# make_oneshot_video_object_segmentation_efficient_again__764748b0.pdf

Make One-Shot Video Object Segmentation Efﬁcient Again

Tim Meinhardt Technical University of Munich tim.meinhardt@tum.de

Laura Leal-Taixé Technical University of Munich leal.taixe@tum.de

Video object segmentation (VOS) describes the task of segmenting a set of objects in each frame of a video. In the semi-supervised setting, the ﬁrst mask of each object is provided at test time. Following the one-shot principle, ﬁne-tuning VOS methods train a segmentation model separately on each given object mask. However, recently the VOS community has deemed such a test time optimization and its impact on the test runtime as unfeasible. To mitigate the inefﬁciencies of previous ﬁne-tuning approaches, we present efﬁcient One-Shot Video Object Segmentation (e-OSVOS). In contrast to most VOS approaches, e-OSVOS decouples the object detection task and predicts only local segmentation masks by applying a modiﬁed version of Mask R-CNN. The one-shot test runtime and performance are optimized without a laborious and handcrafted hyperparameter search. To this end, we meta learn the model initialization and learning rates for the test time optimization. To achieve an optimal learning behavior, we predict individual learning rates at a neuron level. Furthermore, we apply an online adaptation to address the common performance degradation throughout a sequence by continuously ﬁne-tuning the model on previous mask predictions supported by a frame-to-frame bounding box propagation. e-OSVOS provides state-of-the-art results on DAVIS 2016, DAVIS 2017 and You Tube-VOS for one-shot ﬁne-tuning methods while reducing the test runtime substantially. Code is available at https://github.com/dvl-tum/e-osvos.

1 Introduction

Video object segmentation (VOS) describes a two class (foreground-background) pixel-level classiﬁcation task on each frame of a given video sequence. Multiple objects are discriminated by predicting individual foreground-background pixel masks. In this work, we address a variant of VOS which is semi-supervised at test time. To this end, the ground truth foreground-background segmentation mask of the ﬁrst frame is provided for each object. Machine learning methods that tackle semi-supervised VOS are categorized by their utilization of the provided object ground truth masks.

We focus on ﬁne-tuning methods [6, 22, 39, 20, 34], which exploit the transfer learning capabilities of neural networks and follow a multi-step training procedure: (i) pre-training steps: learn general image and segmentation features from training the model on images and video sequences , and (ii) ﬁne-tuning: one-shot test time optimization which enables the model to learn foreground-background characteristics speciﬁc to each object and video sequence. While elegant through their simplicity, ﬁne-tuning methods face important shortcomings: (i) pre-training is ﬁxed and not optimized for the subsequent ﬁne-tuning, (ii) the hyperparameters of the test time optimization are often excessively handcrafted and fail to generalize between datasets. The common existing ﬁne-tuning setups [6, 20] are inefﬁcient and suffer from high test runtimes with as many as 1000 training iterations per segmented object. As a consequence, recent methods refrain from such an optimization at test time and

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

10 2 10 1 100 101

Frames per second [Hz]

CINM FEELVOS

e-OSVOS-10 e-OSVOS-50

e-OSVOS-50-On A e-OSVOS-100-On A

One-shot ﬁne-tuning Other methods

Figure 1: Performance versus runtime comparison of modern video object segmentation (VOS) approaches on the DAVIS 2017 validation set. We only show methods with publicly available runtime information. Our e-OSVOS approach demonstrates the relevance of ﬁne-tuning for VOS and its inherent ﬂexibility as we apply the same meta learned optimization for a varying number of iterations and with online adaptation (On A).

instead opt for solutions such as template matching [7, 13] and mask propagation [8, 9, 24, 25, 35, 44] for semi-supervised VOS.

In this work, we revisit the concept of one-shot ﬁne-tuning for VOS, and show how to leverage the power of meta learning to overcome the aforementioned issues. To this end, we propose three key design choices which make one-shot ﬁne-tuning for VOS efﬁcient again:

Learning the Model Initialization The common pre-training [6, 22, 20, 34] yields a segmentation model not speciﬁcally optimized for the subsequent ﬁne-tuning task and requires an unlearning of potential false positive objects. Therefore, we propose to meta learn the pre-training step, i.e., we learn the best initialization of the segmentation model for a subsequent ﬁne-tuning to any object.

Learning Neuron-Level Learning Rates We replace the laborious and handcrafted hyperparameter search from [6, 22, 20, 34] and additionally optimize learning rates for each neuron of the model. In contrast to a single learning rate for the entire model [1] or millions for all of its parameters [39], this allows for an ideal balance between individual learning behavior and additional trainable parameters.

Optimization of Model with Object Detection To account for the foreground-background pixel imbalance and the challenging object discrimination by individual ﬁne-tuning, previous ﬁne-tuning methods [22, 20, 34] rely on additional mask proposal or bounding box prediction methods. In contrast, we directly ﬁne-tune Mask R-CNN [11] with its separate end-to-end trainable object detection head which limits mask predictions to local object bounding boxes.

This leads to our efﬁcient one-shot video object segmentation (e-OSVOS) approach, which achieves state-of-the-art segmentation performance on the DAVIS-2016, DAVIS-2017 and You Tube-VOS benchmarks compared to all previous ﬁne-tuning methods, at a much lower test runtime, see Figure 1. Overall, our results combat the negative preconceptions with respect to ﬁne-tuning as a principle for semi-supervised VOS, and are intended to motivate future research in this direction.

1.1 Related Work

We categorize VOS methods by their application of one-shot ﬁne-tuning for semi-supervised VOS.

Without Fine-Tuning Several methods [7, 13, 33] pose VOS as the task of pixel retrieval in the learned embedding space. After the embedding learning, no ﬁne-tuning is necessary during the inference pixels are simply their respective nearest neighbors in the learned embedding space [7, 13] or used as a guide to the segmentation network [33]. Other methods propagate segmentation masks using optical ﬂow or point trajectories [9, 35, 36] or segment, propagate and combine object parts [8]. The authors of [24] propagate and decode segmentation masks based on the ﬁrstand query-frame

embeddings. STM [31] leverages a memory network to capture the information of the object in the past frames which is then decoded to predict the current frame mask. They achieve state-of-the-art performance but fail to capture small objects and require a large GPU memory for sequences with many objects.

With Fine-Tuning The concept of ﬁne-tuning for semi-supervised VOS was ﬁrst introduced in OSVOS [6]. This family of methods ﬁne-tunes a pre-trained segmentation model to the ﬁrst frame ground truth mask of a given object and predicts segmentation masks for the remaining video frames. On AVOS [34] extends this approach by adapting the target appearance model online on consecutive frames using heuristics-based ﬁne-tuning policies. While conceptually elegant, the aforementioned methods have no notion of individual objects, shapes, or motion consistency. To remedy this issue, OSVOS-S [22] and PRe MVOS [21] leverage object detection and instance segmentation methods (e.g., Mask RCNN [11]) during the inference as additional object guidance cues. This approach is akin to the tracking-by-detection paradigm, commonly followed in the multi-object tracking community. Fine-tuning methods for VOS all share one major drawback the online ﬁne-tuning process requires extensive manual hyperparameter search and so far numerous training iterations during the inference (up to 1000 in the original OSVOS [6] method). Hence, recent methods refrain from such an optimization at test time due to its impact on the runtime.

Towards Efﬁcient Fine-Tuning Ideally, we would like to learn an appearance model and perform as-few-as-possible training steps to during the inference. One viable approach consists of posing the video object segmentation task as a meta learning problem and optimizing the ﬁne-tuning policies (e.g., generic model initialization, learning rates, and the number of ﬁne-tuning iterations). The ﬁrst attempt in this direction, MVOS [39], proposes to learn the initialization and learning rates per model parameter. However, this approach is impractical for modern large-scale detection/segmentation networks. In this paper, we revisit the concept of meta learning for VOS and propose several critical design choices which yield state-of-the-art results and vastly outperform MVOS [39] and other ﬁne-tuning methods.

Meta Learning for Few-Shot Learning. Previous works have addressed analogous issues for image classiﬁcation. The authors of MAML [10] propose to learn the model initialization for an optimal subsequent ﬁne-tuning at test time. Such an initialization is supposed to beneﬁt the ﬁne-tuning beyond the traditional transfer learning which merely internalizes the training data. The MAML++ [1] and Meta-SGD [16] approaches suggest several improvements to MAML and compliment the model initialization by learning the optimal learning rate. However, both approaches limit their potential by optimizing only a single global learning rate for the entire model. The authors of [45] conduct an analysis of the meta learning for few-shot scenarios problem and address the memorization problem with a speciﬁcally tailored loss function. Other approaches, such as [29], suggest to not only predict the learning rate but apply a parameterized model to predict the entire update step. However, these approaches so far are limited in their applicability to large scale neural networks.

2 One-Shot Fine-Tuning for Video Object Segmentation

For a given video sequence n with In image frames {xn i : 0 i < In} and Kn objects, video object segmentation (VOS) predicts individual object masks yn,k i of all frames xn i . In the case of semi-supervised VOS, the ground truth mask ˆyn,k 0 of a single frame is provided at test time for each object. For simplicity, we assume that the given frame always corresponds to the ﬁrst frame i = 0 of the video. However, potentially a video might contain multiple objects entering the sequence at different frames. The common approach for one-shot ﬁne-tuning of a segmentation model f and its parameters θf follows the three step optimization pipeline presented in [6]: (i) Base network: Learn general object features by training the feature extractor backbone of f on a large scale image recognition challenge, e.g. Image Net [30]. (ii) Parent network: Train f on a segmentation dataset, e.g., DAVIS-17 training set [28], to learn the foreground-background segmentation problem. (iii) Fine-tuning: Learn object and sequence speciﬁc features by separately ﬁne-tuning the parent network to each object of a given video sequence. It should be noted that one-shot learning by nature runs full-batch updates, hence iteration and epoch are often used interchangeably. For a sequence n, the ﬁne-tuning yields Kn separately trained models f n,k with parameters θn,k f . The ﬁnal object

masks yn,k i = f n,k(xn i ) are obtained from the maximum over the predicted pixel probabilities over all objects Kn. The steps (ii) and (iii) minimize the segmentation loss Lseg(D, θf), e.g., binary

cross-entropy, of the model f on a given training dataset D. For clarity, we omit the sequence and object indices on f and θf in future references and refer to a problem solved by an optimization g as:

θg f = argmin θf with g Lseg(D, θf) (1)

Such an optimization is deﬁned by several hyperparameters, including, the model initialization, number of training iterations and the type of stochastic gradient descent (SGD) method as well as its learning rate(s).

3 Efﬁcient One-Shot Video Object Segmentation

We describe the key design choices of e-OSVOS, namely, the model choice, meta learning of the ﬁnetuning optimization, and two additional test-time modiﬁcations to further enhance our performance.

3.1 Optimization of Model with Object Detection

Fine-tuning a fully-convolutional model for VOS suffers from two major issues: (i) the imbalance between foreground and background pixels and (ii) the challenging object discrimination by individual ﬁne-tuning. Typically, the latter requires an unlearning of potential false positive pixels.Several ﬁnetuning approaches [22, 20, 34] tackle these issues by including separate mask proposal or bounding box prediction methods. We propose to directly ﬁne-tune Mask R-CNN [11] which decouples the object detection and requires the demanding pixel-wise segmentation only to bounding boxes.

Mask R-CNN consists of a feature extraction backbone, a Region Proposal Network (RPN) and two network heads, namely, the bounding box object detection and mask segmentation heads. The RPN produces potential bounding box candidates, also known as proposals, on the intermediate feature representation provided by the backbone. The box detection head predicts the object class and regresses the ﬁnal bounding boxes for each proposal. Finally, the segmentation head provides object masks for each object class and bounding box. The segmentation loss mentioned in Section 2 corresponds to the multi-task Mask R-CNN loss: Lseg = LRP N + Lbox + Lmask.

We adapt Mask R-CNN for the VOS task by replacing the pixel-wise cross-entropy Lmask loss with the Lovász-Softmax [5] loss. The Lovász-Softmax loss directly optimizes the intersection-overunion and demonstrates superior performance in our one-shot ﬁne-tuning setting. In contrast to commonly applied batch normalization, group normalization [38] allows for a ﬁne-tuning even on single sample (frame) batches. Therefore, we replace all normalization layers of the backbone with group normalization.

3.2 Meta Learning the One-Shot Test Time Optimization

As outlined by [29], meta learning is of particular interest for semi-supervised or few-shot learning scenarios. In this work, we extend this idea from image classiﬁcation to VOS and meta learn step (ii) and (iii) of the optimization pipeline from Section 2. To this end, we learn differentiable components of the test time optimization, speciﬁcally, the model initialization and SGD learning rate(s).

3.2.1 Meta Tasks

In order to meta learn the optimization, we formulate the VOS ﬁne-tuning problem as a meta task. A task represents the ﬁne-tuning optimization on a single object of a video sequence. Given a set of N unique video sequences each with Kn objects, we deﬁne the corresponding taskset T = {Tn,k : 0 k < Kn|0 n < N} with Tn,k = {Dn,k train, Dn,k test}. As illustrated in Figure 2a, an individual task is created by splitting each sequence into a training and test dataset consisting of disjoint sets of video frames. The goal of task Tn,k is to minimize the test loss Lseg(Dn,k test, f) of the model f. The datasets Dn,k train = {xn 0, ˆyn,k 0 } and Dn,k test = {{xn i , ˆyn,k i } : 1 i < In} include the ﬁrst and all consecutive frames, respectively. We train e-OSVOS on the taskset Ttrain such that the ﬁne-tuning optimization on any Dn,k train yields optimal results on the corresponding Dn,k test. This involves two optimizations, namely, the inner ﬁne-tuning and outer meta optimization. As for all machine learning methods, the ﬁnal generalization to the test taskset Ttest is paramount. In future references of the datasets Dtrain and Dtest we again omit the sequence and object indices n, k.

(a) Meta Taskset

(b) Meta Optimization

Figure 2: The test time optimization g of e-OSVOS is meta learned on a VOS taskset structured as in (a). Each task represents a video sequence with its frames split into training Dn train and test Dn test datasets. The optimization g depicted in (b) consists of the model initialization and a set of learning rates applied with vanilla stochastic gradient descent. Both of which are meta learned by backpropagating the ﬁnal test loss Lseg(Dtest, θT f ).

3.2.2 Meta Optimization

In analogy to How to train your MAML [1], our test time optimization consists of a vanilla SGD with two trainable components, namely, the initialization of the model f and learning rates λ which are applied for a ﬁxed number of iterations. We refer to the trainable parameters of such an optimization g with θg. Learning a single task involves the following bi-level optimization problem for θg and θf:

θ g = argmin θg Lseg(Dtest, θg f), (2)

s.t. θg f = argmin θf with g Lseg(Dtrain, θf). (3)

The outer optimization in Equation (2) is handcrafted and performed on batches of tasks from Ttrain. The bi-level optimization aims to maximize the generalization from a given training to test dataset. In practice, one step of Equation (2) includes multiple steps in Equation (3). This corresponds to ﬁne-tuning the model f for multiple iterations on the ﬁrst frame Dtrain. The optimization g is trained by Backpropagation Through Time (BPTT) of the test loss after T training iterations:

LBP T T =Lseg(Dtest, θT f ), (4)

with θt+1 f = g( θt f Lseg(Dtrain, θt f), θt f). (5)

As illustrated in Figure 2b, g connects the computational graph of each iteration over time. The optimization applies a gradient descent step with respect to Dtrain and updates f. To this end, the optimization receives the current model parameters θt f and their gradients θt f Lseg(Dtrain, θt f). After T updates, the optimization itself is updated to minimize LBP T T with respect to the updated model f. As Equation (5) already requires the computation of model parameter gradients, the outer backpropagation of LBP T T introduces second order derivatives. To reduce the computational effort, these can be omitted which is equivalent to ignoring the dashed edges of the graph in Figure 2b.

3.2.3 Learning the Segmentation Model Initialization

Meta learning the model initialization for a subsequent optimization (ﬁne-tuning) yields superior performance compared to classic transfer learning approaches (parent network training). The initialization not only internalizes the data of the tasks, but beneﬁts the subsequent ﬁne-tuning step. Previous works [10, 16, 39], have applied this successfully to few-shot image classiﬁcation. For semi-supervised VOS, meta learning is able to provide a model initialization θ0 f for the ﬁne-tuning

optimization in Equation (5). Such an initialization avoids biases for speciﬁc objects and eases the individual ﬁne-tuning to each object signiﬁcantly. In addition to the classic overﬁtting to Ttrain, meta learning is prone to zero-shot collapsing, also called the memorization problem [45]. For image classiﬁcation, this is avoided by randomly shufﬂing the class labels for each training task. For multi-object VOS we tackle the issue of zero-shot collapsing by separating objects of the same sequence to multiple tasks. The ﬁrst two example tasks in Figure 2a demonstrate the necessity of an one-shot optimization for segmenting different objects given the same input image.

3.2.4 Learning Neuron-Level Learning Rates

The optimization g performs Equation (5) with a vanilla SGD step and updates the segmentation model by applying a set of meta learned learning rates λ for a ﬁxed number of iterations. The entire set of trainable optimization parameters is denoted as θg = {θ0 f, λ}. Previous meta learning for few-shot approaches applied learning rates at varying parameter hierarchy levels, from a single global learning rate for the entire model in [1], to learning rates for all model parameters θf in MVOS [39]. The latter is unfeasible for many modern state-of-the-art segmentation networks as it effectively doubles the number of trainable optimization parameters (|θg| 2|θ0 f|).

Therefore, we propose an ideal balance between individual learning behavior and additional trainable parameters by optimizing a set of learning rates at the neuron level. A common linear neural network layer consists of multiple neurons, or kernels for convolution layers, where each neuron applies a weight tensor and corresponding scalar bias. We predict a pair of learning rates for each neuron of the model f, i.e., a single rate for each weight tensor and bias scalar. The amount of additional trainable parameters is neglectable for modern segmentation models as their total number of parameters typically exceeds 107. In Algorithm 1 of the supplementary, we illustrate the full e-OSVOS training pipeline for a given VOS taskset Ttrain.

3.3 Online Adaption and Bounding Box Propagation

By nature, ﬁne-tuning methods are prone to overﬁt on the given single frame dataset Dn,k train = {xn 0, ˆyn,k 0 }. For sequences with changing object appearance or new similar objects entering the scene, such an overﬁtting often results in degrading recognition performance or drifting of the segmentation mask. However, e-OSVOS incorporates two test time techniques to overcome those problems.

Online adaptation Inspired by [34], we apply an online adaptation (On A) which continuously ﬁnetunes the segmentation model on both the given ﬁrst frame ground truth and past mask predictions. First, we ﬁne-tune the model for T iterations only on the ﬁrst frame which yields θT f and then continue the ﬁne-tuning every IOn A frames for TOn A additional iterations on the combined online dataset Dn,k train = {xn 0, ˆyn,k 0 } {xn i , yn,k i }. In contrast to [34], our efﬁcient test time optimization allows for a reset of the model before every additional ﬁne-tuning to the ﬁrst-frame model state θT f . Such a reset avoids the accumulation of false positive pixels wrongly considered as ground truth. Our learned optimization g generalizes to such an online adaptation without any additional meta learning.

Bounding Box Propagation In analogy to [4], we extend the RPN proposals with the detected object boxes of the previous frame. To account for the changing position of the object, we augment the previous boxes with random spatial transformations. Starting with the ﬁrst frame ground truth boxes, the frame-to-frame propagation facilitates the tracking of each object over the sequence.

4 Experiments

We demonstrate the applicability of e-OSVOS on three semi-supervised VOS benchmarks, namely, DAVIS 2016 [27], DAVIS 2017 [28] and You Tube-VOS [41]. The tasksets T for training and evaluation of e-OSVOS are constructed from the corresponding training, validation and test video sequences of each benchmark.

4.1 Datasets and Evaluation Metrics

DAVIS 2016 The DAVIS 2016 [27] benchmark consists of a training and validation set with 30 and 20 single object video sequences, respectively. Every sequence is captured at 24 frames per second (FPS) and semi-supervision is achieved by providing the respective ﬁrst frame object mask.

DAVIS 2017 The DAVIS 2017 [28] benchmark extends DAVIS-16 with 100 additional sequences including dedicated test-dev and test sets. The validation, test-dev and test sets each consist of 30 sequences. The extended train set contains the remaining 60 video sequences. In addition, DAVIS 2017 contains a mix of single and multi-object sequences with varying image resolutions.

You Tube-VOS Our largest benchmark, You Tube-VOS [41], consists of 4453 video sequences including dedicated test and validation sets with 508 and 474 sequences, respectively. As DAVIS 2017, this benchmark contains single and multi-object sequences in multiple resolutions but provides segmentation ground truth only at 6 FPS. In general, [41] requires stronger tracking capabilities as objects enter in the middle of the sequence or leave and reenter the frame entirely.

Evaluation Metrics We evaluate the standard VOS metrics deﬁned by [27]. For the Intersection over Union between predicted and ground truth masks, also known as Jaccard index J in %, we evaluate the mean as well as decay over all frames. Furthermore, we report the mean contour accuracy F in %, the combination of both mean metrics J &F in % and the frames per second (FPS) in Hz.

4.2 Implementation Details

We conduct all experiments on a Mask R-CNN model with Res Net50 [12] and FPN [17] pre-trained on the COCO [18] segmentation dataset. In order to optimize the learning rates and model initialization jointly without overﬁtting, we follow previous VOS approaches such as [33, 23] and train e-OSVOS on You Tube-VOS combined with DAVIS 2017. To improve generalization, we construct training tasks Tn,k Ttrain by randomly sampling a single frame from a sequence and augmenting it with spatial and color transformations for each train and test dataset. Furthermore, for both DAVIS datasets, we ﬁne-tune the meta learning of the model initialization for each dataset while keeping the previously learned learning rates ﬁxed. For the outer optimization we apply RAdam [19] with a ﬁxed learning rate β, as shown in Algorithm 1 of the supplementary, on batches of 4 training tasks each distributed to a Quadro RTX 6000 GPU for a total of 4 days. To limit the computational effort, we ignore second order derivatives and ﬁne-tune for T = 5 BPTT iterations. The learning rates are clamped to be non-negative after each meta update.

The online adaptation (On A) is applied every IOn A = 5 steps for TOn A = 10 iterations. To further boost inner-sequence generalization, we apply spatial random transformations as in [6] during the initial ﬁne-tuning but not for the online adaptation. While the iterations are ﬁxed to T = 5 during the meta learning e-OSVOS generalizes to varying numbers of iterations and the online adaptation without any further learning. To indicate different versions of e-OSVOS, we denote the number of initial ﬁne-tuning iterations and if online adaptation is applied.

4.3 Ablation Study

We demonstrate the effect of the individual e-OSVOS components on the DAVIS 2017 validation set in Table 1. Both the parent and meta training utilize the combined dataset of You Tube-VOS and DAVIS 2017. For a fair comparison between varying number of ﬁne-tuning iterations, we refrained from any spatial random transformations at test time. The ﬁrst row shows a handcrafted equivalent of the e-OSVOS test time optimization for which we apply a grid search to ﬁnd the optimal global ﬁne-tuning learning rate. Note, this baseline is not representative for state-of-the-art ﬁne-tuning VOS approaches as we omitted any additional handcrafted test time improvements [6, 22, 20, 34], such as, layer-wise learning rates, learning rate scheduling, contour snapping. The handcrafted approach is inferior to meta learning the initialization and a single global learning rate even for substantially more iterations. The neuron-level learning rates and additional modiﬁcations to the Mask R-CNN motivated in Section 3 both yield substantial segmentation performance gains. While the improvement from bounding box propagation is comparatively small, it only adds an insigniﬁcant amount of additional runtime. The marginal improvement from e-OSVOS-50 to e-OSVOS-100 clearly motivates the application of an online adaption to combat overﬁtting to the ﬁrst frame and degrading performance over the course of the sequence. In Figure 3, we further demonstrate the efﬁciency of e-OSVOS on

Table 1: Ablation study of each e-OSVOS component on the DAVIS 2017 validation set. The ﬁrst row represent a handcrafted equivalent of our test time optimization. We present performance gains componentwise for 10 ﬁne-tuning iterations and iterationwise for our ﬁnal e-OSVOS version.

Method Iterations (T) J &F

Mask R-CNN + parent training + single LR search

10 33.6 50 39.7 100 41.6 1000 42.7

Mask R-CNN + Learn model initialization and single LR 10 64.4 + 30.8 + Learn neuron level learning rates 10 67.2 + 2.8 + Group normalization + Lovász-Softmax 10 69.4 + 2.2

+ Bounding box propagation (e-OSVOS) 10 69.9 + 0.5 + Online adaption (e-OSVOS-On A) 10 71.2 + 1.3

e-OSVOS-T 50 71.3 100 71.2

e-OSVOS-T-On A 50 73.7 100 74.8

Figure 3: We evaluate e-OSVOS for increasing number of initial ﬁne-tuning iterations T on the DAVIS 2017 validation set. The ﬁrst iterations yield the largest performance gains while still running at comparatively large frames per second rates.

0 60 120 180 240 300 360 420 480 e-OSVOS-T

Frames per second [Hz]

the DAVIS 2017 validation set. The meta learning enables large gains in segmentation performance after only a few ﬁne-tuning iterations without suffering from low frames per second rates. With increasing number of iterations the actual inference time of the sequence becomes neglectable.

4.4 Benchmark Evaluation

We present state-of-the-art VOS results for ﬁne-tuning methods on DAVIS 2016 and 2017 in Table 2 and for You Tube-VOS in Table 3. We focus our evaluation on ﬁne-tuning, hence separating the results of methods without ﬁne-tuning (FT). The overall state-of-the art method STM [31], which does not leverage ﬁne-tuning, currently surpasses all existing approaches in terms of performance and runtime. Nevertheless, we want to motivate ﬁne-tuning as a concept which is applicable to further boost results of methods like STM without harming its efﬁciency.

DAVIS 2016 and 2017 In terms of the important J metric, we outperform all previous one-shot ﬁnetuning approaches on the validation set while reducing the runtime multiple orders of magnitude. It is important to note, that unlike our approach all previous ﬁne-tuning methods rely on post processing or an ensemble of models to achieve optimal results. We even surpass PRe MVOS [21], the longtime state-of-the-art VOS method, with a much simpler and more efﬁcient ﬁne-tuning approach. PRe MVOS applies an additional contour snapping, as in [6], which explains its superiority in terms of contour accuracy F. On the test-dev set all methods achieve substantially worse results in all metrics compared to the validation set. This is due to sequences more challenging with respect to instance identity preservation. We do not achieve state-of-the-art results for ﬁne-tuning methods on

Table 2: VOS performance evaluation on the DAVIS 2016 and 2017 benchmarks. We categorize methods by their application of ﬁne-tuning (FT) and post-processing (PP) of the predicted masks and label methods with an ensemble of models with . The table is ordered by J &F on DAVIS 2017 validation. The evaluation metrics are detailed in Sec. 4.1. If not publicly available we adopted runtimes (FPS) from [3].

DAVIS 2016 - validation DAVIS 2017 - validation DAVIS 2017 - test-dev

Method FT PP J J Decay F FPS J J Decay F J &F J J Decay F J &F

FAVOS [8] 82.4 4.5 79.5 0.56 54.6 14.1 61.8 58.2 42.9 18.1 44.2 43.6 RGMP [24] 81.5 10.9 82.0 7.70 64.8 18.9 68.6 66.7 51.3 34.3 54.4 52.8 RVOS [24] 57.5 24.9 63.6 60.6 47.9 35.7 52.6 50.3 Meta VOS [3] 81.5 5.0 82.7 4.0 63.9 14.4 70.7 67.3 RANet [37] 86.6 7.4 87.6 30.3 63.2 18.6 68.2 65.7 53.4 21.9 57.3 55.4 FEELVOS [33] 81.1 13.7 82.2 2.22 69.1 17.5 74.0 71.5 55.1 29.8 60.4 57.8 MHP-VOS [42] 87.6 6.9 89.5 0.01 73.4 17.8 78.9 76.1 66.4 18.0 72.7 69.5 STM [31] 88.7 5.0 90.1 6.25 79.2 8.0 84.3 81.7 69.3 16.9 75.2 72.2

CINM [2] 83.4 12.3 85.0 0.01 67.2 24.6 74.4 70.7 64.5 20.0 70.5 67.5 Lucid [15] 83.9 9.1 82.0 0.005 63.4 19.5 69.9 66.6 MVOS [39] 83.3 84.1 4.0 56.3 62.1 59.2 OSVOS [6] 79.8 14.9 80.6 0.11 56.6 26.1 63.9 60.3 47.0 19.2 54.8 50.9 OSVOS-S [22] 85.6 5.5 87.5 0.22 64.7 15.1 71.3 68.0 52.9 24.1 62.1 57.5 On AVOS [34] 86.1 5.2 84.9 0.08 61.6 27.9 69.1 65.3 49.9 23.0 55.7 52.8 PRe MVOS [21] 84.9 8.8 88.6 0.01 73.9 16.2 81.8 77.8 67.5 21.7 75.8 71.6

e-OSVOS-10 85.1 5.0 84.8 5.3 69.2 18.5 74.6 71.9 e-OSVOS-50 85.5 5.0 85.8 1.64 70.7 18.6 75.9 73.3

e-OSVOS-50-On A 85.9 5.2 85.9 0.35 73.0 13.6 78.3 75.6 60.9 22.1 68.6 64.8 e-OSVOS-100-On A 86.6 4.5 87.0 0.29 74.4 13.0 80.0 77.2

Table 3: VOS performance evaluated on the You Tube-VOS validation set. This benchmark additionally evaluates the performance on completely unseen object classes. Results of other methods are copied from [31].

You Tube-VOS - validation

Method FT PP Overall J Seen F Seen J Unseen F Unseen

OSMN [43] 51.2 60.0 60.1 40.6 44.0 MSK [26] 53.1 59.9 59.5 45.0 47.9 RGMP [24] 53.8 59.5 45.2 RVOS [32] 56.8 63.6 67.2 45.5 51.0 S2S [40] 64.4 71.0 70.0 55.5 61.2 A-GAME [14] 66.1 67.8 60.8 STM [31] 79.4 79.7 84.2 72.8 80.9

On AVOS [34] 55.2 60.1 62.7 46.6 51.4 OSVOS [6] 58.8 59.8 60.5 54.2 60.7 PRe MVOS [21] 66.9 71.4 75.9 56.5 63.7

e-OSVOS-50-On A 71.4 71.7 66.0 74.3 73.8

the test-dev set. However, our approach still demonstrates the potential of ﬁne-tuning as we surpass most none-ﬁne-tuning methods without applying any post-processing or an ensemble of methods.

You Tube-VOS On the more challenging You Tube-VOS dataset, our approach yields overall better results compared to all previous ﬁne-tuning methods. In particular, PRe MVOS suffers from inferior performance on unseen object classes. This indicates that our meta learned initialization provides a superior ﬁne-tuning initialization which is less prone to overﬁtting. It should be noted that some methods were evaluated on an earlier version of the You Tube-VOS benchmark which causes slight variations in the ﬁnal results.

5 Conclusion

This works demonstrates the application of meta learning to VOS ﬁne-tuning and makes one-shot video object segmentation efﬁcient again. We ﬁrst motivate our model choice to be a modiﬁed Mask R-CNN instead of a fully convolutional segmentation model. Furthermore, we meta learn the model initialization and a set of neuron-level learning rates. e-OSVOS works in addition to common test-time techniques which mitigate performance degradation, such as online adaptation with continuous ﬁnetuning and a bounding box propagation. We demonstrate the best performance amongst ﬁne-tuning methods, and aspire to reignite research in this promising approach to semi-supervised VOS.

Acknowledgements

This research was funded by the Humboldt Foundation through the Sofja Kovalevskaja Award.

Broader Impact

Authors are asked to include a section in their submissions discussing the broader impact of their work, including possible societal consequences, both, positive and negative.

Many methods for video object segmentation or multiple object tracking rely on appearance models of objects. In this work, we have shown that one can rely on the simple but elegant solution of ﬁne-tuning of a model as a way to build appearance models.

Semi-supervised video object segmentation is often used to automatize video editing, e.g., to remove one object from a video. While it is clear that more automatic methods would have a positive impact in reducing the manual work needed to perform such video edits, there is also potential to misuse such technology. One could imagine the creation of fake videos, where objects are taken out or put on the scene to create out-of-context content that might lead viewers to misinterpret the situation. Nonetheless, we believe the technology is still in early stages and far from being able to create fake content without substantial knowledge and manual work. Therefore, we believe that, for this particular task, the beneﬁts outweigh the potential misuses of the technology.

Appearance models are also key towards tackling multi-object tracking and segmentation, important for applications such as robotics. For example, social robots are often tasked with following one speciﬁc person, hence the robot has to learn fast an on the ﬂy the appearance of the speciﬁc person that it has to follow. This can be extended to multiple people tracking, where each model would be ﬁne-tuned to a speciﬁc person on the scene. Segmentation of an object of interest becomes also key for robotic tasks such as grasping or any object-robot interaction. But multi-object tracking and video object segmentation also have a dark side, with applications such as illegal surveillance. We want to note, that our method does not make use of any kind of identifying characteristic of a person (if the person would be our object to follow and segment). Therefore, we believe our technology does not directly contribute nor promote these kinds of misuses.

We believe that the simple concept of ﬁne-tuning a model to a speciﬁc object is incredibly powerful. With our work, we hope to inspire researchers to continue with that paradigm, now that we can properly train it to achieve state-of-the-art results. Looking at the impact that these tools can have for society, one can see extremely positive things such as the realization of social robots that could help the elderly in their daily chores.

[1] Antreas Antoniou, Harrison Edwards, and Amos Storkey. How to train your MAML. In International Conference on Learning Representations, 2019. 2, 3, 5, 6

[2] Linchao Bao, Baoyuan Wu, and Wei Liu. Cnn in mrf: Video object segmentation via inference in a cnn-based higher-order spatio-temporal mrf. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 5977 5986, 06 2018. 9

[3] Harkirat Singh Behl, Mohammad Najaﬁ, Anurag Arnab, and Philip H. S. Torr. Meta learning deep visual words for fast video object segmentation. In Neur IPS 2019 Workshop on Machine Learning for Autonomous Driving, 2018. 9

[4] Philipp Bergmann, Tim Meinhardt, and Laura Leal-Taixé. Tracking without bells and whistles. In The IEEE International Conference on Computer Vision (ICCV), October 2019. 6

[5] Maxim Berman, Amal Rannen Triki, and Matthew B Blaschko. The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4413 4421, 2018. 4

[6] Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. One-shot video object segmentation. IEEE Conf. on Computer Vision and Pattern Recognition, 2017. 1, 2, 3, 7, 8, 9

[7] Yuhua Chen, Jordi Pont-Tuset, Alberto Montes, and Luc Van Gool. Blazingly fast video object segmentation with pixel-wise metric learning. IEEE Conf. on Computer Vision and Pattern Recognition, pages 1189 1198, 2018. 2

[8] J. Cheng, Y.-H. Tsai, W.-C. Hung, S. Wang, and M.-H. Yang. Fast and accurate online video object segmentation via tracking parts. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 2, 9

[9] J. Cheng, Y.-H. Tsai, S. Wang, and M.-H. Yang. Segﬂow: Joint learning for video object segmentation and optical ﬂow. In Int. Conf. on Computer Vision, 2017. 2

[10] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conf. on Machine Learning, ICML 17, pages 1126 1135. JMLR.org, 2017. 3, 5

[11] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2980 2988, Oct 2017. 2, 3, 4

[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. IEEE Conf. on Computer Vision and Pattern Recognition, 2016. 7

[13] Yuan-Ting Hu, Jia-Bin Huang, and Alexander G. Schwing. Videomatch: Matching based video object segmentation. European Conf. on Computer Vision, 2018. 2

[14] Joakim Johnander, Martin Danelljan, Emil Brissman, Fahad Shahbaz Khan, and Michael Felsberg. A generative appearance model for end-to-end video object segmentation. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 8945 8954, Piscataway, NJ, 2019. IEEE. 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019); Conference Location: Long Beach, CA, USA; Conference Date: June 16-20, 2019. 9

[15] Anna Khoreva, Rodrigo Benenson, Eddy Ilg, Thomas Brox, and Bernt Schiele. Lucid data dreaming for video object segmentation. Int. J. Comput. Vision, 127(9):1175 1197, September 2019. 9

[16] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-SGD: Learning to Learn Quickly for Few-Shot Learning. ar Xiv.org, July 2017. 3, 5

[17] Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In IEEE Conf. on Computer Vision and Pattern Recognition, July 2017. 7

[18] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and Larry Zitnick. Microsoft coco: Common objects in context. In ECCV. European Conference on Computer Vision, September 2014. 7

[19] Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. In Proceedings of the Eighth International Conference on Learning Representations (ICLR 2020), April 2020. 7

[20] Jonathon Luiten, Paul Voigtlaender, and Bastian Leibe. Premvos: Proposal-generation, reﬁnement and merging for video object segmentation. In Asian Conference on Computer Vision, 2018. 1, 2, 4, 7

[21] Jonathon Luiten, Paul Voigtlaender, and Bastian Leibe. Premvos: Proposal-generation, reﬁnement and merging for video object segmentation. In Asian Conference on Computer Vision, 2018. 3, 8, 9

[22] Kevis-Kokitsi Maninis, Sergi Caelles, Yuhua Chen, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. Video object segmentation without temporal information. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2018. 1, 2, 3, 4, 7, 9

[23] Seoung Oh, Joon-Young Lee, Ning Xu, and Seon Kim. Video object segmentation using space-time memory networks, 04 2019. 7

[24] Seoung Wug Oh, Joon-Young Lee, Kalyan Sunkavalli, and Seon Joo Kim. Fast video object segmentation by reference-guided mask propagation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 2, 9

[25] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A.Sorkine-Hornung. Learning video object segmentation from static images. In IEEE Conf. on Computer Vision and Pattern Recognition, 2017. 2

[26] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A. Sorkine-Hornung. Learning video object segmentation from static images. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 3491 3500, 2017. 9

[27] F. Perazzi, J. Pont-Tuset, B. Mc Williams, L. V. Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 724 732, June 2016. 6, 7

[28] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbelaez, Alexander Sorkine-Hornung, and Luc Van Gool. The 2017 DAVIS challenge on video object segmentation. IEEE Conf. on Computer Vision and Pattern Recognition, abs/1704.00675, 2017. 3, 6, 7

[29] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International Conf. on Learning Representations, 2017. 3, 4

[30] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. International Journal on Computer Vision, 115(3):211 252, December 2015. 3

[31] Ning Xu Seon Joo Kim Seoung Wug Oh, Joon-Young Lee. Video object segmentation using space-time memory networks. In Int. Conf. on Computer Vision, 2019. 3, 8, 9

[32] Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques, and Xavier Giro-i Nieto. Rvos: End-to-end recurrent network for video object segmentation. In IEEE Conf. on Computer Vision and Pattern Recognition, June 2019. 9

[33] Paul Voigtlaender, Yuning Chai, Florian Schroff, Hartwig Adam, Bastian Leibe, and Liang Chieh Chen. Feelvos: Fast end-to-end embedding learning for video object segmentation. CVPR, 2019. 2, 7, 9

[34] Paul Voigtlaender and Bastian Leibe. Online adaptation of convolutional neural networks for video object segmentation. In BMVC, 2017. 1, 2, 3, 4, 6, 7, 9

[35] W. Wang, J. Shen, F. Porikli, and R. Yang. Semi-supervised video object segmentation with super-trajectories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(4):985 998, April 2019. 2

[36] Ziqin Wang, Jun Xu, Li Liu, Fan Zhu, and Ling Shao. Ranet: Ranking attention network for fast video object segmentation. In Int. Conf. on Computer Vision, Oct 2019. 2

[37] Ziqin Wang, Jun Xu, Li Liu, Fan Zhu, and Ling Shao. Ranet: Ranking attention network for fast video object segmentation. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019. 9

[38] Yuxin Wu and Kaiming He. Group normalization. In ECCV, 2018. 4

[39] Huaxin Xiao, Bingyi Kang, Yu Liu, Maojun Zhang, and Jiashi Feng. Online Meta Adaptation for Fast Video Object Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99):1 1, January 2019. 1, 2, 3, 5, 6, 9

[40] Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian L. Price, Scott Cohen, and Thomas S. Huang. Youtube-vos: Sequence-to-sequence video object segmentation. abs/1809.00461, 2018. 9

[41] Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas S. Huang. Youtube-vos: A large-scale video object segmentation benchmark. Ar Xiv, abs/1809.03327, 2018. 6, 7

[42] Shuangjie Xu, Daizong Liu, Linchao Bao, Wei Liu, and Pan Zhou. Mhp-vos: Multiple hypotheses propagation for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2019. 9

[43] L. Yang, Y. Wang, X. Xiong, J. Yang, and A. K. Katsaggelos. Efﬁcient video object segmentation via network modulation. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 6499 6507, 2018. 9

[44] Linjie Yang, Yanran Wang, Xuehan Xiong, Jianchao Yang, and Aggelos K. Katsaggelos. Efﬁcient video object segmentation via network modulation. IEEE Conf. on Computer Vision and Pattern Recognition, pages 6499 6507, 2018. 2

[45] Mingzhang Yin, George Tucker, Mingyuan Zhou, Sergey Levine, and Chelsea Finn. Metalearning without memorization. Co RR, abs/1912.03820, 2019. 3, 6