# mixandmatch_tuning_for_selfsupervised_semantic_segmentation__46c5fc40.pdf

Mix-and-Match Tuning for Self-Supervised Semantic Segmentation

Xiaohang Zhan, Ziwei Liu, Ping Luo, Xiaoou Tang, Chen Change Loy Department of Information Engineering, The Chinese University of Hong Kong {zx017, lz013, pluo, xtang, ccloy}@ie.cuhk.edu.hk

Deep convolutional networks for semantic image segmentation typically require large-scale labeled data, e.g., Image Net and MS COCO, for network pre-training. To reduce annotation efforts, self-supervised semantic segmentation is recently proposed to pre-train a network without any humanprovided labels. The key of this new form of learning is to design a proxy task (e.g., image colorization), from which a discriminative loss can be formulated on unlabeled data. Many proxy tasks, however, lack the critical supervision signals that could induce discriminative representation for the target image segmentation task. Thus self-supervision s performance is still far from that of supervised pre-training. In this study, we overcome this limitation by incorporating a mix-and-match (M&M) tuning stage in the self-supervision pipeline. The proposed approach is readily pluggable to many self-supervision methods and does not use more annotated samples than the original process. Yet, it is capable of boosting the performance of target image segmentation task to surpass fully-supervised pre-trained counterpart. The improvement is made possible by better harnessing the limited pixelwise annotations in the target dataset. Speciﬁcally, we ﬁrst introduce the mix stage, which sparsely samples and mixes patches from the target set to reﬂect rich and diverse local patch statistics of target images. A match stage then forms a class-wise connected graph, which can be used to derive a strong triplet-based discriminative loss for ﬁne-tuning the network. Our paradigm follows the standard practice in existing self-supervised studies and no extra data or label is required. With the proposed M&M approach, for the ﬁrst time, a self-supervision method can achieve comparable or even better performance compared to its Image Net pretrained counterpart on both PASCAL VOC2012 dataset and City Scapes dataset.

Introduction Semantic image segmentation is a classic computer vision task that aims at assigning each pixel in an image with a class label such as chair , person , and dog . It enjoys a wide spectrum of applications, such as scene understanding (Li, Socher, and Fei-Fei 2009; Lin et al. 2014; Li et al. 2017b) and autonomous driving (Geiger et al. 2013; Cordts et al. 2016; Li et al. 2017a). Deep convolutional neural network (CNN) is now the state-of-the-art technique for

Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

semantic image segmentation (Long, Shelhamer, and Darrell 2015; Liu et al. 2015; Zhao et al. 2017; Liu et al. 2017). The excellent performance, however, comes with a price of expensive and laborious label annotations. In most existing pipelines, a network is usually ﬁrst pre-trained on millions of class-labeled images, e.g., Image Net (Russakovsky et al. 2015) and MS COCO (Lin et al. 2014), and subsequently ﬁne-tuned with thousands of pixel-wise annotated images. Self-supervised learning1 is a new paradigm proposed for learning deep representations without extensive annotations. This new technique has been applied to the task of image segmentation (Zhang, Isola, and Efros 2016a; Larsson, Maire, and Shakhnarovich 2016; 2017). In general, self-supervised image segmentation can be divided into two stages: the proxy stage, and the ﬁne-tuning stage. The proxy stage does not need any labeled data but requires one to design a proxy or pretext task with self-derived supervisory signals on unlabeled data. For instance, learning by colorization (Larsson, Maire, and Shakhnarovich 2017) utilizes the fact that a natural image is composed of luminance channel and chrominance channels. The proxy task is formulated with cross-entropy loss to predict an image chrominance from the luminance of the same image. In the ﬁne-tuning stage, the learned representations are utilized to initialize the target semantic segmentation network. The network is then ﬁne-tuned with pixel-wise annotations. It has been shown that without large-scale class-labeled pre-training, semantic image segmentation could still gain encouraging performance over random initialization or from-scratch training. Though promising, the performance of self-supervised learning is still far from that achieved by supervised pre-training. For instance, a VGG-16 network trained with the self-supervised method of (Larsson, Maire, and Shakhnarovich 2017) achieves a 56.0% mean Intersection over Union (m Io U) on PASCAL VOC 2012 segmentation benchmark (Everingham et al. 2010), higher than a random initialized network that only yields 35.0% m Io U. However, an identical network trained on Image Net achieves 64.2% m Io U. There exists a considerable gap between selfsupervised and pure supervised pre-training. We believe that the performance discrepancy is mainly caused by the semantic gap between the proxy task and the

1Project page: http://mmlab.ie.cuhk.edu.hk/projects/M&M/

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

(b) before M&M after M&M

Figure 1: (a) shows samples of patches from categories bus and car , and these two categories have similar color distributions but different patch statistics. (b) depicts deep feature distributions of bus and car , before and after mix-andmatch, visualized with t-SNE (Maaten and Hinton 2008). Best viewed in color.

target task. Take learning by colorization as an example, the goal of the proxy task is to colorize gray-scale images. The representations learned from colorization may be wellsuited for modeling color distributions, but are likely amateur in discriminating high-level semantics. For instance, as shown in Fig. 1(a), a red car can be arbitrarily more similar to a red bus than to a blue car. The features of both car and bus classes are highly overlapped, as depicted by the feature embedding in the left plot of Fig. 1(b). Improving the performance of self-supervised image segmentation requires one to improve the discriminative power of representation tailored to the target task. This goal is non-trivial target s pixel-wise annotations are discriminative for the goal but they often available with just a handful amount, typically in thousands of labeled images. Existing approaches typically use a pixel-wise softmax loss to exploit pixel-wise annotations for ﬁne-tuning a network. This strategy may be sufﬁcient for a network that is well-initialized by supervised pre-training but could fall inadequate for a self-supervised network of which the features are weak. We argue that pixel-wise softmax loss is not the sole way of harnessing the information provided by pixel-wise annotations. In this study, we present a new learning strategy called mix-and-match (M&M), which can help harness the scarce labeled information of a target set for improving the performance of networks pre-trained by self-supervised learning. The M&M learning is conducted after the proxy stage and before the usual target ﬁne-tuning stage, serving as an inter-

mediate step to bridge the gap between the proxy and target tasks. It is noteworthy that M&M only uses the target images and its labels thus no additional annotation is required. The essence of M&M is inspired by metric learning. In the mix step, we randomly sample a large number of local patches from the target set and mix them together. The patch set is formed across images thus decouple any intra-image dependency to faithfully reﬂect the diverse and rich target distribution. Extracting patches also allows us to generate a massive number of triplets from the small target image set to produce stable gradients for training our network. In the match step, we form a graph with nodes deﬁned by patches represented by their deep features. An edge between nodes is deﬁned as attractive if the nodes share the same class label; otherwise, it is a rejective edge. We enforce a class-wise connected graph, that is, all nodes from the same class in the graph compose a connected subgraph, as shown in Fig. 3(c). This ensures global consistency in triplet selection coherent to the class labels. With the graph, we can derive a robust triplet loss that encourages the network to map each patch to a point in feature space so that patches belonging to the same class lie close together while patches of different classes are separated by a wide margin. The way we sample triplets from a class-wise connected graph differs significantly from existing approach (Schroff, Kalenichenko, and Philbin 2015) that forms multiple disconnected subgraphs for each class. We summarize our contributions as follows. 1) We formulate a novel mix-and-match tuning method, which for the ﬁrst time, allows networks pre-trained with self-supervised learning to outperform the supervised learning counterpart. Speciﬁcally, with VGG-16 as the backbone network, by using image colorization as the proxy task, our M&M method achieves 64.5%, outperforming the Image Net pretrained network that achieves 64.2% m Io U on PASCAL VOC2012 dataset. Our method also obtains 66.4% m Io U on City Scapes dataset, comparable to 67.9% m Io U achieved by using a Image Net pre-trained network. This improvement is signiﬁcant considering that our approach is based on unsupervised pre-training. 2) Apart from the learning by colorization method, M&M also improves learning by context method (Noroozi and Favaro 2016) by a large margin. 3) In the setting of random initialization, our method achieves signiﬁcant improvements with both Alex Net and VGG-16, on both PASCAL VOC2012 and City Scapes. It makes training semantic segmentation from scratch possible. 4) In addition to the new notion of mix-and-match, we also present a triplet selection mechanism based on class-wise connected graph, which is more robust than conventional selection scheme for our task.

Related Work Self-supervision. It is a standard and established practice to pre-train a deep network with large-scale class-labeled images (e.g., Image Net) before ﬁne-tuning the model for other visual tasks. Recent research efforts are gearing towards reducing the degree of or eliminating supervised pre-training altogether. Among various alternatives, selfsupervised learning is gaining substantial interest. To en-

able self-supervised learning, proxy tasks are designed so that meaningful representations can be induced from the problem-solving process. Popular proxy tasks include sample reconstruction (Pathak et al. 2016b), temporal correlation (Wang and Gupta 2015; Pathak et al. 2016a), learning by context (Doersch, Gupta, and Efros 2015; Noroozi and Favaro 2016), cross-transform correlation (Dosovitskiy et al. 2015) and learning by colorization (Zhang, Isola, and Efros 2016a; 2016b; Larsson, Maire, and Shakhnarovich 2016; 2017). In this study, we do not design a new proxy task, but present an approach that could uplift the discriminative power of a self-supervised network tailored to the image segmentation task. We demonstrate the effectiveness of M&M on learning by colorization and learning-by-context. Weakly-supervised segmentation. There exists a rich body of literature that investigates approaches for reducing annotations in learning deep models for the task of image segmentation. Alternative annotations such as point (Bearman et al. 2016), bounding box (Dai, He, and Sun 2015) (Papandreou et al. 2015), scribble (Lin et al. 2016) and video (Hong et al. 2017) have been explored as cheap supervisions to replace the pixel-wise counterpart. Note that these methods still require Image Net classiﬁcation as a pre-training task. Self-supervised learning is more challenging in that no image-level supervision is provided in the pre-training stage. The proposed M&M approach is dedicated to improve the weak representation learned by self-supervised pre-training. Graph-based segmentation. Graph-based image segmentation has been explored from early years. The main idea is to explore dependency between pixels. Different from the conventional graph on pixels or superpixels in a single image, the proposed method deﬁnes the graph on image patches sampled from multiple images. We do not partition image by performing cuts on a graph, but use the graph to select triplets for the proposed discriminative loss.

Mix-and-Match Tuning

Figure 2 illustrates the proposed approach, where (a) and (c) depict the conventional stages for self-supervised semantic image segmentation, while (b) shows the proposed mixand-match (M&M) tuning. Speciﬁcally, in (a), a proxy task, e.g., learning by colorization, is designed to pre-train the CNN using unlabeled images. In (c), the pre-trained CNN is ﬁne-tuned on images and the associated per-pixel labeled maps of a target task. This work inserts M&M tuning between the proxy task and the target task as shown in (b). It is noteworthy that M&M uses the same target images and label maps in (c), hence no additional data is required. As the name implies, M&M tuning consists of two steps, namely mix and match . We explain these steps as follows.

The Mix Step Patch Sampling

Recall that our goal is to better harness the information in pixel-wise annotations of the target set. Image patches have long been considered as strong visual primitive (Singh, Gupta, and Efros 2012) that incorporates both appearance and structure information. Visual patches have been successfully applied to various tasks in visual understanding (Li,

attractive edge

rejective edge

unlabeled data

target task data

target task data

feature maps

target task

self-supervised proxy task

Figure 2: An overview of the mix-and-match approach. Our approach starts with a self-supervised proxy task (a), and uses the learned CNN parameters to initialize the CNN in mix-and-match tuning (b). Given an image batch with label maps of the target task, we select and mix image patches and then match them according to their classes via a classwise connected graph. The matching gives rise to a triplet loss, which can be optimized to tune the parameters of the network via back propagation. Finally, the modiﬁed CNN parameters are further ﬁne-tuned to the target task (c).

Wu, and Tu 2013). Inspired by these pioneering works, the ﬁrst step of M&M tuning is designed to be a mix step that aims at sampling patches across images. The relation between these patches can be exploited for optimization in the subsequent match operation. More precisely, a large number of image patches with various spatial sizes are randomly sampled from a batch of images. Heavily overlapped patches are discarded. These patches are represented by using the features extracted from the CNN pre-trained in the stage of Fig. 2(a), and assigned with unique class labels based on the corresponding label map. The patches across all images are mixed to decouple any intra-image dependency so as to reﬂect the diverse and rich target distribution. The mixed patches are subsequently utilized as the input for the match operation.

The Match Step Perceptual Patch Graph Our next goal is to exploit the patches to generate stable gradients for tuning the network. This is possible since patches are of different classes, and such relation can be employed to form a massive number of triplets. A triplet is denoted as (Pa, Pp, Pn), where Pa is an anchor patch, Pp is a positive patch that shares the same label as Pa, and Pn is a negative patch with a different class label. With the triplets, one can formulate a discriminative triplet loss for ﬁne-tuning the network. A conventional way of sampling triplets is to follow the notion of Schroff, Kalenichenko, and Philbin (2015). For convenience, we call this strategy as random triplets . In this strategy, triplets are randomly picked from the input

batch. For instance, as shown in Fig. 3(a), nodes {1, 2} and an arbitrary negative patch forms a triplet, and nodes {3, 4} and another negative patch forms another triplet. As can be seen, there is no positive connection between nodes {1, 2} and {3, 4} despite they share a common class label. While locally the distance between each triplet is optimized, the boundary of the positive class can be loose since the global constraint (i.e. all nodes {1, 2, 3, 4} must lie closer) is not enforced. We term this phenomenon as global inconsistency. Empirically, we found that this approach tends to perform poorer than the proposed method, which will be introduced next. The proposed match step draws triplets in a different way from the conventional approach (Schroff, Kalenichenko, and Philbin 2015). In particular, the match step begins with graph construction based on the mixed patches. For each CNN learning iteration, we construct a graph on-the-ﬂy given a batch of input images. The nodes of the graph are patches. Two types of edges are deﬁned between nodes a) attractive if two nodes have an identical class label and b) rejective if two nodes have different class labels. Different from (Schroff, Kalenichenko, and Philbin 2015), we enforce the graph to be connected, and importantly, the graph should be class-wise connected. That is, all nodes from the same class in the graph compose a connected subgraph via attractive edges. We adopt an iterative strategy to create such a graph. At ﬁrst, the graph is initialized to be empty. Then, as shown in Fig. 3(b), patches are absorbed individually into the graph as a node and it creates respectively one attractive and rejective edge with existing nodes in the graph. An example of an established graph is shown in Fig. 3(c). Considering nodes {1, 2, 3, 4} again, unlike random triplets , the nodes form a connected subgraph. Different classes represented in green nodes and pink nodes also form coherent clusters based on their respective classes, imposing tighter constraints than random triplets. To fully realize such class-wise constraints, each node in the graph will take turn to serve as an anchor for loss optimization. An added beneﬁt of permitting all nodes as possible anchor candidate is the improved utilization efﬁciency of patch relation over random triplets.

The Tuning Loss Loss function. To optimize the semantic consistency within the graph, for any two nodes in the graph, if they are connected by attractive edges, we seek to minimize their distance in the feature space; and if they are connected by rejective edges, the distance should be maximized. Consider a node that connects two other nodes via attractive and rejective edges, we denote it as an anchor while the two connected nodes are denoted as positive and negative respectively. These three nodes are grouped to be a triplet . When constructing the graph, we ensure that each node can serve as anchor , except for those nodes whose labels are unique among all the nodes. Thus, the number of nodes equals the number of triplets. Assume that in each iteration we discover N triplets in the graph. By converting the graph optimization problem into triplet ranking (Schroff,

(b-ii) the ﬁnal graph (a) random triplets (b-i) construction of class-wise connected graph

Figure 3: This ﬁgure shows different strategies of drawing triplets. The color of nodes represent their labels. Blue and red edges denote attractive and rejective edges, respectively. (a) depicts the random triplet strategy (Schroff, Kalenichenko, and Philbin 2015), where nodes from the same class do not necessarily form a connected subgraph. (b-i) and (b-ii) shows the proposed triplet selection strategy. A class-wised connected graph is constructed to sample triplets, which enforces tighter constraints on positive class boundary. Details are explained in the main text of methodology section. Best viewed in color.

Kalenichenko, and Philbin 2015), we formulate our loss function as follows:

i max D P i a, P i p D P i a, P i n + α, 0 ,

(1) where Pa, Pp, Pn, denote anchor , positive , negative nodes in a triplet, α is a regularization factor controlling the distance margin and D( , ) is a distance metric measuring patch relationship. In this work, we leverage perceptual distance (Gatys, Ecker, and Bethge 2015) to characterize the relationship between patches. This is different from previous works (Singh, Gupta, and Efros 2012) (Doersch et al. 2012) that deﬁne patch distance using low-level cues (e.g., colors and edges). Speciﬁcally, the perceptual representation can be formulated as f : P x, where f denotes a convolutional neural network (CNN) and x denotes the extracted representation. D(P0, P1) is the perceptual distance between two patches, which can be formulated as:

D(Pi, Pj) = (xi/ xi 2 xj/ xj 2) 2 , (2)

where xi and xj is the CNN representation extracted from patch Pi and Pj. L2 normalization is used here for calculating Euclidean distances. By optimizing the triplet ranking loss, our perceptual patch graph converges to both intra-class and inter-class semantic consistency. M&M implementation details. We use both Alex Net (Krizhevsky, Sutskever, and Hinton 2012) and VGG-16 (Simonyan and Zisserman 2015) as our backbone CNN architectures, as illustrated in Fig. 2. For initialization, we try random initialization and two proxy tasks including Jigsaw Puzzles (Noroozi and Favaro 2016) and Colorization (Larsson, Maire, and Shakhnarovich 2017). From a batch of 16 images in each CNN iteration,

we sample 10 patches per image with various sizes and resize them to a ﬁxed size of 128 128. Then we extract pool5 feature of these patches from the CNN for later usage. We assign the patches unique labels as the central pixel labels using the corresponding label maps. Then we perform the iterative strategy to construct the graph as discussed in the methodology section. We make use of each node in the graph as an anchor , which is made possible by our graph construction strategy. If any node whose label is unique among all the nodes, we duplicate it as its positive counterpart. In this way, we obtain a batch of meaningful triplets whose number is equal to the number of nodes, and feed them into a triplet loss layer, whose margin α is set as 2.1. Such a M&M tuning is conducted for 8000 iterations on PASCAL VOC2012 or City Scapes training dataset. The learning rate is ﬁxed at 0.01 before iteration 6000, and then dropped to 0.001. We apply batch normalization to speed up convergence.

Segmentation ﬁne-tuning details. Finally, we ﬁne-tune the CNN to the semantic segmentation task. For Alex Net, we follow the same setting as presented in (Noroozi and Favaro 2016), and for VGG-16, we follow (Larsson, Maire, and Shakhnarovich 2017) whose architecture is equipped with hyper-columns (Hariharan et al. 2015). The ﬁne-tuning process undergoes 40k iterations, with an initial learning rate as 0.01 and dropped with a factor of 10 at iteration 24k, 36k. We keep tuning batch normalization layers before pool5 . All experiments follow the same setting.

Experiments

Settings. Different proxy tasks are combined with our M&M tuning to demonstrate its merits. In our experiments, as initialization, we use released models of different proxy tasks from learning by context (or Jigsaw Puzzles) (Noroozi and Favaro 2016) and learning by colorization (Larsson, Maire, and Shakhnarovich 2017). Both methods adopt 1.3 million unlabeled images in Image Net dataset (Deng et al. 2009) for training. Besides that, we also perform experiments on randomly initialized networks. In M&M tuning, we make use of PASCAL VOC2012 dataset (Everingham et al. 2010), which consists of 10,582 training samples with pixel-wise annotations. The same dataset is used in (Noroozi and Favaro 2016; Larsson, Maire, and Shakhnarovich 2017) for ﬁne-tuning so no additional data is used in M&M. For fair comparisons, all self-supervision methods are benchmarked on PASCAL VOC2012 validation set that comes with 1,449 images. We show the beneﬁts of M&M tuning on different backbone networks, including Alex Net and VGG-16. To demonstrate the generalization ability of our learned model, we also report the performance of our VGG-16 full model on PASCAL VOC2012 test set. We further apply our method on the City Scapes dataset (Cordts et al. 2016), with 2,974 training samples and report results on the 500 validation samples. All results are reported in mean Intersection over Union (m Io U), which is the standard evaluation criterion of semantic segmentation.

Table 1: We test our model on PASCAL VOC2012 validation set, which is the generally accepted benchmark for semantic segmentation with self-supervised pre-training. Our method achieves the state-of-the-art with both VGG-16 and Alex Net architectures.

Method Arch. VOC12 %m Io U. Image Net VGG-16 64.2 Random VGG-16 35.0 Larsson et al. (Larsson, Maire, and Shakhnarovich 2016) VGG-16 50.2 Larsson et al. (Larsson, Maire, and Shakhnarovich 2017) VGG-16 56.0 Ours (M&M + Graph, colorization pre-trained) VGG-16 64.5 Image Net Alex Net 48.0 Random Alex Net 23.5 k-means (Kr ahenb uhl et al. 2015) Alex Net 32.6 Pathak et al. (Pathak et al. 2016b) Alex Net 29.7 Donahue et al. (Donahue, Kr ahenb uhl, and Darrell 2016) Alex Net 35.2 Zhang et al. (Zhang, Isola, and Efros 2016a) Alex Net 35.6 Zhang et al. (Zhang, Isola, and Efros 2016b) Alex Net 36.0 Noroozi et al. (Noroozi and Favaro 2016) Alex Net 37.6 Larsson et al. (Larsson, Maire, and Shakhnarovich 2017) Alex Net 38.4 Ours (M&M + Random Triplets, colorization pre-trained) Alex Net 40.9 Ours (M&M + Graph, colorization pre-trained) Alex Net 42.8 Ours (M&M + Graph, randomly initialized) Alex Net 43.6

Overall. Existing self-supervision works report segmentation results on PASCAL VOC2012 dataset. The highest performance attained by existing self-supervision methods is learning by colorization (Larsson, Maire, and Shakhnarovich 2017), which achieves 38.4% m Io U and 56.0% m Io U with Alex Net and VGG-16 as the backbone network, respectively. Therefore, we adopt learning by colorization as our proxy task here. With our M&M tuning, we boost the performance to 42.8% m Io U and 64.5% m Io U with Alex Net and VGG-16 as the backbone network. As shown in Table 1, our method achieves state-of-the-art performance on semantic segmentation, outperforming (Larsson, Maire, and Shakhnarovich 2016) by 14.3% and (Larsson, Maire, and Shakhnarovich 2017) by 8.5% when using VGG-16 as backbone network. Notably, our M&M selfsupervision paradigm shows comparable results (0.3% point of advantage) to its Image Net pre-trained counterpart. Furthermore, on PASCAL VOC2012 test set, our approach achieves 64.3% m Io U, which is a record-breaking performance for self-supervision methods. Qualitative results of this model are shown in Fig. 6. We additionally perform an ablation study on the Alex Net setting. As shown in Table 1, with colorization task as pre-training, our class-wise connected graph outperforms random triplets by 2.5%, suggesting the importance of class-wise connected graph. With random initialization, our model surprisingly performs even better than colorization pre-training. Per-class results. We analyze per-class results of M&M tuning on PASCAL VOC2012 validation set. The results are summarized in Table 2. When compared our method with the baseline model that uses colorization2 as pretraining, our approach demonstrates signiﬁcant improvements in classes including aeroplane, bike, bottle, bus, car,

2We obtain higher performance than reported with the released pre-training model of (Larsson, Maire, and Shakhnarovich 2017).

Table 2: Per-class segmentation results on PASCAL VOC2012 val. The last row shows the additional results of our model combined with Image Net pre-trained model by averaging their prediction probabilities. The results suggest the complementary nature of our self-supervised method with Image Net pre-trained model.

aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv m Io U. Image Net 81.7 37.4 73.3 55.8 59.6 82.4 74.7 82.4 30.8 60.3 46.1 71.4 65.3 72.6 76.7 49.7 70.6 34.2 72.7 60.2 64.2 Colorization 73.6 28.5 67.5 55.5 50.2 78.3 66.1 78.3 26.8 60.8 50.6 70.6 64.9 62.2 73.5 38.2 66.8 38.8 68.1 55.1 60.2 M&M 83.1 37.0 69.6 56.1 62.9 84.4 76.4 82.8 33.4 61.5 44.7 67.3 68.5 68.0 78.5 42.2 72.7 37.2 75.7 58.6 64.5 Ensemble Image Net+M&M 84.5 39.4 76.3 60.3 64.6 85.4 77.7 84.1 35.6 63.6 50.4 70.6 72.0 73.6 80.1 50.2 73.7 37.6 77.8 66.6 67.4

before after

Figure 4: Feature distribution with and without the proposed mix-and-match (M&M) tuning. We use 17,684 patches obtained from PASCAL VOC2012 validation set to extract features, and map the high-dimensional features to a 2-D space with t-SNE, along with their categories. For clarity, we split 20 classes into four parts in order. The ﬁrst row shows the feature distribution of a naively ﬁne-tuned model without M&M, and the second row depicts the feature distribution of a model additionally tuned with M&M. Note that the features are respectively extracted from the CNNs which have been ﬁne-tuned to segmentation task, in this case, two CNNs have undergone the identical amount of data and labels. Best viewed in color.

chair, motorbike, sheep, train. A further attempt at combining our self-supervised model and the fully-supervised model (through averaging their predictions) leads to an even higher m Io U of 67.4%. The results suggest that selfsupervision serves as a strong candidate complementary to the current fully-supervised paradigm. Applicability to different proxy tasks. Besides colorization (Larsson, Maire, and Shakhnarovich 2017), we also explore the possibility of using Jigsaw Puzzles (Noroozi and Favaro 2016) as our proxy task. Similarly, our M&M tuning boosts the segmentation performance from 36.5%3 m Io U to 41.2% m Io U. The result suggests that the proposed approach is widely applicable to other self-supervision methods. Our method can also be applied to randomly initialized cases. In PASCAL VOC 2012, M&M tuning boosts the performance from 19.8% m Io U to 43.6% m Io U with Alex Net and from 35.0% m Io U to 56.7% m Io U with VGG-16. The improvements of our method over different baselines are shown in Table 3 for PASCAL VOC 2012. Generalizability to City Scapes. We apply our method on City Scapes dataset. With colorization as pre-training, naive ﬁne-tuning yields 57.5% m Io U and M&M tuning improves it to 66.4% m Io U. The result is comparable with Image Net

3We use the released pre-training model of Jigsaw Puzzles (Noroozi and Favaro 2016) for ﬁne-tuning and obtain a slightly lower baseline than the reported 37.6% m Io U in the paper.

Table 3: The table shows the improvements of our method with different pre-training tasks. They respectively are Random (Xavier initialization) with Alex Net and VGG-16, Jigsaw Puzzles (Noroozi and Favaro 2016) with Alex Net and Colorization (Larsson, Maire, and Shakhnarovich 2017) with Alex Net and VGG-16. Baselines are produced with naive ﬁne-tuning. Image Net pre-trained results are regarded as upper bound. Evaluations are conducted on PASCAL VOC2012 validation set and City Scapes validation set. Results on testing sets are shown in brackets.

benchmark PASCAL VOC2012 City Scapes pre-train Random Jigsaw Colorize Random Colorize Random Colorize backbone Alex Net VGG-16 VGG-16 baseline 19.8 36.5 38.4 35.0 60.2 42.5 57.5 M&M 43.6 41.2 42.8 56.7 64.5 (64.3) 49.1 66.4 (65.6) Image Net 48.0 64.2 67.9

pre-trained counterpart that yields 67.9% m Io U. With a random initialized network, M&M could bring a large improvement from 42.5% m Io U to 49.1% m Io U. The comparison can be found in Table 3.

Further Analysis

Learned representations. To illustrate the learned representations enabled by M&M tuning, we visualize the sample distribution changes in the t-SNE embedding space. As shown in Fig. 4, after M&M tuning, samples from the same

100 200 400

time/iter (s)

m Io U. (%)

m Io U. (%)

time/iter (s)

Figure 5: The ﬁgure shows that a larger graph brings better performance, but costs a longer time in each iteration. We train the model with the same hyper-parameters for different settings and test on PASCAL VOC2012 validation set.

category tend to stay close while those from different categories are torn apart. Notably, this effect is more pronounced on categories of aeroplane, bike, bottle, bus, car, chair, motorbike, sheep, train and tv, which aligns with the per-class performance improvements listed in Table 2. The Effect of graph size. Here we investigate how the selfsupervision performance is inﬂuenced by the graph size (the number of nodes in the graph), which deﬁnes the number of triplets that can be discovered. Speciﬁcally, we set the image batch size to be {10, 20, 40}, so that the number of nodes is {100, 200, 400}, as shown in Fig. 5. The comparative study is performed on Alex Net with learning by colorization (Larsson, Maire, and Shakhnarovich 2017) as initialization. We have the following observations. On the one hand, a larger graph leads to a higher performance, since it brings more diverse samples for more accurate metric learning. On the other hand, a larger graph takes longer time for processing, since a larger batch size of images is fed in each iteration. Efﬁciency. The previous study suggests that performance and speed trade-off can be enabled through graph size adjustment. Nevertheless, our graph training process is very efﬁcient. It costs respectively 3.5 hours and 5.8 hours on a single TITAN-X for Alex Net and VGG-16, which are much faster than conventional Image Net pre-training or any other self-supervised pre-training task. Failure cases. We also include some failure cases of our method, as shown in Fig. 7. The failed examples can be explained as follows. Firstly, patches sampled from thin objects may fail to reﬂect the key characteristics of the object due to the clutter, so the boat in the ﬁgure ends as a false negative. Secondly, our M&M tuning method inherits its base model (i.e., colorization model) to some extent, which accounts for the case in the ﬁgure that the dog is falsely classiﬁed as a cat.

Conclusion We have presented a novel mix-and-match (M&M) tuning method for improving the performance of self-supervised learning on semantic image segmentation task. Our ap-

(a) Image (b) Ground Truth (c) Image Net pre-train (d) Colorization pre-train (e) Ours

Figure 6: Visual comparison on PASCAL VOC2012 validation set (top 4 rows) and City Scapes validation set (bottom 3 rows). (a) Image. (b) Ground Truth. (c) Results with Image Net supervised pre-training. (d) Results with colorization pre-training. (e) Our results.

(a) Image (b) Ground Truth (c) Image Net pre-train (d) Colorization pre-train (e) Ours

Figure 7: Our failure cases. (a) Image. (b) Ground Truth. (c) Results with Image Net supervised pre-training. (d) Results with colorization pre-training. (e) Our results.

proach effectively exploits mixed image patches to form a class-wise connected graph, from which triplets can be sampled to compute a discriminative loss for M&M tuning. Our approach not only improves the performance of self-supervised semantic segmentation with different proxy tasks and different backbone CNNs on different benchmarks, achieving state-of-the-art results, but also outperforms its Image Net pre-trained counterpart for the ﬁrst time in the literature, shedding light on the enormous potential of self-supervised learning. M&M tuning is potentially to be applied to various tasks and worth further exploration. Future work will focus on the essence and advantages of multistep optimization like M&M tuning.

Acknowledgement This work is supported by Sense Time Group Limited and the General Research Fund sponsored by the Research Grants Council of the Hong Kong SAR (CUHK 14241716, 14224316. 14209217).

Bearman, A.; Russakovsky, O.; Ferrari, V.; and Fei-Fei, L. 2016. What s the point: Semantic segmentation with point supervision. In ECCV. Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; and Schiele, B. 2016. The Cityscapes dataset for semantic urban scene understanding. In CVPR. Dai, J.; He, K.; and Sun, J. 2015. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In ICCV. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In CVPR. Doersch, C.; Singh, S.; Gupta, A.; Sivic, J.; and Efros, A. 2012. What makes Paris look like Paris? ACM Transactions on Graphics 31(4). Doersch, C.; Gupta, A.; and Efros, A. A. 2015. Unsupervised visual representation learning by context prediction. In ICCV. Donahue, J.; Kr ahenb uhl, P.; and Darrell, T. 2016. Adversarial feature learning. ar Xiv:1605.09782. Dosovitskiy, A.; Fischer, P.; Springenberg, J.; Riedmiller, M.; and Brox, T. 2015. Discriminative unsupervised feature learning with exemplar convolutional neural networks. ar Xiv:1506.02753. Everingham, M.; Van Gool, L.; Williams, C. K.; Winn, J.; and Zisserman, A. 2010. The PASCAL visual object classes (VOC) challenge. IJCV 88(2):303 338. Gatys, L. A.; Ecker, A. S.; and Bethge, M. 2015. A neural algorithm of artistic style. ar Xiv:1508.06576. Geiger, A.; Lenz, P.; Stiller, C.; and Urtasun, R. 2013. Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research 32(11):1231 1237. Hariharan, B.; Arbel aez, P.; Girshick, R.; and Malik, J. 2015. Hypercolumns for object segmentation and ﬁne-grained localization. In CVPR. Hong, S.; Yeo, D.; Kwak, S.; Lee, H.; and Han, B. 2017. Weakly supervised semantic segmentation using web-crawled videos. ar Xiv:1701.00352. Kr ahenb uhl, P.; Doersch, C.; Donahue, J.; and Darrell, T. 2015. Data-dependent initializations of convolutional neural networks. ar Xiv:1511.06856. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classiﬁcation with deep convolutional neural networks. In NIPS. Larsson, G.; Maire, M.; and Shakhnarovich, G. 2016. Learning representations for automatic colorization. In ECCV. Larsson, G.; Maire, M.; and Shakhnarovich, G. 2017. Colorization as a proxy task for visual understanding. ar Xiv:1703.04044. Li, X.; Liu, Z.; Luo, P.; Loy, C. C.; and Tang, X. 2017a. Not all pixels are equal: Difﬁculty-aware semantic segmentation via deep layer cascade. In CVPR.

Li, X.; Qi, Y.; Wang, Z.; Chen, K.; Liu, Z.; Shi, J.; Luo, P.; Tang, X.; and Loy, C. C. 2017b. Video object segmentation with reidentiﬁcation. In CVPRW. Li, L.-J.; Socher, R.; and Fei-Fei, L. 2009. Towards total scene understanding: Classiﬁcation, annotation and segmentation in an automatic framework. In CVPR. Li, Q.; Wu, J.; and Tu, Z. 2013. Harvesting mid-level visual concepts from large-scale internet images. In CVPR. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Doll ar, P.; and Zitnick, C. L. 2014. Microsoft COCO: Common objects in context. In ECCV. Lin, D.; Dai, J.; Jia, J.; He, K.; and Sun, J. 2016. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In CVPR. Liu, Z.; Li, X.; Luo, P.; Loy, C.-C.; and Tang, X. 2015. Semantic image segmentation via deep parsing network. In ICCV. Liu, Z.; Li, X.; Luo, P.; Loy, C. C.; and Tang, X. 2017. Deep learning markov random ﬁeld for semantic segmentation. TPAMI. Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully convolutional networks for semantic segmentation. In CVPR. Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using t-sne. JMLR 9:2579 2605. Noroozi, M., and Favaro, P. 2016. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV. Papandreou, G.; Chen, L.-C.; Murphy, K. P.; and Yuille, A. L. 2015. Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In ICCV. Pathak, D.; Girshick, R.; Doll ar, P.; Darrell, T.; and Hariharan, B. 2016a. Learning features by watching objects move. ar Xiv:1612.06370. Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; and Efros, A. A. 2016b. Context encoders: Feature learning by inpainting. In CVPR. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. 2015. Imagenet large scale visual recognition challenge. IJCV 115(3):211 252. Schroff, F.; Kalenichenko, D.; and Philbin, J. 2015. Facenet: A uniﬁed embedding for face recognition and clustering. In CVPR. Simonyan, K., and Zisserman, A. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR. Singh, S.; Gupta, A.; and Efros, A. 2012. Unsupervised discovery of mid-level discriminative patches. In ECCV. Wang, X., and Gupta, A. 2015. Unsupervised learning of visual representations using videos. In ICCV. Zhang, R.; Isola, P.; and Efros, A. A. 2016a. Colorful image colorization. In ECCV. Zhang, R.; Isola, P.; and Efros, A. A. 2016b. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In CVPR. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; and Jia, J. 2017. Pyramid scene parsing network. In CVPR.