# unsupervised_representation_learning_by_predicting_image_rotations__5a2cb8cc.pdf

Published as a conference paper at ICLR 2018

UNSUPERVISED REPRESENTATION LEARNING BY PREDICTING IMAGE ROTATIONS

Spyros Gidaris, Praveer Singh, Nikos Komodakis University Paris-Est, LIGM Ecole des Ponts Paris Tech {spyros.gidaris,praveer.singh,nikos.komodakis}@enpc.fr

Over the last years, deep convolutional neural networks (Conv Nets) have transformed the ﬁeld of computer vision thanks to their unparalleled capacity to learn high level semantic image features. However, in order to successfully learn those features, they usually require massive amounts of manually labeled data, which is both expensive and impractical to scale. Therefore, unsupervised semantic feature learning, i.e., learning without requiring manual annotation effort, is of crucial importance in order to successfully harvest the vast amount of visual data that are available today. In our work we propose to learn image features by training Conv Nets to recognize the 2d rotation that is applied to the image that it gets as input. We demonstrate both qualitatively and quantitatively that this apparently simple task actually provides a very powerful supervisory signal for semantic feature learning. We exhaustively evaluate our method in various unsupervised feature learning benchmarks and we exhibit in all of them state-of-the-art performance. Speciﬁcally, our results on those benchmarks demonstrate dramatic improvements w.r.t. prior state-of-the-art approaches in unsupervised representation learning and thus signiﬁcantly close the gap with supervised feature learning. For instance, in PASCAL VOC 2007 detection task our unsupervised pre-trained Alex Net model achieves the state-of-the-art (among unsupervised methods) m AP of 54.4% that is only 2.4 points lower from the supervised case. We get similarly striking results when we transfer our unsupervised learned features on various other tasks, such as Image Net classiﬁcation, PASCAL classiﬁcation, PASCAL segmentation, and CIFAR-10 classiﬁcation. The code and models of our paper will be published on: https://github.com/gidariss/Feature Learning Rot Net.

1 INTRODUCTION

In recent years, the widespread adoption of deep convolutional neural networks (Le Cun et al., 1998) (Conv Nets) in computer vision, has lead to a tremendous progress in the ﬁeld. Speciﬁcally, by training Conv Nets on the object recognition (Russakovsky et al., 2015) or the scene classiﬁcation (Zhou et al., 2014) tasks with a massive amount of manually labeled data, they manage to learn powerful visual representations suitable for image understanding tasks. For instance, the image features learned by Conv Nets in this supervised manner have achieved excellent results when they are transferred to other vision tasks, such as object detection (Girshick, 2015), semantic segmentation (Long et al., 2015), or image captioning (Karpathy & Fei-Fei, 2015). However, supervised feature learning has the main limitation of requiring intensive manual labeling effort, which is both expensive and infeasible to scale on the vast amount of visual data that are available today.

Due to that, there is lately an increased interest to learn high level Conv Net based representations in an unsupervised manner that avoids manual annotation of visual data. Among them, a prominent paradigm is the so-called self-supervised learning that deﬁnes an annotation free pretext task, using only the visual information present on the images or videos, in order to provide a surrogate supervision signal for feature learning. For example, in order to learn features, Zhang et al. (2016a) and Larsson et al. (2016) train Conv Nets to colorize gray scale images, Doersch et al. (2015) and Noroozi & Favaro (2016) predict the relative position of image patches, and Agrawal et al. (2015) predict the egomotion (i.e., self-motion) of a moving vehicle between two consecutive frames. The

Published as a conference paper at ICLR 2018

rationale behind such self-supervised tasks is that solving them will force the Conv Net to learn semantic image features that can be useful for other vision tasks. In fact, image representations learned with the above self-supervised tasks, although they have not managed to match the performance of supervised-learned representations, they have proved to be good alternatives for transferring on other vision tasks, such as object recognition, object detection, and semantic segmentation (Zhang et al., 2016a; Larsson et al., 2016; Zhang et al., 2016b; Larsson et al., 2017; Doersch et al., 2015; Noroozi & Favaro, 2016; Noroozi et al., 2017; Pathak et al., 2016a; Doersch & Zisserman, 2017). Other successful cases of unsupervised feature learning are clustering based methods (Dosovitskiy et al., 2014; Liao et al., 2016; Yang et al., 2016), reconstruction based methods (Bengio et al., 2007; Huang et al., 2007; Masci et al., 2011), and methods that involve learning generative probabilistic models Goodfellow et al. (2014); Donahue et al. (2016); Radford et al. (2015).

Our work follows the self-supervised paradigm and proposes to learn image representations by training Conv Nets to recognize the geometric transformation that is applied to the image that it gets as input. More speciﬁcally, we ﬁrst deﬁne a small set of discrete geometric transformations, then each of those geometric transformations are applied to each image on the dataset and the produced transformed images are fed to the Conv Net model that is trained to recognize the transformation of each image. In this formulation, it is the set of geometric transformations that actually deﬁnes the classiﬁcation pretext task that the Conv Net model has to learn. Therefore, in order to achieve unsupervised semantic feature learning, it is of crucial importance to properly choose those geometric transformations (we further discuss this aspect of our methodology in section 2.2). What we propose is to deﬁne the geometric transformations as the image rotations by 0, 90, 180, and 270 degrees. Thus, the Conv Net model is trained on the 4-way image classiﬁcation task of recognizing one of the four image rotations (see Figure 2). We argue that in order a Conv Net model to be able recognize the rotation transformation that was applied to an image it will require to understand the concept of the objects depicted in the image (see Figure 1), such as their location in the image, their type, and their pose. Throughout the paper we support that argument both qualitatively and quantitatively. Furthermore we demonstrate on the experimental section of the paper that despite the simplicity of our self-supervised approach, the task of predicting rotation transformations provides a powerful surrogate supervision signal for feature learning and leads to dramatic improvements on the relevant benchmarks.

Note that our self-supervised task is different from the work of Dosovitskiy et al. (2014) and Agrawal et al. (2015) that also involves geometric transformations. Dosovitskiy et al. (2014) train a Conv Net model to yield representations that are discriminative between images and at the same time invariant on geometric and chromatic transformations. In contrast, we train a Conv Net model to recognize the geometric transformation applied to an image. It is also fundamentally different from the egomotion method of Agrawal et al. (2015), which employs a Conv Net model with siamese like architecture that takes as input two consecutive video frames and is trained to predict (through regression) their camera transformation. Instead, in our approach, the Conv Net takes as input a single image to which we have applied a random geometric transformation (i.e., rotation) and is trained to recognize (through classiﬁcation) this geometric transformation without having access to the initial image.

Our contributions are:

We propose a new self-supervised task that is very simple and at the same time, as we demonstrate throughout the paper, offers a powerful supervisory signal for semantic feature learning.

We exhaustively evaluate our self-supervised method under various settings (e.g. semisupervised or transfer learning settings) and in various vision tasks (i.e., CIFAR-10, Image Net, Places, and PASCAL classiﬁcation, detection, or segmentation tasks).

In all of them, our novel self-supervised formulation demonstrates state-of-the-art results with dramatic improvements w.r.t. prior unsupervised approaches.

As a consequence we show that for several important vision tasks, our self-supervised learning approach signiﬁcantly narrows the gap between unsupervised and supervised feature learning.

In the following sections, we describe our self-supervised methodology in 2, we provide experimental results in 3, and ﬁnally we conclude in 4.

Published as a conference paper at ICLR 2018

90 rotation

270 rotation

180 rotation

270 rotation

Figure 1: Images rotated by random multiples of 90 degrees (e.g., 0, 90, 180, or 270 degrees). The core intuition of our self-supervised feature learning approach is that if someone is not aware of the concepts of the objects depicted in the images, he cannot recognize the rotation that was applied to them.

2 METHODOLOGY

2.1 OVERVIEW

The goal of our work is to learn Conv Net based semantic features in an unsupervised manner. To achieve that goal we propose to train a Conv Net model F(.) to estimate the geometric transformation applied to an image that is given to it as input. Speciﬁcally, we deﬁne a set of K discrete geometric transformations G = {g(.|y)}K y=1, where g(.|y) is the operator that applies to image X the geometric transformation with label y that yields the transformed image Xy = g(X|y). The Conv Net model F(.) gets as input an image Xy (where the label y is unknown to model F(.)) and yields as output a probability distribution over all possible geometric transformations:

F(Xy |θ) = {F y(Xy |θ)}K y=1, (1)

where F y(Xy |θ) is the predicted probability for the geometric transformation with label y and θ are the learnable parameters of model F(.).

Therefore, given a set of N training images D = {Xi}N i=0, the self-supervised training objective that the Conv Net model must learn to solve is:

i=1 loss(Xi, θ), (2)

where the loss function loss(.) is deﬁned as:

loss(Xi, θ) = 1

y=1 log(F y(g(Xi|y)|θ)). (3)

In the following subsection we describe the type of geometric transformations that we propose in our work.

2.2 CHOOSING GEOMETRIC TRANSFORMATIONS: IMAGE ROTATIONS

In the above formulation, the geometric transformations G must deﬁne a classiﬁcation task that should force the Conv Net model to learn semantic features useful for visual perception tasks (e.g., object detection or image classiﬁcation). In our work we propose to deﬁne the set of geometric transformations G as all the image rotations by multiples of 90 degrees, i.e., 2d image rotations by 0, 90, 180, and 270 degrees (see Figure 2). More formally, if Rot(X, φ) is an operator that rotates image X by φ degrees, then our set of geometric transformations consists of the K = 4 image rotations G = {g(X|y)}4 y=1, where g(X|y) = Rot(X, (y 1)90).

Forcing the learning of semantic features: The core intuition behind using these image rotations as the set of geometric transformations relates to the simple fact that it is essentially impossible for a Conv Net model to effectively perform the above rotation recognition task unless it has ﬁrst learnt to recognize and detect classes of objects as well as their semantic parts in images. More speciﬁcally,

Published as a conference paper at ICLR 2018

Rotated image: X

Rotated image: X

Rotated image: X

Rotated image: X

Conv Net model F(.)

Conv Net model F(.)

Conv Net model F(.)

Conv Net model F(.)

Predict 270 degrees rotation (y=3) Rotate 270 degrees

g( X , y=3)

Rotate 180 degrees

g( X , y=2)

Rotate 90 degrees

g( X , y=1)

Rotate 0 degrees

g( X , y=0)

Maximize prob.

Predict 0 degrees rotation (y=0)

Maximize prob.

Maximize prob.

Maximize prob.

Predict 180 degrees rotation (y=2)

Predict 90 degrees rotation (y=1)

Objectives:

Figure 2: Illustration of the self-supervised task that we propose for semantic feature learning. Given four possible geometric transformations, the 0, 90, 180, and 270 degrees rotations, we train a Conv Net model F(.) to recognize the rotation that is applied to the image that it gets as input. F y(Xy ) is the probability of rotation transformation y predicted by model F(.) when it gets as input an image that has been transformed by the rotation transformation y .

to successfully predict the rotation of an image the Conv Net model must necessarily learn to localize salient objects in the image, recognize their orientation and object type, and then relate the object orientation with the dominant orientation that each type of object tends to be depicted within the available images. In Figure 3b we visualize some attention maps generated by a model trained on the rotation recognition task. These attention maps are computed based on the magnitude of activations at each spatial cell of a convolutional layer and essentially reﬂect where the network puts most of its focus in order to classify an input image. We observe, indeed, that in order for the model to accomplish the rotation prediction task it learns to focus on high level object parts in the image, such as eyes, nose, tails, and heads. By comparing them with the attention maps generated by a model trained on the object recognition task in a supervised way (see Figure 3a) we observe that both models seem to focus on roughly the same image regions. Furthermore, in Figure 4 we visualize the ﬁrst layer ﬁlters that were learnt by an Alex Net model trained on the proposed rotation recognition task. As can be seen, they appear to have a big variety of edge ﬁlters on multiple orientations and multiple frequencies. Remarkably, these ﬁlters seem to have a greater amount of variety even than the ﬁlters learnt by the supervised object recognition task.

Absence of low-level visual artifacts: An additional important advantage of using image rotations by multiples of 90 degrees over other geometric transformations, is that they can be implemented by ﬂip and transpose operations (as we will see below) that do not leave any easily detectable low-level visual artifacts that will lead the Conv Net to learn trivial features with no practical value for the vision perception tasks. In contrast, had we decided to use as geometric transformations, e.g., scale and aspect ratio image transformations, in order to implement them we would need to use image resizing routines that leave easily detectable image artifacts.

Well-posedness: Furthermore, human captured images tend to depict objects in an up-standing position, thus making the rotation recognition task well deﬁned, i.e., given an image rotated by 0, 90, 180, or 270 degrees, there is usually no ambiguity of what is the rotation transformation (with the exception of images that only depict round objects). In contrast, that is not the case for the object scale that varies signiﬁcantly on human captured images.

Implementing image rotations: In order to implement the image rotations by 90, 180, and 270 degrees (the 0 degrees case is the image itself), we use ﬂip and transpose operations. Speciﬁcally,

Published as a conference paper at ICLR 2018

Input images on the models

Conv1 27 27

Conv3 13 13

(a) Attention maps of supervised model

Conv1 27 27

Conv3 13 13

(b) Attention maps of our self-supervised model

Figure 3: Attention maps generated by an Alex Net model trained (a) to recognize objects (supervised), and (b) to recognize image rotations (self-supervised). In order to generate the attention map of a conv. layer we ﬁrst compute the feature maps of this layer, then we raise each feature activation on the power p, and ﬁnally we sum the activations at each location of the feature map. For the conv. layers 1, 2, and 3 we used the powers p = 1, p = 2, and p = 4 respectively. For visualization of our self-supervised model s attention maps for all the rotated versions of the images see Figure 6 in appendix A.

for 90 degrees rotation we ﬁrst transpose the image and then ﬂip it vertically (upside-down ﬂip), for 180 degrees rotation we ﬂip the image ﬁrst vertically and then horizontally (left-right ﬂip), and ﬁnally for 270 degrees rotation we ﬁrst ﬂip vertically the image and then we transpose it.

2.3 DISCUSSION

The simple formulation of our self-supervised task has several advantages. It has the same computational cost as supervised learning, similar training convergence speed (that is signiﬁcantly faster than image reconstruction based approaches; our Alex Net model trains in around 2 days using a single Titan X GPU), and can trivially adopt the efﬁcient parallelization schemes devised for supervised learning (Goyal et al., 2017), making it an ideal candidate for unsupervised learning on internetscale data (i.e., billions of images). Furthermore, our approach does not require any special image pre-processing routine in order to avoid learning trivial features, as many other unsupervised or self-supervised approaches do. Despite the simplicity of our self-supervised formulation, as we will see in the experimental section of the paper, the features learned by our approach achieve dramatic improvements on the unsupervised feature learning benchmarks.

3 EXPERIMENTAL RESULTS

In this section we conduct an extensive evaluation of our approach on the most commonly used image datasets, such as CIFAR-10 (Krizhevsky & Hinton, 2009), Image Net (Russakovsky et al., 2015),

Published as a conference paper at ICLR 2018

(a) Supervised

(b) Self-supervised to recognize rotations

Figure 4: First layer ﬁlters learned by a Alex Net model trained on (a) the supervised object recognition task and (b) the self-supervised task of recognizing rotated images. We observe that the ﬁlters learned by the self-supervised task are mostly oriented edge ﬁlters on various frequencies and, remarkably, they seem to have more variety than those learned on the supervised task.

Table 1: Evaluation of the unsupervised learned features by measuring the classiﬁcation accuracy that they achieve when we train a non-linear object classiﬁer on top of them. The reported results are from CIFAR-10. The size of the Conv B1 feature maps is 96 16 16 and the size of the rest feature maps is 192 8 8.

Model Conv B1 Conv B2 Conv B3 Conv B4 Conv B5

Rot Net with 3 conv. blocks 85.45 88.26 62.09 - - Rot Net with 4 conv. blocks 85.07 89.06 86.21 61.73 - Rot Net with 5 conv. blocks 85.04 89.76 86.82 74.50 50.37

PASCAL (Everingham et al., 2010), and Places205 (Zhou et al., 2014), as well as on various vision tasks, such as object detection, object segmentation, and image classiﬁcation. We also consider several learning scenarios, including transfer learning and semi-supervised learning. In all cases, we compare our approach with corresponding state-of-the-art methods.

3.1 CIFAR EXPERIMENTS

We start by evaluating on the object recognition task of CIFAR-10 the Conv Net based features learned by the proposed self-supervised task of rotation recognition. We will here after call a Conv Net model that is trained on the self-supervised task of rotation recognition Rot Net model.

Implementation details: In our CIFAR-10 experiments we implement the Rot Net models with Network-In-Network (NIN) architectures (Lin et al., 2013). In order to train them on the rotation prediction task, we use SGD with batch size 128, momentum 0.9, weight decay 5e 4 and lr of 0.1. We drop the learning rates by a factor of 5 after epochs 30, 60, and 80. We train in total for 100 epochs. In our preliminary experiments we found that we get signiﬁcant improvement when during training we train the network by feeding it all the four rotated copies of an image simultaneously instead of each time randomly sampling a single rotation transformation. Therefore, at each training batch the network sees 4 times more images than the batch size.

Evaluation of the learned feature hierarchies: First, we explore how the quality of the learned features depends from their depth (i.e., the depth of the layer that they come from) as well as from the total depth of the Rot Net model. For that purpose, we ﬁrst train using the CIFAR-10 training images three Rot Net models which have 3, 4, and 5 convolutional blocks respectively (note that each conv. block in the NIN architectures that implement our Rot Net models have 3 conv. layers; therefore,

Published as a conference paper at ICLR 2018

Table 2: Exploring the quality of the self-supervised learned features w.r.t. the number of recognized rotations. For all the entries we trained a non-linear classiﬁer with 3 fully connected layers (similar to Table 1) on top of the feature maps generated by the 2nd conv. block of a Rot Net model with 4 conv. blocks in total. The reported results are from CIFAR-10.

# Rotations Rotations CIFAR-10 Classiﬁcation Accuracy

4 0 , 90 , 180 , 270 89.06 8 0 , 45 , 90 , 135 , 180 , 225 , 270 , 315 88.51 2 0 , 180 87.46 2 90 , 270 85.52

Table 3: Evaluation of unsupervised feature learning methods on CIFAR-10. The Supervised NIN and the (Ours) Rot Net + conv entries have exactly the same architecture but the ﬁrst was trained fully supervised while on the second the ﬁrst 2 conv. blocks were trained unsupervised with our rotation prediction task and the 3rd block only was trained in a supervised manner. In the Random Init. + conv entry a conv. classiﬁer (similar to that of (Ours) Rot Net + conv) is trained on top of two NIN conv. blocks that are randomly initialized and stay frozen. Note that each of the prior approaches has a different Conv Net architecture and thus the comparison with them is just indicative.

Method Accuracy

Supervised NIN 92.80

Random Init. + conv 72.50

(Ours) Rot Net + non-linear 89.06 (Ours) Rot Net + conv 91.16

(Ours) Rot Net + non-linear (ﬁne-tuned) 91.73 (Ours) Rot Net + conv (ﬁne-tuned) 92.17

Roto-Scat + SVM Oyallon & Mallat (2015) 82.3 Exemplar CNN Dosovitskiy et al. (2014) 84.3 DCGAN Radford et al. (2015) 82.8 Scattering Oyallon et al. (2017) 84.7

the total number of conv. layers of the examined Rot Net models is 9, 12, and 15 for 3, 4, and 5 conv. blocks respectively). Afterwards, we learn classiﬁers on top of the feature maps generated by each conv. block of each Rot Net model. Those classiﬁers are trained in a supervised way on the object recognition task of CIFAR-10. They consist of 3 fully connected layers; the 2 hidden layers have 200 feature channels each and are followed by batch-norm and relu units. We report the accuracy results of CIFAR-10 test set in Table 1. We observe that in all cases the feature maps generated by the 2nd conv. block (that actually has depth 6 in terms of the total number of conv. layer till that point) achieve the highest accuracy, i.e., between 88.26% and 89.06%. The features of the conv. blocks that follow the 2nd one gradually degrade the object recognition accuracy, which we assume is because they start becoming more and more speciﬁc on the self-supervised task of rotation prediction. Also, we observe that increasing the total depth of the Rot Net models leads to increased object recognition performance by the feature maps generated by earlier layers (and after the 1st conv. block). We assume that this is because increasing the depth of the model and thus the complexity of its head (i.e., top Conv Net layers) allows the features of earlier layers to be less speciﬁc to the rotation prediction task.

Exploring the quality of the learned features w.r.t. the number of recognized rotations: In Table 2 we explore how the quality of the self-supervised features depends on the number of discrete rotations used in the rotation prediction task. For that purpose we deﬁned three extra rotation recognition tasks: (a) one with 8 rotations that includes all the multiples of 45 degrees, (b) one with only the 0 and 180 rotations, and (c) one with only the 90 and 270 rotations. In order to implement the rotation transformation of the 45 , 135 , 225 , 270 , and 315 rotations (in the 8 discrete rotations case) we used an image wrapping routine and then we took care to crop only the central square

Published as a conference paper at ICLR 2018

Figure 5: (a) Plot with the rotation prediction accuracy and object recognition accuracy as a function of the training epochs used for solving the rotation prediction task. The red curve is the object recognition accuracy of a fully supervised model (a NIN model), which is independent from the training epochs on the rotation prediction task. The yellow curve is the object recognition accuracy of an object classiﬁer trained on top of feature maps learned by a Rot Net model at different snapshots of the training procedure. (b) Accuracy as a function of the number of training examples per category in CIFAR-10. Ours semi-supervised is a NIN model that the ﬁrst 2 conv. blocks are Rot Net model that was trained in a self-supervised way on the entire training set of CIFAR-10 and the 3rd conv. block along with a prediction linear layer that was trained with the object recognition task only on the available set of labeled images.

image regions that do not include any of the empty image areas introduced by the rotation transformations (and which can easily indicate the image rotation). We observe that indeed for 4 discrete rotations (as we proposed) we achieve better object recognition performance than the 8 or 2 cases. We believe that this is because the 2 orientations case offers too few classes for recognition (i.e., less supervisory information is provided) while in the 8 orientations case the geometric transformations are not distinguishable enough and furthermore the 4 extra rotations introduced may lead to visual artifacts on the rotated images. Moreover, we observe that among the Rot Net models trained with 2 discrete rotations, the Rot Net model trained with 90 and 270 rotations achieves worse object recognition performance than the model trained with the 0 and 180 rotations, which is probably due to the fact that the former model does not see during the unsupervised phase the 0 rotation that is typically used during the object recognition training phase.

Comparison against supervised and other unsupervised methods: In Table 3 we compare our unsupervised learned features against other unsupervised (or hand-crafted) features on CIFAR-10. For our entries we use the feature maps generated by the 2nd conv. block of a Rot Net model with 4 conv. blocks in total. On top of those Rot Net features we train 2 different classiﬁers: (a) a nonlinear classiﬁer with 3 fully connected layers as before (entry (Ours) Rot Net + non-linear), and (b) three conv. layers plus a linear prediction layer (entry (Ours) Rot Net +conv.; note that this entry is basically a 3 blocks NIN model with the ﬁrst 2 blocks coming from a Rot Net model and the 3rd being randomly initialized and trained on the recognition task). We observe that we improve over the prior unsupervised approaches and we achieve state-of-the-art results in CIFAR-10 (note that each of the prior approaches has a different Conv Net architecture thus the comparison with them is just indicative). More notably, the accuracy gap between the Rot Net based model and the fully supervised NIN model is very small, only 1.64 percentage points (92.80% vs 91.16%). We provide per class breakdown of the classiﬁcation accuracy of our unsupervised model as well as the supervised one in Table 9 (in appendix B). In Table 3 we also report the performance of the Rot Net features when, instead of being kept frozen, they are ﬁne-tuned during the object recognition training phase. We observe that ﬁne-tuning the unsupervised learned features further improves the classiﬁcation performance, thus reducing even more the gap with the supervised case.

Correlation between object classiﬁcation task and rotation prediction task: In Figure 5a, we plot the object classiﬁcation accuracy as a function of the training epochs used for solving the selfsupervised task of recognizing rotations, which learns the features used by the object classiﬁer.

Published as a conference paper at ICLR 2018

Table 4: Task Generalization: Image Net top-1 classiﬁcation with non-linear layers. We compare our unsupervised feature learning approach with other unsupervised approaches by training non-linear classiﬁers on top of the feature maps of each layer to perform the 1000-way Image Net classiﬁcation task, as proposed by Noroozi & Favaro (2016). For instance, for the conv5 feature map we train the layers that follow the conv5 layer in the Alex Net architecture (i.e., fc6, fc7, and fc8). Similarly for the conv4 feature maps. We implemented those non-linear classiﬁers with batch normalization units after each linear layer (fully connected or convolutional) and without employing drop out units. All approaches use Alex Net variants and were pre-trained on Image Net without labels except the Image Net labels and Random entries. During testing we use a single crop and do not perform ﬂipping augmentation. We report top-1 classiﬁcation accuracy.

Method Conv4 Conv5

Image Net labels from (Bojanowski & Joulin, 2017) 59.7 59.7

Random from (Noroozi & Favaro, 2016) 27.1 12.0

Tracking Wang & Gupta (2015) 38.8 29.8 Context (Doersch et al., 2015) 45.6 30.4 Colorization (Zhang et al., 2016a) 40.7 35.2 Jigsaw Puzzles (Noroozi & Favaro, 2016) 45.3 34.6 BIGAN (Donahue et al., 2016) 41.9 32.2 NAT (Bojanowski & Joulin, 2017) - 36.0

(Ours) Rot Net 50.0 43.8

More speciﬁcally, in order to create the object recognition accuracy curve, in each training snapshot of Rot Net (i.e., every 20 epochs), we pause its training procedure and we train from scratch (until convergence) a non-linear object classiﬁer on top of the so far learnt Rot Net features. Therefore, the object recognition accuracy curve depicts the accuracy of those non-linear object classiﬁers after the end of their training while the rotation prediction accuracy curve depicts the accuracy of the Rot Net at those snapshots. We observe that, as the ability of the Rot Net features for solving the rotation prediction task improves (i.e., as the rotation prediction accuracy increases), their ability to help solving the object recognition task improves as well (i.e., the object recognition accuracy also increases). Furthermore, we observe that the object recognition accuracy converges fast w.r.t. the number of training epochs used for solving the pretext task of rotation prediction.

Semi-supervised setting: Motivated by the very high performance of our unsupervised feature learning method, we also evaluate it on a semi-supervised setting. More speciﬁcally, we ﬁrst train a 4 block Rot Net model on the rotation prediction task using the entire image dataset of CIFAR-10 and then we train on top of its feature maps object classiﬁers using only a subset of the available images and their corresponding labels. As feature maps we use those generated by the 2nd conv. block of the Rot Net model. As a classiﬁer we use a set of convolutional layers that actually has the same architecture as the 3rd conv. block of a NIN model plus a linear classiﬁer, all randomly initialized. For training the object classiﬁer we use for each category 20, 100, 400, 1000, or 5000 image examples. Note that 5000 image examples is the extreme case of using the entire CIFAR10 training dataset. Also, we compare our method with a supervised model that is trained only on the available examples each time. In Figure 5b we plot the accuracy of the examined models as a function of the available training examples. We observe that our unsupervised trained model exceeds in this semi-supervised setting the supervised model when the number of examples per category drops below 1000. Furthermore, as the number of examples decreases, the performance gap in favor of our method is increased. This empirical evidence demonstrates the usefulness of our method on semi-supervised settings.

3.2 EVALUATION OF SELF-SUPERVISED FEATURES TRAINED IN IMAGENET

Here we evaluate the performance of our self-supervised Conv Net models on the Image Net, Places, and PASCAL VOC datasets. Speciﬁcally, we ﬁrst train a Rot Net model on the training images of the Image Net dataset and then we evaluate the performance of the self-supervised features on the image

Published as a conference paper at ICLR 2018

Table 5: Task Generalization: Image Net top-1 classiﬁcation with linear layers. We compare our unsupervised feature learning approach with other unsupervised approaches by training logistic regression classiﬁers on top of the feature maps of each layer to perform the 1000-way Image Net classiﬁcation task, as proposed by Zhang et al. (2016a). All weights are frozen and feature maps are spatially resized (with adaptive max pooling) so as to have around 9000 elements. All approaches use Alex Net variants and were pre-trained on Image Net without labels except the Image Net labels and Random entries.

Method Conv1 Conv2 Conv3 Conv4 Conv5

Image Net labels 19.3 36.3 44.2 48.3 50.5

Random 11.6 17.1 16.9 16.3 14.1 Random rescaled Kr ahenb uhl et al. (2015) 17.5 23.0 24.5 23.2 20.6

Context (Doersch et al., 2015) 16.2 23.3 30.2 31.7 29.6 Context Encoders (Pathak et al., 2016b) 14.1 20.7 21.0 19.8 15.5 Colorization (Zhang et al., 2016a) 12.5 24.5 30.4 31.5 30.3 Jigsaw Puzzles (Noroozi & Favaro, 2016) 18.2 28.8 34.0 33.9 27.1 BIGAN (Donahue et al., 2016) 17.7 24.5 31.0 29.9 28.0 Split-Brain (Zhang et al., 2016b) 17.7 29.3 35.4 35.2 32.8 Counting (Noroozi et al., 2017) 18.0 30.6 34.3 32.5 25.7

(Ours) Rot Net 18.8 31.7 38.7 38.2 36.5

Table 6: Task & Dataset Generalization: Places top-1 classiﬁcation with linear layers. We compare our unsupervised feature learning approach with other unsupervised approaches by training logistic regression classiﬁers on top of the feature maps of each layer to perform the 205-way Places classiﬁcation task (Zhou et al., 2014). All unsupervised methods are pre-trained (in an unsupervised way) on Image Net. All weights are frozen and feature maps are spatially resized (with adaptive max pooling) so as to have around 9000 elements. All approaches use Alex Net variants and were pretrained on Image Net without labels except the Place labels, Image Net labels, and Random entries.

Method Conv1 Conv2 Conv3 Conv4 Conv5

Places labels Zhou et al. (2014) 22.1 35.1 40.2 43.3 44.6 Image Net labels 22.7 34.8 38.4 39.4 38.7

Random 15.7 20.3 19.8 19.1 17.5 Random rescaled Kr ahenb uhl et al. (2015) 21.4 26.2 27.1 26.1 24.0

Context (Doersch et al., 2015) 19.7 26.7 31.9 32.7 30.9 Context Encoders (Pathak et al., 2016b) 18.2 23.2 23.4 21.9 18.4 Colorization (Zhang et al., 2016a) 16.0 25.7 29.6 30.3 29.7 Jigsaw Puzzles (Noroozi & Favaro, 2016) 23.0 31.9 35.0 34.2 29.3 BIGAN (Donahue et al., 2016) 22.0 28.7 31.8 31.3 29.7 Split-Brain (Zhang et al., 2016b) 21.3 30.7 34.0 34.1 32.5 Counting (Noroozi et al., 2017) 23.3 33.9 36.3 34.7 29.6

(Ours) Rot Net 21.5 31.0 35.1 34.6 33.7

classiﬁcation tasks of Image Net, Places, and PASCAL VOC datasets and on the object detection and object segmentation tasks of PASCAL VOC.

Implementation details: For those experiments we implemented our Rot Net model with an Alex Net architecture. Our implementation of the Alex Net model does not have local response normalization units, dropout units, or groups in the colvolutional layers while it includes batch normalization units after each linear layer (either convolutional or fully connected). In order to train the Alex Net based Rot Net model, we use SGD with batch size 192, momentum 0.9, weight decay 5e 4 and lr of 0.01. We drop the learning rates by a factor of 10 after epochs 10, and 20 epochs. We train in total for 30 epochs. As in the CIFAR experiments, during training we feed the Rot Net model all four rotated copies of an image simultaneously (in the same mini-batch).

Published as a conference paper at ICLR 2018

Table 7: Task & Dataset Generalization: PASCAL VOC 2007 classiﬁcation and detection results, and PASCAL VOC 2012 segmentation results. We used the publicly available testing frameworks of Kr ahenb uhl et al. (2015) for classiﬁcation, of Girshick (2015) for detection, and of Long et al. (2015) for segmentation. For classiﬁcation, we either ﬁx the features before conv5 (column fc6-8) or we ﬁne-tune the whole model (column all). For detection we use multi-scale training and single scale testing. All approaches use Alex Net variants and were pre-trained on Image Net without labels except the Image Net labels and Random entries. After unsupervised training, we absorb the batch normalization units on the linear layers and we use the weight rescaling technique proposed by Kr ahenb uhl et al. (2015) (which is common among the unsupervised methods). As customary, we report the mean average precision (m AP) on the classiﬁcation and detection tasks, and the mean intersection over union (m Io U) on the segmentation task.

Classiﬁcation Detection Segmentation (%m AP) (%m AP) (%m Io U)

Trained layers fc6-8 all all all

Image Net labels 78.9 79.9 56.8 48.0

Random 53.3 43.4 19.8 Random rescaled Kr ahenb uhl et al. (2015) 39.2 56.6 45.6 32.6

Egomotion (Agrawal et al., 2015) 31.0 54.2 43.9 Context Encoders (Pathak et al., 2016b) 34.6 56.5 44.5 29.7 Tracking (Wang & Gupta, 2015) 55.6 63.1 47.4 Context (Doersch et al., 2015) 55.1 65.3 51.1 Colorization (Zhang et al., 2016a) 61.5 65.6 46.9 35.6 BIGAN (Donahue et al., 2016) 52.3 60.1 46.9 34.9 Jigsaw Puzzles (Noroozi & Favaro, 2016) - 67.6 53.2 37.6 NAT (Bojanowski & Joulin, 2017) 56.7 65.3 49.4 Split-Brain (Zhang et al., 2016b) 63.0 67.1 46.7 36.0 Color Proxy (Larsson et al., 2017) 65.9 38.4 Counting (Noroozi et al., 2017) - 67.7 51.4 36.6

(Ours) Rot Net 70.87 72.97 54.4 39.1

Image Net classiﬁcation task: We evaluate the task generalization of our self-supervised learned features by training on top of them non-linear object classiﬁers for the Image Net classiﬁcation task (following the evaluation scheme of (Noroozi & Favaro, 2016)). In Table 4 we report the classiﬁcation performance of our self-supervised features and we compare it with the other unsupervised approaches. We observe that our approach surpasses all the other methods by a signiﬁcant margin. For the feature maps generated by the Conv4 layer, our improvement is more than 4 percentage points and for the feature maps generated by the Conv5 layer, our improvement is even bigger, around 8 percentage points. Furthermore, our approach signiﬁcantly narrows the performance gap between unsupervised features and supervised features. In Table 5 we report similar results but for linear (logistic regression) classiﬁers (following the evaluation scheme of Zhang et al. (2016a)). Again, our unsupervised method demonstrates signiﬁcant improvements over prior unsupervised methods.

Transfer learning evaluation on PASCAL VOC: In Table 7 we evaluate the task and dataset generalization of our unsupervised learned features by ﬁne-tuning them on the PASCAL VOC classiﬁcation, detection, and segmentation tasks. As with the Image Net classiﬁcation task, we outperform by signiﬁcant margin all the competing unsupervised methods in all tested tasks, signiﬁcantly narrowing the gap with the supervised case. Notably, the PASCAL VOC 2007 object detection performance that our self-supervised model achieves is 54.4% m AP, which is only 2.4 points lower than the supervised case. We provide the per class detection performance of our method in Table 8 (in appendix B).

Places classiﬁcation task: In Table 6 we evaluate the task and dataset generalization of our approach by training linear (logistic regression) classiﬁers on top of the learned features in order to perform the 205-way Places classiﬁcation task. Note that in this case the learnt features are evaluated w.r.t.

Published as a conference paper at ICLR 2018

their generalization on classes that were unseen during the unsupervised training phase. As can be seen, even in this case our method manages to either surpass or achieve comparable results w.r.t. prior state-of-the-art unsupervised learning approaches.

4 CONCLUSIONS

In our work we propose a novel formulation for self-supervised feature learning that trains a Conv Net model to be able to recognize the image rotation that has been applied to its input images. Despite the simplicity of our self-supervised task, we demonstrate that it successfully forces the Conv Net model trained on it to learn semantic features that are useful for a variety of visual perception tasks, such as object recognition, object detection, and object segmentation. We exhaustively evaluate our method in various unsupervised and semi-supervised benchmarks and we achieve in all of them state-of-the-art performance. Speciﬁcally, our self-supervised approach manages to drastically improve the state-of-the-art results on unsupervised feature learning for Image Net classiﬁcation, PASCAL classiﬁcation, PASCAL detection, PASCAL segmentation, and CIFAR-10 classiﬁcation, surpassing prior approaches by a signiﬁcant margin and thus drastically reducing the gap between unsupervised and supervised feature learning.

5 ACKNOWLEDGEMENTS

This work was supported by the ANR SEMAPOLIS project, an INTEL gift, and hardware donation by NVIDIA.

Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning to see by moving. In Proceedings of the IEEE International Conference on Computer Vision, pp. 37 45, 2015.

Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In Advances in neural information processing systems, pp. 153 160, 2007.

Piotr Bojanowski and Armand Joulin. Unsupervised learning by predicting noise. ar Xiv preprint ar Xiv:1704.05310, 2017.

Carl Doersch and Andrew Zisserman. Multi-task self-supervised visual learning. Co RR, abs/1708.07860, 2017.

Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1422 1430, 2015.

Jeff Donahue, Philipp Kr ahenb uhl, and Trevor Darrell. Adversarial feature learning. ar Xiv preprint ar Xiv:1605.09782, 2016.

Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox. Discriminative unsupervised feature learning with convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 766 774, 2014.

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303 338, June 2010.

Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440 1448, 2015.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014.

Priya Goyal, Piotr Doll ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. ar Xiv preprint ar Xiv:1706.02677, 2017.

Published as a conference paper at ICLR 2018

Fu Jie Huang, Y-Lan Boureau, Yann Le Cun, et al. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In Computer Vision and Pattern Recognition, 2007. CVPR 07. IEEE Conference on, pp. 1 8. IEEE, 2007.

Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128 3137, 2015.

Philipp Kr ahenb uhl, Carl Doersch, Jeff Donahue, and Trevor Darrell. Data-dependent initializations of convolutional neural networks. ar Xiv preprint ar Xiv:1511.06856, 2015.

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.

Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Learning representations for automatic colorization. In European Conference on Computer Vision, pp. 577 593. Springer, 2016.

Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Colorization as a proxy task for visual understanding. ar Xiv preprint ar Xiv:1703.04044, 2017.

Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Renjie Liao, Alex Schwing, Richard Zemel, and Raquel Urtasun. Learning deep parsimonious representations. In Advances in Neural Information Processing Systems, pp. 5076 5084, 2016.

Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. ar Xiv preprint ar Xiv:1312.4400, 2013.

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.

Jonathan Masci, Ueli Meier, Dan Cires an, and J urgen Schmidhuber. Stacked convolutional autoencoders for hierarchical feature extraction. Artiﬁcial Neural Networks and Machine Learning ICANN 2011, pp. 52 59, 2011.

Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69 84. Springer, 2016.

Mehdi Noroozi, Hamed Pirsiavash, and Paolo Favaro. Representation learning by learning to count. ar Xiv preprint ar Xiv:1708.06734, 2017.

Edouard Oyallon and St ephane Mallat. Deep roto-translation scattering for object classiﬁcation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2865 2873, 2015.

Edouard Oyallon, Eugene Belilovsky, and Sergey Zagoruyko. Scaling the scattering transform: Deep hybrid networks. ar Xiv preprint ar Xiv:1703.08961, 2017.

Deepak Pathak, Ross Girshick, Piotr Doll ar, Trevor Darrell, and Bharath Hariharan. Learning features by watching objects move. ar Xiv preprint ar Xiv:1612.06370, 2016a.

Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536 2544, 2016b.

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434, 2015.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211 252, 2015.

Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2794 2802, 2015.

Published as a conference paper at ICLR 2018

Jianwei Yang, Devi Parikh, and Dhruv Batra. Joint unsupervised learning of deep representations and image clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5147 5156, 2016.

Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In European Conference on Computer Vision, pp. 649 666. Springer, 2016a.

Richard Zhang, Phillip Isola, and Alexei A Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. ar Xiv preprint ar Xiv:1611.09842, 2016b.

Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Learning deep features for scene recognition using places database. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems 27, pp. 487 495. Curran Associates, Inc., 2014.

Published as a conference paper at ICLR 2018

APPENDIX A VISUALIZING ATTENTION MAPS OF ROTATED IMAGES

Here we visualize the attention maps generated by an Alex Net model trained on the self-supervised task of rotation recognition for all the rotated copies of a few images. We observe that the attention maps of all the rotated copies of an image are roughly the same, i.e., the attention maps are equivariant w.r.t. the image rotations. This practically means that in order to accomplish the rotation prediction task the network focuses on the same object parts regardless of the image rotation.

Attention maps of Conv3 feature maps (size: 13 13)

90 rotation

180 rotation

270 rotation Attention maps of Conv5 feature maps (size: 6 6)

90 rotation

180 rotation

270 rotation

Figure 6: Attention maps of the Conv3 and Conv5 feature maps generated by an Alex Net model trained on the self-supervised task of recognizing image rotations. Here we present the attention maps generated for all the 4 rotated copies of an image.

Published as a conference paper at ICLR 2018

APPENDIX B PER CLASS BREAKDOWN OF DETECTION AND CLASSIFICATION PERFORMANCE

In Tables 8 and 9 we report the per class performance of our unsupervised learning method on the PASCAL detection and CIFAR-10 classiﬁcation tasks respectively.

Table 8: Per class PASCAL VOC 2007 detection performance. As usual, we report the average precision metric. The results of the supervised model (i.e., Image Net labels entry) come from Doersch et al. (2015).

Classes aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv

Image Net labels 64.0 69.6 53.2 44.4 24.9 65.7 69.6 69.2 28.9 63.6 62.8 63.9 73.3 64.6 55.8 25.7 50.5 55.4 69.3 56.4 (Ours) Rot Net 65.5 65.3 43.8 39.8 20.2 65.4 69.2 63.9 30.2 56.3 62.3 56.8 71.6 67.2 56.3 22.7 45.6 59.5 71.6 55.3

Table 9: Per class CIFAR-10 classiﬁcation accuracy.

Classes aero car bird cat deer dog frog horse ship truck

Supervised 93.7 96.3 89.4 82.4 93.6 89.7 95.0 94.3 95.7 95.2 (Ours) Rot Net 91.7 95.8 87.1 83.5 91.5 85.3 94.2 91.9 95.7 94.2