# compress_selfsupervised_learning_by_compressing_representations__e07ff2d9.pdf

Comp Ress: Self-Supervised Learning by Compressing Representations

Soroush Abbasi Koohpayegani Ajinkya Tejankar Hamed Pirsiavash University of Maryland, Baltimore County {soroush,at6,hpirsiav}@umbc.edu

Self-supervised learning aims to learn good representations with unlabeled data. Recent works have shown that larger models beneﬁt more from self-supervised learning than smaller models. As a result, the gap between supervised and selfsupervised learning has been greatly reduced for larger models. In this work, instead of designing a new pseudo task for self-supervised learning, we develop a model compression method to compress an already learned, deep self-supervised model (teacher) to a smaller one (student). We train the student model so that it mimics the relative similarity between the datapoints in the teacher s embedding space. For Alex Net, our method outperforms all previous methods including the fully supervised model on Image Net linear evaluation (59.0% compared to 56.5%) and on nearest neighbor evaluation (50.7% compared to 41.4%). To the best of our knowledge, this is the ﬁrst time a self-supervised Alex Net has outperformed supervised one on Image Net classiﬁcation. Our code is available here: https://github.com/UMBCvision/Comp Ress

1 Introduction

Supervised deep learning needs lots of annotated data, but the annotation process is particularly expensive in some domains like medical images. Moreover, the process is prone to human bias and may also result in ambiguous annotations. Hence, we are interested in self-supervised learning (SSL) where we learn rich representations from unlabeled data. One may use these learned features along with a simple linear classiﬁer to build a recognition system with small annotated data. It is shown that SSL models trained on Image Net without labels outperform the supervised models when transferred to other tasks [9, 24].

Some recent self-supervised learning algorithms have shown that increasing the capacity of the architecture results in much better representations. For instance, for Sim CLR method [9], the gap between supervised and self-supervised is much smaller for Res Net-50x4 compared to Res Net-50 (also shown in Figure 1). Given this observation, we are interested in learning better representations for small models by compressing a deep self-supervised model.

In edge computing applications, we prefer to run the model (e.g., an image classiﬁer) on the device (e.g., Io T) rather than sending the images to the cloud. During inference, this reduces the privacy concerns, latency, power usage, and cost. Hence, there is need for rich, small models. Compressing SSL models goes beyond that and reduces the privacy concerns at the training time as well. For instance, one can download a rich self-supervised Mobile Net model that can generalize well to other tasks and ﬁnetune it on his/her own data without sending any data to the cloud for training.

Since we assume our teacher has not seen any labels, its output is an embedding rather than a probability distribution over some categories. Hence, standard model distillation methods [26] cannot

Equal contribution

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

Linear Evaluation Nearest Neighbor Cluster Alignment

Figure 1: Image Net Evaluation: We compare Ours-1q self-supervised model with supervised and SOTA self-supervised models on Image Net using linear classiﬁcation (left), nearest neighbor (middle) and cluster alignment (right) evaluations. Our Alex Net model outperforms the supervised counterpart on all evaluations. This model is compressed from Res Net-50x4 trained with Sim CLR method using unlabeled Image Net. All models have seen Image Net images only. All SOTA SSL models are Mo Co except Res Net50x4 that is Sim CLR. The teacher for our Alex Net and Res Net50 is Sim CLR Res Net50x4 and for Res Net18 and Mobile Net-V2 is Mo Co Res Net50.

be used directly. One can employ a nearest neighbor classiﬁer in the teacher space by calculating distances between an input image (query) and all datapoints (anchor points) and then converting them to probability distribution. Our idea is to transfer this probability distribution from the teacher to the student so that for any query point, the student matches the teacher in the ranking of anchor points.

Traditionally, most SSL methods are evaluated by learning a linear classiﬁer on the features to perform a downstream task (e.g., Image Net classiﬁcation). However, this evaluation process is expensive and has many hyperparameters (e.g., learning rate schedule) that need to be tuned as one set of parameters may not be optimal for all SSL methods. We believe a simple nearest neighbor classiﬁer, used in some recent works [57, 67, 60], is a better alternative as it has no parameters, is much faster to evaluate, and still measures the quality of features. Hence, we use this evaluation extensively in our experiments. Moreover, inspired by [30], we use another related evaluation by measuring the alignment between k-means clusters and image categories.

Our extensive experiments show that our compressed SSL models outperform state-of-the-art compression methods as well as state-of-the-art SSL counterparts using the same architecture on most downstream tasks. Our Alex Net model, compressed from Res Net-50x4 trained with Sim CLR method, outperforms standard supervised Alex Net model on linear evaluation (by 2 point), in nearest neighbor (by 9 points), and in cluster alignment evaluation (by 4 points). This is interesting as all parameters of the supervised model are already trained on the downstream task itself but the SSL model and its teacher have seen only Image Net without labels. To the best of our knowledge, this is the ﬁrst time an SSL model performs better than the supervised one on the Image Net task itself instead of transfer learning settings.

2 Related work

Self-supervised learning: In self-supervised learning for images, we learn rich features by solving a pretext task that needs unlabeled data only. The pseudo task may be colorization [64], inpainting [42], solving Jigsaw puzzles [36], counting visual primitives [37], and clustering images [7].

Contrastive learning: Our method is related to contrastive learning [23, 39, 27, 67, 4, 49, 25] where the model learns to contrast between the positive and lots of negative pairs. The positive pair is from the same image and model but different augmentations in [24, 9, 51] and from the same image and augmentation but different models (teacher and student) in [50]. Our method uses soft probability distribution instead of positive/negative classiﬁcation [67], and does not couple the two embeddings (teacher and student) directly [50] in Ours-2q variant. Contrastive learning is improved with a more robust memory bank in [24] and with temperature and better image augmentations in [9]. Our ideas are related to exemplar CNNs [15, 34], but used for compression.

Figure 2: Our compression method: The goal is to transfer the knowledge from the self-supervised teacher to the student. For each image, we compare it with a random set of data points called anchors and obtain a set of similarities. These similarities are then converted into a probability distribution over the anchors. This distribution represents each image in terms of its nearest neighbors. Since we want to transfer this knowledge to the student, we get the same distribution from the student as well. Finally, we train the student to minimize the KL divergence between the two distributions. Intuitively, we want each data point to have the same neighbors in both teacher and student embeddings. This illustrates Ours-2q method. For Ours-1q, we simply remove the student memory bank and use the teacher s anchor points for the student as well.

Model compression: The task of training a simpler student to mimic the output of a complex teacher is called model compression in [6] and knowledge distillation in [26]. In [26], the softened class probabilities from the teacher are transferred to the student by reducing KL divergence. The knowledge in the hidden activations of intermediate layers of the teacher is transferred by regressing linear projections [45], aggregation of feature maps [62], and gram matrices [59]. Also, knowledge at the ﬁnal layer can be transferred in different ways [3, 26, 31, 41, 43, 40, 53, 2, 50, 61, 56, 5, 18, 48]. In [2, 50] distillation is formulated as maximization of information between teacher and student.

Similarity-based distilation: Pairwise similarity based knowledge distillation has been used along with supervised teachers. [43, 53, 40] use supervised loss in distillation. [41] is probably the closest to our setting which does not use labels in the distillation step. We are different as we use memory bank and Soft Max along with temperature, and also apply that to compressing self-supervised models in large scale. We compare with a reproduced variant of [41] in the experiments (Section 4.3).

Model compression for self-supervision: Standard model compression techniques either directly use the output of supervised training [26] or have a supervised loss term [50, 40] in addition to the compression loss term. Thus, they cannot be directly applied to compress self-supervised models. In [38], the knowledge from the teacher is transferred to the student by ﬁrst clustering the embeddings from teacher and then training the student to predict the cluster assignments. In [58], the method of [38] is applied to regularize self-supervised models.

Our goal is to train a deep model (e.g. Res Net-50) using an off-the-shelf self-supervised learning algorithm and then, compress it to a less deep model (e.g., Alex Net) while preserving the discriminative power of the features. Figure 2 shows our method.

Assuming a frozen teacher embedding t(x) RN with parameters θt that maps an image x into an ND feature space, we want to learn the student embedding s(x) RM with parameters θs that mimics the same behavior as t(x) if used for a downstream supervised task e.g., image classiﬁcation. Note that the teacher and student may use architectures from different families, so we do not necessarily want to couple them together directly. Hence, we transfer the similarity between data points from the teacher to the student rather than their ﬁnal prediction.

For simplicity, we use ti = t(xi) for the embedding of the model t(x) on the input image xi normalized by ℓ2 norm. We assume a random set of the training data {xj}n j=1 are the anchor points and embed them using both teacher and student models to get {ta j }n j=1 and {sa j }n j=1. Given a query image qi and its embeddings tq i for teacher and sq i for student, we calculate the pairwise similarity between tq i and all anchor embeddings {ta j }n j=1, and then optimize the student model so that in the student s embedding space, the query sq i has the same relationship with the anchor points {sa j }n j=1.

To measure the relationship between the query and anchor points, we calculate their cosine similarity. We convert the similarities to the form of a probability distribution over the anchor points using Soft Max operator. For the teacher, the probability of the i-th query for the j-th anchor point is:

pi j(t) = exp(tq i T ta j /τ) Pn k=1 exp(tq i T ta k/τ)

where τ is the temperature hyperparamater. Then, we deﬁne the loss for a particular query point as the KL divergence between the probabilities over all anchor points under the teacher and student models, and we sum this loss over all query points:

L(t, s) = X

i KL(pi(t)||pi(s))

where pi(s) is the probability distribution of query i over all anchor points on the student network. Finally, since the teacher is frozen, we optimize the student by solving:

arg min θs L(t, s) = arg min θs

i,j pi j(t).log(pi j(s))

Memory bank: One may use the same minibatch for both query and anchor points by excluding the query from each set of anchor points. However, we need a large set of anchor points (ideally the whole training set) so that they have large variation to cover the neighborhood of any query image. Our experiments verify that using the minibatch of size 256 for anchor points is not enough for learning rich representations. This is reasonable as Image Net has 1000 categories so the query may not be close to any anchor point in a minibatch of size 256. However, it is computationally expensive to process many images in a single iteration due to limited computation and memory. Similar to [57], we maintain a memory bank of anchor points from several most recent iterations. We use momentum contrast [24] framework for implementing memory bank for the student. However, unlike [24], we ﬁnd that our method is not affected by the momentum parameter which requires further investigation. Since the teacher is frozen, we implement its memory bank as a simple FIFO queue.

Temperature parameter: Since the anchor points have large variation covering the whole dataset, many of them may be very far from the query image. We use a small temperature value (less than one) since we want to focus mainly on transferring the relationships from the close neighborhoods of the query rather than faraway points. Note that this results in sharper probabilities compared to τ = 1. We show that τ = 1 degrades the results dramatically. The temperature value acts similarly to the kernel width in kernel density estimation methods.

Student using teacher s memory bank: So far, we assumed that the teacher and student embeddings are decoupled, so used a separate memory bank (queue) for each. We call this method Ours-2q. However, we may use the teacher s anchor points in calculating the similarity for the student model. This way, the model may learn faster and be more stable in the initial stages of learning, since the teacher anchor points are already mature. We call this variation Ours-1q in our experiments. Note that in Ours-1q method, we do not use momentum since the teacher is constant.

Caching the teacher embeddings: Since we are interested in using very deep models (e.g., Res Net50x4) as the teacher, calculating the embeddings for the teacher is expensive in terms of both computation and memory. Also, we are not optimizing the teacher model. Hence, for such large models, we can cache the results of the teacher on all images of the dataset and keep them in the memory. This caching has a drawback that we cannot augment the images for the teacher, meaning that the teacher sees exact same images in all epochs. However, since the student still sees augmented images, it is less prone to overﬁtting. On the other hand, this caching may actually help the student by encouraging the relationship between the query and anchor points to be close even under different augmentations, hence, improving the representation in a way similar to regular contrastive learning

[57, 24]. In our experiments, we realize that caching degrades the results by only a small margin while is much faster and efﬁcient in learning. We use caching when we compress from Res Net-50x4 to Alex Net.

4 Experiments and results

We use different combinations of architectures as student-teacher pairs (listed in Table 1). We use three teachers : (a) Res Net-50 model which is trained using Mo Co-v2 method for 800 epochs [10], (b) Res Net-50 trained with Sw AV [8] for 800 epochs, and (c) Res Net-50x4 model which is trained using Sim CLR method for 1000 epochs [9]. We use the ofﬁcially published weights of these models [44, 47, 54]. For supervised models, we use the ofﬁcial Py Torch weights [52]. We use Image Net (ILSVRC2012) [46] without labels for all self-supervised and compression methods, and use various datasets (Image Net, PASCAL-VOC [16], Places [66], CUB200 [55], and Cars196 [32]) for evaluation.

Implementation details: Here, we report the implementation details for Ours-2q and Ours-1q compression methods. The implementation details for all baselines and transfer experiments are included in the appendix. We use Py Torch along with SGD (weight decay=1e 4, learning rate=0.01, momentum=0.9, epochs=130, and batch size=256). We multiply learning rate by 0.2 at epochs 90 and 120. We use standard Image Net data augmentation found in Py Torch. Compressing from Res Net50x4 to Res Net-50 takes ~100 hours on four Titan-RTX GPUs while compressing from Res Net-50 to Res Net-18 takes ~90 hours on two 2080-TI GPUs. We adapt the unofﬁcial implementation of Mo Co [24] in [11] to implement memory bank for our method. We use memory bank size of 128, 000 and set moving average weight for key encoder to 0.999. We use the temperature of 0.04 for all experiments involving Sim CLR Res Net-50x4 and Mo Co Res Net-50 teachers. We pick these values based on the ablation study done for the temperature parameter in Section 4.5. For Sw AV Res Net-50 teacher, we use a temperature of 0.007 since we ﬁnd that it works better than 0.04.

4.1 Evaluation Metrics

Linear classiﬁer (Linear): We treat the student as a frozen feature extractor and train a linear classiﬁer on the labeled training set of Image Net and evaluate it on the validation set with Top-1 accuracy. To reduce the computational overhead of tuning the hyperparameters per experiment, we standardize the Linear evaluation as following. We ﬁrst normalize the features by ℓ2 norm, then shift and scale each dimension to have zero mean and unit variance. For all linear layer experiments, we use SGD with lr=0.01, epochs=40, batch size=256, weight decay=1e 4, and momentum=0.9. At epochs 15 and 30, the lr is multiplied by 0.1.

Nearest Neighbor (NN): We also evaluate the student representations using nearest neighbor classiﬁer with cosine similarity. We use FAISS GPU library [1] to implement it. This method does not need any parameter tuning and is very fast (~25 minutes for Res Net-50 on a single 2080-TI GPU)

Cluster Alignment (CA): The goal is to measure the alignment between clusters of our SSL representations with visual categories, e.g., Image Net categories. We use k-means (with k=1000) to cluster our self-supervised features trained on unlabeled Image Net, map each cluster to an Image Net category, and then evaluate on Image Net validation set. In order to map clusters to categories, we ﬁrst calculate the alignment between all (clustercategory) pairs by calculating the number of common images divided by the size of cluster. Then, we ﬁnd the best mapping between clusters and categories using Hungarian algorithm [33] that maximizes total alignment. This labels the clusters. Then, we report the classiﬁcation accuracy on the validation set. This setting is similar to the object discovery setting in [30]. In Figure 3 (c), we show some random images from random clusters where images inside each cluster are semantically related.

4.2 Baselines:

Contrastive Representation Distillation (CRD): CRD [50] is an information maximization based SOTA method for distillation that includes a supervised loss term. It directly compares the embeddings of teacher and student as in a contrastive setting. We remove the supervised loss in our experiments.

Cluster Classiﬁcation (CC): Cluster Classiﬁcation [38] is an unsupervised knowledge distillation method that improves self-supervised learning by quantizing the teacher representations. This is similar to the recent work of Cluster Fit [58].

Table 1: Comparison of distillation methods on full Image Net: Our method is better than all compression methods for various teacher-student combinations and evaluation benchmarks. In addition, as reported in Table 5 and Figure 1, when we compress Res Net-50x4 to Alex Net, we get 59.0% for Linear, 50.7% for Nearest Neighbor (NN), and 27.6% for Cluster Alignment (CA) which outperforms the supervised model. On NN, our Res Net-50 is only 1 point worse than its Res Net-50x4 teacher. Note that models below the teacher row use the student architecture. Since a forward pass through the teacher is expensive for Res Net50x4, we do not compare with CRD, Reg, and Reg-BN.

Teacher Mo Co Res Net-50 Mo Co Res Net-50 Mo Co Res Net-50 Res Net-50x4 Student Alex Net Res Net-18 Mobile Net-V2 Res Net-50

Linear NN CA Linear NN CA Linear NN CA Linear NN CA

Teacher 70.8 57.3 34.2 70.8 57.3 34.2 70.8 57.3 34.2 75.6 64.5 38.7

Supervised 56.5 41.4 22.9 69.8 63.0 44.9 71.9 64.9 46.0 76.2 71.4 55.6 CC [38] 46.4 31.6 13.7 61.1 51.1 25.2 59.2 50.2 24.7 68.9 55.6 26.4 CRD [50] 54.4 36.9 14.1 58.4 43.7 17.4 54.1 36.0 12.0 - - - Reg 49.9 35.6 9.5 52.2 41.7 25.6 48.0 38.6 25.4 - - - Reg-BN 56.1 42.8 22.3 58.2 47.3 27.2 62.3 48.7 27.0 - - - Ours-2q 56.4 48.4 33.3 61.7 53.4 34.7 63.0 54.4 35.5 71.0 63.0 41.1 Ours-1q 57.5 48.0 27.0 62.6 53.5 33.0 65.8 54.8 32.8 71.9 63.3 41.4

Table 2: Comparison of distillation methods on full Image Net for Sw AV Res Net-50 (teacher) to Res Net-18 (student):. Note that Sw AV (concurrent work) [8] is different from Mo Co and Sim CLR in that it performs contrastive learning through online clustering.

Method Linear NN CA

Teacher 75.6 60.7 27.6

Supervised 69.8 63.0 44.9 CRD 58.2 44.7 16.9 CC 60.8 51.0 22.8 Reg-BN 60.6 47.6 20.8 Ours-2q 62.4 53.7 26.7 Ours-1q 65.6 56.0 26.3

Table 3: NN evaluation for Image Net with fewer labels: We report NN evaluation on validation data using small training data (both Image Net) for Res Net-18 compressed from Mo Co Res Net-50. For 1-shot, we report the standard deviation over 10 runs.

Model 1-shot 1% 10%

Supervised (entire 29.8 ( 0.3) 48.5 56.8 labeled Image Net)

CC [38] 16.3 ( 0.3) 31.6 41.9 CRD [50] 11.4 ( 0.3) 23.3 33.6 Reg-BN 21.5 ( 0.1) 33.4 40.1 Ours-2q 29.0 ( 0.3) 41.2 47.6 Ours-1q 26.5 ( 0.3) 39.6 47.2

Regression (Reg): We implement a modiﬁed version of [45] that regresses only the embedding layer features [61]. Similar to [45, 61], we add a linear projection head on top of the student to match the embedding dimension of the teacher. As noted in CRD [50], transferring knowledge from all intermediate layers does not perform well since the teacher and student may have different architecture styles. Hence, we use the regression loss only for the embedding layer of the networks.

Regression with Batch Norm (Reg-BN): We realized that Reg does not perform well for model compression. We suspect the reason is the mismatch between the embedding spaces of the teacher and student networks. Hence, we added a non-parametric Batch Normalization layer for the last layer of both student and teacher networks to match their statistics. The BN layer uses statistics from the current minibatch only (element-wise whitening). Interestingly, this simple modiﬁed baseline is better than other sophisticated baselines for model compression.

4.3 Experiments Comparing Compression Methods

Evaluation on full Image Net: We train the teacher on unlabeled Image Net, compress it to the student, and evaluate the student using Image Net validation set. As shown in Table 1, our method outperforms other distillation methods on all evaluation benchmarks. For a fair comparison, on Res Net-18, we trained Mo Co for 1,000 epochs and got 54.5% in Linear and 41.1% in NN which does not still match our model. Also, a variation of our method (Mo Co R50 to R18) without Soft Max,

Table 4: Transfer to CUB200 and Cars196: We train the features on unlabeled Image Net, freeze the features, and return top k nearest neighbors based on cosine similarity. We evaluate the recall at different k values (1, 2, 4, and 8) on the validation set.

Method CUB200 Cars196 Alex Net Teacher R@1 R@2 R@4 R@8 R@1 R@2 R@4 R@8

Sup. on Image Net - 33.5 45.5 59.2 71.9 26.6 36.3 45.9 57.8

CRD [50] Res Net-50 16.6 25.9 36.3 48.7 20.9 28.2 37.7 48.9 Reg-BN Res Net-50 16.8 25.5 36.2 48.0 20.9 29.0 38.5 49.7 CC Res Net-50 23.2 32.5 45.1 58.2 23.7 31.4 41.1 52.4 Ours-2q Res Net-50 23.1 33.0 45.1 58.0 23.6 32.8 42.9 54.9 Ours-1q Res Net-50 22.7 31.9 43.2 55.8 22.5 30.6 40.4 52.3

CC Res Net-50x4 23.6 33.6 44.9 58.4 25.4 33.2 43.2 54.3 Ours-2q Res Net-50x4 26.5 37.0 49.4 62.4 28.4 38.5 48.7 60.4 Ours-1q Res Net-50x4 21.9 32.4 43.2 55.9 25.0 34.2 45.1 57.3

temperature, and memory bank (similar to [41]) results in 53.6% in Linear and 42.3% in NN. To evaluate the effect of the teacher s SSL method, in Table 2, we use Sw AV Res Net-50 as the teacher and compress it to Res Net-18. We still get better accuracy compared to other distillation methods.

Evaluation on smaller Image Net: We evaluate our representations by a NN classiﬁer using only 1%, 10%, and only 1 sample per category of Image Net. The results are shown in Table 3. For 1-shot, Ours-2q model achieves an accuracy close to the supervised model which has seen all labels of Image Net in learning the features.

Transfer to CUB200 and Cars196: We transfer Alex Net student models to the task of image retrieval on CUB200 [55] and Cars196 [32] datasets. We evaluate on these tasks without any ﬁnetuning. The results are shown in Table 4. Surprisingly, for the combination of Cars196 dataset and Res Net-50x4 teacher, our model even outperforms the Image Net supervised model. Since in Ours-2q , the student embedding is less restricted and does not follow the teacher closely, the student may generalize better compared to Ours-1q method. Hence, we see better results for Ours-2q on almost all transfer experiments. This effect is similar to [38, 58].

4.4 Experiments Comparing Self-Supervised Methods

Evaluation on Image Net: We compare our features with SOTA self-supervised learning methods on Table 5 and Figure 1. Our method outperforms all baselines on all small capacity architectures (Alex Net, Mobile Net-V2, and Res Net-18). On Alex Net, it outperforms even the supervised model. Table 6 shows the results of linear classiﬁer using only 1% and 10% of Image Net for Res Net-50.

Transferring to Places: We evaluate our intermediate representations learned from unlabeled Image Net on Places scene recognition task. We train linear layers on top of intermediate representations similar to [21]. Details are in the appendix. The results are shown in Table 5. We ﬁnd that our best layer performance is better than that of a model trained with Image Net labels.

Transferring to PASCAL-VOC: We evaluate Alex Net compressed from Res Net-50x4 on PASCALVOC classiﬁcation and detection tasks in Table 7. For classiﬁcation task, we only train a linear classiﬁer on top of frozen backbone which is in contrast to the baselines that ﬁnetune all layers. For object detection, we use the Fast-RCNN [20] as used in [38, 19] to ﬁnetune all layers.

4.5 Ablation Study

To speed up the ablation study, we use 25% of Image Net (randomly sampled ~320k images) and cached features of Mo Co Res Net-50 as a teacher to train Res Net-18 student. For temperature ablation study, the memory bank size is 128k and for memory bank ablation study, the temperature is 0.04. All ablations were performed with Ours-2q method.

Temperature: The results of varying temperature between 0.02 and 1.0 are shown in Figure 3(a). We ﬁnd that the optimal temperature is 0.04, and the student gets worse as the temperature gets

Table 5: Linear evaluation on Image Net and Places: Comparison with SOTA self-supervised methods. We pick the best layer to report the results that is written in parenthesis: f7 refers to fc7 layer and c4 refers to conv4 layer. R50x4 refers to the teacher that is trained with Sim CLR and R50 to the teacher trained with Mo Co. On Res Net-50, our model, that is compressed from Sim CLR R50x4, is better than Sim CLR itself, but worse than Sw AV, BYOL, and Info Min which are concurrent works. refers to 10-crop evaluation. denotes concurrent methods.

Method Ref Image Net Places top-1 top-1

Sup. on Image Net - 56.5 (f7) 39.4 (c4) Inpainting [42] [60] 21.0 (c3) 23.4 (c3) Bi GAN [14] [38] 29.9 (c4) 31.8 (c3) Colorization [64] [19] 31.5 (c4) 30.3 (c4) Context [13] [19] 31.7 (c4) 32.7 (c4) Jigsaw [36] [19] 34.0 (c3) 35.0 (c3) Counting [37] [38] 34.3 (c3) 36.3 (c3) Split Brain [65] [38] 35.4 (c3) 34.1 (c4) Inst Disc [57] [57] 35.6 (c5) 34.5 (c4) CC+Vgg+Jigsaw [38] [38] 37.3 (c3) 37.5 (c3) Rot Net [19] [19] 38.7 (c3) 35.1 (c3) Artifact [29] [17] 38.9 (c4) 37.3 (c4) AND [28] [60] 39.7 (c4) - Deep Cluster [7] [7] 39.8 (c4) 37.5 (c4) LA* [67] [67] 42.4 (c5) 40.3 (c4) CMC [49] [60] 42.6 (c5) - AET [63] [60] 44.0 (c3) 37.1 (c3) RFDecouple [17] [17] 44.3 (c5) 38.6 (c5) Se La+Rot+aug [60] [60] 44.7 (c5) 37.9 (c4) Mo Co - 45.7 (f7) 36.6 (c4) Ours-2q (from R50x4) - 57.6 (f7) 40.4 (c5) Ours-1q (from R50x4) - 59.0 (f7) 40.3 (c5)

Method Ref Image Net top-1

Sup. on Image Net - 69.8 (L5) Inst Disc[57] [57] 44.5 (L5) LA* [67] [67] 52.8 (L5) Mo Co - 54.5 (L5) Ours-2q (from R50) - 61.7 (L5) Ours-1q (from R50) - 62.6 (L5)

Sup. on Image Net - 76.2 (L5) Inst Disc [57] [57] 54.0 (L5) CF-Jigsaw [58] [58] 55.2 (L4) CF-Rot Net [58] [58] 56.1 (L4) LA * [67] [67] 60.2 (L5) Se La [60] [60] 61.5 (L5) PIRL [35] [35] 63.6 (L5) Sim CLR [9] [9] 69.3 (L5) Mo Co [10] [10] 71.1 (L5) Info Min [51] [51] 73.0 (L5) BYOL [22] [22] 74.3 (L5) Sw AV [8] [8] 75.3 (L5) Ours-2q (from R50x4) - 71.0 (L5) Ours-1q (from R50x4) - 71.9 (L5)

Table 6: Evaluation of Res Net-50 features on smaller set of Image Net: Res Net-50x4 is used as the teacher. Unlike other methods that ﬁne-tune the whole network, we only train the last layer. Interestingly, despite ﬁne-tuning fewer parameters, our method achieves better results on the 1% dataset. This demonstrates that our method can produce more dataefﬁcient models. denotes concurrent methods.

Method Top-1 Top-5 1% 10% 1% 10%

Supervised 25.4 56.4 48.4 80.4

Inst Disc [57] - - 39.2 77.4 PIRL [35] - - 57.2 83.8 Sim CLR [9] 48.3 65.6 75.5 87.8 BYOL* [22] 53.2 68.8 78.4 89.0 Sw AV* [8] 53.9 70.2 78.5 89.9

Only the linear layer is trained. Ours-2q 57.8 66.3 80.4 87.0 Ours-1q 59.7 67.0 82.3 87.5

Table 7: Transferring to PASCAL-VOC classiﬁcation and detection tasks: All models use Alex Net and ours is compressed from Res Net-50x4. Our model is on par with Image Net supervised model. For classiﬁcation, we denote the ﬁne-tuned layers in the parenthesis. For detection, all layers are ﬁnetuned. denotes bigger Alex Net [60].

Method Cls. Det. m AP m AP

Supervised on Image Net 79.9 (all) 59.1 Random Rescaled [60] 56.6 (all) 45.6

Context* [13] 65.3 (all) 51.1 Jigsaw [36] 67.6 (all) 53.2 Counting [37] 67.7 (all) 51.4 CC+vgg-Jigsaw++ [38] 72.5 (all) 56.5 Rotation [19] 73.0 (all) 54.4 Deep Cluster* [7] 73.7 (all) 55.4 RFDecouple* [17] 74.7 (all) 58.0 Se La+Rot* [60] 77.2 (all) 59.2 Mo Co [24] 71.3 (fc8) 55.8 Ours-2q 79.7 (fc8) 58.1 Ours-1q 76.2 (fc8) 59.3

Figure 3: Ablation and qualitative results: We show the effect of varying the temperature in (a) and memory bank size in (b) using Res Net-18 distilled from cached features of Mo Co Res Net-50. In (c), we show randomly selected images from randomly selected clusters for our best Alex Net model. Each row is a cluster. This is done without cherry-picking or manual inspection. Note that most rows are aligned with semantic categories. We have more of these examples in the Appendix.

closer to 1.0. We believe this happens since a small temperature focuses on close neighborhoods by sharpening the probability distribution. A similar behavior is also reported in [9]. As opposed to the other similarity based distillation methods [41, 43, 40, 53], by using small temperature, we focus on the close neighborhood of a data point which results in an improved student.

Size of memory bank: Intuitively, larger number of anchor points should capture more details about the geometry of the teacher s embedding thus resulting in a student that approximates the teacher more closely. We validate this in Figure 3(b) where a larger memory bank results in a more accurate student. When coupled with a small temperature, the large memory bank can help ﬁnd anchor points that are closer to a query point, thus accurately depicting its close neighborhood.

Effect of momentum parameter: We evaluate various momentum parameters [24] in range (0.999, 0.7, 0.5, 0) and got NN accuracy of (47.35%, 47.45%, 47.40%, 47.34%) respectively. It is interesting that unlike [24], we do not see any reduction in accuracy by removing the momentum. The cause deserves further investigation. Note that momentum is only applicable in case of ours-2q method.

Effect of caching the teacher features: We study the effect of caching the feature of the whole training data in compressing Res Net-50 to Res Net-18 using all Image Net training data. We realize that caching reduces the accuracy by only a small margin 53.4% to 53.0% on NN and 61.7% to 61.2% on linear evaluation while reducing the running time by a factor of almost 3. Hence, for all experiments using Res Net-50x4, we cache the teacher as we cannot afford not doing so.

5 Conclusion

We introduce a simple compression method to train SSL models using deeper SSL teacher models. Our model outperforms the supervised counterpart in the same task of Image Net classiﬁcation. This is interesting as the supervised model has access to strictly more information (labels). Obviously, we do not conclude that our SSL method works better than supervised models in general . We simply compare with the supervised Alex Net that is trained with cross-entropy loss, which is standard in the SSL literature. One can use a more advanced supervised training e.g., compressing supervised Res Net50x4 to Alex Net, to get much better performance for the supervised model.

Acknowledgment: This material is based upon work partially supported by the United States Air Force under Contract No. FA8750-19-C-0098, funding from SAP SE, and also NSF grant number 1845216. Any opinions, ﬁndings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reﬂect the views of the United States Air Force, DARPA, and other funding agencies. Moreover, we would like to thank Vipin Pillai and Erfan Noury for the valuable initial discussions. We also acknowledge the fruitful comments by all reviewers speciﬁcally by Reviewer 2 for suggesting to use teacher s queue for the student, which improved our results.

Broader Impact

Ethical concerns of AI: Most AI algorithms can be exploited for non-ethical applications. Unfortunately, our method is not an exception. For instance, rich self-supervised features may enable harmful surveillance applications.

AI for all: Model compression reduces the computation needed in inference and self-supervised learning reduces annotation needed in training. Both these beneﬁts may make rich deep models accessible to larger community that do not have access to expensive computation and labeling resources.

Privacy and edge computation: Model compression enables running deep models on the devices with limited computational and power resources e.g., Io T devices. This reduces the privacy issues since the data does not need to be uploaded to the cloud. Moreover, compressing self-supervised learning models can be even better in this sense since a small model e.g., Mobile Net that generalizes to new tasks well, can be ﬁnetuned on the device itself, so even the ﬁnetuning data does not need to be uploaded to the cloud.

[1] A library for efﬁcient similarity search and clustering of dense vectors. https://github. com/facebookresearch/faiss. [2] Sungsoo Ahn et al. Variational information distillation for knowledge transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019, pp. 9163 9171. [3] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In: Advances in neural information processing systems. 2014, pp. 2654 2662. [4] Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. In: Advances in Neural Information Processing Systems. 2019, pp. 15509 15519. [5] Hessam Bagherinezhad et al. Label reﬁnery: Improving imagenet classiﬁcation through label progression. In: ar Xiv preprint ar Xiv:1805.02641 (2018). [6] Cristian Buciluˇa, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. 2006, pp. 535 541. [7] Mathilde Caron et al. Deep clustering for unsupervised learning of visual features. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018, pp. 132 149. [8] Mathilde Caron et al. Unsupervised learning of visual features by contrasting cluster assignments. In: ar Xiv preprint ar Xiv:2006.09882 (2020). [9] Ting Chen et al. A Simple Framework for Contrastive Learning of Visual Representations. In: ar Xiv preprint ar Xiv:2002.05709 (2020). [10] Xinlei Chen et al. Improved Baselines with Momentum Contrastive Learning. In: ar Xiv preprint ar Xiv:2003.04297 (2020). [11] Contrastive Multiview Coding. https://github.com/Hobbit Long/CMC. [12] Contrastive Representation Distillation (CRD), and benchmark of recent knowledge distillation methods. https://github.com/Hobbit Long/Rep Distiller. [13] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, pp. 1422 1430.

[14] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. In: International Conference on Learning Representations, ICLR. 2016. [15] Alexey Dosovitskiy et al. Discriminative unsupervised feature learning with convolutional neural networks. In: Advances in neural information processing systems. 2014, pp. 766 774. [16] Mark Everingham et al. The pascal visual object classes (voc) challenge. In: International journal of computer vision 88.2 (2010), pp. 303 338. [17] Zeyu Feng, Chang Xu, and Dacheng Tao. Self-supervised representation learning by rotation feature decoupling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019, pp. 10364 10374. [18] Tommaso Furlanello et al. Born-Again Neural Networks. In: Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018. Ed. by Jennifer G. Dy and Andreas Krause. Vol. 80. Proceedings of Machine Learning Research. PMLR, 2018, pp. 1602 1611. URL: http://proceedings. mlr.press/v80/furlanello18a.html. [19] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised Representation Learning by Predicting Image Rotations. In: International Conference on Learning Representations. 2018. URL: https://openreview.net/forum?id=S1v4N2l0-. [20] Ross Girshick. Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision. 2015, pp. 1440 1448. [21] Priya Goyal et al. Scaling and benchmarking self-supervised visual representation learning. In: Proceedings of the IEEE International Conference on Computer Vision. 2019, pp. 6391 6400. [22] Jean-Bastien Grill et al. Bootstrap your own latent: A new approach to self-supervised learning. In: ar Xiv preprint ar Xiv:2006.07733 (2020). [23] Raia Hadsell, Sumit Chopra, and Yann Le Cun. Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 06). Vol. 2. IEEE. 2006, pp. 1735 1742. [24] Kaiming He et al. Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, pp. 9729 9738. [25] Olivier J Hénaff et al. Data-efﬁcient image recognition with contrastive predictive coding. In: ar Xiv preprint ar Xiv:1905.09272 (2019). [26] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In: ar Xiv preprint ar Xiv:1503.02531 (2015). [27] R Devon Hjelm et al. Learning deep representations by mutual information estimation and maximization. In: International Conference on Learning Representations. 2019. URL: https://openreview.net/forum?id=Bklr3j0c KX. [28] Jiabo Huang et al. Unsupervised Deep Learning by Neighbourhood Discovery. In: ed. by Kamalika Chaudhuri and Ruslan Salakhutdinov. Vol. 97. Proceedings of Machine Learning Research. Long Beach, California, USA: PMLR, Sept. 2019, pp. 2849 2858. URL: http: //proceedings.mlr.press/v97/huang19b.html. [29] Simon Jenni and Paolo Favaro. Self-supervised feature learning by learning to spot artifacts. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 2733 2742. [30] Xu Ji, João F Henriques, and Andrea Vedaldi. Invariant information clustering for unsupervised image classiﬁcation and segmentation. In: Proceedings of the IEEE International Conference on Computer Vision. 2019, pp. 9865 9874. [31] Jangho Kim, Seong Uk Park, and Nojun Kwak. Paraphrasing complex network: Network compression via factor transfer. In: Advances in Neural Information Processing Systems. 2018, pp. 2760 2769. [32] Jonathan Krause et al. 3d object representations for ﬁne-grained categorization. In: Proceedings of the IEEE international conference on computer vision workshops. 2013, pp. 554 561. [33] Harold W Kuhn. The Hungarian method for the assignment problem. In: Naval research logistics quarterly 2.1-2 (1955), pp. 83 97.

[34] Tomasz Malisiewicz, Abhinav Gupta, and Alexei A Efros. Ensemble of exemplar-svms for object detection and beyond. In: 2011 International conference on computer vision. IEEE. 2011, pp. 89 96. [35] Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In: ar Xiv preprint ar Xiv:1912.01991 (2019). [36] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In: European Conference on Computer Vision. Springer. 2016, pp. 69 84. [37] Mehdi Noroozi, Hamed Pirsiavash, and Paolo Favaro. Representation learning by learning to count. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, pp. 5898 5906. [38] Mehdi Noroozi et al. Boosting self-supervised learning via knowledge transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 9359 9367. [39] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation Learning with Contrastive Predictive Coding. 2018. ar Xiv: 1807.03748 [cs.LG]. [40] Wonpyo Park et al. Relational knowledge distillation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019, pp. 3967 3976. [41] Nikolaos Passalis and Anastasios Tefas. Learning deep representations with probabilistic knowledge transfer. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018, pp. 268 284. [42] Deepak Pathak et al. Context encoders: Feature learning by inpainting. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 2536 2544. [43] Baoyun Peng et al. Correlation congruence for knowledge distillation. In: Proceedings of the IEEE International Conference on Computer Vision. 2019, pp. 5007 5016. [44] Py Torch implementation of Mo Co: https://arxiv.org/abs/1911.05722. https://github. com/facebookresearch/moco. [45] Adriana Romero et al. Fit Nets: Hints for Thin Deep Nets. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Ed. by Yoshua Bengio and Yann Le Cun. 2015. URL: http://arxiv. org/abs/1412.6550. [46] Olga Russakovsky et al. Image Net Large Scale Visual Recognition Challenge. In: International Journal of Computer Vision (IJCV) 115.3 (2015), pp. 211 252. DOI: 10.1007/ s11263-015-0816-y. [47] Sim CLR - A Simple Framework for Contrastive Learning of Visual Representations https://arxiv.org/abs/2002.05709. https : / / github . com / google - research / simclr. [48] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In: Advances in neural information processing systems. 2017, pp. 1195 1204. [49] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. In: ar Xiv preprint ar Xiv:1906.05849 (2019). [50] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive Representation Distillation. In: International Conference on Learning Representations. 2020. URL: https:// openreview.net/forum?id=Skgp BJrtv S. [51] Yonglong Tian et al. What makes for good views for contrastive learning. In: ar Xiv preprint ar Xiv:2005.10243 (2020). [52] Torchvision Models. https : / / pytorch . org / docs / stable / torchvision / models.html. [53] Frederick Tung and Greg Mori. Similarity-preserving knowledge distillation. In: Proceedings of the IEEE International Conference on Computer Vision. 2019, pp. 1365 1374. [54] Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. https: //github.com/facebookresearch/swav. [55] Peter Welinder et al. Caltech-UCSD birds 200. In: (2010). [56] Ancong Wu et al. Distilled Person Re-Identiﬁcation: Towards a More Scalable System. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 2019.

[57] Zhirong Wu et al. Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 3733 3742. [58] Xueting Yan et al. Cluster Fit: Improving Generalization of Visual Representations. In: CVPR. 2020. [59] Junho Yim et al. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, pp. 4133 4141. [60] Asano YM., Rupprecht C., and Vedaldi A. Self-labelling via simultaneous clustering and representation learning. In: International Conference on Learning Representations. 2020. URL: https://openreview.net/forum?id=Hyx-jy BFPr. [61] Lu Yu et al. Learning Metrics From Teachers: Compact Networks for Image Embedding. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 2019. [62] Sergey Zagoruyko and Nikos Komodakis. Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. In: ICLR. 2017. URL: https://arxiv.org/abs/1612.03928. [63] Liheng Zhang et al. Aet vs. aed: Unsupervised representation learning by auto-encoding transformations rather than data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019, pp. 2547 2555. [64] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In: European conference on computer vision. Springer. 2016, pp. 649 666. [65] Richard Zhang, Phillip Isola, and Alexei A Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, pp. 1058 1067. [66] Bolei Zhou et al. Learning deep features for scene recognition using places database. In: Advances in neural information processing systems. 2014, pp. 487 495. [67] Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins. Local aggregation for unsupervised learning of visual embeddings. In: Proceedings of the IEEE International Conference on Computer Vision. 2019, pp. 6002 6012.