# equivariant_selfsupervised_learning_encouraging_equivariance_in_representations__463420cd.pdf

Published as a conference paper at ICLR 2022

EQUIVARIANT CONTRASTIVE LEARNING

Rumen Dangovski MIT EECS rumenrd@mit.edu

Li Jing Facebook AI Research ljng@fb.com

Charlotte Loh MIT EECS cloh@mit.edu

Seungwook Han MIT-IBM Watson AI Lab sh3264@columbia.edu

Akash Srivastava MIT-IBM Watson AI Lab akashsri@mit.edu

Brian Cheung MIT CSAIL & BCS cheungb@mit.edu

Pulkit Agrawal MIT CSAIL pulkitag@mit.edu

Marin Soljaˇci c MIT Physics soljacic@mit.edu

In state-of-the-art self-supervised learning (SSL) pre-training produces semantically good representations by encouraging them to be invariant under meaningful transformations prescribed from human knowledge. In fact, the property of invariance is a trivial instance of a broader class called equivariance, which can be intuitively understood as the property that representations transform according to the way the inputs transform. Here, we show that rather than using only invariance, pre-training that encourages non-trivial equivariance to some transformations, while maintaining invariance to other transformations, can be used to improve the semantic quality of representations. Speciﬁcally, we extend popular SSL methods to a more general framework which we name Equivariant Self Supervised Learning (E-SSL). In E-SSL, a simple additional pre-training objective encourages equivariance by predicting the transformations applied to the input. We demonstrate E-SSL s effectiveness empirically on several popular computer vision benchmarks, e.g. improving Sim CLR to 72.5% linear probe accuracy on Image Net. Furthermore, we demonstrate usefulness of E-SSL for applications beyond computer vision; in particular, we show its utility on regression problems in photonics science. Our code, datasets and pre-trained models are available at https://github.com/rdangovs/essl to aid further research in E-SSL.

1 INTRODUCTION

Human knowledge about what makes a good representation and the abundance of unlabeled data has enabled the learning of useful representations via self-supervised learning (SSL) pretext tasks. State-of-the-art SSL methods encourage the representations not to contain information about the way the inputs are transformed, i.e. to be invariant to a set of manually chosen transformations. One such method is contrastive learning, which sets up a binary classiﬁcation problem to learn invariant features. Given a set of data points (say images), different transformations of the same data point constitute positive examples, whereas transformations of other data points constitute the negatives (He et al., 2020; Chen et al., 2020). Beyond contrastive learning, many SSL methods also rely on learning representations by encouraging invariance (Grill et al., 2020; Chen & He, 2021; Caron et al., 2021; Zbontar et al., 2021). Here, we refer to such methods as Invariant-SSL (I-SSL).

The natural question in I-SSL is to what transformations should the representations be insensitive (Chen et al., 2020; Tian et al., 2020; Xiao et al., 2020). Chen et al. (2020) highlighted the importance of transformations and empirically evaluated which transformations are useful for contrastive learning (e.g., see Figure 5 in their paper). Some transformations, such as four-fold rotations, despite preserving semantic information, were shown to be harmful for contrastive learning. This does not mean that four-fold rotations are not useful for I-SSL at all. In fact, predicting four-fold

Published as a conference paper at ICLR 2022

rotations is a good proxy task for evaluating the representations produced with contrastive learning (Reed et al., 2021). Furthermore, instead of being insensitive to rotations (invariance), training a neural network to predict them, i.e. to be sensitive to four-fold rotations, results in good image representations (Gidaris et al., 2018; 2019). These results indicate that the choice of making features sensitive or insensitive to a particular group of transformations can have a substantial effect on the performance of downstream tasks. However, the prior work in SSL has exclusively focused on being either entirely insensitive (Grill et al., 2020; Chen & He, 2021; Caron et al., 2021; Zbontar et al., 2021) or sensitive (Agrawal et al., 2015; Doersch et al., 2015; Zhang et al., 2016; Noroozi & Favaro, 2016; Gidaris et al., 2018) to a set of transformations. In particular, the I-SSL literature has proposed to simply remove transformations that hurt performance when applied as invariance.

To understand how sensitivity/ insensitivity to a particular transformation affects the resulting features, we ran a series of experiments summarized in Figure 1. We trained and tested a simple I-SSL baseline, Sim CLR (Chen et al., 2020), on CIFAR-10 using only the random resized cropping transformation (solid yellow line). The test accuracy is calculated as the retrieval accuracy of a k-nearest neighbors (k NN) classiﬁer with a memory bank consisting of the representations on the training set obtained after pre-training for 800 epochs. Next, in addition to being invariant to resized cropping, we additionally encouraged the model to be either sensitive (shown in pink) or insensitive (shown in blue) to a second transformation. We encourage insensitivity by adding the transformation to the Sim CLR data augmentation, and sensitivity by predicting it (see Section 4). We varied the choice of this second transformation. We found that for some transformations, such as horizontal ﬂips and grayscale, insenstivity results in better features, but is detrimental for transformations, such as fourfold rotations, vertical ﬂips, 2x2 jigsaws (4! = 24 classes), four-fold Gaussian blurs (4 levels of blurring) and color inversions. When we encourage sensitivity to these transformations, the trend is reversed. In summary, we observe that if invariance to a particular transformation hurts feature learning, then imposing sensitivity to the same transformation may improve performance. This leads us to conjecture that instead of choosing the features to be only invariant or only sensitive as done in prior work, it may be possible to learn better features by imposing invariance to certain transformations (e.g., cropping) and sensitivity to other transformations (e.g., four-fold transformations).

The concepts of sensitivity and insensitivity are both captured by the mathematical idea of equivariance (Agrawal et al., 2015; Jayaraman & Grauman, 2015; Cohen & Welling, 2016). Let G be a group of transformations. For any g G let Tg(x) denote the function with which g transforms an input image x. For instance, if G is the group of four-fold rotations then Tg(x) rotates the image x by a multiple of π/2. Let f be the encoder network that computes feature representation, f(x). I-SSL encourages the property of invariance to G, which states f(Tg(x)) = f(x), i.e. the output representation, f(x), does not vary with Tg. Equivariance, a generalization of invariance, is deﬁned as, x : f(Tg(x)) = T g(f(x)), where T g is a ﬁxed transformation (i.e., without any parameters). Intuitively, equivariance encourages the feature representation to change in a well deﬁned manner to the transformation applied to the input. Thus, invariance is a trivial instance of equivariance, where T g is the identity function, i.e. T g(f(x)) = f(x). While there are many possible choices for T g (Cohen & Welling, 2016; Bronstein et al., 2021), I-SSL uses only the trivial choice that encourages f to be insensitive to G. In contrast, if T g is not the identity, then f will be sensitive to G and we say that the equivariance to G will be non-trivial.

Therefore, in order to encourage potentially more useful equivariance properties, we generalize SSL to an Equivariant Self-Supervised Learning (E-SSL) framework. In our experiments on standard computer vision data, such as the small-scale CIFAR-10 (Torralba et al., 2008; Krizhevsky, 2009) and the large-scale Image Net (Deng et al., 2009), we show that extending I-SSL to E-SSL by also predicting four-fold rotations improves the semantic quality of the representations. We show that this approach works for other transformations too, such as vertical ﬂips, 2x2 jigsaws, four-fold Gaussian blurs and color inversions, but focus on four-fold rotations as the most promising improvement we obtain with initial E-SSL experiments in Figure 1.

We also note that the applications of E-SSL in this paper are task speciﬁc, meaning that the representations from E-SSL may work best for a particular downstream task that beneﬁts from equivariance dictated by the available data. E-SSL can be further extended to applications in science; in particular, we focus on predictive tasks using (unlabelled and labelled) data collected via experiments or simulations. The downstream tasks in prediction problems in science are often ﬁxed and can be aided by incorporating scientiﬁc insights. Here, we also explore the generality of E-SSL beyond computer vi-

Published as a conference paper at ICLR 2022

horizontal flips grayscale four-fold rotations vertical flips 2x2 jigsaws four-fold blurs color inversions

k NN accuracy (%)

insensitive sensitive

Figure 1: SSL representations should be encouraged to be either insensitive or sensitive to transformations. The baseline is Sim CLR with random resized cropping only. Each transformation on the horizontal axis is combined with random resized cropping. The dataset is CIFAR-10 and the k NN accuracy is on the test set. More experimental details can be found in Section 4.

sion, on a different application: regression problems in photonics science and demonstrate examples where E-SSL is effective over I-SSL.

Our contributions can be summarized as follows:

We introduce E-SSL, a generalization of popular SSL methods that highlights the complementary nature of invariance and equivariance. To our knowledge, we are the ﬁrst to create a method that beneﬁts from such complementarity. We improve state-of-the-art SSL methods on CIFAR-10 and Image Net by encouraging equivariance to four-fold rotations. We also show that E-SSL is more general and works for many other transformations, previously unexplored in related works. We demonstrate the usefulness of E-SSL beyond computer vision with experiments on regression problems in photonics science. We also show that our method works both for ﬁnite and inﬁnite groups.

The rest of the paper is organized as follows. In Section 2 we elaborate on related work. In Section 3 we introduce our experimental method for E-SSL. In Section 4 we present our main experiments in computer vision. In Section 5 provide a discussion around our work that extends our study beond computer vision. Beginning from Appendix A, we provide more details behind our ﬁndings and discuss several potential avenues of future work.

2 RELATED WORK

To encourage non-trivial equivariance, we observe that a simple task that predicts the synthetic transformation applied to the input, works well and improves I-SSL already; some prediction tasks create representations that can be transferred to other tasks of interest, such as classiﬁcation, object detection and segmentation. While prediction tasks alone have been realized successfully before in SSL (Agrawal et al., 2015; Doersch et al., 2015; Zhang et al., 2016; Misra et al., 2016; Noroozi & Favaro, 2016; Zamir et al., 2016; Lee et al., 2017; Mundhenk et al., 2018; Gidaris et al., 2018; Zhang et al., 2019; Zhang, 2020), to our knowledge we are the ﬁrst to combine simple predictive objectives of synthetic transformations with I-SSL, and successfully improve the semantic quality of representations. We found that the notion of equivariance captures the generality of our method.

To improve representations with pretext tasks, Gidaris et al. (2018) use four-fold rotations prediction as a pretext task for learning useful visual representations via a new model named Rot Net. Feng et al. (2019) learn decoupled representations: one part trained with four-fold rotations prediction and another with non-parametric instance discrimination (Wu et al., 2018) and invariance to four-fold rotations. Yamaguchi et al. (2021) use a joint training objective between four-fold rotations prediction and image enhancement prediction. Xiao et al. (2020) propose to learn representations as follows: for each atomic augmentation from the contrastive learning s augmentation policy, they leave it out and project to a new space on which I-SSL encourages invariance to all augmentations, but the leftout one. The resulting representation could either be a concatenation of all projected left-out views representations, or the representation in the shared space, before the individual projections. Our

Published as a conference paper at ICLR 2022

transformations

E-SSL framework

insensitive

Methods Mo Co (He et al., 2020) Sim CLR (Chen et al., 2020) BYOL (Grill et al., 2020) BT (Zbontar et al., 2021) DINO (Caron et al., 2021) etc.

Egomotion (Agrawal et al., 2015) Context (Doersch et al., 2015) Colorization (Zhang et al., 2016) Jigsaw (Noorozi & Favaro., 2016) Rot Net (Gidaris et al., 2018) etc. E-SSL (ours)

Figure 2: E-SSL framework. Left: framework. Right: methods. Egomotion, Context, Colorization and Jigsaw use other transformations than rotations, but their patterns looks like that of Rot Net s. Likewise, for E-SSL can use transformations different from rotation.

method differs from the above contributions in that E-SSL is the only hybrid framework that encourages both insensitive representations for some transformations and sensitive representations for others and does not require representations to be sensitive and insensitive to a particular transformation at the same time. Thus, what distinguishes our work is the complementary nature of invariance and equivariance for multiple transformations, including ﬁnite and inﬁnite groups.

To obtain performance gains from transformations, Tian et al. (2020) study which transformations are the best for contrastive learning through the lens of mutual information. Reed et al. (2021) use four-fold rotations prediction as an evaluation measure to tune optimal augmentations for contrastive learning. Wang & Qi (2021) use strong augmentations to improve contrastive learning by matching the distributions of strongly and weakly augmented views representation similarities to a memory bank. Wang et al. (2021) provide an effective way to bridge transformation-insensitive and transformation-sensitive approaches in self-superived learning methods via residual relaxation. A growing body of work encourages invariance to domain agnostic transformations (Tamkin et al., 2021; Lee et al., 2021; Verma et al., 2021) or strengthens invariance with regularization (Foster et al., 2021). Our framework is different from the above works, because we work with transformations that encourage equivariance beyond invariance.

To understand and improve equivariant properties of neural networks, Lenc & Vedaldi (2015) study emerging equivariant properties of neural networks and (Cohen & Welling, 2016; Bronstein et al., 2021) construct equivariant neural networks. In contrast, our work does not enforce strict equivariance, but only encourages equivariant properties for the encoder network through the choice of the loss function. While strict equivariance is concerned with groups, some of the transformations, such as random resized cropping and Gaussian blurs, may not even be groups, but they could still be analyzed in the E-SSL framework. Thus, ours is a ﬂexible framework, which allows us to consider a variety of transformations and how the encoder might exhibit equivariant properties to them.

Our method is designed to test our primary conjecture that a hybrid approach of sensitive and insensitive representations learns better features. Surprisingly, this hybrid approach is not yet present in SSL, as Figure 2 illustrates. In this ﬁgure, we can view transformations in SSL as levers. Each downstream task has an optimal conﬁguration of the levers, which should be tuned in the SSL objective: left for insensitive and right for sensitive representations. E.g., make representations insensitive to horizontal ﬂips and grayscale and sensitive to four-fold rotations, vertical ﬂips, 2x2 jigsaws, Gaussian blurs or color inversions. Formally, insensitive and sensitive features correspond to trivial and regular group representations, respectively. Here, we present an effective method to achieve this control.

Let f( ; θf) with trainable parameters θf be a backbone encoder. Analogously, let p1( ; θp1) be a projector network for the I-SSL loss. There might be an extra prediction head and parameters, depending on the objective, which we omit for simiplicity. Let p2( ; θp2) be the predictor network for encouraging sensitivity, which we will call predictor for equivariance. We share the backbone encoder f jointly for I-SSL and the objective of predicting the transformations from the backbone representations. Let ℓI-SSL be the I-SSL loss and ℓE-SSL be the added E-SSL loss that encourages

Published as a conference paper at ICLR 2022

view 1 view 2

prediction views

backbone f backbone f backbone f

projector p1

invariance equivariance

projector p1

backbone f backbone f backbone f

predictor p2 predictor p2 predictor p2 predictor p2

Figure 3: Sketch of E-SSL with four-fold rotations prediction, resulting in a backbone that is sensitive to rotations and insensitive to ﬂips and blurring. Image Net example n01534433:169.

sensitivity to a particular transformation. Let the parameter λ be the strength of the E-SSL loss. The optimization objective for an image x with views {x } in the batch is given as follows arg min θf ,θp1,θp2 ℓSSL(p1(f({x }; θf); θp1)) + λEg G [Prediction Loss(g, p2(f(Tg(x ); θf); θp2)] (1)

where ℓE-SSL (the expectation in the second summand) can take either one or all of the views, but we take only one for simplicity. The goal of ℓE-SSL is to predict g from the representation p2(f(Tg(x ); θf); θp2), which encourages equivariance to the group of transformations G. The Prediction Loss could be either a cross entropy loss for ﬁnite groups or L1/ MSE loss for inﬁnite groups. In practice we replace the expectation with an unbiased estimate. Most of our experiments in this paper focus on ﬁnite groups, but we show one example for an inﬁnite group in Appendix F.

E-SSL can be constructed for any semantically meaningful transformation (see for example, Figure 1). From Figure 1 we choose four-fold rotations as the most promising transformation and we ﬁx it for the upcoming section. As a minor motivation, we also present empirical results about the similarities between four-fold rotations prediction and I-SSL in Appendix C. In particular, both tasks beneﬁt from the same data augmentation. Figure 3 sketches how our construction works for predicting four-fold rotations. In particular, we sample each of the 4 possible rotations uniformly and use the cross entropy loss for the Prediction Loss in Equation 1.

What transformations could work for E-SSL? A common property of the successful transformations we have studied up to this point is that they form groups in the mathematical sense, i.e. (i) each transformation is invertible, (ii) composition of two transformations is part of the set of transformations and (iii) compositions are associative.

In this paper, we encourage equivariance to a group of transformation by predicting them. This does not guarantee that the encoder we learn will be strictly equivariant to the group. In practice we observe that invariance and equivariance is well encouraged by the training objectives we use (see Appendix D.3 for detailed analysis). In fact, even strict equivariance is possible, i.e. there exists an encoder that is non-trivially equivariant, under a reasonable assumption which is formulated as follows. Let X be the set of all images. Let G be a group whose elements g G transform X via the function Tg : X X. Let X = {Tg(x) | g G, x X} be the set of all transformed images. Let f( ; θ): X S be an encoder network that we learn with parameters θ. We write f( ) f( ; θ) for simplicity. Finally, let S = {f(x ) | x X} be the set of all representations of the images in X . The following is our statement. Proposition 1 (Non-trivial Equivariance). Given Tg : X X for the group G, there exists an encoder f : X S that is non-trivially equivariant to the group G under the assumption that if f(Tg(x)) = f(Tg (x )) then g = g and x = x for all g, g G and x, x X.

Published as a conference paper at ICLR 2022

We defer the proof to Appendix B. The signiﬁcance of this proof is that it explicitly constructs a nontrivially equivariant encoder network for groups G if the assumption is satisﬁed. The intuition of the assumption is that if the representations of two transformed inputs are the same, the inputs should coincide, and likewise the transformations. More formally, this assumption reﬂects the condition when the dataset contains only one element of each group orbit. We speculate that satisfying this assumption is reasonable for the datasets in this work, since we observe a natural setting of the data, e.g. horizontal mirror symmetry in Gpm, and we consider transformations that disturb this natural setting. In Appendix G we also show that E-SSL is crucial for the Flowers-102 dataset for which this assumption might be less clear. In Appendix H we also present a natural modiﬁcation of E-SSL for scenarios, where that assumption is violated.

Could other transformations still help? To motivate our work, in Figure 1 we observed additional transformations that could be useful, such as vertical ﬂips, 2x2 jigsaws, four-fold Gaussian blurs and color inversions. All of these transformations are groups, except for four-fold Gaussian blurs. Each element of Gaussian blurs is invertible (de-blurring), but the inverse is not a transformation in the set. Interestingly, we observe that four-fold Gaussian blurs still improve the baseline, which means the success of E-SSL may not be limited to groups.

We might also consider combining the prediction of multiple transformations to encourage sensitivity to all of them. However, the gains we saw in Figure 1 may not add up when we combine transformations, because they may not be independent. The gains may also depend on the transformations that we choose for I-SSL. While we see combinations of transformations as promising future work, we focus on a single transformation to make a clear presentation of E-SSL.

4 EXPERIMENTS

CIFAR-10 setup. We use the CIFAR-10 experimental setup from (Chen & He, 2021). We consider two simple I-SSL methods: Sim CLR (with Info NCE loss (Oord et al., 2018) and temperature 0.5) and Sim Siam (Chen & He, 2021). We were able to obtain baseline results close to those in (Chen & He, 2021). The predictor for equivariance takes a smaller crop with size 16x16. We report performance on the standard linear probe. We tune λ to 0.4 both for Sim CLR and Sim Siam (full tuning in Table 6 in Appendix D). Remaining experimental details can be found in Appendix D.

Image Net setup. We use the original augmentation setting for each method. The predictor for equivariance takes a smaller crop with size 96x96. We use a Res Net-50 (He et al., 2016) backbone for each method. In terms of optimizer and batch size settings, we follow the standard training recipe for each method. For our Sim CLR experiments we use a slightly more optimal implementation that uses BYOL s augmentations (i.e. it includes solarization), initializes the Res Net with zero Batch Norm weights and uses the Info NCE loss with temperature 0.2.

Photonic-crystals setup. Photonic crystals (Ph C) are periodically-structured materials engineered for wide ranging applications by manipulating light waves (Yablonovitch, 1987; Joannopoulos et al., 2008). The density-of-states (DOS) is often used as a design metric to engineer the desired properties of these crystals and thus here, we consider the regression task of predicting the DOS of Ph Cs. Examples of this dataset are depicted in Section 5 and further details can be found in Appendix F. The use of symmetry or invariance knowledge is common in scientiﬁc problems; here, the DOS labels are invariant to several physical transformations of the unit cell, namely, rolling translations (due to its periodicity), operations arising from the symmetry group (C4v) of the square lattice, i.e. rotations and mirror ﬂips, and refractive scaling. We construct an encoder network comprising of simple convolutional and fully-connected layers (see Appendix F) and create various synthetic datasets to investigate encouragement of equivariance. After SSL/ E-SSL, we ﬁne-tune the network with L1 loss; for better interpretability of prediction accuracies, we use a relative error metric (Liu et al., 2018; Loh et al., 2021) for evaluation, given by ℓDOS = (P

ω DOSpred DOS )/(P

ω DOS), reported in (%). We defer the results to Section 5, because of the novelty of the experimental setup.

The predictor p2 for E-SSL. The predictor is a 2 layer MLP for CIFAR-10 and Photonic-crystals, and a 3 layer MLP for Image Net, followed by a linear head that produces the logits for the an n-way

Published as a conference paper at ICLR 2022

Algorithm 1 Py Torch-style pseudocode for E-SSL, predicting four-fold rotations.

# f: backbone encoder network # p1: projector network for I-SSL # p2: predictor network for E-SSL # ssl_loss: loss function for I-SSL # lambda: weight of the E-SSL

for x in loader:

# large views for SSL and small view for EE V_large = augment(x, small_crop=False) # list of views v_small = augment(x, small_crop=True) # change: crop with size=96 and scale=(0.05, 0.14)

# loss loss_invariance = ssl_loss(p1(f(V_large)) labels = [0] * N + [1] * N + [2] * N + [3] * N # 4Nx1 v_cat = cat([v_small] * 4, dim=0) # 4Nx3x96x96 v_equivariance = rot90(v_cat, labels) # constructing the rotated views

logits = p2(f(v_equivariance)) # 4Nx4 loss_equivariance = Cross Entropy Loss(logits, labels) # rotation prediction loss = loss_invariance + lambda * loss_equivariance

# optimization step loss.backward() optimizer.step()

Table 1: Linear probe accuracy (%) on CIFAR-10. Models are pre-trained for 800 epochs. Baseline results are from Appendix D in (Chen & He, 2021). Standard deviations are from 5 different random initializations for the linear head. Deviations are small because the linear probe is robust to the seed.

Method Sim CLR Sim Siam (Chen et al., 2020) (Chen & He, 2021)

Baseline (Chen & He, 2021) 91.1 91.8 Baseline (our reproduction) 92.0 0.0 91.6 0.0 E-SSL (ours) 94.1 0.0 94.2 0.1

Ablating E-SSL

Single random rotation 93.4 0.0 ( 0.7) 92.6 0.0 ( 1.6) Linear predictor for equivariance 93.3 0.0 ( 0.8) 93.4 0.0 ( 0.8) No SSL augmentation in equivariance views 92.7 0.1 ( 1.4) 92.0 0.1 ( 2.2)

Alternatives to E-SSL

Disentangled representations 91.3 0.0 ( 2.7) 91.1 0.0 ( 3.1) Insensitive instead of sensitive 86.3 0.1 ( 7.8) 86.1 0.1 ( 8.1)

classiﬁcation (for example four-fold rotations is 4-way classiﬁcation), or a single node for the continuous group experiment. The predictor s hidden dimension is shared across all layers and it equals 2048 for CIFAR-10 and Image Net and 512 for Ph C. After each linear layer, there is a Layer Normalization (Ba et al., 2016) followed by Re LU. We experimented with Batch Normalization (Ioffe & Szegedy, 2015) (with trainable afﬁne parameters) instead of Layer Normalization, but did not observe any signiﬁcant changes. For some experiments, we discovered that removing the last Re LU from the MLP improves the results slightly. In particular, for Sim Siam on CIFAR-10 and for all models on Image Net we omit the last Re LU.

Finally, Algorithm 1 presents pseudocode for E-SSL with four-fold rotations on Image Net. In our implementation, we use smaller resolution for the rotated images, so that we can ﬁt all views on the same batch and have minimal overhead for pre-training (additional details in Table 9 in Appendix E).

4.2 MAIN RESULTS

CIFAR-10 results. To highlight the beneﬁts of our method, Table 1 demonstrates the improvement we obtain by using E-SSL on top of Sim CLR and Sim Siam and then shows different ablations and alternative methods. We label the E-SSL extensions as E-Sim CLR and E-Sim Siam respectively. We

Published as a conference paper at ICLR 2022

0.0 0.5 1.0 fraction of training data

top-1 linear probe acc. (%)

Sim CLR semi-supervised

Baseline E-SSL (ours)

No Augment.

Crop + Flip only

No Grayscale

Full Augment.

Sim CLR remove data aug.

0.0 0.5 1.0 fraction of training data

Sim Siam semi-supervised

No Augment.

Crop + Flip only

No Grayscale

Full Augment.

Sim Siam remove data aug.

Figure 4: Reducing the labels for training and the data augmentation for pre-training on CIFAR-10. Error bars for 5 different training data splits.

Table 2: Linear probe accuracy (%) on Image Net. Each model is pre-trained for 100 epochs. Baseline results are from Table B.1 in (Chen et al., 2020) from Table 4 in (Chen & He, 2021). Numbers marked with * use a less optimal setting than our reproduction for Sim CLR (see Image Net setup).

Method Sim CLR Sim Siam Barlow Twins (Chen et al., 2020) (Chen & He, 2021) (Zbontar et al., 2021)

Baseline (Chen et al., 2020) 64.7* - - Baseline (Chen & He, 2021) 66.5* 68.1 - Baseline (our reproduction) 67.3 68.1 66.9 E-SSL (ours) 68.3 68.6 68.2

observe that we can increase a tuned baseline accuracy by about 2 3%. When ablating E-SSL, we see that each component of E-SSL is important. Most useful is the SSL augmentation applied on top of the rotated views. We also study alternatives to E-SSL. With Disentangled representations we investigate whether a middle ground is optimal for E-SSL: half of the representation to be insensitive to a transformation and the other half to be sensitive to the same transformation. This results in degradation of performance, which reﬂects our hypothesis that the representations should be either insensitive or sensitive. We conducted this experiment by using four-fold rotations in ISSL for half of the representation and E-SSL for the other half. Finally, making the representations Insensitive instead of sensitive to four-fold-rotations hurts the performance signiﬁcantly, as it is also observed in Figure 1, and in (Chen et al., 2020; Xiao et al., 2020).

Figure 4 reveals that E-SSL is more robust to removing transformations for I-SSL or reducing the labels for training. For example, E-Sim CLR and E-Sim Siam with only random resized cropping obtain 83.5% and 84.6% accuracies. Encouraging sensitivity to one transformation, namely fourfold-rotations, can reduce the need for selecting many transformations for I-SSL and with only 1% of the training data, E-Sim CLR and E-Sim Siam achieve 90.0 1.0% and 88.6 1.0% respectively.

Table 3: Linear probe accuracy (%) on Image Net with longer pre-training. BT is short for Barlow Twins.

Method pre-training epochs

100 200 300

Sim CLR (repro) 67.3 69.7 70.6 E-Sim CLR (ours) 68.3 70.5 71.5

BT (repro) 66.9 70.0 71.1 E-BT (ours) 68.2 71.0 71.9

Image Net results. Table 2 demonstrates our main results on the linear probe on Image Net after pre-training with various state-of-the-art I-SSL methods and their E-SSL versions. By only sweeping λ and slightly reducing the original learning rate for Sim Siam we obtain consistent 1%/ 0.5%/ 1.3% improvements for Sim CLR/ Sim Siam/ Barlow Twins respectively. Additionally, in Table 3 we observe consistent beneﬁts of using E-SSL with longer pretraining. Finally, after 800 epochs of pretraining E-Sim CLR achieves 72.5%, which is 0.6% better than Sim CLR s 71.9% baseline.

Published as a conference paper at ICLR 2022

four-fold translations

four-fold rotations

Frequency (arbitrary units)

DOS (arbitrary units)

Input unit cells Label

Photonic crystal

Figure 5: Ph C datasets with transformations for sensitivity. The regression task is to predict the DOS labels (an example of a label in R400 is shown on the right) from 2D square periodic unit cells (examples of the inputs in R32 32 are shown on the left). We consider two types of input unit cells; at the top is the Blob dataset where the feature variation is always centered; at the bottom is the Group pm (Gpm) dataset where inputs have a horizontal mirror symmetry.

Table 4: Fine-tuning the backbone on Ph C datasets using 3000/ 2000 labelled train/ test samples. Relative error (%) is ℓDOS = (P

ω DOSpred DOS )/(P

ω DOS). Lower is better. Sim CLR for Blob includes C4v (rotations and ﬂips); Sim CLR for Gpm includes rolling translations and mirrors. E-Sim CLR encourages the features to be sensitive to the selected transformation explained in the text (four-fold translations for Blob and four-fold rotations for Gpm). + Transform means adding this transformation to Sim CLR. Error bars are for 3 different training data splits.

Ph C Dataset Supervised Sim CLR Sim CLR + Transform E-Sim CLR (ours)

Blob 1.068 0.015 0.987 0.005 0.999 0.005 0.974 0.009 Gpm 3.212 0.041 3.122 0.002 3.139 0.005 3.091 0.006

5 DISCUSSION

To show that other domains beneﬁt from E-SSL in a qualitatively similar way to the applications in the previous section, here we introduce two datasets in photonics science. Figure 5 depicts the datasets, i.e. input-label pairs consisting of 2D square periodic unit cells of Ph Cs and their associated DOS. The physics of the problem dictates that the DOS is invariant to (rolling) translations, scaling of all pixels by a ﬁxed positive factor, and operations of the C4v symmetry group, i.e. rotations and mirror ﬂips. In choosing the transformations that E-SSL should encourage sensitivity to, we observe that the transformations that have worked for CIFAR-10 and Image Net disturb the natural setting of the data (e.g. rotations disturb the natural upright setting of images). Thus, we encourage sensitivity to transformations that ﬁt this observation, and insensitivity to the rest of the transformations.

In Figure 5, the top dataset is a Blob dataset where the shape variation in each image is centered. We encourage sensititivy to the group of four-fold translations, given by G = {e, h, v, hv}, where h and v are 1/2-unit cell translations in the horizontal and vertical axis, respectively, e is the unit element (no transformation) and hv is the composition of h and v. In the bottom dataset of Figure 5, the Ph C unit cells are generated to have a horizontal mirror symmetry, i.e. we use the 2D wallpaper (or crystallographic plane) group pm. We encourage sensitivity to the group of four-fold rotations (the same group we used for CIFAR-10 and Image Net), since rotating any of the images disturbs the (horizontal) mirror symmetry. More accurately, since only π/2 rotations disturb the symmetry, we separate them in two classes, {π/2, π/2} and {0, π}, and perform binary prediction in E-SSL.

Table 4 shows the results of ﬁne-tuning the backbone and an additional DOS-predictor head (see Appendix F) with 3000 labelled samples for this regression task. We observe that encouraging sensitivity to the selected transformations (via E-Sim CLR) leads to the largest reduction in the error. On the contrary, including these transformations to Sim CLR (indicated by + Transform ) increases the error. Furthermore, we explore scaling transformations and show that E-SSL can be generalized to inﬁnite groups (see Appendix F). This supports our observations about the usefulness of E-SSL over I-SSL and demonstrates E-SSL s generality beyond computer vision.

Published as a conference paper at ICLR 2022

ACKNOWLEDGEMENTS

We thank Shurong Lin, Kristian Georgiev, Alexander Atanasov, Peter Lu and the anonymous reviewers for fruitful conversations and support. R.D. dedicates this work to the memory of Boyko Dangovski.

The authors acknowledge the MIT Super Cloud and Lincoln Laboratory Supercomputing Center (Reuther et al., 2018) for providing HPC and consultation resources that have contributed to the research results reported within this paper/report.

Research was sponsored by the United States Air Force Research Laboratory and the United States Air Force Artiﬁcial Intelligence Accelerator and was accomplished under Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the ofﬁcial policies, either expressed or implied, of the United States Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

This material is also based in part upon work supported by the Air Force Ofﬁce of Scientiﬁc Research under the award number FA9550-21-1-0317 and the U. S. Army Research Ofﬁce through the Institute for Soldier Nanotechnologies at MIT, under Collaborative Agreement Number W911NF18-2-0048. This work is also supported in part by the the National Science Foundation under Cooperative Agreement PHY-2019786 (The NSF AI Institute for Artiﬁcial Intelligence and Fundamental Interactions, http://iaifi.org/).

REPRODUCIBILITY STATEMENT

Algorithm 1, the original public code for each of the I-SSL methods we use in the paper and the experimental setups in Section 4.1, and in Appendices D, E and F, can be used for reproducibility. Our code is available at https://github.com/rdangovs/essl.

Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning to see by moving. In Proceedings of the IEEE International Conference on Computer Vision, pp. 37 45, 2015.

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. In Neur IPS Deep Learning Symposium, 2016.

Michael M. Bronstein, Joan Bruna, Taco Cohen, and Petar Veliˇckovi c. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges, 2021.

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv e J egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. ar Xiv preprint ar Xiv:2104.14294, 2021.

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597 1607. PMLR, 2020.

Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750 15758, 2021.

Thomas Christensen, Charlotte Loh, Stjepan Picek, Domagoj Jakobovi c, Li Jing, Sophie Fisher, Vladimir Ceperic, John D. Joannopoulos, and Marin Soljaˇci c. Predictive and generative machine learning models for photonic crystals. Nanophotonics, 9(13):4183 4192, October 2020. ISSN 2192-8614. doi: 10.1515/nanoph-2020-0197. URL https://www.degruyter.com/ document/doi/10.1515/nanoph-2020-0197/html.

Thomas Christensen, Hoi Chun Po, John D. Joannopoulos, and Marin Soljaˇci c. Location and topology of the fundamental gap in photonic crystals, 2021.

Published as a conference paper at ICLR 2022

Taco Cohen and Max Welling. Group equivariant convolutional networks. In International conference on machine learning, pp. 2990 2999. PMLR, 2016.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pp. 1422 1430, 2015.

Zeyu Feng, Chang Xu, and Dacheng Tao. Self-supervised representation learning by rotation feature decoupling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10364 10374, 2019.

Adam Foster, Rattana Pukdee, and Tom Rainforth. Improving transformation invariance in contrastive representation learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Nom EDg IEBw E.

Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR, 2018.

Spyros Gidaris, Andrei Bursuc, Nikos Komodakis, Patrick P erez, and Matthieu Cord. Boosting few-shot visual learning with self-supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8059 8068, 2019.

Jean-Bastien Grill, Florian Strub, Florent Altch e, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. ar Xiv preprint ar Xiv:2006.07733, 2020.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729 9738, 2020.

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.

Dinesh Jayaraman and Kristen Grauman. Learning image representations tied to ego-motion. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.

J. D. Joannopoulos, S. G. Johnson, J. N. Winn, and R. D. Meade. Photonic Crystals: Molding the Flow of Light. Princeton University Press, 2 edition, 2008. URL http://ab-initio.mit. edu/book/.

Steven G. Johnson and J. D. Joannopoulos. Block-iterative frequency-domain methods for Maxwell s equations in a planewave basis. Optics Express, 8(3):173 190, January 2001. ISSN 1094-4087. doi: 10.1364/OE.8.000173. URL https://www.osapublishing.org/oe/ abstract.cfm?uri=oe-8-3-173.

Samuel Kim, Peter Y. Lu, Charlotte Loh, Jamie Smith, Jasper Snoek, and Marin Soljaˇci c. Scalable and Flexible Deep Bayesian Optimization with Auxiliary Information for Scientiﬁc Problems. ar Xiv:2104.11667, April 2021. URL http://arxiv.org/abs/2104.11667.

Alex Krizhevsky. Learning multiple layers of features from tiny images. technical report, 2009.

Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision, pp. 667 676, 2017.

Published as a conference paper at ICLR 2022

Kibok Lee, Yian Zhu, Kihyuk Sohn, Chun-Liang Li, Jinwoo Shin, and Honglak Lee. i-mix: A domain-agnostic strategy for contrastive representation learning. In ICLR, 2021.

Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 991 999, 2015.

Boyuan Liu, Steven G. Johnson, John D. Joannopoulos, and Ling Lu. Generalized Gilat Raubenheimer method for density-of-states calculation in photonic crystals. Journal of Optics, 20(4):044005, April 2018. ISSN 2040-8978, 2040-8986. doi: 10.1088/2040-8986/aaae52. URL http://arxiv.org/abs/1711.07993.

Charlotte Loh, Thomas Christensen, Rumen Dangovski, Samuel Kim, and Marin Soljacic. Surrogateand invariance-boosted contrastive learning for data-scarce applications in science, 2021. URL https://arxiv.org/abs/2110.08406.

Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shufﬂe and learn: unsupervised learning using temporal order veriﬁcation. In European Conference on Computer Vision, pp. 527 544. Springer, 2016.

T. Nathan Mundhenk, Daniel Ho, and Barry Y. Chen. Improvements to context based self-supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

Maria-Elena Nilsback and Andrew Zisserman. Automated ﬂower classiﬁcation over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722 729. IEEE, 2008.

Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pp. 69 84. Springer, 2016.

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018.

Colorado Reed, Sean Metzger, Aravind Srinivas, Trevor Darrell, and Kurt Keutzer. Evaluating selfsupervised pretraining without using labels. In CVPR, 2021.

Albert Reuther, Jeremy Kepner, Chansup Byun, Siddharth Samsi, William Arcand, David Bestor, Bill Bergeron, Vijay Gadepally, Michael Houle, Matthew Hubbell, et al. Interactive supercomputing on 40,000 cores for machine learning and data analysis. In 2018 IEEE High Performance extreme Computing Conference (HPEC), pp. 1 6. IEEE, 2018.

Alex Tamkin, Mike Wu, and Noah Goodman. Viewmaker networks: Learning views for unsupervised representation learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=eno VQWLsfy L.

Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What makes for good views for contrastive learning? ar Xiv preprint ar Xiv:2005.10243, 2020.

Antonio Torralba, Rob Fergus, and William T. Freeman. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11):1958 1970, 2008. doi: 10.1109/TPAMI.2008.128.

Vikas Verma, Thang Luong, Kenji Kawaguchi, Hieu Pham, and Quoc Le. Towards domain-agnostic contrastive learning. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 10530 10541. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr. press/v139/verma21a.html.

Xiao Wang and Guo-Jun Qi. Contrastive learning with stronger augmentations. ar Xiv preprint ar Xiv:2104.07713, 2021.

Published as a conference paper at ICLR 2022

Yifei Wang, Zhengyang Geng, Feng Jiang, Chuming Li, Yisen Wang, Jiansheng Yang, and Zhouchen Lin. Residual relaxation for multi-view representation learning. Advances in Neural Information Processing Systems, 34, 2021.

Zhirong Wu, Yuanjun Xiong, Stella Yu, and Dahua Lin. Unsupervised feature learning via nonparametric instance-level discrimination. In CVPR, 2018.

Tete Xiao, Xiaolong Wang, Alexei A Efros, and Trevor Darrell. What should not be contrastive in contrastive learning. ar Xiv preprint ar Xiv:2008.05659, 2020.

Eli Yablonovitch. Inhibited Spontaneous Emission in Solid-State Physics and Electronics. Physical Review Letters, 58(20):2059 2062, May 1987. doi: 10.1103/Phys Rev Lett.58.2059. URL https://link.aps.org/doi/10.1103/Phys Rev Lett.58.2059.

Shin ya Yamaguchi, Sekitoshi Kanai, Tetsuya Shioda, and Shoichiro Takeda. Image enhanced rotation prediction for self-supervised learning. In 2021 IEEE International Conference on Image Processing (ICIP), pp. 489 493. IEEE, 2021.

Amir R Zamir, Tilman Wekel, Pulkit Agrawal, Colin Wei, Jitendra Malik, and Silvio Savarese. Generic 3d representation via pose estimation and matching. In European Conference on Computer Vision, pp. 535 553. Springer, 2016.

Jure Zbontar, Li Jing, Ishan Misra, Yann Le Cun, and St ephane Deny. Barlow twins: Self-supervised learning via redundancy reduction. ar Xiv preprint ar Xiv:2103.03230, 2021.

Liheng Zhang. Equivariance and invariance for robust unsupervised and semi-supervised learning. 2020.

Liheng Zhang, Guo-Jun Qi, Liqiang Wang, and Jiebo Luo. Aet vs. aed: Unsupervised representation learning by auto-encoding transformations rather than data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2547 2555, 2019.

Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In European conference on computer vision, pp. 649 666. Springer, 2016.

A SUMMARY OF MAIN TEXT AND LAYOUT OF APPENDIX

In this paper we motivated the generalization of state-of-the-art methods in self-supervised learning to the more general framework of equivariant self-supervised learning (E-SSL). In E-SSL rather than using only invariance as a trivial case of equivariance, we encouraged non-trivial equivariance and improved state-of-the-art methods on common computer vision benchmarks and regressions tasks in photonics science. We also discussed that there are many types of equivariance we can consider for E-SSL. We observed that most of the successful transformations for E-SSL that we explored form groups, but forsee that potentially many more transformations could be explored.

For future work one could learn transformations that are equivariances, instead of setting them manually. Thus, the concept of E-SSL could potentially be extended to natural language processing or other science domains, whose transformations for SSL are less well-understood. To facilitate further research in E-SSL, below we provide additional details and analysis of the experiments in the main text. We also discuss interesting avenues for future work.

B PROOF OF PROPOSITION 1

Proof. To construct a non-trivially equivariant f, we ﬁrst need to show that both X and S are G-sets, i.e. that there is a group action Tg of G on X , which is given by the statement of the proposition, and another (non-trivial) group action T g of G on S, which we will construct. Then, we need to show that f commutes with the group action, i.e. that f(Tg(x )) = T g(f(x )).

Published as a conference paper at ICLR 2022

Group actions. Note that by the setup of the problem, we are already given how G acts on the input X , i.e. Tg is known. For example, if G is the group of four-fold rotations, then Tg is the rotation of the input by a multiple of π/2. We proceed to construct the non-trivial group action T g of G on S.

Deﬁne the function T : G S S as T (g, s) = f(Tg(Tg (x ))), where s = f(Tg (x )). Note that T is well-deﬁned, because gg G by the closure of the group and s is uniquely written as s = f(Tg (x )). To see why s is uniquely written, it sufﬁces to show that if f(Tg (x )) = f(Tg (x )) then both g = g and x = x , which follows directly from our assumption in the statement.

Now, to prove that T is a group action, it sufﬁces to show two properties.

Identity: T (e, s) = s for s = f(g (x )) and e is the unit element of the group. To show that, note that by deﬁnition T (e, s) = f(Te(Tg (x ))) = f(Tg (x )), because eg = g .

Compositionality: T (g, T (h, f(Tg (x )))) = T (gh, f(Tg (x ))). To show this, we expand the LHS and use the deﬁnition of T to obtain as follows T (g, T (h, f(Tg (x )))) = T (g, f(Th(Tg (x )))) = f(Tg(Thg (x ))) = f(Tghg (x )) = f(Tgh(Tg (x ))) = T (gh, f(Tg (x ))), because the group operation is associative.

Hence, T is a group action, and thus S is a G-set, and we can write T (g, ) T g( ).

Commuting with the group action. To see this property, note that T g (f(x )) = T g (f(Tg(x))) = f(Tg (Tg(x))) = f(Tg (x )) as desired. Note that T g is non-trivial.

Therefore, we can conclude that f, which satisﬁes the constructed group action T g, is not-trivially equivariant to the group G.

C ROTATION PREDICTION AND I-SSL BENEFIT FROM SIMILAR DATA AUGMENTATION.

Recently, rotation prediction with a linear head from the frozen backbone representations proved to be useful for validating the augmentation policies of contrastive learning (Reed et al., 2021). This shows that the two tasks of classiﬁcation of ground truth classes and synthetic rotation classes from frozen backbone representations beneﬁt from similar augmentation policies. We took this experiment a step further, and performed rotation prediction with the augmentation policies, typically used in contrastive learning.

The result is in Table 5. Interestingly, Rot Net beneﬁts from augmentations, typically used in contrastive learning, and the Rot Net training shares the same sweet spot (Tian et al., 2020) as k NN classiﬁcation. There are several takeaways from this experiment: (i) we can ﬁnd good augmentations for contrastive learning by doing Rot Net alone, i.e. without doing any contrastive learning; (ii) Rot Net beneﬁts from augmentations needed in contrastive learning; (iii) we may be able to combine four-fold rotations prediction and contrastive learning.

D CIFAR-10 EXPERIMENTS

D.1 EXPERIMENTAL SETUP

Our experiments use the following architectural choices: Res Net-18 backbone (the CIFAR-10 version has kernel size 3, stride 1, padding 1 and there is no max pooling afterwards); 512 batch size (only our baseline Sim Siam model uses batch size 1024); 0.03 base learning rate for the baseline Sim CLR and Sim Siam and 0.06 base learning rate for E-Sim CLR and E-Sim Siam; 800 pre-training epochs; standard cosine decayed learning rate; 10 epochs for the linear warmup; two layer projector with hidden dimension 2048 and output dimension 2048; for Sim Siam a two layer (bottleneck) predictor with hidden dimension 512 whose learning rate is not decayed; the last batch normalization for the projector does not have learnable afﬁne parameters; 0.0005 weight decay value; SGD with momentum 0.9 optimizer. The augmentation is Random Resized Cropping with scale (0.2, 1.0),

Published as a conference paper at ICLR 2022

Table 5: Rot Net s augmentation sweet spot. k NN and Rotation Prediction have the same sweep spot (Level 4) which gives best accuracy in both columns. Rot Net is trained on CIFAR-10 for 100 epochs with the same optimization setup as in our I-SSL experiments. Accuracies are on the test split. ( ) marks the deviation from the sweet spot. Every new level adds a new augmentation to the previous level incrementally.

Level Added Augmentation Supervised k NN Acc. (%) Rotation Prediction Acc. (%)

0 none 44.8 ( 19.8) 90.2 ( 4.8) 1 random resized cropping 59.2 ( 5.4) 93.7 ( 1.3) 2 horizontal ﬂips w.p. 0.5 59.4 ( 5.2) 94.5 ( 0.5) 3 color jitter w.p. 0.8 64.3 ( 0.3) 94.9 ( 0.1) 4 grayscale w.p. 0.2 64.6 95.0 5 Gaussian blur w.p. 0.2 64.1 ( 0.5) 94.5 ( 0.5) 6 random rotation ( π/6) 59.4 ( 5.2) 93.1 ( 1.9) 7 vertical ﬂip w.p. 0.5 51.9 ( 12.7) 90.6 ( 4.4)

aspect ratio (3/4, 4/3) and size 32x32, Random horizontal Flips with probability 0.5, Color Jittering (0.4, 0.4, 0.4, 0.1) with probability 0.8 and Grayscale with probability 0.2. Some of our evaluations use a k NN-classifer with 200 neighbors, cosine similarity and Gaussian kernel with temperature 0.1. This evaluation correlates well with the standard linear probe, but it is more efﬁcient to calculate. We report the k NN accuracy in % at the end of the 800 epochs of training. For our main results, we report a linear probe accuracy from training a linear classiﬁer for 100 epochs on top of the frozen representations with SGD with momentum 0.9 and cosine decay of the learning rate, batch size 256 and initial laerning rate of 30. For linear probe experiments we try 5 different initalizations of the linear head and report mean and standard deviations. The deviations are negligible because the linear probe is robust to the random seed. All parameters are reported in a Pytorch-like style.

For Figure 1 we use resolution of 32x32 for the transformations studied. The 4 levels of the Gaussian blur are for kernel sizes 0, 5, 9 and 15 in the default Gaussian blur torchvision implementation. The prediction of the transformations follows the experimental setup in Section 3. When we apply the transformations in I-SSL, we add them in the beginning of the augmentation policy with probability 1. The same setup is used for Disentangled representations and Insensitive instead of sensitive in Table 1.

D.2 ADDITIONAL EXPERIMENTS

Explored hyperparameters. Both for Sim CLR and Sim Siam we ran a grid search over the following hyperparameters: base learning rate: {0.01, 0.03, 0.06}, batch size: {512, 1024}, λ (for E-SSL): {0.0, 0.2, 0.4, 0.6, 0.8, 1.0}, predictor s MLP depth: {2, 3, 4}, predictor s normalization: {None, Batch Norm, Layer Norm}, nonlinearity at the last MLP layer of the predictor: {True, False}.

Tuning λ. Table 6 shows tuning of the CIFAR-10 results. We observe noticeable improvements over the SSL baselines by using E-SSL instead.

Table 6: Tuning the λ parameter for CIFAR-10.

Method Baseline E-SSL

0.0 0.2 0.4 0.6 0.8 1.0

Sim CLR 92.0 0.0 93.6 0.0 94.1 0.0 94.0 0.0 94.1 0.0 93.5 0.0 Sim Siam 91.1 0.0 94.1 0.0 94.2 0.1 93.7 0.0 93.8 0.0 93.3 0.0

Sensitivity to transformations for I-SSL. Table 7 demonstrates that E-SSL can produce good representation with as few SSL transformations for I-SSL as possible. We observe that E-SSL is less sensitive than SSL to the choice of data augmentation.

Published as a conference paper at ICLR 2022

Table 7: Comparing the augmentation sensitivity for CIFAR-10. Levels: 0 is no transformations; 1 adds random resized cropping; 2 adds horizontal ﬂips; 3 adds color jitter; 4 adds grayscale.

Method Augmentation Level

Sim CLR 28.0 0.1 76.0 0.1 77.0 0.0 87.4 0.0 92.0 0.0 E-Sim CLR 69.5 0.0 83.5 0.0 84.6 0.1 91.6 0.0 94.1 0.0

Sim Siam 17.6 0.1 71.5 0.0 72.2 0.0 88.1 0.0 91.1 0.0 E-Sim Siam 67.5 0.1 84.6 0.0 85.7 0.1 92.9 0.0 94.2 0.1

The importance of complete invariance or sensitivity. Table 8 studies whether a middle ground for the representations exist, i.e. whether it is possible to have part of the representation invariant and the other part sensitive to the transformation. If we apply the E-SSL loss only to half of the representation, then there is a very small drop in the performance. Furthermore, we observe that having a disjoint mix between insensitivity and sensitivity in the representation is noticeably harmful.

Table 8: Studying the effect of disjoint representations on CIFAR-10. Split Representation means that we encourage similarity only on one half of the backbone representation. Disentangled Representation means that one half of the representation is trained to be insensitive to four-fold rotations and the other half is sensitive four-fold rotations. Linear probe accuracy (%) after 800 epochs.

Method Baseline Split Representation Disentangled Representation

E-Sim CLR 94.1 0.0 94.1 0.0 ( 0.0) 91.3 0.0 ( 2.7) E-Sim Siam 94.2 0.1 93.8 0.0 ( 0.4) 91.1 0.0 ( 3.1)

Fully connected backbone. We perform a simple experiment with a fully connected backbone, instead of a Res Net-18. The hidden dimensions of the backbone are listed in order as {3 32 32, 2048, 2048, 512} with Batch Normalization and Re LUs in between. The rest of the experimental setup is exactly the same. On the linear probe (%), we obtain 70.5 0.0 for Sim CLR and 73.8 0.1 for E-Sim CLR, and 70.9 0.0 for Sim Siam and 73.5 0.1 for E-Sim Siam, highlighting noticeable gains from using E-SSL.

CIFAR-100 experiments. We test our CIFAR-10 experimental setup directly on CIFAR-100. On the linear probe (%), we obtain 65.8 0.0 for Sim CLR and 69.5 0.1 for E-Sim CLR, and 65.8 0.1 for Sim Siam and 69.3 0.1 for E-Sim Siam, highlighting sizable gains from using E-SSL.

Large crop study. We study whether using a large crop with a single rotation on CIFAR-10 can be just as good as a small crop. We obtain 93.9 0.0 on the linear probe using E-Sim CLR, which is only 0.2 absolute points below our best result of 94.1 0.0 using four small crops.

D.3 NORM-DIFFERENCES ANALYSIS

In Figure 6 we present analysis that shows our training objectives encourage invariance and equivariance to transformations. We take our best performing E-Sim CLR and E-Sim Siam methods on CIFAR-10. During training we keep track of two measures that can capture how invariant/ equivariant the backbone representations are.

The invariance measure computes the negative cosine similarity between two views of the backbone representations. The lower this measure is, the higher the similarity between the two views, and thus the more invariant the backbone representations are to the transformations in I-SSL. We observe that during training high similarity between the two views is maintained (roughly between 0.8 and 0.9), which indicates that invariance is encouraged in the backbone representations, as desired.

Likewise, the equivariance measure computes the average cosine similarity of the backbone represenations, between all six pairs of the four rotated views. The lower this measure is, the lower

Published as a conference paper at ICLR 2022

the similarity between the four views, and thus the more non-trivially equivariant the backbone representations are to the transformations for equivariance. We observe that the measure decays to about 0.3 during training, which indicates that the backbone representations are encouraged to be equivariant to four-fold rotations, as desired.

0 20000 40000 60000 80000 1.0

invariance measure

0 20000 40000 60000 80000 1.0

0.0 E-Sim Siam

0 20000 40000 60000 80000 training steps

equivariance measure

0 20000 40000 60000 80000 0.0

Figure 6: Demonstration of the evolution of the invariance (top) and equivariance (bottom) measures during training. Left is E-Sim CLR and right is E-Sim Siam.

E IMAGENET EXPERIMENTS

We had limited computational resources, so we kept the learning rates the same as in the original methods. Only for Sim Siam we found that choosing a smaller learning rate 0.08 leads to better results for E-Sim Siam. We only swept the λ parameter, where for Sim CLR and Sim Siam the sweep was between 0 and 1 and for Barlow Twins it was between 0 and 100. The optimal λ is 0.4 for Sim CLR, 0.08 for Sim Siam, 8 for Barlow Twins. We use (0.05, 0.14) scale range for 100 pretraining epochs. For more pre-training epochs we use (0.05, 0.14) for Sim CLR and (0.08, 1.0) for Barlow Twins.

Table 9 lists the overhead from using rotation prediction in our experiments.

Table 9: Overhead in doing rotation prediction. Reported GPU hours for an experiment on 100 epochs.

Sim CLR Sim Siam Barlow Twins

Baseline 256 295 246 E-SSL (ours) 307 364 294

Overhead 20% 23% 19%

Published as a conference paper at ICLR 2022

F PHC EXPERIMENTS

Dataset generation. 2D Photonic crystals (Ph Cs) are characterized by a periodically varying permitivitty ε(x, y); here, for simplicity we consider a two-tone permitivitty proﬁle i.e. ε {ε1, ε2}, with εi [1, 20] discretized to a resolution of 32 32. To generate the unit cells in the blob dataset, we follow the proceedure in Christensen et al. (2020). For the Gpm dataset, the unit cells are deﬁned using a level set of a 2D Fourier sum function like in Kim et al. (2021); Loh et al. (2021), with additional constraints applied to the lattice to create the mirror symmetry adopted from the method in Christensen et al. (2021). We then follow the procedure in Loh et al. (2021) to compute, and subsequently process, the density-of-states (DOS) of each unit cell, speciﬁcally, via the MIT Photonics Bands (MPB) software (Johnson & Joannopoulos, 2001) and the Generalized Gilat-Raubenheimer method in an implementation from Liu et al. (2018).

Network architecture. We use an encoder network composing of simple convolutional (CNN) and fully-connected (FC) layers for the backbone; speciﬁcally, our backbone begins with 3 CNN layers, all with a kernel size of 7 and channel dimensions given by [64, 256, 256]. The output is ﬂattened and fed into 2 FC layers each with 1024 nodes (i.e. the representations have dimension 1024). We include Batch Norm (Ioffe & Szegedy, 2015), Re LU and Max Pooling for the CNNs, and Re LU only for the ﬁrst FC layer. The projector and predictor networks, p1 and p2 are 2-layer MLPs with hidden dimension 512, with Batch Norm and Re LU between each layer except the last and the projection dimension for p1 is 256. Additionally, since this is a regression task and the label space is much larger than in image classiﬁcation tasks, we include a dense DOS-predictor head after the representations, which is ﬁne-tuned with 3000 labelled samples after SSL or E-SSL. The DOSpredictor has 4 FC layers, with number of nodes given by [1024, 1024, 512, 400]. We explore two ﬁne-tuning protocols of the DOS-predictor: freezing the backbone (discussed later in the Appendix) or ﬁne-tuning the backbone (discussed in the main text).

Hyperparameters. For SSL and E-SSL, we performed 250 pre-training epochs using the SGD optimizer with a standard cosine decayed learning rate; the batch size was ﬁxed to 512. The pretrained model was saved at various epochs {20, 50, 100, 180, 250} for further ﬁne-tuning. Finetuning was performed for 100 epochs using Adam optimizer and a ﬁxed batch size of 64. No transformations were applied to the input during ﬁne-tuning for both freezing or ﬁne-tuning the backbone. We ran a grid search over the following hyperparameters; for pre-training, base learning rate: {10 3, 10 4, 10 5}, λ (for E-SSL): {0.2, 1.0, 2.0, 5.0, 10.0}, and for ﬁne-tuning: a learning rate in {10 3, 10 4, 10 5}.

Frozen backbone experiment. In Table 10 we present our results from freezing the backbone encoder while ﬁne-tuning the DOS-predictor head. We observe similar trends as in Table 4 where we allowed ﬁne-tuning of the backbone. Relative error is reported in % and the lower the error is, the better. Sim CLR for Blob includes C4v (rotations and ﬂips) and Sim CLR for Gpm includes rolling translations and ﬂips. E-Sim CLR encourages the features to be sensitive to the selected transformation (four-fold translations for Blob and four-fold rotations for Gpm), which improves the performance of Sim CLR. On the contrary, adding the selected transformation to Sim CLR, as indicated by + Transform , increases the error of Sim CLR. Error bars are reported for 3 different choices of training data. Supervised (frozen) refers to the impractical situation of freezing a random backbone and ﬁne-tuning the DOS-predictor.

Table 10: Frozen backbone experiment on Ph C datasets for 3000/ 2000 labelled train/ test samples.

Ph C Dataset Supervised (frozen) Sim CLR Sim CLR + Transform E-Sim CLR (ours)

Blob 1.686 0.014 1.237 0.005 1.242 0.013 1.165 0.020 Gpm 5.450 0.077 3.214 0.048 3.313 0.029 3.187 0.000

Continuous group experiment. In all experiments shown so far, we dealt with ﬁnite groups of transformations. To show that E-SSL generalizes beyond the ﬁnite group setting, we also explore transformations from a continuous group. An example is the scaling transformation where every pixel of the input unit cell is scaled by the same positive factor. More speciﬁcally, this set of positive

Published as a conference paper at ICLR 2022

scaling transformations g(s)x = sx deﬁnes a continuous group G = {g(s)|s R+} which leaves the DOS labels invariant due to the physics of the problem and normalization applied when preprocessing the dataset (Loh et al., 2021). In our experiment, we uniformly sample s (1, smax] and apply the inverse with probability 0.5 (i.e. we cap the scaling factor to a maximum of smax = {5, 10} for numerical stability during training and we apply up-scaling and down-scaling with equal probability). To encourage equivariance to this group, we simply predict the scale factor applied to the input using L1 loss (i.e. the ﬁnal layer of the predictor p2 is a single node). In Table 11, we show results after ﬁne-tuning the backbone and the DOS predictor network with 3000 labelled samples. We observe similar trends to Table 4; encouraging sensitivity to scaling produces the lowest error and including scaling to Sim CLR increases the error. To isolate the effect of scaling transformation, the remaining physics-governed invariances excluding scaling (translations, rotations and mirrors) are used in Sim CLR and the invariance part of E-Sim CLR for both datasets.

Table 11: Fine-tuning the backbone on Ph C datasets using 3000/ 2000 labelled train/ test samples. Relative error (%) is ℓDOS = (P

ω DOSpred DOS )/(P

ω DOS). Lower is better. E-Sim CLR encourages the features to be sensitive to scaling. + Scaling means adding scaling to Sim CLR. Error bars are for 3 different training data splits.

Ph C Dataset Supervised Sim CLR Sim CLR + Scaling E-Sim CLR (ours)

Blob (smax = 10) 1.068 0.015 0.988 0.001 1.005 0.006 0.974 0.000 Blob (smax = 5) 1.068 0.015 0.988 0.001 1.000 0.014 0.987 0.017 Gpm (smax = 10) 3.212 0.041 3.073 0.003 3.112 0.011 3.062 0.005 Gpm (smax = 5) 3.212 0.041 3.073 0.003 3.082 0.013 3.058 0.008

G FLOWERS-102 EXPERIMENTS

In order to study the importance of the assumption in Proposition 1 we perform an experiment with a dataset that might not be amenable to such an assumption at ﬁrst sight. We choose the Flowers102 dataset (Nilsback & Zisserman, 2008), because at ﬁrst sight the dataset might not beneﬁt from four-fold rotations in E-SSL. Therefore, this dataset complements the experiments we performed for CIFAR-10 and Image Net.

Experimental setup. We train Sim CLR and E-Sim CLR. We use the same optimization hyperparameters from the experimental setup for our CIFAR-10 experiments. We downsize the images to 96x96 resolution and use the standard Res Net-18, instead of its modiﬁed version for CIFAR-10. For the data augmentation in I-SSL, we use the same Random Resized Cropping as in CIFAR-10 (with size 96 of the crops being the only difference), the same Color Jittering and Random horizontal ﬂips as in the CIFAR-10 experiment. We report the k NN accuracy in (%) on the validation set. We study both four-fold rotations and four-fold translations as transformations for invariance/ equivariance. Four-fold rotations are chosen following the hypothesis that most of the data points should be invariant to rotation. Four-fold translations are chosen, because of our observation that most of the data points are centered, just like in the Blob Ph C dataset. The λ for predicting four-fold translations is 0.01 and for four-fold rotations is 0.5 (chosen from a grid search among {0.001, 0.01, 0.1, 0.5, 1.0, 2.0}).

Results. Following our observations in Figure 1, we observe that encouraging insensitivity to fourfold rotations and translations, by adding the transformations to the Sim CLR data augmentation, worsens the Sim CLR baseline. In contrast, using these transformations for E-SSL improves the baselines and further shows the utility of E-SSL for real-world data. We even observe beneﬁt from encouraging equivariance to four-fold rotations, which is against the intuition that rotations should be invariant. This is probably due to the fact that some images in the dataset are not truly rotationally invariant (see examples of the data points in Figure 8).

Published as a conference paper at ICLR 2022

0 100 200 300 400 500 600 700 800 epochs

k NN accuracy (%)

Translation Experiment

Sim CLR (baseline) E-Sim CLR w/ Translation (ours) Sim CLR w/ Translation

0 100 200 300 400 500 600 700 800

Rotation Experiment

Sim CLR (baseline) E-Sim CLR w/ Rotation (ours) Sim CLR w/ Rotation

Figure 7: E-Sim CLR gives sizable improvements for the Flowers-102 SSL pre-training. k NN accuracy (%) is on the validation set.

Figure 8: The Flowers-102 is not completely invariant to rotation. The top row shows data points which are roughly invariant to four-fold translations. The bottom row shows counterexamples to that hypothesis.

H RELATIVE ORIENTATION PREDICTION

In our experiments we demonstrate that if a shared biased between the train and test sets exists, we should exploit it via the E-SSL training objective. However, in some scenarios, the class label of the downstream tasks depends on the orientation (e.g., classifying road signs) and the current E-SSL method may not be very useful, because both x and Tg(x) exist in the data. This situation

Published as a conference paper at ICLR 2022

invites a natural generalization of our method in the spirit of (Agrawal et al., 2015). If x and Tg(x) exist in the data then we can modify E-SSL minimally to form a useful objective. In particular, we can set the objective for p2 to predict the relative orientation between two data points, i.e. given x, we form Tg (x) by sampling Tg from G uniformly, and then predict g from p2(z; θp2), where z = [f(x), f(Tg (x))] is the concatenation of the two representations. This modiﬁcation requires minimal change to our framework.

To test the usefulness of the modiﬁed method when x and Tg (x) exist in the data, we artiﬁcially modify CIFAR-10 so that any rotation of an image can appear in the dataset. We consider the downstream task of predicting the rotation orientation of an image, which clearly depends on the orientation of the image. Our hypothesis is that the modiﬁed E-Sim CLR will be better than Sim CLR, which is what we observe in our results below. Intuitively, Sim CLR is not able to capture the orientation of the image, while the modiﬁed E-Sim CLR is able to, because the latter predicts relative orientation.

The experimental setting is as follows: we pretrain for 100 epochs. The predictor for equivariance s input dimension is doubled, because we concatenate two representations. We set λ to 0.4. All other hyperparameters are the same as the rest of the CIFAR-10 experiments. The downstream task is 4way rotation orientation classiﬁcation. Using pre-training with Sim CLR on the standard CIFAR-10 dataset, we obtain baseline linear probe (%) accuracy 67.1 0.1. Using our modiﬁcation of E-SSL on the same experimental setting, we obtain 71.2 0.1. linear probe accuracy, which is a sizable gain and points to the promise of the relative orientation prediction as future work.

The relative orientation prediction scenario is also highly relevant in other domains such as in photonic crystals. For example, we can modify the Ph C setup and remove the orientation bias of the datasets while predicting a different property, the band structures. Unlike the DOS, the Ph C band structures are not invariant to rotations and so we can consider the group of four-fold rotations. This setup would ﬁt the scenario described above since 1) both x and Tg (x) exist in the data due to the lack of bias and 2) the downstream task is sensitive to rotation. We will explore the above framework on this setup in future work.