# arcl_enhancing_contrastive_learning_with_augmentationrobust_representations__9a304bad.pdf Published as a conference paper at ICLR 2023 ARCL: ENHANCING CONTRASTIVE LEARNING WITH AUGMENTATION-ROBUST REPRESENTATIONS Xuyang Zhao1,2 Tianqi Du1,3 Yisen Wang3,4 Jun Yao5 Weiran Huang2 1 School of Mathematical Sciences, Peking University 2 Qing Yuan Research Institute, Shanghai Jiao Tong University 3 National Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 4 Institute for Artificial Intelligence, Peking University 5 Huawei Noah s Ark Lab Self-Supervised Learning (SSL) is a paradigm that leverages unlabeled data for model training. Empirical studies show that SSL can achieve promising performance in distribution shift scenarios, where the downstream and training distributions differ. However, the theoretical understanding of its transferability remains limited. In this paper, we develop a theoretical framework to analyze the transferability of self-supervised contrastive learning, by investigating the impact of data augmentation on it. Our results reveal that the downstream performance of contrastive learning depends largely on the choice of data augmentation. Moreover, we show that contrastive learning fails to learn domain-invariant features, which limits its transferability. Based on these theoretical insights, we propose a novel method called Augmentation-robust Contrastive Learning (Ar CL), which guarantees to learn domain-invariant features and can be easily integrated with existing contrastive learning algorithms. We conduct experiments on several datasets and show that Ar CL significantly improves the transferability of contrastive learning. 1 INTRODUCTION A common assumption in designing machine learning algorithms is that training and test samples are drawn from the same distribution. However, this assumption may not hold in real-world applications, and algorithms may suffer from distribution shifts, where the training and test distributions differ. This issue has motivated a plethora of research in various settings, such as transfer learning, domain adaptation and domain generalization (Blanchard et al., 2011; Muandet et al., 2013; Wang et al., 2021a; Shen et al., 2021). Different ways of characterizing the relationship between test and training distributions lead to different algorithms. Most literature studies this in the supervised learning scenario. It aims to find features that capture some invariance across different distributions, and assume that such invariance also applies to test distributions (Peters et al., 2016; Rojas-Carulla et al., 2018; Arjovsky et al., 2019; Mahajan et al., 2021; Jin et al., 2020; Ye et al., 2021). Self-Supervised Learning (SSL) has attracted great attention in many fields (He et al., 2020; Chen et al., 2020; Grill et al., 2020; Chen & He, 2021; Zbontar et al., 2021). It first learns a representation from a large amount of unlabeled training data, and then fine-tunes the learned encoder to obtain a final model on the downstream task. Due to its two-step nature, SSL is more likely to encounter the distribution shift issue. Exploring its transferability under distribution shifts has become an important topic. Some recent works study this issue empirically (Liu et al., 2021; Goyal et al., 2021; von K ugelgen et al., 2021; Wang et al., 2021b; Shi et al., 2022). However, the theoretical understanding is still limited, which also hinders the development of algorithms. In this paper, we study the transferability of self-supervised contrastive learning in distribution shift scenarios from a theoretical perspective. In particular, we investigate which downstream distribu- Equal Contribution. This work was partially done when Xuyang was visiting Qing Yuan Research Institute. Correspondence to Weiran Huang (weiran.huang@outlook.com). Published as a conference paper at ICLR 2023 tions will result in good performance for the representation obtained by contrastive learning. We study this problem by deriving a connection between the contrastive loss and the downstream risk. Our main finding is that data augmentation is essential: contrastive learning provably performs well on downstream tasks whose distributions are close to the augmented training distribution. Moreover, the idea behind contrastive learning is to find representations that are invariant under data augmentation. This is similar to the domain-invariance based supervised learning methods, since applying each kind of augmentation to the training data can be viewed as inducing a specific domain. Unfortunately, from this perspective, we discover that contrastive learning fails to produce a domain-invariant representation, limiting its transferability. To address this issue, we propose a new method called Augmentation-robust Contrastive Learning (Ar CL), which can be integrated with various widely used contrastive learning algorithms, such as Sim CLR (Chen et al., 2020) and Mo Co (He et al., 2020). In contrast to the standard contrastive learning, Ar CL forces the representation to align the two farthest positive samples, and thus provably learns domain-invariant representations. We conducted experiments on various downstream tasks to test the transferability of representations learned by Ar CL on CIFAR10 and Image Net. Our experiments demonstrate that Ar CL significantly improves the standard contrastive learning algorithms. RELATED WORK Distribution shift in supervised learning. Distribution shift problem has been studied in many literature (Blanchard et al., 2011; Muandet et al., 2013; Wang et al., 2021a; Shen et al., 2021). Most works aim to learn a representation that performs well on different source domains simultaneously (Rojas-Carulla et al., 2018; Mahajan et al., 2021; Jin et al., 2020), following the idea of causal invariance (Peters et al., 2016; Arjovsky et al., 2019). Structural equation models are often assumed for theoretical analysis (von K ugelgen et al., 2021; Liu et al., 2020; Mahajan et al., 2021). Distributionally robust optimization optimizes a model s worst-case performance over some uncertainty set directly (Krueger et al., 2021; Sagawa et al., 2019; Duchi & Namkoong, 2021; Duchi et al., 2021). Stable learning (Shen et al., 2020; Kuang et al., 2020) learns a set of global sample weights that could remove the confounding bias for all the potential treatments from data distribution. Disentangled representation learning (Bengio et al., 2013; Tr auble et al., 2021; Kim & Mnih, 2018) aims to learn representations where distinct and informative factors of variations in data are separated. Theoretical understanding of contrastive learning. A number of recent works also aim to theoretically explain the success of contrastive learning in IID settings. One way to explain it is through the mutual information between positive samples (Tian et al., 2020; Hjelm et al., 2018; Tschannen et al., 2019). Arora et al. (2019) directly analyze the generalization of Info NCE loss based on the assumption that positive samples are drawn from the same latent classes. In the same setting, Bao et al. (2022) establish equivalence between Info NCE and supervised loss and give sharper upper and lower bounds. Huang et al. (2021) take data augmentation into account and provide generalization bounds based on the nearest neighbor classifier. Contrastive learning in distribution shift. Shen et al. (2022) and Hao Chen et al. (2022) study contrastive learning in unsupervised domain adaptation, where unlabeled target data are obtained. Shi et al. (2022) show that SSL is the most robust on distribution shift datasets compared to autoencoders and supervised learning. Hu et al. (2022) improve the out-of-distribution performance of SSL from an SNE perspective. Other robust contrastive learning methods (Kim et al., 2020; Jiang et al., 2020) focus on adversarial robustness while this paper focuses on distributional robustness. 2 PROBLEM FORMULATION Given a set of unlabeled data where each sample X is i.i.d. sampled from training data distribution D on X Rd, the goal of Self-Supervised Learning (SSL) is to learn an encoder f : X Rm for different downstream tasks. Contrastive learning is a popular approach of SSL, which augments each sample X twice to obtain a positive pair (X1, X2), and then learns the encoder f by pulling them close and pushing random samples (also called negative samples) away in the embedding space. The data augmentation is done by applying a transformation A to the original data, where A is randomly selected from a transformation set A according to some distribution π. We use Published as a conference paper at ICLR 2023 DA to denote the distribution of augmented data under a specific transformation A, and use Dπ to denote the distribution of augmented data, i.e., Dπ = R A DA dπ(A). The loss of contrastive learning typically consists of two terms: Lcon(f; D, π) := Lalign(f; D, π) + λLreg(f; D, π), (1) where Lalign(f; D, π) := EX DE(A1,A2) π2 f(A1(X)) f(A2(X)) 2 measures the alignment of positive samples1, and Lreg(f; D, π) is the regularization term to avoid feature collapse. For example, the regularization term in Info NCE loss is Lreg(f; D, π) := E (X,X ) D2 E (A1,A2,A ) π3 log ef(A1(X)) f(A2(X)) + ef(A1(X)) f(A (X )) . In this paper, we consider multi-class classification problems X Rd Y = {1, . . . , K} as downstream tasks, and focus on the covariate shift setting, i.e., target distribution Dtar is different from D but P(Y |X) is fixed. We use the best linear classifier on top of f for downstream classification, and evaluate the performance of f on Dtar by its risk (also used by Shi et al. (2022)): R(f; Dtar) := min h RK m E X Dtar ℓ(h f(X), Y ), (2) where ℓis the loss function. We focus on the case where ℓis the square loss. With a slight abuse of notation, for the classifier h f : Rm Rk, we also use R(h f; Dtar) to denote its risk. Our goal is to study the transferability of representation learned by contrastive learning on different downstream distributions. For simplicity, we assume that f(x) = 1 for every x X and f is L-Lipschitz continuous throughout this paper. At the end of this section, we remark the difference between transformations and augmentations. Transformations are deterministic mapping from X to X, and augmentations are random application of them. We give examples to demonstrate that many augmentation methods used in practice indeed satisfy this model. Example 2.1. The augmentations used in Sim CLR (Chen et al., 2020) are compositions of random transformations, such as Random Crop, Horizonal Flip and Color distortion. Each A denotes a specific composition here. Users will determine probabilities that each transformation will be applied. If used, the parameters of the random transformation such as crop range are also selected randomly. Until the parameters of each used transformations are set, the composited transformation is deterministic. Therefore, we can view these augmentation processes as randomly selecting some deterministic transformations, following the distribution π determined by users. Example 2.2. Some recent works (Jahanian et al., 2021) propose to augment data by transformations in the latent space. Suppose we have a pre-trained generative model with an encoder T : X Z and a generator G : Z X, where Z is a latent space. Since Z is usually simple, one can augment data in X by randomly shifting T(X). In this setting, the augmentation can be parameterized by Aθ(X) = G(T(X) + θ) for θ Z. The distribution π of θ is usually chosen to be a normal distribution. 3 TRANSFERABILITY ANALYSIS: THE CRUCIAL ROLE OF DATA AUGMENTATION In this section, we theoretically demonstrate that data augmentation plays a crucial role in the transferability of contrastive learning. The first step is to establish the connection between contrastive loss Lcon(f; D, π) and downstream risk R(f; Dtar). For this purpose, we adopt the (σ, δ)-augmentation notion proposed by Huang et al. (2021), which was used there to study the KNN error of contrastive learning. Definition 3.1 ((σ, δ)-transformation). Let Ck X be the set of all points in class k. A data transformation set A is called a (σ, δ)-transformation on D for some σ (0, 1] and δ > 0, if for every k [K], there exists C0 k Ck such that PX D(X C0 k) σ PX D(X Ck), and supx1,x2 C0 kd A(x1, x2) δ, where d A(x1, x2) := inf A1,A2 A d(A1(x1), A2(x2)) for some distance d( , ). 1In this paper, notation stands for L2-norm or Frobenius norm for vectors and matrices, respectively. Published as a conference paper at ICLR 2023 This definition measures the concentration of augmented data. A transformation set with smaller δ and larger σ clusters original data more, i.e., samples from the same class are closer after augmentation. Therefore, one can expect the learned representation f to have better cluster performance, which was measured by KNN error in (Huang et al., 2021). We remark that most of contrastive learning theories (Arora et al., 2019; Ash et al., 2021; Bao et al., 2022) assume that given class k, positive pairs are obtained by i.i.d. sampling from PX D( | X Ck). Compared with this data generation model, our (σ, δ) notion is more practical. By taking into account the linear classifier on top of f, we extend the KNN result in (Huang et al., 2021) to the downstream risk, and obtain the following lemma. Lemma 3.1. For any distribution D and encoder f, define µk(f; D) := EX Df(X) for k [K]. Suppose that A is a (σ, δ)-transformation. For any representation f : X Rd1 and linear layer h RK d1, we have R(h f; Dπ) c h Kσ(Lalign(f; D, π)) 1 4 + h τ(σ, δ)+ k=1 Dπ(Ck) ek h µk(f; Dπ) , (3) where c is an absolute constant and τ(σ, δ) is a constant which only depends on (σ, δ). The detail form of τ is shown in the appendix. The first term in equation 3 is the alignment of f, which is optimized during the contrastive learning pre-training on D. The second term is determined solely by the (σ, δ) quantity of the data augmentation. Larger σ and smaller δ induce smaller τ(σ, δ). These two terms together quantify the cluster level of the representation f learned by contrastive learning. The third term is related to the linear layer h on top of f, and will be minimized in downstream training, where each class center is converted to the corresponding ground-truth label. If the regularization term Lreg is appropriately chosen, the class centers can be distinguished from each other, and the third term can be reduced to 0 by h. Taking the linear classifier into account, we obtain the following main result. Theorem 3.2. Suppose that A is a (σ, δ)-transformation on D, and π is the augmentation distribution on A. For an encoder f, define γreg(f; D, π) = 1 c1(Lalign(f; D, π))1/4 τ c2 max k =k |µk(f; Dπ) µk (f; Dπ)| for some positive constants c1 and c2. Then for any f such that {µk(f; Dπ)}K k=1 are linearly independent and γreg(f; D, π) > 0, we have R(f; Dπ) c γ 1 reg(f; D, π) (Lalign(f; D, π))1/4 + γ 1 reg(f; D, π)τ(σ, δ) , (4) where c is an absolute constant. Remark 1. (a) The linear independence of {µk(f; Dπ)}K k=1 condition is weak, since its complement event is a null set in Euclidean space under Lebesgue measure, and have 0 probability under any distribution that is absolutely continuous w.r.t. Lebesgue measure. (b) The quantity γ 1 reg is essentially the bound of L2 norm of the linear classifier h which reduces the third term PK k=1 Dπ(Ck) ek h µk(f; Dπ) in equation 3 to 0, i.e., mapping the class centers to their corresponding ground-truth labels. The more orthogonal the class centers, the larger the γreg and the better the bounds. The theory developed in (Huang et al., 2021) shows that popular algorithms such as Sim CLR and Barlow Twins can indeed push away the class centers, and the constant γreg can be ensured to have a constant gap away from 0. Therefore, for simplicity we directly impose this requirement on f. This theorem shows that contrastive learning on distribution D with augmentation π is essentially optimizing the supervised risk on the augmented distribution Dπ, instead of the original distribution D. Therefore, the augmentation π is crucial to the transferability of contrastive learning. If the downstream distribution Dtar is close to Dπ, the encoder obtained by contrastive learning performs well on it. The more diverse π is, we can expect Dπ to approximate downstream distributions more accurately, and contrastive learning has better transferability. Published as a conference paper at ICLR 2023 Besides, if we have some prior knowledge of downstream distribution, we can specify an appropriate augmentation such that Dπ and Dtar are close in order to improve downstream performance. For example, consider the case where original data D are gray digits, and the downstream distribution Dtar is obtained by dying them red. Then if we use random coloring as data augmentation and conduct contrastive learning on D, the learned feature will generalize to Dtar well, even though Dtar is far away from D. 4 TRANSFERABILITY LIMITATION: A DOMAIN-INVARIANCE VIEW Our theoretical results in the last section suggests that data augmentation is crucial to the transferability of contrastive learning. However, there exists a fundamental limitation of contrastive learning: the learned representation is not domain-invariant, which is important for a model to generalize to more downstream datasets (Peters et al., 2016; Arjovsky et al., 2019). Roughly speaking, domain-invariance means that the learned model can extract some intrinsic features, which are invariant across different domains. In supervised domain generalization, a common setting is to have a training domain set consisting of multiple source domains. In contrastive learning, correspondingly, we can also consider DA as a transformation-induced domain for each data transformation A, and consider {DA}A A as the training domain set. Since the goal of contrastive learning is also to align different DA, we naturally expect that the features it learn are domaininvariant. However, since the alignment loss Lalign in contrastive learning is obtained by averaging over different transformations (i.e., taking expectation), the learned features are not invariant even across {DA}A A. In other words, encoders learned through contrastive learning could behave extremely differently in different DA. The following toy model gives an illustrative example. Proposition 4.1. Consider a two-dimensional classification problem with data (X1, X2) N(0, I2). The label Y satisfies Y = 1(X1 0), and the data augmentation is to multiply X2 by standard normal noise, i.e., Aθ(X) = (X1, θ X2), θ N(0, 1). The corresponding transformation-induced domain set is P = {Dc : Dc = (X1, c X2) for c R}. We consider the 0-1 loss in equation 2. Then for every ε > 0, there exists representation f and two domains Dc and Dc such that Lalign(f; D, π) < ε, but |R(f; Dc) R(f; Dc )| 1 This example shows that there exists a representation with arbitrary small contrastive loss, but with very different performance over different transformation-induced domains. The intuition behind this example is that in order to make Lalign small, it is sufficient to align different domains (induced by different transformations) in an average rather than uniform sense. Therefore, the representation may still suffer a large alignment loss on some rarely selected augmented domains. 5 ARCL: AUGMENTATION-ROBUST CONTRASTIVE LEARNING To further improve the transferability performance of contrastive learning, we consider how to learn a representation that is domain-invariant across {DA}A A. First of all, we need a formal definition that mathematically characterize domain-invariance, so that we can design algorithms based on it. Here, we borrow the notion proposed by (Arjovsky et al., 2019). Definition 5.1. We say that a representation f elicits an invariant predictor h0 f across a domain set P, if there is a classifier h0 simultaneously optimal for all domains in P, that is, h0 arg minh R(h f; D) for all D P. This definition is equivalent to learning features whose correlations with the target variable are stable, and has been shown to improve distribution shift transferability both practically and theoretically in supervised learning. Interestingly, by setting P to be {DA}A A, this definition is well Published as a conference paper at ICLR 2023 suited to contrastive learning. Specifically, define the augmentation-robust loss as LAR(f; D) := E X D sup A,A A f(A(X)) f(A (X)) 2, which is an uniform version of the original alignment loss Lalign. Then we have the following results. Theorem 5.1. For any two transformations A and A , linear predictor h and representation f, we have sup A,A A |R(h f; DA) R(h f; DA )| c h LAR(f, D). (5) Moreover, fix f and let h A arg minh R(h f, DA). Then we have |R(h A f; DA ) R(h A f; DA )| 2c ( h A + h A )LAR(f, D). (6) Compared with Lalign, LAR replaces the expectation over A by the supremum operator, hence Lalign(f; D, π) LAR(f; D) for all f and π. If LAR(f) is small, the above proposition shows that R(h f; DA) varies slightly with A, so that the optimal h for DA is close to that for DA . In other words, representation with smaller LAR tends to elicit the same linear optimal predictors across different domains, a property that does not hold for original alignment loss. Small LAR not only requires f to push the positive pairs closer, but also forces this alignment to be uniform across all transformations in A. Therefore, we propose to train the encoder using the augmentation-robust loss for better distribution shift transfer performance, i.e., replace Lalign by LAR in contrastive loss equation 1. In practical implementation, sup A1,A2 A f(A1(X)) f(A2(X)) 2 2 can not be computed exactly, unless |A| is finite and small. Therefore, we use the following approach to approximate it. For each sample X, we first randomly select m augmentations to obtain a set of augmented samples denoted by b Am(X). Then, rather than taking average as in standard contrastive learning, we consider use sup A1,A2 b Am(X) f(A1(X)) f(A2(X)) 2 2 as the alignment term, which is an approximation of sup A1,A2 A f(A1(X)) f(A2(X)) 2 2. The integer m is called the number of views, which controls the approximation accuracy. Then, given n unlabeled samples (X1, . . . , Xn), the empirical augmentation-robust loss is b LAR(f) := 1 i=1 sup A1,A2 b Am(Xi) f(A1(Xi)) f(A2(Xi)) 2 2. (7) We remark that our proposed method can be applied to any contrastive learning methods whose objective can be formulated by equation 1, such as Sim CLR, Moco, Barlow Twins (Zbontar et al., 2021) and so on. We give the detailed form of Sim CLR + Ar CL as an example in Algorithm 1, and provide Mo Co + Ar CL in the Appendix. The difference between Algorithm 1 and Sim CLR is only the construction of positive pairs. In each epoch t, for every sample X, we first randomly select m transformations denoted by b A(X) = {A1, . . . , Am}. Then based on current encoder f and projector g, we select two transformations that have worst alignment (minimal inner product) among all pairs in b A(X), and use them to construct the finally used positive pairs of X. The construction of negative pairs and update rule are the same as Sim CLR. 5.1 THEORETICAL JUSTIFICATION We now give theoretical guarantees of our proposed approximation, i.e., bound the gap between b LAR(f) and LAR(f). For each representation f F, where F is some hypothesis space, define zf(x) := sup A1,A2 A f(A1x) f(A2x) 2 and ZF := {zf : f F}. Then we use the Rademacher complexity of Z to define the alignment complexity of F as e Rn(F) := Rn(ZF) = E i=1 σiz(Xi) This quantity is a characteristic of the complexity of F in terms of its alignment. If most of the functions in F have good alignment under A, then e Rn(F) is small even though Rn(F) could be large. Using this definition, we have the following result. Published as a conference paper at ICLR 2023 Algorithm 1: Sim CLR + Ar CL input : Batch size N, temperature τ, augmentation π, number of views m, epoch T, encoder f, projector g. 1 for t = 1, . . . , T do 2 sample minibatch {Xi}N i=1; 3 for i = 1, . . . , N do 4 draw m augmentations b A = {A1, . . . , Am} π; 5 zi,j = g(f(Aj Xi)) for j [m]; 6 # select the worst positive samples; 7 s+ i = minj,k [m]{z i,jzi,k/( zi,j zi,k )}; 8 # select the negative samples; 9 for j = 1, . . . , N do 10 s i,j = z i,1zj,1/( zi,1 zj,1 ); 11 s i,j+N = z i,1zj,2/( zi,1 zj,2 ) ; 12 compute L = 1 N PN i=1 log exp(s+ i /τ) P2N j=1,j =i exp(s i,j/τ); 13 update f and g to minimize L; 14 return f Theorem 5.2. Suppose that Θ = Bd A(1), i.e., the d A-dimensional unit ball. Suppose that all A in A are LA-Lipschitz continuous. Let π be a distribution on Θ with density pπ such that infθ Θ pπ(θ) cπ. Then with probability at least 1 2ε, for every f F, LAR(f) b LAR(f) + 2e Rn(F) + log(1/ε) where n is the training sample size and m is the number of views. The bound in Theorem 5.2 consists of two terms O(n 1/2) and O(m 1/d A). The first one is due to finite sample issue, and the second is caused by the finite maximization approximation of the supremum. 1/cπ acts like the volume of the transformation set and is of constant order. As m increases, the second term decreases and the bound becomes tighter. Note that we need to assume that the augmentation distribution π behaves uniformly, i.e., infθ Θ pπ(θ) c for some positive constant c, otherwise we can not ensure that the finite maximization is an acceptable approximation. 6 EXPERIMENTS We conduct experiments on several distribution shift settings to show the performance of our proposed methods. To show the adaptability of our approach, we apply Ar CL on Sim CLR (Chen et al., 2020) and Mo Co V2 (He et al., 2020) respectively. Following the setting of Sim CLR and Mo Co V2, we add a projection head after the backbone in the pre-training stage and remove it in the downstream task. For the linear evaluation, we follow the protocol that the encoder is fixed and a linear classifier on the top of the encoder is trained using test set. For the full fine-tuning, we also add a linear classifier on the top of the encoder but we do not fix the encoder. 6.1 EXPERIMENTS ON MODIFIED CIFAR10 AND CIFAR100 Setup. Our representations are trained on CIFAR-10 (Krizhevsky, 2009) using Sim CLR modified with our proposed Ar CL. Different batch sizes (256, 512) and number of views (m = 4, 6, 8) are considered. We use Res Net-18 as the encoder and a 2-layer MLP as the projection head. We train the representation with 500 epochs. The temperature is 0.5. Warm-up is not used. We conduct linear evaluation with 100 epochs on different downstream datasets to test the OOD performance of the learned representations. Published as a conference paper at ICLR 2023 Table 1: 5 different augmentations. Grayscale Random Crop Horizontal Flip Color Jitter Aug 1 Aug 2 Aug 3 Aug 4 Aug 5 Table 2: Linear evaluation results (%) of pretrained CIFAR10 models on CIFAR10, CIFAR100 and their modified versions. Method Batch Size Aug 1 Aug 2 Aug 3 Aug 4 Aug 5 Original Sim CLR 256 86.36 83.21 86.93 86.42 86.13 86.76 Sim CLR + Ar CL (views=4) 256 88.68 86.77 89.01 88.70 88.31 88.95 Sim CLR + Ar CL (views=6) 256 88.95 87.18 89.54 88.92 88.61 89.11 Sim CLR 512 88.62 86.27 88.96 88.56 88.37 88.81 Sim CLR + Ar CL (views=4) 512 89.97 88.06 90.48 89.91 89.59 90.20 Sim CLR + Ar CL (views=6) 512 90.24 89.54 90.69 90.43 90.07 90.69 Sim CLR + Ar CL (views=8) 512 90.44 88.96 90.98 90.63 90.31 90.84 Sim CLR 256 51.65 47.55 53.17 52.05 51.36 52.75 Sim CLR+Ar CL(views=4) 256 53.76 49.80 55.68 54.19 52.96 54.83 Sim CLR+Ar CL(views=6) 256 54.13 50.74 55.74 54.75 53.46 55.29 Sim CLR 512 52.28 48.09 53.45 52.58 51.53 53.12 Sim CLR+Ar CL(views=4) 512 53.40 50.16 54.92 53.77 52.61 54.20 Sim CLR+Ar CL(views=6) 512 54.00 50.57 56.24 55.04 53.77 55.60 Sim CLR+Ar CL(views=8) 512 54.59 50.85 55.74 54.62 53.21 55.96 Datasets. We use CIFAR10 to train the representation. The data augmentation used follows the setting in Sim CLR, i.e., a composition of Random Crop, Random Horizontal Flip, Color Jitter and Random Grayscale. To evaluate the transferability of the learned representation, we use different downstream datasets to train the linear classifier and test its accuracy. Downstream datasets are generated by modifying the original CIFAR10 and CIFAR100 datasets, i.e., applying different data augmentations to them. We choose 5 different augmentations as shown in Table 1. The results are shown in Table 2. Results. In all settings, our proposed approach improves the transferability of Sim CLR significantly. As the number of views m increases, the accuracy also increases, which matches our theoretical results. Besides, the performance improvement brought by increasing m tends to saturate, which suggest that m does not need to be very large. 6.2 EXPERIMENTS ON IMAGENET TRANSFERRING TO SMALL DATASETS Setup. In this part, we will first train an encoder on Image Net (Deng et al., 2009) using Mo Co v2 modified with our proposed Ar CL. We select the number of Ar CL views from {2, 3, 4}. Notice that since there exists asymmetric network architecture in Mo Co, there is difference between Mo Co and Mo Co + Ar CL(views=2). We use Res Net50 as the backbone and follow all the other architecture settings of Mo Co. The memory bank size K is set to 65536 and the batch size is set to 256. For the training epochs, we propose three schemes: 1) 200 epochs training from scratch, 2) 50 epochs training from the model which is pretrained by Mo Co for 800 epochs, 3) 100 epochs training from the model which is pretrained by Mo Co for 800 epochs. For each setting, we search the best start learning rate in {0.03, 0.0075, 0.0015} and the best temperature in {0.20, 0.15, 0.12, 0.10, 0.05}. We use the SGD optimizer and apply the cosine learning rate decay. After the model has been pre-trained by contrastive learning, we test its transferability with linear evaluation and full fine-tuning on various downstream small datasets, following the settings in (Ericsson et al., 2021). For the target datasets, we adopt FGVC Aircraft (Maji et al., 2013), Caltech-101 (Fei-Fei et al., 2004), Stanford Cars (Krause et al., 2013), CIFAR10, CIFAR100, DTD (Cimpoi et al., 2014), Oxford 102 Flowers (Nilsback & Zisserman, 2008), Food-101 (Bossard et al., 2014) and Oxford-IIIT Pets (Parkhi et al., 2012). For linear evaluation, multinomial logistic regression is Published as a conference paper at ICLR 2023 Table 3: Results comparison on linear evaluation (up) and finetuning (down) of pretrained Image Net models on popular recognition datasets. Additionally, we provide a supervised baseline for comparison, a standard pretrained Res Net50 available from the Py Torch library. The supervised baseline results are from (Ericsson et al., 2021). Results style: best in the same epoch setting. Avg takes the average of the results on all the nine small datasets. Epochs Image Net Aircraft Caltech101 Cars CIFAR10 CIFAR100 DTD Flowers Food Pets Avg Mo Co 800 70.68 41.79 87.92 39.31 92.28 74.90 73.88 90.07 68.95 83.30 72.49 Mo Co 800+50 70.64 41.00 87.63 39.01 92.27 75.14 74.31 88.31 68.57 83.69 72.21 Mo Co + Ar CL(views=2) 800+50 69.70 44.29 89.79 42.15 93.07 76.70 74.20 90.40 70.94 83.68 73.91 Mo Co + Ar CL(views=3) 800+50 69.80 44.57 89.48 42.11 93.29 77.33 74.63 91.13 71.16 84.23 74.21 Mo Co + Ar CL(views=4) 800+50 69.80 44.62 89.66 42.88 93.22 76.83 75.00 91.59 71.35 83.99 74.35 Mo Co 800+100 70.64 41.65 87.64 39.31 92.12 75.03 73.94 89.53 68.31 83.55 72.34 Mo Co + Ar CL(views=2) 800+100 70.26 41.86 89.52 40.21 92.64 75.73 74.04 88.97 70.06 84.31 73.04 Mo Co + Ar CL(views=3) 800+100 70.92 43.87 89.36 42.37 93.30 76.93 72.93 90.50 71.14 84.03 73.83 Mo Co + Ar CL(views=4) 800+100 69.42 45.55 89.61 41.91 93.55 77.05 74.04 91.17 70.93 85.48 74.37 Mo Co 200 67.72 40.02 86.59 37.41 90.90 72.43 73.88 87.97 66.97 80.00 70.69 Mo Co + Ar CL(views=2) 200 68.04 42.34 87.92 36.45 92.29 71.71 74.68 89.00 67.87 81.42 71.85 Mo Co + Ar CL(views=3) 200 68.74 43.21 88.26 38.27 92.49 75.02 74.68 89.61 68.31 81.64 72.39 Mo Co + Ar CL(views=4) 200 68.92 41.65 88.42 38.77 92.70 75.73 75.43 88.95 68.63 81.60 72.43 Supervised* 77.20 43.59 90.18 44.92 91.42 73.90 72.23 89.93 69.49 91.45 74.12 Mo Co 800 83.56 82.54 85.09 95.89 71.81 69.95 95.26 76.81 88.83 83.30 Mo Co 800+50 83.15 84.50 85.90 96.13 72.58 70.16 94.44 79.34 86.12 83.59 Mo Co + Ar CL(views=2) 800+50 86.05 87.38 87.28 96.33 79.39 72.18 95.89 81.36 89.03 86.10 Mo Co + Ar CL(views=3) 800+50 84.03 87.64 86.34 96.88 80.98 72.87 96.14 81.90 89.20 86.22 Mo Co + Ar CL(views=4) 800+50 84.19 88.42 86.67 96.68 81.17 73.09 95.90 81.70 89.52 86.37 Mo Co 800+100 83.18 84.50 84.27 96.01 72.14 70.27 95.53 78.23 88.73 83.65 Mo Co + Ar CL(views=2) 800+100 84.45 86.84 87.20 96.40 78.40 71.91 95.93 80.54 88.56 85.58 Mo Co + Ar CL(views=3) 800+100 85.94 86.85 87.34 96.36 79.75 71.44 96.00 81.48 88.26 85.94 Mo Co + Ar CL(views=4) 800+100 85.65 88.50 86.39 96.91 81.29 73.35 96.17 81.82 89.30 86.60 Mo Co 200 83.18 82.66 84.47 95.51 72.54 70.43 94.99 77.39 86.12 83.03 Mo Co + Ar CL(views=2) 200 81.09 83.93 86.54 95.88 76.18 70.69 94.44 76.78 86.98 83.61 Mo Co + Ar CL(views=3) 200 84.79 85.61 85.39 96.56 78.81 70.59 95.84 80.71 87.91 85.13 Mo Co + Ar CL(views=4) 200 84.88 86.19 85.90 96.35 78.62 70.69 95.77 80.46 88.00 85.21 Supervised* 83.50 91.01 82.61 96.39 82.91 73.30 95.50 84.60 92.42 86.92 fit on the extracted features. For full fine-tuning, the full models are trained for 5,000 steps using SGD with Nesterov momentum. We calculate the average results of the nine downstream datasets. Results. We compare our methods with the Mo Co baseline. There are about an average rise of 2% in the linear evaluation setting and about an average rise of 3% in the finetuning setting. In particular, we gain a huge improvement on CIFAR100 finetuning, where our proposed methods can outperform the baseline for up to 10%. It is also worthy to mention that our methods can reach the supervised results on nearly all the datasets. These facts can all indicate the effectiveness of our proposed methods. Besides, the results also show that, under the same epoch setting, the average performance grows as the number of the views increase, which fits our theoretical analysis. One can also notice that there is only a small improvement in the 800+100 epoch setting compared to the 800+50 epoch setting. This implies that our methods can also converge in a rather fast speed, which can also be effective within a few epochs of training. Compared to CL methods, Ar CL gets to see more augmentations in total since for each original sample, we construct its m augmented versions. 6.3 COMPARISON WITH AVERAGE ALIGNMENT LOSS Compared to CL methods, Ar CL gets to see more augmentations in total since for each original sample, we construct its m augmented versions. To make the comparison between Ar CL and CL fairer, for original CL, we also construct m views while use the average of the similarity between all positive pairs among the m views, namely average alignment loss (AAL), as the learning objective. Results for Sim CLR on CIFAR10 and CIFAR100 and Mo Co on Image Net can be seen in Table 5 and 6 in the appendix. Detailed experimental settings are also deferred there. As the experiments show, contrastive learning with AAL has similar performance with vanilla methods, which is still much worse than Ar CL on distribution shift downstream tasks. Published as a conference paper at ICLR 2023 ACKNOWLEDGEMENT Yisen Wang is partially supported by the National Key R&D Program of China (2022ZD0160304), the National Natural Science Foundation of China (62006153), Open Research Projects of Zhejiang Lab (No. 2022RC0AB05), and Huawei Technologies Inc. We would like to express our sincere gratitude to the reviewers of ICLR 2023 for their insightful and constructive feedback. Their valuable comments have greatly contributed to improving the quality of our work. Martin Arjovsky, L eon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019. Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. ar Xiv preprint ar Xiv:1902.09229, 2019. Jordan T Ash, Surbhi Goel, Akshay Krishnamurthy, and Dipendra Misra. Investigating the role of negatives in contrastive representation learning. ar Xiv preprint ar Xiv:2106.09943, 2021. Han Bao, Yoshihiro Nagano, and Kento Nozawa. On the surrogate gap between contrastive and supervised losses. In International Conference on Machine Learning, pp. 1585 1606. PMLR, 2022. Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 2013. Gilles Blanchard, Gyemin Lee, and Clayton Scott. Generalizing from several related classification tasks to a new unlabeled sample. Proc. of Neur IPS, 2011. Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 - mining discriminative components with random forests. In ECCV, 2014. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In Proc. of ICML, 2020. Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In Proc. of CVPR, 2021. Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248 255, 2009. doi: 10.1109/CVPR.2009.5206848. John C Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 2021. John C Duchi, Peter W Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. Mathematics of Operations Research, 2021. Linus Ericsson, Henry Gouk, and Timothy M. Hospedales. How well do self-supervised models transfer? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5414 5423, June 2021. Li Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 Conference on Computer Vision and Pattern Recognition Workshop, pp. 178 178, 2004. doi: 10.1109/CVPR.2004.383. Published as a conference paper at ICLR 2023 Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min Xu, Pengchao Wang, Vivek Pai, Mannat Singh, Vitaliy Liptchinsky, Ishan Misra, Armand Joulin, et al. Self-supervised pretraining of visual features in the wild. ar Xiv preprint ar Xiv:2103.01988, 2021. Jean-Bastien Grill, Florian Strub, Florent Altch e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Proc. of Neur IPS, 2020. Jeff Z Hao Chen, Colin Wei, Ananya Kumar, and Tengyu Ma. Beyond separability: Analyzing the linear transferability of contrastive representations to related subpopulations. ar Xiv preprint ar Xiv:2204.02683, 2022. Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proc. of CVPR, 2020. R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. ar Xiv preprint ar Xiv:1808.06670, 2018. Tianyang Hu, Zhili Liu, Fengwei Zhou, Wenjia Wang, and Weiran Huang. Your contrastive learning is secretly doing stochastic neighbor embedding. ar Xiv preprint ar Xiv:2205.14814, 2022. Weiran Huang, Mingyang Yi, and Xuyang Zhao. Towards the generalization of contrastive selfsupervised learning. ar Xiv preprint ar Xiv:2111.00743, 2021. Ali Jahanian, Xavier Puig, Yonglong Tian, and Phillip Isola. Generative models as a data source for multiview representation learning. ar Xiv preprint ar Xiv:2106.05258, 2021. Ziyu Jiang, Tianlong Chen, Ting Chen, and Zhangyang Wang. Robust pre-training by adversarial contrastive learning. Proc. of Neur IPS, 2020. Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Domain extrapolation via regret minimization. ar Xiv preprint ar Xiv:2006.03908, 2020. Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In Proc. of ICML, 2018. Minseon Kim, Jihoon Tack, and Sung Ju Hwang. Adversarial self-supervised contrastive learning. Proc. of Neur IPS, 2020. Jonathan Krause, Jia Deng, Michael Stark, and Li Fei-Fei. Collecting a large-scale dataset of finegrained cars. 2013. Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009. David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Dinghuai Zhang, Remi Le Priol, and Aaron Courville. Out-of-distribution generalization via risk extrapolation (rex). In Proc. of ICML, 2021. Kun Kuang, Ruoxuan Xiong, Peng Cui, Susan Athey, and Bo Li. Stable prediction with model misspecification and agnostic distribution shift. In Proc. of AAAI, 2020. Chang Liu, Xinwei Sun, Jindong Wang, Haoyue Tang, Tao Li, Tao Qin, Wei Chen, and Tie-Yan Liu. Learning causal semantic representation for out-of-distribution prediction. ar Xiv preprint ar Xiv:2011.01681, 2020. Hong Liu, Jeff Z Hao Chen, Adrien Gaidon, and Tengyu Ma. Self-supervised learning is more robust to dataset imbalance. ar Xiv preprint ar Xiv:2110.05025, 2021. Divyat Mahajan, Shruti Tople, and Amit Sharma. Domain generalization using causal matching. In Proc. of ICML, 2021. Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft, 2013. URL https://arxiv.org/abs/1306.5151. Published as a conference paper at ICLR 2023 Krikamol Muandet, David Balduzzi, and Bernhard Sch olkopf. Domain generalization via invariant feature representation. In Proc. of ICML, 2013. Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722 729, 2008. doi: 10.1109/ICVGIP.2008.47. Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V. Jawahar. Cats and dogs. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3498 3505, 2012. doi: 10.1109/CVPR.2012.6248092. Jonas Peters, Peter B uhlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 2016. Mateo Rojas-Carulla, Bernhard Sch olkopf, Richard Turner, and Jonas Peters. Invariant models for causal transfer learning. The Journal of Machine Learning Research, 2018. Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks. In Proc. of ICLR, 2019. Harshay Shah, Kaustav Tamuly, Aditi Raghunathan, Prateek Jain, and Praneeth Netrapalli. The pitfalls of simplicity bias in neural networks. In Neur IPS. Kendrick Shen, Robbie Jones, Ananya Kumar, Sang Michael Xie, Jeff Z Hao Chen, Tengyu Ma, and Percy Liang. Connect, not collapse: Explaining contrastive learning for unsupervised domain adaptation. ar Xiv preprint ar Xiv:2204.00570, 2022. Zheyan Shen, Peng Cui, Tong Zhang, and Kun Kuang. Stable learning via sample reweighting. In Proc. of AAAI, 2020. Zheyan Shen, Jiashuo Liu, Yue He, Xingxuan Zhang, Renzhe Xu, Han Yu, and Peng Cui. Towards out-of-distribution generalization: A survey. ar Xiv preprint ar Xiv:2108.13624, 2021. Yuge Shi, Imant Daunhawer, Julia E Vogt, Philip HS Torr, and Amartya Sanyal. How robust are pre-trained models to distribution shift? ar Xiv preprint ar Xiv:2206.08871, 2022. Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. In Proc. of ECCV, 2020. Frederik Tr auble, Elliot Creager, Niki Kilbertus, Francesco Locatello, Andrea Dittadi, Anirudh Goyal, Bernhard Sch olkopf, and Stefan Bauer. On disentangled representations learned from correlated data. In Proc. of ICML, 2021. Michael Tschannen, Josip Djolonga, Paul K Rubenstein, Sylvain Gelly, and Mario Lucic. On mutual information maximization for representation learning. ar Xiv preprint ar Xiv:1907.13625, 2019. Julius von K ugelgen, Yash Sharma, Luigi Gresele, Wieland Brendel, Bernhard Sch olkopf, Michel Besserve, and Francesco Locatello. Self-supervised learning with data augmentations provably isolates content from style. ar Xiv preprint ar Xiv:2106.04619, 2021. Jindong Wang, Cuiling Lan, Chang Liu, Yidong Ouyang, Wenjun Zeng, and Tao Qin. Generalizing to unseen domains: A survey on domain generalization. ar Xiv preprint ar Xiv:2103.03097, 2021a. Tan Wang, Zhongqi Yue, Jianqiang Huang, Qianru Sun, and Hanwang Zhang. Self-supervised learning disentangled group representation as feature. Proc. of Neur IPS, 2021b. Haotian Ye, Chuanlong Xie, Tianle Cai, Ruichen Li, Zhenguo Li, and Liwei Wang. Towards a theoretical framework of out-of-distribution generalization. ar Xiv preprint ar Xiv:2106.04496, 2021. Jure Zbontar, Li Jing, Ishan Misra, Yann Le Cun, and St ephane Deny. Barlow twins: Self-supervised learning via redundancy reduction. In Proc. of ICML, 2021. Published as a conference paper at ICLR 2023 A MOCO + ARCL Algorithm 2: Mo Co + Ar CL input : Batch size N, temperature τ, memory bank M, augmentation π, number of views m, epoch T, encoders fq, fk, projector g. 1 for t = 1, . . . , T do 2 sample minibatch {Xi}N i=1; 3 for i = 1, . . . , N do 4 draw m augmentations b A = {A1, . . . , Am} π; 5 z1 i,j = g(fq(Aj Xi)) for j [m]; 6 z2 i,j = g(fk(Aj Xi)) for j [m]; 7 # select the worst positive samples; 8 s+ i = minj,l [m],j =l{z1 i,j z2 i,l/( z1 i,j z2 i,l )}; 9 # select the negative samples; 10 for m M do 11 s i,m = z i,1m/( zi,1 m ); 12 compute L = 1 N PN i=1 log exp(s+ i /τ) P m M exp(s i,m/τ); 13 update f and g to minimize L; 14 update M; 15 return f B PROOFS IN SECTION 3 Proof of Lemma 3.1. For simplicity we omit the notations D and ϑ in this proof, and use x1 to denote the positive sample, i.e. the augmented data, use pk to denote Dϑ(Ck) and use µk to denote µk(f; Dϑ). From Theorem 2 and Lemma B.1 in (Huang et al., 2021), we have E x Ck f (x1) µk c r 1 1 4pos(f) + τ(σ, δ) (8) for some constant c, where τ(σ, δ) := 4 1 σ 1 Lδ Note that τ is decreasing with σ and increasing with δ. Then we have 1 4pos(f) + τ(σ, δ) k=1 pk E x Ck f(x1) µk pk h E x Ck E x1 Ax h f(x1) h µk pk h E x Ck E x1 Ax h f(x1) ek 1 h k=1 pk ek h µk = 1 h R(h f) 1 h k=1 pk ek h µk Published as a conference paper at ICLR 2023 for all linear layer h RK d1. Therefore, we obtain R(h f) c h L pk + h τ(σ, δ) + k=1 pk ek h µk 1 4pos + h τ(σ, δ) + k=1 pk ek h µk . Proof of Theorem 3.2. By the triangle inequality and equation 8, we have µk 1 E x Ck f(x1) µk 1 C r 1 1 4pos(f) τ(σ, δ). min k µk 2 1 C r 1 1 4pos(f) τ(σ, δ). Define U = (µ1, . . . , µK) Rd1 K and let h0 = U + be the Moore Penrose inverse of U such that h0U = IK. Then we have h0 = sup x Rd1 For every x = (x1, . . . , xd1) Rd1 with x 2 = 1, we have i=1 x2 i µi 2 + X i =j xixjµ i µj min i µi 2 2 max i,j |µ i µj| X i =j |xixj| = min i µi 2 2 max i,j |µ i µj| min i µi 2 2 (K 1) max i,j |µ i µj| 1 4pos(f) τ(σ, δ) K max i,j |µ i µj| =: 1 ηreg(f; D, ϑ). Therefore, we have h0 ηreg(f; D, ϑ). Then, by Lemma 3.1, we complete the proof by R(f) = min h R(h f) R(h0 f) cγreg(f; D, ϑ)L 1 4pos(f; D, ϑ) + γreg(f; D, ϑ)τ(σ, δ). C PROOF IN SECTION 4 Proof of Proposition 4.1. For any ε > 0, let t = ε/2 and f(x1, x2) = x1 + tx2. Then, the alignment loss of f satisfies Lalign(f; D, π) = t2 E X2 2 E (θ1,θ2) N(0,1)2(θ1 θ2)2 = 2t2 < ε. Let c = 0 and c = 1/t. Then obviously R(f; Dc) = 0, but R(f; Dc ) = P(X1 < 0, X1 + X2 0) + P(X1 0, X1 + X2 0) = 1 Published as a conference paper at ICLR 2023 D PROOFS IN SECTION 5 Proof of Theorem 5.1. For any f and h with h ch, we have R(h f; DA) R(h f; DA ) = E (X,Y ) D |h f(AX) Y |2 |h f(A X) Y |2 = E (X,Y ) D(h f(Ax) h f(A x))((h f(Ax) + h f(A x)) + 2y) c E (X,Y ) D h f(Ax) h f(A x) c h E (X,Y ) D f(Ax) f(A x) c h LAR(f) for some constant c which depends on ch. proof of Theorem 5.2. We first fix the sample (x1, . . . , xn), and consider the randomness of Am(xi). Define A 1(xi) and A 2(xi) as A 1(xi), A 2(xi) = arg sup A1,A2 A f(A1xi) f(A2xi) 2 2. Note that their order doesn t matter. For every ε > 0, we have i=1 sup A1,A2 A f(A1xi) f(A2xi) 2 2 1 i=1 sup A1,A2 Am(xi) f(A1xi) f(A2xi) 2 2 + L2L2 A inf A Am(xi) A A 1(xi) 2, inf A Am(xi) A A 2(xi) 2 inf A Am(xi) A A 1(xi) 2 log(2n/ε) inf A Am(xi) A A 2(xi) 2 log(2n/ε) A Am(xi) : A A 1(xi) 2 log(2n/ε) A Am(xi) : A A 2(xi) 2 log(2n/ε) (iii) 1 2n 1 cϑ log(2n/ε) (iv) 1 ε. (i) comes from the Lipschitz continuity of f and A; (ii) is a simple application of set operations; (iii) is derived from the fact that the volume of a d-dimensional ball of radius r is proportional to rd; (iii) comes from the inequality (1 1/a)a e 1 for a > 1. Therefore, with probability at least 1 ε, we have i=1 sup A1,A2 A f(A1xi) f(A2xi) 2 b LAR(f) + L2 f L2 A Besides, for every f we have i=1 sup A1,A2 A f(A1xi) f(A2xi) 2 2 + 2Rn(F) + log(1/ε) with probability at least 1 ε. Then we finally obtain that with probability at least 1 2ε, for every f F, LAR(f) b LAR(f) + 2e Rn(F) + log(1/ε) 2 + L2 f L2 A Published as a conference paper at ICLR 2023 Table 4: Linear evaluation results of pretrained CIFAR10 models using Mo Co v2 on CIFAR10, CIFAR100 and their modified versions. Method Aug 1 Aug 2 Aug 3 Aug 4 Aug 5 Original Mo Co 88.34 87.44 89.14 88.76 88.12 89.29 Mo Co + Ar CL (views=2) 89.12 88.13 89.62 89.25 88.91 90.44 Mo Co + Ar CL (views=3) 90.22 89.11 90.93 90.13 89.75 91.22 Mo Co + Ar CL (views=4) 90.77 89.64 91.22 90.86 90.18 91.45 Sim CLR 52.98 48.77 54.47 53.18 52.93 55.88 Sim CLR+Ar CL(views=2) 53.37 49.14 55.09 53.46 53.14 56.62 Sim CLR+Ar CL(views=3) 54.49 50.29 55.33 53.78 53.62 57.13 Sim CLR+Ar CL(views=4) 55.42 51.42 55.96 53.76 54.41 57.98 E ADDITIONAL EXPERIMENTS In order to further verify the generality of our method, we conduct more experiments on different settings and compare our objective with different loss. E.1 MOCO ON CIFAR10 Setup. In this part, we conduct experiments with Mo Co v2 pretrained on CIFAR10. We follow the setup of Mo Co v2. We use Res Net-18 as the encoder and train the representation with 400 epochs. The temperature is set to 0.2 and the initial learning rate is set to 0.15. The memory bank size is set to 4096. The SGD optimizer and a cosine schedule for learning rate are used. Warm-up is not used. We conduct linear evaluation with 100 epochs on modified CIFAR10 and CIFAR100 which are used in Section 6.1. Results. Results can be seen in Table 4. We can see that the original results still maintain in the Mo Co method. Our proposed approach improves the transferability of Mo Co. As the number of views grows, the accuracy also increases. E.2 COMPARISON WITH AVERAGE ALIGNMENT LOSS To make the comparison between Ar CL and CL fairer, we add new experiments for original CL. For each sample, we also construct m views and use the expectation of the similarity between positive pairs, namely average alignment loss(AAL), as the learning objective. The settings and results are shown in the following part. Loss Function. For image xi, the normalized features of its m views from the online branch are zi1, zi2, . . . , zim and the normalized features from the target branch are z i1, z i2, . . . , z im. Our Ar CL alignment loss is Lalign Ar CL = X i min j =k z ijz ik/τ. And the average alignment loss should be Lalign Average = X i avg j =k z ijz ik/τ = X j =k z ijz ik/τ)/(m2 m). The uniformity loss keeps the same and the total loss is the sum of the alignment loss and the uniformity loss. Hyperparameters and Datasets. For Sim CLR, we train the models on CIFAR10 and evaluate them on modified CIFAR10 and CIFAR100 just as our original paper does. We choose the augmentation views to be 4 and the batch size to be 512. For Mo Co, We conduct experiments on Image Net, with the 800-epochs-pretrained model as the initializing model. We train the model for 50 epochs and use the same setting as in Section 6.2. Results. Results for Sim CLR experiments can be seen in Table 5. We compare the linear evaluation results on modified CIFAR10 and CIFAR100. Results for Mo Co experiments can be seen in 6. We compare both the linear evaluation and finetuning results on small datasets. As the experiments Published as a conference paper at ICLR 2023 Table 5: Linear evaluation results of pretrained CIFAR10 models using Sim CLR with two different alignment losses on CIFAR10, CIFAR100 and their modified versions. Method Aug 1 Aug 2 Aug 3 Aug 4 Aug 5 Original CIFAR10 Sim CLR 88.62 86.27 88.96 88.56 88.37 88.81 Sim CLR + AAL (views=4) 88.90 86.34 88.72 88.65 88.44 89.97 Sim CLR + Ar CL (views=4) 89.97 88.06 90.48 89.91 89.59 90.20 CIFAR100 Sim CLR 52.28 48.09 53.45 52.58 51.53 53.12 Sim CLR+AAL(views=4) 52.34 48.77 53.12 52.44 51.82 53.21 Sim CLR+Ar CL(views=4) 53.40 50.16 54.92 53.77 52.61 54.20 Table 6: Results comparison on linear evaluation (up) and finetuning (down) of pretrained Image Net models with Ar CL alignment loss and average alignment loss on popular recognition datasets. Results style: best in the same augmentation view setting. Avg takes the average of the results on all the nine small datasets. Epochs Image Net Aircraft Caltech101 Cars CIFAR10 CIFAR100 DTD Flowers Food Pets Avg Mo Co 70.68 41.79 87.92 39.31 92.28 74.90 73.88 90.07 68.95 83.30 72.49 Mo Co + AAL(views=2) 71.12 40.53 87.80 38.64 92.23 75.14 74.95 88.64 69.24 83.17 72.26 Mo Co + Ar CL(views=2) 69.70 44.29 89.79 42.15 93.07 76.70 74.20 90.40 70.94 83.68 73.91 Mo Co + AAL(views=3) 71.37 40.41 87.79 42.09 92.64 75.31 74.89 89.23 69.37 83.79 72.84 Mo Co + Ar CL(views=3) 69.80 44.57 89.48 42.11 93.29 77.33 74.63 91.13 71.16 84.23 74.21 Mo Co 83.56 82.54 85.09 95.89 71.81 69.95 95.26 76.81 88.83 83.30 Mo Co + AAL(views=2) 83.87 82.76 85.90 96.38 71.43 72.71 95.50 76.95 89.05 83.84 Mo Co + Ar CL(views=2) 86.05 87.38 87.28 96.33 79.39 72.18 95.89 81.36 89.03 86.10 Mo Co + AAL(views=3) 83.07 83.21 85.19 96.37 72.02 72.55 95.74 79.62 88.83 84.07 Mo Co + Ar CL(views=3) 84.03 87.64 86.34 96.88 80.98 72.87 96.14 81.90 89.20 86.22 show, when contrastive learning methods with average alignment loss use the same amount of data as Ar CL, although they can have higher accuracy compared to vanilla methods (those which only use two views) on ID set, they still have a similar performance with vanilla methods on OOD set, which is much worse than Ar CL. This further verifies the superiority of our method. E.3 EXPERIMENTS ON MNIST-CIFAR MNIST-CIFAR is a synthetic dataset proposed by (Shah et al.). It consists of 3 64 32 synthetic images, each of which is a vertical concatenation of a 3 32 32 MNIST image from class 0 or class 1, and a 3 32 32 CIFAR10 image from class 0 or class 1. We follow the distribution settings proposed by (Shi et al., 2022): ID train: 1.0 correlation between MNIST and CIFAR10 labels. Contains two classes: class 0 with MNIST 0 and CIFAR10 automobile , and class 1 with MNIST 1 and CIFAR10 plane . OOD train: no correlation between MNIST and CIFAR10 labels, images from the two classes of MNIST and CIFAR10 are randomly paired. We choose the label of CIFAR10 to be the label of the concatenated image. OOD test: generated similarly to the OOD train set using the test set of MNIST and CIFAR10. In this experiment, we train the model on the ID train set, conduct the linear evaluation on the OOD train set, and calculate the accuracy on the OOD test set. We adopt Sim CLR as the CL framework and use the 4-layer CNN as the backbone just as Shi et al. (2022) does. We fix the base feature size of CNN C = 32 and the latent dimension L = 128. Batch size is set to 128, the optimizer is set to SGD and the lr. scheduler is set to warmup cosine. The learning rate and weight decay are searched in the range given in Table 2 of Shi et al. (2022). Since the training process is easy to converge, we set the training epoch and the linear evaluation epoch to 5. We also compare the average alignment loss (AAL for short) mentioned in the Appendix 6.3. Published as a conference paper at ICLR 2023 Table 7: Linear evaluation results of pretrained models using Sim CLR with two different alignment losses on MNIST-CIFAR dataset. The average results under three diffrent random seeds are given. Methods Accuracy(%) Methods Accuracy(%) Sim CLR 85.6 Sim CLR + Ar CL(views=3) 86.0 Sim CLR+AAL(views=3) 85.4 Sim CLR + Ar CL(views=4) 87.2 Sim CLR+AAL(views=4) 86.1 Sim CLR + Ar CL(views=5) 87.3 Sim CLR+AAL(views=5) 86.3 Sim CLR + Ar CL(views=6) 88.4 Sim CLR+AAL(views=6) 85.8 Results. We can see that by using Ar CL, Sim CLR enjoys a rising performance. The accuracy also grows as the number of views grows, which fits our theory. We also notice that the accuracy of Sim CLR with AAL does not vary too much as the number of views grows, which indicates that the key point is the usage of multi-view (adopting the minimum instead of the average).