# interventional_contrastive_learning_with_meta_semantic_regularizer__179c7e01.pdf

Interventional Contrastive Learning with Meta Semantic Regularizer

Wenwen Qiang * 1 2 3 Jiangmeng Li * 1 2 3 Changwen Zheng 1 3 Bing Su 4 5 Hui Xiong 6 7

Contrastive learning (CL)-based self-supervised learning models learn visual representations in a pairwise manner. Although the prevailing CL model has achieved great progress, in this paper, we uncover an ever-overlooked phenomenon: When the CL model is trained with full images, the performance tested in full images is better than that in foreground areas; when the CL model is trained with foreground areas, the performance tested in full images is worse than that in foreground areas. This observation reveals that backgrounds in images may interfere with the model learning semantic information and their influence has not been fully eliminated. To tackle this issue, we build a Structural Causal Model (SCM) to model the background as a confounder. We propose a backdoor adjustment-based regularization method, namely Interventional Contrastive Learning with Meta Semantic Regularizer (ICLMSR), to perform causal intervention towards the proposed SCM. ICL-MSR can be incorporated into any existing CL methods to alleviate background distractions from representation learning. Theoretically, we prove that ICL-MSR achieves a tighter error bound. Empirically, our experiments on multiple benchmark datasets demonstrate that ICL-MSR is able to improve the performances of different state-of-the-art CL methods.

*Equal contribution 1Science & Technology on Integrated Information System Laboratory, Institute of Software Chinese Academy of Sciences, Beijing, China 2University of Chinese Academy of Sciences, Beijing, China 3Southern Marine Science and Engineering Guangdong Laboratory (Guangzhou), Guangdong, China. 4Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China 5Beijing Key Laboratory of Big Data Management and Analysis Methods, Beijing, China 6Thrust of Artificial Intelligence, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China 7Department of Computer Science Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China. Correspondence to: Bing Su <subingats@gmail.com>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

1. Introduction

Learning robust and generic representations without human annotation is a long-standing and important topic in machine learning. Contrastive learning (CL)-based self-supervised learning, an innovative unsupervised representation learning (SSL) method, has recently demonstrated superiority in computer vision tasks such as classification (Chen et al., 2020a), object identification (Grill et al., 2020), and transfer learning (Zbontar et al., 2021). The success of CL is partly due to its instance-based learning paradigm, e.g., CL assumes that each sample in the training dataset as a distinct class. This paradigm can be applied to any type of data to capture common semantic information applicable to different tasks.

In general, for a sample X in a mini-batch of training data, two augmented samples X1 and X2 are generated by performing random augmentation transformations T to X, i.e., X1, X2 = T (X). Then, using one of two augmented samples as the anchor, most existing CL frameworks treat the remaining augmented sample as a positive sample and the augmented samples generated by the other samples in the mini-batch as negative samples. The contrastive loss (Chen et al., 2020a) is used to train the feature extractor. According to the instance-based learning paradigm, minimizing contrastive loss entails pulling the positive sample closer to the anchor and pushing the negative samples further away from the anchor in the learnt feature space (Wang & Isola, 2020). Also, the contrastive loss can be considered as a way to assess the mutual information between the positive sample and the anchor from an information theoretic standpoint (Oord et al., 2018). The high similarity or mutual information between the positive sample and the anchor should be due to shared semantic or foreground-related information. Observations from several toy experiments, on the other hand, contradict this.

To be more specific, we run the toy experiments on the COCO dataset (Lin et al., 2014) with four different experimental settings: 1) training and testing the CL model on full images; 2) training and testing the CL model on full images and foreground images, respectively; 3) training and testing the CL model on foreground images; and 4) training and testing the CL model on foreground images and full images, respectively. Sim CLR (Chen et al., 2020a) and BYOL (Grill

Interventional Contrastive Learning with Meta Semantic Regularizer

Figure 1. The experimental results for two CL models. 1 represents training and testing on full images, 2 represents training on full images and testing on foreground images, 3 represents training and testing on foreground images, and 4 represents training on foreground images and testing on full images.

et al., 2020), two CL models, were chosen as baselines. Figure 1 shows an often-overlooked characteristic in the current CL model: Comparing the results produced under settings 1) and 2), where the model is trained on full images, the performance evaluated on full images is clearly superior than the performance tested on foreground images. In addition, when comparing the results obtained under configuration 1) and 4), the model trained with full images outperforms the model trained with foreground images when both are tested on full images. That is, when the backdrop is removed from the full image during training or testing, the performance of the two CL models suffers. This discovery suggests that background-related information can influence the CL models learning process. However, comparing the results obtained under settings 3) and 4), we discover that when the model is trained with foreground images, the performance tested in foreground images is considerably better than the performance tested in full images. In addition, when all variables are considered, we find that training and testing with only foreground images produces the best results. We can confidently conclude that background-related information degrades the performance of the CL models based on this. As we can see, the two conclusions are mutually exclusive.

A plausible explanation is that a feature extractor trained on full images extracts background-dependent semantic features. During the test phase, because that the full image contains both foreground and background parts, besides the foreground parts, the background part also play a certain role in promoting the classification. CL, on the other hand, strives to be adaptable to a variety of downstream tasks, such as object detection, object segmentation, and so on. Only foreground-related semantic information can ensure the robustness of the learned features to various tasks. This is in accordance with the second observation. To this purpose, we develop a Structural Causal Model (SCM) to describe the causal relationships between semantic information, pos-

itive sample, and anchor in this paper. We can represent the background as a confounder based on this, which fits to the explanation given above. Then, to execute causal intervention towards the proposed SCM, we present a backdoor adjustment-based regularization approach called Interventional Contrastive Learning with Meta Semantic Regularizer (ICL-MSR). To eliminate background distractions from representation learning, ICL-MSR can be simply implemented into most existing CL approaches. We show that ICL-MSR achieves a tighter error bound than CL methods that merely minimizing the contrastive loss. Our experiments on multiple benchmark datasets show that ICL-MSR can improve the performance of the state-of-the-art CL-based self-supervised learning approaches. The following is a list of our major contributions:

We discover a paradox: under different setting, background information can both improve and prevent performance of the learned feature representations from improving.

To capture the causal links between semantic information, positive sample, and anchor, we establish a Structural Causal Model (SCM). We can simply deduce that the background is effectively a confounder that produces misleading correlations between the positive sample and the anchor based on this.

We propose a new method called Interventional Contrastive Learning with Meta Semantic Regularizer (ICL-MSR) by implementing backdoor adjustments to the planned SCM.

We provide theoretical guarantee on the error bound and empirical evaluations to demonstrate that ICLMSR can improve the performances of different stateof-the-art CL methods.

2. Related Works

Self-supervised learning, such as contrastive learning (CL), aims to learn a generalized feature extractor that can be well applied to downstream tasks. The objective of contrastive learning is mainly based on the Info NCE loss. It is first proposed in (Oord et al., 2018) and can be seen as a lower bound of the mutual information between the feature and the context. Sim CLR (Chen et al., 2020a;b; Sordoni et al., 2021; Wen & Li, 2021) extends the Info NCE loss to maximize the similarity between two different data augmentations. To better prompt the performance of contrastive learning, Info Min (Tian et al., 2020b) proposes a set of stronger augmentations that reduce the mutual information between views while keeping task-relevant information intact. Align Uniform (Wang & Isola, 2020) relates the contrastive loss to two critical properties, including alignment and uniformity, to form

Interventional Contrastive Learning with Meta Semantic Regularizer

a new objective. The CL model has shown its superiority in many vision tasks. However, there are still some challenges worth mentioning.

Firstly, CL is sensitive to batch size. To solve this problem, Mo Co (He et al., 2020; Chen et al., 2020c; 2021b) increases the number of negative examples by using a memory bank. Secondly, some negative samples may contain similar semantic information as the positive sample. To tackle this issue, Sw AV (Caron et al., 2020) and PCL (Li et al., 2020) learn good-quality negative samples by introducing clustering methods in the training process, thereby reducing the number of negative samples. BYOL (Grill et al., 2020) proposes learning feature representation without negative samples. However, this also brings a new challenge: degenerate solutions. So, BYOL proposes learning feature representations with a structurally asymmetric feature extractor and adopting the moving average as an optimization method for model parameters. Then, Direct Pred (Tian et al., 2021) provides theoretical analysis to verify the effectiveness of BYOL. At the same time, a series of effective works such as Sim Siam (Chen & He, 2021), Barlow Twins (Zbontar et al., 2021), W-MSE (Ermolov et al., 2021), and SSL-HSIC (Sordoni et al., 2021) are proposed to avoid degenerate solutions. Among them, W-MSE and Barlow Twin do not require asymmetric networks and are conceptually simpler. In addition to the improvement of the model, the contrastive learning theory is also attracting increasing attention. Some works (Nozawa & Sato, 2021; Arora et al., 2019) provide a bound on the CL. RELIC (Mitrovic et al., 2021) gives a causal explanation for the objective of CL.

The goal of this paper is to explore the impact of background information on the representation learning process in contrastive learning. Our proposed ICL-MSR can be incorporated into most existing CL methods to alleviate background distractions. Also, there are three differences between ICL-MSR and Causal3DIdent (Von K ugelgen et al., 2021). The first is that ICL-MSR is motivated by causal intervention, while Causal3DIdent is motivated by counterfactual. The second is that ICL-MSR is mainly concerned with the objective function of contrastive learning and regards the background-dependent semantic features as confounding factors, while Causal3DIdent focuses on data augmentation and regards it as counterfactual under soft style intervention. The third is that ICL-MSR implements backdoor adjustment, and Causal3DIdent can be a part of ICL-MSR.

3. Problem Formulations

3.1. Contrastive Learning

In this paper, we mainly focus on CL-based representation learning approaches. The basic goal of the CL methods is to learn a generic feature extractor f, which projects the

Figure 2. The proposed SCM between semantic information Xs, positive sample X3 j i , and anchor (or label) Y (X3 j i ).

sample from the original input space to a latent space for extracting intrinsic features.

Formally, we first randomly sample a minibatch of training data and denote it as Xtr = {Xi}N i=1, where Xi represents the i-th sample, and N represents the number of samples in the minibatch. We perform stochastic data augmentation (e.g., random crop) to transform each sample Xi into two augmented views X1 i and X2 i . Since there are N samples in Xtr, we finally obtain 2N augmented samples denoted as Xaug tr = X1 i , X2 i N i=1. Then, we feed all training samples into the feature extractor f to get their feature representations, i.e., Zj i = f(Xj i ), i {1, ..., N} , j {1, 2}. The general objective of CL is formulated as:

j=1 log exp sim(Zj i ,Z3 j i ) τ

l=1,l =j exp sim(Zj i ,Zl k) τ

where τ represents the temperature hyper-parameter and sim (u, v) = u T v u v denotes the dot product between l2 normalized u and v (i.e., the cosine similarity). For a sample Xj i randomly selected from the dataset Xaug tr , CL regards it as the anchor, the pair {Xj i , X3 j i } as positive, the pairs {Xj i , Xl k}k=N,l=2 k=1,k =i,l=1 as negatives. The loss Lct in objective (1) is computed across all positive pairs.

3.2. Structural Causal Model

Minimizing the objective (1) is to make the sample X3 j i in the positive pair close to the anchor and the sample Xl k in the negative pairs far away from the anchor. From this perspective, the anchor can be seen as the label, then minimizing the objective (1) equals to predict X3 j i to the label Xj i . Then, the SCM implicated in CL can be formalized as Figure 2. The nodes in SCM represent the abstract data variables and the directed edges represent the (functional) causality, e.g., X3 j i Y (X3 j i ) represents that X3 j i is the cause and Y (X3 j i ) is the effect. In the following, we will describe the proposed SCM and the rationale behind its construction in detail at a high level. Please refer to Section 4 for the detailed functional implementations.

Interventional Contrastive Learning with Meta Semantic Regularizer

Figure 3. The framework of the proposed ICL-MSR.

X3 j i Y (X3 j i ). Y (X3 j i ) denotes the corresponding label of X3 j i . As mentioned above, the label equals to the anchor. Thus, we have Y (X3 j i ) = Xj i . This link assumes that X3 j i should be similar with the anchor Xj i .

X3 j i Xs Y (X3 j i ). Xs represents the semantic information and can be regarded as the convolution kernel of the feature extractor f. Therefore, the link X3 j i Xs and Xs Y (X3 j i ) assume that the feature representations of X3 j i and Y (X3 j i ) in the latent space is extracted by using f, and each feature representation channel corresponds to a semantic information.

An ideal contrastive learning model should capture the true causality between X3 j i and Y (X3 j i ) and can generalize to unseen samples well. For example, we expect the label prediction of Y (X3 j i ) to be caused by the foreground feature, not the background information. However, from the proposed SCM, the increased likelihood of Y (X3 j i ) given X3 j i is not only due to X3 j i Y (X3 j i ), but also the spurious correlation via Xs X3 j i Y (X3 j i ), e.g., the background of X3 j i generates the background feature, which provides useful context for predicting the anchor, this corresponds to our first observation in Figure 1: when the model is trained on full images, the performance evaluated on full images is clearly superior to the performance tested on foreground images. Therefore, to pursue the true causality between X3 j i and Y (X3 j i ), we need to use the causal intervention P(Y (X3 j i )|do(X3 j i )) instead of the P(Y (X3 j i )|X3 j i ) to measure the causal relation.

3.3. Causal Intervention via Backdoor Adjustment

Before we introduce the backdoor adjustment, we first give an intuition of why the feature extractor f trained by previous contrastive learning models extracts backgrounddependent semantic features.

The fact that the size of positive sample is too tiny, i.e. only one, could explain this problem. As shown in objective (1), randomly given an anchor, there is only one positive sample, but 2N 2 negative samples. Also, there are some samples (false negative samples) in negative pairs that are with the same foreground as the positive sample. Due to the randomness of the sampling process, the background similarity between false negative samples and anchor is lower than the foreground similarity between them. However, the anchor and the sample in the positive pair are generated by the same original image, so that the foreground and background between the two positive samples are similar. When minimizing the objective (1), the shared foreground and background between positive pair will likely together drag the positive sample towards the anchor if there are too few positive samples. Observation is also carried out on false negative samples that have semantically equivalent information to the anchor (Tian et al., 2020a). As a result, shifting the false negative samples further from the anchor will mainly focus on lowering the foreground similarity. In other word, this in turn may promote drag operation to pay greater attention to the backdrop to some extent. As a result, the role of the background is enhanced. This can lead to the previous contrastive learning models extracts background-dependent semantic features.

Above, we have shown that the feature extractor f can extract background-dependent semantic features and analyze how these semantic features affect the training process of the contrastive learning model. Below, we will propose to use the backdoor adjustment (Glymour et al., 2016) to eliminate the interference of the background-dependent semantics to achieve P(Y (X3 j i )|do(X3 j i )).

Specifically, the backdoor adjustment assumes that we can observe and stratify the confounder. In the proposed SCM, the confounder is contained in Xs, we can layer it into different semantic features, e.g., Xs = {Zi s}i=n i=1, where Zi s

Interventional Contrastive Learning with Meta Semantic Regularizer

represents a stratification of semantic features. Formally, the backdoor adjustment for the proposed SCM is presented as:

P(Y (X3 j i )|do(X3 j i ))

i=1 P(Y (X3 j i )|X3 j i , Zi s)P(Zi s) (2)

where P(Y (X3 j i )|do(X3 j i )) represents the true causality between Y (X3 j i ) and X3 j i . See appendix B for detailed derivation of equation (2).

4. Methodology

4.1. Meta Semantic Regularizer

In this subsection, we present the implementation of the backdoor adjustment during the training phase. As shown in equation (2), firstly, we need to give detailed functional implementations for Zi s.

Without loss of generality, we denote the dimension of the output Zj i of the feature extractor f as w h c, where w is the width, h is the height, and c is the number of feature channels. To be more specific, we denote Zj i as Zj i = [Zj i,1, ..., Zj i,c], where Zj i,r Rw h represents a feature map of Zj i , r {1, ..., c}. Note that for any pre-trained CNNbased feature extractor, each channel corresponds to a kind of semantic information or visual concepts (Zeiler & Fergus, 2014; Zhou et al., 2016). However, it is difficult to encode one visual concept by a single channel. So, this motivates us to find a weight vector for each semantic information. Our idea is that each semantic information corresponds to one subset of channels. So the weights for this related subset of channels should be large, and the weights for channels that are outside of the subset should be small.

Given a weight vector at = [a1,t, ..., ac,t]T, we represent the functional implementations of the Zt s as Zt s = at, and P(Zt s) = 1/n. Then, we represent the functional implementations of the P(Y (Z3 j i )|Z3 j i , Zt s) as

P(Y (X3 j i )|X3 j i , Zt s) =

sim(Zj i ,at Z3 j i ) τ

sim(Zj i ,at Z3 j i ) τ

sim(Zj i ,Zl k) τ

where at Z3 j i = [a1,t Z3 j i,1 , ..., ac,t Z3 j i,c ]. As a result, the overall backdoor adjustment is presented as:

P(Y (X3 j i )|do(X3 j i )) =

sim(Zj i ,at Z3 j i ) τ

sim(Zj i ,at Z3 j i ) τ

sim(Zj i ,Zl k) τ

The proposed meta semantic regularizer can be thought of as a learnable module fmsr, which is to generate the semantically relevant weight matrix As and is implemented by convolutional neural network, where As = [a1, ..., an]. Specifically, for a input sample X, we first obtain two augmentated samples X1, X2 by feeding X to a stochastic data augmentation module. Then we feed X to the module fmsr to obtain the weight matrix As, and the two augmented samples X1, X2 share only one weight matrix As.

4.2. Model Objectives

To this end, we introduce the objective of the proposed interventional contrastive learning with meta semantic regularizer (ICL-MSR). The whole learning framework of ICLMSR is shown in Figure 3, and the training process is shown in Appendix. ICL-MSR consists of three modules including the feature extractor f, the meta semantic regularizer fmsr, and the projective head fph. The meta semantic regularizer is trained alongside the feature extractor, with two stages per epoch. In the first stage, f and fph are learned using the two augmented training dataset Xaug tr and the semantically relevant weight matrix As. In the second stage, fmsr is updated by computing its gradients with respect to the contrastive loss. We train both modules in an iterative manner until convergence.

In the first stage of each epoch, the parameters of f and fph are updated by minimizing the objective Lto, which can be presented as:

min f,fph Lto = Lct + λLmsr, (5)

where Lct is shown in the objective (1), λ is the hyperparameter, and Lmsr is presented as:

j=1 log P(Y (X3 j i )|do(X3 j i )). (6)

In the second stage of each epoch, to learn the parameters of fmsr, we propose a meta learning-based training mechanism. That is, fmsr is updated by encouraging the weight matrix to be chosen such that, if f and fph are trained using the weight matrix, the performance of the primary contrastive learning would be maximized on this same training data. Specifically, We first update f and fph once with the learning rate α by the follows:

f 1 = f α f Lto, f 1 ph = fph α fph Lto, (7)

These two updates can be seen as to learn a good f and a good fph. After this step, f 1 and f 1 ph can be seen as the function of fmsr, because f Lto and fph Lto is related to fmsr. Then, we update fmsr by minimizing the following:

min fmsr Lct f 1, f 1 ph + γLuni (8)

Interventional Contrastive Learning with Meta Semantic Regularizer

Table 1. Classification accuracy (top 1) of a linear classifier and a 5-nearest neighbors classifier for methods and datasets with the Res Net-18 feature extractor.

Methods CIFAR-10 CIFAR-100 STL-10 Tiny Image Net linear 5-nn linear 5-nn linear 5-nn linear 5-nn

Sim CLR (Chen et al., 2020a) 91.80 88.42 66.83 56.56 90.51 85.68 48.84 32.86 BYOL (Grill et al., 2020) 91.73 89.45 66.60 56.82 91.99 88.64 51.00 36.24 W-MSE (Ermolov et al., 2021) 91.99 89.87 67.64 56.45 91.75 88.59 49.22 35.44 Re SSL (Zheng et al., 2021) 90.20 88.26 63.79 53.72 88.25 86.33 46.60 32.39 LMCL (Chen et al., 2021a) 91.91 88.52 67.01 56.86 90.87 85.91 49.24 32.88 SSL-HSIC (Li et al., 2021) 91.95 89.99 67.23 57.01 92.09 88.91 51.37 36.03 RELIC (Mitrovic et al., 2021) 91.96 89.35 67.24 56.88 91.15 86.21 49.17 32.97

ICL-MSR(Sim CLR + MSR) 92.34 89.47 67.59 57.64 92.03 86.94 50.12 32.88 ICL-MSR(BYOL + MSR) 92.26 90.12 66.97 57.97 93.22 89.36 52.54 37.54 ICL-MSR(LMCL + MSR) 92.45 89.38 67.99 57.71 91.56 87.73 52.61 32.35 ICL-MSR(Re SSL + MSR) 91.77 89.06 65.12 55.07 89.91 88.06 47.17 33.03

where γ is a hyper-parameter, Lct(f 1, f 1 ph) represents that the loss Lct is calculated based on the parameters f 1 and f 1 ph, and Luni is the uniformity loss that aims to constrain the distribution of elements in As to approximate a uniform distribution, so that the resulting visual semantics can be as inconsistent as possible. Based on the Gaussian potential kernel (Borodachov et al., 2019; Cohn & Kumar, 2007; Wang & Isola, 2020), the Luni can be represented as :

Luni = log X

ai,aj As Gt (ai, aj, t) (9)

where Gt (ai, aj, t) = exp (2t ai Taj 2t), and t is a fixed hyper-parameter. Note that the average pairwise Gaussian potential is nicely tied with the uniform distribution, for more details please refer to (Wang & Isola, 2020).

A problem is that why minimizing objective (8) can make fmsr to learn semantic information. Note that only the semantic information shared between the positive pair can prompt X3 j i and Y (X3 j i ) to be similar, thus minimizing the contrastive loss. Based on the SCM, we can see that the shared semantic information contains both backgroundrelated and foreground-related information. The idea behind objective (8) is that minimizing the contrastive loss so that fmsr learns the shared semantic information. Also, this step can be seen as promoting the learning of Lct once again on the basis of Lct, which is similar to learning to learn. This is also the reason why we call it a meta semantic regularizer.

5. Error Bound for Downstream Classification

The classification task is often used to evaluate the performance of most CL methods. Therefore, we present the generalization error bound (GEB) of the proposed

ICL-MSR based on the classification task which trains a softmax classifier by minimizing the traditional cross entropy loss (Zhang & Sabuncu, 2018), e.g., LSM(f; T) = inf W LCE(W f; T), where W is the linear classifier, T is the label. For a feature embedding f(X), the generalization error is defined by LT SM(f) = EX[LSM(f; T)]. Then we investigate how such a generalization error LT SM(f) is far from the contrastive learning objective Lct.

Theorem 5.1. Let f arg minf Lcl + λLmsr. Then with probability at least 1 δ, we have that

LT SM (f ) Lcl (f ) O

(10) where M is the total number of training samples, N is the size the mini-batch, and Q1 = p

1 + 1/N, RH (λ) is the rademacher complexity. Also, RH (λ) is monotonically decreasing w.r.t. λ.

The detailed proof can be found in Appendix C. As shown in equation (16), we can obtain that the error bound in gradually decreases with the increase of the training sample size M. Note that this observation is consistent with the traditional supervised learning method. Also, we can see that the mini-batch size N in the error term p

Q2/N is negligible for the large sample size N. In this case, the relative large size N will effectively reduce the first error term Q1RH (λ), and thereby tightening the error bound. Finally, when we enlarge the regularization parameter λ, the rademacher complexity RH will also be decreased, and thus further reducing the error bound and improving the generalizability of contrastive learning algorithm.

Interventional Contrastive Learning with Meta Semantic Regularizer

Figure 4. The experimental results for two kinds of ICL-MSR models. 1 represents training and testing on full images, 2 represents training on full images and testing on foreground images.

6. Experiments

6.1. Benchmark Datasets

The following datasets are utilized to evaluate the performance of the proposed ICL-MSR: CIFAR-10 and CIFAR100 (Krizhevsky et al., 2009) are two small-scale datasets, which consist of images of size 32 32 from 10 classes and 100 classes, respectively. STL-10 (Coates et al., 2011) is derived from Image Net and consists of more than 100K training samples with 96 96 resolution. Tiny Image Net (Le & Yang, 2015) can be seen as a simplified version of Image Net, which contains 100K training samples and 10K testing samples from 200 classes and an image scale of 64 64. Image Net-100 (Tian et al., 2020a) is a randomly sampled subset of Image Net with a total of 100 classes. Image Net (Deng et al., 2009) is a well-known large-scale dataset. It consists of about 1.3M training images and 50K test images with over 1000 classes.

6.2. Implementation Details

The experiments we carried out are to evaluate the effectiveness of the proposed meta semantic regularizer. So, we ignore the influence of secondary factors, e.g., the neural network architecture. To this end, for CIFAR-10, CIFAR100, STL-10, and Tiny Image Net, we use the same feature extractor for all compared methods, e.g., Res Net-18. For Image Net and Image Net-100, we use Res Net-50 as the feature extractor. For all datasets, the obtained feature representations are L2 normalized, unless otherwise specified. We set t = 2 and n = 6. For Sim CLR, we set τ = 0.5. For BYOL, we use the exponential moving average with cosine increasing, starting from 0.99. For all the compared methods, the Adam optimizer (Kingma & Ba, 2014) is used for the datasets with small and medium sizes. Also, for CIFAR10 and CIFAR-100, the number of epochs is set to 1,000 and the learning rate is set to 3 10 3, for Tiny Image Net, Image Net-100, the number of epochs is set to 1,000 and the

learning rate is set to 2 10 3, for STL-10, the number of epochs is set to 2,000 and the learning rate is set to 2 10 3. Also, for all datasets, we use learning rate warm-up for the first 500 iterations of the optimizer, and a 0.2 learning rate drop 50 and 25 epochs before the end. The dimension of output of the projection head fph is set to 1024. The weight decay is set to 10 6. The output dimension of the f is set to 64 for CIFAR-10 and CIFAR-100, 128 for STL-10 and Tiny Image Net. Finally, for Image Net, we set the implementation and hyperparameters to be the same as (Chen et al., 2020b; Chuang et al., 2020).

As for image transformation details, on CIFAR-10 and CIFAR-100, and Tiny Image Net, the extracted crops are varied with a random size from 0.2 to 1.0 and a random aspect ratio from 3:4 to 4:3 of the input image. The horizontal mirroring is implemented with a probability of 0.5. The probability of the color jittering with configuration (0.4, 0.4, 0.4, 0.1) is set to 0.8. The probability of grayscaling is set to 0.1. For Image Net and Image Net-100, the crop size is set to be varied from 0.08 to 1.0, we use stronger jittering (0.8; 0.8; 0.8; 0.2) and set the probability of grayscaling to 0.2 and the probability of Gaussian blurring to 0.5. As for evaluation protocol, we first freeze the f after training phase and then train a supervised linear classifier on top of it. The linear classifier is a fully-connected layer followed by softmax. We set the epochs for training the linear classifier to 500 with the Adam optimizer. We also report the accuracy of a k-nearest neighbors classifier (k-nn), and set k = 5.

For toy experiments, we run four kinds of methods on COCO dataset (Lin et al., 2014), including Sim CLR, BYOL, Sim CLR-MSR, and BYOL-MSR. During training, we eliminated the samples containing multiple targets in the coco dataset to ensure that the samples in the training set are only one target. Meanwhile, we observe that some classes in the coco dataset contain only a few or no samples. Therefore, we only selected 30 of these categories, including: { airplane :0, banana :1, bear :2, bed :3, bench :4, bird :5, boat :6, broccoli :7, bus :8, cat :9, clock :10, cow :11, dog :12, elephant :13, fire hydrant :14, giraffe :15, horse :16, motorcycle :17, person :18, pizza :19, scissors :20, sink :21, stop sign :22, teddy bear :23, toilet :24, traffic light :25, train :26, truck :27, vase :28, zebra :29}. The Co Co dataset provides ground-truth bounding boxes of objects in images, so we take the area inside the bounding box of each image as the foreground, and the area outside the bounding box of each image as the background.

6.3. Evaluation Results

Figure 4 shows the experimental results of two kinds of proposed ICL-MSR including Sim CLR+MSR and BYOL+MSR. When training ICL-MSR on the full images,

Interventional Contrastive Learning with Meta Semantic Regularizer

Figure 5. Impacts of the hyperparameters (a) λ, (b) γ, and (c) the number of semantic weight vectors n.

Table 2. Classification accuracy (top 1 and top 5) of a linear classifier for methods with the Res Net-50 feature extractor and negative sample size 4096 on Image Net-100.

Methods Top 1 Top 5

Sim CLR (Chen et al., 2020a) 70.15 89.75 Mo Co (He et al., 2020) 72.80 91.64 CMC (Tian et al., 2020a) 73.58 92.06 Sw AV (Caron et al., 2020) 75.78 92.86 DCL (Chuang et al., 2020) 74.60 92.08 LMLC (Ermolov et al., 2021) 75.89 92.89

ICL-MSR (Sim CLR + MSR) 72.08 91.81 ICL-MSR (CMC + MSR) 74.60 92.87 ICL-MSR (CMC + Sw AV) 75.91 92.88 ICL-MSR (DCL + MSR) 75.68 93.17 ICL-MSR (LMLC + MSR) 76.45 93.88

we can observe that the testing results on full images are comparable with those on foreground images. Also, the testing results on full images have increased compared with those in Figure 1. This indicates that the proposed backdoor adjustment-based regularization method is effective and the background information is indeed a confounding factor that can interference with the learning process of the feature extractor.

Table 1 shows the experimental results (linear and 5-nn) of the compared methods with a Res Net-18 feature extractor on small and medium size datasets. The compared methods include Sim CLR, BYOL, Barlow Twins, W-MSE, Re SSL, LMCL, SSL-HSIC, and RELIC. We incorporate the Meta Semantic Regularizer into four models, resulting in four kinds of ICL-MSR (Sim CLR+MSR, BYOL+MSR, LMCL+MSR, Re SSL+MSR). The hyperparameters λ and γ are set to 1 and 1 for Sim CLR+MSR, 0.1 and 1 for BYOL+MSR and LMCL+MSR, 0.01 and 1 for Re SSL+MSR, respectively. Table 2 shows the experimental results (Top 1 and Top 5) of the compared methods with a Res Net-50 feature extractor on Image Net-100. The

Table 3. Classification accuracy (top 1 and top 5) of a linear classifier for methods with the Res Net-50 feature extractor on Image Net.

Methods Top 1 Top 5

Sim CLR (Chen et al., 2020a) 69.3 89.0 Mo Co (He et al., 2020) 71.1 - CMC (Tian et al., 2020a) 66.2 87.0 CPC (Henaff, 2020) 63.8 85.3 Info Min Aug. (Tian et al., 2020b) 73.0 91.1 Sw AV (Caron et al., 2020) 75.3 - BYOL (Chuang et al., 2020) 74.3 91.6 RELIC (Ermolov et al., 2021) 74.8 92.2 SSL-HSIC (Li et al., 2021) 72.2 90.7

ICL-MSR (Sim CLR + MSR) 70.9 90.5 ICL-MSR (Sw AV + MSR) 75.5 - ICL-MSR (BYOL + MSR) 75.4 92.6

compared methods include Sim CLR, Mo Co, CMC, Sw AV, DCL, and LMLC. We incorporate the Meta Semantic Regularizer into four models, resulting in four kinds of ICL-MSR (Sim CLR+MSR, CMC+MSR, DCL+MSR, LMLC+MSR). The hyper-parameters λ and γ are set to 0.1 and 1 for Sim CLR+MSR, 0.1 and 0.1 for CMC+MSR, and 1 and 1 for DCL+MSR and LMLC+MSR, respectively. Table 3 shows the experimental results (Top 1 and Top 5) of the compared methods a Res Net-50 feature extractor on Image Net. The compared methods include Sim CLR, Mo Co, CMC, CPC, Info Min Aug., Sw AV, BYOL, SSL-HSIC, and RELIC. We incorporate the Meta Semantic Regularizer into two models, resulting in two kinds of ICL-MSR (Sim CLR+MSR, BYOL+MSR). The hyper-parameters λ and γ are set to 0.1 and 1, respectively.

From the three Tables, we can observe that the performances of our proposed ICL-MSR are all better than those of the baseline, For Table 1, the best results always appear in BYOL+MSR and LMCL+MSR. For Table 2, the best results are located in LMCL+MSR. For Table 3, LMCL+MSR outperforms all compared methods. This indicates the ef-

Interventional Contrastive Learning with Meta Semantic Regularizer

Table 4. Semi-supervised training with a fraction of Image Net labels (1% and 10%). Classification accuracy (top 1 and top 5) of a linear classifier for methods with the Res Net-50 feature extractor.

Methods 1% 10% Top 1 Top 5 Top 1 Top 5

Sim CLR (Chen et al., 2020a) 48.3 75.5 65.6 87.8 BYOL (Chuang et al., 2020) 53.2 78.4 68.8 89.0

ICL-MSR (Sim CLR + MSR) 50.7 77.1 66.9 89.6 ICL-MSR (BYOL + MSR) 55.5 80.6 70.5 90.7

fectiveness of the proposed ICL-MSR. Note that RELIC is a causal-related method and focuses on the different augmentations. We can observe that, in Table 1, three of the four proposed methods outperform RELIC, and in Table 2, BYOL+MSR outperforms RELIC. This indicates that background information is indeed a confounding factor for learning a good feature extractor. We summarize two possible reasons why the improvement is less than 1% in most cases. The first is that ICL-MSR is more suitable for image data with larger resolutions. We find that improvement of less than 1% mostly occurred on datasets with small resolutions, e.g., 32 32 for cifar-10 and cifar-100 datsets. An image with a smaller resolution already contains less background information. The second is that ICL-MSR may be sensitive to the hyperparameter n. To reduce computational complexity, we set n = 6 for all datasets.

We evaluate the performance obtained when fine-tuning the representation learned by ICL-MSR on a classification task with a small subset of Image Net s training dataset. We follow the semi-supervised protocol of (Chen et al., 2020a; Chuang et al., 2020) and use the same fixed splits of respectively 1% and 10% of Image Net labeled training dataset. We report both top-1 and top-5 accuracies on the test dataset in Table 4. We set n = 14. We can observe that the proposed ICL-MSR outperforms the compared methods in all cases. This indicates that the proposed ICL-MSR is effective for the fine-turning tasks.

6.4. Hyperparametric Analysis

To understand the impacts of hyper-parameters, we select Sim CLR+MSR to conduct comparisons on CIFAR-10 dataset. Specifically, λ controls the impact of the MSR, γ controls the impact of the uniformity loss, and n is the number of the learned weight vectors. We first set γ = 1, n = 6 and select λ from the range of {10 3, 10 2, ..., 103}. The results are shown in Figure 5 (a). We observe that when λ = 1, ICL-MSR gets the best accuracy. This illustrates that the proposed MSR is effective. Then, we set λ = 1, n = 6 and select γ from the range of {10 3, 10 2, ..., 103}. From Figure 5(b), we obtain that the best result corresponds to γ = 1. This indicates that constraining the distribution

of the weight matrix to a uniform distribution is effective. Finally, we set λ = γ = 1 and select n from the range of {2, 4, ..., 12}. From Figure 5(c), we can obtain that a proper number of semantic weight vectors can prompt the performance of ICL-MSR. When γ = 103, the accuracy is the lowest. This indicates that releasing uniform constraint will discard the semantic information.

7. Conclusions

In this paper, based on the toy experiments on the COCO dataset with four experimental settings, we find two contradictory conclusions. Then, we build a Structural Causal Model (SCM) to give an explanation for this contradiction and propose to regard the background as a confounder. To tackle this problem, we propose a regularization method based on backdoor adjustment. Our method can be easily incorporated into most existing CL methods. We demonstrate the effectiveness of the proposed method both theoretically and empirically.

Acknowledgements

The authors would like to thank the anonymous reviewers for their valuable comments. This work is supported in part by National Natural Science Foundation of China No. 61976206 and No. 61832017, Key Special Project for Introduced Talents Team of Southern Marine Science and Engineering Guangdong Laboratory (Guangzhou) No. GML2019ZD0603, National Key Research and Development Program of China No. 2019YFB1405100, Beijing Outstanding Young Scientist Program NO. BJJWZYJH012019100020098, Beijing Academy of Artificial Intelligence (BAAI), the Fundamental Research Funds for the Central Universities, the Research Funds of Renmin University of China 21XNLG05, and Public Computing Cloud, Renmin University of China. This work is also supported in part by Intelligent Social Governance Platform, Major Innovation & Planning Interdisciplinary Platform for the Double-First Class Initiative, Renmin University of China, and Public Policy and Decision-making Research Lab of Renmin University of China.

Arora, S., Khandeparkar, H., Khodak, M., Plevrakis, O., and Saunshi, N. A theoretical analysis of contrastive unsupervised representation learning. ar Xiv preprint ar Xiv:1902.09229, 2019.

Borodachov, S. V., Hardin, D. P., and Saff, E. B. Discrete energy on rectifiable sets. Springer, 2019.

Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski,

Interventional Contrastive Learning with Meta Semantic Regularizer

P., and Joulin, A. Unsupervised learning of visual features by contrasting cluster assignments. ar Xiv preprint ar Xiv:2006.09882, 2020.

Chen, S., Niu, G., Gong, C., Li, J., Yang, J., and Sugiyama, M. Large-margin contrastive learning with distance polarization regularizer. In International Conference on Machine Learning, pp. 1673 1683. PMLR, 2021a.

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597 1607. PMLR, 2020a.

Chen, T., Kornblith, S., Swersky, K., Norouzi, M., and Hinton, G. Big self-supervised models are strong semisupervised learners. ar Xiv preprint ar Xiv:2006.10029, 2020b.

Chen, X. and He, K. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750 15758, 2021.

Chen, X., Fan, H., Girshick, R., and He, K. Improved baselines with momentum contrastive learning. ar Xiv preprint ar Xiv:2003.04297, 2020c.

Chen, X., Xie, S., and He, K. An empirical study of training self-supervised vision transformers. ar Xiv preprint ar Xiv:2104.02057, 2021b.

Chuang, C.-Y., Robinson, J., Yen-Chen, L., Torralba, A., and Jegelka, S. Debiased contrastive learning. Advances in Neural Information Processing Systems, 2020.

Coates, A., Ng, A., and Lee, H. An analysis of singlelayer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 215 223. JMLR Workshop and Conference Proceedings, 2011.

Cohn, H. and Kumar, A. Universally optimal distribution of points on spheres. Journal of the American Mathematical Society, 20(1):99 148, 2007.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

Ermolov, A., Siarohin, A., Sangineto, E., and Sebe, N. Whitening for self-supervised representation learning. In International Conference on Machine Learning, pp. 3015 3024. PMLR, 2021.

Glymour, M., Pearl, J., and Jewell, N. P. Causal inference in statistics: A primer. John Wiley & Sons, 2016.

Grill, J.-B., Strub, F., Altch e, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al. Bootstrap your own latenta new approach to self-supervised learning. Advances in Neural Information Processing Systems, 33:21271 21284, 2020.

He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729 9738, 2020.

Henaff, O. Data-efficient image recognition with contrastive predictive coding. In International Conference on Machine Learning, pp. 4182 4192. PMLR, 2020.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.

Le, Y. and Yang, X. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.

Li, J., Zhou, P., Xiong, C., and Hoi, S. C. Prototypical contrastive learning of unsupervised representations. ar Xiv preprint ar Xiv:2005.04966, 2020.

Li, Y., Pogodin, R., Sutherland, D. J., and Gretton, A. Selfsupervised learning with kernel dependence maximization. ar Xiv preprint ar Xiv:2106.08320, 2021.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll ar, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In European conference on computer vision, pp. 740 755. Springer, 2014.

Mitrovic, J., Mc Williams, B., Walker, J., Buesing, L., and Blundell, C. Representation learning via invariant causal mechanisms. In International Conference on Learning Representations (ICLR), 2021.

Nozawa, K. and Sato, I. Understanding negative samples in instance discriminative self-supervised representation learning. ar Xiv preprint ar Xiv:2102.06866, 2021.

Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018.

Sordoni, A., Dziri, N., Schulz, H., Gordon, G., Bachman, P., and Des Combes, R. T. Decomposed mutual information estimation for contrastive representation learning. In International Conference on Machine Learning, pp. 9859 9869. PMLR, 2021.

Interventional Contrastive Learning with Meta Semantic Regularizer

Tian, Y., Krishnan, D., and Isola, P. Contrastive multiview coding. In Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23 28, 2020, Proceedings, Part XI 16, pp. 776 794. Springer, 2020a.

Tian, Y., Sun, C., Poole, B., Krishnan, D., Schmid, C., and Isola, P. What makes for good views for contrastive learning? ar Xiv preprint ar Xiv:2005.10243, 2020b.

Tian, Y., Chen, X., and Ganguli, S. Understanding selfsupervised learning dynamics without contrastive pairs. ar Xiv preprint ar Xiv:2102.06810, 2021.

Von K ugelgen, J., Sharma, Y., Gresele, L., Brendel, W., Sch olkopf, B., Besserve, M., and Locatello, F. Selfsupervised learning with data augmentations provably isolates content from style. Advances in Neural Information Processing Systems, 34, 2021.

Wang, T. and Isola, P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pp. 9929 9939. PMLR, 2020.

Wen, Z. and Li, Y. Toward understanding the feature learning process of self-supervised contrastive learning. ar Xiv preprint ar Xiv:2105.15134, 2021.

Zbontar, J., Jing, L., Misra, I., Le Cun, Y., and Deny, S. Barlow twins: Self-supervised learning via redundancy reduction. ar Xiv preprint ar Xiv:2103.03230, 2021.

Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818 833. Springer, 2014.

Zhang, Z. and Sabuncu, M. R. Generalized cross entropy loss for training deep neural networks with noisy labels. In 32nd Conference on Neural Information Processing Systems (Neur IPS), 2018.

Zheng, M., You, S., Wang, F., Qian, C., Zhang, C., Wang, X., and Xu, C. Ressl: Relational self-supervised learning with weak augmentation. Advances in Neural Information Processing Systems, 34, 2021.

Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921 2929, 2016.

Appendix: Interventional Contrastive Learning with Meta Semantic Regularizer

A. The Training Process

Algorithm 1 ICL-MSR Input:

#: N, minibatch size #: f, encoder function #: fmsr, meta semantic regularizer #: fph, projection head #: λ, γ, n, t, τ, hyperparameters #: α, β, learning rates 1: repeat 2: for t-th training iteration do 3: Iteratively sample minibatch Xtr = {Xi}N i=1. 4: # regular contrastive training step 5: f f α f Lto 6: fph fph α fph Lto 7: end for 8: for t-th training iteration do 9: Iteratively sample minibatch Xtr = {Xi}N i=1. 10: # compute fast weights 11: # retain computational graph 12: f 1 f α f Lto 13: f 1 ph fph α fph Lto 14: # meta training step using second derivative

15: fmsr fmsr β fmsr h Lct f 1, f 1 ph + γLuni i

16: end for 17: until f, fph, and fmsr converge.

B. Derivation of Equation 2

We first give definitions to the path, the d-separation, and the backdoor criterion. From (Glymour et al., 2016), we can obtain that:

Definition B.1. Path. A path consists of three components including the Chain Structure: A B C or A B C, the Bifurcate Structure: A B C, and the Collisions Structure: A B C.

Definition B.2. d-separation. A path p is blocked by a set of nodes Z if and only if:

1. p contains a chain of nodes A B C or a fork A B C such that the middle node B is in Z (i.e., B is conditioned on), or

2. p contains a collider A B C such that the collision node B is not in Z, and no descendant of B is in Z. If Z blocks every path between two nodes X and Y , then X and Y are d-separated, conditional on Z, and thus are independent conditional on Z.

Definition B.3. The Backdoor Criterion. Given on ordered pair of variables (X, Y ) in a directed acyclic graph G, a set of variables Z satisfies the backdoor criterion relative to (X, Y ) if no node in Z is a descendant of X, and Z blocks every path between X and Y that contains an arrow into X. If a set of variables of Z satisfies the backdoor criterion for X and Y , then

Interventional Contrastive Learning with Meta Semantic Regularizer

the causal effect of X on Y is given by the formula:

P (Y = y |do (X = x)) = X

z P (Y = y |X = x, Z = z )P (Z = z) (11)

C. Proof for Theorem 5.1

First, we give a Lemma as follows:

Lemma C.1. (Arora et al., 2019) Assume that f arg minf Lcl + λLmsr. Then with probability at least 1 δ over the training data X = {X1, X2, ..., XM}, for any f H,

LT SM (f ) Lcl (f ) + O

where N is the size of negative pairs, Q1 = p

1 + 1/M, Q2 = log (1/δ) log2 (M), the Rademacher Complexity is defined

as RH (λ) = Eσ { 1}3d N h supf H(λ) σ, f i , d is the dimension of the learned feature representation, and the restricted

hypothesis space is defined as H (λ) = {φ |φ H, and R1 (f) 4/λ}.

Theorem C.2. Let f arg minf Lcl + λLmsr. Then with probability at least 1 δ, we have that

LT SM (f ) Lcl (f ) O

where M is the total number of training samples, N is the negative pair size, and Q1 = p

1 + 1/N, RH (λ) is the rademacher complexity. Also, RH (λ) is monotonically decreasing w.r.t. λ.

Proof. For the traditional cross entropy loss LT SM (f), we have that:

= EX h inf W LCE (W f, T) i

log ef(X)T uc+

ef(X)T uc+ +P ef(X)T uc

log e f(X)T EX+[f(X+)]

ef(X)T EX+[f(X+)]+ME h ef(X)T EX [f(X )]i

log e f(X)T EX+[f(X+)]

ef(X)T EX+[f(X+)]+MEX h ef(X)T EX [f(X )]i

Then, we can obtain:

LT SM (f ) Lcl (f) Lcl (f ) Lcl (f) O

Therefore, we have LT SM (f ) Lcl (f ) O