# rethinking_weak_supervision_in_helping_contrastive_learning__1701af27.pdf

Rethinking Weak Supervision in Helping Contrastive Learning

Jingyi Cui 1 * Weiran Huang 2 3 * Yifei Wang 4 * Yisen Wang 1 5

Contrastive learning has shown outstanding performances in both supervised and unsupervised learning, and has recently been introduced to solve weakly supervised learning problems such as semi-supervised learning and noisy label learning. Despite the empirical evidence showing that semi-supervised labels improve the representations of contrastive learning, it remains unknown if noisy supervised information can be directly used in training instead of after manual denoising. Therefore, to explore the mechanical differences between semi-supervised and noisy-labeled information in helping contrastive learning, we establish a unified theoretical framework of contrastive learning under weak supervision. Specifically, we investigate the most intuitive paradigm of jointly training supervised and unsupervised contrastive losses. By translating the weakly supervised information into a similarity graph under the framework of spectral clustering based on the posterior probability of weak labels, we establish the downstream classification error bound. We prove that semi-supervised labels improve the downstream error bound whereas noisy labels have limited effects under such a paradigm. Our theoretical findings here provide new insights for the community to rethink the role of weak supervision in helping contrastive learning.

1. Introduction

Contrastive learning has shown state-of-the-art empirical performances in unsupervised representation learning (Chen

*Equal contribution 1National Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 2Qing Yuan Research Institute, Shanghai Jiao Tong University 3Huawei Noah s Ark Lab 4School of Mathematical Sciences, Peking University 5Institute for Artificial Intelligence, Peking University. Correspondence to: Yisen Wang <yisen.wang@pku.edu.cn>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

et al., 2020; He et al., 2020; Chen & He, 2021; Wang et al., 2021a). It learns good representations of high-dimensional observations from a large amount of unlabeled data, by pulling together an anchor and its augmented views in the embedding space. On the other hand, supervised contrastive learning (Khosla et al., 2020) uses same-class examples and their corresponding augmentations as positive labels, and achieves significantly better performance than both the unsupervised contrastive learning and the state-of-the-art classification losses, e.g. cross entropy loss. While accurate supervised signals are not always available, there is a lot of weak supervision accompanying the data. This makes us wonder: can weak supervision help contrastive learning?

There are two major types of weak supervision. The first is semi-supervised information, where the supervised labels are only available on a small fraction of samples. The second is noisy-labeled information, where the labels are available but unreliable, i.e. the labels can possibly be wrong. Empirical evidence has shown that semi-supervised information can directly be used in positive sample selection to help improve the representations of contrastive learning by jointly training the supervised and unsupervised contrastive losses (Assran et al., 2020; Acharya et al., 2022). By contrast, for noisy-labeled information, most methodological studies use contrastive learning as a tool to select confident samples based on the learned representations (Yao et al., 2021; Ortego et al., 2021; Li et al., 2022; Zhang et al., 2022a), whereas none of the existing literature demonstrates if noisy label information can help improve the representations of contrastive learning. Therefore, we are wondering if the conclusion on noisy-labeled information is the same as the semi-supervised information. If not, what are the differences between semi-supervised and noisy-labeled information in helping contrastive learning?

In this paper, we show an awkward fact that noisy labels have a limited effect on representations of contrastive learning under the joint training paradigm, and the winner of supervised and unsupervised contrastive learning itself serves as an embarrassingly strong baseline. As a preview of results, in Figure 1, we plot the result of unsupervised contrastive learning trained without labels (Sim CLR (Chen et al., 2020)), supervised contrastive learning trained with noisy labels (Sup Con (Khosla et al., 2020)), the winner of Sup Con and Sim CLR (Max), and joint training of Sim CLR

Rethinking Weak Supervision in Helping Contrastive Learning

0.0 0.1 0.2 0.3 0.4 0.5 0.6 noise rate

Sim CLR Sup Con Joint Training Max(Sim CLR,Sup Con)

Figure 1: The winner of Sup Con and Sim CLR serves as a strong baseline of joint training under label noise on the CIFAR-10 dataset.

and Sup Con under label noise (Joint Training), respectively. We conduct experiments under symmetric label noise with noise rates ranging from 0% to 60%, stepped by 10%. We evaluate the trained representations by linear probing on the clean testing data. The weight parameter θ of Joint Training is tuned in {0, 0.05, 0.1, 0.2, 0.4, 0.6, 0.8, 1.0}, and the best linear probing accuracy under the optimal θ is reported. We show that supervised training outperforms unsupervised training when the noise rate is low, and the situation reverses when the noise rate is high. Nonetheless, the joint training of Sim CLR and Sup Con (with finely tuned weights) has only limited advantage over the winner of Sup Con and Sim CLR. Moreover, its performance coincides with that of Sup Con when noise rate is 0, and converges to that of Sim CLR when noise rate increases. This indicates that noisy labels provide limited help to contrastive representation learning under the paradigm of joint training, and that the winner of Sup Con and Sim CLR itself serves as a strong baseline.

To explain the above phenomena in depth and exploit the mechanical differences between semi-supervised and noisylabeled information in helping contrastive learning, we establish a unified theoretical framework of weakly supervised contrastive learning, applying to both semi-supervised and noisy-labeled settings. We investigate the joint training of supervised and unsupervised contrastive losses, and take spectral contrastive learning (Hao Chen et al., 2021) as a performance proxy for (standard) contrastive learning to conduct the theoretical analysis. Based on the posterior probability of labeled samples, we translate the weakly supervised information (under symmetric label noise assumption) into a similarity graph under the framework of spectral clustering. This enables us to analyze the effect of the label information on the augmentation graph, and consequently on the derived error bound. Accordingly, we prove that under the semi-supervised setting, the label information helps improve the downstream error bound, whereas under the noisy-labeled setting, the joint training is no better than the winner of supervised and unsupervised contrastive learning in terms of an error bound.

The contributions of this paper are summarized as follows.

We for the first time establish a theoretical framework for contrastive learning under weak supervision, including noisy label learning and semi-supervised learning.

By formulating the label information into a similarity graph based on the posterior probability of labels, we derive the downstream error bound of jointly trained contrastive learning losses. We prove that semisupervised labels improve the downstream error bound compared with unsupervised learning, whereas under the noisy-labeled setting, joint training fails to improve the error bound compared with the winner of supervised and unsupervised contrastive learning.

We empirically verify that noisy labels have only limited help to contrastive representation learning under the paradigm of joint training. Thus, complex designs such as label denoising are still required for leveraging noisy labeled information in improving supervised contrastive learning.

2. Related Works

Theoretical Frameworks of Contrastive Learning. The theoretical frameworks of unsupervised contrastive learning can be divided into two major categories. The first category is devoted to building the relationship between unsupervised contrastive learning and supervised downstream classification. Arora et al. (2019) first introduce the concept of latent classes, hypothesizes that semantically similar points are sampled from the same latent class, and proves that the unsupervised contrastive loss serves as an upper bound of downstream supervised learning loss. Nozawa & Sato (2021); Ash et al. (2022); Bao et al. (2022) further investigate the effect of negative samples, and establish surrogate bounds for the downstream classification loss that better match the empirical observations on the negative sample size. However, studies in this category have to assume the existence of supervised latent classes, and that the positive pairs are conditionally independently drawn from the same latent class. This assumption fails to distinguish between supervised and unsupervised contrastive learning, and therefore cannot be used to analyze the weakly supervised setting.

Another major approach is to analyze contrastive learning by modeling the feature similarity. Hao Chen et al. (2021) first introduce the concept of the augmentation graph to represent the feature similarity of the augmented samples, and analyzes contrastive learning from the perspective of spectral clustering. Shen et al. (2022) use a stochastic block model to analyze spectral contrastive learning for the problem of unsupervised domain adaption. Similarly, Wang et al. (2021b) propose the concept of augmentation overlap to

Rethinking Weak Supervision in Helping Contrastive Learning

formulate how the positive samples are aligned. Moreover, contrastive learning is also understood through other existing theoretical frameworks of unsupervised learning, such as nonlinear independent component analysis (Zimmermann et al., 2021), neighborhood component analysis (Ko et al., 2022), variational autoencoder (Aitchison, 2021), stochastic neighbor embedding (Hu et al., 2023), geometric analysis of embedding space (Huang et al., 2023), and message passing (Wang et al., 2023).

In this paper, we follow the second category of contrastive learning approaches, and formulate the weakly supervised information into a similarity graph based on both label and feature information.

Contrastive Learning for Noisy Label Learning. Ghosh & Lan (2021) first find that pretraining with contrastive learning improves robustness to label noise through empirical evidence. Many methodological studies are carried out for noisy label learning with the help of contrastive learning. Yao et al. (2021); Ortego et al. (2021); Li et al. (2022) use representations learned from unsupervised contrastive learning to filter out confident samples from all noisy ones, and in turn use the confident samples to conduct supervised contrastive learning to generate better representations. Navaneet et al. (2022) introduce additional semantically similar supervision to contrastive representation learning by incorporating nearest neighbors under certain constraints as additional positive samples, which also adapts to noisy label learning. By contrast, Yan et al. (2022) follow the idea of negative learning (Kim et al., 2019; 2021b), and leverages the negative correlations from the noisy data to avoid same-class negatives in contrastive learning. Chuang et al. (2022) propose a robust contrastive loss function inspired by the symmetric losses that are proved to be noise tolerant. Very recently, Zhang et al. (2022a) use contrastive learning to handle noisy labels of long-tailed data. For theoretical studies, Cheng et al. (2021) analyze the robustness of crossentropy with SSL features, and Xue et al. (2022) prove the robustness of downstream classifier in contrastive learning.

Contrastive Learning for Semi-supervised Learning. Lee et al. (2022); Yang et al. (2022) use contrastive regularization to enhance the reliability of pseudo-labeling in semisupervised learning. Kim et al. (2021a) introduce a semisupervised learning method that combines self-supervised contrastive pre-training and semi-supervised fine-tuning based on augmentation consistency regularization. Zhang et al. (2022b) use contrastive loss to model pairwise similarities among samples, generates pseudo labels from the cross entropy loss, and in turn calibrates the prediction distribution of the two branches.

To conclude, the existing studies of contrastive learning under weak supervision mainly focus on using contrastive

learning as a tool to improve the weakly supervised learning performance, whereas to the best of our knowledge, none of the previous works reveals how weak supervision helps contrastive learning. To fill in the blank, in this paper, we establish a theoretical framework for contrastive learning under weak supervision, and show the effects of semi-supervised and noisy-labeled information on the error bounds of contrastive learning.

3. Preliminaries

Notations. Suppose that random variables X X := Rd, and Y [r] := {1, . . . , r}. Let the input natural data {( xi, yi)}i [N] be i.i.d. sampled from the joint distribution P( X, Y ). Given a natural data x X, we use A( | x) to denote the distribution of its augmentations and use X to denote the set of all augmented data, which is assumed to be finite but exponentially large. Denote n = |X|.

3.1. Spectral Contrastive Learning

In Hao Chen et al. (2021), an augmentation graph G is used to describe the distribution of augmented samples, where the edge weight wxx := E x P[A(x| x)A(x | x)] denotes the marginal probability of generating augmented views x and x from the same natural data. Due to the total probability mass, P

x,x X wxx = 1. The adjacent matrix of the augmentation graph is denoted as A := (wxx )x,x X Rn n, and the normalized adjacent matrix is denoted as A := D 1/2AD 1/2, where D := diag(wx)x X , and wx := P x X wxx .

In this paper, we consider the spectral contrastive loss L(f) proposed by Hao Chen et al. (2021), that is, for an embedding function f : X Rk,

2 Ex,x+[f(x) f(x+)] + Ex,x h f(x) f(x ) 2i . (1)

Spectral contrastive loss is proved to be equivalent to the matrix factorization loss, i.e. for F Rn k := (ux)x X , ux := w1/2 x f(x),

Lmf(F) := A FF 2 F = L(f) + const. (2)

3.2. Noisy Label Learning

Recall that we denote the true label of a given instance x X is y. One common assumption of the generation procedure of label noise is as follows. Given the true labels, the noisy label is randomly flipped to another label y with some probability. In this paper, we take the widely adopted symmetric label noise assumption as an example.

For notational simplicity, we write the symmetric label noise assumption in matrix form. Denote Y := (ηj(xi))i [n],j [r], ηj(x) = P(Y = j|x), as the posterior

Rethinking Weak Supervision in Helping Contrastive Learning

probability matrix of the clean label distribution, and denote Y := ( ηj(xi))i [n],j [r], ηj(x) = P( Y = j|x), as the noisy label distribution. In Assumption 3.1, we assume that the flipping probability is conditional independent of the input data, and that the flipping probability to all other classes is uniformly at random.

Assumption 3.1. For symmetric label noise with noise rate γ (0, 1), we denote the transition matrix T = (ti,j)i [r],j [r], where

ti,i = 1 γ, and ti,j = γ r 1, for j = i. (3)

Then the noisy label posterior distribution is assumed to be

Y = Y T . (4)

Under Assumption 3.1, T is symmetric. Specifically, when γ = 0, T degenerates to the identity matrix Ir r. Moreover, to guarantee PAC-learnability, we usually assume the true label is the dominating class, i.e. γ < r 1

3.3. Semi-Supervised Learning

For j [r], let nj be the number of labeled samples of Class j. Let n L = P

j [r] n L,j be the number of all labeled samples, and n U be the number of unlabeled samples. Obviously, we have n L + n U = n. Usually, the number of labeled samples is much smaller than that of the unlabeled ones because human annotation is costly and labor-intensive. That is, we can naturally assume n L n U.

In the following parts of the paper, we analyze the settings of noisy label learning and semi-supervised learning in a unified framework. Without loss of generality, we assume (x1, . . . , xn L) is labeled with noise rate γ [0, r 1

r ), and denote the corresponding clean and noisy posterior probability matrices as Y L and Y L, respectively. Then we have Y L = Y LT . Specifically, when γ = 0, our analyzing framework degenerates to the standard setting of semi-supervised learning, and when n L = n, our analyzing framework reduces to the standard noisy label learning.

4. Mathematical Formulations

We mention that our formulation of similarity graph is not a distributional assumption on the underlying similarity among data, but to formulate a possible probability of drawing positive samples in contrastive learning that takes both label and feature information into consideration. Specifically, in Sections 4.1 and 4.2, we only discuss the similarity graph induced by the weakly supervised labels and neglected feature similarity. Note that in Section 4.2 we investigate the setting of semi-supervised noisy labels, so as to include semi-supervised learning and noisy label learning in a unified framework. Then in Section 4.3, we take both

label and feature similarity into consideration through the convex combination to describe the joint training loss.

4.1. Similarity Graph Describing Noisy Labels

To leverage the labeled information in the form of a similarity graph, we first consider a simple example where noise rate γ = 0 and the label distribution is deterministic, i.e. for a sample x with true label y, the posterior probability ηy(x) = 1 and ηj(x) = 0 for j = y. In this case, we can naturally assume that in the label similarity graph, the intra-class vertices are fully connected and the inter-class vertices are disconnected. That is, wxx = 1 if x and x has the same label and otherwise wxx = 0.

Then we consider the more general stochastic label scenario. Recall that for unsupervised spectral contrastive learning, the edge weight wxx in an augmentation graph G describes the marginal probability of generating x and x from the same natural data. That is, wxx describes the joint probability of a pair of positive samples. Similarly, since the positive samples for supervised contrastive learning (Khosla et al., 2020) are selected as all same-class samples, we can naturally define the edge weight wxx as the probability of two views x and x generating from the same class, i.e. wxx = P

j [r] ηj(x)ηj(x ), and therefore AL := Y LY L. Moreover, we denote A as the normalized adjacent matrix. For the simplicity of notations, we consider the case where the data is class-balanced, i.e. n1 = . . . = nr = n L/r. Then we have A = r n L A.

Next, we add label noise to our mathematical formulations. To be specific, when performing supervised contrastive learning based on noisy labeled data, we naturally select positive samples as the samples with the same noisy labeled data. According to Assumption 3.1, we have Y L = Y LT , where T is symmetric. Then the adjacent matrix of the similarity graph describing noisy labels is formulated as

A L : = Y L Y L = Y LT (Y LT ) = Y LT T Y L = Y LT 2Y L. (5)

Similarly, when data is class balanced, we have the normalized adjacent matrix A L = n L

4.2. Similarity Graph Describing Semi-Supervised Noisy Labels

Under the setting of semi-supervised learning, we have no prior knowledge about the label information of the unlabeled samples. Therefore, from the perspective of unsupervised contrastive learning, the unlabeled samples can be viewed as having unique class labels. Therefore, to construct the similarity graph, we attach sample-specific labels to the unlabeled samples. Thus, the posterior probability matrix of unlabeled samples Y U is an identity matrix In U n U . Note

Rethinking Weak Supervision in Helping Contrastive Learning

that here we only discuss the similarity graph of supervised information, so the feature similarity between samples is not included in the similarity graph.

Combining both labeled and unlabeled samples, the posterior probability matrix of all semi-supervised samples can be denoted as

Y = Y L 0 0 Y U

= Y LT 0 0 In U n U

Therefore, the similarity graph of samples with n L noisy labels can be denoted as

A = Y Y = Y LT 2Y L 0 0 In U n U .

In Lemma 4.1 we present the influence of symmetric label noise with noise rate γ on the similarity graph A .

Lemma 4.1. Under Assumption 3.1, if the data is class balanced, i.e. n1 = . . . = nr = n L

r , then there holds

A = α(γ) AL + β(γ) r

n L 1n L 1 n L 0 0 In U n U

where α(γ) := 1 r r 1γ 2 and β(γ) := γ r 1 2 r r 1γ .

Note that without label noise, i.e. γ = 0, we have α(γ) = 1 and β(γ) = 0. For the sake of simplicity, in the following, we write α and β instead of α(γ) and β(γ) when no ambiguity is aroused.

In Lemma 4.1, we show that the effect of symmetric label noise is to add a uniform weight to the edges between all labeled samples. This uniform weight increases the confusion between intraand inter-class similarities. For example, under the deterministic label scenario, we have AL = In L n L, and under label noise, the original intraclass similarity is uniformly shrunk from 1 to α and the inter-class similarity increases from 0 to β. Moreover, as the noise rate γ increases, α decreases and β increases, which results in severer confusion between the intraand inter-class similarities. Intuitively, as the similarity graph describes the sampling probability of positive pairs, α and β measure how noisy that similarity is. (The sampling probability of negative samples remains unaffected because they are assumed to be uniformly sampled regardless of labels.) Then under label noise, the probability of sameclass samples being selected as positive pairs reduces from 1 to α + r n L β, and the probability of different-class samples being selected as positive pairs raises from 0 to r n L β.

Note that our mathematical formulations can also be extended to generalized label noise assumptions of the noise transition matrix T (other than symmetric label noise). Specifically, under generalized assumptions, the

term r n L 1n L 1 n L in (8) will become a real symmetric matrix which depends on the specific form of the label noise assumption.

4.3. Similarity Graph Describing Joint Training

Recall that we want to investigate the effect of noisy labels in the joint training of supervised and unsupervised contrastive losses. Specifically, for θ (0, 1), we have the loss for joint training as

LJoint Training := (1 θ)Lunsup + θLsup, (9)

where in Lunsup, the positive samples are selected as augmentations of the anchor sample, and in Lsup, the positive samples are selected as (possibly noisy) same-class samples. The specific algorithms of joint training for noisy label learning and semi-supervised learning are shown in Appendix B.1.

According to (2), the spectral contrastive loss is equivalent to the matrix factorization loss. Therefore, we can write (9) as a convex combination of matrix factorization losses with similarity graphs under label and feature information, i.e.

(1 θ) A0 FF 2 F + θ A FF 2 F , (10)

where we denote A0 as the augmentation graph of arbitrary unlabeled samples describing feature information, and A

is denoted in (7) describing the similarity graph induced by the semi-supervised noisy labeled (augmented) samples.

Note that (10) can be rewritten as

((1 θ) A0 + θ A) FF 2 F + c0(θ), (11)

where c0(θ) := (1 θ) A0 2 F + θ A 2 F (1 θ) A0 + θ A 2 F is independent of F. As c0(θ) does not affect the training procedure, optimizing (10) and (11) results in the same optimal F. Therefore, in the following, to analyze the joint training loss (9), we investigate the properties of the mixed similarity graph

Aθ,γ,n L := (1 θ) A0 + θ A . (12)

5. Theoretical Results

In this section, we first compute eigenvalues of the similarity graph induced by both label and feature information, which plays a key role in deriving the error bound of contrastive learning in Section 5.1. Then in Section 5.2, we show that (clean) label information in the semi-supervised setting can help improve the error bound, whereas in Section 5.3, we prove that the joint training of supervised and unsupervised contrastive learning fails to improve the error bound compared with purely supervised or purely unsupervised contrastive learning.

Rethinking Weak Supervision in Helping Contrastive Learning

5.1. Eigenvalues of Similarity Graph Describing Joint Training

We first compute the eigenvalues of the similarity graph describing the weak labels (without describing feature information). Proposition 5.1. For arbitrary Y , assume that the labeled data is class-balanced, i.e. P

i [n L] ηj(xi) = n L/r for j [r]. Assume that the eigenvalues of AL are µ1, . . . , µn (in descending order). Then under Assumption 3.1, the eigenvalues of A are

µ1 = . . . = µn U+1 = 1, (13)

µj = µjα, for j = n U + 2, . . . , n, (14)

where α = µj 1 r r 1γ 2 .

In Proposition 5.1, we show that the eigenvalues of A

rely on the eigenvalues of A and consequently rely on the posterior probabilities of clean labels. Specifically, if the true label has a higher posterior probability, i.e. maxj [r] P(Y = j|x) is larger, then the eigenvalues of A are larger. On the other hand, the existence of label noise uniformly shrinks the eigenvalues of A except for the largest ones, and larger noise rate γ results in smaller α and thus leads to smaller eigenvalues of A . Moreover, the number of largest eigenvalues is decided by the number of unlabeled samples.

Note that rank(A ) rank(Y L) + n U n U + r, and therefore we have µn U+r+1 = . . . = µn = 0. Specifically, under the deterministic label scenario, we have µn U+2 = . . . = µn U+r = 1. Then the eigenvalues of A become

µ1 = . . . = µn U+1 = 1, (15)

µn U+2 = . . . = µn U+r = α = 1 r r 1γ 2 , (16)

µn U+r+1 = . . . = µn = 0. (17)

Then in the following proposition, we discuss the eigenvalues of the mixed similarity graph Aθ,γ,n L describing both weak labels and feature information. Proposition 5.2. Denote λ1, . . . , λn as the eigenvalues of Aθ,γ,n L. Then given the eigenvalues of A0, i.e. ν1, . . . , νn and the eigenvalues of AL, i.e. µ1, . . . , µn L(in descending order), under the deterministic scenarios, when k n U, there holds

max θ + (1 θ)νn L+k, (1 θ)νk+1, θα + (1 θ)νn L+k r+1

λk+1 θ + (1 θ)νk+1,

and for k n U + r,

(1 θ)νk+1 λk+1 min{θ + (1 θ)νk+1, θα + (1 θ)νk n U , (1 θ)νk+1 r n U }.

In Proposition 5.2, we derive both the upper and lower bounds for the k + 1-th largest eigenvalue of Aθ,γ,n L. We see that under the deterministic scenario, the upper bound of the k + 1-th largest eigenvalue of Aθ,γ,n L depends on at most three specific eigenvalues of the unsupervised augmentation graph A0. The value of λk+1 is also affected by the weighting parameter θ. However, the specific dependence relies on the relative magnitudes of νk+1, νk n U , and νk+1 n U r. A perhaps somehow anti-intuitive conclusion is that when k is smaller than n U, the upper bound of λk+1 is unaffected by the noise rate. Similarly, we notice that for k n U + r, the lower bound of λk+1 is unaffected by the noise rate.

5.2. Error Bound of Joint Training under Semi-supervised Setting

Recall that the goal of contrastive representation learning is to learn an embedding function f : X Rk. The quality of the learned embedding is often evaluated through linear evaluation. To be specific, denote B Rk r as the weights of the downstream linear classifier, and the linear predictor is denoted as gf,B( x) = arg maxi [r] Px A( | x)(gf,B(x) = i). In this paper, we focus on analyzing the error bound of the best possible downstream linear classifier gf pop,B , where f pop arg minf:X Rk is the minimizer of the population spectral contrastive loss L(f) defined in (1), and B

is the optimal weight for the downstream linear classifier.

Following Hao Chen et al. (2021), we assume that the labels are recoverable from augmentations, i.e. we assume there exists a classifier g that can predict y(x) given x with error at most δ (0, 1).

Assumption 5.3. Assume that for some δu, δs > 0, there holds

E x P X,x A( | x)1[ˆy(xi) = y( x)] δu. (18)

ℓ [r] ηℓ(xi)1[ˆy(xi) = ℓ] δs. (19)

Compared with Assumption 3.5 in Hao Chen et al. (2021), Assumption 5.3 additionally assumes the recoverable of labels taking expectation under the posterior probability distribution. Intuitively, δu represents the error under unsupervised learning, and δs represents the error under supervised learning (with clean posterior distributions). Therefore, it is reasonable to assume that δs δu. Note that Assumption 5.3 is a minor revision of the original assumption. The additional assumption (19) does not change the nature of the original idea of label recovery, and will be used to bound the error term of learning from weakly supervised labels.

Rethinking Weak Supervision in Helping Contrastive Learning

Then we derive the error bound of downstream linear evaluation learned by contrastive learning under semi-supervised setting.

Theorem 5.4. Assume the assumptions in Theorem 5.5 hold. Then if

B F 1/ max n (1 θ)νk, θ + (1 θ)νn L+k r o ,

E 2 2δu + [(1 + ρ)δs 2δu]θ

1 θ (1 θ)νk+1 + 8δu. (20)

By Theorem 5.4, the error bound of linear probing is larger when the label recovery error δu and δs gets larger.

The error bound in Theorem 5.4 attains the minimum when θ = 1 if ρ 2δu/δs 1. And therefore we have E 2(1 + ρ)δs + 8δu 2(1 + ρ)δu + 8δu. Since νk+1 [0, 1], then when ρ (1 + νk+1)/(1 νk+1), 2(1 + ρ)δu + 8δu 4δu 1 νk+1 + 8δu. Recall that in Hao Chen et al. (2021), the error bound of purely unsupervised contrastive learning is 4δu 1 νk+1 + 8δu. Our result indicates that semi-supervised information improves the error bound compared with purely unsupervised contrastive learning since by using all labeled samples. This theoretical point can also be verified by existing experimental research about semisupervised contrastive learning, e.g. Assran et al. (2020).

5.3. Error Bound of Joint Training under Noisy-labeled Setting

In Theorem 5.5, we present the error bound of joint training contrastive learning under noisy label setting.

Theorem 5.5. For arbitrary Y , assume that the labeled data is class-balanced, i.e. P i [n L] ηj(xi) = n L/r for j [r]. Denote ν1, . . . , νn as the eigenvalues of A0 (in descending order). Denote E := P x P X,x A( | x) gf pop,B (x) = y( x) as the linear evaluation error, where B Rr k with norm B F 1/(1 θ)νk. Assume there exists ρ > 0, such that wi/wj < ρ, for i, j [n], and k > r. Then under the deterministic scenario and Assumptions 3.1 and 5.3, there holds

E 2 2δu + [α(1 + ρ)δs 2δu + (1 α)]θ

1 λ(ν; θ, α) + 8δu, (21)

λ(ν; θ, α) = min{θ + (1 θ)νk+1, θα + (1 θ)νk,

(1 θ)νk+1 r}. (22)

and α := 1 r r 1γ 2.

The bound in Theorem 5.5 gets larger when the noise rate γ and the label recovery error δu and δs gets larger. It is worth

noting that Theorem 5.5 shows that joint training fails to improve the error bound. Specifically, since the numerator is positive, this bound is equivalent to the minimum of the following three terms

2 2δu + [α(1 + ρ)δs 2δu + (1 α)]θ

1 θ (1 θ)νk+1 + 8δu, (23)

2 2δu + [α(1 + ρ)δs 2δu + (1 α)]θ

1 θα (1 θ)νk + 8δu, (24)

2 2δu + [α(1 + ρ)δs 2δu + (1 α)]θ

1 (1 θ)νk+1 r + 8δu. (25)

Since each of the three terms is a monotonic function with respect to θ, the minimum must be attained at the end of the range of θ, i.e. θ = 0 or θ = 1. Therefore, the minimum of E must be attained at θ = 0 or θ = 1, too. Specifically, when the noise rate is relatively high,

i.e. γ > γthreshold := r 1

1 νk+1 2δu (1 (1+ρ)δs)(1 νk+1) , E achieves its minimum at θ = 0, whereas when the noise rate is relatively low, i.e. γ < γthreshold, E achieves its minimum at θ = 1. Either way, joint training of supervised and unsupervised contrastive learning (θ (0, 1)) does not improve the error bound compared with purely supervised or purely unsupervised contrastive learning.

Then we derive the finite sample bound for the linear probing error for contrastive learning under label noise, which explicitly depends on the training set size n. Theorem 5.6. Denote b E := P x P X,x A( | x) g ˆ f, b B(x) = y( x) as the generalization bound for the linear probing error. For any labeling function ˆy : X [r], there exists a linear probe b B Rr k such that with probability at least 1 ε over the randomness of data, we have

b E min 1 k k

2 2δu + [α(1 + ρ)δs 2δu + (1 α)]θ

1 λ(ν; θ, α, k + 1)

+ 4k h c1 b Rn/2(F) + c2 q

(1 θ)νk λ(ν; θ, α, k + 1) 2

+ 8δu, (26)

where b Rn/2(F) is the maximal possible empirical Rademacher complexity of F over n/2 data, λ(ν; θ, α, k + 1) = min{θ+(1 θ)νk+1, θα+(1 θ)νk, (1 θ)νk+1 r}, α := 1 r r 1γ 2, c1 k2κ2 + kκ and c2 kκ2 + k2κ4.

Note that in Theorem 5.6, as the training size n , the sample error term (second term) approximates 0, and therefore the bound in Theorem 5.6 degenerates to that in Theorem 5.5.

For a given sample size n < , we observe a trade-off in the choice of k. Specifically, as k increases, the approximation

Rethinking Weak Supervision in Helping Contrastive Learning

error (1st term in (26)) decreases, whereas the sample error (2nd term in (26)) increases. It will lead to the following two cases: 1) when k is small, the approximation error could be very large since λk+1 is large; 2) when k is large, the eigen gap λk λk+1 0 since both λk and λk+1 are very small, and accordingly the sample error goes to infinity. This suggests that we should choose a moderate feature dimension k, so that both approximation and sample error terms are relatively small.

5.4. Discussions

By comparing Theorems 5.4 and 5.5, we show that clean semi-supervised labels help improve the downstream linear error bound of contrastive representation learning, whereas jointly training supervised and unsupervised contrastive losses fails to improve the error bound under noisy labels. In other words, the winner of supervised and unsupervised contrastive learning itself serves as a strong baseline for noisy label contrastive learning. This theoretical finding partly explains why the intuitive joint-training method is not investigated by the community, and why complex algorithmic design such as label denoising is a popular approach to leveraging noisy labels in contrastive learning.

For technical contributions, although the theoretical analysis is based on Hao Chen et al. (2021), our analysis is essentially different from existing works in the following aspects: 1) We for the first time establish a theoretical framework for weakly supervised contrastive learning, where we translate the label information into a similarity graph, whereas existing works analyzed pure unsupervised contrastive learning; 2) The main technical difficulty of our analysis is to discuss the eigenvalues of the mixed similarity graph containing both label and feature information (Proposition 5.2), rather than to utilize existing results about self-supervised contrastive learning.

The joint training of Sim CLR and Sup Con is also discussed in previous works (Islam et al., 2021; Chen et al., 2022), which focus on the empirical improvement of transfer performances whereas we focus on the theoretical properties of joint training. More details can be found in Appendix B.2.

6. Experiments

Recall that it is already empirically verified by Assran et al. (2020) that clean semi-supervised labels help improve over unsupervised contrastive learning. Therefore, in this section, we only empirically verify our theoretical results that noisy labels have limited effects in improving the performance of contrastive learning, and show that the winner of Sup Con and Sim CLR itself serves as a strong baseline for contrastive learning with noisy labels. Based on this, we discuss that complex designs are imperative for improving contrastive

representation learning with noisy labels.

6.1. Experimental Setups

We conduct numerical comparisons on the CIFAR-10 and Tiny Image Net-200 benchmark datasets. Because the standard supervised contrastive learning algorithm Sup Con (Khosla et al., 2020) is adapted from the self-supervised contrastive learning framework Sim CLR (Chen et al., 2020), for fair comparisons, we use Sup Con as the supervised contrastive loss, and Sim CLR as the self-supervised contrastive loss. We argue that we can to a large extent verify the theoretical insights discussed in the previous section, even if the theoretical parts consider the spectral contrastive loss. First of all, as shown in the original paper, spectral contrastive loss has comparative empirical performances with respect to that of Sim CLR. Besides, the theoretical evidence can be found in Johnson et al. (2023), which proves that by interpreting the exponentiated dot product ef(x) f(x ) as the similarity and treating the exponential and temperature term τ as part of the model instead of part of the objective, Info NCE loss and Spectral contrastive loss share the same population minimum. That means, by adopting similar kernel deriviations, our work also has the potential to extend to other contrastive losses including standard Info NCE.

We follow the experimental setting of Sim CLR and Sup Con. Specifically, we use Res Net-50 as the encoder and a 2-layer MLP as the projection head. We set the batch size as 1024. We use 1000 epochs for training representations. We use the SGD optimizer with the learning rate 0.5 decayed at the 700-th, 800-th, and 900-th epochs with a weight decay 0.1. We run experiments on 4 NVIDIA Tesla V100 32GB GPUs. The data augmentations we use are random crop and resize (with random flip), color distortion, and color dropping. We evaluate the self-supervised learned representation by linear evaluation protocol, where a linear classifier is trained on the top of the encoder, and regard its test accuracy as the performance of the encoder. 100 epochs are used for linear probing on the clean data. The symmetric noisy labels are generated by flipping the labels of a given proportion of training samples uniformly to one of the other class labels. For the CIFAR-10 dataset, we run experiments with noise rate {0.1, 0.2, 0.3, 0.4, 0.5, 0.6} and for Tiny Image Net-200, we run experiments with noise rate {0.2, 0.4, 0.6, 0.8}.

6.2. Parameter Analysis of θ

We first conduct an analysis of the parameter θ for the joint training of Sim CLR and Sup Con. In Figure 2(a), we plot the optimal θ for LJoint Training on the CIFAR-10 dataset across various noise rates. We show that when the noise rate is relatively high, θ is relatively small, and LJoint Training relies more on unsupervised learning, whereas when the noise rate is relatively low, θ is relatively large, and LJoint Training

Rethinking Weak Supervision in Helping Contrastive Learning

relies more on supervised learning. Moreover, the optimal θ lies very close to either 0 or 1, which indicates that only one loss mainly contributes to the joint training. On the other hand, in Figure 2(b), we plot the performance changes of LJoint Training with respect to the parameter θ under noise rates 0.4 and 0.8 on the Tiny Image Net-200 dataset. We show that under both noise rates, as θ increases, the accuracy of LJoint Training decreases. Moreover, the performance drop of LJoint Training under high noise rates is more significant than that under low noise rates, indicating that too much noisy label information hurts the performance of joint training, especially when the noise rate is high.

0.0 0.1 0.2 0.3 0.4 0.5 0.6 noise rate

(a) Optimal θ for various noise rates on CIFAR-10.

0.1 0.2 0.3 0.4 0.5 0.6

nr=0.4 nr=0.8

(b) Performance change w.r.t. θ on Tiny Image Net-200.

Figure 2: Parameter analysis of θ.

6.3. Parameter Analysis of k

We empirically verify the discussions following Theorem 5.6 about the feature dimension k. We train the joint objective (12) with θ = 0.1 using Res Net-50 on the CIFAR-10 dataset under 30% label noise with feature dimension varying in {128, 256, 512, 1024, 2048}, and report the linear probing accuracy in Table 1.

Table 1: Parameter analysis of k.

k 128 256 512 1024 2048

Acc (%) 91.89 92.18 92.12 91.77 90.9

We observe that as k increases, the linear probing accuracy first increases and then decreases, which validates the theoretical insight that we should choose a moderate feature dimension k in Theorem 5.6.

6.4. The Winner of Sup Con and Sim CLR Serves as a Strong Baseline

In this section, we show that under label noise, joint training of Sup Con and Sim CLR has limited improvement over a simple but strong baseline, that is, the winner of Sup Con and Sim CLR on the Tiny Image Net-200 dataset. Specifically, in Figure 3, the box plot show the linear probing performances of LJoint Training with θ {0.1, 0.2, 0.4, 0.6} under noise rates {0.2, 0.4, 0.6, 0.8} respectively. And the curves respectively show the performance of Sim CLR, Sup Con, and

0.0 0.2 0.4 0.6 0.8 noise rate

Sim CLR Sup Con Max(Sim CLR,Sup Con)

Figure 3: The winner of Sup Con and Sim CLR serves as a strong baseline of joint training under label noise on the Tiny Image Net-200 dataset.

the winner of the two. We can see that the Max performance of Sup Con and Sim CLR lies within the box plots, indicating that the winner of Sup Con and Sim CLR has a comparative or sometimes better performance compared with the jointly trained model. Additionally, we observe the same trends on tasks like detection, segmentation, and finetuning in Appendix B.3. These experimental results verify the theoretical result in Theorem 5.5 that joint training does not improve the error bound.

7. Conclusion

In this paper, we establish a theoretical framework for weakly supervised contrastive learning, which is compatible with the settings of both noisy label learning and semisupervised learning. We take spectral contrastive learning as a proxy for theoretical analysis. By formulating a mixed similarity graph induced by both weakly supervised label information and unsupervised feature information, we analyze the weakly supervised spectral contrastive learning based on the framework of spectral clustering, and derive the downstream linear evaluation error bound. Our theoretical results show that semi-supervised information improves the downstream error bound, whereas, under the setting of symmetric label noise, we prove that jointly training supervised and unsupervised contrastive losses fail to improve the error bound. Our theoretical findings here provide new insights for the community to rethink the role of weak supervision in helping the representation of contrastive learning. For future works, we will investigate the effect of more complex weak supervision, such as active learning and label-dependent label noise, on contrastive learning.

Acknowledgements

Yisen Wang is partially supported by the National Key R&D Program of China (2022ZD0160304), the National Natural Science Foundation of China (62006153), Open Research Projects of Zhejiang Lab (No. 2022RC0AB05), and Huawei Technologies Inc. Weiran Huang is funded by MSRA.

Rethinking Weak Supervision in Helping Contrastive Learning

Acharya, A., Sanghavi, S., Jing, L., Bhushanam, B., Choudhary, D., Rabbat, M., and Dhillon, I. Positive unlabeled contrastive learning. ar Xiv preprint ar Xiv:2206.01206, 2022.

Aitchison, L. Infonce is a variational autoencoder. ar Xiv preprint ar Xiv:2107.02495, 2021.

Arora, S., Khandeparkar, H., Khodak, M., Plevrakis, O., and Saunshi, N. A theoretical analysis of contrastive unsupervised representation learning. In ICML, 2019.

Ash, J., Goel, S., Krishnamurthy, A., and Misra, D. Investigating the role of negatives in contrastive representation learning. In AISTATS, 2022.

Assran, M., Ballas, N., Castrejon, L., and Rabbat, M. Supervision accelerates pre-training in contrastive semisupervised learning of visual representations. ar Xiv preprint ar Xiv:2006.10803, 2020.

Bao, H., Nagano, Y., and Nozawa, K. On the surrogate gap between contrastive and supervised losses. In ICML, 2022.

Chen, M., Fu, D. Y., Narayan, A., Zhang, M., Song, Z., Fatahalian, K., and R e, C. Perfectly balanced: Improving transfer and robustness of supervised contrastive learning. In ICML, 2022.

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In ICML, 2020.

Chen, X. and He, K. Exploring simple siamese representation learning. In CVPR, 2021.

Cheng, H., Zhu, Z., Sun, X., and Liu, Y. Demystifying how self-supervised features improve training from noisy labels. ar Xiv preprint ar Xiv:2110.09022, 2021.

Chuang, C.-Y., Hjelm, R. D., Wang, X., Vineet, V., Joshi, N., Torralba, A., Jegelka, S., and Song, Y. Robust contrastive learning against noisy views. In CVPR, 2022.

Fulton, W. Eigenvalues, invariant factors, highest weights, and schubert calculus. Bulletin of the American Mathematical Society, 37(3):209 249, 2000.

Ghosh, A. and Lan, A. Contrastive learning improves model robustness under label noise. In CVPR, 2021.

Hao Chen, J. Z., Wei, C., Gaidon, A., and Ma, T. Provable guarantees for self-supervised deep learning with spectral contrastive loss. In Neur IPS, 2021.

He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.

Hu, T., Liu, Z., Zhou, F., Wang, W., and Huang, W. Your contrastive learning is secretly doing stochastic neighbor embedding. In ICLR, 2023.

Huang, W., Yi, M., Zhao, X., and Jiang, Z. Towards the generalization of contrastive self-supervised learning. In ICLR, 2023.

Islam, A., Chen, C.-F. R., Panda, R., Karlinsky, L., Radke, R., and Feris, R. A broad study on the transferability of visual representations with contrastive learning. In ICCV, pp. 8845 8855, 2021.

Johnson, D. D., Hanchi, A. E., and Maddison, C. J. Contrastive learning can find an optimal basis for approximately view-invariant functions. In ICLR, 2023.

Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., and Krishnan, D. Supervised contrastive learning. In Neur IPS, 2020.

Kim, B., Choo, J., Kwon, Y.-D., Joe, S., Min, S., and Gwon, Y. Selfmatch: Combining contrastive self-supervision and consistency for semi-supervised learning. ar Xiv preprint ar Xiv:2101.06480, 2021a.

Kim, Y., Yim, J., Yun, J., and Kim, J. Nlnl: Negative learning for noisy labels. In ICCV, 2019.

Kim, Y., Yun, J., Shon, H., and Kim, J. Joint negative and positive learning for noisy labels. In CVPR, 2021b.

Ko, C.-Y., Mohapatra, J., Liu, S., Chen, P.-Y., Daniel, L., and Weng, L. Revisiting contrastive learning through the lens of neighborhood component analysis: an integrated framework. In ICML, 2022.

Lee, D., Kim, S., Kim, I., Cheon, Y., Cho, M., and Han, W.- S. Contrastive regularization for semi-supervised learning. In CVPR, 2022.

Li, S., Xia, X., Ge, S., and Liu, T. Selective-supervised contrastive learning with noisy labels. In CVPR, 2022.

Navaneet, K., Abbasi Koohpayegani, S., Tejankar, A., Pourahmadi, K., Subramanya, A., and Pirsiavash, H. Constrained mean shift using distant yet related neighbors for representation learning. In ECCV 2022, 2022.

Nozawa, K. and Sato, I. Understanding negative samples in instance discriminative self-supervised representation learning. In Neur IPS, 2021.

Ortego, D., Arazo, E., Albert, P., O Connor, N. E., and Mc Guinness, K. Multi-objective interpolation training for robustness to label noise. In CVPR, 2021.

Rethinking Weak Supervision in Helping Contrastive Learning

Shen, K., Jones, R. M., Kumar, A., Xie, S. M., Hao Chen, J. Z., Ma, T., and Liang, P. Connect, not collapse: Explaining contrastive learning for unsupervised domain adaptation. In ICML, 2022.

Wang, Y., Geng, Z., Jiang, F., Li, C., Wang, Y., Yang, J., and Lin, Z. Residual relaxation for multi-view representation learning. In Neur IPS, 2021a.

Wang, Y., Zhang, Q., Wang, Y., Yang, J., and Lin, Z. Chaos is a ladder: A new theoretical understanding of contrastive learning via augmentation overlap. In ICLR, 2021b.

Wang, Y., Zhang, Q., Du, T., Yang, J., Lin, Z., and Wang, Y. A message passing perspective on learning dynamics of contrastive learning. In ICLR, 2023.

Xue, Y., Whitecross, K., and Mirzasoleiman, B. Investigating why contrastive learning benefits robustness against label noise. In ICML, 2022.

Yan, J., Luo, L., Xu, C., Deng, C., and Huang, H. Noise is also useful: Negative correlation-steered latent contrastive learning. In CVPR, 2022.

Yang, F., Wu, K., Zhang, S., Jiang, G., Liu, Y., Zheng, F., Zhang, W., Wang, C., and Zeng, L. Class-aware contrastive semi-supervised learning. In CVPR, 2022.

Yao, Y., Sun, Z., Zhang, C., Shen, F., Wu, Q., Zhang, J., and Tang, Z. Jo-src: A contrastive approach for combating noisy labels. In CVPR, 2021.

Zhang, M., Yuan, C., Yao, J., and Huang, W. Learning with noisily-labeled class-imbalanced data. ar Xiv preprint ar Xiv:2211.10955, 2022a.

Zhang, Y., Zhang, X., Li, J., Qiu, R., Xu, H., and Tian, Q. Semi-supervised contrastive learning with similarity co-calibration. IEEE Transactions on Multimedia, 2022b.

Zimmermann, R. S., Sharma, Y., Schneider, S., Bethge, M., and Brendel, W. Contrastive learning inverts the data generating process. In ICML, 2021.

Rethinking Weak Supervision in Helping Contrastive Learning

The appendix consists of the proofs for lemmas and theorems (Section A) and additional experimental results (Section B).

A.1. Proof of Lemma 4.1

Proof. Under Assumption 3.1, we have

( (1 γ)2 + γ2/(r 1), i = j

2γ(1 γ)/(r 1) + (r 2)γ2/(r 1)2, i = j

(1 γ)2 + γ2/(r 1), i = j γ r 1

2 r r 1γ , i = j. (27)

That is, we have

T 2 = h (1 γ)2 + γ2/(r 1) γ r 1

2 r r 1γ i Ir r

2 r r 1γ 1r 1 r

= 1 r r 1γ 2 Ir r + γ r 1

2 r r 1γ 1r 1 r

:= αIr r + β 1r 1 r . (28)

Given γ [0, 1), we have

A L = Y LT 2Y L = Y L αIr r + β 1r 1 r Y L = αY LY L + βY L 1r 1 r Y L = αAL + β 1n L 1 n L, (29)

where the last equality holds because P

j ηj(xi) = 1 for i [n].

and the normalized augmentation graph is

A = D 1/2A D 1/2, (30)

D = DL 0 0 In U n U

DL = diag(di), (32)

j [n L] A i,j = α X

ℓ [r] ηℓ(xi)ηℓ(xj) + n Lβ

ℓ [r] ηℓ(xi) X

j [n L] ηℓ(xj) + n Lβ = α X

ℓ [r] ηℓ(xi)nℓ+ n Lβ (33)

Specifically, when the labeled data is class-balanced, i.e. n1 = . . . = nr = n L/r. Then we have

ℓ [r] ηℓ(xi) + n Lβ = n L

r α + n Lβ = n L

Rethinking Weak Supervision in Helping Contrastive Learning

n L AL + β r

n L 1n L 1 n L 0 0 In U n U

A.2. Proof of Proposition 5.1

Proof. We first prove that v1 = 1 n L 1n L is an eigenvector of AL := r n L AL with eigenvalue µ1 = 1. To be specific,

AL 1 n L 1n L = 1 n L r

n L AL 1n L

n L Y LY L 1n L

n L Y L n L

= 1 n L 1n L, (36)

where the second last equality is due to class balance, i.e. P

i [n L] ηj(xi) = n L/r for j [r], and the last equality holds because P

j [r] ηj(xi) = 1 for i [n L].

Therefore, we can rewrite AL as

AL = h 1 n L 1n L, v2, . . . , vn L i

1 0 . . . 0 0 µ2 . . . 0 ... ... ... 0 0 . . . µn L

1 n L 1 n L v 2 ... v n L

Note that 1 n L 1n L 1 n L can be decomposed as

1 n L 1n L 1 n L = 1 n L 1n L 1 n L 1n L

= h 1 n L 1n L, v2, . . . , vn L i

1 0 . . . 0 0 0 . . . 0 ... ... ... 0 0 . . . 0

1 n L 1 n L v 2 ... v n L

Then we have

n L AL + rβ 1

n L 1n L 1 n L

= h 1 n L 1n L, v2, . . . , vn L i

α 0 . . . 0 0 αµ2 . . . 0 ... ... ... 0 0 . . . αµn L

1 n L 1 n L v 2 ... v n L

h 1 n L 1n L, v2, . . . , vn L i

rβ 0 . . . 0 0 0 . . . 0 ... ... ... 0 0 . . . 0

1 n L 1 n L v 2 ... v n L

= h 1 n L 1n L, v2, . . . , vn L i

α + rβ 0 . . . 0 0 αµ2 . . . 0 ... ... ... 0 0 . . . αµn L

1 n L 1 n L v 2 ... v n L

Rethinking Weak Supervision in Helping Contrastive Learning

Since α + rβ = 1, the eigenvalues of A L are 1, αµ2, . . . , αµn L. Thus the eigenvalues of

A = A L 0 0 In U n U

µ1 = . . . = µn U+1 = 1, (41)

µj = αµj, for j = n U + 2, . . . , n. (42)

A.3. Proof of Proposition 5.2

Proof. By equation 13 in Fulton (2000), for two real symmetric n by n matrix (1 θ) A0 and θ A , the k + 1-th largest eigenvalue of Aθ,λ,n L := (1 θ) A0 + θ A can take any value in the interval

max i+j=n+k+1(1 θ)νi + θ µj λk+1 min i+j=k+2(1 θ)νi + θ µj. (43)

By Proposition 5.1, we have

1, j = 1, . . . , n U + 1;

αµj, j = n U + 2, . . . , n U + r;

0, j = n U + r + 1, . . . , n. (44)

Therefore, we have

max i+j=n+k+1(1 θ)νi + θ µj

= max 1 i n+k+1(1 θ)νi + θ µn+k+1 i

θ + (1 θ)νi, i = n L + k, . . . , n;

θαµn+k+1 i + (1 θ)νi, i = n L + k r + 1, . . . , n L + k 1;

(1 θ)νi, i = k + 1, . . . , n L + k r

θ + (1 θ)νn L+k θαµn+k+1 i + (1 θ)νi, i = n L + k r + 1, . . . , n L + k 1;

(1 θ)νk+1, (45)

where the last equality holds because {νi}i [n] is ranked in descending order. Then when k n U,

λk+1 max θ + (1 θ)νn L+k, max i=n L+k r+1,...,n L+k 1{θαµn+k+1 i + (1 θ)νi}, (1 θ)νk+1 , (46)

when n U < k < n U + r,

λk+1 max (1 θ)νk+1, max i=n L+k r+1,...,n L+k 1{θαµn+k+1 i + (1 θ)νi} , (47)

and when k n U + r,

λk+1 (1 θ)νk+1. (48)

On the other hand, we have

min i+j=k+2(1 θ)νi + θ µj

= min 1 i k+2(1 θ)νi + θ µk+2 i

Rethinking Weak Supervision in Helping Contrastive Learning

θ + (1 θ)νi, i = k + 1 n U, . . . , k + 1;

θαµk+2 i + (1 θ)νi, i = k + 2 r n U, . . . , k n U;

(1 θ)νi, i = k + 2 n, . . . , k + 1 r n U.

θ + (1 θ)νk+1;

θαµk+2 i + (1 θ)νi, i = k + 2 r n U, . . . , k n U;

(1 θ)νk+1 r n U . (49)

Then when k n U, there holds

λk+1 θ + (1 θ)νk+1, (50)

when n U < k < n U + r, there holds

λk+1 min n θ + (1 θ)νk+1, min i=k+2 r n U,...,k n U{θαµk+2 i + (1 θ)νi} o , (51)

and when k n U + r, there holds

λk+1 min n θ + (1 θ)νk+1, min i=k+2 r n U,...,k n U{θαµk+2 i + (1 θ)νi}, (1 θ)νk+1 r n U o . (52)

A.4. Proof of Theorem 5.4

To prove Theorem 5.4, we first prove Proposition A.1.

Proposition A.1. For arbitrary Y , assume that the labeled data is class-balanced, i.e. P

i [n L] ηj(xi) = n L/r for j [r]. Denote ν1, . . . , νn as the eigenvalues of A0 (in descending order). Denote E := P x P X,x A( | x) gf pop,B (x) = y( x) as the linear evaluation error, where B Rr k with norm B F 1/λk. Assume there exists ρ > 0, such that wi/wj < ρ, for i, j [n]. Then under the deterministic scenario and Assumptions 3.1 and 5.3, for k n U, there holds

E 2 2δu + [α(1 + ρ)δs 2δu + (1 α)]θ

1 θ (1 θ)νk+1 + 8δu, (53)

for n U + 1 k n U + r 1, there holds

E 2 2δu + [α(1 + ρ)δs 2δu + (1 α)]θ

1 min{θ + (1 θ)νk+1, θα + (1 θ)νk n U } + 8δu, (54)

and for k n U + r, there holds

E 2 2δu + [α(1 + ρ)δs 2δu + (1 α)]θ

1 λ(ν; θ, α) + 8δu, (55)

λ(ν; θ, α) = min{θ + (1 θ)νk+1, θα + (1 θ)νk n U ,

(1 θ)νk+1 r n U } (56)

and α := 1 r r 1γ 2.

Proof of Proposition A.1. By Lemma B.3 of Hao Chen et al. (2021), for any labeling function y : X [r], there exists a linear probe B Rr k with norm B F 1/λk such that

P x P X,x A( | x) gf pop,B (x) = y( x) 2ϕˆy

1 λk+1 + 8δu, (57)

Rethinking Weak Supervision in Helping Contrastive Learning

where according to the definition of Aθ,λ,n L,

(1 θ) wij wiwj + θ A i,j 1[ˆy(xi) = ˆy(xj)]

i,j [n] wi,j1[ˆy(xi) = ˆy(xj)] + θ X

wiwj A i,j1[ˆy(xi) = ˆy(xj)] . (58)

We investigate the RHS of (58) respectively. The first term is X

i,j [n] wi,j1[ˆy(xi) = ˆy(xj)]

i,j [n] E x P X A(xi| x)A(xj| x)1[ˆy(xi) = ˆy(xj)]

i,j [n] E x P X A(xi| x)A(xj| x) 1[ˆy(xi) = ˆy( x)] + 1[ˆy(xj) = ˆy( x)]

i [n] E x P X A(xi| x)1[ˆy(xi) = ˆy( x)]

The second term is X

wiwj A i,j1[ˆy(xi) = ˆy(xj)]

wiwj( A L)i,j1[ˆy(xi) = ˆy(xj)] + X

wiwj1[ˆy(xi) = ˆy(xi)]

i n L,j>n L

wiwj A i,j1[ˆy(xi) = ˆy(xi)]. (60)

According to the definition of A , the last two terms are equal to 0. Then by Lemma 4.1, the second term on the RHS of (58) becomes X

wiwj A i,j1[ˆy(xi) = ˆy(xj)]

wiwj α Ai,j + β r

1[ˆy(xi) = ˆy(xj)]

wiwjηℓ(xi)ηℓ(xj) 1[ˆy(xi) = ℓ] + 1[ˆy(xj) = ℓ]

1 2(wi + wj)ηℓ(xi)ηℓ(xj) 1[ˆy(xi) = ℓ] + 1[ˆy(xj) = ℓ]

1 2(wi + wj)

i,j [n L] wiηℓ(xi)ηℓ(xj)1[ˆy(xi) = ℓ]

i,j [n L] wiηℓ(xi)ηℓ(xj)1[ˆy(xj) = ℓ]

Rethinking Weak Supervision in Helping Contrastive Learning

i,j [n L] wjηℓ(xi)ηℓ(xj)1[ˆy(xi) = ℓ]

i,j [n L] wjηℓ(xi)ηℓ(xj)1[ˆy(xj) = ℓ] + (1 α)

i,j [n L] ρwjηℓ(xi)ηℓ(xj)1[ˆy(xi) = ℓ]

i,j [n L] wiηℓ(xi)ηℓ(xj)1[ˆy(xj) = ℓ]

i,j [n L] wjηℓ(xi)ηℓ(xj)1[ˆy(xi) = ℓ]

i,j [n L] ρwiηℓ(xi)ηℓ(xj)1[ˆy(xj) = ℓ] + (1 α)

i [n L] ρπℓηℓ(xi)1[ˆy(xi) = ℓ]

j [n L] πℓηℓ(xj)1[ˆy(xj) = ℓ]

i [n L] πℓηℓ(xi)1[ˆy(xi) = ℓ]

j [n L] ρπℓηℓ(xj)1[ˆy(xj) = ℓ] + (1 α)

= α(1 + ρ) 1

i [n L] ηℓ(xi)1[ˆy(xi) = ℓ] + (1 α)

α(1 + ρ)δs + (1 α), (61)

where we denote πℓ= P(Y = ℓ) and by class-balance, πℓ= 1

r. Then combining (58), (59) and (61), we have

ϕˆy 2(1 θ)δu + θα(1 + ρ)δs + θ(1 α) = 2δu + [α(1 + ρ)δs 2δu + (1 α)]θ. (62)

Therefore, by (57), we have

E := P x P X,x A( | x) gf pop,B (x) = y( x) 2 2δu + [α(1 + ρ)δs 2δu + (1 α)]θ

1 λk+1 + 8δu, (63)

Combined with Proposition 5.2, we have for k n U, there holds

E 2 2δu + [α(1 + ρ)δs 2δu + (1 α)]θ

1 θ (1 θ)νk+1 + 8δu, (64)

for n U + 1 k n U + r 1, there holds

E 2 2δu + [α(1 + ρ)δs 2δu + (1 α)]θ

1 min{θ + (1 θ)νk+1, θα + (1 θ)νk n U } + 8δu, (65)

and for k n U + r, there holds

E 2 2δu + [α(1 + ρ)δs 2δu + (1 α)]θ

1 min{θ + (1 θ)νk+1, θα + (1 θ)νk n U , (1 θ)νk+1 r n U } + 8δu. (66)

Rethinking Weak Supervision in Helping Contrastive Learning

Proof. By taking γ = 0, i.e. α = 1, and k n U in Proposition A.1, we reach the bound in (20).

Besides, by Proposition 5.2, we have

1/ max n θ + (1 θ)νn L+k 1, (1 θ)νk, θα + (1 θ)νn L+k r o

1/ max n θ + (1 θ)νn L+k 1, (1 θ)νk, θ + (1 θ)νn L+k r o

1/ max n (1 θ)νk, θ + (1 θ)νn L+k r o .

A.5. Proof of Theorem 5.5

Proof. By taking n U = 0 and k > r in Proposition A.1, we reach the bound in (21).

Besides, by Proposition 5.2, we have

B F 1/λk 1/(1 θ)νk.

A.6. Proof of Theorem 5.6

To prove Theorem 5.6, we first derive the following generalization bound for spectral contrastive pretraining under label noise inspired by the proofs of Theorem 4.1 in Hao Chen et al. (2021). We note that the form seems to be the same as that of Theorem 4.1 in Hao Chen et al. (2021), but the underlying distribution of positive samples is different since we include the label similarity into consideration.

Proposition A.2. For some κ > 0, assume f κ for all f F and x X. Let f pop F be a minimizer of the population loss L(f). Given a random dataset of size n, let ˆfemp F be a minimizer of empirical loss b Ln(f). Then, with probability at least 1 ε over the randomness of data, we have

L( ˆfemp) L(f pop) + c1 b Rn/2(F) + c2 r

n + ε , (67)

where constants c1 k2κ2 + kκ and c2 kκ2 + k2κ4.

Proof. Inspired by the proofs of Theorem D.8 in Hao Chen et al. (2021), for ˆf arg minf:X Rk b Ln(f) such that L( ˆf) minf:X Rk L(f) + ϵ0, there holds

b E min 1 k k

1 λk +1 + 4k ϵ0 λk λk+1 2

+ 8δu. (68)

According to Proposition 5.2, we have for n U = 0 and k r,

(1 θ)νk+1 λk+1 min{θ + (1 θ)νk+1, θα + (1 θ)νk, (1 θ)νk+1 r}. (69)

By k k, there holds λk > λk+1, and thus we have

λk λk+1 (1 θ)νk min{θ + (1 θ)νk+1, θα + (1 θ)νk, (1 θ)νk+1 r} (70)

Here we without loss of generality assume 1 k k + 1 r to ensure that (70) 0. Then pluging (62), (69), and (70) into (68), we finish the proof.

Rethinking Weak Supervision in Helping Contrastive Learning

B. Additional Experiments

B.1. Algorithms for Joint Training

Joint Training for Semi-supervised Learning. For semi-supervised contrastive learning, both labeled and unlabeled data are inputs of LSim CLR, that is, the unsupervised contrastive loss LSim CLR is calculated based on all samples (denoted as {xi}n i=1). The supervised contrastive loss LSup Con is calculated based only on the labeled samples (without loss of generality, we denote the labeled samples as {(xi, yi)}n L i=1). We show the procedures for semi-supervised contrastive learning in Algorithm 1.

Algorithm 1 Joint Training for Semi-supervised Learning

Input: Labeled data {(xi, yi)}n L i=1; unlabeled data {xi}n i=n L+1; parameter θ. Initialize: Encoder f. repeat

Compute Sup Con loss on the labeled data without using labels {(xi, yi)}n L i=1; Compute Sim CLR loss on all data {xi}n i=1; Update encoder f according to the joint training loss (1 θ) 1

n L Pn L i=1 LSup Con(f(xi), yi)+θ 1

n Pn i=1 LSim CLR(f(xi)); until Converge. Output: Encoder f.

Note that in Section 4.2, it seems that the unlabeled data is used in computing the Sup Con loss, but this is not true. We include the unlabeled samples in the label similarity graph only for the convinience of mathematical formulations. In fact, as we view the unlabeled samples as having unique class labels, this formulation does not affect the selection of positive pairs and therefore does not affect the calculation of the Sup Con loss.

Joint Training for Noisy Label Learning. For joint training contrastive learning with label noise, the LSup Con loss is calculated based on the noisy-labeled samples (xi, yi)n i=1, and the LSim CLR loss is calculated based on all samples without using their labels. We show the procedures for joint training contrastive learning under label noise in Algorithm 2.

Algorithm 2 Joint Training for Noisy Label Learning

Input: Noisy labeled data {(xi, yi)}n i=1; parameter θ. Initialize: Encoder f. repeat

Compute Sup Con loss on the (noisy) labeled data {(xi, yi)}n i=1; Compute Sim CLR loss on all data without using labels {xi}n i=1; Update encoder f according to the joint training loss (1 θ) 1

n Pn i=1 LSup Con(f(xi), yi)+θ 1

n Pn i=1 LSim CLR(f(xi)); until Converge. Output: Encoder f.

B.2. Algorithmic Comparisons with Similar Methods

Islam et al. (2021) and Chen et al. (2022) adopt similar methods as ours, but focus on different aspects. Specifically, Islam et al. (2021) and Chen et al. (2022) empirically investigate the transferability of the joint training of Sim CLR ad Sup Con, whereas ours focuses on theoretically analyzing the performance of their joint training under linear probing. We discuss the connections and differences as follows.

Islam et al. (2021) finds that the combination of Sim CLR and Sup Con significantly improves transfer learning performance over Cross-Entropy. Chen et al. (2022) investigates the problems of coarse-to-fine transfer learning by adding a weighted class-conditional Info NCE loss and a class-conditional autoencoder to Sup Con. Similar to our results, according to Tables 2 and 3 of Islam et al. (2021) and Table 3 of Chen et al. (2022), the transfer learning performance of the combination of Sup Con and Info NCE (or class-conditional Info NCE) is comparable to but has no significant improvement over the winner of Sup Con and Sim CLR. In addition, the robustness investigated in these two papers is conceptually different from ours. The robustness in Islam et al. (2021) is to image corruptions and the robustness in Chen et al. (2022) measures how well an algorithm can recover hidden subgroups in an unsupervised setting, whereas our manuscript investigates the robustness to label corruptions (label noise).

Rethinking Weak Supervision in Helping Contrastive Learning

B.3. Transfer Learning Performances

We conduct additional experiments to evaluate the representation quality with the transfer learning performance on downstream tasks including detection, segmentation, and fine-tuning. We evaluate transfer learning performance of models pretrained on Tiny Image Net-200 with Sim CLR, Sup Con, and Mix. All models are pretrained under the settings of Section 6.1. For Mix, we set θ = 0.2. For Sup Con and Mix, the models are pretrained under noise rate γ {20%, 40%, 60%, 80%}. We list the results as follows. The best results are marked in bold and the second best marked in underline.

Detection. Object detection is fine-tuned on PASCAL VOC 07+12 dataset. The detector is Faster R-CNN. All models are fine-tuned for 12 epochs. We evaluate the models by the default VOC metric of AP50.

Table 2: Performance comparisons of object detection.

γ 20% 40% 60% 80%

Sim CLR 65.3 65.3 65.3 65.3 Sup Con 68.5 66.9 66.2 67.0 Mix 66.9 65.5 63.8 63.2

Segmentation. Segmentation is fine-tuned on Pascal VOC 12 dataset. The model is Deeplab V3. All models are fine-tuned for 20,000 iterations. We evaluate the models by mean Io U.

Table 3: Performance comparisons of segmentation.

γ 20% 40% 60% 80%

Sim CLR 36.61 36.61 36.61 36.61 Sup Con 49.07 47.58 46.57 42.47 Mix 40.33 38.14 33.66 34.33

Fine-tuning. We finetune the pretrained models on the labeled CIFAR-10 data. The weights of the linear classifier used to fine-tune the encoder network are initialized to zero. On CIFAR-10, models are fine-tuned for 90 epochs. All results are reported on the standard CIFAR-10 test set and are summarized in the following table.

Table 4: Performance comparisons of fine-tuning.

γ 20% 40% 60% 80%

Sim CLR 68.01 68.01 68.01 68.01 Sup Con 71.35 67.46 61.62 49.03 Mix 70.01 66.89 63.86 55.38

As shown in the Tables 2, 3, and 4, the performance of joint training (Mix) is no better than the winner of Sim CLR and Info NCE across all noise rates, which is similar to the conclusion on linear probing discussed in Section 6.4.