# generalization_analysis_for_contrastive_representation_learning__0e9e0c05.pdf

Generalization Analysis for Contrastive Representation Learning

Yunwen Lei 1 Tianbao Yang 2 Yiming Ying 3 Ding-Xuan Zhou 4

Recently, contrastive learning has found impressive success in advancing the state of the art in solving various machine learning tasks. However, the existing generalization analysis is very limited or even not meaningful. In particular, the existing generalization error bounds depend linearly on the number k of negative examples while it was widely shown in practice that choosing a large k is necessary to guarantee good generalization of contrastive learning in downstream tasks. In this paper, we establish novel generalization bounds for contrastive learning which do not depend on k, up to logarithmic terms. Our analysis uses structural results on empirical covering numbers and Rademacher complexities to exploit the Lipschitz continuity of loss functions. For self-bounding Lipschitz loss functions, we further improve our results by developing optimistic bounds which imply fast rates in a low noise condition. We apply our results to learning with both linear representation and nonlinear representation by deep neural networks, for both of which we derive Rademacher complexity bounds to get improved generalization bounds.

1. Introduction

The performance of machine learning (ML) models often depends largely on the representation of data, which motivates a resurgence of contrastive representation learning (CRL) to learn a representation function f : X 7 Rd from unsupervised data (Chen et al., 2020; Khosla et al., 2020; He et al., 2020). The basic idea is to pull together similar pairs (x, x+) and push apart disimilar pairs (x, x ) in an

1Department of Mathematics, The University of Hong Kong 2Department of Computer Science and Engineering, Texas A&M University 3Department of Mathematics and Statistics, State University of New York at Albany 4School of Mathematics and Statistics, University of Sydney. Correspondence to: Yiming Ying <yying@albany.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

embedding space, which can be formulated as minimizing the following objective (Chen et al., 2020; Oord et al., 2018)

Ex,x+,{x i }k i=1log 1+

i=1 exp f(x) f(x+) f(x i ) ,

where k is the number of negative examples. The hope is that the learned representation f(x) would capture the latent structure and be beneficial to other downstream learning tasks (Arora et al., 2019; Tosh et al., 2021a). CRL has achieved impressive empirical performance in advancing the state-of-the-art performance in various domains such as computer vision (He et al., 2020; Caron et al., 2020; Chen et al., 2020; Caron et al., 2020) and natural language processing (Brown et al., 2020; Gao et al., 2021; Radford et al., 2021).

The empirical success of CRL motivates a natural question on theoretically understanding how the learned representation adapts to the downstream tasks, i.e.,

How would the generalization behavior of downstream ML models benefit from the representation function built from positive and negative pairs? Especially, how would the number of negative examples affect the learning performance?

Arora et al. (2019) provided an attempt to answer the above questions by developing a theoretical framework to study CRL. They first gave generalization bounds for a learned representation function in terms of Rademacher complexities. Then, they showed that this generalization behavior measured by an unsupervised loss guarantees the generalization behavior of a linear classifier in the downstream classification task. However, the generalization bounds there enjoy a linear dependency on k, which would not be effective if k is large. Moreover, this is not consistent with many studies which show a large number of negative examples (Chen et al., 2020; Tian et al., 2020a; H enaff et al., 2020; Khosla et al., 2020) is necessary for good generalization performance. For example, the work (He et al., 2020) used 65536 negative examples in unsupervised visual representation learning, for which the existing analysis requires n (65536)2d training examples to get non-vacuous bounds (Arora et al., 2019). Yuan et al. (2022) has demonstrated the benefits using all negative data for each anchor data for CRL and proposed an efficient algorithm for optimizing global contrastive loss. Therefore, the existing

Generalization Analysis for Contrastive Representation Learning

Assumption Arora et al. 19 Ours

low noise O k

Table 1. Comparison between our generalization bounds and those in Arora et al. (2019) for the logistic loss. Here d is the number of learned features. The notation e O ignores log factors.

analysis does not fully answer the question on the super performance of CRL to downstream tasks as already shown in many applications.

In this paper, we aim to further deepen our understanding of CRL by fully exploiting the Lipschitz continuity of loss functions. Our contributions are listed as follows.

1. We develop generalization error bounds for CRL. We consider three types of loss functions: ℓ2-Lipschitz loss, ℓ -Lipschitz loss and self-bounding Lipschitz loss. For ℓ2-Lipschitz loss, we develop a generalization bound with a square-root dependency on k by two applications of vectorcontraction lemmas on Rademacher complexities, which improves the existing bound by a factor of

k (Arora et al., 2019). For ℓ -Lipschitz loss, we develop generalization bounds which does not depend on k, up to some logarithmic terms, by approximating the arguments of loss functions via expanding the original dataset by a factor of k to fully exploit the Lipschitz continuity. For self-bounding Lipschitz loss, we develop optimistic bounds involving the training errors, which can imply fast rates under a low noise setting. All of our generalization bounds involve Rademacher complexities of feature classes, which preserve the coupling among different features.

2. We then apply our general result to two unsupervised representation learning problems: learning with linear features and learning with nonlinear features via deep neural networks (DNNs). For learning with linear features, we consider two regularization schemes, i.e., p-norm regularizer and Schatten-norm regularizer. For learning with nonlinear features, we develop Rademacher complexity and generalization bounds with a square-root dependency on the depth of DNNs. To this aim, we adapt the technique in Golowich et al. (2018) by using a different moment generalization function to capture the coupling among different features.

3. Finally, we apply our results on representation learning to the generalization analysis of downstream classification problems, which outperforms the existing results by a factor of k (ignoring a log factor).

The remaining parts of the paper are organized as follows. Section 2 reviews the related work, and Section 3 provides the problem formulation. We give generalization bounds for CRL in Section 4 for three types of loss functions, which are then applied to learning with both linear and nonlinear

features in Section 5. Conclusions are given in Section 6.

2. Related Work

The most related work is the generalization analysis of CRL in Arora et al. (2019), where the authors developed generalization bounds for unsupervised errors in terms of Rademacher complexity of representation function classes. Based on this, they further studied the performance of linear classifiers on the learned features. In particular, they considered the mean classifier where the weight for a class label is the mean of the representation of corresponding inputs. A major result in Arora et al. (2019) is to show that the classification errors of the mean classifier can be bounded by the unsupervised errors of learned representation functions. This shows that the downstream classification task can benefit from a learned representation function with a low unsupervised error.

The above work motivates several interesting theoretical study of CRL. Nozawa et al. (2020) studied CRL in a PACBayesian setting, which aims to learn a posterior distribution of representation functions. Nozawa et al. (2020) derived PAC-Bayesian bounds for the posterior distribution and applied it to get PAC-Bayesian bounds for the mean-classifier, which relaxes the i.i.d. assumption. Negative examples in the framework (Arora et al., 2019) are typically taken to be randomly sampled datapoints, which may actually have the same label of the point of interest. This introduces a bias in the objective function of CRL, which leads to performance drops in practice. Motivated by this, Chuang et al. (2020) introduced a debiased CRL algorithm by building an approximation of unbiased error in CRL, and developed generalization guarantees for the downstream classification. Ash et al. (2022) refines the connection in Arora et al. (2019) by removing the collision probability in the denominator, which motivates the discussion on selecting the optimal number of negative examples. Nozawa & Sato (2021) improves the analysis Arora et al. (2019) by giving a generic transfer theorem between unsupervised loss and supervised loss, which exhibits a coverage-collision trade-off due to the number of negative examples.

Several researchers studied CRL from other perspectives. Lee et al. (2021) proposed to learn a representation function f to minimize E(X1,X2)[ X2 f(X1) 2 2], where X1, X2 are unlabeled input and pretext target. Under an approximate conditional independency assumption, the authors showed that a linear function based on the learned representation approximates the true predictor on downstream problems. In a generative modeling setup, Tosh et al. (2021b) proposed to learn representation functions by a landmark embedding procedure, which can reveal the underlying topic posterior information. Tosh et al. (2021a) studied CRL in a multiview setting with two views available for each datum. Under

Generalization Analysis for Contrastive Representation Learning

an assumption on the redundancy between the two views, the authors showed that low-dimensional representation can achieve near optimal downstream performance with linear models. Hao Chen et al. (2021) studied self-supervised learning from the perspective of spectral clustering based on a population augmentation graph, and proposed a spectral contrastive loss. They further developed generalization bounds for both representation learning and the downstream classification. This result is improved in a recent work by developing a guarantee to incorporate the representation function class (Saunshi et al., 2022). Wang et al. (2022) removes the assumption on the conditional independency, and proposes a guarantee on the downstream performance from the viewpoint of augmentation overlap. Bao et al. (2022) shows that the contrastive loss can be viewed as a surrogate objective of the downstream loss by building upper and lower bounds for downstream classification errors. There are also recent work on theoretical analysis of representation learning via gradient-descent dynamics (Lee et al., 2021; Tian et al., 2020b), mutual information (Tsai et al., 2020), alignment of representations (Wang & Isola, 2020), and causality (Mitrovic et al., 2020). CRL is related to metric learning, for which generalization bounds have been studied in the literature (Cao et al., 2016).

3. Problem Formulation

Let X denote the space of all possible datapoints. In CRL, we are given several similar data in the form of pairs (x, x+) drawn from a distribution Dsim and negative data x 1 , x 2 , . . . , x k drawn from a distribution Dneg unrelated to x. Our aim is to learn a feature map f : X 7 Rd from a class of representation functions F = f : f( ) 2 R

for some R > 0, where 2 denotes the Euclidean norm. Here d N denotes the number of features.

We follow the framework in Arora et al. (2019) to define the distribution Dsim and the distribution Dneg. Let C denote the set of all latent classes and for each class c C we assume there is a probability distribution Dc over X, which quantifies the relevance of x to the class c. We assume there is a probability distribution ρ defined over C. Then we define Dsim(x, x+) and Dneg(x ) as follows

Dsim(x, x+) = Ec ρ Dc(x)Dc(x+) ,

Dneg(x ) = Ec ρ Dc(x ) .

Intuitively, Dsim(x, x+) measures the probability of x and x+ being drawn from the same class c ρ, while Dneg(x ) measures the probability of drawing an un-relevant x . Let (xj, x+ j ) Dsim and (x j1, . . . , x jk) Dneg, j [n] := {1, . . . , n}, where k denotes the number of negative exam-

ples. We collect these training examples into a dataset

S = n (x1, x+ 1 , x 11, . . . , x 1k), (x2, x+ 2 , x 21, . . . , x 2k),

. . . , (xn, x+ n , x n1, . . . , x nk) o . (3.1)

Given a representation function f, we can measure its performance by building a classifier based on this representation and computing the accuracy of the classifier. To this aim, we define a (K + 1)-way supervised task T consisting of distinct classes {c1, . . . , c K+1} C. The examples for this supervised task are drawn by the following process:

We first draw a label c T = {c1, . . . , c K+1} from a distribution DT over T , after which we draw an example x from Dc. This defines the following distribution over labeled pairs (x, c): DT (x, c) = Dc(x)DT (c). Since there is a label for each example, we can build a multi-class classifier g : X 7 RK+1 for T , where gc(x) measures the likelihood of assigning the class label c to the example x. The loss of g on a point (x, y) X T can be measured by ℓs g(x)y g(x)y }y =y , where ℓs : RK 7 R+. We quantify the performance of a classifier g on the task T by the supervised loss. By minimizing the supervised loss, we want to build a classifier whose component associated to the correct label is largest.

Definition 3.1 (Supervised loss). Let g : X 7 RK+1 be a multi-class classifier. The supervised loss of g is defined as

Lsup(T , g) := E(x,c) DT h ℓs g(x)c g(x)c

For CRL, we often consider g as a linear classifier based on the learned representation f, i.e., g(x) = Wf(x), where W R(K+1) d. Then the performance of the representation function f(x) can be quantified by the accuracy of the best linear classifier on the representation f(x):

Lsup(T , f) = min W R(K+1) d Lsup(T , Wf).

To find a good representation f based on unsupervised dataset S, we need to introduce the concept of unsupervised loss functions. Let ℓ: Rk 7 R+ be a loss function for which popular choices include the hinge loss

ℓ(v) = max 0, 1 + max i [k]{ vi} (3.2)

and the logistic loss

ℓ(v) = log 1 + X

i [k] exp( vi) . (3.3)

Let f(x) denote the transpose of f(x).

Definition 3.2 (Unsupervised error). The population unsupervised error is defined as

Lun(f) := E ℓ f(x) (f(x+) f(x i )) k i=1 .

Generalization Analysis for Contrastive Representation Learning

The empirical unsupervised error with S is defined as

ˆLun(f) := 1

j=1 ℓ f(xj) (f(x+ j ) f(x ji)) k i=1 .

A natural algorithm is to find among F the function with the minimal empirical unsupervised loss, i.e., ˆf := arg minf F ˆLun(f). This function can then be used for the downstream supervised learning task, e.g., to find a linear classifier g(x) = Wf(x) indexed by W R(K+1) d.

4. Generalization Error Bounds

In this paper, we are interested in the performance of ˆf on testing, i.e., how the empirical behavior of ˆf on S would generalize well to testing examples. Specifically, we will control Lun( ˆf) ˆLun( ˆf). Since ˆf depends on the dataset S, we need to control the uniform deviation between population unsupervised error and empirical unsupervised error over the function class F, which depends on the complexity of F. In this paper, we will use Rademacher complexity to quantify the complexity of F (Bartlett & Mendelson, 2002).

Definition 4.1 (Rademacher Complexity). Let e F be a class of real-valued functions over a space Z and e S = {zi}n i=1 Z. The empirical Rademacher complexity of e F with respect to (w.r.t.) e S is defined as Re S( e F) = Eϵ supf e F 1 n P

i [n] ϵif(zi) , where ϵ = (ϵi)i [n] { 1}n are independent Rademacher variables. We define the worst-case Rademacher complexity as RZ,n( e F) = supe S Z:|e S|=n Re S( e F), where |e S| is the cardinality of e S.

For any f F, we introduce gf : X k+2 7 R as follows

gf(x, x+, x 1 , . . . , x k ) = ℓ f(x) f(x+) f(x i ) k i=1 .

It is then clear that

Lun(f) ˆLun(f) = Ex,x+,x 1 ,...,x k gf(x, x+, x 1 , . . . , x k )

j [n] gf(xj, x+ j , x j1, . . . , x jk).

Results in learning theory show that we can bound Lun( ˆf) ˆLun( ˆf) by RS(G) (Bartlett & Mendelson, 2002), where

G = (x, x+, x 1 , . . . , x k ) 7

gf(x, x+, x 1 , . . . , x k ) : f F .

Note functions in G involve the nonlinear function ℓ: Rk 7 R+, which introduces difficulties in the complexity analysis. Our key idea is to use the Lipschitz continuity of ℓto reduce the complexity of G to the complexity of another function class without ℓ. Since the arguments in ℓare vectors, we can have different definition of Lipschitz continuity w.r.t. different norms (Lei et al., 2015; Tewari & Chaudhuri, 2015;

Lei et al., 2019; Foster & Rakhlin, 2019; Mustafa et al., 2022). For any a = (a1, . . . , ak) Rk and p 1, we

define the ℓp-norm as a p = Pn i=1 |ai|p 1

Definition 4.2 (Lipschitz continuity). We say ℓ: Rk 7 R+ is G-Lipschitz w.r.t. the ℓp-norm iff

|ℓ(a) ℓ(a )| G a a p, a, a Rk.

In this paper, we are particularly interested in the Lipschitz continuity w.r.t. either the ℓ2-norm or the ℓ -norm. According to Proposition G.1, the loss functions defined in Eq. (3.2) and Eq. (3.3) are 1-Lipschitz continuous w.r.t. , and 1-Lipschitz continuous w.r.t. 2 (Lei et al., 2019).

Note each component of the arguments in ℓare of the form f(x) (f(x+) f(x )). This motivates the definition of the following function class

H = n hf(x, x+, x ) = f(x) f(x+) f(x ) : f F o .

As we will see in the analysis, the complexity of G is closely related to that of H. Therefore, we first show how to control the complexity of H. In the following lemma, we provide Rademacher complexity bounds of H w.r.t. a general dataset S of cardinality n. We will use a vector-contraction lemma to prove it (Maurer, 2016). The basic idea is to notice the Lipschitz continuity of the map (x, x+, x ) 7 x (x+ x ) w.r.t. 2 on X 3. The proof is given in Section B.

Lemma 4.3. Let n N and S = {(xj, x+ j , x j ) : j [n]}. Assume f(x) 2 R for any f F and x S . Then

n Eϵ { 1}n { 1}d { 1}3 h sup f F

ϵj,t,1ft(xj)+ϵj,t,2ft(x+ j )+ϵj,t,3ft(x j ) i ,

where ft(x) is the t-th component of f(x) Rd.

Remark 4.4. We compare Lemma 4.3 with the following Rademacher complexity bound in Hao Chen et al. (2021)

Eϵ { 1}n h sup f F

j [n] ϵjf(xj) f(x+ j ) i

d max t [d] Eϵ { 1}n h sup ft Ft

j [n] ϵjft(xj)ft(x+ j ) i , (4.1)

where Ft = x 7 ft(x) : f F . As a comparison, our analysis in Lemma 4.3 can imply the following bound

Eϵ { 1}n h sup f F

j [n] ϵjf(xj) f(x+ j ) i

2REϵ { 1}2nd h sup f F

ϵj,t,1ft(xj)+ϵj,t,2ft(x+ j ) i .

Eq. (4.1) decouples the relationship among different features since the maximization over t [d] is outside of the

Generalization Analysis for Contrastive Representation Learning

expectation operator. As a comparison, our result preserves this coupling since the summation over t [d] is inside the supermum over f F. This preservation of coupling has an effect on the bound. Indeed, it is expected that

Eϵ { 1}2nd h sup f F

ϵj,t,1ft(xj)+ϵj,t,2ft(x+ j ) i =

d Eϵ { 1}2n h sup ft Ft

ϵj,1ft(xj)+ϵj,2ft(x+ j ) i .

In this case, our result implies a bound with a better dependency on d as compared to Eq. (4.1) (the factor of d in Eq. (4.1) is replaced by

d here). We can plug our bound into the analysis in Hao Chen et al. (2021) to improve their results. In Section A we will give a specific example where our bound can outperform Eq. (4.1) by a factor of

d. Remark 4.5. Lemma 4.3 requires an assumption f(x) 2 R. This assumption can be achieved by adding a projection operator as f(x) = PR( f(x)) for f F, where PR denotes the projection operator onto the Euclidean ball with radius R around the zero point. According to the inequality PR( f(x)) PR( f (x)) 2 f(x) f (x) 2, the arguments in the proof indeed show the following inequality with H = h f(x, x+, x ) = PR( f(x)) PR( f(x+)) PR( f(x )) : f F :

n Eϵ { 1}3nd h sup f F

t [d] ϵj,t,1 ft(xj) + ϵj,t,2 ft(x+ j ) + ϵj,t,3 ft(x j ) i .

That is, we can add a projection operator over F to remove the assumption f(x) 2 R.

Loss Arora et al. 19 Ours

1-ℓ2-Lipschitz

A n 1-ℓ -Lipschitz

S.B. 1-Lipschitz n 1+n 2k 1C2+(n 1

Table 2. Comparison between our generalization bounds and those in Arora et al. (2019). The notation means we ignore log factors. S.B. means self-bounding. The notations A, B and C are defined in Eq. (4.2), (4.4) and (4.5), which are typically of the same order. Then, our results improve the bounds in Arora et al. (2019) by a factor of

k for ℓ2-Lipschitz loss, and by a factor of k for ℓ - Lipscthiz loss. For self-bounding loss, we get optimistic bounds.

4.1. ℓ2 Lipschitz Loss

We first consider the ℓ2 Lipschitz loss. The following theorem to be proved in Section B gives Rademacher complexity and generalization error bounds for unsupervised loss

function classes. We always assume ℓ f(x) (f(x+)

f(x i )) k i=1 B for any f F in this paper.

Theorem 4.6 (Generalization bound: ℓ2-Lipschitz loss). Assume f(x) 2 R for any f F and x X. Let S be defined as in Eq. (3.1). If ℓ: Rk 7 R+ is G2-Lipschitz w.r.t. the ℓ2-norm, then RS(G)

A = E{ϵ} { 1}3nkd E h sup f F

ϵj,i,t,1ft(xj)

+ ϵj,i,t,2ft(x+ j ) + ϵj,i,t,3ft(x ji) i . (4.2)

Furthermore, for any δ (0, 1), with probability at least 1 δ the following inequality holds for any f F

Lun(f) ˆLun(f) 4

Remark 4.7. Under the same Lipschitz continuity w.r.t. 2, the following bound was established in Arora et al. (2019)

Lun(f) = ˆLun(f) + O G2R

(4.3) where

B = Eϵ { 1}n { 1}d { 1}k+2 h sup f F

t [d] (4.4)

ϵj,t,k+1ft(xj) + ϵj,t,k+2ft(x+ j ) + X

i [k] ϵj,t,kft(x ji) i .

Note A and B are of the same order. Indeed, the dominating term in the braces of the above equation is P

i [k] ϵj,t,kft(x ji) and therefore we have

B Eϵ h sup f F

i [k] ϵj,t,kft(x ji) i .

Furthermore, A grows also in this order since ϵj,i,t,1ft(xj)+ ϵj,i,t,2ft(x+ j ) + ϵj,i,t,3ft(x ji) is of the same order of ϵj,i,t,3ft(x ji). Typically, A B

nkd since there are O(nkd) terms in the summation inside the supremum. In this case, Theorem 4.6 implies a bound O( p

kd/n), while Eq. (4.3) gives a bound O(k

d/ n). It is clear our bound improves the bound in Arora et al. (2019) by a factor of

4.2. ℓ Lipschitz Loss

We now turn to the analysis for the setting with ℓ Lipschitz continuity assumption, which is more challenging. The following theorem controls the Rademacher complexity of G w.r.t. the dataset S in terms of the worst-case Rademacher

Generalization Analysis for Contrastive Representation Learning

complexity of H defined on the set SH, where

SH = n (x1, x+ 1 , x 11), (x1, x+ 1 , x 12), . . . , (x1, x+ 1 , x 1k) | {z } induced by the first example

(x2, x+ 2 , x 21), (x2, x+ 2 , x 22), . . . , (x2, x+ 2 , x 2k) | {z } induced by the second example

(xn, x+ n , x n1), (xn, x+ n , x n2), . . . , (xn, x+ n , x nk) | {z } induced by the last example

As compared to G, the function class H removes the loss function ℓand is easier to handle. Our basic idea is to exploit the Lipschitz continuity of ℓw.r.t. : to approximate the function class {ℓ(v1(y), . . . , vk(y))}, it suffices to approximate each component vj(y), j [k]. This explains why we expand the set S of cardinality n to the set SH of cardinality nk. The proof is given in Section C.

Theorem 4.8 (Complexity bound: ℓ -Lipschitz loss). Assume f(x) 2 R for any f F and x X. Let S be defined as in Eq. (3.1). If ℓ: Rk 7 R+ is G-Lipschitz w.r.t. the ℓ -norm, then

RS(G) 24G(R2 + 1)n 1

k RSH,nk(H)

1 + log(4R2n 3 2 k) l log2 R2 n

where RSH,nk(H) = max ( xj, x+ j , x j )

j [nk] SH Eϵ { 1}nk

j [nk] ϵjf( xj) (f( x+ j ) f( x j )) i .

Note in RSH,nk(H) we restrict the domain of functions in H to SH, and allow an element in SH to be chosen several times in the above maximization.

We can use Lemma 4.3 to control RSH,nk(H) in Theorem 4.8, and derive the following generalization error bound. The proof is given in Section C.

Theorem 4.9 (Generalization bound: ℓ -Lipschitz loss). Let ℓ: Rk 7 R+ be G-Lipschitz continuous w.r.t. . Assume f(x) 2 R, δ (0, 1). Then with probability at least 1 δ over S for all f F we have

Lun(f) ˆLun(f)+3B

2n +48G(R2+1)n 1

1 + log(4R2n 3 2 k) l log2 R2 n

C = max {( xj, x+ j , x j )}nk j=1 SH Eϵ { 1}nk { 1}d { 1}3 (4.5)

ϵj,t,1ft( xj) + ϵj,t,2ft( x+ j ) + ϵj,t,3ft( x j ) i .

Remark 4.10. We now compare our bound with Eq. (4.3) developed in Arora et al. (2019). It is reasonable to assume C and B are of the same order. 1 Then, our bound becomes

Lun(f) = ˆLun(f)+O GRB log2(n Rk)

(4.6) We know if ℓis G2-Lipschitz continuous w.r.t. 2, it is also

k G2-Lipschitz continuous w.r.t. . Therefore, in the extreme case we have G =

k G2. Even in this extreme case, our bound is of the order Lun(f) = ˆLun(f)+

O G2RB log2(n Rk)

n , which improves Eq.

(4.3) by a factor of

k up to a logarithmic factor. For popular loss functions defined in Eq. (3.2) and Eq. (3.3), we have G = G2 = 1 and in this case, our bound in Eq. (4.6) improves Eq. (4.3) by a factor of k if we ignore a logarithmic factor. Remark 4.11. we now give the intuition of our improvements. Arora et al. (2019) considers the 1-Lipschitz continuity of ℓ: Rk 7 R w.r.t. 2, while we use the 1-Lipschitz continuity of ℓw.r.t. . Note that 1-Lipschitz continuity w.r.t. 2 is a weaker condition as compared to the 1Lipschitz continuity w.r.t. . Indeed, if ℓis 1-Lipschitz continuous w.r.t. 2, then it is also

k-Lipschitz continuous w.r.t. with a much larger Lipschitz constant. In our problem, the contrastive loss is 1-Lipschitz continuous w.r.t. both 2 and . Therefore, we can use the stronger assumption on the 1-Lipschitz continuity w.r.t. to save a factor of

k. Furthermore, Arora et al. (2019) use the inequality J 2 J F , where J Rk (k+2)d, 2 is the spectral norm and F is the Frobenius norm. The inequality introduces an additional factor of

k since J F can be as large as

k J 2. As a comparison, our analysis based on Lipschitz continuity w.r.t. does not introduce any loss in the factor of k (up to a log term), and outperforms Arora et al. (2019) by a factor of k.

4.3. Self-bounding Lipschitz Loss

Finally, we consider a self-bounding Lipschitz condition where the Lipschitz constant depends on the loss function values. This definition was given in Reeve & Kaban (2020).

Definition 4.12 (Self-bounding Lipschitz Continuity). A loss function ℓ: Rk 7 R+ is said to be Gs-self-bounding Lipschitz continuous w.r.t. ℓ norm if for any a, a Rk

ℓ(a) ℓ(a ) Gs max ℓ(a), ℓ(a ) 1

It was shown that the logistic loss given in Eq. (3.3) satisfies the self-bounding Lipschtiz continuity with Gs = 2 (Reeve

1Indeed, under a typical behavior of Rademacher complexity as Eϵ { 1}n supa A Rn ϵiai = O( n) (Bartlett & Mendelson, 2002), we have C = O(

nkd) and B = O(

Generalization Analysis for Contrastive Representation Learning

& Kaban, 2020). In the following theorem, we give generalization bounds for learning with self-bounding Lipschitz loss functions. The basic idea is to replace the Lipschitz constant G in Theorem 4.9 with empirical errors by using the self-bounding property. We use e O to hide logarithmic factors. The proof is given in Section D. Theorem 4.13 (Generalization bound: self-bounding Lipschitz loss). Let ℓ: Rk 7 R+ be Gs-self-bounding Lipschitz continuous w.r.t. . Assume f(x) 2 R, δ (0, 1). Then with probability at least 1 δ over S we have the following inequality uniformly for all f F

Lun(f)= ˆLun(f)+e O (B+G2 s R4)n 1+G2 s R2n 2k 1C2

B+Gs R2)n 1

2 +Gs Rn 1k 1

1 2un(f) log 1 2 (1/δ).

Remark 4.14. Theorem 4.13 gives optimistic generalization bounds in the sense that the upper bounds depend on empirical errors (Srebro et al., 2010). Therefore, the generalization bounds for Lun(f) ˆLun(f) would benefit from

low training errors. In particular, if ˆL

1 2un(f) = 0, Theorem 4.13 implies generalization bounds

Lun(f) = ˆLun(f)+e O Bn 1+G2 s R4n 1+G2 s R2n 2k 1C2 .

Typically, we have C = O(

nk) and in this case Lun(f) = ˆLun(f) + e O Bn 1 + G2 s R4n 1 . In other words, we get fast-decaying error bounds in an interpolating setting.

5. Applications

To apply Theorem 4.6 and Theorem 4.9, we need to control the term A or C, which is related to the Rademacher complexity of a function class. In this section, we will show how to control C for features of the form x 7 Uv(x), where U Rd d is a matrix and v : X 7 Rd . Here v maps the original data x X to an intermediate feature in Rd , which is used for all the final features. If v is the identity map, then we get linear features. If v is a neural network, then we get nonlinear features. For a norm on a matrix, we denote by its dual norm. The following lemmas to be proved in Section E give general results on Rademacher complexities. Lemma 5.1 gives upper bounds, while Lemma 5.2 gives lower bounds. It is immediate to extend our analysis to control A. For brevity we ignore such a discussion. Lemma 5.1 (Upper bound). Let d, d N. Let V be a class of functions from X to Rd . Let F = {f(x) = Uv(x) : U U, v V}, where U = U = (u1, . . . , ud) Rd d : U Λ and f(x) = Uv(x) = (u1, . . . , ud) v(x). Then

Eϵ { 1}nd sup f F

j [n] ϵj,tft(xj)

ΛEϵ { 1}nd sup v V

j [n] ϵ1,jv(xj), . . . , X

j [n] ϵd,jv(xj) .

Lemma 5.2 (Lower bound). If F is symmetric in the sense that f F implies f F, then we have

2 sup f F max ( x, x+, x ) SH

f( x) 2 2+ f( x+) 2 2+ f( x ) 2 2 1

Note in our definition of F, we ignore the projection operator, i.e., the feature function class should be of the form f(x) = PR Uv(x)) to satisfy the assumption f(x) 2 R. According to Remark 4.5, it is easy to extend our analysis here to the case with including the projection operator in the definition of feature function class.

5.1. Linear Features

We first apply Lemma 5.1 to derive Rademacher complexity bounds for learning with linear features. For any p 1 and a matrix W = (w1, . . . , wd ) Rd d , the ℓ2,p norm of W is defined as W 2,p = P i [d ] wi p 2 1

p . If p = 2, this becomes the Frobenius norm W F . For any p 1, the Schatten-p norm of a matrix W Rd d

is defined as the ℓp-norm of the vector of singular values (σ1(W), . . . , σmin{d,d }(W)) (the singular values are assumed to be sorted in non-increasing order), i.e., W Sp := σ(W) p. Let p be the number satisfying 1/p+1/p = 1. The following proposition to be proved in Section E.1 gives complexity bounds for learning with linear features. Proposition 5.3 (Linear representation). Consider the feature map defined in Lemma 5.1 with v(x) = x.

(a) If = 2,p, then Eϵ sup f F

j [n] ϵt,jft(xj)

n Λd1/q max( p

q 1, 1) o X

j [n] xj 2 2 1

(b) If = Sp with p 2, then

j [n] ϵt,jft(xj) Λ2 1

4 min q [p,2]

j [n] xjx j 1

2 Sq , d1/q X

j [n] xj 2 2 1/2o .

We now plug the above Rademacher complexity bounds into Theorem 4.9 to give generalization error bounds for learning with unsupervised loss. Let Bx = max{ xj 2, x+ j 2, x jt 2 : j [n], t [k]}. Note P

j [nk] xj 2 2 1

nk Bx for xj in the definition of C, from which and Proposition 5.3 we get the following bound for the case v(x) = x (the definition of C involves nk examples, while in Proposition 5.3 we consider n examples):

nk Bx min q p Λd1/q max( p

The following corollary then follows from Theorem 4.9.

Generalization Analysis for Contrastive Representation Learning

Corollary 5.4. Consider the feature map in Proposition 5.3 with = 2,p. Let ℓbe the logistic loss and δ (0, 1). Then with probability at least 1 δ, for all f F

Lun(f) ˆLun(f) = B log 1 2 (1/δ) n +

e O GRBx minq p Λd1/q max( q 1, 1)

It is also possible to give generalization bounds for learning with ℓ2-Lipschitz loss functions, and optimistic generalization bounds for learning with self-bounding Lipschitz loss functions. We omit the discussion for brevity.

5.2. Nonlinear Features

We now consider Rademacher complexity for learning with nonlinear features by DNNs. The following lemma to be proved in Section E.2 gives Rademacher complexity bounds for learning with features by DNNs. We say an activation σ : R 7 R is positive-homogeneous if σ(ax) = aσ(x) for a 0, contractive if |σ(x) σ(x )| |x x |. The Re LU activation function σ(x) = max{x, 0} is both positivehomogeneous and contractive.

Proposition 5.5 (Nonlinear representation). Consider the feature map defined in Lemma 5.1 with = F and

V = n x 7 v(x) = σ VLσ VL 1 σ(V1x) :

Vl F Bl, l [L] o ,

where σ is positive-homogeneous, contractive and σ(0) = 0, and L is the number of layers. Then

j [n] ϵt,jft(xj)

dΛBLBL 1 B1

1 i<j n (x i xj)2 1

j [n] xj 2 2

Remark 5.6. If d = 1, the following bound was established in Golowich et al. (2018)

j [n] ϵjft(xj) = O

j [n] xj 2 2 1

Proposition 5.5 extends this bound to the general case d N. In particular, if d = 1, our result matches the result in Golowich et al. (2018) up to a constant factor. We need to introduce different techniques to handle the difficulty in considering the coupling among different features u t v(x), t [d], which is reflected by the regularizer on U as U F Λ. Ignoring this coupling would imply a bound with a crude dependency on d. To preserve the coupling, we consider the moment generation function (MGF) of supf F P

j [n] ϵt,jft(xj) 2, and

then reduce it to the MGF of a Rademacher chaos variable P

1 i<j n ϵiϵjx i xj by repeated applications of contraction inequalities of Rademacher complexities. A direct application of the analysis in Golowich et al. (2018) show

j [n] ϵjft(xj) = O d

j [n] xj 2 2 1

As a comparison, our analysis implies a bound with a squareroot dependency on d. We will give more details on the comparison of technical analysis in Remark E.6.

Note P 1 i<j n(x i xj)2 1

2 = O P j [n] xj 2 2 , from which and Proposition 5.5 we get for nonlinear features that C = O

d LΛ(nk) 1 2 Bx Q

l [L] Bl . The following proposition then follows directly from Theorem 4.9.

Corollary 5.7. Consider the feature map in Proposition 5.5. Let ℓbe the logistic loss and δ (0, 1). With probability at least 1 δ the following inequality holds for all f F

Lun(f) ˆLun(f) =

l [L] Bl + B log 1 2 1

5.3. Generalization for Downstream Classification

In this subsection, we apply the above generalization bounds on unsupervised learning to derive generalization guarantees for a downstream supervised learning task. Similar ideas can be dated back to metric/similarity learning, where one shows that similarity-based learning guarantees a good generalization of the resultant classification (Guo & Ying, 2014; Balcan et al., 2008; Balcan & Blum, 2006). Following Arora et al. (2019), we consider a particular mean classifier with rows being the means of the representation of each class, i.e., x 7 W µf(x) with the c-th row of W being the mean µc of representations of inputs with label c: µc := Ex Dc[f(x)]. Consider the average supervised loss

Lµ sup(f) := E{ci}K+1 i=1 ρK+1 Lsup({ci}K+1 i=1 , W µf)|ci = cj ,

where we take the expectation over T = {ci}K+1 i=1 . The following lemma shows that the generalization performance of the mean classifier based on a representation f can be guaranteed in terms of the generalization performance of the representation in unsupervised learning.

Lemma 5.8 (Arora et al. 2019). There exists a function ρ : CK+1 7 R+ such that the following inequality holds for any f F: ET D ρ(T )Lµ sup(f) Lun(f).

We refer the interested readers to Arora et al. (2019) for the expression of ρ(T ), which is independent of n. The following corollaries are immediate applications of Lemma 5.8 and our generalization bounds for unsupervised learning. We omit the proof for brevity.

Generalization Analysis for Contrastive Representation Learning

Corollary 5.9 (Linear representation). Consider the feature map in Proposition 5.3 with = 2,p. Let ℓs, ℓbe the logistic loss and δ (0, 1). Then with probability at least 1 δ the following inequality holds

ET D h ρ(T )Lµ sup( ˆf) i = ˆLun( ˆf) + e O B log 1 2 (1/δ) n

+ GRBx minq p Λd1/q max( q 1, 1)

Remark 5.10. If p (log d)/(log d 1), we set q = (log d)/(log d 1), and get d1/q max( q 1, 1) = O(log 1 2 d). In this case, we get a bound with a logarithmic dependency on the number of features. It is possible to extend our discussion to more general norms = p,q, p, q 1 (Kakade et al., 2012).

Corollary 5.11 (Nonlinear representation). Consider the feature map in Proposition 5.5. Let ℓs, ℓbe the logistic loss and δ (0, 1). With probability at least 1 δ we have

ET D h ρ(T )Lµ sup( ˆf) i = ˆLun( ˆf)+

l [L] Bl n + B log1/2(1/δ) n

Remark 5.12. If we combine our Rademacher complexity bounds in Section 5 and Eq. (4.3) developed in Arora et al. (2019), we would get generalization bounds for supervised classification with a linear dependency on k. If we combine our complexity bounds and Theorem 4.6, we would get generalization bounds for supervised classification with a square-root dependency on k. These discussions use the Lipschitz continuity of ℓw.r.t 2. As a comparison, the use of Lipschitz continuity w.r.t. allows us to derive generalization bounds with a logarithmic dependency on k in Corollary 5.9 and Corollary 5.11. Furthermore, we can improve the bounds e O(1/ n) in these corollaries to e O(1/n) in an interpolation setting by applying Theorem 4.13. Remark 5.13. Note ρ(T ) in the above corollaries can grow very fast w.r.t. k, which motivates various studies on the connection between feature learning and downstream classification tasks (Ash et al., 2022; Nozawa & Sato, 2021; Wang et al., 2022; Bao et al., 2022). As a comparison, the main focus of our paper is to improve generalization bounds for pre-train task, which is orthogonal to the analysis in Ash et al. (2022); Nozawa & Sato (2021); Wang et al. (2022); Bao et al. (2022). It should be mentioned that Ash et al. (2022) also studies generalization bounds for pre-train task. Under the assumption f(x) 1 R1, Ash et al. (2022) shows the following generalization bounds for contrastive learning with ℓ Lipschitz loss

Lun(f) ˆLun(f) R1

kd n max t [d] max x1,...,xn Eϵ sup f F

j [n] ft(xj),

which is worse than our bound by a factor of

k. Moreover, this bound uses R1 while our bound assumes f(x) 2 R. Note R1 can be larger than R by a factor of

It should be mentioned that our improvement on the generalization bounds for pre-train task can be seamlessly combined with the connection on pre-train task and downstream task to get improved generalization bounds for the downstream task. For example, Nozawa & Sato (2021) derives the bound Lu sup( ˆf) 2Lun( ˆ f) vk+1 , where vk+1 is the probability that the sampled k negative examples contains all classes. We can directly combine this result and our analysis to derive the following bounds on the performance of downstream tasks for learning with linear features

Lu sup( ˆf) = 2 vk+1

ˆLun( ˆf) + e O log1/2(1/δ) +

6. Conclusion

Motivated by the existing generalization bounds with a crude dependency on the number of negative examples, we present a systematic analysis on the generalization behavior of CRL. We consider three types of loss functions. Our results improve the existing bounds by a factor of

k for ℓ2 Lipschitz loss, and by a factor of k for ℓ Lipschitz loss (up to a log factor). We get optimistic bounds for selfbounding Lipschitz loss, which imply fast rates under low noise conditions. We justify the effectiveness of our results with applications to both linear and nonlinear features.

Our analysis based on Rademacher complexities implies algorithm-independent bounds. It would be interesting to develop algorithm-dependent bounds to understand the interaction between optimization and generalization. For ℓ loss, our bound still enjoys a logarithmic dependency on k. It would be interesting to study whether this logarithmic dependency can be removed in further study.

Acknowledgments

We thank the reviewers and area chair for their constructive comments. Part of the work was done when Yunwen Lei was with the Department of Mathematics, Hong Kong Baptist University. The work of Tianbao Yang was partially supported by NSF Career Award #1844403, NSF Program #2110545, and NSF-Amazon Joint Program #2147253. The work of Yiming Ying was partially supported by NSF (DMS2110836, IIS-2103450, and IIS-2110546). The work of Ding-Xuan Zhou was partially supported by Inno HK initiative, The Government of the HKSAR, Laboratory for AIPowered Financial Technologies, Research Grant Council of Hong Kong [Projects # City U 11308121, N City U102/20, C1013-21GF], and NSFC [Project No. 12061160462].

Generalization Analysis for Contrastive Representation Learning

Alon, N., Ben-David, S., Cesa-Bianchi, N., and Haussler, D. Scale-sensitive dimensions, uniform convergence, and learnability. Journal of the ACM, 44(4):615 631, 1997.

Anthony, M. and Bartlett, P. L. Neural network learning: Theoretical foundations, volume 9. 1999.

Arora, S., Khandeparkar, H., Khodak, M., Plevrakis, O., and Saunshi, N. A theoretical analysis of contrastive unsupervised representation learning. In International Conference on Machine Learning, pp. 9904 9923, 2019.

Ash, J., Goel, S., Krishnamurthy, A., and Misra, D. Investigating the role of negatives in contrastive representation learning. In International Conference on Artificial Intelligence and Statistics, pp. 7187 7209. PMLR, 2022.

Balcan, M.-F. and Blum, A. On a theory of learning with similarity functions. In international Conference on Machine Learning, pp. 73 80, 2006.

Balcan, M.-F., Blum, A., and Srebro, N. Improved guarantees for learning via similarity functions. In Conference on Learning Theory, pp. 287 298, 2008.

Bao, H., Nagano, Y., and Nozawa, K. On the surrogate gap between contrastive and supervised losses. In International Conference on Machine Learning, pp. 1585 1606. PMLR, 2022.

Bartlett, P. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002.

Boucheron, S., Lugosi, G., and Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford university press, 2013.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877 1901, 2020.

Cao, Q., Guo, Z.-C., and Ying, Y. Generalization bounds for metric and similarity learning. Machine Learning, 102 (1):115 132, 2016.

Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., and Joulin, A. Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33:9912 9924, 2020.

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning, pp. 1597 1607. PMLR, 2020.

Chuang, C.-Y., Robinson, J., Lin, Y.-C., Torralba, A., and Jegelka, S. Debiased contrastive learning. In Advances in

Neural Information Processing Systems, volume 33, pp. 8765 8775, 2020.

Cucker, F. and Zhou, D.-X. Learning Theory: an Approximation Theory Viewpoint. Cambridge University Press, 2007.

De la Pena, V. and Gin e, E. Decoupling: from dependence to independence. Springer Science & Business Media, 2012.

Foster, D. J. and Rakhlin, A. ℓ vector contraction for rademacher complexity. ar Xiv preprint ar Xiv:1911.06468, 2019.

Gao, T., Yao, X., and Chen, D. Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894 6910, 2021.

Golowich, N., Rakhlin, A., and Shamir, O. Size-independent sample complexity of neural networks. In Conference On Learning Theory, pp. 297 299. PMLR, 2018.

Guermeur, Y. Lp-norm sauer shelah lemma for margin multi-category classifiers. Journal of Computer and System Sciences, 89:450 473, 2017.

Guo, Z.-C. and Ying, Y. Guaranteed classification via regularized similarity learning. Neural Computation, 26(3): 497 522, 2014.

Hao Chen, J. Z., Wei, C., Gaidon, A., and Ma, T. Provable guarantees for self-supervised deep learning with spectral contrastive loss. Advances in Neural Information Processing Systems, 34:5000 5011, 2021.

He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation learning. In Computer Vision and Pattern Recognition, pp. 9729 9738, 2020.

H enaff, O. J., Srinivas, A., De Fauw, J., Razavi, A., Doersch, C., Eslami, S. A., and Van Den Oord, A. Data-efficient image recognition with contrastive predictive coding. In International Conference on Machine Learning, pp. 4182 4192, 2020.

Kakade, S. M., Shalev-Shwartz, S., and Tewari, A. Regularization techniques for learning with matrices. Journal of Machine Learning Research, 13:1865 1890, 2012.

Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., and Krishnan, D. Supervised contrastive learning. Advances in Neural Information Processing Systems, 33:18661 18673, 2020.

Lee, J. D., Lei, Q., Saunshi, N., and Zhuo, J. Predicting what you already know helps: Provable self-supervised learning. Advances in Neural Information Processing Systems, 34:309 323, 2021.

Generalization Analysis for Contrastive Representation Learning

Lei, Y., Dogan, U., Binder, A., and Kloft, M. Multi-class svms: From tighter data-dependent generalization bounds to novel algorithms. In Advances in Neural Information Processing Systems, pp. 2026 2034, 2015.

Lei, Y., Dogan, U., Zhou, D.-X., and Kloft, M. Datadependent generalization bounds for multi-class classification. IEEE Transactions on Information Theory, 65(5): 2995 3021, 2019.

Lust-Piquard, F. and Pisier, G. Non commutative khintchine and paley inequalities. Arkiv f or Matematik, 29(1):241 260, 1991.

Maurer, A. A vector-contraction inequality for rademacher complexities. In International Conference on Algorithmic Learning Theory, pp. 3 17, 2016.

Mitrovic, J., Mc Williams, B., Walker, J., Buesing, L., and Blundell, C. Representation learning via invariant causal mechanisms. ar Xiv preprint ar Xiv:2010.07922, 2020.

Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of Machine Learning. MIT press, 2012.

Mustafa, W., Lei, Y., and Kloft, M. On the generalization analysis of adversarial learning. In International Conference on Machine Learning, pp. 16174 16196. PMLR, 2022.

Nozawa, K. and Sato, I. Understanding negative samples in instance discriminative self-supervised representation learning. Advances in Neural Information Processing Systems, 34:5784 5797, 2021.

Nozawa, K., Germain, P., and Guedj, B. PAC-bayesian contrastive unsupervised representation learning. In Uncertainty in Artificial Intelligence, pp. 21 30. PMLR, 2020.

Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748 8763. PMLR, 2021.

Reeve, H. and Kaban, A. Optimistic bounds for multioutput learning. In International Conference on Machine Learning, pp. 8030 8040. PMLR, 2020.

Saunshi, N., Ash, J., Goel, S., Misra, D., Zhang, C., Arora, S., Kakade, S., and Krishnamurthy, A. Understanding contrastive learning requires incorporating inductive biases. In International Conference on Machine Learning, volume 162, pp. 19250 19286, 2022.

Srebro, N., Sridharan, K., and Tewari, A. Smoothness, low

noise and fast rates. In Advances in Neural Information Processing Systems, pp. 2199 2207, 2010.

Tewari, A. and Chaudhuri, S. Generalization error bounds for learning to rank: Does the length of document lists matter? In International Conference on Machine Learning, pp. 315 323. PMLR, 2015.

Tian, Y., Krishnan, D., and Isola, P. Contrastive multiview coding. In European Conference on Computer Vision, pp. 776 794. Springer, 2020a.

Tian, Y., Yu, L., Chen, X., and Ganguli, S. Understanding self-supervised learning with dual deep networks. ar Xiv preprint ar Xiv:2010.00578, 2020b.

Tosh, C., Krishnamurthy, A., and Hsu, D. Contrastive learning, multi-view redundancy, and linear models. In Algorithmic Learning Theory, pp. 1179 1206. PMLR, 2021a.

Tosh, C., Krishnamurthy, A., and Hsu, D. Contrastive estimation reveals topic posterior information to linear models. Journal of Machine Learning Research, 22:281 1, 2021b.

Tropp, J. A. The expected norm of a sum of independent random matrices: An elementary approach. In High Dimensional Probability VII: The Carg ese Volume, pp. 173 202. Springer, 2016.

Tsai, Y.-H. H., Wu, Y., Salakhutdinov, R., and Morency, L.-P. Demystifying self-supervised learning: An informationtheoretical framework. ar Xiv preprint ar Xiv:2006.05576, 2, 2020.

Wang, T. and Isola, P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pp. 9929 9939. PMLR, 2020.

Wang, Y., Zhang, Q., Wang, Y., Yang, J., and Lin, Z. Chaos is a ladder: A new theoretical understanding of contrastive learning via augmentation overlap. In International Conference on Learning Representations, 2022.

Ying, Y. and Campbell, C. Rademacher chaos complexities for learning the kernel problem. Neural computation, 22 (11):2858 2886, 2010.

Yuan, Z., Wu, Y., Qiu, Z., Du, X., Zhang, L., Zhou, D., and Yang, T. Provable stochastic optimization for global contrastive learning: Small batch does not harm performance. In Proceedings of International Conference of Machine Learning, 2022.

Zhou, D.-X. The covering number in learning theory. Journal of Complexity, 18(3):739 767, 2002.

Generalization Analysis for Contrastive Representation Learning

Appendix for Generalization Analysis for Contrastive Representation Learning

A. Effect on the Preservation of Coupling in Lemma 4.3

In this section, we show that Lemma 4.3 can improve Eq. (4.1) by a factor of

d due to the ability in preserving the coupling among different coordinates in the features. To this aim, we introduce the following lemma on the spectral norm of random matrices.

Lemma A.1 (Tropp 2016). Consider an independent family S1, . . . , Sn of random d d matrices with E[Si] = 0 for each i, and define Z = P

i [n] Si. Denote C(d) = 4(1 + 2 log d ). Then

C(d) max n E[ZZ ] 1 2 op, E[Z Z] 1 2 op

o + C(d) E max i Si 2 op 1

For simplicity, let us consider linear features, i.e.,

F = n f(x) = (f1(x), . . . , fd(x)) : X

t [d] wt 2 2 R2 w o ,

where ft(x) = wt, x and x 2 1. In this case, we have

f(x) = max t [d] |ft(x)| max t [d] wt 2 x 2 Rw

t [d] f 2 t (x) 1

t [d] wt 2 2 x 2 2 1

Eq. (4.1) implies

j [n] ϵjf(xj) f(x+ j ) d max t [d] Eϵ { 1}n h sup f F

j [n] ϵjft(xj)ft(x+ j ) i

= d max t [d] Eϵ { 1}n h sup f F

j [n] ϵjw t x+ j x j wt i

= d max t [d] Eϵ { 1}n h sup wt 2 Rw w t X

j [n] ϵjx+ j x j wt i

= d R2 w Eϵ { 1}n X

j [n] ϵjx+ j x j op,

where op denotes the spectral operator of a matrix. Let Z = P

j [n] ϵjx+ j x j . It is clear that

j [n] ϵjx+ j x j X

j [n] ϵjx+ j x j i = X

j [n] x+ j x j xj(x+ j ) X

j [n] x+ j (x+ j )

j [n] ϵjx+ j x j X

j [n] ϵjx+ j x j i = X

j [n] xj(x+ j ) x+ j x j X

j [n] xj(xj) .

Therefore, we have max n E[ZZ ] 1 2 op, E[Z Z] 1 2 op

Furthermore, we have E max j ϵjx+ j x j 2 op 1.

Generalization Analysis for Contrastive Representation Learning

It then follows from Lemma A.1 that

j [n] ϵjx+ j x j op p

n C(d) + C(d).

We can combine the above discussions together to derive the following inequality based on Eq. (4.1)

j [n] ϵjf(xj) f(x+ j ) d R2 w p

n C(d) + C(d) . (A.1)

As a comparison, the inequality below Eq. (4.1) implies the following inequality via our approach

j [n] ϵjf(xj) f(x+ j ) Rw Eϵ sup f F

t [d] ϵj,tft(xj).

Furthermore, there holds

t [d] ϵj,tft(xj) = Eϵ sup f F

t [d] ϵj,t wt, xj = Eϵ sup f F

j [n] ϵj,txj E

j [n] ϵj,txj 2 Rw Eϵ X

j [n] ϵj,txj 2 2

j [n] ϵj,txj 2 2

j [n] xj 2 2 1

where we have used Cauchy-Schwartz s inequality in the first inequality and the Jensen s inequality in the second inequality. Therefore, our analysis implies Eϵ sup f F

j [n] ϵjf(xj) f(x+ j ) R2 w

It is clear our analysis improves (A.1) based on existing analysis by a factor of

d in this specific problem.

B. Proof of Theorem 4.6

To prove Theorem 4.6, we first prove Lemma 4.3 by the following vector-contraction lemma on Rademacher complexities.

Lemma B.1 (Maurer 2016). Let S = {zj}n j=1 Zn. Let F be a class of functions f : Z 7 Rd and h : Rd 7 R be G-Lipschitz w.r.t. ℓ2-norm. Then

Eϵ { 1}n h sup f F

j [n] ϵj(h f )(zj) i

2GEϵ { 1}nd h sup f F

t [d] ϵj,tf t(zj) i .

In Section F, we will provide an extension of the above lemma.

Proof of Lemma 4.3. Let f : X 3 7 R3d be defined as

f (x, x+, x ) = f(x), f(x+), f(x ) R3d

and h : R3d 7 R be defined as

h(y, y+, y ) = y (y+ y ), y, y+, y Rd.

Then it is clear that f(x) (f(x+) f(x )) = h f (x, x+, x ).

Furthermore, for any y1, y+ 1 , y 1 , y2, y+ 2 , y 2 with Euclidean norm less than or equal to R, we have

h(y1, y+ 1 , y 1 ) h(y2, y+ 2 , y 2 ) = y 1 (y+ 1 y 1 ) y 2 (y+ 2 y 2 )

= y 1 (y+ 1 y 1 ) y 1 (y+ 2 y 2 ) + y 1 (y+ 2 y 2 ) y 2 (y+ 2 y 2 )

= y 1 (y+ 1 y 1 y+ 2 + y 2 ) + (y1 y2) (y+ 2 y 2 ).

Generalization Analysis for Contrastive Representation Learning

It then follows from the elementary inequality (a + b)2 (1 + p)a2 + (1 + 1/p)b2 that

h(y1, y+ 1 , y 1 ) h(y2, y+ 2 , y 2 ) 2

2(1 + p) y1 2 y+ 1 y+ 2 2 + 2(1 + p) y1 2 y 1 y 2 2 2 + (1 + 1/p) y+ 2 y 2 2 2 y1 y2 2 2.

We can choose p = 2 and get h(y1, y+ 1 , y 1 ) h(y2, y+ 2 , y 2 ) 2 6R2 y+ 1 y+ 2 2 + y 1 y 2 2 2 + y1 y2 2 2

= 6R2 (y1, y+ 1 , y 1 ) (y2, y+ 2 , y 2 ) 2 2.

This shows that h is

6R-Lipschitz continuous w.r.t. 2. We can apply Lemma B.1 to derive

Eϵ { 1}n h sup f F

j [n] ϵj(h f )(zj) i

12REϵ { 1}n { 1}d { 1}3 h sup f F

ϵj,t,1ft(xj) + ϵj,t,2ft(x+ j ) + ϵj,t,3ft(x j ) i .

The proof is completed.

The following standard lemma gives generalization error bounds in terms of Rademacher complexities.

Lemma B.2 (Mohri et al. 2012). Let e G be a function class and e S = {z1, . . . , zn}. If for any g e G we have g(z) [0, B], then for any δ (0, 1) the following inequality holds with probability (w.r.t. e S) at least 1 δ

i=1 g(zi) + 2Re S( e G) + 3B

2n , g e G.

Proof of Theorem 4.6. According to the G2-Lipschitz continuity of ℓw.r.t. ℓ2-norm and Lemma B.1, we have

Eϵ { 1}n h sup f F

j [n] ϵjℓ f(xj) (f(x+ j ) f(x ji))

2G2E{ϵ} { 1}nk E h sup f F

i [k] ϵj,if(xj) (f(x j ) f(x ji)) i .

According to Lemma 4.3, we further get

E{ϵ} { 1}nk E h sup f F

i [k] ϵj,if(xj) (f(x j ) f(x ji)) i

12RE{ϵ} { 1}3nkd E h sup f F

ϵj,i,t,1ft(xj) + ϵj,i,t,2ft(x+ j ) + ϵj,i,t,3ft(x ji) i .

We can combine the above two inequalities to get the Rademacher complexity bounds.

We now turn to the generalization bounds. Applying Lemma B.2, with probability at least 1 δ the following inequality holds with probability at least 1 δ

Lun(f) ˆLun(f) + 2RS(G) + 3B

The stated bound on generalization errors then follows by plugging the Rademacher complexity bounds into the above bound. The proof is completed.

C. Proof of Theorem 4.9

We first introduce several complexity measures such as covering numbers and fat-shattering dimension (Alon et al., 1997; Anthony & Bartlett, 1999; Cucker & Zhou, 2007; Zhou, 2002).

Generalization Analysis for Contrastive Representation Learning

Definition C.1 (Covering number). Let e S := {z1, . . . , zn} Zn. Let e F be a class of real-valued functions defined over a space Z. For any ϵ > 0 and p 1, the empirical ℓp-norm covering number Np(ϵ, e F, e S) with respect to e S is defined as the smallest number m of a collection of vectors v1, . . . , vm {(f(z1), . . . , f(zn)) : f e F} such that

sup f e F min j [m]

i [n] |f(zi) vj i |p 1

where vj i is the i-th component of the vector vj. In this case, we call {v1, . . . , vm} an (ϵ, ℓp)-cover of e F with respect to e S.

Definition C.2 (Fat-Shattering Dimension). Let e F be a class of real-valued functions defined over a space e Z. We define the fat-shattering dimension fatϵ( e F) at scale ϵ > 0 as the largest D N such that there exist D points z1, . . . , z D Z and witnesses s1, . . . , s D R satisfying: for any δ1, . . . , δD { 1} there exists f e F with

δi(f(zi) si) ϵ/2, i [D].

To prove Theorem 4.8, we need to introduce the following lemma on Rademacher complexity, fat-shattering dimension and covering numbers. Part (a) shows that the covering number can be bounded by fat-shattering dimension (see, e.g., Theorem 12.8 in Anthony & Bartlett (1999)). Part (b) shows that the fat-shattering dimension can be controlled by the worst-case Rademacher complexity, which was developed in Srebro et al. (2010). Part (c) is a discretization of the chain integral to control Rademacher complexity by covering numbers (Srebro et al., 2010), which can be found in Guermeur (2017). Let e be the base of the natural logarithms.

Lemma C.3. Let e S := {z1, . . . , zn} Z. Let e F be a class of real-valued functions defined over a space e Z.

(a) If functions in e F take values in [ B, B], then for any ϵ > 0 with fatϵ( e F) < n we have

log N (ϵ, e F, e S) 1 + fatϵ/4( e F) log2 4e Bn

ϵfatϵ/4( e F)

(b) For any ϵ > R e Z,n( e F), we have fatϵ( e F) < 4n

ϵ2 R2 e Z,n( e F).

(c) Let (ϵj) j=0 be a monotone sequence decreasing to 0 and any (a1, . . . , an) Rn. If ϵ0 q

n 1 supf e F Pn i=1 f(zi) ai 2, then for any non-negative integer N we have

Re S( e F) 2

log N (ϵj, e F, e S)

n + ϵN. (C.1)

According to Part (a) of Lemma C.3, the following inequality holds for any ϵ (0, 2B] (the case fatϵ/4( e F) = 0 is trivial since in this case we have N (ϵ, e F, e S) = 1, and otherwise we have fatϵ/4( e F) 1)

log N (ϵ, e F, e S) 1 + fatϵ/4( e F) log2 2 8e B2|e S|

We follow the arguments in Lei et al. (2019) to prove Theorem 4.8.

Proof of Theorem 4.8. We first relate the empirical ℓ -covering number of F w.r.t. S = {(xj, x+ j , x j1, x j2, . . . , x jk) : j [n]} to the empirical ℓ -covering number of H w.r.t. SH. Let n rm = rm 1,1, rm 1,2, . . . , rm 1,k, . . . , rm n,1, rm n,2, . . . , rm n,k : m [N] o

be an (ϵ/G, ℓ )-cover of H w.r.t. SH. Recall that

hf(x, x+, x ) = f(x) f(x+) f(x ) . (C.3)

Then, by the definition of ℓ -cover we know for any f F we can find m [N] such that

max j [n] max i [k]

hf(xj, x+ j , x ji) rm j,i ϵ/G.

Generalization Analysis for Contrastive Representation Learning

By the Lipschitz continuity of ℓ, we then get

ℓ f(xj) f(x+ j ) f(x ji) k i=1 ℓ {rm j,i}k i=1

G f(xj) f(x+ j ) f(x ji) k i=1 rm j,i k i=1 = G hf(xj, x+ j , x ji) k i=1 rm j,i k i=1 Gϵ/G = ϵ.

This shows that ℓ {rm 1,i}k i=1 , ℓ {rm 2,i}k i=1 , . . . , ℓ {rm n,i}k i=1 : m [N] is an (ϵ, ℓ )-cover of G w.r.t. S and therefore

N (ϵ, G, S) N (ϵ/G, H, SH). (C.4)

Since we consider empirical covering number of F w.r.t. S, we can assume functions in H are defined over SH. For simplicity, we denote Rnk(H) := RSH,nk(H). We now control N (ϵ/G, H, SH) by Rademacher complexities of H. For any ϵ > 2Rnk(H), it follows from Part (b) of Lemma C.3 that

fatϵ(H) 4nk

ϵ2 R2 SH,nk(H) nk. (C.5)

Note for any f F, we have f(x) (f(x+) f(x )) [ 2R2, 2R2]. It then follows from Eq. (C.2) and Eq. (C.5) that (replace B by 2R2)

log N (ϵ, H, SH) 1 + fatϵ/4(H) log2(32e R4nk/ϵ2)

1 + 64nk R2 nk(H) ϵ2 log2(32e R4nk/ϵ2), ϵ (0, 4R2].

We can combine the above inequality and Eq. (C.4) to derive the following inequality for any 2GRnk(H) ϵ 4GR2

log N (ϵ, G, S) 1 + 64nk G2R2 nk(H) ϵ2 log2(32e R4G2nk/ϵ2). (C.6)

Let ϵN = 24G max

k Rnk(H), n 1

ϵj = 2N jϵN, j = 0, . . . , N 1,

N = log2 2GR2

k Rnk(H), n 1

It is clear from the definition that ϵ0 2GR2 ϵ0/2.

The Lipschitz continuity of ℓimplies

ℓ(({hf(x, x+, x i )}i [k])) ℓ((0, 0, . . . , 0)) G hf(x, x+, x ) 2R2G.

According to the above inequality and Part (c) of Lemma C.3, we know (note ϵN 2GRnk(H) and therefore Eq. (C.6) holds for ϵ = ϵj, j = 1, . . . , N)

j=1 (ϵj + ϵj 1)

log N (ϵj, G, S)

j=1 (ϵj + ϵj 1) + 16G

nk Rnk(H) n

(ϵj + ϵj 1) log(32e R4G2nk/ϵ2 j) ϵj + ϵN

2 + ϵN + 48G

nk Rnk(H) n

j=1 log(32e R4G2nk/ϵ2 j)

2 + ϵN + 48GN

k Rnk(H) log(32e R4G2nk) + 48G

j=1 log(1/ϵ2 j).

Generalization Analysis for Contrastive Representation Learning

According to the definition of ϵk, we know

j=1 log(1/ϵ2 j) =

j=1 log(22j/ϵ2 0) =

j=1 log(1/ϵ2 0) + log 4

j=1 j = N log(1/ϵ2 0) + N(N + 1) log 4

= N log 1/ϵ2 0 + (N + 1) log 2 = N log 2N+1/ϵ2 0 = N log 1

= N log n 24G2R2 .

We can combine the above two inequalities together to get

RS(G) 24GR2n 1

2 + ϵN + 48NG

k Rnk(H) log(32e R4G2nk) + log n 24G2R2

2 + ϵN + 48G

k Rnk(H) log(32e R4G2nk) + log n 24G2R2

l log2 2GR2 n

2 + ϵN + 48G

k Rnk(H) log(4R2n 3 2 k) l log2 R2 n

where we have used the definition of N and 32e/24 4. The proof is completed.

Proof of Theorem 4.9. Applying Lemma B.2, with probability at least 1 δ the following inequality holds with probability at least 1 δ

Lun(f) ˆLun(f) + 2RS(G) + 3B

According to Theorem 4.8 and Lemma 4.3, we know

RS(G) 48G(R2 + 1)n 1

k Rnk(H) 1 + log(R2n 3 2 k) l log2 R2 n

48G(R2 + 1)n 1

1 + log(R2n 3 2 k) l log2 R2 n

We can combine the above two inequalities together and derive the stated bound. The proof is completed.

D. Proof of Theorem 4.13

To prove Theorem 4.13, we introduce the following lemma on generalization error bounds in terms of local Rademacher complexities (Reeve & Kaban, 2020).

Lemma D.1 (Reeve & Kaban 2020). Consider a function class G of functions mapping Z to [0, b]. For any e S = {zi : i [n]} and g G, let ˆEe S[g] = 1

i [n] g(zi). Assume for any e S Zn and r > 0, we have

Re S {g G : ˆEe S[g] r} ϕn(r),

where ϕn : R+ 7 R+ is non-decreasing and ϕn(r)/ r is non-increasing. Let ˆrn be the largest solution of the equation ϕn(r) = r. For any δ (0, 1), with probability at least 1 δ the following inequality holds uniformly for all g G

Ez[g(z)] ˆEe S[g] + 90(ˆrn + r0) + 4 q

ˆEe S[g](ˆrn + r0),

where r0 = b log(1/δ) + 6 log log n /n.

Proof of Theorem 4.13. For any r > 0, we define Fr as a subset of F with the empirical error less than or equal to r

Fr = n f F : 1

j [n] gf(xj, x+ j , x j1, . . . , x jk) r o .

Let n rm = rm 1,1, rm 1,2, . . . , rm 1,k, . . . , rm n,1, rm n,2, . . . , rm n,k : m [N] o

Generalization Analysis for Contrastive Representation Learning

2r Gs), ℓ )-cover of Hr := hf H : f Fr w.r.t. SH. Then, by the definition of ℓ -cover we know for any f Fr we can find m [N] such that

max j [n] max i [k]

hf(xj, x+ j , x ji) rm j,i ϵ/(

According to the self-bounding Lipschitz continuity of ℓ, we know

ℓ f(xj) f(x+ j ) f(x ji) k i=1 ℓ {rm j,i}k i=1 2

j [n] max n ℓ f(xj) f(x+ j ) f(x ji) k i=1 , ℓ {rm j,i}k i=1 o f(xj) f(x+ j ) f(x ji) k i=1 rm j,i k i=1 2

ℓ f(xj) f(x+ j ) f(x ji) k i=1 + ℓ {rm j,i}k i=1 hf(xj, x+ j , x ji) k i=1 rm j,i k i=1 2

2G2 srϵ2/(2r G2 s) = ϵ2,

where we have used the following inequalities due to the definition of Fr

j [n] ℓ f(xj) f(x+ j ) f(x ji) k i=1 r, 1 n

j [n] ℓ {rm j,i}k i=1 r.

Therefore, we have N2(ϵ, Gr, S) N (ϵ/(

2r Gs), Hr, SH),

where Gr = {gf G : f Fr}. Analyzing analogously to the proof of Theorem 4.8, we get (replacing G there by

2r Gs(R2 + 1)n 1

k RSH,nk(H) 1 + log(4R2n 3 2 k) l log2 R2 n

m := ψn(r).

Let ˆrn be the point satisfying ˆrn = ψn(ˆrn):

2ˆrn Gs(R2 + 1)n 1

k RSH,nk(H) 1 + log(4R2n 3 2 k) l log2 R2 n

from which we get

ˆrn = e O G2 s R4n 1 + G2 sk R2 SH,nk(H) .

We can apply Lemma D.1 to get the following inequality with probability at least 1 δ uniformly for all f F

Lun(f) = ˆLun(f)+ e O Bn 1+G2 s R4n 1+G2 sk R2 SH,nk(H) + e O

2 +Gs R2n 1

k RSH,nk(H) ˆL

We can apply Lemma 4.3 to control RSH,nk(H) and derive the following bound

Lun(f) = ˆLun(f) + e O Bn 1 + G2 s R4n 1 + G2 s R2n 2k 1C2 + e O

2 + Gs R2n 1

2 + Gs Rn 1k 1

The proof is completed.

E. Proof on Rademacher Complexities

We first prove Rademacher complexity bounds for feature spaces in Lemma 5.1, and then give lower bounds. Finally, we will apply it to prove Proposition 5.3 and Proposition 5.5.

Proof of Lemma 5.1. Let U = (u1, . . . , ud) and VS = (v(x1), . . . , v(xn)). Then it is clear

u 1 v(x1) u 1 v(xn) ... ... ... u d v(x1) u d v(xn)

Generalization Analysis for Contrastive Representation Learning

ϵ1,1 ϵ1,n ... ... ... ϵd,1 ϵd,n

Then we have X

j [n] ϵt,ju t v(xj) = Mϵ, UVS = trace(M ϵ UVS) = trace(UVSM ϵ )

= U , VSM ϵ U VSM ϵ ,

where trace denotes the trace of a matrix. Therefore, we have

j [n] ϵt,jft(xj) = sup U U,v V

j [n] ϵt,ju t v(xj)

Λ sup v V VSM ϵ = Λ sup v V

j [n] ϵ1,jv(xj), . . . , X

j [n] ϵd,jv(xj) .

The proof is completed.

Proof of Lemma 5.2. Note X

ϵj,t,1ft( xj) + ϵj,t,2ft( x+ j ) + ϵj,t,3ft( x j ) =

ϵj,t,1ft( xj)+ϵj,t,2ft( x+ j )+ϵj,t,3ft( x j ) , X

ϵj,t,1ft( xj)+ϵj,t,2ft( x+ j )+ϵj,t,3ft( x j ) o .

According to the symmetry of F we know

C = max {( xj, x+ j , x j )}nk j=1 SH Eϵ { 1}nk { 1}d { 1}3 h sup f F

ϵj,t,1ft( xj) + ϵj,t,2ft( x+ j ) + ϵj,t,3ft( x j ) i

sup f F max {( xj, x+ j , x j )}nk j=1 SH Eϵ { 1}nk { 1}d { 1}3 h X

ϵj,t,1ft( xj) + ϵj,t,2ft( x+ j ) + ϵj,t,3ft( x j ) i ,

where we have used the Jensen s inequality in the last step.

Since we take maximization over {( xj, x+ j , x j )}nk j=1 SH, we can choose ( xj, x+ j , x j ) = ( x, x+, x ) for any ( x, x+, x ) SH. Then we get

C sup f F max ( x, x+, x ) SH Eϵ { 1}nk { 1}d { 1}3 h X

ϵj,t,1ft( x) + ϵj,t,2ft( x+) + ϵj,t,3ft( x ) i

2 sup f F max ( x, x+, x ) SH

f 2 t ( x) + f 2 t ( x+) + f 2 t ( x ) 1

2 sup f F max ( x, x+, x ) SH

f( x) 2 2 + f( x+) 2 2 + f( x ) 2 2 1

2 1nk sup f F max ( x, x+, x ) SH

f( x) 2 2 + f( x+) 2 2 + f( x ) 2 2 1

where we have used the following Khitchine-Kahane inequality (De la Pena & Gin e, 2012)

i=1 ϵiti 2 1

i=1 |ti|2 1

2 , t1, . . . , tn R, (E.1)

The proof is completed.

Generalization Analysis for Contrastive Representation Learning

Remark E.1. The analysis in the proof implies a lower bound for Re S( e F) for a symmetric e F and e S = {z1, . . . , zn}

Re S( e F) 1

2n sup f e F

f(z1), . . . , f(zn) 2.

Indeed, by the symmetry of F, the Jensen inequality and Eq. (E.1), we have

Re S( e F) = Eϵ sup f e F

i [n] ϵif(zi) = Eϵ sup f e F

i [n] ϵif(zi)

n sup f e F Eϵ X

i [n] ϵif(zi) 1

2n sup f e F

i [n] f 2(zi) 1

E.1. Proof of Proposition 5.3

The following Khintchine-Kahane inequality (De la Pena & Gin e, 2012; Lust-Piquard & Pisier, 1991) is very useful for us to estimate Rademacher complexities.

Lemma E.2. Let ϵ1, . . . , ϵn be a sequence of independent Rademacher variables.

(a) Let v1, . . . , vn H, where H is a Hilbert space with being the associated norm. Then, for any p 1 there holds

i=1 ϵivi p 1

p 1, 1) n X

(b) Let X1, . . . , Xn be a set of matrices of the same dimension. For all q 2,

i=1 ϵi Xi q Sq

e max n n X

i=1 X i Xi 1

i=1 Xi X i 1

Proof of Proposition G.1. Let q p. It is clear q p . The dual norm of 2,p is 2,p . Therefore, according to Lemma 5.1 and p q we know

j [n] ϵt,jft(xj) ΛEϵ X

j [n] ϵt,jxj p

j [n] ϵt,jxj q

j [n] ϵt,jxj q

i 1/q = Λ d Eϵ X

j [n] ϵjxj q

where we have used Jense s inequality and the concavity of x 7 x1/q . By Lemma E.2, we know

j [n] ϵjxj q

j [n] xj 2 2 q

It then follows that

j [n] ϵt,jft(xj) Λd1/q max( p

j [n] xj 2 2 1

Note the above inequality holds for any q p. This proves Part (a).

We now prove Part (b). Since the dual norm of Sp is Sp , by Lemma 5.1 we know

j [n] ϵt,jft(xj) ΛEϵ X

j [n] ϵ1,jxj, . . . , X

j [n] ϵd,jxj Sp .

For any t [d] and j [n], define e Xt,j = 0 0 xj 0 0 ,

Generalization Analysis for Contrastive Representation Learning

i.e., the t-th column of e Xt,j = xj, and other columns are zero vectors. This implies that X

j [n] ϵ1,jxj, . . . , X

j [n] ϵd,jxj = X

j [n] ϵt,j e Xt,j.

It is clear that e Xt,j e X t,j = xjx j and

e X t,j e Xt,j =

0 0 ... . . . 0 ... 0 x j xj 0 0 . . . .

= x j xjdiag(0, . . . , 0 | {z } t 1 , 1, 0 . . . , 0),

where diag(a1, . . . , an) denotes the diagonal matrix with elements a1, . . . , an. Therefore, we have X

j [n] e Xt,j e X t,j = d X

j [n] xjx j

j [n] e X t,j e Xt,j = X

j [n] x j xj Id d,

where Id d denotes the identity matrix in Rd d. Therefore, we can apply Lemma E.2 to show that

j [n] ϵt,jft(xj) Λ Eϵ X

j [n] ϵ1,jxj, . . . , X

j [n] ϵd,jxj q

e max n d X

j [n] xjx j 1

2 Sq , d1/q X

j [d] xj 2 2 1

The proof is completed.

E.2. Proof of Proposition 5.5

For convenience we introduce the following sequence of function spaces

Vk = n x 7 σk Vkσ Vk 1 σ(V1x) : Vj F Bj o , k [L].

To prove Proposition 5.5, we need to introduce several lemmas. The following lemma shows how the supremum over a matrix can be transferred to a supremum over a vector. It is an extension of Lemma 1 in Golowich et al. (2018) from d = 1 to d N, and can be proved exactly by the arguments in Golowich et al. (2018). Lemma E.3. Let σ : R 7 R be a 1-Lipschitz continuous, positive-homogeneous activation function which is applied elementwise. Then for any vector-valued function class e F

sup f e F,V Rh h : V F B

j [n] ϵt,jσ(V f(xj)) 2

2 B2 sup f e F, v Rh : v 2 1 sup v 2 1

j [n] ϵt,jσ( v f(xj)) 2 .

Proof. Let v 1 , . . . , v h be rows of matrix V , i.e., V = v1, . . . , vh . Then by the positive-homogeneous property of activation function we have

j [n] ϵt,jσ(V f(xi)) 2

j [n] ϵt,jσ(v 1 f(xj)) ... P

j [n] ϵt,jσ(v h f(xj))

j [n] ϵt,jσ(v r f(xj))

r [h] vr 2 2 X

j [n] ϵt,jσ v r vr 2 f(xj) 2

r [h] vr 2 2 max r [h]

j [n] ϵt,jσ v r vr 2 f(xj) 2

B2 sup v 2 1

j [n] ϵt,jσ( v f(xj)) 2 .

Generalization Analysis for Contrastive Representation Learning

The proof is completed.

The following lemma gives a general contraction lemma for Rademacher complexities. It allows us to remove a nonlinear function ψ, which is very useful for us to handle the activation function in DNNs.

Lemma E.4 (Contraction Lemma, Thm 11.6 in Boucheron et al. (2013)). Let τ : R+ 7 R+ be convex and nondecreasing. Suppose ψ : R 7 R is contractive in the sense |ψ(t) ψ( t)| |t t| and ψ(0) = 0. Then the following inequality holds for any e F

Eϵ τ sup f e F

i=1 ϵiψ f(xi) Eϵ τ sup f e F

i=1 ϵif(xi) .

The following lemma gives bounds of MGFs for a random variable Z = P

1 i<j n ϵiϵjaij, which is called a Rademacher chaos variable (De la Pena & Gin e, 2012; Ying & Campbell, 2010).

Lemma E.5 (page 167 in De la Pena & Gin e (2012)). Let ϵi, i [n] be independent Rademacher variables. Let ai,j R, i, j [n]. Then for Z = P

1 i<j n ϵiϵjaij we have

Eϵ exp |Z|/(4es) 2, where s2 := X

1 i<j n a2 i,j.

Proof of Proposition 5.5. The dual norm of F is F . Therefore, according to Lemma 5.1 we know

j [n] ϵt,jft(xj) ΛEϵ sup v V

j [n] ϵt,jv(xj) 2 2

sup f VL 1,V : V F BL

j [n] ϵt,jσ(V f(xj)) 2

sup f VL 1, v: v 2 1

j [n] ϵt,jσ( v f(xj)) 2 1

where we have used Lemma E.3 in the second inequality. Let λ 0 and τ(x) = exp(λx2). It is clear that τ is convex and increasing in the interval [0, ). It then follows from the Jensen s inequality that

exp λ Eϵ sup f F

j [n] ϵt,jft(xj) 2 exp λ ΛBLEϵ

sup f VL 1, v: v 2 1

j [n] ϵt,jσ( v f(xj)) 2 1

Eϵ exp λ ΛBL

sup f VL 1, v: v 2 1

j [n] ϵt,jσ( v f(xj)) 2 1

= Eϵ exp λΛ2B2 L sup f VL 1, v: v 2 1

j [n] ϵt,jσ( v f(xj)) 2

Eϵ exp λΛ2B2 L X

t [d] sup f VL 1, v: v 2 1

j [n] ϵt,jσ( v f(xj)) 2

t [d] exp λΛ2B2 L sup f VL 1, v: v 2 1

j [n] ϵt,jσ( v f(xj)) 2

t [d] Eϵt exp λΛ2B2 L sup f VL 1, v: v 2 1

j [n] ϵt,jσ( v f(xj)) 2

= Eϵ { 1}n exp dλΛ2B2 L sup f VL 1, v: v 2 1

j [n] ϵjσ( v f(xj)) 2 ,

Generalization Analysis for Contrastive Representation Learning

where we have used the independency between ϵt = (ϵt,j)j [n], t [d]. Let τ : R+ 7 R+ be defined as τ(x) = exp(dλΛ2B2 Lx2). Then we have

exp λ Eϵ sup f F

j [n] ϵt,jft(xj) 2 Eϵ { 1}n τ sup f VL 1, v: v 2 1

j [n] ϵjσ( v f(xj))

Eϵ { 1}n τ sup f VL 1, v: v 2 1

j [n] ϵjσ( v f(xj)) + Eϵ { 1}n τ sup f VL 1, v: v 2 1 X

j [n] ϵjσ( v f(xj))

= 2Eϵ { 1}n τ sup f VL 1, v: v 2 1

j [n] ϵjσ( v f(xj)) 2Eϵ { 1}n τ sup f VL 1, v: v 2 1

j [n] ϵj v f(xj)

= Eϵ { 1}n τ sup f VL 1, v: v 2 1 v X

j [n] ϵj f(xj) = Eϵ { 1}n τ sup f VL 1

j [n] ϵj f(xj) 2

where we have used Lemma E.4 and the contraction property of σ in the last inequality.

According to Lemma E.3, we know

Eϵ { 1}n τ sup f VL 1

j [n] ϵj f(xj) 2

= Eϵ { 1}n τ sup VL 1 F BL 1, f VL 2

j [n] ϵjσ VL 1 f(xj) 2

Eϵ { 1}n τ BL 1 sup v 2 1, f VL 2

j [n] ϵjσ v f(xj) .

It then follows that

Eϵ { 1}n τ sup f VL 1

j [n] ϵj f(xj) 2

Eϵ { 1}n τ BL 1 sup v 2 1, f VL 2

j [n] ϵjσ v f(xj) + Eϵ { 1}n τ BL 1 sup v 2 1, f VL 2 X

j [n] ϵjσ v f(xj)

= 2Eϵ { 1}n τ BL 1 sup v 2 1, f VL 2

j [n] ϵjσ v f(xj) 2Eϵ { 1}n τ BL 1 sup v 2 1, f VL 2

j [n] ϵj v f(xj)

= 2Eϵ { 1}n τ BL 1 sup v 2 1, f VL 2 v X

j [n] ϵj f(xj) 2Eϵ { 1}n τ BL 1 sup f VL 2

j [n] ϵj f(xj) 2

We can apply the above inequality recursively and derive

Eϵ { 1}n τ sup f VL 1

j [n] ϵj f(xj) 2

2L 1Eϵ { 1}n τ BL 1 B1 X

j [n] ϵjxj 2

Furthermore, by Eq. (E.4) we know

j [n] ϵt,jft(xj) = τ 1τ Eϵ sup f F

j [n] ϵt,jft(xj)

τ 1 Eϵ { 1}n τ sup f VL 1

j [n] ϵj f(xj) 2

τ 1 2L 1Eϵ { 1}n τ BL 1 B1 X

j [n] ϵjxj 2

= τ 1 2L 1Eϵ { 1}n exp dλΛ2B2 LB2 L 1 B2 1 X

j [n] ϵjxj 2

Generalization Analysis for Contrastive Representation Learning

where the last identity follows from the definition of τ. Let λ0 = dλΛ2B2 LB2 L 1 B2 1. Then

Eϵ { 1}n exp λ0 X

j [n] ϵjxj 2

= Eϵ { 1}n exp λ0 X

j [n] xj 2 2 + 2λ0 X

1 i<j n ϵiϵjx i xj

j [n] xj 2 2 Eϵ { 1}n exp 2λ0 X

1 i<j n ϵiϵjx i xj .

We choose λ = 1 8esdΛ2B2 LB2 L 1 B2 1 , where s = P

1 i<j n(x i xj)2 1

2 . Then it is clear λ0 = 1 8es. We can apply Lemma E.5 to derive that Eϵ { 1}n exp 2λ0 X

1 i<j n ϵiϵjx i xj 2

and therefore Eϵ { 1}n exp λ0 X

j [n] ϵjxj 2

j [n] xj 2 2 .

We know τ 1(x) = p

λ 1 log x. It then follows that

j [n] ϵt,jft(xj) λ 1(L 1) log 2 + λ 1 log Eϵ { 1}n exp λ0 X

j [n] ϵjxj 2

λ 1(L 1) log 2 + λ 1 log 2 exp λ0 X

j [n] xj 2 2 1

= λ 1L log 2 + λ 1λ0 X

j [n] xj 2 2

= 8esdΛ2B2 LB2 L 1 B2 1L log 2 + dΛ2B2 LB2 L 1 B2 1 X

j [n] xj 2 2

dΛBLBL 1 B1

8es L log 2 + X

j [n] xj 2 2

The proof is completed by noting 8e(log 2) 16.

Remark E.6. Our proof of Proposition 5.5 is motivated by the arguments in Golowich et al. (2018), which studies Rademacher complexity bounds for DNNs with d = 1. Our analysis requires to introduce techniques to handle the difficulty of considering d features simultaneously. Indeed, we control the Rademacher complexity for learning with d features by

j [n] ϵt,jft(xj) Eϵ

sup f VL 1, v: v 2 1

j [n] ϵt,jσ( v f(xj)) 2 1

If d = 1, this becomes

j [n] ϵt,jft(xj) Eϵ sup f VL 1, v: v 2 1

j [n] ϵjσ( v f(xj)) ,

and the arguments in Golowich et al. (2018) apply. There are two difficulties in applying the arguments in Golowich et al. (2018) to handle general d N. First, the term P

j [n] ϵt,jσ( v f(xj)) cannot be written as a Rademacher

complexity due to the summation over t [d]. Second, there is a square function of the term P

j [n] ϵt,jσ( v f(xj)) . To

handle this difficulty, we introduce the function τ(x) = exp(λx2) instead of the function τ(x) = exp(λx) in Golowich et al. (2018). To this aim, we need to handle the MGF of a Rademacher chaos variable P

1 i<j j ϵiϵj(x i xj)2, which is not a sub-Gaussian variable. As a comparison, the analysis in Golowich et al. (2018) considers the MGF for a sub-Gaussian variable. One can also use the following inequality X

j [n] ϵt,jσ( v f(xj)) 2 1

j [n] ϵt,jσ( v f(xj)) ,

Generalization Analysis for Contrastive Representation Learning

the latter of which can then be further controlled by the arguments in Golowich et al. (2018). This, however, incurs a bound with a linear dependency on d. As a comparison, our analysis gives a bound with a square-root dependency on d.

F. A General Vector-contraction Inequality for Rademacher Complexities

In this section, we provide a general vector-contraction inequality for Rademacher complexities, which recovers Lemma B.1 with τ(a) = a. The lemma is motivated from Lemma E.4 by considering a general convex and nondecreasing τ.

Theorem F.1. Let F be a class of bounded functions f : Z 7 Rd which contains the zero function. Let τ : R+ R+ be a continuous, non-decreasing and convex function. Assume g1, . . . , gn : Rd R are G-Lipschitz continuous w.r.t. 2 and satisfy gi(0) = 0. Then

Eϵ { 1}nτ sup f F

i=1 ϵi gi(f(xi)) Eϵ { 1}ndτ G

j=1 ϵi,jfj(xi) . (F.1)

The following lemma is due to (Maurer, 2016). We provide here the proof for completeness.

Lemma F.2. Let F be a class of functions f : Z 7 Rd and g be any functional defined on F. Assume that g1, . . . , gn : Rd R are G-Lipschitz continuous w.r.t. 2. Then,

Eϵ { 1}n sup f F

i=1 ϵi gi(f(xi)) Eϵ { 1}nd sup f F

j=1 ϵi,jfj(xi) i . (F.2)

Proof. We prove this result by induction. According to the symmetry between f and f, we derive

Eϵn sup f F

i=1 ϵi gi(f(xi))

2 sup f, f F

h g(f) + g( f) +

i=1 ϵi gi(f(xi)) +

i=1 ϵi gi( f(xi)) + gn(f(xn)) gn( f(xn)) i

2 sup f, f F

h g(f) + g( f) +

i=1 ϵi gi(f(xi)) +

i=1 ϵi gi( f(xi)) + gn(f(xn)) gn( f(xn)) i , (F.3)

According to the Lipschitz property and Eq. (E.1), we derive

gn(f(xn)) gn( f(xn)) G f(xn) f(xn) 2 G

2Eϵn,1,...,ϵn,j d X

j=1 ϵn,j fj(xn) fj(xn) .

Plugging the above inequality back into (F.3) and using the Jensen s inequality, we get

Eϵn sup f F

i=1 ϵi gi(f(xi))

2Eϵn,1,...,ϵn,j sup f, f F

h g(f) + g( f) +

i=1 ϵi gi(f(xi)) +

i=1 ϵi gi( f(xi)) + G

j=1 ϵn,j fj(xn) fj(xn) i

2Eϵn,1,...,ϵn,j sup f, f F

h g(f) + g( f) +

i=1 ϵi gi(f(xi)) +

i=1 ϵi gi( f(xi)) + G

j=1 ϵn,j fj(xn) fj(xn) i

= Eϵn,1,...,ϵn,j sup f F

i=1 ϵi gi(f(xi)) + G

j=1 ϵn,jfj(xn) i ,

where we have used the symmetry in the second step.

The stated result can be derived by continuing the above deduction with expectation over ϵn 1, ϵn 2 and so on.

Generalization Analysis for Contrastive Representation Learning

To prove Theorem F.1, we introduce the following lemmas on the approximation of a continuous, non-decreasing and convex function. Let a+ = max{a, 0}.

Lemma F.3. Let f : [a, b] R+ be a continuous, non-decreasing and convex function and m 2. Let a = x1 < < xm = b. Then the function g : [a, b] R defined by

g(x) = f(xk) + f(xk+1) f(xk)

xk+1 xk (x xk), if x [xk, xk+1]

belongs to the set

H[a,b] := n c0 +

i=1 ci(x ti)+ : ci 0, i [n], ti R, m N, x [a, b] o . (F.4)

Proof. Define

f(x) = f(x1) +

f(xi+1) f(xi)

(x xi)+ (x xi+1)+ .

We first show that f(x) = g(x) for all x [a, b]. Suppose that x [xk, xk+1). Then, it is clear that

f(x) = f(x1) +

f(xi+1) f(xi)

(x xi)+ (x xi+1)+ + f(xk+1) f(xk)

(x xk)+ (x xk+1)+

f(xi+1) f(xi)

(x xi)+ (x xi+1)+

f(xi+1) f(xi)

(x xi) (x xi+1) + f(xk+1) f(xk)

= f(xk) + f(xk+1) f(xk)

xk+1 xk (x xk) = g(x).

We now show that f(x) belongs to the set H[a,b]. Indeed, it follows from (x xm)+ = 0 for all x xm = b that

f(x) = f(x1) +

f(xi+1) f(xi)

xi+1 xi (x xi)+

f(xi) f(xi 1)

xi xi 1 (x xi)+

= f(x1) + f(x2) f(x1)

x2 x1 (x x1)+ +

hf(xi+1) f(xi)

xi+1 xi f(xi) f(xi 1)

Therefore, f(x) can be written as f(x) = c0 + Pm 1 i=1 ci(x ti)+ with ti = xi, c0 = f(x1), c1 = f(x2) f(x1)

ci = f(xi+1) f(xi)

xi+1 xi f(xi) f(xi 1)

xi xi 1 , i = 2, . . . , m 1. The terms c1, . . . , cm 1 are all non-negative since f is nondecreasing and convex. The proof is completed.

Lemma F.4. If f : [a, b] R+ is continuous, non-decreasing and convex, then f belongs to the closure of H[a,b] defined in Eq. (F.4).

Proof. Let m N. We can find a = x(m) 1 < x(m) 2 < < x(m) n+1 = b such that

f(x(m) k ) f(x(m) k 1) f(b) f(a)

f (m)(x) := f(x(m) k ) + f(x(m) k+1) f(x(m) k )

x(m) k+1 x(m) k (x x(m) k ) if x [x(m) k , x(m) k+1].

Generalization Analysis for Contrastive Representation Learning

For any x [x(m) k , x(m) k+1], it follows from the convexity of f that

|f (m)(x) f(x)| = f(x(m) k ) f(x) +

f(x(m) k+1) f(x(m) k ) (x x(m) k )

x(m) k+1 x(m) k

= f(x(m) k ) f(x) +

f(x(m) k+1) f(x(m) k ) (x x(m) k )

x(m) k+1 x(m) k

f(x(m) k+1) f(x(m) k ) (x x(m) k )

x(m) k+1 x(m) k f(b) f(a)

from which we know limn |f (m)(x) f(x)| = 0 for all x [a, b]. Lemma F.3 shows that f (m) H[a,b] for all m N. Therefore, f belongs to the closure of H[a,b]. The proof is completed.

Proof of Theorem F.1. According to the boundedness assumption of f F and the fact 0 F, there exist B > 0 such that

0 min n sup f F

i=1 ϵi gi(f(xi)), G

j=1 ϵi,jfj(xi) o

max n sup f F

i=1 ϵi gi(f(xi)), G

j=1 ϵi,jfj(xi) o B

for all ϵ { 1}n. Let t R be an arbitrary number. Define gt : F 7 R by gt(f) = 0 for any f = 0 and gt(0) = t. It is clear that

Eϵ { 1}n sup f F

i=1 ϵi gi(f(xi)) = Eϵ { 1}n max n sup f F:f =0

i=1 ϵi gi(f(xi)) , t o

Eϵ { 1}nd sup f F

j=1 ϵi,jfj(xi) = Eϵ { 1}nd max n G

2 sup f F:f =0

j=1 ϵi,jfj(xi) , t o .

Plugging the above identities into (F.2) with g = gt gives

Eϵ { 1}n max n sup f F:f =0

i=1 ϵi gi(f(xi)) , t o Eϵ { 1}nd max n G

2 sup f F:f =0

j=1 ϵi,jfj(xi) , t o .

If t 0, the above inequality is equivalent to

Eϵ { 1}n max n sup f F

i=1 ϵi gi(f(xi)) , t o Eϵ { 1}nd max n G

j=1 ϵi,jfj(xi) , t o (F.5)

by noting gi(0) = 0 for all i Nn. If t < 0, it follows from (F.2) with g(f) = 0 that

Eϵ { 1}n max n sup f F

i=1 ϵi gi(f(xi)) , t o

= Eϵ { 1}n sup f F

i=1 ϵi gi(f(xi)) Eϵ { 1}nd sup f F

j=1 ϵi,jfj(xi) i

= Eϵ { 1}nd max n G

j=1 ϵi,jfj(xi) , t o ,

where we have used gi(0) = 0 for all i Nn in the first identity. That is, (F.5) holds for all t R. Subtracting t from both sides of Eq. (F.5) gives

Eϵ { 1}n sup f F

i=1 ϵi gi(f(xi)) t

+ Eϵ { 1}nd G

j=1 ϵi,jfj(xi) t

+, t R, (F.6)

Generalization Analysis for Contrastive Representation Learning

from which we know

Eϵ { 1}n τ sup f F

i=1 ϵi gi(f(xi)) Eϵ { 1}nd τ G

j=1 ϵi,jfj(xi) , τ H[0,B].

According to Lemma F.4, we know τ : [0, B] R+ belongs to the closure of H[0,B]. Therefore, Eq. (F.1) holds. The proof is completed.

G. Lipschitz Continuity of Loss Functions

The following proposition is known in the literature (Lei et al., 2019). We prove it for completeness.

Proposition G.1. (a) Let ℓbe defined as Eq. (3.2). Then ℓis 1-Lipschitz continuous w.r.t. and 1-Lipschitz continuous w.r.t. 2.

(b) Let ℓbe defined as Eq. (3.3). Then ℓis 1-Lipschitz continuous w.r.t. and 1-Lipschitz continuous w.r.t. 2.

Proof. We first prove Part (a). For any v and v , we have

|ℓ(v) ℓ(v )| = max 0, 1 + max i [k]{ vi} max 0, 1 + max i [k]{ v i}

| max i [k]{ vi} max i [k]{ v i}| max i [k] |vi v i| = v v ,

where we have used the elementary inequality

| max i [k] ai max i [k] bi| max i [k] |ai bi|.

This proves Part (a).

We now prove Part (b). It is clear that ℓ(v)

vi = exp( vi) 1 + P i [k] exp( vi).

Therefore, the ℓ1 norm of the gradient can be bounded as follows

ℓ(v) 1 1 1 + P

i [k] exp( vi)

i [k] exp( vi) 1.

This proves Part (b). The proof is completed.