# generalization_analysis_for_contrastive_representation_learning__0e9e0c05.pdf Generalization Analysis for Contrastive Representation Learning Yunwen Lei 1 Tianbao Yang 2 Yiming Ying 3 Ding-Xuan Zhou 4 Recently, contrastive learning has found impressive success in advancing the state of the art in solving various machine learning tasks. However, the existing generalization analysis is very limited or even not meaningful. In particular, the existing generalization error bounds depend linearly on the number k of negative examples while it was widely shown in practice that choosing a large k is necessary to guarantee good generalization of contrastive learning in downstream tasks. In this paper, we establish novel generalization bounds for contrastive learning which do not depend on k, up to logarithmic terms. Our analysis uses structural results on empirical covering numbers and Rademacher complexities to exploit the Lipschitz continuity of loss functions. For self-bounding Lipschitz loss functions, we further improve our results by developing optimistic bounds which imply fast rates in a low noise condition. We apply our results to learning with both linear representation and nonlinear representation by deep neural networks, for both of which we derive Rademacher complexity bounds to get improved generalization bounds. 1. Introduction The performance of machine learning (ML) models often depends largely on the representation of data, which motivates a resurgence of contrastive representation learning (CRL) to learn a representation function f : X 7 Rd from unsupervised data (Chen et al., 2020; Khosla et al., 2020; He et al., 2020). The basic idea is to pull together similar pairs (x, x+) and push apart disimilar pairs (x, x ) in an 1Department of Mathematics, The University of Hong Kong 2Department of Computer Science and Engineering, Texas A&M University 3Department of Mathematics and Statistics, State University of New York at Albany 4School of Mathematics and Statistics, University of Sydney. Correspondence to: Yiming Ying . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). embedding space, which can be formulated as minimizing the following objective (Chen et al., 2020; Oord et al., 2018) Ex,x+,{x i }k i=1log 1+ i=1 exp f(x) f(x+) f(x i ) , where k is the number of negative examples. The hope is that the learned representation f(x) would capture the latent structure and be beneficial to other downstream learning tasks (Arora et al., 2019; Tosh et al., 2021a). CRL has achieved impressive empirical performance in advancing the state-of-the-art performance in various domains such as computer vision (He et al., 2020; Caron et al., 2020; Chen et al., 2020; Caron et al., 2020) and natural language processing (Brown et al., 2020; Gao et al., 2021; Radford et al., 2021). The empirical success of CRL motivates a natural question on theoretically understanding how the learned representation adapts to the downstream tasks, i.e., How would the generalization behavior of downstream ML models benefit from the representation function built from positive and negative pairs? Especially, how would the number of negative examples affect the learning performance? Arora et al. (2019) provided an attempt to answer the above questions by developing a theoretical framework to study CRL. They first gave generalization bounds for a learned representation function in terms of Rademacher complexities. Then, they showed that this generalization behavior measured by an unsupervised loss guarantees the generalization behavior of a linear classifier in the downstream classification task. However, the generalization bounds there enjoy a linear dependency on k, which would not be effective if k is large. Moreover, this is not consistent with many studies which show a large number of negative examples (Chen et al., 2020; Tian et al., 2020a; H enaff et al., 2020; Khosla et al., 2020) is necessary for good generalization performance. For example, the work (He et al., 2020) used 65536 negative examples in unsupervised visual representation learning, for which the existing analysis requires n (65536)2d training examples to get non-vacuous bounds (Arora et al., 2019). Yuan et al. (2022) has demonstrated the benefits using all negative data for each anchor data for CRL and proposed an efficient algorithm for optimizing global contrastive loss. Therefore, the existing Generalization Analysis for Contrastive Representation Learning Assumption Arora et al. 19 Ours low noise O k Table 1. Comparison between our generalization bounds and those in Arora et al. (2019) for the logistic loss. Here d is the number of learned features. The notation e O ignores log factors. analysis does not fully answer the question on the super performance of CRL to downstream tasks as already shown in many applications. In this paper, we aim to further deepen our understanding of CRL by fully exploiting the Lipschitz continuity of loss functions. Our contributions are listed as follows. 1. We develop generalization error bounds for CRL. We consider three types of loss functions: ℓ2-Lipschitz loss, ℓ -Lipschitz loss and self-bounding Lipschitz loss. For ℓ2-Lipschitz loss, we develop a generalization bound with a square-root dependency on k by two applications of vectorcontraction lemmas on Rademacher complexities, which improves the existing bound by a factor of k (Arora et al., 2019). For ℓ -Lipschitz loss, we develop generalization bounds which does not depend on k, up to some logarithmic terms, by approximating the arguments of loss functions via expanding the original dataset by a factor of k to fully exploit the Lipschitz continuity. For self-bounding Lipschitz loss, we develop optimistic bounds involving the training errors, which can imply fast rates under a low noise setting. All of our generalization bounds involve Rademacher complexities of feature classes, which preserve the coupling among different features. 2. We then apply our general result to two unsupervised representation learning problems: learning with linear features and learning with nonlinear features via deep neural networks (DNNs). For learning with linear features, we consider two regularization schemes, i.e., p-norm regularizer and Schatten-norm regularizer. For learning with nonlinear features, we develop Rademacher complexity and generalization bounds with a square-root dependency on the depth of DNNs. To this aim, we adapt the technique in Golowich et al. (2018) by using a different moment generalization function to capture the coupling among different features. 3. Finally, we apply our results on representation learning to the generalization analysis of downstream classification problems, which outperforms the existing results by a factor of k (ignoring a log factor). The remaining parts of the paper are organized as follows. Section 2 reviews the related work, and Section 3 provides the problem formulation. We give generalization bounds for CRL in Section 4 for three types of loss functions, which are then applied to learning with both linear and nonlinear features in Section 5. Conclusions are given in Section 6. 2. Related Work The most related work is the generalization analysis of CRL in Arora et al. (2019), where the authors developed generalization bounds for unsupervised errors in terms of Rademacher complexity of representation function classes. Based on this, they further studied the performance of linear classifiers on the learned features. In particular, they considered the mean classifier where the weight for a class label is the mean of the representation of corresponding inputs. A major result in Arora et al. (2019) is to show that the classification errors of the mean classifier can be bounded by the unsupervised errors of learned representation functions. This shows that the downstream classification task can benefit from a learned representation function with a low unsupervised error. The above work motivates several interesting theoretical study of CRL. Nozawa et al. (2020) studied CRL in a PACBayesian setting, which aims to learn a posterior distribution of representation functions. Nozawa et al. (2020) derived PAC-Bayesian bounds for the posterior distribution and applied it to get PAC-Bayesian bounds for the mean-classifier, which relaxes the i.i.d. assumption. Negative examples in the framework (Arora et al., 2019) are typically taken to be randomly sampled datapoints, which may actually have the same label of the point of interest. This introduces a bias in the objective function of CRL, which leads to performance drops in practice. Motivated by this, Chuang et al. (2020) introduced a debiased CRL algorithm by building an approximation of unbiased error in CRL, and developed generalization guarantees for the downstream classification. Ash et al. (2022) refines the connection in Arora et al. (2019) by removing the collision probability in the denominator, which motivates the discussion on selecting the optimal number of negative examples. Nozawa & Sato (2021) improves the analysis Arora et al. (2019) by giving a generic transfer theorem between unsupervised loss and supervised loss, which exhibits a coverage-collision trade-off due to the number of negative examples. Several researchers studied CRL from other perspectives. Lee et al. (2021) proposed to learn a representation function f to minimize E(X1,X2)[ X2 f(X1) 2 2], where X1, X2 are unlabeled input and pretext target. Under an approximate conditional independency assumption, the authors showed that a linear function based on the learned representation approximates the true predictor on downstream problems. In a generative modeling setup, Tosh et al. (2021b) proposed to learn representation functions by a landmark embedding procedure, which can reveal the underlying topic posterior information. Tosh et al. (2021a) studied CRL in a multiview setting with two views available for each datum. Under Generalization Analysis for Contrastive Representation Learning an assumption on the redundancy between the two views, the authors showed that low-dimensional representation can achieve near optimal downstream performance with linear models. Hao Chen et al. (2021) studied self-supervised learning from the perspective of spectral clustering based on a population augmentation graph, and proposed a spectral contrastive loss. They further developed generalization bounds for both representation learning and the downstream classification. This result is improved in a recent work by developing a guarantee to incorporate the representation function class (Saunshi et al., 2022). Wang et al. (2022) removes the assumption on the conditional independency, and proposes a guarantee on the downstream performance from the viewpoint of augmentation overlap. Bao et al. (2022) shows that the contrastive loss can be viewed as a surrogate objective of the downstream loss by building upper and lower bounds for downstream classification errors. There are also recent work on theoretical analysis of representation learning via gradient-descent dynamics (Lee et al., 2021; Tian et al., 2020b), mutual information (Tsai et al., 2020), alignment of representations (Wang & Isola, 2020), and causality (Mitrovic et al., 2020). CRL is related to metric learning, for which generalization bounds have been studied in the literature (Cao et al., 2016). 3. Problem Formulation Let X denote the space of all possible datapoints. In CRL, we are given several similar data in the form of pairs (x, x+) drawn from a distribution Dsim and negative data x 1 , x 2 , . . . , x k drawn from a distribution Dneg unrelated to x. Our aim is to learn a feature map f : X 7 Rd from a class of representation functions F = f : f( ) 2 R for some R > 0, where 2 denotes the Euclidean norm. Here d N denotes the number of features. We follow the framework in Arora et al. (2019) to define the distribution Dsim and the distribution Dneg. Let C denote the set of all latent classes and for each class c C we assume there is a probability distribution Dc over X, which quantifies the relevance of x to the class c. We assume there is a probability distribution ρ defined over C. Then we define Dsim(x, x+) and Dneg(x ) as follows Dsim(x, x+) = Ec ρ Dc(x)Dc(x+) , Dneg(x ) = Ec ρ Dc(x ) . Intuitively, Dsim(x, x+) measures the probability of x and x+ being drawn from the same class c ρ, while Dneg(x ) measures the probability of drawing an un-relevant x . Let (xj, x+ j ) Dsim and (x j1, . . . , x jk) Dneg, j [n] := {1, . . . , n}, where k denotes the number of negative exam- ples. We collect these training examples into a dataset S = n (x1, x+ 1 , x 11, . . . , x 1k), (x2, x+ 2 , x 21, . . . , x 2k), . . . , (xn, x+ n , x n1, . . . , x nk) o . (3.1) Given a representation function f, we can measure its performance by building a classifier based on this representation and computing the accuracy of the classifier. To this aim, we define a (K + 1)-way supervised task T consisting of distinct classes {c1, . . . , c K+1} C. The examples for this supervised task are drawn by the following process: We first draw a label c T = {c1, . . . , c K+1} from a distribution DT over T , after which we draw an example x from Dc. This defines the following distribution over labeled pairs (x, c): DT (x, c) = Dc(x)DT (c). Since there is a label for each example, we can build a multi-class classifier g : X 7 RK+1 for T , where gc(x) measures the likelihood of assigning the class label c to the example x. The loss of g on a point (x, y) X T can be measured by ℓs g(x)y g(x)y }y =y , where ℓs : RK 7 R+. We quantify the performance of a classifier g on the task T by the supervised loss. By minimizing the supervised loss, we want to build a classifier whose component associated to the correct label is largest. Definition 3.1 (Supervised loss). Let g : X 7 RK+1 be a multi-class classifier. The supervised loss of g is defined as Lsup(T , g) := E(x,c) DT h ℓs g(x)c g(x)c For CRL, we often consider g as a linear classifier based on the learned representation f, i.e., g(x) = Wf(x), where W R(K+1) d. Then the performance of the representation function f(x) can be quantified by the accuracy of the best linear classifier on the representation f(x): Lsup(T , f) = min W R(K+1) d Lsup(T , Wf). To find a good representation f based on unsupervised dataset S, we need to introduce the concept of unsupervised loss functions. Let ℓ: Rk 7 R+ be a loss function for which popular choices include the hinge loss ℓ(v) = max 0, 1 + max i [k]{ vi} (3.2) and the logistic loss ℓ(v) = log 1 + X i [k] exp( vi) . (3.3) Let f(x) denote the transpose of f(x). Definition 3.2 (Unsupervised error). The population unsupervised error is defined as Lun(f) := E ℓ f(x) (f(x+) f(x i )) k i=1 . Generalization Analysis for Contrastive Representation Learning The empirical unsupervised error with S is defined as ˆLun(f) := 1 j=1 ℓ f(xj) (f(x+ j ) f(x ji)) k i=1 . A natural algorithm is to find among F the function with the minimal empirical unsupervised loss, i.e., ˆf := arg minf F ˆLun(f). This function can then be used for the downstream supervised learning task, e.g., to find a linear classifier g(x) = Wf(x) indexed by W R(K+1) d. 4. Generalization Error Bounds In this paper, we are interested in the performance of ˆf on testing, i.e., how the empirical behavior of ˆf on S would generalize well to testing examples. Specifically, we will control Lun( ˆf) ˆLun( ˆf). Since ˆf depends on the dataset S, we need to control the uniform deviation between population unsupervised error and empirical unsupervised error over the function class F, which depends on the complexity of F. In this paper, we will use Rademacher complexity to quantify the complexity of F (Bartlett & Mendelson, 2002). Definition 4.1 (Rademacher Complexity). Let e F be a class of real-valued functions over a space Z and e S = {zi}n i=1 Z. The empirical Rademacher complexity of e F with respect to (w.r.t.) e S is defined as Re S( e F) = Eϵ supf e F 1 n P i [n] ϵif(zi) , where ϵ = (ϵi)i [n] { 1}n are independent Rademacher variables. We define the worst-case Rademacher complexity as RZ,n( e F) = supe S Z:|e S|=n Re S( e F), where |e S| is the cardinality of e S. For any f F, we introduce gf : X k+2 7 R as follows gf(x, x+, x 1 , . . . , x k ) = ℓ f(x) f(x+) f(x i ) k i=1 . It is then clear that Lun(f) ˆLun(f) = Ex,x+,x 1 ,...,x k gf(x, x+, x 1 , . . . , x k ) j [n] gf(xj, x+ j , x j1, . . . , x jk). Results in learning theory show that we can bound Lun( ˆf) ˆLun( ˆf) by RS(G) (Bartlett & Mendelson, 2002), where G = (x, x+, x 1 , . . . , x k ) 7 gf(x, x+, x 1 , . . . , x k ) : f F . Note functions in G involve the nonlinear function ℓ: Rk 7 R+, which introduces difficulties in the complexity analysis. Our key idea is to use the Lipschitz continuity of ℓto reduce the complexity of G to the complexity of another function class without ℓ. Since the arguments in ℓare vectors, we can have different definition of Lipschitz continuity w.r.t. different norms (Lei et al., 2015; Tewari & Chaudhuri, 2015; Lei et al., 2019; Foster & Rakhlin, 2019; Mustafa et al., 2022). For any a = (a1, . . . , ak) Rk and p 1, we define the ℓp-norm as a p = Pn i=1 |ai|p 1 Definition 4.2 (Lipschitz continuity). We say ℓ: Rk 7 R+ is G-Lipschitz w.r.t. the ℓp-norm iff |ℓ(a) ℓ(a )| G a a p, a, a Rk. In this paper, we are particularly interested in the Lipschitz continuity w.r.t. either the ℓ2-norm or the ℓ -norm. According to Proposition G.1, the loss functions defined in Eq. (3.2) and Eq. (3.3) are 1-Lipschitz continuous w.r.t. , and 1-Lipschitz continuous w.r.t. 2 (Lei et al., 2019). Note each component of the arguments in ℓare of the form f(x) (f(x+) f(x )). This motivates the definition of the following function class H = n hf(x, x+, x ) = f(x) f(x+) f(x ) : f F o . As we will see in the analysis, the complexity of G is closely related to that of H. Therefore, we first show how to control the complexity of H. In the following lemma, we provide Rademacher complexity bounds of H w.r.t. a general dataset S of cardinality n. We will use a vector-contraction lemma to prove it (Maurer, 2016). The basic idea is to notice the Lipschitz continuity of the map (x, x+, x ) 7 x (x+ x ) w.r.t. 2 on X 3. The proof is given in Section B. Lemma 4.3. Let n N and S = {(xj, x+ j , x j ) : j [n]}. Assume f(x) 2 R for any f F and x S . Then n Eϵ { 1}n { 1}d { 1}3 h sup f F ϵj,t,1ft(xj)+ϵj,t,2ft(x+ j )+ϵj,t,3ft(x j ) i , where ft(x) is the t-th component of f(x) Rd. Remark 4.4. We compare Lemma 4.3 with the following Rademacher complexity bound in Hao Chen et al. (2021) Eϵ { 1}n h sup f F j [n] ϵjf(xj) f(x+ j ) i d max t [d] Eϵ { 1}n h sup ft Ft j [n] ϵjft(xj)ft(x+ j ) i , (4.1) where Ft = x 7 ft(x) : f F . As a comparison, our analysis in Lemma 4.3 can imply the following bound Eϵ { 1}n h sup f F j [n] ϵjf(xj) f(x+ j ) i 2REϵ { 1}2nd h sup f F ϵj,t,1ft(xj)+ϵj,t,2ft(x+ j ) i . Eq. (4.1) decouples the relationship among different features since the maximization over t [d] is outside of the Generalization Analysis for Contrastive Representation Learning expectation operator. As a comparison, our result preserves this coupling since the summation over t [d] is inside the supermum over f F. This preservation of coupling has an effect on the bound. Indeed, it is expected that Eϵ { 1}2nd h sup f F ϵj,t,1ft(xj)+ϵj,t,2ft(x+ j ) i = d Eϵ { 1}2n h sup ft Ft ϵj,1ft(xj)+ϵj,2ft(x+ j ) i . In this case, our result implies a bound with a better dependency on d as compared to Eq. (4.1) (the factor of d in Eq. (4.1) is replaced by d here). We can plug our bound into the analysis in Hao Chen et al. (2021) to improve their results. In Section A we will give a specific example where our bound can outperform Eq. (4.1) by a factor of d. Remark 4.5. Lemma 4.3 requires an assumption f(x) 2 R. This assumption can be achieved by adding a projection operator as f(x) = PR( f(x)) for f F, where PR denotes the projection operator onto the Euclidean ball with radius R around the zero point. According to the inequality PR( f(x)) PR( f (x)) 2 f(x) f (x) 2, the arguments in the proof indeed show the following inequality with H = h f(x, x+, x ) = PR( f(x)) PR( f(x+)) PR( f(x )) : f F : n Eϵ { 1}3nd h sup f F t [d] ϵj,t,1 ft(xj) + ϵj,t,2 ft(x+ j ) + ϵj,t,3 ft(x j ) i . That is, we can add a projection operator over F to remove the assumption f(x) 2 R. Loss Arora et al. 19 Ours 1-ℓ2-Lipschitz A n 1-ℓ -Lipschitz S.B. 1-Lipschitz n 1+n 2k 1C2+(n 1 Table 2. Comparison between our generalization bounds and those in Arora et al. (2019). The notation means we ignore log factors. S.B. means self-bounding. The notations A, B and C are defined in Eq. (4.2), (4.4) and (4.5), which are typically of the same order. Then, our results improve the bounds in Arora et al. (2019) by a factor of k for ℓ2-Lipschitz loss, and by a factor of k for ℓ - Lipscthiz loss. For self-bounding loss, we get optimistic bounds. 4.1. ℓ2 Lipschitz Loss We first consider the ℓ2 Lipschitz loss. The following theorem to be proved in Section B gives Rademacher complexity and generalization error bounds for unsupervised loss function classes. We always assume ℓ f(x) (f(x+) f(x i )) k i=1 B for any f F in this paper. Theorem 4.6 (Generalization bound: ℓ2-Lipschitz loss). Assume f(x) 2 R for any f F and x X. Let S be defined as in Eq. (3.1). If ℓ: Rk 7 R+ is G2-Lipschitz w.r.t. the ℓ2-norm, then RS(G) A = E{ϵ} { 1}3nkd E h sup f F ϵj,i,t,1ft(xj) + ϵj,i,t,2ft(x+ j ) + ϵj,i,t,3ft(x ji) i . (4.2) Furthermore, for any δ (0, 1), with probability at least 1 δ the following inequality holds for any f F Lun(f) ˆLun(f) 4 Remark 4.7. Under the same Lipschitz continuity w.r.t. 2, the following bound was established in Arora et al. (2019) Lun(f) = ˆLun(f) + O G2R (4.3) where B = Eϵ { 1}n { 1}d { 1}k+2 h sup f F t [d] (4.4) ϵj,t,k+1ft(xj) + ϵj,t,k+2ft(x+ j ) + X i [k] ϵj,t,kft(x ji) i . Note A and B are of the same order. Indeed, the dominating term in the braces of the above equation is P i [k] ϵj,t,kft(x ji) and therefore we have B Eϵ h sup f F i [k] ϵj,t,kft(x ji) i . Furthermore, A grows also in this order since ϵj,i,t,1ft(xj)+ ϵj,i,t,2ft(x+ j ) + ϵj,i,t,3ft(x ji) is of the same order of ϵj,i,t,3ft(x ji). Typically, A B nkd since there are O(nkd) terms in the summation inside the supremum. In this case, Theorem 4.6 implies a bound O( p kd/n), while Eq. (4.3) gives a bound O(k d/ n). It is clear our bound improves the bound in Arora et al. (2019) by a factor of 4.2. ℓ Lipschitz Loss We now turn to the analysis for the setting with ℓ Lipschitz continuity assumption, which is more challenging. The following theorem controls the Rademacher complexity of G w.r.t. the dataset S in terms of the worst-case Rademacher Generalization Analysis for Contrastive Representation Learning complexity of H defined on the set SH, where SH = n (x1, x+ 1 , x 11), (x1, x+ 1 , x 12), . . . , (x1, x+ 1 , x 1k) | {z } induced by the first example (x2, x+ 2 , x 21), (x2, x+ 2 , x 22), . . . , (x2, x+ 2 , x 2k) | {z } induced by the second example (xn, x+ n , x n1), (xn, x+ n , x n2), . . . , (xn, x+ n , x nk) | {z } induced by the last example As compared to G, the function class H removes the loss function ℓand is easier to handle. Our basic idea is to exploit the Lipschitz continuity of ℓw.r.t. : to approximate the function class {ℓ(v1(y), . . . , vk(y))}, it suffices to approximate each component vj(y), j [k]. This explains why we expand the set S of cardinality n to the set SH of cardinality nk. The proof is given in Section C. Theorem 4.8 (Complexity bound: ℓ -Lipschitz loss). Assume f(x) 2 R for any f F and x X. Let S be defined as in Eq. (3.1). If ℓ: Rk 7 R+ is G-Lipschitz w.r.t. the ℓ -norm, then RS(G) 24G(R2 + 1)n 1 k RSH,nk(H) 1 + log(4R2n 3 2 k) l log2 R2 n where RSH,nk(H) = max ( xj, x+ j , x j ) j [nk] SH Eϵ { 1}nk j [nk] ϵjf( xj) (f( x+ j ) f( x j )) i . Note in RSH,nk(H) we restrict the domain of functions in H to SH, and allow an element in SH to be chosen several times in the above maximization. We can use Lemma 4.3 to control RSH,nk(H) in Theorem 4.8, and derive the following generalization error bound. The proof is given in Section C. Theorem 4.9 (Generalization bound: ℓ -Lipschitz loss). Let ℓ: Rk 7 R+ be G-Lipschitz continuous w.r.t. . Assume f(x) 2 R, δ (0, 1). Then with probability at least 1 δ over S for all f F we have Lun(f) ˆLun(f)+3B 2n +48G(R2+1)n 1 1 + log(4R2n 3 2 k) l log2 R2 n C = max {( xj, x+ j , x j )}nk j=1 SH Eϵ { 1}nk { 1}d { 1}3 (4.5) ϵj,t,1ft( xj) + ϵj,t,2ft( x+ j ) + ϵj,t,3ft( x j ) i . Remark 4.10. We now compare our bound with Eq. (4.3) developed in Arora et al. (2019). It is reasonable to assume C and B are of the same order. 1 Then, our bound becomes Lun(f) = ˆLun(f)+O GRB log2(n Rk) (4.6) We know if ℓis G2-Lipschitz continuous w.r.t. 2, it is also k G2-Lipschitz continuous w.r.t. . Therefore, in the extreme case we have G = k G2. Even in this extreme case, our bound is of the order Lun(f) = ˆLun(f)+ O G2RB log2(n Rk) n , which improves Eq. (4.3) by a factor of k up to a logarithmic factor. For popular loss functions defined in Eq. (3.2) and Eq. (3.3), we have G = G2 = 1 and in this case, our bound in Eq. (4.6) improves Eq. (4.3) by a factor of k if we ignore a logarithmic factor. Remark 4.11. we now give the intuition of our improvements. Arora et al. (2019) considers the 1-Lipschitz continuity of ℓ: Rk 7 R w.r.t. 2, while we use the 1-Lipschitz continuity of ℓw.r.t. . Note that 1-Lipschitz continuity w.r.t. 2 is a weaker condition as compared to the 1Lipschitz continuity w.r.t. . Indeed, if ℓis 1-Lipschitz continuous w.r.t. 2, then it is also k-Lipschitz continuous w.r.t. with a much larger Lipschitz constant. In our problem, the contrastive loss is 1-Lipschitz continuous w.r.t. both 2 and . Therefore, we can use the stronger assumption on the 1-Lipschitz continuity w.r.t. to save a factor of k. Furthermore, Arora et al. (2019) use the inequality J 2 J F , where J Rk (k+2)d, 2 is the spectral norm and F is the Frobenius norm. The inequality introduces an additional factor of k since J F can be as large as k J 2. As a comparison, our analysis based on Lipschitz continuity w.r.t. does not introduce any loss in the factor of k (up to a log term), and outperforms Arora et al. (2019) by a factor of k. 4.3. Self-bounding Lipschitz Loss Finally, we consider a self-bounding Lipschitz condition where the Lipschitz constant depends on the loss function values. This definition was given in Reeve & Kaban (2020). Definition 4.12 (Self-bounding Lipschitz Continuity). A loss function ℓ: Rk 7 R+ is said to be Gs-self-bounding Lipschitz continuous w.r.t. ℓ norm if for any a, a Rk ℓ(a) ℓ(a ) Gs max ℓ(a), ℓ(a ) 1 It was shown that the logistic loss given in Eq. (3.3) satisfies the self-bounding Lipschtiz continuity with Gs = 2 (Reeve 1Indeed, under a typical behavior of Rademacher complexity as Eϵ { 1}n supa A Rn ϵiai = O( n) (Bartlett & Mendelson, 2002), we have C = O( nkd) and B = O( Generalization Analysis for Contrastive Representation Learning & Kaban, 2020). In the following theorem, we give generalization bounds for learning with self-bounding Lipschitz loss functions. The basic idea is to replace the Lipschitz constant G in Theorem 4.9 with empirical errors by using the self-bounding property. We use e O to hide logarithmic factors. The proof is given in Section D. Theorem 4.13 (Generalization bound: self-bounding Lipschitz loss). Let ℓ: Rk 7 R+ be Gs-self-bounding Lipschitz continuous w.r.t. . Assume f(x) 2 R, δ (0, 1). Then with probability at least 1 δ over S we have the following inequality uniformly for all f F Lun(f)= ˆLun(f)+e O (B+G2 s R4)n 1+G2 s R2n 2k 1C2 B+Gs R2)n 1 2 +Gs Rn 1k 1 1 2un(f) log 1 2 (1/δ). Remark 4.14. Theorem 4.13 gives optimistic generalization bounds in the sense that the upper bounds depend on empirical errors (Srebro et al., 2010). Therefore, the generalization bounds for Lun(f) ˆLun(f) would benefit from low training errors. In particular, if ˆL 1 2un(f) = 0, Theorem 4.13 implies generalization bounds Lun(f) = ˆLun(f)+e O Bn 1+G2 s R4n 1+G2 s R2n 2k 1C2 . Typically, we have C = O( nk) and in this case Lun(f) = ˆLun(f) + e O Bn 1 + G2 s R4n 1 . In other words, we get fast-decaying error bounds in an interpolating setting. 5. Applications To apply Theorem 4.6 and Theorem 4.9, we need to control the term A or C, which is related to the Rademacher complexity of a function class. In this section, we will show how to control C for features of the form x 7 Uv(x), where U Rd d is a matrix and v : X 7 Rd . Here v maps the original data x X to an intermediate feature in Rd , which is used for all the final features. If v is the identity map, then we get linear features. If v is a neural network, then we get nonlinear features. For a norm on a matrix, we denote by its dual norm. The following lemmas to be proved in Section E give general results on Rademacher complexities. Lemma 5.1 gives upper bounds, while Lemma 5.2 gives lower bounds. It is immediate to extend our analysis to control A. For brevity we ignore such a discussion. Lemma 5.1 (Upper bound). Let d, d N. Let V be a class of functions from X to Rd . Let F = {f(x) = Uv(x) : U U, v V}, where U = U = (u1, . . . , ud) Rd d : U Λ and f(x) = Uv(x) = (u1, . . . , ud) v(x). Then Eϵ { 1}nd sup f F j [n] ϵj,tft(xj) ΛEϵ { 1}nd sup v V j [n] ϵ1,jv(xj), . . . , X j [n] ϵd,jv(xj) . Lemma 5.2 (Lower bound). If F is symmetric in the sense that f F implies f F, then we have 2 sup f F max ( x, x+, x ) SH f( x) 2 2+ f( x+) 2 2+ f( x ) 2 2 1 Note in our definition of F, we ignore the projection operator, i.e., the feature function class should be of the form f(x) = PR Uv(x)) to satisfy the assumption f(x) 2 R. According to Remark 4.5, it is easy to extend our analysis here to the case with including the projection operator in the definition of feature function class. 5.1. Linear Features We first apply Lemma 5.1 to derive Rademacher complexity bounds for learning with linear features. For any p 1 and a matrix W = (w1, . . . , wd ) Rd d , the ℓ2,p norm of W is defined as W 2,p = P i [d ] wi p 2 1 p . If p = 2, this becomes the Frobenius norm W F . For any p 1, the Schatten-p norm of a matrix W Rd d is defined as the ℓp-norm of the vector of singular values (σ1(W), . . . , σmin{d,d }(W)) (the singular values are assumed to be sorted in non-increasing order), i.e., W Sp := σ(W) p. Let p be the number satisfying 1/p+1/p = 1. The following proposition to be proved in Section E.1 gives complexity bounds for learning with linear features. Proposition 5.3 (Linear representation). Consider the feature map defined in Lemma 5.1 with v(x) = x. (a) If = 2,p, then Eϵ sup f F j [n] ϵt,jft(xj) n Λd1/q max( p q 1, 1) o X j [n] xj 2 2 1 (b) If = Sp with p 2, then j [n] ϵt,jft(xj) Λ2 1 4 min q [p,2] j [n] xjx j 1 2 Sq , d1/q X j [n] xj 2 2 1/2o . We now plug the above Rademacher complexity bounds into Theorem 4.9 to give generalization error bounds for learning with unsupervised loss. Let Bx = max{ xj 2, x+ j 2, x jt 2 : j [n], t [k]}. Note P j [nk] xj 2 2 1 nk Bx for xj in the definition of C, from which and Proposition 5.3 we get the following bound for the case v(x) = x (the definition of C involves nk examples, while in Proposition 5.3 we consider n examples): nk Bx min q p Λd1/q max( p The following corollary then follows from Theorem 4.9. Generalization Analysis for Contrastive Representation Learning Corollary 5.4. Consider the feature map in Proposition 5.3 with = 2,p. Let ℓbe the logistic loss and δ (0, 1). Then with probability at least 1 δ, for all f F Lun(f) ˆLun(f) = B log 1 2 (1/δ) n + e O GRBx minq p Λd1/q max( q 1, 1) It is also possible to give generalization bounds for learning with ℓ2-Lipschitz loss functions, and optimistic generalization bounds for learning with self-bounding Lipschitz loss functions. We omit the discussion for brevity. 5.2. Nonlinear Features We now consider Rademacher complexity for learning with nonlinear features by DNNs. The following lemma to be proved in Section E.2 gives Rademacher complexity bounds for learning with features by DNNs. We say an activation σ : R 7 R is positive-homogeneous if σ(ax) = aσ(x) for a 0, contractive if |σ(x) σ(x )| |x x |. The Re LU activation function σ(x) = max{x, 0} is both positivehomogeneous and contractive. Proposition 5.5 (Nonlinear representation). Consider the feature map defined in Lemma 5.1 with = F and V = n x 7 v(x) = σ VLσ VL 1 σ(V1x) : Vl F Bl, l [L] o , where σ is positive-homogeneous, contractive and σ(0) = 0, and L is the number of layers. Then j [n] ϵt,jft(xj) dΛBLBL 1 B1 1 i 0 and p 1, the empirical ℓp-norm covering number Np(ϵ, e F, e S) with respect to e S is defined as the smallest number m of a collection of vectors v1, . . . , vm {(f(z1), . . . , f(zn)) : f e F} such that sup f e F min j [m] i [n] |f(zi) vj i |p 1 where vj i is the i-th component of the vector vj. In this case, we call {v1, . . . , vm} an (ϵ, ℓp)-cover of e F with respect to e S. Definition C.2 (Fat-Shattering Dimension). Let e F be a class of real-valued functions defined over a space e Z. We define the fat-shattering dimension fatϵ( e F) at scale ϵ > 0 as the largest D N such that there exist D points z1, . . . , z D Z and witnesses s1, . . . , s D R satisfying: for any δ1, . . . , δD { 1} there exists f e F with δi(f(zi) si) ϵ/2, i [D]. To prove Theorem 4.8, we need to introduce the following lemma on Rademacher complexity, fat-shattering dimension and covering numbers. Part (a) shows that the covering number can be bounded by fat-shattering dimension (see, e.g., Theorem 12.8 in Anthony & Bartlett (1999)). Part (b) shows that the fat-shattering dimension can be controlled by the worst-case Rademacher complexity, which was developed in Srebro et al. (2010). Part (c) is a discretization of the chain integral to control Rademacher complexity by covering numbers (Srebro et al., 2010), which can be found in Guermeur (2017). Let e be the base of the natural logarithms. Lemma C.3. Let e S := {z1, . . . , zn} Z. Let e F be a class of real-valued functions defined over a space e Z. (a) If functions in e F take values in [ B, B], then for any ϵ > 0 with fatϵ( e F) < n we have log N (ϵ, e F, e S) 1 + fatϵ/4( e F) log2 4e Bn ϵfatϵ/4( e F) (b) For any ϵ > R e Z,n( e F), we have fatϵ( e F) < 4n ϵ2 R2 e Z,n( e F). (c) Let (ϵj) j=0 be a monotone sequence decreasing to 0 and any (a1, . . . , an) Rn. If ϵ0 q n 1 supf e F Pn i=1 f(zi) ai 2, then for any non-negative integer N we have Re S( e F) 2 log N (ϵj, e F, e S) n + ϵN. (C.1) According to Part (a) of Lemma C.3, the following inequality holds for any ϵ (0, 2B] (the case fatϵ/4( e F) = 0 is trivial since in this case we have N (ϵ, e F, e S) = 1, and otherwise we have fatϵ/4( e F) 1) log N (ϵ, e F, e S) 1 + fatϵ/4( e F) log2 2 8e B2|e S| We follow the arguments in Lei et al. (2019) to prove Theorem 4.8. Proof of Theorem 4.8. We first relate the empirical ℓ -covering number of F w.r.t. S = {(xj, x+ j , x j1, x j2, . . . , x jk) : j [n]} to the empirical ℓ -covering number of H w.r.t. SH. Let n rm = rm 1,1, rm 1,2, . . . , rm 1,k, . . . , rm n,1, rm n,2, . . . , rm n,k : m [N] o be an (ϵ/G, ℓ )-cover of H w.r.t. SH. Recall that hf(x, x+, x ) = f(x) f(x+) f(x ) . (C.3) Then, by the definition of ℓ -cover we know for any f F we can find m [N] such that max j [n] max i [k] hf(xj, x+ j , x ji) rm j,i ϵ/G. Generalization Analysis for Contrastive Representation Learning By the Lipschitz continuity of ℓ, we then get ℓ f(xj) f(x+ j ) f(x ji) k i=1 ℓ {rm j,i}k i=1 G f(xj) f(x+ j ) f(x ji) k i=1 rm j,i k i=1 = G hf(xj, x+ j , x ji) k i=1 rm j,i k i=1 Gϵ/G = ϵ. This shows that ℓ {rm 1,i}k i=1 , ℓ {rm 2,i}k i=1 , . . . , ℓ {rm n,i}k i=1 : m [N] is an (ϵ, ℓ )-cover of G w.r.t. S and therefore N (ϵ, G, S) N (ϵ/G, H, SH). (C.4) Since we consider empirical covering number of F w.r.t. S, we can assume functions in H are defined over SH. For simplicity, we denote Rnk(H) := RSH,nk(H). We now control N (ϵ/G, H, SH) by Rademacher complexities of H. For any ϵ > 2Rnk(H), it follows from Part (b) of Lemma C.3 that fatϵ(H) 4nk ϵ2 R2 SH,nk(H) nk. (C.5) Note for any f F, we have f(x) (f(x+) f(x )) [ 2R2, 2R2]. It then follows from Eq. (C.2) and Eq. (C.5) that (replace B by 2R2) log N (ϵ, H, SH) 1 + fatϵ/4(H) log2(32e R4nk/ϵ2) 1 + 64nk R2 nk(H) ϵ2 log2(32e R4nk/ϵ2), ϵ (0, 4R2]. We can combine the above inequality and Eq. (C.4) to derive the following inequality for any 2GRnk(H) ϵ 4GR2 log N (ϵ, G, S) 1 + 64nk G2R2 nk(H) ϵ2 log2(32e R4G2nk/ϵ2). (C.6) Let ϵN = 24G max k Rnk(H), n 1 ϵj = 2N jϵN, j = 0, . . . , N 1, N = log2 2GR2 k Rnk(H), n 1 It is clear from the definition that ϵ0 2GR2 ϵ0/2. The Lipschitz continuity of ℓimplies ℓ(({hf(x, x+, x i )}i [k])) ℓ((0, 0, . . . , 0)) G hf(x, x+, x ) 2R2G. According to the above inequality and Part (c) of Lemma C.3, we know (note ϵN 2GRnk(H) and therefore Eq. (C.6) holds for ϵ = ϵj, j = 1, . . . , N) j=1 (ϵj + ϵj 1) log N (ϵj, G, S) j=1 (ϵj + ϵj 1) + 16G nk Rnk(H) n (ϵj + ϵj 1) log(32e R4G2nk/ϵ2 j) ϵj + ϵN 2 + ϵN + 48G nk Rnk(H) n j=1 log(32e R4G2nk/ϵ2 j) 2 + ϵN + 48GN k Rnk(H) log(32e R4G2nk) + 48G j=1 log(1/ϵ2 j). Generalization Analysis for Contrastive Representation Learning According to the definition of ϵk, we know j=1 log(1/ϵ2 j) = j=1 log(22j/ϵ2 0) = j=1 log(1/ϵ2 0) + log 4 j=1 j = N log(1/ϵ2 0) + N(N + 1) log 4 = N log 1/ϵ2 0 + (N + 1) log 2 = N log 2N+1/ϵ2 0 = N log 1 = N log n 24G2R2 . We can combine the above two inequalities together to get RS(G) 24GR2n 1 2 + ϵN + 48NG k Rnk(H) log(32e R4G2nk) + log n 24G2R2 2 + ϵN + 48G k Rnk(H) log(32e R4G2nk) + log n 24G2R2 l log2 2GR2 n 2 + ϵN + 48G k Rnk(H) log(4R2n 3 2 k) l log2 R2 n where we have used the definition of N and 32e/24 4. The proof is completed. Proof of Theorem 4.9. Applying Lemma B.2, with probability at least 1 δ the following inequality holds with probability at least 1 δ Lun(f) ˆLun(f) + 2RS(G) + 3B According to Theorem 4.8 and Lemma 4.3, we know RS(G) 48G(R2 + 1)n 1 k Rnk(H) 1 + log(R2n 3 2 k) l log2 R2 n 48G(R2 + 1)n 1 1 + log(R2n 3 2 k) l log2 R2 n We can combine the above two inequalities together and derive the stated bound. The proof is completed. D. Proof of Theorem 4.13 To prove Theorem 4.13, we introduce the following lemma on generalization error bounds in terms of local Rademacher complexities (Reeve & Kaban, 2020). Lemma D.1 (Reeve & Kaban 2020). Consider a function class G of functions mapping Z to [0, b]. For any e S = {zi : i [n]} and g G, let ˆEe S[g] = 1 i [n] g(zi). Assume for any e S Zn and r > 0, we have Re S {g G : ˆEe S[g] r} ϕn(r), where ϕn : R+ 7 R+ is non-decreasing and ϕn(r)/ r is non-increasing. Let ˆrn be the largest solution of the equation ϕn(r) = r. For any δ (0, 1), with probability at least 1 δ the following inequality holds uniformly for all g G Ez[g(z)] ˆEe S[g] + 90(ˆrn + r0) + 4 q ˆEe S[g](ˆrn + r0), where r0 = b log(1/δ) + 6 log log n /n. Proof of Theorem 4.13. For any r > 0, we define Fr as a subset of F with the empirical error less than or equal to r Fr = n f F : 1 j [n] gf(xj, x+ j , x j1, . . . , x jk) r o . Let n rm = rm 1,1, rm 1,2, . . . , rm 1,k, . . . , rm n,1, rm n,2, . . . , rm n,k : m [N] o Generalization Analysis for Contrastive Representation Learning 2r Gs), ℓ )-cover of Hr := hf H : f Fr w.r.t. SH. Then, by the definition of ℓ -cover we know for any f Fr we can find m [N] such that max j [n] max i [k] hf(xj, x+ j , x ji) rm j,i ϵ/( According to the self-bounding Lipschitz continuity of ℓ, we know ℓ f(xj) f(x+ j ) f(x ji) k i=1 ℓ {rm j,i}k i=1 2 j [n] max n ℓ f(xj) f(x+ j ) f(x ji) k i=1 , ℓ {rm j,i}k i=1 o f(xj) f(x+ j ) f(x ji) k i=1 rm j,i k i=1 2 ℓ f(xj) f(x+ j ) f(x ji) k i=1 + ℓ {rm j,i}k i=1 hf(xj, x+ j , x ji) k i=1 rm j,i k i=1 2 2G2 srϵ2/(2r G2 s) = ϵ2, where we have used the following inequalities due to the definition of Fr j [n] ℓ f(xj) f(x+ j ) f(x ji) k i=1 r, 1 n j [n] ℓ {rm j,i}k i=1 r. Therefore, we have N2(ϵ, Gr, S) N (ϵ/( 2r Gs), Hr, SH), where Gr = {gf G : f Fr}. Analyzing analogously to the proof of Theorem 4.8, we get (replacing G there by 2r Gs(R2 + 1)n 1 k RSH,nk(H) 1 + log(4R2n 3 2 k) l log2 R2 n m := ψn(r). Let ˆrn be the point satisfying ˆrn = ψn(ˆrn): 2ˆrn Gs(R2 + 1)n 1 k RSH,nk(H) 1 + log(4R2n 3 2 k) l log2 R2 n from which we get ˆrn = e O G2 s R4n 1 + G2 sk R2 SH,nk(H) . We can apply Lemma D.1 to get the following inequality with probability at least 1 δ uniformly for all f F Lun(f) = ˆLun(f)+ e O Bn 1+G2 s R4n 1+G2 sk R2 SH,nk(H) + e O 2 +Gs R2n 1 k RSH,nk(H) ˆL We can apply Lemma 4.3 to control RSH,nk(H) and derive the following bound Lun(f) = ˆLun(f) + e O Bn 1 + G2 s R4n 1 + G2 s R2n 2k 1C2 + e O 2 + Gs R2n 1 2 + Gs Rn 1k 1 The proof is completed. E. Proof on Rademacher Complexities We first prove Rademacher complexity bounds for feature spaces in Lemma 5.1, and then give lower bounds. Finally, we will apply it to prove Proposition 5.3 and Proposition 5.5. Proof of Lemma 5.1. Let U = (u1, . . . , ud) and VS = (v(x1), . . . , v(xn)). Then it is clear u 1 v(x1) u 1 v(xn) ... ... ... u d v(x1) u d v(xn) Generalization Analysis for Contrastive Representation Learning ϵ1,1 ϵ1,n ... ... ... ϵd,1 ϵd,n Then we have X j [n] ϵt,ju t v(xj) = Mϵ, UVS = trace(M ϵ UVS) = trace(UVSM ϵ ) = U , VSM ϵ U VSM ϵ , where trace denotes the trace of a matrix. Therefore, we have j [n] ϵt,jft(xj) = sup U U,v V j [n] ϵt,ju t v(xj) Λ sup v V VSM ϵ = Λ sup v V j [n] ϵ1,jv(xj), . . . , X j [n] ϵd,jv(xj) . The proof is completed. Proof of Lemma 5.2. Note X ϵj,t,1ft( xj) + ϵj,t,2ft( x+ j ) + ϵj,t,3ft( x j ) = ϵj,t,1ft( xj)+ϵj,t,2ft( x+ j )+ϵj,t,3ft( x j ) , X ϵj,t,1ft( xj)+ϵj,t,2ft( x+ j )+ϵj,t,3ft( x j ) o . According to the symmetry of F we know C = max {( xj, x+ j , x j )}nk j=1 SH Eϵ { 1}nk { 1}d { 1}3 h sup f F ϵj,t,1ft( xj) + ϵj,t,2ft( x+ j ) + ϵj,t,3ft( x j ) i sup f F max {( xj, x+ j , x j )}nk j=1 SH Eϵ { 1}nk { 1}d { 1}3 h X ϵj,t,1ft( xj) + ϵj,t,2ft( x+ j ) + ϵj,t,3ft( x j ) i , where we have used the Jensen s inequality in the last step. Since we take maximization over {( xj, x+ j , x j )}nk j=1 SH, we can choose ( xj, x+ j , x j ) = ( x, x+, x ) for any ( x, x+, x ) SH. Then we get C sup f F max ( x, x+, x ) SH Eϵ { 1}nk { 1}d { 1}3 h X ϵj,t,1ft( x) + ϵj,t,2ft( x+) + ϵj,t,3ft( x ) i 2 sup f F max ( x, x+, x ) SH f 2 t ( x) + f 2 t ( x+) + f 2 t ( x ) 1 2 sup f F max ( x, x+, x ) SH f( x) 2 2 + f( x+) 2 2 + f( x ) 2 2 1 2 1nk sup f F max ( x, x+, x ) SH f( x) 2 2 + f( x+) 2 2 + f( x ) 2 2 1 where we have used the following Khitchine-Kahane inequality (De la Pena & Gin e, 2012) i=1 ϵiti 2 1 i=1 |ti|2 1 2 , t1, . . . , tn R, (E.1) The proof is completed. Generalization Analysis for Contrastive Representation Learning Remark E.1. The analysis in the proof implies a lower bound for Re S( e F) for a symmetric e F and e S = {z1, . . . , zn} Re S( e F) 1 2n sup f e F f(z1), . . . , f(zn) 2. Indeed, by the symmetry of F, the Jensen inequality and Eq. (E.1), we have Re S( e F) = Eϵ sup f e F i [n] ϵif(zi) = Eϵ sup f e F i [n] ϵif(zi) n sup f e F Eϵ X i [n] ϵif(zi) 1 2n sup f e F i [n] f 2(zi) 1 E.1. Proof of Proposition 5.3 The following Khintchine-Kahane inequality (De la Pena & Gin e, 2012; Lust-Piquard & Pisier, 1991) is very useful for us to estimate Rademacher complexities. Lemma E.2. Let ϵ1, . . . , ϵn be a sequence of independent Rademacher variables. (a) Let v1, . . . , vn H, where H is a Hilbert space with being the associated norm. Then, for any p 1 there holds i=1 ϵivi p 1 p 1, 1) n X (b) Let X1, . . . , Xn be a set of matrices of the same dimension. For all q 2, i=1 ϵi Xi q Sq e max n n X i=1 X i Xi 1 i=1 Xi X i 1 Proof of Proposition G.1. Let q p. It is clear q p . The dual norm of 2,p is 2,p . Therefore, according to Lemma 5.1 and p q we know j [n] ϵt,jft(xj) ΛEϵ X j [n] ϵt,jxj p j [n] ϵt,jxj q j [n] ϵt,jxj q i 1/q = Λ d Eϵ X j [n] ϵjxj q where we have used Jense s inequality and the concavity of x 7 x1/q . By Lemma E.2, we know j [n] ϵjxj q j [n] xj 2 2 q It then follows that j [n] ϵt,jft(xj) Λd1/q max( p j [n] xj 2 2 1 Note the above inequality holds for any q p. This proves Part (a). We now prove Part (b). Since the dual norm of Sp is Sp , by Lemma 5.1 we know j [n] ϵt,jft(xj) ΛEϵ X j [n] ϵ1,jxj, . . . , X j [n] ϵd,jxj Sp . For any t [d] and j [n], define e Xt,j = 0 0 xj 0 0 , Generalization Analysis for Contrastive Representation Learning i.e., the t-th column of e Xt,j = xj, and other columns are zero vectors. This implies that X j [n] ϵ1,jxj, . . . , X j [n] ϵd,jxj = X j [n] ϵt,j e Xt,j. It is clear that e Xt,j e X t,j = xjx j and e X t,j e Xt,j = 0 0 ... . . . 0 ... 0 x j xj 0 0 . . . . = x j xjdiag(0, . . . , 0 | {z } t 1 , 1, 0 . . . , 0), where diag(a1, . . . , an) denotes the diagonal matrix with elements a1, . . . , an. Therefore, we have X j [n] e Xt,j e X t,j = d X j [n] xjx j j [n] e X t,j e Xt,j = X j [n] x j xj Id d, where Id d denotes the identity matrix in Rd d. Therefore, we can apply Lemma E.2 to show that j [n] ϵt,jft(xj) Λ Eϵ X j [n] ϵ1,jxj, . . . , X j [n] ϵd,jxj q e max n d X j [n] xjx j 1 2 Sq , d1/q X j [d] xj 2 2 1 The proof is completed. E.2. Proof of Proposition 5.5 For convenience we introduce the following sequence of function spaces Vk = n x 7 σk Vkσ Vk 1 σ(V1x) : Vj F Bj o , k [L]. To prove Proposition 5.5, we need to introduce several lemmas. The following lemma shows how the supremum over a matrix can be transferred to a supremum over a vector. It is an extension of Lemma 1 in Golowich et al. (2018) from d = 1 to d N, and can be proved exactly by the arguments in Golowich et al. (2018). Lemma E.3. Let σ : R 7 R be a 1-Lipschitz continuous, positive-homogeneous activation function which is applied elementwise. Then for any vector-valued function class e F sup f e F,V Rh h : V F B j [n] ϵt,jσ(V f(xj)) 2 2 B2 sup f e F, v Rh : v 2 1 sup v 2 1 j [n] ϵt,jσ( v f(xj)) 2 . Proof. Let v 1 , . . . , v h be rows of matrix V , i.e., V = v1, . . . , vh . Then by the positive-homogeneous property of activation function we have j [n] ϵt,jσ(V f(xi)) 2 j [n] ϵt,jσ(v 1 f(xj)) ... P j [n] ϵt,jσ(v h f(xj)) j [n] ϵt,jσ(v r f(xj)) r [h] vr 2 2 X j [n] ϵt,jσ v r vr 2 f(xj) 2 r [h] vr 2 2 max r [h] j [n] ϵt,jσ v r vr 2 f(xj) 2 B2 sup v 2 1 j [n] ϵt,jσ( v f(xj)) 2 . Generalization Analysis for Contrastive Representation Learning The proof is completed. The following lemma gives a general contraction lemma for Rademacher complexities. It allows us to remove a nonlinear function ψ, which is very useful for us to handle the activation function in DNNs. Lemma E.4 (Contraction Lemma, Thm 11.6 in Boucheron et al. (2013)). Let τ : R+ 7 R+ be convex and nondecreasing. Suppose ψ : R 7 R is contractive in the sense |ψ(t) ψ( t)| |t t| and ψ(0) = 0. Then the following inequality holds for any e F Eϵ τ sup f e F i=1 ϵiψ f(xi) Eϵ τ sup f e F i=1 ϵif(xi) . The following lemma gives bounds of MGFs for a random variable Z = P 1 i 0 such that 0 min n sup f F i=1 ϵi gi(f(xi)), G j=1 ϵi,jfj(xi) o max n sup f F i=1 ϵi gi(f(xi)), G j=1 ϵi,jfj(xi) o B for all ϵ { 1}n. Let t R be an arbitrary number. Define gt : F 7 R by gt(f) = 0 for any f = 0 and gt(0) = t. It is clear that Eϵ { 1}n sup f F i=1 ϵi gi(f(xi)) = Eϵ { 1}n max n sup f F:f =0 i=1 ϵi gi(f(xi)) , t o Eϵ { 1}nd sup f F j=1 ϵi,jfj(xi) = Eϵ { 1}nd max n G 2 sup f F:f =0 j=1 ϵi,jfj(xi) , t o . Plugging the above identities into (F.2) with g = gt gives Eϵ { 1}n max n sup f F:f =0 i=1 ϵi gi(f(xi)) , t o Eϵ { 1}nd max n G 2 sup f F:f =0 j=1 ϵi,jfj(xi) , t o . If t 0, the above inequality is equivalent to Eϵ { 1}n max n sup f F i=1 ϵi gi(f(xi)) , t o Eϵ { 1}nd max n G j=1 ϵi,jfj(xi) , t o (F.5) by noting gi(0) = 0 for all i Nn. If t < 0, it follows from (F.2) with g(f) = 0 that Eϵ { 1}n max n sup f F i=1 ϵi gi(f(xi)) , t o = Eϵ { 1}n sup f F i=1 ϵi gi(f(xi)) Eϵ { 1}nd sup f F j=1 ϵi,jfj(xi) i = Eϵ { 1}nd max n G j=1 ϵi,jfj(xi) , t o , where we have used gi(0) = 0 for all i Nn in the first identity. That is, (F.5) holds for all t R. Subtracting t from both sides of Eq. (F.5) gives Eϵ { 1}n sup f F i=1 ϵi gi(f(xi)) t + Eϵ { 1}nd G j=1 ϵi,jfj(xi) t +, t R, (F.6) Generalization Analysis for Contrastive Representation Learning from which we know Eϵ { 1}n τ sup f F i=1 ϵi gi(f(xi)) Eϵ { 1}nd τ G j=1 ϵi,jfj(xi) , τ H[0,B]. According to Lemma F.4, we know τ : [0, B] R+ belongs to the closure of H[0,B]. Therefore, Eq. (F.1) holds. The proof is completed. G. Lipschitz Continuity of Loss Functions The following proposition is known in the literature (Lei et al., 2019). We prove it for completeness. Proposition G.1. (a) Let ℓbe defined as Eq. (3.2). Then ℓis 1-Lipschitz continuous w.r.t. and 1-Lipschitz continuous w.r.t. 2. (b) Let ℓbe defined as Eq. (3.3). Then ℓis 1-Lipschitz continuous w.r.t. and 1-Lipschitz continuous w.r.t. 2. Proof. We first prove Part (a). For any v and v , we have |ℓ(v) ℓ(v )| = max 0, 1 + max i [k]{ vi} max 0, 1 + max i [k]{ v i} | max i [k]{ vi} max i [k]{ v i}| max i [k] |vi v i| = v v , where we have used the elementary inequality | max i [k] ai max i [k] bi| max i [k] |ai bi|. This proves Part (a). We now prove Part (b). It is clear that ℓ(v) vi = exp( vi) 1 + P i [k] exp( vi). Therefore, the ℓ1 norm of the gradient can be bounded as follows ℓ(v) 1 1 1 + P i [k] exp( vi) i [k] exp( vi) 1. This proves Part (b). The proof is completed.