# generalization_bounds_for_adversarial_contrastive_learning__a1bbc19f.pdf Journal of Machine Learning Research 24 (2023) 1-54 Submitted 7/22; Revised 12/22; Published 2/23 Generalization Bounds for Adversarial Contrastive Learning Xin Zou zouxin2021@gmail.com School of Computer Science Wuhan University Wuhan, Hubei, China Weiwei Liu liuweiwei863@gmail.com School of Computer Science Wuhan University Wuhan, Hubei, China Editor: Joan Bruna Deep networks are well-known to be fragile to adversarial attacks, and adversarial training is one of the most popular methods used to train a robust model. To take advantage of unlabeled data, recent works have applied adversarial training to contrastive learning (Adversarial Contrastive Learning; ACL for short) and obtain promising robust performance. However, the theory of ACL is not well understood. To fill this gap, we leverage the Rademacher complexity to analyze the generalization performance of ACL, with a particular focus on linear models and multi-layer neural networks under ℓp attack (p 1). Our theory shows that the average adversarial risk of the downstream tasks can be upper bounded by the adversarial unsupervised risk of the upstream task. The experimental results validate our theory. Keywords: Robustness, Adversarial learning, Contrastive learning, Rademacher complexity, Generalization bound. 1. Introduction Deep neural networks (DNNs) have achieved state-of-the-art performance in many fields. However, several prior works (Szegedy et al., 2014; Goodfellow et al., 2015) have shown that DNNs may be vulnerable to imperceptibly changed adversarial examples, which causes a lot of focus on the robustness of the models (Madry et al., 2018; Mao et al., 2020; Li et al., 2022). One of the most popular approaches to achieving adversarial robustness is adversarial training, which involves training the model with samples perturbed to maximize the loss on the target model (Goodfellow et al., 2015; Madry et al., 2018; Zhang et al., 2019). Schmidt et al. (2018) show that adversarial robust generalization requires a larger amount of data, while Yin et al. (2019); Awasthi et al. (2020) show that the Rademacher Complexity of adversarial training is strictly larger in theory than that of natural training, which implies that we need more data for adversarial training. . Corresponding Author. c 2023 Xin Zou and Weiwei Liu. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-0866.html. Zou and Liu Since the labeled data is limited and expensive to obtain, one option would be to use large-scale unlabeled data and apply self-supervised learning (Gidaris et al., 2018; Noroozi and Favaro, 2016), an approach that trains the model on unlabeled data in a supervised manner by utilizing self-generated labels from the data itself. Contrastive Learning (CL) (Chen et al., 2020; He et al., 2020), which aims to maximize the feature similarity of similar pairs and minimize the feature similarity of dissimilar ones, is a popular self-supervised learning technique. Recently, Kim et al. (2020); Ho and Vasconcelos (2020); Jiang et al. (2020) apply adversarial training in CL and achieve state-of-the-art model robustness. They find that if adversarial training is conducted on the upstream contrastive task, the trained model will be robust on the downstream supervised task. However, their results are all empirical and lack theoretical analysis. To fill this gap, we here present a theoretical analysis of adversarial contrastive learning through the lens of Rademacher complexity, with a particular focus on linear models and multi-layer neural networks. Our theoretical results show that the average adversarial risk of the downstream tasks can be upper bounded by the adversarial unsupervised risk of the upstream task; this implies that if we train a robust feature extractor on the upstream task, we can obtain a model that is robust on downstream tasks. The remainder of this article is structured as follows: 2 introduces some related works in this field. 3 provides some basic definitions and settings that will be used in the following sections. 4 presents our first part s main results, which show a connection between the robust risk for the upstream task and the robust risk for the downstream tasks. 5 shows our second part s results, it outlines our bounds for linear models and multi-layer neural networks by bounding the Rademacher complexity of each hypothesis class. 6 shows some experimental results that verify our theory; finally, the conclusions are presented in the last section. 2. Related Work Adversarial Robustness. Szegedy et al. (2014) show that DNNs are fragile to imperceptible distortions in the input space. Subsequently, Goodfellow et al. (2015) propose the fast gradient sign method (FGSM), which perturbs a target sample towards its gradient direction to increase the loss and then uses the generated sample to train the model in order to improve the robustness. Following this line of research, Madry et al. (2018); Moosavi-Dezfooli et al. (2016); Kurakin et al. (2017); Carlini and Wagner (2017) propose iterative variants of the gradient attack with improved adversarial learning frameworks. Besides, Ma et al. (2022) analyze the trade-offbetween robustness and fairness, Li and Liu (2023) study the worst-class adversarial robustness in adversarial training. For the theoretical perspective, Montasser et al. (2019) study the PAC learnability of adversarial robust learning, Xu and Liu (2022) extend the work of Montasser et al. (2019) to multiclass case and Yin et al. (2019); Awasthi et al. (2020) give theoretical analysises to adversarial training by standard uniform convergence argumentation and giving a bound of the Rademacher complexity. Our work is quite different from Yin et al. (2019); Awasthi et al. (2020), firstly, they just analyze the linear models and two-layer neural networks, but we consider linear models and multi-layer deep neural networks; secondly, they consider the classification loss, which is much easier to analyze than our contrastive loss. Generalization Bounds for Adversarial Contrastive Learning Contrastive Learning. Contrastive Learning is a popular self-supervised learning paradigm that attempts to learn a good feature representation by minimizing the feature distance of similar pairs and maximizing the feature distance of dissimilar pairs (Chen et al., 2020; Wang and Liu, 2022). Sim CLR (Chen et al., 2020) learns representations by maximizing the agreement between differently augmented views of the same data, while Mo Co (He et al., 2020) builds large and consistent dictionaries for unsupervised learning with a contrastive loss. From the theoretical perspective, Saunshi et al. (2019) first presents a framework to analyze CL. We generalize their framework to the adversarial CL. The key challenging issues in this work are how to define and rigorously analyze the adversarial CL losses. Moreover, we further analyze the Rademacher complexity of linear models and multi-layer deep neural networks composited with a complex adversarial contrastive loss. Neyshabur et al. (2015) show an upper bound of the Rademacher complexity for neural networks under group norm regularization. In fact, the Frobenius norm and the ℓ1, -norm in our theoretical analysis are also group norms, while the settings and proof techniques between Neyshabur et al. (2015) and our work are quite different: (1) they consider the standard setting while we consider a more difficult adversarial setting; (2) they consider the Rademacher complexity of neural networks with size 1 output layer, while we consider the case that the neural network is composited with a complex contrastive learning loss; (3) they prove their results by reduction with respect to the number of the layers, while motivated by the technique of Gao and Wang (2021), we use the covering number of the neural network to upper bound the Rademacher complexity in the adversarial case. Adversarial Contrastive Learning. Several recent works (Kim et al., 2020; Ho and Vasconcelos, 2020; Jiang et al., 2020) apply adversarial training in the contrastive pretraining stage to improve the robustness of the models on the downstream tasks, achieving good robust performance in their experiments. Our work attempts to provide a theoretical explanation as to why models robustly trained on the upstream task can be robust on downstream tasks. 3. Problem Setup We introduce the problem setups in this section. 3.1 Basic Contrastive Learning Settings We first set up some notations and describe the contrastive learning framework. Let X Rm be the domain of all possible data points, this paper assumes that X is bounded. Contrastive learning assumes that we get the similar data in the form of pairs (x, x+), which is drawn from a distribution Dsim on X 2, and k independent and identically distributed (i.i.d.) negative samples x 1 , x 2 , . . . , x k drawn from a distribution Dneg on X. Given the training set S = {(xi, x+ i , x i1, . . . , x ik)}M i=1, we aim to learn a representation f from F that maps similar pairs (x, x+) into similar points (f(x), f(x+)), while at the same time keeping f(x i ), , f(x k ) away from f(x), where F is a class of representation functions f : X Rn. Latent Classes. Let C denote the set of all latent classes (Saunshi et al., 2019) that are all possible classes for points in X; for each class c C , moreover, the probability Dc Zou and Liu over X captures the probability that a point belongs to class c. The distribution on C is denoted by ρ. To formalize the similarity among data in X, assume we obtain i.i.d. similar data points x, x+ from the same distribution Dc, where c is randomly selected according to the distribution ρ on latent classes C. We can then define Dsim and Dneg as follows: Definition 1 (Dsim and Dneg, Saunshi et al., 2019) For unsupervised tasks, we define the distribution of sampling similar samples Dsim(x, x+) and negative sample Dneg(x ) as follows: Dsim(x, x+) = E c ρDc(x)Dc(x+), Dneg(x ) = E c ρDc(x ). Supervised Tasks. For supervised tasks, we focus on the tasks in which a representation function f will be tested on a (k + 1)-way supervised task T consisting of distinct classes {c1, c2, . . . , ck+1} C, while the labeled data set of task T consists of M i.i.d. examples drawn according to the following process: A label c {c1, . . . , ck+1} is selected according to a distribution DT , and a sample x is drawn from Dc. The distribution of the labeled pair (x, c) is defined as: DT (x, c) = DT (c)Dc(x). 3.2 Evaluation Metric for Representations We evaluate the quality of a representation function f with reference to its performance on a multi-class classification task T using a linear classifier. Consider a task T = {c1, . . . , ck+1}. A multi-class classifier for T is a function g : X Rk+1, the output coordinates of which are indexed by the classes c in task T . Let {g(x)y g(x)y }y =y be a k-dimensional vector of differences in the coordinates of the output of the classifier g. The loss function of g on a point (x, y) X T is defined as ℓ({g(x)y g(x)y }y =y). For example, one often considers the standard hinge loss ℓ(v) = max{0, 1 + maxi{ vi}} and the logistic loss ℓ(v) = log2(1 + P i exp( vi)) for v Rk. The supervised risk of classifier g is defined as follows: Lsup(T , g) := E (x,c) DT h ℓ {g(x)c g(x)c }c =c i . The risk Lsup(T , g) of classifier g on task T measures the quality of the outputs of g, take the hinge loss as an example, our goal is to get a classifier that has much higher confidence for the true class (i.e., the value g(x)c) than others (i.e., the values g(x)c for c = c), so we want all the differences g(x)c g(x)c for c = c to be as large as possible. If g wrongly classifies x, i.e., arg max c g(x)c = c, of course, ℓon (x, c) is not smaller than 1; however, even thought g correctly classifies x, the loss value can not decrease to zero unless g(x)c g(x)c 1 for all c = c. Let {ci}k+1 i=1 = {c1, . . . , ck+1} be a set of classes from C. Given the matrix W R(k+1) n, we have g(x) = Wf(x) as a classifier that composite the feature extractor g and linear classifier W. The supervised risk of f is defined as the risk of g when the best W is chosen: Lsup(T , f) := inf W R(k+1) n Lsup(T , Wf). Generalization Bounds for Adversarial Contrastive Learning When training a feature extractor f on the upstream task, we do not make predictions on the examples, so we can t define the risk of the feature extractor f as we do for the classifier g. Our goal on the upstream task is to train a feature extractor that can perform well on the downstream tasks, so we consider the potential of the feature extractor here, i.e., the minimal possible classification risk of a linear classifier on the feature extracted by f. So we define the risk of the feature extractor f as above by taking the infimum over all linear classifiers. Definition 2 (Mean Classifier, Saunshi et al., 2019) For a function f and task T = {c1, . . . , ck+1}, the mean classifier is W µ, whose cth row is the mean µc := E x Dc[f(x)] and we define Lµ sup(T , f) := Lsup(T , W µf). Since Lsup(T , f) involves taking infimum over all possible linear classifiers, it is difficult to analyze Lsup(T , f) and establish connections between the risk of the unsupervised upstream task and Lsup(T , f). We introduce the mean classifier to bridge them by bounding average Lµ sup(T , f) over the tasks with the risk of the unsupervised upstream risk (for more details, please refer to 4). Definition 3 (Average Supervised Risk) The average supervised risk for a function f on (k + 1)-way tasks is defined as: Lsup(f) := E {ci}k+1 i=1 ρk+1[Lsup({ci}k+1 i=1 , f)|ci = cj] Given the mean classifier, the average supervised risk for a function f is defined as follows: Lµ sup(f) := E {ci}k+1 i=1 ρk+1[Lµ sup({ci}k+1 i=1 , f)|ci = cj]. In contrastive learning, the feature extractor is trained as a pretrained model to be used on downstream tasks. There may be many different downstream tasks such as binary classification tasks with classes {c1, c2} C and many other multi-class classification tasks, so here we consider the average error of Lµ sup(T , f) over all possible tasks T sampled from ρk+1 as the final performance measure of a feature extractor f. 3.3 Contrastive Learning Algorithm We denote k as the number of negative samples used for training and (x, x+) Dsim, (x 1 , . . . , x k) Dk neg. Definition 4 (Unsupervised Risk) The unsupervised risk is defined as follows: Lun(f) := E[ℓ({f(x)T (f(x+) f(x i ))}k i=1)]. Given M samples n (xj, x+ j , x j1, . . . , x jk) o M j=1 from Dsim Dk neg, the empirical counterpart of unsupervised risk is defined as follows: b Lun(f) := 1 j=1 ℓ({f(xj)T (f(x+ j ) f(x ji))}k i=1). Zou and Liu In the contrastive learning upstream task, we want to learn a feature extractor f such that f maps examples from the same class to similar features and makes the features of examples from different classes far away from each other. So for the examples (x, x+, x 1 , . . . , x k ), we want f(x)T f(x+) to be as large as possible while f(x)T f(x i ), i = 1, . . . , k to be as small as possible. The loss ℓwe defined before elegantly captures the aim of contrastive learning, if we set vi = f(x)T f(x+) f(x)T f(x i ), minimizing ℓ({f(x)T (f(x+) f(x i ))}k i=1) will yield large f(x)T f(x+) and small f(x)T f(x i ), i = 1, . . . , k. So optimizing over Lun(f) can reach our goal for contrastive learning. Following Saunshi et al. (2019), the unsupervised risk can be described by the following equation: Lun(f) = E c+,c i ρk+1 E x,x+ D2 c+,x i Dc i [ℓ({f(x)T (f(x+) f(x i ))}k i=1)]. The Empirical Risk Minimization (ERM) algorithm is used to find a function bf ERM arg min f F b Lun(f) that minimizes the empirical unsupervised risk. The function bf ERM can be subsequently used for supervised linear classification tasks. 3.4 Adversarial Setup A key question in adversarial contrastive learning is that of how to define the adversary sample in contrastive learning. From a representation perspective, one can find a point ex that is close to x and keeps the feature f(ex) both as far from f(x+) as possible and as close to the feature of some negative sample f(x ) as possible. Inspired by this intuition, we define the Contrastive Adversary Sample as follows: Definition 5 (U-Contrastive Adversary Sample) Given a neighborhood U(x) of x, x+ Dc+, x i Dc i for i = 1, . . . , k, we define the U-Contrastive Adversary Sample of x as follows: ex := arg sup x U(x) E x+ Dc+,x i Dc i [ℓ({f(x )T (f(x+) f(x i ))}k i=1)]. In this article, we suppose that the loss function ℓis convex. By the subadditivity of sup, we have, f F: sup x U(x) E x+ Dc+ [ℓ({f(x )T(f(x+) f(x i ))}k i=1)] E x+ Dc+ [ sup x U(x) ℓ({f(x )T(f(x+) f(x i ))}k i=1)]. (1) In the following sections, we analyze the theoretical properties of the right-hand side of (1), which can be easily optimized in practice by Adversarial Empirical Risk Minimization (AERM). The U-Adversarial Unsupervised Risk and its surrogate risk can be defined as below. Definition 6 (U-Adversarial Unsupervised Risk) Given a neighborhood U(x) of x, the U-Adversarial Unsupervised Risk of a presentation function f is defined as follows: e Lun(f) := E c+,c i ρk+1 E x Dc+[ sup x U(x) E x+ Dc+ [ℓ({f(x )T (f(x+) f(x i ))}k i=1)]]. (2) Generalization Bounds for Adversarial Contrastive Learning Moreover, the surrogate risk of (2) is as follows: e Lsun(f) = E c+,c i ρk+1 E x,x+ D2 c+ sup x U(x) ℓ({f(x )T (f(x+) f(x i ))}k i=1). By (1), we have e Lun(f) e Lsun(f) for any f F. The Adversarial Supervised Risk of a classifier g for a task T is defined as follows: e Lsup(T , g) := E (x,c) DT [ sup x U(x) ℓ({g(x)c g(x)c }c =c)]. The Adversarial Supervised Risk of a representation function f for a task T is defined as follows: e Lsup(T , f) := inf W R(k+1) n e Lsup(T , Wf). (3) For the mean classifier, we define: e Lµ sup(T , f) := e Lsup(T , W µf) The Average Adversarial Supervised Risk for a representation function is as defined below. Definition 7 (Average U-Adversarial Supervised Risk) e Lsup(f) := E {ci}k+1 i=1 ρk+1[e Lsup({ci}k+1 i=1 , f)|ci = cj]. For the mean classifier, we have the following: e Lµ sup(f) := E {ci}k+1 i=1 ρk+1[e Lµ sup({ci}k+1 i=1 , f)|ci = cj]. 4. Theoretical Analysis for Adversarial Contrastive Learning This section presents some theoretical results for adversarial contrastive learning. 4.1 One Negative Sample Case Let τ = P c,c ρ2[c = c ], σ be an M-dimensional Rademacher random vector with i.i.d. en- tries, define (gf)|S = (gf(z1), . . . , gf(z M)) and RS(G) := E σ { 1}M σ, (gf)|S # G := {gf(x, x+, x 1 , . . . , x k ) = sup x U(x) ℓ {f(x )T (f(x+) f(x i ))}k i=1 |f F}, let bf arg min f F be Lsun(f) where be Lsun(f) is the empirical counterpart of e Lsun(f), we have: Zou and Liu Theorem 8 (The proof can be found in the Appendix A.1) Let ℓ: Rk R be bounded by B. Then, for any δ (0, 1),with a probability of at least 1 δ over the choice of the training set S = {(xj, x+ j , x j )}M j=1, for any f F: e Lsup( bf) e Lµ sup( bf) 1 1 τ (e Lsun(f) τℓ(0)) + 1 1 τ AGM, AGM = O(RS(G) Remark 9 Theorem 8 shows that when the hypothesis class F is rich enough to contain some f with low surrogate adversarial unsupervised risk, the empirical minimizer of the surrogate adversarial unsupervised risk will then obtain good robustness on the supervised downstream task. Note that we can take f = bf in the upper bound of Theorem 8 and get a bound e Lsup( bf) 1 1 τ (e Lsun( bf) τℓ(0)) + 1 1 τ AGM. Then we can see that if the output of AERM (i.e., bf) achieves small unsupervised adversarial risk, then bf is a robust feature extractor such that it can achieve good robustness after fine-tuning on the downstream tasks. Theorem 8 gives a relationship between the robustness of the contrastive (upstream) task and the robustness of the downstream classification tasks and explains why adversarial contrastive learning can help improve the robustness of the downstream task, as shown empirically in Kim et al. (2020); Ho and Vasconcelos (2020); Jiang et al. (2020). 4.2 Blocks of Similar Points Saunshi et al. (2019) show a refined method that operates by using blocks of similar data and determine that the method achieves promising performance both theoretically and empirically. We adapt this method to adversarial contrastive learning. Specifically, we sample (b + 1) i.i.d. similar samples x, x+ 1 , . . . , x+ b from c+ ρ and b negative i.i.d. samples from c ρ. The block adversarial contrastive learning risk is as follows: e Lblock sun (f) := E sup x U(x) ℓ f(x )T Pb i=1 f(x+ i ) b Pb i=1 f(x i ) b and its empirical counterpart is as follows: sun (f) := 1 sup x U(xi) ℓ f(x )T Pb j=1 f(x+ ij) Pb j=1 f(x ij) Theorem 10 (The proof can be found in the Appendix A.2) For any f F, we have: e Lsup(f) 1 1 τ e Lblock sun (f) τℓ(0) 1 1 τ e Lsun(f) τℓ(0) . Remark 11 Theorem 10 shows that using blocks of similar data yields a tighter upper bound for the adversarial supervised risk than in the case for pairs of similar data. Theorem 10 implies that using the blocks in adversarial contrastive learning may improve the robust performance of the downstream tasks; this will be verified by the empirical results in 6. Generalization Bounds for Adversarial Contrastive Learning 4.3 Multiple Negative Sample Case This subsection extends our results to k negative samples. To achieve this, more definitions are required. Let [k] denote the set {1, 2, . . . , k}. Definition 12 We define a distribution D over the supervised tasks as follows: First, sample k + 1 classes (allow repetition) c+, c 1 , . . . , c k ρk+1, conditioned on the event that c i = c+, i [k]. Then, set the task T as the set of distinct classes in {c+, c 1 , . . . , c k }. Definition 13 The Average U-Adversarial Supervised Risk of a representation function f F over D is defined as follows: e Lsup(f) := E T D h e Lsup(T , f) i . Let Edistinct be the event such that {c+, c 1 , . . . , c k } is distinct and p = P (c+,c 1 ,...,c k ) D [Edistinct]. For any f F, we have (The proof can be found in the Appendix A.3): e Lsup(f) e Lsup(f) From (5), we can turn to analyze the relation between e Lsup(f) and e Lsun(f) in the multiple negative sample case. Our Theorem 16 handles e Lsup(f) instead of e Lsup(f). Assumption 14 Assume that I1, I2 [d] such that I1 I2 = [d], ℓsatisfies the following inequations: ℓ({vi}i I1) ℓ({vi}i [d]) ℓ({vi}i I1) + ℓ({vi}i I2), (6) ℓ({vi}i I2) ℓ({vi}i [d]) ℓ({vi}i I1) + ℓ({vi}i I2). (7) Proposition 15 (The proof can be found in the Appendix A.4) The hinge loss and the logistic loss satisfy Assumption 14. If C is finite, we obtain a simple (and informal) bound of e Lsup( bf), for the more complex case that allows infinite C and the formal form of Theorem 16 (Theorem 32), please refer to the Appendix A.5. Theorem 16 (The proof can be found in the Appendix A.6) Suppose C is finite, for any c C, ρ(c) > 0, and ℓsatisfies Assumption 14. Then, with a probability of at least 1 δ over the choice of the training set S, f F: e Lsup( bf) α(ρ) e Lsun(f) + AGM β. Zou and Liu 5. Generalization Bounds for Example Hypothesis Classes This section presents a concrete analysis of the Rademacher complexity for linear hypothesis class and multi-layer neural networks based on covering number (Wainwright, 2019, Definition 5.1). We first introduce some definitions and required lemmas. Lemma 17 (Wainwright, 2019, Lemma 5.7, volume ratios and metric entropy) Consider a pair of norms and on Rd, and let B and B be their corresponding unit balls (i.e. B = {θ Rd| θ 1}, with B similarly defined). The δ-covering number of B in the -norm then obeys the following bounds (Wainwright, 2019, Lemma 5.7): 1 vol(B ) N(δ; B, ) vol(2 δB + B ) vol(B ) , where we define the Minkowski sum A+B := {a+b : a A, b B}, vol(B) := R 1{x B}dx is the volume of B based on the Lebesgue measure, and N(δ; B, ) is the δ-covering number of B with respect to the norm . Lemma 18 (The proof can be found in the Appendix A.7) Let Bp(r) be the p-norm ball in Rd with radius r. The δ-covering number of Bp(r) with respect to p thus obeys the following bound: N(δ; Bp(r), p) 1 + 2r Definition 19 (Wainwright, 2019, Definition 5.16, sub-Gaussian process) A collection of zero-mean random variables {Xθ, θ T} is a sub-Gaussian process with respect to a metric ρX on T if: E[eλ(Xθ Xeθ)] e λ2ρ2 X (θ,eθ) 2 , θ, eθ T, λ R. It is easy to prove that the Rademacher process (Wainwright, 2019, 5.2) satisfies the condition in Definition 19 with respect to the ℓ2-norm. Lemma 20 (Wainwright, 2019, the Dudley s entropy integral bound) Let {Xθ, θ T} be a zero-mean sub-Gaussian process with respect to the induced pseudometric ρX from Definition 19. Then, for any δ [0, D], we have: sup θ,eθ T (Xθ Xeθ) sup γ,γ T ρX (γ,γ ) δ + 32J (δ/4; D). (8) Here, D = supθ,eθ T ρX(θ, eθ) and J (a; b) = R b a p ln NX(u; T)du, where NX(u; T) is the u-covering number of T in the ρX-metric. Remark 21 Given θ0 T, since E [Xθ0] := E [ θ0, σ ] = 0, we have: E sup θ T Xθ = E sup θ T (Xθ Xθ0) E sup θ,eθ T (Xθ Xeθ) Generalization Bounds for Adversarial Contrastive Learning Combining (8) with (9), we have: E sup θ T Xθ sup γ,γ T ρX (γ,γ ) δ + 32J (δ/4; D), which can be used to draw the upper bound of RS(G) by establishing a connection between the Rademacher process and the Rademacher complexity of the hypothesis classes when proper norm is chosen. For more details, please refer to the Appendix, details are in the proof of Theorem 22, Theorem 24 and Theorem 26. To make Theorem 8 and Theorem 16 concrete, we need to upper bound RS(G). Assume that loss ℓis a non-increasing function; for example, hinge loss and logistic loss satisfy this assumption. Let G = {gf(x, x+, x ) = sup x U(x) ℓ f(x )T (f(x+) f(x )) |f F}. Since ℓis non-increasing, we have: G = ℓ min x U(x) f(x )T f(x+) f(x ) |f F . Let H = min x U(x) f(x )T (f(x+) f(x )) |f F . Suppose ℓis η-Lipschitz. By the Ledoux-Talagrand contraction inequality (Ledoux and Talagrand, 2013), we have: RS(G) ηRS(H). (10) Thus, we only need to upper bound RS(H). Let A a,b be the ℓb-norm of the ℓa-norm of the rows of A. Consider the training set S = {(xi, x+ i , x i )}M i=1. Let X be a matrix whose ith row is xi. We define X+ and X in a similar way. It is easy to see that p 1, i = 1, . . . , M, xi p X p, . 5.1 Linear Hypothesis Class Let F ={f : x Wx|W Rn m, |||W|||p w}. To simplify notations, p 1 and 1 P =max X p, , X+ p, , X p, , P =max X p , , X+ p , , X p , . (11) We then have p 1, i = 1, . . . , M, xi p, x+ i p, x i p P. We now present the upper bound of RS(H) under r attack. Theorem 22 (RS(H) under r attack for linear models) Consider the ℓr attack, i.e. let U(x) = {x | x x r ϵ}. We then have: RS(H) = O [PP + ϵR s(r , p, m)] h ms(p , p, m)w2 where s(p, q, n) := nmax n 1 p 1 r = 1, and R is defined similarly to (11). Zou and Liu The proof can be found in the Appendix A.8. Remark 23 Combining Theorem 22 with (4), we have: [PP + ϵR s(r , p, m)] mηs(p , p, m)w2 + B q 5.2 Multi-layer Neural Network In this section, we analyze fully connected multi-layer neural networks. Suppose that X Rm. Let F = {Wdσ(Wd 1σ( σ(W1x))) | |||Wl||| Ml, l = 1, . . . , d}, where ||| ||| is the norm of the matrix and σ( ) is an elementwise L-Lipschitz function with σ(0) = 0 and Wl Rhl hl 1, where hd = n, h0 = m. Assume ℓis η-Lipschitz and nonincreasing. From (10), we need only to bound the Rademacher complexity of H. We here consider two cases of the matrix norm ||| |||. Let U(x) = {x | x x p ϵ} for some p 1. 5.2.1 Frobenius-norm Case. We first consider using the Frobenius-Norm in the definition of the multi-layer neural networks hypothesis class F. Theorem 24 (RS(H) under p attack for NNs under ||| |||F constraint) Let U(x) = {x | x x p ϵ} (i.e. consider the ℓp attack), σ(0) = 0 with Lipschitz constant L and let F = Wdσ(Wd 1σ( σ(W1x))) |||Wl|||F MF l , l = 1, . . . , d . We then have: l=1 hlhl 1K where K = 2BF X,ϵ BF X+ + BF X , BF X,ϵ =Ld 1 d Y l=1 MF l max n 1, m 1 2 1 p o ( X p, +ϵ) , BF X =Ld 1 d Y l=1 MF l max n 1, m 1 2 1 The proof can be found in the Appendix A.9. Remark 25 Combining Theorem 24 with (4), we have: d Pd l=1 hlhl 1 Generalization Bounds for Adversarial Contrastive Learning 5.2.2 ℓ1, -norm Case We consider the 1, norm constraint. Theorem 26 (RS(H) under p attack for NNs under 1, constraint) Let U(x) = {x | x x p ϵ} (i.e. consider the ℓp attack), σ(0) = 0 with Lipschitz constant L; moreover, let F = n Wdσ(Wd 1σ( σ(W1x))) Wl 1, M1, l , l=1, . . . ,d o . We then have: l=1 hlhl 1 p where K0 = 2B1, X,ϵ B X+ + B X , K1 = K0 2 + B X,ϵ B1, X+ + B1, X , B X,ϵ = Ld 1 d Y l=1 hl M1, l m1 1 p ( X p, + ϵ) , B X = Ld 1 d Y l=1 hl M1, l m1 1 B1, X,ϵ = Ld 1 d Y l=1 M1, l ( X p, + ϵ) , B1, X = Ld 1 d Y l=1 M1, l X p, . The proof can be found in the Appendix A.10. Remark 27 Combining Theorem 26 with (4), we have: d K0K1 Pd l=1 hlhl 1 M + B Remark 28 Our bound has important implications for the design of regularizers for adversarial contrastive learning. To achieve superior robust performance on the downstream tasks, the usual approach is to make X p, small. For example, Pytorch scales the images to tensors with entries within the range [0, 1]. Moreover, Theorem 24 shows that we can take the norms of the layers as the regularizers to reduce the adversarial supervised risk. In our analysis for the Rademacher complexity, we consider models with norm-constrained weights, which means that the hypothesis class is uniformly Lipschitz, although the Lipschitz constant may be large (the product of maximal weight norms for the layers). One may wonder what will happen if we remove the constrains on the norm of the weights. For simplicity, let s consider a hypothesis class H { 1}X for binary classification, we have: RS(H) = E σ i=1 σih(xi) where σ1, . . . , σn { 1} are i.i.d. uniform random variables. We can regard σ1, . . . , σn as random labels that we need to fit by hypothesis from H, so we can interpret RS(H) as the Zou and Liu Attack ϵ Type λ 0 0.002 0.05 0.2 PGD 0.01 Clean 75.73 74.8 74.5 75.67 Adv 67.67 69.25 68.59 67.65 0.02 Clean 53.11 55.72 55.73 55.72 Adv 46.71 48.17 48.17 48.16 FGSM 0.01 Clean 74.42 76.13 76.12 76.11 Adv 67.28 68.79 68.8 68.8 0.02 Clean 54.28 54.29 54.28 66.68 Adv 45.93 45.94 45.92 55.64 Table 1: Results of experiments on the regularizer. In this table, we list the clean accuracy (Clean) and adversarial accuracy (Adv) of the mean classifier under the PGD and FGSM attack with ϵ = 0.01 and ϵ = 0.02. λ is chosen from {0, 0.002, 0.05, 0.2}, and λ = 0 indicates no regularizer. ability of H to fit random 1 binary labels. Now let H be the neural network, if we do not constrain the norm of the weights, theoretically, the universal approximation theorem (Maiorov and Pinkus, 1999) tells us that neural networks can fit any continuous function on a bounded input space, which means that in this case RS(H) 1, leading to vacuous bounds in the binary classification case; experimentally, Zhang et al. (2017) show that deep neural networks easily fit random labels. From another perspective, to derivate an upper bound for RS(H) by covering number, we need to find a δ-covering set for H under some metric ρ( , ). If the weights of the layers are not bounded, we can not cover H by a finite subset of H, the δ-covering number of H under ρ( , ) will be infinite, which means that Lemma 20 does not hold. So it is difficult to go beyond the Lipschitz network. 6. Experiments In this section, we conduct several experiments to support our theory. We emphasize that we are not proposing a method to try to get better robustness on the downstream tasks, we do the experiments to verify our two claims from our theoretical results: (1) As shown in Remark 11, using the blocks may improve the robust performance; (2) As shown in Remark 28, using the norms of the layers of the neural networks as the regularizer may help improve the robust performance. Data sets. We use two data sets (Krizhevsky and Hinton, 2009) in our experiments: (1) the CIFAR-10 data set and (2) the CIFAR-100 data set. CIFAR-10 contains 50000/10000 train/test images with size 32 32, which are categorized into 10 classes. CIFAR-100 contains 50000/10000 train/test images with size 32 32, which are categorized into 100 classes. Model. We use a neural network with two convolutional layers and one fully connected layer. Following He et al. (2020), we use the Stochastic Gradient Descent (SGD) optimizer Generalization Bounds for Adversarial Contrastive Learning 1 2 4 6 8 10 block size clean accuracy(%) pgd-0.01 pgd-0.02 fgsm-0.01 fgsm-0.02 (a) Influence on clean accuracy 1 2 4 6 8 10 block size adv accuracy(%) pgd-0.01 pgd-0.02 fgsm-0.01 fgsm-0.02 (b) Influence on adversarial accuracy Figure 1: The effect of block size on the accuracy. In the figure, we show the clean accuracy and the adversarial accuracy of the mean classifier under PGD and FGSM attack with ϵ = 0.01 and ϵ = 0.02. The block size is choosen from {1, 2, 4, 6, 8, 10}. (a) The influence on the clean accuracy; (b) The influence on the adversarial accuracy. with momentum 0.9 but set the weight decay to be 5 10 4 and the learning rate to be 0.001. Evaluation of robustness. For representation f, we first calculate buc = 1 nc Pnc i=1 f(xi) to estimate the c-th row of the mean classifier, where x1, . . . , xnc are the data points with label c in our training set. Denote c W µ as the estimator of W, we use the robustness of the classifier c W µf as an evaluation of the robustness of f on the downstream task. We show the results for CIFAR-10 here; the results for CIFAR-100 can be found in the Appendix B. 6.1 Improvement from the regularizer Inspired by our bound (13) in Theorem 24 and Theorem 8, the adversarial supervised risk can be upper bounded by the sum of the adversarial unsupervised loss and AGM, which is related to the maximal Frobenius-norm of the network layers. We choose to simultaneously optimize the contrastive upstream pre-train risk and the Frobenius norm of the parameters of the model. We set the norm of the parameters for the layers as a regularizer and test the performance of the mean classifier; here, W µ is calculated by averaging all features of the training data set as done in Nozawa et al. (2020). We use a hyper-parameter λ to balance the trade-offof the the contrastive upstream pre-train risk and the Frobenius norm of the parameters of the model. We choose to minimize the following regularized empirical risk: L(f) = be Lsun(f) + λN(f) (14) where N(f) is a regularizer that constrains the Frobenius norm of the parameters of the model f, here we choose N(f) = Pd l=1 |||Wl|||F where |||Wl|||F is the Frobenius norm of the Zou and Liu parameters for the l-th layer of f and: be Lsun(f)= 1 j=1 max x j U(xj)ℓ({f(x j)T(f(x+ j ) f(x ji))}k i=1). More details about our algorithm are in Algorithm 1. Algorithm 1: The AERM algorithm for adversarial contrastive learning Input : The training data set S = n (xj, x+ j , x j1, . . . , x jk) o M j=1 sampled from Dsim Dk neg; the hyper-parameter λ in (14); the adversarial perturbation U; learning rate α; the total iteration number T; 1 initialize θ0 to be randomized parameters; 3 while t < T do 4 randomly sample a batch of data with size N: B S ; 6 for (x, x+, x 1 , . . . , x k ) in B do 7 calculate adversarial example x arg sup x U(x) ℓ fθt(x )T fθt(x+) fθt(x i ) k i=1 8 e B e B {( x, x+, x 1 , . . . , x k )}; 10 L(fθt) = 1 ( x,x+,x 1 ,...,x k ) e B ℓ fθt( x)T fθt(x+) fθt(x i ) k i=1 11 θt+1 θt α θt L(fθt); 12 t t + 1; 13 end Output : The feature extractor fθT F that tries to minimize (14); The results are shown in Table 1. From Table 1, we can see that the F-norm regularizer can improve the adversarial accuracy (the prediction performance of a model on adversarial examples generated by attacker) of the mean classifier, which is in line with our Theorem 24. 6.2 Effect of block size To verify Theorem 10, we analyze the effect of block size on the adversarial accuracy of the mean classifier. Figure 1 presents the results for clean accuracy and adversarial accuracy, respectively. From Figure 1, we can see that a larger block size will yield better adversarial accuracy. The results are consistent with Theorem 10: as the block size grows, Theorem 10 shows that we are optimizing a tighter bound, which leads to better performance as shown in Figure 1. Generalization Bounds for Adversarial Contrastive Learning 7. Conclusion This paper studies the generalization performance of adversarial contrastive learning. We first extend the contrastive learning framework to the adversarial case, then we upper bound the average adversarial risk of the downstream tasks with the adversarial unsupervised risk of the upstream task and an adversarial Rademacher complexity term. Furthermore, we provide the upper bound of the adversarial Rademacher complexity for linear models and multi-layer neural networks. Finally, we conduct several experiments and the experimental results are consistent with our theory. Acknowledgments This work is supported by the National Natural Science Foundation of China under Grant 61976161. Zou and Liu Appendix A. Proofs In this section, we display the proofs of our theorems, lemmas and corollaries. For reading convenience, we will restate the theorem before proving. A.1 Proof of Theorem 8 In the below, we present some useful lemmas that will be used in the proofs of our main theorems. Lemma 29 For any f F, we have: e Lsup(f) e Lµ sup(f) 1 1 τ (e Lun(f) τℓ(0)) 1 1 τ (e Lsun(f) τℓ(0)). (A.1) Remark 30 be Lsun(f) denotes the empirical e Lsun(f) and bf arg min f F be Lsun(f). Applying Lemma 29 to bf shows that, if we can train a robust feature extractor with low surrogate adversarial unsupervised risk, we can obtain a robust classifier with low adversarial supervised risk on the downstream task. Lemma 31 Let ℓ: Rk R be bounded by B. Then, for any δ (0, 1), with a probability of at least 1 δ over the choice of the training set S = {(xj, x+ j , x j1, . . . , x jk)}M j=1 = {zj}M j=1, for any f F: e Lsun( bf) e Lsun(f) + AGM, where AGM = O(RS(G) δ M ), RS(G) := E σ { 1}M σ, (gf)|S # {gf(x, x+, x 1 , . . . , x k ) = sup x U(x) ℓ {f(x )T (f(x+) f(x i ))}k i=1 |f F}, where σ is an M- dimensional Rademacher random vector with i.i.d. entries and (gf)|S = (gf(z1), . . . , gf(z M)). Proof [Proof of Lemma 29] By the definition of e Lsup(T , f), i.e., (3), it s obvious that e Lsup(f) e Lµ sup(f), so we only need to prove the second part of (A.1). From definition (2), we have, f F: e Lun(f) = E c+,c ρ2 E x+ Dc+ x Dc ℓ f(x )T f(x+) f(x ) (i) E c+,c ρ2 sup x U(x) ℓ f(x )T (µc+ µc ) #) (ii) = (1 τ) E c+,c ρ2 sup x U(x) ℓ f(x )T (µc+ µc ) # c+ = c ) (A.2) where (i) comes from the convexity of ℓand Jensen s Inequality, and (ii) comes from the property of conditional expectations. Then we have: Generalization Bounds for Adversarial Contrastive Learning sup x U(x) ℓ f(x )T (µc+ µc ) # c+ = c ) (i)= E c+,c ρ2 DT (c+) E x Dc+ sup x U(x) ℓ f(x )T (µc+ µc ) # + DT (c ) E x Dc sup x U(x) ℓ f(x )T (µc µc+) # c+ = c ) (ii) = E c+,c ρ2 DT (c+) E x Dc+ sup x U(x) ℓ g(x )c+ g(x )c # + DT (c ) E x Dc sup x U(x) ℓ g(x )c g(x )c+ # c+ = c ) where T = {c+, c }, g(x) = = W µf(x) and (i) comes from the symmetry of c+, c ; (ii) is directly from some linear algebras. From (A.3) we know that: sup x U(x) ℓ f(x )T (µc+ µc ) # c+ = c ) (i)= E c+,c ρ2 E c DT E x Dc sup x U(x) ℓ g(x )c g(x )c (ii) = E c+,c ρ2 " e Lsup c+, c , W µf c+ = c # (iii) = E c+,c ρ2 " e Lµ sup c+, c , f c+ = c # (iv) = e Lµ sup(f), where (i) is due to the tower property of expectation;(ii) is obvious by the definition of e Lsup(T , g); (iii) comes from the definition of e Lµ sup(T , f) and (iv) is from the definition of e Lµ sup(f). Combine (A.2) with (A.4), we conclude that: e Lun(f) (1 τ)e Lµ sup(f) + τℓ(0), f F. So we have: (1 τ)e Lsup(f) + τℓ(0) (1 τ)e Lµ sup(f) + τℓ(0) e Lun(f), f F. (A.5) Rearranging (A.5) yields (A.1). Zou and Liu Proof [Proof of Lemma 31] Denote (x, x+, x 1 , , x k ) by z, then by the Theorem 3.3 in Mohri et al. (2012), we have: With probability at least 1 δ 2 over the choice of the training set S, 1 B gf(zi) + 2 which is equivalent to: e Lsun(f) be Lsun(f) + 2 M RS(G) + 3B δ M , f F. (A.6) Let f = arg min f F e Lsun(f), since E S be Lsun(f) = e Lsun(f) and ℓis bounded by B, Hoeffding s inequality tells us that: f F, t R: P be Lsun(f) e Lsun(f) t e 2Mt2 Set t = B q δ 2M , we have: f F, with probability at least 1 δ be Lsun(f) e Lsun(f) B δ 2M . (A.7) Combine (A.6) with (A.7), the union bound tells us that: f F, with probability at least 1 δ over the choice of the training set S, e Lsun( bf) (i) be Lsun( bf) + O (ii) be Lsun(f ) + O (iii) e Lsun(f ) + O (iv) e Lsun(f) + O where (i) comes from (A.6);(ii) is directly from the fact that be Lsun( bf) be Lsun(f ), which is from the definition of bf;(iii) is a result of (A.7) and (iv) is obvious by the definition of f . Theorem 8 Let ℓ: Rk R be bounded by B. Then, for any δ (0, 1),with a probability of at least 1 δ over the choice of the training set S = {(xj, x+ j , x j )}M j=1, for any f F: e Lsup( bf) e Lµ sup( bf) 1 1 τ (e Lsun(f) τℓ(0)) + 1 1 τ AGM. Proof From Lemma 29 we know that: e Lsup( bf) e Lµ sup( bf) 1 1 τ (e Lun( bf) τℓ(0)) 1 1 τ (e Lsun( bf) τℓ(0)), Then Lemma 31 directly yields the result we need. Generalization Bounds for Adversarial Contrastive Learning A.2 Proof of Theorem 10 Theorem 10 For any f F, we have: e Lsup(f) 1 1 τ e Lblock sun (f) τℓ(0) 1 1 τ e Lsun(f) τℓ(0) . Proof By the convexity of ℓand Jensen s inequality, we have: x , x+ i , x i : f(x )T Pb i=1 f(x+ i ) b Pb i=1 f(x i ) b i=1 f(x )T f(x+ i ) f(x i ) ! i=1 ℓ f(x )T f(x+ i ) f(x i ) . Take maximization about x both sides, we have: sup x U(x) ℓ f(x )T Pb i=1 f(x+ i ) b Pb i=1 f(x i ) b b sup x U(x) i=1 ℓ f(x )T f(x+ i ) f(x i ) # sup x U(x) ℓ f(x )T f(x+ i ) f(x i ) # Taking expectations both sides yields: e Lblock sun (f) = E x,x+ i ,x i sup x U(x) ℓ f(x )T Pb i=1 f(x+ i ) b Pb i=1 f(x i ) b E x,x+ i ,x i sup x U(x) ℓ f(x )T f(x+ i ) f(x i ) #) sup x U(x) ℓ f(x )T f(x+) f(x ) # = e Lsun(f), which proves the second inequality. For the first inequality, we have, f F: Zou and Liu e Lblock sun (f) = E c+,c ρ2 E x,x+ i Db+1 c+ sup x U(x) ℓ f(x )T Pb i=1 f(x+ i ) b Pb i=1 f(x i ) b (i) E c+,c ρ2 sup x U(x) E x+ i Db c+ f(x )T Pb i=1 f(x+ i ) b Pb i=1 f(x i ) b (ii) E c+,c ρ2 sup x U(x) ℓ E x+ i Db c+ f(x )T Pb i=1 f(x+ i ) b Pb i=1 f(x i ) b (iii) = E c+,c ρ2 sup x U(x) ℓ f(x )T (µc+ µc ) #) (iv) = (1 τ)e Lµ sup(f) + τℓ(0), where (i) and (ii) are directs result of Jensen s Inequality and convexity of maximization function and ℓ; (iii) is from the linearity of expectation and the last equality follows the same argumentation as in (A.4), which proves the Theorem. A.3 Proof of (5) Proof For any f F, we have: e Lsup(f)= E T D h e Lsup(T , f) i (i)=p E T D h e Lsup(T , f) Edistinct i +(1 p) E T D h e Lsup(T , f) Edistinct i h e Lsup(T , f) Edistinct i (ii) = p e Lsup(f), where Edistinct is the event that {c+, c 1 , . . . , c k } is distinct and p = P (c+,c 1 ,...,c k ) D [Edistinct] and (i) is from the property of conditional expectation and (ii) comes from the definition of e Lsup(f). A.4 Proof of Proposition 15 Proposition 15 The hinge loss and the logistic loss satisfy Assumption 14. Proof Since I1 and I2 are symmetric, We only need to prove (6). For the Hinge Loss: ℓ({vi}i I1) = max 0, 1 + max i I1 { vi} (i) max 0, 1 + max i [d]{ vi} = ℓ {vi}i [d] , Generalization Bounds for Adversarial Contrastive Learning where (i) is from the fact that I1 [d], so the first inequality is proved, for the second one: ℓ {vi}i [d] = max 0, 1 + max i [d]{ vi} = max 0, 1 + max i I1 I2{ vi} , where the last equality is directly from the definition of I1 and I2. 1. if 1 + max i I1 I2{ vi} 0, then ℓ({vi}i I1 I2) = 0, since Hinge Loss is non-negative, we ℓ({vi}i I1 I2) = 0 + 0 ℓ({vi}i I1) + ℓ({vi}i I2) . 2. if 1 + max i I1 I2{ vi} > 0 (a) if 1 + max i I1 { vi} 0, then ℓ({vi}i I1) = max 0, 1 + max i I1 { vi} = 0. So we ℓ({vi}i I1 I2) = max 0, 1 + max i I1 I2{ vi} = max 0, 1 + max i I2 { vi} = 0 + ℓ({vi}i I2) = ℓ({vi}i I1) + ℓ({vi}i I2) . (b) if 1 + max i I2 { vi} 0, by the same discussion of (a), we have: ℓ({vi}i I1 I2) ℓ({vi}i I1) + ℓ({vi}i I2) . (c) if 1 + max i I1 { vi} > 0 and 1 + max i I2 { vi} > 0, ℓ({vi}i I1 I2)=max 0, 1+ max i I1 I2{ vi} max 0, 1+max i I1 { vi}+1+max i I2 { vi} max 0, 1 + max i I1 { vi} + max 0, 1 + max i I2 { vi} = ℓ({vi}i I1) + ℓ({vi}i I2) . So the second inequality is proved. For the Logistic Loss: ℓ({vi}i I1) = log2 = ℓ {vi}i [d] . Zou and Liu So the first inequality is proved. For the second one: ℓ({vi}i I1 I2) = log2 i I1 I2 e vi i I1 e vi + X i I1 e vi + X i I2 e vi + = ℓ({vi}i I1) + ℓ({vi}i I2) , which proves the second inequality. A.5 Proof of Theorem 32 We here introduce some further notations, which will be used in our main results. For a tuple (c+, c 1 , . . . , c k ), we define I+ := {i [k]|c i = c+} and Q as the set of distinct classes in (c+, c 1 , . . . , c k ). We define pmax(T ) := maxc DT (c), τk := P c+,c i ρk+1[I+ = ], while ℓN( 0) is defined as the loss of the N-dimensional zero vector. Let T be a task sample from distribution D and ρ+(T ) be a distribution of c+ when (c+, c 1 , . . . , c k ) are sampled from ρk+1 conditioned on Q = T and I+ = and ρ+ min(T ) := min c T ρ+(T )(c). Theorem 32 Assume that ℓsatisfies Assumption 14. With a probability of at least 1 δ over the choice of the training set S, for any f F, we have: ρ+ min(T ) pmax(T ) e Lsup(T , bf) E T D ρ+ min(T ) pmax(T ) e Lµ sup(T , bf) e Lsun(f) + AGM τk 1 τk E c+,c i ρk+1 h ℓ|I+|( 0)|I+ = i , (A.8) where |I+| is the cardinality of set I+. Before proceeding with the proof of Theorem 32, we introduce some useful lemmas. Lemma 33 For any T sampled from D, we have: ρ+(T )(c) ρ+ min(T ) pmax(T )DT (c), c T . Lemma 34 Assume that f satisfies Assumption 14. For any f F, we have: (1 τk) E T D ρ+ min(T ) pmax(T ) e Lsup(T , f) (1 τk) E T D ρ+ min(T ) pmax(T ) e Lµ sup(T , f) e Lsun(f) τk E c+,c i ρk+1 h ℓ|I+|( 0)|I+ = i . Generalization Bounds for Adversarial Contrastive Learning Proof [Proof of Lemma 33] By the definition of ρ+(T ) and ρ+ min(T ), we can find that: c T , ρ+ min(T ) ρ+(T )(c). So we have: c T , ρ+(T )(c) ρ+ min(T ) 1. (A.9) By the definition of pmax(T ), we have: c T , DT (c) pmax(T ) 1. (A.10) Combine (A.9) and (A.10), we have: c T , ρ+(T )(c) ρ+ min(T ) pmax(T )DT (c). Proof [Proof of Lemma 34] By the definition of e Lsun(f), we have: e Lsun(f) = E c+,c i ρk+1 x,x+ D2 c+;x i Dc i sup x U(x) ℓ f(x )T f(x+) f(x i ) k i=1 = E c+,c i ρk+1 sup x U(x) ℓ f(x )T f(x+) f(x i ) k i=1 (i) E c+,c i ρk+1 ℓ f(x )T f(x+) f(x i ) k i=1 (ii) E c+,c i ρk+1 sup x U(x) ℓ n f(x )T µc+ µc i where (i), (ii) is from the Jensen s inequality and convexity of ℓ. Then we analyze lower bound of R1: Zou and Liu R1 (i)= (1 τk) E c+,c i ρk+1 sup x U(x) ℓ n f(x )T µc+ µc i + τk E c+,c i ρk+1 sup x U(x) ℓ n f(x )T µc+ µc i (ii) (1 τk) E c+,c i ρk+1 sup x U(x) ℓ f(x )T (µc+ µc) c Q,c =c+ I+ = + τk E c+,c i ρk+1 sup x U(x) ℓ n f(x )T µc+ µc i (iii) (1 τk) E c+,c i ρk+1 sup x U(x) ℓ f(x )T (µc+ µc) c Q,c =c+ I+ = + τk E c+,c i ρk+1 ℓ|I+|( 0) I+ = , where (i) comes from the property of conditional expectation, (ii) is a result of the fact that Q [k] and ℓsatisfies Assumption 14 and (iii) is from the fact that [|I+|] [k] and ℓsatisfies Assumption 14. Let R2 = E c+,c i ρk+1 sup x U(x) ℓ f(x )T (µc+ µc) c Q,c =c+ I+ = R2 = E c+,c i ρk+1 sup x U(x) ℓ f(x )T (µc+ µc) c Q,c =c+ I+ = E c+,c i ρk+1 sup x U(x) ℓ f(x )T (µc+ µc) c Q,c =c+ Q = T , I+ = (ii) = E T D E c+ ρ+(T ) x Dc+ sup x U(x) ℓ f(x )T (µc+ µc) c T ,c =c+ # where (i) is from the tower property of expectation and (ii) is directly obtained by the definition of ρ+(T ). Let R3 = E c+ ρ+(T ) x Dc+ sup x U(x) ℓ f(x )T (µc+ µc) c T ,c =c+ # Generalization Bounds for Adversarial Contrastive Learning R3 = E c+ ρ+(T ) x Dc+ sup x U(x) ℓ f(x )T (µc+ µc) c T ,c =c+ # (i) ρ+ min(T ) pmax(T ) E c+ DT x Dc+ sup x U(x) ℓ f(x )T (µc+ µc) c T ,c =c+ # (ii) = ρ+ min(T ) pmax(T ) e Lµ sup(T , f) (iii) ρ+ min(T ) pmax(T ) e Lsup(T , f), where (i) is directly from Lemma 33;(ii) and (iii) are from the definition of e Lµ sup(T , f) and e Lsup(T , f), respectively. Combing (A.11) (A.12) (A.13) and (A.14) yields: (1 τk) E T D ρ+ min(T ) pmax(T ) e Lsup(T , f) (1 τk) E T D ρ+ min(T ) pmax(T ) e Lµ sup(T , f) e Lsun(f) τk E c+,c i ρk+1 h ℓ|I+|( 0)|I+ = i . Equipped with the above lemmas, now we can turn to the proof of Theorem 32. Proof [Proof of Theorem 32] From Lemma 31 we know that with probability at least 1 δ over the choice of the training set S, f F: e Lsun( bf) e Lsun(f) + AGM. Combing this with Lemma 34 directly yields (A.8). A.6 Proof of Theorem 16 Theorem 16 Suppose C is finite. For any c C, ρ(c) > 0, and ℓsatisfies Assumption 14. Then, with a probability of at least 1 δ over the choice of the training set S, f F: e Lsup( bf) α(ρ) e Lsun(f) + AGM β, where α(ρ) = 1 1 τk max |T | k+1 Tdistinct ρ+ min(T ) is a positive constant depending on ρ and β = α(ρ)τk E c+,c i ρk+1 h ℓ|I+|( 0)|I+ = i . Proof Since C is finite and ρ(c) > 0 for any c C, we can see that: 1 α(1 τk) = min |T | k+1,T distinct ρ+ min(T ) pmax(T ) > 0. (A.15) Zou and Liu By Theorem 32 and the definition of e Lsup(f), we have: f F, with probability at least 1 δ over the choice of the training set S, 1 α(1 τk) e Lsup( bf) (i)= E T D 1 α(1 τk) e Lsup(T , bf) (ii) E T D ρ+ min(T ) pmax(T ) e Lsup(T , bf) (iii) 1 1 τk e Lsun(f) + AGM τk 1 τk E c+,c i ρk+1 h ℓ|I+|( 0)|I+ = i , where (i) is from the definition of e Lsup( bf);(ii) is from (A.15) and (iii) is from Theorem 32. Rearranging yields the result. A.7 Proof of Lemma 18 Lemma 18 Let Bp(r) be the p-norm ball in Rd with radius r. The δ-covering number of Bp(r) with respect to p thus obeys the following bound: N(δ; Bp(r), p) 1 + 2r where N(δ; B, ) is the δ-covering number of B with respect to the norm . Proof Set , in Lemma 17 to be p, we have: N(δ; Bp(1), p) vol 2 δBp(1) + Bp(1) vol (Bp(1)) = vol 1 + 2 vol (Bp(1)) d vol (Bp(1)) vol (Bp(1)) = 1 + 2 where (i) is true because Bp(1) Rd. Now suppose that {x1, , x N} is the minimal δ-covering of Bp(1), then: x Bp(1), xi {x1, , x N} s.t. x xi p δ. So we have: rx Bp(r), rxi {rx1, , rx N} s.t. rx rxi p rδ. So {rx1, , rx N} is a rδ-covering of Bp(r), so we have: N(rδ; Bp(r), p) N(δ; Bp(1), p) 1 + 2 r take place of δ, we have N(δ; Bp(r), p) 1 + 2r Generalization Bounds for Adversarial Contrastive Learning A.8 Proof of Theorem 22 Theorem 22 Let U(x) = {x | x x r ϵ} (i.e. consider the ℓr attack). We then have : RS(H) = O [PP + ϵR s(r , p, m)] h ms(p , p, m)w2 where s(p, q, n) := nmax n 1 p 1 r = 1, and R is defined similarly to (11). Before giving the proof of the theorem, we introduce some lemmas will be used in our proof. Lemma 35 For any x Rn,0 < p2 p1, we have: x p1 x p2 n Proof [Proof of Lemma 35] Firstly, we prove x p1 x p2. For any x Rn, suppose ai = |xi|, lep f(p) = (Pn i=1 ap i ) 1 p . In order to prove x p1 x p2, it suffices to prove that: ai 0, i = 1, 2, . . . , n, f(p) is non-increasing on (0, 1]. 1. if ai = 0, i, f is a constant function, so f is non-increasing. 2. if i, ai = 0, suppose {ai|ai = 0} has K elements, without loss of generality, suppose {ai|ai = 0} = {a1, a2, , a K} and a1 = max 1 k K{ak}, then we have: 0, k = 1, 2, . . . , K. (A.16) We can write f(p) = (Pn i=1 ap i ) 1 p = a1 PK k=1 ak a1 p = a1exp{ 1 pln PK k=1 ak a1 let g(p) = 1 pln PK k=1 ak a1 p , by the monotone property of composite functions, to prove f(p) is non-increasing, it suffices to prove g(p) is non-increasing.Taking derivation of g yields: PK k=1 h ak a1 p PK k=1 ak a1 p ln PK k=1 ak a1 From (A.16) we know that PK k=1 h ak p PK k=1 ak p 0 and ln PK k=1 ak p2 0, so we have g (p) 0, so g is non-increasing, which means that x p1 x p2, x Rn. Nextly, we prove that x p2 n p1 x p1. By the definition of p, we have: x p1 p2 = (Pn i=1 |xi|p2) n Pn i=1 |xi|p2 p1 p1 p2 .Since p2 p1, i.e. p1 p2 1, we know the function p1 p2 is convex. By Jensen s Inequality we know that: λi s.t. λi 0 and i=1 λi = 1, ti R : h i=1 λih(ti). (A.17) Zou and Liu Set λ1 = λ2 = = λn = 1 n, ti = |xi|p2, i = 1, 2, . . . , n in (A.17), we have: i=1 |xi|p2 ! p1 i=1 (|xi|p2) i=1 |xi|p1. Multiplying both sides by n p1 p2 yields 1 n Pn i=1 |xi|p2 p1 n Pn i=1 |xi|p1 n p1 p2 , i.e. p1 p2 1 x p1 p1. Taking both sides the p1th power yields x p2 n Lemma 36 Suppose that A, B R+ s.t. A x p x q B x p (p, q 1) for any x Rn, then W Rm n: |||W|||p B Proof [Proof of Lemma 36] If we have A x p x q B x p, then: |||W|||p = max x p 1 Wx p (i) max x p B Wx p (ii) max x p B 1 A Wx q = max x p 1 B A Wx q = B where (i) is from the fact that x q B x p, which means that {x| x p 1} {x| x q 1} and (ii) is from the condition that A x p x q. Lemma 37 For any p, q 1, then W Rm n: |||W|||q nmax{ 1 p}|||W|||p. Remark 38 If we denote nmax{ 1 p} by s(p, q, n), then we have: W Rm n, |||W|||q s(p, q, n)|||W|||p. Proof [Proof of Lemma 37] We discuss it in two cases. 1. If p q, by Lemma 35, we have: x Rn, x q x p n 1 p 1 q x q. By Lemma 36, we know that W Rm n, |||W|||q n 1 p 1 q |||W|||p. 2. If p q, by Lemma 35, we have: x Rn, x p x q n 1 q 1 p x p, i.e. q x q x p x q. By Lemma 36, we know that W Rm n, |||W|||q n 1 q 1 p |||W|||p. So we conclude that W Rm n, |||W|||q nmax{ 1 p}|||W|||p. Now we turn to bound the Rademacher complexity of H, where H = h(x, x+, x ) = min x U(x) f(x )T f(x+) f(x ) |f F , F = {f : x Wx|W Rn m, |||W|||p w}. Generalization Bounds for Adversarial Contrastive Learning Let H0 = h(x, x+, x ) = f(x)T f(x+) f(x ) |f F = n (x, x+, x ) x T W T W(x+ x ) W Rn m, |||W|||p w o . Proof [Proof of Theorem 22] Firstly, we drop the complicated and unwieldy min operate by directly solving the minimization problem. h(x, x+, x ) = min x x r ϵf(x )T f(x+) f(x ) = min δ r ϵ (W(x + δ))T Wx+ Wx = x T W T W(x+ x ) + min δ r ϵδT W T W(x+ x ) (i)= x T W T W(x+ x ) ϵ W T W(x+ x ) r , where (i) is from Holder s Inequality and taking the value of δ that can get equality. Define: H1 = n (x, x+, x ) W T W(x+ x ) r W Rn m, |||W|||p w o , Hw 0 = n (x, x+, x ) x T A(x+ x ) A Rm m, |||A|||p w o , Hw 1 = n (x, x+, x ) A(x+ x ) r A Rm m, |||A|||p w o . Since W Rn m s.t. |||W|||p w, we have: (i) W T p |||W|||p = |||W|||p |||W|||p (ii) s(p , p, m)|||W|||2 p s(p , p, m)w2, p = 1 and (i) comes from the Submultiplicativity of matrix norm and (ii) is from Lemma 37. So we have: n W |||W|||p w o n W W T W p s(p , p, m)w2o . And since W T W is positive semi-definite, it s easy to see that: n W T W W T W p s(p , p, m)w2o n A |||A|||p s(p , p, m)w2o . So we have: H0 = n (x, x+, x ) x T W T W(x+ x ) W Rn m, |||W|||p w o n (x, x+, x ) x T W T W(x+ x ) W Rn m, W T W p s(p , p, m)w2o n (x, x+, x ) x T A(x+ x ) A Rm m, |||A|||p s(p , p, m)w2o = Hs(p ,p,m)w2 0 . Zou and Liu Similarly, H1 Hs(p ,p,m)w2 1 . Given the training set S = {(xi, x+ i , x i )}M i=1, we have: RS(H) = E σ sup |||W|||p w i=1 σi x T i W T W(x+ i x i ) ϵ W T W(x+ i x i ) r ) sup |||W|||p w i=1 σi x T i W T W(x+ i x i ) ) sup |||W|||p w σi W T W(x+ i x i ) r ) sup |||W|||p w i=1 σi x T i W T W(x+ i x i ) ) sup |||W|||p w σi W T W(x+ i x i ) r ) (iii) = RS(H0) + ϵ RS(H1) (iv) RS(Hs(p ,p,m)w2 0 ) + ϵ RS(Hs(p ,p,m)w2 1 ), where σ is a Random Vector whose elements are i.i.d. Rademacher Random Variables and (i) comes from the subadditivity of sup function;(ii) is from the fact that σi has the same distribution as σi, i = 1, 2, . . . , M; (iii) is by the definition of RS(H0) and RS(H1) and (iv) is from the monotone property of Rademacher Complexity and the fact that H0 Hs(p ,p,m)w2 0 and H1 Hs(p ,p,m)w2 1 . Secondly, we upper bound the Rademacher Complexity of Hw 0 and Hw 1 . For RS(Hw 0 ): Given the training set S = {(xi, x+ i , x i )}M i=1 = {(zi)}M i=1, define the ℓ2-norm for a function in Hw 0 as: h Hw 0 , h 2 := i=1 [h(zi)]2, Define Hw 0 (S) = n (h(z1), h(z2), . . . , h(z M)) h Hw 0 o , we have that for any h Hw 0 and vh = (h(z1), h(z2), . . . , h(z M)) be the corresponding vector in Hw 0 (S), we have: h 2 = vh 2. So we know that any δ-covering of Hw 0 ( h1, h2, , h N ) with respect to ℓ2 norm in the functional space, corresponds to a δ-covering of Hw 0 (S) with respect to the ℓ2 norm in the Euclidean Space, i.e. h1(z1) h1(z2) ... h1(z M) h2(z1) h2(z2) ... h2(z M) h N(z1) h N(z2) ... h N(z M) Generalization Bounds for Adversarial Contrastive Learning So we have: N (δ; Hw 0 (S), 2) = N (δ; Hw 0 , 2) . (A.19) By the definition of Rademacher Complexity, we know that RS(Hw 0 ) is just the expectation of the Rademacher Process with respect to Hw 0 (S), which is E[ sup θ Hw 0 (S) Xθ]. To use Lemma 20, we must show that Rademacher Process is a sub-Gaussian Process with respect to some metric ρX. Denote the Euclidean metric by ρ2, we have: for Rademacher Process {Xθ, θ T}, θ, eθ T and λ R: E h eλ(Xθ Xeθ)i (i)= E h eλ( σ,θ σ,eθ )i = E h eλ( σ,θ eθ )i (ii) h eλσi(θi eθi)i i=1 e λ2(θi eθi)2 2 PM i=1(θi eθi)2 = e λ2 2 ρ2 X(θ,eθ), where σ is a Random Vector whose elements are i.i.d. Rademacher Random Variables and (i) is from the definition of Rademacher Process; (ii) is from the expectation property of i.i.d. random variables and (iii) is from Example 2.3 in Wainwright (2019). So we proved that the Rademacher Process is a sub-Gaussian Process with respect to the Euclidean metric ρ2. So by Lemma 20 and (9), we know that: δ (0, D], RS(Hw 0 )=E[ sup θ Hw 0 (S) Xθ] E sup θ,eθ Hw 0 (S) (Xθ Xeθ) sup γ,γ Hw 0 (S) ρX (γ,γ ) δ +32J (δ/4; D), D = sup θ,θ Hw 0 (S) θ θ 2 2 sup θ Hw 0 (S) θ 2 = 2 sup h Hw 0 h 2 = 2 sup h Hw 0 i=1 [h(zi)]2 M sup |||A|||p w |x T i A(x+ i x i )| (ii) 2 M sup |||A|||p w xi p A(x+ i x i ) p M sup |||A|||p w xi p |||A|||p x+ i x i p (iv) 4 and J (a; b) = R b a p ln N (u; Hw 0 (S), 2)du = R b a p ln N (u; Hw 0 , 2)du. Where (i) is from the definition of f; (ii) is from the Holder s Inequality; (iii) is a result of properties of matrix norm and (iv) is from the definition of P and P. Similar to the discussion of upper bound Zou and Liu for D, for all h1, h2 Hw 0 , we have: i=1 [h1(zi) h2(zi)]2 M sup 1 i M |h1(zi) h2(zi)| M sup 1 i M |x T i (A1 A2)(x+ i x i )| M sup 1 i M xi p |||A1 A2|||p x+ i x i p (ii) 2 MP P|||A1 A2|||p, where A1, A2 are the matrices corresponding to h1, h2, respectively and (i) is from the same argument as (ii), (iii) in (A.20) and (ii) is from the definition of P and P. Suppose NA = A1, , AN is a δ 2PP M -covering of SA = n A |||A|||p w o with respect to ||| |||p , i.e.: A SA , Aj NA, s.t. A Aj p δ 2PP Combine this with (A.21), let A be the matrix corresponding to h and Aj be the matrix corresponding to hj and let Nh = h1, h2, , h N , we have: h Hw 0 , hj Nh, s.t. h hj 2 2 So Nh is a δ-covering of Hw 0 with respect to 2 , so we have: N(δ; Hw 0 , 2) N( δ 2PP M ; SA, ||| |||p). By Lemma 18, we know that: N(δ; Hw 0 , 2) N( δ 2PP M ; SA, ||| |||p) So we have: J (0; D) = Z D ln N (u; Hw 0 , 2)du (i) Z D v u u tm2ln M u du = 2m M (iii) 8m PP w where (i) is from (A.22); (ii) is because that for any x 0, ln(1 + x) x and (iii) is from (A.20). So take δ 0+, we have: RS(Hw 0 ) 32J (0; D) 256m PP w Generalization Bounds for Adversarial Contrastive Learning For RS(Hw 1 ): The same as RS(Hw 0 ), we consider ℓ2 norm for a function h Hw 1 and define Hw 1 (S) = n (h(z1), h(z2), . . . , h(z M)) h Hw 1 o , by similar argument in (A.19), we have: N (δ; Hw 1 (S), 2) = N (δ; Hw 1 , 2) . By the definition of Rademacher Complexity, we know that RS(Hw 1 ) is just the expectation of the Rademacher Process with respect to Hw 1 (S), which is E[ sup θ Hw 1 (S) Xθ]. By Lemma 20 and (9), we know that: δ (0, D], RS(Hw 1 )=E[ sup θ Hw 1 (S) Xθ] E sup θ,eθ Hw 1 (S) (Xθ Xeθ) sup γ,γ Hw 1 (S) ρX (γ,γ ) δ +32J (δ/4; D), D = sup θ,θ Hw 1 (S) θ θ 2 2 sup θ Hw 1 (S) θ 2 = 2 sup h Hw 1 h 2 = 2 sup h Hw 1 i=1 [h(zi)]2 M sup |||A|||p w A(x+ i x i ) r (ii) 2 M sup |||A|||p w |||A|||r x+ i x i r MR sup |||A|||p w |||A|||r (iv) 4 MR sup |||A|||p w s(r , p, m)|||A|||p 4w R s(r , p, m) (A.23) and J (a; b) = R b a p ln N (u; Hw 1 (S), 2)du = R b a p ln N (u; Hw 1 , 2)du. Where (i) is from the definition of f; (ii) is a result of the properties of matrix norm; (iii) is from the definition of R and (iv) comes from Lemma 37. Similar to the discussion of upper bound for D, for all h1, h2 Hw 1 , we have: i=1 [h1(zi) h2(zi)]2 M sup 1 i M |h1(zi) h2(zi)| M sup 1 i M (A1 A2)(x+ i x i ) r (i) M sup 1 i M |||A1 A2|||r x+ i x i r MR |||A1 A2|||r (iii) 2R s(r , p, m) M |||A1 A2|||p, (A.24) where A1, A2 are the matrices corresponding to h1, h2, respectively and (i) from the properties of matrix norm and (ii) is from the definition of R and (iii) comes from Lemma 37. Suppose NA = A1, , AN is a δ 2R s(r ,p,m) M -covering of SA = n A |||A|||p w o with respect to ||| |||p , i.e.: A SA , Aj NA, s.t. A Aj p δ 2R s(r , p, m) Zou and Liu Combine this with (A.24), let A be the matrix corresponding to h and Aj be the matrix corresponding to hj and let Nh = h1, h2, , h N , we have: h Hw 1 , hj Nh, s.t. h hj 2 2R s(r , p, m) M δ 2R s(r , p, m) So Nh is a δ-covering of Hw 1 with respect to 2 , so we have: N(δ; Hw 1 , 2) N( δ 2R s(r , p, m) M ; SA, ||| |||p). By Lemma 18, we know that: N(δ; Hw 1 , 2) N( δ 2R s(r , p, m) M ; SA, ||| |||p) 1 + 4R s(r , p, m)w (A.25) So we have: J (0; D) = Z D ln N (u; Hw 1 , 2)du (i) Z D v u u tm2ln 1 + 4R s(r , p, m)w 4R s(r , p, m)w M u du = 2m p R s(r , p, m)w 4 R s(r , p, m)w D 4 M (iii) 8m R s(r , p, m)w where (i) is from (A.25); (ii) is because that for any x 0, ln(1 + x) x and (iii) is from (A.23). So take δ 0+, we have: RS(Hw 1 ) 32J (0; D) 256m R s(r , p, m)w Combine upper bounds of RS(Hw 0 ) and RS(Hw 1 ) with (A.18), we have: RS(H) RS(Hs(p ,p,m)w2 0 ) + ϵ RS(Hs(p ,p,m)w2 1 ) 256m PP s(p , p, m)w2 M + ϵ 256m R s(r , p, m)s(p , p, m)w2 = 256m s(p , p, m)w2 M (PP + ϵR s(r , p, m)) = O (PP + ϵR s(r , p, m)) m s(p , p, m)w2 Generalization Bounds for Adversarial Contrastive Learning A.9 Proof of Theorem 24 Theorem 24 Let U(x) = {x | x x p ϵ} (i.e. consider the ℓp attack), σ(0) = 0 with Lipschitz constant L and let F = Wdσ(Wd 1σ( σ(W1x))) |||Wl|||F MF l , l =1, . . . , d . We then have: l=1 hlhl 1K where K = 2BF X,ϵ BF X+ + BF X , where BF X,ϵ =Ld 1 d Y l=1 MF l max n 1, m 1 2 1 p o ( X p, +ϵ) , BF X =Ld 1 d Y l=1 MF l max n 1, m 1 2 1 Before giving the proof, we firstly introduce some useful lemmas. Lemma 39 If x i U(xi) = x i xi x i p ϵ , then for 1 r + 1 r = 1, we have: x i r max n 1, m1 1 p o ( X p, + ϵ) . Proof [Proof of Lemma 39] We divide it into two cases. 1. If p r , then by Holder s Inequality with 1 r = 1 s, we have: x i r sup 1 s x i p = 1 s x i p = m 1 s x i p = m1 1 where the equality holds when all the entries are equal. 2. If p < r , by Lemma 35, we have x i r x i p, where the equality holds when one of the entries of x i equals to one and the others equal to zero. Then we have: x i r max n 1, m1 1 p o x i p max n 1, m1 1 p o ( xi p + xi x i p) max n 1, m1 1 p o ( X p, + ϵ) . Lemma 40 Let A Rm n, b Rn, then we have: A b 2 |||A|||F b 2. Proof [Proof of Lemma 40] Let Ai be the rows of A, i = 1, 2, . . . , m, we have: i=1 (Aib)2 (i) i=1 Ai 2 2 b 2 2 = i=1 Ai 2 2 q b 2 2 = |||A|||F b 2, where (i) is from Holder s Inequality. Zou and Liu Lemma 41 Suppose σ is a L-Lipschitz function, then the elementwise vector map corresponding to σ is also L-Lipschitz with respect to 2. Proof [Proof of Lemma 41] σ(x) σ(y) 2 = i=1 (σ(x)i σ(y)i)2 = i=1 (σ(xi) σ(yi))2 i=1 L2(xi yi)2 = L i=1 (xi yi)2 = L x y 2, where (i) is because σ is L-Lipschitz. Now we can turn to the proof of Theorem 24. Proof [Proof of Theorem 24] In this case, let U(x) = x x x p ϵ , we have: F = x Wdσ(Wd 1σ( σ(W1x))) |||Wl|||F MF l , l = 1, . . . , d , H = h(x, x+, x ) = min x U(x) f(x )T f(x+) f(x ) |f F . Let SF l = Wl Rhl hl 1 |||Wl|||F MF l , l = 1, 2, . . . , d. Let CF l be the δl-covering of SF l and define: Fc = n fc : x W c dσ W c d 1σ ( σ(W c 1x)) W c l CF l , l = 1, 2, . . . , d o F, Hc = hc(x, x+, x ) = min x U(x) f(x )T f(x+) f(x ) |f Fc H. Similar to the proof of Theorem 22, we know that the Rademacher Process is a sub-Gaussian Process with respect to the Euclidean metric, which induces the ℓ2 norm. Given the training set S = {(xi, x+ i , x i )}M i=1 {zi}M i=1, define the ℓ2-norm for a function in H as: h H, h 2 := i=1 [h(zi)]2. Define H(S) = n (h(z1), h(z2), . . . , h(z M)) h H o , we have that for any h H and vh = (h(z1), h(z2), . . . , h(z M)) be the corresponding vector in H(S), we have h 2 = vh 2. So we know that any δ-covering of H ( h1, h2, , h N ) with respect to ℓ2 norm in the functional space, corresponds to a δ-covering of H(S) with respect to the ℓ2 norm in the Euclidean Space, i.e. h1(z1) h1(z2) ... h1(z M) h2(z1) h2(z2) ... h2(z M) h N(z1) h N(z2) ... h N(z M) Generalization Bounds for Adversarial Contrastive Learning So we have: N (δ; H(S), 2) = N (δ; H, 2) . By the definition of Rademacher Complexity, we know that RS(H) is just the expectation of the Rademacher Process with respect to H(S), which is E[ sup θ H(S) Xθ]. So by Lemma 20 and (9), we know that: δ (0, D]: RS(H)=E[ sup θ H(S) Xθ] E sup θ,eθ H(S) (Xθ Xeθ) sup γ,γ H(S) γ γ 2 δ + 32J (δ/4; D), D= sup θ,θ H(S) θ θ 2 2 sup θ H(S) θ 2 =2sup h H h 2 =2sup h H i=1 [h(zi)]2 2 M sup h H 1 i M (A.26) and J (a; b) = R b a p ln N (u; H(S), 2)du = R b a p ln N (u; H, 2)du. For any f F, x X, let xl be the output of x passing through the first l 1 layers, we have: f(x) 2 = Wdσ(Wd 1xd 1) 2 (i) |||Wd|||F σ(Wd 1xd 1) 2 (ii) = |||Wd|||F σ(Wd 1xd 1) σ(0) 2 (iii) LMF d Wd 1xd 1 2 Ld 1 d Y l=2 MF l W1x 2 Ld 1 d Y l=1 MF l x 2 (iv) Ld 1 d Y l=1 MF l max n 1, m 1 2 1 p o x p Ld 1 d Y l=1 MF l max n 1, m 1 2 1 where (i) is from Lemma 40; (ii) is from the fact that σ(0) = 0; (iii) comes from the assumption that σ is L-Lipschitz and |||Wd|||F MF d and (iv) is attained by setting r = r = 2 in the proof of Lemma 39. To simplify the notations, we define: BF X,ϵ =Ld 1 d Y l=1 MF l max n 1, m 1 2 1 p o ( X p, +ϵ) , BF X =Ld 1 d Y l=1 MF l max n 1, m 1 2 1 So we have: x X, f F, f(x) 2 BF X. (A.27) Similarly, we have: x X, f F, x x p ϵ, f(x ) 2 BF X,ϵ. (A.28) Zou and Liu For any h H, z X 3, let x = arg min x x p ϵ f(x )T (f(x+) f(x )) and let xl be the output of x passing through the first l 1 layers, we have: |h(z)| = | inf x x p ϵf(x )T f(x+) f(x ) | = |f(x )T f(x+) f(x ) | f(x ) 2 f(x+) f(x ) 2 (i) BF X,ϵ (BF X+ + BF X ), where (i) is from (A.27) and (A.28). So we have: M sup h H 1 i M MBF X,ϵ (BF X+ + BF X ) where (i) is from (A.26). Now, we need to find the smallest distance between H and Hc, i.e. sup h H inf hc Hc h hc 2. By the discussion in (A.26), we have h hc 2 M max 1 i M|h(zi) hc(zi)|. For any zi = (xi, x+ i , x i ), i = 1, 2, . . . , M, given h and hc such that |||Wl W c l |||F δl, l = 1, 2, . . . , d, we have: |h(zi) hc(zi)| = inf x i xi p ϵf(x i)T f(x+ i ) f(x i ) inf x i xi p ϵfc(x i)T fc(x+ i ) fc(x i ) . Let x i = arg inf x i xi p ϵ f(x i)T f(x+ i ) f(x i ) and xc i = arg inf x i xi p ϵ fc(x i)T fc(x+ i ) fc(x i ) , yi = xc i f(x i )T f(x+ i ) f(x i ) fc(xc i)T fc(x+ i ) fc(x i ) x i otherwise . Then we have: |h(zi) hc(zi)| = |f(x i )T f(x+ i ) f(x i ) fc(xc i)T fc(x+ i ) fc(x i ) | (i) |f(yi)T f(x+ i ) f(x i ) fc(yi)T fc(x+ i ) fc(x i ) | = |f(yi)T f(x+ i ) f(x i ) fc(yi)T f(x+ i ) f(x i ) + fc(yi)T f(x+ i ) f(x i ) fc(yi)T fc(x+ i ) fc(x i ) | (ii) | (f(yi) fc(yi))T f(x+ i ) f(x i ) | + |fc(yi)T f(x+ i ) fc(x+ i ) | + |fc(yi)T f(x i ) fc(x i ) | (iii) (BF X+ + BF X ) f(yi) fc(yi) 2 + BF X,ϵ f(x+ i ) fc(x+ i ) 2 + BF X,ϵ f(x i ) fc(x i ) 2, (A.30) Generalization Bounds for Adversarial Contrastive Learning where (i) is easily verified by the definition of yi; (ii) is from the triangle inequality and (iii) is from (A.27) and (A.28). Define ga b ( ) as: ga b (y) = Wbσ (Wb 1σ ( Wa+1σ (W c a σ(W c 1y)))) . Then we have: f(yi) fc(yi) 2 = g0 d(yi) gd d(yi) 2 = g0 d(yi) g1 d(yi) + g1 d(yi) g2 d(yi) + + gd 1 d (yi) gd d(yi) 2 (i) g0 d(yi) g1 d(yi) 2 + + gd 1 d (yi) gd d(yi) 2, where (i) is from the triangle inequality. Then we calculate gl 1 d (yi) gl d(yi) 2, l = 1, 2, . . . , d: gl 1 d (yi) gl d(yi) 2 = Wdσ gl 1 d 1(yi) Wdσ gl d 1(yi) 2 (i) |||Wd|||F σ gl 1 d 1(yi) σ gl d 1(yi) 2 (ii) LMF d gl 1 d 1(yi) gl d 1(yi) 2 j=l+1 MF j Wlσ gl 1 l 1(yi) W c l σ gl 1 l 1(yi) 2 j=l+1 MF j (Wl W c l ) σ gl 1 l 1(yi) 2 j=l+1 MF j δl σ gl 1 l 1(yi) 2, where (i) is from Lemma 40; (ii) comes from the assumption that σ is L-Lipschitz and |||Wd|||F MF d ; (iii) is from the definition of ga b ( ) and (iv) is from Lemma 40 and the choice of hc when h is fixed, which means that |||Wl W c l |||F δl. Next, we upper bound σ gl 1 l 1(yi) 2: σ gl 1 l 1(yi) 2 = σ gl 1 l 1(yi) σ(0) 2 (i) L gl 1 l 1(yi) 2 = L W c l 1σ gl 2 l 2(yi) 2 (ii) L W c l 1 F σ gl 2 l 2(yi) 2 (iii) L MF l 1 σ gl 2 l 2(yi) 2 j=1 MF j yi 2, Zou and Liu where (i) is because σ is L-Lipschitz; (ii) is from Lemma 40 and (iii) is because W c l 1 F MF l 1. From (A.31) and (A.32) we have: gl 1 d (yi) gl d(yi) 2 Ld 1 Qd j=1 MF j MF l δl yi 2 (i) Ld 1 Qd j=1 MF j MF l δl max n 1, m 1 2 1 p o ( X p, + ϵ) = BF X,ϵ δl MF l , where (i) is from Lemma 39. Similarly: gl 1 d (x+ i ) gl d(x+ i ) 2 Ld 1 Qd j=1 MF j MF l δl x+ i 2 Ld 1 Qd j=1 MF j MF l δl max n 1, m 1 2 1 p o X+ p, = BF X+ δl MF l , gl 1 d (x i ) gl d(x i ) 2 Ld 1 Qd j=1 MF j MF l δl x i 2 Ld 1 Qd j=1 MF j MF l δl max n 1, m 1 2 1 p o X p, = BF X δl MF l . Combining the above with (A.30) yields: |h(zi) hc(zi)| BF X+ + BF X g0 d(yi) g1 d(yi) 2 + + gd 1 d (yi) gd d(yi) 2 + BF X,ϵ g0 d(x+ i ) g1 d(x+ i ) 2 + + gd 1 d (x+ i ) gd d(x+ i ) 2 + BF X,ϵ g0 d(x i ) g1 d(x i ) 2 + + gd 1 d (x i ) gd d(x i ) 2 = (BF X+ + BF X )BF X,ϵ δl MF l + BF X,ϵBF X+ δl MF l + BF X,ϵBF X = 2BF X,ϵ(BF X+ + BF X ) δl MF l = K So h hc 2 = M max 1 i M|h(zi) hc(zi)| M Pd l=1 Kδl MF l . Let δl = MF l δ d K M , we have: K MF l MF l δ Then: h H, hc Hc s.t. h hc 2 δ, which means that sup h H inf hc Hc h hc 2 δ when choosing δl = MF l δ d K Generalization Bounds for Adversarial Contrastive Learning So Hc is a δ-covering of H, and N(δ; H, 2) |Hc| = Qd l=1 |CF l |. By Lemma 18 we know that |CF l | = N( MF l δ d K M ; SF l , ||| |||F ) 1 + 2d K M δ hl hl 1. So we have: N(δ; H, 2) |Hc| = l=1 |CF l | !Pd l=1 hl hl 1 . (A.33) So we can conclude that: J (0; D) = Z D ln N (u; H, 2)du (i) Z D l=1 hl hl 1 l=1 hl hl 1 v u u t2d K l=1 hl hl 1 4 v u u t2d KD l=1 hl hl 1 4 l=1 hl hl 1 where (i) is from (A.33); (ii) comes from the fact that ln(1 + x) x, x 0 and (iii) comes from (A.29). Since we shows that RS(H) 2E sup γ,γ H(S) γ γ 2 δ + 32J (δ/4; D) before, take δ 0+, we have: RS(H) 32J (0; D) 64 l=1 hl hl 1 l=1 hl hl 1 A.10 Proof of Theorem 26 Theorem 26 Let U(x) = {x | x x p ϵ} (i.e. consider the ℓp attack), σ(0) = 0 with Lip- schitz constant L; moreover, let F = n Wdσ(Wd 1σ( σ(W1x))) Wl 1, M1, l , l=1, . . . ,d o . We then have: l=1 hlhl 1 p K0 = 2B1, X,ϵ B X+ + B X , K1 = K0 2 + B X,ϵ B1, X+ + B1, X , Zou and Liu B X,ϵ = Ld 1 d Y l=1 hl M1, l m1 1 p ( X p, + ϵ) , B X = Ld 1 d Y l=1 hl M1, l m1 1 B1, X,ϵ = Ld 1 d Y l=1 M1, l ( X p, + ϵ) , B1, X = Ld 1 d Y l=1 M1, l X p, . Before giving the proof, we firstly introduce some useful lemmas. Lemma 42 Let A Rm n, b Rn, then we have: A b A 1, b . Proof [Proof of Lemma 42] Let Ai be the rows of A, i = 1, 2, . . . , m, we have: A b = max 1 i m|Aib| (i) max 1 i m ( Ai 1 b ) = A 1, b , where (i) is from the Holder s Inequality. Lemma 43 Let A Rm n, b Rn, then we have: A b 1 A ,1 b 1 m A 1, b 1. Proof [Proof of Lemma 43] Let Ai be the rows of A, i = 1, 2, . . . , m, we have: i=1 |Aib| (i) i=1 ( Ai b 1) = A ,1 b 1, where (i) is from the Holder s Inequality. And we have: A ,1 = ( A1 , , Am ) 1 (i) ( A1 1, , Am 1) 1 (ii) m ( A1 1, , Am 1) = m A 1, , where (i) is from the fact that x x 1 and (ii) is the from the fact that for all x Rm, x 1 m x . So we have: A b 1 A ,1 b 1 m A 1, b 1. Lemma 44 Suppose σ is a L-Lipschitz function, then the elementwise vector map corresponding to σ is also L-Lipschitz with respect to . Proof [Proof of Lemma 44] σ(x) σ(y) = max 1 i n|σ(x)i σ(y)i| = max 1 i n|σ(xi) σ(yi)| (i) max 1 i n L|xi yi| = L x y , Generalization Bounds for Adversarial Contrastive Learning where (i) is because σ is L-Lipschitz. Now we can turn to the proof of Theorem 26. Proof [Proof of Theorem 26] In this case, let U(x) = x x x p ϵ , we have: F = n x Wdσ(Wd 1σ( σ(W1x))) Wl 1, M1, l , l = 1, . . . , d o , H = h(x, x+, x ) = min x U(x) f(x )T f(x+) f(x ) |f F . Let S1, l = n Wl Rhl hl 1 Wl 1, M1, l o , l = 1, 2, . . . , d. Let C1, l be the δl-covering of S1, l and define: Fc = n fc : x W c dσ W c d 1σ ( σ(W c 1x)) W c l C1, l , l = 1, 2, . . . , d o F, Hc = hc(x, x+, x ) = min x U(x) f(x )T f(x+) f(x ) |f Fc H. Similar to the proof of Theorem 22, we know that the Rademacher Process is a sub-Gaussian Process with respect to the Euclidean metric, which induces the ℓ2 norm. Similar to the proof of Theorem 24, given the training set S = {(xi, x+ i , x i )}M i=1 {zi}M i=1, define the ℓ2-norm for a function in H as: h H, h 2 := i=1 [h(zi)]2. Define H(S) = n (h(z1), h(z2), . . . , h(z M)) h H o , with the same argument as in proof of Theorem 24, we have: N (δ; H(S), 2) = N (δ; H, 2) , and RS(H) = E[ sup θ H(S) Xθ]. So by Lemma 20 and (9), we know that: δ (0, D]: RS(H)=E[ sup θ H(S) Xθ] E sup θ,eθ H(S) (Xθ Xeθ) sup γ,γ H(S) γ γ 2 δ +32J (δ/4; D), D= sup θ,θ H(S) θ θ 2 2 sup θ H(S) θ 2 =2sup h H h 2 =2sup h H i=1 [h(zi)]2 2 M sup h H 1 i M (A.34) and J (a; b) = R b a p ln N (u; H(S), 2)du = R b a p ln N (u; H, 2)du Zou and Liu Then for any f F, x X, let xl be the output of x passing through the first l 1 layers, we have: f(x) = Wdσ(Wd 1xd 1) (i) Wd 1, σ(Wd 1xd 1) (ii) = Wd 1, σ(Wd 1xd 1) σ(0) (iii) LM1, d Wd 1xd 1 l=2 M1, l W1x Ld 1 d Y l=1 M1, l x (iv) Ld 1 d Y l=1 M1, l max 1, m p x p l=1 M1, l x p Ld 1 d Y l=1 M1, l X p, , where (i) is from Lemma 42; (ii) is from the fact that σ(0) = 0; (iii) comes from the assumption that σ is L-Lipschitz and |||Wd|||1, M1, d and (iv) is attained by setting r = 1, r = in the proof of Lemma 39. To simplify the notations, we define: B1, X = Ld 1 l=1 M1, l X p, , B X = Ld 1 l=1 hl M1, l B1, X,ϵ = Ld 1 l=1 M1, l ( X p, + ϵ) , B X,ϵ = Ld 1 l=1 hl M1, l p ( X p, + ϵ) . So: x X, f F, f(x) B1, X . Similarly, we have: x X, f F, x x p ϵ, f(x ) B1, X,ϵ . (A.35) f(x) 1 = Wdσ(Wd 1xd 1) 1 (i) hd Wd 1, σ(Wd 1xd 1) 1 (ii) = hd Wd 1, σ(Wd 1xd 1) σ(0) 1 (iii) hd L M1, d Wd 1xd 1 1 l=2 hl M1, l W1x 1 Ld 1 d Y l=1 hl M1, l x 1 (iv) Ld 1 d Y l=1 hl M1, l max n 1, m1 1 l=1 hl M1, l p x p Ld 1 d Y l=1 hl M1, l Generalization Bounds for Adversarial Contrastive Learning where (i) is from Lemma 43; (ii) is from the fact that σ(0) = 0; (iii) comes from the assumption that σ is L-Lipschitz and |||Wd|||1, M1, d and (iv) is attained by setting r = , r = 1 in the proof of Lemma 39. So we know that: x X, f F, f(x) 1 B X. (A.36) Similarly, we have: x X, f F, x x p ϵ, f(x ) 1 B X,ϵ. (A.37) For any h H, z X 3, let x = arg min x x p ϵ f(x )T (f(x+) f(x )) and let xl be the output of x passing through the first l 1 layers, we have: |h(z)| = | inf x x p ϵf(x )T f(x+) f(x ) | = |f(x )T f(x+) f(x ) | f(x ) f(x+) f(x ) 1 (i) B1, X,ϵ (B X+ + B X ), where (i) is from (A.35) and (A.36). So we get: M sup h H 1 i M MB1, X,ϵ (B X+ + B X ) MK0, (A.38) where (i) is from (A.34). Now, we need to find the smallest distance between H and Hc, i.e. sup h H inf hc Hc h hc 2. By the discussion in (A.34), we have h hc 2 M max 1 i M|h(zi) hc(zi)|. For any zi = (xi, x+ i , x i ), i = 1, 2, . . . , M, given h and hc such that Wl W c l 1, δl, l = 1, 2, . . . , d, we have: |h(zi) hc(zi)| = inf x i xi p ϵf(x i)T f(x+ i ) f(x i ) inf x i xi p ϵfc(x i)T fc(x+ i ) fc(x i ) . Let x i = arg inf x i xi p ϵ f(x i)T f(x+ i ) f(x i ) and xc i = arg inf x i xi p ϵ fc(x i)T fc(x+ i ) fc(x i ) , yi = xc i f(x i )T f(x+ i ) f(x i ) fc(xc i)T fc(x+ i ) fc(x i ) x i otherwise . Zou and Liu Then we have: |h(zi) hc(zi)| = |f(x i )T f(x+ i ) f(x i ) fc(xc i)T fc(x+ i ) fc(x i ) | (i) |f(yi)T f(x+ i ) f(x i ) fc(yi)T fc(x+ i ) fc(x i ) | = |f(yi)T f(x+ i ) f(x i ) fc(yi)T f(x+ i ) f(x i ) + fc(yi)T f(x+ i ) f(x i ) fc(yi)T fc(x+ i ) fc(x i ) | (ii) | (f(yi) fc(yi))T f(x+ i ) f(x i ) | + |fc(yi)T f(x+ i ) fc(x+ i ) | + |fc(yi)T f(x i ) fc(x i ) | (iii) (B X+ + B X ) f(yi) fc(yi) + B X,ϵ f(x+ i ) fc(x+ i ) + B X,ϵ f(x i ) fc(x i ) , (A.39) where (i) is easily verified by the definition of yi; (ii) is from the triangle inequality and (iii) is from (A.36) and (A.37). Again we define ga b ( ) as: ga b (y) = Wbσ (Wb 1σ ( Wa+1σ (W c a σ(W c 1y)))) . f(yi) fc(yi) = g0 d(yi) gd d(yi) = g0 d(yi) g1 d(yi) + g1 d(yi) g2 d(yi) + + gd 1 d (yi) gd d(yi) (i) g0 d(yi) g1 d(yi) + + gd 1 d (yi) gd d(yi) , where (i) is from the triangle inequality. Then we calculate gl 1 d (yi) gl d(yi) , l = 1, 2, . . . , d: gl 1 d (yi) gl d(yi) = Wdσ gl 1 d 1(yi) Wdσ gl d 1(yi) (i) Wd 1, σ gl 1 d 1(yi) σ gl d 1(yi) (ii) LM1, d gl 1 d 1(yi) gl d 1(yi) j=l+1 M1, j Wlσ gl 1 l 1(yi) W c l σ gl 1 l 1(yi) j=l+1 M1, j (Wl W c l ) σ gl 1 l 1(yi) j=l+1 M1, j δl σ gl 1 l 1(yi) , Generalization Bounds for Adversarial Contrastive Learning where (i) is from Lemma 42; (ii) comes from the assumption that σ is L-Lipschitz and Wd 1, M1, d ; (iii) is from the definition of ga b ( ) and (iv) is from Lemma 42 and the choice of hc when h is fixed, which means that Wl W c l 1, δl. Next, we upper bound σ gl 1 l 1(yi) : σ gl 1 l 1(yi) = σ gl 1 l 1(yi) σ(0) (i) L gl 1 l 1(yi) = L W c l 1σ gl 2 l 2(yi) (ii) L W c l 1 1, σ gl 2 l 2(yi) (iii) L M1, l 1 σ gl 2 l 2(yi) Ll 1 j=1 M1, j yi , where (i) is because σ is L-Lipschitz; (ii) is from Lemma 42 and (iii) is because W c l 1 1, M1, l 1 . (A.40) and (A.41) show that: gl 1 d (yi) gl d(yi) Ld 1 Qd j=1 M1, j M1, l δl yi (i) Ld 1 Qd j=1 M1, j M1, l δl ( X p, + ϵ) = B1, X,ϵ δl M1, l , where (i) is from Lemma 39. Similarly, we have: gl 1 d (x+ i ) gl d(x+ i ) Ld 1 Qd j=1 M1, j M1, l δl x+ i Ld 1 Qd j=1 M1, j M1, l δl X+ p, = B1, X+ δl M1, l , gl 1 d (x i ) gl d(x i ) Ld 1 Qd j=1 M1, j M1, l δl x i Ld 1 Qd j=1 M1, j M1, l δl X p, = B1, X δl M1, l . Zou and Liu Combine the above with (A.39): |h(zi) hc(zi)| B X+ + B X g0 d(yi) g1 d(yi) + + gd 1 d (yi) gd d(yi) + B X,ϵ g0 d(x+ i ) g1 d(x+ i ) + + gd 1 d (x+ i ) gd d(x+ i ) + B X,ϵ g0 d(x i ) g1 d(x i ) + + gd 1 d (x i ) gd d(x i ) = (B X+ + B X )B1, X,ϵ δl M1, l + B X,ϵB1, X+ δl M1, l + B X,ϵB1, X = h B1, X,ϵ (B X+ + B X ) + B X,ϵ(B1, X+ + B1, X ) i d X δl M1, l K1 So h hc 2 = M max 1 i M|h(zi) hc(zi)| M Pd l=1 K1δl M1, l . Let δl = M1, l δ d K1 K1 M1, l M1, l δ which means that: h H, hc Hc s.t. h hc 2 δ, so we have sup h H inf hc Hc h hc 2 δ when choosing δl = M1, l δ d K1 So Hc is a δ-covering of H, and N(δ; H, 2) |Hc| = Qd l=1 |C1, l |. By Lemma 18 we know that: |C1, l | = N( M1, l δ d K1 M ; S1, l , 1, ) 1 + 2d K1 M δ hl hl 1. This means: N(δ; H, 2) |Hc| = l=1 |C1, l | !Pd l=1 hl hl 1 . (A.42) So we can conclude that: J (0; D) = Z D ln N (u; H, 2)du (i) Z D l=1 hl hl 1 l=1 hl hl 1 v u u t2d K1 l=1 hl hl 1 4 v u u t2d K1D l=1 hl hl 1 4 l=1 hl hl 1 p where (i) is from (A.42); (ii) comes from the fact that ln(1 + x) x, x 0 and (iii) comes from (A.38). Generalization Bounds for Adversarial Contrastive Learning 1 2 4 6 8 10 block size clean accuracy(%) pgd-0.01 pgd-0.02 fgsm-0.01 fgsm-0.02 (a) Influence on clean accuracy 1 2 4 6 8 10 block size adv accuracy(%) pgd-0.01 pgd-0.02 fgsm-0.01 fgsm-0.02 (b) Influence on adversarial accuracy Figure B.1: The effect of block size on the accuracy. In the figure, we show the clean accuracy and the adversarial accuracy of the mean classifier under PGD and FGSM attack with ϵ = 0.01 and ϵ = 0.02. The block size is choosen from {1, 2, 4, 6, 8, 10}. (a) The influence on the clean accuracy; (b) The influence on the adversarial accuracy. Since we shows that RS(H) 2E sup γ,γ H(S) γ γ 2 δ + 32J (δ/4; D) before, take δ 0+, we have: RS(H) 32J (0; D) 64 l=1 hl hl 1 p l=1 hl hl 1 p Appendix B. Extra Experimental Results In this section, we present our experimental results for CIFAR-100. The basic settings are the same as 6. B.1 Improvement from Regularizer Table B.1 shows that the F-norm regularizer helps to improve the adversarial robustness of the model, which agrees with our Theorem 24. Zou and Liu Attack ϵ Type λ 0 0.002 0.005 0.01 0.02 PGD 0.01 Clean 84.07 85.65 85.48 85.52 58.53 Adv 78.49 79.55 79.39 79.41 79.44 0.02 Clean 81.87 81.91 81.93 82.08 81.97 Adv 73.01 73.05 73.07 73.16 73.07 FGSM 0.01 Clean 84.61 84.53 84.95 84.58 84.55 Adv 78.70 78.64 79.18 78.70 78.63 0.02 Clean 80.47 80.48 80.39 80.44 82.09 Adv 72.45 72.45 72.36 72.41 73.87 Table B.1: Results of experiments on the regularizer on data set CIFAR-100. In this table, we list the clean accuracy (Clean) and adversarial accuracy (Adv) of the mean classifier under the PGD and FGSM attack with ϵ = 0.01 and ϵ = 0.02. λ is chosen from {0, 0.002, 0.005, 0.01, 0.02}, and λ = 0 indicates no regularizer. B.2 Effect of Block Size Figure B.1 records the influence of block size on the clean accuracy and the adversarial accuracy of the model, from which we can see that a larger block size will yield better adversarial accuracy. Pranjal Awasthi, Natalie Frank, and Mehryar Mohri. Adversarial learning guarantees for linear hypotheses and neural networks. In ICML, volume 119, pages 431 441, 2020. Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In SP, pages 39 57, 2017. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. In ICML, volume 119, pages 1597 1607, 2020. Qingyi Gao and Xiao Wang. Theoretical investigation of generalization bounds for adversarial learning of deep neural networks. Journal of Statistical Theory and Practice, 15(2):51, 2021. Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR, 2018. Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015. Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, pages 9726 9735, 2020. Generalization Bounds for Adversarial Contrastive Learning Chih-Hui Ho and Nuno Vasconcelos. Contrastive learning with adversarial examples. In Neur IPS, 2020. Ziyu Jiang, Tianlong Chen, Ting Chen, and Zhangyang Wang. Robust pre-training by adversarial contrastive learning. In Neur IPS, 2020. Minseon Kim, Jihoon Tack, and Sung Ju Hwang. Adversarial self-supervised contrastive learning. In Neur IPS, 2020. Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009. Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. In ICLR, 2017. Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013. Boqi Li and Weiwei Liu. Wat: Improve the worst-class robustness in adversarial training, 2023. Xiyuan Li, Xin Zou, and Weiwei Liu. Defending against adversarial attacks via neural dynamic system. In Neur IPS, 2022. Xinsong Ma, Zekai Wang, and Weiwei Liu. On the tradeoffbetween robustness and fairness. In Neur IPS, 2022. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In ICLR, 2018. Vitaly Maiorov and Allan Pinkus. Lower bounds for approximation by MLP neural networks. Neurocomputing, 25(1-3):81 91, 1999. Chengzhi Mao, Amogh Gupta, Vikram Nitin, Baishakhi Ray, Shuran Song, Junfeng Yang, and Carl Vondrick. Multitask learning strengthens adversarial robustness. In ECCV, pages 158 174, 2020. Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. Adaptive computation and machine learning. MIT Press, 2012. Omar Montasser, Steve Hanneke, and Nathan Srebro. VC classes are adversarially robustly learnable, but only improperly. In COLT, pages 2512 2530, 2019. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A simple and accurate method to fool deep neural networks. In CVPR, pages 2574 2582, 2016. Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In COLT, volume 40, pages 1376 1401, 2015. Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, volume 9910, pages 69 84, 2016. Zou and Liu Kento Nozawa, Pascal Germain, and Benjamin Guedj. Pac-bayesian contrastive unsupervised representation learning. In UAI, volume 124, pages 21 30, 2020. Nikunj Saunshi, Orestis Plevrakis, Sanjeev Arora, Mikhail Khodak, and Hrishikesh Khandeparkar. A theoretical analysis of contrastive unsupervised representation learning. In ICML, volume 97, pages 5628 5637, 2019. Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data. In Neur IPS, pages 5019 5031, 2018. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In ICLR, 2014. Martin J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019. Zekai Wang and Weiwei Liu. Robustness verification for contrastive learning. In ICML, pages 22865 22883, 2022. Jingyuan Xu and Weiwei Liu. On robust multiclass learnability. In Neur IPS, 2022. Dong Yin, Kannan Ramchandran, and Peter L. Bartlett. Rademacher complexity for adversarially robust generalization. In ICML, volume 97, pages 7085 7094, 2019. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017. Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P. Xing, Laurent El Ghaoui, and Michael I. Jordan. Theoretically principled trade-offbetween robustness and accuracy. In ICML, volume 97, pages 7472 7482, 2019.