# generalization_bounds_for_adversarial_contrastive_learning__a1bbc19f.pdf

Journal of Machine Learning Research 24 (2023) 1-54 Submitted 7/22; Revised 12/22; Published 2/23

Generalization Bounds for Adversarial Contrastive Learning

Xin Zou zouxin2021@gmail.com School of Computer Science Wuhan University Wuhan, Hubei, China

Weiwei Liu liuweiwei863@gmail.com School of Computer Science Wuhan University Wuhan, Hubei, China

Editor: Joan Bruna

Deep networks are well-known to be fragile to adversarial attacks, and adversarial training is one of the most popular methods used to train a robust model. To take advantage of unlabeled data, recent works have applied adversarial training to contrastive learning (Adversarial Contrastive Learning; ACL for short) and obtain promising robust performance. However, the theory of ACL is not well understood. To ﬁll this gap, we leverage the Rademacher complexity to analyze the generalization performance of ACL, with a particular focus on linear models and multi-layer neural networks under ℓp attack (p 1). Our theory shows that the average adversarial risk of the downstream tasks can be upper bounded by the adversarial unsupervised risk of the upstream task. The experimental results validate our theory.

Keywords: Robustness, Adversarial learning, Contrastive learning, Rademacher complexity, Generalization bound.

1. Introduction

Deep neural networks (DNNs) have achieved state-of-the-art performance in many ﬁelds. However, several prior works (Szegedy et al., 2014; Goodfellow et al., 2015) have shown that DNNs may be vulnerable to imperceptibly changed adversarial examples, which causes a lot of focus on the robustness of the models (Madry et al., 2018; Mao et al., 2020; Li et al., 2022). One of the most popular approaches to achieving adversarial robustness is adversarial training, which involves training the model with samples perturbed to maximize the loss on the target model (Goodfellow et al., 2015; Madry et al., 2018; Zhang et al., 2019). Schmidt et al. (2018) show that adversarial robust generalization requires a larger amount of data, while Yin et al. (2019); Awasthi et al. (2020) show that the Rademacher Complexity of adversarial training is strictly larger in theory than that of natural training, which implies that we need more data for adversarial training.

. Corresponding Author.

c 2023 Xin Zou and Weiwei Liu.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-0866.html.

Zou and Liu

Since the labeled data is limited and expensive to obtain, one option would be to use large-scale unlabeled data and apply self-supervised learning (Gidaris et al., 2018; Noroozi and Favaro, 2016), an approach that trains the model on unlabeled data in a supervised manner by utilizing self-generated labels from the data itself. Contrastive Learning (CL) (Chen et al., 2020; He et al., 2020), which aims to maximize the feature similarity of similar pairs and minimize the feature similarity of dissimilar ones, is a popular self-supervised learning technique. Recently, Kim et al. (2020); Ho and Vasconcelos (2020); Jiang et al. (2020) apply adversarial training in CL and achieve state-of-the-art model robustness. They ﬁnd that if adversarial training is conducted on the upstream contrastive task, the trained model will be robust on the downstream supervised task. However, their results are all empirical and lack theoretical analysis. To ﬁll this gap, we here present a theoretical analysis of adversarial contrastive learning through the lens of Rademacher complexity, with a particular focus on linear models and multi-layer neural networks. Our theoretical results show that the average adversarial risk of the downstream tasks can be upper bounded by the adversarial unsupervised risk of the upstream task; this implies that if we train a robust feature extractor on the upstream task, we can obtain a model that is robust on downstream tasks. The remainder of this article is structured as follows: 2 introduces some related works in this ﬁeld. 3 provides some basic deﬁnitions and settings that will be used in the following sections. 4 presents our ﬁrst part s main results, which show a connection between the robust risk for the upstream task and the robust risk for the downstream tasks. 5 shows our second part s results, it outlines our bounds for linear models and multi-layer neural networks by bounding the Rademacher complexity of each hypothesis class. 6 shows some experimental results that verify our theory; ﬁnally, the conclusions are presented in the last section.

2. Related Work

Adversarial Robustness. Szegedy et al. (2014) show that DNNs are fragile to imperceptible distortions in the input space. Subsequently, Goodfellow et al. (2015) propose the fast gradient sign method (FGSM), which perturbs a target sample towards its gradient direction to increase the loss and then uses the generated sample to train the model in order to improve the robustness. Following this line of research, Madry et al. (2018); Moosavi-Dezfooli et al. (2016); Kurakin et al. (2017); Carlini and Wagner (2017) propose iterative variants of the gradient attack with improved adversarial learning frameworks. Besides, Ma et al. (2022) analyze the trade-oﬀbetween robustness and fairness, Li and Liu (2023) study the worst-class adversarial robustness in adversarial training. For the theoretical perspective, Montasser et al. (2019) study the PAC learnability of adversarial robust learning, Xu and Liu (2022) extend the work of Montasser et al. (2019) to multiclass case and Yin et al. (2019); Awasthi et al. (2020) give theoretical analysises to adversarial training by standard uniform convergence argumentation and giving a bound of the Rademacher complexity. Our work is quite diﬀerent from Yin et al. (2019); Awasthi et al. (2020), ﬁrstly, they just analyze the linear models and two-layer neural networks, but we consider linear models and multi-layer deep neural networks; secondly, they consider the classiﬁcation loss, which is much easier to analyze than our contrastive loss.

Generalization Bounds for Adversarial Contrastive Learning

Contrastive Learning. Contrastive Learning is a popular self-supervised learning paradigm that attempts to learn a good feature representation by minimizing the feature distance of similar pairs and maximizing the feature distance of dissimilar pairs (Chen et al., 2020; Wang and Liu, 2022). Sim CLR (Chen et al., 2020) learns representations by maximizing the agreement between diﬀerently augmented views of the same data, while Mo Co (He et al., 2020) builds large and consistent dictionaries for unsupervised learning with a contrastive loss. From the theoretical perspective, Saunshi et al. (2019) ﬁrst presents a framework to analyze CL. We generalize their framework to the adversarial CL. The key challenging issues in this work are how to deﬁne and rigorously analyze the adversarial CL losses. Moreover, we further analyze the Rademacher complexity of linear models and multi-layer deep neural networks composited with a complex adversarial contrastive loss. Neyshabur et al. (2015) show an upper bound of the Rademacher complexity for neural networks under group norm regularization. In fact, the Frobenius norm and the ℓ1, -norm in our theoretical analysis are also group norms, while the settings and proof techniques between Neyshabur et al. (2015) and our work are quite diﬀerent: (1) they consider the standard setting while we consider a more diﬃcult adversarial setting; (2) they consider the Rademacher complexity of neural networks with size 1 output layer, while we consider the case that the neural network is composited with a complex contrastive learning loss; (3) they prove their results by reduction with respect to the number of the layers, while motivated by the technique of Gao and Wang (2021), we use the covering number of the neural network to upper bound the Rademacher complexity in the adversarial case. Adversarial Contrastive Learning. Several recent works (Kim et al., 2020; Ho and Vasconcelos, 2020; Jiang et al., 2020) apply adversarial training in the contrastive pretraining stage to improve the robustness of the models on the downstream tasks, achieving good robust performance in their experiments. Our work attempts to provide a theoretical explanation as to why models robustly trained on the upstream task can be robust on downstream tasks.

3. Problem Setup

We introduce the problem setups in this section.

3.1 Basic Contrastive Learning Settings

We ﬁrst set up some notations and describe the contrastive learning framework. Let X Rm be the domain of all possible data points, this paper assumes that X is bounded. Contrastive learning assumes that we get the similar data in the form of pairs (x, x+), which is drawn from a distribution Dsim on X 2, and k independent and identically distributed (i.i.d.) negative samples x 1 , x 2 , . . . , x k drawn from a distribution Dneg on X. Given the training set S = {(xi, x+ i , x i1, . . . , x ik)}M i=1, we aim to learn a representation f from F that maps similar pairs (x, x+) into similar points (f(x), f(x+)), while at the same time keeping f(x i ), , f(x k ) away from f(x), where F is a class of representation functions f : X Rn. Latent Classes. Let C denote the set of all latent classes (Saunshi et al., 2019) that are all possible classes for points in X; for each class c C , moreover, the probability Dc

Zou and Liu

over X captures the probability that a point belongs to class c. The distribution on C is denoted by ρ. To formalize the similarity among data in X, assume we obtain i.i.d. similar data points x, x+ from the same distribution Dc, where c is randomly selected according to the distribution ρ on latent classes C. We can then deﬁne Dsim and Dneg as follows:

Deﬁnition 1 (Dsim and Dneg, Saunshi et al., 2019) For unsupervised tasks, we deﬁne the distribution of sampling similar samples Dsim(x, x+) and negative sample Dneg(x ) as follows: Dsim(x, x+) = E c ρDc(x)Dc(x+), Dneg(x ) = E c ρDc(x ).

Supervised Tasks. For supervised tasks, we focus on the tasks in which a representation function f will be tested on a (k + 1)-way supervised task T consisting of distinct classes {c1, c2, . . . , ck+1} C, while the labeled data set of task T consists of M i.i.d. examples drawn according to the following process: A label c {c1, . . . , ck+1} is selected according to a distribution DT , and a sample x is drawn from Dc. The distribution of the labeled pair (x, c) is deﬁned as: DT (x, c) = DT (c)Dc(x).

3.2 Evaluation Metric for Representations

We evaluate the quality of a representation function f with reference to its performance on a multi-class classiﬁcation task T using a linear classiﬁer. Consider a task T = {c1, . . . , ck+1}. A multi-class classiﬁer for T is a function g : X Rk+1, the output coordinates of which are indexed by the classes c in task T . Let {g(x)y g(x)y }y =y be a k-dimensional vector of diﬀerences in the coordinates of the output of the classiﬁer g. The loss function of g on a point (x, y) X T is deﬁned as ℓ({g(x)y g(x)y }y =y). For example, one often considers the standard hinge loss ℓ(v) = max{0, 1 + maxi{ vi}} and the logistic loss ℓ(v) = log2(1 + P i exp( vi)) for v Rk. The supervised risk of classiﬁer g is deﬁned as follows:

Lsup(T , g) := E (x,c) DT

h ℓ {g(x)c g(x)c }c =c i .

The risk Lsup(T , g) of classiﬁer g on task T measures the quality of the outputs of g, take the hinge loss as an example, our goal is to get a classiﬁer that has much higher conﬁdence for the true class (i.e., the value g(x)c) than others (i.e., the values g(x)c for c = c), so we want all the diﬀerences g(x)c g(x)c for c = c to be as large as possible. If g wrongly classiﬁes x, i.e., arg max c g(x)c = c, of course, ℓon (x, c) is not smaller than 1;

however, even thought g correctly classiﬁes x, the loss value can not decrease to zero unless g(x)c g(x)c 1 for all c = c. Let {ci}k+1 i=1 = {c1, . . . , ck+1} be a set of classes from C. Given the matrix W R(k+1) n, we have g(x) = Wf(x) as a classiﬁer that composite the feature extractor g and linear classiﬁer W. The supervised risk of f is deﬁned as the risk of g when the best W is chosen:

Lsup(T , f) := inf W R(k+1) n Lsup(T , Wf).

Generalization Bounds for Adversarial Contrastive Learning

When training a feature extractor f on the upstream task, we do not make predictions on the examples, so we can t deﬁne the risk of the feature extractor f as we do for the classiﬁer g. Our goal on the upstream task is to train a feature extractor that can perform well on the downstream tasks, so we consider the potential of the feature extractor here, i.e., the minimal possible classiﬁcation risk of a linear classiﬁer on the feature extracted by f. So we deﬁne the risk of the feature extractor f as above by taking the inﬁmum over all linear classiﬁers.

Deﬁnition 2 (Mean Classiﬁer, Saunshi et al., 2019) For a function f and task T = {c1, . . . , ck+1}, the mean classiﬁer is W µ, whose cth row is the mean µc := E x Dc[f(x)] and

we deﬁne Lµ sup(T , f) := Lsup(T , W µf).

Since Lsup(T , f) involves taking inﬁmum over all possible linear classiﬁers, it is diﬃcult to analyze Lsup(T , f) and establish connections between the risk of the unsupervised upstream task and Lsup(T , f). We introduce the mean classiﬁer to bridge them by bounding average Lµ sup(T , f) over the tasks with the risk of the unsupervised upstream risk (for more details, please refer to 4).

Deﬁnition 3 (Average Supervised Risk) The average supervised risk for a function f on (k + 1)-way tasks is deﬁned as:

Lsup(f) := E {ci}k+1 i=1 ρk+1[Lsup({ci}k+1 i=1 , f)|ci = cj]

Given the mean classiﬁer, the average supervised risk for a function f is deﬁned as follows:

Lµ sup(f) := E {ci}k+1 i=1 ρk+1[Lµ sup({ci}k+1 i=1 , f)|ci = cj].

In contrastive learning, the feature extractor is trained as a pretrained model to be used on downstream tasks. There may be many diﬀerent downstream tasks such as binary classiﬁcation tasks with classes {c1, c2} C and many other multi-class classiﬁcation tasks, so here we consider the average error of Lµ sup(T , f) over all possible tasks T sampled from ρk+1 as the ﬁnal performance measure of a feature extractor f.

3.3 Contrastive Learning Algorithm

We denote k as the number of negative samples used for training and (x, x+) Dsim, (x 1 , . . . , x k) Dk neg.

Deﬁnition 4 (Unsupervised Risk) The unsupervised risk is deﬁned as follows:

Lun(f) := E[ℓ({f(x)T (f(x+) f(x i ))}k i=1)].

Given M samples n (xj, x+ j , x j1, . . . , x jk) o M

j=1 from Dsim Dk neg, the empirical counterpart

of unsupervised risk is deﬁned as follows:

b Lun(f) := 1

j=1 ℓ({f(xj)T (f(x+ j ) f(x ji))}k i=1).

Zou and Liu

In the contrastive learning upstream task, we want to learn a feature extractor f such that f maps examples from the same class to similar features and makes the features of examples from diﬀerent classes far away from each other. So for the examples (x, x+, x 1 , . . . , x k ), we want f(x)T f(x+) to be as large as possible while f(x)T f(x i ), i = 1, . . . , k to be as small as possible. The loss ℓwe deﬁned before elegantly captures the aim of contrastive learning, if we set vi = f(x)T f(x+) f(x)T f(x i ), minimizing ℓ({f(x)T (f(x+) f(x i ))}k i=1) will yield large f(x)T f(x+) and small f(x)T f(x i ), i = 1, . . . , k. So optimizing over Lun(f) can reach our goal for contrastive learning. Following Saunshi et al. (2019), the unsupervised risk can be described by the following equation:

Lun(f) = E c+,c i ρk+1 E x,x+ D2 c+,x i Dc i [ℓ({f(x)T (f(x+) f(x i ))}k i=1)].

The Empirical Risk Minimization (ERM) algorithm is used to ﬁnd a function bf ERM arg min f F b Lun(f) that minimizes the empirical unsupervised risk. The function bf ERM can be

subsequently used for supervised linear classiﬁcation tasks.

3.4 Adversarial Setup

A key question in adversarial contrastive learning is that of how to deﬁne the adversary sample in contrastive learning. From a representation perspective, one can ﬁnd a point ex that is close to x and keeps the feature f(ex) both as far from f(x+) as possible and as close to the feature of some negative sample f(x ) as possible. Inspired by this intuition, we deﬁne the Contrastive Adversary Sample as follows:

Deﬁnition 5 (U-Contrastive Adversary Sample) Given a neighborhood U(x) of x, x+ Dc+, x i Dc i for i = 1, . . . , k, we deﬁne the U-Contrastive Adversary Sample of x as follows: ex := arg sup x U(x) E x+ Dc+,x i Dc i [ℓ({f(x )T (f(x+) f(x i ))}k i=1)].

In this article, we suppose that the loss function ℓis convex. By the subadditivity of sup, we have, f F:

sup x U(x) E x+ Dc+

[ℓ({f(x )T(f(x+) f(x i ))}k i=1)] E x+ Dc+

[ sup x U(x) ℓ({f(x )T(f(x+) f(x i ))}k i=1)].

(1) In the following sections, we analyze the theoretical properties of the right-hand side of (1), which can be easily optimized in practice by Adversarial Empirical Risk Minimization (AERM). The U-Adversarial Unsupervised Risk and its surrogate risk can be deﬁned as below.

Deﬁnition 6 (U-Adversarial Unsupervised Risk) Given a neighborhood U(x) of x, the U-Adversarial Unsupervised Risk of a presentation function f is deﬁned as follows: e Lun(f) := E c+,c i ρk+1

E x Dc+[ sup x U(x) E x+ Dc+

[ℓ({f(x )T (f(x+) f(x i ))}k i=1)]]. (2)

Generalization Bounds for Adversarial Contrastive Learning

Moreover, the surrogate risk of (2) is as follows:

e Lsun(f) = E c+,c i ρk+1

E x,x+ D2 c+

sup x U(x) ℓ({f(x )T (f(x+) f(x i ))}k i=1).

By (1), we have e Lun(f) e Lsun(f) for any f F. The Adversarial Supervised Risk of a classiﬁer g for a task T is deﬁned as follows:

e Lsup(T , g) := E (x,c) DT [ sup x U(x) ℓ({g(x)c g(x)c }c =c)].

The Adversarial Supervised Risk of a representation function f for a task T is deﬁned as follows: e Lsup(T , f) := inf W R(k+1) n e Lsup(T , Wf). (3)

For the mean classiﬁer, we deﬁne:

e Lµ sup(T , f) := e Lsup(T , W µf)

The Average Adversarial Supervised Risk for a representation function is as deﬁned below.

Deﬁnition 7 (Average U-Adversarial Supervised Risk)

e Lsup(f) := E {ci}k+1 i=1 ρk+1[e Lsup({ci}k+1 i=1 , f)|ci = cj].

For the mean classiﬁer, we have the following:

e Lµ sup(f) := E {ci}k+1 i=1 ρk+1[e Lµ sup({ci}k+1 i=1 , f)|ci = cj].

4. Theoretical Analysis for Adversarial Contrastive Learning

This section presents some theoretical results for adversarial contrastive learning.

4.1 One Negative Sample Case

Let τ = P c,c ρ2[c = c ], σ be an M-dimensional Rademacher random vector with i.i.d. en-

tries, deﬁne (gf)|S = (gf(z1), . . . , gf(z M)) and RS(G) := E σ { 1}M

σ, (gf)|S #

G := {gf(x, x+, x 1 , . . . , x k ) = sup x U(x) ℓ {f(x )T (f(x+) f(x i ))}k i=1 |f F}, let bf

arg min f F be Lsun(f) where be Lsun(f) is the empirical counterpart of e Lsun(f), we have:

Zou and Liu

Theorem 8 (The proof can be found in the Appendix A.1) Let ℓ: Rk R be bounded by B. Then, for any δ (0, 1),with a probability of at least 1 δ over the choice of the training set S = {(xj, x+ j , x j )}M j=1, for any f F:

e Lsup( bf) e Lµ sup( bf) 1 1 τ (e Lsun(f) τℓ(0)) + 1 1 τ AGM,

AGM = O(RS(G)

Remark 9 Theorem 8 shows that when the hypothesis class F is rich enough to contain some f with low surrogate adversarial unsupervised risk, the empirical minimizer of the surrogate adversarial unsupervised risk will then obtain good robustness on the supervised downstream task. Note that we can take f = bf in the upper bound of Theorem 8 and get a bound e Lsup( bf) 1 1 τ (e Lsun( bf) τℓ(0)) + 1 1 τ AGM. Then we can see that if the output of AERM (i.e., bf)

achieves small unsupervised adversarial risk, then bf is a robust feature extractor such that it can achieve good robustness after ﬁne-tuning on the downstream tasks. Theorem 8 gives a relationship between the robustness of the contrastive (upstream) task and the robustness of the downstream classiﬁcation tasks and explains why adversarial contrastive learning can help improve the robustness of the downstream task, as shown empirically in Kim et al. (2020); Ho and Vasconcelos (2020); Jiang et al. (2020).

4.2 Blocks of Similar Points

Saunshi et al. (2019) show a reﬁned method that operates by using blocks of similar data and determine that the method achieves promising performance both theoretically and empirically. We adapt this method to adversarial contrastive learning. Speciﬁcally, we sample (b + 1) i.i.d. similar samples x, x+ 1 , . . . , x+ b from c+ ρ and b negative i.i.d. samples from c ρ. The block adversarial contrastive learning risk is as follows:

e Lblock sun (f) := E

sup x U(x) ℓ

f(x )T Pb i=1 f(x+ i ) b Pb i=1 f(x i ) b

and its empirical counterpart is as follows:

sun (f) := 1

sup x U(xi) ℓ

f(x )T Pb j=1 f(x+ ij)

Pb j=1 f(x ij)

Theorem 10 (The proof can be found in the Appendix A.2) For any f F, we have: e Lsup(f) 1 1 τ

e Lblock sun (f) τℓ(0) 1 1 τ

e Lsun(f) τℓ(0) .

Remark 11 Theorem 10 shows that using blocks of similar data yields a tighter upper bound for the adversarial supervised risk than in the case for pairs of similar data. Theorem 10 implies that using the blocks in adversarial contrastive learning may improve the robust performance of the downstream tasks; this will be veriﬁed by the empirical results in 6.

Generalization Bounds for Adversarial Contrastive Learning

4.3 Multiple Negative Sample Case

This subsection extends our results to k negative samples. To achieve this, more deﬁnitions are required. Let [k] denote the set {1, 2, . . . , k}.

Deﬁnition 12 We deﬁne a distribution D over the supervised tasks as follows: First, sample k + 1 classes (allow repetition) c+, c 1 , . . . , c k ρk+1, conditioned on the event that c i = c+, i [k]. Then, set the task T as the set of distinct classes in {c+, c 1 , . . . , c k }.

Deﬁnition 13 The Average U-Adversarial Supervised Risk of a representation function f F over D is deﬁned as follows:

e Lsup(f) := E T D

h e Lsup(T , f) i .

Let Edistinct be the event such that {c+, c 1 , . . . , c k } is distinct and p = P (c+,c 1 ,...,c k ) D [Edistinct].

For any f F, we have (The proof can be found in the Appendix A.3):

e Lsup(f) e Lsup(f)

From (5), we can turn to analyze the relation between e Lsup(f) and e Lsun(f) in the multiple negative sample case. Our Theorem 16 handles e Lsup(f) instead of e Lsup(f).

Assumption 14 Assume that I1, I2 [d] such that I1 I2 = [d], ℓsatisﬁes the following inequations:

ℓ({vi}i I1) ℓ({vi}i [d]) ℓ({vi}i I1) + ℓ({vi}i I2), (6)

ℓ({vi}i I2) ℓ({vi}i [d]) ℓ({vi}i I1) + ℓ({vi}i I2). (7)

Proposition 15 (The proof can be found in the Appendix A.4) The hinge loss and the logistic loss satisfy Assumption 14.

If C is ﬁnite, we obtain a simple (and informal) bound of e Lsup( bf), for the more complex case that allows inﬁnite C and the formal form of Theorem 16 (Theorem 32), please refer to the Appendix A.5.

Theorem 16 (The proof can be found in the Appendix A.6) Suppose C is ﬁnite, for any c C, ρ(c) > 0, and ℓsatisﬁes Assumption 14. Then, with a probability of at least 1 δ over the choice of the training set S, f F:

e Lsup( bf) α(ρ) e Lsun(f) + AGM β.

Zou and Liu

5. Generalization Bounds for Example Hypothesis Classes

This section presents a concrete analysis of the Rademacher complexity for linear hypothesis class and multi-layer neural networks based on covering number (Wainwright, 2019, Deﬁnition 5.1). We ﬁrst introduce some deﬁnitions and required lemmas.

Lemma 17 (Wainwright, 2019, Lemma 5.7, volume ratios and metric entropy) Consider a pair of norms and on Rd, and let B and B be their corresponding unit balls (i.e. B = {θ Rd| θ 1}, with B similarly deﬁned). The δ-covering number of B in the -norm then obeys the following bounds (Wainwright, 2019, Lemma 5.7): 1

vol(B ) N(δ; B, ) vol(2

δB + B ) vol(B ) ,

where we deﬁne the Minkowski sum A+B := {a+b : a A, b B}, vol(B) := R 1{x B}dx is the volume of B based on the Lebesgue measure, and N(δ; B, ) is the δ-covering number of B with respect to the norm .

Lemma 18 (The proof can be found in the Appendix A.7) Let Bp(r) be the p-norm ball in Rd with radius r. The δ-covering number of Bp(r) with respect to p thus obeys the following bound:

N(δ; Bp(r), p) 1 + 2r

Deﬁnition 19 (Wainwright, 2019, Deﬁnition 5.16, sub-Gaussian process) A collection of zero-mean random variables {Xθ, θ T} is a sub-Gaussian process with respect to a metric ρX on T if:

E[eλ(Xθ Xeθ)] e

λ2ρ2 X (θ,eθ)

2 , θ, eθ T, λ R.

It is easy to prove that the Rademacher process (Wainwright, 2019, 5.2) satisﬁes the condition in Deﬁnition 19 with respect to the ℓ2-norm.

Lemma 20 (Wainwright, 2019, the Dudley s entropy integral bound) Let {Xθ, θ T} be a zero-mean sub-Gaussian process with respect to the induced pseudometric ρX from Deﬁnition 19. Then, for any δ [0, D], we have:

sup θ,eθ T (Xθ Xeθ)

sup γ,γ T ρX (γ,γ ) δ

+ 32J (δ/4; D). (8)

Here, D = supθ,eθ T ρX(θ, eθ) and J (a; b) = R b a p

ln NX(u; T)du, where NX(u; T) is the u-covering number of T in the ρX-metric.

Remark 21 Given θ0 T, since E [Xθ0] := E [ θ0, σ ] = 0, we have:

E sup θ T Xθ

= E sup θ T (Xθ Xθ0) E

sup θ,eθ T (Xθ Xeθ)

Generalization Bounds for Adversarial Contrastive Learning

Combining (8) with (9), we have:

E sup θ T Xθ

sup γ,γ T ρX (γ,γ ) δ

+ 32J (δ/4; D),

which can be used to draw the upper bound of RS(G) by establishing a connection between the Rademacher process and the Rademacher complexity of the hypothesis classes when proper norm is chosen. For more details, please refer to the Appendix, details are in the proof of Theorem 22, Theorem 24 and Theorem 26.

To make Theorem 8 and Theorem 16 concrete, we need to upper bound RS(G). Assume that loss ℓis a non-increasing function; for example, hinge loss and logistic loss satisfy this assumption. Let G = {gf(x, x+, x ) = sup x U(x) ℓ f(x )T (f(x+) f(x )) |f F}. Since ℓis

non-increasing, we have:

G = ℓ min x U(x)

f(x )T f(x+) f(x ) |f F .

Let H = min x U(x)

f(x )T (f(x+) f(x )) |f F . Suppose ℓis η-Lipschitz. By the

Ledoux-Talagrand contraction inequality (Ledoux and Talagrand, 2013), we have:

RS(G) ηRS(H). (10)

Thus, we only need to upper bound RS(H). Let A a,b be the ℓb-norm of the ℓa-norm of the rows of A. Consider the training set S = {(xi, x+ i , x i )}M i=1. Let X be a matrix whose ith row is xi. We deﬁne X+ and X in a similar way. It is easy to see that p 1, i = 1, . . . , M, xi p X p, .

5.1 Linear Hypothesis Class

Let F ={f : x Wx|W Rn m, |||W|||p w}. To simplify notations, p 1 and 1

P =max X p, , X+ p, , X p, , P =max X p , , X+ p , , X p , . (11)

We then have p 1,

i = 1, . . . , M, xi p, x+ i p, x i p P.

We now present the upper bound of RS(H) under r attack.

Theorem 22 (RS(H) under r attack for linear models) Consider the ℓr attack, i.e. let U(x) = {x | x x r ϵ}. We then have:

RS(H) = O [PP + ϵR s(r , p, m)] h ms(p , p, m)w2

where s(p, q, n) := nmax n 1 p 1

r = 1, and R is deﬁned similarly to (11).

Zou and Liu

The proof can be found in the Appendix A.8.

Remark 23 Combining Theorem 22 with (4), we have:

[PP + ϵR s(r , p, m)] mηs(p , p, m)w2 + B q

5.2 Multi-layer Neural Network

In this section, we analyze fully connected multi-layer neural networks. Suppose that X Rm. Let F = {Wdσ(Wd 1σ( σ(W1x))) | |||Wl||| Ml, l = 1, . . . , d}, where ||| ||| is the norm of the matrix and σ( ) is an elementwise L-Lipschitz function with σ(0) = 0 and Wl Rhl hl 1, where hd = n, h0 = m. Assume ℓis η-Lipschitz and nonincreasing. From (10), we need only to bound the Rademacher complexity of H. We here consider two cases of the matrix norm ||| |||. Let U(x) = {x | x x p ϵ} for some p 1.

5.2.1 Frobenius-norm Case.

We ﬁrst consider using the Frobenius-Norm in the deﬁnition of the multi-layer neural networks hypothesis class F.

Theorem 24 (RS(H) under p attack for NNs under ||| |||F constraint) Let U(x) = {x | x x p ϵ} (i.e. consider the ℓp attack), σ(0) = 0 with Lipschitz constant L and let F = Wdσ(Wd 1σ( σ(W1x))) |||Wl|||F MF l , l = 1, . . . , d . We then have:

l=1 hlhl 1K

where K = 2BF X,ϵ BF X+ + BF X ,

BF X,ϵ =Ld 1 d Y

l=1 MF l max n 1, m 1 2 1

p o ( X p, +ϵ) , BF X =Ld 1 d Y

l=1 MF l max n 1, m 1 2 1

The proof can be found in the Appendix A.9.

Remark 25 Combining Theorem 24 with (4), we have:

d Pd l=1 hlhl 1

Generalization Bounds for Adversarial Contrastive Learning

5.2.2 ℓ1, -norm Case

We consider the 1, norm constraint.

Theorem 26 (RS(H) under p attack for NNs under 1, constraint) Let U(x) = {x | x x p ϵ} (i.e. consider the ℓp attack), σ(0) = 0 with Lipschitz constant L; moreover,

let F = n Wdσ(Wd 1σ( σ(W1x))) Wl 1, M1, l , l=1, . . . ,d o . We then have:

l=1 hlhl 1 p

where K0 = 2B1, X,ϵ B X+ + B X , K1 = K0

2 + B X,ϵ B1, X+ + B1, X ,

B X,ϵ = Ld 1 d Y

l=1 hl M1, l m1 1

p ( X p, + ϵ) , B X = Ld 1 d Y

l=1 hl M1, l m1 1

B1, X,ϵ = Ld 1 d Y

l=1 M1, l ( X p, + ϵ) , B1, X = Ld 1 d Y

l=1 M1, l X p, .

The proof can be found in the Appendix A.10.

Remark 27 Combining Theorem 26 with (4), we have:

d K0K1 Pd l=1 hlhl 1 M + B

Remark 28 Our bound has important implications for the design of regularizers for adversarial contrastive learning. To achieve superior robust performance on the downstream tasks, the usual approach is to make X p, small. For example, Pytorch scales the images to tensors with entries within the range [0, 1]. Moreover, Theorem 24 shows that we can take the norms of the layers as the regularizers to reduce the adversarial supervised risk.

In our analysis for the Rademacher complexity, we consider models with norm-constrained weights, which means that the hypothesis class is uniformly Lipschitz, although the Lipschitz constant may be large (the product of maximal weight norms for the layers). One may wonder what will happen if we remove the constrains on the norm of the weights. For simplicity, let s consider a hypothesis class H { 1}X for binary classiﬁcation, we have:

RS(H) = E σ

i=1 σih(xi)

where σ1, . . . , σn { 1} are i.i.d. uniform random variables. We can regard σ1, . . . , σn as random labels that we need to ﬁt by hypothesis from H, so we can interpret RS(H) as the

Zou and Liu

Attack ϵ Type λ 0 0.002 0.05 0.2

PGD 0.01 Clean 75.73 74.8 74.5 75.67 Adv 67.67 69.25 68.59 67.65

0.02 Clean 53.11 55.72 55.73 55.72 Adv 46.71 48.17 48.17 48.16

FGSM 0.01 Clean 74.42 76.13 76.12 76.11 Adv 67.28 68.79 68.8 68.8

0.02 Clean 54.28 54.29 54.28 66.68 Adv 45.93 45.94 45.92 55.64

Table 1: Results of experiments on the regularizer. In this table, we list the clean accuracy (Clean) and adversarial accuracy (Adv) of the mean classiﬁer under the PGD and FGSM attack with ϵ = 0.01 and ϵ = 0.02. λ is chosen from {0, 0.002, 0.05, 0.2}, and λ = 0 indicates no regularizer.

ability of H to ﬁt random 1 binary labels. Now let H be the neural network, if we do not constrain the norm of the weights, theoretically, the universal approximation theorem (Maiorov and Pinkus, 1999) tells us that neural networks can ﬁt any continuous function on a bounded input space, which means that in this case RS(H) 1, leading to vacuous bounds in the binary classiﬁcation case; experimentally, Zhang et al. (2017) show that deep neural networks easily ﬁt random labels. From another perspective, to derivate an upper bound for RS(H) by covering number, we need to ﬁnd a δ-covering set for H under some metric ρ( , ). If the weights of the layers are not bounded, we can not cover H by a ﬁnite subset of H, the δ-covering number of H under ρ( , ) will be inﬁnite, which means that Lemma 20 does not hold. So it is diﬃcult to go beyond the Lipschitz network.

6. Experiments

In this section, we conduct several experiments to support our theory. We emphasize that we are not proposing a method to try to get better robustness on the downstream tasks, we do the experiments to verify our two claims from our theoretical results: (1) As shown in Remark 11, using the blocks may improve the robust performance; (2) As shown in Remark 28, using the norms of the layers of the neural networks as the regularizer may help improve the robust performance. Data sets. We use two data sets (Krizhevsky and Hinton, 2009) in our experiments: (1) the CIFAR-10 data set and (2) the CIFAR-100 data set. CIFAR-10 contains 50000/10000 train/test images with size 32 32, which are categorized into 10 classes. CIFAR-100 contains 50000/10000 train/test images with size 32 32, which are categorized into 100 classes. Model. We use a neural network with two convolutional layers and one fully connected layer. Following He et al. (2020), we use the Stochastic Gradient Descent (SGD) optimizer

Generalization Bounds for Adversarial Contrastive Learning

1 2 4 6 8 10 block size

clean accuracy(%)

pgd-0.01 pgd-0.02 fgsm-0.01 fgsm-0.02

(a) Inﬂuence on clean accuracy

1 2 4 6 8 10 block size

adv accuracy(%)

pgd-0.01 pgd-0.02 fgsm-0.01 fgsm-0.02

(b) Inﬂuence on adversarial accuracy

Figure 1: The eﬀect of block size on the accuracy. In the ﬁgure, we show the clean accuracy and the adversarial accuracy of the mean classiﬁer under PGD and FGSM attack with ϵ = 0.01 and ϵ = 0.02. The block size is choosen from {1, 2, 4, 6, 8, 10}. (a) The inﬂuence on the clean accuracy; (b) The inﬂuence on the adversarial accuracy.

with momentum 0.9 but set the weight decay to be 5 10 4 and the learning rate to be 0.001. Evaluation of robustness. For representation f, we ﬁrst calculate buc = 1 nc Pnc i=1 f(xi) to estimate the c-th row of the mean classiﬁer, where x1, . . . , xnc are the data points with label c in our training set. Denote c W µ as the estimator of W, we use the robustness of the classiﬁer c W µf as an evaluation of the robustness of f on the downstream task. We show the results for CIFAR-10 here; the results for CIFAR-100 can be found in the Appendix B.

6.1 Improvement from the regularizer

Inspired by our bound (13) in Theorem 24 and Theorem 8, the adversarial supervised risk can be upper bounded by the sum of the adversarial unsupervised loss and AGM, which is related to the maximal Frobenius-norm of the network layers. We choose to simultaneously optimize the contrastive upstream pre-train risk and the Frobenius norm of the parameters of the model. We set the norm of the parameters for the layers as a regularizer and test the performance of the mean classiﬁer; here, W µ is calculated by averaging all features of the training data set as done in Nozawa et al. (2020). We use a hyper-parameter λ to balance the trade-oﬀof the the contrastive upstream pre-train risk and the Frobenius norm of the parameters of the model. We choose to minimize the following regularized empirical risk:

L(f) = be Lsun(f) + λN(f) (14)

where N(f) is a regularizer that constrains the Frobenius norm of the parameters of the model f, here we choose N(f) = Pd l=1 |||Wl|||F where |||Wl|||F is the Frobenius norm of the

Zou and Liu

parameters for the l-th layer of f and:

be Lsun(f)= 1

j=1 max x j U(xj)ℓ({f(x j)T(f(x+ j ) f(x ji))}k i=1).

More details about our algorithm are in Algorithm 1.

Algorithm 1: The AERM algorithm for adversarial contrastive learning

Input : The training data set S = n (xj, x+ j , x j1, . . . , x jk) o M

j=1 sampled from

Dsim Dk neg; the hyper-parameter λ in (14); the adversarial perturbation U; learning rate α; the total iteration number T;

1 initialize θ0 to be randomized parameters;

3 while t < T do

4 randomly sample a batch of data with size N: B S ;

6 for (x, x+, x 1 , . . . , x k ) in B do

7 calculate adversarial example

x arg sup x U(x) ℓ fθt(x )T fθt(x+) fθt(x i ) k i=1

8 e B e B {( x, x+, x 1 , . . . , x k )};

10 L(fθt) = 1

( x,x+,x 1 ,...,x k ) e B ℓ fθt( x)T fθt(x+) fθt(x i ) k i=1

11 θt+1 θt α θt L(fθt);

12 t t + 1;

13 end Output : The feature extractor fθT F that tries to minimize (14);

The results are shown in Table 1. From Table 1, we can see that the F-norm regularizer can improve the adversarial accuracy (the prediction performance of a model on adversarial examples generated by attacker) of the mean classiﬁer, which is in line with our Theorem 24.

6.2 Eﬀect of block size

To verify Theorem 10, we analyze the eﬀect of block size on the adversarial accuracy of the mean classiﬁer. Figure 1 presents the results for clean accuracy and adversarial accuracy, respectively. From Figure 1, we can see that a larger block size will yield better adversarial accuracy. The results are consistent with Theorem 10: as the block size grows, Theorem 10 shows that we are optimizing a tighter bound, which leads to better performance as shown in Figure 1.

Generalization Bounds for Adversarial Contrastive Learning

7. Conclusion

This paper studies the generalization performance of adversarial contrastive learning. We ﬁrst extend the contrastive learning framework to the adversarial case, then we upper bound the average adversarial risk of the downstream tasks with the adversarial unsupervised risk of the upstream task and an adversarial Rademacher complexity term. Furthermore, we provide the upper bound of the adversarial Rademacher complexity for linear models and multi-layer neural networks. Finally, we conduct several experiments and the experimental results are consistent with our theory.

Acknowledgments

This work is supported by the National Natural Science Foundation of China under Grant 61976161.

Zou and Liu

Appendix A. Proofs

In this section, we display the proofs of our theorems, lemmas and corollaries. For reading convenience, we will restate the theorem before proving.

A.1 Proof of Theorem 8

In the below, we present some useful lemmas that will be used in the proofs of our main theorems.

Lemma 29 For any f F, we have:

e Lsup(f) e Lµ sup(f) 1 1 τ (e Lun(f) τℓ(0)) 1 1 τ (e Lsun(f) τℓ(0)). (A.1)

Remark 30 be Lsun(f) denotes the empirical e Lsun(f) and bf arg min f F be Lsun(f). Applying

Lemma 29 to bf shows that, if we can train a robust feature extractor with low surrogate adversarial unsupervised risk, we can obtain a robust classiﬁer with low adversarial supervised risk on the downstream task.

Lemma 31 Let ℓ: Rk R be bounded by B. Then, for any δ (0, 1), with a probability of at least 1 δ over the choice of the training set S = {(xj, x+ j , x j1, . . . , x jk)}M j=1 = {zj}M j=1, for any f F: e Lsun( bf) e Lsun(f) + AGM,

where AGM = O(RS(G)

δ M ), RS(G) := E σ { 1}M

σ, (gf)|S #

{gf(x, x+, x 1 , . . . , x k ) = sup x U(x) ℓ {f(x )T (f(x+) f(x i ))}k i=1 |f F}, where σ is an M-

dimensional Rademacher random vector with i.i.d. entries and (gf)|S = (gf(z1), . . . , gf(z M)).

Proof [Proof of Lemma 29] By the deﬁnition of e Lsup(T , f), i.e., (3), it s obvious that e Lsup(f) e Lµ sup(f), so we only need to prove the second part of (A.1). From deﬁnition (2), we have, f F:

e Lun(f) = E c+,c ρ2

E x+ Dc+ x Dc

ℓ f(x )T f(x+) f(x )

(i) E c+,c ρ2

sup x U(x) ℓ f(x )T (µc+ µc ) #)

(ii) = (1 τ) E c+,c ρ2

sup x U(x) ℓ f(x )T (µc+ µc ) # c+ = c )

(A.2) where (i) comes from the convexity of ℓand Jensen s Inequality, and (ii) comes from the property of conditional expectations. Then we have:

Generalization Bounds for Adversarial Contrastive Learning

sup x U(x) ℓ f(x )T (µc+ µc ) # c+ = c )

(i)= E c+,c ρ2

DT (c+) E x Dc+

sup x U(x) ℓ f(x )T (µc+ µc ) #

+ DT (c ) E x Dc

sup x U(x) ℓ f(x )T (µc µc+) # c+ = c )

(ii) = E c+,c ρ2

DT (c+) E x Dc+

sup x U(x) ℓ g(x )c+ g(x )c #

+ DT (c ) E x Dc

sup x U(x) ℓ g(x )c g(x )c+ # c+ = c )

where T = {c+, c }, g(x) =

= W µf(x) and (i) comes from the symmetry of

c+, c ; (ii) is directly from some linear algebras. From (A.3) we know that:

sup x U(x) ℓ f(x )T (µc+ µc ) # c+ = c )

(i)= E c+,c ρ2

E c DT E x Dc

sup x U(x) ℓ g(x )c g(x )c

(ii) = E c+,c ρ2

" e Lsup c+, c , W µf c+ = c #

(iii) = E c+,c ρ2

" e Lµ sup c+, c , f c+ = c # (iv) = e Lµ sup(f),

where (i) is due to the tower property of expectation;(ii) is obvious by the deﬁnition of e Lsup(T , g); (iii) comes from the deﬁnition of e Lµ sup(T , f) and (iv) is from the deﬁnition of e Lµ sup(f). Combine (A.2) with (A.4), we conclude that:

e Lun(f) (1 τ)e Lµ sup(f) + τℓ(0), f F.

So we have:

(1 τ)e Lsup(f) + τℓ(0) (1 τ)e Lµ sup(f) + τℓ(0) e Lun(f), f F. (A.5)

Rearranging (A.5) yields (A.1).

Zou and Liu

Proof [Proof of Lemma 31] Denote (x, x+, x 1 , , x k ) by z, then by the Theorem 3.3 in Mohri et al. (2012), we have: With probability at least 1 δ

2 over the choice of the training set S,

1 B gf(zi) + 2

which is equivalent to:

e Lsun(f) be Lsun(f) + 2

M RS(G) + 3B

δ M , f F. (A.6)

Let f = arg min f F e Lsun(f), since E S

be Lsun(f) = e Lsun(f) and ℓis bounded by B, Hoeﬀding s

inequality tells us that: f F, t R:

P be Lsun(f) e Lsun(f) t e 2Mt2

Set t = B q

δ 2M , we have: f F, with probability at least 1 δ

be Lsun(f) e Lsun(f) B

δ 2M . (A.7)

Combine (A.6) with (A.7), the union bound tells us that: f F, with probability at least 1 δ over the choice of the training set S,

e Lsun( bf) (i) be Lsun( bf) + O

(ii) be Lsun(f ) + O

(iii) e Lsun(f ) + O

(iv) e Lsun(f) + O

where (i) comes from (A.6);(ii) is directly from the fact that be Lsun( bf) be Lsun(f ), which is from the deﬁnition of bf;(iii) is a result of (A.7) and (iv) is obvious by the deﬁnition of f .

Theorem 8 Let ℓ: Rk R be bounded by B. Then, for any δ (0, 1),with a probability of at least 1 δ over the choice of the training set S = {(xj, x+ j , x j )}M j=1, for any f F:

e Lsup( bf) e Lµ sup( bf) 1 1 τ (e Lsun(f) τℓ(0)) + 1 1 τ AGM.

Proof From Lemma 29 we know that:

e Lsup( bf) e Lµ sup( bf) 1 1 τ (e Lun( bf) τℓ(0)) 1 1 τ (e Lsun( bf) τℓ(0)),

Then Lemma 31 directly yields the result we need.

Generalization Bounds for Adversarial Contrastive Learning

A.2 Proof of Theorem 10

Theorem 10 For any f F, we have:

e Lsup(f) 1 1 τ

e Lblock sun (f) τℓ(0) 1 1 τ

e Lsun(f) τℓ(0) .

Proof By the convexity of ℓand Jensen s inequality, we have: x , x+ i , x i :

f(x )T Pb i=1 f(x+ i ) b Pb i=1 f(x i ) b

i=1 f(x )T f(x+ i ) f(x i ) !

i=1 ℓ f(x )T f(x+ i ) f(x i ) .

Take maximization about x both sides, we have:

sup x U(x) ℓ

f(x )T Pb i=1 f(x+ i ) b Pb i=1 f(x i ) b

b sup x U(x)

i=1 ℓ f(x )T f(x+ i ) f(x i ) #

sup x U(x) ℓ f(x )T f(x+ i ) f(x i ) #

Taking expectations both sides yields:

e Lblock sun (f) = E x,x+ i ,x i

sup x U(x) ℓ

f(x )T Pb i=1 f(x+ i ) b Pb i=1 f(x i ) b

E x,x+ i ,x i

sup x U(x) ℓ f(x )T f(x+ i ) f(x i ) #)

sup x U(x) ℓ f(x )T f(x+) f(x ) #

= e Lsun(f),

which proves the second inequality. For the ﬁrst inequality, we have, f F:

Zou and Liu

e Lblock sun (f) = E c+,c ρ2

E x,x+ i Db+1 c+

sup x U(x) ℓ

f(x )T Pb i=1 f(x+ i ) b Pb i=1 f(x i ) b

(i) E c+,c ρ2

sup x U(x) E x+ i Db c+

f(x )T Pb i=1 f(x+ i ) b Pb i=1 f(x i ) b

(ii) E c+,c ρ2

sup x U(x) ℓ

E x+ i Db c+

f(x )T Pb i=1 f(x+ i ) b Pb i=1 f(x i ) b

(iii) = E c+,c ρ2

sup x U(x) ℓ f(x )T (µc+ µc ) #) (iv) = (1 τ)e Lµ sup(f) + τℓ(0),

where (i) and (ii) are directs result of Jensen s Inequality and convexity of maximization function and ℓ; (iii) is from the linearity of expectation and the last equality follows the same argumentation as in (A.4), which proves the Theorem.

A.3 Proof of (5)

Proof For any f F, we have:

e Lsup(f)= E T D

h e Lsup(T , f) i (i)=p E T D

h e Lsup(T , f) Edistinct i +(1 p) E T D

h e Lsup(T , f) Edistinct i

h e Lsup(T , f) Edistinct i (ii) = p e Lsup(f),

where Edistinct is the event that {c+, c 1 , . . . , c k } is distinct and p = P (c+,c 1 ,...,c k ) D [Edistinct]

and (i) is from the property of conditional expectation and (ii) comes from the deﬁnition of e Lsup(f).

A.4 Proof of Proposition 15

Proposition 15 The hinge loss and the logistic loss satisfy Assumption 14.

Proof Since I1 and I2 are symmetric, We only need to prove (6). For the Hinge Loss:

ℓ({vi}i I1) = max 0, 1 + max i I1 { vi} (i) max 0, 1 + max i [d]{ vi} = ℓ {vi}i [d] ,

Generalization Bounds for Adversarial Contrastive Learning

where (i) is from the fact that I1 [d], so the ﬁrst inequality is proved, for the second one:

ℓ {vi}i [d] = max 0, 1 + max i [d]{ vi} = max 0, 1 + max i I1 I2{ vi} ,

where the last equality is directly from the deﬁnition of I1 and I2.

1. if 1 + max i I1 I2{ vi} 0, then ℓ({vi}i I1 I2) = 0, since Hinge Loss is non-negative, we

ℓ({vi}i I1 I2) = 0 + 0 ℓ({vi}i I1) + ℓ({vi}i I2) .

2. if 1 + max i I1 I2{ vi} > 0

(a) if 1 + max i I1 { vi} 0, then ℓ({vi}i I1) = max 0, 1 + max i I1 { vi} = 0. So we

ℓ({vi}i I1 I2) = max 0, 1 + max i I1 I2{ vi} = max 0, 1 + max i I2 { vi}

= 0 + ℓ({vi}i I2) = ℓ({vi}i I1) + ℓ({vi}i I2) .

(b) if 1 + max i I2 { vi} 0, by the same discussion of (a), we have:

ℓ({vi}i I1 I2) ℓ({vi}i I1) + ℓ({vi}i I2) .

(c) if 1 + max i I1 { vi} > 0 and 1 + max i I2 { vi} > 0,

ℓ({vi}i I1 I2)=max 0, 1+ max i I1 I2{ vi} max 0, 1+max i I1 { vi}+1+max i I2 { vi}

max 0, 1 + max i I1 { vi} + max 0, 1 + max i I2 { vi}

= ℓ({vi}i I1) + ℓ({vi}i I2) .

So the second inequality is proved. For the Logistic Loss:

ℓ({vi}i I1) = log2

= ℓ {vi}i [d] .

Zou and Liu

So the ﬁrst inequality is proved. For the second one:

ℓ({vi}i I1 I2) = log2

i I1 I2 e vi

i I1 e vi + X

i I1 e vi + X

i I2 e vi +

= ℓ({vi}i I1) + ℓ({vi}i I2) ,

which proves the second inequality.

A.5 Proof of Theorem 32

We here introduce some further notations, which will be used in our main results. For a tuple (c+, c 1 , . . . , c k ), we deﬁne I+ := {i [k]|c i = c+} and Q as the set of distinct classes in (c+, c 1 , . . . , c k ). We deﬁne pmax(T ) := maxc DT (c), τk := P c+,c i ρk+1[I+ = ], while ℓN( 0)

is deﬁned as the loss of the N-dimensional zero vector. Let T be a task sample from distribution D and ρ+(T ) be a distribution of c+ when (c+, c 1 , . . . , c k ) are sampled from ρk+1 conditioned on Q = T and I+ = and ρ+ min(T ) := min c T ρ+(T )(c).

Theorem 32 Assume that ℓsatisﬁes Assumption 14. With a probability of at least 1 δ over the choice of the training set S, for any f F, we have:

ρ+ min(T ) pmax(T ) e Lsup(T , bf) E T D

ρ+ min(T ) pmax(T ) e Lµ sup(T , bf)

e Lsun(f) + AGM τk 1 τk E c+,c i ρk+1

h ℓ|I+|( 0)|I+ = i , (A.8)

where |I+| is the cardinality of set I+.

Before proceeding with the proof of Theorem 32, we introduce some useful lemmas.

Lemma 33 For any T sampled from D, we have: ρ+(T )(c) ρ+ min(T ) pmax(T )DT (c), c T .

Lemma 34 Assume that f satisﬁes Assumption 14. For any f F, we have:

(1 τk) E T D

ρ+ min(T ) pmax(T ) e Lsup(T , f) (1 τk) E T D

ρ+ min(T ) pmax(T ) e Lµ sup(T , f)

e Lsun(f) τk E c+,c i ρk+1

h ℓ|I+|( 0)|I+ = i .

Generalization Bounds for Adversarial Contrastive Learning

Proof [Proof of Lemma 33] By the deﬁnition of ρ+(T ) and ρ+ min(T ), we can ﬁnd that:

c T , ρ+ min(T ) ρ+(T )(c).

So we have:

c T , ρ+(T )(c)

ρ+ min(T ) 1. (A.9)

By the deﬁnition of pmax(T ), we have:

c T , DT (c)

pmax(T ) 1. (A.10)

Combine (A.9) and (A.10), we have:

c T , ρ+(T )(c) ρ+ min(T ) pmax(T )DT (c).

Proof [Proof of Lemma 34] By the deﬁnition of e Lsun(f), we have:

e Lsun(f) = E c+,c i ρk+1

x,x+ D2 c+;x i Dc i

sup x U(x) ℓ f(x )T f(x+) f(x i ) k i=1

= E c+,c i ρk+1

sup x U(x) ℓ f(x )T f(x+) f(x i ) k i=1

(i) E c+,c i ρk+1

ℓ f(x )T f(x+) f(x i ) k i=1

(ii) E c+,c i ρk+1

sup x U(x) ℓ n f(x )T µc+ µc i

where (i), (ii) is from the Jensen s inequality and convexity of ℓ. Then we analyze lower bound of R1:

Zou and Liu

R1 (i)= (1 τk) E c+,c i ρk+1

sup x U(x) ℓ n f(x )T µc+ µc i

+ τk E c+,c i ρk+1

sup x U(x) ℓ n f(x )T µc+ µc i

(ii) (1 τk) E c+,c i ρk+1

sup x U(x) ℓ f(x )T (µc+ µc)

c Q,c =c+ I+ =

+ τk E c+,c i ρk+1

sup x U(x) ℓ n f(x )T µc+ µc i

(iii) (1 τk) E c+,c i ρk+1

sup x U(x) ℓ f(x )T (µc+ µc)

c Q,c =c+ I+ =

+ τk E c+,c i ρk+1

ℓ|I+|( 0) I+ = ,

where (i) comes from the property of conditional expectation, (ii) is a result of the fact that Q [k] and ℓsatisﬁes Assumption 14 and (iii) is from the fact that [|I+|] [k] and ℓsatisﬁes

Assumption 14. Let R2 = E c+,c i ρk+1

sup x U(x) ℓ f(x )T (µc+ µc)

c Q,c =c+ I+ =

R2 = E c+,c i ρk+1

sup x U(x) ℓ f(x )T (µc+ µc)

c Q,c =c+ I+ =

E c+,c i ρk+1

sup x U(x) ℓ f(x )T (µc+ µc)

c Q,c =c+ Q = T , I+ =

(ii) = E T D

E c+ ρ+(T ) x Dc+

sup x U(x) ℓ f(x )T (µc+ µc)

c T ,c =c+ #

where (i) is from the tower property of expectation and (ii) is directly obtained by the

deﬁnition of ρ+(T ). Let R3 = E c+ ρ+(T ) x Dc+

sup x U(x) ℓ f(x )T (µc+ µc)

c T ,c =c+ #

Generalization Bounds for Adversarial Contrastive Learning

R3 = E c+ ρ+(T ) x Dc+

sup x U(x) ℓ f(x )T (µc+ µc)

c T ,c =c+ #

(i) ρ+ min(T ) pmax(T ) E c+ DT x Dc+

sup x U(x) ℓ f(x )T (µc+ µc)

c T ,c =c+ #

(ii) = ρ+ min(T ) pmax(T ) e Lµ sup(T , f) (iii) ρ+ min(T ) pmax(T ) e Lsup(T , f),

where (i) is directly from Lemma 33;(ii) and (iii) are from the deﬁnition of e Lµ sup(T , f) and e Lsup(T , f), respectively. Combing (A.11) (A.12) (A.13) and (A.14) yields:

(1 τk) E T D

ρ+ min(T ) pmax(T ) e Lsup(T , f) (1 τk) E T D

ρ+ min(T ) pmax(T ) e Lµ sup(T , f)

e Lsun(f) τk E c+,c i ρk+1

h ℓ|I+|( 0)|I+ = i .

Equipped with the above lemmas, now we can turn to the proof of Theorem 32. Proof [Proof of Theorem 32] From Lemma 31 we know that with probability at least 1 δ over the choice of the training set S, f F:

e Lsun( bf) e Lsun(f) + AGM.

Combing this with Lemma 34 directly yields (A.8).

A.6 Proof of Theorem 16

Theorem 16 Suppose C is ﬁnite. For any c C, ρ(c) > 0, and ℓsatisﬁes Assumption 14. Then, with a probability of at least 1 δ over the choice of the training set S, f F:

e Lsup( bf) α(ρ) e Lsun(f) + AGM β,

where α(ρ) = 1 1 τk max |T | k+1 Tdistinct

ρ+ min(T ) is a positive constant depending on ρ and

β = α(ρ)τk E c+,c i ρk+1

h ℓ|I+|( 0)|I+ = i .

Proof Since C is ﬁnite and ρ(c) > 0 for any c C, we can see that:

1 α(1 τk) = min |T | k+1,T distinct

ρ+ min(T ) pmax(T )

> 0. (A.15)

Zou and Liu

By Theorem 32 and the deﬁnition of e Lsup(f), we have: f F, with probability at least 1 δ over the choice of the training set S,

1 α(1 τk) e Lsup( bf) (i)= E T D

1 α(1 τk) e Lsup(T , bf) (ii) E T D

ρ+ min(T ) pmax(T ) e Lsup(T , bf)

(iii) 1 1 τk

e Lsun(f) + AGM τk 1 τk E c+,c i ρk+1

h ℓ|I+|( 0)|I+ = i ,

where (i) is from the deﬁnition of e Lsup( bf);(ii) is from (A.15) and (iii) is from Theorem 32. Rearranging yields the result.

A.7 Proof of Lemma 18

Lemma 18 Let Bp(r) be the p-norm ball in Rd with radius r. The δ-covering number of Bp(r) with respect to p thus obeys the following bound:

N(δ; Bp(r), p) 1 + 2r

where N(δ; B, ) is the δ-covering number of B with respect to the norm .

Proof Set , in Lemma 17 to be p, we have:

N(δ; Bp(1), p) vol 2

δBp(1) + Bp(1)

vol (Bp(1)) = vol 1 + 2

vol (Bp(1))

d vol (Bp(1))

vol (Bp(1)) = 1 + 2

where (i) is true because Bp(1) Rd. Now suppose that {x1, , x N} is the minimal δ-covering of Bp(1), then:

x Bp(1), xi {x1, , x N} s.t. x xi p δ.

So we have: rx Bp(r), rxi {rx1, , rx N} s.t. rx rxi p rδ.

So {rx1, , rx N} is a rδ-covering of Bp(r), so we have:

N(rδ; Bp(r), p) N(δ; Bp(1), p) 1 + 2

r take place of δ, we have N(δ; Bp(r), p) 1 + 2r

Generalization Bounds for Adversarial Contrastive Learning

A.8 Proof of Theorem 22

Theorem 22 Let U(x) = {x | x x r ϵ} (i.e. consider the ℓr attack). We then have :

RS(H) = O [PP + ϵR s(r , p, m)] h ms(p , p, m)w2

where s(p, q, n) := nmax n 1 p 1

r = 1, and R is deﬁned similarly to (11).

Before giving the proof of the theorem, we introduce some lemmas will be used in our proof.

Lemma 35 For any x Rn,0 < p2 p1, we have: x p1 x p2 n

Proof [Proof of Lemma 35] Firstly, we prove x p1 x p2. For any x Rn, suppose

ai = |xi|, lep f(p) = (Pn i=1 ap i ) 1 p . In order to prove x p1 x p2, it suﬃces to prove that: ai 0, i = 1, 2, . . . , n, f(p) is non-increasing on (0, 1].

1. if ai = 0, i, f is a constant function, so f is non-increasing.

2. if i, ai = 0, suppose {ai|ai = 0} has K elements, without loss of generality, suppose {ai|ai = 0} = {a1, a2, , a K} and a1 = max 1 k K{ak}, then we have:

0, k = 1, 2, . . . , K. (A.16)

We can write f(p) = (Pn i=1 ap i ) 1 p = a1 PK k=1 ak a1

p = a1exp{ 1

pln PK k=1 ak a1

let g(p) = 1

pln PK k=1 ak a1

p , by the monotone property of composite functions, to

prove f(p) is non-increasing, it suﬃces to prove g(p) is non-increasing.Taking derivation of g yields:

PK k=1 h ak a1

p PK k=1 ak a1

p ln PK k=1 ak a1

From (A.16) we know that

PK k=1 h ak

p PK k=1 ak

p 0 and ln PK k=1 ak

p2 0, so we have

g (p) 0, so g is non-increasing, which means that x p1 x p2, x Rn.

Nextly, we prove that x p2 n

p1 x p1. By the deﬁnition of p, we have: x p1 p2 =

(Pn i=1 |xi|p2)

n Pn i=1 |xi|p2 p1

p1 p2 .Since p2 p1, i.e. p1 p2 1, we know the function

p1 p2 is convex. By Jensen s Inequality we know that:

λi s.t. λi 0 and

i=1 λi = 1, ti R : h

i=1 λih(ti). (A.17)

Zou and Liu

Set λ1 = λ2 = = λn = 1

n, ti = |xi|p2, i = 1, 2, . . . , n in (A.17), we have:

i=1 |xi|p2 ! p1

i=1 (|xi|p2)

i=1 |xi|p1.

Multiplying both sides by n

p1 p2 yields 1

n Pn i=1 |xi|p2 p1

n Pn i=1 |xi|p1 n

p1 p2 , i.e.

p1 p2 1 x p1 p1.

Taking both sides the p1th power yields x p2 n

Lemma 36 Suppose that A, B R+ s.t. A x p x q B x p (p, q 1) for any x Rn, then W Rm n: |||W|||p B

Proof [Proof of Lemma 36] If we have A x p x q B x p, then:

|||W|||p = max x p 1 Wx p (i) max x p B Wx p (ii) max x p B 1 A Wx q = max x p 1 B A Wx q = B

where (i) is from the fact that x q B x p, which means that {x| x p 1} {x| x q 1} and (ii) is from the condition that A x p x q.

Lemma 37 For any p, q 1, then W Rm n: |||W|||q nmax{ 1

p}|||W|||p.

Remark 38 If we denote nmax{ 1

p} by s(p, q, n), then we have: W Rm n, |||W|||q s(p, q, n)|||W|||p.

Proof [Proof of Lemma 37] We discuss it in two cases.

1. If p q, by Lemma 35, we have: x Rn, x q x p n 1 p 1

q x q. By Lemma 36,

we know that W Rm n, |||W|||q n 1 p 1

q |||W|||p.

2. If p q, by Lemma 35, we have: x Rn, x p x q n 1 q 1

p x p, i.e.

q x q x p x q.

By Lemma 36, we know that W Rm n, |||W|||q n 1 q 1

p |||W|||p.

So we conclude that W Rm n, |||W|||q nmax{ 1

p}|||W|||p.

Now we turn to bound the Rademacher complexity of H, where

H = h(x, x+, x ) = min x U(x)

f(x )T f(x+) f(x ) |f F ,

F = {f : x Wx|W Rn m, |||W|||p w}.

Generalization Bounds for Adversarial Contrastive Learning

Let H0 = h(x, x+, x ) = f(x)T f(x+) f(x ) |f F

= n (x, x+, x ) x T W T W(x+ x ) W Rn m, |||W|||p w o .

Proof [Proof of Theorem 22] Firstly, we drop the complicated and unwieldy min operate by directly solving the minimization problem.

h(x, x+, x ) = min x x r ϵf(x )T f(x+) f(x ) = min δ r ϵ (W(x + δ))T Wx+ Wx

= x T W T W(x+ x ) + min δ r ϵδT W T W(x+ x )

(i)= x T W T W(x+ x ) ϵ W T W(x+ x ) r ,

where (i) is from Holder s Inequality and taking the value of δ that can get equality. Deﬁne:

H1 = n (x, x+, x ) W T W(x+ x ) r W Rn m, |||W|||p w o ,

Hw 0 = n (x, x+, x ) x T A(x+ x ) A Rm m, |||A|||p w o ,

Hw 1 = n (x, x+, x ) A(x+ x ) r A Rm m, |||A|||p w o .

Since W Rn m s.t. |||W|||p w, we have:

(i) W T p |||W|||p = |||W|||p |||W|||p

(ii) s(p , p, m)|||W|||2 p s(p , p, m)w2,

p = 1 and (i) comes from the Submultiplicativity of matrix norm and (ii) is from Lemma 37. So we have:

n W |||W|||p w o n W W T W p s(p , p, m)w2o .

And since W T W is positive semi-deﬁnite, it s easy to see that:

n W T W W T W p s(p , p, m)w2o n A |||A|||p s(p , p, m)w2o .

So we have:

H0 = n (x, x+, x ) x T W T W(x+ x ) W Rn m, |||W|||p w o

n (x, x+, x ) x T W T W(x+ x ) W Rn m, W T W p s(p , p, m)w2o

n (x, x+, x ) x T A(x+ x ) A Rm m, |||A|||p s(p , p, m)w2o = Hs(p ,p,m)w2 0 .

Zou and Liu

Similarly, H1 Hs(p ,p,m)w2 1 . Given the training set S = {(xi, x+ i , x i )}M i=1, we have:

RS(H) = E σ

sup |||W|||p w

i=1 σi x T i W T W(x+ i x i ) ϵ W T W(x+ i x i ) r )

sup |||W|||p w

i=1 σi x T i W T W(x+ i x i ) )

sup |||W|||p w

σi W T W(x+ i x i ) r )

sup |||W|||p w

i=1 σi x T i W T W(x+ i x i ) )

sup |||W|||p w

σi W T W(x+ i x i ) r )

(iii) = RS(H0) + ϵ RS(H1) (iv) RS(Hs(p ,p,m)w2 0 ) + ϵ RS(Hs(p ,p,m)w2 1 ),

where σ is a Random Vector whose elements are i.i.d. Rademacher Random Variables and (i) comes from the subadditivity of sup function;(ii) is from the fact that σi has the same distribution as σi, i = 1, 2, . . . , M; (iii) is by the deﬁnition of RS(H0) and RS(H1) and (iv) is from the monotone property of Rademacher Complexity and the fact that H0 Hs(p ,p,m)w2 0 and H1 Hs(p ,p,m)w2 1 . Secondly, we upper bound the Rademacher Complexity of Hw 0 and Hw 1 . For RS(Hw 0 ): Given the training set S = {(xi, x+ i , x i )}M i=1 = {(zi)}M i=1, deﬁne the ℓ2-norm for a function in Hw 0 as:

h Hw 0 , h 2 :=

i=1 [h(zi)]2,

Deﬁne Hw 0 (S) = n (h(z1), h(z2), . . . , h(z M)) h Hw 0 o , we have that for any h Hw 0 and vh = (h(z1), h(z2), . . . , h(z M)) be the corresponding vector in Hw 0 (S), we have:

h 2 = vh 2.

So we know that any δ-covering of Hw 0 ( h1, h2, , h N ) with respect to ℓ2 norm in the functional space, corresponds to a δ-covering of Hw 0 (S) with respect to the ℓ2 norm in the Euclidean Space, i.e.

h1(z1) h1(z2) ... h1(z M)

h2(z1) h2(z2) ... h2(z M)

h N(z1) h N(z2) ... h N(z M)

Generalization Bounds for Adversarial Contrastive Learning

So we have:

N (δ; Hw 0 (S), 2) = N (δ; Hw 0 , 2) . (A.19)

By the deﬁnition of Rademacher Complexity, we know that RS(Hw 0 ) is just the expectation of the Rademacher Process with respect to Hw 0 (S), which is E[ sup θ Hw 0 (S) Xθ]. To use Lemma

20, we must show that Rademacher Process is a sub-Gaussian Process with respect to some metric ρX. Denote the Euclidean metric by ρ2, we have: for Rademacher Process {Xθ, θ T}, θ, eθ T and λ R:

E h eλ(Xθ Xeθ)i (i)= E h eλ( σ,θ σ,eθ )i = E h eλ( σ,θ eθ )i (ii)

h eλσi(θi eθi)i

i=1 e λ2(θi eθi)2

2 PM i=1(θi eθi)2 = e λ2

2 ρ2 X(θ,eθ),

where σ is a Random Vector whose elements are i.i.d. Rademacher Random Variables and (i) is from the deﬁnition of Rademacher Process; (ii) is from the expectation property of i.i.d. random variables and (iii) is from Example 2.3 in Wainwright (2019). So we proved that the Rademacher Process is a sub-Gaussian Process with respect to the Euclidean metric ρ2. So by Lemma 20 and (9), we know that: δ (0, D],

RS(Hw 0 )=E[ sup θ Hw 0 (S) Xθ] E

sup θ,eθ Hw 0 (S) (Xθ Xeθ)

sup γ,γ Hw 0 (S) ρX (γ,γ ) δ

+32J (δ/4; D),

D = sup θ,θ Hw 0 (S) θ θ 2 2 sup θ Hw 0 (S) θ 2 = 2 sup h Hw 0 h 2 = 2 sup h Hw 0

i=1 [h(zi)]2

M sup |||A|||p w

|x T i A(x+ i x i )| (ii) 2

M sup |||A|||p w

xi p A(x+ i x i ) p

M sup |||A|||p w

xi p |||A|||p x+ i x i p (iv) 4

and J (a; b) = R b a p

ln N (u; Hw 0 (S), 2)du = R b a p

ln N (u; Hw 0 , 2)du. Where (i) is from the deﬁnition of f; (ii) is from the Holder s Inequality; (iii) is a result of properties of matrix norm and (iv) is from the deﬁnition of P and P. Similar to the discussion of upper bound

Zou and Liu

for D, for all h1, h2 Hw 0 , we have:

i=1 [h1(zi) h2(zi)]2

M sup 1 i M |h1(zi) h2(zi)|

M sup 1 i M |x T i (A1 A2)(x+ i x i )|

M sup 1 i M xi p |||A1 A2|||p x+ i x i p (ii) 2

MP P|||A1 A2|||p,

where A1, A2 are the matrices corresponding to h1, h2, respectively and (i) is from the same argument as (ii), (iii) in (A.20) and (ii) is from the deﬁnition of P and P. Suppose NA = A1, , AN is a δ 2PP

M -covering of SA = n A |||A|||p w o with respect to ||| |||p , i.e.: A SA , Aj NA, s.t. A Aj p δ 2PP

Combine this with (A.21), let A be the matrix corresponding to h and Aj be the matrix corresponding to hj and let Nh = h1, h2, , h N , we have:

h Hw 0 , hj Nh, s.t. h hj 2 2

So Nh is a δ-covering of Hw 0 with respect to 2 , so we have:

N(δ; Hw 0 , 2) N( δ 2PP

M ; SA, ||| |||p).

By Lemma 18, we know that:

N(δ; Hw 0 , 2) N( δ 2PP

M ; SA, ||| |||p)

So we have:

J (0; D) = Z D

ln N (u; Hw 0 , 2)du (i) Z D

v u u tm2ln

M u du = 2m

M (iii) 8m PP w

where (i) is from (A.22); (ii) is because that for any x 0, ln(1 + x) x and (iii) is from (A.20). So take δ 0+, we have:

RS(Hw 0 ) 32J (0; D) 256m PP w

Generalization Bounds for Adversarial Contrastive Learning

For RS(Hw 1 ): The same as RS(Hw 0 ), we consider ℓ2 norm for a function h Hw 1 and deﬁne Hw 1 (S) = n (h(z1), h(z2), . . . , h(z M)) h Hw 1 o , by similar argument in (A.19), we have: N (δ; Hw 1 (S), 2) = N (δ; Hw 1 , 2) .

By the deﬁnition of Rademacher Complexity, we know that RS(Hw 1 ) is just the expectation of the Rademacher Process with respect to Hw 1 (S), which is E[ sup θ Hw 1 (S) Xθ]. By Lemma 20

and (9), we know that: δ (0, D],

RS(Hw 1 )=E[ sup θ Hw 1 (S) Xθ] E

sup θ,eθ Hw 1 (S) (Xθ Xeθ)

sup γ,γ Hw 1 (S) ρX (γ,γ ) δ

+32J (δ/4; D),

D = sup θ,θ Hw 1 (S) θ θ 2 2 sup θ Hw 1 (S) θ 2 = 2 sup h Hw 1 h 2 = 2 sup h Hw 1

i=1 [h(zi)]2

M sup |||A|||p w

A(x+ i x i ) r (ii) 2

M sup |||A|||p w

|||A|||r x+ i x i r

MR sup |||A|||p w |||A|||r (iv) 4

MR sup |||A|||p w s(r , p, m)|||A|||p 4w R s(r , p, m)

(A.23) and J (a; b) = R b a p

ln N (u; Hw 1 (S), 2)du = R b a p

ln N (u; Hw 1 , 2)du. Where (i) is from the deﬁnition of f; (ii) is a result of the properties of matrix norm; (iii) is from the deﬁnition of R and (iv) comes from Lemma 37. Similar to the discussion of upper bound for D, for all h1, h2 Hw 1 , we have:

i=1 [h1(zi) h2(zi)]2

M sup 1 i M |h1(zi) h2(zi)|

M sup 1 i M (A1 A2)(x+ i x i ) r (i)

M sup 1 i M |||A1 A2|||r x+ i x i r

MR |||A1 A2|||r (iii) 2R s(r , p, m)

M |||A1 A2|||p, (A.24) where A1, A2 are the matrices corresponding to h1, h2, respectively and (i) from the properties of matrix norm and (ii) is from the deﬁnition of R and (iii) comes from Lemma 37. Suppose NA = A1, , AN is a δ 2R s(r ,p,m)

M -covering of SA = n A |||A|||p w o with respect to

||| |||p , i.e.:

A SA , Aj NA, s.t. A Aj p δ 2R s(r , p, m)

Zou and Liu

Combine this with (A.24), let A be the matrix corresponding to h and Aj be the matrix corresponding to hj and let Nh = h1, h2, , h N , we have:

h Hw 1 , hj Nh, s.t. h hj 2 2R s(r , p, m)

M δ 2R s(r , p, m)

So Nh is a δ-covering of Hw 1 with respect to 2 , so we have:

N(δ; Hw 1 , 2) N( δ 2R s(r , p, m)

M ; SA, ||| |||p).

By Lemma 18, we know that:

N(δ; Hw 1 , 2) N( δ 2R s(r , p, m)

M ; SA, ||| |||p)

1 + 4R s(r , p, m)w

(A.25) So we have:

J (0; D) = Z D

ln N (u; Hw 1 , 2)du (i) Z D

v u u tm2ln

1 + 4R s(r , p, m)w

4R s(r , p, m)w

M u du = 2m p

R s(r , p, m)w 4

R s(r , p, m)w D 4

M (iii) 8m R s(r , p, m)w

where (i) is from (A.25); (ii) is because that for any x 0, ln(1 + x) x and (iii) is from (A.23). So take δ 0+, we have:

RS(Hw 1 ) 32J (0; D) 256m R s(r , p, m)w

Combine upper bounds of RS(Hw 0 ) and RS(Hw 1 ) with (A.18), we have:

RS(H) RS(Hs(p ,p,m)w2 0 ) + ϵ RS(Hs(p ,p,m)w2 1 )

256m PP s(p , p, m)w2

M + ϵ 256m R s(r , p, m)s(p , p, m)w2

= 256m s(p , p, m)w2

M (PP + ϵR s(r , p, m))

= O (PP + ϵR s(r , p, m)) m s(p , p, m)w2

Generalization Bounds for Adversarial Contrastive Learning

A.9 Proof of Theorem 24

Theorem 24 Let U(x) = {x | x x p ϵ} (i.e. consider the ℓp attack), σ(0) = 0 with Lipschitz constant L and let F = Wdσ(Wd 1σ( σ(W1x))) |||Wl|||F MF l , l =1, . . . , d . We then have:

l=1 hlhl 1K

where K = 2BF X,ϵ BF X+ + BF X , where

BF X,ϵ =Ld 1 d Y

l=1 MF l max n 1, m 1 2 1

p o ( X p, +ϵ) , BF X =Ld 1 d Y

l=1 MF l max n 1, m 1 2 1

Before giving the proof, we ﬁrstly introduce some useful lemmas.

Lemma 39 If x i U(xi) = x i xi x i p ϵ , then for 1 r + 1

r = 1, we have:

x i r max n 1, m1 1

p o ( X p, + ϵ) .

Proof [Proof of Lemma 39] We divide it into two cases.

1. If p r , then by Holder s Inequality with 1 r = 1

s, we have:

x i r sup 1 s x i p = 1 s x i p = m 1 s x i p = m1 1

where the equality holds when all the entries are equal.

2. If p < r , by Lemma 35, we have x i r x i p, where the equality holds when one of the entries of x i equals to one and the others equal to zero.

Then we have:

x i r max n 1, m1 1

p o x i p max n 1, m1 1

p o ( xi p + xi x i p)

max n 1, m1 1

p o ( X p, + ϵ) .

Lemma 40 Let A Rm n, b Rn, then we have: A b 2 |||A|||F b 2.

Proof [Proof of Lemma 40] Let Ai be the rows of A, i = 1, 2, . . . , m, we have:

i=1 (Aib)2 (i)

i=1 Ai 2 2 b 2 2 =

i=1 Ai 2 2 q

b 2 2 = |||A|||F b 2,

where (i) is from Holder s Inequality.

Zou and Liu

Lemma 41 Suppose σ is a L-Lipschitz function, then the elementwise vector map corresponding to σ is also L-Lipschitz with respect to 2.

Proof [Proof of Lemma 41]

σ(x) σ(y) 2 =

i=1 (σ(x)i σ(y)i)2 =

i=1 (σ(xi) σ(yi))2

i=1 L2(xi yi)2 = L

i=1 (xi yi)2 = L x y 2,

where (i) is because σ is L-Lipschitz.

Now we can turn to the proof of Theorem 24. Proof [Proof of Theorem 24] In this case, let U(x) = x x x p ϵ , we have:

F = x Wdσ(Wd 1σ( σ(W1x))) |||Wl|||F MF l , l = 1, . . . , d ,

H = h(x, x+, x ) = min x U(x)

f(x )T f(x+) f(x ) |f F .

Let SF l = Wl Rhl hl 1 |||Wl|||F MF l , l = 1, 2, . . . , d. Let CF l be the δl-covering of SF l and deﬁne:

Fc = n fc : x W c dσ W c d 1σ ( σ(W c 1x)) W c l CF l , l = 1, 2, . . . , d o F,

Hc = hc(x, x+, x ) = min x U(x)

f(x )T f(x+) f(x ) |f Fc H.

Similar to the proof of Theorem 22, we know that the Rademacher Process is a sub-Gaussian Process with respect to the Euclidean metric, which induces the ℓ2 norm. Given the training set S = {(xi, x+ i , x i )}M i=1 {zi}M i=1, deﬁne the ℓ2-norm for a function in H as:

h H, h 2 :=

i=1 [h(zi)]2.

Deﬁne H(S) = n (h(z1), h(z2), . . . , h(z M)) h H o , we have that for any h H and vh = (h(z1), h(z2), . . . , h(z M)) be the corresponding vector in H(S), we have h 2 = vh 2. So we know that any δ-covering of H ( h1, h2, , h N ) with respect to ℓ2 norm in the functional space, corresponds to a δ-covering of H(S) with respect to the ℓ2 norm in the Euclidean Space, i.e.

h1(z1) h1(z2) ... h1(z M)

h2(z1) h2(z2) ... h2(z M)

h N(z1) h N(z2) ... h N(z M)

Generalization Bounds for Adversarial Contrastive Learning

So we have:

N (δ; H(S), 2) = N (δ; H, 2) .

By the deﬁnition of Rademacher Complexity, we know that RS(H) is just the expectation of the Rademacher Process with respect to H(S), which is E[ sup θ H(S) Xθ].

So by Lemma 20 and (9), we know that: δ (0, D]:

RS(H)=E[ sup θ H(S) Xθ] E

sup θ,eθ H(S) (Xθ Xeθ)

sup γ,γ H(S) γ γ 2 δ

+ 32J (δ/4; D),

D= sup θ,θ H(S) θ θ 2 2 sup θ H(S) θ 2 =2sup h H h 2 =2sup h H

i=1 [h(zi)]2 2

M sup h H 1 i M

(A.26) and J (a; b) = R b a p

ln N (u; H(S), 2)du = R b a p

ln N (u; H, 2)du. For any f F, x X, let xl be the output of x passing through the ﬁrst l 1 layers, we have:

f(x) 2 = Wdσ(Wd 1xd 1) 2 (i) |||Wd|||F σ(Wd 1xd 1) 2 (ii) = |||Wd|||F σ(Wd 1xd 1) σ(0) 2

(iii) LMF d Wd 1xd 1 2 Ld 1 d Y

l=2 MF l W1x 2 Ld 1 d Y

l=1 MF l x 2

(iv) Ld 1 d Y

l=1 MF l max n 1, m 1 2 1

p o x p Ld 1 d Y

l=1 MF l max n 1, m 1 2 1

where (i) is from Lemma 40; (ii) is from the fact that σ(0) = 0; (iii) comes from the assumption that σ is L-Lipschitz and |||Wd|||F MF d and (iv) is attained by setting r = r = 2 in the proof of Lemma 39. To simplify the notations, we deﬁne:

BF X,ϵ =Ld 1 d Y

l=1 MF l max n 1, m 1 2 1

p o ( X p, +ϵ) , BF X =Ld 1 d Y

l=1 MF l max n 1, m 1 2 1

So we have:

x X, f F, f(x) 2 BF X. (A.27)

Similarly, we have:

x X, f F, x x p ϵ, f(x ) 2 BF X,ϵ. (A.28)

Zou and Liu

For any h H, z X 3, let x = arg min x x p ϵ f(x )T (f(x+) f(x )) and let xl be the output

of x passing through the ﬁrst l 1 layers, we have:

|h(z)| = | inf x x p ϵf(x )T f(x+) f(x ) | = |f(x )T f(x+) f(x ) |

f(x ) 2 f(x+) f(x ) 2 (i) BF X,ϵ (BF X+ + BF X ),

where (i) is from (A.27) and (A.28). So we have:

M sup h H 1 i M

MBF X,ϵ (BF X+ + BF X )

where (i) is from (A.26). Now, we need to ﬁnd the smallest distance between H and Hc, i.e.

sup h H inf hc Hc h hc 2.

By the discussion in (A.26), we have h hc 2

M max 1 i M|h(zi) hc(zi)|. For any

zi = (xi, x+ i , x i ), i = 1, 2, . . . , M, given h and hc such that |||Wl W c l |||F δl, l = 1, 2, . . . , d, we have:

|h(zi) hc(zi)| = inf x i xi p ϵf(x i)T f(x+ i ) f(x i ) inf x i xi p ϵfc(x i)T fc(x+ i ) fc(x i ) .

Let x i = arg inf x i xi p ϵ f(x i)T f(x+ i ) f(x i ) and xc i = arg inf x i xi p ϵ fc(x i)T fc(x+ i ) fc(x i ) ,

yi = xc i f(x i )T f(x+ i ) f(x i ) fc(xc i)T fc(x+ i ) fc(x i )

x i otherwise .

Then we have:

|h(zi) hc(zi)| = |f(x i )T f(x+ i ) f(x i ) fc(xc i)T fc(x+ i ) fc(x i ) |

(i) |f(yi)T f(x+ i ) f(x i ) fc(yi)T fc(x+ i ) fc(x i ) |

= |f(yi)T f(x+ i ) f(x i ) fc(yi)T f(x+ i ) f(x i )

+ fc(yi)T f(x+ i ) f(x i ) fc(yi)T fc(x+ i ) fc(x i ) |

(ii) | (f(yi) fc(yi))T f(x+ i ) f(x i ) |

+ |fc(yi)T f(x+ i ) fc(x+ i ) | + |fc(yi)T f(x i ) fc(x i ) |

(iii) (BF X+ + BF X ) f(yi) fc(yi) 2 + BF X,ϵ f(x+ i ) fc(x+ i ) 2

+ BF X,ϵ f(x i ) fc(x i ) 2, (A.30)

Generalization Bounds for Adversarial Contrastive Learning

where (i) is easily veriﬁed by the deﬁnition of yi; (ii) is from the triangle inequality and (iii) is from (A.27) and (A.28). Deﬁne ga b ( ) as:

ga b (y) = Wbσ (Wb 1σ ( Wa+1σ (W c a σ(W c 1y)))) .

Then we have:

f(yi) fc(yi) 2 = g0 d(yi) gd d(yi) 2

= g0 d(yi) g1 d(yi) + g1 d(yi) g2 d(yi) + + gd 1 d (yi) gd d(yi) 2

(i) g0 d(yi) g1 d(yi) 2 + + gd 1 d (yi) gd d(yi) 2,

where (i) is from the triangle inequality. Then we calculate gl 1 d (yi) gl d(yi) 2, l = 1, 2, . . . , d:

gl 1 d (yi) gl d(yi) 2 = Wdσ gl 1 d 1(yi) Wdσ gl d 1(yi) 2

(i) |||Wd|||F σ gl 1 d 1(yi) σ gl d 1(yi) 2

(ii) LMF d gl 1 d 1(yi) gl d 1(yi) 2

j=l+1 MF j Wlσ gl 1 l 1(yi) W c l σ gl 1 l 1(yi) 2

j=l+1 MF j (Wl W c l ) σ gl 1 l 1(yi) 2

j=l+1 MF j δl σ gl 1 l 1(yi) 2,

where (i) is from Lemma 40; (ii) comes from the assumption that σ is L-Lipschitz and |||Wd|||F MF d ; (iii) is from the deﬁnition of ga b ( ) and (iv) is from Lemma 40 and the choice of hc when h is ﬁxed, which means that |||Wl W c l |||F δl.

Next, we upper bound σ gl 1 l 1(yi) 2:

σ gl 1 l 1(yi) 2 = σ gl 1 l 1(yi) σ(0) 2 (i) L gl 1 l 1(yi) 2 = L W c l 1σ gl 2 l 2(yi) 2

(ii) L W c l 1 F σ gl 2 l 2(yi) 2 (iii) L MF l 1 σ gl 2 l 2(yi) 2

j=1 MF j yi 2,

Zou and Liu

where (i) is because σ is L-Lipschitz; (ii) is from Lemma 40 and (iii) is because W c l 1 F MF l 1. From (A.31) and (A.32) we have:

gl 1 d (yi) gl d(yi) 2 Ld 1 Qd j=1 MF j MF l δl yi 2

(i) Ld 1 Qd j=1 MF j MF l δl max n 1, m 1 2 1

p o ( X p, + ϵ) = BF X,ϵ δl MF l ,

where (i) is from Lemma 39. Similarly:

gl 1 d (x+ i ) gl d(x+ i ) 2 Ld 1 Qd j=1 MF j MF l δl x+ i 2

Ld 1 Qd j=1 MF j MF l δl max n 1, m 1 2 1

p o X+ p, = BF X+ δl MF l ,

gl 1 d (x i ) gl d(x i ) 2 Ld 1 Qd j=1 MF j MF l δl x i 2

Ld 1 Qd j=1 MF j MF l δl max n 1, m 1 2 1

p o X p, = BF X δl MF l .

Combining the above with (A.30) yields:

|h(zi) hc(zi)| BF X+ + BF X g0 d(yi) g1 d(yi) 2 + + gd 1 d (yi) gd d(yi) 2

+ BF X,ϵ g0 d(x+ i ) g1 d(x+ i ) 2 + + gd 1 d (x+ i ) gd d(x+ i ) 2

+ BF X,ϵ g0 d(x i ) g1 d(x i ) 2 + + gd 1 d (x i ) gd d(x i ) 2

= (BF X+ + BF X )BF X,ϵ

δl MF l + BF X,ϵBF X+

δl MF l + BF X,ϵBF X

= 2BF X,ϵ(BF X+ + BF X )

δl MF l = K

So h hc 2 =

M max 1 i M|h(zi) hc(zi)|

M Pd l=1 Kδl MF l . Let δl = MF l δ d K

M , we have:

K MF l MF l δ

Then: h H, hc Hc s.t. h hc 2 δ, which means that sup h H inf hc Hc h hc 2 δ when

choosing δl = MF l δ d K

Generalization Bounds for Adversarial Contrastive Learning

So Hc is a δ-covering of H, and N(δ; H, 2) |Hc| = Qd l=1 |CF l |. By Lemma 18 we

know that |CF l | = N( MF l δ d K

M ; SF l , ||| |||F ) 1 + 2d K

M δ hl hl 1. So we have:

N(δ; H, 2) |Hc| =

l=1 |CF l |

!Pd l=1 hl hl 1 . (A.33)

So we can conclude that:

J (0; D) = Z D

ln N (u; H, 2)du (i) Z D

l=1 hl hl 1

l=1 hl hl 1

v u u t2d K

l=1 hl hl 1 4

v u u t2d KD

l=1 hl hl 1 4

l=1 hl hl 1

where (i) is from (A.33); (ii) comes from the fact that ln(1 + x) x, x 0 and (iii) comes from (A.29).

Since we shows that RS(H) 2E

sup γ,γ H(S) γ γ 2 δ

+ 32J (δ/4; D) before, take

δ 0+, we have:

RS(H) 32J (0; D) 64

l=1 hl hl 1

l=1 hl hl 1

A.10 Proof of Theorem 26

Theorem 26 Let U(x) = {x | x x p ϵ} (i.e. consider the ℓp attack), σ(0) = 0 with Lip-

schitz constant L; moreover, let F = n Wdσ(Wd 1σ( σ(W1x))) Wl 1, M1, l , l=1, . . . ,d o . We then have:

l=1 hlhl 1 p

K0 = 2B1, X,ϵ B X+ + B X , K1 = K0

2 + B X,ϵ B1, X+ + B1, X ,

Zou and Liu

B X,ϵ = Ld 1 d Y

l=1 hl M1, l m1 1

p ( X p, + ϵ) , B X = Ld 1 d Y

l=1 hl M1, l m1 1

B1, X,ϵ = Ld 1 d Y

l=1 M1, l ( X p, + ϵ) , B1, X = Ld 1 d Y

l=1 M1, l X p, .

Before giving the proof, we ﬁrstly introduce some useful lemmas.

Lemma 42 Let A Rm n, b Rn, then we have: A b A 1, b .

Proof [Proof of Lemma 42] Let Ai be the rows of A, i = 1, 2, . . . , m, we have:

A b = max 1 i m|Aib| (i) max 1 i m ( Ai 1 b ) = A 1, b ,

where (i) is from the Holder s Inequality.

Lemma 43 Let A Rm n, b Rn, then we have:

A b 1 A ,1 b 1 m A 1, b 1.

Proof [Proof of Lemma 43] Let Ai be the rows of A, i = 1, 2, . . . , m, we have:

i=1 |Aib| (i)

i=1 ( Ai b 1) = A ,1 b 1,

where (i) is from the Holder s Inequality. And we have:

A ,1 = ( A1 , , Am ) 1 (i) ( A1 1, , Am 1) 1

(ii) m ( A1 1, , Am 1) = m A 1, ,

where (i) is from the fact that x x 1 and (ii) is the from the fact that for all x Rm, x 1 m x . So we have: A b 1 A ,1 b 1 m A 1, b 1.

Lemma 44 Suppose σ is a L-Lipschitz function, then the elementwise vector map corresponding to σ is also L-Lipschitz with respect to .

Proof [Proof of Lemma 44]

σ(x) σ(y) = max 1 i n|σ(x)i σ(y)i| = max 1 i n|σ(xi) σ(yi)|

(i) max 1 i n L|xi yi| = L x y ,

Generalization Bounds for Adversarial Contrastive Learning

where (i) is because σ is L-Lipschitz.

Now we can turn to the proof of Theorem 26. Proof [Proof of Theorem 26] In this case, let U(x) = x x x p ϵ , we have:

F = n x Wdσ(Wd 1σ( σ(W1x))) Wl 1, M1, l , l = 1, . . . , d o ,

H = h(x, x+, x ) = min x U(x)

f(x )T f(x+) f(x ) |f F .

Let S1, l = n Wl Rhl hl 1 Wl 1, M1, l o , l = 1, 2, . . . , d. Let C1, l be the δl-covering

of S1, l and deﬁne:

Fc = n fc : x W c dσ W c d 1σ ( σ(W c 1x)) W c l C1, l , l = 1, 2, . . . , d o F,

Hc = hc(x, x+, x ) = min x U(x)

f(x )T f(x+) f(x ) |f Fc H.

Similar to the proof of Theorem 22, we know that the Rademacher Process is a sub-Gaussian Process with respect to the Euclidean metric, which induces the ℓ2 norm. Similar to the proof of Theorem 24, given the training set S = {(xi, x+ i , x i )}M i=1 {zi}M i=1, deﬁne the ℓ2-norm for a function in H as:

h H, h 2 :=

i=1 [h(zi)]2.

Deﬁne H(S) = n (h(z1), h(z2), . . . , h(z M)) h H o , with the same argument as in proof of Theorem 24, we have: N (δ; H(S), 2) = N (δ; H, 2) ,

and RS(H) = E[ sup θ H(S) Xθ].

So by Lemma 20 and (9), we know that: δ (0, D]:

RS(H)=E[ sup θ H(S) Xθ] E

sup θ,eθ H(S) (Xθ Xeθ)

sup γ,γ H(S) γ γ 2 δ

+32J (δ/4; D),

D= sup θ,θ H(S) θ θ 2 2 sup θ H(S) θ 2 =2sup h H h 2 =2sup h H

i=1 [h(zi)]2 2

M sup h H 1 i M

(A.34) and J (a; b) = R b a p

ln N (u; H(S), 2)du = R b a p

ln N (u; H, 2)du

Zou and Liu

Then for any f F, x X, let xl be the output of x passing through the ﬁrst l 1 layers, we have:

f(x) = Wdσ(Wd 1xd 1) (i) Wd 1, σ(Wd 1xd 1)

(ii) = Wd 1, σ(Wd 1xd 1) σ(0) (iii) LM1, d Wd 1xd 1

l=2 M1, l W1x Ld 1 d Y

l=1 M1, l x (iv) Ld 1 d Y

l=1 M1, l max 1, m p x p

l=1 M1, l x p Ld 1 d Y

l=1 M1, l X p, ,

where (i) is from Lemma 42; (ii) is from the fact that σ(0) = 0; (iii) comes from the assumption that σ is L-Lipschitz and |||Wd|||1, M1, d and (iv) is attained by setting r = 1, r = in the proof of Lemma 39. To simplify the notations, we deﬁne:

B1, X = Ld 1

l=1 M1, l X p, , B X = Ld 1

l=1 hl M1, l

B1, X,ϵ = Ld 1

l=1 M1, l ( X p, + ϵ) , B X,ϵ = Ld 1

l=1 hl M1, l

p ( X p, + ϵ) .

So: x X, f F, f(x) B1, X .

Similarly, we have:

x X, f F, x x p ϵ, f(x ) B1, X,ϵ . (A.35)

f(x) 1 = Wdσ(Wd 1xd 1) 1 (i) hd Wd 1, σ(Wd 1xd 1) 1

(ii) = hd Wd 1, σ(Wd 1xd 1) σ(0) 1 (iii) hd L M1, d Wd 1xd 1 1

l=2 hl M1, l W1x 1 Ld 1 d Y

l=1 hl M1, l x 1

(iv) Ld 1 d Y

l=1 hl M1, l

max n 1, m1 1

l=1 hl M1, l

p x p Ld 1 d Y

l=1 hl M1, l

Generalization Bounds for Adversarial Contrastive Learning

where (i) is from Lemma 43; (ii) is from the fact that σ(0) = 0; (iii) comes from the assumption that σ is L-Lipschitz and |||Wd|||1, M1, d and (iv) is attained by setting r = , r = 1 in the proof of Lemma 39. So we know that:

x X, f F, f(x) 1 B X. (A.36)

Similarly, we have:

x X, f F, x x p ϵ, f(x ) 1 B X,ϵ. (A.37)

For any h H, z X 3, let x = arg min x x p ϵ f(x )T (f(x+) f(x )) and let xl be the output

of x passing through the ﬁrst l 1 layers, we have:

|h(z)| = | inf x x p ϵf(x )T f(x+) f(x ) | = |f(x )T f(x+) f(x ) |

f(x ) f(x+) f(x ) 1 (i) B1, X,ϵ (B X+ + B X ),

where (i) is from (A.35) and (A.36). So we get:

M sup h H 1 i M

MB1, X,ϵ (B X+ + B X )

MK0, (A.38)

where (i) is from (A.34). Now, we need to ﬁnd the smallest distance between H and Hc, i.e.

sup h H inf hc Hc h hc 2.

By the discussion in (A.34), we have h hc 2

M max 1 i M|h(zi) hc(zi)|. For any

zi = (xi, x+ i , x i ), i = 1, 2, . . . , M, given h and hc such that Wl W c l 1, δl, l = 1, 2, . . . , d, we have:

|h(zi) hc(zi)| = inf x i xi p ϵf(x i)T f(x+ i ) f(x i ) inf x i xi p ϵfc(x i)T fc(x+ i ) fc(x i ) .

Let x i = arg inf x i xi p ϵ f(x i)T f(x+ i ) f(x i ) and xc i = arg inf x i xi p ϵ fc(x i)T fc(x+ i ) fc(x i ) ,

yi = xc i f(x i )T f(x+ i ) f(x i ) fc(xc i)T fc(x+ i ) fc(x i )

x i otherwise .

Zou and Liu

Then we have:

|h(zi) hc(zi)| = |f(x i )T f(x+ i ) f(x i ) fc(xc i)T fc(x+ i ) fc(x i ) |

(i) |f(yi)T f(x+ i ) f(x i ) fc(yi)T fc(x+ i ) fc(x i ) |

= |f(yi)T f(x+ i ) f(x i ) fc(yi)T f(x+ i ) f(x i )

+ fc(yi)T f(x+ i ) f(x i ) fc(yi)T fc(x+ i ) fc(x i ) |

(ii) | (f(yi) fc(yi))T f(x+ i ) f(x i ) |

+ |fc(yi)T f(x+ i ) fc(x+ i ) | + |fc(yi)T f(x i ) fc(x i ) |

(iii) (B X+ + B X ) f(yi) fc(yi) + B X,ϵ f(x+ i ) fc(x+ i )

+ B X,ϵ f(x i ) fc(x i ) , (A.39) where (i) is easily veriﬁed by the deﬁnition of yi; (ii) is from the triangle inequality and (iii) is from (A.36) and (A.37). Again we deﬁne ga b ( ) as:

ga b (y) = Wbσ (Wb 1σ ( Wa+1σ (W c a σ(W c 1y)))) .

f(yi) fc(yi) = g0 d(yi) gd d(yi)

= g0 d(yi) g1 d(yi) + g1 d(yi) g2 d(yi) + + gd 1 d (yi) gd d(yi)

(i) g0 d(yi) g1 d(yi) + + gd 1 d (yi) gd d(yi) ,

where (i) is from the triangle inequality. Then we calculate gl 1 d (yi) gl d(yi) , l = 1, 2, . . . , d:

gl 1 d (yi) gl d(yi) = Wdσ gl 1 d 1(yi) Wdσ gl d 1(yi)

(i) Wd 1, σ gl 1 d 1(yi) σ gl d 1(yi)

(ii) LM1, d gl 1 d 1(yi) gl d 1(yi)

j=l+1 M1, j Wlσ gl 1 l 1(yi) W c l σ gl 1 l 1(yi)

j=l+1 M1, j (Wl W c l ) σ gl 1 l 1(yi)

j=l+1 M1, j δl σ gl 1 l 1(yi) ,

Generalization Bounds for Adversarial Contrastive Learning

where (i) is from Lemma 42; (ii) comes from the assumption that σ is L-Lipschitz and Wd 1, M1, d ; (iii) is from the deﬁnition of ga b ( ) and (iv) is from Lemma 42 and the choice of hc when h is ﬁxed, which means that Wl W c l 1, δl.

Next, we upper bound σ gl 1 l 1(yi) :

σ gl 1 l 1(yi) = σ gl 1 l 1(yi) σ(0) (i) L gl 1 l 1(yi)

= L W c l 1σ gl 2 l 2(yi) (ii) L W c l 1 1, σ gl 2 l 2(yi)

(iii) L M1, l 1 σ gl 2 l 2(yi) Ll 1

j=1 M1, j yi ,

where (i) is because σ is L-Lipschitz; (ii) is from Lemma 42 and (iii) is because W c l 1 1, M1, l 1 .

(A.40) and (A.41) show that:

gl 1 d (yi) gl d(yi) Ld 1 Qd j=1 M1, j M1, l δl yi

(i) Ld 1 Qd j=1 M1, j M1, l δl ( X p, + ϵ) = B1, X,ϵ δl M1, l ,

where (i) is from Lemma 39.

Similarly, we have:

gl 1 d (x+ i ) gl d(x+ i ) Ld 1 Qd j=1 M1, j M1, l δl x+ i

Ld 1 Qd j=1 M1, j M1, l δl X+ p, = B1, X+ δl M1, l ,

gl 1 d (x i ) gl d(x i ) Ld 1 Qd j=1 M1, j M1, l δl x i

Ld 1 Qd j=1 M1, j M1, l δl X p, = B1, X δl M1, l .

Zou and Liu

Combine the above with (A.39):

|h(zi) hc(zi)| B X+ + B X g0 d(yi) g1 d(yi) + + gd 1 d (yi) gd d(yi)

+ B X,ϵ g0 d(x+ i ) g1 d(x+ i ) + + gd 1 d (x+ i ) gd d(x+ i )

+ B X,ϵ g0 d(x i ) g1 d(x i ) + + gd 1 d (x i ) gd d(x i )

= (B X+ + B X )B1, X,ϵ

δl M1, l + B X,ϵB1, X+

δl M1, l + B X,ϵB1, X

= h B1, X,ϵ (B X+ + B X ) + B X,ϵ(B1, X+ + B1, X ) i d X

δl M1, l K1

So h hc 2 =

M max 1 i M|h(zi) hc(zi)|

M Pd l=1 K1δl M1, l . Let δl = M1, l δ d K1

K1 M1, l M1, l δ

which means that: h H, hc Hc s.t. h hc 2 δ, so we have sup h H inf hc Hc h hc 2 δ

when choosing δl = M1, l δ d K1

So Hc is a δ-covering of H, and N(δ; H, 2) |Hc| = Qd l=1 |C1, l |. By Lemma 18 we

know that: |C1, l | = N( M1, l δ d K1

M ; S1, l , 1, ) 1 + 2d K1

M δ hl hl 1. This means:

N(δ; H, 2) |Hc| =

l=1 |C1, l |

!Pd l=1 hl hl 1 . (A.42)

So we can conclude that:

J (0; D) = Z D

ln N (u; H, 2)du (i) Z D

l=1 hl hl 1

l=1 hl hl 1

v u u t2d K1

l=1 hl hl 1 4

v u u t2d K1D

l=1 hl hl 1 4

l=1 hl hl 1 p

where (i) is from (A.42); (ii) comes from the fact that ln(1 + x) x, x 0 and (iii) comes from (A.38).

Generalization Bounds for Adversarial Contrastive Learning

1 2 4 6 8 10 block size

clean accuracy(%)

pgd-0.01 pgd-0.02 fgsm-0.01 fgsm-0.02

(a) Inﬂuence on clean accuracy

1 2 4 6 8 10 block size

adv accuracy(%)

pgd-0.01 pgd-0.02 fgsm-0.01 fgsm-0.02

(b) Inﬂuence on adversarial accuracy

Figure B.1: The eﬀect of block size on the accuracy. In the ﬁgure, we show the clean accuracy and the adversarial accuracy of the mean classiﬁer under PGD and FGSM attack with ϵ = 0.01 and ϵ = 0.02. The block size is choosen from {1, 2, 4, 6, 8, 10}. (a) The inﬂuence on the clean accuracy; (b) The inﬂuence on the adversarial accuracy.

Since we shows that RS(H) 2E

sup γ,γ H(S) γ γ 2 δ

+ 32J (δ/4; D) before, take

δ 0+, we have:

RS(H) 32J (0; D) 64

l=1 hl hl 1 p

l=1 hl hl 1 p

Appendix B. Extra Experimental Results

In this section, we present our experimental results for CIFAR-100. The basic settings are the same as 6.

B.1 Improvement from Regularizer

Table B.1 shows that the F-norm regularizer helps to improve the adversarial robustness of the model, which agrees with our Theorem 24.

Zou and Liu

Attack ϵ Type λ 0 0.002 0.005 0.01 0.02

PGD 0.01 Clean 84.07 85.65 85.48 85.52 58.53 Adv 78.49 79.55 79.39 79.41 79.44

0.02 Clean 81.87 81.91 81.93 82.08 81.97 Adv 73.01 73.05 73.07 73.16 73.07

FGSM 0.01 Clean 84.61 84.53 84.95 84.58 84.55 Adv 78.70 78.64 79.18 78.70 78.63

0.02 Clean 80.47 80.48 80.39 80.44 82.09 Adv 72.45 72.45 72.36 72.41 73.87

Table B.1: Results of experiments on the regularizer on data set CIFAR-100. In this table, we list the clean accuracy (Clean) and adversarial accuracy (Adv) of the mean classiﬁer under the PGD and FGSM attack with ϵ = 0.01 and ϵ = 0.02. λ is chosen from {0, 0.002, 0.005, 0.01, 0.02}, and λ = 0 indicates no regularizer.

B.2 Eﬀect of Block Size

Figure B.1 records the inﬂuence of block size on the clean accuracy and the adversarial accuracy of the model, from which we can see that a larger block size will yield better adversarial accuracy.

Pranjal Awasthi, Natalie Frank, and Mehryar Mohri. Adversarial learning guarantees for linear hypotheses and neural networks. In ICML, volume 119, pages 431 441, 2020.

Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In SP, pages 39 57, 2017.

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoﬀrey E. Hinton. A simple framework for contrastive learning of visual representations. In ICML, volume 119, pages 1597 1607, 2020.

Qingyi Gao and Xiao Wang. Theoretical investigation of generalization bounds for adversarial learning of deep neural networks. Journal of Statistical Theory and Practice, 15(2):51, 2021.

Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR, 2018.

Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015.

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, pages 9726 9735, 2020.

Generalization Bounds for Adversarial Contrastive Learning

Chih-Hui Ho and Nuno Vasconcelos. Contrastive learning with adversarial examples. In

Neur IPS, 2020.

Ziyu Jiang, Tianlong Chen, Ting Chen, and Zhangyang Wang. Robust pre-training by adversarial contrastive learning. In Neur IPS, 2020.

Minseon Kim, Jihoon Tack, and Sung Ju Hwang. Adversarial self-supervised contrastive learning. In Neur IPS, 2020.

Alex Krizhevsky and Geoﬀrey Hinton. Learning multiple layers of features from tiny images. 2009.

Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. In ICLR, 2017.

Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and

processes. Springer Science & Business Media, 2013.

Boqi Li and Weiwei Liu. Wat: Improve the worst-class robustness in adversarial training, 2023.

Xiyuan Li, Xin Zou, and Weiwei Liu. Defending against adversarial attacks via neural dynamic system. In Neur IPS, 2022.

Xinsong Ma, Zekai Wang, and Weiwei Liu. On the tradeoﬀbetween robustness and fairness. In Neur IPS, 2022.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In ICLR, 2018.

Vitaly Maiorov and Allan Pinkus. Lower bounds for approximation by MLP neural networks.

Neurocomputing, 25(1-3):81 91, 1999.

Chengzhi Mao, Amogh Gupta, Vikram Nitin, Baishakhi Ray, Shuran Song, Junfeng Yang, and Carl Vondrick. Multitask learning strengthens adversarial robustness. In ECCV, pages 158 174, 2020.

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine

Learning. Adaptive computation and machine learning. MIT Press, 2012.

Omar Montasser, Steve Hanneke, and Nathan Srebro. VC classes are adversarially robustly learnable, but only improperly. In COLT, pages 2512 2530, 2019.

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A simple and accurate method to fool deep neural networks. In CVPR, pages 2574 2582, 2016.

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In COLT, volume 40, pages 1376 1401, 2015.

Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, volume 9910, pages 69 84, 2016.

Zou and Liu

Kento Nozawa, Pascal Germain, and Benjamin Guedj. Pac-bayesian contrastive unsupervised representation learning. In UAI, volume 124, pages 21 30, 2020.

Nikunj Saunshi, Orestis Plevrakis, Sanjeev Arora, Mikhail Khodak, and Hrishikesh Khandeparkar. A theoretical analysis of contrastive unsupervised representation learning. In ICML, volume 97, pages 5628 5637, 2019.

Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data. In Neur IPS, pages 5019 5031, 2018.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In ICLR, 2014.

Martin J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.

Zekai Wang and Weiwei Liu. Robustness veriﬁcation for contrastive learning. In ICML, pages 22865 22883, 2022.

Jingyuan Xu and Weiwei Liu. On robust multiclass learnability. In Neur IPS, 2022.

Dong Yin, Kannan Ramchandran, and Peter L. Bartlett. Rademacher complexity for adversarially robust generalization. In ICML, volume 97, pages 7085 7094, 2019.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P. Xing, Laurent El Ghaoui, and Michael I. Jordan. Theoretically principled trade-oﬀbetween robustness and accuracy. In ICML, volume 97, pages 7472 7482, 2019.