# on_the_generalization_analysis_of_adversarial_learning__2e9738c9.pdf On the Generalization Analysis of Adversarial Learning Waleed Mustafa 1 Yunwen Lei 2 Marius Kloft 1 Many recent studies have highlighted the susceptibility of virtually all machine-learning models to adversarial attacks. Adversarial attacks are imperceptible changes to an input example of a given prediction model. Such changes are carefully designed to alter the otherwise correct prediction of the model. In this paper, we study the generalization properties of adversarial learning. In particular, we derive high-probability generalization bounds on the adversarial risk in terms of the empirical adversarial risk, the complexity of the function class, and the adversarial noise set. Our bounds are generally applicable to many models, losses, and adversaries. We showcase its applicability by deriving adversarial generalization bounds for the multi-class classification setting and various prediction models (including linear models and Deep Neural Networks). We also derive optimistic adversarial generalization bounds for the case of smooth losses. These are the first fast-rate bounds valid for adversarial deep learning to the best of our knowledge. 1. Introduction Machine learning has been shown to be susceptible to a large number of security threats (Barreno et al., 2010; Papernot et al., 2016). One such threat is adversarial examples (Szegedy et al., 2013; Biggio et al., 2013). Adversarial examples are perturbed inputs carefully designed to alter a model s prediction while being imperceptible to humans. This paper studies the generalization properties of models trained to withstand such adversarial attacks. While much work has been conducted on the algorithmic design of attacks (Carlini et al., 2017; Brendel et al., 2017; Awasthi et al., 2021; Engstrom et al., 2019), and defenses (Madry 1Department of Computer Science, University of Kaiserslautern, Germany 2School of Computer Science, University of Birmingham, United Kingdom. Correspondence to: Yunwen Lei . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). et al., 2017; Finlay & Oberman, 2019; Kannan et al., 2018), there is a lack of theoretical understanding of the generalization properties of adversarial learning. The current state of the art in the generalization analysis of adversarial learning has the following limitations to our knowledge. First, all previous papers on adversarial generalization apply solely to additive adversaries (in which the attacker adds a small perturbation to the input sample). However, there is a recent trend in using non-additive adversaries (Hendrycks & Dietterich, 2018; Engstrom et al., 2019; Wong & Kolter, 2020; Awasthi et al., 2021). Previous generalization analyses are inapplicable in this setting. Second, all previous papers consider restricted setups based on simple models and architectures (linear model or onehidden-layer neural networks; Yin et al., 2019; Awasthi et al., 2020) or unrealistic assumptions (Dan et al., 2020; Schmidt et al., 2018; Farnia et al., 2018; Gao & Wang, 2021), or they use surrogate losses (Khim & Loh, 2018; Yin et al., 2019). However, a surrogate upper bound does not necessarily lead to a meaningful generalization bound on the original (adversarial) loss (Awasthi et al., 2020). Next, there exists no unified analysis applying to a large variety of models, Moreover, there are no fast-rate bounds (Srebro et al., 2010) for adversarial learning using neural networks. Finally, all previous results scale at least O( K) in the number of label classes K and are thus inapplicable to extreme classification, or structured prediction (Prabhu & Varma, 2014). We derive generalization bounds that do not suffer from the limitations mentioned above. Our contributions can be summarized as follows: We derive the first generalization bounds for adversarial learning valid for general (possibly non-additive) noise functions (thus covering a wide array of attacks). Our bounds are modular and general. They can be applied to many models, from linear models over kernel machines to neural networks. Extending it to new models is easy: it requires simply computing the ℓ - covering number of the model. We show the first generalization bounds for adversarial learning of Deep Neural Networks applying to the adversarial loss directly, not a surrogate. On the Generalization Analysis of Adversarial Learning Our bounds scale O(log(K)) in the number of label classes, making them suitable for multi-label learning and structured prediction in adversarial environments. Finally, we show the first fast-rate bounds for adversarial deep learning. The rest of the paper is organized as follows. We discuss related work in Section 2. In Section 3, we introduce the notation and the problem setup. Section 4 is dedicated to the main results of this paper. We apply our approach to several models and adversarial attacks in Section 5. Finally, we provide fast-rate bounds in Section 6 and discuss our findings in Section 7. 2. Related Work In this section, we first give an overview of popular adversarial attacks and defenses and then review the related work on the generalization analysis of adversarial learning. Adversarial attacks Adversarial attacks are usually categorized as white-box (Carlini & Wagner, 2017) or blackbox (Brendel et al., 2017), depending on the information available to the attacker. Most commonly, the attacker is constrained to alter the input by additive noise from an ℓpball. Recently, further (non-additive) attack models have been considered. In which the adversary manipulates the input by a non-linear transformation, either in the input space (e.g., rotation of an input image; Engstrom et al., 2019) or in a semantic representation space (e.g., in the frequency domain of an image; Awasthi et al., 2021). Defenses In response to such attacks, several defense mechanisms have been developed, for instance, based on regularizing the model s Lipschitz constant (Bietti et al., 2018; Ciss e et al., 2017), input gradient (Hein & Andriushchenko, 2017; Ross & Doshi-Velez, 2018), or input Hessian (Mustafa et al., 2020) at training. The most widely used defense mechanism against adversarial attacks is adversarial training (Madry et al., 2017) and its variants (Kannan et al., 2018; Zhang et al., 2019). Its key idea is to replace clean training samples with their adversarial counterparts while maintaining their correct labels. Systematic studies have shown that the resulting models are robust and can withstand a large number of attacks (Athalye et al., 2018). Generalization Analysis of Adversarial Learning We now discuss the existing generalization bounds for adversarial learning. Dan et al. (2020) and Schmidt et al. (2018) showed bounds valid only in the idealized binary classification scenario where the data is sampled from two Gaussians. The prediction function is linear in both papers, and the bounds scale linearly in the number of dimensions. Dan et al. (2020) also showed a matching lower bound. Attias et al. (2019) showed a bound based on the VC-dimension of the function class, which can be very large for many models. Their study considers attacks with a finite adversarial noise set. However, virtually all practical attacks use an uncountable infinite noise set. Gao & Wang (2021) and Farnia et al. (2018) showed bounds assuming that the attack is apriori known to the learner, which is a strong assumption since the attacker could utilize any attack available in practice. Xing et al. (2021) leveraged the algorithmic stability to study the generalization of adversarial learning. Several authors have used the Rademacher complexity to study the generalization of ℓp-additive-perturbation attacks (Khim & Loh, 2018; Yin et al., 2019; Awasthi et al., 2020; Xiao et al., 2021). In contrast, our analysis applies to a much wider range of attacks. Khim & Loh (2018) introduced the tree-transform, in which the supremum over the adversarial noise set is propagated through the network layers to establish an upper-bound on the adversarial loss. This upper bound, however, can be vacuous for Deep Neural Networks since its looseness grows exponentially with the depth of the network. Since our approach applies directly to the loss, it does not suffer from this problem. In addition, their bound grows as O(K) in the number of classes K, while ours is O(log(K)). Yin et al. (2019) showed generalization bounds for linear models and one-hidden-layer neural networks based on the surrogate loss introduced in Raghunathan et al. (2018) under ℓ -additive perturbation. Their bound does not apply to (deep) neural networks with two or more hidden layers. Our approach applies to neural networks with arbitrary many layers (Deep Neural Networks) and is directly based on the adversarial loss, not on a surrogate. Awasthi et al. (2020) also derived bounds directly based on the adversarial loss, but only for linear models and one-hidden-layer networks. Their bound scales as O( m) in the number of neurons m, while ours is O(log(m)). 3. Problem Setup We now define the formal setup of adversarial learning. We start by defining general statistical learning, after which we introduce the adversary. Let Z = X Y be an inputoutput space, where X Rd is an input/feature space and Y = [K] := {1, . . . , K} is an output/label space. We further assume that there is an unknown probability measure D defined over Z. The goal of supervised learning is to find a function g : X Y, such that, for a given loss function ℓc : Y Y, the expected loss E(x,y) D[ℓc(g(x), y)] is minimized. We are interested in parameterized-scoringbased classifiers based on multivariate model functions f : X RK. The classification function g is obtained by g(x) = arg maxk Y f(x)k. Since the distribution D is On the Generalization Analysis of Adversarial Learning usually unknown, we utilize a sample from it to learn g. Let {zi = (xi, yi)}n i=1 be an i.i.d. sample drawn from D. The classification function is selected from the hypothesis class FW := {x 7 f(x, w) : x X, w W} parameterized by w W, where W is some parameter space. We are interested in empirical risk minimization, with the parameter ˆw defined by ˆw = arg min w W 1 n i=1 ℓ(f(xi, w), yi), where ℓ: RK Y R+ is a given loss function. An adversary utilizes a noise application function A : X B X, where B is a noise space, to modify an input of a classifier with the goal to alter its prediction. To our knowledge, all previous work on adversarial learning theory considered only the special case where A is the additive noise A(x, δ) = x + δ and B is the ℓp-ball {δ : δ p ϵ}. In contrast, we consider a more general A1 and therefore, our results can be applied to a broader array of attacks (e.g., Engstrom et al., 2019; Awasthi et al., 2021). Given an input example x and a learned parameter setting w, the adversary selects the noise parameter δ by δ = arg max δ B ℓ(f(A(x, δ); w), y). A common strategy to train a robust model is adversarial training (Madry et al., 2017). The training is achieved by solving the min-max problem ˆwadv = arg min w W ˆRadv(w), where ˆRadv(w) := 1 n Pn i=1 maxδ B ℓ(f(A(xi, δ); w), yi) measures the empirical risk of the model on the training examples subject to adversarial noise. We are interested in the adversarial generalization behavior of ˆwadv as measured by the population risk in the adversarial setting: Radv( ˆwadv) := ED max δ B ℓ(f(A(x, δ) ˆwadv)) . We refer to the difference between the population risk Radv(w) and the empirical risk ˆRadv(w) as the generalization error of w. 4. Main Results In this section, we introduce our main result. Our primary tool is based on covering numbers defined below. Roughly speaking, covering numbers measure the complexity of a 1Since any sample x can be obtained from another sample x by adding δ = x x, the key limitation of additive attacks lies in the restriction δ p ϵ. For example, while a rotation of an image by a small angle is considered imperceptible, it may result in x x 1, where pixels are in [0, 1]. function class F in terms of the number of balls required to approximate the class to a prescribed accuracy under the metric D. Definition 4.1 (Covering number). Let (A, D) be a pseudometric space. We say that C A is an (ϵ, D)-cover to A A if sup a A inf c C D(a, c) ϵ. The covering number of A at ϵ precision, denoted as N(ϵ, A), is the size of the minimal-cardinality set that covers A. For a data set S := {zi}n i=1 with zi Z and a function class F with its elements taking values in a (possibly infinitedimensional) real vector space V, the (ϵ, D)-empirical covering number of F on S, denoted as ND(ϵ, F, S), is the (ϵ, D)-covering number of the set F|S = {(f(z1), . . . , f(zn)) : f F}. The ℓp pseudo-metric on Vn is Dp(x, y) = ( 1 i=1 xi yi p) 1 p , where is a general norm on V. If p = we obtain D (x, y) := maxi [n] xi yi . We denote by Np(ϵ, F, , S) and N (ϵ, F, , S) the covering numbers w.r.t. the Dp and D metrics, respectively. Our main approach is to view adversarial generalization as multi-label classification, which allows us to utilize recent advances in the generalization analysis of vector-valued learning to study adversarial generalization (Wu et al., 2021). The loss function class of interest is Gadv := {z 7 max δ B ℓ(f(A(x, δ)), y) : f F}. (1) We define a new function class as follows. For each function f F, we construct a functional g : Z (RK) B as g(z) := ℓ(f(A(x, )), y). That is, g receives an inputoutput vector z and outputs a function B 7 RK. The corresponding function class is then G := {z 7 ℓ(f(A(x, )), y) : f F}. In the following lemma, we introduce our first main result. The lemma states that the covering number of the function class Gadv is bounded by the covering number of the function class G. Therefore, the covering number of G can be used to control the adversarial generalization. Lemma 4.2. Let Gadv and G be defined as above. It holds N (ϵ, Gadv, | |, S) N (ϵ, G, , S), where g = supz |g(z)|, g G. On the Generalization Analysis of Adversarial Learning The proof of this lemma is deferred to the appendix. The theorem allows us to control the ℓ -covering number of the adversarial loss class by the ℓ -covering number of the class G. While Lemma 4.2 provides an essential tool to control the complexity of the adversarial loss class, deriving an upper bound to N (ϵ, G, . , S) is not simple. The main challenge is that the functions in G take values in an infinitedimensional vector space. Our approach to this problem is to approximate such a space by a finite discretization. To that end, we introduce a Lipschitzness assumption on the functions δ 7 ℓ(f(A(x, δ)), y) for f F. Definition 4.3 (Lipschitz continuity). Let f : V1 V2 be a map from vector space V1 to V2. Let V1 and V2 be endowed with norms l1 and l2, respectively. We say that f is ( l2, l1)-Lipschitz with constant τ if, for any δ, δ V1, we have: f(δ) f(δ ) l2 τ δ δ l1. When V2 = R, then l2 is the absolute value and the notation is simplified to l1-Lipschitz. We now introduce our main result to relate the ℓ -covering number of class Gadv to the covering number of the discretized version Gadv defined below. Lemma 4.4. Let δ 7 ℓ(f(A(x, δ)), y) be -Lipschitz with constant L, for all x X, y Y, and f F. Let CB(ϵ/2L) be an (ϵ/2L, )-cover of B. We define the loss class Gadv = (z, δ) 7 ℓ(f(A(x, δ)), y) : f F (2) and the extended training set S = {(xi, δ, yi) : i [n], δ CB(ϵ/2L)}. (3) Then we have N (ϵ, Gadv, | |, S) N (ϵ, G, , S) N (ϵ/2, Gadv, | |, S). Note that functions in Gadv involve the maximum over B due to adversarial learning, which is removed for functions in Gadv. This is achieved by incorporating δ into the argument. Also, note that each element of Gadv and Gadv is a scalar function. Therefore we use | | in the definition of covering numbers. For comparison, each element in G is a functional mapping z to a function on B. Therefore we use in the involved covering number. For brevity of notation, we often omit either | | or when mentioning covering numbers. The proof of this lemma is deferred to the appendix. Remark 4.5. We note that Lemma 4.4 controls the complexity of the adversarial loss class by the complexity of its non-adversarial counterpart. Therefore, it can be readily applied to a wide array of models where a covering number bound exists. This is in contrast to virtually all approaches in the literature (Yin et al., 2019; Khim & Loh, 2018; Xiao et al., 2021; Awasthi et al., 2020; Farnia et al., 2018), in which a function-class-specific approach is used. Remark 4.6. The Lipschitzness condition on the function δ 7 ℓ(f(A(x, δ)), y) is necessary in Lemma 4.4. It is a mild condition that most attacks fulfill (e.g., Engstrom et al., 2019; Awasthi et al., 2021; Madry et al., 2017). Remark 4.7. The size of the extended training set S grows linearly in the size of the set CB(ϵ/2L). In principle, the size CB(ϵ/2L) can grow exponentially in the dimensionality of B. However, the dependence of the generalization performance is of the order O(log 1 2 (| S|)) for many function classes (e.g., Bartlett et al., 2017; Zhang, 2002; Mustafa et al., 2021). These lead to generalization bounds with a square-root dependency on the dimensionality of B. We now state our main generalization bound, which controls the generalization errors by an integral of covering numbers on Gadv. This result is modular: one needs to plug a covering-number bound into it to obtain a generalization bound for adversarial learning. Theorem 4.8. Let δ (0, 1). Suppose that the loss ℓis bounded by 1. Let G and S be defined as (2) and (3). With probability at least 1 δ over the training data S, for all w W, we have Radv(w) ˆRadv(w) + 3 log N (ϵ/2, Gadv, S)dϵ . 5. Applications In this section, we present several applications of Theorem 4.8. Our first example is multi-class linear classifiers with additive adversary. 5.1. Regularized Multi-class Linear Classifiers We consider the K-class linear classifiers with the following hypothesis class: F := n x 7 Wx : W RK d, W 2,2 Λ o . (4) We further assume that maxx X x Ψ. For a given hypothesis f F, we carry out prediction for an input x X by x 7 arg maxi [K] fi(x). The quality of prediction is measured by the multi-class margin loss defined as 1, if M(t, y) 0, 1 M(t, y)/ρ, if 0 < M(t, y) < ρ, 0, if t ρ, (5) On the Generalization Analysis of Adversarial Learning where M(t, y) = ty maxy =y ty . The margin loss (5) is an upper bound on the zero-one loss and is -Lipschitz with constant 2 ρ in the first argument for all y Y (Bartlett et al., 2017). ℓ -additive perturbation We first consider the ℓ attack (Goodfellow et al., 2014; Kannan et al., 2018), in which the attacker utilizes an additive noise application function A(x, δ) = x + δ, where x X, and the noise set is the ℓ -ball B = {δ : δ β} Rd. To apply Lemma 4.4, we first show the Lipschitzness of the function δ 7 ℓρ(W(x + δ), y). Lemma 5.1. Consider the function g W (z, δ) = ℓ(W(x + δ), y) and assume W 1, Λ1. Then, for any z, the function δ 7 g W (z, δ) is -Lipschitz with constant 2Λ1 Based on the Lipschitzness of the adversarial loss function in the above lemma, we can bound the covering number of the adversarial class in the lemma below. Lemma 5.2. Let F be the multi-class linear hypothesis class defined in (4) and ℓρ the multi-class margin loss (5). Further, let Gadv and Gadv be defined as in (1) and (2). Then, for ϵ > 0 and S defined in (3), we have log N (ϵ/2, Gadv, S) CΛ2(Ψ + dβ)2Llog ϵ2ρ2 , where C is an absolute constant, Ψ = Ψ + Llog := log ϵρ + 2 n K 12βΛ1 If we plug the above lemma back into Theorem 4.8, we get the following corollary. Corollary 5.3. With the notation above, with probability at least 1 δ over the draw of the training data, for all w W, we have Radv(w) ˆRadv(w) + 8 dβ) Llog nρ , (6) where C is an absolute constant, Ψ = Ψ + Llog =log 1 2 ρ +1 n K 12βΛ1n Remark 5.4. We note that the dependence d on d in the term Ψ + dβ is due to the mismatch between the norm on the input x and the norm in the ball B. Indeed, we have used the inequality δ 2 d δ . This dependence on d vanishes if the bound on the norms of x and δ matches (e.g., if we consider the attack where B := {δ : δ 2 β}). Remark 5.5. The term Llog incurs also a square root dependence on the dimension d. Such a dependence is attributed to the complexity of the adversarial noise set B. For example, if B is contained in a low dimensional manifold d < d, the dependence is reduced to O( d ). This motivates projecting the input onto a low-dimensional manifold to reduce the effective dimensionality of the adversarial noise. Remark 5.6. We now compare the bound (6) to the bounds in Yin et al. (2019); Xiao et al. (2021), Khim & Loh (2018), and Awasthi et al. (2020). The dependence of the bound (6) on the number of classes is of the order O(log(K)), while the bounds in Yin et al. (2019); Xiao et al. (2021) and Khim & Loh (2018) are of the order O(K K) and O(K), respectively. Therefore, the bound (6) is favourable in the classification setting with a large K. The dependence of the bound (6) on the input dimension is of the order O(d). On the other hand, the dependence on d in Yin et al. (2019), Awasthi et al. (2020), and Khim & Loh (2018) is of the order O( d), and thus our bound incurs an extra d term. Similar to our bound the cost of d in their bounds is due to the mismatch of the norm constraint on x and δ. Unlike the bound (6), their bounds do not incur the d resulting from the complexity of the adversarial noise set. This is because for linear models with ℓp-additive perturbation the function class transformation in their analysis is tight, and thus effectively the set B is a singleton with dimension 0. Hence, the bounds (Yin et al., 2019; Awasthi et al., 2020; Khim & Loh, 2018) are favourable for multi-class linear models with ℓp-additive perturbation when K < Adversarial spatial transformation We now consider the adversarial spatial transformation in Engstrom et al. (2019). It is based on the spatial transformer network (Jaderberg et al., 2015), which introduced a parameterized spatial transformation that is Lipschitz in the transformation parameters. For simplicity of presentation, we consider square images (x R d being an integer) and linear spatial transformations (including rotating and shearing). Therefore, the noise parameter is a matrix δ R2 2. The adversarial transformation function A is defined in the following steps: first, the indices of the image are transformed by us = δ11 δ12 δ21 δ22 where ut, vt [ d] are the target indices of the pixel at the source indices us, vs. Let U s [ d]d be the aggregation of all the us indices in one vector. Define V s, U t, V t similarly. Denote by (U s, V s) = S(U t, V t, δ) the transformation (7). On the Generalization Analysis of Adversarial Learning The output image x is then obtained by setting the value at the index (U t i , V t i ) for i [d] according to l=1 xkl max(0, 1 |V s i k|) max(0, 1 |U s i l|). (8) Let x = T (x, (U s, V s)) denote the transformation (8). The noise application function for the attack A is then defined as A(x, δ) = T (x, S(U t, V t, δ)). (9) The following lemma establishes the Lipschitzness of the functions A and δ 7 ℓρ(WA(x, δ), y). Lemma 5.7. Let A(x, δ) be the noise application function defined above. For all x 1 Ψ1, the function δ 7 A(x, δ) is ( , )-Lipschitz with constant 4Ψ1 d. Further, the function δ 7 ℓρ(WA(x, δ), y) is -Lipschitz with constant 8Ψ1Λ1 We can now derive an upper bound on the ℓ -covering number of the adversarial loss. The result is summarized in the following lemma. Lemma 5.8. Let F be the multi-class linear hypothesis class (4) and ℓρ the multi-class margin loss (5). Let Gadv and Gadv be defined as (1) and (2), respectively. Then, for ϵ > 0, δ β and S defined in (3), we have log N (ϵ/2, Gadv, S) CΛ2Ψ2Llog where C is an absolute constant, Llog := log ϵρ + 2 n K a a = C1βΛ1Ψ1 d, and C1 is a constant. Applying Lemma 5.8 and Theorem 4.8, we get the following immediate corollary. Corollary 5.9. With the notation above, with probability at least 1 δ over the draw of the training data, we have, for all W W, Radv(W) ˆRadv(W) + 3 n + CΛΨ Llog nρ , where C is an absolute constant and Llog := log 1 2 ρ + 1 n K an Remark 5.10. Note that unlike the bound (6), there is no direct dependence on the input dimension d outside of the log terms. This is because spatial transformations do not alter the image norm. Furthermore, the complexity of the noise set B is drastically reduced (B is four-dimensional) in comparison to the additive noise attack. Thus the important factor is the complexity of the adversarial noise set and not the input dimension. Remark 5.11. To the best of our knowledge, this is the first adversarial generalization bound valid for an attack other than the ℓp-additive attack. It is unclear how to adapt the existing approaches in the literature to general attacks. 5.2. Neural Network We now turn our attention to feed-forward neural networks under different attacks. Recall that a feed-forward network is defined as the composition of a set of L-layers. Each layer l [L] consists of a linear transformation parameterized by the matrix W l Rml ml 1, where ml is the width of layer l, followed by an element-wise non-linear Lipschitz activation function σ( ). We have m0 = d (the input dimension) and m L = K (the number of classes). Therefore the network function NW(x) is evaluated as NW(x) = W Lσ(W L 1σ( σ(W 1x)) )), (10) where W = (W 1, . . . , W L). We consider norm-bounded networks with the following hypothesis class F := {x 7 NW(x) : W W}. (11) The quality of classification is measured by the margin loss function (5). In the following, we summarize the assumptions used throughout this section. Assumption 5.12. Let W W be the weight of the network (11). Suppose that W is such that, for all W W, it is W l 2 al and W l σ sl for all l [L 1]. Further, suppose that, for all W W, it is W L 2 a L, W L 2, s L and W 1 1, s 1. ℓ -additive perturbation We now consider the ℓ - additive perturbation applied to multi-layer neural networks. As with the linear case, we first establish the - Lipschitzness of the function δ 7 ℓρ(NW(x + δ), y). The following lemma summarizes the result. Lemma 5.13. Let NW be the neural network function defined in (10). Further let ℓρ be the loss function (5). With Assumption 5.12, the function g(δ) = ℓρ(NW(x + δ), y) is -Lipschitz in δ with constant 2 ρ(QL l=2 sl)s 1 m1 for all (x, y) X Y and W W. The following lemma introduces an upper bound of the ℓ - covering number of the neural network adversarial class. Lemma 5.14. Let F be the multi-class neural network hypothesis class (11) and ℓρ be the margin loss (5). Let On the Generalization Analysis of Adversarial Learning B := {δ : δ β}. Further let Gadv and Gadv be defined as (1) and (2), respectively. Assume Assumption 5.12 holds and x 2 Ψ. Then, for S defined in (3) and ϵ > 0, we have log(N (ϵ/2, Gadv, S)) CL2Ψ 2 Llog := log (C1Ψ Γ/(ϵρ) + C2 m) n 6βλ dβ), Γ = maxl [L](QL i=1 si)alml/sl, λ = 2 ρ(QL l=2 sl)s 1 m1, m = maxl [L] ml, and C, C1, C2 are universal constants. The following corollary follows directly from the above Lemma and Theorem 4.8. Corollary 5.15. With the notation above, with probability at least 1 δ over the draw of the training data, for all W W, we have Radv(W) ˆRadv(W) + 8 where C is an absolute constant, Ψ = Ψ + Llog =log 1 2 ρ +C2 m n 6βλn Remark 5.16. Note that similar to the linear case, the bound has two d dependence. The first is in Ψ , which arises from the mismatch of norms and can be mitigated by controlling the appropriate norm as discussed above. The second d term in Llog is due to the complexity of the adversarial noise set B. As discussed in the linear case, a projection on a low-dimensional manifold can help reduce such dependency. Remark 5.17. The generalization bound in Yin et al. (2019) applies only to a one-hidden-layer neural network and is based on the surrogate upper bound introduced in Raghunathan et al. (2018). This is in contrast to our bound, which directly applies to multi-layer networks and the adversarial loss. While the bound in Khim & Loh (2018) applies to multi-layer networks, it is based on a harsh surrogate upper bound, which pushes the maximization through each layer, thus multiplying the bound slack. This can lead to a vacuous bound for Deep Neural Networks. Furthermore, their bound is of the order O(K d) compared to our bound O(log(K) d) (using compatible norms on x and δ). While the result in Awasthi et al. (2020) applies directly to the adversarial loss, it applies only to one-hidden-layer neural networks and binary classification. Their bound is of the order O( d m) while ours is O( d log( m)), where m is the width of the hidden layer. Adversarial spatial transformation We consider the adversarial attack based on the spatial transformation in (9). We begin by establishing the Lipschtizness of δ 7 ℓρ(NW(A(x, δ)), y). Lemma 5.18. Let g(δ) = ℓρ(NW(A(x, δ)), y) and Assumption 5.12 hold. Then for all W W, (x, y) X Y, g is -Lipschitz with constant 8 ρ(QL l=2 sl)s 1 m1Ψ1 Lemma 5.19. Let F be the multi-class neural network hypothesis class (11) and ℓρ be the margin loss (5). For the adversarial spatial attack defined above, let Gadv and Gadv be defined as in (1) and (2), respectively. Suppose that Assumption 5.12 holds, and x 1 Ψ1, x 2 Ψ, and δ β, for all δ B. Then for S defined in (3) and ϵ > 0, we have log(N ( Gadv, S, ϵ/2)) CL2Ψ2 Llog := log (C1ΨΓ/(ϵρ) + C2 m) n C3βλ C, C1, C2, C3 are universal constants, m = maxl [L] ml, Γ = maxl [L](QL i=1 si)alml/sl, and λ = (QL l=2 sl)s 1 m1(Ψ1 We can plug the above complexity bound back into Theorem 4.8 and derive the following generalization error bounds. Corollary 5.20. With the notation above, with probability at least 1 δ over the draw of the training data, for all W W, we have Radv(W) ˆRadv(W) + 8 where C is an absolute constant, and Llog =log 1 2 ρ +C2 m n 6βλn Remark 5.21. The bound has no direct dependence on d outside of log terms as with the linear case. This is due to the low complexity of the adversarial spatial transform without altering the image norm. On the Generalization Analysis of Adversarial Learning 6. Optimistic Bounds and Fast Rates In this section, we aim to derive optimistic bounds for the generalization of adversarial learning in the sense of incorporating the training errors into the generalization bounds. Optimistic bounds have been studied before in the binary-classification settings (Srebro et al., 2010) and multiclassification settings (Reeve & Kaban, 2020), where they have resulted in fast-rate bounds for smooth losses under low-noise conditions. We aim to extend these approaches to the case of adversarial examples. Our results are based on the local Rademacher complexity (Bartlett et al., 2005). Definition 6.1 (Rademacher complexity). The empirical Rademacher complexity of a function class F of real-valued functions with respect to a set S = {zi}n i=1 is defined as RS(F) = Eσ h sup f F i=1 σif(zi) i , (12) where {σi} are random variables with equal probability of being either +1 or 1. Our result applies loss functions ℓwith a certain smoothness condition defined by the following robust-self-bounding property. Definition 6.2. Let ℓ: RK Y R be a loss function. We say that ℓis robust-self-bounding-Lipschitz with respect to a set B if, for all y Y and all measurable functions ν : B RK and µ : B RK, we have | max δ B ℓ(ν(δ), y) max δ B ℓ(µ(δ), y)| λ max δ B ν(δ) µ(δ) max{max δ B ℓ(ν(δ), y), max δ B ℓ(µ(δ), y)}θ. Remark 6.3. The robust self-bounding property is inspired by the self-bounding property introduced in Reeve & Kaban (2020) as follows |ℓ(ν, y) ℓ(µ, y)| λ max{ℓ(ν, y), ℓ(µ, y)}θ ν µ . The key difference is the introduction of the adversarial max operator maxδ B to adapt to adversarial learning. Remark 6.4. The robust-self-bounding property is maintained by several realistic losses. For instance, consider the smooth margin loss Lρ(t, y) = 1 if M(t, y) 0 2 (M(t, y)/ρ)3 3 (M(t, y)/ρ)2+1 if 0 < M(t, y) < ρ 0 if M(t, y) ρ, defined in Reeve & Kaban (2020). This function is an upper bound on the zero-one loss, therefore, it can be considered as a surrogate loss to classification scenarios. It is further a robust-self-bounding-Lipschitz with θ = 1/2 and λ = 4 6/ρ as shown in appendix E. We now define the local hypothesis class and the local loss class. For a given hypothesis class Fadv := (x, δ) 7 f(A(x, δ), w) : w W , we define the local hypothesis class Fadv|r Fadv as the set of functions f Fadv with empirical adversarial training errors at most r. That is Fadv|r := n (x, δ, y) 7 f(A(x, δ), w) : w W, ˆRadv(w) r o . Similarly, define the local loss class Gadv|r := n (x, y) 7 max δ B ℓ(f(A(x, δ), w), y) : w W, ˆRadv(w) r o . We first introduce a structural result on covering numbers. Lemma 6.5. Let Gadv be defined as above. Assume that the loss ℓis robust-self-bounding with parameters λ > 0, θ (0, 1/2]. Further let δ 7 ℓ(f(A(x, δ), w), y) be -Lipschitz with constant L. Let Fadv := {(x, δ, y) 7 f(A(x, δ), w)y : w W} and ˆS := {(xi, δ, y) : i [n], δ CB( ϵ λ(2r)θ2L), y Y}, where CB(ϵ/2L) is an (ϵ/2L, )-cover of B. Then, we have N2(ϵ, Gadv|r, S) N ϵ 4λ(2r)θ , Fadv, ˆS . Lemma 6.5 establishes a bound on the ℓ2-covering number of the local loss class by the ℓ -covering number of the extended hypothesis class Fadv. This serves as a key step in the proof of the next lemma, which establishes a bound on the local Rademacher complexity by a sub-root function of r, a key step in developing optimistic bounds (Bartlett et al., 2005; Srebro et al., 2010). Lemma 6.6. With the notation and assumptions of Lemma 6.5, suppose that, for all w W, f(x, w) B and ℓ is bounded by b. Suppose that log N ϵ, Fadv, ˆS Rb1 for ϵ [b1, b2] and Rb1 R that does not depend on n and ϵ. Then, 24+θ + 40 q R 1 n log b1 θ n Lemma 6.6 establishes a bound on the Rademacher complexity of the loss class Gadv|r in terms of a bound on the empirical risk r. Note that we obtain a sub-root bound on the local Rademacher complexity for θ = 1/2. On the Generalization Analysis of Adversarial Learning Remark 6.7. The condition log N ϵ, Fadv, ˆS Rb1 ϵ2 is satisfied by many of the typical function calsses. For instance, it is satified by the neural network class (see Lemma C.1) with Rb1 = CL2Ψ 2 L Y Llog := log (C1Ψ Γ/b1) + C2 m) n 6βλ The next theorem presents the main result of this section, namely an optimistic generalization bounds for adversarial learning with a smooth loss ℓ. Theorem 6.8. With the notation and assumption of Lemmas 6.6 and 6.5, with probability at least 1 δ over the draw of the training set S, for all w W, we have Radv(w) is bounded by Radv(w) ˆRadv(w) + 106r + q ˆRadv(w) (8r + L) n (log(1/δ) log(log(n))), where L = 4b n (log(1/δ) + log(log(n))) and R 1 n log b1 θ n Remark 6.9. Note that the second term grows as O( R1/ n n ). For the majority of function classes, R1/ n is O(1), where O hides log terms, for example for linear models (Lei et al., 2019)), kernel methods (Bartlett & Mendelson, 2003), DNNs (Bartlett et al., 2017), and structured output prediction (Mustafa et al., 2021). The second term would then grow as O( 1 n). The fourth term grows at the usual O( 1 n) rate. However, if ˆRadv(w) = 0, the fourth term vanishes, thus leading to a O( 1 n) generalization bound. Remark 6.10. In Dan et al. (2020), the excess risk bound of the order O( d n) was shown under an assumption on the adversarial signal-to-noise ratio. However, it applies to linear classes in the idealized case that the true data distribution is Gaussian. On the other hand, our bound applies to any function class where the covering number or the Rademacher complexity can be bounded. Bhattacharjee et al. (2021) presented bounds in expectation of the order O(1/n) for linear and kernel-based models under the distributional assumptions of separable data. In comparison, we develop high-probability bounds applicable to any function class under milder distributional assumptions (zero empirical risk). 7. Conclusion We presented a general generalization analysis of adversarial learning. Our analysis applies to a wide range of attacks and models. To our knowledge, this is the first analysis for non-additive noise. Our approach is modular and easily applicable to a large number of models. We showcased our general results for linear models and neural networks with additive noise or the spatial transformation attack. Our analysis emphasized the importance of the complexity of the adversarial noise set B rather than the input space dimension. We further extended our analysis to the case of smooth losses, where we derived fast-rates under zero adversarial empirical risk. In future work, we will investigate mitigating the dependence on the dimension of the noise set B. Our hypothesis is that the complexity can be mitigated by carefully considering the interplay between the noise set and the function class (e.g., an additive attack on a model first reducing the input dimension to d < d should yield bounds with a dependence on d , rather than d). Acknowledgements We thank the reviewers for their constructive feedback that helped improve the paper. MK and WM acknowledge support by the Carl-Zeiss Foundation, the DFG awards KL 2698/2-1 and KL 2698/5-1, and the BMBF awards 01|S18051A, 03|B0770E, and 01|S21010C. Athalye, A., Carlini, N., and Wagner, D. A. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In ICML, 2018. Attias, I., Kontorovich, A., and Mansour, Y. Improved generalization bounds for robust learning. In Algorithmic Learning Theory, pp. 162 183. PMLR, 2019. Awasthi, P., Frank, N., and Mohri, M. Adversarial learning guarantees for linear hypotheses and neural networks. In International Conference on Machine Learning, pp. 431 441. PMLR, 2020. Awasthi, P., Yu, G., Ferng, C.-S., Tomkins, A., and Juan, D.- C. Adversarial robustness across representation spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7608 7616, 2021. Barreno, M., Nelson, B., Joseph, A. D., and Tygar, J. D. The security of machine learning. Machine Learning, 81 (2):121 148, 2010. Bartlett, P., Foster, D., and Telgarsky, M. Spectrallynormalized margin bounds for neural networks. Advances in Neural Information Processing Systems, 30: 6241 6250, 2017. On the Generalization Analysis of Adversarial Learning Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. 3(null): 463 482, mar 2003. ISSN 1532-4435. Bartlett, P. L., Bousquet, O., and Mendelson, S. Local rademacher complexities. The Annals of Statistics, 33(4): 1497 1537, 2005. Bhattacharjee, R., Jha, S., and Chaudhuri, K. Sample complexity of robust linear classification on separated data. In International Conference on Machine Learning, pp. 884 893. PMLR, 2021. Bietti, A., Mialon, G., Chen, D., and Mairal, J. A kernel perspective for regularizing deep neural networks. ar Xiv preprint ar Xiv:1810.00363, 2018. Biggio, B., Corona, I., Maiorca, D., Nelson, B., ˇSrndi c, N., Laskov, P., Giacinto, G., and Roli, F. Evasion attacks against machine learning at test time. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 387 402. Springer, 2013. Boucheron, S., Lugosi, G., and Bousquet, O. Concentration inequalities. In Summer school on machine learning, pp. 208 240. Springer, 2003. Brendel, W., Rauber, J., and Bethge, M. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. ar Xiv preprint ar Xiv:1712.04248, 2017. Carlini, N. and Wagner, D. A. Towards evaluating the robustness of neural networks. 2017 IEEE Symposium on Security and Privacy (SP), pp. 39 57, 2017. Carlini, N., Katz, G., Barrett, C., and Dill, D. L. Groundtruth adversarial examples. Co RR, abs/1709.10207, 2017. URL http://arxiv.org/abs/1709.10207. Ciss e, M., Bojanowski, P., Grave, E., Dauphin, Y., and Usunier, N. Parseval networks: Improving robustness to adversarial examples. In ICML, 2017. Dan, C., Wei, Y., and Ravikumar, P. Sharp statistical guaratees for adversarially robust gaussian classification. In International Conference on Machine Learning, pp. 2345 2355. PMLR, 2020. Engstrom, L., Tran, B., Tsipras, D., Schmidt, L., and Madry, A. Exploring the landscape of spatial robustness. In International Conference on Machine Learning, pp. 1802 1811. PMLR, 2019. Farnia, F., Zhang, J., and Tse, D. Generalizable adversarial training via spectral normalization. In International Conference on Learning Representations, 2018. Finlay, C. and Oberman, A. M. Scaleable input gradient regularization for adversarial robustness. ar Xiv preprint ar Xiv:1905.11468, 2019. Gao, Q. and Wang, X. Theoretical investigation of generalization bounds for adversarial learning of deep neural networks. Journal of Statistical Theory and Practice, 15 (2):1 28, 2021. Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014. Hein, M. and Andriushchenko, M. Formal guarantees on the robustness of a classifier against adversarial manipulation. In NIPS, 2017. Hendrycks, D. and Dietterich, T. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, 2018. Jaderberg, M., Simonyan, K., Zisserman, A., et al. Spatial transformer networks. Advances in Neural Information Processing Systems, 28:2017 2025, 2015. Kannan, H., Kurakin, A., and Goodfellow, I. Adversarial logit pairing. ar Xiv preprint ar Xiv:1803.06373, 2018. Khim, J. and Loh, P.-L. Adversarial risk bounds via function transformation. ar Xiv preprint ar Xiv:1810.09519, 2018. Ledent, A., Alves, R., Lei, Y., and Kloft, M. Finegrained generalization analysis of inductive matrix completion. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 25540 25552. Curran Associates, Inc., 2021a. URL https://proceedings. neurips.cc/paper/2021/file/ d6428eecbe0f7dff83fc607c5044b2b9-Paper. pdf. Ledent, A., Mustafa, W., Lei, Y., and Kloft, M. Norm-based generalisation bounds for deep multi-class convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, number 9 in 35, pp. 8279 8287, 2021b. Lei, Y., Dogan, U., Zhou, D.-X., and Kloft, M. Datadependent generalization bounds for multi-class classification. IEEE Transactions on Information Theory, 65(5): 2995 3021, 2019. Long, P. M. and Sedghi, H. Size-free generalization bounds for convolutional neural networks. In ICLR, 2020. On the Generalization Analysis of Adversarial Learning Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. Co RR, abs/1706.06083, 2017. Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of machine learning. MIT press, 2018. Mustafa, W., Vandermeulen, R. A., and Kloft, M. Input hessian regularization of neural networks. In Workshop on Beyond first-order methods in ML systems at the 37th International Conference on Machine Learning, 2020. Mustafa, W., Lei, Y., Ledent, A., and Kloft, M. Fine-grained generalization analysis of structured output prediction. In IJCAI 2021, 2021. Papernot, N., Mc Daniel, P., Jha, S., Fredrikson, M., Celik, Z. B., and Swami, A. The limitations of deep learning in adversarial settings. In 2016 IEEE European symposium on security and privacy (Euro S&P), pp. 372 387. IEEE, 2016. Prabhu, Y. and Varma, M. Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 14, pp. 263 272, New York, NY, USA, 2014. Raghunathan, A., Steinhardt, J., and Liang, P. Certified defenses against adversarial examples. Co RR, abs/1801.09344, 2018. Reeve, H. W. and Kaban, A. Optimistic bounds for multioutput prediction. In 37th International Conference on Machine Learning, 2020. Ross, A. S. and Doshi-Velez, F. Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In Thirty-second AAAI conference on artificial intelligence, 2018. Schmidt, L., Santurkar, S., Tsipras, D., Talwar, K., and Madry, A. Adversarially robust generalization requires more data. In Neur IPS, 2018. Srebro, N., Sridharan, K., and Tewari, A. Smoothness, low noise and fast rates. Advances in Neural Information Processing Systems, 23, 2010. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. Intriguing properties of neural networks. Co RR, abs/1312.6199, 2013. Wong, E. and Kolter, J. Z. Learning perturbation sets for robust machine learning. In International Conference on Learning Representations, 2020. Wu, L., Ledent, A., Lei, Y., and Kloft, M. Fine-grained generalization analysis of vector-valued learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 10338 10346, 2021. Xiao, J., Fan, Y., Sun, R., and Luo, Z.-Q. Adversarial Rademacher complexity of deep neural networks. 2021. Xing, Y., Song, Q., and Cheng, G. On the algorithmic stability of adversarial training. Advances in Neural Information Processing Systems, 34, 2021. Yin, D., Kannan, R., and Bartlett, P. Rademacher complexity for adversarially robust generalization. In International Conference on Machine Learning, pp. 7085 7094. PMLR, 2019. Zhang, H., Yu, Y., Jiao, J., Xing, E. P., Ghaoui, L. E., and Jordan, M. I. Theoretically principled trade-off between robustness and accuracy. ar Xiv preprint ar Xiv:1901.08573, 2019. Zhang, T. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2(Mar):527 550, 2002. On the Generalization Analysis of Adversarial Learning A. Proofs of the Main Result in Section 4 In this section, we present the proofs of our main result (Theorem 4.8). As discussed in the main text, our main tool is the ℓ -covering numbers (see Definition 4.1). We note that the main hardness in deriving bounds over the adversarial loss is the maximization over the adversarial noise set. Our main strategy is to utilize the properties of the ℓ -covering numbers to control the complexity of the adversarial loss class with a high function dimensional class, thus alleviating the maximization operator. Our first step is to bound the covering number of the adversarial class with an extended function class where each element in the adversarial loss class is replaced with a full-function that takes the adversarial noise as its argument as summarized in Lemma 4.2. We now present the proof of Lemma 4.2. Proof of Lemma 4.2. Our idea is to construct a cover for the adversarial class Gadv on the training set S from the elements of cover for the class G on the same training set. To avoid clutter let g(z, δ, w) := ℓ(fw(A(x, δ), y). Recall from Definition 4.1 that N (ϵ, Gadv, S) is the cardinality of the smallest cover for the set Gadv|S := max δ B g(z1, δ, w), . . . , max δ B g(zn, δ, w)) : w W Rn. The first step of our proof is the observation that it is possible to construct an ℓ -cover for Gadv|S by utilizing an ℓ -cover constructed for the set G|S = {(g(z1, , w), . . . , g(zn, , w))) : w W} (RB)n. Note that each element of g is a vector of functions g(z, , w) as compared to only the scalar maxδ B g(z, δ, w) in Gadv. Our aim now is that given a cover for G; we construct a cover for Gadv. To that extent, let CG = c1 i ( ), . . . , cn i ( ) : i [m] (RB)n be a (ϵ, ℓ )-cover of G. We now claim that the set CGadv = c1 i := max δ B c1 i (δ), . . . , cn i := max δ B cn i (δ) : i [m] covers the set Gadv. Indeed, given w W, there exist by definition j(w) such that: max i [n] max δ B |g(zi, δ, w) ci j(w)(δ)| ϵ. Therefore, we have max i [n] | max δ B g(zi, δ, w) ci j(w)| = max i [n] | max δ B g(zi, δ, w) max δ B ci j(w)(δ)| max i [n] max δ B |g(zi, δ, w) ci j(w)(δ)| The first equality follows from the construction of CGadv. The first inequality follows from the following inequality: for real-valued functions f, and g, we have | maxx f(x) maxx g(x)| maxx |f(x) g(x)|. It then follows that CGadv is an (ϵ, ℓ )-cover to Gadv. Since the cardinality of CG and CGadv are equal, we have N (ϵ, Gadv, S) N (ϵ, G, S). (13) The proof is completed. Next, we present the proof of Lemma 4.4. On the Generalization Analysis of Adversarial Learning Proof of Lemma 4.4. Our goal is to control the ℓ -covering number of the infinite-dimensional class G with a finitedimensional counterpart. The core idea is to approximate the functions g G with a discrete version of it. We, therefore, require the functions in G to be Lipschitz with respect to some norm and a cover of B with respect to the same norm. To that end, let CB(ϵ/2L, ) be an ϵ/2L-cover w.r.t. a general norm for the set B. Let MB be the size of CB(ϵ/2L, ). That is, CB := CB(ϵ/2L, ) = n δi, i [MB] o B. Now recall that N (ϵ, Gadv, S) is the smallest cardinality of the set covering Gdisc|S defined according to the set CB(ϵ/2L, ) as follows, g(z1, δ1, w) g(z1, δ2, w) g(z1, δMB, w) ... ... ... ... g(zn, δ1, w) g(zn, δ2, w) g(zn, δMB, w) Our goal now is to construct an (ϵ, ℓ )-cover of G|S by utilizing an (ϵ/2, ℓ )-cover of Gdisc|S. To that extent, let the set ˆc1 i ( δ1) ˆc1 i ( δ2) ˆc1 i ( δMB) ... ... ... ... ˆcn i ( δ1) ˆcn i ( δ2) ˆcn i ( δMB) : i [MGdisc|S] be an ϵ/2-cover for Gdisc|S of size MBdisc|S. We now construct a cover of G|S. The key idea here is to construct functions ci( ), i [MGdisc|S] to be piece-wise constant around each δj, for j [MB]. Consider the set c1 i (δ) := ˆc1 i (arg minδ CB δ δ ) ... cn i (δ) := ˆcn i (arg minδ CB δ δ ) : i [MGdisc] We claim that it is an ϵ-cover for G. Indeed, for any w W, by construction of CGdisc|S, there exist j(w), such that max i max δ CB(ϵ/2L) |g(zi, δ, w) ˆci j(w)( δ)| ϵ/2. (14) max i max δ B |g(zi, δ, w) ci j(w)(δ)| = max i max δ B |g(zi, δ, w) g(zi, δ (δ), w) + g(zi, δ (δ), w) ci j(w)(δ)| max i max δ B |g(zi, δ, w) g(zi, δ (δ), w)| + |g(zi, δ (δ), w) ci j(w)(δ)| , where δ (δ) := arg minδ CB(ϵ/2L) δ δ . The inequality follows from triangle inequality. Since by construction of CG we have ci j(w)(δ) = ˆci j(w)(δ (δ))2, we have max i max δ B |g(zi, δ, w) g(zi, δ (δ), w)|+|g(zi, δ (δ), w) ci j(w)(δ)| = max i max δ B |g(zi, δ, w) g(zi, δ (δ), w)| + |g(zi, δ (δ), w) ˆci j(w)(δ (δ))| max i max δ B |g(zi, δ, w) g(zi, δ (δ), w)| + max i max δ CB |g(zi, δ, w) ˆci j(w)( δ)| max i max δ B L δ δ (δ) + max i max δ CB |g(zi, δ, w) ˆci j(w)( δ)| Lϵ/2L + ϵ/2 = ϵ 2Ties are resolved in arbitrary but fixed manner On the Generalization Analysis of Adversarial Learning The second inequality follows from the Lipschtizness of g with respect to δ and the -norm. The third inequality is due to the construction of δ ( ) and equation (14). Since CGdisc|S and CG have equal size, we conclude that N (ϵ, G, S) N (ϵ, Gdisc, S). (15) By combining inequalities (13) and (15), the result follows. Now we proceed to prove Theorem 4.8. We first present the Rademacher theorem, which controls the generalization of learning algorithms by the Rademacher complexity. Theorem A.1 (Mohri et al. 2018). Let S = {zi}m i=1 be i.i.d. random sample from a distribution D defined over Z. Further let F [0, 1]Z be a loss class parameterized by the set W. Then for all δ (0, 1), we have with probability at least 1 δ over the draw of the sample S, for all w W that R(w) ˆR(w) + 2RS(F) + 3 The theorem states that the generalization of the function class F can be controlled, with high probability, by the empirical risk and the Rademacher complexity. Our approach relies, however, on another complexity measure, namely ℓ -covering numbers. The following classical result of Dudley s entropy integral (Boucheron et al., 2003; Bartlett et al., 2017; Ledent et al., 2021a; Srebro et al., 2010) gives a relationship between the Rademacher complexity and ℓ -covering number. We apply the version by Srebro et al. (2010). Theorem A.2 (Srebro et al. 2010). Let F be a class of functions mapping from a space Z and taking values in [0, b], and assume that 0 F. Let S be a finite sample of size m and ˆE[f(z)2] = 1 m Pm i=1 f(zi)2 We have the following relationship between the empirical Rademacher complexity RS(F) and the covering number N2(ϵ, F, S). R(F) inf α>0 log N2(ϵ, F, S)dϵ We are now ready to present the proof of Theorem 4.8. Proof of Theorem 4.8. The proof is a direct application of Theorems A.2 and A.1 combined with Lemma 4.2 and Lemma 4.4. For δ (0, 1), we have with probability at least 1 δ, for all w W Radv(w) ˆRadv(w) + 2RS(Gadv) + 3 ˆRadv(w) + inf α>0 log N (ϵ, Gadv, S)dϵ + 3 ˆRadv(w) + inf α>0 log N (ϵ/2, Gadv, S)dϵ + 3 The first inequality is due to Theorem A.1 while the second is derived from Theorem A.2. The third inequality follows from Lemmas 4.2 and 4.4. B. Proofs of Results on Linear Models in Section 5.1 In this section, we present the omitted proof of section 5.1. The first step of our approach is to show that the loss function is ℓ -Lipschtiz with respect to the noise parameter δ. We then derive a bound on the size of the set CB(ϵ/2L). We finally bound the ℓ -covering number of the extended class Gadv on the extended data set S. Throughout the paper, we will require upper bounds on the covering number of bounded balls in Rd. We begin by reviewing the following result deriving an upper bound on the size of the set CB(ϵ) defined for the general norm . On the Generalization Analysis of Adversarial Learning Lemma B.1 (Long & Sedghi 2020). Let d be a positive integer, be a norm, ρ be the metric induced by it, and κ, ϵ > 0. A ball of radius κ in Rd w.r.t. ρ can be covered by ( 3κ ϵ )d balls of radius ϵ. Our approach relies on ℓ -covering numbers for the loss classes. We now review the upper bounds on the ℓ -covering numbers of linear models. The bound used in our work is a combination of the approach in Lei et al. (2019) and the covering number bound in Zhang (2002). Lemma B.2 (Lei et al. 2019). Let F = {x 7 (w 1 x, w 2 x, . . . , x Kx) : w = (w1, . . . , w K) RKd, w 2 Λ} be the linear model hypothesis class, S = {(xi, yi)}n i=1 be a given data set. Consider an -Lipschitz loss ℓwith constant L, for all y [K]. Let S be an extended data set defined as S = {φj(xi) : i [n], j [K]} where φj( ) is defined as φj(x) := 0, . . . , 0 | {z } j 1 , x, 0, . . . , 0 | {z } K j Further let F := {x 7 w, x , x S, w RKd, w 2 Λ}. For all ϵ > 0, we have the following inequality N (ϵ, ℓ F, S) N (ϵ/L, F, S). The above lemma allows for controlling the ℓ -covering number of multi-class loss classes by a real-valued function class F on an extended dataset S. The following lemma gives a covering number bound for real-valued linear function classes. Lemma B.3 (Zhang 2002). Let L be a class of linear functions on a set of size n. That is, L = { w, x , x, w RN}. If x q b and w p a, where 2 q < and 1/p + 1/q = 1, then ϵ > 0, log N (ϵ, L, n) 36(q 1)a2b2 ϵ2 log[2 4ab/ϵ + 2 n + 1], where N (ϵ, L, n) is the worst case covering number of the class L on a dataset of size n. The above result controls the covering numbers by norms of the data and weights. A direction application of Lemmas B.3 and B.2 gives the following corollary. Corollary B.4. Let F be the linear multi-class linear hypothesis class, ℓρ be the loss (5). Let S = {(xi, yi)}n i=1 be a given dataset with x 2 Ψ, for all x X, and W 2,2 Λ, then for all ϵ > 0, we have log N (ϵ, ℓρ F, S) C Ψ2Λ2 ρ2ϵ2 log (2 8ΨΛ/ϵρ + 2 n K + 1) . Proof. We apply Lemmas B.2 and B.3 noting that ℓρ is -Lipschitz with constant 2 ρ-norm, for all y Y. B.1. ℓ -additive Perturbation Attack We are now ready to prove the bounds of the ℓ -additive attacks applied to linear models. Our first step is to derive the -Lipschitz constant of the function δ 7 ℓρ(W(x + δ)). The following is the poof of Lemma 5.1. Proof of Lemma 5.1. The proof is a direct derivation. Observe the following, for all (x, y) Z and W 1, Λ1, and δ, δ B, we have |ℓρ(W(x + δ), y) ℓρ(W(x + δ ), y)| 2 ρ max i |Wi,.δ Wi,.δ | ρ max i Wi,. 1 δ δ 2Λ1 where Wi, denotes the i-th row of W. The first inequality is derived from that ℓρ is -Lipschitz with constant 2 In the following we present the proof of Lemma 5.2. On the Generalization Analysis of Adversarial Learning Proof of Lemma 5.2. First consider the ℓ -norm on the set B. By Lemma 5.1, we have the function δ 7 ℓρ(W(x + δ)y) is -Lipschitz with constant 2Λ1 ρ . Consider the set CB(ϵρ/4Λ1). By Lemma B.1, and that δ β, for all δ B, we have |CB(ϵρ/4Λ1)| 12Λ1β | S| = n 12Λ1β and for ( x, y) S with x = (x, δ), x 2 x 2 + δ 2 Ψ + Thus, the result follows from Corollary B.4. In the following, we present the proof of Corollary 5.3. Proof of Corollary 5.3. The proof is a direct application of Theorem 4.8 by setting α to 1 n. Thus, consider the following integral log N (ϵ/2, Gadv, S)dϵ Z 1 ϵ2ρ2 Llogdϵ C Llog log(n) ΛΨ C Llog log(n) ΛΨ ρ [log(ϵ)]1 1 n C Llog log(n) ΛΨ The first inequality is due to the monotone property of integral. The second is by noticing that replacing ϵ by 1 n in Llog can only increase its value. Plugging this in Theorem 4.8 gives the result. B.2. Spatial Adversarial Attack In this section we present the proof of adversarial generalization bound in Corollary 5.9. The first step is to show that the transformation δ 7 ℓ(W(A(x, δ), y) is -Lipschitz as summarized in Lemma B.5. We first show that δ 7 A(x, δ) is ( , )-Lipschitz. The following lemma states the result. Lemma B.5. Let A(x, δ) be defined as in (9). For all x 1 Ψ1, and δ, δ B, we have, A(x, δ) A(x, δ ) 4 Proof. Let U s and V s be the indexes corresponding to δ and U s and V s be indexes corresponding to δ . Further let a = (U s, V s) (U s , V s ) . Then by the definition of the ℓ -norm and the transformation A, we have, A(x, δ) A(x, δ ) = max i [d] l=1 xkl(max(0, 1 |V s i k|) max(0, 1 |U s i l|) max(0, 1 |V s i k|) max(0, 1 |U s i l|)) l=1 |xkl| (max(0, 1 |V s i k|) max(0, 1 |U s i l|) max(0, 1 |V s i k|) max(0, 1 |U s i l|)) , On the Generalization Analysis of Adversarial Learning where the inequality follows from the triangular inequality. For arbitrary k, l, i [ d], let a = max(0, 1 |V s i k|), b = max(0, 1 |U s i l|), a = max(0, 1 |V s i l|), and b = max(0, 1 |U s i l|). Consider the following inequality |ab a b | |ab ab | + |ab a b | |a||b b | + |b ||a a | |b b | + |a a |. The first inequality is due to the triangular inequality and the last follows by |a|, |b | 1. Thus, A(x, δ) A(x, δ ) l=1 |xkl| (| max(0, 1 |U s i k|) max(0, 1 |U s i k|)| + | max(0, 1 |V s i k|) max(0, 1 |V s i k|)|) l=1 |xkl| (||U s i k| |U s i k|| + ||V s i k| |V s i k||) l=1 |xkl| (|U s i U s i | + |V s i V s i |) l=1 |xkl| 2 (U s, V s) (U s , V s ) 2Ψ1 (U s, V s) (U s , V s ) . The first inequality is due to the triangular inequality. The second inequality follows from 1-Lipschitzness of the function x 7 max(0, x). The third inequality is due to the reverse triangular inequality (i.e., ||c| |d|| |c d|). Now it remains to derive a bound for (U s, V s) (U s , V s ) by δ δ . Observe the following (U s, V s) (U s , V s ) = max i [d] max(|V s i V s i |, |U s i U s i |) = max i [d] max(|V t i (δ22 δ 22) U t i (δ21 δ 21)|, |V t i (δ12 δ 12) U t i (δ11 δ 11)|) max i [d] max(|V t i ||δ22 δ 22| + |U t i ||δ21 δ 21|, |V t i ||δ12 δ 12| + |U t i ||δ11 δ 11)) The second equality is due to the definition of S and (7). The first inequality follows from the triangular inequality. The last inequality is derived from the fact maxi [d] |V t i | = d. Combining the two inequalities we get A(x, δ) A(x, δ ) 4 The proof is completed. We now utilize this result to prove Lemma 5.7. Proof of Lemma 5.7. For δ, δ B, we then have |ℓρ(WA(x, δ), y) ℓρ(WA(x, δ ), y)| 2 ρ WA(x, δ) WA(x, δ ) ρ max i | Wi,., (A(x, δ) A(x, δ )) | ρ max i Wi,. 1 A(x, δ) A(x, δ ) where the first inequality follows from the Lipschitzness of the loss ℓρ, the second from the definition of ℓ -norm, the third from the H older inequality, and the final from Lemma B.5. On the Generalization Analysis of Adversarial Learning Proof of Lemma 5.8. We consider the ℓ -covering number of the set B. Recall that, we have, by Lemma 5.7, the function δ 7 ℓρ(WA(x, δ), y), is -Lipschitz with the Lipschitz constant 8 d. We now aim to control the size of the set CB(ϵρ/16Λ1Ψ1 d). By Lemma B.1, for the ball B = {δ : δ β}, we have CB(ϵρ/16Λ1Ψ1 By the construction of S, we then have Further note that A(x, δ) 2 4 x 2 4Ψ, for all x X, since spatial transformation does not alter the norm of the input except for the factor of 4 due to the bilinear interpolation. Thus, the result then follows from Corollary B.4. C. Proofs of Applications to Neural Networks in Section 5.2 In this section, we present the omitted proofs of section 5.2. The technique is similar to the linear case. We first establish the Lipschitzness property of the functions δ 7 ℓρ(NW(A(x, δ)), y). We then extend the data set and apply ℓ -covering number results of the neural networks function class. We first review the following Lemma (Ledent et al., 2021b). It establishes a bound on the ℓ -covering numbers of norm-bounded neural network function classes. Lemma C.1 (Ledent et al. 2021b). Let F be the class of neural networks that is, F = {x 7 NW(x)}, where W = (W 1, . . . , W L) are a set of weights and NW = W Lσ(W L 1σ(. . . W 1x)) with 1-Lipschitz element-wise non-linearities σ. Define the loss class L = ℓp F where ℓp is defined as (5). Suppose that W l 21 al and W l σ sl for all l [L 1], W L 2 a L, W L 2, s L, x 2 b, and ml is the width of the l th layer. Then given a data set S with n elements and ϵ > 0, log N (ϵ, L, S) CL2b2 log (( C1bΓ/(ϵρ) + C2 m )n + 1) , where Γ = maxl [L](QL i=1 si)alml/sl, m = maxl [L] ml, and C, C1, C2 are universal constants. C.1. ℓ -additive Perturbation Attack The first step is to derive the -Lipschitz constant of the loss function as a function in the noise parameter. We start by the following lemma on -Lipschitzness of neural network as a function of the input. Lemma C.2. Consider the neural network function NW(x) defined as (10). Given x, x X, then for all y Y |ℓρ(NW(x), y) ℓρ(NW(x ), y)| 2 ! m1s 1 x x . On the Generalization Analysis of Adversarial Learning Proof. Let N l W be the output of the l th layer of the network NW. Consider the following, |ℓρ(NW(x), y) ℓρ(NW(x ), y)| 2 ρ NW(x) NW(x ) ρ W L(N L 1 W (x) N L 1 W (x )) ρ max i [m L] W L i,. 2 (N L 1 W (x) N L 1 W (x )) 2 l=2 sl W 1x W 1x 2 2 l=2 sl m1 W 1x W 1x l=2 sl m1 max i [m1] W 1 i,. 1 x x = 2 l=2 sl m1s 1 x x . The second inequality follows from the definition of ℓ -norm and H older inequality, the third from the fact that the non-linearity is 1-Lipschitz and by induction over the layers. The fourth is from the fact that x 2 d x , for all x Rd. We now present the proof of Lemma 5.13. Proof of Lemma 5.13. The result follows directly from Lemma C.2. Let δ, δ B, then for all (x, y) Z |ℓρ(NW(x + δ), y) ℓρ(NW(x + δ ), y)| 2 l=2 sl m1s 1 (x + δ) (x + δ ) = 2 l=2 sl m1s 1 δ δ . The proof is completed. Proof of Lemma 5.14. We consider ℓ covering number of set B. By Lemma 5.13, the function δ 7 ℓρ(NW(x + δ), y), is -Lipschitz with the Lipschitz constant 2 ρ QL l=2 sl m1s 1. By Lemma 4.4, we aim now to bound the size of the set CB(ϵρ/4 QL l=2 sl m1s 1). By Lemma B.1, and that δ β, we have l=2 sl m1s 1 12 QL l=2 sl m1s 1β ϵρ 12 QL l=2 sl m1s 1β ϵρ and for ( x, y) S with x = (x, δ), we have x 2 x 2 + δ 2 Ψ + Thus, the result follows from Lemma C.1. C.2. Spatial Adversarial Attack Proof of Lemma 5.18. The proof follows directly from Lemmas C.2 and B.5. Let δ, δ B, then for all W W, (x, y) Z, |ℓρ(NW(A(x, δ)), y) ℓρ(NW(A(x, δ )), y)| 2 l=2 sl m1s 1 A(x, δ)) A(x, δ )) l=2 sl m1s 1Ψ1 The proof is completed. On the Generalization Analysis of Adversarial Learning Now we present the proof of Lemma 5.19. Proof of Lemma 5.19. The function δ 7 ℓρ(NW(A(x, δ)), y) is -Lipschitz with the Lipschitz constant 8 ρ QL l=2 sl m1s 1Ψ1 d. By Lemma 4.4, we aim now to bound the size of the set CB(ϵρ/16 QL l=2 sl m1s 1Ψ1 d). By Lemma B.1, and that δ β, we have l=2 sl m1s 1Ψ1 48 QL l=2 sl m1s 1Ψ1 Therefore, we have 48 QL l=2 sl m1s 1Ψ1 We again note that A(x, δ) 2 4 x 2, by the fact that spatial transformation does not alter the input norms except for the factor of 4 from the bilinear interpolation. The result, thus, follows by Lemma C.1. D. Proofs of Optimistic Bounds Proof of Lemma 6.5. The proof strategy is to show that the covering number of the set Gadv|r S is controlled by the covering number of a set of functions of the adversarial noise. To that extent define the set HS|r = n (f(A(x1, .), w)1, . . . , f(A(x1, .), w)K, . . . f(A(xn, .), w)1, . . . f(A(xn, .), w)K) , w W, ˆRadv(w) r o (Rn K)B HS = {(f(A(x1, .), w)1, . . . , f(A(x1, .), w)K, . . . f(A(xn, .), w)1, . . . f(A(xn, .), w)K) , w W} (Rn K)B. CHS|r = n cj 1( )1, . . . , cj 1( )K, . . . cj n( )1, . . . cj n( )K , w W o (Rn K)B be an ( ϵ λ(2r)θ , ℓ )-cover of HS|r. Further suppose that it is a proper cover, that is CHS|r HS|r. Note that for any w W there exists a j(w) such that max i [n] max δ B max k [K] |f(A(xi, δ), w) cj(w) i (δ)k| ϵ λ(2r)θ . We now claim that CHS|r can be used to cover the set G|r S at an ϵ resolution as measured by ℓ2 norm. In other words, we aim to show that max δ B ℓ(f(A(xi, δ), w), yi) max δ B ℓ(cj(w) i (δ), yi) 2 ϵ. On the Generalization Analysis of Adversarial Learning Observe the following max δ B ℓ(f(A(xi, δ), w), yi) max δ B ℓ(cj(w) i (δ), yi) 2 i=1 λ2 max{max δ B ℓ(f(A(xi, δ), w), yi), max δ B ℓ(cj(w) i (δ), yi)}2θ max δ B f(A(xi, δ), w) cj(w) i (δ) 2 i=1 λ2 max{max δ B ℓ(f(A(xi, δ), w), yi), max δ B ℓ(cj(w) i (δ), yi)}2θ max k [K] max i [n] max δ B f(A(xi, δ), w)k cj(w) i (δ)k 2 = ϵλ λ(2r)θ i=1 max{max δ B ℓ(f(A(xi, δ), w), yi), max δ B ℓ(cj(w) i (δ), yi))}2θ i=1 (max δ B ℓ(f(A(xi, δ), w), yi) + max δ B ℓ(cj(w) i (δ), yi))2θ i=1 max δ B ℓ(f(A(xi, δ), w), yi) + 1 i=1 max δ B ℓ(cj(w) i (δ), yi) 2 (2r)2θ = ϵ2. Since we can construct a proper cover at precision ϵ from a general cover at precision ϵ/2, we conclude that N2(ϵ, Gadv|r, S) = N2(ϵ, Gadv|r S) N ( ϵ 2λ(2r)θ , HS|r). Furthermore, since HS|r HS, we have N2(ϵ, Gadv|r, S) N ( ϵ 2λ(2r)θ , HS|r) N ( ϵ 2λ(2r)θ , HS). The following step is to show that we can control N ( ϵ λ(2r)θ , HS) by a cover of a discretized version of HS. This holds by Lemma 4.4 and hence completes the proof. Proof of Lemma 6.6. Note that by the definition of Mϵ, we have | ˆS| = n KMϵ. Now our goal is to bound the local Rademacher complexity of the adversarial class Rn(Fadv|r). By Theorem A.2 we have R(Gadv|r) inf α>0 log N2(ϵ, Gadv|r, S)dϵ Select α = 4λ(2r)θ/ n then R(Gadv|r) 16λ(2r)θ/ n + 10 n log N ( ϵ 4λ(2r)θ , Fadv, ˆS)dϵ. Now by change of variable we get R(Gadv|r) 16λ(2r)θ/ n + 40λ(2r)θ Z b1/2r1/2 θ log N (ϵ, F, ˆS)dϵ 16λ(2r)θ/ n + 40λ(2r)θ log N (ϵ, Fadv, ˆS)dϵ, On the Generalization Analysis of Adversarial Learning where the second inequality is due to the fact that b r. By the assumption N ϵ, Fadv, ˆS Ra ϵ2 , for ϵ [a, b], we have R(Gadv|r) 16λ(2r)θ/ n + 40λ(2r)θ log N (ϵ, Fadv, ˆS)dϵ 16λ(2r)θ/ n + 40λ(2r)θ = 16λ(2r)θ/ n + 40λ(2r)θq = 16λ(2r)θ/ n + 40λ(2r)θq 22+θλ) log( 1 n) = λrθ/ n 24+θ + 40 q R 1 n log b1 θ n The proof is completed. To prove Theorem 6.8 we require the following lemma, which gives optimistic bounds for learning with smooth loss functions. We say φ is sub-root if r 7 φ(r)/ r is nonincreasing. We say r is a fixed-point of φ if r = φ(r ). Lemma D.1 (Srebro et al. 2010). Consider a real hypothesis class F. Further assume that, for all f F, f(x) B. Let ℓbe a loss function bounded by b. Let S be sampled i.i.d. from some distribution D. Suppose that the local Rademacher complexity Rn(ℓ F|r) φ(r), where φ is a sub-root functions. We have with probability at least 1 δ, for all f F, R(f) ˆR(f) + 106r + 48b n (log(1/δ) + log(log(n))) + ˆR(f) 8r + 4b n (log(1/δ) + log(log(n))) . Proof of Theorem 6.8. Note that by Lemma D.1, with probability at least 1 δ we have the following inequality for all w W, Radv(w) ˆRadv(w) + 106r + 48b n (log(1/δ) + log(log(n))) + ˆRadv(w) 8r + 4b n (log(1/δ) + log(log(n))) , where r is the fixed point solution of the sub-root function φ(r) satisfying Rn(Gadv) φ(r). Observe also that by Lemma 6.6 and that θ = 1/2 a good candidate of φ is φ(r) = λ 2r/ n 16 + 40 q R 1 n log b1 θ n 22+θλ . Therefore, R 1 n log b1 θ n 22+θλ 2 . It follows the following inequality with probability at least 1 δ for all w Radv(w) ˆRadv(w) + 106λ2 R 1 n log b1 θ n n (log(1/δ) + log(log(n))) + v u u t ˆRadv(w) R 1 n log b1 θ n n (log(1/ϵ) + log(log(n))) The proof is completed. E. Example of Robust-self-bounding Loss In this section we show that the loss function 1 if M(t, y) 0, 2 (M(t, y)/ρ)3 3 (M(t, y)/ρ)2+1 if 0 < M(t, y) < ρ, 0 if M(t, y) ρ, On the Generalization Analysis of Adversarial Learning is a robust-self-bounding function. It was shown in Reeve & Kaban (2020) that for t, t RK |Lρ(t, y) ℓ(t , y)| 2 6/ρ max{ℓ(t, y), ℓ(t , y)} 1 2 t t . (16) Now consider the function ν : B RK and µ : B RK. Then | max δ B Lρ(µ(δ), y) max δ B Lρ(ν(δ), y)| max δ B |Lρ(µ(δ), y) Lρ(ν(δ), y)| 6/ρ max{Lρ(µ(δ), y), Lρ(ν(δ), y)} 1 2 µ(δ) ν(δ) 6/ρ max{max δ B Lρ(µ(σ), y), max δ B Lρ(ν(σ), y)} 1 2 max δ B µ(δ) ν(δ) , where the second inequality is due to Eq. (16).