# on_achieving_optimal_adversarial_test_error__62d1bc52.pdf Published as a conference paper at ICLR 2023 ON ACHIEVING OPTIMAL ADVERSARIAL TEST ERROR Justin D. Li & Matus Telgarsky University of Illinois, Urbana-Champaign {jdli3,mjt}@illinois.edu We first elucidate various fundamental properties of optimal adversarial predictors: the structure of optimal adversarial convex predictors in terms of optimal adversarial zero-one predictors, bounds relating the adversarial convex loss to the adversarial zero-one loss, and the fact that continuous predictors can get arbitrarily close to the optimal adversarial error for both convex and zero-one losses. Applying these results along with new Rademacher complexity bounds for adversarial training near initialization, we prove that for general data distributions and perturbation sets, adversarial training on shallow networks with early stopping and an idealized optimal adversary is able to achieve optimal adversarial test error. By contrast, prior theoretical work either considered specialized data distributions or only provided training error guarantees. 1 INTRODUCTION Imperceptibly altering the input data in a malicious fashion can dramatically decrease the accuracy of neural networks (Szegedy et al., 2014). To defend against such adversarial attacks, maliciously altered training examples can be incorporated into the training process, encouraging robustness in the final neural network. Differing types of attacks used during this adversarial training, such as FGSM (Goodfellow et al., 2015), PGD (Madry et al., 2019), and the C&W attack (Carlini & Wagner, 2016), which are optimization-based procedures that try to find bad perturbations around the inputs, have been shown to help with robustness. While many other defenses have been proposed (Guo et al., 2017; Dhillon et al., 2018; Xie et al., 2017), adversarial training is the standard approach (Athalye et al., 2018). Despite many advances, a large gap still persists between the accuracies we are able to achieve on non-adversarial and adversarial test sets. For instance, in Madry et al. (2019), a wide Res Net model was able to achieve 95% accuracy on CIFAR-10 with standard training, but only 46% accuracy on CIFAR-10 images with perturbations arising from PGD bounded by 8/255 in each coordinate, even with the benefit of adversarial training. In this work we seek to better understand the optimal adversarial predictors we are trying to achieve, as well as how adversarial training can help us get there. While several recent works have analyzed properties of optimal adversarial zero-one classifiers (Bhagoji et al., 2019; Pydi & Jog, 2020; Awasthi et al., 2021b), in the present work we build off of these analyses to characterize optimal adversarial convex surrogate loss classifiers. Even though some prior works have suggested shifting away from the use of convex losses in the adversarial setting because they are not adversarially calibrated (Bao et al., 2020; Awasthi et al., 2021a;c; 2022a;b), we show the use of convex losses is not an issue as long as a threshold is appropriately chosen. We will also show that under idealized settings adversarial training can achieve the optimal adversarial test error. In prior work guarantees on the adversarial test error have been elusive, except in the specialized case of linear regression (Donhauser et al., 2021; Javanmard et al., 2020; Hassani & Javanmard, 2022). Our analysis is in the Neural Tangent Kernel (NTK) or near-initialization regime, where recent work has shown analyzing gradient descent can be more tractable (Jacot et al., 2018; Du et al., 2018). Of many such works our analysis is closest to Ji et al. (2021), which provides a general test error analysis, but for standard (non-adversarial) training. A recent work (Rice et al., 2020) suggests that early stopping helps with adversarial training, as otherwise the network enters a robust overfitting phase in which the adversarial test error quickly rises while the adversarial training error continues to decrease. The present work uses a form of early stopping, and so is in the earlier regime where there is little to no overfitting. Published as a conference paper at ICLR 2023 100 101 102 103 Zero-one loss Early phase (this paper) Late phase (not this paper) Robust loss (train) Robust loss (test) Standard loss (train) Standard loss (test) Figure 1: A plot of the (robust/standard) zero-one (training/test) loss throughout training for an adversarially trained network. We ran Rice et al. s code, using a constant step size of 0.01. The present work is set within the early phase of training, where we can get arbitrarily close to the optimal adversarial test error. In fact, due to technical reasons our analysis will be further restricted to an even earlier portion of this phase, as we remain within the near-initialization/NTK regime. As noted in prior work, adversarial training, as compared with standard training, seems to have more fragile test-time performance and quickly enters a phase of severe overfitting, but we do not consider this issue here. 1.1 OUR CONTRIBUTIONS In this work, we prove structural results on the nature of predictors that are close to, or even achieve, optimal adversarial test error. In addition, we prove adversarial training on shallow Re LU networks can get arbitrarily close to the optimal adversarial test error over all measurable functions. This theoretical guarantee requires the use of optimal adversarial attacks during training, meaning we have access to an oracle that gives, within an allowed set of perturbations, the data point which maximizes the loss. We also use early stopping so that we remain in the near-initialization regime and ensure low model complexity. The main technical contributions are as follows. 1. Optimal adversarial predictor structure (Section 3). We prove fundamental results about optimal adversarial predictors by relating the global adversarial convex loss to global adversarial zero-one losses (cf. Lemma 3.1). We show that optimal adversarial convex loss predictors are directly related to optimal adversarial zero-one loss predictors (cf. Lemma 3.2). In addition, for predictors whose adversarial convex loss is almost optimal, we show that when an appropriate threshold is chosen its adversarial zero-one loss is also almost optimal (cf. Theorem 3.3). This theorem translates bounds on adversarial convex losses, such as those in Section 4, into bounds on adversarial zero-one losses when optimal thresholds are chosen. Using our structural results of optimal adversarial predictors, we prove that continuous functions can get arbitrarily close to the optimal test error given by measurable functions (cf. Lemma 3.4). 2. Adversarial training (Section 4). Under idealized settings, we show adversarial training leads to optimal adversarial predictors. (a) Generalization bound. We prove a near-initialization generalization bound for adversarial risk (cf. Lemma 4.4). To do so, we provide a Rademacher complexity bound for linearized functions around initialization (cf. Lemma 4.5). The overall bound scales directly with the parameter s distance from initialization, and 1/ n, where n is the number of training points. Included in the bound is a perturbation term which depends on the width of the network, and in the worst case scales like τ 1/4, where τ bounds the ℓ2 norm of the perturbations. (b) Optimization bound. We show that using an optimal adversarial attack during gradient descent training results in a network which is adversarially robust on the training set, in the sense that it is not much worse compared to an arbitrary reference network (cf. Lemma 4.6). Comparing to a reference network instead of just ensuring low training error (as in prior work) will be key to obtaining a good generalization analysis, as the optimal adversarial test error may be high. Published as a conference paper at ICLR 2023 (c) Optimal test error. As the generalization and optimization bounds are both in a nearinitialization setting, these two bounds can be used in conjunction. We first bound the test error of our trained network in terms of its training error using the generalization bound, and then apply the optimization bound to compare against training error of an arbitrary reference network. Another application of our generalization bound then allows us to compare against the test error of an arbitrary reference network (cf. Theorem 4.1). Applying approximation bounds and Lemma 3.4 then lets us bound our trained network s test error in terms of the optimal test error over all measurable functions (cf. Corollary 4.2). 2 RELATED WORK We highlight several papers in the adversarial and near-initialization communities that are relevant to this work. Optimal adversarial predictors. Several works study the properties of optimal adversarial predictors when considering the zero-one loss (Bhagoji et al., 2019; Pydi & Jog, 2020; Awasthi et al., 2021b). In this work, we are able to understand optimal adversarial predictors under convex losses in terms of those under zero-one losses, although we will not make use of any properties of optimal zero-one adversarial predictors other than the fact that they exist. Other works study the inherent tradeoff between robust and standard accuracy (Tsipras et al., 2019; Zhang et al., 2019), but these are complementary to this work as we only focus on the adversarial setting. Convex losses. Several works explore the relationship between convex losses and zero-one losses in the non-adversarial setting (Zhang, 2004; Bartlett et al., 2006). Whereas the optimal predictor in the non-adversarial setting can be understood locally at individual points in the input domain, it is difficult to do so in the adversarial setting due to the possibility of overlapping perturbation sets. As a result, our analysis will be focused on the global structure of optimal adversarial predictors. Convex losses as an integral over reweighted zero-one losses have appeared before (Savage, 1971; Schervish, 1989; Hernández-Orallo et al., 2012), and we will adapt and make use of this representation in the adversarial setting. Adversarial surrogate losses. Several works have suggested convex losses are inappropriate in the adversarial setting because they are not calibrated, and instead propose using non-convex surrogate losses (Bao et al., 2020; Awasthi et al., 2021a;c; 2022a;b). In this work, we show that with appropriate thresholding convex losses are calibrated, and so are an appropriate choice for the adversarial setting. Near-initialization. Several works utilize the properties of networks in the near-initialization regime to obtain bounds on the test error when using gradient descent (Li & Liang, 2018; Arora et al., 2019; Cao & Gu, 2019; Nitanda et al., 2020; Ji & Telgarsky, 2019; Chen et al., 2019; Ji et al., 2021). In particular, this paper most directly builds upon the prior work of Ji et al. (2021), which showed that shallow neural networks could learn to predict arbitrarily well. We adapt their analysis to the adversarial setting. Adversarial training techniques. Adversarial training initially used FGSM (Goodfellow et al., 2015) to find adversarial examples. Numerous improvements have since been proposed, such as iterated FGSM (Kurakin et al., 2016) and PGD (Madry et al., 2019), which strives to find even stronger adversarial examples. These works are complementary to ours, because here we assume that we have an optimal adversarial attack, and show that with such an algorithm we can get optimal adversarial test error. Some of these alterations (Zhang et al., 2019; Wang et al., 2021; Miyato et al., 2018; Kannan et al., 2018) do not strictly attempt to find a maximal adversarial attack at every iteration, but instead use some other criteria. However, Rice et al. (2020) proposes that many of the advancements to adversarial training since PGD can be matched with early stopping. Our work corroborates the power of the early stopping in adversarial training as we use it in our analysis. Adversarial training error bounds. Several works are able to show convergence of the adversarial training error. Gao et al. (2019) did so for networks with smooth activations, but is unable to handle constant-sized perturbations as the width increases. Meanwhile Zhang et al. (2020) uses Re LU Published as a conference paper at ICLR 2023 activations, but imposes a strong separability condition on the training data. Our training error bounds use Re LU activations, and in contrast to these previous works simultaneously hold for constant-sized perturbations and consider general data distributions. However, we note that the ultimate goals of these works differ, as we focus on adversarial test error. Adversarial generalization bounds. There are several works providing adversarial generalization bounds. They are not tailored to the near-initialization setting, and so they are either looser or require assumptions that are not satisfied here. These other approaches include SDP relaxation based bounds (Yin et al., 2018), tree transforms (Khim & Loh, 2019), and covering arguments (Tu et al., 2019; Awasthi et al., 2020; Balda et al., 2019). Our generalization bound also uses a covering argument, but pairs this with a near-initialization decoupling. There are a few works that are able to achieve adversarial test error bounds in specialized cases. They have been analyzed when the data distribution is linear, both when the model is linear too (Donhauser et al., 2021; Javanmard et al., 2020), and for random features (Hassani & Javanmard, 2022). In the present work, we are able to handle general data distributions. 3 PROPERTIES OF OPTIMAL ADVERSARIAL PREDICTORS This section builds towards Theorem 3.3, relating zero-one losses to convex surrogate losses. 3.1 SETTING We consider a distribution D with associated measure µ that is Borel measurable over X Y , where X Rd is compact, and Y = { 1, 1}. For simplicity, throughout we will take X = B1 to be the closed Euclidean ball of radius 1 centered at the origin. We allow arbitrary P(y = 1|x) [0, 1] that is, the true labels may be noisy. We will consider general adversarial perturbations. For x B1, let P(x) be the closed set of allowed perturbations. That is, an adversarial attack is allowed to change the input x to any x P(x). We will impose the natural restrictions that = P(x) B1 for all x B1. That is, there always exists at least one perturbed input, and perturbations cannot exceed the natural domain of the problem. In addition, we will assume the set-valued function P is upper hemicontinuous. That is, for any x X and any open set U P(x), there exists an open set V x such that P(V ) = v V P(v) U. For the commonly used ℓ perturbations as well as many other commonly used perturbation sets (Yang et al., 2020), these assumptions hold. As an example, in the above notation we would write ℓ perturbations as P(x) = {x B1 : x x τ}. Let f : B1 R be a predictor. We will let ℓc be any nonincreasing convex loss with continuous derivative. The adversarial loss is ℓA(x, y, f) = maxx P(x) ℓc(yf(x )), and the adversarial convex risk is RA(f) = R ℓA(x, y, f) dµ(x, y). For convenience, define f +(x) = supx P(x) f(x ) and f (x) = infx P(x) f(x ), the worst-case values for perturbations in P(x) when y = 1 and y = 1, respectively. Then we can write the adversarial zero-one risk as RAZ(f) := Z 1[y = +1]1[f (x) < 0] + 1[y = 1]1[f +(x) 0] dµ(x, y). To relate the adversarial convex and zero-one risks, we will use reweighted adversarial zero-one risks Rt AZ(f) as an intermediate quantity, defined as follows. The adversarial zero-one risk when the +1 labels have weight ( ℓ c(t)) and the 1 labels have weight ( ℓ c( t)) is Rt AZ(f) := Z 1[y = +1]1[f (x) < 0]( ℓ c(t)) + 1[y = 1]1[f +(x) 0]( ℓ c( t)) dµ(x, y). 3.2 RESULTS We present a number of structural properties of optimal adversarial predictors. The key insight will be to write the global adversarial convex loss in terms of global adversarial zero-one losses, as follows. Published as a conference paper at ICLR 2023 Lemma 3.1. For any predictor f, RA(f) = R Rt AZ(f t) dt. Rt AZ(f t) is an intuitive quantity to consider for the following reason. In the non-adversarial setting, a predictor outputting a value of f(x) corresponds to a prediction of ( ℓ c( f(x))) ( ℓ c(f(x)))+( ℓ c( f(x))) of the labels being +1 at that point. If +1 labels are given weight ( ℓ c(t)) and 1 labels weight ( ℓ c( t)), then f(x) would predict at least half the labels being +1 if and only if ( ℓ c( f(x))) ( ℓ c(f(x)))+( ℓ c( f(x))) ( ℓ c( t)) ( ℓ c(t))+( ℓ c( t)). As a result, the optimal non-adversarial predictor is also an optimal non-adversarial zero-one classifier at thresholds t for the corresponding reweighting of +1 and 1 labels. Even though we won t be able to rely on the same local analysis as our intuition above, it turns out the same thing is globally true in the adversarial setting. Lemma 3.2. There exists a predictor g : X R such that RA(g) is minimal. For any such predictor, Rt AZ(g t) = inff Rt AZ(f) for all t R. Note that the predictor in Lemma 3.2 is not necessarily unique. For instance, the predictor s value at a particular point x might not matter to the adversarial risk because there are no points in the underlying distribution whose perturbation sets include x. To prove Lemma 3.2, we will use optimal adversarial zero-one predictors to construct an optimal adversarial convex loss predictor g. Conversely, we can also use an optimal adversarial convex loss predictor to construct optimal adversarial zero-one predictors. In general, the predictors we find will not exactly have the optimal adversarial convex loss. For these predictors we have the following error gap. Theorem 3.3. Suppose there exists s 1 and c > 0 such that Gℓc(p) := ℓc(0) inf z R pℓc(z) + (1 p)ℓc( z) |2p 1| Then for any predictor g, inf t RAZ(g t) inf h meas. RAZ(h) 31 1 RA(g) inf h meas. RA(h) 1/s . The idea of using Gℓc(p) comes from Zhang (2004). When we explicitly set ℓc to be the logistic loss in Section 4, we can choose s = 2 and c = 2 (Zhang, 2004). While similar bounds exist in the non-adversarial case with RAZ(g) instead of inft RAZ(g t) (Zhang, 2004; Bartlett et al., 2006), the analogue with RAZ(g t) infh meas. RAZ(h) appearing on the left-hand side is false in the adversarial setting, which can be seen as follows. Consider a uniform distribution of (x, y) pairs over {( 1, 1)}, and suppose { 1, +1} P( 1) P(+1). Then the optimal adversarial convex risk is ℓc(0) given by f(x) = 0, and the optimal adversarial zero-one risk is 1/2 given by sgn(f(x)) = +1. However, for ϵ > 0 the predictor gϵ with gϵ(+1) = ϵ, gϵ( 1) = ϵ, and gϵ(x) [ ϵ, ϵ] everywhere else gives adversarial convex risk 1 2 ℓc(ϵ) + ℓc( ϵ) and adversarial zero-one risk 1. This results in RAZ(gϵ) inf h meas. RAZ(h) ϵ 0 1/2, RA(gϵ) inf h meas. RA(h) ϵ 0 0, demonstrating the necessity for some change compared to the analogous non-adversarial bound. As this example shows, getting arbitrarily close to the optimal adversarial convex risk does not guarantee getting arbitrarily close to the optimal adversarial zero-one risk. This inadequacy of convex losses in the adversarial setting has been noted in prior work (Bao et al., 2020; Awasthi et al., 2021a;c; 2022a;b), leading them to suggest the use of non-convex losses. However, as Theorem 3.3 shows, we can circumvent this inadequacy if we allow the choice of an optimal (possibly nonzero) threshold. While we have compared against optimal measurable predictors here, in Section 4 we will use continuous predictors. This presents a potential problem, as there may be a gap between the adversarial risks achievable by measurable and continuous predictors. It turns out this is not the case, as the following lemma shows. Lemma 3.4. For the adversarial risk, comparing against all continuous functions is equivalent to comparing against all measurable functions. That is, infg cts. RA(g) = infh meas. RA(h). In the next section, we will use Lemma 3.4 to compare trained continuous predictors against all measurable functions. Published as a conference paper at ICLR 2023 4 ADVERSARIAL TRAINING Theorem 3.3 shows that with optimally chosen thresholds, to achieve nearly optimal adversarial zero-one risk it suffices to achieve nearly optimal adversarial convex risk. However, it is unclear how to find such a predictor. In this section we remedy this issue, proving bounds on the adversarial convex risk when adversarial training is used on shallow Re LU networks. In particular, we show with appropriately chosen parameters we can achieve adversarial convex risk that is arbitrarily close to optimal. Unlike Section 3, our results here will be specific to the logistic loss. 4.1 SETTING Training points (xk, yk)n k=1 are drawn from the distribution D. Note that xk 1, where by default we use to denote the ℓ2 norm. We will let τ = sup{ x x 2 : x P(x), x B1} be the maximum ℓ2 norm of the adversarial perturbations. By our restrictions on the perturbation sets, we have 0 τ 2. Throughout this section we will set ℓc to be the logistic loss ℓ(z) = ln(1+e z). The empirical adversarial loss and risk are ℓA,k(f) = ℓA(xk, yk, f) and b RA(f) = 1 n Pn k=1 ℓA,k(f). The predictors will be shallow Re LU networks of the form f(W; x) = ρ m Pm j=1 aiσ(w T jx), where W is an m d matrix, w T j is the jth row of W, σ(z) = max(0, z) is the Re LU, ai { 1} are initialized uniformly at random, and ρ is a temperature parameter that we can set. Out of all of these parameters, only W will be trained. The initial parameters W0 will have entries initialized from standard Gaussians with variance 1, which we then train to get future iterates Wi. We will frequently use the features of Wi for other parameters W; that is, we will consider f (i)(W; x) = f(Wi; x), W , where the gradient is taken with respect to the matrix, not the input. Note that f is not differentiable at all points. When this is the case by f we mean some choice of f f, the Clarke differential. For notational convenience we define b RA(W) = b RA(f(W; )), RA(W) = RA(f(W; )), b R(i) A (W) = b RA(f (i)(W; )), and R(i) A (W) = RA(f (i)(W; )). Our adversarial training will be as follows. To get the next iterate Wi+1 from Wi for i 0 we will use gradient descent with Wi+1 = Wi η b RA(Wi). Normally adversarial training occurs in two steps: 1. For each k, find x k P(x) such that ℓ(ykf(Wi; x k)) is maximized. 2. Perform a gradient descent update using the adversarial inputs found in the previous step: Wi+1 = Wi η 1 n Pn k=1 ℓ(x k, yk, f(Wi; )) . Step 1 is an adversarial attack, in practice done with a method such as PGD that does not necessarily find an optimal attack. However, we will assume the idealized scenario where we are able to find an optimal attack. Our goal will be to find a network that has low risk with respect to optimal adversarial attacks. That is, we want to find f such that RA(f) is as small as possible. 4.2 RESULTS Our adversarial training theorem will compare the risk we obtain to that of arbitrary reference parameters Z Rm d, which we will choose appropriately when we apply this theorem in Corollaries 4.2 and 4.3. To get near-optimal risk, we will apply our early stopping criterion of running gradient descent until Wi W0 > 2RZ, where RZ max{1, ηρ, Z W0 }, a quantity we assume knowledge of, at which point we will stop. It is possible this may never occur in that case, we will stop at some time step t, which is a parameter we are free to choose. We just need to choose t sufficiently large to allow for enough training to occur. The iterate we choose as our final model will be the one with the best training risk. Theorem 4.1. Let m ln(emd) and ηρ2 < 2. For any Z Rm d, let RZ max{1, ηρ, Z W0 } and W t = arg min{ b RA(Wi) : 0 i t, Wj W0 2RZ j i}. Then with probability at least 1 12δ, RA(W t) 2 2 ηρ2 R(0) A (Z) + e O R2 Z ηt + ρRZ d + τm n + ρR4/3 Z d1/3 where e O suppresses ln(n), ln(m), ln(d), ln(1/δ), ln(1/τ) terms. Published as a conference paper at ICLR 2023 In Corollary 4.2 we will show we can set parameters so that all error terms are arbitrarily small. The early stopping criterion Wi W0 > 2RZ will allow us to get a good generalization bound, as we will show that all iterates then have Wi W0 2RZ + ηρ. When the early stopping criterion is met, we will be able to get a good optimization bound. When it is not, choosing t large enough allows us to do so. It may be concerning that we require knowledge of RZ, as otherwise the algorithm changes depending on which reference parameters we use. However, in practice we could instead use a validation set and instead of choosing the model with the best training risk, we could choose the model with the best validation risk, which would remove the need for knowing RZ. Ultimately, our assumption of knowing RZ is there to simplify the analysis and highlight other aspects of the problem, and we leave dropping this assumption to future work. To compare against all continuous functions, we will use the universal approximation of infinite-width neural networks (Barron, 1993). We use a specific form that adapts Barron s arguments to give an estimate of the complexity of the infinite-width neural network (Ji et al., 2019). We consider infinitewidth networks of the form f(x; U ) := R U (v), x1[v Tx 0] d N(v), where U : Rd Rd parameterizes the network, and N is a standard d-dimensional Gaussian distribution. We will choose an infinite-width network f( ; U ϵ ) with a finite complexity measure supx U ϵ (x) that is ϵ-close to a near-optimal continuous function. Letting Rϵ := max{ρ, ηρ2, supx U ϵ (x) }, with high probability we can extract a finite-width network Z close to the infinite-width network whose distance from W0 is at most RZ = Rϵ/ρ. Note that our assumed knowledge of RZ is equivalent to assuming knowledge of Rϵ. From Lemma 3.4 we know continuous functions can get arbitrarily close to the optimal adversarial risk over all measurable functions, so we can instead compare against all measurable functions. We immediately have an issue with our formulation the comparator in Theorem 4.1 is homogeneous. To have any hope of predicting general functions, we need biases. We simulate biases by adding a dummy dimension to the input, and then normalizing. That is, we transform the input x 1 2(x; 1). The dummy dimension, while part of the input to the network, is not truly a part of the input, and so we do not allow adversarial perturbations to affect this dummy dimension. Corollary 4.2. Let ϵ > 0. Then there exists a finite Rϵ max{ρ, ηρ2} representing the complexity measure of an infinite-width network that is within ϵ of the optimal adversarial risk. Then with probability at least 1 δ, setting ρ = eΘ(ϵ), η = eΘ(1/ϵ), t = eΩ with n satisfying n = eΩ max(1, τm)R2 ϵ/ϵ2 , where eΘ, eΩsuppresses ln(Rϵ), ln(1/ϵ), ln(1/δ) terms, we have RA(W t) inf{RA(g) : g measurable} + O(ϵ). Once again, it may be concerning that we require knowledge of the complexity of the data distribution for Corollary 4.2. However, the following result demonstrates that we are effectively guaranteed to converge to the optimal risk as n , as long as parameters are set appropriately. Corollary 4.3. If we set ρ(n) = n 1/6, η(n) = n1/6, t(n) = n, m(n) = n1/2, then RA(W (n) t ) n inf{RA(g) : g measurable} almost surely. 4.3 PROOF SKETCH OF THEOREM 4.1 The proof has two main components: a generalization bound and an optimization bound. We describe both in further detail below. Published as a conference paper at ICLR 2023 4.3.1 GENERALIZATION 2τ + 32τ ln(n/δ) 2. We prove a new near-initialization generalization bound. Lemma 4.4. If B 1 and m ln(emd), then with probability at least 1 5δ, R(0) A (V ) b R(0) A (V ) 2 ρB n + 2ρB τ n + 77ρBd ln3/2(4em2d3/δ) n . The key term to focus on is the middle term. Note that τ is a quantity that grows with the perturbation radius τ, and importantly is 0 when τ is 0. When we are in the non-adversarial setting (τ = 0), the middle term is dropped and we recover a Rademacher bound qualitatively similar to Lemma A.8 in (Ji et al., 2021). In the general setting when τ is a constant, so is τ, resulting in an additional dependence on the width of the network. Lemma 4.4 will easily follow from the following Rademacher complexity bound. Lemma 4.5. Define V = {V : V W0 B}, and let F = {xk 7 min x k P(xk) yk f(x k; W0), V : V V}. Then with probability at least 1 3δ, Rad(F) ρB n + ρB τ 1 + q In the setting of linear predictors with normed ball perturbations, an exact characterization of the perturbations can be obtained, leading to some of the Rademacher bounds in Yin et al. (2018); Khim & Loh (2019); Awasthi et al. (2020). In our setting, it is unclear how to get a similar exact characterization. Instead, we prove Lemma 4.5 by decomposing the adversarially perturbed network into its nonperturbed term and the value added by the perturbation. The nonperturbed term can then be handled by a prior Rademacher bound for standard networks. The difficult part is in bounding the complexity added by the perturbation. A naive argument would worst-case the adversarial perturbations, resulting in a perturbation term that scales linearly with m and does not decrease with n. Obtaining the Rademacher bound that appears here requires a better understanding of the adversarial perturbations. In comparison to simple linear models, we use a more sophisticated approach, utilizing our understanding of linearized models. Because we are using the features of the initial network, for a particular perturbation at a particular point, the same features are used across all networks. As the parameter distance between all networks is close, the same perturbation achieves similar effects. We also control the change in features caused by the perturbations, which is given by Lemma A.9. Having bounded the effect of the perturbation term, we then apply a covering argument over the parameter space to get the Rademacher bound. 4.3.2 OPTIMIZATION Our optimization bound is as follows. Lemma 4.6. Let Rgd := 2RZ + ηρ, with ηρ2 < 2. Then with probability at least 1 δ, b RA(W t) 2 2 ηρ2 b R(0) A (Z) + 1 " R2 Z 2η η2ρ2 52ρR4/3 gd d1/4 ln(ed2m2/δ)1/4 + 178ρRgdd1/3 ln(ed3m2/δ)1/3 ln(1/δ) dm5/6 Published as a conference paper at ICLR 2023 The main difference between the adversarial case here and the non-adversarial case in Ji et al. (2021) is in appropriately bounding b RA(W) , which is Lemma A.10. In order to do so, we utilize a relation between adversarial and non-adversarial losses. The rest of the adversarial optimization proofs follow similarly to the non-adversarial case, although simplified because we assume that RZ is known. 5 DISCUSSION AND OPEN PROBLEMS This paper leaves open many potential avenues for future work, several of which we highlight below. Early stopping. Early stopping played a key technical role in our proofs, allowing us to take advantage of properties that hold in a near-initialization regime, as well as achieve a good generalization bound. However, is early stopping necessary? The necessity of early stopping is suggested by the phenomenon of robust overfitting (Rice et al., 2020), in which the adversarial training error continues to decrease, but the adversarial test error dramatically increases after a certain point. Early stopping is one method that allows adversarial training to avoid this phase, and achieve better adversarial test error as a result. However, it should be noted that early stopping is necessary in this work to stay in the near-initialization regime, which likely occurs much earlier than the robust overfitting phase. Underparameterization. Our generalization bound increases with the width. As a result, to get our generalization bound to converge to 0 we required the width to be sublinear in the number of training points. Is it possible to remove this dependence on width? Recent works suggest that some sort of dependence on the width may be necessary. In the setting of linear regression, overparameterization has been shown to hurt adversarial generalization for specific types of networks (Hassani & Javanmard, 2022; Javanmard et al., 2020; Donhauser et al., 2021). Note that in this simple data setting a simple network can fit the data, so underparameterization does not hurt the approximation capabilities of the network. However, in a more complicated data setting, Madry et al. (2019) notes that increased width helps with the adversarial test error. One explanation is that networks need to be sufficiently large to approximate an optimal robust predictor, which may be more complicated than optimal nonrobust predictors. Indeed, they note that smaller networks, under adversarial training, would converge to the trivial classifier of predicting a single class. Interestingly, they also note that width helps more when the adversarial perturbation radius is small. This observation is reflected in our generalization bound, since the dependence on width is tied to the perturbation term. If the perturbation radius is small, then a large width is less harmful to our generalization bound. We propose further empirical exploration into how the width affects generalization and approximation error, and how other factors influence this relationship. This includes investigating whether a larger perturbation radius causes larger widths to be more harmful to the generalization bound, the effect of early stopping on these relationships, and how the approximation error for a given width changes with the perturbation radius. Using weaker attacks. Within our proof we used the assumption that we had access to an optimal adversarial attack. In turn, we got a guarantee against optimal adversarial attacks. However, in practice we do not know of a computationally efficient algorithm for generating optimal attacks. Could we prove a similar theorem, getting a guarantee against optimal attacks, using a weaker attack like PGD in our training algorithm? If this was the case, then we would be using a weaker attack to successfully defend against a stronger attack. Perhaps this is too much to ask for could we get a guarantee against PGD attacks instead? Transferability to other settings. We have only considered the binary classification setting here. A natural extension would be to consider the same questions in the multiclass setting, where there are three (or more) possible labels. In addition, our results in Section 4 only hold for shallow Re LU networks in a near-initialization regime. To what extent do the relationships observed here transfer to other settings, such as training state-of-the-art networks? For instance, does excessive overparameterization beyond the need to capture the complexity of the data hurt adversarial robustness in practice? Published as a conference paper at ICLR 2023 ACKNOWLEDGMENTS The authors are grateful for support from the NSF under grant IIS-1750051. Charalambos D. Aliprantis and Kim C. Border. Infinite Dimensional Analysis: A Hitchhiker s Guide. Springer, 3rd edition, 2006. Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1st edition, 2009. Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. Co RR, abs/1901.08584, 2019. Anish Athalye, Nicholas Carlini, and David A. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. Co RR, abs/1802.00420, 2018. Pranjal Awasthi, Natalie Frank, and Mehryar Mohri. Adversarial learning guarantees for linear hypotheses and neural networks. Co RR, abs/2004.13617, 2020. Pranjal Awasthi, Natalie Frank, Anqi Mao, Mehryar Mohri, and Yutao Zhong. Calibration and consistency of adversarial surrogate losses. Co RR, abs/2104.09658, 2021a. Pranjal Awasthi, Natalie S. Frank, and Mehryar Mohri. On the existence of the adversarial bayes classifier (extended version). Co RR, abs/2112.01694, 2021b. Pranjal Awasthi, Anqi Mao, Mehryar Mohri, and Yutao Zhong. A finer calibration analysis for adversarial robustness. Co RR, abs/2105.01550, 2021c. Pranjal Awasthi, Anqi Mao, Mehryar Mohri, and Yutao Zhong. H-consistency bounds for surrogate loss minimizers. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 1117 1174. PMLR, 17 23 Jul 2022a. Pranjal Awasthi, Anqi Mao, Mehryar Mohri, and Yutao Zhong. Multi-class H-consistency bounds. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022b. Emilio Rafael Balda, Arash Behboodi, Niklas Koep, and Rudolf Mathar. Adversarial risk bounds for neural networks through sparsity based compression. Co RR, abs/1906.00698, 2019. Han Bao, Clay Scott, and Masashi Sugiyama. Calibrated surrogate losses for adversarially robust classification. In Jacob Abernethy and Shivani Agarwal (eds.), Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pp. 408 451. PMLR, 09 12 Jul 2020. A.R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930 945, 1993. doi: 10.1109/18.256500. Peter L. Bartlett, Michael I. Jordan, and Jon D. Mcauliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006. ISSN 01621459. Arjun Nitin Bhagoji, Daniel Cullina, and Prateek Mittal. Lower bounds on adversarial robustness from optimal transport. Co RR, abs/1909.12272, 2019. Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks, 2019. Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. Co RR, abs/1608.04644, 2016. Published as a conference paper at ICLR 2023 Zixiang Chen, Yuan Cao, Difan Zou, and Quanquan Gu. How much over-parameterization is sufficient to learn deep relu networks? Co RR, abs/1911.12360, 2019. Guneet S. Dhillon, Kamyar Azizzadenesheli, Zachary C. Lipton, Jeremy Bernstein, Jean Kossaifi, Aran Khanna, and Anima Anandkumar. Stochastic activation pruning for robust adversarial defense. Co RR, abs/1803.01442, 2018. Konstantin Donhauser, Alexandru Tifrea, Michael Aerni, Reinhard Heckel, and Fanny Yang. Interpolation can hurt robust generalization even when there is no noise, 2021. Simon S. Du, Xiyu Zhai, Barnabás Póczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. Co RR, abs/1810.02054, 2018. Gerald B. Folland. Real Analysis: Modern Techniques and Their Applications. Wiley Interscience, 2nd edition, 1999. Ruiqi Gao, Tianle Cai, Haochuan Li, Liwei Wang, Cho-Jui Hsieh, and Jason D. Lee. Convergence of adversarial training in overparametrized networks. Co RR, abs/1906.07916, 2019. Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples, 2015. Chuan Guo, Mayank Rana, Moustapha Cissé, and Laurens van der Maaten. Countering adversarial images using input transformations. Co RR, abs/1711.00117, 2017. Hamed Hassani and Adel Javanmard. The curse of overparametrization in adversarial training: Precise analysis of robust generalization for random features regression, 2022. José Hernández-Orallo, Peter Flach, and Cèsar Ferri. A unified view of performance metrics: Translating threshold choice into expected classification loss. Journal of Machine Learning Research, 13(91):2813 2869, 2012. Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. Adel Javanmard, Mahdi Soltanolkotabi, and Hamed Hassani. Precise tradeoffs in adversarial training for linear regression. Co RR, abs/2002.10477, 2020. Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. ar Xiv preprint ar Xiv:1803.07300v2, 2018. Ziwei Ji and Matus Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. Co RR, abs/1909.12292, 2019. Ziwei Ji, Matus Telgarsky, and Ruicheng Xian. Neural tangent kernels, transportation mappings, and universal approximation. Co RR, abs/1910.06956, 2019. Ziwei Ji, Justin D. Li, and Matus Telgarsky. Early-stopped neural networks are consistent. Co RR, abs/2106.05932, 2021. Harini Kannan, Alexey Kurakin, and Ian J. Goodfellow. Adversarial logit pairing. Co RR, abs/1803.06373, 2018. Justin Khim and Po-Ling Loh. Adversarial risk bounds via function transformation, 2019. Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. Co RR, abs/1607.02533, 2016. Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. Co RR, abs/1808.01204, 2018. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks, 2019. Published as a conference paper at ICLR 2023 Takeru Miyato, Shin ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: A regularization method for supervised and semi-supervised learning, 2018. Atsushi Nitanda, Geoffrey Chinot, and Taiji Suzuki. Gradient descent can learn less overparameterized two-layer neural networks on classification problems, 2020. Muni Sreenivas Pydi and Varun Jog. Adversarial risk via optimal transport and optimal couplings, 2020. Leslie Rice, Eric Wong, and J. Zico Kolter. Overfitting in adversarially robust deep learning. Co RR, abs/2002.11569, 2020. Leonard J. Savage. Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66(336):783 801, 1971. ISSN 01621459. Mark J. Schervish. A general method for comparing probability assessors. The Annals of Statistics, 17(4):1856 1879, 1989. ISSN 00905364. Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In Yoshua Bengio and Yann Le Cun (eds.), 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy, 2019. Zhuozhuo Tu, Jingwei Zhang, and Dacheng Tao. Theoretical analysis of adversarial learning: A minimax approach, 2019. Yisen Wang, Xingjun Ma, James Bailey, Jinfeng Yi, Bowen Zhou, and Quanquan Gu. On the convergence and robustness of adversarial training. Co RR, abs/2112.08304, 2021. Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan L. Yuille. Mitigating adversarial effects through randomization. Co RR, abs/1711.01991, 2017. Greg Yang, Tony Duan, J. Edward Hu, Hadi Salman, Ilya Razenshteyn, and Jerry Li. Randomized smoothing of all shapes and sizes, 2020. Dong Yin, Kannan Ramchandran, and Peter L. Bartlett. Rademacher complexity for adversarially robust generalization. Co RR, abs/1810.11914, 2018. Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P. Xing, Laurent El Ghaoui, and Michael I. Jordan. Theoretically principled trade-off between robustness and accuracy. Co RR, abs/1901.08573, 2019. Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32(1):56 85, 2004. ISSN 00905364. Yi Zhang, Orestis Plevrakis, Simon S. Du, Xingguo Li, Zhao Song, and Sanjeev Arora. Overparameterized adversarial training: An analysis overcoming the curse of dimensionality. Co RR, abs/2002.06668, 2020. A.1 OPTIMAL ADVERSARIAL PREDICTOR AND APPROXIMATION PROOFS In this section we prove various properties of optimal adversarial predictors. First we need to handle some basic limits. Published as a conference paper at ICLR 2023 Recall the definition of Rt AZ(f): Rt AZ(f) := Z 1[y = +1]1[f (x) < 0]( ℓ c(t)) + 1[y = 1]1[f +(x) 0]( ℓ c( t)) dµ(x, y). The following lemmas are various forms of continuity for Rt AZ(f), making use of the continuity of ℓ c. Lemma A.1. For any infinite family of predictors gs : X { 1, +1} indexed by s R, and any t R, lim s t|Rt AZ(gs) Rs AZ(gs)| = 0. Proof. By the Dominated Convergence Theorem (Folland, 1999, Theorem 2.24), lim s t|Rt AZ(gs) Rs AZ(gs)| Z 1[y = +1]1[g s (x) < 0] ( ℓ c(t)) ( ℓ c(s)) + 1[y = 1]1[g+ s (x) 0] ( ℓ c( t)) ( ℓ c( s)) dµ(x, y) Z 1[y = +1] ( ℓ c(t)) ( ℓ c(s)) + 1[y = 1] ( ℓ c( t)) ( ℓ c( s)) dµ(x, y) Z 1[y = +1] lim s t ( ℓ c(t)) ( ℓ c(s)) + 1[y = 1] lim s t ( ℓ c( t)) ( ℓ c( s)) dµ(x, y) The following lemma allows us to switch the order of the limit and adversarial risk. Lemma A.2. For any predictors gs : X { 1, +1} indexed by s R, and any t R, lim s t Rs AZ(gs) = Rt AZ when lims t gs exists. In particular, when gs = f for all s R, then lims t Rs AZ(f) = Rt AZ (f). Proof. By the Dominated Convergence Theorem (Folland, 1999, Theorem 2.24), lim s t Rs AZ(gs) Z 1[y = +1]1[g s (x) < 0]( ℓ c(s)) + 1[y = 1]1[g+ s (x) 0]( ℓ c( s)) dµ(x, y) = Z lim s t 1[y = +1]1[g s (x) < 0]( ℓ c(s)) + 1[y = 1]1[g+ s (x) 0]( ℓ c( s)) dµ(x, y) = Z 1[y = +1]1 lim s t g s (x) < 0 ( ℓ c(t)) + 1[y = 1]1 lim s t g+ s (x) 0 ( ℓ c( t)) dµ(x, y) Published as a conference paper at ICLR 2023 For the rest of this section, let fr : X { 1, +1} be optimal adversarial zero-one predictors when the 1 labels are given weight ( ℓ c( r)) and the +1 labels are given weight ( ℓ c(r)) (fr minimizes Rr AZ(fr)), for all r R. These predictors exist by Theorem 1 of Bhagoji et al. (2019), although they may not be unique. The following lemma states that Rr AZ(fr) is continuous as a function of r. Lemma A.3. For any t R, lim s t Rs AZ(fs) = Rt AZ(ft). Proof. By Lemma A.1, lim sup s t Rs AZ(fs) Rt AZ(ft) = lim sup s t Rs AZ(fs) Rs AZ(ft) 0, lim inf s t Rs AZ(fs) Rt AZ(ft) = lim inf s t Rt AZ(fs) Rt AZ(ft) 0. Together, we get lim s t Rs AZ(fs) = Rt AZ(ft). The following lemma gives some structure to optimal adversarial zero-one predictors. Lemma A.4. For any s t, Rs AZ(max(ft, fs)) = Rs AZ(fs) and Rt AZ(min(ft, fs)) = Rt AZ(fs). Proof. Let A := {x : fs(x) < ft(x)}, and define A+ s = {(x, +1) : P(x) A = , fs(P(x) \ A) {+1}}, A s = {(x, 1) : P(x) A = , fs(P(x) \ A) { 1}}, A+ t = {(x, +1) : P(x) A = , ft(P(x) \ A) {+1}}, A t = {(x, 1) : P(x) A = , ft(P(x) \ A) { 1}}. Let µs(x, y) = 1[y = +1]( ℓ c(s))µ(x, +1) + 1[y = 1]( ℓ c( s))µ(x, 1) be the associated measures when the +1 labels have weight ( ℓ c(s)) and the 1 labels have weight ( ℓ c( s)), and define µt similarly. Then µs(A+ s ) = ( ℓ c(s))µ(A+ s ) ( ℓ c(t))µ(A+ s ) ( ℓ c(t))µ(A+ t ) = µt(A+ t ), µs(A s ) = ( ℓ c( s))µ(A s ) ( ℓ c( t))µ(A s ) ( ℓ c( t))µ(A t ) = µt(A t ). As a result, µs(A+ s ) µs(A s ) µt(A+ t ) µt(A t ). The reweighted adversarial zero-one loss can be written in terms of the reweighted measures, as follows. Rt AZ(f) = Z 1[y = +1]1[f (x) < 0]( ℓ c(t)) + 1[y = 1]1[f +(x) 0]( ℓ c( t)) dµ(x, y) = Z 1[y = +1]1[f (x) < 0] + 1[y = 1]1[f +(x) 0] dµt(x, y). The following lower bound can then be computed. Rs AZ(max(ft, fs)) Rs AZ(fs) = Z 1[y = +1] 1[max(ft, fs) (x) < 0] 1[f (x) < 0] + 1[y = 1] 1[max(ft, fs)+(x) 0] 1[f +(x) 0] dµs(x, y) = µs(A+ s ) + µs(A s ) 0. Similarly, Rt AZ(min(ft, fs)) Rt AZ(fs) = Z 1[y = +1] 1[min(ft, fs) (x) < 0] 1[f (x) < 0] + 1[y = 1] 1[min(ft, fs)+(x) 0] 1[f +(x) 0] dµt(x, y) = µt(A+ t ) µt(A t ) 0. Published as a conference paper at ICLR 2023 As 0 µs(A+ s ) µs(A s ) µt(A+ t ) µt(A t ) 0 we must have equality everywhere, giving the desired result. Using our understanding of optimal adversarial zero-one predictors, we can construct a predictor that is optimal at all thresholds. Lemma A.5. There exists f : X R such that Rt AZ(f t) is the minimum possible value for all t R. Proof. Define f(x) = sup{r Q : fr(x) = +1}. To prove Rt AZ(f t) is minimal for all t R, we first do so for all t Q, and then for all t R \ Q. For any t Q, by definition sgn(f t) ft. Let q1, q2, . . . be an enumeration of Q (t, ). Let g0 = ft, and recursively define gi+1 = max(gi, fqi) for all i 0. Note that limi gi = sgn(f t). Inductively applying Lemma A.4 results in Rt AZ(gi) = Rt AZ(ft) for all i 0. By the Dominated Convergence Theorem (Folland, 1999, Theorem 2.24), Rt AZ(f t) = Rt AZ(sgn(f t)) = lim i Rt AZ(gi) = lim i Rt AZ(ft) = Rt AZ(ft). For any t R \ Q, note that sgn(f t) = lim s t s Q sgn(f s). By the Dominated Convergence Theorem (Folland, 1999, Theorem 2.24), Lemma A.2, and Lemma A.3, Rt AZ(f t) = Rt AZ(sgn(f t)) = Rt AZ(lim s t s Q sgn(f s)) = lim s t s Q Rs AZ(sgn(f s)) = lim s t s Q Rs AZ(fs) = Rt AZ(ft). For any predictor, its adversarial convex loss can be written as a weighted sum of adversarial zero-one losses across thresholds. This then implies the function defined in Lemma A.5 has optimal adversarial convex loss. Proof of Lemma 3.1. Recall the definition of RA(f): RA(f) = Z 1[y = +1]ℓc(f (x)) + 1[y = 1]ℓc( f +(x)) dµ(x, y). Published as a conference paper at ICLR 2023 Applying the fundamental theorem of calculus and rearranging, RA(f) = Z " 1[y = +1] Z f (x) ( ℓ c(t)) dt + 1[y = 1] Z f +(x) ( ℓ c(t)) dt 1[y = +1] Z 1[f (x) < t]( ℓ c(t)) dt + 1[y = 1] Z 1[f +(x) t]( ℓ c( t)) dt 1[y = +1]1[f (x) < t]( ℓ c(t)) + 1[y = 1]1[f +(x) t]( ℓ c( t)) dt As the integrand is nonnegative, Tonelli s theorem (Folland, 1999, Theorem 2.37) gives " Z 1[y = +1]1[f (x) < t]( ℓ c(t)) + 1[y = 1]1[f +(x) t]( ℓ c( t)) dµ(x, y) Rt AZ(f t) dt. While optimal adversarial zero-one predictors were used to construct a predictor with optimal adversarial convex loss in Lemma A.5, the reverse is also possible: using a predictor with optimal adversarial convex loss to construct optimal adversarial zero-one predictors. Proof of Lemma 3.2. Let f be the optimal predictor defined in Lemma A.5, which gives existence. Suppose, towards contradiction, that there exists g : X R with minimal RA(g) and t R such that Rt AZ(g t) > Rt AZ(f t). By Lemma A.2 Rt AZ(g t) is left continuous as a function of t, and by Lemma A.3 Rt AZ(f t) is continuous as a function of t. As a result, there exists δ, ϵ > 0 such that for all s [t δ, t], Rs AZ(g s) Rs AZ(f s) ϵ. RA(g) RA(f) = Z Rs AZ(g s) Rs AZ(f s) ds t δ Rs AZ(g s) Rs AZ(f s) ds t δ ϵ ds = δϵ > 0, contradicting the assumption that RA(g) was minimal. So we must have Rt AZ(g t) minimal for all t R. The next two lemmas bound the zero-one loss at different thresholds in terms of the zero-one loss at threshold 0. Lemma A.6. Let f be the optimal predictor defined in Lemma A.5. Then Rt AZ(f t) ( ℓ c( |t|))RAZ(f). Published as a conference paper at ICLR 2023 = Z 1[y = +1]1[f (x) < t]( ℓ c(t)) + 1[y = 1]1[f +(x) t]( ℓ c( t)) dµ(x, y) Z 1[y = +1]1[f (x) < 0]( ℓ c(t)) + 1[y = 1]1[f +(x) 0]( ℓ c( t)) dµ(x, y) Z 1[y = +1]1[f (x) < 0]( ℓ c( |t|)) + 1[y = 1]1[f +(x) 0]( ℓ c( |t|)) dµ(x, y) = ( ℓ c( |t|))RAZ(f). In contrast to the previous lemma, the following bound works for any predictor. Lemma A.7. Let g be any predictor. Then Rt AZ(g t) ( ℓ c(|t|))RAZ(g t). = Z 1[y = +1]1[g (x) < t]( ℓ c(t)) + 1[y = 1]1[g+(x) t]( ℓ c( t)) dµ(x, y) Z 1[y = +1]1[g (x) < t]( ℓ c(|t|)) + 1[y = 1]1[g+(x) t]( ℓ c(|t|)) dµ(x, y) = ( ℓ c(|t|))RAZ(g t). Now we can relate proximity to the optimal adversarial zero-one loss and proximity to the optimal adversarial convex loss. Proof of Theorem 3.3. Let f be the optimal predictor defined in Lemma A.5. Then RA(g) inf h meas. RA(h) = RA(g) RA(f) Ru AZ(g u) Ru AZ(f u) du max(0, ( ℓ c(|u|))RAZ(g u) ( ℓ c( |u|))RAZ(f)) du max(0, ( ℓ c(|u|)) inf t RAZ(g t) ( ℓ c( |u|))RAZ(f)) du 0 max(0, ( ℓ c(u)) inf t RAZ(g t) ( ℓ c( u))RAZ(f)) du. As ( ℓ c(u)) inft RAZ(g t) ( ℓ c( u))RAZ(f) is continuous and nonincreasing as a function of u, as well as nonnegative when u = 0, there exists some r [0, ] such that Z 0 max(0, ( ℓ c(u)) inf t RAZ(g t) ( ℓ c( u))RAZ(f)) du 0 ( ℓ c(u)) inf t RAZ(g t) ( ℓ c( u))RAZ(f) du. Published as a conference paper at ICLR 2023 Letting p := inft RAZ(g t) RAZ(f)+inft RAZ(g t), we find that this occurs when ( ℓ c(r)) inf t RAZ(g t) ( ℓ c( r))RAZ(f) = 0 ℓ c(r) ℓ c( r) = RAZ(f) inft RAZ(g t) ℓ c(r) + ℓ c( r) ℓ c( r) = RAZ(f) + inft RAZ(g t) inft RAZ(g t) ℓ c( r) ℓ c(r) + ℓ c( r) = inft RAZ(g t) RAZ(f) + inft RAZ(g t) ℓ c( r) ℓ c(r) + ℓ c( r) = p ℓ c( r) = ℓ c(r) + ℓ c( r) p pℓ c(r) (1 p)ℓ c( r) = 0. Since ℓc is convex, this implies r minimizes pℓc(r) + (1 p)ℓc( r). The integral can then be computed exactly as follows. RA(g) inf h meas. RA(h) 0 ( ℓ c(u)) inf t RAZ(g t) ( ℓ c( u))RAZ(f) du = 2 ℓc(0) ℓc(r) inf t RAZ(g t) ℓc( r) ℓc(0) RAZ(f) ℓc(0) RAZ(f) + inf t RAZ(g t) ℓc(r) inf t RAZ(g t) ℓc( r)RAZ(f) ℓc(0) RAZ(f) + inf t RAZ(g t) ℓc(r) inft RAZ(g t) RAZ(f) + inft RAZ(g t) + ℓc( r) RAZ(f) RAZ(f) + inft RAZ(g t) RAZ(f) + inf t RAZ(g t) # ℓc(0) RAZ(f) + inf t RAZ(g t) ℓc(r)p + ℓc( r)(1 p) RAZ(f) + inf t RAZ(g t) # = 2 RAZ(f) + inf t RAZ(g t) h ℓc(0) pℓc(r) + (1 p)ℓc( r) i = 2 RAZ(f) + inf t RAZ(g t) Gℓc(p). Using our assumption of a lower bound on Gℓc(p) results in RA(g) inf h meas. RA(h) 2 RAZ(f) + inf t RAZ(g t) |2p 1| RAZ(f) + inf t RAZ(g t) inft RAZ(g t) RAZ(f) RAZ(f) + inft RAZ(g t) inft RAZ(g t) RAZ(f) s RAZ(f) + inft RAZ(g t) s 1 . Published as a conference paper at ICLR 2023 Finally, since RAZ(f) 1/2 and inft RAZ(g t) 1, RA(g) inf h meas. RA(h) 2s inf t RAZ(g t) RAZ(f) s inf t RAZ(g t) inf h meas. RAZ(h) s . Rearranging then gives the desired inequality. The following lemmas states that continuous predictors can get arbitrarily close to the optimal adversarial zero-one risk, even if we require the predictors to output the exact label. Lemma A.8. Define REAZ(f) = Z 1[yf(P(x)) = {+1}] dµ(x, y), the adversarial zero-one risk when we require f to exactly output the right label over the entire perturbation set. Then for any measurable f : X { 1, +1} and any ϵ > 0, there exists continuous g : X [ 1, +1] such that µ({(x, y) : yg(P(x)) = {+1} and yf(P(x)) = {+1}}) < ϵ. In particular, this implies infg cts. REAZ(g) = infh meas. REAZ(h) = infg cts. RAZ(g) = infh meas. RAZ(h). A = {x : f +(x) = 1}, P(A) = x AP(x), B = {x : f (x) = +1}, P(B) = x BP(x). By the inner regularity of µx (Folland, 1999, Theorem 7.8), there exist compact sets K A, L B such that µx(A) µx(K) < ϵ/2, µx(B) µx(L) < ϵ/2. As P is upper hemicontinuous and K and L are compact, both P(K) and P(L) are also compact (Aliprantis & Border, 2006, Lemma 17.8). Note that they are also disjoint as P(K) P(L) P(A) P(B) = . By Urysohn s Lemma (Folland, 1999, Lemma 4.32), there exists a continuous function g : X [0, 1] such that gt(x) = 0 for all x P(Kt) and gt(x) = 1 for all x P(Lt). The continuous function 2g 1 : X [ 1, 1] then satisfies µ({(x, y) : y(2g 1)(P(x)) = {+1} and yf(P(x)) = {+1}}) µx(A \ K) + µx(B \ L) < ϵ. Note that we also have REAZ(2g 1) REAZ(f) + µx(A \ K) µx(B \ L) < REAZ(f) + ϵ. As f and ϵ > 0 were arbitrary, infg cts. REAZ(g) infh meas. REAZ(h). To get the implication, note that infg cts. RAZ(g) infg cts. REAZ(g) infh meas. REAZ(h) = infh meas. RAZ(h) infg cts. RAZ(g), so we must have equality everywhere. While the optimal adversarial predictor may be discontinuous, continuous predictors can get arbitrarily close to the optimal adversarial convex risk. Proof of Lemma 3.4. We have infg cts. RA(g) infh meas. RA(h), so it suffices to show infg cts. RA(g) infh meas. RA(h). Let f be the optimal predictor defined in Lemma A.5. Then RA(f) = infh meas. RA(h), so we want to show infg cts. RA(g) RA(f). Published as a conference paper at ICLR 2023 Let ϵ > 0. Choose M > 0 large enough so that RA(min(max(f, M), M)) < RA(f) + ϵ/3, and let f = min(max(f, M), M). As ℓc is continuous, there exists a finite-sized partition P = {p0, p1, p2, . . . , pr} with p0 = M and pr = M such that ℓc(pi) ℓc(pi 1) ϵ/3 and ℓc( pi) ℓc( pi 1) ϵ/3 for all 1 i r. By Lemma A.8, for every pi there exists continuous gpi : X [ 1, +1] such that µ({(x, y) : ygpi(P(x)) = {+1} and ysgn(f pi)(P(x)) = {+1}}) < ϵ 3r(ℓc( M) ℓc(M)). Consider the continuous function gϵ = M + Pr i=1(pi pi 1) gpi+1 2 , which will be shown to have adversarial risk within ϵ of the optimal. Define Di := {(x, y) : ygpi(P(x)) = {+1} and ysgn(f pi)(P(x)) = {+1}} i, E := {(x, y) : ℓA(x, y, gϵ) > ℓA(x, y, f) + ϵ/3}. We will now show that E r i=1Di. Let (x, y) r i=1Di. Then ygpi(P(x)) ysgn( f pi)(P(x)) for all 1 i r. Let i = arg maxi{sgn( f pi) = +1}. Then ygϵ(P(x)) min{ypi , ypmax{i +1,r}} and yf(P(x)) max{ypi , ypmax{i +1,r}}, so ℓA(x, y, gϵ) max{ℓc(ypi ), ℓc(ypmax{i +1,r})} min{ℓc(ypi ), ℓc(ypmax{i +1,r})} + ϵ/3 ℓA(x, y, f) + ϵ/3, which implies (x, y) E. Consequently, i=1 µ(Di) < ϵ 3r ℓc( M) ℓc(M) = ϵ 3 ℓc( M) ℓc(M) . Combining the bounds results in inf g cts. RA(g) RA(gϵ) RA( f) + ϵ/3 + µ(E) ℓc( M) ℓc(M) < RA(f) + ϵ/3 + ϵ/3 + ϵ/3 = RA(f) + ϵ. As this holds for all ϵ > 0, we have infg cts. RA(g) RA(f), completing the proof. A.2 GENERALIZATION PROOFS The following lemma controls the difference in features for nearby points, which will be useful when proving our Rademacher complexity bound. Lemma A.9. With probability at least 1 3nδ, 1 ρ max δk τ f(xk; W0) f(xk + δk; W0) 2τ + 32τ ln(1/δ) Proof. As in Lemma A.2 of Ji et al. (2021), with probability at least 1 3nδ, for any xk = 0, X j 1[|w T 0,jxk| τk xk ] mτk + p 8mτk ln(1/δ), and henceforth assume the failure event does not hold. With τk = τ xk we have, for any xk = 0, j 1[|w T 0,jxk| τ] mτ 8mτ ln(1/δ) Published as a conference paper at ICLR 2023 As such, define the set Sk := n j : δk τ, sgn(w T 0,jxk) = sgn(w T 0,j(xk + δk)) o , where the preceding concentration inequality implies |Sk| mτ xk + q 8mτ ln(1/δ) xk for all xk = 0. Then for any xk (including xk = 0) and any δk τ, 1 ρ2 f(xk; W0) f(xk + δk; W0) 2 xk1[w T 0,jxk 0] (xk + δk)1[w T 0,j(xk + δk) 0] 2 xk1[w T 0,jxk 0] xk1[w T 0,j(xk + δk) 0] 2 xk1[w T 0,j(xk + δk) 0] (xk + δk)1[w T 0,j(xk + δk) 0] 2 . As Sk is exactly the set of indices j over which 1[w T 0,jxk 0] could possibly differ from 1[w T 0,j(xk + δk) 0], we can restrict the sum in the first term to these indices, resulting in 1 ρ2 f(xk; W0) f(xk + δk; W0) 2 j Sk xk 2 1[w T 0,jxk 0] 1[w T 0,j(xk + δk) 0] 2 δk1[w T 0,j(xk + δk) 0] 2 32τ ln(1/δ) where we used xk 1 in the last step. Taking the square root of both sides gives us f(xk; W0) f(xk + δk; W0) 2τ + 32τ ln(1/δ) completing the proof. We will now prove our Rademacher complexity bound. Published as a conference paper at ICLR 2023 Proof of Lemma 4.5. We have n Rad(F) = Eϵ sup V V k=1 ϵk min x k P(xk) yk f(x k; W0), V = Eϵ sup V V min x k P(xk) yk f(x k; W0), V yk f(xk; W0), V ! + ϵkyk f(xk; W0), V ! min x k P(xk) yk f(x k; W0) f(xk; W0), V ! + Eϵ sup U V k=1 ϵkyk f(xk; W0), U min x k P(xk) yk f(x k; W0) f(xk; W0), V ! where in the last step we use the Rademacher bound provided in the proof of Lemma A.8 from Ji et al. (2021). We now focus on bounding Eϵ sup V V min x k P(xk) yk f(x k; W0) f(xk; W0), V ! For notational simplicity let Dk(V ) := min x k P(xk) yk f(x k; W0) f(xk; W0), V , so that the quantity we want to bound can be rewritten as Eϵ sup V V k=1 ϵk Dk(V ). As Rademacher complexity is invariant under constant shifts, we can subtract the constant Ck from Dk(V ), where Ck := sup V V Dk(V ) + inf V V Dk(V ) With this constant shift, the expression becomes Eϵ sup V V k=1 ϵk Dk(V ) Ck = n Rad(G), where G = {xk 7 min x k P(xk) yk f(x k; W0) f(xk; W0), V Ck : V V}. We will now bound n Rad(G) using a covering argument in the parameter space V. Instead of directly finding a covering for the ball of radius B, we first find a covering for a cube with side length 2B containing the ball. Projecting the cube to the ball then yields a proper covering of the ball, as this mapping is non-expansive. To ensure every point on the surface of the cube is at most ϵ distance away from a point, we use a grid with scale 2ϵ/ m, which results in B m points in the cover. This cover Cϵ has the property that for every V V, there is some U Cϵ such that V U ϵ, since every coordinate of V is ϵ/ m-close to a coordinate in Cϵ, and V U = p Pm i=1(Vi Ui)2 p Pm i=1(ϵ/ m)2 = ϵ. Due to the non-expansive projection mapping from the cube to the sphere, the ϵ-cover for the cube projects to an ϵ-cover for the sphere. As a result, we have an ϵ-cover for the sphere of radius B with B m ϵ m points. Published as a conference paper at ICLR 2023 A geometric ϵ-cover gives only a ρ τϵ-cover in the function space, since for any V and U with V U ϵ, and any xk, min x V P(xk) yk f(x V ; W0) f(xk; W0), V Ck min x U P(xk) yk f(x U; W0) f(xk; W0), U Ck sup V U ϵ x U P(xk) min x V P(xk) yk f(x V ; W0) f(xk; W0), V Ck yk f(x U; W0) f(xk; W0), U Ck ! sup V U ϵ x U P(xk) yk f(x U; W0) f(xk; W0), V Ck yk f(x U; W0) f(xk; W0), U Ck = sup V U ϵ x U P(xk) yk f(x U; W0) f(xk; W0), V U sup x U P(xk) f(x U; W0) f(xk; W0) V U sup δU τ f(x U; W0) f(xk; W0) V U where the last step follows with probability at least 1 3δ by Lemma A.9. As a result, we can get an ϵ-cover in the function space with just ρB τ m ϵ m points. This bound on the covering number N(G, ϵ, u) ρB τ m ϵ m then implies N(G, ϵ, 2) N(G, ϵ/ n, u) which we can use in a standard parameter-based covering argument (Anthony & Bartlett, 2009) to get n Rad(G) inf α>0 sup U G U 2 2 ln N(G, α, 2) α n + ρB τ n v u u t2 ln To calculate sup U G U 2, note that each of the n entries is bounded above by sup V V Dk(V ) Ck = max sup V V Dk(V ) Ck, inf U V Dk(U) Ck = sup V V Dk(V ) inf U V Dk(U) Published as a conference paper at ICLR 2023 so sup U G U 2 ρB τ n. Setting α = ρB τ (let α 0 if τ = 0) we get n Rad(G) ρB τ n + ρB τ Putting this together with our previous bound results in n Rad(F) ρB n + n Rad(G) ρB n + ρB τ n and dividing by n then completes the proof. With our Rademacher complexity bound we can prove our generalization bound, as follows. Proof of Lemma 4.4. By Lemma A.6 part 2 of Ji et al. (2021), with probability at least 1 δ, sup V W0 B sup x 1 f(x; W0), V 18ρB ln(emd(1 + 3(md3/2)d)/δ) 18ρBd ln(4em2d3/δ). By the decreasing monotonicity of the logistic loss, sup V W0 B sup x 1 ℓA(x, y, f( ; W0), V ) [ℓ(18ρBd ln(4em2d3/δ)), ℓ( 18ρBd ln(4em2d3/δ))] [ln(2) 18ρBd ln(4em2d3/δ), ln(2) + 18ρBd ln(4em2d3/δ)]. With a bound on the range from above that holds with probability at least 1 δ and a bound on the Rademacher complexity from Lemma 4.5 that holds with probability at least 1 3δ, we can now apply a standard Rademacher bound (Shalev-Shwartz & Ben-David, 2014) that holds with probability at least 1 δ to get that altogether with probability at least 1 5δ, R(0) A (V ) b R(0) A (V ) 2Rad(ℓA(F)) + 3(36ρBd ln(4em2d3/δ)) 2 ρB n + 2ρB τ n + 77ρBd ln3/2(4em2d3/δ) n . A.3 OPTIMIZATION PROOFS The following lemma bounds the gradient of the adversarial risk. Lemma A.10. For any matrix W Rm d, b RA(W) ρ min n 1, b RA(W) o . Published as a conference paper at ICLR 2023 Proof. By properties of the logistic loss (Ji & Telgarsky, 2018), min x k P(xk) ykf(W; x k) min x k P(xk) ykf(W; x k) min x k P(xk) ykf(W; x k) min x k P(xk) ykf(W; x k) min x k P(xk) ykf(W; x k) min x k P(xk) ykf(W; x k) min x k P(xk) ykf(W; x k) k=1 min 1, ℓA,k(f) ρ ρ min n 1, b RA(W) o . We have the following guarantee when using adversarial training. Lemma A.11. When adversarially training with step size η, for any iterate t and any reference parameters Z Rm d, Wt Z 2 + (2η η2ρ2) X i (k + 1)ϵ, for some k 1 to be determined later. Case 1: x (k + 1)ϵ. Then f(x; V ) f(x; W) ρ m m(k + 1)ϵ = (k + 1)ρϵ. Case 2: x > (k + 1)ϵ. Then z > kϵ. With probability at least 1 mδ, wj 2 ln(1/δ) for all 1 j m. Define S1 := n j [m] : |w T jz| q z o , S2 := n j [m] : |w T jx| r x o , S3 := j [m] : vj wj r , S := S2 S3, where q := 2 r + 1 2 ln(1/δ) , with r a parameter we will choose later. Note that S2 S1, since if |w T jx| r x , then |w T jz| |w T jx| + wj ϵ r x + 1 k wj x = r + 1 k wj z 2 r + 1 2 ln(1/δ) z q z . By Lemma A.2 part 1 of Ji et al. (2021) we have that with probability at least 1 3δ, |S2| qm + p 8qm ln(1/δ). Within the proof of Lemma A.7 part 1 of Ji et al. (2021) it is shown that Published as a conference paper at ICLR 2023 |S3| R2 V r2 . So altogether, with probability at least 1 (m + 3)δ, |S| |S2| + |S3| 8qm ln(1/δ) + R2 V r2 2 ln(1/δ) m + 2 ln(1/δ) m ln(1/δ) 2 ln(1/δ) m p ln(e/δ) + R2 V r2 d ln(e/δ) m p ln(e/δ) + R2 V r2 ln(e/δ) + 18m k + R2 V r2 . Setting r := R2/3 V m 1/3 ln(e/δ) 1/6 we get |S| 7R2/3 V m2/3 ln(e/δ) 1/3 + 18m k . Substituting this upper bound on |S| results in f(x; V ) f(x; W) ρ m 3R1/3 V m1/3 ln(e/δ) 1/6 + 5m1/2d1/4k 1/2p 3ρR1/3 V m 1/6 ln(e/δ) 1/6 + 5ρd1/4k 1/2p Combining the two cases results in f(x; V ) f(x; W) max{(k + 1)ρϵ, 3ρR1/3 V m 1/6 ln(e/δ) 1/6 + 5ρd1/4k 1/2p After setting k := d1/6ϵ 2/3 ln(e/δ) 1/3 to balance the terms, f(x; V ) f(x; W) max{ρd1/6ϵ1/3 ln(e/δ) 1/3 + ρϵ, 3ρR1/3 V m 1/6 ln(e/δ) 1/6 + 5ρd1/6ϵ1/3 ln(e/δ) 1/3} 3ρR1/3 V m 1/6 ln(e/δ) 1/6 + 5ρd1/6ϵ1/3 ln(e/δ) 1/3 + ρϵ. So with probability at least 1 (m + 3)δ, sup x z ϵ x 1 V W RV f(x; V ) f(z; V ) 6ρR1/3 V m 1/6 ln(e/δ) 1/6 + 10ρd1/6ϵ1/3 ln(e/δ) 1/3 + 2ρϵ + 11ρ ln(edm/δ) Rescaling the probability of failure, we get that with probability at least 1 δ, sup x z ϵ x 1 V W RV f(x; V ) f(z; V ) 7ρR1/3 V m 1/6 ln(em/δ) 1/6 + 12ρd1/6ϵ1/3 ln(em/δ) 1/3 + 2ρϵ + 15ρ ln(edm/δ) Published as a conference paper at ICLR 2023 Using a sphere covering argument, we bound b R(i) A (Z) in terms of b R(0) A (Z), as well as R(i)(Z) in terms of R(0)(Z). Lemma A.13. 1. For any z 1 and RV 1 and RB 0, with probability at least 1 δ, sup x 1 V W0 RV B W0 RB f(x; V ) f(x; W0), B 26ρ(RB + RV )R1/3 V d1/4 ln(ed2m2/δ)1/4 m1/6 + 89ρ(RB + RV )d1/3 ln(ed3m2/δ)1/3 ln(1/δ) dm . 2. With probability at least 1 δ, simultaneously sup Wi W0 RV B W0 RB b R(i) A (B) b R(0) A (B) 26ρ(RB + RV )R1/3 V d1/4 ln(ed2m2/δ)1/4 m1/6 + 89ρ(RB + RV )d1/3 ln(ed3m2/δ)1/3 sup Wi W0 RV B W0 RB R(i)(B) R(0)(B) 26ρ(RB + RV )R1/3 V d1/4 ln(ed2m2/δ)1/4 m1/6 + 89ρ(RB + RV )d1/3 ln(ed3m2/δ)1/3 ln(1/δ) dm . Proof. 1. For notational convenience let W := W0. For some 0 < ϵ 1/(dm) which we will choose later, instantiate a cover C at scale ϵ/ d, with |C| ( d/ϵ)d. For a given point x, let z C denote the closest point in the cover. Then by the triangle inequality, sup x 1 V W RV B W RB f(x; V ) f(x; W), B sup x 1 V W RV B W RB f(x; V ) f(x; W), B f(z; V ) f(z; W), B + f(z; V ) f(z; W), B sup x 1 V W RV B W RB f(x; V ) f(z; V ), B + f(x; W) f(z; W), B + f(z; V ) f(z; W), B . Published as a conference paper at ICLR 2023 Upper bounding this expression by taking the supremum separately over each of terms, sup x 1 V W RV B W RB f(x; V ) f(z; V ), B + sup x 1 V W RV B W RB f(x; W) f(z; W), B + sup x 1 V W RV B W RB f(z; V ) f(z; W), B , and noticing the second term is bounded above by the first term, results in sup x 1 V W RV B W RB f(x; V ) f(x; W), B sup x 1 V W RV B W RB 2 f(x; V ) f(z; V ), B + sup x 1 V W RV B W RB f(z; V ) f(z; W), B sup x 1 V W RV B W RB 2 f(x; V ) f(z; V ), B V + 2 f(x; V ) f(z; V ) + sup x 1 V W RV B W RB f(z; V ) f(z; W), B sup x 1 V W RV B W RB 2 f(x; V ) f(z; V ) (RB + RV ) + 2 f(x; V ) f(z; V ) + sup x 1 V W RV B W RB f(z; V ) f(z; W), B . Instantiating Lemma A.7 part 1 of Ji et al. (2021) for all z C, we get that with probability at least 1 3( d/ϵ)dδ, f(z; V ) f(z; W), B 3ρ(RB + 2RV )R1/3 V ln(e/δ)1/4 By Lemma A.2 part 2 of Ji et al. (2021), with probability at least 1 ( d/ϵ)dδ))). Assuming this holds, then for all x z ϵ, f(x; V ) f(z; V ) = j=1 ai σ(v T jx) σ(v T jz) ρ m σ(v T jx) σ(v T jz) v T jx v T jz ρ m m V x z ρ(RV + W )ϵ d/ϵ)dδ)))ϵ. Instantiating Lemma A.12 for all z C, we get that with probability at least 1 ( d/ϵ)dδ, sup x 1 V W RV f(x; V ) f(z; V ) 7ρR1/3 V m 1/6 ln(em/δ) 1/6 + 12ρd1/6ϵ1/3 ln(em/δ) 1/3 + 2ρϵ + 15ρ ln(edm/δ) Published as a conference paper at ICLR 2023 Altogether, with probability at least 1 5( sup x 1 V W RV B W RB f(x; V ) f(x; W), B 7ρR1/3 V m 1/6 ln(em/δ) 1/6 + 12ρd1/6ϵ1/3 ln(em/δ) 1/3 + 2ρϵ + 15ρ ln(edm/δ) + 2ρ(RV + m + + 3ρ(RB + 2RV )R1/3 V ln(e/δ)1/4 Setting ϵ = 1/(dm) we get with probability at least 1 5(d sup x 1 V W RV B W RB f(x; V ) f(x; W), B 20ρ(RB + RV )R1/3 V ln(em/δ)1/4 + 64ρ(RB + RV ) ln(edm/δ)1/3 Rescaling δ we get that with probability at least 1 δ, sup x 1 V W RV B W RB f(x; V ) f(x; W), B 20ρ(RB + RV )R1/3 V d1/4 ln(5ed2m2/δ)1/4 + 64ρ(RB + RV )d1/3 ln(5ed3m2/δ)1/3 26ρ(RB + RV )R1/3 V d1/4 ln(ed2m2/δ)1/4 + 89ρ(RB + RV )d1/3 ln(ed3m2/δ)1/3 ln(1/δ) dm . 2. With probability at least 1 δ, the previous part holds. This part is then an immediate consequence since the logistic loss is 1-Lipschitz. We now prove the main optimization lemma. Proof of Lemma 4.6. We will use the fact that it does not matter much which feature we use, because the function values on the domain will differ by a small amount, and hence the resulting difference in risk will be small. This is encapsulated by Lemma A.13. Published as a conference paper at ICLR 2023 We stop training once we reach t iterations, or the parameter distance from initialization exceeds 2RZ. That is, we stop at iteration T, where T = min {t, inf {i : Wi W0 > 2RZ}}. By definition Wi W0 2RZ Rgd for all i < T, and WT W0 WT 1 W0 + η b RA(WT 1) 2RZ + ηρ = Rgd. Then by Lemma A.13 part 2, we have that with probability at least 1 δ, sup Wi W Rgd B W Rgd b R(i) A (B) b R(0) A (B) 52ρR4/3 gd d1/4 ln(ed2m2/δ)1/4 m1/6 + 178ρRgdd1/3 ln(ed3m2/δ)1/3 m1/4 + 5ρ p ln(1/δ) dm := κ1. Note that this holds for all iterations of interest as well as when B represents the reference parameters. By Lemma A.11 and the interchangeability of features, we get WT Z 2 + (2η η2ρ2) X i 0, there exists nϵ such that for all n nϵ, with probability at least 1 1/n2, RA(W t)(n) inf g meas.{RA(g)} + ϵ. Since P n nϵ 1/n2 < , by the Borel-Cantelli lemma we have lim sup n RA(W (n) t ) inf g meas.{RA(g)} + ϵ almost surely. As ϵ > 0 was arbitrary, we get Corollary 4.3.