# consistent_adversarially_robust_linear_classification_nonparametric_setting__32fd5c7c.pdf Consistent Adversarially Robust Linear Classification: Non-Parametric Setting Elvis Dohmatob 1 Abstract For binary classification in d dimensions, it is known that with a sample size of n, an excess adversarial risk of O(d/n) is achievable under strong parametric assumptions about the underlying data distribution (e.g., assuming a Gaussian mixture model). In the case of well-separated distributions, this rate can be further refined to O(1/n). Our work studies the non-parametric setting, where very little is known. With only mild regularity conditions on the conditional distribution of the features, we examine adversarial attacks with respect to arbitrary norms and introduce a straightforward yet effective estimator with provable consistency w.r.t adversarial risk. Our estimator is given by minimizing a series of smoothed versions of the robust 0/1 loss, with a smoothing bandwidth that adapts to both n and d. Furthermore, we demonstrate that our estimator can achieve the minimax excess adversarial risk of e O( p d/n) for linear classifiers, at the cost of solving possibly rougher optimization problems. 1. Introduction Supervised machine-learning models like (generalized) linear models, kernel methods, and neural networks have enjoyed tremendous practical success in many applications involving high-dimensional data, such as images. Yet, these models are highly sensitive to small perturbations known as adversarial examples (Szegedy et al., 2013; Tsipras et al., 2019; Su et al., 2017; Schmidt et al., 2018), which are often imperceptible by humans. While various strategies such as adversarial training (Madry et al., 2018) can mitigate this empirically, lack of robustness remains highly problematic for many safety-critical applications (e.g, autonomous vehicles), and motivates a better understanding of the phenomena at play. In the era of socalled large-language models (LLMs) like Chat GPT and 1Meta FAIR. Correspondence to: Elvis Dohmatob . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). their rapid adaptation by the general public for all kinds of duties like image generation, text generation, chatbots, encyclopedic knowledge assistants, etc., the impact of adversarial examples (in the form of adversarial prompts) will be even more devastating. For example see (Carlini et al., 2023) and the references therein. In this work, we consider the statistical problem of computing a Bayes-optimal robust linear classifier for a binary classification problem on Rd without any parametric assumptions on the underlying distribution of the data, beyond the requirement of some regularity of the conditional density of the distribution of the features. Efficiently computable consistent estimators in the adversarial setting are a big deal because it is known that unlike the case of ordinary classification, convex surrogates don t exist in the adversarial case (Bao et al., 2020). To summarize, Risk-Consistency. We construct a sequence of linear classifiers (indexed by the sample size n and input dimension d), whose adversarial risk / classification error approaches the optimal value over all linear models, in the limit n such that d log(n)/n 0. See Theorem 4.1. Our estimator relies on Gaussian smoothing of the data, wherein the smoothing bandwidth hn gracefully adapts to the sample size n and the input dimension d. In the special case of Euclidean-norm attacks, our proposed estimator can be efficiently computed via smooth optimization methods on the Euclidean unit-sphere (a smooth manifold) (Boumal et al., 2018). The bandwidth parameter hn trades between ease of optimization and rate of statistical convergence of the adversarial risk estimator to the Bayes-optimal value. Minimax Optimality. Under a stronger smoothness condition on the conditional distribution of the features, we recover in Theorem 4.2 rates which are uniformly optimal over all attack strengths, and match the minimax optimal rate of classification in the absence of adversaries. 2. Preliminaries and Problem Setting 2.1. Linear Classification and Adversarial Attacks Consider a binary classification problem in Rd, consisting of learning the label y { 1} of a random data point with feature vector x Rd. Let P be the unknown joint distribution of the pair (x, y). We are given access to an iid Consistent Adversarially Robust Linear Classification: Non-Parametric Setting sample Dn = {(x1, y1), . . . , (xn, yn)} Rd { 1} from P. For any w Rd, consider the linear (in fact, affine) function fw : Rd R defined by fw(x) := x w. This induces a linear classifier (aka half-space) Cf(x) := sign(f(x)) = ( 1, if f(x) 0, 1, else. (1) Let be any norm on Rd used to measure the strength of the attacker, and let be the dual norm, defined by w := sup z Rd, z 1 z w. (2) Popular examples in the literature include ℓp norm p, with p [1, ], especially for p = 1 (sparse attacks), p = (entry-wise bounded attacks), and p = 2. Note that if p [1, ] is the harmonic conjugate of p, i.e., if 1/p+1/p = 1, then the dual norm for p is p . For example, the ℓ2-norm is auto-dual, whilst the ℓ1-norm and the ℓ -norm are duals of one another. If w = 0, we can always normalize it so that w = 1, without modifying the induced classifier Cfw. We therefore consider the following restricted subset of linear models Flin := {fw | w Rd, w = 1}. (3) Note that the same function space was considered in (Bao et al., 2020) in the specific case of Euclidean-norm attacks. Though the scope of our work is linear models, our results are applicable to the case of models with frozen representations (i.e., pre-training), since we do not make any structural / parametric assumptions on the distribution of the features. Equivalently, this case corresponds to scenarios where the attacker can inject perturbations into the feature map ϕ(x) of an input x at test time (Bietti & Mairal, 2019). 2.2. Adversarial Risk and Adversarial Regret Definition 2.1 (Adversarial Risk). Given an attack budget ϵ 0, the adversarial risk (aka robust classification error) Eϵ(f) = Eϵ(f; P) of a model f Flin is defined by Eϵ(f; P) := P( δ Rd, δ ϵ | Cf(x + δ) = y), (4) where (x, y) is a labelled random test example from P. Note that E0(f) is the non-adversarial risk (aka ordinary classification error) of f. It is clear that Eϵ(f) E0(f) for any ϵ 0. More generally, the function ϵ 7 Eϵ(f) is nondecreasing. Definition 2.2 (Adversarial Regret). Given a linear model f Flin, its adversarial regret is the quantity Rϵ(f) := Eϵ(f) inf g Flin Eϵ(g), (5) i.e the excess adversarial risk of f relative to the optimum value over the function class Flin of linear classifiers. Note that, by design inff Flin Rϵ(f) = 0 for all ϵ 0. Observe that, for any f Flin, Eϵ(f) = P(yf(x) ϵ) = Ex,y [θϵ(yf(x))], (6) where θϵ : R {0, 1} is the unit-step function shifted to the right by an amount ϵ, i.e θϵ(t) := θ0(t ϵ) = ( 0, if t ϵ, 1, otherwise. (7) Indeed, for any w Rd with w = 1, one computes inf δ ϵ y((x + δ) w) = yx w + inf δ ϵ yδ w = yx w ϵ w = yx w ϵ. Also see Proposition 3 of (Bao et al., 2020). The quantity mw(x, y) := yfw(x) = yx w has a geometric meaning as the margin of the data point (x, y) w.r.t to the decisionboundary (a hyperplane) induced by fw. Indeed, in the euclidean setting, |mw(x, y)|/ w 2 is exactly equal to the distance of x to the decision boundary of the classifier fw. Moreover, mw(x, y) > 0 for points which are correctly classified by fw, whilst mw(x, y) 0 for wrong classified points. Thus, (6) demands that even correctly classified points whose margin is not greater ϵ are counted as incorrectly classified, in the adversarial setting. Given the training dataset Dn and the attack budget ϵ, the task of the machine-learner is to output a binary linear classifier bfn Flin. 2.3. The Adversarial 0/1 Loss Our goal is to find the linear model f Flin with the least possible adversarial risk Eϵ(f). Since Eϵ(f) depends on the unknown distribution P of (x, y), this cannot be done directly. Vanilla empirical risk minimization (ERM) corresponds to computing bfn Flin which minimizes the empirical version of Eϵ(f), namely b En,ϵ(f) = Eϵ(f; b Pn) = 1 i=1 θϵ(yif(xi)) n#{i [n] | yif(xi) ϵ} [0, 1], where b Pn := (1/n) Pn i=1 δ(xi,yi) is the empirical distribution corresponding to the random dataset Dn. Because of the discontinuity of the margin loss t 7 θϵ(t) at t = ϵ, the empirical risk b En,ϵ(f) cannot be optimized directly. Moreover, as established in (Bao et al., 2020; Dan et al., 2020), there is no convex surrogate for this loss which is Bayesconsistent. Our idea is to replace θϵ by a smoothed surrogate, which is computationally tractable. Consistent Adversarially Robust Linear Classification: Non-Parametric Setting Definition 2.3. An estimator bfn,ϵ Flin computed on the training dataset Dn is said to be consistent if it has vanishing adversarial regret, i.e Rϵ( bfn,ϵ) n 0 for all ϵ 0. We stress that our definition of consistency is relative to linear models, which are the main focus of this paper. 2.4. Notations The maximum of two real numbers a and b will be denoted a b. The maximum of a and 0 is denoted a+. As usual, Φ (resp. φ) denotes the Gaussian cumulative distribution (resp. probability density) function. Given a real matrix A Rd m, its singular-values σ1(A) σd(A) . . . σd(A) correspond to the sorted list of positive square-roots eigenvalues of the positive-semidefinite matrix A A. The operator norm of A, denoted A op, corresponds to σ1(A), while the the Frobenius (aka Hilbert-Schmidt) norm of A, denoted A F , corresponds to q Pd j=1 σj(A)2. The number of nonzero singular-values of A, correspond to its rank, denoted rank(A). The effective rank of A, denote r(A), is defined by r(A) := A 2 F / A 2 op. Note that if A is a nonzero matrix, then 1 r(A) rank(A) min(m, d). O(n), Ω(n), o(n), etc. are standard standard notations, while the avatars e O(n), eΩ(n), etc., mean that there are hidden factors which are at most poly-logarithmic in n. All asymptotics will be w.r.t the limit n . An event occurs with high probability (w.h.p) if its occurrence probability is at least 1 o(1) in the limit n . 3. The Proposed Estimator 3.1. Smooth Surrogate for Adversarial Loss Let Q : R [0, 1] be a function verifying the following conditions: (1) Q is strictly increasing. (2) The tail of Q behaves like limt Q(t) = 0. For example, the survival function of any random variable is such a function. Given a bandwidth parameter h > 0, define a function θε,h by θε,h(u) := Qh(u ϵ), where Qh(u) := Q(u/h). (9) Note that if in addition Q is continuous, then θε,h converges pointwise to θε in the limit h 0+. Let ℓε,h be the loss function induced by θε,h, i.e., ε 0 and h > 0, ℓε,h(y, y ) := θε,h(yy ), (10) This leads to a smoothed version of the empirical adversarial 0/1 risk defined in (11), namely b En,ε,h(f) := 1 i=1 ℓε,h(yi, f(xi)) i=1 θε,h(yif(xi)) [0, 1]. 20 10 0 10 20 Parameter wd Smoothed margin loss Rn, , h 0.01 1.0 2.0 5.0 10.0 Figure 1. The Effect of Smoothing. Mild smoothing (small h) makes the underlying optimization problem computationally tractable without changing the value of the optimal empirical adversarial risk. Refer to Lemma 3.1. Extreme smoothing (large h) makes the optimization problem very easy but essentially destroys the information contained in the training dataset Dn. Note that in the nonrobust case ϵ = 0, choosing Q = QG = Gaussian survival function corresponds to the probit loss considered in (Keshet et al., 2011) in another context. 3.2. Link with Kernel Density Estimation An important example is when Q is the survival function of a random variable with density. In this case, one can alternatively write the smoothed margin function (9) as a convolution product θε,h = θε Q( /h), the smoothed loss then corresponds replacing the empirical marginal distribution of the data points x1, . . . , xn Rd by its kernel density estimate (KDE), constructed via the smoothing function s Q := Q , i.e b En,ε,h(f) = En,ε,h(f; b Pn) = Eε(f; b Pn s Q( /h) | {z } KDE where denotes convolution. Because of the way the adversarial risk functional Eε is defined, the RHS in the above can be seen to correspond to a 1-dimensional kernel density estimation (KDE) for the distribution of margins mf(x, y) := yf(x), with (x, y) P. 3.3. The Estimator For an appropriate choice of smoothing bandwidth h = hn, our proposed estimator bfn,ε,h is defined as any minimizer of the smoothed empirical adversarial risk functional b En,ε,h over the linear function class Flin, that is bfn,ε,h arg min f Flin b En,ε,h(f). (13) The rest of this section will be spent discussing algorithmic aspects of the above estimator. We begin with the following lemma establishes an explicit formula for a sub-gradient of the empirical smoothed risk b En,ε,h(fw). Consistent Adversarially Robust Linear Classification: Non-Parametric Setting Lemma 3.1. For any w Rd, let mi(w) := yix i w be the margin of the data point (xi, yi) w.r.t the linear model fw. Then, the smoothed empirical adversarial risk functional w 7 b En,ε,h(fw) is differentiable on Rd with gradient w b En,ε,h(fw) = 1 i=1 Q ((mi ε)/h) yixi. (14) Moreover, if Q is twice differentiable with bounded second derivative, i.e Q α for some α 0, then for any ε 0, the smoothed empirical adversarial risk functional w 7 En,ε,h(fw) is L-smooth on Rd, with L = α Σn op/h2 and Hessian given by 2 w b En,ϵ,h(fw) = 1 nh2 X Dϵ(w)X, (15) where X is the n d matrix with rows x1, . . . , xn (i.e the design matrix), Σn := X X/n is the empirical covariance matrix (aka the normalized Gram matrix), and Dϵ(w) is the n n diagonal matrix given by Dϵ(w)ii := Q ((mi(w) ε)/h) for all i [n]. For example, in the cause of the Gaussian survival function Q = QG, it is easy to see that the hypothesis of proposition are verified with α = Q G = 1/ 2πe. Likewise, in the case where Q = QS := 1 σ, with σ(t) := 1/(1 + e t) being the sigmoid, the hypothesis holds with α = σ 1. Proof of Lemma 3.1. Indeed, one can write b En,ε,h(fw) = 1 i=1 θε,h(mi) = 1 i=1 Q ((mi ε)/h) , and so differentiating w.r.t w and noting that mi(w) = yixi gives the first part of the result. Now, differentiating (14) to w gives nh2 2 w b En,ε,h(fw) = i=1 Q (mi ε h )xix i = X Dϵ(w)X. It follows that uniformly on w Rd, the operator norm of the Hessian w b En,ε,h(fw) is upper-bounded by 2 w b En,ε,h(fw) op Dϵ(w) op X 2 op nh2 = Dϵ(w) op Σn op which completes the proof. 3.4. Algorithm for Computing the Estimator In the case where the attack is measured w.r.t Euclidean norm, or more generally, Mahalanobis norms, (13) thus corresponds to a smooth optimization optimization on a smooth sphere-like Riemannian manifold. This is a standard problem, and there are lots of efficient algorithms (Boumal et al., 2018). We use trust-region-based methods (Absil et al., 2007) implemented in the Manopt library (Boumal et al., 2014). In the case of general ℓp-norms with p {2, }, the problem (13) can be tackled using the methods developed in (Sato, 2023). Remark 3.1. Note that we are only able to guarantee convergence to a stationary point of b En,ϵ,h. However, in all our experiments, we observe that the numerically obtained stationary point is also the global optimum. A rigorous study of this aspect will done in a future work, possibly leveraging ideas from (Diakonikolas et al., 2019). 4. Statistical Analysis We now establish a deviation bound which proves that, for an appropriate sample-size dependent sequence of bandwidths hn 0, our proposed estimator bfn,ε,hn Flin defined in (13) is Bayes-consistent over the linear function class Flin in the sense that its adversarial risk converges with large n to that of the optimal robust linear classifier. 4.1. Assumptions We will consider the regime where the sample size n and the input-dimension d scale like so n such that (d/n) log n 0. (16) This includes the fixed-dimensional regime where d is bounded. The above condition is of course more general as d is allowed to grow with n, as long as the condition (d/n) log n 0 is respected. Definition 4.1. Given a random variable z, its L evy concentration function (LCF) is defined for any δ 0 by LCF(z, δ) := sup u R P(|z u| δ). (17) LCFs control how much mass a random variable can concentrate around any given point. They are encountered in convex geometry, problems of random matrix theory, and concentration of random polynomials, just to name a few. We refer the reader to (Rudelson & Vershynin, 2014) and the numerous references therein. For the statistical analysis of the adversarial regret of our proposed estimator (13), we will make the following mild regularity assumption on the distribution P of the data Assumption 4.1 ( Small-Ball Assumption). For a random labelled test data point (x, y) P, assume that sup w =1 LCF(yx w, δ) η(δ), (18) Consistent Adversarially Robust Linear Classification: Non-Parametric Setting for some increasing function η verifying limδ 0 η(δ) = 0. Assumption 4.1 is very mild, and holds under rather very general conditions. We will see that the assumption holds if the distribution of the feature vector x conditioned y = l for any fixed value l { 1} of the label y has density (w.r.t to Lebesgue measure on Rd). The assumption will allow us control the variations of the margin loss θϵ : t 7 1t ϵ in the vicinity of the discontinuity at t = ϵ, uniformly over the linear function class Flin and the attack strength ε > 0. Remark 4.1. In a completely different context, a condition similar to Assumption 4.1 has been used in the analysis of benign overfitting in linear neural networks regression (Chatterji et al., 2022) (see Assumption A.4 therein). It turns our that for Assumption 4.1 to hold, the following condition is sufficient. Condition 4.1. For any label l { 1}, the distribution of the feature vector x conditioned the event y = l, admits a density (w.r.t Lebesgue measure on Rd). Proposition 4.1. If Condition 4.1 holds, then Assumption 4.1 prevails. 4.2. Risk-Consistency of Our Proposed Estimator The following is one of our main theoretical results. It establishes that the adversarial risk of the linear classifier bfn,ε,hn (13) converges to the optimal linear classifier. Note that bfn,ε,hn is a random function, as it depends on the random training data Dn := {(x1, y1), . . . , (xn, yn)}. Theorem 4.1. Let Assumption 4.1 be in order. In the limit (16), if (hn)n is a sequence of bandwidths tending to 0, then w.h.p over training data Dn, it holds that sup ε 0 Rε( bfn,ε,hn) = O where λ(hn) := η(hn p log 1/hn) + Q( p log 1/hn) and the function η is as in Assumption 4.1. 4.3. Minimax Optimality We now obtain quantitative rates for the excess adversarial risk under Assumption 4.1. We do this via an appropriate choice of bandwidths hn. The following important consequence of Theorem 4.1. Theorem 4.2. Suppose Assumption 4.1 is in order. Let the smoothing function Q have exponential-tail (e.g Gaussian survival function QG). Consider the choice of bandwidth Then, in the limit (16), it holds w.h.p over Dn that sup ε 0 Rε( bfn,ε,hn) = O Note that in Theorem 4.2, the rate p d/n (ignoring logfactors) is matches the minimax rate for non-parameteric classification. Let us examplify Theorem 4.2 in the case where the class-conditional distributions of the the features are multivariate Gaussians. Corollary 4.1. Suppose that for (x, y) P, the distribution of yx is a mixture of (multivariate) Gaussians, and let the smoothing function QG and bandwidth hn ( p (d/n) log n)1+Ω(1). Then, w.h.p over Dn, sup ε 0 Rε( bfn,ε,hn) = O The result follows from Theorem 4.2 above and the following lemma by which η(δ) = O(δ) for Gaussian mixtures. Lemma 4.2. Suppose there exists an absolute constants b 0 such that for all w Rd, the random variable yx w has density bounded by b. Then, Assumption 4.1 holds with η(δ) = 2bδ. In particular, this is the case when the distribution of z := yx is a mixture of (multivariate) Gaussians. Proof. Indeed, if fw b is the density of yx w, then LCF(yx w, δ) := sup u R P(|yx w u| δ) u δ fw(z)dz b 2δ = 2bδ, which proves the first part of the claim. In particular, if the distribution of z P k πk N(µk, Σ) (a setting considered in (Dan et al., 2020)), then for every w Rd, the random variable Z w has distribution P k πk N(µ k w, w Σ), which has density bounded by O(1/ w Σ) = O(1), over the sphere w = 1. For example, in the case of euclidean-norm attacks, this bound reduces to the spectral norm O( Σ 1 op). 5. Related Work There has been a sustained interest in characterizing the statistical hardness of adversarial learning. In this section, we review the works which are most relevant to ours. Consistent Adversarially Robust Linear Classification: Non-Parametric Setting 5.1. Non-Existence of Consistent Convex Surrogates Consistency and calibration results (Bartlett et al., 2005) provide reassurance that optimizing a surrogate does not ultimately hinder the search for a function that achieves the risk, and thus allow such a search to proceed within the scope of computationally efficient algorithms. In the case of ordinary learning corresponding to the 0/1 loss θ0 : t 7 1t 0, it has been shown in (Bartlett et al., 2005) that consistent convex loss functions exist. In fact, all the classically used convex loss functions (logistic loss, etc.) are consistent for classification. However, Bao et al. (2020) recently established negative results which expose a stark separation between tractability of learning half spaces in the aforementioned ordinary regime, and the case of robust / adversarial classification. More precisely, they show that in contrast to the case of ordinary learning where calibrated and consistent convex surrogates to the 0/1 loss θ0 : t 7 1t 0 exist Bartlett et al. (2006), no consistent loss functions exist for the adversarial 0/1 θε : t 7 1t ε (with ε = 0), over linear models f Flin. The issue is the discontinuity at the point t = ε. In the case of ordinary / non-adversarial classification where ε = 0, this discontinuity can be handled by using a convex loss function. In the case of adversarial classification where ε > 0, such a convexification of θε is not possible (Bao et al., 2020). In particular, this implies that regularization based heuristics for training would-be robust models is neither calibrated nor consistent for robustness, in general! An exception studied in (Awasthi et al., 2021) is the realizable case, where the data distribution is separable in the general sense (23). Our results don t contradict the observations of (Bao et al., 2020; Awasthi et al., 2021) , since actually, our surrogate losses will be non-convex. 5.2. Known Positive Results Well-Separated and Parametric Settings. A number of works have proposed linear classifiers which provably achieve consistency with quantitative rates in favorable settings. Notably, in the case of Gaussian and Bernoulli mixture models which are reminiscent of a parametric assumption, Schmidt et al. (2018) established a d/n minimax lower-bound on the adversarial regret (5). Corresponding upper-bounds of d/n and d/n were established for these data distributions. Moreover, the bound d/n has been shown to be tight in general (Schmidt et al., 2018; Dan et al., 2020; Bhattacharjee et al., 2021). Similar results were obtained in (Dan et al., 2020) in the case of Gaussian mixtures. In the case of well-separated data, (Bhattacharjee et al., 2021) proposed a modified max-margin algorithm which achieves a 1/n bound on the adversarial risk. Such so-called fast rates are reminiscent large-margin assumptions, which can be fully captured by the so-called Massart noise condition (Massart & N ed elec, 2006) |P(y = 1 | x) 1/2| γ, (23) for some γ (0, 1/2] and almost all x Rd. In contrast to the aforementioned works, we consider the case where no such parametric or large-margin assumption on the underlying distribution is made, and establish a non-parametric error rate of p d/n (ignoring log-factors) (Vapnik & Chervonenkis, 1971; Massart & N ed elec, 2006). Moreover, this rate is optimal since it is well-known that p d/n is optimal in classical / non-robust classification without further distributional assumptions like (23). The General Setting. Departing from the aforementioned parametric of well-separatedness assumptions, (Bhattacharjee & Chaudhuri, 2020) proposed a two-stage estimator which is provably consistent. The first stage consists in pruning the dataset Dn so as to only keep a maximal ϵ -separated subset, for some ϵ > ϵ. The second stage amounts to running a weighted nearest neighbors classifier on the pruned dataset. Let us mention some weaknesses of the above method. First, though quite general, the resulting estimator suffers a notable irreducible computational drawback due to the first stage. Indeed, computing a maximal ϵ -separated subset is reminiscent of finding a maximal independent set on a bipartite graph with n vertices. Due to NP-hardness (Feige, 2002), the complexity of this problem is impractical in the large-n limit. Furthermore, the consistency results in (Bhattacharjee & Chaudhuri, 2020) for their proposed estimator are non-quantitative (i.e no rates). Since the setting of the method proposed in ((Bhattacharjee & Chaudhuri, 2020) described above is most similar to the setting considered in our work, we now provide a detailed comparison with our proposed estimator (13). First, from a computational standpoint, observe from Lemma 3.1 that our proposed estimator corresponds to a non-convex but Lsmooth optimization problem with L = O( Σn op/h2 n), where we have the choice of smoothing bandwidth hn such that hn 0. As in Lemma 3.1, Σn := X X/n is the empirical covariance matrix of the features. For concreteness if we take Σn op = O(1), then L = O(1/h2 n) which controls the computational complexity associated with computing our estimator. On the other hand, the adversarial regret of our estimator is of order λ(hn) + p d log(n)/n (where λ is a decreasing function depending on η in Assumption 4.1) thanks to Theorem 4.1, allowing us to tradeoff between computational complexity / optimization difficulty (small large L) and rate of statistical convergence by appropriately tuning hn. In particular, if we are only interested in consistency (and no quantitative rates), then our proposed estimator can be made to run in linear time (i.e fast optimization) by tuning the smoothing bandwidth hn such that hn 0 only very slowly. Consistent Adversarially Robust Linear Classification: Non-Parametric Setting 101 102 103 104 Sample size n Adversarial Risk Attack stregnth 0.0 0.3 0.7 1.0 101 102 103 104 Sample size n Adversarial Risk Attack stregnth 0.0 0.3 0.7 1.0 Figure 2. Gaussian Experiment. Showing the adversarial risk (6) of our proposed estimator bfn,ϵ,hn defined in (13) as a function of sample size n and Euclidean-norm attack strength ϵ. Left: isotropic features (i.e Σ = Id). Right: Non-isotropic features. The horizontal broken lines correspond to the Bayes-optimal adversarial risk (25) the given value of ϵ. Error-bars correspond to 10 runs of computing our estimator, each one corresponding to a different draw of the training dataset Dn. As the sample size n is increased, notice how the adversarial risk of our estimator approaches the Bayes-optimal adversarial risk, for each level of the attack strength ϵ, in accordance to Theorem 4.2 and Corollary 4.1. 101 102 103 104 Sample size n Adversarial Risk Attack stregnth 0.0 0.3 0.7 1.0 101 102 103 104 Sample size n Adversarial Risk Attack stregnth 0.0 0.3 0.7 1.0 Figure 3. Non-Gaussian Experiment. As in Figure 2, error-bars correspond to 10 runs of computing our estimator bfn,ϵ,hn (13), each one corresponding to a different draw of training dataset Dn. Left: isotropic features (i.e Σ = Id). Right: Non-isotropic features. As the sample size n is increased, notice how the adversarial risk of our estimator approaches the Bayes-optimal adversarial risk, for each level of the attack strength ϵ, in accordance to Theorem 4.2. 5.3. The Case of Regression In the setting of linear regression, (Xing et al., 2021) studied Euclidean-norm attacks with general covariance matrices. They showed that the optimal robust model is a ridge regression whose ridge parameter depends implicitly on the strength of the attacks. (Javanmard et al., 2020) studied tradeoffs between ordinary and adversarial risk in linear regression, and computed exact Pareto optimal curves in the case of Euclidean-norm attacks on isotropic features. Their results show a tradeoff between ordinary and adversarial risk for adversarial training. (Javanmard & Mehrabi, 2021) also revisited this tradeoff for latent models and show that this tradeoff is mitigated when the data enjoys a lowdimensional structure. Recently, the study of robustness in linear regression for general norms and feature covariance matrices has been initiated in (Scetbon & Dohmatob, 2023; Dohmatob & Scetbon, 2023) in a student-teach setup. Finally, let us mention (Hassani & Javanmard, 2022) which established tradeoffs between accuracy and robustness to Euclidean-norm attacks on two-layer neural networks in random features regime. (Dohmatob & Bietti, 2022) extended this analysis to other learning regimes including stochastic gradient descent (SGD) and the so-called lazy training regime (Chizat et al., 2019). 6. Empirical Validation In this section, we present some empirical verification of our theoretical results. All experiments were run on a single modern CPU laptop. Gaussian Experiment. Consider the distribution P(y = 1) = 1/2, x | y N(yµ, Σ), (24) Consistent Adversarially Robust Linear Classification: Non-Parametric Setting where µ Rd with µ Σ = 1 is a fixed vector and the d d positive-definite matrix Σ is the feature covariance matrix. We try both Σ = Id and Σ = diag(σ2 1, . . . , σ2 d), with σ2 k = k 2 for all k [d]. This data model corresponds to the setting of Schmidt et al. (2018); Dobriban & Wager (2018); Dan et al. (2020), and the Bayes-optimal robust linear classifier is known analytically. Note that for this problem, it is well-known (Bhagoji et al., 2019; Dobriban et al., 2020) that the optimal value of the adversarial risk is attained over linear models and is explicitly given by Eopt ϵ = Φ( (1 ϵ/ µ 2)+ SNR), (25) where SNR := µ 2/σd is a measure of signal-tonoise ratio. We set the input-dimension to d = 20 for this experiment. For each value of sample size n {100, 200, . . . , 1000, 2000, 3000, . . . , 10000}, we generate a dataset Dn = {(x1, y1), . . . , (xn, yn)} of n iid samples, and then compute the estimator bfn,ϵ,hn described in Section 3.3, where hn is the bandwidth parameter, taken as hn = p (d/n) log n, in accordance with the choice in Corollary 4.1. We consider Euclidean-norm attacks of strength ϵ ranging in {0.1, 0.2, . . . , 0.9, 1}. The results of this experiment are shown in Figure 2. Non-Gaussian Experiment. Next we turn from the Gaussian mixture setting and consider a setup where z | y N(yµ, Id), x = max(z, 0) (entry-wise). (26) This setup captures the structure of a two-layer neural network with Re LU activation in the so-called random features regime where only the parameters of the output layer are learned. The rest of the experimental setup is as in the previous experiment (the Gaussian Experiment ). The results of this experiment are shown in Figure 3. Comparison with (Bao et al., 2020). It should be noted that (Bao et al., 2020) doesn t attempt any analysis of consistency, and therefore contains no regret analysis (rates). In particular, minimizing their proposed surrogate losses (sigmoid, etc.) don t provably converge to Bayes-optimum adversarial risk. In contrast, our work principally explores the problem of consistency and we show that for a class of non-convex smooth loss function (e.g sigmoid, Gaussian, etc.), the resulting estimator is consistent w.r.t adversarial 0/1 risk; we also establish an explicit regret of order e O(1/ n) rate of convergence. A key idea in our work is to not consider static surrogate losses as in (Bao et al., 2020), but to adapt a smoothing bandwidth h = hn so that it vanishes with sample size n at a precise rate (20). Nothwithstanding, Figure 4 (Right) shows the result of a small experiment comparing our proposed estimator with that proposed in (Bao et al., 2020). The dataset, is a 2D Gaussian dataset given by x | y N(yµ, Σ) and P(y = 1) = 1/2, where Σ = diag(σ2 1, σ2 2), with σ1 = 1, σ2 = 0.1. This is a modification of the dataset twonorm described in Section 10 of (Bao et al., 2020) where they had σ1 = σ2 = 0.1 (left plot). For a smoothing bandwidth parameter, we use the surrogate loss function θϵ,h(u) = Q((u ϵ)/h), with Q = QS. The case h = 1 corresponds to the the estimator proposed in (Bao et al., 2020), while h = hn as in (20) corresponds to the estimator proposed in our work. As we can see (right plot), the estimator from (Bao et al., 2020) fails to be consistent here, while our adaptive estimator (left plot) is consistent, as predicted by Theorem 4.2. Similar results are observed with the choice Q = QG. 101 102 103 104 Sample size n Adversarial regret 101 102 103 104 Sample size n Figure 4. Inconsistency of non-adaptive smoothing bandwidth. Left. Gaussian mixture problem with σ1 = σ2 = 0.1. Right. σ1 = 1, σ2 = 0.1. Non-adaptive bandwidth h = 1 corresponds to (Bao et al., 2020); unlike our proposed estimator (adaptive bandwidth hn), it fails to achieve the optimal adversarial risk. 7. Conclusion In this work, we have considered the problem of consistent adversarially robust classification in a setting where no parametric or well-separatedness assumptions are made, in contrast with previous works We have proposed a estimator which simultaneously enjoys low-computational complexity and statistical guarantees in the form of quantitative rates of convergence to the Bayes-optimal adversarial risk. Our estimator is based on smoothing the adversarial 0/1 loss, with a smoothing bandwidth which effectively adapts to the sample complexity and input-dimension. Non-linear Models. Our work can be extended in a number of directions. Notably, our work has focused on linear models. Even though our theoretical results developed here are already very interesting in this regime, analyzing neural networks (at least its so-called linearized and kernel regimes) would be an interesting direction to pursue. Statistics vs Optimization. Finally, the theoretical analysis in this work has focused on the statistical properties of the proposed estimator. The optimization related aspects of computing our proposed estimator (provable convergence to global optimum instead of just stationary points) will be pursued in a future extension of our work. Consistent Adversarially Robust Linear Classification: Non-Parametric Setting Impact Statement As we usher in the era of large language models (LLMs) and Chat GPT, the challenge of adversarial examples gains unprecedented importance, magnified by the widespread distribution of unsanitized and untraceable content and models across the internet. Our research delves into theoretical and algorithmic analyses of robust regression within a non-parametric framework, introducing the first estimator proven to be optimally robust in such a complex environment. This paper aims to contribute significantly to the advancement of Machine Learning, meticulously examining the societal impacts of our findings. We are confident that our work, which carefully navigates the intricacies of adversarial robustness, offers exclusively positive implications for the field and society at large. Absil, P.-A., Baker, C. G., and Gallivan, K. A. Trust-region methods on riemannian manifolds. Foundations of Computational Mathematics, 7(3), Jul 2007. Awasthi, P., Frank, N., Mao, A., Mohri, M., and Zhong, Y. Calibration and consistency of adversarial surrogate losses. In Advances in Neural Information Processing Systems, volume 34, pp. 9804 9815. Curran Associates, Inc., 2021. Bao, H., Scott, C., and Sugiyama, M. Calibrated surrogate losses for adversarially robust classification. In Abernethy, J. and Agarwal, S. (eds.), Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pp. 408 451. PMLR, 09 12 Jul 2020. Bartlett, P., Jordan, M., and Mc Auliffe, J. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101:138 156, 02 2006. doi: 10.1198/016214505000000907. Bartlett, P. L., Bousquet, O., and Mendelson, S. Local Rademacher complexities. The Annals of Statistics, 33 (4):1497 1537, 2005. Bhagoji, A. N., Cullina, D., and Mittal, P. Lower bounds on adversarial robustness from optimal transport, 2019. Bhattacharjee, R. and Chaudhuri, K. When are nonparametric methods robust? In ICML, pp. 832 841. PMLR, 2020. Bhattacharjee, R., Jha, S., and Chaudhuri, K. Sample complexity of robust linear classification on separated data. In Proceedings of the 38th International Conference on Machine Learning (ICML). PMLR, 2021. Bietti, A. and Mairal, J. Group invariance, stability to deformations, and complexity of deep convolutional representations. Journal of Machine Learning Research, 20 (25):1 49, 2019. Boumal, N., Mishra, B., Absil, P.-A., and Sepulchre, R. Manopt, a Matlab toolbox for optimization on manifolds. Journal of Machine Learning Research, 15(42): 1455 1459, 2014. Boumal, N., Absil, P.-A., and Cartis, C. Global rates of convergence for nonconvex optimization on manifolds. IMA Journal of Numerical Analysis, 2018. Carlini, N., Nasr, M., Choquette-Choo, C. A., Jagielski, M., Gao, I., Awadalla, A., Koh, P. W., Ippolito, D., Lee, K., Tram er, F., and Schmidt, L. Are aligned neural networks adversarially aligned? Ar Xiv, abs/2306.15447, 2023. Chatterji, N. S., Long, P. M., and Bartlett, P. L. The interplay between implicit bias and benign overfitting in two-layer linear networks. Journal of Machine Learning Research, 23(263):1 48, 2022. Chizat, L., Oyallon, E., and Bach, F. On lazy training in differentiable programming. 2019. Dan, C., Wei, Y., and Ravikumar, P. Sharp statistical guaratees for adversarially robust Gaussian classification. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 2345 2355. PMLR, 13 18 Jul 2020. Diakonikolas, I., Kane, D., and Manurangsi, P. Nearly tight bounds for robust proper learning of halfspaces with a margin. In Advances in Neural Information Processing Systems, 2019. Dobriban, E. and Wager, S. High-dimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics, 46(1):247 279, 2018. Dobriban, E., Hassani, H., Hong, D., and Robey, A. Provable tradeoffs in adversarially robust classification. ar Xiv preprint ar Xiv:2006.05161, 2020. Dohmatob, E. and Bietti, A. On the (non-)robustness of two-layer neural networks in different learning regimes, 2022. Dohmatob, E. and Scetbon, M. Robust linear regression: Phase-transitions and precise tradeoffs for general norms. Ar Xiv, abs/2308.00556, 2023. Feige, U. Relations between average case complexity and approximation complexity. STOC. Association for Computing Machinery, 2002. Consistent Adversarially Robust Linear Classification: Non-Parametric Setting Hassani, H. and Javanmard, A. The curse of overparametrization in adversarial training: Precise analysis of robust generalization for random features regression. ar Xiv preprint ar Xiv:2201.05149, 2022. Javanmard, A. and Mehrabi, M. Adversarial robustness for latent models: Revisiting the robust-standard accuracies tradeoff. ar Xiv preprint ar Xiv:2110.11950, 2021. Javanmard, A., Soltanolkotabi, M., and Hassani, H. Precise tradeoffs in adversarial training for linear regression. In COLT, 2020. Keshet, J., Mc Allester, D. A., and Hazan, T. Pac-bayesian approach for minimization of phoneme error rate. 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2224 2227, 2011. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. 2018. Massart, P. and N ed elec, E. Risk bounds for statistical learning. The Annals of Statistics, 34, 2006. Rudelson, M. and Vershynin, R. Small Ball Probabilities for Linear Images of High-Dimensional Distributions. International Mathematics Research Notices, 2015(19): 9594 9617, 12 2014. Sato, H. Riemannian optimization on unit sphere with pnorm and its applications. Computational Optimization and Applications, 2023. Scetbon, M. and Dohmatob, E. Robust linear regression: Gradient-descent, early-stopping, and beyond. In AISTATS, volume 206 of Proceedings of Machine Learning Research. PMLR, 2023. Schmidt, L., Santurkar, S., Tsipras, D., Talwar, K., and Madry, A. Adversarially robust generalization requires more data. Co RR, abs/1804.11285, 2018. Su, J., Vargas, D. V., and Sakurai, K. One pixel attack for fooling deep neural networks. Co RR, abs/1710.08864, 2017. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013. Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., and Madry, A. Robustness may be at odds with accuracy. In International Conference on Learning Representations (ICLR), volume abs/1805.12152, 2019. van der Vaart, A. W. and Wellner, J. A. Weak Convergence and Empirical Process: With Applications to Statistics. 1996. Vapnik, V. N. The nature of statistical learning theory. New York, NY: Springer, 2nd ed. edition, 2000. ISBN 0-38798780-0. Vapnik, V. N. and Chervonenkis, A. Y. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2), 1971. Xing, Y., Zhang, R., and Cheng, G. Adversarially robust estimate and risk analysis in linear regression. In AISTATS, volume 130, 2021. Consistent Adversarially Robust Linear Classification: Non-Parametric Setting Supplementary Material Consistent Adversarially Robust Linear Classification: Non-Parametric Setting A An Oracle Bound: Proof of Theorem 4.1 11 B Controlling the Bias in Adversarial Risk Due to Smoothing 12 B.1 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 C VC Dimension Computations 14 D Proof of Proposition 4.1 16 A. An Oracle Bound: Proof of Theorem 4.1 Recall the definition of adversarial risk Eϵ(f) and excess adversarial risk Rϵ(f) from Section 2.1 Theorem 4.1. Let Assumption 4.1 be in order. In the limit (16), if (hn)n is a sequence of bandwidths tending to 0, then w.h.p over training data Dn, it holds that sup ε 0 Rε( bfn,ε,hn) = O where λ(hn) := η(hn p log 1/hn) + Q( p log 1/hn) and the function η is as in Assumption 4.1. Proof. Let bf = bfn,ϵ,h be a minimizer of b En,ϵ,h over the linear function class Flin and let f be a minimizer of Eϵ over Flin. Then, for any f Flin, one has 0 Eϵ( bf) Eϵ(f ) = Eϵ( bf) b En,ϵ,h( bf) + b En,ϵ,h( bf) b En,ϵ,h(f ) | {z } 0 + b En,ϵ,h(f ) Eϵ(f ) Eϵ( bf) b En,ϵ,h( bf) + b En,ϵ,h(f ) Eϵ(f ) 2 Eϵ b En,ϵ,h Flin. We deduce that Rϵ( bf) := Eϵ( bf) inf f Flin Eϵ(f) 2 Eϵ b En,ϵ,h Flin. (28) On the other hand, we can further decompose Eϵ b En,ϵ,h Flin = Eϵ b En,ϵ + b En,ϵ b En,ϵ,h Flin Eϵ b En,ϵ Flin | {z } gap due to ERM on margin loss + b En,ϵ b En,ϵ,h Flin | {z } gap due to smoothing of margin loss where U V F := supf F |U(f) V (f)|. Consistent Adversarially Robust Linear Classification: Non-Parametric Setting Combining (28) and (29) from the sketch of the proof in the main text, we get Eϵ( bfn,ϵ,h) inf f Flin Rϵ(f) Eϵ b En,ϵ Flin | {z } usual ERM error + b En,ϵ b En,ϵ,h Flin | {z } empirical error due to smoothing We now bound the different terms in the RHS of (30) separately. Bounding the first term in (30). Consider the binary function class Gϵ on Rd { 1} given by Gϵ := {(x, y) 7 θϵ( yf(x)) | f Flin}. (31) Thanks to Lemma C.2, we know that the VC dimension of Gϵ is at most 4d. Then, applying the Vapnik-Chervonenkis (VC) theory uniform convergence (Vapnik & Chervonenkis, 1971) gives Eϵ b En,ϵ Flin n w.h.p over training data Dn. (32) Bounding the second term in (30). In Proposition B.1, it is established that for small h and large n, the following orcale bound holds w.h.p, b En,ϵ,h b En,ϵ Flin λ(h) + Combining with (30) and (32), we obtain that w.h.p, Eϵ( bf) inf f Flin Eϵ(f) λ(h) + which completes the proof. B. Controlling the Bias in Adversarial Risk Due to Smoothing We now prove inequality (33), which was instrumental in the proof of Theorem 4.1. Proposition B.1. With the same conditions and notations as in Theorem 4.1, it holds w.h.p over the training data Dn that b En,ϵ,h b En,ϵ Flin = O Proof. Vary h = hn and t = tn > 0 such that h 0+, t , ht 0+. (35) For each f Flin and i [n], consider the binary random variable zt,i(f) := 1|yif(xi) ϵ| ht = ( 1, if |yif(xi) ϵ| ht, 0, else. (36) Note that the sequence zt,1(f), . . . , zt,n(f) is an iid sequence of Bernoulli random variables with mean αt(f) [0, 1] given by αt(f) := E [zt,i(f)] = P(|y1f(x1) ϵ| ht). (37) Recall the function θϵ(u) := 1u ϵ and θϵ,h(u) := Q((u ϵ)/h) for any u R, introduced in (7) and (9) respectively, and observe that if |u ϵ| ht, then |θϵ,h(u) θϵ(u)| ( |1 Q((u ϵ)/h)| = Q( (u ϵ)/h) = Q(|u ϵ|/h), if u ϵ ht, |Q((u ϵ)/h) 0| = Q((u ϵ)/h) = Q(|u ϵ|/h), if u ϵ ht Consistent Adversarially Robust Linear Classification: Non-Parametric Setting Where have used the property that Q is non-increasing and Q( a) 1/2 for all a 0. We deduce that: every f and i, |θϵ,h(yif(xi)) θϵ(yif(xi))| 1 (1 Q(t))zt,i(f). (39) Now, consider the {0, 1}-valued function class on Rd { 1} given by Ht := {(x, y) 7 1|yf(x) ϵ| ht | f Flin}. (40) Thanks to Lemma C.1, we know that the VC dimension of Ht is at most 4d, and so by (Vapnik & Chervonenkis, 1971), it holds w.h.p over training data Dn that1 i=1 zt,i(f) αt(f) where all the hidden constants are independent of d, n, h, and t. Combining with (39), we deduce that, for large n and small h, the following chain of inequalities holds w.h.p over training data Dn i=1 |θϵ,h(yif(xi)) θϵ(yif(xi))| 1 (1 Q(t)) 1 i=1 zt,i(f) i=1 (1 zt,i(f)) + 1 i=1 Q(t)zt,i(f) n + Q(t)αt(f) + where βt(f) is defined by βt(f) := 1 αt(f) + Q(t)αt(f). (43) Taking t = p k log 1/h and applying Lemma B.1 ensures that, for any fixed positive k and for h 0+, log 1/h) + Q( p k log 1/h), for all f Flin. (44) We deduce that, for large n and small h, it holds w.h.p over the training data Dn, that b En,ϵ,h b En,ϵ Flin = sup f Flin i=1 |θϵ,h(yif(xi)) θϵ(yif(xi)))| log 1/h) + Q( p k log 1/h) + which completes the proof. Lemma B.1. Suppose the function class Flin is nice. Then, for any fixed positive k and in the limit h 0+ with t = p 3 log 1/h, it holds that βt Flin η( log 1/h) + Q( p k log 1/h). Proof. For small ht > 0 with ht 0+, one computes βt(f) = 1 αt(f) + Q(t)αt(f) 1 αt(f) + Q(t) η(ht) + Q(t), 1Note that the e O(1/ n) rate cannot be improved to o(1/ n) even for a single f Flin since this would contradict the CLT. Consistent Adversarially Robust Linear Classification: Non-Parametric Setting The first inequality is a basic Gaussian tail-bound, and the fact that αt(f) 1. The second inequality invokes Assumption 4.1 to get 1 αt(f) = P(|yf(x) ϵ| ht) η(ht) for (x, y) P. Finally, choosing t = p k log 1/h with h 0+ ensures that ht = kλ(h) λ(h), and so βt(f) η( log 1/h) + Q( p k log 1/h) as claimed. B.1. Proof of Theorem 4.2 Let bf = bfn,ε,h be a minimizer of b En,ε,h over the function class Flin. For any f Flin, one computes Rε( bf) Eε( bf) Eε(f) = Eε( bf) b En,ε,h( bf) + b En,ε,h( bf) b En,ε,h(f) | {z } 0 + b En,ε,h(f) Eε(f) Eε( bf) b En,ε,h( bf) + b En,ε,h(f) Eε(f) 2 sup f Flin |Eε(f) b En,ε,h(f)|. Thus, defining Eϵ b En,ϵ,h Flin := supf Flin | b Eϵ(f) b En,ϵ,h(f)|, we deduce that Rε( bf) 2 Eε b En,ε,h Flin. (47) On the other hand, we can further decompose Eε b En,ε,h Flin = Eε b En,ε + b En,ε b En,ε,h Flin Eε b En,ε Flin | {z } regret due to ERM + b En,ε b En,ε,h Flin | {z } regret due to smoothing The first term in the above decomposition is controlled using classical uniform convergence arguments (Vapnik & Chervonenkis, 1971; Vapnik, 2000); it is of order O( p (d/n) log n) since the VC pseudo-dimension of Flin is at most d. Thanks to Proposition B.1, we know that the second term is of order λ(h) + p (d/n) log n w.h.p. Combining with (47) then gives (4.1). Finally, taking the bandwidth hn as in (20) balances both terms in (4.1) and we obtain (21). C. VC Dimension Computations Let F be a real-valued function class on an abstract set X. Lemma C.1. Let α, β R with β 0, and define a binary-valued function class H := {(x, y) 7 1|yf(x) α| β | f F}. Then, we have the upper bound VCdim(H) 4 VCpdim(F). (49) In particular, if F = F is the linear function class on Rd, then VCdim(H) 4d. Proof. First observe that, in terms of VC dimension, the function class H can be equivalently defined as a collection of sets like so H = {Tα,β(f) | f F}, where Tα,β(f) := {(x, y) X { 1} | |yf(x) α| β}. (50) Now, for any t R and f F, define St(f) := {(x, y) X { 1} | yf(x) t}, and consider the function class St(F) on X { 1} given by St(F) := {St(f) | f F}. (51) Since |yf(x) α| β iff yf(x) α β or yf(x) α β, it is clear that (x, y) Tα,β(f) iff (x, y) Sα β(f) S α β(f), and so Tα,β(f) = Sα β(f) S α β(f). It follows that, H {A B | A Sα β(F), B S α β( F)}, (52) Consistent Adversarially Robust Linear Classification: Non-Parametric Setting where F := { f | f F}. Thanks to van der Vaart & Wellner (1996, Lemma 2.6.17, part (iii)), we deduce that VCdim(H) VCdim(Sα β(F)) + VCdim(S α β( F)) = VCdim(Sα β(F)) + VCdim(S α β( F)) = VCdim(Sα β(F)) + VCdim( Sα+β(F)) = 2 VCdim(Sα β(F)). where we have used the fact that VCdim({S | S S}) = VCdim(S) for any collection of sets S. Finally, If thanks to Lemma C.2, we have VCdim(St(F)) 2 VCpdim(F). (54) Combining with (53) then gives VCdim(H) 4VCpdim(F). In particular, for the half-space function class F on Rd, we have VCpdim(F) d, which then gives VCdim(H(F)) 4d, and the proof is complete. Lemma C.2. Let ϵ R be fixed and define a collection of subsets of X { 1} by H := {Λ(f) | f F}, where Λ(f) := {(x, y) X { 1} | yf(x) ϵ}. (55) Then, VCdim(H) 2 VCpdim(F). Proof. For any f F and y Y, define fy : X R by fy(x) := y(x) ϵ + y, and let Fy := {fy | f F}. Thus, Fy is an affine translation of F. Then, one computes Λ(f) = y { 1}{(x, y) | x X, yf(x) ϵ} = y { 1}{(x, y) | x X, fy(x) y} = y { 1}{(x, t) X R | fy(x) t} (X {y}) = y { 1}SG(Fy) (X {y}) = y { 1}Λy(f), where Λy(f) := SG(Fy) (X {y}). For every y, let Λy(F) := {Λy(f) | f F}. We deduce from the previous display that, H = { y { 1}Λy(f) | f F} {A B | A Λ+(F), B Λ (F)}, (57) and so, thanks to (?)Lemma 2.6.17, part (iii))]vaartwellner96book, we obtain y { 1} VCdim(Λy(F)). (58) Now, observe that Λy(F) = {A (X {y}) | A SG(Fy)}, and so by (?)Lemma 2.6.17, part (ii))]vaartwellner96book, we get VCdim(Λy(F)) VCdim(SG(Fy)). (59) Now, since the transformation f 7 fy is obviously a bijection between F and Fy, the VC pseudo-dimensions of F and Fy are equal. We deduce from (59) that VCdim(Λy(F)) VCdim(SG(Fy)) = VCdim(SG(F)) =: VCpdim(F). (60) Combining with (58) gives the claimed result. Consistent Adversarially Robust Linear Classification: Non-Parametric Setting D. Proof of Proposition 4.1 Proof. Under Condition 4.1, it is clear that for (x, y) P, the random vector z := yx has density. Indeed, if N Rd is a null-set w.r.t the Lebesgue measure on Rd, then one computes P(z N) = P(y = 1)P(x N | y = 1) + P(y = 1)P( x N | y = 1) = 0 + 0 = 0, (61) since N := { v | v N} is also a null-set. It follows that every continuous transformation of z also has density, and thus has a continuous CDF. In particular, for every w Rd, the margin random variable m(x, y; w) := yx w = z w has a continuous CDF. Now, let R be a large positive number. By the preceding argument and the compactness of the set SR := {(w, u) Rd+1 | w = 1, |u| R}, the function δ 7 sup(w,u) SR P(|z w u| δ) is continuous on (0, ), and so lim δ 0+ sup (w,u) SR P(|z w u| δ) = sup (w,u) SR lim δ 0+ P(|z w u| δ) = sup (w,u) SR P(z w = u) = 0. On the other hand, if |u| > R, then |z w u| |u| |z w| > R |z w|, and so sup w =1, |u|>R P(|z w u| δ) sup w =1 P(R |z w| δ) sup w =1 P(|z w| > R δ) P( sup w =1 |z w| > R δ) = P( z > R δ) = P( z > R δ) δ 0+ P( z > R) R 0. In the last two steps, we have used the fact that the random variable z has density since every norm on Rd is continuous. Combining (62) and (63) completes the proof of the claim.