# a_law_of_robustness_beyond_isoperimetry__94cd6e76.pdf A Law of Robustness beyond Isoperimetry Yihan Wu 1 Heng Huang 1 Hongyang Zhang 2 We study the robust interpolation problem of arbitrary data distributions supported on a bounded space and propose a two-fold law of robustness. Robust interpolation refers to the problem of interpolating n noisy training data points in Rd by a Lipschitz function. Although this problem has been well understood when the samples are drawn from an isoperimetry distribution, much remains unknown concerning its performance under generic or even the worst-case distributions. We prove a Lipschitzness lower bound Ω( p n/p) of the interpolating neural network with p parameters on arbitrary data distributions. With this result, we validate the law of robustness conjecture in prior work by Bubeck, Li, and Nagaraj on two-layer neural networks with polynomial weights. We then extend our result to arbitrary interpolating approximators and prove a Lipschitzness lower bound Ω(n1/d) for robust interpolation. Our results demonstrate a two-fold law of robustness: i) we show the potential benefit of overparametrization for smooth data interpolation when n = poly(d), and ii) we disprove the potential existence of an O(1)-Lipschitz robust interpolating function when n = exp(ω(d)). 1. Introduction Robustness has been a central research topic in machine learning (Szegedy et al., 2014; Goodfellow et al., 2014), statistics (Huber, 2004), operation research (Ben-Tal et al., 2009), and many other domains. In machine learning, study of adversarial robustness has led to significant advances in defending against adversarial attacks, where test inputs with slight modification can lead to problematic prediction 1Department of Computer Science, University of Maryland at College Park 2School of Computer Science, University of Waterloo. Correspondence to: Yihan Wu , Heng Huang , Hongyang Zhang . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). results. In statistics and operation research, robustness is a desirable property for optimization problems against uncertainty, which can be represented as deterministic or random variability in the value of optimization parameters. This is known as robust statistics or robust optimization. In both cases, the problem can be stated as given a deterministic labeling function g : Rd [ 1, 1], (approximately) interpolating the training data {(xi, g(xi))}n i=1 or its noisy counterpart by a function with small Lipschitz constant. The focus of this paper is on the latter setting known as robust interpolation problem (Bubeck & Sellke, 2023). That is, given noisy training data {(xi, g(xi) + zi)}n i=1 of size n where x1, , xn are restricted in a unit ball and z1, , zn have variance > 0, how many network parameters and training samples are needed for robust interpolation provided that the functions in the class can (approximately) interpolate the noisy training data with Lipschitz constant L? There are several reasons to study the noisy setting (Bubeck & Sellke, 2023): 1) The real-world data are noisy. For example, it has been shown that around 3.3% of the data in the most-cited datasets was inaccurate or mislabeled (Northcutt et al., 2021). 2) This noise assumption is necessary from a theoretical point of view, as otherwise there could exist a Lipschitz function which perfectly fits the training data for any large n. Despite progress on the robust interpolation problem (Bubeck & Sellke, 2023; Bubeck et al., 2021), many fundamental questions remain unresolved. In modern learning theory, it was commonly believed that 1) big data (Schmidt et al., 2018), 2) low dimensionality of input (Blum et al., 2020; Yang et al., 2020a; Kumar et al., 2020), and 3) overparametrization (Bubeck & Sellke, 2023; Bubeck et al., 2021) improve robustness. We view the robustness problem from the perspective of Lipschitzness and ask the following question: Are big data and large models a remedy for robustness? In fact, there is significant empirical evidence to indicate that enlarging the model size (overparametrization) improves robustness when n is moderately large (e.g., when n = poly(d), see (Madry et al., 2017; Schmidt et al., 2018)). Our work verifies the benefit of overparametrization for fitting a neural network with p parameters below the noise level by proving such neural networks must have a Lipschitzness lower bound Ω( p n/p). On the other hand, big A Law of Robustness beyond Isoperimetry data and large models may not be a remedy for robustness if n goes even larger. We show that for any approximator, no matter how many parameters it contains, its Lipschitzness is of order Ω(n1/d). In particular, our result disproves the existence of learning an O(1)-Lipschitz function with n = exp(ω(d)). Besides, by showing that for any learning algorithm, there exists a joint data distribution such that one needs at least n = exp(Ω(d)) samples to learn an O(1)-Lipschitz function with good population error, we demonstrate that big data are also necessary for robust interpolation in some special cases. The robust interpolation problem becomes more challenging when no assumptions are made on the distribution of covariates. Due to the well-separated nature of data, most positive results for obtaining good Lipschitzness lower bound have focused on the isoperimetry distribution (Bubeck & Sellke, 2023). A probability measure µ on Rd satisfies cisoperimetry if for any bounded L-Lipschitz f : Rd R, and any t 0, Pr(|f(x) E[f(x)]| t) 2 exp( dt2 Isoperimetry states that the output of any Lipschitz function is O(1)-subgaussian under suitable rescaling. Special cases of isoperimetry include high-dimensional Gaussians N(0, Id d ), uniform distributions on spheres and hypercubes of diameter 1. However, real-world data might not follow the isoperimetry assumption. In contrast, our results of Theorem 3.4 go beyond isoperimetry and providing a lower bound of robustness for functions with p parameters under arbitrary distributions in the bounded space. Our results of Theorem 3.9 go even further by providing a universal lower bound of robustness for any model class, including the class of neural networks with arbitrary architecture. Notations. We will use X to represent the instance space, F = {f : X [ 1, 1]} to represent the hypothesis/function space, x X to represent the sample instance, y [ 1, 1] to represent the target, and z to represent the target noise. For errors, denote by l(f(x), y) the loss function of f on instance x and target y, in our work we use the mean squared error as in Bubeck & Sellke (2023), i.e., l(f(x), y) = (f(x) y)2. Let LD(f) := E(x,y) D[l(f(x), y)] be the population error, and let LS(f) := 1 |S| P (x,y) S[l(f(x), y)] be the empirical error. Denote by f : X [ 1, 1] the prediction function which maps an instance to its predicted target. It can be parameterized, e.g., by deep neural networks. For norms, we denote by x a generic norm. Examples of norms include x , the infinity norm, and x 2, the ℓ2 norm. We will frequently use (X, ) to represent the normed linear space of X with norm . Define diam(X) as the diameter of X w.r.t. the norm . For a given score function f, we denote by Lip (f) (or sometimes Lip(f) for simplicity) the Lipschitz constant of f w.r.t. the norm . Let represent the ceiling operator. We will use O( ), Θ( ) o( ), and Ω( ) to express sample complexity and Lipschitzness. 1.1. Our results Our law of robustness is two-fold: a) overparametrization can potentially help robust interpolation when n = poly(d) (Section 3.1), and b) there exists no robust interpolation when n = exp(ω(d)) (Section 3.2). Lipschitzness (or local Lipschitzness) is an important characterization of adversarial robustness for learning algorithms (Yang et al., 2020b; Zhang et al., 2019; Wu et al., 2022b;c). The popular randomized smoothing approaches (Cohen et al., 2019; Li et al., 2019; Wu et al., 2022d) can provide robust guarantee through Lipschitzness but suffer curse of dimensionality problem (Wu et al., 2021). Thus, studying the Lipschitzness is crucial for understanding robustness. For a given score function f, we denote by Lip (f) the Lipschitz constant of f w.r.t. the norm . That is, for any x1, x2 in the input space, |f(x1) f(x2)| Lip (f) x1 x2 . Our results show lower bounds on the Lipschitzness of learned functions when the training error is slightly smaller than the noise level (i.e., in the case of overfitting), but without assumptions on the distribution of covariates except that they are restricted in the bounded space X := {x : x 1}. We are interested in the assumption of bounded space because: 1) most applications of machine learning focus on the case where the data are in the bounded space. For example, images and videos are considered to be in [ 1, 1]d. 2) The discussion of Lipschitzness is closely related to how large the input space is. For example, for the images restricted in [ 1, 1]d, special attentions are paid on the ℓ robust radius of 0.031 or 0.062 (Zhang et al., 2019; Madry et al., 2017), which corresponds to a (local) Lipschitz constant of O(1) for the classifier. Overparametrization may benefit robust interpolation. The universal law of robustness by Bubeck & Sellke (2023) provides an Ω( p nd/p) Lipschitzness lower bound of the interpolating functions when the underlying distribution is isoperimetry (see Theorem 2.1). Our first result goes beyond the isoperimetry assumption, and provides an Ω( p n/p) Lipschitzness lower bound of the interpolating functions under arbitrary distribution. We note that the d difference between the two Lipschitzness lower bounds is due to the special property of the isoperimetry assumption (see Remark 3.5). Our result predicts the potential existence of an O(1)-Lipschitz function that fits the data below the noise level when p = Ω(n). The following informal theorem illustrates the results (the detailed theorems are introduced at later sections): Theorem A (informal version of Theorem 3.4). Let A Law of Robustness beyond Isoperimetry F be any class of functions from Rd [ 1, 1] and let {(xi, yi)}n i=1 be i.i.d. input-output pairs in {x : x 1} [ 1, 1] for any given norm . Assume that: 1. The expected conditional variance of the output (i.e., the noise level ) is strictly positive, denoted by σ2 := E[Var[y|x]] > 0. 2. F admits a J-Lipschitz parametrization by p real parameters, each of size at most poly(n, d). Then, with high probability over the sampling of the data, one has simultaneously for all f F: i=1 (yi f(xi))2 σ2 ϵ Lip (f) Ω ϵ rn Remark 1.1. Our theorem takes a further step in proving the Conjecture 1 in Bubeck et al. (2021), where it is conjectured that for generic data sets, with high probability, any f in the collections of two layer networks with p parameters fitting the data must also satisfy Lip (f) Ω( p n/p). We validate the conjecture under the polynomial weights assumption, where Bubeck & Sellke (2023) validate the Conjecture 1 under the polynomial weights assumption and the isoperimetry assumption. Remark 1.2 (Strong overparametrization is not necessary for the robust interpolation). The Lipschitzness lower bound of Bubeck & Sellke (2023) suggests strong overparametrization, i.e., p = Ω(nd), is required for the robust interpolation under the isoperimetry assumption. Our theorem shows that strong overparametrization may not be a necessary condition for the robust interpolation on a general distribution. Moderate overparametrization with p = Ω(n) may also be enough for robust interpolation. Our results are consistent with the empirical observations that CIFAR10 (50000 images) can be robustly fitted by a model with p = 106, and Image Net (107 images) can be robustly fitted by a model with p = 107 108. Big data hurts robust interpolation. Under the assumptions of isoperimetry distribution and the J-Lipschitz parameterized functions, the universal law of robustness by Bubeck & Sellke (2023) predicts the potential existence of an O(1)-Lipschitz function fits the data below the noise level when p = Ω(nd). Our result goes beyond the two assumptions and disproves the existence of such O(1)- Lipschitz functions in the big data scenario when n = exp(ω(d)) for arbitrary distributions: Theorem B (informal version of Theorem 3.9). Let F be any class of functions from Rd [ 1, 1] and let {(xi, yi)}n i=1 be i.i.d. input-output pairs in {x : x 1} [ 1, 1] for any given norm . Assume that: 1. The expected conditional variance of the output (i.e., the noise level ) is strictly positive, denoted by σ2 := E[Var[y|x]] > 0. Then, with high probability over the sampling of the data, one has simultaneously for all f F: i=1 (yi f(xi))2 σ2 ϵ Lip (f) Ω(ϵn1/d). Difference between our results and Bubeck & Sellke (2023). Bubeck & Sellke (2023) proposed a universal law of robustness for general class of functions (see Theorem 2.1). Our results Theorem 3.4 and Theorem 3.9 share the same setting with Theorem 2.1, while the former ones make much weaker assumptions: 1) Both Theorem 3.4 and Theorem 3.9 do not require an isoperimetry assumption of input distributions. 2) Theorem 3.9 does not make any assumption on the Lipschitzness and size of model parametrization. Moreover, while Theorem 2.1 predicts potential existence of an O(1)-Lipschitz robust interpolating function when p = Ω(nd), Theorem 3.9 disproves the hypothesis in the big data scenario when n = exp(ω(d)) for arbitrary distributions in the bounded space. Besides, our bounds work for all ℓp(p 1) norm while the bound in Bubeck & Sellke (2023) only focuses on ℓ2 norm. Practical implications. Our analysis provides important implications for practical settings. When selecting the models for learning on a certain dataset, ideally the number of parameters in the selected model should be the same (or slightly larger) scale of the dataset in order to get good robust performance. When the size of dataset is too large comparing to the dimension of dataset, in order to achieve good robustness, it may be beneficial to either reduce the size of the training data or scatter the data in a higher-dimensional space by padding special covariates. This approach can help to mitigate the negative effects of the curse of big data and improve model robustness, particularly when dealing with large datasets in practical applications (Wu et al., 2022a; 2023; Sun et al., 2021; 2022). 2. Related Work Robust interpolation problem. Bubeck et al. (2021) provided the first guarantee on the law of robustness for twolayer neural networks which was later extended by Bubeck & Sellke (2023) to a universal law of robustness for general class of functions under isoperimetry distributions. A probability measure µ on Rd satisfies c-isoperimetry if for any bounded L-Lipschitz f : Rd R, and any t 0, Pr(|f(x) E[f(x)]| t) 2 exp( dt2 2c L2 ). Theorem 2.1 (Theorem 1 of Bubeck & Sellke (2023)). Let F be a class of functions from Rd [ 1, 1] and let {(xi, yi)}n i=1 be i.i.d. input-output pairs in Rd [ 1, 1]. Assume that: 1. The expected conditional variance of the output (i.e., the noise level ) is strictly positive, denoted by σ2 := E[Var[y|x]] > 0. A Law of Robustness beyond Isoperimetry 2. F admits a J-Lipschitz parametrization by p real parameters, each of size at most poly(n, d). 3. The distribution µ of the input xi satisfies isoperimetry (or a mixture thereof). Then, with high probability over the sampling of the data, one has simultaneously for all f F: i=1 (yi f(xi))2 σ2 ϵ Lip 2(f) Ω Our work extends the result of Bubeck & Sellke (2023) by consequently removing the third assumption (see Theorem 3.4) and the second assumption (see Theorem 3.9). Sample complexity of robust learning. The sample complexity of robust learning for benign distributions and certain function class has been extensively studied in the recent years. In particular, Bhattacharjee et al. (2021) considered the sample complexity of robust linear classification on the separated data. Yin et al. (2019) studied the adversarially robust generalization problem through the lens of Rademacher complexity. Cullina et al. (2018) extended the PAC-learning framework to account for the presence of an adversary. Montasser et al. (2019) showed that any hypothesis class with finite VC dimension is robustly PAC learnable with an improper learning rule. They also showed that the requirement of being improper is necessary. Schmidt et al. (2018) showed an Ω( d)-factor gap between the standard and robust sample complexity for a mixture of Gaussian distributions in ℓ robustness, which was later extended to the case of ℓp robustness with a tight bound by Bhagoji et al. (2019); Dobriban et al. (2020); Dan et al. (2020). Different from the prior work, our work is the first to discover the sample complexity of robust learning for arbitrary function class and learning algorithms. 3. A Two-fold Law of Robustness In this section, we present our main theoretical analysis, which contributes to our two-fold law of robustness. All missing proofs can be found in the appendix. Robust interpolation problem. We first introduce our problem settings. Given noisy training data {(xi, yi := g(xi) + zi)}n i=1 of size n where x1, . . . , xn are training samples, g(x1), . . . , g(xn) the ground truth, and z1, . . . , zn have variance σ2 > 0, we say a model f robustly interpolates (or fits the data below the noise level) the training data if and only if i=1 (yi f(xi))2 σ2 ϵ. Our two-fold law of robustness. a) Overparametrization can potentially help robust interpolation when n = poly(d) (Section 3.1); b) There exists no robust interpolation when n = exp(ω(d)) (Section 3.2). 3.1. A Lipschitz lower bound beyond the isoperimetry assumption. In this part, we show the first part of our two-fold law of robustness: overparametrization can potentially help robust interpolation when n = poly(d). Notice, here we claim potentially help as overparametrization is only a necessary but not sufficient condition for robust interpolation. Motivation. We notice that, the proof of Theorem 2.1 (Bubeck & Sellke, 2023) depends heavily on the definition of isoperimetry distribution, i.e., Pr(|f(x) E[f(x)]| t) 2 exp( dt2 2c L2 ) for L-Lipschitz f : Rd R. This formula indicates the high-concentration property of isoperimetry distributions due to the exp( d) dependency of Pr(|f(x) E[f(x)]| t). The exp( d) dependency is also the reason that the Lipschitzness lower bound of Bubeck & Sellke (2023) is Ω( p nd/p) instead of the Ω( p n/p) lower bound we derived. Challenge. One may naturally come up with the idea to derive a bound of Pr(|f(x) E[f(x)]| t) for arbitrary distributions and go beyond the isoperimetry distribution. However, the challenge is that unlike the regular concentration bound on Pr(|x E[x]| t), we are dealing with a more complicate case, where the random variable is f(x) with arbitrary L-Lipschitz f. To solve this problem, we apply the Azuma s inequality below: Lemma 3.1 (Azuma s inequality (Azuma, 1967)). Suppose {Xk : k = 0, 1, 2, 3, . . . } is a martingale and |Xk Xk 1| ck almost surely. Then for all positive integers N and ϵ > 0, Pr(|XN X0| ϵ) 2 exp 2 PN k=1 c2 k Azuma s inequality shows the concentration bound for the values of martingales that have bounded differences. With this lemma, we are able to derive the following concentration bound for arbitrary distributions on a bounded space. Lemma 3.2. Given an arbitrary probability measure µ on the bounded space X Rd, for any L-Lipschitz f : Rd R, and any t 0, Pr(|f(x) E[f(x)]| t) 2 exp( t2 2diam(X)2L2 ). Comparing with the exp( dt2 2c L2 ) bound for the isoperimetry distributions, our bound for arbitrary distributions only differs a d on the numerator of the term inside the exponential. In order to achieve the same concentration bound of isoperimetry distributions, one need diam(X) = Θ(1/ A Law of Robustness beyond Isoperimetry which means our input are located on an Θ(1/ d)-diameter space. As the real world datasets are usually supported on an Θ(1)-diameter space, matching the isoperimetry bound for all distributions is empirical meaningless. With Lemma 3.2, we can start to calculate the Lipschitzness lower bound with the following lemma on finite function class Lemma 3.3. Let F be a finite class of L-Lipschitz functions from Rd [ 1, 1] and let {(xi, yi)}n i=1 be i.i.d. inputoutput pairs in {x : x 1} [ 1, 1] for any given norm . Assume that the expected conditional variance of the output (i.e., the noise level ) is strictly positive, denoted by σ2 := E[Var[y|x]] > 0, we have i=1 (yi f(xi))2 σ2 ϵ + |F| exp ϵ2n Lemma 3.3 shows the connection between the robust interpolation problem and the Lipschitzness of the underlying functions. Notice, the probability of f F : 1 n Pn i=1(yi f(xi))2 σ2 ϵ decreases with L, which indicates that we need a large enough L to make sure that there exists f satisfying the condition of robust interpolation problem. With this intuition, we can calculate the following Lipschitzness lower bound for the robust interpolation problem without the isoperimetry assumption. Theorem 3.4. Let F be any class of functions from Rd [ 1, 1] and let {(xi, yi)}n i=1 be i.i.d. input-output pairs in {x : x 1} [ 1, 1] for any given norm . Assume that: 1. The expected conditional variance of the output (i.e., the noise level ) is strictly positive, denoted by σ2 := E[Var[y|x]] > 0. 2. J-Lipschitz parametrization: F = {fw, w W} with W Rp, diam(W) W and for any w1, w2 W, ||fw1 fw2||F J||w1 w2||. Then, with probability at least 1 δ, one has simultaneously for all f F: i=1 (yi f(xi))2 σ2 ϵ r n p ln(36WJϵ 1) + ln(2/δ). The crucial part of the proof is to find a finite ϵ/6J-covering of F with the J-Lipschitz parametrization assumption. Then we can apply Lemma 3.3 to this finite covering set and get the Lipschitzness lower bound. We will show in the next section that without the J-Lipschitz parametrization assumption, one can hardly use the similar proof technique to derive the Lipschitzness lower bound. Bubeck & Sellke (2023) showed that under neural network settings, J is always of polynomial order of the diameter of the weight space. Thus, if the weight is only polynomial large w.r.t. d and n, ln(60WJϵ 1) would not affect the Lipschitzness bound too much and we may neglect it in its asymptotic approximation. Thus, we have a Lipschitzness lower bound of order Ω(ϵ p n/p) for the robust interpolation problem. Our theorem validates the first part of our law of robustness, i.e., the potential existence of robust interpolating functions under the overparametrization scenario when n = poly(d) (see Remark 3.10). Tightness of our bound. When n = poly(d), Theorem 4 of Bubeck et al. (2021) has already demonstrated the existence of an at most O( p n/p)-Lipschitz two layer network, which fits generic data below the noise level. Thus, our Lipschitzness lower bound is tight. Remark 3.5 (Difference of the d-dependency between Theorem 2.1 and 3.4). Comparing to the Ω(ϵ p nd/p) of Lipschitzness lower bound in Theorem 2.1, our bound does not depend on the dimension d. This difference, as we discussed in Lemma 3.2, is due to the isoperimetry assumption. In Bubeck et al. (2021), it s also showed that the tight Lipschitzness lower bound of two layer networks is of order Ω(ϵ p n/p), which is consistent with our results. 3.2. A Lipschitz lower bound beyond the J-Lipschitz parametrization assumption. In this part, we show the second part of our two-fold law of robustness. We demonstrate an intriguing observation that huge data hurts robust interpolation. Our analysis leads to a universal lower bound of Lipschitzness regarding the robust interpolation problem, which goes beyond the isoperimetry and J-Lipschitz parametrization assumptions. Our analysis is based on the relation between Rademacher complexity and the generalization gap between the population error LD(f) and the training error LS(f). Motivation. The J-Lipschitzness parametrization assumption provides us a simple way to find a covering of the function space F. Although the Lipschitzness lower bound in Bubeck & Sellke (2023) has only logarithmic dependency with respect to J, it may still affect the Lipschitzness lower bound when the weight of neural networks is exponentially large w.r.t. d, or the number of layers of neural works is polynomial w.r.t. d. Thus, we seek to derive a Lipschitzness lower bound beyond the J-Lipschitzness parametrization assumption. Challenge. Without the J-Lipschitzness parametrization A Law of Robustness beyond Isoperimetry assumption, the covering number of the function F will have more complicate dependency on the Lipschitzness L (see Lemma 3.6). In this case, calculating Lipschitzness lower bound with Lemma 3.6 and Lemma 3.3 requires one to solve an inequality like L d ln L + L 2 C, which obviously has no closed-form solution when d 3. Thus, we need other techniques to deal with this case. Recall the objective of robust interpolation problem is 1 n Pn i=1(yi f(xi))2 σ2 ϵ, one can immediately find that the left hand side formula is the train error with mean squared loss LS(f). Under the label noise settings, we have LD(f) = ED[(f(x) y)2] Ex[Var(y|x)] = σ2, which yields 1 n Pn i=1(yi f(xi))2 σ2 ϵ LS(f) LD(f) ϵ. Therefore, if one can derive LS(f) LD(f) ϵ Lip (f) Ω(ϵn1/d), a natural corollary is that i=1 (yi f(xi))2 σ2 ϵ LS(f) LD(f) ϵ Lip (f) Ω(ϵn1/d). In this way, we successfully convert the robust interpolation problem to a generalization problem between the empirical error and population error under the mean squared loss, which can be solved by the statistical learning techniques, e.g., VC dimension and Rademacher complexity. We focus on the Rademacher complexity in this part. Rademacher complexity. We start with the definition of Rademacher complexity, which measures the richness of a function class. For a set A Rn, the Rademacher complexity is defined as n Eσ1,...,σn { 1,1} Given a loss function l, a hypothesis class F, and a training set S = {(x1, y1), ..., (xn, yn)}, denote by l F := {l(f( ), ) : f F} and l F S := {(l(f(x1), y1), ..., l(f(xn), yn)) : f F}. The Rademacher complexity of the set l F S is given by R(l F S):= 1 n Eσ1,...,σn { 1,1} i=1 σil(f(xi), yi)) For every function f F, the generation error between LD(f) and LS(f) is bounded by the Rademacher complexity of the function space l F S. More formally, assume that f F, x X, |l(f(x), y)| a. Then with a probability at least 1 δ, for all f F, LD(f) LS(f) 2ES Dn[R(l F S)]+a From Equation 1, we can see that given a lower bound of generalization gap LD(f) LS(f) ϵ, one has immediately ES Dn[R(l F S)] ϵ Therefore, if we can find the relation between the Rademacher complexity of l F and the Lipschitzness of the functions in class F, we are able to derive a constrain of the Lipschitz constant for F. The contraction lemma of Rademacher complexity (Lemma 26.9 of Shalev-Shwartz & Ben-David (2014)) states that for a given space A and a L-lipschitz function h on A, we have R(h A) L R(A). Thus, if the error function l(f(x), y) is C-Lipschitz w.r.t. f F for arbitrary y [ 1, 1], R(l F S) C R(F S). (2) It has been proved (von Luxburg & Bousquet, 2004) that the Rademacher complexity of a set is directly related to the number of ϵ-covering of the set. So the first step to calculate the Rademacher complexity of F S is to find the covering number of this function space. Given a space (X, || ||) and a covering radius η, let N(X, η, || ||), a.k.a. the η-covering number, be the minimum number of η-ball which covers X. For a given function space F, define ||f f ||F = sup x X |f(x) f (x)|. We have the following upper bound of the covering number of F: Lemma 3.6 (Covering number of L-Lipschitz function space). For a bounded and connected space (X, || ||), let BL be the set of functions f s such that Lip|| ||(f) L. If X is connected and centered, we have for every ϵ > 0, N(BL, ϵ, || ||F) 2L diam(X) 2L ,|| ||). The Dudley s integral provides the relation between the covering number of a function class and its Rademacher complexity. With Dudley s integral, von Luxburg & Bousquet (2004) showed that for every ϵ > 0, ES Dn[R(BL S)] ln(N(BL, u, || ||F))du. (3) Notice that when u > 2L diam(X), the number of ucovering is 1 and ln(N(BL, u, || ||F)) = 0. Combining it with Lemma 3.6 yields the following lemma: A Law of Robustness beyond Isoperimetry Lemma 3.7. Let (X, || ||) be a bounded and connected space and BL be all functions f F with Lip|| ||(f) L. Let n = |S|. If X is connected and centered, for any ϵ > 0 ES Dn[R(BL S)] 2ϵ + 4 Z 2L diam(X) 2L, || || ln 2+ln 2L diam(X) As all the variables in Lemma 3.7 are known, by calculating the integration, one can derive an upper bound of Rademacher complexity ES Dn[R(BL S)]: Lemma 3.8. If diam(X) = 2 w.r.t. || || and d 3, we have ES Dn[R(BL S)] 2 ln 2 d 2 L n1/d + 16 3n1/d + 1 . According to Equation 1 in (Mendelson & Vershynin, 2003), when u 2L diam(X), N(X, u 2L, || ||) ( 6L diam(X) u )d if X Rd. Then the integral part will u )d ln 2+ln l 2L diam(X) u m , which is no more than q u )d ln 2 + q u + 1). Taking ϵ = Θ( L n1/d ), the integral part will be bounded by Θ(Ln1/2 1/d). Thus ES Dn[R(BL S )] Θ( L n1/d ) + 4 2 n Θ(Ln1/2 1/d) = Θ( L n1/d ). In our settings, we are interested in the squared ℓ2 loss l(f(x), y) = (f(x) y)2. We have f(x)l(f(x), y) = 2(f(x) y) 2(|f(x)| + |y|) 4, i.e., l(f(x), y) is 4-Lipschitz w.r.t. f(x) for arbitrary y [ 1, 1]. Thus, ES Dn[R(l BL S)] 4ES Dn[R(BL S)]=O L n1/d . Combining this result with Equation 1 yields the main theorem of our paper: Theorem 3.9 (Lipschitzness Lower Bound Beyond the J-Lipschitz parametrization assumption). Let F be any class of functions from Rd [ 1, 1] and let {(xi, yi)}n i=1 be i.i.d. input-output pairs in {x : x 1} [ 1, 1] for any given norm . Assume that: 1. The expected conditional variance of the output (i.e., the noise level ) is strictly positive, denoted by σ2 := E[Var[y|x]] > 0. Then with probability at least 1 δ, for all f F: i=1 (yi f(xi))2 σ2 ϵ Lip (f) n1/d where K = 96 + 96 2 ln 2 d 2 + 16 2 n1/2 1/d q 3n1/d + 1). Theorem 3.9 states that, for all data distribution D with label noise of variance σ2 and every function f : X [ 1, 1], overfitting i.e. 1 n Pn i=1(yi f(xi))2 σ2 ϵ implies Lip Ω(ϵn1/d), which validates the second part of our law of robustness, i.e., achieving good robust interpolation is impossible when n = exp(ω(d)). Remark 3.10. Theorem 3.9 disprove the existence of robust interpolating functions when n = exp(ω(d)). Thus, the first part of our law of robustness holds only when n = poly(d). Tightness of our bound. Intuitively, the Lipschitzness of the interpolating function is inversely propositional to the distance between the closest training data pairs. Given n training data in the d-dimensional bounded space, one can scatter the data evenly in the space, where the distance between any training pair is as large as Θ(1/n1/d). Inspired by this, we complement Theorem 3.9 with a matching Lipschitzness upper bound of O(n1/d), which shows that the Lipschitzness lower bound in Theorem 3.9 is achievable by a certain function and training data: Theorem 3.11 (Tightness of our bound). For any distribution D which is supported on {x Rd : ||x|| 1}, there exist n training samples {x1, ..., xn} such that i, j, i = j, ||xi xj|| 1 n1/d . Denote by {y1, ..., yn} the observed targets. We design a function f which first perfectly fits the training samples, i.e., f (xi) = yi, i [n], then use the linear interpolation between neighbour training points as the prediction of other samples. This function is at most 2n1/d-Lipschitz. Theorem 3.11 shows that there exists n samples, such that the function which perfectly fits the training samples is O(n1/d)-Lipschitz. 3.2.1. OUR (COUNTER-INTUITIVE) IMPLICATIONS It was widely believed that 1) big data (Schmidt et al., 2018), 2) low dimensionality of input (Blum et al., 2020), and 3) overparametrization (Bubeck & Sellke, 2023; Bubeck et al., 2021; Gao et al., 2019) improve robustness. Our main results of Theorem 3.9 challenge the common beliefs and show that these hypotheses may not be true in the robust interpolation problem. Our results shed light on the theoretic understanding of robustness beyond isoperimetry assumption. The curse of big data. Our Lipschitzness lower bound in Theorem 3.9 is increasing w.r.t. the sample size n. The intuition is that as one has more training data, those data are squeezed in the bounded space with smaller margin. Thus to fit the data well, the Lipschitz constant of the interpolating functions cannot be small. Perhaps surprisingly, our results contradict with the common belief that more data always A Law of Robustness beyond Isoperimetry improve model robustness. The blessing of dimensionality. It is known that high dimensionality of input space strengthens the power of adversary. For example, in the ℓ threat model, an adversary can change every pixel of a given image by 8 or 16 intensity levels. Admittedly, higher dimensionality means that the adversary can modify more pixels. However, we show that our Lipschitzness lower bound in Theorem 3.9 is decreasing w.r.t. d. The intuition is that input space with higher dimension has larger space to scatter the data. So the data can be well-separated, and thus the Lipschitz constant of the interpolating functions can be small. 4. Small Data May Hurt Performance and Robustness In Section 3, we mainly focus on the robust interpolation problem on the training samples. The lower bound given by Theorem 3.9 implies that one can sample at most exp(O(d)) training samples in order to obtain an O(1)-Lipschitz function in the robust interpolation problem. In this section, we show that n = exp(Ω(d)) is a necessary condition for obtaining a good population error by any O(1)-Lipschitz learning algorithm. We now provide a complementary result of Section 3.2. We first prove that for learning algorithms on binary classification tasks, if the number of training samples is less than half of the number of all samples, there exists a distribution with label noise such that the average error of all learning algorithms is greater than a constant. As the distribution on a binary classification is naturally a distribution on the regression tasks, we can find such a distribution for the regression tasks similarly. Lemma 4.1. Let A(S) : X { a, a} be any learning algorithm with respect to the squared ℓ2 loss over a domain X and samples S. Assume there are label noise E[Var[y|x]] = σ2. Let m be any number smaller than |X|/2, representing the size of a training set. Then, for any a > 0 there exists a distribution D (with label noise) over X { a, a} such that ES Dm[LD(A(S))]] 1 2(a2 + σ2). In the next lemma, we will show a no-free-lunch theory on the regression tasks and algorithms that outputs an LLipschitz function. The intuition is to consider the minimum distance between two points in the distribution D. On one hand, if the minimum distance is less than ϵ, we can assign the two samples that achieve the minimum distance with labels 1 and 1, respectively. As the algorithm A is LLipschitz, the maximum difference between the predicted labels of the two selected points is Lϵ. Thus, the error of A will be larger than 1 Lϵ. On the other hand, if the minimum distance is larger than ϵ, the maximum number of points in the distribution D will be less than the number of the ϵ-packing of the input space X. By Lemma 4.1, there exists a distribution such than if the number of training samples is less than half of the ϵ-packing of the input space, the average error of all learning algorithms will be at least a constant. More formally, we have the following theorem: Lemma 4.2 (No-free-lunch theory with L-Lipschitz algorithms). Let A(S) : X [ 1, 1] be any algorithm that returns an L-Lipschitz function (w.r.t. the norm ) for the task of regression w.r.t. the squared ℓ2 loss over a domain (X, || ||) and samples S. Let n be the size of training set, i.e., n = |S|. Assume that the label noise has variance σ2 := ED[Var(y|x)] 1/2. Then, there exists a distribution D over X [ 1, 1] with noisy labels such that for all L-Lipschitz (w.r.t. norm ) learning algorithm and any ϵ [0, 1 2L]: n < M(X, ϵ, || ||)/2 ES Dn[LD(A(S))] min 1 where M(X, ϵ, || ||) is the ϵ-packing number of (X, || ||). Now we are ready to prove our main theorem. Theorem 4.3. Let S = {(xi, yi)}n i=1 be i.i.d. training pairs in {x : x 1} [ 1, 1] for any given norm . Denote by LD(f) := ED[(f(x) y)2] the squared ℓ2 loss. Assume that the expected conditional variance of the output (i.e., the noise level ) is strictly positive and bounded by 1/2, denoted by σ2 := E[Var[y|x]]. Let A(S) : X R be any L-Lipschitz learning algorithm over a training set S. Then there exists a distribution D of (x, y) such that d ES[LD (A(S))] min 1 Proof. Consider X = {x Rd : ||x|| 1}. We have M(X, η, || ||) 1 η d . Thus by Lemma 4.2, there exists a distribution D such that if σ2 0.5, d n < M(X, η, || ||)/2 ES Dn[LD(A(S))] min 1 Taking η = 1/2 ϵ L where ϵ (0, 1/2), we have n < 1 2 2L 1 2ϵ d implies ES Dn[LD(A(S))] min 1 σ2. Thus in the worst case, n has to be at least exp(Ω(d)) if one wants to achieve good astuteness by any learning algorithm that returns an O(1)-Lipschitz function. This completes the proof of Theorem 4.3. A Law of Robustness beyond Isoperimetry Theorem 4.3 states that for certain distributions, n has to be at least exp(Ω(d)) if one wants to achieve good population error by any O(1)-Lipschitz learning algorithm. This is not restricted to the algorithms that perfectly fit the training data. The sample complexity lower bound matches the upper bound given in Theorem 3.9. 5. Conclusions In this work, we study the robust interpolation problem beyond the isoperimetry assumption, and propose a twofold law of robustness. We show the potential benefit of overparametrization for smooth data interpolation when n = poly(d), and disprove the potential existence of an O(1)-Lipschitz robust interpolating function when n = exp(ω(d)). Besides, we also prove that small data (exp(O(d))) may hurt robustness on certain distributions. Perhaps surprisingly, the results shed light on the curse of big data and the blessing of dimensionality regarding robustness. Acknowledgement Hongyang Zhang is supported by NSERC Discovery Grant RGPIN-2022-03215, DGECR-2022-00357. Yihan Wu and Heng Huang were partially supported by NSF IIS 1838627, 1837956, 1956002, 2211492, CNS 2213701, CCF 2217003, DBI 2225775. Azuma, K. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, Second Series, 19(3):357 367, 1967. Ben-Tal, A., El Ghaoui, L., and Nemirovski, A. Robust optimization. Princeton university press, 2009. Bhagoji, A. N., Cullina, D., and Mittal, P. Lower bounds on adversarial robustness from optimal transport. In Advances in Neural Information Processing Systems, 2019. Bhattacharjee, R., Jha, S., and Chaudhuri, K. Sample complexity of robust linear classification on separated data. In International Conference on Machine Learning, pp. 884 893, 2021. Blum, A., Dick, T., Manoj, N., and Zhang, H. Random smoothing might be unable to certify ℓ robustness for high-dimensional images. Journal of Machine Learning Research, 21:1 21, 2020. Bubeck, S. and Sellke, M. A universal law of robustness via isoperimetry. Journal of the ACM, 70(2):1 18, 2023. Bubeck, S., Li, Y., and Nagaraj, D. M. A law of robustness for two-layers neural networks. In Annual Conference on Learning Theory, volume 134, pp. 804 820, 2021. Case, B. M., Gallagher, C., and Gao, S. A note on subgaussian random variables. Cryptology e Print Archive, 2019. Cohen, J. M., Rosenfeld, E., and Kolter, J. Z. Certified adversarial robustness via randomized smoothing. ICML, 2019. Cullina, D., Bhagoji, A. N., and Mittal, P. PAC-learning in the presence of evasion adversaries. In Advances in Neural Information Processing Systems, pp. 230 241, 2018. Dan, C., Wei, Y., and Ravikumar, P. Sharp statistical guaratees for adversarially robust gaussian classification. In International Conference on Machine Learning, pp. 2345 2355, 2020. Dobriban, E., Hassani, H., Hong, D., and Robey, A. Provable tradeoffs in adversarially robust classification. ar Xiv preprint ar Xiv:2006.05161, 2020. Gao, R., Cai, T., Li, H., Hsieh, C.-J., Wang, L., and Lee, J. D. Convergence of adversarial training in overparametrized neural networks. Advances in Neural Information Processing Systems, 32:13029 13040, 2019. Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2014. Huber, P. J. Robust statistics, volume 523. John Wiley & Sons, 2004. Kumar, A., Levine, A., Goldstein, T., and Feizi, S. Curse of dimensionality on randomized smoothing for certifiable robustness. In International Conference on Machine Learning, pp. 5458 5467, 2020. Li, B., Chen, C., Wang, W., and Carin, L. Certified adversarial robustness with additive noise. In Advances in Neural Information Processing Systems, pp. 9464 9474, 2019. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2017. Mendelson, S. and Vershynin, R. Entropy and the combinatorial dimension. Inventiones mathematicae, 152(1): 37 55, 2003. Montasser, O., Hanneke, S., and Srebro, N. VC classes are adversarially robustly learnable, but only improperly. In Annual Conference on Learning Theory, pp. 2512 2530, 2019. A Law of Robustness beyond Isoperimetry Northcutt, C. G., Athalye, A., and Mueller, J. Pervasive label errors in test sets destabilize machine learning benchmarks. In Neur IPS 2021 Datasets and Benchmarks Track, 2021. Schmidt, L., Santurkar, S., Tsipras, D., Talwar, K., and Madry, A. Adversarially robust generalization requires more data. In Advances in Neural Information Processing Systems, 2018. Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. Sun, J., Yang, Y., Xun, G., and Zhang, A. A stagewise hyperparameter scheduler to improve generalization. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 1530 1540, 2021. Sun, J., Huai, M., Jha, K., and Zhang, A. Demystify hyperparameters for stochastic optimization with transferable representations. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1706 1716, 2022. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014. von Luxburg, U. and Bousquet, O. Distance-based classification with Lipschitz functions. Journal of Machine Learning Research, 5:669 695, 2004. Wu, X., Huang, F., Hu, Z., and Huang, H. Faster adaptive federated learning. ar Xiv preprint ar Xiv:2212.00974, 2022a. Wu, X., Hu, Z., and Huang, H. Decentralized riemannian algorithm for nonconvex minimax problems. ar Xiv preprint ar Xiv:2302.03825, 2023. Wu, Y., Bojchevski, A., Kuvshinov, A., and G unnemann, S. Completing the picture: Randomized smoothing suffers from the curse of dimensionality for a large family of distributions. In International Conference on Artificial Intelligence and Statistics, pp. 3763 3771. PMLR, 2021. Wu, Y., Bojchevski, A., and Huang, H. Adversarial weight perturbation improves generalization in graph neural network. ar Xiv preprint ar Xiv:2212.04983, 2022b. Wu, Y., Li, X., Kerschbaum, F., Huang, H., and Zhang, H. Towards robust dataset learning. ar Xiv preprint ar Xiv:2211.10752, 2022c. Wu, Y., Zhang, H., and Huang, H. Retrievalguard: Provably robust 1-nearest neighbor image retrieval. In International Conference on Machine Learning, pp. 24266 24279. PMLR, 2022d. Yang, G., Duan, T., Hu, J. E., Salman, H., Razenshteyn, I., and Li, J. Randomized smoothing of all shapes and sizes. In International Conference on Machine Learning, pp. 10693 10705, 2020a. Yang, Y.-Y., Rashtchian, C., Zhang, H., Salakhutdinov, R., and Chaudhuri, K. A closer look at accuracy vs. robustness. In Advances in Neural Information Processing Systems, 2020b. Yin, D., Kannan, R., and Bartlett, P. Rademacher complexity for adversarially robust generalization. In International conference on machine learning, pp. 7085 7094, 2019. Zhang, H., Yu, Y., Jiao, J., Xing, E., El Ghaoui, L., and Jordan, M. Theoretically principled trade-off between robustness and accuracy. In International Conference on Machine Learning, pp. 7472 7482, 2019. A Law of Robustness beyond Isoperimetry A. Missing proofs A.1. Proof of Lemma 3.2 Proof. Denote by X the random variable of µ on bounded space X We consider the Z1 = f(X) and Z0 = E[f(X)], since |Z1 Z0| = |f(X) E[f(X)]| = |EX [f(X) f(X )]| |L sup x,x X ||x x ||| = Ldiam(X), where X is of the same distribution with X. Because E[Z1] = Z0, {Z0, Z1} is a martingale with bounded difference. Thus, by Azuma s inequality Lemma 3.1, we have Pr(|f(x) E[f(x)]| t) = Pr(|Z1 Z0| t) 2 exp( t2 2diam(X)2L2 ). A.2. Proof of Lemma 3.3 Proof. We use the similar proof technique as in Bubeck & Sellke (2023). Our proof depends on the following lemma. Lemma A.1 (Lemma 2.1 of Bubeck & Sellke (2023)). Let F be any class of functions from Rd [ 1, 1]. Let {(xi, yi)}n i=1 be i.i.d. input-output pairs in Rd [ 1, 1] for any given norm . Assume that the expected conditional variance of the output (i.e., the noise level ) is strictly positive, denoted by σ2 := E[Var[y|x]] > 0. i=1 (yi f(xi))2 σ2 ϵ i=1 f(xi)zi ϵ We now try to bound the term Pr( f F : 1 n Pn i=1 f(xi)zi ϵ 4). As xi is randomly sampled from the input distribution and diam(X) = 2, we have Pr(|f(xi) E[f(x)]| t) 2 exp( t2 which indicates f(xi) E[f(x)] is 8L2/n-subgaussian distributed. Because |zi| = |yi g(xi)| 2, we know (f(xi) E[f(x)])zi is 32L2-subgaussian. By Property 1 in Case et al. (2019) we know 1 n Pn i=1(f(xi) E[f(x)])zi is 32L2/nsubgaussian. Since E[(f(xi) E[f(x)])zi] = 0, we have i=1 (f(xi) E[f(x)])zi ϵ Since the range of the functions is in [ 1, 1] we have E[f(x)] [ 1, 1] and hence: i=1 E[f(x)]zi ϵ By Hoeffding s inequality, the above quantity is smaller than 2 exp( nϵ2/83) Thus we obtain with an union bound: i=1 f(xi)zi ϵ i=1 (f(xi) E[f(x)])zi ϵ |F| exp( nϵ2 210L2 ) + 2 exp( nϵ2/83). Together with Lemma A.1 we have i=1 (yi f(xi))2 σ2 ϵ + |F| exp nϵ2 which proves this lemma. A Law of Robustness beyond Isoperimetry A.3. Proof of Theorem 3.4 Proof. We use the similar proof technique as in Bubeck & Sellke (2023). We argue that the η-covering of the function space F is upper bounded by the η/J-covering of the parameter space W. To see this, we can select the centers Wc = {wc i } of the η/J-covering of W, and covering F with η-balls centered at fwc i , because fw F, we can find w Wc such that ||w w || η/J, by the definition of J-Lipschitz parametrization we have ||fw f w||F J||w w || η, thus F can be covered by N(W, η/J, || ||) balls. So we have N(F, η, || ||F) N(W, η/J, || ||) (6JW/η)p. Taking η = ϵ 6 and denote by Wϵ the ϵ/6J-covering of the W. Applying Lemma 3.3 to Fw = {fw : w Wϵ} we have i=1 (yi f(xi))2 σ2 ϵ 2 and Lip|| ||(f) L + exp p ln(36JWϵ 1) nϵ2 For all f F, we can find an f Fw such that ||f fw||F ϵ/6. One can easily derive i=1 (yi f(xi))2 1 i=1 (yi fw(xi))2 + ϵ/2 σ2 ϵ. Thus, if n is large enough such that exp( nϵ2/83) δ/8 and L ϵ 32 q n p ln(36W Jϵ 1)+ln(2/δ), we have i=1 (yi f(xi))2 σ2 ϵ and Lip|| ||(f) L which yields with probability at least 1 δ, i=1 (yi f(xi))2 σ2 ϵ Lip|| ||(f) ϵ r n p ln(36WJϵ 1) + ln(2/δ) A.4. Proof of Lemma 3.6 Proof. We consider the Lipschitz function class BL := {f : Lip|| ||(f) L}. In order to bound the covering number of F, we consider an ϵ 2L-covering of input space X consisting of N = Nϵ/(2L)(X) plates U1, U2, ..., UN centered at s1, s2, ..., s N. The fact that X is connected enables one to join any two sets Ui and Uj by a chain of intersecting Uk. For any function f F, we can construct its approximating functional ef by taking its value on U1 as an ϵ/2-approximation of f(s1). As diam(U1) L diam(X), there are at most 2L diam(X)/ϵ such approximations. On the other hand, note that the N plates are chained. By Lipschitzness, the function values of f on s1 and s2 differ at most ϵ/2, and so f(s2) differs at most ϵ from ef(s1) by triangle inequality. It implies that to construct an ϵ-approximation of f(s2) on U2, we shall know either ef(s1) ϵ/2 or ef(s1) + ϵ/2. Repeating the same argument by N times, we can bound the ϵ-covering of f on X by 2L diam(X)/ϵ 2N. A.5. Proof of Lemma 3.7 Proof. The proof of this lemma is quite straight forward. Notice that when u > 2L diam(X), the number of u-covering for BL is 1 and ln(N(BL, u, || ||F)) = 0. Combining Equation 3 with Lemma 3.6 yields this lemma. A Law of Robustness beyond Isoperimetry A.6. Proof of Lemma 3.8 Proof. As u 2L diam(X), we have N(X, u 2L, || ||) ( 12L ES Dn[R(BL S)] 2ϵ + 4 Z 2L diam(X) 2L, || || ln 2 + ln 2L diam(X) d ln 2 + ln 2L d ln 2 du + 16 ln(16L/ϵ + 1). Switching the integral variable from u to v = u/12L we have d ln 2 du = 12L Z 1/3 v d ln 2 dv ln 2 1 d/2 + 1v d/2+1|1/3 ϵ/(48L)) Based on the calculation above we have ES Dn[R(BL S)] 2ϵ + L 96 2 ln 2 n(d 2) ln(16L/ϵ + 1). As this inequality holds for arbitrary ϵ > 0, we can take ϵ = 48L/n1/d and have ES Dn[R(BL S)] 96 L 2 ln 2 d 2 L n1/d + 16 3n1/d + 1 O L A.7. Proof of Theorem 3.9 Proof. According to Equation 1, LD(f) LS(f) 2ES Dn[R(l F S)] + a where a := max(x,y) l(f(x), y) 4. According to Equation 2 and f(x)l(f(x), y) 4, we have ES Dn[R(l F S)] 4ES Dn[R(F S)]. Thus, ES Dn[R(F S)] 1 LD(f) LS(f) 4 Under the label noise settings, we have LD(f) = ED[(f(x) y)2] = Ex,y[(f(x) Ey[y|x])2 + (y Ey[y|x])2] Ex[Var(y|x)] = σ2. A Law of Robustness beyond Isoperimetry So with the overfitting assumption LS(f) σ2 ϵ, we have ES Dn[R(F S)] 1 LD(f) LS(f) 4 Consider BL = {f F : Lip|| ||(f) L}. According to Lemma 3.8, we have n1/d ES Dn[R(BL S)] ϵ where K = 96 + 96 2 ln 2 d 2 + 16 2 n1/2 1/d q 3n1/d + 1) Θ(1). Thus we have If f0 F, such that LS(f0) σ2 ϵ Lip (f0) K Lip (f0) ES Dn[R(BLip (f0) S)] ϵ which yields contradiction. Therefore, f F, LS(f) σ2 ϵ Lip (f) n1/d Taking X = {x Rd : ||x|| 1}, we have diam(X) = 2, which yields Theorem 3.9. A.8. Proof of Theorem 3.11 Proof. First, we show that we can find n training samples {x1, ..., xn} such that i, j, i = j, ||xi xj|| 1 n1/d . Consider the 1 n1/d -packing of the space {x : ||x|| 1}, the packing number is greater than the 1 n1/d -covering number of the same space, which at least (1/ 1 n1/d )d = n, we then choose {x1, ..., xn} from the 1 n1/d -packing, the minimum pairwise distance is at least 1 n1/d . Next, we show f is at most n1/d-Lipschitz, as f is the linear interpolation between neighbour training points, the worst case Lipschitz constant is |yi yj| ||xi xj|| 2n1/d. A.9. Proof of Lemma 4.1 Proof. Our proof is partly based on Theorem 5.1 of Shalev-Shwartz & Ben-David (2014). Let C be a subset of X of size 2m. There exist T = 22m possible labeling functions from C to { a, a}. Denote these functions by f1, ..., f T . We then define a distribution Di w.r.t. fi by Di({(x, y)}) = ( p/|C|, if y = fi(x); (1 p)/|C|, if y = fi(x), where p > 1/2 satisfies Var(y|x) = σ2 = 4a2p(1 p) (notice that as fi(x) can only be a or a, p is the same for all fi(x) s). In this way, Di satisfies the noisy label setting. We will show that for every algorithm A that receives a training set A Law of Robustness beyond Isoperimetry of size m from C { a, a} and returns a function A(S) : C R , it holds that max i [T ] ES Dm i [LDi(A(S))] a2 + σ2 There are k = (2m)m possible sequences of m instances from C. Denote these sequences by S1, ..., Sk. Also, if Sj = (x1, ..., xm), we denote by Si j the sequence containing the instances in Sj labeled by the function fi, namely, Si j = ((x1, a1fi(x1)), ..., (xm, amfi(xm))), where Pr(al = 1) = p, Pr(al = 1) = 1 p, and a1, ..., am are i.i.d. for all Si j, given that p is the same for all fi(x) s. If the distribution is Di, then the possible training sets that algorithm A receives are Si 1, ..., Si k, and all these training sets have the same probability of being sampled. Therefore, ES Dm i [LDi(A(S))] = 1 j=1 LDi(A(Si j)). Using the facts that maximum is larger than average and that average is larger than minimum , we have max i [T ] 1 k j=1 LDi(A(Si j)) 1 j=1 LDi(A(Si j)) i=1 LDi(A(Si j)) min j [k] 1 T i=1 LDi(A(Si j)). Next, fix some j [k] . Denote by Sj := (x1, ..., xm) and let v1, ..., vq be the instances in C that do not appear in Sj. Clearly, q m. Therefore, for every function h : C R and every i we have LDi(h) = 1 2m Ea { 1,1}2m x C (h(x) aifi(x))2 # x C [p(h(x) fi(x))2 + (1 p)(h(x) + fi(x))2] x C [(h(x) (2p 1)fi(x))2 + 4p(1 p)fi(x)2] x C [(h(x) (2p 1)fi(x))2]. x C [(h(x) (2p 1)fi(x))2] 1 2m r=1 (h(vr) (2p 1)fi(vr))2 1 r=1 (h(vr) (2p 1)fi(vr))2. i=1 LDi(A(Si j)) 1 i=1 Ea { 1,1}m r=1 (A(Si j(a))(vr) (2p 1)fi(vr))2 # i=1 Ea { 1,1}m[(A(Si j(a))(vr) (2p 1)fi(vr))2] 2 min r [p] 1 T i=1 Ea { 1,1}m[(A(Si j)(a)(vr) (2p 1)fi(vr))2]. A Law of Robustness beyond Isoperimetry Next, fix some r [p]. We can partition all the functions in f1, ..., f T into T/2 disjoint pairs, where for a pair (fi, fi ) we have that for every c C, fi(c) = fi (c) if and only if c = vr. Note that for such a pair and the same a, we must have Si j(a) = Si j (a) and a { 1, 1}m, Pr(a|Si j) = Pr(a|Si j ). It follows that Ea { 1,1}m[(A(Si j)(vr) (2p 1)fi(vr))2] + Ea { 1,1}m[(A(Si j )(vr) (2p 1)fi (vr))2] Ea { 1,1}m[(A(Si j)(vr) (2p 1)fi(vr))2 + (A(Si j )(vr) (2p 1)fi (vr))2] Ea { 1,1}m 1 2(2p 1)2(fi (vr) fi(vr))2 =2(2p 1)2a2, which yields 1 T i=1 Ea { 1,1}m[(A(Si j(a))(vr) (2p 1)fi(vr))2] (2p 1)2a2. Combining the discussion above, we have max i [T ] ES Dm i [LDi(A(S))] min j [k] 1 T i=1 LDi(A(Si j)) σ2 + 1 2(2p 1)2a2 = a2 + σ2 A.10. Proof of Lemma 4.2 Proof. Consider an arbitrary finite set C X. Denote by d(C) := min(a,b) C C,a =b ||a b||. We now consider two cases: a) d(C) < ϵ and b) d(C) ϵ, and show that our conclusion holds for both cases. Case a): d(C) < ϵ. Denote by (x1, x2) = argmin(a,b) C C,a =b ||a b||. We can select D such that D({(x1, 1)}) = p 2, D({(x1, 1)}) = (1 p) 2 and D({(x2, 1)}) = p 2, D({(x2, 1)}) = 1 p 2 , where 4p(1 p) = σ2, p > 1/2. Consider an L-Lipschitz learning algorithm A(S) : C R: ES Dn[LD(A(S))] 2(A(S)(x1) 1)2 + 1 p 2 (A(S)(x1) + 1)2 + p 2(A(S)(x2) + 1)2 + 1 p 2 (A(S)(x2) 1)2 min S Dn[1 (2p 1)|A(S)(x1) A(S)(x2)|] 1 L(2p 1)||x1 x2|| Case b): d(C) ϵ. We reduce the regression problem from a binary classification problem with target { 1, 1} by considering the distribution D such that D only on X { 1, 1}. Then by ??, for every A(S) : X R and every C X there exists D such that 2 ES Dn[LD(A(S))] 1 + σ2 Notice that C X can be chosen arbitrarily. Thus we have n < max C X,d(C) ϵ |C| 2 ES Dn[LD(A(S))] 1 + σ2 Denote the ϵ-packing number of space (X, || ||) by M(X, ϵ, || ||). We have max C X,d(C) ϵ |C| 2 = M(X, ϵ, || ||)/2. A Law of Robustness beyond Isoperimetry n < M(X, ϵ, || ||)/2 ES Dn[LD(A(S))] 1 + σ2 Combining a) and b) yields our conclusion. A.11. Proof of Theorem 4.3 Proof. Consider X = {x Rd : ||x|| 1}. We have M(X, η, || ||) 1 η d and thus there exists a distribution D such that if σ2 0.5 d n < M(X, η, || ||)/2 ES Dn[LD(A(S))] min 1 Taking η = 1/2 ϵ L where ϵ (0, 1/2), we have d ES Dn[LD(A(S))] min 1 Thus in the worst case, n has to be at least exp(Ω(d)) if one wants to achieve good astuteness by any O(1)-Lipschitz learning algorithm, this completes our proof. B. Some basic concepts of Rademacher complexity Definition B.1 (Representativeness of S). Rep D(l, F, S) := sup f F (LD(f) LS(f)). Definition B.2 (Rademacher complexity). For A Rn, n Eσ1,...,σn { 1,1} Lemma B.3. Assume that f F, x X, |l(f, x)| c. Then with probability at least 1 δ, for all f F, LD(f) LS(f) ES Dn[Rep D(l, F, S)] + c Lemma B.4 (Lemma 26.2 in Shalev-Shwartz & Ben-David (2014)). ES Dn[Rep D(l, F, S)] 2ES Dn[R(l F S)], where S = {x1, ..., xn} and l F S = {(l(f, x1, y1), ..., l(f, xn, yn)) Rn}. Lemma B.5 (Theorem 26.5 in Shalev-Shwartz & Ben-David (2014)). Assume f F, x X, |l(f, x)| a, then with probability at least 1 δ, for all f F, LD(f) LS(f) 2ES Dn[R(l F S )] + a Lemma B.6 (Lemma 26.9 in Shalev-Shwartz & Ben-David (2014)). If l(f(x), y) is C|| ||-Lipschitz w.r.t. f(x) for arbitrary y [ 1, 1], R(l F S) C R(F S).