# a_universal_law_of_robustness_via_isoperimetry__ee3a0e1d.pdf A Universal Law of Robustness via Isoperimetry S ebastien Bubeck Microsoft Research sebubeck@microsoft.com Mark Sellke Stanford University msellke@stanford.edu Classically, data interpolation with a parametrized model class is possible as long as the number of parameters is larger than the number of equations to be satisfied. A puzzling phenomenon in deep learning is that models are trained with many more parameters than what this classical theory would suggest. We propose a theoretical explanation for this phenomenon. We prove that for a broad class of data distributions and model classes, overparametrization is necessary if one wants to interpolate the data smoothly. Namely we show that smooth interpolation requires d times more parameters than mere interpolation, where d is the ambient data dimension. We prove this universal law of robustness for any smoothly parametrized function class with polynomial size weights, and any covariate distribution verifying isoperimetry (or a mixture thereof). In the case of two-layer neural networks and Gaussian covariates, this law was conjectured in prior work by Bubeck, Li and Nagaraj. We also give an interpretation of our result as an improved generalization bound for model classes consisting of smooth functions. 1 Introduction Solving n equations generically requires only n unknowns1. However, the revolutionary deep learning methodology revolves around highly overparametrized models, with many more than n parameters to learn from n training data points. We propose an explanation for this enigmatic phenomenon, showing in great generality that finding a smooth function to fit d-dimensional data requires at least nd parameters. In other words, overparametrization by a factor of d is necessary for smooth interpolation, suggesting that perhaps the large size of the models used in deep learning is a necessity rather than a weakness of the framework. Another way to phrase the result is as a tradeoff between the size of a model (as measured by the number of parameters) and its robustness (as measured by its Lipschitz constant): either one has a small model (with n parameters) which must then be non-robust, or one has a robust model (constant Lipschitz) but then it must be very large (with nd parameters). Such a tradeoff was conjectured for the specific case of two-layer neural networks and Gaussian data in [BLN21]. Our result shows that in fact it is a universal phenomenon, which applies to essentially any parametrized function class (including in particular deep neural networks) as well as a much broader class of data distributions. As in [BLN21] we obtain an entire tradeoff curve between size and robustness: our universal law of robustness states that, for any function class smoothly parametrized by p parameters, and for any d-dimensional dataset satisfying mild regularity conditions, any function in this class that fits the data below the noise level must have its (Euclidean) Lipschitz constant larger than q Theorem 1 (Informal version of Theorem 3). Let F be a class of functions from Rd R and let (xi, yi)n i=1 be i.i.d. input-output pairs in Rd [ 1, 1]. Assume that: 1. F admits a Lipschitz parametrization by p real parameters, each of size at most poly(n, d). 1As in, for instance, the inverse function theorem in analysis or B ezout s theorem in algebraic geometry. See also [YSJ19, BELM20] for versions of this claim with neural networks. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). 2. The distribution µ of the covariates xi satisfies isoperimetry (or is a mixture theoreof). 3. The expected conditional variance of the output (i.e., the noise level ) is strictly positive, denoted σ2 := Eµ[V ar[y|x]] > 0. Then, with high probability over the sampling of the data, one has simultaneously for all f F: i=1 (f(xi) yi)2 σ2 ϵ Lip(f) eΩ Remark 1.1. For the distributions µ we have in mind, for instance uniform on the unit sphere, there exists with high probability some O(1)-Lipschitz function f : Rd R satisfying f(xi) = yi for all i. Indeed, with probability 1 e Ω(d) we have ||xi xj|| 1 for all 1 i = j n so long as n poly(d). In this case we may apply the Kirszbraun extension theorem to find a suitable f regardless of the labels yi. More explicitly we may fix a smooth bump function g : R+ R with g(0) = 1 and g(x) = 0 for x 1, and then interpolate using the sum of radial basis functions i=1 g(||x xi||)yi. In fact this construction requires only p = n(d + 1) parameters to specify the values (xi, yi)i [n] and thus determine the function f. Hence p = n(d + 1) parameters suffice for robust interpolation, i.e. Theorem 3 is essentially best possible for L = O(1). A similar construction shows the same conclusion for any p [eΩ(n), nd], essentially tracing the entire tradeoff curve. This is because one can first project onto a fixed subspace of dimension d = p/n, and the projected inputs xi now have pairwise distances at least Ω q d/d with high probability. The analogous construction on the projected points now requires only p = dn parameters and has Lipschitz constant L = 1.1 Speculative implication for real data To put Theorem 1 in context, we compare to the empirical results presented in [MMS+18]. In the latter work, they consider the MNIST dataset which consists of n = 6 104 images in dimension 282 = 784. They trained robustly different architectures, and reported in Figure 4 the size of the architecture versus the obtained robust test accuracy (third plot from the left). One can see a sharp transition from roughly 10% accuracy to roughly 90% accuracy at around 2 105 parameters (capacity scale 4 in their notation). Moreover the robust accuracy keeps climbing up with more parameters, to roughly 95% accuracy at roughly 3 106 parameters. How can we compare these numbers to the law of robustness? There are a number of difficulties that we discuss below, and we emphasize that this discussion is highly speculative in nature, though we find that, with a few leaps of faith, our universal law of robustness sheds light on the potential parameter regimes of interest for robust deep learning. The first difficulty is to evaluate the correct dimension of the problem. Certainly the number of pixels per image gives an upper bound, however one expects that the data lies on something like a lower dimensional sub-manifold. Optimistically, we hope that Theorem 1 will continue to apply for an appropriate effective dimension which may be rather smaller than the literal number of pixels. This hope is partially justified by the fact that isoperimetry holds in many less-than-picturesque situations, some of which are stated in the next subsection. The next difficulty is to estimate/interpret the noise value σ2. From a theoretical point of view, this noise assumption is necessary for otherwise there could exist a smooth classifier with perfect accuracy in F, defeating the point of any lower bound on the size of F. We tentatively would like to think of σ2 as capturing the contribution of the difficult part of the learning problem, that is σ2 could be thought of as the non-robust generalization error of reasonably good models, so a couple of % of error in the case of MNIST. With that interpretation, one gets below the noise level in MNIST with a training error of a couple of %. We believe that versions of the law of robustness might hold without noise; these would need to go beyond representational power and consider the dynamics of learning algorithms. Finally another subtlety to interpret the empirical results of [MMS+18] is that there is a mismatch between what they measure and our quantities of interest. Namely the law of robustness talks about two things: the training error, and the worst-case robustness (i.e., the Lipschitz constant). On the other hand [MMS+18] measures the robust generalization error. Understanding the interplay between those three quantities is a fantastic open problem. Here we take the perspective that a small robust generalization error should imply a small training error and a small Lipschitz constant. Another important mismatch is that we stated our universal law of robustness for Lipschitzness in ℓ2, while the experiments in [MMS+18] are for robustness in ℓ . We believe that a variant of the law of robustness remains true for ℓ , a belief again partially justified by how broad isoperimetry is (see next subsection). With all the caveats described above, we can now look at the numbers as follows: in the [MMS+18] experiments, smooth models with accuracy below the noise level are attained with a number of parameters somewhere in the range 2 105 3 106 parameters (possibly even larger depending on the interpretation of the noise level), while the law of robustness would predict any such model must have at least nd parameters, and this latter quantity should be somewhere in the range 106 107 (corresponding to an effective dimension between 15 and 150). While far from perfect, the law of robustness prediction is far more accurate than the classical rule of thumb # parameters # equations (which here would predict a number of parameters of the order 104). Perhaps more interestingly, one could apply a similar reasoning to the Image Net dataset, which consists of 1.4 107 images of size roughly 2 105. Estimating that the effective dimension is a couple of order of magnitudes smaller than this size, the law of robustness predicts that to obtain good robust models on Image Net one would need at least 1010 1011 parameters. This number is larger than the size of current neural networks trained robustly for this task, which sports between 108 109 parameters. Thus, we arrive at the tantalizing possibility that robust models for Image Net do not exist yet simply because we are a couple orders of magnitude off in the current scale of neural networks trained for this task. 1.2 Related work Theorem 1 is a direct follow-up to the conjectured law of robustness in [BLN21] for (arbitrarily weighted) two-layer neural networks with Gaussian data. Our result does not actually prove their conjecture, because we assume here polynomially bounded weights. While this assumption is reasonable from a practical perspective, it remains mathematically interesting to prove the full conjecture for the two-layer case. We prove however in Section A that the polynomial weights assumption is necessary as soon as one considers three-layer neural networks. Let us also mention the [GCL+19, Theorem 6.1] which showed a lower bound Ω(nd) on the VC dimension of any function class which can robustly interpolate arbitrary labels on all well-separated input sets (x1, . . . , xn). We also note that a relation between high-dimensional phenomenon such as concentration and adversarial examples has been hypothesized before, such as in [GMF+18]. In addition to [MMS+18], several recent works have experimentally studied the relationship between a neural network scale and its achieved robustness, see e.g., [NBA+18, XY20, GQU+20]. It has been consistently reported that larger networks help tremendously for robustness, beyond what is typically seen for classical non-robust accuracy. We view our universal law of robustness as putting this empirical observation on a more solid footing: scale is actually necessary to achieve robustness. The law of robustness setting is closely related to the interpolation setting: in the former case one considers models optimizing beyond the noise level , while in the latter case one studies models with perfect fit on the training data. The study of generalization in this in- terpolation regime has been a central focus of learning theory in the last few years (see e.g., [BHMM19, MM19, BLLT20, NKB+20]), as it seemingly contradicts classical theory about regularization. More broadly though, generalization remains a mysterious phenomon in deep learning, and the exact interplay between the law of robustness setting (interpolation regime/worst-case robustness) and (robust) generalization error is a fantastic open problem. Interestingly, we note that one could potentially avoid the conclusion of the law of robustness (that is, that large models are necessary for robustness), with early stopping methods that could stop the optimization once the noise level is reached. In fact, this theoretically motivated suggestion has already been empirically tested and confirmed in the recent work [RWK20], showing again a close tie between the conclusions one can draw from the law of robustness and actual practical settings. Classical lower bounds on the gradient of a function include Poincar e type inequalities, but they are of a qualitately different nature compared to the law of robustness lower bound. We recall that a measure µ on Rd satisfies a Poincar e inequality if for any function f, one has Eµ[ f 2] C Var(f) (for some constant C > 0). In our context, such a lower bound for an interpolating function f has essentially no consequence since the variance f could be exponentially small. In fact this is tight, as one easily use similar constructions to those in [BLN21] to show that one can interpolate with an exponentially small expected norm squared of the gradient (in particular it is crucial in the law of robustness to consider the Lipschitz constant, i.e., the supremum of the norm of the gradient). On the other hand, our isoperimetry assumption is related to a certain strenghtening of the Poincar e inequality known as log-Sobolev inequality (see e.g., [Led01]). If the covariate measure satisfies only a Poincar e inequality, then we could prove a weaker law of robustness of the form Lip n d p (using for example the concentration result obtained in [BL97]). For the case of two-layer neural networks there is another natural notion of smoothness (different from ℓp norms of the gradient) that can be considered, known as the Barron norm. In [BELM20] it is shown that for such a notion of smoothness there is no tradeoff a la the law of robustness, namely one can simultaneously be optimal both in terms of Barron norm and in terms of the network size. More generally, it is an interesting challenge to understand for which notions of smoothness there is a tradeoff with size. 1.3 Isoperimetry Concentration of measure and isoperimetry are perhaps the most ubiquitous features of highdimensional geometry. In short, they assert in many cases that Lipschitz functions on highdimensional space concentrate tightly around their mean. Our result assumes that the distribution µ of the covariates xi satisfies such an inequality in the following sense. Definition 1.1. A probability measure µ on Rd satisfies c-isoperimetry if for any bounded LLipschitz f : Rd R, and any t 0, P[|f(x) E[f]| t] 2e dt2 2c L2 . (1.1) In general, if a scalar random variable X satisfies P[|X| t] 2e t2/C then we say X is C-subgaussian. Hence isoperimetry states that the output of any Lipschitz function is O(1)- subgaussian under suitable rescaling. Distributions satisfying O(1)-isoperimetry include high dimensional Gaussians µ = N 0, Id d and uniform distributions on spheres and hypercubes (normalized to have diameter 1). Isoperimetry also holds for mild perturbations of these idealized scenarios, including2: The sum of a Gaussian and an independent random vector of small norm [CCNW21]. Strongly log-concave measures in any normed space [BL00, Proposition 3.1]. Manifolds with positive Ricci curvature [Gro86, Theorem 2.2]. Due to the last condition above, we believe our results are realistic even under the manifold hypothesis that high-dimensional data tends to lie on a lower-dimensional submanifold. This viewpoint on learning has been studied for decades, see e.g. [HS89, KL93, RS00, TDSL00, NM10, FMN16]. 2The first two examples satisfy a logarithmic Sobolev inequality, which implies isoperimetry [Led99, Proposition 2.3]. We also note that our formal theorem (Theorem 3) actually applies to distributions that can be written as a mixture of distributions satisfying isoperimetry. Let us also point out that from a technical perspective, our proof is not tied to the Euclidean norm and applies essentially whenever Definition 1.1 holds. The main difficulty in extending the law of robustness to e.g. the earth-mover distance seems to be identifying realistic cases which satisfy isoperimetry. Our proofs will repeatedly use the following simple fact: Proposition 1.2. [Ver18, Proposition 2.6.1],[v H14, Exercise 3.1] If X1, . . . , Xn are independent, C-subgaussian, with mean 0, then 1 n Pn i=1 Xi is 18C-subgaussian. 2 A finite approach to the law of robustness For the function class of two-layer neural networks, [BLN21] investigated several approaches to prove the law of robustness. At a high level, the proof strategies there relied on various ways to measure how large the set of two-layer neural networks can be (specifically, they tried a geometric approach based on relating to multi-index models, a statistical approach based on the Rademacher complexity, and an algebraic approach for the case of polynomial activations). In this work we take here a different route: we shift the focus from the function class F to an individual function f F. Namely, our proof starts by asking the following question: for a fixed function f, what is the probability that it would give a good approximate fit on the (random) data? For simplicity, consider for a moment the case where we require f to actually interpolate the data (i.e., perfect fit), and say that yi are random 1 labels. The key insight is that isoperimetry implies that either the 0-level set of f or the 1-level set of f must have probability smaller than exp d Lip(f)2 . Thus, the probability that f fits all the n points is at most exp nd Lip(f)2 so long as both labels yi { 1, 1} actually appear a constant fraction of the time. In particular, using an union bound3, for a finite function class F of size N with L-Lipschitz functions, the probability that there exists a function f F fitting the data is at most = exp log(N) nd Thus we see that, if L q nd log(N), then the probability of finding a fitting function in F is very small. This basically concludes the proof, since via a standard discretization argument, for a smoothly parametrized family with p (bounded) parameters one expects log(N) = O(p). We now give the formal proof, which applies in particular to approximate fit rather than exact fit in the argument above. The only difference is that we will identify a well-chosen subgaussian random variable in the problem. We start with the finite function class case: Theorem 2. Let (xi, yi) be i.i.d. input-output pairs in Rd [ 1, 1] such that: 1. The distribution µ of the covariates xi can be written as µ = Pk ℓ=1 αℓµℓ, where each µℓ satisfies c-isoperimetry and αℓ 0, Pk ℓ=1 αℓ= 1. 2. The expected conditional variance of the output is strictly positive, denoted σ2 := Eµ[V ar[y|x]] > 0. Then one has: i=1 (yi f(xi))2 σ2 ϵ + 2 exp log(|F|) ϵ2nd 3In this informal argument we ignore the possibility that the labels yi are not well-balanced. Note that the probability of this rare event is not amplified by a union bound over f F. We start with a lemma showing that, to optimize beyond the noise level one must necessarily correlate with the noise part of the labels. In what follows we denote g(x) = E[y|x] for the target function, and zi = yi g(xi) for the noise part of the observed labels (namely yi is the sum of the target function g(xi) and the noise term zi). Lemma 2.1. One has i=1 (yi f(xi))2 σ2 ϵ i=1 f(xi)zi ϵ Proof. The sequence (z2 i ) is i.i.d., with mean σ2, and such that |zi|2 4. Thus Hoeffding s inequality yields: i=1 z2 i σ2 ϵ On the other hand the sequence (zig(xi)) is i.i.d., with mean 0 (since E[zi|xi] = 0), and such that |zig(xi)| 2. Thus Hoeffding s inequality yields: i=1 zig(xi) ϵ Let us write Z = 1 n(z1, . . . , zn), G = 1 n(g(x1), . . . , g(xn)), and F = 1 n(f(x1), . . . , f(xn)). We claim that if Z 2 σ2 ϵ 6 and Z, G ϵ 6, then for any f F one has G + Z F 2 σ2 ϵ F, Z ϵ This claim together with (2.1) and (2.2) conclude the proof. On the other hand the claim itself directly follows from: σ2 ϵ G+Z F 2 = Z +G F 2 = Z 2+2 Z, G F + G F 2 σ2 ϵ We can now proceed to the proof of Theorem 2: Proof. First note that without loss of generality we can assume that the range of any function in F is included in [ 1, 1] (indeed clipping the values improves both the fit to any y [ 1, 1] and the Lipschitz constant). We also assume wlog that all functions in F are L-Lipschitz. For clarity let us start with the case k = 1. By the isoperimetry assumption we have that q d c f(xi) E[f] L is 1-subgaussian. Since |zi| 2, we also have that q d c (f(xi) E[f])zi L is 4subgaussian. Moreover, the latter random variable has zero-mean since E[z|x] = 0. Thus by Proposition 1.2 we have: i=1 (f(xi) E[f])zi t 2 exp (t/9)2 . We rewrite the above as: i=1 (f(xi) E[f])zi ϵ Since we assumed that the range of the functions is in [ 1, 1] we have E[f] [ 1, 1] and hence: i=1 E[f]zi ϵ (This step is the analog of requiring the labels yi to be well-balanced in the example of perfect interpolation.) By Hoeffding s inequality, the above quantity is smaller than 2 exp( nϵ2/83) (recall that |zi| 2). Thus we obtain with an union bound: i=1 f(xi)zi ϵ i=1 (f(xi) E[f])zi ϵ 2|F| exp ϵ2nd + 2 exp nϵ2 Together with Lemma 2.1 this concludes the proof for k = 1. We now turn to the case k > 1. We first sample the mixture component ℓi [k] for each data point i [n], and we now reason conditioned on these mixture components. Let Sℓ [n] be the set of data points sampled from mixture component ℓ [k], that is xi, i Sℓ, is i.i.d. from µℓ. We now have that q d c f(xi) E µℓi [f] L is 1-subgaussian (notice that the only difference is that now we need to center by Eµℓi[f], which depends on the mixture component). In particular using the same reasoning as for (2.4) we obtain (crucially note that Proposition 1.2 does not require the random variables to be identically distributed): i=1 (f(xi) Eµℓi[f])zi ϵ Next we want to appropriately modify (2.4). To do so note that: max m1,...,mk [ 1,1] i=1 mℓizi = so that we can rewrite (2.4) as: i=1 Eµℓi[f]zi ϵ Now note that Pk ℓ=1 p nk and thus we have: Finally by Hoeffding s inequality, we have for any ℓ [k], P P 8 , and thus the last display is bounded from above by 2k exp nϵ2 83k . The proof can now be concluded as in the case k = 1. Finally we can now state and prove the formal version of the informal Theorem 1 from the introduction. Theorem 3. Let F be a class of functions from Rd R and let (xi, yi)n i=1 be i.i.d. input-output pairs in Rd [ 1, 1]. Fix ϵ, δ (0, 1). Assume that: 1. The function class can be written as F = {fw, w W} with W Rp, diam(W) W and for any w1, w2 W, ||fw1 fw2|| J||w1 w2||. 2. The distribution µ of the covariates xi can be written as µ = Pk ℓ=1 αℓµℓ, where each µℓ satisfies c-isoperimetry, αℓ 0, Pk ℓ=1 αℓ= 1, and k is such that 94k log(8k/δ) nϵ2. 3. The expected conditional variance of the output is strictly positive, denoted σ2 := Eµ[V ar[y|x]] > 0. Then, with probability at least 1 δ with respect to the sampling of the data, one has simultaneously for all f F: i=1 (f(xi) yi)2 σ2 ϵ Lip(f) ϵ 29 c nd p log(60WJϵ 1) + log(4/δ) . Proof. Define WL W by WL = {w W : Lip(fw) L}. Denote WL,ϵ for an ϵ 6J -net of WL. We have in particular |Wϵ| (60WJϵ 1)p. We apply Theorem 2 to FL,ϵ = {fw, w WL,ϵ}: i=1 (yi f(xi))2 σ2 ϵ 2 and Lip(f) 2L + 2 exp p log(60WJϵ 1) ϵ2nd Observe that if f g ϵ 6 and y , f , g 1, then 1 n Pn i=1(yi f(xi))2 ϵ 2 + 1 n Pn i=1(yi g(xi))2. (We may again assume without loss of generality that all functions in F map to [ 1, 1].) Thus we obtain for any L > 0: i=1 (yi f(xi))2 σ2 ϵ and Lip(f) L + 2 exp p log(60WJϵ 1) ϵ2nd The first assumption ensures that for any w WL, there is w WL,ϵ with fw fw ϵ 6. The second assumption shows the probability just above is at most δ when L = ϵ 29 c q nd p log(60W Jϵ 1)+log(4/δ). This concludes the proof. 3 Deep neural networks We now specialize the law of robustness (Theorem 3) to multi-layer neural networks. We consider a rather general class of depth D neural networks described as follows. First, we require that the neurons are partitioned into layers L1, . . . , LD, and that all connections are from Li Lj for some i < j. This includes the basic feed-forward case in which only connections Li Li+1 are used as well as more general skip connections. We specify (in the natural way) a neural network by matrices Wj of shape |Lj| P i