# foolish_crowds_support_benign_overfitting__34d279bb.pdf Journal of Machine Learning Research 23 (2022) 1-12 Submitted 10/21; Revised 3/22; Published 4/22 Foolish Crowds Support Benign Overfitting Niladri S. Chatterji niladri@cs.stanford.edu Computer Science Department, Stanford University, 353 Jane Stanford Way, Stanford, CA 94305. Philip M. Long plong@google.com Google, 1600 Amphitheatre Parkway, Mountain View, CA, 94043. Editor: Ohad Shamir We prove a lower bound on the excess risk of sparse interpolating procedures for linear regression with Gaussian data in the overparameterized regime. We apply this result to obtain a lower bound for basis pursuit (the minimum ℓ1-norm interpolant) that implies that its excess risk can converge at an exponentially slower rate than OLS (the minimum ℓ2-norm interpolant), even when the ground truth is sparse. Our analysis exposes the benefit of an effect analogous to the wisdom of the crowd , except here the harm arising from fitting the noise is ameliorated by spreading it among many directions the variance reduction arises from a foolish crowd. Keywords: generalization, benign overfitting, interpolation, sparsity, lower bounds, regression 1. Introduction Recently, there has been a surge of interest in benign overfitting, where a learning algorithm generalizes well despite interpolating noisy data (see, e.g., Zhang et al., 2021; Belkin et al., 2019; Bartlett et al., 2020; Belkin, 2021). Arguably the most basic setting in which this has been analyzed theoretically is linear regression with Gaussian data, where upper and nearly matching lower bounds have been obtained for the ordinary least squares (OLS) estimator, which chooses a parameter vector bθ to minimize bθ 2 from among interpolating models (Bartlett et al., 2020; Negrea et al., 2020; Tsigler and Bartlett, 2020). Bounds have also been obtained for basis pursuit (Chen et al., 2001), which minimizes bθ 1 from among interpolating models (Muthukumar et al., 2020; Ju et al., 2020; Chinot et al., 2020; Koehler et al., 2021). The upper bounds for the OLS estimator show that it rapidly approaches the Bayes risk when the structure of the covariance matrix Σ of the inputs is favorable. Informally, the following are necessary and sufficient: the sum of all of the eigenvalues of Σ is not too big; after excluding a few of the largest eigenvalues, there are many small eigenvalues of roughly equal magnitude. A canonical example of such a benign covariance matrix is Σk,ε := diag( k z }| { 1, . . . , 1, p k z }| { ε, . . . , ε). Consider a case where k p, the rows of X are n i.i.d. draws from N(0, Σk,ε), and, for 2022 Niladri S. Chatterji and Philip M. Long. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-1199.html. Chatterji and Long independent noise ξ N(0, I) and a unit-length θ , y = Xθ + ξ. Then a high probability upper bound on the excess risk of the OLS is proportional to and a nearly matching lower bound is also known (Bartlett et al., 2020; Negrea et al., 2020; Tsigler and Bartlett, 2020). As an example, if p = n2, ε = 1/n2 and k is a constant, this is proportional to 1/n. If, in addition, θ i = 0 for i > k, upper bounds are also known for basis pursuit in this setting (Koehler et al., 2021). On the other hand, they are much worse, scaling with p like 1 log p, and requiring that p be an exponentially large function of n to converge. In this paper, we prove a lower bound for basis pursuit in this setting that scales with p like 1 log2 p. Our lower bound requires that σ c θ 2 for an arbitrarily small positive constant c, along with a few mild technical conditions, including that p > n, so that interpolation is possible, and that n k (see Theorem 2 for the details). Note that OLS converges much faster than basis pursuit in this setting despite the fact that θ is sparse. The lower bound is a special case of a more general bound, Theorem 1, which can be paraphrased as follows. Under the same conditions as Theorem 2, including σ c θ 2, any interpolating procedure that, with high probability, outputs an s-sparse model bθ, must suffer excess loss proportional to σ2n s log2 p. We then get Theorem 2 by showing that, in this setting, basis pursuit almost surely outputs an n-sparse model. This analysis sheds light on why the overfitting of OLS is so benign. Because OLS interpolates, we can think of the parameters of its output as storing the noise OLS benefits from spreading the noise evenly among many parameters, where each small fragment of noise has a tiny effect. This phenomenon is akin to the reduction in variance arising from prediction using a weighted average of many covariates that is commonly referred to as the wisdom of the crowd (Surowiecki, 2005). Here, by spreading the harm arising from fitting the noise among many parameters, the algorithm benefits from a foolish crowd. Muthukumar et al. (2020) established the same lower bound for basis pursuit in a setting with isotropic covariates (see also, Ju et al., 2020) in the case that θ = 0. Accommodating the possibility that θ = 0 complicates the argument a bit, but our main contribution is to demonstrate slow convergence for basis pursuit in settings where OLS enjoys fast convergence. Chinot et al. (2020) proved an upper bound on the risk of basis pursuit, but, as pointed out by Koehler et al. (2021), it does not imply a bound on the excess risk. Limitations of algorithms that output sparse linear classifiers have also been studied previously (Helmbold and Long, 2012). After a preliminary version of this work was posted on ar Xiv (Chatterji and Long, 2021), some related work was published (Wang et al., 2021; Li and Wei, 2021; Donhauser et al., 2022) that established upper bounds on the excess risk of the minimum ℓ1-norm interpolator. In particular, Wang et al. (2021) showed an upper bound on the excess risk, that in the case with isotropic covariates scales as σ2/ log(p), almost matching our lower bound. Foolish Crowds Support Benign Overfitting 2. Preliminaries For p, n N, an example is a member of Rp R, and a linear regression algorithm takes as input n examples, and outputs bθ Rp. For a joint probability distribution P over Rp R, the excess risk of bθ with respect to P is R(bθ) := E(x,y) P [(bθ x y)2] inf θ E(x,y) P [(θ x y)2]. We refer to the following as the (k, p, n, ε, σ)-scenario: X Rn p is a matrix whose rows are i.i.d. draws from N(0, diag( k z }| { 1, 1, . . . , 1, p k z }| { ε, . . . , ε)), and for θ Rp with (θ k+1, . . . , θ p) = 0, and ξ N(0, σ2I), y = Xθ + ξ. For δ > 0, s, n N and a joint probability distribution P over Rp R, we say that a regression algorithm A is an (s, δ)-sparse interpolator for P and n N if it satisfies the following: With probability 1 δ over the independent draw of the samples (x1, y1), . . . , (xn, yn) P, the output bθ of A interpolates the data (that is, satisfies bθ x1 = y1, . . . , bθ xn = yn), and has at most s non-zero components. Given X Rn p and y Rn, the basis pursuit algorithm (minimum ℓ1-norm interpolant) ABP outputs arg min θ θ 1, s.t. Xθ = y, if there is such a θ, and otherwise behaves arbitrarily (say outputting 0). For any j N, we denote the set {1, . . . , j} by [j]. Given a vector v, let v 2 denote its Euclidean norm and v 1 denote its ℓ1-norm. Given a matrix M, let M op denote its operator norm. For z R, we denote max{z, 0} by [z]+. 3. Main Results We are ready to present our main result, a high probability lower bound on the excess risk for any (s, δ)-sparse interpolator. Theorem 1 For any 0 < c1 1, there are absolute positive constants c2, c3 such that the following holds. For any 0 δ c2, for any (k, p, n, ε, σ) such that σ c1 θ 2, p n + k, n log2(1/δ) + k1+c1, for any n s p k, and any regression algorithm A that is an (s, δ)-sparse interpolator for the (k, p, n, ε, σ)-scenario P, with probability 1 4δ over n random draws from P, the output bθ of A satisfies R(bθ) c3σ2n s log2(3p/s). Chatterji and Long This theorem shows that a sparse interpolating predictor suffers large excess risk. Intuitively, the proof follows since an s-sparse interpolating predictor needs to hide the energy of the noise, which roughly scales like σ2n, in just s coordinates. If it attempts to hide it in the first k coordinates, then it suffers from large bias. If it hides it in the tail, then it suffers large variance. Next, we state our result for basis pursuit, the minimum ℓ1-norm interpolator. Theorem 2 For any 0 < c1 1, there are absolute positive constants c2, c3 such that the following holds. For any 0 δ c2, for any (k, p, n, ε, σ) such that σ c1 θ 2, p n + k, n log2(1/δ) + k1+c1, with probability 1 4δ over n random draws from P, the output bθ of ABP satisfies log2(3p/n). This theorem is proved by showing that the output of basis pursuit is always n-sparse and then by simply invoking the previous general result. Theorem 2 implies that the excess risk of ABP can be much worse than OLS. For example, if k = 5, p = n2, ε = 1/n2, σ2 = θ 2 2 = 1, then Theorem 2 implies an Ω 1 log2 n lower bound for ABP where a O 1 upper bound holds for OLS (Bartlett et al., 2020; Negrea et al., 2020; Tsigler and Bartlett, 2020). If instead, k = 5, p = n2, ε = 1/n2, σ2 = θ 2 2 = log2 n, then the excess risk of OLS goes to zero at a O log2 n n rate, but Theorem 2 implies that the excess risk of ABP is bounded below by a constant. When ε = 1, that is, when the covariates are isotropic, our lower bound coincides with the lower bound derived previously by Muthukumar et al. (2020). 4. Proof of Theorem 1 This section is devoting to proving Theorem 1, so the assumptions of Theorem 1 are in scope throughout this section. Our proof proceeds through a series of lemmas. Definition 3 For any v Rp and any S [p], let v S be the vector obtained by selecting the components of S from v in order. For X Rn p, define XS similarly, except selecting columns from X. Let H := {1, . . . , k} and T := {k + 1, . . . , p}, so that v H = (v1, . . . , vk) and v T = (vk+1, . . . , vp). The first step is to break up the excess risk into contributions from the head H and the tail T. Lemma 4 The excess risk of any parameter vector bθ satisfies R(bθ) = θ H bθH 2 2 + ε bθT 2 2. Proof By the projection lemma, R(bθ) = E(x,y) P [(bθ x y)2] E(x,y) P [(θ x y)2] = E(x,y) P [(bθ x θ x)2] + σ2 σ2 = θ H bθH 2 2 + ε bθT 2 2, Foolish Crowds Support Benign Overfitting since θ T = 0 and the distribution of x has covariance diag( k z }| { 1, 1, . . . , 1, p k z }| { ε, . . . , ε). Lemma 4 leads to the subproblem of establishing a lower bound on bθT 2 2. The following lemma is an easy step in this direction. Lemma 5 Given any estimator bθ such that Xbθ = y we have XT bθT 2 = y XH bθH 2. Proof The lemma follows from the fact that y = Xbθ = XH bθH + XT bθT . Lemma 5 provides a means to establish a lower bound on XT bθT 2. This in turn can lead to a lower bound on bθT 2 if we can show that the linear operator associated with XT does not blow up bθT . It turns out, when bθ (and thus bθT ) is sparse, a random XT is especially unlikely to blow up bθT , as reflected in the following lemma. It is an immediate consequence of (Adamczak et al., 2012, Theorem 4.2). Lemma 6 There exists a constant c such that for any t 1, we have P max S T:|S| s XS op c ε s log 3(p k) + n + t e t. Lemma 6 implies a lower bound on θT 2 for any s-sparse θ. Lemma 7 There exists a constant c such that, with probability at least 1 δ, any s-sparse θ has θT 2 XT θT 2 c ε s log 3(p k) s + n + log(1/δ) . (1) Proof For any s-sparse θ, if S = T {i : θi = 0}, we have |S| s, so for any X, we have XT θT 2 = XSθS 2. Applying Lemma 6 with t = log(1/δ), with probability at least 1 δ, (1) holds for all such θ. Since bθ is likely to be s-sparse, Lemma 7 implies a high-probability lower bound on bθT 2, the contribution of the tail to the excess risk. This bound is in terms of XT bθT 2 = y XH bθH 2. We will bound this by proving a high-probability lower bound on y 2, and a high-probability upper bound on XH bθH 2. We start with a lower bound on y 2. Lemma 8 With probability 1 δ, y 2 2 (σ2 + θ 2 2)n Chatterji and Long Proof We have y = Xθ + ξ. That is, for each sample i [n], yi = xi θ + ξi. Observe that xi θ N(0, θ 2 2), ξi N(0, σ2), and xi and ξi are independent. Therefore, we have that yi N(0, σ2 + θ 2 2). Thus i=1 |yi|2 = σ2 + θ 2 2 q, (2) where q is a random variable with a χ2(n) distribution. Applying Lemma 1 from (Laurent and Massart, 2000), we have tn 1 exp( t). If we set t = log(1/δ) then with probability at least 1 δ, This combined with (2) completes the proof. Recall that we also want an upper bound on XH bθH 2; we will use a bound that is an immediate consequence of (Vershynin, 2010, Corollary 5.35). Lemma 9 With probability 1 δ, 2 log(2/δ). Armed with these lemmas, we can now prove Theorem 1. Proof of Theorem 1 With foresight, set ζ = q c4σ2n s log2(3p/s) for a constant c4 that will be determined by the analysis. Case 1 ( bθH 2 θ 2 + ζ). Recall that θ 2 = θ H 2, since it is zero for all entries after the kth coordinate. By Lemma 4 we have R(bθ) bθH θ H 2 2 bθH 2 θ 2 2 + ζ2 = c4σ2n s log2(3p/s). Case 2 ( bθH 2 θ 2 + ζ). By Lemma 4, we have R(bθ) ε bθT 2 2. The estimator bθ is s-sparse with probability at least 1 δ. Hence, combining Lemmas 5 and 7, and taking a union bound we get that, for an absolute positive constant c, with probability 1 2δ, R(bθ) c y XH bθH 2 2 s log2 3(p k) s + n + log2(1/δ) c y 2 XH bθH 2 2 s log2 3(p k) s + n + log2(1/δ) . Foolish Crowds Support Benign Overfitting Applying Lemma 8, we find that with probability at least 1 3δ, (σ2 + θ 2 2)n 1 2 q s log2 3(p k) s + n + log2(1/δ) . Next, by applying Lemma 9, with probability at least 1 4δ, (σ2 + θ 2 2)n 1 2 q 2 log(2/δ)) bθH 2 s log2 3(p k) s + n + log2(1/δ) (σ2 + θ 2 2)n 1 2 q 2 log(2/δ)) ( θ 2 + ζ) s log2 3(p k) s + n + log2(1/δ) (σ2+ θ 2 2)n 1 2 q 2 log(2/δ)) θ 2+ q c4σ2n s log2(3p/s) s log2 3(p k) s +n+log2(1/δ) . By choosing c2 > 0 to be small enough, we can choose n to be as large as desired. Recall that, as n , both k = o(n) and log(2/δ) = o(n). Thus with probability at least 1 4δ, we have that R(bθ) (c/2) σ2 + θ 2 2 1 + c2 1 8 1 + q c4σ2n θ 2 2s log2(3p/s) s log2 3(p k) s + n + log2(1/δ) . Since s n, we have σ2 + θ 2 2 1 + c2 1 8 1 + q c4σ2 θ 2 2 log2(3p/s) s log2(3p/s) 1 + θ 2 2 σ2 1 + c2 1 8 1 + q c4σ2 θ 2 2 log2(3p/s) s log2(3p/s) . Defining r := θ 2 σ and simplifying, we have 1 + r2 1 + c2 1 8 r 1 + c2 1 8 q c4 log2(3p/s) s log2(3p/s) . Chatterji and Long 1 + r2 1 + c2 1 8 r is a decreasing function of r, and, by assumption, r = θ σ 1/c1, we have 8 1 + c2 1 8 q c4 log2(3p/s) s log2(3p/s) . Since s p, we have 8 1 + c2 1 8 c4 i2 s log2(3p/s) . Recall that 0 < c1 1, and choose 8 1 + c2 1 8 Thus, if c4 is chosen to be a sufficiently small positive constant then completing the proof. 5. Proof of Theorem 2 This section is devoted to proving Theorem 2, so the assumptions of Theorem 2 are in scope throughout this section. As in Section 4, our proof proceeds through a series of lemmas. The first lemma is an immediate consequence of (Schneider and Tardivel, 2020, Proposition 1). Lemma 10 Almost surely, there is a unique minimizer of θ 1 subject to Xθ = y. The following lemma appears to be known (Chen et al., 2001); we included a proof in Appendix A because we did not find one that applies in our setting. Lemma 11 Almost surely, the output bθ of ABP is n-sparse. Proof of Theorem 2 By Lemma 11, for any δ > 0, ABP is a (n, δ)-sparse interpolator for the (k, p, n, ε, σ)-scenario P. Applying Theorem 1 with s = n completes the proof. Foolish Crowds Support Benign Overfitting 6. Discussion We have demonstrated that for interpolating linear regression with Gaussian data, outputting a sparse parameter vector can be harmful, even when learning a sparse target. Our proofs only use a few of the properties of Gaussian distributions, so in the case that the covariance is Σk,ε, our results should generalize to sub-Gaussian and log-concave distributions. We chose to analyze Σk,ε because it is arguably the canonical case where OLS enjoys benign overfitting, and it leads to clean and interpretable bounds. Handling a wider variety of covariance matrices is another very natural future direction. A starting point would be to generalize (Adamczak et al., 2012, Theorem 4.2). Recent research has shown that linear models parameterized by simple two-layer linear neural networks with diagonal weight matrices leads to implicit regularization that interpolates between the ℓ1-norm used by basis pursuit and the ℓ2-norm used by OLS (Woodworth et al., 2020; Azulay et al., 2021). The connection of this work to neural networks, together with the stark difference between ℓ1 and ℓ2 regularization in the context of benign overfitting highlighted in this paper, motivates the study of benign overfitting with these models. (We thank Olivier Bousquet for suggesting this last problem.) Appendix A. Proof of Lemma 11 By Lemma 10, we may assume without loss of generality that there is a unique minimizer of θ 1 subject to Xθ = y. Assume for the sake of contradiction that bθ has bθ 0 = s > n. Let I := {i [p] : bθi = 0}. Since |I| > n, the columns in {Xi : i I} are linearly dependent. That is, there exists a set of weights {λi : i I}, at least one of which is nonzero, such that X i I λi Xi = 0. (3) Let the vector λ Rp be obtained by filling in λi = 0 for i / I. From here, we will divide our analysis into cases. Case 1 (P i I λisign(bθi) > 0). We will prove by contradiction that this case cannot happen. For an η > 0 to be set later, consider i I λiei = bθ ηλ. First, note that Xv = Xbθ η X i I λi Xi = y 0 = y. We will now prove the claim that, for a small enough η, v 1 < bθ 1. This will lead to the desired contradiction. To establish this claim, it suffices to prove that λ λ 2 is a descent direction for 1 at bθ. Toward this end, consider an arbitrary member z of the subgradient of 1 at bθ. Recalling that λi = 0 when bθi = 0, we have i I λisign(bθi) > 0, Chatterji and Long by the assumption of this case. This implies that λ λ 2 is indeed a descent direction, so that, for a small enough η, v 1 < bθ 1. Recalling that Xv = y then yields a contradiction. Case 2 (P i I λisign(bθi) < 0). This case leads to a contradiction symmetrically to the proof in Case 1, using v = bθ + ηλ. Case 3 (P i I λisign(bθi) = 0). Choose i0 arbitrarily from among those i I such that λi = 0 with the minimum values of |λi|, that is, i0 arg mini [p] {|λi| : λi > 0}. As in the first case, suppose that λi0 > 0 (the other case can be handled symmetrically). Set η = θi0/λi0, and once again consider the vector As before, for all η [0, η], X(bθ η λ) = y. Furthermore, since i0 arg min i [p] {|λi| : λi > 0} , for each such η , for all i, sign((bθ η λ)i) = sign(bθi). This means that along the path from bθ to v, any subgradient z of the ℓ1 norm satisfies zi = sign(bθi), for all i I. But this means, throughout this path, λ is orthogonal to any subgradient, which in turn means that the ℓ1-norm is unchanged. When η = η, we have an interpolator with the same ℓ1-norm as bθ but one fewer nonzero component, a contradiction. Radosław Adamczak, Rafał Latała, Alexander Litvak, Alain Pajor, and Nicole Tomczak Jaegermann. Chevet type inequality and norms of submatrices. Studia Mathematica, 210 (1):35 56, 2012. Shahar Azulay, Edward Moroshko, Mor Shpigel Nacson, Blake E. Woodworth, Nathan Srebro, Amir Globerson, and Daniel Soudry. On the implicit bias of initialization shape: beyond infinitesimal mirror descent. In International Conference on Machine Learning, pages 468 477, 2021. Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063 30070, 2020. Mikhail Belkin. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. ar Xiv preprint ar Xiv:2105.14368, 2021. Mikhail Belkin, Daniel J Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849 15854, 2019. Foolish Crowds Support Benign Overfitting Niladri S Chatterji and Philip M Long. Foolish crowds support benign overfitting. ar Xiv preprint ar Xiv:2110.02914, 2021. Scott Shaobing Chen, David L Donoho, and Michael A Saunders. Atomic decomposition by basis pursuit. SIAM review, 43(1):129 159, 2001. Geoffrey Chinot, Matthias Löffler, and Sara van de Geer. On the robustness of minimumnorm interpolators. ar Xiv preprint ar Xiv:2012.00807, 2020. Konstantin Donhauser, Nicolo Ruggeri, Stefan Stojanovic, and Fanny Yang. Fast rates for noisy interpolation require rethinking the effects of inductive bias. ar Xiv preprint ar Xiv:2203.03597, 2022. David P Helmbold and Philip M Long. On the necessity of irrelevant variables. Journal of Machine Learning Research, 13(1):2145 2170, 2012. Peizhong Ju, Xiaojun Lin, and Jia Liu. Overfitting can be harmless for basis pursuit, but only to a degree. In Advances in Neural Information Processing Systems, volume 33, 2020. Frederic Koehler, Lijia Zhou, Danica J. Sutherland, and Nathan Srebro. Uniform convergence of interpolators: Gaussian width, norm bounds and benign overfitting. In Advances in Neural Information Processing Systems, 2021. Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, pages 1302 1338, 2000. Yue Li and Yuting Wei. Minimum ℓ1-norm interpolators: Precise asymptotics and multiple descent. ar Xiv preprint ar Xiv:2110.09502, 2021. Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian, and Anant Sahai. Harmless interpolation of noisy data in regression. IEEE Journal on Selected Areas in Information Theory, 1(1):67 83, 2020. Jeffrey Negrea, Gintare Karolina Dziugaite, and Daniel Roy. In defense of uniform convergence: Generalization via derandomization with an application to interpolating predictors. In International Conference on Machine Learning, pages 7263 7272, 2020. Ulrike Schneider and Patrick Tardivel. The geometry of uniqueness, sparsity and clustering in penalized estimation. ar Xiv preprint ar Xiv:2004.09106, 2020. James Surowiecki. The wisdom of crowds. Anchor, 2005. Alexander Tsigler and Peter L Bartlett. Benign overfitting in ridge regression. ar Xiv preprint ar Xiv:2009.14286, 2020. Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. ar Xiv preprint ar Xiv:1011.3027, 2010. Guillaume Wang, Konstantin Donhauser, and Fanny Yang. Tight bounds for minimum l1-norm interpolation of noisy data. ar Xiv preprint ar Xiv:2111.05987, 2021. Chatterji and Long Blake Woodworth, Suriya Gunasekar, Jason D Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models. In Conference on Learning Theory, pages 3635 3673, 2020. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107 115, 2021.