# optimal_regularization_can_mitigate_double_descent__542e65f6.pdf

Published as a conference paper at ICLR 2021

OPTIMAL REGULARIZATION CAN MITIGATE DOUBLE DESCENT

Preetum Nakkiran Harvard University preetum@cs.harvard.edu

Prayaag Venkat Harvard University pvenkat@g.harvard.edu

Sham Kakade Microsoft Research & University of Washington sham@cs.washington.edu

Tengyu Ma Stanford University tengyuma@stanford.edu

Recent empirical and theoretical studies have shown that many learning algorithms from linear regression to neural networks can have test performance that is non-monotonic in quantities such the sample size and model size. This striking phenomenon, often referred to as double descent , has raised questions of if we need to re-think our current understanding of generalization. In this work, we study whether the double-descent phenomenon can be avoided by using optimal regularization. Theoretically, we prove that for certain linear regression models with isotropic data distribution, optimally-tuned ℓ2 regularization achieves monotonic test performance as we grow either the sample size or the model size. We also demonstrate empirically that optimally-tuned ℓ2 regularization can mitigate double descent for more general models, including neural networks. Our results suggest that it may also be informative to study the test risk scalings of various algorithms in the context of appropriately tuned regularization.

1 INTRODUCTION

Recent works have demonstrated a ubiquitous double descent phenomenon present in a range of machine learning models, including decision trees, random features, linear regression, and deep neural networks (Opper, 1995; 2001; Advani & Saxe, 2017; Spigler et al., 2018; Belkin et al., 2018; Geiger et al., 2019b; Nakkiran et al., 2020; Belkin et al., 2019; Hastie et al., 2019; Bartlett et al., 2019; Muthukumar et al., 2019; Bibas et al., 2019; Mitra, 2019; Mei & Montanari, 2019; Liang & Rakhlin, 2018; Liang et al., 2019; Xu & Hsu, 2019; Derezi nski et al., 2019; Lampinen & Ganguli, 2018; Deng et al., 2019; Nakkiran, 2019). The phenomenon is that models exhibit a peak of high test risk when they are just barely able to ﬁt the train set, that is, to interpolate. For example, as we increase the size of models, test risk ﬁrst decreases, then increases to a peak around when effective model size is close to the training data size, and then decreases again in the overparameterized regime. Also surprising is that Nakkiran et al. (2020) observe a double descent as we increase sample size, i.e. for a ﬁxed model, training the model with more data can hurt test performance.

These striking observations highlight a potential gap in our understanding of generalization and an opportunity for improved methods. Ideally, we seek to use learning algorithms which robustly improve performance as the data or model size grow and do not exhibit such unexpected nonmonotonic behaviors. In other words, we aim to improve the test performance in situations which would otherwise exhibit high test risk due to double descent. Here, a natural strategy would be to use a regularizer and tune its strength on a validation set. This motivates the central question of this work:

When does optimally tuned regularization mitigate or remove the double-descent phenomenon?

Another motivation is the fact that double descent is largely observed for unregularized or underregularized models in practice. As an example, Figure 1 shows a simple linear ridge regression

Published as a conference paper at ICLR 2021

Figure 1: Test Risk vs. Num. Samples for Isotropic Ridge Regression in d = 500 dimensions. Unregularized regression is non-monotonic in samples, but optimallyregularized regression (λ = λopt) is monotonic. In this setting, the optimal regularizer λopt does not depend on number of samples n (Lemma 2), but this is not always true see Figure 2.

setting in which the unregularized estimator exhibits double descent, but an optimally-tuned regularizer has monotonic test performance.

Our Contributions: We study this question from both a theoretical and empirical perspective. Theoretically, we start with the setting of high-dimensional linear regression. Linear regression is a sensible starting point to study these questions, since it already exhibits many of the qualitative features of double descent in more complex models (e.g. Belkin et al. (2019); Hastie et al. (2019) and further related works in Section 1.1). Our work shows that optimally-tuned ridge regression can achieve both sample-wise monotonicity and model-size-wise monotonicity under certain assumptions. Concretely, we show

1. Sample-wise monotonicity: In the setting of well-speciﬁed linear regression with isotropic features/covariates (Figure 1), we prove that optimally-tuned ridge regression yields monotonic test performance with increasing samples. That is, more data never hurts for optimally-tuned ridge regression. (See Theorem 1).

2. Model-wise monotonicity: We consider a setting where the input/covariate lives in a highdimensional ambient space with isotropic covariance. Given a ﬁxed model size d (which might be much smaller than ambient dimension), we consider the family of models which ﬁrst project the input to a random d-dimensional subspace, and then compute a linear function in this projected feature space. (This is nearly identical to models of double-descent considered in Hastie et al. (2019, Section 5.1)). We prove that in this setting, as we grow the model-size, optimally-tuned ridge regression over the projected features has monotone test performance. That is, with optimal regularization, bigger models are always better or the same. (See Theorem 3).

3. Monotonicity in the real-world: We also demonstrate several richer empirical settings where optimal ℓ2 regularization induces monotonicity, including random feature classiﬁers and convolutional neural networks. This suggests that the mitigating effect of optimal regularization may hold more generally in broad machine learning contexts. (See Section 5).

A few remarks are in order:

Problem-speciﬁc vs Minimax and Bayesian. It is worth noting that our results hold for all linear ground-truths, rather than holding for only the worst-case ground-truth or a random ground-truth. Indeed, the minimax optimal estimator or the Bayes optimal estimator are both trivially sample-wise and model-wise monotonic with respect to the minimax risk or the Bayes risk. However, they do not guarantee monotonicity of the risk itself for a given ﬁxed problem. In particular, there exist minimax optimal estimators which are not sample-monotonic in the sense we desire.

Universal vs Asymptotic. We also remark that our analysis is not only non-asymptotic but also works for all possible input dimensions, model sizes, and sample sizes. To our knowledge, the results herein are the ﬁrst non-asymptotic sample-wise and model-wise monotonicity results for linear regression. (See discussion of related works Hastie et al. (2019); Mei & Montanari (2019) for related results in the asymptotic setting). Our work reveals aspects of the problem that were not

Published as a conference paper at ICLR 2021

present in prior asymptotic works. For example, we empirically show that optimal regularization can eliminate even triple descent in ridge regression (Figure 2). Moreover, we show that for non Gaussian covariates, optimally-tuned ridge regression is not always sample-monotonic: we give a counterexample in Section 4.

Towards a more general characterization. Our theoretical results crucially rely on the covariance of the data being isotropic. A natural next question is if and when the same results can hold more generally. A full answer to this question is beyond the scope of this paper, though we give the following results:

1. Optimally-tuned ridge regression is not always sample-monotonic: we show a counterexample for a certain non-Gaussian data distribution and heteroscedastic noise. We are not aware of prior work pointing out this fact. (See Section 4 for the counterexample and intuitions.) 2. For non-isotropic Gaussian covariates, we can achieve sample-wise monotonicity with a regularizer that depends on the population covariance matrix of data. This suggests unlabeled data might also help mitigate double descent in some settings, because the population covariance can be estimated from unlabeled data. (See Appendix B). 3. For non-isotropic Gaussian covariates, we conjecture that optimally-tuned ridge regression is sample-monotonic even with a standard ℓ2 regularizer (as in Figure 2). We derive a sufﬁcient condition for this conjecture. Due to that current random matrix theory may be insufﬁcient to verify this conjecture, we verify it numerically on a wide variety of cases. (See Appendix B for details).

The last two results above highlight the importance of the form of the regularizer, which leads to the open question: How do we design good regularizers which mitigate or remove double descent? We hope that our results can motivate future work on mitigating the double descent phenomenon, and allow us to train high performance models which do not exhibit nonmonotonic behaviors.

1.1 RELATED WORKS

The study of nonmonotonicity in learning algorithms existed prior to double descent and has a long history going back to (at least) Trunk (1979) and Le Cun et al. (1991); Le Cun et al. (1991), where the former was largely empirical observations and the latter studied the sample non-nonmonotonicity of unregularized linear regression in terms of the eigenspectrum of the covariance matrix; the difference to our works is that we study this in the context of optimal regularization. In fact, Duin (1995; 2000); Opper (2001); Loog & Duin (2012). Loog et al. (2019) introduces the same notion of risk monotonicity which we consider, and studies several examples of monotonic and non-monotonic procedures.

Double descent of test risk as a function of model size was considered recently in more generality by Belkin et al. (2018). Similar behavior was observed empirically in earlier work in somewhat more restricted settings Trunk (1979); Opper (1995; 2001); Skurichina & Duin (2002); Le Cun et al. (1991); Le Cun et al. (1991) and more recently in Advani & Saxe (2017); Geiger et al. (2019a); Spigler et al. (2018); Neal et al. (2018). Recently Nakkiran et al. (2020) demonstrated a generalized double descent phenomenon on modern deep networks, and highlighted sample non-monotonicity as an aspect of double descent.

A recent stream of theoretical works consider model-wise double descent in simpliﬁed settings often via linear models for regression or classiﬁcation. This also connects to works on highdimentional regression in the statistics literature. A partial list of works in these areas include Belkin et al. (2019); Hastie et al. (2019); Bartlett et al. (2019); Muthukumar et al. (2019); Bibas et al. (2019); Mitra (2019); Mei & Montanari (2019); Liang & Rakhlin (2018); Liang et al. (2019); Xu & Hsu (2019); Derezi nski et al. (2019); Lampinen & Ganguli (2018); Deng et al. (2019); Nakkiran (2019); Mahdaviyeh & Naulet (2019); Dobriban et al. (2018); Dobriban & Sheng (2019); Kobak et al. (2018). Of these, most closely related to our work are Hastie et al. (2019); Dobriban et al. (2018); Mei & Montanari (2019). Speciﬁcally, Hastie et al. (2019) considers the risk of unregularized and regularized linear regression in an asymptotic regime, where dimension d and number of samples n scale to inﬁnity together, at a constant ratio d/n. In contrast, we show non-asymptotic results, and are able to consider increasing the number of samples for a ﬁxed model, without scaling

Published as a conference paper at ICLR 2021

both together. Mei & Montanari (2019) derive similar results for unregularized and regularized random features, also in an asymptotic limit. The non-asymptotic versions of the settings considered in Hastie et al. (2019) are almost identical to ours for example, our projection model in Section 3 is nearly identical to the model in Hastie et al. (2019, Section 5.1). Finally, subsequent to our work, d Ascoli et al. (2020) identiﬁed triple descent in an asymptotic setting.

2 SAMPLE MONOTONICITY IN RIDGE RIDGRESSION

In this section, we prove that optimally-regularized ridge regression has test risk that is monotonic in samples, for isotropic gaussian covariates and linear response. This conﬁrms the behavior empirically observed in Figure 1. We also show that this monotonicity is not fragile , and using larger than larger regularization is still sample-monotonic (consistent with Figure 1).

Formally, we consider the following linear regression problem in d dimensions. The input/covariate x Rd is generated from N(0, Id), and the output/response is generated by y = x, β + ε with ε N(0, σ2) for some unknown parameter β Rd. We denote the joint distribution of (x, y) by D. We are given n training examples {(xi, yi)}n i=1 i.i.d sampled from D. We aim to learn a linear model fβ(x) = x, β with small population risk R(β) := E(x,y) D[( x, β y)2]. For simplicity, let X Rn d be the data matrix that contains x i s as rows and let y Rn be column vector that contains the responses yi s as entries. For any estimator ˆβn(X, y) as a function of n samples, deﬁne the expected risk of the estimator as:

R(ˆβn) := E X,y Dn[R(ˆβn(X, y))] (1)

We consider the regularized least-squares estimator, also known as the ridge regression estimator. For a given λ > 0, deﬁne

ˆβn,λ := argmin β ||Xβ y||2 2 + λ||β||2 2 = (XT X + λId) 1XT y (2)

Here Id denotes the d dimensional identity matrix. Let λopt n be the optimal ridge parameter (that achieves the minimum expected risk) given n samples: λopt n := argminλ:λ 0 R(ˆβn,λ)). Let ˆβopt n be the estimator that corresponds to the λopt n . That is, ˆβopt n := argminβ ||Xβ y||2 2 + λopt n ||β||2 2. Our main theorem in this section shows that the expected risk of ˆβopt n monotonically decreases as n increases.

Theorem 1. In the setting above, the expected test risk of optimally-regularized well-speciﬁed isotropic linear regression is monotonic in samples. That is, for all β Rd and all d N, n N, σ > 0,

R(ˆβopt n+1) R(ˆβopt n )

The above theorem shows a strong form of monotonicity, since it holds for every ﬁxed groundtruth β , and does not require averaging over any prior on ground-truths. Moreover, it holds nonasymptotically, for every ﬁxed n, d N. Obtaining such non-asymptotic results is nontrivial, since we cannot rely on concentration properties of the involved random variables.

In particular, evaluating R(ˆβopt n ) as a function of the problem parameters (n, σ, β , and d) is technically challenging. In fact, we suspect that a simple closed form expression does not exist. The key idea towards proving the theorem is to derive a partial evaluation the following lemmas shows that we can write R(ˆβopt n ) in the form of E[g(γ, σ, n, d, β )] where γ Rd contains the singular values of X. We will then couple the randomness of data matrices obtained by adding a single sample, and use singular value interlacing to compare their singular values.

Lemma 1. In the setting of Theorem 1, let γ = (γ1, . . . , γd) be the singular values of the data matrix X Rn d. (If n < d, we pad the γi = 0 for i > n.) Let Γn be the distribution of γ. Then, the expected test risk is

R(ˆβn,λ) = E (γ1,...γd) Γn

||β ||2 2λ2/d + σ2γ2 i (γ2 i + λ)2

Published as a conference paper at ICLR 2021

From Lemma 1, the below lemma follows directly by taking derivatives to ﬁnd the optimal λ.

Lemma 2. In the setting of Theorem 1, the optimal ridge parameter is constant for all n: λopt n = dσ2

||β ||2 2 . Moreover, the optimal expected test risk can be written as

R(ˆβopt n ) = E (γ1,...γd) Γn

γ2 i + dσ2/||β ||2 2

Proofs of Lemma 1 and 2 are deferred to the Appendix, Section A.1. Now we are ready to prove Theorem 1.

Proof of Theorem 1. Let e X R(n+1) d and X Rn d be any two matrices which differ by only the last row of e X. By the Cauchy interlacing theorem Theorem 4.3.4 of Horn et al. (1990) (c.f.,Lemma 3.4 of Marcus et al. (2014)), the singular values of X and e X are interlaced: i : γi 1(X) γi( e X) γi(X) where γi( ) is the i-th singular value.

If we couple e X and X, it will induce a coupling Π between the distributions Γn+1 and Γn, of the singular values of the data matrix for n + 1 and n samples. This coupling satisﬁes that eγi γi with probability 1 for ({eγi}, {γi}) Π. Now, expand the test risk using Lemma 2, and observe that each term in the sum of Equation (4) below is monotone decreasing with γi. Thus:

R(ˆβopt n ) = E (γ1,...γd) Γn

γ2 i + dσ2/||β ||2 2

E (eγ1,...eγd) Γn+1

eγ2 i + dσ2/||β ||2 2

= R(ˆβopt n+1) (6)

By similar techniques, we can also prove that overregularization that is, using ridge parameters λ larger than the optimal value is still monotonic. This proves the behavior empirically observed in Figure 1. Theorem 2. In the same setting as Theorem 1, over-regularized regression is also monotonic in samples. That is, for all d N, n N, σ > 0, β Rd, the following holds

λ λ : R(ˆβn+1,λ) R(ˆβn,λ)

where λ = dσ2

||β ||2 2 .

Proof. In Section A.1.

3 MODEL-WISE MONOTONICITY IN RIDGE REGRESSION

In this section, we show that for a certain family of linear models, optimal regularization prevents model-wise double descent. That is, for a ﬁxed number of samples, larger models are not worse than smaller models.

We consider the following learning problem. Informally, covariates live in a p-dimensional ambient space, and we consider models which ﬁrst linearly project down to a random d-dimensional subspace, then perform ridge regression in that subspace for some d p. Formally, the covariate x Rp is generated from N(0, Ip), and the response is generated by y = x, θ + ε with ε N(0, σ2) and for some unknown parameter θ Rp. Next, n examples {(xi, yi)}n i=1 are sampled i.i.d from this distribution. For a given model size d p, we ﬁrst sample a random orthonormal matrix P Rd p which speciﬁes our model. We then consider models which operate on ( exi, yi) Rd R, where exi = Pxi. We denote the joint distribution of (ex, y) by D. Here, we emphasize that p is some large ambient dimension and d p is the size of the model we learn.

Published as a conference paper at ICLR 2021

For a ﬁxed P, we want to learn a linear model f ˆβ( x) = x, ˆβ for estimating y, with small mean

squared error on distribution: RP (ˆβ) := E(ex,y) D[( x, ˆβ y)2]. For n samples (xi, yi), let X Rn p be the data matrix, e X = XP T Rn d be the projected data matrix and y Rn be the responses. For any estimator ˆβ( e X, y) as a function of the observed samples, deﬁne the expected risk of the estimator as:

R(ˆβ) := E P E e X, y Dn[RP (ˆβ( X, y)] (7)

We consider the regularized least-squares estimator. For a given λ > 0, deﬁne

ˆβd,λ := argmin β || e Xβ y||2 2 + λ||β||2 2 = ( e XT e X + λId) 1 e XT y (8)

Let λopt d be the optimal ridge parameter (that achieves the minimum expected risk) for a model of size d, with n samples: λopt d := argminλ 0 R(ˆβd,λ)). Let ˆβopt d be the estimator that corresponds to the λopt d , that is ˆβopt d := argminβ || e Xβ y||2 2 + λopt d ||β||2 2. Now, our main theorem in this setting shows that with optimal ℓ2 regularization, test performance is monotonic in model size. Theorem 3. In the setting above, the expected test risk of the optimally-regularized model is monotonic in the model size d. That is, for all p N, θ Rp, d p, n N, σ > 0, we have

R(ˆβopt d+1) R(ˆβopt d )

The proof of Theorem 3 is in Appendix A.2, and follows closely the proof of Theorem 1.

4 COUNTEREXAMPLES TO MONOTONICITY

In this section, we show that optimally-regularized ridge regression is not always monotonic in samples. We give a numeric counterexample in d = 2 dimensions, with non-gaussian covariates and heteroscedastic noise. This does not contradict our main theorem in Section 2, since this distribution is not jointly Gaussian with isotropic marginals.

Counterexample. Here we give an example of a distribution (x, y) for which the expected error of optimally-regularized ridge regression with n = 2 samples is worse than with n = 1 samples. This counterexample is most intuitive to understand when the ridge parameter λ is allowed to depend on the speciﬁc sample instance (X, y) as well as n1. We sketch the intuition for this below. Consider the following distribution on (x, y) in d = 2 dimensions. This distribution has one clean coordinate and one noisy coordinate. The distribution is: (x, y) = ( e1, 1) with probability 1/2, and (x, y) = ( e2, 10) w.p. 1/2. Where 10 is uniformly random independent noise. This distribution is wellspeciﬁed in that the optimal predictor is linear in x: E[y|x] = β , x for β = [1, 0]. However, the noise is heteroscedastic.

For n = 1 samples, the estimator can decide whether to use small λ or large λ depending on if the sampled coordinate is the clean or noisy one. Speciﬁcally, for the sample (x, y): If x = e1, then the optimal ridge parameter is λ = 0. If x = e2, then the optimal parameter is λ = .

For n = 2 samples, with probability 1/2 the two samples will hit both coordinates. In this case, the estimator must chose a single value of λ uniformly for both coordinates. This yields to a suboptimal tradeoff, since the noisy coordinate demands large regularization, but this hurts estimation on the clean coordinate.

It turns out that a slight modiﬁcation to the above also serves as a counterexample to monotonicity when the regularization parameter λ is chosen only depending on n (and not on the instance X, y). The distribution is: (x, y) = ( e1, 1) w.p. 0.98 and (x, y) = ( e2, 20) w.p. 0.02. This distribution has the following property. Theorem 4. There exists a distribution D over (x, y) for x R2, y R with the following properties. Let ˆβopt n be the optimally-regularized ridge regression solution for n samples (X, y) from D. Then:

1Recall, our model of optimal ridge regularization from Section 2 only allows λ to depend on n (not on X, y).

Published as a conference paper at ICLR 2021

1. D is well-speciﬁed in that ED[y|x] is a linear function of x,

2. The expected test risk increases as a function of n, between n = 1 and n = 2. Speciﬁcally

R(ˆβopt n=1) < R(ˆβopt n=2)

Proof. For n = 1 samples, it can be conﬁrmed analytically that the expected risk R(ˆβopt n=1) < 8.157. This is achieved with λ = 400/2401 0.166597. For n = 2 samples, it can be conﬁrmed numerically (via Mathematica) that the expected risk R(ˆβopt n=2) > 8.179. This is achieved with λ = 0.642525.

5 EXPERIMENTS

We now experimentally demonstrate that optimal ℓ2 regularization can mitigate double descent, in more general settings than Theorems 1 and 3.

5.1 SAMPLE MONOTONICITY

Here we show various settings where optimal ℓ2 regularization empirically induces samplemonotonic performance.

Nonisotropic Regression. We ﬁrst consider the setting of Theorem 1, but with non-isotropic covariantes x. That is, we perform ridge regression on samples (x, y), where the covariate x Rd is generated from N(0, Σ) for Σ = Id. As before, the response is generated by y = x, β + ε with ε N(0, σ2) for some unknown parameter β Rd. We consider the same ridge regression estimator, ˆβn,λ := argminβ ||Xβ y||2 2 + λ||β||2 2.

Figure 2: Test Risk vs. Num. Samples for Non-Isotropic Ridge Regression in d = 30 dimensions. Unregularized regression is non-monotonic in samples, but optimally-regularized regression is monotonic. Note the optimal regularization λ depends on the number of samples n.

Figure 2 shows one instance of this, for a particular choice of Σ and β . The covariance Σ is diagonal, with Σi,i = 10 for i 15 and Σi,i = 1 for i > 15. That is, the covariance has one large eigenspace and one small eigenspace. The ground-truth β = 0.1 e1 + e30, which lies almost entirely within the small eigenspace of Σ. The noise parameter is σ = 0.5.

We see that unregularized regression (λ = 0) actually undergoes triple descent 2 in this setting, with the ﬁrst peak around n = 15 samples due to the 15-dimensional large eigenspace, and the second peak at n = d. In this setting, optimally-regularized ridge regression is empirically monotonic in samples (Figure 2). Unlike the isotropic setting of Section 2, the optimal ridge parameter λn is no longer a constant, but varies with number of samples n.

2See also the multiple descent behavior of kernel interpolants in Liang et al. (2020).

Published as a conference paper at ICLR 2021

Random Re LU Features. We consider random Re LU features, in the random features framework of Rahimi & Recht (2008). For a given number of features D, and number of samples n, the random feature classiﬁer is obtained by performing regularized linear regression on the embedding x := Re LU(Wx), where W RD d is a matrix with each entry sampled i.i.d N(0, 1/

d) and Re LU applies pointwise. This is equivalent to a 2-layer fully-connected neural network with a frozen (randomly-initialized) ﬁrst layer, trained with ℓ2 loss and weight decay. In Appendix A.4, we apply random features to Fashion-MNIST Xiao et al. (2017). From Appendix Figure 4a, we see that underregularized models are non-monotonic, but optimal ℓ2 regularization is monotonic in samples. Moreover, the optimal ridge parameter λ appears to be constant for all n, similar to our results from the isotropic setting in Theorem 1.

5.2 MODEL-SIZE MONOTONICITY

Here we empirically show that optimal ℓ2 regularization can mitigate model-wise double descent.

Random Re LU Features. We consider the same experimental setup as in Section 5.1, but now ﬁx the number of samples n, and vary the number of random features D. This corresponds to varying the width of the corresponding 2-layer neural network. Figure 4b in Appendix A.4 shows the test error of the random features classiﬁer, for n = 500 train samples and varying number of random features. We see that underregularized models undergo model-wise double descent, but optimal ℓ2 regularization prevents double descent.

Figure 3: Test Error vs. Model Size for 5-layer CNNs on CIFAR-100, with ℓ2 regularization (weight decay). Note that the optimal regularization λ varies with n.

Convolutional Neural Networks. We follow the experimental setup of Nakkiran et al. (2020) for modelwise double descent, and add varying amounts of ℓ2 regularization (weight decay). We chose the following setting from Nakkiran et al. (2020), because it exhibits double descent even with no added label noise. We consider the same family of 5-layer convolutional neural networks (CNNs) from Nakkiran et al. (2020), consisting of 4 convolutional layers of widths [k, 2k, 4k, 8k] for varying k N. We train and test on CIFAR100 (Krizhevsky et al., 2009), an image classiﬁcation problem with 100 classes. Inputs are normalized to [ 1, 1]d, and we use standard data-augmentation of random horizontal ﬂip and random crop with 4pixel padding. All models are trained using Stochastic Gradient Descent (SGD) on the cross-entropy loss, with step size 0.1/ p

T/512 + 1 at step T. We train for 1e6 gradient steps, and use weight decay λ for varying λ. Due to optimization instabilities for large λ, we use the model with the minimum train loss among the last 5K gradient steps. Figure 3 shows the test error of these models on CIFAR-100. Although unregularized and under-reguarized models exhibit double descent, the test error of optimally-regularized models is largely monotonic. Note that the optimal regularization λ varies with the model size no single regularization value is optimal for all models.

6 DISCUSSION AND CONCLUSION

In this work, we study the double descent phenomenon in the context of optimal regularization. We show that, while unregularized or under-regularized models often have non-monotonic behavior, appropriate regularization can eliminate this effect.

Theoretically, we prove that for certain linear regression models with isotropic covariates, optimallytuned ℓ2 regularization achieves monotonic test performance as we grow either the sample size or the model size. These are the ﬁrst non-asymptotic monotonicity results we are aware of in linear

Published as a conference paper at ICLR 2021

regression. We also demonstrate empirically that optimally-tuned ℓ2 regularization can mitigate double descent for more general models, including neural networks. We hope that our results can motivate future work on mitigating the double descent phenomenon, and allow us to train high performance models which do not exhibit unexpected nonmonotonic behaviors.

Open Questions. Our work suggests a number of natural open questions. First, it is open to prove (or disprove) that optimal ridge regression is sample-monotonic for non-isotropic Gaussian covariates. We conjecture that it is, and outline a potential route to proving this (via Conjectures 1 and 2 in the Appendix). Second, more broadly, it is open to prove sample-wise or model-wise monotonicity for more general (non-linear) models with appropriate regularizers. Finally, it is open to understand why large neural networks in practice are often sample-monotonic in realistic regimes of sample sizes, even without careful choice of regularization.

Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural networks. ar Xiv preprint ar Xiv:1710.03667, 2017.

Peter L Bartlett, Philip M Long, G abor Lugosi, and Alexander Tsigler. Benign overﬁtting in linear regression. ar Xiv preprint ar Xiv:1906.11300, 2019.

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning and the bias-variance trade-off. ar Xiv preprint ar Xiv:1812.11118, 2018.

Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features. ar Xiv preprint ar Xiv:1903.07571, 2019.

Koby Bibas, Yaniv Fogel, and Meir Feder. A new look at an old problem: A universal learning approach to linear regression. ar Xiv preprint ar Xiv:1905.04708, 2019.

St ephane d Ascoli, Levent Sagun, and Giulio Biroli. Triple descent and the two kinds of overﬁtting: Where & why do they appear? ar Xiv preprint ar Xiv:2006.03509, 2020.

Zeyu Deng, Abla Kammoun, and Christos Thrampoulidis. A model of double descent for highdimensional binary linear classiﬁcation. ar Xiv preprint ar Xiv:1911.05822, 2019.

Michał Derezi nski, Feynman Liang, and Michael W. Mahoney. Exact expressions for double descent and implicit regularization via surrogate random design, 2019.

Edgar Dobriban and Yue Sheng. Wonder: Weighted one-shot distributed ridge regression in high dimensions. ar Xiv preprint ar Xiv:1903.09321, 2019.

Edgar Dobriban, Stefan Wager, et al. High-dimensional asymptotics of prediction: Ridge regression and classiﬁcation. The Annals of Statistics, 46(1):247 279, 2018.

Robert PW Duin. Small sample size generalization. In Proceedings of the Scandinavian Conference on Image Analysis, volume 2, pp. 957 964. PROCEEDINGS PUBLISHED BY VARIOUS PUBLISHERS, 1995.

Robert PW Duin. Classiﬁers in almost empty spaces. In Proceedings 15th International Conference on Pattern Recognition. ICPR-2000, volume 2, pp. 1 7. IEEE, 2000.

Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, St ephane d Ascoli, Giulio Biroli, Cl ement Hongler, and Matthieu Wyart. Scaling description of generalization with number of parameters in deep learning. ar Xiv preprint ar Xiv:1901.01608, 2019a.

Mario Geiger, Stefano Spigler, St ephane d Ascoli, Levent Sagun, Marco Baity-Jesi, Giulio Biroli, and Matthieu Wyart. Jamming transition as a paradigm to understand the loss landscape of deep neural networks. Physical Review E, 100(1):012115, 2019b.

Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. Surprises in highdimensional ridgeless least squares interpolation, 2019.

Published as a conference paper at ICLR 2021

Roger A Horn, Roger A Horn, and Charles R Johnson. Matrix Analysis. Cambridge University Press, 1990.

Dmitry Kobak, Jonathan Lomond, and Benoit Sanchez. Optimal ridge penalty for real-world highdimensional data can be zero or negative due to the implicit ridge regularization. ar Xiv preprint ar Xiv:1805.10939, 2018.

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

Andrew K Lampinen and Surya Ganguli. An analytic theory of generalization dynamics and transfer learning in deep linear networks. ar Xiv preprint ar Xiv:1809.10374, 2018.

Yann Le Cun, Ido Kanter, and Sara A Solla. Eigenvalues of covariance matrices: Application to neural-network learning. Physical Review Letters, 66(18):2396, 1991.

Yann Le Cun, Ido Kanter, and Sara A Solla. Second order properties of error surfaces: Learning time and generalization. In Advances in neural information processing systems, pp. 918 924, 1991.

Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel ridgeless regression can generalize. ar Xiv preprint ar Xiv:1808.00387, 2018.

Tengyuan Liang, Alexander Rakhlin, and Xiyu Zhai. On the risk of minimum-norm interpolants and restricted lower isometry of kernels. ar Xiv preprint ar Xiv:1908.10292, 2019.

Tengyuan Liang, Alexander Rakhlin, and Xiyu Zhai. On the multiple descent of minimum-norm interpolants and restricted lower isometry of kernels. 2020.

Marco Loog and Robert PW Duin. The dipping phenomenon. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pp. 310 317. Springer, 2012.

Marco Loog, Tom Viering, and Alexander Mey. Minimizers of the empirical risk and risk monotonicity. In Advances in Neural Information Processing Systems, pp. 7476 7485, 2019.

Yasaman Mahdaviyeh and Zacharie Naulet. Asymptotic risk of least squares minimum norm estimator under the spike covariance model. ar Xiv preprint ar Xiv:1912.13421, 2019.

Adam W Marcus, Daniel A Spielman, and Nikhil Srivastava. Ramanujan graphs and the solution of the kadison-singer problem. ar Xiv preprint ar Xiv:1408.4421, 2014.

Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. ar Xiv preprint ar Xiv:1908.05355, 2019.

Partha P. Mitra. Understanding overﬁtting peaks in generalization error: Analytical risk curves for l2 and l1 penalized interpolation. Ar Xiv, abs/1906.03667, 2019.

Vidya Muthukumar, Kailas Vodrahalli, and Anant Sahai. Harmless interpolation of noisy data in regression. ar Xiv preprint ar Xiv:1903.09139, 2019.

Preetum Nakkiran. More data can hurt for linear regression: Sample-wise double descent. ar Xiv preprint ar Xiv:1912.07242, 2019.

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=B1g5s A4twr.

Brady Neal, Sarthak Mittal, Aristide Baratin, Vinayak Tantia, Matthew Scicluna, Simon Lacoste Julien, and Ioannis Mitliagkas. A modern take on the bias-variance tradeoff in neural networks. ar Xiv preprint ar Xiv:1810.08591, 2018.

Manfred Opper. Statistical mechanics of learning: Generalization. The Handbook of Brain Theory and Neural Networks, 922-925., 1995.

Manfred Opper. Learning to generalize. Frontiers of Life, 3(part 2), pp.763-775., 2001.

Published as a conference paper at ICLR 2021

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pp. 1177 1184, 2008.

Marina Skurichina and Robert PW Duin. Bagging, boosting and the random subspace method for linear classiﬁers. Pattern Analysis & Applications, 5(2):121 135, 2002.

Stefano Spigler, Mario Geiger, St ephane d Ascoli, Levent Sagun, Giulio Biroli, and Matthieu Wyart. A jamming transition from under-to over-parametrization affects loss landscape and generalization. ar Xiv preprint ar Xiv:1810.09665, 2018.

Gerard V Trunk. A problem of dimensionality: A simple example. IEEE Transactions on pattern analysis and machine intelligence, (3):306 307, 1979.

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017.

Ji Xu and Daniel J Hsu. On the number of variables to use in principal component regression. In Advances in Neural Information Processing Systems, pp. 5095 5104, 2019.

Published as a conference paper at ICLR 2021

In Section A.1 and A.2 we provide the proofs for sample-monotonicity and model-size monotonicity. In Section A.4 we include additional and omitted plots. In Section B we investigate whether monotonicity provably holds in more general models, and present a monotonicity conjecture for non-isotropic covariates.

A.1 SAMPLE MONOTONICITY PROOFS

First we prove Lemma 1.

Proof of Lemma 1. For isotropic x, the test risk is related to the parameter error as:

R(ˆβ) := E (x,y) D[( x, ˆβ y)2]

= E x N(0,Id),η N(0,σ2)[( x, ˆβ β + η)2]

= ||ˆβ β ||2 2 + σ2

Plugging in the form of ˆβn,λ and expanding:

R(ˆβn,λ) := E X,y Dn[R(ˆβn,λ)]

= E[||ˆβn,λ β ||2 2] + σ2

= E X,y[||(XT X + λI) 1XT y β||2 2] + σ2

= E X,η N(0,σ2In)[||(XT X + λI) 1XT (Xβ + η) β ||2 2] + σ2

= E X[||(XT X + λI) 1XT Xβ β ||2 2] + E X,η[||(XT X + λI) 1XT η||2 2] + σ2

= E X[||(XT X + λI) 1XT Xβ β ||2 2] + σ2 E X[||(XT X + λI) 1XT ||2 F ] + σ2

Now let X = UΣV T be the full singular value decomposition of X, with U Rn n, Σ Rn d, V Rd d. Let (γ1, . . . γd) denote the singular values, deﬁning γi = 0 for i > min(n, d). Then, continuing:

R(ˆβn,λ) = E V,Σ[||diag({ λ γ2 i + λ})V T β ||2 2] + σ2 E Σ[ X

γ2 i (γ2 i + λ)2 ] + σ2 (9)

= E z Unif(||β ||2Sd 1),Σ [||diag({ λ γ2 i + λ})z||2 2] + σ2 E Σ[ X

γ2 i (γ2 i + λ)2 ] + σ2 (10)

= ||β ||2 2 d E Σ[ X

(γ2 i + λ)2 ] + σ2 E Σ[ X

γ2 i (γ2 i + λ)2 ] + σ2 (11)

||β ||2 2λ2/d + σ2γ2 i (γ2 i + λ)2 ] + σ2 (12)

In Line (10) follows because by symmetry, the distribution of V is a uniformly random orthonormal matrix, and Σ is independent of V . Thus, z := V T β is distributed as a uniformly random point on the unit sphere of radius ||β ||2.

Next we prove Lemma 2.

Published as a conference paper at ICLR 2021

Proof of Lemma 2. First, we determine the optimal ridge parameter. Using Lemma 1, we have

λR(ˆβn,λ) =

λ E (γ1,...γd) Γ

||β||2 2λ2/d + σ2γ2 i (γ2 i + λ)2

= 2(||β ||2 2λ/d σ2) E (γ1,...γd) Γ

γ2 i (γ2 i + λ)3

Thus, λR(ˆβn,λ) = 0 = λ = dσ2

||β ||2 2 and we conclude that λopt n = dσ2

||β ||2 2 .

For this optimal parameter, the test risk follows from Lemma 1 as

R(ˆβopt n ) = R(ˆβn,λopt n ) (13)

= E (γ1,...γd) Γn

γ2 i + dσ2/||β ||2 2

Proof of Theorem 2. We follow a similar proof strategy as in Theorem 1: we invoke singular value interlacing (eγi γi) for the data matrix when adding a single sample. We then apply Lemma 1 to argue that the test risk varies monotonically with the singular values.

R(ˆβn,λ) = E (γ1,...γd) Γ[ X

||β ||2 2λ2/d + σ2γ2 i (γ2 i + λ)2 | {z } S(γi)

and we compute how each term in the sum varies with γi:

i S(γi) = γi S(γi)

d )2||β ||2 2λ2 + dσ2(γ2 i λ) (γ2 i + λ)3

Thus we have

2||β ||2 = γi S(γi) 0 (15)

By the coupling argument in Theorem 1, this implies that the test risk is monotonic:

R(ˆβn+1,λ) R(ˆβn,λ)

= E (eγ1,...eγd) Γn+1

E (γ1,...γd) Γn

= E ({eγi},{γi}) Π

i=1 S(eγi) S(γi)

where Π is the coupling. Line (17) follows from Equation (15), and the fact that the coupling obeys eγi γi.

Published as a conference paper at ICLR 2021

A.2 PROJECTION MODEL PROOFS

Lemma 3. For all θ Rp, d, n N, and λ > 0, let X Rn p be a matrix with i.i.d. N(0, 1) entries. Let P Rd p be a random orthonormal matrix. Deﬁne e X := XP T . Let (γ1, . . . , γm) be the singular values of the data matrix X Rn d, for m := max(n, d) (with γi = 0 for i > min(n, d)). Let Γd be the distribution of singular values (γ1, . . . , γm).

Then, the optimal ridge parameter is constant for all d: λopt d = p2eσ2

d||θ||2 2 . where we deﬁne eσ2 :=

p ||θ||2 2. Moreover, the optimal expected test risk can be written as

R(ˆβopt d ) = eσ2 + E (γ1,...,γm) Γd

γ2 i + eσ2p2 d||θ||2 2

Proof. This proof follows exactly analogously as the proof of Lemma 2 from Lemma 1, in Section A.1.

Lemma 4. For all θ Rp, d, n N, and λ > 0, let X Rn p be a matrix with i.i.d. N(0, 1) entries. Let P Rd p be a random orthonormal matrix. Deﬁne e X := XP T and β := Pθ.

Let (γ1, . . . , γm) be the singular values of the data matrix X Rn d, for m := max(n, d) (with γi = 0 for i > min(n, d)). Let Γd be the distribution of singular values (γ1, . . . , γm).

Then, the expected test risk is

R(ˆβd,λ) := E P E e X, y Dn[RP (ˆβd,λ( X, y)]

= σ2 + (1 d

+ E (γ1,...,γm) Γd

p ||θ||2 2)γ2 i + d

p2 ||θ||2 2λ2

(γ2 i + λ)2

Proof of Lemma 4. We ﬁrst deﬁne the parameter that minimizes the population risk. It follows directly that: β P := argmin β Rd RP (β) = Pθ

First, we can expand the risk as

R(ˆβ) = E (ex,y) D[( Px, ˆβ y)2] (18)

= E (ex,y) D,η N(0,σ2)[( x, P T ˆβ θ + η)2] (19)

= σ2 + ||θ P T ˆβ||2 2 (20)

= σ2 + ||θ P T β ||2 2 + ||P T β P T ˆβ||2 2 (21)

+ 2 (θ P T β ), P T β P T ˆβ (22)

= σ2 + ||θ P T β ||2 2 + ||P T β P T ˆβ||2 2 (23)

= σ2 + ||θ P T Pθ||2 2 + ||β ˆβ||2 2 (24)

The cross terms in Line (22) vanish because the ﬁrst-order optimality condition for β implies that β satisﬁes P(θ P T β ) = 0. We now simplify each of the two remaining terms.

First, we have that:

E P ||θ P T Pθ||2 2 = (1 d

p)||θ||2 2 (25)

Published as a conference paper at ICLR 2021

since P T P is an orthogonal projection onto a random d-dimensional subspace.

Now, recall we have y = Xθ + η where η N(0, σ2In). Expand this as:

y = Xθ + η (26)

= XP T Pθ + X(1 P T P)θ + η (27)

= e Xβ + ε + η (28)

where ε := X(1 P T P)θ. Note that conditioned on P, the three terms e X, ε and η are conditionally independent, since P T P and (I P T P) project X onto orthogonal subspaces. And further, ε N(0, ||(1 P T P)θ||2In).

E P E e X,y ||ˆβ β ||2 2 (29)

= E P E e X,y ||( e XT e X + λI) 1 e XT y β ||2 2 (30)

= E P E e X,y,ε,η ||( e XT e X + λI) 1 e XT ( e Xβ + ε + η) β ||2 2 (31)

= E P E e X,y,ε,η [||( e XT e X + λI) 1 e XT e Xβ β ||2 2 (32)

+ ||( e XT e X + λI) 1 e XT ε||2 2 + ||( e XT e X + λI) 1 e XT η||2 2] (33) (34)

Now, since e X is conditionally independent of ε conditioned on P,

E P E e X,y,ε|P ||( e XT e X + λI) 1 e XT ε||2 2 (35)

= E P E e X|P [||( e XT e X + λI) 1 e XT ||2 F ] E ε|P[||ε||2 2] (36)

= E e X [||( e XT e X + λI) 1 e XT ||2 F ] E P,ε[||ε||2 2] (37)

= E e X [||( e XT e X + λI) 1 e XT ||2 F ] E P,X[||X(1 P T P)θ||2 2] (by deﬁnition of ε)

= E e X [||( e XT e X + λI) 1 e XT ||2 F ](p d

p ||θ||2 2) (38)

where Line (37) holds because the marginal distribution of e X does not depend on P.

E P E e X,y,η|P ||( e XT e X + λI) 1 e XT η||2 2 (39)

= σ2 E e X [||( e XT e X + λI) 1 e XT ||2 F ] (40)

Now let e X = UΣV T be the full singular value decomposition of e X, with U Rn n, Σ Rn d, V Rd d. Let (γ1, . . . γm) denote the singular values, where m = max(n, d) and deﬁning γi = 0 for i > min(n, d).

Observe that by symmetry, e X = XP T and P are independent, because the joint distribution ( e X, P) is equivalent to the distribution ( e XQ, PQ) for a random orthonormal Q Rp p. Thus e X and

Published as a conference paper at ICLR 2021

β = Pθ are also independent, and we have:

E P E e X ||( e XT e X + λI) 1 e XT e Xβ β ||2 2 (41)

= E β E V,Σ[||diag({ λ γ2 i + λ})V T β ||2 2] (42)

= E β E V,Σ[||diag({ λ γ2 i + λ})V T β ||2 2] (43)

= E β E z Unif(||β ||2Sd 1),Σ [||diag({ λ γ2 i + λ})z||2 2] (44)

= E β [||β ||2 2 p E Σ[ X

(γ2 i + λ)2 ]] (45)

p E P[||Pθ||2 2] E Σ[ X

(γ2 i + λ)2 ] (46)

p2 ||θ||2 2 E Σ[ X

(γ2 i + λ)2 ] (47)

Finally, continuing from Line (33), we can use Lines (38), (40), and (47) to write:

E P E e X,y ||ˆβ β ||2 2 (48)

= (σ2 + p d

p ||θ||2 2) E e X [||( e XT e X + λI) 1 e XT ||2 F ] (49)

p2 ||θ||2 2 E Σ[ X

(γ2 i + λ)2 ] (50)

= (σ2 + p d

p ||θ||2 2) E Σ[ X

γ2 i (γ2 i + λ)2 ] (51)

p2 ||θ||2 2 E Σ[ X

(γ2 i + λ)2 ] (52)

p ||θ||2 2)γ2 i + d

p2 ||θ||2 2λ2

(γ2 i + λ)2 ] (53)

Now, we can continue from Line (24), and apply lines (25), to conclude:

E[R(ˆβ)] = σ2 + E[||θ P T Pθ||2 2] + E[||β ˆβ||2 2]

= σ2 + (1 d

p ||θ||2 2)γ2 i + d

p2 ||θ||2 2λ2

(γ2 i + λ)2

Proof of Theorem 3. This follows analogously to the proof of Theorem 1, Let e Xd and e Xd+1 be the observed data matrices for d and d + 1 model size. As in Theorem 1, there exists a coupling Π between the distributions Γd and Γd+1 of the singular values of e Xd and e Xd+1 such that these singular values are interlaced.

Published as a conference paper at ICLR 2021

Thus by Lemma 3,

R(ˆβopt d ) = eσ2 + E (γ1,...,γm) Γd

γ2 i + eσ2p2 d||θ||2 2

eσ2 + E (eγ1,...,eγm) Γd+1

eγ2 i + eσ2p2 d||θ||2 2

= R(ˆβopt d+1)

A.3 NONISOTROPIC REDUCTION

Here we observe that results on isotropic regression in Section 2 also imply that ridge regression can be made sample-monotonic even for non-isotropic covariates, if an appropriate regularzier is applied. Speciﬁcally, the regularizer depends on the covariance on the inputs. This follows from a general equivalence between the non-isotropic and isotropic problems.

Lemma 5. For all n N, d N, λ R, σ R, covariance Σ Rd d, PSD matrix M Rd d, and ground-truth β Rd, the following holds.

Consider the following two problems:

1. Regularized regression on isotropic covariates, and an M-regularizer. That is, suppose n samples (x, y) are drawn with covariates x N(0, Id) and response y = β , x + N(0, σ2). Let X Rn d be the matrix of covariates, and y the vector of responses. Consider

ˆβλ := argmin β ||Xβ y||2 2 + λ||β||2 M (54)

Let R = EX,y[||ˆβ β ||2 2] + σ2 be the expected test risk of the above estimator.

2. Regularized regression with covariance Σ, and an (Σ1/2MΣ1/2)-regularizer. That is, suppose n samples (ex, y) are drawn with covariates ex N(0, Σ) and response y = z , ex + N(0, σ2), for z = Σ 1/2β

Let e X Rn d be the matrix of covariates, and y the vector of responses. Consider

ˆzλ := argmin z || e Xz y||2 2 + λ||z||2 Σ1/2MΣ1/2 (55)

Let e R = E e X,y[||ˆz z ||2 Σ] + σ2 be the expected test risk of the above estimator.

Then, the expected test risks of the above two problems are identical:

Proof of Lemma 5. The distribution of e X in the Problem 2 is equivalent to XΣ1/2, where X is as in Problem 1. Thus, the two settings are equivalent by the change-of-variable β = Σ1/2z. Speciﬁcally,

ˆzλ := argmin z || e Xz y||2 2 + λ||z||2 Σ1/2MΣ1/2 (56)

= argmin z ||XΣ1/2z y||2 2 + λz T Σ1/2MΣ1/2z (57)

= argmin z ||XΣ1/2z y||2 2 + λz T Σ1/2MΣ1/2z (58)

= Σ 1/2 argmin β=Σ1/2z ||Xβ y||2 2 + λβT Mβ (59)

Published as a conference paper at ICLR 2021

Further, the response z , ex = β, x , and the test risk transforms identically:

e R = E e X,y [||ˆz z ||2 Σ] + σ2 (60)

= E X,y[||ˆβ β ||2 2] + σ2 (61)

This implies that if the covariance Σ is known, then ridge regression with a Σ 1 regularizer is sample-monotonic. Theorem 5. For all n N, d N, σ R, covariance Σ Rd d, and ground-truths β Rd, the following holds.

Suppose n samples (x, y) are drawn with covariates x N(0, Σ) and response y = β , x + N(0, σ2). Let X Rn d be the matrix of covariates, and y the vector of responses. For λ > 0, consider the ridge regression estimator with Σ 1-regularizer: ˆβn,λ := argmin β ||Xβ y||2 2 + λ||β||2 Σ 1 (63)

Let R(ˆβn,λ) := E ˆβ ||ˆβ β ||Σ + σ2 be the expected test risk of the above estimator. Let λopt n be the optimal ridge parameter (that achieves the minimum expected risk) given n samples:

λopt n := argmin λ

R(ˆβn,λ)) (64)

And let ˆβopt n be the estimator that corresponds to the λopt n . Then, the expected test risk of optimallyregularized linear regression is monotonic in samples:

R(ˆβopt n+1) R(ˆβopt n )

Proof. This follows directly by applying the reduction in Lemma 5 for M = Id to reduce to the isotropic case, and then applying the monotonicity of isotropic regression from Theorem 1.

A.4 ADDITIONAL PLOTS

We apply random features to Fashion-MNIST Xiao et al. (2017), an image classiﬁcation problem with 10 classes. Input images x Rd are normalized and ﬂattened to [ 1, 1]d for d = 784. Class labels are encoded as one-hot vectors y { e1, . . . e10} R10.

B TOWARDS MONOTONICITY WITH GENERAL COVARIATES

Here we investigate whether monotonicity provably holds in more general models, inspired by the experimental results. As a ﬁrst step, we consider Gaussian (but not isotropic) covariances and homeostatic noise. That is, we consider ridge regression in the setting of Section 2, but with x N(0, Σ), and y x, β + N(0, σ2). In this section, we observe that ridge regression can be made sample-monotonic with a modiﬁed regularizer. We also conjecture that ridge regression is sample-monotonic without modifying the regularizer, and we outline a potential proof strategy along with numerical evidence.

B.1 ADAPTIVE REGULARIZATION

The results on isotropic regression in Section 2 imply that ridge regression can be made samplemonotonic even for non-isotropic covariates, if an appropriate regularizer is applied. Speciﬁcally, the appropriate regularizer depends on the covariance of the inputs. For x N(0, Σ), the following estimator is sample-monotonic for optimally-tuned λ: ˆβn,λ := argminβ ||Xβ y||2 2 + λ||β||2 Σ 1. This follows directly from Theorem 1 by applying a change-of-variable; full details of this equivalence are in Section A.3. Note that if the population covariance Σ is not known, it can potentially be estimated from unlabeled data.

Published as a conference paper at ICLR 2021

(a) Test Classiﬁcation Error vs. Number of Training Samples.

(b) Test Classiﬁcation Error vs. Model Size (Number of Random Features).

Figure 4: Double-descent for Random Re LU Features. Test classiﬁcation error as a function of model size and sample size for Random Re LU Features on Fashion-MNIST. Left: with D = 500 features. Right: with n = 500 samples. See Figures 7, 8 for the corresponding test Mean Squared Error. See Appendix D of Nakkiran et al. (2020) for the performance of these unregularized models plotted across Num. Samples Model Size simultaneously.

Figure 5: Train Error vs. Model Size for 5-layer CNNs on CIFAR-100, with ℓ2 regularization (weight decay).

B.2 TOWARDS PROVING MONOTONICITY

We conjecture that optimally-regularized ridge regression is sample-monotonic for non-isotropic covariates, even without modifying the regularizer (as suggested by the experiment in Figure 2). We derive a sufﬁcient condition for monotonicity, which we have numerically veriﬁed in a variety of instances. Speciﬁcally, we conjecture the following. Conjecture 1. For all d N, and all PSD covariances Σ Rd d, consider the distribution on (x, y) where x N(0, Σ), and y x, β + N(0, σ2). Then, we conjecture that the expected test risk of the ridge regression estimator: ˆβn,λ := argminβ ||Xβ y||2 2 + λ||β||2 2 for optimally-tuned λ 0, is monotone non-increasing in number of samples n. That is, for all n N,

inf λ 0 R(ˆβn+1,λ) inf λ 0 R(ˆβn,λ) (65)

where we deﬁne ˆβn,0 := limλ 0+ ˆβn,λ = X y.

Published as a conference paper at ICLR 2021

Figure 6: Train MSE vs. Num. Samples for Non-Isotropic Ridge Regression in d = 30 dimensions, in the setting of Figure 2. Plotting train MSE: 1

n||X ˆβ y||2 2.

Figure 7: Test Mean Squared Error vs. Num Train Samples for Random Re LU Features on Fashion MNIST, with D = 500 features.

In Appendix B.3 we present a technical conjecture in random matrix theory (Conjecture 2) which sufﬁces to prove Conjecture 1. Proving this Conjecture 2 presents a number of technical challenges, but we have numerically veriﬁed it in a variety of cases. It can also be shown that Conjecture 2 is true when Q = I, corresponding to isotropic covariates. We prove the reduction between Conjecture 2 and 1 in Appendix B.3.

B.3 MONOTONICITY CONJECTURE PROOFS

In order to establish Conjecture 1, it is sufﬁcient to prove the following technical conjecture.

Published as a conference paper at ICLR 2021

Figure 8: Test Mean Squared Error vs. Num Features for Random Re LU Features on Fashion MNIST, with n = 500 samples.

Conjecture 2. For all n N, d n, λ > 0, symmetric positive deﬁnite matrix Q Rd d, the following holds.

Deﬁne Gn λ := λ2 E X[(XT X + λQ) 2]

where X Rn d is sampled with each entry i.i.d. N(0, 1). Similarly, deﬁne

Hn λ := E X[||XT X + λQ) 1XT ||2 F ]

The expected test risk for n samples can be expressed as:

R(ˆβn,λ) = (β )T Gn λβ + σ2Hn λ + σ2 (66)

Then, we conjecture that the following two conditions hold.

Gn λ Gn+1 λ (67)

(Gn λ Gn+1 λ ) (Hn λ Hn+1 λ ) d Gn λ/dλ d Hn λ /dλ 0 (68)

Lemma 6. Conjecture 2 implies Conjecture 1.

Proof. By the reduction in Section A.3, showing monotonicity for non-isotropic regression with an isotropic regularizer is equivalent to showing monotonicity for isotropic regression with a nonisotropic regularizer. Thus, we consider the latter. Speciﬁcally, Conjecture 1 is equivalent to showing monotonicity for the estimator

ˆβn,λ := argmin β ||Xβ y||2 2 + λ||β||2 Σ 1 (69)

= (XT X + λΣ 1) 1XT y (70)

where x N(0, I) is isotropic, and y x, β + N(0, σ2).

Published as a conference paper at ICLR 2021

Now, letting Q := Σ 1, the expected test risk of this estimator for n samples is:

R(ˆβn,λ) = E X,y[||ˆβn,λ β ||2 2] + σ2

= E X,y[||(XT X + λQ) 1XT y β ||2 2] + σ2

= E X,η N(0,σ2In)[||(XT X + λQ) 1XT (Xβ + η) β ||2 2] + σ2

= E X[||(XT X + λQ) 1XT Xβ β )||2 2] + σ2 E X[||(XT X + λQ) 1XT ||2 F ] + σ2

= E X[||(XT X + λQ) 1(XT X + λQ λQ)β β )||2 2] + σ2 E X[||(XT X + λQ) 1XT ||2 F ] + σ2

= λ2 E X[||(XT X + λQ) 1Qβ ||2 2] + σ2 E X[||(XT X + λQ) 1XT ||2 F ] + σ2

= (β )T Gn λβ + σ2Hn λ + σ2

Consider the inﬁmum

inf λ 0 R(ˆβn,λ) (71)

We consider several cases below.

Case (1). Suppose the inﬁmum in Equation 71 is achieved in the limit λ + . In this case, monotonicity trivially holds, since

lim λ R(ˆβn,λ) = R( 0) = lim λ R(ˆβn+1,λ)

Case (2). Suppose the inﬁmum in Equation 71 is achieved by some λ = λopt n in the interior of the set (0, ).

Because R(ˆβn,λ) is continuous and differentiable in λ for all λ (0, ), we have that λopt n must satisfy the following ﬁrst-order optimality condition:

λ=λopt n = 0 (72)

= (β )T d Gn λ dλ β + σ2 d Hn λ dλ

λ=λopt n = 0 (73)

We will later use this condition to show monotonicity.

Case (3). Suppose the inﬁmum in Equation 71 is achieved at λopt n = 0. Recall, we deﬁne ˆβn,0 := limλ 0+ ˆβn,λ. This means that,

λ=0 = (β )T d Gn λ dλ β + σ2 d Hn λ dλ

Note that since d Hn λ dλ 0, both Equations (73) and (74) in Case (2) and Case (3) respectively imply that

σ2 (β )T ( d Gn λ dλ )β

λ=λopt n (75)

Now, assuming Conjecture 2, we will show that the choice of λopt n in Cases (2) and (3) has nonincreasing test risk for (n + 1) samples. That is,

R(ˆβn,λopt n ) R(ˆβn+1,λopt n )

This implies the desired monotonicity, since R(ˆβn+1,λopt n ) R(ˆβn+1,λopt n+1).

Published as a conference paper at ICLR 2021

We ﬁrst consider the case when Hn λ Hn+1 λ |λ=λopt n 0. In this case, because Gn λ Gn+1 λ 0 by assumption, we have

R(ˆβn,λopt n ) R(ˆβn+1,λopt n ) = (β )T (Gn λ Gn+1 λ )β + σ2(Hn λ Hn+1 λ ) λ=λopt n (76)

Otherwise, assume. Hn λ Hn+1 λ |λ=λopt n 0. Then we have:

R(ˆβn,λopt n ) R(ˆβn+1,λopt n ) = (β )T (Gn λ Gn+1 λ )β + σ2(Hn λ Hn+1 λ ) λ=λopt n (78)

(β )T (Gn λ Gn+1 λ )β (β )T (d Gn λ dλ )β (Hn λ Hn+1 λ ) d Hn λ /dλ

λ=λopt n (by Equation (75), and Hn λ Hn+1 λ 0)

= (β )T (Gn λ Gn+1 λ ) (Hn λ Hn+1 λ ) d Gn λ/dλ d Hn λ /dλ

λ=λopt n | {z } 0 by Conjecture 2

as desired.