# generalized_equivalences_between_subsampling_and_ridge_regularization__2ddc3e7c.pdf

Generalized equivalences between subsampling and ridge regularization

Pratik Patil

Department of Statistics

University of California Berkeley, CA 94720, USA pratikpatil@berkeley.edu

Jin-Hong Du Department of Statistics and Data Science & Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213, USA jinhongd@andrew.cmu.edu

We establish precise structural and risk equivalences between subsampling and ridge regularization for ensemble ridge estimators. Specifically, we prove that linear and quadratic functionals of subsample ridge estimators, when fitted with different ridge regularization levels λ and subsample aspect ratios ψ, are asymptotically equivalent along specific paths in the (λ, ψ)-plane (where ψ is the ratio of the feature dimension to the subsample size). Our results only require bounded moment assumptions on feature and response distributions and allow for arbitrary joint distributions. Furthermore, we provide a data-dependent method to determine the equivalent paths of (λ, ψ). An indirect implication of our equivalences is that optimally tuned ridge regression exhibits a monotonic prediction risk in the data aspect ratio. This resolves a recent open problem raised by Nakkiran et al. [1] for general data distributions under proportional asymptotics, assuming a mild regularity condition that maintains regression hardness through linearized signalto-noise ratios.

1 Introduction

Ensemble methods, such as bagging [2, 3], are powerful tools that combine weak predictors to improve predictive stability and accuracy. This paper focuses on sampling-based ensembles, which exhibit an implicit regularization effect [4 6]. Specifically, we investigate subsample ridge ensembles, where the ridge predictors are fitted on independently subsampled datasets [7 9]. Recent work has demonstrated that a full ensemble (that is fitted on all possible subsampled datasets) of ridgeless [10] predictors achieves the same squared prediction risk as a ridge predictor fitted on full data [11 13].

To be precise, let ϕ be the limiting dataset aspect ratio p/n, where p is the feature dimension, and n is the number of observations. For a given ϕ, the limiting prediction risk of subsample ridge ensembles is parameterized by (λ, ψ), where λ 0 is the ridge regularization parameter and ψ ϕ is the limiting subsample aspect ratio p/k, with k being the subsample size [11, 12]. Under isotropic features and a well-specified linear model, [11, 12] show that the squared prediction risk at (0, ψ ) (the optimal ridgeless ensemble) is the same as the risk at (λ , ϕ) (the optimal ridge), where ψ and λ are the optimal subsample aspect ratio and ridge penalty, respectively. Furthermore, this equivalence of prediction risk between subsampling and ridge regularization is extended in [13] to anisotropic linear models. As an application, [13] also demonstrates how generalized cross-validation for ridge regression can be naturally transferred to subsample ridge ensembles. In essence, these results suggest that subsampling a smaller number of observations has the same effect as adding an appropriately larger level of ridge penalty. These findings prompt two important open questions:

(a) The extent of equivalences. The previous works all focus on equivalences of the squared prediction risk when the test distribution matches the train distribution. In real-world scenarios,

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Table 1: Comparison with related work. The marker indicates a partial equivalence result connecting the optimal prediction risk of the ridge predictor and the full ridgeless ensemble predictor.

Type of equivalence results Type of data assumptions

pred. risk gen. risk estimator response feature lim. spectrum

Le Jeune et al. [11] linear isotropic Gaussian exists

Patil et al. [12] linear isotropic RMT exists

Du et al. [13] linear anisotropic RMT exists

This work arbitrary anisotropic RMT need not exist

however, there are often covariate and label shifts, making it crucial to consider prediction risks under such shifts. In addition, other risk measures, such as training error, estimation risk, coefficient errors, and more, are also of interest in various inferential tasks. A natural question is then whether similar equivalences hold for such generalized risk measures. At a basic level, the question boils down to whether any equivalences exist at the estimator level . Answering this question would establish a connection between the ensemble and the ridge estimators, facilitating the exchange of various downstream inferential statements between the two sets of estimators.

(b) The role of linear model. All previous works assume a well-specified linear model between the responses and the features, which rarely holds in practical applications. A natural question is whether the equivalences hold for arbitrary joint distributions of the response and the features or whether they are merely an artifact of the linear model. Addressing this question broadens the applicability of such equivalences beyond simplistic models to real-world scenarios, where the relationship between the response and the features is typically intricate and unknown.

We provide answers to questions raised in both directions. We demonstrate that the equivalences hold for the generalized squared risks in the full ensemble. Further, these equivalences fundamentally occur at the estimator level for any arbitrary ensemble size. Importantly, these results are not limited to linear models of the data-generating process. Below we provide a summary of our main results.

1.1 Summary of contributions

(1) Risk equivalences. We establish asymptotic equivalences of the full-ensemble ridge estimators at different ridge penalties λ and subsample ratios ψ along specific paths in the (λ, ψ)-plane for a variety of generalized risk functionals. This class of functionals includes commonly used risk measures (see Table 2), and our results hold for both independent and dependent coefficient matrices that define these risk functionals (see Theorem 1 and Proposition 2). In addition, we demonstrate that the equivalence path remains unchanged across all the functionals examined.

(2) Structural equivalences. We establish structural equivalences in the form of linear functionals of the ensemble ridge estimators, which hold for arbitrary ensemble sizes (see Theorem 3). Our proofs for both structural and risk equivalences exploit properties of certain fixed equations that arise in our analysis, enabling us to explicitly characterize paths that yield equivalent estimators in the (λ, ψ)-plane (see Equation (5)). In addition, we provide an entirely data-driven construction of this path using certain companion Stieltjes transform relations of random matrices (see Proposition 4).

(3) Equivalence implications. As an implication of our equivalences, we show that the prediction risk of an optimally-tuned ridge estimator is monotonically increasing in the data aspect ratio under mild regularity conditions (see Theorem 6). This is an indirect consequence of our general equivalence results that leverages the provable monotonicity of the subsample-optimized estimator. Under proportional asymptotics, our result settles a recent open question raised by Nakkiran et al. [1, Conjecture 1] concerning the monotonicity of optimal ridge regression under anisotropic features and general data models while maintaining a regularity condition that preserves the linearized signal-to-noise ratios across regression problems.

(4) Generality of equivalences. Our main results apply to arbitrary responses with bounded 4 + µ moments for some µ > 0, as well as features with similar bounded moments and general

covariance structures. We demonstrate the practical implications of our findings on real-world datasets (see Section 6). On the technical side, we extend the tools developed in [14] to handle the dependency between the linear and the non-linear components and obtain model-free equivalences. Furthermore, we extend our analysis to include generalized ridge regression (see Corollary 5). Through experiments, we demonstrate the universality of the equivalences for random and kernel features and conjecture the associated data-driven prediction risk equivalence paths (see Section 6).

1.2 Related work

Ensemble learning. Ensemble methods yield strong predictors by combining weak predictors [15]. The most common strategy for building ensembles is based on subsampling observations, which includes bagging [2, 3], random forests [16], neural network ensembles [17 19], among the others. The effect of sampling-based ensembles has been studied in the statistical physics literature [20 22] as well as statistical learning literature [2, 3]. Recent works have attempted to explain the success of ensemble learning by suggesting that subsampling induces a regularizing effect [4, 6]. Under proportional asymptotics, the effect of ensemble random feature models has been investigated in [23 25]. There has been growing interest in connecting the effect of ensemble learning to explicit regularization: see [26] for related experimental evidence under the umbrella of mini-patch learning. Towards making these connections precise, some work has been done in the context of the ridge and ridgeless ensembles: [11 13] investigate the relationship between subsampling and regularization for ridge ensembles in the overparameterized regime. We will delve into these works in detail next.

Ensembles and ridge equivalences. In the study of ridge ensembles, Le Jeune et al. [11] show that the optimal ridge predictor has the same prediction risk as the ridgeless ensemble under the Gaussian isotropic design, which has been extended in [12] for RMT features (see Assumption 2 for definition). Specifically, these works establish the prediction risk equivalences for a specific pair of (λ , ϕ) and (0, ψ ). The results are then extended to the entire range of (λ, ψ) using deterministic asymptotic risk formulas [13], assuming a well-specified linear model, RMT features, and the existence of a fixed limiting spectral distribution of the covariance matrix. Our work significantly broadens the scope of results connecting subsampling and ridge regularization. In particular, we allow for arbitrary joint distributions of the data, anisotropic features with only bounded moments, and do not assume the convergence of the spectrum of the population covariance matrix. Furthermore, we expand the applicability of equivalences to encompass both the estimators and various generalized squared risk functionals. See Table 1 for an explicit comparison and Section 3 for additional related comparisons.

Other ridge connections. Beyond the connection connections between implicit regularization induced by subsampling and explicit ridge regularization examined in this paper, ridge regression shares ties with various other forms of regularization. For example, ridge regression has been linked to dropout regularization [27, 28], variable splitting in random forests [2], noisy training [29, 30], random sketched regression [31 33], various forms of data augmentation [34], feature augmentation [9, 35], early stopping in gradient descent and its variants [36 39], among others. These connections highlight the pervasiveness and the significance of understanding various facets of ridge regularization. In that direction, our work contributes to expand the equivalences between subsampling and ridge regularization.

2 Notation and preliminaries

Ensemble estimators. Let Dn = {(xi, yi) : i [n]} be a dataset containing i.i.d. samples in Rp R. Let X Rn p and y Rn be the feature matrix and response vector with x i and yi in i-th rows. For an index set I [n] of size k, let DI = {(xi, yi) : i I} be the associated subsampled dataset. Let LI Rn n be a diagonal matrix such that its i-th diagonal entry is 1 if i I and 0 otherwise. Note that the feature matrix and response vector associated with DI respectively are LIX and LIy. Given a ridge penalty λ > 0, the ridge estimator fitted on DI consisting of k samples is given by:

bβλ k(DI) = argmin β Rp 1 k

i I (yi x i β)2 + λ β 2 2 = 1

k X LIX + λIp

The ridgeless estimator bβ0 k(DI) = (X LIX/k)+X LIy/k is obtained by sending λ 0+, where M + denotes the Moore-Penrose inverse of matrix M. To define the ensemble estimator, denote

Table 2: Summary of various generalized risks and their corresponding statistical learning problems. The asymptotic equivalences between subsampling and ridge regularization are analyzed in Theorem 1 and Proposition 2 for generalized risks (3) defined through functionals LA,b for various A and b.

Statistical learning problem LA,b( bβ β0) A b nrow(A)

vector coefficient estimation bβ β0 Ip 0 p projected coefficient estimation a ( bβ β0) a 0 1 training error estimation X bβ y X f NL n in-sample prediction X( bβ β0) X 0 n out-of-sample prediction x 0 bβ y0 x 0 ε0 1

the set of all k distinct elements from [n] by Ik = {{i1, i2, . . . , ik} : 1 i1 < i2 < . . . < ik n}. For λ 0, the M-ensemble and the full-ensemble estimators are respectively defined as follows:

bβλ k,M(Dn; {Iℓ}M ℓ=1) = 1

bβλ k(DIℓ), and bβλ k, (Dn) = E[ bβλ k(DI) | Dn]. (2)

Here {Iℓ}M ℓ=1 and I are simple random samples from Ik. In (2), the full-ensemble ridge estimator bβλ k, (Dn) is the average of predictors fitted on all possible subsampled datasets. It is shown in Lemma A.1 of [13] that bβλ k, (Dn) is also asymptotically equivalent to the limit of bβλ k,M(Dn; {Iℓ}M ℓ=1) as the ensemble size M (conditioning on the full dataset Dn). This also justifies using in the subscript to denote the full-ensemble estimator. For brevity, we simply write the estimators as bβλ k,M, bβλ k, , and drop the dependency on Dn, {Iℓ}M ℓ=1. We will show certain structural equivalences in terms of linear projections in the family of estimators bβλ k,M along certain paths in the (λ, p/k)-plane for arbitrary ensemble size M N { } (in Theorem 3). Apart from connecting the estimators, we will also show equivalences of various notions of risks (in Theorem 1), which we describe next.

Generalized risks. Since the ensemble ridge estimators are linear estimators, we evaluate their performance relative to the oracle parameter: β0 = E[xx ] 1E[xy], which is the best (population) linear projection of y onto x and minimizes the linear regression error (see, e.g., [40 42]). Note that we can decompose any response y into: y = f LI(x) + f NL(x), where f LI(x) = β 0 x is the oracle linear predictor, and f NL(x) = y f LI(x) is the nonlinear component that is not explained by f LI(x). The best linear projection has the useful property that f LI(x) is (linearly) uncorrelated with f NL(x), although they are generally dependent. It is worth mentioning that this does not imply that y and x follow a linear regression model. Indeed, our framework allows any nonlinear dependence structure between them and is model-free for the joint distribution of (x, y). Relative to β0, we measure the performance of an estimator bβ via the generalized mean squared risk defined compactly as follows:

R( bβ; A, b, β0) = 1 nrow(A) LA,b( bβ β0) 2 2, (3)

where LA,b(β) = Aβ + b is a linear functional. Note that A and b can potentially depend on the data, and their dimensions can vary depending on the statistical learning problem at hand. This framework includes various important statistical learning problems, as summarized in Table 2. Observe that the framework also includes various notions of prediction risks. One can use the test error formulation in Table 2 to obtain the prediction risk at any test point (x0, y0), where y0 = x 0 β0 + ε0, and ε0 may depend on x0 and have non-zero mean. This also permits the test point to be drawn from a distribution that differs from the training distribution. Specifically, when ε0 = f NL(x0) but x0 has a distribution different from x, we obtain the prediction error under covariate shift. Similarly, when ε0 = f NL(x0) but x0 and x have the same distribution, we get the prediction error under label shift.

Asymptotic equivalence. For our theoretical analysis, we consider the proportional asymptotics regime, where the ratio of the feature size p to the sample size n tends to a fixed limiting data aspect ratio ϕ (0, ). To concisely present our results, we will use the elegant framework of asymptotic equivalence [12, 43, 44]. Let Ap and Bp be sequences of (additively) conformable matrices of arbitrary dimensions (including vectors and scalars). We say that Ap and Bp are asymptotically equivalent, denoted as Ap Bp, if limp | tr[Cp(Ap Bp)]| = 0 almost surely for any sequence of random matrices Cp with bounded trace norm that are (multiplicatively) conformable and independent of Ap and Bp. Note that for sequences of scalar random variables, the definition simply reduces to the typical almost sure convergence of sequences of random variables involved.

Ridge penalty

Estimation risk

0.1 1 10 Subsample aspect ratio

Training error

Prediction risk

Prediction risk (OOD)

Figure 1: Heat map of the various generalized risks (estimation risk, training error, prediction risk, out-of-distribution (OOD) prediction risk) of full-ensemble ridge estimators (approximated with M = 100), for varying ridge penalties λ and subsample aspect ratios ψ = p/k on the log-log scale. The data model is described in Appendix F.2 with p = 500, n = 5000, and ϕ = p/n = 0.1. Observe that along the same path the values match well for all risks, in line with Theorem 1 and Proposition 2.

3 Generalized risk equivalences of ensemble estimators

We begin by examining the risk equivalences among different ensemble ridge estimators for various generalized risks defined as in (3). To prepare for our upcoming results, we impose two structural and moment assumptions on the distributions of the response variable and the feature vector. Assumption 1 (Response variable distribution). Each response variable yi for i [n] has mean 0 and satisfies E[|yi|4+µ] Mµ < for some µ > 0 and a constant Mµ. Assumption 2 (Feature vector distribution). Each feature vector xi for i [n] can be decomposed as xi = Σ1/2zi, where zi Rp contains i.i.d. entries zij for j [p] with mean 0, variance 1, and satisfy E[|zij|4+ν] Mν < for some ν > 0 and a constant Mν, and Σ Rp p is a deterministic and symmetric matrix with eigenvalues uniformly bounded between constants rmin > 0 and rmax < .

Note that we do not impose any specific model assumptions on the response variable y in relation to the feature vector x. We only assume bounded moments as stated in Assumption 1, making all of our subsequent results model-free. The zero-mean assumption for y is for simplicity since, in practice, centering can always be done by subtracting the sample mean. The bounded moment condition can also be satisfied if one imposes a stronger distributional assumption (e.g., sub-Gaussianity). Assumption 2 on the feature vector is common in the study of random matrix theory [45, 46] and the analysis of ridge and ridgeless regression [10, 47 49], which we refer to as RMT features for brevity.

Given a limiting data aspect ratio ϕ (0, ) and a limiting subsample aspect ratio ψ [ϕ, ], our statement of equivalences between different ensemble estimators is defined through certain paths characterized by two endpoints (0, ψ) and (λ, ϕ). These endpoints correspond to the subsample ridgeless ensemble and the (non-ensemble) ridge predictor, respectively, with λ being the ridge penalty to be defined next. For that, let Hp be the empirical spectral distribution of Σ: Hp(r) = p 1 Pp i=1 1{ri r}, where ri s are the eigenvalues of Σ. Consider the following system of equations in λ and v: 1 v = λ + ϕ Z r 1 + vr d Hp(r), and 1 v = ψ Z r 1 + vr d Hp(r). (4)

Existence and uniqueness of the solution (λ, v) [0, ]2 to the above equations are guaranteed by Corollary E.4. Now, define a path P(λ; ϕ, ψ) that passes through the endpoints (0, ψ) and (λ, ϕ):

P(λ; ϕ, ψ) = (1 θ) (λ, ϕ) + θ (0, ψ) | θ [0, 1] . (5)

We are ready to state our results. We consider two cases for the generalized risk (3) depending on the relationships between (A, b) and X. In the first case, when both A and b are independent of the data, Theorem 1 shows that the generalized risks are equivalent along the path (5). Theorem 1 (Risk equivalences when (A, b) (X, y)). Suppose Assumptions 1 2 hold. Let (A, b) be independent of (X, y) such that nrow(A) 1/2 A op and b 2 are almost surely bounded. Let n, p such that p/n ϕ (0, ). For any ψ [ϕ, + ], let λ be as defined in (4). Then, for any pair of (λ1, ψ1) and (λ2, ψ2) on the path P(λ; ϕ, ψ) as defined in (5), the generalized risk functionals (3) of the full-ensemble estimator are asymptotically equivalent:

R bβλ1 p/ψ1 , ; A, b, β0 R bβλ2 p/ψ2 , ; A, b, β0 . (6)

In other words, Theorem 1 establishes the equivalences of subsampling and ridge regularization in terms of generalized risk for many statistical learning problems, encompassing coefficient estimation,

coefficient confidence interval, and test error estimation. All of these problems fall in the category when (A, b) are independent of X. However, there are other statistical learning problems not covered in Theorem 1, such as the in-sample prediction risk and the training error, which correspond to the case when A = X. Fortunately, the equivalences also apply to them, as summarized in Proposition 2. Proposition 2 (Risk equivalences when A = X). Under Assumptions 1 2, when A = X, the conclusion in Theorem 1 continue to hold for the cases of in-sample prediction risk (b = 0) and training error (b = f NL).

Both Theorem 1 and Proposition 2 provide specific second-order equivalences for the full-ensemble ridge estimators, in the sense that the quadratic functionals of the estimators associated with (A, b) are asymptotically the same. See Figure 1 for visual illustrations of both equivalences. These statements are presented separately because their proofs differ, with the proof for the dependent case being more intricate. We note that by combining the risk functionals in Table 2, it is possible to further extend the equivalences to other types of functionals not directly covered by statements above, using composition and continuous mapping. For example, it is not difficult to show that similar equivalences hold for the generalized cross-validation in the full ensembles [13]. This follows by combining the result for training error from Proposition 2 and for denominator concentration proved in Lemma 3.4 of [13].

Theorem 1 generalizes the existing equivalence results for the prediction risk [11 13] to include general risk functionals under milder assumptions. As summarized in Table 1, we allow for a general response model and feature covariance without assuming convergence of the spectral distributions Hp to a fixed distribution H. This flexibility is achieved by showing equivalences of the underlying resolvents in the estimators rather than deriving the limiting functionals as done in previous works. To extend results allowing for a general response, we generalize certain concentration results by leveraging tools from [14] to handle the dependency between the linear and non-linear components.

As alluded to earlier, it is worth noting that the generalized risk equivalences only hold in the full ensemble. However, by using the risk decomposition obtained from Lemma S.1.1 of [12] for finite ensembles, one can obtain the following relationship along the path in Theorem 3: R bβλ1 p/ψ1 ,M; A, b, β0 R bβλ2 p/ψ2 ,M; A, b, β0 /M, for some (independent of M) that is eventually almost surely bounded. In other words, the difference between any two estimators on the same path scales as 1/M when n is sufficiently large, as numerically verified in Figure F5.

4 Structural equivalences of ensemble estimators

The equivalences established in the previous section only hold for the full-ensemble estimators in a second-order sense. We can go a step further and ask if there exist any equivalences for the finite ensemble and if there are any equivalences at the estimator coordinate level between the estimators. Specifically, we aim to inspect whether each coordinate of the p-dimensional estimated coefficients asymptotically equals. This section establishes structural relationships between the estimators in a first-order sense that any bounded linear functionals of them are asymptotically the same. Theorem 3 (Structural equivalences). Suppose Assumptions 1 2 hold. Let n, p with p/n ϕ (0, ). For any ψ [ϕ, + ], let λ be as in (4). Then, for any M N { } and any pair of (λ1, ψ1) and (λ2, ψ2) on the path (5), the M-ensemble estimators are asymptotically equivalent: bβλ1 p/ψ1 ,M bβλ2 p/ψ2 ,M. (7)

Put another way, Theorem 3 implies that for any fixed ensemble size M, the M-ensemble ridgeless estimator at the limiting aspect ratio ψ is equivalent to ridge regression at the regularization level λ. This equivalence is visually illustrated in Figure 2, where the data is simulated from an anisotropic and nonlinear model (see Appendix F.2 for more details). Furthermore, the contour path P(λ; ϕ, ψ) connecting the endpoints (0, ψ) and (λ, ϕ) is a straight line, although it may have varying slopes. Along any path, all M-ensemble estimators at the limiting aspect ratio ψ and ridge penalty λ are equivalent for all (λ, ψ) P(λ; ϕ, ψ). The equivalences here are for any arbitrary linear combinations of the estimators. In particular, this implies that the predicted values (or even any continuous function applied to them due to the continuous mapping theorem) of any test point will eventually be the same, almost surely, with respect to the training data. Finally, we remark that in the statement of Theorem 3, the equivalence is defined on extended reals in order to incorporate the ridgeless case when ψ = 1.

Ridge penalty

(a) Uniform, M = 1 Uniform, M = 5 Uniform, M = 50

Ridge penalty

(b) Uniform, M = 100

0.1 1 10 Subsample aspect ratio

Gaussian, M = 100

Student's t, M = 100

Figure 2: Heat map of the values of various linear functionals of ensemble ridge estimators a bβλ k,M, for varying ridge penalties λ, subsample aspect ratios ψ = p/k, and ensemble size M on the log-log scale. The data model is described in Appendix F.2 with p = 1000, n = 10000, and ϕ = p/n = 0.1. (a) The values of the uniformly weighted projection for varying ensemble sizes (M = 1, 5, and 50). (b) The values of M = 100 for varying projection vectors (uniformly weighted, random Gaussian, and random student s t). The black dashed line is estimated based on Proposition 4 with ψ = 2. Observe that the values along the data-dependent paths are indeed very similar, in line with Proposition 4.

The paths (5) in Theorem 3 are defined via the spectral distribution Hp, which requires knowledge of Σ. This is often difficult to obtain in practice from the observed data. Fortunately, we can provide an alternative characterization for the path (5) solely through the data, as summarized in Proposition 4. Proposition 4 (Data-dependent equivalence path characterization). Suppose Assumptions 1 2 hold. Define ϕn = p/n. Let k n be the subsample size and denote by ψn = p/k. For any M N { }, let λn be the value that satisfies the following equation in ensemble ridgeless and ridge gram matrices:

k LIℓXX LIℓ

n XX + λn In

Define the data-dependent path Pn = P(λn; ϕn, ψn). Then, the conclusion in Theorem 3 continues to hold if we replace the (population) path P by the (data-dependent) path Pn.

The term data-dependent path signifies that we can estimate the level of implicit regularization induced by subsampling solely based on the observed data by solving (8). We remark that (8) always has at least one solution for a given triplet of (n, k, p). This is because the right-hand side of (8) is monotonically decreasing in λ, and the left-hand side always lies within the range of the right-hand side. The powerful implication of the characterization in Proposition 4 is that it enables practical computation of the path using real-world data. In Section 6, we will demonstrate this approach on various real-world datasets, allowing us to predict an equivalent amount of explicit regularization matching the implicit regularization due to subsampling.

One natural extension of the results is to consider the generalized ridge as a base estimator (1):

bβλ,G k (DI) = argmin β Rp 1 k

i I (yi x i β)2 + λ G 1 2 β 2 2, (9)

where G Rp p is a positive definite matrix. The equivalences still hold, albeit on different paths: Corollary 5 (Equivalences for generalized ridge regression). Suppose Assumptions 1 2 hold. Let G Rp p be a deterministic and symmetric matrix with eigenvalues uniformly bounded between constants gmin > 0 and gmax < . Let n, p such that p/n ϕ (0, ). For any ψ [ϕ, + ], let λ be as defined in (4) with Hp replaced e Hp, the empirical spectral distribution of G 1/2ΣG 1/2. Then, the conclusions in Theorems 1 and 3 continue to hold for the generalized ridge ensemble predictors.

Another extension of our results is to incorporate subquadratic risk functionals in Theorem 1 to derive equivalences of the cumulative distribution functions of the predictive error distributions associated

with the estimators along the equivalence paths. One can use such equivalences for downstream inference questions. Finally, while our statements are asymptotic in nature, we expect a similar analysis as done in [50] can yield finite-sample statements. We leave these extensions for the future.

5 Implications of equivalences

The generalized risk equivalences establish a connection between the risks of different ridge ensembles, including the ridge predictors on full data and the full-ensemble ridgeless predictors. These connections enable us to transfer properties from one estimator to the other. For example, by examining the impact of subsampling on the estimator s performance, we can better understand how to optimize the ridge penalty. We will next provide an example of this. Many common methods, such as ridgeless or lassoless predictors, have been recently shown to exhibit non-monotonic behavior in the sample size or the limiting aspect ratio. An open problem raised by Nakkiran et al. [1, Conjecture 1] asks whether the prediction risk of the ridge regression with optimal ridge penalty λ is monotonically increasing in the data aspect ratio ϕ = p/n. The non-monotonicity risk behavior implies that more data can hurt performance, and it s important to investigate whether optimal ridge regression also suffers from this issue. Our equivalence results provide a surprising (even to us) indirect way to answer this question affirmatively under proportional asymptotics.

To analyze the monotonicity of the risk, we will need to impose two additional regularity assumptions to guarantee the convergence of the prediction risk. Let Σ = Pp j=1 rjwjw j denote the eigenvalue decomposition, where (rj, wj) s are pairs of associated eigenvalue and normalized eigenvector. The following assumptions ensure that the hardness of the underlying regression problems across different limiting data aspect ratios is comparable. Assumption 3 (Spectral convergence). We assume there exists a deterministic distribution H such that the empirical spectral distribution of Σ, Hp(r) = p 1 Pp i=1 1{ri r}, weakly converges to H, almost surely (with respect to x). Assumption 4 (Limiting signal and noise energies). We assume there exists a deterministic distribution G such that the empirical distribution of β0 s (squared) projection onto Σ s eigenspace, Gp(r) = β0 2 2 Pp i=1(β 0 wi)2 1{ri r}, weakly converges to G. As n, p and p/n ϕ (0, ), the limiting linearized energy ρ2 = lim β0 2 2 and the limiting nonlinearized energy σ2 = lim f NL 2 L2 are finite.

Assumption 3 is commonly used in random matrix theory and overparameterized learning [10, 12, 13] and ensures that the limiting spectral distribution of the sample covariance matrix converges to a fixed distribution. Under these assumptions, we show in Appendix C that there exists a deterministic function R(λ; ϕ, ψ), such that R( bβλ k, ; A, b, β0) R(λ; ϕ, ψ). Notably, the deterministic profile on the right-hand side is a monotonically increasing and a continuous function in the first parameter ϕ when limiting values of β0 2 2 and f NL 2 L2 are fixed (i.e., when Assumption 4 holds). Moreover, the deterministic risk equivalent of the optimal ridge predictor matches that of the optimal fullensemble ridgeless predictor; in other words, minλ 0 R(λ; ϕ, ϕ) = minψ ϕ R(0; ϕ, ψ). The same monotonicity property of the two optimized functions leads to the following result. Theorem 6 (Monotonicity of prediction risk with optimal ridge regularization). Suppose the conditions of Theorem 1 and Assumptions 3 4 hold. Let k, n, p such that p/n ϕ (0, ) and p/k ψ [ϕ, ]. Then, for A = Σ1/2 and b = 0, the optimal risk of the ridgeless ensemble, minψ ϕ R(0; ϕ, ψ), is monotonically increasing in ϕ. Consequently, the optimal risk of the ridge predictor, minλ 0 R(λ; ϕ, ϕ), is also monotonically increasing in ϕ.

Under Assumptions 3 and 4, the linearized signal-to-noise ratio (SNR) is maintained at the same level across varying distributions as ϕ changes. The key message of Theorem 6 is then that, for a sequence of problems with the same SNR (indicating the same level of regression hardness), the asymptotic prediction risk of optimized ridge risk gets monotonically worse as the aspect ratio of the problem increases. This is intuitive because a smaller ϕ corresponds to more samples than features. In this sense, Theorem 6 certifies that optimal ridge uses the available data effectively, avoiding sample-wise non-monotonicity. Moreover, as the null risk is finite, this also shows that the optimal ridge estimator mitigates the double or multiple descents behaviors in the generalization error [1, 5, 51].

Attempting to prove Theorem 6 directly is challenging due to the lack of a closed-form expression for the optimal risk of ridge regression in general. Moreover, the risk profile of ridge regression,

Uniform Gaussian Student's t

Linear functionals

Relative value

Estimation risk Prediction risk Prediction risk

(OOD) Genearlized risk functionals

Paths through ( , ) for varying : = 0.01 = 0.05 = 0.1 = 1

Figure 3: Comparison of various linear functionals (Theorem 3) and generalized risk functionals (Theorem 1) on different paths evaluated on CIFAR-10. The aspect ratios ϕ = p/n and ψ = 4ϕ are estimated from the dataset and fixed. For different values of λψ (different colors), we estimate λ from Proposition 4, which gives a path between (λψ, ψ) and (λ, ϕ). For each path, we uniformly sample 5 points and compute the functionals of the ridge ensemble using M = 100. The values of different paths are then normalized by subtracting the mean value in the first path and dividing by the mean difference of values on the last and first paths.

R(λ; ϕ, ϕ), does not exhibit any particular structure as a function of λ. On the other hand, the risk profile of the full-ensemble ridgeless, R(0; ϕ, ψ), has a very nice structure in terms of ψ. This, coupled with our equivalence result, allows one to prove quite non-trivial behavior of optimal ridge. See Figure F6 for illustrations.

6 Discussion

Motivated by the recent subsampling and ridge equivalences for the prediction risk [11 13], this paper establishes generalized risk equivalences (Section 3) and structural equivalences (Section 4) within the family of ensemble ridge estimators. Our results precisely link the implicit regularization of subsampling to explicit ridge penalization via a path P defined in (5), which connects two endpoints (0, ψ) and (λ, ϕ) regardless of the risk functionals. Furthermore, we provide a data-dependent method (Proposition 4) to estimate this path. Our results do not assume any specific relationship between the response variable y and the feature vector x. We next explore some extensions of these equivalences.

Real data. While our theoretical results assume RMT features (Assumption 2), we anticipate that the equivalences will very likely hold under more general features [50]. To verify this, we examine the equivalences in Proposition 4 with M = 100 on real-world datasets. Figure 3 depicts both the linear functionals in Theorem 3 and the generalized quadratic functionals in Theorem 1 computed along four paths, shown in different colors. We observe that the variation within each path is small, while different paths have different functional values. This suggests that the theoretical finding in the previous sections also hold quite well on real-world datasets. We investigate this further by varying ψ on the three image datasets, CIFAR-10, MNIST, and USPS. Figure F7 shows similar values and trends at the two points. These experiments demonstrate that the amount of explicit regularization induced by subsampling can indeed be accurately predicted based on the observed data using Proposition 4.

Random features. Closely related to two-layer neural networks [52], we can consider the random feature model, f(x; β, F ) = β φ(F x), where F Rd p is some randomly initialized weight matrix, and φ : R R is a nonlinear activation function applied element-wise to F x. As from universality/invariance , certain random feature models are asymptotically equivalent to a surrogate linear Gaussian model with a matching covariance matrix [53], we expect the theoretical results in Theorems 3 1 to likely hold, although the relationship (4) will now depend on the non-linear activation functions. Empirically, we indeed observe similar equivalence phenomena of the prediction risks with random features ridge regression using sigmoid, Re LU, and tanh activation functions, as shown in Figure 4. Analogous to Proposition 4, we conjecture a similar data-driven relationship between (λn, ϕn) and (0, ψn) for random features f X = φ(XF ): Conjecture 7 (Data-dependent equivalences for random features (informal)). Suppose Assumptions 1 2 hold. Define ϕn = p/n. Let k n be the subsample size and denote by ψn = p/k. Suppose φ satisfies certain regularity conditions. For any M N { }, let λn be the value that satisfies

k φ(LIℓXF )φ(LIℓXF ) +#

nφ(XF )φ(XF ) + λn In

Ridge penalty

0.1 1 10 Subsample aspect ratio

Figure 4: Heat map of prediction risks of full-ensemble ridge estimators (approximated with M = 100) using random features, for varying ridge penalties λ, subsample aspect ratios ψ = p/d on the log-log scale. We consider random features φ(F xi) Rd, where φ is an activation function (sigmoid, Re LU, or tanh). The data model is described in Appendix F.6 with p = 250, d = 500, n = 5000, and ϕ = d/n = 0.1. As in Theorem 1, we see clear risk equivalence paths across activations.

Define the data-dependent path Pn = P(λn; ϕn, ψn). Then, the conclusions in Theorem 1, Proposition 2, and Theorem 3 continue to hold on (φ(XF ), y) with P replaced by Pn.

Kernel features. We can also consider kernel ridge regression. For a given feature map Φ : Rp Rd, the kernel ridge estimator (in the primal form) is defined as:

bβλ k(DI) = argmin β Rp

i I (k 1/2yi k 1/2Φ(xi) β)2 + λ β 2 2 X

i I (yi Φ(xi) β)2 + k

Leveraging the kernel trick, the preceding optimization problem translates to solving the following problem (in the dual domain): bαλ k(DI) = argminα Rk α (KI + kλIk) α + α y I, where KI = ΦIΦ I Rk k is the kernel matrix and ΦI = (Φ(xi))i I Rn d is the feature matrix. The correspondence between the dual and primal solutions is simply given by: bβλ k(DI) = Φ I bαλ k(DI).

Figure F8 illustrate results for kernel ridge regression using the same data-generating process as in the previous subsection. The figure shows the prediction risk of kernel ridge ensembles for polynomial, Gaussian, and Laplace kernels exhibit similar equivalence patterns, leading us to formulate an analogous conjecture for kernel ridge regression (which when Φ(x) = x gives back Proposition 4 with appropriate rescaling of features):

Conjecture 8 (Data-dependent equivalences for kernel features (informal)). Suppose Assumptions 1 2 hold. Define ϕn = p/n. Suppose the kernel K satisfies certain regularity conditions. Let k n be the subsample size and denote by ψn = p/k. For any M N { }, let λn be a solution to

ℓ=1 tr K+ Iℓ = tr

Define the data-dependent path Pn = P(λn; ϕn, ψn). Then, the conclusions in Theorem 1, Proposition 2, and Theorem 3 continue to hold for kernel ridge ensembles with P replaced by Pn.

The empirical evidence here strongly suggests that the relationship between subsampling and ridge regression, which we have proved for ridge regression, holds true at least when the gram matrix linearizes in the sense of asymptotic equivalence [14, 54, 55]. This intriguing observation opens up a whole host of avenues towards fully understanding the universality of this relationship and establishing precise connections for a broader range of models [17, 18, 20, 21]. We hope to share more on these compelling directions in the near future.

Tying other implicit regularizations. Finally, as noted in related work, the equivalences between ridge regularization and subsampling established in this paper naturally offer opportunities to interpret and understand the other implicit regularization effects such as dropout regularization [27, 28], early stopping in gradient descent variants [36, 38, 39]. Furthermore, these equivalences provide avenues to understand the combined effects of these forms of implicit regularization, such as the effects of both subsample aggregating and gradient descent in mini-batch gradient descent, and contrast them to explicit regularization. Whether the explicit regularization can always be cleanly expressed as ridge-like regularization or takes a more generic form is an intriguing question. Exploring this is exciting, and we hope our readers also share this excitement!

Acknowledgments and Disclosure of Funding

We are indebted to Ryan Tibshirani, Alessandro Rinaldo, Arun Kuchibhotla, Yuting Wei, Matey Neykov, Edgar Dobriban, Daniel Le Jeune, Shamindra Shrotriya, Yuchen Wu for many insightful conversations. We are also grateful to anonymous reviewers of the precursor [13] for encouraging us to work on several follow-up directions that make up parts of this successor. On a more personal note: The lack of proof of risk monotonicity of optimally-tuned ridge in general had been monotone bugging (in a good way) the first author, ever since plotting Figure 8 (and other similar plots) in [51]. The resolution in Theorem 6 is of great personal significance to him and he is heartily appreciative of all the collaborators on related threads, especially the second author, for being a dedicated climbing partner on this technical ascent . At the risk of over analogizing, the metaphor of letting a nut soak in water instead of forcibly cracking it open with a hammer, from Grothendieck2, may perhaps describe the proof of Theorem 6. Establishing subsampling and ridge equivalences first and using it to demonstrate ridge monotonicity, in retrospect, seems a much less arduous route than a direct brute-force attempt (with calculus, say). But we will let the readers decide!

This work used the Bridges2 system at the Pittsburgh Supercomputing Center (PSC) through allocations MTH230020 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program. The code for reproducing the results of this paper can be found at https://jaydu1.github.io/overparameterized-ensembling/equiv.

[1] Preetum Nakkiran, Prayaag Venkat, Sham M. Kakade, and Tengyu Ma. Optimal regularization can mitigate double descent. In International Conference on Learning Representations, 2021.

[2] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123 140, 1996.

[3] Peter Bühlmann and Bin Yu. Analyzing bagging. The Annals of Statistics, 30(4):927 961, 2002.

[4] Abraham J. Wyner, Matthew Olson, Justin Bleich, and David Mease. Explaining the success of adaboost and random forests as interpolating classifiers. The Journal of Machine Learning Research, 18(1):1558 1590, 2017.

[5] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machinelearning practice and the classical bias variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849 15854, 2019.

[6] Lucas Mentch and Siyu Zhou. Randomization as regularization: A degrees of freedom explanation for random forest success. The Journal of Machine Learning Research, 21(1):6918 6953, 2020.

[7] Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55 67, 1970.

[8] Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Applications to nonorthogonal problems. Technometrics, 12(1):69 82, 1970.

[9] Trevor Hastie. Ridge regularization: An essential concept in data science. Technometrics, 62 (4):426 433, 2020.

[10] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. Surprises in highdimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949 986, 2022.

[11] Daniel Le Jeune, Hamid Javadi, and Richard Baraniuk. The implicit regularization of ordinary least squares ensembles. In International Conference on Artificial Intelligence and Statistics, 2020.

2http://math.stanford.edu/~vakil/216blog/FOAGmay2523public.pdf, page 9.

[12] Pratik Patil, Jin-Hong Du, and Arun Kumar Kuchibhotla. Bagging in overparameterized learning: Risk characterization and risk monotonization. ar Xiv preprint ar Xiv:2210.11445, 2022.

[13] Jin-Hong Du, Pratik Patil, and Arun Kumar Kuchibhotla. Subsample ridge ensembles: Equivalences and generalized cross-validation. In International Conference on Machine Learning, 2023.

[14] Peter L. Bartlett, Andrea Montanari, and Alexander Rakhlin. Deep learning: A statistical viewpoint. Acta numerica, 30:87 201, 2021.

[15] Trevor Hastie, Robert Tibshirani, and Jerome H. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2009. Second edition.

[16] Leo Breiman. Random forests. Machine learning, 45(1):5 32, 2001.

[17] L.K. Hansen and P. Salamon. Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10):993 1001, 1990.

[18] Anders Krogh and Jesper Vedelsby. Neural network ensembles, cross validation, and active learning. Advances in neural information processing systems, 1994.

[19] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, 2017.

[20] Michael Perrone. Putting it all together: Methods for combining neural networks. In Advances in Neural Information Processing Systems, 1993.

[21] Anders Krogh and Peter Sollich. Learning with ensembles: How overfitting can be useful. In Advances in Neural Information Processing Systems, 1995.

[22] Anders Krogh and Peter Sollich. Statistical mechanics of ensemble learning. Physical Review E, 55(6):811, 1997.

[23] Stéphane d Ascoli, Maria Refinetti, Giulio Biroli, and Florent Krzakala. Double trouble in double descent: Bias and variance (s) in the lazy regime. In International Conference on Machine Learning, pages 2280 2290. PMLR, 2020.

[24] Ben Adlam and Jeffrey Pennington. Understanding double descent requires a fine-grained bias-variance decomposition. In Advances in Neural Information Processing Systems, 2020.

[25] Bruno Loureiro, Cedric Gerbelot, Maria Refinetti, Gabriele Sicuro, and Florent Krzakala. Fluctuations, bias, variance & ensemble of learners: Exact asymptotics for convex losses in high-dimension. In International Conference on Machine Learning, 2022.

[26] Tianyi Yao, Daniel Le Jeune, Hamid Javadi, Richard G. Baraniuk, and Genevera I. Allen. Minipatch learning as implicit ridge-like regularization. In International Conference on Big Data and Smart Computing, 2021.

[27] Stefan Wager, Sida Wang, and Percy S. Liang. Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems, 2013.

[28] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929 1958, 2014.

[29] Andrew R. Webb. Functional approximation by feed-forward networks: A least-squares approach to generalization. IEEE Transactions on Neural Networks, 5(3):363 371, 1994.

[30] Chris M. Bishop. Training with noise is equivalent to Tikhonov regularization. Neural computation, 7(1):108 116, 1995.

[31] Gian-Andrea Thanei, Christina Heinze, and Nicolai Meinshausen. Random projections for large-scale regression. Big and Complex Data Analysis: Methodologies and Applications, pages 51 68, 2017.

[32] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström computational regularization. In Advances in Neural Information Processing Systems, volume 28, 2015.

[33] Daniel Le Jeune, Pratik Patil, Hamid Javadi, Richard G Baraniuk, and Ryan J. Tibshirani. Asymptotics of the sketched pseudoinverse. ar Xiv preprint ar Xiv:2211.03751, 2022.

[34] Chi-Heng Lin, Chiraag Kaushik, Eva L. Dyer, and Vidya Muthukumar. The good, the bad and the ugly sides of data augmentation: An implicit spectral regularization perspective. ar Xiv preprint ar Xiv:2210.05021, 2022.

[35] Marina Skurichina and Robert P. W. Duin. Regularization by adding redundant features. In Advances in Pattern Recognition: Joint IAPR International Workshops SSPR 98 and SPR 98 Sydney, Australia, August 11 13, 1998 Proceedings, pages 564 572. Springer, 1998.

[36] Gergely Neu and Lorenzo Rosasco. Iterate averaging as regularization for stochastic gradient descent. In Conference On Learning Theory, pages 3222 3242. PMLR, 2018.

[37] Arun Suggala, Adarsh Prasad, and Pradeep K. Ravikumar. Connecting optimization and regularization paths. In Advances in Neural Information Processing Systems, 2018.

[38] Alnur Ali, Edgar Dobriban, and Ryan J. Tibshirani. The implicit regularization of stochastic gradient flow for least squares. In International conference on machine learning, 2020.

[39] Alnur Ali, J. Zico Kolter, and Ryan J. Tibshirani. A continuous-time view of early stopping for least squares regression. In International Conference on Artificial Intelligence and Statistics, 2019.

[40] A. Buja, R. Berk, L. Brown, E. George, E. Pitkin, M. Traskin, K. Zhang, and L. Zhao. A conspiracy of random predictors and model violations against classical inference in regression. ar Xiv preprint ar Xiv:1404.1578, 2014.

[41] Andreas Buja, Lawrence Brown, Richard Berk, Edward George, Emil Pitkin, Mikhail Traskin, Kai Zhang, and Linda Zhao. Models as approximations I. Statistical Science, 34(4):523 544, 2019.

[42] Andreas Buja, Lawrence Brown, Arun Kumar Kuchibhotla, Richard Berk, Edward George, and Linda Zhao. Models as approximations II. Statistical Science, 34(4):545 565, 2019.

[43] Edgar Dobriban and Yue Sheng. Wonder: Weighted one-shot distributed ridge regression in high dimensions. Journal of Machine Learning Research, 21(66):1 52, 2020.

[44] Edgar Dobriban and Yue Sheng. Distributed linear regression by averaging. The Annals of Statistics, 49(2):918 943, 2021.

[45] Zhidong Bai and Jack W. Silverstein. Spectral Analysis of Large Dimensional Random Matrices. Springer, 2010. Second edition.

[46] Noureddine El Karoui. The spectrum of kernel random matrices. The Annals of Statistics, 38 (1):1 50, 2010.

[47] Noureddine El Karoui and Holger Kösters. Geometric sensitivity of random matrix results: Consequences for shrinkage estimators of covariance and related statistical methods. ar Xiv preprint ar Xiv:1105.1404, 2011.

[48] Noureddine El Karoui. Asymptotic behavior of unregularized and ridge-regularized highdimensional robust regression estimators: rigorous results. ar Xiv preprint ar Xiv:1311.2445, 2013.

[49] Edgar Dobriban and Stefan Wager. High-dimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics, 46(1):247 279, 2018.

[50] Chen Cheng and Andrea Montanari. Dimension free ridge regression. ar Xiv preprint ar Xiv:2210.08571, 2022.

[51] Pratik Patil, Arun Kumar Kuchibhotla, Yuting Wei, and Alessandro Rinaldo. Mitigating multiple descents: A model-agnostic framework for risk monotonization. ar Xiv preprint ar Xiv:2205.12937, 2022.

[52] Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667 766, 2022.

[53] Hong Hu and Yue M. Lu. Universality laws for high-dimensional learning with random features. IEEE Transactions on Information Theory, 69(3):1932 1964, 2023.

[54] Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel ridgeless regression can generalize. The Annals of Statistics, 2020.

[55] Mojtaba Sahraee-Ardakan, Melikasadat Emami, Parthe Pandit, Sundeep Rangan, and Alyson K Fletcher. Kernel methods and multi-layer perceptrons learn linear models in high dimensions. ar Xiv preprint ar Xiv:2201.08082, 2022.

[56] Antti Knowles and Jun Yin. Anisotropic local laws for random matrices. Probability Theory and Related Fields, 169:257 352, 2017.

[57] László Erd os and Horng-Tzer Yau. A Dynamical Approach to Random Matrix Theory. American Mathematical Society, 2017.

[58] Francisco Rubio and Xavier Mestre. Spectral convergence for a general class of random matrices. Statistics & probability letters, 81(5):592 602, 2011.

[59] Torch Vision. Torchvision: Py Torch s computer vision library. https://github.com/ pytorch/vision, 2016.

[60] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

Supplementary material for Generalized equivalences between subsampling and ridge regularization

This document serves as a supplement to the paper Generalized equivalences betweensubsampling and ridge regularization. The initial (unnumbered) section of the supplement provides a summary of the notation used in both the paper and supplement, followed by an organization for the supplement.

Notation. Below we provide an overview of the notation used throughout.

General notation:

Notation Description

Non-bold Denotes univariates (scalars, functions, distributions, etc.) (e.g., λ, f, H). Bold lower case Denotes vectors (e.g., x, β). Bold upper case Denotes matrices (e.g., X). Calligraphic font Denotes sets (e.g., D). Script font Denotes certain limiting functions (e.g., R in (27)).

N Set of positive integers. R, R+ Set of real numbers and positive real numbers. C, C+ Set of complex numbers and complex numbers with positive imaginary part. [n] Set {1, . . . , n} for natural number n.

(x)+ The positive part of real number x. x , x The floor and ceiling of real number x. f L2 The L2 norm of function f, f L2 = Ex[f 2(x)].

β 2 The ℓ2 norm of vector β. v, w The inner product of vectors v and w. X , X+ The transpose and Moore-Penrose inverse of matrix X Rn p. tr[A], A 1 The trace and inverse (if invertible) of a square matrix A Rp p. Σ1/2 The principal square root of positive semi-definite matrix Σ. Ip or I The p p identity matrix. X op The operator norm (spectral norm) of real matrix X. X tr The trace norm (nuclear norm) tr[(X X)1/2] of real matrix X.

f(A) The matrix V f(R)V 1 where A = V RV 1 is eigenvalue decomposition, and f(R) is f applied (elementwise) to the diagonals of R. A B The Loewner ordering for symmetric matrices A and B.

p , a.s. , d Almost sure convergence, convergence in probability, and weak convergence.

Specific notation:

Notation Description

(x, y) The population random vector in Rp R. Dn A dataset containing i.i.d. samples of (x, y): Dn = {(xi, yi) : i [n]}. X, y The feature matrix in Rn p and the response vector in Rn DI A subsampled dataset DI = {(xi, yi) : i I}. bβλ k(DI) A ridge estimator fitted on DI with |I| = k observations and ridge penalty λ.

bβλ k,M(Dn; {Iℓ}M ℓ=1) An M-ensemble ridge estimator fitted on {DIℓ}M ℓ=1 with subsample size k and ridge penalty λ.

β0 The best (population) linear projection coefficient of y onto x: E[xx ] 1E[xy]. f LI(x) The best (population) linear projection of y onto x: x β0. f NL(x) The component of y that is not explained by x: y x β0. f LI, f NL f LI = Xβ0, f NL = [f NL(xi)]i [n].

A note on indexing of sequences:

In the subsequent sections, we will prove the results for n, k, p being a sequence of integers {nm} m=1, {km} m=1, {pm} m=1. Alternatively, one can also view k and p as sequences kn and pn that are indexed by n. For notational brevity, we drop the subscript when it is clear from the context.

Organization. Below we outline the structure of the rest of the supplement. The technical lemmas refer to the main ingredients that we use to prove results in the corresponding sections.

Table 3: Outline of the supplement.

Section Description Technical lemmas

Appendix A Proof of Theorem 1, Proposition 2 (from Section 3) Lemma A.1, Lemma A.2, Lemma A.3

Appendix B Proof of Theorem 3, Proposition 4, Corollary 5 (from Section 4) Lemma B.1, Lemma B.2

Appendix C Proof of Theorem 6 (from Section 5) Lemma C.1

Appendix D New asymptotic equivalents, concentration results and other useful lemmas (used in Appendices A C) Lemma D.1, Lemma D.2, Lemma D.3, Lemma D.5

Appendix E Asymptotic equivalents: background and known results (used in Appendices A D)

Appendix F Additional details for all the experiments

A Proofs of results in Section 3

A.1 Proof of Theorem 1

Given an observation (x, y), recall the decomposition y = f LI(x)+f NL(x) explained in Section 2. For n i.i.d. samples from the same distribution as (x, y), we define analogously the vector decomposition:

y = f LI + f NL, (10)

where f LI = Xβ0 and f NL = [f NL(xi)]i [n]. Let n A = nrow(A). Note that

R( bβλ k, ; A, b, β0) = R( bβλ k, ; A, 0, β0) + 2n 1 A b A( bβλ k, β0) + n 1 A b 2 2.

By Theorem 3, the cross term vanishes, i.e., n 1 A b A( bβλ k, β0) a.s. 0. We then have

|R( bβλ1 k1, ; A, b, β0) R( bβλ2 k2, ; A, b, β0)| a.s. |R( bβλ1 k1, ; A, 0, β0) R( bβλ2 k2, ; A, 0, β0)|.

It suffices to analyze R( bβλ k, ; A, 0, β0).

For simplicity, we treat A as the normalized matrix n 1/2 A A to avoid the notation of nrow(A). Note that y = Xβ0 + f NL + ε, where β0 is the best linear projection of y on X, f NL is the nonlinear residual and ε is the independent noise. Let Aλ k = EI Ik[(X LIX/k + λIp) 1X LI/k] and Bλ k = Ip Aλ k X = EI Ik[λ(X LIX/k + λIp) 1]. We begin by decomposing the generalized risk for arbitrary (λ, k) into different terms:

R( bβλ k, ; A, 0, β0) = A( bβλ k, β0) 2 2 = A(Aλ ky β0) 2 2 = AAλ k(Xβ0 + f NL) Aβ0 2 2 = β 0 (Aλ k X Ip) A A(Aλ k X Ip)β0 + f NL (Aλ k) A AAλ kf NL

+ 2f NL (Aλ k) A A(Aλ k X Ip)β0

= β 0 Bλ k A ABλ kβ0 + f NL (Aλ k) A AAλ kf NL 2f NL (Aλ k) A ABλ kβ0.

When λ > 0, from Lemma A.3, we know that the cross terms 2f NL (Aλ k) A ABλ kβ0 vanishes as p tends to infinity. Further, by Lemma A.1 and Lemma A.2, it follows that when A (X, y), as p/n ϕ and p/k ψ,

|R( bβλ k, ; A, 0, β0) Rp(λ; ϕ, ψ)| a.s. 0,

where the function Rp is defined as

Rp(λ; ϕ, ψ) = ecp( λ; ϕ, ψ, A A) + f NL 2 L2evp( λ; ϕ, ψ, A A), (11)

and the nonnegative constants ecp( λ; ϕ, ψ, A A) and evp( λ; ϕ, ψ, A A) are as defined in Lemma A.1.

When λ = 0, as in the proof of Theorem 3, we again show that Pn,λ Qn,λ is equivcontinuous over Λ = [0, λmax] for any λmax (0, ) fixed, where we define

Pn,λ = R( bβλ k, ; A, 0, β0), and Qn,λ = Rp(λ; ϕ, ψ).

When ψ = 1, it can be verified that |Pn,λ|, | Pn,λ/ λ|, |Qn,λ|, | Qn,λ/ λ| are bounded almost surely. Thus, by the Moore-Osgood theorem, it follows that, with probability one,

lim n |R( bβ0 k, ; A, 0, β0) Rp(0; ϕ, ψ)| = lim λ 0+ lim n |R( bβλ k, ; A, 0, β0) Rp(λ; ϕ, ψ)| = 0.

Note that ecp( λ; ϕ, ψ, A A) and evp( λ; ϕ, ψ, A A) are functions of the fixed-point solution vp( λ; ψ). By Lemma C.1 we have that for ψ [ϕ, ] there exists a segment P such that for all (λ1, ψ1), (λ2, ψ2) P, it holds that v( λ1; ψ1) = v( λ2; ψ2) as p/k1 ψ1 and p/k2 ψ2. This implies that

Rp(λ1; ϕ, ψ1) = Rp(λ2; ϕ, ψ2).

Thus, by triangle inequality, we have

|R( bβλ1 k1, ; A, 0, β0) R( bβλ2 k2, ; A, 0, β0)| a.s. 0,

which completes the proof.

A.2 Proof of Proposition 2

When A = X and b = f NL, the generalized risk reduces to

R( bβλ k, ; A, b, β0) = 1

n X( bβλ k, β0) f NL 2 2

n X bβλ k, (Xβ0 + f NL) 2 2

n XAλ k(Xβ0 + f NL) (Xβ0 + f NL) 2 2

= β 0 Bλ k bΣBλ kβ0 + 1

nf NL (Ip XAλ k) (Ip XAλ k)f NL

nf NL (Ip XAλ k) XBλ kβ0.

The proof then follows analogously as in Theorem 1 by involving Lemma A.1 and Lemma A.2 with A = n 1/2X.

When A = X and b = 0, we have

R( bβλ k, ; A, b, β0) = β 0 Bλ k bΣBλ kβ0 + f NL (Aλ k) bΣAλ kf NL 2f NL (Aλ k) bΣBλ kβ0.

Invoking Lemma A.2 for f NL (Aλ k) bΣAλ kf NL instead of f NL (Ip XAλ k) (Ip XAλ k)f NL, the proof follows analogously.

A.3 Technical lemmas

Lemma A.1 (Bias of generalized risk). Suppose the same assumptions in Theorem 1 hold with λ > 0. Let Bλ k = Ip Aλ k X = EI Ik[λ(X LIX/k + λIp) 1]. For any A with lim sup A op bounded almost surely, the following statements hold:

(1) If A (X, y), then

β 0 Bλ k A ABλ kβ0 ecp( λ; ϕ, ψ, A A) a.s. 0.

(2) If A = n 1/2X, then

β 0 Bλ k A ABλ kβ0 1 ϕ Z vp( λ; ψ)r 1 + vp( λ; ψ)r d Hp(r) 2 ecp( λ; ϕ, ψ, Σ) a.s. 0.

Here the nonnegative constants vp( λ; ψ), evp( λ; ϕ, ψ, A A) and ecp( λ; ϕ, ψ, A A) are defined through the following equations: 1 vp( λ; ψ) = λ + ψ Z r 1 + vp( λ; ψ)r d Hp(r),

evp( λ; ϕ, ψ, A A) = ϕ tr[A AΣ(vp( λ; ψ)Σ + Ip) 2]/p

vp( λ; ψ) 2 ϕ Z r2

(1 + vp( λ; ψ)r)2 d Hp(r) ,

ecp( λ; ϕ, ψ, A A) = β 0 (vp( λ; ψ)Σ + Ip) 1(evp( λ; ϕ, ψ, A A)Σ + A A)(vp( λ; ψ)Σ + Ip) 1β0.

Proof. Note that β0 is independent of

Bλ k A ABλ k = λ2EI1,I2 Ik[M1A AM2],

where bΣ1 2 = X LI1 I2X/|I1 I2| and Mj = (X LI1X/k+λIp) 1 for j = 1, 2. We analyze the deterministic equivalents of the latter for the two cases.

(1) A (X, y).

From Lemma D.1 (2), we know that when A is independent to (X, y), it follows that

λ2EI1,I2 Ik[M1A AM2]

(vp( λ; ψ)Σ + Ip) 1 (evp( λ; ϕ, ψ, A A)Σ + A A) (vp( λ; ψ)Σ + Ip) 1 .

Then, by the trace property of deterministic equivalents in Lemma E.3 (4), we have

β 0 Bλ k A ABλ kβ0 a.s. ===β 0 (vp( λ; ψ)Σ + Ip) 1 (evp( λ; ϕ, ψ, A A)Σ + A A) (vp( λ; ψ)Σ + Ip) 1 β0 a.s. ===ecp( λ; ϕ, ψ, A A).

(2) A = n 1/2X.

From Lemma D.1 (4), it follows that

λ2EI1,I2 Ik[M1 bΣM2] 1 ϕ Z vp( λ; ψ)r 1 + vp( λ; ψ)r d Hp(r)

(1 + evp( λ; ϕ, ψ))(vp( λ; ψ)Σ + Ip) 2Σ.

Then, by the trace property of deterministic equivalents in Lemma E.3 (4), we have

β 0 Bλ k bΣBλ kβ0 a.s. === 1 ϕ Z vp( λ; ψ)r 1 + vp( λ; ψ)r d Hp(r) (1 + evp( λ; ϕ, ψ))

β 0 (vp( λ; ψ)Σ + Ip) 1 Σ (vp( λ; ψ)Σ + Ip) 1 β0

a.s. === 1 ϕ Z vp( λ; ψ)r 1 + vp( λ; ψ)r d Hp(r) 2 ecp( λ; ϕ, ψ, Σ).

Lemma A.2 (Variance term of generalized risk). Suppose the same assumptions in Theorem 1 hold with λ > 0. Let Aλ k = EI Ik[(X LIX/k + λIp) 1X LI/k] and Bλ k = Ip Aλ k X = EI Ik[λ(X LIX/k + λIp) 1]. For any A with lim sup A op bounded almost surely, the following statements hold:

(1) If A (X, y), then

f NL (Aλ k) A AAλ kf NL f NL 2 L2(1 + evp( λ; ϕ, ψ, Σ)) a.s. 0.

(2) If A = n 1/2X, then

f NL (Aλ k) A AAλ kf NL

a.s. f NL 2 L2 1 ϕ Z vp( λ; ψ)r 1 + vp( λ; ψ)r d Hp(r) 2 (1 + evp( λ; ϕ, ψ, Σ))

+ f NL 2 L2 2ϕ Z vp( λ; ψ)r 1 + vp( λ; ψ)r d Hp(r) 1 .

1 nf NL (XAλ k In) (XAλ k In)f NL

a.s. f NL 2 L2 1 ϕ Z vp( λ; ψ)r 1 + vp( λ; ψ)r d Hp(r) 2 σ2(1 + evp( λ; ϕ, ψ, Σ)),

Here the nonnegative constants vp( λ; ψ) and evp( λ; ϕ, ψ, A A) are defined through the following equations:

1 vp( λ; ψ) = λ + ψ Z r 1 + vp( λ; ψ)r d Hp(r),

evp( λ; ϕ, ψ, A A) = ϕ tr[A AΣ(vp( λ; ψ)Σ + Ip) 2]/p

vp( λ; ψ) 2 ϕ Z r2

(1 + vp( λ; ψ)r)2 d Hp(r) .

Proof. The two cases are treated separately below.

(1) A (X, y).

We split the proof into three different parts.

Part (1) Intersection concentration. Let f NL = LI1 I2f NL + LI1\I2f NL + LI2\I1f NL =: f0 + f1 + f2. Note that

f NL Aλ1 k A AAλ1 k f NL = EI1,I2 Ik

(f0 + f1) X

k M1A AM2 X

k (f0 + f2) .

where Mj = (X LI2X/k + λIp) 1. By conditional independence and Lemma S.8.5 of [51], the cross terms vanish

k M1A AM2 X L2

k f NL a.s. 0, and f NL L1X

k M1A AM2 X

k f2 a.s. 0.

It remains to analyze the quadratic term of f0:

f 0 LI1 I2X

k M1A AM2 X LI1 I2

By conditioning on LI1 I2X, from Lemma S.7.10 (1) of [12], we have

Mj M det := k

i0 (bΣ0 + λIp + λC) 1,

where bΣ0 = X 0 X0/i0, X0 = LI1 I2X, C = (k i0)/i0 (v( λ; γ1, ΣC1)Σ + Ip), C1 = i0(λ(k i0)) 1(bΣ0 + λIp) and i0 = |I1 I2|. Then we have

k M1A AM2 X 0 k k2

k M det A AM det X 0 k

X 0 k M A M X 0 k (13)

where X 0 = X0(Ip + C) 1/2, A = (Ip + C) 1/2A A(Ip + C) 1/2, and M = ((Ip + C) 1/2 bΣ0(Ip + C) 1/2 + λIp) 1.

Part (2) Diagonal concentration. We next use a similar strategy as in the proof of Lemma A.16 in [14] by showing that the off-diagonal summation vanishes. Let Σ = (Ip + C) 1/2Σ(Ip + C) 1/2. Since X0 = Z0Σ1/2, we have X 0 = Z0Σ 1/2. Then, the quadratic form becomes:

1 k2 f 0 X 0M A M X 0 f0

Z0Σ Z 0 i0 + λIn

1 Z0 i0 Σ 1

Z0Σ Z 0 i0 + λIn

k2 f 0 B 1 1 B2B 1 1 f0. (14)

Note that from Lemma D.4, we have for any t > 0,

B 1 1 B2B 1 1 = 1

t (B 1 1 (B1 + t B2) 1) + t B 1 1 B2(B1 + t B2) 1B2B 1 1 .

Let U Rn n with Uij = [f0]i[f0]j 1{i = j}. We then have

1 i =j n [B 1 1 B2B 1 1 ]ij[f0]i[f0]j

= | B 1 1 B2B 1 1 , U |

t | B 1 1 , U | 1

t | (B1 + t B2) 1, U | + t B 1 1 2 op B2 2 op (B1 + t B2) 1 op U tr.

For the first two terms, Lemma D.3 implies that

1 k | B 1 1 , U | a.s. 0, and 1 k | (B1 + t B2) 1, U | a.s. 0.

For the last term, note that B2 op A 2 op bΣ0 op, where A op is almost surely bounded as assumed, and bΣ0 op rmax(1 + p

ψ2/ϕ)2 almost surely as k, n, p and p/n ϕ (0, ), p/k ψ [ϕ, ] (see, e.g., [45]). Also, U 2 f NL 2 2 a.s. 2 f NL 2 L2 < from the strong law of large numbers, Lemma D.5, and B1 op λ 1. Thus, the last term is almost surely bounded. It then follows that k 1| B 1 1 B2B 1 1 , U | a.s. 0. Therefore, 1 k f 0 B 1 1 B2B 1 1 f0 1

i=1 [B 1 1 B2B 1 1 ]ii[f0]2 i

Part (3) Trace concentration. From the results in [56], it holds that

[B 1 1 B2B 1 1 ]ii 1

n tr[B 1 1 B2B 1 1 ] a.s. 0.

Further, n 1 Pn i=1 f 2 i a.s. f NL 2 L2 by strong law of large number and Lemma D.5. Therefore, we have 1 k |f B 1 1 B2B 1 1 f tr[B 1 1 B2B 1 1 ] f NL 2 L2| a.s. 0. (16)

Combining (12), (13), and (16) yields that f (Aλ k) A AAλ kf tr[(Aλ k) A AAλ k] f NL 2 L2 a.s. 0.

By the trace property tr[(Aλ k) A AAλ k] = tr[Aλ k(Aλ k) A A], it remains to derive the deterministic equivalents of Aλ k(Aλ k) A A. Note that

Aλ k(Aλ k) A A = |I1 I2|

k2 EI1,I2 Ik[M1 bΣ1 2M2]A A

where bΣ1 2 = X LI1 I2X/|I1 I2| and Mj = (X LI1X/k + λIp) 1 for j = 1, 2. The above quantity is well-defined almost surely because |I1 I2| converges to some positive quantity almost surely. Next, we analyze the trace term for the two cases.

From Lemma D.1 (3) we know that when A is independent to (X, y), it follows that

EI1,I2 Ik[M1 bΣ1 2M2]A A ϕ 1evv( λ; ϕ, ψ)(vp( λ; ψ)Σ + Ip) 2ΣA A.

We now have

f NL (Aλ k) A AAλ kf NL a.s. === f NL 2 L2 p

p tr[EI1,I2 Ik[M1 bΣ1 2M2]A A]

a.s. === f NL 2 L2ψ ϕ

ϕevp( λ; ϕ, ψ, A A)

= f NL 2 L2evp( λ; ϕ, ψ, A A),

where the second convergence is from Lemma S.8.3 of [12] and the trace property in Lemma E.3 (4).

(2) A = n 1/2X.

Instead of working on (13), we use the following decomposition:

M1 bΣM2 = M1 bΣ1M2 + bΣ2M2 M1 bΣ1 2M2 + M1 bΣ(1 2)c M2

j=1 Mj 2λM1M2 M1 bΣ1 2M2 + M1 bΣ(1 2)c M2

Then, repeating Part (2) and (3) as above, it suffices to derive the deterministic equivalents of (Aλ k) bΣAλ k. We decompose this term into:

(Aλ k) A AAλ k = 1

n(XAλ k In) (XAλ k In) + 1

n(XAλ k + X (Aλ k) ) 1

For the last two terms of the right hand side, following the similar argument as in Part (1), it holds that

2 n tr[XAλ k] 2ϕ Z vp( λ; ψ)r 1 + vp( λ; ψ)r d Hp(r) a.s. 0, and 1 n tr[In] = 1. (18)

For the first term of the right hand side, notice that it is the variance term of the mean squared error computed on all samples, which is the variance term of the numerator of the generalized cross-validation (GCV) estimator in [13]. It has been shown in Proposition 3.6 of [13] that in the full ensemble, the GCV estimator is consistent with the prediction risk, which is also true for the variance term: 1 n tr[(XAλ k In) (XAλ k In)] (19)

a.s. 1 ϕ Z vp( λ; ψ)r 1 + vp( λ; ψ)r d Hp(r) 2 σ2(1 + evp( λ; ϕ, ψ, Σ))., (20)

Finally, combining (14)-(16) with the above finishes the proof for the first convergence result of A = n 1/2X.

Analogously, from (20), we also have

1 nf NL (XAλ k In) (XAλ k In)f NL

a.s. f NL 2 L2 1 ϕ Z vp( λ; ψ)r 1 + vp( λ; ψ)r d Hp(r) 2 σ2(1 + evp( λ; ϕ, ψ, Σ)),

which finishes the proof for the second convergence result of A = n 1/2X.

Lemma A.3 (Cross terms of generalized risk). Suppose the same assumptions in Theorem 1 hold with λ > 0. Let Aλ k = EI Ik[(X LIX/k + λIp) 1X LI/k] and Bλ k = Ip Aλ k X = EI Ik[λ(X LIX/k+λIp) 1]. For any A with lim sup A op bounded almost surely, it holds that

f NL (Aλ k) A ABλ kβ0 a.s. 0.

Proof. Note that

n (Aλ k) A ABλ kβ0 2 2 n

k Aλ k 2 op A 4 op Bλ k 2 op β0 2 2

k EI Ik[ bΣI

1 2op λ(X LIX/k + λIp) 1 op]2 Bλ k 2 op A 4 op β0 2 2

k EI Ik[ bΣI

1 2op]2 A 4 op β0 2 2.

Let s2 j be the singular value of bΣj. From the results in [45], we have lim sup bΣI op lim sup max1 i p s2 i rmax(1 + ψ)2 almost surely as k, p and p/k ψ (0, ]. On the other hand, n/k a.s. ψ/ϕ and A 2 is uniformly bounded as assumed. Thus, we have lim sup n (Aλ k) A ABλ kβ0 2 2 is bounded almost surely.

Similar to the proof of Theorem 1, we decompose f NL as f NL = LI1 I2f NL + LI1\I2f NL + LI2\I1f NL =: f0 + f1 + f2. We then have

f NL LI1 I2X

k M1A AM2 X LI1 I2X

a.s. f 0 LI1 I2X

k M1A AM2 X LI1 I2X

k f 0 LI1 I2 k2

X 0 k M A M bΣ 0β0

k f 0 LI1 I2 k2

X 0 k M A β0 λi0

k f 0 LI1 I2 k2

X 0 k M A M β0,

where the first convergence is from Lemma S.8.5 of [51] and the second convergence is from (13), with X 0 = X0(Ip + C) 1/2, A = (Ip + C) 1/2A A(Ip + C) 1/2, and M = (bΣ 0 + λIp) 1

and bΣ 0 = (Ip + C) 1/2 bΣ0(Ip + C) 1/2. From Lemma D.2, the first term in the above display vanishes. Analogous to (15), the second term also vanishes by splitting and applying Lemma D.2. Thus, we have for I1, I2 Ik,

f NL LI1 I2X

k M1A AM2 X LI1 I2X

k β0 a.s. 0.

Finally, by Lemma G.5 (2) of [13], the conclusion follows.

B Proofs of results in Section 4

B.1 Proof of Theorem 3

We consider two cases.

(1) λ1, λ2 > 0.

We begin to prove for the full ensemble when M = . For j = 1, 2 and Ij Ikj, define bΣj = X LIj X/n and Mj = (X LIj X/k + λj Ip)+. Recall that y = Xβ0 + f NL and

bβλj kj, = EIj Ikj

kj X LIj X + λj Ip

= EIj Ikj [Mj bΣj]β0 + EIj Ikj [Mj X LIj/kj]f NL.

Then, for a Rp with bounded l2 norm, we have

a ( bβλ1 k1, bβλ2 k2, ) = a (EI1 Ik1[M1 bΣ1] EI2 Ik2[M2 bΣ2])β0 | {z } T1 + a (EI1 Ik1[M1X LI1]/k1 EI2 Ik2[M2X LI]/k2)f NL | {z } T2

Next, we analyze the three terms separately. For the first term, from Lemma D.1 (1) we have for λj > 0,

EIj Ikj [Mj bΣj] = Ip EIj Ikj [λj Mj] Ip (v( λj; ψj)Σ + Ip) 1,

where v( λj; ψj) is as defined in (40). By Lemma C.1 we have that for ψ [ϕ, ] there exists a segment P such that for all (λ1, ψ1), (λ2, ψ2) P, it holds that v( λ1; ψ1) = v( λ1; ψ2) as p/k1 ψ1 and p/k2 ψ2. By the definition of deterministic equivalents (Definition E.1), it follows that

T1 = tr[β0a EI1 Ik1[M1 bΣ1]] tr[β0a EI2 Ik2[M2 bΣ2]]

a.s. lim p tr[β0a (Ip (v( λ1; ψ1)Σ + Ip) 1)] tr[β0a (Ip (v( λ2; ψ2)Σ + Ip) 1)]

For the second term, notice that by Lemma D.2,

1 kj a Mj X LIkj f NL = a ZΣ 1 2 kj

1 f NL a.s. 0.

From Lemma G.5 (2) of [13], it follows that

a EIj Ikj [Mj X LIj]/kjf NL a.s. 0, j = 1, 2,

and T2 a.s. 0. Combining the above results and applying triangle inequality on (21) yields that

|a ( bβλ1 k1, bβλ2 k2, )| |T1| + |T2| a.s. 0,

which completes the proof when M = . When M N, replacing the above expectation by the average over M simple random samples I1, I2 Ik completes the proof. (2) λ1λ2 = 0.

When λ1 = λ2 = 0, it is trivially true as k1 = k2. Otherwise, without loss of generality, we assume λ1 = 0 and λ2 > 0. We use the same decomposition (21) and first analyze T1. From part one we have that for λ > 0, Pn,λ Qn,λ a.s. 0 where

Pn,λ : = tr[β0a EI1 Ik1[(X LI1X/k1 + λIp)+ bΣ1]],

Qn,λ : = tr[β0a (Ip (v( λ; ψ1)Σ + Ip) 1)].

We next show that

lim n lim λ 0+ Pn,λ = lim λ 0+ lim n Pn,λ = 0, (22)

by proving that the function Pn,λ is equicontinuous family in λ over Λ = [0, λmax] for any λmax (0, ) fixed. Note that for all λ Λ,

|Pn,λ| EI1 Ik1[ (X LI1X/k1 + λIp)+ bΣ1 op] β0a tr β0 2 2 a 2 2 and its derivative λPn,λ

= EI1 Ik1[tr[(X LI1X/k1 + λIp) 2 bΣ1β0a ]]

EI1 Ik1[ (X LI1X/k1 + λIp) 2 bΣ1 op] β0a tr

β0 2 2 a 2 2 are uniformly bounded in λ almost surely. In the above inequalities, the operator norm is bounded because (X LI1X/k1 + λIp) 1 bΣ1 op si/(si + λ) 1 and (X LI1X/k1 + λIp) 2 bΣ1 op si/(si + λ)2 1/(si + λ) 1, by noting that lim sup bΣj op lim sup max1 i p s2 i rmax(1 + ψ1)2 and lim inf bΣj op lim inf min1 i p s2 i rmin(1 ψ1)2 almost surely as k1, p and p/k1 ψ1 (0, ) \ {1} [45]. On the other hand, since Qn,λ is a continuous function of v( λ; ψ1), we have

|Qn,λ| v( λ1; ψ1)Σ(v( λ1; ψ1)Σ + Ip) 1 op β0a tr β0 2 2 a 2 2 λQn,λ

= tr[β0a (v( λ; ψ1)Σ + Ip) 2Σ] v( λ; ψ1)

Z r (1 + v( λ; ψ1)r)2 d Hp(r).

When ψ1 > 1, v( λ; ψ1)/ λ is bounded over Λ [13, Lemma E.10 and Lemma E.11] and thus | Qn,λ/ λ| is also bounded almost surely. When ψ1 < 1, from Lemma E.12 of [13] we have v( λ; ψ1)

Z r (1 + v( λ; ψ1)r)2 d Hp(r) = v( λ; ψ1)

Z r (1 + v( λ; ψ1)r)2 d Hp(r)

Z r (1 + v( λ; ψ1)r)2 d Hp(r)

1 v( λ; ψ1)2 ψ1

(1 + v( λ; ψ1)r)2 d Hp(r)

since v( λ; ψ1) v(0; ψ1) = + . Therefore, | Qn,λ/ λ| is uniformly bounded over Λ for ψ1 (0, ) \ {1}. Thus, by the Moore-Osgood theorem, the convergence is uniform in λ and (22) follows. Finally, since T2 is bounded analogously as in Part (1), the equivalence holds when λ1 = 0.

B.2 Proof of Proposition 4

From Lemma B.2, it follows that

v( λ; ϕn) 1

By the continuity of function ϕ 7 v( λ; ϕ) from Lemma F.10 and F.11 of [13], we have v( λ; ϕn) v( λ; ϕ) as ϕn = p/n ϕ. From Lemma C.1, there exists λn such that

v(0; ψn) = v( λn; ϕn),

as ψn ψ and ϕn ϕ. Involving (23) on the both sides yields that

n XX + λn Ip

where {I1, . . . , IM} is a simple random sample from Ik.

B.3 Proof of Corollary 5

We will use a structural equivalence (much more direct than the first-order equivalence considered in the paper) between generalized ridge predictor and the isotropic ridge predictor to prove the result.

The generalized ridge predictor (9), trained on the subsampled dataset DI, can be expressed as:

bβλ,G k (DI) = 1

k X LIX + λG 1 X LIy

Observe that we can equivalently manipulate the generalized ridge estimator into:

bβλ,G k (DI) = G 1/2 1

k G 1/2X LIXG 1/2 + λIp

1 G 1/2 X LIy

Recalling from Assumption 2 that X = ZΣ1/2, where Z Rn p contains z i in the i-th row for i [n], we obtain

bβλ,G k (DI) = G 1/2 1

k G 1/2Σ1/2Z LIZΣ1/2G 1/2 + λIp

1 G 1/2Σ1/2Z LIy

We now define ΣG = G 1/2ΣG 1/2 and consider a transformed feature matrix XG = ZΣ1/2 G . Denote the transformed dataset corresponding to the new feature matrix by DG (where we keep the response vector as is). Using (24), we can write relate the generalized ridge estimator fitted on the dataset D to the standard ridge estimator fitted on the data DG (both at the same scalar regularization level λ) by: bβλ,G k (DI) = G 1/2 bβλ k(DG I ).

Observe that the transformed dataset DG satisfies Assumption 2 as the eigenvalues of G are bounded away from 0 and . Further, note that both Theorem 1 and Theorem 3 are invariant to linear transformations of the estimator. Thus, we conclude that the results of both the theorems continue to hold. The equivalence path now use the modified ΣG, which in turn changes the spectral distribution Hp to that of ΣG, in defining the end points via (4), given by:

i=1 1{eri r},

where eri for i [p] are eigenvalues of ΣG. This completes the proof.

B.4 Technical lemmas

Lemma B.1 (Relationship between bm and bv). For X Rn p and ϕn = p/n, define

bm(z; ϕn) = 1

, and bv(z; ϕn) = 1

It holds that

ϕnz bm(z; ϕn) + ϕn 1 = zbv(z; ϕn). (25)

Proof. Let r be the rank of the matrix X. Denote by si, i = 1, . . . , r, the non-zero eigenvalues of the matrix X X. Note that these are the same non-zero eigenvalues of the matrix XX . Define function S such that for z = 0,

For z = 0, write out bv and bm in terms of S(z) as:

bv(z; ϕn) = 1

bm(z; ϕn) = 1

We now expand the left-hand side of (25):

ϕnz bm(z; ϕn) + ϕn 1 = p

nz S(z) p r

nz S(z) n r

= zbv(z; ϕn) + n p

= zbv(z; ϕn),

which finishes the proof.

Lemma B.2 (Equivalence of bv and v). Suppose Assumptions 1 2 hold. Then it holds that

v( λ; ϕn) 1

Proof. From Lemma B.1, we have

zbv(z; ϕn) + (1 ϕn) = ϕnz bm(z; ϕn).

Substituting z = λ yields that

+ (ϕn 1) = ϕnλ1

n X X + λIp

From Corollary E.4, we have

1 v( λ; ϕn) = λ + ϕ tr[Σ(v( λ; ϕn)Σ + Ip) 1]/p,

1 p tr[(v( λ; ϕn)Σ + Ip) 1] 1

n X X + λIp

This implies

1 = λv( λ; ϕn) + ϕn tr[v( λ; ϕn)Σ(v( λ; ϕn)Σ + Ip) 1]/p

= λv( λ; ϕn) + ϕn ϕn tr[(v( λ; ϕn)Σ + Ip) 1])/p

λv( λ; ϕn) + ϕn ϕnλ tr

n X X + λIp

= 1 + λv( λ; ϕn) 1

where the last equality follows from (26). This concludes the proof.

C Proof of results in Section 5

C.1 Proof of Theorem 6

By Assumptions 3 4, for A = Σ1/2 and b = 0, Rp(λ; ϕ, ψ), as defined in (11), converges to

Rp(λ; ϕ, ψ) a.s. ρ2ec( λ; ϕ, ψ) + σ2ev( λ; ϕ, ψ) =: R(λ; ϕ, ψ) (27)

where the nonnegative constants ec( λ; ϕ, ψ) and ev( λ; ϕ, ψ) are defined through the following equations:

ev( λ; ϕ, ψ) = ϕ Z r2

(1 + v( λ; ψ)r)2 d H(r)

v( λ; ψ) 2 ϕ Z r2

(1 + v( λ; ψ)r)2 d H(r) ,

ec( λ; ϕ, ψ) = (ev( λ; ϕ, ψ) + 1) Z r 1 + v( λ; ψ)r d G(r),

1 v( λ; ψ) = λ + ψ Z r 1 + v( λ; ψ)r d H(r).

From the proof of Theorem 1, we also have that R(λ; ϕ, ψ) Rp(λ; ϕ, ψ) R( bβλ k, ; Σ, 0, β0).

Since R(0; ϕ, ψ) is a continuous function of ϕ and v(0; ψ) and is increasing in ϕ for any fixed ψ, it follows that for 0 < ϕ1 ϕ2 < ,

min ψ ϕ1 R(0; ϕ1, ψ) min ψ ϕ2 R(0; ϕ1, ψ) min ψ ϕ2 R(0; ϕ2, ψ),

where the first inequality follows because {ψ : ψ ϕ1} {ψ : ψ ϕ2}, and the second inequality follows because R(0; ϕ, ψ) is increasing in ϕ for a fixed ψ. Thus, minψ ϕ R(0; ϕ, ψ) is a continuous and monotonically increasing function in ϕ.

Finally, note that from Lemma C.1, for any λ, there exists ψ such that v(0; ϕ, ψ) = v( λ; ϕ, ϕ). This implies that R(0; ϕ, ψ) = R(λ; ϕ, ϕ). Then we have minψ ϕ R(0; ϕ, ψ) minλ 0 R(λ; ϕ, ϕ). Conversely, since for any ψ, there exists λ such that v(0; ϕ, ψ) = v( λ; ϕ, ϕ), we also have minψ ϕ R(0; ϕ, ψ) minλ 0 R(λ; ϕ, ϕ). Combining the two inequalities yields the conclusion.

C.2 Technical lemmas

In this section, we gather results on certain analytic properties of the fixed-point solution v( λ; ϕ) defined in (40).

Lemma C.1 (Contour of fixed-point solutions). As n, p such that p/n ϕ (0, ), for any ψ [ϕ, + ], there exists a unique value λ 0 (or conversely for λ [0, ], there exists a unique value ψ [ϕ 1, ]) such that for all (λ, ψ) on the path

P = {(1 θ) (λ, ϕ) + θ (0, ψ) | θ [0, 1]},

it holds that

vp( λ; ψ) = v( λ; ϕ) = v( 0; ψ).

where vp( λ; ψ) is as defined in (40).

Proof. Since when ϕ < 1, v(0; ψ) = + for all ψ [ϕ, 1], we can restrict ourselves to ψ [ϕ 1, + ]. From Lemma E.11 (1) of [13], the function ψ 7 v(0; ψ) is strictly decreasing over ψ [ϕ 1, ] with range

v(0; ϕ 1) = v(0; ϕ), ϕ (1, ) limψ 1+ v(0; ψ) = + , ϕ (0, 1] , v(0; + ) := lim ψ + v(0; ψ) = 0.

From Lemma E.12 (3) of [13], the function λ 7 v( λ; ϕ) is strictly decreasing over λ [0, ] with range

v(0; ϕ) = v(0; ϕ), ϕ (1, ) limλ 0+ v( λ; ϕ) = + , ϕ (0, 1] , v( ; ϕ) := lim λ + v( λ; ϕ) = 0.

Note that v(0; ϕ 1) = v(0; ϕ). For ψ [ϕ 1, ], by the intermediate value theorem, there exists unique λ [0, ] such that v( λ; ϕ) = v(0; ψ). Furthermore, when ψ 1, λ = 0 is also the unique

value such that v(0; ψ) = v( λ; ϕ). Conversely, for λ [0, ], there also exists ψ [ϕ 1, ] such that v( λ; ϕ) = v(0; ψ).

Based on the definition of fixed-point solutions, it follows that

1 v( λ; ϕ) = λ + ϕ Z r 1 + v( λ; ϕ)r d Hp(r) = ψ Z r 1 + v(0; ψ)r d Hp(r) = 1 v(0; ψ).

Then, for any (λ, ψ) = (1 θ)(λ, ϕ) + θ(0, ψ) on the path P, we have

1 v( λ; ϕ) = (1 θ) 1 v( λ; ϕ) + θ 1 v(0; ψ)

= (1 θ)λ + (1 θ)ϕ Z r 1 + v( λ; ϕ)r d Hp(r) + θψ Z r 1 + v(0; ψ)r d Hp(r)

= λ + ψ Z r 1 + v( λ; ϕ)r d Hp(r).

Because vp( λ; ψ) is the unique solution to the fixed-point equation:

1 vp( λ; ψ) = λ + ψ Z r 1 + vp( λ; ψ) d Hp(r),

it then follows that vp( λ; ψ) = v( λ; ϕ) = v(0; ψ).

D Asymptotic equivalents, concentration results, and other useful lemmas

D.1 Full-ensemble resolvents

Lemma D.1 (Full-ensemble resolvents). Let bΣ = X X/n, bΣj = X LIj X/k where Ij Ik. Let Σ1 2 = X LI1 I2X/|I1 I2| and C Rp p with bounded operator norm almost surely. As n, p, k , p/n ϕ, p/k ψ, the following asymptotic equivalences hold:

(1) Basic ridge resolvent:

λEI1 Ik[(bΣ1 + λIp) 1] (vp( λ; ψ)Σ + Ip) 1.

(2) Bias resolvent with C X:

EI1,I2 Ik[λ2(bΣ1 + λIp) 1C(bΣ2 + λIp) 1]

(vp( λ; ψ)Σ + Ip) 1 (evp( λ; ϕ, ψ, C)Σ + C) (vp( λ; ψ)Σ + Ip) 1 .

(3) Variance resolvent with C X:

EI1,I2 Ik h (bΣ1 + λIp) 1 bΣ1 2(bΣ2 + λIp) 1C i

ϕ 1evv( λ; ϕ, ψ)(vp( λ; ψ)Σ + Ip) 2ΣC.

(4) Bias resolvent with C = bΣ:

EI1,I2 Ik h λ2(bΣ1 + λIp) 1 bΣ(bΣ2 + λIp) 1i

ed( λ; ϕ, ψ)(1 + evp( λ; ϕ, ψ))(vp( λ; ψ)Σ + Ip) 2Σ.

Here v( λ; ϕ) is as defined in (41), and we let evp( λ; ϕ, ψ) = evp( λ; ϕ, ψ, Σ),

evp( λ; ϕ, ψ, C) = lim k,n,p ϕ tr[CΣ(vp( λ; ψ)Σ + Ip) 2]/p

vp( λ; ψ) 2 ϕ Z r2

(1 + vp( λ; ψ)r)2 d Hp(r) ,

evv( λ; ϕ, ψ) = 1

vp( λ; ψ) 2 ϕ Z r2

(1 + vp( λ; ψ)r)2 d Hp(r) ,

ed( λ; ϕ, ψ) = 1 ϕ Z vp( λ; ψ)r 1 + vp( λ; ψ)r d Hp(r) 2 .

The empirical distribution Hn of eigenvalues can be replaced by the limiting distribution H whenever it exists.

Proof. The proofs for different parts is separated below.

(1) Basic ridge resolvent.

From Definition E.2, we know that λ(bΣj + λIp) 1 (vp( λ; ψ)Σ + Ip) 1. By the definition of deterministic equivalents in Definition E.1, for any A Rp p that has bounded trace norm and is independent to X, we have

tr[A(λ(bΣj + λIp) 1 (vp( λ; ψ)Σ + Ip) 1)] a.s. 0.

From Lemma G.5 (2) of [13], it follows that

tr[A(λEI1 Ik[(bΣj + λIp) 1] (vp( λ; ψ)Σ + Ip) 1)]

= EI1 Ik[tr[A(λ(bΣj + λIp) 1 (vp( λ; ψ)Σ + Ip) 1)]] a.s. 0, (28)

which implies that

λEI1 Ik[(bΣ1 + λIp) 1] (vp( λ; ψ)Σ + Ip) 1.

(2) Bias resolvent with C X.

Denote Mj = (bΣj + λIp) 1 for j = 1, 2. From Part (c) of the proof for Lemma S.2.4 in [12], it follows that for I1, I2 Ik,

λ2M1CM2 (vp( λ; ψ)Σ + Ip) 1 (evp( λ; ϕ, ψ, C)Σ + C) (vp( λ; ψ)Σ + Ip) 1 .

By the same argument as in (28), the conclusion follows.

(3) Variance resolvent with C X.

From Lemma E.8 (3) of [13], it follows that for I1, I2 Ik,

M1 bΣ1 2M2C ϕ 1evv( λ; ϕ, ψ)(vp( λ; ψ)Σ + Ip) 2ΣC.

By the same argument as in (28), the conclusion follows.

(4) Bias resolvent with C = bΣ.

We begin by decomposing the object inside expectation into two terms:

λ2M1 bΣM2 = |I1 I2|

n λ2M1 bΣ1 2M2 + n |I1 I2|

n λ2M1 bΣ(1 2)c M2 (29)

where bΣ1 2 = X LI1 I2X/|I1 I2| and bΣ(1 2)c = X (I LI1 I2)X/(n |I1 I2|). Next, we analyze each of them.

For the first term, from Lemma D.6 (3) of [13] we have

M1 bΣ1 2M2 vp( λ; ψ)2(1 + evp( λ; ϕ, ψ)) 2(ψ ϕ)

2ψ ϕ 1 λvp( λ; ψ) + ϕ 2ψ ϕ

(vp( λ; ψ)Σ + Ip) 2Σ (30)

where evp( λ; ϕ, ψ) = evp( λ; ϕ, ψ, Σ).

For the second term, from Lemma E.8 (1) of [13] and the product rule of calculus in Lemma E.3 (3), we have

λ2M1 bΣ(1 2)c M2 λ2M1ΣM2.

From Part (2) and Lemma E.3 (1), it follows that

λ2M1 bΣ(1 2)c M2 λ2M1ΣM2 (1 + evp( λ; ϕ, ψ)) (vp( λ; ψ)Σ + Ip) 2 Σ. (31)

Finally, for the coefficients of the two terms, from Lemma S.8.3 of [12], we have that

n = |I1| + |I2| |I1 I2|

a.s. ϕ(2ψ ϕ)

ψ2 , n |I1 I2|

a.s. (ψ ϕ)2

Combining (29)-(32) yields that

ψ2 2(ψ ϕ)ϕλvp( λ; ψ) + ϕ2λ2vp( λ; ψ)2 + (ψ ϕ)2

(1 + evp( λ; ϕ, ψ))(vp( λ; ψ)Σ + Ip) 2Σ

= (ϕλvp( λ; ψ) + ψ ϕ)2

ψ2 (1 + evp( λ; ϕ, ψ))(vp( λ; ψ)Σ + Ip) 2Σ

= 1 ϕ Z vp( λ; ψ)r 1 + vp( λ; ψ)r d Hp(r) 2 (1 + evp( λ; ϕ, ψ))(vp( λ; ψ)Σ + Ip) 2Σ

where the last equality follows by substituting

λ = vp( λ; ψ) 1 ψ Z r(1 + vp( λ; ψ)r) 1 d Hp(r)

based on the fixed-point equation.

D.2 Convergence of random linear and quadratic forms

Lemma D.2 (Concentration of linear form with independent components and varying coefficients). Let zi Rp for i = 1, . . . , n be a sequence of random vectors with i.i.d. entries zij, j = 1, . . . , p such that for each i, j, E[zij] = 0, E[z2 ij] = 1, E[|zij|4+α] Mα for some α > 0 and constant Mα < . Let Z = [z1, . . . , zn] Rn p be the random matrix formed by concatenating zi s. Let g : Rp R be any measurable function such that E[g(zi)] = 0 and E[zig(zi)] = 0 for i = 1, . . . , n. Let ap Rp be a sequence of random vectors independent of zp such that lim supp ap 2/p M0 almost surely for a constant M0 < . Let D be a positive semidefinite matrix such that lim sup D op M0 almost surely as p for some constant M0 < . Then, as n, p such that p/n ϕ (0, ), we have

a D 1 2 Z ZD 1 2 n + λIp

a.s. 0. (33)

Proof. We will use the standard leave-one-out trick to break the dependence between the i-th component multiplier in the summation in (33) and g(zi) for each i = 1, . . . , n. To that end, using the Woodbury matrix identity, first observe that D 1 2 Z ZD 1 2 n + λIp

D 1 2 Z i Z i D 1 2

n + D 1 2 ziz i D 1 2 n + λIp

D 1 2 Z i Z i D 1 2

D 1 2 Z i Z i D 1 2

! 1 D 1 2 ziz i D 1 2 n

D 1 2 Z i Z i D 1 2

1 + z i D 1 2 D 1 2 Z i Z i D 1 2 n + λIp

Plugging (34) back into (33), we expand the desired sum into:

a D 1 2 Z ZD 1 2 n + λIp

i=1 a D 1 2 Z ZD 1 2 n + λIp

D 1 2 zig(zi)

i=1 a D 1 2 Z i Z i D 1 2

D 1 2 zig(zi)

a D 1 2 Z i Z i D 1 2

! 1 D 1 2 ziz i D 1 2 n

D 1 2 Z i Z i D 1 2

D 1 2 zig(zi)

1 + z i D 1 2 D 1 2 Z i Z i D 1 2 n + λIp

i=1 b i zig(zi) 1

b i dizig(zi)

1 + di (35)

i=1 (1 + di)b i zig(zi), (36)

where in step (35), we denote by:

D 1 2 Z i Z i D 1 2

D 1 2 Z i Z i D 1 2

and step (36) follows since di 0. It is easy to check that lim supp bi 2 2/p C1 < , and di a.s. C2 < for some constants C1 and C2. Appealing to Lemma S.8.5 of [51], (36) almost surely converges to 0. This finishes the proof.

Lemma D.3 (Concentration of sum of quadratic forms with independent components and independent varying inner matrices). Let zp Rp be a sequence of random vector with i.i.d. entries zpi, i = 1, . . . , p such that for each i, E[zpi] = 0, E[z2 pi] = 1, E[|zpi|4+α] Mα for some α > 0 and constant Mα < . Let Z = [z1, . . . , zn] Rn p be the design matrix. Let g : Rp R be any measurable function such that E[zig(zi)] = 0 and E[g(zi)] = 1. Let D be a positive semidefinite matrix such that lim sup D op M0 almost surely as p for some constant M0 < . Then, as n, p such that p/n ϕ (0, ),

ij g(zi)g(zj)

a.s. 0. (37)

Proof. The strategy in the proof is to express each of the ij-the entry of the resolvent in (37) such that the dependence on (zi, zj) and the rest of (zk : k = i, j) is separated. One is then able to use the

uncorrelatedness of z and g(z) along with standard concentration of a quadratic form with respect to an independent matrix. Similar strategy has been used in [14] when analyzing kernel ridge regression under proportional asymptotics.

Denote Y = (ZDZ /n+λIn) 1. For each pair of i, j [n] and i = j, we let Z (ij) R(n 2) p

be the matrix comprising the n 2 rows of Z excluding the ith and jth rows, and U Rp 2 be the matrix with columns Uijei = zi, Uej = zj. We finally define the matrices R (ij) Rp p as follows:

R (ij) = λD1/2 D1/2Z (ij)Z (ij)D1/2/n + λIp 1 D1/2,

and let e Yij = {Ym,ℓ}m,ℓ {i,j} be the submatrix of Y . Then, using the block matrix inversion formula (see, e.g., Appendix A.1.4 of [45]) and the Woodbury matrix identity, one can show that

e Yij = U ij R (ij)Uij/n + λI2 1 ,

zi, R (ij)zj /n dij ,

where dij is given by:

dij = λ + zi, R (ij)zj /n λ + zi, R (ij)zj /n zi, R (ij)zj 2 /n2. Since zi, zj, and R (ij) are mutually independent, by Lemma S.8.5 of [51], we have zi, R (ij)zj /n a.s. 0 and the denominator concentrates on λ2. By Lemma G.5 (1) of [13], it follows that max1 i =j n dij a.s. λ2. Thus, there exists N0 N such that for all n N0, max1 i =j n dij λ2/2 almost surely. Then, it follows that for n N0,

1 i =j n Yijg(zi)g(zj)

1 i =j n | zi, R (ij)zj g(zi)g(zj)|

1 i =j n (zig(zi)) R (ij)(zjg(zj))

where the last convergence is from Lemma S.8.5 of [51] and Lemma G.5 (2) of [13].

Lemma D.4. For any two conforming matrices A and B and any t = 0, we have

A 1BA 1 = (A 1 (A + t B) 1)/t + t A 1B(A + t B) 1BA 1.

Proof. We recall the Woodbury matrix identity:

(A + UCV ) 1 = A 1 A 1U(C 1 + V A 1U) 1V A 1. This holds for any conforming matrices A, U, C, and V . We will need to apply the Woodbury matrix identity twice below.

1. Applying the Woodbury identity for the first time with A = A, U = t, C = B, and V = 1, we get (A + t B) 1 = A 1 t A 1(B 1 + t A 1) 1A 1. Rearranging, this yields

(A 1 (A + t B) 1)/t = A 1(B 1 + t A 1) 1A 1 = A 1(t A 1 + B 1) 1A 1. (38)

2. Applying the Woodbury identity the second time with A = t B, U = 1, C = A, and V = 1, we get (A + t B) 1 = t 1B 1 t 1B 1(A 1 + t 1B 1) 1t 1B 1. Multiplying by t B from the left yields

t B(A + t B) 1 = I (A 1 + t 1B 1) 1t 1B 1 = I (t A 1 + B 1) 1B 1.

Now, multiplying by B on the right, notice that

t B(A + t B) 1B = B (t A 1 + B 1) 1. (39)

From (38), observe that

A 1BA 1 (A 1 (A + t B) 1)/t = A 1BA 1 A 1(t A 1 + B 1) 1A 1

= A 1(B (t A 1 + B 1) 1)A 1.

Using (39), we then have

A 1BA 1 (A 1 (A + t B) 1)/t = t A 1B(A + t B) 1BA 1.

Rearranging, we arrive at the desired equality:

A 1BA 1 = (A 1 (A + t B) 1)/t + t A 1B(A + t B) 1BA 1.

D.3 Concentration of energy of linear and nonlinear components

Lemma D.5. Under Assumptions 1 2, the best linear estimator β0 and the nonlinear component f NL as defined in (10) satisfies that β0 2 and f NL L4+δ are bounded almost surely.

Proof. By Jensen s inequality and Assumption 1, we have E[y2] E[y4+δ]2/(4+δ) C2/(4+δ) 0 for some constant C0 > 0.

By the orthogonality, we have that E[y2] = E[(x β0)2] + f NL 2 L2, which implies that E[(x β0)2] and f NL 2 L2 is bounded. Since by Assumption 2, the eigenvalues of Σ is lower bounded by rmin > 0, we have that E[(x β0)2] = β0Σβ0 rmin β0 2 2 and thus β0 2 is bounded.

Since f NL = y x β0, by triangle inequality we have that

f NL L4+δ y L4+δ + x β0 L4+δ y L4+δ + C β0 2)

for some constant C > 0. The second inequality of the above display is from Lemma 7.8 of [57]. Since y L4+δ and β0 2 are bounded, it follows that f NL L4+δ is also bounded.

E Asymptotic equivalents: background and known results

Preliminary background. In several proofs, we employ the concept of asymptotic equivalence of sequences of random matrices; see [12, 44, 49, 51]. This section provides a brief overview of the associated terminology and calculus principles. Definition E.1 (Asymptotic equivalence). Consider sequences {Ap}p 1 and {Bp}p 1 of (random or deterministic) matrices of growing dimensions. We say that Ap and Bp are equivalent and write Ap Bp if limp | tr[Cp(Ap Bp)]| = 0 almost surely for every sequence of random matrices Cp independent to Ap and Bp, with bounded trace norm such that lim supp Cp tr < almost surely.

The notion of asymptotic equivalence of two sequences of random matrices above can be further extended to incorporate conditioning on another sequence of random matrices; see [12] for more details. Definition E.2 (Conditional asymptotic equivalence). Consider sequences {Ap}p 1, {Bp}p 1 and {Dp}p 1 of (random or deterministic) matrices of growing dimensions. We say that Ap and Bp are equivalent given Dp and write Ap Bp | Dp if limp | tr[Cp(Ap Bp)]| = 0 almost surely conditional on {Dp}p 1, i.e.,

P lim p | tr[Cp(Ap Bp)]| = 0 {Dp}p 1

for any sequence of random matrices Cp independent to Ap and Bp conditional on Dp, with bounded trace norm such that lim sup Cp tr < as p .

Below we summarize the calculus rules for conditional asymptotic equivalence in Definition E.2. These are adapted from Lemma S.7.4 and S.7.6 of [12].

Lemma E.3 (Calculus of asymptotic equivalents). Let Ap, Bp, Cp and Dp be sequences of random matrices. The calculus of asymptotic equivalents satisfies the following properties:

(1) Equivalence: The relation is an equivalence relation.

(2) Sum: If Ap Bp | Ep and Cp Dp | Ep, then Ap + Cp Bp + Dp | Ep.

(3) Product: If Ap has bounded operator norms such that lim supp Ap op < , Ap is conditional independent to Bp and Cp given Ep for p 1, and Bp Cp | Ep, then Ap Bp Ap Cp | Ep.

(4) Trace: If Ap Bp | Ep, then tr[Ap]/p tr[Bp]/p 0 almost surely when conditioning on Ep.

(5) Differentiation: Suppose f(z, Ap) g(z, Bp) | Ep where the entries of f and g are analytic functions in z S and S is an open connected subset of C. Suppose for any sequence Cp of deterministic matrices with bounded trace norm we have | tr[Cp(f(z, Ap) g(z, Bp))]| M for every p and z S. Then, we have f (z, Ap) g (z, Bp) | Ep for every z S, where the derivatives are taken entrywise with respect to z.

(6) Unconditioning: If Ap Bp | Ep, then Ap Bp.

(7) Substitution: Let v : Rp p R and f(v(C), C) : Rp p Rp p be a matrix function for matrix C Rp p and p N, that is continuous in the first augment with respect to operator norm. If v(C) a.s. = v(D) such that C is independent to D, then f(v(C), C) f(v(D), C) | C.

Standard ridge resolvents and various extensions. In this section, we collect various asymptotic equivalents. The following corollary is a simple consequence of Theorem 1 in [58]. It provides deterministic equivalent for the scaled ridge resolvent. Corollary E.4 (Deterministic equivalent for scaled ridge resolvent). Suppose xi Rp, 1 i n, are i.i.d. random vectors such that each xi = ziΣ1/2, where zi is a random vector consisting of i.i.d. entries zij, 1 j p, satisfying E[zij] = 0, E[z2 ij] = 1, and E[|zij|8+α] Mα for some constants α > 0 and Mα < , and Σ Rp p is a positive semidefinite matrix satisfying 0 Σ rmax Ip for some constant rmax < (independent of p). Let X Rn p the concatenated matrix with x i , 1 i n, as rows, and let bΣ Rp p denote the random matrix X X/n. Let γ = p/n. Then, for z C+, as n, p such that 0 < lim inf γ lim sup γ < , and λ > 0, we have

λ(bΣ + λIp) 1 (v( λ; γ)Σ + Ip) 1,

where v( λ; γ) > 0 is the unique solution to the fixed-point equation

1 v( λ; γ) = λ + γ Z r 1 + v( λ; γ)r d Hp(r). (40)

Here Hn is the empirical distribution (supported on R 0) of the eigenvalues of Σ.

Side remark: The parameter v( λ; γ) in Corollary E.4 is also the companion Stieltjes transform of the spectral distribution of the sample covariance matrix bΣ. It is also the Stieltjes transform of the spectral distribution of the gram matrix XX /n. This is essentially the characterization we use in proving Proposition 4.

The following lemma uses Corollary E.4 along with calculus of deterministic equivalents (from Lemma E.3), and provides deterministic equivalents for resolvents needed to obtain asymptotic bias and variance of standard ridge regression. It is adapted from Lemma S.6.10 of [51]. Lemma E.5 (Deterministic equivalents for ridge resolvents associated with generalized bias and variance). Suppose xi Rp, 1 i n, are i.i.d. random vectors with each xi = ziΣ1/2, where zi Rp is a random vector that contains i.i.d. random variables zij, 1 j p, each with E[zij] = 0, E[z2 ij] = 1, and E[|zij|4+α] Mα for some constants α > 0 and Mα < , and Σ Rp p is a positive semidefinite matrix with rmin Ip Σ rmax Ip for some constants rmin > 0 and rmax < (independent of p). Let X Rn p be the concatenated random matrix with xi, 1 i n, as its rows, and define bΣ = X X/n Rp p. Let γ = p/n. Then, for λ > 0, as n, p with 0 < lim inf γ lim sup γ < , the following statements hold:

(1) Bias of ridge regression:

λ2(bΣ + λIp) 1A(bΣ + λIp) 1

(v( λ; γ, Σ)Σ + Ip) 1(evb( λ; γ, Σ, A)Σ + A)(v( λ; γ, Σ)Σ + Ip) 1.

(2) Variance of ridge regression:

(bΣ + λIp) 2 bΣA evv( λ; γ, Σ)(v( λ; γ, Σ)Σ + Ip) 2ΣA.

Here v( λ; γ, Σ) > 0 is the unique solution to the fixed-point equation

1 v( λ; γ, Σ) = λ + Z γr 1 + v( λ; γ, Σ)r d Hn(r; Σ), (41)

and evb( λ; γ, Σ) and evv( λ; γ, Σ) are defined through v( λ; γ, Σ) by the following equations:

evb( λ; γ, Σ, A) = γ tr[AΣ(v( λ; γ, Σ)Σ + Ip) 2]/p

v( λ; γ, Σ) 2 Z γr2(1 + v( λ; γ, Σ)r) 2 d Hn(r; Σ) , (42)

evv( λ; γ, Σ) 1 = v( λ; γ, Σ) 2 Z γr2(1 + v( λ; γ, Σ)r) 2 d Hn(r; Σ), (43)

where Hn( ; Σ) is the empirical distribution (supported on [rmin, rmax]) of the eigenvalues of Σ.

Though Lemma E.5 states the dependency explicitly, we will simply write Hp(r), v( λ; γ), evb( λ; γ, A), and evv( λ; γ) to denote Hn(r; Σ), v( λ; γ, Σ), evb( λ; γ, Σ, A), and evv( λ; γ, Σ), respectively, for simplifying various notations when it is clear from the context. When A = Σ, we simply write evb( λ; γ) = evb( λ; γ, A).

F Experiment details

F.1 Reproducibility and compute details

The source code for generating all of our figures is included with the supplementary material. The source code also includes details about the computational resources used to run the code and other timing details.

F.2 Simulation details

The covariance matrix of an auto-regressive process of order 1 (AR(1)) is given by Σar1, where (Σar1)ij = ρ|i j| ar1 for some parameter ρar1 (0, 1). Define β0 := 1

5 P5 j=1 w(j) where w(j) is the eigenvector of Σar1 associated with the top jth eigenvalue r(j). We generated data (xi, yi) for i = 1, . . . , n from a nonlinear model:

yi = x i β0 + 1

p( xi 2 2 tr[Σar1]) + εi, xi = Σ

1 2 ar1zi, zij iid t5

σ5 , (M-AR1)

where σ5 = p

5/3 is the standard deviation of t5 distribution. The benefit of using the above nonlinear model is that we can clearly separate the linear and the nonlinear components and compute the quantities of interest because β0 happens to be the best linear projection.

For the simulations, we set ρar1 = 0.5. For finite ensembles, the risks are averaged across 50 simulations.

F.3 Risk comparisons along equivalence paths for finite ensembles

Figure F5 verifies the following linear (in M) relationship along the path in Theorem 3 for finite ensembles: R( bβλ1 p/ψ1 ,M; A, b, β0) R( bβλ2 p/ψ2 ,M; A, b, β0) /M, for some (independent of M), which is eventually almost surely bounded.

Estimation risk

1 10 100 Ensemble size M

Training error

Prediction risk

Prediction risk (OOD)

Figure F5: The range (the difference between the maximum and the maximum along the path) of the finite-sample generalized risk of M-ensemble ridge estimators on the path through (λ, ψ) = (0, 2), for varying ensemble size M under the same setting as in Figure 1. The generalized risks include the estimation risk, the training error, the prediction risk, and the out-of-distribution (OOD) prediction risk.

F.4 Illustration of risk monotonicity of optimal ridge

0.1 0.3 1 3 10 Subsample aspect ratio

Data aspect ratio

0.1 0.3 1 3 10 Ridge penalty

Figure F6: Illustration of risk monotonicity of optimal subsampled ridgeless regression versus optimal regularized ridge regression. The underlying model has a non-isotropic covariance as described in Appendix F.2 with a linearized signal-to-noise ratio of 0.6 and f NL L2 = 1. This results in a null risk of 1.6, which is kept the same across all regression problems. The left panel shows the limiting risk of the full-ensemble ridgeless regression at various data and subsample aspect ratios (ϕ, ψ). The right panel shows the limiting risk of the ridge predictor (on the full data) at various data aspect ratios and regularization penalties (ϕ, λ). Optimal risks for each data aspect ratio are highlighted using slender red lines in both panels. Observe that the optimal risk in both cases is increasing as a function of ϕ. In the left panel, in addition, for every subsample aspect ratio ψ, the risk of subsampled ridgeless is increasing in the data aspect ratio ϕ. This in turn implies the risk monotonicity claim for the optimal ridge regression in Theorem 6 through a slick triangle inequality argument (see the proof in Appendix C for more details).

F.5 Real-world datasets

We conduct experiments on real-world datasets to examine the equivalence in a more general setting. We utilized three image datasets for our experimental analysis: CIFAR 10, MNIST, and USPS [59]. For the CIFAR 10 dataset, we subset the images labeled as dog and cat . For other datasets, we subset the images labeled 3 and 8 . Then we treat them as binary labels y {0, 1} and use the flattened image as our feature vector x. The training sample sizes, the feature dimensions, and the test sample sizes (n, p, nte) are (10000, 3072, 2000), (12873, 784, 2145), and (22358, 3072, 7981) for the three datasets, respectively.

For the first experiment in Figure 3, we fix ϕ = p/n and ψ = 4ϕ using the training set of CIFAR-10. For λψ {0.01, 0.05, 0.1, 1}, we compute the data-dependent value of λ based on Proposition 4. Each value of λψ gives a path between (λψ, ψ) and (λ, ϕ).

For the second experiment in Figure F7, by varying the values of subsample aspect ratios ψ, we compare the random Gaussian linear projection of the estimators and the prediction risk at the two endpoints (λ, ϕ) and (0, ψ), on CIFAR-10, MNIST, and USPS datasets.

2 4 6 8 10 Subsample aspect ratio

Random linear projection

2 4 6 8 10 Subsample aspect ratio

Prediction risk Dataset CIFAR10 MNIST USPS Model ( , ) (0, )

Figure F7: The functional equivalences of subsampling and ridge regularization on different datasets. For each dataset with an aspect ratio ϕ = p/n and each value of ψ, the corresponding ridge penalty λ is estimated by Proposition 4. The two models with ridge penalty and subsample aspect ratio (λ, ϕ) and (0, ψ) are compared. Note that the model corresponding to (λ, ϕ) is ridge regression at ridge penalty λ without subsampling, and the model corresponding to (0, ψ) is full-ensemble ridgeless at a subsample aspect ratio ψ.

F.6 Random features model

The data (xi, yi) is generate from the nonlinear model

yi = x i β0 + 1

p( xi 2 2 p) + εi,

where xij iid N(0, 1), and εi N(0, 1). The prediction risk of the ridge ensemble (M = 100) is computed based on the random feature φ(F xi) and the response yi, where F Rd p is the random weight matrix with Fij iid N(0, p 1). Here, φ is a nonlinear activation function (sigmoid, Re LU, or tanh) that is applied entry-wise to F xi. For the experiment, we set p = 250, d = 500, and ϕ = d/n = 0.1.

F.7 Kernel ridge regression

For a given feature map Φ : Rp Rd, the kernel ridge estimator is defined as:

bβλ k(DI) = argmin β Rp

i I (k 1/2yi k 1/2Φ(xi) β)2 + λ β 2 2 X

i I (yi Φ(xi) β)2 + k

By using the kernel trick, the above optimization problem is equivalent to solving:

bαλ k(DI) = argmin α Rk α (KI + kλIk) α + α y I,

where KI = ΦIΦ I Rk k is the kernel matrix and ΦI = (Φ(xi))i I Rn d is the feature matrix. Simple calculation shows that bβλ k(DI) = Φ I bαλ k(DI).

Using the same data-generating process as in the previous subsection, the results on kernel ridge regression are illustrated in Figure F8.

Ridge penalty

0.1 1 10 Subsample aspect ratio

Figure F8: Heat map of the empirical distribution prediction risk of full-ensemble kernel ridge estimators, for varying ridge penalties λ and subsample aspect ratio ψ = p/d on the log-log scale. The prediction risk of the ridge ensemble (M = 100) is computed using polynomial, Gaussian, and Laplacian kernel, using the default parameters as in Python package scikit-learn v1.2.2 [60] without the intercept terms. We scale λ by k/p so that we can obtain non-null estimators for the polynomial inner-product kernel (of degree 3).