# subsample_ridge_ensembles_equivalences_and_generalized_crossvalidation__519dd706.pdf

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

Jin-Hong Du * 1 Pratik Patil * 2 Arun Kumar Kuchibhotla 1

We study subsampling-based ridge ensembles in the proportional asymptotics regime, where the feature size grows proportionally with the sample size such that their ratio converges to a constant. By analyzing the squared prediction risk of ridge ensembles as a function of the explicit penalty λ and the limiting subsample aspect ratio ϕs (the ratio of the feature size to the subsample size), we characterize contours in the (λ, ϕs)-plane at any achievable risk. As a consequence, we prove that the risk of the optimal full ridgeless ensemble (fitted on all possible subsamples) matches that of the optimal ridge predictor. In addition, we prove strong uniform consistency of generalized crossvalidation (GCV) over the subsample sizes for estimating the prediction risk of ridge ensembles. This allows for GCV-based tuning of full ridgeless ensembles without sample splitting and yields a predictor whose risk matches optimal ridge risk.

1. Introduction

Ensemble methods (Breiman, 1996) are widely used in various real-world applications in statistics and machine learning. They combine a collection of weak predictors to produce more stable and accurate predictions. One notable example of an ensemble method is bagging (bootstrap aggregating) (Breiman, 1996; B uhlmann & Yu, 2002). Bagging involves averaging base predictors that are fitted on different subsampled datasets and has been shown to stabilize the prediction and reduce the predictive variance (B uhlmann & Yu, 2002). In this paper, we study such a class of ensemble methods that fit each base predictor independently using a different subsampled dataset of the full training data. As a prototypical base predictor, we focus on ridge regression

*Equal contribution 1Department of Statistics and Data Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA. 2Department of Statistics, University of California, Berkeley, CA 94720, USA. Correspondence to: Jin-Hong <jinhongd@andrew.cmu.edu>, Pratik <pratikpatil@berkeley.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

0.1 1 10 Subsample aspect ratio s

Ridge regularization

Figure 1. Heat map of the asymptotic prediction risk landscape of full ridge ensembles as the number of observation n, the subsample size k, and the feature dimension p tend to infinity, for varying regularization parameters λ and limiting subsample aspect ratio ϕs = lim p/k. The data (x, y) Rp R is generated from a nonisotropic linear model y = x β0 + ϵ with ϕ = lim p/n = 0.1, where the features, the coefficients, and the residuals are distributed as x N(0, ΣAR1), β0 = 1 5 P5 j=1 w(j), and ϵ N(0, 1), respectively. Here, the covariance matrix (ΣAR1)ij = 0.5|i j|, w(j) is the top jth eigenvector of ΣAR1. The green and blue stars denote the risk of the optimal full-ensemble ridgeless predictor and the optimal ridge predictor without subsampling, respectively. The black dashed line denotes the set of (λ, ϕs) pairs that yield the same risk as (λ , ϕ) and (0, ϕ s), while the gray dashed lines indicate the set of pairs that all result in the same sub-optimal risk.

(Hoerl & Kennard, 1970a;b), one of the most popular statistical methods. We refer readers to the ridgefest by Hastie (2020) for the history and review of ridge regression.

Ridge regression has recently attracted great interest, particularly the limiting case of zero regularization (where the regularization parameter tends to zero), termed ridgeless regression. In the underparameterized regime, the ridgeless predictor is ordinary least squares. However, in the overparameterized regime, it interpolates the training data and exhibits a peculiar risk behavior (Belkin et al., 2020; Bartlett et al., 2020; Hastie et al., 2022; Muthukumar et al., 2020). Le Jeune et al. (2020); Patil et al. (2022a) have recently analyzed the statistical properties of the ensemble ridge and ridgeless predictors under proportional asymp-

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

totics. Under a linear model with the isotropic Gaussian covariate distribution, Le Jeune et al. (2020) prove that the full ensemble (ensemble fitted on all possible subsampled datasets) of least squares predictors with optimal subsample size has the same risk as that of ridge predictor with optimal regularization. Under a more general but still isotropic covariate distribution, Patil et al. (2022a) prove similar risk equivalence of the optimized full ridgeless ensemble and the optimized ridge predictor.

These findings inspire two natural avenues to investigate.

(1) Understanding the extent of risk equivalences. As a curious experiment, one can empirically observe that a similar phenomenon to the one just mentioned appears to hold under quite general non-isotopic data models, as illustrated in Figure 1. We observe that the optimal ridgeless in the full ensemble (the green star) has the same prediction risk as the optimal ridge on the full data (the blue star). Furthermore, any pair of (λ, ϕs) on the black line achieves the same optimal risk. Such a relationship also extends to any other attainable risk value. For example, see the grey lines for (λ, ϕs) pairs that all achieve the same sub-optimal risk. This inspires our first investigation to establish risk equivalences between subsampling and ridge regression under general settings.

(2) Overcoming limitations of split cross-validation. Apart from its theoretical interest, the risk equivalences also suggest an alternative practical way to tune the ridge regularization parameter by tuning the subsample size. In terms of practical tuning of the ridge and ridgeless ensembles, Patil et al. (2022a) provide a split cross-validation method to estimate the prediction risk of ensembles with a fixed (finite) number of ensemble sizes and further prove that the split cross-validation consistently selects the best subsample size. The split cross-validation procedure has two disadvantages: (a) sample splitting introduces additional external randomness in the predictor; and (b) the reduced sample size, although asymptotically negligible, has significant finite sample effects, especially near the interpolation thresholds. This inspires our second investigation to address these limitations by considering generalized cross-validation (GCV) that does not require any sample splitting. The consideration of GCV as a viable estimator of the prediction risk for ridge ensembles stems from the observation that the ridge ensembles are also in fact linear smoothers.

1.1. Summary of Contributions

Below we provide a brief overview of our main results.

General risk equivalences. We establish general equivalences between the subsample-optimized ridgeless ensemble, the optimal ridge predictor, and the optimal subsample ridge ensemble (see Theorem 2.3). In addition, for any

τ 0, we provide an exact characterization of the sets Cτ of pairs (λ, ϕs) (the regularization parameter and the limiting subsample aspect ratio) such that the risk of the full ridge ensemble with ridge regularization λ and subsample aspect ratio ϕs is equal to the risk of the ridge predictor with ridge regularization τ. In essence, this amounts to showing that the implicit regularization of subsampling is the same as additional explicit ridge regularization.

Uniform consistency of GCV. We establish the uniform consistency of GCV across all possible subsample sizes for full ridge ensembles with fixed regularization parameters (see Theorem 3.1). Notably, this result is also applicable to zero explicit regularization and covers the case of ridgeless regression. This finding enables tuning over the subsample size in a data-dependent manner, and in conjunction with Theorem 2.3, it implies that GCV tuning leads to a predictor with the same risk as the optimal ridge predictor (see Corollary 3.2).

Finite-ensemble surprises. Even though GCV is consistent for the non-ensemble ridge and full-ensemble ridge predictors, interestingly, this is the first paper that proves GCV can be inconsistent even for ridge ensembles when the ensemble size is two (see Proposition 3.3). This finding is in contrast to other known results of GCV for ridge (see Section 1.2 for more details). Nevertheless, experiments on synthetic data and real-world single-cell multiomic datasets demonstrate the applicability of GCV for tuning subsample sizes, even with moderate ensemble sizes (roughly of order 10).

1.2. Related Work

Ensembles and risk analysis. Ensemble methods are effective in combining weak predictors to build strong predictors in both regression and classification settings (Hastie et al., 2009). Early work on ensemble methods includes classical papers by Breiman (1996); B uhlmann & Yu (2002). There has been further work on the ensembles of smooth weak predictors (Buja & Stuetzle, 2006; Friedman & Hall, 2007), non-parametric estimators (B uhlmann & Yu, 2002; Loureiro et al., 2022), and classifiers (Hall & Samworth, 2005; Samworth, 2012). Under proportional asymptotics, d Ascoli et al. (2020); Adlam & Pennington (2020a); Loureiro et al. (2022) study ensemble learning under random feature models. For ridge ensembles, Sollich & Krogh (1995); Krogh & Sollich (1997) derive risk asymptotics under Gaussian features. Le Jeune et al. (2020) consider least squares ensembles obtained by subsampling such that the final subsampled dataset has more observations than the number of features. The asymptotic risk characterization for general data models has been derived by Patil et al. (2022a). Both of these works show the equivalence between the subsample optimized full ridgeless ensemble and the optimal ridge under isotropic

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

models. Our work significantly extends the scope of these results by characterizing risk equivalences for both optimal and suboptimal risks and for arbitrary feature covariance and signal structures. See the remarks after Theorem 2.3 for a detailed comparison,

Cross-validation and consistency. Cross-validation (CV) is arguably the most popular class of methods for model assessment and selection. Classical work on CV include: Allen (1974); Stone (1974; 1977); Geisser (1975), among others. We refer the reader to Arlot & Celisse (2010); Zhang & Yang (2015) for comprehensive surveys of different CV variants. In practice, k-fold CV is widely used with typical k being 5 or 10 (Hastie et al., 2009; Gy orfi et al., 2006), but such small values of k suffer from bias in high dimensions (Rad & Maleki, 2020). The extreme case of leave-one-out cross-validation (LOOCV) (when k = n) alleviates the bias issues in risk estimation, and various statistical consistency properties of LOOCV have been analyzed in recent years; see, e.g., Kale et al. (2011); Kumar et al. (2013); Obuchi & Kabashima (2016); Rad et al. (2020). Except for special cases, LOOCV is computationally expensive, and consequently, various approximations and their theoretical properties have been studied; see, e.g, Wang et al. (2018); Rad & Maleki (2020); Rad et al. (2020); Xu et al. (2019). Generalized cross-validation (GCV) is a sort of approximation for the shortcut leave-one-out formula (Hastie et al., 2009), originally studied for the fixed-X design setting for linear smoothers by Golub et al. (1979); Craven & Wahba (1979). The consistency of GCV in such a setting has been investigated in Li (1985; 1986; 1987). More recently, in the random-X setting, GCV has received considerable attention. In particular, consistency of GCV for ridge regression has been established in Adlam & Pennington (2020b); Hastie (2020); Patil et al. (2021; 2022c); Wei et al. (2022) under various data settings. Our work contributes to this body of work by analyzing GCV for subsampled ensemble ridge regression.

2. Subsample and Ridge Equivalences

We consider the standard supervised regression setting. Let Dn = {(xj, yj) : j [n]} denote a dataset containing i.i.d. random vectors in Rp R, X Rn p denote the feature matrix whose j-th row contains x j , and y Rn denote the response vector whose j-th entry contains yj. For an index set I [n] of size k, let DI = {(xj, yj) : j I} be a subsampled dataset and let LI Rn n denote a diagonal matrix such that its jth diagonal entry is 1 if j I and 0 otherwise. Noting that the feature matrix and response vector associated with DI are LIX and LIy, respectively, the ridge estimator bβλ k(DI) fitted on DI (containing k samples)

with regularization parameter λ > 0 can be expressed as:

bβλ k(DI) = argmin β Rp

j I (yj x j β)2/k + λ β 2 2

= (X LIX/k + λIp) 1X LIy/k. (1)

Letting λ 0+, bβ0 k(DI) := (X LIX/k)+X LIy/k becomes the so-called ridgeless estimator, where A+ denotes the Moore-Penrose inverse of matrix A.

Ensemble estimator. To introduce the ensemble estimator, it helps to define the set of all k distinct elements from [n] to be Ik := {{i1, i2, . . . , ik} : 1 i1 < i2 < . . . < ik n}. Note that the cardinality of Ik is n k . For λ 0, the ensemble estimator is then defined as:

eβλ k,M(Dn; {Iℓ}M ℓ=1) := 1

bβλ k(DIℓ), (2)

where I1, . . . , IM are simple random samples from Ik. The full-ensemble ridge estimator is the average of predictors fitted on all possible subsampled datasets:

eβλ k, (Dn) := 1 |Ik|

bβλ k(DI) = E[ bβλ k(DI) | Dn], (3)

where the conditional expectation is taken with respect to the randomness of sampling from Ik. Lemma A.1 shows that eβλ k, (Dn) is also almost surely equivalent to letting the ensemble size M tend to infinity in (2) conditioning on the full dataset Dn, thus justifying the notation in (3). For simplicity, we drop the dependency on Dn, {Iℓ}M ℓ=1 and only write eβλ k,M, eβλ k, , when it is clear from the context.

Prediction risk. We assess the performance of an Mensemble predictor via conditional squared prediction risk:

Rλ k,M := E(x,y)[(y x eβλ k,M)2 | Dn, {Iℓ}M ℓ=1], (4)

where (x, y) is an independent test point sampled from the distribution as Dn. Note that the conditional risk Rλ k,M is a random variable that depends on both the dataset Dn and the random samples Iℓ, ℓ= 1, . . . , M. For the full ensemble estimator eβλ k, , the conditional prediction risk is defined analogously, except the risk now only depends on Dn:

Rλ k, := E(x,y)[(y x eβλ k, )2 | Dn]. (5)

2.1. Data Assumptions

For our theoretical results, we work under a proportional asymptotics regime, in which the original data aspect ratio (p/n) converges to ϕ (0, ) as n, p , and the subsample aspect ratio (p/k) converges to ϕs as k, p . Note that because k n, ϕs always lie in [ϕ, ]. In addition, we impose two structural assumptions on the feature matrix and response vector as summarized in Assumptions 2.1 to 2.2, respectively.

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

Assumption 2.1 (Feature model). The feature matrix decomposes as X = ZΣ1/2, where Z Rn p contains i.i.d. entries with mean 0, variance 1, bounded moments of order 4 + δ for some δ > 0, and Σ Rp p is deterministic and symmetric with eigenvalues uniformly bounded between rmin > 0 and rmax < . Let Σ = Pp j=1 rjwjw j denote the eigenvalue decomposition, where (rj, wj), j [p], are pairs of associated eigenvalue and normalized eigenvector. We assume there exists a deterministic distribution H such that the empirical spectral distribution of Σ, Hp(r) := p 1 Pp i=1 1{ri r}, weakly converges to H, almost surely (with respect to X).

Assumption 2.2 (Response model). The response vector decomposes as y = Xβ0 + ϵ, where β0 Rp is an unknown signal vector with ℓ2-norm uniformly bounded and limp β0 2 2 = ρ2, and ϵ is an unobserved error vector independent of X with mean 0, variance σ2, and bounded moment of order 4+δ for some δ > 0. We assume there exists a deterministic distribution G such that the empirical distribution of β0 s (squared) projection onto Σ s eigenspace, Gp(r) := β0 2 2 Pp i=1(β 0 wi)2 1{ri r}, weakly converges to G, almost surely (with respect to X).

Assumptions 2.1 and 2.2 are standard in the study of the ridge and ridgeless regression under proportional asymptotics; see, e.g., Hastie et al. (2022); Patil et al. (2022b;a). It is possible to further relax both of these assumptions. Specifically, one can incorporate other feature models, e.g., random features (Mei & Montanari, 2022), and can allow for certain non-linearities in the regression function (Bartlett et al., 2021) for the response model. We leave these for future work.

2.2. Risk Equivalences

Under the above assumptions, Lemma A.2 from Patil et al. (2022a) implies that for every M 1, the prediction risk Rλ k,M of the ridge and ridgeless predictors in the full ensemble converges to some deterministic limit Rλ M(ϕ, ϕs) as k, n, p , p/n ϕ and p/k ϕs. When ϕs = ϕ (e.g., k = n), the asymptotic risk Rλ M(ϕ, ϕ) is equal to Rλ 1 (ϕ, ϕ) of the ridge predictor on the full dataset Dn for all M 1, and we denote this risk simply by Rλ (ϕ, ϕ). To facilitate our discussion and for simplicity, Table 1 provides pointers to definitions of all important quantities used in the paper.

From a practical point of view, it is important to understand the least attainable risk that could be attained in the full ensemble. For the full ridge ensembles, we found that the explicit ridge regularization is unnecessary when considering optimal bagging and that the implicit regularization of ridgeless and subsampling suffices. The result below formalizes this empirical observation.

Theorem 2.3 (Optimal ridgeless ensemble vs optimal ridge).

Under Assumptions 2.1 and 2.2, for all ϕ (0, ), we have

min ϕs ϕ R0 (ϕ, ϕs)

| {z } opt. ensemble and no ridge

(a) = min λ 0 Rλ (ϕ, ϕ)

| {z } no ensemble and opt. ridge

(b) = min ϕs ϕ, λ 0 Rλ (ϕ, ϕs)

| {z } opt. ensemble and opt. ridge

Further, if ϕ s is the optimal subsample aspect ratio for ridgeless, and λ is the optimal ridge regularization with no subsampling, then for any θ [0, λ ], full ridge ensemble with penalty parameter λ = λ θ and subsample aspect ratio of ϕs = ϕ + θ(ϕ s ϕ)/λ also attains the optimal prediction risk.

In words, Theorem 2.3 says that optimizing subsample size (i.e. k) with the full ridgeless ensemble attains the same prediction risk as just optimizing the explicit regularization parameter (i.e., λ) of the ridge predictor. Further, both of them are the same as optimizing both k and λ. If one uses a lesser ridge penalty than needed for optimal prediction (i.e., uses λ < λ ), then a full ensemble at a specific subsample aspect ratio ϕs = ϕ + (1 λ/λ )(ϕ s ϕ) > ϕ can recover the remaining ridge regularization. In this sense, the implicit regularization provided by the ensemble amounts to adding more explicit ridge regularization. Similarly, one can supplement a sub-optimal implicit regularization of subsampling by adding explicit ridge regularization.

A special case of equivalence of (a) in Theorem 2.3 was previously formalized in Le Jeune et al. (2020); Patil et al. (2022a) for isotropic covariates. Working with isotropic design helps their proof significantly, as the spectral distributions are the same for all p, n, and the closed-form expression of the asymptotic prediction risk can be derived analytically. However, in the general non-isotropic design, the asymptotic risk does not admit a closed-form expression, and one needs to account for this carefully.

General risk equivalences. Theorem 2.3 proves the risk equivalence of the ridge and full ensemble ridgeless when they attain minimum risk. Appendix A.3 shows a further risk equivalence in the full range, i.e., for any ϕs [ϕ, + ], there exists a λ 0 such that R0 (ϕ, ϕs) = R λ (ϕ, ϕ). Further, Rλ (ϕ, ϕs) remains constant as (λ, ϕs) varies on the line segment (1 θ) ( λ, ϕ)+θ (0, ϕs), for all θ [0, 1].

A remarkable implication of Theorem 2.3 is that for a fixed dataset Dn, one does not need to tune both the subsample size (i.e., k) and the ridge regularization parameter (i.e., λ), but it suffices to fix for example λ = 0 and only tune ϕs. Alternatively, one can also fix k = n and just tune λ 0, which was considered in Patil et al. (2021). Performing tuning over λ 0 requires one to discretize an infinite interval, while tuning the subsample size for a fixed λ only requires searching over a finite grid varying from k = 1 to k = n. For this reason, we fix λ and focus on tuning over k

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

Variable M-ensemble Full ensemble

Finite-sample Asymptotic Finite-sample Asymptotic

Prediction risk Rλ k,M (4) Rλ M(ϕ, ϕs) (21) Rλ k, (5) Rλ (ϕ, ϕs) (21)

Test error R λ k,M (7) Rλ M(ϕ, ϕs) (21)

Training error T λ k,M (6) T λ M(ϕ, ϕs) (32) T λ k, (8) T λ (ϕ, ϕs) (15) GCV denominator Dλ k,M (12) Dλ k, (13) Dλ (ϕ, ϕs) (26) GCV estimator gcvλ k,M (11) gcvλ k, (11) G λ (ϕ, ϕs) (16)

Table 1. Summary of notations and pointers to definitions of important empirical quantities used in this paper and their asymptotic limits.

in this paper. In the next section, we investigate the problem of tuning the subsample size in the full ensemble to achieve the minimum oracle risk via generalized cross-validation.

3. Generalized Cross-Validation

Suppose bf( ; Dn) : Rp R is a predictor trained on Dn. We call bf( ; Dn) a linear smoother if bf(x; Dn) = a x y for some vector ax that only depends on the design X (and x). Define the smoothing matrix S Rn n with rows a x1, . . . , a xn, which in turn is only a function of X. For any linear smoother, the generalized cross-validation (GCV) estimator of the prediction risk is defined to be n 1 y Sy 2 2/(1 n 1 tr(S))2; see, e.g., Wasserman (2006, Section 5.3). The numerator of GCV is the training error, which typically is biased downwards, and the denominator attempts to account for such optimism of the predictor.

Ensemble GCV. Before we analyze GCV for the ridge ensemble, we first introduce some notations. Let I1:M := M ℓ=1Iℓand Ic 1:M := [n] \ I1:M. We define the in-sample training error and the out-of-sample test error of eβλ k,M as:

T λ k,M := 1 |I1:M|

i I1:M (yi x i eβλ k,M)2, (6)

R λ k,M := 1 |Ic 1:M|

i Ic 1:M (yi x i eβλ k,M)2. (7)

Since the full ensemble estimator eβλ k, uses all the data Dn, its training error, denoted by T λ k, , is simply:

T λ k, := 1

i [n] (yi x i eβλ k, )2. (8)

Since I1:M a.s. [n] for any n N as M , the notation T λ k, in (8) is justified as a limiting case of (6) (see Appendix A.1 for more details). Now, observe that a ridge ensemble is a linear smoother because XI1:M eβλ k,M =

Sλ k,My I1:M , where the smoothing matrix Sλ k,M is given by:

ℓ=1 XIℓ(X IℓXIℓ/k + λIp)+X Iℓ/k. (9)

Analogously, the smoothing matrix for eβλ k, is given by:

Sλ k, = 1 |Ik|

I Ik X(X LIX/k + λIp)+X LI/k. (10)

Thus, the GCV estimates for ridge predictors in the finite and full ensemble case are respectively given by:

gcvλ k,M = T λ k,M Dλ k,M , gcvλ k, = T λ k, Dλ k, , (11)

where the denominators Dλ k,M and Dλ k, are as follows:

Dλ k,M := (1 |I1:M| 1 tr(Sλ k,M))2, (12)

Dλ k, := (1 n 1 tr(Sλ k, ))2. (13)

3.1. Full-Ensemble Uniform Consistency

Let Kn {0, 1, . . . , n} be a grid of subsample sizes that covers the full range of [0, n] asymptotically in the sense that {k/n : k Kn} converges to the set [0, 1] as n . One simple choice is to set

Kn = {0, k0, 2k0, . . . . n/k0 k0},

where the increment is k0 = nν for some ν (0, 1). Here, we adopt the convention that when k = 0, the predictor reduces to a null predictor that always returns zero. Based on the definition above, we now present the uniform consistency results of the GCV estimator (11) for full ensembles when the ridge regularization parameter λ is fixed.

Theorem 3.1 (Uniform consistency of GCV). Suppose Assumptions 2.1 and 2.2 hold. Then, for all λ 0, we have

max k Kn |gcvλ k, Rλ k, | a.s. 0,

as n, p such that p/n ϕ (0, ).

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

0.1 0.2 0.5 1.0 2.0 5.0 10.0 Subsample aspect ratio s

Data aspect ratio = 0.1

Underparameterized regime (p < n)

1.0 2.0 5.0 10.0 Subsample aspect ratio s

Data aspect ratio = 1.1

Overparameterized regime (p > n)

0.0 0.1 1.0 Empirical gcvk, M Rk, M Asymptotic

Figure 2. Asymptotic risk and GCV curves for full ridge ensembles, under model (M-AR1) when ρAR1 = 0.5 and σ2 = 1 with varying regularization parameters λ {0, 0.1, 1} and subsample sizes k = p/ϕs . The points denote finite-sample risks averaged over 50 dataset repetitions with an ensemble size of M = 500, with n = p/ϕ and p = 500. The left and the right panels illustrate the underparameterized and overparameterized cases with the limiting data aspect ratio ϕ = 0.1 and ϕ = 1.1, respectively.

10 2 10 1 100 101

Data aspect ratio

Empirical Asymptotic

Figure 3. Asymptotic prediction risk curves with optimal tuned parameters λ and ϕ s, under model (M-AR1) when ρAR1 = 0.5, σ2 = 1, for varying data aspect ratio ϕ. The curves represent the theoretical asymptotic GCV estimate in the full ensemble and the asymptotic risk of the optimal ridge predictors. The points represent the finite-sample risks of the best 500-ensemble ridgeless and the best ridge predictor averaged over 50 dataset repetitions, with n = p/ϕ and p = 500.

Theorem 3.1 shows the uniform consistency of GCV in the full ensemble for fixed subsample size k and ridge regularization parameter λ. The almost sure qualification in Theorem 3.1 is with respect the entire training data (X, y). An implication of Theorem 3.1 is that one can select the optimal subsample size in a data-dependent manner, i.e., selecting bkλ argmink Kn gcvλ k, guarantees to track the minimum prediction risk mink [n] Rλ k, asymptotically.

We first provide numerical illustrations for Theorem 3.1 under the non-isotropic AR(1) data model, which is the same as the one used for Figure 1; see Appendix I for model details. Figure 2 shows both the GCV estimate and the asymptotic risk for the full ridge ensemble. We observe a close match of the theoretical curves and the GCV estimates.

Combining Theorem 2.3 and Theorem 3.1, we can obtain the following corollary regarding GCV subsample tuning.

Corollary 3.2 (Ridge tuning by GCV subsample tuning). Suppose Assumptions 2.1 and 2.2 hold. Then, we have

gcv0 bk0, a.s. min ϕs ϕ,λ 0 Rλ (ϕ, ϕs),

as k, n, p such that p/n ϕ (0, ).

Corollary 3.2 certifies the validity of GCV tuning for achieving the optimal risk over all possible regularization parameters and subsample sizes. In practice, tuning for the ridge parameter λ requires one to determine a grid of λ s for cross-validation. However, the maximum value for the grid is generally chosen by some ad hoc criteria. For example, there is no default maximum value for ridge tuning in the widely-used package glmnet (Friedman et al., 2010). From Theorem 2.3, when the signal-noise ratio ρ2/σ2 is small, the subsample size should be small enough (so that ϕs is large), and the range of λ s grid should be large enough to cover its optimal value. On the contrary, the GCV-based method does not need such an upper bound for the grid Kn of subsample sizes because the sample size provides a natural grid in finite samples, informed by the dataset.

In Corollary 3.2, we fix the ridge regularization parameter λ to be zero. But, one can also use other value of λ < λ

and the similar statement still holds with gcv0 bk0, replaced

by gcvλ bkλ, based on Theorem 2.3. Furthermore, one can

construct the estimator of λ as bλ = λ(n bk0)/(bkλ bk0) by extrapolating the line segment between (0, bk0) and (λ, bkλ).

In Figure 3, we numerically compare the optimal subsampled ridgeless ensemble with the optimal ridge predictor to verify Corollary 3.2. As we can see, their theoretical curves exactly match, and the empirical estimates in finite samples are also close to their asymptotic limits.

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

0.1 0.2 0.5 1.0 2.0 5.0 10.0 Subsample aspect ratio s

Prediction risk

Data aspect ratio = 0.1

Figure 4. Asymptotic prediction risk curves of ridgeless ensembles, under model (M-AR1) when ρAR1 = 0.5, σ2 = 1, and ϕ = 0.1. The points denote the finite-sample GCV estimates of ridgeless ensembles for varying ensemble sizes M {1, 10, 20} averaged over 50 dataset repetitions, with n = p/ϕ and p = 500.

3.2. A Finite-Ensemble Inconsistency Result

While deriving GCV asymptotics in the proof of Theorem 3.1 for the full ensemble, we also obtain as a byproduct the asymptotic limit of the GCV estimate for finite ensembles. From related work (Patil et al., 2021) and Theorem 3.1, we already know that gcvλ k,1 and gcvλ k, are consistent estimators of the non-ensemble risk (Rλ k,1) and the full ensemble risk (Rλ k, ), respectively. However, gcvλ k,M for 1 < M < may not be consistent, which is somewhat surprising. As an example, GCV for M = 2 is not a consistent estimator for the prediction risk Rλ k,2, as shown in the following proposition.

Proposition 3.3 (GCV inconsistency for ridgeless, M = 2). Suppose Assumptions 2.1 and 2.2 hold with ρ2, σ2 (0, ). Then, for any ϕ (0, ), we have

|gcv0 k,2 R0 k,2| p 0,

as k, n, p , p/n ϕ, p/k ϕs (1, ) (ϕ, ).

Intuitively, the inconsistency for a finite in large part happens because, for a finite M, the residuals computed using the bagged predictor contain non-negligible fractions of outof-sample and in-sample, and all of them are treated equally. As a result, the GCV estimate for finite ensembles indirectly relates to the original data through the aspect ratios (ϕ, ϕs), even though the GCV estimate is computed only using the training observations. See Section 5 about possible approaches for the corrected GCV estimate for arbitrary ensemble sizes. Though in practice, the correction may not be crucial for a moderate M, because the GCV estimate is close to the underlying target as shown in Figure 4.

3.3. Proof Outline of Theorem 3.1

There are three key steps are involved to prove Theorem 3.1. (1) Deriving the asymptotic limit of the prediction risk Rλ k,M. (2) Deriving the asymptotic limit of GCV estimate gcvλ k, . (3) Showing pointwise consistency in k by matching the two limits and then lifting to uniform convergence in k. We briefly explain key ideas for showing the three steps below.

(1) Asymptotic limit of risk. We build upon prior results on the risk analysis of ridge ensembles. Under Assumptions 2.1 and 2.2, Lemma A.2 adapted from Patil et al. (2022a) implies that the conditional prediction risks under proportional asymptotics converge to certain deterministic limits:

Rλ k,M a.s. Rλ M(ϕ, ϕs), Rλ k, a.s. Rλ (ϕ, ϕs), (14)

where Rλ M(ϕ, ϕs), Rλ (ϕ, ϕs) are as defined in (21).

(2) Asymptotic limit of GCV. To analyze the asymptotic behavior of the GCV estimates, we obtain the asymptotics of the denominator and the numerator of GCV separately. We first show the regular cases when ϕs < and λ > 0, and then incorporate boundary cases of ϕs = and λ = 0. Our analysis begins with the following lemma that provides asymptotics for the denominator Dλ k, (as in (13)) of GCV:

Lemma 3.4 (Asymptotics of the GCV denominator). Suppose Assumption 2.1 holds. Then, for all λ > 0,

Dλ k, a.s. Dλ (ϕ, ϕs),

as k, n, p , p/n ϕ (0, ), p/k ϕs [ϕ, ).

It is worth noting that Lemma 3.4 does not require Assumption 2.2 because the smoothing matrix only concerns the design matrix X and does not depend on the response y.

Towards obtaining asymptotics for the numerator T λ k, (as in (8)) of GCV, we first decompose T λ k, into simpler components via Lemma D.1. Specifically, the full mean squared training error admits the following decomposition:

m=1 (cm T λ k,m + (1 cm)R λ k,m) a.s. 0,

where c1 = ϕ/ϕs and c2 = 2ϕ(2ϕs ϕ)/ϕ2 s. Here, T λ k,m and R λ k,m are the in-sample training and out-of-sample test errors of the m-ensemble for m = 1 and 2, as defined in (6) and (7). This decomposition implies that the full training error is asymptotically simply a linear combination of training and test errors. Therefore, it suffices to obtain the asymptotics of each of these components. As analyzed in Lemma D.2, it is easy to show that the test errors converge to Rλ m for m = 1, 2. On the other hand, it is more challenging to derive the asymptotic limits for the training errors T λ k,m. We first split the T λ k,m into finer components via a

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

bias-variance decomposition of T λ k,m. By developing novel deterministic equivalents of resolvents arising from the decomposition of in-sample errors in Lemma F.8, we are able to show the convergence of the bias and variance components (see Lemma D.3). Combining Lemmas D.1 to D.3 yields the convergence of T λ k,m to a deterministic limit T λ m as summarized in the following lemma:

Lemma 3.5 (Asymptotics of the GCV numerator). Suppose Assumptions 2.1 and 2.2 hold. Then, for all λ > 0,

T λ k, a.s. T λ =

m=1 (cm T λ m + (1 cm)Rλ m), (15)

as k, n, p , p/n ϕ (0, ), p/k ϕs [ϕ, ), where c1 = ϕ/ϕs and c2 = 2ϕ(2ϕs ϕ)/ϕ2 s.

Finally, the boundary cases when ϕs = + and λ = 0 are taken care of in succession by Proposition E.1 and Proposition E.2, respectively. Combining the above results provides the asymptotics for the GCV estimate in the full ensemble:

Proposition 3.6 (Asymptotics of GCV for full ensemble). Suppose Assumptions 2.1 and 2.2 hold. Then, for all λ 0,

gcvλ k, a.s. G λ (ϕ, ϕs) := T λ (ϕ, ϕs) Dλ (ϕ, ϕs), (16)

as k, n, p , p/n ϕ (0, ), p/k ϕs [ϕ, ].

(3) Asymptotics matching and uniform convergence. The asymptotic limits obtained in the first steps can be shown to match with each other by algebraic manipulations. This shows the pointwise consistency in Theorem 3.1. The uniform convergence then follows by applying a certain Ce sarotype mean convergence lemma (see Lemma G.5).

4. Real Data Example: Single-Cell Multiomics

We compare tuning subsample size in the full ridgeless ensemble with tuning the ridge parameter on the full data in a real-world data example from multiomics. This singlecell CITE-seq dataset from Hao et al. (2021) consists of 50,781 human peripheral blood mononuclear cells (PBMCs) originating from eight volunteers post-vaccination (day 3) of an HIV vaccine, which simultaneously measures 20,729 genes and 228 proteins in individual cells.

We follow the standard preprocessing procedure in singlecell data analysis (Hao et al., 2021; Du et al., 2022) to select the top 5,000 highly variable genes and the top 50 highly abundant surface proteins, which exhibit high cellto-cell variations in the dataset. The gene expression and protein abundance counts for each cell are then divided by the total counts for that cell and multiplied by 104 and lognormalized. We randomly hold out half of the cells in each cell type as a test set. The top 500 principal components

CD8 T n=4424

Mono n=7156

CD4 T n=7864 Cell type

Out-of-sample error

Methods subsample (M=5) subsample (M=10) ridge

Figure 5. Violin plots of mean squared errors on the randomly held out test sets of different tuning methods for predicting the abundances of 50 proteins in the single-cell CITE-seq dataset. The sizes of the test sets are the same as the sizes of the training sets.

of the standardized gene expressions are used as features to predict protein abundances. The results of using ensembles with subsample size (k) tuning and ridge tuning (λ) without subsampling based on the GCV estimates are compared in Figure 5. For the former, we search over the grid of 25 k s from nν to n spaced evenly on the log scale, with ν = 0.5 and sample size n ranges from 516 to 7864 for different cell types. For the latter, we search over the grid of 100 λ s from 10 2 to 102 spaced evenly on log scale. Since different cell types have different sample sizes, this results in different data aspect ratios, presented in increasing order in Figure 5.

From Figure 5, we see that using a moderate ensemble size (M = 5 or 10) and tuning the subsample size have a very similar performance to only tuning the ridge regularization parameter in the full dataset. This suggests that the results of Corollary 3.2 also hold even on real data for different data aspect ratios. As discussed after Corollary 3.2, subsample tuning is easier to implement because the dataset provides a natural lower and upper bound of the subsample size. On the other hand, ridge tuning requires one to heuristically pick the upper regularization threshold for the search grid.

5. Discussion and Future Directions

In this work, we provide the risk characterization for the full ridge ensemble and establish the oracle risk equivalences between the full ridgeless ensemble and ridge regression. At a high level, these equivalences show that implicit regularization induced by subsampling matches explicit ridge regularization, i.e., a subsampled ridge predictor with penalty λ1 has the same risk as another ridge predictor with penalty λ2 λ1. Additionally, we prove the uniform consistency of generalized cross-validation for full ridge ensembles, which implies the validity of GCV tuning (that does not require sample splitting) for optimal predictive performance. We describe next some avenues for future work moving forward.

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

Bias correction for finite ensembles. In Proposition 3.3, we show that the GCV estimate can be inconsistent in the finite ridge and ridgeless ensembles. The inconsistency for M = 2 occurs because sampling from the whole dataset induces extra randomness beyond those of the training observations used to compute the GCV estimates. Our analysis of GCV for the full ensemble suggests a way to correct the bias of the GCV estimate. In Appendix H, we outline a possible correction strategy for finite ensembles based on out-of-bag estimates. An intriguing next research direction is to investigate the implementation and uniform consistency of the corrected GCV for finite ensembles in detail.

Extensions to other error metrics. In this paper, we focus on the in-distribution squared prediction risk. It is of interest to extend the equivalences for other error metrics, such as squared estimation risk, general prediction risks, and other functionals of the out-of-sample error distribution, like the quantiles of the error distribution. Additionally, for the purposes of tuning, it is also of interest to extend the GCV analysis to estimate such functionals of the out-of-sample error distribution. Such functional estimation could be valuable in constructing prediction intervals for the unknown response. The technical tools introduced in Patil et al. (2022c) involving leave-one-out perturbation techniques could prove useful for such an extension. Furthermore, this extension would also allow for extending the results presented in this paper to hold under a general non-linear response model.

Extensions to other base predictors. Finally, the focus of this paper is the base ridge predictor. A natural extension of the current work is to consider kernel ridge regression. Going further, it is of much interest to consider other regularized predictors, such as lasso. Whether optimal subsampled lassoless regression still matches with the optimal lasso is an interesting question. There is already empirical evidence along the lines of Figure 1 for such a connection. The results proved in the current paper make us believe that there is a general story quantifying the effect of implicit regularization by subsampling and that provided by explicit regularization. Whether the general story unfolds as neatly as presented here for ridge regression remains an exciting next question!

Acknowledgements

We are grateful to Ryan Tibshirani, Alessandro Rinaldo, Yuting Wei, Matey Neykov, Daniel Le Jeune, Shamindra Shrotriya for many helpful conversations surrounding ridge regression, subsampling, and generalized cross-validation. Many thanks are also due to the anonymous reviewers for their encouraging remarks and insightful questions that have informed several new directions for follow-up future work. In particular, special thanks to the reviewer 2t75 for a wonderful review and highlighting other related works on subsampling that have made their way into the manuscript.

Adlam, B. and Pennington, J. Understanding double descent requires a fine-grained bias-variance decomposition. Advances in neural information processing systems, 33: 11022 11032, 2020a.

Adlam, B. and Pennington, J. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning. PMLR, 2020b.

Allen, D. M. The relationship between variable selection and data augmentation and a method for prediction. Technometrics, 16(1):125 127, 1974.

Arlot, S. and Celisse, A. A survey of cross-validation procedures for model selection. Statistics surveys, 4:40 79, 2010.

Bai, Z. and Silverstein, J. W. Spectral Analysis of Large Dimensional Random Matrices. Springer, 2010. Second edition.

Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063 30070, 2020.

Bartlett, P. L., Montanari, A., and Rakhlin, A. Deep learning: a statistical viewpoint. Acta numerica, 30:87 201, 2021.

Belkin, M., Hsu, D., and Xu, J. Two models of double descent for weak features. SIAM Journal on Mathematics of Data Science, 2(4):1167 1180, 2020.

Bloemendal, A., Knowles, A., Yau, H.-T., and Yin, J. On the principal components of sample covariance matrices. Probability theory and Related Fields, 164(1):459 552, 2016.

Breiman, L. Bagging predictors. Machine Learning, 24(2): 123 140, 1996.

B uhlmann, P. and Yu, B. Analyzing bagging. The Annals of Statistics, 30(4):927 961, 2002.

Buja, A. and Stuetzle, W. Observations on bagging. Statistica Sinica, pp. 323 351, 2006.

Craven, P. and Wahba, G. Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31:377 403, 1979.

Dobriban, E. and Sheng, Y. Distributed linear regression by averaging. The Annals of Statistics, 49(2):918 943, 2021.

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

Dobriban, E. and Wager, S. High-dimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics, 46(1):247 279, 2018.

Du, J.-H., Cai, Z., and Roeder, K. Robust probabilistic modeling for single-cell multimodal mosaic integration and imputation via scvaeit. Proceedings of the National Academy of Sciences, 119(49):e2214414119, 2022.

d Ascoli, S., Refinetti, M., Biroli, G., and Krzakala, F. Double trouble in double descent: Bias and variance (s) in the lazy regime. In International Conference on Machine Learning, pp. 2280 2290. PMLR, 2020.

El Karoui, N. Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators: rigorous results. ar Xiv preprint ar Xiv:1311.2445, 2013.

El Karoui, N. On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. Probability Theory and Related Fields, 170(1):95 175, 2018.

Erd os, L. and Yau, H.-T. A Dynamical Approach to Random Matrix Theory. American Mathematical Society, 2017.

Friedman, J., Hastie, T., and Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1, 2010.

Friedman, J. H. and Hall, P. On bagging and nonlinear estimation. Journal of Statistical Planning and Inference, 137(3):669 683, 2007.

Geisser, S. The predictive sample reuse method with applications. Journal of the American statistical Association, 70(350):320 328, 1975.

Golub, G. H., Heath, M., and Wahba, G. Generalized crossvalidation as a method for choosing a good ridge parameter. Technometrics, 21(2):215 223, 1979.

Greene, E. and Wellner, J. A. Exponential bounds for the hypergeometric distribution. Bernoulli, 23(3):1911, 2017.

Grenander, U. and Szeg o, G. Toeplitz Forms and Their Applications. University of California Press, 1958. First edition.

Gut, A. Probability: A Graduate Course. Springer, New York, 2005.

Gy orfi, L., Kohler, M., Krzyzak, A., and Walk, H. A Distribution-free Theory of Nonparametric Regression. Springer Science & Business Media, 2006.

Hall, P. and Samworth, R. J. Properties of bagged nearest neighbour classifiers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(3):363 379, 2005.

Hao, Y., Hao, S., Andersen-Nissen, E., Mauck III, W. M., Zheng, S., Butler, A., Lee, M. J., Wilk, A. J., Darby, C., Zager, M., et al. Integrated analysis of multimodal single-cell data. Cell, 2021.

Hastie, T. Ridge regularization: An essential concept in data science. Technometrics, 62(4):426 433, 2020.

Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning. Springer Series in Statistics, 2009. Second edition.

Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949 986, 2022.

Hoerl, A. E. and Kennard, R. W. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55 67, 1970a.

Hoerl, A. E. and Kennard, R. W. Ridge regression: applications to nonorthogonal problems. Technometrics, 12(1): 69 82, 1970b.

Kale, S., Kumar, R., and Vassilvitskii, S. Cross-validation and mean-square stability. In In Proceedings of the Second Symposium on Innovations in Computer Science, 2011.

Krogh, A. and Sollich, P. Statistical mechanics of ensemble learning. Physical Review E, 55(1):811, 1997.

Kumar, R., Lokshtanov, D., Vassilvitskii, S., and Vattani, A. Near-optimal bounds for cross-validation via loss stability. In International Conference on Machine Learning, 2013.

Le Jeune, D., Javadi, H., and Baraniuk, R. The implicit regularization of ordinary least squares ensembles. In International Conference on Artificial Intelligence and Statistics, 2020.

Li, K.-C. From Stein s unbiased risk estimates to the method of generalized cross validation. The Annals of Statistics, pp. 1352 1377, 1985.

Li, K.-C. Asymptotic optimality of cl and generalized crossvalidation in ridge regression with application to spline smoothing. The Annals of Statistics, 14(3):1101 1112, 1986.

Li, K.-C. Asymptotic optimality for cp, cl, cross-validation and generalized cross-validation: Discrete index set. The Annals of Statistics, 15(3):958 975, 1987.

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

Loureiro, B., Gerbelot, C., Refinetti, M., Sicuro, G., and Krzakala, F. Fluctuations, bias, variance & ensemble of learners: Exact asymptotics for convex losses in highdimension. In International Conference on Machine Learning, pp. 14283 14314. PMLR, 2022.

Mei, S. and Montanari, A. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667 766, 2022.

Miolane, L. and Montanari, A. The distribution of the lasso: Uniform control over sparse balls and adaptive parameter tuning. The Annals of Statistics, 49(4):2313 2335, 2021.

Muthukumar, V., Vodrahalli, K., Subramanian, V., and Sahai, A. Harmless interpolation of noisy data in regression. IEEE Journal on Selected Areas in Information Theory, 1 (1):67 83, 2020.

Obuchi, T. and Kabashima, Y. Cross validation in LASSO and its acceleration. Journal of Statistical Mechanics: Theory and Experiment, 2016.

Patil, P., Wei, Y., Rinaldo, A., and Tibshirani, R. Uniform consistency of cross-validation estimators for highdimensional ridge regression. In International Conference on Artificial Intelligence and Statistics. PMLR, 2021.

Patil, P., Du, J.-H., and Kuchibhotla, A. K. Bagging in overparameterized learning: Risk characterization and risk monotonization. ar Xiv preprint ar Xiv:2210.11445, 2022a.

Patil, P., Kuchibhotla, A. K., Wei, Y., and Rinaldo, A. Mitigating multiple descents: A model-agnostic framework for risk monotonization. ar Xiv preprint ar Xiv:2205.12937, 2022b.

Patil, P., Rinaldo, A., and Tibshirani, R. Estimating functionals of the out-of-sample error distribution in highdimensional ridge regression. In International Conference on Artificial Intelligence and Statistics. PMLR, 2022c.

Rad, K. R. and Maleki, A. A scalable estimate of the outof-sample prediction error via approximate leave-one-out cross-validation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(4):965 996, 2020.

Rad, K. R., Zhou, W., and Maleki, A. Error bounds in estimating the out-of-sample prediction error using leaveone-out cross validation in high-dimensions. In International Conference on Artificial Intelligence and Statistics. PMLR, 2020.

Rubio, F. and Mestre, X. Spectral convergence for a general class of random matrices. Statistics & probability letters, 81(5):592 602, 2011.

Rudin, W. Principles of Mathematical Analysis. Mc Graw Hill New York, 1976.

Samworth, R. J. Optimal weighted nearest neighbour classifiers. The Annals of Statistics, 40(5):2733 2763, 2012.

Sollich, P. and Krogh, A. Learning with ensembles: How overfitting can be useful. Advances in neural information processing systems, 8, 1995.

Stone, M. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B, 36(2):111 133, 1974.

Stone, M. Asymptotics for and against cross-validation. Biometrika, 64(1):29 35, 1977.

Sur, P., Chen, Y., and Cand es, E. J. The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled chi-square. Probability Theory and Related Fields, 175(1):487 558, 2019.

Thrampoulidis, C., Oymak, S., and Hassibi, B. Regularized linear regression: A precise analysis of the estimation error. In Conference on Learning Theory, pp. 1683 1709. PMLR, 2015.

Thrampoulidis, C., Abbasi, E., and Hassibi, B. Precise error analysis of regularized M-estimators in high dimensions. IEEE Transactions on Information Theory, 64(8):5592 5628, 2018.

Wang, S., Zhou, W., Lu, H., Maleki, A., and Mirrokni, V. Approximate leave-one-out for fast parameter tuning in high dimensions. ar Xiv preprint ar Xiv:1807.02694, 2018.

Wasserman, L. Olive Nonparametric Statistics. Springer, 2006.

Wei, A., Hu, W., and Steinhardt, J. More than a toy: Random matrix models predict how real-world neural representations generalize. ar Xiv preprint ar Xiv:2203.06176, 2022.

Xu, J., Maleki, A., and Rad, K. R. Consistent risk estimation in high-dimensional linear regression. ar Xiv preprint ar Xiv:1902.01753, 2019.

Zhang, Y. and Yang, Y. Cross-validation for selecting a model selection procedure. Journal of Econometrics, 187 (1):95 112, 2015.

Title in Wasserman (2006) may seem like a typo, but it is not. If you have a moment to chuckle, peek at column 2 of page 266!

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

This serves as an appendix to the paper Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation. Below we provide an outline for the appendix along with a summary of the notation used in the main paper and the appendix.

Organization

The content of the appendix is organized as follows.

Appendix A presents proofs of results in Section 2.

Appendix A.1 relates the ridge estimator in the full ensemble to the M-ensemble ridge estimator as the ensemble size M tends to infinity. Specifically, we provide proof of the fact [mentioned in Section 2] that the estimator eβλ k, as defined in (3) is almost surely equivalent to letting the ensemble size M for the estimator eβλ k,M as defined in (2). The main ingredients are results in Appendix G. Appendix A.2 gathers known results from Patil et al. (2022a) in the form of Lemma A.2 that characterize the asymptotic prediction risks of ridge ensembles used in the remaining sections. Appendix A.3 proves Theorem 2.3. The main ingredients are Lemma A.2 and results in Appendix F. See Figure 6.

Lemma A.2 Appendix F

Theorem 2.3

Figure 6. Schematic for the proof of Theorem 2.3.

Appendix B presents proofs of results in Section 3.

Appendix B.1 proves Theorem 3.1. The main ingredients are Lemma B.1 (proved in Appendix B) and results in Appendix G. The main ingredients that prove Lemma B.1 are Proposition 3.6 (proved in Appendix E) and Lemma A.2. See Figure 7.

Theorem 3.1

Lemma B.1 Appendix G

Proposition 3.6 Lemma A.2

Figure 7. Schematic for the proof of Theorem 3.1.

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

Appendix B.2 proves Corollary 3.2. The main ingredient in Theorem 3.1. Appendix B.3 proves Proposition 3.3. The main ingredients are Lemma 3.4 and results in Appendix G. See Figure 8.

Proposition 3.3

Lemma 3.4 Appendix G

Figure 8. Schematic for the proof of Proposition 3.3.

Appendix C proves Lemma 3.4. The main ingredient is Lemma C.1 (proved in Appendix C). See Figure 9.

Figure 9. Schematic for the proof of Lemma 3.4.

Appendix D proves Lemma 3.5. The main ingredients are a series of lemmas, Lemmas D.1 to D.3 (proved in Appendices D.1 to D.3). These lemmas provide structural decompositions for the ensemble train error and obtain the limiting behaviors of the terms in the decompositions. See Figure 10.

Appendix D.1 proves Lemma D.1 that shows a certain decomposition of the ensemble train error into in-sample train and out-of-sample test error components. Appendix D.2 proves Lemma D.2 on the convergence of out-of-sample test error components. The main ingredient is Lemma A.2. Appendix D.3 proves Lemma D.3 on the convergence of in-sample train error components. The main ingredients are Lemmas D.4 and D.5 (proved in Appendix D.4) and Lemmas D.6 and D.7 (proved in Appendix D.5). Appendix D.4 proves Lemmas D.4 and D.5 on component concentrations of certain cross and variance terms arising in the aforementioned decompositions. The main ingredients are results in Appendix G. Appendix D.5 proves Lemmas D.6 and D.7 on component deterministic approximations for the concentrated bias and variance functionals in the steps above. The main ingredients are results in Appendix F.

Appendix E proves Proposition 3.6. The main ingredients are components proved in Lemma 3.4 and Lemma 3.5, Propositions E.1 and E.2 that handle certain boundary cases not covered by Lemma 3.4 and Lemma 3.5 (proved in Appendices E.1 and E.2), and results in Appendix G. See Figure 11.

Appendix E.1 proves Proposition E.1 that considers the boundary case as the subsampling ratio ϕs for ridge regression (λ > 0). Proposition E.2 proves Proposition E.2 that handles the limiting case of ridgeless regression (λ = 0), by justifying and taking a suitable limit as λ 0+ of the corresponding results for ridge regression.

Appendix F summarizes auxiliary asymptotic equivalency results used in the proofs throughout.

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

Lemma D.1 Lemma D.2 Lemma D.3

Lemma A.2 Lemmas D.4 and D.5 Lemmas D.6 and D.7

Appendix G Appendix F

Figure 10. Schematic for the proof of Lemma 3.5.

Proposition 3.6

Lemma 3.4 Lemma 3.5 Propositions E.1 and E.2 Appendix G

Figure 11. Schematic for the proof of Proposition 3.6.

Appendix F.1 provides background on the notion of asymptotic matrix equivalents and various calculus rules that such notion of equivalency obeys.

Appendix F.2 gathers some known asymptotic matrix equivalents and derives some novel asymptotic matrix equivalents that arise in our analysis.

Appendix F.3 gathers various known analytic properties of certain fixed-point equations and proves some additional properties that arise in our analysis.

Appendix G collects several helper concentration results used in the proofs throughout.

Appendix G.1 provides lemmas deriving the asymptotic proportion of shared observations when subsampling.

Appendix G.2 provides lemmas establishing concentrations for linear and quadratic forms of random vectors.

Appendix G.3 provides lemmas for lifting original convergences to converges of Ce saro-type mean and max for triangular arrays.

Appendix H discusses the bias correction of GCV for finite ensembles [mentioned in Section 5].

Appendix I describes additional numerical details for experiments [mentioned in Section 3].

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

An overview of some general notation used in the main paper and the appendix is as follows.

1. General notation: We denote scalars in non-bold lower or upper case (e.g., n, λ, C), vectors in bold lower case (e.g., x, β), and matrices in bold upper case (e.g., X). For a real number x, (x)+ denotes its positive part, x its floor, and x its ceiling. For a vector β, β 2 denotes its ℓ2 norm. For a pair of vectors v and w, v, w denotes their inner product. For an event A, 1A denotes the associated indicator random variable. We denote convergence in probability by p , almost sure convergence by a.s. , and convergence in distribution by d .

2. Set notation: We denote sets using calligraphic letters (e.g., D), and use blackboard letters to denote some special sets: N denotes the set of positive integers, R denotes the set of real numbers, R 0 denotes the set of non-negative real numbers, R>0 denotes the set of positive real numbers, C denotes the set of complex numbers, C+ denotes the set of complex numbers with positive imaginary part, and C denotes the set of complex numbers with negative imaginary part. For a natural number n, we use [n] to denote the set {1, . . . , n}.

3. Matrix notation: For a matrix X Rn p, X Rp n denotes its transpose, and X+ Rp n denote its Moore Penrose inverse. For a square matrix A Rp p, tr[A] denotes its trace, and A 1 Rp p denotes its inverse, provided it is invertible. For a positive semidefinite matrix Σ, Σ1/2 denotes its principal square root. A p p identity matrix is denoted Ip, or simply by I, when it is clear from the context.

For a real matrix X, its operator norm (or spectral norm) with respect to ℓ2 vector norm is denoted by X op, and its trace norm (or nuclear norm) is denoted by X tr (recall that X tr = tr[(X X)1/2]). For a positive semidefinite matrix A Rp p with eigenvalue decomposition A = V RV 1 for an orthonormal matrix V Rp p and a diagonal matrix R Rp p with non-negative entries, and a function f : R 0 R 0, we denote by f(A) the p p positive semidefinite matrix V f(R)V 1. Here, f(R) is a p p diagonal matrix obtained by applying the function f to each diagonal entry of R.

For symmetric matrices A and B, A B denotes the Loewner ordering. For sequences of matrices An and Bn, An Bn denotes a certain notion of asymptotic equivalence (see Appendix F).

Finally, in what follows, we will prove the results for n, k, p being a sequence of integers {nm} m=1, {km} m=1, {pm} m=1. One can also view k and p as sequences kn and pn that are indexed by n. For simplicity, we drop the subscript when it is clear from the context.

A. Proofs of results in Section 2

A.1. Full-ensemble versus limiting M-ensemble

Lemma A.1 (Almost sure equivalence of full-ensemble and limiting M-ensemble). For k, n fixed, for the ensemble estimator defined in (2), it holds that

eβλ k,M(Dn; {Iℓ}M ℓ=1) a.s. E[ bβλ k(DI) | Dn] = 1 |Ik|

Proof of Lemma A.1. Note that for k, n fixed, the cardinality of Ik is n k . Thus, we have

eβλ k,M(Dn; {Iℓ}M ℓ=1) = X

M bβλ k(DI)

for random variables n M,i s. Since when sampling with replacement n M,I Binomial(M, 1/ n k ) with mean M/ n k , from the strong law of large numbers, we have that as M ,

a.s. 1 n k , I Ik. (17)

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

For sampling without replacement, n M,I Hypergeometric(M, 1, n k ) for M n k (see Definition G.1) with mean M/ n k . When M = n k , n M,I/M = 1/ n k . In both cases, we have

eβλ k,M(Dn; {Iℓ} ℓ=1) := lim M eβλ k,M(Dn; {Iℓ}M ℓ=1) a.s. = 1 n k X

which concludes the proof.

A.2. Risk characterization of ridge ensembles

In the study of ridge ensembles under proportional asymptotics, a key quantity that appears is the solution of a fixed-point equation. For any λ > 0 and θ > 0, define v( λ; θ) as the unique nonnegative solution to the fixed-point equation

v( λ; θ) 1 = λ + θ Z r(1 + v( λ; θ)r) 1 d H(r). (18)

For λ = 0, define v(0; θ) := limλ 0+ v( λ; θ) for θ > 1 and + otherwise. Such a fixed-point equation has appeared in the literature before. See, e.g., Dobriban & Wager (2018); Hastie et al. (2022); Mei & Montanari (2022) in the context of ridge regression; and more generally, for other M-estimators, see, e.g., El Karoui (2013; 2018); Thrampoulidis et al. (2015; 2018); Sur et al. (2019); Miolane & Montanari (2021), among others. The fact that the fixed-point equation (18) has a unique nonnegative solution follows from Patil et al. (2022b, Lemma S.6.14).

We then define the nonnegative constants ev( λ; ϑ, θ), and ec( λ; θ) via the following equations:

ev( λ; ϑ, θ) = ϑ Z r2(1 + v( λ; θ)r) 2 d H(r)

v( λ; θ) 2 ϑ Z r2(1 + v( λ; θ)r) 2 d H(r) , ec( λ; θ) = Z r(1 + v( λ; θ))r) 2 d G(r). (19)

Lemma A.2 (Risk characterization of ridge ensembles, adapted from Patil et al. (2022a)). Suppose Assumptions 2.1-2.2 hold for the dataset Dn. Then, as k, n, p such that p/n ϕ (0, ) and p/k ϕs [ϕ, ] (and ϕs = 1 if λ = 0), there exist deterministic functions Rλ M(ϕ, ϕs) for M N, such that for I1, . . . , IM SRS Ik,

sup M N |R( ef M,Ik; Dn, {Iℓ}M ℓ=1) Rλ M(ϕ, ϕs)| p 0. (20)

Furthermore, the function Rλ M(ϕ, ϕs) decomposes as

Rλ M(ϕ, ϕs) = σ2 + Bλ M(ϕ, ϕs) + V λ M(ϕ, ϕs), (21)

where the bias and variance terms are given by

Bλ M(ϕ, ϕs) = M 1Bλ(ϕs, ϕs) + (1 M 1)Bλ(ϕ, ϕs), (22)

V λ M(ϕ, ϕs) = M 1Vλ(ϕs, ϕs) + (1 M 1)Vλ(ϕ, ϕs), (23)

and the functions Bλ( , ) and Vλ( , ) are defined as

Bλ(ϑ, θ) = ρ2(1 + ev( λ; ϑ, θ))ec( λ; θ), Vλ(ϑ, θ) = σ2ev( λ; ϑ, θ), θ (0, ], ϑ θ. (24)

A.3. Proof of Theorem 2.3

Proof of Theorem 2.3. Define ϕ s(ϕ) := argminϕs ϕ Rλ, (ϕ, ϕs) and λ (ϕ) := argminλ 0 Rλ 1 (ϕ, ϕ). We will write ϕ s and λ for simplicity and split the proof into different cases.

Part (1) Case of SNR > 0 (ρ2 > 0, σ2 > 0):

From Patil et al. (2022a, Proposition 5.7) we have that ϕ s (ϕ 1, ). From Lemma F.11 (1), the function ϕs 7 v(0; ϕs) is strictly decreasing over ϕs [1, ] with range

v(0; ϕs 1) =

( v(0; ϕs), ϕ (1, ) limϕs 1+ v(0; ϕs) = + , ϕ (0, 1] , v(0; + ) := lim ϕs + v(0; ϕs) = 0.

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

From Lemma F.12 (1), the function λ 7 v( λ; ϕ) is strictly decreasing over λ [0, ] with range

v(0; ϕ 1) =

( v(0; ϕ), ϕ (1, ) limλ 0+ v( λ; ϕ) = + , ϕ (0, 1] , v( ; ϕ) := lim λ + v( λ; ϕ) = 0.

By the intermediate value theorem, there exists unique λ0 (0, ) such that v( λ0; ϕ) = v(0; ϕ s). Then we also have ec( λ0; ϕ) = ec(0; ϕ s) and ev( λ0; ϕ, ϕ) = ev(0; ϕ s). Substituting this into the optimal ensemble risk, we have

min ϕs ϕ R0 (ϕ, ϕs) = R0 (ϕ, ϕ s)

= (σ2 + ρ2ec(0; ϕ s))(1 + ev(0; ϕ, ϕ s))

= (σ2 + ρ2ec( λ0; ϕ))(1 + ev( λ0; ϕ, ϕ))

= Rλ0 1 (ϕ, ϕ)

min λ 0 Rλ 1 (ϕ, ϕ).

On the other hand, there exists unique ϕ0 [1, ) such that v( λ ; ϕ) = v(0; ϕ0), and thus, we have

min λ 0 Rλ 1 (ϕ, ϕ) = Rλ 1 (ϕ, ϕ)

= (σ2 + ρ2ec( λ ; ϕ))(1 + ev( λ ; ϕ, ϕ))

= (σ2 + ρ2ec(0; ϕ0))(1 + ev(0; ϕ, ϕ0))

= R0 (ϕ, ϕ0)

min ϕs ϕ R0 (ϕ, ϕs).

Combining the above two inequalities, we have that minϕs ϕ R0 (ϕ, ϕs) = minλ 0 Rλ 1 (ϕ, ϕ).

Part (2) Case of SNR = 0 (ρ2 = 0, σ2 > 0):

From Patil et al. (2022a, Proposition 5.7) we have that ϕ s = + , which implies that v(0; ϕ s) = 0. Then, from Lemma F.11 (1) we have v(0; + ) := limϕs + v(0; ϕs) = 0,

lim ϕs + ev(0; ϕ, + ) = lim ϕs +

ϕ Z r2(1 + v(0; ϕs)r) 2 d H(r)

v(0; ϕs) 2 ϕ Z r2(1 + v(0; ϕs)r) 2 d H(r)

ϕ Z (v(0; ϕs)r)2(1 + v(0; ϕs)r) 2 d H(r)

1 ϕ Z (v(0; ϕs)r)2(1 + v(0; ϕs)r) 2 d H(r)

= ϕ Z (v(0; + )r)2(1 + v(0; + )r) 2 d H(r)

1 ϕ Z (v(0; + )r)2(1 + v(0; + )r) 2 d H(r)

min ϕs ϕ R0 (ϕ, ϕs) = R0 (ϕ, ) = σ2(1 + ev(0; ϕ, + )) = σ2.

On the other hand,

min λ 0 Rλ 1 (ϕ, ϕ) = Rλ 1 (ϕ, ϕ) = σ2ev( λ ; ϕ, ϕ) σ2

where the equality holds when λ = + because ev( λ ; ϕ, ϕ) 0 from Lemma F.10 (4). Thus, the optimal parameters to the two optimization problems are given by ϕ s = λ = + , with v(0; ϕ s) = v( λ ; ϕ) = 0.

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

Part (3) Case of SNR = (ρ2 > 0, σ2 = 0):

When ϕ 1, from Patil et al. (2022a, Proposition 5.7) we have that any ϕ s [ϕ, 1] minimizes minϕs ϕ R0 (ϕ, ϕs) and the minimum is 0, which is also the smallest possible prediction risk. As Rλ 1 (ϕ, ϕ) = 0 for λ = 0, the conclusion still holds.

When ϕ (1, ), we know that ϕ s (1, ) from Patil et al. (2022a, Proposition 5.7). Analogous to Part (1), we have that minϕs ϕ R0 (ϕ, ϕs) = minλ 0 Rλ 1 (ϕ, ϕ).

Part (4) Relationship between ϕ and λ :

Each pair of the optimal solution (ϕ , λ ) satisfies that v(0; ϕ s) = v( λ ; ϕ) =: v , where v(0; ϕ s) and v( λ ; ϕ) are non-negative solutions to the following fixed-point equations:

1 v(0; ϕ s) = ϕ s

Z r 1 + v(0; ϕ s)r d H(r), 1 v( λ ; ϕ) = λ + ϕ Z r 1 + v( λ ; ϕ)r d H(r)

From the previous parts, if SNR = 0, then λ = ϕ s = + and v = 0. Otherwise, we have

Z r 1 + v r d H(r) = λ + ϕ Z r 1 + v r d H(r),

which yields that

λ = (ϕ s ϕ) Z r 1 + v r d H(r).

Part (5) Individual and joint optimization:

Note that from Lemma F.10 (2) and Lemma F.11 (1), the function ϕs 7 v( λ; ϕs) is decreasing with the range [0, λ 1] for λ [0, ]. Then the function (λ, ϕs) 7 v( λ; ϕs) has the range [0, + ], which is the same as v(0; ϕs). It follows that minϕs ϕ R0 (ϕ, ϕs) = minϕs ϕ,λ 0 Rλ 1 (ϕ, ϕ) by the analogous argument in Part (1)-(3).

When λ = 0, the curve reduces to a singleton, which is a trivial case. When λ > 0, for any t [0, λ ], let λ = λ t and ϕs = ϕ + t(ϕ s ϕ)/λ . Note that

1 v(λ; ϕs) = λ + ϕs

Z r 1 + v r d H(r)

= λ t + (ϕ + t(ϕ s ϕ)/λ ) Z r 1 + v(0; ϕ s)r d H(r)

= λ + ϕ Z r 1 + v(0; ϕ s)r d H(r) + t

λ (ϕ s ϕ λ) Z r 1 + v(0; ϕ s)r d H(r)

which implies that v(λ; ϕs) = v . Then, we have

ec( λ; ϕs) = Z r (1 + v( λ; ϕs))r)2 d G(r) = ec( λ ; ϕ) = ec(0; ϕ s).

ev( λ; ϕ, ϕs) = ϕ Z r2(1 + v( λ; ϕs)r) 2 d H(r)

v( λ; ϕs) 2 ϕ Z r2(1 + v( λ; ϕs)r) 2 d H(r) = ev( λ ; ϕ, ϕ) = ev(0; ϕ, ϕ s).

It then follows that Rλ (ϕ, ϕs) = Rλ (ϕ, ϕ) = R0 (ϕ, ϕ s), which completes the proof for Theorem 2.3.

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

Part (6) Extension of risk equivalence:

Here we extend the results in Theorem 2.3 to a more general equivalence of (λ, ϕs), as indicated following Theorem 2.3 towards the end of Section 2.2.

For any ϕs [ϕ, + ], let λ = ( ϕs ϕ) R r(1 + v(0; ϕs)r) 1 d H(r) 0. Then, we have

1 v(0; ϕs) = ϕs

Z r 1 + v(0; ϕs)r d H(r) = λ + ϕ Z r 1 + v(0; ϕs)r d H(r),

It follows that v( λ; ϕ) = v(0; ϕs), and consequently, R0 (ϕ, ϕs) = R λ (ϕ, ϕ).

B. Proofs of results in Section 3

B.1. Proof of Theorem 3.1

To prove Theorem 3.1, we first prove pointwise convergence (over k and λ) as stated in Lemma B.1, which is based on Proposition 3.6 proved in Appendix E.

Lemma B.1 (Consistency of GCV in full ensemble). Under Assumptions 2.1-2.2, as k, n, p , p/n ϕ (0, ) and p/k ϕs [ϕ, + ], for λ 0, it holds that

|gcvλ k Rλ k, | a.s. 0. (25)

Proof of Lemma B.1. We will first show that proof for λ > 0 and ϕs < , and then extend the results to these boundary cases.

Recall that from Proposition 3.6, we have

G λ (ϕ, ϕs) =

ϕ2s T λ 2 (ϕ, ϕs) + 2(ϕs ϕ)2

ϕ2s Rλ 2 (ϕ, ϕs) ϕ

ϕs T λ 1 (ϕ, ϕs) ϕs ϕ

ϕs Rλ 1 (ϕ, ϕs)

Dλ (ϕ, ϕs) .

We next simplify the expression of the numerator:

ϕ2s T λ 2 (ϕ, ϕs) + 2(ϕs ϕ)2

ϕ2s Rλ 2 (ϕ, ϕs) ϕ

ϕs T λ 1 (ϕ, ϕs) ϕs ϕ

ϕs Rλ 1 (ϕ, ϕs)

ϕ2s Rλ 1 (ϕ, ϕs) + Dλ(ϕs, ϕs) ϕ

ϕs Rλ 1 (ϕ, ϕs) + 2ϕ(ϕs ϕ)

1 λv( λ; ϕs) + ϕ2

ϕ2s Rλ 2 (ϕ, ϕs) ϕ

ϕs Dλ(ϕs, ϕs)Rλ 1 (ϕ, ϕs) ϕs ϕ

ϕs Rλ 1 (ϕ, ϕs)

ϕ2s λv( λ; ϕs) + ϕ2

ϕ2s λ2v( λ; ϕs)2 Rλ (ϕ, ϕs) + 2(ϕs ϕ)2

ϕ2s (Rλ 2 (ϕ, ϕs) Rλ 1 (ϕ, ϕs))

ϕ2s + 2ϕ(ϕs ϕ)

ϕ2s λv( λ; ϕs) + ϕ2

ϕ2s λ2v( λ; ϕs)2 Rλ (ϕ, ϕs)

= Dλ (ϕ, ϕs)Rλ (ϕ, ϕs).

Then, it follows that G λ (ϕ, ϕs) = Rλ (ϕ, ϕs). From Lemma A.2 and Proposition 3.6, we have that gcvλ k a.s. G λ (ϕ, ϕs) and Rλ k, a.s. Rλ (ϕ, ϕs), which finishes the proof.

We are now ready to prove Theorem 3.1.

Proof of Theorem 3.1. Let Rn,k = gcvλ k, Rλ k, for n N and k Kn. From Lemma B.1 we have that Rn,k a.s. 0 as k, n, p , p/n ϕ (0, ) and p/k ϕs [ϕ, ]. Here we view k and p as kn and pn that are indexed by n. Then from Lemma G.5 (1) the conclusion follows.

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

B.2. Proof of Corollary 3.2

Proof of Corollary 3.2. From Theorem 3.1, we have

max k Kn |gcvλ k, Rλ k, | a.s. 0.

This implies that

min k Kn gcv0 k = min k Kn R0 (p/n, p/k) 1 + gcv0 k R0 (p/n, p/k) R0 (p/n, p/k)

min k Kn R0 (p/n, p/k) 1 max k Kn

gcv0 k R0 (p/n, p/k) R0 (p/n, p/k)

min k Kn R0 (p/n, p/k) 1 1

σ2 max k Kn |gcv0 k R0 (p/n, p/k)|

a.s. inf ϕs ϕ R0 (ϕ, ϕs)

= inf ϕs ϕ,λ 0 Rλ (ϕ, ϕs),

where the last equality is from Theorem 2.3. This finishes the proof.

B.3. Proof of Proposition 3.3

Proof of Proposition 3.3. From the proof of Lemma 3.4, we have

1 k tr (MmΣm) a.s. (1 λv( λ; ϕs)).

Then, as k, n, p , p/n ϕ and p/k ϕs, we have

1 1 |I1 I2| 1 2

m=1 tr (MmΣm)

1 k |I1 I2| 1 k 1 2

m=1 tr (MmΣm)

a.s. 1 ϕs 2ϕs ϕ(1 λv( λ; ϕs)) 2 =: Dλ 2 (ϕ, ϕs).

where the convergence of k/|I1 I2| is from Lemma G.2. It then follows that

gcvλ k,2 = T λ k,2 Dλ k,2

a.s. G λ 2 (ϕ, ϕs) := T λ 2 Dλ 2 ,

where T λ 2 defined in (32) has the following expression:

T λ 2 (ϕ, ϕs) = 1

2 ϕs ϕ 2ϕs ϕRλ 1 (ϕ, ϕs) + 1

2Dλ(ϕs, ϕs) ϕs 2ϕs ϕRλ 1 (ϕ, ϕs) + 2(ϕs ϕ)

2ϕs ϕ 1 λv( λ; ϕs) + ϕ 2ϕs ϕ

Rλ (ϕ, ϕs) .

On the other hand, we have

Rλ 2 (ϕ, ϕs) = 1

2Rλ 1 (ϕ, ϕs) + 1

2Rλ (ϕ, ϕs).

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

Note that Rλ 1 (ϕ, ϕs) > Rλ (ϕ, ϕs) > σ2 when ϕ < ϕs < because ρ2 > 0. When λ = 0 and ϕs > 1, we have λv( λ; ϕs) = 0 and

G 0 2 (ϕ, ϕs) =

ϕs ϕ 2ϕs ϕRλ 1 (ϕ, ϕs) + 2(ϕs ϕ)

2ϕs ϕ Rλ (ϕ, ϕs)

2 1 ϕs 2ϕs ϕ

ϕs ϕ Rλ 1 (ϕ, ϕs) + 2Rλ (ϕ, ϕs) .

It follows that

G 0 2 (ϕ, ϕs) Rλ 2 (ϕ, ϕs) = 1 2(ϕs ϕ) ϕs Rλ 1 (ϕ, ϕs) + (3ϕs ϕ)Rλ (ϕ, ϕs) =: c,

and gcvλ k,2 Rλ 2 (ϕ, ϕs) p c > 0, which completes the proof.

C. Proof of Lemma 3.4 (convergence of the GCV denominator functional)

Proof of Lemma 3.4. By the definition, the smooth matrix for M = is given by

Sλ, ({DIℓ}M ℓ=1) := lim M 1 M

ℓ=1 Sλ(DIℓ).

For the denominator, note that for any fixed n N, as M , I1:M a.s. [n]. Then from Lemma C.1 (stated and proved below),

1 |I1:M| tr(Sλ(DIℓ)) a.s. 1

n tr XMℓ X Lℓ

n tr (MℓΣℓ) a.s. ϕ

ϕs (1 λv( λ; ϕs)).

By continuous mapping theorem, we have 1 1

n tr(Sλ, ) 2 a.s. Dλ (ϕ, ϕs) := ϕs ϕ

ϕs λv( λ; ϕs) 2 . (26)

Lemma C.1 (Deterministic approximation of the denominator functional). Under Assumption 2.1, for all m [M] and Im Ik, let bΣm = X Lm X/k, Lm Rn n be a diagonal matrix with (Lm)ll = 1 if l Im and 0 otherwise, and Mm = (X Lm X/k + λIp) 1. Then, it holds that for all m [M] and Im Ik,

1 n tr Mm bΣm a.s. ϕ

ϕs (1 λv( λ; ϕs)),

as n, k, p , p/n ϕ (0, ), and p/k ϕs [ϕ, ), where the nonnegative constant ev(λ; ϕ, ϕs) is as defined in (19).

Proof of Lemma C.1. Note that Mm bΣm = Ip λMm. From Corollary F.5, we have that λMm (v( λ; ϕs)Σ + Ip) 1. Then by Lemma F.3 (4), it follows that

1 n tr (MmΣm) a.s. ϕ lim p 1 p tr Ip (v( λ; ϕs)Σ + Ip) 1

1 Z 1 1 + v( λ; ϕs)r d Hp(r)

= ϕ Z v( λ; ϕs)r 1 + v( λ; ϕs)r d H(r)

ϕs (1 λv( λ; ϕs))

where in the second last line, we used the fact that Hp and H have compact supports and Assumption 2.2 and the last equality is due to the definition of v( λ; ϕs) in (18).

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

D. Proof of Lemma 3.5 (convergence of the GCV numerator functional)

Proof of Lemma 3.5. For any m [M], let Im be a sample from Ik, and LIm Rn n be a diagonal matrix with (LIm)ll = 1 if l Im and 0 otherwise. The ingredient estimator takes the form:

eβλ k,M({DIℓ}M ℓ=1) = 1

m=1 bβλ(DIm)

m=1 (X LIm X/k + λIp) 1(X LImy/k)

k β0 + X LIm X

We will write eβλ,M = eβλ k,M and Lm = LIm for simplicity when they are clear from the context. The set operation will be propagated to such notations, e.g., Lm l = LIm Il, Lm l = LIm Il, Lm\l = LIm\Il, etc. Let Mm = (X LIm X/k + λIp) 1 for m [M], we have

m=1 (Ip λMm)β0 + 1

m=1 Mm(X Lm/k)ϵ. (27)

The proof follows by combing the squared error decomposition in Lemma D.1, with the component convergence of test errors in Lemma D.2 and of train errors in Lemma D.3. To prove Lemma D.3, we further make of the component concentration results presented in Appendices D.4 and D.5, and component deterministic approximation results presented in Appendix D.5.

D.1. Decomposition of the mean squared error (Lemma D.1)

Lemma D.1 (Decomposition of the mean squared error for the M-ensemble estimator). For a dataset Dn, let X Rn p

and y Rn be the design matrix and response vector. Let LI Rn n be a diagonal matrix with (LI)ll = 1 if l I and 0 otherwise. Then the mean squared error evaluated on Dn decomposes as

y X eβλ k 2 2 = E h Errtrain( bβλ k(DI)) + Errtest( bβλ k({DI))}) | Dn i

+ 2E h Errtrain( eβλ k,2({DI, DJ})) + Errtest( eβλ k,2({DI, DJ})) | Dn i (28)

where the training and test errors are defined by

Errtrain( bβλ k(DIℓ)) = LIℓ(y X bβλ k(DIℓ)) 2 2 Errtest( bβλ k(DIℓ)) = LIc ℓ(y X bβλ k(DIℓ)) 2 2

Errtrain( eβλ k,2({DIm, DIℓ})) = LIm Lℓ(y X eβλ k,2({DIm, DIℓ})) 2 2

Errtest( eβλ k,2({DIm, DIℓ})) = LIcm Lc ℓ(y X eβλ k,2({DIm, DIℓ})) 2 2.

From Lemma D.1, the numerator of the GCV estimate for a M-ensemble estimator decomposes into a linear combination of the training and test error of all possible 1-ensemble and 2-ensemble estimators. Then the asymptotics of the numerator can be obtained, if we can show that the limits of all components exist and their linear combination remains invariable when M goes off to infinity.

Proof of Lemma D.1. We first decompose the training error into the linear combination of the mean squared errors (evaluated on Dn) for 1-ensemble and 2-ensemble estimators:

1 n y X eβλ k,M 2 2

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

ℓ=1 (yi xi bβλ k(DIℓ)

ℓ=1 (yi xi bβλ k(DIℓ))2 + 1

m,ℓ [M] i =j

(yi xi bβλ k(DIm))(yi xi bβλ k(DIℓ))

ℓ=1 y X bβλ k(DIℓ) 2 2

m,ℓ [M] i =j

1 2[4(yi xi eβλ,2({DIm, DIℓ}))2 (yi xi bβλ k(DIm))2 (yi xi bβλ k(DIℓ))2]

1 n y X bβλ k(DIℓ) 2 2 + 2 M 2 X

m,ℓ [M] i =j

1 n y X eβλ,2({DIm, DIℓ}) 2 2.

Next, we further decompose the MSE into training and test errors for 1-ensemble and 2-ensemble estimators:

1 n y X bβλ k(DIℓ) 2 2 = 1

n LIℓ(y X bβλ k(DIℓ)) 2 2 + 1

n LIc ℓ(y X bβλ k(DIℓ)) 2 2,

1 n y X eβλ,2({DIm, DIℓ}) 2 2 = 1

n LIm Lℓ(y X eβλ,2({DIm, DIℓ})) 2 2 + 1

n LIc m Lc ℓ(y X eβλ,2({DIm, DIℓ})) 2 2

The conclusion then readily follows.

D.2. Convergence of test errors (Lemma D.2)

Lemma D.2 (Convergence of test errors). Under Assumptions 2.1-2.2, for the test error defined in (29) with I1, I2 SRS Ik, we have that as k, n, p , p/n ϕ (0, ) and p/k ϕs [ϕ, + ],

Errtest( bβλ k(DIℓ)) n k

a.s. Rλ 1 (ϕ, ϕs) (30)

Errtest( eβλ k,2({DIm, DIℓ})) |Icm Ic ℓ|

a.s. Rλ 2 (ϕ, ϕs), (31)

where the deterministic functions RM is defined in Lemma A.2.

Proof of Lemma D.2. From the strong law of large numbers, we have

1 k Errtest( bβλ k(DIℓ)) a.s. E h (y0 x 0 bβλ k(DI1))2 Dn i

1 |Icm Ic ℓ|Errtest( bβλ k({DIm, DIℓ})) a.s. E h (y0 x 0 bβλ k({DIm, DIℓ}))2 Dn i .

From Lemma A.2 (Patil et al., 2022a, Theorem 4.1), the condition prediction risks converge in the sense that

E h (y0 x 0 bβλ k(DI1))2 Dn i a.s. R1(ϕ, ϕs)

E h (y0 x 0 eβλ k,2({DIm, DIℓ}))2 Dn i a.s. R2(ϕ, ϕs),

and the conclusions follow.

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

D.3. Convergence of train errors (Lemma D.3)

Lemma D.3 (Convergence of train errors). Under Assumptions 2.1-2.2, for the train error defined in (29) with I1, I2 SRS Ik, we have that as k, n, p , p/n ϕ (0, ) and p/k ϕs [ϕ, + ),

1 k Errtrain( bβλ k(DIℓ)) a.s. T λ 1 (ϕ, ϕs) := Dλ(ϕs, ϕs)Rλ 1 (ϕ, ϕs)

1 |Im Iℓ|Errtrain( eβλ k,1({DIm, DIℓ})) a.s. T λ 2 (ϕ, ϕs) := 1

2 ϕs ϕ 2ϕs ϕRλ 1 (ϕ, ϕs) + 1

2Dλ(ϕs, ϕs) ϕs 2ϕs ϕRλ 1 (ϕ, ϕs) + 2(ϕs ϕ)

2ϕs ϕ 1 λv( λ; ϕs) + ϕ 2ϕs ϕ

(2Rλ 2 (ϕ, ϕs) Rλ 1 (ϕ, ϕs) ,

where the deterministic function RM is defined in Lemma A.2.

Proof of Lemma D.3. From (27), we have

β0 eβλ,M = 1

m=1 λMmβ0 1

m=1 Mm(X Lm/k)ϵ. (33)

Part (1) Case of M = 1:

From (33), the train error can be decomposed as follows:

1 k Errtrain( bβλ k(DIℓ)) = Lℓϵ + LℓX(β0 bβλ k(DIℓ)) 2 2/k

= (Lℓ LℓXMℓX Lℓ/k)ϵ + λLℓXMℓβ0 2 2/k

= TC + TB + TV ,

where the constant term TC, bias term TB, and the variance term TV are given by

In XMℓ X Lℓ

LℓXMℓβ0, (34)

TB = λ2β 0 MℓbΣℓMℓβ0, (35)

In XMℓ X Lℓ

In XMℓ X Lℓ

Next, we analyze the three terms separately. From Lemmas D.4 and D.5 with n = k, we have that TC a.s. 0, and

TV a.s. σ2 1 2

k tr(Mm bΣm) + 1

k tr(Mm bΣm Mm bΣm) := TV T .

Thus, it remains to obtain the deterministic equivalent for the bias term TB and the trace term TV T .

From Lemma D.6 and Lemma D.7, we have that for all I1 Ik,

TB a.s. ρ2Dλ(ϕs, ϕs)(1 + ev( λ; ϕs, ϕs))ec( λ; ϕs)

TV T a.s. σ2Dλ(ϕs, ϕs)(1 + ev( λ; ϕs, ϕs)).

Then, we have

1 k Errtrain( bβλ k(DIℓ)) a.s. Rλ,1(ϕ, ϕs)Dλ(ϕs, ϕs)

where Rλ,1 and Dλ are defined in (21) and (26), respectively.

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

Part (2) Case of M = 2:

From (33), the train error can be analogously decomposed as follows:

1 |Im Il|Errtrain( eβλ k,2({DI1, DI2)}) = 1 |Im Il| Lm l(ϵ + X(β0 bβλ,2)) 2 2 = T C + T B + T V ,

where the constant term TC, bias term TB, and the variance term TV are given by

T C = λ 2|Im Il|

i,j {m,l} ϵ In XMi X Li

Lm l XMjβ0, (37)

i,j {m,l} β 0 Mi bΣm l Mjβ0

= λ2k 4|Im Il|

i {m,l} β 0 Mi bΣi Miβ0 + λ2

i {m,l} |Im+l i \ Ii|β 0 Mi bΣ(m+l i)\i Miβ0

i {m,l} β 0 Mi bΣm l Mm+l iβ0, (38)

T V = 1 4|Im Il|

i,j {m,l} ϵ In XMi X Li

In XMj X Lj

Next, we analyze the three terms separately. From Lemmas D.4 and D.5, we have that TC a.s. 0, and

T V a.s. σ2

1 2 |Im Il| tr(Mi bΣi) + 1

k tr(Mi bΣi Mi bΣm l)

1 1 |Im Il|

j {m,l} tr(Mj bΣj) + 1

n tr(Ml bΣm l Ml bΣm l)

= ϕs 2ϕs ϕ TV T

4 ϕs ϕ 2ϕs ϕ

i {m,l} tr(Mi bΣi Mi bΣ(m+l i)\i)

1 1 |Im Il|

j {m,l} tr(Mj bΣj) + 1

n tr(Ml bΣm l Mm bΣm l)

Thus, it remains to obtain the deterministic equivalent for the bias term T B and the trace term T V T .

From Lemma D.6 and Lemma D.7, and the convergence of the cardinality from Lemma G.2, we have that for all Im, Il SRS Ik,

T B a.s. ρ2

2 et(ϕ, ϕs)ec( λ; ϕs), and T V a.s. σ2

2 et(ϕ, ϕs),

et(ϕ, ϕs) = ϕs 2ϕs ϕDλ(ϕs, ϕs)(1 + ev( λ; ϕs, ϕs)) + ϕs ϕ

2ϕs ϕ(1 + ev( λ; ϕs, ϕs))

+ Dλ(ϕs, ϕs) 2(ϕs ϕ)

2ϕs ϕ 1 λv( λ; ϕs) + ϕ 2ϕs ϕ

(1 + ev( λ; ϕ, ϕs)).

Then, we have

1 |Im Il|Errtrain( eβλ k,2({DI1, DI2)})

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

2 et(ϕ, ϕs)ec( λ; ϕs) + σ2

2 et(ϕ, ϕs)

2Dλ(ϕs, ϕs) ϕs 2ϕs ϕRλ,1(ϕ, ϕs) + 2(ϕs ϕ)

2ϕs ϕ 1 λv( λ; ϕs) + ϕ 2ϕs ϕ

(2Rλ,2(ϕ, ϕs) Rλ,2(ϕ, ϕs))

2 ϕs ϕ 2ϕs ϕR1(ϕ, ϕs),

which finishes the proof.

D.4. Component concentrations

In this subsection, we will show that the cross-term TC converges to zero and the variance term TV converges to its corresponding trace expectation.

D.4.1. CONVERGENCE OF THE CROSS TERM

Lemma D.4 (Convergence of the cross term). Under Assumptions 2.1-2.2, for TC and T C as defined in (34) and (37)), we have TC a.s. 0 and T C a.s. 0 as k, p and p/k ϕs.

Proof of Lemma D.4. We first prove the result for T C. Note that

M 2 1 |I1 I2|

In X Mm X Lm

m=1 Mmβ0, ϵ

We next bound the squared norm:

In XMm X Lm

(Mj X Lj) bΣm l Mlβ0 2

2 + 1 4|I1 I2| Lj XMlβ0 2 2

Ml bΣm l Mj X Lj XMj bΣm l Ml op + k |I1 I2|

bΣj op Ml op

k Ml 2 op bΣl 2

Mj(X Lj X/k)Mj op + k |I1 I2| Ml op bΣl op

k Ml 2 op bΣl 2

op Mj op Ip λMj op + k |I1 I2| Ml op bΣl op

kλ2 + k |I1 I2|

where the last inequality is due to the fact that Mj op 1/λ and Ip λMj op 1. By Assumption 2.2, β0 2 2 is

uniformly bounded in p. From Bai & Silverstein (2010), we have lim sup bΣ op lim sup max1 i p s2 i rmax(1 + ϕs)2 almost surely as k, p and p/k ϕs (0, ). From Lemma G.2, we have that k/|I1 I2| a.s. k/(2k k2/n), which is ϕs/(2ϕs ϕ) almost surely. Then we have that the square norm is almost surely upper bounded by some constant. Applying Lemma G.3, we thus have that T C a.s. 0.

Note that when I1 = I2, T C reduces to TC, and thus the conclusion for TC also holds.

D.4.2. CONVERGENCE OF THE VARIANCE TERM

Lemma D.5 (Convergence of the variance term). Under Assumptions 2.1-2.2, let M N and bΣ = X X/n. For all m [M] and Im Ik, let bΣm = X Lm X/k, Lm Rn n be a diagonal matrix with (Lm)ll = 1 if l Im and 0

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

otherwise, and Mm = (X Lm X/k + λIp) 1. Then, for all m, l [M] and m = l, it holds that

1 k ϵ In XMm X Lm

In XMm X Lm

k tr(Mm bΣm) + 1

k tr(Mm bΣm Mm bΣm) a.s. 0, (40)

1 |Im Il|ϵ In XMi X Li

In XMi X Li

σ2 1 2 |Im Il| tr(Mi bΣi) + 1

k tr(Mi bΣi Mi bΣm l) a.s. 0, (41)

1 |Im Il|ϵ In XMm X Lm

In XMl X Ll

1 1 |Im Il|

ℓ {i,j} tr(MℓbΣℓ) + 1

n tr(Mi bΣi j Mj bΣm l)

a.s. 0, (42)

where i, j {m, l}, i = j, and bΣm l = X Lm l X/|Im Il|, as n, k, p , p/n ϕ (0, ), and p/k ϕs [ϕ, ).

Proof of Lemma D.5. We first prove the last convergence result. Note that

In XMm X Lm

In XMl X Ll

k Lm XMm X Lm l 1

k Lm l XMl X Ll + 1

k2 Lm XMm X Lm l XMl X Ll

op Mj op bΣm l

op + |Im Il|

op Mm op bΣm l op Ml op bΣl

Now, we have lim sup bΣ op lim sup max1 i p s2 i rmax(1 + ϕ)2 almost surely as n, p and p/n ϕ (0, ) from Bai & Silverstein (2010). Similarly, lim sup bΣm op rmax(1 + ϕs)2 almost surely. From Lemma G.2, |Im Il|/k a.s. (2ϕs ϕ)/ϕs. Then the above quantity is asymptotically upper bounded by some constant as n, k, p , p/n ϕ (0, ) and p/k ϕs [ϕ, ). From Lemma G.4, it follows that

1 |Im Il|ϵ In XMi X Li

In XMj X Lj

In XMi X Li

In XMj X Lj

Expanding the trace term above, we have

In XMi X Li

In XMj X Lj

1 1 |Im Il|

ℓ {i,j} tr(MℓbΣℓ) + |Ii Ij|

k2 tr(Mi bΣi j Mj bΣm l)

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

Since Im, Il SRS Ik, from Lemma G.2 we have that |Im Il|/k a.s. k/n. Then, we have

1 |Im Il|ϵ In XMi X Li

In XMj X Lj

1 1 |Im Il|

ℓ {i,j} tr(MℓbΣℓ) + 1

n tr(Mi bΣi j Mj bΣm l)

and thus (42) follows.

Setting i = j in (43) yields (41).

Finally, setting i = j = l = m in (43) finishes the proof for (40).

D.5. Component deterministic approximations

D.5.1. DETERMINISTIC APPROXIMATION OF THE BIAS FUNCTIONAL

Lemma D.6 (Deterministic approximation of the bias functional). Under Assumptions 2.1-2.2, for all m [M] and Im Ik, let bΣm = X Lm X/k, Lm Rn n be a diagonal matrix with (Lm)ll = 1 if l Im and 0 otherwise, and Mm = (X Lm X/k + λIp) 1. Then, it holds that

1. For all m [M],

λ2β 0 Mm bΣm Mmβ0 a.s. ρ2λ2v( λ; ϕs)2(1 + ev( λ; ϕs, ϕs))ec( λ; ϕs).

2. For all m, l [M], m = l and Im, Il SRSWR Ik,

λ2β 0 Ml bΣm\l Mlβ0 a.s. ρ2(1 + ev( λ; ϕs, ϕs))ec( λ; ϕs). (44)

3. For all m, l [M], m = l and Im, Il SRSWR Ik,

λ2β 0 Mm bΣm l Mlβ0 a.s. ρ2λ2 2(ϕs ϕ)

2ϕs ϕ 1 λv( λ; ϕs) + ϕ 2ϕs ϕ

v( λ; ϕs)2(1 + ev( λ; ϕ, ϕs))ec( λ; ϕs),

where bΣm l = |Im Il| 1X Lm l X, as n, k, p , p/n ϕ (0, ), and p/k ϕs [ϕ, ), where ϕ0 = ϕ2 s/ϕ, TB is as defined in (35), and the nonnegative constants ev( λ; ϕ, ϕs) and ec( λ; ϕs) are as defined in (19).

Proof of Lemma D.6. We split the proof into different parts.

Part (1) From Lemma F.6 (2) (with A = Ip), we have that

λ2Mm bΣm Mm v( λ; ϕs)2(1 + ev( λ; ϕs, ϕs)) (v( λ; ϕs)Σ + Ip) 1Σ(v( λ; ϕs)Σ + Ip) 1. (46)

By the definition of deterministic equivalent, we have

λ2β 0 Mm bΣm Mmβ0 a.s. lim p v( λ; ϕs)2(1 + ev( λ; ϕs, ϕs))

ri (1 + riv( λ; ϕs))2 (β 0 wi)2

= lim p β0 2 2v( λ; ϕs)2(1 + ev( λ; ϕs, ϕs)) Z r (1 + v( λ; ϕs)r)2 d Gp(r)

= ρ2v( λ; ϕs)2(1 + ev( λ; ϕs, ϕs)) Z r (1 + v( λ; ϕs)r)2 d G(r), (47)

where the last equality holds since Gp and G have compact supports and invoking Assumption 2.2.

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

Part (2) From Lemma F.8 (1), we have

Ml bΣm\l Ml MlΣMl.

Then, from Patil et al. (2022a, Lemma S.2.4), the conclusion follows.

Part (3) For the cross term, it suffices to derive the deterministic equivalent of β 0 M1 bΣ1 2M2β0. We begin with analyzing the deterministic equivalent of M1 bΣ1 2M2. Let i0 = tr(L1L2) be the number of shared samples between DI1 and DI2, we use the decomposition

k (bΣ0 + λIp) + k i0

k (bΣind j + λIp), j = 1, 2,

where bΣ0 = X L1L2X/i0 and bΣind j = X (Lj L1L2)X/(k i0) are the common and individual covariance estimators of the two datasets. Let N0 = (bΣ0 + λIp) 1 and Nj = (bΣind j + λIp) 1 for j = 1, 2. Then

k N 1 0 + k i0

1 , j = 1, 2, (48)

where the equalities hold because N0 is invertible when λ > 0. Note that

bΣ1 2 = k 2k i0 bΣ1 + k 2k i0 bΣ2 i0 2k i0 bΣ0,

j=1 (M 1 j λIp) i0 2k i0 bΣ0.

We have that

λ2M1 bΣ1 2M2 = k 2k i0 λ2 2 X

j=1 Mj 2k 2k i0 λ3M1M2 i0 2k i0 λ2M1 bΣ0M2. (49)

Next, we derive the deterministic equivalents for the three terms in (49). From Corollary F.5, the first term admits

λMj (v( λ; ϕs)Σ + Ip) 1. (50)

λ2M1M2 (v( λ; ϕs)Σ + Ip) 1 (ev( λ; ϕ, ϕs, Ip)Σ + Ip) (v( λ; ϕs)Σ + Ip) 1 (51)

(v( λ; ϕs)Σ + Ip) 2 (ev( λ; ϕ, ϕs, Ip)Σ + Ip), (52)

ev( λ; ϕ, ϕs, Ip) = ϕ Z r (1 + v( λ; ϕs)r)2 d H(r)

v( λ; ϕs) 2 ϕ Z r2

(1 + v( λ; ϕs)r)2 d H(r) .

For the third term,

M1 bΣ0M2 evv( λ; ϕ, ϕs)(v( λ; ϕs)Σ + Ip) 2Σ, (53)

evv( λ; ϕ, ϕs) := 1

v( λ; ϕs) 2 ϕ Z r2

(1 + v( λ; ϕs)r)2 d H(r) .

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

Combining (50)-(53), we get

λ2M1 bΣ1 2M2 2ϕs 2ϕs ϕ(v( λ; ϕs) ev( λ; ϕ, ϕs, Ip)) ϕ 2ϕs ϕλevv( λ; ϕ, ϕs) λ(v( λ; ϕs)Σ + Ip) 2Σ

= λ2v( λ; ϕs)2(1 + ev( λ; ϕ, ϕs)) 2(ϕs ϕ)

2ϕs ϕ 1 λv( λ; ϕs) + ϕ 2ϕs ϕ

(v( λ; ϕs)Σ + Ip) 2Σ.

The last conclusion then follows analogously as in (47).

D.5.2. DETERMINISTIC APPROXIMATION OF THE VARIANCE FUNCTIONAL

Lemma D.7 (Deterministic approximation of the variance functional). Under Assumptions 2.1-2.2, for all m [M] and Im Ik, let bΣm = X Lm X/k, Lm Rn n be a diagonal matrix with (Lm)ll = 1 if l Im and 0 otherwise, and Mm = (X Lm X/k + λIp) 1. Then, it holds that

1. For all m [M] and Im Ik,

k tr(Mm bΣm) + 1

k tr(Mm bΣm Mm bΣm) a.s. λ2v( λ; ϕs)2(1 + ev( λ; ϕs, ϕs)). (55)

2. For all m, l [M], m = l and Im, Il SRSWR Ik,

1 k tr(Mm bΣm Mm bΣl\m) a.s. ev( λ; ϕs, ϕs). (56)

3. For all m, l [M], m = l and Im, Il SRSWR Ik,

1 1 |Im Il|

j {m,l} tr(Mj bΣj) + 1

n tr(Ml bΣm l Mm bΣm l) a.s.

Dλ(ϕs, ϕs) 2(ϕs ϕ)

2ϕs ϕ 1 λv( λ; ϕs) + ϕ 2ϕs ϕ

(1 + ev( λ; ϕ, ϕs)), (57)

where bΣl\m = |Il \ Im| 1X Ll\m X and bΣm l = |Im Il| 1X Lm l X, as n, k, p , p/n ϕ (0, ), and p/k ϕs [ϕ, ), where the nonnegative constant ev(λ; ϕ, ϕs) is as defined in (19).

Proof of Lemma D.7. We split the proof into different parts.

Part (1) Note that

tr(Mm bΣm Mm bΣm) = tr(Mm bΣm) λ tr(M 2 n bΣm).

We now have

k tr(Mm bΣm) + 1

k tr(Mm bΣm Mm bΣm) = 1 1

k tr(Mm bΣm) λ

k tr(M 2 m bΣm)

k tr(M 2 m bΣm).

From Corollary F.5 we have that λMm (v( λ; ϕs)Σ + Ip) 1. From Lemma F.6 (2) (with A = I), we have that for j [M],

Mm bΣm Mm evv( λ; ϕs)(v( λ; ϕs)Σ + Ip) 2Σ. (58)

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

By the trace rule Lemma F.3 (4) , we have

λ k tr(Mm) λ

k tr(M 2 m bΣm) a.s. lim p p k 1

p tr((v( λ; ϕs)Σ + Ip) 1) tr(λevv( λ; ϕs)(v( λ; ϕs)Σ + Ip) 2Σ)

= ϕs lim p 1 p

1 + v( λ; ϕs)ri λevv( λ; ϕs)ri

(v( λ; ϕs)ri + 1)2

Z 1 + v( λ; ϕs)r λevv( λ; ϕs)r

(1 + v( λ; ϕs)r)2 d Hp(r)

Z 1 + v( λ; ϕs)r λevv( λ; ϕs)r

(1 + v( λ; ϕs)r)2 d H(r), j = 1, 2, (59)

where in the last line we used the fact that Hp and H have compact supports and Assumption 2.2. Then, we have

k tr(Mm bΣm) + 1

k tr(Mm bΣm Mm bΣm)

a.s. 1 ϕs + ϕs

Z 1 + v( λ; ϕs)r λevv( λ; ϕs)r

(1 + v( λ; ϕs)r)2 d H(r)

Z v( λ; ϕs)r 1 + v( λ; ϕs)r d H(r) ϕs

Z λevv( λ; ϕs)r (1 + v( λ; ϕs)r)2 d H(r)

= λv( λ; ϕs) ϕs

Z λevv( λ; ϕs)r (1 + v( λ; ϕs)r)2 d H(r)

= λevv( λ; ϕs) v( λ; ϕs) 1 ϕs

Z v( λ; ϕs)r2

(1 + v( λ; ϕs)r)2 d H(r) ϕs

Z r (1 + v( λ; ϕs)r)2 d H(r)

= λevv( λ; ϕs) v( λ; ϕs) 1 ϕs

Z r 1 + v( λ; ϕs)r d H(r)

= λ2evv( λ; ϕs)

= λ2v( λ; ϕs)2(1 + ev( λ; ϕs, ϕs)),

and thus, (55) follows.

Part (2) Since Mm bΣm Mm and bΣl\m are independent, from Lemma F.8 (1), we have

Mm bΣm Mm bΣl\m Mm bΣm MmΣ.

Then, by the definition of deterministic equivalents, it follows that

1 k tr(Mm bΣm Mm bΣl\m) = 1

k tr(Mm bΣm MmΣ) a.s. ev( λ; ϕs, ϕs),

where the convergence is due to Patil et al. (2022a, Lemma S.2.5).

Part (3) Let i0 = |Im Il|. The first two terms in (59) satisfy that

1 1 |Im Il|

j {m,l} tr(Mj bΣj) = 1 p 2k i0

1 p tr(Mj bΣj) a.s. 1 2ϕ2 s 2ϕs ϕ

Z v( λ; ϕs)r 1 + v( λ; ϕs)r d H(r),

where the convergence is from Part 1. The last term can be further decomposed because

tr(Ml bΣm l Mm bΣm l) = k 2k i0

j {m,l} tr(Ml bΣm l Mm bΣj) i0 2k i0 tr(Ml bΣm l Mm bΣm l)

j {m,l} [tr(Mj bΣm l) λ tr(Mm bΣm l Ml)] i0 2k i0 tr(Ml bΣm l Mm bΣm l).

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

From Lemma F.8 (4), (3), and (5), we have

Mj bΣm l Ip (v( λ; ϕs)Σ + Ip) 1

Mm bΣm l Ml evv( λ; ϕ, ϕs)(v( λ; ϕs)Σ + Ip) 2Σ

Ml bΣm l Mm bΣm l ϕs

ϕ v( λ; ϕs) ϕs ϕ

ϕ λevv( λ; ϕ, ϕs) (v( λ; ϕs)Σ + Ip) 1Σ

λevv( λ; ϕ, ϕs)(v( λ; ϕs)Σ + Ip) 2Σ.

Combining the above terms and by Assumption 2.2, we have

1 1 |Im Il|

j {m,l} tr(Mj bΣj) + 1

n tr(Ml bΣm l Mm bΣm l)

a.s. 1 2ϕ2 s 2ϕs ϕ

Z v( λ; ϕs)r 1 + v( λ; ϕs)r d H(r)

+ ϕ 2ϕs 2ϕs ϕ

Z v( λ; ϕs)r 1 + v( λ; ϕs)r d H(r) λevv( λ; ϕ, ϕs) Z r (1 + v( λ; ϕs)r)2 d H(r)

Z r 1 + v( λ; ϕs)r d H(r) ϕs ϕ

ϕ λevv( λ; ϕ, ϕs) Z r 1 + v( λ; ϕs)r d H(r)

λevv( λ; ϕ, ϕs) Z r (1 + v( λ; ϕs)r)2 d H(r)

Z v( λ; ϕs)r 1 + v( λ; ϕs)r d H(r) + ϕ(ϕs ϕ)

2ϕs ϕ λevv( λ; ϕ, ϕs) Z r 1 + v( λ; ϕs)r d H(r)

ϕλevv( λ; ϕ, ϕs) Z r (1 + v( λ; ϕs)r)2 d H(r)

= λv( λ; ϕs) + ϕ(ϕs ϕ)

Z λevv( λ; ϕ, ϕs)r

1 + v( λ; ϕs)r d H(r) ϕ Z λevv( λ; ϕ, ϕs)r (1 + v( λ; ϕs)r)2 d H(r)

= λ2evv( λ; ϕ, ϕs)

2ϕs ϕ λv( λ; ϕs) ϕ(2ϕs ϕ)

Z v( λ; ϕs)r2

(1 + v( λ; ϕs)r)2 d H(r)

Z r 1 + v( λ; ϕs)r d H(r) ϕ

Z r (1 + v( λ; ϕs)r)2 d H(r)

= λ2evv( λ; ϕ, ϕs)

2ϕs ϕ λv( λ; ϕs) ϕ(2ϕs ϕ)

Z r 1 + v( λ; ϕs)r d H(r)

Z r 1 + v( λ; ϕs)r d H(r)

= λ2evv( λ; ϕ, ϕs)2 2(ϕs ϕ)

2ϕs ϕ 1 λv( λ; ϕs) + ϕ 2ϕs ϕ

= Dλ(ϕs, ϕs) 2(ϕs ϕ)

2ϕs ϕ 1 λv( λ; ϕs) + ϕ 2ϕs ϕ

(1 + ev( λ; ϕ, ϕs)).

E. Proof of Proposition 3.6

Proof of Proposition 3.6. From Lemma D.1, we have

1 n y X eβλ k 2 2 = 1

n EI SRS Ik

h Errtrain( bβλ k(DI)) + Errtest( bβλ k({DI})) i

n E(Im,Iℓ) SRS Ik

h Errtrain( eβλ k({DIm, DIℓ})) + Errtest( eβλ k({DIm, DIℓ})) i .

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

Since by Lemma D.2, Lemma D.3 and Lemma G.6, each expectation converges, we have that

1 n y X eβλ k 2 2 a.s. 2ϕ(2ϕs ϕ)

ϕ2s T λ 2 (ϕ, ϕs) + 2(ϕs ϕ)2

ϕ2s Rλ 2 (ϕ, ϕs) ϕ

ϕs T λ 1 (ϕ, ϕs) ϕs ϕ

ϕs Rλ 1 (ϕ, ϕs),

where the convergence of the averages is from Lemma G.5 and the convergence of coefficients is from Lemma G.2. Since the denominator converges from Lemma 3.4, we further have

gcvλ k, a.s. G λ (ϕ, ϕs) =

ϕ2s T λ 2 (ϕ, ϕs) + 2(ϕs ϕ)2

ϕ2s Rλ 2 (ϕ, ϕs) ϕ

ϕs T λ 1 (ϕ, ϕs) ϕs ϕ

ϕs Rλ 1 (ϕ, ϕs)

Dλ (ϕ, ϕs) ,

for λ > 0 and ϕs [ϕ, + ).

For the boundary case when λ > 0 but ϕs = + , we require Proposition E.1; for the boundary case when λ = 0, we require Proposition E.2. Applying Proposition E.1 and Proposition E.2 finishes the proof.

E.1. Boundary case: diverging subsample aspect ratio for the ridge predictor

Proposition E.1 (Risk approximation when ϕs + ). Under Assumptions 2.1-2.2, it holds for all λ > 0,

gcvλ k a.s. G λ (ϕ, ),

as k, n, p , p/n ϕ (0, ) and p/k , where G λ ( , ) is defined in Proposition 3.6.

Proof of Proposition E.1. Recall that

1 n y X eβλ k 2 2 = lim M 1 n y X eβλ k,M 2 2 = (β0 eβλ k) bΣ(β0 eβλ k) + 1

n(β0 eβλ k) bΣϵ

From Lemma G.3 and Lemma G.4, we have that β 0 X ϵ/n a.s. 0 and ϵ ϵ/n a.s. σ2 as n . For the other term, note that for any (I1, . . . , IM) SRS Ik,

eβλ k 2 lim M E(I1,...,IM) SRS Ik

m=1 (X Lm X/k + λIp) 1(X Lmy/k) 2

lim M E(I1,...,IM) SRS Ik

m=1 (X Lm X/k + λIp) 1X Lm/

ρ2 + σ2 max Im Ik (X Lm X/k + λIp) 1X Lm/

where the last inequality holds eventually almost surely since Assumptions 2.1-2.2 imply that the entries of y have bounded 4-th moment, and thus from the strong law of large numbers, Lmy/

k 2 is eventually almost surely bounded above by C p

E[y2 1] = C p

ρ2 + σ2 for some constant C. Observe that operator norm of the matrix (X Lm X/k+λIp) 1XLm/

k is upper bounded maxi si/(s2 i + λ) 1/smin where si s are the singular values of X and smin is the smallest nonzero singular value. As k, p such that p/k , smin almost surely (e.g., from results of Bloemendal et al. (2016)) and therefore, eβλ k 2 0 almost surely. Because bΣ op is upper bounded almost surely, we further have eβ k X ϵ/n a.s. 0. Consequently we have (β0 eβk) X ϵ/n a.s. 0 and

1 n y X eβλ k 2 2 a.s. β 0 bΣβ0 + σ2.

Finally, from Lemma F.8 (1) β 0 bΣβ0 a.s. β 0 Σβ0 and from Assumption 2.2, we have

1 n y X eβλ k 2 2 a.s. σ2 + ρ2 Z r d G(r).

Since Sλ k = X eβλ k, we have that tr(Sλ k )/n a.s. 0. So the denominator converges to 1 almost surely.

From Lemma F.10, we have G λ (ϕ, ) := limϕs + G λ (ϕ, ϕs) = σ2 + ρ2 R r d G(r), which is also the limit of the GCV estimate. Thus, G λ (ϕ, ) is well defined and G λ (ϕ, ϕs) is right continuous at ϕs = + .

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

E.2. Boundary case: the ridgeless predictor

Proposition E.2 (Risk approximation when λ = 0). Under Assumptions 2.1-2.2, suppose that the conclusion of Proposition 3.6 holds for λ > 0. Then it holds that,

gcv0 k a.s. G 0 (ϕ, ϕs) := lim λ 0+ G λ (ϕ, ϕs),

as k, n, p , p/n ϕ (0, ) and p/k [ϕ, + ], where G λ ( , ) is defined in Proposition 3.6.

Proof of Proposition E.2. We analyze the numerator and the denominator separately.

Part (1) For the denominator, note that

Pn,λ := (1 tr(Sλ k )/n)2 = lim M (1 tr(Sλ k,M)/n)2,

where Sλ k = lim M Sλ k,M is the smoothing matrix. Since Sλ k 0n n and

Sλ k op max Im Ik X(X Lm X/k + λIp) 1X Lm/

which is also upper bounded almost surely from the proof in Proposition E.1 (when λ = 0, the inverse in the above display is replaced by pseudo-inverse). Thus, we have Pn,λ is almost surely upper bounded λ Λ := [0, λmax] for any λmax (0, ) fixed.

Next we inspect the boundedness of the derivative of Pn,λ:

λ lim M lim M (1 tr(Sλ k,M)/n)2 =:

λ lim M QM,λ.

We claim that λ lim M QM,λ = lim M λQM,λ.

To see this, we need to show that QM,λ is equicontinuous in λ over Λ. First we know that QM,λ is differentiable in λ. From (60), we have that QM,λ is uniformly upper bounded over λ Λ almost surely. Note that

λQM,λ = 2(1 tr(Sλ k,M)) tr

where λSλ k,M = 1

m=1 X X Lm X

k + λI 2 X Lm

By the similar arguments as in Proposition E.1, we have that Sλ k,M/ λ op, and Sλ k,M 2 2 are uniformly upper bounded

almost surely over Λ, the equicontinuity conclusion follows. Then by Moore-Osgood theorem, we have

λPn,λ = lim M 2(1 tr(Sλ k,M)) tr

is uniformly upper bounded almost surely over [0, + ] independent of λ and M. Therefore, we conclude that | Pn,λ/ λ| is upper bounded almost surely over λ Λ.

On the other hand, we know that Pn,λ a.s. Dλ (ϕ, ϕs) for λ > 0. Define D0(ϕ, ϕs) := limλ 0+ Dλ (ϕ, ϕs). When λ = 0 and ϕs > 1, we know that D0(ϕ, ϕs) is well-defined because v( λ; ϕs) is finite and continuous from Lemma F.12. When λ = 0 and ϕs (0, 1], from the definition of fixed-point solution (18), we have

1 = v( λ; ϕs)λ + ϕs

Z v( λ; ϕs)r 1 + v( λ; ϕs)r d H(r).

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

In this case, v(0; ϕs) = + from Lemma F.12. Let λ 0+, we have

1 = lim λ 0+ v( λ; ϕs)λ + ϕs lim λ 0+

Z v( λ; ϕs)r 1 + v( λ; ϕs)r d H(r) = lim λ 0+ v( λ; ϕs)λ + ϕs.

Then we have limλ 0+ v( λ; ϕs)λ = 1 ϕs and Dλ (ϕ, ϕs) = (1 ϕs)2. Thus, D0(ϕ, ϕs) is always well-defined.

From Lemma F.12, there exists M > 0 such that the magnitudes of v( λ; ϕs) and its derivative with respect to λ are continuous and bounded by M for all λ [0, + ]. It follows that |Dλ (ϕ, ϕs)| and | Dλ (ϕ, ϕs)/ λ| are uniformly upper bounded almost surely. From Moore-Osgood theorem and the continuity property from Lemma F.12, we have

lim n lim λ 0+ Pn,λ = lim λ 0+ lim n Pn,λ = lim λ 0+ Dλ (ϕ, ϕs) = D0(ϕ, ϕs).

Part (2) For the numerator, note that

n y X eβλ k 2 2 = 1

n (In Sλ k )y 2 2.

Assumptions 2.1-2.2 imply that the entries of y have bounded 4-th moment, and thus from the strong law of large numbers, y/ n 2 is eventually almost surely bounded above by C p

E[y2 1] = C p

ρ2 + σ2 for some constant C. On the other hand, Sλ k 0n n and Sλ k op is also upper bounded almost surely from Part (1). Thus, we have P n,λ is almost surely upper bounded λ Λ.

Next we inspect the boundedness of the derivative of P n,λ:

n(y X eβλ k)

λ lim M 2 n(y X eβλ k,M) Sλ k,My

λ lim M 2 ny (In Sλ k,M)Sλ k,My =:

λ lim M Q M,λ.

We claim that λ lim M Q M,λ = lim M λQ M,λ.

To see this, we need to show that Q M,λ is equicontinuous in λ over Λ. First we know that Q M,λ is differentiable in λ. From (60), we have that Q M,λ is uniformly upper bounded over λ Λ almost surely. Similarly, we have

ny (In 2Sλ k,M)

In 2Sλ k,M op

and the equicontinuity conclusion follows analogously as in Part (1). Therefore, we conclude that | P n,λ/ λ| is upper bounded almost surely over λ [0, + ].

On the other hand, we know that P n,λ a.s. G λ (ϕ, ϕs) for λ > 0. Define D0(ϕ, ϕs)G 0 (ϕ, ϕs) := limλ 0+(Dλ (ϕ, ϕs)G λ (ϕ, ϕs)) = D0(ϕ, ϕs)R0 M(ϕ, ϕs), which is well defined from Part (1) and Lemma A.2. From Lemma F.12, there exists M > 0 such that the magnitudes of v( λ; ϕs), ev(λ; ϕs, ϕ) and ec(λ; ϕs), and their derivatives with respect to λ are continuous and bounded by M for all λ [0, + ]. It follows that |G λ (ϕ, ϕs)| is upper bounded almost surely. Analogously, we have that | (Dλ (ϕ, ϕs)G λ (ϕ, ϕs))/ λ| is also upper bounded almost surely on λ Λ. From Moore-Osgood theorem and the continuity property from Lemma F.12, we have

lim n lim λ 0+ P n,λ = lim λ 0+ lim n P n,λ = lim λ 0+ G λ (ϕ, ϕs) = G 0 (ϕ, ϕs).

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

F. Auxiliary results on asymptotic equivalents

F.1. Preliminary background

We use the notion of asymptotic equivalence of sequences of random matrices in various proofs. In this section, we provide a basic review of the related definitions and corresponding calculus rules. See Dobriban & Wager (2018); Dobriban & Sheng (2021); Patil et al. (2022b;a) for more details.

Definition F.1 (Asymptotic equivalence). Consider sequences {Ap}p 1 and {Bp}p 1 of (random or deterministic) matrices of growing dimensions. We say that Ap and Bp are equivalent and write Ap Bp if limp | tr[Cp(Ap Bp)]| = 0 almost surely for any sequence of random matrices Cp independent to Ap and Bp, with bounded trace norm such that lim supp Cp tr < almost surely.

The notion of asymptotic equivalence of two sequences of random matrices from Definition F.1 can be further extended to incorporate conditioning on another sequence of random matrices.

Definition F.2 (Conditional asymptotic equivalence). Consider sequences {Ap}p 1, {Bp}p 1 and {Dp}p 1 of (random or deterministic) matrices of growing dimensions. We say that Ap and Bp are equivalent given Dp and write Ap Bp | Dp if limp | tr[Cp(Ap Bp)]| = 0 almost surely conditional on {Dp}p 1, i.e.,

P lim p | tr[Cp(Ap Bp)]| = 0 {Dp}p 1

for any sequence of random matrices Cp independent to Ap and Bp conditional on Dp, with bounded trace norm such that lim sup Cp tr < as p .

Below we summarize the calculus rules for conditional asymptotic equivalence Definition F.2 adapted from Patil et al. (2022a, Lemma S.7.4 and S.7.6).

Lemma F.3 (Calculus of deterministic equivalents). Let Ap, Bp, Cp and Dp be sequences of random matrices. The calculus of deterministic equivalents ( D and R) satisfies the following properties:

(1) Equivalence: The relation is an equivalence relation.

(2) Sum: If Ap Bp | Ep and Cp Dp | Ep, then Ap + Cp Bp + Dp | Ep.

(3) Product: If Ap has bounded operator norms such that lim supp Ap op < , Ap is conditional independent to Bp and Cp given Ep for p 1, and Bp Cp | Ep, then Ap Bp Ap Cp | Ep.

(4) Trace: If Ap Bp | Ep, then tr[Ap]/p tr[Bp]/p 0 almost surely when conditioning on Ep.

(5) Differentiation: Suppose f(z, Ap) g(z, Bp) | Ep where the entries of f and g are analytic functions in z S and S is an open connected subset of C. Suppose for any sequence Cp of deterministic matrices with bounded trace norm we have | tr[Cp(f(z, Ap) g(z, Bp))]| M for every p and z S. Then we have f (z, Ap) g (z, Bp) | Ep for every z S, where the derivatives are taken entrywise with respect to z.

(6) Unconditioning: If Ap Bp | Ep, then Ap Bp.

(7) Substitution: Let v : Rp p R and f(v(C), C) : Rp p Rp p be a matrix function for matrix C Rp p

and p N, that is continuous in the first augment with respect to operator norm. If v(C) a.s. = v(D) such that C is independent to D, then f(v(C), C) f(v(D), C) | C.

F.2. Standard ridge resolvents and various extensions

In this section, we collect various asymptotic equivalents. Appendix F.2.1 introduces the basic concepts and definitions. The extended equivalents developed in the work of Patil et al. (2022a) are summarized in Appendix F.2.2. Based on results in Appendices F.2.1 and F.2.2, we prove some useful deterministic equivalent relations in Appendix F.2.3, which are subsequently used in the proof of Lemma D.3 (Lemmas D.6 and D.7).

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

F.2.1. STANDARD RIDGE RESOLVENTS

The following lemma provides a deterministic equivalent for the standard ridge resolvent and implies Corollary F.5. It is adapted from Theorem 1 of Rubio & Mestre (2011). See also Theorem 3 of Dobriban & Sheng (2021). Lemma F.4 (Deterministic equivalent for standard ridge resolvent). Suppose xi Rp, 1 i n, are i.i.d. random vectors such that each xi = ziΣ1/2, where zi is a random vector consisting of i.i.d. entries zij, 1 j p, satisfying E[zij] = 0, E[z2 ij] = 1, and E[|zij|8+α] Mα for some constants α > 0 and Mα < , and Σ Rp p is a positive semidefinite matrix satisfying 0 Σ rmax Ip for some constant rmax < (independent of p). Let X Rn p the concatenated matrix with x i , 1 i n, as rows, and let bΣ Rp p denote the random matrix X X/n. Let γ := p/n. Then, for z C+, as n, p such that 0 < lim inf γ lim sup γ < , we have

(bΣ z Ip) 1 (c(e(z; γ))Σ z Ip) 1, (61)

where the scalar c(e(z; γ)) is defined in terms of another scalar e(z; γ) by the equation

c(e(z; γ)) = 1 1 + γe(z; γ), (62)

and e(z; γ) is the unique solution in C+ to the fixed-point equation

e(z; γ) = tr[Σ(c(e(z; γ))Σ z Ip) 1]/p. (63)

Note that both the scalars c(e(z; γ)) and e(z; γ) also implicitly depend on Σ. For notation brevity, we do not always explicitly indicate this dependence. However, we will be explicit in such dependence for certain extensions to follow. Additionally, observe that one can eliminate e(z; γ) in the statement of Lemma F.4 by combining (62) and (63) so that for z C+, one has (bΣ z Ip) 1 (c(z; γ)Σ z Ip) 1,

where c(z) is the unique solution in C to the fixed-point equation

1 c(z; γ) = 1 + γ tr[Σ(c(z; γ)Σ z Ip) 1]/p.

The following corollary is a simple consequence of Lemma F.4, which supplies a deterministic equivalent for the (regularization) scaled ridge resolvent. We will work with a real regularization parameter λ from here on. Corollary F.5 (Deterministic equivalent for scaled ridge resolvent). Assume the setting of Lemma F.4. For λ > 0, we have

λ(bΣ + λIp) 1 (v( λ; γ)Σ + Ip) 1,

where v( λ; γ) > 0 is the unique solution to the fixed-point equation

1 v( λ; γ) = λ + γ Z r 1 + v( λ; γ)r d Hn(r). (64)

Here Hn is the empirical distribution (supported on R 0) of the eigenvalues of Σ.

As a side note, the parameter v( λ; γ) in Corollary F.5 is also the companion Stieltjes transform of the spectral distribution of the sample covariance matrix bΣ, which is also the Stieltjes transform of the spectral distribution of the gram matrix XX /n.

The following lemma uses Corollary F.5 along with calculus of deterministic equivalents (from Lemma F.3), and provides deterministic equivalents for resolvents needed to obtain asymptotic bias and variance of standard ridge regression. It is adapted from Lemma S.6.10 of Patil et al. (2022b). Lemma F.6 (Deterministic equivalents for ridge resolvents associated with generalized bias and variance). Suppose xi Rp, 1 i n, are i.i.d. random vectors with each xi = ziΣ1/2, where zi Rp is a random vector that contains i.i.d. random variables zij, 1 j p, each with E[zij] = 0, E[z2 ij] = 1, and E[|zij|8+α] Mα for some constants α > 0 and Mα < , and Σ Rp p is a positive semidefinite matrix with rmin Ip Σ rmax Ip for some constants rmin > 0 and rmax < (independent of p). Let X Rn p be the concatenated random matrix with xi, 1 i n, as its rows, and define bΣ := X X/n Rp p. Let γ := p/n. Then, for λ > 0, as n, p with 0 < lim inf γ lim sup γ < , the following statements hold:

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

(1) Bias of ridge regression:

λ2(bΣ + λIp) 1A(bΣ + λIp) 1 (v( λ; γ, Σ)Σ + Ip) 1(evb( λ; γ, Σ, A)Σ + A)(v( λ; γ, Σ)Σ + Ip) 1. (65)

(2) Variance of ridge regression:

(bΣ + λIp) 2 bΣA evv( λ; γ, Σ)(v( λ; γ, Σ)Σ + Ip) 2ΣA. (66)

Here v( λ; γ, Σ) > 0 is the unique solution to the fixed-point equation

1 v( λ; γ, Σ) = λ + Z γr 1 + v( λ; γ, Σ)r d Hn(r; Σ), (67)

and evb( λ; γ, Σ) and evv( λ; γ, Σ) are defined through v( λ; γ, Σ) by the following equations:

evb( λ; γ, Σ, A) = γ tr[AΣ(v( λ; γ, Σ)Σ + Ip) 2]/p

v( λ; γ, Σ) 2 Z γr2(1 + v( λ; γ, Σ)r) 2 d Hn(r; Σ) , (68)

evv( λ; γ, Σ) 1 = v( λ; γ, Σ) 2 Z γr2(1 + v( λ; γ, Σ)r) 2 d Hn(r; Σ), (69)

where Hn( ; Σ) is the empirical distribution (supported on [rmin, rmax]) of the eigenvalues of Σ.

Though Lemma F.6 states the dependency explicitly, we will simply write Hn(r), v( λ; γ), evb( λ; γ, A), and evv( λ; γ) to denote Hn(r; Σ), v( λ; γ, Σ), evb( λ; γ, Σ, A), and evv( λ; γ, Σ), respectively, for simplifying notations when it is clear from the context. When A = Σ, we simply write evb( λ; γ) = evb( λ; γ, A). The moment assumption of order 8 + α for some α > 0 in the above lemma can be relaxed to only requiring the existence of moments of order 4 + α by a truncation argument as in the proof of Theorem 6 of Hastie et al. (2022) (in Appendix A.4 therein). We omit the details and refer the readers to Hastie et al. (2022).

F.2.2. EXTENDED RIDGE RESOLVENTS

The lemma below extends the deterministic equivalents of the ridge resolvents in Lemma F.6 to provide deterministic equivalents for Tikhonov resolvents, where the regularization matrix λIp is replaced with λ(Ip + C) and C Rp p is an arbitrary positive semidefinite random matrix. Lemma F.7 (Tikhonov resolvents, adapted from Patil et al. (2022a)). Suppose the conditions in Lemma F.6 holds. Let C Rp p be any symmetric and positive semidefinite random matrix with uniformly bounded operator norm in p that is independent to X for all n, p N, and let N = (bΣ + λIp) 1. Then the following statements hold:

(1) Tikhonov resolvent:

λ(N 1 + λC) 1 eΣ 1 C . (70)

(2) Bias of Tikhonov regression:

λ2(N 1 + λC) 1Σ(N 1 + λC) 1 eΣ 1 C (evb( λ; γ, ΣC)Σ + Σ)eΣ 1 C . (71)

(3) Variance of Tikhonov regression:

(N 1 + λC) 1 bΣ(N 1 + λC) 1Σ evv( λ; γ, ΣC)eΣ 1 C ΣeΣ 1 C Σ, (72)

where ΣC = (Ip + C) 1

2 Σ(Ip + C) 1

2 , eΣC = v( λ; γ, ΣC)Σ + Ip + C. Here, v( λ; γ, ΣC), evb( λ; γ, ΣC), and evv( λ; γ, ΣC) defined by (67)-(69) simplify to

1 v( λ; γ, ΣC) = λ + γ tr[(v( λ; γ, ΣC)Σ + Ip + C) 1Σ]/p, (73)

1 evv( λ; γ, ΣC) = 1 v( λ; γ, ΣC)2 γ tr[(v( λ; γ, ΣC)Σ + Ip + C) 2Σ2]/p, (74)

evb( λ; γ, ΣC) = γ tr[(v( λ; γ, ΣC)Σ + Ip + C) 2Σ2]/p evv( λ; γ, ΣC). (75)

If γ ϕ (0, ), then γ in (1)-(3) can be replaced by ϕ, with the empirical distribution Hn of eigenvalues replaced by the limiting distribution H.

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

F.2.3. RESOLVENTS FOR TRAINING ERROR

The following lemma concerns the deterministic equivalents of quantities that arise in the proof for Lemma D.3.

Lemma F.8 (Resolvents for in-sample error). Suppose the conditions in Lemma F.6 holds. Let C Rp p be any symmetric and positive semidefinite random matrix with uniformly bounded operator norm in p that is independent to X for all n, p N. Let I1, I2 SRS Ik and bΣj be the sample covariance matrix computed using k observations of X indexed by Ij (j = 0, 1). For j = 1, 2, let Mj = (bΣj + λIp) 1 be the resolvent for bΣj. Then, as k, n, p such that p/n ϕ (0, ) and p/k ϕs [ϕ, ), the following statements hold:

(1) Independent product with sample covariance:

(2) Bias term 1:

λ2M1CM2 (v( λ; ϕs)Σ + Ip) 1 (ev( λ; ϕ, ϕs, C)Σ + C) (v( λ; ϕs)Σ + Ip) 1 . (76)

(3) Bias term 2:

M1 bΣ1 2M2C evv( λ; ϕ, ϕs)(v( λ; ϕs)Σ + Ip) 2ΣC, (77)

(4) Variance term 1:

M1 bΣ1 2 Ip (v( λ; ϕs)Σ + Ip) 1, (78)

(5) Variance term 2:

M1 bΣ1 2M2 bΣ1 2 ϕs

v( λ; ϕs) ϕs ϕ

ϕs λevv( λ; ϕ, ϕs) (v( λ; ϕs)Σ + Ip) 1Σ

λevv( λ; ϕ, ϕs)(v( λ; ϕs)Σ + Ip) 2Σ, (79)

ev( λ; ϕ, ϕs, C) = lim k,n,p ϕ tr[CΣ(v( λ; ϕs)Σ + Ip) 2]/p

v( λ; ϕs) 2 ϕ Z r2

(1 + v( λ; ϕs)r)2 d H(r) ,

evv( λ; ϕ, ϕs) := 1

v( λ; ϕs) 2 ϕ Z r2

(1 + v( λ; ϕs)r)2 d H(r) .

Proof of Lemma F.8. We split the proof into different parts.

Part (1) Note that tr(bΣj) = P

i Ij xi 2 2/k and xi = z i Σzi. By Lemma G.4, we have that tr(bΣj)/p tr(Σ)/p a.s. 0. Since C op is uniformly upper bounded and

1 p tr(C bΣj) 1

p| tr(C(bΣj Σ))| 1

p C op | tr(bΣj Σ)|,

it follows that 1

p tr(C bΣj) 1

p tr(CΣ) a.s. 0, which implies that C bΣj CΣ

Part (2) This is a direct consequence of Patil et al. (2022a, Part (c) of the proof for Lemma S.24).

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

Part (3) This is a direct consequence of Patil et al. (2022a, Part (c) of the proof for Lemma S.25).

Part (4) Let i0 = |I1 I2|. Conditioning on bΣ1 2 and i0, from Definition F.2 and Lemma F.7 (1) we have

λM1 M det M1 2,i0 := k k i0 (v1Σ + Ip + C1) 1 i0,

where v1 = v( λ; γ1, ΣC1), ΣC1 = (Ip + C1) 1

2 Σ(Ip + C1) 1

2 , C1 = i0(λ(k i0)) 1M 1 1 2, and γ1 = p/(k i0). Here the subscripts of v1 and C1 are related to the aspect ratio γ1. Because

lim sup bΣ1 2 op rmax(1 + p

almost surely as k, n, p such that p/n ϕ and p/k ϕs, by Lemma F.3 (3), we have

M1 bΣ1 2 λ 1M det M1 2,i0 bΣ1 2 | i0.

i0 (M 1 1 2 + λC0) 1 bΣ1 2 | i0,

where C0 = (k i0)/i0 (v1Σ + Ip). Define ΣC0 = (I + C0) 1

2 Σ(I + C0) 1

2 . Conditioning on i0, by Lemma F.7 (1), we have

tr[ΣC1(v1ΣC1 + Ip) 1] = tr[Σ(v1Σ + Ip + C1) 1]

Σ M 1 1 2 + λ(k i0)

i0 (v1Σ + Ip) 1#

a.s. = k i0

Σ v0Σ + Ip + k i0

i0 (v1Σ + Ip) 1#

Σ i0 k i0 v0 + v1

Σ + k k i0 Ip

where v0 = v( λ; γ0, ΣC0)and γ0 = p/i0. Note that the fixed-point solution v0 depends on v1. The fixed-point equations reduce to

1 v0 = λ + γ0 tr[ΣC0(v0ΣC0 + Ip) 1]/p = λ + p

k v0 + k i0

1 v1 = λ + γ1 tr[ΣC1(v1ΣC1 + Ip) 1]/p = λ + p

k v0 + k i0

almost surely. Note that the solution (v0, v1) to the above equations is a pair of positive numbers and does not depend on samples. If (v0, v1) is a solution to the above system, then (v1, v0) is also a solution. Thus, any solution to the above equations must be unique. On the other hand, since v0 = v1 = v( λ; p/k) satisfies the above equations, it is the unique solution. By Lemma F.3 (7), we can replace v( λ; γ1, ΣC1) by the solution v0 = v1 = v( λ; p/k) of the above system, which does not depend on samples. Thus,

M1 bΣ1 2 = k

i0 (M 1 1 2 + λC ) 1 bΣ1 2 | i0, (80)

where C = (k i0)/i0 (v( λ; p/k)Σ + Ip). Again from Lemma F.7 (1) we have

(M 1 1 2 + λC ) 1 bΣ1 2 = Ip λ(M 1 1 2 + λC ) 1(Ip + C )

Ip (v( λ; p/k)Σ + Ip + C ) 1(Ip + C ) | i0

k (Ip (v( λ; p/k)Σ + Ip) 1).

Finally, from Lemma F.3 (6), we have

M1 bΣ1 2 Ip (v( λ; p/k)Σ + Ip) 1 Ip (v( λ; ϕs)Σ + Ip) 1.

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

Part (5) From Patil et al. (2022a, Part (c) of the proof for Lemma S.2.5), we have that

M1 bΣ1 2M2 bΣ1 2 k2

i2 0 (M 1 1 2 + λC ) 1 bΣ1 2(M 1 1 2 + λC ) 1 bΣ1 2,

where M1 2 = (bΣ1 2 + λIp) 1 and C = (k i0)/i0(v( λ; ϕs)Σ + Ip). Since

(M 1 1 2 + λC ) 1 bΣ1 2 = Ip λ(M 1 1 2 + λC ) 1(Ip + C ),

M1 bΣ1 2M2 bΣ1 2

i2 0 (M 1 1 2 + λC ) 1 bΣ1 2 λk2

i2 0 (M 1 1 2 + λC ) 1 bΣ1 2(M 1 1 2 + λC ) 1(Ip + C )

i2 0 (Ip λ(M 1 1 2 + λC ) 1(Ip + C )) λk2

i2 0 (M 1 1 2 + λC ) 1 bΣ1 2(M 1 1 2 + λC ) 1(Ip + C ) (81)

From Lemma F.7 (1) and (3), we have that

λ(M 1 1 2 + λC ) 1 ϕ

ϕs (v( λ; ϕs)Σ + Ip) 1

(M 1 1 2 + λC ) 1 bΣ1 2(M 1 1 2 + λC ) 1 ϕ2

ϕ2s evv( λ; ϕ, ϕs)(v( λ; ϕs)Σ + Ip) 2Σ.

Combing the above two equivalents, the expression in (81) can be further simplified as:

M1 bΣ1 2M2 bΣ1 2 ϕs

v( λ; ϕs) ϕs ϕ

ϕs λevv( λ; ϕ, ϕs) (v( λ; ϕs)Σ + Ip) 1Σ

λevv( λ; ϕ, ϕs)(v( λ; ϕs)Σ + Ip) 2Σ.

F.3. Analytic properties of associated fixed-point equations

In this section, we gather results on the properties of the fixed-point solution v( λ; ϕ) defined in (64).

The following lemma provides the existence and uniqueness of the solution v( λ; ϕ). The properties of the derivatives in Lemma F.9 are related to the properties of evv( λ; ϕ) defined in Lemma F.10, which equals f (x), where the function f is defined in (82).

Lemma F.9 (Properties of the solution to the fixed-point equation, adapted from Patil et al. (2022a)). Let λ, ϕ, a > 0 and b < be real numbers. Let P be a probability measure supported on [a, b]. Define the function f such that

x ϕ Z r 1 + rx d P(r) λ. (82)

Then, the following properties hold:

(1) For λ = 0 and ϕ (1, ), there is a unique x0 (0, ) such that f(x0) = 0. The function f is positive and strictly decreasing over (0, x0) and negative over (x0, ), with limx 0+ f(x) = and limx f(x) = 0.

(2) For λ > 0 and ϕ (0, ), there is a unique xλ 0 (0, ) such that f(xλ 0) = 0. The function f is positive and strictly decreasing over (0, xλ 0) and negative over (xλ 0, ), with limx 0+ f(x) = and limx f(x) = λ.

(3) For λ = 0 and ϕ (1, ), f is differentiable on (0, ) and its derivative f is strictly increasing over (0, x0), with limx 0+ f (x) = and f (x0) < 0.

(4) For λ > 0 and ϕ (0, ), f is differentiable on (0, ) and its derivative f is strictly increasing over (0, ), with limx 0+ f (x) = and f (xλ 0) < 0.

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

The continuity and limiting behavior of the function ϕ 7 v( λ; ϕ) is given for ridge regression (λ > 0) in Lemma F.10 and for ridgeless regression (λ = 0) in Lemma F.11.

Lemma F.10 (Continuity in the aspect ratio for ridge regression, adapted from Patil et al. (2022a)). Let λ, a > 0 and b < be real numbers. Let P be a probability measure supported on [a, b]. Consider the function v( λ; ) : ϕ 7 v( λ; ϕ), over (0, ), where v( λ; ϕ) > 0 is the unique solution to the fixed-point equation

1 v( λ; ϕ) = λ + ϕ Z r 1 + rv( λ; ϕ) d P(r) (83)

Then, the following properties hold:

(1) The range of the function v( λ; ) is a subset of (0, λ 1).

(2) The function v( λ; ) is continuous and strictly decreasing over (0, ). Furthermore, limϕ 0+ v( λ; ϕ) = λ 1, and limϕ v( λ; ϕ) = 0.

(3) The function evv( λ; ) : ϕ 7 evv( λ; ϕ), where

evv( λ; ϕ) = v( λ; ϕ) 2 Z ϕr2(1 + rv( λ; ϕ)) 2 d P(r) 1 ,

is positive and continuous over (0, ). Furthermore, limϕ 0+ evv( λ; ϕ) = λ 2, and limϕ evv( λ; ϕ) = 0.

(4) The function evb( λ; ) : ϕ 7 evb( λ; ϕ), where

evb( λ; ϕ) = evv( λ; ϕ) Z ϕr2(1 + v( λ; ϕ)r) 2 d P(r),

is positive and continuous over (0, ). Furthermore, limϕ 0+ evb( λ; ϕ) = limϕ evb( λ; ϕ) = 0.

Lemma F.11 (Continuity in the aspect ratio for ridgeless regression, adapted from Patil et al. (2022b)). Let a > 0 and b < be real numbers. Let P be a probability measure supported on [a, b]. Consider the function v(0; ) : ϕ 7 v(0; ϕ), over (1, ), where v(0; ϕ) > 0 is the unique solution to the fixed-point equation

1 ϕ = Z v(0; ϕ)r 1 + v(0; ϕ)r d P(r). (84)

Then, the following properties hold:

(1) The function v(0; ) is continuous and strictly decreasing over (1, ). Furthermore, limϕ 1+ v(0; ϕ) = , and limϕ v(0; ϕ) = 0.

(2) The function ϕ 7 (ϕv(0; ϕ)) 1 is strictly increasing over (1, ). Furthermore, limϕ 1+(ϕv(0; ϕ)) 1 = 0 and limϕ (ϕv(0; ϕ)) 1 = 1.

(3) The function evv(0; ) : ϕ 7 evv(0; ϕ), where

evv(0; ϕ) = v(0; ϕ) 2 ϕ Z r2(1 + rv(0; ϕ)) 2 d P(r) 1 ,

is positive and continuous over (1, ). Furthermore, limϕ 1+ evv(0; ϕ) = , and limϕ evv(0; ϕ) = 0.

(4) The function evb(0; ) : ϕ 7 evb(0; ϕ), where

evb(0; ϕ) = evv(0; ϕ) Z r2(1 + v(0; ϕ)r) 2 d P(r),

is positive and continuous over (1, ). Furthermore, limϕ 1+ evb(0; ϕ) = , and limϕ evb(0; ϕ) = 0.

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

The continuity and differentiabilty of the function λ 7 v( λ; ϕ) on a closed interval [0, λmax] for some constant λmax is given for ϕ (1, ) in Lemma F.12 adapted from Patil et al. (2022b). This ensures that v(0; ϕ) = limλ 0+ v( λ; ϕ) is well-defined for ϕ > 1 and also implies that related functions are bounded.

Lemma F.12 (Differentiability in the regularization parameter). Let 0 < a b < be real numbers. Let P be a probability measure supported on [a, b]. Let ϕ > 0 be a real number. Let Λ = [0, λmax] for some constant λmax (0, ). For λ Λ, let v( λ; ϕ) > 0 denote the solution to the fixed-point equation

1 v( λ; ϕ) = λ + ϕ Z r v( λ; ϕ)r + 1 d P(r).

When λ = 0 and ϕ (0, 1], v( λ; ϕ) := + . Then, the following properties hold:

(1) (Monotonicity) For ϕ (0, ), the function λ 7 v( λ; ϕ) is strictly decreasing in λ [0, ).

(2) (Differentiability) For ϕ (1, ), the function λ 7 v( λ; ϕ) is twice differentiable over Λ.

(3) (Boundedness of the second derivative) For ϕ (1, ), v( λ; ϕ), / λ[v( λ; ϕ)], and 2/ λ2[v( λ; ϕ)] are bounded over Λ.

Proof of Lemma F.12. Start by re-writing the fixed-point equation as

λ = 1 v( λ; ϕ) ϕ Z r v( λ; ϕ)r + 1 d P(r).

Define a function f by

x ϕ Z r xr + 1 d P(r).

Observe that v( λ; ϕ) = f 1(λ). We next study various properties of f and prove the different parts in the statment.

Part (1) Properties of f and f 1:

Observe that

x ϕ Z r xr + 1 d P(r) = 1

1 ϕ Z xr xr + 1 d P(r) .

The function g : x 7 1/x is positive and strictly decreasing over (0, ) with limx 0+ g(x) = and limx g(x) = 0, while the function

h : x 7 1 ϕ Z xr xr + 1 d P(r)

is strictly decreasing over (0, ) with h(0) = 1 and limx h(x) = 1 ϕ.

Thus, there is a unique 0 < x0 < when ϕ > 1 such that h(x0) = 0, and consequently f(x0) = 0; and x0 = + when ϕ (0, 1] such that g(x0) = 0, and consequently f(x0) = 0. Because h and g are positive over [0, x0), f, a product of two positive strictly decreasing functions, is strictly decreasing over (0, x0), with limx 0+ f(x) = and f(x0) = 0.

Because f is strictly decreasing over (0, x0), f 1 is strictly decreasing (see, e.g., Problem 2, Chapter 5 of Rudin (1976)). Since f(x0) = 0, f 1(0) = x0, and since limx 0+ f(x) = , limy f 1(y) = 0. Hence, f 1 is strictly decreasing over [0, ) for all ϕ > 0 and bounded above by x0 < for all ϕ > 1.

Parts (2) and (3) We will prove the remaining two parts together.

Properties of f and (f 1) :

The derivative f at x is given by

x2 + ϕ Z r2

(xr + 1)2 d P(r) = 1

1 ϕ Z xr xr + 1

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

The function g : x 7 1/x2 is positive and strictly decreasing over (0, ) with limx 0+ g(x) = and limx g(x) = 0. On the other hand, the function

h : x 7 1 ϕ Z xr xr + 1

is strictly decreasing over (0, ) with h(0) = 1 and h(x0) > 0. This follows because for x [0, x0],

ϕ Z xr xr + 1

2 d P(r) x0b x0b + 1

ϕ Z xr xr + 1

< ϕ Z xr xr + 1 d P(r) ϕ Z x0r x0r + 1 d P(r) = 1,

where the first inequality in the chain above follows as the support of P is [a, b], and the last inequality follows since f(x0) = 0 and x0 > 0, which implies that

1 x0 = ϕ Z r x0r + 1 d P(r), or equivalently that 1 = ϕ Z x0r x0r + 1 d P(r).

Thus, f , a product of two positive strictly decreasing functions, is strictly decreasing, and in turn, f is strictly increasing. Moreover, limx 0+ f (x) = ; when ϕ > 1, f (x0) < 0 and when ϕ (0, 1], f (x) approaches zero from below as x + .

When ϕ > 1, because f (x) = over (0, x0), by the inverse function theorem, (f 1) , we have

(f 1) (f(x)) = 1 f (x)

1 ϕ Z xr xr + 1

where the first inequality uses the fact that |f (x0)| < |f (x)| for x (0, x0] from Part 1, and the last inequality uses the bound from (85).

Properties of f and (f 1) :

The second derivative f at x is given by

(xr + 1)3 d P(r) = 2

1 ϕ Z xr xr + 1

The rest of the arguments are similar to those in Part 2. The function g : x 7 1/x3 is positive and strictly decreasing over (0, ) with limx 0+ g(x) = and limx g(x) = 0, while the function

h : x 7 1 ϕ Z xr xr + 1

is strictly decreasing over (0, ) with h(0) = 1 and h(x0) > 0 as

ϕ Z xr xr + 1

3 d P(r) x0b x0b + 1

2 ϕ Z xr xr + 1

< ϕ Z xr xr + 1 d P(r) ϕ Z x0r x0r + 1 d P(r) = 1.

It then follows that f is strictly decreasing, with limx 0+ f (x) = ; when ϕ > 1, f (x0) > 0 and when ϕ (0, 1], f (x) approaches zero from above as x + .

When ϕ > 1, by inverse function theorem, we have

(f 1) (f(x)) = f (x) f (x)3

1 ϕ Z xr xr + 1

1 ϕ Z xr xr + 1

1 ϕ Z xr xr + 1

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

where the first inequality uses the bound from (86), and the second inequality uses the bound from (85).

This finishes all the parts and concludes the proof.

G. Helper concentration results

G.1. Size of the intersection of randomly sampled datasets

In this section, we collect various helper results concerned with concentrations and convergences. Below we recall the definition of a hypergeometric random variable, along with its mean and variance. See, e.g., Greene & Wellner (2017) for more related details.

Definition G.1 (Hypergeometric random variable). A random variable X follows the hypergeometric distribution X Hypergeometric(n, K, N) if its probability mass function is given by

K k N K n k

N n , max{0, n + K N} k min{n, K}.

The expectation and variance of X are given by

N , Var(X) = n K(N K)(N n)

The following lemma characterize the limiting proportions of shared observations in two simple random samples under proportional asymptotics, when both the subsample size and the full data size tend to infinity, which is adapted from Patil et al. (2022a).

Lemma G.2 (Asymptotic proportions of shared observations). For n N, define Ik := {{i1, i2, . . . , ik} : 1 i1 < i2 < . . . < ik n}. Let I1, I2 SRSWR Ik, define the random variable i SRSWR 0 := |I1 I2| to be the number of shared samples, and define i SRSWOR 0 accordingly. Let {km} m=1 and {nm} m=1 be two sequences of positive integers such that nm is strictly increasing in m, nν m km nm for some constant ν (0, 1). Then, i SRSWR 0 /km km/nm a.s. 0, and i SRSWOR 0 /km km/nm a.s. 0.

G.2. Convergence of random linear and quadratic forms

In this section, we collect helper lemmas on the concentration of linear and quadratic forms of random vectors.

The following lemma provides the concentration of a linear form of a random vector with independent components. It follows from a moment bound from Lemma 7.8 of Erd os & Yau (2017), along with the Borel-Cantelli lemma, and is adapted from Lemma S.8.5 of Patil et al. (2022b).

Lemma G.3 (Concentration of linear form with independent components). Let zp Rp be a sequence of random vector with i.i.d. entries zpi, i = 1, . . . , p such that for each i, E[zpi] = 0, E[z2 pi] = 1, E[|zpi|4+α] Mα for some α > 0 and constant Mα < . Let ap Rp be a sequence of random vectors independent of zp such that lim supp ap 2/p M0 almost surely for a constant M0 < . Then, a p zp/p 0 almost surely as p .

The following lemma provides the concentration of a quadratic form of a random vector with independent components. It follows from a moment bound from Lemma B.26 of Bai & Silverstein (2010), along with the Borel-Cantelli lemma, and is adapted from Lemma S.8.6 of Patil et al. (2022b).

Lemma G.4 (Concentration of quadratic form with independent components). Let zp Rp be a sequence of random vector with i.i.d. entries zpi, i = 1, . . . , p such that for each i, E[zpi] = 0, E[z2 pi] = 1, E[|zpi|4+α] Mα for some α > 0 and constant Mα < . Let Dp Rp p be a sequence of random matrix such that lim sup Dp op M0 almost surely as p for some constant M0 < . Then, z p Dpzp/p tr[Dp]/p 0 almost surely as p .

G.3. Convergence of Ce saro-type mean and max for triangular array

In this section, we collect a helper lemma on deducing almost sure convergence of a Ce saro-type mean from almost sure convergence of the original sequence, which is adapted from Patil et al. (2022a).

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

Lemma G.5 (Convergence of conditional expectation). For n N, suppose {Rn,ℓ}Nn ℓ=1 is a set of Nn random variables defined over the probability space (Ω, F, P), with 1 < Nn < almost surely. If there exists a constant c such that Rn,pn a.s. c for all deterministic sequences {pn [Nn]} n=1, then the following statements hold:

(1) maxℓ [Nn] |Rn,ℓ(ω) c| a.s. 0.

(2) N 1 n PNn ℓ=1 Rn,ℓ a.s. c.

Lemma G.6 (Convergence of conditional expectation over simple random sampling). For n N and k = kn Kn, let Mn = |Ik| and suppose {Rn,1(Iℓ)}ℓ [Mn] and {Rn,2(Im, Iℓ)}m,ℓ [Mn],m =ℓare sets of Mn and Mn(Mn 1) random variables, such that Rn,2(Im, Iℓ) (Rn,1(Im) + Rn,2(Iℓ))/2. Then, the following holds:

(1) If there exists a constant c1 such that Rn,1(Iℓn) a.s. c1 for all deterministic sequences {ℓn [Mn]} n=1, then maxℓ [Mn] |Rn,ℓ(Iℓ) c| a.s. 0 and EIℓ SRS Ik[|Rn,ℓ(Iℓ) c|] a.s. 0.

(2) Further, if there exists a constant c2 such that Rn,2(Imn, Iℓn) a.s. c2 for all sequences of simple random samples {(Imn, Iℓn) SRS Ikn} n=1, then max(Im,Iℓ) SRS Ik |Rn,2(Im, Iℓ) c2| a.s. 0 and

E(Im,Iℓ) SRS Ik[|Rn,2(Imn, Iℓn) c2|] a.s. 0.

Proof of Lemma G.6. We split the proof into two cases.

Part (1) The conclusion directly follows from Lemma G.5.

Part (2) Observe that

Rn,2(Im, Iℓ) 1

2(Rn,1(Im) + Rn,2(Iℓ)). (87)

From (1), we have that EI[Rn,1(I)] a.s. c1, where the expectation is taken with respect to the uniform distribution over Ik. From the condition, we have Rn,2(Im, Iℓ) a.s. c2 for any Im, Iℓ SRS Ik. Then, by Pratt s lemma (see, e.g., Gut, 2005, Theorem 5.5), the conclusion follows.

H. GCV correction for arbitrary M

Note that the asymptotic limit of the training error for arbitrary M N is given by

T λ M = 2E λ k,2 E λ k,1 + 2

M (E λ k,1 E λ k,2),

where E λ k,j = ck,M,j T λ k,j + (1 ck,M,j)Rλ k,j. Here, ck,M,j is the limiting proportion of the distinct number of observations from j simple random samples to the distinct number of observations from M simple random samples of size k. Roughly speaking, the proportion of unseen observations from M simple random samples of size k is (n k)M/n M and thus

ck,M,j = lim 1 (n k)j/nj

1 (n k)M/n M = 1 (1 ϕ/ϕs)j

1 (1 ϕ/ϕs)M .

From the expression, one knows for certain that the GCV asymptotics will not match the risk of the estimator in general. In addition, the form of the expression also leads to an approach to correct the GCV estimator for general M that we will discuss below. What we prove in Theorem 3.1 is that the difference between the two asymptotics vanishes as M . We expect the difference to scale as 1/M. The explicit analysis of the finite-ensemble effect requires carefully analyzing the coefficients ck,M,j, and even for the isotropic design, the expression for the GCV asymptotics for general appears to be very involved. It is in principle possible to perform this analysis, but we did not pursue it further in the paper given our primary focus on the full-ensemble estimator. Numerically, we observe that the bias is small for a moderate M (e.g., for M = 10) and a reasonable data model with SNR (SNR = 0.6) from Figure 4. In general, we expect this to be the case for either moderate k or M and typical real-world SNR ranges. We will consider adding more numerical illustrations of the finite-ensemble effect in the revision under different settings.

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

We aim to define the corrected GCV as

gcvλ k,M := a1T λ k,M + a2 Rλ k,M Dλ k,M ,

where a1 and a2 are two unknown parameters to be determined. To determine the unknown parameters, we need to match the limiting GCV with the true risk. Since

T λ 1 (ϕ, ϕs) = Dλ 1 (ϕ, ϕs)Rλ 1 (ϕ, ϕs), and T λ 2 (ϕ, ϕs) = b1Rλ 1 (ϕ, ϕs) + b2Rλ 2 (ϕ, ϕs),

for some known constants b1 and b2 which can be derived in the proof of Proposition 3.3, the adjustment is given by

(ck,M,1Dλ 1 (ϕ, ϕs) + 1 ck,M,1)Rλ 1 (ϕ, ϕs)

(ck,M,2b1Rλ 1 (ϕ, ϕs) + (1 ck,M,2 + b2)Rλ 2 (ϕ, ϕs))

+ Dλ M(ϕ, ϕs)a2

Rλ 1 (ϕ, ϕs) + 2 1 + 1

Rλ 2 (ϕ, ϕs)

[a1(ck,M,1Dλ 1 (ϕ, ϕs) + 1 ck,M,1) + a2Dλ M(ϕ, ϕs)]Rλ 1 (ϕ, ϕs)

[a1(ck,M,2b1Rλ 1 (ϕ, ϕs) + (1 ck,M,2 + b2)Rλ 2 (ϕ, ϕs)) + a2Dλ M(ϕ, ϕs)]Rλ 2 (ϕ, ϕs)),

which implies that

a1(ck,M,1Dλ 1 (ϕ, ϕs) + 1 ck,M,1) + a2Dλ M(ϕ, ϕs) = Dλ M(ϕ, ϕs)

a1(ck,M,2b1Rλ 1 (ϕ, ϕs) + (1 ck,M,2 + b2)Rλ 2 (ϕ, ϕs)) + a2Dλ M(ϕ, ϕs)Rλ 2 (ϕ, ϕs) = Dλ M(ϕ, ϕs)Rλ 2 (ϕ, ϕs).

Solving the above linear system for a1 > 0 and a2 R gives the correct weights for defining a consistent GCV estimate. The solutions will depend on Dλ M(ϕ, ϕs), Rλ 1 (ϕ, ϕs) and Rλ 2 (ϕ, ϕs). For the denominator, (12) is a consistent estimate for Dλ M(ϕ, ϕs). For the prediction risks of M = 1, 2, we can use out-of-bag observations to estimate Rλ 1 (ϕ, ϕs) and Rλ 2 (ϕ, ϕs).

I. Additional details for numerical experiments

The covariance matrix of an auto-regressive process of order 1 (AR(1)) is given by Σar1, where (Σar1)ij = ρ|i j| AR1 for some parameter ρAR1 (0, 1), and the AR(1) data model is defined as:

yi = x i β0 + ϵi, xi N(0, Σar1),

j=1 w(j), ϵi N(0, σ2), (M-AR1)

where w(j) is the eigenvector of Σar1 associated with the top jth eigenvalue r(j). From Grenander & Szeg o (1958, pp. 69-70), the top j-th eigenvalue can be written as r(j) = (1 ρ2 AR1)/(1 2ρAR1 cos θjp + ρ2 AR1) for some θjp ((j 1)π/(p + 1), jπ/(p + 1)). Then, under model (M-AR1), the signal strength ρ2 defined in Assumption 2.2 is 5 1(1 ρ2 AR1)/(1 ρAR1)2, which is the limit of 25 1 P5 j=1 r(j). Thus, model (M-AR1) parameterized by two parameters ρAR1 and σ2 satisfies Assumption 2.1-2.2.