# implicit_regularization_paths_of_weighted_neural_representations__759facef.pdf Implicit Regularization Paths of Weighted Neural Representations Jin-Hong Du Carnegie Mellon University jinhongd@andrew.cmu.edu Pratik Patil University of California Berkeley pratikpatil@berkeley.edu We study the implicit regularization effects induced by (observation) weighting of pretrained features. For weight and feature matrices of bounded operator norms that are infinitesimally free with respect to (normalized) trace functionals, we derive equivalence paths connecting different weighting matrices and ridge regularization levels. Specifically, we show that ridge estimators trained on weighted features along the same path are asymptotically equivalent when evaluated against test vectors of bounded norms. These paths can be interpreted as matching the effective degrees of freedom of ridge estimators fitted with weighted features. For the special case of subsampling without replacement, our results apply to independently sampled random features and kernel features and confirm recent conjectures (Conjectures 7 and 8) of the authors on the existence of such paths in [50]. We also present an additive risk decomposition for ensembles of weighted estimators and show that the risks are equivalent along the paths when the ensemble size goes to infinity. As a practical consequence of the path equivalences, we develop an efficient cross-validation method for tuning and apply it to subsampled pretrained representations across several models (e.g., Res Net-50) and datasets (e.g., CIFAR-100). 1 Introduction In recent years, neural networks have become state-of-the-art models for tasks in computer vision and natural language processing by learning rich representations from large datasets. Pretrained neural networks, such as Res Net, which are trained on massive datasets like Image Net, serve as valuable resources for new, smaller datasets [32]. These pretrained models reduce computational burden and generalize well in tasks such as image classification and object detection due to their rich feature space [32, 69]. Furthermore, pretrained features or neural embeddings, such as the neural tangent kernel, extracted from these models, serve as valuable representations of diverse data [33, 66]. However, despite their usefulness, fitting models based on pretrained features on large datasets can be challenging due to computational and memory constraints. When dealing with high-dimensional pretrained features and large sample sizes, direct application of even simple linear regression may be computationally infeasible or memory-prohibitive [23, 44]. To address this issue, subsampling has emerged as a practical solution that reduces the dataset size, thereby alleviating the computational and memory burden. Subsampling involves creating smaller datasets by randomly selecting a subset of the original data points. Beyond these computational and memory advantages, subagging can also greatly improve predictive performance in overparameterized regimes, especially near model interpolation thresholds [53]. Moreover, through distributed learning, models fitted on multiple subsampled datasets can be aggregated as an ensemble to provide more stable predictions [20, 21, 51]. There has been growing interest in understanding the effects of subsampling (without replacement) [16, 25, 37, 50, 51]. These works relate subsampling to explicit ridge regularization, assuming either 38th Conference on Neural Information Processing Systems (Neur IPS 2024). Table 1: Overview of related work on the equivalence of implicit regularization and explicit ridge regularization. Main analysis Feature structure Weight structure Reference Risk characterization Gaussian subsampling [37] linear subsampling [51] Gaussian bootstrapping [5, 16, 17] Estimator equivalence linear subsampling [50] general general Theorem 1 general subsampling Theorem 2 linear, random, kernel subsampling Propositions 3 5 Risk equivalence linear subsampling [50] general general Theorem 6 Gaussian features ϕ N(0p, Σ) or linearly decomposable features (referred to as linear features in this paper) ϕ = Σ1/2z, where Σ Rp p is the covariance matrix and z Rp contains i.i.d. entries with zero means and bounded 4 + δ moments for some δ > 0. Specifically, [50] establish a connection between implicit regularization induced by subsampling and explicit ridge regularization through a path defined by the tuple (k/n, λ), where k and n are the subsample size and the full sample size, respectively, and λ is the ridge regularization level. Along this path, any subsample estimator with the corresponding ridge regularization exhibits the same first-order (or estimator equivalence) and second-order (or risk equivalence) asymptotic limits. Moreover, the endpoints of all such paths along the two axes of k = n (no subsampling) and λ = 0 (no regularization) span the same range. Although these results have been demonstrated for linear features, [50] also numerically observe similar equivalence behavior in more realistic contexts and propose conjectures for random features and kernel features based on heuristic universality justifications. However, extending these results to encompass more general feature structures and other sampling schemes remains an open question. Towards answering this question, in this paper, we view subsampling as a weighted regression problem [67]. This perspective allows us to study the equivalence in its most general form, considering arbitrary feature structures and weight structures. The general weight matrix approach used in this study encompasses various applications, including subsampling, bootstrapping, variance-adaptive weighting, survey, and importance weighting, among others. By interpreting subsampling as a weighted regression problem, we leverage recent tools from free probability theory, which have been developed to analyze feature sketching [39, 42, 54]. Building on these theoretical tools, we establish implicit regularization paths for general weighting and feature structures. We summarize our main results below and provide an overview of our results in the context of recent related work in Table 1. 1.1 Summary of results and paper outline We summarize our main results and provide an outline for the paper below. Paths of weighted representations. In Section 3, we demonstrate that general weighted models exhibit first-order equivalence along a path (Theorem 1) when the weight matrices are asymptotically independent of the data matrices. This path of equivalence can be computed directly from the data using the formula provided in Equation (2). Furthermore, we provide a novel interpretation of this path in terms of matching effective degrees of freedom of models along the path for general feature structures when the weights correspond to those arising from subsampling (Theorem 2). Paths of subsampled representations. We further specialize our general result in Theorem 2 for the weights induced by subsampling without replacement to structured features in Section 3.2. These include results for linear random features, nonlinear random features, and kernel features, as shown in Propositions 3 5, respectively. The latter two results also resolve Conjectures 7 and 8 raised by [50] regarding subsampling regularization paths for random and kernel features, respectively. Risk equivalences and tuning. In Section 4, we demonstrate that an ensemble of weighted models has general quadratic risk equivalence on the path, with an error term that decreases inversely as 1/M as the number of ensemble size M increases (Theorem 6). The risk equivalence holds for both in-distribution and out-of-distribution settings. For subsampling general features, we derive an upper bound for the optimal subsample size (Proposition 7) and propose a cross-validation method to tune the subsample and ensemble sizes (Algorithm 1), validated on real datasets in Section 4.3. This level of generality is achievable because we do not analyze the risk of either the full model or the weighted models in isolation. Instead, we relate these two sets of models, allowing us to maintain weak assumptions about the features. The key assumption underlying our results is the asymptotic freeness of weight matrices with respect to the data matrices. While directly testing this assumption is generally challenging, we verify its validity through its consequences on real datasets in Section 4.3. 1.2 Related literature We provide a brief account of other related work below to place our work in a better context. Linear features. Despite being overparameterized, neural networks generalize well in practice [70, 71]. Recent work has used high-dimensional linearized networks to investigate the various phenomena that arise in deep learning, such as double descent [12, 46, 48], benign overfitting [10, 35, 45], and scaling laws [7, 19, 66]. This literature analyzes linear regression using statistical physics [14, 60] and random matrix theory [22, 30]. Risk approximations hold under random matrix theory assumptions [6, 30, 66] in theory and apply empirically on a variety of natural data distributions [43, 60, 66]. Random and kernel features. Random feature regression, initially introduced in [56] as a way to scale kernel methods, has recently been used for theoretical analysis of neural networks and trends of double descent in deep networks [1, 46]. The generalization of kernel ridge regression has been studied in [11, 40, 57]. The risks of kernel ridge regression are also analyzed in [9, 19, 29]. The neural representations we study are motivated by the neural tangent kernel (NTK) and related theoretical work on ultra-wide neural networks and their relationships to NTKs [34, 68]. Resampling analysis. Resampling and weighted models are popular in distributed learning to provide more stable predictions and handle large datasets [20, 21, 51]. Historically, for ridge ensembles, [36, 61] derived risk asymptotics under Gaussian features. Recently, there has been growing interest in analyzing the effect of subsampling in high-dimensional settings. [37] considered least squares ensembles obtained by subsampling, where the final subsampled dataset has more observations than the number of features. For linear models in the underparameterized regime, [59] also provide certain equivalences between subsampling and iterative least squares approaches. The asymptotic risk characterization for general data models has been derived by [51]. [25, 50] extended the scope of these results by characterizing risk equivalences for both optimal and suboptimal risks and for arbitrary feature covariance and signal structures. Very recently, different resampling strategies for high-dimensional supervised regression tasks have been analyzed by [17] under isotropic Gaussian features. Cross-validation methods for tuning the ensemble of ridge estimators and other penalized estimators are discussed in [13, 25, 26]. Our work adds to this literature by considering ensembles of models with general weighting and feature structures. 2 Preliminaries In this section, we formally define our weighted estimator and state the main assumption on the weight matrix. Let fnn : Rd Rp be a pretrained model. Let {(xi, yi): i = 1, . . . , n} in Rd R be the given dataset. Applying fnn to the raw dataset, we obtain the pretrained features ϕi = fnn(xi) for i = 1, . . . , n as the resulting neural representations or neural embeddings. In matrix notation, we denote the pretrained feature matrix by Φ = [ϕ1, . . . , ϕn] Rn p. Let W Rn n be a general weight matrix used for weighting the observations. The weight matrix W is allowed to be asymmetric, in general. We consider fitting ridge regression on the weighted dataset (W Φ, W y). Given a ridge penalty λ, the ridge estimator fitted on the weighted dataset is given by: bβW ,λ := argmin β Rp W y W Φβ 2 2 n + λ β 2 2 = (Φ W W Φ + nλIp) Φ W W y. (1) In the definition above, we allow for λ = 0, in which case the corresponding ridgeless estimator is defined as the limit λ 0+. For λ < 0, we use the Moore-Penrose pseudoinverse. An important special case is where W is a diagonal matrix, in which case the above estimator reduces to weighted ridge regression. This type of weight matrix encompasses various applications, such as resampling, bootstrapping, and variance weighting. Our main application in this paper will be subsampling. For our theoretical results, we assume that the weight matrix W preserves some spectral structure of the feature matrix Φ. This assumption is captured by the condition of asymptotic freeness between W W and the feature Gram matrix ΦΦ . Asymptotic freeness is a concept from free probability theory [64]. Assumption A (Weight structure). Let W W and ΦΦ /n converge almost surely to bounded operators that are infinitesimally free with respect to (tr[ ], tr[C( )]) for any C independent of W with C tr uniformly bounded. Additionally, let W W have a limiting S-transform that is analytic on the lower half of the complex plane. At a high level, Assumption A captures the notion of independence but is adapted for non-commutative random variables of matrices. We provide background on free probability theory and asymptotic freeness in Appendix A.3. Here, we briefly list a series of invertible transformations from free probability to help define the S-transform [47]. The Cauchy transform is given by GA(z) = tr[(z I A) 1]. The moment generating series is given by MA(z) = z 1GA(z 1) 1. The S-transform is given by SA(w) = (1 + w 1)M 1 A (w). These are the Cauchy transform (negative of the Stieltjes transform), moment generating series, and S-transform of A, respectively. Here, M 1 A denotes the inverse under the composition of MA. The notation tr[A] denotes the average trace tr[A]/p of A Rp p. The freeness of a pair of matrices A and B means that the eigenvectors of one are completely unaligned or incoherent with those of the other. For example, if A = URU for a uniformly random unitary matrix U drawn independently of the positive semidefinite B and R, then A and B are almost surely asymptotically infinitesimally free [15]. Other well-known examples include Wigner matrices, which are asymptotically free with respect to deterministic matrices [4, Theorem 5.4.5]. Gaussian matrices, where the Gram matrix G = ΦΦ /n = U(V V /n)U and any deterministic S, are almost surely asymptotically free [47, Chapter 4, Theorem 9]. Although not proven in full generality, it is expected that diagonal matrices are asymptotically free from data Gram matrices constructed using i.i.d. data. In Section 3.2, we will provide additional examples of feature matrices, such as random and kernel features from machine learning, for which our results apply. Our results involve the notion of degrees of freedom from statistical optimism theory [27, 28]. Degrees of freedom in statistics count the number of dimensions in which a statistical model may vary, which is simply the number of variables for ordinary linear regression. To account for regularization, this notion has been extended to effective degrees of freedom (Chapter 3 of [31]). Under some regularity conditions, from Stein s relation [63], the degrees of freedom of a predictor bf are measured by the trace of the operators y 7 ( / y) bf(Φ). For the ridge estimator bβI,µ fitted on (Φ, y) with penalty µ, the degrees of freedom is consequently the trace of its prediction operator y 7 Φ(Φ Φ + µIp) Φ y, which is also referred to as the ridge smoother matrix. That is, df( bβI,µ) = tr[Φ Φ(Φ Φ + µIp) ]. We denote the normalized degrees of freedom by df = df/n. Note that df( bβI,µ) min{n, p}/n 1. Finally, we express our asymptotic results using the asymptotic equivalence relation. Consider sequences {An}n 1 and {Bn}n 1 of (random or deterministic) matrices (which includes vectors and scalars). We say that An and Bn are equivalent and write An Bn if limp | tr[Cn(An Bn)]| = 0 almost surely for any sequence Cn of matrices with bounded trace norm such that lim sup Cn tr < as n . Our forthcoming results apply to a sequence of problems indexed by n. For notational simplicity, we omit the explicit dependence on n in our statements. 3 Implicit regularization paths We begin by characterizing the implicit regularization induced by weighted pretrained features. We will show that the degrees of freedom of the unweighted estimator bβI,µ on the full data (Φ, y) with regularization parameter µ are equal to the degrees of freedom of the weighted estimator bβW ,λ for some regularization parameter λ. For estimator equivalence, our data-dependent set of weighted ridge estimators (W , λ) that connect to the unweighted ridge estimator (I, µ) is defined in terms of matching effective degrees of freedom of component estimators in the set. To state the upcoming result, denote the Gram matrix of the weighted data as GW = W ΦΦ W /n and the Gram matrix of the unweighted data as GI = ΦΦ /n. Furthermore, let λ+ min(A) denote the minimum positive eigenvalue of a symmetric matrix A. Theorem 1 (Implicit regularization of weighted representations). For GI Rn n, suppose that the weight matrix W Rn n satisfies Assumption A and lim sup y 2 2/n < as n . For any µ > lim infn λ+ min(GI), let λ > λ+ min(GW ) be given by the following equation: λ = µ/SW W ( df( bβI,µ)), (2) where SW W is the S-transform of the operator W W . Then, as n , it holds that: df( bβW ,λ) df( bβI,µ) and bβW ,λ bβI,µ. (3) In other words, to achieve a target regularization of µ on the unweighted data, Theorem 1 provides a method to compute the regularization penalty λ with given weights W from the available data using (2). The weighted estimator then has asymptotically the same degrees of freedom as the unweighted estimator. This means that the level of effective regularization of the two estimators is the same. Moreover, the estimators themselves are structurally equivalent; that is, c ( bβW ,λ bβI,µ) a.s. 0 for every constant vector c with bounded norm. The estimator equivalence in Theorem 1 is a first-order result, while we will also characterize the second-order effects in Section 4. The notable aspect of Theorem 1 is its generality. The equivalence results hold for a wide range of weight matrices and allow for negative values for the regularization levels. Furthermore, we have not made any direct assumptions about the feature matrix Φ, the weight matrix W , and the response vector y (other than mild bounded norms). The main underlying ingredient is the asymptotic freeness between W and Φ, which we then exploit using tools developed in [39] in the context of feature sketching. We discuss special cases of interest for W and Φ in the upcoming Sections 3.1 and 3.2. 3.1 Examples of weight matrices There are two classes of weighting matrices that are of practical interest: Non-diagonal weighting matrices. One can consider observation sketching, which involves some random linear combinations of the rows of the data matrix. Such observation sketching is beneficial for privacy, as it scrambles the rows of the data matrix, which may contain identifiable information about individuals. It also helps in reducing the effect of non-i.i.d. data that arise in time series or spatial data, where one wants to smooth away the impact of irregularities or non-stationarity. Diagonal weighting matrices. When observations are individually weighted, W is a diagonal matrix, which includes scenarios such as resampling, bootstrapping, and subsampling. Note that even with subsampling, one can have a non-binary diagonal weighting matrix. For example, one can consider sampling with replacement or sampling with a particular distribution, which yields non-binary diagonal weighting matrices. Other examples of non-binary diagonal weighting matrices include inverse-variance weighting sampling to mitigate the effects of heterogeneous variations if the responses have different variances for different units. In general, the set of equivalent weighted estimators depends on the corresponding S-transform as in (2), and it can be numerically evaluated. When focusing on subsampling without replacement, the data-dependent path for equivalent estimators with associated subsampling and regularization levels can be explicitly characterized in the following result by analyzing the S-transform of subsampling operators. 0.0 0.1 1 Subsample ratio k/n Ridge penalty Degrees of freedom 0.0 0.1 1 Subsample ratio k/n Figure 1: Equivalence under subsampling. The left panel shows the heatmap of degrees of freedom, and the right panel shows the random projection EW [a bβW ,λ] where a N(0p, Ip/p). In both heatmaps, the red color lines indicate the predicted paths using Equation (4), and the black dashed lines indicate the empirical paths by matching empirical degrees of freedom. The data is generated according to Appendix F.1 with n = 10000 and p = 1000, and the results are averaged over M = 100 random weight matrices W . Theorem 2 (Regularization paths due to subsampling). For a subsampling matrix W (k) consisting of k unit diagonal entries, the path (2) in terms of (k, λ) simplifies to: (1 df/n) (1 λ/µ) = (1 k/n), (4) where we denote by df = df( bβI,µ) = df( bβW ,λ) for notational simplicity. The relation (4) is remarkably simple, yet quite general! It provides an interplay between the normalized target complexity df/n, regularization inflation λ/µ, and subsample fraction k/n: (1 normalized complexity) (1 regularization inflation) = (1 subsample fraction). (5) Since the normalized target complexity and subsample fraction are no greater than one, (5) also implies that the regularization level λ for the subsample estimator is always lower than the regularization level µ for the full estimator. In other words, subsampling induces (positive) implicit regularization, reducing the need for explicit ridge regularization. This is verified numerically in Figure 1. For a fixed target regularization amount µ, the degrees of freedom df( bβI,µ) of the ridge estimator on full data is fixed. Thus, we can observe that the path in the (k/n, λ)-plane is a line. There are two extreme cases: (1) when the subsample size k is close to n, we have µ λ; and (2) when the subsample size is near 0, we have µ . When λ = 0, the effective regularization level λ is such that df( bβW (k),λ) = df( bβI,µ) = k, which we find to be a neat relation! Beyond subsampling without replacement, one can also consider other subsample matrixs. For example, for bootstrapping k entries, we observe a similar equivalent path in Figure 5. Additionally, for random sample reweighting, as shown in Figure 6, we also observe certain equivalence behaviors of degrees of freedom. This indicates that Theorem 1 also applies to more general weighting schemes. 3.2 Examples of feature matrices As mentioned in Section 2, when the feature matrix Φ consists of i.i.d. Gaussian features, any deterministic matrix W satisfies the condition stated in Assumption A. However, our results are not limited to Gaussian features. In this section, we will consider more general families of features commonly analyzed in machine learning and demonstrate the applicability of our results to them. (1) Linear features. As a first example, we consider linear features composed of (multiplicatively) transformed i.i.d. entries with sufficiently bounded moments by a deterministic covariance matrix. Proposition 3 (Regularization paths with linear features). Suppose the feature ϕ can be decomposed as ϕ = Σ1/2z, where z Rp contains i.i.d. entries zi for i = 1, . . . , p with mean 0, variance 1, and satisfies E[|zi|4+µ] Mµ < for some µ > 0 and a constant Mµ, and Σ Rp p is a deterministic 0.0 0.1 1 Subsample ratio k/n Ridge penalty Linear features 0.0 0.1 1 Subsample ratio k/n Kernel features 0.0 0.1 1 Subsample ratio k/n Random features Figure 2: Equivalence of degrees of freedom for various feature structures under subsampling. The three panels correspond to linear features, random features with Re LU activation function (2-layer), and kernel features (polynomial kernel with degree 3 and without intercept), respectively. In all heatmaps, the red color lines indicate the predicted paths using Equation (4), and the black dashed lines indicate the empirical paths by matching the empirical degrees of freedom. The data is generated according to Appendix F.1 with n = 5000 and p = 500, and the results are averaged over M = 100 random weight matrices W . symmetric matrix with eigenvalues uniformly bounded between constants rmin > 0 and rmax < . Then, as n, p such that p/n γ > 0, the equivalences in (3) hold along the path (4). Features of this type are common in random matrix theory [8] and in a wide range of applications, including statistical physics [14, 60], high-dimensional statistics [22, 55, 58], machine learning [18], among others. The generalized path (2) in Theorem 2 recovers the path in Proposition 4 of [50]. Although the technique in this paper is quite different and more general than that of [50]. (2) Kernel features. As the second example, Theorem 2 also applies to kernel features. Kernel features are a generalization of linear features and lift the input feature space to a highor infinitedimensional feature space by applying a feature map x 7 ϕ(x). Kernel methods use the kernel function K(xi, xj) = ϕ(xi), ϕ(xj) to compute the inner product in the lifted space. Proposition 4 (Regularization paths with kernel features). Suppose the same conditions as in Proposition 3 and the kernel function is of the form K(xi, xj) = g( xi 2 2/p, xi, xj /p, xj 2 2/p), where g is C1 around (τ, τ, τ) and C3 around (τ, 0, τ) and τ := limp tr[Σ]/d. Then, as n , the equivalences in (3) hold in probability along the path (4). The assumption in Proposition 4 is commonly used in the risk analysis of kernel ridge regression [9, 19, 29, 57], among others. Here, Ck denotes the class of functions that are k-times continuously differentiable. It includes neural tangent kernels (NTKs) as a special case. Proposition 4 confirms Conjecture 8 of [50] for these types of kernel functions. (3) Random features. Finally, we consider random features that were introduced by [56] as a way to scale kernel methods to large datasets. Linked closely to two-layer neural networks [46], the random feature model has fnn(x) = σ(F x), where F Rd p is some randomly initialized weight matrix, and σ : R R is a nonlinear activation function applied element-wise to F x. Proposition 5 (Regularization paths with random features). Suppose xi N(0, Σ) and the activation function σ : R R is differentiable almost everywhere and there are constants c0 and c1 such that |σ(x)|, |σ (x)| c0ec1x, whenever σ (x) exists. Then, as n, p, d such that p/n γ > 0 and d/n ξ > 0, the equivalences in (3) hold in probability along the path (4). As mentioned in the related work, random feature models have recently been used as a standard model to study various generalization phenomena observed in neural networks theoretically [1, 46]. Proposition 5 resolves Conjecture 7 of [50] under mild regularity conditions on the activation function. It is worth noting that the prior works mentioned above, including [50], have focused on first characterizing the risk asymptotics in terms of various population quantities for each of the cases above. In contrast, our work in this paper deviates from these approaches by not expressing the risk in population quantities but rather by directly relating the estimators at different regularization levels. In the next section, we will explore the relationship between their squared prediction risks. 4 Prediction risk asymptotics and risk estimation The results in the previous section provide first-order equivalences of the estimators, which are related to the bias of the estimators. In practice, we are also interested in the predictive performance of the estimators. In this section, we investigate the second-order equivalence of weighting and ridge regularization through ensembling. Specifically, we show that aggregating estimators fitted on different weighted datasets also reduces the additional variance. Furthermore, the prediction risks of the full-ensemble weighted estimator and the unweighted estimator also match along the path. Before presenting our risk equivalence result, we first introduce some additional notation. Assume there are M i.i.d. weight matrices W1, . . . , WM Rn n. The M-ensemble estimator is defined as: bβW1:M,λ = M 1 M X m=1 bβWm,λ, (6) and its performance is quantified by the conditional squared prediction risk, given by: R( bβW1:M,λ) = Ex0,y0[(y0 ϕ 0 bβW1:M,λ)2 | Φ, y, {Wm}M m=1], (7) where (x0, y0) is a test point sampled independently from some distribution Px0,y0 that may be different from the training distribution Px,y, and ϕ0 = fnn(x0) is the pretrained feature at the test point. The covariance matrix of the test features ϕ0 is denoted by Σ0. When Px0,y0 = Px,y, we refer to it as the in-distribution risk. On the other hand, when Px0,y0 differs from Px,y, we refer to it as the out-of-distribution risk. Note that the conditional risk RM is a scalar random variable that depends on both the dataset (Φ, y) and the weight matrix Wm for m [M]. Our goal in this section is to analyze the prediction risk of the ensemble estimator (6) for any ensemble size M. Theorem 6 (Risk equivalence along the path). Under the setting of Theorem 1, assume that the operator norm of Σ0 is uniformly bounded in p and that each response variable yi for i = 1, . . . , n has mean 0 and satisfies E[|yi|4+µ] Mµ < for some µ, Mµ > 0. Then, along the path (4), R( bβW1:M,λ) R( bβI,µ) + C M tr[(GI + µI) yy (GI + µIn) ], (8) where the constant C is given by: C = µ/ λ λ2S W W ( df( bβI,µ)) tr[(GI + µI) (ΦΣ0Φ /n)(GI + µI) ]. (9) At a high level, Theorem 6 provides a bias-variance-like risk decomposition for both the squared risks of weighted ensembles. The risk of the weighted predictor is equal to the risk of the unweighted equivalent implicit ridge regressor (bias) plus a term due to the randomness due to weighting (variance). The inflation factor C controls the magnitude of this term, and it decreases at a rate of 1/M as the ensemble size M increases (see Figure 7 for a numerical verification of this rate). Therefore, by using a resample ensemble with a sufficiently large size M, we can retain the statistical properties of the full ridge regression while reducing memory usage and increasing parallelization. Theorem 6 extends the risk equivalence results in [50, 52]. Compared to previous results, Theorem 6 provides a broader risk equivalence that holds for general weight and feature matrices, as well as an arbitrary ensemble size M. It is important to note that Theorem 6 holds even when the test distribution differs from the training data, making it applicable to out-of-distribution risks. Furthermore, our results do not rely on any specific distributional assumptions for the response vector, making them applicable in a model-free setting. The key idea behind this result is to exploit asymptotic freeness between the subsample and data matrices. Next, we will address the question of optimal tuning. 4.1 Optimal oracle tuning As in Theorem 2, we next analyze various properties related to optimal subsampling weights and their implications for the risk of optimal ridge regression. Recall that the subsampling matrix W (k) is a diagonal matrix with k {1, . . . , n} nonzero diagonal entries, which is parameterized by the subsample size k. Note that the optimal regularization parameter µ for the full data (W (k) = I or k = n) is a function of the distribution of pretrained data and the test point. Based on the risk equivalence in Theorem 6, there exists an optimal path of (k, λ) with the corresponding full-ensemble Subsample ratio k/n Ridge penalty Degrees of freedom Subsample ratio k/n Training error Subsample ratio k/n Prediction error Figure 3: Equivalence in pretrained features of pretrained Res Net-50 on Flowers-102 datasets. estimator bβW (k) 1: ,λ := lim M bβW (k) 1:M,λ that achieves the optimal predictive performance at (n, µ ). In particular, the ridgeless ensemble with λ = 0 happens to be on the path. From previous work [25, 50], the optimal subsample size k for λ = 0 has the property that k p under linear features. We show in the following that this property can be extended to include general features. Proposition 7 (Optimal subsample ratio). Assume the subsampling matrix W as defined in Theorem 2. Let µ = argminµ 0 R( bβW (k) 1: ,µ). Then the corresponding subsample size satisfies: k = df( bβW (k) 1: ,µ ) rank(GI). (10) The optimal subsample size k obtained from Proposition 7 is asymptotically optimal. For linear features, in the underparameterized regime where n > p, [25, 50] show that the optimal subsample size k is asymptotically no larger than p. This result is covered by Proposition 7 by noting that rank(GI) p under linear features. It is interesting and somewhat surprising to note that in the underparameterized regime (when p n), we do not need more than p observations to achieve the optimal risk. In this sense, the optimal subsampled dataset is always overparameterized. When the limiting risk profiles R(γ, ψ, µ) := limp/n γ,p/k ψ R( bβW (k) 1: ,µ) exist for subsample ensembles, the limiting risk of the optimal ridge predictor infµ 0 R(γ, γ, µ) is monotonically decreasing in the limiting sample aspect ratio γ [50]. This also (provably) confirms the sample-wise monotonicity of optimally-tuned risk for general features in an asymptotic sense [48]. Due to the risk equivalence in Theorem 6, for any µ > 0, there exists ψ such that R(γ, γ, µ) = R(γ, ψ, 0). This implies that infµ 0 R(γ, γ, µ) = infψ γ R(γ, ψ, 0). In other words, tuning over subsample sizes with sufficiently large ensembles is equivalent to tuning over the ridge penalty on the full data. 4.2 Data-dependent tuning As suggested by Proposition 7, the optimal subsample size is smaller than the rank of the Gram matrix. This result has important implications for real-world datasets where the number of observations (n) is much larger than the number of features (p). In such cases, instead of using the entire dataset, we can efficiently build small ensembles with a subsample size k p. This approach is particularly beneficial when n is significantly higher than p, for example, when n = 1000p. By fitting ensembles with only M = 100 base predictors, we can potentially reduce the computational burden while still achieving optimal predictive performance. Furthermore, this technique can be especially valuable in scenarios where computational resources are limited or when dealing with massive datasets that cannot be easily processed in their entirety. In the following, we propose a method to determine the optimal values of the regularization parameter µ for the full ridge regression, as well as the corresponding subsample size k and the optimal ensemble size M . According to Theorem 6, the optimal value of M is theoretically infinite. However, in practice, the prediction risk of the M-ensemble predictor decreases at a rate of 1/M as M increases. Therefore, it is important to select a suitable value of M that achieves the desired level of performance while considering computational constraints and the specified error budget. By carefully choosing an appropriate M, we can strike a balance between model accuracy and efficiency, ensuring that the subsampled neural representations are effectively used in downstream tasks. Consider a grid of subsample size Kn {1, . . . , n}; for instance, Kn = {0, k0, 2k0, . . . , n} where k0 is a subsample size unit. For a prespecified subsample size k Kn and ensemble size M0 Algorithm 1 Meta-algorithm for tuning of ensemble sizes and subsample matrices. Input: A dataset Dn = {(xi, yi) Rp R : 1 i n}, a regularization parameter λ, a class of subsample matrix distribution Pn = {Pk}k Kn, a ensemble size M0 2 for risk estimation, and optimality tolerance parameter δ. 1: Build ensembles bβW (k) 1:M0,λ with M0 base estimators, where W (k) 1 , . . . , W (k) M0 i.i.d. Pk for each k Kn. 2: Estimate the prediction risk of bβW (k) 1:M0,λ with b Rm,k by CV methods such as CGCV [13], for k Kn and m = 1, . . . , M0. 3: Extrapolate the risk estimations b Rm,k for m > M0 using (11) and (12). 4: Select a subsample size bk argmink Kn b R ,k. that minimizes the extrapolated estimates. 5: Select an ensemble size c M argminm N 1{ b Rm,bk > b R ,bk + δ} for the δ-optimal risk. 6: If c M > M0, fit a c M-ensemble estimator bβW (bk) Output: Return the tuned estimator bβW (bk) 1: c M,λ, and the risk estimators b RM,k for all M, k. N, suppose we have multiple risk estimates b Rm of Rm for m = 1, . . . , M0. The squared risk decomposition [51, Eq (7)] along with the equivalence path (8) implies that Rm = m 1R1 + 1 m 1 R , for m = 1, . . . , M0. Summing these equations yields PM0 m=1 Rm = PM0 m=1 1 m R1 + PM0 m=1 1 m 1 R . Thus, we can estimate R by: m=1 m 1 b R1 / 1 m 1 . (11) Then, the extrapolated risk estimates b Rm (with m > M0) are defined as: b Rm := m 1 b R1 + 1 m 1 b R for m > M0. (12) The meta-algorithm that implements the above cross-validation procedure is provided in Algorithm 1. To efficiently tune the parameters of ridge ensembles, we use and combine the corrected generalized cross-validation (CGCV) method [13] and the extrapolated cross-validation (ECV) method [26]. The improved CV method is implemented in the Python library [24]. 4.3 Validation on real-world datasets In this section, we present numerical experiments to validate our theoretical results on real-world datasets. Figure 3 provides evidence supporting Assumption A on pretrained features extracted from commonly used neural networks applied to real-world datasets. The first panel of the figure demonstrates the equivalence of degrees of freedom for these pretrained features. Furthermore, we also observe consistent behavior across different neural network architectures and different datasets (see Figures 8 and 9). Remarkably, the path of equivalence can be accurately predicted, offering valuable insight into the underlying dynamics of these models. This observation suggests that the pretrained features from widely used neural networks exhibit similar properties when applied to realworld data, regardless of the specific architecture employed. The ability to predict the equivalence path opens up new possibilities for optimizing the performance of these models in practical applications. One implication of the equivalence results explored in Theorems 1 and 6 is that instead of tuning for the full ridge penalty µ on the large datasets, we can fix a small value of the ridge penalty λ, fit subsample ridge ensembles, and tune for an optimal subsample size k. To illustrate the validity of the tuning procedure described in Algorithm 1, we present both the actual prediction errors and their estimates by Algorithm 1 in Figure 4. We observe that the risk estimates closely match the prediction risks at different ensemble sizes across different datasets. Even with a subsampling ratio k/n of 0.01 and a sufficiently large M, the risk estimate is close to the optimal risk. A smaller subsample size could also yield even smaller prediction risk in certain datasets. 10 3 10 2 10 1 100 Subsample ratio k/n Fashion-MNIST / Res Net-18 init. 10 3 10 2 10 1 100 Subsample ratio k/n 0.09 CIFAR-10 / Res Net-18 pretr. 10 3 10 2 10 1 100 Subsample ratio k/n 0.020 CIFAR-100 / Res Net-50 pretr. 10 3 10 2 10 1 100 Subsample ratio k/n 0.020 Food-101 / Res Net-101 pretr. Prediction risk Estimated risk M = 1 M = 1 M = 2 M = 2 M = 5 M = 5 M = 10 M = 10 M = 100 M = 100 Figure 4: Risk estimation by corrected and extrapolated generalized cross-validation. The risk estimates are computed based on M0 = 25 base estimators using Algorithm 1 with λ = 10 3. 5 Limitations and outlook While our results are quite general in terms of applying to a wide variety of pretrained features, they are limited in that they only apply to ridge regression fitted on the pretrained features. The key challenge for extending the analysis based on Assumption A to general estimators beyond ridge regression is the characterization of the effect of subsampling general resolvents as additional ridge regularization. To extend to generalized linear models, one approach is to view the optimization as iteratively reweighted least squares [38] in combination with the current results. Another approach is to combine our results with the techniques in [41] to obtain deterministic equivalents for the Hessian, enabling an understanding of implicit regularization due to subsampling beyond linear models. Beyond implicit regularization due to subsampling, there are other forms of implicit regularization, such as algorithmic regularization due to early stopping in gradient descent [2, 3, 49], dropout regularization [62, 65], among others. In some applications, multiple forms of implicit regularization are present simultaneously. For instance, during a mini-batch gradient step, implicit regularization arises from both iterative methods and mini-batch subsampling. The results presented in this paper may help to make explicit the combined effect of various forms of implicit regularization. Acknowledgments and Disclosure of Funding We thank Benson Au, Daniel Le Jeune, Ryan Tibshirani, and Alex Wei for the helpful conversations surrounding this work. We also thank the anonymous reviewers for their valuable feedback and suggestions. We acknowledge the computing support the ACCESS allocation MTH230020 provided for some of the experiments performed on the Bridges2 system at the Pittsburgh Supercomputing Center. The code for reproducing the results of this paper can be found at https://jaydu1.github.io/ overparameterized-ensembling/weighted-neural. [1] Adlam, B., Levinson, J. A., and Pennington, J. (2022). A random matrix perspective on mixtures of nonlinearities in high dimensions. In International Conference on Artificial Intelligence and Statistics. [2] Ali, A., Dobriban, E., and Tibshirani, R. J. (2020). The implicit regularization of stochastic gradient flow for least squares. In International conference on machine learning. [3] Ali, A., Kolter, J. Z., and Tibshirani, R. J. (2019). A continuous-time view of early stopping for least squares regression. In International Conference on Artificial Intelligence and Statistics. [4] Anderson, G. W., Guionnet, A., and Zeitouni, O. (2010). An Introduction to Random Matrices. Cambridge University Press. [5] Ando, R. and Komaki, F. (2023). On high-dimensional asymptotic properties of model averaging estimators. ar Xiv preprint ar Xiv:2308.09476. [6] Bach, F. (2023). High-dimensional analysis of double descent for linear regression with random projections. ar Xiv:2303.01372. [7] Bahri, Y., Dyer, E., Kaplan, J., Lee, J., and Sharma, U. (2021). Explaining neural scaling laws. ar Xiv:2102.06701. [8] Bai, Z. and Silverstein, J. W. (2010). Spectral Analysis of Large Dimensional Random Matrices. Springer. Second edition. [9] Barthelmé, S., Amblard, P.-O., Tremblay, N., and Usevich, K. (2023). Gaussian process regression in the flat limit. The Annals of Statistics, 51(6):2471 2505. [10] Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. (2020). Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063 30070. [11] Bartlett, P. L., Montanari, A., and Rakhlin, A. (2021). Deep learning: A statistical viewpoint. Acta Numerica, 30:87 201. [12] Belkin, M., Hsu, D., and Xu, J. (2020). Two models of double descent for weak features. SIAM Journal on Mathematics of Data Science, 2(4):1167 1180. [13] Bellec, P., Du, J.-H., Koriyama, T., Patil, P., and Tan, K. (2023). Corrected generalized cross-validation for finite ensembles of penalized estimators. ar Xiv preprint ar Xiv:2310.01374. [14] Bordelon, B., Canatar, A., and Pehlevan, C. (2020). Spectrum dependent learning curves in kernel regression and wide neural networks. In International Conference on Machine Learning. [15] Cébron, G., Dahlqvist, A., and Gabriel, F. (2022). Freeness of type B and conditional freeness for random matrices. ar Xiv preprint ar Xiv:2205.01926. [16] Chen, X., Zeng, Y., Yang, S., and Sun, Q. (2023). Sketched ridgeless linear regression: The role of downsampling. In International Conference on Machine Learning. [17] Clarté, L., Vandenbroucque, A., Dalle, G., Loureiro, B., Krzakala, F., and Zdeborová, L. (2024). Analysis of bootstrap and subsampling in high-dimensional regularized regression. ar Xiv preprint ar Xiv:2402.13622. [18] Couillet, R. and Liao, Z. (2022). Random Matrix Methods for Machine Learning. Cambridge University Press. [19] Cui, H., Loureiro, B., Krzakala, F., and Zdeborová, L. (2021). Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. Advances in Neural Information Processing Systems, 34:10131 10143. [20] Dobriban, E. and Sheng, Y. (2020). Wonder: Weighted one-shot distributed ridge regression in high dimensions. Journal of Machine Learning Research, 21(66):1 52. [21] Dobriban, E. and Sheng, Y. (2021). Distributed linear regression by averaging. The Annals of Statistics, 49(2):918 943. [22] Dobriban, E. and Wager, S. (2018). High-dimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics, 46(1):247 279. [23] Drineas, P., Mahoney, M. W., and Muthukrishnan, S. (2006). Sampling algorithms for ℓ2 regression and applications. Proceedings of the ACM-SIAM Symposium on Discrete Algorithm. [24] Du, J.-H. and Patil, P. (2023). Python package sklearn_ensemble_cv v0.2.1. Py PI. [25] Du, J.-H., Patil, P., and Kuchibhotla, A. K. (2023). Subsample ridge ensembles: Equivalences and generalized cross-validation. In International Conference on Machine Learning. [26] Du, J.-H., Patil, P., Roeder, K., and Kuchibhotla, A. K. (2024). Extrapolated cross-validation for randomized ensembles. Journal of Computational and Graphical Statistics. [27] Efron, B. (1983). Estimating the error rate of a prediction rule: improvement on cross-validation. Journal of the American Statistical Association, 78(382):316 331. [28] Efron, B. (1986). How biased is the apparent error rate of a prediction rule? Journal of the American Statistical Association, 81(394):461 470. [29] Ghorbani, B., Mei, S., Misiakiewicz, T., and Montanari, A. (2020). When do neural networks outperform kernel methods? Advances in Neural Information Processing Systems. [30] Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. (2022). Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949 986. [31] Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models. Chapman & Hall. [32] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR). [33] Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems. [34] Jacot, A., Simsek, B., Spadaro, F., Hongler, C., and Gabriel, F. (2020). Kernel alignment risk estimator: Risk prediction from training data. Advances in Neural Information Processing Systems. [35] Koehler, F., Zhou, L., Sutherland, D. J., and Srebro, N. (2021). Uniform convergence of interpolators: Gaussian width, norm bounds and benign overfitting. Advances in Neural Information Processing Systems. [36] Krogh, A. and Sollich, P. (1997). Statistical mechanics of ensemble learning. Physical Review E, 55(6):811. [37] Le Jeune, D., Javadi, H., and Baraniuk, R. (2020). The implicit regularization of ordinary least squares ensembles. In International Conference on Artificial Intelligence and Statistics. [38] Le Jeune, D., Javadi, H., and Baraniuk, R. G. (2021). The flip side of the reweighted coin: Duality of adaptive dropout and regularization. In Advances in Neural Information Processing Systems. [39] Le Jeune, D., Patil, P., Javadi, H., Baraniuk, R. G., and Tibshirani, R. J. (2024). Asymptotics of the sketched pseudoinverse. SIAM Journal on Mathematics of Data Science, 6(1):199 225. [40] Liang, T. and Rakhlin, A. (2020). Just interpolate: Kernel ridgeless regression can generalize. The Annals of Statistics. [41] Liao, Z. and Mahoney, M. W. (2021). Hessian eigenspectra of more realistic nonlinear models. In Advances in Neural Information Processing Systems, volume 34. [42] Liu, S. and Dobriban, E. (2020). Ridge regression: Structure, cross-validation, and sketching. In International Conference on Learning Representations. [43] Loureiro, B., Gerbelot, C., Cui, H., Goldt, S., Krzakala, F., Mezard, M., and Zdeborova, L. (2021). Learning curves of generic features maps for realistic datasets with a teacher-student model. In Advances in Neural Information Processing Systems. [44] Mahoney, M. W. (2011). Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3(2):123 224. [45] Mallinar, N., Simon, J. B., Abedsoltan, A., Pandit, P., Belkin, M., and Nakkiran, P. (2022). Benign, tempered, or catastrophic: A taxonomy of overfitting. ar Xiv:2207.06569. [46] Mei, S. and Montanari, A. (2022). The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667 766. [47] Mingo, J. A. and Speicher, R. (2017). Free Probability and Random Matrices, volume 35. Springer. [48] Nakkiran, P., Venkat, P., Kakade, S. M., and Ma, T. (2021). Optimal regularization can mitigate double descent. In International Conference on Learning Representations. [49] Neu, G. and Rosasco, L. (2018). Iterate averaging as regularization for stochastic gradient descent. In Conference On Learning Theory. [50] Patil, P. and Du, J.-H. (2023). Generalized equivalences between subsampling and ridge regularization. Advances in Neural Information Processing Systems. [51] Patil, P., Du, J.-H., and Kuchibhotla, A. K. (2023). Bagging in overparameterized learning: Risk characterization and risk monotonization. Journal of Machine Learning Research, 24(319):1 113. [52] Patil, P., Du, J.-H., and Tibshirani, R. J. (2024). Optimal ridge regularization for out-ofdistribution prediction. ar Xiv preprint ar Xiv:2404.01233. [53] Patil, P., Kuchibhotla, A. K., Wei, Y., and Rinaldo, A. (2022). Mitigating multiple descents: A model-agnostic framework for risk monotonization. ar Xiv preprint ar Xiv:2205.12937. [54] Patil, P. and Le Jeune, D. (2024). Asymptotically free sketched ridge ensembles: Risks, crossvalidation, and tuning. In International Conference on Learning Representations. [55] Paul, D. and Aue, A. (2014). Random matrix theory in statistics: A review. Journal of Statistical Planning and Inference, 150:1 29. [56] Rahimi, A. and Recht, B. (2007). Random features for large-scale kernel machines. Advances in neural information processing systems, 20. [57] Sahraee-Ardakan, M., Emami, M., Pandit, P., Rangan, S., and Fletcher, A. K. (2022). Kernel methods and multi-layer perceptrons learn linear models in high dimensions. ar Xiv preprint ar Xiv:2201.08082. [58] Serdobolskii, V. I. (2007). Multiparametric Statistics. Elsevier. [59] Slagel, J. T., Chung, J., Chung, M., Kozak, D., and Tenorio, L. (2019). Sampled tikhonov regularization for large linear inverse problems. Inverse Problems, 35(11):114008. [60] Sollich, P. (2001). Gaussian process regression with mismatched models. Advances in Neural Information Processing Systems, 14. [61] Sollich, P. and Krogh, A. (1995). Learning with ensembles: How overfitting can be useful. In Advances in Neural Information Processing Systems. [62] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929 1958. [63] Stein, C. M. (1981). Estimation of the mean of a multivariate normal distribution. The Annals of Statistics, pages 1135 1151. [64] Voiculescu, D. V. (1997). Free Probability Theory. American Mathematical Society. [65] Wager, S., Wang, S., and Liang, P. S. (2013). Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems. [66] Wei, A., Hu, W., and Steinhardt, J. (2022). More than a toy: Random matrix models predict how real-world neural representations generalize. In International Conference on Machine Learning. [67] Weisberg, S. (2005). Applied Linear Regression. John Wiley & Sons. [68] Yang, G. (2019). Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. ar Xiv preprint ar Xiv:1902.04760. [69] Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014). How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems. [70] Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations. [71] Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2021). Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107 115. This serves as an appendix to the paper Implicit Regularization Paths of Weighted Neural Representations. The beginning (unlabeled) section of the appendix provides an organization for the appendix, followed by a summary of the general notation used in both the paper and the appendix. Any other specific notation is explained inline where it is first used. Organization In Appendix A, we provide a brief technical background on free probability theory and various transforms that we need and collect known asymptotic ridge equivalents that we use in our proofs. In Appendix B, we present proofs of the theoretical results in Section 3 (Theorems 1 and 2 and Propositions 3 5). In Appendix C, we present proofs of the theoretical results in Section 4 (Theorem 6 and Proposition 7). In Appendix D, we provide additional illustrations for the results in Section 3 (Figures 5 and 6). In Appendix E, we provide additional illustrations for the results in Section 4 (Figures 7 9), including our meta-algorithm for tuning (Algorithm 1) that is not included in the main text due to space constraints. In Appendix F, we provide additional details on the experiments in both Section 3 and Section 4. We use blackboard letters to denote some special sets: N denotes the set of natural numbers, R denotes the set of real numbers, R+ denotes the set of positive real numbers, C denotes the set of complex numbers, C+ denotes the set of complex numbers with positive imaginary part, and C denotes the set of complex numbers with negative imaginary part. We use [n] to denote the index set {1, 2, . . . , n}. We denote scalars and vectors using lower-case letters and matrices using upper-case letters. For a vector β, β denotes its transpose, and β 2 denotes its ℓ2 norm. For a pair of vectors u and v, u, v denotes their inner product. For a matrix X Rn p, X Rp n denotes its transpose, and X Rp n denotes its Moore-Penrose inverse. For a square matrix A Rp p, tr[A] denotes its trace, tr[A] denotes its average trace tr[A]/p, and A 1 denotes its inverse, provided that A is invertible. For a symmetric matrix A, λ+ min(A) denotes its minimum nonzero eigenvalue. For a positive semidefinite matrix G, G1/2 denotes its principal square root. For a matrix X, we denote by X op its operator norm with respect to the ℓ2 vector norm. It is also the spectral norm of X. For a matrix X, we denote by X tr its trace norm. It is given by tr[(X X)1/2], and is also the nuclear norm X. We denote p p identity matrix by Ip, or simply by I when it is clear from the context. For symmetric matrices A and B, we use A B to denote the Loewner ordering to mean that A B is a positive semidefinite matrix. For two sequences of matrices Ap and Bp, we use Ap Bp to denote a certain asymptotic equivalence; see Appendix A.3 for a precise definition. A Technical background A.1 Basics of free probability theory In this section, we briefly review definitions from free probability theory and its applications to random matrices. This review will help set the stage by introducing the various mathematical structures and spaces we are working with. It will also introduce some of the notation used throughout the text. Free probability is a mathematical framework that deals with non-commutative random variables [19]. The use of free probability theory has appeared in various recent works in statistical machine learning, including [1, 2, 11, 12, 17]. Good references on free probability theory include [3, 13], from which we borrow some basic definitions in the following. All the material in this section is standard in free probability theory and mainly serves to keep the definitions self-contained. Definition 8 (Non-commutative algebra). A set A is called a (complex) algebra (over the field of complex numbers C) if it is a vector space (over C with addition +), equipped with a bilinear multiplication , such that for all x, y, z A and α C, (1) x (y z) = (x y) z, (2) (x + y) z = x z + y z, (3) x (y + z) = x y + x z, (4) α(x y) = (αx) y = x (αy). In addition, an algebra is called unital if a multiplicative identity element exists. We will use 1A to denote this identity element. We will drop the symbol to denote multiplication over the algebra. Definition 9 (Non-commutative probability space). Let A over C be a unital algebra with identity 1A. Let φ : A C be a linear functional which is unital (that is, φ(1A) = 1). Then (A, φ) is called a non-commutative probability space, and φ is called a state. A state φ is said to be tracial if φ(xy) = φ(yx) for all x, y A. Definition 10 (Moments). Let (A, φ) be a non-commutative probability space. The numbers {φ(xk)} k=1 are called the moments of the variable x A. Definition 11 ( -algebra). An algebra A is called a -algebra if there exists a mapping x x from A A such that, for all x, y A and α C, (1) (x + y) = x + y , (2) (αx) = αx , (3) (xy) = y x , (4) (x ) = x. A variable x of a -algebra is called self-adjoint if x = x . A unital linear functional φ on a -algebra is said to be positive if φ(x x) 0 for all x A. Definition 12 ( -probability space). Let A be a unital -algebra with a positive state φ. Then (A, φ) is called a -probability space. Example 1. Denote by Mp(C) the collection of all p p matrices with complex entries. Let the multiplication and addition operations be defined in the usual way. The -operation is the same as taking the conjugate transpose. Let tr : Mp(C) C be the normalized trace defined by: The state tr is tracial and positive. Definition 13 (Free independence). Suppose (A, φ) is a -probability space. Then, the -subalgebras {Ai}i I of A are said to be -freely independent (or simply -free) if, for all n 2 and all x1, x2, , xn from {Ai}i I, κn(x1, x2, , xn) = 0 whenever at least two of the xi are from different Ai. In particular, any collection of variables is said to be -free if the sub-algebras generated by these variables are -free. Lemma 14. Suppose (A, φ) is a -probability space. If x and y are free in (A, φ), then for all non-negative integers n and m, φ(xnym) = φ(xn)φ(ym) = φ(ymxn). In other words, elements of the algebra are considered free if any alternating product of centered polynomials is also centered. In this work, we will consider φ to be the normalized trace. The normalized trace is the generalization of 1 p tr[A] for A Cp p to elements of a C -algebra A. Specifically, for any self-adjoint a A and any polynomial p, we have φ(p(a)) = Z p(z) dµa(z), where µa is the probability measure that characterizes the spectral distribution of a. Definition 15 (Convergence in spectral distribution). Let (A, φ) be a C -probability space. We say that A1, . . . , Am Cp p converge in spectral distribution to elements a1, . . . , am A if, for all 1 ℓ< and 1 ij m for 1 j ℓ, we have 1 p tr[Ai1 Aiℓ] φ(ai1 aiℓ). Then, with slight abuse of notation, two matrices A, B Rp p are said to be free if ℓ=1 poly A ℓ(A)poly B ℓ(B) for all L 1 and all centered polynomials, that is, tr[poly A ℓ(A)] = 0. This notation is an abuse of notation because finite matrices cannot satisfy this condition. However, they can satisfy it asymptotically as p , and in this case, we say that A and B are asymptotically free. Note: With some abuse of notation, we will let matrices in boldface denote both the finite matrix and the limiting element in the free probability space. The limiting element can be understood, for example, as a bounded linear operator on a Hilbert space. We also remark that all notions we need are well-defined in this limit as well, as long as they are appropriately normalized. A.2 Useful transforms and their relationships In this section, we review the key transforms used in free probability theory and their interrelationships. Definition 16 (Cauchy transform). Let a be an element of a -probability space (A, φ). Suppose there exists some C > 0 such that |φ(an)| Cn for all n N. Then the Cauchy transform of a is defined as: for all z C with |z| > C. Note that the Cauchy transform is the negative of the Stieltjes transform. In this paper, we will focus only on the Cauchy transform. Recall that for a probability measure ν on R and for z / R, the Cauchy transform of ν is defined as: 1 z x dν(x). The definition above is motivated by the following property of the Cauchy transform of a measure. Suppose ν is a probability measure whose support is contained in [ C, C] for some C > 0 and which has moments {mk(ν)} k=0. Then the Cauchy transform of ν is defined for z C with |z| > C as: Definition 17 (Moment generating function). Let a be an element of a -probability space (A, φ). The moment generating function of a is defined as: Ma(z) = 1 + k=1 φ(ak)zk for z C such that |z| < ra. Here, ra is the radius of convergence of the series. For a probability measure ν, the moment generating function is defined analogously. (Note: The definition above is not to be confused with the moment generating function of a random variable in probability theory.) The Cauchy transform is related to the moment series via: In the other direction, we have: Definition 18 (S-transform). For m=0 φ(am)zm, we define the S-transform of a by: Sa(w) = 1 + w w M 1 a (w), (15) where M 1 denotes the inverse under composition of M. Finally, in terms of operator A, we summarize the series of invertible transformations between the various transforms introduced in this section. Cauchy transform: GA(z) = tr[(z I A) 1]. Moment generating series: S-transform: SA(w) = 1 + w w M 1 A (w). MA(z) = P k=1 tr[Ak]zk is the moment generating series. M 1 A denotes the inverse under composition of MA. tr[A] denotes the average trace tr[A]/p of a matrix A Rp p. A.3 Asymptotic ridge resolvents In this section, we provide a brief background on the language of asymptotic equivalents used in the proofs throughout the paper. We will state the definition of asymptotic equivalents and point to useful calculus rules. For more details, see [16, Appendix S.7]. To concisely present our results, we will use the framework of asymptotic equivalence [5, 6, 16], defined as follows. Let Ap and Bp be sequences of matrices of arbitrary dimensions (including vectors and scalars). We say that Ap and Bp are asymptotically equivalent, denoted as Ap Bp, if limp | tr[Cp(Ap Bp)]| = 0 almost surely for any sequence of random matrices Cp with bounded trace norm that are independent of Ap and Bp. Note that for sequences of scalar random variables, the definition simply reduces to the typical almost sure convergence of sequences of random variables involved. The notion of deterministic equivalents obeys various calculus rules such as sum, product, differentiation, conditioning, and substitution. We refer the reader to [16] for a comprehensive list of these calculus rules, their proofs, and other related details. Next, we collect firstand second-order asymptotic equivalents for sketched ridge resolvents from [11, 17], which will be useful for our extensions to weighted ridge resolvents. Assumption B (Sketch structure). Let S Rp q be the feature sketching matrix and X Rn p be the data matrix. Let SS and 1 n X X converge almost surely to bounded operators that are infinitesimally free with respect to ( 1 p tr[ ], tr[Θ( )]) for any Θ independent of S with Θ tr uniformly bounded. Additionally, let SS have a limiting S-transform that is analytic on the lower half of the complex plane. For the statement to follow, let us define bΣ := 1 n X X. Let eλ0 := lim infp λ+ min(S bΣS). Here, recall that λ+ min(A) represents the minimum nonzero eigenvalue of a symmetric matrix A. Theorem 19 (Free sketching equivalence; [11], Theorem 7.2). Under Assumption B, for all λ > eλ0, S(S bΣS + λIq) S (bΣ + νIp) , (16) where ν > λ+ min(bΣ) is increasing in λ > eλ0 and satisfies: ν λSSS ( tr[bΣS(S bΣS + λIq) S ]) λSSS ( tr[bΣ(bΣ + νIp) ]). (17) Lemma 20 (Second-order equivalence for sketched ridge resolvents; [17], Lemma 15). Under the settings of Lemma 21, for any positive semidefinite Ψ with uniformly bounded operator norm, for all λ > eλ0, S(S bΣS + λIq) S ΨS(S bΣS + λIq) S (bΣ + νIp) (Ψ + ν ΨIp)(bΣ + νIp) , (18) where ν Ψ 0 is given by: λλ2S SS ( tr[bΣ(bΣ + νIp) ]) tr[(bΣ + νIp) Ψ(bΣ + νIp) ]. (19) B Proofs in Section 3 B.1 Proof of Theorem 1 Our main ingredient in the proof is Lemma 21. We will first show estimator equivalence and then show degrees of freedom equivalence. Estimator equivalence. Recall from (1) the ridge estimator on the weighted data is: bβW ,λ = (Φ W W Φ/n + λIp) Φ W W y/n. This is the primal form of the ridge estimator. Using the Woodbury matrix identity, we first write the estimator into its dual form. bβW ,λ = Φ W (W ΦΦ W /n + λIn) W y/n = Φ W (GW + λIn) W y/n. Now, we can apply the first part of Lemma 21 to the matrix W (GW + λIn) W . From (31), we then have the following equivalence: bβW ,λ Φ (GI + µIn) y/n = Φ (ΦΦ + µIn) y/n = (Φ Φ/n + µIn) Φ y/n = bβI,µ, where µ satisfies the following equation: tr[GI(GI + µIn) ] = λSW W ( df( bβI,µ)). Note that in the simplification, we used the Woodbury identity again to go back from the dual form into the primal form for the ridge estimator based on the full data. Rearranging, we obtain the desired estimator equivalence. We next move on to showing the degrees of freedom equivalence. Degrees of freedom equivalence. For the subsampled estimator bβW ,λ, the effective degrees of freedom is given by: df( bβW ,λ) = tr[Φ Φ/n(Φ Φ/n + λIp) ] = tr[Φ(Φ Φ/n + λIp) Φ /n] = tr[(ΦΦ /n + λIn) ΦΦ /n]. The second equality above follows from the push-through identity Φ(Φ Φ/n + λIp) Φ = (ΦΦ + λIn) ΦΦ . Recognizing the quantity inside the trace as the degrees of freedom of the full ridge estimator, we have µ = λSW W ( df( bβI,µ)). We can equivalently write the equation above as = df( bβI,µ) or µ λ = SW W ( df( bβI,µ)). Rearranging the display above provides the desired degrees of freedom equivalence and finishes the proof. B.2 Proof of Theorem 2 We will apply Theorem 1 to the subsampling weight matrix W . The main ingredient that we need is the S-transform of the spectrum of the matrix W W . As summarized in Appendix A.2, one approach to compute the S-transform is to go through the following chain of transforms. First, we apply the Cauchy transform, then the moment-generating series, and finally, take the inverse to obtain the S-transform. We will do this in the following steps. Cauchy transform. Recall that the Cauchy transform from Definition 16 can be computed as: GW W (z) = tr[(z In W W ) 1]. Moment generating series. We can then compute the moment series from Definition 17 using (14) as follows: MW W (z) = 1 z In W W 1# = tr[(In z W W ) 1] tr[In] = tr[In] + tr[(In z W W ) 1] = tr[(z W W In + In)(In z W W ) 1] = tr[z W W (In z W W ) 1]. We now note that the matrix W W has k eigenvalues of 1 and n k eigenvalues of 0. Therefore, we have MW W (z) = tr[z W W (In z W W ) 1] n z 1 z . (20) S-transform. The inverse of the moment generating series map z 7 MW W (z) from (20) is: M 1 (w) = w w + k/n. (21) Therefore, from Definition 18 and using (21), we have S(w) = 1 + w w w w + k/n = 1 + w w + k/n. (22) Now, we are ready to apply Theorem 1 to the subsampling matrix W . Substituting (22) into (2), we get µ λ = S( df( bβI,µ)) = 1 df( bβI,µ) df( bβI,µ) + k/n . Rearranging, we obtain df( bβI,µ) (µ λ) = µ (k/n) λ. Thus, we get df( bβI,µ) = λ µ (k/n) In other words, we have 1 df( bβI,µ) = µ µ λ Multiplying (1 λ/µ) on both sides, we arrive at the desired relation. This completes the proof. B.3 Proof of Proposition 3 We prove this by matching the path (4) with the one in [15]. Let γ = p/n, ψ = p/k, Hp be the spectral distribution of bΣ = X X/n. The path from Equation (5) of [15] is given by the following equation: µ = (ψ γ) Z r 1 + v(µ, γ)r d Hp(r), (23) where v(µ, γ) is the unique solution to the following fixed-point equation: 1 v(µ, γ) = µ + γ Z r 1 + v(µ, γ)r d Hp(r) = ψ Z r 1 + v(µ, γ)r d Hp(r). (24) For given γ and µ, we will show that ψ that solves (23) gives rise k = p/ψ that also solves (4) with λ = 0: 1 n XX + µIn ] = k Rearranging the above equation yields: k n = 1 µ tr 1 n XX + µIn ] = 1 µv(µ, γ), where the second equality is from Lemma B.2 of [15]. This implies that ψ Z r 1 + v(µ, γ)r d Hp(r) = (ψ γ) Z r 1 + v(µ, γ)r d Hp(r), where the second equality follows from (24). The above is the same as the path (23) in [15]. This finishes the proof. B.4 Proof of Proposition 4 We first describe the setup for the kernel ridge regression formulation and then show the desired equivalence. Setup. Let K( , ) : Rd Rd R be a kernel function. Let H denote the reproducing kernel Hilbert space associated with kernel K. Kernel ridge regression with the subsampling matrix W solves the following problem with tuning parameter λ 0: bf W ,λ = argmin f H W y W f(X) 2 2/n + λ f 2 H, where f(X) = [f(x1), . . . , f(xn)] . Kernel ridge regression predictions have a closed-form expression: bf W ,λ(x) = K(x, X) W (W K(X, X)W + λIn) W y. Here, K(x, X) Rn with i-th entry K(x, xi), and K(X, X) Rn n with the ij-th entry K(xi, xj). The predicted values on the training data X are given by bf W ,λ(X) = K(X, X) W (W K(X, X)W + λIn) W y. Here, the matrix K(X, X) W (W K(X, X)W + λIn) W is the smoothing matrix. Define GI = K(X, X) and GW = W K(X, X)W . Leveraging the kernel trick, the preceding optimization problem translates into solving the following problem (in the dual domain): bαW ,λ = argmin α Rn α (GW + λIn) α + 2α W y, where the dual solution is given by bαW ,λ = (GW + λIn) W y. The correspondence between the dual and primal solutions is simply given by: bβW ,λ = Φ W bαW ,λ where Φ = [ϕ(x1), . . . , ϕ(xn)] is the feature matrix and ϕ: Rd 7 H is the feature map of the Hilbert space H with kernel K. Thus, bf W ,λ(X) = W Φ bβW ,λ = W ΦΦ W bαW ,λ = GW (GW + λIn) W y and the degrees of freedom is given by df( bβI,µ) = tr[GW (GW + λIn) ]. Next, we show that (3) holds. Alternatively, one can also show that bαW ,λ bαI,µ, and bf W ,λ(x0) bf I,µ(x0), which we omit due to similarity. Our proof strategy consists of two steps. We first show that it suffices to establish the desired result for the linearized version. We then show that we can suitably adapt our result for the linearized version. Linearization of kernels. In the below, we will show that for µ 0, W (GW + λIn) W (GI + µIn) , tr[λ(GW + λIn) ] tr[µ(GI + µIn) ], where W and λ 0 satisfy that µ = λSW W ( 1 n tr[GI(GI + λIn) ]) = λSW W ( 1 n tr[GW (GW + µIn) ]). (25) Using assumptions of Proposition 3 and the assumption in Proposition 4, by [18, Proposition 5.1],2 GI Glin I op p 0, Glin I = c0In + c11n1 n + c2XX and (c0, c1, c2) associated with function g in Proposition 4 and τ = limp tr[Σ]/p are defined as c0 = g(τ, τ, τ) g(τ, 0, τ) c2 tr[Σ] c1 = g(τ, 0, τ) + g (τ, 0, τ)tr[Σ2] c2 = g (τ, 0, τ). (28) Assume Cn is a sequence of random matrices with bounded trace norm. Note that tr[Cp((GI + µIn) (Glin I + µIn) )] tr[Cp] (GI + µIn) (Glin I + µIn) op tr[Cp] (GI + µIn) op (Glin I + µIn) op GI Glin I op a.s. 0. 2Assumption A1 of [18] requires finite 5 + δ-moments, which can be relaxed to only finite 4 + δ-moments as in the assumption of Proposition 3, by a truncation argument as in the proof of Theorem 6 of [7, Appendix A.4]. where in the last inequality, we use a matrix identity A 1 B 1 = A 1(B A)B 1 for two invertible matrices A and B. Thus, we have (GI + µIn) p (Glin I + µIn) . (29) Hence, combining (29) and the transition property of asymptotic equivalence [15, Lemma S.7.4 (1)], it suffices to show W (Glin W + λIn) W (Glin I + µIn) , where Glin W = W Glin I W , and λ and µ satisfy (25). Similarly, we can also show that the path (25) is asymptotically equivalent to µ = λSW W ( 1 n tr[Glin I (Glin I + λIn) ]) = λSW W ( 1 n tr[Glin W (Glin W + µIn) ]). (30) Equivalence for linearized kernels. We next show that the resolvent equivalence result holds for Klin. This follows from additional manipulations building on Lemma 21. B.5 Proof of Proposition 5 In the below, we will show that for µ 0, W (W ΦΦ W /n + λIn) W (ΦΦ /n + µIn) , tr[λ(W ΦΦ W /n + λIn) ] tr[µ(ΦΦ /n + µIn) ]. Under assumptions in Proposition 5, the linearized features take the form d F X + ρsωs U, where the constants ρs and ωs are given in Proposition 5 and U Rn p has i.i.d. standard normal entries. From Claim A.13 of [10], the linear functionals of the estimators bβD,λ and bβI,µ with random features Φ and Φlin are asymptotically equivalent. Now, following the proof of Proposition 4, we apply Lemma 21 on Φlin to yield the desired result. B.6 Technical lemmas In preparation for the forthcoming statement, define λ0 = lim infn λ+ min(GW ). Recall the Gram matrices G = ΦΦ /n and GW = W ΦΦ W /n. Lemma 21 (General first-order equivalence for freely subsampled ridge resolvents). For W Rn n, suppose Assumption A holds for W W . Then, for all λ > λ0, W (GW + λI) W (G + µI) , (31) tr[λ(GW + λIn) ] tr[µ(G + µIn) ], (32) where µ > λ+ min(G) solves the equation: µ = λSW W ( tr[G(G + µV ) ]) λSW W ( tr[GW (GW + λV ) ]). (33) Proof of Lemma 21. The first result follows from using Theorem 19 by suitably changing the roles of X and Φ. In particular, we set Φ to be X and W to be S and apply Theorem 19 to obtain W (W ΦΦ W /n + λIn) W (ΦΦ /n + µIn) . (34) Writing in terms of G and GW , this proves the first part (31). For the second part, we use the result (34) in the first part and multiply both sides by ΦΦ /n to get (ΦΦ /n) W (W ΦΦ W /n + λIn) W (ΦΦ /n) (ΦΦ /n + µIn) . Using the trace property of asymptotic equivalence [15, Lemma S.7.4 (4)], we have tr[(ΦΦ /n) W (W ΦΦ W /n + λIn) W ] tr[(ΦΦ /n) (ΦΦ /n + µIn) ]. Using the cyclic property of the trace operator yields tr[(W ΦΦ /n) W (W ΦΦ W /n + λIn) ] tr[(ΦΦ /n) (ΦΦ /n + µIn) ]. In terms of G and GW , this is the same as tr[GW (GW + λIn) ] tr[G(G + µIn) ]. Adding and subtracting λIn and µIn on the leftand right-hand resolvents, we arrive at the second part (32). This completes the proof. C Proofs in Section 4 C.1 Proof of Theorem 6 The main ingredients of the proof are Lemmas 21 and 22. We begin by decomposing the unknown response y0 into its linear predictor and residual. Specifically, let β0 be the optimal projection parameter given by β0 = Σ 1 0 E[ϕ0y0]. Then, we can express the response as the sum of its best linear predictor, ϕ 0 β0, and the residual, y0 ϕ 0 β0. Denote the variance of this residual by σ2 0 = E[(y0 ϕ 0 β0)2]. It is easy to see that the risk decomposes as follows: R( bβW1:M,λ) = E (y0 ϕ 0 bβW1:M,λ)2 | Φ, y, {Wm}M m=1 = ( bβW1:M,λ β0) Σ( bβW1:M,λ β0) + σ2 0. Here, we used the fact that (y0 ϕ 0 β0) is uncorrelated with ϕ0, that is, E[ϕ0(y0 ϕ 0 β0)] = 0p. We note that β0 2 < and Σ0 has uniformly bounded operator norm. Observe that R( bβW1:M,λ) = ( bβW1:M,λ β0) Σ0( bβW1:M,λ β0) + σ2 0 m=1 bβWm,λ β0 m=1 bβWm,λ β0 k,ℓ=1 bβ Wk,λΣ0 bβWℓ,λ 2 m=1 β 0 Σ0 bβWm,λ + β 0 Σ0β0 + σ2 0 k,ℓ=1 ( bβ Wk,λΣ0 bβWℓ,λ bβ I,µΣ0 bβI,µ) + bβ I,µΣ0 bβI,µ m=1 β 0 Σ0 bβWm,λ + β 0 Σ0β0 + σ2 0. By Lemma 21, note that 1 M k=1 bβWm,λ bβI,µ. Thus, we have R( bβW1:M,λ) 1 M 2 bβ Wk,λΣ0 bβWℓ,λ bβ I,µΣ0 bβI,µ + bβ I,µΣ0 bβI,µ 2 m=1 β 0 Σ0 bβI,µ + β 0 Σ0β0 + σ2 0. Now, by two applications of Lemma 21, we know that bβ Wk,λΣ0 bβWℓ,λ bβI,µΣ0 bβI,µ a.s. 0 when k = ℓsince Wk and Wℓare independent. Hence, we have R( bβW1:M,λ) 1 M 2 bβ Wm,λΣ0 bβWm,λ bβ I,µΣ0 bβI,µ + ( bβI,µ β0) Σ0( bβI,µ β0) + σ2 0 M bβ W ,λΣ0 bβW ,λ bβ I,µΣ0 bβI,µ + ( bβI,µ β0) Σ0( bβI,µ β0) + σ2 0 M bβ W ,λΣ0 bβW ,λ bβ I,µΣ0 bβI,µ + R( bβI,µ) (35) where we used the fact that the M terms where k = ℓconverge identically in the second to last line and a risk decomposition similar to that for bβW1:M,λ in the last line. Thus, it suffices to evaluate the difference bβ λ Σ0 bβλ bβ µ Σ0 bβµ to finish the proof. We have bβ W ,λΣ0 bβW ,λ bβ I,µΣ0 bβI,µ = (y W /n)W Φ(Φ W W Φ/n + λIp) Σ0(Φ W W Φ/n + λIp) Φ W (W y/n) (y Φ/n)(Φ Φ/n + µIp) Σ0(Φ Φ/n + µIp) (Φ y/n) = tr[W W Φ( 1 nΦ W W Φ + λIp) Σ0( 1 nΦ W W Φ + λIp) Φ W W /n (yy )] tr[Φ(Φ Φ/n + µIp) Σ0(Φ Φ/n + µIp) Φ /n (yy )] tr[(ΦΦ /n + µIn) (ΦΣ0Φ /n + µ Σ0In)(ΦΦ /n + µIn) (yy )] tr[Φ(Φ Φ/n + µIp) Σ0(Φ Φ/n + µIp) Φ /n (yy )] = tr[(ΦΦ /n + µIn) (ΦΣ0Φ /n)(ΦΦ /n + µIn) (yy )] + µ Σ0 tr[(ΦΦ + µIn) (yy )(ΦΦ + µIn) ] tr[(ΦΦ /n + µIn) (ΦΣ0Φ /n)(ΦΦ /n + µIn) (yy )] = µ Σ0 tr[(ΦΦ + µIn) (yy )(ΦΦ + µIn) ], (36) where in the third line, we used the second-order equivalence for freely weighted ridge resolvents from Lemma 22; in the fourth line, we employed the push-through identity multiple times. Substituting for µ Σ0 from Lemma 22 in (36) and substituting this back into (35), we arrive at the desired decomposition. This completes the proof. C.2 Proof of Proposition 7 We use the path (4) with k and µ , and setting λ = 0: 1 df( bβI,µ ) This suggests that k = df( bβI,µ ). Note that r := rank(X X) = rank(GI). By the definition of degrees of freedom, it follows that df( bβI,µ ) = tr[X X(X X + µ Ip) ] si si + µ r = rank(GI), where s1, . . . , sr are non-zero eigenvalues of X X. This finishes the proof. C.3 Technical lemmas Recall from Appendix B.6 that we define λ0 = lim infn λ+ min(W ΦΦ W /n). Lemma 22 (General second-order equivalence for freely weighted ridge resolvents). Under the settings of Lemma 21, for any positive semidefinite Σ0 with uniformly bounded operator norm, for all λ > λ0, 1 n W W Φ( 1 nΦ W W Φ + λIp) Σ0( 1 nΦ W W Φ + λIp) Φ W W nΦΦ + µIn) ( 1 nΦΣ0Φ + µ Σ0In)( 1 nΦΦ + µIn) , (37) where µ Σ0 0 is given by: nΦΦ + µIn) ( 1 nΦΦ + µIn) . (38) Proof. We use the Woodbury matrix identity to write 1 n W W Φ( 1 nΦ W W Φ + λIp) Σ0( 1 nΦ W W Φ + λIp) Φ W W n W ΦΦ W + λIm) W ΦΣ0Φ W ( 1 n W ΦΦ W + λIm) W . The equivalence in (37) and the inflation parameter in (38) now follow from the second-order result for feature sketch by substituting W for S, Φ for Φ , and 1 nΦΣ0Φ for Σ0 in (18). D Additional illustrations for Section 3 D.1 Implicit regularization paths for bootstrapping 0.0 0.1 1 Subsample ratio k/n Ridge penalty Degrees of freedom 0.0 0.1 1 Subsample ratio k/n Figure 5: Equivalence under bootstrapping. The left panel shows the heatmap of degrees of freedom, and the right panel shows the random projection EW [a bβW ,λ] where a N(0p, Ip/p). In both heatmaps, the red lines indicate the predicted paths using Equation (4), and the black dashed lines indicate the empirical paths obtained by matching empirical degrees of freedom. Despite the complexity of the theoretical path for bootstrapping, we observe that the empirical paths closely resemble it. Therefore, the theoretical path for sampling without replacement from (4) serves as a good approximation. D.2 Implicit regularization paths with non-uniform weights 0.0 0.1 1 Subsample ratio k/n Ridge penalty Degrees of freedom 0.0 0.1 1 Subsample ratio k/n Figure 6: Equivalence under non-uniform weighting. The left panel shows the heatmap of degrees of freedom, and the right panel shows the random projection EW [a bβW ,λ], where a N(0p, Ip/p). The weights (diag(W )) for observations are initially generated as (9/10)i for i = 0, . . . , n 1, subsample k entries from {1, . . . , n}, zero out the other n k entries, and then normalized to have norm k. The black dashed lines indicate the empirical paths obtained by matching the empirical degrees of freedom. E Additional illustrations for Section 4 E.1 Rate illustration for ensemble risk against ensemble size M = 1 M = 20 M = 100 Prediction risk Estimate M = 1 M = 20 M = 100 1 1/M Prediction risk Estimate M = 1 M = 20 M = 100 Prediction risk Estimate Figure 7: Risk equivalence for random feature structures when sampling without replacement. The solid lines represent the prediction risks and their estimates of the subsample ridge ensemble, and the red dashed lines indicate the prediction error of the full ridge predictor. The data and random features with the Re LU activation function are generated according to Appendix F.1 with n = 5000 and p = 500. The regularization level for the full ridge is set as µ = 1, and each subsampled ridge ensemble is fitted with M = 100 randomly sampled subsampling matrices. For each value of λ, the subsample ratio is determined by solving Equation (4). E.2 Real data illustrations for implicit regularization paths 0.001 0.01 0.1 1 Subsample ratio k/n Ridge penalty Degrees of freedom 0.001 0.01 0.1 1 Subsample ratio k/n Training error 0.001 0.01 0.1 1 Subsample ratio k/n Prediction error Figure 8: Equivalence in pretrained features of pretrained Res Net-18 on CIFAR-10 dataset. 0.001 0.01 0.1 1 Subsample ratio k/n Ridge penalty Degrees of freedom 0.001 0.01 0.1 1 Subsample ratio k/n Training error 0.001 0.01 0.1 1 Subsample ratio k/n Prediction error Figure 9: Equivalence in features of randomly initialized Res Net-18 on Fashion-MNIST dataset. F Details of experiments F.1 Simulation details The simulation settings are as follows. Covariance model. The covariance matrix of an auto-regressive process of order 1 (AR(1)) is given by Σar1 Rd d, where (Σar1)ij = ρ|i j| ar1 for some parameter ρar1 (0, 1). For the simulations, we set ρar1 = 0.25. Signal model. Define β0 = 1 5 P5 j=1 w(j) where w(j) is the eigenvector of Σar1 associated with the top jth eigenvalue r(j). Response model. We generated data (xi, yi) Rd R for i = 1, . . . , n from a nonlinear model: yi = x i β0 + 1 p( xi 2 2 tr[Σar1]) + εi, xi = Σ 1 2 ar1zi, zij iid t5 where σ5 = p 5/3 is the standard deviation of t5 distribution. The benefit of using the above nonlinear model is that we can clearly separate the linear and the nonlinear components and compute the quantities of interest because β0 happens to be the best linear projection. The linear, random, and kernel features are generated as follows. Linear features. For a given feature dimension p, we use d = p raw features from (M-AR1) as linear features. Random features. For generating random features, we use d = 2p raw features from (M-AR1) and sample a randomly initialized weight matrix F Rp d whose entries are i.i.d. samples from N(0, d 1/2). Then the transform feature is given by exi = φ(F xi) Rp, where φ is a nonlinear transformation and set to be Re LU function in our experiment. Kernel features. For kernel features, we use d = p raw features from (M-AR1) to construct the kernel matrix. In the simulations, the estimates are averaged across 20 simulations with different random seeds. F.2 Experimental details in Section 4.3 Following the similar experimental setup in [20], we use residual networks to extract features on several computer vision datasets, both at random initialization and after pretraining. More specifically, we consider Res Net-{18, 34, 50, 101} applied to the CIFAR-{10,100} [9], Fashion-MNIST [21], Flowers-102 [14], and Food-101 [4] datasets. All random initialization was done following [8]; pretrained networks (obtained from Py Torch) were pretrained on Image Net, and the outputs of the last pretrained layer on each dataset mentioned above were used as the embedding feature Φ. After obtaining the embedding features from the last layer of the neural network model, we further normalize each row of the pretrained feature to have a norm of p, and center the one-hot labels to have zero means. To reduce the computational burden, we only consider the first 10 one-hot labels of all datasets. For datasets with different data aspect ratios, we stratify 10% of the training samples as the training set for the CIFAR-100 dataset. The training and predicting errors are the mean square errors on the training and test sets, respectively, aggregated over all the labels. [1] Adlam, B., Levinson, J. A., and Pennington, J. (2022). A random matrix perspective on mixtures of nonlinearities in high dimensions. In International Conference on Artificial Intelligence and Statistics. [2] Adlam, B. and Pennington, J. (2020). The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning. [3] Bose, A. (2021). Random Matrices and Non-Commutative Probability. CRC Press. [4] Bossard, L., Guillaumin, M., and Van Gool, L. (2014). Food-101 mining discriminative components with random forests. In Computer Vision ECCV 2014: 13th European Conference. Springer. Table 2: Summary of pretrained features from different real datasets. Dataset Model Number of train samples Number of test samples Number of pretrained features Fashion-MNIST Res Net-18 init. 60000 10000 512 CIFAR-10 Res Net-18 pretr. 50000 10000 512 CIFAR-100 (subset) Res Net-50 pretr. 5000 10000 2048 Flowers-102 Res Net-50 pretr. 2040 6149 2048 Food-101 Res Net-101 pretr. 75750 25250 2048 [5] Dobriban, E. and Sheng, Y. (2020). Wonder: Weighted one-shot distributed ridge regression in high dimensions. Journal of Machine Learning Research, 21(66):1 52. [6] Dobriban, E. and Sheng, Y. (2021). Distributed linear regression by averaging. The Annals of Statistics, 49(2):918 943. [7] Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. (2022). Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949 986. [8] He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In International Conference on Computer Vision. [9] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. [10] Lee, D., Moniri, B., Huang, X., Dobriban, E., and Hassani, H. (2023). Demystifying disagreement-on-the-line in high dimensions. In International Conference on Machine Learning. [11] Le Jeune, D., Patil, P., Javadi, H., Baraniuk, R. G., and Tibshirani, R. J. (2024). Asymptotics of the sketched pseudoinverse. SIAM Journal on Mathematics of Data Science, 6(1):199 225. [12] Mel, G. and Pennington, J. (2021). Anisotropic random feature regression in high dimensions. In International Conference on Learning Representations. [13] Mingo, J. A. and Speicher, R. (2017). Free Probability and Random Matrices, volume 35. Springer. [14] Nilsback, M.-E. and Zisserman, A. (2008). Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing. [15] Patil, P. and Du, J.-H. (2023). Generalized equivalences between subsampling and ridge regularization. Advances in Neural Information Processing Systems. [16] Patil, P., Du, J.-H., and Kuchibhotla, A. K. (2023). Bagging in overparameterized learning: Risk characterization and risk monotonization. Journal of Machine Learning Research, 24(319):1 113. [17] Patil, P. and Le Jeune, D. (2024). Asymptotically free sketched ridge ensembles: Risks, crossvalidation, and tuning. In International Conference on Learning Representations. [18] Sahraee-Ardakan, M., Emami, M., Pandit, P., Rangan, S., and Fletcher, A. K. (2022). Kernel methods and multi-layer perceptrons learn linear models in high dimensions. ar Xiv preprint ar Xiv:2201.08082. [19] Voiculescu, D. V. (1997). Free Probability Theory. American Mathematical Society. [20] Wei, A., Hu, W., and Steinhardt, J. (2022). More than a toy: Random matrix models predict how real-world neural representations generalize. In International Conference on Machine Learning. [21] Xiao, H., Rasul, K., and Vollgraf, R. (2017). Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: All the claims made the abstract are justified by both theoretical and experimental results in Sections 3 and 4 and the appendix. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: The assumptions are discussed and explained after their first mention (either in Section 2 or right after the theoretical result that uses them). The main limitations of the paper are discussed in the last section (Section 5). Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: The assumptions for each of the results are included in the main text, and the complete proofs for each of the results are included in the appendix. The beginning of the appendix provides an organization for the proofs of all the results mentioned in the main text. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: The code and instructions for reproducing experimental results in this paper are included in the supplementary materials. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We use publicly available data. The code and instructions for reproducing experimental results in this paper are included in the supplementary materials. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: The experimental details are included in the appendix, and the source code is provided in the supplementary materials. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: The standard errors across multiple random seeds are included for Figure 7. Note that for the heatmaps, we only report the mean statistics because of visual constraints. However, the standard errors for the heatmaps are small enough not to impact the regularization paths indicated. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: The compute resources are described in the README file of the submitted code in the supplementary materials. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We conform with the Neur IPS Code of Ethics. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: The paper provides a theoretical analysis and does not have immediate societal impacts. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The paper does not pose any risks that require safeguards. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: The paper correctly cites papers of related assets. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not release new assets. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing or research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing or research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.