# ridge_regression_structure_crossvalidation_and_sketching__57538ad0.pdf

Published as a conference paper at ICLR 2020

RIDGE REGRESSION: STRUCTURE, CROSSVALIDATION, AND SKETCHING

Sifan Liu Department of Statistics Stanford University Stanford, CA 94305, USA sfliu@stanford.edu

Edgar Dobriban Department of Statistics University of Pennsylvania Philadelphia, PA 19104, USA dobriban@wharton.upenn.edu

We study the following three fundamental problems about ridge regression: (1) what is the structure of the estimator? (2) how to correctly use cross-validation to choose the regularization parameter? and (3) how to accelerate computation without losing too much accuracy? We consider the three problems in a uniﬁed large-data linear model. We give a precise representation of ridge regression as a covariance matrix-dependent linear combination of the true parameter and the noise. We study the bias of K-fold cross-validation for choosing the regularization parameter, and propose a simple bias-correction. We analyze the accuracy of primal and dual sketching for ridge regression, showing they are surprisingly accurate. Our results are illustrated by simulations and by analyzing empirical data.

1 INTRODUCTION

Ridge or ℓ2-regularized regression is a widely used method for prediction and estimation when the data dimension p is large compared to the number of datapoints n. This is especially so in problems with many good features, where sparsity assumptions may not be justiﬁed. A great deal is known about ridge regression. It is Bayes optimal for any quadratic loss in a Bayesian linear model where the parameters and noise are Gaussian. The asymptotic properties of ridge have been widely studied (e.g., Tulino & Verd u, 2004; Serdobolskii, 2007; Couillet & Debbah, 2011; Dicker, 2016; Dobriban & Wager, 2018, etc). For choosing the regularization parameter in practice, cross-validation (CV) is widely used. In addition, there is an exact shortcut (e.g., Hastie et al., 2009, p. 243), which has good consistency properties (Hastie et al., 2019). There is also a lot of work on fast approximate algorithms for ridge, e.g., using sketching methods (e.g., el Alaoui & Mahoney, 2015; Chen et al., 2015; Wang et al., 2018; Chowdhury et al., 2018, among others).

Here we seek to develop a deeper understanding of ridge regression, going beyond existing work in multiple aspects. We work in linear models under a popular asymptotic regime where n, p at the same rate (Marchenko & Pastur, 1967; Serdobolskii, 2007; Couillet & Debbah, 2011; Yao et al., 2015). In this framework, we develop a fundamental representation for ridge regression, which shows that it is well approximated by a linear scaling of the true parameters perturbed by noise. The scaling matrices are functions of the population-level covariance of the features. As a consequence, we derive formulas for the training error and bias-variance tradeoff of ridge.

Second, we study commonly used methods for choosing the regularization parameter. Inspired by the observation that CV has a bias for estimating the error rate (e.g., Hastie et al., 2009, p. 243), we study the bias of CV for selecting the regularization parameter. We discover a surprisingly simple form for the bias, and propose a downward scaling bias correction procedure. Third, we study the accuracy loss of a class of randomized sketching algorithms for ridge regression. These algorithms approximate the sample covariance matrix by sketching or random projection. We show they can be surprisingly accurate, e.g., they can sometimes cut computational cost in half, only incurring 5% extra error. Even more, they can sometimes improve the MSE if a suboptimal regularization parameter is originally used.

Published as a conference paper at ICLR 2020

Our work leverages recent results from asymptotic random matrix theory and free probability theory. One challenge in our analysis is to ﬁnd the limit of the trace tr (Σ1 + Σ 1 2 ) 1/p, where Σ1 and Σ2 are p p independent sample covariance matrices of Gaussian random vectors. The calculation requires nontrivial aspects of freely additive convolutions (e.g., Voiculescu et al., 1992; Nica & Speicher, 2006).

Our work is connected to prior works on ridge regression in high-dimensional statistics (Serdobolskii, 2007) and wireless communications (Tulino & Verd u, 2004; Couillet & Debbah, 2011). Among other related works, El Karoui & K osters (2011) discuss the implications of the geometric sensitivity of random matrix theory for ridge regression, without considering our problems. El Karoui (2018) and Dicker (2016) study ridge regression estimators, but focus only on the risk for identity covariance. Hastie et al. (2019) study ridgeless regression, where the regularization parameter tends to zero.

Sketching is an increasingly popular research topic, see Vempala (2005); Halko et al. (2011); Mahoney (2011); Woodruff (2014); Drineas & Mahoney (2017) and references therein. For sketched ridge regression, Zhang et al. (2013a;b) study the dual problem in a complementary ﬁnite-sample setting, and their results are hard to compare. Chen et al. (2015) propose an algorithm combining sparse embedding and the subsampled randomized Hadamard transform (SRHT), proving relative approximation bounds. Wang et al. (2017) study iterative sketching algorithms from an optimization point of view, for both the primal and the dual problems. Dobriban & Liu (2018) study sketching using asymptotic random matrix theory, but only for unregularized linear regression. Chowdhury et al. (2018) propose a data-dependent algorithm in light of the ridge leverage scores. Other related works include Sarlos (2006); Ailon & Chazelle (2006); Drineas et al. (2006; 2011); Dhillon et al. (2013); Ma et al. (2015); Raskutti & Mahoney (2016); Gonen et al. (2016); Thanei et al. (2017); Ahfock et al. (2017); Lopes et al. (2018); Huang (2018).

The structure of the paper is as follows: We state our results on representation, risk, and biasvariance tradeoff in Section 2. We study the bias of cross-validation for choosing the regularization parameter in Section 3. We study the accuracy of randomized primal and dual sketching for both orthogonal and Gaussian sketches in Section 4. We provide proofs and additional simulations in the Appendix. Code reproducing the experiments in the paper are available at https://github. com/liusf15/Ridge Regression.

2 RIDGE REGRESSION

We work in the usual linear regression model Y = Xβ + ε, where each row xi of X Rn p is a datapoint in p dimensions, and so there are p features. The corresponding element yi of Y Rn is its continous response (or outcome). We assume mean zero uncorrelated noise, so Eε = 0, and Cov [ε] = σ2In. We estimate the coefﬁcient β Rp by ridge regression, solving the optimization problem

ˆβ = arg min β Rp 1 n Y Xβ 2 2 + λ β 2 2,

where λ > 0 is a regularization parameter. The solution has the closed form

ˆβ = X X/n + λIp 1 X Y/n. (1)

We work in a big data asymptotic limit, where both the dimension p and the sample size n tend to inﬁnity, and their aspect ratio converges to a constant, p/n γ (0, ). Our results can be interpreted for any n and p, using γ = p/n as an approximation.

We recall that the empirical spectral distribution (ESD) of a p p symmetric matrix Σ is the distribution 1

p Pp i=1 δλi where λi, i = 1, . . . , p are the eigenvalues of Σ, and δx is the point mass at x. This is a standard notion in random matrix theory, see e.g., Marchenko & Pastur (1967); Tulino & Verd u (2004); Couillet & Debbah (2011); Yao et al. (2015). The ESD is a convenient tool to summarize all information obtainable from the eigenvalues of a matrix. For instance, the trace of Σ is proportional to the mean of the distribution, while the condition number is related to the range of the support. As is common, we will work in models where there is a sequence of covariance matrices Σ = Σp, and their ESDs converges in distribution to a limiting probability distribution. The results become simpler, because they depend only on the limit.

Published as a conference paper at ICLR 2020

By extension, we say that the ESD of the n p matrix X is the ESD of X X/n. We will consider some very speciﬁc models for the data, assuming it is of the form X = UΣ1/2, where U has iid entries of zero mean and unit variance. This means that the datapoints, i.e., the rows of X, have the form xi = Σ1/2ui, i = 1, . . . , p, where ui have iid entries. Then Σ is the true covariance matrix of the features, which is typically not observed. These types of models for the data are very common in random matrix theory, see the references mentioned above.

Under these models, it is possible to characterize precisely the deviations between the empirical covariance matrix bΣ = n 1X X and the population covariance matrix Σ, dating back to the well known classical Marchenko-Pastur law for eigenvectors (Marchenko & Pastur, 1967), extended to more general models and made more precise, including results for eigenvectors (see e.g., Tulino & Verd u, 2004; Couillet & Debbah, 2011; Yao et al., 2015, and references therein). This has been used to study methods for estimating the true covariance matrix, with several applications (e.g., Paul & Aue, 2014; Bun et al., 2017). More recently, such models have been used to study high dimensional statistical learning problems, including classiﬁcation and regression (e.g., Zollanvari & Genton, 2013; Dobriban & Wager, 2018). Our work falls in this line.

We start by ﬁnding a precise representation of the ridge estimator. For random vectors un, vn of growing dimension, we say un and vn are deterministic equivalents, if for any sequence of ﬁxed (or random and independent of un, vn) vectors wn such that lim sup wn 2 < almost surely, we have |w n (un vn)| 0 almost surely. We denote this by un vn. Thus linear combinations of un are well approximated by those of vn. This is a somewhat non-standard deﬁnition, but it turns out that it is precisely the one we need to use prior results from random matrix theory such as from (Rubio & Mestre, 2011).

We extend scalar functions f : R R to matrices in the usual way by functional calculus, applying them to the eigenvalues and keeping the eigenvectors. If M = V ΛV is a spectral decomposition of M, then we deﬁne f(M) := V f(Λ)V , where f(Λ) is the diagonal matrix with entries f(Λii).

For a ﬁxed design matrix X, we can write the estimator as

ˆβ = (bΣ + λIp) 1bΣβ + (bΣ + λIp) 1 X ε

However, for a random design, we can ﬁnd a representation that depends on the true covariance Σ, which may be simpler when Σ is simple, e.g., when Σ = Ip is isotropic.

Theorem 2.1 (Representation of ridge estimator). Suppose the data matrix has the form X = UΣ1/2, where U Rn p has iid entries of zero mean, unit variance and ﬁnite 8 + c-th moment for some c > 0, and Σ = Σp Rp p is a deterministic positive deﬁnite matrix. Suppose that n, p with p/n γ > 0. Suppose the ESD of the sequence of Σs converges in distribution to a probability measure with compact support bounded away from the origin. Suppose that the noise is Gaussian, and that β = βp is an arbitrary sequence of deterministic vectors, such that lim sup β 2 < .

Then the ridge regression estimator is asymptotically equivalent to a random vector with the following representation:

ˆβ(λ) A(Σ, λ) β + B(Σ, λ) σ Z p1/2 .

Here Z N(0, Ip) is a random vector that is stochastically dependent only on the noise ε, and A, B are deterministic matrices deﬁned by applying the scalar functions below to Σ:

A(x, λ) = (cpx + λ) 2(cp + c p)x, B(x, λ) = (cpx + λ) 1cpx.

Here cp := c(n, p, Σ, λ) is the unique positive solution of the ﬁxed point equation

n tr Σ(cpΣ + λI) 1 . (2)

This result gives a precise representation of the ridge regression estimator. It is a sum of two terms: the true coefﬁcient vector β scaled by the matrix A(Σ, λ), and the noise vector Z scaled by the matrix B(Σ, λ). The ﬁrst term captures to what extent ridge regression recovers the signal . Morever, the

Published as a conference paper at ICLR 2020

Figure 1: Ridge regression bias-variance tradeoff. Left: γ = p/n = 0.2; right: γ = 2. The data matrix X has iid Gaussian entries. The coefﬁcient β has distribution β N(0, Ip/p), while the noise ε N(0, Ip).

noise term Z is directly coupled with the noise in the original regression problem, and thus also the estimator. The result would not hold for an independent noise vector Z.

However, the coefﬁcients are not fully explicit, as they depend on the unknown population covariance matrix Σ, as well as on the ﬁxed-point variable cp.

Some comments are in order:

1. Structure of the proof. The proof is quite non-elementary and relies on random matrix theory. Speciﬁcally, it uses the language of the recently developed calculus of deterministic equivalents (Dobriban & Sheng, 2018), and results by (Rubio & Mestre, 2011). A general takeaway is that for n not much larger than p, the empirical covariance matrix bΣ is not a good estimator of the true covariance matrix Σ. However, the deviation of linear functionals of bΣ, can be quantiﬁed. In particular, we have

(bΣ + λI) 1 (cpΣ + λI) 1,

in the sense that linear combinations of the entries of the two matrices are close (see the proof for more details). 2. Understanding the resolvent bias factor cp. Thus, cp can be viewed as a resolvent bias factor, which tells us by what factor Σ is multiplied when evaluating the resolvent (bΣ + λI) 1, and comparing it to its naive counterpart (Σ + λI) 1. It is known that cp is well deﬁned, and this follows by a simple monotonicity argument, see Hachem et al. (2007); Rubio & Mestre (2011). Speciﬁcally, the left hand side of (2) is decreasing in cp, while the right hand size is increasing in Also c p is the derivative of cp, when viewing it as a function of z := λ. An explicit expression is provided in the proof in Section A.1, but is not crucial right now.

Here we discuss some implications of this representation.

For uncorrelated features, Σ = Ip, A, B reduce to multiplication by scalars. Hence, each coordinate of the ridge regression estimator is simply a scalar multiple of the corresponding coordinate of β. One can use this to ﬁnd the bias in each individual coordinate.

Training error and optimal regularization parameter. This theorem has implications for understanding the training error, and optimal regularization parameter of ridge regression. As it stands, the theorem itself only characterizes the behavior og linear combinations of the coordinates of the estimator. Thus, it can be directly applied to study the bias Eˆβ(λ) β of the estimator. However, it cannot directly be used to study the variance; as that would require understanding quadratic functionals of the estimator. This seems to require signiﬁcant advances in random matrix theory, going beyond the results of Rubio & Mestre (2011). However, we show below that with additional assumptions on the structure of the parameter β, we can derive the MSE of the estimator in other ways.

Published as a conference paper at ICLR 2020

We work in a random-effects model, where the p-dimensional regression parameter β is random, each coefﬁcient has zero mean Eβi = 0, and is normalized so that Varβi = α2/p. This ensures that the signal strength E β 2 = α2 is ﬁxed for any p. The asymptotically optimal λ in this setting is always λ = γσ2/α2 see e.g., Tulino & Verd u (2004); Dicker (2016); Dobriban & Wager (2018). The ridge regression estimator with λ = pσ2/(nα2) is the posterior mean of β, when β and ε are normal random variables.

For a distribution F, we deﬁne the quantities

θi(λ) = Z 1 (x + λ)i d Fγ(x),

(i = 1, 2, . . .). These are the moments of the resolvent and its derivatives (up to constants). We use the following loss functions: mean squared estimation error: M(ˆβ) = E ˆβ β 2 2, and residual or training error: R(ˆβ) = E [ ] Y X ˆβ 2 2.

Theorem 2.2 (MSE and training error of ridge). Suppose β has iid entries with Eβi = 0, Var [βi] = α2/p, i = 1, . . . , p and β is independent of X and ε. Suppose X is an arbitrary n p matrix depending on n and p, and the ESD of X converges weakly to a deterministic distribution F as n, p and p/n γ. Then the asymptotic MSE and residual error of the ridge regression estimator ˆβ(λ) has the form

lim n M(ˆβ(λ)) = α2λ2θ2 + γσ2[θ1 λθ2], (3)

lim n R(ˆβ(λ)) = α2λ2[θ1 λθ2] + σ2 1 γ(1 + λθ1 λ2θ2) , (4)

Bias-variance tradeoff. Building on this, we can also study the bias-variance tradeoff of ridge regression. Qualitatively, large λ leads to more regularization, and decreases the variance. However, it also increases the bias. Our theory allows us to ﬁnd the explicit formulas for the bias and variance as a function of λ. See Figure 1 for a plot and Sec. A.3 for the details. As far as we know, this is one of the few examples of high-dimensional asymptotic problems where the precise form of the bias and variance can be evaluated.

Bias-variance tradeoff at optimal λ = γσ2/α2. (see Figure 6) This can be viewed as the pure effect of dimensionality on the problem, keeping all other parameters ﬁxed, and has intriguing properties. The variance ﬁrst increases, then decreases with γ. In the classical low-dimensional case, most of the risk is due to variance, while in the modern high-dimensional case, most of it is due to bias. This is consistent with other phenomena in proportional-limit asymptotics, e.g., that the map between population and sample eigenvalue distributions is asymptotically deterministic (Marchenko & Pastur, 1967).

Future applications. This fundamental representation may have applications to important statistical inference questions. For instance, inference on the regression coefﬁcient β and the noise variance σ2 are important and challenging problems. Can we use our representation to develop debiasing techniques for this task? This will be interesting to explore in future work.

3 CROSS-VALIDATION

How can we choose the regularization parameter? In practice, cross-validation (CV) is the most popular approach. However, it is well known that CV has a bias for estimating the error rate, because it uses a smaller number of samples than the full data size (e.g., Hastie et al., 2009, p. 243). In this section, we study related questions, proposing a bias-correction method for the optimal regularization parameter. This is closely connected to the previous section, because it relies on the same random-effects theoretical framework. In fact, our conclusions here are a direct consequence of the properties of that framework.

Setup. Suppose we split the n datapoints (samples) into K equal-sized subsets, each containing n0 = n/K samples. We use the k-th subset (Xk, Yk) as the validation set and the other K 1 subsets (X k, Y k), with total sample size n1 = (K 1)n/K as the training set. We ﬁnd the ridge

Published as a conference paper at ICLR 2020

Figure 2: Left: Cross-validation on the Million Song Dataset (MSD, Bertin-Mahieux et al., 2011). For the error bar, we take n = 1000, p = 90, K = 5, and average over 90 different sub-datasets. For the test error, we train on 1000 training datapoints and ﬁt on 9000 test datapoints. The debiased λ reduces the test error by 0.00024, and the minimal test error is 0.8480. Right: Cross-validation on the ﬂights dataset Wickham (2018). For the error bar, we take n = 300, p = 21, K = 5, and average over 180 different sub-datasets. For the test error, we train on 300 datapoints and ﬁt on 27000 test datapoints. The debiased λ reduces the test error by 0.0022, and the minimal test error is 0.1353.

regression estimator ˆβ k, i.e.

ˆβ k(λ) = X k X k + n1λIp 1 X k Y k.

The expected cross-validation error is, for isotropic covariance, i.e., Σ = I,

CV (λ) = Ed CV (λ) = E

k=1 Yk Xk ˆβ k(λ) 2 2/n0

= σ2 + E h ˆβ k β 2 2 i .

Bias in CV. When n, p tend to inﬁnity so that p/n γ > 0, and in the random effects model with Eβi = 0, Varβi = α2/p described above, the minimizer of CV (λ) tends to λ k = γσ2/α2, where γ is the limiting aspect ratio of X k, i.e. γ = γK/(K 1). Since the aspect ratios of X k and X differ, the limiting minimizer of the cross-validation estimator of the test error is biased for the limiting minimizer of the actual test error, which is λ = γσ2/α2.

Bias-correction. Suppose we have found ˆλ k, the minimizer of d CV (λ). Afterwards, we usually reﬁt ridge regression on the entire dataset, i.e., ﬁnd

ˆβ(ˆλ ) = (X X + ˆλ n I) 1X Y.

Based on our bias calculation, we propose to use a bias-corrected parameter

ˆλ := ˆλ k K 1

So if we use 5 folds, we should multiply the CV-optimal λ by 0.8. We ﬁnd it surprising that this theoretically justiﬁed bias-correction does not depend on any unknown parameters, such as β, α2, σ2.While the bias of CV is widely known, we are not aware that this bias-correction for the regularization parameter has been proposed before.

Numerical examples. Figure 2 shows on two empirical data examples that the debiased estimator gets closer to the optimal λ than the original minimizer of the CV. However, in this case it does not signiﬁcantly improve the test error. Simulation results in Section A.4 also show that the bias-correction correctly shrinks the regularization parameter and decreases the test error. We also consider examples where p n (i.e., γ 1), because this is a setting where it is known that the bias of CV can be large (Tibshirani & Tibshirani, 2009). However, in this case, we do not see a signiﬁcant improvement.

Extensions. The same bias-correction idea also applies to train-test validation. In addition, there is a special fast short-cut for leave-one-out cross-validation in ridge regression (e.g., Hastie et al.,

Published as a conference paper at ICLR 2020

2009, p. 243), which has the same cost as one ridge regression. The minimizer converges to λ (Hastie et al., 2019). However, we think that the bias-correction idea is still valuable, as the idea applies beyond ridge regression: CV selects regularization parameters that are too large. See Section A.5 for more details and experiments comparing different ways of choosing the regularization parameter.

4 SKETCHING

A ﬁnal important question about ridge regression is how to compute it in practice. In this section, we study that problem in the same high-dimensional model used throughout our paper. The computation complexity of ridge regression, O(np min(n, p)), can be intractable in modern large-scale data analysis. Sketching is a popular approach to reducing the time complexity by reducing the sample size and/or dimension, usually by random projection or sampling (e.g. Mahoney, 2011; Woodruff, 2014; Drineas & Mahoney, 2016). Speciﬁcally, primal sketching approximates the sample covariance matrix X X/n by X L LX/n, where L is an m n sketching matrix, and m < n. If L is chosen as a suitable random matrix, then this can still approximate the original sample covariance matrix. Then the primal sketched ridge regression estimator is

ˆβp = X L LX/n + λIp 1 X Y/n. (5)

Dual sketching reduces p instead. An equivalent expression for ridge regression is ˆβ = n 1X XX /n + λIn 1 Y . Dual sketched ridge regression reduces the computation cost of the Gram matrix XX , approximating it by XRR X for another sketching matrix R Rp d (d < p), so

ˆβd = X XRR X /n + λIn 1 Y/n. (6)

The sketching matrices R and L are usually chosen as random matrices with iid entries (e.g., Gaussian ones) or as orthogonal matrices. In this section, we study the asymptotic MSE for both orthogonal (Section 4.1) and Gaussian sketching (Section 4.2). We also mention full sketching, which performs ridge after projecting down both X and Y . In section A.11, we ﬁnd its MSE. However, the other two methods have better tradeoffs, and we can empirically get better results for the same computational cost.

4.1 ORTHOGONAL SKETCHING

First we consider primal sketching with orthogonal projections. These can be implemented by subsampling, Haar distributed matrices, or subsampled randomized Hadamard transforms (Sarlos, 2006). We recall that the standard Marchenko-Pastur (MP) law is the probability distribution which is the limit of the ESD of X X/n, when the n p matrix X has iid standard Gaussian entries, and n, p so that p/n γ > 0, which has an explicit density (Marchenko & Pastur, 1967; Bai & Silverstein, 2010). Theorem 4.1 (Primal orthogonal sketching). Suppose β has iid entries with Eβi = 0, Var [βi] = α2/p, i = 1, . . . , p and β is independent of X and ε. Suppose X has iid standard normal entries.

We compute primal sketched ridge regression (5) with an m n orthogonal matrix L (m < n, LL = Im). Let n, p and m tend to inﬁnity with p/n γ (0, ) and m/n ξ (0, 1). Then the MSE of ˆβp(λ) has the limit

M(λ) = α2 (λ + ξ 1)2 + γ(1 ξ) θ2 γ ξ , λ

ξ2 + γσ2 ξθ1 γ ξ , λ

ξ (λ + ξ 1)θ2 γ ξ , λ

where θi(γ, λ) = R (x + λ) id Fγ(x) and Fγ is the standard Marchenko-Pastur law with aspect ratio γ.

Structure of the proof. The proof is in Section A.6, with explicit formulas in Section A.6.1. The θi are related to the resolvent of the MP law and its derivatives. In the proof, we decompose the

Published as a conference paper at ICLR 2020

Figure 3: Primal orthogonal sketching with n = 500, γ = 5, λ = 1.5, α = 3, σ = 1. Left: MSE of primal sketching normalized by the MSE of ridge regression. The error bar is the standard deviation over 10 repetitions. Right: Bias and variance of primal sketching normalized by the bias and variance of ridge regression, respectively.

MSE as the sum of variance and squared bias, both of which further reduce to the traces of certain random matrices, whose limits are determined by the MP law Fγ and λ. The two terms on the RHS of Equation (7) are the limits of squared bias and variance, respectively. There is an additional key step in the proof, which introduces the orthogonal complement L1 of the matrix L such that L L + L 1 L1 = In, which leads to some Gaussian random variables appearing in the proof, and simpliﬁes calculations.

Simulations. A simulation in Figure 3 (left) shows a good match with our theory. It also shows that sketching does not increase the MSE too much. In this case, by reducing the sample size to half the original one, we only increase the MSE by a factor of 1.05. This shows sketching can be very effective. We also see in Figure 3 (right) that variance is compromised much more than bias.

Robustness to tuning parameter. The reader may wonder how strongly this depends on the choice of the regularization parameter λ. Perhaps ridge regression works poorly with this λ, so sketching cannot worsen it too much? What happens if we take the optimal λ instead of a ﬁxed one? In experiments in Section A.12 we show that the behavior is quite robust to the choice of regularization parameter.

The next theorem states a result for dual orthogonal sketching.

Theorem 4.2 (Dual orthogonal sketching). Under the conditions of Theorem 4.1, we compute the dual sketched ridge regression with an orthogonal p d sketching matrix R (d p, R R = Id). Let n, p and d go to inﬁnity with p/n γ (0, ) and d/n ζ (0, γ). Then the MSE of ˆβd(λ) has the limit

γ γ 1 + (λ γ + ζ)2 θ2(ζ, λ) + (γ ζ) θ2 1(ζ, λ) + σ2 θ1(ζ, λ) (λ + ζ γ) θ2(ζ, λ) ,

where θi(ζ, λ) = (1 ζ)/λi + ζ R (x + λ) id Fζ(x), and Fζ is the standard Marchenko-Pastur law.

Proof structure and simulations. The proof in Section A.7 follows similar path to the previous one. Here θi comes in because of the companion Stieltjes transform of MP law. The simulation results shown in Figure 11 agrees well with our theory. They are similar to the ones before: sketching has favorable properties, and the bias increases less than the variance.

Optimal tuning parameters. For both primal and dual sketching, the optimal regularization parameter minimizing the MSE seems analytically intractable. Instead, we use a numerical approach in our experiments, based on a binary search. Since this is one-dimensional problem, there are no numerical issues. See Figure 13 in Section A.12.3.

Published as a conference paper at ICLR 2020

Figure 4: Left: Ratio of optimal MSE of marginal regression to that of optimally tuned ridge regression, for three values of γ = p/n, as a function of the SNR α2/σ2. Right: Gaussian dual sketch when there is no noise. γ = 0.4, α = 1, λ = 1 (both for original and sketching). Standard error over 50 experiments.

4.1.1 EXTREME PROJECTION MARGINAL REGRESSION

It is of special interest to investigate extreme projections, where the sketching dimension is much reduced compared to the sample size, so m n. This corresponds to ξ = 0. This can also be viewed as a scaled marginal regression estimator, i.e., ˆβ X Y . For dual sketching, the same case can be recovered with ζ = 0. Another interest of studying this special case is that the formula for MSE simpliﬁes a lot.

Theorem 4.3 (Marginal regression). Under the same assumption as Theorem 4.1, let ξ = 0. Then the form of the MSE is M(λ) = [α2 (λ 1)2 + γ + σ2γ]/λ2. Moreover, the optimal λ that minimizes this equals γσ2/α2 + 1 + γ and the optimal MSE is M(λ ) = α2 1 α2/[α2(1 + γ) + γσ2] .

The proof is in Section A.8. When is the optimal MSE of marginal regression small? Compared to the MSE of the zero estimator α2, it is small when γ(σ2/α2 + 1) + 1 is large. In Figure 4 (left), we compare marginal and ridge regression for different aspect ratios and SNR. When the signal to noise ratio (SNR) α2/σ2 is small or the aspect ratio γ is large, marginal regression does not increase the MSE much. As a concrete example, if we take α2 = σ2 = 1 and γ = 0.7, the marginal MSE is 1 1/2.4 0.58. The optimal ridge MSE is about 0.52, so their ratio is only ca. 0.58/0.52 1.1. It seems quite surprising that a simple-minded method like marginal regression can work so well. However, the reason is that when the SNR is small, we cannot expect ridge regression to have good performance. Large γ can also be interpreted as small SNR, where ridge regression works poorly and sketching does not harm performance too much.

4.2 GAUSSIAN SKETCHING

In this section, we study Gaussian sketching. The following theorem states the bias of dual Gaussian sketching. The bias is enough to characterize the performance in the high SNR regime where α/σ , and we discuss the extension to low SNR after the proof.

Theorem 4.4 (Bias of dual Gaussian sketch). Suppose X is an n p standard Gaussian random matrix. Suppose also that R is a p d matrix with i.i.d. N(0, 1/d) entries. Then the bias of dual sketch has the expression Bias2(ˆβd) = α2 + α2/γ [m (z) 2m(z)] |z=0, where m is a function described below, and m (z) denotes the derivative of m w.r.t. z. Below, we use the branch of the square root with positive imaginary part.

The function m is characterized by its inverse function, which has the explicit formula m 1(z) = 1/[1 + z/ζ] [γ + 1 p

(γ 1)2 + 4λz]/(2z) for complex z with positive imaginary part.

Published as a conference paper at ICLR 2020

About the proof. The proof is in Section A.9.We mention that the same result holds when the matrices involved have iid non-Gaussian entries, but the proof is more technical. The current proof is based on free probability theory (e.g., Voiculescu et al., 1992; Hiai & Petz, 2006; Couillet & Debbah, 2011). The function m is the Stieltjes transform of the free additive convolution of a standard MP law F1/ξ and a scaled inverse MP law λ/γ F 1 1/γ (see the proof).

Numerics. To evaluate the formula, we note that m 1(m(0)) = 0, so m(0) is a root of m 1. Also, dm(0)/dz equals 1/(dm 1(y)/dy|y=m(0)), the reciprocal of the derivative of m 1 evaluated at m(0). We use binary search to ﬁnd the numerical solution. The theoretical result agrees with the simulation quite well, see Figure 4.

Somewhat unexpectedly, the MSE of dual sketching can be below the MSE of ridge regression, see Figure 4. This can happen when the original regularization parameter is suboptimal. As d grows, the MSE of Gaussian dual sketching converges to that of ridge regression.

We have also found the bias of primal Gaussian sketching. However, stating the result requires free probability theory, and so we present it in the Appendix, see Theorem A.1. To further validate our results, we present additional simulations in Sec. A.12, for both ﬁxed and optimal regularization parameters after sketching. A detailed study of the computational cost for sketching in Sec. A.13 concludes, as expected, that primal sketching can reduce cost when p < n, while dual sketching can reduce it when p > n; and also provides a more detailed analysis.

ACKNOWLEDGMENTS

The authors thank Ken Clarkson for helpful discussions and for providing the reference Chen et al. (2015). ED was partially supported by NSF BIGDATA grant IIS 1837992. SL was partially supported by a Tsinghua University Summer Research award. A version of our manuscript is available on arxiv at https://arxiv.org/abs/1910.02373.

Daniel Ahfock, William J Astle, and Sylvia Richardson. Statistical properties of sketching algorithms. ar Xiv preprint ar Xiv:1706.03665, 2017. Nir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast johnson-lindenstrauss transform. In Proceedings of the thirty-eighth annual ACM symposium on Theory of computing, pp. 557 563. ACM, 2006. Greg W Anderson, Alice Guionnet, and Ofer Zeitouni. An Introduction to Random Matrices. Number 118. Cambridge University Press, 2010. Theodore W Anderson. An Introduction to Multivariate Statistical Analysis. Wiley New York, 2003. Zhidong Bai and Jack W Silverstein. Spectral analysis of large dimensional random matrices. Springer Series in Statistics. Springer, New York, 2nd edition, 2010. Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The million song dataset. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011), 2011. Jo el Bun, Jean-Philippe Bouchaud, and Marc Potters. Cleaning large correlation matrices: tools from random matrix theory. Physics Reports, 666:1 109, 2017. Shouyuan Chen, Yang Liu, Michael R Lyu, Irwin King, and Shengyu Zhang. Fast relative-error approximation algorithm for ridge regression. In UAI, pp. 201 210, 2015. Agniva Chowdhury, Jiasen Yang, and Petros Drineas. An iterative, sketching-based framework for ridge regression. In International Conference on Machine Learning, pp. 988 997, 2018. Romain Couillet and Merouane Debbah. Random Matrix Methods for Wireless Communications. Cambridge University Press, 2011. Paramveer Dhillon, Yichao Lu, Dean P Foster, and Lyle Ungar. New subsampling algorithms for fast least squares regression. In Advances in neural information processing systems, pp. 360 368, 2013. Lee H Dicker. Ridge regression and asymptotic minimax estimation over spheres of growing dimension. Bernoulli, 22(1):1 37, 2016. Edgar Dobriban and Sifan Liu. A new theory for sketching in linear regression. ar Xiv preprint ar Xiv:1810.06089, Neur IPS 2019, 2018. Edgar Dobriban and Yue Sheng. Distributed linear regression by averaging. ar Xiv preprint arxiv:1810.00412, 2018. Edgar Dobriban and Yue Sheng. One-shot distributed ridge regression in high dimensions. ar Xiv preprint ar Xiv:1903.09321, 2019. Edgar Dobriban and Stefan Wager. High-dimensional asymptotics of prediction: Ridge regression and classiﬁcation. The Annals of Statistics, 46(1):247 279, 2018.

Published as a conference paper at ICLR 2020

Petros Drineas and Michael W Mahoney. Rand NLA: randomized numerical linear algebra. Communications of the ACM, 59(6):80 90, 2016. Petros Drineas and Michael W Mahoney. Lectures on randomized numerical linear algebra. ar Xiv preprint ar Xiv:1712.08880, 2017. Petros Drineas, Michael W Mahoney, and S Muthukrishnan. Sampling algorithms for l 2 regression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pp. 1127 1136. Society for Industrial and Applied Mathematics, 2006. Petros Drineas, Michael W Mahoney, S Muthukrishnan, and Tam as Sarl os. Faster least squares approximation. Numerische mathematik, 117(2):219 249, 2011. Ahmed el Alaoui and Michael W Mahoney. Fast randomized kernel ridge regression with statistical guarantees. In Advances in Neural Information Processing Systems, pp. 775 783, 2015. Noureddine El Karoui. On the impact of predictor geometry on the performance on high-dimensional ridgeregularized generalized robust regression estimators. Probability Theory and Related Fields, 170(1-2):95 175, 2018. Noureddine El Karoui and Holger K osters. Geometric sensitivity of random matrix results: consequences for shrinkage estimators of covariance and related statistical methods. ar Xiv preprint ar Xiv:1105.1404, 2011. Alon Gonen, Francesco Orabona, and Shai Shalev-Shwartz. Solving ridge regression using sketched preconditioned svrg. In International Conference on Machine Learning, pp. 1397 1405, 2016. Walid Hachem, Philippe Loubaton, and Jamal Najim. Deterministic equivalents for certain functionals of large random matrices. The Annals of Applied Probability, 17(3):875 930, 2007. Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217 288, 2011. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning. Springer series in statistics, 2009. Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. ar Xiv preprint ar Xiv:1903.08560, 2019. Fumio Hiai and D enes Petz. The semicircle law, free random variables and entropy. Number 77. American Mathematical Soc., 2006. Zengfeng Huang. Near optimal frequent directions for sketching dense and sparse matrices. In International Conference on Machine Learning, pp. 2053 2062, 2018. Miles E Lopes, Shusen Wang, and Michael W Mahoney. Error estimation for randomized least-squares algorithms via the bootstrap. ar Xiv preprint ar Xiv:1803.08021, 2018. Ping Ma, Michael W Mahoney, and Bin Yu. A statistical perspective on algorithmic leveraging. The Journal of Machine Learning Research, 16(1):861 911, 2015. Michael W Mahoney. Randomized algorithms for matrices and data. Foundations and Trends R in Machine Learning, 3(2):123 224, 2011. Vladimir A Marchenko and Leonid A Pastur. Distribution of eigenvalues for some sets of random matrices. Mat. Sb., 114(4):507 536, 1967. Robb J Muirhead. Aspects of multivariate statistical theory, volume 197. John Wiley & Sons, 2009. Alexandru Nica and Roland Speicher. Lectures on the combinatorics of free probability, volume 13. Cambridge University Press, 2006. Debashis Paul and Alexander Aue. Random matrix theory in statistics: A review. Journal of Statistical Planning and Inference, 150:1 29, 2014. Garvesh Raskutti and Michael W Mahoney. A statistical perspective on randomized sketching for ordinary least-squares. The Journal of Machine Learning Research, 17(1):7508 7538, 2016. Francisco Rubio and Xavier Mestre. Spectral convergence for a general class of random matrices. Statistics & Probability Letters, 81(5):592 602, 2011. Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In Foundations of Computer Science, 2006. FOCS 06. 47th Annual IEEE Symposium on, pp. 143 152. IEEE, 2006. Vadim Ivanovich Serdobolskii. Multiparametric Statistics. Elsevier, 2007. Gian-Andrea Thanei, Christina Heinze, and Nicolai Meinshausen. Random projections for large-scale regression. In Big and complex data analysis, pp. 51 68. Springer, 2017. Ryan J Tibshirani and Robert Tibshirani. A bias correction for the minimum error rate in cross-validation. The Annals of Applied Statistics, pp. 822 829, 2009. Antonio M Tulino and Sergio Verd u. Random matrix theory and wireless communications. Communications and Information theory, 1(1):1 182, 2004. Santosh S Vempala. The random projection method, volume 65. American Mathematical Soc., 2005. Dan V Voiculescu, Ken J Dykema, and Alexandru Nica. Free random variables. Number 1. American Mathematical Soc., 1992. Jialei Wang, Jason D Lee, Mehrdad Mahdavi, Mladen Kolar, and Nathan Srebro. Sketching meets random projection in the dual: A provable recovery algorithm for big and high-dimensional data. Electronic Journal of Statistics, 11(2):4896 4944, 2017. Shusen Wang, Alex Gittens, and Michael W Mahoney. Sketched ridge regression: Optimization perspective, statistical perspective, and model averaging. Journal of Machine Learning Research, 18:1 50, 2018.

Published as a conference paper at ICLR 2020

Hadley Wickham. nycﬂights13: Flights that Departed NYC in 2013, 2018. URL https://CRAN. R-project.org/package=nycflights13. R package version 1.0.0. David P Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends R in Theoretical Computer Science, 10(1 2):1 157, 2014. Jianfeng Yao, Zhidong Bai, and Shurong Zheng. Large Sample Covariance Matrices and High-Dimensional Data Analysis. Cambridge University Press, New York, 2015. Lijun Zhang, Mehrdad Mahdavi, and Rong Jin. Linear convergence with condition number independent access of full gradients. In Advances in Neural Information Processing Systems, pp. 980 988, 2013a. Lijun Zhang, Mehrdad Mahdavi, Rong Jin, Tianbao Yang, and Shenghuo Zhu. Recovering the optimal solution by dual random projection. In Conference on Learning Theory, pp. 135 157, 2013b. Amin Zollanvari and Marc G Genton. On Kolmogorov asymptotics of estimators of the misclassiﬁcation error rate in linear discriminant analysis. Sankhya A, 75(2):300 326, 2013.

A.1 PROOF OF THEOREM 2.1

If p/n γ and the spectral distribution of Σ converges to H, we have by the general Marchenko Pastur (MP) theorem of Rubio and Mestre (Rubio & Mestre, 2011), that

(bΣ + λI) 1 (cpΣ + λI) 1,

where cp := c(n, p, Σ, λ) is the unique positive solution of the ﬁxed point equation

n tr Σ(cpΣ + λI) 1 .

Here, using the terminology of the calculus of deterministic equivalents (Dobriban & Sheng, 2018), two sequences of (not necessarily symmetric) n n matrices An, Bn of growing dimensions are equivalent, and we write An Bn if limn tr [Cn(An Bn)] = 0 almost surely, for any sequence Cn of (not necessarily symmetric) n n deterministic matrices with bounded trace norm, i.e., such that lim sup Cn tr < (Dobriban & Sheng, 2018). Informally, linear combinations of the entries of An can be approximated by the entries of Bn.

We start with

ˆβ = X X/n + λIp 1 X Y/n = X X/n + λIp 1 X (Xβ + ε)

= (bΣ + λIp) 1bΣβ + (bΣ + λIp) 1 X ε

Then, by the general MP law written in the language of the calculus of deterministic equivalents

(bΣ + λIp) 1bΣ = Ip λ(bΣ + λIp) 1 Ip λ(cpΣ + λI) 1 = cpΣ(cpΣ + λI) 1.

By the deﬁnition of equivalence for vectors,

(bΣ + λIp) 1bΣβ cpΣ(cpΣ + λI) 1β.

We note a subtle point here. The rank of the matrix M := (bΣ + λIp) 1bΣ is at most n, and so it is not a full rank matrix when n < p. In contrast, cpΣ(cpΣ + λI) 1 can be a full rank matrix. Therefore, for the vectors β in the null space of bΣ, which is also the null space of X, we certainly have that the two sides are not equal. However, here we assumed that the matrix X is random, and so its null space is a random max(p n, 0) dimensional linear space. Therefore, for any ﬁxed vector β, the random matrix M will not contain it in its null space with high probability, and so there is no contradiction.

We should also derive an asymptotic equivalent for

(bΣ + λIp) 1 X ε

Published as a conference paper at ICLR 2020

Figure 5: Simulation for ridge regression. We take n = 1000, λ = 0.3. Also, X has iid N(0, 1) entries, βi iid N(0, α2/p), εi iid N(0, σ2), with α = 3, σ = 1. The standard deviations are over 50 repetitions. The theoretical lines are plotted according to Theorem 2.2. The MSE is normalized by the norm of β.

Suppose we have Gaussian noise, and let Z N(0, Ip). Then we can write

(bΣ + λIp) 1 X ε

n =d (bΣ + λIp) 1bΣ1/2 σZ

So the question reduces to ﬁnding a deterministic equivalent for h(bΣ), where h(x) = (x + λ) 2x. Note that

h(x) = (x + λ) 2x = (x + λ) 2(x + λ λ) = (x + λ) 1 λ(x + λ) 2.

By the calculus of determinstic equivalents: (bΣ + λ) 1 (cpΣ + λI) 1. Moreover, fortunately the limit of the second part was recently calculated in (Dobriban & Sheng, 2019). This used the so-called differentiation rule of the calculus of deterministic equivalents to ﬁnd

(bΣ + λ) 2 (cpΣ + λI) 2(I c pΣ).

The derivative c p = dcp/dz has been found in Dobriban & Sheng (2019), in the proof of Theorem 3.1, part 2b. The result is (with γp = p/n, Hp the spectral distribution of Σ, and T a random variable distributed according to Hp)

c p = γp EHp cp T (cp T z)2

1 + γpz EHp T (cp T z)2 . (8)

So, we ﬁnd the ﬁnal answer

(bΣ + λIp) 1bΣ1/2 A(Σ, λ) := (cpΣ + λI) 1 λ(cpΣ + λI) 2(I c pΣ).

A.2 RISK ANALYSIS

Figure 5 shows a simulation result. We see a good match between theory and simulation.

A.2.1 PROOF OF THEOREM 2.2

Proof. The MSE of ˆβ has the form

E ˆβ β 2 = bias2 + δ2,

bias2 = E X X/n + λIp 1 X X/nβ β 2

δ2 = σ2E X X/n + λIp 1 n 1X 2

Published as a conference paper at ICLR 2020

We assume that X has iid entries of zero mean and unit variance, and that Eβ = 0, Var [β] = α2/p Ip. As p/n γ as n goes to inﬁnity, the ESD of 1

n X X converges to the MP law Fγ. So we have

bias2 = E λ X X/n + λIp 1 β 2

p tr[ X X/n + λIp 2] α2λ2 Z 1 (x + λ)2 d Fγ(x),

n2 E h tr[ X X/n + λIp 2 X X] i

n E h tr[ X X/n + λIp 1 λ X X/n + λIp 2] i

σ2γ Z 1 x + λd Fγ(x) λ Z 1 (x + λ)2 d Fγ(x) .

Denoting θi(γ, λ) = R 1 (x+λ)i d Fγ(x), then

AMSE(ˆβ) = α2λ2θ2 + γσ2[θ1 λθ2]. (9)

For the standard Marchenko-Pastur law (i.e., when Σ = Ip), we have the explicit forms of θ1 and θ2. Speciﬁcally,

θ1 = Z 1 x + λd Fγ(x) = 1

λγ + 2 γλz2

( γ + 1 + λ γ ) +

( γ + 1 + λ γ )2 4

It is known that the limiting Stieltjes transform m Fγ := mγ of bΣ has the explicit form (Marchenko & Pastur, 1967):

mγ(z) = (z + γ 1) + p

(z + γ 1)2 4zγ 2zγ .

As usual in the area, we use the principal branch of the square root of complex numbers. Hence

θ1 = ( λ+γ 1)+

( λ+γ 1)2+4λγ 2λγ . Also

θ2(γ, λ) = Z 1 (x + λ)2 d Fγ(x) = Z d dλ 1 x + λd Fγ(x)

γλ2 + 1 γ d dλ z2

γλ2 + γ + 1

2γλ2 1 2 γ [ λ + γ + 1

( γ + 1+λ γ )2 4

( γ + 1+λ γ )2 4

For the residual,

n Y X ˆβ 2 2|X = α2λ2 1

p tr[ X X/n + λIp 1 λ X X/n + λIp 2]

n[tr(In) 2 tr X X/n + λIp 1 X X/n + tr X X/n + λIp 1 X X/n 2 ].

Published as a conference paper at ICLR 2020

p tr[ X X/n + λIp 1 X X/n 2 ] = E 1

p tr[ Ip λ X X/n + λIp 1 2 ]

1 2λθ1 + λ2θ2.

n Y X ˆβ 2 2

α2λ2[θ1 λθ2] + σ2 1 2γ(1 λθ1) + γ(1 2λθ1 + λ2θ2)

= α2λ2[θ1 λθ2] + σ2 1 γ(1 + λθ1 λ2θ2) .

A.3 BIAS-VARIANCE TRADEOFF

The limiting MSE decomposes into a limiting squared bias and variance. The speciﬁc forms of these are

bias2 = α2 Z λ2

(x + λ)2 d Fγ(x), var = γσ2 Z x (x + λ)2 d Fγ(x).

See Figure 1 for a plot. We can make several observations.

1. The bias increases with λ, starting out at zero for λ = 0 (linear regression), and increasing to α2 as λ (zero estimator). 2. The variance decreases with λ, from γσ2 R x 1d Fγ(x) to zero.

3. In the setting plotted in the ﬁgure, when α2 and σ2 are roughly comparable, there are additional qualitative properties we can investigate. When γ is small, the regularization parameter λ inﬂuences the bias more strongly than the variance (i.e., the derivative of the normalized quantities in the range plotted is generally larger for the normalized squared bias). In contrast when γ is large, the variance is inﬂuenced more.

Next we consider how bias and variance change with γ at the optimal λ = γσ2/α2. This can be viewed as the pure effects of dimensionality on the problem, keeping all other parameters ﬁxed. Ineed, α2/σ2 can be viewed as the signal-to-noise ratio (SNR), and is ﬁxed. This analysis allows us to study for the best possible estimator (ridge regression, a Bayes estimator), behaves with the dimension. We refer to Figure 6, where we make some speciﬁc choices of α and σ.

1. Clearly the overall risk increases, as the problem becomes harder with increasing dimension. This is in line with our intuition. 2. The classical bias-variance tradeoff can be summarized by the equation

bias2(λ) + var(λ) M (α, γ),

where we made explicit the dependence of the bias and variance on λ, and where M (α, γ) is the minimum MSE achievable, also known as the Bayes error, for which there are explicit formulas available (Tulino & Verd u, 2004; Dobriban & Wager, 2018). 3. The variance ﬁrst increases, then decreases with γ. This shows that in the classical low-dimensional case, most of the risk is due to variance, while in the modern highdimensional case, most of it is due to bias. This observation is consistent with other phenomena in proportional-limit asymptotics, for instance that the map between population and sample eigenvalue distributions is asymptotically deterministic (Marchenko & Pastur, 1967; Bai & Silverstein, 2010).

A.4 SIMULATIONS WITH CROSS-VALIDATION

See Figure 7. We consider both small and large γ. Our bias-correction procedure shrinks the λ to the correct direction and decreases the test error. It is also shown that the one-standard-error rule (e.g., Hastie et al., 2009) does not perform well here.

Published as a conference paper at ICLR 2020

Figure 6: Bias-variance tradeoff at optimal λ = γσ2/α2, when α = 3, σ = 1.

Figure 7: Left: we generate a training set (n = 1000, p = 700, γ = 0.7, α = σ = 1) and a test set (ntest = 500) from the same distribution. We split the training set into K = 5 equally sized folds and do cross-validation. The blue error bars plot the mean and standard error of the K test errors. The red dotted line indicates the one-standard-error location. The green dashed line indicates the optimal λ CV obtained by k-fold cross-validation, while the red dashed-dotted line indicates the debiased version K 1

K λ CV . The orange line plots the test error when training on the whole training set and ﬁt on the whole test set, and the purple dashed-dotted line indicates the minimal λ test. The test error is 1.513 at λ CV and 1.510 at K 1

K λ CV . So the bias-correction decreases the test error by about 0.003. Right: we take n = 200, p = 1000, γ = 5, α = 3, σ = 1. The bias-correction decreases the test error from 8.92 to 8.89, so it decreases by 0.03.

Published as a conference paper at ICLR 2020

A.5 CHOOSING THE REGULARIZATION PARAMETERADDITIONAL DETAILS

Another possible prediction method is to use the average of the ridge estimators computed during cross-validation. Here it is also natural to use the CV-optimal regularization parameters, averaging ˆβ k(ˆλ k), i.e.

ˆβavg(ˆλ k) = 1

k=1 ˆβ k(ˆλ k).

This has the advantage that it does not require reﬁtting the ridge regression estimator, and also that we use the optimal regularization parameter.

A.5.1 TRAIN-TEST VALIDATION

The same bias in the regularization parameter also applies to train-test validation. Since the number of samples is changed when restricting to the training set, the optimal λ chosen by train-test validation is also biased for the true regularization parameter minimizing the test error. We will later see in simulations (Figure 8) that retraining the ridge regression estimator on the whole data will still signiﬁcantly improve the performance (this is expected based on our results on CV). For prediction, here we can also use ridge regression on the training set. This effectively reduces sample size n ntrain, where ntrain is the sample size of the training set. However, if the training set grows such that n/ntrain 1 while ntrain , the train-test split has asymptotically optimal performance.

A.5.2 LEAVE-ONE-OUT

There is a special short-cut for leave-one-out in ridge regression, which saves us from burdensome computation. Write loo(λ) for the leave-one-out estimator of prediction error with parameter λ. Instead of doing ridge regression n times, we can calculate the error explicitly as

" Yi X i ˆβ(λ) 1 Sii(λ)

where S(λ) = X(X X + nλI) 1X . The minimizer of loo(λ) is asymptotically optimal, i.e., it converges to λ (Hastie et al., 2019). However, the computational cost of this shortcut is the same as that of a train-test split. Therefore, the method described above has the same asymptotic performance.

Figure 8: Comparing different ways of doing cross-validation. We take n = 500, p = 550, α = 20, σ = 1, K = 5. As for train-test validation, we take 80% of samples to be training set and the rest 20% be test set. The error bars are the mean and standard deviation over 20 repetitions.

Simulations: Figure 8 shows simulation results comparing different cross-validation methods:

1. kf k-fold cross-validation by taking the average of the ridge estimators at the CV-optimal regularization parameter.

Published as a conference paper at ICLR 2020

2. kf reﬁt k-fold cross-validation by reﬁtting ridge regression on the whole dataset using the CV-optimal regularization parameter. 3. kf bic k-fold cross-validation by reﬁtting ridge regression on the whole dataset using the CV-optimal regularization parameter, with bias correction. 4. tt train-test validation, by using the ridge estimator computed on the train data, at the validation-optimal regularization parameter. Note: we expect this to be similar, but worse than the kf estimator. 5. tt reﬁt train-test validation by reﬁtting ridge regression on the whole dataset, using the validation-optimal regularization parameter. Note: we expect this to be similar, but slightly worse than the kf reﬁt estimator. 6. tt bic train-test validation by reﬁtting ridge regression on the whole dataset using the CV-optimal regularization parameter, with bias correction. 7. loo leave-one-out

Figure 8 shows that the naive estimators (kf and tt) can be quite inaccurate without reﬁtting or bias correction. However, if we either reﬁt or bias-correct, the accuracy improves. In this case, there seems to be no signiﬁcant difference between the various methods.

A.6 PROOF OF THEOREM 4.1

Proof. Suppose m/n ξ as n goes to inﬁnity. For ˆβp, we have

bias2 = E X L LX/n + λIp 1 X X/nβ β 2

δ2 = σ2E X L LX/n + λIp 1 n 1X 2

Denote M = X L LX/n + λIp 1, the resolvent of the sketched matrix. We further assume that X has iid N(0, 1) entries and LL = Im. Let L1 be an orthogonal complementary matrix of

L, such that L L + L 1 L1 = In. We also denote N = X L 1 L1X n . Then

MX X/n =M X L LX + X L 1 L1X n = Ip λM + MN.

Therefore, using that Cov [β] = α2/p Ip, we ﬁnd the bias as

p E tr(M Ip)(M Ip)

λ2E tr[M 2] + E tr M 2 (X L 1 L1X)2

2λE tr M 2N .

By the properties of Wishart matrices (e.g., Anderson, 2003; Muirhead, 2009), we have

E [N] = n m

n2 E Wishart(Ip, n m)2 = 1

n2 [n m + p(n m) + (n m)2]Ip.

Recalling that m, n such that m/n ξ, and that θi(γ, λ) = R (x + λ) id Fγ(x),

λ2 + n m + p(n m) + (n m)2

α2[(λ + ξ 1)2 + γ(1 ξ)]θ2(γ, ξ, λ).

n2 E tr[M 2X X]

n E [tr[M]] λE tr[M 2] + E tr[M 2N]

γσ2[θ1(γ, ξ, λ) λθ2(γ, ξ, λ) + (1 ξ)θ2(γ, ξ, λ)].

Published as a conference paper at ICLR 2020

Here we used the additional deﬁnitions

θi(γ, ξ, λ) = Z 1 (ξx + λ)i d Fγ/ξ(x)

θi(γ, λ) = θi(γ, ξ = 1, λ).

Note that these can be connected to the previous deﬁnitions by

θ1(γ, ξ, λ) = 1

Z 1 x + λ/ξ d Fγ/ξ(x) = 1

θ2(γ, ξ, λ) = 1

Therefore the AMSE of ˆβp is

AMSE(ˆβp) = α2[(λ + ξ 1)2 + γ(1 ξ)]θ2(γ, ξ, λ) + γσ2[θ1(γ, ξ, λ) (λ + ξ 1)θ2(γ, ξ, λ)]

= α2[(λ + ξ 1)2 + γ(1 ξ)] 1

(λ + ξ 1) 1

A.6.1 ISOTROPIC CASE

Consider the special case where Γ = I, that is, X has iid N(0, 1) entries. Then Fγ is the standard MP law, and we have the explicit forms for θi = θi(γ, λ) = R 1 (x+λ)i d Fγ:

θ1(γ, λ) = 1 + λ

γλ + 1 2 γλ[ γ + 1 + λ γ +

( γ + 1 + λ γ )2 4],

θ2(γ, λ) = 1

γλ2 + γ + 1

2γλ2 1 2 γ (λ + 1

( γ + 1+λ γ )2 4 + 1 2 γ

( γ + 1 + λ γ )2 4 1

θ1(ζ, λ) = ζθ1(ζ, λ) + 1 ζ

θ2(ζ, λ) = ζθ2(ζ, λ) + 1 ζ

The results are obtained by the contour integral formula Z f(x)d Fγ(x) = 1

f(|1 + γz|2)(1 z2)2

z2(1 + γz)(z + γ)dz.

See Proposition 2.10 of Yao et al. (2015).

A.7 PROOF OF THEOREM 4.2

Proof. Suppose d/p ζ as n goes to inﬁnity. For ˆβd, we have

bias2 = E n 1X XRR X /n + λIn 1 Xβ β 2

δ2 = σ2 tr[ XRR X /n + λIn 2 XX

Denote M = XRR X /n + λIn 1. Note that, using that Cov [β] = α2/p Ip

p E tr[MXX /n]2 2α2

p E tr[MXX /n] + α2

Published as a conference paper at ICLR 2020

Moreover, letting R1 to be an orthogonal complementary matrix of R, such that RR +R1R 1 = In,

and N = XR1R 1 X

p tr[MXX /n] = 1

p tr[In λE [tr[M]] + E [MN]]

Z 1 x + λd Fζ(x) + γ ζ

Z 1 x + λd Fζ(x),

where Fζ is the companion MP law, that is, Fζ = (1 γ)δ0 + γFζ. The third term calculated by using that XR and XR1 are independent for a Gaussian random matrix X, so that M, N are independent, and that E [N] = p d

p tr[MXX /n] 1

λ + ζθ1(ζ, λ) .

p tr[MXX /n]2 = 1

p E tr[In + λ2M 2 + MNMN 2λM + 2MN λM 2N λMNM .

E [MNMN|M] = M[(p d)(M + tr(M)In) + (p d)2M]/n2

= p d + (p d)2

n2 M 2 + p d

p tr[MXX /n]2 1

γ [1 + (λ2 2λ(γ ζ) + (γ ζ)2) θ2(ζ, λ)

+ 2(γ ζ λ) θ1(ζ, λ) + (γ ζ) θ2 1(ζ, λ)].

Thus we ﬁnd the following exprssion for the limiting squared bias:

γ [γ 1 + (λ γ + ζ)2 θ2 + (γ ζ) θ2 1].

With similar calculations (that we omit for brevity), we can ﬁnd

δ2 σ2( θ1(ζ, λ) (λ + ζ γ) θ2(ζ, λ)).

Therefore the AMSE of ˆβd is

γ [γ 1 + (λ γ + ζ)2 θ2 + (γ ζ) θ2 1] + σ2[ θ1(ζ, λ) (λ + ζ γ) θ2(ζ, λ)].

A.8 PROOF OF THEOREM 4.3

Proof. Recall that we have m, n , such that m/n ξ. Then we need to take ξ 0. However, we ﬁnd it more convenient to do the calculation directly from the ﬁnite sample results as m, n, p with m/n 0, p/n γ, It is not hard to check that computing the results in the other way (i.e., interchanging the limits), leads to the same results. Starting from our bias formula for primal sketching, we ﬁrst get

λ2 + n m + p(n m) + (n m)2

E h tr[ X L LX/n + λIp 2] i

α2[(λ 1)2 + γ]/λ2.

Published as a conference paper at ICLR 2020

The limit of the trace term is not entirely trivial, but it can be calculated by (1) observing that the m p sketched data matrix P = LX has iid normal entries (2) thus the operator norm of P P/n vanishes, (3) and so by a simple matrix perturbation argument the trace concentrates around p/λ2. This gives the rough steps of ﬁnding the above limit. Moreover,

n2 E h tr[ X L LX/n + λIp 2 X X] i γσ2/λ2 EFγX2 = γσ2/λ2

So the MSE is M(λ) = α2[(λ 1)2 + γ]/λ2 + σ2 γ/λ2. From this it is elementary to ﬁnd the optimal λ and its objective value.

A.9 PROOF OF THEOREM 4.4

Proof. Note that the bias can be written as

p E h tr[ XRR X /n + λIn 1 XX /n] i + α2.

Write G = XX . Since RR Wp(Ip, d), we have XRR X Wn(G, d). So XRR X d= G1/2WG1/2, where W Wn(In, d).

= E h tr[(G1/2WG1/2/d + nλIn) 1G] i

n ) 1) 1] .

So we need to ﬁnd the law of W

p ) 1. Suppose ﬁrst that G = XX Wn(In, p). Then W and G 1 are asymptotically freely independent. The l.s.d. of W/d is the MP law F1/ξ while the l.s.d. of G/p is the MP law F1/γ. We need to ﬁnd the additive free convolution W G, where G = λ

Recall that the R-transform of a distribution F is deﬁned by

RF (z) = m 1 F ( z) 1

where m 1 F (z) is the inverse function of the Stieltjes transform of F (e.g., Voiculescu et al., 1992; Hiai & Petz, 2006; Couillet & Debbah, 2011). We can ﬁnd the R-transform by solving

m F (RF (z) + 1

Note that the R-transform of W/d is

RW (z) = 1 1 z/ξ .

The Stieltjes transform of G 1 is

m G 1(z) = Z 1 1/x z d F1/γ(x) = 1

Published as a conference paper at ICLR 2020

Figure 9: Dual Gaussian sketch improves MSE.

Then the R-transform of G 1 is

RG 1(z) = 1

z + γ + 1 p

(γ + 1)2 4γ(z + 1)

Since we have the property that Raµ(z) = a Rµ(az),

γ G 1(z) = γ 1 p

Hence we have

RW G = RW + R G = 1 1 z/ξ + γ 1 p

Moreover, the Stieltjes transform of µ = W G satisﬁes

m 1 µ (z) = m 1 W G(z) = RF ( z) 1

z = 1 1 + z/ξ + γ 1 p

(γ 1)2 + 4λz 2z 1

γ lim z 0 m(z),

γ lim z 0 d dz m(z).

So it sufﬁces to ﬁnd m(z) and d dzm(z) evaluated at zero.

This result can characterize the performance of sketching in the high SNR regime, where α σ. To understand the lower SNR regime, we need to study the variance, and thus we need to calculate

= σ2E tr[(W

p ) 1) 2G 1]

where G = XX Wn(In, p) is a Wishart distribution, and XRR X =d G1/2WG1/2, with W Wn(In, r). This seems to be quite challenging, and we leave it to future work.

Published as a conference paper at ICLR 2020

A.10 RESULTS FOR PRIMAL GAUSSIAN SKETCHING

The statement requires some notions from free probability, see e.g., Voiculescu et al. (1992); Hiai & Petz (2006); Nica & Speicher (2006); Anderson et al. (2010); Couillet & Debbah (2011) for references .

Theorem A.1 (Bias of primal Gaussian sketch). Suppose X is an n p standard Gaussian random matrix. Suppose also that L is a d n matrix with i.i.d. N(0, 1/d) entries. Then the bias of primal sketch has the expression MSE(ˆβp) = α2+ α2

γ [τ((a+b) 1b(a+b) 1b 1) 2τ((a+b) 1)], where a and b two free random variables, that are freely independent in a non-commutative probability space, and τ is their trace. Speciﬁcally, the law of a is the MP law F1/ξ and b = λ

γ b, where the law

of b is the MP law F1/γ.

Proof of Theorem A.1. Note that

bias2 = E X L LX/(nd) + λIp 1 (X X/n)β β 2

and X L LX/(nd) + λIp 1 X = X (L LXX /(nd) + λIn) 1. Thus

nd + λIn) 1 X

p E[tr[(XX L L

nd + λIn) 1 XX

nd + λIn) 1 XX

2 tr[(L LXX

nd + λIn) 1 XX

First we ﬁnd the l.s.d. of ( L LXX

nd + λIn) 1 XX

n . Write W = L L, G = XX . Then

nd + λIn) 1 XX

nd + λIn) 1 G

which is similar to ( W

n ) 1) 1. So it sufﬁces to ﬁnd the l.s.d. of ( W

By the deﬁnition, W Wn(In, d), G Wn(In, p), therefore the l.s.d. of W/d converges to the MP law F1/ξ and the l.s.d. of G/p converges to the MP law F1/γ.

Also note that

nd + λIn) 1 XX

nd + λIn) 1 XX

d + λn G 1) 1G 1(W

d + λn G 1) 1G.

We write A = W

p ) 1. Then it sufﬁces to ﬁnd

p E tr[(A + B) 1B(A + B) 1B 1] .

We will ﬁnd an expression for this using free probability. For this we will need to use some series expansions. There are two cases, depending on whether the operator norm of BA 1 is less than or greater than unity, leading to different series expansions. We will work out below the ﬁrst case, but the second case is similar and leads to the same answer.

tr[(A + B) 1B(A + B) 1B 1] = tr[A 1(I + BA 1) 1BA 1(I + BA 1) 1B 1]

Since the operator norm of BA 1 is less unity, we have the von Neumann series expansion

[I + BA 1] 1 =

i=0 ( BA 1)i,

Published as a conference paper at ICLR 2020

then we have

tr[(A + B) 1B(A + B) 1B 1] = X

i,j 0 ( 1)i+j tr[(BA 1)i+j+1B 1A 1]

i,j 0 ( 1)i+j tr[(A 1B)i+j+1A 1B 1].

Since A and B are asymptotically freely independent in the free probability space arising in the limit (e.g., Voiculescu et al., 1992; Hiai & Petz, 2006; Couillet & Debbah, 2011), and the polynomial (a 1b)i+j+1a 1b 1 involves an alternating sequence of a, b, we have

1 n tr[(A 1B)i+j+1A 1B 1] τ[(a 1b)i+j+1a 1b 1],

where a and the b are free random variables and τ is their law. Speciﬁcally, a is a free random variable with the MP law F1/ξ and b is λ

γ b 1, where b is a free r.v. with MP law F1/γ. Moreover, they are freely independent.

Hence, we have

1 n tr[(A + B) 1B(A + B) 1B 1] τ[ X

i 0 ( 1)i(a 1b)i+1 X

j 0 ( 1)j(a 1b)ja 1b 1]

= τ[(a 1b)(1 + a 1b) 1(1 + a 1b) 1a 1b 1]

= τ[(a + b) 1b(a + b) 1b 1].

bias2 α2 + α2

n tr[(A + B) 1B(A + B) 1B 1 2 tr[A + B] 1]

γ [τ((a + b) 1b(a + b) 1b 1) 2τ((a + b) 1)].

A.11 RESULTS FOR FULL SKETCHING

The full sketch estimator projects down the entire data, and then does ridge regression on the sketched data. It has the form

ˆβf = X L LX/n + λIp 1 X L LY

bias2 = α2λ2 Z 1 ξx + λd Fγ/ξ(x) = α2λ2 1

var = σ2γ Z 1 ξx + λd Fγ/ξ(x) λ Z 1 (ξx + λ)2 d Fγ/ξ(x)

AMSE(ˆβf) = α2λ2 1

The optimal λ for full sketch is always λ = γσ2

α2 , the same as ridge regression. Some simulation results are shown in Figure 10, and they show the expected shape (e.g., they decrease with ξ).

Published as a conference paper at ICLR 2020

Figure 10: Simulation results for full sketch, with n = 1000, γ = 0.1. The simulation results are averaged over 30 independent experiments.

Figure 11: Dual orthogonal sketching with γ = 1.5, λ = 1, α = 3, σ = 1. Left: MSE of dual sketching normalized by the MSE of ridge regression. The standard deviation is over 50 repetitions. Right: Bias and variance of dual sketching normalized by the bias and variance of ridge regression, respectively.

A.12 NUMERICAL RESULTS

A.12.1 DUAL ORTHOGONAL SKETCHING

See Figure 11 for additional simulation results for dual orthogonal sketching.

A.12.2 PERFORMANCE AT A FIXED REGULARIZATION PARAMETER

First we ﬁx the regularization parameter at the optimal value for original ridge regression. The results are visualized in Figure 12. On the x axis, we plot the reduction in sample size m/n for primal sketch, and the reduction in dimension d/p for dual sketch. In this case, primal and dual sketch will increase both bias and variance, and empirically in the current case, dual sketch increases them more. So in this particular case, primal sketch is preferred.

A.12.3 PERFORMANCE AT THE OPTIMAL REGULARIZATION PARAMETER

We ﬁnd the optimal regularization parameter λ for primal and dual orthogonal sketching. Then we use the optimal regularization parameter for all settings, see Figure 13. Both primal and dual sketch increase the bias, but decrease the variance. It is interesting to note that, for equal parameters ξ and

Published as a conference paper at ICLR 2020

Figure 12: Fixed regularization parameter λ = 0.7, optimal for original ridge, in a setting where γ = 0.7, and α2 = σ2.

Figure 13: Primal and dual sketch at optimal λ. We take γ = 0.7 and let ξ range between 0.001 and 1, where for primal sketch ξ = r/n while for dual sketch ξ = d/p.

Published as a conference paper at ICLR 2020

ζ, and in our particular case, dual sketch has smaller variance, but larger bias. So primal sketch is preferred bias or MSE is important, but dual sketch is more desired when one wants smaller variance. All in all, dual sketch has larger MSE than primal sketch in the current setting. It can also be seen that in this speciﬁc example, the optimal λ for primal sketch is smaller than that of dual sketch. However these results are hard to interpret, because there is no natural correspondence between the two parameters ξ and ζ.

A.13 COMPUTATIONAL COMPLEXITY

Since sketching is a method to reduce computational complexity, it is important to discuss how much computational efﬁciency we gain. Recall our three estimators

ˆβ = X X/n + λIp 1 X Y/n = n 1X XX /n + λIn 1 Y,

ˆβp = X L LX/n + λIp 1 X Y/n,

ˆβd = n 1X XRR X /n + λIn 1 Y,

Their computational complexity, when computed in the usual way, is:

No sketch (Standard ridge): if p < n, computing X Y and X X requires O(np) and O(np2) ﬂops, then solving the linear equation (X X/n + λIp)ˆβ = X Y/n requires O(p3) ﬂops by the LU decomposition. It is O(np2) ﬂops in total.

If p > n, we use the second formula for ˆβ, and the total ﬂops is O(pn2).

Primal sketch: for the Hadamard sketch (and other sketches based on the FFT), computing LX by FFT requires mp log n, computing (LX) LX requires mp2, so the total ﬂops is O(p3 + mp(log n + p)). So the primal sketch can reduce the computation cost only when p < n.

Dual sketch: computing XRR X requires nd (log p + n) ﬂops by FFT, solving (XRR X /n + λIn) 1 Y requires O(n3) ﬂops, the matrix-vector multiplication of X and (XRR X /n+λIn) 1Y requires O(np) ﬂops, so the total ﬂops is O(n3+nd(log p+ n)). Dual sketching can reduce the computation cost only when p > n.