# solving_ridge_regression_using_sketched_preconditioned_svrg__ca98c985.pdf Solving Ridge Regression using Sketched Preconditioned SVRG Alon Gonen ALONGNN@CS.HUJI.AC.IL The Hebrew University Francesco Orabona FRANCESCO@ORABONA.COM Yahoo Research, 229 West 43rd Street, 10036 New York, NY, USA Shai Shalev-Shwartz SHAIS@CS.HUJI.AC.IL The Hebrew University We develop a novel preconditioning method for ridge regression, based on recent linear sketching methods. By equipping Stochastic Variance Reduced Gradient (SVRG) with this preconditioning process, we obtain a significant speed-up relative to fast stochastic methods such as SVRG, SDCA and SAG. 1. Introduction Consider the ridge regression problem: 1 2(w xi yi)2 + λ where λ > 0 is a regularization parameter, xi Rd and yi R for i = 1, , n the training data. We focus on the large scale regime, where both n and d are large. In this setting, stochastic iterative methods such as SDCA (Shalev Shwartz & Zhang, 2013), SVRG (Johnson & Zhang, 2013), and SAG (Roux et al., 2012) have become a standard choice for minimizing the objective L. Specifically, the overall complexity of a recent improved variant of SVRG due to Xiao & Zhang (2014) depends on the average condition number, which is defined as follows. Denote the empirical correlation matrix and its eigenvalue decomposition by i=1 xix i = i=1 λiuiu i . The average condition number of C + λI is defined as the ratio between the trace of the Hessian of L and its minimal Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). eigenvalue: ˆκ := ˆκ(C + λI) = tr(C + λI) λd(C + λI) = λi + λ λd + λ . The mentioned variant of SVRG finds an ϵ-approximate minimizer of L in time O((ˆκ + n)d log(1/ϵ)). Namely, the output of the algorithm, denoted ˆw, satisfies E[L( ˆw)] L(w ) ϵ, where the expectation is over the randomness of the algorithm. For an accelerated version of the algorithm, we can replace ˆκ by nˆκ (Shalev-Shwartz & Zhang, 2014; Lin et al., 2015). The regularization parameter, λ, increases the smallest eigenvalue of C+λI to be at least λ, thus improves the condition number and makes the optimization problem easier. However, to control the under/over fitting tradeoff, λ has to decrease as n increases (Shalev-Shwartz & Ben-David, 2014). Moreover, in many machine learning applications λd approaches zero and it is usually smaller than the value of λ. Overall, this yields a large condition number in most of the interesting cases. A well-known approach for reducing the average condition number is preconditioning. Concretely, for a (symmetric) positive definite (pd) matrix P Rd d, we define the preconditioned optimization problem as min w Rd L( w) := L(P 1/2 w) . (2) Note that w is an ϵ-approximate minimizer of L if and only if w = P 1/2 w forms an ϵ-approximate minimizer of L. Hence, we can minimize (2) rather than (1). As we shall see, the structure of the objective allows us to apply the preconditioning directly to the data (as a preprocessing step) and consequently rewrite the preconditioned objective as a ridge regression problem with respect to the preconditioned data (see Section 5.1). For a suitable choice of a matrix P, the average condition number is significantly reduced. Precisely, as will be apparent from the analysis, the Solving Ridge Regression using Sketched Preconditioned SVRG pd matrix that minimizes the average condition number is P = C + λI, and the corresponding average condition number is d. However, we note that such preconditioning process would require both the computation of P 1/2 and the computation of P 1/2xi for each i [n]. By first order conditions, computing (C + λI) 1/2 is equivalent to solving the original problem in (1), rendering this optimal preconditioner useless. Yet, the optimal preconditioner might not needed in many cases. In fact, a common empirical observation (see Section 6) is that (high-dimensional) machine learning problems tend to have few dominant features, while the other coordinates are strongly correlated with the stronger features. As a result, the spectrum of the correlation matrix decays very fast. Hence, it is natural to expect to gain a lot from devising preconditioning methods that focus on the stronger directions of the data. Our contributions are as follows. We develop a relatively cheap preconditioning method that, coupled with SVRG, assures to speed-up the convergence in practical applications while having a computational cost comparable to SVRG alone. In order to approximately extract the stronger directions while incurring a low computational cost, we rely on a variant of the Block Lanczos method due to Musco & Musco (2015) in order to compute an approximated truncated SVD (Singular Value Decomposition) of the correlation matrix C. Finally, by equipping SVRG with this preconditioner, we obtain our main result. 2. Main Result Theorem 1. Let k [d] be a given parameter and assume that the regularization parameter, λ, is larger than λd. Our preconditioning process runs in time O(ndk log(n)). By equipping the SVRG of Xiao & Zhang (2014) with this preconditioner, we find an ϵ-approximate minimizer for (1) (with probability at least 9/10) in additional runtime of O(( κ + n + d)d log(1/ϵ)), where κ = kλk+P i>k λi λ or κ = n(kλk+P i>k λi) λ 1/2 if we use accelerated SVRG. When the runtimes of both the (accelerated) SVRG and our preconditioned (accelerated) SVRG are controlled by the average condition number (and both runtimes dominate ndk), then ignoring logarithmic dependencies, we obtain a speed-up of order ratio = Pd i=1 λi k λk + P i>k λi = Pk i=1 λi + P i>k λi k λk + P i>k λi . (3) (or q Pd i=1 λi/(λkk + P i>k λi) if acceleration is used) over SVRG. If the spectrum decays fast then k λk Pk i=1 λi and P i>k λi k λk. In this case, the ratio will be large. Indeed, as we show in the experimental section, this ratio is often huge for relatively small k. 2.1. Main challenges and perspective While the idea of developing a preconditioner that focuses on the stronger directions of the data matrix sounds plausible, there are several difficulties that have to be solved. First, since a preconditioner must correspond to an invertible transformation, it is not clear how to form a preconditioner based on a low rank approximation and, in particular, how should we treat the non-leading components. One of the main technical challenges in our work is to translate the approximation guarantees of the Lanczos method into a guarantee on the resulted average condition number. The standard measures of success for low-rank approximation are based on either Frobenius norm or spectral norm errors. As will be apparent from the analysis (see Section 5.4), such bounds do not suffice for our needs. Our analysis relies on stronger per vector error guarantees (6) due to Musco & Musco (2015). It should be emphasized that while we use a variant of SVRG due to Xiao & Zhang (2014), we could equally use a variant of SDCA (Shalev-Shwartz, 2016) or develop such a variant for SAG or SAGA. Furthermore, while we focus on the quadratic case, we believe that our ideas can be lifted to more general setting. For example, when applied to self-concordant functions, each step of Newton s method requires the minimization of a quadratic objective. Therefore, it is natural to ask if we can benefit from applying our method for approximating the Newton step. 2.2. Bias-complexity tradeoff As we mentioned above, λ controls a tradeoff between underfitting and overfitting. In this view, we can interpret our result as follows. Assuming for simplicity that n d and ignoring logarithmic dependencies, we note that if λ = kλk + P i>k λi nk , (4) then the runtime of our preconditioned SVRG is O(ndk). For comparison, the runtime of (unconditioned) SVRG is O(ndk) if λ = Pd i=1 λi The ratio between the RHS of (5) and (4) is the ratio given in (3). Hence, for a given runtime budget of order O(ndk), we can set the regularization parameter of the Solving Ridge Regression using Sketched Preconditioned SVRG preconditioned SVRG to be smaller by this ratio. Similar interpretation holds for the accelerated versions. 3. Related Work Existing algorithms and their complexities: Since minimizing (1) is equivalent to solving the system (C + λI)w = 1 n Pn i=1 yixi, standard numerical linear algebra solvers such as Gaussian elimination can be used to solve the problem in time O(nd2). Iterative deterministic methods, such as Gradient Descent (GD), finds an ϵ-approximate minimizer in time ndκ log(1/ϵ), where κ = λ1(C+λI) λd(C+λI) is the condition number of C + λI (see Theorem 2.1.15 in Nesterov (2004)). The Kaczmarz algorithm (Kaczmarz, 1937) has an identical complexity. Both the Conjugate Gradient (CG) method (Hestenes & Stiefel, 1952) and the Accelerated Gradient Descent (AGD) algorithm of Nesterov (1983) enjoy a better runtime of nd κ log(1/ϵ). In fact, CG has a more delicate analysis (see Corollary 16.7 in Vishnoi (2012)): If all but c [d] eigenvalues of C + λI are contained in a range [a, b], then the runtime of CG is at most nd(c + p b/a log(1/ϵ)). In particular, CG s runtime is at most O(nd2). Furthermore, following the interpretation of our main result in Section 2.2, we note that for a runtime budget of O(ndk), we can set the regularization parameter of CG to be of order λk/k2 (which is usually much greater than the RHS of (4)). Linear Sketching: Several recently developed methods in numerical linear algebra are based on the so-called sketch-and-solve approach, which essentially suggests that given a matrix A, we first replace it with a smaller random matrix AS, and then perform the computation on AS (Woodruff, 2014; Clarkson & Woodruff, 2013; Sarlos, 2006). For example, it is known that if the entries of S are i.i.d. standard normal variables and S has p = Ω(k/ϵ) columns, then with high probability, the column space of AS contains a (1 + ϵ) rank-k approximation to A with respect to the Frobenius norm. This immediately yields a fast PCA algorithm (see Section 4.1 in Woodruff (2014)). While the above sketch-and-solve approach sounds promising for this purpose, our analysis reveals that controlling the Frobenius norm error does not suffice for our needs. We need spectral norm bounds, which are known to be more challenging (Witten & Cand es, 2013). Furthermore, as mentioned above, the success of our conditioning method heavily depends on the stronger per vector error guarantees (6) obtained by Musco & Musco (2015) which are not obtained by simpler linear sketching methods. Sketched preconditioning: Recently, subspace embedding methods were used to develop cheap precondition- ers for linear regression with respect to the squared loss (Woodruff, 2014). Precisely, Clarkson & Woodruff (2013) considered the case λ = 0 (i.e, standard leastsquares) and developed a preconditioning method that reduces the average condition number to a constant. Thereafter, they suggest applying a basic solver such as CG. The overall running time is dominated by the preconditioning process which runs in time O(d3 + nd). Hence, a significant improvement over standard solvers is obtained if n d. The main shortcoming of this method is that it does not scale well to large dimensions. Indeed, when d is very large, the overhead resulted from the preconditioning process can not be afforded. Efficient preconditioning based on random sampling: While we focus on reducing the dependence on the dimensionality of the data, other work investigated the gain from using only a random subset of the data points to form the conditioner (Yang et al., 2014). The theoretical gain of this approach has been established under coherence assumptions (Yang et al., 2014). 4. Preliminaries 4.1. Additional notation and definitions Any matrix B Rd n of rank r can be written in (thin) SVD form as B = UΣV = Pr i=1 σi(B)uiv i . The singular values are ordered in descending order. The spectral norm of B is defined by B = σ1(B). The spectral norm is submultiplicative, i.e., AB A B for all A and B. Furthermore, the spectral norm is unitary invariant, i.e., for all A and U such that the columns of U are orthonormal, UA = A . For any k [r], it is well known that the truncated SVD of B, Bk := UkΣk Vk = Pk i=1 σi(B)uiv i , is the best rank-k approximation of B w.r.t. the spectral norm (Trefethen & Bau III, 1997). A twice continuously differentiable function f : Rd R is said to be β-smooth if 2f(w) β for all w, where 2f(w) is the Hessian of f at w. f is said to be α-strongly convex if λd( 2f(w)) α for all w. If g is convex and f is αstrongly convex, then f + g is α-strongly convex. 4.2. Stochastic Variance Reduced Gradient (SVRG) We consider a variant of the Stochastic Variance Reduced Gradient (SVRG) algorithm of Johnson & Zhang (2013) due to Xiao & Zhang (2014). The algorithm is an epochbased iterative method for minimizing an average, F(w) = 1 N PN i=1 fi(w), of smooth functions. It is assumed that each fi : Rd R is convex and βi-smooth. The entire function F is assumed to be α-strongly convex. The algorithm is detailed in Algorithm 1. Its convergence rate Solving Ridge Regression using Sketched Preconditioned SVRG Algorithm 1 SVRG (Xiao & Zhang, 2014) 1: Input: Functions f1, . . . , fn, β1, . . . , βn 2: Parameters: w0 Rd, m, η, S N 3: for s = 1, 2, . . . , S do 4: w = ws 1 5: v = F( w) 6: w0 = w 7: for t = 1, . . . , m do {New epoch} 8: Pick it [N] with probability qit = βit/ P βj 9: vt = ( fit(wt 1) fit( w))/qit + v 10: wt = wt 1 ηvt 11: end for 12: ws = 1 m Pm t=1 wt 13: end for 14: Output: the vector w S depends on the averaged smoothness of the individual functions and the average condition number of F, defined as i=1 βi ; ˆκF = ˆβ α . Theorem 2. (Xiao & Zhang, 2014) Fix ϵ > 0. Running SVRG (Algorithm 1) with any w0, S log((F(w0) minw Rd F(w))/ϵ), m = ˆκF , and η = 0.1/ˆβ yields an ϵ-approximate minimizer of F. Furthermore, assuming that each single gradient fi(w) can be computed in time O(d), the overall runtime is O((ˆκF + N)d log(ϵ0/ϵ)). In the original definition of SVRG (Johnson & Zhang, 2013), the indices it are chosen uniformly at random from [n], rather than proportional to βi. As a result, the convergence rate depends on the maximal smoothness, max{βi}, rather than the average, ˆβ. It will be apparent from our analysis (see Theorem 4) that in our case, max{βi} is proportional to the maximum norm of any preconditioned xi. Since we rely on the improved variant of Xiao & Zhang (2014), our bound depends on the average of the βi s, which scale with the average norm of the preconditioned xi s. To simplify the presentation, in the sequel we refer to Algorithm 1 as SVRG. 4.3. Randomized Block Lanczos A randomized variant of the Block Lanczos method due to Musco & Musco (2015) is detailed1 in Algorithm 2. Note that the matrix Uk Σk V k forms an SVD of the matrix Ak := Q(Q A)k = Uk U k A. Theorem 3. (Musco & Musco, 2015) Consider the run of Algorithm 2 and denote Ak = Uk Σk Vk = Pk i=1 σi ui v i . 1More precisely, Algorithm 2 in Musco & Musco (2015) returns the projection matrix Uk U k , while we also compute the SVD of Uk U k A. The additional runtime is negligible. Algorithm 2 Block Lanczos method (Musco & Musco, 2015) 1: Input: A Rd n, k d, ϵ (0, 1) 2: q = Θ log(n) ϵ , p = qk, Π N(0, 1)n k 3: Compute K = [AΠ, (AA )AΠ, . . . , (AA )q 1AΠ] 4: Orthonormalize K s columns to obtain Q Rd qk 5: Compute the truncated SVD (Q A)k = Wk Σk V k 6: Compute Uk = Q Wk 7: Output: the matrices Uk, Σk, Vk Denote the SVD of A by A = Pd i=1 σiviu i . The following bounds hold with probability at least 9/10: A Ak (1 + ϵ ) A Ak (1 + ϵ )σk i [k], |z i AA zi u i AA ui| = | σ2 i σ2 i | ϵ σ2 k+1 . (6) The runtime of the algorithm is O ndk log(n) ϵ + k2(n+d) 5. Sketched Conditioned SVRG In this section we develop our sketched conditioning method. By analyzing the properties of this conditioner and combining it with SVRG, we will conclude Theorem 1. Recall that we aim at devising cheaper preconditioners that lead to a significant reduction of the condition number. Specifically, given a parameter k [d], we will consider only preconditioners P 1/2 for which both the computation of P 1/2 itself and the computation of the set {P 1/2xi, . . . , P 1/2xn} can be carried out in time O(ndk). We will soon elaborate more on the considerations when choosing the preconditioner, but first we would like to address some important implementation issues. 5.1. Preconditioned regularization In order to implement the preconditioning scheme suggested above, we should be able to find a simple form for the function L. In particular, since we would like to use SVRG, we should write L as an average of n components whose gradients can be easily computed. Denote by xi = P 1/2xi for all i [n]. Since for every i [n], ((P 1/2w) xi yi)2 = (w xi yi)2, it seems natural to write L(w) = L(P 1/2w) as follows: 1 2(w xi yi)2 | {z } =: ℓi 2 P 1/2w 2 . Assume momentarily that λ = 0. Note that the gradient of ℓi at any point w is given by ℓi(wt) = (w xi yi) xi. Solving Ridge Regression using Sketched Preconditioned SVRG Hence, by computing all the xi s in advance, we are able to apply SVRG directly to the preconditioned function and computing the stochastic gradients in time O(d). When λ > 0, the computation of the gradient at some point w involves the computation of P 1w. We would like to avoid this overhead. To this end, we decompose the regularization function as follows. Denote the standard basis of Rd by e1, . . . , ed. Note that the function L can be rewritten as follows: L(w) = 1 n + d i=1 ℓi(w) , where ℓi(w) = n+d n 1 2(w xi yi)2 for i = 1, . . . , n and ℓn+i(w) = λ(n + d) 1 2(w ei)2 for i = 1, . . . , d. Finally, denoting bi = P 1/2ei for all i, we can rewrite the preconditioned function L as follows: L(w) = 1 n + d i=1 ℓi(w) , where ℓi(w) = n+d n 1 2(w xi yi)2 for i = 1, . . . , n and ℓn+i(w) = λ(n + d) 1 2(w bi)2 for i = 1, . . . , d. By computing the xi s and the bi s in advance, we are able to apply SVRG while computing stochastic gradients in time O(d). 5.2. The effect of conditioning We are now in position to address the following fundamental question: How does the choice of the preconditioner, P 1/2, affects the resulted average condition number of the function L (4.2)? The following lemma upper bounds ˆκ L by the average condition number of the matrix P 1/2(C + λI)P 1/2, which we denote by κ (when the identity of the matrix P is understood). Theorem 4. Let P 1/2 be a preconditioner. Then, the average condition number of L is upper bounded by ˆκ L κ = tr(P 1/2(C + λI)P 1/2) λd(P 1/2(C + λI)P 1/2) . The proof is in the appendix. Note that an optimal bound of O(d) is attained by the whitening matrix P 1/2 = (C + λI) 1/2. 5.3. Exact sketched conditioning Our sketched preconditioner is based on a random approximation of the best rank-k approximation of the data matrix. It will be instructive to consider first a preconditioner that is based on an exact rank-k approximation of the data matrix. Let X Rd n be the matrix whose i-th columns is xi and let X = n 1/2X. Denote by X = Prank( X) i=1 σiuiv i = UΣV the SVD of X and recall that Xk = Pk i=1 σiuiv i is the best k-rank approximation of X. Note that X X = C and therefore σ2 i = λi(C) = λi. Furthermore, the left singular vectors of X, u1, . . . , uk, coincide with the k leading eigenvectors of the matrix C. Consider the preconditioner, uiu i λi + λ + I Pk i=1 uiu i λk + λ , where uk+1, . . . , ud are obtained from a completion of u1, . . . , uk to an orthonormal basis. Lemma 1. Let k [d] be a parameter and assume that the regularization parameter, λ, is larger than λd. Using the exact sketched preconditioner, we obtain ˆκ L kλk + P i>k λi λ + d . (7) Proof. A simple calculation shows that for i = 1, . . . , k, λi(P 1/2(C + λI)P 1/2) = λi + λ λi + λ = 1 . Similarly, for i = k + 1, . . . , d, λi(P 1/2(C + λI)P 1/2) = λi + λ λd(P 1/2(C + λI)P 1/2) λ λk + λ . Combining the above with Theorem 4, we obtain that ˆκ L tr(P 1/2(C + λI)P 1/2) λd(P 1/2(C + λI)P 1/2) = kλk + P i>k λi λ + d . 5.4. Sketched conditioning An exact computation of the SVD of the matrix X takes O(nd2). Instead, we will use the Block Lanczos method in order to approximate the truncated SVD of X. Specifically, given a parameter k [d], we invoke the Block Lanczos method with the parameters X, k and ϵ = 1/2. Recall that the output has the form Xk = Uk Σk V k = Pk i=1 σi ui v i . Analogously to the exact sketched preconditioner, we define our sketched preconditioner by σ2 i + λ + I Pk i=1 ui u i p σ2 k + λ . (8) Solving Ridge Regression using Sketched Preconditioned SVRG Theorem 5. Let k [d] be a parameter and assume that the regularization parameter, λ, is larger that λd. Using the sketched preconditioner defined in (8), up to a multiplicative constant, we obtain the bound (7) on the average condition number with probability at least 9/10. The rest of this section is devoted to the proof of Theorem 5. We follow along the lines of the proof of Lemma 1. Up to a multiplicative constant, we derive the same upper and lower bounds on the eigenvalues of P 1/2(C + λI)P 1/2. From now on, we assume that the bounds in Theorem 3 (where ϵ = 1/2) hold. This assumption will be valid with probability of at least 9/10. We next introduce some notation. We can rewrite P 1/2 = U( Σ2 + λI) 1/2 U where Σ is a diagonal d d with Σi,i = σi if i k and Σi = σk if i > k. and the columns of U are a completion of u1, . . . , uk to an orthonormal basis. Recall that the SVD of X is denoted by X = Pd i=1 σiuiv i = UΣV . Lemma 2. (Upper bound on the leading eigenvalue) We have λ1(P 1/2(C + λI)P 1/2) 17 . Proof. Since λ1(P 1/2(C + λI)P 1/2) = P 1/2(C + λI)P 1/2 = P 1/2CP 1/2 + λP 1 , using the triangle inequality we have that λ1(P 1/2(C+λI)P 1/2) P 1/2CP 1/2 +λ P 1 . By the definition of P we have that P 1 = 1 σ2 k+λ and therefore the second summand on the right hand side of the above is at most λ σ2 k+λ 1. As to the first summand, recall that C = X X and therefore P 1/2CP 1/2 = X P 1/2 2. We will show that X P 1/2 4 which will imply that P 1/2CP 1/2 16. To do so, we first apply the triangle inequality, X P 1/2 = ( Xk + ( X Xk)) P 1/2 X k P 1/2 + ( X Xk) P 1/2 . Let us consider one term at the time. Recall that Xk = Uk Σk V k . Since U k U Rk,d is a diagonal matrix with ones on the diagonal, and since the spectral norm is invariant to multiplication by unitary matrices, we obtain that X k P 1/2 = Vk Σk U k U( Σ2 + λI) 1/2 U = Σk U k U( Σ2 + λI) 1/2 = max i [k] σi p σ2 i + λ max i [k] σi σi + Next, by the submutiplicativity of the spectral norm, ( X Xk) P 1/2 X Xk P 1/2 . Theorem 3 implies that X Xk 3 P 1/2 = 1 p σ2 k + λ 1 p σ2 k (1/2)σ2 k+1 Hence, X Xk P 1/2 3. Combining all of the above bounds concludes our proof. Lemma 3. (Refined upper bound on the last d k eigenvalues) For any i {k + 1, . . . , d}, λi P 1/2(C + λI)P 1/2 2(λi + λ) Proof. Using the Courant minimax principle (Bhatia, 2013), we obtain the following bound for all i {k + 1, . . . , d}: λi P 1/2(C + λI)P 1/2 = max M Rd: dim(M)=i min x M: x =0 x P 1/2(C + λI)P 1/2x = max M Rd: dim(M)=i min x M: x =0 x P 1/2(C + λI)P 1/2x P 1/2x 2 P 1/2x 2 max M Rd: dim(M)=i min x M: x =0 x P 1/2(C + λI)P 1/2x max x Rd: x =0 = λi (C + λI) λ1(P 1) = (λi + λ) ( σ2 k + λ) 1 . Finally, using Theorem 3 we have that σ2 k σ2 k 1 2σ2 k+1 1 2σ2 k = 1 2λk and therefore, ( σ2 k + λ) 1 ( 1 2λk + λ) 1 2 (λk + λ) 1 . Lemma 4. (Lower bound on the smallest eigenvalue) λd(P 1/2CP 1/2) λ 19(λk + λ) . Proof. Note that λd(P 1/2(C + λI)P 1/2) = 1 P 1/2(C + λI) 1P 1/2 , (9) so we can derive an upper bound on P 1/2(C + λI) 1P 1/2 . Consider an arbitrary completion of Solving Ridge Regression using Sketched Preconditioned SVRG v1, . . . , vk to an orthonormal set, v1, . . . , vd Rn. Let V Rn d be the matrix whose i-th column is vi. Since the spectral norm is unitary invariant and both U and V have orthonormal columns, P 1/2(C + λI) 1P 1/2 = U( Σ2 + λI)1/2 U (C + λI) 1 U( Σ2 + λI)1/2 U = V ( Σ2 + λI)1/2 U (C + λI) 1 U( Σ2 + λI)1/2 V . Denote by Z = U( Σ2 + λI)1/2 V . By the triangle inequality and the submutiplicativity of the spectral norm, Z (C + λI) 1 Z X (C + λI) 1 X + ( Z X) (C + λI) 1( Z X) X (C + λI) 1 X + Z X 2 (C + λI) 1 . (10) To bound the first summand of (10), we use the unitary invariance to obtain X (C + λI) 1 X = V ΣU U(Σ2 + λI) 1U UΣV = Σ(Σ2 + λI) 1Σ = max i λ2 i λ2 i + λ 1 . For the second summand of (10), note that (C+λI) 1 = 1 λd+λ and that, using the triangle inequality, Z X = ( U Σ V X) + ( Z U Σ V ) U Σ V X + U(( Σ2 + λI)1/2 Σ) V . By using unitary invariance together with the inequality p σ2 i + λ σi λ (which holds for every i), we get U(( Σ2+λI)1/2 Σ) V = ( Σ2+λI)1/2 Σ Hence, using the inequality (x + y)2 2x2 + 2y2, we obtain Z X 2 2 U Σ V X 2 + 2λ . We next derive an upper bound on U Σ V X . Since U Σ V = Xk + σk Pd i=k+1 ui v i , U Σ V X Xk X + σk i=k+1 ui v i Using Theorem 3 we know that Xk X 1.5 σk and σ2 k + 0.5 σ2 k+1 1.5 σk. Combining this with the fact that Pd i=k+1 ui v i = 1, we obtain U Σ V X 3 σk . Combining the above inequalities, we obtain P 1/2(C + λI) 1P 1/2 1 + 2 (3σk)2 + 2λ and using (9) we conclude our proof. Proof. (of Theorem 5) The three last lemmas imply that the inequalities derived during the proof of Lemma 1 remain intact up to a multiplicative constant. Therefore, the bound (7) on the condition number also holds up to a multiplicative constant. This completes the proof. 5.5. Sketched Preconditioned SVRG By equipping SVRG with the sketched preconditioner (8), we obtain the Sketched Preconditioned SVRG (see Algorithm 3). Proof. (of Theorem 1) The theorem follows from Theorem 5 and Theorem 2. Algorithm 3 Sketched Preconditioned SVRG 1: Input: x1, . . . , xn Rd, y1, . . . , yn R, ϵ > 0 2: Parameters: λ > 0, k [d] 3: Let X Rd,n be the matrix whose i th column is (1/n)xi 4: Run the Block Lanczos method (Algorithm 2) with the input X, k, ϵ = 1/2 to obtain Xk = Uk Σk Vk 5: Let ui be the columns of Uk and σi be the diagonal elements of Σk 6: Form the preconditioner P 1/2 according to (8) 7: Compute xi = P 1/2xi, bi = P 1/2ei 8: Let ℓi(w) = n+d n 1 2(w xi yi)2 for i = 1, . . . , n and ℓi(w) = λ(n + d)(w bi)2 for i = n + 1, . . . , n + d 9: Let βi = n+d n xi 2 for i = 1, . . . , n and βi = λ(n + d) bi for i = n + 1, . . . , n + d. Let ˆβ = 1 n Pn+d i=1 βi 10: Run SVRG (Algorithm 1) 11: Return ˆw = P 1/2 w 6. The Empirical Gain of Sketched Preconditioning In this section we empirically demonstrate the gain of our method. We consider both regression problems and binary classifications tasks, where the square loss serves as a surrogate for the zero-one loss. We use the following datasets: Synthetic: We draw two random 5000 20000 matrices, X(1) and X(2), whose singular vectors are drawn uniformly at random and the q-th singular value is Solving Ridge Regression using Sketched Preconditioned SVRG 0 100 200 300 400 500 600 700 100 (a) MNIST dataset. 0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 100 (b) CIFAR-10 dataset. 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 1 (c) RCV1 dataset. 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 1 (d) Real-sim dataset. Figure 1. Plot of the ratio (3) as a function of k. 1/q and 1/q2, respectively. We then normalize the columns. For each X = X(j), we consider a regression problem, where the labels are generated as follows: we first draw a vector w N(0, 1)5000 and then set yi = w X ,i + zi, where zi N(0, 0.1). MNIST:2 A subset of MNIST, corresponding to the digits 4 and 7, where the task is to distinguish between the two digits. Here, n = 12107, d = 784. RCV1:3 The Reuters RCV1 collection. Here, n = 20242, d = 47236 and we consider a standard binary document classification task. CIFAR-10:4 Here, n = 50000, d = 3072. Following Frostig et al. (2015), the classification task is to distinguish between the animal categories to the automotive ones. real-sim:5 Here, n = 72309, d = 20958, and we consider a standard binary document classification task. 6.1. Inspecting our theoretical speed-up Recall that the ratio (3) quantifies our theoretical speedup. Hence, we first empirically inspect the prefixes of the corresponding quantities (as a function of k) for each of the datasets (see Figure 1). We can see that while in MNIST and CIFAR-10 the ratio is large for small values of k, in RCV1 and real-sim the ratio increases very slowly (note that for the former two datasets we use logarithmic scale). 2http://yann.lecun.com/exdb/mnist/ 3https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/ 4http://www.cs.toronto.edu/ kriz/cifar.html 5https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/ 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 10 4 SVRG SCSVRG-k=30 (a) Synthetic with linear decay 0 2 4 6 8 10 12 14 16 18 20 10 7 SVRG SCSVRG-k=30 (b) Synthetic with quadratic decay 0 5 10 15 20 25 30 35 40 45 50 55 60 10 4 SVRG SCSVRG-k=30 (c) MNIST dataset. 0 5 10 15 20 25 30 35 40 45 50 55 60 10 4 (d) CIFAR-10 dataset. 0 5 10 15 20 25 30 35 40 45 50 55 60 10 3 SVRG SCSVRG-k=30 (e) RCV1 dataset. 0 5 10 15 20 25 30 35 40 45 50 55 60 10 5 SVRG SCSVRG-k=30 (f) real-sim dataset. Figure 2. Convergence of Sketched Preconditioned SVRG vs SVRG. The x-axis is the number of epochs and the y-axis is the suboptimality, L( wt) minw Rd L(w), in logarithmic scale. 6.2. Empirical advantage of Sketched Preconditioned SVRG We now evaluate Algorithm 3 and compare it to the SVRG algorithm of Xiao & Zhang (2014). To minimally affect the inherent condition number, we added only a slight amount of regularization, namely, λ = 10 8. The loss used is the square loss. The step size, η, is optimally tuned for each method. Similarly to previous work on SVRG (Xiao & Zhang, 2014; Johnson & Zhang, 2013), the size of each epoch, m, is proportional to the number of points, n. We minimally preprocessed the data by average normalization: each instance vector is divided by the average ℓ2-norm of the instances. The number of epochs is up to 60. Note that in all cases we choose a small preconditioning parameter, namely k = 30, so that the preprocessing time of Algorithm 3 is negligible. There is a clear correspondence between the ratios depicted in Figure 1 and the actual speedup. In other words, the empirical results strongly affirm our theoretical results. Solving Ridge Regression using Sketched Preconditioned SVRG Acknowledgments We thank Edo Liberty for helpful discussions. The work is supported by ICRI-CI and by the European Research Council (Theory DL project). Bhatia, Rajendra. Matrix analysis, volume 169. Springer Science & Business Media, 2013. Clarkson, Kenneth L and Woodruff, David P. Low rank approximation and regression in input sparsity time. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pp. 81 90. ACM, 2013. Frostig, Roy, Ge, Rong, Kakade, Sham M, and Sidford, Aaron. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. ar Xiv preprint ar Xiv:1506.07512, 2015. Hestenes, Magnus Rudolph and Stiefel, Eduard. Methods of conjugate gradients for solving linear systems. NBS, 1952. Johnson, Rie and Zhang, Tong. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pp. 315 323, 2013. Kaczmarz, Stefan. Angen aherte aufl osung von systemen linearer gleichungen. Bulletin International de l Acad emie Polonaise des Sciences et des Lettres, 35: 355 357, 1937. Lin, Hongzhou, Mairal, Julien, and Harchaoui, Zaid. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems, pp. 3366 3374, 2015. Musco, Cameron and Musco, Christopher. Randomized block krylov methods for stronger and faster approximate singular value decomposition. In Advances in Neural Information Processing Systems, pp. 1396 1404, 2015. Nesterov, Yurii. A method of solving a convex programming problem with convergence rate O(1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372 376, 1983. Nesterov, Yurii. Introductory lectures on convex optimization, volume 87. Springer Science & Business Media, 2004. Roux, Nicolas L, Schmidt, Mark, and Bach, Francis R. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems, pp. 2663 2671, 2012. Sarlos, Tamas. Improved approximation algorithms for large matrices via random projections. In Foundations of Computer Science, 2006. FOCS 06. 47th Annual IEEE Symposium on, pp. 143 152. IEEE, 2006. Shalev-Shwartz, Shai. Sdca without duality, regularization, and individual convexity. ar Xiv preprint ar Xiv:1602.01582, 2016. Shalev-Shwartz, Shai and Ben-David, Shai. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. Shalev-Shwartz, Shai and Zhang, Tong. Stochastic dual coordinate ascent methods for regularized loss. The Journal of Machine Learning Research, 14(1):567 599, 2013. Shalev-Shwartz, Shai and Zhang, Tong. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, pp. 1 41, 2014. Trefethen, Lloyd N and Bau III, David. Numerical linear algebra, volume 50. Siam, 1997. Vishnoi, Nisheeth K. Laplacian solvers and their algorithmic applications. Theoretical Computer Science, 8(1-2): 1 141, 2012. Witten, Rafiand Cand es, Emmanuel. Randomized algorithms for low-rank matrix factorizations: sharp performance bounds. Algorithmica, 72(1):264 281, 2013. Woodruff, David P. Sketching as a tool for numerical linear algebra. ar Xiv preprint ar Xiv:1411.4357, 2014. Xiao, Lin and Zhang, Tong. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057 2075, 2014. Yang, Tianbao, Jin, Rong, Zhu, Shenghuo, and Lin, Qihang. On data preconditioning for regularized loss minimization. Machine Learning, pp. 1 23, 2014.