# principal_component_projection_without_principal_component_analysis__67d0b3bd.pdf Principal Component Projection Without Principal Component Analysis Roy Frostig RF@CS.STANFORD.EDU Stanford University Cameron Musco CNMUSCO@MIT.EDU Christopher Musco CPMUSCO@MIT.EDU MIT Aaron Sidford ASID@MICROSOFT.COM Microsoft Research, New England We show how to efficiently project a vector onto the top principal components of a matrix, without explicitly computing these components. Specifically, we introduce an iterative algorithm that provably computes the projection using few calls to any black-box routine for ridge regression. By avoiding explicit principal component analysis (PCA), our algorithm is the first with no runtime dependence on the number of top principal components. We show that it can be used to give a fast iterative method for the popular principal component regression problem, giving the first major runtime improvement over the naive method of combining PCA with regression. To achieve our results, we first observe that ridge regression can be used to obtain a smooth projection onto the top principal components. We then sharpen this approximation to true projection using a low-degree polynomial approximation to the matrix step function. Step function approximation is a topic of long-term interest in scientific computing. We extend prior theory by constructing polynomials with simple iterative structure and rigorously analyzing their behavior under limited precision. 1. Introduction In machine learning and statistics, it is common often essential to represent data in a concise form that decreases noise and increases efficiency in downstream tasks. Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). Perhaps the most widespread method for doing so is to project data onto the linear subspace spanned by its directions of highest variance that is, onto the span of the top components given by principal component analysis (PCA). Computing principal components can be an expensive task, a challenge that prompts a basic algorithmic question: Can we project a vector onto the span of a matrix s top principal components without performing principal component analysis? This paper answers that question in the affirmative, demonstrating that projection is much easier than PCA itself. We show that it can be solved using a simple iterative algorithm based on black-box calls to a ridge regression routine. The algorithm s runtime does not depend on the number of top principal components chosen for projection, a cost inherent to any algorithm for PCA, or even algorithms that just compute an orthogonal span for the top components. 1.1. Motivation: principal component regression To motivate our projection problem, consider one of the most basic downstream applications for PCA: linear regression. Combined, PCA and regression comprise the principal component regression (PCR) problem: Definition 1.1 (Principal component regression (PCR)). Let A 2 Rn d be a design matrix whose rows are data points and let b 2 Rn be a vector of data labels. Let Aλ 2 Rn d denote the result of projecting each row of A onto the span of the top principal components of A, i.e. the eigenvectors of the covariance matrix 1 n ATA whose corresponding eigenvalue exceeds a threshold λ. The task of PCR is to find a minimizer of the squared loss k Aλx bk2 2. In other words, the goal is to compute A λb, where A λ is the Moore-Penrose pseudoinverse of Aλ. PCR is a key regularization method in statistics, numerical Principal Component Projection Without Principal Component Analysis linear algebra, and scientific disciplines including chemometrics (Hotelling, 1957; Hansen, 1987; Frank & Friedman, 1993). It models the assumption that small principal components represent noise rather than data signal. PCR is typically solved by first using PCA to compute Aλ and then applying linear regression. The PCA step dominates the algorithm s cost, especially if many principal components have variance above the threshold λ. We remedy this issue by showing that our principal component projection algorithm yields a fast algorithm for regression. Specifically, full access to Aλ is unnecessary for PCR: A λb can be computed efficiently given only an approximate projection of the vector ATb onto A s top principal components. By solving projection without PCA we obtain the first PCA-free algorithm for PCR. 1.2. A first approximation: ridge regression Our approach to efficient principal component projection is actually based on a common alternative to PCR: ridge regression. This ubiquitous method computes a minimizer of k Ax bk2 2 for some regularization parameter λ (Tikhonov, 1963). The advantage of ridge regression is that it is a simple convex optimization problem that can be solved efficiently using many techniques (see Lemma 2.1). Solving ridge regression is equivalent to applying the matrix (ATA + λI) 1AT, an operation that can be viewed as a smooth relaxation of PCR. Adding the 2 norm penalty (i.e. λI) effectively washes out A s small principal components in comparison to its large ones and achieves an effect similar to PCR at the extreme ends of A s spectrum. Accordingly, ridge regression gives access to a smooth projection operator, (ATA+λI) 1ATA, which approximates PAλ, the projection onto A s top row principal components. Both have the same singular vectors, but PAλ has a singular value of 1 for each squared singular value σ2 i λ in A and a singular value of 0 for each σ2 i < λ, whereas (ATA + λI) 1ATA has singular values equal to σ2 This function approaches 1 when σ2 i is much greater than λ and 0 when it is smaller. Figure 1 shows the comparison. Unfortunately, in many settings, ridge regression is a very crude approximation to PCR and projection and may perform significantly worse in certain data analysis applications (Dhillon et al., 2013). In short, while ridge regression algorithms are valuable tools, it has been unclear how to wield them for tasks like projection or PCR. 1.3. Main result: from ridge regression to projection We show that it is possible to sharpen the weak approximation given by ridge regression. Specifically, there exists a low degree polynomial p( ) such that 0 100 200 300 400 500 600 700 800 900 1000 σi(Projection to Aλ) σi(Smooth Ridge Projection) Figure 1: Singular values of the projection matrix PAλ vs. those of the smooth projection operator (ATA + λI) 1ATA obtained from ridge regression. (ATA + λI) 1ATA y provides a very accurate approximation to PAλy for any y 2 Rd. Moreover, the polynomial can be evaluated as a recurrence, which translates into a simple iterative algorithm: we can apply the sharpened approximation to a vector by repeatedly applying any ridge regression routine a small number of times. Theorem 1.2 (Principal component projection without PCA). Given A 2 Rn d and y 2 Rd, Algorithm 1 uses O(γ 2 log(1/ )) approximate applications of (ATA + λI) 1 and returns x with kx PAλyk2 kyk2. Like all iterative PCA algorithms, our running time scales inversely with γ, the spectral gap around λ.1 Notably, it does not depend on the number of principal components in Aλ, a cost incurred by any method that applies the projection PAλ directly, either by explicitly computing the top principal components of A, or even by just computing an orthogonal span for these components. As mentioned, the above theorem also yields an algorithm for principal component regression that computes A λb without finding Aλ. We achieve this result by introducing a robust reduction, from projection to PCR, that again relies on ridge regression as a computational primitive. Corollary 1.3 (Principal component regression without PCA). Given A 2 Rn d and b 2 Rn, Algorithm 2 uses O(γ 2 log(1/ )) approximate applications of (ATA + λI) 1 and returns x with kx A λbk ATA kbk2. Corollary 1.3 gives the first known algorithm for PCR that avoids the cost of principal component analysis. 1.4. Related work A number of papers attempt to alleviate the high cost of principal component analysis when solving PCR. It has 1See Section 3.2 for a discussion of this gap dependence. Aside from a full SVD requiring O(nd2) time, any PCA algorithm giving the guarantee of Theorem 1.2 will have a dependence both on γ and on the number principal components in Aλ. However, the γ dependence can be better γ 1/2 for Krylov methods (Musco & Musco, 2015), giving a runtime tradeoff. Principal Component Projection Without Principal Component Analysis been shown that an approximation to Aλ suffices for solving the regression problem (Chan & Hansen, 1990; Boutsidis & Magdon-Ismail, 2014). Unfortunately, even the fastest approximations are much slower than routines for ridge regression and inherently incur a linear dependence on the number of principal components above λ. More closely related to our approach is work on the matrix sign function, an important operation in control theory, quantum chromodynamics, and scientific computing in general. Approximating the sign function often involves matrix polynomials similar to our sharpening polynomial that converts ridge regression to principal component projection. Significant effort addresses Krylov methods for applying such operators without computing them explicitly (van den Eshof et al., 2002; Frommer & Simoncini, 2008). Our work differs from these methods in an important way: since we only assume access to an approximate ridge regression algorithm, it is essential that our sharpening step is robust to noise. Our iterative polynomial construction allows for a complete and rigorous noise analysis that is not available for Krylov methods, while at the same time eliminating space and post-processing costs. Iterative approximations to the matrix sign function have been proposed, but lack rigorous noise analysis (Higham, 2008). 1.5. Paper layout Section 2: Mathematical and algorithmic preliminaries. Section 3: Develop a PCA-free algorithm for principal component projection based on a ridge regression subroutine. Section 4: Show how our approximate projection algo- rithm can be used to solve PCR, again without PCA. Section 5: Detail our iterative approach to sharpening the smooth ridge regression projection towards true projection via a low degree sharpening polynomial. Section 6: Empirical evaluation of our principal compo- nent projection and regression algorithms. 2. Preliminaries Singular value decomposition. Any matrix A 2 Rn d of rank r has a singular value decomposition (SVD) A = U VT, where U 2 Rn r and V 2 Rd r both have orthonormal columns and 2 Rr r is a diagonal matrix. The columns of U and V are the left and right singular vectors of A. Moreover, = diag(σ1(A), ..., σr(A)), where σ1(A) σ2(A) ... σr(A) > 0 are the singular values of A in decreasing order. The columns of V are the eigenvectors of the covariance matrix ATA, i.e. the principal components of the data, and the eigenvalues of the covariance matrix are the squares of the singular values σ1, . . . , σr. Functions of matrices. If f : R ! R is a scalar function and S = diag(s1, . . . , sn) is a diagonal matrix, we define def = diag(f(s1), . . . , f(sn)) the entrywise application of f to the diagonal. For a non-diagonal matrix A with SVD A = U VT we define f(A) def = Uf( )VT. Matrix pseudoinverse. We define the pseudoinverse of A as A = f(A)T where f(x) = 1/x. The pseudoinverse is essential in the context of regression, as the vector A b minimizes the squared error k Ax bk2 Principal component projection. Given a threshold λ > 0 let k be the largest index with σk(A)2 λ and define: def = U diag(σ1, . . . , σk, 0, . . . , 0)VT. The matrix Aλ contains A s rows projected to the span of all principal components having squared singular value at least λ. We sometimes write Aλ = APAλ where PAλ 2 Rd d is the projection onto these top components. Here PAλ = f(ATA) where f(x) is a step function: 0 if x < λ and 1 if x λ. Miscellaneous notation. For any positive semidefinite M, N 2 Rd d we use N M to denote that M N is positive semidefinite. For any x 2 Rd, kxk M Ridge regression. Ridge regression is the problem of computing, given a regularization parameter λ > 0: xλ def = argmin x2Rd k Ax bk2 The solution to (1) is given by xλ = Applying the matrix " 1 to ATb is equivalent to solving the convex minimization problem: xλ def = argmin 1 2x TATAx y Tx + λkxk2 A vast literature studies solving problems of this form via (accelerated) gradient descent, stochastic variants, and random sketching (Nesterov, 1983; Nelson & Nguyˆen, 2013; Shalev-Shwartz & Zhang, 2014; Lin et al., 2014; Frostig et al., 2015; Cohen et al., 2015). We summarize a few, now standard, runtimes achievable by these iterative methods: Lemma 2.1 (Ridge regression runtimes). Given y 2 Rd let x = (ATA + λI) 1y. There is an algorithm, RIDGE(A, λ, y, ) that, for any > 0, returns x such that k x x k ATA+λI kyk(ATA+λI) 1. It runs in time TRIDGE(A, λ, ) = O nnz(A)p λ log(1/ ) where λ = σ2 Principal Component Projection Without Principal Component Analysis the condition number of the regularized system and nnz(A) is the number of nonzero entries in A. There is a also stochastic algorithm that, for any δ > 0, gives the same guarantee with probability 1 δ in time TRIDGE(A, λ, , δ) = O ((nnz(A) + d sr(A) λ) log(1/δ )) , where sr(A) = k Ak2 2 is A s stable rank. When nnz(A) d sr(A) λ the runtime can be improved to TRIDGE(A, λ, , δ) = O( nnz(A) d sr(A) λ log(1/δ )), where the O hides a factor of log Typically, the regularized condition number λ will be significantly smaller than the full condition number of ATA. 3. From ridge regression to principal component projection We now describe how to approximately apply PAλ using any black-box ridge regression routine. The key idea is to first compute a soft step function of ATA via ridge regression, and then to sharpen this step to approximate PAλ. Let Bx = (ATA + λI) 1(ATA)x be the result of applying ridge regression to (ATA)x. In the language of functions of matrices, we have B = r(ATA), where def = x x + λ. The function r(x) is a smooth step about λ (see Figure 1). It primarily serves to map the eigenvalues of ATA to the range [0, 1], mapping those exceeding the threshold λ to a value above 1/2 and the rest to a value below 1/2. To approximate the projection PAλ, it would now suffice to apply a simple symmetric step function: 0 if x < 1/2 1 if x 1/2 It is easy to see that s(B) = s(r(ATA)) = PAλ. For x λ, r(x) 1/2 and so s(r(x)) = 1. Similarly for x < λ, r(x) < 1/2 and hence s(r(x)) = 0. That is, the symmetric step function exactly converts our smooth ridge regression step to the true projection operator. 3.1. Polynomial approximation to the step function While computing s(B) directly is expensive, requiring the SVD of B, we show how to approximate this function with a low-degree polynomial. We also show how to apply this polynomial efficiently and stably using a simple iterative algorithm. Our main result, proven in Section 5, is: Lemma 3.1 (Step function algorithm). Let S 2 Rd d be symmetric with every eigenvalue σ satisfying σ 2 [0, 1] and |σ 1/2| γ. Let A denote a procedure that on x 2 Rd produces A(x) with k A(x) Sxk2 = O( 2γ2)kxk2. Given y 2 Rd set s0 := A(y), w0 := s0 1 2y, and for k 0 set A(wk A(wk)) and sk+1 := sk + wk+1. If all arithmetic operations are performed with (log(d/ γ)) bits of precision then ksq s(S)yk2 = O( )kyk2 for q = (γ 2 log(1/ )). Note that the output sq is an approximation to a 2q degree polynomial of S applied to y. In Algorithm 1, we give pseudocode for combining the procedure with ridge regression to solve principal component projection. Set S = B and let A be an algorithm that approximately applies B to any x by applying approximate ridge regression to ATAx. As long as B has no eigenvalues falling within γ of 1/2, the lemma ensures ksq PAλyk2 = O( )kyk2. This requires γ on order of the spectral gap: 1 σ2 k(A), where k is the largest index with σ2 Algorithm 1 (PC-PROJ) Principal component projection input: A 2 Rn d, y 2 Rd, error , failure rate δ, threshold λ, gap γ 2 (0, 1) q := c1γ 2 log(1/ ) 0 := c 1 2 2γ2/p λ, δ0 := δ/(2q) s := RIDGE(A, λ, ATAy, 0, δ0) w := s 1 2y for k = 0, ..., q 1 do t := w RIDGE(A, λ, ATAw, 0, δ0) RIDGE(A, λ, ATAt, 0, δ0) s := s + w end for return s Theorem 3.2. If 1 1 4γ σk+1(A)2 λ (1 4γ)σk(A)2 and c1, c2 are sufficiently large constants, PC-PROJ (Algorithm 1) returns s such that with probability 1 δ, ks PAλyk2 kyk2. The algorithm requires O(γ 2 log(1/ )) ridge regression calls, each costing TRIDGE(A, λ, 0, δ0). Lemma 2.1 yields total cost (with no failure probability) nnz(A)p λγ 2 log(1/ ) log ( λ/( γ)) or, via stochastic methods, (nnz(A) + d sr(A) λ) γ 2 log(1/ ) log( λ/( γδ)) with acceleration possible when nnz(A) > d sr(A) λ. Principal Component Projection Without Principal Component Analysis Proof. We instantiate Lemma 3.1. Let S = B = (ATA + λI) 1ATA. As discussed, B = r(ATA) and hence all its eigenvalues fall in [0, 1]. Specifically, σi(B) = σi(A)2 Now, σk(B) λ/(1 4γ) λ/(1 4γ)+λ = 1 2 4γ 1 2 +γ and similarly σk+1(B) λ(1 4γ) λ(1 4γ)+λ = 1 4γ 2 γ, so all eigenvalues of B are at least γ far from 1/2. By Lemma 2.1, for any x, with probability 1 δ0: k RIDGE(A,λ, ATAx, 0, δ0) Bxk ATA+λI 0k ATAxk(ATA+λI) 1 σ1(A) 0kxk2. Since the minimum eigenvalue of ATA + λI is λ: k RIDGE(A, λ, ATAx, 0, δ0) Bxk2 kxk2 = O( 2γ2)kxk2. Union bounding over all 2q calls of RIDGE, this bound holds for all calls with probability 1 δ0 2q = 1 δ. Overall, by Lemma 3.1, ks s(B)yk2 = O( )kyk2 with probability 1 δ. As discussed, s(B) = PAλ. Adjusting constants on (via c1, c2) completes the proof. Note that the runtime of Theorem 3.2 depends on p λ. In performing principal component projection, PC-PROC applies an asymmetric step function to ATA. The best polynomial approximating this step also has a p λ dependence (Eremenko & Yuditskii, 2011), so our reduction from projection to ridge regression is optimal in this regard. 3.2. Choosing λ and γ Theorem 3.2 requires σk+1(A)2 1 4γ λ (1 4γ)σk(A)2. If λ is chosen approximately equidistant from the two eigenvalues, we need γ = O(1 σ2 In practice, however, it is unnecessary to explicitly specify γ or to choose λ so precisely. With q = O(γ 2 log(1/ )) our projection will be approximately correct on all singular values outside the range [(1 γ)λ, (1 + γ)λ]. If there are any intermediate singular values in this range, as shown in Section 5, the approximate step function applied by Lemma 3.1 will map these values to [0, 1] via a monotonically increasing soft step. That is, Algorithm 1 gives a slightly softened projection, removing any principal directions with value below (1 γ)λ, keeping any with value above (1 + γ)λ, and partially projecting any in between. 4. From principal component projection to principal component regression A major motivation for an efficient, PCA-free method for projecting a vector onto the span of top principal compo- nents is principal component regression (PCR). Recall that PCR solves the following problem: λb = argmin x2Rd k Aλx bk2 In exact arithmetic, A λb is equal to (ATA) 1PAλATb. This identity suggests a method for computing the solution to ridge regression without finding Aλ explicitly: first apply a principal component projection algorithm to ATb and then solve a linear system to apply (ATA) 1. Unfortunately, this approach is disastrously unstable, not only when PAλ is applied approximately, but in any finite precision environment. We instead present a modified method for obtaining PCA-free regression from projection. 4.1. Stable inversion via ridge regression Let y = PAλATb and suppose we have y y (e.g. from Algorithm 1). The issue with the above approach is that (ATA) 1 can have a very large max eigenvalue, so we cannot guarantee (ATA) 1 y (ATA) 1y. On the other hand, applying the ridge regression operator (ATA+ λI) 1 is much more stable it has max eigenvalue 1/λ, so (ATA + λI) 1 y will approximate (ATA + λI) 1y well. In short, it is more stable to apply (ATA + λI) 1y = f(ATA)y, where f(x) = 1 x+λ, but the goal in PCR is to apply (ATA) 1 = h(ATA) where h(x) = 1/x. So, in order to go from one function to the other, we use a correction function g(x) = x 1 λx. By simple calculation, λb = (ATA) 1y = g((ATA + λI) 1)y. Additionally, we can stably approximate g(x) with an iteratively computed low degree polynomial! Specifically, we use a truncation of the series g(x) = P1 i=1 λi 1xi. Exact computation of g(x) would exactly apply (ATA) 1, which as discussed, is unstable due to large eigenvalues (corresponding to small eigenvalues of ATA). Our approximation to g(x) is accurate on the large eigenvalues of ATA but inaccurate on the small eigenvalues. This turns out to be the key to stability by not fully inverting these eigenvalues, we avoid the instability of applying (ATA) 1. We give a full analysis in Appendix B, yielding: Lemma 4.1 (PCR approximation algorithm). Let A be a procedure that, given x 2 Rd, produces A(x) with k A(x) (ATA + λI) 1xk ATA+λI = O( q2σ1(A))kxk2. Let B be a procedure that, given x 2 Rd produces B(x) with k B(x) PAλxk2 = O( q2p λ )kxk2. Given b 2 Rn set s0 := B(ATb) and s1 := A(s0). For k 1 set: sk+1 := s1 + λ A(sk). If all arithmetic operations are performed with (log(d/q )) bits of precision then ksq A λbk ATA = O( )kbk2 for q = (log( λ/ )). Principal Component Projection Without Principal Component Analysis We instantiate the iterative procedure above in Algorithm 2. PC-PROJ(A, λ, y, γ, , δ) denotes a call to Algorithm 1. Algorithm 2 (RIDGE-PCR) Ridge regression-based PCR input: A 2 Rn d, b 2 Rn, error , failure rate δ, threshold λ, gap γ 2 (0, 1) q := c1 log( λ/ ) 0 := c 1 2 /(q2p λ), δ0 = δ/2(q + 1) y := PC-PROJ(A, λ, ATb, γ, 0, δ/2) s0 := RIDGE(A, λ, y, 0, δ0), s := s0 for k = 1, ..., q do s := s0 + λ RIDGE(A, λ, s, 0, δ0) end for return s Theorem 4.2. If 1 1 4γ σk+1(A)2 λ (1 4γ)σk(A)2 and c1, c2 are sufficiently large constants, RIDGE-PCR (Algorithm 2) returns s such that with probability 1 δ, λbk ATA kbk2. The algorithm makes one call to PC-PROJ and O(log( λ/ )) calls to ridge regression, each of which costs TRIDGE(A, λ, 0, δ0), so Lemma 2.1 and Theorem 3.2 imply a total runtime of O(nnz(A)p λγ 2 log2 ( λ/( γ))), where O hides log log(1/ ), or, with stochastic methods, O((nnz(A) + d sr(A) λ)γ 2 log2 ( λ/( γδ))). Proof. We apply Lemma 4.1; A is given by RIDGE(A, λ, x, 0, δ0). Since k(ATA + λI) 1k2 < 1/λ, Lemma 2.1 states that with probability 1 δ0, k A(x) (ATA + λI) 1xk ATA+λI 0kxk(ATA+λI) 1 c 1 2 q2p λλkxk2 c 1 2 q2σ1(A)kxk2. Now, B is given by PC-PROJ(A, λ, x, γ, 0, δ/2). With probability 1 δ/2, if 1 1 4γ σk+1(A)2 λ (1 4γ)σk(A)2 then by Theorem 3.2, k B(x) PAλxk2 0kxk2 = /(c2q2p λ). Applying the union bound over q + 1 calls to A and a single call to B, these bounds hold on every call with probability 1 δ. Adjusting constants on (via c1 and c2) proves the theorem. 5. Approximating the matrix step function We now return to proving our underlying result on iterative polynomial approximation of the matrix step function: Lemma 3.1 (Step function algorithm). Let S 2 Rd d be symmetric with every eigenvalue σ satisfying σ 2 [0, 1] and |σ 1/2| γ. Let A denote a procedure that on x 2 Rd produces A(x) with k A(x) Sxk = O( 2γ2)kxk2. Given y 2 Rd set s0 := A(y), w0 := s0 1 2y, and for k 0 set A(wk A(wk)) and sk+1 := sk + wk+1. If all arithmetic operations are performed with (log(d/ γ)) bits of precision and if q = (γ 2 log(1/ )) then ksq s(S)yk2 = O( )kyk2. The derivation and proof of Lemma 3.1 is split into 3 parts. In Section 5.1 we derive a simple low degree polynomial approximation to the sign function: 1 if x > 0 0 if x = 0 1 if x < 0 In Section 5.2 we show how this polynomial can be computed with a stable iterative procedure. In Section 5.3 we use these pieces and the fact that the step function is simply a shifted and scaled sign function to prove Lemma 3.1. Along the way we give complementary views of Lemma 3.1 and show that there exist more efficient polynomial approximations. All proofs are in Appendix A. 5.1. Polynomial approximation to the sign function We show that for sufficiently large k, the following polynomial is uniformly close to sgn(x) on [ 1, 1]: pk(x) can be derived in several ways. One follows from observing that sgn(x) is odd and so sgn(x)/x = 1/|x| is even. So, a good polynomial approximation for sgn(x) should be odd and, when divided by x, should be even (i.e. a function of x2). Specifically, given an approximation q(x) to 1/px on (0, 1] we can approximate sgn(x) using xq(x2). Letting q be the k-th order Taylor approximation to 1/px at x = 1 yields pk(x). We first show: Lemma 5.1. sgn(x) = limk!1 pk(x) for all x 2 [ 1, 1]. Proof. Let f(x) = x 1/2. By induction on k it is straightforward to show that the k-th derivative of f at x > 0 is f (k)(x) = ( 1)k x 1+2k 2 . Since ( 1)i(x 1)i = (1 x)i, the degree k Taylor approximation to f(x) at x = 1 is qk(x) = Pk i=0(1 x)i Qi 2j . The integral form of Taylor series remainder gives |f(x) qk(x)| = 111 k! (x t)kdt 111 which is bounded by: Principal Component Projection Without Principal Component Analysis and consequently, the Taylor approximation converges, i.e. limk!1 qk(x) = 1/px, for x 2 (0, 1]. Since pk(x) = x qk(x2), limk!1 pk(x) = x/ x2 = sgn(x) for x 6= 0 in [ 1, 1]. Since pk(0) = 0 = sgn(0), the result follows. Alternatively, to derive pk(x) we can consider (1 x2)k, which is relatively large near 0 and small on the rest of [ 1, 1]. Integrating this function from 0 to x and normalizing yields a good step function. In Appendix A we prove: Lemma 5.2. For all x 2 R, pk(x) = 0 (1 y2)kdy R 1 0 (1 y2)kdy . Next, we bound the rate of convergence of pk(x) to sgn(x): Lemma 5.3. For k 1 if x 2 (0, 1] then pk(x) > 0 and k) 1e kx2 pk(x) sgn(x) . (2) If x 2 [ 1, 0) then pk(x) < 0 and sgn(x) pk(x) sgn(x) + (x k) 1e kx2 . Proof. The claim is trivial for x = 0. Since pk(x) is odd it suffices to consider x 2 (0, 1]. It is direct that pk(x) > 0, and pk(x) sgn(x) follows since pk(x) increases monotonically with k and limk!1 pk(x) = sgn(x) by Lemma 5.1. All that remains is the left-side inequality of (2). Using the Taylor series expansion of Lemma 5.1, sgn(x) pk(x) x(1 x2)k Now since 1 + x ex for all x and Pn Combining with P1 i=0(1 x2)i = x 2 and again that 1 + x ex proves the left hand side of (2). The lemma directly implies that pk(x) is a high quality approximation to sgn(x) for x bounded away from 0. Corollary 5.4. If x 2 [ 1, 1], with |x| > 0 and k = 2 ln(1/ ), then | sgn(x) pk(x)| . We conclude by noting that this proof in fact implies the existence of a lower-degree polynomial approximation to sgn(x). Since the sum of coefficients in our expansion is small, we can replace each (1 x2)q with Chebyshev polynomials of lower degree. In Appendix A, we prove: Lemma 5.5. There exists an O( 1 log(1/ )) degree polynomial q(x) such that | sgn(x) q(x)| for all x 2 [ 1, 1] with |x| > 0 . Lemma 5.5 achieves, up to an additive log(1/ )/ , the optimal trade off between degree and approximation of sgn(x) (Eremenko & Yuditskii, 2007). We have preliminary progress toward making this near-optimal polynomial algorithmic, a topic we leave to explore in future work. 5.2. Stable iterative algorithm for the sign function We now provide an iterative algorithm for computing pk(x) that works when applied with limited precision. Our formula is obtained by considering each term of pk(x). Let def = x(1 x2)k Clearly tk+1(x) = tk(x)(1 x2)(2k + 1)/(2k + 2) and therefore we can compute the tk iteratively. Since pk(x) = Pk i=0 ti(x) we can compute pk(x) iteratively as well. We show this procedure works when applied to matrices, even if all operations are performed with limited precision: Lemma 5.6. Let B 2 Rd d be symmetric with k Bk2 1. Let C be a procedure that given x 2 Rd produces C(x) with k C(x) (I B2)xk2 kxk2. Given y 2 Rd suppose that we have t0 and p0 such that kt0 Byk2 kyk2 and kp0 Byk2 kyk2. For all k 1 set C(tk) and pk+1 := pk + tk+1 . Then if operations are carried out with (log(d/ )) bits of precision, for 1 k 1/(7 ): ktk(B)y tkk2 7k kyk2 and kpk(B)y pkk2 7k kyk2 . Proof. Let t def = tk(B)y, p def = pk(B)y, and C def = I B2. Since p 0 = By and k Bk2 1 we see that even if t0 and p0 are truncated to the given bit precision we still have kt0 t 0k2 kyk2 and kp0 p Now suppose that ktk t kk2 kyk2 for some 1. Since |tk(x)| |pk(x)| | sgn(x)| 1 for x 2 [ 1, 1] and I B I we know that kt kk2 kyk2 and by reverse triangle inequality ktkk2 (1 + )kyk2. Using our assumption on C and applying triangle inequality yields kk2 k C(tk) Ctkk2 + k C(tk t k)k2 ktkk2 + k Ck2 k(tk t k)k2 ( (1 + ) + )kyk2 (2 + )kyk2 . In the last line we used k Ck2 1 since 0 B2 I. Again, this gives k Ct kk2 kyk2 and by reverse triangle inequality k C(tk)k2 (1 + 2 + )kyk2. Using C(tk) to compute tk+1 with bounded precision will then introduce an additional additive error of (1+2 + )kyk2 4 kyk2. Putting all this together, kt k tkk2 grows by at most an Principal Component Projection Without Principal Component Analysis additive 6 kyk2 every time k increases and by the same argument so does kpk p kk2. Including our initial error of kyk2 on t0 and p0, we conclude that kt k tkk2 and kpk p kk2 are both bounded by (6k + )kyk2 7k kyk2. 5.3. Approximating the step function We finally apply the results of Section 5.1 and Section 5.2 to approximate the step function and prove Lemma 3.1. We simply use s(x) = 1 2(1 + sgn(2x 1)), first showing how to compute 1 2(1 + pk(2x 1)) via Lemma 5.6. Lemma 5.7. Let S 2 Rd d be symmetric with 0 S I. Let A be a procedure that on x 2 Rd produces A(x) with k A(x) Sxk2 kxk2. Given arbitrary y 2 Rd set s0 := A(y), w0 := s0 1 2y, and for all k 0 set A(wk A(wk)) and sk+1 := sk + wk+1. If arithmetic operations are performed with (log(d/ )) bits of precision and k = O(1/ ) then k 1 2(I + pk(2S I))y skk2 = O(k )kyk2. Proof. Since M def = I (2S I)2 = 4S(I S), wk is the same as 1 2tk in Lemma 5.6 with B = 2S I and C(x) = 4A(x A(x)), and sk = s0 + Pk i=1 wi = Pk 2y = 1 2(pk + y). Multiplying by 1/2 everywhere does not increase error and k2S Ik2 1 so we can invoke Lemma 5.6 to yield the result provided we can show k4A(x A(x)) Mxk2 = O( )kxk2. Computing A(x) and subtracting from x introduces at most additive error 2 kxk2 Consequently by the error guarantee of A, k4A(A(x) x) Mxk2 = O( )kxk2 as desired. Using Lemma 5.7 and Corollary 5.4 we finally have: Proof of Lemma 3.1. By assumption, 0 S I and γ2q = O(1). Invoking Lemma 5.7 with error 0 = 2γ2, def = 1 2(I + pq(2S I))y we have kaq sqk2 = O(γ2 2q)kyk2 = O( )kyk2 . (3) Now, since s(S) = 1 2(I+sgn(2S I)) and every eigenvalue of 2S I is in [γ, 1], by assumption on S we can invoke Corollary 5.4 yielding kaq s(S)yk2 1 2kpq(2S I) sgn(2S I)k2kyk2 2 kyk2. The result follows from combining with (3) via triangle inequality. 6. Empirical evaluation We conclude with an empirical evaluation of PC-PROC and RIDGE-PCR (Algorithms 1 and 2). Since PCR has already been justified as a statistical technique, we focus on showing that, with few iterations, our algorithm recovers an accurate approximation to A λb and PAλy. 0 20 40 60 80 100 10 Regression Error # Iterations γ = .1 γ = .02 γ = .01 0 20 40 60 80 100 # Iterations Projection Error γ = .1 γ = .02 γ = .01 Figure 2: Relative log error of RIDGE-PCR (left) and PCPROJ (right) for synthetically generated data. 0 200 400 600 800 1000 10 # Iterations Relative Error Regression Error Projection Error 0 50 100 150 200 250 300 350 400 10 # Iterations Relative Error Regression Error Projection Error Figure 3: Extended log error plot on synthetic data with gap γ = .1 (left). Relative log error of RIDGE-PCR and PC-PROJ for an MNIST-based regression problem (right). We begin with synthetic data, which lets us control the spectral gap γ that dominates our iteration bounds (see Theorem 3.2). Data is generated randomly by drawing top singular values uniformly from the range [.5(1 + γ), 1] and tail singular values from [0, .5(1 γ)]. λ is set to .5 and A (500 rows, 200 columns) is formed via the SVD U VT where U and V are random bases and contains our random singular values. To model a typical application, b is generated by adding noise to the response Ax of a random x that correlates with A s top principal components. As apparent in Figure 2, our algorithm performs very well for regression, even for small γ. Error is measured via the natural ATA-norm and we plot k RIDGE-PCR(A, b, λ) A ATA. The figure shows similar convergence for projection (given with respect to the more natural 2-norm), although we do notice a stronger effect of a small gap γ in this case. Both plots confirm the linear convergence predicted by our analysis (Theorems 3.2 and 4.2). To illustrate stability, we include an extended plot for the γ = .1 data which shows arbitrarily high accuracy as iterations increase (Figure 3). Finally, we consider a 60K-point regression problem constructed from MNIST classification data (Le Cun et al., 2015), with the goal of distinguishing handwritten digits {1,2,4,5,7} from the rest. Input is normalized and 1000 random Fourier features are generated according to a unit RBF kernel (Rahimi & Recht, 2007). Our final data set is both of larger scale and condition number than the original. The MNIST principal component regression was run with λ = .01σ2 1. Although the gap γ is very small around this cutoff point (just .006), we see fast convergence for PCR. Convergence for projection is slowed more notably by the gap, but it is still possible to obtain 0.01 relative error with only 20 iterations (i.e. invocations of ridge regression). Principal Component Projection Without Principal Component Analysis Boutsidis, Christos and Magdon-Ismail, Malik. Faster SVD-truncated regularized least-squares. In Proceedings of the 2014 IEEE International Symposium on Information Theory (ISIT), pp. 1321 1325, 2014. Chan, Tony F. and Hansen, Per Christian. Computing trun- cated singular value decomposition least squares solutions by rank revealing QR-factorizations. SIAM Journal on Scientific and Statistical Computing, 11(3):519 530, 1990. Cohen, Michael B., Lee, Yin Tat, Musco, Cameron, Musco, Christopher, Peng, Richard, and Sidford, Aaron. Uniform sampling for matrix approximation. In Proceedings of the 6th Conference on Innovations in Theoretical Computer Science (ITCS), pp. 181 190, 2015. Dhillon, Paramveer S., Foster, Dean P., Kakade, Sham M., and Ungar, Lyle H. A risk comparison of ordinary least squares vs ridge regression. The Journal of Machine Learning Research, 14(1):1505 1511, 2013. Eremenko, Alexandre and Yuditskii, Peter. Uniform ap- proximation of sgn(x) by polynomials and entire functions. Journal d Analyse Mathmatique, 101(1):313 324, 2007. Eremenko, Alexandre and Yuditskii, Peter. Polynomials of the best uniform approximation to sgn(x) on two intervals. Journal d Analyse Math ematique, 114(1):285 315, 2011. Frank, Ildiko E. and Friedman, Jerome H. A statistical view of some chemometrics regression tools. Technometrics, 35(2):109 135, 1993. Frommer, Andreas and Simoncini, Valeria. Model Order Reduction: Theory, Research Aspects and Applications, chapter Matrix Functions, pp. 275 303. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008. Frostig, Roy, Ge, Rong, Kakade, Sham M., and Sidford, Aaron. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015. Hansen, Per Christian. The truncated SVD as a method for regularization. BIT Numerical Mathematics, 27(4): 534 553, 1987. Higham, Nicholas J. Functions of Matrices: Theory and Computation. Society for Industrial and Applied Mathematics, 2008. Hotelling, Harold. The relations of the newer multivariate statistical methods to factor analysis. British Journal of Statistical Psychology, 10(2):69 79, 1957. Le Cun, Yann, Cortes, Corinna, and Burges, Christo- pher J.C. MNIST handwritten digit database. 2015. URL http://yann.lecun.com/exdb/mnist/. Lin, Qihang, Lu, Zhaosong, and Xiao, Lin. An accelerated proximal coordinate gradient method. In Advances in Neural Information Processing Systems 27 (NIPS), pp. 3059 3067, 2014. Musco, Cameron and Musco, Christopher. Randomized block krylov methods for stronger and faster approximate singular value decomposition. In Advances in Neural Information Processing Systems 28 (NIPS), pp. 1396 1404, 2015. Nelson, Jelani and Nguyˆen, Huy L. OSNAP: Faster numer- ical linear algebra algorithms via sparser subspace embeddings. In Proceedings of the 54th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 117 126, 2013. Nesterov, Yurii. A method for unconstrained convex mini- mization problem with the rate of convergence o(1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372 376, 1983. Rahimi, Ali and Recht, Benjamin. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems 20 (NIPS), pp. 1177 1184. 2007. Sachdeva, Sushant and Vishnoi, Nisheeth K. Faster algorithms via approximation theory. Foundations and Trends in Theoretical Computer Science, 9(2):125 210, 2014. Shalev-Shwartz, Shai and Zhang, Tong. Accelerated proxi- mal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, pp. 1 41, 2014. Tikhonov, Andrey. Solution of incorrectly formulated problems and the regularization method. In Soviet Mathematics Doklady, volume 4, pp. 1035 1038, 1963. van den Eshof, Jasper, Frommer, Andreas, Lippert, Thomas, Schilling, Klaus, and van der Vorst, Henk A. Numerical methods for the QCD overlap operator I: Sign-function and error bounds. Computer physics communications, 146(2):203 224, 2002.