# faster_eigenvector_computation_via_shiftandinvert_preconditioning__18964652.pdf Faster Eigenvector Computation via Shift-and-Invert Preconditioning Dan Garber DGARBER@TTIC.EDU Toyota Technological Institute at Chicago Elad Hazan EHAZAN@CS.PRINCETON.EDU Princeton University Chi Jin CHIJIN@BERKELEY.EDU University of California, Berkeley Sham M. Kakade SHAM@CS.WASHINGTON.EDU University of Washington Cameron Musco CNMUSCO@MIT.EDU Massachusetts Institute of Technology Praneeth Netrapalli PRANEETH@MICROSOFT.COM Aaron Sidford ASID@MICROSOFT.COM Microsoft Research, New England We give faster algorithms and improved sample complexities for the fundamental problem of estimating the top eigenvector. Given an explicit matrix A Rn d, we show how to compute an ϵ approximate top eigenvector of A A in time e O h nnz(A) + d sr(A) gap2 i log 1/ϵ . Here nnz(A) is the number of nonzeros in A, sr(A) is the stable rank, and gap is the relative eigengap. We also consider an online setting in which, given a stream of i.i.d. samples from a distribution D with covariance matrix Σ and a vector x0 which is an O(gap) approximate top eigenvector for Σ, we show how to refine x0 to an ϵ approximation using O v(D) gap ϵ samples from D. Here v(D) is a natural notion of variance. Combining our algorithm with previous work to initialize x0, we obtain improved sample complexities and runtimes under a variety of assumptions on D. We achieve our results via a robust analysis of the classic shift-and-invert preconditioning method. This technique lets us reduce eigenvector computation to approximately solving a series of linear systems with fast stochastic gradient methods. Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). 1. Introduction Given A Rn d, computing the top eigenvector of A A is a fundamental problem in numerical linear algebra, applicable to principal component analysis (Jolliffe, 2002), spectral clustering and learning (Ng et al., 2002; Vempala & Wang, 2004), pagerank computation, and many other graph computations (Page et al., 1999; Koren, 2003; Spielman, 2007). For instance, a degree-k principal component analysis is nothing more than performing k leading eigenvector computations. Given the ever-growing size of modern datasets, it is thus a key challenge to come up with more efficient algorithms for this basic computational primitive. In this work we provide improved algorithms for computing the top eigenvector, both in the offline case, when A is given explicitly and in the online or statistical case where we access samples from a distribution D over Rd and wish to estimate the top eigenvector of the covariance matrix Ea D aa . In the offline case, our algorithms are the fastest to date in a wide and meaningful regime of parameters. Notably, while the running time of most popular methods for eigenvector computations is a product of the size of the dataset (i.e. number of non-zeros in A) and certain spectral characteristics of A, which both can be quite large in practice, we present running times that actually split the dependency between these two quantities, and as a result may yield significant speedups. In the online case, our results yield improved sample complexity bounds and allow for very efficient streaming implementations with memory Shift-and-Invert Preconditioning and processing-time requirements that are proportional to the size of a single sample. On a high-level, our algorithms are based on a robust analysis of the classic idea of shift-and-invert preconditioning (Saad, 1992), which allows us to efficiently reduce eigenvector computation to approximately solving a short sequence of well-conditioned linear systems in λI A A for some shift λ λ1(A). We apply state-of-the-art stochastic gradient methods to approximately solve these linear systems. We believe our results suggest the general effectiveness of shift-and-invert based approaches and imply that further computational gains may be reaped in practice. 1.1. Our Approach The well known power method for computing the top eigenvector of A A starts with an initial vector x and repeatedly multiplies by A A, eventually causing x to converge to the top eigenvector. For a random start vector, convergence requires O(log(d/ϵ)/gap) iterations, where gap = (λ1 λ2)/λ1, λi denotes the ith largest eigenvalue of A A, and we assume a high-accuracy regime where ϵ < gap. The dependence on this gap ensures that the largest eigenvalue is significantly amplified in comparison to the remaining values. If the eigenvalue gap is small, one approach is to replace A A with a preconditioned matrix i.e. a matrix with the same top eigenvector but a much larger gap. Specifically, let B = λI A A for some shift parameter λ. If λ > λ1, we can see that the smallest eigenvector of B (the largest eigenvector of B 1) is equal to the largest eigenvector of A A. Additionally, if λ is close to λ1, there will be a constant gap between the largest and second largest values of B 1. For example, if λ = (1 + gap)λ1, then we will have λ1 B 1 = 1 λ λ1 = 1 gap λ1 and λ2 B 1 = 1 λ λ2 = 1 2 gap λ1 . This constant factor gap ensures that the power method applied to B 1 converges to the top eigenvector of A A in just O(log(d/ϵ)) iterations. Of course, there is a catch each iteration of this shifted-and-inverted power method must solve a linear system in B, whose condition number is proportional 1 gap. For small gap, solving this system via iterative methods is more expensive. Fortunately, linear system solvers are incredibly well studied and there are many efficient iterative algorithms we can adapt to apply B 1 approximately. In particular, we show how to accelerate the iterations of the shifted-and-inverted power method using variants of Stochastic Variance Reduced Gradient (SVRG) (Johnson & Zhang, 2013). Due to the condition number of B, we will not entirely avoid a 1 gap dependence, however, we can separate this dependence from the input size nnz(A). Typically, stochastic gradient methods are used to optimize convex functions that are given as the sum of many convex components. To solve a linear system (M M)x = b we minimize the convex function f(x) = 1 2x (M M)x b x with components ψi(x) = 1 2x mim i x 1 nb x where mi is the ith row of M. Such an approach can be used to solve systems in A A, however solving systems in B = λI A A requires more care. We require an analysis of SVRG that guarantees convergence even when some of our components are non-convex. We give a simple analysis for this setting, generalizing recent work in the area (Shalev-Shwartz, 2015; Csiba & Richt arik, 2015). Given fast approximate solvers for B, the second main piece of our algorithmic framework is a new error bound for the shifted-and-inverted power method, showing that it is robust to approximate linear system solvers, such as SVRG. We give a general analysis, showing exactly what accuracy each system must be solved to, allowing for faster implementations using linear solvers with weaker guarantees. Our proofs center around the potential function: G(x) def = Pv 1 x B / Pv1x B, where Pv1 and Pv 1 are the projections onto the top eigenvector and its complement respectively. This function resembles tangent based potential functions used in previous work (Hardt & Price, 2014) except that we use the B norm rather than the ℓ2 norm. For the exact power method, this is irrelevant progress is identical in both norms (see Lemma 37 of our full version). However, B is a natural norm for measuring the progress of linear system solvers for B, so our potential function makes it possible to extend analysis to the case when B 1x is computed up to error ξ with bounded ξ B. 1.2. Our Results Our algorithmic framework described above offers several advantages. We obtain improved running times for computing the top eigenvector in the offline model. In Theorem 16 we give an algorithm running in time O nnz(A) + d sr A ϵ + log2 d gap where sr(A) = A 2 F / A 2 2 rank(A) is the stable rank and nnz(A) is the number of non-zero entries. Up to log factors, our runtime is in many settings proportional to the input size nnz(A), and so is very efficient for large matrices. In the case when nnz(A) d sr(A) gap2 we apply the results of (Frostig et al., 2015b; Lin et al., 2015) to provide an accelerated runtime of: " nnz(A) 3 4 (d sr(A)) 1 4 gap # log d gap log 1 ϵ + log3 d gap Finally, in the case when ϵ > gap, our results easily extend to give gap-free bounds (Theorems 34 and 35 of our Shift-and-Invert Preconditioning full paper), identical to those shown above but with gap replaced by ϵ. Note that our offline results hold for any A and require no initial knowledge of the top eigenvector. In Section 6 we discuss how to estimate the parameters λ1, gap, with modest additional runtime cost. Our algorithms return an approximate top eigenvector x with x A Ax (1 ϵ)λ1. By choosing error ϵ gap, we can ensure that x is actually close to v1 i.e. that |x v1| 1 ϵ. Further, we obtain the same asymptotic runtime since O log 1 ϵ gap + log2 d gap = O log 1 ϵ + log2 d gap . We compare our runtimes with previous work in Table 1. In the online case, in Theorem 25, we show how to improve an O(gap) approximation to the top eigenvector to an ϵ approximation with constant probability using O v(D) gap ϵ samples where v(D) is a natural variance measure. Our algorithm is based on the streaming SVRG algorithm of (Frostig et al., 2015a). It requires just O(d) amortized time per sample, uses just O(d) space, and is easily parallelized. We can apply our result in a variety of regimes, using existing algorithms to obtain the initial O(gap) approximation and our algorithm to refine this solution. As shown in Table 2, this gives improved runtimes and sample complexities over existing work. Notably, we give better asymptotic sample complexity than known matrix concentration results for general distributions, and give the first streaming algorithm that is asymptotically optimal in the popular Gaussian spike model. Our robust shifted-and-inverted power method analysis provides new understanding of this widely implemented technique. It gives a means of obtaining provably accurate results when each iteration is implemented using solvers with weak accuracy guarantees. In practice, this reduction between approximate linear system solving and eigenvector computation shows that optimized regression libraries can be leveraged for faster eigenvector computation in many cases. Furthermore, in theory we believe that the reduction suggests computational limits inherent in eigenvector computation as seen by the often easier-toanalyze problem of linear system solving. Indeed, in Section 7 of our full paper we provide evidence that in certain regimes our statistical results are optimal. 1.3. Previous Work OFFLINE EIGENVECTOR COMPUTATION Due to its universal applicability, eigenvector computation in the offline case is extremely well studied. Classical methods, such as the QR algorithm, take roughly O(nd2) time to compute a full eigendecomposition. They can be accelerated using fast matrix multiplication (Williams, 2012; Le Gall, 2014), however remain prohibitively expensive for large matrices. Hence, faster iterative methods are often employed, especially when only the top eigenvector (or a few of the top eigenvectors) is desired. As discussed, the popular power method requires O (log(d/ϵ)/gap) iterations to converge to an ϵ approximate top eigenvector. Using Chebyshev iteration or the Lanczos method, this bound can be improved to O log(d/ϵ/ gap (Saad, 1992), giving total runtime of O nnz(A) log(d/ϵ)/ gap . When ϵ > gap, the gap terms in these runtimes can be replaced by ϵ. We focus on the high-precision regime when ϵ < gap, but also give gap-free bounds in Section 8 of our full paper. Unfortunately, if nnz(A) is very large and gap is small, the above runtimes can still be quite expensive, and there is a natural desire to separate the 1/ gap dependence from the nnz(A) term. One approach is to use random subspace embedding matrices (Ailon & Chazelle, 2009; Clarkson & Woodruff, 2013) or fast row sampling algorithms (Cohen et al., 2015), which can be applied in O(nnz(A)) time and yield a matrix A which is a good spectral approximation to the original. The number of rows in A depends only on the stable rank of A and the error of the embedding. Applying such a subspace embedding and then computing the top eigenvector of A A requires runtime O (nnz(A) + poly(sr(A), ϵ, gap)), achieving the goal of reducing runtime dependence on the input size nnz(A). Unfortunately, the dependence on ϵ is significantly suboptimal such an approach cannot be used to obtain a linearly convergent algorithm. Further, the technique does not extend naturally to the online setting. Another approach, which we follow more closely, is to apply stochastic optimization techniques, which iteratively update an estimate to the top eigenvector, considering a random row of A with each update step. Such algorithms naturally extend to the online setting and have led to improved dependence on the input size for a variety of problems (Bottou, 2010). Using variancereduced stochastic gradient techniques, (Shamir, 2015c) achieves a runtime bound assuming an upper bound on the squared row norms of A. In the best case, when row norms are uniform, this runtime can be simplified to O nnz(A) + d sr(A)2/gap2 log(1/ϵ) log log(1/ϵ) . This result makes an important contribution in separating input size and gap dependencies using stochastic optimization techniques. Unfortunately, the algorithm requires an approximation to the eigenvalue gap and a starting vector that has a constant dot product with the top eigenvector. Initializing with a random vector loses polynomial factors in d (Shamir, 2015b), on top of the already suboptimal dependencies on sr(A) and ϵ. Shift-and-Invert Preconditioning Algorithm Runtime Power Method O nnz(A) log(d/ϵ) Lanczos Method O nnz(A) log(d/ϵ) gap Fast Subspace Embeddings (Clarkson & Woodruff, 2013) + Lanczos O nnz(A) + d sr(A) max{gap2.5ϵ,ϵ2.5} SVRG (Shamir, 2015c) (assuming bounded row norms and warm-start) O nnz(A) + d sr(A)2 gap2 log(1/ϵ) log log(1/ϵ) Theorem 16 O h nnz(A) + d sr(A) gap2 i h log 1 ϵ + log2 d gap i Theorem 17 (full paper) O h nnz(A)3/4(d sr(A))1/4 gap i h log d gap log 1 ϵ + log3 d gap i Table 1. Comparision to previous work on Offline Eigenvector Estimation. We give runtimes for computing a unit vector x such that x A Ax (1 ϵ)λ1 in the regime ϵ = O(gap). ONLINE EIGENVECTOR COMPUTATION In the online, or statistical setting, research often looks beyond runtime. One focus is on minimizing the sample size required to achieve a given accuracy. Another common focus is on obtaining streaming algorithms, which use just O(d) space - proportional to the size of a single sample. In this section, in order to more easily compare to previous work, we normalize λ1 = 1 and assume we have the row norm bound a 2 2 O(d) which then gives us the variance bound Ea D (aa )2 2 = O(d). Additionally, we compare runtimes for computing some x such that |x v1| 1 ϵ, as this is the most popular guarantee studied in the literature. Theorem 25 is easily extended to this setting as obtaining x with x T AA x (1 ϵ gap)λ1 ensures |x v1| 1 ϵ. Our algorithm requires O(d/(gap2ϵ)) samples to find such a vector under the above assumptions. The simplest algorithm in this setting is to take n samples from D and compute the leading eigenvector of the empirical estimate b E[aa ] = 1 n Pn i=1 aia i . By a matrix Bern- stein bound (Tropp, 2015), O d log d gap2ϵ samples is enough to insure b E[aa ] E[aa ] 2 ϵ gap. By Lemma 36 in our full version, this ensures that, if x is set to the top eigenvector of b E[aa ] it will satisfy |x v1| 1 ϵ. A large body of work focuses on improving this simple algorithm. In Table 2 we give a sampling of results, all which rely on distributional assumptions at least as strong as those given above. Note that, in each setting, we can use the cited algorithm to first compute an O(gap) approximate eigenvector, and then refine this approximation with our streaming from Theorem 25 using O d gap2ϵ samples. This gives us improved runtime and sample complexity results. Notably, by the lower bound in Section 7 of our full paper, in all settings considered in Table 2, we achieve optimal asymptotic sample complexity - as our sample size grows large, ϵ decreases at an optimal rate. To save space, we do not show our improved runtime bounds, but they are easy to derive by adding the runtime required by the given algorithm to achieve O(gap) accuracy to O d2 gap2ϵ the runtime of our streaming algorithm. The bounds given for the simple matrix Bernstein based algorithm described above, Krasulina/Oja s Algorithm (Balsubramani et al., 2013), and SGD (Shamir, 2015a) require no additional assumptions, aside from those given at the beginning of this section. The streaming results cited for (Mitliagkas et al., 2013) and (Hardt & Price, 2014) assume a is generated from a Gaussian spike model, where ai = λ1γiv1 + Zi and γi N(0, 1), Zi N(0, Id). We note that under this model, the matrix Bernstein results improve by a log d factor and so match our results in achieving asymptotically optimal convergence rate. The results of (Mitliagkas et al., 2013) and (Hardt & Price, 2014) sacrifice this optimality in order to operate under the streaming model. Our work gives the best of both works a streaming algorithm giving asymptotically optimal results. (Sa et al., 2015) assumes E aa Waa O(1)tr(W) for any symmetric W that commutes with Eaa . This is much stronger than the assumption above that Ea D (aa )2 2 = O(d) and there are easy examples where the above assumption holds while theirs does not. 1.4. Organization While our general approach is simple, our proofs are quite involved, and hence, most are omitted from the main paper. They can be found in our full paper. In Section 2 we review problem definitions and parameters. In Section 3 we describe the shifted-and-inverted power method and show how it can be implemented using approximate system solvers. In Section 4 we instantiate this framework, showing how to apply SVRG to solve systems in our shifted matrix and giving our main offline results. In Section 5 we discuss extending these techniques to the online setting. Finally, in Section 6 we discuss how to efficiently estimate the shift parameters required by our algorithms. Shift-and-Invert Preconditioning Algorithm Sample Size Runtime Streaming? Our Sample Complexity Matrix Bernstein + Lanczos (explicitly forming matrix) O d log d gap2ϵ gap3 + d gap2ϵ Matrix Bernstein + Lanczos (iteratively applying matrix) O d log d gap2ϵ O d2 log d log(d/ϵ) gap3 + d gap2ϵ Memory-efficient PCA (Mitliagkas et al., 2013), (Hardt & Price, 2014) O d log(d/ϵ) O d2 log(d/ϵ) gap3ϵ O d log(d/gap) gap4 + d gap2ϵ Alecton (Sa et al., 2015) O( d log(d/ϵ) gap2ϵ ) O d2 log(d/ϵ) gap2ϵ O( d log(d/gap) gap3 + d gap2ϵ) Krasulina / Oja s Algorithm (Balsubramani et al., 2013) O dc1 gap2ϵc2 gap2ϵc2 O dc1 gap2+c2 + d gap2ϵ SGD (Shamir, 2015a) O d3 log(d/ϵ) O d4 log(d/ϵ) ϵ2 O d3 log(d/gap) gap2 + d gap2ϵ Table 2. Summary of existing Online Eigenvector Estimation results. Bounds are for computing a unit vector x with |x v1| 1 ϵ. For each result, we obtain improved bounds by running the algorithm to compute an O(gap) approximate eigenvector, and then using our algorithm to obtain an ϵ approximation using an additional O d gap2ϵ samples, O(d) space, and O(d) work per sample. 2. Preliminaries In the Offline Setting we are given A Rn d with rows a(1), ..., a(n) and wish to compute an approximation to the top eigenvector of Σ def = A A. Specifically, for error ϵ we want a unit vector x with x Σx (1 ϵ)λ1(Σ). In the Online Setting we access an oracle returning independent samples from distribution D on Rd. We wish to approximate the top eigenvector of Σ def = Ea D aa , specifically, to find unit x with x Σx (1 ϵ)λ1(Σ). 2.1. Problem Parameters Let λ1, ..., λd denote the eigenvalues of Σ and v1, ..., vd denote their corresponding eigenvectors. We define the eigenvalue gap by gap def = (λ1 λ2)/λ1. Our bounds also employ the following parameters: In the Offline Setting: Let sr(A) def = A 2 F / A 2 2 denote the stable rank of A. Note that we always have sr(A) rank(A). Let nnz(A) denote the number of nonzero entries in A. In the Online Setting: Let v(D) def = Ea D h (aa ) 2i 2 Ea D(aa ) 2 2 = Ea D h (aa ) 2i 2 λ2 1 denote a natural upper bound on the variance of D in various settings. Note that v(D) 1. 3. Algorithmic Framework Here we overview our robust shift-and-invert framework, focusing on intuition, with proofs relegated to our full paper. We let B def = λI Σ, and assume that we have a crude estimate of λ1 and gap so can set λ to satisfy 1 + gap 150 λ1 λ 1 + gap 100 λ1. (See Section 6 for how we can compute such a λ). Note that λi B 1 = 1 λi(B) = 1 λ λi and so λ1(B 1) λ2(B 1) = λ λ2 λ λ1 gap gap/100 = 100. This large gap ensures that, assuming the ability to apply B 1, the power method will converge very quickly. 3.1. Potential Function Our eigenvector algorithms aim to maximize the Rayleigh quotient, x Σx for unit x. However, to track the progress of our algorithm we use a different potential function. We define for x = 0: Pv 1 x B Pv1x B = i 2 α2 i /λi(B 1) p α2 1/λ1(B 1) . (1) where Pv1 and Pv 1 denote the projections onto v1 and the subspace orthogonal to v1 and αi = v i x. When the Rayleigh quotient error ϵ = λ1 x Σx is small, we can show that G(x) closely tracks ϵ, so we can analyze our algorithms exclusively in terms of G(x) and then convert the resulting bounds to Rayleigh quotient error (see Lemma 3 of full paper). 3.2. Power Iteration It is easy to see that the shifted-and-inverted power iteration makes progress with respect to our objective function given an exact linear system solver for B. Theorem 4. Let x be a unit vector with x, v1 = 0 and let ex = B 1x, i.e. the power method update of B 1 on x. Then, under our assumption on λ, we have: G(ex) λ2 B 1 λ1 (B 1) G(x) 1 100G(x). Proof. Writing x in the eigenbasis, we have x = P i αivi and ex = P i αiλi B 1 vi. Since x, v1 = 0, α1 = 0 Shift-and-Invert Preconditioning and by the equivalent formulation of G(x) given in (1): i 2 α2 i λi(B 1) p α2 1λ1(B 1) i 2 α2 i /λi(B 1) p α2 1/λ1(B 1) = λ2 B 1 λ1 (B 1) G(x). Recalling that λ1 B 1 /λ2 B 1 gap gap/100 = 100 yields the result. In the next section we show that Theorem 4 can be made robust we still make progress on our objective function even if we only approximate B 1x using a fast linear system solver. 3.3. Approximate Power Iteration We can show that each iteration of the shifted-and-inverted power method makes constant expected progress on G(x) assuming we: 1. Start with a sufficiently good x and an approximation of λ1 2. Can apply B 1 approximately using a system solver such that the function error (i.e. distance to B 1x in the B norm) is sufficiently small in expectation. 3. Can estimate Rayleigh quotients over Σ well enough to only accept updates that do not hurt progress on the objective function too much. Note that the second assumption is very weak. An expected progress bound allows, for example, the solver to occasionally return a solution that is entirely orthogonal to v1, causing us to make unbounded backwards progress. The third assumption allows us to reject possibly harmful updates and ensure that we still make progress in expectation. In the offline setting, we can access A and are able to compute Rayleigh quotients trivially. However, we only assume the ability to estimate quotients, since in the online setting we only have access to Σ through samples from D. Theorem 5 (Approximate Shifted-and-Inverted Power Iteration Warm Start). Let x = P i αivi be a unit vector such that G(x) 1 10. Suppose we have an estimate bλ1 of λ1 such that 10/11 (λ λ1) λ bλ1 λ λ1. Furthermore, suppose we have a subroutine solve( ) such that on any input x E solve (x) B 1x B c1 1000 for some c1 < 1, and a subroutine d quot ( ) that on any input x = 0 satisfies d quot (x) quot(x) 1 30 (λ λ1) where quot(x) def = x Σx x x . Let bx = solve (x). Then the following update procedure: ( d quot (bx) bλ1 λ bλ1 /6 and bx 2 2 3 1 λ bλ1 x otherwise, satisfies: G(ex) 1 10 and E [G(ex)] 3 25G(x) + c1 500. That is, not only do we decrease our potential function by a constant factor in expectation, but we are guaranteed that the potential function will never increase beyond 1/ 10. Roughly, the proof of Theorem 5 shows that, conditioning on accepting our iterative step: E [G(bx)] = E Pv 1 B 1x B Pv1B 1x B 50 + 2E bx B 1x B That is, the potential function decreases as in the exact case (Theorem 4) with additional additive error due to the inexact linear system solve. Theorem 5 assumes that we can solve linear systems to some absolute accuracy in expectation. However, system solvers typically only guarantee relative progress bounds with respect to an initial estimate of B 1x. Fortunately, we can show that approximating B 1x with 1 x Bxx, and applying a linear system solver that improves this estimate by a constant factor in expectation gives small enough error to make progress in each power iteration: Corollary 6 (Relative Error Linear System Solvers). For any unit vector x, we have: 1 x Bxx B 1x B α1 p λ1(B 1) G(x), so instantiating Theorem 5 with c1 = α1G(x) gives E[G(ex)] 4 25G(x) as long as: E solve (x) B 1x B 1 λ x Σxx B 1x B 1000 . 3.4. Initialization Theorem 5 and Corollary 6 show that, given a good enough approximation to v1, we can rapidly refine this approximation by applying the shifted-and-inverted power method with very coarse approximate linear system solves. To obtain the initial approximation, we rely on a burn-in period in which we solve each linear system to higher accuracy. During burn-in, we may have a very small component of x in the direction of v1, and so require higher accuracy to ensure that we do not lose this component. Shift-and-Invert Preconditioning Theorem 8 (Approximate Shifted-and-Inverted Power Method Burn-In). Suppose we initialize x0 N(0, I) and suppose we have access to a subroutine solve ( ) such that E solve (x) B 1x B 1 3000κ(B 1)d21 1 λ x Σxx B 1x B , where κ(B 1) = λ1(B 1) λd(B 1) = O( 1 gap). Then iteratively computing: xt = solve (xt 1) / solve (xt 1) 2, for T = O (log(d/gap)), we have G(x T ) 1 10 with probability 1 O( 1 4. Offline Eigenvector Computation We now discuss how to instantiate the framework of Section 3 in the offline setting. We adapt Stochastic Variance Reduced Gradient (SVRG) (Johnson & Zhang, 2013) to solve linear systems in B. To solve a system in the positive definite matrix A A, one optimizes the objective function f(x) = 1 2x A Ax b x. This function is the sum of n convex components, ψi(x) = 1 2x aia i x 1 nb x. In each iteration of traditional gradient descent, one computes the full gradient of f(xi) and takes a step in that direction. In stochastic gradient methods, at each iteration, a single component is sampled, and the step direction is based only on the gradient of the sampled component. Hence, a full gradient computation is avoided at each iteration, leading to runtime gains. Unfortunately, while we have access to the rows of A and so can solve systems in A A, it is less clear how to solve systems in B = λI A A. To do this, we will split our function into components of the form ψi(x) = 1 2x wi I aia i x 1 nb x for some set of weights wi with P i [n] wi = λ. Importantly, wi I aia i may not be positive semidefinite. We are minimizing a sum of functions which is convex, but consists of non-convex components. While recent results for minimizing such functions could be applied directly (Shalev-Shwartz, 2015; Csiba & Richt arik, 2015), in our full paper we obtain stronger results by using a more general form of SVRG and analyzing the specific properties of our function. Our analysis shows that we can make constant factor progress in solving linear systems in B in time O nnz(A) + d sr(A) gap2 . If d sr(A) gap2 nnz(A) this gives a runtime proportional to the input size the best we could hope for. If not, we show that it is possible to accelerate our system solver using the results of (Frostig et al., 2015b; Lin et al., 2015), achieving run- time O nnz(A)3/4(d sr(A))1/4 gap log d gap . We show how to use the unaccelerated system solvers to obtain our main offline result. The analysis is identical in the accelerated case. Theorem 16 (Shifted-and-Inverted Power Method With SVRG). Let B = λI A A for 1 + gap 150 λ1 λ 1 + gap 100 λ1 and let x0 N(0, I) be a random initial vector. Running the inverted power method on B initialized with x0, using the SVRG solver from Theorem 12 (in our full paper) to approximately apply B 1 at each step, returns x such that with probability 1 O 1 d10 , x Σx (1 ϵ)λ1 in total time O nnz(A) + d sr(A) Instantiating the theorem with ϵ = ϵ gap, we can find a unit vector x with |v 1 x| 1 ϵ in the same asymptotic running time (an extra log(1/gap) term is absorbed into the log2(d/gap) term). Proof. By Theorem 8, if we start with x0 N(0, I) we can run O log d gap iterations of the inverted power method, to obtain x1 with G(x1) 1/ 10 with probability 1 O(1/d10). Each iteration requires applying an linear solver that decreases initial error in expectation by a factor of 1 poly(d,1/gap). Such a solver is given by applying our constant factor solver O(log(d/gap)) times. So overall in order to find x1 with G(x1) 1/ 10, we require time O nnz(A) + d sr(A) gap2 log2 d gap . After this initial burn-in period we can apply Corollary 6 of Theorem 5. In each iteration, we only need to use a solver that decreases initial error by a constant factor in expectation so requires time O nnz(A) + d sr(A) gap2 . With O(log(d/ϵ)) iterations, we can obtain x with E G(x)2 = O(ϵ/d10), which is sufficient for the result. Note that the second stage requires O log d ϵ = O(log d + log(1/ϵ)) iterations to achieve the high probability bound. However, the O(log d) term is smaller than the O log2 d gap term, so is absorbed into the asymptotic notation. 5. Online Eigenvector Computation We now discuss how to extend our results to the online setting. This setting is somewhat more difficult since there is no canonical A, we only have access to the distribution D through samples. In order to apply Theorem 5 we must show how to both estimate the Rayleigh quotient as well as solve the requisite linear systems in expectation. Our Rayleigh quotient estimation procedure is standard we first approximate the Rayleigh quotient by its empiri- Shift-and-Invert Preconditioning cal value on a batch of k samples and prove using Chebyshev s inequality that the error on this sample is small with constant probability. We then repeat this procedure O(log(1/p)) times and output the median, obtaining a good estimate probability 1 p. Theorem 18 (Online Rayleigh Quotient Estimation). Given ϵ (0, 1], p [0, 1], and unit vector x set k = 4 v(D)ϵ 2 and m = O(log(1/p)). For all i [k] and j [m] let a(j) i be drawn independently from D and set Ri,j = x a(j) i (a(j) i ) x and Rj = 1 k P i [k] Ri,j. If we let z be median value of the Rj then with probability 1 p we have z x Σx ϵλ1. The next challenge is to solve linear systems in B in the streaming setting. We follow the general strategy of the offline algorithms given in Section 4, replacing traditional SVRG with the streaming SVRG algorithm of (Frostig et al., 2015a). Whereas in the offline case, we could ensure that our initial error x0 xopt 2 B is small by simply scaling by the Rayleigh quotient (Corollary 6) in the online case estimating the Rayleigh quotient to sufficient accuracy would require too many samples. Instead, we simply show how to use streaming SVRG to solve the desired linear systems to absolute accuracy without an initial point. Ultimately, due to the different error dependences in the online case this guarantee suffices and the lack of an initial point is not a bottleneck. Corollary 24 (Streaming SVRG Solver). Given a linear system Bx = b with unit vector b there is a streaming algorithm that returns x with E x xopt 2 B 10cλ1(B 1) using O( v(D) gap2 c) samples from D. Note that convergence for this streaming algorithm is significantly worse than for offline SVRG algorithms. The number of samples we take is proportional to 1/c, so, if we wanted to for example, apply Theorem 8 we would need poly(1/gap, d) samples (compared with just log(d/gap) iterations in the online case). This is why we only give a warm-start algorithm in the online case one that operates in the regime where coarse linear solves are sufficient. Our main theorem is: Theorem 25 (Online Shifted-and-Inverted Power Method Warm Start). Let B = λI A A for 1 + gap 150 λ1 λ 1 + gap 100 λ1 and let x0 be some vector with G(x0) 1 10. Running the shifted-and-inverted power method on B initialized with x0, using the streaming SVRG solver of Corollary 24 to approximately apply B 1 at each step, returns x such that x Σx (1 ϵ)λ1 with constant probability for any target ϵ < gap. The algorithm uses O( v(D) gap ϵ) samples and amortized O(d) time per sample. 6. Parameter Estimation for Offline Eigenvector Computation In Section 4, in order to invoke Theorems 5 and 8 we assumed knowledge of some λ with (1+gap/150)λ1 λ (1 + gap/100)λ1. Here we mention that it is possible to efficiently estimate this parameter, incurring a modest additional runtime cost. Algorithm 1 of our full paper uses the gap-free eigenvalue estimation algorithm of (Musco & Musco, 2015), applying the shifted-and-inverted power method with the SVRG based solver of Section 4 to two vectors simultaneously to compute estimates of both λ1 and λ2. Using the gap between these estimates, the algorithm iteratively refines its approximation of gap. Overall: Theorem 26. There is an algorithm that, with probability 1 O(1/d10) returns λ with (1 + gap/150)λ1 λ (1 + gap/100)λ1 (ˆλ1) in time O nnz(A) + d sr(A) Note that, by using the accelerated solver discussed in Section 4 we can also accelerate this to O nnz(A)3/4(d sr(A))1/4 gap . The runtime of Theorem 26 is within a O(log(d/gap)) factor of our runtimes that assume knowledge of λ. Additionally, note that this extra cost is separated from the ϵ dependencies in the runtimes. Ailon, Nir and Chazelle, Bernard. The fast Johnson Lindenstrauss transform and approximate nearest neighbors. SIAM Journal on Computing, 39(1):302 322, 2009. Balsubramani, Akshay, Dasgupta, Sanjoy, and Freund, Yoav. The fast convergence of incremental PCA. In Advances in Neural Information Processing Systems 26 (NIPS), pp. 3174 3182, 2013. Bottou, L eon. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT, pp. 177 186. Springer, 2010. Clarkson, Kenneth L and Woodruff, David P. Low rank approximation and regression in input sparsity time. In Proceedings of the 45th Annual ACM Symposium on Theory of Computing (STOC), pp. 81 90, 2013. Cohen, Michael B, Lee, Yin Tat, Musco, Cameron, Musco, Christopher, Peng, Richard, and Sidford, Aaron. Uniform sampling for matrix approximation. In Proceedings of the 6th Conference on Innovations in Theoretical Computer Science (ITCS), pp. 181 190, 2015. Shift-and-Invert Preconditioning Csiba, Dominik and Richt arik, Peter. Primal method for ERM with flexible mini-batching schemes and nonconvex losses. ar Xiv:1506.02227, 2015. Frostig, Roy, Ge, Rong, Kakade, Sham M, and Sidford, Aaron. Competing with the empirical risk minimizer in a single pass. In Proceedings of the 28th Annual Conference on Computational Learning Theory (COLT), pp. 728 763, 2015a. Frostig, Roy, Ge, Rong, Kakade, Sham M, and Sidford, Aaron. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015b. Hardt, Moritz and Price, Eric. The noisy power method: A meta algorithm with applications. In Advances in Neural Information Processing Systems 27 (NIPS), pp. 2861 2869, 2014. Johnson, Rie and Zhang, Tong. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26 (NIPS), pp. 315 323, 2013. Jolliffe, Ian. Principal component analysis. Wiley Online Library, 2002. Koren, Yehuda. On spectral graph drawing. In Computing and Combinatorics, pp. 496 508. Springer, 2003. Le Gall, Franc ois. Powers of tensors and fast matrix multiplication. In Proceedings of the 39th International Symposium on Symbolic and Algebraic Computation, pp. 296 303. ACM, 2014. Lin, Hongzhou, Mairal, Julien, and Harchaoui, Zaid. A universal catalyst for first-order optimization. ar Xiv:1506.02186, 2015. Mitliagkas, Ioannis, Caramanis, Constantine, and Jain, Prateek. Memory limited, streaming PCA. In Advances in Neural Information Processing Systems 26 (NIPS), pp. 2886 2894, 2013. Musco, Cameron and Musco, Christopher. Randomized block krylov methods for stronger and faster approximate singular value decomposition. In Advances in Neural Information Processing Systems 28 (NIPS), 2015. Ng, Andrew Y, Jordan, Michael I, and Weiss, Yair. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 15 (NIPS), pp. 849 856, 2002. Page, Lawrence, Brin, Sergey, Motwani, Rajeev, and Winograd, Terry. The Page Rank citation ranking: bringing order to the Web. 1999. Sa, Christopher D, Re, Christopher, and Olukotun, Kunle. Global convergence of stochastic gradient descent for some non-convex matrix problems. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pp. 2332 2341, 2015. Saad, Yousef. Numerical methods for large eigenvalue problems. SIAM, 1992. Shalev-Shwartz, Shai. SDCA without duality. ar Xiv:1502.06177, 2015. Shamir, Ohad. Convergence of stochastic gradient descent for PCA. ar Xiv:1509.09002, 2015a. Shamir, Ohad. Fast stochastic algorithms for SVD and PCA: Convergence properties and convexity. ar Xiv:1507.08788, 2015b. Shamir, Ohad. A stochastic PCA and SVD algorithm with an exponential convergence rate. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pp. 144 152, 2015c. Spielman, Daniel A. Spectral graph theory and its applications. In null, pp. 29 38. IEEE, 2007. Tropp, Joel A. An introduction to matrix concentration inequalities. ar Xiv:1501.01571, 2015. Vempala, Santosh and Wang, Grant. A spectral algorithm for learning mixture models. Journal of Computer and System Sciences, 68(4):841 860, 2004. Williams, Virginia Vassilevska. Multiplying matrices faster than Coppersmith-Winograd. In Proceedings of the 44th Annual ACM Symposium on Theory of Computing (STOC), pp. 887 898, 2012.