# online_learning_of_eigenvectors__bebe2e06.pdf Online Learning of Eigenvectors Dan Garber DANGAR@TX.TECHNION.AC.IL Technion - Israel Institute of Technology Elad Hazan EHAZAN@CS.PRINCETON.EDU Princeton University Tengyu Ma TENGYUL@CS.PRINCETON.EDU Princeton University Computing the leading eigenvector of a symmetric real matrix is a fundamental primitive of numerical linear algebra with numerous applications. We consider a natural online extension of the leading eigenvector problem: a sequence of matrices is presented and the goal is to predict for each matrix a unit vector, with the overall goal of competing with the leading eigenvector of the cumulative matrix. Existing regretminimization algorithms for this problem either require to compute an eigen decompostion every iteration, or suffer from a large dependency of the regret bound on the dimension. In both cases the algorithms are not practical for large scale applications. In this paper we present new algorithms that avoid both issues. On one hand they do not require any expensive matrix decompositions and on the other, they guarantee regret rates with a mild dependence on the dimension at most. In contrast to previous algorithms, our algorithms also admit implementations that enable to leverage sparsity in the data to further reduce computation. We extend our results to also handle non-symmetric matrices. 1. Introduction Computing the leading eigenvector of a symmetric real matrix is one of the most important problems in numerical linear algebra and an important primitive in many algorithms. Perhaps the best known application of this problem is the Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s). Principal Component Analysis problem, in which, roughly speaking, given a set of high-dimensional vectors, the problem is to find a low-dimensional subspace such that the projection of the vectors onto this low-dimensional subspace is close on average to the original vectors. It is well known that the optimal solution to this problem is to project the high-dimensional vectors over the several leading eigenvectors of the covariance matrix of the data. In this paper we consider an online learning problem that is a natural extension of the leading eigenvector problem. A decision maker observes a sequence of matrices. Before a new matrix is revealed, the decision maker must commit to a unit vector. Once the matrix is revealed the decision maker gains the quadratic product of the selected unit vector with the revealed matrix, and his overall goal is to maximize the total reward. As standard in such settings, the performance of the decision maker is measured via the regret which is given by the difference between the total reward of the best fixed unit vector in hindsight and the total reward of the decision maker. Indeed the best fixed unit vector in hindsight is simply given by the leading eigenvector of the sum of revealed matrices and the associated total reward is the corresponding leading eigenvalue. This problem captures as a special case the problem known as Online Principal Component Analysis that was studied in (Warmuth & Kuzmin, 2006a), (Warmuth & Kuzmin, 2006b), (Nie et al., 2013), in case of a single principal component. From an optimization theory point of view, the offline leading eigenvector problem is a non-convex quadratic optimization problem, since w.l.o.g. it requires to maximize a convex function over a non convex set, i.e. the unit sphere. However, it could be solved to high precision via eigendecomposition which takes O(n3) time where n is the size of the matrix, or via iterative approximation algorithms such as the Power or Lanczos algorithms whose running time for well-conditioned matrices is roughly O(nnz) where nnz denotes the number of non-zeros entries in the input matrix. Online Learning of Eigenvectors When considering the online problem, three different approaches come to mind which we now detail. The Convexification Approach The first approach is to convexify the problem by lifiting the decision variable from a unit vector to a matrix, more specifically a positive semidefinite matrix with unit trace. This approach corresponds to the problem of Online Linear Optimization over the Spectrahedron 1. This is also the approach taken in previous works on Online PCA (Warmuth & Kuzmin, 2006a), (Warmuth & Kuzmin, 2006b), (Nie et al., 2013). While this approach leads to theoretically efficient algorithms with nearly optimal regret bounds, such as the Matrix Multiplicative Weights algorithm (Tsuda et al., 2005; Arora & Kale, 2007), their major drawback is that they require super-linear computation per iteration, i.e. they require to compute a full eigendcomposition which amounts to O(n3) arithmetic operations per iteration . The latter is true even if the support of the sequence of matrices is sparse. The Oracle-based Approach A natural approach is to try to reduce the online problem to the offline one. That is, assume that we are given an oracle for the offline problem that given a query matrix returns its leading eigenvector, and derive an online algorithm that is based on making queries to the oracle. Such a reduction is possible either via the Follow the Perturbed Leader meta-algorithm (FPL) (Kalai & Vempala, 2005) or by the recent Online Frank Wolfe algorithm presented in (Hazan & Kale, 2012). Both of these algorithms require on each iteration of the online problem to make a single query to the eigenvector oracle which could be implemented using the Power or Lancsoz algorithms mentioned above with complexity of roughly O(nnz) arithmetic operations. While the per-iteration complexity of these methods is potentially much more favorable than the convexification approach and despite the fact that the regret bound guaranteed by FPL is optimal in terms of the length of the sequence, it comes with the price of a large dependence on the dimension. Indeed when designing online learning algorithms, usually the primary goal is optimal dependence on the sequence length, however favorable dependence on the dimension is crucial in order for the proposed method to be of any practical significance. The Iterative Approach A third approach is to design online algorithms that directly tackle the non-convex optimization problem, i.e. online analogues of iterative algorithms for the offline problem such as the Power Method with an update step that roughly amounts to computing a single matrix-vector product, which is much more efficient than both previous approaches. Such an approach is remi- 1formally defined as {X 2 Rn n | X 0, Tr(X) = 1}. niscent of the Online Gradient Decent method presented in (Zinkevich, 2003) which is an online analogue of the gradient descent method for offline convex optimization. For the specific problem of Stochastic Principal Component Analysis, such algorithms with provable guarantees exist (Balsubramani et al., 2013), (Shamir, 2014), however we are not aware of any such method for the online setting considered here. Our interest in this work is to study algorithms for the online eigenvector problem that may be of use for large scale instances. Towards this end we part from the convexification approach that was the main approach studied in previous related problems and requires super-linear computations, and focus on the oracle-based and iterative approaches which allow for more efficient implementations and may leverage sparsity in the data. 1.1. Our Results Our main result is an online algorithm that takes the so called oracle approach and is based on the Follow the Perturbed Leader meta-algorithm. The algorithm requires on each iteration to perform only a single call to an offline eigenvector oracle and attains near-optimal regret in terms of the sequence length. In contrast to previous such algorithms, the dependence of the obtained regret bound on the dimension is much more favorable. Moreover, as opposed to previous oracle-based approaches, our algorithm admits an implementation that may leverage sparsity in the data to further reduce computation. On the technical side, our algorithm is based on a novel analysis of FPL. While previous approaches to analyzing the FPL algorithm are geometric in nature, which seems to inevitably introduce a large dependence on the dimension, our approach exploits the specific structure of the problem at hand and is algebraic in nature. More precisely, we study the spectrum of symmetric real matrices under Gaussian perturbations and apply tools from matrix perturbation theory to derive the regret bound. We also consider a somewhat easier stochastic setting, in which we assume that the sequence of matrices is sampled from a fixed and unknown distribution. We present an algorithm that takes the so called iterative approach and is analogues to the Power algorithm for the offline setting, i.e. it computes a single matrix-vector product on each iteration. The regret of the algorithm is nearly optimal in terms of the sequence length and depends on the dimension only through a logarithmic factor. The analysis of the algorithm is especially accessible and requires only a black-box application of the offline Power method and a, by now standard, matrix concentration inequality. A comparison of our results to previous related work is detailed in Table 1. Online Learning of Eigenvectors Method Regret bound Iteration complexity Matrix Mul. Weights (Tsuda et al., 2005; Arora & Kale, 2007) T n3 (SVD) Follow the Perturbed Leader (Kalai & Vempala, 2005) (see Subsection 4.1) n5/4p T EV Online Frank-Wolfe (Hazan & Kale, 2012) n1/2T 3/4 EV This paper, online setting (see Section 5) n T EV This paper, stochastic setting (see Section 3) Table 1. Comparison between different algorithms for the online eigenvector problem. We denote by EV the computation of the leading eigenvector of a given matrix. We omit the dependence on constants and logarithmic factors in n, T. A performance measure that seems natural for comparing between online algorithms with different regret bounds and iteration complexity is the worst case overall time complexity to achieve average (expected) regret. This measure is important for instance when considering the application of online learning algorithms to saddle-point optimization problems (Grigoriadis & Khachiyan, 1995; Clarkson et al., 2012; Garber & Hazan, 2011). The overall complexity required for the Matrix Mul. Weights method to achieve average regret is O(n3 2). Our online algorithm on the other hand admits an implementation with overall running time O(n3/2 7/2nnz), where nnz denotes the jointsparsity of observed matrices (see subsection 5.1 for details). Hence in case 3/2nnz = O(n3/2), it is overall faster. Finally, we extend our results to handle non-symmetric matrices as well. 2. Notation and Problem Setting We denote by Sn the linear space of all n n real symmetric matrices. We denote by S the Euclidean sphere in Rn, that is S = {x 2 Rn | kxk2 = 1}. Given a matrix A 2 Sn we denote its eigenvalues by λ1(A) λ2(A) ...λn(A). We also refer to λ1(A) as λmax which is given by λmax = maxx2S x>Ax. We denote by δ(A) the eigengap of the matrix A which is given by δ(A) = λ1(A) λ2(A). Unless specified else, given a vector x 2 Rn we denote by kxk its standard Euclidean norm and given a matrix A 2 Sn we denote by k Ak its spectral norm. We also denote by k Ak F and k Ak the Frobenius and nuclear norms of A respectively. Recall that k Ak, k Ak are dual norms and thus according to Holder s inequality, it holds for any two matrices A, B 2 Sn that A B k Ak k Bk where denotes the standard inner product for matrices, that is A B = P i,j Ai,j Bi,j. In this work we consider the following repeated game: an adversary chooses a sequence of matrices A1, A2, ..., AT 2 Sn. Then, for T rounds, the player is required on each round t 2 [T] to choose a vector xt 2 S. After making his choice, the matrix At is revealed and the player gains the profit x> t Atxt. Such an adversary is referred to in the liter- ature as oblivious since he chooses the sequence of matrices without any knowledge of the actual actions of the player. A stronger type of adversary, known as adaptive adversary, need not commit in advance to the entire sequence of matrices, but may choose on time t the matrix At to depend on the entire history of the game, that is on A1, ..., At 1 and x1, ..., xt 1. Throughout the paper (especially in Sections 4, 5) we consider only an oblivious adversary. In the full version of the paper we detail how our results could be easily extended to also handle an adaptive adversary. We measure the overall performance of the player according to the regret which is given by. regret T = max In case the decision maker uses randomization to choose is actions it also makes sense to consider the expected regret which is given by E[regret T ] = max where the expectation is taken over the randomness introduced by the decision maker. We assume without losing generality that k Atk 1 and that At is positive definite (note that adding a multiplicity of the identity matrix to At does not change the regret). 2.1. The Asymmetric Case It makes sense to also consider the asymmetric case in which the input matrices A1, ..., AT are not necessarily symmetric but are m n real matrices for fixed m, n. In this case a prediction is a rank-one matrix uv> where u 2 Rm, v 2 Rn and both are unit vectors. In this case the Online Learning of Eigenvectors regret is given by regret T = max u 2 Rm, kuk = 1 v 2 Rn, kvk = 1 where σmax( ) denotes the largest singular value. We now show how given a low-regret algorithm for the symmetric problem we can use it to achieve low-regret on the asymmetric problem via a randomized conversion. The algorithm is given below. The following lemma bounds the Algorithm 1 Asymmetric to Symmetric Conversion Algorithm 1: Input: Algorithm A for the online eigenvector problem in dimension m n. 2: for t = 1, ..., T do 3: Receive prediction xt 2 Rm+n from A. 4: Decompose xt into xt = ( ut, vt) for ut 2 Rm, vt 2 Rn. 5: With probability 2k utkk vtk set (ut, vt) = ut k utk, vt k vtk . and with remaining probability set (ut, vt) = (u, v), for uniformly chosen unit vectors u 2 Rm, v 2 Rn. 6: Observe At. 7: Feed the matrix At = regret of Algorithm 1 in terms of the regret of an algorithm for the symmetric problem 2. The proof is given in the appendix. Lemma 1. Assume that for all t 2 [T] it holds that k Atk 1. Then it holds for all t 2 [T] that At is symmetric, k Atk 1 and where the expectation is taken over the randomness in choosing ut, vt. 2Note that the matrix At defined in the algorithm is not positive definite as we assumed in the eigenvector problem, however this could be easily fixed by adding a multiplicity of the identity matrix and scaling accordingly to keep the unit upper bound on the spectral norm. 3. The Stochastic Setting In this section we consider a stochastic setting which is somewhat easier than the online adversarial setting. In the stochastic setting we assume that all matrices A1, A2, ..., AT are sampled i.i.d. from a fixed but unknown distribution D over matrices in Sn with spectral norm bounded by one and w.l.o.g. we assume that these matrices are positive definite . We denote the distribution mean by A = EM D[M]. Our algorithm titled Epoch Power Method for the stochastic setting is given below. It works by dividing the sequence into disjoint epochs, each of length which is a parameter that will be determined in the analysis. The algorithm predicts using a single unit vector throughout each epoch, while applying Power method update steps in order to compute the prediction for the following epoch. We refer to an epoch by and denote by x( ) the point played throughout epoch . We also denote A!( ) = t=1 At, that is the empirical mean of all matrices observed until the end of epoch . In this section we prove the following theorem. Theorem 1. Let x1, x2, ..., x T denote the unit vectors played by Algorithm 2 throughout rounds 1, 2, ..., T. The following guarantees hold. 1. Given δ > 0, choosing block length = δ e guarantees that with probability at least 1 δ 2. Choosing block length = d 1 p2T log 2n Te guarantees that t Atxt] = O We note that both results in the theorem are optimal up to log factors. In what follows we always refer to x( ) as the final value of this vector, that is, its value at the end of epoch 1. In order to prove Theorem 1 we need the following two lemmas. The following lemma is a straightforward application of a Bernstein concentration inequality for symmetric matrices (see (Tropp, 2012), theorem 1.4). Lemma 2. For any block and > 0 it holds that 2 min{ , T} Online Learning of Eigenvectors Algorithm 2 Epoch Power Method 1: Input: block length parameter . 2: Let v be a unit vector chosen uniformly at random from the sphere. 3: for t = 1, ..., do 4: Play arbitrarily & observe At. 5: end for 6: x(2) v. 7: for = 2...d T/ e do 8: x( +1) v. 9: Let A!( 1) = 1 ( 1) 10: for t = ( 1) + 1... min{ , T} do 11: Play x( ) & observe At. 12: x( +1) A!( 1)x( +1)/k A!( 1)x( +1)k. 13: end for 14: end for The following lemma is based on an analysis of the Power Method for computing the leading eigenvector of a positive definite matrix and gives a guarantee on the quality of the vectors x( ). For details see Theorem 4.1 in (Kuczy nski & Wo zniakowski, 1992). Lemma 3. For any b T c 2 and > 0, it holds with probability at least 1 n exp( ) that ( +1) A!( 1)x( +1) (1 )λmax( A!( 1)). We can now prove Theorem 1. Proof. We first prove part 1 of the theorem, part 2 follows as a corollary. Fix a block number 3 and an error tolerance . Using Lemmas 2, 3 we have that with probability at least 1 n exp( ) n exp x( )Ax( ) x( ) A!( 2)x( ) λmax( A!( 2)) 2 λmax(A) 3 , where the first and last inequalities follow from Lemma 2 and the second inequity follows from Lemma 3 and the observation that λmax( A!( 2)) 1. Setting = 4 δ ( 2) we have that with probability at least it holds that x( )Ax( ) λmax(A) 12 Thus setting the block length to = d 1 have that with probability at least 1 δ T it holds that x( )Ax( ) λmax(A) 24 Summing over 3 we have that with probability at least 1 δ, ( )Ax( ) λmax(A) !1/4 d T/ e X Since each block accounts for iterations and in worst case the algorithm suffers loss of 1 on all first 2 iterations we have that with probability at least 1 δ We now prove part 2 of the theorem. Since on any time t, xt is independent of At we have that x2S x>(T A)x + k E[T λmax(A) (At A)k]. (2) Let us denote by end the index of the last block. Note that k PT t=1 At Ak = T k A!( end) Ak. By applying Lemma 2 we have that with probability at least 1 δ it holds that At Ak = k A!( end) Ak 4 Online Learning of Eigenvectors Plugging Eq. (1) and Eq. (3) into Eq. (2) we have that (1 2δ) (4 + 13/2) Thus setting δ = T 1, which corresponds to setting the block length to = d 1 p2T log 2n Te, gives the second part of the theorem. 4. The FPL Meta-Algorithm for the Online In this section we overview the Follow the Perturbed Leader meta-algorithm for online learning and its application to the problem of online learning of eigenvectors. The meta-algorithm is given below. The algorithm relies on the availability of an oracle for the offline eigenvector problem, that is an oracle that given a matrix A, returns a leading eigenvector of A. We denote a call to this oracle by EV(A). Additionally, the algorithm relies on the availability of a distribution D over Sn from which it is possible to sample a perturbation matrix (efficiently). Different such distributions give rise to different instances of the algorithm with different regret guarantees. Algorithm 3 Follow the Perturbed Leader 1: Input: distribution D over Sn. 2: Sample a matrix N D. 3: x1 EV(N). 4: for t = 1, 2, ... do 5: Play xt & observe At. Theorem 2. The expected regret of algorithm 3 is upper bounded as follows. E[regret T (FPL(D))] t+1Atxt+1 x> 1 Nx1 x >Nx ] where x = arg maxx2S x> PT The proof is given in the appendix. We now turn to survey two choices for the perturbationgenerating distribution D and their corresponding regret bounds. We show that even though these distributions give rise to optimal algorithms in terms of the sequence length T, they suffer from a large dependence on the problem s dimension n. 4.1. Entry-wise Uniform Perturbation Following (Kalai & Vempala, 2005) we consider the following entry-wise uniform distribution Duni. Each coordinate i j in the perturbation matrix N is sampled U[0, 1/ ] and for each coordinate i < j we set Ni,j Nj,i (in order for the resulting perturbation to be symmetric). In (Kalai & Vempala, 2005) it was shown that with such noise distribution, one can bound the expected regret of a single round t as follows (see proof of Theorem 1.1. in (Kalai & Vempala, 2005)). t+1Atxt+1 x> t Atxt] k Atk1, where k Ak1 = P i,j |Ai,j|. Furthermore, since the sampled perturbation is bounded in 1, using Holder s inequality we have that 1 Nx1 x >Nx ] EN Duni[kx1x> 1 x x >k1 k Nk1] D1 where D1 denotes the 1 diameter of the set {xx> | x 2 S}. In order to derive the precise regret bound we need to bound both quantities k Atk1, D1. The proof of the following lemma is given in the appendix. Lemma 4. It holds that D1 = O(n) and for all t it holds that k Atk1 = O(n3/2). These bounds are also tight. Plugging the result of the lemma into Theorem 2 and optimizing over we have that E[regret T (FPL(Duni))] = O 4.2. Exponentially-distributed Perturbation A second distribution that we consider, Dexp, samples from Sn according to the following density. dµ(N) / exp( k Nk), (5) for appropriately chosen parameter . A similar distribution was used for the problem of learning rotations (Hazan et al., 2010a) (altough in (Hazan et al., 2010a) the density was proportional to the nuclear norm and not the spectral as in Eq. (5)). For more details on how to sample from the distribution specified by Eq. (5) as well as a proof of the following lemma, the reader is referred to (Hazan et al., 2010b). Lemma 5. It holds that k Nk Gamma(n2, 1/ ) and in particular E[k Nk] = n2 The following lemma upper bounds the expected regret on a single round t. Online Learning of Eigenvectors Lemma 6. On any time t 2 [T] it holds that EN Dexp[(x> t+1Atxt+1 x> The proof is given in the appendix. Plugging Lemmas 5, 6 into Theorem 2 and optimizing over we have that E[regret T (FPL(Dexp))] = O 5. New Perturbation and Analysis for FPL via Matrix Perturbation Theory In this section we present our main result - a new noise distribution for the FPL algorithm and a corresponding analysis. In contrast to the analysis used in order to derive the regret bounds in Subsections 4.1, 4.2, which relies on geometric considerations and for which a large dependence of the regret bound on the problem s dimension n seems unavoidable, here we use a new analysis idea that is algebraic in nature and relies on tools from matrix perturbation theory which results in a much more moderate dependence on the dimension. The new distribution, denoted Dnew, is based on a single parameter c and sampling from it is done as follows. We draw a vector v 2 Rn whose entries are i.i.d. N(0, 1) random variables and set the perturbation matrix to N = c vv>. We prove the following theorem. Theorem 3. Let c = T n max{1, ln(T/n)}. Then E[regret T (FPL(Dnew))] = O( n T max{1, ln (T/n)}). Aside from the important improvement in the dependence on the dimension (pn in Theorem 3 vs. at least n in Section 4), a key difference between the perturbations is in the efficiency of the resulting implementations. The key feature of the FPL algorithm for the online eigenvector problem is that the EV( ) oracle could be implemented using iterative methods such as the Power or Lanczos methods, that only require to compute matrix-vector products, to run in time that is typically O(nnz) where nnz denotes the sparsity of the input matrix. However, since the perturbations considered in Subsections 4.1. 4.2 are dense with high probability, even in case the support of all matrices At is sparse, the call to the oracle EV in Algorithm 3 will be with a dense matrix In contrast, the perturbation considered in this section is rank-one. Hence, while the perturbation matrix N = cvv> is still dense with high probability, computing the product of N with a vector requires only O(n) time which allows for an oracle implementation that could still benefit computationally from sparsity in the data. We now turn to prove Theorem 3. The following classic result in matrix perturbation theory is known as the Davis-Kahan Sine Theroem, see (Davis & Kahan, 1970). For ease of presentation, we restate the theorem and give a self-contained proof in the appendix. Theorem 4 (Davis Kahan sine theorem). Let A, B, E 2 Sn such that B = A+E and assume that δ(A) > 0. Denote by u A the top eigenvector of A and by u B the top eigenvector of B. It holds that The following theorem constitutes the key technical ingredient in the proof of Theorem 3. It upper bounds the cumulative distribution function of the eigengap of the perturbed matrix A + cvv> for a given matrix A. Theorem 5. Let A 2 Sn and let v be a vector of independent N(0, 1) random variables and let c be a positive scalar. Denote B = A + cvv>. Then for any > 0 Pr(δ(B) ) min{2 The proof is based on anti-concentration results for the leading eigenvalue of the perturbed matrix. Due to its length and technical detail it is deferred to Appendix C.3. Here for some intuition, we prove a weaker version of Theorem 5, which captures some of the key ideas. Lemma 7. [Weaker version of Theorem 5] Let A 2 Sn and let v be a vector of independent N(0, 1) random variables and let c be a positive scalar. Denote B = A+cvv>. Then for any > 0 Proof. Weyl s eigenvalues inequality states that for any two matrices X, Y 2 Sn with eigenvalues λ1(X) λ2(X) ... λn(X) and λ1(Y ) λ2(Y ) ... λn(Y ) it holds for any i 2 [n] that λi(X) + λn(Y ) λi(X + Y ) λi(X) + λ1(Y ). (7) Applying Eq. (7) with X = cvv>, Y = A and i = 2 gives us that λ2(B) λ2(cvv>) + λ1(A) = λ1(A), where the equality follows since vv> is rank-one. Thus we have that δ(B) = λ1(B) λ2(B) λ1(A + cvv>) λ1(A). Online Learning of Eigenvectors Let us denote by u1 the eigenvector of A that corresponds to eigenvalue λ1(A). We continue to lower bound δ(B) as follows. u1 λ1(A) = c(u> Since v is a vector of independent N(0, 1) random variables we have that (u> 1 v) is also distributed as a N(0, 1) random variable. Thus (u> 1 v)2 is a Chi-squared random variable with a single degree of freedom, also denoted as χ2 1. If R is a χ2 1 random variable then it holds that for any > 0 Thus we have that for any > 0 it holds that Pr(δ(B) ) Pr The following lemma is a consequence of Theorem 5. The proof is given in the appendix. Lemma 8. Given A 2 Sn let v be a random vector whose entries are independent N(0, 1) random variables and let c be a positive scalar. Denote δ = δ(A + cvv>). Define the random variable X = min(a, δ 1) where a is a given positive constant. Then we have E[X] 2 2 c max{ln( eac We are now ready to prove Theorem 3. Proof. Starting from Theorem 2 we have that E[regret T (FPL(Dnew))] t Atxt] + E[x> 1 Nx1 x >Nx ] t k k Atk] + E[k Nk] t k F ] + c E[kvk2], where the second inequality follows from Holder s inequality and the third follows from the fact that for any matrix in Sn with k non-zero eigenvalues λ1, λ2, ..., λk it holds that , and from our choice of the noise matrix N. Denote St = Pt 1 =1 A . Since for any t, xt is the leading eigenvector of the matrix (St + N) and xt+1 is the leading eigenvector of the matrix (St+1 + N) = (St + N) + At, we have by applying Theorem 4 that t k F ] E[min{ 2 k Atk δ(St + N)}] 2, 1 δ(St + N)}], where the min term is used since obviously kxtx> t+1k is upper bounded by 2. The second inequality follows since k Atk 1. Since all entries of v are N(0, 1) random variables it holds that E[kvk2] = n and thus E[regret T (FPL(Dnew))] 2, 1 δ(St + cvv>)}] + cn. Applying the result of Lemma 8 with a = 1 2 for every t gives E[regret T (FPL(Dnew))] 2 c max{ln( ec 2), 1} + cn. Thus setting c = T n max{ln(T/n), 1} yields the theorem. 5.1. Using Approximate Eigenvector Computations So far we have assumed that the eigenvector oracle used in Algorithm 3 finds an exact leading eigenvector. In practice, it is much more efficient to use iterative methods such as the Power and Lanczos algorithms to find an approximate eigenvector. The following theorem states that indeed Algorithm 3 admits such an efficient implementation, without sacrificing the regret guarantee given in Theorem 3. The proof is given in the appendix. Theorem 6. Algorithm 3, instantiated with noise distribution Dnew, admits an implementation such that the statement of Theorem 3 holds, and the worst-case time complexity of each iteration is O(n 1/4T 3/4nnz), where nnz denotes the joint-sparsity of all observed matrices A1, ..., AT . Acknowledgments The research leading to these results has received funding from the European Unions Seventh Framework Programme (FP7/2007-2013) under grant agreement n 336078 ERCSUBLRN. Online Learning of Eigenvectors Arora, Sanjeev and Kale, Satyen. A combinatorial, primal- dual approach to semidefinite programs. In Proceedings of the 39th Annual ACM Symposium on Theory of Computing, STOC, 2007. Balsubramani, Akshay, Dasgupta, Sanjoy, and Freund, Yoav. The fast convergence of incremental PCA. In 27th Annual Conference on Neural Information Processing Systems, NIPS, 2013. Cesa-Bianchi, Nicol o and Lugosi, G abor. Prediction, learning, and games. Cambridge University Press, 2006. Clarkson, Kenneth L., Hazan, Elad, and Woodruff, David P. Sublinear optimization for machine learning. Journal of the ACM, 59(5):23, 2012. Davis, C. and Kahan, W. M. The rotation of eigenvectors by a perturbation, III. SIAM J. Numer. Anal., 7, March 1970. Garber, Dan and Hazan, Elad. Approximating semidefi- nite programs in sublinear time. In 25th Annual Conference on Neural Information Processing Systems 201, pp. 1080 1088, 2011. Grigoriadis, Michael D. and Khachiyan, Leonid G. A sublinear-time randomized approximation algorithm for matrix games. Oper. Res. Lett., 18(2):53 58, 1995. Hazan, Elad and Kale, Satyen. Projection-free online learn- ing. In Proceedings of the 29th International Conference on Machine Learning, ICML, 2012. Hazan, Elad, Kale, Satyen, and Warmuth, Manfred K. Learning rotations with little regret. In 23rd Conference on Learning Theory, COLT, 2010a. Hazan, Elad, Kale, Satyen, and Warmuth, Manfred K. Cor- rigendum to learning rotations with little regret. 2010b. URL http://ie.technion.ac.il/ ehazan/ papers/rotfix.pdf. Kalai, Adam Tauman and Vempala, Santosh. Efficient al- gorithms for online decision problems. J. Comput. Syst. Sci., 71(3):291 307, 2005. Kuczy nski, J. and Wo zniakowski, H. Estimating the largest eigenvalues by the power and lanczos algorithms with a random start. SIAM J. Matrix Anal. Appl., 13:1094 1122, October 1992. Nie, Jiazhong, Kotlowski, Wojciech, and Warmuth, Man- fred K. Online PCA with optimal regrets. In 24th International Conference on Algorithmic Learning Theory, ALT, 2013. Shamir, Ohad. A stochastic PCA algorithm with an expo- nential convergence rate. Co RR, abs/1409.2848, 2014. URL http://arxiv.org/abs/1409.2848. Sylvester, J.J. On the relation between the minor determi- nants of linearly equivalent quadratic functions. Philosophical Magazine Series 4, 1(4):295 305, 1851. Tropp, Joel A. User-friendly tail bounds for sums of ran- dom matrices. Foundations of Computational Mathematics, 12(4):389 434, 2012. Tsuda, Koji, R atsch, Gunnar, and Warmuth, Manfred K. Matrix exponentiated gradient updates for on-line learning and bregman projection. Journal of Machine Learning Research, 6:995 1018, 2005. Warmuth, Manfred K. and Kuzmin, Dima. Online variance minimization. In 19th Annual Conference on Learning Theory, COLT, 2006a. Warmuth, Manfred K. and Kuzmin, Dima. Randomized PCA algorithms with regret bounds that are logarithmic in the dimension. In Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, NIPS, 2006b. Zinkevich, Martin. Online convex programming and gen- eralized infinitesimal gradient ascent. In Proceedings of the Twentieth International Conference on Machine Learning, ICML, 2003.