# online_learning_of_eigenvectors__bebe2e06.pdf

Online Learning of Eigenvectors

Dan Garber DANGAR@TX.TECHNION.AC.IL Technion - Israel Institute of Technology

Elad Hazan EHAZAN@CS.PRINCETON.EDU Princeton University

Tengyu Ma TENGYUL@CS.PRINCETON.EDU Princeton University

Computing the leading eigenvector of a symmetric real matrix is a fundamental primitive of numerical linear algebra with numerous applications. We consider a natural online extension of the leading eigenvector problem: a sequence of matrices is presented and the goal is to predict for each matrix a unit vector, with the overall goal of competing with the leading eigenvector of the cumulative matrix. Existing regretminimization algorithms for this problem either require to compute an eigen decompostion every iteration, or suffer from a large dependency of the regret bound on the dimension. In both cases the algorithms are not practical for large scale applications.

In this paper we present new algorithms that avoid both issues. On one hand they do not require any expensive matrix decompositions and on the other, they guarantee regret rates with a mild dependence on the dimension at most. In contrast to previous algorithms, our algorithms also admit implementations that enable to leverage sparsity in the data to further reduce computation. We extend our results to also handle non-symmetric matrices.

1. Introduction

Computing the leading eigenvector of a symmetric real matrix is one of the most important problems in numerical linear algebra and an important primitive in many algorithms. Perhaps the best known application of this problem is the

Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s).

Principal Component Analysis problem, in which, roughly speaking, given a set of high-dimensional vectors, the problem is to ﬁnd a low-dimensional subspace such that the projection of the vectors onto this low-dimensional subspace is close on average to the original vectors. It is well known that the optimal solution to this problem is to project the high-dimensional vectors over the several leading eigenvectors of the covariance matrix of the data. In this paper we consider an online learning problem that is a natural extension of the leading eigenvector problem. A decision maker observes a sequence of matrices. Before a new matrix is revealed, the decision maker must commit to a unit vector. Once the matrix is revealed the decision maker gains the quadratic product of the selected unit vector with the revealed matrix, and his overall goal is to maximize the total reward. As standard in such settings, the performance of the decision maker is measured via the regret which is given by the difference between the total reward of the best ﬁxed unit vector in hindsight and the total reward of the decision maker. Indeed the best ﬁxed unit vector in hindsight is simply given by the leading eigenvector of the sum of revealed matrices and the associated total reward is the corresponding leading eigenvalue. This problem captures as a special case the problem known as Online Principal Component Analysis that was studied in (Warmuth & Kuzmin, 2006a), (Warmuth & Kuzmin, 2006b), (Nie et al., 2013), in case of a single principal component.

From an optimization theory point of view, the ofﬂine leading eigenvector problem is a non-convex quadratic optimization problem, since w.l.o.g. it requires to maximize a convex function over a non convex set, i.e. the unit sphere. However, it could be solved to high precision via eigendecomposition which takes O(n3) time where n is the size of the matrix, or via iterative approximation algorithms such as the Power or Lanczos algorithms whose running time for well-conditioned matrices is roughly O(nnz) where nnz denotes the number of non-zeros entries in the input matrix.

Online Learning of Eigenvectors

When considering the online problem, three different approaches come to mind which we now detail.

The Convexiﬁcation Approach The ﬁrst approach is to convexify the problem by liﬁting the decision variable from a unit vector to a matrix, more speciﬁcally a positive semideﬁnite matrix with unit trace. This approach corresponds to the problem of Online Linear Optimization over the Spectrahedron 1. This is also the approach taken in previous works on Online PCA (Warmuth & Kuzmin, 2006a), (Warmuth & Kuzmin, 2006b), (Nie et al., 2013). While this approach leads to theoretically efﬁcient algorithms with nearly optimal regret bounds, such as the Matrix Multiplicative Weights algorithm (Tsuda et al., 2005; Arora & Kale, 2007), their major drawback is that they require super-linear computation per iteration, i.e. they require to compute a full eigendcomposition which amounts to O(n3) arithmetic operations per iteration . The latter is true even if the support of the sequence of matrices is sparse.

The Oracle-based Approach A natural approach is to try to reduce the online problem to the ofﬂine one. That is, assume that we are given an oracle for the ofﬂine problem that given a query matrix returns its leading eigenvector, and derive an online algorithm that is based on making queries to the oracle. Such a reduction is possible either via the Follow the Perturbed Leader meta-algorithm (FPL) (Kalai & Vempala, 2005) or by the recent Online Frank Wolfe algorithm presented in (Hazan & Kale, 2012). Both of these algorithms require on each iteration of the online problem to make a single query to the eigenvector oracle which could be implemented using the Power or Lancsoz algorithms mentioned above with complexity of roughly O(nnz) arithmetic operations. While the per-iteration complexity of these methods is potentially much more favorable than the convexiﬁcation approach and despite the fact that the regret bound guaranteed by FPL is optimal in terms of the length of the sequence, it comes with the price of a large dependence on the dimension. Indeed when designing online learning algorithms, usually the primary goal is optimal dependence on the sequence length, however favorable dependence on the dimension is crucial in order for the proposed method to be of any practical signiﬁcance.

The Iterative Approach A third approach is to design online algorithms that directly tackle the non-convex optimization problem, i.e. online analogues of iterative algorithms for the ofﬂine problem such as the Power Method with an update step that roughly amounts to computing a single matrix-vector product, which is much more efﬁcient than both previous approaches. Such an approach is remi-

1formally deﬁned as {X 2 Rn n | X 0, Tr(X) = 1}.

niscent of the Online Gradient Decent method presented in (Zinkevich, 2003) which is an online analogue of the gradient descent method for ofﬂine convex optimization. For the speciﬁc problem of Stochastic Principal Component Analysis, such algorithms with provable guarantees exist (Balsubramani et al., 2013), (Shamir, 2014), however we are not aware of any such method for the online setting considered here.

Our interest in this work is to study algorithms for the online eigenvector problem that may be of use for large scale instances. Towards this end we part from the convexiﬁcation approach that was the main approach studied in previous related problems and requires super-linear computations, and focus on the oracle-based and iterative approaches which allow for more efﬁcient implementations and may leverage sparsity in the data.

1.1. Our Results

Our main result is an online algorithm that takes the so called oracle approach and is based on the Follow the Perturbed Leader meta-algorithm. The algorithm requires on each iteration to perform only a single call to an ofﬂine eigenvector oracle and attains near-optimal regret in terms of the sequence length. In contrast to previous such algorithms, the dependence of the obtained regret bound on the dimension is much more favorable. Moreover, as opposed to previous oracle-based approaches, our algorithm admits an implementation that may leverage sparsity in the data to further reduce computation. On the technical side, our algorithm is based on a novel analysis of FPL. While previous approaches to analyzing the FPL algorithm are geometric in nature, which seems to inevitably introduce a large dependence on the dimension, our approach exploits the speciﬁc structure of the problem at hand and is algebraic in nature. More precisely, we study the spectrum of symmetric real matrices under Gaussian perturbations and apply tools from matrix perturbation theory to derive the regret bound.

We also consider a somewhat easier stochastic setting, in which we assume that the sequence of matrices is sampled from a ﬁxed and unknown distribution. We present an algorithm that takes the so called iterative approach and is analogues to the Power algorithm for the ofﬂine setting, i.e. it computes a single matrix-vector product on each iteration. The regret of the algorithm is nearly optimal in terms of the sequence length and depends on the dimension only through a logarithmic factor. The analysis of the algorithm is especially accessible and requires only a black-box application of the ofﬂine Power method and a, by now standard, matrix concentration inequality.

A comparison of our results to previous related work is detailed in Table 1.

Online Learning of Eigenvectors

Method Regret bound Iteration complexity Matrix Mul. Weights (Tsuda et al., 2005; Arora & Kale, 2007)

T n3 (SVD) Follow the Perturbed Leader (Kalai & Vempala, 2005) (see Subsection 4.1) n5/4p

T EV Online Frank-Wolfe (Hazan & Kale, 2012) n1/2T 3/4 EV This paper, online setting (see Section 5)

n T EV This paper, stochastic setting (see Section 3)

Table 1. Comparison between different algorithms for the online eigenvector problem. We denote by EV the computation of the leading eigenvector of a given matrix. We omit the dependence on constants and logarithmic factors in n, T.

A performance measure that seems natural for comparing between online algorithms with different regret bounds and iteration complexity is the worst case overall time complexity to achieve average (expected) regret. This measure is important for instance when considering the application of online learning algorithms to saddle-point optimization problems (Grigoriadis & Khachiyan, 1995; Clarkson et al., 2012; Garber & Hazan, 2011). The overall complexity required for the Matrix Mul. Weights method to achieve average regret is O(n3 2). Our online algorithm on the other hand admits an implementation with overall running time O(n3/2 7/2nnz), where nnz denotes the jointsparsity of observed matrices (see subsection 5.1 for details). Hence in case 3/2nnz = O(n3/2), it is overall faster.

Finally, we extend our results to handle non-symmetric matrices as well.

2. Notation and Problem Setting

We denote by Sn the linear space of all n n real symmetric matrices. We denote by S the Euclidean sphere in Rn, that is S = {x 2 Rn | kxk2 = 1}. Given a matrix A 2 Sn we denote its eigenvalues by λ1(A) λ2(A) ...λn(A). We also refer to λ1(A) as λmax which is given by λmax = maxx2S x>Ax. We denote by δ(A) the eigengap of the matrix A which is given by δ(A) = λ1(A) λ2(A). Unless speciﬁed else, given a vector x 2 Rn we denote by kxk its standard Euclidean norm and given a matrix A 2 Sn we denote by k Ak its spectral norm. We also denote by k Ak F and k Ak the Frobenius and nuclear norms of A respectively. Recall that k Ak, k Ak are dual norms and thus according to Holder s inequality, it holds for any two matrices A, B 2 Sn that A B k Ak k Bk where denotes the standard inner product for matrices, that is A B = P

i,j Ai,j Bi,j.

In this work we consider the following repeated game: an adversary chooses a sequence of matrices A1, A2, ..., AT 2 Sn. Then, for T rounds, the player is required on each round t 2 [T] to choose a vector xt 2 S. After making his choice, the matrix At is revealed and the player gains the proﬁt x>

t Atxt. Such an adversary is referred to in the liter-

ature as oblivious since he chooses the sequence of matrices without any knowledge of the actual actions of the player. A stronger type of adversary, known as adaptive adversary, need not commit in advance to the entire sequence of matrices, but may choose on time t the matrix At to depend on the entire history of the game, that is on A1, ..., At 1 and x1, ..., xt 1. Throughout the paper (especially in Sections 4, 5) we consider only an oblivious adversary. In the full version of the paper we detail how our results could be easily extended to also handle an adaptive adversary.

We measure the overall performance of the player according to the regret which is given by.

regret T = max

In case the decision maker uses randomization to choose is actions it also makes sense to consider the expected regret which is given by

E[regret T ] = max

where the expectation is taken over the randomness introduced by the decision maker.

We assume without losing generality that k Atk 1 and that At is positive deﬁnite (note that adding a multiplicity of the identity matrix to At does not change the regret).

2.1. The Asymmetric Case

It makes sense to also consider the asymmetric case in which the input matrices A1, ..., AT are not necessarily symmetric but are m n real matrices for ﬁxed m, n. In this case a prediction is a rank-one matrix uv> where u 2 Rm, v 2 Rn and both are unit vectors. In this case the

Online Learning of Eigenvectors

regret is given by

regret T = max u 2 Rm, kuk = 1

v 2 Rn, kvk = 1

where σmax( ) denotes the largest singular value.

We now show how given a low-regret algorithm for the symmetric problem we can use it to achieve low-regret on the asymmetric problem via a randomized conversion. The algorithm is given below. The following lemma bounds the

Algorithm 1 Asymmetric to Symmetric Conversion Algorithm

1: Input: Algorithm A for the online eigenvector problem

in dimension m n. 2: for t = 1, ..., T do 3: Receive prediction xt 2 Rm+n from A. 4: Decompose xt into xt = ( ut, vt) for ut 2 Rm, vt 2 Rn. 5: With probability 2k utkk vtk set (ut, vt) =

ut k utk, vt k vtk

. and with remaining probability

set (ut, vt) = (u, v), for uniformly chosen unit vectors u 2 Rm, v 2 Rn. 6: Observe At.

7: Feed the matrix At =

regret of Algorithm 1 in terms of the regret of an algorithm for the symmetric problem 2. The proof is given in the appendix.

Lemma 1. Assume that for all t 2 [T] it holds that k Atk 1. Then it holds for all t 2 [T] that At is symmetric, k Atk 1 and

where the expectation is taken over the randomness in choosing ut, vt.

2Note that the matrix At deﬁned in the algorithm is not positive deﬁnite as we assumed in the eigenvector problem, however this could be easily ﬁxed by adding a multiplicity of the identity matrix and scaling accordingly to keep the unit upper bound on the spectral norm.

3. The Stochastic Setting

In this section we consider a stochastic setting which is somewhat easier than the online adversarial setting. In the stochastic setting we assume that all matrices A1, A2, ..., AT are sampled i.i.d. from a ﬁxed but unknown distribution D over matrices in Sn with spectral norm bounded by one and w.l.o.g. we assume that these matrices are positive deﬁnite . We denote the distribution mean by A = EM D[M].

Our algorithm titled Epoch Power Method for the stochastic setting is given below. It works by dividing the sequence into disjoint epochs, each of length which is a parameter that will be determined in the analysis. The algorithm predicts using a single unit vector throughout each epoch, while applying Power method update steps in order to compute the prediction for the following epoch.

We refer to an epoch by and denote by x( ) the point played throughout epoch . We also denote A!( ) =

t=1 At, that is the empirical mean of all matrices observed until the end of epoch .

In this section we prove the following theorem. Theorem 1. Let x1, x2, ..., x T denote the unit vectors played by Algorithm 2 throughout rounds 1, 2, ..., T. The following guarantees hold.

1. Given δ > 0, choosing block length =

δ e guarantees that with probability at least 1 δ

2. Choosing block length = d 1

p2T log 2n Te guarantees that

t Atxt] = O

We note that both results in the theorem are optimal up to log factors.

In what follows we always refer to x( ) as the ﬁnal value of this vector, that is, its value at the end of epoch 1.

In order to prove Theorem 1 we need the following two lemmas.

The following lemma is a straightforward application of a Bernstein concentration inequality for symmetric matrices (see (Tropp, 2012), theorem 1.4). Lemma 2. For any block and > 0 it holds that

2 min{ , T}

Online Learning of Eigenvectors

Algorithm 2 Epoch Power Method

1: Input: block length parameter . 2: Let v be a unit vector chosen uniformly at random from

the sphere. 3: for t = 1, ..., do 4: Play arbitrarily & observe At. 5: end for 6: x(2) v. 7: for = 2...d T/ e do 8: x( +1) v.

9: Let A!( 1) = 1 ( 1)

10: for t = ( 1) + 1... min{ , T} do 11: Play x( ) & observe At. 12: x( +1) A!( 1)x( +1)/k A!( 1)x( +1)k. 13: end for 14: end for

The following lemma is based on an analysis of the Power Method for computing the leading eigenvector of a positive deﬁnite matrix and gives a guarantee on the quality of the vectors x( ). For details see Theorem 4.1 in (Kuczy nski & Wo zniakowski, 1992).

Lemma 3. For any b T

c 2 and > 0, it holds with probability at least 1 n exp( ) that

( +1) A!( 1)x( +1) (1 )λmax( A!( 1)).

We can now prove Theorem 1.

Proof. We ﬁrst prove part 1 of the theorem, part 2 follows as a corollary.

Fix a block number 3 and an error tolerance . Using Lemmas 2, 3 we have that with probability at least 1 n exp( ) n exp

x( )Ax( ) x( ) A!( 2)x( ) λmax( A!( 2)) 2 λmax(A) 3 ,

where the ﬁrst and last inequalities follow from Lemma 2 and the second inequity follows from Lemma 3 and the observation that λmax( A!( 2)) 1.

Setting = 4

δ ( 2) we have that with probability at least

it holds that

x( )Ax( ) λmax(A) 12

Thus setting the block length to = d 1

have that with probability at least 1 δ

T it holds that

x( )Ax( ) λmax(A) 24

Summing over 3 we have that with probability at least 1 δ,

( )Ax( ) λmax(A)

!1/4 d T/ e X

Since each block accounts for iterations and in worst case the algorithm suffers loss of 1 on all ﬁrst 2 iterations we have that with probability at least 1 δ

We now prove part 2 of the theorem.

Since on any time t, xt is independent of At we have that

x2S x>(T A)x + k

E[T λmax(A)

(At A)k]. (2)

Let us denote by end the index of the last block. Note that k PT

t=1 At Ak = T k A!( end) Ak. By applying Lemma 2 we have that with probability at least 1 δ it holds that

At Ak = k A!( end) Ak 4

Online Learning of Eigenvectors

Plugging Eq. (1) and Eq. (3) into Eq. (2) we have that

(1 2δ) (4 + 13/2)

Thus setting δ = T 1, which corresponds to setting the block length to = d 1

p2T log 2n Te, gives the second part of the theorem.

4. The FPL Meta-Algorithm for the Online

In this section we overview the Follow the Perturbed Leader meta-algorithm for online learning and its application to the problem of online learning of eigenvectors. The meta-algorithm is given below. The algorithm relies on the availability of an oracle for the ofﬂine eigenvector problem, that is an oracle that given a matrix A, returns a leading eigenvector of A. We denote a call to this oracle by EV(A). Additionally, the algorithm relies on the availability of a distribution D over Sn from which it is possible to sample a perturbation matrix (efﬁciently). Different such distributions give rise to different instances of the algorithm with different regret guarantees.

Algorithm 3 Follow the Perturbed Leader

1: Input: distribution D over Sn. 2: Sample a matrix N D. 3: x1 EV(N). 4: for t = 1, 2, ... do 5: Play xt & observe At.

Theorem 2. The expected regret of algorithm 3 is upper bounded as follows.

E[regret T (FPL(D))]

t+1Atxt+1 x>

1 Nx1 x >Nx ]

where x = arg maxx2S x> PT

The proof is given in the appendix.

We now turn to survey two choices for the perturbationgenerating distribution D and their corresponding regret bounds. We show that even though these distributions give rise to optimal algorithms in terms of the sequence length T, they suffer from a large dependence on the problem s dimension n.

4.1. Entry-wise Uniform Perturbation

Following (Kalai & Vempala, 2005) we consider the following entry-wise uniform distribution Duni. Each coordinate i j in the perturbation matrix N is sampled U[0, 1/ ] and for each coordinate i < j we set Ni,j Nj,i (in order for the resulting perturbation to be symmetric).

In (Kalai & Vempala, 2005) it was shown that with such noise distribution, one can bound the expected regret of a single round t as follows (see proof of Theorem 1.1. in (Kalai & Vempala, 2005)).

t+1Atxt+1 x>

t Atxt] k Atk1,

where k Ak1 = P

i,j |Ai,j|.

Furthermore, since the sampled perturbation is bounded in 1, using Holder s inequality we have that

1 Nx1 x >Nx ]

EN Duni[kx1x>

1 x x >k1 k Nk1] D1

where D1 denotes the 1 diameter of the set {xx> | x 2 S}.

In order to derive the precise regret bound we need to bound both quantities k Atk1, D1. The proof of the following lemma is given in the appendix.

Lemma 4. It holds that D1 = O(n) and for all t it holds that k Atk1 = O(n3/2). These bounds are also tight.

Plugging the result of the lemma into Theorem 2 and optimizing over we have that

E[regret T (FPL(Duni))] = O

4.2. Exponentially-distributed Perturbation

A second distribution that we consider, Dexp, samples from Sn according to the following density.

dµ(N) / exp( k Nk), (5)

for appropriately chosen parameter .

A similar distribution was used for the problem of learning rotations (Hazan et al., 2010a) (altough in (Hazan et al., 2010a) the density was proportional to the nuclear norm and not the spectral as in Eq. (5)). For more details on how to sample from the distribution speciﬁed by Eq. (5) as well as a proof of the following lemma, the reader is referred to (Hazan et al., 2010b).

Lemma 5. It holds that k Nk Gamma(n2, 1/ ) and in particular E[k Nk] = n2

The following lemma upper bounds the expected regret on a single round t.

Online Learning of Eigenvectors

Lemma 6. On any time t 2 [T] it holds that

EN Dexp[(x>

t+1Atxt+1 x>

The proof is given in the appendix.

Plugging Lemmas 5, 6 into Theorem 2 and optimizing over we have that

E[regret T (FPL(Dexp))] = O

5. New Perturbation and Analysis for FPL via

Matrix Perturbation Theory

In this section we present our main result - a new noise distribution for the FPL algorithm and a corresponding analysis. In contrast to the analysis used in order to derive the regret bounds in Subsections 4.1, 4.2, which relies on geometric considerations and for which a large dependence of the regret bound on the problem s dimension n seems unavoidable, here we use a new analysis idea that is algebraic in nature and relies on tools from matrix perturbation theory which results in a much more moderate dependence on the dimension.

The new distribution, denoted Dnew, is based on a single parameter c and sampling from it is done as follows. We draw a vector v 2 Rn whose entries are i.i.d. N(0, 1) random variables and set the perturbation matrix to N = c vv>.

We prove the following theorem.

Theorem 3. Let c =

T n max{1, ln(T/n)}. Then

E[regret T (FPL(Dnew))] = O(

n T max{1, ln (T/n)}).

Aside from the important improvement in the dependence on the dimension (pn in Theorem 3 vs. at least n in Section 4), a key difference between the perturbations is in the efﬁciency of the resulting implementations. The key feature of the FPL algorithm for the online eigenvector problem is that the EV( ) oracle could be implemented using iterative methods such as the Power or Lanczos methods, that only require to compute matrix-vector products, to run in time that is typically O(nnz) where nnz denotes the sparsity of the input matrix. However, since the perturbations considered in Subsections 4.1. 4.2 are dense with high probability, even in case the support of all matrices At is sparse, the call to the oracle EV in Algorithm 3 will be with a dense matrix In contrast, the perturbation considered in this section is rank-one. Hence, while the perturbation matrix N = cvv> is still dense with high probability, computing the product of N with a vector requires only O(n) time which allows for an oracle implementation that could still beneﬁt computationally from sparsity in the data.

We now turn to prove Theorem 3.

The following classic result in matrix perturbation theory is known as the Davis-Kahan Sine Theroem, see (Davis & Kahan, 1970). For ease of presentation, we restate the theorem and give a self-contained proof in the appendix.

Theorem 4 (Davis Kahan sine theorem). Let A, B, E 2 Sn such that B = A+E and assume that δ(A) > 0. Denote by u A the top eigenvector of A and by u B the top eigenvector of B. It holds that

The following theorem constitutes the key technical ingredient in the proof of Theorem 3. It upper bounds the cumulative distribution function of the eigengap of the perturbed matrix A + cvv> for a given matrix A.

Theorem 5. Let A 2 Sn and let v be a vector of independent N(0, 1) random variables and let c be a positive scalar. Denote B = A + cvv>. Then for any > 0

Pr(δ(B) ) min{2

The proof is based on anti-concentration results for the leading eigenvalue of the perturbed matrix. Due to its length and technical detail it is deferred to Appendix C.3. Here for some intuition, we prove a weaker version of Theorem 5, which captures some of the key ideas.

Lemma 7. [Weaker version of Theorem 5] Let A 2 Sn and let v be a vector of independent N(0, 1) random variables and let c be a positive scalar. Denote B = A+cvv>. Then for any > 0

Proof. Weyl s eigenvalues inequality states that for any two matrices X, Y 2 Sn with eigenvalues λ1(X) λ2(X) ... λn(X) and λ1(Y ) λ2(Y ) ... λn(Y ) it holds for any i 2 [n] that

λi(X) + λn(Y ) λi(X + Y ) λi(X) + λ1(Y ). (7)

Applying Eq. (7) with X = cvv>, Y = A and i = 2 gives us that

λ2(B) λ2(cvv>) + λ1(A) = λ1(A),

where the equality follows since vv> is rank-one.

Thus we have that

δ(B) = λ1(B) λ2(B) λ1(A + cvv>) λ1(A).

Online Learning of Eigenvectors

Let us denote by u1 the eigenvector of A that corresponds to eigenvalue λ1(A). We continue to lower bound δ(B) as follows.

u1 λ1(A) = c(u>

Since v is a vector of independent N(0, 1) random variables we have that (u>

1 v) is also distributed as a N(0, 1) random variable. Thus (u>

1 v)2 is a Chi-squared random variable with a single degree of freedom, also denoted as χ2

1. If R is a χ2

1 random variable then it holds that for any > 0

Thus we have that for any > 0 it holds that

Pr(δ(B) ) Pr

The following lemma is a consequence of Theorem 5. The proof is given in the appendix.

Lemma 8. Given A 2 Sn let v be a random vector whose entries are independent N(0, 1) random variables and let c be a positive scalar. Denote δ = δ(A + cvv>). Deﬁne the random variable X = min(a, δ 1) where a is a given positive constant. Then we have E[X] 2

2 c max{ln( eac

We are now ready to prove Theorem 3.

Proof. Starting from Theorem 2 we have that

E[regret T (FPL(Dnew))]

t Atxt] + E[x>

1 Nx1 x >Nx ]

t k k Atk] + E[k Nk]

t k F ] + c E[kvk2],

where the second inequality follows from Holder s inequality and the third follows from the fact that for any matrix in Sn with k non-zero eigenvalues λ1, λ2, ..., λk it holds that

, and from our choice of the noise matrix N.

Denote St = Pt 1

=1 A . Since for any t, xt is the leading eigenvector of the matrix (St + N) and xt+1 is the leading

eigenvector of the matrix (St+1 + N) = (St + N) + At, we have by applying Theorem 4 that

t k F ] E[min{

2 k Atk δ(St + N)}]

2, 1 δ(St + N)}],

where the min term is used since obviously kxtx>

t+1k is upper bounded by

2. The second inequality follows since k Atk 1.

Since all entries of v are N(0, 1) random variables it holds that E[kvk2] = n and thus

E[regret T (FPL(Dnew))]

2, 1 δ(St + cvv>)}] + cn.

Applying the result of Lemma 8 with a = 1 2 for every t gives

E[regret T (FPL(Dnew))]

2 c max{ln( ec

2), 1} + cn.

Thus setting c =

T n max{ln(T/n), 1} yields the theorem.

5.1. Using Approximate Eigenvector Computations

So far we have assumed that the eigenvector oracle used in Algorithm 3 ﬁnds an exact leading eigenvector. In practice, it is much more efﬁcient to use iterative methods such as the Power and Lanczos algorithms to ﬁnd an approximate eigenvector. The following theorem states that indeed Algorithm 3 admits such an efﬁcient implementation, without sacriﬁcing the regret guarantee given in Theorem 3. The proof is given in the appendix.

Theorem 6. Algorithm 3, instantiated with noise distribution Dnew, admits an implementation such that the statement of Theorem 3 holds, and the worst-case time complexity of each iteration is O(n 1/4T 3/4nnz), where nnz denotes the joint-sparsity of all observed matrices A1, ..., AT .

Acknowledgments

The research leading to these results has received funding from the European Unions Seventh Framework Programme (FP7/2007-2013) under grant agreement n 336078 ERCSUBLRN.

Online Learning of Eigenvectors

Arora, Sanjeev and Kale, Satyen. A combinatorial, primal-

dual approach to semideﬁnite programs. In Proceedings of the 39th Annual ACM Symposium on Theory of Computing, STOC, 2007.

Balsubramani, Akshay, Dasgupta, Sanjoy, and Freund,

Yoav. The fast convergence of incremental PCA. In 27th Annual Conference on Neural Information Processing Systems, NIPS, 2013.

Cesa-Bianchi, Nicol o and Lugosi, G abor. Prediction, learning, and games. Cambridge University Press, 2006.

Clarkson, Kenneth L., Hazan, Elad, and Woodruff, David P.

Sublinear optimization for machine learning. Journal of the ACM, 59(5):23, 2012.

Davis, C. and Kahan, W. M. The rotation of eigenvectors

by a perturbation, III. SIAM J. Numer. Anal., 7, March 1970.

Garber, Dan and Hazan, Elad. Approximating semideﬁ-

nite programs in sublinear time. In 25th Annual Conference on Neural Information Processing Systems 201, pp. 1080 1088, 2011.

Grigoriadis, Michael D. and Khachiyan, Leonid G. A sublinear-time randomized approximation algorithm for matrix games. Oper. Res. Lett., 18(2):53 58, 1995.

Hazan, Elad and Kale, Satyen. Projection-free online learn-

ing. In Proceedings of the 29th International Conference on Machine Learning, ICML, 2012.

Hazan, Elad, Kale, Satyen, and Warmuth, Manfred K.

Learning rotations with little regret. In 23rd Conference on Learning Theory, COLT, 2010a.

Hazan, Elad, Kale, Satyen, and Warmuth, Manfred K. Cor-

rigendum to learning rotations with little regret. 2010b. URL http://ie.technion.ac.il/ ehazan/ papers/rotfix.pdf.

Kalai, Adam Tauman and Vempala, Santosh. Efﬁcient al-

gorithms for online decision problems. J. Comput. Syst. Sci., 71(3):291 307, 2005.

Kuczy nski, J. and Wo zniakowski, H. Estimating the largest

eigenvalues by the power and lanczos algorithms with a random start. SIAM J. Matrix Anal. Appl., 13:1094 1122, October 1992.

Nie, Jiazhong, Kotlowski, Wojciech, and Warmuth, Man-

fred K. Online PCA with optimal regrets. In 24th International Conference on Algorithmic Learning Theory, ALT, 2013.

Shamir, Ohad. A stochastic PCA algorithm with an expo-

nential convergence rate. Co RR, abs/1409.2848, 2014. URL http://arxiv.org/abs/1409.2848.

Sylvester, J.J. On the relation between the minor determi-

nants of linearly equivalent quadratic functions. Philosophical Magazine Series 4, 1(4):295 305, 1851.

Tropp, Joel A. User-friendly tail bounds for sums of ran-

dom matrices. Foundations of Computational Mathematics, 12(4):389 434, 2012.

Tsuda, Koji, R atsch, Gunnar, and Warmuth, Manfred K.

Matrix exponentiated gradient updates for on-line learning and bregman projection. Journal of Machine Learning Research, 6:995 1018, 2005.

Warmuth, Manfred K. and Kuzmin, Dima. Online variance

minimization. In 19th Annual Conference on Learning Theory, COLT, 2006a.

Warmuth, Manfred K. and Kuzmin, Dima. Randomized

PCA algorithms with regret bounds that are logarithmic in the dimension. In Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, NIPS, 2006b.

Zinkevich, Martin. Online convex programming and gen-

eralized inﬁnitesimal gradient ascent. In Proceedings of the Twentieth International Conference on Machine Learning, ICML, 2003.