# matrix_eigendecomposition_via_doubly_stochastic_riemannian_optimization__4ac2495a.pdf Matrix Eigen-decomposition via Doubly Stochastic Riemannian Optimization Zhiqiang Xu XUZQ@I2R.A-STAR.EDU.SG Peilin Zhao ZHAOP@I2R.A-STAR.EDU.SG Jianneng Cao CAOJN@I2R.A-STAR.EDU.SG Xiaoli Li XLLI@I2R.A-STAR.EDU.SG Institute for Infocomm Research, A*STAR, Singapore Matrix eigen-decomposition is a classic and long-standing problem that plays a fundamental role in scientific computing and machine learning. Despite some existing algorithms for this inherently non-convex problem, the study remains inadequate for the need of large data nowadays. To address this gap, we propose a Doubly Stochastic Riemannian Gradient EIGen Solver, DSRG-EIGS, where the double stochasticity comes from the generalization of the stochastic Euclidean gradient ascent and the stochastic Euclidean coordinate ascent to Riemannian manifolds. As a result, it induces a greatly reduced complexity per iteration, enables the algorithm to completely avoid the matrix inversion, and consequently makes it well-suited to large-scale applications. We theoretically analyze its convergence properties and empirically validate it on real-world datasets. Encouraging experimental results demonstrate its advantages over the deterministic counterpart. 1. Introduction Matrix eigen-decomposition, aiming at a group of top eigenvectors of a given matrix (Golub & Van Loan, 1996), has found widespread applications in many areas of scientific and engineering computing, e.g., numerical computation (Press et al., 2007) and structure analysis (Torbjorn Ringertz, 1997)). Particularly, it plays a fundamental role in many machine learning tasks, such as spectral clustering (Ng et al., 2002), dimensionality reduction (Jolliffe, 2002), and kernel approximation (Drineas & Mahoney, 2005), etc. Despite the great importance of this problem, existing solutions, i.e., eigen- Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). solvers, have been relatively lacking. Among them, the power method (Golub & Van Loan, 1996) and the (block) Lanczos algorithm (Parlett, 1998) belong to well-known eigensolvers, while randomized SVD (Halko et al., 2011) and online learning of eigenvectors (Garber et al., 2015) are recently proposed. In addition, matrix eigendecompostion can be formulated as a quadratically constrained quadratic program, and thus can be addressed from the optimization perspective, for example, the trace penalty minimization (Wen et al., 2013). Notably, its non-convex constraint set constitutes a Riemannian manifold, or more precisely, Stiefel manifold, which turns it into a Riemannian optimization problem that can be tackled by the methods of optimization on manifolds (Edelman et al., 1999; Absil et al., 2008; Wen & Yin, 2013). However, most of existing eigensolvers belong to batch learning, i.e., using the entire dataset at each update step, and thus are not suitable to large-scale matrices, especially those unable to completely fit into memory. To address this issue, we usually could resort to stochastic optimization, which enables the algorithm to work through access to only a subset of the data each time. And stochastic algorithms often converge faster than their batch counterparts even if no memory issue arises. To overcome the limitations of existing batch learning eigensolvers, we propose a doubly stochastic Riemannian gradient method to obtain the DSRG-EIGS algorithm, a new eigensolver. The method simultaneously generalizes the stochastic gradient ascent (SGA) and the stochastic coordinate ascent (SCA) (Nesterov, 2012) from the Euclidean space to the Riemannian space, and arrives at a combination of their Riemannian counterparts: stochastic Riemannian gradient ascent (SRGA) and stochastic Riemannian coordinate ascent (SRCA). Specifically, SRGA works by sampling data sub-matrices, while SRCA proceeds by sampling column blocks of Riemannian gradient coordinates in our problem. Both methods keep iterates remaining on the manifold and stochastic Riemannian gradients staying in the tangent space during iterations. They greatly reduce the complexity per iteration of the algorithm, especially for Matrix Eigen-decomposition via Doubly Stochastic Riemannian Optimization dense matrices. Meanwhile, the algorithm becomes able to completely avoid the matrix inversion required in its deterministic version, and thus can work effectively in the case of desiring a large number of eigenvectors. Furthermore, we provide a progressive analysis on the theoretical convergence properties of DSRG-EIGS, which shows the convergence of the algorithm to global solutions at a sub-linear rate in expectation and that the algorithm is able to take advantage of importance sampling (Zhao & Zhang, 2014) to improve the convergence rate. The rest of the paper is organized as follows. In Section 2, we review some basics of matrix eigen-decomposition and Riemannian optimization. We present our doubly stochastic Riemannian gradient eigensolver, abbreviated as DSRG-EIGS in Section 3, followed by the progressive theoretical analysis in Section 4. Experimental results are shown in Section 5. Section 6 discusses related work. Finally, Section 7 concludes the paper. 2. Preliminaries 2.1. Matrix Eigen-decomposition The eigen-decomposition of a symmetric1 matrix A Rn n says that A = UΛUT where U = [u1, , un] (uj represents the jth column of U) is an orthogonal matrix, i.e., U U = UU = I (I represents the identity matrix of appropriate size), Λ = diag(λ1, , λn) is a diagonal matrix, and uj is called the eigenvector corresponding to the eigenvalue λj, i.e., Auj = λjuj. For the convenience in the sequel, we assume that λ1 λn and define V [u1, , uq] and V [uq+1, , un], Σ diag(λ1, , λq) and Σ diag(λq+1, , λn), where q is the number of top eigenvectors to be sought. From the point of view of optimization, in practice, matrix eigen-decomposition can be defined by the following nonconvex quadratically constrained quadratic program: max X Rn q:X X=I(1/2)tr(X AX), (1) where q < n and tr( ) represents the trace of a square matrix, i.e., sum of its diagonal entries. It can be easily verified that X = V maximizes the trace value at (1/2) q i=1 λi. 2.2. Riemannian Gradient Given a Riemannian manifold (Lee, 2012) M, its tangent space at a point X M, denoted as TXM, is a Euclidean space that locally linearizes M around X. Analogous to the Euclidean case, one iterate of the first-order optimiza- 1The given matrix A is assumed to be symmetric throughout the paper, i.e., AT = A. tion on M takes the form (Absil et al., 2008): X(t+1) = RX(t)(αtξX(t)), (2) where ξX(t) TX(t)M (namely, ξX(t) is a tangent vector of M at X(t)) represents the search direction, αt is the step size, and RX(t)( ) represents the retraction at X(t) which maps a tangent vector ξ TX(t)M to a point on M. Tangent vectors serving as search directions are generally gradient-related. The gradient of a function f(X) defined on M, denoted as Gradf(X), depends on the Riemannian metric, which is a family of smoothly varying inner products on tangent spaces, i.e., ξ, η X, where ξ, η TXM for any X M. The Riemannian gradient Gradf(X) TXM is the unique tangent vector that satisfies Gradf(X), ξ X = Df(X)[ξ], (3) for any ξ TXM, where Df(X)[ξ] represents the directional derivative of f(X) in the tangent direction ξ. 2.2.1. EIGS VIA RIEMANNIAN GRADIENT The constraint set in problem (1) constitutes a Stiefel manifold, i.e., St(n, q) = {X Rn q : X X = I}, which turns the problem into a Riemannian one: max X St(n,q) f(X), where f(X) (1/2)tr(X AX). Under the canonical metric ξ, η X = tr(ξ (I 1 2XX )η) and by (3), the Riemannian gradient of f(X) is Gradf(X) = (I XX )AX. Furthermore, we use the Cayley transformation based retraction (Wen & Yin, 2013): RX(ξ) = (I 1 2S(ξ)) 1(I + 1 2S(ξ))X, (4) for any ξ TXSt(n, q), where S(ξ) = (PXξ)X X(PXξ) and PX = I 1 Given a line search method for determining the step size such as Amijo-Wolf conditions (Nocedal & Wright, 2006) or non-monotone line search (Wen & Yin, 2013), we can arrive at the Riemannian Gradient EIGen Solver (RGEIGS) X(t+1) = RX(t)(αt Gradf(X(t))). 3. Doubly Stochastic Riemannian Gradient In this section, we propose a doubly stochastic Riemannian gradient eigensolver, denoted as DSRG-EIGS, which generalizes Euclidean SGA and SCA to Stiefel manifolds and Matrix Eigen-decomposition via Doubly Stochastic Riemannian Optimization meanwhile extends RG-EIGS to the doubly stochastic optimization setting. One update of the stochastic Riemannian gradient ascent takes the form (Bonnabel, 2013): X(t+1) = RX(t)(αt G(yt, X(t))), where αt > 0, yt is an observation of the random variable y that follows some distribution and satisfies E[f(y, X)] = f(X), and G(y, X) TXSt(n, p) is a stochastic Riemannian gradient such that E[G(y, X)] = Gradf(X). 3.1. Sampling over Data We first consider the stochastic Riemannian gradient based on sampling over data. The given matrix A Rn n can be written as a matrix summation. Although this summation could be made quite general, in our case, it s based on the following partitioning of A into a block matrix of size nr nc for simplicity: A11 A1nc Anr1 Anrnc where Ekl {0, 1}n n represents the element indicator of Akl in A. Define f(s, X) p 1 s tr(X (Es A)X) and G(s, X) p 1 s (I XX )(Es A)X, where s is a random variable taking on pair values from {(k, l) : k = 1, , nr, l = 1, , nc}, with respective probabilities ps > 0 subject to s ps = 1. It holds that E[f(s, X)] = f(X) and E[G(s, X)] = Gradf(X). We then get the stochastic Riemannian gradient G(s, X) by sampling over data, which greatly reduces the complexity per iteration for data scanning, from that of a full scan, O(n2q) for dense matrices, to that of a partial scan. However, the complexity per iteration for updating variable X remains the same as that with the batch version RG-EIGS, i.e., O(nq2)+O(q3). Hence, when q is large, it s still computationally cumbersome. 3.2. Sampling over Riemannian Gradient Coordinates To further reduce the complexity per iteration, we now consider the stochastic Riemannian gradient based on sampling over Riemannian gradient coordinates {[Gradf(X)]ij : i = 1, , n, j = 1, , q}. This is exactly the idea of stochastic coordinate ascent (Nesterov, 2012). However, SCA is intended to solve unconstrained or separately constrained convex problems, and thus not suitable for ours, an inherently non-convex problem. In fact, the variable space St(n, q) (i.e., Stiefel manifold) and the gradient space TXSt(n, q) (i.e., Euclidean space) are not the same one. Hence, the direct application of this method to our problem may be not well-defined, because a partial update of coordinates could make either X(t) drift off the manifold, i.e., X(t) St(n, q), or ξX(t) step out of the tangent space, i.e., ξX(t) TX(t)St(n, q). To tackle this issue, we propose to sample intrinsic coordinates of Riemannian gradients in the tangent space. Note that the tangent space of Stiefel manifold (Absil et al., 2008) at X can be explicitly represented as TXSt(n, q) = {XΩ+ X K : Ω = Ω Rq q, K R(n q) q}, where X Rn (n q) represents the orthonormal complement of X in Rn n such that (X X ) is othogonal. By this representation, we can identify the intrinsic coordinates of a tangent vector ξX with corresponding Ωand K. We can also find the dimensionality of St(n, q) is 1 2q(q 1) + (n q)q. Recall that our Riemannian gradient is Gradf(X) = (I XX )AX, which can be rewritten as Gradf(X) = X X AX. Hence, its intrinsic coordinates are Ω= 0 and K = X AX. We only need to sample coordinates from K. To gain advantages as with SCA, we sample columns of K, which is equivalent to sample columns of X. To this end, assume X is partitioned into a block matrix of size 1 qc (i.e., column block matrix): X = (X 1, X 2, , X qc) = where E m {0, 1}n q similarly represents the element indicator of X m in X. Define f(r, X) p 1 r tr(X A(E r X)) and G(r, X) p 1 r (I XX )A(E r X), where r is a random variable taking on values 1, , qc, with respective probabilities pr > 0 subject to r pr = 1. It holds that E[f(r, X)] = f(X) and E[G(r, X)] = Gradf(X). As we will see shortly, only one column block of X needs be updated at each step. We now get the stochastic Riemannian gradient G(r, X) by sampling over Riemannian intrinsic coordinates. It keeps X and G(r, X) staying on the manifold and in the tangent space, respectively, and meanwhile the update step works like a Euclidean SCA step. 3.3. Doubly Stochastic Riemannian Gradient (DSRG) By sampling over both data and intrisic Riemannian gradient coordinates, we arrive at our doubly stochastic Rieman- Matrix Eigen-decomposition via Doubly Stochastic Riemannian Optimization nian gradient G(s, r, X) TXSt(n, q): G(s, r, X) = p 1 s p 1 r (I XX )(Es A)(E r X). It s easy to see that it is an unbiased estimate of the true Riemannian gradient, i.e., E[G(s, r, X)] = Gradf(X). We then arrive at our DSRG ascent method: X(t+1) = RX(t)(αt G(st, rt, X(t))). (5) To simplify the above update, first let g G(s, r, X), U (g, X) and V (X, g). Since g TXSt(n, q), we have g X = 0, which implies that PX(g) = (I 1 2XX )g = g and thus S(g) = U V . We then can write RX(αg) = (I α 2 S(g)) 1(I + α = X + α U(I α 2 V U) 1 V X, by the Sherman-Morrison-Woodbury formula (Press et al., 2007). Note that V X = ( I 0 ) and V U = ( 0 I g g 0 Accordingly, 2 V U) 1 = ( W α where W = (I + α2 4 g g) 1. We then get RX(αg) = X + α(I α 2 Xg )g W (6) = X + (αg + 2X)W. To see the properties of this method, let s focus on W. Note that g g = diag(0, , 0, C, 0, , 0), where C = p 2 s p 2 r D is the rth diagonal block of g g, and D = X r(Es A) (I XX )(Es A)X r = (Akl Xlr) (Akl Xlr) (X k Akl Xlr) (X k Akl Xlr), supposing s = (k, l) (note that subscripts k, l, r are all block indices). Therefore, we get W = diag(I, , I, B 1, I, , I), where B = I + α2 4 C. We now can see that in (5) only the rth column block of X needs be updated, while the left ones remain unchanged: (αp 1 s p 1 r (H XX k )Akl Xlm +2X m)B 1 X m, m = r where H = (0, , 0, I, 0, , 0) with I being the kth column block. Our DSRG-EIGS algorithm is summarized in Algorithm 1, which enjoys the advantages over RG-EIGS from the double stochasticity: 1) it achieves a greatly reduced complexity per iteration, O(nq), especially when A is dense or q is large; 2) only a size-controlled small matrix B needs be inverted; 3) there is no need of matrix inversion when qc = q, i.e., single column sampling over X. Algorithm 1 DSRG-EIGS Input: A, T, η > 0, ζ > 0 Output: X(T ) 1: Initialize X(0) and ps = As F s A s F for any s S = {(k, l) : k = 1, , nr, l = 1, , nc}. 2: for t = 1, 2, , T do 3: Sample st = (kt, lt) from S according to {ps}. 4: Sample rt from {1, 2, , qc} uniformly. 5: Set αt = η 1+ζt. 6: Update X(t+1) rt = X(t) rt + (αtqcp 1 st (H(t) kt )Aktlt X(t) ltrt + 2X(t) rt)(B(t)) 1 and X(t+1) r = X(t) r for r = rt, where B(t) and H(t) are as defined in Section 3.3. 7: end for 4. Theoretical Analysis We analyze convergence properties of Algorithm 1 in this section. 4.1. Local Convergence We note the following facts. Stiefel manifold is smooth, connected and compact, with a positive global injectivity radius (Lee, 2012; Bonnabel, 2013). In addition, the function f(X) to be maximized in our problem is three times continuously differentiable, the retraction (4) is twice continuously differentiable, and the stochastic Riemannian gradient (5) is unbiased, and bounded since both A and X are bounded. According to (Bonnabel, 2013), we have Theorem 4.1. If step sizes satisfy t αt = and t α2 t < , then for Algorithm 1, f(X(t)) converges almost surely and Gradf(X(t)) converges to 0 almost surely, as t . Note that only convergence to a local solution is guaranteed by Theorem 4.1. 4.2. Global Convergence In fact, Theorem 4.1 can be strengthened to achieve a global convergence for our problem. Specifically, we in- Matrix Eigen-decomposition via Doubly Stochastic Riemannian Optimization vestigate the squared cosine value of the principal angle between2 Xt and the ground truth V, which is defined as cos2 Xt, V λmin(X t VV Xt) = min y =0 V Xty 2 2/ y 2 2. Note that if cos2 Xt, V = 1, then Xt = V up to a q q orthogonal matrix, that is, our goal is achieved. Actually we have the following strengthened theorem: Theorem 4.2. Define Θt = 1 E[cos2 Xt, V ]. Assume that A has a positive eigen-gap, i.e., τ = λq λq+1 > 0, αt = c t with c > 2 τ , and cos2 Xs, V 1 2 with s 0. Then we have Θt = O( 1 t ) for t s. Theorem 4.2 shows that our DSRG-EIGS algorithm converges to a global solution at a sub-linear rate in expectation. We note that the requirement on the initialization X0, which makes cos2 Xs, V 1 2 at certain iteration s 0, is theoretically non-trivial. However, empirically a random initialization works well as we will observe in our experiments. Hence, Theorem 4.2 amounts to the convergence analysis at a later stage of the algorithm starting from t0 = s instead of t0 = 0. Similar to (Balsubramani et al., 2013), we have the following theorem which shows the concrete convergence rate of our algorithm. Before that, we define some stochastic quantities: At p 1 st (Est A), Yt p 1 rt (E rt Xt), and Zt Zt( Z t Zt) 1/2, where Zt = Xt + αt At Yt. Theorem 4.3. Under the conditions of Theorem 4.2, assume that a = cτ, b = γE[ At 2 2]E[ Yt 2 2] with γ > 9, and t > s 1. Then it holds that Θt+1 Θs( s t + 1)a + 4b a 1(1 + 1 s + 1)a 1 1 t + 1. To prove Theorem 4.2-4.3, we need some lemmas. Lemma 4.4. Assume cos2 Xt, V 1 2 and let βt = E[ At 2 2]E[ Yt 2 2]. Then 1 E[cos2 Zt, V ] (1 αtτ)(1 E[cos2 Xt, V ]) + 5βtα2 t + O(α3 t). cos2 Xt+1, V cos2 Zt, V 4α2 t At 2 2 Yt 2 2 O(α3 t ). Lemma 4.6. Assume the constant γ > 9. Then Θt+1 (1 αtτ)Θt + γβtα2 t . 2For notational convenience, we place iteration indices about t as subscripts hereafter. All the proofs are given in the supplementary. We notice that Theorem 4.3 and Lemma 4.6 provide a convenient form that enables us to leverage sampling distributions for improving the convergence rate. 4.3. Accelerated Global Convergence Since the inequalities in Theorem 4.3 and Lemma 4.6 hold for general sampling distributions, we are able to improve the convergence rate by optimizing sampling distributions over data or gradient coordinates, i.e., importance sampling (Zhao & Zhang, 2014). We only need to minimize βt w.r.t. two sampling distributions, {ps} and {pr}, which is equivalent to two independent problems: min s ps=1 E[ At 2 2] and min r pr=1 E[ Yt 2 2]. Let h({ps}, η) be the Lagrange function for the first constrained optimization problem. Then h({ps}, η) = E[ At 2 2] + η( k,l p 1 kl Akl 2 2 + η( k,l pkl 1), h pkl = Akl 2 2 p2 kl + η, 2h p2 kl = 2 Akl 2 2 p3 kl > 0. Setting h pkl = 0, followed by normalization, yields the solution p kl = Akl 2/ k,l Akl 2. However, the spectral norm of a matrix is not quite easy to compute. We can relax3 using an easy-to-compute upper bound of the spectral norm: min s ps=1 E[ At 2 2] min s ps=1 E[ At 2 F ], and work on the latter problem. Likewise, we can find that p kl = Akl F / k,l Akl F . This says that the blocks with a larger norm value should be sampled with a higher probability, which is quite a useful property for sparse matrix eigen-decomposition in that it can avoid frequent use of less informative blocks especially those zero or nearly zero blocks. On the other hand, it holds that p r = X r 2 r X r 2 similarly. Since X r 2 = 1 always, we get p r = q 1 c , which says that the optimal sampling over gradient coordinates turns out to be a uniform sampling. Two optimal sampling distributions are used in Algorithm 1. 5. Experimental Results In this section, we empirically validate the effectiveness of our proposed doubly stochastic Riemannian gradient 3We acutally can use a tighter upper bound on spectral norm: A 2 A 1 A . Matrix Eigen-decomposition via Doubly Stochastic Riemannian Optimization method for matrix eigen-decomposition, DSRG-EIGS, by comparing it with its deterministic counterpart, RG-EIGS (Wen & Yin, 2013), and using Matlab s EIGS function for benchmarking. Both DSRG-EIGS and RG-EIGS were implemented in Matlab on a machine with Windows OS, 8G of RAM. 5.1. Experimental Setting We detail the experimental settings in this subsection, including the RG-EIGS implementation, initialization of X at t = 0, step size αt, and quality measures for performance evaluation. We adopt the original author s implementation for the RGEIGS4. It uses the non-monotone line search with the wellknown Barzilai-Borwein step size, which significantly reduces the iteration number, and performs well in practice. Both RG-EIGS and DSRG-EIGS are fed with the same initial value of X, where each entry is sampled from the standard normal distribution N(0, 1) and then they all as a whole are orthogonalized. We set αt for DSRG-EIGS to take the form of αt = η 1+ζt, where ζ is fixed to 2 throughout the experiments and η will be tuned. The performance of different algorithms is evaluated using three quality measures: feasibilities X t Xt I F , objective function values 1 2tr(X t AXt) and squared cosine values of the principal angle between each iterate Xt and the ground truth V, i.e., cos2 Xt, V . Lower values of feasibility are better, while large values of objective function and squared cosine are better. The output by EIGS is taken as the ground truth. We report the convergence curves of these measures, where the empirical convergence rate of each algorithm in terms of objective function values or squared cosine values can be observed. 5.2. Performance on Sparse Matrices We first examine the performance of the algorithms on sparse matrices, which are downloaded from the university of Florida sparse matrix collection5. Their statistics are given in Table 1. Each of them is uniformly partitioned into a block matrix of size mr mc given in Table 1. We use q = 100 and uniformly partition X into a block matrix of size 1 qc with qc = q/2. The convergence curves of three quality measures for RGEIGS and DSRG-EIGS on sparse matrices are shown in Figure 1, with one row of plots for each matrix and one column of plots for each measure. Each point on the convergence curve for RG-EIGS corresponds to one batch step6, while it spans a fixed number of stochastic steps for 4optman.blogs.rice.edu/ 5www.cise.ufl.edu/research/sparse/matrices/ 6The decrease steps in Figure 1(b) are caused by the non- Table 1. Sparse Matrices. dataset n nnz(A) mr mc hang Glider 10,260 92,703 10 1 indef 64,810 565,996 50 1 IBMNA 169,422 1,279,274 150 1 Table 2. Dense Matrices. dataset n nnz(A) mr mc citeseer 3,312 10,969,344 10 10 usps 9,298 86,452,804 20 20 pubmed 19,717 388,760,089 40 40 news20 19,928 397,125,184 40 40 a8a 32,561 1,060,218,721 40 40 DSRG-EIGS. We tested four different step sizes for DSRGEIGS on each dataset. In each plot, the output of matlab s EIGS function, as a reference, is shown as a single point represented by a red pentagram. These performance results show that DSRG-EIGS consistently and significantly outperforms RG-EIGS in term of each quality measure. Specifically, DSRG-EIGS converges faster than RG-EIGS in terms of objective function values as shown in the middle column of plots, which clearly demonstrates the effectiveness and superiority of our algorithm to its deterministic version. Similar conclusions can be drawn for the right column of plots in terms of squared cosine values. Moreover, we observe that the feasibility of RG-EIGS deteriorates in a manner similar to step functions. This is because that RG-EIGS relies heavily on the Sherman Morrison-Woodbury formula which suffers from the numerical instability, and that the caused error will accumulate with iterations. In contrast, our DSRG-EIGS achieves a better feasibility especially on the first two sparse matrices, indicating that this issue is mitigated. 5.3. Performance on Dense Matrices We now report the performance of the algorithms on dense matrices, which are RBF kernels generated using feature datasets: citeseer and pubmed7, usps, news20 and a8a8. The statistics of resultant dense matrices are shown in Table 2, including their block sizes of uniform partitioning. We use q = 10 here. X is uniformly partitioned into qc = q/2 column blocks as well. Figure 2 shows the convergence curves of different measures on two dense matrices (results on left dense matrices are placed in the supplementary). As we can see, monotone step size used in the implementation of RG-EIGS. 7linqs.cs.umd.edu/projects/projects/lbc 8www.csie.ntu.edu.tw/ cjlin/libsvmtools Matrix Eigen-decomposition via Doubly Stochastic Riemannian Optimization Time (sec.) 0 0.5 1 1.5 2 2.5 3 3.5 || XTX - I ||F 10-12 hang Glider p=100 EIGS RG-EIGS DSRG-EIGS η = 0.4 DSRG-EIGS η = 0.6 DSRG-EIGS η = 0.8 DSRG-EIGS η = 1.0 (a) hang Glider - feasbility Time (sec.) 0 0.5 1 1.5 2 2.5 3 3.5 105 hang Glider p=100 EIGS RG-EIGS DSRG-EIGS η = 0.4 DSRG-EIGS η = 0.6 DSRG-EIGS η = 0.8 DSRG-EIGS η = 1.0 (b) hang Glider - objective function Time (sec.) 0 0.5 1 1.5 2 2.5 3 3.5 hang Glider p=100 EIGS RG-EIGS DSRG-EIGS η = 0.4 DSRG-EIGS η = 0.6 DSRG-EIGS η = 0.8 DSRG-EIGS η = 1.0 (c) hang Glider - cosine value Time (sec.) 0 5 10 15 20 25 || XTX - I ||F 10-5 indef p=100 EIGS RG-EIGS DSRG-EIGS η = 2.0e-06 DSRG-EIGS η = 4.0e-06 DSRG-EIGS η = 6.0e-06 DSRG-EIGS η = 8.0e-06 (d) indef - feasbility Time (sec.) 0 5 10 15 20 25 109 indef p=100 EIGS RG-EIGS SRGD-EIGS η = 2.0e-06 SRGD-EIGS η = 4.0e-06 SRGD-EIGS η = 6.0e-06 SRGD-EIGS η = 8.0e-06 (e) indef - objective function Time (sec.) 0 5 10 15 20 25 indef p=100 EIGS RG-EIGS DSRG-EIGS η = 2.0e-06 DSRG-EIGS η = 4.0e-06 DSRG-EIGS η = 6.0e-06 DSRG-EIGS η = 8.0e-06 (f) indef - cosine value Time (sec.) 0 50 100 150 200 250 || XTX - I ||F 10-10 IBMNA p=100 EIGS RG-EIGS DSRG-EIGS η = 2.0 DSRG-EIGS η = 3.0 DSRG-EIGS η = 4.0 DSRG-EIGS η = 5.0 (g) IBMNA - feasbility Time (sec.) 0 50 100 150 200 250 105 IBMNA p=100 EIGS RG-EIGS DSRG-EIGS η = 2.0 DSRG-EIGS η = 3.0 DSRG-EIGS η = 4.0 DSRG-EIGS η = 5.0 (h) IBMNA - objective function Time (sec.) 0 50 100 150 200 250 IBMNA p=100 EIGS RG-EIGS DSRG-EIGS η = 2.0 DSRG-EIGS η = 3.0 DSRG-EIGS η = 4.0 DSRG-EIGS η = 5.0 (i) IBMNA - cosine value Figure 1. Performance on sparse matrices. the performance of DSRG-EIGS is similar to the case of sparse matrices, compared to RG-EIGS. Especially on the a8a dataset, RG-EIGS fails to run due to running out of memory, while our DSRG-EIGS can still work (where the ground truth results were computed on a machine with the same CPU but a larger RAM). Therefore, the effectiveness and superiority of our algorithm are validated on dense matrices as well. The experimental studies on both sparse and dense matrices demonstrate that the proposed algorithm is broadly effective and can be superior to its deterministic version. The advantages could be more pronounced in some cases. If the memory can not hold an input matrix, for example, a full matrix of size 32000 32000 like a8a, RGEIGS clearly fails to run. In some real applications of matrix eigen-decomposition, when suboptimal solutions suffice to achieve satisfactory results in terms of third-party or domain-specific quality measures, such as modularity for spectral clustering, DSRG-EIGS would be a better choice than RG-EIGS. 6. Related Work Typical existing approaches to matrix eigen-decomposition include the power method, the Lanczos algorithms, and Riemannian methods. The power method (Golub & Van Loan, 1996), finding the leading eigenpair (i.e., eigenvalue with the largest absolute value), starts from some initial vector, and then repeatedly alternates matrix-vector multiplication and vector normalization. Although it can be used on large sparse matrices, it may be slow and even diverge. Instead of disregarding the information in previous iterations as in power method, the Lanczos algorithm (Cullum & Willoughby, 2002) utilizes Matrix Eigen-decomposition via Doubly Stochastic Riemannian Optimization Time (sec.) 0 5 10 15 20 25 30 35 40 45 || XTX - I ||F 10-7 news20 p=10 EIGS RG-EIGS DSRG-EIGS η = 8.0 DSRG-EIGS η = 9.0 DSRG-EIGS η = 10.0 DSRG-EIGS η = 11.0 (a) news20 - feasbility Time (sec.) 0 5 10 15 20 25 30 35 40 45 104 news20 p=10 EIGS RG-EIGS DSRG-EIGS η = 8.0 DSRG-EIGS η = 9.0 DSRG-EIGS η = 10.0 DSRG-EIGS η = 11.0 (b) news20 - objective function Time (sec.) 0 5 10 15 20 25 30 35 40 45 news20 p=10 EIGS RG-EIGS DSRG-EIGS η = 8.0 DSRG-EIGS η = 9.0 DSRG-EIGS η = 10.0 DSRG-EIGS η = 11.0 (c) news20 - cosine value 0 20 40 60 80 100 120 140 160 180 200 Time (sec.) || XTX I ||F DSRG EIGS η = 1.0e+00 (d) a8a - feasbility 0 20 40 60 80 100 120 140 160 180 200 10 1 Time (sec.) DSRG EIGS η = 1.0e+00 (e) a8a - objective function 0 20 40 60 80 100 120 140 160 180 200 0 Time (sec.) DSRG EIGS η = 1.0e+00 (f) a8a - cosine value Figure 2. Performance on dense matrices. it to iteratively construct a basis of the Krylov subspace for eigen-decomposition. Riemannian methods address the problem from the Riemannian optimization perspective, such as optimization on Stiefel or Grassmann manifolds (Torbjorn Ringertz, 1997; Absil et al., 2008). One recently proposed method, Randomized SVD (Halko et al., 2011), finds the truncated SVD by random projections. All these methods perform the batch learning, while our focus in this paper is on stochastic algorithms. Another recent method, called MSEIGS (Si et al., 2014), tries to utilize graph cluster structure to speedup eigen-decomposition, while we consider more general matrices. The work most related to ours include online learning of eigenvectors (Garber et al., 2015), which only targets the leading eigenvector, i.e., q = 1, and coordinate descent on orthogonal matrices (Shalit & Chechik, 2014), which is a special case of Stiefel manifolds. (Garber et al., 2015) is based on the power method, and provides the regret analysis without empirical validation. We address the problem from a stochastic Riemannian optimization perspective. Stochastic coordinate descent is realized through Givens rotations with only local convergence guaranteed in (Shalit & Chechik, 2014), while we work on general Stiefel manifolds with global convergence guaranteed. On the other hand, doubly stochastic gradient has been used for scaling up kernel (Dai et al., 2014) and nonlinear component analysis (Xie et al., 2015), which rely on the primal feature data in vectors as with other PCA algorithms (Mitliagkas et al., 2013; Boutsidis et al., 2015), instead of relational data in square matrices as we target. In addition, importance sampling (Zhao & Zhang, 2014) has been considered for convex problems, while we extend its use on a non-convex problem in this paper. 7. Conclusion We proposed the doubly stochastic Riemannian gradient ascent algorithm for matrix eigen-decomposition (DSRGEIGS), i.e., a new eigensolver, which generalized the Euclidean stochastic gradient ascent and the Euclidean stochastic coordinate ascent to the Riemannian setting, or more precisely, Stiefel manifolds. The algorithm enjoys the advantages from both sides to achieve a greatly reduced complexity per iteration and be able to avoid the matrix inversion. We conducted a progressive convergence analysis, which shows that DSRG-EIGS converges to a global solution at a sub-linear rate in expectation, and that the convergence rate can be improved by leveraging sampling distributions. The effectiveness and superiority are verified on both sparse and dense matrices. For future work, we may address the limitations of DSRG-EIGS, including the non-trivial initialization and dependence on a positive eigen-gap. We may also conduct more empirical investigations on the algorithm. Matrix Eigen-decomposition via Doubly Stochastic Riemannian Optimization Acknowledgments We would like to thank Dr. Yiping Ke for reading the draft and offering useful suggestions, and anonymous reviewers for their valuable comments, which significantly improve the quality of the paper in different ways. Absil, P-A, Mahony, Robert, and Sepulchre, Rodolphe. Optimization algorithms on matrix manifolds. Princeton University Press, 2008. Balsubramani, Akshay, Dasgupta, Sanjoy, and Freund, Yoav. The fast convergence of incremental PCA. In Advances in Neural Information Processing Systems 26, December 5-8, 2013, Lake Tahoe, Nevada, United States., pp. 3174 3182, 2013. Bonnabel, Silvere. Stochastic gradient descent on riemannian manifolds. IEEE Trans. Automat. Contr., 58(9): 2217 2229, 2013. doi: 10.1109/TAC.2013.2254619. Boutsidis, Christos, Garber, Dan, Karnin, Zohar Shay, and Liberty, Edo. Online principal components analysis. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA, January 4-6, 2015, pp. 887 901, 2015. Cullum, Jane K. and Willoughby, Ralph A. Lanczos Algorithms for Large Symmetric Eigenvalue Computations, Vol. 1. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2002. ISBN 0898715237. Dai, Bo, Xie, Bo, He, Niao, Liang, Yingyu, Raj, Anant, Balcan, Maria-Florina, and Song, Le. Scalable kernel methods via doubly stochastic gradients. In Advances in Neural Information Processing Systems 27, December 8-13 2014, Montreal, Quebec, Canada, pp. 3041 3049, 2014. Drineas, Petros and Mahoney, Michael W. On the nystr om method for approximating a gram matrix for improved kernel-based learning. J. Mach. Learn. Res., 6:2153 2175, December 2005. ISSN 1532-4435. Edelman, Alan, Arias, Tom as A., and Smith, Steven T. The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl., 20(2):303 353, April 1999. ISSN 0895-4798. doi: 10.1137/S0895479895290954. Garber, Dan, Hazan, Elad, and Ma, Tengyu. Online learning of eigenvectors. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pp. 560 568, 2015. Golub, Gene H. and Van Loan, Charles F. Matrix Computations (3rd Ed.). Johns Hopkins University Press, Baltimore, MD, USA, 1996. ISBN 0-8018-5414-8. Halko, N., Martinsson, P. G., and Tropp, J. A. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev., 53(2):217 288, May 2011. ISSN 0036-1445. doi: 10.1137/090771806. Jolliffe, I. T. Principal component analysis. Hardcover, October 2002. Lee, John M. Introduction to smooth manifolds. Springer, 2012. Mitliagkas, Ioannis, Caramanis, Constantine, and Jain, Prateek. Memory limited, streaming pca. In Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 26, pp. 2886 2894. Curran Associates, Inc., 2013. Nesterov, Yurii. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341 362, 2012. doi: 10.1137/ 100802001. Ng, Andrew Y., Jordan, Michael I., and Weiss, Yair. On spectral clustering: Analysis and an algorithm. In Dietterich, T.G., Becker, S., and Ghahramani, Z. (eds.), Advances in Neural Information Processing Systems 14, pp. 849 856. MIT Press, 2002. Nocedal, J. and Wright, S. J. Numerical Optimization. Springer, New York, 2nd edition, 2006. Parlett, Beresford N. The Symmetric Eigenvalue Problem. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1998. ISBN 0-89871-402-8. Press, William H., Teukolsky, Saul A., Vetterling, William T., and Flannery, Brian P. Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press, New York, NY, USA, 3 edition, 2007. ISBN 0521880688, 9780521880688. Shalit, Uri and Chechik, Gal. Coordinate-descent for learning orthogonal matrices through givens rotations. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pp. 548 556, 2014. Si, Si, Shin, Donghyuk, Dhillon, Inderjit S, and Parlett, Beresford N. Multi-scale spectral decomposition of massive graphs. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., and Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems 27, pp. 2798 2806. Curran Associates, Inc., 2014. Torbjorn Ringertz, U. Eigenvalues in optimum structural design. Institute for Mathematics and Its Applications, 92:135, 1997. Matrix Eigen-decomposition via Doubly Stochastic Riemannian Optimization Wen, Zaiwen and Yin, Wotao. A feasible method for optimization with orthogonality constraints. Math. Program., 142(1-2):397 434, 2013. doi: 10.1007/ s10107-012-0584-1. Wen, Zaiwen, Yang, Chao, Liu, Xin, and Zhang, Yin. Trace-penalty minimization for large-scale eigenspace computation. Technical report, RICE UNIV HOUSTON TX DEPT OF COMPUTATIONAL AND APPLIED MATHEMATICS, 2013. Xie, Bo, Liang, Yingyu, and Song, Le. Scale up nonlinear component analysis with doubly stochastic gradients. Co RR, abs/1504.03655, 2015. Zhao, Peilin and Zhang, Tong. Stochastic optimization with importance sampling. Co RR, abs/1401.2753, 2014.