# riemannian_pursuit_for_big_matrix_recovery__e1806a8c.pdf Riemannian Pursuit for Big Matrix Recovery Mingkui Tan 1 TANMINGKUI@GMAIL.COM Ivor W. Tsang 2 IVOR.TSANG@GMAIL.COM Li Wang 3 LIW022@UCSD.EDU Bart Vandereycken 4 BARTV@PRINCETON.EDU Sinno Jialin Pan 5 JSPAN@I2R.A-STAR.EDU.SG 1 School of Computer Science, The University of Adelaide, Ingkarni Wardli North Terrace Campus 5005, Australia 2 Center for Quantum Computation & Intelligent Systems, University of Technology Sydney, Australia 3 Department of Mathematics, University of California, San Diego, USA 4 Department of Mathematics, Princeton University, Fine Hall, Washington Road, Princeton NJ 08544-1000, USA 5 Institute for Infocomm Research, 1 Fusionopolis Way, #21-01 Connexis (South) 138632, Singapore Low rank matrix recovery is a fundamental task in many real-world applications. The performance of existing methods, however, deteriorates significantly when applied to ill-conditioned or large-scale matrices. In this paper, we therefore propose an efficient method, called Riemannian Pursuit (RP), that aims to address these two problems simultaneously. Our method consists of a sequence of fixed-rank optimization problems. Each subproblem, solved by a nonlinear Riemannian conjugate gradient method, aims to correct the solution in the most important subspace of increasing size. Theoretically, RP converges linearly under mild conditions and experimental results show that it substantially outperforms existing methods when applied to large-scale and ill-conditioned matrices. 1. Introduction Matrix recovery (MR) has attracted a lot of attention from various research communities, such as statistical machine learning, collaborative filtering, image and signal processing (Cand es & Recht, 2009; Negahban & Wainwright, 2012). With the fast development of Web 2.0 in the last decade, big MR problems have been widely involved in many practical applications, leading to great challenges in computation. For instance, in collaborative filtering tasks, the Netflix Prize problem involves 108 ratings of 480,189 users on 17,770 movies (KDDCup, 2007). The MR problem is defined as follows: Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). Definition 1. Given a linear operator A : Rm n Rl, let b = A( b X) + e be l linear measurements of an unknown rank-br matrix b X Rm n, where e denotes noise. Then the task of MR is to recover b X by solving min X f(X), s.t. rank(X) r, (1) where l mn, r br, and f(X) = 1 2 b A(X) 2 2. The definition of A depends on the application context, such as matrix completion, quantum state tomography, matrix factorizations; see, e.g., (Recht et al., 2010; Cand es & Plan, 2010a; Recht, 2011; Keshavan et al., 2010b; Laue, 2012). Although our derivation is valid for any A that allows for MR, in the numerical experiments, we will focus on matrix completion (MC) as a specific application. In this case, A(X) is defined as the element-wise restriction of X on Ξ, a subset of the complete set of entries of X. Problem (1) is known to be NP-hard. To address it, many researchers proposed to solve the nuclear-norm convex relaxation (Fazel, 2002; Hazan, 2008; Recht et al., 2010): min X X , s.t. A(X) = b, where denotes the matrix nuclear norm. A number of algorithms have been proposed to solve this relaxation, such as the singular value thresholding (SVT) (Cai et al., 2010), the augmented Lagrangian method (ALM) (Lin et al., 2010; 2011; Yang & Yuan, 2013). In practice, the following matrix lasso problem is also often studied (Toh & Yun, 2010): min X 1 2 b A(X) 2 2 + λ X , where λ is a regularization parameter. Regarding this problem, the accelerated proximal gradient (APG) method has been proven to be effective (Toh & Yun, 2010; Mishra et al., 2013). While nuclear-norm based methods have shown some success in practice, their applicability to large-scale problems is rather limited because of their necessity to compute high-dimensional SVDs. Recently, Mishra et al. (2013) proposed a method to solve the Riemannian Pursuit for Big Matrix Recovery matrix lasso problem by avoiding high-dimensional SVDs. However, empirically, this method still performs similarly to APG. To improve the scalability of MR, a different solution is to relax problem (1) to a fixed-rank optimization problem (Mitra et al., 2010; Keshavan et al., 2010a): min X f(X), s.t. rank(X) = r, (2) where r is an estimated rank. This problem is nonconvex, but it can be solved efficiently by local-search methods. Particularly, since the fixed-rank matrices belong to a smooth matrix manifold, many efficient methods based on manifold optimization have been proposed (Meyer et al., 2011; Boumal & Absil, 2012; Vandereycken, 2013). By exploiting the smooth geometry of fixed-rank matrices, fixed-rank based methods have shown superior scalability compared with the nuclear-norm based methods (Boumal & Absil, 2012; Mishra et al., 2012; Vandereycken, 2013). However, there are two deficiencies for the existing fixedrank methods. Firstly, since the ground-truth rank of the matrix to be recovered is usually unknown, it is nontrivial to set the value of r in (2). Secondly, as observed in (Ngo & Saad, 2012; Boumal & Absil, 2012), when solving (2) with ill-conditioned b X, existing fixed-rank based methods may converge slowly. To address the above issues, we develop an efficient and scalable algorithm for MR that iteratively increases the rank of the matrix to be recovered by a fixed integer ρ. The main contributions of this paper are as follows: 1) We propose the Riemannian Pursuit (RP) method, which essentially solves a sequence of fixed-rank minimization problems using a nonlinear Riemannian Conjugate Gradient (NRCG) method. The convergence of NRCG is guaranteed by the application of a strong Wolfe line search. 2) We prove that RP converges linearly under mild conditions. Compared with other fixed-rank based methods, the proposed optimization scheme can effectively address the convergence issues that occur with ill-conditioned and large rank problems. 3) RP automatically estimates the rank under proper stopping conditions, which avoids the difficulty of the rank estimation in most existing fixed-rank based methods. 2. Notations and Preliminaries Throughout the paper, we denote by the superscript T the transpose of a vector/matrix, 0 a vector/matrix with all zeros, diag(v) a diagonal matrix with a vector of diagonal entries equal to v, and v p the ℓp-norm of a vector v. We denote by [n] the list {1, ..., n}. Given a linear operator A, its adjoint operator is denoted by A . Let A B and A, B = tr(ABT) be the element-wise product and inner product of A and B, respectively. Denote the SVD of X Rm n as X = U(diag(σ))VT = Pq i=1 σiuiv T i , where q = min{m, n} and σi is arranged in descending order. The nuclear norm of X is defined as X = σ 1 = P i |σi| and the Frobenius norm of X is defined as X F = σ 2. The condition number κr(X) of X w.r.t. a given number r is defined as κr(X) = σ1/σr. 2.1. Matrix RIP Condition To discuss convergence, we introduce the matrix restricted isometry property (RIP) condition (Recht et al., 2010). Specifically, the matrix RIP condition describes a property of a linear operator A as the smallest number γr such that (1 γr)||X||2 F ||A(X)||2 F (1 + γr)||X||2 F (3) holds for all matrices of rank at most r. Observe that the RIP condition does not hold for MC. To study the exact recovery condition of MC, the incoherence of matrices is introduced (Cand es & Recht, 2009; Cand es & Plan, 2010b). Specifically, a matrix X of rank r with SVD X = Udiag(σ)VT is µ-incoherent (µ 1) if i [r] µ/m and ||vi|| p Under these conditions, the RIP conditions of MR has been extended to MC (Meka et al., 2009a). Proposition 1. In MC, suppose the observed entry set Ξ is sampled according to the Bernoulli model with each entry (i, j) Ξ being independently drawn from a probability p. There exists a constant C > 0, for all γr (0, 1), µ 1, n m 3, if p Cµ2r2 log(n)/(γ2 rm), the following RIP condition holds (1 γr)p||X||2 F ||A(X)||2 F (1 + γr)p||X||2 F , (5) for any µ-incoherent matrix X Rm n of rank at most r with probability at least 1 exp( n log n). 2.2. Differential Geometry of Fixed-Rank Matrices Given a positive integer r, consider the smooth submanifold of fixed rank-r matrices, Mr = {X Rm n : rank(X) = r} = {Udiag(σ)VT : U Stm r , V Stn r , ||σ||0 = r} with Stm r = {U Rm r : UTU = I} the Stiefel manifold of m r real and orthonormal matrices. The tangent space TXMr of Mr at X = Udiag(σ)VT Rm n is given by TXMr = {UMVT+Up VT+UVT p : M Rr r, Up Rm r, UT p U = 0, Vp Rn r, VT p V = 0}. Riemannian Pursuit for Big Matrix Recovery Define a metric g X(A, B) = A, B , where X Mr and A, B TXMr, then Mr is a Riemannian manifold by restricting A, B to the tangent bundle. Here the tangent bundle is defined as the disjoint union of all tangent spaces TMr = S X Mr{X} TXMr = {(X, E) Rm n Rm n : X Mr, E TXMr}. By restricting f(X) = 1 2 b A(X) 2 2 to Mr we obtain a smooth function on Mr. Its Riemannian gradient is given as the orthogonal projection onto the tangent space of the gradient of f. Define PU = UUT and P U = I UUT for any U Stm r . The orthogonal projection of any Z Rm n onto the tangent space at X = Udiag(σ)VT is defined as PTXMr(Z) : Z 7 PUZPV + P U ZPV + PUZP V . (6) Let G = A (A(X) b). Then, the Riemannian gradient of f(X) on Mr can be calculated as gradf(X) = PTXMr(G). (7) For convenience, we define PT0Mr(Z) = 0 when X = 0. Moreover, the Retraction mapping on Mr is to go back from an element in the tangent space to the manifold, which can be computed in a closed form as RX(E) = PMr(X + E) = i=1 σipiq T i , (8) where Pr i=1 σipiq T i denotes a best rank-r approximation to X + E. The norm of a tangent vector ζX TXMr evaluated at X is defined as ||ζX|| = p ζX, ζX . We refer to (Vandereycken, 2013) and references therein for more details on the geometry of Mr. 3. Riemannian Pursuit Solving problem (1) is hard because there is little knowledge about the rank of the matrix to be recovered; otherwise (1) is reduced to a fixed-rank minimization problem. Considering that fixed-rank methods have gained great success in solving big MR problems with explicit knowledge of the rank, we propose to iteratively increase the rank by a fixed integer ρ and then solve a series of fixed-rank minimization subproblems until a proper stopping condition is achieved. Once this procedure is finished, the final rank returned is our rank estimation and we can perform a final, more accurate fixed-rank optimization step. Based on this motivation, the proposed method is presented in Algorithm 1. Since Riemannian optimization is a core element of the algorithm, we refer to it as the Riemannian Pursuit (RP) in the sequel. The parameter ρ in RP is crucial for both rank estimation and convergence. For ease of presentation, we leave the detailed setting of choosing ρ for Section 3.2. As shown in Algorithm 1, ξt = b A(Xt) represents the residual at iteration t. Starting with X0 = 0, RP iterates with two main steps: 1) identifying the most-active subspace through an active-subspace search in Step 2; and 2) solving a master problem optimization regarding a fixedrank minimization problem in Step 3. In Algorithm 1 and in the rest of the paper, we use the notation Pt := PTXt Mtρ. The parameter ϵout is a tolerance on the stopping condition of Algorithm 1 and will be detailed in Section 3.1. Algorithm 1 RP: Riemannian Pursuit for MR. Require: Rank increase ρ. Inner and outer iteration tolerance ϵin and ϵout. 1: Initialize X0 = 0, ξ0 = b, G = A (ξ0), and t = 1. 2: Perform an active-subspace search as follows. 2a: Compute Q = G Pt 1(G). 2b: Compute a best rank ρ approximation of Q: Ht 1 2 = Uρdiag(σρ)VT ρ 3: Let Ht 1 1 = Pt 1(G) and Ht 1 = Ht 1 1 + Ht 1 2 . 3a: Choose a proper step size τt from (10) and set Xintial = RXt 1( τt Ht 1). (Warm Start) 3b: Using Xinitial as initial guess, call Xt = NRCG(Xinitial, ϵin). 4: Update ξt = b A(Xt) and G = A (ξt). 5: Quit if stopping condition on ϵout is achieved; otherwise, let t := t + 1 and go to Step 2. In Step 2, the active-subspace search determines the mostactive subspace that is orthogonal to Xt 1 from G = A (ξt 1). Such a subspace is obtained by computing the top ρ singular values and vectors of G Pt 1(G), which is orthogonal to TXt 1M(t 1)ρ and thus also to Xt 1. Remark that, due to computational reasons, the master problem in step 3b is not solved exactly, thus Pt 1(G) is not necessarily zero (or negligible). After the active-subspace search, we have rank(Xt) = tρ, namely Xt Mtρ. The master problem optimization in Step 3 is to solve a fixed-rank problem min X f(X), s.t. rank(X) = tρ, (9) where f(X) = 1 2||A(X) b||2 2 is a smooth function. In particular, we solve it using LRGeom CG from (Vandereycken, 2013), which is a nonlinear Riemannian Conjugate Gradient (NRCG) method. As shown in Algorithm 1, NRCG involves two parameters, namely the initial point Xintial and its stopping tolerance ϵin. Here, Xintial of rank tρ in Step (3a) is used as a warmstart in NRCG, which is important for improving the overall efficiency of the algorithm. To be more specific, we use Riemannian Pursuit for Big Matrix Recovery RXt 1( τt Ht 1) as an initial point for NRCG, where τt is a step size that is obtained by a line search on the condition f(RXt 1( τt Ht 1)) f(Xt 1) τt 2 Ht 1, Ht 1 . (10) Main Theoretical Results: Before presenting the details of NRCG, we summarize two major theoretical results regarding the convergence of Algorithm 1. Firstly, let {Xt} be the sequence generated by RP, then f(Xt) decreases monotonically w.r.t. t. Lemma 1. Let {Xt} be the sequence generated by RP, then f(Xt) f(Xt 1) τt 2 ||Ht 1 2 ||2 F , (11) where τt satisfies condition (10). Secondly, let b X be the ground-truth low-rank matrix and e be the additive noise, the following theorem indicates that f(Xt) decreases linearly when f(Xt) > f( b X) = 1 Theorem 1. Let {Xt} be the sequence generated by RP and ζ = min{τ1, , τι}. As long as f(Xt) C 2 ||e||2 (where C > 1) and if there exists an integer ι > 0 such that γ(br+2ιρ) < 1 2, then RP decreases linearly in objective values when t < ι, namely f(Xt+1) νf(Xt), where C(1 2γ(br+2ιρ))2 C + 1)2(1 γ(br+2ιρ)) This theorem illustrates the convergence speed of RP under the RIP condition γ(br+2ιρ) < 1 2. To apply it to the matrix completion problem, we need to adapt a variant of the RIP condition in (5) and assume that the Xt are incoherent, uniformly in t. 3.1. Stopping Conditions for RP According to Lemma 1, RP monotonically decreases in objective value. Without proper stopping conditions, RP may increase the rank until tρ h = min(m, n), where ξt = b A(Xt) = 0, and the solution will likely be overfitted. To avoid this issue, one can terminate on a small relative residual, ξt F /||b||F λF . (12) In real-world problems, the matrix to be recovered is not exactly low-rank and then (12) may not be adequate. Since RP decreases the objective values monotonically, we propose to use a difference in function values as the stopping condition, 2(f(Xt 1) f(Xt))/(ρ b 2 F ) ϵout, where ϵout is a predefined tolerance value. This condition is based on the assumption that over-fitting will happen if increasing the rank does not significantly decrease the objective value. In practice, ϵout = 10 5 is usually a good choice. 3.2. Parameter Selection on ρ As shown in Theorem 1, RP with a larger ρ converges faster. However, a small ρ is required in order to make an accurate estimation to the rank. Particularly, when dealing with problems of ill-conditioning, ρ should be small enough. We present a simple and effective method to set an appropriate ρ. Let σ be the singular vector of A (b), where σi is arranged in descending order. Motivated by the thresholding strategy in St OMP for sparse recovery (Donoho et al., 2012), we choose ρ such that for 0 < η 1 σi η σ1, i ρ. (13) In other words, ρ denotes the number of sufficiently large singular values of A (b). In general, a smaller η leads to a larger ρ. When setting η = 1, we trivially have ρ = 1, which is not efficient when the exact rank br is large.1 We refer to the Supplementary Materials for an efficient computational strategy to compute ρ given η. 3.3. Nonlinear Riemannian Conjugate Gradient In this section, we detail the NRCG method for solving the fixed-rank problem (9) in Step 3b of RP. To differentiate from the outer iteration variable Xt of RP, we use Xk for the inner iteration index k of NRCG. In Euclidean space, the search direction ζk of nonlinear CG is calculated as ζk = gradf(Xk) + βkζk 1, where βk is calculated from, for example, the Fletcher Reeves (FR) rule: βt = gradf(Xk), gradf(Xk) gradf(Xk 1), gradf(Xk 1) . (14) Different from Euclidean space, the search direction on a manifold are adapted to follow a path on the manifold (Absil et al., 2008). Particularly, since gradf(Xk) TXk Mr, gradf(Xk 1) TXk 1Mr, and ζk 1 are in different tangent spaces of the manifold, the above two equations are not applicable on Riemannian manifolds. To extend nonlinear CG of Euclidean space to Riemannian manifolds, two additional operators, namely retraction and vector transport are necessary. With the previously defined retraction mapping in (8), one can move points towards the direction of a tangent vector and make them stay on the manifold. A vector transport T on a manifold Mr is a smooth map that transports tangent vectors from one tangent space to another (Absil et al., 2008). Denoting such a vector transport by TX Y : TXMr TYMr, the conjugate direction can be calculated as ζk = Ek + βt TXk 1 Xk(ζk 1), (15) 1In practice, we suggest setting η 0.60. Riemannian Pursuit for Big Matrix Recovery where Ek = gradf(Xk) and βk is determined from (14). The NRCG method is presented in Algorithm 2, which includes two major steps: 1) calculating the conjugate search direction in Step 3, and 2) updating Xk+1 by retraction, namely Xk+1 = RXk(θkζk), where θk denotes the step size satisfying the strong Wolfe conditions. Specifically, given a descent direction ζk TXk Mr, θk is determined such that f(RXk(θkζk)) f(Xk) + c1θk gradf(Xk), ζk , (16) | gradf(RXk(θkζk)), TXk 1 Xk(ζk) | c2| gradf(Xk), ζk |, (17) where 0 < c1 < c2 < 1/2. Two choices for vector transport are orthogonal projection TX Y(ζ) = PTYMr(ζ) (18) and the scaled differentiated retraction TX Y(ζ) = α d dt RX(ν + tζ)|t=0 , (19) where ν = R 1 X (Y) and α is such that TX Y(ζ) F = ζ F ; see (Sato & Iwai, 2013). The standard choice (18) from (Vandereycken, 2013) is cheaper to evaluate, but (19) is required for proving the convergence of NRCG. Proposition 2 (Sato & Iwai (2013)). Given the retraction (8) and vector transport (19) on Mr, there exists a step size θk that satisfies the strong Wolfe conditions (16) and (17). Lemma 2 (Sato & Iwai (2013)). The search direction ζk generated by NRCG using the vector transport (19) and the strong Wolfe conditions (16) and (17) satisfies 1 1 c2 gradf(Xk), ζk gradf(Xk 1), gradf(Xk 1) 2c2 1 1 c2 . (20) Using (Dieci & Eirola, 1999), (19) can be implemented efficiently even though it will be more expensive than (18). One can show that the difference between (18) and (19) is O( ζ 2), hence they have the same behavior near the optimizer where ζ 0. For convenience, we therefore use (18) in the numerical experiments. Global convergence of NCRG is obtained if the functions f(RXk(θζk)) are Lipschitz continuously differentiable in θ. Since the manifold Mtρ is open at points where rank(X) < tρ, such a condition cannot hold uniformly on Mtρ. In (Vandereycken, 2013), the additional term 1 2µ2(||X ||2 F +||X||2 F ) was added to the objective function to penalize rank drops. For simplicity, we assume that f satisfies the necessary Lipschitz conditions throughout the iteration. Assumption 1. Let {Xk} and {ζk} be the sequence of iterates and search directions generated by the NRCG method in Step (3b). We assume that θ 7 f(RXk(θζk)) is Lipschitz continuously differentiable with a uniform Lipschitz constant L > 0. Theorem 2. Let {Xk} be the sequence generated by the NRCG method in Step (3b) of Algorithm 1 with the strong Wolfe line search, where 0 < c1 < c2 < 1/2, then we have limk inf gradf(Xk) = 0. Proof: Combining Lemma 2 and Assumption 1 as in (Sato & Iwai, 2013). Algorithm 2 NRCG(Xintial,r, ϵin) for solving (2). 1: Initialize X1 = Xintial and ζ0 = 0. Let k = 1. 2: Compute the gradient Ek = gradf(Xk) by (7). 3: Compute a conjugate direction ζk according to (15). 4: Choose a step size θk satisfying the strong Wolfe conditions (16) and (17), and set Xk+1 = RXk(θkζk). 5: Terminate and output Xk+1 if the stopping conditions are achieved; otherwise, let k = k +1 and go to step 1. 3.4. Computational Advantages of RP The proposed RP algorithm has several computational advantages. First of all, it is useful for rank detection. Specifically, under the stopping conditions in Section 3.1, RP will automatically estimate the rank as r = tρ. Secondly, RP converges well on ill-conditioned problems. Even when the condition number κr(X) is very large, κtρ(X) will be small for t small. Thus NCRG can converge well and consequently, the total convergence of RP is improved significantly. Thirdly, RP has good scaling characteristics for solving large-scale problems. For example, the only SVD calculations required are the best rank tρ approximation of Xt +ξ for the retraction and the best rank ρ of the matrix Q in Step 2b. In all cases, the matrices involved are highly structured. Specifically, Xk + θkζk is a rank 2tρ matrix, and thus the full SVD can be computed very efficiently; see (Vandereycken, 2013). For Q = G Pt(G), we can use PROPACK (Larsen, 2004) or randomized low-rank approximation (Halko et al., 2011) at a cost of O((|G| + tρn)ρ) where |G| is the cost of one matrix-vector product with G. For example, |G| = O(brn log2 n) is highly sparse for MC. Finally, thanks to the warm-start, each application of NRCG requires only a few iterations to achieve a sufficiently accurate solution. Riemannian Pursuit for Big Matrix Recovery 4. Related Studies Fixed-rank methods such as the low-rank geometric conjugate gradient method (LRGeom CG) (Vandereycken, 2013), the quotient geometric matrix completion method (q Geom MC) (Mishra et al., 2012), and the method of scaled gradients on Grassmann manifolds for matrix completion (Sc Grass MC) (Ngo & Saad, 2012), have shown promising performance. Gradient methods and stochastic gradient methods have been developed to address the fixedrank problems (Jaggi & Sulovsky, 2010; Wen et al., 2012). All these methods require the estimate of r. Greedylike algorithms have also been proposed to solve the fixed-rank problem, such as the Singular Value Projection (SVP) (Meka et al., 2009b), Atomic Decomposition for Minimum Rank Approximation (ADMi RA) (Lee & Bresler, 2010), Spa RCS (Waters et al., 2011), and so on. For these methods, they are guaranteed to converge under restricted conditions. Shwartz et al. (2011) proposed a Greedy Efficient Component Optimization (GECO) algorithm to solve a convex relaxation of the rank-constrained problem by iteratively increasing the rank by 1. Jaggi & Sulovsky (2010) proposed a new approximation algorithm based on a sparse approximate SDP model (Hazan, 2008). Laue (2012) proposed a hybrid strategy to solve the MR problem by iteratively increases the rank by 1. Different from these methods, RP essentially solves a sequence of fixed-rank methods and incrementally increases the rank by ρ 1. In practice, it is crucial to use a large ρ to accelerate the convergence speed on big matrices of large ranks. Moreover, unlike RP, GECO solves much more expensive regression subproblems; Laue s method solves the nonlinear master problems with a BFGS method, which is memory inefficient on large-scale problems. Finally, the importance of stopping conditions for rank estimation is absent in these methods. 5. Numerical Experiments 5.1. Baseline Methods and Performance Matric Following state-of-the-art methods are adopted as baseline methods: SVP (Meka et al., 2009b), APG (Toh & Yun, 2010) (which uses a homotopy strategy to solve the matrix lasso problem), GECO (Shwartz et al., 2011), LMa Fit (Wen et al., 2012), LMa Fit-A (Wen et al., 2012) (which extends LMa Fit by automatically estimating the rank). Sc Grass MC (Ngo & Saad, 2012), a fixed-rank method which is claimed to alleviate the convergence issue over ill-conditioning problems. LRGeom CG (Vandereycken, 2013), a fixed-rank method that adopts the nonlinear Riemannian Conjugate Descent method with Armijo line search. q Geom MC, a fixed-rank method based on mani- 0 50 100 150 200 10 8 Time (in seconds) Relative Objective Value GECO Sc Grass MC LRGeom CG LMa Fit q Geom MC RP(η = 1.00) RP(η = 0.75) RP(η = 0.65) RP(η = 0.55) (a) Gaussian singular values sg where ρ w.r.t. different η s is 1, 8, 14, 18, resp. 0 50 100 150 200 Time (in seconds) Relative Objective Value Sc Grass MC LRGeom CG LMa Fit q Geom MC RP(η = 1.00) RP(η = 0.75) RP(η = 0.65) RP(η = 0.55) (b) χ2 singular values s2 χ where ρ w.r.t. different η s is 1, 4, 6, 12, respectively. Figure 1. Convergence of comparison methods on sg and sχ2, where ζos = 3 and br = 50. GECO cannot converge within 1 hour on sg, and we omit its results on sχ2. Sc Grass MC gets numerical problems after 50 iterations on sχ2. fold optimization. 2 Note that some related methods, such as the IALM and Opt Space are not considered as baselines in this paper since the adopted baseline methods above have been shown to be state-of-the-art in MR (Wen et al., 2012; Vandereycken, 2013). For RP, we adopt the stopping criteria discussed in Section 3.1. The relative testing error (RTE) is adopted as the comparison metric in synthetic experiments: RTE = PΞ( b X X ) F / X Ξ F . The testing root-mean-square error (RMSE) is used as the comparison metric in realworld applications: RMSE = PΞ( b X X ) F / p (|Ξ|). Here b X denotes the observed matrix (with missing entries filled with 0 ), X denotes the recovered matrix, and Ξ denotes the index set of observed entries. All the experiments (except for GECO) are conducted in Matlab on a PC installed a 64-bit operating system with an Intel(R) Core(TM) i7 CPU (2.80GHz with single-thread mode) and 24GB memory. 5.2. Synthetic Problem Generation In synthetic experiments, we focus on matrices with large condition numbers. Following (Ngo & Saad, 2012), we generate ground-truth low-rank matrices b X = b Udiag(bσ) b VT + e, where σ is a br-sparse vector, b U Stm r , and b V Stn r . Two types of singular values are studied: 1) Gaussian sparse singular value sg with each nonzero entry 2LMa Fit, Sc Grass MC, q Geom MC and LRGeom CG are from: http://www.montefiore.ulg.ac. be/ mishra/fixedrank/fixedrank.html; APG, SVP and GECO are from: http://www. math.nus.edu.sg/ mattohkc/NNLS.html, http://www.cs.utexas.edu/ pjain/svp/, www.cs.huji.ac.il/ shais/code/index.html, respectively. Riemannian Pursuit for Big Matrix Recovery sampled from Gaussian distribution N(0, 1000), and 2) χ2 sparse singular values s2 χ, where each entry is the square of sg. The two types of singular values are fast decaying, and their condition numbers κbr( b X) are very large. Once b X is generated, we sample l = r(m + n r) ζos entries from b X uniformly to produce b, where ζos is the oversampling factor (Lin et al., 2010). In the noisy cases, each sampled entry is disturbed by a Gaussian noise with strength b 2/ n 2, where is a strength factor and n Rl is generated by n = randn(l, 1) in Matlab. 5.3. Performance Comparison in Noiseless Cases We compare the convergence of RP on sg and sχ2 with several baseline methods. To show the impact of η from (13) (and thus ρ) on the convergence of RP, we test η = 1.00, 0.75, 0.65, 0.55, respectively. For simplicity, we set the rank parameter for the fixed-rank methods as the ground-truth rank, namely, r = br = 50, which is the best choice for r in the noiseless case. The relative objective value w.r.t. the computational time are shown in Fig. 1. As can be seen from Fig. 1, RP with different η generally converges faster than other methods on both types of matrices. The reason that the other methods converge slowly is due to the large condition numbers of sg and sχ2. All the above observations justify that RP can converge well on ill-conditioned problems thanks to the reasons mentioned in Section 3.4. 5.4. Performance Comparison in Noisy Cases In this section, we study the performance of the comparison methods on noisy problems of medium size (i.e., m = n = 5, 000). We report the training time and RTE of various methods for comparison. For medium-sized problems, we generate matrices with the ground-truth rank br = 50, and produce the observations with noise strength factor = 0.01 under oversampling rates ζos {2, 2.3, 2.5, 2.8, 3.0, 3.3, 3.5, 3.8, 4.0}. We compare RP with SVP, APG, LMa Fit, LMa Fit-A, q Geom MC and LRGeom CG. Note that LMa Fit-A can automatically adjust the rank. We set λF = 0.01, ϵout = 10 5 for the stopping conditions and η = 0.65 for RP. For APG, we set the trade-off parameter λ = 10 3σmax, where σmax is the largest singular value of A (b). For all the fixed-rank methods, we set r = br = 50. We use default settings for the other parameters of the baselines. For each oversampling rate, we run 10 independent experiments. The Averaged computational time, RTE and the estimated ranks are shown in Fig. 2(a), Fig. 2(b) and Fig. 2(c), respectively. From Fig. 2(a) and Fig. 2(b), under various oversampling factors, RP generally shows the least relative testing error and the best optimization efficiency among all methods. More importantly, from Fig. 2(c), only RP can estimate the Table 1. Averaged training time (seconds) on big matrices. Data Type sg sχ2 r 50 100 50 100 LRGeom CG 316.8 992.2 564.3 1018.8 q Geom MC 216.1 1091.5 415.1 455.3 RP 57.5(48) 205.6(102) 75.4(48) 150.8(85) * The number in bracket is the rank estimated by RP. Note that LRGeom CG and q Geom MC use the ground-truth rank (r = br). target rank correctly under various oversampling factors; while APG can only correctly detect the rank with enough observations. 5.5. Performance Comparison on Big Matrices In the experiments on large-scale matrix completion problems, we vary the values of br in {50, 100}, and set m=n= 20, 000, ζos = 4 and = 0.05. Here, we only compare the scalability of RP with that of LRGeom CG and q Geom MC since there two methods have shown better efficiency than other baselines. We set λF = 0.05, ϵout = 10 5 and η = 0.65 for RP. To demonstrate the superiority of RP over the baselines in terms of computational speed, we terminate the baselines once they achieve the same objective value of RP or a maximum of 400 iterations are achieved. In addition, we set r = br for these two fixed-rank methods. The other parameters are the same as those in the experiments on the medium-sized problems. We run 10 independent experiments and record the averaged results for comparison. The averaged computational time and RTE are listed in Table 1 and Table 2, respectively. As can be found from Table 1, in general, the computational time of RP is 4 times faster than that of LRGeom CG, and 3 times faster than that of q Geom MC, respectively. In addition, as can be seen from Table 2, with the same training error, RP is also slightly better than the other two baseline methods in terms of the relative testing error. Notice that here we choose r = br for LRGeom CG and q Geom MC, which is the optimal choice for rank estimation. Table 2. Averaged relative testing error on big matrices. Data Type sg sχ2 r 50 100 50 100 LRGeom CG 0.0374 0.0388 0.074 0.0695 q Geom MC 0.0351 0.0316 0.120 0.0347 RP 0.0316 0.0306 0.031 0.0275 5.6. Real-World Experiments In this section, we compare RP with the baseline methods on two real-world large-scale collaborative filtering datasets: Movie Lens with 10M ratings (denoted by Movie-10M) (Herlocker et al., 1999) and Netflix Prize dataset (KDDCup, 2007). Movie-10M contains 10M ratings given by 71,567 users on 10,681 movies while Netflix Prize contains 100,480,507 ratings given by 480,189 users on 17,770 movies. The baseline methods include Riemannian Pursuit for Big Matrix Recovery APG LRGeom CG LMa Fit LMa Fit A q Geom MC SVP RP 2 2.5 3 3.5 4 Oversampling factor ζos Relative Testing Error (in log scale) (a) Relative testing error (in log scale) 2 2.5 3 3.5 4 10 1 Oversampling factor ζos Time (seconds in log scale) (b) Training time (seconds in log scale) 2 2.5 3 3.5 4 Estimated Rank Oversampling factor ζos APG LMa Fit A RP Ground Truth (c) Estimated rank Figure 2. Comparison on medium-sized problems of rank br = 50. The results are obtained by averaging over 10 independent trials. APG, LRGeom CG, q Geom MC, Lmafit, Lmafit-A, GECO, Jaggi s method and Laue s method. In general, collaborative filtering data are very noisy. As a result, the singular values of the matrices tend to be longtail. Therefore, we need to set larger stopping tolerances to alleviate over-fitting. For this experiment, we set λF = 0.2 and ϵout = 10 4. We set η = 0.65, and the detected ranks by RP are used as the rank estimations for the fixed-rank methods, namely LRGeom CG, q Geom MC and LMa FIT. Finally, we constrain the maximum rank for all methods to 100. For comparison, we report the testing RMSE of different methods over 10 random 80/20 train/test partitions as explained in (Laue, 2012). Comparison results are shown in Table 3. From the table, we can observe that RP performs the best among all the methods in terms of RMSE and computational efficiency. It is worth mentioning that we use the rank detected by RP as the rank estimation for LRGeom CG and q Geom MC. Therefore, RP can be much faster than these two methods if the cost of the model selection is considered. Table 3. Experimental results on real-world datasets. Dataset Movie-10M Netflix RMSE Time (seconds) RMSE Time (seconds) APG 1.096 1048 17 - - LRGeom CG 0.824 338 11 0.867 3128 35 Qgeom MC 0.850 189 7 0.880 3965 74 Lmafit 0.837 307 1 0.875 3798 50 Lmafit-A 0.969 421 16 0.962 5286 165 RP 0.817 81 1 0.859 1332 27 * Result of APG on Netflix is absent due to out-of-memory issue. The standard variations of RMSE are not reported since they are not significant. The average ranks estimated by APG, Lmafit-A and RP on Movie are 100, 77 and 10, respectively. The average ranks estimated by Lmafit-A and RP on Netflix are 81 and 12, respectively. Due to the absence of source codes, we record the published results of GECO (Shwartz et al., 2011), Jaggi s method (Jaggi & Sulovsky, 2010) and Laue s method (Laue, 2012), on the Movie-10M dataset. The experimental settings of these methods reported in the liter- atures are similar to ours, thus the comparison is fair. In addition, we list the training time, the times of speedup, RMSE and the CPU details in Table 4 for reference. Table 4. Performance comparison on Movie-10M dataset. Methods Time (in seconds) Speed Up RMSE CPU(GHz) GECO 784,941 9,000x 0.821 2.5 Laue 2,663 30x 0.815 2.5 Jaggi 3,120 38x 0.862 2.4 RP 81 0.817 2.8 From Table 4, we observe that on the Movie-10M dataset, RP obtains comparable or better performance to the baseline methods in terms of RMSE, but can achieve great speedup with similar CPUs. Particularly, RP is orders of magnitude faster than all the other methods. With these comparisons, we can conclude that RP can achieve much faster training speed over the comparison methods. 6. Conclusion We propose a Riemannian Pursuit (RP) method for tackling big MR problems. In contrast to nuclear-norm based methods, RP only needs to compute rank-ρ truncated SVD with ρ very small per iteration, as opposed to APG which may take hundreds of high-dimensional SVDs. By exploiting the Riemannian geometry of the fixed-rank manifold, RP uses a more efficient master solver. Moreover, RP increases the rank of the matrix by ρ > 1 per iteration, thus it exhibits good scalability for big MR problems with large ranks. Finally, RP automatically detects the rank with appropriate stopping conditions, and performs well on ill-conditioned problems. Extensive experimental results show that RP achieves superb scalability and maintain similar or better MR performance compared with state-of-the-art methods. Acknowledgments This research was in part supported by the Singapore NTU A*SERC under Grant 112 172 0013, and the Australian Research Council Future Fellowship FT130100746. Riemannian Pursuit for Big Matrix Recovery Absil, P.-A. and Mahony, R. and Sepulchre, R. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2008. Boumal, N. and Absil, P.-A. Rtrmc: A Riemannian trust-region method for low-rank matrix completion. In NIPS, 2012. Cai, J., Cand es, J., E., and Shen, Z. A singular value thresholding algorithm for matrix completion. SIAM J. on Optim., 20(4): 1956 1982, 2010. Cand es, E. J. and Plan, Y. Tight oracle bounds for low-rank matrix recovery from a minimal number of random measurements. IEEE Trans. on Inform. Theory, 57(4):2342 2359, 2010a. Cand es, E. J. and Plan, Y. Matrix completion with noise. Proceedings of the IEEE, 98(6):925 936, 2010b. Cand es, E. J. and Recht, B. Exact matrix completion via convex optimization. Found. Comput. Math., 9:717 772, 2009. Dieci, L. and Eirola, T. On smooth decompositions of matrices. SIAM J. Numer. Anal., 20(3):800-819, 1999. Donoho, D. L., Tsaig, Y., Drori, I., and Starck, J. L. Sparse solution of underdetermined systems of linear equations by stagewise orthogonal matching pursuit. IEEE Trans. Info. Theory, 58(2):1094 1121, 2012. Fazel, M. Matrix rank minimization with applications. 2002. Ph D thesis, Stanford University. Golub, G. H. and Van Loan, C. F. Matrix Computations. JHU Press, 3rd edition, 1996. Halko, N. and Martinsson, P. G. and Tropp, J. A. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions SIAM Review, 53(2): 217 288, 2011 Hazan, E. Sparse approximate solutions to semidefinite programs. LATIN, pp. 306 316, 2008. Herlocker, J. L., Konstan, J. A., Borchers, A., and Riedl, J. An algorithmic framework for performing collaborative filtering. In SIGIR, 1999. Jaggi, M. and Sulovsky, M. A simple algorithm for nuclear norm regularized problems. In ICML, 2010. KDDCup. ACM SIGKDD and Netflix. In Proceedings of KDD Cup and Workshop, 2007. Keshavan, R. H., Montanari, A., and Oh, S. Matrix completion from a few entries. IEEE Trans. on Info. Theory, 56:2980 2998, 2010a. Keshavan, R. H, Montanari, A., and Oh, S. Matrix completion from noisy entries. JMLR, 99:2057 2078, 2010b. Laue, S. A hybrid algorithm for convex semidefinite optimization. In ICML, 2012. Lee, K. and Bresler, Y. ADMi RA: Atomic decomposition for minimum rank approximation. IEEE Trans. on Inform. Theory, 56(9):4402 4416, 2010. Lin, Z., Chen, M., and Ma, Y. The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices. Technical report, UIUC, 2010. Lin, Z., Liu, R., and Su, Z. Linearized alternating direction method with adaptive penalty for low-rank representation. ar Xiv preprint ar Xiv:1109.0367, 2011. Meka, R., Jain, P., and Dhillon, I. S. Guaranteed rank minimization via singular value projection. Technical report, 2009a. Meka, R., Jain, P., and I.S.Dhillon. Guaranteed rank minimization via singular value projection. In NIPS, 2009b. Meyer, G., Bonnabel, S., and Sepulchre, R. Linear regression under fixed-rank constraints: A Riemannian approach. In ICML, 2011. Mishra, B., Apuroop, K. A., and Sepulchre, R. A Riemannian geometry for low-rank matrix completion. Technical report, 2012. Mishra, B., Meyer, G., Bach, F., and Sepulchre, R. Low-rank optimization with trace norm penalty. SIAM J. Optim., 23(4): 2124 2149, 2013. Mitra, K., Sheorey, S., and Chellappa, R. Large-scale matrix factorization with missing data under additional constraints. In NIPS, 2010. Negahban, S. and Wainwright, M. J. Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. JMLR, 13:1665 1697, 2012. Ngo, T. T. and Saad, Y. Scaled gradients on Grassmann manifolds for matrix completion. In NIPS, 2012. Larsen, R. M. PROPACK Software for large and sparse SVD calculations http://soi.stanford.edu/ rmunk/ PROPACK, 2004 Recht, B. A simpler approach to matrix completion. JMLR, pp. 3413 3430, 2011. Recht, B., Fazel, M., and Parrilo, P. A. Guaranteed minimum rank solutions to linear matrix equations via nuclear norm minimization. SIAM Rev., 52(3), 2010. Ring, W. and Wirth, B. Optimization methods on Riemannian manifolds and their application to shape space. SIAM J. Optim., 22(2):596 627, 2012. Sato, H. and Iwai, T. A new, globally convergent Riemannian conjugate gradient method. Optimization: A Journal of Mathematical Programming and Operations Research, (ahead-of-print): 1 21, 2013. Selvan, S. E., Amato, U., Gallivan, K. A., Qi, Ch., Carfora, M. F., Larobina, M., and Alfano, B. Descent algorithms on oblique manifold for source-adaptive ica contrast. IEEE Trans. Neural Netw. Learning Syst., 23(12):1930 1947, 2012. Shalit, U., Weinshall, D., and Chechik, G. Online learning in the embedded manifold of low-rank matrices. JMLR, 13:429 458, 2012. Shwartz, S. S., Gonen, A., and Shamir, O. Large-scale convex minimization with a low-rank constraint. In ICML, 2011. Toh, K.-C. and Yun, S. An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Pac. J. Optim., 6:615 640, 2010. Vandereycken, B. Low-rank matrix completion by Riemannian optimization. SIAM J. Optim., 23(2):1214 1236, 2013. Waters, A. E., Sankaranarayanan, A. C., and Baraniuk, Richard G. Sparcs: Recovering low-rank and sparse matrices from compressive measurements. In NIPS, 2011. Wen, Z., Yin, W., and Zhang, Y. Solving a low-rank factorization model for matrix completion by a non-linear successive overrelaxation algorithm. Math. Program. Comput., 4(4):333 361, 2012. Yang, J. and Yuan, X. Linearized augmented Lagrangian and alternating direction methods for nuclear norm minimization. Mathematics of Computation, 82(281):301 329, 2013.