# rankone_matrix_pursuit_for_matrix_completion__efa4f584.pdf Rank-One Matrix Pursuit for Matrix Completion Zheng Wang ZHENGWANG@ASU.EDU Ming-Jun Lai MJLAI@MATH.UGA.EDU Zhaosong Lu ZHAOSONG@SFU.CA Wei Fan DAVID.FANWEI@HUAWEI.COM Hasan Davulcu HASANDAVULCU@ASU.EDU Jieping Ye JIEPING.YE@ASU.EDU The Biodesign Institue, Arizona State University, Tempe, AZ 85287, USA Department of Mathematics, University of Georgia, Athens, GA 30602, USA Department of Mathematics, Simon Fraser University, Burnaby, BC, V5A 156, Canada Huawei Noah s Ark Lab, Hong Kong Science Park, Shatin, Hong Kong School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, AZ 85281, USA Low rank matrix completion has been applied successfully in a wide range of machine learning applications, such as collaborative filtering, image inpainting and Microarray data imputation. However, many existing algorithms are not scalable to large-scale problems, as they involve computing singular value decomposition. In this paper, we present an efficient and scalable algorithm for matrix completion. The key idea is to extend the well-known orthogonal matching pursuit from the vector case to the matrix case. In each iteration, we pursue a rank-one matrix basis generated by the top singular vector pair of the current approximation residual and update the weights for all rank-one matrices obtained up to the current iteration. We further propose a novel weight updating rule to reduce the time and storage complexity, making the proposed algorithm scalable to large matrices. We establish the linear convergence of the proposed algorithm. The fast convergence is achieved due to the proposed construction of matrix bases and the estimation of the weights. We empirically evaluate the proposed algorithm on many real-world large-scale datasets. Results show that our algorithm is much more efficient than state-of-theart matrix completion algorithms while achieving similar or better prediction performance. Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). 1. Introduction Low rank matrix learning has attracted significant attention in the machine learning community due to its wide range of applications, such as collaborative filtering (Koren et al., 2009; Srebro et al., 2005), compressed sensing (Cand es & Recht, 2009), multi-class learning and multi-task learning (Argyriou et al., 2008; Negahban & Wainwright, 2010; Dud ık et al., 2012). In this paper, we consider the general form of low rank matrix completion: given a partially observed real-valued matrix Y ℜn m, the low rank matrix completion problem is to find a matrix X ℜn m with minimum rank such that PΩ(X) = PΩ(Y), where Ωincludes the index pairs (i, j) of all observed entries, and PΩ is the orthogonal projector onto the span of matrices vanishing outside of Ω. As it is intractable to minimize the matrix rank exactly in the general case, the trace norm or nuclear norm is widely used as a convex relaxation of the matrix rank (Cand es & Recht, 2009). It is defined by the Schatten p-norm with p = 1. For matrix X with rank r, its Schatten p-norm is defined by (Pr i=1 σp i )1/p, where {σi} are the singular values of X. Thus, the trace norm of X is the ℓ1 norm of the matrix spectrum as ||X|| = Pr i=1 |σi|. Solving the standard low rank or trace norm problem is computationally expensive for large matrices, as it involves computing singular value decomposition (SVD). How to solve these problems efficiently and accurately has attracted much attention in recent years (Avron et al., 2012; Srebro et al., 2005; Cai et al., 2010; Balzano et al., 2010; Keshavan & Oh, 2009; Toh & Yun, 2010; Ji & Ye, 2009; Ma et al., 2011; Mazumder et al., 2010; Mishra et al., 2011; Wen et al., 2010; Lee & Bresler, 2010; Recht & R e, 2013). Most of these methods still involve the computation of SVD or truncated SVD iteratively, which is Rank-One Matrix Pursuit for Matrix Completion not scalable to large-scale problems (Cai et al., 2010; Keshavan & Oh, 2009; Toh & Yun, 2010; Ma et al., 2011; Mazumder et al., 2010; Lee & Bresler, 2010). Several methods approximate the trace norm using its variational characterizations (Mishra et al., 2011; Srebro et al., 2005; Wen et al., 2010; Recht & R e, 2013), and proceed by alternating optimization. The linear convergence rate is established theoretically for properly designed alternating optimization algorithm under appropriate initialization (Jain et al., 2013). However, its computational complexity depends on the square of the rank of the estimated matrix. Thus in practical problems, especially for large matrices, it requires the rank of the estimated matrix to be very small, which sacrifices the estimation accuracy. Recently, the coordinate gradient descent method has been demonstrated to be efficient in solving sparse learning problems in the vector case (Friedman et al., 2010; Shalev Shwartz & Tewari, 2009). The key idea is to solve a very simple one-dimensional problem (for one coordinate) in each iteration. One natural question is whether and how such method can be applied to solve the matrix completion problem. Some progress has been made recently along this direction (Jaggi & Sulovsk y, 2010; Tewari et al., 2011; Shalev-Shwartz et al., 2011; Dud ık et al., 2012; Zhang et al., 2012). These algorithms proceed in two main steps in each iteration. The first step involves computing the top singular vector pair, and the second step refines the weights of the rank-one matrices formed by all top singular vector pairs obtained up to the current iteration. The main differences among these algorithms lie in how they refine the weights. The Jaggi s algorithm (JS) (Jaggi & Sulovsk y, 2010) directly applies the Hazan s algorithm, which adapts the Frank-Wolfe algorithm to the matrix case (Hazan, 2008). It updates the weights with a small step size and does not consider further refinement. It does not use all information in each step, which leads to a slow convergence rate. Similar to JS, Tewari et al. (Tewari et al., 2011) use a small update step size for a general structure constrained problem. A more efficient Frank-Wolfe type algorithm is to fully refine the weights, which is claimed to be equivalent to orthogonal matching pursuit (OMP) in a wide range of l1 ball constrained convex optimization problems (Jaggi, 2013). The greedy efficient component optimization (GECO) (Shalev-Shwartz et al., 2011) applies a similar approach, which optimizes the weights by solving another time consuming optimization problem. It empirically reduces the number of iterations without theoretical guarantees. However, the sophisticated weight refinement leads to a higher total computational cost. The lifted coordinate gradient descent algorithm (Lifted) (Dud ık et al., 2012) updates the rank-one matrix basis with a constant weight in each iteration, and conducts a lasso type algorithm (Tibshirani, 1994) to fully correct the weights. The weights for the basis update are difficult to tune: a large value leads to divergence; a small value makes the algorithm slow (Zhang et al., 2012). The matrix norm boosting approach (Boost) (Zhang et al., 2012) learns the update weights and designs a local refinement step by a non-convex optimization problem which is solved by alternating optimization. It has a sub-linear convergence rate. In this paper, we present a simple and efficient algorithm to solve the low rank matrix completion problem. The key idea is to extend the orthogonal matching pursuit procedure (Pati et al., 1993) from the vector case to the matrix case. In each iteration, a rank-one basis matrix is generated by the left and right top singular vectors of the current approximation residual. In the standard algorithm, we fully update the weights for all rank-one matrices in the current basis set at the end of each iteration by performing an orthogonal projection of the observation matrix onto their spanning subspace. The most time-consuming step of the proposed algorithm is to calculate the top singular vector pair of a sparse matrix, which costs O(|Ω|) operations in each iteration. An appealing feature of the proposed algorithm is that it has a linear convergence rate. This is quite different from traditional orthogonal matching pursuit or weak orthogonal greedy algorithms, whose convergence rate for sparse vector recovery is sub-linear as shown in (Liu & Temlyakov, 2012). See also (Tropp, 2004) for an extensive study on various greedy algorithms. With this rate of convergence, we only need O(log(1/ϵ)) iterations for achieving an ϵaccuracy solution. One drawback of the standard algorithm is that it needs to store all rank-one matrices in the current basis set for full weight updating, which contains r|Ω| elements in the r-th iteration. This makes the storage complexity of the algorithm dependent on the number of iterations, which restricts the approximation rank especially for large matrices. To tackle this problem, we propose an economic weight updating rule for this algorithm. In this economic algorithm, we only track two matrices in each iteration. One is the current estimated matrix and the other one is the pursued rank-one matrix. When restricted to the observations in Ω, each has |Ω| nonzero elements. Thus the storage requirement, i.e., 2|Ω|, keeps the same in different iterations, which is the same as the greedy algorithms (Jaggi & Sulovsk y, 2010; Tewari et al., 2011). Interestingly, we show that using this economic updating rule we still retain the linear convergence rate. To the best of our knowledge, our proposed algorithms are the fastest among all related methods. We verify the efficiency of our algorithms empirically on large-scale matrix completion problems. The main contributions of our paper are: We propose a computationally efficient and scalable algorithm for matrix completion, which extends the Rank-One Matrix Pursuit for Matrix Completion orthogonal matching pursuit from the vector case to the matrix case. We theoretically prove the linear convergence rate of our algorithm. As a result, we only need O(log(1/ϵ)) steps to obtain an ϵ-accuracy solution, and in each step we only need to compute the top singular vector pair, which can be computed efficiently. We further reduce the storage complexity of our algorithm based on an economic weight updating rule while retaining the linear convergence rate. This algorithm has constant storage complexity which is independent of the approximation rank and is more practical for large-scale problems. Our proposed algorithm is free of tuning parameter, except for the accuracy of the solution. And it is guaranteed to converge, i.e., no risk of divergence. Notations: Let Y = (y1, , ym) ℜn m be an n m real matrix, and Ω {1, , n} {1, , m} denote the indices of the observed entries of Y. PΩis the projection operator onto the space spanned by the matrices vanishing outside of Ωso that the (i, j)-th component of PΩ(Y) equals to Yi,j for (i, j) Ωand zero otherwise. The Frobenius norm of Y is defined as ||Y||F = q P i,j Y2 i,j. Let vec(Y) = (y T 1 , , y T m)T denote a vector reshaped from matrix Y by concatenating all its column vectors. Let y = vec(PΩ(Y)) be the vector by concatenating all observed entries in Y. The inner product of two matrices X and Y is defined as X, Y = vec(X), vec(Y) . Given a matrix A ℜn m, we denote PΩ(A) by AΩ. For any two matrices A, B ℜn m, we define A, B Ω= AΩ, BΩ , A Ω= p A, A Ωand A = p 2. Rank-One Matrix Pursuit It is well-known that any matrix X ℜn m can be written as a linear combination of rank-one matrices, that is, X = M(θ) = X i I θi Mi, (1) where {Mi : i I} is the set of all n m rank-one matrices with unit Frobenius norm. Clearly, θ is an infinite dimensional vector. Such a representation can be obtained from the standard SVD of X. The original low rank matrix approximation problem aims to minimize the zero-norm of θ subject to the constraint: min θ ||θ||0 s.t. PΩ(M(θ)) = PΩ(Y), (2) where ||θ||0 denotes the cardinality of the number of nonzero elements of θ. If we reformulate the problem as min θ ||PΩ(M(θ)) PΩ(Y)||2 F s.t. ||θ||0 r, (3) we could solve it by an orthogonal matching pursuit type greedy algorithm using rank-one matrices as the basis. If the dictionary {Mi : i I} is known and finite, this is equivalent to the compressed sensing problem. However, in our formulation, the size of the dictionary is infinite and the bases are to be constructed during the basis pursuit process. In particular, we are to find a suitable subset with over-complete rank-one matrix coordinates, and learn the weight for each coordinate. This is achieved by executing two steps alternatively: one is to construct the basis, and the other one is to learn the weights of the basis. Suppose that after the (k-1)-th iteration, the rank-one basis matrices M1, . . . , Mk 1 and their current weight θk 1 are already computed. In the k-th iteration, we are to pursue a new rank-one basis matrix Mk with unit Frobenius norm, which is mostly correlated with the current observed regression residual Rk = PΩ(Y) Xk 1, where Xk 1 = (M(θk 1))Ω= Pk 1 i=1 θk 1 i (Mi)Ω. Therefore, Mk can be chosen to be an optimal solution of the following problem: max M { M, Rk : rank(M) = 1, M F = 1}. (4) Notice that each rank-one matrix M with unit Frobenius norm can be written as the product of two unit vectors, namely, M = uv T for some u ℜn and v ℜm with u = v = 1. We then see that problem (4) can be equivalently reformulated as max u,v {u T Rkv : u = v = 1}. (5) Clearly, the optimal solution (u , v ) of problem (5) is a pair of top left and right singular vectors of Rk. It can be efficiently computed by the power method (Jaggi & Sulovsk y, 2010; Dud ık et al., 2012). The new rankone basis matrix Mk is then readily available by setting Mk = u v T . After finding the new rank-one basis matrix Mk, we update the weights θk for all currently available basis matrices {M1, , Mk} by solving the following least squares regression problem: min θ ℜk || i=1 θi Mi Y||2 Ω. (6) By reshaping the matrices (Y)Ωand (Mi)Ωinto vectors y and mi, we can easily see that the optimal solution θk of (6) is given by θk = ( MT k Mk) 1 MT k y, (7) Rank-One Matrix Pursuit for Matrix Completion where Mk = [ m1, , mk] is the matrix formed by all reshaped basis vectors. The row size of matrix Mk is the total number of observed entries. It is computationally expensive to directly calculate the matrix multiplication. An incremental update rule can be applied to solve this step efficiently (Wang et al., 2014). We run the above two steps iteratively until some desired stopping condition is satisfied. We can terminate the method based on the rank of the estimated matrix or the approximation residual. In particular, one can choose a preferred rank of the approximate solution matrix. Alternatively, one can stop the method once the residual Rk is less than a tolerance parameter ε. The main steps of Rank One Matrix Pursuit (R1MP) are given in Algorithm 1. Remark In our algorithm, we adapt orthogonal matching pursuit on the observed part of the matrix. This is similar to the GECO algorithm. However, GECO constructs the estimated matrix by projecting the observation matrix onto a much larger subspace, which is a product of two subspaces spanned by all left singular vectors and all right singular vectors obtained up to the current iteration. So it has much higher computational complexity. Lee et al. (Lee & Bresler, 2010) recently propose the ADMi RA algorithm, which is also a greedy approach. In each step it first chooses 2r components by top-2r truncated SVD and then uses another top-r truncated SVD to obtain a rank-r matrix. Thus, the ADMi RA algorithm is computationally more expensive than the proposed algorithm. The main difference between the proposed algorithm and ADMi RA is somewhat similar to the difference between the OMP (Pati et al., 1993) for learning sparse vectors and Co Sa MP (Needell & Tropp, 2010). In addition, the performance guarantees (including recovery guarantee and convergence property) of ADMi RA rely on strong assumptions, i.e., the matrix involved in the loss function satisfies a rank-restricted isometry property, which is not satisfied in matrix completion (Lee & Bresler, 2010). Lee et al. sketch a similar idea as the standard verion of our algorithm in Remark 2.3 without any further analysis, and their theoretical results cannot be easily extended to our algorithm. Another contribution of our work is that we further propose an economic version of the algorithm and analyze its convergence property. 3. Convergence Analysis In this section, we will show that our proposed rank-one matrix pursuit algorithm achieves a linear convergence rate. This main result is given in the following theorem. Theorem 3.1. The rank-one matrix pursuit algorithm satisfies ||Rk|| γk 1 Y Ω, k 1. γ is a constant in [0, 1). Algorithm 1 Rank-One Matrix Pursuit (R1MP) Input: YΩand stopping criterion. Initialize: Set X0 = 0 and k = 1. repeat Step 1: Find a pair of top left and right singular vectors (uk, vk) of the observed residual matrix Rk = YΩ Xk 1 and set Mk = uk(vk)T . Step 2: Compute the weight θk using the closed form least squares solution θk = ( MT k Mk) 1 MT k y. Step 3: Set Xk = Pk i=1 θk i (Mi)Ωand k k + 1. until stopping criterion is satisfied Output: Constructed matrix ˆY = Pk i=1 θk i Mi. Before proving Theorem 3.1, we need to establish some useful and preparatory properties of Algorithm 1. The first property says that Rk+1 is perpendicular to all previously generated Mi for i = 1, , k. Property 3.2. Rk+1, Mi = 0 for i = 1, , k. Proof. Recall that θk is the optimal solution of problem (6). By the first-order optimality condition, one has Y Pk j=1 θk j Mj, Mi Ω= 0 for i = 1, , k, which to- gether with Rk = YΩ Xk 1 and Xk = Pk j=1 θk j (Mj)Ω implies that Rk+1, Mi = 0 for i = 1, , k. The following property shows that as the number of rankone basis matrices Mi increases during our learning process, the residual Rk does not increase. Property 3.3. Rk+1 Rk for all k 1. Proof. We observe that for all k 1, Rk+1 2 = min θ ℜk{ Y Pk i=1 θi Mi 2 Ω} min θ ℜk 1{ Y Pk 1 i=1 θi Mi 2 Ω} = Rk 2, and hence the conclusion holds. We next establish that {(Mi)Ω}k i=1 is linearly independent unless Rk = 0. It follows that formula (7) is welldefined and hence θk is uniquely defined before the algorithm stops. Property 3.4. Suppose that Rk = 0 for some k 1. Then, Mi has a full column rank for all i k. Proof. Using Property 3.3 and the assumption Rk = 0 for some k 1, we see that Ri = 0 for all i k. We now prove this statement by induction on i. Indeed, since R1 = 0, we clearly have M1 = 0. Hence the conclusion holds for i = 1. We now assume that it holds for i 1 < k and need to show that it also holds for i k. By the induction hypothesis, Mi 1 has a full column rank. Suppose for contradiction that Mi does not have a full column rank. Then, there exists α ℜi 1 such that Rank-One Matrix Pursuit for Matrix Completion (Mi)Ω= Pi 1 j=1 αj(Mj)Ω, which together with Property 3.2 implies that Ri, Mi = 0. It follows that σmax(Ri) = u T i Rivi = Ri, Mi = 0, and hence Ri = 0, which contradicts the fact that Rj = 0 for all j k. Therefore, Mi has a full column rank and the conclusion holds. We next build a relationship between two consecutive residuals Rk+1 and Rk . For convenience, define θk 1 k = 0 and let θk = θk 1+ηk. In view of (6), one can observe that ηk = arg min η || i=1 ηi Mi Rk||2 Ω. (8) i=1 ηk i (Mi)Ω. (9) By the definition of Xk, one can also observe that Xk = Xk 1 + Lk and Rk+1 = Rk Lk. Property 3.5. ||Rk+1||2 = ||Rk||2 ||Lk||2 and ||Lk||2 Mk, Rk 2, where Lk is defined in (9). Proof. Since Lk = P i k ηk i (Mi)Ω, it follows from Property 3.2 that Rk+1, Lk = 0. Thus, = ||Rk Lk||2 = ||Rk||2 2 Rk, Lk + ||Lk||2 = ||Rk||2 2 Rk+1 + Lk, Lk + ||Lk||2 = ||Rk||2 2 Lk, Lk + ||Lk||2 = ||Rk||2 ||Lk||2 We next bound Lk 2 from below. If Rk = 0, ||Lk||2 Mk, Rk 2 clearly holds. We now suppose throughout the remaining proof that Rk = 0. It then follows from Property 3.4 that Mk has a full column rank. Using this fact and (8), we have ηk = MT k Mk 1 MT k rk, where rk is the reshaped residual vector of Rk. Invoking that Lk = P i k ηk i (Mi)Ω, we then obtain ||Lk||2 = r T k Mk( MT k Mk) 1 MT k rk. (10) Let Mk = QU be the QR factorization of Mk, where QT Q = I and U is a k k nonsingular upper triangular matrix. One can observe that ( Mk)k = mk, where ( Mk)k denotes the k-th column of the matrix Mk and mk is the reshaped vector of (Mk)Ω. Recall that Mk = ukv T k = 1. Hence, ( Mk)k 1. Due to QT Q = I, Mk = QU and the definition of U, we have 0 < |Ukk| Uk = ( Mk)k 1. In addition, by Property 3.2, we have MT k rk = [0, , 0, Mk, Rk ]T . (11) Substituting Mk = QU into (10), and using QTQ = I and (11), we obtain that = r T k Mk(UT U) 1 MT k rk = [0, , 0, Mk, Rk ] U 1U T [0, , 0, Mk, Rk ]T = Mk, Rk 2/(Ukk)2 Mk, Rk 2. The last equality follows since U is upper triangular and the last inequality is due to |Ukk| 1. We are now ready to prove Theorem 3.1. Proof. Using the definition of Mk, we have Mk, Rk = uk(vk)T , Rk = σ (Rk), where σ (Rk) is the maximum singular value of the residual matrix Rk. Using this inequality and Property 3.5, we obtain that ||Rk+1||2 = ||Rk||2 ||Lk||2 ||Rk||2 Mk, Rk 2 = 1 σ2 (Rk) Rk 2 ||Rk||2. In view of this relation and the fact that R1 = Y 2 Ω, we easily conclude that As for each step we have 0 < 1 rank(Ri) σ (Ri) Ri 1, there must exist 0 γ < 1 that satisfies ||Rk|| γk 1 Y Ω. This completes the proof. Remark In practice, the value of Ri 2 σ2 (Ri) that controls the convergence speed is much less than min(m, n). We will emprically verify this in the experiments. Remark If Ωis the entire set of all indices of {(i, j), i = 1, , m, j = 1, , n}, our rank-one matrix pursuit algorithm equals to standard SVD using the power method. Remark This convergence is obtained for the optimization residual in the low rank matrix completion problem. We further extend our algorithm to solve the more general matrix sensing problem and analyze the corresponding statistical convergence behavior under mild conditions, such as the rank-restricted isometry property (Lee & Bresler, 2010; Jain et al., 2013). Details are provided in the longer version of this paper (Wang et al., 2014). Rank-One Matrix Pursuit for Matrix Completion 4. Economic Rank-One Matrix Pursuit The proposed R1MP algorithm has to track all pursued bases and save them in the memory. It demands O(r|Ω|) storage complexity to obtain a rank-r estimated matrix. For large-scale problems, such storage requirement is not negligible and restricts the rank of the matrix to be estimated. To adapt our algorithm to large-scale problems with a large approximation rank, we simplify the orthogonal projection step by only tracking the estimated matrix Xk 1 and the rank-one update matrix Mk. In this case, we only need to estimate the weights for these two matrices in Step 2 of our algorithm by solving the following least squares problem: αk = arg min α={α1,α2} ||α1Xk 1 + α2Mk Y||2 Ω. (12) This still corrects all weights of the existed bases, though the correction is sub-optimal. If we write the estimated matrix as a linear combination of the bases, we have Xk = Pk i=1 θk i (Mi)Ωwith θk k = αk 2 and θk i = θk 1 i αk 1, for i < k. The detailed procedure of this simplified method is given in Algorithm 2. Algorithm 2 Economic Rank-One Matrix Pursuit (ER1MP) Input: YΩand stopping criterion. Initialize: Set X0 = 0 and k = 1. repeat Step 1: Find a pair of top left and right singular vectors (uk, vk) of the observed residual matrix Rk = YΩ Xk 1 and set Mk = uk(vk)T . Step 2: Compute the optimal weights αk for Xk 1 and Mk by solving: arg min α ||α1Xk 1 + α2(Mk)Ω YΩ||2. Step 3: Set Xk = αk 1Xk 1 + αk 2(Mk)Ω; θk k = αk 2 and θk i = θk 1 i αk 1 for i < k; k k + 1. until stopping criterion is satisfied Output: Constructed matrix ˆY = Pk i=1 θk i Mi. The proposed economic rank-one matrix pursuit algorithm (ER1MP) uses the same amount of storage as the greedy algorithms (Jaggi & Sulovsk y, 2010; Tewari et al., 2011), which is significantly smaller than that required by R1MP algorithm. Interestingly, we can show that the ER1MP algorithm still retains the linear convergence rate. The main result is given in the following theorem, and the proof is provided in the long version of this paper (Wang et al., 2014). Theorem 4.1. The economic rank-one matrix pursuit algorithm satisfies ||Rk|| γk 1 Y Ω, k 1. γ is a constant in [0, 1). 5. Experiments In this section, we compare our rank-one matrix pursuit algorithms R1MP and ER1MP with state-of-the-art matrix completion algorithms. The competing algorithms include: singular value projection (SVP) (Jain et al., 2010), singular value thresholding (SVT) (Cand es & Recht, 2009), Jaggi s fast algorithm for trace norm constraint (JS) (Jaggi & Sulovsk y, 2010), spectral regularization algorithm (Soft Impute) (Mazumder et al., 2010), low rank matrix fitting (LMa Fit) (Wen et al., 2010), alternating minimization (Alt Min) (Jain et al., 2013), boosting type accelerated matrixnorm penalized solver (Boost) (Zhang et al., 2012) and greedy efficient component optimization (GECO) (Shalev Shwartz et al., 2011). The general greedy method (Tewari et al., 2011) is not included in our comparison, as it includes JS and GECO (included in our comparison) as special cases for matrix completion. The lifted coordinate descent method (Lifted) (Dud ık et al., 2012) is not included in our comparison, as it is similar to Boost proposed in (Zhang et al., 2012), but more sensitive to the parameters. The code for most of these algorithms is available online: SVP: http://www.cs.utexas.edu/ pjain/svp/ SVT: http://svt.stanford.edu/ Soft Impute: http://www-stat.stanford.edu/ rahulm/software.html LMa Fit: http://lmafit.blogs.rice.edu/ Boost: http://webdocs.cs.ualberta.ca/ xinhua2/boosting.zip GECO: http://www.cs.huji.ac.il/ shais/code/geco.zip We compare these algorithms in two problems, including image recovery and collaborative filtering. The data size for image recovery is relatively small, and the recommendation problem is in large-scale. In the experiments, we follow the recommended settings of the parameters for competing algorithms. If no recommended parameter value is available, we choose the best one from a candidate set using cross validation. For our R1MP and ER1MP algorithms, we only need a stopping criterion. For simplicity, we stop our algorithms after r iterations. In this way, we approximate the ground truth using a rank-r matrix. We present the experimental results using root-meansquare error (RMSE) (Jaggi & Sulovsk y, 2010; Shalev Shwartz et al., 2011). The experiments are implemented in MATLAB1. They call some external packages for fast 1GECO is written in C++ and we call its executable file in MATLAB. Rank-One Matrix Pursuit for Matrix Completion Table 1. Image recovery results measured in terms of the RMSE: the value below is the actual value times 100 (mean std). Image SVT SVP Soft Impute LMa Fit Alt Min JS R1MP ER1MP Lenna 3.86 0.02 5.31 0.14 4.60 0.02 7.45 0.63 4.47 0.10 5.48 0.72 3.90 0.02 3.97 0.02 Barbara 4.48 0.02 5.60 0.08 5.22 0.01 5.16 0.28 5.05 0.06 6.52 0.88 4.63 0.01 4.73 0.02 Clown 3.72 0.03 10.97 0.17 4.48 0.03 4.65 0.67 5.49 0.46 7.30 2.32 3.85 0.03 3.91 0.03 Crowd 4.48 0.02 7.62 0.13 5.35 0.02 4.91 0.05 4.87 0.02 7.38 1.41 4.89 0.03 4.96 0.03 Girl 3.36 0.02 4.45 0.16 4.10 0.01 4.12 0.48 5.07 0.50 4.42 0.46 3.09 0.02 3.12 0.02 Man 4.49 0.03 5.52 0.10 5.16 0.03 5.31 0.13 5.19 0.11 6.25 0.54 4.66 0.03 4.76 0.03 computation of SVD2 and sparse matrix computations. The experiments are run in a PC with WIN7 system, Intel 4 core 3.4 GHz CPU and 8G RAM. 5.1. Image Recovery In the image recovery experiments, we use the following benchmark test images: Lenna, Barbara, Clown, Crowd, Girl, Man3. The size of each image is 512 512. For each experiment, we present the average RMSE and the corresponding standard derivation of 10 different runs for each competing algorithm. In each run, we randomly exclude 50% of the pixels in the image, and the remaining ones are used as the observations. As the image matrix is not guaranteed to be low rank, we use the rank 200 for the estimation matrix for each experiment. The JS algorithm does not explicitly control the rank, thus we fix its number of iterations to 2000. The numerical results are listed in Table 1. The results show that SVT, our R1MP and ER1MP achieve the best numerical performance. However, our algorithm is much faster and more stable than SVT. For each image, ER1MP uses around 3.5 seconds, but SVT consumes around 400 seconds. Image recovery needs a relatively higher approximation rank; GECO and Boost fail to find a good recovery in some cases, so we do not include them in the table. 5.2. Recommendation In the following experiments, we compare the different matrix completion algorithms using large recommendation datasets: Jester (Goldberg et al., 2001) and Movie Lens (Miller et al., 2003). We use six datasets including: Jester1, Jester2, Jester3, Movie Lens100K, Movie Lens1M, and Movie Lens10M. The statistics of these datasets are given in Table 2. The Jester datasets were collected from a joke recommendation system. They contain anonymous ratings of 100 jokes from the users. The ratings are real values ranging from 10.00 to +10.00. The Movie- 2PROPACK is used in SVP, SVT, Soft Impute and Boost. It is an efficient SVD package, which can be downloaded from http: //soi.stanford.edu/ rmunk/PROPACK/ 3Images are downloaded from http://www.utdallas. edu/ cxc123730/mh_bcs_spl.html Lens datasets were collected from the Movie Lens website4. They contain anonymous ratings of the movies on this web made by its users. For Movie Lens100K and Movie Lens1M, there are 5 rating scores (1 5), and for Movie Lens10M there are 10 levels of scores with a step size 0.5 in the range of 0.5 to 5. In the following experiments, we randomly split the ratings into training and test sets. Each set contains 50% of the ratings. We compare the prediction results from different methods. In the experiments, we use 100 iterations for the JS algorithm, and for other algorithms we use the same rank for the estimated matrices; the values of the rank are {10, 10, 5, 10, 10, 20} for the six corresponding datasets. The results in terms of the RMSE is given in Table 3. We also show the running time of different methods in Table 4. We can observe from the above experiments that our ER1MP algorithm is the fastest among all competing methods to obtain satisfactory results. Table 2. Characteristics of the recommendation datasets. Dataset # row # column # rating Jester1 24983 100 106 Jester2 23500 100 106 Jester3 24938 100 6 105 Movie Lens100k 943 1682 105 Movie Lens1M 6040 3706 106 Movie Lens10M 69878 10677 107 5.3. Convergence and Efficiency We present the residual curves on the Lenna image in logarithmic scale for our R1MP and ER1MP algorithms in Figure 1. The results show that our algorithms reduce the approximation error in a linear rate. This is consistent with our theoretical analysis. The empirical results verify the linear convergence property of our proposed algorithms. 6. Conclusion In this paper, we propose an efficient and scalable low rank matrix completion algorithm. The key idea is to extend orthogonal matching pursuit method from the vector case to the matrix case. We also propose a novel weight updating 4http://movielens.umn.edu Rank-One Matrix Pursuit for Matrix Completion Table 3. Recommendation results measured in terms of the RMSE. Boost fails on the Movie Lens10M. Dataset SVP Soft Impute LMa Fit Alt Min Boost JS GECO R1MP ER1MP Jester1 4.7311 5.1113 4.7623 4.8572 5.1746 4.4713 4.3680 4.3418 4.3384 Jester2 4.7608 5.1646 4.7500 4.8616 5.2319 4.5102 4.3967 4.3649 4.3546 Jester3 8.6958 5.4348 9.4275 9.7482 5.3982 4.6866 5.1790 4.9783 5.0145 Movie Lens100K 0.9683 1.0354 1.2308 1.0042 1.1244 1.0146 1.0243 1.0168 1.0261 Movie Lens1M 0.9085 0.8989 0.9232 0.9382 1.0850 1.0439 0.9290 0.9595 0.9462 Movie Lens10M 0.8611 0.8534 0.8625 0.9007 0.8728 0.8668 0.8621 0.8692 Table 4. The running time (measured in seconds) of all methods on all recommendation datasets. Dataset SVP Soft Impute LMa Fit Alt Min Boost JS GECO R1MP ER1MP Jester1 18.35 161.49 3.68 11.14 93.91 29.68 > 104 1.83 0.99 Jester2 16.85 152.96 2.42 10.47 261.70 28.52 > 104 1.68 0.91 Jester3 16.58 10.55 8.45 12.23 245.79 12.94 > 103 0.93 0.34 Movie Lens100K 1.32 128.07 2.76 3.23 2.87 2.86 10.83 0.04 0.04 Movie Lens1M 18.90 59.56 30.55 68.77 93.91 13.10 > 104 0.87 0.54 Movie Lens10M > 103 > 103 154.38 310.82 130.13 > 105 23.05 13.79 0 50 100 150 200 250 300 10 3 0 50 100 150 200 250 300 10 3 Figure 1. Illustration of the linear convergence of the proposed rank-one matrix pursuit algorithms on the Lenna image: the xaxis is the iteration, and the y-axis is the RMSE in logarithmic scale. The curves are the results for R1MP and ER1MP respectively. rule under this framework to reduce the storage complexity and make it independent of the approximation rank. Our algorithms are computationally inexpensive for each matrix pursuit iteration, and find satisfactory results in a few iterations. Another advantage of our proposed algorithms is they have only one tunable parameter, which is the rank. It is easy to understand and to use by the user. This becomes especially important in large-scale learning problems. In addition, we rigorously show that both algorithms achieve a linear convergence rate, which is significantly better than the previous known results (a sub-linear convergence rate). We also empirically compare the proposed algorithms with state-of-the-art matrix completion algorithms, and our results show that the proposed algorithms are more efficient than competing algorithms while achieving similar or better prediction performance. We plan to generalize our theoretical and empirical analysis to other loss functions in the future. 7. Acknowledgments This work was supported in part by China 973 Fundamental R&D Program (No.2014CB340304), NIH (LM010730), and NSF (IIS-0953662, CCF-1025177). Argyriou, A., Evgeniou, T., and Pontil, M. Convex multitask feature learning. Machine Learning, 73(3):243 272, 2008. Avron, H., Kale, S., Kasiviswanathan, S., and Sindhwani, V. Efficient and practical stochastic subgradient descent for nuclear norm regularization. In ICML, 2012. Balzano, L., Nowak, R., and Recht, B. Online identification and tracking of subspaces from highly incomplete information. In Allerton, 2010. Cai, J., Cand es, E. J., and Shen, Z. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956 1982, 2010. Cand es, E. J. and Recht, B. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717 772, 2009. Dud ık, M., Harchaoui, Z., and Malick, J. Lifted coordinate descent for learning with trace-norm regularization. In AISTATS, 2012. Friedman, J. H., Hastie, T., and Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1 22, 2010. Rank-One Matrix Pursuit for Matrix Completion Goldberg, K., Roeder, T., Gupta, D., and Perkins, C. Eigentaste: A constant time collaborative filtering algorithm. Information Retrieval, 4(2):133 151, 2001. Hazan, E. Sparse approximate solutions to semidefinite programs. In LATIN, 2008. Jaggi, M. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ICML, 2013. Jaggi, M. and Sulovsk y, M. A simple algorithm for nuclear norm regularized problems. In ICML, 2010. Jain, P., Meka, R., and Dhillon, I. S. Guaranteed rank minimization via singular value projection. In NIPS, 2010. Jain, P., Netrapalli, P., and Sanghavi, S. Low-rank matrix completion using alternating minimization. In STOC, 2013. Ji, S. and Ye, J. An accelerated gradient method for trace norm minimization. In ICML, 2009. Keshavan, R. and Oh, S. Optspace: A gradient descent algorithm on grassmann manifold for matrix completion. http://arxiv.org/abs/0910.5260, 2009. Koren, Y., Bell, R., and Volinsky, C. Matrix factorization techniques for recommender systems. Computer, 2009. Lee, K. and Bresler, Y. Admira: atomic decomposition for minimum rank approximation. IEEE Transactions on Information Theory, 56(9):4402 4416, 2010. Liu, E. and Temlyakov, T. N. The orthogonal super greedy algorithm and applications in compressed sensing. IEEE Transactions on Information Theory, 58: 2040 2047, 2012. Ma, S., Goldfarb, D., and Chen, L. Fixed point and bregman iterative methods for matrix rank minimization. Mathematical Programming, 128(1-2):321 353, 2011. Mazumder, R., Hastie, T., and Tibshirani, R. Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research, 99:2287 2322, August 2010. Miller, B. N., Albert, I., Lam, S. K., Konstan, J. A., and Riedl, J. Movielens unplugged: experiences with an occasionally connected recommender system. In IUI, 2003. Mishra, B., Meyer, G., Bach, F., and Sepulchre, R. Low-rank optimization with trace norm penalty. http://arxiv.org/abs/1112.2318, 2011. Needell, D. and Tropp, J. A. Cosamp: iterative signal recovery from incomplete and inaccurate samples. Communications of the ACM, 53(12):93 100, 2010. Negahban, S. and Wainwright, M.J. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. In ICML, 2010. Pati, Y. C., Rezaiifar, R., Rezaiifar, Y. C. Pati R., and Krishnaprasad, P. S. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Asilomar SSC, 1993. Recht, B. and R e, C. Parallel stochastic gradient algorithms for large-scale matrix completion. Mathematical Programming Computation, 5(2):201 226, 2013. Shalev-Shwartz, S. and Tewari, A. Stochastic methods for ℓ1 regularized loss minimization. In ICML, 2009. Shalev-Shwartz, S., Gonen, A., and Shamir, O. Largescale convex minimization with a low-rank constraint. In ICML, 2011. Srebro, N., Rennie, J., and Jaakkola, T. Maximum margin matrix factorizations. In NIPS, 2005. Tewari, A., Ravikumar, P., and Dhillon, I. S. Greedy algorithms for structurally constrained high dimensional problems. In NIPS, 2011. Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267 288, 1994. Toh, K. and Yun, S. An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Pacific Journal of Optimization, 6:615 640, 2010. Tropp, J. A. Greed is good: algorithmic results for sparse approximation. IEEE Transactions on Information Theory, 50:2231 2242, 2004. Wang, Z., Lai, M., Lu, Z., Fan, W., Davulcu, H., and Ye, J. Orthogonal rank-one matrix pursuit for low rank matrix completion. http://arxiv.org/abs/1404.1377, 2014. Wen, Z., Yin, W., and Zhang, Y. Low-rank factorization model for matrix completion by a non-linear successive over-relaxation algorithm. Rice CAAM Tech Report 1007, University of Rice, 2010. Zhang, X., Yu, Y., and Schuurmans, D. Accelerated training for matrix-norm regularization: A boosting approach. In NIPS, 2012.