# coherent_matrix_completion__92e56716.pdf Coherent Matrix Completion Yudong Chen YDCHEN@UTEXAS.EDU University of California, Berkeley, CA 94720, USA Srinadh Bhojanapalli BSRINADH@UTEXAS.EDU Sujay Sanghavi SANGHAVI@MAIL.UTEXAS.EDU Rachel Ward RWARD@MATH.UTEXAS.EDU The University of Texas at Austin, Austin, TX 78712, USA Matrix completion concerns the recovery of a low-rank matrix from a subset of its revealed entries, and nuclear norm minimization has emerged as an effective surrogate for this combinatorial problem. Here, we show that nuclear norm minimization can recover an arbitrary n n matrix of rank r from O(nr log2(n)) revealed entries, provided that revealed entries are drawn proportionally to the local row and column coherences (closely related to leverage scores) of the underlying matrix. Our results are order-optimal up to logarithmic factors, and extend existing results for nuclear norm minimization which require strong incoherence conditions on the types of matrices that can be recovered, due to assumed uniformly distributed revealed entries. We further provide extensive numerical evidence that a proposed two-phase sampling algorithm can perform nearly as well as local-coherence sampling and without requiring a priori knowledge of the matrix coherence structure. Finally, we apply our results to quantify how weighted nuclear norm minimization can improve on unweighted minimization given an arbitrary set of sampled entries. 1. Introduction Low-rank matrix completion has been the subject of much recent study due to its application in myriad tasks: collaborative filtering, dimensionality reduction, clustering, and localization in sensor networks. Clearly, the problem is illposed in general; correspondingly, analytical work on the Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). subject has focused on the joint development of algorithms, and sufficient conditions under which such algorithms are able recover the matrix. While they differ in scaling/constant factors, all existing sufficient conditions (Cand es & Recht, 2009; Recht, 2009; Keshavan et al., 2010; Gross, 2011; Jain et al., 2012) and (Negahban & Wainwright, 2012) (with a couple of exceptions we describe in Section 2) require that (a) the subset of observed elements should be uniformly randomly chosen, independent of the values of the matrix elements, and (b) the low-rank matrix be incoherent or not spiky i.e. its row and column spaces should be diffuse, having low inner products with the standard basis vectors . Under these conditions, the matrix has been shown to be provably recoverable via methods based on convex optimization (Cand es & Recht, 2009; Recht, 2009; Gross, 2011), alternating minimization (Jain et al., 2012), iterative thresholding (Cai et al., 2010) etc. using as few as (nr log n) observed elements for an n n matrix of rank r. Actually, the incoherence assumption is required because of the uniform sampling: coherent matrices are those which have most of their mass in a relatively small number of elements. By sampling entries uniformly and independently at random, most of the mass of a coherent low-rank matrix will be missed; this could (and does) throw off all existing recovery methods. One could imagine that if the sampling is dependent on the matrix, roughly in a way that elements with more mass are more likely to be observed, then it may be possible for existing methods to recover the full matrix. In this paper, we show that the incoherence requirement can be eliminated completely, provided the sampling distribution is adapted to the matrix to be recovered in the right way. Specifically, we have the following results. 1. If the probability of an element being observed is de- pendent on the sum of the corresponding row and column leverage scores (local coherence parameters) of Coherent Matrix Completion the underlying matrix, then an arbitrary rank-r matrix can be exactly recovered from (nr log2 n) observed elements with high probability, using nuclear norm minimization. In case all leverage scores are roughly equal, our results reduce to existing guarantees for incoherent matrices using uniform sampling. Our sample complexity bound (nr log2 n) is optimal up to a single factor of log2 n, since the degrees of freedom in an n n matrix of rank r is 2nr. Our bounds are also near-optimal with respect to the local coherence parameters of the matrix. 2. We provide numerical evidence that an adaptive sam- pling strategy, which assumes no prior knowledge about the local coherences of the underlying matrix, can perform on par with the optimal sampling strategy in completing coherent matrices, and significantly outperforms uniform sampling. Specifically, we consider a two-phase sampling strategy whereby given a fixed budget of m samples, we first draw a fixed proportion of samples uniformly at random, and then draw the remaining samples according to the the local coherence structure of the resulting sampled matrix. 3. Using our theoretical results, we are able to quantify the benefit of weighted nuclear norm minimization over standard nuclear norm minimization, and provide a strategy for choosing the weights in such problems given non-uniformly distributed samples so as to reduce the sampling complexity of weighted nuclear norm minimization to that of standard nuclear norm. Our results give the first exact recovery guarantee for weighted nuclear norm minimization in (Salakhutdinov & Srebro, 2010; Negahban & Wainwright, 2012; Foygel et al., 2011), thus providing theoretical justification for its good empirical performance. Our main theoretical results are achieved by a new analysis based on bounds involving the weighted 1,2 matrix norm, defined as the maximum of the appropriately weighted row and column norms of the matrix. This differs from previous approaches that use 1 or unweighted 1,2 bounds (Gross, 2011; Chen, 2013). In some sense, using the weighted 1,2-type bounds is natural for the analysis of low-rank matrices, because the rank is a property of the rows and columns of the matrix rather than its individual entries, and the weighted norm captures the relative importance of the rows/columns. Therefore, it is interesting to see if the techniques in this paper are relevant more generally, beyond the specific settings and algorithms considered here. 2. Related work There is now a vast body of literature on matrix completion, and an even bigger body of literature on matrix approxima- tions more generally; we restrict our related work review here to papers that are most directly related. Exact Completion, Incoherent Matrices, Random Samples: The first algorithm and theoretical guarantees for the exact recovery of a low-rank matrix from a subset of elements appeared in (Cand es & Recht, 2009); there it was shown that algorithm (1) above works when the low-rank matrix is incoherent, and the sampling is uniform random and independent of the matrix. Subsequent works have refined provable completion results for incoherent matrices under the uniform random sampling model, both via nuclear norm minimization (Cand es & Tao, 2010; Recht, 2009; Gross, 2011; Chen, 2013), and other methods like SVD followed by local descent (Keshavan et al., 2010), alternating minimization (Jain et al., 2012) etc, and also with both sparse errors and additive noise (Candes & Plan, 2010; Chen et al., 2013; Chandrasekaran et al., 2011). Matrix approximations via sub-sampling: Weighted sampling methods have been widely considered in the related context of matrix sparsification, where one aims to approximate a large dense matrix with a sparse matrix. The strategy of element-wise matrix sparsification was introduced in (Achlioptas & Mcsherry, 2007). They propose and provide bounds for the L2 element-wise sampling model, where entries of the matrix are sampled with probability proportional to their squared magnitude. These bounds were later refined in (Drineas & Zouzias, 2011). Alternatively, (Arora et al., 2006) proposed the L1 entrywise sampling model, where entries are sampled with probabilities proportional to their magnitude. This model was further investigated in (Achlioptas et al., 2013) and argued to be almost always preferable to L2 sampling. Closely related to the matrix sparsification problem is the matrix column selection problem, where one aims to find the best k column subset of a matrix to use as an approximation. State-of-the-art algorithms for column subset selection (Boutsidis et al., 2009; Mahoney, 2011) involve randomized sampling strategies whereby columns are selected proportionally to their statistical leverage scores the squared Euclidean norms of projections of the canonical unit vectors on the column singular subspaces. The statistical leverage scores of a matrix can be approximated efficiently, faster than the time needed to compute an SVD (Drineas et al., 2012). Statistical leverage scores have been used extensively in statistical regression analysis for outlier detection (Chatterjee & Hadi, 1986). More recently, statistical leverage scores were used in the context of graph sparsification under the name of graph resistance (Spielman & Srivastava, 2011). The sampling distribution we use for the matrix completion in this paper is based on statistical leverage scores. As shown in Section 4.1, sampling as such outperforms both L1 and L2 entry-wise sampling, Coherent Matrix Completion at least in the context of matrix completion. Weighted sampling in compressed sensing: This paper is similar in spirit to recent work in compressed sensing which shows that sparse recovery guarantees traditionally requiring mutual incoherence can be extended to systems which are only weakly incoherent, without any loss of approximation power, provided measurements from the sensing basis are subsampled according to their coherence with the sparsity basis. This notion of local coherence sampling seems to have originated in (Rauhut & Ward, 2012) in the context of sparse orthogonal polynomial expansions, and has found applications in uncertainty quantification (Yang & Karniadakis, 2013), interpolation with spherical harmonics (Burq et al., 2012), and MRI compressive imaging (Krahmer & Ward, 2012). Finally, closely related to our paper is the recent work by (Krishnamurthy & Singh, 2013), which considers matrix completion where only the row space is allowed to be coherent. Their proposed algorithm selects columns to observe in their entirety and requires a total of O(r2n log r) observed entries, which is quadratic in r. 2.1. Organization We present our main results for coherent matrix completion in Section 3. In Section 4 we propose a two-phase algorithm that requires no prior knowledge about the underlying matrix coherence structure. In Section 5 we provide guarantees for weighted nuclear norm minimization. We provide the proof of the main theorem in the appendix. 3. Main Results The results in this paper hold for what is arguably the most popular approach to matrix completion: nuclear norm minimization. If the true matrix is M with entries Mij, and the set of observed elements is , this method guesses as the completion the optimum of the convex program: s.t. Xij = Mij for (i, j) 2 . where the nuclear norm k k of a matrix is the sum of its singular values1. Throughout, we use the standard notation f(n) = (g(n)) to mean that cg(n) f(n) Cg(n) for some positive constants c, C, where n := max{n1, n2}. We focus on the setting where matrix entries are revealed from an underlying probability distribution. To introduce the distribution of interest, we first need a definition. Definition 3.1. For an n1 n2 real-valued matrix M of 1This becomes the trace norm for positive-definite matrices. It is now well-recognized to be a convex surrogate for rank minimization. rank r with SVD given by U V >, the local coherences2 µi for any row i, and j for any column j - are defined by the following relations , i = 1, . . . , n1 , j = 1, . . . , n2. Note that the µi, js are non-negative, and since U and V have orthonormal columns we always have P i µir/n1 = P j jr/n2 = r. We are ready to state our main result, the theorem below. Theorem 3.2. Let M = (Mij) be an n1 n2 matrix with local coherence parameters {µi, j}, and suppose that its entries Mij are observed only over a subset of elements [n1] [n2]. There are universal constants c0, c1, c2 > 0 such that if each element (i, j) is independently observed with probability pij, and pij satisfies (µi + j)r log2(n1 + n2) min{n1, n2} , 1 pij 1 min{n1, n2}10 , then M is the unique optimal solution to the nuclear norm minimization problem (1) with probability at least 1 c1(n1 + n2) c2. We will refer to the sampling strategy (3) as local coherence sampling. Note that the expected number of observed entries is P i,j pij, and this satisfies r log2(n1 + n2) min{n1, n2} = 2c0 max {n1, n2} r log2(n1 + n2), independent of the coherence, or indeed any other property, of the matrix. Hoeffding s inequality implies that the actual number of observed entries sharply concentrates around its expectation, leading to the following corollary: Corollary 3.3. Let M = (Mij) be an n1 n2 matrix with local coherence parameters {µi, j}. Draw a subset of its entries by local coherence sampling according to the procedure described in Theorem 3.2. There are universal constants c0 2 > 0 such that the following holds with probability at least 1 c0 1(n1+n2) c0 2: the number m of revealed entries is bounded by m 3c0 max {n1, n2} r log2(n1 + n2) 2In the matrix sparsification literature (Drineas et al., 2012; Boutsidis et al., 2009) and beyond, the quantities !!2 and !!V >ej !!2 are referred to as the leverage scores of M. Coherent Matrix Completion and M is the unique optimal solution to the nuclear norm minimization program (1). We now provide comments and discussion. (A) Roughly speaking, the condition given in (3) ensures that entries in important rows/columns (indicated by large local coherences µi and j) of the matrix should be observed more often. Note that Theorem 3.2 only stipulates that an inequality relation hold between pij and {µi, j}. This allows for there to be some discrepancy between the sampling distribution and the local coherences. It also has the natural interpretation that the more the sampling distribution {pij} is aligned to the local coherence pattern of the matrix, the fewer observations are needed. (B) Sampling based on local coherences provides close to the optimal number of sampled elements required for exact recovery (when sampled with any distribution). In particular, assume n1 = n2 = n and recall that the number of degrees of freedom of an n n matrix with rank r is 2nr(1 r/2n). Hence, regardless how the entries are sampled, a minimum of (nr) entries is required to recover the matrix. Theorem 3.2 matches this lower bound, with an additional O(log2(n)) factor. (C) Our work improves on existing results even in the case of uniform sampling and uniform incoherence. Recall that the original work of (Cand es & Recht, 2009), and subsequent works (Cand es & Tao, 2010; Recht, 2009; Gross, 2011) give recovery guarantees based on two parameters of the matrix M: a global incoherence parameter µ0 which is a uniform bound on the (above-defined) local coherences i.e. every µi µ0 and every j µ0 and a joint inco- herence parameter µstr defined by k UV >k1 = n1n2 . With these definitions, the current state of the art states that if the uniform sampling probability satisfies pij p cmax{µ0, µstr}r log2 n when n1 = n2 = n, where c is a constant, then M will be the unique optimum of (1) with high probability. A direct corollary of our work improves on this result, by removing the need for extra constraints on the joint incoherence; in particular, it is easy to see that our theorem implies that a uniform sampling probability of p c µ0r log2 n n that is, with no µstr guarantees recovery of M with high probability. Note that in general, µstr can be as high as µ0r; our corollary thus removes this sub-optimal dependence on the rank and on the joint incoherence. This improvement was recently observed in (Chen, 2013). Remark 3.4. Suppose n1 = n2 = n. If the column space of M is incoherent with maxi µi µ0 and the row space is arbitrary, then one can randomly pick (µ0r log n) rows of M and observe all their entries, and compute the local Algorithm 1 Two-phase sampling for coherent matrix completion input Sampled matrix P (M), rank parameter r, and m, β such that | | = βm. 1: Compute the rank-r SVD of P (M), U V >. 2: Estimate the local coherences by µi = n1 3: Generate a set of (1 β)m new samples distributed as pij = min ( µi+ j)r log2(n1+n2) min{n1,n2} , 1 4: ˆ M = arg min k Xk s.t P [ (X) = P [ (M). output Completed matrix ˆ M. coherences of the space spanned by these rows. These parameters will be equal to the j s of M with high probability. Based on these values, we can perform non-uniform sampling according to (3) and exactly recover M. This procedure does not require any prior knowledge about the local coherences of M. It uses a total of (µ0rn log2 n) samples. This improves on the (µ2 0r2n log r) sample complexity in (Krishnamurthy & Singh, 2013), which is quadratic in µ0r. We prove this remark in the supplement. 4. A two-phase sampling procedure We have seen that one can exactly recover an arbitrary n n low-rank matrix using (nr log2(n)) entries if sampled in accordance with the local coherences. In practical applications of matrix completion, even when the user is free to choose how to sample the matrix entries, she likely will not be privy to the local coherence parameters {µi, j}. In this section we propose a two-phase sampling procedure, described below and in Table 1, which assumes no a priori knowledge about the matrix coherence structure, yet is observed to be competitive with the oracle local coherence distribution (3). Consider a total budget of m samples, and a set of sampled indices such that | | = βm, where β 2 [0, 1]. Let P () be the sampling operator which maps the matrix entries not in to 0. The first step of the algorithm is to take the rankr SVD of P (M), U V >, where U, V 2 Rn r and 2 Rr r. Use the local coherences µi, j of U, V respectively as estimates for the local coherences of M. Let ( µi + j)r log2(n1 + n2) min{n1, n2} , 1 Now generate the remaining (1 β)m samples of matrix M according to this distribution (4). Let denote the new set of samples. Using the combined set of samples P [ (M) as constraints, run the nuclear norm minimization program (1). Let ˆ M be the optimum of this program. Coherent Matrix Completion To understand the performance of the two-phase algorithm, assume that the initial set of m1 = βm samples P (M) are generated uniformly at random. If the underlying matrix M is incoherent, then already the algorithm will recover M if m1 = (max{n1, n2}r log2(n1+n2)). On the other hand, if M is highly coherent, having almost all energy concentrated on just a few entries, then the estimated local coherences (4) from uniform sampling will be poor and hence the recovery algorithm suffers. Between these two extremes, there is reason to believe that the two-phase sampling procedure will provide a better estimate to the underlying matrix than if all m entries were sampled uniformly. Indeed, numerical experiments suggest that the two-phase procedure can indeed significantly outperform uniform sampling for completing coherent matrices. 4.1. Numerical experiments We now analyze the performance of the two-phase sampling procedure outlined in Algorithm 1 through numerical experiments. For this, we consider rank-5 matrices of size 500 500 of the form M = DUV >D, where the entries of matrices U and V are i.i.d. Gaussian N(0, 1) and D is a diagonal matrix with power-law decay, Dii = i , 1 i 500. We refer to such constructions as power-law matrices. The parameter adjusts the coherence of the matrix with = 0 being incoherent and = 1 corresponding to maximal coherence µ0 = (n). We normalize M to make k Mk F = 1. Figure 1 plots the number of samples required for successful recovery (yaxis) for different values of (x-axis) using Algorithm 1 with initial samples taken i.i.d. uniform. Successful recovery is defined as when at least 95% of trials have relative error in the Frobenius norm not exceeding 0.01. To put the results in perspective, we plot it in Figure 1 against the performance of pure uniform sampling, as well as other popular sampling distributions from the matrix sparsification literature (Achlioptas & Mcsherry, 2007; Achlioptas et al., 2013; Arora et al., 2006; Drineas & Zouzias, 2011), namely, in step 3 of the algorithm, sampling proportional to entry ( pij / | Mij|) and sampling proportional to entry squared ( pij / M 2 ij), as opposed to sampling from the distribution (4). In all cases, the estimated matrix M is constructed from the rank r SVD of P (M), M = U V >. Performance of nuclear norm minimization using samples generated according to the oracle distribution (3) serves as baseline for the best possible recovery, as theoretically justified by Theorem 3.2. We use an Augmented Lagrangian Method (ALM) based solver by (Chen & Ganesh, 2009) to solve the convex optimization program (1). Figure 1 suggests that the two-phase algorithm performs comparably to the theoretically optimal coherence-based distribution (3), despite not having access to the underlying local coherences, in the regime of mild to moderate coherence 0.7. While the entrywise sampling strategies perform comparable for low values of , the number of samples for successful recovery increases for > 0.6. Completion from purely uniformly sampled entries requires significantly more samples at higher values of . Choosing β: Recall that the parameter β in Algorithm 1 is the fraction of number of uniform samples to the total number of samples. Figure 2(a) plots the number of samples required for successful recovery (y-axis) as β (x-axis) varies from 0.1 to 1 for different values of . β = 1 reduces to purely uniform sampling, and for small values of β, the local coherences estimated in (4) will be far from the actual local coherences. Then, as expected, the sample complexity goes up for β near 0 and β = 1. We find that setting β 2/3 results in the lowest sample complexity. Surprisingly, even taking β = 0.9 as opposed to pure uniform sampling β = 1 results in a significant decrease in the sample complexity (Figure 2(b)). That is, even budgeting just a small fraction of samples to be drawn from the estimated local coherences can significantly improve the success rate in low-rank matrix recovery as long as the underlying matrix is not completely coherent. In applications like collaborative filtering, this would imply (assuming that the local coherences are smaller ( 0.5)) that incentivizing just a small fraction of users to rate a few selected movies according to the estimated local coherence distribution obtained by previous samples has the potential to greatly improve the quality of the recovered matrix of preferences. In Figure 3 we compare the performance of the two-phase algorithm for different values of the matrix dimension n, and notice for each n a phase transition occurring at (n log(n)) samples. In Figure 4 we consider the scenario where the samples are noisy and compare the performance of Algorithm 1 to uniform sampling and the theoretically-optimal local coherence sampling from Theorem 3.2. Specifically we assume that the samples are generated from M + Z where Z is a Gaussian noise matrix. We consider two values for the noise σ def = k Zk F /k Mk F : σ = 0.1 and σ = 0.2. The figures plot error in Frobenius norm k M ˆ Mk F (y-axis), vs total number of samples m (x-axis). These plots demonstrate the robustness of the algorithm to noise and once again show that sampling with estimated coherences can be as good as sampling with exact coherences for matrix recovery using nuclear norm minimization for 0.7. 5. Weighted Nuclear Norm Minimization Theorem 3.2 suggests that the more a set of observed entries are aligned with the local coherences of a matrix, the better will be the performance of nuclear norm minimization. Interestingly, Theorem 3.2 can be used in a reverse Coherent Matrix Completion 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 Samples/(n log(n)) two phase unifom / local coherence two phase uniform / entrywise two phase uniform / squared entrywise local coherence sampling (oracle) uniform sampling Figure 1. Performance of Algorithm 1 for power-law matrices: We consider rank-5 matrices of form M = DUV >D, where entries of the matrices U and V are generated from a Gaussian distribution N(0, 1) and D is a diagonal matrix with Dii = 1 i . Higher values of correspond to higher coherence. The above simulations are run with two-phase parameter β = 2/3. Sampling (3) gives the best results of successful recovery using 10n log(n) samples for all values of in accordance with Theorem 3.2. Surprisingly, sampling according to (4) with estimated local coherences has almost the same sample complexity for 0.7. Sampling proportional to entry and entry squared perform as well for low values of , but their sample complexity increases quickly for > 0.6. way: one may adjust the local coherences to align with a given set of observations. Here we demonstrate an application of this idea in quantifying the benefit of weighted nuclear norm minimization for non-uniform sampling. In many applications of matrix completion, the revealed entries are given to us, and distributed non-uniformly among the rows and columns. As observed by (Salakhutdinov & Srebro, 2010), standard unweighted nuclear norm minimization (1) is inefficient in this setting. They propose to instead use weighted nuclear norm minimization: ˆX = arg min s.t. Xij = Mij, for (i, j) 2 , where R = diag(R1, R2, . . . , Rn1) and C = diag(C1, . . . , Cn2) are diagonal weight matrices with positive diagonal entries. We now provide a theoretical guarantee for this method, and quantify its advantage over unweighted nuclear norm minimization. Suppose M satisfies the standard incoherence condition maxi,j {µi, j} µ0. Let bxc denote the largest integer not exceeding x. Under this setting, we have the following (proved in the supplementary materials): Theorem 5.1. Without lost of generality, assume R1 R2 Rn1 and C1 C2 Cn2. There exist universal constants c0, c1, c2 such that M is the unique optimum to (5) with probability at least 1 c1(n1 + n2) c2 provided pij 1 min{n1,n2}10 and i Pbn1/(µ0r)c j Pbn2/(µ0r)c We prove this theorem by drawing a connection between the weighted nuclear norm and the local incoherence parameters (2). Define the scaled matrix M := RMC. Observe that the program (5) is equivalent to first solving the following unweighted problem with scaled observations X = arg min s.t. Xij = Mij, for (i, j) 2 , and then setting ˆX = R 1 XC 1. In other words, through the weighted nuclear norm, we convert the problem of completing M to that of completing M. Therefore, if we can choose the weights R and C such that the local incoherence parameters of M, denoted as { µi, j}, are aligned with the non-uniform observations in a way that roughly satisfies condition (3), then we gain in sample complexity compared to the unweighted approach. We now quantify this more precisely for a particular class of matrix completion problems. Comparison to unweighted nuclear norm. Suppose n1 = n2 = n and the observation probabilities have a product form: pij = pr n. If we choose Ri = i0 (which is suggested by the condition (6)), Theorem 5.1 asserts that the following is sufficient for recovery of M: bn/(µ0r)c X i & log2 n, 8j; pr bn/(µ0r)c X j & log2 n, 8i. (8) Coherent Matrix Completion 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 Samples/(nlog(n)) two phase, _=0.5 two phase, _=0.7 0 20 40 60 80 100 0 Samples/(nlog(n)) Fraction of successful recovery two phase sampling, =0.9 uniform sampling Figure 2. We consider power-law matrices with parameter = 0.5 and = 0.7. (a): This plot shows that Algorithm 1 successfully recovers coherent low-rank matrices with fewest samples ( 10n log(n)) when the proportion of initial samples drawn from the uniform distribution is in the range β 2 [0.5, 0.8]. In particular, the sampling complexity is significantly lower than that for uniform sampling (β = 1). (b): Even by drawing 90% of the samples uniformly and using the estimated local coherences to sample the remaining 10% samples, one observes a marked improvement in the rate of recovery. 0 5 10 15 20 0 Samples/(nlog(n)) Fraction of successful recovery n=100 n=200 n=300 n=400 n=500 0 5 10 15 20 25 0 Samples/(nlog(n)) Fraction of successful recovery n=100 n=200 n=300 n=400 n=500 Figure 3. (a) & (b):Scaling of sample complexity of Algorithm 1 with n: We consider power-law matrices (with = 0.5 in plot (a) and 0.7 in plot (b)). The plots suggest that the sample complexity of Algorithm 1 scales roughly as (n log(n)). 4 6 8 10 12 14 16 18 20 0 Samples/(nlog(n)) Relative error two phase sampling, m =0.1 two phase sampling, m =0.2 local coherence sampling, m =0.1 local coherence sampling, m =0.2 uniform sampling, m =0.1 uniform sampling, m =0.2 8 10 12 14 16 18 20 22 24 0 Samples/(nlog(n)) Relative error two phase sampling, m =0.1 two phase sampling, m =0.2 local coherence sampling, m =0.1 local coherence sampling, m=0.2 uniform sampling, m =0.1 uniform sampling, m =0.2 Figure 4. (a) & (b):Performance of Algorithm 1 with noisy samples: We consider power-law matrices (with = 0.5 in plot (a) and = 0.7 in plot (b)), perturbed by a Gaussian noise matrix Z with k Zk F /k Mk F = σ. The plots consider two different noise levels, σ = 0.1 and σ = 0.2. We compare two-phase sampling (Algorithm 1) with β = 2/3, sampling from the exact local coherences, and uniform sampling. Algorithm 1 has error almost as low as the local-coherence sampling without requiring any a priori knowledge of the low-rank matrix, while uniform sampling suffers dramatically. Coherent Matrix Completion We compare this condition to that required by unweighted nuclear norm minimization: by Thm. 3.2, the latter requires n log2 n, 8i, j. That is, the weighted approach succeeds under much less restrictive conditions. In particular, the unweighted approach imposes a condition on the least sampled row and column, whereas condition (8) shows that the weighted approach can use the heavily sampled rows/columns to assist the less sampled. This benefit is most significant precisely when the observations are very non-uniform. The weighted nuclear norm approach is shown to be empirically successful in (Salakhutdinov & Srebro, 2010). There they propose to weigh the rows (columns, resp.) by the square root of the corresponding row (column, resp) marginals, which coincides with the R and C chosen according to our theory in the last paragraph. We remark that Theorem 5.1 is the first exact recovery guarantee for weighted nuclear norm minimization. It provides an explanation, complementary to those in (Salakhutdinov & Srebro, 2010; Foygel et al., 2011; Negahban & Wainwright, 2012), for why the weighted approach is advantageous over the unweighted approach for non-uniform observations. It also serves as a testament to the power of Theorem 3.2 as a general result on the relationship between sampling and local coherence. 6. Proof Outline for Theorem 3.2 The proof proceeds by constructing a dual certificate Y that obeys certain sub-gradient optimality conditions and certifies the optimality of M to (1). One of the major differences between our proof and existing ones is in validating one of the optimality conditions, namely, that k Y k is small. In previous work, this is done by bounding k Y k by k Y 0k1 := P "" for a certain matrix Y 0, which eventually leads to the standard incoherence conditions. Here, we derive a new bound using the weighted1,2 norm of Y 0, which is the maximum of the weighted row and column norms of Y 0, with the weights depending on the local coherences µi and j. We turn to the details below. Define the projections PT Z := UU >Z + ZV V > UU >V ZZ> and PT ?Z := Z PT Z, and let R Z be the matrix with (R Z)ij = Zij/pij if (i, j) 2 and zero otherwise. As usual, k Zk F and k Zk are the Frobenius norm and spectral norm of the matrix Z, and k Akop is the operator norm of the mapping A. Using standard convex analysis, we show that M is the unique optimum to (1) if 1. k PT R PT PT kop 1 2, and 2. there exists some Y obeying (a) Yij = 0, 8(i, j) /2 , ##PT Y UV >## F 1 4n5 , and (c) k PT ?Y k 1 We proceed to show that condition 1 above holds with high probability (w.h.p.) assuming only the local bounds (3) on sampling and incoherence. We then construct Y using the Golfing Scheme (Gross, 2011), setting W0 := 0, Wk := Wk 1 + R k PT (UV > PT Wk 1), k 2 [k0], and Y = Wk0, where the k s are k0 := 20 log n i.i.d. random index sets with P((i, j) 2 k) = 1 (1 pij)1/k0 and R k is defined analogously to R . Y satisfies condition 2(a) above. Setting k = UV > PT Wk, we verify k k PT PT R k PT kop which implies the condition 2(b) using the condition 1. It remains to validate the condition 2(c), which is the most innovative part of our proof. We need the following definitions of weighted 1,2 and 1 norms k Zkµ(1,2) := max k Zkµ(1) := max We show, crucially, that these norms have the following concentration properties k(R I) Zk c pc0 k Zkµ(1) + k Zkµ(1,2) k(PT R PT )Zkµ(1,2) 1 k Zkµ(1) + k Zkµ(1,2) k(PT R PT ) Zkµ(1) 1 2 k Zkµ(1) , which hold w.h.p. for a fixed Z. Using the first inequality above, we can obtain k PT ?(Y )k c pc0 k k 1kµ(1) + k k 1kµ(1,2) We then apply the next two inequalities to show k kkµ(1) (1/2)k ##UV >## k kkµ(1,2) (1/2)k µ(1) + k UV kµ(1,2) for each k. The theorem follows from combining the last three display equations and expressing k UV >kµ(1,2) and k UV >kµ(1) in terms of {µi, j}. Acknowledgements We would like to thank Petros Drineas, Michael Mahoney and Aarti Singh for helpful discussions. R. Ward was supported by an NSF CAREER award, AFOSR Young Investigator Program award, and ONR Grant N00014-12-1-0743. Coherent Matrix Completion Achlioptas, D. and Mcsherry, F. Fast computation of low- rank matrix approximations. J. ACM, 54(2):9, 2007. Achlioptas, D., Karnin, Z., and Liberty, E. Matrix entry-wise sampling: Simple is best. http://cs-www.cs.yale.edu/homes/ el327/papers/matrix Sampling.pdf, 2013. Arora, S., Hazan, E., and Kale, S. A fast random sampling algorithm for sparsifying matrices. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pp. 272 279. Springer, 2006. Boutsidis, C., Mahoney, M., and Drineas, P. An improved approximation algorithm for the column subset selection problem. In SODA, pp. 968 977, 2009. Burq, N., Dyatlov, S., Ward, R., and Zworski, M. Weighted eigenfunction estimates with applications to compressed sensing. SIAM J. Math. Anal., 44(5):3481 3501, 2012. Cai, J., Cand es, E., and Shen, Z. A singular value thresh- olding algorithm for matrix completion. SIAM J. Optimiz., 20(4):1956 1982, 2010. Candes, E. and Plan, Y. Matrix completion with noise. Pro- ceedings of the IEEE, 98(6):925 936, 2010. Cand es, E. and Recht, B. Exact matrix completion via con- vex optimization. Found. Comput. Math., 9, 2009. Cand es, E. and Tao, T. The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053 2080, 2010. Cand es, E., Li, X., Ma, Y., and Wright, J. Robust principal component analysis? J. ACM, 58(3):11, 2011. Chandrasekaran, V., Sanghavi, S., Parrilo, P., and Willsky, A. Rank-sparsity incoherence for matrix decomposition. SIAM Journal on Optimization, 21(2):572 596, 2011. Chatterjee, S. and Hadi, A. Influential observations, high leverage points, and outliers in linear regression. Statistical Science, 1(3):379 393, 1986. Chen, M. and Ganesh, A. The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices, October 2009. URL http://perception.csl.illinois.edu/ matrix-rank/sample_code.html. Chen, Y. Incoherence-optimal matrix completion. ar Xiv preprint ar Xiv:1310.0154, 2013. Chen, Y., Jalali, A., Sanghavi, S., and Caramanis, C. Low- rank matrix recovery from errors and erasures. IEEE Transactions on Information Theory, 59(7), 2013. Drineas, P. and Zouzias, A. A note on element-wise matrix sparsification via a matrix-valued Bernstein inequality. Information Processing Letters, 111(8):385 389, 2011. Drineas, P., Magdon-Ismail, M., Mahoney, M., and Woodruff, D. Fast approximation of matrix coherence and statistical leverage. JMLR, 13:3475, 2012. Foygel, R., Salakhutdinov, R., Shamir, O., and Srebro, N. Learning with the weighted trace-norm under arbitrary sampling distributions. ar Xiv:1106.4251, 2011. Gross, D. Recovering low-rank matrices from few coeffi- cients in any basis. IEEE Transactions on Information Theory, 57(3):1548 1566, 2011. Jain, P., Netrapalli, P., and Sanghavi, S. Low-rank matrix completion using alternating minimization. ar Xiv preprint ar Xiv:1212.0467, 2012. Keshavan, R. H., Montanari, A., and Oh, S. Matrix com- pletion from a few entries. IEEE Transactions on Information Theory, 56(6):2980 2998, 2010. Krahmer, F. and Ward, R. Beyond incoherence: Stable and robust sampling strategies for compressive imaging. ar Xiv preprint ar Xiv:1210.2380, 2012. Krishnamurthy, A. and Singh, A. Low-rank matrix and tensor completion via adaptive sampling. ar Xiv preprint ar Xiv:1304.4672v2, 2013. Mahoney, M. Randomized algorithms for matrices & data. Foundations & Trends in Machine learning, 3(2), 2011. Negahban, S. and Wainwright, M. Restricted strong con- vexity and weighted matrix completion: Optimal bounds with noise. JMLR, 98888:1665 1697, 2012. Rauhut, H. and Ward, R. Sparse Legendre expansions via l1-minimization. Journal of Approximation Theory, 164 (5):517 533, 2012. Recht, B. A simpler approach to matrix completion. ar Xiv preprint ar Xiv:0910.0651, 2009. Salakhutdinov, R. and Srebro, N. Collaborative filtering in a non-uniform world: Learning with the weighted trace norm. ar Xiv preprint ar Xiv:1002.2780, 2010. Spielman, D. and Srivastava, N. Graph sparsification by effective resistances. SIAM J. Comp., 40(6):1913, 2011. Tropp, J. User-friendly tail bounds for sums of random matrices. Found. Comput. Math., 12(4):389 434, 2012. Yang, X. and Karniadakis, G. Reweighted l1 minimiza- tion method for stochastic elliptic differential equations. Journal of Computational Physics, 2013.