# completing_any_lowrank_matrix_provably__61e125aa.pdf Journal of Machine Learning Research 16 (2015) 2999-3034 Submitted 6/14; Revised 4/15; Published 7/15 Completing Any Low-rank Matrix, Provably Yudong Chen yudong.chen@eecs.berkeley.edu Department of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley, CA 94704, USA Srinadh Bhojanapalli bsrinadh@utexas.edu Sujay Sanghavi sanghavi@mail.utexas.edu Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 78712, USA Rachel Ward rward@math.utexas.edu Department of Mathematics and ICES The University of Texas at Austin Austin, TX 78712, USA Editor: Tong Zhang Matrix completion, i.e., the exact and provable recovery of a low-rank matrix from a small subset of its elements, is currently only known to be possible if the matrix satisfies a restrictive structural constraint known as incoherence on its row and column spaces. In these cases, the subset of elements is assumed to be sampled uniformly at random. In this paper, we show that any rank-r n-by-n matrix can be exactly recovered from as few as O(nr log2 n) randomly chosen elements, provided this random choice is made according to a specific biased distribution suitably dependent on the coherence structure of the matrix: the probability of any element being sampled should be at least a constant times the sum of the leverage scores of the corresponding row and column. Moreover, we prove that this specific form of sampling is nearly necessary, in a natural precise sense; this implies that many other perhaps more intuitive sampling schemes fail. We further establish three ways to use the above result for the setting when leverage scores are not known a priori. (a) We describe a provably-correct sampling strategy for the case when only the column space is incoherent and no assumption or knowledge of the row space is required. (b) We propose a two-phase sampling procedure for general matrices that first samples to estimate leverage scores followed by sampling for exact recovery. These two approaches assume control over the sampling procedure. (c) By using our main theorem in a reverse direction, we provide an analysis showing the advantages of the (empirically successful) weighted nuclear/trace-norm minimization approach over the vanilla un-weighted formulation given non-uniformly distributed observed elements. This approach does not require controlled sampling or knowledge of the leverage scores. Keywords: matrix completion, coherence, leverage score, nuclear norm, weighted nuclear norm . Partial preliminary results appeared at the International Conference on Machine Learning (ICML) 2014 under the title Coherent Matrix Completion . c 2015 Yudong Chen, Srinadh Bhojanapalli, Sujay Sanghavi and Rachel Ward. Chen, Bhojanapalli, Sanghavi and Ward 1. Introduction Low-rank matrix completion has been the subject of much recent study due to its application in myriad tasks: collaborative filtering, dimensionality reduction, clustering, non negative matrix factorization and localization in sensor networks. Clearly, the problem is ill-posed in general; correspondingly, analytical work on the subject has focused on the joint development of algorithms, and sufficient conditions under which such algorithms are able to recover the matrix. While they differ in scaling/constant factors, all existing sufficient conditions (Cand es and Recht, 2009; Cand es and Tao, 2010; Recht, 2011; Keshavan et al., 2010; Gross, 2011; Jain et al., 2013; Negahban and Wainwright, 2012) with a couple of exceptions we describe in Section 2 require that (a) the subset of observed elements should be uniformly randomly chosen, independent of the values of the matrix elements, and (b) the low-rank matrix be incoherent or not spiky i.e., its row and column spaces should be diffuse, having low inner products with the standard basis vectors. Under these conditions, the matrix has been shown to be provably recoverable via methods based on convex optimization (Cand es and Recht, 2009), alternating minimization (Jain et al., 2013), iterative thresholding (Cai et al., 2010), etc. using as few as Θ(nr log n) observed elements for an n n matrix of rank r. Actually, the incoherence assumption is required because of the uniform sampling: coherent matrices are those which have most of their mass in a relatively small number of elements. By sampling entries uniformly and independently at random, most of the mass of a coherent low-rank matrix will be missed; this could (and does) throw offmost existing methods for exact matrix completion. One could imagine that if the sampling is adapted to the matrix, roughly in a way that ensures that elements with more mass are more likely to be observed, then it may be possible for existing methods to recover the full matrix. In this paper, we show that the incoherence requirement can be eliminated completely, provided the sampling distribution is dependent on the matrix to be recovered in the right way. Specifically, we have the following results. 1. If the probability of an element being observed is proportional to the sum of the corresponding row and column leverage scores (which are local versions of the standard incoherence parameter) of the underlying matrix, then an arbitrary rank-r matrix can be exactly recovered from Θ(nr log2 n) observed elements with high probability, using nuclear norm minimization (Theorem 2 and Corollary 3). In the case when all leverage scores are uniformly bounded from above, our results reduce to existing guarantees for incoherent matrices using uniform sampling. Our sample complexity bound Θ(nr log2 n) is optimal up to a single factor of log2 n, since the degrees of freedom in an n n matrix of rank r is in general in the order of nr. Moreover, we show that to complete a coherent matrix, it is necessary (in certain precise sense) to sample according to the leverage scores as above (Theorem 6). 2. For a matrix whose column space is incoherent and row space is arbitrarily coherent, our results immediately lead to a provably correct sampling scheme which requires no prior knowledge of the leverage scores of the underlying matrix and has near optimal sample complexity (Corollary 4). Completing Any Low-rank Matrix, Provably 3. We provide numerical evidence that a two-phase adaptive sampling strategy, which assumes no prior knowledge about the leverage scores of the underlying matrix, can perform on par with the optimal sampling strategy in completing coherent matrices, and significantly outperforms uniform sampling (Section 4). Specifically, we consider a two-phase sampling strategy whereby given a fixed budget of m samples, we first draw a fixed proportion of samples uniformly at random, and then draw the remaining samples according to the leverage scores of the resulting sampled matrix. 4. As a corollary of our main theorem, we are able to obtain the first exact recovery guarantee for the weighted nuclear norm minimization approach, which can be viewed as adjusting the leverage scores to align with the given sampling distribution. Our results provide a strategy for choosing the weights when non-uniformly distributed samples are given so as to order-wise reduce the sample complexity of the weighted approach to that of the standard unweighted formulation (Theorem 7). Our theorem quantifies the benefit of the weighted approach, thus providing theoretical justification for its good empirical performance observed in Srebro and Salakhutdinov (2010); Foygel et al. (2011); Negahban and Wainwright (2012). These results provide a deeper and more general theoretical understanding of the relation between the sampling procedure and the matrix coherence/leverage-score structure, and how they affect the recovery performance. While in practice one may not have complete control over the sampling procedure, or exact knowledge of the matrix leverage scores, partial control and knowledge are often possible, and we believe our theory provides useful approximations and insights. We expect that the ideas and results in this paper will serve as the foundation for developing algorithms for more general settings and applications. Our theoretical results are achieved by a new analysis based on concentration bounds involving the weighted ℓ ,2 matrix norm, defined as the maximum of the appropriately weighted row and column norms of the matrix. This differs from previous approaches that use ℓ or unweighted ℓ ,2 norm bounds (Gross, 2011; Recht, 2011; Chen, 2015). In some sense, using the weighted ℓ ,2-type bounds is natural for the analysis of low-rank matrix recover/approximation when the observations are in the form of entries of rows/columns of the matrix, because the rank is a property of the rows and columns of the matrix rather than its individual elements, and the weighted norm captures the relative importance of the rows/columns. Therefore, our techniques based on the ℓ ,2 norm might be of independent interest beyond the specific settings and algorithms considered here. 1.1 Organization In Section 2 we briefly survey the relevant literature. We present our main results for coherent matrix completion in Section 3. In Section 4 we propose a two-phase algorithm that requires no prior knowledge about the underlying matrix s leverage scores. In Section 5 we provide guarantees for weighted nuclear norm minimization. The paper concludes with a discussion of future work in Section 6. We provide the proofs of the main theorems in the appendix. Chen, Bhojanapalli, Sanghavi and Ward 2. Related Work There is now a vast body of literature on matrix completion, and an even bigger body of literature on matrix approximations; we restrict our literature review here to papers that are most directly related. Completion of incoherent and row-coherent matrices: The first algorithm and theoretical guarantees for exact low-rank matrix completion appeared in Cand es and Recht (2009); there it was shown that nuclear norm minimization works when the low-rank matrix is incoherent, and the sampling is uniform random and independent of the matrix. Subsequent works have refined provable completion results for incoherent matrices under the uniform random sampling model, both via nuclear norm minimization (Cand es and Tao, 2010; Recht, 2011; Gross, 2011; Chen, 2015), and other methods like SVD followed by local descent (Keshavan et al., 2010) and alternating minimization (Jain et al., 2013), etc. The setting with sparse errors and additive noise is also considered (Cand es and Plan, 2010; Chandrasekaran et al., 2011; Chen et al., 2013; Cand es et al., 2011; Negahban and Wainwright, 2012). The recent work in Krishnamurthy and Singh (2013) considers matrix completion when the row space is allowed to be coherent but the column space is still required to be incoherent with parameter µ0. Their proposed adaptive sampling algorithm selects columns to observe in their entirety and requires a total of O(µ0r3/2n log(2r/δ)) observed elements with a success probability 1 δ, which is superlinear in r. A corollary of our results guarantees a sample complexity that is linear in r in this row-coherent setting. The sample complexity was recently improved to O(µ0rn log2(r2/δ)) in Krishnamurthy and Singh (2014). Matrix approximations via sub-sampling: Weighted sampling methods have been widely considered in the related context of matrix sparsification, where one aims to approximate a given large dense matrix with a sparse matrix. The strategy of element-wise matrix sparsification was introduced in Achlioptas and Mc Sherry (2007). They propose and provide bounds for the ℓ2 element-wise sampling model, where elements of the matrix are sampled with probability proportional to their squared magnitude. These bounds were later refined in Drineas and Zouzias (2011). Alternatively, Arora et al. (2006) propose the ℓ1 elementwise sampling model, where elements are sampled with probabilities proportional to their magnitude. This model was further investigated in Achlioptas et al. (2013) and argued to be almost always preferable to ℓ2 sampling. Closely related to the matrix sparsification problem is the matrix column selection problem, where one aims to find the best k column subset of a matrix to use as an approximation. State-of-the-art algorithms for column subset selection (Boutsidis et al., 2009; Mahoney, 2011) involve randomized sampling strategies whereby columns are selected proportionally to their statistical leverage scores the squared Euclidean norms of projections of the canonical unit vectors on the column subspaces. The statistical leverage scores of a matrix can be approximated efficiently, faster than the time needed to compute an SVD (Drineas et al., 2012). Statistical leverage scores are also used extensively in statistical regression analysis for outlier detection (Chatterjee and Hadi, 1986). More recently, statistical leverage scores were used in the context of graph sparsification under the name of graph resistance (Spielman and Srivastava, 2011). The sampling distribution we use for the Completing Any Low-rank Matrix, Provably matrix completion guarantees of this paper is elemen-wise and based on statistical leverage scores. As shown both theoretically (Theorem 6) and empirically (Section 4.1), sampling as such outperforms both ℓ1 and ℓ2 element-wise sampling, at least in the context of matrix completion. Weighted sampling in compressed sensing: This paper is similar in spirit to recent work in compressed sensing which shows that sparse recovery guarantees traditionally requiring mutual incoherence can be extended to systems which are only weakly incoherent, without any loss of approximation power, provided measurements from the sensing basis are subsampled according to their coherence with the sparsity basis. This notion of local coherence sampling seems to have originated in Rauhut and Ward (2012) in the context of sparse orthogonal polynomial expansions, and has found applications in uncertainty quantification (Yang and Karniadakis, 2013), interpolation with spherical harmonics (Burq et al., 2012), and MRI compressive imaging (Krahmer and Ward, 2014). 3. Main Results The results in this paper hold for what is arguably the most popular approach to matrix completion: nuclear norm minimization. If the true matrix is M with its (i, j)-th element denoted by Mij, and the set of observed elements is Ω, this method estimates M via the optimum of the convex program: s.t. Xij = Mij for (i, j) Ω. (1) where the nuclear norm of a matrix is the sum of its singular values.1 We focus on the setting where matrix elements are revealed according an underlying probability distribution. To introduce the distribution of interest, we first need a definition. Definition 1 (Leverage Scores) For an n1 n2 real-valued matrix M of rank r whose rank-r SVD is given by UΣV , its (normalized) leverage scores µi(M) for any row i, and νj(M) for any column j are defined as µi(M) : = n1 2 , i = 1, 2, . . . , n1, νj(M) : = n2 2 , j = 1, 2, . . . , n2, (2) where ei denotes the i-th standard basis element with appropriate dimension.2 Note that the leverage scores are non-negative, and are functions of the column and row spaces of the matrix M. Since U and V have orthonormal columns, we always have relationship P i µi(M)r/n1 = P j νj(M)r/n2 = r. The standard incoherence parameter µ0 1. This becomes the trace norm for positive-definite matrices. It is now well-recognized to be a convex surrogate for the rank function (Fazel, 2002). 2. In the matrix sparsification literature (Drineas et al., 2012; Boutsidis et al., 2009) and beyond, the leverage scores of M often refer to the un-normalized quantities U ei 2 and V ej 2. Chen, Bhojanapalli, Sanghavi and Ward of M used in the previous literature corresponds to a global upper bound on the leverage scores: µ0 max i,j {µi(M), νj(M)}. Therefore, the leverage scores can be considered as the localized versions of the standard incoherence parameter. We are ready to state our main result, the theorem below. Theorem 2 Let M = (Mij) be an n1 n2 matrix of rank r, and suppose that its elements Mij are observed only over a subset of elements Ω [n1] [n2]. There is a universal constant c0 > 0 such that, if each element (i, j) is independently observed with probability pij, and pij satisfies pij min c0 (µi(M) + νj(M)) r log2(n1 + n2) min{n1, n2} , 1 , (3) pij 1 min{n1, n2}10 , then M is the unique optimal solution to the nuclear norm minimization problem (1) with probability at least 1 5(n1 + n2) 10. We will refer to the sampling strategy (3) as leveraged sampling. Note that the expected number of observed elements is P i,j pij, and this satisfies i,j pij max c0 r log2(n1 + n2) min{n1, n2} i,j (µi(M) + νj(M)) , X 1 min{n1, n2}10 = 2c0 max {n1, n2} r log2(n1 + n2), which is independent of the leverage scores, or indeed any other property of the matrix. Hoeffding s inequality implies that the actual number of observed elements sharply concentrates around its expectation, leading to the following corollary: Corollary 3 Let M = (Mij) be an n1 n2 matrix of rank r. Draw a subset Ωof its elements by leveraged sampling according to the procedure described in Theorem 2. There is a universal constant c0 > 0 such that the following holds with probability at least 1 10(n1 + n2) 10: the number m of revealed elements is bounded by |Ω| 3c0 max {n1, n2} r log2(n1 + n2) and M is the unique optimal solution to the nuclear norm minimization program (1). We now provide comments and discussion. (A) Roughly speaking, the condition given in (3) ensures that elements in important rows/columns (indicated by large leverage scores µi and νj) of the matrix should be observed more often. Note that Theorem 2 only stipulates that an inequality relation hold between pij and {µi(M), νj(M)}. This allows for there to be some discrepancy between the sampling Completing Any Low-rank Matrix, Provably distribution and the leverage scores. It also has the natural interpretation that the more the sampling distribution {pij} is aligned to the leverage score pattern of the matrix, the fewer observations are needed. (B) Sampling based on leverage scores provides close to the optimal number of sampled elements required for exact recovery (when sampled with any distribution). In particular, recall that the number of degrees of freedom of an n n matrix of rank r is 2nr(1 r/2n), and knowing the leverage scores of the matrix reduces the degrees of freedom by 2n in the worst case. Hence, regardless of how the elements are sampled, a minimum of Θ(nr) elements is required to recover the matrix. Theorem 2 matches this lower bound, with an additional O(log2(n)) factor. (C) Our work improves on existing results even in the case of uniform sampling and uniform incoherence. Recall that the original work of Cand es and Recht (2009), and subsequent works (Cand es and Tao, 2010; Recht, 2011; Gross, 2011) give recovery guarantees based on two parameters of the matrix M Rn n (assuming its SVD is UΣV ): (a) the (above-defined) incoherence parameter µ0, which is a uniform bound on the leverage scores, and (b) a joint incoherence parameter µstr defined by UV = q n2 . With these definitions, the current state of the art states that if the sampling probability is uniform and satisfies pij p cmax{µ0, µstr}r log2 n where c is a constant, then M will be the unique optimum of (1) with high probability. A direct corollary of our work improves on this result, by removing the need for extra constraints on the joint incoherence; in particular, it is easy to see that our theorem implies that a uniform sampling probability of p cµ0r log2 n n that is, with no µstr guarantees recovery of M with high probability. Note that µstr can be as high as µ0r, for example, in the case when M is positive semi-definite; our corollary thus removes this sub-optimal dependence on the rank and on the incoherence parameter. This improvement was recently observed in Chen (2015). 3.1 Knowledge-Free Completion for Row Coherent Matrices Theorem 2 immediately yields a useful result in scenarios where only the row space of a matrix is coherent and one has control over the sampling of the matrix. This setting is considered by Krishnamurthy and Singh (2013). Suppose the column space of M Rn n is incoherent with maxi µi(M) µ0 and the row space is arbitrary (we consider square matrix for simplicity). For a number 0 < δ < 1 to be prescribed by the user, We choose each row of M with probability 10µ0r δ , and observe all the elements of the chosen rows. We then compute the leverage scores { νj} of the space spanned by these rows, and use them as estimates for νj(M), the leverage scores of M. Based on these estimates, we can perform leveraged sampling according to (3) and then use nuclear norm minimization to recover M. Note that this procedure does not require any prior knowledge about the leverage scores of M. The following corollary shows that the procedure is provably correct and exactly recovers M with high probability, using a near-optimal number of samples. Chen, Bhojanapalli, Sanghavi and Ward Corollary 4 For any number 0 < δ < 1 and some universal constants c0, c1 > 0, the following holds. With probability at least 1 δ, the above procedure computes the column leverage scores of M exactly, i.e., νj = νj(M), j [n]. If we set δ = 4n 10, and further sample a set Ωof elements of M with probabilities pij = min c0 (µ0 + νj)r log2 n n , 1 , i, j, then with probability at least 1 10n 10, M is the unique optimal solution to the nuclear norm minimization program (1), and we use a total of at most c1µ0rn log2 n samples. The algorithm proposed in Krishnamurthy and Singh (2013) requires a sample complexity of O(µ0r3/2n log(2r/δ)) (and guarantees a success probability of 1 δ). Our result in the corollary above removes the sub-optimal r3/2 factor in the sample complexity. Very recently Krishnamurthy and Singh (2014) provide a new sample complexity bound O(µ0rn log2(r2/δ)) using the same algorithm from their previous paper. We note that our sampling strategy is different from theirs: we sample entire rows of M, whereas they sample entire columns. 3.2 Necessity of Leveraged Sampling In this subsection, we show that the leveraged sampling in (3) is necessary for completing a coherent matrix in a certain precise sense. For simplicity, we restrict ourselves to square matrices in Rn n. Suppose each element (i, j) is observed independently with probability pij. We consider a family of sampling probabilities {pij} with the following property. Definition 5 (Location Invariance) {pij} is said to be location-invariant with respect to the matrix M if the following are satisfied: (1) For any two rows i = i that are identical, i.e., Mij = Mi j for all j, we have pij = pi j for all j; (2) For any two columns j = j that are identical, i.e., Mij = Mij for all i, we have pij = pij for all i. In other words, {pij} is location-invariant with respect to M if identical rows (or columns) of M have identical sampling probabilities. We consider this assumption very mild, and it covers the leveraged sampling as well as many other typical sampling schemes, including: uniform sampling, where pij p, element-wise magnitude sampling, where pij |Mij| (ℓ1 sampling) or pij M2 ij (ℓ2 sampling), and row/column-wise magnitude sampling, where pij f Mi 2 , M j 2 for some (usually coordinate-wise non-decreasing) function f : R2 + 7 [0, 1]. Given two n-dimensional vectors µ = (µ1, . . . , µn) and ν = (ν1, . . . , νn), we use Mr ( µ, ν) to denote the set of rank-r matrices whose leverage scores are bounded by µ and ν; that is, Mr ( µ, ν) := M Rn n : rank(M) = r; µi(M) µi, νj(M) νj, i, j . We have the following results. Completing Any Low-rank Matrix, Provably Theorem 6 Suppose n r 2. Given any 2r numbers a1, . . . , ar and b1, . . . , br with r 4 Pr k=1 1 ak , Pr k=1 1 bk r and 2 r ak, bk 2n r , k [r], there exist two n-dimensional vectors µ and ν and the corresponding set Mr ( µ, ν) with the following properties: 1. For each i, j [n], µi = ak and νj = bk for some k, k [r]. That is, the values of the leverage scores are given by {ak} and {bk }. 2. There exists a matrix M(0) Mr ( µ, ν) for which the following holds. If {pij} is location-invariant w.r.t. M(0), and for some (i0, j0), pi0j0 µi0 + νj0 4n r log 2n (µi0 νj0)r then with probability at least 1 4, the following conclusion holds: There are infinitely many matrices M(1) = M(0) in Mr ( µ, ν) such that {pij} is location-invariant w.r.t. M(1), and M(0) ij = M(1) ij , (i, j) Ω. 3. If we replace the condition (4) with pi0j0 µi0 + νj0 then the conclusion above holds with probability at least 1 In other words, if (4) holds, then with probability at least 1/4, no method can distinguish between M(0) and M(1); similarly, if (5) holds, then with probability at least 1/n no method succeeds. We shall compare these results with Theorem 2, which guarantees that if we use leveraged sampling, pij c0 µi + νj n r log n, i, j for some universal constant c0, then for any matrix M(0) in Mr ( µ, ν), the nuclear norm minimization approach (1) recovers M(0) from its observed elements with failure probability no more than 1 n. Therefore, under the setting of Theorem 6, leveraged sampling is sufficient and necessary for matrix completion up to one logarithmic factor for a target failure probability 1 n (or up to two logarithmic factors for a target failure probability 1 4). Admittedly, the setting covered by Theorem 6 has several restrictions on the sampling distributions and the values of the leverage scores. Nevertheless, we believe this result captures some essential difficulties in recovering general coherent matrices, and highlights how the sampling probabilities should relate in a specific way to the leverage score structure of the underlying object. 4. A Two-Phase Sampling Procedure We have seen that one can exactly recover an arbitrary n n rank-r matrix using Θ(nr log2 n) elements if sampled in accordance with the leverage scores. In practical applications of 3. We use the notation a b = max{a, b}. Chen, Bhojanapalli, Sanghavi and Ward matrix completion, even when the user is free to choose how to sample the matrix elements, she may not be privy to the leverage scores {µi(M), νj(M)}. In this section we propose a two-phase sampling procedure, described below and in Algorithm 1, which assumes no a priori knowledge about the matrix leverage scores, yet is observed to be competitive with the oracle leveraged sampling distribution (3). Suppose we are given a total budget of m samples. The first step of the algorithm is to use the first β fraction of the budget to estimate the leverage scores of the underlying matrix, where β [0, 1]. Specifically, take a set of indices Ωsampled uniformly without replacement4 such that |Ω| = βm, and let PΩ( ) be the sampling operator which maps the matrix elements not in Ωto 0. Take the rank-r SVD of PΩ(M), U Σ V , where U, V Rn r and Σ Rr r, and then use the leverage scores µi := µi( U Σ V ) and νj := νj( U Σ V ) as estimates for the column and row leverage scores of M. Now as the second step, generate the remaining (1 β)m samples of the matrix M by sampling without replacement with distribution pij ( µi + νj)r log2(2n) Let Ωdenote the new set of samples. Using the combined set of samples PΩ Ω(M) as constraints, run the nuclear norm minimization program (1). Let ˆ M be the optimum of this program. This approach of adjusting the sampling distribution based on leverage scores is relevant whenever we have some freedom in choosing the observed entries. For example, many recommendation systems do actively solicit users opinions on some items chosen by the system, e.g., by asking them to fill out a survey or to choose from a list of items. While our assumptions are not strictly satisfied in practice, they are useful approximations and provide guidance for designing/analyzing practical systems. For example, in many systems there exist popular items that are viewed/rated by a large number of users, and heavy users that view/rate a large number of items. Our row-wise sampling procedure discussed in Section 3.1 can be viewed as an approximation of such settings. To understand the performance of the two-phase algorithm, assume that the initial set of m1 = βm samples PΩ(M) are generated uniformly at random. If the underlying matrix 4. Note that sampling without replacement has lesser failure probability than the equivalent binomial sampling with replacement (Recht, 2011). Algorithm 1 Two-phase sampling for coherent matrix completion input Rank parameter r, sample budget m, and parameter β [0, 1] Step 1: Obtain the initial set Ωby sampling uniformly without replacement such that |Ω| = βm. Compute best rank-r approximation to PΩ(M), U Σ V , and its leverage scores { µi} and { νj}. Step 2: Generate set of (1 β)m new samples Ωby sampling without replacement with distribution (6). Set ˆ M = arg min X X s.t PΩ Ω(X) = PΩ Ω(M). output Completed matrix ˆ M. Completing Any Low-rank Matrix, Provably 0 0.2 0.4 0.6 0.8 0 Samples/(n log(n)) two phase uniform / leverage scores two phase uniform / elementwise two phase uniform / squared elementwise leveraged sampling (oracle) uniform sampling Figure 1: Performance of Algorithm 1 for power-law matrices: We consider rank-5 matrices of the form M = DUV D, where elements of the matrices U and V are generated independently from a Gaussian distribution N(0, 1) and D is a diagonal matrix with Dii = 1 iα . Higher values of α correspond to more non-uniform leverage scores and less incoherent matrices. The above simulations are run with two-phase parameter β = 2/3. Leveraged sampling (3) gives the best results of successful recovery using roughly 10n log(n) samples for all values of α in accordance with Theorem 2. Surprisingly, sampling according to (6) with estimated leverage scores has almost the same sample complexity for α 0.7. Uniform sampling and sampling proportional to element and element squared perform well for low values of α, but their performance degrades quickly for α > 0.6. M is incoherent, then already the algorithm will recover M if m1 = Θ(nr log2(2n)). On the other hand, if M is highly coherent, having almost all energy concentrated on just a few elements, then the estimated leverage scores (6) from uniform sampling in the first step will be poor and hence the recovery algorithm suffers. Between these two extremes, there is reason to believe that the two-phase sampling procedure will provide a better estimate to the underlying matrix than if all m elements were sampled uniformly. Indeed, numerical experiments suggest that the two-phase procedure can indeed significantly outperform uniform sampling for completing coherent matrices. 4.1 Numerical Experiments We now study the performance of the two-phase sampling procedure outlined in Algorithm 1 through numerical experiments. For this, we consider rank-5 matrices of size 500 500 of the form M = DUV D, where the elements of the matrices U and V are i.i.d. Gaussian N(0, 1) and D is a diagonal matrix with power-law decay, Dii = i α, 1 i 500. We refer to such constructions as power-law matrices. The parameter α adjusts the leverage scores (and hence the coherence level) of M with α = 0 being maximal incoherence µ0 = Θ(1) and α = 1 corresponding to maximal coherence µ0 = Θ(n). Figure 1 plots the number of samples required for successful recovery (y-axis) for different values of α (x-axis) and β = 2/3 using Algorithm 1 with the initial samples Ω Chen, Bhojanapalli, Sanghavi and Ward 0.2 0.4 0.6 0.8 1 0 Samples/(nlog(n)) two phase, _=0.5 two phase, _=0.7 0 20 40 60 80 100 0 Samples/(nlog(n)) Fraction of successful recovery two phase sampling, =0.9 uniform sampling Figure 2: We consider power-law matrices with parameter α = 0.5 and α = 0.7. (a): This plot shows that Algorithm 1 successfully recovers coherent low-rank matrices with fewest samples ( 10n log(n)) when the proportion of initial samples drawn from the uniform distribution is in the range β [0.5, 0.8]. In particular, the sample complexity is significantly lower than that for uniform sampling (β = 1). Note the x-axis starts at 0.1. (b): Even by drawing 90% of the samples uniformly and using the estimated leverage scores to sample the remaining 10% samples, one observes a marked improvement in the rate of recovery. taken uniformly at random. Successful recovery is defined as when at least 95% of trials have relative errors in the Frobenius norm M ˆ M F / M F not exceeding 0.01. To put the results in perspective, we plot it in Figure 1 against the performance of pure uniform sampling, as well as other popular sampling distributions from the matrix sparsification literature (Achlioptas and Mc Sherry, 2007; Achlioptas et al., 2013; Arora et al., 2006; Drineas and Zouzias, 2011), namely, in step 2 of the algorithm, sampling proportional to element ( pij | Mij|) and sampling proportional to element squared ( pij M2 ij), as opposed to sampling from the distribution (6). In all cases, the estimated matrix M is constructed from the rank-r SVD of PΩ(M), M = U Σ V . Performance of nuclear norm minimization using samples generated according to the oracle distribution (3) serves as baseline for the best possible recovery, as theoretically justified by Theorem 2. We use the Augmented Lagrangian Method (ALM) based solver in Lin et al. (2009) to solve the convex optimization program (1). Figure 1 suggests that the two-phase algorithm performs comparably to the theoretically optimal leverage scores-based distribution (3), despite not having access to the underlying leverage scores, in the regime of mild to moderate coherence. While the element-wise sampling strategies perform comparably for low values of α, the number of samples for successful recovery increases quickly for α > 0.6. Completion from purely uniformly sampled elements requires significantly more samples at higher values of α. Choosing β: Recall that the parameter β in Algorithm 1 is the fraction of uniform samples used to estimate the leverage scores. Figure 2(a) plots the number of samples required for successful recovery (y-axis) as β (x-axis) varies from 0 to 1 for different values of α. Setting β = 1 reduces to purely uniform sampling, and for small values of β, the leverage scores estimated in (6) will be far from the actual leverage scores. Then, as expected, the Completing Any Low-rank Matrix, Provably 0 5 10 15 20 0 Samples/(nlog(n)) Fraction of successful recovery n=100 n=200 n=300 n=400 n=500 0 5 10 15 20 25 0 Samples/(nlog(n)) Fraction of successful recovery n=100 n=200 n=300 n=400 n=500 Figure 3: Scaling of sample complexity of Algorithm 1 with n. We consider power-law matrices with α = 0.5 in plot (a) and 0.7 in plot (b). We set β = 2/3 for this set of simulations. The plots suggest that the sample complexity of Algorithm 1 scales roughly as Θ(n log(n)). sample complexity goes up for β near 0 and β = 1. We find the algorithm performs well for a wide range of β, and setting β 2/3 results in the lowest sample complexity. Surprisingly, even taking β = 0.9 as opposed to pure uniform sampling β = 1 results in a significant decrease in the sample complexity; see Figure 2(b) for more details. That is, even budgeting just a small fraction of samples to be drawn from the estimated leverage scores can significantly improve the success rate in low-rank matrix recovery as long as the underlying matrix is not completely coherent. In applications like collaborative filtering, this would imply that incentivizing just a small fraction of users to rate a few selected movies according to the estimated leverage score distribution obtained by previous samples has the potential to greatly improve the quality of the recovered matrix of preferences. In Figure 3 we compare the performance of the two-phase algorithm for different values of the matrix dimension n, and notice for each n a phase transition occurring at Θ(n log(n)) samples. In Figure 4 we consider the scenario where the samples are noisy and compare the performance of Algorithm 1 to uniform sampling and the theoretically-optimal leveraged sampling from Theorem 2. Specifically we assume that the samples are generated from M + Z where Z is a Gaussian noise matrix. We consider two values for the noise σ def = Z F / M F : σ = 0.1 and σ = 0.2. The figures plot relative error in Frobenius norm (y-axis), vs total number of samples m (x-axis). These plots demonstrate the robustness of the algorithm to noise and once again show that sampling with estimated leverage scores can be as good as sampling with exact leverage scores for matrix recovery using nuclear norm minimization for α 0.7. The empirical results in this section demonstrate the advantage of the two-phase algorithm over uniform sampling. It is an interesting future problem to provide rigorous analysis on the sample complexity of the algorithm. We note that there is an Ω(n2) lower bound on Chen, Bhojanapalli, Sanghavi and Ward 4 6 8 10 12 14 16 18 20 0 Samples/(nlog(n)) Relative error two phase sampling, m =0.1 two phase sampling, m =0.2 local coherence sampling, m =0.1 local coherence sampling, m =0.2 uniform sampling, m =0.1 uniform sampling, m =0.2 8 10 12 14 16 18 20 22 24 0 Samples/(nlog(n)) Relative error two phase sampling, m =0.1 two phase sampling, m =0.2 local coherence sampling, m =0.1 local coherence sampling, m=0.2 uniform sampling, m =0.1 uniform sampling, m =0.2 Figure 4: Performance of Algorithm 1 with noisy samples: We consider power-law matrices (with α = 0.5 in plot (a) and α = 0.7 in plot (b)), perturbed by a Gaussian noise matrix Z with Z F / M F = σ. The plots consider two different noise levels, σ = 0.1 and σ = 0.2. We compare two-phase sampling (Algorithm 1) with β = 2/3, sampling from the exact leverage scores, and uniform sampling. Algorithm 1 has relative error almost as low as the leveraged sampling without requiring any a priori knowledge of the low-rank matrix, while uniform sampling suffers dramatically. the sample complexity for algorithms using passive sampling when the underlying matrix is maximally coherent (Krishnamurthy and Singh, 2014). 5. Weighted Nuclear Norm Minimization Theorem 2 suggests that the performance of nuclear norm minimization will be better if the set of observed elements is aligned with the leverage scores of the matrix. Interestingly, Theorem 2 can also be used in a reverse way: one may adjust the leverage scores to align with a given set of observed elements. Here we demonstrate an application of this idea in quantifying the benefit of weighted nuclear norm minimization when the revealed elements are distributed non-uniformly. Suppose the underlying matrix of interest is incoherent. In many applications, we do not have the freedom to choose which elements to observe. Instead, the revealed elements are given to us, and distributed non-uniformly among the rows and columns. As observed in Srebro and Salakhutdinov (2010), standard unweighted nuclear norm minimization (1) is inefficient in this setting. They propose to instead use weighted nuclear norm minimization for low-rank matrix completion: ˆX = arg min X Rn1 n2 RXC s.t. Xij = Mij, for (i, j) Ω, (7) where R = diag(R1, R2, . . . , Rn1) Rn1 n1 and C = diag(C1, C2, . . . , Cn2) Rn2 n2 are user-specified diagonal weight matrices with positive diagonal elements. Completing Any Low-rank Matrix, Provably We now provide a theoretical guarantee for this method, and quantify its advantage over unweighted nuclear norm minimization. Our analysis is based on the observation that weighted nuclear norm minimization can be viewed as a way of scaling the rows and columns of the underlying matrix so that its leverage scores are adjusted to reflect the given nonuniform sampling distributions. Suppose M Rn1 n2 has rank r and satisfies the standard incoherence condition maxi,j {µi(M), νj(M)} µ0. Let x denote the largest integer not exceeding x. Under this setting, we can apply Theorem 2 to establish the following: Theorem 7 Without loss of generality, assume R1 R2 Rn1 and C1 C2 Cn2. There exists a universal constant c0 such that M is the unique optimum to (7) with probability at least 1 5(n1 + n2) 10 provided that for all i, j, pij 1 min{n1,n2}10 and R2 i P n1/(µ0r) i =1 R2 i + C2 j P n2/(µ0r) j =1 C2 j log2 n. (8) This theorem is proved by drawing a connection between the weighted nuclear norm formulation (7) and the leverage scores (2) of the target matrix. Define the scaled matrix M := RMC. Observe that the weighted program (7) is equivalent to first solving the following unweighted problem with scaled observations X = arg min X X s.t. Xij = Mij, for (i, j) Ω, (9) and then rescaling the solution X to return ˆX = R 1 XC 1. In other words, through the use of the weighted nuclear norm, we convert the problem of completing M to that of completing the scaled matrix M. This leads to the following observation, which underlines the proof of Theorem 7: If we can choose the weights R and C in such a way that the leverage scores of scaled matrix M, denoted as µi := µi( M), νj := νi( M), i, j [n], are aligned with the given non-uniform observations in a way that roughly satisfies the relation (3), then we gain in sample complexity compared to the unweighted approach. We now quantify this observation more precisely for a particular class of matrix completion problems. 5.1 Comparison to Unweighted Nuclear Norm. Assume for simplicity n1 = n2 = n and n/(µ0r) is an integer. Suppose the sampling probabilities have a product form: pij = pr ipc j, with pr 1 pr 2 pr n and pc 1 pc 2 pc n. If we choose Ri = q 1 npr i P j pc j and Cj = q 1 npc j P i pr i which is suggested by the condition (8) Theorem 7 asserts that the following set of conditions are sufficient for recovery of M with high probability: n log2 n, j; pr i n log2 n, i. (10) Chen, Bhojanapalli, Sanghavi and Ward We can compare the above condition to the condition for the unweighted approach: by Theorem 2, the unweighted nuclear norm minimization formulation (1) recovers M if pr i pc j µ0r n log2 n, i, j. (11) Therefore, the weighted nuclear norm approach succeeds under less restrictive conditions: the condition (11) for the unweighted approach requires a lower bound on minimum sampling probability over the rows and columns, whereas the condition (10) for the weighted approach involves the average sampling probability of the n/(µ0r) least sampled rows/columns. This benefit is most significant precisely when the observed samples are very non-uniformly distributed. We provide a concrete example of the gain of weighting in Section E. Our theoretical results are consistent with the empirical study in Srebro and Salakhutdinov (2010); Foygel et al. (2011), which demonstrate the advantage of the weighted approach with the weights R and C chosen as above (using the empirical sampling distribution). We remark that Theorem 7 is the first exact recovery guarantee for weighted nuclear norm minimization. It provides a theoretical explanation, complementary to those in Srebro and Salakhutdinov (2010); Foygel et al. (2011); Negahban and Wainwright (2012), for why the weighted approach is advantageous over the unweighted approach for non-uniform observations. It also serves as a testament to the power of Theorem 2 as a general result on the relationship between sampling and the coherence/leverage score structure of a matrix. In Theorem 7 and the discussion above we assume the underlying matrix M is incoherent. Clearly, one can still use the weighted nuclear norm approach when M is coherent: as long as the weights are chosen such that the leverage scores of the scaled matrix M are aligned with the distributions of the revealed entries, Theorem 2 can be applied and we expect improvements of the recovery performance using the weighted approach. How to choose the weights in this setting, and how it affects the performance, are left to future work. 6. Conclusion In this paper we study the problem of matrix completion with no assumptions on the incoherence of the underlying matrix. We show that if the sampling of entries suitably depends on leverage scores of the matrix, then it can be recovered from O(nr log2(n)) entries using nuclear norm minimization. We further establish the necessity of leverage score sampling within the class of location invariant sampling distributions. Based on these results, we present a new two-phase sampling algorithm which does not require knowledge of underlying structure of the matrix and provide simulation results to verify its performance. As a corollary of our main theorem, we provide exact recovery guarantees for the weighted nuclear norm minimization approach when the observed entries are given and distributed non-uniformly. It is an interesting open problem to provide rigorous theoretical analysis of the number of samples needed by the two-phase sampling algorithm. It is also of interest to develop and analyze algorithms that sample with more stages and iteratively improve the leverage score estimates. More generally, it is useful to develop and study other methods for estimating/adjusting the leverage scores and tuning the sampling procedure. Extending the Completing Any Low-rank Matrix, Provably results in this paper to other low-rank recovery settings and applications will be of great value. Acknowledgments We would like to thank Petros Drineas, Michael Mahoney and Aarti Singh for helpful discussions, and the anonymous reviewers for their insightful comments and suggestions. Y. Chen was supported by NSF grant CIF-31712-23800 and ONR MURI grant N0001411-1-0688. R. Ward was supported in part by an NSF CAREER award, AFOSR Young Investigator Program award, and ONR Grant N00014-12-1-0743. S. Sanghavi would like to acknowledge NSF grants 1302435, 1320175 and 0954059 for supporting this work. Appendix A. Proof of Theorem 2 We prove our main result Theorem 2 in this section. The overall outline of the proof is a standard convex duality argument. The main difference in establishing our results is that, while other proofs relied on bounding the ℓ norm of certain random matrices, we instead bound the weighted ℓ ,2, norm (to be defined). The high level road map of the proof is a standard one: by convex analysis, to show that M is the unique optimal solution to (1), it suffices to construct a dual certificate Y obeying certain sub-gradient optimality conditions. One of the conditions requires the spectral norm Y to be small. Previous work bounds Y by the the ℓ norm Y := P i,j Y ij of a certain matrix Y , which gives rise to the standard and joint incoherence conditions involving uniform bounds by µ0 and µstr. Here, we derive a new bound using the weighted ℓ ,2 norm of Y , which is the maximum of the weighted row and column norms of Y . These bounds lead to a tighter bound of Y and hence less restrictive conditions for matrix completion. We now turn to the details. To simplify the notion, we prove the results for square matrices (n1 = n2 = n). The results for non-square matrices are proved in exactly the same fashion. In the sequel by with high probability (w.h.p.) we mean with probability at least 1 n 20. The proof below involves no more than 5n10 random events, each of which will be shown to hold with high probability. It follows from the union bound that all the events simultaneously hold with probability at least 1 5n 10, which is the success probability in the statement of Theorem 2. A few additional notations are needed. We drop the dependence of µi(M) and νj(M) on M and simply use µi and νj. We use c and its derivatives (c , c0, etc.) for universal positive constants, which may differ from place to place. The inner product between two matrices is given by Y, Z = trace(Y Z). Recall that U and V are the left and right singular vectors of the underlying matrix M. We need several standard projection operators for matrices. The projections PT and PT are given by PT (Z) := UU Z + ZV V UU V ZZ and PT (Z) := Z PT (Z). PΩ(Z) is the matrix with (PΩ(Z))ij = Zij if (i, j) Ω and zero otherwise, and PΩc(Z) := Z PΩ(Z). As usual, z 2 is the ℓ2 norm of the vector z, and Z F and Z are the Frobenius norm and spectral norm of the matrix Chen, Bhojanapalli, Sanghavi and Ward Z, respectively. For a linear operator R on matrices, its operator norm is defined as R op = sup X Rn n R(X) F / X F . For each 1 i, j n, we define the random variable δij := I ((i, j) Ω), where I( ) is the indicator function. The matrix operator RΩ: Rn n 7 Rn n is defined as 1 pij δij D eie j , Z E eie j . (12) A.1 Optimality Condition Following our proof road map, we now state a sufficient condition for M to be the unique optimal solution to the optimization problem (1). This is the content of Proposition 8 below (proved in Section A.7 to follow). Proposition 8 Suppose pij 1 n10 . The matrix M is the unique optimal solution to (1) if the following conditions hold. 1. PT RΩPT PT op 1 2. There exists a dual certificate Y Rn n which satisfies PΩ(Y ) = Y and PT (Y ) UV F 1 4n5 , (b) PT (Y ) 1 A.2 Validating the Optimality Condition We begin by proving that Condition 1 in Proposition 8 is satisfied under the conditions of Theorem 2. This is done in the following lemma, which is proved in Section A.8 to follow. The lemma shows that RΩis close to the identity operator on T. Lemma 9 If pij min{c0 (µi+νj)r n log n, 1} for all (i, j) and a sufficiently large c0, then w.h.p. PT RΩPT PT op 1 A.3 Constructing the Dual Certificate It remains to construct a matrix Y (the dual certificate) that satisfies Condition 2 in Proposition 8. We do this using the golfing scheme (Gross, 2011; Cand es et al., 2011). Set k0 := 20 log n. For each k = 1, . . . , k0, let Ωk Rn n be a random set of matrix elements such that for each (i, j), P [(i, j) Ωk] = qij := 1 (1 pij)1/k0, independently of all others. We may assume that the set Ωof observed elements is generated as Ω= Sk0 k=1 Ωk, which is equivalent to the original Bernoulli sampling model. Let W0 := 0 and for k = 1, . . . , k0, Wk := Wk 1 + RΩk PT (UV PT Wk 1), (14) where the operator RΩk is given by 1 qij I ((i, j) Ωk) D eie j , Z E eie j . Completing Any Low-rank Matrix, Provably The dual certificate is given Y := Wk0. Clearly PΩ(Y ) = Y by construction. The proof of Theorem 2 is completed if we show that under the condition in the theorem, Y satisfies Conditions 2(a) and 2(b) in Proposition 8 w.h.p. A.4 Concentration Properties The key step in our proof is to show that Y satisfies Condition 2(b) in Proposition 8, i.e., we need to bound PT (Y ) . Here our proof departs from existing ones, as we establish concentration bounds on this quantity in terms of (an appropriately weighted version of) the ℓ ,2 norm, which we now define. The µ( , 2)-norm of a matrix Z Rn n is defined as Z µ( ,2) := max b Z2 ib, max j which is the maximum of the weighted column and row norms of Z. We also need the µ( )-norm of Z, which is a weighted version of the matrix ℓ norm. This is given as Z µ( ) := max i,j |Zij| r n which is the weighted element-wise magnitude of Z. We now state three new lemmas concerning the concentration properties of these norms. The first lemma is crucial to our proof; it bounds the spectral norm of (RΩ I) Z in terms of the µ( , 2) and µ( ) norms of Z. This obviates the intermediate lemmas required by previous approaches (Cand es and Tao, 2010; Gross, 2011; Recht, 2011; Keshavan et al., 2010) which use the ℓ norm of Z. Lemma 10 Suppose Z is a fixed n n matrix. For some universal constant c > 1, we have w.h.p. v u u u tmax Z2 ij pij , max j If pij min{c0 (µi+νj)r n log n, 1} for all (i, j) and a sufficiently large constant c0, then we further have w.h.p. (RΩ I) Z c c0 Z µ( ) + Z µ( ,2) . The next two lemmas further control the µ( , 2) and µ( ) norms of a matrix after certain random transformation. Lemma 11 Suppose Z is a fixed n n matrix. If pij min{c0 (µi+νj)r n log n, 1} for all i, j and a sufficiently large constant c0, then w.h.p. (PT RΩ PT )Z µ( ,2) 1 Z µ( ) + Z µ( ,2) Chen, Bhojanapalli, Sanghavi and Ward Lemma 12 Suppose Z is a fixed n n matrix. If pij min{c0 (µi+νj)r n log n, 1} for all i, j and a sufficiently large constant c0, then w.h.p. (PT RΩ PT ) Z µ( ) 1 We prove Lemmas 10 12 in Section A.8. Equipped with the three lemmas above, we are now ready to validate that Y satisfies Condition 2 in Proposition 8. A.5 Validating Condition 2(a) Set k = UV PT (Wk) for k = 1, . . . , k0; note that k0 = UV PT (Y ). By definition of Wk, we have k = (PT PT RΩk PT ) k 1. (15) Note that Ωk is independent of k 1 and qij pij/k0 c 0(µi + νj)r log(n)/n under the condition in Theorem 2. Applying Lemma 9 with Ωreplaced by Ωk , we obtain that w.h.p. k F PT PT RΩk PT k 1 F 1 Applying the above inequality recursively with k = k0, k0 1, . . . , 1 gives PT (Y ) UV F = k0 F 1 k0 UV F 1 4n6 r 1 4n5 , where we use our definition of k0 and UV F = r in the second inequality. A.6 Validating Condition 2(b) By definition, Y can be rewritten as Y = Pk0 k=1 RΩk PT k 1. It follows that k=1 (RΩk PT PT ) k 1 k=1 (RΩk I) k 1 . We apply Lemma 10 with Ωreplaced by Ωk to each summand in the last RHS to obtain w.h.p. PT (Y ) c c0 k=1 k 1 µ( ) + c c0 k=1 k 1 µ( ,2) . (16) We bound each summand in the last RHS. Applying (k 1) times (15) and Lemma 12 (with Ωreplaced by Ωk), we have w.h.p. k 1 µ( ) = PT PT RΩk 1PT k 2 µ( ) 1 k 1 UV µ( ) . Completing Any Low-rank Matrix, Provably for each k. Similarly, repeatedly applying (15), Lemma 11 and the inequality we just proved above, we obtain w.h.p. k 1 µ( ,2) (17) = PT PT RΩk 1PT k 2 µ( ,2) (18) 2 k 2 µ( ) + 1 2 k 2 µ( ,2) (19) k 1 UV µ( ) + 1 2 k 2 µ( ,2) (20) k 1 UV µ( ) + 1 k 1 UV µ( ,2) . (21) It follows that w.h.p. PT (Y ) c c0 k=1 (k + 1) 1 k 1 UV µ( ) + c c0 k 1 UV µ( ,2) (22) UV µ( ) + 2c c0 UV µ( ,2) . (23) Note that for all (i, j), we have UV = e i UV ej q n , e i UV 2 = q and UV ej 2 = q n . Hence UV µ( ) 1 and UV µ( ,2) = 1. We conclude that PT (Y ) 6c c0 + 2c c0 1 provided that the constant c0 in Theorem 2 is sufficiently large. This completes the proof of Theorem 2. A.7 Proof of Proposition 8 (Optimality Condition) Proof Consider any feasible solution X to (1) with PΩ(X) = PΩ(M). Let G be an n n matrix which satisfies PT G = 1, and PT G, PT (X M) = PT (X M) . Such G always exists by duality between the nuclear norm and spectral norm. Because UV +PT G is a sub-gradient of the function f(Z) = Z at Z = M, we have X M D UV + PT G, X M E . (24) But Y, X M = PΩ(Y ), PΩ(X M) = 0 since PΩ(Y ) = Y . It follows that X M D UV + PT G Y, X M E = PT (X M) + D UV PT Y, X M E PT Y, X M PT (X M) UV PT Y F PT (X M) F PT Y PT (X M) 2 PT (X M) 1 4n5 PT (X M) F , Chen, Bhojanapalli, Sanghavi and Ward where in the last inequality we use conditions 2a and 2b in the proposition. Using Lemma 13 below, we obtain 2 PT (X M) 1 4n5 2n5 PT (X M) > 1 8 PT (X M) . The RHS is strictly positive for all X with PΩ(X M) = 0 and X = M. Otherwise we must have PT (X M) = X M and PT PΩPT (X M) = 0, contradicting the assumption PT RΩPT PT op 1 2. This proves that M is the unique optimum. Lemma 13 If pij 1 n10 for all (i, j) and PT RΩPT PT op 1 2, then we have 2n5 PT (Z) , Z {Z : PΩ(Z ) = 0}. (25) Proof Define the operator R1/2 Ω : Rn n 7 Rn n by R1/2 Ω(Z) := X 1 pij δij D eie j , Z E eie j . Note that R1/2 Ω is self-adjoint and satisfies R1/2 ΩR1/2 Ω = RΩ. Hence we have R1/2 ΩPT (Z) F = p PT RΩPT Z, PT Z (PT RΩPT PT ) Z, PT (Z) + PT (Z), PT (Z) PT (Z) 2 F PT RΩPT PT PT (Z) 2 F 2 PT (Z) F , where the last inequality follows from the assumption PT RΩPT PT op 1 2. On the other hand, PΩ(Z) = 0 implies 0 = R1/2 Ω(Z) = R1/2 ΩPT (Z) + R1/2 ΩPT (Z) and thus R1/2 ΩPT (Z) F = R1/2 ΩPT (Z) F max i,j 1 pij PT (Z) F n5 PT (Z) F . Combining the last two display equations gives 2n5 PT (Z) F 2n5 PT (Z) . A.8 Proof of Technical Lemmas We prove the four technical lemmas that are used in the proof of our main theorem. The proofs use the matrix Bernstein inequality given as Theorem 16 in Section F. We also make frequent use of the following facts: for all i and j, we have max µir n PT (eie j ) 2 We also use the shorthand a b := min{a, b}. Completing Any Low-rank Matrix, Provably A.8.1 Proof of Lemma 9 For any matrix Z, we can write (PT RΩPT PT )(Z) = X pij δij 1 D eie j , PT (Z) E PT (eie j ) =: X i,j Sij(Z). Note that E [Sij] = 0 and Sij s are independent of each other. For all Z and (i, j), we have Sij = 0 if pij = 1. On the other hand, when pij c0 (µi+νj)r log n n , then it follows from (26) that PT (eie j ) 2 F Z F max i,j Z F 1 c0 log n Z F . Putting together, we have that Sij 1 c0 log n under the condition of the lemma. On the other hand, we have i,j E S2 ij(Z) F pij δij 1 2 D eie j , PT (Z) E D eie j , PT (eie j ) E PT (eie j ) max i,j 1 pij PT (eie j ) 2 D eie j , PT (Z) E PT (eie j ) This implies P i,j E h S2 ij i 1 c0 log n under the condition of the lemma. Applying the Matrix Bernstein inequality (Theorem 16), we obtain PT RΩPT PT = P i,j Sij 1 2 w.h.p. for sufficiently large c0. A.8.2 Proof of Lemma 10 We can write (RΩ I) Z as the sum of independent matrices: (RΩ I) Z = X pij δij 1 Zijeie j =: X Note that E[Sij] = 0. For all (i, j), we have Sij = 0 if pij = 1, and pij |Zij| . Moreover, we have E i,j S ij Sij i,j Z2 ijeie j eje i E 1 pij δij 1 2 = max i Chen, Bhojanapalli, Sanghavi and Ward The quantity E h P i,j Sij S ij i is bounded by maxj Pn i=1(1 pij)Z2 ij/pij in a similar way. The first part of the lemma then follows from the matrix Bernstein inequality (Theorem 16). If pij 1 c0(µi+νj)r log n n log n, we have for all i and j, Sij log n (1 I(pij = 1)) 1 pij |Zij| log n 1 c0 Z µ( ) , pij Z2 ij log n 1 c0 Z 2 µ( ,2) , pij Z2 ij log n 1 c0 Z 2 µ( ,2) . The second part of the lemma follows again from applying the matrix Bernstein inequality. A.8.3 Proof of Lemma 11 Let X = (PT RΩ PT ) Z. By definition we have X µ( ,2) = max a,b µar Xa 2 , r n where Xa and X b are the a-th row and b-th column of of X, respectively. We bound each term in the maximum. Observe that q n νbr X b can be written as the sum of independent column vectors: r n νbr X b = X pij δij 1 Zij PT (eie j )eb r n where E [Sij] = 0. To control Sij 2 and E h P i,j S ij Sij i , we first need a bound for PT (eie j )eb 2. If j = b, we have PT (eie j )eb 2 = UU ei + (I UU )ei V eb 2 where we use the triangle inequality and the definition of µi and νb . Similarly, if j = b, we have PT (eie j )eb 2 = (I UU )eie j V V eb 2 e j V V eb . (28) Now note that Sij 2 (1 I(pij = 1)) 1 pij |Zij| q n νbr PT (eie j )eb 2 . Using the bounds (27) and (28), we obtain that for j = b, Sij 2 (1 I(pij = 1)) 1 pib |Zib| r n n2 log n |Zib| 2 c0 log n Z µ( ) , Completing Any Low-rank Matrix, Provably where we use pib 1 c0µir log n n and pib 1 c0 q n log n in the second inequality. For j = b, we have Sij 2 (1 I(pij = 1)) 1 pij |Zij| r n n 2 c0 log n Z µ( ) , where we use pij 1 c0 q n log n. We thus obtain Sij 2 2 c0 log n Z µ( ) for all (i, j). On the other hand, note that E h P i,j S ij Sij i = P i,j E 1 pij δij 1 2 Z2 ij PT (eie j )eb 2 = P j=b,i + P j =b,i 1 pij pij Z2 ij PT (eie j )eb 2 Applying (27), we can bound the first sum by X pib Z2 ib 2 µir νbr 2 c0 log n n νbr Z b 2 2 2 c0 log n Z 2 µ( ,2) , where we use pib 1 c0(µi+νb)r n log n in the second inequality. The second sum can be bounded using (28): X pij Z2 ij e j V V eb 2 n e j V V eb 2 X e j V V eb 2 1 c0 log n i Z2 ij n νjr 1 c0 log n Z 2 µ( ,2) e j V V eb 2 (b) 1 c0 log n Z 2 µ( ,2) , where we use pij 1 c0νjr log n n in (a) and P j =b e j V V eb 2 V V eb 2 2 νbr Combining the bounds for the two sums, we obtain E h P i,j S ij Sij i 3 c0 log n Z 2 µ( ,2) . We can bound E h P i,j Sij S ij i in a similar way. Applying the Matrix Bernstein inequality in Theorem 16, we have w.h.p. 2 = P i,j Sij 2 1 Z µ( ) + Z µ( ,2) for c0 sufficiently large. Similarly we can bound q n µar Xa 2 by the same quantity. We take a union bound over all a and b to obtain the desired results. Chen, Bhojanapalli, Sanghavi and Ward A.8.4 Proof of Lemma 12 Fix a matrix index (a, b) and let wab = q n . We can write [(PT RΩ PT ) Z]ab pij δij 1 Zij D eie j , PT (eae b ) E 1 which is the sum of independent zero-mean variables. We first compute the following bound: D eie j , PT (eae b ) E = e i UU eae b ej + e i (I UU )eae b V V ej e a UU ea + e a (I UU )eae b V V eb µar n , i = a, j = b, e a (I UU )eae b V V ej e b V V ej , i = a, j = b, e i UU eae b (I V V )eb e i UU ea , i = a, j = b, e i UU eae b V V ej e i UU ea e b V V ej , i = a, j = b, where we use the fact that the matrices I UU and I V V have spectral norm at most 1. We proceed to bound |sij| . Note that |sij| (1 I(pij = 1)) 1 pij |Zij| D eie j , PT (eae b ) E 1 wab . We distinguish four cases. When i = a and j = b, we use (29) and pab 1 c0(µa+νb)r log2(n) n to obtain |sij| |Zij| / (wijc0 log n) Z µ( ) / (c0 log n) . When i = a and j = b, we apply (29) to get |sij| (1 I(pij = 1)) |Zaj| (a) |Zaj| r n µar n νjr 1 c0 log n Z µ( ) where (a) follows from paj min n c0 νjr log n n , 1 o . In a similar fashion, we can show that the same bound holds when i = a and j = b. When i = a and j = b, we use (29) to get |sij| (1 I(pij = 1)) |Zij| (b) |Zij| r n µir n νjr 1 c0 log n Z µ( ) where (b) follows from pij 1 c0 q n log n and max nq n o 1. We conclude that |sij| Z µ( ) / (c0 log n) for all (i, j). On the other hand, note that E pij δij 1 2# Z2 ij w2 ab D eie j , PT (eae b ) E2 i=a,j=b + X i=a,j =b + X i =a,j=b + X i =a,j =b . Completing Any Low-rank Matrix, Provably We bound each of the four sums. By (29) and pab 1 c0(µa+νb)r log n n 1 c0(µa+νb)2r2 log n 2n2 , we have X i=a,j=b 1 pab pabw2 ab Z2 ab µar 2 2 Z 2 µ( ) c0 log n . By (29) and pajw2 ab w2 ab c0w2 aj νbr n log n , we have pajw2 ab Z2 aj e b V V ej Z 2 µ( ) c0 log n n e b V V ej , which implies P i=a,j =b Z 2 µ( ) /(c0 log n). Similarly we can bound P i =a,j=b by the same quantity. Finally, by (29) and pij 1 c0 µir n log n , we have i =a,j =b 1 w2 ab (1 pij)Z2 ij pij e i UU ea e b V V ej Z 2 µ( ) c0 log n 1 w2 ab e i UU ea X e b V V ej , which implies P i =a,j =b Z 2 µ( ) /(c0 log n). Combining pieces, we obtain E h P ijs2 ij i 5 Z 2 µ( ) /(c0 log n). Applying the Bernstein inequality (Theorem 16), we conclude that [(PT RΩPT PT ) Z]ab w.h.p. for c0 sufficiently large. The desired result follows from a union bound over all (a, b). Appendix B. Proof of Corollary 4 Recall the setting: for each row of M, we pick it with some probability p and observe all its elements. We need a simple lemma. Let J [n] be the (random) set of the indices of the row picked, and PJ(Z) be the matrix that is obtained from Z by zeroing out the rows outside J. Recall that UΣV is the SVD of M. Lemma 14 If µi(M) := n r U ei 2 µ0 for all i [n], and p 10 µ0r δ , then with probability at least 1 δ, U PJ(U) Ir r 1 where Ir r is the identity matrix in Rr r. Chen, Bhojanapalli, Sanghavi and Ward Proof Define the random variable ηj := I(i J) for i = 1, 2, . . . , n, where I( ) is the indicator function. Note that U PJ(U) Ir r = U PJ(U) U U = i=1 S(i) := pηi 1 U eie i U. The matrices {S(i)} are mutually independent and satisfy E S(i) = 0, S(i) 1 p U ei 2 2 µ0r i=1 S(i)S (i) i=1 S (i)S(i) i=1 U eie i UU eie i U i=1 eie i U ei 2 i=1 eie i U ei 2 Note that S(i) are r r matrices. It follows from the matrix Bernstein (Theorem 16) that when p 10µ0r δ , we have P U PJ(U) Ir r 1 µ0r 6pn + µ0r Note that U PJ(U) Ir r 1 2 implies that U PJ(U) is invertible, which further implies PJ(U) Rn r has rank-r. The rows picked are PJ(M) = PJ(U)ΣV , which thus have full rank-r and their row space must be the same as the row space of M. Therefore, the leverage scores { νj} of these rows are the same as the row leverage scores {νj(M)} of M. Also note that we must have µ0 1. Setting δ and sampling Ωas described in the corollary and applying Theorem 2, we are guaranteed to recover M exactly with probability at least 1 9n 10. The total number of elements we have observed is i,j pij = 10µ0r log 2r 4n 10 + c0(µ0rn + rn) log2 n c1µ0rn log2 n for some sufficiently large universal constant c1, and by Hoeffding s inequality, the actual number of observations is at most two times the expectation with probability at least 1 n 10 provided c0 is sufficiently large. The corollary follows from the union bound. Appendix C. Proof of Theorem 6 We prove the theorem assuming Pr k=1 1 ak = Pr k=1 1 bk = r; extension to the general setting in the theorem statement will only affect the pre-constant in (4) by a factor of at most 2. Completing Any Low-rank Matrix, Provably For each k [r], let sk := 2n akr, tk := 2n bkr. We assume the sk s and tk s are all integers. Under the assumption on ak and bk, we have 1 sk, tk n and Pr k=1 sk = Pr k=1 tk = n. Define the sets Ik := n Pr 1 l=1 sl + i : i [sk] o and Jk := n Pr 1 l=1 tl + j : j [tk] o ; note that Sr k=1 Ik = Sr k=1 Jk = [n]. The vectors µ and ν are given by µi = ak, k [r], i Ik, νj = bk, k [r], j Jk. It is clear that µ and ν satisfy the property 1 in the statement of the theorem. Let the matrix M(0) be given by M(0) = AB , where A, B Rn r are specified below. For each k [r], we set 1 sk for all i Ik. All other elements of A are set to zero. Therefore, the k-th column of A has sk non-zero elements equal to q 1 sk , and the columns of A have disjoint supports. Similarly, for each k [r] , we set for all j Jk. All other elements of B are set to zero. Observe that A is an orthonormal matrix, so µi M(0) = n r Ai 2 2 = n 2 µi, k [r], i Ik, . A similar argument shows that νj M(0) νj, j [n]. Hence M(0) Mr ( µ, ν). We note that M(0) is a block diagonal matrix with r blocks where the k-th block has size sk tk, and M(0 F = r. Consider the i0 and j0 in the statement of the theorem. There must exit some k1, k2 [r] such that i0 Ik1 and j0 Jk2. Assume w.l.o.g. that sk1 tk2. then pi0j0 µi0 + νj0 = ak1 + bk2 = log (1/η) 4sk1 + log (1/η) 4tk2 log (1/η) where η = µi0r 2n = 1 sk1 in part 2 of the theorem and η = 2 n in part 3. Because {pij} is location-invariant w.r.t. M(0), we have pij = pi0j0 log (1/η) 2tk2 , i Ik1, j Jk2. Let Wi := |({i} Jk2) Ω| be the number of observed elements on {i} Jk2. Note that for each i Ik1, we have P [Wi = 0] = Y (1 pij) 1 log(1/η) tk2 exp (log η) = η, Chen, Bhojanapalli, Sanghavi and Ward where we use 1 x e 2x, 0 x 1 2 in the second inequality. Therefore, there must exist i Ik1 for which there is no observed element in {i } Jk2 with probability P [Wi = 0, i Ik1] = 1 P [Wi 1, i Ik1] 1 (1 η)sk1 1 e ηsk1 1 ( 1 2, η = µi0r 4n 1 n, η = n These are the probabilities that appear in part 2 and part 3 of the theorem statement, respectively. Now choose a number s sk1. Let M(1) = AB , where B is the same as before and A is given by 1 s, i = i , k = k2 Aik, otherwise. By varying s we can construct infinitely many such M(1). Clearly M(1) is rank-r. Observe that M(1) differs from M(0) only in the elements with indices in {i } Jk2, which are not observed, so M(0) ij = M(1) ij , (i, j) Ω. Also observe that any {pij} that is location-invariant w.r.t. M(0) is also location-invariant w.r.t. M(1). The following lemma guarantees that M(1) Mr ( µ, ν), which completes the proof of the theorem. Lemma 15 The matrix M(1) constructed above satisfies µi M(1) 2µi M(0) , i [n], νj M(1) = νj M(0) , j [n]. Proof Note that by the definition, the leverage scores of a rank-r matrix M with SVD M = UΣV can be expressed as Pcol(M)(ei) 2 2 , where col(M) denotes the column space of M and Pcol(M)( ) is the Euclidean projection onto the column space of M. A similar relation holds for the row leverage scores and the row space of M. In other words, the column/row leverage scores of a matrix are determined by its column/row space. Because M(0) and M(1) have the same row space (which is the span of the columns of B), the second set of equalities in the lemma hold. It remains to prove the first set of inequalities for the column leverage scores. If k1 = k2, then the columns of A have unit norms and are orthogonal to each other. Using the above expression for the leverage scores, we have µi M(1) = n 2 = µi M(0) . If k1 = k2, we may assume without loss of generality that k1 = 1, k2 = 2 and i = 1. In the sequel we use Ai to denote the i-th columns of A. We now construct two vectors α and β Completing Any Low-rank Matrix, Provably which have the same span with A1 and A2. Define two vectors α, β Rn, such that the first s1 elements of α and the {s1 + 1, . . . , s1 + s2}-th elements of β are one, the first element of β is p s2 s , and all other elements of α and β are zero. Clearly α = s1 A1 and β = s2 A2, so span(α, β) = span( A1, A2). We next orthogonalize α and β by letting α = α and α 2 α = β s2 s1 sα = s1 s , i = 1 s2 s1 s, i = 2, . . . , s1 1, i = s1 + 1, . . . , s1 + s2 0, i = s1 + s2 + 1, . . . , n. Note that span( α, β) = span(α, β) and α, β = 0. Simple calculation shows that α 2 2 = α 2 2 = s1 and β 2 2 = s1 1 s1 s + 1 s2. Finally, we normalize α and β by letting α = α/ α and β = β/ β . It is clear that span( α, β) = span( A1, A2), and α, Ak = D β, Ak E = 0, k = 3, . . . , r. Now consider the matrix A Rn r obtained from A by replacing the first two columns of A with α and β, respectively. Because col( A) = col( A) = col(M(1)), we have µi M(1) = n Pcol( A) (ei) 2 . But the columns of A have unit norms and are orthogonal to each other. It follows that µi M(1) = n A A ei 2 = n For s1 + s2 < i n, since s s1 we have A ei 2 = A ei 2 = A ei 2 so µi M(1) = µi M(0) . For i [s1 + s2], we have A ei 2 = α2 i + β2 i = 1 s1 + (s1 1)2 s1(s1 1)+s2 1 s 2 s1 = 2 A ei 2 , i = 1 1 s1 + 1 s1(s1 1)+s2 1 s 2 s1 = 2 A ei 2 , i = 2, . . . , s1 s1 s (s1 1+s1 s)s2 1 s2 = A ei 2 , i = s1 + 1, . . . , s1 + s2. This means µi M(1) 2n A ei 2 = 2µi(M(0)), i [s1 + s2], which completes the proof of the lemma. Appendix D. Proof of Theorem 7 Suppose the rank-r SVD of M is U Σ V ; so U Σ V = RMC = RUΣV C. By definition, we have µir n = P U(ei) 2 2 , Chen, Bhojanapalli, Sanghavi and Ward where P U( ) denotes the projection onto the column space of U, which is the same as the column space of RU. This projection has the explicit form P U(ei) = RU U R2U 1 U Rei. It follows that n = RU U R2U 1 U Rei = R2 i e i U U R2U 1 U ei R2 i [σr (RU)] 2 U ei 2 n [σr (RU)] 2 , (30) where σr( ) denotes the r-th singular value and the last inequality follows from the standard incoherence assumption maxi,j{µi, νj} µ0. We now bound σr (RU). Since RU has rank r, we have σ2 r (RU) = min x =1 RUx 2 2 = min x =1 i=1 R2 i e i Ux 2 . (31) If we let zi := e i Ux 2 for each i [n], then zi satisfies i=1 zi = Ux 2 2 = x 2 2 = 1 and by the standard incoherence assumption, 2 x 2 2 µ0r Therefore, the value of the minimization (31) is lower-bounded by the optimal value of the following program i=1 R2 i zi i=1 zi = 1, 0 zi µ0r n , i = 1, . . . , n. From the theory of linear programming, we know the minimum is achieved at an extreme point z of the feasible set. Such an extreme point z satisfies z i 0, i and n linear equalities i=1 z i = 1, z i = 0, for i I1, n , for i I2 Completing Any Low-rank Matrix, Provably for some index sets I1 and I2 such that I1 I2 = φ, |I1| + |I2| = n 1. It is easy to see that we must have |I2| = j n µ0r k . Since R1 R2 . . . Rn, the minimizer z has the form n , i = 1, . . . , j n µ0r k , 1 j n µ0r k µ0r n , i = j n µ0r k + 1, 0, i = j n µ0r k + 2, . . . , n, and the value of the minimization (32) is at least i=1 R2 i µ0r This proves that σ2 r (RU) µ0r n P n/(µ0r) i=1 R2 i . Combining with (30), we obtain that n R2 i P n/(µ0r) i =1 R2 i , νjr n C2 j P n/(µ0r) j =1 C2 j ; the proof for νj is similar. Applying Theorem 2 to the equivalent problem (9) with the above bounds on µi and νj proves the theorem. Appendix E. Weighted vs Unweighted Nuclear Norm Minimization for Non-uniform Sampling In this section we provide a concrete example of the gain of weighting under the setting of Section 5.1, where the observed entries are given and distributed non-uniformly. Suppose M is an n-by-n matrix with rank r, and its incoherence parameter satisfies µ0r = c, where c is a numerical constant. We assume the sampling probabilities have the form pr i = pc i = min{γ i0.15 log n n0.65 , 1} for i = 1, 2, . . . , n; here the minimization ensures pr ipc j is a probability. Note that the parameter γ determines the expected number of samples P i,j pr ipc j. For the condition (11) for the unweighted approach to hold, we need γ2 n0.3, and thus the the expected number of samples is at least X i,j pr ipc j X i,j γ i0.15 n0.65 γ j0.15 n0.65 = Ω(n1.3), where we use the estimate Pn i=1 i0.15 = Θ(n1.15). On the other hand, the condition (10) for the weighted approach is satisfied as long as γ2 n0.15, so the the expected number of samples satisfies X i,j pr ipc j X i,j γ i0.15 n0.65 γ j0.15 n0.65 log2 n = O(n1.15 log2 n) when γ2 = Θ(n0.15). Therefore, the number of samples required by the condition (10) for the weighted approach is order-wise smaller than the unweighted counterpart (11). Note that the conditions (10) and (11) are the best known sufficient conditions for exact matrix completion using the weighted and unweighted approaches, respectively, so the comparison above suggests a significant gain in sample complexity using the weighted approach. Chen, Bhojanapalli, Sanghavi and Ward Appendix F. Matrix Bernstein Inequality Theorem 16 (Tropp 2012) Let X1, . . . , XN Rn1 n2 be independent zero mean random matrices. Suppose and Xk B almost surely for all k. Then we have (n1 + n2) exp t2/2 Bt/3 + σ2 As a consequence, for any c > 0, we have cσ2 log(n1 + n2) + c B log(n1 + n2). (33) with probability at least 1 (n1 + n2) (c 1). D. Achlioptas and F. Mc Sherry. Fast computation of low-rank matrix approximations. Journal of the ACM, 54(2):9, 2007. D. Achlioptas, Z. Karnin, and E. Liberty. Matrix entry-wise sampling: simple is best. http: // cs-www. cs. yale. edu/ homes/ el327/ papers/ matrix Sampling. pdf , 2013. S. Arora, E. Hazan, and S. Kale. A fast random sampling algorithm for sparsifying matrices. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 272 279. Springer, 2006. C. Boutsidis, M. Mahoney, and P. Drineas. An improved approximation algorithm for the column subset selection problem. In Proceedings of the Symposium on Discrete Algorithms, pages 968 977, 2009. N. Burq, S. Dyatlov, R. Ward, and M. Zworski. Weighted eigenfunction estimates with applications to compressed sensing. SIAM Journal on Mathematical Analysis, 44(5): 3481 3501, 2012. J. Cai, E. Cand es, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956 1982, 2010. E. Cand es and Y. Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6): 925 936, 2010. E. Cand es and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717 772, 2009. Completing Any Low-rank Matrix, Provably E. Cand es and T. Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053 2080, 2010. E. Cand es, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? Journal of the ACM, 58(3):11, 2011. V. Chandrasekaran, S. Sanghavi, P. Parrilo, and A. Willsky. Rank-sparsity incoherence for matrix decomposition. SIAM Journal on Optimization, 21(2):572 596, 2011. S. Chatterjee and A. Hadi. Influential observations, high leverage points, and outliers in linear regression. Statistical Science, 1(3):379 393, 1986. Y. Chen. Incoherence-optimal matrix completion. IEEE Transactions on Information Theory, 61(5):2909 2923, 2015. Y. Chen, A. Jalali, S. Sanghavi, and C. Caramanis. Low-rank matrix recovery from errors and erasures. IEEE Transactions on Information Theory, 59(7):4324 4337, 2013. P. Drineas and A. Zouzias. A note on element-wise matrix sparsification via a matrix-valued Bernstein inequality. Information Processing Letters, 111(8):385 389, 2011. P. Drineas, M. Magdon-Ismail, M. Mahoney, and D. Woodruff. Fast approximation of matrix coherence and statistical leverage. Journal of Machine Learning Research, 13: 3475 3506, 2012. M. Fazel. Matrix Rank Minimization with Applications. Ph D thesis, Stanford University, 2002. R. Foygel, O. Shamir, N. Srebro, and R. Salakhutdinov. Learning with the weighted tracenorm under arbitrary sampling distributions. In Advances in Neural Information Processing Systems 24, pages 2133 2141. 2011. D. Gross. Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory, 57(3):1548 1566, 2011. P. Jain, P. Netrapalli, and S. Sanghavi. Low-rank matrix completion using alternating minimization. In Proceedings of the Forty-fifth Annual ACM Symposium on Theory of Computing, pages 665 674. ACM, 2013. R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from a few entries. IEEE Transactions on Information Theory, 56(6):2980 2998, 2010. F. Krahmer and R. Ward. Stable and robust sampling strategies for compressive imaging. IEEE Transactions on Image Processing, 23(2):612 622, 2014. A. Krishnamurthy and A. Singh. Low-rank matrix and tensor completion via adaptive sampling. In Advances in Neural Information Processing Systems 26, pages 836 844, 2013. A. Krishnamurthy and A. Singh. On the power of adaptivity in matrix completion and approximation. ar Xiv preprint ar Xiv:1407.3619, 2014. Chen, Bhojanapalli, Sanghavi and Ward Z. Lin, M. Chen, L. Wu, and Y. Ma. The augmented Lagrange multiplier method for exact recovery of corrupted low-rank matrices. UIUC Technical Report UILU-ENG-09-2215, 2009. M. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3(2):123 224, 2011. S. Negahban and M. Wainwright. Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. Journal of Machine Learning Research, 13:1665 1697, 2012. H. Rauhut and R. Ward. Sparse Legendre expansions via ℓ1-minimization. Journal of Approximation Theory, 164(5):517 533, 2012. B. Recht. A simpler approach to matrix completion. Journal of Machine Learning Research, 12:3413 3430, 2011. D. Spielman and N. Srivastava. Graph sparsification by effective resistances. SIAM Journal on Computing, 40(6):1913 1926, 2011. N. Srebro and R. Salakhutdinov. Collaborative filtering in a non-uniform world: Learning with the weighted trace norm. In Advances in Neural Information Processing Systems, pages 2056 2064, 2010. J. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389 434, 2012. X. Yang and G. Karniadakis. Reweighted ℓ1 minimization method for stochastic elliptic differential equations. Journal of Computational Physics, 248:87 108, 2013.