# nearoptimal_weighted_matrix_completion__7cbda72b.pdf

Journal of Machine Learning Research 24 (2023) 1-40 Submitted 3/22; Revised 4/23; Published 9/23

Near-Optimal Weighted Matrix Completion

Oscar L opez lopezo@fau.edu Harbor Branch Oceanographic Institute Florida Atlantic University Fort Pierce, FL 34946, USA

Editor: Moritz Hardt

Recent work in the matrix completion literature has shown that prior knowledge of a matrix s row and column spaces can be successfully incorporated into reconstruction programs to substantially beneﬁt matrix recovery. This paper proposes a novel methodology that exploits more general forms of known matrix structure in terms of subspaces. The work derives reconstruction error bounds that are informative in practice, providing insight to previous approaches in the literature while introducing novel programs with reduced sample complexities. The main result shows that a family of weighted nuclear norm minimization programs incorporating a M1r-dimensional subspace of n n matrices (where M1 1 conveys structural properties of the subspace) allow accurate approximation of a rank r matrix aligned with the subspace from a near-optimal number of observed entries (within a logarithmic factor of M1r). The result is robust, where the error is proportional to measurement noise, applies to full rank matrices, and reﬂects degraded output when erroneous prior information is imposed. Numerical experiments are presented that validate the theoretical behavior derived for several example weighted programs. Keywords: Matrix completion, weighted matrix completion, low rank matrices, incoherence, nuclear norm minimization, convex optimization

1. Introduction

Over the past two decades, matrix completion has evolved from an academic curiosity to a common industrial tool (Recht et al. 2010; Cand es and Recht 2009; Srebro and Jaakkola 2004; Foygel and Srebro 2011). Its utility includes seismic data acquisition (Aravkin et al. 2014; L opez et al. 2016; Yang et al. 2013), machine learning (Nguyen et al. 2018; Yi et al. 2012; Luo et al. 2015), collaborative ﬁltering (Srebro and Salakhutdinov 2010), computer vision (Tomasi and Kanade 1992), gene expression analysis (Troyanskaya et al. 2001) and MRI (Zhao et al. 2010). In these applications, practitioners wish to estimate a data matrix of interest D Cn1 n2 from a fraction of revealed noisy entries. The success of recent approaches hinges on the underlying assumption that D can be well approximated by a rank r matrix where r min{n1, n2}, that is, D has low rank structure. This data model is common in smooth signals, but pervasive simply by the sheer nature of large scale data (Udell and Townsend 2019). To elaborate, let Ω {1, , n1} {1, , n2} be a subset of size m n1n2 and PΩ: Cn1 n2 7 Cm the corresponding sampling operator that extracts the m values at the entries speciﬁed by Ωfrom an input matrix (with |Ω| = m). Given PΩ(D) + d Cm, where

c 2023 Oscar L opez.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-0331.html.

d Cm encompasses measurement noise, the goal of matrix completion is to recover D as accurately as possible. Under the low rank assumption, a well-studied method to estimate the data matrix is via the nuclear norm minimization program (Fazel et al. 2001; Recht et al. 2010; Cand es and Recht 2009), which estimates D via

D1 := argmin X Cn1 n2 X subject to PΩ(D) + d PΩ(X) 2 η, (1)

where X = Pmin(n1,n2) k=1 σk(X) is the nuclear norm, σk(X) is the k-th largest singular value of X and η is a program parameter chosen according to the noise level. The nuclear norm penalty provides a convex surrogate for the rank objective function, in order to output a low rank matrix that is viable for the noisy observations in a tractable manner. As in previous work, a notion of incoherence is needed to quantify how evenly distributed the information is throughout the data matrix. Ultimately, incoherence conditions ensure that a set of observed entries Ωchosen uniformly at random provide an appropriate sampling scheme for the array of interest.

Deﬁnition 1 Given D Cn1 n2 and r min{n1, n2}, consider the singular value decomposition (SVD) D = UΣV . The r-incoherence parameters of D are deﬁned as the smallest µ0, µ1 > 0 such that

n1 , max 1 ℓ n2

Ukj 2 Vℓj 2 r µ1r

Deﬁnition (2) is common in the literature, know as the standard incoherence condition (Chen 2015). The parameter µ1 is unique to this work, but similar to the joint incoherence condition introduced in Recht 2011. This novel parameter will be elaborated in Section 3.2. Intuitively, small parameters (for example, µ0, µ1 log(max{n1, n2})) correspond to data matrices whose information is not concentrated on a few set of entries. Such metrics of spikiness are necessary when the observations are chosen without regard to the matrix structure, in order to guarantee that the probed entries will supply a substantial amount of information. However, incoherence conditions can be avoided if the sampling scheme is modiﬁed according to prior knowledge of the matrix s leverage scores (Chen et al. 2015, 2014; Eftekhari et al. 2018a). The following result states the sample complexity and resulting error bound for program (1), where without loss of generality it is henceforth assumed that n1 n2.

Theorem 2 Let D Cn1 n2 have r-incoherence parameters µ0, µ1 and suppose Ω [n1] [n2] is generated by selecting a subset of size m n1n2 uniformly at random from all subsets of size m. Deﬁne D1 as in (1) with d 2 η. There exist universal constants c0, c1, c2 > 0 such that if m c0 max{µ0, µ1}n1r log2(n1)

Near-Optimal Weighted Matrix Completion

then with high probability

n1n2 log(n1)

k=r+1 σk(D) + c2 n1n2 r log(n1)

The result matches previous work in terms of sample complexity required for exact matrix completion (Chen 2015; Eftekhari et al. 2018a,b). However, a fair comparison is diﬃcult to make due to the dependence here on µ1 whereas other authors only require µ0. Section 3.2 will provide data matrices with µ1 < µ0 as well as examples where µ1 µ0 holds. Therefore, there is no strict relationship between these parameters and Theorem 2 is arguably on par with incoherence-optimal conditions (Chen 2015). These results state that, when a practitioner is oblivious to the data matrix s structure, n1r log2(n1) observed entries are needed to robustly reconstruct an incoherent n1 n2 rank r matrix.

1.1 Matrix Completion with Prior Knowledge: Approach and Overview of the Main Results

In many applications, prior knowledge of the data s structure is available. Incorporating this information appropriately into a reconstruction program has been shown to signiﬁcantly improve the success of matrix recovery (Aravkin et al. 2014; Zhang et al. 2019, 2020; Abernethy et al. 2009; Bayat and Daei 2020; Chiang et al. 2015; Eftekhari et al. 2018b; Chen 2015; Xu et al. 2013; Yi et al. 2013; Jain and Dhillon 2013; Chen et al. 2015, 2014). Inspired by these approaches, this paper proposes and analyzes a matrix reconstruction framework that exploits prior knowledge of subspaces that align well with the matrix of interest. This section introduces the approach and summarizes the results and novelties of this paper. The contribution of this work is in the generality of the proposed framework and analysis. The main result provides the ability to derive near-optimal sample complexities and informative error bounds that express a trade-oﬀwhen incorporating distinct subspaces enforcing known matrix structure. The result applies to a variety of weighted nuclear norm minimization programs, providing novel insight to previous approaches in the literature and proposing new programs. To introduce the approach and summarize the results, let T Cn1 n2 be a linear subspace of matrices with orthogonal complement T and respective orthogonal projections PT and PT . With weight parameter 0 ω 1, the family of weighted nuclear norm minimization programs proposed here approximate the data matrix via the following modiﬁed version of (1)

Dω := argmin X Cn1 n2

ωPT (X) + PT (X) subject to PΩ(D) + d PΩ(X) 2 η. (4)

When ω < 1, program (4) favors matrices that align with the estimate subspace T while the case ω = 1 reduces the program to unbiased nuclear norm minimization (1). The weight parameter toggles how severely one wishes to penalize matrices that do not agree with the prior information, thereby capturing the user s conﬁdence in T. The main novelty of program (4) in contrast to previous approaches is in the ability to incorporate subspaces with general structure and the ﬂexibility of weight selection. The theoretical contributions of this paper can be summarized as follows:

Theorem 4 analyzes program (4) in a general sense, applying to any subspace T with elements of maximal rank r. The main result states that one can accurately reconstruct a matrix nearly lying in T from m r M1 polylog(n1) observed entries, where 1 M1 n1n2

r captures crucial dimensional and incoherence-based properties of T. The result is robust, with error bound proportional to the measurement noise level η, the error of the best rank r approximation Pn2 k=r+1 σk, and a term that quantiﬁes the accuracy of T.

Speciﬁc choices of T are presented in Section 2.1, demonstrating the applicability of Theorem 4. The results derived therein showcase a variety of sample complexities including m r polylog(n1), m r2 polylog(n1), and m (n1r r2)polylog(n1). Furthermore, the derived error bounds express an informative trade-oﬀbetween sample complexity and sensitivity to the accuracy of T. In other words, programs incorporating subspaces T that require less samples will in general exhibit error terms that are more susceptible to inaccurate T (quantiﬁed via the principal angles between subspaces, Ji-guang 1987). This behavior is validated numerically in Section 4.

Some examples in Section 2.1 are related to approaches previously studied in the literature: Yi et al. 2013; Bayat and Daei 2020; Chiang et al. 2015; Eftekhari et al. 2018b; Chen 2015; Xu et al. 2013; Jain and Dhillon 2013. Most of the methodologies or results from these citations only apply in a high ﬁdelity scenario, when T is error-free or aligns suﬃciently well with the data matrix. In contrast, the main contribution here is the robustness of program (4) and the theoretical results. The error bounds derived here provide novel insight to previous approaches, allowing inexact prior information while roughly matching the sample complexity of related results in the literature. See Section 3 for further discussion.

1.2 Organization and Notation

The remainder of the paper is organized as follows: Section 2 discusses a foundational weighted matrix completion approach from the literature in order to elaborate on the inspiration for this work, its main result, and the improvements provided relative to the literature. Section 2.1 applies the main result to example subspaces T, deriving a variety of sample complexities and error bounds. Section 3 discusses related work in the literature and the introduced incoherence parameter µ1 in order to fairly compare this work with other results. Section 4 conducts numerical experiments, comparing example programs to the original weighted program discussed in Section 2. The paper concludes with a discussion of future work in Section 5 followed by the proofs in the Appendix. Notation: for any integer n N, [n] denotes the set {ℓ N : 1 ℓ n} and In is the n n identity matrix. For k, ℓ N, bk indicates the k-th entry of the vector b, Xkℓdenotes the (k, ℓ) entry of the matrix X and Xk (X ℓ) denotes its k-th row (resp. ℓ-th column). For vectors, b 2 is the Euclidean norm. For matrices, σk(X) denotes the k-th largest singular value of X, X := σ1(X) is the operator norm, X F := X, X 1/2 is the Frobenius norm, X := P k σk(X) is the nuclear norm, and X is the largest entry of X in absolute value. S and Sop are the closed unit balls in Cn1 n2 with respect to the Frobenius and operator norms respectively. The adjoint of a linear operator A is denoted by A , while X

Near-Optimal Weighted Matrix Completion

will be used to denote the conjugate transpose of a matrix X. As previously mentioned, for matrices X Cn1 n2 it is assumed that n1 n2 without loss of generality.

2. Weighted Matrix Completion

To the author s best knowledge, the ﬁrst version of a weighted nuclear norm minimization program was proposed by Aravkin et al. 2014. To elaborate on this original approach, given r n2, consider the SVD and decompose the data matrix of interest as

D = UΣV = UrΣr V r + U+Σ+V +

where UrΣr V r pertains to the largest r singular values of D along with the corresponding singular vectors. Assume that U Cn1 r, V Cn2 r with orthonormal columns are available containing information of the range of Ur Cn1 r, V r Cn2 r and deﬁne

Qω1 := ω1 U U + In1 U U , Wω2 := ω2 V V + In2 V V ,

where ω1, ω2 [0, 1] are chosen weights. Notice that Q1, W1 are identity matrices and otherwise, when ω1, ω2 < 1, these linear operators skew toward the orthogonal complement of range( U) and range( V ) respectively. The original weighted nuclear norm minimization program approximates the data matrix via

Dω1,ω2 := argmin X Cn1 n2 Qω1XWω2 subject to PΩ(D) + d PΩ(X) 2 η. (5)

Analogous to this paper s approach, with ω1, ω2 < 1 program (5) favors matrices that match a certain structure while ω1 = ω2 = 1 is unbiased nuclear norm minimization (1). Notice that (5) is nearly of the form (4), but the choice of two weights in the original formulation will impede the results here from being directly applicable. However, a program of the form (4) that is closely related to (5) will be considered in Section 2.1. Spurring from the original program, similar methodologies have been proposed in the literature (Eftekhari et al. 2018b; Zhang et al. 2019, 2020; Bayat and Daei 2020) but distinct approaches have also been considered (Xu et al. 2013; Chiang et al. 2015; Yi et al. 2013; Jain and Dhillon 2013; Abernethy et al. 2009; Chen 2015). To attempt producing a result that provides some level of insight for many of these variations, a more general notion of incoherence that applies to an entire subspace is required.

Deﬁnition 3 For a subspace T Cn1 n2 and ρ n2, the subspace ρ-incoherence and joint incoherence parameters of T are deﬁned respectively as

M0 := max X T Sop n2

ρ X 2 ,2 (6)

and M1 := max X T S n1n2

ρ max k,ℓ|Xkℓ|2, (7)

where X ,2 is the maximum of the row and column norms of X, see (24).

The ensemble will be referred to as the ρ-subspace incoherence parameters or condition. These parameters will provide crucial dimensional information of a given subspace T. Section 2.1 will explore these parameters via example programs, with results that demonstrate their role in deriving near-optimal sample complexities. Henceforth, let ρ be deﬁned as

ρ = max X T rank(X). (8)

The main result of the paper can now be presented:

Theorem 4 Let T Cn1 n2 be a subspace with ρ-subspace incoherence parameters M0, M1. Suppose Ω [n1] [n2] is generated by selecting a subset of size m n1n2 uniformly at random from all subsets of size m. Let D Cn1 n2 and deﬁne Dω as in (4) with d 2 η. There exist universal constants c0, c1, c2, c3, c4 > 0 such that if

m c0M1ρ log3(n1) and 0 < ω M1 log(n1 + n2) n1n2 (9)

or m c0 max n M0, p

M1 o ωn1ρ log2(n1) and 1

2ρ log(n1) ω 1, (10)

D Dω F c1f(ω)

n1n2 log(n1)

k=ρ+1 σk(D)

n1n2 log(n1)

n1n2ρ log(n1)

n1n2 log(n1)

! PT (UρΣρV ρ ) (11)

holds with probability greater than 1 c4

f(ω) := min

n1n2 log(n1)

As in the beginning of the section, UρΣρV ρ in (11) denotes the best rank ρ approximation of D via the SVD. The proof is postponed until Section A.3. The theorem implies that faithful prior subspace knowledge (with elements of maximal rank ρ) can be used to robustly approximate a nearly rank ρ matrix in the subspace from as few as M1ρ log3(n1) observed entries. The result applies to two intervals for ω, with small and large weights (ω 0 and ω 1 respectively). These ranges oﬀer distinct sample complexities and error bounds. The main focus of this paper will be in the regime of small weights, which produce a substantial reduction in sample complexity. However, the interval of larger weights (10) also generates interesting results (see end of Section 2.1).

Near-Optimal Weighted Matrix Completion

The result is robust to inexact subspaces, expressed in the term PT (UρΣρV ρ )

appearing in (11). This term quantiﬁes the inaccuracy of the estimate subspace relative to the best rank ρ approximation of the data, where PT (UρΣρV ρ ) 0 implies that T was chosen appropriately. Arguably, this term is quite raw and the result can be more informative if a more general metric is used. The main convenience in presenting Theorem 4 with error bound (11) is that no restrictions are imposed on T. To provide a more informative error bound (at the cost of restricting T), the accuracy of the estimate subspace can instead be quantiﬁed via the principal angle between subspaces (PABS) Drmac 2000 sin(θT ) := PT PTρ F F . (12)

In (12), F F is the spectral norm for operators acting on matrices and PTρ is the orthogonal projection onto a ρ-dimensional subspace of matrices that UρΣρV ρ belongs to

Tρ := span n Uρ k V ρ k oρ

Notice that the PABS lies in [0, 1], where sin(θT ) 0 implies accurate subspace estimate and sin(θT ) 1 states that UρΣρV ρ largely lies in T . Since sin(θT ) = 1 only when dim(T) dim(Tρ) (see Drmac 2000), this restriction on T must now be enforced in order to obtain the following modiﬁed version of Theorem 4 involving the PABS:

Corollary 5 Under the same conditions of Theorem 4, if dim(T) ρ then

D Dω F c1f(ω)

n1n2 log(n1)

k=ρ+1 σk(D)

n1n2 log(n1)

n1n2ρ log(n1)

+ c3f(ω) sin(θT )

n1n2 log(n1)

k=1 σ2 k(D)

holds with probability greater than 1 c4

Corollary 5 provides a more standard error bound, since the PABS are well understood metrics of signal alignment in the literature. However, aside from imposing dim(T) ρ, notice that the last term in the error bound above now includes the avoidable factor n2. For this reason, the example results of Section 2.1 will apply Theorem 4 rather than Corollary 5 since the latter seems to produce pessimistic error bounds in general. To further understand the implications of the main results and demonstrate the crucial role of the subspace incoherence parameters, it is instructive to consider speciﬁc choices of T. This is the purpose of Section 2.1, where sample complexities will be supplied for various estimate subspaces. The proofs of these results, which are corollaries of Theorem 4, consist of lower and upper bounding M1. The lower bounds (presented in Appendix B) enforce the sample complexity upholding the dimensionality of T. These examples illustrate the importance of M1 and the near optimality of the main result.

2.1 Matrix Completion with Prior Knowledge: Example Subspaces

This section provides example programs that enforce speciﬁc subspace structure on the output estimate data matrix. Theorem 4 will be applied in the small weight regime (9) to derive informative reconstruction error bounds and nearly optimal sample complexities. In particular, the results will showcase a trade-oﬀwhen incorporating distinct choices of T. The examples will illustrate that subspaces allowing exact matrix completion from relatively few samples will in general exhibit increased sensitivity to inaccurate prior information. To elaborate in a concrete setting, decompose as before D = UΣV = UrΣr V r + U+Σ+V + , where UrΣr V r is the best rank r approximation of D. Assume that U Cn1 r, V Cn2 r with orthonormal columns are available containing information of Ur, V r. In this section, U and V will specify T. The accuracy of the prior information will be mainly quantiﬁed via the principal angle between subspaces (PABS) with respect to the range of the left and right singular vectors. From the SVD, notice that the columns of U+ (V +

respectively) form an orthonormal basis for range(Ur) (range(V r) respectively). The PABS here are deﬁned via the canonical correlation coeﬃcients (Ji-guang 1987)

sin(θu) := U U+ and sin(θv) := V V + , (13)

where X = σ1(X) denotes the operator norm of a matrix X. The PABS lie in [0, 1] and provide a measure of the degree of alignment between the data s main rank r component and estimates speciﬁed in some sense by U and V . The case sin(θu), sin(θv) 0 holds with accurate knowledge, while sin(θu), sin(θv) 1 implies inappropriate prior information was supplied. With this in mind, simpliﬁed results for four programs of the form (4) with speciﬁc choices of T are provided. Henceforth, program (4) incorporating a subspace T will be refered to as program (4, T). The following examples are presented in order of most to least sensitive error bounds with respect to the inaccuracy of T, respectively exhibiting a trend of increasing sample complexities.

Theorem 6 Let D Cn1 n2 have rank r, and consider estimates U Cn1 r and V Cn2 r where U V has r-incoherence parameter µ1. Suppose Ω [n1] [n2] is generated by selecting a subset of size m n1n2 uniformly at random from all subsets of size m. Deﬁne Dω as in (4) with d 2 = η = 0 and estimate subspace

T1 := span n U k V k o

There exists universal constants c0, c3 > 0 such that if

m c0µ1r log3(n1) and 0 < ω µ1 log(n1 + n2) n1n2

then with high probability

n1n2r log(n1)

sin(θu) + sin(θv) +

Ur U k V ℓV r 2 F

Near-Optimal Weighted Matrix Completion

To the author s best knowledge, program (4, T1) provides a novel method by which to exploit prior information of the data s rank 1 components. The result states that, with an appropriate weight and accurate projection onto the subspace spanned by data s rank 1 components, program (4, T1) can faithfully estimate a rank r matrix with O(r log3(n1)) random samples. In a high ﬁdelity scenario, there are roughly r degrees of freedom to approximate UrΣr V r . Therefore the derived sample complexity is within a quadratic logarithmic factor of the optimal rate O(r log(n1)), where the logarithmic factor in the optimal rate is unavoidable in matrix completion under random sampling models (see for example Theorem 1.7 provided by Candes and Tao 2010). On the other hand, in the case of unreliable information, the right hand side of (14) showcases high sensitivity to inaccurate T1. Several elaborated remarks are in order:

Incorporating accurate subspace T1 allows for frugal completion of matrices that do not satisfy typical incoherence conditions. For example, with severe parameter µ0 n1/r, since µ1 µ2 0r Theorem 6 guarantees reconstruction from O(n1 log3(n1)) random samples. Without prior information, such data matrices require a signiﬁcant amount of additional samples to be recovered according to Theorem 2 and similar results in the literature.

Program (4, T1) allows reconstruction of general full rank matrices in an underdetermined scenario with prior information. With n2 trustworthy rank 1 components (or an orthogonal projection onto the span), the result allows for an accurate estimate of a full rank matrix from O(n2 log3(n1)) observed entries.

In contrast to other approaches that will be considered, program (4, T1) is most sensitive to inaccurate prior information. This observation will become clear due to the ﬁnal term in (14) involving the sum over all k = ℓ, which is not present in the remaining error bounds. This term requires that each matrix U k V ℓwith k = ℓbe unaligned with elements in

span n Ur k V r ℓ o

k,ℓ {1, ,r}.

This requisite is quite strict, since even a single inaccurate rank-1 component included as prior knowledge can cause a signiﬁcant error according to (14).

The next result involves a methodology related to many previously proposed in the literature (Chen 2015; Xu et al. 2013; Chiang et al. 2015; Yi et al. 2013; Jain and Dhillon 2013; Abernethy et al. 2009). The result will render a less sensitive methodology at the cost of higher sample complexity.

Theorem 7 Under the same setup as Theorem 6, let U U and V V have r-standard incoherence parameters µL and µR respectively. Deﬁne Dω as in (4) with η = 0 and subspace estimate T2 := span n U k V ℓ o

(k,ℓ) {1, ,r} {1, ,r}.

There exists universal constants c0, c3 > 0 such that if

m c0µLµRr2 log3(n1) and 0 < ω µLµRr log(n1 + n2) n1n2

then with high probability

D F c3 (sin(θu) + sin(θv))

n1n2r log(n1)

This result considers a robust approach in contrast to program (4, T1), with a more lenient requisite that the columns of Ur and V r lie in the range of U and V respectively. This comparison is evident in the produced error bound (15), which is similar to (14) but no longer exhibits the sensitive term that sums over k = ℓ. However, the trade-oﬀis an increased sample complexity of O(r2 log3(n1)). When ω 0, this methodology and result roughly agree with the work by Yi et al. 2013; Chen 2015; Xu et al. 2013; Jain and Dhillon 2013, but program (4, T2) is more ﬂexible since it introduces a choice of weight and allows for erroneous row and column subspaces (for further discussion on related work see Section 3.1). Theorems 6 and 7 help illustrate the crucial role of the ρ-subspace incoherence parameters from Deﬁnition 3. Consider program (4, T2), where ρ = r. To obtain Theorem 7 from Theorem 4, it will be shown in Appendix B that M1(T2) = µLµRr. Since T2 is an r2

dimensional space, one should not expect exact matrix recovery via program (4, T2) with less than r2 log(n1) randomly sampled entries. This sample complexity is enforced by the lower bound µLµRr r, which illustrates the near optimality of the main result. Similarly, for program (4, T1) it holds that M1(T1) = µ1 1 which upholds the dimensionality of T1. The next example involves only incorporating row or column span information.

Theorem 8 Under the same setup as Theorem 6, let U U have r-standard incoherence parameter µL. Deﬁne Dω as in (4) with η = 0 and subspace estimate

T3 := n X Cn1 n2 | X = U U X o .

There exists universal constants c0, c3 > 0 such that if

m c0µLn2r log3(n1) and 0 < ω µL log(n1 + n2)

then with high probability

D F c3 sin(θu)

n1n2r log(n1)

In contrast to unbiased nuclear norm minimization (1), the result demonstrates that one sided information can reduce the sample complexity for rectangular matrices with n2 n1. Analogously, for wide matrices one can incorporate information of the range instead. Comparing to the previous examples (incorporating T1 and T2), notice that error bound (16) is less sensitive to inaccurate subspaces as it only involves a single PABS term. The ﬁnal example attempts to provide some intuition for the original weighted nuclear norm approach. Although Theorem 4 is not directly applicable to program (5) and variations, some intuition may be provided by considering the program of the form (4) with estimate subspace

T4 := n X Cn1 n2 | X = U U X V V + U U X V V + U U X V V o (17)

Near-Optimal Weighted Matrix Completion

where the orthonormal columns of U (resp. V ) span range( U) (resp. range( V ) ). Arguably, among all programs considered here, this approach is most related to the original weighted program. To gain insight, a new notion of incoherence is needed.

Deﬁnition 9 Given D Cn1 n2 and r min{n1, n2}, consider the singular value decomposition (SVD) D = UΣV . The r-complementary incoherence parameter of D is deﬁned as the largest µ2 such that

n1 , min 1 ℓ n2

This incoherence condition is strongly related with the standard incoherence parameter. Notice that 1 = Up 2 2 + U p 2 2 for any p [n1] and the same observation holds for V . Therefore, 0 µ2 1 µ0 and for this reason the new condition is referred to as complementary. This new parameter also quantiﬁes how spiky a data matrix is, where µ2 0 holds for matrices with entire rows or columns containing very little information and µ2 1 implies an even distribution of non-zero entries throughout the matrix. The result for the weighted program incorporating (17) now follows.

Theorem 10 Under the same setup as Theorem 6, let U V have r-standard and complementary incoherence parameters µ0 and µ2 (resp.). Deﬁne Dω as in (4) with η = 0 and estimate subspace T4 from (17). There exists universal constants c0, c3 > 0 such that if

m c0µ0 max{µ0r, n1 µ2r}r log3(n1) and 0 < ω n1 µ2r log(n1 + n2) n1n2

then with high probability

D F c3 sin(θu) sin(θv)

n1n2r log(n1)

Under lenient conditions, it holds that max{µ0r, n1 µ2r} = n1 µ2r and therefore a n1 n2 rank r matrix can be recovered from µ0(n1r µ2r2) log3(n1) observed entries with prior information. Notice that T4 is an n1r + n2r r2 dimensional subspace, which is respected by the complexity of the result. In contrast to unbiased nuclear norm minimization, this approach provides a slight reduction in sample complexity by requiring µ0µ2r2 log3(n1) less samples. However, the current result exhibits a larger logarithmic dependency while removing the incoherence condition related to µ1, so a fair comparison is diﬃcult to make. Relative to the other approaches considered in this section, program (4, T4) is most robust to inaccurate prior information since it imposes less of a constraint on the search space. This is exhibited in (19) via the term sin(θu) sin(θv), which is the smallest error bound in the examples thusfar. However, this reduction in sensitivity to noise produces one of the largest sampling requisites of this section. The trend expressed by these error bounds will be explored numerically in Section 4, where decreasing susceptibility to inaccurate prior information in general requires an increasing number of observed entries for exact matrix completion.

The results in this section all apply Theorem 4 in the regime of small weights (9). The context of larger weights (10) can also supply interesting results. For example, program (4, T1) with ω = (r log(n1)) 1/2 can be shown to require max{µ0, µ1}n1 r log3/2(n1) samples for accurate matrix estimation. Similarly, program (4, T2) would need µ0n1r log3/2(n1) observed entries, mildly reducing the sample complexity in contrast to (1) by a logarithmic factor. Intuitively, such larger weight choices would be most appropriate for less reliable prior information while still reducing the number of samples.

3. Discussion

This section elaborates on the main result and the considered weighted programs. Comparison to previous work is conducted in Section 3.1 and a discussion of the novel incoherence parameters is provided in Section 3.2.

3.1 Related Work

Many authors have considered how to eﬃciently include prior information into matrix reconstruction problems. The papers by Aravkin et al. 2014; Zhang et al. 2019, 2020 focus on applications to seismology and numerical aspects of the problem, adopting an approach that spurred from program (5) ﬁrst proposed by Aravkin et al. 2014. A distinct approach is considered by Chen et al. 2015, 2014; Eftekhari et al. 2018a, where the prior information is used to bias the sampling scheme according to the array s leverage scores. Therein, the authors show that O(n1r log2(n1)) revealed entries provide exact (noiseless) completion of a rank r matrix with more lenient dependency on the standard incoherence parameter µ0. Most relevant to the context of this paper, are the results by Yi et al. 2013; Bayat and Daei 2020; Chiang et al. 2015; Eftekhari et al. 2018b; Chen 2015; Xu et al. 2013; Jain and Dhillon 2013 which provide theoretical analysis related to program (5) or program (4, T2) in Theorem 7. The work by Eftekhari et al. 2018b applies directly to program (5) in the matrix completion case and general matrix sensing scenario. In the matrix completion setting the authors obtain sample complexity |Ω| µ0n1r log(n1) under inexact prior information, thereby reducing the number of samples by a logarithmic factor in contrast to unbiased nuclear norm minimization. The sample complexity and error bounds produced therein depend on the PABS (13), imposing a ﬁdelity requisite on T for the results to be applicable. The results in this paper are not directly comparable. Arguably, the result most related in this work is provided in Theorem 10, where program (4, T4) requires |Ω| µ0(n1 µ2r)r log3(n1) samples. In contrast to the work by Eftekhari et al. 2018b, the result here does not require PABS-based prior knowledge or requisites for applicability and the resulting error bounds are more lenient in terms of the dependence on sin(θu) and sin(θv). The authors Bayat and Daei 2020 consider a more ﬂexible program than the original approach, where four weight choices are allowed. Their results only apply to the matrix sensing scenario, where the authors demonstrate an O(nr) sample complexity. The remaining citations (Yi et al. 2013; Chen 2015; Xu et al. 2013; Jain and Dhillon 2013; Abernethy et al. 2009) consider approaches arguably similar in nature to program (4, T2) with ω = 0 and exact prior information, while Chiang et al. 2015 allows for noisy

Near-Optimal Weighted Matrix Completion

side information. Among these, the smallest sample complexity is O(µ0r2 log(n1) log(r)) provable in the setting of exact prior knowledge (in the analogous scenario where ρ = r for simplicity). In this context, Theorem 7 presented here with small weights allows for accurate estimation from O(µ2 0r2 log3(n1)) sampled entries. This sampling condition is slightly worse than what is derived by Yi et al. 2013; Chen 2015. However, the approach considered here is a more ﬂexible methodology with weight selection that allows for improved reconstruction output (see Section 4 for numerical behaviour based on weight selection). Furthermore, Theorem 7 applies to cases with inexact prior knowledge and derives error bounds that provide insight when inaccurate estimates U and V are incorporated in diﬀerent ways.

3.2 Incoherence Parameter µ1

This section discusses the parameter µ1 deﬁned in (3), comparing it to the standard and joint incoherence parameters from the literature. To the best of the author s knowledge, this deﬁnition of incoherence has not appeared in the matrix completion literature. Previous optimal results have sample complexity requiring only linear dependence on the standard parameter µ0. The results here also depend on this parameter, but additionally introduce µ1 with sub-linear and linear dependence. Arguably, µ1 is reminiscent of the joint incoherence (or strong incoherence) condition introduced by Recht 2011; Gross 2011. The joint incoherence parameter will be denoted here as µ1. Given r n2 and recalling the SVD of D, the joint incoherence parameter of D is deﬁned as the smallest µ1 > 0 such that

max (k,ℓ) [n1] [n2]

j=1 Ukj Vℓj

In particular, µ1 in deﬁnition (3) also depends jointly on the right and left singular vectors. Furthermore, µ1 µ2 0r which is also a tight upper bound for the joint incoherence parameter (Chen 2015). However, it is important to note that Recht 2011 and other authors derived sample complexity m µ1n1r log2(n1). In contrast, the work here requires m µ1n1r log2(n1) in Theorem 2 and m µ1r log3(n1) in Theorem 6. Though it is diﬃcult to provide a fair comparison, this section will argue that the results here impose relatively lenient incoherent conditions in a joint sense. The author Chen 2015 discusses the exorbitant nature of the joint incoherence parameter since it intuitively requires the rows of Ur and V r to be unaligned, a requisite with no reasonable explanation. In the current work, all results would also hold if µ1 were deﬁned as the smallest number such that

n1n2 . (20)

This alternative deﬁnition alleviates the joint nature of the original deﬁnition. It is now arguable that this condition requires the ℓ4 norms of the rows of Ur and V r to both be small, thereby imposing an additional non-spikiness condition analogous to the requisite of a small µ0 parameter with respect to the ℓ2 norms. However, in contrast the adopted

deﬁnition (3), the incoherence parameter given by (20) is pessimistic (see the second example below) and for this reason the original deﬁnition is kept. Notice that Theorem 2 only has sub-linear dependence on the introduced parameter µ1. This observation is crucial to properly compare some of the work here with previous incoherence optimal results. To elaborate, speciﬁc data matrices are produced to compute µ0, µ1, and µ1.

Case µ0 = µ1: consider the random orthogonal model and the incoherent basis model by Cand es and Recht 2009. Therein, the authors show that a rank r matrix M = UΣV generated from the random orthogonal model obeys maxk,ℓ|Ukℓ|2 10 log(n1)/n1 and maxk,ℓ|Vkℓ|2 10 log(n2)/n2 with high probability. From this, it is easy to see that µ0, µ1 log(n1). A similar conclusion holds trivially for the incoherent basis model and any singular vectors obeying the size property 1.12 in the same reference.

Joint incoherence: in this example, µ1 log2(n1) (see Cand es and Recht 2009).

Case µ0 > µ1: let r < n2 and M = UV where the columns of U and V consist of any r columns of In1 (the n1 n1 identity matrix) and any r columns of F : Cn2 7 Cn2

(the 1D Fourier transform) respectively. Then the r-incoherence parameters satisfy µ0 = n1/r and µ1 = p

n1/r. Note that deﬁnition (20) would instead obtain µ1 = p

n1/ r in this example, which demonstrates the improvement gained by the chosen deﬁnition (3).

Joint incoherence: here µ1 n1/r.

Case µ0 < µ1: let M = UV where the columns of U and V consist of any r columns of In1 and any r columns of In2 respectively. This example gives the worst case r-incoherence parameters µ0 = n1/r and µ1 = p

n1n2/r. In general, using (20) shows that µ1 µ0 r, which is sharp according to this example when n1 = n2.

Joint incoherence: µ1 n1n2/r.

From these examples, it is clear that there is no strict relationship between µ0 and µ1. Moreover, typical data matrices of interest from the literature seem to largely lie in the regime where µ1 µ0. In these cases, Theorem 2 intuitively reduces to solely depend on µ0. Moreover, the examples reveal that µ1 µ1 holds intuitively, though a proof of such a statement is not provided in this work. The lenient joint incoherence conditions of this paper are due to the sub-linear dependence µ1 in Theorem 2, which is most comparable to the work of Recht 2011. Notice that µ1 µ1 holds in the examples above. However, Theorem 6 does require linear dependence on µ1 (but removes the linear dependence on n1). Since µ1 may be as large as rµ2 0, this may lead to the requisite m r2µ2 0 log3(n1) which still oﬀers a severe reduction in the number of observed entries comparable to the result in Theorem 7.

Near-Optimal Weighted Matrix Completion

4. Numerical Experiments

This section numerically explores programs (4, T1) and (4, T2), comparing them to the original weighted nuclear norm minimization program (5). The goal of this section is to numerically validate the error bounds in Section 2.1. The experiments will reveal the practicality of the derived analysis, agreeing with the theoretical observation that weighted programs incorporating subspaces that require less samples will in general exhibit increasing sensitivity of inaccurate prior information. The setup of Eftekhari et al. 2018b is adopted to generate a data matrix and subspace information. Let D = UrΣr V r Rn1 n2, where Ur Rn1 r and V r Rn2 r are constructed by orthogonalizing the columns of a standard random Gaussian matrix with r columns and normalizing so that D F = 1. To obtain prior knowledge, a perturbed matrix is generated D = D + N where the entries of N Rn1 n2 are i.i.d. Gaussian random variables with variance σ2 that will be toggled to select a desired PABS. Then U Rn1 r and V Rn2 r are the leading r left and right singular vectors of D. The dimensions are set to n1 = n2 = 500 and r = 50. The set of observed matrix entries is selected uniformly at random from all subsets of the same cardinality |Ω| = λ(n1n2), where λ [0, 1] will be varied to specify a desired sampling percentage. In each experiment, D, N and Ωare generated independently and programs (4) and (5) are solved with ω = ω1ω2 varying in (0,1] (setting ω1 = ω2). The plots below present the average relative errors of 100 independent trials via trustworthy and relatively inaccurate subspace estimates. Programs (4) and (5) are solved using the LR-BPDN implementation introduced by Aravkin et al. 2014, which combines the Pareto curve methodology (van den Berg and Friedlander 2009) with a matrix factorization approach. With ω > 0, (26) is solved in lieu of (4), which is an equivalent formulation that trades oﬀthe objective function with a modiﬁed projection operator in the constraint (see A.3). This allows LR-BPDN to be directly applicable to (26), with output Dω = ωPT (Dω)+PT (Dω) which gives the desired solution as Dω = ω 1PT ( Dω) + PT ( Dω). An analogous trick is used to solve (5) when ω1ω2 > 0. Varying weights: noiseless numerical results with varying weights ω, ω1ω2 (0, 1] are shown in Figure 1. Two plots are provided, corresponding to reliable prior information with |Ω|/n1n2 = .01 (left plot) and less accurate prior knowledge with |Ω|/n1n2 = .15 (right plot). In these plots, the variance of N is chosen to provide PABS sin(θu), sin(θv) .1 and sin(θu), sin(θv) .2 respectively. In the case of good subspace estimates (left plot) it is clear that program (4, T1) greatly outperforms the other approaches, obtaining a relative error .1 with only 1% of observed entries. However, this approach is demonstrated to be relatively sensitive when less accurate prior information is supplied. With relatively inaccurate prior information, the original weighted program (5) exhibits the best reconstruction error in the right plot of Figure 1. The numerical behavior illustrated in Figure 1 agrees with the derived error bounds and discussion provided in Section 2.1, where Theorem 10 provides some insight for the robust behavior of program (5) at the cost of higher sample complexity. Varying sampling percentage: noiseless numerical results with varying percentages of observed noiseless entries are shown in Figure 2. Applying the choice of weights from Figure 1 that give the smallest reconstruction error for each program, two plots are shown

Figure 1: Plots of weight choice versus average relative error of matrix reconstruction via noiseless weighted nuclear norm minimization programs. The left plot applies reliable prior information with sin(θu), sin(θv) .1 and %1 percent of observed matrix entries. The right plot was obtained with sin(θu), sin(θv) .2, where %15 of the entries are observed.

with reliable subspaces (left plot) and less accurate prior knowledge (right plot). In the case of accurate estimate subspaces, program (4, T1) obtains the smallest relative errors in all shown sampling percentages. However, observe that programs (4, T2) and (5) obtain comparable relative errors from as little as 8% of observed entries. As in Figure 1, in the case of less accurate subspaces, the original weighted program exhibits the most ﬂexibility toward untrustworthy prior information. Therefore, Figure 2 also reﬂects the trade-oﬀ behavior of sample complexity vs sensitivity to inaccurate prior information that agrees with the theoretical conclusions of Section 2.1.

5. Conclusion

In this paper, a family of weighted matrix completion programs is proposed as a means to incorporate prior knowledge into the matrix completion problem. The main result establishes the nearly optimal sampling rate O(M1r log3(n)) by which an n n rank r matrix can be accurately approximated when a subspace of rank r matrices is enforced (where M1 captures dimensional and incoherence-based properties of the subspace). The analysis allows for robust matrix approximation in the case of inexact subspaces, measurement noise, and full rank data matrices. The work provides novel intuition for previous methodologies in the literature, while introducing novel approaches. Finally, the results and numerical experiments showcase an insightful trade-oﬀcaused by incorporating distinct subspaces. In general, it is observed that subspaces requiring less samples for exact matrix completion are more susceptible to inaccurate prior information. Several important limitations and potential improvements need to be elaborated and explored as future work. The subspace incoherence parameter M1 may extract the parameter µ1, which is distinct in comparison to results that only require the standard incoherence condition. Modiﬁed analysis that only depends on the standard parameter µ0 would be

Near-Optimal Weighted Matrix Completion

Figure 2: Plots of sampling percentage versus average relative error of matrix reconstruction via noiseless weighted nuclear norm minimization programs. The left plot applies reliable prior information with sin(θu), sin(θv) .1 and weight choices of ω1ω2 = .01 for program 5, ω = .02 for program (4, T2), and ω = .01 for program (4, T1). The right plot was obtained with sin(θu), sin(θv) .2 and weight choices of ω1ω2 = .06 for program 5, ω = .3 for program (4, T2), and ω = .2 for program (4, T1).

most informative and align best with the existing literature. Furthermore, the main theoretical result does not seem to provide any insight for weight selection according to the user s conﬁdence in the estimate subspace or any other variables. An error bound that speciﬁes the optimal weight choice in diﬀerent scenarios would be of great value to practitioners. As future work, it would be of interest to adapt the proof strategy and tools of this work to further comprehend how to set the program parameters of the proposed method and previously considered weighted approaches.

Another avenue to extend this work is to consider numerically eﬃcient alternatives to the nuclear norm approach presented here. Strong prior information may be suﬃcient to alleviate sample complexity, potentially rendering the low rank assumption unnecessary for matrix completion. For example, a least squares-based weighted program might arguably produce comparable estimates while reducing computational complexity by avoiding rank penalization terms. The tools developed here may be useful to analyze such approaches, and help better understand the independent roles that prior information and data rank play in the sample complexity.

Appendix A. Proof of the Main Results

This section provides proofs for the main results, only stating the required lemmas. These lemmas will be proven in Appendix C.

A.1 Dual Certiﬁcate

The ﬁrst lemma establishes dual certiﬁcate conditions relative to T for recovery error bounds. It is stated in general form, applicable to any linear operator. For the statement, it is important to notice that for any X Cn1 n2 there exists a G T Sop such that PT (D) = D, G by characterization of the nuclear norm via its dual norm (the operator norm). Furthermore, for A : Cn1 n2 7 Cm deﬁne

A F7 2 := max X S A(X) 2.

Lemma 11 Let A : Cn1 n2 7 Cm be a linear operator, and

inf X T S A(X) 2 β1 > 0, APT F7 2 β2. (21)

Given D Cn1 n2 with PT (D) = D, G for some G T Sop, assume that there exist Y, Z Cn1 n2 with Y = A A(Z) satisfying

PT (Y ) G F β3, PT (Y ) β4, A(Z) 2 β5 (22)

β1 + β4 < 1. Let d Cm with d 2 η and

D := argmin X Cn1 n2 X s.t. A(X) A(D) d 2 η.

Then D D F C1 PT (D) + C2η,

where C1, C2 depend on the βk s.

The proof of this lemma is postponed until C.1, where the dependence of C1 and C2 on the βk s is speciﬁed. The main result will be obtained by establishing (21) and (22) for the matrix completion sampling operator PΩ. With this in mind, the proof of Theorem 4 is provided, from which the remaining theorems are corollaries (see Appendix B). The proof considers the sampling with replacement model, discussed in detail in the next section.

A.2 Sampling Model

As in previous work, the work load will be simpliﬁed by considering the uniform sampling with replacement model. In other words, let Ωbe generated by choosing m entries independently and uniformly at random from [n1] [n2] (this allows Ωto have repeated entries). Recht 2011 and Gross 2011 show that any upper bound on the probability of failure for exact (noiseless) matrix completion via Ωis also valid for uniform sampling without replacement. This strategy will apply in the scenario of this work, proceeding in a diﬀerent manner to include noisy observations, full rank matrices, and inexact subspace estimates. With Ωgenerated as above, let Ω Ωconsist of the Ω m distinct samples in Ωand deﬁne the normalized operators

m P Ωand A = rn1n2

Near-Optimal Weighted Matrix Completion

which extract and scale an input matrix s values in the entries speciﬁed by the subset in the subscript. As shown by Gross 2011 (Section II), the distribution of Ωis the same as the distribution of sampling Ω entries uniformly without replacement. Therefore, assume Ω ΩWLOG, so that generating Ωwith m entries will generate Ωuniformly at random and any lower bound requisite for |Ω| will also be satisﬁed by m (since m |Ω|). Notice that for any X Cn1 n2 and (k, ℓ) [n1] [n2]

(P ΩPΩ(X))kℓ= Xkℓ if (k, ℓ) Ω 0 otherwise

and P ΩP Ω(X)

kℓ= mult Ω(k, ℓ)Xkℓ if (k, ℓ) Ω 0 otherwise

where mult Ω(k, ℓ) N is the multiplicity of (k, ℓ) Ω. Let mult Ω(k, ℓ) τ for all (k, ℓ) Ω. The quantity τ will be bounded using Proposition 3.3 by Recht 2011 with parameter β = 3/2 therein. For n1 9, the proposition gives τ 4 log(n1) with probability exceeding 1 n 1 1 . Then for any X Cn1 n2

A(X) 2 τ A(X) 2 2 p

log(n1) A(X) 2. (23)

The remainder will operate in this scenario so that (23) holds with high probability. Therefore, using the lower bound inf X T S A(X) 2 β1

for some β1 > 0, allows to choose parameter β1 = β1 2

log(n1) > 0 from Lemma 11. The

following lemma is crucial for this purpose, and will also be used to compute parameters β3 and β5 with an appropriate choice of dual certiﬁcate.

Lemma 12 Let T Cn1 n2 be a subspace with subspace joint incoherence parameter M1 and ρ deﬁned as in (8). With A as above and 0 < δ 1

m C M1ρ log3/2(n1 + n2)

where C > 0 is an absolute constant, then

( A A I)(X), X < 2δ,

with probability exceeding

The proof can be found in C.2, which generalizes the approach of Rauhut 2008; Rudelson and Vershynin 2008 (Rudelson-Vershynin Lemma via Dudley s inequality) as done by Liu 2011. The following lemma by Chen 2015 is needed to compute β4.

Lemma 13 Suppose Z Cn1 n2 is a ﬁxed matrix. Then with probability greater than 1 1 n1+n2

A A(Z) Z 4n1n2 log(n1 + n2)

2n1n2 log(n1 + n2)

The concentration inequality uses the norm

Z ,2 := max n max k

j |Zkj|2, max ℓ

j |Zjℓ|2 o (24)

which is the maximum of the row and column norms of Z. The result appears as Lemma 2 in the work of Chen 2015, the proof is adapted here to the sampling with replacement model (see C.3 for the proof). The parameters in (21) and (22) may now be computed to establish the weighted matrix completion error bounds.

A.3 Proof of Theorem 4: Weighted Matrix Completion

With respect to the normalized operators, the output of (4) is equivalently given as

Dω := argmin X Cn1 n2

ωPT (X) + PT (X) s.t.

A(D) + rn1n2

Further note that

ωPT (Dω) + PT (Dω) =

argmin X Cn1 n2 X s.t.

A(D) + rn1n2

ωPT (X) + PT (X) 2 rn1n2

where equality holds since ωPT ( )+PT ( ) is invertible when ω > 0, with inverse operator ω 1PT ( ) + PT ( ). Let Pω( ) = PT ( ) + ωPT ( ).

Multiplying both sides of the constraint in (25) by ω gives

ωPT (Dω) + PT (Dω) =

argmin X Cn1 n2 X s.t.

ωA(D) + ω rn1n2

m d A Pω(X) 2 ω rn1n2

= argmin X Cn1 n2 X s.t.

A Pω(DT ) + ω rn1n2

m d A Pω(X) 2 ω rn1n2

were the last line deﬁnes DT := ωPT (D) + PT (D). The dual certiﬁcate of Lemma 11 will be produced with respect to program (26) with sampling operator AT := A Pω to obtain an upper bound for DT ωPT (Dω) + PT (Dω) F ,

Near-Optimal Weighted Matrix Completion

which will in turn will give an appropriate bound for D Dω F . Proof [Proof of Theorem 4] Using the notation above, notice that

inf X T S AT (X) 2 = inf X T S A(X) 2. (27)

The parameters in (21) and (22) from Theorem 11 with respect to AT , DT , U and V will now be bounded. Let G T Sop be such that PT (DT ) = DT , G . For the dual certiﬁcate, choose Y := A T AT (Z),

where Z := P 1 ω P ΩP Ω(G) .

Notice that Y := A T A P ΩP Ω(G) .

After bounding the βk s, the proof will ﬁnish by applying Lemma 11.

Parameter β1: using (27) and (23) will give β1 = (2 p

2 log(n1)) 1. To show this, Theorem 12 will be applied in two diﬀerent ways below with δ 1/4 so that for any X T S ( A A I)(X), X < 2δ 1

and consequently A(X) 2

2 which gives β1 (2 p

2 log(n1)) 1

with high probability.

Parameter β2: this parameter is AT PT F 2 = ω APT F 2. With the chosen dual certiﬁcate, use the following calculation

ω APT F 2 ω rn1n2

m PΩ F 2 PT F F ω rn1n2

where PΩ F 2 1 holds since Ωhas no repeated entries.

Parameter β3: this parameter can be computed using (28). Note that

PT (Y ) = PT A A P ΩP Ω(G) = PT A A (G) .

PT (Y ) G F = PT ( A A I) PT (G) F 2δ ρ := β3,

since G = 1 gives G F p

rank(G) ρ and

( A A I)(X), X = PT ( A A I) PT F7 F 2δ.

Later, δ will be chosen in two diﬀerent ways (always satisfying δ 1/4) which will change the value of β3 in each scenario.

Parameter β5: under the scenario mult(k, ℓ) τ 4 log(n1) (see discussion in A.2), it can be shown that β5 p

6ρ log(n1) as follows

A(P ΩP Ω(G)) 2 := rn1n2

PΩ P ΩP Ω(G) 2 = rn1n2

P ΩP Ω(G) F

P Ω( U Σ V ) 2 := τ A(G) 2 p

6 log(n1) G F .

The last inequality follows from (28) with 2δ 1/2, since G T.

The remaining parameter β4 is bounded by considering distinct ranges for ω and choices for δ 1/4 (which will also determine β3).

Case ω M1 log(n1+n2) n1n2 : in this scenario, apply Lemma 12 with δ = 1/4, which holds if

M1ρ log3/2(n1 + n2). (29)

With appropriate C, the probability of success exceeds 1 (n1 + n2) 1. Notice that this case gives β3 = ρ/2.

Parameter β4: to bound PT (Y ) , notice that

PT (Y ) = ω PT A A P ΩP Ω(G) = ω PT A A P ΩP Ω(G) G

ω A A P ΩP Ω(G) G = ω A A (G) G ,

where the inequality holds since PT is an orthogonal projection. Using the bound above and Lemma 13 gives

PT (Y ) 4ωn1n2 log(n1 + n2)

2n1n2 log(n1 + n2)

with probability at least 1 (n1 + n2) 1. Using the subspace incoherence condition, it is clear that

n2 and G G F

M1 n1n2 . (30)

Using (29) and (30) gives

PT (Y ) ω n1n2 12C2 M1 log2(n1 + n2) + ω M0n1

2C M1 log(n1 + n2) := β4

2n1n2ρ log(n1) m ω n1n2 2C 2M1 log(n1 + n2).

Note that M0 n2, and therefore β2β3

β1 + β4 < 1 if ω M1 log(n1+n2) n1n2 and Theorem 11 may now be applied.

Near-Optimal Weighted Matrix Completion

2ρ log(n1+n2) ω 1: apply Lemma 12 with δ = m/8ωn1 p

2ρ log(n1 + n2).

Using m n2 1 and the assumption on ω gives

2ρ log(n1 + n2) 1

2ρ log(n1 + n2) 1

as desired if

M1, M0} ρ log3/2(n1 + n2)8ωn1 p

2ρ log(n1 + n2) m . (31)

With an appropriate choice of C, the probability of success exceeds 1 (n1 + n2) 1. This case gives β3 = m/4ωn1 p

2 log(n1 + n2).

Parameter β4: as in the previous range for ω, Lemma 13 along with (30) and (31) gives that with high probability

PT (Y ) 1 6

2C log(n1 + n2) + ω q

2 log(n1 + n2) 1

where the last inequality holds with ω 1 and an appropriate choice of C. Note that β2β3

β1 = n2 2 n1 , and therefore β2β3

β1 +β4 = 3/4 so that Theorem 11 can be applied in this case as well.

This concludes the bounds for all βk parameters in (21) and (22) from Theorem 11. In both considered weight ω ranges, Theorem 11 gives constants (considering only dominating terms, see proof in C.1)

n1n2 log(n1)

β1 + β5 = p

n1n2ρ log(n1)

and error bound

DT ωPT (Dω) PT (Dω) F C1 PT (DT ) + ω rn1n2

= C1 PT (D) + ω rn1n2

m C2η. (32)

The desired error term D Dω F will be bounded in two diﬀerent ways to introduce f(ω) in (11) as the minimum of both bounds. Bound 1 for D Dω F : notice that

DT ωPT (Dω) PT (Dω) 2 F = PT (D Dω) + ωPT (D Dω) 2 F = PT (D Dω) 2 F + ω2 PT (D Dω) 2 F ω2 D Dω 2 F + ω2 D Dω 2 F = ω2 D Dω 2 F ,

so for some absolute constant

n1n2 log(n1)

n1n2 log(n1)

n1n2ρ log(n1)

Bound 2 for D Dω F : use the derived properties of A to obtain

D Dω 2 F = PT (D Dω) 2 F + PT (D Dω) 2 F

β2 1 A (PT (D Dω)) 2 2 + PT (D Dω) 2 F

A (D Dω) 2 + A (PT (D Dω)) 2 2 + PT (D Dω) 2 F

m η + rn1n2

PT (D Dω) F

2 + PT (D Dω) 2 F ,

where the last inequality holds by feasibility of Dω for (4) and since A := p n1n2

m PΩcontains no repeated entries. The term in the ﬁnal line can be bounded by (32) since PT (D Dω) 2 F PT (D Dω) 2 F + ω PT (D Dω) 2 F

= PT (D Dω) + ωPT (D Dω) 2 F = DT ωPT (Dω) PT (Dω) 2 F .

This approach gives that for some absolute constant

n1n2 log(n1)

n1n2 log(n1)

+ Cωn1n2 log(n1)

n1n2ρ log(n1)

n1n2 log(n1)

Notice that, in contrast to bound 1 for D Dω F , bound 2 includes the multiplicative term ω p

n1n2 log(n1)/m and the additive term p

n1n2 log(n1)/m in the noise term. Factoring out the multiplicative term and choosing the minimum deﬁnes f(ω). To ﬁnish, bound PT (D) as follows PT (D) PT (UρΣρV ρ ) + PT (U+Σ+V + )

PT (UρΣρV ρ ) + U+Σ+V + = PT (UρΣρV ρ ) +

k=ρ+1 σk(D).

To establish Corollary 5 from the proof above, it suﬃces to bound PT (UρΣρV ρ ) as follows: PT PTρ(UρΣρV ρ ) n2 PT PTρ(UρΣρV ρ ) F n2 PT PTρ F F UρΣρV ρ F .

Near-Optimal Weighted Matrix Completion

Appendix B. Proof of Theorems 2, 6, 7, 8, and 10

Theorems 6, 7, 8, and 10 are all corollaries of Theorem 4 with small ω while Theorem 2 applies ω = 1. To obtain these results, it is suﬃcient to bound the ρ-subspace incoherence parameters for the respective estimate subspaces (all with ρ = r) and the term PT (UrΣr V r ) in (11). For Theorem 2, notice that with ω = 1 program (4) becomes the nuclear norm minimization program (1). This makes T irrelevant, allowing the choice of subspace as in Theorem 6 with U = Ur and V = V r to provide PT (UrΣr V r ) = 0. It only remains to upper bound M0 in (6) and M1 (7) in this case.

Theorem 7, program (4, T2): applying Theorem 4 with small weights requires upper and lower bounding the subspace joint incoherence parameter M1. Recall that any X T2 S satisﬁes X = U U X V V , so X can be written as X = UW and X = Z V where range(W) range( V ) and range(Z) range( U).

To upper bound the subspace joint incoherence, write X = UW and notice that W = V α where α Cr r and α F = 1. Then,

kj Upkαjk V qj

kj | Upk|2| Vqj|2

µ0( U U )µ0( V V )r

Therefore M1(T2) µ0( U U )µ0( V V )r.

The lower bound will be obtained by a proper selection of α. Let p [n1] obtain the maximum row norm of U and likewise q [n2] for V . Then with αjk = c U pk V qj where c is a normalization constant (so that α F = 1) gives

( UW ) p q = U p 2 V q 2 = r q

µ0( U U )µ0( V V )

to obtain M1(T2) = µ0( U U )µ0( V V )r.

Finally, to bound PT 2 (X) with X = UrΣr V r gives

PT 2 (X) = U U X V V + U U X

r U U X V V F + r U U X F r(sin(θv) + sin(θu)) X F .

Theorem 8, program (4, T3): notice that any X T3 S can be written as X = UW

where the columns of W are arbitrary. Then

( UW )pq Up 2 Wq 2

and therefore, M1 µ0( U U )n2. For a lower bound, let p [n1] obtain the maximum row norm of U and choose each row Wq = c U p were c is a proper normalizing constant achieving W F = 1. Then

( UW ) pq 2 = n2 U p 2 2 r = µ0( U U )n2

PT 3 (UrΣr V r ) can be bounded as

PT 3 (UrΣr V r ) = U U UrΣr V r r U U UrΣr V r F

r sin(θu) UrΣr V r F .

Theorem 10, program (4, T4): any X T4 Sop can be written as X = X1 + X2 + X3 where X1 T2, X2 = UW with range(W) range( V ), and X3 = Z V

with range(Z) range( U ). As before, the largest entry of any X1 is bounded by µ0r/ n1n2. For matrices of the form X3, write Z = U α where α Cn1 r r has orthogonal columns and α F 1. Then,

(Z V )pq Zp 2 Vq 2 α U p 2

µ0r(n1 µ2r)

The last inequality holds since 1 = Up 2 + U p 2 2 and by deﬁnition (18). An analogous argument for X2 and the triangle inequality gives

M1 µ2 0r + µ0(n1 µ2r) + µ0(n2 µ2r).

For the lower bound, consider matrices of the form X3 as before. Choose α with α F = 1 in an analogous manner to the proof of Theorem 7. Then M1 µ0( V V )(n1 µ2r) n1 µ2r.

For PT 4 (UrΣr V r ) , it follows that

PT 4 (UrΣr V r ) = U U UrΣr V r V V (33)

r U U UrΣr V r V V F r sin(θu) sin(θv) UrΣr V r F .

Theorem 6, program (4, T1): recall that T1 = span{ Uk V k }k [r], so every X T1 S can be written as X = UΣ V for some diagonal matrix with Σ F 1.

For the upper bound, it holds that for any p [n1] and q [n2]

( UΣ V )pq =

k σk Upk V qk

k | Upk|2| Vqk|2 !1/2

This gives M1(T1) µ1( U V ).

Near-Optimal Weighted Matrix Completion

For the lower bound, notice that the singular values can be freely chosen. Let ( p, q) [n1] [n2] be such that

k | U pk|2| V qk|2 !1/2 =

Then choosing σk = c U pk V qk (possibly complex valued which will still be in T1) where c is a normalization constant (so Σ F = 1) gives

( UΣ V ) p q =

and therefore M1(T1) µ1( U V ) since it is deﬁned as the maximum.

Finally, to bound PT 1 (UrΣr V r ) with X = UrΣr V r notice that

PT 1 (X) = PT 2 (X) + X

k =ℓ U k V ℓ U k V ℓ, X ,

where T2 is as in Theorem 7 and the term PT 2 (X) can be bounded as in the proof therein via a triangle inequality. To bound the remaining term, which is a rank r matrix, we obtain

k =ℓ U k V ℓ U k V ℓ, X

k =ℓ U k V ℓ U k V ℓ, X

k =ℓ | U k V ℓ, X |2

k =ℓ | Ur U k V ℓV r, Σr |2

Ur U k V ℓV r 2 F

Theorem 2, program (1): as discussed, choose T as T1 from Theorem 6 but with U = Ur and V = V r. Then PT 1 (UrΣr V r ) = 0 and as in the previous case M1 = µ1(D).

To upper bound M0, every X T Sop can be written as X = UrΣV r for some diagonal matrix with Σ 1. Then M0(T) µ0(D), since

X 2, = max k,ℓ{ Xk 2, X ℓ 2} = max k,ℓ{ Ur k ΣV r 2, UrΣV r ℓ 2}

max k,ℓ{ Ur k 2 ΣV r , UrΣ V r ℓ 2} max k,ℓ{ Ur k 2, V r ℓ 2}

where the second inequality holds since Ur, V r with orthonormal columns and Σ 1 give that UrΣ and ΣV r are bounded by 1.

Appendix C. Proof of Required Lemmas

This section proves the main lemmas required for the proofs in Appendix A: Lemma 11 and Lemma 12.

C.1 Proof of Lemma 11

Lemma 11 is essentially a generalization of dual certiﬁcate guarantees for sparse vector recovery to the low-rank matrix recovery case (see Theorem 4.33 by Foucart and Rauhut 2013), and the proof will be similar. Proof [Proof of Lemma 11] Denote W = D D, the goal is to bound W F . Since D is feasible A(W) 2 A(D ) A(D) d 2 + d 2 2η.

This will be applied throughout the proof. Let Q T be such that D + W, Q = PT (D + W) and G T be such that D, G = PT (D) . By optimality of D and feasibility of D,

D D = D + W | D + W, G + Q |

= D + W, G + PT (D + W)

= PT (D) + PT (W), G + PT (D + W)

PT (D) | PT (W), G | + PT (W) PT (D) .

Where the second inequality holds by the variational characterization of the nuclear norm, X = sup Y 1 X, Y . Using D PT (D) + PT (D) and rearranging gives

PT (W) 2 PT (D) + | PT (W), G |. (34)

Introducing Y = A A(Z), the last term in (34) can be bounded as

| PT (W), G | | PT (W), G PT (Y ) | + | PT (W), PT (Y ) |

β3 PT (W) F + | PT (W), PT (Y ) |

=β3 PT (W) F + | W PT (W), Y PT (Y ) |

=β3 PT (W) F + | W, Y PT (W), PT (Y ) |

β3 PT (W) F + | W, Y | + | PT (W), PT (Y ) |. (35)

The second equality applies W, PT (Y ) = PT (W), PT (Y ) = PT (W), Y . The three terms in (35) are bounded next, beginning with PT (W) F . The assumed inequalities (21) give that A(B) 2 β1 B F , and A(H) 2 β2 H F for any B T and H T . Therefore,

β1 A(PT (W)) 2 1

β1 A(W) 2 + 1

A(PT (W)) 2

β1 PT (W) F . (36)

Near-Optimal Weighted Matrix Completion

The remaining terms in (35), | PT (W), PT (Y ) | and | W, Y |, can be bounded by assumptions (22)

| PT (W), PT (Y ) | PT (W) PT (Y ) β4 PT (W)

and | W, Y | := | W, A A(Z) | = | A(W), A(Z) | A(W) 2 A(Z) 2 2ηβ5.

Using these inequalities to bound | PT (W), G | in (34) gives

PT (W) 2 PT (D) + 2ηβ3

β1 PT (W) F + β4 PT (W) + 2ηβ5

2 PT (D) + β2β3

PT (W) + 2 β3

Since by assumption ρ := β2β3

β1 + β4 < 1, rearrange to obtain

PT (W) 2 1 ρ PT (D) + 2 β3 β1 + β5 η

From previous calculations, (36),

PT (W) F 2η

β1 PT (W) F 2η

β1 PT (W) ,

so that both of these inequalities give

W F PT (W) F + PT (W) F PT (W) F + PT (W)

β1 + 1 PT (W) C1 PT (D) + 2C2η,

with constants given as

β1 + 1 1 β2β3

C.2 Proof of Lemma 12

The proof of Lemma 12 requires two lemmas, stated here and proven in Section C.3. Before continuing, some useful notation and observations are established for the sampling operators. Recall that M1 is the ρ-subspace incoherence parameter of T and that the operator that samples with replacement has been normalized as

where m and Ω(with possible repetitions) are deﬁned as in Section A.2. The operator A is as an ensemble of matrices { Ak}k [m] Cn1 n2. Here the superscripts order the matrices such that the action on X Cn1 n2 is given entry-wise as

A(X)k = Ak, X ,

for k [m]. The scaling ensures that A A forms an isotropic ensemble, that is, for any X Cn1 n2

k=1 E Ak Ak, X =

where {Mp,q}(p,q) [n1 n2] is the canonical n1 n2 matrix basis. Therefore, each Ak is a random matrix that achieves pn1n2

m Mp,q with probability (n1n2) 1. Next is a lemma that will be useful to establish Lemma 12, in order to apply a concentration inequality. The proof is postponed until Section C.3 and is straightforward from the subspace incoherence assumptions.

Lemma 14 Deﬁne A as above. Then for Z T and all k [m]

| Ak, Z | Z F

k=1 | Ak, Z |4 Z 4 F M1ρ

where ρ is deﬁned as in (8).

The next Lemma can be considered a generalization of Lemma 3.6 by Rudelson and Vershynin 2008. The argument is due to Liu 2011, but has been tailored to the current setting with a tighter bound in terms of the logarithm degree. Adopting the notation therein, for a matrix A Cn1 n2 denote |A)(A| as the operator that maps X 7 A A, X .

Lemma 15 Let m n1n2 and ϵ1, ..., ϵm be i.i.d. Rademacher random variables. Then

Eϵ sup X T S

k=1 ϵk | Ak)( Ak|(X), X

C M1ρ log(n1 + n2) log1/2(m) m

k=1 | Ak)( Ak|(X), X

where C > 0 is an absolute constant.

See Section C.3 for the proof. The proof of Lemma 12 follows. Proof [Proof of Lemma 12] Let T = T S and

X := sup X T

( A A I)(X), X = sup X T ( A A I)(X), X ,

Near-Optimal Weighted Matrix Completion

where the last equality holds since A A I is a Hermitian operator. The goal is to show X 2δ,

which is achieved proceeding along the lines of Liu 2011; Rauhut 2008; Rudelson and Vershynin 2008. Adopting the notation from Lemma 15 (and Liu 2011), where for a matrix A Cn1 n2 the operator |A)(A| maps X 7 A A, X and write

X := sup X T ( A A I)(X), X := A A I T

| Ak)( Ak| 1

| Ak)( Ak| E| Ak)( Ak| T ,

where the last equality holds due to isotropy of the ensemble (37) with i.i.d. samples. EX will be bounded ﬁrst, and then a concentration inequality will be applied to show this random variable is concentrated around its mean. Using symmetrization (as in equation (42) of Liu 2011, which uses Lemma 6.3 by Ledoux and Talagrand 1991, gives

k=1 ϵk| Ak)( Ak|

where ϵk are Rademacher random variables. Applying Lemma 15, which requires m n1n2, gives

k ϵk| Ak)( Ak|

k | Ak)( Ak|

C1 := C M1ρ log(n1 + n2) log1/2(m) m ,

and C > 0 is an absolute constant given in Lemma 15. Summarizing and continuing these calculations,

k | Ak)( Ak|

| Ak)( Ak| 1

| Ak)( Ak| 1

=2C1 (EX + 1)1/2 .

EX + 1 2 C M1ρ log(n1 + n2) log1/2(m) m 2

2 C M1ρ log3/2(n1 + n2) m ,

where the last inequality holds since m (n1 + n2)2 by assumption. Given δ > 0, EX δ if 2

2 C M1ρ log3/2(n1 + n2) m δ

δ + 1. (39)

A concentration inequality will show that X is close to its expected value with high probability.

Theorem 16 (Theorem 8.42 in (Foucart and Rauhut 2013)) Let F be a countable set of functions f : Cn1 n2 7 R. Let Y1, ..., Ym be independent random matrices in Cn1 n2

such that Ef(Yk) = 0 and f(Yk) K almost surely for all k [m] and for all f F. Deﬁne Z as the random variable

Z = sup f F

Let σ2 > 0 be such that E Pm k=1 f(Yℓ)2 σ2 for all f F. Then for all δ 0

P (Z EZ + δ) exp δ2

2σ2 + 4KEZ + 2δK/3

To apply the theorem, let X T generate a set of functions f X : Cn1 n2 R via

f X(Z) := | Z, X |2 1

Then notice that

| Ak)( Ak| 1

D | Ak)( Ak| 1

m I (X), X E

k f X Ak = sup X T

where T is a dense countable subset of T . For all k [m] and X T , by the ﬁrst part of Lemma 14

f X Ak | Ak, X |2 M1ρ

and K = M1ρ

m from Theorem 16.

Near-Optimal Weighted Matrix Completion

Now for σ2, apply the second part of Lemma 14 and isometry of the ensemble to obtain

k f X Ak 2 E X

k | Ak, X |4 M1ρ

To ﬁnish, assuming

M1ρ log3/2(n1 + n2)

gives EX δ according to (39) and by Theorem 16

X EX + δ < 2δ

with probability of failure not exceeding

2M1ρ + 4M1ρδ + 2M1ρδ/3

The last inequality holds assuming δ 1

4, under which (40) holds if

10 C M1ρ log3/2(n1 + n2)

where the statement of the theorem absorbs all the absolute constants into C.

C.3 Proof of Additional Lemmas

This section supplies the proofs of Lemmas 13, 14 and 15. Proof [Proof of Lemma 13] The main ingredient is a matrix Bernstein inequality (Theorem 1.6 by Tropp 2012). As in Section C.2, expand

which is a sum of independent and centered random matrices. In order to apply the Bernstein inequality, the operator norms of each summand and the matrix variance statistic need to be bounded. Notice that for any k [m] Ak Ak, Z 1

m Z Ak Ak, Z 1

E Ak Ak | Ak, Z |2 1

where M1 Cn1 n1 is a diagonal matrix whose entries are the diagonal elements of ZZ . Therefore,

m Z Ak Ak, Z 1

n1n2 M1 + ZZ

n1n2 max 1 k n1 Zk 2 2 + ZZ 2n1n2

m Z 2 ,2 := σ2.

The last inequality holds by Gershgorin circle theorem, which gives that for some k [n1]

ZZ |(ZZ )kk| + X

ℓ =k |(ZZ )kℓ| = Zk 2 2 + X

ℓ =k | Zk , Zℓ | n1 Z 2 ,2.

Analogously, bound

m Z Ak Ak, Z 1

Apply Theorem 1.6 in by Tropp 2012 with R, σ2 above and

t = 4n1n2 log(n1 + n2) Z /3

16n2 1n2 2 log2(n1 + n2) Z 2 /9 + 32mn1n2 log(n1 + n2) Z 2 ,2 2m

to obtain the desired probability of success. The statement of the lemma simpliﬁes the upper bound on the operator norm by noting that

t 4n1n2 log(n1 + n2) Z /3 + 2 p

2mn1n2 log(n1 + n2) Z ,2 m .

Lemma 14, which admits a straightforward proof. Proof [Proof of Lemma 14] The inequality (38) is straightforward by deﬁnition and scaling. For the remaining claim, if Z T then by (38)

k=1 | Ak, Z |4 = n1n2

max (p,q) [n1] [n2] |Zpq|2 n1 X

q=1 |Zpq|2 Z 4 F M1ρ

The proof of Lemma 15 is due to Liu 2011, tailored here to ﬁt the speciﬁc setting. This modiﬁed argument results in a tighter bound in terms of the logarithmic dependency.

Near-Optimal Weighted Matrix Completion

Adopting the author s notation, in what follows for a matrix A Cn1 n1 denote |A)(A| as the operator that maps X 7 A A, X . Proof [Proof of Lemma 15] The argument will rely on the work of Liu 2011 for brevity, referring the reader to the proof of Lemma 3.1 in Section A therein. With U2 deﬁned by Liu 2011, notice that T S U2 since every matrix in T is rank ρ (where ρ is deﬁned in (8)). The result here is obtained in a similar manner, but considering non-square matrices and linear subspace T as Banach space in its own right to compute its covering number. To this end, it is important notice that T equipped with the Frobenius norm is a Banach space and { Ak}m k=1 T , where T denotes the dual space of T, have dual norm bounded as

Ak T = sup X T S

m := ρK, k [m] (41)

by Lemma 14. Notice that K := p

M1/m. With this in mind, proceed as in the proof of Lemma 3.1 by Liu 2011 (with T S replacing U2) up to equation (15) which is obtained via comparison principle to a Gaussian process and Dudley s inequality. Combined with bound (18) therein and a change of variables shows

Eϵ sup X T S

k=1 ϵk | Ak)( Ak|(X), X 48

0 log1/2 N 1 ρ (T S) , X, ϵ dϵ,

where N (B, , ϵ) is the number of balls of radius ϵ in a metric needed to cover a set B,

k=1 | Ak)( Ak|(X), X

and X is a semi-norm deﬁned for Y Cn1 n2 as

Y X = max k [m] | Ak, Y |.

To bound the integral, bound N 1 ρ (T S) , X, ϵ in two diﬀerent ways. For Y

1 ρ (T S), notice that Y X K by (41) so that

N 1 ρ (T S) , X, ϵ N (K BX, X, ϵ) , (42)

where BX is the unit ball in X. For small ϵ and with n1 n2, use (42) and equation (20) by Liu 2011 to obtain

N 1 ρ (T S) , X, ϵ 1 + 2K

2n2 1 . (43)

For large ϵ, apply Lemma 3.2 by Liu 2011 (Lemma 1 by Gu edon et al. 2008) with E = T equipped with the Frobenious norm to obtain

N 1 ρ (T S) , X, ϵ = N (T S, X, ϵ ρ) exp C2 1K2 log(m)

where C1 is an absolute constant given by Maurey s empirical method. The inequality holds by (41) and since T has modulus of convexity of power type 2 with constant λ(T) = 1 and dual space type 2 constant T2(T ) 1 due to the Frobenius norm (see Theorem A3 and A4 in the Appendix of Aubrun 2009). To bound the integral, split it at A := K/n1 and use (43) for small ϵ to obtain Z A

0 log1/2 N 1 ρ (T S) , X, ϵ dϵ Z A

2n1 log1/2 1 + 2K

1 + log 1 + 2K

1/A (1 + log (1 + 2Ky)) dy

1/A (1 + log ((A + 2K)y)) dy

2K(2 + log(1 + 2n1)),

where the ﬁnal bound holds by integrating log ((A + 2K)y) /y2 by parts. Consider the remaining part of the integral up to K, since when ϵ > K by (42) it holds that N 1 ρ (T S) , X, ϵ = 1. Using (44) gives

A log1/2 N 1 ρ (T S) , X, ϵ dϵ Z K

C1K log1/2(m)

ϵ dϵ = C1K log1/2(m) log(n1).

In conclusion

Eϵ sup X T S

k=1 ϵk | Ak)( Ak|(X), X

2K(2 + log(1 + 2n1)) + C1K log1/2(m) log(n1)

C M1ρ log1/2(m) log(n1 + n2) m

k=1 | Ak)( Ak|(X), X

for some absolute constant C.

The main diﬀerence in this proof as opposed to the proof of Lemma 3.1 by Liu 2011 is that the containment of 1 ρ (T S) in the nuclear norm ball B1 is not considered here. Rather, the linear subspace T is viewed as a Banach space with unit ball T S. This allows for a direct application of Lemma A.3, reducing the logarithmic dependency by a factor of log3/2(n1). The same gain does not seem straightforward for the universal result of Liu 2011. Otherwise, using the arguments here, the sample complexity of Pauli measurements can be reduced to O(n1ρ log3/2(n1)) when universality is not imposed (for example, via a dual certiﬁcate instead of the restricted isometry property).

Jacob Abernethy, Francis Bach, Theodoros Evgeniou, and Jean-Philippe Vert. A new approach to collaborative ﬁltering: Operator estimation with spectral regularization.

Near-Optimal Weighted Matrix Completion

Journal of Machine Learning Research, 10(29):803 826, 2009. URL http://jmlr.org/ papers/v10/abernethy09a.html.

Aleksandr Aravkin, Rajiv Kumar, Hassan Mansour, Ben Recht, and Felix J. Herrmann. Fast methods for denoising matrix completion formulations, with applications to robust seismic data interpolation. SIAM Journal on Scientiﬁc Computing, 36(5):S237 S266, 2014. doi: 10.1137/130919210. URL https://doi.org/10.1137/130919210.

Guillaume Aubrun. On almost randomizing channels with a short kraus decomposition. Communications in Mathematical Physics, 288(3):1103 1116, 2009. doi: 10.1007/ s00220-008-0695-y.

Siavash Bayat and Sajad Daei. An optimal hybrid nuclear norm regularization for matrix sensing with subspace prior information. IEEE Access, 8:130937 130946, 2020. doi: 10.1109/ACCESS.2020.3009688.

Emmanuel Cand es and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):1615 3383, apr 2009. ISSN 0001-0782. doi: https://doi.org/10.1007/s10208-009-9045-5. URL https://doi.org/ 10.1145/2184319.2184343.

Emmanuel J. Candes and Terence Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053 2080, 2010. doi: 10.1109/TIT.2010.2044061.

Yudong Chen. Incoherence-optimal matrix completion. IEEE Transactions on Information Theory, 61(5):2909 2923, 2015. doi: 10.1109/TIT.2015.2415195.

Yudong Chen, Srinadh Bhojanapalli, Sujay Sanghavi, and Rachel Ward. Coherent matrix completion. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML 14, page I 674 I 682. JMLR.org, 2014.

Yudong Chen, Srinadh Bhojanapalli, Sujay Sanghavi, and Rachel Ward. Completing any low-rank matrix, provably. Journal of Machine Learning Research, 16(94):2999 3034, 2015. URL http://jmlr.org/papers/v16/chen15b.html.

Kai-Yang Chiang, Cho-Jui Hsieh, and Inderjit S Dhillon. Matrix completion with noisy side information. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper/2015/file/ 0609154fa35b3194026346c9cac2a248-Paper.pdf.

Zlatko Drmac. On principal angles between subspaces of euclidean space. SIAM Journal on Matrix Analysis and Applications, 22(1):173 194, 2000. doi: 10.1137/S0895479897320824. URL https://doi.org/10.1137/S0895479897320824.

Armin Eftekhari, Michael B Wakin, and Rachel A Ward. MC2: a two-phase algorithm for leveraged matrix completion. Information and Inference: A Journal of the IMA,

7(3):581 604, 02 2018a. ISSN 2049-8764. doi: 10.1093/imaiai/iax020. URL https: //doi.org/10.1093/imaiai/iax020.

Armin Eftekhari, Dehui Yang, and Michael B. Wakin. Weighted matrix completion and recovery with prior subspace information. IEEE Transactions on Information Theory, 64 (6):4044 4071, 2018b. doi: 10.1109/TIT.2018.2816685.

M. Fazel, H. Hindi, and S.P. Boyd. A rank minimization heuristic with application to minimum order system approximation. In Proceedings of the 2001 American Control Conference. (Cat. No.01CH37148), volume 6, pages 4734 4739 vol.6, 2001. doi: 10.1109/ ACC.2001.945730.

Simon Foucart and Holger Rauhut. A Mathematical Introduction to Compressive Sensing. Birkh auser Basel, 2013. ISBN 0817649476.

Rina Foygel and Nathan Srebro. Concentration-based guarantees for low-rank matrix reconstruction. In Sham M. Kakade and Ulrike von Luxburg, editors, Proceedings of the 24th Annual Conference on Learning Theory, volume 19 of Proceedings of Machine Learning Research, pages 315 340, Budapest, Hungary, 09 11 Jun 2011. PMLR. URL https://proceedings.mlr.press/v19/foygel11a.html.

David Gross. Recovering low-rank matrices from few coeﬃcients in any basis. IEEE Transactions on Information Theory, 57(3):1548 1566, 2011. doi: 10.1109/TIT.2011.2104999.

Olivier Gu edon, Shahar Mendelson, Alain Pajor, and Nicole Tomczak-Jaegermann. Majorizing measures and proportional subsets of bounded orthonormal systems. Revista Matem atica Iberoamericana, 24(3):1075 1095, 2008. doi: rmi/1228834305. URL https://doi.org/.

Prateek Jain and Inderjit S. Dhillon. Provable inductive matrix completion. Co RR, abs/1306.0626, 2013. URL http://arxiv.org/abs/1306.0626.

Sun Ji-guang. Perturbation of angles between linear subspaces. Journal of Computational Mathematics, 5(1):58 61, 1987. ISSN 02549409, 19917139. URL http://www.jstor. org/stable/43692289.

M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes, volume 23. Springer-Verlag, 1991.

Yi-Kai Liu. Universal low-rank matrix recovery from pauli measurements. NIPS 11 Proceedings of the 24th International Conference on Neural Information Processing Systems, abs/1103.2816:1638 1646, 2011.

Yong Luo, Tongliang Liu, Dacheng Tao, and Chao Xu. Multiview matrix completion for multilabel image classiﬁcation. IEEE Transactions on Image Processing, 24(8):2355 2368, 2015. doi: 10.1109/TIP.2015.2421309.

Oscar L opez, Rajiv Kumar, Ozg ur Yılmaz, and Felix J. Herrmann. Oﬀ-the-grid low-rank matrix recovery and seismic data reconstruction. IEEE Journal of Selected Topics in Signal Processing, 10(4):658 671, 2016. doi: 10.1109/JSTSP.2016.2555482.

Near-Optimal Weighted Matrix Completion

Duc Minh Nguyen, Evaggelia Tsiligianni, and Nikos Deligiannis. Extendable neural matrix completion. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6328 6332, 2018. doi: 10.1109/ICASSP.2018.8462164.

Holger Rauhut. Stability results for random sampling of sparse trigonometric polynomials. IEEE Transactions on Information Theory, 54(12):5661 5670, 2008. doi: 10.1109/TIT. 2008.2006382.

Benjamin Recht. A simpler approach to matrix completion. Journal of Machine Learning Research, 12(104):3413 3430, 2011. URL http://jmlr.org/papers/v12/recht11a. html.

Benjamin Recht, Maryam Fazel, and Pablo A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review, 52(3):471 501, 2010. doi: 10.1137/070697835. URL https://doi.org/10.1137/070697835.

Mark Rudelson and Roman Vershynin. On sparse reconstruction from fourier and gaussian measurements. Communications on Pure and Applied Mathematics, 61(8):1025 1045, 2008. doi: https://doi.org/10.1002/cpa.20227. URL https://onlinelibrary.wiley. com/doi/abs/10.1002/cpa.20227.

Nathan Srebro and Tommi S. Jaakkola. Learning with Matrix Factorizations. Ph D thesis, USA, 2004. AAI0807530.

Nathan Srebro and Russ R Salakhutdinov. Collaborative ﬁltering in a non-uniform world: Learning with the weighted trace norm. In J. Laﬀerty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems, volume 23. Curran Associates, Inc., 2010. URL https://proceedings.neurips.cc/ paper/2010/file/67d96d458abdef21792e6d8e590244e7-Paper.pdf.

Carlo Tomasi and Takeo Kanade. Shape and motion from image streams under orthography: a factorization method. INTERNATIONAL JOURNAL OF COMPUTER VISION, 9(2): 137 154, 1992.

Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389 434, 2012. doi: 10.1007/s10208-011-9099-z.

Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B. Altman. Missing value estimation methods for DNA microarrays . Bioinformatics, 17(6):520 525, 06 2001. ISSN 1367-4803. doi: 10. 1093/bioinformatics/17.6.520. URL https://doi.org/10.1093/bioinformatics/17. 6.520.

Madeleine Udell and Alex Townsend. Why are big data matrices approximately low rank? SIAM Journal on Mathematics of Data Science, 1(1):144 160, 2019. doi: 10.1137/18M1183480. URL https://doi.org/10.1137/18M1183480.

Ewout van den Berg and Michael P. Friedlander. Probing the pareto frontier for basis pursuit solutions. SIAM Journal on Scientiﬁc Computing, 31(2):890 912, 2009. doi: 10.1137/080714488. URL https://doi.org/10.1137/080714488.

Miao Xu, Rong Jin, and Zhi-Hua Zhou. Speedup matrix completion with side information: Application to multi-label learning. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings. neurips.cc/paper/2013/file/e58cc5ca94270acaceed13bc82dfedf7-Paper.pdf.

Yi Yang, Jianwei Ma, and Stanley Osher. Seismic data reconstruction via matrix completion. Inverse Problems and Imaging, 7(4):1379 1392, 2013.

Jinfeng Yi, Tianbao Yang, Rong Jin, Anil K. Jain, and Mehrdad Mahdavi. Robust ensemble clustering by matrix completion. In 2012 IEEE 12th International Conference on Data Mining, pages 1176 1181, 2012. doi: 10.1109/ICDM.2012.123.

Jinfeng Yi, Lijun Zhang, Rong Jin, Qi Qian, and Anil K. Jain. Semi-supervised clustering by input pattern assisted pairwise similarity matrix completion. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML 13, page III 1400 III 1408. JMLR.org, 2013.

Yijun Zhang, Shashin Sharan, and Felix J. Herrmann. High-frequency waveﬁeld recovery with weighted matrix factorizations, pages 3959 3963. 2019. doi: 10.1190/segam2019-3215103.1. URL https://library.seg.org/doi/abs/10.1190/ segam2019-3215103.1.

Yijun Zhang, Shashin Sharan, Oscar L opez, and Felix J. Herrmann. Waveﬁeld recovery with limited-subspace weighted matrix factorizations, pages 2858 2862. 2020. doi: 10.1190/segam2020-3428365.1. URL https://library.seg.org/doi/abs/10.1190/ segam2020-3428365.1.

Bo Zhao, Justin P. Haldar, Cornelius Brinegar, and Zhi-Pei Liang. Low rank matrix recovery for real-time cardiac mri. In 2010 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, pages 996 999, 2010. doi: 10.1109/ISBI.2010.5490156.