# symmetric_matrix_completion_with_relu_sampling__c5043edb.pdf

Symmetric Matrix Completion with Re LU Sampling

Huikang Liu * 1 Peng Wang * 2 Longxiu Huang 3 Qing Qu 2 Laura Balzano 2

We study the problem of symmetric positive semidefinite low-rank matrix completion (MC) with deterministic entry-dependent sampling. In particular, we consider rectified linear unit (Re LU) sampling, where only positive entries are observed, as well as a generalization to threshold-based sampling. We first empirically demonstrate that the landscape of this MC problem is not globally benign: Gradient descent (GD) with random initialization will generally converge to stationary points that are not globally optimal. Nevertheless, we prove that when the matrix factor with a small rank satisfies mild assumptions, the nonconvex objective function is geodesically strongly convex on the quotient manifold in a neighborhood of a planted low-rank matrix. Moreover, we show that our assumptions are satisfied by a matrix factor with i.i.d. Gaussian entries. Finally, we develop a tailor-designed initialization for GD to solve our studied formulation, which empirically always achieves convergence to the global minima. We also conduct extensive experiments and compare MC methods, investigating convergence and completion performance with respect to initialization, noise level, dimension, and rank.

1. Introduction

Low-rank matrix completion (MC) refers to the task of filling in the missing entries of a partially observed low-rank matrix. It has found applications in diverse fields, such as recommendation systems (Koren et al., 2009), sequential bioinformatics (Zheng et al., 2013), and computer vision (Ji et al., 2010), to name a few. In particular, symmetric

*Equal contribution 1Antai College of Economics and Management, Shanghai Jiao Tong University 2Department of Electrical Engineering and Computer Science, University of Michigan 3Department of Computational Mathematics, Science and Engineering & Department of Mathematics, Michigan State University. Correspondence to: Laura Balzano <girasole@umich.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

positive semi-definite (PSD) low-rank MC has applications in covariance matrix completion (Candes & Plan, 2010; Hosseini & Sebt, 2017), Hankel matrix completion (Chen & Chi, 2013; Usevich & Comon, 2016; Cai et al., 2023), and Euclidean distance matrix completion (Laurent, 2001; Al Homidan & Wolkowicz, 2005; Dokmanic et al., 2015). An extensive body of literature investigates the statistical and optimization properties of the low-rank MC problem using different approaches, such as nuclear-norm minimization (Recht, 2011; Candes & Plan, 2010; Candes & Recht, 2012), spectral method (Keshavan et al., 2010; Chatterjee, 2015), and matrix factorization (Sun & Luo, 2016), among others. To analyze the MC problem, most of these approaches rely on the following assumptions: (1) the underlying matrix is low-rank and incoherent, and (2) the entries are observed independent of the matrix entries according to a probabilistic mechanism, e.g., uniformly at random. Based on these assumptions, it is possible to prove how many observed entries are required so that the missing entries can be exactly or approximately completed.

The majority of existing work assumes that the entries are observed uniformly at random, independent of the underlying matrix values. However, this assumption is strict and often violated in practice. In real-world applications where data are collected from measurements, such as distance matrices, missing entries tend to be those that are harder to collect. When data are collected from participants, such as online shopping or surveys, missing answers are typically highly correlated with the question and the value of the true answer. For example, surveys about sensitive topics will have missing entries on any culturally or morally problematic answers. Moreover, online ratings tend to be skewed to the high end; for example, most people do not read a book or watch a movie that they might dislike. Even in sensor systems, missing data are not likely to be independent and completely random. Sensors may saturate at a certain value or break down based on environmental conditions that also affect the sensor value. In all these applications, the probability of missing entries in a matrix is dependent on the underlying values, sometimes deterministically.

Despite being highly relevant for applications, the problem of recovering missing entries when the sampling mechanism depends on the entries remains a challenging and relatively under-explored area of research. Existing papers along these

Matrix Completion with Re LU Sampling

lines give impractical results, such as those where the recovery metric provides no guarantees for recovering entries that are never observed (Foygel et al., 2011; Foucart et al., 2020), or high-dimensional consistency results that do not admit finite sample guarantees or make strong assumptions on the sampling probability functions, for example that they are Lipschitz in the matrix entry value and nonzero everywhere (Bhattacharya & Chatterjee, 2022). Research works on MC with deterministic sampling focused on understanding fundamental properties that the sampling patterns must exhibit (Lee et al., 2023). However, none of these works provide clear and practical guidelines for what MC problems can be solved in their given settings and for what algorithms will successfully complete them.

1.1. Our Contributions

Our work aims to advance the understanding of MC with deterministic sampling by focusing on symmetric PSD lowrank MC with Re LU sampling, where only the positive entries of a matrix are observed. In this setting, under relatively mild assumptions, we first show that the globally optimal solutions of the low-rank MC problem with a known rank can exactly (resp. approximately) complete the missing entries in the noiseless (resp. noisy) setting. Moreover, we prove that the objective function is geodesically strongly convex on the quotient manifold around the planted low-rank matrix, i.e., the underlying true low-rank matrix. Therefore, with an initial point close enough to the planted low-rank matrix, GD will converge to this desired matrix. This motivates us to tailor an initialization method for GD. We empirically demonstrate that GD with our initialization always converges to the planted low-rank matrix. It is worth mentioning that this objective function landscape result is the first of its kind within the literature on dependent or deterministic sampling for MC.

While Re LU sampling is a specific setting, it has the potential to be generalized to a broad class of common missing data problems, e.g., where only a range of values is observed. In this direction, we have also provided more general assumptions for a broader class of sampling functions. For example, a threshold sample where we observe all entries above a positive threshold will also lead to a strongly convex objective around the planted low-rank matrix. This more general setting requires stronger assumptions, but it gives us more general results that also apply to the noisy setting. We note that this thresholding sampling setting generally means that a constant fraction of entries are observed. This is an interesting sampling regime when observations are entry-dependent, and a useful one for many sensor data applications where it may be feasible to collect a moderate number of matrix entries.

Even in the best case of our theorems, empirical results

(shown in Section 4) drastically outperform the theory, leaving an exciting open question as to whether better assumptions and proof techniques can deliver theory that matches empirical observation.

1.2. Literature Review

Due to space constraints, we defer detailed related work to Appendix B and focus here on the most related works. Bishop & Yu (2014) prove PSD matrix recovery for their proposed algorithm with deterministic sampling of principal submatrices. Our assumptions are different and our results showing the benign local landscape of the problem are applicable to any gradient-based algorithm. Bhattacharya & Chatterjee (2022) assumes that each entry of a low-rank matrix M is observed with some probability f(M ), where f is applied entry-wise and is essentially Lipschitz and nonzero. Their proofs apply to the high-dimensional regime where the dimensions grow but rank is uniformly bounded, as in (Chatterjee, 2020).

In an empirical work that studies Re LU sampling among others (Naik et al., 2022), the authors conclude that convex methods generally do not work as well as nonconvex in the dependent-sampling setting, motivating our focus on the nonconvex objective function. Finally, the work in (Saul, 2022; Seraghiti et al., 2023) both seek a non-negative lowrank matrix M such that a sparse M = f(M ) and f is a given nonlinearity, such as Re LU. They provide multiple algorithms for learning M (such as EM, alternating block descent) but no theoretical guarantees for when this lowrank approximation exists.

Notation. Given a matrix A Rm n, we use σmax(A) or A to denote its largest singular value (i.e., spectral norm), σmin(A) its smallest non-zero singular value, A F its Frobenius norm, and aij its (i, j)-th element. Given a vector a Rn, we denote its Euclidean norm by a and the i-th entry by ai. Given a positive integer n, we denote by [n] the set {1, . . . , n}. Let Om n = Q Rm n : QT Q = In denote the set of all m n orthonormal matrices. In particular, let Om = {Q Rm m : QT Q = Im} denote the set of all m m orthogonal matrices. Given an integer n, we define [n] := {1, . . . , n}.

2. Problem Formulation

MC with Re LU sampling. In this work, we consider a noisy symmetric PSD MC completion problem. Specifically, let M := U U T Rn n be a symmetric PSD matrix, where U Rn r. In addition, let M Rn n be a noisy version of M generated by

M := M + , (1)

Matrix Completion with Re LU Sampling

0 200 400 600 800 1000 10-10

106 Uniform Sampling

0 500 1000 1500 2000 2500

106 Re Lu Sampling

Figure 1. Recovery and convergence performance of GD for solving the MC problem with the uniform (p = 0.2) and Re LU sampling in the noiseless case. We apply GD with Gaussian random initialization for solving Problem (3) with the uniform and Re LU sampling, respectively. Then, we plot the gradient norm (i.e., F(U (t)) F ) and completion error (i.e., U (t)U (t)T M F / M F ) against number of iterations.

where Rn n is a noise matrix. It is worth noting that M is non-symmetric when the noise matrix is not symmetric.

There are many applications where it is common to observe only a partial set of the entries of M (Nguyen et al., 2019). This could be due to data collection, experimental constraints, or inherent missing information (Hu et al., 2008). In this work, we consider a setting where the missingness pattern of the matrix is dependent on the underlying values in the matrix and is deterministic given the matrix entries. Specifically, we suppose that only the non-negative entries in M can be observed, i.e., the observed set

Ω= {(i, j) [n] [n] : mij 0} . (2)

Notably, this sampling regime is commonly referred to as Re LU sampling in the literature (see, e.g., Naik et al. (2022); Mazumdar & Rawat (2018)), as it utilizes the function f(x) = max{0, x}, known as the rectified linear unit (Re LU) function in deep learning1. Then, our goal is to complete the missing entries of M from the observed entries in MΩ.

Before proceeding, we make some remarks on this MC problem. First, existing works (Naik et al., 2022; Saul, 2022; Seraghiti et al., 2023) have empirically studied the low-rank MC problem with Re LU sampling and focused on proposing efficient algorithms to address this problem. In particular, Mazumdar & Rawat (2018) has theoretically studied the recovery performance of the Re LU-based representation

1See (Nair & Hinton, 2010) for early use of the phrase rectified linear unit, but it had been in use for neural nets long before, referred to by the name linear threshold unit (Wersing et al., 2001) or positive part (Jarrett et al., 2009), among others.

learning problem under a probabilistic model. This differs from our work, which studies the global optimality and optimization landscape of the MC problem. Second, this sampling is merely a specific instance within a broader set of general deterministic sampling schemes. We prove results on one such generalization, where we observe values above or below a threshold. We believe our work will be a springboard for the study of more general practical entrydependent sampling schemes.

Optimization formulation. Leveraging the low-rank structure in Equation (1), we consider the following nonconvex MC problem to complete the missing entries of M :

min U Rn r F(U) := 1

Note that the rank of the optimization variable U is exactly the rank of the planted low-rank matrix M . In the noiseless setting with uniform sampling and incoherence assumptions, Ge et al. (2016) have shown that Problem (3) has a benign global optimization landscape in the sense that it has no spurious local minima all local minima must also be global minima. This, together with the result in (Lee et al., 2016), implies that gradient descent (GD) with random initialization with high probability converges to globally optimal solutions that achieve exact completion; see Figure 1(a). One may then conjecture that Problem (3) in the Re LU sampling setting also has such a benign optimization landscape. However, this conjecture is refuted by empirical evidence in Figure 1(b). This result is typical in the Re LU sampling setting, illustrating that GD with random initialization may converge to a spurious critical point that is not globally optimal. The next question is whether it is even possible to recover the missing entries by applying GD for this setting. In this work, we answer this question in the affirmative.

3. Main Results

In this section, we first present our theoretical result on symmetric PSD low-rank MC with Re LU sampling in the noiseless case under mild assumptions in Section 3.1. In Section 3.2, we extend these results to the noisy case under more general sampling assumptions. In Section 3.3, we show that all the introduced assumptions hold with high probability when the entries of U are i.i.d. sampled from the standard Gaussian distribution.

3.1. Noiseless Case and Re LU Sampling

In this subsection, we consider the noiseless case, i.e., = 0, and Re LU sampling in (2). We start by introducing some assumptions on the underlying matrix U and the sampling set Ω. Noting that the r-dimensional space has 2r orthants, we denote by Ci the i-th orthant of the r-dimensional space

Matrix Completion with Re LU Sampling

Figure 2. An illustrative figure on the partition of rows of U Rn 2. We rearrange the rows of U and partition them into 4 blocks, each belonging to different orthants.

for each i [2r]. For example, there are 4 orthants (i.e., quadrants) in the 2-dimensional space and 8 orthants (i.e., octants) in the 3-dimensional space. Here, Ci for each i [2r] is ordered such that the signs of components of a vector belonging to orthant Ci differ from those in orthant Ci+1 in only one component. For ease of exposition, without loss of generality, we assume the rows of U Rn r are partitioned into the following blocks

U = U T 1 U T 2 . . . U T 2r T Rn r, (4)

where each row of U i Rni r belongs to the i-th orthant Ci for each i [2r] and P2r

i=1 ni = n. For example, when r = 2, the rows of different blocks of U i takes the signs as shown in Figure 2. Based on the above setup, we make the following assumption. Assumption 3.1. For each i [2r], rank(U i ) = r.

We remark that if r is much smaller than log n and the entries of U are i.i.d. sampled from a distribution symmetric about zero, such as the standard normal distribution, then the generated submatrices are full rank with high probability.

Next, we discuss the second assumption that will lead to unique completion. This assumption is illustrated in Figure 3. For any pair of indices (i, j) [2r] [2r], we denote by M (i,j) Rni nj the (i, j)-th block of M, i.e., M (i,j) = U i U T j . Moreover, we denote the set of observations in the (i, j) block by Ωi,j = {(k, l) [ni] [nj] : m(i,j) kl 0} where m(i,j) kl is the (k, l)-th entry of M (i,j). Let a column vector u i,k Rr denote the k-th row of U i for each k [ni]. Then, we have m(i,j) kl = u T i,ku j,l. It is obvious that Ωi,i = [ni] [ni] under the Re LU sampling since u T i,ku i,l 0 always holds for all k, l [ni] due to the fact that the rows of U i have the same sign. Besides, since M (i+1,i) = U i+1U T i and the signs of components in each row of U i+1 and those of U i only differ in one component, one can image that there are enough observations in Ωi+1,i with Re LU sampling. More precisely, we can formalize the above observation as follows: Assumption 3.2. For any i [2r 1], we have |Ωi+1,i| r2 and the matrix space spanned by {u i+1,ku T i,l : (k, l) Ωi+1,i} is the whole space Rr r.

Figure 3. An illustrative figure on Assumption 3.2 when r = 2. Orange pixels denote observed entries, while white pixels denote missing entries.

Assumptions 3.1 and 3.2 guarantee that there are enough positive (observed) entries for the uniqueness of the completion. We note that these assumptions can be checked from the observed matrix by finding a permutation of rows and columns such that the diagonal blocks are fully observed and of rank r and the off-diagonal blocks have enough entries. In Section 3.3, we will show that, when U is i.i.d. Gaussian random matrix and r 1

2 log n, Assumptions 3.1 and 3.2 hold with high probability.

Characterization of global optimality. Based on the two assumptions, we are ready to characterize the global optimality of the MC problem (3) under Re LU sampling.

Theorem 3.3. Suppose that = 0 in (1), the observed set Ωis defined in (2), and Assumptions 3.1 and 3.2 hold. Then, U Rd r is a global optimal solution to Problem (3) if and only if it satisfies UU T = M .

We defer the proof of this theorem to Appendix A.1.1. This theorem demonstrates that even under the Re LU sampling regime, the global optimal solutions of Problem (3) still recover the underlying matrix M exactly, just as in the uniform sampling regime.

Uniqueness of global solutions on the manifold. Remark that the objective function F( ) of Probelm (3) is invariant under any orthogonal matrices in Rr r, i.e., F(UQ) = F(U) for all Q Or. To characterize this invariance to orthogonal transformations, let denote an equivalent relation on Rn r with the equivalence class

[U] := V Rn r : V U iff Q Or, V = UQ .

Then, we define the Riemannian quotient manifold M as

M := Rn r\ = {[U] : U Rn r}. (5)

If we consider Problem (3) on the manifold M, Theorem 3.3 implies that Problem (3) has a unique global optimal solution [U ]. Next, we will show that the objective function F( ) exhibits geodesic strong convexity on the quotient manifold M near the global optimum [U ] .

Matrix Completion with Re LU Sampling

Main technical ingredient. The proof of Theorem 3.3 relies on the following technical lemma, which could be independent interest. This lemma indicates that the distance between two low-rank matrices on the quotient manifold M is bounded by their subspace distance. Its proof is also deferred to Appendix A.1.3. Lemma 3.4. For arbitrary U, V Rn r with r n, suppose that rank(U) = r. Then, there exists an orthogonal matrix Q Or such that

U V Q F σmin(U) + V

σ2 min(U) UU T V V T F .

Preliminary setup for manifold optimization. As discussed above, we study Problem (3) over the quotient manifold M defined in (5). Toward this end, we introduce some concepts of manifold optimization, such as tangent space, Riemannian gradient, and Riemannian Hessian. Since the formal definition of the tangent space to a quotient manifold is abstract, we describe it informally as follows. We define the vertical space at U Rn r denoted by SU (see Boumal (2023, Definition 9.23, Example 9.25)) as

SU := UR Rn r : R + RT = 0, R Rr r . (6)

As shown in (14), the orthogonal complement of SU, which is called the horizontal space and denoted by S U, is

S U := D Rn r : U T D = DT U . (7)

According to Boumal (2023, Definition 9.24)), for any U Rn r, there exists a bijective mapping lift U between any tangent vector ξ T[U]M and any matrix D S U, i.e.,

lift U : T[U]M 7 S U, lift U(ξ) = D,

where T[U]M : M Rn r denotes the tangent space to M at [U]. According to Boumal (2023, Propositions 9.38 and 9.44)), the Riemannian gradient and Hessian at [U] along a direction ξ T[U]M, denoted by grad F([U])[ξ] and Hess F([U])[ξ, ξ], are computed by

grad F([U])[ξ] = F(U), lift U(ξ) ,

Hess F([U])[ξ, ξ] = 2F(U)[lift U(ξ)], lift U(ξ) . (8)

Geodesic strong convexity on M. Equipped with the above setup, we analyze the optimization landscape of Problem (3) around the global optimal solutions in the Re LU sampling regime. Although Problem (3) does not possess a benign global optimization landscape, we show that it has a favorable local optimization landscape. Theorem 3.5. Under Assumptions 3.1 and 3.2, F( ) is geodesically strongly convex on the quotient manifold M at [U ], i.e., for all ξ T[U ]M,

Hess F([U ])[ξ, ξ] γ

2 lift U (ξ) 2 F , (9)

γ := min D S U , D F =1 (U DT + DU T )Ω 2 F > 0. (10)

We defer the proof of this theorem to Appendix A.1.2. Intuitively, this theorem demonstrates that the objective function F( ) is strongly convex on the manifold M in the neighborhood of U . Consequently, if a tailored initialization in the local neighborhood of U is available, GD is guaranteed to find a global optimal solution.

From geodesic strong convexity on M to strong convexity in Euclidean space. Notably, the above geodesic strong convexity on the manifold M implies strong convexity of F( ) along some directions in Euclidean space. Specifically, for any U Rn r, let U T U = P ΣQT be a singular value decomposition of U T U , where P , Q Or

and Σ Rr r, and e U := UP QT . Using this and (9), we can show

2F(U )[ e U U , e U U ] γ

2 e U U 2 F .

Please refer to Appendix A.1 for detailed proof.

3.2. Noisy Case and General Deterministic Sampling

In this subsection, we extend our above results to the noisy case with a general deterministic sampling regime. Toward this goal, we need the following assumption on the sampling pattern, which generalizes Assumptions 3.1 and 3.2.

Assumption 3.6. There exists a collection of index set {I1, I2, . . . , IK} such that the following conditions hold:

k [K] Ik = [n];

(b) Ik Ik Ωholds for all k [K];

(c) There exists λ > 0 such that U T k U k λIr for each k [K], where U k is the matrix with the rows consisting of u i for all i Ik;

(d) For any k = l [K], there exists a path k0 k1 ks such that k0 = k, ks = l and Ikj 1 Ikj Ωfor all j [s].

Now, let us explain these conditions in more detail. Condition (a) guarantees that each row of U belongs to at least one submatrix; Condition (b) guarantees that the diagonal blocks of M index by Ik for all k [K] are fully observed; Condition (c) indicates that there are enough rows in each part such that the submatrix U k is of full row rank; Finally, condition (d) ensures that there are some off-diagonal blocks that are also fully observed and any two of these blocks are connected via a path. We note that the last condition is strict but aids significantly in the proof of Theorem 3.8. With a

Matrix Completion with Re LU Sampling

slightly more careful analysis, this condition could be weakened so that off-diagonal blocks are only partially observed. In Section 3.3, we show that under the setting where the entries of U are i.i.d. sampled from the standard Gaussian distribution, r O(log n), and the noise is bounded, Assumption 3.6 holds with high probability.

Comparison to (Bishop & Yu, 2014). Assumption 3.6 is similar to the assumptions in (Bishop & Yu, 2014), but as far as we know, neither implies the other. More specifically, Bishop & Yu (2014) consider deterministic sampling and assume that a collection of subsets of Ωadmit an ordering such that any two adjacent parts have enough overlap. However, in the Re LU sampling setting, this order is hard to construct, and it is not clear whether there exists such an ordered collection of subsets. Moreover, both the analysis and the algorithm proposed in (Bishop & Yu, 2014) heavily rely on these ordered subsets, while we only use our partition for analysis.

General sampling regimes. Notably, Assumption 3.6 can be applied to more general sampling regimes. For example, consider a positive-threshold sampling regime, where the entry mij of M is observed if mij η for a constant η 0. In particular, Re LU sampling corresponds to the case η = 0.

Figure 4. A figure on the partition of the plane into 12 equal sectors.

Example 3.7 (Positive-Threshold Mask). Consider the noiseless case with r = 2 and the entry mij is observed if mij η for a threshold η > 0. We partition the 2D plane into 12 equal sectors, shown in Figure 4. Let us define a collection of index sets as follows:

Ik := {i [n] : u i Ck}, k [12].

Obviously, we have [n] = k [K]Ik, and thus condition (a) in Assumption 3.6 holds. For any i, j Ik, the rows u i and u j belong to the same sector Ck, which implies that the angle between u i and u j is less than π/6. Therefore, we have mij = u T i u j = u i u j cos(π/6) =

3 u i u j /2. As long as the threshold η min{

3 u i u j /2 : i, j Ck, k [12]}, the entry mij can be observed for all i, j Ik, and

thus Ik Ik Ω. This implies that condition (b) in Assumption 3.6 holds. Condition (c) in Assumption 3.6 holds as long as each Ik contains at least two elements i = j such that u i is not parallel to u j. Finally, for any i Ik and j Ik+1, one can verify that mij u i u j /2 using the similar argument. Consequently, as long as the threshold η min{ u i u j /2 : i Ik, j Ik+1, k [11]}, we have Ik Ik+1 Ω. That is, all the near-diagonal blocks of M will be fully observed, and thus condition (d) in Assumption 3.6 holds.

Global optimality and local landscape analysis. Based on Assumption 3.6, we are ready to characterize the global optimality and local optimization landscape of Problem (3) in the noisy case.

Theorem 3.8. Let U Rn r be any global optimal solution to Problem (3). Under Assumption 3.6, we have

UU T U U T F c

where c > 0 depends on λ, U F and F .

We defer the proof of the theorem to Appendix A.2.1. In particular, the dependence of c on λ, U F and F in specified in (39). Notably, this theorem generalizes Theorem 3.3 to the noisy case with general deterministic sampling satisfying Assumption 3.6.

Theorem 3.9. Let U Rn r be any global optimal solution to Problem (3). Suppose that the Assumption 3.6 holds and the noise matrix in (1) satisfies

λ + γ + U ) ,

λ γ 4 c0 1 + κ( γ + 2 U )

where γ is provided in (10) in Theorem 3.5, c0 > 0 depends on γ, U , and λ, and κ > 0 only depends on U . Then, F( ) is geodesically strongly convex on M at [U], i.e., for all ξ T[U]M,

Hess F([U])[ξ, ξ] γ

8 F lift U(ξ) 2 F .

We defer the proof of this theorem to Appendix A.2.2. Intuitively, this theorem demonstrates that the objective function F( ) is strongly convex on the quotient manifold M in the neighborhood of [U].

3.3. Verification of Assumptions

In this subsection, we show that when the entries of U

are i.i.d. sampled from the standard Gaussian distribution, Assumptions 3.1, 3.2, and 3.6 all hold with high probability using a concentration argument.

Matrix Completion with Re LU Sampling

Noiseless setting. We first show that when = 0 in (1), if the entries of U are i.i.d. Gaussian random variables and the rank is small, Assumptions 3.1 and 3.2 hold with a high probability. Proposition 3.10. Suppose that r (log2 n)/2 and the entries of U are i.i.d. sampled from the standard Gaussian distribution. Then, Assumptions 3.1 and 3.2 hold with probability at least 1 n exp( n/16).

We defer the proof of this proposition to Appendix A.3.1. While the rank assumption is strict, we empirically observe in Section 4.4 that it can be relaxed. This rank requirement arises from our proof technique. It is of great interest to find an alternative approach to improve our analysis technique.

Noisy setting. For the noisy case, i.e., = 0 in (1), we construct the collection of index sets mentioned in Assumption 3.6 in a similar manner to the partition shown in Figure 4. However, it is more difficult to characterize it in high dimensions. To address this, we introduce the concept of ϵnet (see Definition A.2) to construct these index sets. Under the Gaussian distribution and Re LU sampling, we prove the following proposition that shows as long as r O(log n) and the noise is small enough, Assumption 3.6 hold with high probability. We defer the proof to Appendix A.3.2. Proposition 3.11. Suppose that r log n/(4 log 3), the entries of U are i.i.d. sampled from the standard Gaussian distribution, and the noise matrix satisfies < min u k 2 : k [n] /2. Then, Assumption 3.6 holds with probability at least 1 1/n.

4. Experimental Results

In this section, we conduct numerical experiments to validate our theoretical developments and demonstrate the performance of several state-of-the-art algorithms on the MC problem with Re LU sampling. All of our experiments are implemented in MATLAB R2023a on a PC with 32GM memory and Intel(R) Core(TM) i7-11800H 2.3GHz CPU. Experiments with Euclidean distance matrix completion can be found in Appendix C.3.

4.1. Our Proposed Algorithm

A natural approach to solving Problem (3) is to apply GD as follows: Given the current iterate U (t), we generate the next iterate via

U (t+1) = U (t) η F(U (t)), t 0, (12)

where Z = (UU T )Ω MΩand F(U) = Z + ZT U. According to Theorem 3.5, Problem (3) with the Re LU sampling exhibits a benign local optimization landscape, even though it possesses a complicated global landscape as illustrated in Figure 1. Consequently, GD may not be effective

Algorithm 1 GD for MC with Re LU sampling

Input: observed set Ω, observed matrix MΩ Generate Q Rn n using (13) Apply a truncated SVD to Q to obtain U (0) On r

for t = 0, 1, 2, . . . do

U (t+1) = U (t) η F(U (t)) end for

for solving Problem (3) unless a proper initial point U (0) is available. This motivates us to propose a tailor-designed initialization as follows: We first randomly generate a matrix Y Rn r, whose entries are i.i.d. sampled from the standard normal distribution. Then, we construct Q = Y Y T

and generate a new matrix Q Rn n via

( mij, if (i, j) Ω, |qij|, otherwise. (13)

Finally, we apply a truncated SVD to Q to obtain an initial point U (0) On r, where the columns of U (0) consist of right singular vectors of Q corresponding to its top r singular values. Notably, we leverage the low-rank structure of M and the Re LU sampling to design the initialization scheme. According to this and (12), we now summarize our proposed method for solving Problem (3) in Algorithm 1.

4.2. Convergence Behavior and Recovery Performance

In this subsection, we study the convergence behavior and recovery performance of GD in Algorithm 1 for solving Problem (3) in the Re LU sampling regime. To measure the convergence and recovery performance of the studied algorithms, we employ the following metrics at the t-th iteration:

Norm of gradient: αt = F(U (t)) F ,

Function value: βt = F(U (t)),

Completion error: γt = U (t)U (t)T M F / M F .

Obviously, if αt is smaller, then U (t) will be closer to a critical point. Moreover, if βt and γt are smaller, then U (t)

will be closer to ground truth.

Our tailor-designed vs. random-imputation spectral initialization (RI) & random spectral (RS) initialization. To demonstrate the effectiveness of our tailor-designed initialization, we compare our proposed initialization with a random-imputation spectral (RI) and a random spectral (RS) initialization. Specifically, the RI initialization proceeds as follows: we generate a matrix Y Rn r, whose entries are i.i.d. sampled from the standard normal distribution. Then, we construct Q = Y Y T and generate a matrix Q via setting qij = mij if (i, j) Ω. Finally, we generate

Matrix Completion with Re LU Sampling

0 500 1000 1500 2000 2500 3000 3500 4000 4500

(a) Noiseless case: σ = 0

0 500 1000 1500 2000 2500 3000 3500

(b) Noisy case: σ = 10 4

0 500 1000 1500 2000 2500 3000 3500

(c) Noisy case: σ = 10 2

Figure 5. Convergence and recovery performance of GD for MC with Re LU sampling under different initialization schemes.

Table 1. Comparison of the norm of gradient, function value, and completion error returned by GD with different initialization.

σ = 0 αT βT γT

Our init (9.7 0.16) 10 7 (3.3 0.36) 10 14 (7.6 0.7) 10 11

RI init (6.8 0.3) 10 3 (7.9 2.1) 103 0.5 0.13 RS init 0.1 0.45 (7.1 3.8) 103 0.45 0.24

σ = 10 4 αT βT γT

Our init (9.7 0.14) 10 7 (1.8 0.03) 10 4 (4.4 0.13) 10 5

RI init (0.39 1.2) 10 5 (8.4 2.1) 103 0.51 0.13 RS init (1.5 6.5) 10 3 (8.4 2.0) 103 0.51 0.13

U (0) On r using the truncated SVD to Q as explained in Section 4.1. Moreover, the RS initialization proceeds exactly the same way as above, but without the imputation scheme qij = mij if (i, j) Ω.

Experimental setup. In our experiments, we set n = 200 and r = 5. We generate data matrix M according to the model (1) with different noise level σ {0, 10 4, 10 2} and sample the observed entries via Re LU sampling, e.g. (2). For each noise level, we generate 20 data matrices and run GD with our proposed, RI, and RS initialization on each data matrix, respectively. In each test, we terminate the algorithm when the Frobenious norm of the gradient at the T-th iteration is less than 10 6 or T 5000.

Experimental results. To demonstrate the convergence and recovery performance of our proposed approach, we calculate and report the mean and standard deviation of the norm of gradient αT , function value βT , and completion error γT averaged over 20 runs in Table 1. An additional result for σ = 10 2 can be found in Appendix C.1. In addition, we select one run from the 20 runs and plot norms of gradients {αt}t 0 and completion errors {γt}t 0 against the number of iterations in Figure 5. In the noiseless case, i.e., σ = 0, it is observed that GD with the tailor-designed initialization efficiently finds the ground truth at a linear rate, while GD with the RI or RS initialization converges a critical point that is not globally optimal. In the noisy case, i.e., σ = 10 4, we observe that GD with the tailor-

designed initialization converges to a critical point within an O(σ)-neighborhood of the ground truth, while GD with the RI or RS initialization converges a critical point that is significantly distant from the ground truth.

The results in Table 1 and Figure 5 support our theoretical results in Theorem 3.3 and Theorem 3.5. They empirically demonstrate that the optimal solutions to Problem (3) can recover the ground truth but this problem only has a local benign optimization landscape. Additionally, the comparison between our proposed initialization and the RI spectral initialization further highlights the effectiveness of our tailor-designed initialization.

4.3. Comparison to Existing Methods

In this subsection, we compare our proposed method with some state-of-the-art methods for solving MC in the Re LU sampling regime in terms of recovery performance and computational efficiency. This includes the alternating direction method of multipliers (ADMM) (Boyd et al., 2011) for the convex formulation of MC, scaled gradient descent (Scaled GD) for MC studied in (Tong et al., 2021), proximal alternating minimization (PAM) for solving the formulation proposed in (Saul, 2022), momentum proximal alternating minimization (MPAM) for solving nonlinear matrix decomposition studied in (Seraghiti et al., 2023), and Gaussian Newton matrix recovery (GNMR) method for low-rank MC (Zilber & Nadler, 2022). To guarantee the performance of

Matrix Completion with Re LU Sampling

10-6 Completion error

Figure 6. Comparison of completion error and CPU time for a variety of MC algorithms in the noiseless case.

Scale GD, we employ our tailored-designed initialization scheme in Section 4.1 to initialize it. We refer the reader to Appendix C.2 for the problem formulations and implementation details.

Experimental setup. In our experiments, we set n = 1000, r = 20 and generate data matrix M according to the model (1) with different noise levels σ {0, 10 2}. For each noise level, we generate 10 data matrices and run the MC algorithms on each data matrix. In each test, we terminate our algorithm when the Frobenius norm of the gradient is less than 10 4. The termination criteria of other algorithms are specified in Appendix C.2.

Experimental results. To compare the recovery performance and computational efficiency of the tested methods, we calculate and report the completion error and CPU times averaged over 10 runs using box plots in Figure 6. See an additional figure for the noisy case in Appendix C.1. It is observed that our proposed method can achieve a comparable recovery performance to the other tested methods in both noiseless and noisy cases. Moreover, our proposed method is as fast as MPAM, slightly faster than PAM and Scaled GD, and substantially faster than ADMM and GNMR in terms of computational efficiency.

It is notable that all methods compared here have reasonable completion error on this Re LU MC problem. It was shown empirically in (Naik et al., 2022) that other methods, including direct nuclear norm minimization, do not work as well. It is of great interest to develop a broader theory to understand or clarify this discrepancy among algorithms for the Re LU sampling and more general entry-dependent MC problems.

4.4. Completion with various ranks

In Figure 7, we investigate completion accuracy in the noiseless setting when we increase the rank. In all cases,

0 20 40 60 80 100 120 rank

normalized final matrix error

Normalized recovery error over 100 trials

n=200 n=300 n=400

Figure 7. Completion in the noiseless case as rank varies, with dimension n = 200, 300, 400. Line shows the median error, and bars show the 25% and 75% error quantiles for 100 trials.

completion is successful well beyond our assumed rank bound of r 1 2 log n. Consider n = 200; even though log2(200) 7.6, completion only begins to break down around r = 45. This behavior was also reported in (Naik et al., 2022). A deeper understanding of this fundamental limit is an exciting question for future work.

5. Conclusions

In this work, we studied the MC problem with Re LU sampling. Under mild assumptions on the underlying matrix, we showed that global optimal solutions of the low-rank formulation of MC recover the underlying matrix in both noiseless and noisy cases. Moreover, we also characterized the local optimization landscape, which is strongly convex on the quotient manifold in the neighborhood of global optimal solutions. Finally, we proposed a tailor-designed initialization for GD to optimize the studied formulation and conducted extensive experiments to showcase the potential of our proposed approach.

While our work fills a gap in the literature, there are still limitations that need to be addressed, starting with generalizing from SPSD matrices to rectangular matrices and considering unknown rank r. Weakening Assumption 3.6(d) is of great interest, as mentioned before. Additionally, we point out that our method and initialization have prior knowledge that may not be available in general, e.g., our initialization explicitly uses the Re LU sampling assumption. It would be interesting to empirically explore the sensitivity of MC algorithms to this coupling between sampling scheme and initialization. A final general future direction of study is to extend our landscape analysis to more general deterministic and entry-dependent sampling regimes.

Acknowledgements

The work of H.L. was supported in part by NSF China under Grant 12301403 and Young Elite Scientists Sponsorship Program by CAST 2023QNRC001. The work of

Matrix Completion with Re LU Sampling

L.B. and P.W. was supported in part by ARO YIP award W911NF1910027, NSF CAREER award CCF-1845076, and Do E award DE-SC0022186. QQ acknowledges support from NSF CAREER CCF-2143904, NSF CCF-2212066, and NSF CCF-2212326.

Impact Statement

This paper presents work whose goal is to advance the field of machine learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Agarwal, A., Dahleh, M., Shah, D., and Shen, D. Causal matrix completion. In The Thirty Sixth Annual Conference on Learning Theory, pp. 3821 3826. PMLR, 2023.

Al-Homidan, S. and Wolkowicz, H. Approximate and exact completion problems for euclidean distance matrices using semidefinite programming. Linear algebra and its applications, 406:109 141, 2005.

Bhattacharya, S. and Chatterjee, S. Matrix completion with data-dependent missingness probabilities. IEEE Transactions on Information Theory, 2022.

Bishop, W. E. and Yu, B. M. Deterministic symmetric positive semidefinite matrix completion. Advances in Neural Information Processing Systems, 27, 2014.

Boumal, N. An introduction to optimization on smooth manifolds. Cambridge University Press, 2023.

Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine learning, 3(1):1 122, 2011.

Cai, H., Cai, J.-F., and You, J. Structured gradient descent for fast robust low-rank hankel matrix completion. SIAM Journal on Scientific Computing, 45(3):A1172 A1198, 2023.

Candes, E. and Recht, B. Exact matrix completion via convex optimization. Communications of the ACM, 55 (6):111 119, 2012.

Candes, E. J. and Plan, Y. Matrix completion with noise. Proceedings of the IEEE, 98(6):925 936, 2010.

Chatterjee, S. Matrix estimation by universal singular value thresholding. The Annals of Statistics, 43(1):177 214, 2015.

Chatterjee, S. A deterministic theory of low rank matrix completion. IEEE Transactions on Information Theory, 66(12):8046 8055, 2020.

Chen, Y. and Chi, Y. Spectral compressed sensing via structured matrix completion. In International conference on machine learning, pp. 414 422. PMLR, 2013.

Dokmanic, I., Parhizkar, R., Ranieri, J., and Vetterli, M. Euclidean distance matrices: essential theory, algorithms, and applications. IEEE Signal Processing Magazine, 32 (6):12 30, 2015.

Foucart, S., Needell, D., Pathak, R., Plan, Y., and Wootters, M. Weighted matrix completion from non-random, non-uniform sampling patterns. IEEE Transactions on Information Theory, 67(2):1264 1290, 2020.

Foygel, R., Shamir, O., Srebro, N., and Salakhutdinov, R. R. Learning with the weighted trace-norm under arbitrary sampling distributions. Advances in neural information processing systems, 24, 2011.

Ganti, R. S., Balzano, L., and Willett, R. Matrix completion under monotonic single index models. Advances in neural information processing systems, 28, 2015.

Ge, R., Lee, J. D., and Ma, T. Matrix completion has no spurious local minimum. Advances in neural information processing systems, 29, 2016.

G uler, O., Hoffman, A. J., and Rothblum, U. G. Approximations to solutions to systems of linear inequalities. SIAM Journal on Matrix Analysis and Applications, 16 (2):688 696, 1995.

Hern andez-Lobato, J. M., Houlsby, N., and Ghahramani, Z. Probabilistic matrix factorization with non-random missing data. In International Conference on Machine Learning, pp. 1512 1520. PMLR, 2014.

Hoffman, A. J. On approximate solutions of systems of linear inequalities. In Selected Papers Of Alan J Hoffman: With Commentary, pp. 174 176. World Scientific, 2003.

Hosseini, S. M. and Sebt, M. A. Array interpolation using covariance matrix completion of minimum-size virtual array. IEEE Signal Processing Letters, 24(7):1063 1067, 2017.

Hu, Y., Koren, Y., and Volinsky, C. Collaborative filtering for implicit feedback datasets. In 2008 Eighth IEEE international conference on data mining, pp. 263 272. Ieee, 2008.

Jarrett, K., Kavukcuoglu, K., Ranzato, M., and Le Cun, Y. What is the best multi-stage architecture for object recognition? In 2009 IEEE 12th international conference on computer vision, pp. 2146 2153. IEEE, 2009.

Ji, H., Liu, C., Shen, Z., and Xu, Y. Robust video denoising using low rank matrix completion. In 2010 IEEE computer society conference on computer vision and pattern recognition, pp. 1791 1798. IEEE, 2010.

Matrix Completion with Re LU Sampling

Jin, H., Ma, Y., and Jiang, F. Matrix completion with covariate information and informative missingness. The Journal of Machine Learning Research, 23(1):8115 8176, 2022.

Keshavan, R. H., Montanari, A., and Oh, S. Matrix completion from a few entries. IEEE transactions on information theory, 56(6):2980 2998, 2010.

Kir aly, F. J., Theran, L., and Tomioka, R. The algebraic combinatorial approach for low-rank matrix completion. J. Mach. Learn. Res., 16(1):1391 1436, 2015.

Koren, Y., Bell, R., and Volinsky, C. Matrix factorization techniques for recommender systems. Computer, 42(8): 30 37, 2009.

Laurent, M. Polynomial instances of the positive semidefinite and euclidean distance matrix completion problems. SIAM Journal on Matrix Analysis and Applications, 22 (3):874 894, 2001.

Lee, H., Mazumder, R., Song, Q., and Honorio, J. Matrix completion from general deterministic sampling patterns. ar Xiv preprint ar Xiv:2306.02283, 2023.

Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. Gradient descent only converges to minimizers. In Conference on learning theory, pp. 1246 1257. PMLR, 2016.

Little, R. J. and Rubin, D. B. Statistical analysis with missing data, volume 793. John Wiley & Sons, 3rd edition, 2019.

Liu, G., Liu, Q., and Yuan, X. A new theory for matrix completion. Advances in Neural Information Processing Systems, 30, 2017.

Liu, G., Liu, Q., Yuan, X.-T., and Wang, M. Matrix completion with deterministic sampling: Theories and methods. IEEE transactions on pattern analysis and machine intelligence, 43(2):549 566, 2019.

Ma, C. and Zhang, C. Identifiable generative models for missing not at random data imputation. Advances in Neural Information Processing Systems, 34:27645 27658, 2021.

Ma, W. and Chen, G. H. Missing not at random in matrix completion: The effectiveness of estimating missingness probabilities under a low nuclear norm assumption. Advances in Neural Information Processing Systems, 32, 2019.

Mazumdar, A. and Rawat, A. S. Representation learning and recovery in the Re LU model. ar Xiv preprint ar Xiv:1803.04304, 2018.

Naik, R., Trivedi, N., Tarzanagh, D. A., and Balzano, L. Truncated matrix completion-an empirical study. In 2022 30th European Signal Processing Conference (EUSIPCO), pp. 847 851. IEEE, 2022.

Nair, V. and Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807 814, 2010.

Nguyen, L. T., Kim, J., and Shim, B. Low-rank matrix completion: A contemporary survey. IEEE Access, 7: 94215 94237, 2019.

Ongie, G., Pimentel-Alarcon, D., Nowak, R., Willett, R., and Balzano, L. Tensor methods for nonlinear matrix completion. SIAM journal on mathematics of data science, 2020.

Pena, J., Vera, J. C., and Zuluaga, L. F. New characterizations of hoffman constants for systems of linear constraints. Mathematical Programming, 187:79 109, 2021.

Pimentel-Alarc on, D. L., Boston, N., and Nowak, R. D. A characterization of deterministic sampling patterns for low-rank matrix completion. IEEE Journal of Selected Topics in Signal Processing, 10(4):623 636, 2016.

Recht, B. A simpler approach to matrix completion. Journal of Machine Learning Research, 12(12), 2011.

Saul, L. K. A nonlinear matrix decomposition for mining the zeros of sparse data. SIAM Journal on Mathematics of Data Science, 4(2):431 463, 2022.

Sengupta, N., Udell, M., Srebro, N., and Evans, J. Sparse data reconstruction, missing value and multiple imputation through matrix factorization. Sociological Methodology, 53(1):72 114, 2023.

Seraghiti, G., Awari, A., Vandaele, A., Porcelli, M., and Gillis, N. Accelerated algorithms for nonlinear matrix decomposition with the Re LU function. ar Xiv preprint ar Xiv:2305.08687, 2023.

Shamir, O. and Shalev-Shwartz, S. Matrix completion with the trace norm: Learning, bounding, and transducing. The Journal of Machine Learning Research, 15(1):3401 3423, 2014.

Shapiro, A., Xie, Y., and Zhang, R. Matrix completion with deterministic pattern: A geometric perspective. IEEE Transactions on Signal Processing, 67(4):1088 1103, 2018.

Singer, A. and Cucuringu, M. Uniqueness of low-rank matrix completion by rigidity theory. SIAM Journal on Matrix Analysis and Applications, 31(4):1621 1641, 2010.

Matrix Completion with Re LU Sampling

Soltanolkotabi, M. Learning Re LUs via gradient descent. Advances in neural information processing systems, 30, 2017.

Sportisse, A., Boyer, C., and Josse, J. Estimation and imputation in probabilistic principal component analysis with missing not at random data. Advances in Neural Information Processing Systems, 33:7067 7077, 2020a.

Sportisse, A., Boyer, C., and Josse, J. Imputation and lowrank estimation with missing not at random data. Statistics and Computing, 30(6):1629 1643, 2020b.

Sun, R. and Luo, Z.-Q. Guaranteed matrix completion via non-convex factorization. IEEE Transactions on Information Theory, 62(11):6535 6579, 2016.

Szarek, S. J. Metric entropy of homogeneous spaces. ar Xiv preprint math/9701213, 1997.

Tong, T., Ma, C., and Chi, Y. Accelerating ill-conditioned low-rank matrix estimation via scaled gradient descent. The Journal of Machine Learning Research, 22(1):6639 6701, 2021.

Usevich, K. and Comon, P. Hankel low-rank matrix completion: Performance of the nuclear norm relaxation. IEEE Journal of Selected Topics in Signal Processing, 10(4): 637 646, 2016.

Vershynin, R. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Wersing, H., Beyn, W.-J., and Ritter, H. Dynamical stability conditions for recurrent neural networks with unsaturating piecewise linear transfer functions. Neural Computation, 13(8):1811 1825, 2001.

Yang, C., Ding, L., Wu, Z., and Udell, M. Tenips: Inverse propensity sampling for tensor completion. In International Conference on Artificial Intelligence and Statistics, pp. 3160 3168. PMLR, 2021.

Zheng, X., Ding, H., Mamitsuka, H., and Zhu, S. Collaborative matrix factorization with multiple similarities for predicting drug-target interactions. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1025 1033, 2013.

Zilber, P. and Nadler, B. GNMR: a provable one-line algorithm for low rank matrix recovery. SIAM Journal on Mathematics of Data Science, 4(2):909 934, 2022.

Matrix Completion with Re LU Sampling

Supplementary Material

The appendix is organized as follows: Appendix A contains all the proofs for the main results. Specifically, in Section A.1, we focus on the noiseless Re LU case and provide proofs for all the theorems presented in Section 3.1. In Section A.2, we shift our attention to the noisy general case and provide proofs for the theorems found in Section 3.2. In Section A.3, we concentrate on the Gaussian and Re LU sampling settings and present all the required proofs for Section 3.3. Appendix B provides the detailed related work. Appendix C provides detailed setups of our experiments and includes additional numerical results to supplement Section 4.

A. Proofs of Main Results

Before we proceed, let us recall some definitions and notions. Based on the block partition in (4), we denote by M (i,j) Rni nj the (i, j)-th block of M, i.e., M (i,j) = U i U T j for any pair of indices (i, j) [2r] [2r]. Moreover, we denote

the set of observations in the (i, j) block by Ωi,j = {(k, l) [ni] [nj] : m(i,j) kl 0} where m(i,j) kl is the (k, l)-th entry of M (i,j). Let a column vector u i,k Rr denote the k-th row of U i for each k [ni]. Then, we have m(i,j) kl = u T i,ku j,l. Since u T i,ku i,l 0 always holds for all k, l [ni] due to the fact that the rows of U i have the same sign, we have Ωi,i = [ni] [ni] for all i [2r] under the Re LU sampling (2).

A.1. Proofs for Section 3.1: Noiseless Case and Re LU Sampling

Firstly, we need to show that the horizontal space S U, which is the orthogonal complement of SU defined in (6), is given by (7). Indeed, for arbitrary D S U, due to the orthogonality, we obtain that D, UR = U T D, R = 0 holds for all skew-symmetric matrix R. This holds if and only if U T D is symmetric. Consequently, S U is given by

S U = D Rn r | U T D = DT U . (14)

A.1.1. PROOF OF THEOREM 3.3

Proof of Theorem 3.3. Let U Rn r be an arbitrary optimal solution to Problem (3). According to (1) and = 0, we have M = U U T . This, together with the optimality of Problem (3), implies that (UU T )Ω= (U U T )Ω. Let U = U T 1 U T 2 U T 2r T be the same partition as that in (4). Since all diagonal blocks are fully observed, we have Ui U T i = U i U T i . This, together with Assumption 3.1 and Lemma 3.4, yields that there exists Qi Or such that Ui = U i Qi for all i [2r].

Next, we show that the above Qi for all i [2r] are the same. For each tri-diagonal block, it follows from (UU T )Ω= (U U T )Ωthat (Ui+1U T i )Ωi+1,i = (U i+1U T i )Ωi+1,i for all i [2r 1]. By substituting the equality Ui = U i Qi into it, we get (U i+1Qi+1QT i U T i )Ωi+1,i = (U i+1U T i )Ωi+1,i. We write its entry-wise form as follows:

u T i+1,k Qi+1QT i u i,l = u T i+1,ku i,l, (k, l) Ωi+1,i. (15)

According to Assumption 3.2, {u i+1,ku T i,l : (k, l) Ωi+1,i} spans the whole r-by-r matrix space, so (15) holds if and only if Qi+1QT i = Ir. Therefore, Qi = Qj for all i = j. This yields U = U Q with Q Or for any optimal solution U. Then, we complete the proof.

A.1.2. PROOF OF THEOREM 3.5

Before we proceed, let us compute the gradient and Hessian of F(U). Obviously, the gradient is

F(U) = (UU T M)ΩU. (16)

Given a direction D Rn r, we have

F(U + t D) = (UU T + t UDT + t DU T + t2DDT M)Ω(U + t D)

= F(U) + t(UDT + DU T )ΩU + t2(DDT )ΩU + t(UU T M)ΩD

t2(UDT + DU T )ΩD + t3(DDT )ΩD.

Matrix Completion with Re LU Sampling

Then, we compute its bilinear form of the Hessian along a direction D Rn r as follows:

2F(U)[D, D] = D, lim t 0 F(U + t D) F(U)

= D, (UU T M)ΩD + (UDT + DU T )ΩU

= DDT , (UU T M)Ω + DU T , (UDT + DU T )Ω

= DDT , (UU T M)Ω + 1

2 (UDT + DU T )Ω 2 F ,

where the last inequality holds because DU T , (UDT + DU T )Ω = UDT , (UDT + DU T )Ω and

DU T , (UDT + DU T )Ω = 1

2 UDT + DU T , (UDT + DU T )Ω = 1

2 (UDT + DU T )Ω, (UDT + DU T )Ω .

Proof of Theorem 3.5. Using (17) and M = U U T , we have

2F(U )[D, D] = 1

2 (U DT + DU T )Ω 2 F 0. (18)

This implies that the Hessian of F at U is semi-definite positive.

Next, we consider these directions D such that 2F(U )[D, D] = 0. We aim to show

D = U V , where V + V T = 0. (19)

First, for each i [2r], since the i-th diagonal part is fully observed, i.e., Ωi,i = [ni] [ni], we have

U i DT i + Di U T i = 0. (20)

It follows from Assumption 3.1 that rank(U i ) = r, and thus its pseudo-inverse is U i = (U T i U i ) 1U T i Rr ni. By multiplying U T i on the right hand side of (20), we get Di = U i DT i U T i = U i Vi, where Vi := DT i U T i . Moreover, by multiplying U i and U T i on the left and right sides of (20), we get DT i U T i + U i Di = 0. This implies Vi + V T i = 0, which means Vi is a skew-symmetric matrix. Now, let us consider the (i + 1, i)-th block of (U DT + DU T )Ω. According to (20) and Di = U i Vi for all i [2r], we have

0 = U i+1DT i + Di+1U T i

Ωi+1,i = U i+1V T i U T i + U i+1Vi+1U T i

Ωi+1,i = U i+1 V T i + Vi+1 U T i

Using the same argument in (15), we have V T i + Vi+1 = 0. Thus, we have Vi = Vi+1 for all i [2r 1], and thus Vi for all i [2r] are the same. Consequently, the space spanned by all the directions D satisfying F(U )[D, D] = 0 is exactly the vertical space SU defined in (6), i.e.,

SU = D Rn r : F(U )[D, D] = 0 .

Define the constant

γ := min D S U , D F =1 (U DT + DU T )Ω 2 F > 0 (21)

We claim that γ > 0. Now, we prove this claim by contradiction. Indeed, suppose that γ = 0. This implies that there exists a D S U with D F = 1 such that (U DT + DU T )Ω 2 F = 0. This, together with (19), yields D SU and thus D SU S U = {0}, which contradicts the fact that D F = 1. Then, for any D S U , (18) shows that

2F(U )[D, D] = 1

2 (U DT S U + DS U U T )Ω 2 F γ

2 DS U 2 F = γ

2 D 2 F . (22)

where DS U = Proj S U (D) is the projection of D on S U . Finally, substituting D = lift U (ξ) S U into the above inequality for any ξ T[U ]M, together with (8), yields the desired result.

Matrix Completion with Re LU Sampling

From the geodesic strong convexity on M to the strong convexity in Euclidean space. For any U, V Rn r, let U T U = P ΣQT be an SVD of U T U , where P , Q Or and Σ Rr r. Moreover, let e U := UP QT and D := e U U , then we must have D S U . This is because, for any E SU = {U V : V + V T = 0}, we have

E, e U U = U V , UP QT U = V , U T UP QT U T U = V , QΣQT U T U = 0, (23)

where the last equality holds because V is a skew-symmetric matrix and QΣQT U T U is symmetric. Therefore, (22) implies that

2F(U )[ e U U , e U U ] γ

2 e U U 2 F . (24)

A.1.3. PROOF OF LEMMA 3.4

Proof of Lemma 3.4. According to r = rank(U), we obtain that the pseudo-inverse of U is U := (U T U) 1U T Rr n. Then, one can verify U 1/σmin(U). Next, let Ψ := UU T V V T . Then, we compute

ΨU T = U V V T U T , U ΨU T = Ir U V V T U T .

U ΨU T F U 2 Ψ F Ψ F σ2 min(U). (25)

Let V T U T = P ΣHT be a singular value decomposition (SVD) of V T U T Rr r, where P , H Or and Σ Rr r. Then, we have

Ir Σ2 F = Ir HΣ2HT F = Ir U V V T U T F = U ΨU T F Ψ F σ2 min(U),

where the inequality follows from (25). This implies

Σ Ir F = (Σ + Ir) 1(Σ2 Ir) F Σ2 Ir F Ψ F σ2 min(U). (26)

Finally, by choosing Q = P HT , we get

ΨU T = U V V T U T = U V Q + V P (Ir Σ)HT .

Therefore, we obtain

U V Q F ΨU T F + V P (Ir Σ)HT F Ψ F σmin(U) + V Ψ F

σ2 min(U) ,

where the last inequality follows from U 1/σmin(U) and (26).

Now, we remove the full rank assumption rank(U) = r and prove the following more general lemma.

Lemma A.1. For arbitrary U, V Rn r with r n, there exists an orthogonal matrix Q Or such that

U V Q F max σ2 min(U) + V

σ2 min(U) , 1 σmin(V )

UU T V V T F , (27)

where σmin(U) and σmin(V ) denote the smallest non-zero singular value of U and V , respectively.

Proof. For ease of exposition, we denote the rank of U by ˆr = rank(U), where ˆr r. Let U = P1Σ1QT 1 be a compact SVD of U, where P1 On ˆr, Σ1 Rˆr ˆr, and Q1 Oˆr. Notice that

UU T V V T 2 F = P1P T 1 (UU T V V T ) + (I P1P T 1 )(UU T V V T ) 2 F = P1P T 1 (UU T V V T ) 2 F + (I P1P T 1 )(UU T V V T ) 2 F = P T 1 (UU T V V T ) 2 F + (I P1P T 1 )V V T 2 F ,

Matrix Completion with Re LU Sampling

where the last equality holds because P T 1 P1 = I and (I P1P T 1 )U = 0. Using the same argument, we have

P T 1 (UU T V V T ) 2 F = P T 1 (UU T V V T )P1P T 1 2 F + P T 1 (UU T V V T )(I P1P T 1 ) 2 F = P T 1 (UU T V V T )P1 2 F + P T 1 V V T (I P1P T 1 ) 2 F P T 1 (UU T V V T )P1 2 F = Σ2 1 P T 1 V V T P1 2 F ,

where the last equality holds because U = P1Σ1QT 1 . Putting the above results together yields

UU T V V T 2 F Σ2 1 P T 1 V V T P1 2 F + (I P1P T 1 )V V T 2 F . (28)

Similarly, we also have

U V Q 2 F = P T 1 (U V Q) 2 F + (I P1P T 1 )V Q 2 F . (29)

Let P T 1 V = P2Σ2QT 2 be a compact SVD of P T 1 V , where P2 On ˆr, Σ2 Rˆr ˆr, and Q2 Oˆr. By choosing Q such that QT 2 Q = QT 1 , we have

P T 1 (U V Q) F = Σ1QT 1 P T 1 V Q F = Σ1 P2Σ2 F σ2 min(U) + V

σ2 min(U) Σ2 1 P T 1 V V T P1 2 F , (30)

where the second equality follows from P T 1 V = P2Σ2QT 2 and QT 2 Q = QT 1 , and the inequality uses Lemma 3.4. Moreover, we have (I P1P T 1 )V Q 2 F = (I P1P T 1 )V1 2 F 1 σmin(V ) (I P1P T 1 )V1V T 1 2 F .

This, together with (28), (29), and (30), implies (27).

A.2. Proofs for Section 3.2: Noisy Case with General Sampling

A.2.1. PROOF OF THEOREM 3.8

Proof of Theorem 3.8. Using the global optimality of U, (1), and (3), we have

F(U) F(U ) = 1

4 2 F . (31)

In addition, we have

4 (UU T M)Ω 2 F 1

8 (UU T U U T )Ω 2 F 1

where the inequality follows from (1) and A B 2 F A 2 F /2 B 2 F for all A, B. This, together with (31), implies

(UU T U U T )Ω 2 F 4 2 F . (32)

Now, we consider the diagonal blocks of M index by Ik Ik. For each k [K], let Ck := Uk U T k U k U T k . According to condition (a) in Assumption 3.6, the block Ik Ik is fully observed, and thus we have for each k [K],

Ck F (UU T U U T )Ω F 2 F . (33)

Recall that U k = (U T k U k) 1U T k . Let U k Uk = PkΣk HT k be an SVD of U k Uk Rr r, where Pk, Hk Or and Σk Rr r, and Qk := Hk P T k . According to Lemma 3.4 and the proof, we have

Uk Qk U k F σmin(U k) + Uk σ2 min(U k) Uk U T k U k U T k F

λ + 2 1 2 + U k , (34)

Matrix Completion with Re LU Sampling

where the second inequality follows from (33) and σmin(U k)

λ for all k [K] due to condition (b) in Assumption 3.6, and the last inequality uses

Uk = Uk U T k 1 2 = Ck + U k U T k 1 2 Ck 1 2 + U k 2 1 2 + U k .

For these off-diagonal blocks satisfying Ij Ik Ω, let Cj,k := Uj U T k U j U T k for all j = k [K]. Similar to (33), we have Cj,k F 2 F and U j Uj(U k Uk)T = Ir + U j Cj,k U T k .

Substituting the SVDs U j Uj = PjΣj HT j and U k Uk = PkΣk HT k into the above equality yields

Ir + U j Cj,k(U k )T = U j Uj(U k Uk)T = PjΣj HT j HkΣk P T k = Pj(Σj Ir)HT j Hk(Σk Ir)P T k + Pj(Σj Ir)HT j Hk P T k +

Pj HT j Hk(Σk Ir)P T k + Pj HT j Hk P T k

Note that Qk = Hk P T k for all k [K], and thus Pj HT j Hk P T k = QT j Qk. Then, we have

Qj Qk F = Qj QT k Ir F = Pj HT j Hk P T k Ir F

= U j Cj,k(U k )T Pj(Σj Ir)HT j Hk(Σk Ir)P T k Pj(Σj Ir)HT j Hk P T k

Pj HT j Hk(Σk Ir)P T k F

Σj Ir F Σk Ir F + Σj Ir F + Σk Ir F + U j Cj,k U T k F

λ2 Cj F Ck F + 1

where the second inequality follows from (26). For each k [K], according to condition (d) in Assumption 3.6, there exists a path k0 k1 ks such that k0 = k, ks = 1 and Ikj 1 Ikj Ωfor all j [s]. Therefore, we bound the term Qk Q1 as follows:

j [s] Qkj 1 Qkj F 2K

where the last inequality follows from (35) and s K. Therefore, we obtain

U U QT 1 2 F =

k=1 Uk U k QT 1 2 F 2

Uk U k QT k 2 F + U k(Qk Q1)T 2 F

λ + 2 1 2 + U )2 + 2K2 U 2 2

1 2 F + U 2 + K2 U 2 3 + 2

where the second inequality follows from (34) and (36), and the third inequality holds because U k(Qk Q1)T F U k Qk Q1 F . As a result, we get

U U QT 1 F c0

λ F , where c0 =

1 2 F + U + K U 3 + 2

Finally, based on the fact that U U T = U QT 1 (U QT 1 )T , we have

UU T U U T F = UU T U QT 1 (U QT 1 )T F

= (U U QT 1 )(U U QT 1 )T + (U U QT 1 )(U QT 1 )T + U QT 1 (U U QT 1 )T F

U U QT 1 F 2 U + U U QT 1

Matrix Completion with Re LU Sampling

where the first inequality holds because of the triangle inequality and the second inequality holds because of (37). We complete the proof by letting

λ F F . (39)

A.2.2. PROOF OF THEOREM 3.9

Proof of Theorem 3.9. Recall that U Rn r is a global optimal solution of Problem (3). For ease of exposition, let e := (UU T M)Ω. Obviously, we have e F F . This, together with (17), yields

2F(U)[D, D] 1

2 (UDT + DU T )Ω 2 F DDT F e F 1

2 (UDT + DU T )Ω 2 F F D 2 F . (40)

Next, we consider some direction D such that (UDT + DU T )Ω= 0. We claim that D = UV for some skew-symmetric matrix V Rr r. Indeed, for each i [2r], since the i-th diagonal part is fully observed, i.e., Ωi,i = [ni] [ni], we have

Ui DT i + Di U T i = 0. (41)

Using Weyl s inequality, we obtain

σr(Ui) σr(U i ) Ui Qi U i F

λ + γ + U 1

where the second inequality follows from the assumption that U T i U i λIr, (34), and F < γ/8 by (11), and the last inequality holds because of F λ

λ + γ + U ) by (11). Therefore, the pseudo-inverse U i = (U T i Ui) 1U T i is well-defined for all i [2r]. Multiplying U T i from the right-hand side of (41) yields

Di = Ui(U i Di)T = Ui Vi, where Vi = (U i Di)T . Moreover, multiplying b U i and b U T

i from the left and right hand sides of (41) yields (U i Di)T + U i Di = 0. This implies Vi + V T i = 0, which means Vi is a skew matrix. Then, consider the (i, j)-th block of (UDT + DU T )Ω. Since Di = Ui Vi for any i [2r], it follows from (41) that

0 = Uj DT i + Dj U T i = Uj V T i U T i + Uj Vj U T i = Ui+1 V T i + Vi+1 U T i .

Since Ui and Ui+1 are full-rank, we obtain V T i + Vi+1 = 0. Thus, we have Vi = Vi+1 for all i [2r 1], and thus Vi for all i [2r] are the same.

For ease of exposition, let S = SU and S = S U. For any D Rn r, we decompose it as

D = DS + DS , where DS = proj S(D), DS = proj S (D). (42)

Define the constant

bγ := (U b ET + b EU T )Ω 2 F , where b E = argmin E S , E F =1 (UET + EU T )Ω 2 F . (43)

According to (40), we compute for any D S ,

2F(U)[D, D] 1

2 (UDT S + DS U T )Ω 2 F F D 2 F bγ

2 DS 2 F F D 2 F = bγ

We complete the proof by letting D = lift U(ξ) for any ξ T[U]M and showing that bγ γ/4.

Then, the rest of the proof is devoted to proving bγ γ/4. Without loss of generality, we assume that Q1 = Ir in (37) and let U = U U , then we have U F c0 F /λ c0 F /λ where

λ + γ + U + K U 3 + γ

1 2 F + U + K U 3 + 2

Matrix Completion with Re LU Sampling

which holds because of the condition F γ

8 . Thus, we have

bγ (U b ET + b EU T )Ω F (U b ET + b EU T )Ω F (U b ET + b EU T )Ω F 2 c0

where the first inequality holds because of the triangle inequality, and the second inequality follows from b E F = 1 and U F c0 F /λ. Recall that γ is defined w.r.t. S U = {D : U T D = DT U }, i.e., the subspace S U is the solution space of the linear system U T D = DT U . Even though b E may not fulfill the linear system requirements, due to the fact b E S U, we can bound

U T b E b ET U F = U T b E b ET U F 2 c0

where the first equality holds because of b E S U, i.e., U T b E = b ET U. According to Hoffman s error bound for linear system (Hoffman, 2003; Pena et al., 2021; G uler et al., 1995), there exists some constant κ > 0, which only depends on the linear system of U , and some D S U such that

D b E F κ U T b E b ET U F 2κ c0

Then, we have

(U b ET + b EU T )Ω F (U DT + DU T )Ω F 2 U D b E F

γ D F 2 U |D b E F γ ( γ + 2 U ) D b E F ,

where the first inequality holds because of the triangle inequality and the fact UD F U D F holds for any matrices U and D, the second inequality follows from the definition of γ, and the last one uses the triangle inequality and b E F = 1. Substituting this and (46) into (45) yields

bγ (U b ET + b EU T )Ω F 2 c0

λ (1 + κ( γ + 2 U 2)) F .

This, together with (11), implies ˆγ γ/2, which completes the proof.

A.2.3. DISCUSSION ON THE RELATIONSHIP BETWEEN γ AND λ

Note that λ is defined to satisfy U T k U k = P

i Ik u i (u i )T λ. Under the setting of Gaussian matrix and noise, we could first compute the conditional expectation

E U T k U k | U k Ck = X

i Ik E[u i (u i )T | u i Ck] = |Ik|E[u i (u i )T | u i Ck].

Under the noiseless setting where Ck is an octant, a simple computation shows that

E u i (u i )T | u i Ck = 1 2

where e Rr is the all-one vector. For the noisy setting, even though it is not easy to get a close form, we could still show that E[u i (u i )T | u i Ck] is positive definite whose least eigenvalue is independent of n and r, under the assumption r = O(log n). Besides, note that all u i , i Ik are still i.i.d. variables under the condition u i Ck. By applying the concentration theory, we could show that λ = Θ(|Ik|) Θ( n), where the last inequality holds because we already shows that |Ik| n/2 under the assumption r = O(log n).

Regarding γ, we define the linear map ϕ : Rn r 7 Rn n with ϕ(D) = (U DT + DU T )Ω. The linear map ϕ can be vectorized by introducing a matrix A Rn2 nr, i.e.

vec(ϕ(D)) = A vec(D)

where A consist of the entries of U . Recall that γ = min ||ϕ(D)||F where D S , ||D||F = 1. Since we already show S is the kernel space of ϕ, we know γ is the smallest nonzero singular value of A. By carefully computation and applying the concentration inequality, one can also show γ = O( n). Then, according to Equation (7), Theorem 3.8 holds when F = O(n 1 4 ), in other words, we need the noise level δ = O(n 3

Matrix Completion with Re LU Sampling

A.3. Proofs for Section 3.3: Verification of Assumptions under the Gaussian Distribution

A.3.1. PROOF OF PROPOSITION 3.10

Proof of Proposition 3.10. Since u i i.i.d. N(0, Id) for all i [n] and Ck for each k [2r] is an orthant, we have

P (u i Ck) = 1

2r , i [n], k [2r].

Note that nk = Pn i=1 1(u i Ck) denotes the number of u i belonging to orthant Ck. Applying the Bernstein inequality (see, e.g., (Vershynin, 2018, Theorem 2.8.4)) to nk, we obtain

By choosing t = n/2r+1 n/2 due to r (log2 n)/2, we have for each k [2r],

P nk n 2r+1

This, together with the union bound, yields

2 , k [2r] 1 2r exp n

Since n/2 r and the entries of U is i.i.d. standard Gaussian random variables, Assumption 3.1 always holds.

Now, we consider the near diagonal block Ωk+1,k for each k [2r 1]. Without loss of generality, we assume Ck = {(x1, , xr) : xi 0, i [r]} and Ck+1 = {(x1, , xr) : xi 0, i [r 1], xr 0}. Therefore, for each row u i Ck and u j Ck+1, sgn(u i ) differs with sgn(u j) only at the last entry. Let Dk consist of all the vectors x Ck satisfying xr max{x1, , xr 1}, i.e.,

Dk = {x Ck : xr max{x1, , xr 1}} .

One can verify that the conditional probability P(u i Dk | u i Ck) 1

2, and thus

P(u i Dk) = P(u i Dk | u i Ck) P(u i Ck) 1 2r+1 .

By applying the concentration inequality, one can show that |Dk| n/4 with high probability. Then, for each u i Dk and any u j Ck+1, we have

P u i , u j 0 | u i 1

Note that |Ωk+1,k| = X

(i,j) Ωk+1,k 1(u i , u j 0) X

j:u j Ck+1 1(u i , u j 0)

Conditioned on any u i Dk, 1(u i , u j 0) for all i, j are independent of each other. Therefore, we can apply the concentration inequality to show that X

j:u j Ck+1 1(u i , u j 0) n

As a result, |Ωk+1,k| n/16 holds with high probability.

A.3.2. PROOF OF PROPOSITION 3.11

For the noisy case, we need to first introduce the ϵ-net to proceed.

Definition A.2 (Net and Covering Number). Let Sr 1 = {u Rr : u = 1} be the unit sphere in Rr and ϵ > 0 be a parameter. A subset Nϵ Sr 1 is said to be an ϵ-net of Sr 1 if for any point u Sr 1, there exists a point a Nϵ such that u a ϵ. The cardinality of the smallest ϵ-net is called the ϵ-covering number, denoted by N(Sr 1, ϵ).

Matrix Completion with Re LU Sampling

According to Szarek (1997), we know that, for any ϵ > 0,

N(Sr 1, ϵ) 1 + 2

4 denote the 1

4-net Sr 1 with the smallest cardinality. Then, (48) implies that K = |N 1

4 | 9r. Let N 1

4 = {a1, a2, . . . , a M}, then we define

u Rr : u T ak

Since ak N 1

2 Sr 1, the inequality u T ak (

2) u /4 means the angle between u and ai is less than

12. According to the definition of ϵ-net, we can show the following lemma.

Lemma A.3. Rr = S

Proof. For any u Rr \ {0} and u/ u Sr 1, there exists some ai N 1

4 such that ai u/ u 1

4, which implies

This implies u Ci.

We can now define the collection of index set as follows:

Ik = {i [n] : u i Ck}, k [K],

where u i is the i-th row of U . Lemma A.3 implies that the collection of index sets {I1, I2, . . . , IK} satisfies condition (a) in Assumption 3.6.

Proof of Proposition 3.11. First, Lemma A.3 already implies condition (a). Next, for any k [K] and any (i, j) Ik Ik, we know that u i , u j Ck according to the definition of Ik. Owing to the property of Ck, we know that the angle between u i and ak is less than π/12 and so is the angle between u j and ak. This implies that the angle between u i and u j is less than π/6. This further implies

u T i u j cos π

u i u T j =

3 2 u i u T j .

Then, we have mij = u T i u j +δij 0 due to < 1

2 mink [n] u k 2 2. It implies that (i, j) Ωfor all (i, j) Ik Ik, i.e., condition (b) holds.

Next, since each row u i is an i.i.d. Gaussian vector, we have u i / u i 2 follows the uniform distribution on the unit sphere Sr 1. One can evaluate the probability that P(u i Ck) in the following way: the area of Sr 1 is equal to 2πr/2

Γ(r/2), while the area of Sr 1 Ck is bigger than volume of (r 1)-dimensional ball with radius equal to sin(π/12), which is π(r 1)/2 Γ((r+1)/2) sinr 1(π/12). This implies

π(r 1)/2 Γ((r+1)/2) sinr 1(π/12)

2πr/2 Γ(r/2) = Γ(r/2) 2 πΓ((r + 1)/2) sinr 1(π/12) 1 πr sinr(π/12),

where the last inequality holds because Γ((r + 1)/2) = r 1

2 Γ((r 1)/2) r

2Γ(r) holds for all positive integers. Since r log n/(4 log 3), we have

P (u i Ck) 1 π(r 1) sinr 1(π/12) 4 log 3

log(sin(π/12))

4 log 3 > 4 log 3

log n n 1/3.

Matrix Completion with Re LU Sampling

Similar to (47), one can easily show that, with probability at least 1 1/n, we have |Ik| n/2 for all k [K]. Moreover, since n/2 r and U is a matrix with i.i.d. Gaussian entries, it is easy to see that U T k U k λIr holds for some λ > 0.

Finally, for any two Ck and Cl satisfying Ck Cl = and for any u i Ck and u j Cl, let u 0 Ck Cl, then the angle between u i and u 0 is less than π/6 and so is the angle between u j and u 0, which implies the angle between u i and u j is less than π

3 . Thus, we have

u T i u j cos π

u i u T j = 1

2 u i u T j .

so mij = u T i u j + δij 0 given the condition that < mink [n] u k 2/2. It implies that (i, j) Ωfor all (i, j) Ik Il. For any k, l [K], we can find a path k0 k1 ks such that k0 = k, ks = l and Ckt Ckt+1 = . This complete the whole proof.

B. Literature Review

There are three categories of work we review: deterministic sampling, probabilistic sampling that depends on entry values, and general matrix completion and factorization literature with close relationship to our work.

Deterministic sampling Within deterministic sampling, there is work that must make incoherence or genericity assumptions on the underlying matrix, and work that removes almost all assumptions on the matrix but either needs to leverage other properties (i.e. PSD) or ignore problematic parts of the matrix. We will start with work that does not make assumptions on the underlying matrix.

Possibly most similar to our work is (Bishop & Yu, 2014), where the authors consider deterministic sampling of PSD matrices. They require sampling of principal submatrices, which is very similar to the assumptions we have made, since under Re LU sampling, we observe the diagonal, and under the partition/permutation of (4), we also observe the diagonal blocks. To the best of our knowledge, neither of the assumptions in our paper or in (Bishop & Yu, 2014) implies the other. More specifically, Bishop & Yu (2014) consider deterministic sampling and assume that a collection of subsets of Ωadmit an ordering such that any two adjacent parts have enough overlap. However, in the Re LU sampling setting, this order is hard to construct, and it is not clear whether there exists such an ordered collection of subsets. Moreover, both the analysis and the algorithm proposed in (Bishop & Yu, 2014) heavily rely on these ordered subsets, while we only use our partition for analysis. They design an algorithm for completion based on their theoretical guarantees, whereas our work simply uses the well-known gradient descent algorithm for completion. Additionally, to the best of our knowledge, it is not possible to extend their approach to non-PSD matrices. Our approach also relies heavily on the synergy of PSD matrices and Re LU sampling, but we believe that the generalization we provide in Assumption 3.6, and further generalizations thereof, can break this dependency.

The work in (Mazumdar & Rawat, 2018) studies the Re LU recovery problem in the context of neural network parameters. They provide theoretical guarantees for estimating a low-rank matrix observed after adding a bias vector and passing through a Re LU. They formulate a likelihood based on the distribution of the bias vector, and show that the maximizer of the likelihood is close to the planted low-rank matrix. However, it is not clear how to solve their proposed likelihood. Their work was inspired in part by the related work in (Soltanolkotabi, 2017), which uses gradient descent to estimate vectors that are observed through a known linear operator and then passed through a Re LU, and provides high-probability finite sample guarantees on the accuracy of the estimates when the known linear measurement operator is random. The matrix completion problem in (Ganti et al., 2015) seeks to estimate a partially observed matrix, which is a low-rank matrix observed through an unknown entry-wise monotonic and Lipschitz nonlinearity (like Re LU). They estimate both the nonlinearity and the low-rank matrix in an alternating fashion. Their guarantees are for estimating the matrix entries after the nonlinearity, as opposed to the latent low-rank matrix entries directly.

(Foucart et al., 2020) studies the non-symmetric matrix completion problem when the observation pattern is deterministic. It introduces a methodology for deriving a weighting matrix tailored to the specific sampling pattern presented. They then provide an efficient initialization scheme by making good use of the weighting matrix for the matrix completion problem. Their recovery guarantees are with respect to this weighting matrix, so if parts of the true matrix are not recoverable, the corresponding part of the weight matrix will be zero. This allows them to avoid assumptions on the underlying matrix (other than standard assumptions on the maximum value of the matrix). Other works in this direction include (Chatterjee,

Matrix Completion with Re LU Sampling

2020), which proves necessary asymptotic properties of the sampling patterns for a notion of stable recovery when no assumptions are made on the underlying matrix. The requirements are strong and general, and so the results are somewhat limited, in particular, their results imply that no sparse deterministic pattern can guarantee stable recovery.

For works that make assumptions on the underlying matrix, (Pimentel-Alarc on et al., 2016) assumes the matrix columns (or rows) are in general position and gives nearly necessarily assumptions on the deterministic pattern. (Kir aly et al., 2015) also assumes genericity and makes stronger assumptions. (Liu et al., 2017; 2019) develops a different assumption related to the spark of the sensing matrix or observation pattern and shows that the planted matrix uniquely fits the entries. (Shapiro et al., 2018) develops conditions for a low-rank solution to be locally unique, i.e. the unique solution in a neighborhood around that point, and makes several interesting connections to the above literature. Finally, (Singer & Cucuringu, 2010) makes a connection to rigidity theory and provides a method to determine whether a unique completion of a given partially observed matrix and a given rank is possible.

Probabilistic entry-dependent sampling There is a line of work in matrix completion that considers arbitrary sampling distributions (Foygel et al., 2011; Shamir & Shalev-Shwartz, 2014) and seeks to bound the expected loss, where expectation is taken with respect to the sampling distribution. Therefore, distinct from our work, if an entry is observed with probability zero, their results provide no guarantees for recovery. In classical statistics literature, one models the data as a random variable and the missingness mechanism as random, represented by a binary random variable, and potentially dependent on the data. In this context a commonly used taxonomy of missing data mechanisms follows (Little & Rubin, 2019), which defines MCAR, MAR, and MNAR: (i) MCAR, missing completely at random, where the missing data variable is independent of the data, e.g. entries missing uniformly at random in classical matrix completion literature fits into this category; (ii) MAR, missing at random where the missing data variable is independent of the unobserved data when conditioned on that which is observed (data and/or covariates); and (iii) MNAR, missing not at random where the missing data variable depends on that which is unobserved, even conditioned on that which is observed.

Using this probabilistic setup and terminology, (Hern andez-Lobato et al., 2014) proposes a Bayesian approach to jointly infer a complete data factorization model and a missing data mechanism, also modeled with matrix factorization. They also provide approximate inference methods for the resulting intractable Bayesian inference problem. (Sportisse et al., 2020a) considers identifiability of PPCA with MNAR data and provides an estimation algorithm for the principal components in this setting. (Sportisse et al., 2020b) proposes interesting EM-style methods to estimate the joint distribution of the missing data mechanism and the data. (Ma & Zhang, 2021) identifies novel sufficient conditions such that the ground truth parameters of a parametric data model are identifiable, and maximum likelihood identifies them uniquely, in the MNAR setting.

Other papers focused on the MNAR setting include (Yang et al., 2021; Jin et al., 2022; Agarwal et al., 2023). (Sengupta et al., 2023) compares matrix factorization to multiple imputation, the most popular framework in biostatistics and social sciences for imputing missing entries in the MAR setting. An empirical work that includes Re LU sampling with other MNAR sampling schemes is found in (Naik et al., 2022). The authors conclude that convex methods generally do not work as well as nonconvex in the dependent-sampling setting. Closely related to our work is also (Bhattacharya & Chatterjee, 2022), which assumes that each entry of a low-rank M is observed with some probability f(M ), where f is applied entry-wise and is essentially Lipschitz and non-zero. The authors provide modified singular value thresholding and nuclear norm estimators for this setting and prove consistency in the high-dimensional regime, where the dimensions of M grow but the rank remains uniformly bounded. A similar estimator is proposed in (Ma & Chen, 2019), and under a low-rank assumption on the matrix of probabilities of observation, they prove error bounds for estimating the matrix of observation probabilities as well as the underlying matrix.

More general factorization and completion literature The work in (Saul, 2022) is highly related to our work, though it is not a matrix completion problem per se. This paper seeks a non-negative low-rank matrix X such that very sparse observation M = f(X) and f is a given nonlinearity, such as Re LU. They provide an EM algorithm for learning X but no theoretical guarantees for when this low-rank approximation exists. The work in (Seraghiti et al., 2023) studies the same problem and provides an alternating block coordinate descent approach.

Finally, the work in (Ongie et al., 2020) studies matrix completion with uniformly missing entries, but the underlying matrix columns are points on a low-dimensional nonlinear variety. The algorithm for completion lifts the partially observed data using a polynomial lifting, which creates a non-uniform sampling structure in the lifted space, where an entire deterministic set of entries in the lifted space must be missing if one entry is missing in the original space. The authors show that it is still possible to perform completion with this structured missing pattern under certain genericity conditions.

Matrix Completion with Re LU Sampling

Table 2. Comparison of the norm of gradient, function value, and completion error returned by GD with different initialization.

σ = 0 αT βT γT

Our init (9.7 0.16) 10 7 (3.3 0.36) 10 14 (7.6 0.7) 10 11

RI init (6.8 0.3) 10 3 (7.9 2.1) 103 0.5 0.13 RS init 0.1 0.45 (7.1 3.8) 103 0.45 0.24

σ = 10 4 αT βT γT

Our init (9.7 0.14) 10 7 (1.8 0.03) 10 4 (4.4 0.13) 10 5

RI init (0.39 1.2) 10 5 (8.4 2.1) 103 0.51 0.13 RS init (1.5 6.5) 10 3 (8.4 2.0) 103 0.51 0.13

σ = 10 2 αT βT γT

Our init (9.7 0.17) 10 7 1.8 0.03 (4.4 0.22) 10 3

RI init (8.1 0.35) 10 5 (7.3 3.3) 103 0.46 0.2 RS init (6.4 0.19) 10 5 (7.9 3.6) 103 0.418 0.21

10-6 Completion error

(a) Noiseless case: σ = 0

10-3Completion error

(b) Noisy case: σ = 10 2

Figure 8. Completion error and CPU time of the tested methods for MC with Re LU sampling.

C. Experimental Setups and Results

C.1. Additional results and figures

In this section, we provide the full version of Table 1, which includes noise level σ = 10 2. See Table 2. We also provide an additional figure along the lines of Figure 6, also including noise level σ = 10 2. In Figure 8, we see that all the algorithms perform very similarly in terms of completion accuracy with σ = 10 2, and their computation time is similar to that for the noiseless case.

C.2. Experimental Setup for Section 4.3

In this subsection, we give details of the studied methods in Section 4.3.

ADMM for the convex formulation of MC. Noting that the noisy model in (1) is considered in this work, then we study the following convex formulation to complete the missing entries as studied in (Candes & Plan, 2010):

min X Rn n X s.t. XΩ MΩ 2 F δ. (49)

In our implementation, we simply set δ = 2 F . It is obvious that the classic formulation in (Recht, 2011) is a special case of the above formulation when δ = 0 in the noiseless case. Then, one can efficiently solve this convex problem using

Matrix Completion with Re LU Sampling

ADMM. By introducing an auxiliary variable Y , we first rewrite Problem (49) as

min X Rn n Y s.t. X = Y , XΩ MΩ 2 F δ. (50)

By introducing a dual variable Λ Rn n, one can write the augmented Lagrangian as follows:

L(X, Y ; Λ) = Y + δX (X) Λ, X Y + ρ

2 X Y 2 F , (51)

where X = {X Rn n : XΩ MΩ 2 F δ}. In the implementation, we randomly genearte an initial point (Y (0), Λ(0)), whose entries are i.i.d. sampled from the standard normal distribution. Given the current iterate (X(t), Y (t), Λ(t)), ADMM generates the next iterate as follows:

To update X, we compute

X(t+1) = argmin XL(X, Y (t); Λ(t)).

By letting W (t) = Y (t) + Λ(t)/ρ, we compute its closed-form solution as follows:

X(t+1) = W (t), if W (t) Ω MΩ 2 F δ,

X(t+1) ΩC = W (t) ΩC, X(t+1) Ω = W (t) Ω+ t MΩ

1 + t , where t = W (t) Ω MΩ 2 F δ 1, otherwise.

To update Y , we compute

Y (t+1) = argmin Y L(X(t+1), Y ; Λ(t)) = argmin Y 1 2

Y X(t+1) Λ(t)/ρ 2

This is to compute the proximal mapping of the nuclear norm that admits a closed-form solution.

To update Λ, we have

Λ(t+1) = Λ(t) ρ X(t+1) Y (t+1) (52)

We terminate the algorithm when X(t) Y (t) F 10 4 for some t 0.

Scaled GD for the formulation in (Tong et al., 2021). According to (Tong et al., 2021), one can solve the following formulation for MC:

min L,R Rn r H(L, R) = 1

(LRT )Ω MΩ 2 F . (53)

Here, we directly use their MATLAB codes downloaded from https://github.com/Titan-Tong/Scaled GD to implement Scaled GD for solving the above problem. Notably, we employ our tailor-designed initialization in Section 4.1 to initialize Scaled GD, which demonstrates great performance in our experiments. We terminate the algorithm when H(L(t), R(t)) F 10 4 for some t 0.

PAM for the formulation in (Saul, 2022). To complete sparse nonnegative matrices with low-dimensional structures, Saul (2022) studied the following formulation:

min X,Θ Rn n X Θ 2 F s.t. rank(Θ) = r, XΩ= MΩ, XΩc 0. (54)

Since the variables X, Θ are separable in this problem, we can directly proximal alternating minimization (PAM) to solve this problem. In the implementation, we randomly genearte an initial point (X(0), Θ(0)), whose entries are i.i.d. sampled from the standard normal distribution. Given the current iterate (X(t), Θ(t)), PAM generates the next iterate as follows:

X(t+1) argmin X X Θ(t) 2 F + α

2 X X(t) 2 F s.t. XΩ= MΩ, XΩc 0,

Θ(t+1) argminΘ X(t+1) Θ 2 F + β

2 Θ Θ(t) 2 F s.t. rank(Θ) = r.

Obviously, the above subproblems both admit closed-form solutions. We terminate the algorithm when X(t+1) X(t) F + Θ(t+1) Θ(t) F 10 4 for some t 0.

Matrix Completion with Re LU Sampling

0.1 0.2 0.3 0.4 0.5 0.6 0.7 percentage of smallest distances observed

normalized final matrix error

Normalized recovery error over 100 trials

rank=10 rank=25 rank=40

Figure 9. Matrix completion for Euclidean distance matrices where only a fraction of the smallest entries are observed. Line shows the median error, and bars show the 25% and 75% error quantiles for 100 trials.

Momentum PAM for the NMD formulation in (Seraghiti et al., 2023). Noting that Θ in Problem (54) admits the low-rank structure, one can naturally reformulate this problem into the following nonlinear matrix decomposition (NMD) formulation:

min X Rn n,W ,H Rn r X W H 2 F s.t. XΩ= MΩ, XΩc 0.

In particular, Seraghiti et al. (2023) have proposed a momentum PAM method for solving this problem. Here, we directly use their MATLAB codes downloaded from https://gitlab.com/ngillis/Re LU-NMD to implement momentum PAM for solving this problem. We terminate the algorithm when the relative error X(t) (W (t)H(t))Ω F / X(t) F 10 4.

Gaussian-Newton matrix recovery (GNMR) for low-rank MC (Zilber & Nadler, 2022). Zilber & Nadler (2022) employed the Gauss-Newton linearization to design a GNMR method for solving the non-convex formulation of MC. Here, we directly use the MATLAB codes in (Naik et al., 2022) to implement this method for solving MC with Re LU sampling.

C.3. Euclidean Distance Matrix Completion

Centered Euclidean distance matrices are low-rank and PSD. In applications, we may only have access to a small number of pairwise distances between points, and we would like to complete the Euclidean distance matrix to know all pairwise distances. In this section, we experiment with the scenario where we observe only the smallest distances in the matrix. We generate Euclidean distance matrices synthetically, with dimension 200, and our observation is a fraction of the smallest entries. As shown in Figure 9, depending on rank and fraction of entries, Algorithm 1 recovers the matrix exactly.

Note that after centering, 50% of entries are negative, but for a real Euclidean distance matrix none of the entries are negative. In a real data setting, we would want to design an algorithm to approximately center the matrix given only the observed entries.