# semirandom_matrix_completion_via_flowbased_adaptive_reweighting__4444d728.pdf

Semi-Random Matrix Completion via Flow-Based Adaptive Reweighting

Jonathan A. Kelner Jerry Li Allen Liu Aaron Sidford Kevin Tian

We consider the well-studied problem of completing a rank-r, µ-incoherent matrix M Rd d from incomplete observations. We focus on this problem in the semirandom setting where each entry is independently revealed with probability at least p = poly(r,µ,log d)

d . Whereas multiple nearly-linear time algorithms have been established in the more specialized fully-random setting where each entry is revealed with probablity exactly p, the only known nearly-linear time algorithm in the semi-random setting is due to [17], whose sample complexity has a polynomial dependence on the inverse accuracy and condition number and thus cannot achieve high-accuracy recovery. Our main result is the first high-accuracy nearly-linear time algorithm for solving semi-random matrix completion, and an extension to the noisy observation setting. Our result builds upon the recent short-flat decomposition framework of [42, 43] and leverages fast algorithms for flow problems on graphs to solve adaptive reweighting subproblems efficiently.

1 Introduction

How can we ensure learning algorithms do not overfit to generative assumptions on their input data? Since worst-case statistical inference problems are often intractable (i.e., without distributional assumptions), to develop efficient algorithms, we often introduce a generative model as a proxy for average-case, real-world behavior. However, if these algorithms are to be useful practically, we must ensure they do not depend too strongly on any artificial properties of the generative model. Ideally, they should require only what is statistically and algorithmically necessary, and no more.

In this paper, we consider this question in the context of matrix completion, one of the most fundamental and well-studied linear inverse problems. In matrix completion, the goal is to recover a symmetric rank-r matrix M Rd d where r d,6 given a small (typically sublinear) number of entries revealed independently and at random. In a seminal line of work [13, 11, 12, 61], it was shown that under natural structural assumptions on M, one can fully recover M given dr randomly revealed entries in polynomial time, via semidefinite programming (SDP). Subsequently, there has been extensive work attempting to improve upon statistical and computational guarantees for matrix completion. In an oversimplification of the literature (see Section 1.3 for a more detailed and nuanced discussion), this has resulted in two categories of matrix completion algorithms.

SDP-based algorithms achieve the nearly-optimal sample complexity, but are based on solving SDPs, which requires time Ω(d3.5) even with state of the art solvers [39, 34].

First-order iterative methods require a slightly higher sample complexity, namely, they require m pd2 revealed entries for an observation probability p = Ω(poly(r, log d)) 1

d. However,

MIT, kelner@mit.edu. Microsoft Research, jerrl@microsoft.com. MIT, cliu568@mit.edu. Stanford University, sidford@stanford.edu. University of Texas at Austin, kjtian@cs.utexas.edu. 6Our results also hold for asymmetric, rectangular matrices. See the discussion at the end of Appendix A.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

they use faster optimization primitives and can be implemented in time e O(m poly(r, log d)). Throughout the introduction, we will somewhat informally refer to such algorithms as fast.7

Because matrix completion is particularly interesting to study when the target rank r is substantially smaller than the dimension d (as information-theoretically, we require dr observations), it may appear that fast algorithms are able to nearly match the statistical guarantees of the SDP-based methods, while offering substantially lower runtimes. In light of this, naïvely one might expect that first-order iterative methods ought to be always preferred over the alternative. However, it has been noted that in many real-world situations, these first-order iterative methods actually fail to perform reliably, see e.g. [66, 70, 26, 68]. An oft-cited reason for this failure is exactly the tendency for these sorts of methods to demonstrate the type of generative overfitting described above [66, 70, 4]. Indeed, in real-world applications, the distribution of revealed entries is usually very far from uniformly random [4], which can cause iterative methods that rely too heavily on this assumption to fail dramatically. This begs the following natural question:

Can we design fast algorithms for matrix completion that are robust to generative overfitting?

In light of this discussion, we consider a semi-random variant of matrix completion, introduced in [17]. Recall that in standard matrix completion, we assume every entry (i, j) [d] [d] is revealed independently with probability p. In the p-semi-random model (Definition 2), we only know p (0, 1) such that every entry is revealed independently with an unknown probability pij p. Given that real-world data is unlikely to be generated exactly uniformly, it is reasonable to believe that the semi-random model is a more faithful model of matrix completion in practice.

In this model, the algorithm is given strictly more information than in the standard (uniform) matrix completion setup. This is because the semi-random model is equivalent to the setting where we are first given a standard matrix completion instance, and then additional (but potentially adversarially chosen) entries are revealed.8 It is straightforward to show that the SDP-based approaches generalize to succeed in the semi-random setting, since the adversary s revealed entries just translate to additional constraints in the convex program satisfied by the ground truth. However, as demonstrated in [17], Appendix A, previously known first-order iterative methods can fail to converge in this semi-random setting. We view this as strong theoretical evidence that such algorithms exhibit large amounts of generative overfitting. We can now more concretely rephrase the previous question:

Can we design fast algorithms for semi-random matrix completion?

This paper focuses on the theoretical complexity of the problem; the relationship of the techniques in this work to efficient matrix completion in practice is an interesting direction for future research. Precisely, we seek to design an algorithm for semi-random matrix completion which works for the same (or qualitatively similar) p as in the standard setting, and which runs in time e O(m poly(r, log d))? Here m denotes the total number of revealed entries, a proxy for the input size; note that in the semi-random setting, it could be that m pd2.

The main result of [17] is an efficient algorithm for semi-random matrix completion, albeit with some important caveats, as we discuss shortly. Roughly speaking, for ϵ (0, 1), [17] gave an algorithm which, given observations of rank-r M with condition number κ from the p-semi-random model for p = 1

d e O(poly(r, κ, 1

ϵ )), outputs M so that M M 2 F ϵ M 2 F with high probability. Moreover, it runs in nearly-linear time e O(m poly(r, κ, 1

ϵ )), where m is the number of revealed entries. While this is a promising first step towards matrix completion in the semi-random model, it leaves several important questions unanswered.

High-accuracy recovery. Both the sample complexity and runtime of the algorithm in [17] depend polynomially on ϵ 1. In particular, the algorithm cannot recover the matrix M to ϵ 1 = poly(d) levels of precision in nearly-linear time. Thus, even in the standard setting where we assume that the bit complexity of the entries of M are polynomially bounded, [17] cannot achieve exact

7Here, and throughout the paper, we use the notation e O(f) := O(f log O(1) f) and similarly define eΩ. 8As clarified in Definition 2, it is important in this equivalence that the additionally revealed entries are not adaptively chosen, i.e., do not depend on which random entries were observed. This manifests in our analysis because an adaptive adversary causes significant difficulties with independent sample splitting (see Lemma 2). There is precedent from the literature on fully-random matrix completion that the inability to sample split causes analysis issues (see, e.g., Section 3 of [61]); extending to an adaptive adversary is a challenging open direction.

recovery of the matrix in nearly-linear time. Unfortunately, this dependence seems inherent to the techniques therein (see the discussion in Section 1.2).

Condition number dependence. Another important downside of the [17] algorithm s guarantee is its polynomial dependences on κ in both the sample complexity and runtime. No dependence on κ is required information-theoretically, or by the SDP-based methods. This aspect also seems somewhat inherent to the non-convex iterative method used in [17].

Noise tolerance. Finally, the guarantees in [17] require that M is exactly low-rank, i.e. it does not generalize to handle noise in observations. In practice, this is unlikely, and guarantees should ideally hold even when we only assume that the observed M is close to low-rank.

1.1 Our results Our main result is a new fast algorithm for semi-random matrix completion that addresses all three of the aforementioned issues. Specifically, we show the following main result.

Theorem 1 (informal, see Theorem 2). Let M Rd d be a symmetric, rank-r, µ-incoherent matrix, let symmetric N Rd d satisfy N , and let c M = M + N. Let ϵ (0, 1). There is an algorithm that takes observations from c M in the p-semi-random model for p = Q

d , where Q = poly(r, µ, log( d

ϵ )), and outputs M as a rank-Q factorization such that with high probability,

M M Q ( + ϵ M ) .

The algorithm runs in nearly-linear time O(m Q) where m is the number of revealed entries.

Here, refers to the elementwise ℓ norm of a matrix, and F refers to the Frobenius norm. We pause to comment on Theorem 2. First, in the noiseless setting, i.e., when = 0, our algorithm has a polylogarithmic dependence on ϵ 1, and thus achieves high-accuracy recovery in nearly-linear time. In particular, as long as the bit complexity of the underlying matrix is polynomially bounded, we achieve exact recovery in nearly-linear time. Moreover, our rate has no dependence on the condition number κ. In both of these aspects, our result qualitatively improves upon [17], while retaining a nearly-linear runtime and e O(d poly(r)) sample complexity. We also remark that the incoherence assumption in Theorem 2 (or a less standard variant thereof), which is formally defined in Definition 1, is used in all low-rank matrix completion algorithms we are aware of, as the problem is intractable without subspace regularity assumptions (see, e.g., discussion in [43]).

Finally, we observe that our guarantee applies to the noisy setting, i.e., > 0. To our knowledge, this is the first such guarantee in the literature for semi-random matrix completion, even for SDP-based methods. We achieve guarantees in terms of the ℓ norm of the noise matrix N; moreover, the overhead is a polynomial of only r log( d

ϵ ), a preferable guarantee to the poly(d) overheads known to be achievable in the standard, fully random setting by semidefinite programming [11].

We do note that in the standard matrix completion setting, error guarantees are typically phrased in terms of the Frobenius norm of the noise N [11, 43]. However, in the semi-random setting, some measure of the element-wise noise level is unavoidable, because the semi-random noise could be chosen to only reveal unusually large entries of the noise matrix. We believe it is an interesting open question whether or not one can achieve somewhat more global measures of error such as the Frobenius norm of large entries in the revealed noise, which are more comparable to guarantees recently attained for semi-random sparse linear regression [42].

1.2 Our techniques We achieve these results via a fundamentally different approach than the prior work of [17], detailed in Section 2. At a (very) high level, [17] relies on finding a single reweighting of the revealed entries of M that is good in a global sense. This roughly corresponds to finding a mask of the revealed entries, so that the masked matrix is guaranteed to be a good spectral approximation of the ground truth matrix. Given such a mask, [17] then shows that a non-convex optimization method will recover the true matrix, with high probability. While this is a very natural approach, this also directly results in a poly( 1

ϵ ) dependence in the runtime and sample complexity of the algorithm. This is because the quality of the spectral approximation directly goes into the final accuracy of their overall algorithm; ultimately, they need a (1 + ϵ)-multiplicative spectral approximation to the ground truth. However, all known techniques for finding such a high-quality spectral approximation require a poly( 1

ϵ ) dependence in both sample complexity and runtime.

We instead devise an iterative method that is guaranteed to make progress, assuming that the iterate satisfies a local progress criterion. We then find a local reweighting of the matrix which allows us to find a step which satisfies this criterion. While this local progress criterion is much more involved to state, and this reweighting must now be done every iteration (rather than just once at the beginning of the algorithm), this framework has an important advantage over the global criterion used previously. Critically, now each step of our method need only seek a reweighting which achieves a constant factor of relative progress to the target matrix. Thus, rather than needing a (1 + ϵ)-multiplicative approximation at every step, we only need a constant multiplicative approximation to make progress. This is the main insight which allows us to achieve polylogarithmic rates in 1

Our algorithm builds upon the short-flat decomposition-based approach to designing iterative methods for linear inverse problems introduced in [42], and specialized in [43] to matrix completion in the standard setting. However, there are significant technical challenges in adapting this framework to semi-random models. At a high level, the matrix completion problem is not well-conditioned, so a crucial step in the [43] algorithm is dropping rows and columns of a difference matrix (between an iterate and the target) which are estimated as too large. Random observations (held out via sample splitting) are then used in matrix concentration inequalities to estimate the difference matrix on the remaining indices, and dropped indices are later recovered.

Unfortunately, all of these steps crucially rely on independent sampling and hence break in the presence of semi-random noise. We instead take a fresh look at what certificates of progress are possible in the semi-random setting. Our key technical contribution, Algorithm 2, is a subroutine which makes bicriteria progress: it either finds an adaptive reweighting which accurately estimates the difference on a large submatrix after denoising, or certifies that a few rows and columns are responsible for much of the difference and hence can be dropped. To achieve our nearly-linear runtime, we crucially exploit graphical structure of feasible reweightings present in the randomly-revealed entries, and use flow-based optimization algorithms to solve our reweighting subproblem.

1.3 Related work

Here, we review related work on matrix completion and semi-random models. These lines of research are vast and we focus on the research most relevant to our setting. For a more comprehensive review, we refer the interested reader to surveys on these topics, e.g., [41, 60, 59, 63].

Matrix completion. Matrix completion was first introduced in [62] in the context of collaborative filtering, e.g., for the Netflix Prize [6]. Since then, it has found applications in various areas, such as signal processing [46, 64, 57], social network analysis [47], causal inference [4], and most recently, AI alignment [1]. In a seminal line of work, [13, 11, 61, 12, 16] demonstrated that SDPs based on nuclear norm minimization can solve matrix completion with sublinear sample complexity, and in polynomial time. Another line of work focuses on first-order iterative methods for matrix completion, see e.g., [44, 31, 32, 33, 36, 74, 67, 18, 72, 30]. These works demonstrate that optimization frameworks, such as alternating minimization, can provably and efficiently recover the true matrix. Finally, yet another line of work seeks to understand the non-convex landscape of descent-based methods for matrix completion [65, 21, 28, 40, 73, 40, 73, 71]. In terms of theoretical runtimes, the state of the art e O(dr3+o(1)), is given in [43]. While some of the aforementioned papers have implementations, comprehensively evaluating the practical performance of all of these methods is an interesting direction for future work. As noted in [17], none of these fast methods are known to succeed in the presence of semi-random noise.

Semi-random models. Semi-random models were first introduced in a pair of seminal papers [9, 24], in part as a means to test whether learning algorithms which succeed under random models generalize to more realistic generative modeling assumptions. Most early work on semi-random models focused on understanding semi-random variants of various constraint satisfaction problems, see e.g. [25, 24, 45, 15, 48, 50]. More recently, they have also been studied in the context of learning theory, in the context of clustering problems [23, 53, 49, 14, 29, 51, 52, 55], sparse and overcomplete linear regression [37, 42], planted clique [54, 10], and more [27, 8].

Of particular interest to us is the recent line of work on developing fast learning algorithms in semi-random models [17, 42, 43, 27, 8]. The closest and most direct comparison to our work is the aforementioned [17], which we quantitatively improve upon in a number of ways. It is worth noting that the specific poly(r) r6 dependence in the sample complexity of [17] is lower than in Theorem 1. We view our result as a promising proof-of-concept of the tractability of semi-random

matrix completion via fast algorithms; both our dependence and the dependence of [17] are somewhat large at the moment, limiting immediate practical deployment, but we believe that it is an interest direction for future research to tighten dependencies on problem parameters and seek more practical algorithms.

2 Technical overview

At a high level, our algorithm follows the paradigm in [42] for designing fast iterative algorithms that are robust to semi-random models, which used this framework for sparse linear regression. Broadly, the approach is to first find deterministic conditions which guarantee that a step of the iterative method, which descends along a reweighted direction based on semi-random observations, makes progress. We then develop a custom nearly-linear time algorithm for computing such a reweighting, using optimization tools specialized to the regularity of graph-structured polytopes.

Short-flat decompositions for progress. We first discuss the verifiable conditions we impose to certify progress of our iterative method. Throughout this section, we let our current iterate be M Rd d and let the ground truth be M Rd d, assume both matrices have rank at most r, and focus on the noiseless setting N = 0d d for simplicity. We also assume for normalization purposes that M M F = 1, and our goal is to take a step M M + D so that

(M + D) M F 1

Our conditions are inspired by the approach in [43], which gives improved iterative algorithms for standard matrix completion (without semi-random noise). Let p be the minimum observation probability, let Ωbe the set of observed entries from M , and let Ω be the set of truly random observations, i.e., when each entry is revealed with probability exactly p (we can find a coupling so that Ω Ωalways). Inspired by a similar framework in the sparse recovery setting [42], the key certificate for guaranteeing progress of the form (1) in [43] is that of a short-flat decomposition. We say a candidate step D has a short-flat decomposition if we can write D = S + F where S F 1 and F op cr 1

2 for a small constant c: here we think of S as the short part and F as the flat part of the decomposition. If we can further ensure that enough signal is captured, i.e., D, M M 1 c, then [42, 43] give a short argument based on Hölder s inequality that a step in D, combined with a low-rank projection for denoising, will imply (1).

Thus, it suffices to find a matrix D that has these properties; for intuition, we would ideally like to take D = S = M M and F = 0d d, which would complete the matrix in one shot. Naturally, however, we are limited by the fact that D should be supported on the observations Ω, as otherwise we cannot make any guarantees on D, M M . Naïvely, one would hope to simply take the empirical observations D = 1

p[M M ]Ω (which zeroes entries outside Ω ), an unbiased estimator of M M, and appeal to matrix concentration arguments to show existence of a short-flat decomposition. Unfortunately, this is not the case, as entries, rows, or columns of M M which are too large can arbitrarily hinder matrix concentration bounds, and [43] gives simple examples showing this dependence on the imbalance is unavoidable for uniform reweightings.

However, [43] made the important insight that when matrix concentration fails, it must be that M M is overly concentrated on a few heavy rows and columns, and hence dropping a few rows and columns significantly reduces M M F. These rows and columns can later be recovered using the low-rank assumption, once enough progress has been made. The algorithm in [43] then works by repeatedly estimating heavy rows and columns, explicitly dropping them to form a set S [d] of balanced indices, and then taking the step D = 1

p[M M ]Ω (S S). This necessity of dropping certain poorly-behaved indices is a crucial difference between the matrix completion setting and simpler semi-random inverse problems such as sparse linear regression [42], which possess stronger global regularity features such as the restricted isometry property (RIP).

Certifying progress in semi-random models. In the semi-random matrix completion problem, we need several modifications to the strategies of [42, 43], to handle the interplay between estimating heavy indices and making certifiable progress. Firstly, we cannot accurately estimate the ℓ2 norm of rows and columns of the true difference matrix M M from observations, because the semi-random adversary can drastically distort our estimates. However, we show in Algorithm 1 that we can use the ℓ norm of observed entries as a proxy for heaviness.9 Specifically, after dropping rows and columns

9This norm distortion is the main source of poly(r) losses in our guarantees compared to the [43] algorithm.

with the most observed large entries, we argue that all remaining rows and columns are well-balanced except on few sparse errors, which we exclude from our potential.

Next, even if appropriate rows and columns were dropped, we cannot use the empirical observations to construct our step, as semi-random observations are no longer unbiased and the fully-random indices Ω are unknown. Instead, we use Ω existentially to set up a convex program over reweightings w RΩ, to search for a candidate step D := P

(i,j) Ωwij[M M ]ij. Roughly, our convex program aims to find w satisfying the following constraints (see Lemma 7 for a formal statement).

1. For all i [d], P

j [d]|(i,j) Ωwij[M M ]2 ij = O( 1

2. There exists S, F Rd d with D = S + F, S F 1, and F op = O(r 1

3. For a sufficiently small constant c, D, M M = P

(i,j) Ωwij[M M ]2 ij 1 c.

Item 1 serves to ensure our step D is sufficiently spread out amongst its rows and columns, which serves two roles: it bounds how much progress is lost when excluding indices, and enforces crucial problem regularity for our reweighting optimization algorithm. On the other hand, Items 2 and 3 are the standard criteria for making iterative progress through the short-flat framework of [42, 43]. It is straightforward to see that the ideal candidate 1

p[M M ]Ω (S S) we mentioned earlier is feasible for these constraints, and we show in Lemma 7 that finding any satisfying reweighting is enough to ensure progress (with some mild technical changes). It remains to discuss how to find a candidate D to ensure rapid convergence through the progress bound (1).

Flow-based adaptive reweighting. A key technical contribution of this work is a nearly-linear time algorithm finding a reweighting D(w) := P

(i,j) Ωwij[M M ]ij that meets our progress criteria. Due to our focus on fast algorithms, we require this step to run in O(|Ω| poly(r)) time. Our strategy is to form a joint objective over w obeying Item 1, and S F 1, of the form

Fprog-flat(w, S) := D(w), M M + C smax(D(w) S). (2)

Here, the parameter C is used to trade off the progress objective D(w), M M (giving Item 3) and the flatness objective smax(D(w) S) (giving Item 2), where smax(M) = log Tr exp(M) is a smooth approximation to the operator norm. That is, letting Z := W S be the product space between w RΩsatisfying Item 1 and short matrices S Rd d, we show that finding (w, S) Z with low suboptimality error in Fprog-flat is enough to make progress through Lemma 7.

We optimize (2) by carefully applying a Frank-Wolfe method due to [35]. While naïve Frank-Wolfe analysis requires global smoothness of the objective, this provably does not hold for our Fprog-flat (at least, with a smoothness bound of poly(r) required for nearly-linear time algorithms). However, we leverage that the analysis in [35] shows that weaker regularity conditions, namely bounds on the curvature constant, suffice for efficient convergence of an iterative method. Specifically, we use the structure of our constraint set W to show that Fprog-flat is smooth in a restricted set of directions within W which implies suitable bounds on the curvature constant (see, e.g., Lemma 18) to remove a d factor from the naïve smoothness bound. We also observe that W is exactly a rescaling of a bipartite matching polytope, so that we can use flow-based graph algorithms, e.g., the approximate weighted matching algorithm of [22], to efficiently implement the linear optimization oracles over W required by Frank-Wolfe methods. We then use techniques from numerical linear algebra and polynomial approximation to implement linear optimization oracles over S, completing our optimization subroutine for finding adaptive reweightings.

One other technical hurdle arises in implementing this framework for finding a reweighting: the existence of w satisfying Items 1, 2, and 3 relies on having a good estimate of the current distance M M F, after excluding certain rows and columns. In general, we can only upper bound this quantity, but we show that we can also use our optimization subroutine for (2) to certify non-existence of such a w . In this latter case, we show that it must have been the case that a few rows and columns were significantly heavier than the rest, and we can run Algorithm 1 to drop them. Our overall iterative method, Algorithm 2, makes bicriteria progress and either terminates with a drop step (when no feasible w exists), or a descent step (when we find a good solution to (2)).

Recovering dropped rows and columns. Finally, we briefly discuss how to post-process an estimate M that is close to M on a large submatrix to recover dropped rows and columns. The high-level outline is similar to [43], where we aim to find a verified set of rows and columns that

we completed well, and use the low-rank assumption to obtain accurate estimates to the remainder via regression. However, there are several key differences. The main difference is that in the semi-random model, we need to track errors in our estimate entrywise instead of in ℓ2, since a semi-randomness could drastically distort the empirical ℓ2 error by adversarially revealing entries. We rely on a structural result using tools from the theory on volumetric spanners [5, 7] (see Lemma 27) showing that large entrywise errors must be localized to a small submatrix. At this point, we set up ℓ regression problems to identify verified indices, solvable in nearly-linear time using recent advances in linear programming [69]. Importantly, the overhead in the error guarantee of this recovery step is only e O(poly(r)) (see Proposition 3), allowing us to achieve a recovery guarantee in Theorem 1 that avoids poly(d) overheads in the presence of noise.

3 Preliminaries

We now formally set up preliminaries required for analyzing our solution to the semi-random matrix completion problem. First, we recall the standard definition of incoherence for subspaces.

Definition 1 (Incoherence). We say a dimension-r subspace V Rd is µ-incoherent if ΠV ei 2 p µr

d for all i [d] (where ei is the ith standard basis vector), and M Rm n is µ-incoherent if M has µ-incoherent row and column spans.

We will work with symmetric matrices. We denote the space of d d symmetric matrices by Sd d. In Appendix A, we give a reduction from general matrices to symmetric matrices showing that it suffices to work in the symmetric case, up to constant factor loss in parameters. Next, we define the semi-random observation model.

Definition 2 (Semi-random model). For M Sd d, we use Osr p (M) to denote oracle access to M in the p-semi-random model, where we call Osr p a p-semi-random oracle. Specifically, in one call to Osr p with input M, for unknown observation probabilities {pι}ι [d] [d] [p, 1], each ι [d] [d] is independently included in a set I with probability pι. The oracle Osr p then returns the set {[M]ι}ι I. When an algorithm requires the ability to query Osr p (M) for various values of p (specified in the algorithm description), we list the input as Osr [0,1](M).

Note that Osr p (M) and Osr [0,1](M) inherit their definitions in [43] when all pι = p, for a fixed p [0, 1]. Now we can formally define the semi-random matrix completion problem we study.

Definition 3 (Semi-random matrix completion). In the semi-random matrix completion problem, parameterized by µ, R 0, p [0, 1], d N, and r [d], there is an unknown rank-r

µ-incoherent matrix M Sd d. For some N Sd d satisfying N , and M := M + N, we receive the output of one call to Osr p (M), and wish to output an approximation to M .

Remark 1. In Appendix A, we give a sample-splitting reduction which lets us use Osr p (M) to simulate p

q independent accesses to Osr q (M) for smaller values q p.

We will also need a few technical definitions to explain the main ingredients in our algorithm.

Entropy and softmax. We let d d := {M Sd d 0 | Tr(M) = 1} denote the d d spectraplex. We let H : d d R denote von Neumann (matrix) entropy, i.e., H(M) := M, log M , and let H : Sd d R denote its convex conjugate, which we call matrix softmax, defined by H (M) := log Tr exp M. More generally, for η > 0, we define

H η(M) := η log Tr exp 1

We recall the following standard facts about H η.

Fact 1 ([58]). The function H η defined in (3) satisfies the following for all η > 0.

1. For all M Sd d, λ1(M) H η(M) λ1(M) + η log d, where λ1(M) is the maximum eigenvalue of M.

2. For all M Sd d, H η(M) = exp( 1

η M) Tr exp( 1

3. H η is twice-differentiable and 1

η-smooth with respect to the norm op.

To use H η to enforce operator norm bounds, we define the signed lift of a matrix M Sd d by

slift(M) := M 0d d 0d d M

Note that slift signs and lifts M so that its maximum eigenvalue is its operator norm.

Comparing matrices. We use the following comparison definition from [43].

Definition 4 (Closeness on a submatrix). We say M, M Rm n are -close on a γ-submatrix if there exist subsets A [m], B [n] satisfying |A| m γ min(m, n), |B| n γ min(m, n), and [M M ]A B F .

If γ = 0, we say M, M are -close. When M, M are both symmetric, we require that A = B.

We note that in Definition 4, A, B are unknown; our analysis only uses this definition existentially.

4 Outline of proof of Theorem 1

In this section, we overview the main components of the proof of Theorem 1, which are developed and proved in full in the appendix sections. Our final matrix completion algorithm, Algorithm 8, alternates between two main subroutines, Algorithm 3 (analyzed in the following Corollary 1) and Algorithm 7 (analyzed in the following Proposition 3). We prove Theorem 2, a formal version of Theorem 1, by combining Corollary 1 and Proposition 3 in Appendix E. We now explain and motivate each of these pieces in the context of our technical overview in Section 2.

The first main component of our final algorithm, Algorithm 8, takes as input parameters α and γtot, as well as an input matrix M Sd d satisfying M M F (i.e., M and the target matrix M are -close), where is sufficiently larger than the noise level N . Its goal is to use semi-random observations from c M = M + N to return another matrix M Sd d such that c M and M are

α -close on a γtot-submatrix (Definition 4). In other words, we are willing to give up on a γtot fraction of rows and columns to make an α factor progress on the remaining submatrix. To achieve this result, we first provide a helper claim in Proposition 1, an analysis of Algorithm 2 (Drop Or Descent).

Proposition 1. Let M Sd d be given as a rank-r factorization, let M Sd d be rank-r , and let U [d]. Suppose we know for γdrop, γsub [0, 1], > 0, that |[d] \ U| γdropd, and MU U, M U U are -close on a γsub-submatrix. Let δ, ϵ (0, 1

10), θ, κ 1, and suppose for ˆr := r + r and an appropriate polynomial,

, γsub ϵ 36θ. (4)

Given one call to Osr p (M + N) for N ϵγsub

20d , the following holds with probability 1 δ.

1. If Algorithm 2 returns on Line 12, then MU U , M U U are (1 ϵ 288) -close on a γsub + 4

θκ-submatrix, and |U \ U | 40(γsub + ( ˆr

θ)1/2)d log d.

2. If Algorithm 2 returns on Line 16, then M U U, M U U are 10ϵ 1 4 -close on a 2γsubsubmatrix, and M is given as a rank-(3r + 2r ) factorization.

The runtime of the algorithm is

where m d is the number of observed entries upon calling Osr p (M + N).

Proposition 1 is used to analyze one step of Algorithm 2, after we have already taken a few iterations and explicitly removed γdropd rows and columns [d] \ U from consideration, so that our remaining submatrices on U U are close on a γsub-submatrix. Our goal will ultimately to have γsub, γdrop γtot

2 throughout the algorithm. Algorithm 2 provides bicriteria guarantees: in the drop case of Item 1, it slightly increases the γdrop parameter, increases γsub by an even smaller amount (where the tradeoff is

given by a degree of freedom κ), and makes a small amount of progress (decreasing the closeness by 1 O(ϵ)). In the descent case of Item 2, we instead decrease the closeness by a large O(poly(ϵ)) factor, while doubling γsub and roughly tripling the rank. By choosing the parameters appropriately, we show that iterating upon Proposition 1 yields Corollary 1, proven in Appendix B.

Corollary 1. Let M, M Sd d be rank-r , and suppose M M F . Let α 250 and δ, γtot (0, 1

10). Algorithm 3 uses one call to Osr p (M + N), where for appropriate polynomials,

N d poly( r

γtot log( d

δ )) α1+o(1) , p = poly( r

γtot log( d

and computes a rank-r αo(1)-matrix M , such that with probability 1 δ, M and M are

α -close on a γtot-submatrix. The runtime of the algorithm is mαo(1) poly( r

γtot log( d

δ )), where m is the number of observed entries upon calling Osr p (M + N).

We now explain the key computational method we develop to implement Proposition 1. We will solve a subproblem, defined in (5), to attempt to find a reweighting which makes progress. The subproblem enforces the conditions on candidate weights w mentioned in Section 2, and is formally stated below:

Fprog-flat(x, S) := 1I, x + CH η slift D x

where D(w) := X

ι I wιDι, vι = [D]2 ι for all ι I, and x

v denotes entrywise division. (5)

The difference between (5) and our informal sketch in (2) is that we reparameterize x w v where v is the squared observations of the difference matrix D, and I is the set of observations. We also introduce a tradeoff parameter η for controlling additive error bounds via Fact 1. Thus, the component 1I, x = v, w captures the progress condition (Item 3), and the component H η(slift(D( x

v ) S)) captures the flatness condition (Item 2). We optimize (5) over S F 1 + ϵ, the set of short matrices, and x belonging to X defined in (22) which we observe is a rescaling of a bipartite matching polytope by a parameter k (Fact 3). This polytope is used to enforce the spreadness condition (Item 1).

In the case we find a good solution to the subproblem, we can use the resulting solution to take a descent step and give the guarantee in Item 2 of Proposition 3. Otherwise, we have certified that excluding a few heavy rows and columns removed much of the progress that could be made (which we prove is the only way a good solution does not exist), and hence obtain the guarantee in Item 1. Our main guarantee for solving (5) at a fast rate is the following, proven in Appendix C.

Proposition 2. Let ϵ [0, C] and δ (0, 1), and suppose |[D]ι| [ℓ, u] for all ι I. There is an algorithm computing (x, S), an ϵ-approximate minimizer to (5) with probability 1 δ, in time

O (|I| + d) T 2 poly Ck Rη

for Rη := η 1(2k + ku

ℓ2d) and T = 8CηR2 η ϵ , and S is given as a rank-r = O( C2

ϵ2 T) factorization.

The last component in the proof of Theorem 1 is Algorithm 7, a rounding" step where we take the output of Corollary 1, say M, which is guaranteed to be close to M on a large submatrix, and then postprocess it to a matrix f M that is guaranteed to be close to M everywhere with some poly(r, µ, log d)-factor loss in the closeness. As long as the progress α made by Corollary 1 is larger than this factor, we can simply alternate Corollary 1 and the rounding step to still make constant-factor progress in each iteration. We prove the following result in Appendix D.

Proposition 3. Let M Sd d be rank-r and µ-incoherent, and let c M = M + N for N Sd d with N for some 0. Let M Sd d be given as a rank-r decomposition, with r r . Further, for γ (0, 1) suppose M and M are d -close on a γ-submatrix. Finally, assume

γ 1 104µr log(d).

Then for any δ (0, 1) if p 1

d poly(µr log( d

δ )) for an appropriate polynomial, Algorithm 7 uses one call to Op(c M) and with probability 1 δ, outputs f M Sd d such that f M M poly µr log d

Also, f M is given as a rank-poly(µr log( d

δ )) factorization. The algorithm runs in m poly(µr log( d

δ )) time, where m is the number of observed entries upon calling Osr p (c M).

Acknowledgments and Disclosure of Funding

AL was supported in part by an NSF GRFP and a Hertz Fellowship. AS was supported in part by a Microsoft Research Faculty Fellowship, NSF CAREER Grant CCF-1844855, NSF Grant CCF1955039, and a Pay Pal research award. JK was supported in part by NSF awards CCF-1955217, CCF-1565235, and DMS-2022448. Part of this work was conducted while visiting the Simons Institute for the Theory of Computing.

[1] Prizes for matrix completion problems. https://www.alignmentforum.org/posts/ p Jreb DRBj9gf BE8q E/prizes-for-matrix-completion-problems, 2023.

[2] Amol Aggarwal and Josh Alman. Optimal-degree polynomial approximations for exponentials and gaussian kernel density estimation. In 37th Computational Complexity Conference, CCC 2022, volume 234 of LIPIcs, pages 22:1 22:23. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022.

[3] Zeyuan Allen-Zhu and Lorenzo Orecchia. Nearly linear-time packing and covering LP solvers. Math. Program., 175(1-2):307 353, 2019.

[4] Susan Athey, Mohsen Bayati, Nikolay Doudchenko, Guido Imbens, and Khashayar Khosravi. Matrix completion methods for causal panel data models. Journal of the American Statistical Association, 116(536):1716 1730, 2021.

[5] Baruch Awerbuch and Robert Kleinberg. Online linear optimization and adaptive routing. Journal of Computer and System Sciences, 74(1):97 114, 2008.

[6] James Bennett and Stan Lanning. The netflix prize. 2007.

[7] Aditya Bhaskara, Sepideh Mahabadi, and Ali Vakilian. Tight bounds for volumetric spanners and applications. Co RR, abs/2310.00175, 2023.

[8] Avrim Blum, Meghal Gupta, Gene Li, Naren Sarayu Manoj, Aadirupa Saha, and Yuanyuan Yang. Dueling optimization with a monotone adversary. In International Conference on Algorithmic Learning Theory, volume 237 of Proceedings of Machine Learning Research, pages 221 243. PMLR, 2024.

[9] Avrim Blum and Joel Spencer. Coloring random and semi-random k-colorable graphs. Journal of Algorithms, 19(2):204 234, 1995.

[10] Rares-Darius Buhai, Pravesh K Kothari, and David Steurer. Algorithms approaching the threshold for semi-random planted clique. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 1918 1926, 2023.

[11] Emmanuel J. Candès and Yaniv Plan. Matrix completion with noise. Proc. IEEE, 98(6):925 936, 2010.

[12] Emmanuel J. Candès and Benjamin Recht. Exact matrix completion via convex optimization. Commun. ACM, 55(6):111 119, 2012.

[13] Emmanuel J Candès and Terence Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053 2080, 2010.

[14] Yudong Chen, Ali Jalali, Sujay Sanghavi, and Huan Xu. Clustering partially observed graphs via convex optimization. The Journal of Machine Learning Research, 15(1):2213 2238, 2014.

[15] Yudong Chen, Sujay Sanghavi, and Huan Xu. Clustering sparse graphs. ar Xiv preprint ar Xiv:1210.3335, 2(5), 2012.

[16] Yuxin Chen, Yuejie Chi, Jianqing Fan, Cong Ma, and Yuling Yan. Noisy matrix completion: Understanding statistical guarantees for convex relaxation via nonconvex optimization. SIAM journal on optimization, 30(4):3098 3121, 2020.

[17] Yu Cheng and Rong Ge. Non-convex matrix completion against a semi-random adversary. In Conference On Learning Theory, pages 1362 1394. PMLR, 2018.

[18] Yeshwanth Cherapanamjeri, Kartik Gupta, and Prateek Jain. Nearly optimal robust matrix completion. In International Conference on Machine Learning, pages 797 805. PMLR, 2017.

[19] Michael B. Cohen, Jonathan A. Kelner, John Peebles, Richard Peng, Anup B. Rao, Aaron Sidford, and Adrian Vladu. Almost-linear-time algorithms for markov chains and new spectral primitives for directed graphs. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pages 410 419. ACM, 2017.

[20] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random Struct. Algorithms, 22(1):60 65, 2003.

[21] Christopher De Sa, Christopher Re, and Kunle Olukotun. Global convergence of stochastic gradient descent for some non-convex matrix problems. In International conference on machine learning, pages 2332 2341. PMLR, 2015.

[22] Ran Duan and Seth Pettie. Linear-time approximation for maximum weight matching. J. ACM, 61(1):1:1 1:23, 2014.

[23] Micha Elsner and Warren Schudy. Bounding and comparing methods for correlation clustering beyond ilp. In Proceedings of the Workshop on Integer Linear Programming for Natural Language Processing, pages 19 27, 2009.

[24] Uriel Feige and Joe Kilian. Heuristics for semirandom graph problems. Journal of Computer and System Sciences, 63(4):639 671, 2001.

[25] Uriel Feige and Robert Krauthgamer. Finding and certifying a large hidden clique in a semirandom graph. Random Structures & Algorithms, 16(2):195 208, 2000.

[26] David Gamarnik and Sidhant Misra. A note on alternating minimization algorithm for the matrix completion problem. IEEE Signal Processing Letters, 23(10):1340 1343, 2016.

[27] Xing Gao and Yu Cheng. Robust matrix sensing in the semi-random model. In Proceedings of the 37th Conference on Neural Information Processing Systems (Neur IPS), 2023.

[28] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. Advances in neural information processing systems, 29, 2016.

[29] Amir Globerson, Tim Roughgarden, David Sontag, and Cafer Yildirim. Tight error bounds for structured prediction. ar Xiv preprint ar Xiv:1409.5834, 2014.

[30] Yuzhou Gu, Zhao Song, Junze Yin, and Lichen Zhang. Low rank matrix completion via robust alternating minimization in nearly linear time. In The Twelfth International Conference on Learning Representations, ICLR 2024. Open Review.net, 2024.

[31] Suriya Gunasekar, Ayan Acharya, Neeraj Gaur, and Joydeep Ghosh. Noisy matrix completion using alternating minimization. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2013, volume 8189 of Lecture Notes in Computer Science, pages 194 209. Springer, 2013.

[32] Moritz Hardt. Understanding alternating minimization for matrix completion. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 651 660. IEEE, 2014.

[33] Moritz Hardt and Mary Wootters. Fast matrix completion without the condition number. In Proceedings of The 27th Conference on Learning Theory, COLT 2014, volume 35 of JMLR Workshop and Conference Proceedings, pages 638 678. JMLR.org, 2014.

[34] Baihe Huang, Shunhua Jiang, Zhao Song, Runzhou Tao, and Ruizhe Zhang. Solving SDP faster: A robust IPM framework and efficient implementation. In 63rd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2022, pages 233 244. IEEE, 2022.

[35] Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28 of JMLR Workshop and Conference Proceedings, pages 427 435. JMLR.org, 2013.

[36] Prateek Jain and Praneeth Netrapalli. Fast exact matrix completion with finite samples. In Conference on Learning Theory, pages 1007 1034. PMLR, 2015.

[37] Arun Jambulapati, Jerry Li, Christopher Musco, Kirankumar Shiragur, Aaron Sidford, and Kevin Tian. Structured semidefinite programming for recovering structured preconditioners. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023.

[38] Arun Jambulapati and Kevin Tian. Revisiting area convexity: Faster box-simplex games and spectrahedral generalizations. Co RR, abs/2303.15627, 2023.

[39] Haotian Jiang, Tarun Kathuria, Yin Tat Lee, Swati Padmanabhan, and Zhao Song. A faster interior point method for semidefinite programming. In 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, pages 910 918, 2020.

[40] Chi Jin, Sham M Kakade, and Praneeth Netrapalli. Provable efficient online matrix completion via non-convex stochastic gradient descent. Advances in Neural Information Processing Systems, 29, 2016.

[41] Charles R Johnson. Matrix completion problems: a survey. In Matrix theory and applications, volume 40, pages 171 198, 1990.

[42] Jonathan Kelner, Jerry Li, Allen X Liu, Aaron Sidford, and Kevin Tian. Semi-random sparse recovery in nearly-linear time. In The Thirty Sixth Annual Conference on Learning Theory, pages 2352 2398. PMLR, 2023.

[43] Jonathan A. Kelner, Jerry Li, Allen Liu, Aaron Sidford, and Kevin Tian. Matrix completion in almost-verification time, 2023.

[44] Raghunandan H. Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from noisy entries. In Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009, pages 952 960. Curran Associates, Inc., 2009.

[45] Alexandra Kolla, Konstantin Makarychev, and Yury Makarychev. How to play unique games against a semi-random adversary: Study of semi-random models of unique games. In 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science, pages 443 452. IEEE, 2011.

[46] Nathan Linial, Eran London, and Yuri Rabinovich. The geometry of graphs and some of its algorithmic applications. Combinatorica, 15:215 245, 1995.

[47] Gunjan Mahindre, Anura P. Jayasumana, Kelum Gajamannage, and Randy Paffenroth. On sampling and recovery of topology of directed social networks a low-rank matrix completion based approach. In 2019 IEEE 44th Conference on Local Computer Networks (LCN), pages 324 331, 2019.

[48] Konstantin Makarychev, Yury Makarychev, and Aravindan Vijayaraghavan. Approximation algorithms for semi-random partitioning problems. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pages 367 384, 2012.

[49] Konstantin Makarychev, Yury Makarychev, and Aravindan Vijayaraghavan. Sorting noisy data with partial information. In Proceedings of the 4th conference on Innovations in Theoretical Computer Science, pages 515 528, 2013.

[50] Konstantin Makarychev, Yury Makarychev, and Aravindan Vijayaraghavan. Constant factor approximation for balanced cut in the pie model. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 41 49, 2014.

[51] Konstantin Makarychev, Yury Makarychev, and Aravindan Vijayaraghavan. Correlation clustering with noisy partial information. In Conference on Learning Theory, pages 1321 1342. PMLR, 2015.

[52] Konstantin Makarychev, Yury Makarychev, and Aravindan Vijayaraghavan. Learning communities in the presence of errors. In Conference on learning theory, pages 1258 1291. PMLR, 2016.

[53] Claire Mathieu and Warren Schudy. Correlation clustering with noisy input. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, pages 712 728. SIAM, 2010.

[54] Theo Mc Kenzie, Hermish Mehta, and Luca Trevisan. A new algorithm for the robust semirandom independent set problem. In Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, pages 738 746. SIAM, 2020.

[55] Ankur Moitra, William Perry, and Alexander S Wein. How robust are reconstruction thresholds for community detection? In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 828 841, 2016.

[56] Cameron Musco and Christopher Musco. Randomized block krylov methods for stronger and faster approximate singular value decomposition. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, pages 1396 1404, 2015.

[57] Nagarajan Natarajan and Inderjit S. Dhillon. Inductive matrix completion for predicting gene disease associations. Bioinformatics, 30(12):60 68, 2014.

[58] Yurii Nesterov. Smoothing technique and its applications in semidefinite optimization. Mathematical Programming, Series A, 110:245 259, 2007.

[59] Luong Trung Nguyen, Junhan Kim, and Byonghyo Shim. Low-rank matrix completion: A contemporary survey. IEEE Access, 7:94215 94237, 2019.

[60] Andy Ramlatchan, Mengyun Yang, Quan Liu, Min Li, Jianxin Wang, and Yaohang Li. A survey of matrix completion methods for recommendation systems. Big Data Mining and Analytics, 1(4):308 323, 2018.

[61] Benjamin Recht. A simpler approach to matrix completion. Journal of Machine Learning Research, 12(12), 2011.

[62] Jasson DM Rennie and Nathan Srebro. Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the 22nd international conference on Machine learning, pages 713 719, 2005.

[63] Tim Roughgarden. Beyond the Worst-Case Analysis of Algorithms. Cambridge University Press, 2021.

[64] Anthony Man-Cho So and Yinyu Ye. Theory of semidefinite programming for sensor network localization. Mathematical Programming, Series B, (109):367 -384, 2007.

[65] Ju Sun, Qing Qu, and John Wright. When are nonconvex problems not scary? ar Xiv preprint ar Xiv:1510.06096, 2015.

[66] Ruoyu Sun. Matrix completion via nonconvex factorization: Algorithms and theory. Ph D thesis, University of Minnesota, 2015.

[67] Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via non-convex factorization. IEEE Transactions on Information Theory, 62(11):6535 6579, 2016.

[68] Wei Tu, Peng Liu, Jingyu Zhao, Yi Liu, Linglong Kong, Guodong Li, Bei Jiang, Guangjian Tian, and Hengshuai Yao. M-estimation in low-rank matrix factorization: a general framework. In 2019 IEEE international conference on data mining (ICDM), pages 568 577. IEEE, 2019.

[69] Jan van den Brand, Yin Tat Lee, Aaron Sidford, and Zhao Song. Solving tall dense linear programs in nearly linear time, 2021.

[70] Bo Xin, Yizhou Wang, Wen Gao, and David Wipf. Exploring algorithmic limits of matrix rank minimization under affine constraints. IEEE Transactions on Signal Processing, 64(19):4960 4974, 2016.

[71] Jialun Zhang, Hong-Ming Chiu, and Richard Y Zhang. Accelerating sgd for highly illconditioned huge-scale online matrix completion. Advances in Neural Information Processing Systems, 35:37549 37562, 2022.

[72] Shuai Zhang and Meng Wang. Correction of corrupted columns through fast robust hankel matrix completion. IEEE Transactions on Signal Processing, 67(10):2580 2594, 2019.

[73] Xiao Zhang, Simon Du, and Quanquan Gu. Fast and sample efficient inductive matrix completion via multi-phase procrustes flow. In International Conference on Machine Learning, pages 5756 5765. PMLR, 2018.

[74] Tuo Zhao, Zhaoran Wang, and Han Liu. A nonconvex optimization framework for low rank matrix estimation. Advances in Neural Information Processing Systems, 28, 2015.

A Additional preliminaries

In this section, we formalize some additional preliminaries and helpful notation which will be used in the remainder of the appendices.

General notation. For n N we denote [n] := {i N | i n}. We say a vector is s-sparse if it has s nonzero entries. When S T and T is clear from context, we let Sc := T \S. The all-zeroes and all-ones vectors of dimension d are denoted 0d and 1d. Applied to a vector, p denotes the ℓp norm for p 1. Applied to a matrix, p denotes the ℓp norm of the vector corresponding to the flattened matrix. We use med(S) to denote the median of a set S. For t R we let sign(t) = 1 if t 0 and sign(t) = 1 otherwise. We let denote the entrywise product of two vectors.

Matrices. Matrices are denoted in boldface. The number of nonzero entries of a matrix M is denoted nnz(M), and the set of indices corresponding to these nonzero entries, i.e., the support, is denoted supp(M). We define Tmv(M) as the amount of time it takes to compute Mv for any vector v; note Tmv(M) = O(nnz(M)) if M has explicit entries, and if M Rm n is given as a rank-r factorization then Tmv(M) = O((m + n)r). We equip Rm n with the inner product A, B := Tr(A B), denote the d d identity by Id, and the all-zero m n matrix by 0m n. The ith row and jth column of a matrix M are respectively denoted Mi: and M:j. For a collection of matrices {Ai}i [k] Rm n with an associated operator A (clear from context), we denote A(x) := P

i [k] xi Ai for x Rk. We say M is s-row column sparse (RCS) if its rows and columns are s-sparse. For S [m], T [n], we let the submatrix of M Rm n indexed by S, T be denoted MS T . When S is a singleton {i} for i [m], we denote this Mi,T and similarly define MS,j for j [n].

The Frobenius, operator, and trace (nuclear) norms of a matrix are denoted F, op, and tr respectively, corresponding to the 2-norm, -norm, and 1-norm of the singular values of the matrix. We denote the symmetric d d matrices by Sd d and the positive semidefinite cone by Sd d 0 . We say M Rm n has singular value decomposition (SVD) M = UΣV if M is rank-r, Σ Rr r is diagonal, and U Rm r, V Rn r have orthonormal columns. The ordered singular values of M Rm n are denoted {σi(M)}i [min(m,n)] where σ1(M) is largest. Similarly we define the ordered eigenvalues of M Sd d by {λi(M)}i [d]. For M Sd d with eigendecomposition UΛU , we define exp(M) := U exp(Λ)U where on the right-hand side, exp is applied entrywise on the diagonal. We also define log M for M Sd d 0 and |M| for M Sd d so the functions are applied to the eigenvalues in the appropriate basis. Finally, we let span(A) denote the column span (or image) of a matrix A, and for a subspace V Rd, we let ΠV denote the orthogonal projection matrix onto V .

Index sets. For d N, we let S(d) contain all symmetrized indices of the form:

{(i, i)} for i [d], or {(i, j), (j, i)} for i, j [d], i = j.

We use (i, j) as shorthand for the singleton {(i, i)} if i = j, and otherwise as shorthand for {(i, j), (j, i)}. When I is a set of indices in [d] [d], we let MI be the d d matrix which zeroes out all entries of M except those in ι. For a symmetrized index ι S(d), we denote the scalar value of the ι entry by [M]ι (note that this is well-defined even when ι contains two indices due to symmetry). The zero-one matrix whose only nonzero entries are ι is denoted by Eι, so that Mι = [M]ιEι for all ι S(d). For I S(d) clear from context, we let E be the operator on w RI which forms the associated d d matrix whose indices on I agree with w and is otherwise zero entrywise, i.e.,

Optimization. For convex f : X R with X Rd, we let f(x) denote the subgradient set at x X. We let y := sup x 1 x y denote the dual norm to a norm . When M is a d d linear operator we use the notation M[u, v] := Mu, v . This becomes relevant when M is a Hessian of a matrix function (a linear operator on d d matrices), so M[U, V] evaluates this operator on U, V.

Reduction to symmetric matrices. We briefly justify our focus on symmetric matrices in Definition 3. We begin by recalling two facts.

Fact 2. Let m, n N with m n. There exists k N such that kn [ m

Lemma 1. Let m, n, k N and let M Rm n be rank-r with SVD M = UΣV . Suppose that both U := span(U) and V := span(V) are µ-incoherent. Then, letting M Rm kn be defined as

M := (M M . . . M M) | {z } k times

and letting M have SVD M = UΣ W , then W := span(W) is µ-incoherent.

Proof. It is straightforward to check that a valid SVD is given by

V V . . . V V

| {z } k times

Because incoherence is invariant to the choice of SVD, the conclusion holds.

Consider an asymmetric variant of the problem in Definition 3, where M Rm n for m n is rank-r and µ-incoherent, and we have p-semi-random observation oracle access (defined analogously to Definition 2) to M . Let k be the integer from Fact 2, and let d M Rm kn be constructed as in Lemma 1 by concatenating M k times. The row and column spans of d M are µ-incoherent, by Lemma 1. Moreover, we can simulate p-semi-random observation oracle access to

0m m d M h d M i 0kn kn

S(m+kn) (m+kn)

in the symmetric sense of Definition 3, given asymmetric oracle access to M . It is straightforward to check that because d M has dimensions within a constant factor of each other, the incoherence parameter is only affected by a constant in this lifting process. For simplicity in the remainder of the paper, we focus on the case of symmetric matrices, i.e., the problem statement in Definition 3.

Sample splitting. Finally, we recall a convenient subsampling claim from [43] that extends to the semi-random model, which lets us use Osr p (M) to simulate access to Osr q (M) for smaller values q p.

Lemma 2. Let {pk}k [K] (0, 1) satisfy pk p 1 K for all k [K], and let M Sd d. We can simulate sequential access to Osr pk(M ) for all k [K] with access to Osr Kp(M ).

Proof. The claim is immediate from the observation that the proof of Lemma 1, [43] is monotone in p. In particular, the same subsampling process in [43] preserves monotonicity of observation probabilities and hence produces valid semi-random observation oracles.

B Semi-random partial matrix completion

In this section, we develop Algorithm 3, our main subroutine for partial matrix completion in the semi-random model (Definition 2). Algorithm 3 uses recursive calls to Algorithm 2, Drop Or Descent, which implements a single step of our partial completion method.

At a high level, one application of Drop Or Descent is parameterized by a current number of dropped rows γdrop, a current number of excluded rows γsub, a subset of undropped rows U = [d]\D satisfying |D| γdropd, and a rank-r matrix M such that MU U is promised to be close to our target M U U on a γsub-submatrix (Definition 4). It then distinguishes between the following cases from semi-random, noisy observations of M , by solving a testing problem via optimization.

1. In the first case, dropping an additional γ fraction of rows and columns decreases the closeness parameter by a small amount, i.e. a small subset of rows or columns are responsible for a nontrivial portion of the mass. We call this case the drop case.

2. In the second case, after removing a few rows and columns, the remaining mass is still large. We show that in this case, by deleting a particular subset of problematic rows or columns, we can use well-conditionedness properties of the remaining submatrix to decrease by a constant factor. We do so by solving an optimization problem, which finds a spread reweighting of observations with a short-flat decomposition. We call this case the descent case.

We analyze the descent case in Appendix B.1, where we demonstrate sufficient conditions on a proposed reweighted step which guarantee progress under submatrix closeness. The candidate steps we analyze in this section are all reweightings of truncated observations, which are found by solving an appropriate optimization problem (see (20)). We then analyze the drop case in Appendix B.2, where we show that either our current difference matrix has Frobenius norm which is unstable to dropping few rows and columns, or a good solution exists which is compatible with our descent algorithm. We give an explicit dropping procedure for handling the former case in Appendix B.3. We finally put these cases together and give a win-win analysis of Drop Or Descent in Appendix B.4.

B.1 Descent steps In this section, we give our progress analysis in the case where we can find sufficient signal in our observations. Specifically, Lemma 7 gives sufficient conditions on a reweighted, truncated observation matrix such that we can improve our submatrix closeness parameter by a constant factor.

For D Sd d, we first define a clipping operation which will be used to control our reweightings. We let clipu(D) be the matrix which clips all entries to the range [ u, u], formally defined by

[clipu(D)]ij = med ( u, Dij, u) . (7)

We also will use the following claim which follows from straightforward casework.

Lemma 3. Let u 0, |a| u, and a := med( u, a + b, u). Then |a a| |b|.

Proof. If |a + b| u the claim is obvious. Otherwise, if a = u it must be the case that b 0, and 0 a a b. Similarly, if a = u we must have 0 a a b.

We now introduce some additional notation. Let I S(d) be a set of symmetrized indices under consideration (corresponding to observations). For w RI and b D Sd d clear from context, we let

b D(w) := X

ι I wι b Dι. (8)

For a parameter ρ, we define the set of spread reweightings of the observations b DI by

w RI 0 | max i [d]

j [d]|ι=(i,j) I wι[ b D]2 ι ρ

Further, we require the following result on low-rank approximation.

Lemma 4 (Theorem 1, [56]). Let δ, ϵ (0, 1), let M Sd d, and let q [d]. There is an algorithm which returns Z Rd q with orthonormal columns such that ZZ MZZ M op (1 + ϵ)σq+1(M)

with probability 1 δ, in time

Tmv(M) q log d

Moreover, we have for all i [q], letting zi be the ith column of Z, q

z i M2zi [(1 ϵ) σi (M) , (1 + ϵ) σi (M)] .

Finally, we provide ways to bound the singular values of an approximately low-rank matrix.

Lemma 5. Let M, N Sd d have eigendecompositions M = UΛU and N = VΣV , where Λ = diag (λ), Σ = diag (σ), and λ, σ Rd are in nonincreasing order. Then, M N F λ σ 2.

Proof. Since M 2 F = λ 2 2 and N 2 F = σ 2 2, expanding the squares of both sides shows that it suffices to prove that M, N λ, σ , which follows from the von Neumann trace inequality.

Lemma 6. Let ϵ [0, 1], let M Sd d be rank-ˆr, and suppose M S F ϵ for S Sd d. Then for any s N, we have σˆr+s+1(S) ϵ s1/2 .

Proof. Let λ Rd be the set of eigenvalues of M in nonincreasing order, and similarly let σ Rd be the nonincreasing eigenvalues of S. By assumption, λi = 0 for all i ˆr + 1, and λ σ 2 ϵ by Lemma 5. Thus, at most ˆr + s eigenvalues of S can be larger than ϵ s1/2 , giving the conclusion.

We are now ready to give our main progress guarantee in Lemma 7. Specifically, we show that a reweighting of truncated observations which has enough signal and admits a short-flat decomposition is guaranteed to make significant progress, as long as it is sufficiently spread in the sense of (9).

Lemma 7. Let ϵ, γ (0, 1

10), ℓ, u 0, and T S U [d] such that |U \ S| γd, |S \ T| γd, and ℓ ϵ d. Suppose that for D, D , N Sd d, we have D = D + N, for rank-ˆr D with D S S F 1, D T T u, and N γℓ. Let

b D := clipu(D),

and for I S(d) (U U), w RI 0, suppose the following conditions hold for S, F Sd d

supported on I (U U) with b D(w) = F + S.

1. w Wρ for ρ = ϵ 2γd, recalling the definition (9).

2. b D, b D(w) 1 ϵ.

3. S F 1 + ϵ.

4. F op ϵ ˆr1/2 .

5. minι I |[ b D]ι| ℓ.

Then, letting Z Rd 2ˆr be the output of Lemma 4 with parameters δ (0, 1), ϵ 1

2, M S, and q 2ˆr, we have with probability 1 δ that D ZZ SZZ

F 10ϵ 1 4 .

Proof. As a starting point, we claim that

13ϵ. (10) To see this, we expand to show that D T T S 2 F 2 + 3ϵ 2 D T T , ST T

2 + 3ϵ 2 D T T , h b D(w) i

+ 2 D T T tr F op

2 + 5ϵ 2 b DT T , h b D(w) i

+ 2 h D b D i

2 + 7ϵ 2 b DT T , h b D(w) i

The first line used Item 3 and D S S F 1, and the second used the (matrix) Hölder s inequality. The third used Item 4 and that b DT T tr

ˆr b DT T F, and the (vector) Hölder s inequality. The last line used D T T u, the definition of b D, and Lemma 3 to conclude h D b D i

and also bounded b D(w) 1 via Items 1 and 5, which show

b D(w) 1 D b D, b D(w) E

max ι I 1 [ b D]ι

Finally, since w is supported on U U, applying |U \ T| 2γd, Item 1, and Item 2 yields b DT T , h b D(w) i

D b D, b D(w) E 4γρd 1 3ϵ. (12)

Then, (10) follows by combining (11) and (12). Next, applying Lemma 6 using (10) shows that

Therefore, assuming the algorithm in Lemma 4 succeeds, we have

ZZ SZZ | {z } :=S = S + ZZ SZZ S

, where F op 6 ϵ

At this point, we can write D(w) = S + (F + F ), where by the triangle inequality

F + F op 7 ϵ 1

Since S F S F 1 + ϵ, the remainder of the argument is identical to the proof of (10) with the worse flatness parameter, where we use that S T T F S F and ST T F S F.

B.2 Existence of good solutions In this section, we define a stability condition (14) under which implies a solution exists which satisfies the conditions in Lemma 7. We use this existence claim to certify progress in one of two ways in our final algorithm. In Lemma 11, we show that under (14), there exists a reweighting (with high probability over the semi-random observations) which is compatible with Lemma 7 s requirements. We then show in Lemma 12 that in the absence of (14), one can increase the drop parameter to improve the submatrix closeness bound. We will analyze an explicit dropping procedure we use in this latter case in Appendix B.3. We first require a helper guarantee from prior work on controlling entries of a low-rank matrix after dropping a few problematic rows and columns.

Lemma 8 (Lemma 6, [43]). Let A Sd d be a rank-ˆr matrix with A F 1, and let γ [0, 1]. There is S [d] with |S | (1 γ)d, such that

ˆr γd . (13)

We also require a bound on the sampling error of random observations from a matrix with bounded entries, rows, and columns, which is an application of the matrix Bernstein inequality.

Lemma 9 (Lemma 7, [43]). Let p, δ [0, 1

2], ρ, τ 0, and let A Sd d satisfy

Ai: 2 ρ for all i [d], and |Aij| τ for all (i, j) [d] [d].

Let Ω S(d) be a random subset where each ι S(d) is included in Ωwith probability p, and let e A := 1

p AΩ. Then with probability at least 1 δ,

A e A op 4 max

Finally, we need a guarantee on the concentration of empirical estimates of squared ℓ2 norms.

Lemma 10 (Lemma 3, [43]). Let p, δ, α (0, 1), let v Rd have v τ, and let v Rd have each vi independently set to vi with probability p, and 0 otherwise. Then with probability 1 δ, v 2 2 1

max α v 2 2 , 3τ 2 log 2

We are now ready to give our main result on sufficient conditions for a good solution to exist.

Lemma 11. Let δ, ϵ, γ, p (0, 1

10), and S U [d] such that |U \ S| γd. Let

ℓ:= ϵ 20d, u := 2

ˆr γd , p 240ˆr log( 6d

δ ) γ3ϵ2d .

Suppose that for D, D , N Sd d, we have D = D + N, for rank-ˆr D with D S S F 1 and N γℓ. Further, for b D := clipu(D), let Ω := {ι S(d) (U U) | |[ b D]ι| ℓ}. Finally, let S be the result of applying Lemma 8 with A D S S, γ γ

2 , so D S S u, and let T := S \ L where L is the indices of the γd

2 rows of b DS S with largest ℓ2 norms, so |S \ T| γd. Suppose b DT T F 1 ϵ

Letting Ω S(d) have each ι S(d) independently included with probability p, and w {0, 1

have [w ]ι = 1

p if and only if ι (T T) Ω , the following conditions all hold with probability

1 δ over the randomness of Ω, and S , F Sd d supported on (T T) Ω with b D(w ) = F + S .

1. w Wρ for ρ = ϵ 2γd, recalling the definition (9).

D b D, b D(w ) E 1 ϵ

3. S F 1 + ϵ.

4. F op γϵ 4ˆr1/2 .

Proof. For notational convenience, let L := S(d)\Ω be the set of indices ι S(d) with |[ b D]ι| < ℓ. We define our decomposition of b D(w ) as follows, where we recall w is supported on (T T) Ω :

b D(w ) := b D(T T ) Ω | {z } :=S

+ b D(w ) b D(T T ) Ω | {z } :=F

Now, clearly N F d N ϵ 200, so S F DS S F 1 + ϵ 200 where we used that every entry of S is smaller in magnitude than the corresponding entry in DS S (it is zeroed out if it does not belong to (T T) Ω , and otherwise is clipped to a smaller range). This establishes Item 3. The same logic shows that b DS S F 1 + ϵ 200. Combined with (14) and the definition of T,

10 = ϵ 5γd. (15)

since P i L [ b DT T ]i: 2 2 b DS S 2 F b DT T 2 F ϵ 10, and L consisted of the largest rows of S by ℓ2 norm. Therefore, applying Lemma 9 with A b D(T T ) Ω , τ u, and ρ2 ϵ 5γd establishes Item 4 with probability 1 δ

3 using the lower bound on p. We further have by (14) that b D(T T ) Ω F 1 ϵ

30 b DL F 1 ϵ

Therefore, Lemma 10 with v set to the vectorized b D(T T ) Ω , τ u, and α ϵ

b D(T T ) Ω 2

F D b D, b D(w ) E max

α b D(T T ) Ω 2

F , 3u2 log 6

with probability 1 δ

3; we again used our lower bound on p. Combining the above two displays establishes Item 2. Finally, to see Item 1, fix a row i T. We apply Lemma 10 once more with v set to the vectorized [ b D(T T ) Ω ]i: and α = 1

2 to obtain that with probability 1 δ

3, h b D(w ) i

h b D(T T ) Ω i

2 + 6u2 log( 6

δ ) p 2 ϵ 5γd ϵ 2γd,

where we used (15) and our lower bound on p.

We also provide a helper lemma for making progress if (14) fails to hold.

Lemma 12. In the setting of Lemma 11, if (14) is false, D T T F 1 ϵ 36.

Proof. This follows from Lemma 3 and the assumption on (14), which lets us bound D T T F b DT T F + h D b D i

b DT T F + d N 1 ϵ

30 + ϵ 200 1 ϵ

B.3 Drop steps In the previous Appendix B.2, we developed Lemmas 11 and 12, which together say that either there exists a good solution compatible with the progress analysis in Lemma 7 (if (14) holds), or there are an γ fraction of rows and columns which are responsible for an ϵ fraction of the Frobenius norm of the difference matrix (if (14) does not hold). In this section, we show in the latter scenario, there is a procedure which explicitly drops a subset of rows and columns, such that on the remainder, a slightly smaller submatrix makes an ϵ fraction of progress. Importantly, we can ensure that the submatrix parameter (i.e. γ in the sense of Definition 4) grows slowly using a tunable parameter, which lets us iterate drop steps without losing too many rows and columns.

We begin by giving a subroutine (Algorithm 1), analogous to Section 4.1 of [43], which takes semirandom observations from an appropriately-bounded difference matrix and explicitly drops a small number of rows and columns, estimated from observations. We show that all remaining rows have few large entries with high probability. Our drop step simply applies this subroutine.

Algorithm 1: Sparsify(Osr [0,1](D), U, τ, γ, p, δ)

1 Input: Osr [0,1](D), U [d], τ 0, γ, p, δ (0, 1)

3 tmax 3 log d + 1

4 for 0 t < tmax do

5 Dt Osr p (DSt St)

6 for i St do ri,t |{j St | [Dt]ij was observed, and |[Dt]ij| τ}|

7 St+1 St \ Rt where Rt St corresponds to the γd indices i with largest ri,t 8 end

9 return U Stmax

Lemma 13. Let D Sd d, such that there is A U [d] with |U\A| γd, DA A τ. Let s N and p 800

δ ). Then with probability 1 δ, Algorithm 1 returns U U with |U \ U | 4γd log d, and DU U has at most s entries with magnitudes τ per row or column.

Proof. Let P [0, 1]d d be the true (symmetric) matrix of reveal probabilities of entries for calls to Osr p , following Definition 2. Note that all entries of P are at least p. We define the potential

(i,j) St St |[D t ]ij| τ

We also denote the expected number of revealed large entries in each row and column as follows:

j St |[D t ]ij| τ

Pij for all i St.

By a standard Chernoff bound, with probability 1 δ, for all 0 t < tmax and all i St, ri,t r i,t max 1

4r i,t, 50 log d

Condition on this event in the rest of the proof. Due to our symmetric observation model (Definition 2), and since A A has no entries τ in magnitude, in any iteration, at least half of the contribution to Φt must be due to remaining rows in Bt := St \ A, i.e. we have |Bt| γd such that X

i Bt r i,t Φt

Letting R i,t denote the γd |Bt| indices in St with largest r i,t, we thus have from (16),

i Rt ri,t X

i R t ri,t 3

i R t r i,t 50γd log d

8Φt 50γd log d

Therefore, if Φt 400γd log( d

δ ), we decrease Φt by at least a 1

4 factor, so by our choice of tmax, we have Φtmax 1 400γd(log d

2 . This shows that there are at most γd remaining rows i Stmax 1 with r i,tmax 1 ps

2 . Now, any remaining row i Stmax 1 with at least s entries with magnitude τ has r i,tmax 1 ps, and therefore will be one of the largest γd remaining rows by observed counts via (16). Hence, it will be dropped in the last iteration tmax as claimed.

We conclude by showing that the leftover poorly-behaved rows and columns (i.e. those with any large entries) comprise a negligible fraction of the overall indices, using Lemma 8. This shows that by excluding these few indices from our maintained submatrix (after explicitly dropping indices via Algorithm 2), we make an ϵ fraction of the progress if the condition in Lemma 12 is met.

Lemma 14. Let δ, ϵ, γ, p (0, 1

10), θ, κ 1, and S U [d] such that |U \S| γd. Suppose that for D, D , N Sd d, we have D = D + N for rank-ˆr D with D S S F 1 and N ϵ

d, and that there is a subset T S with D T T F 1 ϵ and |S \ T| ϵd

θ . Then, applying Algorithm 1 with inputs Osr [0,1](D), U, δ, and

θ 5d , γ γ + 10

ˆr θ, p 800ˆrκ log( d

δ ) d , (17)

returns a subset U U such that with probability 1 δ, there exists a subset S U with

|U \ U | 40

d log d, |U \ S | γ + 4

d, and DS S F 1 ϵ

Proof. First, by Lemma 8, there is a subset A S such that |S \ A| 10 q

ˆr θ, so that

DA A D A A + N

Therefore, our application of Algorithm 1 with the parameters in (17) satisfies the hypotheses of Lemma 13. Condition on the conclusion of Lemma 13 holding in the rest of the proof (with probability 1 δ), giving us the first bound in (18). Also, there are at most d ˆrκ entries per row and column of DU U with magnitude larger than τ in (17), so there are at most d ˆrκ entries per row and column of D U U with magnitude larger than τ + ϵ

Next, if D S S F 1 ϵ

3, the conclusion is obvious (taking S S U ). Otherwise, for C := S \ T,

D C S 2 F 1

D S S 2 F D T T 2 F

By Lemma 8, there is a subset S S U with

|(S U ) \ S | 4d

ˆrθκ 4d . (19)

This set satisfies the second bound in (18), so it remains to prove the third, which follows from D S S 2 F D S S 2 F D C S 2 F D (C S ) S 2

2 + D (C S ) S 2

d (2τ)2 + d

The third line used that there are at most ϵd

θ rows in C, and each such row (restricted to S ) has all but d rκ entries bounded in magnitude by τ, with the remainder bounded as in (19).

B.4 Analysis of Drop Or Descent and SRPartial Completion We now combine Lemma 7 and Lemma 14 to give a win-win analysis of a progress step, Drop Or Descent. This progress step either executes a descent step (analyzed via Lemma 7), or a drop step (analyzed via Lemma 14); we use Lemmas 11 and 12 to demonstrate correctness of each of these cases. For the runtime analysis, we rely on Proposition 6, the main result of Appendix C, to solve a relevant optimization problem used by our algorithm, but all other parts of the proof are self-contained.

We provide pseudocode for the algorithm corresponding to Proposition 1 in Algorithm 2.

Algorithm 2: Drop Or Descent(Osr [0,1](M + N), M, U, γdrop, γsub, , r, r , δ, ϵ, θ, κ)

1 Input: Osr [0,1](M + N) where M Sd d is rank-r and N ϵγsub

20d , U [d] with |[d] \ U| γdrop, rank-r M Sd d such that MU U, M U U are -close on a γsub-submatrix, δ, ϵ (0, 1

10), θ, κ 1

3 p1 240ˆr log( 30d

δ ) γ3 subϵ2d , p2 800ˆrκ log( 5d

δ ) d , u 2

ˆr γsubd, ℓ ϵ 20d

4 I supp(Osr p1(M + N)) \ {ι S(d) | |[M + N M]ι| ℓ } (U U)

5 vι 2 [clipu (M + N M)]2 ι for all ι I

6 (w, S) ϵ 12-approximate minimizer to (20) with failure probability δ

5, computed using Proposition 6

7 z output of Lemma 4 with M b D(w) S, q 1, ϵ γsub

8 V2 1 (z ( b D(w) S)2z) 1 2

9 V v, w + CV2

10 if V > (1 11ϵ

11 U Sparsify(Osr [0,1]( 1

(M + N M)), U,

θ 5d , γsub + 10 q

ˆr θ, p2, δ

12 Return: ( Drop , U )

15 Z output of Lemma 4 with M S, q 2ˆr, ϵ 1

5 16 Return: ( Descent , M M + ZZ SZZ )

Proposition 1. Let M Sd d be given as a rank-r factorization, let M Sd d be rank-r , and let U [d]. Suppose we know for γdrop, γsub [0, 1], > 0, that |[d] \ U| γdropd, and MU U, M U U are -close on a γsub-submatrix. Let δ, ϵ (0, 1

10), θ, κ 1, and suppose for ˆr := r + r and an appropriate polynomial,

, γsub ϵ 36θ. (4)

Given one call to Osr p (M + N) for N ϵγsub

20d , the following holds with probability 1 δ.

1. If Algorithm 2 returns on Line 12, then MU U , M U U are (1 ϵ 288) -close on a γsub + 4

θκ-submatrix, and |U \ U | 40(γsub + ( ˆr

θ)1/2)d log d.

2. If Algorithm 2 returns on Line 16, then M U U, M U U are 10ϵ 1 4 -close on a 2γsubsubmatrix, and M is given as a rank-(3r + 2r ) factorization.

The runtime of the algorithm is

where m d is the number of observed entries upon calling Osr p (M + N).

Proof. Throughout the proof, without loss we let = 1, since we can scale M by 1 , take the claimed step, and scale up by (indeed, examination of Algorithm 2 shows it performs this scaling). Further, let ˆr := r + r , D := M M, D := D + N, and fix

ˆr γsubd, ℓ:= ϵ 20d, ρ := ϵ 2γsubd.

Also, let b D := clipu(D), let Ω := {ι S(d) | |[ b D]ι| ℓ} and let L := S(d) \ Ω . Note that Lemma 2 shows we can simulate p1-semi-random observation oracle access to b D := clipu(D) (as required by Line 4) and the 4 log(d) calls to p2-semi-random observation oracle access to D (as required by Sparsify in Line 11), given access to Osr p (M + N), proving correctness of p in (4).

Let I S(d) correspond to the output of Osr p1(M + N) in Line 4, after removing indices in L

and outside U U, which we couple to Ω I, the result of calling Op1([M + N]U U) (i.e. a p1-semi-random observation oracle where all probabilities are p1), after removing indices in L .

We now describe our algorithm. Define Wρ as in (9), and for w RI, define b D(w) as in (8). Consider the following optimization problem over Wρ S, where S := {S Sd d | S F 1 + ϵ}:

min w Wρ,S S F(w, S) := v, w + CH η slift b D(w) S ,

where v RI has vι = [ b Dι]2 for all ι I, C :=

ˆr 2γsub , η := ϵ 24C log(2d). (20)

We compute (w, S), an ϵ 12-approximate minimizer to F, and V , satisfying |V F(w, S)| ϵ 12. Pseudocode for computing such a V with failure probability 1 δ

5 is provided in Line 9, and the correctness and runtime of this step are analyzed at the end of this proof.

1. If V > (1 11ϵ

12 ), we call Sparsify on (our semi-random access to) the matrix D =

M + N M with parameters U, τ

θ 5d , γ γsub + 10 q

ˆr θ, p p, and δ δ

5. We then return the new row subset U U. We call this case the drop case.

2. Otherwise, we update to M + ZZ SZZ , where Z Rd 2ˆr is the output of Lemma 4

with parameters δ δ

2, M S, and q 2ˆr. We call this case the descent case.

We first analyze the drop case. We claim that if V > (1 11ϵ

12 ), it could not be the case that (14) held. To see this, suppose (14) held. Then, with probability 1 δ

5 (due to our lower bound (4) on p), Lemma 11 provides us (w , S ) Wρ S satisfying

b D(w ) = S + F , v, w 1 ϵ

2, for S F 1 + ϵ, F op γsubϵ

In fact, w is only supported on Ω, the fully random coordinates. By using Fact 1, we conclude that

F(w , S ) 1 ϵ

+ C F op + η log(2d)

Therefore, any ϵ 12-approximate minimizer (w, S) to F will satisfy F(w, S) (1 5ϵ

6 ). Finally, since we assumed our computed value V satisfies |V F(w, S)| ϵ 12, this contradicts V > (1 11ϵ

12 ) as claimed. Hence, (14) indeed did not hold, so by Lemma 12 and our bounds in (4), the preconditions of Lemma 14 are met with ϵ ϵ 36, proving correctness of the drop case with the stated parameters.

Next, suppose we are in the descent case, meaning we have (w, S) Wρ S satisfying F(w, S) (1 ϵ). By nonnegativity of the second summand in the definition of F, this shows D b D, b D(w) E = v, w 1 ϵ.

Moreover, because v, w dρ = ϵ 2γ by the definition of Wρ, we have

b D(w) S op H η slift b D(w) S F(w, S) + ϵ 2γ C ϵ

where we used Fact 1 in the first inequality. Lemma 7 then yields correctness of the descent case.

We next bound the runtime, which is the sum of three costs: the optimization solver, the cost of obtaining an approximate value V , and the cost of Lemma 7 in the descent case. It is straightforward to check that in the context of Proposition 6, which bounds the runtime of our optimization solver,

k = ϵ 2γsub , Rη = O ˆr log(d)

, and T = O ˆr2 log(d)

Therefore, by our choices of C, η from before, the cost of Proposition 6 for δ

5 failure probability is

O m + dˆr3 log(d)

γ11.5 sub ϵ8 log1.5 d δγsub

Next, consider the cost of obtaining V satisfying |V F(w, S)|. We can explicitly compute the term v, w , and because ηC log(2d) ϵ 24, it suffices to obtain an ϵ 24C = ϵγsub 12ˆr1/2 -additive approximation to b D(w) S op by Fact 1. The guarantees of Proposition 6 give that b D(w) S op 2ϵ ˆr1/2 (see (21)). Therefore, we may call Lemma 4 with q 1, ϵ γsub

24 , and δ δ

6, in time which does

dominate the above, using the rank-O( ˆr2.5 log(d)

γ6 subϵ5 ) factorization of S given to us by Proposition 6. Additionally, the runtime of Lemma 7, i.e. the cost of running Lemma 4 with constant ϵ and q = 2ˆr, does not dominate Proposition 6. The claimed runtime follows by combining these bounds.

Finally, the failure probability comes from a union bound over Lemma 11, Lemma 14, the subroutine in Proposition 6, and our two calls to Lemma 4, once for the descent step and once to compute the value V . We set each of the failure probabilities of these five randomized steps to δ

By iterating on Proposition 1, we provide our full partial completion method, Algorithm 3.

Corollary 1. Let M, M Sd d be rank-r , and suppose M M F . Let α 250 and δ, γtot (0, 1

10). Algorithm 3 uses one call to Osr p (M + N), where for appropriate polynomials,

N d poly( r

γtot log( d

δ )) α1+o(1) , p = poly( r

γtot log( d

and computes a rank-r αo(1)-matrix M , such that with probability 1 δ, M and M are

α -close on a γtot-submatrix. The runtime of the algorithm is mαo(1) poly( r

γtot log( d

δ )), where m is the number of observed entries upon calling Osr p (M + N).

Proof. We claim that throughout the course of Algorithm 3, the preconditions of Proposition 1 are always met, i.e. γsub is always an upper bound on the number of excluded rows U \ S, γdrop is always an upper bound on the number of dropped rows [d] \ U, and M always has rank ˆr r . To show correctness of the rank bound, the rank r of our iterate only increases on drop steps, where Proposition 1 ensures r + r at most triples, so ˆr r (after the update on Line 11) remains a correct bound on r throughout the algorithm. Also, ˆr ˆrtot throughout, since there are R descent steps.

Algorithm 3: SRPartial Completion(Osr [0,1](M + N), M, , r , α, γtot, δ)

1 Input: Osr [0,1](M + N) where M Sd d, rank-r M Sd d such that M M F , α 250, γtot, δ (0, 1

2 Ndrop 0, Ndescent 0, R log α, ϵ exp( 8 log α), K 288 log α

ϵ , ˆrtot (2r ) 3R

3 ˆr 2r , U [d], γtemp ϵγ2 tot 106ˆrtot RK log(d), γsub γtemp 160RK log(d) 2 R, γdrop 0

4 θ 25600ˆrtot(RK log(d))2

γ2 tot , κ 16γ2 tot R2K24R

ˆrtotγ2 temp 5 while Ndrop < RK and Ndescent < R do

6 if Drop Or Descent(Osr [0,1](M + N), M, U, γdrop, γsub, , ˆr r , r , δ R(K+1), ϵ, θ, κ) = ( Drop , U ) then

7 U U , γdrop γdrop + γtot 2RK , exp( ϵ

288) , Ndrop Ndrop + 1

10 ( Descent , M ) Drop Or Descent(Osr [0,1](M + N), M, U, γdrop, γsub, , ˆr r , r , δ R(K+1), ϵ, θ, κ)

11 ˆr 3ˆr, γsub 2γsub, M M , 10ϵ 1 4 , Ndescent Ndescent + 1

14 Return: M

We next prove correctness of the γsub parameter throughout the algorithm. By Proposition 1, this parameter doubles each time Algorithm 2 runs a descent step, which is correctly handled by Line 11. Moreover, the only other time γsub increases is by 4

θκ each time Algorithm 2 runs a drop step, via the first part of Proposition 1. Note that the algorithm terminates once RK drop steps or R descent steps have run, and by our choices of θ, κ, we have 4

θκ RK γtemp 160RK log(d) 2 R,

which is our initial setting of γsub in Line 3. Therefore, since the worst-case growth of the γsub parameter over RK drop steps and R descent steps is if we encounter all of the drop steps first (and then repeatedly double R times), the value of γsub is correct throughout the algorithm.

Next, we show that the γdrop parameter remains correct throughout the algorithm. Note that the γsub parameter is never larger than γtemp 160RK log(d) γtot 160RK log(d). Therefore, the number of dropped rows in each drop step, by Proposition 1 and our choice of θ, is at most

γtot 160RK log d +

d log d γtot

Therefore the update to γdrop in Line 7 is correct. We have also shown that over the course of the algorithm, we never drop or exclude more than γtotd 160RK log(d) + γtotd

2 γtotd rows in total. Additionally, we verify the second condition in (4) holds with the largest value of γsub: γtemp 160RK log(d) ϵ 36θ.

Finally, after RK drop steps Proposition 1 shows that we have made an exp( ϵ

288 RK) = 1

α factor progress on the closeness parameter . Similarly, after R descent steps, since 10ϵ 1 4 ϵ 1 8 by our condition α 250, we have made an ϵ R

α factor progress as well. The failure probability comes from union bounding over the at most R(K + 1) calls to Drop Or Descent.

For the runtime bound, by observation all of the parameters ˆr, γ 1 sub , ϵ 1 in Proposition 1 are bounded by a polynomial in r αo(1) throughout the algorithm. The correctness of the lower bound on p and the upper bound on N similarly follow from Proposition 1.

C Computing adaptive reweightings

In this section, we develop an optimization subroutine for solving the subproblems (20) arising in our partial matrix completion method. We use an approximate Frank-Wolfe framework to reduce our optimization to a sequence of linear optimization problems over our constraint set, which we characterize as a matching polytope over a bipartite graph. We show that for the vector block of our variable, we can cast these linear optimization problems as computing maximum weight bipartite matchings, as explained in Appendix C.4. To efficiently implement the matrix block steps, we apply standard tools from the numerical linear algebra and approximation theory literature in Appendix C.3. We now define the problem we study in a self-contained way.

Throughout this section, we fix a subset I S(d) and parameters k R>0, ϵ (0, 1), and define:

x RI 0 | max i [d]

j [d]|ι=(i,j) I xι k

S := S Sd d | S F 1 + ϵ , Z := X S.

In other words, S is the set of short (Frobenius-norm bounded) matrices. Further, if x X is viewed as a symmetric d d matrix, the constraints on X enforce that x has bounded row and column sums. Clearly x X = x 1 k, so X is a subset of an ℓ1 ball of radius k. However, we can more specifically capture X as a scaled bipartite matching polytope.

Fact 3. Let I S(d) be identified with a d d bipartite graph G = (L R, E), where for (i, j) [d] [d], we place an edge from the ith vertex in L (denoted ui) to the jth vertex in V (denoted vj) if and only if (i, j) S(d). Let M be the bipartite matching polytope of G, i.e.

f RE 0 | max ui L

vj R|(ui,vj) E f(ui,vj) 1

k X = M for X in (22), where we map coordinates in I to edges in E in the canonical way.

By the characterization in Fact 3, we efficiently implement iterations of our Frank-Wolfe method by calling an approximate maximum weight matching algorithm by [22]. We also use the structure of the matching polytope to prove a key regularity property on our objective in Appendix C.2. We finally require the definition of a bounded submatrix, whose entries lie in a given range.

Definition 5 (Bounded submatrix). Let D Sd d and I S(d). We say that (D, I) is an (ℓ, u)-bounded submatrix if for all ι I, |[D]ι| [ℓ, u].

When D and I are clear from context, we let D be the operator on RI such that

ι I wιDι. (24)

With these definitions in hand, our goal in this section is to optimize the following jointly convex function over (x, S) Z, for an (ℓ, u)-bounded submatrix (D, I) and parameters C, η R>0, named Fprog-flat to denote its decomposition into progress and (approximate) flatness components:

Fprog-flat(x, S) := 1I, x + CH η slift D x

where vι = [D]2 ι for all ι I, and x

v denotes entrywise division. (25)

Note that (25) is exactly the problem in (20), up to reparameterizing x w v. It may be helpful to think of the parameter settings for (25) in the context (20), e.g., C r

d , ℓ 1 d, k 1

γ , where r and γ are the rank and closeness parameter of a current iterate.

C.1 Approach based on the Frank-Wolfe method To efficiently minimize f = Fprog-flat (defined in (25)) over Z (defined in (22)) we apply a variant of Frank-Wolfe or conditional gradient methods due to [35]. This method essentially reduces the

problem to approximately solving minz Z f(q), z for different values of q Z. The algorithm is given below as Algorithm 4.10 Below we define the curvature constant, the key quantity which bounds the convergence of the method, as well as the main theorem from [35] which we use.

Algorithm 4: Approx Frank Wolfe(f, D, T, δ) (Algorithm 2 of [35] restated)

1 Input: Differentiable, convex f : Rd R and compact, convex D Rd such that the curvature constant of f with respect to D is Cf, iteration count K 1 and approximation tolerance δ 0

2 Compute arbitrary z(0) D

3 for t = 0, . . . , T 1 do

4 γ(t) 2 t+2

5 Compute s(t) D such that f(z(t)), s(t) mins D f(z(t)), s + δγ(t)Cf,D

2 6 z(t+1) (1 γ(t))z(t) + γ(t)s(t)

8 return z(T )

Definition 6 (Curvature constant (restated from [35])). The curvature constant Cf,D of convex and differentiable f : Rd R with respect to compact domain D Rd is

Cf,D := sup z0,z1 D,γ [0,1],zγ=z0+γ(z1 z0)

2 γ2 (f(zγ) f(z0) f(z0), zγ z0 ) .

Proposition 4 (Approximate Frank-Wolfe, Theorem 1 of [35] restated and specialized). Let f : Rd R be convex and differentiable, let D Rd be compact and convex, and let z argminz Df(z). For any K 1 and δ 0, z(T ) = Approx Frank Wolfe(f, D, T, δ) (Algorithm 4) satisfies

f(z(T )) f(z ) 2(1 + δ)Cf,D

To facilitate applying Proposition 4 to efficiently minimize Fprog-flat (25) over Z (22) we provide two general lemmas. The first is Lemma 15 below, which implies that to implement Line 5 of Algorithm 4 it suffices to compute approximate gradients and and then approximately solve constrained linear optimization problems for these approximate gradients.

Lemma 15 (Gradient approximation). For compact D Rd, g, ˆg Rd, and , M 0 suppose that

| g ˆg, s | and | g, s | M for all s D

If ˆg, ˆs (1 α) minx D ˆg, s + for ˆs D and α [0, 1] then

g, ˆs min x D g, s + αM + 3 .

Proof. For s argmins D g, s the assumptions imply that

g, s = ˆg, s + ˆg g, s (1 α) ˆg, s + + .

The result then follows as

(1 α) ˆg, s = (1 α) g, s + (1 α) g ˆg, s g, s + αM + .

The second lemma is the following straightforward upper bound on the curvature constant (that is perhaps related to the relationship to smoothness given in Lemma 7 of [35]). It shows that to upper bound the curvature constant it suffices to upper bound 2f(z)[w, w] for all w, z D.

Lemma 16 (Curvature bound). For convex, twice-differentiable f : Rd R and compact D Rd if 2f(z)[w, w] M for all w, z D then the curvature constant f with respect to D is at most 2M.

10We perform minor changes to the names of the variables, choice of offsets, etc. to facilitate and simplify the use of [35] to obtain our results.

Proof. Let z0, z1 Z and γ [0, 1] and define zγ := z0 + γ(z1 z0) for all γ [0, 1]. Note that

f(zγ) f(z0) f(z0), zγ z0 = γ f(z0), z1 z0 + Z γ

0 f(zt), z1 z0 dt

0 f(zt) f(z0), z1 z0 dt .

Additionally,

f(zt) f(z0), z1 z0 = Z t

0 2f(zα)[z1 z0, z1 z0]dα

t sup z D 2f(z)[z1 z0, z1 z0] .

Now, since f is convex, 2f(x) is positive semidefinite for all x D and so

2f(z)[z1 z0, z1 z0] 2 2f(z)[z1, z1] + 2 2f(z)[z0, z0] 2M .

Combining these bounds yields the desired bound of

f(zγ) f(z0) f(z0), zγ z0 Z γ

0 2Mtdt = Mγ2 .

In the following Appendix C.2 we bound the curvature constant of the particular f = Fprog-flat defined in (24) over Z and provide additional regularity bounds on this f over Z. Then in Appendices C.3 and C.4, we discuss the implementation of a single step of Algorithm 5 (specifically, Line 5 by satisfying the conditions needed to apply Lemma 15). To fix some notation, we let zt := (xt, St) be the tth iterate of Algorithm 5 (z(t) in the code). We also let

Yt := exp 1 ηslift D xt

Tr exp 1 ηslift D xt

be the gradient of H η(slift(D( x

v ) S)) before applying the chain rule, see Fact 1. Because of the block-separable structure of slift, we have that

Yt = 1 Tr(Lt) + Tr(Rt)

Lt 0d d 0d d Rt

where Lt := exp 1

St , Rt := exp 1

For notational simplicity, we finally define

Bt := Lt Rt, Tt := Tr(Lt) + Tr(Rt).

We observe that by definition of the objective in (25), we have

[ xf(zt)]ι = 1 + C

Tt Bt, Eι for all ι I, Sf(zt) = C

Finally, gt := (ht, Ht) denotes the blocks of the approximate gradient computed in iteration t and vt := (wt, Wt) denotes the blocks of the approximate linear objective optimizer in iteration t, as suffice for applying Lemma 15. Throughout Appendices C.3 and C.4, we fix a single iteration t and drop t from all subscripts, to decrease notational clutter.

C.2 Curvature constants and regularity Here we bound the curvature constant of Fprog-flat as defined in (25) over Z defined in (22). To do so, we first provide a bound, Lemma 17, on the operator norm of D(x) defined in (24), for x X. The proof follows from a previously established fact that the operator norm of any matrix (even if rectangular) is upper bounded by the maximum ℓ1 norm of any its rows or columns (see e.g., Lemma B.4, [19]); however, we directly give a self-contained proof for convenience.

Lemma 17. For all x X (defined in (22)), we have D(x) op ku

d for D as defined in (24).

Proof. For all z Rd, the Cauchy-Schwarz inequality and the fact that D(x) is symmetric imply

i,j [d] [D(x)]ij zizj

i,j [d] z2 i [D(x)]ij

i,j [d] z2 j [D(x)]ij

i [d] z2 i X

Additionally, for all i [d] since (D, I) is an (ℓ, u)-bounded submatrix and x X,

[D(x)]ij | = X

j [d]|ι=(i,j) I

[D(x)]ij ku

Combining shows that |z D(x)z| ku

d z 2 2 for all z Rd and yields the result as D(x) is symmetric.

Lemma 17 improves upon the trivial bound on D(x) op for x with x 1 k. Note that if x has k on the index corresponding to the largest entry of DI, then D(x) op ku, which is substantially larger than ku

d bound given in Lemma 17. The improved bound for x X that we show in Lemma 17 is key to obtaining bounds on the curvature constant of Fprog-flat over Z which are sufficient for obtaining our desired runtimes (i.e., scaling polynomially only in r , not d). We provide the curvature bound as well as different bounds on the regularity of Fprog-flat over Z below.

Lemma 18. For all z, z Z with z = (x, S) for x X and S S as defined in (22) and for f = Fprog-flat defined in (25),

| X f(z ), x | C ku

ℓ2d + k, | Sf(z ), S | 2C, and 2f(z )[z, z] C

Consequently, CFprog-flat,Z 2C

Proof. Let z, z Z be arbitrary. Let A be the linear operator such that A(x, S) = slift D x

v S . Let G be the gradient of H η at Az and let H be the Hessian of H η at Az . Note that the chain rule, Fact 1, and that the trace and operator norm are dual imply that, for z = (x, S),

| f(z ), z | = | 1I, x + C G, Az | x 1 + C G tr Az op k + C Az op ,

2f(z )[z, z] = CH[Az, Az] C

η Az 2 op .

However, for any (x, S) Z = X S,

op + 1 + ϵ ku

ℓ2d + 2, (28)

where the first inequality used the triangle inequality and that the operator norm is bounded by the Frobenius norm, and the second used the range of v, that ϵ (0, 1), and Lemma 17. Combining these inequalities yields 2f(z )[z, z] C

ℓ2d 2. Applying the same argument with S = 0 and x =

0 yields | X f(z ), x | C ku

ℓ2d +k and | Sf(z ), S | 2C. Finally, CFprog-flat,Z 2C

follows from Lemma 16.

C.3 Matrix step implementation In this section, we discuss how to approximate the S block of Fprog-flat(z) := Fprog-flat(x, S) within the error allowed by Line 5 and Lemma 15, where (as discussed earlier) we denote our approximation by

H Sf(x, S) = C

T B, for T = Tr(L) + Tr(R), B = L R,

following notation in (26). We then give implementation details for computing

W min M S H, M ,

within the error threshold required by Line 5. Our approximately-optimal step matrix W will have the convenient property of coming with an explicit low-rank representation, facilitating faster implementation of future iterations of our subproblem solver. We will give approximation guarantees in a mixed multiplicative-additive format, whose conversion to additive error is facilitated by Lemma 15.

The main result of this section is the following.

Lemma 19 (S step). There is an algorithm which takes as input x X, S S given as a rank-r factorization, R 1

v ) S) op, [0, C], and α, δ (0, 1), and computes W S as a rank-q factorization satisfying

W, H (1 α)argmin W S W , H + , (29)

for H Sd d with | H SFprog-flat(x, S), S | for all S S, (30)

with probability 1 δ. Moreover, q = O( C2

2 ), and the algorithm runs in time

(|I| + dr) R

Approximating the gradient. We begin by bounding the complexity of matrix-vector access to a matrix e B B L R, with a given tolerance for approximation error. Ultimately we will use the approximate gradient

to implement the update to S in our algorithm. Our access is based on a standard polynomial approximation to the exponential which is well-known in the literature.

Lemma 20 ([2]). For α 0 and R 0, there is a polynomial p of degree O(R q

α) such that | exp(x) p(x)| α exp(x) for all x [ R, R].

Applying Lemma 20 twice directly yields our approximation e B.

Corollary 2. Let x X, S S satisfy 1

v ) S) op R, and let 0. There is a matrix e B such that, for L := exp( 1

v ) S)) and R := exp( 1

v ))), and defining

T := Tr(L) + Tr(R), B := L R,

we have e B tr (1 +

8C )T, as well as

D e B B, S E for all S S, Tmv(e B) = O

(|I| + Tmv(S)) R

Proof. It suffices to provide matrices e L, e R such that

Tmv(e L) + Tmv( e R) = O

(|I| + Tmv(S)) R

8C Tr(L), e R R tr

Indeed, letting e B := e L e R, the bound on Tmv(e B) is then immediate, and by Hölder s inequality (with S op S F 2) and the triangle inequality, we also have the desired C T

D e B B, S E 2C

e B B tr 2C

e L L tr + e R R tr

by the definition of T = Tr(L) + Tr(R). We only discuss e L, as the case of e R is symmetric. We let e L = p( 1

v ) S)), where p is the polynomial from Lemma 20 with α 8C . The bound

on Tmv(e L) follows from the degree of p, and the fact that Tmv(D( x

v )) = O(|I|). Moreover, letting {λi}i [d] [ R, R] be the eigenvalues of 1

v ) S), we indeed have e L L tr = X

i [d] |p(λi) exp(λi)|

i [d] exp(λi) =

Finally, the claim e B tr (1 +

8C )T follows from the triangle inequality.

Approximating the S step. We now build upon the access afforded by Corollary 2 to compute W, our approximate solution to min M S H, M , where H := C

T e B is our gradient approximation. Specifically, we use the following standard tool with Lemma 4 to provide our approximate linear optimization implementation over S for any matrix supporting vector multiplication access.

Lemma 21. Let α, δ (0, 1) and M Sd d. There is an algorithm which returns F [(1 α) M F , (1 + α) M F] with probability 1 δ in time O(Tmv(M)α 2 log d

Proof. Let Q Rk d with independently uniform random unit vector rows in Rd scaled down by 1

k for appropriately large k = Θ(α 2 log d

δ ). By the Johnson-Lindenstrauss lemma of [20], with probability 1 δ, we have for all i [d] that [MQ ]i: 2

2 h (1 α) Mi: 2 2 , (1 + α) Mi: 2 2 i .

Therefore, summing for all i [d] proves that Tr(QM2Q ) = F satisfies the desired bound. We can compute MQ explicitly in time k Tmv(M), and computing the Frobenius norm of MQ Rd k does not dominate the stated running time.

Lemma 22. Let α, δ (0, 1), 0, let H Sd d have H tr τ, and let W := argmin W S W, H . There is an algorithm which with probability 1 δ returns W = ZRZ

satisfying R Sq q and Z Sd q, with

W, H (1 α) W , H + , q = O τ 2

Proof. Note that in closed form, we have W = (1 + ϵ) H H F , and the optimal objective value is W , H = H F. We first compute F using Lemma 21 such that

H F F 1 + α

within the allotted time and failure probability δ

2. Then, letting W (1 + ϵ) H

F S, we have

W , H W , H = H 2 F F + (1 + ϵ) H F α

2 H F α W , H . (33)

Next, let W ZZ W ZZ where Z is the result of Lemma 4 with q 9τ 2

2, and failure probability δ

2. We output R := Z W Z and Z, which can be computed within the allotted time. The result follows by combining (33) with

W, H W , H W W op H tr 3τ

2 σq+1(W ) .

Here, we used that W F 2 implies 9τ 2

2 2 singular values of W are larger than 2τ

We make one slight extension to Lemma 22 under implicit access to H.

Corollary 3. In the setting of Lemma 22, suppose H = 1

A e B for an unknown constant A. There is an algorithm which returns W = ZRZ with R Sq q and Z Sd q satisfying the guarantee in (32) with probability 1 δ in time

O Tmv(e B) 1

Proof. We follow the steps of Lemma 22 exactly, except we instead let W (1 + ϵ) e B F where F

satisfies e B F F (1 + α

8 ) e B F, again with probability 1 δ

2, using Lemma 21. Because the computation in Lemma 21 is scale-invariant, the distribution of W remains unchanged from Lemma 22, and all remaining steps are the same as before.

We combine the developments of this section to prove Lemma 19.

Proof of Lemma 19. We define H = C

T e B as in (31) (using e B defined in Corollary 2), and take the step given by Lemma 22 and Corollary 3. Recall from Corollary 2 that H op (1 +

8C )C 2C, so we can set τ = 2C in Lemma 22. Moreover, in the context of Corollary 3,

Tmv(e B) = O

(|I| + dr) R

The runtime bound then follows from Corollary 3. Finally, the rank bound q follows from (32).

C.4 Vector step implementation In this section, analogously to Appendix C.3, we discuss how to approximate the X block of f(z) with a vector h, and give implementation details for computing

w min v X h, v .

We crucially use the characterization in Fact 3, which implies this is a weighted matching problem.

The main result of this section is the following.

Lemma 23 (X step). There is an algorithm which takes as input x X, S S given as a rank-r factorization, R 1

v ) S) op, [0, C], and α, δ (0, 1), and computes y X satisfying

y, h (1 α)argminy X y , h , (34)

for h RI with | h x Fprog-flat(x, S), x | for all x X, (35)

with probability 1 δ. The algorithm runs in time

Rk2C2 log1.5( Rdk C

δ ) 2 + |I|

Approximating the gradient. Recall that [ xf(x, S)]ι = 1 + C

T B, Eι for all ι I, where again B = L R and T = Tr(L) + Tr(R) as in Appendix C.3. To produce our approximation h xf(x, S), we use the following helper guarantee.

Lemma 24 (Lemma 20, [38]). Let ϵ, ϵ , δ (0, 1), {Ai}i [n] Sd d, R log 1

ϵ , and let M Sd d satisfy M op R. There is an algorithm which computes {Vi}i [n] R such that for all i [n], |Vi Ai, exp(M) | ϵ |Ai|, exp(M) + ϵ Tr(|Ai|)Tr exp(M) with probability 1 δ in time

R log1.5( Rnd

i [n] Tmv(Ai)

With Lemma 24 established, our gradient approximation follows straightforwardly,

Corollary 4. Let x X, S S satisfy 1

v ) S) op R, and let 0 and δ (0, 1). We can compute h RI in time

(|I| + Tmv(S))

Rk2C2 log1.5( Rdk C

such that with probability 1 δ,

| h xf(x, S), x | for all x X.

Proof. Define T as in Corollary 2 throughout the proof. We let h RI satisfy h I = 1 + C(Vι Uι)

T for all ι I, for V, U RI to be defined, which we can compute within the allotted time. Since x 1 k for all x X, it suffices to show that h xf(x, S)

To this end, throughout this proof let M := 1

v ) S), and define

V ι := exp(M), Eι , U ι := exp( M), Eι , for all ι I.

By our choice of hι, we have

|[h xf(x, S)]ι| = C

T (|Vι V ι | + |Uι U ι |) .

We take {Vι}ι I to be the approximations given by Lemma 24 with the set of matrices {Eι}ι I and M, δ δ

2, and ϵ, ϵ 4k C ; we define {Uι}ι I identically, except we apply Lemma 24 with M instead of M. Note that these approximations yield

C T (|Vι V ι | + |Uι U ι |) C

T (ϵ + ϵ )Tr(|Eι|) (Tr exp(M) + Tr exp( M))

as desired. The first inequality used the guarantee in Lemma 24 with Hölder s inequality, and the second used the definition of T = Tr exp(M) + Tr exp( M) and Tr(|Eι|) = 2. Finally, the runtime follows directly from the bound in Lemma 24.

Approximating the x step. Given our estimate h from Corollary 4, our approximate linear optimization oracle follows from a known result on approximating maximum weighted matchings.

Proposition 5 (Theorem 3.12, [22]). Let G = (L R, E) be a bipartite graph, let w RE be a set of weights, and let m := |E|. There is an algorithm which computes f M, defined in (23), such that w, f (1 α) minf M w, f , in time O( m

We remark that Proposition 5 is stated in [3] for maximization of linear objectives in nonnegative weight vectors w, i.e. maximum weight matchings. However, it is straightforward to see that this implies the same result for minimization of linear objectives in arbitrary cost vectors w RE, by first dropping any positive coordinates (which can only increase the objective w, x when x is coordinatewise nonnegative), and then negating the result of [22]. Because our access to our cost vector w of interest is implicit, we make a simple observation which extends Proposition 5.

Corollary 5. In the setting of Proposition 5, let C R 0, and suppose w = 1

Av for an unknown constant A > 0, and an explicit vector v Rn. The guarantees of Proposition 5 hold true for inputs (w, G, M) if run on inputs (v, G, C M) instead.

Proof. This follows directly from scale-invariance of the guarantees in Proposition 5.

We now use these developments to give a short proof of Lemma 23.

Proof of Lemma 23. We define hι = 1 + C(Vι Uι)

T entrywise (using notation in Corollary 4), and take the step given by Proposition 5 and Corollary 5. The runtime follows immediately from these results.

C.5 Runtime analysis for subproblem solver In this section, we put together the pieces and provide a runtime bound for approximately solving (25). We first provide Lemma 25 which provides the bounds for implementing Line 5 for Z and then use this to provide the main result for approximately solving, (25), Proposition 6.

Lemma 25. There is an algorithm that for input z = (x, S) X S where S is given as a rank-r factorization, R 1

v ) S) op, [0, C], and δ (0, 1) outputs ˆz Z with

ˆz = (y, W) X S where W is output as a rank O( C2

2 ) factorization such that with probability 1 δ, Fprog-flat(z), ˆz min z Z SFprog-flat(x, S), z +

Rk2C2 log1.5( Rdk C

+ |I| k + Cku

Proof. Applying Lemma 19 and Lemma 23 yields that for any αS, αX (0, 1] and [0, C] (which we set later) we can obtain with probability 1 δ (by a union bound) W S as a rank-q factorization for q = O( C2

2 ) and H Sd d satisfying (Lemma 19)

H, W (1 αS) min S S H, S + and | H SFprog-flat(x, S), z | for all S S ,

and y X and h RI satisfying (Lemma 23)

h, y (1 αX ) min x X h, x and | h X Fprog-flat(x, S), x | for all x X ,

Rk2C2 log1.5( Rdk C

Conditioning on these events and applying Lemma 15 and Lemma 18 yields that

SFprog-flat(x, S), W min x X SFprog-flat(x, S), x + αS 2C + 3 .

X Fprog-flat(x, S), y min S S X Fprog-flat(x, S), S + αX C ku

ℓ2d + k + 3 .

Picking αS = 8C , αX = min{

ℓ2d ) 1, 1} and =

24 then yields the result.

We now prove a more quantitative variant of Proposition 2.

Proposition 6. Let ϵ [0, C] and δ (0, 1), and suppose |[D]ι| [ℓ, u] for all ι I. There is an algorithm computing (x, S), an ϵ-approximate minimizer to (5) with probability 1 δ, in time

T |I| + C2T

Rηk2C2 log1.5( Rηdk C

+O T|I| CηRη

for Rη := η 1(2k + ku

ℓ2d) and T = 8CηR2 η ϵ , and S is given as a rank-r = O( C2

ϵ2 T) factorization.

Proof. Throughout the proof let f := Fprog-flat for simplicity. Our algorithm computes z(T ) =

Approx Frank Wolfe(f, Z, T, 2CηR2 η Cf,Z ) (Algorithm 4), where we note that Cf,Z need not necessarily be computed. By Lemma 18 we know that Cf,Z 2CηR2 η and consequently Proposition 4 implies that so long as each iteration of Algorithm 4 is implemented to the desired accuracy,

f(z(T )) f(z ) 2Cf,Z

T + 2 + 4CηR2 η T + 2 8CηR2 η T + 2 ϵ .

To implement the method we apply Lemma 25 with ( , δ) in the lemma set to ( CηR2 η T +2 , δ

T ). This suffices to have the desired output with the desired probability by union bound.

It remains to bound the computational cost of this algorithm and the rank of the output. Note that CηR2 η T = Ω(ϵ 1) and Rη 1

v ) S ) op for all (x , S ) X S (by Lemma 17, see e.g., (28)). Additionally, if we initialize the method with (0d, 0d d) then we see that the rank of the S component of each computed s(t) is O( C2

ϵ2 ) and that each S component of each z(t) can be maintained as a rank O(t C2

ϵ2 ) factorization. The computational cost is then dominated by the cost of each iteration which, by Lemma 25 is bounded as claimed after adjusting by T.

D Postprocessing

In this section, we give the second main component of our overall matrix completion algorithm in Algorithm 7. This piece takes as input two matrices M, M Sd d which are close on a submatrix (where we only have noisy semi-random observations from M ). It then performs a sequence of postprocessing steps to produce a matrix f M such that f M is close to M on the full matrix, as measured in the ℓ norm. For the first step of Algorithm 7, we simply reuse our earlier subroutine Sparsify (Algorithm 1), whose guarantees we recall here for convenience.

Lemma 13. Let D Sd d, such that there is A U [d] with |U\A| γd, DA A τ. Let s N and p 800

δ ). Then with probability 1 δ, Algorithm 1 returns U U with |U \ U | 4γd log d, and DU U has at most s entries with magnitudes τ per row or column.

In Appendix D.1, we prove a structural fact about bounded spanning subsets of low-rank matrices. We use this to analyze our fixing step in Appendices D.2 and D.3, where we use regression problems on a subset of columns passing an appropriate test to complete the matrix.

D.1 Localizing errors via spanning subsets Before giving our next postprocessing steps, we prove a structural property, which can be stated concisely in the language of volumetric spanners [5]. Following definitions in [7], Lemma 26 shows any n-sized set in Rd admits a 2d-approximate ℓ1-volumetric spanner of size d + 1.

Lemma 26. Given {vi}i [n] Rd, there exists S [n] with |S| d + 1 such that for all i [n],

with coefficients satisfying P

j S |aj| 2d.

Proof. Consider the largest volume simplex among the {vi}i [d], say given by {vi}i [d+1] without loss of generality. We claim all the other vectors are contained in the simplex with d + 1 vertices, X

j [d+1] vj dvi, for all i [d + 1].

To see this, fix i [d + 1], and let the (d 1)-dimensional plane spanned by the other d points be P. If any other vj lies outside the two planes parallel to P at equal distances, such that one contains vi, then including vj would produce a larger-volume simplex. Taking the intersections of the halfspaces defined by these planes produces the simplex described above. From this, we have the desired property by convexity of the ℓ1 norm, applied to the coefficients defining this simplex.

Using Lemma 26, we obtain the following corollary on boundedness of large index subsets.

Lemma 27. Let D Sd d be rank-r, and assume each row of D has at most s entries with magnitude τ. There exists S [d] with |[d] \ S| (r + 1)s such that [D]S: , [D]:S 2rτ.

Proof. Let UΣU be an SVD of D, and let {ui}i [d] Rr be the rows of U. By Lemma 26, we can find a subset T such that all rows i [d] can be written as

j T ajuj, with X

j T |aj| 2r.

This also implies every row of D can be written as a linear combination of the row vectors {Dj:}j T , such that the ℓ1 norm of the linear combination coefficients is at most 2r. Now, let S = [d]\R, where R consists of all indices i [d] such that |[D]ji| τ for some j T; by assumption, |R| (r +1)s, giving the size bound. Take some row Di: for i [d], and let Di,S be the restriction of this row to columns in S. By applying the given linear combination to the rows {Dj,S}j T , all entrywise at most τ in magnitude, the triangle inequality shows Di,S 2rτ as claimed.

D.2 Testing columns In this section, we present the second step of our postprocessing algorithm, where we design a test to identify columns with no large errors. This step uses our structural claim in Lemma 27.

Algorithm 5: Test(M, j, ϕ, q, δ)

1 Input: M Rd d, j [d], ϕ 0, q, δ (0, 1)

2 tmax 10 log 1

4 for t [tmax] do

5 Sample T [d] by including each element independently with probability q

6 if minv R|T | M:T v M:j + ϕ v 2ϕ then c c + 1

2tmax then Return: true

9 else Return: false

To analyze Algorithm 5, we need the following basic claims which are standard in the literature.

Lemma 28. Let rank-r M Rd k be µ-incoherent with d k, and let x span(M). Then for all i [d], |xi| ( µr

d )1/2 x 2.

Proof. Let U Rd r be an orthonormal basis for span(M) with rows {ui}i [d], so that x = Uv with x 2 = v 2. The conclusion follows from

|xi| = | ui, v | ui 2 x 2 rµr

Lemma 29. Let V Rd be a µ-incoherent subspace of dimension r, and let BV Rd r be an orthonormal basis for V . Let S [d] have |S| (1 1 3µr)d. Then the following properties hold.

For any v V , v S 2 q

span([BV ]S:), as a subspace of R|S|, is 3µ-incoherent.

Proof. Let A := [BV ] S:[BV ]S: = P

i S bib i , where {bi}i [d] are rows of BV . By the definition of incoherence, we have that Ir A op Tr(Ir A) = P

i [d]\S Tr(bib i ) 1

The lower bound then yields the first statement. Now, the matrix M := [BV ]S:A 1

2 has orthonormal columns, and the same span as [BV ]S:. Furthermore, by incoherence and the fact that A 2

3Ir, all of the rows i S of M have

3 2 [BV ]i: 2

Finally, |S| 2d

3 (since µ 1). Combining these claims proves the second statement.

We also include the following application of the matrix Bernstein inequality from [43].

Lemma 30 (Lemma 12, [43]). Let δ (0, 1) and let V Rd be a µ-incoherent subspace of dimension r. Let BV be an orthononormal basis for V , with rows {bi}i [d] Rr. Let S [d] have |S| (1 1 3µr)d. Let T S include each element independently with probability p 100µr

δ . Then with probability 1 δ,

i T bib i 2p Ir.

Now, we are ready to analyze our testing algorithm and prove that it has the desired behavior.

Lemma 31. Let M Sd d be rank-r and µ-incoherent, and let M Sd d be rank-r for r r . For s, τ > 0, assume M M has at most s entries with magnitudes τ in each row. Then with probability 1 δ, if we run Algorithm 5 with parameters q, ϕ and j [d] such that

, q 1 20µsr , ϕ = p

Algorithm 5 has the following behavior.

If M :j M:j 4r τ µr , Algorithm 5 returns true.

If M :j M:j 10τqdµr , Algorithm 5 returns false.

Proof. Let the SVD of M be M = U Σ (U ) . We consider the two claims separately.

True case. Suppose that M :j M:j 4r τ µr . We will prove that each run of the inner loop of the algorithm succeeds with probability 4

5. By using Lemma 27 with D M M, and applying our upper and lower bounds on q, there exists S [d] with |S| (1 1 3µr )d such that for all i S, M :i M:i 4r τ. Next, by Lemma 30 and incoherence of M , with probability 9 10,

(V :T S) V :T S 2

This means there exists v RT S with v 2 q

2qd , such that (V :T S) v = (V :j) , which also means M :T Sv = M :j. We thus have for this vector v,

M:T Sv M:j + ϕ v M :T Sv M:T Sv + M :j M:j + ϕ v 4r τ v 1 + 4r τ µr + ϕ v 4r τ v 2 p

|T| + 6r τ µr 20rτ µr 2ϕ,

where we used that with probability 9 10, |T| 2qd. By a union bound, each independent test thus increments c with probability 4

5 over the randomness of T, so a Chernoff bound gives the claim.

False case. Next, suppose that M :j M:j 10τqdµr , and again consider a single run of the algorithm. Let i0 [d] be the entry such that |M i0j Mi0j| 10τqdµr . First, with probability at least 9 10 over the randomness of the inner loop, for all j T, M i0j Mi0j τ, (37)

because by assumption, there are at most s indices j where |M i0,j Mi0,j | τ. Since qs 1 20, there is at least a 9 10 probability that none of these indices is selected in T by a Chernoff bound. As

before, with probability 9 10, there is a vector v with v 2 q

2qd such that M :T v = M :j. Also,

again, with probability 9 10, we have |T| < 2qd. Now let R [d] be the set of rows i such that [M ]i,T {j} [M]i,T {j} τ .

By assumption, each row in T {j} can only remove s indices from R, so

|R| d s(|T| + 1) d 2qsd 1 1 10µr

Consider any solution v to the problem in Line 6 which increments c. For v to be feasible, we must have v 2, so v 1 2|T|. Further,

M:T v M:j 2r τ p

Hence, for all i R, [M ]i,T (v v ) M ij Mij + ([M]i,T [M ]i,T )v + [M]i,T v Mij

τ + v 1 τ + 2r τ p

In the first line, we used that M :T v = M :j by definition. Also, note that M :T (v v ) is in the span of M , so by Lemma 29, the fact that |R| (1 1 10µr )d, and (39),

M :T (v v ) 2

M R T (v v ) 2 4τqd

Therefore, by incoherence of M , Lemma 28 implies that for all i [d], [M ]i,T (v v ) 4τqdµr . (40)

However, consider the index i0 defined at the beginning of this case. Similarly to (39), M i0j Mi0j [M]i0,T v Mi0j + [M]i0,T [M ]i0,T v + [M ]i0,T (v v )

qdr τ + τ 2|T| + 4τqdµr < 10τqdµr ,

where we used (37), (38), (40), and |T| 2qd. This contradicts the assumption that M i0j Mi0j 10τqdµr .

We conclude that the problem on Line 6 fails to increment c with probability 2

3, by a union bound. Since we repeat the loop 10 log( 1

δ ) times, with 1 δ probability the test will reject, as claimed.

Remark 2. From observation, the proof of Lemma 31 continues to hold if the problem on Line 6 is solved up to accuracy 0.1ϕ. That is, the true case actually proves a stronger bound of 1.9ϕ in (36), so it would continue to pass for any error 0.1ϕ. Similarly, even if we assumed the vector v in the false case achieved a value of 2.1ϕ (instead of 2ϕ), we would still have a contradiction.

Finally, we include a claim from prior work which bounds the complexity of implementing Line 6.

Lemma 32. Following notation in Lemma 31, for any δ (0, 1), assuming q = O( µr

δ )), the optimization problem on Line 6 can be solved to accuracy 0.1ϕ with probability 1 δ, in time

O d poly µr log d

Proof. For shorthand let M := M:T , k := |T|, and u := M:j in this proof, and assume that k 2qd, which occurs with probability 1 δ

2. We create a linear program,

min x Rk+2Ax b c x,

where x = v α β for α, β R and v Rk. We encode the constraints Mv u α, v β using 2d + 2k linear constraints on x. Finally, we make the objective c x = α + ϕβ. The conclusion follows by applying the solver of [69], Theorem 1, with failure probability δ

2, since k = O(µr log( d

δ )) and we have a O(d + k) O(k) constraint matrix.

Algorithm 6: Fix(Osr p (c M), M, T, ϕ, q, δ)

1 Input: Osr p (c M), M Sd d, T [d], ϕ 0, q, δ (0, 1)

2 Sample B T by including each element independently with probability q

4 for b B do

5 if Test(M:T , b, ϕ, q, δ 10d2 ) then G G {b}

7 for j [d] do

8 Sj indices in [d] of observed entries in our call to Osr p (c M)

9 vj argminv R|G|ϕ qd v + MSj Gv c MSj,j

11 e V vertical concatenation of {v j }j [d]

12 return f M = M:G e V

D.3 Fixing columns via regression We now complete our postprocessing algorithm. Given an estimate M that is entrywise close to M , except for a small number of potentially large errors in each row and column, we first run Algorithm 5 to identify a set of columns containing no large errors. We then set up regression problems to fix the rest of the matrix (which may initially contain large errors on some entries).

Lemma 33. Let M Sd d be rank-r and µ-incoherent, and let c M = M + N for N Sd d with N τ. Let M Sd d be rank-r for r r . For s, τ > 0, assume [M M ]:T has at most s entries larger than τ in each row and column. Assume

s d 20µr2 , |T| 1 1 10µr

Then with probability 1 δ, if we run Algorithm 6 with parameters q, ϕ such that

, q 1 20µsr, ϕ = p

then the output satisfies f M M 104q2d2(µr)3τ.

Proof. First, by Lemma 27 and applying our upper and lower bounds on q, there is a subset T T with |T | (1 1 3µr)d such that for all j T , M:j M :j 4rτ. Next, by Lemma 31, with probability 1 δ

10, we have B T G. Thus, by applying Lemma 30 with S T and p q, we

have that with 1 δ

10 probability, for any j [d], there is a vector uj with uj 2 q

3µr 2qd such that

M :Guj = M :j. Also, by Lemma 31, we have with probability 1 δ

M :G M:G 10τqdµr.

Now fix a j [d] and consider the execution of the loop in Lines 7 to 10 for this index j. Consider uj as a candidate solution to the optimization problem. We have

qd uj + MSj Guj c MSj,j 2 p

qdµr1.5τ + MSj Guj M Sj Guj

+ M Sj,j c MSj,j 2 p

qdµr1.5τ + p

|G| uj 2 10τqdµr + τ

100τqd(µr)1.5,

where we use that with 1 δ

10 probability, |G| 2qd. Thus, the solution vj in Line 9 must have MSj Gvj c MSj,j 100τqd(µr)1.5, vj 100µ1.5 r. (41)

This implies that M Sj Gvj M Sj,j M Sj Gvj MSj Gvj + MSj Gvj c MSj,j

+ c MSj,j M Sj,j |G| vj 10τqdµr + 100τqd(µr)1.5 + τ

2500q2d2(µr)2.5τ.

Next, recall that span(M ) is µ-incoherent. Let B Rd r have orthonormal columns with span(B) = span(M ), with rows {bi}i [d]. By Lemma 30, with probability 1 δ 10d2 , Sj contains a subset S j such that |S j| 2pd and

i S j bib i p

2Ir |S j| 4d Ir.

Thus, for all vectors z span(M ), z 2 q

4d |S j| z S j: 2. Using µ-incoherence of span(M )

M :Gvj M :j M :Gvj M :j 2

M S j:Gvj M S j:j 2

M S j:Gvj M S j:j

5000q2d2(µr)3τ,

where the first line used Lemma 28. Finally, recall from (41) that for vj to be an optimal solution, we must have vj 100µ1.5 r. Since |G| 2qd, we have the desired bound for column j [d]: M:Gvj M :j M:Gvj M :Gvj + M :Gvj M :j |G| vj M:G M :G + 5000q2d2(µr)3τ

104q2d2(µr)3τ.

Union bounding all of the failure probabilities then completes the proof.

Remark 3. Similarly to Remark 2, we note that the proof of Lemma 33 is tolerant to an error of τqd(µr)1.5 for the problem in Line 9, because this implies (41) holds up to a negligible constant factor, which does not affect correctness of the rest of the proof. Line 9 can again be solved to this accuracy in time O(d poly(µr log( d

δ ))), with failure probability δ, as described in Lemma 32.

Finally, we put everything together and give our full postprocessing procedure.

Proposition 3. Let M Sd d be rank-r and µ-incoherent, and let c M = M + N for N Sd d with N for some 0. Let M Sd d be given as a rank-r decomposition, with r r . Further, for γ (0, 1) suppose M and M are d -close on a γ-submatrix. Finally, assume

γ 1 104µr log(d).

Then for any δ (0, 1) if p 1

d poly(µr log( d

δ )) for an appropriate polynomial, Algorithm 7 uses one call to Op(c M) and with probability 1 δ, outputs f M Sd d such that f M M poly µr log d

Also, f M is given as a rank-poly(µr log( d

δ )) factorization. The algorithm runs in m poly(µr log( d

δ )) time, where m is the number of observed entries upon calling Osr p (c M).

Proof. Let τ = 1000 µr1.5 log d. By applying Lemma 6, [43], and the assumed bound on γ, there is A [d] with |[d] \ A | d 300µr log d such that [M c M]A A [M M ]A A + τ.

Now we run Algorithm 1 with

p 108µ2r2 log2( d

δ ) d , s d 105µ2r2 log d.

These satisfy the conditions in Lemma 13, so with probability 1 δ 10, we obtain S [d] with |S| d(1 1 300µr), and [M c M]S S has s entries with magnitude larger than τ in each row or column, implying [M M ]S S has s entries with magnitude larger than 2τ per row or column.

Next, we run Algorithm 6 on the submatrix with rows indexed by S, i.e. we use the observations from c MS: and input MS:. The subset that is input to Algorithm 6 will be T S, and we use q = 3000µr

δ ). Note that we have q 1 20µsr. Also by construction, |S| d(1 1 300µr) and by

Lemma 29, M S: is 3µ-incoherent. Thus, Lemma 33 yields, with probability 1 δ

10, f M such that [f M M ]S: 104q2d2(µr)3τ 1014(µr)6.5 log2 d

Finally, we transpose the matrix and then run Algorithm 6 again to complete the whole matrix, i.e. with inputs M f M and T S. Overall, we compute rank-poly(µr log( d

δ )) f M such that f M M poly µr log d

Finally, by convexity of norms, we can symmetrize f M 1

2(f M + f M ) and this will only improve the bound. Combining over all of the subroutines, the sample complexity and runtime are as claimed, by appropriately applying the linear program solvers as described in Lemma 32 and Remark 3.

We provide pseudocode for the procedure in Proposition 3 in Algorithm 7.

Algorithm 7: Postprocess(Osr [0,1](c M), M, γ, , r, r , µ, δ)

1 Input: Osr [0,1](c M) where c M = M + N Sd d where M is rank-r and µ-incoherent and N , M Sd d given as a rank-r factorization such that M, M are d -close on a γ-submatrix, δ (0, 1)

2 S Sparsify(Osr [0,1](c M M), 1000 µr1.5 log d, 1 300µr log d, 108µ2r2 log2( d

δ ), q 3000µr

δ ), τ 1000 µr1.5 log d

4 f M Fix(Osr p (c MS:), MS:, S, qdrτ, q, δ

5 r rank(f M)

δ ), q 3000µr

δ ), τ 1014(µr)6.5 log2( d

7 f M Fix(Osr p (c M), f M , S, qdrτ, q, δ

8 Return: 1

2(f M + f M )

E Semi-random matrix completion

In this section, we finally combine the two main components of our algorithm, Algorithm 3 and Algorithm 7, to give our final result, Theorem 2. We begin with a preprocessing step to estimate an initial distance bound, which follows straightforwardly from an analogous result in [43].

Lemma 34. There is an algorithm, Estimate Frob Norm, which takes as input Osr p (c M) and parameters 0 and δ (0, 1) where c M = M + N where M Sd d is rank-r and µ-incoherent and

N and p 120µr log(d/δ)

d . Estimate Frob Norm runs in time O(m) where m is the number of observed entries and outputs an estimate V R such that with probability 1 δ,

M F V 10d2( M F + ).

Proof. Let Ω [d] [d] be the set of observed entries. Our estimator for V will be

(i,j) Ω (|c Mij| + )2 .

Now we analyze this estimator. For each row i [d], by Lemma 30, with probability at least 1 δ

d, if we let Si be the set of observed entries in row i, X

j Si (M ij)2 p

j [d] (M ij)2 .

Summing over all rows, we have by the triangle inequality |M ij| |c Mij| + that

V 2 dp M 2 F M 2 F

which proves one side of the claim. To prove the other direction, by the triangle inequality,

2d c M F + d 10d2( M F + ) .

Now, we present the full algorithm for semi-random matrix completion.

Algorithm 8: SRCompletion(Osr [0,1](c M), , r , µ, δ)

1 Input: Osr [0,1](c M) where c M = M + N Sd d where M is rank-r and µ-incoherent and N , δ (0, 1)

3 p0 200µr log(d/δ)

4 D Estimate Frob Norm(Osr p0(c M), δ

5 P poly r , µ, log d D

6 Q poly(P)

d 8 K log D

9 while D d Q do

10 M SRPartial Completion Osr [0,1](M + N), M, D, r , P, 1 106µP 2 , δ 4K

11 f M Postprocess Osr [0,1](M + N), M, 1 106µP 2 , D

d P , P o(1), r , µ, δ 4K

12 M PCAr (f M)

15 Return: f M

Remark 4. In Line 12 of Algorithm 8, PCAr ( ) denotes computing the top-r PCA of a matrix. We ensure that the input f M is a rank r = poly(r , µ, log( d D

δ )) matrix given as a low-rank factorization, and then this step can be implemented in time d poly(r).

Now we prove our main theorem.

Theorem 2. Let M Sd d be rank-r and µ-incoherent, let N Sd d have N , and let δ (0, 1). Let Q = poly(r , µ, log( d M F

δ )) for a sufficiently large polynomial. Algorithm 8 can

be implemented with one call to Osr p (M + N) for p = Q

d and outputs f M Sd d with rank at most Q (represented in terms of its low-rank factorization) such that with probability 1 δ, f M M Q .

The algorithm runs in time O(m Q) where m is the total number of observed entries.

Proof. We inductively verify the conditions of Corollary 1 are satisfied whenever SRPartial Completion is called and the conditions of Proposition 3 are satisfied whenever Postprocess is called. The base case for SRPartial Completion is clear from the guarantees of Lemma 34. Now assuming that the hypotheses of Corollary 1 are satisfied when SRPartial Completion is called, we know the output M has rank r r P o(1) and is D

α -close to M on a 1 106µP 2 submatrix. Now this implies that the hypotheses of Proposition 3 are satisfied when we call Postprocess and thus we get that f M M poly(r )P o(1) D

d P D 100d .

Since M is rank-r , we have

f M M F 2 PCAr (f M) M F 0.1D .

This now implies that the hypotheses of Corollary 1 are satisfied at the beginning of the next iteration (unless D d Q in which case the loop ends). By the way we set K, the loop ends in at most K iterations and thus the overall failure probability is at most δ. As long as none of the subroutines fails, the above argument implies the output must satisfy f M M Q

and that f M has rank at most Q. Note that by Lemma 2, we can simulate all of the observations used in the algorithm with a single sample Osr p (M + N) with reveal probability

p poly r , µ, log d M F

The overall runtime guarantee follows by combining the runtime guarantees of Corollary 1, Proposition 3, and the PCA step (see Remark 4).

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: As stated in the abstract and introduction, this is a theory paper where we give a new algorithm for semi-random matrix completion and prove that it achieves improved guarantees in terms of runtime, accuracy and noise tolerance.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper.

The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.

The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.

It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: We discuss the modeling assumptions and acknowledge that since this is a theory paper, the modeling assumptions and runtimes of the algorithms may not be practical yet (as mentioned in Section 1.3), but hope that the insights are useful and may have practical implications in the future.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.

The authors are encouraged to create a separate "Limitations" section in their paper.

The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.

The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.

The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.

The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.

If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.

While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: The paper (and supplementary material) includes complete proofs of all of the claims.

Guidelines:

The answer NA means that the paper does not include theoretical results.

All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced.

All assumptions should be clearly stated or referenced in the statement of any theorems.

The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.

Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.

Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [NA]

Justification: The paper does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments.

If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.

If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.

Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.

While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example

(a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.

(b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.

(c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).

(d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [NA]

Justification: The paper does not include experiments.

Guidelines:

The answer NA means that paper does not include experiments requiring code.

Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details.

While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).

The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details.

The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.

The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.

At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).

Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [NA]

Justification: The paper does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments.

The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.

The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [NA]

Justification: The paper does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments.

The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.

The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)

The assumptions made should be given (e.g., Normally distributed errors).

It should be clear whether the error bar is the standard deviation or the standard error of the mean.

It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.

For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).

If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [NA]

Justification: The paper does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments.

The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.

The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.

The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: We read the ethics guidelines and believe we meet them.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics.

If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.

The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: Our paper is about theoretical results and therefore substantial societal impact would likely come only with further research and work. However, we do discuss how our paper is motivated by algorithms more robust and less sensitive to distributional assumptions. In this sense, our paper seems to be providing a potential helpful tool; positive or negative societal impacts of this tool would likely come from future work applying our result.

Guidelines:

The answer NA means that there is no societal impact of the work performed.

If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.

The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.

If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: The paper does not release data or models.

Guidelines:

The answer NA means that the paper poses no such risks.

Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.

Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.

We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justification: The paper does not use existing assets.

Guidelines:

The answer NA means that the paper does not use existing assets.

The authors should cite the original paper that produced the code package or dataset.

The authors should state which version of the asset is used and, if possible, include a URL.

The name of the license (e.g., CC-BY 4.0) should be included for each asset.

For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.

If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA]

Justification: The paper does not release new assets.

Guidelines:

The answer NA means that the paper does not release new assets.

Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.

The paper should discuss whether and how consent was obtained from people whose asset is used.

At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA]

Justification: The paper does not involve crowdsourcing nor research with human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.

According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

Answer: [NA]

Justification: The paper does not involve crowdsourcing nor research with human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.

We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution.

For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.