# robust_matrix_sensing_in_the_semirandom_model__a39e5913.pdf

Robust Matrix Sensing in the Semi-Random Model

Xing Gao University of Illinois at Chicago xgao53@uic.edu

Yu Cheng Brown University yu_cheng@brown.edu

Low-rank matrix recovery is a fundamental problem in machine learning with numerous applications. In practice, the problem can be solved by convex optimization namely nuclear norm minimization, or by non-convex optimization as it is well-known that for low-rank matrix problems like matrix sensing and matrix completion, all local optima of the natural non-convex objectives are also globally optimal under certain ideal assumptions. In this paper, we relax the assumptions and study new approaches for matrix sensing in a semi-random model where an adversary can add any number of arbitrary sensing matrices. More precisely, the problem is to recover a low-rank matrix X from linear measurements bi = Ai, X , where an unknown subset of the sensing matrices satisfies the Restricted Isometry Property (RIP) and the rest of the Ai s are chosen adversarially. It is known that in the semi-random model, existing non-convex objectives can have bad local optima. To fix this, we present a descent-style algorithm that provably recovers the ground-truth matrix X . For the closely-related problem of semirandom matrix completion, prior work [CG18] showed that all bad local optima can be eliminated by reweighting the input data. However, the analogous approach for matrix sensing requires reweighting a set of matrices to satisfy RIP, which is a condition that is NP-hard to check. Instead, we build on the framework proposed in [KLL+23] for semi-random sparse linear regression, where the algorithm in each iteration reweights the input based on the current solution, and then takes a weighted gradient step that is guaranteed to work well locally. Our analysis crucially exploits the connection between sparsity in vector problems and lowrankness in matrix problems, which may have other applications in obtaining robust algorithms for sparse and low-rank problems.

1 Introduction

Low-rank matrix recovery is a popular inverse problem with many applications in machine learning such as collaborative filtering, image compression, and robust principal component analysis (PCA) [RS05, FCRP08, CLMW11]. In this paper, we study one of the most basic low-rank matrix recovery problems namely matrix sensing [CRT06, RFP10]. In the matrix sensing problem, we want to reconstruct a low-rank ground-truth matrix X Rd1 d2 from a collection of sensing matrices {Ai}n i=1 and the corresponding linear measurements bi = Ai, X .

For notational convenience, we define a sensing operator A[ ] : Rd1 d2 Rn such that A[X] = b with bi = Ai, X for i = 1 . . . n. The goal is to solve the following rank-constrained optimization problem:

min X Rd1 d2 A[X] b 2 2 subject to rank(X) r .

As optimizing over low-rank matrices are often computationally hard, one common approach is to replace the non-convex low-rank constraint with its convex-relaxation, which results in the following

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

nuclear norm minimization problem [RFP10]: min X Rd1 d2 X subject to A[X] = b . (1)

Another widely-used approach in practice is to consider the unconstrained non-convex factorized parametrization [RFP10, GJZ17, BNS16]:

min U Rd1 r,V Rd2 r A[UV ] b 2

and solve it with some form of gradient descent or alternating minimization.

Existing convex and non-convex approaches all rely on certain assumptions. A standard assumption in the literature is that the sensing matrices satisfy the Restricted Isometry Property (RIP), which means that the sensing matrices approximately preserve the norm of a low-rank matrix. (Formally, 1 L X 2 F 1

n Pn i=1 Ai, X 2 L X 2 F given rank(X) r for some parameters r and L .)

In this paper, we relax the RIP condition on the sensing matrices and study a robust version of the problem, which is often referred to as the semi-random model. More specifically, an adversary corrupts the input data by providing any number of additional sensing matrices Ai that are adversarially chosen, but the corresponding measurements bi = Ai, X remain consistent with the ground truth matrix X . Consequently, only a subtset of the sensing matrices satisfy the RIP condition and the rest of them are arbitrary. This is an intermediate scenario between the average case and the worst case, which arises more frequently in practice.

To the best of our knowledge, we are the first to study the matrix sensing problem in this semi-random model. Formally, we consider the following adversary: suppose that originally there was a collection of RIP sensing matrices {Ai}m i=1 ( good matrices), then the adversary augmented some arbitrary {Ai}n i=m+1 ( bad matrices) and then shuffled all the sensing matrices. The algorithm is then given the measurements based on the good and bad matrices together. The combined sensing matrices are no longer guaranteed to satisfy the RIP condition, but there exists a subset (indicated by an indicator vector w ) that does, i.e., 1

L X 2 F Pn i=1 w i Ai, X 2 L X 2 F , where w i = 1

m on the original good matrices and w i = 0 on the bad matrices added by the adversary. In general, the subset may be replaced by a convex combination and the indicator vector by a simplex. Inspired by the adversary for semi-random vector regression in [KLL+23], we refer to this condition as w RIP (weighted RIP) and formally define it in Definition 2.2.

Since the w RIP condition is a more general assumption than RIP, existing solutions that rely on RIP might fail under the semi-random model with w RIP condition. As stated in [KLL+23], this type of adversary does not break the problem from an information-theoretic perspective", but affects the problem computationally. In particular, existing non-convex approaches for matrix sensing (e.g., 2) may get stuck at bad local minima as the RIP condition is necessary for proving landscape results regarding the non-convex objective (see, e.g., the counter-examples provided in [BNS16]. The convex relaxation approach (1) does continue to work in the semi-random model, because the augmented linear measurements are consistent with the ground-truth matrix X which simply provides additional optimization constraints. However, convex approaches are often less desirable in practice and can become computationally prohibitive when d1, d2 > 100 as pointed out in [RFP10].

1.1 Our Contributions

The limitations of existing algorithms motivate us to pose and study the problem of semi-random matrix sensing in this paper. We summarize our main contributions below:

Pose and study matrix sensing in the semi-random model. We introduce the more general w RIP condition on matrix sensing as a relaxation of the typical RIP assumption, and provide a solution that is more robust to input contamination. Our work will serve as a starting point for the design of more efficient robust algorithms for matrix sensing, as well as other low-rank matrix problems, in the semi-random model. Design an efficient robust algorithm for semi-random matrix sensing. Our algorithm is guaranteed to converge to a global optimum which improves on the existing non-convex solution [BNS16] that can get stuck in bad local optima in the semi-random model, while achieving a comparable running time as existing convex solution [RFP10], informally stated in Theorem 1.1 below. A formal statement can be found in Theorem 3.1.

Adapt a reweighting scheme for semi-random matrix sensing. In contrast to the non-convex approach that failed and the convex approach that avoided the challenge posed by the adversary altogether, we study a new approach that directly targets the semi-random adversary instead. We develop an algorithm using an iterative reweighting approach inspired by [KLL+23]: in each iteration, the algorithm reweights the sensing matrices to combat the effect of the adversary and then takes a weighted gradient step that works well based on the current solution.

Exploit the connection between sparsity and low-rankness. Observing a duality between sparse vectors and low-rank matrices, we draw a parallel between linear regression and matrix sensing problems. By exploring the structural similarities and differences between vector and matrix problems, we are able to extend and generalize the work of [KLL+23] on semi-random sparse vector recovery to the higher dimensional problem of semi-random matrix sensing. We emphasize that even though the generalization from vector to matrix problems is natural, the analysis behind the intuition is often nontrivial and involves different mathematical tools.

We state a simplified version of our main algorithmic result assuming Gaussian design. The more general result is stated as Theorem 3.1 in Section 3. Theorem 1.1 (Semi-Random Matrix Sensing). Suppose the ground-truth matrix X Rd1 d2 satisfies rank(X ) r and X F poly(d). Let A1, . . . , An be the sensing matrices and let bi = Ai, X be the corresponding measurements. Suppose there exists a (hidden) set of Ω(dr) sensing matrices with i.i.d. standard Gaussian entries, and the remaining sensing matrices are chosen adversarially, where d = max(d1, d2).

There exists an algorithm that can output X Rd1 d2 such that X X F ϵ with high probability in time e O(ndω+1r log(1/ϵ)) 1 where ω < 2.373 is the matrix multiplication exponent.

1.2 Overview of Our Techniques

Since there exists a subset (or a convex combination in general) of the sensing matrices that satisfy the RIP condition, a natural strategy is to reverse the effect from the adversary by reweighting the sensing matrices so that they satisfy the RIP condition. However, it is NP-hard to verify RIP condition on all low-rank inputs, so it is unclear how to preprocess and fix the input in the beginning and then apply existing solutions to matrix sensing. Instead, we make a trade-off between the frequency of reweighting and the requirement on the weights by adopting an iterative reweighting approach: in each iteration, we only aim to find a set of weights so that the weighted matrices satisfy some desirable properties (not necessarily RIP) with respect to the current estimate X (as opposed to all low-rank matrices).

Inspired by the workflow in [KLL+23], our semi-random matrix sensing algorithm (Algorithm 1) repeatedly calls a halving algorithm to reduce the error of our estimate arbitrarily small. The halving algorithm (Algorithm 2) contracts the upper bound on X X F , which is the error between our current estimate X and the ground truth X , each time it is run. Inside this algorithm is a gradient descent style loop, where in each iteration we try to minimize a weighted objective function, which is essentially the weighted ℓ2-norm of A[Xt] b (the distance to X measured by the sensing matrices), where the weights are provided by an oracle implemented in Algorithm 3. The algorithm proceeds by taking a step opposite to the gradient direction, and the step is then projected onto a nuclear-norm-bounded ball which is necessary for the weight oracle to continue working in the next step. As we mentioned before, the weights from the oracle need to satisfy some nice properties with respect to the current iteration estimate Xt. Ideally, the property should: firstly, ensure the gradient step makes enough progress towards X ; secondly, can be derived from the w RIP condition so that we know such a requirement is feasible; and lastly, be easily verifiable as opposed to the NP-hard RIP condition.

With the first requirement in mind, we define the weight oracle as in Definition 2.5. The oracle output should satisfy two properties, namely the progress and decomposition guarantees, and together they ensure the gradient step makes good enough progress toward X . Intuitively speaking, the progress guarantee ensures the gradient step is large in the direction parallel to the actual deviation X X (as opposed to only reducing the measured deviation A[X] b) and thus will make significant progress, while the decomposition guarantee ensures the gradient step has small contribution and

1Throughout the paper, we write e O(f(n)) for O(f(n) polylog f(n)) and similarly for eΩ( ).

effect in other directions thus will not cancel the progress after the projection. While the progress guarantee is quantified as an inner product, we introduce a concept called norm-decomposition (Definition 2.4) to capture the decomposition guarantee, and we will provide more details later. For the second requirement, we can loosely relate the two oracle guarantees to the w RIP condition: the (large) progress guarantee makes use of the lower bound in w RIP condition Pn i=1 w i Ai, X X F 2 1

and the (small) decomposition guarantee makes use of the upper bound Pn i=1 w i Ai, X X F 2 L . We introduce a condition called d RIP (decomposable w RIP defined in Definition 2.3) to formally capture this relation, and we will show that it follows from the w RIP condition thus we can achieve such an oracle. Lastly, we will show that the oracle properties can be easily verified, meeting our third requirement.

A formal statement and a road map that leads to our main result can be found in Section 3.

1.3 Related Work

Matrix sensing (RIP): There are two main types of existing solutions. The convex-relaxation formulation 1 of the problem can be posed as a semidefinite program via the standard form primaldual pair [RFP10], where the primal problem has a (d1 + d2)2 semidefinite constraint and n linear constraints. State of the art SDP solver [JKL+20] requires running time of e O(nd2.5) where d = max(d1, d2). The other approach uses non-convex formulation 2 to reduce the size of the decision variable from d2 to dr, improving computational efficiency. It is shown in [BNS16] that there are no spurious local minima given RIP sensing matrices and incoherent linear measurements in the non-convex approach, however, it is no longer applicable in the semi-random model.

Semi-random model: First introduced by [BS95], the semi-random model has been studied for various graph problems [FK01, PW17, MS10, MMV12]. Previously the work of [CG18] applied the semi-random model to the matrix completion problem, and recently [KLL+23] studied sparse vector recovery in this model.

Semi-random matrix completion: Low-rank matrix problems such as matrix completion and matrix sensing have similar optimization landscapes [GJZ17], thus development in one often lends insight to another. Prior work [CG18] on the closely-related problem of matrix completion under the semirandom model showed that all bad local optima can be eliminated by reweighting the input data via a preprocessing step. It exploits the connection between the observation data matrix and the Laplacian matrix of a complete bipartite graph, and gives a reweighting algorithm to preprocess the data in a black-box manner. However, the analogous approach for matrix sensing requires reweighting a set of matrices to satisfy RIP, which is a condition that is NP-hard to check, thus is not practical in the matrix sensing problem.

Semi-random vector regression: In order to overcome the barrier of the reweighting or preprocessing approach mentioned earlier, we take inspiration from the work of [KLL+23] on sparse vector recovery under the semi-random model. One of their main contributions is the short-flat decomposition , which is a property that can be efficiently verified for a given vector (locally), instead of verifying the RIP condition for all sparse vectors (globally). They provide a projected gradient descent style algorithm, where the rows of the sensing matrix are reweighted differently in each iteration to ensure a short-flat decomposition exists for the gradient. We draw a parallel between the problem of sparse vector regression and low-rank matrix sensing, and extend their work on linear regression of sparse vectors to the more generalized problem of sensing low-rank matrices.

2 Preliminaries

2.1 Notations

Throughout the paper, we denote the ground-truth low-rank matrix as X . We assume X Rd1 d2, rank(X ) = r, and d1, d2 have the same order of magnitude. Let d = max(d1, d2).

We write [n] for the set of integers {1, ..., n}. We use n for the nonnegative probability simplex in dimension n, and Rn 0 for the set of vectors with nonnegative coordinates in Rn. For a vector x, we denote its ℓ1, ℓ2, and ℓ -norms as x 1, x 2 and x respectively, and write the ith coordinate in x as xi. For a matrix A, we use A , A 2, and A F for the nuclear, spectral (operator), and

Frobenius norms of A respectively. For a matrix A, we use A(k) = argminrank(A ) k A A F to denote the best rank-k approximation of A; or equivalently, given the SVD of A = Pr i=1 σiuiv i , we have A(k) = Pk i=1 σiuiv i where σ1, ..., σk are the top k singular values of A.

We write tr(A) for the trace of a square matrix A. For matrices A, B Rd1 d2, we write A, B for their entrywise inner product A, B = A, B = tr(A B) = P

j,k Ajk Bjk. A symmetric matrix A Rd d is positive semidefinite (PSD) if and only if A = U U for some matrix U, and we write A B if A and B have the same dimension and B A is positive semidefinite. We write exp (A) as the matrix exponential of A; if A is diagonalizable as A = UDU 1 then exp(A) = U exp(D)U 1.

2.2 Definitions

We formally define the matrix sensing operator and observation vector below. Definition 2.1 (Matrix Sensing Operator). Given a collection of sensing matrices A = {Ai}n i=1 Rd1 d2, we define the sensing operator A[ ] : Rd1 d2 Rn such that A[X] = b where bi = Ai, X for X Rd1 d2.

In other words, we have b := Pn i=1 Ai, X ei where ei is the ith standard basis vector in Rn. Throughout the paper, we use either A or {Ai}n i=1 to represent the sensing matrices.

To consistently recover a rank-r matrix in general, the number of measurements n should be at least (d1 + d2 r)r [CP11], hence we assume n = eΩ(dr) where eΩsuppresses log factors. In most matrix sensing literature, it is standard to impose the Restricted Isometry Property (RIP) condition on the sensing matrices. The RIP condition states that A[ ] is approximately an isometry on low-rank matrices, which means the ℓ2-norm of the observation vector is close to the Frobenius norm of X .

In this paper, we consider a semi-random model and relax the RIP condition as follows: we require that there exist weights {w i }n i=1 (or w n) so that the weighted sensing matrices { p

w i Ai}n i=1 satisfy the RIP condition. We call this relaxed assumption the w RIP (weighted RIP) condition.

We formally define RIP and w RIP conditions below. Definition 2.2 (RIP and w RIP Conditions). We say a collection of sensing matrices A = {Ai}n i=1 Rd1 d2 satisfies the RIP (Restricted Isometry Property) condition with parameters r, L, and ρ if the following conditions hold for all X Rd1 d2 with rank(X) r:

1. Boundedness: Ai 2 ρ ;

2. Isometry: 1

n Pn i=1 Ai, X 2 L X 2 F .

Further, we say A = {Ai}n i=1 satisfies the w RIP (weighted RIP) condition with parameters r, L, ρ, if w n such that the following conditions hold for all X Rd1 d2 with rank(X) r:

1. Boundedness: Ai 2 ρ ;

2. Isometry: 1

L X 2 F Pn i=1 w i Ai, X 2 L X 2 F .

Notice that w RIP is a relaxation of the RIP condition, because we can choose w i = 1/n for all i in the standard RIP setting. More importantly, w RIP is strictly weaker. For example, w RIP allows a (possibly majority) fraction of the sensing matrices to be chosen adversarially. We want to emphasize that the algorithm does not know w one of the main challenges of semi-random matrix sensing is that finding w seems computationally hard, because it is NP-Hard to check the RIP condition.

For presenting our algorithm and analysis, we introduce a variant of the w RIP condition called d RIP (decomposable-w RIP). Definition 2.3 (d RIP Condition). We say a collection of sensing matrices A = {Ai}n i=1 Rd1 d2 satisfies the d RIP (decomposable w RIP) condition if w n and constants L, K, r, ρ 1, such that for all V Rd1 d2 satisfying V F [ 1

1. Boundedness: Ai 2 ρ ;

2. Isometry: 1

L Pn i=1 w i Ai, V 2 L ;

3. Decomposability: (L, 1 K r)-norm-decomposition of G = Pn i=1 w i Ai, V Ai = Pn i=1 w i ui Ai .

Definition 2.4 (Norm Decomposition). We say a matrix G has a (CF , C2)-norm-decomposition if S and E s.t. G = S + E, and S F CF , E 2 C2 .

The main difference with w RIP is that d RIP requires the additional decomposition property. Observe that G is the (weighted) gradient at the point V . At a high level, we would like to decompose the gradient into two matrices, one with small Frobenius norm and the other one with small operator norm. Our matrix norm-decomposition is inspired by the short-flat-decomposition for vectors in [KLL+23].

In Section 4, we will explain the motivation behind the norm decomposition as well as how to efficiently verify such a decomposition exists. We will also show that the d RIP condition is closely related to w RIP (by choosing parameters within a constant factor of each other) in Appendix C.

A crucial component in our algorithm is a weight oracle that produces a nonnegative weight on each sensing matrix (the weights are in general different in each iteration), such that the weighted gradient step moves the current solution closer to X . The oracle should output weights that satisfy certain properties which we term progress and decomposition guarantees. The purpose of these two guarantees is further explained in the proof of Lemma 4.2 in Appendix A.

Definition 2.5 (Weight Oracle). We say an algorithm O is a (Cprog, CF )-oracle, if given as input n matrices A = {Ai}n i=1 Rd1 d2 and an vector u = A[V ] Rn where V Rd1 d2, V F [ 1

4, 1], and V 2

2r, the algorithm O(A, u, δ) returns a weight vector w Rn 0 such that the following conditions hold with probability at least 1 δ:

1. Progress guarantee: Pn i=1 wiu2 i Cprog ;

2. Decomposition guarantee: (CF , Cprog

6 r ) norm-decomposition of G = Pn i=1 wiui Ai .

Note that the progress guarantee is equivalent to G, V Cprog.

Finally we define numerical rank which we use in our analysis. Numerical rank serves as a lower bound for the rank of a matrix based on its nuclear norm and Frobenius norm. That is, we always have Rankn(A) rank(A).

Definition 2.6 (Numerical Rank). The numerical rank of A is Rankn(A) = A 2 A 2 F .

3 Semi-Random Matrix Sensing

In this section, we present our main algorithm (Algorithm 1) for semi-random matrix sensing. Algorithm 1 with high probability recovers the ground-truth matrix X to arbitrary accuracy.

Algorithm 1: Semi Random Matrix Sensing(R0, ϵ, δ, A, b)

1: Input: R0 X F , b = A[X ], ϵ > 0, δ (0, 1) ; 2: Output: Xout s.t. Xout X F ϵ . 3: X0 0, T log R0

T , R R0 ; 4: for 0 t T do 5: Xt+1 Halve Error(Xt, R, O, δ , A, b), R R

2 ; 6: end for 7: Return Xout XT ;

The performance guarantee and runtime of Algorithm 1 are formally stated in the following theorem.

Theorem 3.1 (Matrix Sensing under w RIP). Suppose the ground-truth matrix X Rd1 d2 satisfies rank(X ) r and X F R0. Suppose the sensing matrices A = (Ai Rd1 d2)n i=1 satisfy (r, L, ρ)-w RIP (as in Definition 2.2). Let b = A[X ] Rn be the corresponding measurements.

For any ϵ, δ > 0, Algorithm 1 can output X Rd1 d2 such that X X F ϵ with probability at least 1 δ. Algorithm 1 runs in time O(ndω polylog (d) log ( L

ϵ )rρ2L4 log R0

ϵ ) where d = max(d1, d2) and ω < 2.373 is the matrix multiplication exponent.

Theorem 1.1 is a direct corollary of Theorem 3.1 under Gaussian design.

Proof of Theorem 1.1. When there are Ω(dr) sensing matrices with i.i.d. standard Gaussian entries, the input sensing matrices satisfy (r, L, ρ)-w RIP for L = O(1) and ρ = O(d1/2) with probability at least 1 1 poly(d). This follows from a standard proof for RIP and the fact that we can ignore any

sensing matrices with Ai 2 d1/2. We assume that the w RIP condition is satisfied.

By Theorem 3.1, when L = O(1), ρ = O(d1/2), R0 = poly(d) and δ = 1 poly(d), Algorithm 1 can output a solution X such that X X F ϵ with high probability. The runtime of Algorithm 1 can be simplified to e O(ndω+1r log(1/ϵ)).

We first provide a road map for our analysis for proving Theorem 3.1:

Our main algorithm runs a halving subroutine for log R0

ϵ iterations to reduce the error to ϵ. Each call to this subroutine reduces the upper bound on the distance between the current solution and the ground truth X by half. This halving subroutine runs in time O(ndω polylog (d) log ( L

ϵ )rρ2L4) according to Lemmas 4.2 and 4.3.

In Section 4, we present the halving algorithm (Algorithm 2). It depends on a (Ω(1), O(1))- oracle, and Lemma 4.1 shows that the oracle guarantees can be easily verified. The algorithm s correctness and running time are analyzed in Lemma 4.2 and Lemma 4.3.

In Section 5 we present the weight oracle required by the halving algorithm. We first show in Lemma 5.1 that the w RIP condition implies that the sensing matrices satisfy the d RIP condition tailored to the design of the oracle. Then we present an implementation of the oracle in Algorithm 3 based on the d RIP condition, and analyze its correctness and running time in Lemma 5.3 and Lemma 5.4.

4 Algorithm for Halving the Error

In this section, we present Algorithm 2 (Halve Error). Algorithm 2 takes an estimate Xin with Xin X F R and outputs Xout such that Xout X F R

2 . This is the matrix version of the Half Raidus Sparse [KLL+23] algorithm for vectors.

Algorithm 2: Halve Error(Xin, R, O, δ, A, b)

1: Input: Rank-r matrix Xin Rd1 d2, Xin X F R, O is a (1, 12L2)-oracle for A with failure probability δ (0, 1), linear measurements b = A[X ] . 2: Output: Xout Rd1 d2 s.t. Xout X F R

2 w.p. 1 δ and rank(Xout) r .

3: X0 Xin, X = {X Rd1 d2 | X Xin

2r R}, η = 1 288L4 , T = 6

η . 4: for 0 t T do 5: ut 1

R(A[Xt] b) ; /* ut = A[ Xt X

R ] where (ut)i = 1

R Ai, Xt X */ 6: wt O(A, ut, δ

T ) ; 7: Gt Pn i=1 (wt)i(ut)i Ai ; 8: if O output satisfies the progress and decomposition guarantees on ut then 9: Xt+1 argmin X X X (Xt ηRGt) 2 F ; 10: else 11: Return Xout (Xt)(r) ; /* Rank-r approximation of Xt */ 12: end if 13: end for 14: Return Xout (XT )(r) ;

A crucial requirement of the algorithm is a (Ω(1), O(1))-oracle for A. In each iteration, the oracle takes a vector ut = A[Xt] b

R , which is the (normalized) measured deviation between current estimate Xt and X , and computes a weight vector wt. The algorithm then tries to minimize the weighted objective function by gradient descent:

Objective: ft(X) = 1

i=1 (wt)i Ai, X X

2 , i.e. ft(Xt) = 1

i=1 (wt)i(ut)2 i ,

Gradient: Xft(X) =

i=1 (wt)i Ai, X X

R Ai , i.e. Gt = Xft(X)|Xt =

i=1 (wt)i(ut)i Ai .

Ideally in the next iteration, we would like to make a step from Xt in the opposite direction of the gradient Gt with the goal of minimizing the deviation in the next iteration. However, we cannot take a step exactly in the direction of Gt, and our movement is constrained within a ball of (nuclear norm) radius

2r R centered at Xin, namely the region X = {X | X Xin

2r R}. Nuclear norm is used as a proxy to control the rank and Frobenius norm of Xt simultaneously throughout the algorithm: firstly, since Xin X F R, it makes sense that in each iteration Xt Xin F R as well; secondly, while trying to minimize the difference between Xt and X , we also want to ensure the rank of Xt is relatively small, i.e. rank(Xt) O(r). To tie things together, we use the following relationship between rank and numerical rank:

rank(Xt Xin) Rankn(Xt Xin) = Xt Xin 2 Xt Xin 2 F .

Assuming rank(Xt) rank(Xin) and Xt Xin F R throughout, then rank(Xt) Xt Xin 2 2R2 . Roughly speaking, in order for rank(Xt) O(r), it is necessary that Xt Xin O( r R), i.e. Xt is inside some nuclear norm ball X of radius O( r R) centered at Xin. We set the radius of X to be

2r R so that X X, since Xin X F R, rank(Xin X ) 2r therefore X Xin

2r R. Thus we confine our movement within this nuclear norm ball of radius

2r R centered at Xin throughout the algorithm, and take the rank-r approximation of the last Xt to ensure rank(Xout) r upon the termination of the algorithm.

To analyze the algorithm, first we show how to check whether the weight oracle output satisfies the progress and decomposition guarantees. The progress condition Pn i=1 wiu2 i 1 is trivial to verify, and we check whether G is (CF , C2)-decomposable using Lemma 4.1, with details and proof deferred to Appendix A.

Lemma 4.1 (Verify Norm Decomposition). Given a matrix G = UΣV = Pd i=1 σiuiv i and C2 > 0, suppose σ1 ... σk > C2 σk+1... σd, then for all E 2 C2, we have G E 2 F Pk i=1 (σi C2)2.

The following lemmas analyze the algorithm s correctness and show that it terminates with the desired distance contraction, as well as its running time. The proof is deferred to Appendix A.

Lemma 4.2 (Algorithm 2: Halve Error). Given a (1, 12L2)-oracle for A with failure probability δ (0, 1), where A satisfies the d RIP Condition 2.3, and b = A[X ], Algorithm 2 succeeds with probability at least 1 δ.

Lemma 4.3 (Algorithm 2 Running Time). Algorithm 2 with failure probability δ runs in time O(ndω polylog (d) log L

The crucial part of Lemma 4.2 shows that if current estimate Xt is sufficiently far from X , i.e. Xt X F 1

4R, then according to Lemma 5.3 with high probability the weight oracle produces an output satisfying the progress and decomposition guarantees, and each iteration of Algorithm 2 decreases the distance to X by a constant factor: Xt+1 X 2 F 1 η

2 Xt X 2 F , thus after sufficient number of iterations the distance to X will be halved. On the other hand, if the weight oracle fails, with high probability the current estimate Xt is already sufficiently close to X , thus the algorithm can terminate early.

5 Oracle for Reweighting the Input

In this section, we present an algorithm (Algorithm 3) that serves as the weight oracle required by the error-halving algorithm (Algorithm 2). Algorithm 3 is the matrix version of the Step Oracle [KLL+23] algorithm for vectors. We first state that, given proper choices of parameters within a constant factor, the w RIP Condition 2.2 implies the d RIP Condition 2.3, which is a more suitable property for our oracle implementation. The proof is deferred to Appendix C.

Lemma 5.1 (w RIP = d RIP). If A satisfies w RIP Condition 2.2 with parameters r , L , ρ, then A satisfies the d RIP Condition 2.3 with parameters L, K, r, ρ such that L = Θ(L ), r = Θ(r ), and some constant K 1.

Now we are ready to present an implementation of the weight oracle in Algorithm 3 based on the d RIP condition. This algorithm takes as inputs the d RIP sensing matrices A and a vector u. If u is an applicable input to the oracle, with high probability the algorithm outputs a weight vector w satisfying the progress and decomposition guarantees as in Definition 2.5.

First we introduce some potential functions used in the algorithm.

Definition 5.2 (Potential Functions in Algorithm 3). For sensing matrices A = {Ai}n i=1 and input u Rn to the oracle, we define the following potential functions on weight vector w Rn:

Progress potential: Φprog(w) = Pn i=1 wiu2 i .

Decomposition potential: Φdc(w) = min S F L w 1

µ2 log [F(Gw S)] + w 1

where Gw = Pn i=1 wiui Ai and F(E) = tr exp E E

Overall potential: Φ(w) = Φprog(w) CrΦdc(w) .

Note that F(E) = Pd j=1 exp σ2 j (E)

µ2 where σj(E) is the jth singular value of E, due to properties of the exponential of a diagonalizable matrix.

The progress and decomposition potential functions control the progress and decomposition guarantees respectively, and later we will show that the termination condition is implied by the overall potential Φ 0. Consequently, by maximizing the overall potential each round, the algorithm tries to make as much progress as possible while ensuring G is decomposable.

Algorithm 3: O(A, u, δ)

1: Input: Sensing operator A satisfying d RIP Condition 2.3, u Rn . 2: Output: w Rn such that the algorithm is a (1, 12L2)-oracle as in Definition 2.5 with probability (1 δ). 3: C 108, µ 1 Cr log d, η 1 Krρ2 log d, N log2 1 δ , N 8Ln

4: for 0 k N do 5: w0 0; 6: for 0 t N do 7: if Φprog(wt) 1 then 8: Return w wt; 9: else 10: Sample i [n] uniformly random; 11: st argmaxs [0,η]Φ(wt + sei) ; 12: wt+1 wt + stei ; 13: end if 14: end for 15: end for 16: Return w 0 ;

Lemma 5.3 (Correctness of Algorithm 3). Suppose A satisfies d RIP Condition 2.3 and u is an applicable input to the weight oracle (that is, u = A[V ] Rn for some V Rd1 d2 satisfying V F

4, 1] and V 2

2r). Then, Algorithm 3 is a (1, 12L2)-oracle for A (as in Definition 2.5) with failure probability at most δ.

We prove this lemma in two steps: first we show in Lemma B.1 that the output is valid; then in Lemma B.2 we show that the oracle achieves the success probability. Our weight oracle is inspired by the step oracle in [KLL+23]. It is worth noting that Lemma B.3, a key component used in the proof of Lemma B.2, is significantly different in the matrix case compared to the vector case. Lemma B.3 upper bounds the increase in Φdc each round, which is then used to provide a lower bound for the increase in Φ. Combining Lemma B.3 with our earlier remark that the algorithm terminates when Φ 0 gives us the number of iterations needed to terminate with high probability.

The running time of Algorithm 3 is stated in the following lemma, and the proof is deferred to Appendix B. Lemma 5.4 (Algorithm 3 Running Time). Algorithm 3 with failure probability δ runs in time O(ndω polylog (d) log 1

6 Conclusion and Future Work

In this paper, we pose and study the matrix sensing problem in a natural semi-random model. We relax the standard RIP assumption on the input sensing matrices to a much weaker condition where an unknown subset of the sensing matrices satisfies RIP while the rest are arbitrary.

For this semi-random matrix sensing problem, existing non-convex objectives can have bad local optima. In this work, we employ an iterative reweighting approach using a weight oracle to overcome the influence of the semi-random input. Our solution is inspired by previous work on semi-random sparse vector recovery, where we exploit the structural similarities between linear regression on sparse vectors and matrix sensing on low-rank matrices.

Looking forward, we believe our approach can serve as a starting point for designing more efficient and robust algorithms for matrix sensing, as well as for other low-rank matrix and sparse vector problems in the semi-random model.

Acknowledgement

We thank Rong Ge for helpful discussions. Xing Gao is supported in part by NSF awards ECCS2217023 and CCF-2307106. Yu Cheng is supported in part by NSF Award CCF-2307106.

[BNS16] Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Global optimality of local search for low rank matrix recovery. Advances in Neural Information Processing Systems, 29, 2016.

[BS95] Avrim Blum and Joel Spencer. Coloring random and semi-random k-colorable graphs. Journal of Algorithms, 19(2):204 234, 1995.

[CG18] Yu Cheng and Rong Ge. Non-convex matrix completion against a semi-random adversary. In Conference On Learning Theory, pages 1362 1394. PMLR, 2018.

[CLMW11] Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):1 37, 2011.

[CP11] Emmanuel J Candes and Yaniv Plan. Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. IEEE Transactions on Information Theory, 57(4):2342 2359, 2011.

[CRT06] Emmanuel J Candes, Justin K Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(8):1207 1223, 2006.

[DDH07] J. Demmel, I. Dumitriu, and O. Holtz. Fast linear algebra is stable. Numerische Mathematik, 108(1):59 91, 2007.

[FCRP08] Maryam Fazel, E Candes, Benjamin Recht, and P Parrilo. Compressed sensing and robust recovery of low rank matrices. In 2008 42nd Asilomar Conference on Signals, Systems and Computers, pages 1043 1047. IEEE, 2008.

[FK01] Uriel Feige and Joe Kilian. Heuristics for semirandom graph problems. Journal of Computer and System Sciences, 63(4):639 671, 2001.

[GJZ17] Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In International Conference on Machine Learning, pages 1233 1242. PMLR, 2017.

[JKL+20] Haotian Jiang, Tarun Kathuria, Yin Tat Lee, Swati Padmanabhan, and Zhao Song. A faster interior point method for semidefinite programming. In 2020 IEEE 61st annual symposium on foundations of computer science (FOCS), pages 910 918. IEEE, 2020.

[KLL+23] Jonathan Kelner, Jerry Li, Allen X Liu, Aaron Sidford, and Kevin Tian. Semi-random sparse recovery in nearly-linear time. In The Thirty Sixth Annual Conference on Learning Theory, pages 2352 2398. PMLR, 2023.

[MMV12] Konstantin Makarychev, Yury Makarychev, and Aravindan Vijayaraghavan. Approximation algorithms for semi-random partitioning problems. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pages 367 384, 2012.

[MS10] Claire Mathieu and Warren Schudy. Correlation clustering with noisy input. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, pages 712 728. SIAM, 2010.

[PW17] Amelia Perry and Alexander S Wein. A semidefinite program for unbalanced multisection in the stochastic block model. In 2017 International Conference on Sampling Theory and Applications (Samp TA), pages 64 67. IEEE, 2017.

[RFP10] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471 501, 2010.

[RS05] Jasson DM Rennie and Nathan Srebro. Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the 22nd international conference on Machine learning, pages 713 719, 2005.

[Tho65] Colin J Thompson. Inequality with applications in statistical mechanics. Journal of Mathematical Physics, 6(11):1812 1813, 1965.

[Wil14] Virginia Vassilevska Williams. Multiplying matrices in o (n2. 373) time. preprint, 2014.

A Omitted Proofs from Section 4

To check whether G is (CF , C2)-decomposable using Lemma 4.1, we first consider the following construction. Suppose the SVD of G is G = UΣV = Pd i=1 σiuiv i . Let S = Pd i=1 µiuiv i with µi = max(σi C2, 0) and let E = Pd i=1 λiuiv i with λi = min(C2, σi). In other words, suppose σ1 ... σk > C2 σk+1... σd. For i k (i.e., σi > C2), let µi = σi C2 and λi = C2; for i > k (i.e., σi C2), let µi = 0 and λi = σi. We have G = S + E with E 2 C2, and

S F = q Pk i=1 (σi C2)2. Then by Lemma 4.1, G is (CF , C2)-norm-decomposable if and only if S F CF because:

if S F CF , we have a valid (CF , C2)-norm-decomposition for G ; if S F > CF , a valid (CF , C2)-norm-decomposition does not exist for G .

Lemma 4.1 (Verify Norm Decomposition). Given a matrix G = UΣV = Pd i=1 σiuiv i and C2 > 0, suppose σ1 ... σk > C2 σk+1... σd, then for all E 2 C2, we have G E 2 F Pk i=1 (σi C2)2.

Proof. Fix any E with E 2 C2. Observe that for all 1 i k, we have u i Gvi = σi and C2 u i Evi C2. Consequently, for all i k, we have σi > C2 and u i (G E)vi σi C2 > 0.

Let S = G E, we have

S 2 F = U SV 2

i=1 (u i Svi)2

i=1 (σi C2)2 .

Lemma 4.2 (Algorithm 2: Halve Error). Given a (1, 12L2)-oracle for A with failure probability δ (0, 1), where A satisfies the d RIP Condition 2.3, and b = A[X ], Algorithm 2 succeeds with probability at least 1 δ.

Proof. We will first show that the distance to X decreases by a constant factor after each iteration:

Xt+1 X 2 F 1 η

Consider iteration t in Algorithm 2: Xt+1 = argmin X X X (Xt ηRGt) 2 F . Taking the gradient of X (Xt ηRGt) 2 F at X = Xt+1, we get 2[Xt+1 (Xt ηRGt)]. Since X X and Xt+1 is the local minimizer in X:

2 Xt+1 (Xt ηRGt), Xt+1 X 0 .

Rearranging the terms gives

2ηR Gt, Xt+1 X 2 Xt+1 Xt, Xt+1 X .

To simplify, let D = Xt+1 Xt, Dt = Xt X , Dt+1 = Xt+1 X . Note that D + Dt = Dt+1 .

2ηR Gt, Dt+1 2 D, Dt+1 = D Dt+1, D Dt+1 D, D Dt+1, Dt+1 = Dt, Dt D, D Dt+1, Dt+1 ,

Dt 2 F Dt+1 2 F 2ηR Gt, Dt+1 + D, D = 2ηR Gt, Dt + 2ηR Gt, D + D, D . (3)

Inequality (3) provides a lower bound on the distance contraction after each iteration. We break the right-hand side into two parts. The first term 2ηR Gt, Dt corresponds to the magnitude of the step Gt in the direction Dt = Xt X , which is the progress made by this step. To lower bound it, we

will use the progress guarantee of the (1, 12L2)-oracle. Recall ut = 1

R(A[Xt] b) and consider Vt = 1

R(Xt X ) = 1

RDt so that ut = A[Vt]. Given that the oracle s output satisfies the progress guarantee, which states that Pn i=1 (wt)i(ut)2 i 1, we have:

2ηR Gt, Dt = 2ηR2 Gt, Vt = 2ηR2D n X

i=1 (wt)i(ut)i Ai, Vt E = 2ηR2 n X

i=1 (wt)i(ut)2 i 2ηR2 .

The remaining term 2ηR Gt, D + D, D might be negative and cancel some of our progress. A natural attempt is to try to bound it using 2ηR Gt, D + D, D η2R2 Gt, Gt . However, the w RIP condition of A does not provide any guarantee on Gt 2 F . (In fact, we can derive that G 2 L from w RIP, but the best we can hope for is G 2 F rank(G) G 2 2 where rank(G) d.) This motivates the decomposition property in the d RIP Condition 2.3 and in the weight oracle. The idea is that even though we cannot directly bound Gt F , we can in fact lower bound 2ηR Gt, D + D, D the term by decomposing Gt into a Frobenius-norm-bounded matrix St, and an operator-normbounded matrix Et. Specifically, we will use the decomposition guarantee of the (1, 12L2)-oracle, which states that there exists norm-decomposition of Gt = St + Et where St F 12L2 and Et 2 1 6 r. As our movement is confined in X, D = Xt+1 Xt is nuclear-norm-bounded so the inner product Et, D can be bounded by generalized Holder s inequality. Recall η = 1 288L4 .

2ηR Gt, D + D, D = 2ηR Et, D + 2ηR St, D + D, D

2ηR Et 2 D η2R2 St, St

2ηR 1 6 r 2

2r R η2R2 144L4

Putting inequalities 3,4 and 5 together:

Dt 2 F Dt+1 2 F 2ηR2 3

2 Dt 2 F = Dt+1 2 F 1 η

In the case that the algorithm terminates after T = 6

η iterations,

XT X 2 F 1 η

T Xin X 2 F exp ηT

Xout X F Xout XT F + XT X F 2 XT X F 1

The last inequality comes from Xout being the best rank-r approximation of XT .

In the case that the algorithm terminates early at Line 11, we can assume with probability at least 1 δ

T that the weight oracle would have succeeded given applicable input ut. Then failure to satisfy the progress and decomposition guarantees means that ut is not an applicable input, which means Vt does not satisfy the norm constraint in the weight oracle. Vt 2

2r is guaranteed because Xt X, and Vt F = 1

R Xt X F is decreasing in each round, so we must have Vt F < 1

4, which means Xt X F < 1

4R. By the same argument as above, Xout X F 1

Finally, by a union bound on the failure probability of the weight oracle, the algorithm succeeds with probability at least 1 δ.

Lemma 4.3 (Algorithm 2 Running Time). Algorithm 2 with failure probability δ runs in time O(ndω polylog (d) log L

Proof. Algorithm 2 has a for-loop that s repeated for T = O(L4) times.

Inside the loop, line 5 and 7 takes linear time O(nd2). Computing wt using the oracle (line 6) runs in time O(ndω polylog (d) log L

δ rρ2) according to Lemma 5.4. Line 8 through line 12 are all upper

bounded by time of SVD, which is on the same order of matrix multiplication O(dω) [DDH07], with current best of O(d2.373)[Wil14]. In particular, verifying the oracle guarantees (line 8) can be solved as an eigenvalue problem. Finding Xt+1 (line 9) is equivalent to projecting X t+1 := Xt ηRGt Xin

2r R onto the unit nuclear norm ball. We first perform SVD on X t+1 then binary search for the largest truncation from its singular values to reach the nuclear norm sphere in time O(d log d), and the entire projection step is dominated by SVD. Finally the output step consists of SVD and matrix multiplication.

The overall running time of the algorithm is dominated by the weight oracle, so the total running time is O(ndω polylog (d) log L

B Omitted proofs from Section 5

First we state a couple of lower bounds related to the decomposition potential function.

Claim B.1. µ2 log[F(E)] µ2 log(d) and µ2 log[F(E)] E 2 2 .

Proof. For the first lower bound, exp σ2 j (E)

µ2 1 for all j [d] so F(E) = Pd j=1 exp σ2 j (E)

For the second lower bound, F(E) = Pd j=1 exp σ2 j (E)

µ2 exp σ2 1(E)

µ2 = exp E 2 2 µ2 .

Lemma B.1 (Correctness of Algorithm 3). If Algorithm 3 terminates from the inner loop, the output satisfies the progress and decomposition guarantees as defined in 2.5.

Proof. We start with w0 = 0, which means Φprog(w0) = 0, Φdc(w0) = µ2 log d, and Φ(w0) = 0 Crµ2 log d = 1.

At each round, since st is chosen to maximize Φ(wt + stei), in particular if we choose st = 0 then Φ(wt+1) = Φ(wt), so Φ(wt+1) Φ(wt) which is non-decreasing. By definition Φprog(wt) is also non-decreasing, and increases by at most 1 each round, because Φprog(wt+1) Φprog(wt) = stu2 i η( Ai 2 V )2 8ηrρ2 8 K log d 1. Φdc(wt) may not be monotone, but we have Φdc(wt) µ2 log d.

Suppose the algorithm terminates at round t during one of the inner loops, which means Φprog(wt 1) < 1 and 1 Φprog(wt) < 2 .

Progress guarantee: Φprog(wt) = Pn i=1 (wt)iu2 i 1 is satisfied upon termination.

Decomposition guarantee:

Φ(wt) = Φprog(wt) CrΦdc(wt) Φ(w0) = 1 ,

= min S F L w 1

µ2 log [F(Gw S)] + wt 1

4CLr = Φdc(wt) Φprog(wt) + 1

min S F L wt 1

µ2 log [F(Gwt S)] 3

Cr = S F L wt 1 s.t. Gwt S 2 2 3

Cr = wt 1 12L .

So there exist S F 12L2 and E 2 = G S 2

Cr = 1 6 r which satisfy the decomposition guarantee.

Notice that for any round t < t, Φprog(wt ) < 1, we also have Φdc(wt ) Φprog(wt )+1

Cr 2 Cr, so Φdc(wt) 3 Cr throughout the algorithm, which is a fact we will use later in Lemma B.3.

Lemma B.2 (Success probability of Algorithm 3). Given A satisfying d RIP Condition and applicable input u, Algorithm 3 terminates from the inner loop with probability at least 1 δ.

Proof. We first show the probability that the algorithm terminates from the inner loop is at least 1

2, i.e., Pr[Φprog(wt) 1] 1

2 for some t N.

Notice that Φ(wt) = Φprog(wt) CrΦdc(wt) 0 = Φprog(wt) CrΦdc(wt) CrΦdc(w0) = 1, therefore the algorithm starts with Φ(w0) = 1 and will terminate once Φ(wt) 0 . Also notice that throughout the algorithm Φ(wt) < 1 because Φprog(wt) < 2 and CrΦdc(wt) 1 (from proof of Lemma B.1).

To prove by contradiction, assume that Pr[Φprog(wt) 1] < 1

2 for all t N, i.e., Pr[continue] 1

2 for all rounds. We will lower bound the expected increase in Φ(wt) each round, and we will show that with sufficiently large N, E[Φ(w N)] 1 contradicting Φ(wt) < 1 for all t N.

Recall that Φ(wt) = Φprog(wt) CrΦdc(wt), the lower bound for increase in Φprog(wt) is provided by d RIP condition on A and applicable input u. The upper bound for expected increase in Φdc(wt) is provided by Lemma B.3.

Given the algorithm continues at round t N, consider choosing st = ηw i so that w = wt +ηw i ei, then the expected increase in Φ is at least:

E[Φ(wt+1) Φ(wt) | continue] = E[Φprog(wt+1) Φprog(wt)] Cr E[Φdc(wt+1) Φdc(wt)]

E[Φprog(w ) Φprog(wt)] Cr E[Φdc(w ) Φdc(wt)]

i=1 ηw i u2 i Cr E[Φdc(w )] Φdc(wt)

Ln Cr η 2CLnr

Given the algorithm stops after round t, E[Φ(wt+1) Φ(wt) | stop] = 0. Overall:

E[Φ(wt+1) Φ(wt)] = E[Φ(wt+1) Φ(wt) | continue] Pr[continue] + 0

η 2Ln Pr[continue] .

By choosing a sufficiently large N = 8Ln

E[Φ(w N)] Φ(w0) + ηN

2Ln Pr[continue] 1 + ηN

contradicting Φ(wt) < 1 . This means each inner loop of the algorithm will terminate with probability greater than 1

2. Finally, we boost the success probability to (1 δ) using the outer loop with N = log2 1 δ iterations.

Lemma B.3 provides a crucial bound used in the proof. Even though it achieves similar result as Lemma 13 [KLL+23] on the potential functions defined for vectors, analyzing the potential function defined for matrices involves very different techniques.

Lemma B.3 (Potential Increase Upper Bound). Given w Rn s.t. Φdc(w) 3 Cr, by choosing a sufficiently large value for K, for w = w + ηw i ei , we have:

Ei [n][Φdc(w )] Φdc(w) + η 2CLnr .

The assumption Φdc(w) 3 Cr is justified in the proof of Lemma B.1.

Proof. First we introduce some notation:

Denote Φop(w) = min S F L w 1

µ2 log [F(Gw S)] , so that Φdc(w) = Φop(w) + w 1

G = Pn i=1 w i ui Ai, and by d RIP Condition 2.3, we know (L, 1 K r)-norm-decomposition of G = S + E , where S F L and E 2 1 K r. Let G = Pn i=1 wiui Ai, and S =

argminΦop(w) so that Φop(w) = µ2 log[F(G S)], and let E = G S. Let G = Pn i=1 w iui Ai. Using these notation and Pn i=1 w i = 1:

E i [n][Φdc(w )] = E i [n]

Φop(w ) + w 1

Φop(w ) + w 1 + ηw i 4CLr

= E i [n][Φop(w )] + w 1

4CLr + η 4CLnr .

We need to show Ei [n][Φdc(w )] Φdc(w) + η 2CLnr = Φop(w) + w 1

4CLr + η 2CLnr , equivalently

Ei [n][Φop(w )] Φop(w) + η 4CLnr .

Consider S = S+ηw i S . We have S F S F +ηw i S F L w 1+ηw i L = L w 1, so

S is a valid argument for Φop(w ) = min S F L w 1

µ2 log [F(G S)] , therefore Φop(w )

µ2 log[F(G S )]. Let E = G S = G + ηw i ui Ai S ηw i S = E + Z(i) where Z(i) = ηw i ui Ai ηw i S . Using these and the concavity of the log function, we have

Ei [n][Φop(w )] 1

i=1 µ2 log[F(G S )] = 1

i=1 µ2 log[F(E + Z(i))]

F(E + Z(i)) #

It suffices to show

F(E + Z(i)) #

Φop(w) + η 4CLnr = µ2 log[F(E)] + 1 4CL η nr .

Expanding the left hand side:

i=1 F(E + Z(i))

i=1 tr exp E E + Z(i) Z(i) + E Z(i) + Z(i) E

i=1 tr exp E E

exp Z(i) Z(i) + E Z(i) + Z(i) E

i=1 exp Z(i) Z(i) + E Z(i) + Z(i) E

i=1 exp Z(i) Z(i) + E Z(i) + Z(i) E

= µ2 log [F(E)] + µ2 log

i=1 exp Z(i) Z(i) + E Z(i) + Z(i) E

2 The first inequality uses Golden-Thompson Inequality (stated as Lemma B.6), and the second inequality follows from Lemma B.4. Finally it suffices to show

i=1 exp Z(i) Z(i) + E Z(i) + Z(i) E

2 1 4CL η nr .

We will use the approximation exp(X) I + X + X2 for symmetric X with X 2 1. The argument in the exponential satisfies this condition as justified in Claim B.2. We will also use

log (1 + x) x x 0.

i=1 exp Z(i) Z(i) + E Z(i) + Z(i) E

exp Z(i) Z(i) + E Z(i) + Z(i) E

I + Z(i) Z(i) + E Z(i) + Z(i) E

µ2 + (Z(i) Z(i) + E Z(i) + Z(i) E)2

E Z(i) + Z(i) E

(Z(i) Z(i) + E Z(i) + Z(i) E)2

E Z(i) + Z(i) E

2(Z(i) Z(i))2

2(E Z(i) + Z(i) E)2

E Z(i) + Z(i) E

(Z(i) Z(i))2

2 + 2µ2 1 n

(E Z(i) + Z(i) E)2

i=1 Z(i) Z(i) 2 +

i=1 E Z(i) + Z(i) E

i=1 (Z(i) Z(i))2 2 + 2

i=1 (E Z(i) + Z(i) E)2 2

These four terms are bounded by Claims B.3,B.4, B.5 and B.6 respectively, notice that the second term dominates the first and the third, and the forth term dominates the second. So finally we have:

i=1 exp Z(i) Z(i) + E Z(i) + Z(i) E

= 1 4CL η nr , with choice of K = 388CL3 = O(L3) .

Lemma B.4. If 0 A, then tr(AB) tr(A) B 2 .

Proof. Since 0 A, A = P j σjuju j with σj 0 .

tr(AB) = tr

j σjuju j B

j σj tr(uju j B) = X

j σju j Buj

= tr(A) B 2 .

Lemma B.5. (A + B) (A + B) 2A A + 2B B , and (A + B) (A + B) 2 8(A A)2 + 8(B B)2 .

2A A + 2B B (A + B) (A + B) = A A + B B A B B A

= (A B) (A B) 0 .

(A + B) (A + B) 2 (2A A + 2B B)2

2[2(A A)]2 + 2[2(B B)]2

8(A A)2 + 8(B B)2 .

The following claims were used in Lemma B.3. Recall that A satisfies d RIP Condition 2.3, u = A[V ] Rn for some V Rd1 d2 satisfying V F [ 1

2r, Z(i) = ηw i (ui Ai S ), and Φdc(w) 3 Cr by assumption of Lemma B.3. We have the following:

Ai 2 ρ by boundedness property of d RIP Condition 2.3 ,

|ui| = | Ai, V | Ai 2 V ρ2

2r L rρ assuming 2

S 2 S F L, E 2 1 K r by decomposition property of d RIP Condition 2.3 ,

E 2 2 Φop(w) Φdc(w) 3

Cr by Claim B.1 .

Z(i) Z(i)+E Z(i)+Z(i) E

Z(i) Z(i) 2 = η2w 2 i (ui Ai S ) (ui Ai S ) 2

2η2w 2 i u2 i A i Ai 2 + S S 2 (Lemma B.5)

2η2w 2 i (L2rρ4 + L2)

4η2w 2 i L2rρ4

E Z(i) + Z(i) E 2 2 E Z(i) 2 2 E 2 Z(i) 2 = 2ηw i E 2 ui Ai S 2 2ηw i E 2 (|ui| Ai 2 + S 2)

Cr (L rρ2 + L)

Putting them together: Z(i) Z(i) + E Z(i) + Z(i) E

Z(i) Z(i) 2 + E Z(i) + Z(i) E 2

Cr log d (4η2w 2 i L2rρ4 + 8ηw i Lρ2

CL K 1 with sufficiently large K .

n Pn i=1 Z(i) Z(i) 2 4L2 K log d η nr .

Z(i) Z(i) 2 =

i=1 η2w 2 i (ui Ai S ) (ui Ai S ) 2

i=1 η2w 2 i u2 i A i Ai 2 + S S 2

i=1 ηw i (w i u2 i ρ2 + w i L2)

2η 1 Krρ2 log d(Lρ2 + L2)

K log d η r

i=1 Z(i) Z(i) 2 1

Z(i) Z(i) 2

n Pn i=1 E Z(i) + Z(i) E 2 4

i=1 E Z(i) + Z(i) E

n E 2 ηG ηS 2

n E 2 η E 2

Claim B.5. 2 µ2 1

n Pn i=1(Z(i) Z(i))2 2 O CL4

K3rρ2 log2 d η nr .

i=1 (Z(i) Z(i))2 2 8

i=1 η4w 4 i u4 i (A i Ai)2 2 + (S S )2 2 (Lemma B.5)

i=1 η3w 2 i (w 2 i u4 i ρ4 + w 2 i L4)

8η 1 K3r3ρ6 log3 d

i=1 w 2 i ρ4u4 i +

i=1 w 2 i L4 #

8η 1 K3r3ρ6 log3 d

i=1 w i ρ2u2 i

i=1 w i L2 !2

8η 1 K3r3ρ6 log3 d(ρ4L2 + L4)

K3r2ρ2 log3 d η

K3r2ρ2 log3 d

i=1 (Z(i) Z(i))2 2 O CL4

K3rρ2 log2 d

Claim B.6. 2 µ2 1

n Pn i=1(E Z(i) + Z(i) E)2 2 96L2

i=1 (E Z(i) + Z(i) E)2 2 2

(E Z(i) + Z(i) E)2 2

i=1 4 E Z(i)Z(i) E 2

i=1 4 E 2 2 Z(i) Z(i) 2

Z(i) Z(i) 2

K log d η nr (Claim B.3)

nr Lemma B.6 (Golden Thompson Inequality [Tho65]). For two n n Hermitian matrices A and B:

tr exp(A + B) tr exp(A) exp(B) .

Lemma 5.4 (Algorithm 3 Running Time). Algorithm 3 with failure probability δ runs in time O(ndω polylog (d) log 1

Proof. Algorithm 3 has a nested for loop that s repeated for N N = O(n log d log 1

δ rρ2) times. The major step in the loop is line 11: st argmaxs [0,η]Φ(wt + sei), which is equivalent to

argmins [0,η]CrΦop(w+sei)+ s

4L su2 i . Recall that Φop(w) = min S F L w 1

µ2 log [F(Gw S)]

where F(E) = tr exp E E

µ2 . Note that µ2 log [F(E)] is convex in E.

First we show that Φop(w) is convex in w, i.e., given w1, w2, Φop 1

2(w1 + w2) 1

2 Φop(w1) + Φop(w2) .

Let w3 = 1 2(w1 + w2), and Gk = Pn i=1 (wk)iui Ai for k = 1, 2, 3. Suppose S1, S2 attain the minimum for Φop(w1), Φop(w2) respectively, i.e., Φop(w1) = µ2 log [F(G1 S1)] and Φop(w2) = µ2 log [F(G2 S2)]. Let S3 = 1

2(S1 + S2). Notice that G3 = 1

2(G1 + G2), so G3 S3 = 1

2 (G1 S1 + G2 S2). Since S3 F 1

2( S1 F + S2 F ) 1

2L( w1 1 + w2 1) = L w3 1, S3 is a valid argument for Φop(w3), therefore:

Φop(w3) µ2 log [F(G3 S3)] 1

µ2 log [F(G1 S1)] + µ2 log [F(G2 S2)]

2 Φop(w1) + Φop(w2) .

Line 11 is equivalent to minimizing CrΦop(w + sei) + s 4L su2 i , which is convex in s for a fixed w, over a bounded interval [0, η], so the minimization needs to evaluate Φop(w + sei) for O(polylog(d)) different values of s. Evaluating Φop is also a minimization which can be solved by computing SVD on Gw and evaluating F(Gw S) in time O(dω) for polylog(d) various constructions of S. Overall finding the optimal value of s takes time O(dω polylog(d)), and the algorithm s total running time is O(ndω polylog (d) log 1

C Omitted proofs: From w RIP to d RIP condition

Here we show that the d RIP Condition 2.3 is implied by the w RIP Condition 2.2, given proper choices of parameters within a constant factor. Notice that in the w RIP condition, we have a low-rank constraint on the input matrix, i.e., rank(X) r, and in d RIP we have a norm constraint instead, i.e., V F [ 1

2r. To make use of the w RIP condition of A, we will decompose V into low-rank matrices, so that w RIP condition applies to each of the low-rank matrices. Though the rank of V is arbitrary, we can still upper bound its numerical rank based on the norm constraint.

First we will introduce a low-rank decomposition, and an upper bound on the sum of their Frobenius norms. This is the matrix version of the shelling-decomposition in Lemma 15 for vectors in [KLL+23].

Lemma C.1 (Low-rank Decomposition). Given V Rd1 d2 with Rankn(V ) = V 2 V 2 F = ν, and let

V = P σiuiv i be its SVD with σi in descending order. Decompose V into sum of rank-r matrices, i.e., write V = Pℓ=k ℓ=1 V (ℓ) where V (ℓ) = Pi=ℓr i=(ℓ 1)r+1 σiuiv i . Then we have: Pk ℓ=2 V (ℓ) F p ν

Proof. Note that V (1) is the rank-r approximation of V , and V (ℓ) s are constructed using disjoint singular values and vectors in groups of size r, and are orthogonal to each other.

Denote σi V (ℓ) as the ith largest singular value of V (ℓ). V (ℓ+1) F r σ1 V (ℓ+1) r σr V (ℓ) r

V 2 V 2 F ν =

Now we are ready to prove Lemma 5.1, which states that w RIP implies d RIP condition. The proof uses similar techniques as in the second part of Lemma 17 [KLL+23] for vector recovery. Lemma 5.1 (w RIP = d RIP). If A satisfies w RIP Condition 2.2 with parameters r , L , ρ, then A satisfies the d RIP Condition 2.3 with parameters L, K, r, ρ such that L = Θ(L ), r = Θ(r ), and some constant K 1.

Proof. Boundedness property is satisfied by assumption Ai 2 ρ i . Isometry property: Consider V Rd1 d2 s.t. V F [ 1

2r. Need to show 1

L Pn i=1 w i Ai, V 2 L . Let L = 25L , K 1 and r = r 12800L2K2 .

ν = Rankn(V ) = V 2 V 2 F 128r. By Lemma C.1, decompose V into rank-r matrices so that we can apply the (r , L )-w RIP property of A.

V (ℓ) F r ν

r V F 1 10LK V F 1

V F V(r ) F = V (1) F =

ℓ=2 V (ℓ) F V F

w i Ai, so that Pn i=1 w i Ai, V 2 = Pn i=1 Bi, V 2 = Pn i=1 Bi, V ei 2 2 .

Lower bound:

i=1 Bi, V ei

i=1 Bi, V (1) ei

ℓ=2 V (ℓ) ei

i=1 Bi, V (1) ei

i=1 Bi, V (ℓ) ei

i=1 Bi, V (1) 2

i=1 Bi, V (ℓ) 2

1 L V (1) 2 F

L V (ℓ) 2 F

Taking the square: Pn i=1 w i Ai, V 2 4.482

L V 2 F 4.482

Upper bound:

i=1 Bi, V ei

i=1 Bi, V (1) ei

i=1 Bi, V (ℓ) ei

i=1 Bi, V (1) 2 +

i=1 Bi, V (ℓ) 2

L V (1) F +

L V 2 F + 0.1

L 5 V F + 0.02

Taking the square: Pn i=1 w i Ai, V 2 L

25 V 2 F + 0.022

L V 2 F + 0.008 V 2 F L V 2 F L .

Combining the lower bound and upper bound: 1

L Pn i=1 w i Ai, V 2 L .

Decomposition property:

Let S = G(r ), the rank-r approximation of G = Pn i=1 w i Ai, V Ai. Let E = G S. Suffices to show S F L and E 2 1 K r . We have

S 2 F = S, S = G, S =

i=1 w i Ai, V Ai, S =

i=1 Bi, V Bi, S =

i=1 Bi, V Bi, S

i=1 Bi, V ei,

j=1 Bj, S ej

i=1 Bi, V ei

i=1 Bi, S ei

i=1 Bi, V 2

i=1 Bi, S 2 =

i=1 w i Ai, V 2

i=1 w i Ai, S 2

which implies S F L

5 L, and consequently,

E 2 = σr +1(G) σr (G) = σr (S)

S 2 F r L 5

2LK r 1 K r .