# geometric_analysis_of_matrix_sensing_over_graphs__55c804ca.pdf

Geometric Analysis of Matrix Sensing over Graphs

Haixiang Zhang Department of Mathematics University of California, Berkeley Berkeley, CA 94720 haixiang_zhang@berkeley.edu

Ying Chen Department of IEOR University of California, Berkeley Berkeley, CA 94720 ying-chen@berkeley.edu

Javad Lavaei Department of IEOR University of California, Berkeley Berkeley, CA 94720 lavaei@berkeley.edu

In this work, we consider the problem of matrix sensing over graphs (MSo G). As a general case of matrix completion and matrix sensing problems, the MSo G problem has not been analyzed in the literature and the existing results cannot be directly applied to the MSo G problem. This work provides the first theoretical results on the optimization landscape of the MSo G problem. More specifically, we propose a new condition, named the Ω-RIP condition, to characterize the optimization complexity of the problem. In addition, with an improved regularizer of the incoherence, we prove that the strict saddle property holds for the MSo G problem with high probability under the incoherence condition and the Ω-RIP condition, which guarantees the polynomial-time global convergence of saddleavoiding methods. Compared with state-of-the-art results, the bounds in this work are tight up to a constant. Besides the theoretical guarantees, we numerically illustrate the close relation between the Ω-RIP condition and the optimization complexity.

1 Introduction

In a wide range of problems in the fields of machine learning, signal processing and power systems, an unknown low-rank matrix parameter should be estimated from a few measurements of the matrix. To be more specific, given some measurements of the unknown symmetric and positive semi-definite (PSD) matrix M Rn n of rank r n, the low-rank matrix optimization problem can be formulated as

min M Rn n f(M; M ) s. t. M 0, rank(M) r, (1)

where f( ; M ) is a loss function that penalizes the mismatch between M and M . The goal is to recover the matrix M via finding a global minimizer of problem (1). Applications of this problem include matrix sensing [31, 40, 38], matrix completion [10, 11, 17], phase retrieval [8, 33, 14], and power systems [39, 24]; see the review papers [13, 15] for more applications. Early attempts to deal with the nonconvex rank constraint of the problem focused on solving a convex relaxation of (1); see [10, 31, 11, 7]. However, the convex relaxation approach usually updates the matrix variable via the Singular Value Decomposition (SVD) in each iteration. This will lead to an O(n3) computational complexity in each iteration and an O(n2) space complexity, which are prohibitively

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

high for large-scale problems; see the numerical comparison in [41]. Similar issues are observed for algorithms based on the Singular Value Projection [20] and Riemannian optimization algorithms [35, 36, 19, 1, 27].

To improve the computation and memory efficiency, the Burer-Monteiro factorization approach was proposed in [6], which is based on the fact that the mapping U 7 UU T is surjective onto the manifold of PSD matrices of rank at most r, where U Rn r. More concretely, problem (1) is equivalent to

min U Rn r f(UU T ; M ). (2)

Due to the non-convexity of the mapping U 7 UU T , problem (2) is an unconstrained non-convex problem and may have spurious second-order critical points (i.e., second-order critical points that do not correspond to the ground truth matrix M ). In general, saddle-avoiding local search methods are only able to find ϵ-approximate second-order critical points1. As a result, local search methods with a random initialization will likely be stuck at spurious second-order critical points and unable to converge to the ground truth solution. However, in a variety of real-world applications, simple algorithms such as perturbed gradient descent methods and alternating minimization methods have achieved empirical success on problem (2). Recently, substantial progress has been made on the theoretical explanation of the benign behavior of these algorithms. For example, the alternating minimization algorithm was studied in [21, 28, 29]. The (stochastic) gradient descent algorithm, which is in general easier to implement than the alternating minimization algorithm, was analyzed in [8, 34, 37, 14, 13]. Moreover, the gradient descent algorithm is proved to have the implicit regularization phenomenon in the over-parameterization case [26, 16, 32].

Besides the algorithmic analysis, a large amount of literature [17, 33, 42, 38] focused on the geometric analysis of the landscape of problem (2), which usually depends on the strict-saddle property [33]. Intuitively, the strict-saddle property states that at any feasible point of problem (2), at least one of the three properties will hold: (i) the point is close to a global solution; (ii) the norm of the gradient is large; (iii) the Hessian matrix has a negative eigenvalue. In the later two cases, saddle-escaping algorithms [12, 23, 2] are able to find a descent direction and thus, these algorithms will converge globally in polynomial time. The formal definition of the strict-saddle property is provided in Section 3.1. In the following, we review the state-of-the-art conditions for two special classes of problem (2) that guarantee the strict-saddle property; see the survey [15] for other problem classes.

1.1 Matrix sensing and the Restricted Isometry Property (RIP) condition

In the matrix sensing problem, the information of the ground truth matrix M is gathered via the measurement operator A. The loss function f is usually chosen to be the negative log-likelihood function. For example, in the classic linear matrix sensing problem, the operator A is a linear operator defined as A(M) := [ A1, M , . . . , Am, M ] , M Rn n, (3)

where m is the number of measurements and Ai Rn n contains independently identically distributed (i.i.d) Gaussian random entries and Ai s are independent of each other. If the measurement noise is Gaussian, the maximum likelihood estimation is equivalent to the following minimization problem

min U Rn r A(UU T ) A(M ) 2 F . (4)

More examples of the (non-linear) matrix sensing problem are discussed in [42]. One of the most important conditions that guarantee the benign landscapes is the RIP condition: Definition 1 (RIP Condition[31, 42]). Given natural numbers r and s, the function f( ; M ) is said to satisfy the Restricted Isometry Property (RIP) of rank (2r, 2s) for a constant δ [0, 1), denoted as δ-RIP2r,2s, if

(1 δ) K 2 F 2f(M; M ) (K, K) (1 + δ) K 2 F (5)

holds for all matrices M, K Rn n such that rank(M) 2r and rank(K) 2s, where 2f(M; M ) (K, K) is the curvature of the Hessian at point M along direction K.

1A point x0 is called an ϵ-approximate second-order critical point to the optimization problem minx F(x) if F(x0) F ϵ and λmin[ 2F(x0)] ϵ.

For the linear matrix sensing problem (4), it is proved in [9] that the δ-RIP2r,2s condition holds with high probability if m = Θ(nrδ 2). The RIP condition is also established in many other applications of the matrix sensing problem [42, 4] and is of independent research interest. The constant δ plays a critical role in characterizing the optimization landscape of problem (2). More specifically, in [5], the authors showed that the strict-saddle property holds for problem (2) if the δ-RIP2r,2r condition holds with δ < 1/2. Counterexamples have been constructed in [40, 38] to illustrate that the strict-saddle property may fail under the δ-RIP2r,2r condition with δ 1/2.

1.2 Matrix completion and the incoherence condition

In spite of the strong theoretical results under the RIP condition, there exist a large number of applications that do not satisfy the RIP condition. A well-studied class of problems that does not satisfy the RIP condition is the matrix completion problem. For the matrix completion problem, a subset Ω [n] [n] of entries of M are observed and the goal is to recover the low-rank ground truth matrix from the observed entries. For every matrix M Rn n, we denote the projection of M onto Ωas MΩ, namely,

(MΩ)ij = Mij if (i, j) Ω, 0 otherwise.

Using the ℓ2-loss function, the matrix completion problem is defined as

min U Rn r MΩ M Ω 2 F . (6)

The matrix completion problem (6) is a special case of the linear matrix sensing problem (4), where the sample size m is equal to |Ω| and each measurement matrix Ai contains exactly one nonzero entry. We note that the δ-RIP2r,2r condition does not hold for problem (6) unless we observe all entries of M , i.e., when Ω= [n] [n]. As an alternative to the RIP condition, the incoherence of M is useful in characterizing the complexity of problem (6). Definition 2 (Incoherence Condition [10]). Given µ [1, n], the ground truth matrix M is said to be µ-incoherent if

e T i V F p

µr/n, i [n], (7)

where V Λ (V )T is the truncated SVD of M and ei is the i-th standard basis of Rn.

Intuitively, the incoherence of M measures the sparsity of the low-rank ground truth matrix. If the incoherence is large, the matrix M is highly sparse and it is necessary to measure considerably many entries of M to observe nonzero entries. On the other hand, a relatively small incoherence of M is able to avoid the extreme case. Except the limited literature on the deterministic matrix completion problem [3, 25, 30], the majority of the matrix completion literature considered the following Bernoulli sampling model: Definition 3 (Bernoulli Sampling Model). Given a sampling rate p (0, 1], each index (i, j) [n] [n] belongs to the set Ωindependently with probability p.

Under the Bernoulli sampling model, the objective function of problem (6) is well-behaved over the set of matrices with a small incoherence. Therefore, a regularizer that penalizes the incoherence of UU T is included and we instead solve the following regularized matrix completion problem:

min U Rn r 1 p

(UU T )Ω M Ω 2 F + λ X

i [n] e T i U F α 4 + , (8)

where (x)+ := max{x, 0} for all x R and α, λ > 0 are constants. Intuitively, the coefficient 1/p is used to normalize the ℓ2-norm. We note that there exists an algorithm that can solve problem (6) without the incoherence regularizer [13]. However, the algorithm relies on the spectral initialization, which requires more computational effort than the factorization approach (2). It is proved in [18, 17] that if the sampling rate satisfies

np Θ[µ4r6(κ )6 log n], (9)

the problem (8) satisfies the strict-saddle property with high probability, where κ is the condition number of M . On the other hand, it is proved in [11] that the information-theoretical lower bound np Θ(µr log n) is necessary for the exact completion with high probability.

Figure 1: Example of the problem of matrix sensing over graphs. The vertex set is V = [6] and the edge set is E = {(1, 1), (1, 2), (1, 3), (1, 4), (2, 2), (3, 4), (3, 5), (4, 6), (5, 6), (6, 6)}.

1.3 Motivating example: Matrix sensing over graphs

Although the matrix sensing problem and the matrix completion problem are well-studied in literature, we show that there exist important applications of (2) that belong to a much broader class of problems than the special classes previously studied by the existing works. Hence, the current theoretical results cannot be directly applied to these applications and it remains unknown whether saddle-avoiding algorithms can find the ground truth matrix M in polynomial time.

Consider an undirected graph G = (V, E), where V and E are the set of vertices and the set of edges, respectively. For the notational simplicity, we assume that V = [n]. Then, the edge set E is a subset of [n] [n]. The goal of this problem is to recover the ground truth matrix M via measurements of its entries {M ij | (i, j) E}. Instead of directly observing each entry M ij, for each node i V, we observe a mixture (i.e., the output of a function) of the entries in the set

Ei := {(i, j) | j is incident to i} .

Denote the loss function for node i as fi MEi; M Ei . The total loss function is the sum of the loss function for all nodes; namely, we consider the problem

min U Rn r X

i [n] fi (UU T )Ei; M Ei . (10)

Since E = i [n]Ei, the above problem is a special case of the general problem

min U Rn r f (UU T )E; M E , (11)

where we define f (UU T )E; M E := P i [n] fi (UU T )Ei; M Ei . We name the problem (11) as the matrix sensing over graphs (MSo G) problem. The MSo G problem has a number of applications, including the state estimation problem in power systems [39, 24]. We note that in those applications, the loss function fi is a quadratic function of entries of MEi and M Ei; see the example in the end of this section and the graph-structured quadratic sensing problem in [24].

To better understand (10) as a special type of MSo G, we consider a toy example where the undirected graph G has n = 6 vertices and is plotted in Figure 1. In this example, the objective function of the MSo G problem is

f(ME; M E) = f1(M11, M12, M13; M 11, M 12, M 13) + f2(M21, M22; M 21, M 22) + f3(M31, M34, M35; M 31, M 34, M 35) + f4(M41, M43, M46; M 41, M 43, M 46) + f5(M53, M56; M 53, M 56) + f6(M64, M65, M66; M 64, M 65, M 66),

where we recall that we factorize M into UU T in the Burer-Monteiro factorization approach. Hence, only 13 entries of the matrix M with n2 = 36 entries appear in the measurements. For example, the entry M 16 does not appear in any measurements. If we directly observe these 13 entries, we can define the loss function f1 by the ℓ2-loss function:

f1(M11, M12, M13; M 11, M 12, M 13) = (M11 M 11)2 + (M12 M 12)2 + (M13 M 13)2.

We can define other loss functions f2, . . . , f6 in a similar way. Then, the MSo G problem reduces to the matrix completion problem with Ω= E, namely, the objective function becomes

f(ME; M E) = ME M E 2 F .

Therefore, the matrix completion problem (6) is a special case of the MSo G problem (11) and the existing results for the matrix completion problem cannot be directly applied to problem (11). Similarly, if Ωis a complete graph, the matrix sensing problem (10) becomes a special case of the MSOG problem. Moreover, since some entries of M do not appear in any of the measurements in general, it is easy to verify that the RIP condition (5) does not hold for problem (11). In summary, the existing results based on the RIP condition [38] and the incoherence condition [17, 13], as well as the results for other applications of the low-rank matrix optimization problem, cannot be directly applied to the MSo G problem.

Next, we show that the power state estimation problem [39, 24] can be formulated as the MSo G problem (11). This is the most important data-analysis problem for power systems and is solved every 5-15 minutes in practice using heuristic methods with little theoretical guarantees. Several major blackouts were attributed to the failure of solving this problem. An N-bus power system can be represented by an undirected graph G = (V, E), where V = [N] and E V V are the set of buses and the set of lines, respectively. The state of the power system is characterized by the voltage phasors v CN, where CN is the N-dimensional complex space. Our goal is to recover the voltage phasors via some measurements, without loss of generality, we assume that we measure the voltage magnitude at each bus and the active power flows over each line. Suppose that v = x + iy for x, y RN, where i is the imagnary unit. Define

The following two types of measurements are available:

1. For each bus k V, the voltage magnitude is defined as |vk|2, where the magnitude for a complex number vk = xk + iyk is defined as |vk|2 := x2 k + y2 k. The voltage magnitude can be written in the matrix form as

|vk|2 = z T (eke T k + ek+Ne T k+N)z,

where ek is the k-th standard basis of RN.

2. For each line (k, ℓ) E, let Ykℓ C be the admittance of the line. The active power flow from bus k to bus ℓis pkℓ= Re (vk(vk vℓ) Y kℓ), where z is the conjugate of complex number z. We observe the active power flows pkℓand pℓk for all lines (k, ℓ) E. Suppose that Ykℓ= Pkℓ+ i Qkℓfor Pkℓ, Qkℓ R for all (k, ℓ) E. Then, the active power flow can be written in the matrix form as

pkℓ= z T Pkℓ(eke T k eke T ℓ+ ek+Ne T k+N e N+ke T N+k) + Qkℓ(eke T N+ℓ eℓe T N+k) z.

Therefore, these observations can be written as the quadratic form z T Mjz for some matrices M1, . . . , Mm R2N 2N, where m = 2N + |E| is the number of observations. For each sensing matrix Mj, only the diagonal entries and (k, ℓ), (N +k, ℓ), (k, N +ℓ), (N +k, N +ℓ) entries for (k, ℓ) E may be nonzero. In addition, we fix y1 = z N+1 = 0 since a unitary transformation on v will not change the loss function. As a result, the problem can be written as a (2N 1)-dimensional rank-1 MSo G problem:

min w R2N X

j [m] Mj, ww T zz T 2 , s.t. w N+1 = 0.

1.4 Problem formulation and contributions

In this work, we propose the first sufficient condition that guarantees the benign landscape of the MSo G problem. More specifically, we consider the case when the graph G is a random graph obeying the Erdös Rényi model. Namely, each pair (i, j) V V belongs to the edge set independently with probability p. Under this random graph model and the assumption that the vertex set V is [n], each entry M ij is indirectly observed (i.e., is involved in some measurements) independently with probability p. For comparison with the existing results of the matrix completion problem (6), we denote the edge set, which is equivalent to the set of indices of observed entries, as Ω [n] [n]. Then, the set Ωfollows the Bernoulli sampling model (Definition 3). Similar to the matrix completion problem, we aim at recovering the ground truth matrix M by solving the following problem with an

improved regularizer:

min U Rn r 1 pf UU T

i [n]r e T i U F , (12)

where we define the regularizer

r(x) := λ Z 1

(x + αy 10α)+ + 9α 4 (1 |y|) dy, x R.

Note that in problem (12), we use a novel incoherence regularizer that is different from those in the prior literature. The regularizer is the convolution between (x + αy 10α)+ + 9α 4 and the probability density function 1 |y| on [ 1, 1], which is twice continuously differentiable in x (the set of discontinuous points of the derivatives is a zero measure set). Hence, the regularizer r e T i U F

is twice continuously differentiable in U. In addition, we can exchange the convolution and the differentiation with respect to x. Since the integrand is a quartic polynomial, the objective value and the derivatives of the regularizer can be exactly evaluated by numerical integration schemes with O(1) computations. Let R Rn r be the set of U such that e T i U F = O(α) for all i [n]. The intuition behind the design of the incoherence regularizer is to ensure that every approximate first-order critical point of problem (11) must lie in set R. More specifically, the regularizer ensures that (i) outside R, the gradient of the regularizer is large enough such that any approximate first-order critical point must lie in R, (ii) the regularizer and its Hessian matrix have a sufficiently small contribution to the objective function in R.

Besides the randomness model of Ω, we make the following three assumptions about problem (12). Assumption 1. The loss function f is twice continuously differentiable. Assumption 2. The ground truth matrix M is PSD and rank-r. Assumption 3. The ground truth matrix M is a global minimizer of f[( )Ω; M Ω].

We note that these assumptions are standard in the low-rank optimization literature. The first assumption is a mild regularity assumption on the loss function and is satisfied by a wide range of loss functions, including the ℓ2-loss and the negative log-likelihood function of various probability distributions. The second assumption is based on the prior knowledge about the specific application. The third assumption is necessary for the exact recovery of the ground truth M .

We give an informal statement of the main result in the following theorem. Here, we consider the (δ, Ω)-RIP2r,2r condition, which is an extension of the classic δ-RIP2r,2r condition. Intuitively, the constant δ [0, 1) measures the similarity between f and the ℓ2-loss function. The rigorous definition is provided in Definition 4. Theorem 1 (Informal). Suppose that the loss function f satisfies the (1/16, Ω)-RIP2r,2r condition with a high probability over Ωand the sampling rate p satisfies

np Θ[µ2r3(κ )2 log n],

where κ is the condition number of M . Then, with a suitable choice of the parameters α and λ, problem (12) satisfies the strict saddle property with high probability. Furthermore, there exist algorithms that can find a solution U0 such that U0U T 0 M F ϵ in polynomial time with the same probability.

The above theorem provides the first theoretical result for the MSo G problem. Basically, Theorem 1 says that a combination of the incoherence condition and the Ω-RIP condition is sufficient to guarantee that problem (12) can be solved in polynomial time with high probability. In addition, the lower bound on the sampling rate is better than the bound in [17] (i.e., the bound in (9)). This is a result of our improved regularizer. Finally, the upper bound on the Ω-RIP constant is an absolute constant (namely, 1/16). Remark 1. We note that since the result of Theorem 1 only holds with high probability, it is not necessary for the Ω-RIP condition to hold for all subsets Ω. Instead, we only require the Ω-RIP condition to hold for a set of Ωthat correspond to a high probability over the distribution of Ω. In the special case when Ωis a deterministic subset that satisfies similar benign properties as in the random graph case, we only need the Ω-RIP condition for this deterministic subset Ω.

The next informal theorem shows that our upper bound is optimal up to a constant. Theorem 2 (Informal). There exists a loss function f such that: (i) f satisfies the (1/2, Ω)-RIP2r,2r condition for all Ω [n] [n], (ii) for all p [0, 1], problem (12) has a spurious second-order critical point2 with probability at least (3

Intuitively, Theorem 2 provides a negative result saying that the tightest upper bound on the RIP constant cannot be better than 1/2 if the problem (12) has a benign landscape with high probability. This is also the first negative result on the problem (12).

1.5 Notation

For every natural number n, we denote [n] := {1, . . . , n}. The operator 2-norm and the Frobenius norm of a matrix M are denoted as M 2 and M F , respectively. The trace of matrix M is denoted as tr(M). The inner product between two matrices is defined as M, N := tr(M T N). For any matrix M Rn n, we denote its singular values by σ1(M) σk(M). Let σ i be the singular value σi(M ). The condition number of M is κ = σ 1/σ r. The minimum eigenvalue of matrix M is denoted as λmin(M). For any two matrices A, B Rn m, we use A B to denote the fourth-order tensor whose (i, j, k, ℓ) element is Ai,j Bk,ℓ. The identity tensor is denoted as I. The notation M 0 means that the matrix M is PSD. The sub-matrix Ri:j,k:ℓconsists of the i-th to the j-th rows and the k-th to the ℓ-th columns of matrix R. The action of the Hessian 2f(M) on any two matrices K and L is given by [ 2f(M)](K, L) := P

i,j,k,ℓ[ 2f(M)]i,j,k,ℓKij Lk,ℓ. The notation f = O(g) means that there exists an absolute constant C > 0 such that f C g. The notation f = Θ(g) means that there exist absolute constants C1, C2, > 0 such that C1 g f C2 g.

2 Ω-RIP Condition

Since the objective function f(MΩ; M Ω) does not satisfy the classic RIP condition unless all elements of M are observed (i.e., when Ω= [n] [n]), it is necessary to consider a generalization of the RIP condition to the partial observation case. Definition 4 (Ω-RIP Condition). Given a subset Ω [n] [n] and natural numbers r, s, the function f( ; M ) is said to satisfy the Ω-Restricted Isometry Property (Ω-RIP) of rank (2r, 2s) for a constant δ [0, 1), denoted as (δ, Ω)-RIP2r,2s, if

(1 δ) KΩ 2 F 2f[MΩ; (M )Ω] (KΩ, KΩ) (1 + δ) KΩ 2 F (13)

holds for all matrices M, K Rn n such that rank(M) 2r, rank(K) 2s.

Note that the matrix completion problem satisfies the (0, Ω)-RIP2r,2s condition. In the following, we demonstrate another example for which the Ω-RIP condition holds with a non-zero constant δ. Example 1 (Linear matrix sensing over graphs). In the linear matrix sensing over graphs problem, the loss function is defined as

f(MΩ; M Ω) := A(MΩ) A(M Ω) 2 F ,

where the linear operator A : Rn n 7 Rm (defined in (3)) is generated by Gaussian measurements and m is the number of measurements. For all subsets Ω [n] [n], a similar proof to that of Theorem 2.3 of [9] implies that the (δ, Ω)-RIP2r,2r condition holds with high probability when m cnr/δ2 for some constant c > 0. The intuition behind the proof is that the constructed ϵ-net for the linear matrix sensing problem is also an ϵ-net for the linear MSo G problem since MΩ M Ω F M M F for all M, M Rn n. Remark 2. In the above example and other examples in practice, the Ω-RIP condition holds with high probability for a fixed subset Ω. Therefore, we can focus on the event when the Ω-RIP condition holds since otherwise the results will only differ by a sufficiently small probability.

In Section 4 and Appendix C, we numerically show that the optimization complexity of problem (11) is closely related to the Ω-RIP constant δ and the sampling rate p.

2A point U0 is called a spurious second-order critical point of problem (12) if U0 satisfies the first-order and the second-order necessary optimality conditions but U0U T 0 = M .

3 Theoretical Results

In this section, we provide strong theoretical results on the MSo G problem (12). We first develop a sufficient condition on the benign landscape of problem (12) and then study the tightness of our developed condition.

3.1 Global landscape: Strict saddle property

First, we develop conditions under which problem (12) does not have any spurious second-order critical points and therefore saddle-escaping methods (e.g., [22, 23]) can find an approximate global minimum in polynomial time. To guarantee the global convergence, the strict saddle property is commonly considered in the literature: Definition 5 (Strict Saddle Property [33]). Consider an optimization problem minx X Rd F(x) and let X denote the set of its global minima. We say that the problem satisfies the (θ, β, γ)-strict saddle property for θ, β, γ > 0 if at least one of the following conditions is satisfied for every x X:

dist(x, X ) θ; F(x) F β; λmin[ 2F(x)] γ,

where dist(x, X ) := infx X x x F is the distance between x and X .

For the low-rank optimization problem, the distance in the factorization space is equivalent to the distance in the matrix space in the sense that there exist constants c1(X ), c2(X ) > 0 such that

c1(X ) U U F UU T U (U )T F c2(X ) U U F

holds for all U X as long as U U F is small and X is bounded [34]. Denote the objective function of problem (12) as ℓ(U). As an example of saddle-avoiding methods, the accelerated perturbed gradient descent algorithm [23] can find a point U0 such that

ℓ(U0) F = O(ϵ), λmin[ 2ℓ(U0)] = O( ϵ)

in O(ϵ 1.75) iterations with high probability. If we choose ϵ > 0 to be small enough, the strict saddle property ensures that the point U0 satisfies U0U T 0 M F = O(θ). We note that the Lipschitz continuity of the Hessian of ℓcan be guaranteed by the regularity assumption 1 and the boundedness of trajectories of the algorithm, which can be proved in a similar way as Theorem 8 in [22]. In summary, if the strict saddle property holds, we can apply saddle-avoiding methods to achieve the polynomial-time global convergence for problem (12).

Now, we prove that the problem (12) satisfies the strict saddle property with high probability under the Ω-RIP2r,2r condition and the incoherence condition. Compared with Theorem 1, we assume, for the simplicity of the statement of the theorem, that the Ω-RIP condition holds for all subset Ωsince otherwise, the result will only differ by a small probability. Theorem 3. Suppose that the loss function f satisfies the (δ, Ω)-RIP2r,2r condition for all Ω [n] [n] and

α2 = Θ µrσ 1 n

, λ = Θ ( µ + nδ)n µr

, np Θ[µ2r3(κ )2 log n], δ < 1

Then, there exists a small constant ϵ > 0 such that with probability at least 1 1/poly(n), the MSo G problem (12) satisfies the (θ, β, γ)-strict saddle property with

θ = Θ(ϵ/σ r), β = ϵ, γ = Θ(ϵ2/σ r).

In Theorem 3, we provide the first theoretical results on the MSo G problem (12). Basically, the theorem shows that if the Ω-RIP2r,2r constant is smaller than an absolute constant, then the same sampling rate as for the matrix completion problem is sufficient to guarantee the benign landscape of problem (12). Therefore, the result is a generalization of the results in [17], which considered the case when the Ω-RIP constant δ is zero and the sampling rate p satisfies a stricter condition (9). On the other hand, if the sampling rate p is equal to 1, Theorem 3 guarantees that the strict saddle property holds when the (regular) RIP condition holds with δ < 1/16. Compared with the state-of-the-art results in [38, 5], the upper bound on δ in our work may not be optimal but is only worse by an absolute constant. We leave the improvement of the upper bound as future work.

3.2 Tightness of the Ω-RIP condition

To study the tightness of our Ω-RIP condition for problem (12), we construct an instance of problem (12) that satisfies the (1/2, Ω)-RIP condition but has spurious second-order critical points. Note that the existence of spurious second-order critical points negates the strict saddle property. The counterexample is based on a similar idea as those in [38]. More specifically, we assume that n 2r and consider the tensor:

2 (e2i 1e T 2i 1) (e2i 1e T 2i 1) + (e2ie T 2i) (e2ie T 2i)

2 (e2i 1e T 2i 1) (e2ie T 2i) + (e2ie T 2i) (e2i 1e T 2i 1)

4 (e2i 1e T 2i) (e2i 1e T 2i) + (e2ie T 2i 1) (e2ie T 2i 1)

4 (e2i 1e T 2i) (e2ie T 2i 1) + (e2ie T 2i 1) (e2i 1e T 2i) o ,

where ei Rn is the i-th standard basis of Rn. The rank-r ground truth matrix M is constructed as

U := [e1 e3 e2r 1] , M := U (U )T = X

i [r]e2i 1e T 2i 1.

Then, the loss function is given by

f3/2(MΩ; M Ω) := 1

2(MΩ M Ω) : 3

2 I + H : (MΩ M Ω), M Rn n.

It is proved in [38] that if Ω= [n] [n], the function f3/2 satisfies the 1/2-RIP2r,2r condition and has a spurious second-order critical point. In this work, we generalize the results to the case when the set Ωis random and the objective function contains a regularizer.

Theorem 4. Suppose that the loss function in problem (12) is chosen to be f3/2. Then, it holds that:

1. The loss function f3/2 satisfies the (1/2, Ω)-RIP2r,2r condition for all Ω [n] [n];

2. For all p [0, 1], problem (12) has a spurious second-order critical point with probability at least max{pr(r+1), 1 pr(r+1)/2} (3

We note that the results in Theorem 4 holds for all p [0, 1]. From Theorem 4, we cannot improve the upper bound of δ in Theorem 3 to be better than 1/2. In addition, this result shows that our bound in Theorem 3 is optimal up to a constant. This tightness result is consistent with that of the matrix sensing problem in [40, 38].

4 Numerical Illustrations

In this section, we show how the optimization complexity is related to the Ω-RIP constant δ and the sampling rate p via a numerical example. Here, the optimization complexity refers to the probability that the randomly initialized gradient descent algorithm can find the ground truth matrix M . In this example, we choose a random orthogonal matrix V Rn n and define the loss function to be

fc[MΩ; (V M V T )Ω] := 1

2[M (V M V T )]Ω: (c I + H) : [M (V M V T )]Ω, M Rn n,

where c R is a hyper-parameter and the tensor H and the ground truth M are defined in Section 3.2. In addition, by a similar analysis as in Section 3.2, we can prove that the function fc satisfies the (1/2, Ω)-RIP2r,2r condition if we choose c = 3/2. We also numerically verify this conclusion by checking the curvature [ 2f3/2(MΩ; M Ω)](K, K) along 104 random directions K Rn n. Since f3/2 is a quadratic function, the (1/2, Ω)-RIP2r,2r condition is given by

1 2 KΩ 2 F KΩ: 3

2 I + H : KΩ 3

2 KΩ 2 F , K Rn n,

Figure 2: Success rate of the algorithm with different sampling rate p and RIP constant δ.

where we define the tensor multiplication K : H : K = P

i,j,k,ℓ [n] H ijkℓKij Kkℓfor all K Rn n

and fourth-order tensor H Rn n n n. In addition, since the identity tensor I satisfies the (0, Ω)- RIP2r,2r condition, it is straightforward to prove that the function fc satisfies the (δ, Ω)-RIP2r,2r condition with δ = 1/(2c 1) for all c 1.

We choose the problem size to be n = 10 and r = 5. The regularization parameters are α = 10 and λ = 100. The set of sampling rates and Ω-RIP2r,2r constants are

p {0.7, 0.75, . . . , 0.95, 1.0}, δ {0.2, 0.25, . . . , 0.75, 0.8}.

We solve each problem instance by the Burer-Monterio factorization and the perturbed accelerated gradient descent algorithm [23], where the constant step size is 0.007/c. We generate 100 independent problem instances and compute the success rate of the gradient descent algorithm with random initialization. We say that the algorithm successfully solves the instance if the generalization error UU T M F is less than 10 3. If this condition fails, it means that the algorithm is stuck at a spurious local solution.

The results are plotted in Figure 2. We can see from the figure that the optimization complexity grows when δ becomes smaller and when p becomes larger. This result shows that the Ω-RIP condition plays an important role in characterizing the optimization complexity of problem (12). To be more concrete, we expect that an upper bound on the Ω-RIP constant will be able to guarantee the benign optimization landscape of problem (12). Moreover, the result is consistent with the results for the matrix sensing problem and the matrix completion problem. Furthermore, we can see that when the Ω-RIP constant is smaller than 1/2, the success rate has a stronger correlation with p than δ. This observation is also reflected in Theorem 1, where the lower bound on the sampling rate is on the same order as those in the existing works [18, 17] if δ is upper bounded. For the cases when p is close to 1 and δ is larger than 0.5, the success rate is better than the case when p = 1. One possible explanation for this phenomenon is that the loss function fc has multiple global minima when the sampling rate is smaller than 1. Instead of converging to a spurious local solution, the implicit regularization [26] makes the perturbed gradient descent algorithm more likely to converge to the global solution with the minimal complexity , which is likely the ground truth V M V T . If the sampling rate is much smaller than 1, the number of spurious local minima will increase and the perturbed gradient descent algorithm may stuck at spurious local minima and fail to converge to the ground truth.

5 Conclusion

In this work, we provide the first theoretical analysis of the MSo G problem. A new notion, dubbed as the Ω-RIP condition, is proposed and shown to be useful in characterizing the optimization complexity of the MSo G problem. Using an improved incoherence regularizer, we proved the polynomial-time global convergence of saddle-avoiding methods under the Ω-RIP condition and the incoherence condition. The bounds on the sampling rate and the Ω-RIP constant are state-of-the-art up to a constant. Moreover, we showed that our bound on the Ω-RIP condition is tight (up to a constant). Future works include improving the upper bound on the Ω-RIP constant and the lower bound on the sampling rate.

[1] Ahn, K., Suarez, F.: Riemannian perspective on matrix factorization. ar Xiv preprint ar Xiv:2102.00937 (2021) [2] Allen-Zhu, Z., Li, Y.: Neon2: Finding local minima via first-order oracles. Advances in Neural Information Processing Systems 31 (2018) [3] Bhojanapalli, S., Jain, P.: Universal matrix completion. In: International Conference on Machine Learning, pp. 1881 1889. PMLR (2014) [4] Bi, Y., Lavaei, J.: On the absence of spurious local minima in nonlinear low-rank matrix recovery problems. In: International Conference on Artificial Intelligence and Statistics, pp. 379 387. PMLR (2021) [5] Bi, Y., Zhang, H., Lavaei, J.: Local and global linear convergence of general low-rank matrix recovery problems. In: Proceedings of 36th AAAI Conference on Artificial Intelligence (AAAI), Vancouver, Canada, pp. 1 9 (2022) [6] Burer, S., Monteiro, R.D.: A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming 95(2), 329 357 (2003) [7] Candès, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? Journal of the ACM (JACM) 58(3), 1 37 (2011) [8] Candes, E.J., Li, X., Soltanolkotabi, M.: Phase retrieval via wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory 61(4), 1985 2007 (2015) [9] Candes, E.J., Plan, Y.: Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. IEEE Transactions on Information Theory 57(4), 2342 2359 (2011) [10] Candès, E.J., Recht, B.: Exact matrix completion via convex optimization. Foundations of Computational mathematics 9(6), 717 772 (2009) [11] Candès, E.J., Tao, T.: The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory 56(5), 2053 2080 (2010) [12] Cartis, C., Gould, N.I., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results. Mathematical Programming 127(2), 245 295 (2011) [13] Chen, J., Liu, D., Li, X.: Nonconvex rectangular matrix completion via gradient descent without ℓ2, regularization. IEEE Transactions on Information Theory 66(9), 5806 5841 (2020) [14] Chen, Y., Chi, Y., Fan, J., Ma, C.: Gradient descent with random initialization: Fast global convergence for nonconvex phase retrieval. Mathematical Programming 176(1), 5 37 (2019) [15] Chi, Y., Lu, Y.M., Chen, Y.: Nonconvex optimization meets low-rank matrix factorization: An overview. IEEE Transactions on Signal Processing 67(20), 5239 5269 (2019) [16] Chou, H.H., Gieshoff, C., Maly, J., Rauhut, H.: Gradient descent for deep matrix factorization: Dynamics and implicit bias towards low rank. ar Xiv preprint ar Xiv:2011.13772 (2020) [17] Ge, R., Jin, C., Zheng, Y.: No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In: International Conference on Machine Learning, pp. 1233 1242. PMLR (2017) [18] Ge, R., Lee, J.D., Ma, T.: Matrix completion has no spurious local minimum. Advances in Neural Information Processing Systems pp. 2981 2989 (2016) [19] Hou, T.Y., Li, Z., Zhang, Z.: Fast global convergence for low-rank matrix recovery via riemannian gradient descent with random initialization. ar Xiv preprint ar Xiv:2012.15467 (2020) [20] Jain, P., Meka, R., Dhillon, I.: Guaranteed rank minimization via singular value projection. Advances in Neural Information Processing Systems 23 (2010) [21] Jain, P., Netrapalli, P., Sanghavi, S.: Low-rank matrix completion using alternating minimization. In: Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pp. 665 674 (2013) [22] Jin, C., Ge, R., Netrapalli, P., Kakade, S.M., Jordan, M.I.: How to escape saddle points efficiently. In: International Conference on Machine Learning, pp. 1724 1732. PMLR (2017)

[23] Jin, C., Netrapalli, P., Jordan, M.I.: Accelerated gradient descent escapes saddle points faster than gradient descent. In: Conference On Learning Theory, pp. 1042 1085. PMLR (2018) [24] Jin, M., Lavaei, J., Sojoudi, S., Baldick, R.: Boundary defense against cyber threat for power system state estimation. IEEE Transactions on Information Forensics and Security 16, 1752 1767 (2020) [25] Király, F.J., Theran, L., Tomioka, R.: The algebraic combinatorial approach for low-rank matrix completion. J. Mach. Learn. Res. 16(1), 1391 1436 (2015) [26] Li, Y., Ma, T., Zhang, H.: Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In: Conference On Learning Theory, pp. 2 47. PMLR (2018) [27] Luo, Y., Li, X., Zhang, A.R.: Nonconvex factorization and manifold formulations are almost equivalent in low-rank matrix optimization. ar Xiv preprint ar Xiv:2108.01772 (2021) [28] Netrapalli, P., Jain, P., Sanghavi, S.: Phase retrieval using alternating minimization. Advances in Neural Information Processing Systems 26 (2013) [29] Netrapalli, P., UN, N., Sanghavi, S., Anandkumar, A., Jain, P.: Non-convex robust pca. Advances in Neural Information Processing Systems 27 (2014) [30] Pimentel-Alarcón, D.L., Boston, N., Nowak, R.D.: A characterization of deterministic sampling patterns for low-rank matrix completion. IEEE Journal of Selected Topics in Signal Processing 10(4), 623 636 (2016) [31] Recht, B., Fazel, M., Parrilo, P.A.: Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review 52(3), 471 501 (2010) [32] Stöger, D., Soltanolkotabi, M.: Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction. Advances in Neural Information Processing Systems 34 (2021) [33] Sun, J., Qu, Q., Wright, J.: A geometric analysis of phase retrieval. Foundations of Computational Mathematics 18(5), 1131 1198 (2018) [34] Tu, S., Boczar, R., Simchowitz, M., Soltanolkotabi, M., Recht, B.: Low-rank solutions of linear matrix equations via procrustes flow. In: International Conference on Machine Learning, pp. 964 973. PMLR (2016) [35] Wei, K., Cai, J.F., Chan, T.F., Leung, S.: Guarantees of riemannian optimization for low rank matrix recovery. SIAM Journal on Matrix Analysis and Applications 37(3), 1198 1222 (2016) [36] Wei, K., Cai, J.F., Chan, T.F., Leung, S.: Guarantees of riemannian optimization for low rank matrix completion. Inverse Problems & Imaging 14(2) (2020) [37] Yi, X., Park, D., Chen, Y., Caramanis, C.: Fast algorithms for robust pca via gradient descent. Advances in neural information processing systems 29 (2016) [38] Zhang, H., Bi, Y., Lavaei, J.: General low-rank matrix optimization: Geometric analysis and sharper bounds. Advances in Neural Information Processing Systems 34 (2021) [39] Zhang, R.Y., Lavaei, J., Baldick, R.: Spurious local minima in power system state estimation. IEEE transactions on control of network systems 6(3), 1086 1096 (2019) [40] Zhang, R.Y., Sojoudi, S., Lavaei, J.: Sharp restricted isometry bounds for the inexistence of spurious local minima in nonconvex matrix recovery. J. Mach. Learn. Res. 20(114), 1 34 (2019) [41] Zheng, Q., Lafferty, J.: A convergent gradient descent algorithm for rank minimization and semidefinite programming from random linear measurements. In: Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 1, pp. 109 117 (2015) [42] Zhu, Z., Li, Q., Tang, G., Wakin, M.B.: Global optimality in low-rank matrix optimization. IEEE Transactions on Signal Processing 66(13), 3614 3628 (2018)

Acknowledgements

This work was supported by grants from AFOSR, ARO, ONR, and NSF. Haixiang Zhang was supported by the Two Sigma Ph.D. Fellowship.