# unique_sparse_decomposition_of_low_rank_matrices__6c23f457.pdf

Unique sparse decomposition of low rank matrices

Dian Jin Rutgers University dj370@scarletmail.rutgers.edu

Xin Bing Cornell University xb43@cornell.edu

Yuqian Zhang Rutgers University yqz.zhang@rutgers.edu

The problem of ﬁnding the unique low dimensional decomposition of a given matrix has been a fundamental and recurrent problem in many areas. In this paper, we study the problem of seeking a unique decomposition of a low rank matrix Y Rp n that admits a sparse representation. Speciﬁcally, we consider Y = AX Rp n where the matrix A Rp r has full column rank, with r < min{n, p}, and the matrix X Rr n is element-wise sparse. We prove that this sparse decomposition of Y can be uniquely identiﬁed by recovering ground-truth A column by column, up to some intrinsic signed permutation. Our approach relies on solving a nonconvex optimization problem constrained over the unit sphere. Our geometric analysis for the nonconvex optimization landscape shows that any strict local solution is close to the ground truth solution, and can be recovered by a simple datadriven initialization followed with any second order descent algorithm. At last, we corroborate these theoretical results with numerical experiments.

1 Introduction

The problem of matrix decomposition has been a popular and fundamental topic under extensive investigations across several disciplines, including signal processing, machine learning, natural language processing [10, 11, 31, 46, 32, 8]. From the decomposition, one can construct eﬃcient representation of the original data matrix. However, for any matrix Y Rp n that can be factorized as a product of two matrices A Rp r and X Rr n, there exist inﬁnitely many decompositions, simply because one can use any r r invertible matrix Q to construct A = AQ and X = Q 1X such that Y = AX = A X , while A = A and X = X. Thus, in various applications, additional structures and priors are being exploited to ﬁnd a preferred representation [22, 15]. For example, principal component analysis (PCA) aims to ﬁnd orthogonal representations which retain as much variations in Y as possible [17, 23], whereas independent component analysis (ICA) targets the representations of statistically independent non-Gaussian signals [26].

In this paper, we are interested in ﬁnding a unique sparse low-dimensional representation of Y . To this end, we study the decomposition of a low rank matrix Y Rp n that satisﬁes

Y = AX, (1.1)

where A Rp r is an unknown deterministic matrix, with r < min{n, p}, and X Rr n is an unknown sparse matrix.

Formulation (1.1) is an important model problem in many applications. As columns of Y are viewed as linear combinations of columns of A with X being the sparse coeﬃcient, (1.1) can be used to form overlapping clusters of the n columns of Y via the support of X with columns of A being viewed as r cluster centers [12, 7]. When we form a p r low-dimensional representation of Y via sparse combinations, this greatly enhance the interpretability of the resulting representations [28, 21, 4], in the same spirit as the sparse PCA, but (1.1) generalizes to the factorization of non-orthogonal matrices.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

To motivate our approach, we ﬁrst consider the simple case that A has orthonormal columns, namely, AT A = Ir1. Then it is easy to see that the sparse coeﬃcient matrix X is recovered by multiplying Y on the left by AT ,

AT Y = AT AX = X. (1.2)

The problem of ﬁnding such orthonormal matrix A boils down to successively ﬁnding a unit-norm direction q that renders q T Y as sparse as possible [34, 39, 35],

min q q T Y sparsity s. t. q 2 = 1. (1.3)

However, the natural choice of sparsity penalty, either ℓ0 or ℓ1, leads to trivial and meaningless solutions, as there always exists q in the null space of AT such that q T Y = 0.

To avoid the null space of AT , we instead choose to ﬁnd the unit direction q that maximizes the ℓ4 norm of q T Y as max q q T Y 4 s. t. q 2 = 1. (1.4)

The above formulation is based on the key observation that the objective value is maximized when q coincides with one column of A (see, Section 2, for details) while the objective value is zero when q lies in the null space of AT . The ℓ4 norm objective function and its variants have been adopted as a sparsity regularizer in a line of recent works [30, 44, 43, 35, 42]. However, even with this new objective function, the null space of AT persists as a challenge for solving the optimization problem: they form a ﬂat region of saddle points.

This paper characterizes the nonconvex optimization landscape of (1.4) and proposes a guaranteed procedure that avoids the ﬂat region and provably recovers the global solution to (1.4), which corresponds to one column of A. More speciﬁcally, we demonstrate that, despite the non-convexity, (1.4) still possesses benign geometric property in the sense that any strict local solution with large objective value is globally optimal and recovers one column of A, up to its sign. See, Theorem 3.1 in Section 3.1 for the population level result and Theorem 3.4 for the ﬁnite sample result.

We further extend these results to the general case when A only has full column rank in Theorem 3.6 of Section 3.2. To recover a general A with full column rank, our procedure ﬁrst resorts to a preconditioning procedure of Y proposed in Section 2.3 and then solves a optimization problem similar to (1.4). From our analysis of the optimization landscape, the intriguing problem boils down to developing algorithms to recover the nontrivial local solutions by avoiding regions with small objective values. We thus propose a simple initialization scheme in Section 4.1 and prove in Theorem 4.3 that such initialization, proceeded with any second order descent algorithm [20, 27], suﬃces to ﬁnd the global solution, up to some statistical error. Our theoretical analysis provides the explicit convergence rate of the statistical error and characterizes its dependence on various dimensions, such as p, r and n, as well as the sparsity of X.

Numerical simulation results are provided in Section 5. Due to the space limitation, we defer all the proof along with our conclusions and discussion of several future directions of our work to Appendix.

Notations Throughout this paper, we use bold lowercase letters, like a, to represent vectors and bold uppercase letters, like A, to represent matrices. For matrix X, Xij denotes the entry at the i-th row and j-th column of X, with Xi and X j denoting the i-th row and j-th column of X, respectively. Oftentimes, we write X j = Xj for simplicity. We use grad and Hess to represent the Riemannian gradient and Hessian. For any vector v Rd, we use v q to denote its ℓq norm, for 1 q . The notation v q stands for {vq i }i. For matrices, we use F and op to denote the Frobenius norm and the operator norm, respectively. For any positive integer d, we write [d] = {1, 2, . . . , d}. The unit sphere in d-dimensional real space Rd is written as Sd 1. For two sequences an and bn, we write an À bn if there exists some constant C > 0 such that an Cbn for all n. Both uppercase C and lowercase c are reserved to represent numerical constants, whose values may vary line by line.

1Ir is the identity matrix of size r r.

1.1 Related work

Finding the unique factorization of a matrix is an ill-posed problem in general due to inﬁnitely many solutions. There exist several strands of studies from diﬀerent contexts on ﬁnding the unique decomposition of Y by imposing additional structures on A and X. We start by reviewing the literature which targets the sparse decomposition of Y .

Dictionary learning The problems of dictionary learning (DL) [2, 38, 19, 39] and sparse blind deconvolution or convolutional dictionary learning [13, 29] study the unique decomposition of Y = AX where X is sparse and A has full row rank. In this case, the row space of Y lies in the row space of X, suggesting to recover the sparse rows of X via solving the following problem,

min q q T Y 1 s. t. q = 0. (1.5)

Under certain scaling and incoherence conditions on A, the objective achieves the minimum value when q is equal to one column of A, at the same time q T Y recovers one sparse row of X. This idea has been studied and modiﬁed in a strand of papers when A has full row rank [38, 39, 45, 30, 44, 35, 42, 37, 47]. In our context, the major diﬀerence rises in the matrix A, which has full column rank rather than row rank, therefore minimizing q T Y 1 as before only leads to some vector in the null space of AT , yielding the trivial zero objective value.

We would love to note that [35] uses the same objective function in (1.4) to study the problem of overcomplete dictionary learning (where A has full row rank), however the optimization landscape when A has full column rank is signiﬁcantly diﬀerent from that in the overcomplete setting. The more complicated optimization landscape in our setting brings additional diﬃculty of the analysis and requires a proper initialization in our proposed algorithm. We refer to Appendix B for detailed technical comparison with [35].

Sparse PCA Sparse principal component analysis (SPCA) is a popular method that recovers a unique decomposition of a low-rank matrix Y by utilizing the sparsity of its singular vectors. However, as being said, under Y = AX, SPCA is only applicable when X coincides with the right singular vectors of Y . Indeed, one formulation of SPCA is to solve

max U Rn r tr U T Y T Y U λ U 1, s. t. U T U = Ir, (1.6)

which is promising only if X corresponds to the right singular vectors of Y . It is worth mentioning that among the various approaches of SPCA, the following one might be used to recover one sparse row of X, min u,v Y uv T 2 2 + λ v 1 s. t. u 2= 1. (1.7)

This procedure was originally proposed by [49] and [36] together with an eﬃcient algorithm by alternating the minimization between u and v. However, there is no guarantee that the resulting solution recovers the ground truth.

Factor analysis Factor analysis is a popular statistical tool for constructing low-rank representations of Y by postulating Y = AX + E where A Rp r is the so-called loading matrix with r = rank(A) < min{n, p}, X Rr n contains n realizations of a r-dimensional factor and E is some additive noise. Here only Y is observable. Factor analysis is mainly used to recover the low-dimension column space of A or the row space of X, rather than to identify and recover the unique decomposition. Recently, [7] studied the unique decomposition of Y when the columns of X are i.i.d. realizations of a r-dimensional latent random factor. The unique decomposition is further used for (overlapping) clustering the rows of Y via the assignment matrix A. To uniquely identify A, [7] assumes that A contains at least one r r identity matrix, coupled with other scaling conditions on A (we refer to [7] for detailed discussions of other existing conditions in the literature of factor models that ensure the unique decomposition of Y but require strong prior information on either A or X). By contrast, we rely on the sparsity of X instead of A which is more general than requiring the existence of a r r identity matrix.

NMF and topic models Such existence condition of identity matrix in either A or X has a variant in non-negative matrix factorization (NMF) [14] and topic models [3, 8, 9], also

see the references therein, where Y , A and X have non-negative entries. Since all Y , A and X from model (1.1) are allowed to have arbitrary signs in our context, the approaches designed for NMF and topic models are inapplicable.

2 Formulation and Assumptions

The decomposition of Y = AX is not unique without further assumptions. To ensure the uniqueness of such decomposition, we rely on two assumptions on the matrices A and X, stated in the following Section 2.1.

Our goal is to uniquely recover A from Y , up to a some signed permutation. More precisely, we aim to recover columns of AP for some signed permutation matrix P Rr r. To facilitate the understanding and motivate our approach, in Section 2.2 we ﬁrst state our procedure for the unique recovery of A when A has orthonormal columns. Its theoretical analysis is presented in Section 3. Later in Section 2.3, we discuss how to extend our results to the case when A is a more general full column rank matrix under Assumption 2.2.

For now, we only focus on the recovery of one column of A as the remaining columns can be recovered via the same procedure after projecting Y onto the complement space spanned by the recovered columns of A (see Section D for detailed discussion).

2.1 Assumptions We ﬁrst resort to the matrix X Rr n being element-wise sparse. The sparsity of X is modeled via the Bernoulli-Gaussian distribution, stated in the following assumption. Assumption 2.1 Assume Xij = Bij Zij for i [r] and j [n], where

Bij i.i.d. Ber(θ), Zij i.i.d. N(0, σ2). (2.1)

The Bernoulli-Gaussian distribution is popular for modeling sparse random matrices [38, 2, 1, 39]. The overall sparsity level of X is controlled by θ, the parameter of the Bernoulli distribution. We remark that the Gaussianity is assumed only to simplify the proof and to obtain more transparent deviation inequalities between quantities related with X and their population counterparts. Both our approach and analysis can be generalized to cases where Zij are centered i.i.d. sub-Gaussian random variables.

We also need another condition on the matrix A. To see this, note that even when A were known, recovering X from Y = AX requires A to have full column rank. We state this in the following assumption. Assumption 2.2 Assume the matrix A Rp r has rank(A) = r with A op= 1. The unit operator norm of A is assumed without loss of generality as one can always re-scale σ2, the variance of X, by A op.

2.2 Recovery of the orthonormal columns of A In this section, we consider the recovery of one column of A when A is a semi-orthogonal matrix satisfying the following assumption. Assumption 2.3 Assume AT A = Ir. Our approach recovers columns of A one at a time by adopting the ℓ4 maximization to penalize the sparsity of rows of matrix X. Its rationale is based on the following lemma, assuming the orthogonality among columns of A. Lemma 2.4 Under Assumption 2.3, solving the following problem max q AT q 4 4 s. t. q 2 = 1 (2.2)

recovers one column of A, up to its sign.

Intuitively, under Assumption 2.3, we have AT q 2 1 for any unit vector q. Therefore, criterion (2.2) seeks a vector AT q within the unit ball to maximize its ℓ4 norm. When q corresponds to one column of A, that is, q = ai for any i [r], we have the largest objective AT ai 4 4= 1. This ℓ4 norm maximization approach has been used in several related literature, for instance, sparse blind deconvolution [44, 30], complete and over-complete dictionary learning [43, 42, 35], independent component analysis [25, 24] and tensor decomposition [18].

The appealing property of maximizing the ℓ4 norm is its benign geometry landscape under the unit sphere constraint. Indeed, despite of the non-convexity of (2.2), our result in Theorem 3.1 implies that any strict location solution to (2.2) is globally optimal. This enables us to use any second order gradient ascent method to solve (2.2). Motivated by Lemma 2.4, since we only have access to Y Rp n, we propose to solve the following problem to recover one column of A, min q F(q) .= 1 12θσ4n

Y T q 4 4 s. t. q 2 = 1. (2.3)

The scalar (12θσ4n) 1 is a normalization constant. The following lemma justiﬁes the usage of (2.3) and also highlights the role of the sparsity of X.

Lemma 2.5 Under model (1.1) and Assumption 2.1, we have

E r F(q)s = 1

(1 θ) AT q 4 4 + θ AT q 4 2

where the expectation is taken over the randomness of X.

Remark 2.6 (Role of the sparsity parameter θ) Lemma 2.5 implies that, for large n, solving (2.3) approximately ﬁnds the solution to

min q f pqq .= 1

(1 θ) AT q 4 4 + θ AT q 4 2

ı s. t. q 2 = 1 (2.5)

The objective function is a convex combination of AT q 4 4 and AT q 2 2 with coeﬃcients depending on the magnitude of θ. In view of Lemma 2.4, it is easy to see that solving (2.5) recovers one column of A, up to the sign, as long as θ < 1. However, the magnitude of θ controls the benignness of the geometry landscape of (2.5). When θ is small, or X is suﬃciently sparse, we essentially solve (2.2) which has the most benign landscape. On the other hand, when θ 1, the landscape of (2.5) is mostly determined by the eigenvalue problem2 which maximizes AT q 2 subject to q 2= 1. We will demonstrate that when X is suﬃciently sparse, second order descent algorithm with a simple initialization ﬁnds the globally optimal solution to (2.3) in Section 3. 2.3 Recovery of the non-orthogonal columns of A In this section, we discuss how to extend our procedure to recover A from Y = AX when A is a general full column rank matrix satisfying Assumption 2.2. The main idea is to ﬁrst resort to a preconditioning procedure of Y such that the preconditioned Y has the decomposition A X, up to some small perturbation, where A satisﬁes Assumption 2.3 and X satisﬁes Assumption 2.1 with σ2 = 1. Then we apply our procedure in Section 2.2 to recover A. The recovered A is further used to recover the original A.

To precondition Y , we propose to left multiply Y by the following matrix

D .= Y Y T +ı1/2 Rp p (2.6)

where M + denotes the Moore-Penrose inverse of any matrix M. The resulting preconditioned Y satisﬁes Y .= DY = A X + E (2.7)

with A satisfying Assumption 2.3, X = X/ ?

θnσ2 and E being a perturbation matrix with small entries. We refer to Proposition 3.5 below for its precise statements.

Analogous to (2.3), we propose to recover one column of A by solving the following problem

min q 2=1 Fg(q) .= θn

Y T q 4 4 (2.8)

Theoretical guarantees of this procedure are provided in Section 3.2. After recovering one column of A, the remaining columns of A can be successively recovered via the procedure in Section D. In the end, A can be recovered by ﬁrst inverting the preconditioning matrix D as D 1 A and then re-scaling its largest singular value to 1.

2When A is orthonormal, this eigenvalue problem processes the worst landscape as there are inﬁnitely many solutions obtaining the same eigenvalue.

3 Theoretical Guarantees

We provide theoretical guarantees for our procedure (2.3) in Section 3.1 when A has orthonormal columns. The theoretical guarantees of (2.8) for recovering a general full column rank A are stated in Section 3.2.

3.1 Theoretical guarantees for semi-orthonormal A In this section, we provide guarantees for our procedure by characterizing the solution to (2.3) when A satisﬁes Assumption 2.3.

As the objective function F(q) in (2.3) concentrates around f(q) in (2.5), it is informative to ﬁrst analyze the solution to (2.5). Although (2.5) is a nonconvex problem and has multiple local solutions, Theorem 3.1 below guarantees that any strict local solution to (2.5) is globally optimal, in the sense that, it recovers one column of A, up to its sign. We introduce the null region R0 of our objective in (2.5), R0 = q Sp 1 : AT q = 0 . (3.1)

Theorem 3.1 (Population case) Under Assumption 2.3, assume θ 1/6. Any local solution q to (2.5), that is not in R0, satisﬁes a = AP e1 (3.2)

for some signed permutation matrix P Rr r.

The detailed proof of Theorem 3.1 is deferred to Appendix F.3. We only oﬀer an outline of our analysis below.

The proof of Theorem 3.1 relies on analysis of the optimization landscape of (2.5) on disjoint partitions of Sp 1 = {q Rp : q 2= 1}3, deﬁned as

R1 .= R1(C ) = q Sp 1 : AT q 2 C , (3.3)

R2 = Sp 1 \ p R0 R1q .

Here C is any ﬁxed constant between 0 and 1. The upper bound follows from the inequality that AT q = maxk|a T k q| ak 2 q 2= 1 for any q Sp 1. The region R0 can be easily avoided by choosing the initialization such that the objective function f(q) is not equal to zero. For R1 and R2, we are able to show the following results. Let Hess f(q) be the Riemannian Hessian matrix of (2.5) at any point q Sp 1.

(1) Optimization landscape for R1:

Lemma 3.2 Assume θ < 1. Any local solution q R1(C ) to (2.5) with C > 1

θ 1 θ recovers one column of A, that is, for some signed permutation matrix P

Lemma 3.2 shows that any critical point q R1 is either a strict saddle point that there exists a direction along which the Hessian is negative, or the desired local solution q that satisﬁes the second order optimality condition and is equal to one column of A, up to its sign.

(2) Optimization landscape for R2:

Lemma 3.3 Assume θ < 1/3. For any point q R2(C ) with C 1 3θ

2 , there exists v such that

v T Hess f pqq v < 0. (3.4)

Lemma 3.3 implies that any critical point in R2 is a saddle point that can be escaped by negative curvature. Hence there is no local solution to (2.5) in the region R2.

3Visualization of the partitions in S2 is available in section A.

Theorem 3.1 thus follows from Lemma 3.2 and Lemma 3.3, provided that c

θ 1 θ < 1 3θ. (3.5)

Condition (3.5) puts restrictions on the upper bound of θ. It is easy to see that (3.5) holds for any θ 1/6. As discussed in Remark 2.6, a smaller θ leads to a more benign optimization landscape. In light of Theorem 3.1, we now provide guarantees for the solution to the ﬁnite sample problem (2.3) in the following theorem. Deﬁne the sample analogue of the null region R0 in (3.1) as R 0(c ) .= q Sp 1 : AT q 2 c (3.6) for any given value c [0, 1). Theorem 3.4 (Finite sample case) Under Assumptions 2.1 and 2.3, assume θ (0, 1/9] and

c , log2 n r log n

for some suﬃciently large constant C > 0 and any c (0, 1/4]. Then with probability at least 1 cn c , any local solution q to (2.3) that is not in R 0(c ) satisﬁes

q AP 1 2 2 À

θn + ˆ θr2 + log2 n

for some signed permutation matrix P . Here we defer our discussion of technical details and full proof in section C. 3.2 Theoretical guarantees for general full column rank A In this section, we provide theoretical guarantees for our procedure of recovering a general full column rank matrix A under Assumption 2.2.

Recall from Section 2.3 that our approach ﬁrst preconditions Y by using D from (2.6). The following proposition provides guarantees for the preconditioned Y , denoted as Y = DY . The proof is deferred to Appendix F.5. Write the SVD of A = UADAV T A with UA Rp r and VA Rr r being, respectively, the left and right singular vectors. Proposition 3.5 Under Assumptions 2.1 and 2.2, assume n Cr/θ2 for some suﬃciently large constant C > 0. With probability greater than 1 2e c r, one has Y = A X + E (3.9)

where A = UAV T A , X = X/ ?

θnσ2 and E = A X with

Here c and c are positive constants. Proposition 3.5 implies that, when n Cr/θ2, the preconditioned Y satisﬁes Y = A(Ir + ) X A X (3.11)

with AT A = Ir. This naturally leads us to apply our procedure in Section 2.2 to recover columns of A via (2.8). We formally show in Theorem 3.6 below that any local solution to (2.8) approximately recover one column of A up to a signed permutation matrix. Similar to (3.6), deﬁne R 0 (c ) .= n q Sp 1 : AT q 2 c o (3.12)

for some given value c [0, 1). Theorem 3.6 Under Assumption 2.1 and 2.2, assume θ p0, 1/9s and

c θ max log3 n, log n c θ ?

c θ , r c ?

θ , r2 log n

Then with probability at least 1 cn c 4e c r, any solution q to (2.8) that is not in Region R 0 pc q satisﬁes

q AP 1 2 2 À

θn + ˆ θr2 + log2 n

for some signed permutation matrix P .

The proof of Theorem 3.6 can be found in Appendix F.6. Due to the preconditioning step, the requirement of the sample size in (3.13) is slightly stronger than (3.7), whereas the estimation error of q only has an additional a

r log n/(θ2n) term comparing to (3.8).

Theorem 3.6 requires to avoid the null region R 0(c ) in (3.12). We provide a simple initialization in the next section that provably avoids R 0 . Furthermore, every iterate of any descent algorithm based on such initialization is provably not in R 0 either.

4 Complete Algorithm and Provable Recovery

In this section, we present a complete pipeline for recovering A from Y . So far we have established that every local solution to (2.8), that is not in R 0(c ), approximately recovers one column of A = UAV T A . To our end, we will discuss: (1) a data-driven initialization in Section 4.1 which, together with Theorem 3.6, provably recovers one column of A; (2) a deﬂation procedure [38, 39, 35] in Section D that sequentially recovers all remaining columns of A. Due to the limitation of space we defer our discussion of deﬂation procedure in appendix.

4.1 Initialization Our goal is to provide a simple initialization such that solving (2.8) via any second order descent algorithm provably recovers one column of A. According to Theorem 3.6, such an initialization needs to guarantee the following conditions. Condition I: The initial point q(0) does not fall into region R 0 (c ) for some c satisfying (3.13) in Theorem 3.6.

Condition II: The updated iterates q(k), for all k 1, stay away from R 0(c ) as well. We propose the following initialization q(0) = Y 1n Y 1n 2 Sp 1. (4.1)

The following two lemmas guarantee that both Condition I and Condition II are met for this choice. Their proofs can be found in Appendices F.7 and F.8.

Lemma 4.1 Under Assumption 2.1 and 2.2, assume θ p0, 1/9s and

θ max log3 n, r log n

θ , r log2 n

θ , r3 log n . (4.2)

holds, then, with probability at least 1 2e cr, the initialization q(0) in (4.1) is not in region R 0 (c ) with c = 1/(2r).

Lemma 4.2 Let q(k), for k 1, be any updated iterate from solving (2.3) by using any monotonic decreasing algorithm with the initial point q(0) chosen as (4.1). If

θ max log3 n, r log n

θ , θ2r2 log n (4.3)

holds, then, with probability at least 1 cn c 2e c r, one has

q(k) / R 0(c ), for all k 1, with c = 1/(2r).

Combining Lemmas 4.1 and 4.2 together with Theorem 3.6 readily yields the following theorem. Theorem 4.3 Under Assumptions 2.1 and 2.2, assume θ (0, 1/9] and (4.2) holds. Let q be any local solution to (2.8) from any monotonic decreasing second order algorithms with the initial point chosen as (4.1). With probability at least 1 cn c 4e c r, one has

q AP 1 2 2 À

θn + r log3 n

θn for some signed permutation matrix P .

Theorem 4.3 provides the guarantees for using any monotonic decreasing second order algorithms [33, 5] to solve (2.8) with the initialization chosen in (4.1).

5 Experiments

In this section we verify the empirical performance of our proposed procedure for recovering A under model (1.1) in diﬀerent scenarios. Due to the space limit, we defer more experiments to the Appendix of this paper.

Experiment setup To generate the data Y = AX, we generate the columns of A by using the normalized

left singular vectors of R Rp r where Rij i.i.d. N(0, 1). The sparse coeﬃcient matrix

X Rr n are generated as Xij i.i.d. BG(θ). To evaluate the success of recovering one column vector of A for any estimate q Sp 1, we use the following criterion, Err pqq = min 1 i r p1 | q, ai |q (5.1)

If Err pqq ρe, we say the vector q recovers the ground-truth column vector of A. We choose ρe = 1 10 2 in our simulation settings. To evaluate the recovery of the whole matrix A, we use the following normalized Frobenius norm between any estimate Aest and the true A:

min P 1 ?r Aest AP F s.t. P is a signed permutation matrix. (5.2)

We ﬁrst evaluate the probability of successfully recovering one column of A in two scenarios. In the ﬁrst case, we vary simultaneously θ and r while in the second case we change n and r. We then evaluate the performance of our procedure, Algorithm 1 in Section D, for recovering the full matrix A.

Recovery probability with varying θ and r

We ﬁx p = 100 and n = 5 103 while vary θ {0.01, 0.04, . . . , 0.58} and r {10, 30, . . . , 70}. For each pair of pθ, rq, we repeatedly generate 200 data sets and apply our procedure in (2.8). The averaged recovery probability of our procedure over the 200 replicates is shown in Figure 1a. The recovery probability gets larger as r decreases, in line with Theorem 3.6. We also note that the recovery increases for smaller θ. This is because smaller θ renders a nicer geometric landscape of the proposed non-convex problem, as detailed in Remark 2.6. On the other hand, the recovery probability decreases when θ is approaching to 0. As suggested by Theorem 3.4, the statistical error of estimating A gets inﬂated as θ gets too small.

Recovery probability with varying n and r

Here we ﬁx p = 100 and the sparsity parameter θ = 0.1. We vary r {10, 30, . . . , 70} and n {2000, 3000, . . . , 12000}. Figure 1b shows the averaged recovery probability of our procedure over 200 replicates in each setting. Our procedure performs increasingly better as n increases, as expected from Theorem 3.4.

6 Conclusion and Future Work

In this paper, we have studied the unique decomposition of a low rank matrix Y that admits a sparse low-dimensional representation. Under model Y = AX where X has i.i.d. Bernoulli-Gaussian entries and A has full column rank, we propose a nonconvex procedure that provably recovers A, a quantity that can be further used to recover X. We provide a complete analysis for recovering one column of A, up to the sign, by showing that any second order descent algorithm provably attains the global solution with a simple and data-driven initialization, despite the nonconvex nature of the proposed procedure.

There are several directions that are certainly worth further pursuing. For instance, a complete analysis of the deﬂation procedure for recovering the full matrix A is certainly of great interest. It is also worth studying this decomposition problem in presence of some additive errors, that is, Y = AX +E. Our current procedure only tolerates E that has small entries. How to modify our procedure to accommodate a moderate / large E is an interesting and challenging problem that we leave to future research.

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0

Recovery probability

r=10 r=30 r=50 r=70

(a) Recovery probability versus θ: the averaged probability of successful recovery for different θ and r with p = 100 and n = 1.2 104.

2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 n

Recovery probability

r=10 r=30 r=50 r=70

(b) Recovery probability versus n: the averaged probability of successful recovery for different n and r with p = 100 and θ = 0.1.

[1] Alekh Agarwal, Animashree Anandkumar, Prateek Jain, and Praneeth Netrapalli. Learning sparsely used overcomplete dictionaries via alternating minimization. SIAM Journal on Optimization, 26(4):2775 2799, 2016.

[2] Alekh Agarwal, Animashree Anandkumar, and Praneeth Netrapalli. Exact recovery of sparsely used overcomplete dictionaries. stat, 1050:8 39, 2013.

[3] Sanjeev Arora, Rong Ge, Yonatan Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, and Michael Zhu. A practical algorithm for topic modeling with provable guarantees. In Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 280 288. PMLR, 17 19 Jun 2013.

[4] Tom Baden, Philipp Berens, Katrin Franke, Miroslav Román Rosón, Matthias Bethge, and Thomas Euler. The functional diversity of retinal ganglion cells in the mouse. Nature, 529(7586):345 350, 2016.

[5] Roberto Battiti. First-and second-order methods for learning: between steepest descent and newton s method. Neural computation, 4(2):141 166, 1992.

[6] Dimitri P Bertsekas and John N Tsitsiklis. Parallel and distributed computation: numerical methods. 2003.

[7] Xin Bing, Florentina Bunea, Yang Ning, and Marten Wegkamp. Adaptive estimation in structured factor models with applications to overlapping clustering. Ann. Statist., 48(4):2055 2081, 08 2020.

[8] Xin Bing, Florentina Bunea, and Marten Wegkamp. A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics. Bernoulli, 26(3):1765 1796, 08 2020.

[9] Xin Bing, Florentina Bunea, and Marten Wegkamp. Optimal estimation of sparse topic models. Journal of Machine Learning Research, 21(177):1 45, 2020.

[10] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993 1022, 2003.

[11] Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):1 37, 2011.

[12] Guanhua Chen, Patrick F. Sullivan, and Michael R. Kosorok. Biclustering with heterogeneous variance. Proceedings of the National Academy of Sciences, 110(30):12253 12258, 2013.

[13] Sky C Cheung, John Y Shin, Yenson Lau, Zhengyu Chen, Ju Sun, Yuqian Zhang, Marvin A Müller, Ilya M Eremin, John N Wright, and Abhay N Pasupathy. Dictionary learning in fouriertransform scanning tunneling spectroscopy. Nature communications, 11(1):1 11, 2020.

[14] David Donoho and Victoria Stodden. When does non-negative matrix factorization give a correct decomposition into parts? Advances in Neural Information Processing Systems, January 2004.

[15] Jicong Fan, Yuqian Zhang, and Madeleine Udell. Polynomial matrix completion for missing data imputation and transductive learning. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, pages 3842 3849, 2020.

[16] Simon Foucart and Holger Rauhut. A Mathematical Introduction to Compressive Sensing. Birkhäuser Basel, 2013.

[17] Karl Pearson F.R.S. Liii. on lines and planes of closest ﬁt to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559 572, 1901.

[18] Rong Ge and Tengyu Ma. On the optimization landscape of tensor decompositions. Mathematical Programming, pages 1 47, 2020.

[19] Quan Geng and John Wright. On the local correctness of 1-minimization for dictionary learning. In 2014 IEEE International Symposium on Information Theory, pages 3180 3184. IEEE, 2014.

[20] Donald Goldfarb. Curvilinear path steplength algorithms for minimization which use directions of negative curvature. Mathematical programming, 18(1):31 40, 1980.

[21] Kelly Gravuer, Jon J. Sullivan, Peter A. Williams, and Richard P. Duncan. Strong human association with plant invasion success for trifolium introductions to new zealand. Proceedings of the National Academy of Sciences, 105(17):6344 6349, 2008.

[22] Benjamin Haeﬀele, Eric Young, and Rene Vidal. Structured low-rank matrix factorization: Optimality, algorithm, and applications to image processing. In International conference on machine learning, pages 2007 2015. PMLR, 2014.

[23] Harold Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321 377, 1936.

[24] Aapo Hyvarinen. A family of ﬁxed-point algorithms for independent component analysis. In 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 5, pages 3917 3920. IEEE, 1997.

[25] Aapo Hyvärinen and Erkki Oja. A fast ﬁxed-point algorithm for independent component analysis. Neural computation, 9(7):1483 1492, 1997.

[26] Aapo Hyvärinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural Networks, 13(4-5):411 430, 2000.

[27] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points eﬃciently. In International Conference on Machine Learning, pages 1724 1732. PMLR, 2017.

[28] Ian T. Jolliﬀe. Rotation of principal components: Some comments. Journal of Climatology, 7(5):507 510, 1987.

[29] Han-Wen Kuo, Yenson Lau, Yuqian Zhang, and John Wright. Geometry and symmetry in shortand-sparse deconvolution. In International Conference on Machine Learning, pages 3570 3580. PMLR, 2019.

[30] Yanjun Li and Yoram Bresler. Global geometry of multichannel sparse blind deconvolution on the sphere. Co RR, abs/1805.10437, 2018.

[31] Zhouchen Lin. A review on low-rank models in data analysis. Big Data & Information Analytics, 1(2&3):139, 2016.

[32] Cun Mu, Yuqian Zhang, John Wright, and Donald Goldfarb. Scalable robust matrix recovery: Frank wolfe meets proximal methods. SIAM Journal on Scientiﬁc Computing, 38(5):A3291 A3317, 2016.

[33] Yurii Nesterov and Boris T. Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177 205, 2006.

[34] Qing Qu, Ju Sun, and John Wright. Finding a sparse vector in a subspace: Linear sparsity using alternating directions. In Advances in Neural Information Processing Systems, pages 3401 3409, 2014.

[35] Qing Qu, Yuexiang Zhai, Xiao Li, Yuqian Zhang, and Zhihui Zhu. Analysis of the optimization landscapes for overcomplete representation learning. ar Xiv preprint ar Xiv:1912.02427, 2019.

[36] Haipeng Shen and Jianhua Z. Huang. Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis, 99(6):1015 1034, 2008.

[37] Laixi Shi and Yuejie Chi. Manifold gradient descent solves multi-channel sparse blind deconvolution provably and eﬃciently. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5730 5734. IEEE, 2020.

[38] Daniel A. Spielman, Huan Wang, and John Wright. Exact recovery of sparsely-used dictionaries. In Conference on Learning Theory, 2012.

[39] Ju Sun, Qing Qu, and John Wright. Complete dictionary recovery over the sphere i: Overview and the geometric picture. IEEE Transactions on Information Theory, 63(2):853 884, 2016.

[40] Paul Tseng. Convergence of a block coordinate descent method for nondiﬀerentiable minimization. Journal of optimization theory and applications, 109(3):475 494, 2001.

[41] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices, page 210 268. Cambridge University Press, 2012.

[42] Yuexiang Zhai, Hermish Mehta, Zhengyuan Zhou, and Yi Ma. Understanding l4-based dictionary learning: Interpretation, stability, and robustness. In International Conference on Learning Representations, 2020.

[43] Yuexiang Zhai, Zitong Yang, Zhenyu Liao, John Wright, and Yi Ma. Complete dictionary learning via l4-norm maximization over the orthogonal group. J. Mach. Learn. Res., 21:165:1 165:68, 2020.

[44] Yuqian Zhang, Han-Wen Kuo, and John Wright. Structured local optima in sparse blind deconvolution. IEEE Transactions on Information Theory, 66(1):419 452, 2020.

[45] Yuqian Zhang, Yenson Lau, Han-Wen Kuo, Sky Cheung, Abhay Pasupathy, and John Wright. On the global geometry of sphere-constrained sparse blind deconvolution. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.

[46] Yuqian Zhang, Cun Mu, Han-Wen Kuo, and John Wright. Toward guaranteed illumination models for non-convex objects. In Proceedings of the IEEE International Conference on Computer Vision, pages 937 944, 2013.

[47] Yuqian Zhang, Qing Qu, and John Wright. From symmetry to geometry: Tractable nonconvex problems. ar Xiv preprint ar Xiv:2007.06753, 2020.

[48] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2):301 320, 2005.

[49] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2):265 286, 2006.

1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] Please see our abstract and introduction 1 (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [No] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] Please check section 2 and 3 (b) Did you include complete proofs of all theoretical results? [Yes] Please check our appendix 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] Please check section 5 and E (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] Please check section 5 and E (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [Yes] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or oﬀensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]