# complete_dictionary_recovery_using_nonconvex_optimization__bf0abfaf.pdf Complete Dictionary Recovery Using Nonconvex Optimization Ju Sun JS4038@COLUMBIA.EDU Qing Qu QQ2105@COLUMBIA.EDU John Wright JW2966@COLUMBIA.EDU Department of Electrical Engineering, Columbia University, New York, NY, USA We consider the problem of recovering a complete (i.e., square and invertible) dictionary A0, from Y = A0X0 with Y Rn p. This recovery setting is central to the theoretical understanding of dictionary learning. We give the first efficient algorithm that provably recovers A0 when X0 has O (n) nonzeros per column, under suitable probability model for X0. Prior results provide recovery guarantees when X0 has only O ( n) nonzeros per column. Our algorithm is based on nonconvex optimization with a spherical constraint, and hence is naturally phrased in the language of manifold optimization. Our proofs give a geometric characterization of the high-dimensional objective landscape, which shows that with high probability there are no spurious local minima. Experiments with synthetic data corroborate our theory. Full version of this paper is available online: http://arxiv.org/abs/1504.06785. 1. Introduction Dictionary learning (DL) is the problem of finding a sparse representation for a collection of input signals. Its applications span classical image processing, visual recognition, compressive signal acquisition, and also recent deep architectures for signal classification. Recent surveys on applications and algorithms of DL include Elad (2010) and Mairal et al (2014). Formally, given a data matrix Y Rn p, DL seeks an approximation Y AX, where A lies in a certain admissible set A, and X is as sparse as possible. Typical formulations for DL are nonconvex: the admissible set A is typically nonconvex, and the observation map (A, X) 7 AX is bilinear. There is also an intrinsic symmetry in the problem Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s). due to sign-permutation ambiguity, which seems to preclude convexification (Gribonval & Schnass (2010), Geng & Wright (2011)). Thus, despite many empirical successes, relatively little is known about the theoretical properties of DL algorithms. Towards theoretical understanding, it is natural to start with the dictionary recovery (DR) problem: suppose that the data matrix Y is generated as Y = A0X0, where A0 Rn m and X0 Rm p, and try to recover A0 and X0. One might imagine putting favorable structural assumptions on A0 and X0 to make DR well-posed and amenable to efficient algorithms. However, under natural assumptions for A0 and X0, even proving that the target solution is a local minimum of certain popular practical DL formulations requires nontrivial analysis (Gribonval & Schnass, 2010; Geng & Wright, 2011; Schnass, 2014a;b; 2015). Obtaining global solutions with efficient algorithms is a standing challenge. Existing results on global dictionary recovery pertain only to highly sparse X0. For example, Spielman et al (2012) showed that solving a sequence of certain linear programs can recover a complete dictionary A0 from Y , when X0 is a sparse random matrix with O( n) nonzeros per column. Agarwal et al (2013a; 2013b) and Arora et al (2013; 2015) have subsequently given efficient algorithms for the overcomplete setting (m n), based on a combination of {clustering or spectral initialization} and local refinement. These algorithms again succeed when X0 has e O( n) nonzeros per column (The e O suppresses some logarithm factors). Barak et al (2014) provides efficient algorithms based on sum-of-square hierarchy that guarantees recovery of complete dictionaries when X0 has O(nc) nonzeros per column for any c < 1. Giving efficient algorithms which provably succeeds in linear sparsity regime (i.e., O(n) nonzeros per column) is an open problem. 1 1Recent works, including Arora et al (2014) and Barak et al (2014), contain guarantees for recovery with linear sparsity, but run in super-polynomial (quasipolynomial) time. Aside from efficient recovery, other theoretical work on DL includes results on identifiability (Hillar & Sommer, 2011), generalization bounds (Vainsencher et al., 2011; Mehta & Gray, 2013), and noise Complete Dictionary Recovery Using Nonconvex Optimization Strongly%convex Nonzero%gradient Negative%curvature Figure 1. Why is dictionary learning over Sn 1 tractable? Assume the target dictionary A0 is orthogonal. Left: Large sample objective function EX0 [f (q)]. The only local minima are the columns of A0 and their negatives. Center: the same function, visualized as a height above the plane a 1 (a1 is the first column of A0). Right: Around the optimum, the function exhibits a small region of positive curvature, a region of large gradient, and finally a region in which the direction away from a1 is a direction of negative curvature. In this work, we focus on recovering a complete (i.e., invertible) dictionary A0 from Y = A0X0. We give the first polynomial-time algorithm that provably recovers A0 when X0 has O (n) nonzeros per column, if X0 contains i.i.d. Bernoulli-Gaussian entries. We achieve this by formulating complete DR as a nonconvex program with spherical constraint. Under the probability model for X0, we give a geometric characterization of the high-dimensional objective landscape over the sphere, which shows that with high probability (w.h.p.) there are no spurious local minima. In particular, the geometric structure allows us to design a Riemannian trust region algorithm over the sphere that provably converges to one local minimum with an arbitrary initialization, despite the presence of saddle points. This paper is organized as follows. Sec. 2 will motivate our nonconvex formulation of DR and formalize the setting. Sec. 3 and 4 will introduce two integral parts of our algorithmic framework: characterization of the high-dimensional function landscape and a Riemannian trust-region algorithm over the sphere. Main theorems are included in Sec. 5, followed by some numerical verification presented in Sec. 6. Sec. 7 will close this paper by discussing implications of this work. Due to space constraint, we only sketch highlevel ideas of the proof in this paper. Detailed proofs can be found in the full version (Sun et al., 2015). 2. A Nonconvex Formulation for DR We assume Y = A0X0, where A0 Rn n is a complete matrix and X0 follows the Bernoulli-Gaussian (BG) model with rate θ: [X0]ij = Ωij Gij, with Ωij Ber (θ) and Vij N (0, 1). We write compactly X0 i.i.d. BG (θ). Since Y = A0X0 and A0 is complete, row (Y ) = row (X0) 2 and rows of X0 are sparse vectors in the known subspace row (Y ). Following Spielman et al (2012), we use stability (Gribonval et al., 2014). 2row ( ) denotes the row space. this fact to first recover the rows of X0, and subsequently recover A0 by solving a system of linear equations. In fact, for X0 i.i.d. BG (θ), rows of X0 are the n sparsest vectors (directions) in row (Y ) w.h.p. (Spielman et al., 2012). Thus one might try to recover them by solving3 min q Y 0 s.t. q = 0. (2.1) The objective is discontinuous, and the domain is an open set. Known convex relaxations (Spielman et al., 2012; Demanet & Hand, 2014) provably break down beyond the aforementioned O( n) sparsity level. Instead, we work with a nonconvex alternative: 4 min f(q; b Y ) .= 1 k=1 hµ (q byk) , s.t. q 2 = 1, (2.2) where b Y Rn p is a proxy of Y and k indexes columns of b Y . Here hµ ( ) is chosen to be a convex smooth approximation to the | | function, namely, hµ (z) = µ log cosh z which is infinitely differentiable and µ controls the smoothing level. The spherical constraint is nonconvex. Hence, a-priori, it is unclear whether (2.2) admits efficient algorithms that attain one local optimizer (Murty & Kabadi, 1987). Surprisingly, simple descent algorithms for (2.2) exhibit very striking behavior: on many practical numerical examples5, they appear to produce global solutions. Our next section will uncover interesting geometrical structures underlying the phenomenon. 3The notation denotes matrix transposition. 4Similar formulation has been proposed in (Zibulevsky & Pearlmutter, 2001) in the context of blind source separation, see also (Qu et al., 2014). 5... not restricted to the model we assume here for A0 and X0. Complete Dictionary Recovery Using Nonconvex Optimization 3. High-dimensional Geometry For the moment, suppose A0 is orthogonal, and take b Y = Y = A0X0 in (2.2). Figure. 1 (left) plots EX0 [f (q; Y )] over q S2 (n = 3). Remarkably, EX0 [f (q; Y )] has no spurious local minima. In fact, every local minimum bq produces a row of X0: bq Y = αe i X0 for some α = 0. To better illustrate the point, we take the particular case A0 = I and project the upper hemisphere above the equatorial plane e 3 onto e 3 . The projection is bijective and we equivalently define a reparameterization g : e 3 7 R of f. Figure 1 (center) plots the graph of g. Obviously the only local minimizers are 0, e1, e2, and they are also global minimizers. Moreover, the apparent nonconvex landscape has interesting structures around 0: when moving away from 0, one sees successively a strongly convex region, a nonzero gradient region, and a region where at each point one can always find a direction of negative curvature, as shown schematically in Figure. 1 (right). This geometry implies that at any nonoptimal point, there is always at least one direction of descent. Thus, any algorithm that can take advantage of the descent directions will likely converge to one global minimizer, irrespective of initialization. Two challenges stand out when implementing this idea. For geometry, one has to show similar structure exists for general complete A0, in high dimensions (n 3), when the number of observations p is finite (vs. the expectation in the experiment). For algorithms, we need to be able to take advantage of this structure without knowing A0 ahead of time. In Sec. 4, we describe a Riemannian trust region method which addresses the later challenge. Now we focus on the first one. 3.1. Geometry for orthogonal A0 In this case, we take b Y = Y = A0X0. Since f (q; A0X0) = f (A 0q; X0), the landscape of f (q; A0X0) is simply a rotated version of that of f (q; X0), i.e., when A0 = I. Hence we will focus on the case when A0 = I. Among the 2n symmetric sections of Sn 1 centered around the signed basis vectors e1, . . . , en, we work with the section around en as an example. The result will carry over to all sections with the same argument. We again invoke the projection trick described above, this time onto the equatorial plane e n . This can be formally captured by the reparameterization mapping: q (w) = w, q 1 w 2 , w Rn 1, (3.1) where w is the new variable in e n . We study the composition g (w; Y ) .= f (q (w) ; Y ) over the set Γ .= w : w < q Our next theorem characterizes the properties of g (w). In particular, it shows the favorable structure we observed for n = 3 persists in high dimensions, w.h.p.6, even when p is large yet finite, for the case A0 is orthogonal. Theorem 3.1. Suppose A0 = I and hence Y = A0X0 = X0. There exist positive constants c and C, such that for any θ (0, 1/2) and µ < min caθn 1, cbn 5/4 , whenever p C µ2θ2 n3 log n µθ, the following hold simultaneously w.h.p.: 2g(w; X0) c θ µ I w w µ 4 w 2g(w; X0)w w 2 c θ w 1 20 and the function g(w; X0) has exactly one local minimizer w over the open set Γ .= n w : w < q 4n o , which satisfies In particular, with this choice of p, the probability the claim fails to hold is at most 4np 10+θ(np) 7+exp ( 0.3θnp)+ cd exp cepµ2θ2/n2 . Here ca to ce are all positive numerical constants. In words, when the samples are numerous enough, one sees the strongly convex, nonzero gradient, and negative curvature regions successively when moving away from target solution 0, and the local (also global) minimizer of g (w; Y ) is next to 0, within a distance of O (µ). Note that q(Γ) contains all points q Sn 1 such that q en = maxi |q ei|. We can characterize the graph of the function f(q; X0) in the vicinity of some other signed basis vector ei simply by changing the plane e n to e i . Doing this 2n times (and multiplying the failure probability in Theorem 3.1 by 2n), we obtain a characterization of f(q) over the entirety of Sn 1. Corollary 3.2. Suppose A0 = I and hence Y = A0X0 = X0. There exist positive constant C, such that for any θ (0, 1/2) and µ < min caθn 1, cbn 5/4 , whenever p C µ2θ2 n3 log n µθ, with probability at least 1 8n2p 10 θ(np) 7 exp ( 0.3θnp) cc exp cdpµ2θ2/n2 , the function f (q; X0) has exactly 2n local minimizers over the sphere Sn 1. In particular, there is a bijective map between these minimizers and signed basis vectors { ei}i, such 6In this work, we say some event occurs with high probability when the failure probability is bounded by an inverse polynomial of n and p. Complete Dictionary Recovery Using Nonconvex Optimization that the corresponding local minimizer q and b { ei}i satisfy Here ca to cd are numerical constants (possibly different from that in the above theorem). The proof of Theorem 3.1 is conceptually straightforward: one shows that E[f(q; X0)] has the claimed properties, and then proves that each of the quantities of interest concentrates uniformly about its expectation. The detailed calculations are nontrivial. 3.2. Geometry for complete A0 For general complete dictionaries A0, we hope that the function f retains the nice geometric structure discussed above. We can ensure this by preconditioning Y such that the output looks as if being generated from a certain orthogonal matrix, possibly plus a small perturbation. We can then argue that the perturbation does not significantly affect the properties of the graph of the objective function. Write Y = 1 pθY Y 1/2 Y . (3.5) Note that for X0 i.i.d. BG (θ), E [X0X 0] / (pθ) = I. Thus, one expects 1 pθY Y = 1 pθA0X0X 0A 0 to behave roughly like A0A 0 and hence Y to behave like (A0A 0) 1/2 A0X0 = UV X0 (3.6) where we write the SVD of A0 as A0 = UΣV . It is easy to see UV is an orthogonal matrix. Hence the preconditioning scheme we have introduced is technically sound. Our analysis shows that Y can be written as Y = UV X0 + ΞX0, (3.7) where Ξ is a matrix with small magnitude. Perturbation analysis combines with the the above results for the orthogonal case yields: Theorem 3.3. Suppose A0 is complete with its condition number κ (A0). There exist positive constants c and C, such that for any θ (0, 1/2) and µ < min caθn 1, cbn 5/4 , when p C c2 θ max n n4 µ4 , n5 µ2 o κ8 (A0) log4 κ(A0)n µθ and Y .= pθ (Y Y ) 1/2 Y , UΣV = SVD (A0), and let e Y = V U Y , then the following hold simultaneously w.h.p.: 2g(w; e Y ) c θ 2µ I w w µ 4 w g(w; e Y ) w 2g(w; e Y )w 2c θ w 1 20 and the function g(w; e Y ) has exactly one local minimizer w over the open set Γ .= n w : w < q 4n o , which satisfies w 0 µ In particular, with this choice of p, the probability the claim fails to hold is at most 4np 10+θ(np) 7+exp ( 0.3θnp)+ p 8+cd exp cepµ2θ2/n2 . Here ca to ce are all positive numerical constants. Corollary 3.4. Suppose A0 is complete with its condition number κ (A0). There exist positive constants c and C, such that for any θ (0, 1/2) and µ < min caθn 1, cbn 5/4 , when p C c2 θ max n n4 µ4 , n5 µ2 o κ8 (A0) log4 κ(A0)n µθ and Y .= pθ (Y Y ) 1/2 Y , UΣV = SVD (A0), with probability at least 1 8n2p 10 θ(np) 7 exp ( 0.3θnp) p 8 cd exp cepµ2θ2/n2 , the function f q; V U Y has exactly 2n local minimizers over the sphere Sn 1. In particular, there is a bijective map between these minimizers and signed basis vectors { ei}i, such that the corresponding local minimizer q and b { ei}i satisfy 2µ 7 . (3.9) Here ca to cd are numerical constants (possibly different from that in the above theorem). 4. Riemannian Trust Region Algorithm We do not know A0 ahead of time, so our algorithm needs to take advantage of the structure described above without knowledge of A0. Intuitively, this seems possible as the descent direction in the w space appears to also be a local descent direction for f over the sphere. Another issue is that although the optimization problem has no spurious local minima, it does have many saddle points (Figure. 1). Therefore, certain form of second-order information is needed to help escape the saddle points. Based on these considerations, we describe a Riemannian trust region method (TRM) (Absil et al., 2007; 2009) over the sphere for this purpose. 4.1. Trust region method for Euclidean spaces For a function f : Rn R and an unconstrained optimization problem min x Rn f (x) , (4.1) Complete Dictionary Recovery Using Nonconvex Optimization typical (second-order) TRM proceeds by successively forming second-order approximations to f at the current iterate, bf (δ; xk 1) .= f (xk 1) + f (xk 1) δ 2δ Q (xk 1) δ, (4.2) where Q (xk 1) is a proxy for the Hessian matrix 2f (xk 1), which encodes the second-order geometry. The next iterate is determined by seeking a minimum of bf (δ; xk 1) over a small region, normally an ℓ2 ball, commonly known as the trust region. Thus, the well known trust region subproblem takes the form δk .= arg min δ Rn, x bf (δ; xk 1) , (4.3) where is called the trust-region radius that controls how far the movement can be made. A ratio ρk .= f (xk 1) f (xk 1 + δk) bf (0) bf (δk 1) (4.4) is defined to measure the progress and typically the radius is updated dynamically according to ρk to adapt to the local function behavior. If the progress is satisfactory, the next iterate is (perhaps plus some line search improvement) xk = xk 1 + δk. (4.5) Detailed introduction to the classical TRM can be found in the texts (Conn et al., 2000; Nocedal & Wright, 2006). 4.2. Trust region method over the sphere To generalize the idea to smooth manifolds, one natural choice is to form the approximation over the tangent spaces (Absil et al., 2007; 2009). Specific to our spherical manifold, for which the tangent space at an iterate qk Sn 1 is Tqk Sn 1 .= {v : v qk = 0}, we work with the quadratic approximation bf : Tqk Sn 1 7 R defined as bf(qk, δ) .= f(qk) + f(qk), δ 2δ 2f(qk) f(qk), qk I δ. (4.6) To interpret this approximation, let PTqk Sn 1 .= (I qkq k) be the orthoprojector onto Tqk Sn 1 and write (4.6) into an equivalent form: bf(qk, δ) .= f(qk) + D PTqk Sn 1 f(qk), δ E 2δ PTqk Sn 1 2f(qk) f(qk), qk I PTqk Sn 1δ. The two terms gradf (qk) .= PTqk Sn 1 f(qk), Hessf (qk) .= PTqk Sn 1 2f(qk) f(qk), qk I PTqk Sn 1 are the Riemannian gradient and Riemannian Hessian of f w.r.t. Sn 1, respectively (Absil et al., 2007; 2009), turning (4.6) into the form of familiar quadratic approximation, as described in (4.2). Then the Riemannian trust-region subproblem is min δ Tqk Sn 1, δ bf (qk, δ) , (4.7) where > 0 is the familiar trust-region parameter. Taking any orthonormal basis Uqk for Tqk Sn 1, we can transform (4.7) into a classical trust-region subproblem: min ξ bf (qk, Uqkξ) , (4.8) for which very efficient numerical algorithms exist (Mor e & Sorensen, 1983; Hazan & Koren, 2014). Once we obtain the minimizer ξ , we set δ = Uξ , which solves (4.7). One additional issue as compared to the Euclidean setting is that now δ is one vector in the tangent space and additive update leads to a point outside the manifold. To resolve this, we resort to the natural exponential map: qk+1 .= expqk δ = qk cos δ + δ δ sin δ , (4.9) which move the sequence to the next iterate along the direction 7 of δ while staying over the sphere. There are many variants of (Riemannian) TRM that allow one to solve the subproblem 4.8 only approximately while still guarantee convergence. For simplicity, we avoid the extra burden caused thereof for analysis by solving the subproblem exactly via SDP relaxation: introduce ξ = [ξ , 1] , Ξ = ξ ξ , M = A b b 0 where A = U 2f(q) f(q), q I U and b = U f(q) for any orthobasis U of Tqk 1Sn 1. The subproblem is known to be equivalent to the SDP problem (Fortin & Wolkowicz, 2004): min Ξ M, Ξ , s.t.tr(Ξ) 2 + 1, En+1, Ξ = 1, Ξ 0, (4.11) where En+1 = en+1e n+1. The detailed trust region algorithm is presented in Algorithm 1. 4.3. Algorithmic results Using the geometric characterization in Theorem 3.1 and Theorem 3.3, we prove that when the parameter is suf- 7Technically, moving along a curve on the manifold of which δ is the initial tangent vector in certain canonical way. Complete Dictionary Recovery Using Nonconvex Optimization Algorithm 1 Trust Region Method for Finding a Single Sparse Vector Input: data matrix Y Rn p, smoothing parameter µ and parameters ηvs = 0.9, ηs = 0.1, γi = 2, γd = 1/2, max = 1, and min = 10 16. Output: bq Sn 1 1: Initialize q(0) Sn 1, (0) and k = 1, 2: while not converged do 3: Set U Rn (n 1) to be an orthobasis for q(k 1) 4: Solve the trust region subproblem bξ = arg min ξ (k 1) bf q(k 1), Uξ (4.12) bq q(k 1) cos bδ + bδ bδ sin bδ . ρk f(q(k 1)) f(bq) f(q(k 1)) bf(q(k 1), bδ) (4.13) 7: if ρk ηvs then 8: Set q(k) bq, (k) min γi (k 1), max . 9: else if ρk ηs then 10: Set q(k) bq, (k) (k 1). 11: else 12: Set q(k) q(k 1), (k) max(γd (k 1), min). 13: end if 14: Set k = k + 1. 15: end while ficiently small8, (1) the trust region step induces at least a fixed amount of decrease to the objective value in the negative curvature and nonzero gradient region; (2) the trust region iterate sequence will eventually move to and stay in the strongly convex region, and converge to the global minima with an asymptotic quadratic rate. In particular, the geometry implies that from any initialization, the iterate sequence converges to a close approximation to one local minimizer in a polynomial number of steps. The following two theorems collect these results, for orthogonal and general complete A0, respectively. Theorem 4.1 (Orthogonal dictionary). Suppose the dictionary A0 is orthogonal. Then there exists a posi- 8For simplicity of analysis, we have assumed is fixed throughout the analysis. In practice, dynamic updates to tends to lead to faster convergence. tive constant C, such that for all θ (0, 1/2), and µ < min caθn 1, cbn 5/4 , whenever exp(n) p Cn3 log n µθ/(µ2θ2), with probability at least 1 8n2p 10 θ(np) 7 exp ( 0.3θnp) p 10 cc exp cdpµ2θ2/n2 , the Riemannian trust-region algorithm with input data matrix b Y = Y , any initialization q(0) on the sphere, and a step size satisfying n5/2 log3/2 (np) , cfc3 θ3µ n7/2 log7/2 (np) returns a solution bq Sn 1 which is ε near to one of the local minimizers q (i.e., bq q ε) in max cgn6 log3 (np) c3 θ3µ4 , chn c2 θ2 2 f(q(0)) f(q ) + log log cic θµ εn3/2 log3/2 (np) iterations. Here c is as defined in Theorem 3.1, and ca, cb are the same numerical constants as defined in Theorem 3.1, cc to ci are other positive numerical constants. Proofs for the complete case basically follows from that the slight perturbation of structure parameters as summarized in Theorem 3.3 (vs. Theorem 3.1) change all algorithm parameter by at most small multiplicative constants. Theorem 4.2 (Complete dictionary). Suppose the dictionary A0 is complete with condition number κ (A0). There exists a positive constant C, such that for all θ (0, 1/2), and µ < min caθn 1, cbn 5/4 , whenever exp(n) p C c2 θ max n n4 µ4 , n5 µ2 o κ8 (A0) log4 κ(A0)n with probability at least 1 8n2p 10 θ(np) 7 exp ( 0.3θnp) 2p 8 cc exp cdpµ2θ2/n2 , the Riemannian trust-region algorithm with input data matrix Y .= pθ (Y Y ) 1/2 Y where UΣV = SVD (A0), any initialization q(0) on the sphere and a step size satisfying n5/2 log3/2 (np) , cfc3 θ3µ n7/2 log7/2 (np) returns a solution bq Sn 1 which is ε near to one of the local minimizers q (i.e., bq q ε) in max cgn6 log3 (np) c3 θ3µ4 , chn c2 θ2 2 f(q(0)) f(q ) + log log cic θµ εn3/2 log3/2 (np) iterations. Here c is as defined in Theorem 3.1, and ca, cb are the same numerical constants as defined in Theorem 3.1, cc to ci are other positive numerical constants. Complete Dictionary Recovery Using Nonconvex Optimization 5. Main Results For orthogonal dictionaries, from Theorem 3.1 and its corollary, we know that all the minimizers bq are O(µ) away from their respective nearest target q , with q b Y = αe i X0 for certain α = 0 and i [n]; in Theorem ??, we have shown that w.h.p. the Riemannian TRM algorithm produces a solution bq Sn 1 that is ε away to one of the minimizers, say bq . Thus, the bq returned by the TRM algorithm is O(ε + µ) away from q . For exact recovery, we use a simple linear programming rounding procedure, which guarantees to exactly produce the optimizer q . We then use deflation to sequentially recover other rows of X0. Overall, w.h.p. both the dictionary A0 and sparse coefficient X0 are exactly recovered up to sign permutation, when θ Ω(1), for orthogonal dictionaries. The same procedure can be used to recover complete dictionaries, though the analysis is slightly more complicated. Our overall algorithmic pipeline for recovering orthogonal dictionaries is sketched as follows. 1. Estimating one row of X0 by the Riemannian TRM algorithm. By Theorem 3.1 (resp. Theorem 3.3) and Theorem 4.1 (resp. Theorem 4.2), starting from any, when the relevant parameters are set appropriately (say as µ and ), w.h.p., our Riemannian TRM algorithm finds a local minimizer bq, with q the nearest target that exactly recovers one row of X0 and bq q O(µ) (by setting the target accuracy of the TRM as, say, ε = µ). 2. Recovering one row of X0 by rounding. To obtain the target solution q and hence recover (up to scale) one row of X0, we solve the following linear program: min q q b Y 1, s.t. r, q = 1, (5.1) with r = bq. We show that when bq, q is sufficiently large, implied by µ being sufficiently small, w.h.p. the minimizer of (5.1) is exactly q , and hence one row of X0 is recovered by q b Y . 3. Recovering all rows of X0 by deflation. Once ℓrows of X0 (1 ℓ n 2) have been recovered, say, by unit vectors q1 , . . . , qℓ , one takes an orthonormal basis U for [span q1 , . . . , qℓ ] , and minimizes the new function h(z) .= f(Uz; b Y ) on the sphere Sn ℓ 1 with the Riemannian TRM algorithm (though conservative, one can again set parameters as µ , , as in Step 1) to produce a bz. Another row of X0 is then recovered via the LP rounding (5.1) with input r = U bz (to produce qℓ+1 ). Finally, by repeating the procedure until depletion, one can recover all the rows of X0. 4. Reconstructing the dictionary A0. By solving the linear system Y = AX0, one can obtain the dictionary A0 = Y X 0 (X0X 0) 1. Formally, we have the following results: Theorem 5.1 (Orthogonal Dictionary). Assume the dictionary A0 is orthogonal and we take b Y = Y . Suppose θ (0, 1/3), µ < min caθn 1, cbn 5/4 , and p Cn3 log n µ θ/ µ2 θ2 . The above algorithmic pipeline with parameter setting ( ccc θµ2 n5/2 log5/2 (np) , cdc3 θ3µ n7/2 log7/2 (np) recovers the dictionary A0 and X0 in polynomial time, with failure probability bounded by cep 6. Here c is as defined in Theorem 3.1, and ca through ce, and C are all positive numerical constants. Theorem 5.2 (Complete Dictionary). Assume the dictionary A0 is complete with condition number κ (A0) and we take b Y = Y . Suppose θ (0, 1/3), µ < min caθn 1, cbn 5/4 , and p C c2 θ max n n4 µ4 , n5 µ2 o κ8 (A0) log4 κ(A0)n µθ . The algorithmic pipeline with parameter setting ( ccc θµ2 n5/2 log5/2 (np) , cdc3 θ3µ n7/2 log7/2 (np) recovers the dictionary A0 and X0 in polynomial time, with failure probability bounded by cep 6. Here c is as defined in Theorem 3.1, and ca through cf, and C are all positive numerical constants. 6. Numerical Results To corroborate our theory, we experiment with dictionary recovery on simulated data 9 . For simplicity, we focus on recovering orthogonal dictionaries and we declare success once a single row of the coefficient matrix is recovered. Since the problem is invariant to rotations, w.l.o.g. we set the dictionary as A0 = I Rn n. We fix p = 5n2 log(n), and each column of the coefficient matrix X0 Rn p has exactly k nonzero entries, chosen uniformly random from [n] k . These nonzero entries are i.i.d. standard normals. This is slightly different from the Bernoulli-Gaussian model we assumed for analysis. For n reasonably large, these two models produce similar behavior. For the sparsity surrogate defined in (2.3), we fix the parameter µ = 10 2. We implement Algorithm 1 with adaptive step size instead of the fixed step size in our analysis. To see how the allowable sparsity level varies with the dimension, which our theory primarily is about, we vary the 9The code is available online: https://github.com/ sunju/dl_focm Complete Dictionary Recovery Using Nonconvex Optimization Figure 2. Phase transition for recovering a single sparse vector under the dictionary learning model with p = 5n2 log n. dictionary dimension n and the sparsity k both between 1 and 150; for every pair of (k, n) we repeat the simulations independently for T = 5 times. Because the optimal solutions are signed coordinate vectors {ei}n i=1, for a solution bq returned by the TRM algorithm, we define the reconstruction error (RE) to be RE = min1 i n ( bq ei , bq + ei ) . The trial is determined to be a success once RE µ, with the idea that this indicates bq is already very near the target and the target can likely be recovered via the LP rounding we described (which we do not implement here). Figure 2 shows the phase transition in the (n, k) plane for the orthogonal case. It is obvious that our TRM algorithm can work well into the linear region whenever p O(n2 log n). Our analysis is tight up to logarithm factors, and also the polynomial dependency on 1/µ, which under the theory is polynomial in n. 7. Discussion For recovery of complete dictionaries, the LP program approach in (Spielman et al., 2012) that works with θ O(1/ n) only demands p Ω(n2 log n2), which is recently improved to p Ω(n log4 n) (Luh & Vu, 2015), almost matching the lower bound Ω(n log n) (i.e., when θ 1/n). The sample complexity stated in Theorem 5.2 is obviously much higher. It is interesting to see whether such growth in complexity is intrinsic to working in the linear regime. Though our experiments seemed to suggest the necessity of p O(n2 log n) even for the orthogonal case, there could be other efficient algorithms that demand much less. Tweaking these three points will likely improve the complexity: (1) The ℓ1 proxy. The derivative and Hessians of the log cosh function we adopted entail the tanh function, which is not amenable to effective approximation and affects the sample complexity; (2) Geometric characterization and algorithm analysis. It seems working directly on the sphere (i.e., in the q space) could simplify and possibly improve certain parts of the analysis; (3) treating the complete case directly, rather than using (pessimistic) bounds to treat it as a perturbation of the orthogonal case. Particularly, gen- eral linear transforms may change the space significantly, such that preconditioning and comparing to the orthogonal transforms may not be the most efficient way to proceed. It is possible to extend the current analysis to other dictionary settings. Our geometric structures and algorithms allow plug-and-play noise analysis. Nevertheless, we believe a more stable way of dealing with noise is to directly extract the whole dictionary, i.e., to consider geometry and optimization (and perturbation) over the orthogonal group. This will require additional nontrivial technical work, but likely feasible thanks to the relatively complete knowledge of the orthogonal group (Edelman et al., 1998; Absil et al., 2009). A substantial leap forward would be to extend the methodology to recovery of structured overcomplete dictionaries, such as tight frames. Though there is no natural elimination of one variable, one can consider the marginalization of the objective function wrt the coefficients and work with hidden functions. 10 Under the i.i.d. BG coefficient model, our recovery problem is also an instance of the ICA problem. It is interesting to ask what is vital in making the problem tractable: sparsity or independence. The full version (Sun et al., 2015) includes an experimental study in this direction, which underlines the importance of the sparsity prior. In fact, the preliminary experiments there suggest the independence assumption we made here likely can be removed without losing the favorable geometric structures. In addition, the connection to ICA also suggests the possibility of adapting our geometric characterization and algorithms to the ICA problem. This likely will provide new theoretical insights and computational schemes to ICA. In the surge of theoretical understanding of nonconvex heuristics (Keshavan et al., 2010; Jain et al., 2013; Hardt, 2014; Hardt & Wootters, 2014; Netrapalli et al., 2014; Jain & Netrapalli, 2014; Netrapalli et al., 2013; Candes et al., 2014; Jain & Oh, 2014; Anandkumar et al., 2014; Yi et al., 2013; Lee et al., 2013; Qu et al., 2014; Lee et al., 2013; Agarwal et al., 2013a;b; Arora et al., 2013; 2015; 2014), the initialization plus local refinement strategy mostly differs from practice, whereby random initializations seem to work well, and the analytic techniques developed are mostly fragmented and highly specialized. The analytic and algorithmic we developed here hold promise to provide a coherent account of these problems. It is interesting to see to what extent we can streamline and generalize the framework. 10This recent work (Arora et al., 2015) on overcomplete DR has used a similar idea. The marginalization taken there is near to the global optimum of one variable, where the function is wellbehaved. Studying the global properties of the marginalization may introduce additional challenges. Complete Dictionary Recovery Using Nonconvex Optimization Acknowledgments This work was partially supported by grants ONR N0001413-1-0492, NSF 1343282, and funding from the Moore and Sloan Foundations and the Wei Family Private Foundation. We thank the area chair and the anonymous reviewers for making painstaking effort to read our long proofs and providing insightful feedback. We also thank Cun Mu and Henry Kuo for discussions related to this project. Absil, P.-A., Baker, C. G., and Gallivan, K. A. Trust-region methods on riemannian manifolds. Foundations of Computational Mathematics, 7(3):303 330, 2007. Absil, Pierre-Antoine, Mahoney, Robert, and Sepulchre, Rodolphe. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2009. Agarwal, Alekh, Anandkumar, Animashree, Jain, Prateek, Netrapalli, Praneeth, and Tandon, Rashish. Learning sparsely used overcomplete dictionaries via alternating minimization. ar Xiv preprint ar Xiv:1310.7991, 2013a. Agarwal, Alekh, Anandkumar, Animashree, and Netrapalli, Praneeth. Exact recovery of sparsely used overcomplete dictionaries. ar Xiv preprint ar Xiv:1309.1952, 2013b. Anandkumar, Animashree, Ge, Rong, and Janzamin, Majid. Guaranteed non-orthogonal tensor decomposition via alternating rank-1 updates. ar Xiv preprint ar Xiv:1402.5180, 2014. Arora, Sanjeev, Ge, Rong, and Moitra, Ankur. New algorithms for learning incoherent and overcomplete dictionaries. ar Xiv preprint ar Xiv:1308.6273, 2013. Arora, Sanjeev, Bhaskara, Aditya, Ge, Rong, and Ma, Tengyu. More algorithms for provable dictionary learning. ar Xiv preprint ar Xiv:1401.0579, 2014. Arora, Sanjeev, Ge, Rong, Ma, Tengyu, and Moitra, Ankur. Simple, efficient, and neural algorithms for sparse coding. ar Xiv preprint ar Xiv:1503.00778, 2015. Barak, Boaz, Kelner, Jonathan A, and Steurer, David. Dictionary learning and tensor decomposition via the sum-ofsquares method. ar Xiv preprint ar Xiv:1407.1543, 2014. Candes, Emmanuel, Li, Xiaodong, and Soltanolkotabi, Mahdi. Phase retrieval via wirtinger flow: Theory and algorithms. ar Xiv preprint ar Xiv:1407.1065, 2014. Conn, Andrew R., Gould, Nicholas I. M., and Toint, Philippe L. Trust-region Methods. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2000. ISBN 0-89871-460-5. Demanet, Laurent and Hand, Paul. Scaling law for recovering the sparsest element in a subspace. Information and Inference, 3(4):295 309, 2014. Edelman, Alan, Arias, Tom as A, and Smith, Steven T. The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications, 20 (2):303 353, 1998. Elad, Michael. Sparse and redundant representations: from theory to applications in signal and image processing. Springer, 2010. Fortin, Charles and Wolkowicz, Henry. The trust region subproblem and semidefinite programming*. Optimization methods and software, 19(1):41 67, 2004. Geng, Quan and Wright, John. On the local correctness of ℓ1-minimization for dictionary learning. Submitted to IEEE Transactions on Information Theory, 2011. Preprint: http://www.columbia.edu/ jw2966. Gribonval, R emi and Schnass, Karin. Dictionary identification - sparse matrix-factorization via ℓ1-minimization. IEEE Transactions on Information Theory, 56(7):3523 3539, 2010. Gribonval, R emi, Jenatton, Rodolphe, and Bach, Francis. Sparse and spurious: dictionary learning with noise and outliers. ar Xiv preprint ar Xiv:1407.5155, 2014. Hardt, Moritz. Understanding alternating minimization for matrix completion. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pp. 651 660. IEEE, 2014. Hardt, Moritz and Wootters, Mary. Fast matrix completion without the condition number. In Proceedings of The 27th Conference on Learning Theory, pp. 638 678, 2014. Hazan, Elad and Koren, Tomer. A linear-time algorithm for trust region problems. ar Xiv preprint ar Xiv:1401.6757, 2014. Hillar, Christopher and Sommer, Friedrich T. When can dictionary learning uniquely recover sparse data from subsamples? ar Xiv preprint ar Xiv:1106.3616, 2011. Jain, Prateek and Netrapalli, Praneeth. Fast exact matrix completion with finite samples. ar Xiv preprint ar Xiv:1411.1087, 2014. Jain, Prateek and Oh, Sewoong. Provable tensor factorization with missing data. In Advances in Neural Information Processing Systems, pp. 1431 1439, 2014. Complete Dictionary Recovery Using Nonconvex Optimization Jain, Prateek, Netrapalli, Praneeth, and Sanghavi, Sujay. Low-rank matrix completion using alternating minimization. In Proceedings of the forty-fifth annual ACM symposium on Theory of Computing, pp. 665 674. ACM, 2013. Keshavan, Raghunandan H, Montanari, Andrea, and Oh, Sewoong. Matrix completion from a few entries. Information Theory, IEEE Transactions on, 56(6):2980 2998, 2010. Lee, Kiryung, Wu, Yihong, and Bresler, Yoram. Near optimal compressed sensing of sparse rank-one matrices via sparse power factorization. ar Xiv preprint ar Xiv:1312.0525, 2013. Luh, Kyle and Vu, Van. Dictionary learning with few samples and matrix concentration. ar Xiv preprint ar Xiv:1503.08854, 2015. Mairal, Julien, Bach, Francis, and Ponce, Jean. Sparse modeling for image and vision processing. Foundations and Trends in Computer Graphics and Vision, 8(2-3), 2014. Mehta, Nishant and Gray, Alexander G. Sparsity-based generalization bounds for predictive sparse coding. Proceedings of the 30th International Conference on Machine Learning (ICML-13), 28(1):36 44, 2013. Mor e, J. J. and Sorensen, D. C. Computing a trust region step. SIAM J. Scientific and Statistical Computing, 4: 553 572, 1983. Murty, Katta G and Kabadi, Santosh N. Some np-complete problems in quadratic and nonlinear programming. Mathematical programming, 39(2):117 129, 1987. Netrapalli, Praneeth, Jain, Prateek, and Sanghavi, Sujay. Phase retrieval using alternating minimization. In Advances in Neural Information Processing Systems, pp. 2796 2804, 2013. Netrapalli, Praneeth, Niranjan, UN, Sanghavi, Sujay, Anandkumar, Animashree, and Jain, Prateek. Non-convex robust pca. In Advances in Neural Information Processing Systems, pp. 1107 1115, 2014. Nocedal, Jorge and Wright, Stephen. Numerical Optimization. Springer, 2006. Qu, Qing, Sun, Ju, and Wright, John. Finding a sparse vector in a subspace: Linear sparsity using alternating directions. In Advances in Neural Information Processing Systems, pp. 3401 3409, 2014. Schnass, Karin. On the identifiability of overcomplete dictionaries via the minimisation principle underlying k-svd. Applied and Computational Harmonic Analysis, 37(3): 464 491, 2014a. Schnass, Karin. Local identification of overcomplete dictionaries. ar Xiv preprint ar Xiv:1401.6354, 2014b. Schnass, Karin. Convergence radius and sample complexity of itkm algorithms for dictionary learning. ar Xiv preprint ar Xiv:1503.07027, 2015. Spielman, Daniel A, Wang, Huan, and Wright, John. Exact recovery of sparsely-used dictionaries. In Proceedings of the 25th Annual Conference on Learning Theory, 2012. Sun, Ju, Qu, Qing, and Wright, John. Complete dictionary recovery over the sphere. ar Xiv preprint ar Xiv:1504.06785, 2015. Vainsencher, Daniel, Mannor, Shie, and Bruckstein, Alfred M. The sample complexity of dictionary learning. J. Mach. Learn. Res., 12:3259 3281, November 2011. ISSN 1532-4435. Yi, Xinyang, Caramanis, Constantine, and Sanghavi, Sujay. Alternating minimization for mixed linear regression. ar Xiv preprint ar Xiv:1310.3745, 2013. Zibulevsky, Michael and Pearlmutter, Barak. Blind source separation by sparse decomposition in a signal dictionary. Neural computation, 13(4):863 882, 2001.