# structured_local_minima_in_sparse_blind_deconvolution__1b3e6844.pdf

Structured Local Minima in Sparse Blind Deconvolution

Yuqian Zhang, Han-Wen Kuo, John Wright Department of Electrical Engineer and Data Science Institute Columbia University, New York, NY 10027 {yz2409, hk2673, jw2966}@columbia.edu

Blind deconvolution is a ubiquitous problem of recovering two unknown signals from their convolution. Unfortunately, this is an ill-posed problem in general. This paper focuses on the short and sparse blind deconvolution problem, where the one unknown signal is short and the other one is sparsely and randomly supported. This variant captures the structure of the unknown signals in several important applications. We assume the short signal to have unit ℓ2 norm and cast the blind deconvolution problem as a nonconvex optimization problem over the sphere. We demonstrate that (i) in a certain region of the sphere, every local optimum is close to some shift truncation of the ground truth, and (ii) for a generic short signal of length k, when the sparsity of activation signal θ k 2/3 and number of measurements m poly (k), a simple initialization method together with a descent algorithm which escapes strict saddle points recovers a near shift truncation of the ground truth kernel.

1 Introduction

Blind deconvolution is the problem of recovering two unknown signals a0 and x0 from their convolution y = a0 x0. This fundamental problem recurs across several ﬁelds, including astronomy, microscopy data processing [1], neural spike sorting [2], computer vision [3], etc. However, this problem is ill-posed without further priors on the unknown signals, as there are inﬁnitely many pairs of signals (a, x) whose convolution equals a given observation y. Fortunately, in practice, the target signals (a, x) are often structured. In particular, a number of practical applications exhibit a common short-and-sparse structure:

In Neural spike sorting: Neurons in the brain ﬁre brief voltage spikes when stimulated. The signatures of the spikes encode critical features of the neuron and the occurrence of such spikes are usually sparse and random in time [2, 4].

In Microscopy data analysis: Some nanoscale materials are contaminated by randomly and sparsely distributed defects , which change the electronic structure of the material [1].

In Image deblurring: Blurred images due to camera shake can be modeled as a convolution of the latent sharp image and a kernel capturing the motion of the camera. Although natural images are not sparse, they typically have (approximately) sparse gradients [5, 6].

In the above applications, the observation signal y Rm is generated via the convolution of a short kernel a0 Rk(k m) and a sparse activation coeﬃcient x0 Rm ( x0 0 m). Without loss of generality, we let y denote the circular convolution of a0 and x0

y = a0 x0 = f a0 x0, (1)

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

Convolution Y

Activation X

Figure 1: Local Minimum. Top: observation y = a0 x0, and ground truth a0, and x0; Bottom: recovered a x, a, and x at one local minimum of a natural formulation in [16].

with f a0 Rm denoting the zero padded m-length version of a0, which can be expressed as f a0 = ιka0. Here, ιk : Rk Rm is a zero padding operator. Its adjoint ι k : Rm Rk acts as a projection onto the lower dimensional space by keeping the ﬁrst k components.

The short-and-sparse blind deconvolution problem exhibits a scaled-shift ambiguity, which derives from the basic properties of a convolution operator. Namely, for any observation signal y, and any nonzero scalar α and integer shift τ, the following equality always holds

y = ( αsτ[f a0]) α 1s τ[x0] . (2)

Here, s τ[v] denotes the cyclic shift of the vector v by τ entries:

sτ[v](i) = v ([i τ 1]m + 1) , i {1, , m} . (3)

Clearly, both scaling and cyclic shifts preserve the short-and-sparse structure of (a0, x0). This scaled-shift symmetry raises nontrivial challenges for computation, making straightforward convexiﬁcation approaches ineﬀective.1

Nonconvex algorithms for sparse blind deconvolution have been well developed and practiced, especially in computer vision [12, 13, 14, 15]. Despite its empirical success, little was known about its working mechanism. Recently, [16] studies the optimization landscape of the natural nonconvex formulation for sparse blind deconvolution, assuming the kernel a Rk to have unit Frobenius norm (denote as a Sk 1). [16] argues that under conditions, this problem has well-structured local optima, in the sense that every local optimum is close to some shift truncation of the ground truth (Figure 1).

The presence of these local optima can be viewed as a result of the shift symmetry associated to the convolution operator: the shifted and truncated kernel ι ksτ[f a0] can be convolved with the sparse signal s τ[x0] (shifted in the other direction) to produce a near approximation to the observation ι ksτ[f a0] s τ[x0] y.

In [16], this geometric insight about local optima is corroborated with a lot of experiments, but rigorous proof is only available in the dilute limit in which the sparse coeﬃcient signal x0 is a single spike. In this paper, we adopt the unit Frobenius norm constraint for the short convolution kernel a as in [16], but consider a diﬀerent objective function. We formulate the sparse blind deconvolution problem as the following optimization problem over the sphere:

min ˇy ry (q) 4 4 s. t. q F = 1. (4)

Here, ˇy denotes the reversal of y2 and ry (q) is a preconditioner which we will discuss in detail later. Convolution y ry (q) approximates the reversed underlying activation signal x0, and 4 4 serves as the sparsity penalty.

We demonstrate that even when x0 is relatively dense, any local minimum in certain region of the sphere is close to a shift truncation ι ksτ [f a0] of the ground truth. This benign region contains the sub-level set of small objective value. Algorithmically, if initialized at a point with small enough objective value, then a descent algorithm always decreases the objective value and hence stays in this region. Speciﬁcally, for a generic kernel3 a0 Sk 1, if the

1A number of works [7, 8, 9, 10, 11] have developed provable methods for blind deconvolution under the assumption that a0 and x0 belong to random subspaces, or are sparse in random dictionaries. These random models exhibit simpler geometry than the short-and-sparse model. Because our target signal is sparse in the standard basis, the aforementioned results are not applicable in our setting. 2Denote y = [y1, y2, , ym 1, ym]T , then its reversal ˇy = [y1, ym, ym 1, , y2]T with y1 not moved. 3In this paper, we refer a kernel sampled following a uniform distribution over the sphere as a generic kernel on the sphere.

sparsity rate4 θ k 2/3 and the number of measurement m poly(k), initializing at some preconditioned k consecutive entries of y, and applying any descent method that converges to a local minimizer under a strict saddle hypothesis [17, 18], produces a near shift-truncation of the ground truth.5

Assumptions and Notations We assume that x0 Rm follows Bernoulli-Gaussian (BG) model with sparsity level θ: x0 (i) = ωigi with ωi Ber (θ) and gi N (0, 1), where all the diﬀerent random variables are jointly independent. For simplicity, we write x0 i.i.d. BG (θ). Throughout this paper, a vector v Rk is indexed as v = [v1, v2, , vk], and [ ]m denotes the modulo operator of m. We use op, F , and p to denote operator norm, Frobenius norm, and entry wise ℓp norm respectively. PS [ ] .= F denotes projection onto the Frobenius sphere. ( ) p is the entry wise p-th order exponent operator. We use C, c to denote positive constants, and their value change across the paper.

2 Problem Formulation

In the short-and-sparse blind deconvolution problem, any k consecutive entries in y only depend on 2k 1 consecutive entries in x0:

yi = yi, , y1+[i+k 1]m T =

τ= (k 1) x1+[i+τ 1]m ι ksτ[f a0] (5)

ak ak 1 a1 0 0 0 ak a2 0 0 ... ... ... ... ... ... ... 0 0 ak 1 a1 0 0 0 ak a2 a1

| {z } A0 Rk (2k 1)

x1+[i k]m ... xi ... x1+[i+k 2]m

| {z } xi R(2k 1) 1

Write Y = [y1, y2, . . . , ym] Rk m and X0 = [x1, . . . , xm] R2k 1 m. Using the above expression, we have that Y = A0X0. (7) Each column xi of X0 only contains some 2k 1 entries of x0. The rows of X0 are cyclic shifts of the reversal of x0: X0 =

" s0[ˇx0] ... s2k 2[ˇx0]

The shifts of ˇx0 are sparse vectors in the linear subspace row(X0). Note that if we could recover some shift sτ[x0], we could subsequently determine s τ[a0] by solving a linear system of equations, and hence solve the deconvolution problem, up to the shift ambiguity.6

2.1 Finding a Shifted Sparse Signal In light of the above observations, a natural computational approach to sparse blind deconvolution is to attempt to ﬁnd x0 by searching for a sparse vector in the linear subspace row(X0), e.g., by solving an optimization problem

min v s. t. v row (X0) , v 2 = 1, (9)

4This equivalently says there could be as many as O(k1/3) shifts of the kernel in a k-length window of the observation. 5[16] proposes to solve the short-and-sparse blind deconvolution problem with a two phase algorithm which ﬁrst recovers a shift truncation, and then recovers the ground truth kernel with an annealing algorithm. We present additional experimental results on the recovery of the ground truth in the supplementary material. 6[19] considers the multi-channel blind deconvolution problem, where many independent observations yp = a0 xp are available. [19] shows how to formulate this problem as searching for a sparse vector in a linear subspace. Our approach is also inspired by the idea of looking for a sparse/spiky vector in a subspace. However, it pertains to a diﬀerent problem, in which only a single observation is available. The short and sparse problem exhibits a more complicated optimization landscape, due to the signed shift ambiguity.

where is chosen to encourage sparsity of the target signal [20, 21, 22, 23].

In sparse blind deconvolution, we do not have access to the row space of X0. Instead, we only observe the subspace row(Y ) row(X0). The subspace row(Y ) does not necessarily contain the desired sparse vector e T i X0, but it does contain some approximately sparse vectors. In particular, consider following vector in row(Y ), v = Y T a0 = ˇx0 sparse + X

i =0 a0, si[a0] si[ˇx0]

| {z } noise z

The vector v is a superposition of a sparse signal ˇx0 and its scaled shifts a0, si[a0] si[ˇx0]. If the shift-coherence | a0, sτ[a0] | is small7 and x0 is sparse enough, z can be viewed as small noise.8 The vector v is not sparse, but it is spiky: a few of its entries are much larger than the rest. We deploy a milder sparsity penalty 4 4 to recover such a spiky vector, as 4 4 is very ﬂat around 0 and insensitive to small noise in the signal.9 This gives min 1

4 v 4 4 s. t. v row (Y ) , v 2 = 1. (11)

We can express a generic unit vector v row(Y ) as v = Y T Y Y T 1/2 q, with v 2 = q 2. This leads to the following equivalent optimization problem over the sphere

min ψ (q) .= 1

Y T Y Y T 1/2 q 4

4 s. t. q 2 = 1. (12)

Interpretation: preconditioned shifts. This objective ψ (q) can be rewritten as

ˇy Y Y T 1/2q 4

ˇx0 AT 0 Y Y T 1/2q 4

4 ˇx0 ζ 4 4 , (13)

where ζ = AT 0 (A0AT 0 ) 1/2q. This approximation becomes accurate as m grows.10 This objective encourages the convolution of ˇx0 and ζ to be as spiky as possible. Reasoning analogous to (10) suggests that ˇx0 ζ will be spiky if

ζ = AT 0 A0AT 0 1/2 q el, l {1, , 2k 1} . (14) For simplicity, we deﬁne the preconditioned convolution matrix

A .= A0AT 0 1/2 A0 = [a1 a2 a2k 1] , (15)

with column coherence (preconditioned shift coherence) µ .= maxi =j | ai, aj |. Then ζ can also be interpreted as measuring the inner products of q with columns of A. Making this intuition rigorous, we will show that minimizing this objective over a certain region of the sphere yields a preconditioned shift truncate al, from which we can recover a shift truncate of the original signal a0.

2.2 Structured Local Minima

Figure 2: Saddles points are approximately superpositions of local minima.

We will show that in a certain region RC Sk 1, the preconditioned shift truncations al are the only local minimizers. Moreover, the other critical points in RC can be interpreted as resulting from competition between several of these local minima (Figure 2). At any saddle point, there exists strict negative curvature in the direction of a nearby local minimizer which breaks the balance in favor of some particular al. The region RC is deﬁned as follows: Deﬁnition 2.1. For ﬁxed C > 0, letting κ denote the condition number of A0, and µ .= maxi =j | ai, aj | the column coherence of A, we deﬁne two regions RC , ˆRC Sk 1, as

RC .= n q Sk 1 | AT q 6 4 C µκ2 AT q 3 3

ˆRC .= n q Sk 1 | AT q 6 4 C µκ2o RC . (17)

7For a generic kernel a0, the shift-coherence is bounded by | a0, sτ[a0] | 1/

k for any shift τ. 8In particular, under a Bernoulli-Gaussian model, for each j, E[z2 j ] = θ P

i =0 a0, si[a0] 2. 9In comparison, the classical choice = 1 is a strict sparsity penalty that essentially encourages all small entries to be 0. 10As Ex0 i.i.d.BG(θ)[Y Y T ] = Ex0 i.i.d.BG(θ)[A0X0XT 0 AT 0 ] = θm A0AT 0 .

A simpler and smaller region ˆRC is also introduced in Deﬁnition (2.1). This region ˆRC can be viewed as a sub-level set for AT q 4 4, which is proportional to the objective value ψ (q) assuming m is suﬃciently large11. Therefore, once initialized within ˆRC , the iterates produced by a descent algorithm will stay in ˆRC .

In particular, at any stationary point q R10, the local optimization landscape can be characterized in terms of the number of spikes (entries with nontrivial magnitude12) in ζ. If there is only one spike in ζ, then such stationary point q is a local minimum that is close to one local minimizer; if there are more than two spikes in ζ, then such stationary point q is saddle point. Based on the above characterizations of stationary points in RC with C 10, we can deduce that any local minimum is close to al for some integer l, a preconditioned shift truncation of the ground truth a0. Theorem 2.2 (Main Result). Assuming observation y Rm is the circulant convolution of a0 Rk and x0 i.i.d. BG (θ) Rm, where the convolutional matrix A0 has minimum singular value σmin > 0 and condition number κ 1, and A has column incoherence 0 µ < 1. There exists a positive constant C such that whenever the number of measurements

m C min µ 4/3, κ2k2

(1 θ)2 σ2 min κ8k4 log3 κk (1 θ) σmin

and θ log k/k, then with high probability, any local optima q ˆR2C satisﬁes

| q, PS [al] | 1 c κ 2 (19)

for some integer 1 l 2k 1. Here, C 10 and c = 1/C .

This theorem says that any local minimum in ˆR2C is close to some normalized column of A given polynomially many observation. The parameters σmin, κ and µ eﬀectively measure the spectrum ﬂatness of the ground truth kernel a0 and characterize how broad the results hold. A random like kernel usually has big σmin, small κ and µ, which equivalently implies the result holds in a large sub-level set ˆR2C even with fewer observations.

Hence, once assuring the algorithm ﬁnds a local minimum in ˆR2C , then some shifted truncation of the ground truth kernel a0 can be recovered. In other words, if we can ﬁnd an initialization point with small objective value then a descent algorithm minimizing the objective function guarantees that q always stays in ˆR2C in proceeding iterations. Therefore, any descent algorithm that escapes a strict saddle point can be applied to ﬁnd some al, or some shift truncation of a0.

2.3 Initialization with a Random Sample Recall that yi = A0xi, which is a sparse superposition of about 2θk columns of A0. Intuitively speaking, such qinit already encodes certain preferences towards a few preconditioned shift truncations of the ground truth. Therefore, we randomly choose an index i and set the initialization point as qinit = PS h Y Y T 1/2 yi i , ζinit = AT qinit PS AT Axi .13 (20)

For a generic kernel a0 Sk 1, AT A is close to a diagonal matrix, as the magnitudes of oﬀ-diagonal entries are bounded by column incoherence µ. Hence, the sparse property of xi can be approximately preserved, that PS AT Axi is spiky vector with small 4 4. By leveraging the sparsity level θ, one can make sure such initialization point qinit falls in ˆR2C . Therefore, we propose Algorithm 1 for solving sparse blind deconvolution with its working conditions stated in Corollary 2.3. For the choice of descent algorithms which escape strict saddle points, there are several such algorithms specially tailored for sphere constrained optimization problems [24, 25].

11Please refer to Section 3 for more arguments. 12We call any ζl with magnitude no smaller than 2µ ζ 3 3 / ζ 4 4 to be nontrivial and defer technical reasonings to later sections. 13As Ex0 i.i.d.BG(θ)[Y Y T ] = θm A0AT 0 .

Algorithm 1 Short and Sparse Blind Deconvolution

Input: Observations y Rm and kernel size k. Output: Recovered Kernel a.

1: Generate random index i [1, m] and set qinit = PS h Y Y T 1/2 yi i .

2: Solve following nonconvex optimization problem with a descent algorithm that escapes saddle point and ﬁnd a local minimizer q = arg minq Sk 1 ϕ (q)

3: Set a = PS h Y Y T 1/2 q i .

Corollary 2.3. Suppose the ground truth a0 kernel has preconditioned shift coherence 0 µ 1 8 48 log 3/2 (k) and sparse coeﬃcient x0 i.i.d. BG (θ) Rm. There exist positive constants C 25604 and C such that whenever the sparsity level

64k 1 log k θ min n 1 482 µ 2k 1 log 2 k, 1

4 640 C1/4 3C µκ2 2/3 k 1 1 + 36µ2k log k 2o

and signal length

σ2 min k3 1 + 36µ2k log k 4 log κk

, min µ 4/3, κ2k2

(1 θ)2 σ2 min κ8k4 log3 κk (1 θ) σmin

then with high probability, Algorithm 1 recovers a such that

a PS [ιksτ[f a0]] 2 2

for some integer shift (k 1) τ k 1.

For a generic a0 Sk 1, plugging in the numerical estimation of the parameters σmin, κ and µ (Figure 3), accurate recovery can be obtained with m θ2k6 poly log (k) measurements and sparsity level θ k 2/3 poly log (k). For bandpass kernels a0, σmin is smaller and κ, µ are larger, and so our results require x0 to be longer and sparser.

3 Optimization Function Landscape

We next brieﬂy present the key elements in deriving the main results of this paper. We ﬁrst investigate the stationary points of the population objective Ex0[ψ(q)]. We demonstrate that any local minimizer in RC is close to a signed column of A, a preconditioned shift truncation of a0. We then demonstrate that when m is suﬃciently large, the ﬁnite sample objective ψ(q) has similar properties.

Using E[Y Y T ] = θm A0AT 0 again, the expectation of the objective function ψ (q) can be approximated as follows:

E [ψ(q)] E 1

Y T θm A0AT 0 1/2 q 4

θm2 AT q 4 4 3

This approximation can be made rigorous (see Lemma 2.1 of the supplementary material), allowing us to study the critical points of E[ψ] by studying the simpler problem

min q Rk 1 ϕ (q) .= 1

AT q 4 4 = 1

4 ζ 4 4 . (23)

The Euclidean gradient and Riemannian gradient [26] of ϕ are

ϕ(q) = Aζ 3, grad [ϕ] (q) = Aζ 3 + q ζ 4 4 . (24)

3.1 Critical Points of the Population Objective We wish to argue that every local minimizer of ϕ is close to a preconditioned shift-truncation ai. We do this by showing that at any other critical point, there is a direction of strict negative curvature. We will show that at any critical point q R4, the correlation ζ exhibits a very special structure:

(P) The entries ζi = ai, q are either close to zero, or have magnitude |ζi| close to ζ 4 4 / ai 2.

We can demonstrate this property directly from the stationarity condition grad [ϕ] (q) = 0.

Aζ 3 q ζ 4 4 = 0 AT Aζ 3 AT q ζ 4 4 = 0. (25)

The i-th entry ζi of the correlation ζ therefore satisﬁes the following cubic equation

ai 2 2 ζ3 i + X

j =i ai, aj ζ3 j ζi ζ 4 4 = 0 ζ3 i ζi ζ 4 4 ai 2 2 | {z } αi

P j =i ai, aj ζ3 j ai 2 2 | {z } βi

If αi βi, the roots of (26) are either very close to 0, or very close to αi. The condition αi βi obtains whenever AT q 6 4 4µ AT q 3 3, and hence on R4, every critical point satisﬁes property (P).

3.2 Asymptotic Function Landscape on RC The local optimization landscape around any stationary point q is characterized by the Riemannian Hessian. In particular, at a stationary point q, if Hess [ϕ] (q) is positive semidefinite, then the function is convex and q is a local minimum; if Hess [ϕ] (q) has a negative eigenvalue, then there exists a direction along which the objective value decreases and q is a saddle point. Technically, on RC with C 10, the minimum eigenvalue of the Riemannian Hessian can be controlled based on the spikiness of ζ.

First, we demonstrate that once constrained in RC with C 10, then any stationary point must have cross correlation ζ with entries of nontrivial magnitude, or entries of ζ cannot be simultaneously close to 0. Geometrically, this implies that any stationary point q RC should be "close" to certain preconditioned shift truncations.

Lemma 3.1. For any stationary point q RC with C 10, magnitude of vector ζ = AT q cannot be uniformly bounded by 2µ ζ 3 3 / ζ 4 4.

Local Minima If q is a stationary point in RC with C 10, and ζ only has one single entry ζl with magnitude larger than 2µ ζ 3 3 / ζ 4 4, then the Riemannian Hessian Hess[ϕ] (q) is always positive deﬁnite, and the function is locally convex. In addition, | q, al | > 1 2C 1 κ 2 al 2, hence such q is one local minimum near al.

Lemma 3.2. Suppose q is a stationary point in RC with C 10, and ζ = AT q has only one entry ζl of magnitude no smaller than 2µ ζ 3 3 / ζ 4 4, then q is a local minimum near al such that | q, PS [al] | > 1 2c κ 2 with c = 1/C .

Saddle Points If q is a stationary point in RC with C 10, and ζ has more than one nontrivial entry, then the Riemannian Hessian Hess ϕ (q) has negative eigenvalue(s) and hence q is a saddle point. Especially, denoting any two nontrivial entries of ζ with ζl and ζl , then there exists a negative curvature in the span of al and al .

Lemma 3.3. Suppose q is a stationary point in RC with C 10, and ζ = AT q has at least two entries ζl and ζl with magnitude larger than 2µ ζ 3 3 / ζ 4 4, then the Riemannian Hessian at q has negative eigenvalue(s) and q is a saddle point.

3.3 Finite Sample Concentration We argue that the critical points of the ﬁnite sample objective function ψ(q) are similar to those of the asymptotic objective function ϕ(q):

Critical points are close. The Riemannian gradient concentrates, such that there is a bijection between critical points qpop of ϕ and critical points qfs of ψ, with qpop qfs 2 small.

Curvature is preserved. The Riemannian Hessian concentrates, such that Hess[ψ](qfs) has a negative eigenvalue if and only if Hess[ϕ](qpop) has a negative eigenvalue, and Hess[ψ](qfs) is positive deﬁnite if and only if Hess[ϕ](qpop) is positive deﬁnite.

This implies that every local minimizer of the ﬁnite sample objective function is close to a preconditioned shift-truncation. While conceptually straightforward, the proofs of these

properties are somewhat involved, due to the presence of the preconditioner (Y Y T ) 1/2. We give rigorous versions of all of the above statements, and a complete proof, in the supplementary appendix.

4 Experiments

Properties of a Random Kernel. In our main result, the sparsity rate θ depends on the condition number κ and induced column coherence µ. Figure 3 plots the average values (over 100 independent simulations) of κ and µ for generic unit kernels of varying dimension k = 10, 20, , 1000.

Figure 3: Coherence of random kernels. Average of σmin (left), κ (middle), and µ (right) over 100 independent trials, for varying kernel length k.

These simulations suggest the following estimates:

σmin log 1 (k) , κ log4/3 k, µ p

log (k) /k. (27)

Hence, reliable recovery of the shift truncation of a generic kernel can be guaranteed even when the sparse signal is relatively dense (θ k 2/3). On the other hand, if the convolution kernel a0 is lowpass, then σmin decreases, and κ, µ increase, then more observations m and smaller sparsity level θ is required for the proposed algorithm to perform as desired.

Recovery Error of the Proposed Algorithm We present the performance of Algorithm 1 under varying settings. We deﬁne the recover error as err = 1 maxτ | a, PS [ι ksτ[f a0]] |, and calculate the average error from 50 independent experiments. The left ﬁgure plots the average error when we ﬁx the kernel size k = 50, and vary the dimension m and the sparsity θ of x0.14 The right ﬁgure plots the average error when we vary the dimensions k, m of both convolution signals, and set the sparsity as θ = k 2/3.

Average error, k = 50

Overlapping ratio k"3 2.66 4.13 5.60 7.07

Signal length m

Average error, sparsity = k-2/3

Kernel size k 10 20 30 40 50 60 70 80

Signal length m

Figure 4: Recovery Error of the Shift Truncated Kernel by Algorithm 1.

Acknowledgement The authors gratefully acknowledge support from NSF 1343282, NSF CCF 1527809, NSF CCF 1740833, and NSF IIS 1546411.

14Note that the x-axis is indexed with overlapping ratio k θ, which indicates how many copies of a0 present in a k-length window of y on average.

[1] Sky Cheung, Yenson Lau, Zhengyu Chen, Ju Sun, Yuqian Zhang, John Wright, and Abhay Pasupathy. Beyond the fourier transform: A nonconvex optimization approach to microscopy analysis. Submitted, 2017.

[2] M. S. Lewicki. A review of methods for spike sorting: the detection and classiﬁcation of neural action potentials. Network: Computation in Neural Systems, 9(4):53 78, 1998.

[3] D. Kundur and D. Hatzinakos. Blind image deconvolution. Signal Processing Magazine, IEEE, 13(3):43 64, May 1996.

[4] Chaitanya Ekanadham, Daniel Tranchina, and Eero P. Simoncelli. A blind sparse deconvolution method for neural spike identiﬁcation. In Advances in Neural Information Processing Systems 24, pages 1440 1448. 2011.

[5] T. Chan and C. Wong. Total variation blind deconvolution. IEEE Transactions on Image Processing, 7(3):370 375, Mar 1998.

[6] A. Levin, Y. Weiss, F. Durand, and W. Freeman. Understanding blind deconvolution algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(12):2354 2367, Dec 2011.

[7] A. Ahmed, B. Recht, and J. Romberg. Blind deconvolution using convex programing. ar Xiv preprint:1211.5608, 2012.

[8] Xiaodong Li, Shuyang Ling, Thomas Strohmer, and Ke Wei. Rapid, robust, and reliable blind deconvolution via nonconvex optimization. preprint, 2016.

[9] Shuyang Ling and Thomas Strohmer. Self-calibration and biconvex compressive sensing. Inverse Problems, 31(11):115002, 2015.

[10] Shuyang Ling and Thomas Strohmer. Blind deconvolution meets blind demixing: Algorithms and performance bounds. IEEE Transactions on Information Theory, 63(7):4497 4520, 2017.

[11] Yuejie Chi. Guaranteed blind sparse spikes deconvolution via lifting and convex optimization. IEEE Journal of Selected Topics in Signal Processing, 10(4):782 794, June 2016.

[12] A. Benichoux, E. Vincent, and R. Gribonval. A fundamental pitfall in blind deconvolution with sparse and shift-invariant priors. 38th International Conference on Acoustics, Speech, and Signal Processing, May 2013.

[13] Daniele Perrone and Paolo Favaro. Total variation blind deconvolution: The devil is in the details. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.

[14] David Wipf and Haichao Zhang. Revisiting bayesian blind deconvolution. ar Xiv preprint:1305.2362, 2013.

[15] Haichao Zhang, David Wipf, and Yanning Zhang. Multi-image blind deblurring using a coupled adaptive sparse prior. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), January 2013.

[16] Yuqian Zhang, Yenson Lau, Han-wen Kuo, Sky Cheung, Abhay Pasupathy, and John Wright. On the global geometry of sphere-constrained sparse blind deconvolution. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.

[17] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points eﬃciently. ar Xiv preprint ar Xiv:1703.00887, 2017.

[18] Peng Xu, Farbod Roosta-Khorasani, and Michael W. Mahoney. Second-order optimization for non-convex machine learning: An empirical study. ar Xiv preprint ar Xiv:1708.07827, 2017.

[19] L. Wang and Y. Chi. Blind deconvolution from multiple sparse inputs. IEEE Signal Processing Letters, 23(10):1384 1388, Oct 2016.

[20] Daniel Spielman, Huan Wang, and John Wright. Exact recovery of sparsely-used dictionaries. preprint, 2012.

[21] Ju Sun, Qing Qu, and John Wright. Complete dictionary recovery over the sphere. preprint, 2015.

[22] Qing Qu, Ju Sun, and John Wright. Finding a sparse vector in a subspace: linear sparsity using alternating directions. IEEE Transactions on Information Theory, 2016.

[23] Samuel B. Hopkins, Tselil Schrammand, Jonathan Shi, and David Steurer. Fast spectral algorithms from sum-of-squares proofs: Tensor decomposition and planted sparse vectors. In Proceedings of the Forty-eighth Annual ACM Symposium on Theory of Computing, STOC 16, pages 178 191, 2016.

[24] P.-A. Absil, C.G. Baker, and K.A. Gallivan. Trust-region methods on riemannian manifolds. Foundations of Computational Mathematics, 7(3):303 330, Jul 2007.

[25] Donald Goldfarb, Zaiwen Wen, and Wotao Yin. A curvilinear search method for p-harmonic ﬂows on spheres. SIAM J. Imaging Sciences, 2(1):84 109, 2009.

[26] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton, NJ, USA, 2007.