# geometry_and_symmetry_in_shortandsparse_deconvolution__5a2740cf.pdf Geometry and Symmetry in Short-and-Sparse Deconvolution Han-Wen Kuo 1 2 Yuqian Zhang 3 Yenson Lau 1 2 John Wright 1 2 4 We study the Short-and-Sparse (Sa S) deconvolution problem of recovering a short signal a0 and a sparse signal x0 from their convolution. We propose a method based on nonconvex optimization, which under certain conditions recovers the target short and sparse signals, up to signed shift symmetry which is intrinsic to this model. This symmetry plays a central role in shaping the optimization landscape for deconvolution. We give a regional analysis, which characterizes this landscape geometrically, on a union of subspaces. Our geometric characterization holds when the length-p0 short signal a0 has shift coherence µ, and x0 follows a random sparsity model with rate θ h c1 p0 , c2 p0 µ+ p0 i 1 log2 p0 . Based on this geometry, we give a provable method that successfully solves Sa S deconvolution with high probability. 1. Introduction Datasets in a wide range of areas, including neuroscience (Lewicki, 1998), microscopy (Cheung et al., 2017) and astronomy (Saha, 2007), can be modeled as superpositions of translations of a basic motif. Data of this nature can be modeled mathematically as a convolution y = a0 x0, between a short signal a0 (the motif) and a longer sparse signal x0, whose nonzero entries indicate where in the sample the motif is present. A very similar structure arises in image deblurring (Chan & Wong, 1998), where y is a blurry image, a0 the blur kernel, and x0 the (edge map) of the target sharp image. 1Data Science Institute, Columbia University, New York City, NY, USA 2Department of Electrical Engineering, Columbia University, New York City, NY, USA 3Department of Computer Science, Cornell University, Ithaca, NY, USA 4Department of Applied Math and Applied Physics, Columbia University, New York City, NY, USA. Correspondence to: Han-Wen Kuo . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). Motivated by these and related problems in imaging and scientific data analysis, we study the Short-and-Sparse (Sa S) Deconvolution problem of recovering a short signal a0 Rp0 and a sparse signal x0 Rn (n p0) from their length-n cyclic convolution1 y = a0 x0 Rn. This Sa S model exhibits a basic scaled shift symmetry: for any nonzero scalar α and cyclic shift sℓ[ ], α sℓ[a0] 1 α s ℓ[x0] = y. (1.1) Because of this symmetry, we only expect to recover a0 and x0 up to a signed shift (see Figure 1). Our problem of interest can be stated more formally as: Problem 1.1 (Short-and-Sparse Deconvolution). Given the cyclic convolution y = a0 x0 Rn of a0 Rp0 short (p0 n), and x0 Rn sparse, recover a0 and x0, up to a scaled shift. Despite a long history and many applications, until recently very little algorithmic theory was available for Sa S deconvolution. Much of this difficulty can be attributed to the scale-shift symmetry: natural convex relaxations fail, and nonconvex formulations exhibit a complicated optimization landscape, with many equivalent global minimizers (scaled shifts of the ground truth) and additional local minimizers (scaled shift truncations of the ground truth), and a variety of critical points (Zhang et al., 2017; 2018). Currently available theory guarantees approximate recovery of a truncation2 of a shift sℓ[a0], rather than guaranteeing recovery of a0 as a whole, and requires certain (complicated) conditions on the convolution matrix associated with a0 (Zhang et al., 2018). In this paper, describe an algorithm which, under simpler conditions, exactly recovers a scaled shift of the pair (a0, x0). Our algorithm is based on a formulation first introduced in (Zhang et al., 2017), which casts the deconvolution problem as (nonconvex) optimization over the sphere. We characterize the geometry of this objective function, and show that near a certain union of subspaces, every local minimizer is very close to a signed shift of a0. Based on 1Our result applies to direct convolution by zero padding both a0 and x0. 2I.e., the portion of the shifted signal sℓ[a0] that falls in the window {0, . . . , p0 1}. Figure 1. Shift symmetry in Short-and-Sparse deconvolution. An observation y (left) is a convolution of a short signal a0 and a sparse signal x0 (top right) can be equivalently expressed as a convolution of sℓ[a0] and s ℓ[x0], where sℓ[ ] denotes a shift ℓ samples. The ground truth signals a0 and x0 can only be identified up to a scaled shift. this geometric analysis, we give provable methods for Sa S deconvolution that exactly recover a scaled shift of (a0, x0) whenever a0 is shift-incoherent and x0 is a sufficiently sparse random vector. Our geometric analysis highlights the role of symmetry in shaping the objective landscape for Sa S deconvolution. The remainder of this paper is organized as follows. Section 2 introduces our optimization approach and modeling assumptions. Section 3 introduces our main results both geometric and algorithmic and compares them to the literature. In Section 4 we present a experimental result which corroborates our theoretical claim. Finally, Section 5 discusses two main limitations of our analysis and describes directions for future work. 2. Formulation and Assumptions 2.1. Nonconvex Sa S over the Sphere Our starting point is the (natural) formulation min a,x 1 2 a x y 2 2 Data Fidelity s.t. a 2 = 1. (2.1) We term this optimization problem the Bilinear Lasso, for its resemblance to Lasso estimator in statistics. Indeed, letting ϕlasso(a) min x n 1 2 a x y 2 2 + λ x 1 o (2.2) denote the optimal Lasso cost, we see that (2.1) simply optimizes ϕlasso with respect to a: mina ϕlasso(a) s.t. a 2 = 1. (2.3) In (2.1)-(2.3), we constrain a to have unit ℓ2 norm. This constraint breaks the scale ambiguity between a and x. Moreover, the choice of constraint manifold has surprisingly strong implications for computation: if a is instead constrained to the simplex, the problem admits trivial global minimizers. In contrast, local minima of the sphereconstrained formulation often correspond to shifts (or shift truncations (Zhang et al., 2017)) of the ground truth a0. The problem (2.3) is defined in terms of the optimal Lasso cost. This function is challenging to analyze, especially far away from a0. (Zhang et al., 2017) analyzes the local minima of a simplification of (2.3), obtained by approximating3 the data fidelity term as 1 2 a x y 2 2 = 1 2 a x 2 2 a x, y + 1 2 x 2 2 a x, y + 1 2 y 2 2 . (2.4) This yields a simpler objective function ϕℓ1(a) = min x n 1 2 x 2 2 a x, y + 1 2 y 2 2 + λ x 1 o . (2.5) We make one further simplification to this problem, replacing the nondifferentiable penalty 1 with a smooth approximation ρ(x).4 Our analysis allows for a variety of smooth sparsity surrogates ρ(x); for concreteness, we state our main results for the particular penalty5 i x2 i + δ2 1/2 . (2.6) For δ > 0, this is a smooth function of x; as δ 0 it approaches x 1. Replacing 1 with ρ( ), we obtain the objective function which will be our main object of study, ϕρ(a) = min x n 1 2 x 2 2 a x, y + 1 2 y 2 2 + λρ(x) o . (2.7) As in (Zhang et al., 2017), we optimize ϕρ(a) over the sphere Sp 1: min a ϕρ(a) s.t. a Sp 1 (2.8) Here, we set p = 3p0 2. As we will see, optimizing over this slightly higher dimensional sphere enables us to recover a (full) shift of a0, rather than a truncated shift. Our approach will leverage the following fact: if we view a Sp 1 as indexed by coordinates W = { p0 + 1, . . . , 2p0 1} , then for any shifts ℓ { p0 + 1, . . . , p0 1}, the support of ℓ-shifted short signal sℓ[a0] is entirely contained in interval W. We will give a provable method which recovers a scaled version of one of these canonical shifts. 2.2. Analysis Setting and Assumptions For convenience, we assume that a0 has unit ℓ2 norm, i.e., a0 Sp0 16. Our analysis makes two main assumptions, on the short motif a0 and the sparse map x0, respectively: 3For a generic a, we have si[a], sj[a] 0 and hence a x 2 2 e0 x 2 2 = x 2 2. 4Objective ϕℓ1 is not twice differentiable everywhere, hence cannot be minimized with conventional second order methods. 5This surrogate is often named as the pseudo-Huber function. 6This is purely a technical convenience. Our theory guarantees recovery of a signed shift ( sℓ[a0], s ℓ[x0]) of the truth. If a0 2 = 1, identical reasoning implies that our method recovers a scaled shift αsℓ[a0], α 1s ℓ[x0] with α = 1 a0 2 . 2 Spiky Generic Tapered Generic Lowpass µ 0 µ p 1/2 0 µ β θ p 1/2 0 ( p0 events every p0) θ p 3/4 0 ( 4 p0 events every p0) θ p 1 0 (1 event every p0) Figure 2. Sparsity-coherence tradeoff: We show three families of motifs a0 with varying coherence µ (top), with their maximum allowable sparsity θ and number of copies θp0 within each lengthp0 window respectively (bot). When the target motif has smaller shift-coherence µ, our result allows larger θ, and vise versa. This sparsity-coherence tradeoff is made precise in our main result Theorem 3.1, which, loosely speaking, asserts that when θ 1/(p0 µ + p0), our method succeeds. The first is that distinct shifts a0 have small inner product. We define the shift coherence of µ(a0) to be the largest inner product between distinct shifts: µ(a0) = max ℓ =0 | a0, sℓ[a0] | . (2.9) µ(a0) is bounded between 0 and 1. Our theory allows any µ smaller than some numerical constant. Figure 2 shows three examples of families of a0 that satisfy this assumption: Spiky. When a0 is close to the Dirac delta δ0, the shift coherence µ(a0) 0.7 Here, the observed signal y consists of a superposition of sharp pulses. This is arguably the easiest instance of Sa S deconvolution. Generic. If a0 is chosen uniformly at random from the sphere Sp0 1, its coherence is bounded as µ(a0) p 1/p0 with high probability. Tapered Generic Lowpass. Here, a0 is generated by taki ng a random conjugate symmetric superposition of the first L length-p0 Discrete Fourier Transform (DFT) basis signals, windowing (e.g., with Hamming window) and normalizing to unit ℓ2 norm. When L = p0 1 β, with high probability µ(a0) β. In this model, µ does not have to diminish as p0 grows it 7The use of here suppresses constant and log factors. can be a fixed constant.8 Intuitively speaking, problems with smaller µ are easier to solve, a claim which will be made precise in our results. We assume that x0 is a sparse random vector, under Bernoulli-Gaussian distribution, with rate θ. Concretely speaking, we assume x0i = ωigi, where ωi Ber(θ), gi N(0, 1) with all random variables are jointly independent. We write this as x0 i.i.d. BG(θ). (2.10) Here, θ is the probability that a given entry x0i is nonzero. Problems with smaller θ are easier to solve. In the extreme case, when θ 1/p0, the observation y contains many isolated copies of the motif a0, and a0 can be determined by direct inspection. Our analysis will focus on the nontrivial scenario, when θ 1/p0. Our technical results will articulate sparsity-coherence tradeoffs, in which smaller coherence µ enables larger θ, and vice-versa. More specifically, in our main theorem, the sparsity-coherence relationship is captured in the form θ 1/(p0 µ + p0). (2.11) When a0 is very shift-incoherent (µ 0), our method succeeds when each length-p0 window contains about p0 copies of a0. When µ is larger (as in the generic lowpass model), our method succeeds as long as relatively few copies of a0 overlap in the observed signal. In Figure 2, we illustrate these tradeoffs for the three models described above. 3. Main Results: Geometry and Algorithms 3.1. Geometry of the Objective ϕρ The goal in Sa S deconvolution is to recover a0 (and x0) up to a signed shift i.e., we wish to recover some sℓ[a0]. The shifts sℓ[a0] play a key role in shaping the landscape of ϕρ. In particular, we will argue that over a certain subset of the sphere, every local minimum of ϕρ is close to some sℓ[a0]. To gain intuition into the properties of ϕρ, we first visualize this function in the vicinity of a single shift sℓ[a0] of the ground truth a0. In Figure 3-(a), we plot the function value of ϕρ over Bℓ2,r(sℓ[a0]) Sp 1, where Bℓ2,r(a) is a ball of radius r around a. We make two observations: 8The upper right panel of Figure 2 is generated using random DFT components with frequencies smaller then one-third Nyquist. Such a kernel is incoherent, with high probability. Many commonly occurring low-pass kernels have µ(a0) larger very close to one. One of the most important limitations of our results is that they do not provide guarantees in this highly coherent situation. 3 Bℓ2,r(sℓ[a0]) Sp 1 S{ℓ1,ℓ2} Sp 1 S{ℓ1,ℓ2,ℓ3} Sp 1 sℓ2[a0] sℓ3[a0] S{ℓ1,ℓ2,ℓ3} (i) (ii) (iii) Figure 3. Geometry of ϕρ near span of shift(s) of a0. (i) A portion of the sphere Sp 1 near sℓ[a0] (bot) colored according to height of ϕρ (top); ϕρ is strongly convex in this region, and it has a minimizer very close to sℓ[a0]. (ii) Each pair of shifts sℓ1[a0], sℓ2[a0] defines a linear subspace S{ℓ1,ℓ2} of Rp (left), in which every local minimum of ϕρ near S{ℓ1,ℓ2} is close to either sℓ1[a0] or sℓ2[a0] (right); there is a negative curvature in the middle of sℓ1[a0], sℓ2[a0], and ϕρ is convex in direction away from S{ℓ1,ℓ2}. (iii) The subspace S{ℓ1,ℓ2,ℓ3} is three-dimensional; its intersection with the sphere Sp 1 is isomorphic to a two-dimensional sphere (left). On this set, ϕρ has local minimizers near each of the sℓi[a0], and are the only minimizers near S{ℓ1,ℓ2,ℓ3} (right). The objective function ϕρ is strongly convex on this neighborhood of sℓ[a0]. There is a local minimizer very close to sℓ[a0]. We next visualize the objective function ϕρ near the linear span of two different shifts sℓ1[a0] and sℓ2[a0]. More precisely, we plot ϕρ near the intersection (Figure 3-(b)) of the sphere Sp 1 and the linear subspace S{ℓ1,ℓ2} = α1sℓ1[a0] + α2sℓ2[a0] α R2 . We make three observations: Again, there is a local minimizer near each shift sℓ[a0]. These are the only local minimizers in the vicinity of S{ℓ1,ℓ2}. In particular, the objective function ϕ exhibits negative curvature along S{ℓ1,ℓ2} at any superposition α1sℓ1[a0] + α2sℓ2[a0] whose weights α1 and α2 are balanced, i.e., |α1| |α2|. Furthermore, the function ϕρ exhibits positive curvature in directions away from the subspace Sℓ1,ℓ2. Finally, we visualize ϕρ over the intersection (Figure 3- (c)) of the sphere Sp 1 with the linear span of three shifts sℓ1[a0], sℓ2[a0], sℓ3[a0] of the true kernel a0: S{ℓ1,ℓ2,ℓ3} = α1sℓ1[a0] + α2sℓ2[a0] + α3sℓ3[a0] α R3 . Again, there is a local minimizer near each signed shift. At roughly balanced superpositions of shifts, the objective function exhibits negative curvature. As a result, again, the only local minimizers are close to signed shifts. Our main geometric result will show that these properties obtain on every subspace spanned by a few shifts of a0. Indeed, for each subset τ { p0 + 1, . . . , p0 1} , (3.1) define a linear subspace Sτ = P ℓ τ αℓsℓ[a0] α R2p0 1 . (3.2) The subspace Sτ is the linear span of the shifts sℓ[a0] indexed by ℓin the set τ. Our geometric theory will show that with high probability the function ϕρ has no spurious local minimizers near any Sτ for which τ is not too large say, |τ| 4θp0. Combining all of these subspaces into a single geometric object, define the union of subspaces |τ| 4θp0 Sτ. (3.3) Figure 4 gives a schematic portrait of this set. We claim: In the neighborhood of Σ4θp0, all local minimizers are near signed shifts. The value of ϕρ grows in directions away from Σ4θp0. Our main result formalizes the above observations, under two key assumptions: first, that the sparsity rate θ is sufficiently small (relative to the shift coherence µ of p0), and, second, the signal length n is sufficiently large: Theorem 3.1 (Main Geometric Theorem). Let y = a0 x0 with a0 Sp0 1 µ-shift coherent and x0 i.i.d. BG(θ) Rn with sparsity rate p0 , c2 p0 µ + p0 1 log2 p0 . (3.4) Choose ρ(x) = x2 + δ2 and set λ = 0.1/ p0θ in ϕρ. Then there exists δ > 0 and numerical constant c such that if n poly(p0), with high probability, every local minimizer a of ϕρ over Σ4θp0 satisfies a σsℓ[a0] 2 c max µ, p 1 0 for some signed shift σsℓ[a0] of the true kernel. Above, c1, c2 > 0 are positive numerical constants. 4 Sℓ1,ℓ2 Sℓ1,ℓ3 Figure 4. Geometry of ϕρ over the union of subspaces Σ4θp0. We show schematic representation of the union of subspaces Σ4θp0 (left). For each set τ of at most 4θp0 shifts, we have a subspace Sτ, by which ϕρ has good geometry near (right). Proof. See Appendix B. The upper bound on θ in (3.4) yields the tradeoff between coherence and sparsity described in Figure 2. Simply put, when a0 is better conditioned (as a kernel), its coherence µ is smaller and x0 can be denser. At a technical level, our proof of Theorem 3.1 shows that (i) ϕρ(a) is strongly convex in the vicinity of each signed shift, and that at every other point a near Σ4θp0, there is either (ii) a nonzero gradient or (iii) a direction of strict negative curvature; furthermore (iv) the function ϕρ grows away from Σ4θp0. Points (ii)-(iii) imply that near Σ4θp0 there are no flat saddles: every saddle point has a direction of strict negative curvature. We will leverage these properties to propose an efficient algorithm for finding a local minimizer near Σ4θp0. Moreover, this minimizer is close enough to a shift (here, a sℓ[a0] 2 µ) for us to exactly recover sℓ[a0]: we will give a refinement algorithm that produces ( sℓ[a0], s ℓ[x0]). 3.2. Provable Algorithm for Sa S Deconvolution The objective function ϕρ has good geometric properties on (and near!) the union of subspaces Σ4θp0. In this section, we show how to use give an efficient method that exactly recovers a0 and x0, up to shift symmetry. Although our geometric analysis only controls ϕρ near Σ4θp0, we will give a descent method which, with appropriate initialization a(0), produces iterates a(1), . . . , a(k), . . . that remain close to Σ4θp0 for all k. In short, it is easy to start near Σ4θp0 and easy to stay near Σ4θp0. After finding a local minimizer a, we refine it to produce a signed shift of (a0, x0) using alternating minimization. Our algorithm starts with a initialization scheme which generates a(0) near the union of subspaces Σ4θp0, which consists of linear combinations of just a few shifts of a0. How Data y Kernel a0 Sparse x0 Windowed Data a( 1) αisi[a0] + αjsj[a0] Initialization a(0) Figure 5. Data-driven initialization. Using a piece of the observed data y to generate an initial point that is close to a superposition of shifts sℓ[a0] of the ground truth. Data y = a0 x0 is a superposition of shifts of the true kernel a0 (top). A length-p0 windowed y contains pieces of just a few shifts as a( 1), one step of the generalized power method approximately fills in its missing pieces, yielding a(0) as a near superposition of shifts of a0 (bot). can we find a point near this union? Notice that the data y also consists of a linear combination of just a few shifts of a0 Indeed: y = a0 x0 = P ℓ supp(x0) x0ℓsℓ[a0]. (3.5) A length-p0 segment of data y0,...,p0 1 = [y0, . . . , yp0 1]T captures portions of roughly 2θp0 4θp0 shifts sℓ[a0]. Many of these copies of a0 are truncated by the restriction to {0, . . . , p0 1}. A relatively simple remedy is as follows: first, we zero-pad y0,...,p0 1 to length p = 3p0 2, giving 0p0 1; y0; ; yp0 1; 0p0 1 . (3.6) Zero padding provides enough space to accommodate any shift sℓ[a0] with ℓ τ. We then perform one step of the generalized power method9, writing a(0) = PSp 1 ϕℓ1 PSp 1 0p0 1; y0; ; yp0 1; 0p0 1 , (3.7) where PSp 1 projects onto the sphere. The reasoning behind this construction may seem obscure, but can be clarified after interpreting the gradient ϕρ in terms of its action on the shifts sℓ[a0] (see appendix). For now, we note that this operation has the effect of (approximately) filling in the missing pieces of the truncated shifts sℓ[a0] see Figure 5 for an example. We will prove that with high probability a(0) is indeed close to Σ4θp0. The next key observation is that the function ϕρ grows as we move away from the subspace Sτ, as shown in Figure 3. 9The power method for minimizing a quadratic form ξ(a) = 1 2a Ma over the sphere consists of the iteration a 7 PSp 1Ma. Notice that in this mapping, Ma = ξ(a). The generalized power method, for minimizing a function ϕ over the sphere consists of repeatedly projecting ϕ onto the sphere, giving the iteration a 7 PSp 1 ϕ(a). (3.7) can be interpreted as one step of the generalized power method for the objective function ϕρ. 5 Because of this, a small-stepping descent method will not move far away from Σ4θp0. For concreteness, we will analyze a variant of the curvilinear search method (Goldfarb, 1980; Goldfarb et al., 2017) , which moves in a linear combination of the negative gradient direction g and a negative curvature direction v. At the k-th iteration, the algorithm updates a(k+1) as a(k+1) PSp 1 a(k) tg(k) t2v(k) (3.8) with appropriately chosen step size t. The inclusion of a negative curvature direction allows the method to avoid stagnation near saddle points. Indeed, we will prove that starting from initialization a(0), this method produces a sequence a(1), a(2), . . . which efficiently converges to a local minimizer a that is near some signed shift sℓ[a0] of the ground truth. The second step of our algorithm rounds the local minimizer a σsℓ[a0] to produce an exact solution ba = σsℓ[a0]. As a byproduct, it also exactly recovers the corresponding signed shift of the true sparse signal, bx = σs ℓ[x0]. Our rounding algorithm is an alternating minimization scheme, which alternates between minimizing the Lasso cost over a with x fixed, and minimizing the Lasso cost over x with a fixed. We make two modifications to this basic idea, both of which are important for obtaining exact recovery. First, unlike the standard Lasso cost, which penalizes all of the entries of x, we maintain a running estimate I(k) of the support of x0, and only penalize those entries that are not in I(k): 1 2 a x y 2 2 + λ P i I(k) |xi| . (3.9) This can be viewed as an extreme form of reweighting (Candes et al., 2008). Second, our algorithm gradually decreases penalty variable λ to 0, so that eventually ba bx y. (3.10) This can be viewed as a homotopy or continuation method (Osborne et al., 2000; Efron et al., 2004). For concreteness, at k-th iteration the algorithm reads: Update x: x(k+1) argminx 1 2 a(k) x y 2 2 + λ(k) P i I(k) |xi| , a(k+1) PSp 1 argmina 1 2 a x(k+1) y 2 2 , Update λ and I: 2λ(k), I(k+1) supp x(k+1) . (3.11) We prove that the iterates produced by this sequence of operations converge to the ground truth at a linear rate, as long as the initializer a is sufficiently nearby. Our overall algorithm is summarized as Algorithm 1. Figure 6 illustrates the main steps of this algorithm. Our main Initial a(0) a(100) Converged a Est. ba & True a0 Figure 6. Local minimization and refinement. Data-driven initialization a(0) consists of a near-superposition of two shifts (left), and minimizing ϕρ produces a near shift of a0 as a (mid). Finally the rounded solution ba using the Lasso is almost identical to a shift of a0 (right). algorithmic result states that under closely related hypotheses as above, Algorithm 1 produces a signed shift of the ground truth (a0, x0): Algorithm 1 Short and Sparse Deconvolution input Observation y, motif length p0, sparsity θ, shiftcoherence µ, and curvature threshold ηv. Minimization: Initialize a(0) PSp 1 ϕρ PSp 1 0p0 1; y0; ; yp0 1; 0p0 1 , λ = 0.1/ p0θ10and δ > 0 in ϕρ. for k = 1 to K1 do a(k+1) PSp 1[a(k) tg(k) t2v(k)] Here, g(k) is the Riemannian gradient; v(k) is the eigenvector of smallest Riemannian Hessian eigenvalue if less then ηv with v(k), g(k) 0, otherwise let v(k) = 0; and t (0, 0.1/nθ] satisfies ϕρ(a(k+1)) < ϕρ(a(k)) 1 2t g(k) 2 2 1 4t4ηv v(k) 2 2 end for Obtain a near local minimizer a a(K1). Refinement: Initialize a(0) a, λ(0) 10(pθ + log n)(µ + 1/p) and I(0) Sλ(0) [supp(y a]). for k = 1 to K2 do x(k+1) argminx 1 2 a(k) x y 2 2+λ(k)P a(k+1) PSp 1 argmina 1 2 a x(k+1) y 2 2 λ(k+1) λ(k)/2, I(k+1) supp(x(k+1)) end for output (ba, bx) (a(K2), x(K2)) Theorem 3.2 (Main Algorithmic Theorem). Suppose y = a0 x0 where a0 Sp0 1 is µ-truncated shift coherent such that maxi =j ι p0si[a0], ι p0sj[a0] µ and x0 i.i.d. BG(θ) Rn with θ, µ satisfying " c1 p0 , c2 p0 µ+ p0 log2 p0 , µ c3 log2 n (3.12) for some constant c1, c2, c3 > 0. If the signal lengths n, p0 satisfy n > poly(p0) and p0 > polylog(n), then there exist δ, ηv > 0 such that with high probability, Algorithm 1 6 produces (ba, bx) that are equal to the ground truth up to signed shift symmetry: ba, bx σ sℓ[a0], s ℓ[x0] 2 ε (3.13) for some σ { 1, 1} and ℓ { p0 + 1, . . . , p0 1} if K1 > poly(n, p0) and K2 > polylog(n, p0, ε 1). Proof. See Appendix C. 3.3. Relationship to the Literature Blind deconvolution is a classical problem in signal processing (Stockham et al., 1975; Cannon, 1976), and has been studied under a variety of hypotheses. In this section, we first discuss the relationship between our results and the existing literature on the short-and-sparse version of this problem, and then briefly discuss other deconvolution variants in the theoretical literature. The short-and-sparse model arises in a number of applications. One class of applications involves finding basic motifs (repeated patterns) in datasets. This motif discovery problem arises in extracellular spike sorting (Lewicki, 1998; Ekanadham et al., 2011) and calcium imaging (Pnevmatikakis et al., 2016), where the observed signal exhibits repetitive short neuron excitation patterns occurring sparsely across time and/or space. Similarly, electron microscopy images (Cheung et al., 2017) arising in study of nanomaterials often exhibit repeated motifs. Another significant application of Sa S deconvolution is image deblurring. Typically, the blur kernel is small relative to the image size (short) (Ayers & Dainty, 1988; You & Kaveh, 1996; Carasso, 2001; Levin et al., 2007; 2011). In natural image deblurring, the target image is often assumed to have relatively few sharp edges (Fergus et al., 2006; Joshi et al., 2008; Levin et al., 2011), and hence have sparse derivatives. In scientific image deblurring, e.g., in astronomy (Lane, 1992; Harmeling et al., 2009; Briers et al., 2013) and geophysics (Kaaresen & Taxt, 1998), the target image is often sparse, either in the spatial or wavelet domains, again leading to variants of the Sa S model. The literature on blind image deconvolution is large; see, e.g., (Kundur & Hatzinakos, 1996; Campisi & Egiazarian, 2016) for surveys. Variants of the Sa S deconvolution problem arise in many other areas of engineering as well. Examples include blind equalization in communications (Sato, 1975; Shalvi & Weinstein, 1990; Johnson et al., 1998), dereverberation in sound engineering (Miyoshi & Kaneda, 1988; Naylor & Gaubitch, 2010) and image super-resolution (Baker & Kanade, 2002; Shtengel et al., 2009; Yang et al., 2010). These applications have motivated a great deal of algorith- 10In practice, we suggest setting λ = cλ/ p0θ with cλ [0.5, 0.8]. mic work on variants of the Sa S problem (Lane & Bates, 1987; Bones et al., 1995; Bell & Sejnowski, 1995; Kundur & Hatzinakos, 1996; Markham & Conchello, 1999; Campisi & Egiazarian, 2016; Walk et al., 2017). In contrast, relatively little theory is available to explain when and why algorithms succeed. Our algorithm minimizes ϕρ as an approximation to the Lasso cost over the sphere. Our formulation and results have strong precedent in the literature. Lasso-like objective functions have been widely used in image deblurring (You & Kaveh, 1996; Chan & Wong, 1998; Fergus et al., 2006; Levin et al., 2007; Shan et al., 2008; Xu & Jia, 2010; Dong et al., 2011; Krishnan et al., 2011; Levin et al., 2011; Wipf & Zhang, 2014; Perrone & Favaro, 2014; Zhang et al., 2017). A number of insights have been obtained into the geometry of sparse deconvolution in particular, into the effect of various constraints on a on the presence or absence of spurious local minimizers. In image deblurring, a simplex constraint (a 0 and a 1 = 1) arises naturally from the physical structure of the problem (You & Kaveh, 1996; Chan & Wong, 1998). Perhaps surprisingly, simplex-constrained deconvolution admits trivial global minimizers, at which the recovered kernel a is a spike, rather than the target blur kernel (Levin et al., 2011; Benichoux et al., 2013). (Wipf & Zhang, 2014) imposes the ℓ2 regularization on a and observes that this alternative constraint gives more reliable algorithm. (Zhang et al., 2017) studies the geometry of the simplified objective ϕℓ1 over the sphere, and proves that in the dilute limit in which x0 has one nonzero entry, all strict local minima of ϕℓ1 are close to signed shifts truncations of a0. By adopting a different objective function (based on ℓ4 maximization) over the sphere, (Zhang et al., 2018) proves that on a certain region of the sphere every local minimum is near a truncated signed shift of a0, i.e., the restriction of sℓ[a0] to the window {0, . . . , p0 1}. The analysis of (Zhang et al., 2018) allows the sparse sequence x0 to be denser (θ p 2/3 0 for a generic kernel a0, as opposed to θ p 3/4 0 in our result). Both (Zhang et al., 2017) and (Zhang et al., 2018) guarantee approximate recovery of a portion of sℓ[a0], under complicated conditions on the kernel a0. Our core optimization problem is very similar to (Zhang et al., 2017). However, we obtains exact recovery of both a0 and relatively dense x0, under the much simpler assumption of shift incoherence. Other aspects of the Sa S problem have been studied theoretically. One basic question is under what circumstances the problem is identifiable, up to the scaled shift ambiguity. (Choudhary & Mitra, 2015) shows that the problem ill-posed for worst case (a0, x0) in particular, for certain support patterns in which x0 does not have any isolated nonzero entries. This demonstrates that some modeling assumptions on the support of the sparse term are needed. Nevertheless, this worst case structure is unlikely to occur, either under the Bernoulli model, or in practical deconvolution problems. 7 Motivated by a variety of applications, many lowdimensional deconvolution models have been studied in the theoretical literature. In communication applications, the signals a0 and x0 either live in known low-dimensional subspaces, or are sparse in some known dictionary (Ahmed et al., 2014; Li et al., 2016; Chi, 2016; Ling & Strohmer, 2015; Li et al., 2017; Ling & Strohmer, 2017; Kech & Krahmer, 2017). These theoretical works assume that the subspace / dictionary are chosen at random. This low-dimensional deconvolution model does not exhibit the signed shift ambiguity; nonconvex formulations for this model exhibit a different structure from that studied here. In fact, the variant in which both signals belong to known subspaces can be solved by convex relaxation (Ahmed et al., 2014). The Sa S model does not appear to be amenable to convexification, and exhibits a more complicated nonconvex geometry, due to the shift ambiguity. The main motivation for tackling this model lies in the aforementioned applications in imaging and data analysis. (Wang & Chi, 2016; Li & Bresler, 2018) study the related multi-instance sparse blind deconvolution problem (MISBD), where there are K observations yi = a0 xi consisting of multiple convolutions i = 1, . . . , K of a kernel a0 and different sparse vectors xi. Both works develop provable algorithms. There are several key differences with our work. First, both the proposed algorithms and their analysis require a0 to be invertible. Second, Sa S model and MISBD are not equivalent despite the apparent similarity between them. It might seem possible to reduce Sa S to MISBD by dividing the single observation y into K pieces; this apparent reduction fails due to boundary effects. 4. Experiments We demonstrate that the tradeoffs between the motif length p0 and sparsity rate θ produce a transition region for successful Sa S deconvolution under generic choices of a0 and x0. For fixed values of θ [10 3, 10 2] and p0 [103, 104], we draw 50 instances of synthetic data by choosing a0 Unif(Sp0 1) and x0 Rn with x0 i.i.d. BG(θ) where n = 5 105. Note that choosing a0 this way implies µ(a0) 1 p0 . For each instance, we recover a0 and x0 from y = a0 x0 by minimizing problem (2.5). For ease of computation, we modify Algorithm 1 by replacing curvilinear search with accelerated Riemannian gradient descent method (See appendix M). In Figure 7, we say the local minimizer amin is sufficiently close to a solution of Sa S deconvolution problem, if success(amin, ; a0) := { maxℓ| sℓ[a0], amin | > 0.95 } . (4.1) 10-3 10-2.8 10-2.6 10-2.4 10-2.2 10-2 θ (log scale) p0 (log scale) Figure 7. Success probability of Sa S deconvolution under generic a0, x0 with varying kernel length p0, and sparsity rate θ. When sparsity rate decreases sufficiently with respect to kernel length, successful recovery becomes very likely (brighter), and vice versa (darker). A transition line is shown with slope log p0 log θ 2, implying our method works with high probability when θ 1 p0 in generic case. 5. Discussion The main drawback of our proposed method is that it does not succeed when the target motif a0 has shift coherence very close to 1. For instance, a common scenario in image blind deconvolution involves deblurring an image with a smooth, low-pass point spread function (e.g., Gaussian blur). Both our analysis and numerical experiments show that in this situation minimizing ϕρ does not find the generating signal pairs (a0, x0) consistently the minimizer of ϕρ is often spurious and is not close to any particular shift of a0. We do not suggest minimizing ϕρ in this situation. On the other hand, minimizing the bilinear lasso objective ϕlasso over the sphere often succeeds even if the true signal pair (a0, x0) is coherent and dense. In light of the above observations, we view the analysis of the bilinear lasso as the most important direction for future theoretical work on Sa S deconvolution. The drop quadratic formulation studied here has commonalities with the bilinear lasso: both exhibit local minima at signed shifts, and both exhibit negative curvature in symmetry breaking directions. A major difference (and hence, major challenge) is that gradient methods for bilinear lasso do not retract to a union of subspaces they retract to a more complicated, nonlinear set. Finally, there are several directions in which our analysis could be improved. Our lower bounds on the length n of the random vector x0 required for success are clearly suboptimal. We also suspect our sparsity-coherence tradeoff between µ, θ (roughly, θ 1/( µp0)) is suboptimal, even for 8 the ϕρ objective. Articulating optimal sparsity-coherence tradeoffs for is another interesting direction for future work. Acknowledgements The authors gratefully acknowledge support from NSF 1343282, NSF CCF 1527809, and NSF IIS 1546411. Ahmed, A., Recht, B., and Romberg, J. Blind deconvolution using convex programming. IEEE Transactions on Information Theory, 60(3):1711 1732, 2014. Ayers, G. and Dainty, J. C. Iterative blind deconvolution method and its applications. Optics letters, 13(7):547 549, 1988. Baker, S. and Kanade, T. Limits on super-resolution and how to break them. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9):1167 1183, 2002. Bell, A. J. and Sejnowski, T. J. An informationmaximization approach to blind separation and blind deconvolution. Neural computation, 7(6):1129 1159, 1995. Benichoux, A., Vincent, E., and Gribonval, R. A fundamental pitfall in blind deconvolution with sparse and shiftinvariant priors. In ICASSP-38th International Conference on Acoustics, Speech, and Signal Processing-2013, 2013. Bones, P., Parker, C., Satherley, B., and Watson, R. Deconvolution and phase retrieval with use of zero sheets. JOSA A, 12(9):1842 1857, 1995. Briers, D., Duncan, D. D., Hirst, E. R., Kirkpatrick, S. J., Larsson, M., Steenbergen, W., Stromberg, T., and Thompson, O. B. Laser speckle contrast imaging: theoretical and practical limitations. Journal of biomedical optics, 18(6):066018, 2013. Campisi, P. and Egiazarian, K. Blind image deconvolution: theory and applications. CRC press, 2016. Candes, E. J., Wakin, M. B., and Boyd, S. P. Enhancing sparsity by reweighted â ˇD S 1 minimization. Journal of Fourier analysis and applications, 14(5-6):877 905, 2008. Cannon, M. Blind deconvolution of spatially invariant image blurs with phase. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(1):58 63, 1976. Carasso, A. S. Direct blind deconvolution. SIAM Journal on Applied Mathematics, 61(6):1980 2007, 2001. Chan, T. F. and Wong, C.-K. Total variation blind deconvolution. IEEE transactions on Image Processing, 7(3): 370 375, 1998. Cheung, S., Lau, Y., Chen, Z., Sun, J., Zhang, Y., Wright, J., and Pasupathy, A. Beyond the fourier transform: A nonconvex optimization approach to microscopy analysis. Submitted, 2017. Chi, Y. Guaranteed blind sparse spikes deconvolution via lifting and convex optimization. IEEE Journal of Selected Topics in Signal Processing, 10(4):782 794, June 2016. Choudhary, S. and Mitra, U. Fundamental limits of blind deconvolution part ii: Sparsity-ambiguity trade-offs. ar Xiv preprint ar Xiv:1503.03184, 2015. Dong, W., Zhang, L., Shi, G., and Wu, X. Image deblurring and super-resolution by adaptive sparse domain selection and adaptive regularization. IEEE Transactions on Image Processing, 20(7):1838 1857, 2011. Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., et al. Least angle regression. The Annals of statistics, 32(2): 407 499, 2004. Ekanadham, C., Tranchina, D., and Simoncelli, E. P. A blind sparse deconvolution method for neural spike identification. In Advances in Neural Information Processing Systems 24, pp. 1440 1448. 2011. Fergus, R., Singh, B., Hertzmann, A., Roweis, S. T., and Freeman, W. T. Removing camera shake from a single photograph. In ACM transactions on graphics (TOG), volume 25, pp. 787 794. ACM, 2006. Goldfarb, D. Curvilinear path steplength algorithms for minimization which use directions of negative curvature. Mathematical programming, 18(1):31 40, 1980. Goldfarb, D., Mu, C., Wright, J., and Zhou, C. Using negative curvature in solving nonlinear programs. Computational Optimization and Applications, 68(3):479 502, 2017. Harmeling, S., Hirsch, M., Sra, S., and Scholkopf, B. Online blind deconvolution for astronomical imaging. In 2009 IEEE International Conference on Computational Photography (ICCP 2009), pp. 1 7. IEEE, 2009. Johnson, R., Schniter, P., Endres, T. J., Behm, J. D., Brown, D. R., and Casas, R. A. Blind equalization using the constant modulus criterion: A review. Proceedings of the IEEE, 86(10):1927 1950, 1998. Joshi, N., Szeliski, R., and Kriegman, D. J. Psf estimation using sharp edge prediction. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1 8. IEEE, 2008. 9 Kaaresen, K. F. and Taxt, T. Multichannel blind deconvolution of seismic signals. Geophysics, 63(6):2093 2107, 1998. Kech, M. and Krahmer, F. Optimal injectivity conditions for bilinear inverse problems with applications to identifiability of deconvolution problems. SIAM Journal on Applied Algebra and Geometry, 1(1):20 37, 2017. Krishnan, D., Tay, T., and Fergus, R. Blind deconvolution using a normalized sparsity measure. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 233 240. IEEE, 2011. Kundur, D. and Hatzinakos, D. Blind image deconvolution. IEEE signal processing magazine, 13(3):43 64, 1996. Lane, R. and Bates, R. Automatic multidimensional deconvolution. JOSA A, 4(1):180 188, 1987. Lane, R. G. Blind deconvolution of speckle images. JOSA A, 9(9):1508 1514, 1992. Levin, A., Fergus, R., Durand, F., and Freeman, W. T. Deconvolution using natural image priors. Massachusetts Institute of Technology, Computer Science and Artificial Intelligence Laboratory, 3, 2007. Levin, A., Weiss, Y., Durand, F., and Freeman, W. T. Understanding blind deconvolution algorithms. IEEE transactions on pattern analysis and machine intelligence, 33 (12):2354 2367, 2011. Lewicki, M. S. A review of methods for spike sorting: the detection and classification of neural action potentials. Network: Computation in Neural Systems, 9(4):R53 R78, 1998. Li, Y. and Bresler, Y. Global geometry of multichannel sparse blind deconvolution on the sphere. ar Xiv preprint ar Xiv:1404.4104, 2018. Li, Y., Lee, K., and Bresler, Y. Identifiability in blind deconvolution with subspace or sparsity constraints. IEEE Transactions on Information Theory, 62(7):4266 4275, 2016. Li, Y., Lee, K., and Bresler, Y. Identifiability and stability in blind deconvolution under minimal assumptions. IEEE Transaction of Information Theory, 2017. Ling, S. and Strohmer, T. Self-calibration and biconvex compressive sensing. Inverse Problems, 31(11):115002, 2015. Ling, S. and Strohmer, T. Blind deconvolution meets blind demixing: Algorithms and performance bounds. IEEE Transactions on Information Theory, 63(7):4497 4520, 2017. Markham, J. and Conchello, J.-A. Parametric blind deconvolution: a robust method for the simultaneous estimation of image and blur. JOSA A, 16(10):2377 2391, 1999. Miyoshi, M. and Kaneda, Y. Inverse filtering of room acoustics. IEEE Transactions on acoustics, speech, and signal processing, 36(2):145 152, 1988. Naylor, P. A. and Gaubitch, N. D. Speech dereverberation. Springer Science & Business Media, 2010. Osborne, M. R., Presnell, B., and Turlach, B. A. A new approach to variable selection in least squares problems. IMA journal of numerical analysis, 20(3):389 403, 2000. Perrone, D. and Favaro, P. Total variation blind deconvolution: The devil is in the details. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2909 2916, 2014. Pnevmatikakis, E. A., Soudry, D., Gao, Y., Machado, T. A., Merel, J., Pfau, D., Reardon, T., Mu, Y., Lacefield, C., Yang, W., et al. Simultaneous denoising, deconvolution, and demixing of calcium imaging data. Neuron, 89(2): 285 299, 2016. Saha, S. K. Diffraction-limited imaging with large and moderate telescopes. World Scientific, 2007. Sato, Y. A method of self-recovering equalization for multilevel amplitude-modulation systems. IEEE Transactions on communications, 23(6):679 682, 1975. Shalvi, O. and Weinstein, E. New criteria for blind deconvolution of nonminimum phase systems (channels). IEEE Transactions on information theory, 36(2):312 321, 1990. Shan, Q., Jia, J., and Agarwala, A. High-quality motion deblurring from a single image. In Acm transactions on graphics (tog), volume 27, pp. 73. ACM, 2008. Shtengel, G., Galbraith, J. A., Galbraith, C. G., Lippincott Schwartz, J., Gillette, J. M., Manley, S., Sougrat, R., Waterman, C. M., Kanchanawong, P., Davidson, M. W., et al. Interferometric fluorescent super-resolution microscopy resolves 3d cellular ultrastructure. Proceedings of the National Academy of Sciences, 106(9):3125 3130, 2009. Stockham, T. G., Cannon, T. M., and Ingebretsen, R. B. Blind deconvolution through digital signal processing. Proceedings of the IEEE, 63(4):678 692, 1975. Walk, P., Jung, P., Pfander, G. E., and Hassibi, B. Blind deconvolution with additional autocorrelations via convex programs. ar Xiv preprint ar Xiv:1701.04890, 2017. 10 Wang, L. and Chi, Y. Blind deconvolution from multiple sparse inputs. IEEE Signal Processing Letters, 23(10): 1384 1388, 2016. Wipf, D. and Zhang, H. Revisiting bayesian blind deconvolution. The Journal of Machine Learning Research, 15 (1):3595 3634, 2014. Xu, L. and Jia, J. Two-phase kernel estimation for robust motion deblurring. In European conference on computer vision, pp. 157 170. Springer, 2010. Yang, J., Wright, J., Huang, T. S., and Ma, Y. Image superresolution via sparse representation. IEEE transactions on image processing, 19(11):2861 2873, 2010. You, Y.-L. and Kaveh, M. Anisotropic blind image restoration. In Image Processing, 1996. Proceedings., International Conference on, volume 2, pp. 461 464. IEEE, 1996. Zhang, Y., Lau, Y., Kuo, H.-w., Cheung, S., Pasupathy, A., and Wright, J. On the global geometry of sphereconstrained sparse blind deconvolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4894 4902, 2017. Zhang, Y., Kuo, H.-W., and Wright, J. Structured local optima in sparse blind deconvolution. ar Xiv preprint ar Xiv:1806.00338, 2018.