# superresolution_off_the_grid__d43e6d75.pdf Super-Resolution Off the Grid Qingqing Huang LIDS, qqh@mit.edu Sham M. Kakade University of Washington, Department of Statistics, Computer Science & Engineering, sham@cs.washington.edu Super-resolution is the problem of recovering a superposition of point sources using bandlimited measurements, which may be corrupted with noise. This signal processing problem arises in numerous imaging problems, ranging from astronomy to biology to spectroscopy, where it is common to take (coarse) Fourier measurements of an object. Of particular interest is in obtaining estimation procedures which are robust to noise, with the following desirable statistical and computational properties: we seek to use coarse Fourier measurements (bounded by some cutoff frequency); we hope to take a (quantifiably) small number of measurements; we desire our algorithm to run quickly. Suppose we have k point sources in d dimensions, where the points are separated by at least from each other (in Euclidean distance). This work provides an algorithm with the following favorable guarantees: The algorithm uses Fourier measurements, whose frequencies are bounded by O(1/ ) (up to log factors). Previous algorithms require a cutoff frequency which may be as large as ( d/ ). The number of measurements taken by and the computational complexity of our algorithm are bounded by a polynomial in both the number of points k and the dimension d, with no dependence on the separation . In contrast, previous algorithms depended inverse polynomially on the minimal separation and exponentially on the dimension for both of these quantities. Our estimation procedure itself is simple: we take random bandlimited measurements (as opposed to taking an exponential number of measurements on the hypergrid). Furthermore, our analysis and algorithm are elementary (based on concentration bounds for sampling and the singular value decomposition). 1 Introduction We follow the standard mathematical abstraction of this problem (Candes & Fernandez-Granda [4, 3]): consider a d-dimensional signal x(t) modeled as a weighted sum of k Dirac measures in Rd: wjδµ(j), (1) where the point sources, the µ(j) s, are in Rd. Assume that the weights wj are complex valued, whose absolute values are lower and upper bounded by some positive constant. Assume that we are given k, the number of point sources1. 1An upper bound of the number of point sources suffices. Define the measurement function f(s) : Rd ! C to be the convolution of the point source x(t) with a low-pass point spread function ei as below: t2Rd ei x(dt) = wjei <µ(j),s>. (2) In the noisy setting, the measurements are corrupted by uniformly bounded perturbation z: ef(s) = f(s) + z(s), |z(s)| z, 8s. (3) Suppose that we are only allowed to measure the signal x(t) by evaluating the measurement function ef(s) at any s 2 Rd, and we want to recover the parameters of the point source signal, i.e., {wj, µ(j) : j 2 [k]}. We follow the standard normalization to assume that: µ(j) 2 [ 1, +1]d, |wj| 2 [0, 1] 8j 2 [k]. Let wmin = minj |wj| denote the minimal weight, and let be the minimal separation of the point sources defined as follows: j6=j0 kµ(j) µ(j0)k2, (4) where we use the Euclidean distance between the point sources for ease of exposition2. These quantities are key parameters in our algorithm and analysis. Intuitively, the recovery problem is harder if the minimal separation is small and the minimal weight wmin is small. The first question is that, given exact measurements, namely z = 0, where and how many measurements should we take so that the original signal x(t) can be exactly recovered. Definition 1.1 (Exact recovery). In the exact case, i.e. z = 0, we say that an algorithm achieves exact recovery with m measurements of the signal x(t) if, upon input of these m measurements, the algorithm returns the exact set of parameters {wj, µ(j) : j 2 [k]}. Moreover, we want the algorithm to be measurement noise tolerant, in the sense that in the presence of measurement noise we can still recover good estimates of the point sources. Definition 1.2 (Stable recovery). In the noisy case, i.e., z 0, we say that an algorithm achieves stable recovery with m measurements of the signal x(t) if, upon input of these m measurements, the algorithm returns estimates { bwj, bµ(j) : j 2 [k]} such that kbµ(j) µ( (j))k2 : j 2 [k] poly(d, k) z, where the min is over permutations on [k] and poly(d,k) is a polynomial function in d and k. By definition, if an algorithm achieves stable recovery with m measurements, it also achieves exact recovery with these m measurements. The terminology of super-resolution is appropriate due to the following remarkable result (in the noiseless case) of Donoho [9]: suppose we want to accurately recover the point sources to an error of γ, where γ . Naively, we may expect to require measurements whose frequency depends inversely on the desired the accuracy γ. Donoho [9] showed that it suffices to obtain a finite number of measurements, whose frequencies are bounded by O(1/ ), in order to achieve exact recovery; thus resolving the point sources far more accurately than that which is naively implied by using frequencies of O(1/ ). Furthermore, the work of Candes & Fernandez-Granda [4, 3] showed that stable recovery, in the univariate case (d = 1), is achievable with a cutoff frequency of O(1/ ) using a convex program and a number of measurements whose size is polynomial in the relevant quantities. 2Our claims hold withut using the wrap around metric , as in [4, 3], due to our random sampling. Also, it is possible to extend these results for the p-norm case. cutoff freq measurements runtime cutoff freq measurements runtime SDP 1 k log(k) log( 1 , k) Cd 1 ( 1 1 )d poly(( 1 1 )d, k) Ours 1 (k log(k))2 (k log(k))2 log(kd) (k log(k) + d)2 (k log(k) + d)2 Table 1: See Section 1.2 for description. See Lemma 2.3 for details about the cutoff frequency. Here, we are implicitly using O( ) notation. 1.1 This work We are interested in stable recovery procedures with the following desirable statistical and computational properties: we seek to use coarse (low frequency) measurements; we hope to take a (quantifiably) small number of measurements; we desire our algorithm run quickly. Informally, our main result is as follows: Theorem 1.3 (Informal statement of Theorem 2.2). For a fixed probability of error, the proposed algorithm achieves stable recovery with a number of measurements and with computational runtime that are both on the order of O((k log(k) + d)2). Furthermore, the algorithm makes measurements which are bounded in frequency by O(1/ ) (ignoring log factors). Notably, our algorithm and analysis directly deal with the multivariate case, with the univariate case as a special case. Importantly, the number of measurements and the computational runtime do not depend on the minimal separation of the point sources. This may be important even in certain low dimensional imaging applications where taking physical measurements are costly (indeed, superresolution is important in settings where is small). Furthermore, our technical contribution of how to decompose a certain tensor constructed with Fourier measurements may be of broader interest to related questions in statistics, signal processing, and machine learning. 1.2 Comparison to related work Table 1 summarizes the comparisons between our algorithm and the existing results. The multidimensional cutoff frequency we refer to in the table is the maximal coordinate-wise entry of any measurement frequency s (i.e. ksk1). SDP refers to the semidefinite programming (SDP) based algorithms of Candes & Fernandez-Granda [3, 4]; in the univariate case, the number of measurements can be reduced by the method in Tang et. al. [23] (this is reflected in the table). MP refers to the matrix pencil type of methods, studied in [14] and [15] for the univariate case. Here, we are defining the infinity norm separation as 1 = minj6=j0 kµ(j) µ(j0)k1, which is understood as the wrap around distance on the unit circle. Cd 1 is a problem dependent constant (discussed below). Observe the following differences between our algorithm and prior work: 1) Our minimal separation is measured under the 2-norm instead of the infinity norm, as in the SDP based algorithm. Note that 1 depends on the coordinate system; in the worst case, it can underestimate the separation by a 1/ d factor, namely 1 / 2) The computation complexity and number of measurements are polynomial in dimension d and the number of point sources k, and surprisingly do not depend on the minimal separation of the point sources! Intuitively, when the minimal separation between the point sources is small, the problem should be harder, this is only reflected in the sampling range and the cutoff frequency of the measurements in our algorithm. 3) Furthermore, one could project the multivariate signal to the coordinates and solve multiple uni- variate problems (such as in [19, 17], which provided only exact recovery results). Naive random projections would lead to a cutoff frequency of O( SDP approaches: The work in [3, 4, 10] formulates the recovery problem as a total-variation minimization problem; they then show the dual problem can be formulated as an SDP. They focused on the analysis of d = 1 and only explicitly extend the proofs for d = 2. For d 1, Ingham-type theorems (see [20, 12]) suggest that Cd = O( The number of measurements can be reduced by the method in [23] for the d = 1 case, which is noted in the table. Their method uses sampling off the grid ; technically, their sampling scheme is actually sampling random points from the grid, though with far fewer measurements. Matrix pencil approaches: The matrix pencil method, MUSIC and Prony s method are essentially the same underlying idea, executed in different ways. The original Prony s method directly attempts to find roots of a high degree polynomial, where the root stability has few guarantees. Other methods aim to robustify the algorithm. Recently, for the univariate matrix pencil method, Liao & Fannjiang [14] and Moitra [15] provide a stability analysis of the MUSIC algorithm. Moitra [15] studied the optimal relationship between the cutoff frequency and , showing that if the cutoff frequency is less than 1/ , then stable recovery is not possible with matrix pencil method (with high probability). 1.3 Notation Let R, C, and Z to denote real, complex, and natural numbers. For d 2 Z, [d] denotes the set [d] = {1, . . . , d}. For a set S, |S| denotes its cardinality. We use to denote the direct sum of sets, namely S1 S2 = {(a + b) : a 2 S1, b 2 S2}. Let en to denote the n-th standard basis vector in Rd, for n 2 [d]. Let Pd R,2 = {x 2 Rd : kxk2 = 1} to denote the d-sphere of radius R in the d-dimensional standard Euclidean space. Denote the condition number of a matrix X 2 Rm n as cond2(X) = σmax(X)/σmin(X), where σmax(X) and σmin(X) are the maximal and minimal singular values of X. We use to denote tensor product. Given matrices A, B, C 2 Cm k, the tensor product V = A B C 2 Cm m m is equivalent to Vi1,i2,i3 = Pk n=1 Ai1,n Bi2,n Ci3,n. Another view of tensor is that it defines a multi-linear mapping. For given dimension m A, m B, m C the mapping V ( , , ) : Cm m A Cm m B Cm m C ! Cm A m B m C is defined as: [V (XA, XB, Xc)]i1,i2,i3 = j1,j2,j32[m] Vj1,j2,j3[XA]j1,i1[XB]j2,i2[XC]j3,i3. In particular, for a 2 Cm, we use V (I, I, a) to denote the projection of tensor V along the 3rd dimension. Note that if the tensor admits a decomposition V = A B C, it is straightforward to verify that V (I, I, a) = ADiag(C>a)B>. It is well-known that if the factors A, B, C have full column rank then the rank k decomposition is unique up to re-scaling and common column permutation. Moreover, if the condition number of the factors are upper bounded by a positive constant, then one can compute the unique tensor decomposition V with stability guarantees (See [1] for a review. Lemma 2.5 herein provides an explicit statement.). 2 Main Results 2.1 The algorithm We briefly describe the steps of Algorithm 1 below: (Take measurements) Given positive numbers m and R, randomly draw a sampling set S = ( s(1), . . . s(m) of m i.i.d. samples of the Gaussian distribution N(0, R2Id d). Form the set S0 = S [ {s(m+1) = e1, . . . , s(m+d) = ed, s(m+d+1) = 0} Rd. Denote m0 = m + d + 1. Take another independent random sample v from the unit sphere, and define v(1) = v, v(2) = 2v. Input: R, m, noisy measurement function ef( ). Output: Estimates { bwj, bµ(j) : j 2 [k]}. 1. Take measurements: Let S = {s(1), . . . , s(m)} be m i.i.d. samples from the Gaussian distribution N(0, R2Id d). Set s(m+n) = en for all n 2 [d] and s(m+n+1) = 0. Denote m0 = m + d + 1. Take another random samples v from the unit sphere, and set v(1) = v and v(2) = 2v. Construct a tensor e F 2 Cm0 m0 3: e Fn1,n2,n3 = ef(s) s=s(n1)+s(n2)+v(n3). 2. Tensor Decomposition: Set (b VS0, b Dw) = Tensor Decomp( e F). For j = 1, . . . , k, set [b VS0]j = [b VS0]j/[b VS0]m0,j 3. Read of estimates: For j = 1, . . . , k, set bµ(j) = Real(log([b VS][m+1:m+d,j])/(i )). 4. Set c W = arg min W 2Ck k b F b VS0 b VS0 b Vd Dwk F . Algorithm 1: General algorithm Construct the 3rd order tensor e F 2 Cm0 m0 3 with noise corrupted measurements ef(s) evaluated at the points in S0 S0 {v(1), v(2)}, arranged in the following way: e Fn1,n2,n3 = ef(s) s=s(n1)+s(n2)+v(n3), 8n1, n2 2 [m0], n3 2 [2]. (5) (Tensor decomposition) Define the characteristic matrix VS to be: ei <µ(1),s(1)> . . . ei <µ(k),s(1)> ei <µ(1),s(2)> . . . ei <µ(k),s(2)> ... . . . ... ei <µ(1),s(m)> . . . ei <µ(k),s(m)> and define matrix V 0 2 Cm0 k to be Vd 1, . . . , 1 where Vd 2 Cd k is defined in (17). Define ei <µ(1),v(1)> . . . ei <µ(k),v(1)> ei <µ(1),v(2)> . . . ei <µ(k),v(2)> 1 . . . 1 Note that in the exact case ( z = 0) the tensor F constructed in (5) admits a rank-k decomposition: F = VS0 VS0 (V2Dw), (8) Assume that VS0 has full column rank, then this tensor decomposition is unique up to column permutation and rescaling with very high probability over the randomness of the random unit vector v. Since each element of VS0 has unit norm, and we know that the last row of VS0 and the last row of V2 are all ones, there exists a proper scaling so that we can uniquely recover wj s and columns of VS0 up to common permutation. In this paper, we adopt Jennrich s algorithm (see Algorithm 2) for tensor decomposition. Other algorithms, for example tensor power method ([1]) and recursive projection ([24]), which are possibly more stable than Jennrich s algorithm, can also be applied here. (Read off estimates) Let log(Vd) denote the element-wise logarithm of Vd. The estimates of the point sources are given by: µ(1), µ(2), . . . , µ(k)i Input: Tensor e F 2 Cm m 3, rank k. output: Factor b V 2 Cm k. 1. Compute the truncated SVD of e F(I, I, e1) = b P b b P > with the k leading singular values. 2. Set b E = e F( b P, b P, I). Set b E1 = b E(I, I, e1) and b E2 = b E(I, I, e2). 3. Let the columns of b U be the eigenvectors of b E1 b E 1 2 corresponding to the k eigenvalues with the largest absolute value. 4. Set b V = pm b P b U. Algorithm 2: Tensor Decomp Remark 2.1. In the toy example, the simple algorithm corresponds to using the sampling set S0 = {e1, . . . , ed}. The conventional univariate matrix pencil method corresponds to using the sampling set S0 = {0, 1, . . . , m} and the set of measurements S0 S0 S0 corresponds to the grid [m]3. 2.2 Guarantees In this section, we discuss how to pick the two parameters m and R and prove that the proposed algorithm indeed achieves stable recovery in the presence of measurement noise. Theorem 2.2 (Stable recovery). There exists a universal constant C such that the following holds. Fix x, δs, δv 2 (0, 1 pick m such that m max for d = 1, pick R 2 log(1+2/ x) ; for d 2, pick R 2 log(k/ x) Assume the bounded measurement noise model as in (3) and that z δvw2 1 2 x 1+2 x With probability at least (1 δs) over the random sampling of S, and with probability at least (1 δv) over the random projections in Algorithm 2, the proposed Algorithm 1 returns an estimation of the point source signal bx(t) = Pk j=1 bwjbδµ(j) with accuracy: kbµ(j) µ( (j))k2 : j 2 [k] where the min is over permutations on [k]. Moreover, the proposed algorithm has time complexity in the order of O((m0)3). The next lemma shows that essentially, with overwhelming probability, all the frequencies taken concentrate within the hyper-cube with cutoff frequency R0 on each coordinate, where R0 is comparable to R, Lemma 2.3 (The cutoff frequency). For d > 1, with high probability, all of the 2(m0)2 sampling frequencies in S0 S0 {v(1), v(2)} satisfy that ks(j1)+s(j2)+v(j3)k1 R0, 8j1, j2 2 [m], j3 2 [2], where the per-coordinate cutoff frequency is given by R0 = O(Rplog md). For d = 1 case, the cutoff frequency R0 can be made to be in the order of R0 = O(1/ ). Remark 2.4 (Failure probability). Overall, the failure probability consists of two pieces: δv for random projection of v, and δs for random sampling to ensure the bounded condition number of VS. This may be boosed to arbitrarily high probability through repetition. 2.3 Key Lemmas Stability of tensor decomposition: In this paragraph, we give a brief description and the stability guarantee of the well-known Jennrich s algorithm ([11, 13]) for low rank 3rd order tensor decomposition. We only state it for the symmetric tensors as appeared in the proposed algorithm. Consider a tensor F = V V (V2Dw) 2 Cm m 3 where the factor V has full column rank k. Then the decomposition is unique up to column permutation and rescaling, and Algorithm 2 finds the factors efficiently. Moreover, the eigen-decomposition is stable if the factor V is well-conditioned and the eigenvalues of Fa F b are well separated. Lemma 2.5 (Stability of Jennrich s algorithm). Consider the 3rd order tensor F = V V (V2Dw) 2 Cm m 3 of rank k m, constructed as in Step 1 in Algorithm 1. Given a tensor e F that is element-wise close to F, namely for all n1, n2, n3 2 [m], ** e Fn1,n2,n3 ** z, and assume that the noise is small z δvw2 dkwmaxcond2(V )5 . Use e F as the input to Algorithm 2. With probability at least (1 δv) over the random projections v(1) and v(2), we can bound the distance between columns of the output b V and that of V by: kb Vj V (j)k2 : j 2 [k] cond2(V )5 z, (9) where C is a universal constant. Condition number of VS0: The following lemma is helpful: Lemma 2.6. Let VS0 2 C(m+d+1) k be the factor as defined in (7). Recall that VS0 = [VS; Vd; 1], where Vd is defined in (17), and VS is the characteristic matrix defined in (6). We can bound the condition number of VS0 by kcond2(VS). (10) Condition number of the characteristic matrix VS: Therefore, the stability analysis of the proposed algorithm boils down to understanding the relation between the random sampling set S and the condition number of the characteristic matrix VS. This is analyzed in Lemma 2.8 (main technical lemma). Lemma 2.7. For any fixed number x 2 (0, 1/2). Consider a Gaussian vector s with distribution N(0, R2Id d), where R 2 log(k/ x) for d 2, and R 2 log(1+2/ x) for d = 1. Define the Hermitian random matrix Xs 2 Ck k e i <µ(1),s> e i <µ(2),s> ... e i <µ(k),s> ei <µ(1),s>, ei <µ(2),s>, . . . ei <µ(k),s>i We can bound the spectrum of Es[Xs] by: (1 x)Ik k Es[Xs] (1 + x)Ik k. (12) Lemma 2.8 (Main technical lemma). In the same setting of Lemma 2.7, Let S = {s(1), . . . , s(m)} be m independent samples of the Gaussian vector s. For m k x δs , with probability at least 1 δs over the random sampling, the condition number of the factor VS is bounded by: 1 + 2 x 1 2 x Acknowledgments The authors thank Rong Ge and Ankur Moitra for very helpful discussions. Sham Kakade acknowledges funding from the Washington Research Foundation for innovation in Data-intensive Discovery. [1] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning latent variable models. The Journal of Machine Learning Research, 15(1):2773 2832, 2014. [2] A. Anandkumar, D. Hsu, and S. M. Kakade. A method of moments for mixture models and hidden markov models. ar Xiv preprint ar Xiv:1203.0683, 2012. [3] E. J. Cand es and C. Fernandez-Granda. Super-resolution from noisy data. Journal of Fourier Analysis and Applications, 19(6):1229 1254, 2013. [4] E. J. Cand es and C. Fernandez-Granda. Towards a mathematical theory of super-resolution. Communications on Pure and Applied Mathematics, 67(6):906 956, 2014. [5] Y. Chen and Y. Chi. Robust spectral compressed sensing via structured matrix completion. Information Theory, IEEE Transactions on, 60(10):6576 6601, 2014. [6] S. Dasgupta. Learning mixtures of gaussians. In Foundations of Computer Science, 1999. 40th Annual Symposium on, pages 634 644. IEEE, 1999. [7] S. Dasgupta and A. Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random structures and algorithms, 22(1):60 65, 2003. [8] S. Dasgupta and L. J. Schulman. A two-round variant of em for gaussian mixtures. In Pro- ceedings of the Sixteenth conference on Uncertainty in artificial intelligence, pages 152 159. Morgan Kaufmann Publishers Inc., 2000. [9] D. L. Donoho. Superresolution via sparsity constraints. SIAM Journal on Mathematical Anal- ysis, 23(5):1309 1331, 1992. [10] C. Fernandez-Granda. A Convex-programming Framework for Super-resolution. Ph D thesis, Stanford University, 2014. [11] R. A. Harshman. Foundations of the parafac procedure: Models and conditions for an ex- planatory multi-modal factor analysis. 1970. [12] V. Komornik and P. Loreti. Fourier series in control theory. Springer Science & Business Media, 2005. [13] S. Leurgans, R. Ross, and R. Abel. A decomposition for three-way arrays. SIAM Journal on Matrix Analysis and Applications, 14(4):1064 1083, 1993. [14] W. Liao and A. Fannjiang. Music for single-snapshot spectral estimation: Stability and super- resolution. Applied and Computational Harmonic Analysis, 2014. [15] A. Moitra. The threshold for super-resolution via extremal functions. ar Xiv preprint ar Xiv:1408.1681, 2014. [16] E. Mossel and S. Roch. Learning nonsingular phylogenies and hidden markov models. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 366 375. ACM, 2005. [17] S. Nandi, D. Kundu, and R. K. Srivastava. Noise space decomposition method for twodimensional sinusoidal model. Computational Statistics & Data Analysis, 58:147 161, 2013. [18] K. Pearson. Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. A, pages 71 110, 1894. [19] D. Potts and M. Tasche. Parameter estimation for nonincreasing exponential sums by prony- like methods. Linear Algebra and its Applications, 439(4):1024 1039, 2013. [20] D. L. Russell. Controllability and stabilizability theory for linear partial differential equations: recent progress and open questions. Siam Review, 20(4):639 739, 1978. [21] A. Sanjeev and R. Kannan. Learning mixtures of arbitrary gaussians. In Proceedings of the thirty-third annual ACM symposium on Theory of computing, pages 247 257. ACM, 2001. [22] G. Schiebinger, E. Robeva, and B. Recht. Superresolution without separation. ar Xiv preprint ar Xiv:1506.03144, 2015. [23] G. Tang, B. N. Bhaskar, P. Shah, and B. Recht. Compressed sensing off the grid. Information Theory, IEEE Transactions on, 59(11):7465 7490, 2013. [24] S. S. Vempala and Y. F. Xiao. Max vs min: Independent component analysis with nearly linear sample complexity. ar Xiv preprint ar Xiv:1412.2954, 2014.