# a_fourier_approach_to_mixture_learning__f7533777.pdf A Fourier Approach to Mixture Learning Mingda Qiao Stanford University mqiao@stanford.edu Guru Guruganesh Google Research gurug@google.com Ankit Singh Rawat Google Research ankitsrawat@google.com Avinava Dubey Google Research avinavadubey@google.com Manzil Zaheer Google Deep Mind manzilzaheer@google.com We revisit the problem of learning mixtures of spherical Gaussians. Given samples from mixture 1 k j=1 N(µj, Id), the goal is to estimate the means µ1, µ2, . . . , µk 2 Rd up to a small error. The hardness of this learning problem can be measured by the separation defined as the minimum distance between all pairs of means. Regev and Vijayaraghavan [2017] showed that with = (plog k) separation, the means can be learned using poly(k, d) samples, whereas superpolynomially many samples are required if = o(plog k) and d = (log k). This leaves open the low-dimensional regime where d = o(log k). In this work, we give an algorithm that efficiently learns the means in d = O(log k/ log log k) dimensions under separation d/plog k (modulo doubly logarithmic factors). This separation is strictly smaller than plog k, and is also shown to be necessary. Along with the results of Regev and Vijayaraghavan [2017], our work almost pins down the critical separation threshold at which efficient parameter learning becomes possible for spherical Gaussian mixtures. More generally, our algorithm runs in time poly(k) f(d, , ), and is thus fixed-parameter tractable in parameters d, and . Our approach is based on estimating the Fourier transform of the mixture at carefully chosen frequencies, and both the algorithm and its analysis are simple and elementary. Our positive results can be easily extended to learning mixtures of non-Gaussian distributions, under a mild condition on the Fourier spectrum of the distribution. 1 Introduction Gaussian mixture models (GMMs) are one of the most well studied models for a population with different components. A GMM defines a distribution over the d-dimensional Euclidean space as the weighted sum of normal distributions Pk i=1 wi N(µi, i), which are specified by following quantities: the number of components k 2 N, the component means µi 2 Rd, the component covariances i 2 Rd d, which are positive definite matrices, and the weights wi 0 that sum up to 1. In this work, we consider the uniform spherical case, where the weights wi are uniform wi = 1 k and the covariance matrix i = Id is the identity matrix. The central problem in this setup is to efficiently estimate the means µ1, . . . , µk. To avoid degenerate cases such as when some of the means are the same, it is common to parameterize the problem by the separation of the means , which guarantees that kµi µjk2 for all i 6= j. Part of this work was done while working as an intern at Google Research. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). More precisely, the problem is to estimate the means µ1, . . . , µk 2 Rd up to an error with runtime that is poly(k, 1 , d) with as small a separation as possible. There has been a long line of work on this problem which we survey in Section 1.3. Recently, Regev and Vijayaraghavan [2017] showed that a separation = (plog k) is strictly necessary when the dimension d = (log k). Two natural questions arise immediately. First, if = plog k is sufficient when d = (log k). Although the original work of Regev and Vijayaraghavan [2017] showed that it was information theoretically possible, an actual efficient algorithm was only recently developed by Liu and Li [2022] (who show nearly tight results). The second main question is determining the optimal separation in low dimensions when d = o(log k). Previously, even in O(1) dimensions, the exact separation necessary was unknown. In this paper, we settle the second question and give nearly optimal bounds on the separation necessary in low dimensions (see Figure 1 for more details). 1.1 Overview of Results We begin with a few definitions. A point set {x1, x2, . . . , xk} is called -separated if kxj xj0k2 for any j 6= j0. We say that a Gaussian mixture is -separated (or has separation ) if the means of its components are -separated. Two point sets {u1, . . . , uk} and {v1, . . . , vk} are -close if for some permutation σ over [k], kuj vσ(j)k2 holds for every j. Our main result is an algorithm that efficiently learns the parameters of a mixture of k spherical Gaussians under separation d plog k poly(log log k) in d = O(log k/ log log k) dimensions. In the low-dimensional regime, this separation is strictly smaller than min{plog k, d}, the smallest separation under which previous algorithms could provably learn the parameters in poly(k) time. Theorem 1.1 (Upper bound, informal). Let P be a uniform mixture of k spherical Gaussians in log k log log k dimensions with separation = log((log k)/d) plog k . There is a poly(k)-time algorithm that, given samples from P, outputs k vectors that are w.h.p. -close to the true means for = O( / See Theorem 2.1 and Remark 2.2 for a more formal statement of our algorithmic result, which holds for a wider range of separation and accuracy parameter . Our learning algorithm is provably correct for arbitrarily small , > 0 (possibly with a longer runtime), whereas for most of the previous algorithms, there is a breakdown point such that the algorithm is not known to work when < . Two exceptions are the algorithms of Moitra and Valiant [2010], Belkin and Sinha [2010], both of which allow an arbitrarily small separation but run in e (k) time. We also remark that the runtime of our algorithm scales as O(k3) f(d, , ), and is thus fixed-parameter tractable in parameters d, and .2 We complement Theorem 1.1 with an almost-matching lower bound, showing that the d/plog k separation is necessary for efficient parameter learning in low dimensions. Theorem 1.2 (Lower bound, informal). For d = O log k log log k , there are two mixtures of k spherical Gaussians in Rd such that: (1) both have separation ; (2) their means are not ( /2)-close; and (3) the total variation (TV) distance between them is k !(1). See Theorem E.1 for a more formal version of the lower bound. Theorem 1.1 and Theorem 1.2 together nearly settle the polynomial learnability of spherical Gaussian mixtures in the low-dimensional regime. Up to a doubly logarithmic factor, the critical separation where efficient learning becomes possible is d/plog k. To the best of our knowledge, this was previously unknown even for d = O(1).3 See Figure 1 below for a plot of our results in the context of prior work. The green regions cover the parameters ( , d) such that mixtures of k spherical Gaussians in d dimensions with separation are 2While we focus on the uniform-weight case for simplicity, Theorem 1.1 can be easily extended to the setting where each weight is in [1/(Ck), C/k] for some constant C > 1. 3An exception is the d = 1 case: A result of Moitra [2015] implies that = (1/plog k) suffices (see Section 1.3), and a matching lower bound was given by Moitra and Valiant [2010]. separation Δ dimension " 1/ log & log & Learnable with poly & samples Not poly & -sample learnable IV Theorem 1 Figure 1: Region I is a direct corollary of Moitra and Valiant [2010, Proposition 15]. Regions II, III, and IV are shown by Regev and Vijayaraghavan [2017, Theorems 1.2, 1.3, and 1.4] respectively. The upper boundary of Region IV is the curve = d. Theorems 1.1 and 1.2 settle the learnability in the remaining area, by proving that the line = d/plog k is the boundary between polynomial and super-polynomial sample complexities (up to a doubly-logarithmic factor). learnable (up to O( ) error) using poly(k) samples.4 The red regions contain the parameters under which polynomial-sample learning is provably impossible. The algorithm that underlies Theorem 1.1 can be easily extended beyond the spherical Gaussian case. The following more general result states that for any distribution D whose the Fourier transform does not decay too fast, we can efficiently learn the parameters of a mixture of k translated copies of D. In the following, let Dµ denote the distribution of X + µ when X is drawn from D. Theorem 1.3 (Learning more general mixtures, informal). Let P = 1 j=1 Dµj for -separated µ1, . . . , µk 2 Rd. There is an algorithm that, given > 0 and samples from P, runs in time k, / , 1/δ, max for some δ = δ(D, ) and M = M(k, d, , ), and outputs ˆµ1, . . . , ˆµk that are w.h.p. -close to the true parameters. See Theorem F.1 and Corollary F.2 for a more formal statement of the runtime. Theorem 1.3 applies to many well-known distribution families that are specified by either a location parameter or a scale parameter . Table 1 gives a few examples of applying Theorem 1.3 to mixtures of single-parameter univariate distributions; see Appendix F.2 and Appendix F.3 for more details. Limitations of our work. The main limitation is that the positive results only apply to the regime that the dimension d is logarithmic in the number of clusters, and that all clusters are translated copies of the same distribution. A concrete future direction would be to extend our results to learning mixtures of general Gaussians, even in one dimension. 1.2 Proof Overview For simplicity, we focus on the one dimensional case that P = 1 j=1 N(µj, 1) for -separated means µ1, . . . , µk 2 R. We index the components such that |µ1| |µ2| |µk|, and focus 4In terms of the computational complexity, all the green regions (except a small portion of Region III) admit efficient algorithms. The algorithm of Liu and Li [2022] is efficient when = ((log k)1/2+c) for any c > 0 and thus almost covers Region III. For Region IV, Regev and Vijayaraghavan [2017] gave an efficient algorithm only for d = O(1), whereas Theorem 1.1 covers the entire Region IV. Distribution Parameter Density Function Runtime Cauchy µ 1 (1+(x µ)2) O(k3) e O(plog k/ ) Logistic µ e (x µ) (1+e (x µ))2 O(k3) e O(plog k/ ) Laplace µ 1 2e |x µ| O(k3/ 5) Exponential ln λ λe λx [x 0] O(k3) e O(plog k/ ) Table 1: Implication of Theorem 1.3 for learning various families of mixtures of univariate distributions beyond Gaussians, assuming that the parameters of different components are -separated. The algorithm outputs k parameters that are O( )-close to the true parameters. on the following testing problem: Given > 0 and samples from P, determine whether µ1 = 0 or µ1 , assuming that one of them is true. Note that this testing problem is not harder than estimating µ1, . . . , µk up to error /3 in the former case that µ1 = 0, one of the mean estimates would fall into [ /3, /3], whereas all of them must be outside ( 2 /3, 2 /3) in the latter case. Conversely, as we will prove in Section 2, an algorithm for the testing version can be used for recovering the means as well. Examine the Fourier spectrum. We start by examining the Fourier transform of P, (FP)( ) := EX P , more commonly known in the literature as the characteristic function. Since the Fourier transform of a Gaussian is still a Gaussian, and a translation in the time domain shifts the phase in the frequency domain, we have E X N (µj,1) We will focus on the quantity Aµ( ) := Pk j=1 eiµj , which is the total phase over the k components of P. Equation (1) essentially states that Aµ( ) can be estimated by averaging ei X over samples X P. The key observation is that each term eiµj of Aµ( ) behaves quite differently depending on the magnitude of µj: If µj = 0, eiµj = 1 is a constant, whereas eiµj is a high-frequency wave when |µj| is large. This suggests that the two cases (µ1 = 0 and |µ1| ) can be distinguished by estimating Aµ( ) at different frequencies. The cost of estimating Aµ( ), however, depends heavily on the frequency Equation (1) together with a simple concentration bound shows that O(k2) e O( 2) samples are sufficient for estimating Aµ( ) up to a constant error. Therefore, the crux of this approach is to find the minimum M > 0 such that the two cases can be distinguished by estimating Aµ( ) over 2 [ M, M]. (This is known as the super-resolution problem, which we discuss in Section 1.3.) The sample complexity of the testing problem can then be roughly bounded by O(k2) e O(M 2). In the following, we explore different ways of picking from the range [ M, M]. Choosing randomly. Our overall strategy is to draw randomly from some distribution D over interval [ M, M] and evaluate E D [Aµ( )]. We will argue that this expectation is very close to its first term E D , which takes different values depending on whether µ1 = 0 or |µ1| . Then, D needs to be chosen such that: (1) There is a gap in the value of E D between the two cases; (2) eiµj &&& is small enough for the gap in the first term to be easily identified. As a warmup, let D be the uniform distribution over [0, M]. A simple calculation shows that the gap in the first term E D between the two cases (µ1 = 0 or |µ1| ) is lower bounded by (min{M , 1}), whereas the contribution from the j-th component satisfies . Furthermore, the -separation between the means implies |µj| = ( j). Thus, eiµj i$$$$$ . 1 M|µj| . 1 M 1 j . log k The above is much smaller than min{M , 1} if we set M & max . Unfortunately, even when and are constants, we have O(k2) e O(M 2) = k O(log k) and the resulting sample complexity is already super-polynomial in k. A better choice of D . It turns out that choosing D to be a truncated Gaussian leads to a much lower sample complexity. For some σ M, we draw N(0, σ2) and then truncate it to [ M, M]. Without the truncation, the expectation of eiµj has a nice closed form: which is exactly the Fourier weight of N(0, σ2) at frequency µj. Note that this decreases very fast as |µj| grows, compared to the previous rate of 1 M|µj| when D is uniform. It again follows from a simple calculation that: (1) Depending on whether µ1 = 0 or |µ1| , the gap between E is (min{σ2 2, 1}); (2) The total contribution from j = 2, 3, . . . , k is upper bounded by eiµj i$$$$$ = e (σ2 2j2) = e (σ2 2), where the second step applies |µj| = ( j) and the last step holds assuming σ = (1/ ). In addition, we need to deal with the error incurred by the truncation. The Gaussian tail bounds imply that | | M happens with probability e (M 2/σ2). Since |Aµ( )| k for any , the noise from the truncation is at most k e (M 2/σ2) in magnitude. It remains to choose M, σ > 0 that satisfy the two inequalities: e (σ2 2) min{σ2 2, 1} and k e (M 2/σ2) min{σ2 2, 1}. It suffices to choose σ . 1 . For = ( ), the sample complexity O(k2) e O(M 2) reduces to k O(1/ 2), which is polynomial in k for any fixed . We note that in the above derivation, the assumption that each cluster of P is a Gaussian is only applied in Equation (1) (through the Fourier transform). For any distribution D over R and P = 1 the quantity Aµ( ) = Pk j=1 eiµj can still be read off from the Fourier transform of P at frequency , except that the extra factor e 2/2 becomes (FD)( ), the Fourier transform of D at frequency . This observation leads to our algorithm for learning a mixture of multiple translated copies of D (Theorem 1.3). Gaussian truncation. For the spherical Gaussian case, however, the above only gives an algorithm with runtime k O(1/ 2), which becomes super-polynomial as soon as = o(1), falling short of achieving the near-optimal separation of log k in Theorem 1.1. We further improve our algorithm for the spherical Gaussian case using a Gaussian truncation technique. Intuitively, when deciding whether the mixture P contains a cluster with mean zero, a sample X P is much more informative when |X| is small. This motivates us to focus on samples with a small magnitude via a truncation. We apply such a truncation in a soft way weighting each sample X with e X2/2. This turns out to be sufficiently effective while keeping the entire analysis simple. Note that the weighting effectively multiplies the mixture P with the standard Gaussian N(0, 1) pointwise. The result is still a (un-normalized) mixture of Gaussians: Up to a constant factor, the pointwise product is j/4 N(µj/2, 1/2). Consequently, if we repeat the analysis from previous paragraphs to this weighted mixture, the noise coming from components with large |µj| become even smaller. This eventually allows us to learn the mixture efficiently at separation = log k . We prove Theorem 1.1 via a natural extension of this analysis to the d-dimensional case. Lower bound. Our proof of Theorem 1.2 follows an approach similar to those in the previous lower bound proofs of Moitra and Valiant [2010], Hardt and Price [2015], Regev and Vijayaraghavan [2017]: First, construct two sets of well-separated points {µ(P ) 1 , . . . , µ(P ) k } and {µ(Q) 1 , . . . , µ(Q) with (approximately) matching lower-order moments, i.e., 1 for every small t. Then, show that these matching moments imply that after a convolution with Gaussian, the resulting mixtures are close in TV-distance and thus hard to distinguish. In more detail, similar to the proof of Regev and Vijayaraghavan [2017], we start by choosing N arbitrary points from a small 2 ball, such that they are -separated. The main difference is that Regev and Vijayaraghavan [2017] pick N k, and show that among all the possible mixtures, there are at least two mixtures with similar moments via a pigeon-hole type argument. In contrast, we work in the N k regime, and obtain two point sets with matching lower-order moments by slightly perturbing these N points in opposite directions. The existence of such a good perturbation is shown via a careful application of the Borsuk Ulam Theorem, which is inspired by a similar application in Hardt and Price [2015]. 1.3 Related Work Learning Gaussian mixture models. Most closely related to this paper is the line of work in the theoretical computer science literature on algorithms that provably learn mixtures of Gaussians. The pioneering work of Dasgupta [1999] showed that an ( d) separation between the means of different components is sufficient for the samples to be easily clustered. Several subsequent work (e.g., Vempala and Wang [2004], Arora and Kannan [2005], Achlioptas and Mc Sherry [2005], Kannan et al. [2005], Dasgupta and Schulman [2007], Brubaker and Vempala [2008]) further generalized this result and the separation condition is relaxed to = ((min{k, d})1/4 poly(log k, log d)) by Vempala and Wang [2004]. We refer the readers to Regev and Vijayaraghavan [2017] for a more detailed survey of these results. Moitra and Valiant [2010] and Belkin and Sinha [2010] gave the first algorithms for learning general mixtures of Gaussians (with different unknown covariances) under an arbitrarily small separation between different components. Both algorithms are based on the method of moments, and run in time polynomial in d and the inverse of the minimum TV-distance between different components when k = O(1). For large k, however, the runtime is exponential in k. Regev and Vijayaraghavan [2017] showed that poly(k, d) samples are sufficient to estimate the means of a mixture of spherical Gaussians, if the means are separated by = (plog k). Unfortunately, their algorithm involves an exhaustive search and is computationally inefficient. Subsequently, three concurrent work [Diakonikolas et al., 2018, Hopkins and Li, 2018, Kothari et al., 2018] developed algorithms based on the Sum-of-Squares hierarchy that run in time (dk)poly(1/γ) and learn mixtures with separation = (kγ). In particular, setting γ log log k log k gives a quasi-polynomial time algorithm that achieves the plog k separation in Regev and Vijayaraghavan [2017]. A very recent work of Liu and Li [2022] made further progress towards learning (plog k)-separated mixtures efficiently, by giving a polynomial-time algorithm that succeeds under separation (log k)1/2+c for any constant c > 0. Lower bounds. On the lower bound side, Moitra and Valiant [2010] first proved, by explicitly constructing a hard instance, that learning a mixture of k Gaussians with separation = (1/ k) requires e (k) samples even in d = 1 dimension. Hardt and Price [2015] focused on the regime where k = O(1) and the recovery error goes to zero, and showed that ( 2 6k) samples are necessary. Anderson et al. [2014] proved a lower bound that strengthens the result of Moitra and Valiant [2010]: Separation = 1/poly(k) is insufficient for the mixture to be learnable with polynomially many samples, even when the means of the Gaussians are drawn uniformly at random from [0, 1]d for d = O(log k/ log log k). Regev and Vijayaraghavan [2017] proved that in d = (log k) dimensions, no polynomial-sample algorithm exists if = o(plog k). In particular, even when the means are chosen randomly, the resulting mixture is still hard to learn with high probability. This lower bound, together with their positive result for the = (plog k) regime, shows that plog k is the critical separation threshold in high dimensions. If we restrict our attention to statistical query (SQ) algorithms, Diakonikolas et al. [2017] proved that any SQ algorithm needs d (k) queries if d poly(k) and non-spherical components are allowed. Density estimation. In contrast to the parameter estimation setting that we focus on, the density estimation setting requires the learning algorithm to output (the representation of) a hypothesis distribution ˆP that is close to the mixture P in TV-distance. Ashtiani et al. [2020] proved that the sample complexity of learning Gaussian mixtures up to a TV-distance of is (d2k/ 2) in general, and (dk/ 2) if all components are axis-aligned. Unfortunately, their learning algorithms are not computationally efficient and run in time exponential in k. For the spherical case, Diakonikolas and Kane [2020] gave a more efficient algorithm that runs in quasi-polynomial time poly(d) (k/ )O(log2 k). In the proper learning setting (i.e., ˆP must be a Gaussian mixture), Suresh et al. [2014] gave an algorithm that takes poly(dk/ ) samples and runs in time poly(d) (k/ )O(k2). Super-resolution, and mixture learning using Fourier analysis. Finally, we acknowledge that the Fourier approach that we explore in this work is fairly natural, and similar ideas have appeared in prior work. In particular, our approach is closely related to the super-resolution problem (Donoho [1992], Candès and Fernandez-Granda [2013, 2014]) to recover k unknown locations µ1, . . . , µk 2 Rd from (exact or noisy) observations of the form Pk j=1 eiµj> for k k M, where M is called the cutoff frequency, and the norm is typically 2 or 1. In comparison, our approach (outlined in Section 1.2) differs in two aspects: (1) We focus on a simpler testing version of the problem to decide whether one of the locations is near a given reference point; (2) The Gaussian truncation significantly down-weights the observation coming from the points that are far from the reference point. For the d = 1 case, Moitra [2015] showed that a cutoff frequency of M = O(1/ ) suffices if the points are -separated. This implies an algorithm that efficiently learns spherical Gaussian mixtures in one dimension under separation 1/plog k, which is (nearly) recovered by our positive result. For the general case, Huang and Kakade [2015] gave an algorithm that provably works for ( 2) cutoff frequency M = O((pd log k + log k)/ ). When applied to learning GMMs, however, their algorithm requires separation d. It is conceivable that the algorithm of Huang and Kakade [2015], equipped with appropriate modifications and a tighter analysis (e.g., the approach of Chen and Moitra [2021]), might give a guarantee similar to ours, but no such analysis is explicit in the literature to the best of our knowledge. Moreover, our approach leads to an arguably simpler algorithm with an elementary analysis. More recently, Chakraborty and Narayanan [2020] gave an algorithm that is similar to ours for learning mixtures of spherical Gaussians via deconvolving the mixture. However, their algorithm is only shown to work when = ( d). Chen et al. [2020] studied the problem of learning mixtures of linear regressions (MLRs), which can be reduced to estimating the minimum variance in a mixture of zero-mean Gaussians. They solved this problem by estimating the Fourier moments the moments of the Fourier transform, and gave the first sub-exponential time algorithm for learning MLRs. Chen and Moitra [2021] studied learning mixtures of Airy disks, a problem that is motivated by the physics of diffraction. Their algorithm also proceeds by first estimating the Fourier transform of the mixture, and then dividing it pointwise by the Fourier spectrum of the base distribution. 1.4 Organization of the Paper In Section 2, we formally state our main algorithmic result as well as our main technical theorem, which addresses a testing version of the problem. We then sketch how our upper bound follows from a simple reduction from parameter learning to testing. In Section 3, we give a simple algorithm that solves this testing version by examining the Fourier transform of the mixture. The pseudocode of our algorithms are presented in Appendix A. We defer a few technical proofs to Appendices B through D. We prove our lower bound result (Theorem 1.2) in Appendix E. Finally, in Appendix F, we formally state and prove the guarantees for learning non-Gaussian mixtures and present a few applications of this result. 2 From Learning to Testing We first state our main positive result the formal version of Theorem 1.1. Theorem 2.1. Let P = 1 j=1 N(µj, Id) be a uniform mixture of k 2 spherical Gaussians with -separated means in Rd and < min{ /100, /(32 min{d, ln k})}. There is an algorithm (Algorithm 1) that, given samples from P, outputs k vectors that are -close to µ1, . . . , µk with high probability. The runtime (and thus the sample complexity) of the algorithm is upper bounded by O(( / )4k3 log2 k) max{(d/ )O(d), d O(d)} e O(M 2), where M 2 . 1 2 (min{d, log k} + log ) (min{d + log k, d log(2 + d 2 )} + log Remark 2.2. When d = O log k log log k d), the runtime can be simplified into poly(k) exp log k, d log which is poly(k) if = . When d = o(log k), this condition is strictly looser than both = ( d) and = (plog k). We prove Theorem 2.1 by reducing the parameter learning problem into the following testing version: Given samples from mixture P, determine whether P contains a cluster with a mean that is close to a given guess µ 2 Rd. Formally, we prove the following theorem: Theorem 2.3. Let P = 1 j=1 N(µj, Id) be a uniform mixture of k 2 spherical Gaussians with -separated means in Rd. Let < min{ /100, /(32 min{d, ln k})} and µ 2 Rd. There is an algorithm (Algorithm 2) that, given samples from P, either accepts or rejects , such that it: Accepts with probability 2/3 if minj2[k] kµj µ k2 /2. Rejects with probability 2/3 if minj2[k] kµj µ k2 . The runtime (and thus the sample complexity) of the algorithm is upper bounded by O(k2( / )4) e O(d+M 2), where M 2 . 1 2 (min{d, log k} + log ) (min{d + log k, d log(2 + d 2 )} + log Assuming the above theorem, the proof of our main theorem is straightforward and is thus deferred to Appendix C. The proof proceeds by first drawing a few samples X1, . . . , XN from the mixture, and then running the tester (from Theorem 2.3) to decide whether each Xi is close to one of the mean vectors of the mixture. Then, a simple argument shows that the samples that the tester accepts can be easily clustered to recover the means. Note that if a sample Xi comes from the j-th component N(µj, Id), there is a decent probability (e.g., ( ) when d = 1) that Xi is -close to µj. Thus, we can guarantee that X1, . . . , XN contain good guesses for all the k means for moderately large N. 3 Solve the Testing Problem using Fourier Transform Now we solve the testing problem and prove Theorem 2.3. We assume without loss of generality that µ = 0, since we can reduce to this case by subtracting µ from each sample from P. We also re-index the unknown means µ1, . . . , µk so that kµjk2 is non-decreasing in j. The testing problem is then equivalent to deciding whether kµ1k2 /2 or kµ1k2 . The following simple lemma, which we prove in Appendix D, gives the Fourier transform of the mixture P when a Gaussian truncation of e kxk2 2/2 is applied. Lemma 3.1. For P = 1 j=1 N(µj, Id) and any 2 Rd, 2/2 ei( >X)i j /2) = e k k2 2d/2k Aµ( ), where we define Aµ( ) := Pk j=1 e kµjk2 We will show that Aµ( ) behaves quite differently depending on whether kµ1k2 /2 or kµ1k2 . Thus, we can solve the testing problem by estimating Aµ( ) for carefully chosen . Concretely, we will draw from N(0, σ2Id) and then truncate it to B(0, M) := {x 2 Rd : kxk2 M}, for parameters M, σ > 0 to be chosen later. Formally, we focus on the following expectation: Tµ := E N (0,σ2Id) [Aµ( ) [k k2 M]] . The key step of our proof of Theorem 2.3 is to show that Tµ is close to e (σ2/2+1)kµ1k2 2/4, and is thus helpful for deciding whether kµ1k2 /2 or kµ1k2 . The following lemma, the proof of which is relegated to Appendix D, helps us to bound the difference between Tµ and e (σ2/2+1)kµ1k2 Lemma 3.2. For any M, σ > 0 that satisfy M 2/σ2 5d, e (σ2/2+1)kµjk2 e M 2/(5σ2) where the O(x) notation hides a complex number with modulus x. By Lemma 3.2, the difference Tµ e (σ2/2+1)kµ1k2 2/4 is given by e (σ2/2+1)kµjk2 e M 2/(5σ2) Let S1 := Pk j=2 e (σ2/2+1)kµjk2 2/4 and S2 := Pk j=1 e kµjk2 2/4 denote the two summations above. We have the following upper bounds on S1 and S2, which we prove in Appendix D: Claim 3.3. Assuming (σ2/2 + 1) 2 100 min{ln k, d}, S1 2e (σ2/2+1) 2/64 min{k, 2d}. Claim 3.4. We have S2 , 2 < 100d. Furthermore, S2 Now we put all the pieces together and prove Theorem 2.3. Proof of Theorem 2.3. By Lemma 3.1, Tµ = E N (0,σ2Id) [Aµ( ) = E X P N (0,σ2Id) 2d/2k ek k2 2/4 e k Xk2 2/2 ei( >X) Since the term inside the expectation has modulus k e O(d+M 2), a Chernoff bound implies that we can estimate Tµ up to any additive error γ > 0 using O((k/γ)2) e O(d+M 2) samples from P. By Lemma 3.2 and our definition of S1 and S2, assuming M 2/σ2 5d, &&&Tµ e (σ2/2+1)kµ1k2 2/4&&& S1 + e M 2/(5σ2)S2. In the rest of the proof, we will pick σ and M carefully so that both S1 and e M 2/(5σ2)S2 are upper bounded by γ := (σ2/2 + 1) 2/64. Assuming this, we are done: We can simply estimate Tµ up to an additive error of γ. Let c Tµ denote the estimate. We accept if and only if Re c Tµ := e (σ2/2+1) 2/16 + e (σ2/2+1) 2/4i . Indeed, suppose that kµ1k2 /2, we have Re c Tµ Re Tµ γ e (σ2/2+1)kµ1k2 2/4 (S1 + e M 2/(5σ2)S2) γ e (σ2/2+1) 2/16 3γ. Similarly, Re c Tµ e (σ2/2+1) 2/4 + 3γ if kµ1k2 . If we set σ such that (σ2/2 + 1) 2 1, we have e (σ2/2+1) 2/16 e (σ2/2+1) 2/4 1 8(σ2/2 + 1) 2 = 8γ, which implies Re c Tµ > in the former case and Re c Tµ < in the latter case. Therefore, our algorithm decides correctly. Choice of parameters. Claim 3.3 implies that if we set σ2/2 + 1 = 512 min{d, ln k} + ln , we can ensure S1 (σ2/2 + 1) 2/64 = γ. Furthermore, this choice of σ and the assumption < min{ /100, /(32 min{d, ln k})} guarantee the condition (σ2/2 + 1) 2 1 that we need. It remains to pick M such that e M 2/(5σ2) S2 γ = (σ2/2+1) 2/64. We also need M 2/σ2 5d to ensure that Lemma 3.2 can be applied. It suffices to let M 2 5σ2 d + ln S2 + ln 64 (σ2/2+1) 2 Our choice of σ guarantees ln 64 (σ2/2+1) 2 2 ln , so it is in turn sufficient to pick M such that min{d, ln k} + ln Applying Claim 3.4 shows that M can be chosen such that min{d, log k} + log d + log k, d log Runtime. The runtime of our algorithm is dominated by the number of samples drawn from P O((k/γ)2) e O(d+M 2), where γ = ((σ2 + 1) 2). Plugging our choice of σ into γ gives 1/γ = O(( / )2). The runtime can thus be upper bounded by O(k2( / )4) e O(d+M 2). Acknowledgments We would like to thank the anonymous reviewers of earlier versions of this paper for their suggestions on the presentation and for pointers to the literature. D. Achlioptas and F. Mc Sherry. On spectral learning of mixtures of distributions. In Conference on Learning Theory (COLT), pages 458 469, 2005. J. Anderson, M. Belkin, N. Goyal, L. Rademacher, and J. Voss. The more, the merrier: the blessing of dimensionality for learning large gaussian mixtures. In Conference on Learning Theory (COLT), pages 1135 1164, 2014. S. Arora and R. Kannan. Learning mixtures of separated nonspherical gaussians. The Annals of Applied Probability, 15(1A):69 92, 2005. H. Ashtiani, S. Ben-David, N. J. Harvey, C. Liaw, A. Mehrabian, and Y. Plan. Near-optimal sample complexity bounds for robust learning of gaussian mixtures via compression schemes. Journal of the ACM (JACM), 67(6):1 42, 2020. M. Belkin and K. Sinha. Polynomial learning of distribution families. In Foundations of Computer Science (FOCS), pages 103 112, 2010. S. C. Brubaker and S. Vempala. Isotropic pca and affine-invariant clustering. In Foundations of Computer Science (FOCS), pages 551 560, 2008. E. J. Candès and C. Fernandez-Granda. Super-resolution from noisy data. Journal of Fourier Analysis and Applications, 19(6):1229 1254, 2013. E. J. Candès and C. Fernandez-Granda. Towards a mathematical theory of super-resolution. Commu- nications on pure and applied Mathematics, 67(6):906 956, 2014. S. Chakraborty and H. Narayanan. Learning mixtures of spherical gaussians via fourier analysis. ar Xiv preprint ar Xiv:2004.05813, 2020. S. Chen and A. Moitra. Algorithmic foundations for the diffraction limit. In Symposium on Theory of Computing (STOC), pages 490 503, 2021. S. Chen, J. Li, and Z. Song. Learning mixtures of linear regressions in subexponential time via fourier moments. In Symposium on Theory of Computing (STOC), pages 587 600, 2020. S. Dasgupta. Learning mixtures of gaussians. In Foundations of Computer Science (FOCS), pages 634 644, 1999. S. Dasgupta and L. J. Schulman. A probabilistic analysis of em for mixtures of separated, spherical gaussians. Journal of Machine Learning Research (JMLR), 8:203 226, 2007. I. Diakonikolas and D. M. Kane. Small covers for near-zero sets of polynomials and learning latent variable models. In Foundations of Computer Science (FOCS), pages 184 195, 2020. I. Diakonikolas, D. M. Kane, and A. Stewart. Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures. In Foundations of Computer Science (FOCS), pages 73 84, 2017. I. Diakonikolas, D. M. Kane, and A. Stewart. List-decodable robust mean estimation and learning mixtures of spherical gaussians. In Symposium on Theory of Computing (STOC), pages 1047 1060, 2018. D. L. Donoho. Superresolution via sparsity constraints. SIAM Journal on Mathematical Analysis, 23 (5):1309 1331, 1992. M. Hardt and E. Price. Tight bounds for learning a mixture of two gaussians. In Symposium on Theory of Computing (STOC), pages 753 760, 2015. S. B. Hopkins and J. Li. Mixture models, robustness, and sum of squares proofs. In Symposium on Theory of Computing (STOC), pages 1021 1034, 2018. Q. Huang and S. M. Kakade. Super-resolution off the grid. In Advances in Neural Information Processing Systems (NIPS), pages 2665 2673, 2015. R. Kannan, H. Salmasian, and S. Vempala. The spectral method for general mixture models. In Conference on Learning Theory (COLT), pages 444 457, 2005. P. K. Kothari, J. Steinhardt, and D. Steurer. Robust moment estimation and improved clustering via sum of squares. In Symposium on Theory of Computing (STOC), pages 1035 1046, 2018. B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302 1338, 2000. A. Liu and J. Li. Clustering mixtures with almost optimal separation in polynomial time. In Symposium on Theory of Computing (STOC), pages 1248 1261, 2022. J. Matoušek, A. Björner, and G. M. Ziegler. Using the Borsuk-Ulam theorem: lectures on topological methods in combinatorics and geometry. Springer, 2003. A. Moitra. Super-resolution, extremal functions and the condition number of vandermonde matrices. In Symposium on Theory of Computing (STOC), pages 821 830, 2015. A. Moitra and G. Valiant. Settling the polynomial learnability of mixtures of gaussians. In Foundations of Computer Science (FOCS), pages 93 102, 2010. O. Regev and A. Vijayaraghavan. On learning mixtures of well-separated gaussians. In Foundations of Computer Science (FOCS), pages 85 96, 2017. D. J. Smith and M. K. Vamanamurthy. How small is a unit ball? Mathematics Magazine, 62(2): 101 107, 1989. A. T. Suresh, A. Orlitsky, J. Acharya, and A. Jafarpour. Near-optimal-sample estimators for spherical gaussian mixtures. In Advances in Neural Information Processing Systems (NIPS), pages 1395 1403, 2014. S. Vempala and G. Wang. A spectral algorithm for learning mixture models. Journal of Computer and System Sciences, 68(4):841 860, 2004. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [N/A] The results are theoretical. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] The assump- tions are stated in the theorems. (b) Did you include complete proofs of all theoretical results? [Yes] The main paper includes part of the proof and most of the proof ideas. Technical proofs are included in the appendix. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experi- ments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]