# a_fourier_approach_to_mixture_learning__f7533777.pdf

A Fourier Approach to Mixture Learning

Mingda Qiao Stanford University mqiao@stanford.edu

Guru Guruganesh

Google Research gurug@google.com

Ankit Singh Rawat

Google Research ankitsrawat@google.com

Avinava Dubey Google Research avinavadubey@google.com

Manzil Zaheer Google Deep Mind manzilzaheer@google.com

We revisit the problem of learning mixtures of spherical Gaussians. Given samples from mixture 1 k

j=1 N(µj, Id), the goal is to estimate the means µ1, µ2, . . . , µk 2 Rd up to a small error. The hardness of this learning problem can be measured by the separation deﬁned as the minimum distance between all pairs of means. Regev and Vijayaraghavan [2017] showed that with = (plog k) separation, the means can be learned using poly(k, d) samples, whereas superpolynomially many samples are required if = o(plog k) and d = (log k). This leaves open the low-dimensional regime where d = o(log k). In this work, we give an algorithm that efﬁciently learns the means in d = O(log k/ log log k) dimensions under separation d/plog k (modulo doubly logarithmic factors). This separation is strictly smaller than plog k, and is also shown to be necessary. Along with the results of Regev and Vijayaraghavan [2017], our work almost pins down the critical separation threshold at which efﬁcient parameter learning becomes possible for spherical Gaussian mixtures. More generally, our algorithm runs in time poly(k) f(d, , ), and is thus ﬁxed-parameter tractable in parameters d, and . Our approach is based on estimating the Fourier transform of the mixture at carefully chosen frequencies, and both the algorithm and its analysis are simple and elementary. Our positive results can be easily extended to learning mixtures of non-Gaussian distributions, under a mild condition on the Fourier spectrum of the distribution.

1 Introduction

Gaussian mixture models (GMMs) are one of the most well studied models for a population with different components. A GMM deﬁnes a distribution over the d-dimensional Euclidean space as the weighted sum of normal distributions Pk

i=1 wi N(µi, i), which are speciﬁed by following quantities: the number of components k 2 N, the component means µi 2 Rd, the component covariances i 2 Rd d, which are positive deﬁnite matrices, and the weights wi 0 that sum up to 1. In this work, we consider the uniform spherical case, where the weights wi are uniform wi = 1

k and the covariance matrix i = Id is the identity matrix. The central problem in this setup is to efﬁciently estimate the means µ1, . . . , µk. To avoid degenerate cases such as when some of the means are the same, it is common to parameterize the problem by the separation of the means , which guarantees that kµi µjk2 for all i 6= j.

Part of this work was done while working as an intern at Google Research.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

More precisely, the problem is to estimate the means µ1, . . . , µk 2 Rd up to an error with runtime that is poly(k, 1

, d) with as small a separation as possible. There has been a long line of work on this problem which we survey in Section 1.3.

Recently, Regev and Vijayaraghavan [2017] showed that a separation = (plog k) is strictly necessary when the dimension d = (log k). Two natural questions arise immediately. First, if = plog k is sufﬁcient when d = (log k). Although the original work of Regev and Vijayaraghavan [2017] showed that it was information theoretically possible, an actual efﬁcient algorithm was only recently developed by Liu and Li [2022] (who show nearly tight results). The second main question is determining the optimal separation in low dimensions when d = o(log k). Previously, even in O(1) dimensions, the exact separation necessary was unknown. In this paper, we settle the second question and give nearly optimal bounds on the separation necessary in low dimensions (see Figure 1 for more details).

1.1 Overview of Results

We begin with a few deﬁnitions. A point set {x1, x2, . . . , xk} is called -separated if kxj xj0k2 for any j 6= j0. We say that a Gaussian mixture is -separated (or has separation ) if the means

of its components are -separated. Two point sets {u1, . . . , uk} and {v1, . . . , vk} are -close if for some permutation σ over [k], kuj vσ(j)k2 holds for every j.

Our main result is an algorithm that efﬁciently learns the parameters of a mixture of k spherical Gaussians under separation d plog k poly(log log k) in d = O(log k/ log log k) dimensions. In

the low-dimensional regime, this separation is strictly smaller than min{plog k,

d}, the smallest separation under which previous algorithms could provably learn the parameters in poly(k) time. Theorem 1.1 (Upper bound, informal). Let P be a uniform mixture of k spherical Gaussians in

log k log log k

dimensions with separation =

log((log k)/d) plog k

. There is a poly(k)-time

algorithm that, given samples from P, outputs k vectors that are w.h.p. -close to the true means for = O( /

See Theorem 2.1 and Remark 2.2 for a more formal statement of our algorithmic result, which holds for a wider range of separation and accuracy parameter . Our learning algorithm is provably correct for arbitrarily small , > 0 (possibly with a longer runtime), whereas for most of the previous algorithms, there is a breakdown point such that the algorithm is not known to work when < . Two exceptions are the algorithms of Moitra and Valiant [2010], Belkin and Sinha [2010], both of which allow an arbitrarily small separation but run in e (k) time. We also remark that the runtime of our algorithm scales as O(k3) f(d, , ), and is thus ﬁxed-parameter tractable in parameters d, and .2

We complement Theorem 1.1 with an almost-matching lower bound, showing that the d/plog k separation is necessary for efﬁcient parameter learning in low dimensions.

Theorem 1.2 (Lower bound, informal). For d = O

log k log log k

, there are two

mixtures of k spherical Gaussians in Rd such that: (1) both have separation ; (2) their means are not ( /2)-close; and (3) the total variation (TV) distance between them is k !(1).

See Theorem E.1 for a more formal version of the lower bound.

Theorem 1.1 and Theorem 1.2 together nearly settle the polynomial learnability of spherical Gaussian mixtures in the low-dimensional regime. Up to a doubly logarithmic factor, the critical separation where efﬁcient learning becomes possible is d/plog k. To the best of our knowledge, this was previously unknown even for d = O(1).3

See Figure 1 below for a plot of our results in the context of prior work. The green regions cover the parameters ( , d) such that mixtures of k spherical Gaussians in d dimensions with separation are

2While we focus on the uniform-weight case for simplicity, Theorem 1.1 can be easily extended to the setting where each weight is in [1/(Ck), C/k] for some constant C > 1.

3An exception is the d = 1 case: A result of Moitra [2015] implies that = (1/plog k) sufﬁces (see Section 1.3), and a matching lower bound was given by Moitra and Valiant [2010].

separation Δ

dimension "

1/ log & log &

Learnable with poly & samples Not poly & -sample learnable

IV Theorem 1

Figure 1: Region I is a direct corollary of Moitra and Valiant [2010, Proposition 15]. Regions II, III, and IV are shown by Regev and Vijayaraghavan [2017, Theorems 1.2, 1.3, and 1.4] respectively. The upper boundary of Region IV is the curve =

d. Theorems 1.1 and 1.2 settle the learnability in the remaining area, by proving that the line = d/plog k is the boundary between polynomial and super-polynomial sample complexities (up to a doubly-logarithmic factor).

learnable (up to O( ) error) using poly(k) samples.4 The red regions contain the parameters under which polynomial-sample learning is provably impossible.

The algorithm that underlies Theorem 1.1 can be easily extended beyond the spherical Gaussian case. The following more general result states that for any distribution D whose the Fourier transform does not decay too fast, we can efﬁciently learn the parameters of a mixture of k translated copies of D. In the following, let Dµ denote the distribution of X + µ when X is drawn from D.

Theorem 1.3 (Learning more general mixtures, informal). Let P = 1

j=1 Dµj for -separated µ1, . . . , µk 2 Rd. There is an algorithm that, given > 0 and samples from P, runs in time

k, / , 1/δ, max

for some δ = δ(D, ) and M = M(k, d, , ), and outputs ˆµ1, . . . , ˆµk that are w.h.p. -close to the true parameters.

See Theorem F.1 and Corollary F.2 for a more formal statement of the runtime. Theorem 1.3 applies to many well-known distribution families that are speciﬁed by either a location parameter or a scale parameter . Table 1 gives a few examples of applying Theorem 1.3 to mixtures of single-parameter univariate distributions; see Appendix F.2 and Appendix F.3 for more details.

Limitations of our work. The main limitation is that the positive results only apply to the regime that the dimension d is logarithmic in the number of clusters, and that all clusters are translated copies of the same distribution. A concrete future direction would be to extend our results to learning mixtures of general Gaussians, even in one dimension.

1.2 Proof Overview

For simplicity, we focus on the one dimensional case that P = 1

j=1 N(µj, 1) for -separated means µ1, . . . , µk 2 R. We index the components such that |µ1| |µ2| |µk|, and focus

4In terms of the computational complexity, all the green regions (except a small portion of Region III) admit efﬁcient algorithms. The algorithm of Liu and Li [2022] is efﬁcient when = ((log k)1/2+c) for any c > 0 and thus almost covers Region III. For Region IV, Regev and Vijayaraghavan [2017] gave an efﬁcient algorithm only for d = O(1), whereas Theorem 1.1 covers the entire Region IV.

Distribution Parameter Density Function Runtime

Cauchy µ 1 (1+(x µ)2) O(k3) e O(plog k/ )

Logistic µ e (x µ)

(1+e (x µ))2 O(k3) e O(plog k/ )

Laplace µ 1 2e |x µ| O(k3/ 5)

Exponential ln λ λe λx

[x 0] O(k3) e O(plog k/ )

Table 1: Implication of Theorem 1.3 for learning various families of mixtures of univariate distributions beyond Gaussians, assuming that the parameters of different components are -separated. The algorithm outputs k parameters that are O( )-close to the true parameters.

on the following testing problem: Given > 0 and samples from P, determine whether µ1 = 0 or µ1 , assuming that one of them is true. Note that this testing problem is not harder than estimating µ1, . . . , µk up to error /3 in the former case that µ1 = 0, one of the mean estimates would fall into [ /3, /3], whereas all of them must be outside ( 2 /3, 2 /3) in the latter case. Conversely, as we will prove in Section 2, an algorithm for the testing version can be used for recovering the means as well.

Examine the Fourier spectrum. We start by examining the Fourier transform of P, (FP)( ) := EX P

, more commonly known in the literature as the characteristic function. Since the Fourier transform of a Gaussian is still a Gaussian, and a translation in the time domain shifts the phase in the frequency domain, we have

E X N (µj,1)

We will focus on the quantity Aµ( ) := Pk

j=1 eiµj , which is the total phase over the k components of P. Equation (1) essentially states that Aµ( ) can be estimated by averaging ei X over samples X P.

The key observation is that each term eiµj of Aµ( ) behaves quite differently depending on the magnitude of µj: If µj = 0, eiµj = 1 is a constant, whereas eiµj is a high-frequency wave when |µj| is large. This suggests that the two cases (µ1 = 0 and |µ1| ) can be distinguished by estimating Aµ( ) at different frequencies. The cost of estimating Aµ( ), however, depends heavily on the frequency Equation (1) together with a simple concentration bound shows that O(k2) e O( 2) samples are sufﬁcient for estimating Aµ( ) up to a constant error.

Therefore, the crux of this approach is to ﬁnd the minimum M > 0 such that the two cases can be distinguished by estimating Aµ( ) over 2 [ M, M]. (This is known as the super-resolution problem, which we discuss in Section 1.3.) The sample complexity of the testing problem can then be roughly bounded by O(k2) e O(M 2). In the following, we explore different ways of picking from the range [ M, M].

Choosing randomly. Our overall strategy is to draw randomly from some distribution D over interval [ M, M] and evaluate E D [Aµ( )]. We will argue that this expectation is very close to its ﬁrst term E D

, which takes different values depending on whether µ1 = 0 or |µ1| . Then, D needs to be chosen such that: (1) There is a gap in the value of E D

between the

two cases; (2)

eiµj &&& is small enough for the gap in the ﬁrst term to be easily identiﬁed.

As a warmup, let D be the uniform distribution over [0, M]. A simple calculation shows that the gap in the ﬁrst term E D

between the two cases (µ1 = 0 or |µ1| ) is lower bounded by (min{M , 1}), whereas the contribution from the j-th component satisﬁes

. Furthermore, the -separation between the means implies |µj| = ( j). Thus,

eiµj i$$$$$ .

1 M|µj| . 1 M

1 j . log k

The above is much smaller than min{M , 1} if we set M & max

. Unfortunately,

even when and are constants, we have O(k2) e O(M 2) = k O(log k) and the resulting sample complexity is already super-polynomial in k.

A better choice of D . It turns out that choosing D to be a truncated Gaussian leads to a much lower sample complexity. For some σ M, we draw N(0, σ2) and then truncate it to [ M, M]. Without the truncation, the expectation of eiµj has a nice closed form:

which is exactly the Fourier weight of N(0, σ2) at frequency µj. Note that this decreases very fast as |µj| grows, compared to the previous rate of 1 M|µj| when D is uniform.

It again follows from a simple calculation that: (1) Depending on whether µ1 = 0 or |µ1| , the gap between E

is (min{σ2 2, 1}); (2) The total contribution from j = 2, 3, . . . , k is upper bounded by

eiµj i$$$$$ =

e (σ2 2j2) = e (σ2 2),

where the second step applies |µj| = ( j) and the last step holds assuming σ = (1/ ). In addition, we need to deal with the error incurred by the truncation. The Gaussian tail bounds imply that | | M happens with probability e (M 2/σ2). Since |Aµ( )| k for any , the noise from the truncation is at most k e (M 2/σ2) in magnitude.

It remains to choose M, σ > 0 that satisfy the two inequalities:

e (σ2 2) min{σ2 2, 1} and k e (M 2/σ2) min{σ2 2, 1}.

It sufﬁces to choose σ . 1

. For = ( ), the sample

complexity O(k2) e O(M 2) reduces to k O(1/ 2), which is polynomial in k for any ﬁxed .

We note that in the above derivation, the assumption that each cluster of P is a Gaussian is only applied in Equation (1) (through the Fourier transform). For any distribution D over R and P = 1

the quantity Aµ( ) = Pk

j=1 eiµj can still be read off from the Fourier transform of P at frequency

, except that the extra factor e 2/2 becomes (FD)( ), the Fourier transform of D at frequency . This observation leads to our algorithm for learning a mixture of multiple translated copies of D (Theorem 1.3).

Gaussian truncation. For the spherical Gaussian case, however, the above only gives an algorithm with runtime k O(1/ 2), which becomes super-polynomial as soon as = o(1), falling short of

achieving the near-optimal separation of

log k in Theorem 1.1.

We further improve our algorithm for the spherical Gaussian case using a Gaussian truncation technique. Intuitively, when deciding whether the mixture P contains a cluster with mean zero, a sample X P is much more informative when |X| is small. This motivates us to focus on samples with a small magnitude via a truncation.

We apply such a truncation in a soft way weighting each sample X with e X2/2. This turns out to be sufﬁciently effective while keeping the entire analysis simple. Note that the weighting effectively multiplies the mixture P with the standard Gaussian N(0, 1) pointwise. The result is

still a (un-normalized) mixture of Gaussians: Up to a constant factor, the pointwise product is

j/4 N(µj/2, 1/2). Consequently, if we repeat the analysis from previous paragraphs to this weighted mixture, the noise coming from components with large |µj| become even smaller.

This eventually allows us to learn the mixture efﬁciently at separation =

log k . We prove Theorem 1.1 via a natural extension of this analysis to the d-dimensional case.

Lower bound. Our proof of Theorem 1.2 follows an approach similar to those in the previous lower bound proofs of Moitra and Valiant [2010], Hardt and Price [2015], Regev and Vijayaraghavan [2017]: First, construct two sets of well-separated points {µ(P )

1 , . . . , µ(P )

k } and {µ(Q)

1 , . . . , µ(Q)

with (approximately) matching lower-order moments, i.e., 1

for every small t. Then, show that these matching moments imply that after a convolution with Gaussian, the resulting mixtures are close in TV-distance and thus hard to distinguish.

In more detail, similar to the proof of Regev and Vijayaraghavan [2017], we start by choosing N arbitrary points from a small 2 ball, such that they are -separated. The main difference is that Regev and Vijayaraghavan [2017] pick N k, and show that among all the

possible mixtures, there are at least two mixtures with similar moments via a pigeon-hole type argument. In contrast, we work in the N k regime, and obtain two point sets with matching lower-order moments by slightly perturbing these N points in opposite directions. The existence of such a good perturbation is shown via a careful application of the Borsuk Ulam Theorem, which is inspired by a similar application in Hardt and Price [2015].

1.3 Related Work

Learning Gaussian mixture models. Most closely related to this paper is the line of work in the theoretical computer science literature on algorithms that provably learn mixtures of Gaussians. The pioneering work of Dasgupta [1999] showed that an (

d) separation between the means of different components is sufﬁcient for the samples to be easily clustered. Several subsequent work (e.g., Vempala and Wang [2004], Arora and Kannan [2005], Achlioptas and Mc Sherry [2005], Kannan et al. [2005], Dasgupta and Schulman [2007], Brubaker and Vempala [2008]) further generalized this result and the separation condition is relaxed to = ((min{k, d})1/4 poly(log k, log d)) by Vempala and Wang [2004]. We refer the readers to Regev and Vijayaraghavan [2017] for a more detailed survey of these results.

Moitra and Valiant [2010] and Belkin and Sinha [2010] gave the ﬁrst algorithms for learning general mixtures of Gaussians (with different unknown covariances) under an arbitrarily small separation between different components. Both algorithms are based on the method of moments, and run in time polynomial in d and the inverse of the minimum TV-distance between different components when k = O(1). For large k, however, the runtime is exponential in k.

Regev and Vijayaraghavan [2017] showed that poly(k, d) samples are sufﬁcient to estimate the means of a mixture of spherical Gaussians, if the means are separated by = (plog k). Unfortunately, their algorithm involves an exhaustive search and is computationally inefﬁcient. Subsequently, three concurrent work [Diakonikolas et al., 2018, Hopkins and Li, 2018, Kothari et al., 2018] developed algorithms based on the Sum-of-Squares hierarchy that run in time (dk)poly(1/γ) and learn mixtures with separation = (kγ). In particular, setting γ log log k

log k gives a quasi-polynomial time algorithm that achieves the plog k separation in Regev and Vijayaraghavan [2017]. A very recent work of Liu and Li [2022] made further progress towards learning (plog k)-separated mixtures efﬁciently, by giving a polynomial-time algorithm that succeeds under separation (log k)1/2+c for any constant c > 0.

Lower bounds. On the lower bound side, Moitra and Valiant [2010] ﬁrst proved, by explicitly constructing a hard instance, that learning a mixture of k Gaussians with separation = (1/

k) requires e (k) samples even in d = 1 dimension. Hardt and Price [2015] focused on the regime where k = O(1) and the recovery error goes to zero, and showed that ( 2 6k) samples are necessary. Anderson et al. [2014] proved a lower bound that strengthens the result of Moitra and Valiant [2010]: Separation = 1/poly(k) is insufﬁcient for the mixture to be learnable with polynomially many

samples, even when the means of the Gaussians are drawn uniformly at random from [0, 1]d for d = O(log k/ log log k).

Regev and Vijayaraghavan [2017] proved that in d = (log k) dimensions, no polynomial-sample algorithm exists if = o(plog k). In particular, even when the means are chosen randomly, the resulting mixture is still hard to learn with high probability. This lower bound, together with their positive result for the = (plog k) regime, shows that plog k is the critical separation threshold in high dimensions.

If we restrict our attention to statistical query (SQ) algorithms, Diakonikolas et al. [2017] proved that any SQ algorithm needs d (k) queries if d poly(k) and non-spherical components are allowed.

Density estimation. In contrast to the parameter estimation setting that we focus on, the density estimation setting requires the learning algorithm to output (the representation of) a hypothesis distribution ˆP that is close to the mixture P in TV-distance. Ashtiani et al. [2020] proved that the sample complexity of learning Gaussian mixtures up to a TV-distance of is (d2k/ 2) in general, and (dk/ 2) if all components are axis-aligned. Unfortunately, their learning algorithms are not computationally efﬁcient and run in time exponential in k. For the spherical case, Diakonikolas and Kane [2020] gave a more efﬁcient algorithm that runs in quasi-polynomial time poly(d) (k/ )O(log2 k). In the proper learning setting (i.e., ˆP must be a Gaussian mixture), Suresh et al. [2014]

gave an algorithm that takes poly(dk/ ) samples and runs in time poly(d) (k/ )O(k2).

Super-resolution, and mixture learning using Fourier analysis. Finally, we acknowledge that the Fourier approach that we explore in this work is fairly natural, and similar ideas have appeared in prior work. In particular, our approach is closely related to the super-resolution problem (Donoho [1992], Candès and Fernandez-Granda [2013, 2014]) to recover k unknown locations µ1, . . . , µk 2 Rd from (exact or noisy) observations of the form Pk

j=1 eiµj> for k k M, where M is called the cutoff frequency, and the norm is typically 2 or 1. In comparison, our approach (outlined in Section 1.2) differs in two aspects: (1) We focus on a simpler testing version of the problem to decide whether one of the locations is near a given reference point; (2) The Gaussian truncation signiﬁcantly down-weights the observation coming from the points that are far from the reference point.

For the d = 1 case, Moitra [2015] showed that a cutoff frequency of M = O(1/ ) sufﬁces if the points are -separated. This implies an algorithm that efﬁciently learns spherical Gaussian mixtures in one dimension under separation 1/plog k, which is (nearly) recovered by our positive result. For the general case, Huang and Kakade [2015] gave an algorithm that provably works for ( 2) cutoff frequency M = O((pd log k + log k)/ ). When applied to learning GMMs, however, their algorithm requires separation

d. It is conceivable that the algorithm of Huang and Kakade [2015], equipped with appropriate modiﬁcations and a tighter analysis (e.g., the approach of Chen and Moitra [2021]), might give a guarantee similar to ours, but no such analysis is explicit in the literature to the best of our knowledge. Moreover, our approach leads to an arguably simpler algorithm with an elementary analysis.

More recently, Chakraborty and Narayanan [2020] gave an algorithm that is similar to ours for learning mixtures of spherical Gaussians via deconvolving the mixture. However, their algorithm is only shown to work when = (

d). Chen et al. [2020] studied the problem of learning mixtures of linear regressions (MLRs), which can be reduced to estimating the minimum variance in a mixture of zero-mean Gaussians. They solved this problem by estimating the Fourier moments the moments of the Fourier transform, and gave the ﬁrst sub-exponential time algorithm for learning MLRs. Chen and Moitra [2021] studied learning mixtures of Airy disks, a problem that is motivated by the physics of diffraction. Their algorithm also proceeds by ﬁrst estimating the Fourier transform of the mixture, and then dividing it pointwise by the Fourier spectrum of the base distribution.

1.4 Organization of the Paper

In Section 2, we formally state our main algorithmic result as well as our main technical theorem, which addresses a testing version of the problem. We then sketch how our upper bound follows from

a simple reduction from parameter learning to testing. In Section 3, we give a simple algorithm that solves this testing version by examining the Fourier transform of the mixture.

The pseudocode of our algorithms are presented in Appendix A. We defer a few technical proofs to Appendices B through D. We prove our lower bound result (Theorem 1.2) in Appendix E. Finally, in Appendix F, we formally state and prove the guarantees for learning non-Gaussian mixtures and present a few applications of this result.

2 From Learning to Testing

We ﬁrst state our main positive result the formal version of Theorem 1.1.

Theorem 2.1. Let P = 1

j=1 N(µj, Id) be a uniform mixture of k 2 spherical Gaussians with -separated means in Rd and < min{ /100, /(32

min{d, ln k})}. There is an algorithm (Algorithm 1) that, given samples from P, outputs k vectors that are -close to µ1, . . . , µk with high

probability. The runtime (and thus the sample complexity) of the algorithm is upper bounded by

O(( / )4k3 log2 k) max{(d/ )O(d), d O(d)} e O(M 2),

where M 2 . 1 2 (min{d, log k} + log

) (min{d + log k, d log(2 + d 2 )} + log

Remark 2.2. When d = O

log k log log k

d), the runtime can be simpliﬁed into

poly(k) exp

log k, d log

which is poly(k) if =

. When d = o(log k), this condition is strictly

looser than both = (

d) and = (plog k).

We prove Theorem 2.1 by reducing the parameter learning problem into the following testing version: Given samples from mixture P, determine whether P contains a cluster with a mean that is close to a given guess µ 2 Rd. Formally, we prove the following theorem:

Theorem 2.3. Let P = 1

j=1 N(µj, Id) be a uniform mixture of k 2 spherical Gaussians with -separated means in Rd. Let < min{ /100, /(32

min{d, ln k})} and µ 2 Rd. There is an algorithm (Algorithm 2) that, given samples from P, either accepts or rejects , such that it:

Accepts with probability 2/3 if minj2[k] kµj µ k2 /2.

Rejects with probability 2/3 if minj2[k] kµj µ k2 .

The runtime (and thus the sample complexity) of the algorithm is upper bounded by O(k2( / )4)

e O(d+M 2), where M 2 . 1 2 (min{d, log k} + log

) (min{d + log k, d log(2 + d 2 )} + log

Assuming the above theorem, the proof of our main theorem is straightforward and is thus deferred to Appendix C. The proof proceeds by ﬁrst drawing a few samples X1, . . . , XN from the mixture, and then running the tester (from Theorem 2.3) to decide whether each Xi is close to one of the mean vectors of the mixture. Then, a simple argument shows that the samples that the tester accepts can be easily clustered to recover the means. Note that if a sample Xi comes from the j-th component N(µj, Id), there is a decent probability (e.g., ( ) when d = 1) that Xi is -close to µj. Thus, we can guarantee that X1, . . . , XN contain good guesses for all the k means for moderately large N.

3 Solve the Testing Problem using Fourier Transform

Now we solve the testing problem and prove Theorem 2.3. We assume without loss of generality that µ = 0, since we can reduce to this case by subtracting µ from each sample from P. We also re-index the unknown means µ1, . . . , µk so that kµjk2 is non-decreasing in j. The testing problem is then equivalent to deciding whether kµ1k2 /2 or kµ1k2 .

The following simple lemma, which we prove in Appendix D, gives the Fourier transform of the mixture P when a Gaussian truncation of e kxk2

2/2 is applied.

Lemma 3.1. For P = 1

j=1 N(µj, Id) and any 2 Rd,

2/2 ei( >X)i

j /2) = e k k2

2d/2k Aµ( ),

where we deﬁne Aµ( ) := Pk

j=1 e kµjk2

We will show that Aµ( ) behaves quite differently depending on whether kµ1k2 /2 or kµ1k2 . Thus, we can solve the testing problem by estimating Aµ( ) for carefully chosen . Concretely, we will draw from N(0, σ2Id) and then truncate it to B(0, M) := {x 2 Rd : kxk2 M}, for parameters M, σ > 0 to be chosen later. Formally, we focus on the following expectation:

Tµ := E N (0,σ2Id) [Aµ( )

[k k2 M]] .

The key step of our proof of Theorem 2.3 is to show that Tµ is close to e (σ2/2+1)kµ1k2

2/4, and is thus helpful for deciding whether kµ1k2 /2 or kµ1k2 . The following lemma, the proof of which is relegated to Appendix D, helps us to bound the difference between Tµ and e (σ2/2+1)kµ1k2

Lemma 3.2. For any M, σ > 0 that satisfy M 2/σ2 5d,

e (σ2/2+1)kµjk2

e M 2/(5σ2)

where the O(x) notation hides a complex number with modulus x.

By Lemma 3.2, the difference Tµ e (σ2/2+1)kµ1k2

2/4 is given by

e (σ2/2+1)kµjk2

e M 2/(5σ2)

Let S1 := Pk

j=2 e (σ2/2+1)kµjk2

2/4 and S2 := Pk

j=1 e kµjk2

2/4 denote the two summations above. We have the following upper bounds on S1 and S2, which we prove in Appendix D:

Claim 3.3. Assuming (σ2/2 + 1) 2 100 min{ln k, d}, S1 2e (σ2/2+1) 2/64 min{k, 2d}.

Claim 3.4. We have S2

, 2 < 100d. Furthermore, S2

Now we put all the pieces together and prove Theorem 2.3.

Proof of Theorem 2.3. By Lemma 3.1,

Tµ = E N (0,σ2Id) [Aµ( )

= E X P N (0,σ2Id)

2d/2k ek k2

2/4 e k Xk2

2/2 ei( >X)

Since the term inside the expectation has modulus k e O(d+M 2), a Chernoff bound implies that we can estimate Tµ up to any additive error γ > 0 using O((k/γ)2) e O(d+M 2) samples from P.

By Lemma 3.2 and our deﬁnition of S1 and S2, assuming M 2/σ2 5d,

&&&Tµ e (σ2/2+1)kµ1k2

2/4&&& S1 + e M 2/(5σ2)S2.

In the rest of the proof, we will pick σ and M carefully so that both S1 and e M 2/(5σ2)S2 are upper bounded by γ := (σ2/2 + 1) 2/64. Assuming this, we are done: We can simply estimate Tµ

up to an additive error of γ. Let c Tµ denote the estimate. We accept if and only if Re c Tµ :=

e (σ2/2+1) 2/16 + e (σ2/2+1) 2/4i

. Indeed, suppose that kµ1k2 /2, we have

Re c Tµ Re Tµ γ e (σ2/2+1)kµ1k2

2/4 (S1 + e M 2/(5σ2)S2) γ e (σ2/2+1) 2/16 3γ.

Similarly, Re c Tµ e (σ2/2+1) 2/4 + 3γ if kµ1k2 . If we set σ such that (σ2/2 + 1) 2 1, we have e (σ2/2+1) 2/16 e (σ2/2+1) 2/4 1

8(σ2/2 + 1) 2 = 8γ, which implies Re c Tµ > in the former case and Re c Tµ < in the latter case. Therefore, our algorithm decides correctly.

Choice of parameters. Claim 3.3 implies that if we set σ2/2 + 1 = 512

min{d, ln k} + ln

, we can ensure S1 (σ2/2 + 1) 2/64 = γ. Furthermore, this choice of σ and the assumption < min{ /100, /(32

min{d, ln k})} guarantee the condition (σ2/2 + 1) 2 1 that we need.

It remains to pick M such that e M 2/(5σ2) S2 γ = (σ2/2+1) 2/64. We also need M 2/σ2 5d

to ensure that Lemma 3.2 can be applied. It sufﬁces to let M 2 5σ2

d + ln S2 + ln 64 (σ2/2+1) 2

Our choice of σ guarantees ln 64 (σ2/2+1) 2 2 ln

, so it is in turn sufﬁcient to pick M such that

min{d, ln k} + ln

Applying Claim 3.4 shows that M can be chosen such that

min{d, log k} + log

d + log k, d log

Runtime. The runtime of our algorithm is dominated by the number of samples drawn from P O((k/γ)2) e O(d+M 2), where γ = ((σ2 + 1) 2). Plugging our choice of σ into γ gives 1/γ = O(( / )2). The runtime can thus be upper bounded by O(k2( / )4) e O(d+M 2).

Acknowledgments

We would like to thank the anonymous reviewers of earlier versions of this paper for their suggestions on the presentation and for pointers to the literature.

D. Achlioptas and F. Mc Sherry. On spectral learning of mixtures of distributions. In Conference on

Learning Theory (COLT), pages 458 469, 2005.

J. Anderson, M. Belkin, N. Goyal, L. Rademacher, and J. Voss. The more, the merrier: the blessing

of dimensionality for learning large gaussian mixtures. In Conference on Learning Theory (COLT), pages 1135 1164, 2014.

S. Arora and R. Kannan. Learning mixtures of separated nonspherical gaussians. The Annals of

Applied Probability, 15(1A):69 92, 2005.

H. Ashtiani, S. Ben-David, N. J. Harvey, C. Liaw, A. Mehrabian, and Y. Plan. Near-optimal sample

complexity bounds for robust learning of gaussian mixtures via compression schemes. Journal of the ACM (JACM), 67(6):1 42, 2020.

M. Belkin and K. Sinha. Polynomial learning of distribution families. In Foundations of Computer

Science (FOCS), pages 103 112, 2010.

S. C. Brubaker and S. Vempala. Isotropic pca and afﬁne-invariant clustering. In Foundations of

Computer Science (FOCS), pages 551 560, 2008.

E. J. Candès and C. Fernandez-Granda. Super-resolution from noisy data. Journal of Fourier Analysis

and Applications, 19(6):1229 1254, 2013.

E. J. Candès and C. Fernandez-Granda. Towards a mathematical theory of super-resolution. Commu-

nications on pure and applied Mathematics, 67(6):906 956, 2014.

S. Chakraborty and H. Narayanan. Learning mixtures of spherical gaussians via fourier analysis.

ar Xiv preprint ar Xiv:2004.05813, 2020.

S. Chen and A. Moitra. Algorithmic foundations for the diffraction limit. In Symposium on Theory of

Computing (STOC), pages 490 503, 2021.

S. Chen, J. Li, and Z. Song. Learning mixtures of linear regressions in subexponential time via fourier

moments. In Symposium on Theory of Computing (STOC), pages 587 600, 2020.

S. Dasgupta. Learning mixtures of gaussians. In Foundations of Computer Science (FOCS), pages

634 644, 1999.

S. Dasgupta and L. J. Schulman. A probabilistic analysis of em for mixtures of separated, spherical

gaussians. Journal of Machine Learning Research (JMLR), 8:203 226, 2007.

I. Diakonikolas and D. M. Kane. Small covers for near-zero sets of polynomials and learning latent

variable models. In Foundations of Computer Science (FOCS), pages 184 195, 2020.

I. Diakonikolas, D. M. Kane, and A. Stewart. Statistical query lower bounds for robust estimation of

high-dimensional gaussians and gaussian mixtures. In Foundations of Computer Science (FOCS), pages 73 84, 2017.

I. Diakonikolas, D. M. Kane, and A. Stewart. List-decodable robust mean estimation and learning

mixtures of spherical gaussians. In Symposium on Theory of Computing (STOC), pages 1047 1060, 2018.

D. L. Donoho. Superresolution via sparsity constraints. SIAM Journal on Mathematical Analysis, 23

(5):1309 1331, 1992.

M. Hardt and E. Price. Tight bounds for learning a mixture of two gaussians. In Symposium on

Theory of Computing (STOC), pages 753 760, 2015.

S. B. Hopkins and J. Li. Mixture models, robustness, and sum of squares proofs. In Symposium on

Theory of Computing (STOC), pages 1021 1034, 2018.

Q. Huang and S. M. Kakade. Super-resolution off the grid. In Advances in Neural Information

Processing Systems (NIPS), pages 2665 2673, 2015.

R. Kannan, H. Salmasian, and S. Vempala. The spectral method for general mixture models. In

Conference on Learning Theory (COLT), pages 444 457, 2005.

P. K. Kothari, J. Steinhardt, and D. Steurer. Robust moment estimation and improved clustering via

sum of squares. In Symposium on Theory of Computing (STOC), pages 1035 1046, 2018.

B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. Annals

of Statistics, pages 1302 1338, 2000.

A. Liu and J. Li. Clustering mixtures with almost optimal separation in polynomial time. In

Symposium on Theory of Computing (STOC), pages 1248 1261, 2022.

J. Matoušek, A. Björner, and G. M. Ziegler. Using the Borsuk-Ulam theorem: lectures on topological

methods in combinatorics and geometry. Springer, 2003.

A. Moitra. Super-resolution, extremal functions and the condition number of vandermonde matrices.

In Symposium on Theory of Computing (STOC), pages 821 830, 2015.

A. Moitra and G. Valiant. Settling the polynomial learnability of mixtures of gaussians. In Foundations

of Computer Science (FOCS), pages 93 102, 2010.

O. Regev and A. Vijayaraghavan. On learning mixtures of well-separated gaussians. In Foundations

of Computer Science (FOCS), pages 85 96, 2017.

D. J. Smith and M. K. Vamanamurthy. How small is a unit ball? Mathematics Magazine, 62(2):

101 107, 1989.

A. T. Suresh, A. Orlitsky, J. Acharya, and A. Jafarpour. Near-optimal-sample estimators for spherical

gaussian mixtures. In Advances in Neural Information Processing Systems (NIPS), pages 1395 1403, 2014.

S. Vempala and G. Wang. A spectral algorithm for learning mixture models. Journal of Computer

and System Sciences, 68(4):841 860, 2004.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [N/A] The

results are theoretical. (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] The assump-

tions are stated in the theorems. (b) Did you include complete proofs of all theoretical results? [Yes] The main paper

includes part of the proof and most of the proof ideas. Technical proofs are included in the appendix. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experi-

mental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]