# sampleefficient_private_learning_of_mixtures_of_gaussians__653aafe2.pdf

Sample-Efficient Private Learning of Mixtures of Gaussians

Hassan Ashtiani Mc Master University zokaeiam@mcmaster.ca

Mahbod Majid MIT mahbod@mit.edu

Shyam Narayanan Citadel Securities shyam.s.narayanan@gmail.com

We study the problem of learning mixtures of Gaussians with approximate differential privacy. We prove that roughly kd2 + k1.5d1.75 + k2d samples suffice to learn a mixture of k arbitrary d-dimensional Gaussians up to low total variation distance, with differential privacy. Our work improves over the previous best result [AAL24b] (which required roughly k2d4 samples) and is provably optimal when d is much larger than k2. Moreover, we give the first optimal bound for privately learning mixtures of k univariate (i.e., 1-dimensional) Gaussians. Importantly, we show that the sample complexity for learning mixtures of univariate Gaussians is linear in the number of components k, whereas the previous best sample complexity [AAL21] was quadratic in k. Our algorithms utilize various techniques, including the inverse sensitivity mechanism [AD20b, AD20a, HKMN23], sample compression for distributions [ABDH+20], and methods for bounding volumes of sumsets.

1 Introduction

Learning Gaussian Mixture Models (GMMs) is one of the most fundamental problems in algorithmic statistics. Gaussianity is a common data assumption, and the setting of Gaussian mixture models is motivated by heterogeneous data that can be split into numerous clusters, where each cluster follows a Gaussian distribution. Learning mixture models is among the most important problems in machine learning [Bis06], and is at the heart of several unsupervised and semi-supervised machine learning models. The study of Gaussian mixture models has had numerous scientific applications dating back to the 1890s [Pea94], and is a crucial tool in modern data analysis techniques in a variety of fields, including bioinformatics [LKWB22], anomaly detection [ZSM+18], and handwriting analysis [Bis06].

In this work, we study the problem of learning a GMM from samples. We focus on the density estimation setting, where the goal is to learn the overall mixture distribution up to low total variation distance. Unlike the parameter estimation setting for GMMs, density estimation can be done even without any boundedness or separation assumptions on the parameters of the components. In fact, it is known that mixtures of k Gaussians in d-dimensions can be learned up to total variation distance α using e O(kd2/α2) samples [ABH+18].

Ensuring data privacy has emerged as an increasingly important challenge in modern data analysis and statistics. Differential privacy (DP) [DMNS06] is a rigorous way of defining privacy, and is considered to be the gold standard both in theory and practice, with deployments by Apple [Tea17], Google [EPK14], Microsoft [DKY17], and the US Census Bureau [DLS+17]. As is the case for many data analysis tasks, standard algorithms for learning GMMs leak potentially sensitive information about the individuals who contributed data. This raises the question of whether we can do density estimation for GMMs under the constraint of differential privacy.

Work done as a student at MIT

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

Private density estimation for GMMs with unrestricted Gaussian components is a challenging task. In fact, privately learning a single unrestricted Gaussian has been the subject of multiple recent studies [AAK21, KMS+22b, AL22, KMV22, AKT+23, HKMN23]. Private learning of GMMs is significantly more challenging, because even without privacy constraints, parameter estimation for GMMs requires exponentially many samples in terms of the number of components [MV10]. Therefore, it is not clear how to use the typical recipe of adding noise to the estimated parameters or privately choosing from the finite-dimensional space of parameters. Consequently, the only known sample complexity bounds for privately learning unrestricted GMMs are loose [AAL24b, AAL21].

Let us first formally define the problem of learning GMMs. We represent a GMM D = Pk i=1 wi N(µi, Σi) by its parameters, namely {(wi, µi, Σi)}k i=1, where wi 0, P

i wi = 1, µi Rd, and Σi is a positive definite matrix. In the following, a GMM learning algorithm A receives a set of data points in Rd and outputs a (representation of) a GMM. The total variation distance between two distributions is d TV( D, D) = 1

Rd |D(x) D(x)|dx2. Definition 1.1 (Learning GMMs). For α, β (0, 1), we say A learns GMMs with n samples up to accuracy α and failure probability β if for every GMM D, given samples X1, . . . , Xn i.i.d. D, it outputs (a representation of) a GMM D such that d TV( D, D) α with probability at least 1 β.

α and β are called the accuracy and failure probability, respectively. For clarity of presentation, we will typically fix the value of β (e.g., β = 1/3). The above definition does not enforce the constraint of differential privacy. The following definitions formalizes (approximate) differential privacy. Definition 1.2 (Differential Privacy (DP) [DMNS06, DKM+06]). Let ε, δ 0. A randomized algorithm A : X n O is said to be (ε, δ)-differentially private ((ε, δ)-DP) if for any two neighboring datasets X, X X n and any measurable subset O O, P[A(X ) O] eε P[A(X) O] + δ.

If the GMM learner A of Definition 1.1 is (ε, δ)-DP, we say that A privately learns GMMs. Formally, we have the following definition. Definition 1.3 (Privately learning GMMs). Fix the number of samples n, dimension d, and number of mixture components k. For α, β (0, 1) and ε, δ 0, a randomized algorithm A, that takes as input X1, . . . , Xn Rd, (ε, δ)-privately learns GMMs up to accuracy α and failure probability β, if:

1. For any GMM D that is a mixture of up to k Gaussians in d dimensions, if X = {X1, . . . , Xn} i.i.d. D, A(X1, . . . , Xn) outputs a GMM D such that d TV( D, D) α with probability at least 1 β (over the randomness of the data X and the algorithm A).

2. For any neighboring datasets X, X (Rd)n (not necessarily drawn from any GMM) and any measurable subset O O, P[A(X ) O] eε P[A(X) O] + δ.

Finally, we assume a default value for β of 1/3, meaning that if not stated, the failure probability β is assumed to equal 1/3.

Our main goal in this paper is to understand the number of samples (as a function of the dimension d, the number of mixture components k, the accuracy α, and the privacy parameters ε, δ) that are needed to privately and accurately learn the GMM up to low total variation distance.

1.1 Results

In this work, we provide improved sample complexity bounds for privately learning mixtures of arbitrary Gaussians, improving over previous work of [AAL21, AAL24b]. Moreover, our sample complexity bounds are optimal in certain regimes, when the dimension is either 1 or a sufficiently large polynomial in k and log 1

δ . For general dimension d, we prove the following theorem. Theorem 1.4. For any α, ε, δ (0, 1), k, d N, there exists an inefficient (ε, δ)-DP algorithm that can learn a mixture of k arbitrary full-dimensional Gaussians in d dimensions up to accuracy α, using the following number of samples:

n = e O kd2

α2 + kd2 + d1.75k1.5 log0.5(1/δ) + k1.5 log1.5(1/δ)

2We are slightly abusing the notation and using D(x) as the pdf of D at points x.

Notably, the mixing weights and the means can be arbitrary and the covariances of the Gaussians can be arbitrarily poorly conditioned, as long as the covariances are non-singular3.

We remark that we omit the dependence on β (and assume by default a failure probability of 1/3). However, it is well-known that one can obtain failure probability β with only a multiplicative O(log 1/β) blowup in sample complexity, in a black-box fashion4. In fact, our analysis can yield even better dependencies on β in some regimes, though to avoid too much complication, we do not analyze this.

For reasonably large dimension, i.e., d k2 log2(1/δ), this can be simplified to O kd2

αε , which is in fact optimal (see Theorem 1.6). Hence, we obtain the optimal sample complexity for sufficiently large dimension. Theorem 1.4 also improves over the previous best sample complexity upper bound of [AAL24b], which uses

e O k2d4 + kd2 log(1/δ)

α2ε + kd log(1/δ)

samples. Our results provide a polynomial improvement in all parameters, but to simplify the comparison, if we ignore dependencies in the error parameter α and privacy parameters ε, δ, we improve the sample complexity from k2d4 to kd2 + k2d + k1.5d1.75: note that our result is quadratic in the dimension whereas [AAL24b] is quartic.

When the dimension is d = 1, we can provide an improved result, which is optimal for learning mixtures of univariate Gaussians (see Theorem 1.6 for a matching lower bound). Theorem 1.5. For any α, ε, δ (0, 1), k N, there exists an inefficient (ε, δ)-DP algorithm that can learn a mixture of k arbitrary univariate Gaussians (of nonzero variance) up to accuracy α, using the following number of samples:

α2 + k log(1/δ)

For privately learning mixtures of univariate Gaussians, the previous best-known result for arbitrary Gaussians required e O k2 log3/2(1/δ)

α2ε samples [AAL21]. Importantly, we are the first paper to show that the sample complexity can be linear in the number of components.

Our work purely focuses on sample complexity, and as noted in Theorems 1.4 and 1.5, they do not have polynomial time algorithms. We note that the previous works of [AAL21, AAL24b] also do not run in polynomial time. Indeed, there is reason to believe that even non-privately, it is impossible to learn GMMs in polynomial time (in terms of the optimal sample complexity) [DKS17, BRST21, GVV22].

Finally, we prove the following lower bound for learning GMMs in any fixed dimension d. Theorem 1.6. Fix any dimension d 1 number of components k 2, any α, ε at most a sufficiently small constant c , and δ (αε/d)O(1). Then, any (ε, δ)-DP algorithm that can learn a mixture of k arbitrary full-dimensional Gaussians in d dimensions up to total variation distance α, with probability at least 2/3, requires at least the following number of samples:

αε + k log(1/δ)

Note that for d = 1, this matches the upper bound of Theorem 1.5, thus showing that our univariate result is near-optimal in all parameters α, ε, δ. Moreover, our lower bound refutes the conjecture of [AAL21], which conjectures that only Θ k α2 + k αε + log(1/δ)

ε samples are needed in the univari-

ate case and Θ kd2

αε + log(1/δ)

ε samples are needed in the d-dimensional case. However, we note that our lower bound asymptotically differs from the conjectured bound in [AAL21] only when δ is extremely small.

3For clarity of presentation, we assume the covariance matrices are not singular. However, extending our results to degenerate matrices is straightforward. 4To obtain success probability β with O(n log 1/β) samples, we repeat the procedure T = O(log 1/β) times on independent groups of n samples each, to get T estimates D1, . . . , DT , and by a Chernoff bound, at least 51% of the estimates are within total variation distance α of the true mixture D. So, by choosing an estimate that is within 2α of at least 51% of the estimates, it is still within 3α total variation distance of D.

1.2 Related work

In the non-private setting, the sample complexity of learning unrestricted GMMs with respect to total variation distance (a.k.a. density estimation) is known to be eΘ(kd2/α2) [ABM18, ABH+18], where the upper bound is obtained by the so-called distributional compression schemes.

In the private setting, the only known sample complexity upper bound for unrestricted GMMs [AAL24b] is roughly k2d4 log(1/δ)/(α4ε), which exhibits sub-optimal dependence on various parameters5. This bound is achieved by running multiple non-private list-decoders and then privately aggregating the results. For the special case of axis-aligned GMMs, an upper bound of k2d log(1/δ)3/2/(α2ε) is known [AAL21]. These are the only known results even for privately learning (unbounded) univariate GMMs. In other words, the best known upper bound for sample complexity of privately learning univariate GMMs has quadratic dependence on k.

In the related public-private setting [BKS22, BBC+23], it is assumed that the learner has access to some public data. In this setting, [BBC+23] show that unrestricted GMMs can be learned with a moderate amount of public and private data.

Assuming the parameters of the Gaussian components (and the condition numbers of the covariance matrices) are bounded, one can create a cover for GMMs and use private hypothesis selection [BSKW19] or the private minimum distance estimator [AAK21] to learn the GMM. On the flip side, [ASZ21] prove a lower bound on the sample complexity of learning GMMs, though their lower bound is weaker than ours and is only against pure-DP algorithms.

The focus of our work is on density estimation. A related problem is learning the parameters a GMM, which has received extensive attention in the (non-private) literature (e.g., [Das99, MV10, BS10, LM21, BDJ+22, LL22] among many other papers). To avoid identifiability issues, one has to assume that the Gaussian components are sufficiently separated and have large-enough weights. In the private setting, the early work of [NRS07] demonstrated a privatized version of [VW04] for learning GMMs with fixed (known) covariance matrices. The strong separation assumption (of Ω(k1/4)) between the Gaussian components in [NRS07] was later relaxed to a weaker separation assumption [CCAd+23]. A substantially more general result for privately learning GMMs with unknown covariance matrices was established in [KSSU19], based on a privatized version of [AM05]. Yet, this approach also requires a polynomial separation (in terms of k) between the components, as well as a bound on the spectrum of the covariance matrices. [CKM+21] weakened the separation assumption of [KSSU19] and improved over their sample complexity. This result is based on a generic method that learns a GMM using a private learner for Gaussians and a non-private clustering method for GMMs. Finally, [AAL23] designed an efficient reduction from private learning of GMMs to its non-private counterpart, removing the boundedness assumptions on the parameters and achieving minimal separation (e.g., by reducing to [MV10]). Nevertheless, unlike density estimation, parameter estimation for unrestricted GMMs requires exponentially many samples in terms of k [MV10].

A final important question is that of efficient algorithms for learning GMMs. Much of the work on learning GMM parameters focuses on computational efficiency (e.g,. [MV10, BS10, LM21, BDJ+22, LL22]), as does some work on density estimation (e.g., [CDSS14, ADLS17]). However, under some standard hardness assumptions, it is known that even non-privately learning mixtures of k d-dimensional Gaussians with respect to total variation distance cannot be done in polynomial time as a function of k and d [DKS17, BRST21, GVV22].

Addendum. In a concurrent submission, [AAL24a] extended the result of [AAL24b] for learning unrestricted GMMs to the agnostic (i.e., robust) setting. In contrast, our algorithm works only in the realizable (non-robust) setting. Moreover, [AAL24a] slightly improved the sample complexity result of [AAL24b] from e O(log(1/δ)k2d4/(εα4)) to e O(log(1/δ)k2d4/(εα2)). The sample complexity of our approach is still significantly better than [AAL24a] in terms of all parameters similar to the way it improved over [AAL24b].

5More precisely, the upper bound is e O k2d4

α2ε + kd2 log(1/δ)

α2ε + kd log(1/δ)

2 Technical overview and roadmap

We highlight some of our conceptual and technical contributions. We mainly focus on the highdimensional upper bound, and discuss the univariate upper bound at the end.

2.1 Reducing to crude approximation

Suppose we are promised a bound on the means and covariances of the Gaussian components, i.e., 1 R I Σi R I and µi 2 R for all i [k]. In this case, there is in fact a known algorithm, using private hypothesis selection [BSKW19, AAK21], that can privately learn the distribution using only O kd2 log R

α2 + kd2 log R

αε samples. Moreover, with a more careful application of the hypothesis selection results (see Appendix D), we can prove that result holds even if (µi, Σi) are possibly unbounded, but we have some very crude approximation. By this, we mean that if for each i [k] we

know some ˆΣi such that 1

R Σi ˆΣi R Σi, then it suffices to have n = O kd2 log R

α2 + kd2 log R

samples to learn the full GMM in total variation distance.

Our main goal will be to learn every covariance Σi with such an approximation, for R = poly k, d, 1

ε , so that log R can be hidden in the O factor. To explain why this goal is sufficient, suppose we can crudely learn every covariance Σi with approximation ratio R, using n samples. We then need O kd2 log R

α2 + kd2 log R

αε additional samples to learn the full distri-

bution using hypothesis selection, so the total sample complexity is O n + kd2 log R

α2 + kd2 log R

αε . Hence, we will aim for this easier goal of crudely learning each covariance, for both Theorem 1.4 and Theorem 1.5, using as few samples n as possible. We will also need to approximate each mean µi, though for simplicity we will just focus on covariances in this section.

2.2 Overview of Theorem 1.5 for univariate GMMs

The main goal will be to simply provide a rough approximation to the set of standard deviations σi = Σi, as we can finish the procedure with hypothesis selection, as discussed above.

Bird s eye view: Say we are given a dataset X = {X1, . . . , Xn}: note that every Xj R since we are dealing with univariate Gaussians. The main insight is to sort the data in increasing order (i.e., reorder so that X1 X2 Xn) and consider the unordered multiset of successive differences {X2 X1, X3 X2, . . . , Xn Xn 1}. One can show that if a single datapoint Xj is changed, then the set of consecutive differences (up to permutation) does not change in more than 3 locations (see Lemma F.5 for a formal proof).

We then apply a standard private histogram approach. Namely, for each integer a Z, we create a corresponding bucket Ba, and map each Xj+1 Xj into Ba if 2a Xj+1 Xj < 2a+1. If some mixture component i had variance Σi = σ2 i , we should expect a significant number of Xj+1 Xj to at least be crudely close to σi, such as for the Xj drawn from the ith mixture component. So, some corresponding bucket should be reasonably large. Finally, by adding noise to the count of each bucket and taking the largest noisy counts, we will successfully find an approximation to all variances.

In more detail: For each (possibly negative) integer a let ca be the number of indices i such that 2a Xj+1 Xj < 2a+1. We will prove that, if the weight of the ith component in the mixture is wi and there are n points, then we should expect at least Ω(wi n) indices j to be in a bucket a with 2a within a poly(n) multiplicative factor of the standard deviation σi (see Lemma F.3). The point of this observation is that there are at most O(log n) buckets Ba with 2a between σi poly(n) and O(σi), but there are at least Ω(wi n) indices mapping to one of these buckets. So by the Pigeonhole principle, one of these buckets has at least Ω wi n log n indices, i.e., ca Ω wi n log n for some a with

σi poly(n) 2a O(σi).

Moreover, we know that if we change a single data point Xj, the set of consecutive differences {Xj+1 Xj} after sorting changes by at most 3. So, if we change a single Xj, at most 3 of the counts ca can change, each by at most 3.

Now, for every integer a, draw a noise value from the Truncated Laplace distribution (see Definition A.2 and Lemma A.3), and add it to ca to get a noisy count ca. The details of the noise distribution are not important, but the idea is that this distribution is always bounded by O 1

δ . Moreover, the Truncated Laplace distribution preserves (ε, δ)-DP. This means that the counts { ca}a Z will have (O(ε), O(δ))-DP, because the true counts ca only change minimally across adjacent datasets.

Our crude approximation to the set of standard deviations will be the set of 2a such that ca exceeds some threshold which is a large multiple of 1

δ . If n O k log(1/δ)

αε and the weight wi α/k,

it is not hard to verify that wi n

log n exceeds a large multiple of 1

δ. So, for each i k with weight at least α/k, some corresponding a with σi poly(n) 2a O(σi) will have count ca significantly exceeding the threshold, and thus noisy count ca exceeding the threshold. This will be enough to crudely approximate the values σi coming from large enough weight. We can ignore any component with weight less than α/k, as even if all but one of components have such small weight, together they only contribute α weight. So, we can ignore them and it will only cost us α in total variation distance, which we can afford.

Putting everything together: In summary, we needed O k log(1/δ)

αε samples to approximate each covariance (of sufficient weight) up to a poly(n) multiplicative factor. By setting R = poly(n) and

using the reduction described in Section 2.1, we then need an additional O k log n

α2 + k log n

αε samples.

If we set n = O k α2 + k log(1/δ)

αε , we will obtain that n O k log(1/δ)

αε + O k log n

α2 + k log n

αε , which is sufficient to solve our desired problem in the univariate setting.

Note that this proof relies heavily on the use of private histograms and the order of the data points in the real line. Therefore, it cannot be extended to the high-dimensional setting. We will use a completely different approach to prove Theorem 1.4.

2.3 Overview of Theorem 1.4 for general GMMs

As in the univariate case, we only need rough approximations of the covariances. We will learn the covariances one at a time: in each iteration, we privately identify a single covariance ˆΣi which crudely approximates some covariance Σi in the mixture, with (ε/ p

k log(1/δ), δ/k)-DP. Using the well-known advanced composition theorem (see Theorem A.1), we will get an overall privacy guarantee of (ε, δ)-DP. However, to keep things simple, we will usually aim for (ε, δ)-DP when learning each covariance, and we can replace ε with ε/ p

k log(1/δ) and δ with δ/k at the end.

A natural approach for parameter estimation, rather than learn Σi one at a time, is to learn all of the covariances together. However, we believe that this approach would cause the sample complexity to multiply by a factor of k, compared to learning a single covariance. The advantage of learning the covariances one at a time is that we can apply advanced composition: this will cause the sample complexity to multiply by roughly p

k log(1/δ) instead.

In the rest of this subsection, our main focus is to identify a single crude approximation ˆΣi.

Applying robustness-to-privacy: The first main insight is to use the robustness-to-privacy conversion of Hopkins et al. [HKMN23] (see also [AUZ23]). Hopkins et al. prove a black-box (but not computationally efficient) approach that can convert robust algorithms into differentially private algorithms, using the exponential mechanism and a well-designed score function. This reduction only works for finite dimensional parameter estimation and therefore cannot be applied directly to density estimation. On the other hand, parameter estimation for arbitrary GMMs requires exponentially many samples in the number of components [MV10]. However, we will demonstrate that this lower bound does not hold when we only need a very crude estimation of the parameters.

The idea is the following. For a dataset X = {X1, . . . , Xn}, let S = S(eΣ, X) be a score function, which takes in a dataset X of size n and some candidate covariance eΣ, and outputs some non-negative integer. At a high level, the score function S(eΣ, X) will be the smallest number of data points t that we should change in X to get to some new data set X with a specific desired property: X

should look like a sample generated from a mixture distribution with eΣ being the covariance of

one of its components namely, a component with a significant mixing weight. One can define looks like in different ways, and we will adjust the precise definition later. We remark that this notion of score roughly characterizes robustness, because if the samples in X were truly drawn from a Gaussian mixture model with covariances {Σi}k i=1, we should expect S(Σi, X) to be 0 (since X should already satisfy the desired property), but if we altered k data points, the score should be at most k. The high-level choice of score function is somewhat inspired by a version of the exponential mechanism called the inverse sensitivity mechanism [AD20b, AD20a], though the precise way of defining the score function requires significant care and is an important contribution of this paper.

The robustness-to-privacy framework of [HKMN23], tailored to learning a single covariance, implies the following general result, which holds for any score function S following the blueprint above. For now, we state an informal (and slightly incorrect) version. Theorem 2.1 (informal - see Theorem C.3 for the formal statement). For any η [0, 1), and any dataset X of size n, define the value Vη(X) to be the volume (i.e., Lebesgue measure) of the set of covariance matrices eΣ (where the covariance can be viewed as a vector by flattening), such that S(eΣ, X) η n.

Fix a parameter η < 0.1, and suppose that for any dataset X of size n such that Vη/2(X) is strictly positive, the ratio of volumes Vη(X)/Vη/2(X) is at most some K (which doesn t depend on X). Then, if n log K

ε η , there is a differentially private algorithm that can find a covariance eΣ of low score (i.e.,

where S(eΣ, X) η n) using n samples.

Note that Theorem 2.1 does not seem to say anything about whether X comes from a mixture of Gaussians. However, we aim to instantiate this theorem with a score function that is carefully designed for GMMs. Recall that we want S(eΣ, X) to capture the number of points in X that need to be altered to make it look like a data set that was generated from a mixture, with eΣ being the covariance of one of the Gaussian components. In other words, S(eΣ, X) should be small if (and hopefully only if) a mildly corrupted version of X includes a subset of points that are generated from a Gaussian with covariance eΣ. At the heart of designing such a score function, one needs to come up with a form of robust Gaussianity tester that tells whether a given set of data points are generated from a Gaussian distribution. Aside from this challenge, the volume ratio associated with the chosen score function needs to be small for every dataset X otherwise the above theorem would require a large n (i.e., number of samples). These two challenges are, however, related. If the robust Gaussianity tester has high specificity i.e., rejects most of the sets that are not generated from a (corrupted) Gaussian then the volume ratio is likely to be small (i.e., a smaller number of candidates eΣ would receive a good/low score).

First Attempt: We first try an approach which resembles that of [HKMN23] for privately learning a single Gaussian. We define S(eΣ, X) as the smallest integer t satisfying the following property: there exists a subset Y X of size n/k, such that we can change t data points from Y to get to Y , where Y looks like i.i.d. samples from a Gaussian with some covariance Σ that is similar to eΣ. The choice of Y having size n/k is motivated by the fact that each mixture component, on average, has n/k data points in X.

The notions of looks like and similar to of course need to be formally defined. We say Σ is similar to eΣ (or Σ eΣ) if they are spectrally close, i.e., 0.5Σ eΣ 2Σ. We say that Y looks like samples from a Gaussian with covariance Σ if some covariance estimation algorithm predicts that Y came from a Gaussian with covariance Σ. The choice of covariance estimation algorithm will be quite nontrivial and ends up being a key ingredient in proving Theorem 1.4.

To apply Theorem 2.1, we first set η = c/k for some small constant c. We cannot set a larger value η, because if we change t n/k data points, we could in fact create a totally arbitrary new Gaussian component with large weight. Since there is no bound on the eigenvalues of the covariance matrix, this could cause the volume Vη(X) to be infinite. The main question we must answer is how to bound the volume ratio Vη(X)/Vη/2(X). To answer this question, we first need to understand what it means for S(eΣ, X) η n. If S(eΣ, X) η n, then there exists a corresponding set Y X of size n/k, and we can change η n = c |Y| points from Y to get to some Y which looks like samples from a Gaussian with covariance Σ eΣ. Thus, Y looks like c-corrupted samples from such a Gaussian (i.e.,

a c fraction of the data is corrupted). This motivates using a robust covariance estimation algorithm: indeed, robust algorithms can still approximately learn Σ even if a small constant fraction of data is corrupted, so for any Y X, we expect that no matter how we change a c fraction of Y to obtain Y , the robust algorithm s covariance estimate should not change much. So, for any fixed Y, the set of possible Σ, and thus the set of possible eΣ, should not be that large.

In summary, to bound Vη(X) versus Vη/2(X), there are at most n n/k choices for Y X in the former case, and at least 1 choice in the latter case (since we assume Vη/2(X) > 0 in Theorem 2.1). Moreover, for any such Y, the volume of corresponding eΣ should be exponential in d2 (either for Vη(X) or Vη/2(X)), since the dimensionality of the covariance is roughly d2. So, this suggests that the overall volume ratio is at most n n/k e O(d2). Since log n n/k (n/k) log k, if we plug into

Theorem 2.1 it suffices to have n (n/k) log k+d2

ε (c/k) . Unfortunately this is impossible unless ε log k.

These ideas will serve as a good starting point, though we need to improve the volume ratio analysis. To do so, we also modify the robust algorithm, by strengthening the assumptions on what it means for samples to look like they came from a Gaussian.

Improvement via Sample Compression: To improve the volume ratio, we draw inspiration from a technique called sample compression, which has been used in previous work on non-private density estimation for mixtures of Gaussians [ABH+18, ABDH+20]. The idea behind sample compression is that one does not need the full set Y to do robust covariance estimation; instead, we look at a smaller set of samples. For instance, if Y X looks like c-corrupted samples from a Gaussian of covariance Σ eΣ, we expect that a random subset Z of Y also looks like c-corrupted samples from such a Gaussian. Moreover, as long as one uses m O(d) corrupted samples from a Gaussian, we can still (inefficiently) approximate the covariance. This motivates us to modify the robust algorithm as follows: rather than just checking whether Y looks like c-corrupted samples from a Gaussian of covariance roughly eΣ, we also test whether an average subset Z Y of size m does as well. Therefore, if eΣ has low score, there exists a corresponding set Z X of size m, and there are only n m em log n choices for Z. So, now it suffices to have n m log n+d2

ε (c/k) , which will give us a bound

of O(d2k/ε), as long as m O(d2). Importantly, we still check the robust algorithm on Y of size roughly n/k, which allows us to keep the robustness threshold η at roughly c/k.

There is one important caveat that for each subset Z, there is a distinct corresponding covariance Σ, and the volume of eΣ Σ can change drastically as Σ changes. (For instance, the volume of eΣ T Σ is T Θ(d2) times as large as the volume of eΣ Σ. Since we have no bounds on the possible covariances, T could be unbounded.) For our volume ratio to actually be bounded by about n m e O(d2), we want the volume of eΣ Σ to stay invariant with respect to Σ. This can be done by defining a normalized volume where the normalization is inversely proportional to the determinant (see Appendix C.1 for more details). The robustness-to-privacy conversion (Theorem 2.1) will still hold.

While the bound of O(d2k/ε) seems good, we recall that this bound is merely the sample complexity for (ε, δ)-DP crude approximation of a single Gaussian component. As discussed at the beginning of this subsection, to learn all k Gaussians, we actually need (ε/ p

k log(1/δ), δ/k)-DP, rather than (ε, δ)-DP, when crudely approximating a single component. This will still end up leading to a significant improvement over previous work [AAL24b], but we can improve the volume ratio even further and thus obtain even better sample complexity.

Improving Dimension Dependence: Previously, we used the fact that the volume of candidate eΣ (corresponding to a fixed Z) was roughly exponential in d2 for either Vη/2(X) or Vη(X), so the ratio should also be roughly exponential in d2. Here, we improve this ratio, which will improve the overall volume ratio.

First, we return to understanding the guarantees of the robust algorithm. It is known that, given m O(d) samples from a Gaussian of covariance Σ, we can provide an estimate ˆΣ such that (1 O( p

d/m))Σ ˆΣ (1 O( p

d/m))Σ. As above, we need to solve this even if a c fraction of

samples are corrupted. While this can cause the relative spectral error to increase from 1 O( p

d/m) to 1 O(c + p

d/m), for now let us ignore the additional c factor.

If Vη/2(X) > 0, then there is some covariance Σ and some set Y of size n/k, where the robust algorithm thinks Y looks like (possibly corrupted) Gaussian samples of covariance Σ. So, every eΣ such that 0.5Σ eΣ 2Σ has score of at most η/2 n. This gives us a lower bound on Vη/2(X). We now want to upper bound Vη(X). If S(eΣ, X) < ηn, we still have that the robust algorithm thinks some Y looks like Gaussian samples of covariance Σ, where 0.5Σ eΣ 2Σ. But now, we use the additional fact that for some Z Y of size m, the robust algorithm on Z finds a covariance ˆΣ. By the accuracy of the robust algorithm, Σ and ˆΣ should be similar, i.e., (1 O( p

d/m))Σ ˆΣ (1 O( p

d/m))Σ (where we ignored the c factor). Thus, there exists some Z X of size m and a ˆΣ corresponding to Z, such that 0.5(1 O( p

d/m))ˆΣ eΣ 2(1 + O( p

Therefore, from η/2 to η, we have dilated the candidate set of eΣ by a factor of 1 + O( p

d/m) in the worst case, and we have at least 1 choice in the η/2 case but at most n m choices in the η case.

Thus, the overall volume ratio is at most n m (1 + O( p

d/m))d2 = e O(m log n+d2

d/m), since the

dimension of eΣ is roughly d2. Consequently, it now suffices to have n m log n+d2

d/m ε (c/k) : setting

m = d5/3 gives us an improved bound of O(d5/3k/ε) for learning a single Σi.

There are some issues with this approach: most notably, we ignored the fact that the spectral error is really c+ p

d/m rather than p

d/m. However, the robust algorithm can do better than just estimating up to spectral error c + p

d/m: it can also get an improved Frobenius error. While we will not formally state the guarantees on the robust algorithm here (see Theorem B.3 for the formal statement), the main high-level observation is that if the robust estimator ˆΣ can be 1 c times as large as the true covariance Σ in only O(1) directions then for an average direction the ratio of ˆΣ to Σ will be 1 O( p

d/m). We can utilize this observation to bound the volume ratio, using some careful ε-net arguments (this is executed in Appendix C.3). Our dimension dependence of d5/3 will increase to d7/4, though this still improves over the previous d2 bound.

We will formally define the score function S(eΣ, X) in Appendix E.1 and fully analyze the application of the robustness-to-privacy conversion, as outlined here, in Appendix E.2.

Accuracy: One important final step that we have so far neglected is ensuring that any eΣ of low score must be crudely close to some Σi, if X is actually drawn from a GMM. We will just focus on the case where S(eΣ, X) = 0, so some Y X of size n/k looks like a set of samples from a Gaussian with covariance eΣ. If the samples Y all came from the i-th component of the GMM, then it will not be difficult to show eΣ is similar to Σi. The more difficult case is when data point in Y are generated from several components.

However, if n 20k2d, then |Y| 20kd, which means that by the Pigeonhole Principle, at least 20d points in Y come from the same mixture component (µi, Σi). We are able to prove that, with high probability over samples drawn from a single Gaussian component N(µi, Σi), that the empirical covariance of any subset of size at least 20d is crudely close to Σi (see Corollary B.5). As a result, when verifying that a subset Y looks like i.i.d. Gaussian samples with covariance Σ, we can ensure that the empirical covariance of every subset Z Y of size 20d is crudely close to Σ. Thus, if the score S(eΣ, X) = 0, eΣ is close to Σ, which is crudely close to some Σi.

We also formally analyze the accuracy in Appendix E.2.

Putting everything together: To crudely learn a single Gaussian component with (ε, δ)-DP, we will need n O(d7/4k/ε) samples to find some covariance eΣ with low score, and we also need n O(k2d) so that a covariance eΣ of low score is actually a crude approximation to one of the real mixture components. To crudely approximate all components, we learn each Gaussian component with (ε/ p

k log(1/δ), δ/k)-DP. The advanced composition theorem will imply that repeating this procedure k times on the same data (to learn all k components) will be (ε, δ)-DP. Hence, by replacing

ε with ε/ p

k log(1/δ), we get that it suffices for n O d7/4k3/2

log(1/δ) ε + k2d , if we need to

crudely learn all of the covariances. Finally, we can apply the private hypothesis selection technique (recall Section 2.1), which requires an additional O d2k

αε . Combining these terms will give the final sample complexity.

We remark that the sample complexity obtained above is actually better than the complexity in Theorem 1.4. There are two reasons for this. The first is that we have been assuming each component has weight 1/k, meaning it contributes about n/k data points. In reality, the weights may be arbitrary and thus some components may have much fewer data points. However, it turns out that one can actually ignore any component with less than α/k weight, if we want to solve density estimation up to total variation distance α. This will multiply the sample complexity terms

log(1/δ) ε + k2d , needed for crude approximation, by a factor of 1/α. Finally, the

informal Theorem 2.1 is slightly inaccurate, and the accurate version of the theorem will end up adding the additional term of k3/2 log3/2(1/δ)

αε . Along with the final term O d2k

αε from the private hypothesis selection, these terms will exactly match Theorem 1.4.

2.4 Roadmap

In Appendix A, we note some general preliminary results. In Appendix B, we note some additional preliminaries on robust learning of a single Gaussian. In Appendix C, we discuss the robustnessto-privacy conversion and prove some volume arguments needed for Theorem 1.4. In Appendix D, we explain how to reduce to the crude approximation setting, using private hypothesis selection. In Appendix E, we design and analyze the algorithm for multivariate Gaussians, and prove Theorem 1.4. In Appendix F, we design and analyze the algorithm for univariate Gaussians, and prove Theorem 1.5. In Appendix G, we prove Theorem 1.6. Finally, Appendix H proves some auxiliary results that we state in Appendices B and C.

Limitations

Our results are on theoretical guarantees on the sample complexity of privately learning Mixtures of Gaussians. We do not provide any efficient or practical algorithms, and focus on statistical guarantees. We also do not discuss how to set the parameters and accuracy guarantees for practical applications, this is a question best left to practitioners. Finally, we assume each sample is i.i.d. drawn from a Gaussian Mixture Model distribution, though we remark that we use a robustness-to-privacy framework that will automatically make our algorithm robust to a roughly α/k fraction of corruptions.

Acknowledgements. Hassan Ashtiani was supported by an NSERC Discovery grant. Shyam Narayanan was supported by an NSF Graduate Fellowship and a Google Fellowship.

[AAK21] Ishaq Aden-Ali, Hassan Ashtiani, and Gautam Kamath. On the sample complexity of privately learning unbounded high-dimensional gaussians. In Algorithmic Learning Theory, pages 185 216. PMLR, 2021.

[AAL21] Ishaq Aden-Ali, Hassan Ashtiani, and Christopher Liaw. Privately learning mixtures of axis-aligned gaussians. Advances in Neural Information Processing Systems, 34:3925 3938, 2021.

[AAL23] Jamil Arbas, Hassan Ashtiani, and Christopher Liaw. Polynomial time and private learning of unbounded gaussian mixture models. In International Conference on Machine Learning, pages 1018 1040. PMLR, 2023.

[AAL24a] Mohammad Afzali, Hassan Ashtiani, and Christopher Liaw. Agnostic private density estimation for gmms via list global stability. ar Xiv e-prints, pages ar Xiv 2407, 2024.

[AAL24b] Mohammad Afzali, Hassan Ashtiani, and Christopher Liaw. Mixtures of gaussians are privately learnable with a polynomial number of samples. In Algorithmic Learning Theory, pages 1 27. PMLR, 2024.

[ABDH+20] Hassan Ashtiani, Shai Ben-David, Nicholas JA Harvey, Christopher Liaw, Abbas Mehrabian, and Yaniv Plan. Near-optimal sample complexity bounds for robust learning of gaussian mixtures via compression schemes. Journal of the ACM (JACM), 67(6):1 42, 2020.

[ABH+18] Hassan Ashtiani, Shai Ben-David, Nicholas Harvey, Christopher Liaw, Abbas Mehrabian, and Yaniv Plan. Nearly tight sample complexity bounds for learning mixtures of gaussians via sample compression schemes. Advances in Neural Information Processing Systems, 31, 2018.

[ABM18] Hassan Ashtiani, Shai Ben-David, and Abbas Mehrabian. Sample-efficient learning of mixtures. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.

[AD20a] Hilal Asi and John C Duchi. Instance-optimality in differential privacy via approximate inverse sensitivity mechanisms. Advances in neural information processing systems, 33:14106 14117, 2020.

[AD20b] Hilal Asi and John C Duchi. Near instance-optimality in differential privacy. Co RR, abs/2005.10630, 2020.

[ADLS17] Jayadev Acharya, Ilias Diakonikolas, Jerry Li, and Ludwig Schmidt. Sample-optimal density estimation in nearly-linear time. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1278 1289. SIAM, 2017.

[AKT+23] Daniel Alabi, Pravesh K Kothari, Pranay Tankala, Prayaag Venkat, and Fred Zhang. Privately estimating a gaussian: Efficient, robust, and optimal. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 483 496, 2023.

[AL22] Hassan Ashtiani and Christopher Liaw. Private and polynomial time algorithms for learning gaussians and beyond. In Conference on Learning Theory, pages 1075 1076. PMLR, 2022.

[AM05] Dimitris Achlioptas and Frank Mc Sherry. On spectral learning of mixtures of distributions. In International Conference on Computational Learning Theory, pages 458 469. Springer, 2005.

[ASZ21] Jayadev Acharya, Ziteng Sun, and Huanyu Zhang. Differentially private assouad, fano, and le cam. In Algorithmic Learning Theory, pages 48 78. PMLR, 2021.

[AUZ23] Hilal Asi, Jonathan Ullman, and Lydia Zakynthinou. From robustness to privacy and back. In International Conference on Machine Learning, pages 1121 1146. PMLR, 2023.

[BBC+23] Shai Ben-David, Alex Bie, Clément L. Canonne, Gautam Kamath, and Vikrant Singhal. Private distribution learning with public data: The view from sample compression. In Advances in Neural Information Processing Systems (Neur IPS), 2023.

[BDJ+22] Ainesh Bakshi, Ilias Diakonikolas, He Jia, Daniel M Kane, Pravesh K Kothari, and Santosh S Vempala. Robustly learning mixtures of k arbitrary gaussians. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 1234 1247, 2022.

[Bis06] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

[BKS22] Alex Bie, Gautam Kamath, and Vikrant Singhal. Private estimation with public data. Advances in Neural Information Processing Systems, 2022.

[BRST21] Joan Bruna, Oded Regev, Min Jae Song, and Yi Tang. Continuous lwe. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 694 707, 2021.

[BS10] Mikhail Belkin and Kaushik Sinha. Polynomial learning of distribution families. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 103 112. IEEE, 2010.

[BSKW19] Mark Bun, Thomas Steinke, Gautam Kamath, and Zhiwei Steven Wu. Private hypothesis selection. Advances in Neural Information Processing Systems, 32, 2019.

[BUV14] Mark Bun, Jonathan Ullman, and Salil Vadhan. Fingerprinting codes and the price of approximate differential privacy. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 1 10, 2014.

[CCAd+23] Hongjie Chen, Vincent Cohen-Addad, Tommaso d Orsi, Alessandro Epasto, Jacob Imola, David Steurer, and Stefan Tiegel. Private estimation algorithms for stochastic block models and mixture models. ar Xiv preprint ar Xiv:2301.04822, 2023.

[CDSS14] Siu-On Chan, Ilias Diakonikolas, Rocco A Servedio, and Xiaorui Sun. Efficient density estimation via piecewise polynomial approximation. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 604 613, 2014.

[CKM+21] Edith Cohen, Haim Kaplan, Yishay Mansour, Uri Stemmer, and Eliad Tsfadia. Differentially-private clustering of easy instances. In International Conference on Machine Learning, pages 2049 2059. PMLR, 2021.

[Das99] Sanjoy Dasgupta. Learning mixtures of gaussians. In 40th Annual Symposium on Foundations of Computer Science (Cat. No. 99CB37039), pages 634 644. IEEE, 1999.

[DK22] Ilias Diakonikolas and Daniel M. Kane. Algorithmic High-Dimensional Robust Statistics. Cambridge University Press, 2022.

[DKM+06] Cynthia Dwork, Krishnaram Kenthapadi, Frank Mc Sherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In Advances in Cryptology-EUROCRYPT 2006: 24th Annual International Conference on the Theory and Applications of Cryptographic Techniques, St. Petersburg, Russia, May 28-June 1, 2006. Proceedings 25, pages 486 503. Springer, 2006.

[DKS17] Ilias Diakonikolas, Daniel Kane, and Alistair Stewart. Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 17, pages 73 84, 2017.

[DKY17] Bolin Ding, Janardhan Kulkarni, and Sergey Yekhanin. Collecting telemetry data privately. In Advances in Neural Information Processing Systems (Neur IPS), pages 3571 3580, 2017.

[DLS+17] Aref N. Dajani, Amy D. Lauger, Phyllis E. Singer, Daniel Kifer, Jerome P. Reiter, Ashwin Machanavajjhala, Simson L. Garfinkel, Scot A. Dahl, Matthew Graham, Vishesh Karwa, Hang Kim, Philip Lelerc, Ian M. Schmutte, William N. Sexton, Lars Vilhuber, and John M. Abowd. The modernization of statistical disclosure limitation at the u.s. Census Bureau, 2017. In the September 2017 meeting of the Census Scientific Advisory Committee, 2017.

[DMNS06] Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pages 265 284. Springer, 2006.

[DR14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 4):211 407, 2014.

[EPK14] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. RAPPOR: randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 1054 1067, 2014.

[GDGK20] Quan Geng, Wei Ding, Ruiqi Guo, and Sanjiv Kumar. Tight analysis of privacy and utility tradeoff in approximate differential privacy. In International Conference on Artificial Intelligence and Statistics (AISTATS), volume 108 of Proceedings of Machine Learning Research, pages 89 99. PMLR, 2020.

[GVV22] Aparna Gupte, Neekon Vafa, and Vinod Vaikuntanathan. Continuous LWE is as hard as LWE & applications to learning gaussian mixtures. In 63rd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2022, Denver, CO, USA, October 31 - November 3, 2022, pages 1162 1173. IEEE, 2022.

[HKMN23] Samuel B Hopkins, Gautam Kamath, Mahbod Majid, and Shyam Narayanan. Robustness implies privacy in statistical estimation. In Symposium on Theory of Computing, pages 497 506, 2023.

[KMS22a] Gautam Kamath, Argyris Mouzakis, and Vikrant Singhal. New lower bounds for private estimation and a generalized fingerprinting lemma. In Advances in Neural Information Processing Systems 35, Neur IPS 22, 2022.

[KMS+22b] Gautam Kamath, Argyris Mouzakis, Vikrant Singhal, Thomas Steinke, and Jonathan Ullman. A private and computationally-efficient estimator for unbounded gaussians. In Conference on Learning Theory, pages 544 572. PMLR, 2022.

[KMV22] Pravesh Kothari, Pasin Manurangsi, and Ameya Velingker. Private robust estimation by stabilizing convex relaxations. In Conference on Learning Theory, pages 723 777. PMLR, 2022.

[KSSU19] Gautam Kamath, Or Sheffet, Vikrant Singhal, and Jonathan Ullman. Differentially private algorithms for learning mixtures of separated gaussians. Advances in Neural Information Processing Systems, 32, 2019.

[KV18] Vishesh Karwa and Salil Vadhan. Finite sample differentially private confidence intervals. In Proceedings of the 9th Conference on Innovations in Theoretical Computer Science, ITCS 18, pages 44:1 44:9, Dagstuhl, Germany, 2018. Schloss Dagstuhl Leibniz-Zentrum fuer Informatik.

[LKWB22] Ta-Chun Liu, Peter N Kalugin, Jennifer L Wilding, and Walter F Bodmer. Gmmchi: gene expression clustering using gaussian mixture modeling. BMC bioinformatics, 23(1):457, 2022.

[LL22] Allen Liu and Jerry Li. Clustering mixtures with almost optimal separation in polynomial time. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 1248 1261, 2022.

[LM21] Allen Liu and Ankur Moitra. Settling the robust learnability of mixtures of gaussians. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 518 531, 2021.

[MH08] Arakaparampil M Mathai and Hans J Haubold. Special functions for applied scientists, volume 4. Springer, 2008.

[MV10] Ankur Moitra and Gregory Valiant. Settling the polynomial learnability of mixtures of gaussians. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 93 102. IEEE, 2010.

[Nar23] Shyam Narayanan. Better and simpler lower bounds for differentially private statistical estimation. ar Xiv preprint ar Xiv:2310.06289, 2023.

[NRS07] Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Smooth sensitivity and sampling in private data analysis. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, pages 75 84, 2007.

[Pea94] Karl Pearson. Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London, 1894.

[PH24] Victor S Portella and Nick Harvey. Lower bounds for private estimation of gaussian covariance matrices under all reasonable parameter regimes. ar Xiv preprint ar Xiv:2404.17714, 2024.

[Tea17] Apple Differential Privacy Team. Learning with privacy at scale. Apple Machine Learning Journal, 1(8), 2017.

[VW04] Santosh Vempala and Grant Wang. A spectral algorithm for learning mixture models. Journal of Computer and System Sciences, 68(4):841 860, 2004.

[ZSM+18] Bo Zong, Qi Song, Martin Renqiang Min, Wei Cheng, Cristian Lumezanu, Daeki Cho, and Haifeng Chen. Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In International conference on learning representations, 2018.

A Preliminaries

A.1 Differential Privacy

We state the advanced composition theorem of differential privacy. Theorem A.1 (Advanced Composition Theorem [DR14, Theorem 3.20]). Let ε, δ, δ 0 be arbitrary parameters. Let A1, . . . , Ak be algorithms on a dataset X, where each Ai is (ε, δ)-DP. Then, the

concatenation A(X) = (A1(X), . . . , Ak(X)) is q

δ ε + kε (eε 1), k δ + δ -DP.

Moreover, this holds even if the algorithms Ai are adaptive. By this, we mean that for all i 2 the algorithm Ai is allowed to depend on A1(X), . . . , Ai 1(X). However, privacy must still hold for Ai, conditioned on the previous outputs A1(X), . . . , Ai 1(X).

Next, we note the properties of the Truncated Laplace distribution and mechanism. Definition A.2 (Truncated Laplace Distribution). For , ε, δ > 0, the Truncated Laplace Distribution TLap( , ε, δ) is the distribution with PDF proportional to e |x| ε/ on the region [ A, A], where A =

ε log 1 + eε 1

2δ , and PDF 0 outside the region [ A, A]. Lemma A.3 (Truncated Laplace Mechanism [GDGK20, Theorem 1]). Let f : X n R be a realvalued function, and let > 0, such that for any neighboring datasets X, X , |f(X) f(X )| . Then, the mechanism A that outputs f(X) + TLap( , ε, δ) is (ε, δ)-DP.

A.2 Matrix and Concentration Bounds

In this section, we note some standard but useful lemmas. We first note the Courant-Fischer theorem. Theorem A.4 (Courant-Fischer). Let A Rd d be a real symmetric matrix, with eigenvalues λ1 λ2 λd. Then, for each k,

λk = max V Rd dim(V )=k min x V x 2=1 x Ax = min V Rd dim(V )=d k+1 max x V x 2=1 x Ax,

where V refers to a linear subspace of Rd.

By setting A = J J, we have the following corollary. Corollary A.5. Let J Rd d be a real (possibly asymmetric) matrix with singular values σ1 σ2 σd. Then, for each k,

σk = max V Rd dim(V )=k min x V x 2=1 Jx 2 = min V Rd dim(V )=d k+1 max x V x 2=1 Jx 2.

Next, we note a basic proposition. Proposition A.6. [HKMN23, Lemma 6.8] Let M Rd d be a real symmetric matrix, and J Rd d be any real-valued (but not necessarily symmetric) matrix such that JJ I op ϕ 1, for some parameter ϕ. Then, J MJ 2 F (1 + 3ϕ) M 2 F .

We also note the Hanson-Wright inequality. Lemma A.7 (Hanson-Wright). Given a d d matrix M Rd d and a d-dimensional Gaussian vector X N(0, I), for any t 0,

P X MX E[X MX] t 2 exp c min t2

M 2 F , t M op

for some universal constant c > 0.

Finally, we note a folklore simple characterization of total variation distance (see also [AAL23, Theorem 1.8]). Lemma A.8. For any µ1, µ2 Rd and positive definite Σ1, Σ2 Rd d,

d TV(N(µ1, Σ1), N(µ2, Σ2)) 1

2 max Σ 1/2 1 Σ2Σ 1/2 1 I F , Σ 1/2 1 (µ1 µ2) 2 .

B Robust Estimation of a Single Gaussian

In this section, we establish what a robust algorithm can do for learning a single high-dimensional Gaussian. We will state the accuracy guarantees in terms of the following definition, which combines Mahalanobis distance, spectral distance, and Frobenius distance.

Definition B.1. Given mean-covariance pairs (µ, Σ) and (ˆµ, ˆΣ), and given parameters γ, ρ, τ > 0, we say that (ˆµ, ˆΣ) γ,ρ,τ (µ, Σ) if

Σ 1/2 ˆΣΣ 1/2 I op γ. (Spectral distance)

Σ 1/2 ˆΣΣ 1/2 I F ρ. (Frobenius distance)

Σ 1/2(ˆµ µ) 2 τ. (Mahalanobis distance)

We will also use γ,τ as a shorthand for γ,

d γ,τ, i.e., there is no additional Frobenius norm condition beyond what is already imposed by the operator norm condition.

Although γ,ρ,τ is neither symmetric nor transitive, symmetry and transitivity happen in an approximate sense. Namely, the following result holds: we defer the proof to Appendix H. We remark that similar results are known (e.g., [AL22, Lemma 3.2], [AAL23, Lemma 4.1]). Proposition B.2 (Approximate Symmetry and Transitivity). Fix any γ, ρ, τ > 0 such that γ 0.1. Then, for any (µ1, Σ1) γ,ρ,τ (µ2, Σ2), we have that (µ2, Σ2) 2γ,2ρ,2τ (µ1, Σ1), and for any (µ1, Σ1) γ,ρ,τ (µ2, Σ2) and (µ2, Σ2) γ,ρ,τ (µ3, Σ3), we have that (µ1, Σ1) 4γ,4ρ,4τ (µ3, Σ3).

We note the following theorem about the accuracy of robustly learning a single Gaussian. While this result will roughly follow from known results and techniques in the literature, we were unable to find a formal statement of the theorem, so we prove it in Appendix H. Theorem B.3. For some universal constants c0 (0, 0.01) and C0 > 1, the following holds. Fix any

η γ c0, β < 1, and ρ such that e O(η) ρ c0

d. Let n e O d+log(1/β)

γ2 + (d+log(1/β))2

ρ2 . Then, there exists an (inefficient, deterministic) algorithm A0 with the following property. For any µ, Σ, with probability at least 1 β over X = {X1, . . . , Xn} N(µ, Σ), and for all datasets X with d H(X, X ) η n, we have

A0(X ) C0γ,C0ρ,C0γ (µ, Σ).

We remark that A0 may have knowledge of η, γ, ρ, β, but does not have knowledge of µ, Σ, or the uncorrupted data X.

B.1 Directional bounds

In this section, we note that, when given i.i.d. samples from a Gaussian, with high probability one cannot choose a reasonably large subset with an extremely different empirical mean or covariance.

First, we note the following proposition, for which the proof follows by an ε-net type argument. Proposition B.4. Let n n 20d, and L C1n2 for a sufficiently large constant C1. Suppose that some data points X = {X1, . . . , Xn} are i.i.d. sampled from N(0, I). Then, with probability at least 1 n Ω(n ), the following both hold.

1. For all Xi X, Xi 2 L.

2. For all subsets Y X of size n , there does not exist any real number r and any unit vector v such that | Xi, v r| 1

L for all Xi Y.

Proof. If Xi 2 L, then X MX L2 for M the d d identity matrix I. However, E[X MX] = d. So, by Lemma A.7, the probability of this event is at most 2e c min(n4/d,n2) 2e c n2 n Ω(n ).

Next, we bound the probability that the first item holds but the second item doesn t hold. First, we can create a 1 L2 -net of unit vectors v , of size (3L2)d L3d, and a 1

L-net of real numbers r

from [ L, L], of size 2L2. Now, suppose that the first item holds, but there is some unit vector v and real number r with | Xi, v r| 1 L for all Xi Y, for some Y X of size n . In this case, Xi, v Xi 2 L for all Xi X, so we can assume that |r| L. Thus, if v is the closest unit vector to v in the net and r is the closest real number to r in the net, then | Xi, v r | | Xi, v r| + | Xi, v v | + |r r | 1

L + Xi 2 1 L2 + 1

In other words, if the first event holds and the second does not, there exists v , r in the net and Y X of size n , such that | Xi, v r | 3

L for all Xi Y. We can now bound this via a union bound. To perform the union bound, we first bound the probability of this event holding for some fixed v , r , Y.

For any fixed v , r , Y, note that Xi, v Xi Y is just n i.i.d. copies of N(0, 1). Let s call these values z1, . . . , zn . If there is some r such that |zi r | 3

L for all i, then |zi z1| 6

L for all i. Since the PDF of a N(0, 1) is uniformly bounded by at most 1

2, for any fixed z1, the probablity that any other zi is within 6

L of z1 is at most 6

L, so the overall probability is at most (6/L)n 1.

Now, the union bound is done over n n choices of Y, at most L3d choices of v , and at most 2L2 choices of Y. Thus, the overall probability is at most (6/L)n 1 nn L3d 2L2 (6n)n /Ln 3 3d (6n/L0.7)n . Thus, as long as L 100n2, this is much smaller than n Ω(n ).

By shifting and scaling appropriately, we have the following corollary.

Corollary B.5. Let n n 20d, and L C1n2, for C1 the same constant as in Proposition B.4. Suppose that some data points X = {X1, . . . , Xn} are drawn from N(µ, Σ). Then, with probability at least 1 n Ω(n ), the following both hold.

1. For all Xi X, Σ 1/2(Xi µ) 2 L.

2. For all subsets Y X of size n , there does not exist any real number r and any nonzero vector v such that | Xi, v r| 1

L Σ1/2v 2 for all Xi Y.

B.2 Modified robust algorithm

We can modify the robust algorithm of Theorem B.3 to have the following guarantees.

Lemma B.6. Let c0, C0 be as in Theorem B.3, and C1 be as in Proposition B.4. Let L := C1 n2. Also, fix any η γ c0, any e d β < 1, and and any ρ such that e O(η) ρ c0

d. Finally,

suppose n m e O d γ2 + d2

Then, there exists an (inefficient, deterministic) algorithm A that, on a dataset Y of size m, outputs either some (ˆµ, ˆΣ) or , with the following properties.

1. For any datasets Y, Y with d H(Y, Y ) η m, if A(Y) = (ˆµ, ˆΣ) = and A(Y ) = (ˆµ , ˆΣ ) = , then

(ˆµ , ˆΣ ) 8C0γ,8C0ρ,8C0γ (ˆµ, ˆΣ).

2. For any fixed µ, Σ, with probability at least 1 O(β) over Y1, . . . , Ym N(µ, Σ), A(Y) = and A(Y) C0γ,C0ρ,C0γ (µ, Σ).

3. Fix an integer k 1, and additionally suppose that m 40d k. Suppose that X = {X1, . . . , Xn} are drawn i.i.d. from some mixture of k Gaussians (µi, Σi). Then, with probability at least 1 O(β) over X, for every Y X of size m and every Y with d H(Y, Y ) m/2, either A(Y) = , or A(Y) = (ˆµ, ˆΣ), where there is some i [k] such that 1 9L4 Σi ˆΣ 9L4 Σi and Σ 1/2 i (ˆµ µi) 2 10L3.

As in Theorem B.3, A has knowledge of η, γ, ρ, β, but not µ or Σ.

Proof. The algorithm A works as follows. Start off by computing A0(Y) = (ˆµ, ˆΣ), where A0 is the algorithm of Theorem B.3. We can run the algorithm on any arbitrary dataset, even if it didn t come from Gaussian samples, and we can assume that the algorithm A0 always outputs some mean-covariance pair (for instance, by having it output (0, I) by default instead of ). Next, A, on a dataset Y, checks if any of the following three conditions hold.

(a) There exists Y such that d H(Y, Y ) η m, and A0(Y ) 8C0γ,8C0ρ,8C0γ A0(Y).

(b) There exists Yi Y such that ˆΣ 1/2(Yi ˆµ) 2 > 3L.

(c) There exists a subset Z Y of size 20d, and a unit vector v and real number r, such that for all Yi Z, | Yi, v r| < 1 2L ˆΣ1/2v 2.

If any of these conditions hold, the output is A(Y) = . Otherwise, the output is A(Y) = A0(Y).

We now verify that the algorithm satisfies the required properties. Property 1 clearly holds, because if there exist datasets Y, Y with d H(Y, Y ) η m, A(Y) = and A(Y ) = , and A(Y) 8C0γ,8C0ρ,8C0γ A(Y ), then we would have in fact set A(Y) to .

For Property 2, note that with probability 1 O(β), A0(Y) satisfies the desired property by Theorem B.3, since β e d so d + log 1/β = O(d). So, we just need to make sure that we don t set A(Y) to be . Since Y1, . . . , Ym N(µ, Σ), then with probability 1 O(β), both Y and any Y with d H(Y, Y ) η n satisfy the conditions of Theorem B.3. In other words, A0(Y) C0γ,C0ρ,C0γ and A0(Y ) C0γ,C0ρ,C0γ. So, by Proposition B.2, A0(Y ) 8C0γ,8C0ρ,8C0γ A0(Y) for any Y with d H(Y, Y ) η n. Thus, condition a) is not met.

Moreover, by Corollary B.5, with failure probability at most m Ω(20d) β, Σ 1/2(Yi µ) 2 L for all Yi Y, and for all Z Y of size 20d, there does not exist a real r and a nonzero vector v such that | Yi, v r| 1 L Σ1/2v 2 for all Yi Y. Now, we know that by Theorem B.3, (ˆµ, ˆΣ) 0.5,0.5 (µ, Σ). So,

ˆΣ 1/2(Yi ˆµ) 2 2 Σ 1/2(Yi ˆµ) 2 2( Σ 1/2(Yi µ) + 0.5) = 2L + 1 3L,

and if | Yi, v r| 1 2L ˆΣ1/2v 2, then | Yi, v r| 1 L Σ1/2v 2, for all Yi Y. Thus, conditions b) and c) are also not met, so Property 2 holds.

For Property 3, note that if m 40d k, then |Y X| 20d k, i.e., Y contains at least 20dk uncorrupted points. So, by the Pigeonhole Principle, for every possible Y, there is some index i [k] such that at least 20d points in Y X come from the ith Gaussian component.

Now, let us condition on the event that Corollary B.5 holds for X, where n = 20d, which happens with at least 1 n Ω(n ) 1 β probability. We claim that for every possible Y, and for an i [k] such that 20d points Z Y X came from the ith Gaussian component, then either A(Y) = or if A(Y) = (ˆµ, ˆΣ) then 1 9L4 Σi ˆΣ 9L4 Σi and Σ 1/2 i (ˆµ µi) 2 10L3.

First, we verify that 1 9L4 Σi ˆΣ 9L4 Σi. Otherwise, there exists a unit vector v such that either v ˆΣv 9L4 v Σiv or v ˆΣv 1 9L4 v Σiv. In the former case, by Part 1 of Corollary B.5, for all Yi Z, Σ 1/2 i (Yi µi) 2 L, so

| v, Yi µi | = | Σ1/2 i v, Σ 1/2 i (Yi µi) | L Σ1/2 i v 2 = L p

3L ˆΣ1/2v 2.

Thus, if we set r = v, µi , then | Yi, v r| < 1 2L ˆΣ1/2v 2 for all Yi Z, so the algorithm A would have output due to condition c). Alternatively, if there exists a unit vector v such that v ˆΣv 1 9L4 v Σiv, then by condition b), ˆΣ 1/2(Yi ˆµ) 2 3L for all Yi Z, so

| v, Yi ˆµ | = | ˆΣ1/2v, ˆΣ 1/2(Yi µ) | 3L ˆΣ1/2v 2 1

L Σ1/2 i v 2.

So, there exists r = v, ˆµ such that | v, Yi r| 1

v Σiv which is impossible by Part 2 of Corollary B.5

Next, we verify that Σ 1/2 i (ˆµ µi) 2 10L3. By Triangle inequality, for every Yi Z, Σ 1/2 i (Yi ˆµ) 2 + Σ 1/2 i (Yi µi) 2 Σ 1/2 i (µi ˆµ) 2. But, Σ 1/2 i (Yi µi) 2 L by Part 1 of Corollary B.5, and Σ 1/2 i (Yi ˆµ) 2 3L2 ˆΣ 1/2(Yi ˆµ) 2 3L2 3L 9L3, because we just proved that 9L4Σi ˆΣ and assuming we do not reject due to condition b). Thus, Σ 1/2 i (µi ˆµ) 2 9L3 + L 10L3.

C Volume and the Robustness-to-Privacy Conversion

In this section, we explain the robustness-to-privacy conversion [HKMN23] that we will utilize in proving Theorem 1.4. We will also need some relevant results about computing volume, which will be important in the robustness-to-privacy conversion.

C.1 Normalized Volume

Given a pair (µ, Σ), where µ Rd and Σ Rd d is positive definite, we will define Proj(µ, Σ) Rd (d+3)/2 to represent the coordinates of µ along with the upper-diagonal coordinates of Σ. For any set Ωof mean-covariance pairs (µ, Σ), define Proj(Ω) := {Proj(µ, Σ) : (µ, Σ) Ω}. Because Σ is symmetric, Proj(µ, Σ) fully encodes the information about µ and Σ. We also define vol(Ω) to be the Lebesgue measure of Proj(Ω), i.e., the d(d+3)

2 -dimensional measure of all points Proj(µ, Σ) for (µ, Σ) Ω. Next, we will define the normalized volume

voln(Ω) := Z

1 (det Σ)(d+2)/2 dθ, (1)

where θ = Proj(µ, Σ), and we take a Lebesgue integral over θ Proj(Ω).

To motivate this choice of normalized volume, we will see that the volume is invariant under some basic transformations.

Given a function f : RD RD, for some integer D 1 we recall that the Jacobian J at some x is the matrix with Jij = fi

xj (x). For a function that takes a symmetric matrix Σ and outputs the

symmetric matrix AΣA , we view D = d(d+1)

2 and the function Σ 7 AΣA as a function f from RD RD by taking the upper triangular part of both Σ and AΣA . Note that f is a linear function (thus f(x) = J x where J RD D, and moreover, the following fact is well-known.

Lemma C.1. [MH08, Theorem 11.1.5] For J as defined above, det(J) = det(A)d+1.

Now, for any fixed µ Rd and positive definite Σ Rd d, consider the transformation (ˆµ, ˆΣ) 7 (Σ1/2 ˆΣΣ1/2, Σ1/2ˆµ + µ), viewed as a linear map g from Proj(ˆµ, ˆΣ) Proj(Σ1/2ˆµ + µ, Σ1/2 ˆΣΣ1/2). By setting A := Σ1/2, note that the map g behaves like f on the last d(d+1)

2 coordinates, and on the first d coordinates, it is simply an affine map ˆµ 7 Σ1/2ˆµ + µ. Therefore, the overall linear map g has determinant det(Σ1/2) det(Σ1/2)d+1 = (det Σ)(d+2)/2.

From this, we can infer the following.

Lemma C.2. Fix any µ Rd and positive definite Σ Rd d. Let h be the map (ˆµ, ˆΣ) 7 (Σ1/2ˆµ + µ, Σ1/2 ˆΣΣ1/2). Then, for any set S of mean-covariance pairs, the normalized volume of S equals the normalized volume of h(S).

Proof. Let g be the corresponding map from Proj(ˆµ, ˆΣ) to Proj(Σ1/2ˆµ + µ, Σ1/2 ˆΣΣ1/2). Using a simple integration by substitution, we have

voln(h(S)) = Z

ˆθ=Proj(ˆµ,ˆΣ) Proj(h(S))

(det ˆΣ)(d+2)/2 dˆθ

ˆθ g(Proj(S))

(det ˆΣ)(d+2)/2 dˆθ

ˆθ=Proj(ˆµ,ˆΣ) Proj(S)

(det(Σ1/2 ˆΣΣ1/2))(d+2)/2 (det g)dˆθ

(det ˆΣ)(d+2)/2 dˆθ = voln(S).

C.2 Robustness to Privacy Conversion

We note the following restatement of (the inefficient, approx-DP version of) the main robustness-toprivacy conversion of [HKMN23].

In the following theorem, we will think of the parameter space as lying in RD, with some normalized volume voln(Ω) = R

θ Ωp(θ)dθ, where p 0 is some nonnegative Lebesgue-measurable function and the integration is Lebesgue integration. In our application, we will think of θ = Proj(µ, Σ), and p(θ) = (det Σ) (d+2)/2, to match with (1). Theorem C.3. Restatement of [HKMN23, Lemma 4.2] Let 0 < η < 0.1 and 10η η 1 be fixed parameters. Also, fix privacy parameters ε0, δ0 < 1 and confidence parameter β0 < 1. Let S = S(θ, X) R 0 be a score function that takes as input a dataset X = {X1, . . . , Xn} and a parameter θ Θ RD. For any dataset X and any 0 η 1, let Vη (X) be the D-dimensional normalized volume of points θ Θ with score at most η n, i.e., Vη (X) = voln({θ Θ : S(θ, X) η n}).

Suppose the following properties hold:

(Bounded Sensitivity) For any two adjacent datasets X, X and any θ Θ, |S(θ, X) S(θ, X )| 1.

(Volume) For some universal constant C, and for any X of size n such that there exists θ with S(θ, X) 0.7η n, n C log(Vη (X)/V0.8η (X))+log(1/δ0)

Then, there exists an (ε0, δ0)-DP algorithm M, that takes as input X and outputs either some θ Θ or , such that for any dataset X, if n C max η :η η η log(Vη (X)/Vη(X))+log(1/(β0 η))

ε0 η , then M(X)

outputs some θ Θ of score at most 2ηn with probability 1 β0.

We remark that the algorithm M is allowed prior knowledge of n, D, η, η , ε0, δ0, β0, as well as the domain Θ, function p(θ) that dictates the normalized volume, and score function S.

We remark that the original result in [HKMN23] assumes the volumes are unnormalized, but the result immediately generalizes to normalized volumes. To see why, consider a modified domain Θ RD+1, where θ = (θ, z) Θ if and only if θ Θ and 0 z p(θ). Also, consider a modified score function S acting on Θ , where S ((θ, z), X) := S(θ, X) for for any (θ, z) Θ .

Then, note that the unnormalized volume of Θ , by Fubini s theorem, is precisely RR

(θ,z) Θ 1dzdθ = R

θ Θ p(θ)dθ, which is precisely the normalized volume of corresponding to the new Θ precisely match the normalized volumes corresponding to the old Θ. A similar calculation will give us that the unnormalized volume of points (θ, z) in Θ with S ((θ, z), X) t equals the normalized volume of points θ Θ with S(θ, X) t, for any t.

Thus, to privately find a parameter θ Θ of low score with respect to S (assuming the normalized volume constraints hold), can create Θ and S , and apply the unnormalized version of Theorem C.3 to obtain some (θ, z). Also, if S ((θ, z), X) 2ηn, then S(θ) = S ((θ, z), X) 2ηn, and if outputting (θ, z) is (ε, δ)-DP, then so is simply outputting θ.

C.3 Computing Normalized Volume Ratios

Lemma C.4. Let M Rd d be symmetric with M op 0.1, and let 1 ν 2 be a scaling parameter. Then,

ν d det(I + νM)

det(I + M) νd.

Proof. If the eigenvalues of M are λ1, . . . , λd [ 0.1, 0.1], then

det(I + νM)

det(I + M) =

Note that for x [ 0.1, 0.1] and ν 1, 1+ν x

1+x is an increasing function. Therefore,

where we used the fact that 1 ν 2.

We note an important lemmas about the normalized volumes of certain sets.

Lemma C.5. Fix some γ, τ 0.1 and some scaling parameters 1 ν1 2 and 1 ν2. Let R1 be the set of (µ, Σ) γ,τ (0, I) and R2 be the set of (µ, Σ) ν1 γ,ν2 τ (0, I). Then, the ratio of normalized volumes voln(R2) voln(R1) ν2d2 1 νd 2

Proof. By definition of normalized volume, we have

voln(R2) = ZZ

M op ν1 γ, µ 2 ν2 τ

1 det(I + M)(d+2)/2 dµd M,

where we are slightly abusing notation because we are truly integrating along the upper-triangular part of M. (Overall, the integral is d(d+3)

2 -dimensional.) We have a similar expression for voln(R1).

Let us consider the map sending (µ, I + M) 7 (ν2 µ, I + ν1 M). This is a linear map, and for µ 2 γ and symmetric M op τ, this is a bijective map from R1 to R2. Therefore, by an integration by substitution, we have

voln(R2) = ZZ

M op γ,|µ 2 τ

1 det(I + ν1 M)(d+2)/2 νd(d+1)/2 1 νd 2dµd M

νd(d+1)/2 1 νd 2 (ν d) (d+2)/2 ZZ

M op γ,|µ 2 τ

1 det(I + M)(d+2)/2 dµd M

= ν2d2 1 νd 2 voln(R1).

Above, the first line is integration by substitution, and the second line uses Lemma C.4.

Next, we note the following bound on the size of a net of matrices with small Frobenius norm.

Lemma C.6. Fix some γ 0.1 and 1

d. Let R3 be the set of symmetric matrices M Rd d with M op γ and M F ρ. Then, for any 1 ℓ d, there exists a net B of size at most e O(ℓ d log d) such that for any M R3, there exists B B such that M B op 2ρ

Proof. First, note that the unit sphere in Rd has a 1 d10 -net (in Euclidean distance) of size e O(d log d). Let B0 be this net. The net B will be the set of matrices Pℓ i=1 κiwiw i , where every wi B0, every κi is an integral multiple of 1 d10 , and every |κi| γ. The cardinality of B is at most |B0|ℓ (d10)ℓ e O(ℓ d log d).

Now, we show that B is actually a net. For any matrix M R3, we can write M = Pd i=1 λiviv i , where λi, vi are the eigenvalues and eigenvectors, respectively, of M. Assume the eigenvalues are sorted so that |λi| are in decreasing order. Now, since Pd i=1 λ2 i ρ2, which means that |λi| ρ

ℓ for all i > ℓ.

Now, let B1 = P

i>ℓλiviv i : since the vi s are orthogonal and |λi| ρ

ℓ, this means that B1 op

ℓ. Also, note that M = B1 + Pℓ i=1 λiviv i . Suppose we replace each λi with κi by rounding to the nearest multiple of 1/d10, and replace each vi with wi B0 such that vi wi 2 1/d10. Then, Pℓ i=1 κiwiw i is in B, and by Triangle inequality, we have

i=1 λiviv i

i=1 κiwiw i

op ℓ (|λi κi| + 2 λi vi wi 2) O 1

Thus, M Pℓ i=1 κiwiw i op 2 ρ

We also note the following lemma. We defer the proof to Appendix H. Lemma C.7. Fix any µ and positive definite Σ. Then, for any µ1, µ2, Σ1, Σ2, we have that (µ1, Σ1) γ,ρ,τ (µ2, Σ2) if and only if (Σ1/2µ1 + µ, Σ1/2Σ1Σ1/2) γ,ρ,τ (Σ1/2µ2 + µ, Σ1/2Σ2Σ1/2).

Our main lemma in this subsection is the following, which roughly bounds the normalized volume of a Frobenius norm ball fattened by an operator norm ball. Lemma C.8. Let c1 (0, 0.01) be a sufficiently small universal constant. Fix any µ and positive definite Σ. Fix some parameters γ1, γ2 ( 1

d, c1) and ρ2 ( 1

d, γ3 1 1000

d). Let T1(µ, Σ) represent the set of {(µ1, Σ1)} such that (µ1, Σ1) γ1,γ1 (µ, Σ). Let T2(µ, Σ) represent the set of {(µ2, Σ2)} such that (µ2, Σ2) γ1,γ1 (µ1, Σ1) for some (µ1, Σ1) γ2,ρ2,γ2 (µ, Σ). Then,

voln(T2(µ, Σ)) voln(T1(µ, Σ)) e O(d5/3 log d ρ2/3 2 /γ2 1).

Furthermore, both voln(T1(µ, Σ)) and voln(T2(µ, Σ)) do not change even if we change µ, Σ.

Proof. For now, we also assume that µ = 0 and Σ = I.

Let 1 ℓ d be decided later, and let B be the net from Lemma C.6, where we set γ = γ2 and ρ = ρ2. Let B1 be a 1 d10 -net (in Euclidean distance) over the d-dimensional unit ball, i.e., over µ with µ 2 1. Note that |B1| e O(d log d), and recall that |B| e O(ℓ d log d). Now, for any (µ2, Σ2) T2(0, I), we can write (µ2, Σ2) γ1,γ1 (µ1, Σ1), where Σ1 I op γ2, Σ1 I F ρ2, and µ1 2 γ2. Now, we can choose some µ B1 with µ1 µ 2 d 10 and B B such that (Σ1 I) B op 2ρ2

What can we say about the relationship between (µ2, Σ2) and (µ , I + B)? First, note that because (µ2, Σ2) γ1,γ1 (µ1, Σ1), we have (1 γ1) Σ1 Σ2 (1+γ1) Σ1. Next, since Σ1 (I+B) op = (Σ1 I) B op 2ρ2

ℓ, and since Σ1 has all eigenvalues between 1 γ2 1

2 and 1 + γ2 2,

this means 1 4ρ2

Σ1 (I + B) 1 + 4ρ2

Σ1. Thus, if 4ρ <

ℓ, we have that

1 γ1 1 + (4ρ2/

ℓ) (I + B) Σ1 1 + γ1 1 (4ρ2/

ℓ) (I + B).

Next, because (µ2, Σ2) γ1,γ1 (µ1, Σ1), we have Σ 1/2 1 (µ2 µ1) 2 γ1, which means µ2 µ1 2 2γ1. Thus, because µ1 µ 2 d 10 γ1, this means that µ2 µ 2 3γ1. Moreover,

assuming 10ρ

ℓ, the eigenvalues of I + B are at least (1 γ1) 1 4ρ2

4, which means

that (I + B) 1/2(µ2 µ ) 2 6γ1.

In summary, if 10ρ

ℓ, then 1 γ1 1 + (4ρ2/

ℓ) (I + B) Σ2 1 + γ1 1 (4ρ2/

ℓ) (I + B), (I + B) 1/2(µ2 µ ) 2 6γ1.

Note that if γ1 0.1 and 10ρ

ℓ, then by a simple calculation, 1+γ1 1 (4ρ2/

ℓ) 1 + γ1 + 10ρ2

ℓ and 1 γ1 1+(4ρ2/

ℓ) 1 γ1 4ρ2

ℓ. This implies that (µ2, Σ2) γ1+10ρ2/

ℓ,6γ1 (µ , I + B), for some

µ B1 and B B. Note that this holds for any (µ2, Σ2) T2(0, I).

Therefore, recalling that |B1| e O(d log d), and |B| e O(ℓ d log d) we can cover T2(0, I) with at most e O(ℓ d log d) regions, each of which is the set of (µ2, Σ2) γ1+10ρ2/

ℓ,6γ1 (µ , Σ ). Now,

by Lemma C.7, (µ2, Σ2) γ1+10ρ2/

ℓ,6γ1 (µ , Σ ) if and only if µ2 = Σ 1/2µ0 + µ and

Σ2 = Σ 1/2Σ0Σ 1/2, where (µ0, Σ0) γ1+10ρ2/

ℓ,6γ1 (0, I). By Lemma C.2, this means that the volume of {(µ2, Σ2) : (µ2, Σ2) γ1+10ρ2/

ℓ,6γ1 (µ , Σ )} equals the volume of {(µ0, Σ0) : (µ0, Σ0) γ1+10ρ2/

ℓ,6γ1 (0, I)}.

Thus, if 10ρ

ℓ, then T2(0, I) can be covered by e O(ℓ d log d) regions, each of which has the same normalized volume as the volume of {(µ0, Σ0) : (µ0, Σ0) γ1+10ρ2/

ℓ,6γ1 (0, I)}. Moreover, if we

further have 10ρ2

ℓ γ1, then by Lemma C.8, this is at most 1 + 10ρ2

2d2 6d voln(T1(0, I)).

We will set ℓ= 100(dρ2)2/3

γ2 1 . Since ρ2 1 d, ℓ 100 as long as γ1 c1 1. Also, ℓ

100 γ3 1 1000 d3/2 2/3 1 γ2 1 = d. Finally, 10ρ2

ℓ= γ1 ρ2/3 2 d1/3 γ1, since ρ2

d. Overall, the ratio

voln(T2(0, I)) voln(T1(0, I)) e O(ℓ d log d) 1 + 10ρ2

6d e O(d5/3 log d ρ2/3 2 /γ2 1).

Finally, we show how to remove the assumption that µ = 0 and Σ = I. First, note that for any general µ, Σ, (µ1, Σ1) T1(0, I) if and only if (Σ1/2µ1 + µ, Σ1/2Σ1Σ1/2) T1(µ, Σ), by Lemma C.7. So, by Lemma C.2, voln(T1(µ, Σ)) = voln(T1(0, I)). Next, (µ2, Σ2) γ1,γ1 (µ1, Σ1) and (µ1, Σ1) γ2,ρ2,γ2 (0, I) if and only if (Σ1/2µ2 + µ, Σ1/2Σ2Σ1/2) γ1,γ1 (Σ1/2µ1 + µ, Σ1/2Σ1Σ1/2) and (Σ1/2µ1 + µ, Σ1/2Σ1Σ1/2) γ2,ρ2,γ2 (µ, Σ), by two applications of Lemma C.7. So, by Lemma C.2, voln(T2(µ, Σ)) = voln(T2(0, I)). Thus, the volume ratios stay the same as well.

D Fine Approximation via Hypothesis Selection

In this section, we prove that if we know a very crude approximation to all of the Gaussian mixture components with sufficiently large weight, we can privately learn the full density of the Gaussian mixture model up to low total variation distance. This will be useful for proving both Theorem 1.4 and Theorem 1.5.

We start with the following auxiliary lemma. Lemma D.1. Let G 1 and ζ 1 be some parameters. For some d 1, let ˆµ Rd, and let ˆΣ Rd d be positive definite. Let Uˆµ,ˆΣ be the set of (µ, Σ) such that 1

G ˆΣ Σ G ˆΣ and

ˆΣ 1/2(µ ˆµ) 2 G. Then, there exists a net Bˆµ,ˆΣ of size O(G

d/ζ)d(d+3) such that for every (µ, Σ) Uˆµ,ˆΣ, there exists (µ , Σ ) Bˆµ,ˆΣ such that d TV(N(µ, Σ), N(µ , Σ )) ζ.

Proof. First, assume that ˆΣ = I and ˆµ = 0. Now, let us consider the set U := U0,I = {(µ, Σ) : µ 2 G, 1

G I Σ G I}. Let G G be a parameter that we will set later. Now, we can look

at the un-normalized volume of U viewed in Rd(d+3)/2 (i.e., we are projecting to the upper-triangular part of Σ). First, we claim (using a standard volume argument) that there is a cover B U of size (2G )d(d+3), such that for every (µ, Σ) U, there is (µ , Σ ) B with µ µ 2 1 G and Σ Σ op 1 G .

To see why, consider a maximal packing of (µ i, Σ i) U such that for every i = j, either µ i µ j 2 > 1 G or Σ i Σ j op > 1 G . Then, this set is clearly a cover of U (or else we could have increased the size of the packing, breaking maximality). Then, the sets Si := {(µ, Σ) : µ µ i 1 2G , Σ Σ i op 1 2G } are disjoint, and are all contained in the set S := {(µ, Σ) : µ 2 2G , Σ op 2G }. S is just a shifting and scaling of Si by a factor of 4(G )2, and thus vol(S) = (4(G )2)d(d+3)/2 vol(Si) = (2G )d(d+3) vol(Si). Because every Si is disjoint and contained in S, the number of such indices i is at most (2G )d(d+3).

Next, consider any (µ, Σ) U and (µ , Σ ) B with µ µ 2 1 G , Σ Σ op 1 G . Then, Σ 1/2Σ Σ 1/2 I op = Σ 1/2(Σ Σ)Σ 1/2 op Σ 1 op Σ Σ op G G . Also, Σ 1/2(µ µ ) 2 Σ 1/2 op µ µ 2

G G . In other words, we have found a net B of size (2G )d(d+3) such that for every (µ, Σ) U, there exists (µ , Σ ) B with (µ , Σ ) G/G ,G/G (µ, Σ).

Next, consider general ˆµ, ˆΣ. Note that (µ, Σ) Uˆµ,ˆΣ is equivalent to µ = ˆΣ1/2µ0 + ˆµ, where

µ0 2 G, and Σ = ˆΣ1/2Σ0 ˆΣ1/2, where 1 G I Σ0 G I. So, if we consider the map f : (µ0, Σ0) 7 (ˆΣ1/2µ0 + ˆµ, ˆΣ1/2Σ0 ˆΣ1/2), this bijectively maps U to Uˆµ,ˆΣ. We can also consider the cover Bˆµ,ˆΣ = f(B), where B is the cover constructed for U. Then, by Lemma C.7, (µ 0, Σ 0) G/G ,G/G (µ0, Σ0) if and only if f(µ 0, Σ 0) G/G ,G/G f(µ0, Σ0).

Hence, regardless of the choice of ˆµ, ˆΣ, for every (µ, Σ) Uˆµ,ˆΣ, there exists (µ , Σ ) Bˆµ,ˆΣ with (µ , Σ ) G/G ,G/G (µ, Σ). Moreover, |Bˆµ,ˆΣ| = |B| (2G )d(d+3). Moreover, if (µ , Σ ) G/G ,G/G (µ, Σ), then Σ 1/2(µ µ) 2 G G and Σ 1/2ΣΣ 1/2 I F

d Σ 1/2ΣΣ 1/2 I op

d G G . So, by Lemma A.8, d TV(N(µ, Σ), N(µ , Σ ))

d G G . Hence, if we set G = G

d ζ , then the size of the net Bˆµ,ˆΣ is O(G

d/ζ)d(d+3) and for every (µ, Σ) Uˆµ,ˆΣ, there exists (µ , Σ ) Bˆµ,ˆΣ such that d TV(N(µ, Σ), N(µ , Σ )) ζ.

Next, we note the following result about differentially private hypothesis selection.

Theorem D.2. [BSKW19] Let H = {H1, . . . , HM} be a set of probability distributions over some domain D. There exists an ε-differentially private algorithm (with respect to a dataset X = {X1, . . . , Xn}) which has following guarantees.

Let D be an unknown probability distribution over D, and suppose there exists a distribution H H such that d TV(D, H ) α. If n O log M

αε , and if X1, . . . , Xn are samples

drawn independently from D, then the algorithm will output a distribution ˆH H such that d TV(D, ˆH) 4α with probability at least 9/10.

Given Lemma D.1 and Theorem D.2, we are in a position to convert any crude approximation into a fine approximation. Namely, we prove the following result.

Lemma D.3. Let D represent an unknown GMM with representation {(wi, µi, Σi)}k i=1. Let G 1 be some fixed parameter. Then there exists an ε-differentially algorithm, on n O d2k log(G k k

d/α) α2 + d2k log(G k k

d/α) αε samples, with the following property.

Suppose the algorithm is given as input a set {(ˆµj, ˆΣj)}k j=1 for some k 1, such that for every

(µi, Σi) with weight wi α

k , there exists j k such that 1

G ˆΣj Σi G ˆΣj and ˆΣ 1/2 j (µi

ˆµj) 2 G. Then, if the samples are X1, . . . , Xn i.i.d. D, then with probability at least 9/10, the algorithm outputs a mixture of at most k Gaussians H, with d TV(D, H) O(α).

Proof. Let ζ = α

k . For each (ˆµj, ˆΣj), let Bˆµj,ˆΣj be as in Lemma D.1. We define ˆB = S j k Bˆµj,ˆΣj.

Let H be the set of hypotheses consisting of mixtures of up to k Gaussians N(µ i, Σ i) with weights w i, with every (µ i, Σ i) ˆB and with every w i an integral multiple of ζ. Note that the number of hypothesis M = |H| is at most | ˆB|O(k) (1/ζ)O(k) = (k G

d/ζ)O(d2 k). Since we set ζ = α/k, M (G kk

d α )O(d2 k).

Next, let us consider the true GMM D, with representation {(wi, µi, Σi)}k i=1. Because every (µi, Σi) with wi α

k satisfies 1

G ˆΣj Σi G ˆΣj and ˆΣ 1/2 j (µi ˆµj) 2 G, by Lemma D.1, there exists (µ i, Σ i) ˆB such that d TV(N(µi, Σi), (µ i, Σ i)) ζ. Moreover, we can round each wi α

k to some w i which is an integral multiple of ζ, such that |wi w i| ζ. Finally, for each wi < α

k , we can choose an arbitrary Gaussian in ˆB and round wi to some w i. Then, the total variation distance between D and the GMM with representation {(w i, µ i, Σ i)}k i=1 is at most X

k + |wi w i| + X

i k:wi α/k (ζ + |wi w i|) α + O(k ζ).

Hence, if we set ζ = α

k , there exists a distribution H H with d TV(D, H ) O(α).

Therefore, the algorithm of Theorem D.2, using n O d2k log(G k k

d/α) α2 + d2k log(G k k

samples, will find H H such that d TV(D, H) O(α), and is ε-differentially private.

Pseudocode: We give a simple pseudocode for the algorithm of Lemma D.3, in Algorithm 1.

Algorithm 1: FINE-ESTIMATE(X1, X2, . . . , Xn Rd, d, k, k , ε, α, G, {(ˆµj, ˆΣj)}j k )

Input: Samples X1, . . . , Xn, and crude predictions (ˆµj, ˆΣj), where 1 j k for some k . Output: An (ε, 0)-DP prediction H, which is a mixture over at most k Gaussians.

1 Set ζ = α/k.

2 for j = 1 to k do

3 Define the sets Bˆµj,ˆΣj as in Lemma D.1, for parameters G, ζ.

j k Bˆµj,ˆΣj.

6 Let H be the set of mixtures {(w i, µ i, Σ i)} of k or fewer components, where every (µ i, Σ i) ˆB and every w i is an integral multiple of ζ.

7 Run the ε-DP algorithm of Theorem D.2 on X1, . . . , Xn with respect to H.

E The High Dimensional Setting

In this section, we provide the algorithm and analysis for Theorem 1.4. We first describe and provide pseudocode for the algorithm, and then we prove that it can privately learn mixtures of d-dimensional Gaussians with low sample complexity.

E.1 Algorithm

High-Level Approach: Suppose that the unknown distribution is a GMM with representation {(wi, µi, Σi)}k i=1.

Define Θ to be the set of all feasible mean-covariance pairs (µ, Σ), i.e., where µ Rd and Σ Rd d is positive definite. We also start off with a set Ω= Θ, which will roughly characterize the region of remaining mean-covariance pairs.

At a high level, our algorithm proceeds as follows. We will first learn a very crude approximation

of each (µi, Σi), one at a time. Namely, using roughly ε

k log(1/δ), δ

-DP, we will learn some

(ˆµ1, ˆΣ1) Ωwhich is vaguely close to some (µi, Σi). We will then remove every (µ, Σ) that is vaguely close to this (ˆµ, ˆΣ) from our Ω, and then attempt to repeat this process. We will repeat it up to k times, in hopes that every (µi, Σi) is vaguely close to some (ˆµj, ˆΣj). By advanced composition, the full set of {(ˆµj, ˆΣj)} is still (ε, δ)-DP.

At this point, the remainder of the algorithm is quite simple. Namely, we have some crude estimate for every (µi, Σi) (it is vaguely close to some (ˆµj, ˆΣj)). Moreover, one can create a fine net of roughly e O(d2) (µ, Σ)-pairs that cover all mean-covariance pairs vaguely close to each (ˆµj, ˆΣj), since the dimension of (µ, Σ) is O(d2). Moreover, the weight vector (w1, . . . , wk) has a fine net of size roughly e O(k). As a result, we can reduce the problem to a small set of hypotheses: there are roughly e O(d2) choices for each (µi, Σi), and e O(k) choices for w, for a total of e O(k d2) choices for the mixture of Gaussians. We can then apply known results on private hypothesis selection [BSKW19], which will suffice.

The main difficulty in the algorithm is in privately learning some (ˆµ, ˆΣ) which is vaguely close to some (µi, Σi). We accomplish this task by applying the robustness-to-privacy conversion of [HKMN23], along with a carefully constructed score function, which we will describe later in this section.

Algorithm Description: Let c0 0.01 and C0 > 1 be the constants of Lemma B.6, and c1 0.01 be the constant of Lemma C.8. Let c2 0.1 min(c0, c1) be a sufficiently small constant. We set parameters η = c2 α

4k log(1/δ), δ0 = δ 2k, β0 = e d. We also set m =

e O(d1.75), N = e O m

k log3/2(1/δ)

ε + kd , and n = N 2k

α . Finally, we define

parameters γ = τ = c2 and ρ = c2 d1/8. (See Lines 1 4 of Algorithm 2 for a more precise description of some of the parameters.)

We now define the main score function. To do so, we first set some auxiliary parameter η = γ = c2 8C0 and ρ = ρ 8C0 . We will consider the (deterministic) algorithm A of Lemma B.6, where A is given

parameters η , γ , ρ , β0, that acts on a dataset Z and outputs either or some (ˆµ, ˆΣ). Next, for some domain Ω Θ of feasible mean-covariance pairs (µ, Σ), we will define a function fΩ, which takes as input a dataset Y of size N and a mean-covariance pair (µ, Σ) Θ, and outputs

fΩ(Y, µ, Σ) =

1 if (µ, Σ) Ω, A(Y) γ,ρ,τ (µ, Σ) , and P Z Y |Z|=m [A(Z) γ,ρ,τ (µ, Σ)] 2

0 else . (2) Finally, given a dataset X of size n and (eµ, eΣ) Θ, we define the score function

SΩ((eµ, eΣ), X) = min t : X , Y , µ, Σ such that

d H(X, X ) = t, Y X , |Y | = N, (eµ, eΣ) γ,τ (µ, Σ), fΩ(Y , µ, Σ) = 1 . (3)

Note that (eµ, eΣ) does not need to be in Ω, but it must satisfy (eµ, eΣ) γ,τ (µ, Σ) for some (µ, Σ) Ω. Also, note that it is possible that no matter how one changes the data points, the conditions are never met (for instance if (eµ, eΣ) γ,τ (µ, Σ) for any (µ, Σ) Ω). This is not a problem: we will simply set the score to be + if this occurs.

With this definition of score function SΩ, we can let M be an (ε0, δ0)-DP algorithm based on Theorem C.3, with the settings of n, η, η , ε0, δ0, β0 as above, and with D = n(n+3)

2 . Here, we recall that ( µ, Σ) is viewed as n(n+3)

2 -dimensional by only considering the upper-triangular part of Σ. We also assume the domain is Θ and the normalized volume is as in (1).

Given this algorithm M (which implicitly depends on Ω), the algorithm works as follows. We will actually draw a total of n + n samples (for n as above, and n to be defined later), though we start by just looking at the first n samples X = {X1, . . . , Xn}. We initially set Ω= Θ, and run the following procedure up to k times, or until the algorithm outputs . For the jth iteration, we use M(X) to

compute some prediction (ˆµj, ˆΣj) (or possibly we output ). Assuming we do not output , we remove from Ωevery (µ, Σ) such that n 12 ˆΣj Σ n12 ˆΣj and ˆΣ 1/2 j (µ ˆµj) 2 n12.

At the end, we have computed some {(ˆµj, ˆΣj)}k j=1 where 0 k k. If k = 0 we simply output .

Otherwise, we will draw n = O d2k

αε fresh samples. We run the ε-DP hypothesis selection

based algorithm, Algorithm 1 on the fresh samples, using parameters G = n12 and {(ˆµj, ˆΣj)}k j=1.

Pseudocode: We give pseudocode for the algorithm in Algorithm 2.

Algorithm 2: ESTIMATE(X1, X2, . . . , Xn+n Rd, k, ε, δ, α) Input: Samples X1, . . . , Xn, Xn+1, . . . , Xn+n . Output: An (ε, δ)-DP prediction {(eµi, eΣi), ewi}. /* Set parameters */

1 Let c2 be a sufficiently small constant and K = poly log(d, k, 1

δ ) be sufficiently large.

4k log(1/δ), δ0 δ 2k, β0 e d.

3 m = K d1.75, N K m+log(1/δ0)

ε0 + kd , n N 2k

4 γ, τ c2, ρ 10 d1/8.

5 Θ, Ω {(µ, Σ) : µ Rd, Σ Rd d Symmetric, Positive Definite}.

/* Get X */

6 Obtain samples X {X1, X2, . . . , Xn}.

/* Learn a crude approximation (ˆµj, ˆΣj) of the mean-covariance pairs, one at a time */

7 for j = 1 to k do

8 Define fΩ(Y, µ, Σ) and SΩ((µ, Σ), X) as in (2) and (3), respectively.

9 Let M be the (ε0, δ0)-DP algorithm obtained by Theorem C.3 with the score function SΩ, with parameters n, η, η , ε0, δ0, β0 as defined above, D = n(n+3)

2 , domain Θ, and p(µ, Σ) = (det Σ) (d+2)/2.

11 if A = then

12 ˆµj, ˆΣj A

13 Ω Ω\{(µ, Σ) : n 12 ˆΣj Σ n12 ˆΣj and ˆΣ 1/2 j (µ ˆµj) 2 n12}

15 Break // Break out of the for loop

/* Fine approximation, via private hypothesis selection */

18 Set n K d2k

19 Sample Xn+1, . . . , Xn+n , and redefine X {Xn+1, . . . , Xn+n }.

20 Run Algorithm 1 on X, {(ˆµj, ˆΣj)}, with k = #{(ˆµj, ˆΣj)} and G = n12.

E.2 Analysis of Crude approximation

In this section, we analyze Lines 7 17 of Algorithm 2. The main goal is to show that the algorithm is private, and that for samples drawn from a mixture of Gaussians, every component (µi, Σi) with large enough weight wi is vaguely close to some (ˆµj, ˆΣj) computed by the algorithm.

First, we show that when the samples are actually drawn as a Gaussian mixture, then under some reasonable conditions, any (µ, Σ) close to a true mean-covariance pair has low score.

Proposition E.1. Suppose X = {X1, . . . , Xn} is drawn from a Gaussian mixture model, with representation {(wi, µi, Σi)}k i=1. Then, with 1 O(k β0) probability over X, for any set Ω

of mean-covariance pairs, for all i [k] such that (µi, Σi) Ωand wi α/k, and for all (eµ, eΣ) γ,τ (µi, Σi), SΩ((eµ, eΣ), X) = 0.

Proof. Fix any i [k] with wi α/k. Set µ = µi and Σ = Σi. Also, let X = X, and Y be a random subset of size N among the samples in X actually drawn from N(µi, Σi). Note that the number of such samples has distribution Bin(n, wi), which for wi α

k and n = 2k

α N, is at least N with e Ω(N) β0 failure probability.

Then, Y is just N i.i.d. samples from N(µi, Σi). So, if we draw a random subset Z of size m of Y, it has the same distribution as m i.i.d. samples from N(µi, Σi). We apply Part 2 of Lemma B.6 (where

we use parameters η , γ , ρ in the application). Note that m e O d γ 2 + d2

ρ 2 , so with probability at least 1 O(β0) over Z, A(Z) 8C0γ ,8C0ρ ,8C0γ (µi, Σi). By our setting of γ , ρ , this means that A(Z) γ,ρ,γ (µi, Σi).

In other words, for each index i N m and corresponding subset Z of Y, if we let Wi be the indicator that A(Z) γ,ρ,γ (µi, Σi), then P(Wi = 1) O(β0). While the values of Wi are not necessarily independent, by linearity of expectation we have that E[P Wi] N m O(β0), so the probability that E[P Wi] 1

3 N m is at most O(β0) by Markov s inequality. Moreover, because

N m e O d γ 2 + d2

ρ 2 , we can apply A on Y and we again obtain that with probability 1 O(β0),

A(Y) γ,ρ,γ (µi, Σi).

In summary, for any fixed i [k] with wi α

k , with probability at least 1 O(β0) over X, there exists Y X of size N, such that A(Y) γ,ρ,γ (µi, Σi) and at least 2/3 of the subsets Z Y of size m have A(Z) γ,ρ,γ (µi, Σi). Thus, for any set Ω, if (µi, Σi) Ωthen SΩ((eµ, eΣ), X) = 0 for all (eµ, eΣ) γ,τ (µi, Σi). The proof follows by a union bound over all i [k].

Next, we show that for any dataset X, if there is even a single (µ , Σ ) with low score, there must be a region of (eµ, eΣ) which all has low score. Proposition E.2. Suppose that t = minµ ,Σ SΩ((µ , Σ ), X). Then, there exists some µ, Σ such that for all (eµ, eΣ) γ,τ (µ, Σ), we have that SΩ((eµ, eΣ), X) = t.

Proof. Fix µ , Σ so that SΩ((µ , Σ ), X) = t. Then, there is some (µ, Σ) and some X , Y such that d H(X, X ) = t, Y X , |Y| = N, and fΩ(Y, µ, Σ) = 1. Thus, for any (eµ, eΣ) γ,τ (µ, Σ), by definition, we have that SΩ((eµ, eΣ), X) t. But since t is the minimum possible score, we in fact have equality: SΩ((eµ, eΣ), X) = t for all such (eµ, eΣ).

Next, we want to show that regardless of the dataset (even if not drawn from any Gaussian mixture distribution), the set of points of low score can t have that much volume.

Proposition E.3. Fix a dataset X of size n. Then, the set of eµ, eΣ such that SΩ((eµ, eΣ), X) η n can be partitioned into at most n m regions Si, which is indexed by some (µ i, Σ i). Moreover, for all (eµ, eΣ) Si, there exists (µ, Σ) such that (eµ, eΣ) γ,τ (µ, Σ) and (µ, Σ) 8γ,8ρ,8τ (µ i, Σ i).

Proof. Pick any (eµ, eΣ) with score at most η n. Let X , Y , µ, Σ be such that d H(X, X ) η n, Y X , |Y | = N, (eµ, eΣ) γ,τ (µ, Σ), and fΩ(Y , µ, Σ) = 1. If we define Y X of size N to be the corresponding subset as Y is to X , then d H(Y, Y ) η n = c2

Now, if we take a random subset Z of size m in Y , and look at the corresponding subset Z of size m in Y, by a Chernoff bound, with at least 0.99 probability d H(Z, Z ) c2

2 m. Moreover, with at least 2/3 probability over the random subset Z , A(Z ) γ,ρ,τ (µ, Σ). Therefore, there always exists a subset Z X of size m and a set Z of size m such that d H(Z, Z ) c2

2 m and A(Z ) γ,ρ,τ (µ, Σ).

Now, for any fixed Z X of size m, if we look at any sets Z , Z of size m and Hamming distance at most c2

2 m from Z, then d H(Z , Z ) c2 m. So, by Property 1 of Lemma B.6, for every such Z and Z with A(Z ), A(Z ) = , we must have A(Z ) γ,ρ,τ A(Z ).

To complete the proof, we order the subsets Z1, Z2, . . . , Z( n m) of size m in X. For each i n m , if there exists some Z such that d H(Zi, Z ) c2

2 m and A(Z ) = , choose an arbitrary such Z and let (µ i, Σ i) := A(Z ). Otherwise, we do not define (µ i, Σ i). Then, for any (eµ, eΣ) of score at most η n, there exists (µ, Σ), a subset Zi X of size m, and a set Z i of size m such that (eµ, eΣ) γ,τ (µ, Σ), d H(Zi, Z i) c2

2 m, and A(Z i) γ,ρ,τ (µ, Σ). Because A(Z i) = and Z i has Hamming distance at most c2

2 m from Zi, this also implies that A(Z i) γ,ρ,τ (µ i, Σ i). By Proposition B.2, this means that (µ, Σ) 8γ,8ρ,8τ (µ i, Σ i), which completes the proof.

Proposition E.4. Suppose X = {X1, . . . , Xn} is drawn from a Gaussian mixture model, with representation {(wi, µi, Σi)}k i=1. Then, with probability at least 1 O(β0) over the randomness of X, for any set Ωof mean-covariance pairs, and for every (eµ, eΣ) with SΩ((eµ, eΣ), X) N

2 , there exists an index i [k] such that 1 O(n8) Σi eΣ O(n8) Σi and Σ 1/2 i (eµ µi) 2 O(n6).

Proof. We apply Part 3 of Lemma B.6, though we use N to replace the value m in Lemma B.6. We are assuming N 40d k. So, by definition of score, (eµ, eΣ) γ,τ (µ, Σ) where fΩ(Y , µ, Σ) = 1, for some Y which can be generated by taking a subset of X of size at most N and altering at most N/2 elements. Since fΩ(Y , µ, Σ) = 1, this means that if (ˆµ, ˆΣ) = A(Y ) then (ˆµ, ˆΣ) γ,ρ,τ (µ, Σ). So, by Proposition B.2, (eµ, eΣ) 8γ,8τ (ˆµ, ˆΣ). Assuming γ = τ = c2 is sufficiently small, this means that (eµ, eΣ) 1/2,1/2 (ˆµ, ˆΣ).

By Part 3 of Lemma B.6, with probability at least 1 O(β0), there exists i [k] such that 1 O(n8) Σi ˆΣ O(n8) Σi and Σ 1/2 i (ˆµ µi) 2 O(n6). However, we know that 1

2 ˆΣ eΣ 2ˆΣ, which

means that 1 O(n8) Σi eΣ O(n8) Σi. Moreover, ˆΣ 1/2(ˆµ eµ) 2 0.5, so Σ 1/2 i (ˆµ eµ) 2

O(n4). Thus, by triangle inequality, we have that Σ 1/2 i (eµ µi) 2 O(n6).

Now, given a set X = {X1, . . . , Xn} drawn from a GMM with representation {(wi, µi, Σi)}k i=1, we say X is regular if it satisfies Propositions E.1 and E.4. In other words:

1. for any set Ωof mean-covariance pairs, for all i [k] such that (µi, Σi) Ωand wi α/k, and for all (eµ, eΣ) γ,τ (µi, Σi), SΩ((eµ, eΣ), X) = 0.

2. for any set Ωof mean-covariance pairs, and for every (eµ, eΣ) with SΩ((eµ, eΣ), X) N

2 , there exists an index i [k] such that 1 O(n8) Σi eΣ O(n8) Σi and Σ 1/2 i (eµ µi) 2 O(n6).

When we say X is regular, the components (µi, Σi) and weights wi are implicit.

We now show that every step of the crude approximation (Lines 8 16 in Algorithm 2) will, with 1 O(β0) probability, find some crude approximation (ˆµj, ˆΣj) to some (µi, Σi), as long as (µi, Σi) Ω.

Lemma E.5. For any Ω Θ, the corresponding algorithm M (which depends on Ω) is (ε0, δ0)-DP on datasets X.

Suppose additionally that X is regular, and that there exists i [k] such that wi α

k and (µi, Σi) Ω. Then, with 1 O(β0) probability (just over the mechanism M), M(X) = (eµ, eΣ) and

There exists i [k] such that 1 O(n8) Σi eΣ O(n8) Σi and Σ 1/2 i (eµ µi) 2 O(n6).

There exists (µ, Σ) Ωsuch that (eµ, eΣ) γ,τ (µ, Σ).

Proof. We will apply Theorem C.3. We first need to check the necessary conditions. It follows almost immediately from the definition of SΩthat SΩ((eµ, eΣ), ) has the bounded sensitivity property with respect to neighboring datasets, for any fixed Ω, eµ, eΣ.

To check the volume condition, note that for any dataset X of size n, if minµ,Σ SΩ((µ, Σ), X) 0.7η n, then by Proposition E.2, there exists µ, Σ such that SΩ((eµ, eΣ), X) 0.8η n for all (eµ, eΣ) γ,τ (µ, Σ). Conversely, for any X, by Proposition E.3, the set of (eµ, eΣ) of score at most η n can be partitioned into n m regions Si, indexed by (µ i, Σ i). Moreover, each Si is contained in the set of (eµ, eΣ) such that there exists (µ, Σ) such that (eµ, eΣ) γ,τ (µ, Σ) 8γ,8ρ,8τ (µ i, Σ i).

Let us use the notation of Lemma C.8, where γ1 = γ, γ2 = 8γ, and ρ2 = 8ρ. Then, we have that the set of points with score at most 0.8η n contains T1(µ, Σ) for some (µ, Σ), but the set of points with score at most η n is contained in the union of T2(µ i, Σ i) for at most n m choices of (µ i, Σ i). So, as long as minµ,Σ SΩ((µ, Σ), X) 0.7η n, we can apply Lemma C.8 to obtain

Vη (X) V0.8η (X) n m

d5/3 log d ρ2/3 2 γ2 1

= exp O d7/4 log d + m log n ,

since ρ2 = 8ρ = 8c2 d1/8 and γ1 = γ = c2, and since c2 is a constant. Thus, if

n C O(d7/4 log d + m log n) + log(1/δ0)

we have that M is (ε0, δ0)-DP, by Theorem C.3. This holds by our parameter settings, assuming K is sufficiently large.

Next, we prove accuracy. Assume that X is regular. We again adopt the notation of Lemma C.8, (with γ1 = γ, γ2 = 8γ, and ρ2 = 8ρ). By the first condition of regularity, for all i [k] such that (µi, Σi) Ωand wi α/k, every (eµ, eΣ) T1(µi, Σi) satisfies SΩ((eµ, eΣ), X) = 0. We still have that the set of points of score at most η n is contained in the union of at most n m sets T2(µ i, Σ i). Thus, Vη (X)

Vη(X) exp O d7/4 log d + m log n ,

where we recall that we set η = η

10. So, by Theorem C.3, as long as

n C O(d7/4 log d + m log n) + log(1/(β0 η))

which indeed holds, M(X) outputs some (eµ, eΣ) of score at most 2ηn, with 1 β0 probability. By the second condition of regularity there exists an index i [k] such that 1 O(n8) Σi eΣ O(n8) Σi

and Σ 1/2 i (eµ µi) 2 O(n6). Moreover, because (eµ, eΣ) has finite score, (eµ, eΣ) γ,τ (µ, Σ) for some (µ, Σ) Ω.

Lemma E.6. Lines 7 17 of Algorithm 2 are (ε, δ)-DP. Moreover, for every regular set X, with probability at least 1 O(k β0) over the randomness of Algorithm 2, for every i [k] such that wi α

k , there exists j such that n 12 ˆΣj Σi n12 ˆΣj and ˆΣ 1/2 j (µ ˆµj) 2 n12.

Proof. The privacy guarantee is immediate by (adaptive) advanced composition. Indeed, the algorithm M at each step is (ε0, δ0)-DP, and only depends on X and Ω, which is determined only by the output of all previous runs of M.

Now, let s say that X = {X1, . . . , Xn} is regular, and we run Algorithm 2 on X. Now, after j 0 steps of the loop, define Pj to be the set of indices i [k] such that wi α

k and (µi, Σi) Ωj (where Ωj refers to the set Ωafter j steps of the loop are completed). Likewise, define Qj to be the set of indices i such that there exists (ˆµ, ˆΣ) Ωj with 1 n10 Σi ˆΣ n10 Σi and Σ 1/2 i (ˆµ µi) 2 n10. Note that P0 Q0 = [k], and that Pj Qj always. Moreover, because Ωonly shrinks, Pj+1 Pj and Qj+1 Qj always.

Now, after some step j < k, we claim that if Pj = , then |Qj+1| |Qj| 1 with failure probability at most O(β0), i.e., Q decreases in size for the next step if P isn t currently empty. The failure probability is only over the randomness of M at each step, and holds for any regular dataset X. A union bound says the total failure probability is O(k β0).

To see why this holds, note that if some index i Pj, then (µi, Σi) Ωj. So, we can apply Lemma E.5. At step j + 1, we find some (ˆµj+1, ˆΣj+1), with the following properties. First, there exists an index i [k] with 1 O(n8) Σi ˆΣj+1 O(n8) Σi and Σ 1/2 i (ˆµj+1 µi ) 2 O(n6).

Second, (ˆµj+1, ˆΣj+1) γ,τ (µ, Σ), for some (µ, Σ) Ωj.

We claim this means i Qj. To see why, it suffices to show that 1 n10 Σi Σ n10 Σi and Σ 1/2 i (µ µi ) 2 n10, by definition of Qj and because (µ, Σ) Ωj. However, we know that 1 O(n8) Σi ˆΣj+1 O(n8) Σi and 1

2 ˆΣj+1 Σ 2ˆΣj+1, which implies 1 n10 Σi Σ n10 Σi .

Also, we know that Σ 1/2 i (ˆµj+1 µi ) 2 O(n6) and Σ 1/2(µ ˆµj+1) 2 τ 1 The latter inequality along with the fact that Σ n10 Σi implies that Σ 1/2 i (µ ˆµj+1) 2 n5. So, by triangle inequality, Σ 1/2 i (µ µi ) 2 O(n6) n10.

Next, we claim that i Pj+1, so i Qj+1. Indeed, we have that 1 O(n8) Σi ˆΣj+1 O(n8) Σi ,

which immediately means 1 n10 ˆΣj+1 1 O(n8) ˆΣj+1 Σi O(n8) ˆΣj+1 n10 ˆΣj+1. Also,

Σ 1/2 i (ˆµj+1 µi ) 2 O(n6), and since Σi n10 ˆΣj+1, this means ˆΣ 1/2 j+1 (µi ˆµj+1) 2 O(n11) n12. Thus, the algorithm will remove (µi , Σi ) from Ωat step j + 1, so i Qj+1.

Thus, either Pj is empty, or Qj decreases in size to Qj+1 for each 0 j k + 1. This implies that Pk is empty, i.e., for every i [k] such that wi α

k , (µi, Σi) Ωj. This can only happen if each such (µi, Σi) was removed at some point. So, for every such i, there was some index j such that n 12 ˆΣj Σi n12 ˆΣj and ˆΣ 1/2 j (µ ˆµj) 2 n12.

E.3 Summary and Completion of Analysis

We first quickly summarize how to put everything together to prove Theorem 1.4. To prove our main result, need to ensure that our algorithm is private, uses few enough samples, and accurately learns the mixture. Privacy will be simple, as we have shown in Lemma E.6 that the crude approximation is private, and the hypothesis selection is private as shown in Appendix D. The sample complexity will come from the settings of n and n in lines 3 and 18 of the algorithm, and from our setting of parameters. Finally, we showed in Lemma E.6 that we have found a set of (ˆµj, ˆΣj), of at most k meancovariance pairs, such that every true (µi, Σi) of sufficiently large weight is crudely approximated by some (ˆµj, ˆΣj). Indeed, this is exactly the situation for which we can apply the result on private hypothesis selection (Lemma D.3, based on [BSKW19]).

We are now ready to complete the analysis and prove Theorem 1.4.

Proof of Theorem 1.4. By Lemma E.6, note that the algorithm up to Line 17 is (ε, δ)-DP with respect to X1, . . . , Xn, and does not depend on Xn+1, . . . , Xn+n . Assuming {(ˆµj, ˆΣj)} from these lines is fixed, Lines 18 20 are (ε, δ)-DP with respect to Xn+1, . . . , Xn+n , by Lemma D.3, and do not depend on X1, . . . , Xn. So, the overall algorithm is (ε, δ)-DP.

Next, we verify accuracy. Note that by Lemma E.6, and for G = n12, the sets (ˆµj, ˆΣj) that we find satisfy the conditions for Lemma D.3, with failure probability at most O(k β0). This probability is at most 0.1, assuming ed is significantly larger than k. So, it suffices for n , the number of samples used in Line 18 of Algorithm 2, to satisfy n O d2k log(G k

d/α) α2 + d2k log(G k

αε , where the last part of the bound holds by our assumptions on G and n. Thus, the total sample complexity is

n + n = e O kd2

α2 + kd2 + d1.75k1.5 log0.5(1/δ) + k1.5 log1.5(1/δ)

By Lemma D.3, with failure probability at most 0.1, Lines 18 20 of the algorithm will successfully output a hypothesis which is a mixture of k Gaussians, with total variation distance at most O(α) from the right answer. So, the overall success probability is at least 0.8.

Finally, we note that the assumption that ed is much larger than k can be made WLOG, by padding d to be a sufficiently large multiple of log k. Namely, if given samples Xi = (Xi,1, . . . , Xi,d), we can add coordinates Xi,d+1, . . . , XO(log k) N(0, 1). Then, if the original samples were a mixture of k

Gaussians N(µi, Σi), the new distribution is a mixture of k Gaussians N µi 0

the same mixing weights. So, we can learn the mixture distribution up to total variation distance α in the larger space, and then remove all except the first d coordinates, to output our final answer. This will not affect the sample complexity by more than a poly(log k) factor.

F The Univariate Setting

In this section, we provide the algorithm and analysis for Theorem 1.5. We first describe and provide pseudocode for the algorithm, and then we prove that it can privately learn mixtures of Gaussians with low sample complexity.

We will use σi := Σi to denote the standard deviation of a univariate Gaussian N(µi, Σi).

F.1 Algorithm

As in the algorithm in Appendix E, we will actually draw a total of n + n samples. We start off by only considering the first n data points X = {X1, . . . , Xn}, where each Xj R. Now, let Y = {Y1, . . . , Yn} be the sorted version of X, i.e., Y1 Y2 Yn, and (X1, . . . , Xn) and (Y1, . . . , Yn) are the same up to permutation. Finally, for each j n 1, define Zj to be the ordered pair (Yj, Yj+1 Yj), and define Z = Z(X) to be the unordered multiset of Zj s. (Note that Z depends deterministically on X.)

The algorithm now works as follows. Suppose we have data X = {X1, . . . , Xn}. After sorting to obtain Y1, . . . , Yn, let rj = Yj, sj = Yj+1 Yj, and Zj = (rj, sj). So, Z(X) = {(rj, sj)}1 j n 1, viewed as an unordered set. Note that every sj 0, and the rj s are in nondecreasing order. We will create a set of buckets that can be bijected onto Z2. For each Zj = (rj, sj), if sj > 0 we assign Zj to the bucket labeled (a, b) Z2 if 2a sj < 2a+1 and b n5 2a rj < (b + 1) n5 2a. If sj = 0, we do not assign Zj to any bucket.

For each element e Z2, we keep track of the number of indices j [n 1] with Zj sent to e. In other words, for e = (a, b), we define the count ce = #{j : 2a sj < 2a+1, b n5 2a rj < (b+1) n5 2a}. For each ce, we sample an independent draw ge TLap(1, ε/10, δ/10), and define ce = ce + ge. Finally, we let S = {(ˆµi, ˆΣi)} be the the set of pairs (b n5 2a, 22a) where e = (a, b) satisfies ce > 100

Hence, we have computed some {(ˆµi, ˆΣi)}k i=1, for some k 0. (We will show in the analysis that k n 1 always, so k is finite.) If k = 0 we simply output . Otherwise, we will draw O k

α2 + k αε fresh samples. We run the ε-DP hypothesis selection based algorithm, Algorithm 1 on the fresh samples, using parameters G = n10 and {(ˆµi, ˆΣi)}k i=1.

Pseudocode. We give pseudocode for the algorithm in Algorithm 3.

F.2 Analysis

We start by analyzing Lines 1 19, which generates the set S = {(ˆµi, ˆΣi)} of candidate meancovariance pairs.

First, we note a very basic proposition.

Proposition F.1. Let ε, δ 1. Then, with probability 1, TLap(1, ε/10, δ/10) is at most 100

δ in absolute value.

Proof. By definition of Truncated Laplace, | TLap(1, ε/10, δ/10)| 1 (ε/10) log 1 + eε/10 1

with probability 1. For ε, δ 1, this is less than 100

Algorithm 3: ESTIMATE1D(X1, X2, . . . , Xn+n R, k, ε, δ, α) Input: Samples X1, . . . , Xn, Xn+1, . . . , Xn+n . Output: An (ε, δ)-DP prediction {(eµi, eΣi), ewi}. /* Set parameters */

1 Let K = poly log k, 1

δ be sufficiently large.

2 Set n K k log(1/δ)

αε . /* Get X */

3 Obtain samples X {X1, X2, . . . , Xn}.

/* Obtain some candidate mean-covariance pairs (ˆµi, ˆΣi) */

4 Let Y be X in sorted (nondecreasing) order.

5 Set c(a,b) 0 for all (a, b) Z2.

6 for j = 1 to n 1 do

7 Set rj Yj, sj Yj+1 Yj, Zj (rj, sj).

8 if sj > 0 then

9 Let a log2(rj) , b rj n5 2a .

10 ca,b ca,b + 1.

14 for (a, b) Z2 do

15 Sample ca,b ca,b + TLap(1, ε/10, δ/10).

16 if ca,b > 100

17 S S {(b n5 2a, 22a)}.

/* Fine approximation, via private hypothesis selection */

20 Set n K d2k

21 Sample Xn+1, . . . , Xn+n , and redefine X {Xn+1, . . . , Xn+n }.

22 Run Algorithm 1 on X, S = {(ˆµi, ˆΣi)}, with parameters d = 1, k = #{(ˆµi, ˆΣi)}, and G = n3.

Next, we show that S is not too large.

Lemma F.2. The number of e = (a, b) Z2 such that ce > 100

δ is at most n 1. Thus, |S| n 1.

Proof. Note that every Zj only increases the count of a single ce, so it suffices to prove that ce > 100

δ only if ce 1.

To prove this, suppose that ce = 0. Then, ce TLap(1, ε/10, δ/10) 100

δ , by Proposition F.1. Thus, ce > 100

δ can only happen if ce > 0, meaning ce 1.

Next, we show that for every Gaussian component (µi, σi) of sufficiently large weight wi, there are several pairs Zj = (Yj, Yj+1 Yj) that crudely approximate (µi, σi).

Lemma F.3. Let n K k log(1/δ)

αε , where K = poly log(k, 1

δ ) is sufficiently large. Let X be n i.i.d. samples from a GMM with representation {(wi, µi, Σi)}k i=1. Then, with failure probability at least 0.99, for every i [k] with wi α

k , there are at least α

8k n indices j such that Yj [µi σi, µi + σi] and σi 104 n4 Yj+1 Yj 2σi.

Proof. Fix some i [k] such that wi α

k . By a Chernoff bound, since wi n α

k n is a sufficiently large multiple of log k, with failure probability at most 1 1000k, the number of data points Xj from component i is at least α 2k n. Let us condition on Ti [n], the set of indices coming from the ith mixture component, and condition on |Ti| α 2k n. Then, by a Chernoff bound, with failure

probability at most 1 1000k, at least α 4k n points Xr : r Ti are in the interval [µi σi, µi + σi]. Next, note that for any Xr, Xr N(µi, σ2 i ), Xr Xr N(0, 2σ2 i ), and thus has magnitude at least σi 104n3 with at most 1 1000n3 failure probability. So, by a union bound over all r = r Ti, with at most 1 1000n 1 1000k failure probability, all data points Xr, Xr for r = r Ti are separated by at least σi 104n3 .

So, with at least 3 1000k failure probability, there is a subset Ui Ti of size at least α

4k n, such that for every r, r Ui, |Xr Xr |, and for every r Ui, Xr [µi σi, µi + σi]. By a union bound, this holds simultaneously for all i [k] with wj α

k , with probability at least 0.997.

Conditioned on this event, if we consider the elements X in sorted order (i.e., Y), between every consecutive pair Xr, Xr : r, r Ui (after sorting), we know Xr Xr σi 104 n3 . So, there exist some Xr Yj, Yj+1 Xr such that Yj+1 Yj σi 104 n4 . Hence, for every i [k] with wi α

k , there exist at least |Ui| 1 α

8k indices j such that Yj+1 Yj σi 104 n4 . Moreover, note that Yj, Yj+1 [µi σi, µi + σi], so Yj+1 Yj 2σi. Hence, conditioned on the event from the previous paragraph, the lemma holds.

Given the above lemma, we can prove that for every (µi, Σi) in the Gaussian mixture with at least α

k weight, there is some corresponding crude approximation (ˆµ, ˆΣ) S.

Lemma F.4. Assume that the conditions and result of Lemma F.3 holds. Then, there exists (ˆµ, ˆΣ) S such that n 10 ˆΣ Σi n10 ˆΣ and ˆΣ 1/2(µi ˆµ) 2 n6.

Proof. Fix any i with wi α

k . Then, for any j with Yj, Yj+1 [µi σi, µi+σi] and σi 104 n4 Yj+1 Yj 2σi, the index i contributes to the count of some bucket e = (a, b), where log2 σi 105 n4 a log2(2σi) . Moreover, Yj [b n5 2a, (b + 1) n5 2a). Therefore, if we consider the set Vi of e = (a, b) such that log2 σi 105 n4 a log2(2σi) and [b n5 2a, (b + 1) n5 2a) [µi σi, µi + σi] is nonempty, the sum of the counts ce across such e Vi is at least α 8k n. But there are at most O(log n) choices of a, and since n5 2a > 2σ, there are at most two choices for b for any fixed a. So, |Vi| O(log n), and P

e Vi ce α 8k n. Assuming that

α 8k n O(log n 1

δ ), i.e., n K k log(1/δ)

αε for a sufficiently large polylogarithmic factor K = poly log k, 1

δ , one of these buckets e = (a, b) Vj must have ce > 200

δ. In this case, ce = ce + TLap(1, ε/10, δ/10) > 100

δ by Proposition F.1, so we include the pair (ˆµ, ˆΣ) := (b n5 2a, 22a) in S.

Hence, there exists (a, b) Vi such that (ˆµ, ˆΣ) := (b n5 2a, 22a) in S. By definition of Vi, σi n5 2a 2σi, so 1 n10 Σi 22a 4Σi n10 Σi. Moreover, |ˆΣ 1/2(µi ˆµ)| = 2 a |µi ˆµ|. But the definition of Vi means ˆµ = b n5 2a satisfies ˆµ µi + σi and ˆµ µi σi n5 2a µi (2n5 + 1) σi. So, |ˆµ µi| (2n5 + 1) σi n6 2a. Thus, |ˆΣ 1/2(ˆµ µi)| n6.

We now prove that the mechanism (until Line 19) is differentially private. First, we note the following auxiliary claim. Lemma F.5. Let X, X be adjacent datasets of size n (i.e., only differing on a single element). Then, the corresponding sets Z = Z(X) and Z = Z(X ) differ in distance at most 3, where by distance we mean that there exists a permutation of the elements in Z and Z , respectively, such that at most three indices i n 1 satisfy Zi = Z i.

Proof. Note that for the sorted versions Y, Y of X, X , respectively, we can convert from Y to Y by removing one data point and adding one more data point, without affecting the order of any other data points.

Suppose we remove some Yj from Y . If j = 1, then this just removes Z1, and if j = n, then this just removes Zn 1. If j 2, this modifies Zj 1 and removes Zj. Likewise, if we add some new Yj , this will either add one new ordered pair to Z (if Yj is either the smallest or largest element), or replace one ordered pair in Z with two new pairs. Therefore, if we remove a Yj and then add a Yj , this will change at most 3 of the ordered pairs in Z.

We now prove privacy.

Lemma F.6. The set S = {(ˆµj, ˆΣj)} of candidate mean-covariance pairs is (ε, δ)-DP with respect to X = {X1, . . . , Xn}.

Proof. Let X, X be adjacent datasets of size n. Then, the corresponding sets Z, Z (after a possible permutation) differ in 3 elements. Therefore, if we let { ce}e Z2 be the counts for Z and { c e}e Z2 be the counts for Z , we have that ce c e 1 6. Because changing a single count ce by 1 leads to (ε/10, δ/10)-DP, overall, the counts { ce} will satisfy (ε, δ)-DP. Finally, S is a deterministic function of the noisy counts ce, and therefore must also be (ε, δ)-DP.

Finally, we can incorporate the fine approximation (i.e., Lines 20 22 of Algorithm 3) and prove Theorem 1.5.

Proof of Theorem 1.5. By Lemma F.6, the algorithm up to Line 19 is (ε, δ)-DP with respect to X1, . . . , Xn, and does not depend on Xn+1, . . . , Xn+n . Assuming S = {(ˆµi, ˆΣi)} from these lines is fixed, Lines 20 22 are (ε, δ)-DP with respect to Xn+1, . . . , Xn+n , by Lemma D.3, and do not depend on X1, . . . , Xn. So, the overall algorithm is (ε, δ)-DP.

Next, we verify accuracy. Note that by Lemma F.4, and for G = n10, the sets (ˆµi, ˆΣi) that we find satisfy the conditions for Lemma D.3, with failure probability at most 0.01. Moreover, the size of S = {(ˆµi, ˆΣi)} is at most n, by Lemma F.2. So, because d = 1, it suffices for n , the number of samples used in Line 20 of Algorithm 3, to satisfy n O k log(G kn/α)

α2 + k log(G kn/α)

α2 + k αε , where the last part of the bound holds by our assumptions on G and n. Thus, the total sample complexity is

n + n = e O k log(1/δ)

By Lemma D.3, with failure probability at most 0.1, Lines 20 22 of the algorithm will successfully output a hypothesis which is a mixture of k Gaussians, with total variation distance at most O(α) from the right answer. So, the overall success probability is at least 0.8.

G Lower Bound

In this section, we prove Theorem 1.6. The proof of the lower bound will, at a high level, follow from known lower bounds for privately learning a single Gaussian [KV18, KMS22a], which we now state.

Theorem G.1. [KV18] For some sufficiently small constant c , (ε, δ)-privately learning an arbitrary

univariate Gaussian N(µ, σ2) up to total variation distance c requires Ω log(1/δ)

Moreover, this lower bound holds even if we are promised that |µ| (1/δ)C for a sufficiently large constant C, and σ = 1.

Note that this lower bound immediately implies the same lower bound for general dimension d.

Theorem G.2. [KMS22a] Let α be at most a sufficiently small constant, and let δ αε

d C for a sufficiently large constant C. Then, (ε, δ)-privately learning a d-dimensional Gaussian N(µ, Σ) up

to total variation distance α requires Ω d2 αε samples.

Moreover, this lower bound holds even if we are promised that µ = 0, and I Σ 2I.

Note that a tighter version of the above theorem has been proved in [Nar23, PH24]; we also refer the reader to [BUV14].

Proof Sketch of Theorem 1.6. First, the lower bound of kd2

α2 is already known see [ABH+18].

The lower bound of kd2

αε will follow from Theorem G.2. To explain how, we consider k distinct Gaussians N(µi, Σi), where the means µi are known and very far away from each other, and I Σi 2I are unknown. The overall mixture that we will try to learn is the uniform mixture over N(µi, Σi), i.e., every weight wi = 1/k. By making them very far away from each other, we are

making learning the full mixture equivalent to learning each component (on average). Namely, even if we are given the information of which Gaussian each sample comes from, we will need to learn at least 2/3 of the Gaussians up to total variation distance O(α), to learn the full mixture up to total variation distance α. Hence, we will need at least k times as many samples as for learning a single Gaussian, which means we need Ω kd2

αε total samples.

The lower bound of k log(1/δ)

αε will follow from Theorem G.1. Note that it suffices to prove the lower bound in the univariate case. We plant k distinct Gaussians N(µi, 1), where the µi are very far away from each other, i.e., pairwise |µi µj| (1/δ)10C. We also assume that µ1 = 0 is known, and the remaining µi are unknown but we are promised the value of each µi up to error (1/δ)C. The overall mixture will have the first Gaussian N(0, 1) of weight w1 = 1 α/c , and the remaining Gaussians N(µi, 1) each have weight wi = α/(c (k 1)). Even if we are given the information of which Gaussian component each sample comes from, to learn the overall mixture up to error α, we need to learn at least 2/3 of the small-weight components, each up to total variation distance O(c ). Hence, we will need log(1/δ)

ε samples from most of the small-weight components. Since the small weight

components have weight Θ(α/k), we need Ω k log(1/δ)

αε total samples.

We now give a formal proof of Theorem 1.6.

G.1 Formal Proof of Theorem 1.6

First, we note two lemmas that will be helpful in proving the Theorem. Lemma G.3. Let α [0, 1). Let f be a probability density function over Rd. Assume g : Rd R 0 exists such that Z

Rd |f(x) g(x)| dx α.

Then there exists h such that h is a probability density function over Rd, and Z

Rd |f(x) h(x)| dx 2α.

Proof. We know that

Rd f(x) dx Z

Rd g(x) dx Z

Rd |f(x) g(x)| dx α.

Therefore, G := R

Rd g(x) dx = 1 α.

Now two cases are possible either G < 1, or G > 1, otherwise we are done. If G < 1, let h(x) be the density function corresponding to the following distribution: with probability G take a sample from g(x)/G, and with probability 1 G select 0. Then Z

Rd |f(x) h(x)| dx 1 G + Z

Rd |f(x) g(x)| dx 2α,

as desired. If G > 1, let h(x) be a density function as follows: x : 0 h(x) g(x), and R

Rd h(x) dx = 1. It is easy to see such an h exists by greedily picking h. Then we have R

Rd |g(x) h(x)| dx = R

Rd g(x) h(x) dx G 1 α. Therefore, we can write Z

Rd |f(x) h(x)| dx Z

Rd |f(x) g(x)| dx + Z

Rd |g(x) h(x)| dx 2α,

as desired.

Lemma G.4. Suppose w (0, 1) is fixed. Suppose an (ε, δ) differentially private algorithm exists that takes n samples from the mixture D = w D1 + (1 w)D2, and learns D1 in total variation distance up to error α, with success probability 1 β. Moreover, assume while sampling it is known which component the sample is sampled from. Then there exists an (ε, δ) differentially private algorithm A that takes nw/γ samples from D1 and outputs an estimate of D1 up to total variation distance α, with success probability 1 β γ.

Proof. We can view the sampling procedure of the mixture as sampling a random variable t Bin(n, w), and taking t samples from D1 and n t samples from D2. In order to make an algorithm using nw/γ samples we can take that many samples from D1, and then sample t from Bin(n, w), and run A on n data points constructed as follows: If t is smaller than nw/γ, use t of the samples taken from D1 and set the rest to 0, If t is larger than nw/γ, just run A on all zeroes. Finally output the output of A on this input. Clearly, this would be (ε, δ) differentially private. We know the Algorithm succeeds with probability 1 β, over the random coins of the algorithm and the randomness of sampling, if the input is sampled from w D1 + (1 w)D2. From Markov s inequality, we know P[t nw/γ] 1 γ. Therefore, our constructed sample is drawn i.i.d from w D1 + (1 w)D2 with probability at least 1 γ, where D2 is the fixed 0 distribution. Therefore, there exists an (ε, δ) differentially private algorithm that takes nw/γ many samples from D1 and outputs an estimate of D1 up to total variation distance α, with success probability 1 β γ, as desired.

We now prove Theorem 1.6.

Proof. We prove the lower bound terms one by one. The first term kd2

α2 (the non private term) is known by previous work [ABH+18]. We prove the lower bound for the second term and the last term here using Theorems G.1 and G.2.

Let s prove the second term kd2

αε . We apply Theorem G.2, this theorem implies that for any α smaller than a sufficiently small constant, and δ ( αε

d )C for a sufficiently large C, any (ε, δ) differentially private algorithm A taking n samples in Rd, satisfying

Σ such that I Σ 2I : PX N(0,Σ) n, A s internal random bits[d TV(A(X), N(0, Σ)) α] > 0.6,

must also satisfy n = Ω( d2

Let µ is be k distinct vectors in Rd each having ℓ2 distance M from each other, for M to be set later. To see why such a set exists, we can take µi = Mie1, where e1 is the first unit vector. Now consider the following set of Gaussians: Di = N(µi, Σi), where µi s are known and constructed as above and Σi unknown. Consider the uniform mixture D over these Gaussians, with weights wi = 1/k. We also assume that when sampling from this distribution we know that which component the samples came from. Consider an (ε, δ) differentially private algorithm A that takes n samples from D and outputs a distribution ˆD such that d TV(D, ˆD) α with probability 2/3.

Now consider a sample from Di, from standard Gaussian tail bounds we know that at least 1 exp( M 2/800) fraction of the mass of Di is contained within a ball of radius M/10, around µi. Let Bi denote this ball, and note that Bi s are disjoint.

Let f, fi, ˆf be the probability density functions corresponding to D, Di, ˆD respectively. Assuming, we are in the success regime, we can write

2α 2 d TV(D, ˆD) = Z

f(x) ˆf(x) dx

f(x) ˆf(x) dx.

Now let B [k] be the set of indices i such that R

f(x) ˆf(x) dx 200α/k, and G be its complement. Then we have that |B| k/100. Therefore, there exists G such that |G| 0.99k, and i G : R

f(x) ˆf(x) dx 200α/k. Assume i G is one such index. Note that

f(x) = Pk j=1 fj(x)/k. Therefore, we can write Z

fi(x)/k ˆf(x) dx Z

f(x) ˆf(x) dx + Z

Bi |fi(x) f(x)| dx

200α/k + max j =i PX Dj [X / Bj]

200α/k + exp( M 2/800) 300α/k,

where the last inequality comes from taking M 30 p

log(k/100α). We show that using the distribution ˆD, we can construct an answer to the problem of learning the Gaussian Di. To do so first take gi to be equal to k ˆf(x) over Bi and 0 everywhere else. We have

Rd |fi(x) gi(x)| dx = Z

fi(x) k ˆf(x) dx + Z

Rd\Bi fi(x) dx 300α + 100α/k 400α.

Now we may apply Lemma G.3, and deduce that given ˆD we can construct probability density functions and distributions ˆDi s such that d TV( ˆDi, Di) 800α, for all i G. To recap, so far we have shown that given an (ε, δ) differentially private algorithm that takes as inputs samples from our constructed mixture of Gaussians and outputs a density ˆD that has total variation distance at most α, from the ground truth distribution with success probability 2/3, we can use ˆD to construct densities ˆDi such that d TV( ˆDi, Di) 800α, for 0.99k of the indices i. This implies that there exists a fixed index i for which the component Di is learned up to error 800α with success probability 2/3 0.01 0.65. Applying Lemma G.4, implies that there must exist an (ε, δ) differentially private algorithm that takes 100n/k samples from Di and estimates its density up to total variation distance 800α, with success probability at least 0.6. Therefore, applying Theorem G.2, we conclude that n = Ω( kd2

Now let s prove the last term k log(1/δ)

αε . We aim to apply Theorem G.1. Let µi s be k distinct values in R, each having distance M from each other, for M (1/δ)10C to be set later, where µ1 = 0. It is easy to see such a set exists. Now consider the following set of Gaussians: Di = N(µi, 1), where µi s are known up to log(1/δ)C, and µ1 = 0 is also known. Consider the mixture D over these Gaussians, with weights w1 = 1 α/c , and wi = α/(c (k 1)). We also assume that when sampling from the mixture we know which component each sample comes from. Consider an (ε, δ) differentially private algorithm A that takes n samples from D and outputs a distribution ˆD such that d TV(D, ˆD) α with probability 2/3.

Now consider a sample from Di, from standard Gaussian tail bounds we know that at least 1 exp( M 2/200) fraction of the mass of Di is contained within a ball of radius M/10, around µi. Let Bi denote this ball, and note that Bi s are disjoint.

Let f, fi, ˆf be the probability density functions corresponding to D, Di, ˆD respectively. Assuming, we are in the success regime, similar to the proof of the previous term, we can show that there exists a set G of indices such that |G| 0.99k, and i G : R

f(x) ˆf(x) dx 200α/k. Moreover,

with a similar argument as the previous term for i G we can say as long as M 20 p

log(k/100α), R

wifi(x) ˆf(x) dx 300α/k. We show that using the distribution ˆD, we can construct an

answer to the problem of learning the Gaussian Di, for i = 1. To do so take gi to be equal to ˆf(x)/wi, over Bi and 0 everywhere else. We have

Rd |fi(x) gi(x)| dx = Z

fi(x) ˆf(x)/wi dx+ Z

Rd\Bi fi(x) dx 300α

Now we may apply Lemma G.3, and deduce that given ˆD, we can construct probability density functions and distributions ˆDi s such that d TV( ˆDi, Di) 800c , for all i G. To recap, so far we have shown that given an (ε, δ) differentially private algorithm that takes as inputs samples from our constructed mixture of Gaussians and outputs density ˆD that has total variation distance at most α from the ground truth distribution with success probability 2/3, we can use ˆD to construct densities ˆ Di such that d TV( ˆ Di, Di) 800c , for 0.99k of the indices i. This implies that there exists a fixed index i = 1, for which the component Di is learned up to error 800c , with success probability 2/3 0.01 0.65. Applying Lemma G.4, implies that there must exist an (ε, δ) differentially private algorithm that takes 100nwi samples from Di and estimates its density up to total variation distance 800c , with success probability 0.6. Therefore, applying Theorem G.1, and noting that wi = α/(c (k 1)) we conclude that n = Ω( k log(1/δ)

H Omitted Proofs

In this section, we prove Proposition B.2, Theorem B.3, and Lemma C.7.

H.1 Proof of Proposition B.2

First, we note a basic fact.

Fact H.1. For any x, y [0.9, 1.1], we have that (xy 1)2 4((x 1)2 + (y 1)2).

Proof. Note that |xy 1| = |x 1 + x(y 1)| |x 1| + |x| |y 1|

2 (|x 1| + |y 1|). Thus, (xy 1)2 2(|x 1| + |y 1|)2 4((x 1)2 + (y 1)2).

We now prove Proposition B.2.

Proof. Let J1 := Σ 1/2 2 Σ1/2 1 and J2 := Σ 1/2 3 Σ1/2 2 .

First, assume (µ1, Σ1) γ,ρ,τ (µ2, Σ2). This means J1J 1 I op γ and J1J 1 I F ρ. We then have that Σ 1/2 1 Σ2Σ 1/2 1 = J 1 1 (J 1 1 ) = (J 1 J1) 1. Now, note that J1J 1 and J 1 J1 are both symmetric and have the same eigenvalues. If we call these eigenvalues λ1, . . . , λd, then the eigenvalues of Σ 1/2 1 Σ2Σ 1/2 1 are λ 1 1 , . . . , λ 1 d . Now, our assumption (µ1, Σ1) γ,ρ,τ (µ2, Σ2) implies that 1 γ λi 1+γ and P(1 λi)2 ρ2. This means that, assuming γ 0.1, 1 2γ 1 1+γ λ 1 i 1 1 γ 1 + 2γ, and P 1 λ 1 i 2 P(1 λi)2 λ 2 i 2 P(1 λi)2 = 2ρ2.

This means that Σ 1/2 1 Σ2Σ 1/2 1 op 2γ and Σ 1/2 1 Σ2Σ 1/2 1 F 2ρ.

Finally, Σ 1/2 1 (µ1 µ2) = J 1 1 Σ 1/2 2 (µ1 µ2). Because Σ 1/2 2 (µ1 µ2) has magnitude at most τ by our assumption, J 1 1 Σ 1/2 2 (µ1 µ2) has magnitude at most the maximum singular value of J 1 1 times τ. But every singular value of J 1 1 is some λ 1/2 i which is at most 2, so Σ 1/2 1 (µ2 µ1) 2 = J 1 1 Σ 1/2 2 (µ1 µ2) 2 2τ.

Next, assume (µ1, Σ1) γ,ρ,τ (µ2, Σ2) and (µ2, Σ2) γ,ρ,τ (µ3, Σ3). First, note that Σ 1/2 3 Σ1Σ 1/2 3 = J2J1J 1 J 2 = (J2J1)(J2J1) . If the eigenvalues of J1J 1 are {λi} and the eigenvalues of J2J 2 are {λ i}, then the singular values of J1 and J2 are { λi} and { p

λ i}, respectively. By our assumption, 1 γ λi, λ i 1+γ, which means that 1 γ λi, p

λ i 1 + γ. Thus, the singular values of J2J1 are between 1 γ and 1 + γ, which means that the eigenvalues of (J2J1)(J2J1) are between (1 γ)2 and (1 + γ)2. Hence, for γ 0.1, J2J1J 1 J 2 I op 4γ.

Assume that λi, λ i are in decreasing order. We now consider the kth largest singular value of J2J1. If σk := λk is the kth largest singular value of J1 and σ k := p

λ k is the kth largest singular value of J2, by Corollary A.5 there exist subspaces Vk, V k of dimension d k +1 such that J1v 2 σk v 2 for all v Vk and J2v 2 σ k v 2 for all v V k. Therefore, for every v Vk J 1 1 V k (note that J1 is invertible since J1J 1 has all eigenvalues between 1 γ and 1 + γ) we have that J2J1v 2 σ k J1v 2 σkσ k v 2. Because Vk and J 1 1 V k both have dimension d k + 1, their intersection has dimension at least d 2k +2. So, there is a subspace of dimension at least d 2k +2 such that every v in the subspace has J2J1v 2 σkσ k v 2.

Thus, the (2k 1)th largest singular value of J2J1 is at most σkσ k, so the (2k 1)th largest eigenvalue of J2J1J 1 J 2 is at most λkλ k. The same argument, looking at the smallest singular values, tells us that the (2k 1)th smallest eigenvalue of J2J1J 1 J 2 is at least λd k+1λ d k+1. Thus, for any t, the tth largest eigenvalue of J2J1J 1 J 2 is at most λ (t+1)/2 λ (t+1)/2 and at least λ (d+t)/2 λ (d+t)/2 .

Overall, this means that

J2J1J 1 J 2 I 2 F

t=1 max (λ (t+1)/2 λ (t+1)/2 1)2, (λ (d+t)/2 λ (d+t)/2 1)2

(λ (t+1)/2 λ (t+1)/2 1)2 + (λ (d+t)/2 λ (d+t)/2 1)2

i=1 (λiλ i 1)2

i=1 (λi 1)2 + 8

i=1 (λ i 1)2

8 ( J1J 1 I 2 F + J2J 2 I 2 F ) 16ρ2,

where the fourth line uses Fact H.1. Thus, Σ 1/2 3 Σ1Σ 1/2 3 I F 4ρ.

Finally, note that Σ 1/2 3 (µ1 µ3) 2 Σ 1/2 3 (µ1 µ2) 2+ Σ 1/2 3 (µ2 µ3) 2 = J2Σ 1/2 2 (µ1 µ2) 2 + Σ 1/2 3 (µ2 µ3) 2. By our assumptions, both Σ 1/2 2 (µ1 µ2) 2 and Σ 1/2 3 (µ2 µ3) 2 are at most τ, and J2 has operator norm at most 1.1 2, which means that Σ 1/2 3 (µ1 µ3) 2 3τ.

H.2 Proof of Theorem B.3

First, we note a series of known results that will be key to proving the theorem. We start with the bound for robust covariance estimation in spectral error.

Lemma H.2 (e.g., [DK22, Exercise 4.3]). Fix any η (0, η0), where η0 < 0.01 is a small universal constant, and fix any β (0, 1). There is a (deterministic, inefficient) algorithm A1 with the following property. Let Σ Rd d be any covariance matrix, and let X = {X1, . . . , Xn} N(0, Σ), where n O((d+log(1/β))/η2). Then, with probability at least 1 β over the randomness of X, for any ηcorruption X = {X 1, . . . , X n} of X, A1(X ) outputs ˆΣ1 such that Σ 1/2 ˆΣ1Σ 1/2 I op O(η). Importantly, A1 may have knowledge of η and β, but does not have knowledge of X or Σ.

Next, we prove how to robustly estimate the covariance up to Frobenius error. We start with the following structural lemma.

Lemma H.3. There exists a universal constant c (0, 0.1) with the following property. Fix any 1 k d, and let n e O(d k) be a sufficiently large (i.e., n dk (C3 log(dk))C4 for some absolute constants C3, C4). If we sample i.i.d. X = {X1, . . . , Xn} N(0, I), then with probability at least 1 e cn over X, for all symmetric matrices P Rd d of rank at most k and Frobenius norm 1, and for all subsets S [n] of size at least (1 c) n, 1

i S Xi X i I, P 0.1.

Proof. Consider any fixed P Rd d of rank k and Frobenius norm 1, and fix any integer m. For any data points X1, . . . , Xm i.i.d. N(0, I), let X = (X1, . . . , Xm) Rm d be the concatenation of X1, . . . , Xm, and let Q R(md) (md) be the block matrix

P 0 0 0 P 0 ... ... ... ... 0 0 P

m Pm i=1 Xi X i I, P = 1 m Pm i=1 X i PXi Tr(P) = 1 m X QX Tr(Q) . By the Hanson-Wright inequality, we have that for any fixed P F 1, and for any t 1,

P 1 m X QX Tr(Q) > t = P X QX Tr(Q) > m t

2 exp min Ω(mt)2

m P 2 F , Ω(mt)

2 exp Ω(m t2) . By setting t = 0.01, we have that for some universal constant c1,

i=1 Xi X i I, P

Now, we draw X1, . . . , Xn N(0, I), and take a union bound over all subsets S [n] of size at least (1 c)n (with m = |S|) and a union bound over a net of possible matrices P. The number of options for S is at most P

i cn n i (e/c)cn. For P, we can choose a 1/n10-sized net over the Frobenius norm metric (i.e., the distance between two matrices P1, P2 is P1 P2 F ) for each of the k nonzero eigenvalues and eigenvectors in the unit d-dimensional sphere, which has size at most n100d k. Therefore, by a union bound, the probability that 1

i S Xi X i I, P 0.01 for every |S| (1 c)n and every P in the net is at least 1 (2e/c)cn n100d k e c1n/2.

Finally, we consider P outside of the net. For any symmetric P of rank k and Frobenius norm 1, it has Frobenius distance at most 1/n8 from some P in the net. Let us consider the event that every Xi 2 2 10n, which for n d occurs with failure probability at most 2n e c1n by Hanson-Wright. Under this event, Xi X i I, P P Xi X i I F P P F (10n +

d)/n8 0.01.

As a result, with failure probability at most (2e/c)cn n100d k e c1n/2+2n e c1n e cn (assuming c is sufficiently small), we have both properties. Namely, 1

n P i S Xi X i I, P 0.01 for every |S| (1 c)n and every P in the net, and for any P, Xi X i I, P P 0.01 for the closest P

in the net to P. Overall, this means that for all such P and S, 1

n P i S Xi X i I, P 0.1.

We prove another lemma which contrasts with Lemma H.3. Lemma H.4. Fix any 2 ρ

d. Let k = 4d/ρ2, and let n e O(d k) and c (0, 0.1) be as in Lemma H.3. Fix any covariance matrix Σ and let X = {X1 . . . , Xn} N(0, Σ). Then, with probability at least 1 e cn, for every eΣ such that 0.95 Σ eΣ 1.05 Σ and Σ 1/2eΣΣ 1/2 I F ρ, for every symmetric matrix P Rd d of rank at most k and Frobenius norm 1, and for every S [n] of size at least (1 c)n, 1 n

i S eΣ 1/2Xi X i eΣ 1/2 I, P

Proof. Let J = eΣ 1/2Σ1/2, and let Yi = Σ 1/2Xi. Then, eΣ 1/2Xi = JYi, which means that eΣ 1/2Xi X i eΣ 1/2 I = JYi Y i J I = J(Yi Y i I)J + (JJ I). From now on, we assume that for all S [n] of size at least (1 c) n, and for all symmetric matrices P of rank k and Frobenius norm 1, 1

i S Yi Y i I, P 0.1. This happens with at least e cn probability, by Lemma H.3.

Now, for any subset S [n], 1 n

i S eΣ 1/2Xi X i eΣ 1/2 I, P = 1

J(Yi Y i I)J , P + JJ I, P

i S Tr(J(Yi Y i I)J P)

i S Yi Y i I, J PJ

n JJ I, P .

Now, note that by our assumptions, Σ 1/2eΣΣ 1/2 I op 0.05 and Σ 1/2eΣΣ 1/2 I F ρ. Thus, by Proposition B.2, eΣ 1/2ΣeΣ 1/2 I op 0.1, and by Proposition B.2 again, applied the reverse direction this time, eΣ 1/2ΣeΣ 1/2 I F ρ/2. We just showed JJ I op 0.1, so by Proposition A.6, J PJ F 2 P F 2. Therefore, 1

i S Yi Y i I, J PJ 0.2.

Conversely, we just showed JJ I F ρ/2. So, if we order the eigenvalues of JJ as λ1, λ2, . . . , λd (and the corresponding unit eigenvectors v1, . . . , vd) such that |λi 1| are in decreasing order, then Pd i=1(λi 1)2 ρ2/4, which means that P4d/ρ2

i=1 (λi 1)2 1. So, if we choose P to

i=1 (λi 1)2 1/2 P4d/ρ2

i=1 (λi 1)viv i , we have that P F = 1 and

i=1 (λi 1)2

i=1 (λi 1)2 =

i=1 (λi 1)2

i S eΣ 1/2Xi X i eΣ 1/2 I, P |S|

n 0.2 1 c 0.2 0.7.

Now, we can show how to learn the covariance of Σ up to Frobenius error. Lemma H.5. Fix any η (0, η0), where η0 < 0.01 is a small universal constant, any β (0, 1), and any e O(η) ρ

d. There is a (deterministic, possibly inefficient) algorithm A2 with the following property. Let Σ Rd d be any covariance matrix, and let X = {X1, . . . , Xn} N(0, Σ), where

n O d2+log2(1/β)

ρ2 + log( 1

β ) . Then, with probability at least 1 β over the randomness of X, for

any η-corruption X = {X 1, . . . , X n} of X, A2(X ) outputs ˆΣ2 such that Σ 1/2 ˆΣ2Σ 1/2 I F O(ρ). Importantly, A2 may have knowledge of η, ρ, and β, but does not have knowledge of X or Σ.

Proof. In the case that ρ 2, the claim follows immediately from known results (for instance, it is implicit from [HKMN23]).

Alternatively, assume that ρ 2. In this case, the algorithm works as follows. Assume η η0 c/2, where c is the constant in Lemma H.3. First, compute ˆΣ1 based on Lemma H.2. Note that (1 O(η)) Σ ˆΣ1 (1 + O(η)) Σ with 1 β probability, since the number of samples is sufficiently large. Now, find any positive definite eΣ and a set T [n] of size at least (1 η)n, such that:

(1 O(η)) ˆΣ1 eΣ (1 + O(η)) ˆΣ1.

for any S T of size at least (1 2η)n, 1

i S eΣ 1/2Xi X i eΣ 1/2 I, P 0.2.

First, we note that eΣ = Σ is a feasible choice of eΣ. Indeed, the first condition trivially holds. For the second condition, let T be the subset of uncorrupted data points (i.e., X i = Xi). Then, for any S T, the data points X i for i S are the same as Xi, so by Lemma H.3, with 1 β probability, for every such S, 1

i S Σ 1/2Xi X i Σ 1/2 I, P 0.1.

Next, we show that every eΣ with Σ 1/2eΣΣ 1/2 I F ρ is infeasible. First, we may assume that 0.95Σ eΣ 1.05Σ, as otherwise, we cannot simultaneously satisfy (1 O(η)) ˆΣ1 eΣ (1 + O(η)) ˆΣ1 and (1 O(η)) Σ ˆΣ1 (1 + O(η)) Σ, assuming η c/2 is sufficiently small.

Hence, we just have to verify the infeasibility for every eΣ such that Σ 1/2eΣΣ 1/2 I F ρ and 0.95Σ eΣ 1.05Σ. Indeed, for any subset T of size at least (1 η)n, let S be the uncorrupted points in T. Because there are at most ηn uncorrupted points, |S| (1 2η)n. So by Lemma H.4,

with 1 β probability, for every such S, 1

i S eΣ 1/2Xi X i eΣ 1/2 I, P 0.8.

Therefore, with at most O(β) failure probability, some ˆΣ2 := eΣ is returned, and it satisfies Σ 1/2eΣΣ 1/2 I F ρ.

Next, we note some results on robust mean estimation. We first note the definition of stability, and some properties.

Lemma H.6 ([DK22, Proposition 3.3]). Let n O((d + log(1/β))/α2), for some α

log 1/η). Let X = {Xi}n i=1 i.i.d. D, where D is a subgaussian random variable with mean µ Rd and covariance I. Then, with probability 1 β, for all vectors b [0, 1]n such that Eibi 1 η and all unit vectors v Rd, we have:

1. |Eibi v, Xi µ | α.

Eibi v, Xi µ 2 1 α.

Given a dataset X with these properties, call it (η, α)-stable with respect to µ.

Lemma H.7 (implicit from [DK22, Section 2]). Fix η sufficiently small and α = e O(η). There is a deterministic algorithm A3 that, on a dataset X , outputs ˆµ such that ˆµ µ 2 O(α), for any η-corruption X of any X that is (η, α)-mean stable with respect to any µ Rd. Importantly, A3 does not require knowledge of X or µ.

We now prove Theorem B.3.

Proof. We first show how to estimate the covariance Σ. First, we apply a sample pairing trick (e.g., see [DK22, Section 4.4]). Namely, assume WLOG that n is even, and define X to be the set {(X2i 1 X2i)/

2}n/2 i=1, and X = {(X 2i 1 X 2i)/

2}n/2 i=1. Note that X are i.i.d. samples from N(0, Σ), and X is at most 2η-corrupted.

Now, because η γ, X is at most 2γ corrupted, so Lemma H.2 on X (replacing η with 2γ) gives us some ˆΣ1 such that Σ 1/2 ˆΣ1Σ 1/2 I op O(γ), by our assumed bound on the number of samples. Next, Lemma H.5 on X gives us some ˆΣ2 such that Σ 1/2 ˆΣ2Σ 1/2 I F O(ρ), by our assumed bound on the number of samples. So, we can set ˆΣ to be any covariance such that ˆΣ 1/2 ˆΣ1 ˆΣ 1/2 I op O(γ) and ˆΣ 1/2 ˆΣ2 ˆΣ 1/2 I F O(ρ). Note that Σ satisfies these properties, and any ˆΣ that satisfies these properties must satisfy Σ 1/2 ˆΣΣ 1/2 I F O(γ) and Σ 1/2 ˆΣΣ 1/2 I F O(ρ), by the approximate symmetry and transitivity properties (Proposition B.2).

Now, we estimate the mean µ. Taking the original data X , we compute {ˆΣ 1/2X i}. By stability (Lemma H.7), we know that with 1 β probability, {Σ 1/2Xi} is (η, γ)-stable with respect to µ (where we are using the uncorrupted data and the true covariance Σ). Letting J = ˆΣ 1/2Σ1/2, we know that J has all singular values between 1 O(γ) and 1 + O(γ), and that {J 1 ˆΣ 1/2Xi} is (η, γ)-stable with respect to µ. Moreover, note that we can write v, ˆΣ 1/2(Xi µ) = Jv, J 1 ˆΣ 1/2(Xi µ) , and that 1 O(γ) Jv 2 1+O(γ). Therefore, {ˆΣ 1/2Xi} is (η, O(γ))-stable with respect to ˆΣ 1/2µ, which means that by Lemma H.7, A3 on {ˆΣ 1/2X i} outputs some value ˆν such that ˆν ˆΣ 1/2µ 2 O(γ). Thus, by setting ˆµ := ˆΣ1/2 ˆν, we have that ˆΣ 1/2(ˆµ ν) 2 O(γ), which means that Σ 1/2(ˆµ µ) 2 = J 1 ˆΣ 1/2(ˆµ µ) 2 (1 + O(γ)) ˆΣ 1/2(ˆµ µ) 2 O(γ).

H.3 Proof of Lemma C.7

In this subsection, we prove Lemma C.7.

Proof. First, note that for any positive definite matrices Σ1, Σ2, Σ 1/2 2 Σ1Σ 1/2 2 I op γ is equivalent to (1 γ)I Σ 1/2 2 Σ1Σ 1/2 2 (1 + γ)I. Since M being PSD implies AMA is PSD

(and vice versa if A is invertible), the previous statement is thus equivalent to (1 γ) Σ2 Σ1 (1 + γ)Σ2. Next, note that

Σ 1/2 2 Σ1Σ 1/2 2 I F =

Tr (Σ 1/2 2 Σ1Σ 1/2 2 I)2

Tr Σ 1/2 2 Σ1Σ 1 2 Σ1Σ 1/2 2 2 Σ 1/2 2 Σ1Σ 1/2 2 + I

Tr Σ1Σ 1 2 Σ1Σ 1 2 2 Σ1Σ 1 2 + I , (4)

where the first two lines are a straightforward expansion, and the final line simply uses the fact that Tr(AB) = Tr(BA) for any matrices A, B. Finally, note that

Σ 1/2 2 (µ1 µ2) 2 = q

(µ1 µ2) Σ 1 2 (µ1 µ2). (5)

Now, consider replacing Σ1 with Σ3 := Σ1/2Σ1Σ1/2, Σ2 with Σ4 := Σ1/2Σ2Σ1/2, µ1 with µ3 := Σ1/2µ1 +µ, and µ2 with µ4 := Σ1/2µ2 +µ. Again, since M being PSD implies AMA is PSD (and vice versa), we have that (1 γ) Σ2 Σ1 (1+γ) Σ2 if and only if (1 γ) Σ4 Σ3 (1+γ) Σ4. Moreover, note that

Tr(Σ3Σ 1 4 Σ3Σ 1 4 ) = Tr(Σ1/2Σ1Σ 1 2 Σ1Σ 1 2 Σ 1/2) = Tr(Σ1Σ 1 2 Σ1Σ 1 2 )

and Tr(Σ3Σ 1 4 ) = Tr(Σ1/2Σ1Σ 1 2 Σ 1/2) = Tr(Σ1Σ 1 2 ), which means that (4) would be the same if we replaced Σ1 with Σ3 and Σ2 with Σ4. Finally,

(µ3 µ4) Σ 1 4 (µ3 µ4) = (µ1 µ2) Σ1/2(Σ 1/2Σ 1 2 Σ 1/2)Σ1/2(µ1 µ2) = (µ1 µ2) Σ 1 2 (µ1 µ2),

which means that (4) would be the same if we replaced µ1 with µ3, µ2 with µ4, Σ1 with Σ3, and Σ2 with Σ4.

Overall, by the definition of γ,ρ,τ, we have that (µ1, Σ1) γ,ρ,τ (µ2, Σ2) if and only if (µ3, Σ3) γ,ρ,τ (µ4, Σ4).

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: We give complete proofs for all theorem statements.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: We discuss limitations about the data assumptions and practical algorithms, at the end of the main body.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: We give complete proofs for all theorems.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [NA]

Justification: No experiments.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [NA]

Justification: No experiments.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [NA]

Justification: No experiments.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [NA]

Justification: No experiments.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [NA]

Justification: No experiments.

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes]

Justification: Our work is entirely theoretical, and we do not see any potential harms coming from our research. 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: Our work is entirely theoretical, and we do not see any potential negative societal impacts coming from our research. 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: Our work is entirely theoretical, and poses no such risks. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: Our paper does not use any existing assets, such as code/data/models/etc. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: Our paper does not release any new assets 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: Our paper does not involve crowdsourcing or research with human subjects. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: Our paper does not involve crowdsourcing or research with human subjects.