# privately_learning_mixtures_of_axisaligned_gaussians__ec983c4c.pdf

Privately Learning Mixtures of Axis-Aligned Gaussians

Ishaq Aden-Ali Department of Computing and Software Mc Master University adenali@mcmaster.ca

Hassan Ashtiani Department of Computing and Software Mc Master University zokaeiam@mcmaster.ca

Christopher Liaw Department of Computer Science University of Toronto cvliaw@cs.toronto.edu

We consider the problem of learning mixtures of Gaussians under the constraint of approximate differential privacy. We prove that e O(k2d log3/2(1/δ)/α2ε) samples are sufﬁcient to learn a mixture of k axis-aligned Gaussians in Rd to within total variation distance α while satisfying (ε, δ)-differential privacy. This is the ﬁrst result for privately learning mixtures of unbounded axis-aligned (or even unbounded univariate) Gaussians. If the covariance matrices of each of the Gaussians is the identity matrix, we show that e O(kd/α2 + kd log(1/δ)/αε) samples are sufﬁcient. To prove our results, we design a new technique for privately learning mixture distributions. A class of distributions F is said to be list-decodable if there is an algorithm that, given heavily corrupted samples from f F, outputs a list of distributions one of which approximates f. We show that if F is privately list-decodable then we can learn mixtures of distributions in F. Finally, we show axis-aligned Gaussian distributions are privately list-decodable, thereby proving mixtures of such distributions are privately learnable.

1 Introduction

The fundamental problem of distribution learning concerns the design of algorithms (i.e., estimators) that, given samples generated from an unknown distribution f, output an approximation of f. While distribution learning has a long history [54], studying it under privacy constraints is relatively new and unexplored.

In this paper, we work with a rigorous and practical notion of data privacy known as differential privacy [36]. Roughly speaking, differential privacy guarantees that no single data point can inﬂuence the output of an algorithm too much. Intuitively, this provides privacy by hiding the contribution of each individual. Differential privacy is the de facto standard for modern private analysis and has seen widespread impact in both industry and government [12, 22, 28, 29, 39].

In recent years, there has been a ﬂurry of activity in differentially private distribution learning and a number of techniques have been developed in the literature. In the pure differentially private setting, Bun et al. [17] introduced a method to learn classes of distributions that admit a ﬁnite cover, i.e. when the class of distributions is well-approximated by a ﬁnite number of distributions. They show that this is an exact characterization of distributions which can be learned under pure differential privacy in the sense that a class of distributions is learnable under pure differential privacy if and only if the class

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

admits a ﬁnite cover [17, 41]. As a consequence of this result, they obtained pure differentially private algorithms for learning Gaussian distributions provided that the mean of the Gaussians are bounded and the covariance matrix of the Gaussians are spectrally bounded.1 Moreover, such restrictions on the Gaussians are necessary under the constraint of pure differential privacy.

One way to remove the requirement of having a ﬁnite cover is to relax to a weaker notion of privacy known as approximate differential privacy. With this notion, Bun et al. [17] introduced a method to learn a class of distributions that, instead of requiring a ﬁnite cover, requires a locally small cover, i.e. a cover where each distribution in the class is well-approximated by only a small number of elements within the cover. They prove that the class of Gaussians with arbitrary mean and a ﬁxed, known covariance matrix has a locally small cover which implies an approximate differentially private algorithm to learn this class of distributions. Later, Aden-Ali, Ashtiani, and Kamath [4] proved that the class of mean-zero Gaussians (with no assumptions on the covariance matrix) admits a locally small cover leading to an approximate differentially private method to learn the class of all Gaussians.

It is a straightforward observation that if a class of distributions admits a ﬁnite cover then the class of its mixtures also admits a ﬁnite cover. Combined with the aforementioned work of Bun et al., this implies a pure differentially private algorithm for learning mixtures of Gaussians with bounded mean and spectrally bounded covariance matrices. It is natural to wonder whether an analogous statement holds for locally small covers. In other words, if a class of distributions admits a locally small cover then does the class of mixtures also admit a locally small cover? If so, this would provide a fruitful direction to design differentially private algorithms for learning mixtures of arbitrary Gaussians. Unfortunately, there are simple examples of classes of distributions that admit a locally small cover yet their mixture do not. This leaves open the question of designing private algorithms for many classes of distributions that are learnable in the non-private setting. One concrete open problem is for the class of mixtures of two arbitrary univariate Gaussian distributions. A more general problem is private learning of mixtures of k axis-aligned (or general) Gaussian distributions.

1.1 Main Results

We demonstrate that it is indeed possible to privately learn mixtures of unbounded univariate Gaussians. More generally, we give sample complexity upper bounds for learning mixtures of unbounded d-dimensional axis-aligned Gaussians. In the following theorem and the rest of the paper we use e O to hide polylogarithmic factors, i.e. e O(f(x)) means O(f(x) logc f(x)) for some c > 0.

Theorem 1.1 (Informal). The sample complexity of learning a mixture of k d-dimensional axisaligned Gaussians to α-accuracy in total variation distance under (ε, δ)-differential privacy is e O k2d log3/2(1/δ)

The formal statement of this theorem can be found in Theorem 5.1. For technical reasons, we do require that δ (0, 1/n) for the above theorem to hold. This condition is quite standard in the differential privacy literature. Indeed, for useful privacy, δ should be cryptographically small , i.e., δ 1/n.

Even for the univariate case, our result is the ﬁrst sample complexity upper bound for learning mixture of Gaussians under differential privacy for which the variances are unknown and the parameters of the Gaussians may be unbounded. In the non-private setting, it is known that eΘ(kd/α2) samples are necessary and sufﬁcient to learn a mixture of k axis-aligned Gaussian in Rd [6, 61]. In the private setting, the best known sample complexity lower bound is Ω(d/αε log(d)) under (ε, δ)-DP when δ e O(

d/n) [46]. Obtaining improved upper or lower bounds in this setting remains an open question.

If the covariance matrix of each component of the mixture is the same and known or, without loss of generality, equal to the identity matrix, then we can improve the dependence on the parameters and obtain a result that is in line with the non-private setting.

1When we say that a matrix Σ is spectrally bounded, we mean that there are 0 < a1 a2 such that a1 I Σ a2 I.

Theorem 1.2 (Informal). The sample complexity of learning a mixture of k d-dimensional Gaussians with identity covariance matrix to α-accuracy in total variation distance under (ε, δ)-differential

privacy is e O kd α2 + kd log(1/δ)

We relegate the formal statement and the proof of this theorem to the supplementary materials (see Appendix F). Note that the work of [53] implies an upper bound of O(k2d3 log2(1/δ)/α2ε2) for private learning of the same class albeit in the incomparable setting of parameter estimation.

Comparison with locally small covers. While the results in [4, 17] for learning Gaussian distributions under approximate differential privacy do not yield ﬁnite-time algorithms, they do give strong information-theoretic upper bounds. This is achieved by showing that certain classes of Gaussians admit locally small covers. It is thus natural to ask if we can obtain sharper results by showing that mixtures of Gaussians also admit locally small covers. Unfortunately, the following simple example shows that not even mixtures of two univariate Gaussians admit locally small covers.

Proposition 1.3 (Informal version of Proposition B.6). Every cover for the class of mixtures of two univariate Gaussians is not locally small.

1.2 Techniques

To prove our result, we devise a novel technique which reduces the problem of privately learning mixture distributions to the problem of private list-decodable learning of distributions. The framework of list-decodable learning was introduced by Balcan, Blum, and Vempala [8] and Balcan, Röglin, and Teng [9] in the context of clustering but has since been studied extensively in the literature in a number of different contexts [7, 20, 21, 26, 27, 49, 55, 56]. The problem of list-decodable learning of distributions is as follows. There is a distribution f of interest that we are aiming to learn. However, we do not receive samples from f; rather we receive samples from a corrupted distribution g = (1 γ)f + γh where γ (0, 1) and h is some arbitrary distribution. In our application, γ is close to 1, i.e. most of the samples are corrupted. The goal in list-decodable learning is to output a short list of distributions f1, . . . , fm with the requirement that f is close to at least one of the fi s. The formal deﬁnition of list-decodable learning can be found in Deﬁnition 2.6. Informally, the reduction can be summarized by the following theorem which is formalized in Section 3.

Theorem 1.4 (Informal). If a class of distributions F is privately list-decodable then mixtures of distributions from F are privately learnable.

Roughly speaking, the reduction from learning mixtures of distribution to list-decodable learning works as follows. Suppose that there is an unknown distribution f which is a mixture of k distributions f1, . . . , fk. A list-decodable learner would then receive samples from f as input and output a short list of distributions b F so that for every fi there is some element in b F that is close to fi. In particular, some mixture of distributions from b F must be close to the true distribution f. Since b F is a small ﬁnite set, the set of possible mixtures must also be relatively small. This last observation allows us to make use of private hypothesis selection which selects a good hypothesis from a small set of candidate hypotheses [4, 17]. In Section 3, we describe the aforementioned reduction in more detail. We note that a similar connection between list-decodable learning and learning mixture distributions was also used by Diakonikolas et al. [26]. However, our reduction is focused on the private setting.

The reduction shows that to privately learn mixtures, it is sufﬁcient to design differentially private list-decodable learning algorithms that work for (corrupted versions of) the individual mixture components. To devise list-decodable learners for (corrupted) univariate Gaussian, we utilize stability-based histograms [15, 51] that satisfy approximate differential privacy.

To design a list-decodable learner for corrupted univariate Gaussians, we follow a three-step approach that is inspired by the seminal work of Karwa and Vadhan [50]. First, we use a histogram to output a list of variances one of which approximates the true variance of the Gaussian. As a second step, we would like to output a list of means which approximate the true mean of the Gaussian. This can be done using histograms provided that we roughly know the variance of the Gaussian. Since we have candidate variances from the ﬁrst step, we can use a sequence of histograms where the width of the bins of each of the histograms is determined by the candidate variances from the ﬁrst step. As a last step, using the candidate variances and means from the ﬁrst two steps, we are able to construct a small set of distributions one of which approximates the true Gaussian to within accuracy α. In the

axis-aligned Gaussians setting, we use our solution for the univariate case as a subroutine on each dimension separately. Now that we have a list-decodable learner for axis-aligned Gaussians, we use our reduction to obtain a private learning algorithm for learning mixtures of axis-aligned Gaussians.

Approaches based on constructing lists of candidate variances and means in order to learn an accurate mixture have been previously considered in the non-private setting [5, 6, 23, 61]. However, it does not seem possible to directly privatize the algorithms in this line of work since they construct these candidates directly from the samples, which is a clear violation of privacy.

1.3 Open Problems

The most basic open problem is to understand the exact sample complexity (up to constants) for learning mixtures of univariate Gaussians under approximate differential privacy. Conjecture 1.5 (Informal). The sample complexity of learning a mixture of k univariate Gaussians to within total variation distance α with high probability under (ε, δ)-DP is Θ k α2 + k αε + log(1/δ)

Another open question is whether it is possible to privately learn mixtures of (arbitrary) highdimensional Gaussians. We conjecture that it is possible and with the following sample complexity.

Conjecture 1.6 (Informal). The sample complexity of learning a mixture of k d-dimensional Gaussians to within total variation distance α with high probability under (ε, δ)-DP is

αε + log(1/δ)

1.4 Additional Related Work

Recently, [17] showed how to learn spherical Gaussian mixtures where each Gaussian component has bounded mean under pure differential privacy. Acharya, Sun and Zhang [3] were able to obtain lower bounds in the same setting that nearly match the upper bounds of Bun et al. [17]. Both [47, 53] consider differentially private learning of Gaussian mixtures, however their focus is on parameter estimation and therefore require additional assumptions such as separation or boundedness of the components.

There has been a ﬂurry of activity on differentially private distribution learning and parameter estimation in recent years for many problem settings [3, 11, 14, 16 18, 25, 30, 38, 46, 48, 50, 52, 53, 59, 60]. There has also been a lot of work in the locally private setting [2, 31 33, 40, 43, 44, 63, 64]. Other work on differentially private estimation include [1, 10, 13, 19, 34, 58, 65]. For a more comprehensive review of differentially private statistics, see [45].

2 Preliminaries

For any m N, [m] denotes the set {1, 2, . . . , m}. Let X f denote a random variable X sampled from the distribution f. Let (Xi)m i=1 f m denote an i.i.d. random sample of size m from distribution f. For a vector x Rd, we refer to the ith element of vector x as xi. For any k N, we deﬁne the k-dimensional probability simplex to be k := {(w1, . . . , wk) Rk 0 : Pk i=1 wi = 1}. For a vector µ Rd and a positive semideﬁnite matrix Σ, we use N(µ, Σ) to denote the multivariate normal distribution with mean µ and covariance matrix Σ. We deﬁne G to be the class of univariate Gaussians and Gd = {N (µ, Σ) : Σij = 0 i = j and Σii > 0 i} to be the class of axis-aligned Gaussians. Deﬁnition 2.1 (k-mix(F)). Let F be a class of probability distributions. Then the class of k-mixtures of F, written k-mix(F), is deﬁned as

k-mix(F) := { Pk i=1 wifi : (w1, . . . , wk) k, f1, . . . , fk F }.

2.1 Distribution Learning

A distribution learning method is a (potentially randomized) algorithm that, given a sequence of i.i.d. samples from a distribution f, outputs a distribution bf as an estimate of f. The focus of this paper is on absolutely continuous probability distributions (distributions that have a density with respect to

the Lebesgue measure), so we refer to a probability distribution and its probability density function interchangeably. The speciﬁc measure of closeness between distributions that we use is the total variation (TV) distance. Deﬁnition 2.2. Let g and f be two probability distributions deﬁned over X and let Ωbe the Borel σ-algebra on X. The total variation distance between g and f is deﬁned as

d TV(g, f) = sup S Ω |Pg(S) Pf(S)| = 1

x X |g(x) f(x)|dx = 1

2 g f 1 [0, 1].

where Pg(S) denotes the probability measure that g assigns to S. Moreover, if F is a set of distributions over a common domain, we deﬁne d TV(g, F) = inff F d TV(g, f). Deﬁnition 2.3 (PAC learner). We say algorithm A is a PAC-learner for a class of distributions F which uses m(α, β) samples, if for every α, β (0, 1), every f F, and every n m(α, β) the following holds: if the algorithm is given parameters α, β and a sequence of n i.i.d. samples from f as inputs, then it outputs an approximation bf such that d TV(f, bf) α with probability at least 1 β.2

We work with an additive corruption model often studied in the list-decodable setting that is inspired by the seminal work of Huber [42]. In this model, a sample is drawn from a distribution of interest with some probability, and with the remaining probability is drawn from an arbitrary distribution. Our list-decodable learners take samples from these corrupted distributions as input. Deﬁnition 2.4 (γ-corrupted distributions). Fix some distribution f and let γ (0, 1). We deﬁne a γ-corrupted distribution of f as any distribution g such that g = (1 γ)f + γh for an arbitrary distribution h. We deﬁne Hγ(f) to be the set of all γ-corrupted distributions of f . Remark 2.5. Observe that Hγ(f) is monotone increasing in γ, i.e. Hγ(f) Hγ (f) for all γ (γ, 1). To see this, note that if g = (1 γ)f +γh then we can also rewrite g = (1 γ )f +γ h , where h = γ γ

γ h. Hence, g Cγ (f).

In this work, γ is usually quite close to 1, i.e. the vast majority of the samples are corrupted. Next, we deﬁne list-decodable learning. In this setting, the goal is to learn a distribution f given samples from a γ-corrupted distribution g of f. As γ 1, the goal is to output a list of distributions, one of which approximates f. We use this primitive to design algorithms for learning mixture distributions. Deﬁnition 2.6 (list-decodable learner). We say algorithm ALIST is an L-list-decodable learner for a class of distributions F using m LIST(α, β, γ) samples if for every α, β, γ (0, 1), n m List(α, β, γ), f F, and g Hγ(f), the following holds: given parameters α, β, γ and a sequence of n i.i.d. samples from g as inputs, ALIST outputs a set of distributions e F with | e F| L such that with probability no less than 1 β we have d TV(f, e F) α.

2.2 Differential Privacy

Let X = i=1Xi be the set of all datasets of arbitrary size over a domain set X. Two datasets D, D X are neighbours if D and D differ in at most one data point. Informally, an algorithm is differentially private if its output on neighbouring databases are similar. Formally, differential privacy (DP)3 has the following deﬁnition. Deﬁnition 2.7 ([35, 36]). A randomized algorithm T : X Y is (ε, δ)-differentially private if for all n 1, for all neighbouring datasets D, D Xn, and for all measurable subsets S Y,

Pr [T(D) S] eε Pr[T(D ) S] + δ .

If δ = 0, we say that T is ε-differentially private.

We refer to ε-DP as pure DP, and (ε, δ)-DP for δ > 0 as approximate DP. We make use of the following property of differentially private algorithms which asserts that adaptively composing differentially private algorithms remains differentially private. By adaptive composition, we mean that we run a sequence of algorithms M1(D), . . . , MT (D) where the choice of algorithm Mt may depend on the outputs of M1(D), . . . , Mt 1(D).

2The probability is over m(α, β) samples drawn from f and the randomness of the algorithm. 3We will use the acronym DP to refer to both the terms differential privacy and differentially private . Which term we are using will be clear from the speciﬁc sentence.

Lemma 2.8 (Composition of DP [36, 37]). If M is an adaptive composition of differentially private algorithms M1, . . . , MT then the following two statements hold:

1. If M1, . . . , MT are (ε1, δ1), . . . , (εT , δT )-DP, then M is (ε, δ)-DP for

ε = PT t=1 εt and δ = PT t=1 δt. 2. If M1, . . . , MT are (ε0, δ1), . . . , (ε0, δT )-DP for some ε0 1, then for any δ0 > 0, M is (ε, δ)- DP for ε = ε0 p

6T log(1/δ0) and δ = δ0 + PT t=1 δt.

The ﬁrst statement in Lemma 2.8 is often referred to as basic composition and the second statement is often referred to as advanced composition. We also make use of the fact that post-processing the output of a differentially private algorithm does not impact privacy. Lemma 2.9 (Post Processing). If M : X n Y is (ε, δ)-differentially private, and P : Y Z is any randomized function, then the algorithm P M is (ε, δ)-differentially private.

We deﬁne (ε, δ)-DP PAC learners and (ε, δ)-DP L-list-decodable learners as PAC learners and L-list-decodable learners that satisfy (ε, δ)-DP.

3 List-decodability and Learning Mixtures

In this section, we describe our general technique which reduces the problem of private learning of mixture distributions to private list-decodable learning of distributions. We show that if we have a differentially private list-decodable learner for a class of distributions then this can be transformed, in a black-box way, to a differentially private PAC learner for the class of mixtures of such distributions. In the next section, we describe private list-decodable learners for the class of Gaussians and thereby obtain private algorithms for learning mixtures of Gaussians.

First, let us begin with some intuition in the non-private setting. Suppose that we have a distribution g which can be written as g = Pk i=1 1 kfi. Then we can view g as a k 1

k -corrupted distribution of fi for each i [k]. Any list-decodable algorithm that receives samples from g as input is very likely to output a candidate set b F which contains distributions that are close to fi for each i [k]. Hence, if we let K = {P i [k] 1 k bfi : bfi b F}, then g must be close to some distribution in K. The only remaining task is to ﬁnd a distribution in K that is close to g; this ﬁnal task is known as hypothesis selection and has a known solution [24]. We note that the above argument can be easily generalized to the setting where g is a non-uniform mixture, i.e. g = Pk i=1 wifi where (w1, . . . , wk) k.

The above establishes a blueprint that we can follow in order to obtain a private learner for mixture distributions. In particular, we aim to come up with a private list-decoding algorithm which receives samples from f to produce a set b F. Thereafter, we can construct a candidate set K as mixtures of distributions from b F. Note that this step does not access the samples and therefore maintains privacy. In order to choose a good candidate from K, we make use of private hypothesis selection [4, 17].

Formally, the following theorem establishes the reduction from private list-decodable learning to learning of mixtures. The proof can be found in Appendix C of the supplementary materials. Theorem 3.1. Let k N and ε, δ (0, 1). If F is (ε/2, δ)-DP L-list-decodable with m LIST samples then there is an (ε, δ)-DP PAC learner for k-mix(F) where the number of samples used is m(α, β, ε, δ) =

2, δ + O k log(Lk/α) + log(1/β)

α2 + k log(Lk/α) + log(1/β)

This reduction is quite useful because it is conceptually much simpler to devise list-decodable learners for a given class F. In what follows, we will devise such list-decodable learners for certain classes and use Theorem 3.1 to obtain private PAC learners for mixtures of these classes.

4 Learning Univariate Gaussian Mixtures

Let G be the class of all univariate Gaussians. In this section we consider the problem of privately learning univariate Guassian Mixtures, k-mix(G). In the previous section, we showed that it is

sufﬁcient to design private list-decodable learners for univariate Gaussians. As a warm-up and to build intuition about our techniques, we begin with the simpler problem of constructing private list-decodable learners for Gaussians with a single known variance σ2. In what follows, we often use tilde (e.g. f M, e V ) to denote sets that are meant to be coarse, or constant, approximations and hat (e.g. b F, c M, b V ) to denote sets that are meant to be ﬁne, say O(α), approximations.

4.1 Warm-up: Learning Gaussian Mixtures with a Known, Shared Variance

In this sub-section we construct a private list-decodable learner for univariate Gaussians with a known variance σ2. A useful algorithmic primitive that we will use throughout this section and the next is the stable histogram algorithm. Lemma 4.1 (Histogram learner [15, 51]). Let n N, η, β, ε (0, 1) and δ (0, 1/n). Let D be a dataset of n points over a domain X. Let K be a countable index set and B = {Bi}i K be a collection of disjoint bins deﬁned on X, i.e. Bi X and Bi Bj = for i = j. Finally, let pi = 1

n |D Bi|. There is an (ε, δ)-DP algorithm Stable-Histogram(ε, δ, η, β, D, B) that takes as input parameters ε, δ, η, β, dataset D and bins B, and outputs estimates {epi}i K such that

|pi epi| η for all i K with probability no less than 1 β so long as n = Ω log(1/βδ)

For any ﬁxed σ2 > 0 we deﬁne Gσ to be the set of all univariate Gaussians with variance σ2. For the remainder of this section, we let g = N(µ, σ2) Gσ and g Hγ(g). (Recall that g Hγ(g) means that g = (1 γ)g + γh for some distribution h.) Algorithm 1 shows how we privately output a list of real numbers, one of which is close to the mean of g given samples from g . The following lemma shows that the output of Algorithm 1 is a list of real numbers with the guarantee that at least one element in the list is close to the true mean of a Gaussian which has been corrupted. Note that the lemma assumes the slightly weaker condition where the algorithm receives an approximation to the standard deviation instead of the true standard deviation. This additional generality is used in the next section. Lemma 4.2. Algorithm 1 is an (ε, δ)-DP algorithm such that for any g = N(µ, σ2) and g Hγ(g), when it is given parameters ε, β, γ (0, 1), δ (0, 1/n), eσ [σ, 2σ) and dataset D of n i.i.d. samples from g as input, it outputs a set f M of real numbers of size |f M| 12 1 γ . Furthermore, with

probability 1 β there is an element eµ f M such that |eµ µ| σ, so long as n = Ω log(1/βδ)

Algorithm 1: Univariate-Mean-Decoder(β, γ, ε, δ, eσ, D). Input :Parameters ε, β, γ (0, 1), δ (0, 1/n), eσ and dataset D Output :Set of approximate means f M.

1 Partition R into bins B = {Bi}i N where Bi = ((i 0.5)eσ, (i + 0.5)eσ].

2 {epi}i N Stable-Histogram(ε, δ, (1 γ)/24, β/2, D, B).

3 H {i : epi > (1 γ)/8}

4 If |H| > 12/(1 γ) fail and return f M =

5 f M {ieσ : i H}

6 Return f M.

We begin by gathering several simple claims whose proofs can be found in Appendix D.1. Let pi = PX g [X Bi] be the probability that a sample drawn from g lands in bin Bi. Let pi = 1

n|D Bi| be the actual number of samples drawn from g that have landed in Bi. Let j = µ/eσ . It is a simple calculation to check that |jeσ µ| σ. Thus, we would like to show that jeσ f M or, equivalently, that j H. A straightforward calculation shows that pj (1 γ)/3. Then a standard application of a Chernoff bound shows that many samples actually land in bin Bj, as asserted by the following claim. Claim 4.3. If n = Ω(log(1/β)/(1 γ)) then pj > (1 γ)/6 with probability at least 1 β/2.

Next, we claim that the output of the stable histogram approximately preserves the weight of all the bins and, moreover, that the output does not have too many heavy bins. The ﬁrst assertion implies

that since bin Bj is heavy, the stable histogram also determines that bin Bj is heavy. The second assertion implies that the algorithm does not fail. Let {epi}i N be the output of the stable histogram, as deﬁned in Algorithm 1. Claim 4.4. If n = Ω(log(1/βδ)/(1 γ)ε) then with probability 1 β/2, we have (i) |pi epi| (1 γ)/24 for all i N and (ii) |H| = |{i N : epi > (1 γ)/8}| 12/(1 γ).

With Claim 4.3 and Claim 4.4 in hand, we are now ready to prove Lemma 4.2.

Proof of Lemma 4.2. We brieﬂy prove that the algorithm is private before proceeding to the other assertions of the lemma.

Privacy. Line 2 is the only part of the algorithm that looks at the data and it is (ε, δ)-DP by Lemma 4.1. The remainder of the algorithm can be viewed as post-processing (Lemma 2.9) so it does not affect the privacy.

Bound on |f M|. For the bound on |f M|, observe that if |H| > 12/(1 γ) then the algorithm fails so |f M| 12/(1 γ) deterministically.

Utility. Let g, g , µ be as deﬁned in the statement of the lemma. We now show that there exists eµ f M such that |eµ µ| σ. Let j = µ/eσ . For the remainder of the proof, we assume that n = Ω(log(1/βδ)/(1 γ)ε).

Claim 4.3 asserts that, with probability 1 β/2, we have pj > (1 γ)/6. Claim 4.4 asserts that, with probability 1 β/2, epj pj (1 γ)/24 and that |H| 12/(1 γ). By a union bound, with probability 1 β, we have that pj > (1 γ)/8 and the algorithm does not fail. This implies that j H so jeσ f M. Finally, note that |jeσ µ| eσ/2 σ where the last inequality uses the assumption that eσ 2σ.

We can now use Lemma 4.2 to get a private list-decodable learner (Corollary 4.5) and then use this private list decodable learner together with our reduction (Theorem 3.1) to get an (ε, δ)-PAC learner for k-mix(G) (Theorem 4.6). The proof of Corollary 4.5 can be found in Appendix D. Corollary 4.5. For any ε (0, 1) and δ (0, 1/n), there is an (ε, δ)-DP L-list-decodable learner for Gσ with known σ > 0 where L = O 1 (1 γ)α , and the number of samples used is

m LIST(α, β, γ, ε, δ) = O log(1/βδ)

Theorem 4.6. For any ε (0, 1) and δ (0, 1/n), there is an (ε, δ)-DP PAC learner for k-mix(Gσ)

with known σ > 0 that uses m(α, β, ε, δ) = e O k+log(1/β)

α2 + k log(1/βδ)

αε samples.

4.2 Learning Arbitrary Univariate Gaussian Mixtures

In this subsection, we construct a list-decodable learner for G, the class of all univariate Gaussians. First, in Lemma 4.7, we design an (ε, δ)-DP algorithm that receives samples from g Hγ(g) where g G and outputs a list of candidate values for the standard deviation, one of which approximates the standard deviation of g with high probability. Then, in Lemma 4.8, we use Lemma 4.2 and Lemma 4.7 to design an (ε, δ)-DP list-decoder for G.

4.2.1 Estimating the variance

We begin with a method to estimate the variance. Here, we provide a high-level overview and relegate the details to the appendix. Suppose that g = N(µ, σ2) and g Hγ(g). Our goal is to obtain a multiplicative estimate of σ2. The key observation is as follows. If X1, X2 g then Y = (X1 X2)/

2 is distributed as (1 γ)2N(0, σ2) + (1 (1 γ)2)h for some distribution h. In particular, with probability roughly (1 γ)2, |Y | is itself a good estimate of σ. This observation allows us to proceed similarly to the proof of Lemma 4.2 with two changes. First, since our goal is a multiplicative approximation to σ2, we use bins of the form (2i, 2i+1] for i N. Second, given a dataset D = {X1, . . . , X2m}, we transform it to the dataset D = {|X1 X2|/

2, . . . , |X2m 1 X2m|/

2} and use the previously mentioned observation. The following lemma formalizes the guarantee that we can achieve. The proof can be found in Appendix D.2.

Lemma 4.7. There is an (ε, δ)-DP algorithm such that for any g = N(µ, σ2) and g Hγ(g), when it is given parameters ε, β, γ (0, 1), δ (0, 1/n) and dataset D of 2n i.i.d. samples from g as input, it outputs a set e V of positive real numbers of size |e V | 12 (1 γ)2 . Furthermore, with probability

no less than 1 β there is an element eσ e V such that σ eσ < 2σ, so long as n = Ω log(1/βδ)

4.2.2 A list-decodable learner for univariate Gaussians

We can now use Lemma 4.2 and Lemma 4.7 to design a list-decodable learner for G. This is done in a few steps. First, we obtain a list of candidate variances using Lemma 4.7. We know that one of these candidates is a good approximation to the true variance although we may not know which one. As a second step, we use the algorithm implied by Lemma 4.2 for all the candidate variances to get a list of candidate means. Since one of the candidate variances is a good estimate of the true variance, Lemma 4.2 promises that one of the candidate means is a good estimate of the true mean. Given a list of candidate variances and candidate means, we can create a list of all pairs of variances and means to obtain a list of distributions such that one is close to the true Gaussian. The guarantee is formalized in the following lemma and the proof appears in Appendix D.3. Lemma 4.8. There is an (ε, δ)-DP algorithm such for any g = N(µ, σ2) and g Hγ(g), when it is given parameters ε, α, β, γ (0, 1), δ (0, 1/n) and dataset D of n i.i.d. samples from g as inputs, it outputs a set c M of real numbers and a set b V of positive real numbers such that

|c M| 144 (2 1/α + 1)

(1 γ)3 and |b V | 12 log1+α(2)

Furthermore, with probability no less than 1 β, we have the following:

1. bµ c M such that |bµ µ| ασ 2. bσ b V such that |bσ σ| ασ

so long as n = eΩ log3/2(1/βδ)

We can now use Lemma 4.8 to get the following result. We defer the proof to Appendix D.4. Corollary 4.9. For any ε (0, 1) and δ (0, 1/n), there is an (ε, δ)-DP L-list-decodable learner for G where L = O 1 (1 γ)5α2 and the number of samples used by the algorithm is

m LIST(α, β, γ, ε, δ) = e O log3/2(1/βδ)

Finally, Corollary 4.9 and Theorem 3.1 immediately imply the following theorem. Theorem 4.10. For any ε (0, 1) and δ (0, 1/n), there is an (ε, δ)-DP PAC learner for k-mix(G)

that uses m(α, β, ε, δ) = e O k2 log3/2(1/βδ)

α2ε samples.

5 Learning Mixtures of Axis-Aligned Gaussians

In this section, we prove the following result, which is a formal version of Theorem 1.1. Theorem 5.1. For any ε (0, 1) and δ (0, 1/n), there is an (ε, δ)-DP PAC learner for k-mix Gd

and the number of samples used by the algorithm is m(α, β, ε, δ) = e O k2d log3/2(1/βδ)

The following lemma shows that we can construct an (ε, δ)-DP list-decodable learner for the class of d-dimensional axis-aligned Gaussians, Gd. We explain the high-level details of our approach here. Let g = Qd i=1 gi Gd and g Hγ(g). For a sample X = (X1, . . . , Xd) g , it follows that Xi g i where g i Hγ(gi). We can thus split our dataset by dimension and run our univariate private list-decodable learner (Corollary 4.9) on each dimension separately to get a total of d lists of univariate Gaussians. From the guarantee of Corollary 4.9, in the ith list there will be at least one Gaussian that is a good approximation to gi, and this holds for all i [d]. Finally, since g is an axis-aligned Gaussian (product distribution), we can take all possible combinations of the univariate distributions in the d lists to obtain a new list of axis-aligned Gaussians, one of which accurately approximates g. The details of the proof can be found in Appendix E.

Lemma 5.2. For any ε (0, 1) and δ (0, 1/n), there is an (ε, δ)-DP L-list-decodable

learner for Gd where L = O d2 (1 γ)5α2 d and the number of samples used by the algorithm is

m LIST(α, β, γ, ε, δ) = e O d log3/2(1/βδ)

We can now put together Lemma 5.2 and Theorem 3.1 to immediately get Theorem 5.1.

Acknowledgments and Disclosure of Funding

CL was supported by an NSERC postdoctoral fellowship. HA was supported by an NSERC Discovery Grant.

[1] Jayadev Acharya, Ziteng Sun, and Huanyu Zhang. Differentially private testing of identity and closeness of discrete distributions. In Advances in Neural Information Processing Systems 31, Neur IPS 18, pages 6878 6891. Curran Associates, Inc., 2018.

[2] Jayadev Acharya, Ziteng Sun, and Huanyu Zhang. Hadamard response: Estimating distributions privately, efﬁciently, and with little communication. In Proceedings of the 22nd International Conference on Artiﬁcial Intelligence and Statistics, AISTATS 19, pages 1120 1129. JMLR, Inc., 2019.

[3] Jayadev Acharya, Ziteng Sun, and Huanyu Zhang. Differentially private assouad, fano, and le cam. ar Xiv preprint ar Xiv:2004.06830, 2020.

[4] Ishaq Aden-Ali, Hassan Ashtiani, and Gautam Kamath. On the sample complexity of privately learning unbounded high-dimensional gaussians. In Vitaly Feldman, Katrina Ligett, and Sivan Sabato, editors, Proceedings of the 32nd International Conference on Algorithmic Learning Theory, volume 132 of Proceedings of Machine Learning Research, pages 185 216. PMLR, 16 19 Mar 2021. URL http://proceedings.mlr.press/v132/aden-ali21a.html.

[5] Hassan Ashtiani, Shai Ben-David, and Abbas Mehrabian. Sample-efﬁcient learning of mixtures. In Proceedings of the Thirty-Second AAAI Conference on Artiﬁcial Intelligence, AAAI 18, pages 2679 2686. AAAI Publications, 2018. URL https://arxiv.org/abs/1706.01596.

[6] Hassan Ashtiani, Shai Ben-David, Nicholas J. A. Harvey, Christopher Liaw, Abbas Mehrabian, and Yaniv Plan. Near-optimal sample complexity bounds for robust learning of gaussian mixtures via compression schemes. J. ACM, 67(6):32:1 32:42, 2020.

[7] Ainesh Bakshi and Pravesh K. Kothari. List-decodable subspace recovery: Dimension independent error in polynomial time. In Proceedings of the Thirty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, page 1279 1297, 2021.

[8] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. A discriminative framework for clustering via similarity functions. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 671 680, 2008.

[9] Maria Florina Balcan, Heiko Röglin, and Shang-Hua Teng. Agnostic clustering. In International Conference on Algorithmic Learning Theory, pages 384 398. Springer, 2009.

[10] Rina Foygel Barber and John C. Duchi. Privacy and statistical risk: Formalisms and minimax bounds. Co RR, abs/1412.4451, 2014. URL http://arxiv.org/abs/1412.4451.

[11] Sourav Biswas, Yihe Dong, Gautam Kamath, and Jonathan Ullman. Coinpress: Practical private mean and covariance estimation. ar Xiv preprint ar Xiv:2006.06618, 2020.

[12] Andrea Bittau, Úlfar Erlingsson, Petros Maniatis, Ilya Mironov, Ananth Raghunathan, David Lie, Mitch Rudominer, Ushasree Kode, Julien Tinnes, and Bernhard Seefeld. Prochlo: Strong privacy for analytics in the crowd. In Proceedings of the 26th ACM Symposium on Operating Systems Principles, SOSP 17, pages 441 459, New York, NY, USA, 2017. ACM.

[13] Mark Bun and Thomas Steinke. Average-case averages: Private algorithms for smooth sensitivity and mean estimation. In Advances in Neural Information Processing Systems 32, Neur IPS 19, pages 181 191. Curran Associates, Inc., 2019.

[14] Mark Bun, Jonathan Ullman, and Salil Vadhan. Fingerprinting codes and the price of approximate differential privacy. In Proceedings of the 46th Annual ACM Symposium on the Theory of Computing, STOC 14, pages 1 10, New York, NY, USA, 2014. ACM.

[15] Mark Bun, Kobbi Nissim, and Uri Stemmer. Simultaneous private learning of multiple concepts. In Proceedings of the 7th Conference on Innovations in Theoretical Computer Science, ITCS 16, pages 369 380, New York, NY, USA, 2016. ACM.

[16] Mark Bun, Thomas Steinke, and Jonathan Ullman. Make up your mind: The price of online queries in differential privacy. In Proceedings of the 28th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 17, pages 1306 1325, Philadelphia, PA, USA, 2017. SIAM.

[17] Mark Bun, Gautam Kamath, Thomas Steinke, and Zhiwei Steven Wu. Private hypothesis selection. In Advances in Neural Information Processing Systems 32, Neur IPS 19, pages 156 167. Curran Associates, Inc., 2019.

[18] T. Tony Cai, Yichen Wang, and Linjun Zhang. The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy. ar Xiv preprint ar Xiv:1902.04495, 2019.

[19] Clément L. Canonne, Gautam Kamath, Audra Mc Millan, Jonathan Ullman, and Lydia Zakynthinou. Private identity testing for high-dimensional distributions. ar Xiv preprint ar Xiv:1905.11947, 2019.

[20] Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from untrusted data. In Proceedings of the 49th Annual ACM Symposium on the Theory of Computing, STOC 17, pages 47 60, New York, NY, USA, 2017. ACM.

[21] Yeshwanth Cherapanamjeri, Sidhanth Mohanty, and Morris Yau. List decodable mean estimation in nearly linear time. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 141 148, 2020. doi: 10.1109/FOCS46700.2020.00022.

[22] Aref N. Dajani, Amy D. Lauger, Phyllis E. Singer, Daniel Kifer, Jerome P. Reiter, Ashwin Machanavajjhala, Simson L. Garﬁnkel, Scot A. Dahl, Matthew Graham, Vishesh Karwa, Hang Kim, Philip Lelerc, Ian M. Schmutte, William N. Sexton, Lars Vilhuber, and John M. Abowd. The modernization of statistical disclosure limitation at the U.S. census bureau, 2017. Presented at the September 2017 meeting of the Census Scientiﬁc Advisory Committee.

[23] Constantinos Daskalakis and Gautam Kamath. Faster and sample near-optimal algorithms for proper learning mixtures of Gaussians. In Proceedings of the 27th Annual Conference on Learning Theory, COLT 14, pages 1183 1213, 2014.

[24] Luc Devroye and Gábor Lugosi. Combinatorial methods in density estimation. Springer, 2001.

[25] Ilias Diakonikolas, Moritz Hardt, and Ludwig Schmidt. Differentially private learning of structured discrete distributions. In Advances in Neural Information Processing Systems 28, NIPS 15, pages 2566 2574. Curran Associates, Inc., 2015.

[26] Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. List-decodable robust mean estimation and learning mixtures of spherical Gaussians. In Proceedings of the 50th Annual ACM Symposium on the Theory of Computing, STOC 18, pages 1047 1060, New York, NY, USA, 2018. ACM.

[27] Ilias Diakonikolas, Daniel Kane, and Daniel Kongsgaard. List-decodable mean estimation via iterative multi-ﬁltering. Advances in Neural Information Processing Systems, 33, 2020.

[28] Differential Privacy Team, Apple. Learning with privacy at scale. https: //machinelearning.apple.com/docs/learning-with-privacy-at-scale/ appledifferentialprivacysystem.pdf, December 2017.

[29] Bolin Ding, Janardhan Kulkarni, and Sergey Yekhanin. Collecting telemetry data privately. In Advances in Neural Information Processing Systems 30, NIPS 17, pages 3571 3580. Curran Associates, Inc., 2017.

[30] Wenxin Du, Canyon Foot, Monica Moniot, Andrew Bray, and Adam Groce. Differentially private conﬁdence intervals. ar Xiv preprint ar Xiv:2001.02285, 2020.

[31] John Duchi and Ryan Rogers. Lower bounds for locally private estimation via communication complexity. In Proceedings of the 32nd Annual Conference on Learning Theory, COLT 19, pages 1161 1191, 2019.

[32] John C. Duchi and Feng Ruan. The right complexity measure in locally private estimation: It is not the ﬁsher information. ar Xiv preprint ar Xiv:1806.05756, 2018.

[33] John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. Minimax optimal procedures for locally private estimation. Journal of the American Statistical Association, 2017.

[34] Cynthia Dwork and Jing Lei. Differential privacy and robust statistics. In Proceedings of the 41st Annual ACM Symposium on the Theory of Computing, STOC 09, pages 371 380, New York, NY, USA, 2009. ACM.

[35] Cynthia Dwork, Krishnaram Kenthapadi, Frank Mc Sherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In Proceedings of the 24th Annual International Conference on the Theory and Applications of Cryptographic Techniques, EUROCRYPT 06, pages 486 503, Berlin, Heidelberg, 2006. Springer.

[36] Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Conference on Theory of Cryptography, TCC 06, pages 265 284, Berlin, Heidelberg, 2006. Springer.

[37] Cynthia Dwork, Guy N. Rothblum, and Salil Vadhan. Boosting and differential privacy. In Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science, FOCS 10, pages 51 60, Washington, DC, USA, 2010. IEEE Computer Society.

[38] Cynthia Dwork, Adam Smith, Thomas Steinke, Jonathan Ullman, and Salil Vadhan. Robust traceability from trace amounts. In Proceedings of the 56th Annual IEEE Symposium on Foundations of Computer Science, FOCS 15, pages 650 669, Washington, DC, USA, 2015. IEEE Computer Society.

[39] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. RAPPOR: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM Conference on Computer and Communications Security, CCS 14, pages 1054 1067, New York, NY, USA, 2014. ACM.

[40] Marco Gaboardi, Ryan Rogers, and Or Sheffet. Locally private conﬁdence intervals: Z-test and tight conﬁdence intervals. In Proceedings of the 22nd International Conference on Artiﬁcial Intelligence and Statistics, AISTATS 19, pages 2545 2554. JMLR, Inc., 2019.

[41] Moritz Hardt and Kunal Talwar. On the geometry of differential privacy. In Proceedings of the 42nd Annual ACM Symposium on the Theory of Computing, STOC 10, pages 705 714, New York, NY, USA, 2010. ACM.

[42] Peter J. Huber. Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics, 35(1):73 101, 1964. doi: 10.1214/aoms/1177703732. URL https://doi.org/ 10.1214/aoms/1177703732.

[43] Matthew Joseph, Janardhan Kulkarni, Jieming Mao, and Zhiwei Steven Wu. Locally private Gaussian estimation. In Advances in Neural Information Processing Systems 32, Neur IPS 19, pages 2980 2989. Curran Associates, Inc., 2019.

[44] Peter Kairouz, Keith Bonawitz, and Daniel Ramage. Discrete distribution estimation under local privacy. In Proceedings of the 33rd International Conference on Machine Learning, ICML 16, pages 2436 2444. JMLR, Inc., 2016.

[45] Gautam Kamath and Jonathan Ullman. A primer on private statistics. ar Xiv preprint ar Xiv:2005.00010, 2020.

[46] Gautam Kamath, Jerry Li, Vikrant Singhal, and Jonathan Ullman. Privately learning highdimensional distributions. In Proceedings of the 32nd Annual Conference on Learning Theory, COLT 19, pages 1853 1902, 2019.

[47] Gautam Kamath, Or Sheffet, Vikrant Singhal, and Jonathan Ullman. Differentially private algorithms for learning mixtures of separated Gaussians. In Advances in Neural Information Processing Systems 32, Neur IPS 19, pages 168 180. Curran Associates, Inc., 2019.

[48] Gautam Kamath, Vikrant Singhal, and Jonathan Ullman. Private mean estimation of heavytailed distributions. In Proceedings of the 33rd Annual Conference on Learning Theory, COLT 20, 2020.

[49] Sushrut Karmalkar, Adam Klivans, and Pravesh Kothari. List-decodable linear regression. Advances in neural information processing systems, 2019.

[50] Vishesh Karwa and Salil Vadhan. Finite sample differentially private conﬁdence intervals. In Proceedings of the 9th Conference on Innovations in Theoretical Computer Science, ITCS 18, pages 44:1 44:9, Dagstuhl, Germany, 2018. Schloss Dagstuhl Leibniz-Zentrum fuer Informatik.

[51] Aleksandra Korolova, Krishnaram Kenthapadi, Nina Mishra, and Alexandros Ntoulas. Releasing search queries and clicks privately. In Proceedings of the 18th International World Wide Web Conference, WWW 09, pages 171 180, New York, NY, USA, 2009. ACM.

[52] Xiyang Liu, Weihao Kong, Sham M. Kakade, and Sewoong Oh. Robust and differentially private mean estimation. Co RR, abs/2102.09159, 2021. URL https://arxiv.org/abs/ 2102.09159.

[53] Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Smooth sensitivity and sampling in private data analysis. In Proceedings of the 39th Annual ACM Symposium on the Theory of Computing, STOC 07, pages 75 84, New York, NY, USA, 2007. ACM.

[54] Karl Pearson. Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. A, pages 71 110, 1894.

[55] Prasad Raghavendra and Morris Yau. List decodable learning via sum of squares. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 161 180. SIAM, 2020.

[56] Prasad Raghavendra and Morris Yau. List decodable subspace recovery. In Conference on Learning Theory, pages 3206 3226. PMLR, 2020.

[57] Rolf-Dieter Reiss. Approximate distributions of order statistics with applications to nonparametric statistics. Springer Series in Statistics. Springer-Verlag, New York, 1989. ISBN 0-387-96851-2. URL https://doi.org/10.1007/978-1-4613-9620-8.

[58] Adam Smith. Privacy-preserving statistical estimation with optimal convergence rates. In Proceedings of the 43rd Annual ACM Symposium on the Theory of Computing, STOC 11, pages 813 822, New York, NY, USA, 2011. ACM.

[59] Thomas Steinke and Jonathan Ullman. Between pure and approximate differential privacy. The Journal of Privacy and Conﬁdentiality, 7(2):3 22, 2017.

[60] Thomas Steinke and Jonathan Ullman. Tight lower bounds for differentially private selection. In Proceedings of the 58th Annual IEEE Symposium on Foundations of Computer Science, FOCS 17, pages 552 563, Washington, DC, USA, 2017. IEEE Computer Society.

[61] Ananda Theertha Suresh, Alon Orlitsky, Jayadev Acharya, and Ashkan Jafarpour. Nearoptimal-sample estimators for spherical Gaussian mixtures. In Advances in Neural Information Processing Systems 27, NIPS 14, pages 1395 1403. Curran Associates, Inc., 2014.

[62] Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, 2018.

[63] Shaowei Wang, Liusheng Huang, Pengzhan Wang, Yiwen Nie, Hongli Xu, Wei Yang, Xiang Yang Li, and Chunming Qiao. Mutual information optimally local private discrete distribution estimation. ar Xiv preprint ar Xiv:1607.08025, 2016.

[64] Min Ye and Alexander Barg. Optimal schemes for discrete distribution estimation under locally differential privacy. IEEE Transactions on Information Theory, 64(8):5662 5676, 2018.

[65] Huanyu Zhang, Gautam Kamath, Janardhan Kulkarni, and Zhiwei Steven Wu. Privately learning Markov random ﬁelds. In Proceedings of the 37th International Conference on Machine Learning, ICML 20. JMLR, Inc., 2020.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] The main claims in the abstract and introduction are given by Theorem 1.1 and Theorem 1.2. They are proved in the paper and the appendix. (b) Did you describe the limitations of your work? [Yes] We have formal theorems for our work and raised some open questions which remain open after our work. (c) Did you discuss any potential negative societal impacts of your work? [N/A] This is a theoretical paper and is unlikely to have any direct negative societal impacts. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] Many proofs are in appendix. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]