# sampleefficient_learning_of_mixtures__155da846.pdf Sample-Efficient Learning of Mixtures Hassan Ashtiani, Shai Ben-David David R. Cheriton School of Computer Science University of Waterloo 200 University Avenue West Waterloo, Ontario, Canada, N2L 3G1 {mhzokaei, shai}@uwaterloo.ca Abbas Mehrabian Simons Institute for the Theory of Computing 121 Calvin Lab #2190 University of California, Berkeley Berkeley, CA, USA 94720-2190 abbasmehrabian@gmail.com We consider PAC learning of probability distributions (a.k.a. density estimation), where we are given an i.i.d. sample generated from an unknown target distribution, and want to output a distribution that is close to the target in total variation distance. Let F be an arbitrary class of probability distributions, and let Fk denote the class of k-mixtures of elements of F. Assuming the existence of a method for learning F with sample complexity m F(ε), we provide a method for learning F k with sample complexity O(k log k m F(ε)/ε2). Our mixture learning algorithm has the property that, if the Flearner is proper and agnostic, then the F k-learner would be proper and agnostic as well. This general result enables us to improve the best known sample complexity upper bounds for a variety of important mixture classes. First, we show that the class of mixtures of k axis-aligned Gaussians in Rd is PAC-learnable in the agnostic setting with O(kd/ε4) samples, which is tight in k and d up to logarithmic factors. Second, we show that the class of mixtures of k Gaussians in Rd is PAC-learnable in the agnostic setting with sample complexity O(kd2/ε4), which improves the previous known bounds of O(k3d2/ε4) and O(k4d4/ε2) in its dependence on k and d. Finally, we show that the class of mixtures of k log-concave distributions over Rd is PAClearnable using O(d(d+5)/2ε (d+9)/2k) samples. 1 Introduction Learning distributions is a fundamental problem in statistics and computer science, and has numerous applications in machine learning and signal processing. The problem can be stated as: Given an i.i.d. sample generated from an unknown probability distribution g, find a distribution ˆg that is close to g in total variation distance.1 This strong notion of learning is not possible in general using a finite number of samples. However, if we assume that the target distribution belongs to or can be approximated by a family of distributions, then there is hope to acquire Copyright c 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 1Total variation distance is a prominent distance measure between distributions. For a discussion on this and other choices see (Devroye and Lugosi 2001, Chapter 5). algorithms with finite-sample guarantees. In this paper, we study the important family of mixture models within this framework. Notice that we consider PAC learning of distributions (a.k.a. density estimation), which is different from parameter estimation. In the parameter estimation problem, it is assumed that the target distribution belongs to some parametric class, and the goal is to learn/identify the parameters (see, e.g., (Dasgupta 1999; Belkin and Sinha 2010; Moitra and Valiant 2010)). As an example of our setting, assume that the target distribution is a Gaussian mixture with k components in Rd. Then, how many examples do we need to find a distribution that is ε-close to the target? This sample complexity question, as well as the corresponding computational complexity question, has received a lot of attention recently (see, e.g. (Feldman, Servedio, and O Donnell 2006; Chan et al. 2014; Suresh et al. 2014; Diakonikolas et al. 2016; Diakonikolas, Kane, and Stewart 2017b; Acharya et al. 2017)). In this paper, we consider a scenario in which we are given a method for learning a class of distributions (e.g., Gaussians). Then, we ask whether we can use it, as a black box, to come up with an algorithm for learning a mixture of such distributions (e.g., mixture of Gaussians). We will show that the answer to this question is affirmative. We propose a generic method for learning mixture models. Roughly speaking, we show that by going from learning a single distribution from a class to learning a mixture of k distributions from the same class, the sample complexity is multiplied by a factor of at most (k log2 k)/ε2. This result is general, and yet it is surprisingly tight in many important cases. In this paper, we assume that the algorithm knows the number of components k. As a demonstration, we show that our method provides a better sample complexity for learning mixtures of Gaussians than the state of the art. In particular, for learning mixtures of k Gaussians in Rd, our method requires O(d2k/ε4) samples, improving by a factor of k2 over the O(d2k3/ε4) bound of (Diakonikolas, Kane, and Stewart 2017b). Furthermore, for the special case of mixtures of axis-aligned Gaussians, we provide an upper bound of O(dk/ε4), which is the first optimal bound with respect to k and d up to logarithmic fac- The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) tors, and improves upon the O(dk9/ε4) bound of (Suresh et al. 2014), which is only shown for the subclass of spherical Gaussians. One merit of our approach is that it can be applied in the agnostic (a.k.a. robust) setting, where the target distribution does not necessarily belong to the mixture model of choice. To guarantee such a result, we need the black box to work in the agnostic setting. For example, an agnostic learning method for learning Gaussians can be lifted to work for Gaussian mixtures in the agnostic setting. We would like to emphasize that our focus is on the information-theoretic aspects of learning rather than the computational ones; in particular, although our framework is algorithmic, its running time is exponential in the number of required samples. Proving sample complexity bounds is important in understanding the statistical nature of various classes of distributions (see, e.g., the recent work of (Diakonikolas, Kane, and Stewart 2017a)), and may pave the way for developing efficient algorithms. Our Results Let F be a class of probability distributions, and let Fk denote the class of k-mixtures of elements of F. In our main result, Theorem 5, assuming the existence of a method for learning F with sample complexity m F(ε), we provide a method for learning Fk with sample complexity O(k log2 k m F(ε)/ε2). Our mixture learning algorithm has the property that, if the F-learner is proper, then the Fklearner would be proper as well (i.e., the learner will always output a member of Fk). Furthermore, the algorithm works in the more general agnostic setting provided that the base learners are agnostic learners. We provide several applications of our main result. In Theorem 11, we show that the class of mixtures of k axisaligned Gaussians in Rd is PAC-learnable in the agnostic setting with sample complexity O(kd log2 k/ε4) (see Theorem 12). This bound is tight in terms of k and d up to logarithmic factors. In Theorem 14, we show that the class of mixtures of k Gaussians in Rd is PAC-learnable in the agnostic setting with sample complexity O(kd2 log2 k/ε4). Finally, in Theorem 16, we prove that the class of mixtures of k log-concave distributions over Rd is PAC-learnable using O(d(d+5)/2ε (d+9)/2k) samples. To the best of our knowledge, this is the first upper bound on the sample complexity of learning this class. Related Work PAC learning of distributions was introduced by (Kearns et al. 1994), we refer the reader to (Diakonikolas 2016) for a recent survey. A closely related line of research in statistics (in which more emphasis is on sample complexity) is density estimation, for which the book by (Devroye and Lugosi 2001) is an excellent resource. One approach for studying the sample complexity of learning a class of distributions is bounding the VCdimension of its associated Yatracos class (see Definition 20), and applying results such as Theorem 22. (These VC-dimensions have mainly been studied for the purpose of proving generalization bounds for neural networks with sigmoid activation functions.) In particular, the VC-dimension bound of (Anthony and Bartlett 1999, Theorem 8.14) which is based on the work of (Karpinski and Macintyre 1997) implies a sample complexity upper bound of O((k4d2 + k3d3)/ε2) for PAC learning mixtures of axisaligned Gaussians, and an upper bound of O(k4d4/ε2) for PAC learning mixtures of general Gaussians (both results hold in the more general agnostic setting). A sample complexity upper bound of O(d2k3 log2 k/ε4) for learning mixtures of Gaussians in the realizable setting was proved in (Diakonikolas, Kane, and Stewart 2017b, Theorem A.1) (the running time of their algorithm is not polynomial). Our algorithm is motivated by theirs, but we have introduced several new ideas in the algorithm and in the analysis, which has resulted in improving the sample complexity bound by a factor of k2 and an algorithm that works in the more general agnostic setting. For mixtures of spherical Gaussians, a polynomial time algorithm for the realizable setting with sample complexity O(dk9 log2(d)/ε4) was proposed in (Suresh et al. 2014, Theorem 11). We improve their sample complexity by a factor of O(k8), and moreover our algorithm works in the agnostic setting, too. In the special case of d = 1, a non-proper agnostic polynomial time algorithm with the optimal sample complexity of O(k/ε2) was given in (Chan et al. 2014), and a proper agnostic algorithm with the same sample complexity and better running time was given in (Li and Schmidt 2017). An important question, which we do not address in this paper, is finding polynomial time algorithms for learning distributions. To the best of our knowledge, no polynomial time algorithm for learning mixtures of general Gaussians is known. See (Diakonikolas, Kane, and Stewart 2017b) for the state-of-the-art results. Another important setting is computational complexity in the agnostic learning, see, e.g., (Diakonikolas et al. 2016) for some positive results. A related line of research is parameter estimation for mixtures of Gaussians, see, e.g., (Dasgupta 1999; Belkin and Sinha 2010; Moitra and Valiant 2010), who gave polynomial time algorithms for this problem assuming certain separability conditions (these algorithms are polynomial in the dimension and the error tolerance but exponential in the number of components). Recall that parameter estimation is a more difficult problem and any algorithm for parameter estimation requires some separability assumptions for the target Gaussians, whereas for density estimation no such assumption is needed. E.g., consider the case that k = 2 and the two components are identical; then there is no way to learn their mixing weights. We finally remark that characterizing the sample complexity of learning a class of distributions in general is an open problem, even for the realizable (i.e., non-agnostic) case (see (Diakonikolas 2016, Open Problem 15.1)). 2 The Formal Framework Generally speaking, a distribution learning method is an algorithm that takes a sample of i.i.d. points from distribution g as input, and outputs (a description) of a distribution ˆg as an estimation for g. Furthermore, we assume that g belongs to or can be approximated by class F of distributions, and we may require that ˆg also belongs to this class (i.e., proper learning). Let f1 and f2 be two probability distributions defined over the Borel σ-algebra B. The total variation distance between f1 and f2 is defined as f1 f2 T V = sup B B |f1(B) f2(B)| = 1 2 f1 f2 1 , is the L1 norm of f. In the following definitions, F is a class of probability distributions, and g is a distribution not necessarily in F. Denote the set {1, 2, ..., m} by [m]. All logarithms are in the natural base. For a function g and a class of distributions F, we define OPT(F, g) := inf f F f g 1 Definition 1 (ε-approximation, (ε, C)-approximation). A distribution ˆg is an ε-approximation for g if ˆg g 1 ε. A distribution ˆg is an (ε, C)-approximation for g with respect to F if ˆg g 1 C OPT(F, g) + ε Definition 2 (PAC-Learning Distributions, Realizable Setting). A distribution learning method is called a (realizable) PAC-learner for F with sample complexity m F(ε, δ), if for all distribution g F and all ε, δ > 0, given ε, δ, and a sample of size m F(ε, δ), with probability at least 1 δ outputs an ε-approximation of g. Definition 3 (PAC-Learning Distributions, Agnostic Setting). For C > 0, a distribution learning method is called a C-agnostic PAC-learner for F with sample complexity m C F(ε, δ), if for all distributions g and all ε, δ > 0, given ε, δ, and a sample of size m C F(ε, δ), with probability at least 1 δ outputs an (ε, C)-approximation of g.2 Clearly, a C-agnostic PAC-learner (for any constant C) is also a realizable PAC-learner, with the same error parameter ε. Conversely a realizable PAC-learner can be thought of an -agnostic PAC-learner. 3 Learning Mixture Models Let Δn denote the n-dimensional simplex: Δn = {(w1, . . . , wn) : wi 0, i=1 wi = 1} Definition 4. Let F be a class of probability distributions. Then the class of k-mixtures of F, written Fk, is defined as i=1 wifi : (w1, . . . , wk) Δk, f1, . . . , fk F 2Note that in some papers, only the case C 1 is called agnostic learning, and the case C > 1 is called semi-agnostic learning. Assume that we have a method to PAC-learn F. Does this mean that we can PAC-learn Fk? And if so, what is the sample complexity of this task? Our main theorem gives an affirmative answer to the first question, and provides a bound for sample complexity of learning Fk. Theorem 5. Assume that F has a C-agnostic PAC-learner with sample complexity m C F(ε, δ) = λ(F, δ)/εα for some C > 0, α 1 and some function λ(F, δ) = Ω(log(1/δ)). Then there exists a 3C-agnostic PAC-learner for the class Fk requiring m3C Fk(ε, δ) = 3k)k log k εα+2 k log k m F(ε, δ Since a realizable PAC-learner is an -agnostic PAClearner, we immediately obtain the following corollary. Corollary 6. Assume that F has a realizable PAC-learner with sample complexity m F(ε, δ) = λ(F, δ)/εα for some α 1 and some function λ(F, δ) = Ω(log(1/δ)). Then there exists a realizable PAC-learner for the class Fk requiring m Fk(ε, δ) = 3k)k log k εα+2 k log k m F(ε, δ Some remarks: 1. Our mixture learning algorithm has the property that, if the F-learner is proper, then the Fk-learner is proper as well. 2. The computational complexity of the resulting algorithm is exponential in the number of required samples. 3. The condition λ(F, δ) = Ω(log(1/δ)) is a technical condition that holds for all interesting classes F. 4. One may wonder about tightness of this theorem. In Theorem 2 in (Suresh et al. 2014), it is shown that if F is the class of spherical Gaussians, we have m O(1) Fk (ε, δ) = Ω(km F(ε, δ/k)), therefore, the factor of k is necessary in general. However, it is not clear whether the additional factor of log k/ε2 in the theorem is tight. 5. The constant 3 (in the 3C-agnostic result) comes from (Devroye and Lugosi 2001, Theorem 6.3) (see Theorem 8), and it is not clear whether it is necessary. If we allow for randomized algorithms (which produce a random distribution whose expected distance to the target is bounded by ε), then the constant can be improved to 2, see (Mahalanabis and Stefankovic 2008, Theorem 22). In the rest of this section we prove Theorem 5. Let g be the true data generating distribution, and let g = argmin f Fk g f 1 and ρ = g g 1 = OPT(Fk, g). Note that although g Fk, g itself is not necessarily in the form of a mixture. Since our algorithm works for mixtures, we would first like to write g in the form of a mixture of k distributions, such that each distribution is ρ-close to being in F. This is done via the following lemma. Lemma 7. Suppose that g is a distribution with OPT(Fk, g) = ρ. Then we may write g = i [k] wi Gi, such that each Gi is a distribution, and for each i we have OPT(F, Gi) ρ. Proof. Since OPT(Fk, g) = ρ, we can write i=1 wifi + h = i=1 wi(fi + h) (2) with (w1, . . . , wk) Δk and each fi F, and h 1 = ρ. Note that fi + h is not necessarily a density function. Let D denote the set of density functions, that is, the set of nonnegative functions with unit L1 norm. Note that this is a convex set. Since projection is a linear operator, by projecting both sides of (2) onto D we find where Gi is the L1 projection of fi + h onto D (since g D, the projection of g onto D is itself). Also, since fi F D and projection onto a convex set does not increases distances, we have OPT(F, Gi) fi Gi 1 fi (fi+h) 1 = h 1 = ρ, as required. By Lemma 7, we have i [k] wi Gi, where each Gi is a probability distribution. Let ρi := OPT(F, Gi), and by the lemma we have i [k] wiρi ρ. (3) The idea now is to learn each of the Gi s separately using the agnostic learner for F. We will view g as a mixture of k distributions G1, G2, . . . , Gk. For proving Theorem 5, we will use the following theorem on learning finite classes of distributions, which immediately follows from (Devroye and Lugosi 2001, Theorem 6.3) and a standard Chernoff bound. Theorem 8. Suppose we are given M candidate distributions f1, . . . , f M and we have access to i.i.d. samples from an unknown distribution g. Then there exists an algorithm that given the fi s and ε > 0, takes log(3M 2/δ)/2ε2 samples from g, and with probability 1 δ/3 outputs an index j [M] such that fj g 1 3 min i [M] fi g 1 + 4ε . Input: k, ε, δ and an i.i.d. sample S 0. Let W be an (ε/k)-cover for Δk in ℓ distance. 1. C = (set of candidate distributions) 2. For each ( w1, . . . , wk) W do: 3. For each possible partition of S into A1, A2, ..., Ak: 4. Provide Ai to the F-learner, and let Gi be its output. 5. Add the candidate distribution i [k] wi Gi to C. 6. Apply the algorithm for finite classes (Theorem 8) to C and output its result. Figure 1: Algorithm for learning the mixture class Fk We now describe an algorithm that with probability 1 δ outputs a distribution with L1 distance 13ε + 3Cρ to g (the error parameter is 13ε instead of ε just for convenience of the proof; it is clear that this does not change the order of magnitude of sample complexity). The algorithm, whose pseudocode is shown in Figure 1, has two main steps. In the first step we generate a set of candidate distributions, such that at least one of them is (3ε + ρ)-close to g in L1 distance. These candidates are of the form k i=1 wi Gi, where the Gi s are extracted from samples and are estimates for the real components Gi, and the wi s come from a fixed discretization of Δk, and are estimates for the real mixing weights wi. In the second step, we use Theorem 8 to obtain a distribution that is (13ε + 3Cρ)-close to g. We start with describing the first step. We take s = max 2kλ(F, δ/3k) εα , 16k log(3k/δ) i.i.d. samples from g. Let S denote the set of generated points. Note that λ(F, δ) = Ω(log(1/δ)) implies s = O(kλ(F, δ/3k) ε α). Let W be an ε/k-cover for Δk in ℓ distance of cardinality (k/ε + 1)k. That is, for any x Δk there exists w W such that w x ε/k. This can be obtained from a grid in [0, 1]k of side length ε/k, which is an ε/k-cover for [0, 1]k, and projecting each of its points onto Δk. By an assignment, we mean a function A : S [k]. The role of an assignment is to guess each sample point is coming from which component, by mapping them to a component index. For each pair (A, ( w1, . . . , wk)), where A is an assignment and ( w1, . . . , wk) W, we generate a candidate distribution as follows: let A 1(i) S be those sample points that are assigned to component i. For each i [k], we provide the set A 1(i) of samples to our Flearner, and the learner provides us with a distribution Gi. We add the distribution i [k] wi Gi to the set of candidate distributions. Lemma 9. With probability 1 2δ/3, at least one of the generated candidate distributions is (3ε + Cρ)-close to g. Before proving the lemma, we show that it implies our main result, Theorem 5. By the lemma, we obtain a set of candidates such that at least one of them is (3ε + Cρ)-close to g (with failure probability 2δ/3). This step takes s = O(kλ(F, δ/3k) ε α) many samples. Then, we apply Theorem 8 to output one of those candidates that is (13ε+3Cρ)- close to g (with failure probability δ/3), therefore using log(3M 2/δ)/2ε2 additional samples. Note that the number of generated candidate distributions is M = ks (1+k/ε)k. Hence, in the second step of our algorithm, we take log(3M 2/δ)/2ε2 = O λ(F, δ/3k)k log k = O m F(ε, δ/3k)k log k additional samples. The proof is completed noting the total failure probability is at most δ by the union bound. We now prove Lemma 9. We will use the following concentration inequality, which holds for any binomial random variable X (see (Mitzenmacher and Upfal 2005, Theorem 4.5(2))): Pr{X < EX/2} exp( EX/8) . (5) Say a component i is negligible if wi 8 log(3k/δ) Let L [k] denote the set of negligible components. Let i be a non-negligible component. Note that, the number of points coming from component i is binomial with parameters s and wi and thus has mean swi, so (5) implies that, with probability at least 1 δ/3k, S contains at least wis/2 points from i. Since we have k components in total, the union bound implies that, with probability at least 1 δ/3, uniformly for all i / L, S contains at least wis/2 points from component i. Now consider the pair (A, ( w1, . . . , wk)) such that A assigns samples to their correct indices, and has the property that | wi wi| ε/k for all i [k]. We claim that the resulting candidate distribution is (3ε + Cρ)-close to g. Let G1, . . . , Gk be the distributions provided by the learner. For each i [k] define εi := 2λ(F, δ/3k) For any i / L, since there exists at least wis/2 samples for component i, and since wis/2 = λ(F, δ/3k)ε α i = m F(εi, δ/3k) , we are guaranteed that Gi Gi 1 Cρi + εi with probability 1 δ/3k (recall that each Gi is ρi-close to the class F). Therefore, Gi Gi 1 Cρi + εi holds uniformly over all i / L, with probability 1 δ/3. Note that since α 1, the function w1 1/α i is concave in wi, so by Jensen s inequality we have i [k] w1 1/α i k i [k] wi/k)1 1/α i/ L wiεi = 2λ(F, δ/3k) i/ L w1 1/α i 2kλ(F, δ/3k) Also recall from (3) that i [k] wiρi ρ. Proving the lemma is now a matter of careful applications of the triangle inequality: i [k] wi Gi g i [k] wi Gi i [k] wi Gi i [k] wi( Gi Gi) i [k] ( wi wi) Gi i L wi( Gi Gi) i/ L wi( Gi Gi) i [k] | wi wi| Gi 1 i/ L wi(εi + Cρi) + i [k] ε/k 1 2k 8 log(3k/δ) s + 2kλ(F, δ/3k) 1/α + Cρ + ε ε + ε + ε + Cρ , where for the last inequality we used the definition of s in (4). This completes the proof of Lemma 9. 4 Learning Mixtures of Gaussians Gaussian Mixture Models (GMMs) are probably the most widely studied mixture classes with numerous applications; yet, the sample complexity of learning this class is not fully understood, especially when the number of dimensions is large. In this section, we will show that our method for learning mixtures can improve the state of the art for learning GMMs in terms of sample complexity. In the following, Nd(μ, Σ) denotes a Gaussian density function defined over Rd, with mean μ and covariance matrix Σ. Mixtures of Axis-Aligned Gaussians A Gaussian is called axis-aligned if its covariance matrix Σ is diagonal. The class of axis-aligned Gaussian Mixtures is an important special case of GMMs that is thoroughly studied in the literature (e.g. (Feldman, Servedio, and O Donnell 2006)). Theorem 10. Let F denote the class of d-dimensional axisaligned Gaussians. Then F is 3-agnostic PAC-learnable with m3 F(ε, δ) = O((d + log(1/δ))/ε2). We defer the proof of this result to Section 7. Combining this theorem with Theorem 5 we obtain the following result: Theorem 11. The class Fk of mixtures of k axis-aligned Gaussians in Rd is 9-agnostic PAC-learnable with sample complexity m9 Fk(ε, δ) = O(kd log k log(k/δ)/ε4). Accordingly, it is also PAC-learnable in the realizable case with the same number of samples. This theorem improves the upper bound of O(dk9 log2(d/δ)/ε4) proved in (Suresh et al. 2014, Theorem 11) for spherical Gaussians in the realizable setting. Spherical Gaussians are special cases of axis-aligned Gaussians in which all eigenvalues of the covariance matrix are equal, i.e., Σ is a multiple of the identity matrix. The following minimax lower bound (i.e., worst-case on all instances) on the sample complexity of learning mixtures of spherical Gaussians is proved in the same paper. Theorem 12 (Theorem 2 in (Suresh et al. 2014)). The class Fk of mixtures of k axis-aligned Gaussians in Rd in the realizable setting has m Fk(ε, 1/2) = Ω(dk/ε2). Therefore, our upper bound of Theorem 11 is optimal in terms of dependence on d and k (up to logarithmic factors) for axis-aligned Gaussians. Mixtures of General Gaussians For general Gaussians, we have the following result. Theorem 13. Let F denote the class of d-dimensional Gaussians. Then, F is 3-agnostic PAC-learnable with m3 F(ε, δ) = O((d2 + log(1/δ))/ε2). We defer the proof of this result to Section 7. Combining this theorem with Theorem 5, we obtain the following result: Theorem 14. The class Fk of mixtures of k Gaussians in Rd is 9-agnostic PAC-learnable with sample complexity m9 Fk(ε, δ) = O(kd2 log k log(k/δ)/ε4). Accordingly, it is also PAC-learnable in the realizable case with the same number of samples. This improves by a factor of k2 the upper bound of O(k3d2 log k/ε4) in the realizable setting, proved in (Diakonikolas, Kane, and Stewart 2017b, Theorem A.1). Note that Theorem 12 gives a lower bound of Ω(kd/ε2) for m Fk(ε, δ), hence the dependence of Theorem 14 on k is optimal (up to logarithmic factors). However, there is a factor of d/ε2 between the upper and lower bounds. 5 Mixtures of Log-Concave Distributions A probability density function over Rd is log-concave if its logarithm is a concave function. The following result about the sample complexity of learning log-concave distributions is the direct consequence of the recent work of (Diakonikolas, Kane, and Stewart 2017a). Theorem 15. Let F be the class of distributions corresponding to the set of all log-concave densities over Rd. Then F is 3-agnostic PAC learnable using m3(ε, δ) = O((d/ε)(d+5)/2 log2(1/ε)) samples. Using Theorem 5, we come up with the first result about the sample complexity of learning mixtures of log-concave distributions. Theorem 16. The class of mixtures of k log-concave distributions over Rd is 9-agnostic PAC-learnable using O(d(d+5)/2ε (d+9)/2k) samples. 6 Conclusions We studied PAC learning of classes of distributions that are in the form of mixture models, and proposed a generic approach for learning such classes in the cases where we have access to a black box method for learning a singlecomponent distribution. We showed that by going from one component to a mixture model with k components, the sample complexity is multiplied by a factor of at most (k log2 k)/ε2. Furthermore, as a corollary of this general result, we provided upper bounds for the sample complexity of learning GMMs and axis-aligned GMMs O(kd2 log2 k/ε4) and O(kd log2 k/ε4) respectively. Both of these results improve upon the state of the art in terms of dependence on k and d. It is worthwhile to note that for the case of GMMs, the dependence of our bound is 1/ε4. Therefore, proving an upper bound of kd2/ε2 remains open. Also, note that our result can be readily applied to the general case of mixtures of the exponential family. Let Fd denote the d-parameter exponential family. Then the VCdimension of the corresponding Yatracos class (see Definition 20) is O(d) (see Theorem 8.1 in (Devroye and Lugosi 2001)) and therefore by Theorem 22, the sample complexity of PAC learning Fd is O(d/ε2). Finally, applying Theorem 5 gives a sample complexity upper bound of O(kd/ε4) for learning Fk d . 7 Proofs of Theorems 10 and 13 We follow the general methodology of (Devroye and Lugosi 2001) to prove upper bounds on the sample complexity of learning Gaussian distributions. The idea is to first connect distribution learning to the VC-dimension of a class of a related set system (called the Yatracos class of the corresponding distribution family), and then provide upper bounds on VC-dimension of this system. Our Theorem 22 gives an upper bound for the sample complexity of agnostic learning, given an upper bound for the VC-dimension of the Yatracos class. We remark that a variant of this result, without explicit dependence on the failure probability, is proved implicitly in (Chan et al. 2014) and also appears explicitly in (Diakonikolas, Kane, and Stewart 2017a, Lemma 6). Definition 17 (A-Distance). Let A 2X be a class of subsets of domain X. Let p and q be two probability distributions over X. Then the A-distance between p and q is defined as p q A := sup A A |p(A) q(A)| Definition 18 (Empirical Distribution). Let S = {xi}m i=1 be a sequence of members of X. The empirical distribu- tion corresponding to this sequence is defined by ˆp S(x) = m i=1 1{x=xi} The following lemma is a well known refinement of the uniform convergence theorem, see, e.g., (Anthony and Bartlett 1999, Theorem 4.9). Lemma 19. Let p be a probability distribution over X. Let A 2X and let v be the VC-dimension of A. Then, there exist universal positive constants c1, c2, c3 such that Pr S pm{ p ˆp S A ε} exp(c1 + c2v c3mε2) . Definition 20 (Yatracos class). For a class F of functions from X to R, their Yatracos class is the family of subsets of X defined as Y(F) := {{x X : f1(x) f2(x)} for some f1, f2 F} Observe that if f, g F then f g T V = f g Y(F ). Definition 21 (Empirical Yatracos Minimizer). Let F be a class of distributions over domain X. The empirical Yatracos minimizer is defined as LF : m=1Xm F satisfying LF(S) = arg min q F q ˆp S Y(F). Theorem 22 (PAC Learning Families of Distributions). Let F be a class of probability distributions, and let S pm be an i.i.d. sample of size m generated from an arbitrary probability distribution p, which is not necessarily in F. Then with probability at least 1 δ we have p LF(S) T V 3 OPT(F, p) + α where v is VC-dimension of Y(F), and OPT(F, p) = infq F q p T V , and α is a universal constant. In particular, in the realizable setting p F, we have p LF(S) T V α Remark 23. The L1 distance is precisely twice the total variation distance. Proof. Let q = arg minq F p q T V , so q p Y(F) q p T V = OPT(F, p). Since LF(S), q F we have LF(S) q T V = LF(S) q Y(F). By Lemma 19, with probability 1 δ we have p δ )/m for some universal constant α. Also, since LF(S) is the empirical minimizer of the Y(F)- distance, we have LF(S) ˆp S Y(F) q ˆp S Y(F). The proof follows from these facts combined with multiple applications of the triangle inequality: p LF(S) T V LF(S) q T V + q p T V = LF(S) q Y(F) + OPT(F, p) LF(S) ˆp S Y(F) + ˆp S q Y(F) + OPT(F, p) q ˆp S Y(F) + ˆp S p A + p q Y(F) + OPT(F, p) q p Y(F) + p ˆp S A + p ˆp S Y(F) + 2 OPT(F, p) q p T V + 2 p ˆp S Y(F) + 2 OPT(F, p) δ m + 3 OPT(F, p) . Theorem 22 provides a tool for proving upper bounds on the sample complexity of distribution learning. To prove Theorems 13 and 10, it remains to show upper bounds on the VC dimensions of the Yatracos class of (axis-aligned) Gaussian densities. For classes F and G of functions, let NN(G) := {{x : f(x) 0} for some f G} and ΔF := {f1 f2 : f1, f2 F} , and notice that Y(F) = NN(ΔF). We upper bound the VC-dimension of NN(ΔF) via the following well known result in statistical learning theory, see, e.g., (Devroye and Lugosi 2001, Lemma 4.2). Theorem 24 (Dudley). Let G be an n-dimensional vector space of real-valued functions. Then V C(NN(G)) n. Now let h be an indicator function for an arbitrary element in NN(f1 f2), where f1, f2 are densities of (axis-aligned) Gaussians. Then h is a {0, 1}-valued function and we have: h(x) = 1{N(μ1, Σ1) > N(μ2, Σ2)} = 1{α1 exp( 1 2 (x μ1)T Σ 1 1 (x μ1)) > 2 (x μ2)T Σ 1 2 (x μ2))} = 1{(x μ1)T Σ 1 1 (x μ1) (x μ2)T Σ 1 2 (x μ2) log α2 The inner expression is a quadratic form, and the linear dimension of all quadratic functions is O(d2). Furthermore, for axis-aligned Gaussians, Σ1 and Σ2 are diagonal, and therefore, the inner function lies in an O(d)-dimensional space of functions spanned by {1, x1, . . . , xd, x2 1, . . . , x2 d}. Hence, by Dudley s theorem, we have the required upper bound (d or d2) on the VC-dimension of the Yatracos classes. Finally, Theorems 13 and 10 follow from applying Theorem 22 to the class of (axis-aligned) Gaussian distributions. Acknowledgments Abbas Mehrabian was supported by an NSERC Postdoctoral Fellowship and a Simons-Berkeley Research Fellowship. The authors thank the reviewers of ALT 2017 conference for their comments and pointing out a mistake in an earlier version. Very recently, the authors have improved Theorem 11 and showed that the class of mixtures of k axis-aligned Gaussians in Rd is 3-agnostic PAC-learnable with sample complexity O(kd/ε2). The proof uses novel techniques, see (Ashtiani, Ben-David, and Mehrabian 2017). Acharya, J.; Diakonikolas, I.; Li, J.; and Schmidt, L. 2017. Sample-optimal density estimation in nearly-linear time. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 17, 1278 1289. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics. Anthony, M., and Bartlett, P. 1999. Neural network learning: theoretical foundations. Cambridge University Press. Ashtiani, H.; Ben-David, S.; and Mehrabian, A. 2017. Agnostic distribution learning via compression. ar Xiv preprint ar Xiv:1710.05209. Belkin, M., and Sinha, K. 2010. Polynomial learning of distribution families. In Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, FOCS 10, 103 112. Washington, DC, USA: IEEE Computer Society. Chan, S.-O.; Diakonikolas, I.; Servedio, R. A.; and Sun, X. 2014. Efficient density estimation via piecewise polynomial approximation. In Proceedings of the Forty-sixth Annual ACM Symposium on Theory of Computing, STOC 14, 604 613. New York, NY, USA: ACM. Dasgupta, S. 1999. Learning mixtures of Gaussians. In Foundations of Computer Science, 1999. 40th Annual Symposium on, 634 644. IEEE. Devroye, L., and Lugosi, G. 2001. Combinatorial methods in density estimation. Springer Series in Statistics. Springer Verlag, New York. Diakonikolas, I.; Kamath, G.; Kane, D. M.; Li, J.; Moitra, A.; and Stewart, A. 2016. Robust estimators in high dimensions without the computational intractability. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), 655 664. Diakonikolas, I.; Kane, D. M.; and Stewart, A. 2017a. Learning multivariate log-concave distributions. In Proceedings of Machine Learning Research, volume 65 of COLT 17, 1 17. Diakonikolas, I.; Kane, D. M.; and Stewart, A. 2017b. Statistical query lower bounds for robust estimation of high-dimensional Gaussians and Gaussian mixtures. ar Xiv preprint ar Xiv:1611.03473v2 [cs.LG]. To appear in Proceedings of the 58th Annual IEEE Symposium on Foundations of Computer Science (FOCS 17). Diakonikolas, I. 2016. Learning Structured Distributions. In Peter B uhlmann; Petros Drineas; Michael Kane; and Mark van der Laan., eds., Handbook of Big Data. Chapman and Hall/CRC. chapter 15, 267 283. Feldman, J.; Servedio, R. A.; and O Donnell, R. 2006. PAC learning axis-aligned mixtures of Gaussians with no separation assumption. In Proceedings of the 19th Annual Conference on Learning Theory, COLT 06, 20 34. Berlin, Heidelberg: Springer-Verlag. Karpinski, M., and Macintyre, A. 1997. Polynomial bounds for VC dimension of sigmoidal and general pfaffian neural networks. Journal of Computer and System Sciences 54(1):169 176. Kearns, M.; Mansour, Y.; Ron, D.; Rubinfeld, R.; Schapire, R. E.; and Sellie, L. 1994. On the learnability of discrete distributions. In Proceedings of the Twenty-sixth Annual ACM Symposium on Theory of Computing, STOC 94, 273 282. New York, NY, USA: ACM. Li, J., and Schmidt, L. 2017. Robust and proper learning for mixtures of gaussians via systems of polynomial inequalities. In Kale, S., and Shamir, O., eds., Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, 1302 1382. Amsterdam, Netherlands: PMLR. Mahalanabis, S., and Stefankovic, D. 2008. Density estimation in linear time. In Servedio, R. A., and Zhang, T., eds., 21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-12, 2008, 503 512. Omnipress. Mitzenmacher, M., and Upfal, E. 2005. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. New York, NY, USA: Cambridge University Press. Moitra, A., and Valiant, G. 2010. Settling the polynomial learnability of mixtures of Gaussians. In Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, FOCS 10, 93 102. Washington, DC, USA: IEEE Computer Society. Suresh, A. T.; Orlitsky, A.; Acharya, J.; and Jafarpour, A. 2014. Near-optimal-sample estimators for spherical Gaussian mixtures. In Ghahramani, Z.; Welling, M.; Cortes, C.; Lawrence, N. D.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 27. Curran Associates, Inc. 1395 1403.