# distributed_learning_of_finite_gaussian_mixtures__723957c5.pdf Journal of Machine Learning Research 23 (2022) 1-40 Submitted 1/21; Revised 2/22; Published 2/22 Distributed Learning of Finite Gaussian Mixtures Qiong Zhang qiong.zhang@stat.ubc.ca Department of Statistics University of British Columbia Vancouver, BC V6T 1Z4, Canada Jiahua Chen jhchen@stat.ubc.ca Department of Statistics University of British Columbia Vancouver, BC V6T 1Z4, Canada Editor: Sathiya Keerthi Advances in information technology have led to extremely large datasets that are often kept in different storage centers. Existing statistical methods must be adapted to overcome the resulting computational obstacles while retaining statistical validity and efficiency. In this situation, the split-and-conquer strategy is among the most effective solutions to many statistical problems, including quantile processes, regression analysis, principal eigenspaces, and exponential families. This paper applies this strategy to develop a distributed learning procedure of finite Gaussian mixtures. We recommend a reduction strategy and invent an effective majorization-minimization algorithm. The new estimator is consistent and retains root-n consistency under some general conditions. Experiments based on simulated and real-world datasets show that the proposed estimator has comparable statistical performance with the global estimator based on the full dataset, if the latter is feasible. It can even outperform the global estimator for the purpose of clustering if the model assumption does not fully match the real-world data. It also has better statistical and computational performance than some existing split-and-conquer approaches. Keywords: Barycenter, Gaussian mixture, Global convergence, Mixture reduction, MM algorithm, Model-based clustering, Split-and-conquer. 1. Introduction In the era of big data, the sizes of the datasets for various applications may be so large that they cannot be stored on a single machine. According to Corbett et al. (2013), Google distributes its huge database around the world. Even if the dataset is stored on a single machine, it can be difficult to load the whole dataset into the computer memory. Distributed data storage is also natural when the datasets are collected and managed by independent agencies. Examples include patient information collected from different hospitals and data collected by different government agencies (Agrawal et al., 2003). Privacy considerations may make it difficult or even impossible to pool the separate collections of data into a single dataset. Data analysis methods in such applications should therefore be designed so that they can work with subsets of the dataset, in parallel or sequentially. The information extracted from the subsets may then be combined to draw conclusions about the whole c 2022 Qiong Zhang and Jiahua Chen. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-0093.html. Zhang and Chen population. Under distributed data storage, a two-step split-and-conquer procedure is often used for inference: (i) Local inference: standard inference is carried out on local machines; (ii) Aggregation: the local results are transmitted to a central machine and aggregated. The split-and-conquer approach addresses privacy concerns by sharing only summary statistics across machines. It is also communication efficient and avoids the potentially high transmission cost that may exceed the computational cost (Jaggi et al., 2014). A variety of split-and-conquer approaches have been developed. For example, the splitand-conquer learning of the generalized linear models (Chen and Xie, 2014), kernel ridge regression models (Zhang et al., 2015), ordinary linear models (Chang et al., 2017), and the split-and-conquer estimation of principal eigen-spaces (Fan et al., 2019). Also see the split-and-conquer version of the Wald and score tests (Battey et al., 2015). Most existing approaches first obtain one local estimate of the model parameters based on per local dataset. These local estimates are then pooled and aggregated through a linear averaging operation. This paper focuses on developing split-and-conquer approaches for finite Gaussian mixture models (GMMs), which is widely used for model-based clustering (Murphy, 2012). Learning of mixture models is a well studied problem (Mc Lachlan and Peel, 2004; Frühwirth Schnatter, 2006; Mengersen et al., 2011; Balakrishnan et al., 2017; Chen et al., 2020; Heinrich and Kahn, 2018; Dwivedi et al., 2020) while the relatively new distributed learning of finite mixtures remains to be investigated. We usually learn finite mixture models via the maximum likelihood estimate (MLE) through a well-established EM algorithm (Dempster et al., 1977; Wu, 1983). Distributed versions of the EM algorithm such as Nowak (2003), Safarinejadian et al. (2010), and Chen et al. (2013) are also developed when datasets are stored in a distributed fashion. However, these approaches require the communication of summary statistics and the coordination across local machines at each iteration. Another approach is to first condense the data on local machines into much smaller coresets (Lucic et al., 2017) and then refit the mixture to the pooled coresets at a central machine. The coreset approach achieves high computational efficient at the cost of statistical efficiency, and privacy due to the transmission of raw data. The split-and-conquer based approaches also have their challenges under finite mixtures. The parameter of finite mixture, the mixing distribution, is a discrete distribution with a finite number of support points. The popular linear averaging operation of the local estimates results in a mixing distribution with an inflated number of support points. The aggregated mixture hence has redundant and spurious subpopulations. The KL divergence based aggregation approach of Liu and Ihler (2014) achieves the best efficiency under models from exponential family, but it is less efficient under GMMs. We therefore address the important problem of developing an aggregation procedure with high computational and statistical efficiencies under GMMs. In this paper, we learn the finite Gaussian mixtures via the penalized MLE during the local inference. We explore two potential aggregation approaches and find that a specific reduction approach has satisfactory statistical and computational efficiencies. During the aggregation step, the reduction method first pools the local estimates to form a mixture model with clearly an excessive number of subpopulations. We then search for a mixture with a designated order that is optimal according to some divergence criteria. We identify a Distributed Learning of Finite Gaussian Mixtures divergence metric based on the transportation distance so that the corresponding estimator retains statistical efficiency and computational simplicity. We also develop an majorizationminimization algorithm for the numerical computation. The remainder of the paper is structured as follows. In Section 2, we introduce finite Gaussian mixture models and the transportation distance. In Section 3, we study two splitand-conquer methods and recommend a specific one. In Section 4, we describe an algorithm for computing the recommended estimator. In Section 5, we show that the proposed estimator has the best possible statistical convergence rate under some generic conditions. In Section 6, we review existing approaches for distributed learning of finite Gaussian mixtures and their pros and cons compared with our proposed method. Numerical experiments on simulated, public, and real datasets are presented in Section 7. In all cases, the proposed approach performs well. Section 8 contains some discussion and concluding remarks. The technical details are given in the appendix. 2. Preliminaries on Finite Gaussian Mixtures and Divergences In this section, we introduce finite Gaussian mixture models, EM algorithm, and the transportation distances between mixing distributions or between mixtures. 2.1 Finite Gaussian Mixtures A statistical model is a family of distributions. In many applications, a dataset can be regarded as a random sample from a population, and the population is a member of some parametric distribution family. For instance, a Gaussian distribution family is often assumed in applications. When any single distribution in a parametric family fails to properly model the data, it may be caused by the heterogeneity in the population. For instance, the population is made of several subpopulations so that each subpopulation has a distinct distribution belonging to the same parametric family. Then each observation in the dataset is a random outcome from a subpopulation. The subpopulation, or the membership of the observation, is itself random. Without the knowledge of subpopulation memberships of the individual units, the data form a random sample from a mixture distribution. Mathematically, a finite mixture model is defined as follows. Definition 1 (Finite Mixture Model) Let F = {f( ; θ) : θ Θ} be a parametric family of density functions with respect to some σ-finite measure, where Θ is some parameter space. Let G = PK k=1 wkδθk be a discrete probability measure, assigning probability wk to parameter value θk on Θ for some integer K > 0. The finite mixture distribution of F with mixing distribution G has density function f(x; G) := Z f(x; θ) d G(θ) = k=1 wkf(x; θk). The finite mixture model of F is the distribution family with densities in the form of (1). The parameter space Θ is usually a subset of d-dimensional Euclidean space Rd for some integer d 1. In addition, θ = (θ1, . . . , θK) is a vector of subpopulation parameters. We Zhang and Chen call f( ; θk)s subpopulation density functions. Let the K 1 dimensional simplex be (w1, . . . , w K) : wi 0, The components of the vector w = (w1, w2, . . . , w K) K 1 are called the mixing weights. We use F(x; θ) and F(x; G) respectively for the cumulative distribution functions (CDFs) of f(x; θ) and f(x; G). The space of mixing distributions of order up to K is denoted k=1 wkδθk, w K 1, θ ΘK ) An order K mixture has its mixing distribution in GK GK 1. When the subpopulation distribution is a multivariate Gaussian, we have a finite Gaussian mixture and the corresponding Gaussian mixture model (GMM). Example 1 (Finite Gaussian Mixtures) Let the density function of a d-dimensional Gaussian random variable with mean vector µ and covariance matrix Σ be φ(x; µ, Σ) = det{2πΣ} 1/2 exp 1 2(x µ) Σ 1(x µ) , and let Φ(x; µ, Σ) be its CDF. A Gaussian mixture of order K is a distribution with density and CDF given by k=1 wkφ(x; µk, Σk); Φ(x; G) = k=1 wkΦ(x; µk, Σk). Note the mixing distribution G is a discrete distribution on Θ, which is the space of the d-dimensional vector µ and d d positive definite matrix Σ. The earliest example of GMM in Pearson (1894) suggests that the skewness displayed in the histogram of the biometrics of 1, 000 crabs is due to the presence of two non-identified species. There are also applications of mixture in finance. Suppose the stock prices in the hidden normal or extreme periods have different Gaussian distributions. The marginal distribution is then a Gaussian mixture of order 2 (Liesenfeld, 2001). For general theory and applications of the finite mixture models, see Mc Lachlan and Peel (2004) and Frühwirth Schnatter (2006). The most popular learning approach for finite mixture models is MLE. Under finite GMMs, the likelihood is however unbounded, and the MLE is not well defined. Let X = {x1, x2, . . . , xn} be observed values on a set of independent and identically distributed (IID) random vectors with a finite Gaussian mixture distribution φ(x; G) of order K. The loglikelihood function of G is given by i=1 log φ(xi; G) = i=1 log n K X k=1 wkφ xi; µk, Σk o . Distributed Learning of Finite Gaussian Mixtures This log-likelihood function is unbounded: its value goes to infinity for a specific combination of µk and some degenerating Σk (Chen and Tan, 2009). To solve this issue, Chen and Tan (2009) shows the log-likelihood function can be regularized by adding a penalty function on the subpopulation covariance matrices. The maximizer of the penalized likelihood is well-defined and is a consistent estimator. Let Sx be the sample covariance matrix. Chen and Tan (2009) recommends the following penalized log-likelihood function pℓn(G|X) = ℓn(G|X) an X tr(SxΣ 1 k ) + log det(Σk) for some positive an that is allowed to depend on n, and they learn the mixing distribution G via b Gn := arg sup pℓn(G|X). (2) The size of an controls the strength of the penalty, and a recommended value is n 1/2. Regularizing the likelihood function via a penalty function fixes the problem caused by degenerated Σk. We call b Gn in (2) p MLE and it is strongly consistent when the order has a known upper bound. The specific form of the penalty function mimics a Wishart prior on the subpopulation precision matrices. It allows an easy generalization of the EM algorithm for the numerical computation and a consistency proof (Chen and Tan, 2009). Other forms of the penalty function should be possible but unlikely to markedly outperform the current one. We describe the EM algorithm for numerical computation in the next section. 2.2 EM Algorithm for Computing p MLE Let X = {x1, x2, . . . , xn} i.i.d. φ( ; G) and the sample covariance matrix is Sx. Recall that each observation in the dataset can be regarded as a sample from a subpopulation φ(x; µk, Σk) with probability wk. We hence introduce a membership vector zi = (zi1, . . . , zi K) for the ith unit. The kth entry of zi is 1 when the response value xi is an observation from the kth subpopulation, and 0 otherwise. When the complete data {(zi, xi), i = 1, 2, . . . , n} are available, we can construct a penalized complete likelihood k=1 zik log{wkφ(xi; µk, Σk)} an X tr(SxΣ 1 k ) + log det(Σk) . Given the full dataset X and a proposed mixing distribution G(t), we may compute the conditional expectation w(t) ik = E(zik|X, G(t)) = w(t) k φ(xi; µ(t) k , Σ(t) k ) PK j=1 w(t) j φ(xi; µ(t) j , Σ(t) j ) . (3) With this, we define Q(G; G(t)) = i=1 w(t) ik log wk 1 i=1 w(t) ik log det(Σk) 2antr Σ 1 k S + i=1 w(t) ik (xi µk) Σ 1 k (xi µk) . (4) Zhang and Chen Note that the subpopulation parameters are separated in Q( ; ). We refer to the procedure starting from G(t) to the definition of Q(G; G(t)) as the E-step of the algorithm. The M-step of the algorithm maximizes Q(G; G(t)) with respect to G. Let i=1 w(t) ik (xi µ(t+1) k )(xi µ(t+1) k ) . The solution of the M-step has an explicit expression with the mixing distribution G(t+1) w(t+1) k = n 1 Pn i=1 w(t) ik , µ(t+1) k = nw(t+1) k 1 Pn i=1 w(t) ik xi, Σ(t+1) k = 2an + nw(t+1) k 1 2an Sx + S(t+1) k . The adapted EM algorithm iterates between E and M steps until some convergence criterion is satisfied and G(t) in the latest iteration is regarded as the p MLE. Like its MLE counterpart, pℓn(G(t)) is an increasing sequence in t. We say A B if A B is semi-positive definite for any two square matrices A and B. In this sense, Σ(t+1) k {2an/(n + 2an)}Sx and the latter one does not dependent on t. Hence, the increasing sequence {pℓn(G(t)), t = 1, 2, . . .} also has a finite upper bound. Therefore, pℓn(G(t)) is guaranteed to converge and the limiting points of G(t) are non-degenerate local maxima as t , following additional mathematical justifications of Wu (1983). 2.3 Divergence and Distance between Probability Measures To develop split-and-conquer methods for learning finite mixture models, we need a measure of the closeness between mixing distributions or between mixtures. We first introduce the divergence and distance. Definition 2 (Divergence and Distance) Let Θ be a space. A bivariate function ρ( , ) defined on Θ is a divergence if ρ(θ1, θ2) 0, with equality holds if and only if θ1 = θ2. Suppose that in addition ρ( , ) satisfies (i) symmetry, i.e., ρ(θ1, θ2) = ρ(θ2, θ1), and (ii) the triangle inequality, i.e., ρ(θ1, θ2) ρ(θ1, θ3)+ρ(θ3, θ2), for all θ1, θ2, θ3 Θ. Then ρ( , ) is a distance on Θ. When ρ( , ) is a distance, we call (Θ, ρ) a metric space. We need divergences or distances specifically defined on the space of probability measures. The transportation divergence (Villani, 2003) is particularly useful. Let P(Θ) and P(Θ2) be the spaces of probability measures on Θ and Θ Θ, respectively, equipped with some compatible σ-algebras. For any π P(Θ2), let its marginal measures be π1, and π ,2. For any η, ν P(Θ), we define a space of distributions on Θ Θ as follows: Π(η, ν) = {π P(Θ2) : π1, = η, π ,2 = ν}. The space Π(η, ν) is usually referred to as the coupling between η and ν, and it consists of bivariate distributions for which the marginal distributions are η and ν. For convenience, we use Π(η, ) for the space of distributions with first marginal distribution being η, and similarly for Π( , ν). We regard the mixing weights w as a distribution on Θ when appropriate. Distributed Learning of Finite Gaussian Mixtures Definition 3 (Transportation Divergence and r-Wasserstein distance) Let c( , ) be a non-negative valued bivariate function on Θ Θ satisfying c(θ, θ) = 0 for all θ Θ. The transportation divergence from η to ν is defined to be Tc(η, ν) = inf{Eπ[c(X, Y )] : π Π(η, ν)}. where Eπ{c(X, Y )} is the expectation calculated when the joint distribution of X, Y is π. If c( , ) = Dr( , ) for some r 1 and distance D( , ) on Θ, then WD = T 1/r c ( , ) is called the r-Wasserstein distance with the ground distance D( , ). The infimum in the above definition is with respect to π over Π(η, ν). The same convention may be used subsequently. The joint distribution π that minimizes Eπ{c(X, Y )} is called the optimal transportation plan and is usually denoted π . It is the most efficient plan that transports mass distributed according to η to mass distributed according to ν. Definition 4 (Barycenter) Let (P(Θ), ρ) be a space of probability measures on Θ that is endowed with the divergence ρ( , ). Given positive constants (λ1, λ2, . . . , λM) M 1, the (weighted) barycenter of ν1, . . . , νM P(Θ) is a minimum point of PM m=1 λmρ(νm, η). We allow ρ( , ) to be any divergence function. When ρ( , ) = Dr( , ), it becomes the usual r-Wasserstein barycenter (Cuturi and Doucet, 2014). Conceptually, a barycenter should always be referred to together with information on ρ and the weights (λ1, λ2, . . . , λM). However, the formal description can be tedious. We provide the full information only when necessary. 3. Proposed Split-and-Conquer Approach Suppose we have an IID random sample X = {x1, x2, . . . , x N} from a parametric distribution f(x; θ). Suppose that it is partitioned into M subsets X1, . . . , XM completely at random and stored on M local machines. Let Nm denote the sample size on the mth local machine. Clearly, PM m=1 Nm = N. Let bθm be the local estimate of θ based on Xm. For example, under finite GMM, the parameter θ becomes the mixing distribution G and the local estimate is the p MLE b Gm = arg max i Xm log φ(xi; G) am tr(SmΣ 1 k ) + log det(Σk) ) where Sm is the sample covariance matrix based on the observations in Xm and am = N 1/2 m as recommended. How should we aggregate the local estimates, bθm in general and b Gm in particular, in the split-and-conquer framework? One approach is to combine the local estimates by their linear average (Zhang et al., 2015; Chang et al., 2017). The aggregated estimator is then the weighted average θ = PM m=1 λmbθm with λm set to the sample proportion Nm/N. This simple approach is appropriate for parameters in a vector space, but it is nonsensical if the average of the parameters is not Zhang and Chen well defined. Under finite mixtures, let b G1, . . . , b GM be the local estimators. Then the linear average is m=1 λm b Gm. (5) Its corresponding mixture has density function f(x; G) = PM m=1 λmf(x; b Gm). While f(x; G) is a good estimate, it can be unsatisfactory in terms of revealing the latent structure of the mixture. For instance, this estimator gives a mixture with MK subpopulations rather than the assumed K, which is useless for clustering the dataset into K clusters. Approach 1 One way to adapt the linear averaging is to define a sensible average in the space of the mixing distributions. Recall that the linear average of vectors in the Euclidean space is the centroid (barycenter) of these vectors from a geometric point of view. Similarly, the local estimators b G1, . . . , b Gm may be averaged through their barycenter: GC = arg inf G GK m=1 λmρ( b Gm, G) (6) for some choice of the divergence ρ( , ). The parameter space GK is the space of mixing distributions with K support points and is defined in (1). Approach 2 The mixing distribution G is likely close to the true mixing distribution G , except for the incorrect number of support points. This problem can be solved by approximating G by some G GK. This suggests another aggregation approach. Let ρ( , ) be a divergence in the space of mixing distributions. We can aggregate the local estimates via the reduction estimator, given by GR = arg inf G GK ρ(G, G). (7) These two aggregation approaches, barycenter and reduction, are connected when specific divergences are used. For example, let the divergence ρ( , ) in the barycenter estimator (6) or in the reduction estimator (7) be the well-known KL-divergence: ρ(G1, G2) = DKL(Φ( ; G1) Φ( ; G2)) = Z φ(x; G1) log{φ(x; G1)/φ(x; G2)}dx. In this case, we have DKL(Φ( ; G) Φ( ; G)) = Z φ(x; G) log φ(x; G)/φ(x; G) dx = C1 Z φ(x; G) log φ(x; G)dx Z φ(x; b Gm) log φ(x; G)dx m=1 λm DKL(Φ( ; b Gm) Φ( ; G)) Distributed Learning of Finite Gaussian Mixtures where C1 and C2 are constants not dependent on G. This relationship implies that GR = arg inf G GK ρ(G, G) = arg inf G GK m=1 λmρ( b Gm, G) o = GC. (8) Thus, the two aggregation methods give identical aggregated estimators. The following example shows that the two methods do not always have the same outcome and the barycenter may not give ideal aggregation result. Example 2 (Barycenter of Mixtures with Identical Subpopulations) Suppose we wish to aggregate two local estimates given by Φ(x; G1) = 0.4Φ(x; 1, 1) + 0.6Φ(x; 1, 1) := 0.4Φ 1 + 0.6Φ1, Φ(x; G2) = 0.6Φ(x; 1, 1) + 0.4Φ(x; 1, 1) := 0.6Φ 1 + 0.4Φ1 with λ1 = λ2 = 0.5. We anticipate that whatever distance or divergence we choose, the barycenter will be given by Φ(x; G) = 0.5Φ(x; 1, 1) + 0.5Φ(x; 1, 1) = 0.5Φ 1 + 0.5Φ1. Consider the 2-Wasserstein distance with Euclidean ground distance between two univariate Gaussian distributions that is given by D2 Φ( ; µ1, σ2 1), Φ( ; µ2, σ2 2) = (µ1 µ2)2 + (σ1 σ2)2. Surprisingly, when the divergence is chosen to be ρ = W2 D( , ), the barycenter is given by Φ(x; GC) = 0.4Φ(x; 1, 1) + 0.6Φ(x; 2/3, 1) := 0.4Φ 1 + 0.6Φ2/3. We defer the technical details to Appendix 8.1.1. This example is extremely sharp and best illustrates this issue. The distortion of barycenter to the level in this example is unlikely in real-world applications. The issue here is rooted in the divergence ρ employed in defining the barycenter. Other choices of ρ may lead to a solution not far from our intuition. In comparison, for the reduction estimator, whatever divergence ρ is employed, we always have GR = G. This example urges us to explore various choices of the divergence ρ in (6) and (7). We must take into account both statistical efficiency and computational complexity. For statistical efficiency, the properties of the divergence in (6) must ensure that the corresponding barycenter matches our intuition. Besides, the computation of (6) is usually more expensive than that of (7) given the same ρ. This paper hence focuses more on reduction estimator (7) for split-and-conquer learning of finite mixtures. In the machine learning community, approximate one Gaussian mixture by another with a lower order is called Gaussian mixture reduction (GMR) (Crouse et al., 2011). These approaches are either ad hoc or computationally challenging. For example, Williams and Maybeck (2006) proposes to perform GMR by minimizing ρ(G, G) = L2(Φ( ; G), Φ( ; G)). Although the L2 distance between two Gaussian mixtures has an analytical form, the optimization can be expensive. Our key observation is that it is usually difficult to compute the divergence between two mixtures, but easy to compute that between two Gaussian distributions. This leads to the effective reduction algorithm that we present in the next section. For simplicity, we refer to the proposed estimator as the GMR estimator. Zhang and Chen 4. Numerical Algorithm for Proposed Reduction Approach Let G be defined as in (5). Let the subpopulations in G be Φi = Φ(x; µi, Σi) and the mixing weights be wi for i [MK] = {1, 2, . . . , MK}. Let G be any mixing distribution of order K with the K subpopulations Φγ = Φ(x; µγ, Σγ) and the mixing weights vγ for γ [K]. In vector format, the weights are w and v. Let c( , ) be a cost function based on a divergence in the space of d-dimensional Gaussian distributions for which the computational cost is low. Then the transportation divergence (Nguyen, 2013) between mixing distributions G and G GK with cost function c becomes Tc(G, G) = inf n X i,γ πiγc(Φi, Φγ) : π Π(w, v) o . The corresponding GMR estimator is GR = arg inf Tc(G, G) : G GK . (9) For (9), it may appear that calculating our estimator involves two optimizations: computing Tc(G, G) for each pair of G and G, and then searching for arg inf G Tc(G, G). We are able to design a more efficient optimization algorithm based on the following observation. The optimization problem in Tc(G, G) involves searching for transportation plans π under two marginal constraints specified by w and v. While constraint w is strict, v is a moving constraint. Instead of searching for π satisfying constraint v, we move v to meet π. This makes the marginal distribution constraint v on π redundant. Let us define two functions of G, with G hidden in the background: Jc(G) = inf X i,γ πiγc(Φi, Φγ) : π Π(w, ) , (10) π(G) = arg inf X i,γ πiγc(Φi, Φγ) : π Π(w, ) . (11) Note that both functions depend on G through its subpopulations Φγ but are free of its mixing weights v. The optimizations in (10) and (11) involve only the linear constraint in terms of w. Hence, the optimal transportation plan π(G) for a given G has an analytical form: ( wi if γ = arg minγ c(Φi, Φγ ) 0 otherwise. (12) When c(Φi, Φ) has multiple minima, we transport Φi evenly to every minimum Φ. For example, if c(Φ1, Φ) have two minima Φγ and Φγ , we let π1γ (G) = π1γ (G) = w1/2. Theorem 5 Let G, Tc( ), Jc( ), π( ), and other notations be the same as earlier. We have inf{Tc(G) : G GK} = inf{Jc(G) : G GK}. (13) The subpopulations of the GMR estimator are hence given by GR = arg inf{Jc(G) : G GK} (14) Distributed Learning of Finite Gaussian Mixtures and the mixing weights are given by v with i πiγ(GR). (15) The existence of a solution to (14) is guaranteed under a simple condition on cost function c( , ), see Theorem 7. The proof of Theorem 5 is in Appendix 8.1.2. Based on this theorem, the optimization reduces to searching for K subpopulations Φγ for γ [K] to make up G. The mixing proportions are then determined by (15). An iterative algorithm quickly emerges following the well-known majorization-minimization (MM) idea (Hunter and Lange, 2004). Algorithm 1 MM algorithm for GMR estimator with KL-divergence cost function. Initialization: Φγ, γ [K] repeat for γ [K] do for i [MK] do ( wi if γ = arg minγ DKL(Φi, Φγ ) 0 otherwise end for Let i=1 πiγ, µγ = i=1 {πiγ/π γ}µi i=1 {πiγ/π γ}{Σi + (µi µγ)(µi µγ) } end for until the change in the value of the objective function P i,γ πiγDKL(Φi, Φγ) is below some threshold ϵ > 0 Let vγ = P i πiγ for γ [K] Output: {(vγ, µγ, Σγ) : γ [K]} The algorithm starts with some G(0) with K subpopulations specified. Let G(t) be the mixing distribution after t MM iterations. Define a majorization function of Jc at G(t) to be Kc(G|G(t)) = X i,γ πiγ(G(t))c(Φi, Φγ) (16) where πiγ(G(t)) is computed according to (12). Once π(G(t)) has been obtained, we update the mixing proportion vector of G(t) easily via v(t+1) γ = X i πiγ(G(t)). Zhang and Chen In fact, v(t) is not needed until the algorithm converges. The subpopulations Φγ are separated in the majorization function (16). This allows us to update the subpopulation parameters, one Φγ at a time and possibly in parallel, as the solutions to Φ(t+1) γ = arg inf Φ i πiγ(G(t))c(Φi, Φ). (17) The MM algorithm then iterates between the majorization step (16) and the minimization step (17) until some user-selected convergence criterion is met. The most expensive step in the MM algorithm is the optimization in (17). If we choose the cost function c( , ) = ρr( , ) with ρ( , ) being a divergence in the space of probability measures, the solution to (17) is a barycenter as given in Definition 4. The following lemma shows that the KL-based barycenter of Gaussian distributions has an analytical form and is therefore computationally simple. Lemma 6 (KL-Barycenter of Gaussian Measures) Let νm = Φ( ; µm, Σm), m [M] and λ = (λ1, . . . , λM) M 1. Then PM m=1 λm DKL(νm η) is minimized uniquely in the space of Gaussian measures at η = Φ( ; µ, Σ) with m=1 λmµm and Σ = m=1 λm{Σm + (µm µ)(µm µ) }. Due to the ease of computing the barycenter as shown in this lemma, we recommend c(Φi, Φγ) = DKL(Φi Φγ) in (9). Cost functions define the geometries on the Gaussian distribution space (Peyré and Cuturi, 2019), leading to slightly different outputs. We do not rule out the possibility of better choices. The pseudo-code for the MM algorithm with KL divergence as the cost function is given in Algorithm 1. For notational convenience, in the following theorem, we use Φ for both the parameter (µ, Σ) and the distribution, and similarly for Φ . Theorem 7 Suppose the cost function c( , ) is continuous in both arguments. For some distance in the parameter space of Φ, assume that for any constant > 0 and Φ the following set is compact: {Φ : c(Φ , Φ) }. (18) Let {G(t)} be the sequence generated by G(t+1) = arg min Kc(G|G(t)) with some initial mixing distribution G(0). Then (i) Jc(G(t+1)) Jc(G(t)) for any t; (ii) if G is a limiting point of G(t), then G(t) = G implies Jc(G(t+1)) = Jc(G ). These two properties imply that Jc(G(t)) converges monotonically to some constant J . All the limiting points G(t) are stationary points of Jc( ): iterations from G do not further reduce the value of the objective function Jc( ). We have practically cloned the global convergence theorem (Zangwill, 1969). We do not see a way to directly apply it and therefore provide a proof of the theorem in Appendix 8.1.3. We have all the ingredients for the split-and-conquer learning of a finite GMM. We then consider the statistical properties of the GMR estimator and the experimental evidence for the efficiency of our method. Distributed Learning of Finite Gaussian Mixtures 5. Statistical Properties of the Reduction Estimator We show that the proposed GMR estimator GR is consistent and retains the optimal rate of convergence in a statistical sense. We first state some conditions on the data and the estimation methods. C1 The data X are IID observations from the finite Gaussian mixture Φ(x; G ) with K distinct subpopulations, that is the order of G is known to be K. The subpopulations have distinct parameters and positive definite covariance matrices. C2 The dataset X is partitioned into M subsets X1, . . . , XM of sizes N1, . . . , NM, where each set contains IID observations from the same finite Gaussian mixture distribution. The number of local machines M does not increase with N = P m Nm. C3 The local machine sample ratios Nm/N have a nonzero limit as N . C4 The cost function c(Φk, Φ0) 0 or c(Φ0, Φk) 0 if and only if Φk Φ0 in distribution, and c(Φ1, Φ2) is continuous in both Φ1 and Φ2. Condition C4 is necessary to ensure consistency. It further rules out the case that Tc(G, G ) = for any G with different mixing weights from that of G . The consistency and asymptotic normality of the p MLE under the finite multivariate GMM are established in Chen and Tan (2009) under the standard IID setting. We do not repeat the details here but state the conclusion in a simplified fashion. We use Φ k and (w k, µ k, Σ k) to respectively denote true subpopulation, the mixing weight and the true subpopulation parameters. Lemma 8 (Consistency of p MLE) Given n IID observations from a finite multivariate Gaussian mixture with known order K, the p MLE b G as defined by (2) with an = n 1/2 is asymptotically normal with rate n 1/2. Specifically, it is possible to line up the subpopulation parameters of the true mixing distribution G and of the p MLE b G such that ( bwk, bµk, bΣk) = (w k, µ k, Σ k) + o(1) and ( bwk, bµk, bΣk) = (w k, µ k, Σ k) + Op(n 1/2) as n in obvious notation. Here o(1) is a quantity that converges to 0 almost surely. A quantity is Op(n 1/2) if it is bounded by Cn 1/2 for sufficiently large C with a probability arbitrarily close to 1. Chen and Tan (2009) proves the root-n-consistency for a non-random penalty term, the asymptotic normality remains valid when the sample covariance matrix is part of the penalty. This is because the sample covariance matrix converges to a positive definite matrix. The assumption of known K is crucial for the claimed rate of convergence. If K is unknown, then the convergence rate of b G is far below n 1/2: see Chen (1995) and the recent developments in Rousseau and Mengersen (2011), Nguyen (2013), Heinrich and Kahn (2018), and Dwivedi et al. (2020). Our proposed reduction estimator is aggregated from the p MLEs learned Zhang and Chen at the local machines. Under the condition that min Nm stated above, all the local estimators are consistent by Lemma 8. Hence, the consistency of the linear average estimator G is taken as granted for a constant M. Theorem 9 (Consistency of GR) Let G be the linear average estimator defined by (5) and GR be the aggregated estimator by reduction defined by (7) with ρ = Tc. Assume conditions C1 C4 are satisfied. Then GR is strongly consistent. Specifically, Tc(GR, G ) 0 almost surely as N . The proof of the theorem is given in Appendix 8.1.4. The following theorem shows that under one additional mild condition on the cost function c( , ), the reduction estimator GR has the standard N 1/2 convergence rate. We denote by Φ1 Φ2 the Euclidean norm in µ, Σ in the sense of (20). We use ΦR k for the kth subpopulation of GR and wk for its mixing weight for k [K]. Theorem 10 (Convergence Rate of GR) Let G be the aggregate estimator defined by (5) and GR be the aggregate estimator by reduction defined by (7). Assume conditions C1 C4 are satisfied and further assume that C5 For any Φ, there exists a small neighborhood Ωof Φ and a positive constant A, such that for any Φ1, Φ2 Ω, we have A 1 Φ1 Φ2 2 c(Φ1, Φ2) A Φ1 Φ2 2. Then with proper labelling of subpopulations, we have ΦR k Φ k = Op(N 1/2), w R k π k = Op(N 1/2). Condition C5 requires the cost function c( , ) behaves locally as a quadratic loss function. This is a most natural property for a cost function. In Appendix 8.1.6, we show this condition holds for the KL divergence. The conclusion should hold with any other reasonable choices. The proof of the theorem is given in Appendix 8.1.5. Our proof remains valid, for instance, if we replace Φ1 Φ2 2 by Φ1 Φ2 r for any r > 0 in C5. 6. Related Work In this section, we describe several related approaches. We compare some of these approaches with our proposed approach in Section 7 in the experiments. 6.1 KL Averaging Liu and Ihler (2014) considers the distributed learning of models from an exponential family by the split-and-conquer approach. The parameter θ is a real vector in this case. It proposes to perform local inference by finding the local MLEs and aggregate them by their KL barycenter. This estimator is referred to as the KL averaging (KLA). When the model belongs to the exponential family, the aggregated estimator is as efficient as the global MLE based on the full dataset. For models not in the exponential family, such as GMM, the KLA Distributed Learning of Finite Gaussian Mixtures estimator is less efficient. Moreover, the exact computation of the KL barycenter of local estimators under GMM is difficult. Liu and Ihler (2014) suggests to find an approximate solution instead. It first generates random samples b Xm of size 1000 from the local estimates b Gm at the central machine. Then a GMM of order K is fitted on the pooled sample m b Xm which has a moderate size of 1000M. This approach does not need to transmit the raw data but requires refitting of the mixture on the central machine. 6.2 Distributed EM Algorithm The distributed learning of GMMs can also be tackled by developing a distributed version of the EM algorithm (DEM) (Nowak, 2003). We briefly describe DEM and provide a conceptual comparison to our approach in this section. Under distributed learning setting, let Xm be the dataset stored at the mth local machine m [M] and N be the total sample size. Note that the quantities required in defining Q(G; G(t)) based on full dataset X have the following decomposition: for k [K], i=1 w(t) ik = i Xm w(t) ik o := m=1 Ω(t) m,k, i=1 w(t) ik xi = i Xm w(t) ik xi o := m=1 A(t) m,k, i=1 w(t) ik xix i = i Xm w(t) ik xix i o := m=1 B(t) m,k. Given G(t), one can compute w(t) ik defined by (3) for ith observation on local machine m for all m [M] and k [K]. Hence, one can obtain K k=1{Ω(t) m,k, A(t) m,k, B(t) m,k} at the mth local machine and have them transmitted to a central machine. One can then construct Q(G; G(t)) on the central machine and carry out the M-step to get G(t+1), which reproduces the EM-iteration based on the full dataset. Nowak (2003) considers the situation where the local machines form a sensor network and the transmission cost cannot be ignored. The paper suggests the mth machine transmits m j=1{Ω(t) j,k, A(t) j,k, B(t) j,k} to the next machine in the queue. Furthermore, it adopts the incremental E and M steps of Neal and Hinton (1998) to speed up the convergence of the algorithm. Nowak (2003) further shows that the DEM has a local linear convergence rate. The DEM and proposed GMR approaches are designed for distributed learning with different communication schemes. DEM requires a high level of coordination between local machines and repeated access of the local data. Our proposed method allows local machines to complete the learning on its own. Moreover, our method only requires one round of communication across all machines and is communication efficient. If successful, DEM leads to a solution to the original learning problem retaining full statistical efficiency. We do not include DEM in the experiment because the conclusions are already known. Our proposed method is superior in terms of communication cost. The statistical efficienty of DEM is same as the global estimator which is compared with our proposed method in the experiment. Zhang and Chen 6.3 Learning at Scale via Coresets Most machine learning problems can be formulated as an optimization problem that minimizes a cost function cost(X, θ) over the parameter space. For our problem, the cost function could be the negative log likelihood, X is the full dataset, and the parameter space GK is given by (1). When X is very large, the computational burden can be extremely heavy. Feldman et al. (2011) suggests to replace X by a much smaller weighted subset C, called coreset hereafter, such that |cost(X, G) cost(C, G)| cost(X, G) ϵ (19) for some given small ϵ > 0. Minimize the cost(C, G) based on the coreset can be much faster than the original cost based on the full dataset. The construction of coreset is to assure the minimizer of cost(C, G) approximates that of cost(X, G). Lucic et al. (2017) provides theoretical analysis and techniques for constructing coresets under GMM. For GMM of dimension d, order K, it gives a scheme to obtain coresets of size |C| = O(d4K6ϵ 2) satisfying (19) uniformly over some compact subset of GK. This is a surprising result as the size of C does not depend on the size of X. When the datasets are stored in distributed fashion, one may first reduce the dataset in each machine into the first generation coresets. Then these coresets are paired up to create a second generation coreset from each pair. This procedure is repeated if needed until we get a final coreset. Due to the composition properties of coreset (Lucic et al., 2017, Section 5), the quality of the final coreset can be maintained. We refer to this approach as the Coreset approach hereafter. The Coreset approach is computationally very efficient. Unlike DEM, it looks for approximate solutions leading to inevitable loss in statistical efficiency. Moreover, the Coreset approach requires the transmission of the raw data unlike other approaches for distributed learning that only requires the communication of summary statistics. We include the Coreset approach in our experiment for efficiency comparison. We wish to remark that the log-likelihood function in statistics is defined up to an additive constant. The precision specification (19) can be affected by how it is normalized. We adopt the normalization convention of (Lucic et al., 2017). 6.4 Bayesian Moment Matching (BMM) Direct Bayesian inference under GMM is challening due to the well-known label switching problem. See Murphy (2012, Chapter 11) for explanation and possible solutions. Jaini and Poupart (2016) proposes an approximate inference procedure that does not suffer from the issue of label switching. Under GMM with known order K, given a prior π0 with Dirichlet for weights and Gaussian-Gamma for subpopulation parameters, the posterior distribution π1 with a single observation x1 is a complex mixture of Dirichlet and Gaussian-Gamma combination. Repeating this operation given another observation would lead to an even more complex posterior. Instead, Jaini and Poupart (2016) suggests to approximate π1 posterior with a simple Dirichlet and Gaussian-Gamma combination π1 so that π1 and π1 have the same lower order moments. Let π1 be the prior distribution, with the next single observation x2, we obtain π2 in the same way. Repeat sequentially until we have exhausted Distributed Learning of Finite Gaussian Mixtures all data to get πN. Under multivariate GMMs, one replaces Gaussian-Gamma prior by Gaussian-Wishart. The end product πN is regarded as an approximate posterior and it serves well for Bayes inferences. Being sequential in nature, BMM is computationally efficient but apparently loses statistical efficiency due to approximation. Based on Table 2 in Jaini and Poupart (2016) and our Table 1, the per observation log-likelihood value of BMM when applied to Magic04 is 32.1, which is much lower than that of our proposed GMR 26.6. Jaini and Poupart (2016) shows that BMM has higher statistical efficiency than the online EM approach of Cappé and Moulines (2009, Theorem 2), whose convergence rate is lower than N 1/2. We do not include BMM in the experiment not only because of the comparison above but also because we do not know how they handle the redundant moment equations. Under the multivariate GMM, there are K +d K +d(d+1)K/2 parameters to be estimated at each step but there are d K + 2K 1 + 3d(d + 1)K/2 moment equations. The redudant equations make it difficult for us to replicate their approach. 6.5 ADMM for Distributed Optimization The distributed learning of GMM is essentially a distributed optimization problem. We may therefore directly use the Alternating Direction Method of Multipliers (ADMM) of Boyd et al. (2011). Consider the optimization problem defined as m=1 f(θm|Xm) : θm θ = 0, m [M], θ Θ o for some function f( |Xm) and a Euclidean parameter space Θ. The parameters θm are called local variables and θ is called a global variable. The ADMM for this optimization problem is based on the augmented Lagrangian Lρ(θm, ηm, θ) = m=1 {f(θm|Xm) + η m(θm θ) + (ρ/2) θm θ 2 2} for some regularization parameter ρ > 0. The ADMM then iterates according to θ(t+1) m = arg min θm f(θm|Xm) + (η(t) m ) (θm θ(t)) + (ρ/2) θm θ(t) 2 2 , θ(t+1) = M 1 M X m=1 θ(t+1) m , η(t+1) m = η(t) m + ρ(θ(t+1) m θ(t+1)). Similar to DEM, the ADMM requires a high level of coordination between local machines at each iteration. If successful, it gives the solution to the original optimization problem, void the statistical efficiency comparison. In the context of GMM, one must look for suitable substitutes for the term (η(t) m ) (θm θ(t)) in the augmented Lagrangian since θm is a discrete distribution. We therefore do not include ADMM in the experiment. Zhang and Chen 7. Experiments We conduct experiments on both simulated and real data to illustrate the effectiveness of the proposed GMR estimator in (9) with c(Φi, Φγ) = DKL(Φi, Φγ). We compare its performance with some existing approaches in terms of their statistical efficiency and computational costs. Our experiments include the following estimators: 1. Global. The p MLE based on the full dataset. The Global estimator is statistically most efficient and therefore used as the baseline for comparison. 2. Median. An off-the-shelf aggregation approach is the median GM = arg min G { b G1, b G2,..., b GM} m=1 λm TDKL( b Gm, G). The sample median is intuitively a robust alternative with minor efficiency loss. 3. KLA. The KL-Averaging in Liu and Ihler (2014) with nm = 1000 observations generated from local estimate b Gm. The real datasets have different dimensions and sample sizes, the size nm for real data experiments is specified if different. 4. Coreset. The Coreset approach with |Cm| = 1000 on each local machine. They are repeatedly merged as in Lucic et al. (2017) to arrive at the final coreset C of size 1000, the coreset sizes for real data experiment is specified if different. For the ease of comparison, we use the p MLE defined in (2) with penalty size N 1/2 m as local estimates when applicable. The p MLE is also used in the KLA estimator of Liu and Ihler (2014) on the central machine with penalty size (1000M) 1/2, and for the Coreset method with penalty size |C| 1/2. We use the EM algorithm to compute p MLE and declare convergence when the per observation penalized log-likelihood function is less than 10 6. With very large sample sizes of the simulated data, the maxima of the penalized likelihood should be attained at a mixing distribution close to the true mixing distribution. We therefore use the true mixing distribution as the initial value and regard the output of the EM algorithm as the global maximum of the penalized likelihood. This strategy does not work for the real-world data in the absence of a true mixing distribution. For real-world data, we use kmeans++ with default arguments in scikit-learn package (Pedregosa et al., 2011) to generate 10 initial values for the EM algorithm. Ideally, we run the EM algorithm with these initial values until convergence and regard the output of the EM algorithm with the highest penalized log-likelihood function as the p MLE. To save time on the real dataset, we use a warmup strategy. We run the EM algorithm with these 10 initial values for 20 iterations and pick the one with the highest penalized log-likelihood value. We use the output of this one as the initial value to run the EM algorithm further until convergence and this output is treated as the p MLE. The choice of the initial value in the aggregation step is also important. When the sample size is large, we have good reason to believe that the optimal solution is close to the true value. Also, by the principle of majority rules, the median of the local estimates is likely the closest to the optimal solution. Thus, in simulation studies, we initialize the algorithm Distributed Learning of Finite Gaussian Mixtures with the true mixing distribution and the median estimate. For real-world data, we use the local estimators as multiple initial values and output of the MM algorithm with the lowest objective function value is regarded as the GMR estimator. We declare the convergence of the MM algorithm for the GMR estimator when the change in the objective function is less than 10 6. All experiments are conducted on the Compute Canada (Baldwin, 2012) Cedar cluster with Intel E5 Broadwell CPUs with 64G memory. The codes are written in Python and are publicly available at https://github.com/Sarah Qiong/SCGMM. The code for Coreset method is provided by the author of Lucic et al. (2017). 7.1 Performance Measure In each simulation experiment, we generate R = 100 random samples X (r) from a finite GMM with mixing distribution G(r) and obtain its estimate b G(r), for r [R]. We use the following performance metrics. 1. Transportation divergence (W1). We define the ground distance between Φi and Φγ with mean vectors µi and µγ and covariance matrices Σi and Σγ to be D(Φi, Φγ) = µi µγ 2 + Σ1/2 i Σ1/2 γ F (20) where A F = p tr(A A) is the Frobenius norm of a matrix. The corresponding transportation distance is W1( b G(r), G(r)) = TD( b G(r), G(r)) given in Definition 3. 2. Adjusted Rand Index (ARI). The well known adjusted Rand index (ARI) measures the degree of similarity between two clustering outcomes. Suppose the observations in a dataset are divided into K clusters A1, A2, . . . , AK by one method, and K clusters B1, B2, . . . , BK by another method. Let Ni = #(Ai), Mj = #(Bj), Nij = #(Ai Bj) for i [K] and j [K ], where #(A) is the number of units in set A. The ARI of these two clustering outcomes is defined to be P i,j Nij 2 N 2 1 P i,j Ni 2 Mj 2 1 2 P i Ni 2 + 1 2 P j Mj 2 N 2 1 P i,j Ni 2 Mj 2 where n k is the number of k combinations from a given set of n elements. An ARI value close to 1 indicates a high degree of agreement. The ARI is 1 when the two clustering methods completely agree. The mixture models are commonly used for model-based clustering. Based on a given G, we classify a unit with observed value x into cluster k(x) = arg max k {wkφ(x; µk, Σk)} (21) in obvious notation. We obtain the ARI between the clustering outcomes based on b G(r) and G(r) and use it as a performance metric. 3. Log-likelihood (LL). The per observation log-likelihood value at the estimated mixing distribution b G(r) based on the full dataset is often regarded as a goodness metric. See Liu and Ihler (2014), Jaini and Poupart (2016), and Lucic et al. (2017). Zhang and Chen The split-and-conquer approaches have some advantages in computation time by performing local inference on multiple machines. Since we can compute the local estimates in parallel, the time it takes to obtain the final estimate is the maximum local time plus the aggregation time. We keep record of this time for each experiment. Similarly, we record the total time for constructing the coreset and fitting the mixture on the central machine. We do not keep track of the transmitting time. 7.2 Experiments Based on Simulated Data Distributed learning methods are designed for learning at scale where the observations have high dimensions and large sample size. To reduce the potential influence of human bias, we simulate data from finite Gaussian mixtures with randomly generated parameter values. We use the R package Mix Sim (Melnykov et al., 2012; Maitra and Melnykov, 2010). An important quantity of a finite mixture is pairwise overlap oj|i = P(wif(X; θi) < wjf(X; θj)|X f( ; θi)). The maximum overlap of a finite mixture is defined to be Max Omega = max i,j [K]{oj|i + oi|j}. We use Mix Sim to generate 100 finite Gaussian mixtures with d = 50 and K = 5. We let Max Omega be 1%, 5%, and 10%, N = 2l for l = 17, 19, 21 and M = 2l for l = 2, 4, 6. The simulated data are divided evenly over the local machines. We combine 100 outcomes from each combination of dimension, order, Max Omega, sample size, and number of local machines to form boxplots for each estimation method. Figure 1 and Figure 2 show the results. Figure 1 reports the result when the total sample size is N = 221. Within each subfigure, the Max Omega increases from the left panel to the right panel. Within each panel, the x-axis gives the number of local machines: 4, 16, or 64. The plots in Figure 1 contain boxplots of W1, ARI, LL, and the computation time. Based on Figure 1, all methods have better performance in terms of W1, ARI, and LL when Max Omega is lower. This is consistent with our intuition and the experiment survives the sanity check. The proposed GMR has comparable performance to the gold-standard global estimator in all three metrics. It is arguably the best aggregation approach. The number of local machines has little influence on its performance. KLA has relatively poor performance though it improves moderately when the number of local machines increases. Recall that KLA generates a fixed number of observations from each local estimator. More local machines lead to a larger total sample size in its aggregation step. This helps its statistical performance but not computational efficiency. The median estimator does not perform with a large number of local machines. The Coreset estimator does not perform in all cases. All aggregation approaches take less computation time than the global estimator. The Coreset estimator takes the least amount of computation time. However, the computation time does not taken the transmission cost into account. Moreover, the computation time does not make up the poor statistical performance. By far a large chunk of computation time of the split-and-conquer approaches is spent on learning the local estimators. Because Distributed Learning of Finite Gaussian Mixtures (a) Transportation distance W1 (c) Log-likelihood (d) Computation time Figure 1: Performance of five estimators for learning 50-dimensional order 5 Gaussian mixtures with sample size N = 221. All subplots share the same color coding. Zhang and Chen GMR and the median estimators spend negligible time on aggregation, they use about 1/M of the global estimator computational time. When the data generating mixture has a high degree of overlapping, the EM algorithm needs more iterations to converge leading to higher computation time. KLA must re-learn the GMM based on the pooled data generated from the local estimates, therefore generally takes longer time than the GMR approach. Figure 2 presents results when Max Omega is 10% and M = 64. It is seen the proposed (a) W1 distance (c) Log-likelihood (d) Computation time Figure 2: Performance of five estimators when Max Omega = 0.1 and M = 64 for learning 50-dimensional order 5 Gaussian mixtures. GMR has very good performance, comparable to the global estimator. The relative performance of the median estimator improves when the sample size increases, but still not good enough to be recommended. The performance of KLA and Coreset approaches do not improve with increased sample size. They are far from competitive against the proposed GMR approach. Coreset approach does not benefit from larger sample size likely because the coreset size is fixed at 1000. KLA approach does not because the larger sample size improves only the precision of the local estimates. Its aggregation step is heavily influenced by the built-in randomness when we generate samples from the local estimates. The other aspects of the simulation results are as expected. Distributed Learning of Finite Gaussian Mixtures We have more simulation results when the dimensions d = 10, 50 combined with orders K = 5, 10, and 50. They are presented in Appendix 8.2.1. These results are consistent with what we find so far in terms of the statistical efficiency. Since the Coreset estimator is found not competitive, it is not included in these experiments. 7.3 Public and Real-World Datasets We now examine the performance of the proposed approach on large scale public datasets in Section 7.3.1, for clustering the handwritten digits in Section 7.3.2, and for clustering a large scale spatio-temporal data in Section 7.3.3. 7.3.1 Public Datasets We experiment on the public datasets that are widely used for learning Gaussian mixtures at scale in this section. The following datasets are used in Lucic et al. (2017) and Jaini and Poupart (2016). 1. MAGIC04. This is a simulated dataset for classifying gamma particles in the upper atmosphere. It contains 19, 020 observations with 10 real valued features and is publicly available at UCI machine learning repository. 2. Mini Boo NE. The dataset is taken from the Mini Boo NE experiment that is used to distinguish electron neutrinos from muon neutrinos. It contains 130, 065 observations with 50 real valued features and is publicly available at UCI machine learning repository. 3. KDD. This dataset is used in Lucic et al. (2017). It contains 145, 751 observations with 74 real valued features for predicting the protein types. It is available at https: //kdd.org/kdd-cup/view/kdd-cup-2004/Data. 4. MSYP. The dataset is used to predict the release year of a song from audio features. It contains 515, 345 observations with 90 real valued features. The dataset is publicly available at UCI machine learning repository. Following Lucic et al. (2017), we reduce the dataset to its top 25 principal components and fit the mixture model with these 25 features. For the first three datasets, we divide the dataset onto M = 4 local machines completely at random. Since MSYP is very big and the order of the mixture to be fitted is high, we divide the dataset onto M = 16 local machines. The random partition of the dataset is repeated R = 100 times. The size of the generated sample and coreset size are set to be 1000 for MAGIC04, and 10, 000 for other datasets. The order of the fitted mixture on each dataset follows the setting in Lucic et al. (2017) and is specified in Table 1. For each method, due to repetition, we have 100 LL values based on the full dataset at the learned mixing distributions. We summarize these values by its median and inter quartile range (IQR). We also obtain the total computation time for each method. These results are given in Table 1. Based on Table 1, it is clear that the proposed GMR has the best performance among all split-and-conquer based approaches in terms of the LL value. For the Mini Boo NE dataset, Zhang and Chen Table 1: Performance of five learning approaches on public datasets. Dataset N d K M Global GMR Median KLA Coreset Median (IQR) LL values (the larger the better) MIGIC04 19020 10 10 4 -24.15 -24.30(0.07) -26.60(0.05) -26.73(0.07) -27.16(0.55) Mini Boo NE 130065 50 10 4 -19.46 -22.00(0.53) -24.60(0.32) 6.41(1.95) 103 8.6(2.56) 109 KDD 145751 74 10 4 -221.80 -223.25(0.42) -232.93(8.02) -235.00(8.96) -374.43(193.58) MSYP 515345 25 50 16 -166.56 -167.05(0.04) -171.10(0.04) -170.72(0.01) -181.64(1.78) Median (IQR) computation times in seconds MIGIC04 19020 10 10 4 19.3 7.0(3.2) 6.7(3.2) 10.2(3.1) 2.2(0.6) Mini Boo NE 130065 50 10 4 346.9 313.1(162.6) 313.2(162.6) 511.3(213.2) 26.6(64.3) KDD 145751 74 10 4 1033.9 544.4(309.5) 543.0(310.0) 706.0(290.3) 4.3(64.0) MSYP 515345 25 50 16 67048.8 2611.6(474.0) 1777.5(511.2) 5515.9(1629.7) 67.4(12.6) the LL values of KLA and Coreset approaches are very small. This is likely because the total sample size to refit the GMM on the central machine is relatively small in 50-dimensional space. The fitted order 10 mixture may not be able to cover the entire space properly and the log-likelihood contribution of some observations is practically negative infinity. A single near zero likelihood value could lead to a very small LL value. To evaluate the improvement of the GMR approach over other approaches, we consider how many extra subpopulations are needed to achieve the same gain in the LL value. Note that the famous BIC (Schwarz, 1978) would favor a model with an extra parameter if the gain in LL is more than log(N)/(2N). The log(N)/(2N) values for these datasets are (26, 4.5, 4.1, 1.3) 10 5. A subpopulation in GMM with d = 74 needs 1 + 74 + 74 75/2 = 2854 parameters. This translates into a difference of 0.116 in LL. For the KDD data, the gain in GMR compared to KLA would allow another 17 subpopulations. All the split-and-conquer learning methods are much faster than the global method. For the MSYP dataset, the split-and-conquer methods can be 10 times as fast compared to the global estimator. The Coreset method takes the shortest time. The proposed GMR approach takes comparable computation time with the KLA approach. 7.3.2 NIST Handwritten-Digit Dataset The finite GMM is often used for model-based clustering (Friedman et al., 2001; Fraley and Raftery, 2002, Chapter 14.3). When the dataset is large and/or distributed over many local machines, split-and-conquer approaches such as the proposed GMR become useful. In this section, we demonstrate the use of the GMR method on the famous NIST dataset for character recognition (Grother and Hanaoka, 2016). We use the second edition of the dataset, named by_class.zip 1. It consists of approximately 4M images of handwritten digits and characters (0 9, A Z, and a z) by different writers. Our experiment focuses on the digits and we still refer to it as the NIST dataset. The images of the digits are in directories 30 39. According to the user guide 2, the images in the train_30 to train_39 and hsf_4 folders are used as the training and test sets respectively. The numbers of training images for each digit are listed in the following table: 1. Available at https://www.nist.gov/srd/nist-special-database-19. 2. Available at https://s3.amazonaws.com/nist-srd/SD19/sd19_users_guide_edition_2.pdf. Distributed Learning of Finite Gaussian Mixtures Table 2: The numbers of training images for each digit in NIST dataset. Digits 0 1 2 3 4 5 6 7 8 9 Training 34803 38049 34184 35293 33432 31067 34079 35796 33884 33720 Test 5560 6655 5888 5819 5722 5539 5858 6097 5695 5813 Each image is a 128 128 pixel grayscale matrix whose entries are real values between 0 and 1 that record the darkness of the corresponding pixels. A darker pixel has a value closer to 1. Following the common practice, we first train a 5-layer convolutional neural network and reduce each image to a d = 50 feature vector of real values. The details of the neural network for the dimension reduction are given in Appendix 8.2.2. A naïve approach to build a classifier is to regard the features of each digit as a random sample from a distinct Gaussian distribution. The pooled data is therefore a sample from a finite Gaussian mixture of order K = 10. We may learn this model based on the whole dataset or through split-and-conquer approaches. We randomly select R = 100 datasets of size N = 50K from the training set. Each dataset is then randomly partitioned into M = 10 subsets. We obtain global, GMR, Median, KLA, and Coreset estimates for a Gaussian mixture of order 10 on each dataset. The size of the generated sample for the KLA method and the coreset size in the Coreset method are both set to be 3000. This experiment is also carried out with the sample sizes N = 100K, 200K, and 300K. These mixture estimates are then used to cluster images of handwritten digits in the training and test sets. We show the boxplots of the LL values based on the training dataset and the test dataset in Figure 3(a) and Figure 3(b) respectively. The ARI between the true label of the image and the predicted label based on (21)is respectively given in Figure 3(c) and Figure 3(d). In terms of the LL value, the proposed GMR approach attains the highest log-likelihood among all split-and-conquer approaches. The performance of Coreset estimator is far behind. In this experiment, the number of local machine is fixed at M = 10 and the sample sizes are from 50K to 300K. Increasing the sample size benefits median estimator most notably. This is because the local sample size increases as the total sample size increases. When the total sample size is 300K, the number of samples used to fit the local models and the samples generated to fit the aggregated model in the KLA approach are the same. The LL value of median estimator and the KLA estimator are about the same. In terms of clustering performance, the global estimator surprisingly performs noticeably worse than the split-and-conquer approaches. The high LL value of the global estimator does not help. A likely explanation is that a GMM of order 10 is merely a working model rather than the true model, whereas true models are used in simulated data. This eliminates the advantage of the global estimator. This can also be seen that an increased total sample size N does not lead to an improved fit in general. The ARIs of all approaches get worse when the sample size increases. We think that the damage of the model-misspecification is more severe when the sample size is large. Nevertheless, the proposed GMR method has the best performance in all cases. It has the highest average ARI values and smaller variations. Figure 3(e) gives the computation time. All split-and-conquer approaches save computation time. The Coreset estimator is most time efficient, the GMR and median estimators takes slightly longer and the KLA takes the longest. Zhang and Chen (a) Training LL (b) Test LL (c) Training ARI (d) Test ARI (e) Computation time Figure 3: Performance of five approaches for learning of 50-dimension order 10 Gaussian mixture for NIST digit classification. All subplots share the same color coding. Distributed Learning of Finite Gaussian Mixtures 7.3.3 Atmospheric Data Analysis We apply the proposed GMR approach to fit a finite GMM to an atmospheric dataset 3 named CCSM run cam5.1.amip.2d.001 following Chen et al. (2013). These data are computer simulated based on Community Atmosphere Model version 5 (CAM5). The dataset contains daily observations of multiple atmospheric variables between years 1979 and 2005 over 192 longitudes (lon), 288 latitudes (lat), and 30 vertical altitude levels (lev). There variables are included: the moisture content (Q), temperature (T), and vertical velocity (Ω, OMEGA) of the air. For ease of comparison, we analyze only observations in December, January, and February, i.e., winter in the northern hemisphere. The number of days is thus 2, 430, and the restriction reduces the variation in the dataset. At each surface location, we filter out nonwet days (less than 1 mm of daily precipitation) and focus on days with precipitation above the 95th wet-day percentile. This step reduces the number of observations at each location, not necessarily evenly. The analysis aims to cluster the locations according to the multivariate variable of dimension d = 91: 30 lev {Q, T, Ω} plus the daily precipitation (PRECL) at the surface. Following Chen et al. (2013), we fit a finite GMM of order K = 4. They suggest that this model is helpful in identifying modes of extreme precipitation in 3D atmospheric space over a few atmospheric variables. After this preprocessing, the dataset still takes about 3 GB of memory, so we cannot learn a global mixture in a reasonable time. We partition the dataset evenly into M = 128 subsets and apply the proposed GMR approach with the same numerical strategies as in the NIST experiments. For comparison, we also aggregate the local estimates by the KLA with 500 observations generated from each local estimate. Figure 4: Surface locations colored by clusters. Within each cluster, the darker the color, the more wet days at that location. Once a finite GMM is learned, we cluster the observations based on (21). Each combination of day and surface location is clustered into one of four subpopulations. To visualize 3. Available at https://www.earthsystemgrid.org/dataset. Zhang and Chen the clusters, we further allocate each surface location to the cluster in which it belongs on most days. Figure 4 shows the geographical distribution of these four clusters represented by different colors. Similar to Chen et al. (2013), the GMR clusters reveal a strong latitudinal structure, they clearly separate the frigid, temperate, and tropical zones. The KLA results in similar clusters. Unlike Chen et al. (2013), the proposed GMR clusters are able to separate the continental and oceanic areas in the temperate zone. 8. Discussion and Concluding Remarks We develop an effective split-and-conquer approach for learning finite GMMs. Our experiments show that it has good performance both statistically and computationally. We focus on finite GMMs, but with some adjustment our approach could be applied to learning mixtures with other subpopulation distributions such as Gamma and Poisson. We have ignored many potential issues in this paper. In particular, we assume that the order of the mixture is correctly specified and that datasets on different local machines are IID and have the same underlying distributions. Our method is a preliminary attempt toward more satisfactory solutions to these real-world problems. Acknowledgement This research was enabled in part by support provided by West Grid (www.westgrid.ca) and Compute Canada Calcul Canada (www.computecanada.ca). The authors would like to thank Mario Lu cić for providing the code in Lucic et al. (2017). We would like thank the reviewer and action editor for their help and advice. Rakesh Agrawal, Alexandre Evfimievski, and Ramakrishnan Srikant. Information sharing across private databases. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pages 86 97. ACM, 2003. Sivaraman Balakrishnan, Martin J Wainwright, Bin Yu, et al. Statistical guarantees for the EM algorithm: From population to sample-based analysis. The Annals of Statistics, 45 (1):77 120, 2017. Susan Baldwin. Compute canada: advancing computational research. In Journal of Physics: Conference Series, volume 341, page 012001. IOP Publishing, 2012. Heather Battey, Jianqing Fan, Han Liu, Junwei Lu, and Ziwei Zhu. Distributed estimation and inference with statistical guarantees, 2015. Unpublished manuscript, ar Xiv:1509.05457. Stephen Boyd, Neal Parikh, and Eric Chu. Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers Inc, 2011. Olivier Cappé and Eric Moulines. On-line expectation maximization algorithm for latent data models. Journal of the Royal Statistical Society: Series B (Methodological), 71(3): 593 613, 2009. Distributed Learning of Finite Gaussian Mixtures Xiangyu Chang, Shao-Bo Lin, and Yao Wang. Divide and conquer local average regression. Electronic Journal of Statistics, 11(1):1326 1350, 2017. Jiahua Chen. Optimal rate of convergence for finite mixture models. The Annals of Statistics, 23(1):221 233, 1995. Jiahua Chen and Xianming Tan. Inference for multivariate normal mixtures. Journal of Multivariate Analysis, 100(7):1367 1383, 2009. Sitan Chen, Jerry Li, and Zhao Song. Learning mixtures of linear regressions in subexponential time via fourier moments. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 587 600, 2020. Wei-Chen Chen, George Ostrouchov, David Pugmire, Prabhat, and Michael Wehner. A parallel EM algorithm for model-based clustering applied to the exploration of large spatiotemporal data. Technometrics, 55(4):513 523, 2013. Xueying Chen and M. Xie. A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica, 24(4):1655 1684, 2014. James C Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, Jeffrey John Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, et al. Spanner: Google s globally distributed database. ACM Transactions on Computer Systems (TOCS), 31(3):Article 8, 2013. David F Crouse, Peter Willett, Krishna Pattipati, and Lennart Svensson. A look at Gaussian mixture reduction algorithms. In 14th International Conference on Information Fusion, pages 1 8. IEEE, 2011. Marco Cuturi and Arnaud Doucet. Fast computation of Wasserstein barycenters. In International Conference on Machine Learning, pages 685 693, 2014. Suresh Dara and Priyanka Tumma. Feature extraction by using deep learning: A survey. In 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA), pages 1795 1801. IEEE, 2018. Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B, 39:1 38, 1977. Raaz Dwivedi, Nhat Ho, Koulik Khamaru, Martin Wainwright, Michael Jordan, and Bin Yu. Sharp analysis of expectation-maximization for weakly identifiable models. In International Conference on Artificial Intelligence and Statistics, pages 1866 1876, 2020. Jianqing Fan, Dong Wang, Kaizheng Wang, and Ziwei Zhu. Distributed estimation of principal eigenspaces. Annals of Statistics, 47(6):3009 3031, 2019. Dan Feldman, Matthew Faulkner, and Andreas Krause. Scalable training of mixture models via coresets. In Advances in neural information processing systems, pages 2142 2150, 2011. Zhang and Chen Chris Fraley and Adrian E Raftery. Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97(458):611 631, 2002. Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The Elements of Statistical Learning, volume 1. Springer Series in Statistics New York, 2001. Sylvia Frühwirth-Schnatter. Finite Mixture and Markov Switching Models. Springer Science & Business Media, 2006. Patrick Grother and Kayee Hanaoka. NIST special database 19 handprinted forms and characters 2nd edition. Technical report, National Institute of Standards and Technology, 2016. Philippe Heinrich and Jonas Kahn. Strong identifiability and optimal minimax rates for finite mixture estimation. The Annals of Statistics, 46(6A):2844 2870, 2018. David R Hunter and Kenneth Lange. A tutorial on MM algorithms. The American Statistician, 58(1):30 37, 2004. Martin Jaggi, Virginia Smith, Martin Takac, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and Michael I Jordan. Communication-efficient distributed dual coordinate ascent. In Advances in Neural Information Processing Systems 27, pages 3068 3076, 2014. Priyank Jaini and Pascal Poupart. Online and distributed learning of Gaussian mixture models by Bayesian moment matching. ar Xiv preprint ar Xiv:1609.05881, 2016. Roman Liesenfeld. A generalized bivariate mixture model for stock price volatility and trading volume. Journal of Econometrics, 104(1):141 178, 2001. Qiang Liu and Alexander T Ihler. Distributed estimation, information loss and exponential families. In Advances in Neural Information Processing Systems 27, pages 1098 1106, 2014. Mario Lucic, Matthew Faulkner, Andreas Krause, and Dan Feldman. Training gaussian mixture models at scale via coresets. The Journal of Machine Learning Research, 18(1): 5885 5909, 2017. Ranjan Maitra and Volodymyr Melnykov. Simulating data to study performance of finite mixture modeling and clustering algorithms. Journal of Computational and Graphical Statistics, 19(2):354 376, 2010. Geoffrey Mc Lachlan and David Peel. Finite Mixture Models. John Wiley & Sons, 2004. Volodymyr Melnykov, Wei-Chen Chen, and Ranjan Maitra. Mix Sim: An R package for simulating data to study performance of clustering algorithms. Journal of Statistical Software, 51(12):1 25, 2012. Kerrie L Mengersen, Christian Robert, and Mike Titterington. Mixtures: estimation and applications, volume 896. John Wiley & Sons, 2011. Distributed Learning of Finite Gaussian Mixtures Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012. Radford M Neal and Geoffrey E Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in Graphical Models, pages 355 368. Springer, 1998. Xuan Long Nguyen. Convergence of latent mixing measures in finite and infinite mixture models. The Annals of Statistics, 41(1):370 400, 2013. Robert D Nowak. Distributed EM algorithms for density estimation and clustering in sensor networks. IEEE Transactions on Signal Processing, 51(8):2245 2253, 2003. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8026 8037, 2019. Karl Pearson. Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. A, 185:71 110, 1894. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. Gabriel Peyré and Marco Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends R in Machine Learning, 11(5 6):355 607, 2019. Judith Rousseau and Kerrie Mengersen. Asymptotic behaviour of the posterior distribution in overfitted mixture models. Journal of the Royal Statistical Society: Series B (Methodological), 73(5):689 710, 2011. Behrooz Safarinejadian, Mohammad B Menhaj, and Mehdi Karrari. A distributed EM algorithm to estimate the parameters of a finite mixture of components. Knowledge and Information Systems, 23(3):267 292, 2010. Gideon Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2): 461 464, 1978. Aad W Van der Vaart. Asymptotic Statistics, volume 3. Cambridge University Press, 2000. Cédric Villani. Topics in Optimal Transportation, volume 58. American Mathematical Society, 2003. Jason L Williams and Peter S Maybeck. Cost-function-based hypothesis control techniques for multiple hypothesis tracking. Mathematical and Computer Modelling, 43(9 10):976 989, 2006. Zhang and Chen CF JeffWu. On the convergence properties of the EM algorithm. The Annals of Statistics, 11(1):95 103, 1983. Willard I Zangwill. Nonlinear Programming: A Unified Approach, volume 52. Prentice-Hall Englewood Cliffs, NJ, 1969. Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. The Journal of Machine Learning Research, 16(1):3299 3340, 2015. Distributed Learning of Finite Gaussian Mixtures The appendix is organized as follows. Section 8.1 contains all left over technical details and proofs. Section 8.2 provides additional details in the experiment. 8.1.1 Example 2: Technical Details. Let D(G) = 0.5W2 D(G1, G) + 0.5W2 D(G2, G). By definition, the barycenter minimizes D(G). Recall we suggest that the barycenter is given by Φ(x; GC) = 0.4Φ 1 + 0.6Φ2/3 instead of Φ(x; G) = 0.5Φ 1 + 0.5Φ1. The full proof is overly tedious. Instead, we verify that D(GC) < D(G) so that G is not a barycenter but the suggested GC is likely the one. Note that all transportation plans from G1 and G2 to the presumed barycenter GC have the form p 0.4 p 0.4 p 0.2 + p and p 0.6 p 0.4 p p respectively, for some p between 0 and 0.4. These two matrices are bivariate probability mass functions with the marginal probability masses (0.4, 0.6) and (0.6, 0.4) as required. The cost functions may be presented as c( 1, 1) c( 1, 2/3) c(1, 1) c(1, 2/3) = 0 25/9 4 1/9 It is clear that p = 0.4 gives the optimal plans for transporting G1 to GC and G2 to GC. With these plans in place, we can see that D(GC) = 0.5W2 D(G1, GC) + 0.5W2 D(G2, GC) = 1/3. In comparison, the optimal transportation plan from G1 to G is to move 0.1 mass from Φ1 to Φ 1 with a total cost of 0.1 4 = 0.4. Hence, D(G) = 0.5W2 D(G1, G) + 0.5W2 D(G2, G) = 0.4 > 1/3. That is, G is not a barycenter. 8.1.2 Proof of Theorem 5 The key conclusion of this theorem is inf{Tc(G) : G GK} = inf{Jc(G) : G GK}. Zhang and Chen We prove this result by showing both inf{Tc(G) : G GK} inf{Jc(G) : G GK} (22) and inf{Tc(G) : G GK} inf{Jc(G) : G GK}. (23) We first prove (22). Let G = arg inf{Jc(G) : G GK}. Denote the mixing weights of any G GK as v(G), and the subpopulations prescribed by G as Φγ. According to (15), we have v(G ) = X i πi,γ(G ), which implies that π(G ) Π( , v(G )). Since π(G ) Π(w, ) by (10), we also have that π(G ) Π(w, v(G )). In other words, π(G ) is a valid transportation plan from G to G . Consequently, inf{Tc(G) : G GK} X i,γ πiγ(G )c(Φi, Φ γ) = Jc(G ) = inf{Jc(G) : G GK}, with the last equality implied by the definition of G . This proves (22). Next, we prove (23). Let G = inf{Tc(G) : G GK}, the solution to the optimization on the left-hand side of (23). We denote the subpopulations prescribed by G as Φ γ. Let π = arg inf X i,γ πiγc(Φi, Φ γ) : π Π(w, v(G )) , which is the optimal transportation plan from G to this G . Because of this, we have inf{Tc(G) : G GK} = Tc(G ) = X i,γ π iγc(Φi, Φ γ) Jc(G ) inf{Jc(G) : G GK}. The last step holds because π Π(w, ). This proves (23). 8.1.3 Proof of Theorem 7 Theorem 7 claims that the MM-iteration leads to (i) decreasing sequence Jc(G(t)) in t; and (ii) stationary limiting points G of {G(t))}. Proof of (i): Clearly, we have Kc(G|G(t)) Jc(G) for all G with equality holds at G = G(t). Hence, Jc(G(t)) Jc(G(t)) {Kc(G(t+1)|G(t)) Jc(G(t+1))} = Jc(G(t+1)) {Kc(G(t+1)|G(t)) Jc(G(t))} Jc(G(t+1)) {Kc(G(t)|G(t)) Jc(G(t))} = Jc(G(t+1)). This is the property that an MM-algorithm must have. Distributed Learning of Finite Gaussian Mixtures Proof of (ii). Suppose G(t) has a convergent subsequence leading to a limit G . Let this subsequence be G(tk). By Helly s selection theorem (Van der Vaart, 2000), there is a subsequence sk of tk such that G(sk+1) has a limit, say G . These limits, however, could be subprobability distributions. That is, we cannot rule out the possibility that the total probability in the limit is below 1 by Helly s theorem. This is not the case under the theorem conditions. Let > 0 be large enough such that A1 = {Φ : c(Φi, Φ) , for all subpopulations Φi of G} is not empty. With this , we define A2 = {Φ : c(Φi, Φ) > , for all subpopulations Φi of G}. Suppose G has a subpopulation Φ such that c(Φi, Φ ) > for all i. Replacing this subpopulation in G by any Φ A1 to form G , we can see that for any t, Kc(G |G(t 1)) > Kc(G |G(t 1)). Because G(t) minimizes Kc(G|G(t 1)) in G, the above result shows that none of the subpopulations of G(t) are members of A2. Note that the complement of A2 is compact by condition (18). Consequently, the subpopulations of G(t) are confined to a compact subset. Hence, all limit points of G(t), including both G and G , are proper distributions. By the monotonicity of the iteration: Jc(G(sk+1)) Jc(G(sk+1)) Jc(G(sk)). Let k , we get Jc(G ) = Jc(G ). (24) By the definition of the MM iteration, we have Kc(G(sk+1)|G(sk)) Kc(G|G(sk)). Let k and by the continuity of Kc( | ), we get Kc(G |G ) Kc(G|G ). Hence, G is a solution to min Kc(G|Gt)) when G(t) = G . Namely, we have Kc(G |G ) = Kc(G(t+1)|G ). With the help of (24), it further implies Jc(G ) = Jc(G(t+1)) = Jc(G ) when G(t) = G . This shows that iteration from G(t) = G does not make Jc(G(t+1)) smaller than Jc(G(t)). Hence, G is a stationary point of the MM iteration. This is conclusion (ii) and we have completed the proof. Zhang and Chen 8.1.4 Proof of Theorem 9 Recall that the local estimators b Gm, m [M] are strongly consistent for G when the order K of G is known. This implies that the simple average G G and Tc(G, G ) 0 almost surely. Because Tc( , ) is not necessarily a mathematical distance, we do not have an easy-to-use triangle inequality. The subsequent steps may appear tedious but unavoidable. By almost surely , it implies other than a probability 0 event in the probability space Ω on which the random variables are defined, Tc(G, G ) 0 holds. Without loss of generality, assume Tc(G, G ) 0 holds at all ω Ωwithout a zero-probability exception. With this, Tc(G, G ) 0 holds only if each support point of G converges to one of those of G . The total weights of the support points of G converging to the same support of G must converge to the corresponding weight of G . By definition, GR has K support points. We also notice that Tc(G, GR) Tc(G, G ) 0. (25) Suppose that GR does not converge to G at some ω Ω. One possibility is that the smallest mixing weight of GR (or a subsequence thereof) goes to zero as N . In this case, GR has K 1 or fewer meaningful support points. Since the support points of G are in an infinitesimal neighborhood of those of G , one of them must be a distance away from any of the support points of GR. Therefore, by Condition 4, the transportation cost of this support point is larger than a positive constant not depending on N. The positive transportation cost implies that Tc(G, GR) 0, which contradicts (25). The next possibility is that the smallest mixing weight of GR does not go to zero. In this case, there is a subsequence such that all the mixing weights converge to positive constants. Without loss of generality, all the mixing weights simply converge to positive constants as N . If there is a subsequence of support points of GR that is at least ϵ-distance away from any of the support points of G , then the transportation cost from G to this support point will be larger than a positive constant not depending on N. This again leads to a contradiction to (25). The final possibility is that GR (or a subsequence thereof) has a proper limit, say G = G . If so, Tc(G, GR) Tc(G , G ) = 0, contradicting (25). We have exhausted all the possibilities. Hence, the consistency claim is true. 8.1.5 Proof of Theorem 10 Theorem 10 states that the GMR estimator GR retains the convergence rate N 1/2, under the key condition C1 that the order K of the finite mixture model is known. Let Φmk be the kth subpopulation learned at local machine m and wmk be its mixing weight. Note that we do not put a hat on them for notation simplicity. According to Lemma 8 on the rate of convergence of the p MLE at local machines, these subpopulations can be arranged so that for all m [M] and k [K], we have Φmk Φ k = Op(N 1/2), wmk w k = Op(N 1/2). By C5, the first rate conclusion above implies max{c(Φmk, Φ k) : k [K]} = Op(N 1). Distributed Learning of Finite Gaussian Mixtures For each k, let ewk = PM m=1 λmwmk and eΦk be the local barycenter of Φmk, m [M]: eΦk = arg min{Φ : m=1 λmwmkc(Φmk, Φ)}. Let G be the mixing distribution with weights ewk and subpopulations eΦk. By the rate conclusions given earlier, we have ewk = w k + Op(N 1/2) for k [K]. By C5, we must also have eΦk Φ k = Op(N 1/2) and Tc(G, e G) = Op(N 1). If the GMR GR = e G, then the rate conclusion of the theorem would have been proved. Next, we show GR = e G asymptotically. By theorem conditions, the true subpopulations Φ k are all distinct. Hence, by condition C4, we have min{c(Φ k, Φ k ) : k = k [K]} > 0. Thus, if the subpopulations of GR is not in an op(1) neighbourhood of one of Φ k even though everyone of G is, the transport cost to this subpopulation from any subpopulation of G exceeds a positive constant in probability. This contradicts Tc(G, GR) Tc(G, e G) = Op(N 1). (26) This implies, all subpopulations of GR are within op(1) neighbourhood of one of Φ k. Denote these subpopulations as ΦR k . The optimal plan must transport Φmk to ΦR k , otherwise the total transport cost exceeds a positive constant in probability which again contradicts (26). Since GR minimizes the transport cost, we must have ΦR k = eΦk, the local barycenter. These conclusions imply that GMR GR = e G with probability approaching to 1. Consequently, the rates of convergence of ewk, eΦk extend to those of GR and this completes the proof. 8.1.6 KL-divergence satisfies C5 Let µ1, Σ1 and µ2, Σ2 be the parameters of Φ1 and Φ2. It is known that 2DKL(Φ1 Φ2) = log{det(Σ2)/det(Σ1)} + tr(Σ 1 2 Σ1 Id) + (µ2 µ1)τΣ 1 2 (µ2 µ1). Assume both Φ1 and Φ2 are in a small neighborhood of Φ whose covariance matrix Σ is positive definite. Hence, eigenvalues of Σ2 are in small neighborhood of these of Σ. Thus, there exists a positive constant A1 such that the third term in 2DKL(Φ1 Φ2) satisfies A 1 1 µ2 µ1 2 (µ2 µ1)τΣ 1 2 (µ2 µ1) A1 µ2 µ1 2 (27) Let λ1, . . . , λd be eigenvalues of Σ 1/2 2 Σ1Σ 1/2 2 . Since both Σ1 and Σ2 are in a small neighborhood of Σ, we have λ1, . . . , λd all close to 1. Note that log{det(Σ2)/det(Σ1)} + tr(Σ 1 2 Σ1 Id) = j=1 {(λj 1) log λj}. Zhang and Chen Note that (λ 1) log λ is a convex function with its minimum attained at λ = 1 at which point its second derivative equals 1. Hence, there exists an A2 > 0 such that A 1 2 (λ 1)2 (λ 1) log λ A2(λ 1)2. We have therefore shown that j=1 (λj 1)2 log{det(Σ2)/det(Σ1)} + tr(Σ 1 2 Σ1 Id) A2 j=1 (λj 1)2. (28) We now connect this bound with Frobenius norm. For a positive definite matrix Σ, it is easy to see that Σ Id 2 F = Pd j=1(σj 1)2, where σ1, σ2, . . . , σd are eigenvalues of Σ. Frobenius norm also has sub-multiplicative property Σ1Σ2 F Σ1 F Σ2 F . Applying the sub-multiplicative property in our context, we get Σ1 Σ2 2 F Σ1/2 2 4 F Σ 1/2 2 Σ1Σ 1/2 2 Id 2 F A3 Σ 1/2 2 Σ1Σ 1/2 2 Id 2 F = A3 j=1 (λj 1)2 for some local positive constant A3 > Σ1/2 4 F , as both matrices are in a small neighborhood of Σ. Similarly, we have j=1 (λj 1)2 = Σ 1/2 2 Σ1Σ 1/2 2 Id 2 F = Σ 1/2 2 {Σ1 Σ2}Σ 1/2 2 2 F Σ 1/2 2 4 F Σ1 Σ2 2 F A4 Σ1 Σ2 2 F . for some positive constant A4. This leads to Σ1 Σ2 2 F A 1 4 j=1 (λj 1)2. Let A = 2 max{A1, A2, A3, A4}. Applying (27) and (28), we have A 1{ µ1 µ2 2 + Σ1 Σ2 2 F } DKL(Φ1 Φ2) A{ µ1 µ2 2 + Σ1 Σ2 2 F } when Φ1, Φ2 are in a small neighborhood of Φ, with A being a positive constant depends on Φ. This shows that the KL-divergence has property C5. 8.2 Other Details 8.2.1 Additional Simulation Results We present simulation results for K = 5, 10, 50 and d = 10, 50 as supplementary to Section 7.2. Figures 5 6 show the results for N = 219 and M = 4 with various combinations of Distributed Learning of Finite Gaussian Mixtures K and d. The panels in each figure are arranged so that the order of the mixture increases from left to right, and the dimension of the mixture increases from top to bottom. Comparing panels within the same row in Figure 5, we note that the performance of each the estimators becomes worse as the order of the mixture increases in terms of W1 distance. The panels within the same column in Figures 5 show that all the estimators become worse as the dimension of the mixture increases in terms of both performance measures. Regardless, our estimator has performance comparable to that of the global estimator. In terms of the misclassification error, for the same degree of overlapping, the superiority of our estimator becomes even clearer compared to the KL-averaging as the number of components and the dimension increase. The computational costs of the local estimators are typically low, and this gives our method an added computational advantage. However, this advantage is not guaranteed: see the bottom right panel in Figure 6, where d = 50, K = 50, and the degree of overlapping is above 1%. There are many other factors at play. A more skillful implementation may lead to different conclusions on the computational time. Figure 5: W1 distances of estimators for learning mixtures. 8.2.2 Convolutional Neural Network in NIST Example Deep convolutional neural networks (CNNs) are commonly used to reduce the complex structure of a dataset to informative rectangle data. CNNs effectively perform dimension reduction and classification in an end-to-end fashion (Dara and Tumma, 2018). The final soft-max layer in a CNN can be viewed as fitting a multinomial logistic regression model on Zhang and Chen Figure 6: Computation times for learning mixtures. the reduced feature space. We use a CNN for dimension reduction in the NIST experiment; its architecture is specified in Table 3. We implement the CNN in pytorch 1.5.0 (Paszke et al., 2019) and train it for 10 epochs on the NIST training dataset. We use the SGD optimizer with learning rate 0.01, momentum 0.9, and batch size 64. After training, we drop the final layer and use the resulting CNN to reduce the dimension to 50 for both the training and test sets. Table 3: Architecture and layer specifications of CNN for dimension reduction in NIST example. Layer Layer specification Activation function Conv2d Cin = 1, Cout = 20, H = W = 5 Relu Max Pool2d k = 2 Conv2d Cin = 20, Cout = 50, H = W = 5 Relu Max Pool2d k = 2 Flatten Linear Hin = 800, Hout = 50 Relu Linear Hin = 50, Hout = 10 Softmax