# estimating_mixture_models_via_mixtures_of_polynomials__7d48b3d1.pdf

Estimating Mixture Models via Mixtures of

Polynomials

Sida I. Wang Arun Tejasvi Chaganty Percy Liang Computer Science Department, Stanford University, Stanford, CA, 94305

{sidaw,chaganty,pliang}@cs.stanford.edu

Mixture modeling is a general technique for making any simple model more expressive through weighted combination. This generality and simplicity in part explains the success of the Expectation Maximization (EM) algorithm, in which updates are easy to derive for a wide class of mixture models. However, the likelihood of a mixture model is non-convex, so EM has no known global convergence guarantees. Recently, method of moments approaches offer global guarantees for some mixture models, but they do not extend easily to the range of mixture models that exist. In this work, we present Polymom, an unifying framework based on method of moments in which estimation procedures are easily derivable, just as in EM. Polymom is applicable when the moments of a single mixture component are polynomials of the parameters. Our key observation is that the moments of the mixture model are a mixture of these polynomials, which allows us to cast estimation as a Generalized Moment Problem. We solve its relaxations using semideﬁnite optimization, and then extract parameters using ideas from computer algebra. This framework allows us to draw insights and apply tools from convex optimization, computer algebra and the theory of moments to study problems in statistical estimation. Simulations show good empirical performance on several models.

1 Introduction

Mixture models play a central role in machine learning and statistics, with diverse applications including bioinformatics, speech, natural language, and computer vision. The idea of mixture modeling is to explain data through a weighted combination of simple parametrized distributions [1, 2]. In practice, maximum likelihood estimation via Expectation Maximization (EM) has been the workhorse for these models, as the parameter updates are often easily derivable. However, EM is well-known to suffer from local optima. The method of moments, dating back to Pearson [3] in 1894, is enjoying a recent revival [4, 5, 6, 7, 8, 9, 10, 11, 12, 13] due to its strong global theoretical guarantees. However, current methods depend strongly on the speciﬁc distributions and are not easily extensible to new ones.

In this paper, we present a method of moments approach, which we call Polymom, for estimating a wider class of mixture models in which the moment equations are polynomial equations (Section 2). Solving general polynomial equations is NP-hard, but our key insight is that for mixture models, the moments equations are mixtures of polynomials equations and we can hope to solve them if the moment equations for each mixture component are simple polynomials equations that we can solve. Polymom proceeds as follows: First, we recover mixtures of monomials of the parameters from the data moments by solving an instance of the Generalized Moment Problem (GMP) [14, 15] (Section 3). We show that for many mixture models, the GMP can be solved with basic linear algebra and in the general case, can be approximated by an SDP in which the moment equations are linear constraints. Second, we extend multiplication matrix ideas from the computer algebra literature [16,

mixture model xt data point (RD) zt latent mixture component ([K]) k parameters of component k (RP ) k mixing proportion of p(z = k) [ k]K

k=1 all model parameters moments of data φn(x) observation function fn( ) observation function moments of parameters Ly the Riesz linear functional y y = Ly( ), th moment µ probability measure for y y (y ) the moment sequence Mr(y) moment matrix of degree r

sizes D data dimensions K mixture components P parameters of mixture components T data points N constraints [N] {1, . . . , N} r degree of the moment matrix s(r) size of the degree r moment matrix polynomials R[ ] polynomial ring in variables N set of non-negative integers , β, γ vector of exponents (in NP or ND) monomial QP

p=1 p p an coefﬁcient of in fn( )

Table 1: Notation: We use lowercase letters (e.g., d) for indexing, and the corresponding uppercase letter to denote the upper limit (e.g., D, in sizes ). We use lowercase letters (e.g., k,p) for scalars, lowercase bold letters (e.g., ) for vectors, and bold capital letters (e.g., M) for matrices.

z Multinomial( 1, 2)

x | z N( z, σz) 2 R

k = ( k, σk) 2 R2

1. Write down a mixture model

E ! f( ) x x2 2 + σ2

x3 3 + 3 σ2 ... ...

2. Derive single mixture moment equations

3. Add data

user speciﬁed framework speciﬁed

1 y0,0 y1,0 y2,0 y0,1

y1,0 y2,0 y3,0 y1,1

2 y2,0 y3,0 y4,0 y2,1 σ2 y0,1 y1,1 y2,1 y0,2

y tr(Mr(y))

s.t. Mr(y) 0, y0,0 = 1 y1,0 = 1

t xt y2,0 + y0,1 = 1

t y3,0 + 3y1,1 = 1

4. Recover parameter moments (y)

Mr(y) = VPV>

# sim. diag.

P = diag([ 1, 2])

v( 1) v( 2)

5. Solve for parameters

Figure 1: An overview of applying the Polymom framework.

17, 18, 19] to extract the parameters by solving a certain generalized eigenvalue problem (Section 4).

Polymom improves on previous method of moments approaches in both generality and ﬂexibility. First, while tensor factorization has been the main driver for many of the method of moments approaches for many types of mixture models, [6, 20, 9, 8, 21, 12], each model required speciﬁc adaptations which are non-trivial even for experts. In contrast, Polymom provides a uniﬁed principle for tackling new models that is as turnkey as computing gradients or EM updates. To use Polymom (Figure 1), one only needs to provide a list of observation functions (φn) and derive their expected values expressed symbolically as polynomials in the parameters of the speciﬁed model (fn). Polymom then estimates expectations of φn and outputs parameter estimates of the speciﬁed model. Since Polymom works in an optimization framework, we can easily incorporate constraints such as non-negativity and parameter tying which is difﬁcult to do in the tensor factorization paradigm. In simulations, we compared Polymom with EM and tensor factorization and found that Polymom performs similarly or better (Section 5). This paper assumes identiﬁability and inﬁnite data. With the exception of a few speciﬁc models in Section 5, we defer issues of general identiﬁability and sample complexity to future work.

2 Problem formulation

2.1 The method of moments estimator

In a mixture model, each data point x 2 RD is associated with a latent component z 2 [K]:

z Multinomial( ), x | z p(x;

where = ( 1, . . . , K) are the mixing coefﬁcients,

k 2 RP are the true model parameters for the kth mixture component, and x 2 RD is the random variable representing data. We restrict our attention to mixtures where each component distribution comes from the same parameterized family. For example, for a mixture of Gaussians,

k 2 RD D) consists of the mean and covariance of component k.

We deﬁne N observation functions φn : RD ! R for n 2 [N] and deﬁne fn( ) to be the expectation of φn over a single component with parameters , which we assume is a simple polynomial:

fn( ) := Ex p(x; )[φn(x)] =

p=1 p p . The expectation of each observation function E[φn(x)] can then be ex-

pressed as a mixture of polynomials of the true parameters E[φn(x)] = PK

k=1 k E[φn(x)|z = k] = PK

The method of moments for mixture models seeks parameters [ k]K

k=1 that satisfy the moment conditions

kfn( k). (3)

where E[φn(x)] can be estimated from the data: 1

p! E[φn(x)]. The goal of this work is to ﬁnd parameters satisfying moment conditions that can be written in the mixture of polynomial form (3). We assume that the N observations functions φ1, . . . , φN uniquely identify the model parameters (up to permutation of the components). Example 2.1 (1-dimensional Gaussian mixture). Consider a K-mixture of 1D Gaussians with parameters k = [ k, σ2

k] corresponding to the mean and variance, respectively, of the k-th component (Figure 1: steps 1 and 2). We choose the observation functions, φ(x) = [x1, . . . , x6], which have corresponding moment polynomials, f( ) = [ , 2 +σ2, 3 +3 σ2, . . . ]. For example, instantiating (3), E[x2] = PK

k). Given φ(x) and f( ), and data, the Polymom framework can recover the parameters. Note that the 6 moments we use have been shown by [3] to be sufﬁcient for a mixture of two Gaussians. Example 2.2 (Mixture of linear regressions). Consider a mixture of linear regressions [22, 9], where each data point x = [x, y] is drawn from component k by sampling x from an unknown distribution independent of k and setting y = wkx + , where N(0, σ2

k). The parameters k = (wk, σ2

k) are the slope and noise variance for each component k. Let us take our observation functions to be φ(x) = [x, xy, xy2, x2, . . . , x3y2], for which the moment polynomials are f( ) = [E[x], E[x2]w, E[x3]w2 + E[x]σ2, E[x2], . . .].

In Example 2.1, the coefﬁcients an in the polynomial fn( ) are just constants determined by integration. For the conditional model in Example 2.2, the coefﬁcients depends on the data. However, we cannot handle arbitrary data dependence, see Section D for sufﬁcient conditions and counterexamples.

2.2 Solving the moment conditions

Our goal is to recover model parameters

K 2 RP for each of the K components of the mixture model that generated the data as well as their respective mixing proportions 1, . . . , K 2 R. To start, let s ignore sampling noise and identiﬁability issues and suppose that we are given exact moment conditions as deﬁned in (3). Each condition fn 2 R[ ] is a polynomial of the parameters , for n = 1, . . . , N.

Equation 3 is a polynomial system of N equations in the K + K P variables [ 1, . . . , K] and [ 1, . . . , K] 2 RP K. It is natural to ask if standard polynomial solving methods can solve (3) in the case where each fn( ) is simple. Unfortunately, the complexity of general polynomial equation solving is lower bounded by the number of solutions, and each of the K! permutations of the mixture components corresponds to a distinct solution of (3) under this polynomial system representation. While several methods can take advantage of symmetries in polynomial systems [23, 24], they still cannot be adapted to tractably solve (3) to the best of our knowledge.

The key idea of Polymom is to exploit the mixture representation of the moment equations (3). Speciﬁcally, let µ be a particular mixture over the component parameters

k (i.e. µ is a probability measure). Then we can express the moment conditions (3) in terms of µ :

fn( ) µ (d ), where µ ( ) =

As a result, solving the original moment conditions (3) is equivalent to solving the following feasibility problem over µ, but where we deliberately forget the permutation of the components by using µ to represent the problem:

ﬁnd µ 2 M+(RP ), the set of probability measures over RP s.t.

fn( ) µ(d ) = E[φn(x)], n = 1, . . . , N µ is K-atomic (i.e. sum of K deltas).

If the true model parameters [

k=1 can be identiﬁed by the N observed moments up to permutation, then the measure µ ( ) = PK

k) solving Problem 5 is also unique.

Polymom solves Problem 5 in two steps:

1. Moment completion (Section 3): We show that Problem 5 over the measure µ can be

relaxed to an SDP over a certain (parameter) moment matrix Mr(y) whose optimal solution is Mr(y ) = PK

k)>, where vr(

k) is the vector of all monomials of degree at most r.

2. Solution extraction (Section 4): We then take Mr(y) and construct a series of generalized

eigendecomposition problems, whose eigenvalues yield [

Remark. From this point on, distributions and moments refer to µ which is over parameters, not over the data. All the structure about the data is captured in the moment conditions (3).

3 Moment completion

The ﬁrst step is to reformulate Problem 5 as an instance of the Generalized Moment Problem (GMP) introduced by [15]. A reference on the GMP, algorithms for solving GMPs, and its various extensions is [14]. We start by observing that Problem 5 really only depends on the integrals of monomials under the measure µ: for example, if fn( ) = 2 3

1 2, then we only need to know the integrals over the constituent monomials (y3,0 :=

1µ(d ) and y2,1 :=

1 2µ(d )) in order to evaluate the integral over fn. This suggests that we can optimize over the (parameter) moment sequence y = (y ) 2NP , rather than the measure µ itself. We say that the moment sequence y has a representing measure µ if y =

µ(d ) for all , but we do not assume that such a µ exists. The Riesz linear functional Ly : R[ ] ! R is deﬁned to be the linear map such that Ly( ) := y and Ly(1) = 1. For example, Ly(2 3

1 2 + 3) = 2y3,0 y2,1 + 3. If y has a representing measure µ, then Ly simply maps polynomials f to integrals of f against µ.

The key idea of the GMP approach is to convexify the problem by treating y as free variables and then introduce constraints to guarantee that y has a representing measure. First, let vr( ) := [ : | | r] 2 R[ ]s(r) be the vector of all s(r) monomials of degree no greater than r. Then, deﬁne the truncated moment matrix as Mr(y) := Ly(vr( )vr( )T), where the linear functional Ly is applied elementwise (see Example 3.1 below). If y has a representing measure µ, then Mr(y) is simply a (positive) integral over rank 1 matrices vr( )vr( )T with respect to µ, so necessarily

Mr(y) 0 holds. Furthermore, by Theorem 1 [25], for y to have a K-atomic representing measure, it is sufﬁcient that rank(Mr(y)) = rank(Mr 1(y)) = K. So Problem 5 is equivalent to

ﬁnd y 2 RN (or equivalently, ﬁnd M(y)) s.t. P

an y = E[φn(x)], n = 1, . . . , N Mr(y) 0, y0 = 1 rank(Mr(y)) = K and rank(Mr 1(y)) = K.

Unfortunately, the rank constraints in Problem 6 are not tractable. We use the following relaxation to obtain our ﬁnal (convex) optimization problem

y tr(CMr(y))

an y = E[φn(x)], n = 1, . . . , N Mr(y) 0, y0 = 1

where C 0 is a chosen scaling matrix. A common choice is C = Is(r) corresponding to minimizing the nuclear norm of the moment matrix, the usual convex relaxation for rank. Section A discusses some other choices of C. Example 3.1 (moment matrix for a 1-dimensional Gaussian mixture). Recall that the parameters = [ , σ2] are the mean and variance of a one dimensional Gaussian. Let us choose the monomials v2( ) = [1, , 2, σ2]. Step 4 for Figure 1 shows the moment matrix when using r = 2. Each row and column of the moment matrix is labeled with a monomial and entry (i, j) is subscripted by the product of the monomials in row i and column j. For φ2(x) := x2, we have f2( ) = 2 + c, which leads to the linear constraint y2,0 + y0,1 E[x2] = 0. For φ3(x) = x3, f3( ) = 3 + 3 c, leading to the constraint y3,0 + 3y1,1 E[x3] = 0.

Related work. Readers familiar with the sum of squares and polynomial optimization literature [26, 27, 28, 29] will note that Problem 7 is similar to the SDP relaxation of a polynomial optimization problem. However, in typical polynomial optimization, we are only interested in solutions that actually satisfy the given constraints, whereas here we are interested in K solutions [

k=1, whose mixture satisﬁes constraints corresponding to the moment conditions (3). Within machine learning, generalized PCA has been formulated as a moment problem [30] and the Hankel matrix (basically the moment matrix) has been used to learn weighted automata [13]. While similar tools are used, the conceptual approach and the problems considered are different. For example, the moment matrix of this paper consists of unknown moments of the model parameters, whereas exisiting works considered moments of the data that are always directly observable.

Constraints. Constraints such as non-negativity (for parameters which represent probabilities or variances) and parameter tying [31] are quite common in graphical models and are not easily addressed with existing method of moments approaches. The GMP framework allows us to incorporate some constraints using localizing matrices [32]. Thus, we can handle constraints during the estimation procedure rather than projecting back onto the constraint set as a post-processing step. This is necessary for models that only become identiﬁable by the observed moments after constraints are taken into account. We describe this method and its learning implications in Section C.1.

Guarantees and statistical efﬁciency. In some circumstances, e.g. in three-view mixture models or the mixture of linear regressions, the constraints fully determine the moment matrix we consider these cases in Section 5 and Appendix B. While there are no general guarantee on Problem 7, the ﬂat extension theorem tells us when the moment matrix corresponds to a unique solution (more discussions in Appendix A): Theorem 1 (Flat extension theorem [25]). Let y be the solution to Problem 7 for a particular r. If Mr(y) 0 and rank(Mr 1(y)) = rank(Mr(y)) then y is the optimal solution to Problem 6 for K = rank(Mr(y)) and there exists a unique K-atomic supporting measure µ of Mr(y).

Recovering Mr(y) is linearly dependent on small perturbations of the input [33], suggesting that the method has polynomial sample complexity for most models where the moments concentrate at a polynomially rate. Finally, in Appendix C, we discuss a few other important considerations like noise robustness, making Problem 7 more statistical efﬁcient, along with some technical results on the moment completion problem and some open problems.

4 Solution extraction

Having completed the (parameter) moment matrix Mr(y) (Section 3), we now turn to the problem of extracting the model parameters [

k=1. The solution extraction method we present is based on ideas from solving multivariate polynomial systems where the solutions are eigenvalues of certain multiplication matrices [16, 17, 34, 35].1 The main advantage of the solution extraction view is that higher-order moments and structure in parameters are handled in the framework without modelspeciﬁc effort.

Recall that the true moment matrix is Mr(y ) = PK

k)T, where v( ) := [ 1, . . . , s(r)] 2 R[ ]s(r) contains all the monomials up to degree r. We use = [ 1, . . . , P ] for variables and [

k=1 for the true solutions to these variables (note the boldface). For example,

k)p denotes the pth value of the kth component, which corresponds to a solution for the variable p. Typically, s(r) K, P and the elements of v( ) are arranged in a degree ordering so that || i||1 || j||1 for i j. We can also write Mr(y ) as Mr(y ) = VPV>, where the canonical basis V := [v(

1), . . . , v(

K)] 2 Rs(r) K and P := diag( 1, . . . , K). At the high level, we want to factorize Mr(y ) to get V, however we cannot simply eigen-decompose Mr(y ) since V is not orthogonal. To overcome this challenge, we will exploit the internal structure of V to construct several other matrices that share the same factors and perform simultaneous diagonalization.

Speciﬁcally, let V[β1; . . . ; βK] 2 RK K be a sub-matrix of V with only the rows corresponding to monomials with exponents β1, . . . , βK 2 NP . Typically, β1, . . . , βK are just the ﬁrst K monomials in v. Now consider the exponent γp 2 NP which is 1 in position p and 0 elsewhere, corresponding to the monomial γp = p. The key property of the canonical basis is that multiplying each column k by a monomial

k,p just performs a shift to another set of rows:

V[β1; . . . ; βK] Dp = V

β1 + γp; . . . ; βK + γp

, where Dp := diag(

1,p, . . . ,

Note that Dp contains the pth parameter for all K mixture components. Example 4.1 (Shifting the canonical basis). Let = [ 1, 2] and the true solutions be

1 = [2, 3] and

2 = [ 2, 5]. To extract the solution for 1 (which are (

2,1)), let β1 = (1, 0), β2 = (1, 1), and γ1 = (1, 0).

v( 1) v( 2)

| {z } V[β1;β2]

| {z } diag( 1,1, 2,1)

| {z } V[β1+γ1;β2+γ1]

While the above reveals the structure of V, we don t know V. However, we recover its column space U 2 Rs(r) K from the moment matrix Mr(y ), for example with an SVD. Thus, we can relate U and V by a linear transformation: V = UQ, where Q 2 RK K is some unknown invertible matrix.

Equation 8 can now be rewritten as:

U[β1; . . . ; βK]Q Dp = U

β1 + γp; . . . ; βK + γp

Q, p = 1, . . . , P, (10)

which is a generalized eigenvalue problem where Dp are the eigenvalues and Q are the eigenvectors. Crucially, the eigenvalues, Dp = diag(

1,p, . . . ,

K,p) give us solutions to our parameters. Note that for any choice of β1, . . . , βK and p 2 [P], we have generalized eigenvalue problems that share eigenvectors Q, though their eigenvectors Dp may differ. Corresponding eigenvalues (and hence solutions) can be obtained by solving a simultaneous generalized eigenvalue problem, e.g., by using random projections like Algorithm B of [4] or more robust [37] simutaneous diagonalization algorithms [38, 39, 40].

1 [36] is a short overview and [35] is a comprehensive treatment including numerical issues.

Table 2: Applications of the Polymom framework. See Appendix B.2 for more details.

Mixture of linear regressions

Model Observation functions x = [x, υ] is observed where x 2 RD is drawn from an unspeciﬁed distribution and υ N(w x, σ2I), and σ is known. The parameters are

k = (wk) 2 RD.

φ ,b(x) = x υb for 0 | | 3, b 2 [2]. Moment polynomials f ,1( ) = PP

p=1 E[x +γp]wp f ,2( ) = E[x ]σ2+PP

p,q=1 E[x xpxq]wpwq, where the γp 2 NP is 1 in position p and 0 elsewhere.

Mixture of Gaussians

Model Observation functions x 2 RD is observed where x is drawn from a Gaussian with diagonal covariance: x N( , diag(c)). The parameters are

k = ( k, ck) 2 RD+D.

φ (x) = x for 0 | | 4. Moment polynomials f ( ) = QD

d=1 h d( d, cd). 2

Multiview mixtures

Model Observation functions With 3 views, x = [x(1), x(2), x(3)] is observed where x(1), x(2), x(3) 2 RD and x( ) is drawn from an unspeciﬁed distribution with mean ( )

for 2 [3]. The parameters are

k ) 2 RD+D+D.

φijk(x) = x(1)

k where 1 i, j, k D. Moment polynomials fijk( ) = (1)

We describe one approach to solve (10), which is similar to Algorithm B of [4]. The idea is to take P random weighted combinations of the equations (10) and solve the resulting (generalized) eigendecomposition problems. Let R 2 RP P be a random matrix whose entries are drawn from N(0, 1).

Then for each q = 1, . . . Q, solve U[β1; . . . ; βK]

β1 + γp; . . . ; βK + γp

QDq. The resulting eigenvalues can be collected in 2 RP K, where q,k = Dq,k,k. Note that by deﬁnition q,k = PP

k,p, so we can simply invert to obtain [

K] = R 1 . Although this simple approach does not have great numerical properties, these eigenvalue problems are solvable if the eigenvalues [λq,1, . . . , λq,K] are distinct for all q, which happens with probability 1 as long as the parameters

k are different from each other.

In Appendix B.1, we show how a prior tensor decomposition algorithm from [4] can be seen as solving Equation 10 for a particular instantiation of β1, . . . βK.

5 Applications

Let us now look at some applications of Polymom. Table 2 presents several models with corresponding observation functions and moment polynomials. It is fairly straightforward to write down observation functions for a given model. The moment polynomials can then be derived by computing expectations under the model this step can be compared to deriving gradients for EM.

We implemented Polymom for several mixture models in Python (code: https://github. com/sidaw/polymom). We used CVXOPT to handle the SDP and the random projections algorithm from to extract solutions. In Table 3, we show the relative error maxk || k

k||2 averaged over 10 random models of each class.

In the rest of this section, we will discuss guarantees on parameter recovery for each of these models.

2 h ( , c) = Pb /2c

i=0 a , 2i 2ici and a ,i be the absolute value of the coefﬁcient of the degree i term of the th (univariate) Hermite polynomial. For example, the ﬁrst few are h1( , c) = , h2( , c) = 2 + c, h3( , c) = 3 + 3 c, h4( , c) = 4 + 6 2c + 3c2.

Methd. EM TF Poly EM TF Poly EM TF Poly Gaussians K, D T = 103 T = 104 T = 105

spherical 2, 2 0.37 2.05 0.58 0.24 0.73 0.29 0.19 0.36 0.14 diagonal 2, 2 0.44 2.15 0.48 0.48 4.03 0.40 0.38 2.46 0.35 constrained 2, 2 0.49 7.52 0.38 0.47 2.56 0.30 0.34 3.02 0.29 Others K, D T = 104 T = 105 T = 106

3-view 3, 3 0.38 0.51 0.57 0.31 0.33 0.26 0.36 0.16 0.12 lin. reg. 2, 2 - - 3.51 - - 2.60 - - 2.52

Table 3: T is the number of samples, and the error metric is deﬁned above. Methods: EM: sklearn initialized with k-means using 5 random restarts; TF: tensor power method implemented in Python; Poly: Polymom by solving Problem 7. Models: for mixture of Gaussians, we have σ 2||µ1 µ2||2. spherical and diagonal describes the type of covariance matrix. The mean parameters of constrained Gaussians satisﬁes µ1 + µ2 = 1. The best result is bolded. TF only handles spherical variance, but it was of interest to see what TF does if the data is drawn from mixture of Gaussians with diagonal covariance, these results are in strikeout.

Mixture of Linear Regressions. We can guarantee that Polymom can recover parameters for this model when K D by showing that Problem 6 can be solved exactly: observe that while no entry of the moment matrix M3(y) is directly observed, each observation gives us a linear constraint on the entries of the moment matrix and when K D, there are enough equations that this system admits an unique solution for y.

Chaganty et al. [9] were also able to recover parameters for this model under the same conditions (K D) by solving a series of low-rank tensor recovery problems, which ultimately requires the computation of the same moments described above. In contrast, the Polymom framework makes the dependence on moments upfront and takes care of the heavy-lifting in a problem-agnostic manner. Lastly, the model can be extended to handle per component noise by including σ as a parameter, an extension that is not possible using the method in [9].

Multiview Mixtures. We can guarantee parameter recovery when K D by proving that Problem 7 can be solved exactly (see Section B.2).

Mixture of Gaussians. In this case however, the moment conditions are non-trivial and we cannot guarantee recovery of the true parameters. However, Polymom is guaranteed to recover a mixture of Gaussians that match the moments. We can also apply constraints to the model: consider the case of 2d mixture where the mean parameters for all components lies on a parabola 1 2

2 = 0. In this case, we just need to add constraints to Problem 7: y(1,0)+β y(0,2)+β = 0 for all β 2 N2 up to degree |β| 2r 2. By incorporating these constraints at estimation time, we can possibly identify the model parameters with less moments. See Section C for more details.

6 Conclusion

We presented an unifying framework for learning many types of mixture models via the method of moments. For example, for the mixture of Gaussians, we can apply the same algorithm to both mixtures in 1D needing higher-order moments [3, 11] and mixtures in high dimensions where lowerorder moments sufﬁce [6]. The Generalized Moment Problem [15, 14] and its semideﬁnite relaxation hierarchies is what gives us the generality, although we rely heavily on the ability of nuclear norm minimization to recover the underlying rank. As a result, while we always obtain parameters satisfying the moment conditions, there are no formal guarantees on consistent estimation. The second main tool is solution extraction, which characterizes a more general structure of mixture models compared the tensor structure observed by [6, 4]. This view draws connections to the literature on solving polynomial systems, where many techniques might be useful [35, 18, 19]. Finally, through the connections we ve drawn, it is our hope that Polymom can make the method of moments as turnkey as EM on more latent-variable models, as well as improve the statistical efﬁciency of method of moments procedures.

Acknowledgments. This work was supported by a Microsoft Faculty Research Fellowship to the third author and a NSERC PGS-D fellowship for the ﬁrst author.

[1] D. M. Titterington, A. F. Smith, and U. E. Makov. Statistical analysis of ﬁnite mixture distributions,

volume 7. Wiley New York, 1985. [2] G. Mc Lachlan and D. Peel. Finite mixture models. John Wiley & Sons, 2004. [3] K. Pearson. Contributions to the mathematical theory of evolution. Philosophical Transactions of the

Royal Society of London. A, 185:71 110, 1894. [4] A. Anandkumar, D. Hsu, and S. M. Kakade. A method of moments for mixture models and hidden

Markov models. In Conference on Learning Theory (COLT), 2012. [5] A. Anandkumar, D. P. Foster, D. Hsu, S. M. Kakade, and Y. Liu. Two SVDs sufﬁce: Spectral decomposi-

tions for probabilistic topic modeling and latent Dirichlet allocation. In Advances in Neural Information Processing Systems (NIPS), 2012. [6] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning

latent variable models. ar Xiv, 2013. [7] D. Hsu, S. M. Kakade, and P. Liang. Identiﬁability and unmixing of latent parse trees. In Advances in

Neural Information Processing Systems (NIPS), 2012. [8] D. Hsu and S. M. Kakade. Learning mixtures of spherical Gaussians: Moment methods and spectral

decompositions. In Innovations in Theoretical Computer Science (ITCS), 2013. [9] A. Chaganty and P. Liang. Spectral experts for estimating mixtures of linear regressions. In International

Conference on Machine Learning (ICML), 2013. [10] A. T. Kalai, A. Moitra, and G. Valiant. Efﬁciently learning mixtures of two Gaussians. In Symposium on

Theory of Computing (STOC), pages 553 562, 2010. [11] M. Hardt and E. Price. Sharp bounds for learning a mixture of two Gaussians. ar Xiv preprint ar Xiv:1404.4997, 2014. [12] R. Ge, Q. Huang, and S. M. Kakade. Learning mixtures of Gaussians in high dimensions. ar Xiv preprint

ar Xiv:1503.00424, 2015. [13] B. Balle, X. Carreras, F. M. Luque, and A. Quattoni. Spectral learning of weighted automata - A forward-

backward perspective. Machine Learning, 96(1):33 63, 2014. [14] J. B. Lasserre. Moments, Positive Polynomials and Their Applications. Imperial College Press, 2011. [15] J. B. Lasserre. A semideﬁnite programming approach to the generalized problem of moments. Mathe-

matical Programming, 112(1):65 92, 2008. [16] H. J. Stetter. Multivariate polynomial equations as matrix eigenproblems. WSSIA, 2:355 371, 1993. [17] H. M. M oller and H. J. Stetter. Multivariate polynomial equations with multiple zeros solved by matrix

eigenproblems. Numerische Mathematik, 70(3):311 329, 1995. [18] B. Sturmfels. Solving systems of polynomial equations. American Mathematical Society, 2002. [19] D. Henrion and J. Lasserre. Detecting global optimality and extracting solutions in Glopti Poly. In Positive

polynomials in control, pages 293 310, 2005. [20] A. Anandkumar, R. Ge, D. Hsu, and S. Kakade. A tensor spectral approach to learning mixed membership

community models. In Conference on Learning Theory (COLT), pages 867 881, 2013. [21] A. Anandkumar, R. Ge, and M. Janzamin. Provable learning of overcomplete latent variable models:

Semi-supervised and unsupervised settings. ar Xiv preprint ar Xiv:1408.0553, 2014. [22] K. Viele and B. Tong. Modeling with mixtures of linear regressions. Statistics and Computing, 12(4):315

330, 2002. [23] B. Sturmfels. Algorithms in invariant theory. Springer Science & Business Media, 2008. [24] R. M. Corless, K. Gatermann, and I. S. Kotsireas. Using symmetries in the eigenvalue method for poly-

nomial systems. Journal of Symbolic Computation, 44(11):1536 1550, 2009. [25] R. E. Curto and L. A. Fialkow. Solution of the truncated complex moment problem for ﬂat data, volume

568. American Mathematical Society, 1996 1996. [26] J. B. Lasserre. Global optimization with polynomials and the problem of moments. SIAM Journal on

Optimization, 11(3):796 817, 2001. [27] M. Laurent. Sums of squares, moment matrices and optimization over polynomials. In Emerging appli-

cations of algebraic geometry, pages 157 270, 2009. [28] P. A. Parrilo and B. Sturmfels. Minimizing polynomial functions. Algorithmic and quantitative real

algebraic geometry, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, 60:83 99, 2003. [29] P. A. Parrilo. Semideﬁnite programming relaxations for semialgebraic problems. Mathematical program-

ming, 96(2):293 320, 2003. [30] N. Ozay, M. Sznaier, C. M. Lagoa, and O. I. Camps. GPCA with denoising: A moments-based convex

approach. In Computer Vision and Pattern Recognition (CVPR), pages 3209 3216, 2010.