# benefits_of_overparameterization_with_em__e50d6428.pdf

Beneﬁts of over-parameterization with EM

Ji Xu Columbia University jixu@cs.columbia.edu

Daniel Hsu Columbia University djhsu@cs.columbia.edu

Arian Maleki Columbia University arian@stat.columbia.edu

Expectation Maximization (EM) is among the most popular algorithms for maximum likelihood estimation, but it is generally only guaranteed to ﬁnd its stationary points of the log-likelihood objective. The goal of this article is to present theoretical and empirical evidence that over-parameterization can help EM avoid spurious local optima in the log-likelihood. We consider the problem of estimating the mean vectors of a Gaussian mixture model in a scenario where the mixing weights are known. Our study shows that the global behavior of EM, when one uses an over-parameterized model in which the mixing weights are treated as unknown, is better than that when one uses the (correct) model with the mixing weights ﬁxed to the known values. For symmetric Gaussians mixtures with two components, we prove that introducing the (statistically redundant) weight parameters enables EM to ﬁnd the global maximizer of the log-likelihood starting from almost any initial mean parameters, whereas EM without this over-parameterization may very often fail. For other Gaussian mixtures, we provide empirical evidence that shows similar behavior. Our results corroborate the value of over-parameterization in solving non-convex optimization problems, previously observed in other domains.

1 Introduction

In a Gaussian mixture model (GMM), the observed data Y = {y1, y2, . . . , yn} Rd comprise an i.i.d. sample from a mixture of k Gaussians:

y1, . . . , yn

i ) denote the weight, mean, and covariance matrix of the ith mixture component. Parameters of the GMM are often estimated using the Expectation Maximization (EM) algorithm, which aims to ﬁnd the maximizer of the log-likelihood objective. However, the log-likelihood function is not concave, so EM is only guaranteed to ﬁnd its stationary points. This leads to the following natural and fundamental question in the study of EM and non-convex optimization: How can EM escape spurious local maxima and saddle points to reach the maximum likelihood estimate (MLE)? In this work, we give theoretical and empirical evidence that over-parameterizing the mixture model can help EM achieve this objective.

Our evidence is based on models in (1) where the mixture components share a known, common covariance, i.e., we ﬁx

i = for all i. First, we assume that the mixing weights wi are also ﬁxed to known values. Under this model, which we call Model 1, EM ﬁnds a stationary point of the log-likelihood function in the parameter space of component means ( 1, . . . , k). Next, we over-parameterize Model 1 as follows. Despite the fact that the weights ﬁxed in Model 1, we now pretend that they are not ﬁxed. This gives a second model, which we call Model 2. Parameter estimation for Model 2 requires EM to estimate the mixing weights in addition to the component means. Finding the global maximizer of the log-likelihood over this enlarged parameter space is

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

seemingly more difﬁcult for Model 2 than it is for Model 1, and perhaps needlessly so. However, in this paper we present theoretical and empirical evidence to the contrary.

1. For mixtures of two symmetric Gaussians (i.e., k = 2 and

2), we prove that EM for Model 2 converges to the global maximizer of the log-likelihood objective with almost any initialization of the mean parameters, while EM for Model 1 will fail to do so for many choices of (w

2). These results are established for idealized executions of EM in an inﬁnite sample size limit, which we complement with ﬁnite sample results.

2. We prove that the spurious local maxima in the (population) log-likelihood objective for Model

1 are eliminated in the objective for Model 2.

3. We present an empirical study to show that for more general mixtures of Gaussians, with a

variety of model parameters and sample sizes, EM for Model 2 has higher probability to ﬁnd the MLE than Model 1 under random initializations.

Related work. Since Dempster s 1977 paper [Dempster et al., 1977], the EM algorithm has become one of the most popular algorithms to ﬁnd the MLE for mixture models. Due to its popularity, the convergence analysis of EM has attracted researchers attention for years. Local convergence of EM has been shown by Wu [1983], Xu and Jordan [1996], Tseng [2004], Chrétien and Hero [2008]. Further, for certain models and under various assumptions about the initialization, EM has been shown to converge to the MLE [Redner and Walker, 1984, Balakrishnan et al., 2017, Klusowski and Brinda, 2016, Yan et al., 2017]. Typically, the initialization is required to be sufﬁciently close to the true parameter values of the data-generating distribution. Much less is known about global convergence of EM, as the landscape of the log-likelihood function has not been well-studied. For GMMs, Xu et al. [2016] and [Daskalakis et al., 2017] study mixtures of two Gaussians with equal weights and show that the log-likelihood objective has only two global maxima and one saddle point; and if EM is randomly initialized in a natural way, the probability that EM converges to this saddle point is zero. (Our Theorem 2 generalizes these results.) It is known that for mixtures of three or more Gaussians, global convergence is not generally possible [Jin et al., 2016].

The value of over-parameterization for local or greedy search algorithms that aim to ﬁnd a global minimizer of non-convex objectives has been rigorously established in other domains. Matrix completion is a concrete example: the goal is to recover of a rank r n matrix M 2 Rn n from observations of randomly chosen entries [Candès and Recht, 2009]. A direct approach to this problem is to ﬁnd the matrix X 2 Rn n of minimum rank that is consistent with the observed entries of M. However, this optimization problem is NP-hard in general, despite the fact that there are only 2nr r2 n2 degrees-of-freedom. An indirect approach to this matrix completion problem is to ﬁnd a matrix X of smallest nuclear norm, subject to the same constraints; this is a convex relaxation of the rank minimization problem. By considering all n2 degrees-of-freedom, Candès and Tao [2010] show that the matrix M is exactly recovered via nuclear norm minimization as soon as (nr log6 n) entries are observed (with high probability). Notably, this combination of over-parameterization with convex relaxation works well in many other research problems such as sparse-PCA [d Aspremont et al., 2005] and compressive sensing [Donoho, 2006]. However, many problems (like ours) do not have a straightforward convex relaxation. Therefore, it is important to understand how over-parameterization can help one solve a non-convex problem other than convex relaxation.

Another line of work in which the value of over-parameterization is observed is in deep learning. It is conjectured that the use of over-parameterization is the main reason for the success of local search algorithms in learning good parameters for neural nets [Livni et al., 2014, Safran and Shamir, 2017]. Recently, Haeffele and Vidal [2015], Nguyen and Hein [2017, 2018], Soltani and Hegde [2018], Du and Lee [2018] conﬁrm this observation for many neural networks such as feedforward and convolutional neural networks.

2 Theoretical results

In this section, we present our main theoretical results concerning EM and two-component Gaussian mixture models.

2.1 Sample-based EM and Population EM

Without loss of generality, we assume = I. We consider the following Gaussian mixture model:

y1, . . . , yn

1N( , I) + w

2N( , I). (2)

The mixing weights w

2 are ﬁxed (i.e., assumed to be known). Without loss of generality, we also assume that w

2 > 0 (and, of course, w

2 = 1). The only parameter to estimate is the mean vector . The EM algorithm for this model uses the following iterations:

ht+1i = 1 n

We refer to this algorithm as Sample-based EM1: it is the EM algorithm one would normally use when the mixing weights are known. In spite of this, we also consider an EM algorithm that pretends that the weights are not known, and estimates them alongside the mean parameters. We refer to this algorithm as Sample-based EM2, which uses the following iterations:

htii + ˆwhti

5 = 1 ˆwht+1i

ht+1i = 1 n

htii + ˆwhti

This is the EM algorithm for a different Gaussian mixture model in which the weights w

2 are not ﬁxed (i.e., unknown), and hence must be estimated. Our goal is to study the global convergence properties of the above two EM algorithms on data from the ﬁrst model, where the mixing weights are, in fact, known.

We study idealized executions of the EM algorithms in the large sample limit, where the algorithms are modiﬁed to be computed over an inﬁnitely large i.i.d. sample drawn from the mixture distribution in (2). Speciﬁcally, we replace the empirical averages in (3) and (4) with the expectations with respect to the mixture distribution. We obtain the following two modiﬁed EM algorithms, which we refer to as Population EM1 and Population EM2:

Population EM1:

ht+1i = Ey f

1ehy, htii w

2e hy, htii

1ehy, htii + w

2e hy, htii y

=: H( hti; , w

where f = f ( , w

1) here denotes the true distribution of yi given in (2).

Population EM2: Set wh0i

2 = 0.51, and run

1 ehy, htii

1 ehy, htii + whti

2 e hy, htii

=: Gw( hti, whti; , w

ht+1i = Ey f

1 ehy, htii whti

2 e hy, htii

1 ehy, htii + whti

2 e hy, htii y

=: G ( hti, whti; , w

As n ! 1, we can show the performance of Sample-based EM? converges to that of the Population EM? in probability. This argument has been used rigorously in many previous works on EM [Balakrishnan et al., 2017, Xu et al., 2016, Klusowski and Brinda, 2016, Daskalakis et al., 2017]. The main goal of this section, however, is to study the dynamics of Population EM1 and Population EM2, and the landscape of the log-likelihood objectives of the two models.

1Using equal initial weights is a natural way to initialize EM when the weights are unknown.

2.2 Main theoretical results

Let us ﬁrst consider the special case w

2 = 0.5. Then, it is straightforward to show that whti

2 = 0.5 for all t in Population EM2. Hence, Population EM2 is equivalent to Population EM1. Global convergence of hti to for this case was recently established by Xu et al. [2016, Theorem 1] for almost all initial h0i (see also [Daskalakis et al., 2017]).

We ﬁrst show that the same global convergence may not hold for Population EM1 when w

2. Theorem 1. Consider Population EM1 in dimension one (i.e., 2 R). For any > 0, there exists δ > 0, such that given w

1 2 (0.5, 0.5 + δ) and initialization h0i , the Population EM1 estimate hti converges to a ﬁxed point wrong inside ( , 0).

This theorem, which is proved in Appendix A, implies that if we use random initialization, Population EM1 may converge to the wrong ﬁxed point with constant probability. We illustrate this in Figure 1. The iterates of Population EM1 converge to a ﬁxed point of the function 7! H( ; , w

1) deﬁned in (5). We have plotted this function for several different values of w

1 in the left panel of Figure 1. When w

1 is close to 1, H( ; , w

1) has only one ﬁxed point and that is at = . Hence, in this case, the estimates produced by Population EM1 converge to the true . However, when we decrease the value of w

1 below a certain threshold (which is numerically found to be approximately 0.77 for = 1), two other ﬁxed points of H( ; , w

1) emerge. These new ﬁxed points are foils for Population EM1.

From the failure of Population EM1, one may expect the over-parameterized Population EM2 to fail as well. Yet, surprisingly, our second theorem proves the opposite is true: Population EM2 has global convergence even when w

Theorem 2. For any w

1 2 [0.5, 1), the Population EM2 estimate ( t, whti) converges to either ( , w

1) or ( , w

2) with any initialization h0i except on the hyperplane h h0i, i = 0. Furthermore, the convergence speed is geometric after some ﬁnite number of iterations, i.e., there exists a ﬁnite number T and constant 2 (0, 1) such that the following hold.

If h h0i, i > 0, then for all t > T,

k ht+1i k2 + |wht+1i

k h T i k2 + (wh T i

If h h0i, i < 0, then for all t > T,

k ht+1i + k2 + |wht+1i

k h T i + k2 + (wh T i

Theorem 2 implies that if we use random initialization for h0i, with probability one, the Population EM2 estimates converge to the true parameters.

The failure of Population EM1 and success of Population EM2 can be explained intuitively. Let C1 and C2, respectively, denote the true mixture components with parameters (w

1, ) and (w

2, ). Due to the symmetry in Population EM1, we are assured that among the two estimated mixture components, one will have a positive mean, and the other will have a negative mean: call these ˆC+ and ˆC , respectively. Assume > 0 and w

1 > 0.5, and consider initializing the Population EM1 with h0i := . This initialization incorrectly associates ˆC with the larger weight w

1 instead of the smaller weight w

2. This causes, in the E-step of EM, the component ˆC to become responsible for an overly large share of the overall probability mass, and in particular an overly large share of the mass from C1 (which has a positive mean). Thus, in the M-step of EM, when the mean of the estimated component ˆC is updated, it is pulled rightward towards +1. It is possible that this rightward pull would cause the estimated mean of ˆC to become positive in which case the roles of

ˆC+ and ˆC would switch but this will not happen as long as w

1 is sufﬁciently bounded away from 1 (but still > 0.5).2 The result is a bias in the estimation of , thus explaining why the Population EM1 estimate converges to some wrong 2 ( , 0) when w

1 is not too large.

1 is indeed very close to 1, then almost all of the probability mass of the true distribution comes from C1, which has positive mean. So, in the M-step discussed above, the rightward pull of the mean of ˆC may

Our discussion conﬁrms that one way Population EM1 may fail (in dimension one) is if it is initialized with h0i having the incorrect sign (e.g., h0i = ). On the other hand, the performance of Population EM2 does not depend on the sign of the initial h0i. Recall that the estimates of Population EM2 converge to the ﬁxed points of the mapping M : ( , w1) 7! (G ( , w1; , w

1), Gw( , w1; , w

1)), as deﬁned in (6) and (7). One can check that for all , w1, , w

G ( , w1; , w

1) + G ( , w2; , w

Gw( , w1; , w

1) + Gw( , w2; , w

1) = 1. (8)

Hence, ( , w1) is a ﬁxed point of M if and only if ( , w2) is a ﬁxed point of M as well. Therefore, Population EM2 is insensitive to the sign of the initial h0i. This property can be extended to mixtures of k > 2 Gaussians as well. In these cases, the performance of EM for Model 2 is insensitive to permutations of the component parameters. Hence, because of this nice property, as we will conﬁrm in our simulations, when the mixture components are well-separated, EM for Model 2 performs well for most of the initializations, while EM for Model 1 fails in many cases.

One limitation of our permutation-free explanation is that the argument only holds when the weights in Population EM2 are initialized to be uniform. However, the beneﬁts of over-parameterization are not limited to this case. Indeed, when we compare the landscapes of the log-likelihood objective for (the mixture models corresponding to) Population EM1 and Population EM2, we ﬁnd that overparameterization eliminates spurious local maxima that were obstacles for Population EM1. Theorem 3. For all w

1 6= 0.5, the log-likelihood objective optimized by Population EM2 has only one saddle point ( , w1) = (0, 1/2) and no local maximizers besides the two global maximizers ( , w1) = ( , w

1) and ( , w1) = ( , w

The proof of this theorem is presented in Appendix C. Remark 1. Consider the landscape of the log-likelihood objective for Population EM2 and the point ( wrong, w

1), where wrong is the local maximizer suggested by Theorem 1. Theorem 3 implies that we can still easily escape this point due to the non-zero gradient in the direction of w1 and thus ( wrong, w

1) is not even a saddle point. We emphasize that this is exactly the mechanism that we have hoped for the purpose and beneﬁt of over-parameterization (See the left panel in Figure 2). Remark 2. Note that although ( , w1) = ((w

2) , 1) or ((w

1) , 0) are the two ﬁxed points for Population EM2 as well, they are not the ﬁrst order stationary points of the log-likelihood objective if w

Finally, to complete the analysis of EM for the mixtures of two Gaussians, we present the following result that applies to Sample-based EM2.

Theorem 4. Let (ˆ

1 ) be the estimates of Sample-based EM2. Suppose ˆ

h0i = h0i, ˆwh0i

2 and h h0i, i 6= 0. Then we have

hti htik ! 0 and lim sup

t!1 | ˆwhti

1 | ! 0 as n ! 1 ,

where convergence is in probability.

The proof of this theorem uses the same approach as Xu et al. [2016] and is presented in Appendix D.

2.3 Roadmap of the proof for Theorem 2

Our ﬁrst lemma, proved in Appendix B.1, conﬁrms that if h h0i, i > 0, then h hti, i > 0 for every t and whti

1 2 (0.5, 1). In other words, the estimates of the Population EM2 remain in the correct hyperplane, and the weight moves in the right direction, too.

be so strong that the updated mean estimate becomes positive. Since the model enforces that the mean estimates of ˆC+ and ˆC be negations of each other, the roles of ˆC+ and ˆC switch, and now it is ˆC+ that becomes associated with the larger mixing weight w

1. In this case, owing to the symmetry assumption, Population EM1 may be able to successfully converge to . We revisit this issue in the numerical study, where the symmetry assumption is removed.

Figure 1: Left panel: we show the shape of iterative function H( ; , w

1) with = 1 and different values of w

1 2 {0.9, 0.77, 0.7}. The green plus + indicates the origin (0, 0) and the black points indicate the correct values ( , ) and ( , ). We observe that as w

1 increases, the number of ﬁxed points goes down from 3 to 2 and ﬁnally to 1. Further, when there exists more than one ﬁxed point, there is one stable incorrect ﬁxed point in ( , 0). Right panel: we show the shape of iterative function Gw( , w1; , w

1) with = 1, w

1 = 0.7 and different values of 2 {0.3, 1, 2}. We observe that as increases, Gw becomes from a concave function to a concave-convex function. Further, there are at most three ﬁxed points and there is only one stable ﬁxed point.

Lemma 1. If h h0i, i > 0, we have h hti, i > 0, whti

1 2 (0.5, 1) for all t 1. Otherwise, if h h0i, i < 0, we have h hti, i < 0, whti

1 2 (0, 0.5) for all t 1.

On account of Lemma 1 and the invariance in (8), we can assume without loss of generality that h hti, i > 0 and whti

1 2 (0.5, 1) for all t 1.

Let d be the dimension of . We reduce the d > 1 case to the d = 1 case. This achieved by proving that the angle between the two vectors hti and is a decreasing function of t and converges to 0. The details appear in Appendix B.4. Hence, in the rest of this section we focus on the proof of Theorem 2 for d = 1.

Let g ( , w1) and gw( , w1) be the shorthand for the two update functions G and Gw deﬁned in (6) and (7) for a ﬁxed ( , w

1). To prove that {( hti, whti)} converges to the ﬁxed point ( ?, w?), we establish the following claims:

C.1 There exists a set S = (a , b ) (aw, bw) 2 R2, where a , b 2 R [ { 1} and aw, bw 2 R,

such that S contains point ( ?, w?) and point (g ( , w1), gw( , w1)) 2 S for all ( , w1) 2 S. Further, g ( , w1) is a non-decreasing function of for a given w1 2 (aw, bw) and gw( , w1) is a non-decreasing function of w for a given 2 (a , b ), C.2 There is a reference curve r: [aw, bw] ! [a , b ] deﬁned on S (the closure of S) such that:

C.2a r is continuous, decreasing, and passes through point ( ?, w?), i.e., r(w?) = ?. C.2b Given 2 (a , b ), function w 7! gw( , w) has a stable ﬁxed point in [aw, bw]. Further,

any stable ﬁxed point ws in [aw, bw] or ﬁxed point ws in (aw, bw) satisﬁes the following:

If < ? and r(bw), then r 1( ) > ws > w?. If = ?, then r 1( ) = ws = w?. If > ? and r(aw), then r 1( ) < ws < w?. C.2c Given w 2 [aw, bw], function 7! g ( , w) has a stable ﬁxed point in [a , b ]. Further, any

stable ﬁxed point s in [a , b ] or ﬁxed point s in (a , b ) satisﬁes the following:

If w1 < w?, then r(w) > s > ?. If w1 = w?, then r(w) = s = ?. If w1 > w?, then r(w) < s < ?.

We explain C.1 and C.2 in the right panel of Figure 2. Heuristically, we expect ( , w

1) to be the only ﬁxed point of the mapping ( , w) 7! (g ( , w), gw( , w)), and that ( hti, whti) move toward this ﬁxed point. Hence, we can prove the convergence of the iterates by showing certain geometric

relationships between the curves of ﬁxed points of the two functions. Hence, C.1 helps us to bound the iterates on the area that such nice geometric relations exist, and the reference curve r and C.2 are the tools to help us mathematically characterizing the geometric relations shown in the ﬁgure. Indeed, the next lemma implies that C.1 and C.2 are sufﬁcient to show the convergence to the right point ( ?, w?):

Lemma 2 (Proved in Appendix B.2.1). Suppose continuous functions g ( , w), gw( , w) satisfy C.1 and C.2, then there exists a continuous mapping m : S ! [0, 1) such that ( ?, w?) is the only solution for m( , w) = 0 on S, the closure of S . Further, if we initialize ( h0i, wh0i) in S, the sequence {( hti, whti)}t 0 deﬁned by

ht+1i = g ( hti, whti), and wht+1i = gw( hti, whti),

satisﬁes that m( hti, whti) # 0, and therefore ( hti, whti) converges to ( ?, w?).

In our problem, we set aw = 0.5, bw = 1, a = 0, b = 1 and ( ?, w?) = ( , w

1). Then according to Lemma 1 and monotonic property of g and gw, C.1 is satisﬁed.

To show C.2, we ﬁrst deﬁne the reference curve r by

1 1 2w1 1 , 8w1 2 (0.5, 1], w2 = 1 w1. (9)

The claim C.2a holds by construction. To show C.2b, we establish an even stronger property of the weights update function gw( , w): for any ﬁxed > 0, the function w1 7! gw( , w1) has at most one other ﬁxed point besides w1 = 0 and w1 = 1, and most importantly, it has only one unique stable ﬁxed point. This is formalized in the following lemma. Lemma 3 (Proved in Appendix B.2.2). For all > 0, there are at most three ﬁxed points for gw( , w1) with respect to w1. Further, there exists an unique stable ﬁxed point Fw( ) 2 (0, 1], i.e., (i) Fw( ) = gw( , Fw( )) and (ii) for all w1 2 (0, 1), we have

gw( , w1) < w1 , w1 < Fw( ) and gw( , w1) > w1 , w1 > Fw( ). (10)

We explain Lemma 3 in Figure 1. Note that, in the ﬁgure, we observe that gw is an increasing function with gw( , 0) = 0 and gw( , 1) = 1. Further, it is either a concave function, it is piecewise concave-then-convex3. Hence, we know if @gw( , w1)/@w1|w1=1 is at most 1, the only stable ﬁxed point is w1 = 1, else if the derivative is larger than 1, there exists only one ﬁxed point in (0,1) and it is the only stable ﬁxed point. The complete proof for C.2b is shown in Appendix B.3.

The ﬁnal step to apply Lemma 2 is to prove C2.c. However, ( , w1) = ((2w

1 1) , 1) is a point on the reference curve r and = (2w

1 1) is a stable ﬁxed point for g ( , 1). This violates C.2c. To address this issue, since we can characterize the shape and the number of ﬁxed points for gw, by typical uniform continuity arguments, we can ﬁnd δ, > 0 such that the adjusted reference curve radj(w) := r(w) max(0, w 1+δ) satisﬁes C.2a and C.2b. Then we can prove that the adjusted reference curve radj(w) satisﬁes C2.c; see Appendix B.3.1.

3 Numerical results

In this section, we present numerical results that show the value of over-parameterization in some mixture models not covered by our theoretical results.

Our goal is to analyze the effect of the sample size, mixing weights, and the number of mixture components on the success of the two EM algorithms described in Section 2.1.

We implement EM for both Model 1 (where the weights are assumed to be known) and Model 2 (where the weights are not known), and run the algorithm multiple times with random initial mean estimates. We compare the two versions of EM by their (empirical) success probabilities, which we denote by P1 and P2, respectively. Success is deﬁned in two ways, depending on whether EM is run with a ﬁnite sample, or with an inﬁnite-size sample (i.e., the population analogue of EM).

3There exists w 2 (0, 1) such that gw( , w) is concave in [0, w] and convex in [ w, 1].

Global Maxima

Local Maximum in

direction of

Figure 2: Left panel: The landscapes of log-likelihood objectives for Population EM1 and Population EM2 with ( , w

1) = (1, 0.4) are shown in the black belt and the yellow surface respectively. The two green points indicates the two global maxima of Population EM2, one of which is also the global maximum of Population EM1. The purple point indicates the local maximum of Population EM1. Over-parameterization helps us to escape the local maximum through the direction of w1. Right panel: The ﬁxed point curves for functions g and gw are shown with red and blue lines respectively. The green point at the intersections of the three curves is the correct convergence point ( ?, w?). The black dotted curve shows the reference curve r. The cross points are the possible

initializations and the plus points + are the corresponding positions after the ﬁrst iteration. By the geometric relations between the three curves, the iterations have to converge to ( ?, w?)

When EM is run using a ﬁnite sample, we do not expect recover the

i 2 Rd exactly. Hence, success is declared when the

i are recovered up to some expected error, according to the following measure:

error = min

where is the set of all possible permutations on {1, . . . , k}. We declare success if the error is at most C /n, where C := 4 Tr(W I 1( )). Here, W is the diagonal matrix whose diagonal is (w

1, . . . , w

1, . . . , w

k, . . . , w

k) 2 Rkd, where each w

i is repeated d times, and I( ) is the Fisher Information at the true value := (

k). We adopt this criteria since it is well known that the MLE asymptotically converges to N( , I 1( )/n). Thus, constant 4 1.962 indicates an approximately 95% coverage.

When EM is run using an inﬁnite-size sample, we declare EM successful when the error deﬁned in (11) is at most 10 7.

3.2 Mixtures of two Gaussians

We ﬁrst consider mixtures of two Gaussians in one dimension, i.e.,

2 2 R. Unlike in our theoretical analysis, the mixture components are not constrained to be symmetric about the origin. For simplicity, we always let

1 = 0, but this information is not used by EM. Further, we consider sample size n 2 {1000, 1}, separation

1| 2 {1, 2, 4}, and mixing weight w

1 2 {0.52, 0.7, 0.9}; this gives a total of 18 cases. For each case, we run EM with 2500 random initializations and compute the empirical probability of success. When n = 1000, the initial mean parameter is chosen uniformly at random from the sample. When n = 1, the initial mean parameter is chosen uniformly at random from the rectangle [ 2,

2 + 2] [ 2,

A subset of the success probabilities are shown in Table 1; see Appendix F for the full set of results. Our simulations lead to the following empirical ﬁndings about the behavior of EM on data from well-separated mixtures (|

2| 1). First, for n = 1, EM for Model 2 ﬁnds the MLE almost always (P2 = 1), while EM for Model 1 only succeeds about half the time (P1 0.5). Second, for smaller n, EM for Model 2 still has a higher chance of success than EM for Model 1, except when the weights w

2 are almost equal. When w

2 1/2, the bias in Model 1 is not big enough to stand out from the error due to the ﬁnite sample, and hence Model 1 is more preferable. Notably,

Success probabilities for mixtures of two Gaussians (Section 3.2) Separation Sample size w

1 = 2 n = 1000 0.799 / 0.500 0.497 / 0.800 0.499 / 0.899 n = 1 0.504 / 1.000 0.514 / 1.000 0.506 / 1.000

Success probabilities for mixtures of three or four Gaussians (Section 3.3)

Case 1 Case 2 Case 3 Case 4 0.164 / 0.900 0.167 / 1.000 0.145 / 0.956 0.159 / 0.861

Table 1: Success probabilities for EM on Model 1 and Model 2 (denoted P1 and P2, respectively), reported as P1 / P2.

unlike the special model in (2), highly unbalanced weights do not help EM for Model 1 due to the lack of the symmetry of the component means (i.e., we may have

We conclude that over-parameterization helps EM if the two mixture components are well-separated and the mixing weights are not too close.

3.3 Mixtures of three or four Gaussians

We now consider a setup with mixtures of three or four Gaussians. Speciﬁcally, we consider the following four cases, each using a larger sample size of n = 2000:

Case 1, mixture of three Gaussians on a line:

1 = ( 3, 0),

2 = (0, 0),

3 = (2, 0) with weights w

3 = 0.2. Case 2, mixture of three Gaussians on a triangle:

1 = ( 3, 0),

2 = (0, 2),

3 = (2, 0) with weights w

3 = 0.2. Case 3, mixture of four Gaussians on a line:

1 = ( 3, 0),

2 = (0, 0),

3 = (2, 0),

4 = (5, 0) with weights w

1 = 0.35, w

4 = 0.15. Case 4, mixture of four Gaussians on a trapezoid:

1 = ( 3, 0),

2 = ( 1, 2),

3 = (2, 0),

4 = (2, 2) with weights w

1 = 0.35, w

The other aspects of the simulations are the same as in the previous subsection.

The results are presented in Table 1. From the table, we conﬁrm that EM for Model 2 (with unknown weights) has a higher success probability than EM for Model 1 (with known weights). Therefore, over-parameterization helps in all four cases.

3.4 Explaining the disparity

As discussed in Section 2.2, the performance EM algorithm with unknown weights does not depend on the ordering of the initialization means. We conjuncture that in general, this property that is a consequence of over-parameterization leads to the boost that is observed in the performance of EM with unknown weights.

We support this conjecture by revisiting the previous simulations with a different way of running EM for Model 1. For each set of k vectors selected to be used as initial component means, we run EM k! times, each using a different one-to-one assignment of these vectors to initial component means. We measure the empirical success probability P3 based on the lowest observed error among these k! runs of EM. The results are presented in Table 3 in Appendix F. In general, we observe P3 & P2 for all cases we have studied, which supports our conjecture. However, this procedure is generally more time-consuming than EM for Model 2 since k! executions of EM are required.

Acknowledgements

DH and JX were partially supported by NSF awards DMREF-1534910 and CCF-1740833, and JX was also partially supported by a Cheung-Kong Graduate School of Business Fellowship. We thank Jiantao Jiao for a helpful discussion about this problem.

Sivaraman Balakrishnan, Martin J. Wainwright, and Bin Yu. Statistical guarantees for the em

algorithm: From population to sample-based analysis. Ann. Statist., 45(1):77 120, 02 2017.

Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foun-

dations of Computational mathematics, 9(6):717, 2009.

Emmanuel J Candès and Terence Tao. The power of convex relaxation: Near-optimal matrix

completion. IEEE Transactions on Information Theory, 56(5):2053 2080, 2010.

Stéphane Chrétien and Alfred O Hero. On EM algorithms and their proximal generalizations. ESAIM:

Probability and Statistics, 12:308 326, May 2008.

Constantinos Daskalakis, Christos Tzamos, and Manolis Zampetakis. Ten steps of em sufﬁce for

mixtures of two gaussians. In Proceedings of the 2017 Conference on Learning Theory, pages 704 710, 2017.

Alexandre d Aspremont, Laurent E Ghaoui, Michael I Jordan, and Gert R Lanckriet. A direct

formulation for sparse pca using semideﬁnite programming. In Advances in neural information processing systems, pages 41 48, 2005.

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum-likelihood from incomplete data via the

EM algorithm. J. Royal Statist. Soc. Ser. B, 39:1 38, 1977.

David L Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289 1306,

Simon Du and Jason Lee. On the power of over-parametrization in neural networks with quadratic

activation. In Proceedings of the 35th International Conference on Machine Learning, pages 1329 1338, 2018.

Benjamin D Haeffele and René Vidal. Global optimality in tensor factorization, deep learning, and

beyond. ar Xiv preprint ar Xiv:1506.07540, 2015.

Chi Jin, Yuchen Zhang, Sivaraman Balakrishnan, Martin J Wainwright, and Michael I Jordan.

Local maxima in the likelihood of gaussian mixture models: Structural results and algorithmic consequences. In Advances in Neural Information Processing Systems, pages 4116 4124, 2016.

J. M. Klusowski and W. D. Brinda. Statistical Guarantees for Estimating the Centers of a Two-

component Gaussian Mixture by EM. Ar Xiv e-prints, August 2016.

V. Koltchinskii. Oracle inequalities in empirical risk minimization and sparse recovery problems. In

École d0été de probabilités de Saint-Flour XXXVIII, 2011.

Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational efﬁciency of training neural

networks. In Advances in Neural Information Processing Systems, pages 855 863, 2014.

Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. In Proceedings

of the 34th International Conference on Machine Learning, pages 2603 2612, 2017.

Quynh Nguyen and Matthias Hein. Optimization landscape and expressivity of deep CNNs. In

Proceedings of the 35th International Conference on Machine Learning, pages 3730 3739, 2018.

R. A. Redner and H. F. Walker. Mixture densities, maximum likelihood and the EM algorithm. SIAM

Review, 26(2):195 239, 1984.

Itay Safran and Ohad Shamir. Spurious local minima are common in two-layer relu neural networks.

ar Xiv preprint ar Xiv:1712.08968, 2017.

Mohammadreza Soltani and Chinmay Hegde. Towards provable learning of polynomial neural

networks using low-rank matrix estimation. In International Conference on Artiﬁcial Intelligence and Statistics, pages 1417 1426, 2018.

Paul Tseng. An analysis of the EM algorithm and entropy-like proximal point methods. Mathematics

of Operations Research, 29(1):27 44, Feb 2004.

C. F. Jeff Wu. On the convergence properties of the EM algorithm. The Annals of Statistics, 11(1):

95 103, Mar 1983.

Ji Xu, Daniel J Hsu, and Arian Maleki. Global analysis of expectation maximization for mixtures of

two gaussians. In Advances in Neural Information Processing Systems, pages 2676 2684, 2016.

L. Xu and M. I. Jordan. On convergence properties of the EM algorithm for Gaussian mixtures.

Neural Computation, 8:129 151, 1996.

Bowei Yan, Mingzhang Yin, and Purnamrita Sarkar. Convergence of gradient em on multi-component

mixture of gaussians. In Advances in Neural Information Processing Systems, pages 6956 6966, 2017.