# clustering_semirandom_mixtures_of_gaussians__1ce7a701.pdf

Clustering Semi-Random Mixtures of Gaussians

Pranjal Awasthi 1 Aravindan Vijayaraghavan 2

Gaussian mixture models (GMM) are the most widely used statistical model for the k-means clustering problem and form a popular framework for clustering in machine learning and data analysis. In this paper, we propose a natural robust model for k-means clustering that generalizes the Gaussian mixture model, and that we believe will be useful in identifying robust algorithms. Our ﬁrst contribution is a polynomial time algorithm that provably recovers the ground-truth up to small classiﬁcation error w.h.p., assuming certain separation between the components. Perhaps surprisingly, the algorithm we analyze is the popular Lloyd s algorithm for k-means clustering that is the method-of-choice in practice. Our second result complements the upper bound by giving a nearly matching lower bound on the number of misclassiﬁed points incurred by any k-means clustering algorithm on the semi-random model.

1. Introduction

Clustering is a ubiquitous task in machine learning and data mining for partitioning a data set into groups of similar points. The k-means clustering problem is arguably the most well-studied problem in machine learning. However, designing provably optimal k-means clustering algorithms is a challenging task as the k-means clustering objective is NP-hard to optimize (Williamson & Shmoys, 2011) (in fact, it is also NP-hard to ﬁnd near-optimal solutions (Awasthi et al., 2015; Lee et al., 2017)). A popular approach to cope with this intractability is to study average-case models for the k-means problem. The most widely used such statistical model for clustering is the Gaussian Mixture Model (GMM), that has a long and rich history (Teicher, 1961; Pearson,

*Equal contribution 1Department of Computer Science, Rutgers University, USA. 2EECS Department, Northwestern University, USA. Correspondence to: Pranjal Awasthi <pranjal.awasthi@rutgers.edu>, Aravindan Vijayaraghavan <aravindv@northwestern.edu>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

1894; Dasgupta, 1999; Arora & Kannan, 2001; Vempala & Wang, 2004; Dasgupta & Schulman, 2007).

In this model there are k clusters, and the points from cluster i are generated from a Gaussian in d dimensions with mean µi Rd, and covariance matrix Σi Rd d with spectral norm Σi σ2. Each of the N points in the instance is now generated independently at random, and is drawn from the ith component with probability wi [0, 1] (w1, w2, . . . , wk are also called mixing weights). If the means of the underlying Gaussians are separated enough, the ground truth clustering is well deﬁned1. The algorithmic task is to recover the ground truth clustering for any data set generated from such a model (note that the parameters of the Gaussians, mixing weights and the cluster memberships of the points are unknown).

Starting from the seminal work of Dasgupta (Dasgupta, 1999), there have been a variety of algorithms to provably cluster data from a GMM model. Algorithms based on PCA and distance-based clustering (Arora & Kannan, 2001; Vempala & Wang, 2004; Achlioptas & Mc Sherry, 2005; Kannan et al., 2008) provably recover the clustering when there is adequate separation between every pair of components (parameters)2. Other algorithmic approaches include the method-of-moments (Moitra & Valiant, 2010; Belkin & Sinha, 2010), and algebraic methods based on tensor decompositions (Hsu & Kakade, 2012).(Please see Section 1 in Supplementary material for a more detailed comparison of related work like (Tang & Monteleoni, 2016; Dutta et al., 2017) and guarantees).

On the other hand, the method-of-choice in practice are iterative algorithms like the Lloyd s algorithm (also called k-means algorithm) (Lloyd, 1982) and the k-means++ algorithm (Lloyd s algorithm initialized with centers from distance-based D2-sampling). D2-sampling based initialization schemes were ﬁrst theoretically analyzed in (Ostrovsky et al., 2006; Arthur & Vassilvitskii, 2007). In the absence of good worst-case guarantees, a compelling direction is to use

1A separation of µi µj 2 Ω(σ p

log(Nk)) for i = j [k] sufﬁces w.h.p. 2The best known guarantees along these lines for non-spherical Gaussians (Kumar & Kannan, 2010; Awasthi & Sheffet, 2012) requires a separation of order σ k log N between any pair of means, where σ is the maximum variance among all clusters along any direction.

clustering semi-random gmms

beyond-worst-case paradigms like average-case analysis to provide provable guarantees. Polynomial time guarantees for recovering k-means optimal clustering by the Lloyd s algorithm and k-means++ are known when the points are drawn from a GMM model under sufﬁcient separation conditions (Ostrovsky et al., 2006; Dasgupta & Schulman, 2007; Kumar & Kannan, 2010; Awasthi & Sheffet, 2012).

Although the study of Gaussian mixture models has been very fruitful in designing a variety of efﬁcient algorithms, real world data rarely satisﬁes such strong distributional assumptions. Hence, our choice of algorithm should be informed not only by its computational efﬁciency but also by its robustness to errors and model misspeciﬁcation. As a ﬁrst step, we need theoretical frameworks that can distinguish between algorithms that are tailored towards a speciﬁc probabilistic model and algorithms robust to modeling assumptions. In this paper we initiate such a study in the context of clustering, by studying a natural robust generalization of the GMM model that we call semi-random model.

Semi-random models involve a set of adversarial choices in addition to the random choices of the probabilistic model, while generating the instance. These models have been successfully applied to study the design of robust algorithms for various optimization problems (Blum & Spencer, 1995; Feige & Kilian, 1998; Makarychev et al., 2012) (see Section 1 of Supplementary material). In a typical semi-random model, there is a planted or ground-truth solution, and an instance is ﬁrst generated according to a simple probabilistic model. An adversary is then allowed to make monotone or helpful changes to the instance that only make the planted solution more pronounced. For instance, in the semi-random model of Feige and Kilian (1998) for graph partitioning, the adversary is allowed to arbitrarily add extra edges within each cluster or delete edges between different clusters of the planted partitioning. These adversarial choices only make the planted partition more prominent; however, the choices can be dependent and thwart algorithms that rely on the excessive independence or strong but unrealistic structural properties of these instances.

The study of semi-random models helps us understand and identify robust algorithms. Our motivation for studying semi-random models for clustering is two-fold: a) design algorithms that are robust to strong distributional data assumptions, and b) explain the empirical success of simple heuristics like Lloyd s algorithm.

Semi-random mixtures of Gaussians In an ideal clustering instance, each point x in the ith cluster is signiﬁcantly closer to the mean µi than to any other mean µj for j = i (for a general instance, in the optimal solution, x µi 2 x µj 2 0 j = i). Moving each point in Ci toward its own mean µi only increases this gap between

the distance to its mean and to any other mean. Hence, this perturbation corresponds to a monotone perturbation that only make this planted clustering even better. In our semirandom model, the points are ﬁrst drawn from a mixture of Gaussians (this is the planted clustering). The adversary is then allowed to move each point in the ith cluster closer to its mean µi. This allows the points to be even better clustered around their respective means, however these perturbations are allowed to have arbitrary dependencies. We now formally deﬁne the model. Deﬁnition 1.1 (Semi-random GMM model). Given parameters µ1, µ2, . . . , µk Rd and σ R+, a clustering instance X on N points is generated as follows.

1. Adversary chooses an arbitrary partition C = (C1, . . . , Ck) of { 1, . . . , N }, let Ni = |Ci| i [k].

2. For each i [k] and each t Ci, y(t) Rd is generated independently at random according to a Gaussian with mean µi and covariance Σi with Σ σ i.e., variance at most σ2 in each direction.

3. The adversary then moves each point y(t) towards the mean of its component by an arbitrary amount i.e., for each i [k], t Ci, the adversary picks x(t) arbitrarily in { µi + λ(y(t) µi) : λ [0, 1] } (these choices can be correlated arbitrarily).

The instance is X = { x(t) : t [N] } and is parameterized by (µ1, . . . , µk, σ) with the planted clustering C1, . . . , Ck. We will denote by wmin = mini [k] Ni/N.

It is necessary that each point is moved closer to its mean along the direction of the mean otherwise, one can move points closer to its own mean, but in other directions in such a way that the optimal k-means clustering of the perturbed instance is very different from the planted clustering. This is especially true in the separation range of interest (when k d), where the inter-mean distance is smaller than the average radius of the clusters (see Supplementary material for details).

Data generated by mixtures of high-dimensional Gaussians have certain properties that are often not exhibited by realworld instances. High-dimensional Gaussians have strong concentration properties; for example, all the points generated from a high-dimensional Gaussian are concentrated at a reasonably far distance from the mean (they are

dσ far away w.h.p.). In many real-world datasets on the other hand, clusters in the ground-truth often contain dense cores that are close to the mean. Our semi-random model admits such instances by allowing points in a cluster to move arbitrarily close to the mean.

Our Results. Our ﬁrst result studies the Lloyd s algorithm on the semi-random GMM model and gives an upper bound

clustering semi-random gmms

on the clustering error achieved by the Lloyd s algorithm with the initialization procedure used in (Kumar & Kannan, 2010).

Informal Theorem 1.2. Consider any semi-random instance X with N points generated by the semi-random GMM model (Def. 1.1) with planted clustering C1, . . . , Ck and parameters µ1, . . . , µk, σ2 satisfying i = j [k], µi µj 2 > σ where c0 p

min { k, d } log N and N k2d2/w2 min. There is polynomial time algorithm based on the Lloyd s iterative algorithm that recovers the cluster memberships of all but O(kd/ 4) points.

The O in the above statement hides a log(log N/ 4) and log(d/ 2) factor. Please see Theorem 3.1 for a formal statement. Furthermore, the initialization procedure of Kumar and Kannan (2010) can be replaced by a simpler procedure based on the popular k-means++ algorithm (Ostrovsky et al., 2006; Arthur & Vassilvitskii, 2007). On the other hand, certain other algorithmic techniques like moment-based methods, distance-based clustering and tensor decompositions seem less robust to semi-random perturbations (please see related work in Section 1 of supplementary material for more details). The most closely related to our work is that of (Kumar & Kannan, 2010) and (Awasthi & Sheffet, 2012) who provided deterministic data conditions under which the Lloyd s algorithm converges to the optimal clustering. Along these lines, our work provides further theoretical justiﬁcation for the enormous empirical success that the Lloyd s algorithm enjoys.

It is also worth noting that in spite of being robust to semi-random perturbations, the separation requirement of σ k log N in our upper bound matches the separation requirement in the best guarantees (Awasthi & Sheffet, 2012) for Lloyd s algorithm even in the absence of any semirandom errors or perturbations 3. We also remark that while the algorithm recovers a clustering of the given data that is very close to the planted clustering, this does not necessarily estimate the means of the original Gaussian components up to inverse polynomial accuracy (in fact the centers of the planted clustering after the semi-random perturbation may be Ω(σ) far from the original means). This differs from the recent body of work on parameter estimation in the presence of some adversarial noise (please see Section 1 of supplementary material for a comparison).

While the monotone changes allowed in the semi-random model should only make the clustering task easier, our next result shows that the error achieved by the Lloyd s algorithm is in fact near optimal. More speciﬁcally, we provide a lower

3We note that for clustering GMMs, the work of Brubaker and Vempala (2008) give a qualitatively different separation condition that does not depend on the maximum variance. However this separation condition is incomparable to (Awasthi & Sheffet, 2012), because of the potentially worse dependence on k.

bound on the number of points that will be misclassiﬁed by any k-means optimal solution for the instance.

Informal Theorem 1.3. Given any N (that is sufﬁciently large polynomial in d, k) and such that log N d/(4 log d), there exists an instance X on N points in d dimensions generated from the semi-random GMM model 1.1 with parameters µ1, . . . , µk, σ2, and planted clustering C1, . . . , Ck having separation i = j [k], µi µj 2 σ s.t. any optimal k-means clustering solution C 1, C 2, . . . , C k of X misclassiﬁes at least Ω(kd/ 4) points with high probability.

The above lower bound also holds when the semi-random perturbations are applied to points generated from a mixture of k spherical Gaussians each with covariance σ2I and weight 1/k. Further, the lower bound holds not just for the optimal k-means solution, but also for any locally optimal clustering solution. See Theorem 4.1 for a formal statement. These two results together show that the Lloyd s algorithm essentially recovers the planted clustering up to the optimal error possible for any k-means clustering based algorithm.

Unlike algorithmic results for other semi-random models, an appealing aspect of our algorithmic result is that it gives provable robust guarantees in the semi-random model for a simple, popular algorithm that is used in practice (Lloyd s algorithm). Further, other approaches for clustering like distance-based clustering, method-of-moments and tensor decompositions seem inherently non-robust to these semirandom perturbations (see Section 1 of Supplementary material for details). This robustness of the Lloyd s algorithm suggests an explanation for its widely documented empirical success across different application domains.

Challenges and Overview of Techniques. Lloyd s algorithm has been analyzed in the context of clustering mixtures of Gaussians (Kumar & Kannan, 2010; Awasthi & Sheffet, 2012). Any variant of the Lloyd s algorithm consists of two steps an initialization stage where a set of k initial centers are computed, and the iterative algorithm which successively improves the clustering in each step. Kumar and Kannan (2010) considered a variant of the Lloyd s method where the initialization is given by using PCA along with a O(1) factor approximation to the k-means optimization problem. The improved analysis of this algorithm in (Awasthi & Sheffet, 2012) leads to state of the art results that perfectly recovers all the clusters under a separation of order k log Nσ.

We analyze the variant of Lloyd s algorithm that was introduced by Kumar and Kannan (2010); further, our analysis also extends to a simpler initialization procedure based on the popular k-means++ algorithm. However, there are several challenges in extending the analysis of (Awasthi & Sheffet, 2012) to the semi-random setting. While the

clustering semi-random gmms

semi-random perturbations in the model only move points in a cluster Ci closer to the mean µi, these perturbations can be co-ordinated in a way that can move the empirical mean of the cluster signiﬁcantly. For instance, Lemma 4.3 gives a simple semi-random perturbation to the points in Ci that moves the empirical mean of the points in Ci to µi s.t. µi µi + Ω(σ)be, for any desired direction be. This shift in the empirical means may now cause some of the points in cluster Ci to become closer to µj (in particular points that have a relatively large projection onto be) and vice-versa. In fact, the lower bound instance in Theorem 4.1 is constructed by applying such a semi-random perturbation given by Lemma 4.3 to the points in a cluster, along a carefully picked direction so that m = Ω(d/ 4) points are misclassiﬁed per cluster.

The empirical success of Lloyd s algorithm on real-world data is widely documented. The main contribution of the paper is a robust theoretical analysis of the Lloyd s iterative algorithm when the points come from the semi-random GMM model. The key is to understand the number of points that can be misclassiﬁed in an intermediate step of the Lloyd s iteration. We show in Lemma 3.3 that if in the current iteration of the Lloyd s algorithm, each of the current estimates of the means µ i is within τσ from µi, then the number of misclassiﬁed points by the current iteration of Lloyd s iteration is at most O(kdτ 2/ 4). To analyze this, we need to understand the projection of points onto the direction of the line joining the centers. While the average projection is small i.e. e O(1), there are certain directions where it is as large as Ω(

d)! However, we can still prove an upper bound (in Lemma 3.5) on the number of points x in a cluster Ci s.t. s.t. (x µi) has a large inner product along any (potentially bad) direction be. The effect of these bad points has to be carefully accounted for when analyzing both stages of the algorithm the initialization phase, and the iterative algorithm (in Proposition 3.2 and Lemma 3.4).

Some Related Work. There has been a long line of algorithmic results on Gaussian mixture models starting from (Teicher, 1961; 1967; Pearson, 1894). Considering the vast and rich literature on mixtures of Gaussians, we defer much of the comparison to the related work section in the supplementary material (Section 1). Here we mention and compare to some recent related work on robust parameter estimation for Gaussians and related models. A recent exciting line of work concerns designing robust high-dimensional estimators of the mean and covariance of Gaussians (and models) when a small ε fraction of the points are adversarially corrupted (Brubaker, 2009; Diakonikolas et al., 2016; Lai et al., 2016; Charikar et al., 2017). However, this model and results are incomparable to our semi-random model for example, they typically assume that only a o(1/k) fraction of the points are corrupted, while potentially all the points

could be perturbed in our semi-random model, while on the other hand, our work does not handle arbitrary outliers. Please see Section 1 of the supplementary material for additional related work.

2. Preliminaries

Clustering data from a mixture of Gaussians is a natural average-case model for the k-means clustering problem. Speciﬁcally, if the means of a Gaussian mixture model are well separated, then with high probability, the ground truth clustering of an instance sampled from the model corresponds to the k-means optimal clustering.

Deﬁnition 2.1. (k-means clustering). Given an instance X = { x(1), . . . , x(N) } of N points in Rd, the k-means problems is to ﬁnd k points µ1, . . . , µk such as to minimize P

t [N] mini [k] x(t) µi 2.

The optimal means or centers µ1, . . . , µk naturally deﬁne a clustering of the data where each point is assigned to its closest cluster. A key property of the k-means objective is that the optimal solution induces a locally optimal clustering.

Deﬁnition 2.2. (Locally Optimal Clustering). A clustering C1, . . . , Ck of N data points in Rd is locally optimal if for each i [k], x(t) Ci, and j = i we have that x(t) µ(Ci) x(t) µ(Cj) . Here µ(Ci) is the average of the points in Ci.

Hence, given the optimal k-means clustering, the optimal centers can be recovered by simply computing the average of each cluster. This is the underlying principle behind the popular Lloyd s algorithm (Lloyd, 1982) for k-means clustering. The algorithm starts with a choice of initial centers. It then repeatedly computes new centers to be the average of the clusters induced by the current centers. Hence the algorithm converges to a locally optimal clustering. Although popular in practice, the worst case performance of Lloyd s algorithm can be arbitrarily bad (Arthur & Vassilvitskii, 2005). The choice of initial centers is very important in the success of the Lloyd s algorithm. We show that our theoretical guarantees hold when the initialization is done via the popular k-means++ algorithm (Arthur & Vassilvitskii, 2007). There also exist more sophisticated constant factor approximation algorithms for the k-means problem (Kanungo et al., 2002; Ahmadian et al., 2016) that can be used for seeding in our framework.

We now state several properties of semi-random mixtures that will be used throughout the analysis in the subsequent sections (please see supplementary material for proofs). We ﬁrst start with a few simple properties that follow directly from the corresponding properties about high dimensional Gaussians.

Lemma 2.3. Consider any semi-random instance X =

clustering semi-random gmms

{ x(1), . . . , x(N) } with parameters µ1, . . . , µk, σ2 and clusters C1, . . . , Ck. Then w.h.p. i [k],

ℓ Ci, x(t) µi 2 σ(

log N). (1)

Further, for a ﬁxed unit vector u Rd, with probability at least (1 1/(N 3))

i [k], t Ci, | x(ℓ) µi, u | < 3σ p

Finally, we have w.h.p. i [k],

t Ci, x(ℓ) µi, µi µj µi µj 2

We observe that (3) immediately follows from (2) after a union bound over the k2 < N 2 directions given by the unit vectors along (µi µj) directions. The proofs of these statements are given in Section 2 of the Supplementary material. The next lemma gives an upper bound on how far the mean and variance of the points in a component of a semi-random GMM can move away from the true parameters.

Lemma 2.4. Consider any semi-random instance X with N points generated with parameters µ1, . . . , µk, C1, . . . , Ck such that Ni 4(d + log( k

δ )) for all i [k]. Then with probability at least 1 δ,

x Ci x µi 2 2σ. (4)

i [k], max v: v =1 1 |Ci|

x Ci | x µi, v |2 4σ2. (5)

3. Upper Bounds for Semi-random GMMs

In this section we prove the following theorem that provides algorithmic guarantees for the Lloyd s algorithm with appropriate initialization, under the semi-random model for mixtures of Gaussians in Deﬁnition 1.1.

Theorem 3.1. There exists a universal constant c0, c1 > 0 such that the following holds. There exists a polynomial time algorithm that for any semi-random instance X on N points with planted clustering C1, . . . , Ck generated by the semi-random GMM model (Def. 1.1) with parameters µ1, . . . , µk, σ2 s.t.

i = j [k], µi µj 2 > σ, where (6)

min { k, d } log N and N k2d2/w2 min, ﬁnds w.h.p. a clustering C 1, . . . , C k such that

Cπ(i) C i c1kd

4 max n 1, log 6(

d+log 1 2 N) 2 o

Algorithm 1 Lloyd s Algorithm

Input: A be the N d data matrix with rows Ai for i [N]. Use A to compute initial centers µ(1) 0 , µ(2) 0 , . . . µ(k) 0 as detailed in Proposition 3.2. Use these k-centers to seed a series of Lloyd-type iterations i.e., for r = 1, 2, . . .: do

Set Zi be the set of points for which the closest center among µ(1) r 1, µ(2) r 1, . . . , µ(k) r 1 is µ(i) r 1. Set µ(i) r 1 |Zi| P

Aj Zi Aj. end for

In Section 4 we show that the above error bound is close to the information theoretically optimal bound (up to the logarithmic factor). The Lloyd s algorithm as described in Figure 1 consists of two stages, the initialization stage and an iterative improvement stage.

The initialization follows the same scheme as proposed by Kumar and Kannan in (2010). The initialization algorithm ﬁrst performs a k-SVD of the data matrix followed by running the k-means++ algorithm that uses D2-sampling to compute seed centers (Arthur & Vassilvitskii, 2007). One can also use any constant factor approximation algorithm for k-means clustering in the projected space to obtain the initial centers (Kanungo et al., 2002; Ahmadian et al., 2016). This approach works for clusters that are nearly balanced in size. However, when the cluster sizes are arbitrary, an appropriate transformation of the data is performed ﬁrst that ampliﬁes the separation between the centers. Following this transformation, the (k-SVD + k-means++) is used to get the initial centers. The formal guarantee of the initialization procedure is encapsulated in the following proposition, whose proof is given in Section 3.2.

The main algorithmic contribution of this paper is an analysis of the Lloyd s algorithm when the points come from the semi-random GMM model. For the rest of the analysis we will assume that the instance X generated from the semi-random GMM model satisﬁes (1) to (7). These equations are shown to hold w.h.p. in Section 2 for instances generated from the model. Our analysis will in fact hold for any deterministic data set satisfying these equations. This helps to gracefully argue about performing many iterations of Lloyd s on the same data set without the need to draw fresh samples at each step.

Proposition 3.2. In the above notation for any δ > 0, suppose we are given an instance X on N points satisfying (1)-(7) such that |Ci| Ω(d + log( k

δ )) and assume that 125 p

min { k, d } log N. Then after the initialization step, for every µi there exists µ i such that µi µ i τσ, where τ < /24.

clustering semi-random gmms

The analysis of the Lloyd s iterations crucially relies on the following lemma that upper bounds the number of misclassiﬁed points when the current Lloyd s iterative is relatively close to the true means.

Lemma 3.3 (Projection condition). In the above notation, consider an instance X satisfying (1)-(7) and (6) and suppose we are given µ 1, . . . , µ k satisfying j [k], µ j µj 2 τσ and τ < /24. Then there exists a set Z X such that for any i [k] we have

x Ci (X \ Z), x µ i 2 2 min j =i x µ j 2 2 where

|Z| = O dτ 2

4 max n 1, log 3τ(

The following lemma quantiﬁes the improvement in each step of Lloyd s algorithm. The proof uses Lemma 3.3 along with properties of semi-random Gaussians.

Lemma 3.4. In the above notation, suppose we are given an instance X on N points with wi N d

d 4 log(d) for all i satisfying (1)-(7). Furthermore, suppose we are given centers µ 1, . . . , µ k such that µ i µi τσ, i [k] where τ < /24. Then the centers µ 1, . . . , µ k obtained after one Lloyd s update satisfy µ i µi max((6 + τ 4)σ, τ

2σ)) for all i [k].

We now present the proof of Theorem 3.1.

Proof of Theorem 3.1. Firstly, the eight deterministic conditions (1)-(7) are shown to hold for instance X w.h.p. in Section 2. The proof follows in a straightforward manner by combining Proposition 3.2, Lemma 3.4 and Lemma 3.3. Proposition 3.2 shows that µ(0) i µi 2 /(24) for all i [k]. Applying Lemma 3.4, we have that after T = O(log ) iterations we get µ(T ) i µi 2 8σ for all i [k] w.h.p. Finally using Lemma 3.3 with τ = 1, the theorem follows.

3.1. Analyzing Lloyd s Algorithm

The following lemma is crucial in analyzing the performance of the Lloyd s algorithm. We would like to upper bound the inner product | x(ℓ) µi, be | < λσ for every direction be and sample ℓ [N], but this is impossible since be can be aligned along x(ℓ) µi. The following lemma however upper bounds the total number of points in the dataset that can have a large projection of λ (or above) onto any direction be by at most O(d/λ2). This involves a union bound over a net of all possible directions be.

Lemma 3.5 (Points in Bad Directions). Consider any semirandom instance X = { x(1), . . . , x(N) } with N points having parameters µ1, . . . , µk, σ2 and planted clustering C1, . . . , Ck, and denote by x(ℓ) = x(ℓ) µi i [k], ℓ Ci.

Then c > 0 (universal constant) s.t. λ > 100 log N, w.p. at least 1 2 d

be : be 2 = 1, { ℓ [N] : | x(ℓ), be | > λσ }

λ2 max n 1, log 3(

d+log 1 2 N) λ o (7)

Proof. Set η := min { λ/(2

d + 2 log N), 1

2 } and m := 512d log(3/η)/λ2. Consider an η-net N { u : u 2 = 1 } over unit vectors in Rd. Hence

u Rd : u 2 = 1, v N s.t. u v 2 η

and |N| 2 + η

d exp d log(3/η) .

Further, since | x, be | > λ and N is an η-net, there exists some unit vector u = u(be) N

| x, u | > | x, be + x, be u | σλ x 2 be u 2

for our choice of η. Consider a ﬁxed x { x(1), . . . , x(N) } and a ﬁxed direction u N. Since the variance of y is at most σ2 we have

P h | x, u | > λσ/2 i P h | y, u | > λσ/2 i exp λ2/8 .

The probability that m points in { x(1), . . . , x(N) } satisfy (8) for a ﬁxed direction u is at most N m exp( mγ2/2). Let E represent the bad event that there exists a direction in N such that more that m points satisfy the bad event given by (8).

P[E] |N| N m

exp( mλ2/8)

exp d log(3/η) + m log N mλ2

since λ2 > 32 log N, and mλ2 32d log(3/η).

Lemma 3.3 and Lemma 3.4 use the above lemma to analyze each iteration of the Lloyd s algorithm and show that we make progress in each step by misclassifying fewer points with successive iterations.

3.2. Initialization

In this section we describe how to obtain the initial centers satisfying the condition in Lemma 3.4. The ﬁnal initialization procedure relies on the following subroutine that provides a good initializer if the mean separation is much larger than that in Theorem 3.1. Let A denote the N d matrix of data points and M be the N d matrix where each row of C is equal to one of the means µis of the component to which the corresponding row of A belongs to.

clustering semi-random gmms

Lemma 3.6. In the above notation, for any δ > 0 suppose we are given an instance X on N points satisfying satisfying (1)-(7), with components C1, . . . Ck such that |Ci| Ω(d + log( k

δ )). Let A be the N d matrix of data points and ˆA be the matrix obtained by projecting points onto the best k-dimensional subspace obtained by SVD of A. Let µ i be the centers obtained by running an α factor k-means approximation algorithm on ˆA. Then for every µi there exists µ i such that µi µ i 20

kα A M Nwmin .

The above proof already provides a good initializer provided

is larger than q

wmin and one uses a constant factor approximation algorithm for k-means (Ahmadian et al., 2016).

Furthermore, if is larger than q

k log k log N

wmin , then one can instead use the simpler and faster k-means++ approximation algorithm (Arthur & Vassilvitskii, 2007). The above lemma has a bad dependence on wmin. However, using the Boosting technique of (Kumar & Kannan, 2010) we can reduce the dependence to > 25 k log N and hence prove Proposition 3.2. We provide a proof of this in the Appendix.

4. Lower Bounds for Semi-random GMMs

We prove the following theorem. Theorem 4.1. For any d, k Z+, there exists N0 = poly(d, k) and a universal constant c1 > 0 such that the following holds for all N N0 and such that log N d/(64 log d). There exists an instance X on N points in d dimensions with planted clustering C1, . . . , Ck generated by applying semi-random perturbations to points generated from a mixture of spherical Gaussians with means µ1, µ2, . . . , µk, covariance σ2I and weights being 1/k each, with separation i = j [k], µi µj 2 σ, such that any locally optimal kmeans clustering solution C 1, C 2, . . . , C k of X satisﬁes w.h.p. minπ Permk Pk i=1 |C π(i) Ci| c1kd

4 . It sufﬁces to set N0(d, k) := c0k2d3/2 log2(kd), where c0 > 0 is a sufﬁciently large universal constant. Remark 4.2. Note that the lower bound also applies in particular to the more general semi-random model in Deﬁnition 1.1; in this instance, the points are drawn i.i.d. from the mixture of spherical Gaussians, before applying semirandom perturbations. Further, this lower bound holds for any locally optimal solution, and not just the optimal solution.

The lower bound construction will pick an arbitrary Ω(d/ 4) points from k/2 clusters, and carefully choose a semi-random perturbation to all the points so that these Ω(kd/ 4) points are misclassiﬁed. We start with a simple lemma that shows that an appropriate semi-random perturbation can move the mean of a cluster by an amount O(σ) along any ﬁxed direction.

Lemma 4.3. Consider a spherical Gaussian in d dimensions with mean µ and covariance σ2I, and let be be a ﬁxed unit vector. Consider the semi-random perturbation given by

y Rd, h(y) =

( µ if y µ, be < 0 y otherwise .

Then we have E[h(y)] = µ + 1

Construction. Set m := c1d/ 4 for some appropriately small constant c1 (0, 1). We assume without loss of generality that k is even (the following construction also works for odd k by leaving the last cluster unchanged). We pair up the clusters into k/2 pairs { (C1, C2), (C3, C4), . . . , (Ck 1, Ck) }, and we will ensure that m points are misclassiﬁed in each of the k/2 clusters C1, C3, . . . , Ck 1. The parameters of the mixture of spherical Gaussians G are set up as follows. For each i 1, 3, 5, . . . , k 1, µi µi+1 2 = σ, and all the other inter-mean distances (across different pairs) are at least Mσ which is arbitrarily large (M 7 ).

Let for any i { 1, 3, . . . , k 1 }, Zi Ci be the ﬁrst m points in cluster Ci respectively among the samples y(1), . . . , y(N) drawn from G (these m points inside the clusters can be chosen arbitrarily). Set Zi = for i { 2, 4, . . . , k }. Then, for each i { 1, 3, . . . , k 1 }, set bei to be the unit vector along ui = 1 σ

y Zi(y µi). Finally, for each i { 1, 3, . . . , k 1 } apply the following semi-random perturbation given by Lemma 4.3 to points in cluster Ci+1 along bei,

x(t) = h(y(t)) =

( µi+1 if y(t) µi+1, bei < 0 y(t) otherwise .

Note that the semi-random perturbations are only made to points in the even clusters (based on a few points in its respective odd cluster). The lower bound proof proceeds in two parts. Lemma 4.4 (using Lemma 4.3) and Lemma 4.5 shows that in any k-means optimal clustering the means of each even cluster Ci moves by roughly Ω(σ) bei 1. Lemma 4.6 then shows that these means will classify all the m points in Zi 1 incorrectly w.h.p. In this proof w.h.p. will refer to a probability of at least 1 o(1) unless speciﬁed otherwise (this can be made 1 1/poly(m, k)).

Let µ1, . . . , µk be the (empirical) means of the clusters in the planted clustering C1, C2, . . . , Ck after the semirandom perturbations. The following lemma shows that µi µi 2 σ.

Lemma 4.4. There exists a constant c3 > 0 s.t. for the semi-random instance X described above, we have w.h.p.

clustering semi-random gmms

for some zi with zi 2 c3σ p

i [k], µi =

2πσbei 1 + zi if i is even

µi + zi if i is odd .

The following lemma shows that if Ci, C i are close, then the empirical means are also close.

Lemma 4.5. Consider any cluster Ci of the instance X, and let C i satisfy |C i Ci| m . Suppose µi, µ i are the means of clusters Ci and C i respectively, then

µ i µi 2 4σ m

The following lemma shows that the Voronoi partition about µ1, . . . , µk (or points close to it) incorrectly classiﬁes all points in Zi for each i [k].

Lemma 4.6. Let µ 1, µ 2, . . . , µ k satisfy µ i µi 2

σ/(16 m(1 + 2 q

d )), where µi is the empirical mean of the points in Ci. Then, we have w.h.p. that for each i { 1, 3, . . . , k 1 }, x µ i 2 2 > x µ i+1 2 2, i.e., every point x Zi is misclassiﬁed.

Proof of Theorem 4.1. Let C 1, . . . , C k be a locally optimal k-means clustering of X, and suppose P

i |C i Ci| < mk/2 (for sake of contradiction). For each i [k], let µi be the empirical mean of Ci and µ i be the empirical mean of C i. Since C 1, . . . , C k is a locally optimal clustering, the Voronoi partition given by µ 1, . . . , µ k classiﬁes all the points in agreement with C 1, . . . , C k.

We will now contradict the local optimality of the clustering C 1, . . . , C k. Every cluster Ci has at least N/(2k) points w.h.p. Hence, for each i [k], from Lemma 4.5 we have

µ i µi 2 σ(

log N + ) 4|Ci C i| N/2k

(log N)/d) .

However, from Lemma 4.6, every point in i [k]Zi is misclassiﬁed by µ 1, µ 2, . . . , µ k, i.e., the clustering given the Voronoi partition around µ 1, . . . , µ k differs from C1, . . . , Ck on at least mk/2 points in total. But P

i [k] |C i Ci| < mk/2. Hence, this contradicts the local optimality of the clustering C 1, . . . , C k.

5. Conclusion

In this work we initiated the study of clustering data from a semi-random mixture of Gaussians. We proved that the popular Lloyd s algorithm achieves near optimal error. The robustness of the Lloyd s algorithm for the semi-random

model suggests a theoretical justiﬁcation for its widely documented success in practice. A concrete open question left from our work is to extend our lower bound for locally optimal clusterings to a more general statistical lower bound this would also imply a separation between recovery guarantees for the semi-random model and the pure GMM model. Robust analysis under semi-random adversaries for related heuristics such as the EM algorithm and studying semi-random variants for other popular statistical models in machine learning will further improve the gap between our theoretical understanding and observed practical performance of algorithms for such models.

clustering semi-random gmms

Acknowledgements

Aravindan Vijayaraghavan is supported by the National Science Foundation (NSF) under Grant No. CCF-1652491 and CCF-1637585.

Achlioptas, D. and Mc Sherry, F. On spectral learning of mixtures of distributions. In Learning Theory, pp. 458 469. Springer, 2005.

Ahmadian, S., Norouzi-Fard, A., Svensson, O., and Ward, J. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. Co RR, abs/1612.07925, 2016. URL http://arxiv.org/abs/1612.07925.

Arora, S. and Kannan, R. Learning mixtures of arbitrary Gaussians. In Proceedings of the thirty-third annual ACM symposium on Theory of computing, pp. 247 257. ACM, 2001.

Arthur, D. and Vassilvitskii, S. On the worst case complexity of the k-means method. Technical report, Stanford, 2005.

Arthur, D. and Vassilvitskii, S. K-means++: The advantages of careful seeding. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 07, pp. 1027 1035, 2007. ISBN 978-0-898716-245. URL http://dl.acm.org/citation.cfm? id=1283383.1283494.

Awasthi, P. and Sheffet, O. Improved spectral-norm bounds for clustering. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pp. 37 49. Springer, 2012.

Awasthi, P., Charikar, M., Krishnaswamy, R., and Sinop, A. K. The hardness of approximation of euclidean kmeans. ar Xiv preprint ar Xiv:1502.03316, 2015.

Belkin, M. and Sinha, K. Polynomial learning of distribution families. In Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, pp. 103 112. IEEE, 2010.

Blum, A. and Spencer, J. Coloring random and semi-random k-colorable graphs. J. Algorithms, 19:204 234, September 1995. ISSN 0196-6774. doi: http://dx.doi.org/10. 1006/jagm.1995.1034. URL http://dx.doi.org/ 10.1006/jagm.1995.1034.

Brubaker, S. C. Robust PCA and clustering in noisy mixtures. In Proceedings of the Symposium on Discrete Algorithms, pp. 1078 1087, 2009.

Brubaker, S. C. and Vempala, S. Isotropic pca and afﬁneinvariant clustering. In Proceedings of the 2008 49th

Annual IEEE Symposium on Foundations of Computer Science, FOCS 08, pp. 551 560, Washington, DC, USA, 2008. IEEE Computer Society. ISBN 978-0-7695-3436-7. doi: 10.1109/FOCS.2008.48. URL http://dx.doi. org/10.1109/FOCS.2008.48.

Charikar, M., Steinhardt, J., and Valiant, G. Learning from untrusted data. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pp. 47 60, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-4528-6. doi: 10.1145/3055399. 3055491. URL http://doi.acm.org/10.1145/ 3055399.3055491.

Dasgupta, S. Learning mixtures of Gaussians. In Foundations of Computer Science, 1999. 40th Annual Symposium on, pp. 634 644. IEEE, 1999.

Dasgupta, S. and Schulman, L. A probabilistic analysis of EM for mixtures of separated, spherical Gaussians. The Journal of Machine Learning Research, 8:203 226, 2007.

Diakonikolas, I., Kamath, G., Kane, D. M., Li, J., Moitra, A., and Stewart, A. Robust estimators in high dimensions without the computational intractability. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pp. 655 664, Oct 2016. doi: 10.1109/ FOCS.2016.85.

Dutta, A., Vijayaraghavan, A., and Wang, A. Clustering stable instances of euclidean k-means. Proceedings of Neural Information Processing Systems (NIPS)., 2017.

Feige, U. and Kilian, J. Heuristics for ﬁnding large independent sets, with applications to coloring semi-random graphs. In Foundations of Computer Science, 1998. Proceedings.39th Annual Symposium on, pp. 674 683, nov 1998. doi: 10.1109/SFCS.1998.743518.

Hsu, D. and Kakade, S. M. Learning Gaussian mixture models: Moment methods and spectral decompositions. ar Xiv preprint ar Xiv:1206.5766, 2012.

Kannan, R., Salmasian, H., and Vempala, S. The spectral method for general mixture models. SIAM J. Comput., 38(3):1141 1156, 2008. doi: 10.1137/ S0097539704445925. URL http://dx.doi.org/ 10.1137/S0097539704445925.

Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., and Wu, A. Y. A local search approximation algorithm for k-means clustering. In Proceedings of the eighteenth annual symposium on Computational geometry, pp. 10 18. ACM, 2002.

Kumar, A. and Kannan, R. Clustering with spectral norm and the k-means algorithm. In Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, pp. 299 308. IEEE, 2010.

clustering semi-random gmms

Lai, K. A., Rao, A. B., and Vempala, S. Agnostic estimation of mean and covariance. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pp. 665 674, Oct 2016. doi: 10.1109/FOCS.2016.76.

Lee, E., Schmidt, M., and Wright, J. Improved and simpliﬁed inapproximability for k-means. Information Processing Letters, 120:40 43, 2017.

Lloyd, S. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129 137, 1982.

Makarychev, K., Makarychev, Y., and Vijayaraghavan, A. Approximation algorithms for semi-random partitioning problems. In Proceedings of the 44th Symposium on Theory of Computing (STOC), pp. 367 384. ACM, 2012.

Moitra, A. and Valiant, G. Settling the polynomial learnability of mixtures of Gaussians. In Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, pp. 93 102. IEEE, 2010.

Ostrovsky, R., Rabani, Y., Schulman, L. J., and Swamy, C. The effectiveness of lloyd-type methods for the kmeans problem. In Foundations of Computer Science, 2006. FOCS 06. 47th Annual IEEE Symposium on, pp. 165 176. IEEE, 2006.

Pearson, K. Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. A, 185:71 110, 1894.

Tang, C. and Monteleoni, C. On lloyd s algorithm: New theoretical insights for clustering in practice. In Artiﬁcial Intelligence and Statistics, pp. 1280 1289, 2016.

Teicher, H. Identiﬁability of mixtures. The annals of Mathematical statistics, 32(1):244 248, 1961.

Teicher, H. Identiﬁability of mixtures of product measures. The Annals of Mathematical Statistics, 38(4):1300 1302, 1967.

Vempala, S. and Wang, G. A spectral algorithm for learning mixture models. Journal of Computer and System Sciences, 68(4):841 860, 2004.

Williamson, D. P. and Shmoys, D. B. The Design of Approximation Algorithms. Cambridge University Press, New York, NY, USA, 1st edition, 2011.