# learning_mixture_of_gaussians_with_streaming_data__3eafb933.pdf

Learning Mixture of Gaussians with Streaming Data

Aditi Raghunathan Stanford University aditir@stanford.edu

Prateek Jain Microsoft Research, India prajain@microsoft.com

Ravishankar Krishnaswamy Microsoft Research, India rakri@microsoft.com

In this paper, we study the problem of learning a mixture of Gaussians with streaming data: given a stream of N points in d dimensions generated by an unknown mixture of k spherical Gaussians, the goal is to estimate the model parameters using a single pass over the data stream. We analyze a streaming version of the popular Lloyd s heuristic and show that the algorithm estimates all the unknown centers of the component Gaussians accurately if they are sufﬁciently separated. Assuming each pair of centers are Cσ distant with C = Ω((k log k)1/4σ) and where σ2 is the maximum variance of any Gaussian component, we show that asymptotically the algorithm estimates the centers optimally (up to certain constants); our center separation requirement matches the best known result for spherical Gaussians [18]. For ﬁnite samples, we show that a bias term based on the initial estimate decreases at O(1/poly(N)) rate while variance decreases at nearly optimal rate of σ2d/N. Our analysis requires seeding the algorithm with a good initial estimate of the true cluster centers for which we provide an online PCA based clustering algorithm. Indeed, the asymptotic per-step time complexity of our algorithm is the optimal d k while space complexity of our algorithm is O(dk log k). In addition to the bias and variance terms which tend to 0, the hard-thresholding based updates of streaming Lloyd s algorithm is agnostic to the data distribution and hence incurs an approximation error that cannot be avoided. However, by using a streaming version of the classical (soft-thresholding-based) EM method that exploits the Gaussian distribution explicitly, we show that for a mixture of two Gaussians the true means can be estimated consistently, with estimation error decreasing at nearly optimal rate, and tending to 0 for N .

1 Introduction

Clustering data into homogeneous clusters is a critical ﬁrst step in any data analysis/exploration task and is used extensively to pre-process data, form features, remove outliers and visualize data. Due to the explosion in amount of data collected and processed, designing clustering algorithms that can handle large datasets that do not ﬁt in RAM is paramount to any big-data system. A common approach in such scenarios is to treat the entire dataset as a stream of data, and then design algorithms which update the model after every few points from the data stream. In addition, there are several practical applications where the data itself is not available beforehand and is streaming in, for example in any typical online system like web-search.

For such a model, the algorithm of choice in practice is the so-called streaming k-means heuristic. It is essentially a streaming version of the celebrated k-means algorithm or Lloyd s heuristic [8]. The basic k-means algorithm is designed for ofﬂine/batch data where each data point is assigned to the nearest centroid and the centroids are then updated based on the assigned points; this process is iterated till the solution is locally optimal. The streaming version of the k-means algorithm assigns the new point from the stream to the closest centroid and updates this centroid immediately. That is, unlike ofﬂine k-means which ﬁrst assigns all the points to the respective centroids and then updates

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

the centroids, the streaming algorithm updates the centroids after each point, making it much more space efﬁcient. While streaming k-means and its several variants are used heavily in practice, their properties such as solution quality, time complexity of convergence have not been studied widely. In this paper, we attempt to provide such a theoretical study of the streaming k-means heuristic. One of the big challenges is that even the (ofﬂine) k-means algorithm attempts to solve a non-convex NP-hard problem. Streaming data poses additional challenges because of the large noise in each point that can deviate the solution signiﬁcantly.

In the ofﬂine setting, clustering algorithms are typically studied under certain simplifying assumptions that help bypass the worst-case NP-hardness of these problems. One of the most widely studied setting is when the data is sampled from a mixture of well-separated Gaussians [5, 18, 1], which is also the generative assumption that we impose on the data in this work. However, the online/streaming version of the k-means algorithm has not been studied in such settings. In this work, we design and study a variant of the popular online k-means algorithm where the data is streaming-in, we cannot store more than logarithmically many data points, and each data point is sampled from a mixture of well-separated spherical Gaussians. The goal of the algorithm is then to learn the means of each of the Gaussians; note that estimating other parameters like variance, and weight of each Gaussian in the mixture becomes simple once the true means are estimated accurately.

Our Results. Our main contribution is the ﬁrst bias-variance bound for the problem of learning Gaussian mixtures with streaming data. Assuming that the centers are separated by Cσ where C = Ω( log k) and if we seed the algorithm with initial cluster centers that are Cσ/20 distance away from the true centers, then we show that the error in estimating the true centers can be decomposed into three terms and bound each one of them: (a) the bias term, i.e., the term dependent on distance of true means to initial centers decreases at a 1/poly(N) rate, where N is the number of data points observed so far, (b) the variance term is bounded by σ2 d log N

N where σ is the standard deviation of each of the Gaussians, and d is the dimensionality of the data, and (c) an ofﬂine approximation error: indeed, note that even the ofﬂine Lloyd s heuristic will have an approximation error due to its hard-thresholding nature. For example, even when k = 2, and the centers are separated by Cσ, around exp( C2

8 ) fraction of points from the ﬁrst Gaussian will be closer to the second center, and so the k-means heuristic will converge to centers that are at a squared distance of roughly O(C2) exp( C2

8 )σ2 from the true means. We essentially inherit this ofﬂine error up to constants.

Note that the above result holds at a center separation of Ω( log kσ) distance, which is substantially weaker than the currently best-known result of Ω(σk1/4) for even the ofﬂine problem [18]. However, as mentioned before, this only holds provided we have a good initialization. To this end, we show that when C = Ω(σ(k log k)1/4), we can combine an online PCA algorithm [9, 11] with the batch k-means algorithm on a small seed sample of around O(k log k) points, to get such an initialization. Note that this separation requirement nearly matches the best-known result ofﬂine results [18].

Finally, we also study a soft-version of streaming k-means algorithm, which can also be viewed as the streaming version of the popular Expectation Maximization (EM) algorithm. We show that for mixture of two well-separated Gaussians, a variant of streaming EM algorithm recovers the above mentioned bias-variance bound but without the approximation error. That is, after observing inﬁnite many samples, streaming EM converges to the true means and matches the corresponding ofﬂine results in [3, 6]; to the best of our knowledge this is also ﬁrst such consistency result for the streaming mixture problem. However, the EM updates require that the data is sampled from mixture of Gaussians, while the updates of streaming Lloyd s algorithm are agnostic of the data distribution and hence same updates can be used to solve arbitrary mixture of sub-Gaussians as well.

Technical Challenges. One key technical challenge in analyzing streaming k-means algorithm in comparison to the standard streaming regression style problems is that the ofﬂine problem itself is non-convex and moreover can only be solved approximately. Hence, a careful analysis is required to separate out the error we get in each iteration in terms of the bias, variance, and inherent approximation error terms. Moreover, due to the non-convexity, we are able to guarantee decrease in error only if each of our iterates lies in a small ball around the true mean. While this is initially true due to the initialization algorithm, our intermediate centers might escape these balls during our update. However, we show using a delicate martingale based argument that with high probability, our estimates stay within slightly larger balls around the true means, which turns out to be sufﬁcient for us.

Related Work. A closely related work to ours is an independent work by [17] which studies a stochastic version of k-means for data points that satisfy a spectral variance condition which can be seen as a deterministic version of the mixture of distributions assumption. However, their method requires multiple passes over the data, thus doesn t ﬁt directly in the streaming k-means setting. In particular, the above mentioned paper analyzes the stochastic k-means method only for highly accurate initial set of iterates which requires a large burn-in period of t = O(N 2) and hence needs O(N) passes over the data, where N is the number of data points. Tensor methods [1, 10] can also be extended to cluster streaming data points sampled from a mixture distribution but these methods suffer from large sample/time complexity and might not provide reasonable results when the data distribution deviates from the assumed generative model.

In addition to the gaussian mixture model, clustering problems are also studied under other models such as data with small spectral variance [12], stability of data [4], etc. It would be interesting to study the streaming versions in such models as well.

Paper Outline. We describe our models and problem setup in Section 2. We then present our streaming k-means algorithm and its proof overview in Sections 3 and 4. We then discuss the initialization procedure in Section 5. Finally we describe our streaming-EM algorithm in Section 6.

2 Setup and Notation

We assume that the data is drawn from a mixture of k spherical Gaussians distributions, i.e.,

i wi N(µ i , σ2I), µ i Rd i = 1, 2, . . . k (1)

where µ i Rd is the mean of the i-th mixture component, mixture weights wi 0, and P

i wi = 1. All the problem parameters (i.e., the true means, the variance σ2 and the mixture weights) are unknown to the algorithm. Using the standard streaming setup, where the tth sample xt Rd is drawn from the data distribution, our goal is to produce an estimate ˆµi of µ i for i = 1, 2, . . . k in a single pass over the data using bounded space.

Center Separation. A suitable notion of signal to noise ratio for our problem turns out to be the ratio of minimum separation between the true centers and the maximum variance along any direction. We denote this ratio by C = mini,j µ i µ j σ . For convenience, we also denote µ i µ j σ by Cij. Here and in the rest of the paper, y is the Euclidean norm of a vector y. We use η to denote the learning rate of the streaming updates and µt i to denote the estimate of µ i at time t.

Remarks. For a cleaner presentation, we assume that all the mixture weights are 1/k, but our results hold with general weights as long as an appropriate center separation condition is satisﬁed. Secondly, our proofs also go through when the Gaussians have different variances σi2, as long as the separation conditions are satisﬁed with σ = maxi σi. We furnish details in the full version of this paper [14].

3 Algorithm and Main Result

In this section, we describe our proposed streaming clustering algorithm and present our analysis of the algorithm. At a high level, we follow the approach of various recent results for (ofﬂine) mixture recovery algorithms [18, 12]. That is, we initialize the algorithm with an SVD style operation which de-noises the data signiﬁcantly in Algorithm 1 and then apply our streaming version of Lloyd s heuristic in Algorithm 2. Note that the Lloyd s algorithm is agnostic to the underlying distribution and does not include distribution speciﬁc terms like variance etc.

Intuitively, the initialization algorithm ﬁrst computes an online batch PCA in the for-loop. After this step, we perform an ofﬂine distance-based clustering on the projected subspace (akin to Vempala Wang for the ofﬂine algorithm). Note that since we only need estimates for centers within a suitable proximity from the true centers, this step only uses few (roughly k log k) samples. These centers are fed as the initial centers for the streaming update algorithm. The streaming algorithm then, for each new sample, updates the current center which is closest to the sample, and iterates.

Figure 1: Illustration of optimal K-means error

Algorithm 1 Init Alg(N0)

U random orthonormal matrix Rd k B = Θ(d log d), S = 0 for t = 1 to N0 k log k do

if mod(t, B) = 0 then

U QR(S U), S 0 end if Receive xt as generated by the input stream S = S + xt(xt)T end for X0 = [x N0 k log k+1, . . . , x N0] Form nearest neighbor graph using U T X0 and ﬁnd connected components [ν0 1, . . . , ν0 k] mean of points in each component Return: [µ0 1, . . . , µ0 k] = [Uν0 1, . . . , Uν0 k]

Algorithm 2 Stream Kmeans(N, N0)

1: Set η 3k log 3N

N . 2: Set {µ0 1, . . . , µ0 k} Init Algo(N0). 3: for t = 1 to N do 4: Receive xt+N0 given by the input stream 5: x = xt+N0

6: Let it = arg mini x µt 1 i . 7: Set µt it = (1 η)µt 1 it + ηx

8: Set µt i = µt 1 i for i = it 9: end for 10: Output: µN 1 , . . . , µN k

We now present our main result for the streaming clustering problem. Theorem 1. Let xt, 1 t N + N0 be generated using a mixture of Gaussians (1) with wi = 1/k, i. Let N0, N O(1)k3d3 log d and C Ω((k log k)1/4). Then, the mean estimates (µN 1 , . . . , µN k ) output by Algorithm 2 satisﬁes the following error bound:

i µN i µ i 2 #

NΩ(1) | {z } bias

N | {z } variance

+ exp( C2/8)(C2 + k)σ2 | {z } oﬄine k means error

Our error bound consists of three key terms: bias, variance, and ofﬂine k-means error, with bias and variance being standard statistical error terms: (i) bias is dependent on the initial estimation error and goes down at N ζ rate where ζ > 1 is a large constant; (ii) variance error is the error due to noise in each observation xt and goes down at nearly optimal rate of σ2 d

N albeit with an extra log N term as well as worse dependence on k; and (iii) an ofﬂine k-means error, which is the error that even the ofﬂine Lloyds algorithm would incur for a given center separation C. Note that while sampling from the mixture distribution, exp( C2/8) fraction of data-points can be closer to the true means of other clusters rather than their own mean, because the tails of the distributions overlap. Hence, in general it is not possible to assign back these points to the correct cluster, without any modeling assumptions. These misclassiﬁed points will shift the estimated centers along the line joining the means. See Figure 3 for an illustration. This error can however be avoided by performing soft updates, which is discussed in Section 6.

Time, space, and sample complexity: Our algorithm has nearly optimal time complexity of O(d k) per iteration; the initialization algorithm requires about O(d4k3) time. Space complexity of our algorithm is O(dk log k) which is also nearly optimal. Finally, the sample complexity is O(d3k3), which is a loose upper bound and can be signiﬁcantly improved by a more careful analysis. To compare, the best known sample complexity for the ofﬂine setting is O(kd) [2], which is better by a factor of (dk)2.

Analysis Overview. The proof of Theorem 1 essentially follows from the two theorems stated below: a) update analysis given a good initialization; b) Init Alg analysis for showing such an initialization. Theorem 2 (Streaming Update). Let xt, N0 + 1 t N + N0 be generated using a mixture of Gaussians (1) with wi = 1/k, i, and N = Ω(k3d3 log kd). Also, let the center-separation C Ω( log k), and also suppose our initial centers µ0 i are such that for all 1 i k, µ0 i µ i Cσ

Then, the streaming update of Stream Kmeans(N, N0) , i.e, Steps 3-8 of Algorithm 2 satisﬁes:

i µN i µ i 2 #

NΩ(1) + O(k3) exp( C2/8)(C2 + k)σ2 + log N

Note that our streaming update analysis requires only C = Ω( log k) separation but needs appropriate initialization that is guaranteed by the below result. Theorem 3 (Initialization). Let xt, 1 t N0 be generated using a mixture of Gaussians (1) with wi = 1/k, i. Let µ0 1, µ0 2, . . . µ0 k be the output of Algorithm 1. If C = Ω (k log k)1/4 and

N0 = Ω d3k3 log dk , then w.p. 1 1/poly(k), we have maxi µ0 i µ i C

4 Streaming Update Analysis

At a high level our analysis shows that at each step of the streaming updates, the error decreases on average. However, due to the non-convexity of the objective function we can show such a decrease only if the current estimates of our centers lie in a small ball around the true centers of the gaussians. Indeed, while the initialization provides us with such centers, due to the added noise in each step, our iterates may occasionally fall outside these balls, and we need to bound the probability that this happens. To overcome this, we start with initial centers that are within slightly smaller balls around the true means, and use a careful Martingale argument to show that even if the iterates go a little farther from the true centers (due to noise), with high probability, the iterates are still within the slightly larger ball that we require to show decrease in error.

We therefore divide our proof in two parts: a) ﬁrst we show in Section 4.1 that the error decreases in expectation, assuming that the current estimates lie in a reasonable neighborhood around the true centers; and b) in Section 4.2) we show using a martingale analysis that with high probability, each iterate satisﬁes the required neighborhood condition if the initialization is good enough.

We formalize the required condition for our per-iteration error analysis below. For the remainder of this section, we ﬁx the initialization and only focus on Steps 3-8 of Algorithm 2.

Deﬁnition 1. For a ﬁxed initialization, and given a sequence of points ωt = (xt +N0+1 : 0 t < t), we say that condition It is satisﬁed at time t if maxi µt i µ i Cσ

10 holds for all 0 t t. Note that given a sequence of points and a ﬁxed initialization, Algorithm 2 is deterministic.

We now deﬁne the following quantities which will be useful in the upcoming analysis. At any time t 1, let ωt = (xt +N0+1 : 0 t < t) denote the sequence of points received by our algorithm. For all t 0, let e Ei t = µt i µ i 2 denote the random variable measuring the current error for cluster i, and let e Vt = maxi e Ei t to be the maximum cluster error at time t. Now, let b Ei t+1 = Ext+N0+1 µt+1 i µ i 2 |ωt be the expected error of the ith cluster center after receiving the (t + 1)th, conditioned on ωt. Finally, let Ei t = E µt i µ i 2 | It be the expected error conditioned on It, and let Et = P

4.1 Error Reduction in Single Iteration

Our main tool toward showing Theorem 2 is the following theorem which bounds the expected error after updating the means on arrival of the next sample. Theorem 4. If It holds and C Ω( log k), then for all i, we have

b Ei t+1 (1 η

2k ) e Ei t + η

k5 e Vt + O(1)η2dσ2 + O(k)η(1 η) exp( C2/8)(C2 + k)σ2 .

Proof sketch of Theorem 4. In all calculations in this proof, we ﬁrst assume that the candidate centers satisfy It, and all expectations and probabilities are only over the new sample xt+N0+1, which we denote by x after omitting the superscript. Now recall our update rule: µt+1 i = (1 η)µt i + ηx if µt i is the closest center for the new sample x; the other centers are unchanged. To simplify notations, let:

gt i(x) = 1 iff i = arg min j x µt j , gt i(x) = 0 otherwise. (2)

By deﬁnition, we have for all i,

µt+1 i = (1 η)µt i + η gt i(x)x + (1 gt i(x))µt i = µt i + ηgt i(x)(x µt i).

Our proof relies on the following simple yet crucial lemmas. The ﬁrst bounds the failure probability of a sample being closest to an incorrect cluster center among our candidates. The second shows that if the candidate centers are sufﬁciently close to the true centers, then the failure probability of mis-classifying a point to a wrong center is (upto constant factors) the probability of mis-classiﬁcation even in the optimal solution (with true centers). Finally the third lemma shows that the probability of gt i(x) = 1 for each i is lower-bounded. Complete details and proofs appear in [14].

Lemma 1. Suppose condition It holds. For any i, j = i, let x Cl(j) denote a random point from cluster j. Then Pr x µt i x µt j exp( Ω(C2 ij)).

Lemma 2. Suppose max( µt i µ i , µt i µ i ) σ/Cij. For any i, j = i, let x Cl(j) denote a random point from cluster j. Then Pr x µt i x µt j O(1) exp( C2 ij/8).

Lemma 3. If It holds and C = Ω( log k), then for all i, then Pr [gt i(x) = 1] 1 2k.

And so, equipped with the above notations and lemmas, we have

b Ei t+1 = Ex µt+1 i µ i 2

= (1 η)2 µt i µ i 2 + η2E gt i(x)(x µ i ) + (1 gt i(x))(µt i µ i ) 2

+ 2η(1 η)E h D µt i µ i , gt i(x)(x µ i ) + (1 gt i(x))(µt i µ i ) Ei

2k ) e Ei t + η2 E gt i(x)(x µ i ) 2

+2η(1 η) E h D µt i µ i , gt i(x)(x µ i ) Ei

The last inequality holds because of the following line of reasoning: (i) ﬁrstly, the cross term in the second squared norm evaluates to 0 due to the product gt i(x)(1 gt i(x)), (ii) η2E (1 gt i(x)) µt i µ i 2 η2 e Ei t, (iii) 2η(1 η)E [ µt i µ i , (1 gt i(x))(µt i µ i ) ] 2η(1 η) e Ei t Pr [gt i(x) = 0] 2η(1 η) e Ei t(1 1/2k) by Lemma 3, and ﬁnally (iv) by collecting terms with coefﬁcient e Ei t.

The proof then roughly proceeds as follows: suppose in an ideal case, gt i(x) is 1 for all points x generated from cluster i, and 0 otherwise. Then, if x is a random sample from cluster i, T1 would be dσ2, and T2 would be 0. Of course, the difﬁculty is that gt i(x) is not always as well-behaved, and so the bulk of the analysis is in carefully using Lemmas 1and 2, and appropriately charging the various error terms we get to the current error e Ei t, the variance, and the ofﬂine approximation error.

4.2 Ensuring Proximity Condition Via Super-Martingales

In the previous section, we saw that condition It = 1 is sufﬁcient to ensure that expected one-step error reduces at time step t + 1. Our next result shows that IN = 1 is satisﬁed with high probability.

Theorem 5. Suppose maxi µ0 i µ i C

20σ, then IN = 1 w.p 1 ( 1 poly(N)).

Our argument proceeds as follows. Suppose we track the behaviour of the actual error terms e Ei t over time, and stop the process (call it a failure) when any of these error terms exceeds C2σ2/100 (recall that they are all initially smaller than C2σ2/400). Assuming that the process has not stopped, we show that each of these error terms has a super-martingale behaviour using Theorem 4, which

says that on average, the expected one-step error drops. Moreover, we also show that the actual one-step difference, while not bounded, has a sub-gaussian tail. Our theorem now follows by using Azuma-Hoeffding type inequality for super-martingale sequences.

4.3 Wrapping Up

Now, using Theorems 4 and 5, we can get the following theorem.

Theorem 6. Let γ = O(k)η2dσ2 + O(k2)η(1 η)exp( C2/8)(C2 + k)σ2. Then if C Ω( log k), for all t, we have Et+1 (1 η

4k)Et + γ. It follows that EN (1 η

4k)NE0 + 4k

Proof. Let E i t+1 = E h µt+1 i µ i 2 It i to be the average over all sample paths of e Ei t+1 conditioned on It. Recall that Et+1 is very similar, except the conditioning is on It+1. With this notation, let us take expectation over all sample paths where It is satisﬁed, and use Theorem 4 to get

E i t+1 (1 η

2k )Ei t + η

k5 Et + O(1)η2dσ2 + O(k)η(1 η) exp( C2/8)(C2 + k)σ2 .

And so, summing over all i we will get

3k )Et + O(k)η2dσ2 + O(k2)η(1 η) exp( C2/8)(C2 + k)σ2 .

Finally note that Et+1 and Et+1 are related as Et+1 Pr [It+1] Et+1 Pr [It], and so Et+1 Et+1(1 + 1 N2 ) since Pr [It+1] 1 1/N 5 by Theorem 5.

Proof of Theorem 2. From Theorem 5 we know that the probability of IN being satisﬁed is 1 1/N 5, and in this case, we can use Theorem 6 to get the desired error bound. In case IN fails, then the maximum possible error is roughly maxi,j µ i µ j 2 N (when all our samples are sent to the same cluster), which contributes a negligible amount to the bias term.

5 Initialization for streaming k-means

In Section 4 we saw that our proposed streaming algorithm can lead to a good solution for any separation Cσ O( log k)σ if we can initialize all centers such that µ0 i µ i C

20σ. We now show that Init Alg (Algorithm 1) is one such procedure. We ﬁrst approximately compute the top-k eigenvectors U of the data covariance using a streaming PCA algorithm [9, 13] on O(k3d3 log d) samples. We next store k log k points and project them onto the subspace spanned by U. We then perform a simple distance based clustering [18] that correctly clusters the stored points (assuming reasonable center separation), and ﬁnally we output these cluster centers.

Proof of Theorem 3. Using an argument similar to [9] (Theorem 3), we get that U obtained by the online PCA algorithm (Steps 1:4 of Algorithm 1) satisﬁes (w.p. 1 1/poly(d)):

UU T µ i µ i 2 .01σ2, 1 i k. (3)

Now, let bµ i = U T µ i . For any x sampled from mixture distribution (1), U T x P

i wi N(bµ i , σ2I). Hence, if U T xt, U T xt both belong to cluster i, then (w.p. 1 1/kα):

U T xt U T xt 2 = U T (zt zt ) 2 2 (k + 8α p

k log k)σ2, (4)

where xt = µ i + zt and xt = µ i + zt . The last inequality above follows by using standard χ2

random variable tail bound. Similarly if U T xt, U T xt belong to cluster i and j, i.e., xt = µ i + zt

and xt = µ j + zt then (w.p. 1 1/kα):

U T xt U T xt 2 = bµ i bµ j 2 + U T (zt zt ) 2 2 + 2(bµ i bµ j)T U T (zt zt )

(C2 .2C + 8α p

k log k 16αC p

log k)σ2, (5)

where the above equation follows by using (3), setting α = C/32 and using C = Ω((k log k)1/4).

Using (4), (5), w.h.p. all the points from the same cluster are closer to each other than points from other clusters. Hence, connected components of nearest neighbor graph recover clusters accurately.

Now, we estimate bµi = 1 |Cluster(i)| P

t Cluster(i) U T xt for each i. Since, our clustering is com-

pletely accurate, we have w.p. 1 2m2/k C/32,

bµi bµ i 2 σ log k p

|Cluster(i)| . (6)

As wi = 1/k for all i, |Cluster(i)| m

k w.p. 1 1/k C/32. Theorem now follows by setting m = O(k log k) and by using (3), (6) along with C = Ω((k log k)1/4).

Remark 1. We would like to emphasize that our analysis for the convergence of streaming algorithms works even for smaller separations C = O( log k), as long as we can get a good enough initialization. Hence, a better initialization algorithm with weaker dependence of C on k would lead to an improvement in the overall algorithm.

6 Soft thresholding EM based algorithm

In this section, we study a streaming version of the Expectation Maximization (EM) algorithm [7] which is also used extensively in practice. While the standard k-means or Lloyd s heuristic is known to be agnostic to the distribution, and the same procedure can solve the mixture problem for a variety of distributions [12], EM algorithms are designed speciﬁcally for the input mixture distribution. In this section, we consider a streaming version of the EM algorithm when applied to the problem of mixture of two spherical Gaussians with known variances. In this case, the EM algorithm reduces to a softer version of the Lloyd s algorithm where a point can be partially assigned to the two clusters. Recent results by [6, 3, 19] show convergence of the EM algorithm in the ofﬂine setting for this simple setup. In keeping with earlier notation, let µ 1 = µ and µ 2 = µ and the center separation

σ . Hence, xt i.i.d 1 2N(µ , σ2I) + 1

2N( µ , σ2I).

Algorithm 3 Stream Soft Update(N, N0)

Set η = 3 log N

N . Set µ0 i Init Algo(N0). for t = 1 to N do

Receive xt+N0 as generated by the input stream. x = xt+N0

Let wt = exp x µt 2

σ2 +exp x+µt 2

Set µt+1 = (1 η)µt + η[2wt 1]x. end for

In our algorithm, wt(x) is an estimate of the probability that x belongs to the cluster with µt, given that it is drawn from a balanced mixture of gaussians at µt and µt. Calculating wt(x) corresponds to the E step and updating the estimate of the centers corresponds to the M step of the EM algorithm. Similar to the streaming Lloyd s algorithm presented in Section 3, our analysis of streaming soft updates can be separated into streaming update analysis and analysis Init Alg (which is already presented in Section 5). We now provide our main theorem, and the proof is presented in Appendix C. Theorem 7 (Streaming Update). Let xt, 1 t N + N0 be generated using a mixture two balanced spherical Gaussians with variance σ2. Also, let the center-separation C 4, and also suppose our initial estimate µ0 is such that µ0 µ Cσ

20 . Then, the streaming update of Stream Soft Update(N, N0) , i.e, Steps 3-8 of Algorithm 3 satisﬁes:

E µN µ 2 µ 2

N Ω(1) | {z } bias

+ O(1)log N

| {z } variance

Remark 2. Our bias and variance terms are similar to the ones in Theorem 1 but the above bound does not have the additional approximation error term. Hence, in this case we can estimate µ consistently but the algorithm applies only to a mixture of Gaussians while our algorithm and result in Section 3 can potentially be applied to arbitrary sub-Gaussian distributions.

Remark 3. We note that for our streaming soft update algorithm, it is not critical to know the variance σ2 beforehand. One could get a good estimate of σ by taking the mean of a random projection of a small number of points. We provide the details in the full version of this paper [14].

7 Conclusions

In this paper, we studied the problem of clustering with streaming data where each data point is sampled from a mixture of spherical Gaussians. For this problem, we study two algorithms that use appropriate initialization: a) a streaming version of Lloyd s method, b) a streaming EM method. For both the methods we show that we can accurately initialize the cluster centers using an online PCA based method. We then show that assuming Ω((k log k)1/4σ) separation between the cluster centers, the updates by both the methods lead to decrease in both the bias as well as the variance error terms. For Lloyd s method there is an additional estimation error term, which even the ofﬂine algorithm incurs, and which is avoided by the EM method. However, the streaming Lloyd s method is agnostic to the data distribution and can in fact be applied to any mixture of sub-Gaussians problem. For future work, it would be interesting to study the streaming data clustering problem under deterministic assumptions like [12, 16]. Also, it is an important question to understand the optimal separation assumptions needed for even the ofﬂine gaussian mixture clustering problem.

[1] Anima Anandkumar, Rong Ge, Daniel J. Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models (A survey for ALT). In Proceedings of ALT, pages 19 38, 2015.

[2] Hassan Ashtiani, Shai Ben-David, and Abbas Mehrabian. Sample-efﬁcient learning of mixtures. ar Xiv preprint ar Xiv:1706.01596, 2017.

[3] Sivaraman Balakrishnan, Martin J Wainwright, and Bin Yu. St atistical guarantees for the em algorithm: From population to sample-based analysis. Annals of Stats. 45 (1), 77-120, 2014.

[4] Maria-Florina Balcan, Avrim Blum, and Anupam Gupta. Clustering under approximation stability. J. ACM, 60(2):8:1 8:34, 2013.

[5] Anirban Dasgupta, John Hopcroft, Ravi Kannan, and Pradipta Mitra. Spectral clustering with limited independence. In Proceedings of SODA, pages 1036 1045, 2007.

[6] Constantinos Daskalakis, Christos Tzamos, and Manolis Zampetakis. Ten steps of em sufﬁce for mixtures of two gaussians. ar Xiv preprint ar Xiv:1609.00368, 2016.

[7] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society, pages 1 38, 1977.

[8] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classiﬁcation. John Wiley and Sons, 2000.

[9] Moritz Hardt and Eric Price. The noisy power method: A meta algorithm with applications. In Proceedings of NIPS, pages 2861 2869, 2014.

[10] Daniel J. Hsu and Sham M. Kakade. Learning mixtures of spherical gaussians: moment methods and spectral decompositions. In Proceedings of ITCS 13, pages 11 20, 2013.

[11] Prateek Jain, Chi Jin, Sham M. Kakade, Praneeth Netrapalli, and Aaron Sidford. Streaming PCA: matching matrix bernstein and near-optimal ﬁnite sample guarantees for oja s algorithm. In Proceedings of COLT, pages 1147 1164, 2016.

[12] Amit Kumar and Ravindran Kannan. Clustering with spectral norm and the k-means algorithm. In Proceedings of FOCS, pages 299 308, 2010.

[13] Ioannis Mitliagkas, Constantine Caramanis, and Prateek Jain. Memory limited, streaming PCA. In Proceedings of 27th NIPS, pages 2886 2894, 2013.

[14] Aditi Raghunathan, Ravishankar Krishnaswamy, and Prateek Jain. Learning mixture of gaussians with streaming data. Co RR, abs/1707.02391, 2017.

[15] Ohad Shamir. A variant of azuma s inequality for martingales with subgaussian tails. ar Xiv preprint ar Xiv:1110.2392, 2011.

[16] Cheng Tang and Claire Monteleoni. On lloyd s algorithm: New theoretical insights for clustering in practice. In Proceedings of AISTATS, pages 1280 1289, 2016.

[17] Cheng Tang and Claire Monteleoni. Convergence rate of stochastic k-means. Proceedings of AISTATS, 2017.

[18] Santosh Vempala and Grant Wang. A spectral algorithm for learning mixture models. J. Comput. Syst. Sci., 68(4):841 860, 2004.

[19] Ji Xu, Daniel J Hsu, and Arian Maleki. Global analysis of expectation maximization for mixtures of two gaussians. In Advances in Neural Information Processing Systems, pages 2676 2684, 2016.