# simple_and_sharp_analysis_of_kmeans__d9a0e77a.pdf

Simple and Sharp Analysis of k-means||

V aclav Rozhoˇn 1

We present a simple analysis of k-means|| (Bahmani et al., PVLDB 2012) a distributed variant of the k-means++ algorithm (Arthur and Vassilvitskii, SODA 2007). Moreover, the bound on the number of rounds is improved from O(log n) to O(log n/ log log n), which we show to be tight.

1. Introduction

Clustering is one of the classical machine learning problems. Arguably the simplest and most basic formalization of clustering is the k-means formulation: we are given a (large) set of points X in the Euclidean space and are asked to ﬁnd a (small) set of k centers C so as to minimize the sum of the squared distances between each point and the closest center. That is, we minimize

x X min ci C x ci 2,

Due to its simplicity, k-means is considered as the problem that tests our understanding of clustering.

The classical, yet still state-of-the-art algorithm k-means++ (Arthur & Vassilvitskii, 2007a) combines two ideas to approach the problem. First, a fast randomized procedure ﬁnds a set of k centers that by itself is known to be O(log k) competitive in expectation with respect to the optimal solution. Then, the classical Lloyd s algorithm (Lloyd, 1982) is run to improve the found solution until a local minimum is achieved.

A signiﬁcant disadvantage of the k-means++ is the inherent sequential nature of the ﬁrst, seeding step: one needs to pass through the whole data k times, each time to sample a single center. To overcome this problem, (Bahmani et al., 2012) devised k-means|| : a distributed version of the k-means++ algorithm.

1ETH, Zurich. Correspondence to: V aclav Rozhoˇn <rozhonv@ethz.ch>.

Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s).

In their algorithm one passes through the dataset only few times to extract a set of roughly O(k) candidate centers, from which one later chooses the ﬁnal k centers by the means of the classical k-means++ algorithm.

Our contribution In this work we ﬁrst provide a new, simple analysis of k-means|| , thus simplifying known proofs (Bahmani et al., 2012) and (Bachem et al., 2017). In particular, if we denote by µX the mean of X and ϕ the optimal solution, we prove in Section 3 that O(log ϕX(µX)

ϕ ) rounds of the k-means|| algorithm sufﬁce to get expected constant approximation guarantee. This matches known guarantees.

In Section 4, we then proceed by reﬁning the analysis to provide a better bound on the number of sampling rounds needed by the algorithm: we prove that O(log ϕX(µX)

ϕ / log log ϕX(µX)

ϕ ) rounds sufﬁce. In Section 4, the second bound is shown to be tight for a wide range of values of k by generalisation of a known lower bound (Bachem et al., 2017).

The ﬁrst analysis of k-means|| (Bahmani et al., 2012) invokes linear programming duality as a part of the argument. Its second analysis (Bachem et al., 2017) is more similar to ours, as it only relies on basic lemmas known from the analysis of k-means++ (Arthur & Vassilvitskii, 2007b). Our one-page analysis is considerably shorter and, we believe, also simpler. It can be summed up as view one sampling step of the algorithm as a weighted balls into bins process . We explain this in more detail in Section 3.

2. Background and Notation

We mostly adopt the notation of the paper (Bahmani et al., 2012). Let X = {x1, . . . , xn} be a point set in the d-dimensional Euclidean space. We denote the standard Euclidean distance between two points xi, xj by xi xj and for a subset Y X we deﬁne the distance between Y and x X as d(x, Y ) = miny Y x y .

For a subset Y X we denote by µ(Y ) its centroid, i.e., µ(Y ) = 1 |Y | P

y Y y. For a set of points C = {c1, . . . , ck} and Y X we deﬁne the cost of Y with respect to C as ϕY (C) = P y Y d2(y, C) and use a shorthand ϕx(C) for ϕ{x}(C) and ϕX(c) for ϕX({c}). It is easy to check that

Simple Analysis of k-means II

for a given point set A, the center x that minimizes its cost ϕA(x) is its mean µA.

The goal of the k-means problem is to ﬁnd a set of centers C, |C| = k that minimizes the cost ϕX(C) for a given set of points X. The k-means problem is known to be NP-hard (Aloise et al., 2009; Mahajan et al., 2009) and even hard to approximate up to small constant precision (Awasthi et al., 2015; Lee et al., 2017). From now on we ﬁx an optimal solution C and denote by ϕ its cost. We assume that minimum distance between two points x = y from X is at least 1 and that for the maximum distance between two points from X we have = O(log n); this implies ϕX(µX) = poly(n) and hence, the simpliﬁed complexities O(log n) and O(log n/ log log n) in abstract.

2.1. k-means++ Algorithm

The classical k-means++ algorithm (Arthur & Vassilvitskii, 2007b; Ostrovsky et al., 2013) computes the centers in k sampling steps. After the ﬁrst step where the ﬁrst center is taken from uniform distribution, each subsequent step samples a new point from D2-distribution: if C is the current set of centers, x is being sampled with probability ϕx(C)

ϕX(C), i.e., we sample the points proportional to their current cost.

The k-means++ algorithm is known to provide an O(log k) approximation guarantee, in expectation. The analysis crucially makes use of the following two lemmas that we will also use.

The ﬁrst lemma tells us that if we sample uniformly a random point p from some point set A, we expect the cost ϕA(p) to be within a constant factor of the cost ϕA(µA) which is the smallest cost achievable with one center. One can think of A as being a cluster of the optimal solution or the whole point set X.

Lemma 1 (Lemma 3.1 in (Arthur & Vassilvitskii, 2007a)). Let A be an arbitrary set of points. If we sample a random point p A according to the uniform distribution, we have E[ϕA({p})] 2ϕA(µA).

The second lemma ensures that up to a constant factor the same guarantee holds even for the D2 distribution.

Lemma 2 (Lemma 3.2 in (Arthur & Vassilvitskii, 2007a)). Let A be an arbitrary set of points, C be an arbitrary set of centers and p A be a point chosen by D2 weighting. Then, E[ϕA(C {p})] 8ϕA(µA).

The analysis of k-means++ from the above two lemmas by a careful inductive argument: important observation is that sampling points proportional to their cost implies that we sample from an optimal cluster with probability proportional to its current cost, hence we preferably sample from costly clusters. Still, there is a small probability in

Algorithm 1 k-means|| overseeding Require data X,# rounds t, sampling factor ℓ

1: Uniformly sample x X and set C = {x}. 2: for i 1, 2, . . . , t do 3: C 4: for x X do 5: Add x to C with probability min 1, ℓϕx(C)

6: end for 7: C C C

8: end for 9: Return C

each step that we sample from an already covered cluster and this is the reason why the ﬁnal approximation factor is O(log k) rather than O(1).

2.2. k-means|| Algorithm

The distributed variant of the k-means++ algorithm called k-means|| , was introduced in (Bahmani et al., 2012). The algorithm consists of two parts. In the ﬁrst, overseeding part (Algorithm 1), we proceed in t sequential rounds after sampling uniformly a single center as in the ﬁrst step of k-means++ . In each of the t sampling rounds we sample each point of X with probability min 1, ℓϕx(C)

ϕX(C) , i.e., ℓ 1 times bigger than the probability of taking the point in k-means++ , independently on the other points.

In the second part of the algorithm we collect the set of sampled centers C and create a new, weighted, instance X

of k-means in which the weight of every center is equal to the number of points of X to which the given center is the closest. The new instance is solved, e.g., with k-means++ as in Algorithm 2. One can prove that ﬁnding a set C with ϕX(C) = O(ϕ ) in Algorithm 1 implies that the overall approximation guarantee is, up to a constant, the same as the approximation guarantee of the algorithm used in the second part of the algorithm (see (Bachem et al., 2017), proof of Theorem 1, or (Guha et al., 2003), in general), which is, in this case, O(log k) in expectation.

Hence, the analysis of Algorithm 2 boils down to bounding the number of steps t needed by the Algorithm 1 to achieve constant approximation guarantee for given sampling factor ℓ. The authors of (Bahmani et al., 2012) prove the following.

Theorem 1 (roughly Theorem 1 in (Bahmani et al., 2012)). If we choose t = O(log ϕX(µX)

ϕ ) and ℓ k, Algorithm 1 gives a set C with E[ϕX(C)] = O(ϕ ).

Their result was later reproved in (Bachem et al., 2017). We

Simple Analysis of k-means II

Algorithm 2 k-means|| (Bahmani et al., 2012) Require data X,# rounds t, sampling factor ℓ

1: B Result of Algorithm 1 applied to (X, t, ℓ) 2: for c B do 3: wc # of points x X whose closest center in B is c 4: end for 5: C Run k-means++ on the weighted instance (B, w) 6: Return C

provide a new, simple proof in Section 3.

2.3. Other Related Work

k-means++ was introduced in (Arthur & Vassilvitskii, 2007a) and a similar method was studied by (Ostrovsky et al., 2013). This direction led to approximation schemes (Jaiswal et al., 2012; 2015), constant approximation results based on additional local search (Lattanzi & Sohler, 2019; Choo et al., 2020), constant approximation bi-criteria results based on sampling more centers (Aggarwal et al., 2009; Ailon et al., 2009; Wei, 2016), approximate k-means++ based on Markov chains (Bachem et al., 2016b;a) or coresets (Bachem et al., 2018), analysis of hard instances (Arthur & Vassilvitskii, 2007a; Brunsch & R oglin, 2013) or under adversarial noise (Bhattacharya et al., 2019). Consult (Celebi et al., 2013) for an overview of different seeding methods for k-means .

There is a long line of work on a related k-median problem. A k-median algorithm of (Mettu & Plaxton, 2012) is quite similar to k-means|| and its analysis is also quite similar to our analysis. We discuss some other related work on k-means more thoroughly at the end of Section 3.

3. Warm-up: a Simple Analysis

In this section we provide a simple analysis of Algorithm 1 based on viewing the process as a variant of the balls into bins problem. Recall that in the most basic version of the balls into bins problem, one throws k balls into k bins, each ball to a uniformly randomly chosen bin, and asks, e.g., what is a probability of a certain bin to be hit by a ball. This is equal to 1 (1 1/k)k 1 1/e, hence, we expect a constant proportion of the bins to be hit in a single step.

To see the connection to our problem, we ﬁrst deﬁne the notion of settled clusters that is similar to notions used e.g. in (Aggarwal et al., 2009; Lattanzi & Sohler, 2019).

Deﬁnition 1 (Settled clusters). We call a cluster A of the optimum solution C settled with respect to current solution C, if ϕA(C) 10ϕA(C ). Otherwise, we call A unsettled

with respect to C.

We view the clusters of C as bins and each sampling round of Algorithm 1 as shooting at each bin and hitting it (i.e., making the cluster settled) with some probability. Intuitively, this probability is proportional to the cost of the cluster, since this is how we deﬁned the probability of sampling any point of X. So, we view the process as a more general and repeated variant of the balls into bins process, where the costs of the clusters act like weights of the bins and we sample with probability (roughly) proportional to these weights. We prove now that clusters are being settled with probability roughly proportional to their cost (unless they are very costly).

Lemma 3. Let C be the current set of sampled centers and let A be an unsettled cluster of the optimum solution. The cluster A is not made settled in the next iteration of Algorithm 1 with probability at most

Intuitively, for clusters A with ϕA(C) ϕX(C)/ℓthe probability of hitting them in one step is of order ℓϕA(C)

ϕX(C) (using that e x 1 x for small positive x), while for more costly clusters the probability of hitting them is lower bounded by some constant.

Proof. If we sample a point c from A according to D2

weights, we have E[ϕA(C {c})] 8ϕ A by Lemma 2. Hence, by Markov inequality, A is made settled with probability at least 1 8

5. In other words, there is a subset of points A A with ϕA (C) ϕA(C)/5 such that sampling a point from A makes A settled. If A

contains a point x with ϕx(C) ϕX(C)

ℓ , we sample x and make A settled with probability 1. Otherwise, we have

P(A does not get settled) Y

x A (1 ℓϕx(C)/ϕX(C))

x A ℓϕx(C)/ϕX(C))

exp( ℓϕA(C)/(5ϕX(C)))

where we used 1 + x ex and ϕA (C) ϕA(C)/5.

Similarly to the classical balls into bins problem, we can now observe that the total cost of unsettled clusters drops by a constant factor in each step.

From now on we simplify the notation and write ϕt Y for the cost of the point set Y after t sampling rounds of Algorithm 1. Moreover, by ϕt U we denote the total cost of yet unsettled clusters after t sampling rounds.

Simple Analysis of k-means II

Proposition 1 (roughly Theorem 2 in (Bahmani et al., 2012)). Suppose that ϕt X 20ϕ . For ℓ k we have

E ϕt+1 U 1 1

In other words, the expected cost of yet unsettled clusters drops by a constant factor in each iteration.

Proof. We split the unsettled clusters into two groups: a cluster A with ϕt A ϕt U/(2k) we call heavy and otherwise we call it light. Note that the probability that a heavy cluster A is not settled in (t+1)th iteration is by Lemma 3 bounded by

exp kϕt A 5ϕt X

exp kϕt U 10kϕt X

where we used that ϕt U ϕt X/2: this holds since otherwise more than half the cost of ϕt X is formed by settled clusters, hence ϕt X < 20ϕ , contradicting our assumption ϕt X 20ϕ . Hence, after the sampling step, a heavy cluster A does not contribute to the overall cost of unsettled clusters with probability at least 1/25. This implies that the expected drop in the cost of unsettled clusters is at least

ϕt U E[ϕt+1 U ] X

A heavy ϕt A/25

A light ϕt A

where we used that the light clusters have total cost of at most k ϕt U/(2k) = ϕt U/2.

Theorem 1 now follows directly (Bahmani et al., 2012; Bachem et al., 2017) and we prove it here for completeness.

Proof of Theorem 1. From Lemma 1 it follows that after we sample a uniformly random point, we have E[ϕ0] 2ϕX(µX). Proposition 1 then gives E[ϕt+1 U ] 49 50ϕt U + 20ϕ . Applying this result T times, we get

E[ϕT U] 2 49

T ϕX(µX) + 20ϕ

T ϕX(µX) + 1000ϕ

Choosing T = O(log ϕX(µX)

ϕ ) and recalling that ϕT ϕT U + 10ϕ yields the desired claim.

3.1. Additional Remarks

For the sake of simplicity, we did not optimize constants and analysed Algorithm 1 meaningfully only for the case ℓ= Θ(k). In the following remarks we note how one can extend this (or some previous) analysis and then use it to compare k-means|| more carefully to a recent line of work.

Remark 1. With more care, the approximation factor in Theorem 1 can be made arbitrarily close to 8. We omit the proof.

Remark 2. With more care, for general ℓ= αk one can prove that the number of steps of Algorithm 1 needed to sample a set of points that induce a cost of O(ϕ ) is O(logα ϕX(µX)

ϕ ) for α 1 and O(log ϕX(µX)

ϕ /α) for α 1. We omit the proof.

3.2. Similar Algorithms with Additive Error

Remark 2 allows us to make a closer comparison of k-means|| with a recent line of work of (Bachem et al., 2016a; 2018) that aims for very fast algorithms that allow for an additive error of εϕX(µX).

According to Remark 2, to obtain such a guarantee for the oversampled set of centers, Algorithm 1 needs to set ℓ= O(k) and sample for t = log(1/ε) steps (this was observed in (Bachem et al., 2017)) or it sets ℓ= O(k/ε) points and sample just once (i.e., t = 1). The approximation factor of Algorithm 2 is then multiplied by additional factor of O(log k) as this is the approximation guarantee of k-means++ .

An approach of Bachem et al. (Bachem et al., 2018) is similar to k-means|| with t = 1: there, authors propose a coreset algorithm that samples O(dk/ε2) points from roughly the same distribution as the one used in the ﬁrst sampling step of Algorithm 1. If we use their algorithm by running k-means++ on the provided coreset, we get an algorithm with essentially the same guarantees as Algorithm 2 with number of rounds t = 1 and ℓ= O(k/ε). The main difference is that in Algorithm 2, the weight of each sampled center used by k-means++ subroutine is computed as the number of points for which the center is the closest, whereas in the coreset algorithm, each center is simply given a weight inversely proportional to the probability that the center is sampled. This allows the coreset algorithm to be faster than Algorithm 2 with t = 1 and ℓ= O(k/ε), whose time complexity in this case is O(nk/ε). This is at the expense of higher number of sampled points.

A paper of (Bachem et al., 2016a) uses the Metropolis algorithm on top of the classical k-means++ algorithm to again achieve additive εϕX(µX) (and multiplicative O(log k)) error, while sampling only O( k

ε ) points from the same distribution as Algorithm 2 with t = 1 and

Simple Analysis of k-means II

ℓ= O(k/ε). While the number of taken samples is only slightly higher than the one of Algorithm 2 with t = 1, ℓ= O(k/ε), their running time is much better O(k2/ε).

We see that the main advantage of k-means|| lies in the possibility of running multiple, easily distributed, sampling steps that allow us to achieve strong guarantees.

3.3. MPC Context and a Theoretical Implication

Massively Parallel Computing (MPC) (Dean & Ghemawat, 2004; Karloff et al., 2010) is a distributed model in which the set of input points X is split across several machines, each capable of storing O(s) points for some parameter s. In each round, each machine performs some computation on its part of input and sends O(s) bits to other machines such that each machine receives O(s) bits in total. We want to minimize the number of rounds while keeping the memory s small.

Algorithm 2 with t sampling steps can be implemented in O(t) MPC rounds if s = Ω(tk). Hence, we get an O(1)-approximation algorithm for k-means in O(log ϕX(µ(X))

ϕ ) = O(log n) rounds if each machine has Θ(k) memory.

In the case of per machine memory O(nαk) for constant 0 < α 1, we can use k-means|| to achieve constant approximation guarantee in only O(1/α) rounds, as was noted by (Ghaffari). A hierarchical clustering scheme of Guha et al. (Guha et al., 2003) in this setting needs O(1/α) rounds to output a set of k centers C, but at the expense of approximation guarantee: we have

ϕX( C) = 2O(1/α)ϕX(C ).

k-means|| can now recover this loss in approximation: if we run Algorithm 1 but start not with the trivial C0 = {x} for x sampled uniformly at random, but instead take C0 = C, Proposition 1 ensures that

t = O(log ϕX( C)

ϕX(C )) = O(log 2O(1/α)) = O(1/α)

sampling rounds sufﬁce to get a clustering with O(1)-approximation guarantee.

3.4. Submodular Context

One can observe why O(log ϕX(µX)

ϕ ) is a natural bound for the number of rounds by considering a different way of achieving the same round complexity. First, note that after sampling the ﬁrst point c1, we have E[ϕX(c1)] 2ϕX(µX) = 2ϕX(µX) via Lemma 1 with the set A chosen as the whole set X. The process of adding new points to the solution now satisﬁes a natural law of diminishing returns:

for any C1 C2 X and c X we have

ϕX({c1} C1 {c}) ϕX({c1} C1)

ϕX({c1} C2 {c}) ϕX({c1} C2)

In other words, the function ϕX({c1}) ϕX({c1} C) is submodular (see e.g. (Krause & Golovin) for the collection of uses of submodularity in machine learning). Then one can use recent results about distributed algorithms for maximizing submodular functions (see e.g. (Mirzasoleiman et al., 2013; Barbosa et al., 2015a;b; Liu & Vondrak, 2018)) to get that in O(1) distributed rounds, one can ﬁnd a set of k points C such that ϕX({c1} C) ϕ (ϕX({c1}) ϕ ) /2, i.e., the distance to the best solution drops by a constant factor. Continuing the same process for log((ϕX(µX))/ϕ ) rounds, one gets the same theoretical guarantees as with running Algorithm 2. However, the advantage of k-means|| is its extreme simplicity and speed. Moreover, rather surprisingly, we prove in the next section that the asymptotical round complexity of k-means|| is actually slightly better than logarithmic.

4. Sharp Analysis of k-means|| : Upper Bound

It may seem surprising that the analysis of Theorem 1 can be strengthened, since even for the classical balls into bins problem, where we hit each bin with constant probability, we need O(log k) rounds to hit all the bins with high probability. However, during our process we can disregard already settled clusters since they are not contributing substantially to the overall cost. If we go back to the classical balls into bins problems and let that process repeat on the same set of bins, with the additional property that in each round we throw each one of k balls to a random bin out of those that are still empty, we expect to hit all the bins in mere O(log n) steps (Lenzen & Wattenhofer, 2011). Roughly speaking, this is because of the rapid decrease in the number of bins: after the ﬁrst round, the probability of a bin remain empty is roughly 1

2, but after second round it is only roughly 1 22 since the number of bins decreased, in the next iteration it is even roughly 1 222 and so on. In our weighted case we cannot hope for such a rapid decrease in the number of bins, since the costs of clusters can form a geometric series, in which case we get rid of only a small number of clusters in each step (cf. Section 5) 1.

In this section we show a more careful analysis that bounds

1We believe it is an interesting problem to analyze whether there are reasonable assumptions on the data under which the round complexity indeed follows log n behaviour (for example, this is the case if clusters are of same size and lie in vertices of a simplex). This could explain why in practice it is enough to run Algorithm 1 only for few rounds (Bahmani et al., 2012).

Simple Analysis of k-means II

the number of necessary steps to

O log ϕX(µX)

ϕ / log log ϕX(µX)

In the rest of the paper we use the notation γ = ϕX(µX)/ϕ .

Concentration The number of steps O(log γ/ log log γ) sufﬁce even with high probability; to prove it, we ﬁrst recall the classical Chernoff bounds that are used to argue about concentration around mean.

Theorem 2 (Chernoff bounds). Suppose X1, . . . , Xn are independent random variables taking values in {0, 1}. Let X denote their sum. Then for any 0 δ 1 we have

P(X (1 δ)E[X]) e E[X]δ2/2,

P(X (1 + δ)E[X]) e E[X]δ2/3,

and for δ 1 we have

P(X (1 + δ)E[X]) e E[X]δ/3.

Intuition In the following Proposition 2, a reﬁned version of Proposition 1, we argue similarly, but more carefully, about one sampling step of Algorithm 1. The difference is that we analyze not only the drop in the cost of unsettled clusters, but also the drop in the number of unsettled clusters. Here is the intuition.

Let us go back to the proof of Proposition 1, where we distinguished heavy and light clusters. Heavy clusters formed at least constant proportion of the cost of all clusters and every heavy cluster was hit with probability that we lower-bounded by 1/25. For a light cluster we cannot give a good bound for the probability of hitting it, but since it is not very probable that one light cluster is hit by more than one sampled center, if we denote α = P

A light ϕt A /ϕt U, i.e., α is the proportional cost of the light clusters, we expect roughly αk clusters to become settled. On the other hand, we may, optimistically, hope that after one iteration the cost of unsettled clusters drops from ϕt to αϕt, since the heavy clusters are hit with high probability. This is not exactly the case, since for a heavy cluster we have only a constant probability of hitting it. But we can consider two cases: either there are lot of heavy clusters and we again make αk clusters settled, or their cost is dominated by few, massive clusters, each of which is not settled only with exponentially small probability and, hence, we expect the cost to drop by α factor.

The tradeoff between the behaviour of light and heavy clusters then yields a threshold α 1/poly log γ that balances the drop of the cost and of the number of unsettled clusters.

Proposition 2. Suppose that after t steps of Algorithm 1 for ℓ= k there are kt unsettled clusters and their total cost is ϕt U. Assume that ϕt X 20ϕ . After the next sampling step, with probability at least 1 exp(Θ(k0.1)), we have that either the number of unsettled clusters decreased by at least k/(40 log γ), or the total cost of unsettled clusters decreased from ϕt U to at most 4ϕt U/ 3 log γ.

Proof. Note that ϕt X 20ϕ implies ϕt U ϕt X/2, since otherwise, more than half the cost of ϕt X would be formed by settled clusters and, hence, ϕt X < 20ϕ , a contradiction.

We will say that an unsettled cluster is heavy if its cost is at least ϕt A ϕt U/k and light otherwise. Let α = (P

A light ϕt A)/ϕt U, i.e., the proportional cost of the light clusters. We will distinguish three possible cases.

1. α 1/ log γ,

2. α < 1/ log γ and there are at least k/ log γ heavy clusters,

3. α < 1/ log γ and there are less than k/ log γ heavy clusters.

For each case we now prove that with probability 1 exp(Ω( k0.1)) we either settle at least k/(40 log γ) clusters or the total cost of unsettled clusters drops from ϕt U to 4ϕt U/ 3 log γ.

1. By Lemma 3, each light cluster A gets settled with probability at least

1 e kϕt A/(5ϕt X) (kϕt A)/(20ϕt U),

using ϕt U ϕt X/2 and e x 1 + x/2 for 0 x 1. If we deﬁne XA to be an indicator of whether a light cluster A got settled in this iteration and X = P

A light XA, we have

A light E[XA] X

kϕt A 20ϕt U

= αk/20 k 20 log γ

where we used our assumption on α. Invoking the ﬁrst bound of Theorem 2, we get

P(X E[X]/2) e E[X]/8 e Θ(k0.1),

using that k log γ/ log log γ.

2. We proceed analogously to the previous case. By

Lemma 3 we get that each heavy cluster gets settled with probability at least

1 e kϕt A/(5ϕt X) 1 e 1/10 1

Simple Analysis of k-means II

using ϕt U ϕt X/2 and the deﬁnition of heavy cluster.

We deﬁne XA to be an indicator of whether a heavy cluster A got settled in this iteration and X = P

A heavy XA. We have

A heavy E[XA] k/ p

20 = k 20 log γ .

Invoking the ﬁrst bound of Theorem 2, we get

P(X E[X]/2) e E[X]/8 e Θ(k0.1),

using that k log γ/ log log γ.

3. Let ζ = 10 log γ ϕt U k . We call a heavy cluster massive if its cost is at least ζ. Since we know that there are at most k log γ heavy clusters, the total cost of clusters that are heavy but not massive is at most

k log γ 10p

log γ ϕt U k ϕt U 3 log γ

Hence, the total contribution of clusters that are not massive is at most

ϕt U 3 log γ +αϕt U ϕt U( 1

3 log γ + 1 log γ ) 2ϕt U 3 log γ .

By Lemma 3 each massive cluster is not settled with probability at most e kϕt A/(5ϕt X). Deﬁne the random variable XA to be equal to 0 if a massive cluster A gets settled in this iteration and ϕt A/ζ otherwise. Let X = P A massive XA. Note that expected cost of massive clusters that are not settled in this iteration is bounded by E[X] ζ.

The value of X is stochastically dominated by the value of a variable X deﬁned as follows. We ﬁrst replace each XA by some variables X A,1, . . . , X A, ϕt A/ζ , each new variable X A,i being

equal to 1 with probability e kζ/(10ϕt X) and zero otherwise, independently on the other variables X A,j. Note that the sum X A = X A,1 + + X A, ϕt A/ζ stochastically dominates the value of XA, since it attends the value ϕt A ζ ϕt A ζ with probability

i=1 e kζ/(10ϕt X) e kζ/(10ϕt X) 2ϕt A/ζ

= e kϕt A/(5ϕt X),

and otherwise is nonnegative. Hence, the value X = P X A stochastically dominates the value X.

Since the number of variables X A,i is P

2ϕt X ζ we have

E[X] E[X ] 2ϕt X ζ exp( kζ/(10ϕt X))

4ϕt U ζ exp( k 10 log γϕt U 20kϕt U )

= 4ϕt U 10 log γϕt U/k exp( 10 log γ

20 ) k log γ

where the last step assumes γ large enough. Finally, we bound

log γ) P(X 2k/ p

log γ) = P(X (1 + δ)E[X ])

for δ = 2k log γE[X ] 1 k log γE[X ] 1 by above bound on E[X ]. Applying the third bound in Theorem 2, we conclude that

log γ) exp( E[X ]δ/3)

exp( k/(3 p

log γ)) e Ω(k0.1).

Hence, with probability 1 e Ω(k0.1) the total cost of clusters that remain unsettled after this iteration is bounded by

2ϕt U 3 log γ + 2k log γ ζ 4ϕt U 3 log γ

Theorem 3. For ℓ= k, Algorithm 1 achieves a constant approximation ratio for the number of sampling steps t = O(min(k, log γ log log γ )) steps with probability 1 exp(Ω(k0.1)).

Proof. To see that the number of steps is bounded by k with high probability, note that by Lemma 3 each unsettled cluster is made settled with probability at least 1 e kϕt A/(5ϕt X). So, unless ϕt X 20ϕ , we have ϕt U ϕt X/2 and, hence, with probability at least

A unsettled exp( kϕt A/10ϕt U) = 1 exp( k/10)

we make at least one cluster settled. Union bounding over ﬁrst k steps of the algorithm, we conclude that the algorithm ﬁnishes in at most k steps with probability 1 exp( Ω(k)).

Whatever point c0 is taken uniformly at the beginning of Algorithm 1, for the next iteration we invoke Lemma 3 with A = X (in its formulation we say that A = X is unsettled

Simple Analysis of k-means II

cluster) to conclude that with probability 1 exp( Ω(k)) we have ϕ1 X 10ϕX(µX).

We invoke Proposition 2 and union bound over k subsequent iterations of the algorithm to conclude that with probability 1 exp(Ω(k0.1)), in each sampling step of Algorithm 1 we either make at least k/(40 log γ) clusters settled or the cost of unsettled clusters decreases by a factor of 4/ 3 log γ. The ﬁrst case can happen at most O( log γ) = O(log γ/ log log γ) times , whereas the second case can happen at most O(log 3 log γ/4 γ) = O(log γ/ log log γ) times, until we have ϕt X 20ϕ . Hence, the algorithm achieves constant approximation ratio after O(log γ/ log log γ) steps, with probability 1 exp(Ω(k0.1)).

Remark 3. The high probability guarantee in Theorem 3 can be made 1 exp(Ω(k1 o(1))). We omit the proof.

5. Sharp Analysis of k-means|| : Lower Bound

In this section we show that the upper bound of O(log γ/ log log γ) steps is the best possible. Note that (Bachem et al., 2017) proved that for ℓ= k there is a dataset X such that E[ϕt X] 1 4(4kt) t Var(X) for t < k 1. Hence, for k = O(poly(log γ)) we conclude that for the number of steps T necessary to achieve constant approximation we have (Tpoly(log γ))T γ, implying T = Ω(log γ/ log log γ). We complement their result by showing that the same lower bound also holds for k = Ω(poly(log γ)).

In the following theorem we construct an instance where the distance of two different data points can be zero, but it is easy to generalize the result for the case where we have the distance between two different points lower bounded by 1.

Theorem 4. For any function f with f(x) = Ω(log10 x), f(x) = O(x0.9), there is a dataset X with |X| = Θ(k) and k = Θ(f(ϕX(µX))) such that ϕ = 0 and with probability arbitrarily close to 1 Algorithm 1 needs Ω(log(ϕX(µX))/ log log(ϕX(µX))) iterations to achieve cost zero.

Proof. First we describe the dataset X. We place |X| k+1 points x0,j for 1 j |X| k + 1 to the origin, i.e., x0,j = 0. We choose |X| = Θ(k) to be of such size that we know that with probability at least 1 ε, for a given constant ε, the ﬁrst uniformly chosen center is 0.

For each one of the remaining k 1 points we consider a new axis orthogonal to the remaining k 2 axes and place the point on this axis. For xi,j, 1 i T, 1 j k/T, we set xi,j 0 2 = L2(T i+1), for some large enough L and T = Θ(L/ log L) with the multiplicative constant

chosen in such a way that for ϕX(µX) = Θ( k

T L2T ) we have ϕX(µX) = Θ(k 2L). Note that we need to choose L = Ω(poly log k) to achieve k = Θ(f(ϕX(µX))) = O((ϕX(µX))0.9). We deﬁne ϕt i = Pk/T j=1 ϕt xi,j.

Conditioning on the ﬁrst uniformly taken point being x0,j for some j, we prove by induction for 0 t T that with probability at least 1 t 1/poly(k), after t sampling steps of the algorithm we have for each i > t that out of k/T points xi,j, for some j, at most

of them have been sampled as centers. This will prove the theorem, since it implies that with probability at least 1 1/poly(k), after t = T = Θ(L/ log L) steps the cost ϕt is still nonzero; since we have k = O((ϕX(µX))0.9), we have

Θ(L/ log L) = Θ(log ϕX(µX)

k / log log ϕX(µX)

= Θ(log ϕX(µX)/ log log ϕX(µX)).

For t = 0 the claim we are proving is clearly true. For t 1 note that by induction with probability at least 1 (t 1)/poly(k) we have that at least

of the points xt,j, for some j, were not sampled as centers. Hence, ϕt X k 3T L2(T t+1) and the probability that each point xi,j, i > t, is being sampled is bounded by

k L2(T i+1)

k L2(T t+1)/(3T) = 3T L2(i t) 1 Li t

for L large enough. Hence, for any i > t, the expected number of points xi,j that are being hit is bounded by k/(T Li t). To get concentration around this value for given i, consider two cases.

1. k 3i t T k0.1. Then, by the third bound of Theorem 2 we can bound the probability of taking more than k 3i t T clusters by exp( Θ( k 3i t T )) exp( k0.1) 1/poly(k).

2. k 3i t T < k0.1. Using the assumption k = Ω(log10 ϕX(µX)), hence k = Ω(L10), we have 3i t > k0.9/T k0.7. For L sufﬁciently large we then have Li t poly(k) for any ﬁxed polynomial, hence, the expected number of hits is at most k/(T Li t) 1/poly(k), hence, with probability at least 1 1/poly(k) there is no hit.

Simple Analysis of k-means II

In both cases, conditioning on an event of probability 1 1/poly(k), for any i > t the number of points xi,j that were sampled as centers is with probability at least 1 t/poly(k) bounded by

6. Acknowledgement

I would like to thank my advisor Mohsen Ghaffari for his very useful insights and suggestions regarding the paper. I also thank Davin Choo, Danil Koˇzevnikov, Christoph Grunau, Andreas Krause, and Julian Portmann for useful discussions and anonymous referees for helpful comments.

This project has received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (grant agreement No. 853109)

Aggarwal, A., Deshpande, A., and Kannan, R. Adaptive sampling for k-means clustering. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pp. 15 28. Springer, 2009.

Ailon, N., Jaiswal, R., and Monteleoni, C. Streaming kmeans approximation. In Advances in neural information processing systems, pp. 10 18, 2009.

Aloise, D., Deshpande, A., Hansen, P., and Popat, P. Np-hardness of euclidean sum-ofsquares clustering. Machine Learning, 75(2): 245 248, May 2009. ISSN 1573-0565. doi: 10.1007/s10994-009-5103-0. URL https: //doi.org/10.1007/s10994-009-5103-0.

Arthur, D. and Vassilvitskii, S. K-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 07, pp. 1027 1035, Philadelphia, PA, USA, 2007a. Society for Industrial and Applied Mathematics. ISBN 978-0-898716-24-5. URL http://dl.acm. org/citation.cfm?id=1283383.1283494.

Arthur, D. and Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 1027 1035. Society for Industrial and Applied Mathematics, 2007b.

Awasthi, P., Charikar, M., Krishnaswamy, R., and Sinop, A. K. The hardness of approximation of euclidean kmeans. ar Xiv preprint ar Xiv:1502.03316, 2015.

Bachem, O., Lucic, M., Hassani, H., and Krause, A. Fast and provably good seedings for k-means. In Advances in neural information processing systems, pp. 55 63, 2016a.

Bachem, O., Lucic, M., Hassani, S. H., and Krause, A. Approximate k-means++ in sublinear time. In Thirtieth AAAI Conference on Artiﬁcial Intelligence, 2016b.

Bachem, O., Lucic, M., and Krause, A. Distributed and provably good seedings for k-means in constant rounds. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 292 300. JMLR. org, 2017.

Bachem, O., Lucic, M., and Krause, A. Scalable k-means clustering via lightweight coresets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1119 1127, 2018.

Bahmani, B., Moseley, B., Vattani, A., Kumar, R., and Vassilvitskii, S. Scalable k-means++. Proceedings of the VLDB Endowment, 5(7):622 633, 2012.

Barbosa, R., Ene, A., Nguyen, H., and Ward, J. The power of randomization: Distributed submodular maximization on massive datasets. In International Conference on Machine Learning, pp. 1236 1244, 2015a.

Barbosa, R., Ene, A., Nguyen, H. L., and Ward, J. A new framework for distributed submodular maximization, 2015b.

Bhattacharya, A., Eube, J., R oglin, H., and Schmidt, M. Noisy, greedy and not so greedy k-means++, 2019.

Brunsch, T. and R oglin, H. A bad instance for k-means++. Theoretical Computer Science, 505:19 26, 2013.

Celebi, M. E., Kingravi, H. A., and Vela, P. A. A comparative study of efﬁcient initialization methods for the k-means clustering algorithm. Expert systems with applications, 40(1):200 210, 2013.

Choo, D., Grunau, C., Portmann, J., and Rozhoˇn, V. kmeans++: few more steps yield constant approximation, 2020.

Dean, J. and Ghemawat, S. Mapreduce: Simpliﬁed data processing on large clusters. 2004.

Ghaffari, M. personal communication.

Guha, S., Meyerson, A., Mishra, N., Motwani, R., and O Callaghan, L. Clustering data streams: Theory and practice. IEEE transactions on knowledge and data engineering, 15(3):515 528, 2003.

Simple Analysis of k-means II

Jaiswal, R., Kumar, A., and Sen, S. A simple d2-sampling based ptas for k-means and other clustering problems, 2012.

Jaiswal, R., Kumar, M., and Yadav, P. Improved analysis of d2-sampling based ptas for k-means and other clustering problems. Information Processing Letters, 115(2):100 103, 2015.

Karloff, H., Suri, S., and Vassilvitskii, S. A model of computation for mapreduce. In Proceedings of the twenty-ﬁrst annual ACM-SIAM symposium on Discrete Algorithms, pp. 938 948. SIAM, 2010.

Krause, A. and Golovin, D. Submodular function maximization.

Lattanzi, S. and Sohler, C. A better k-means++ algorithm via local search. In International Conference on Machine Learning, pp. 3662 3671, 2019.

Lee, E., Schmidt, M., and Wright, J. Improved and simpliﬁed inapproximability for k-means. Information Processing Letters, 120:40 43, 2017.

Lenzen, C. and Wattenhofer, R. Tight bounds for parallel randomized load balancing, 2011.

Liu, P. and Vondrak, J. Submodular optimization in the mapreduce model. ar Xiv preprint ar Xiv:1810.01489, 2018.

Lloyd, S. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129 137, March 1982. ISSN 1557-9654. doi: 10.1109/TIT.1982. 1056489.

Mahajan, M., Nimbhorkar, P., and Varadarajan, K. The planar k-means problem is np-hard. In International Workshop on Algorithms and Computation, pp. 274 285. Springer, 2009.

Mettu, R. and Plaxton, G. Optimal time bounds for approximate clustering, 2012.

Mirzasoleiman, B., Karbasi, A., Sarkar, R., and Krause, A. Distributed submodular maximization: Identifying representative elements in massive data. In Advances in Neural Information Processing Systems, pp. 2049 2057, 2013.

Ostrovsky, R., Rabani, Y., Schulman, L. J., and Swamy, C. The effectiveness of lloyd-type methods for the k-means problem. Journal of the ACM (JACM), 59(6):1 22, 2013.

Wei, D. A constant-factor bi-criteria approximation guarantee for k-means++. In Advances in Neural Information Processing Systems, pp. 604 612, 2016.