# analyzing_dα_seeding_for_kmeans__cfdce56d.pdf

Analyzing Dα seeding for k-means

Etienne Bamas 1 Sai Ganesh Nagarajan 2 Ola Svensson 3

One of the most popular clustering algorithms is the celebrated Dα seeding algorithm (also know as k-means++ when α = 2) by Arthur and Vassilvitskii (2007), who showed that it guarantees in expectation an O(22α log k)-approximate solution to the (k,α)-clustering cost (where distances are raised to the power α) for any α 1. More recently, Balcan, Dick, and White (2018) observed experimentally that using Dα seeding with α > 2 can lead to a better solution with respect to the standard k-means objective (i.e. the (k, 2)-clustering cost). In this paper, we provide a rigorous understanding of this phenomenon. For any α > 2, we show that Dα seeding guarantees in expectation an approximation factor of

2 4/α (gα min{ℓ, log k})2/α !

with respect to the standard k-means cost of any underlying clustering; where gα is a parameter capturing the concentration of the points in each cluster, σmax and σmin are the maximum and minimum standard deviation of the clusters around their center, and ℓis the number of distinct mixing weights in the underlying clustering (after rounding them to the nearest power of 2). For instance, if the underlying clustering is defined by a mixture of k Gaussian distributions with equal cluster variance (up to a constant-factor), then our result implies that: (1) if there are a constant number of mixing weights, any constant α > 2 yields a constant-factor approximation; (2) if the mixing weights are arbitrary, any constant α > 2 yields an O log2/α k -approximation,

and α = Θ(log log k) yields an O(log log k)3-

1ETH AI Center, Zurich, Switzerland. 2Zuse Institute Berlin 3EPFL. Correspondence to: Etienne Bamas <etienne.bamas@inf.ethz.ch>, Sai Ganesh Nagarajan <nagarajan@zib.de>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

approximation. We complement these results by some lower bounds showing that the dependency on gα and σmax/σmin is tight. Finally, we provide an experimental validation of the effects of the aforementioned parameters when using Dα

1. Introduction

Clustering is a quintessential machine learning problem with numerous practical applications in medicine (Alashwal et al., 2019), image segmentation (Shi & Malik, 2000; Burney & Tariq, 2014), market analysis (Chiu et al., 2009) and anomaly detection (Münz et al., 2007), to name a few. One of the most popular formulations is the k-means problem that requires us to pick k centers such that the sum of the squared distance from each data point to its closest center is minimized. The k-means problem is NP-hard even in 2 dimensions (Mahajan et al., 2009) and most research is therefore focused on heuristics and approximation algorithms. For a long time, a heavily used heuristic for this problem has been the Lloyd s algorithm, with Expectation Maximization (EM) style updates for the centers after an initial set of k centers are chosen uniformly at random from the data. While this method finds a local optimum, it is known not to have any approximation guarantees and it could have an exponential run time in the worst case (Arthur & Vassilvitskii, 2006).

The k-means++ method. (Arthur & Vassilvitskii, 2007) came up with the elegant k-means++ method that carefully selects the initial centers (also called D2 seeding) such that the next center is a data point that is chosen with probability that is proportional to its squared distance to its closest center, selected thus far (see (Ostrovsky et al., 2013) for a concurrent work on a similar algorithm as well). This intuitively makes sense as this initialization is more likely to discover new clusters (that are far away) than simply selecting centers uniformly at random. Indeed, they proved that the initial choice of centers already forms a O(log k) approximation (in expectation) for this problem. They complemented this upper bound with a family of instances where the expected cost of the D2 seeding is a factor Ω(log k) times the optimum cost, showing that their analysis is tight.

Seeding for k-means

Limitations on clusterable instances. While the D2 seeding method provides clear improvements over uniformly at random initialization in an elegant and efficient manner, the family of instances that show the tightness of the analysis indicates some of its limitations. The family of instances presented in (Arthur & Vassilvitskii, 2007) is indeed highly clusterable: k regular simplices of radius one (each of n/k points) and the pairwise distance between the centers of two simplices are . As tends to (i.e., the instance becomes more and more clusterable), the expected approximation guarantee of the D2 seeding method tends to Θ(log k).

For such clusterable instances, the issue is that the D2 seeding method does not put enough probability mass on discovering new clusters. This phenomenon was already observed in the original paper by (Arthur & Vassilvitskii, 2007), and they proposed a greedy variant that takes several samples at each iteration (increasing the probability that at least one hits a new yet undiscovered cluster) and makes a greedy choice among them. This greedy variant has worse guarantees in the worst case (see (Grunau et al., 2023b) for a recent nearly tight analysis). However, the greedy variant shows better experimental performance (after all, we usually look for k clusters when the data is clusterable), and is currently the method implemented in the popular Scikitlearn library (Pedregosa et al., 2011). Specifically, at each iteration 2 + log(k) points are sampled, and, among them, the point that decreases the objective the most is greedily chosen.

Data-driven approach. More recently, Balcan et al. (Balcan et al., 2018) proposed a data-driven approach in order to address the aforementioned limitation of the D2 seeding method on clusterable data. Instead of always using D2 seeding for the initial centers, they proposed to use Dα

seeding where α is now a parameter of the algorithm. In Dα seeding, a point is selected as the next center with probability proportional to its α-powered distance to its closest center, selected thus far1. One can observe that a large choice of α > 2 increases the probability that a sampled center will discover a new yet undiscovered cluster which is advantageous. At the same time, a large α makes the algorithm more sensitive to outliers (which is also the reason why the greedy variant of k-means++ is worse in the worst case). Hence the selection of α should depend on the kind of instances one wants to solve, which motivates a datadriven approach. One of the main results in (Balcan et al.,

1We remark that Dα seeding was already considered in (Arthur & Vassilvitskii, 2007) but they studied it on a cost that was proportional to the distances raised to the power α (i.e. the (k, α)- clustering cost) instead of the standard k-means objective. They showed that Dα seeding was O(22α log k)-approximate for this cost function. The use of Dα on other objectives, including the standard squared distance objective was first introduced in (Balcan et al., 2018) in a data-driven approach.

2018) is that this is feasible. They showed that the parameter α is learnable in the sense that if we assume the instance is drawn from some unknown distribution D, then with only polynomially many (in the instance size and other relevant parameters) samples and a polynomial running time, one can compute a parameter α for the sampling that is almost optimum for the given distribution D. This is especially interesting as it shows that setting α to a good value on a given distribution is in principle a task that is manageable. Additionally, Balcan et al. (Balcan et al., 2018) complemented their theoretical results with an experimental analysis that shows that setting α equal to 2 is not always the best choice. For instance, on the MNIST dataset, they find that setting α close to 4 is a significantly better choice than α = 2. This is even more striking in the case where D is a mixture of Gaussians, in which case setting α close to 20 seems the best choice. This highlights the fact that in practice, one can outperform the popular k-means++ algorithm of (Arthur & Vassilvitskii, 2007) by tweaking the parameter α. Yet, they do not provide any quantitative understanding of this phenomenon nor provide any approximation guarantees on these instances with different α. This is the main focus of this paper.

Our contributions. Our main contribution is a theoretical analysis of the advantage of Dα seeding, proving that it leads to constant-factor approximation guarantees for a large class of instances, including a balanced mixture of k Gaussians, where the standard k-means++ algorithm is already no better than Ω(log k) (see Section 3). We remark that a beyond worst-case analysis is essential as it is easy to see that α = 2 is an optimal choice in the worst-case (just as the greedy variant of k-means++ is worse in the worst-case).

In our beyond worst-case analysis, we identify natural datadependent parameters that measure (i) how concentrated points are in clusters (the parameter gα), (ii) the ratio of the maximum and minimum standard deviation of the optimal clusters around their mean ( σmax

σmin ), and (iii) how balanced clusters are in terms of number of points (the parameter ℓ). Using these parameters, we show that Dα seeding guarantees for any α > 2 an

2 4/α (gα min{ℓ, log k})2/α !

approximation with respect to the standard k-means objective (the formal statement can be found in Section 2). We further show that the dependence on the first two parameters is necessary and tight (formal statement in Section 2), and this dependence gives a theoretical explanation of the importance of selecting α as a function of the data (Section 2.2). We leave it as an interesting open problem to understand the necessity of the third parameter ℓ.

Finally, a more open-ended direction following our work is

Seeding for k-means

to give a beyond-worst-case analysis of the greedy variant of k-means++. We take a first step in this direction by proving a negative result: we give a family of instances where the natural parameters (i)-(iii) are all constant (and thus Dα seeding yields a constant-factor approximation guarantee for any constant α > 2) but greedy k-means++ as implemented in the Scikit-learn library (with Θ(log k) samples per iteration) has a super constant approximation guarantee (see Section 2).

1.1. Further related works

Several variants have been studied since the original publication of the k-means++ method in (Arthur & Vassilvitskii, 2007; Ostrovsky et al., 2013). Aggarwal et al. (Aggarwal et al., 2009) obtained an O(1)-approximation with constant probability by selecting O(k) centers, which was improved later by (Wei, 2016; Makarychev et al., 2020). Bahmani et al. (Bahmani et al., 2012) provided a scalable version k-means++ that is even more practical, and (Bachem et al., 2017; Cohen-Addad et al., 2020) provided faster ways for randomly selecting the centers. Recently, Lattanzi and Sohler (Lattanzi & Sohler, 2019) obtained a O(1)-approximation with additional O(k log log k) steps of local search after using k-means++ to choose the initial centers and this was improved by Choo et al. (Choo et al., 2020) to obtain a 1030-approximation with ϵk steps of local search. Our constant factor guarantees are arguably smaller without any additional steps of local search but are applicable to the appropriate family of instances, whilst their method is applicable in the worst case across all instances. Moreover, their local search methods can be augmented on top of our guarantees of Dα-seeding, which may offer significant improvements.

Another variant of k-means++ particularly relevant to our setting is the noisy k-means++ algorithm, in which points are sampled according to the standard D2 seeding, but an adversary is allowed to perturb the sampling probabilities by some multiplicative factor of (1 ε). For this case, (Bhattacharya et al., 2020) showed an O(log2 k) upper-bound, which was then improved to the classic O(log k) by (Grunau et al., 2023a). In our case, the sampling probability of a point can be completely different between D2 seeding and Dα seeding. Since the analysis of noisy k-means++ is substantially more difficult than the analysis of the classic k-means++ algorithm, it might come as a surprise that it is still possible to obtain non-trivial guarantees in our case.

Tangentially, one could study algorithms for learning the cluster centers when the data is instantiated from a mixture of Gaussians (Dasgupta, 1999; Arora & Kannan, 2005). Furthermore, specific clustering algorithms are created under specific assumptions on the instances with various clusterability notions (Ackerman & Ben-David, 2009). However,

these clusterability notions are often (computationally) hard to check and the algorithms are not as efficient and simple as the seeding-based algorithms, which also work well in practice (without any assumptions). Finally, the main inspiration of this paper is from the idea of data-driven clustering by Balcan et al. (Balcan et al., 2018).

1.2. Preliminaries and notations

To formally introduce the k-means problem and the seeding algorithms, we will need to work with a metric space (Rd, 2). Since we always work with the Euclidean norm, we will drop the subscript in the notation and write the Euclidean norm of a vector x to be x . If we are given some data points X Rd, the cost of our data X associated with a given set of t centers Zt can be defined as follows:

cost(2)(X, Zt) := X

x X min c Zt x c 2. (1)

Now note that the centers Zt define a natural partition of the data, in that: Cj = {x X : cj = arg minc Zt x c 2}. Then one can write the cost equivalently in the following useful way:

cost(2)(X, Zt) =

x Cj x cj 2. (2)

Furthermore, if one has a candidate clustering C, then each corresponding center can be computed as cj = 1 |Cj| P

x Cj x, which are the centroids of the corresponding clusters. We will denote the optimal centroids by {µ1, µ2, . . . , µk}. By a slight abuse of notations, we might drop the subscript to identify a cluster C C, and µC will refer to the mean of that cluster. Let COPT be the optimal clustering whose corresponding centers are ZOPT. By definition,

cost(2)(X, ZOPT) = min Z Rd:|Z|=k cost(2)(X, Z). (3)

The k-means++ algorithm. We will now describe the class of parameterized seeding algorithms as in (Arthur & Vassilvitskii, 2007; Balcan et al., 2018) for the k-means objective. For any α [0, ], the general Dα seeding procedure chooses k centers as follows:

1. The first center z1 X is chosen uniformly at random from the data points.

2. Let Zt be the set of t centers chosen so far, such that t < k. The next center, zt+1 X is chosen with probability given by:

p(α) X (z) := P(z is sampled|Zt) =

minc Zt z c α P

z X minc Zt z c α (4)

Seeding for k-means

The classic k-means++ algorithm is the special case of D2

seeding. For any α 1, Arthur and Vassilvitskii (Arthur & Vassilvitskii, 2007) show that Dα-seeding procedure is an O(22α log k) approximation in expectation if the cost function is given by:

cost(α)(X, Zk) := X

x X min c Zk x c α . (5)

Our focus in this paper is to provide guarantees on the Dα

seeding algorithm with α > 2 for the standard k-means objective (i.e. α = 2).

The greedy k-means++ algorithm. Although this is not the main focus of the paper, it is helpful to define briefly here the greedy variant of k-means++. The greedy variant with m samples works as follows.

1. The first center z1 X is chosen uniformly at random from the data points.

2. Let Zt be the set of t centers chosen so far, such that t < k. We select a set of m candidate centers z1, z2, . . . , zm where each candidate is sampled according to the probability distribution

p X (z) := P(z is sampled|Zt) =

minc Zt z c 2 P

z X minc Zt z c 2 (6)

The next added center zt+1 is selected to be the one which decreases the cost the most, among all the candidate centers z1, z2, . . . , zm.

Usually m is selected to mildly increase with the input size. For instance the standard scikit-learn library implements the greedy version of k-means++ using m = Θ(log k) candidates at each step (see (Pedregosa et al., 2011)).

2. Our Results

In this section, we define formally the natural parameters that Dα seeding depends on and state formally our results. Moreover, we will provide a short discussion on the necessity of the dependence that will clarify our claims. The last part of this section focuses on using our results to provide recommendations on choosing α in different scenarios.

Before we move on to stating our results on the Dα seeding, we need to define the following quantities with respect to the optimal2 clustering COPT = {C1, C2, . . . , Ck}.

2Although we state our definitions and results with respect to the optimal clusters, our results hold for any reference clustering that satisfies the aforementioned properties.

1. We define σC as the standard deviation of the points inside cluster C COPT. More precisely,

Following this, σmax is defined as the maximum standard deviation of points inside a given cluster, i.e. σmax := max C COPT σC, and similarly σmin := min C COPT σC.

2. We need a parameter gα that measures the concentration of the distances of the points to the centroid µC in a cluster C:

gα := max C COPT (1/|C|) cost(α)(C, µC) (cost(2)(C, µC)/|C|)α/2 (8)

= max C COPT (1/|C|) P

x C dα(x, µC) (1/|C|) P

x C d2(x, µC) α/2 . (9)

One can see that gα is equal to the αth absolute standardized moment of one cluster C (see Chapter 4 in (Kenney, 1939) for a reference).

3. Finally, we need a parameter ℓto control the number of distinct weights of clusters (where the weight of a cluster C is simply equal to the number of points |C|). Formally, for any integer i 0 we let ki to be the number of clusters of COPT whose weight lies in the interval [2i, 2i+1). Then we define the following key parameter:

ℓ:= |{i N | ki > 0}| , (10)

which is the number of intervals of the form [2i, 2i+1) containing the weight of at least one cluster. For instance, if all clusters have the same weight then ℓ= 1. If all the clusters have weights in some interval [m, M] then ℓ log2(M/m).

Remark 2.1. Note that we can express OPT(C) for some optimal cluster C, in terms of its standard deviation by OPT(C) = |C|σ2 C, and the total cost of the optimal clustering is OPT = P

C COPT |C|σ2 C.

Given the aforementioned definitions, the main result that we show in this paper is the following theorem, whose formal proof appears in Section 3. Theorem 2.2. For any clustering (C1, C2, . . . , Ck) of cost OPT, and any α > 2, the Dα seeding procedure returns a clustering of expected cost at most OPT times

2 4/α (gα min{ℓ, log k})2/α !

where f(α) := α2

(α/2 1)2/α (1 22/α 1). In particular, f(α) = O(α/ε)2 if α [2 + ε, ) for some small ε > 0.

Seeding for k-means

An immediate consequence of Theorem 2.2 is that Dα seeding with α > 2 fixed yields a constant-factor approximation guarantee for instances consisting of k regular simplices (i.e. all points in a cluster are arranged in a regular simplex) of the same radius (which implies σmax/σmin = 1) and same size (which implies ℓ= 1); a case that includes the described Ω(log k) lower bound instances for the standard D2 seeding from (Arthur & Vassilvitskii, 2007). Indeed, one can check that gα = 1 in that case. We remark that the above stated guarantees do not require the clusters to be separated (as they are e.g. in the described lower bound instances of D2 seeding), which is a common assumption is several other works (see e.g. (Ostrovsky et al., 2013; Ackerman & Ben-David, 2009)).

We complement our results with the following lower bounds, which are fairly intuitive to prove. Theorem 2.3. There exists an instance with a clustering of cost OPT such that ℓ= 1, σmax/σmin = 1, and the Dα

seeding procedure returns a clustering of expected cost at least Ω(gα)2/α OPT ,

and another instance instance with a clustering of cost OPT such that ℓ= 1, (gα)2/α = O(1), and the Dα seeding procedure returns a clustering of expected cost at least

2 4/α OPT .

The formal proof of Theorem 2.3 can be found in Appendix D.1 and Appendix D.2. Moreover, one can use the lower bound instance in Appendix D.2 to obtain a single instance with k = 2 and such that Dα seeding is no better than an ω(1)-approximation for all α 2 + ε for some small fixed ϵ > 03. Finally, as mentioned in introduction, we also prove a lower bound on the greedy variant of k-means++. Theorem 2.4. There exists an instance with k clusters for which Dα seeding guarantees a constant factor approximation in expectation for any fixed α > 2, and such that the greedy k-means++ algorithm with f(k) samples is not better than an Ω(log log f(k))2 approximation in expectation.

This highlights that Dα seeding can be superior in theory to the greedy variant. The proof of this last theorem can be found in Appendix D.3.

2.1. Discussion on the parameters

As we see, the guarantee of Dα seeding as stated in Theorem 2.2 has a dependence on gα, σmax/σmin, and min{ℓ, log k}. Here we discuss these dependencies.

3However, it would be interesting to see if one can obtain a single instance which is Ω(log k) for all values of α 2 simultaneously.

The parameter gα. The moment condition can be seen as a characterization of the concentration of a cluster of points. For instance, one way that gα could be non-constant is when the cluster has many outliers that are still part of the cluster. On the contrary, if our clusters are generated by a Gaussian mixture, then gα (for any constant α 2) is a constant (in fact, it is not difficult to compute and see that gα αα for Gaussian distributions). If we are in the infinite number of samples limit, where each cluster becomes defined by a density function f on some domain D, then gα is equal to R

D x µ αf(x)dx R

D x µ 2f(x)dx α/2 ,

To obtain a better understanding of our guarantees, we give below the value of (gα)2/α for a few common distributions. W.l.o.g. we re-normalize to assume unit variance of each distribution (i.e. the denominator is equal to 1 in the definition of gα).

1. Perhaps the most classic distribution is the Gaussian distribution with unit variance. In this case, the standardized moment is equal to O((α/2)α/2) thus (gα)2/α is O(α). Furthermore, for multivariate Gaussians, gα = O(ααdα/2d/2). Meaning, in higher dimensions gα decreases rapidly and this is well-supported by our understanding that a Gaussian distribution tends to behave like balls" in higher dimensions (see Section 3.3.3 in (Vershynin, 2020)). In Remark 2.5, we detail how this gα can be used to obtain the guarantees for mixture of Gaussians (as claimed in the abstract).

2. For the exponential distribution, which has slightly heavier tails than Gaussian, Exp(λ), the αth moment is O(α!) and thus (gα)2/α is O(α2).

3. Now consider for instance, a univariate student-t distribution with degree of freedom ν > 0, has its den-

sity function given by, Γ((ν+1)/2) πνΓ(ν/2) 1 + x2

ν ((ν+1)/2)

(Wikipedia T). It is well known that αth moment exists only if α < ν. Thus lower the degree of freedom, the heavier the tail gets, and gα is bounded only when α < ν, in which case it is roughly O(να) and thus (gα)2/α = O(ν2).

Remark 2.5. The claims made in abstract for instances drawn from a mixture of Gaussian distributions are now easy to see. Suppose the mixture of k Gaussians X Pk i=1 wi N(µi, Σi), satisfies maxi,j [k] tr(Σi)/tr(Σj) = O(1) so that σmax/σmin = O(1). Therefore, in that case we obtain an approximation guarantee of O(α2g2/α α min{ℓ, log k}2/α) = O(α3 min{ℓ, log k}2/α). If the mixing weights wi are all equal, then ℓ= 1 and we obtain an

Seeding for k-means

approximation ratio of O(α3), which is constant for any constant α > 2. If the mixing weights are arbitrary, our approximation ratio is at most O(α3 min{ℓ, log k}2/α), which is O(log log k)3 for α = Θ(log log k). Finally we show that we can achieve this approximation even in the case of a finite number of samples taken from an arbitrary mixture of Gaussians (see Appendix C.2).

We show in Appendix D.2 that the dependency on gα in Theorem 2.2 is tight. As a simple example that highlights the intuition, consider the instance given in Figure 1. The red cluster is drawn from a standard 2-dimensional Gaussian law. The blue cluster consists of many points highly concentrated at distance δ from the mean of the red cluster, and one single point at distance δ + from the mean of the red cluster. For this blue cluster, gα will be unbounded when tends to for any α > 2. Note that we can choose the parameters in this instance so that (i) both clusters have the same variance and (ii) both clusters have the same number of points so the other parameters do not play a role here. In this situation, there is still 1/2 probability that the first center is selected in the red cluster (the first center is always chosen uniformly at random), and conditioned on that fact, the Dα seeding (for α > 2) will give way too much probability to the isolated point in the blue cluster, which is a serious issue. Our lower bound construction in Appendix D.2 is a simple formalization of this intuition.

The parameter σmax/σmin. The dependence on σmax/σmin is necessary and in fact tight (see Appendix D.1). The main issue in sampling with large α is that the algorithm might sample repeatedly from a cluster with large standard deviation more often, and thus it might fail to discover some other clusters.

The parameter ℓ. The dependence on min{ℓ, log k} remains an intriguing open problem, and it is unclear to us if any dependency on ℓor k is needed when α > 2. Note that for α going to infinity, the parameter ℓshould matter less and less since Dα seeding becomes equivalent to picking the furthest point. This behavior is accurately reflected in our bound.

2.2. On choosing α

Our theorem states that there is a trade-off in choosing α. We already know that α = 2 may not be the best choice and this is due to the well-clusterable instance of simplices of equal sidelength that are sufficiently far apart. However, Theorem 2.2 implies that any α > 2 is a constant factor approximation in this case, and this is because Dα is more aggressive in discovering new clusters. But is it in our best interest to set α ? Interestingly, Balcan et al. (Balcan et al., 2018) show that the best α is learnable, hence selecting the best α is a task that is manageable when there is a train-

ing set. Moreover, our theorem predicts a new phenomenon that is not present in the experiments in (Balcan et al., 2018). For a mixture of balanced Gaussians, (Balcan et al., 2018) obtain experimental results whose pattern roughly matches the one shown in left-hand side of Figure 2. This experiment corresponds to a Gaussian mixture with the same covariance matrix (namely identity). However, our theory indicates that there is a dependence on the variances that is necessary and it appears in the right-hand side of Figure 2, where one of the Gaussians has much larger variance. Note that our approximation factor has dependence (σmax/σmin)2 4/α, and to mitigate this effect one can choose some α that is not too large but still greater than 2. In fact, our result suggests a simple strategy. Using the training set, one can obtain an estimate of the key parameters gα, σmax/σmin, ℓ, and use these estimates to sufficiently narrow down the search for the best α, hence speeding-up the data-driven approach proposed by (Balcan et al., 2018).

Note that the experiments mentioned here do not use additional steps of Lloyd s algorithm. Since it is quite common to use this algorithm after the seeding, we run additional experiments in Section B. Interestingly, we observe that the general pattern does not change much even after running Lloyd s algorithm until convergence. This means that choosing α > 2 has some significant benefits over α = 2, even if we run Lloyd s algorithm after the seeding.

As a final note, we mention that it is a common wisdom among practitioners that k-means is a good objective, except when the clusters might have varying sizes and density, or when there are many outliers (Google; scikit). In this context, varying sizes and density can be interpreted as the parameters ℓand σmax/σmin, while outliers correspond to the parameter gα. If one believes this common wisdom, then our result essentially implies that whenever k-means is a good clustering objective, then choosing α > 2 should be almost always better than α = 2.

3. Proof Sketch of Theorem 2.2

The proof of Theorem 2.2 is inspired by a very clean potential function analysis of the D2 seeding algorithm by (Dasgupta, 2013). In a similar fashion, it is useful to bound the potential increase at each step. However, as we will see later, the potential function that is used for the D2 seeding analysis does not seem to work for Dα seeding, and some additional ideas are required. While there are other examples of potential-based analysis of seeding algorithms for k-means in (Aggarwal et al., 2009), to the best of our knowledge, the potential that we use is the first to be able to obtain interesting guarantees when the sampling of a point can be arbitrarily far from being proportional to its cost. This is generally non-trivial as evidenced by recent results in Grunau et al. (2023a); Bhattacharya et al. (2020). We de-

Seeding for k-means

Figure 1. An instance with k = 2.

2 6 10 14 18 22 26 30 34 38

k-means cost

2 6 10 14 18 22 26 30 34 38

k-means cost

Figure 2. Performance of Dα seeding for two instances D1 (on the left) and D2 (on the right), from a balanced mixture of k Gaussians with k = 4 and d = 2. The centers/means of the Gaussians for both instances are placed on the vertices of a square of side length 100. However, the covariance matrices for D1 is {I, I, I, I}, whilst the covariance matrices for D2 is {800I, I, I, I}, where I is the identity matrix in two dimensions.

fer a more detailed discussion of the novelty of our potential function to Appendix C.1.

As in (Dasgupta, 2013), at every iteration t, it is useful to keep track of the set of optimal clusters from which a center has already been chosen (i.e. the hit clusters) and the complement of this set which is the set of undiscovered clusters. Formally, we define Ht to be the set of hit clusters after selecting a set of t centers denoted by Zt, i.e.

Ht := {C COPT : C Zt = } ,

where we recall that COPT is the set of clusters in the optimum solution. Ut is defined to be the set of remaining undiscovered clusters, i.e, Ut := COPT \ Ht. Furthermore, we define cost(2) t (C) as a shorthand to denote the cost induced by the points in the cluster C, after the set Zt of t centers are chosen. More formally,

cost(2) t (C) := cost(2)(C, Zt) = X

x C min z Zt x z 2 .

Since we analyze the Dα seeding, we also need to work

with the α-cost:

cost(α) t (C) := cost(α) t (C, Zt) = X

x C min z Zt x z α .

For any set S of clusters, we define cost(2) t (S) := P

C S cost(2) t (C), and cost(α) t (S) := P

C S cost(α) t (C). Moreover we can talk about the cost of a single point at iteration t as cost(2) t (x) := minz Zt x z 2. The main challenge of the proof is to upperbound the expected cost of undiscovered clusters after selecting our k centers. The cost of hit clusters can be upperbounded fairly easily, in a similar manner as in the analysis of the D2 seeding in (Arthur & Vassilvitskii, 2007).

The potential function. Now we proceed to define our potential function which will be used to upperbound the cost of undiscovered clusters. For each i 0, we define Si to be the set of clusters in C COPT such that |C| lies in the interval [2i, 2i+1) (recall that we also defined ki := |Si|). For each i 0, t 0, we define an integer τi(t) 0 which will be a local counter, relevant only for the clusters in Si at

Seeding for k-means

iteration t. zt is defined to be the center selected at iteration t, and U (i) t := Ut Si the set of undiscovered clusters in Si. For each i 0 and time t, we will define an integer wi(t) 0 corresponding to the number of iterations that are considered wasted by the clusters in Si at time t. We use the word wasted to follow the intuition given in (Dasgupta, 2013) where an iteration t is wasted when the selected center zt belongs to an already discovered center (in particular this new center does not discover a new cluster).

Some intuition. In our potential function, the counter τi will intuitively count how many iterations were relevant to the set of clusters Si. Once τi reaches the value ki, we will consider that the set Si was given enough tries to cover its clusters. During the sampling, it might be that some cluster is hit twice, which will increase the wasted counter wj for all j 0 (and τj also increases for all j 0). It might be counter intuitive that when a cluster in Si is hit twice then this still counts as a try in other sets Sj even if j = i. Indeed, it might be much more natural to simply count this try only for the set of clusters Si. However, proceeding in this more intuitive way would create a serious issue that some set Sj might get less than kj tries, because some other set Sj gets more tries than needed.

Formally, we initialize t = 0, wi(0) = 0 for all i N, and τi(0) = 0 for all i N. Then, we maintain these quantities as follows. At time t 0, let C be the cluster the next center zt is chosen from. Let i 0 be the integer such that C Si. Then, there are two cases.

1. If C U (i) t , then we set (for all j 0)

τj(t + 1) =

( τj(t) + 1 if j = i and τj(t) < kj τj(t) otherwise.

and wj(t + 1) = wj(t) for all j 0.

2. Otherwise if C Ht, then we set (for all j 0)

τj(t + 1) =

( τj(t) + 1 if τj(t) < kj τj(t) if τj(t) kj ,

wj(t + 1) =

( wj(t) + 1 if τj(t) < kj wj(t) if τj(t) kj .

Based on these quantities, we define the potential function as follows. First, we define

ϕi(t) := wi(t)

|U (i) t | (2i)1 2/α X

(cost(α) t (C))2/α . (11)

The final potential function can now be defined as

i 0 ϕi(t) . (12)

Remark 3.1. Note that we introduce the quantity

(2i)1 2/α X

(cost(α) t (C))2/α .

It is not obvious that this is a good quantity to control, since we are interested in upper-bounding the quantity X

cost(2) t (C) .

However, by a simple application of a standard convexity inequality, one will notice that

(2i+1)1 2/α X

(cost(α) t (C))2/α X

cost(2) t (C) .

Using this potential, we are ready to proceed with the main proof. Akin to Dasgupta s analysis (Dasgupta, 2013), we split the proof in three main parts. In Section A.1, we show that ϕ(k) is indeed an upper bound on the final cost of undiscovered clusters. In Section A.2, we upper bound the cost of hit clusters using gα. In Section A.3 we upper-bound the increase of the potential function. Finally, we complete the proof in Section A.4.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Acknowledgements

This work was supported by the Swiss State Secretariat for Education, Research and Innovation (SERI) under contract number MB22.00054. Part of this work had been carried out when Sai Ganesh Nagarajan was at EPFL.

Ackerman, M. and Ben-David, S. Clusterability: A theoretical study. In Artificial intelligence and statistics, pp. 1 8. PMLR, 2009.

Aggarwal, A., Deshpande, A., and Kannan, R. Adaptive sampling for k-means clustering. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques: 12th International Workshop, APPROX

Seeding for k-means

2009, and 13th International Workshop, RANDOM 2009, Berkeley, CA, USA, August 21-23, 2009. Proceedings, pp. 15 28. Springer, 2009.

Alashwal, H., El Halaby, M., Crouse, J. J., Abdalla, A., and Moustafa, A. A. The application of unsupervised clustering methods to alzheimer s disease. Frontiers in computational neuroscience, 13:31, 2019.

Arora, S. and Kannan, R. Learning mixtures of separated nonspherical gaussians. The Annals of Applied Probability, 15(1A):69 92, 2005.

Arthur, D. and Vassilvitskii, S. How slow is the k-means method? In Proceedings of the twenty-second annual symposium on Computational geometry, pp. 144 153, 2006.

Arthur, D. and Vassilvitskii, S. k-means++ the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 1027 1035, 2007.

Artin, E. The gamma function. Courier Dover Publications, 2015.

Bachem, O., Lucic, M., and Krause, A. Distributed and provably good seedings for k-means in constant rounds. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pp. 292 300. PMLR, 2017.

Bahmani, B., Moseley, B., Vattani, A., Kumar, R., and Vassilvitskii, S. Scalable k-means++. ar Xiv preprint ar Xiv:1203.6402, 2012.

Balcan, M.-F. F., Dick, T., and White, C. Data-driven clustering via parameterized lloyd s families. Advances in Neural Information Processing Systems, 31, 2018.

Bhattacharya, A., Eube, J., Röglin, H., and Schmidt, M. Noisy, greedy and not so greedy k-means++. In 28th Annual European Symposium on Algorithms (ESA 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020.

Burney, S. A. and Tariq, H. K-means cluster analysis for image segmentation. International Journal of Computer Applications, 96(4), 2014.

Chiu, C.-Y., Chen, Y.-F., Kuo, I.-T., and Ku, H. C. An intelligent market segmentation system using k-means and particle swarm optimization. Expert systems with applications, 36(3):4558 4565, 2009.

Choo, D., Grunau, C., Portmann, J., and Rozhon, V. kmeans++: few more steps yield constant approximation.

In International Conference on Machine Learning, pp. 1909 1917. PMLR, 2020.

Cohen-Addad, V., Lattanzi, S., Norouzi-Fard, A., Sohler, C., and Svensson, O. Fast and accurate $k$-means++ via rejection sampling. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

Dα Seeding. Link to code. https://github. com/saiganesh223/Alpha-Seeding-for-kmeans, 2024.

Dasgupta, S. Learning mixtures of gaussians. In 40th Annual Symposium on Foundations of Computer Science (Cat. No. 99CB37039), pp. 634 644. IEEE, 1999.

Dasgupta, S. Algorithms for k-means clustering. https://cseweb.ucsd.edu/~dasgupta/ 291-geom/kmeans.pdf, 2013.

Google. k-means advantages and disadvantages. https: //developers.google.com/machinelearning/clustering/algorithm/ advantages-disadvantages?hl=en. Accessed: 2023-10-06.

Grunau, C., Özüdo gru, A. A., and Rozhoˇn, V. Noisy kmeans++ revisited. In 31st Annual European Symposium on Algorithms (ESA 2023). Schloss Dagstuhl-Leibniz Zentrum für Informatik, 2023a.

Grunau, C., Özüdogru, A. A., Rozhon, V., and Tetek, J. A nearly tight analysis of greedy k-means++. In Proceedings of the 2023 ACM-SIAM Symposium on Discrete Algorithms, SODA 2023, Florence, Italy, January 22-25, 2023, pp. 1012 1070. SIAM, 2023b.

Hardy, G. H., Littlewood, J. E., and Pólya, G. Inequalities. Cambridge university press, 1952.

Kenney, J. F. Mathematics of statistics. D. Van Nostrand, 1939.

Lattanzi, S. and Sohler, C. A better k-means++ algorithm via local search. In International Conference on Machine Learning, pp. 3662 3671. PMLR, 2019.

Mahajan, M., Nimbhorkar, P., and Varadarajan, K. The planar k-means problem is np-hard. In International workshop on algorithms and computation, pp. 274 285. Springer, 2009.

Makarychev, K., Reddy, A., and Shan, L. Improved guarantees for k-means++ and k-means++ parallel. Advances in Neural Information Processing Systems, 33:16142 16152, 2020.

Seeding for k-means

Münz, G., Li, S., and Carle, G. Traffic anomaly detection using k-means clustering. In Gi/itg workshop mmbnet, volume 7, 2007.

Ostrovsky, R., Rabani, Y., Schulman, L. J., and Swamy, C. The effectiveness of lloyd-type methods for the k-means problem. Journal of the ACM (JACM), 59(6):1 22, 2013.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825 2830, 2011.

scikit. k-means clustering. https://scikit-learn. org/stable/modules/clustering.html. Accessed: 2024-01-18.

Shi, J. and Malik, J. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888 905, 2000.

van der Vaart, A. and Wellner, J. A. Weak convergence and empirical processes. 1996.

Vershynin, R. High-dimensional probability. University of California, Irvine, 2020.

Wei, D. A constant-factor bi-criteria approximation guarantee for k-means++. Advances in neural information processing systems, 29, 2016.

Wikipedia T. Student-t distribtuion. https://en. wikipedia.org/wiki/Student%27s_tdistribution. Accessed: 2023-10-07.

Seeding for k-means

A. Proof of Theorem 2.2

In the whole proof of the main result, we will work with a slightly modified parameter gα. We define

ˆgα := max C COPT (1/|C|2) P

z C cost(α)(C, z)

(cost(2)(C, µC)/|C|)α/2 . (13)

To see how this relates to the gα parameter we can write

z C cost(α)(C, z) = 1 |C|2 X

x C dα(x, z)

x C 2α(dα(x, µC) + dα(z, µC))

x C dα(x, µC) + |C| X

y C dα(y, µC)

x C dα(x, µC) ,

where the second line uses the triangle inequality and the standard inequality (x + y)α 2α(xα + yα) for any x, y 0.

Hence we clearly have ˆgα 2α+1 gα ,

and since all our guarantees will be involve the quantity (ˆgα)2/α, we can simply hide the factor (2α+1)2/α in the big O and replace ˆgα by gα.

Now we are ready to use the potential function defined in Section 3 for our proof.

A.1. Relating the potential function and the cost of undiscovered clusters

This part is fairly straightforward, using a few lemmas.

Lemma A.1. For all i N, we have that wi(k) |U (i) k |.

Proof. Consider the quantity δi(t) := |U (i) t | wi(t). Clearly, we have that δi(0) = ki. Next, notice that for every time-step t < k such that τi(t + 1) = τi(t) + 1, the quantity δi(t) decreases by 1. Indeed, either the algorithm discovers a new cluster in |U (i) t | (in which case |U (i) t+1| = |U (i) t | 1), or the algorithm wastes an iteration in some cluster, in which case wi(t + 1) = wi(t) + 1.

Finally, we claim that there are at least ki such iterations. Note that if τi(t) < ki, then the only way we have that τi does not increase is if the algorithm discovers a new cluster in |U (j) t | for some j = i. This can happen at most P

j =i kj times. Hence there must be at least k P

j =i kj = ki iterations where the counter τi increases. If we denote by ti the first time at which

τi(ti) = ki, this implies ti k and that that δi(ti) = 0 hence wi(ti) = |U (i) ti |. Finally, note that for t > ti, wi(t) does not change anymore, and |U (i) t | can only decrease; which concludes the proof.

Lemma A.2. We have that ϕ(k) P i N cost(2)(U (i) k , Zk)/2 = cost(2) k (Uk)/2.

Proof. Using Lemma A.1, we have

i N ϕi(k) (1/2) (2i+1)1 2/α X

(cost(α)(C, Zk))2/α .

Using Jensen s inequality (see Appendix C.3 for a reference) and the fact that |C| 2i+1 for all C U (i) k , we have that

(2i+1)1 2/α(cost(α)(C, Zk))2/α cost(2) k (C)

for all C U (i) k , which concludes the proof.

Seeding for k-means

A.2. The cost of hit clusters

In this section, we give upper bounds on the expected cost of clusters that were hit during the seeding process. These proofs are similar to the ones found in (Dasgupta, 2013; Arthur & Vassilvitskii, 2007). In fact, the first two lemmas are taken directly from (Arthur & Vassilvitskii, 2007).

Lemma A.3 (From (Arthur & Vassilvitskii, 2007)). Assume some arbitrary set T of centers have already been selected, and z C is added next using Dα-sampling. Then,

E h cost(α)(C, T {z}) | z C i 22α cost(α)(C, µC) .

Lemma A.4 (From (Arthur & Vassilvitskii, 2007)). Assume some point z is selected uniformly at random among the points belonging to some cluster C. Then, for any α 2,

E h cost(α)(C, z) i 2α cost(α)(C, µC) .

The next lemma deals with the expected squared cost of the hit clusters during the seeding process. Lemma A.5. Assume some arbitrary set T of centers have already been selected, and z C is added next using Dα

seeding. Then, E h cost(2)(C, T {z}) | z C i (4e + (α + 1)2 (ˆˆgα)2/α) cost(2)(C, µC) .

Proof. We start by noting that if cost(2)(C, T) (α + 1)2 ( ˆgα)2/αcost(2)(C, µC) then the lemma already holds since adding an additional center can only decrease the cost of C. Therefore, we assume this is not the case in the rest of the proof. We can write

E h cost(2)(C, T {z}) | z C i = X

cost(α)(z, T) cost(α)(C, T) X

x C min{cost(2)(x, T), x z 2} .

Let us upperbound the quantity cost(α)(z, T). For this, let us fix any x C. Then, if we denote by tx the point in T which is closest to x C, we have that

cost(α)(z, T) z tx α ( z x + x tx )α (α + 1)α z x α + (1 + 1/α)α x tx α ,

using the triangle inequality and a case distinction whether z x > (1/α) x tx or not. Averaging this upper bound over all x C, we obtain that

cost(α)(z, T) (α + 1)α cost(α)(C, z)

|C| + (1 + 1/α)α cost(α)(C, T)

(α + 1)α cost(α)(C, z)

|C| + e cost(α)(C, T)

Hence we can rewrite

E h cost(2)(C, T {z}) | z C i

(α+1)α cost(α)(C,z)

|C| + e cost(α)(C,T )

|C| cost(α)(C, T) X

x C min{cost(2)(x, T), ||x z||2}

cost(α)(C, z) cost(α)(C, T) cost(2)(C, T) + e

z C cost(2)(C, z) .

To finish the argument, note that the second term in the last line corresponds to e times the expected cost of C if we pick once center z C, uniformly at random. By Lemma A.4, this is at most (4e) cost(2)(C, µC). For the second term, we use Equation (13) to write

cost(α)(C, z) cost(α)(C, T) cost(2)(C, T) ((α + 1)α ˆgα) (cost(2)(C, µC))α/2 cost(2)(C, T) |C|α/2 1cost(α)(C, T) .

Seeding for k-means

Using Jensen s inequality, we obtain that |C|α/2 1cost(α)(C, T) (cost(2)(C, T))α/2. Hence we finally get that

cost(α)(C, z) cost(α)(C, T) cost(2)(C, T) ((α + 1)α ˆgα) (cost(2)(C, µC))α/2

(cost(2)(C, T))α/2 1 .

Using our assumption that cost(2)(C, T) (α + 1)2 ( ˆgα)2/αcost(2)(C, µC), we clearly get

cost(α)(C, z) cost(α)(C, T) cost(2)(C, T) ((α + 1)2 ( ˆgα)2/α) cost(2)(C, µC) ,

which finishes the proof.

The last lemma relates the squared cost and the α-powered cost of any cluster. Lemma A.6. For any cluster C, we have

cost(α)(C, µC) ˆgα |C| (σC)α .

Proof. We note that cost(α)(C, µC) 1 |C| P

z C cost(α)(C, z) (using Jensen s inequality, and the convexity of the function y 7 P

z C ||z y||α). Hence, we obtain

cost(α)(C, µC)

z C cost(α)(C, z)

|C|1 α/2 ˆgα (cost(2)(C, µC))α/2 = ˆgα |C| (σC)α ,

where the second inequality uses our definition of ˆgα.

A.3. The increase of potential

In this section, we bound the final potential ϕ(k). First, we analyze the increase of local potential ϕi individually, then we use these results to bound the final expected potential E[ϕ(k)].

A.3.1. THE INCREASE IN A WEIGHT CLASS

Lemma A.7. Let Bt be the event that the t-th center is selected from Ht. Then, for any t > 0, i 0, and any past choice of centers Zt 1, we have that

E ϕi(t) ϕi(t 1) | {Zt 1, Bt} (τi(t) τi(t 1)) (2i)1 2/α

|U (i) t 1| X

C U (i) t 1

(cost(α) t 1(C))2/α .

Proof. Fix some i 0. If τi(t 1) = τi(t), we have by definition wi(t) = wi(t 1) and clearly the potential ϕi cannot increase. Otherwise, we simply note that wi(t) = wi(t 1) + 1, and the result clearly follows. Note that in both cases, we use the fact that (cost(α) t (C))2/α (cost(α) t 1(C))2/α.

Lemma A.8. Let A(i) t be the event that the t-th center is selected from and undiscovered cluster belonging to the weight class i. Then, for any t > 0, i, j 0, any past choice of centers Zt 1, we have that

E ϕj(t) ϕj(t 1) | {Zt 1, A(i) t } 0.

Proof. We first note that since zt / Ht, by definition we have wj(t) = wj(t 1) for all j = i, hence ϕj(t) ϕj(t 1) for all j = i. To bound the change of ϕi we notice that wi(t) = wi(t 1), and that |U (i) t | = |U (i) t 1| 1. Let Cz be the cluster which is selected in U (i) t 1. We claim that

E (cost(α) t 1(Cz))2/α | {Zt 1, A(i) t } 1

|U (i) t 1| X

C U (i) t 1

(cost(α) t 1(C))2/α . (14)

Seeding for k-means

Indeed, note that

E h (cost(α) t 1(Cz))2/α | {Zt 1, A(i) t } i = X

C U (i) t 1

cost(α) t 1(C) P

C U (i) t 1 cost(α) t 1(C) cost(α) t 1(C) 2/α

= |U (i) t 1| P

C U (i) t 1 cost(α) t 1(C) X

C U (i) t 1

cost(α) t 1(C) cost(α) t 1(C) 2/α

|U (i) t 1|

(1) |U (i) t 1| P

C U (i) t 1 cost(α) t 1(C) X

C U (i) t 1

cost(α) t 1(C)

|U (i) t 1| X

C U (i) t 1

(cost(α) t 1(C))2/α

|U (i) t 1|

C U (i) t 1

(cost(α) t 1(C))2/α

|U (i) t 1| .

The inequality (1) can be obtained using Lemma C.2 (Chebyshev s sum inequality) by considering the ordered sequence cost(α) t 1(C)

C U (i) t 1 and (cost(α) t 1(C))2/α

C U (i) t 1 . Thus we have:

E h ϕi(t) | {Zt 1, A(i) t } i

|U (i) t 1| 1 (2i)1 2/α

C U (i) t 1

(cost(α) t (C))2/α E h (cost(α) t 1(Cz))2/α | {Zt 1, A(i) t } i

|U (i) t 1| 1 (2i)1 2/α

|U (i) t 1| 1

|U (i) t 1| X

C U (i) t 1

(cost(α) t 1(C))2/α

The previous two lemmas upper-bound the increase of ϕj conditioned on events Bt or A(i) t . Using this two bounds, the next lemma upper-bounds the expected increase of the potential ϕj, removing the conditioning on the events Bt, A(i) t .

Lemma A.9. For every i 0, α > 2, t > 0, we have

E [ϕi(t) ϕi(t 1) | {τi(t) = τi(t 1) + 1}] h(α) 2i 1 2/α (ki τi(t 1)) 2/α

(22α ˆgα) X

C COPT |C|(σC)α !2/α

where h(α) = (α/2 1)1 2/α

E [ϕi(t) ϕi(t 1) | {τi(t) = τi(t 1)}] 0 .

Proof. If τi(t) = τi(t 1), then either we can apply Lemma A.7 or Lemma A.8, and in both cases the expected potential ϕi can only decrease.

If τi(t) = τi(t 1) + 1 then either zt belongs to U (i) t 1, in which case the potential ϕi can only decrease in expectation (by Lemma A.8 again). Therefore there remains only the case that zt Ht 1 (we denote this event by Bt). In this case, let us denote by Zt 1 the current set of selected centers. Using Lemma A.7 and the previous cases we have that

Seeding for k-means

E [ϕi(t) ϕi(t 1) | {τi(t) = τi(t 1) + 1, Zt 1}]

C U (i) t 1(cost(α) t 1(C))2/α

|U (i) t 1| P[Bt | {τi(t) = τi(t 1) + 1, Zt 1}]

C U (i) t 1(cost(α) t 1(C))2/α

|U (i) t 1| cost(α) t 1(Ht 1)

cost(α) t 1(Ht 1) + cost(α) t 1(U (i) t 1)

C U (i) t 1(cost(α) t 1(C))2/α

|U (i) t 1| cost(α) t 1(Ht 1)

cost(α) t 1(Ht 1) + |U (i) t 1|

C U(i) t 1 (cost(α) t 1(C))2/α

|U (i) t 1|

where the last inequality uses Jensen s inequality. If we consider the last expression as a function of X := P

C U(i) t 1 (cost(α) t 1(C))2/α

|U (i) t 1| (the other quantities being fixed), one can see that this expression attains a maximum value

cost(α) t 1(Ht 1)

|U (i) t 1| (α/2 1)

Plugging in this value, we obtain that

E [ϕi(t) ϕi(t 1) | {τi(t) = τi(t 1) + 1, Zt 1}]

(α/2 1)1 2/α

α/2 (2i)1 2/α (cost(α) t 1(Ht 1))2/α

|U (i) t 1|2/α

(α/2 1)1 2/α

α/2 (2i)1 2/α (ki τi(t 1)) 2/α (cost(α) t 1(Ht))2/α ,

where the second inequality uses the fact that |U (i) t 1| ki τi(t 1). By the law of total expectation, we have that

E [ϕi(t) ϕi(t 1) | {τi(t) = τi(t 1) + 1}] = EZt 1 [E [ϕi(t) ϕi(t 1) | {τi(t) = τi(t 1) + 1, Zt 1}]]

(α/2 1)1 2/α

α/2 (2i)1 2/α (ki τi(t 1)) 2/α EZt 1

cost(α) t 1(Ht 1) 2/α

(α/2 1)1 2/α

α/2 (2i)1 2/α (ki τi(t 1)) 2/α EZt 1 h cost(α) t 1(Ht 1) i 2/α

(α/2 1)1 2/α

α/2 (2i)1 2/α (ki τi(t 1)) 2/α

(22α ˆgα) X

C COPT |C|(σC)α !2/α

where the second inequality uses Jensen s inequality, and the last inequality uses Lemmas A.6, A.4, and Lemma A.3. Indeed, we clearly have the expected α-powered cost of a hit cluster is at most its expected cost after the first time it is hit, which, using Lemmas A.3 and A.4 is at most 22α times cost(α)(C, µC).

A.3.2. THE GLOBAL INCREASE

In this part, we are ready to bound the final expected value of ϕ(k) with the following lemma.

Lemma A.10. For any α > 2, we have that

OPT f(α) ( ˆgα)2/α σmax

2 4/α min{ℓ, log(k)}2/α ,

where f(α) = 16(α/2 1)1 2/α

α/2 1 2 22/α 1

Seeding for k-means

Proof. Using Lemma A.9, we obtain that

E [ϕi(k)] h(α) 2i 1 2/α

(22α ˆgα) X

C COPT |C|(σC)α !2/α

t=0 (ki t) 2/α

h(α) 2i 1 2/α

(22α ˆgα) X

C COPT |C|(σC)α !2/α

0 (ki u) 2/αdu

h(α) 2i 1 2/α

(22α ˆgα) X

C COPT |C|(σC)α !2/α

Therefore, we obtain

E[ϕ(k)] = X

i 0 E[ϕi(k)] h(α) 1 2/α

(22α ˆgα) X

C COPT |C|(σC)α !2/α

i 0 (2iki)1 2/α

= 8h(α)( ˆgα)2/α

C COPT |C|(σC)α !2/α

i 0 (2iki)1 2/α .

Comparing this with optimum clustering, we get using basic algebraic manipulations

OPT = 8h(α)( ˆgα)2/α

1 2/α (P C |C|(σC)α)2/α P

C |C|(σC)2 X

i 0 (2iki)1 2/α

= 8h(α)( ˆgα)2/α

C |C| σC σmax

α σα max 2/α

P C |C| σC σmin

i 0 (2iki)1 2/α

8h(α)( ˆgα)2/α

C |C| σC σmax

2 2/α σ2 max

C |C| σC σmin

i 0 (2iki)1 2/α

8h(α)( ˆgα)2/α

C |C| (σC)2 2/α σ2 4/α max P

C |C| σC σmin

i 0 (2iki)1 2/α

8h(α)( ˆgα)2/α

P C |C| σC σmin

2 2/α (σmin)4/α σ2 4/α max

C |C| σC σmin

i 0 (2iki)1 2/α

8h(α)( ˆgα)2/α

P i 0(2iki)1 2/α

C |C|)1 2/α

16h(α)( ˆgα)2/α

i 0(2iki)1 2/α P i 0 2iki 1 2/α ,

where the fourth inequality is obtained using

Seeding for k-means

To conclude the proof, we need to upper bound the ratio P i 0(2iki)1 2/α P

i 0 2iki 1 2/α

for any integer sequence (ki)i 0 satisfying the constraint P

i 0 ki = k. Using Jensen s inequality (note that x 7 x1 2/α is concave), we immediately obtain P

i 0(2iki)1 2/α P

i 0 2iki 1 2/α ℓ2/α . (15)

For the second upperbound, note that to have equality in Jensen s inequality, we must have that 2iki = 2jkj for all i, j 0 such that ki = 0, kj = 0. Intuitively, this can only happen when ℓ= O(log k). Formally, let us denote by L the maximum i 0 such that k L = 0. We then build a decreasing sequence of indices as follows. We define i1 to be equal to L. Then, assuming we defined the indices i1, i2, . . . ix, we define ix+1 to be the highest index 0 i < ix such that kix+1 2kix. We stop until it is not possible to find an index ix+1 fitting those conditions anymore. We denote by I the set of indices that were selected, and J := N \ I its complement.

We notice that, by construction, |I| log k since P

i 0 ki = k. Then clearly we have that

i 0(2iki)1 2/α P i 0 2iki 1 2/α P

i I(2iki)1 2/α P

i I 2iki 1 2/α +

i J (2iki)1 2/α P

i I 2iki 1 2/α . (16)

The first term on the right-hand side can be upperbounded by |I|2/α (log k)2/α, using Jensen s inequality again. As for the second term, let us denote by i1, i2, . . . , i|I| the set of indices selected to be in I. Then, we write

i J (2iki)1 2/α =

j (ix+1,ix) (2jkj)1 2/α

j (ix+1,ix) (2j2kix)1 2/α

j (ix+1,ix) (2ix2j ix2kix)1 2/α

i I (2iki)1 2/α

i I(2iki)1 2/α

Therefore the second right-hand side term in Equation (16) can be upper bounded by

which concludes the proof.

A.4. Wrapping it up

Using Lemma A.10 in combination with Lemma A.2, we obtain that the final expected cost of undiscovered clusters is at most

h(α) ( ˆgα)2/α σmax

2 4/α min{ℓ, log(k)}2/α OPT ,

Seeding for k-means

where h(α) := 16(α/2 1)1 2/α

α/2 1 2 22/α 1

1 22/α 1 . Regarding the cost of hit clusters, we use Lemma A.5 and Lemma A.4 to see that the expected cost of hit clusters is at most

(4e + (α + 1)2 ( ˆgα)2/α) OPT .

Therefore, if we define f(α) := (4e + (α + 1)2) 16(α/2 1)1 2/α

α/2 1 2 22/α 1

1 22/α 1 = O α2

(α/2 1)2/α (1 22/α 1) , we obtain precisely the result of Theorem 2.2.

B. Additional Experiments

In this section, we provide additional experiments to further validate our claims. The code can be found in (Dα Seeding). Particularly, we run the Dα-seeding on the following instances:

1. 1) D1: A mixture of 4 Gaussians with the centers/means of the Gaussians placed on the corners of a square with edge length of 2 = 100 (default). All the covariances are identity matrices (in d = 2).

2. 2) D2: A mixture of 4 Gaussians with the centers/means of the Gaussians placed on the corners of a square with edge length of 2 = 100 (default). All the covariances are identity matrices (in d = 2), except one that has a covariance σ2I, with σ2 = 400.

3. 3) D3: A mixture of 8 Gaussians with the centers/means of the Gaussians placed on the corners of a cube with edge length of 2 = 100 (default). All the covariances are identity matrices (in d = 3).

4. 4) D4: A mixture of 8 Gaussians with the centers/means of the Gaussians placed on the corners of a cube with edge length of 2 = 100 (default). All the covariances are identity matrices (in d = 3), except one that has a covariance σ2I, with σ2 = 800.

5. 5) D5: A mixture of 4 bivariate student-t distributions (see Figure 5), with different degrees of freedom, ν = {1.6, 2, 5, 10}, where the location parameter is centered on the corners of a square with edge length of 2 = 100 (default).

We generate these instance with n = 1000 samples from the appropriate mixture distribution. Then we consider α = {2, 6, 10, . . . , 38}, i.e, starting from 2 and increments of 4 for the Gaussian instances, i.e, D1 through D4.

In all experiments, we show the average performance over N = 5000 trial runs for each value of α. In some cases we also run Lloyd s algorithm until convergence after the seeding.

Figures 4 and 3, confirm our main claims that for Gaussians of the same/similar covariances, it is always best to choose α that is much larger than 2. This can be seen in both the D1 and D3. Although, the costs seem to remain a constant for α 6. However, the story is different when there is one Gaussian with a sufficiently different covariance, then one can see from the figures of D2 and D4 that there is a trade-off. Moreover, we can see that even if we run Lloyd s algorithm until convergence the general pattern does not change much. A small exception here is Figure 3, on the instance D2, we observe that our theoretical approximation of the seeding step with different variances must point to good values of α being smaller in order to minimize the effect of ( σmax

σmin )2 4/α. However, the seeds planted by larger values of α are able hit more of the optimum clusters, only that the chosen point could be sub-optimal and hence only a few steps of Lloyds is able to fix it. This might explain why Lloyds might be able to fix the issue in some cases where α is big. In contrast, when α = 2 the seeding likely misses some optimum clusters and additional Lloyds steps do not offer much help in that case because it gets stuck in a local optimum.

Finally, with the experiments on student-t distribution, we intend to show the effect of gα on the performance of Dα seeding. Here when the degrees of freedom (df) are 1 < df 2, the variance is (see (Wikipedia T)). As seen in Figure 6, when df = 1.6, large values of α show a significant degradation in their performance. But when df = 2, α 4 performs better than α = 2. In general, one might argue that if such a pattern is observed where α = 2 performs better than α > 2, it may be that the distributions could have many outliers. In this case, it might be that the k-means objective is probably not appropriate and one should consider more robust objectives such as k-medians. Now with df = 5, 10, where the distributions become more Gaussian like, we see a similar pattern as the one observed in Gaussian mixtures, where a large α is preferred. Crucially, we also notice that in this case, the number of Lloyd s steps required for convergence it much lower for α > 2 in comparison to the number needed when α = 2.

Seeding for k-means

2 6 10 14 18 22 26 30 34 38

k-means cost

Without Lloyds

With Lloyds

2 6 10 14 18 22 26 30 34 38

k-means cost

Without Lloyds

With Lloyds

Figure 3. Performance of Dα seeding for two instances D1 (on the left) and D2 (on the right).

2 6 10 14 18 22 26 30 34 38

k-means cost

Without Lloyds

With Lloyds

2 6 10 14 18 22 26 30 34 38

k-means cost

Without Lloyds

With Lloyds

Figure 4. Performance of Dα seeding for two instances D3 (on the left) and D4 (on the right). The green curve indicates the cost right after the seeding step and the blue curve indicates the cost if we additionally run Lloyd s steps until convergence.

Seeding for k-means

4 3 2 1 0 1 2 3 4

standard normal

Figure 5. The pdf of univariate student-t distributions with different degrees of freedom.

2 6 10 14 18 22 26 30 34 38

k-means cost

2 6 10 14 18 22 26 30 34 38

Lloyds Steps

Figure 6. Performance of Dα seeding for the instance D5. The figure on the left shows the performance, and on the right is the number of Lloyd s steps needed until convergence.

C. Additional Discussion and Useful Inequalities

C.1. On potential functions of previous works

Here, we discuss more in details why a new potential function is needed when dealing with Dα seeding. Many of the previous works on k-means++ rely on a more natural potential function. For instance in (Dasgupta, 2013), the potential function is defined as

|Ut| cost(2)(Ut, Zt) ,

where Ut is the number of undiscovered clusters, and Wt is the number of iterations where a center was selected in an already discovered cluster. We notice two main differences with our potential function. The first one is that we partitioned the clusters by weight classes, and the second is that replaced the expression

cost(2)(U (i) t , Zt)

Seeding for k-means

by a more involved (2i)1 2/α X

(cost(α)(C, Zt))2/α .

Let us discuss the second difference first, and for the purpose of simplicity let us assume there is a unique weight class. Then it is very tempting to use the same potential function as (Dasgupta, 2013). Unfortunately, we run into issues when the next center is selected in an undiscovered cluster.

Indeed, in that case it is quite crucial in the proof that the potential should decrease in expectation, and an increase in potential here seems like a quite fatal flaw. To show that the potential decreases in the case of D2 seeding, one simply notices that if we had picked one of the undiscovered clusters uniformly at random and removed it from cost(2)(Ut, Zt), then the potential would already decrease. Now, D2 seeding can only do better than that, since heavy clusters have even more chance of being sampled. But what about Dα seeding? Unfortunately, this is not true anymore, as it can be that there are two clusters C1, C2 such that cost(2)(C1, Zt) > cost(2)(C2, Zt), but cost(α)(C1, Zt) < cost(α)(C2, Zt), i.e. the orders are reversed. And this can happen even if gα is fairly small.

Therefore, one needs a potential where the undiscovered clusters are ranked in the same order of importance as Dα seeding weights them. Of course, this potential should also scale as a square of the euclidean distance. This is where our potential comes in.

Now about the first difference, note that because of scaling issues we had to add an extra n1 2/α if all the clusters are of weight n. If the clusters have different weight, the idea of partitioning them using geometric grouping comes naturally.

C.2. Guarantees for finite sample mixture of Gaussians

By assumption we have X Pk i=1 wi N(µi, Σi), with maxi,j [k] tr(Σi)/tr(Σj) = O(1) and wi > 0 for all i [k] (w.l.o.g.). In the infinite sample case (where the sums are replaced by expectations) we have that Ci = {x X : x N(µi, Σi)} is the optimal clustering (with infinite samples), i.e, the true centers of the Gaussians are the parameters that maximize the (population) likelihood. Below we show for a mixture of gaussian distributions that gα = O(αα/2) which implies that in the case of arbitrary weights the worst-case approximation is O(α2(gα)2/α log2/α k), thus, when we set α = Θ(log log(k)), we have an O(log log k)3-approximation. When ℓ= O(1), we get a O(α3) approximation which is a constant for α being a constant. Hence in the infinite sample limit the result holds. Now we use the quantities that are computed above (when considering infinite samples) and we show that these quantities do not change much with high probability when sufficiently many samples are drawn from the Gaussian mixture.

Consider the reference clustering Ci = {x X : x N(µi, Σi)}, which is optimal in the case of infinite samples. We let |X| = N and we consider first the case when all the covariances Σi are full rank. Now we have E [|Ci|] = Nwi and we let wmin := mini [k] wi.

We can define the sampled" version of centroids and the standard deviation of each cluster. That is, ˆµi := 1 |Ci| P

x Ci x, and

x Ci x ˆµi 2

To ensure that the above definition makes sense, we will require sufficiently many samples w.r.t wmin. Also, we will effectively replace ˆµi by µi in the above equation. Thus, we will condition on the following events being true.

1. E1 = {Each reference cluster i has Θ(Nwi) points}.

2. E2 = { Σ 1/2 i (ˆµi µi) < ε, for some arbitrary ε > 0, for each i [k]}.

We remark that if wi = Θ(1/k) for all i, then we have as result of E1, that ℓ= O(1) w.h.p, in which case we get O(α3) approximation.

Continuing our proof, since we have a multivariate Gaussian distribution, it holds that, E Xi µi 2 = tr(Σi), where tr(.) represents the trace of a matrix.

Seeding for k-means

such that for each i [k], we have, λi1 λi2 . . . λid 0. Without loss of generality let λi1 be the largest eigenvalue for Σi. Note that the covariance matrix is always positive semi-definite. Furthermore, when the covariance matrix is full-rank, the eigenvalues are positive. Now, we have to characterize gα. Using the definition in (8), in the expected sense, one has to compute:

g(i)(α) := EXi,Xj [ Xi Xj α]

(E [ Xi µi 2])α/2 .

We also note that, 2E [ Xi µi α] EXi,Xj [ Xi Xj α] 2α+1E [ Xi µi α] (17)

Thus instead we can re-define,

g(i)(α) := E [ Xi µi α]

(E [ Xi µi 2])α/2 .

Here, the denominator is (E Xi µi 2 )α/2 = (tr(Σi))α/2. Now consider the transformation, Yi := Σ 1/2 i (Xi µi). Now Yi is a standard multivariate Gaussian, i.e, Yi N(0, Id), where is the Id is the identity matrix is d dimensions. Then,

EN(µi,Σi) [ Xi µi α] = EN(0,Id) h Σ1/2 i Yi αi Σ1/2 i αEN(0,Id) [ Yi α]. Here, we have:

EN(0,Id) [ Yi α] = 1 (2π)d/2

Rd y α exp y 2

(α/2)α/22α/2 (d/2)α/2

From this we can compute,

g(i)(α) Σ1/2 i ααα/2

4(tr(Σi))α/2

(tr(Σi))α/2 αα/2

= O((α/2)α/2) .

Thus, the quantity gα as defined in (8) in its expected form is at most O((2α)α/2), by using the relation in (17). Now, we move on to the concentration bounds for the following value:

x Ci x µi 2

First, we have i.i.d samples from the mixture distribution. Consider the random variable x µi 2, with x Ci, the expectation of this random variable is E Xi µi 2 . Then using Chebyshev s inequality, we have

Pr |ˆri 2 E Xi µi 2 | ε Var Xi µi 2

|Ci|ε2 . (18)

Now, we can set ε = E[ Xi µi 2]

2 , then the concentration bound becomes,

ˆri 2 E Xi µi 2 E Xi µi 2

4 gi(4)(E Xi µi 2 )2 4E Xi µi 2 )2

|Ci|(E [ Xi µi 2])2

Seeding for k-means

Thus, we have w.h.p , ˆr2 i = Θ(r2 i ) = Θ(tr(Σi)), for all i [k]. This holds when N = Ω(k2/w2 min), then the success probability is 1 O(1/

N). Now, we can condition on this event, to get a bound on gα. Particularly, we must look at the α moment, i.e, we must use the value of g(i)(α). Now the plan is to show that gα as defined is a constant w.h.p. For this, it suffices to show that the following quantity

x Ci x µi α

is Θ(E [ Xi µi α]) w.h.p. Assuming this is true, then we have gα 2α ˆri (α) Θ(E [ Xi µi α]) = O((2α)α/2). This is also due to the fact that ˆr2 i = Θ(r2 i ). Now, to see how the concentration holds, we again use Chebyshev s inequality, when the random variable is x µi α for x Ci, again with the expectation being E [ x µi α]. Repeating, the arguments, from the previous case, i.e, using Equation (18) for the aforementioned random variable,

Pr ˆri (α) E [ Xi µi α] E [ Xi µi α]

4 g(i)(2α)(E [ Xi µi α])2

|Ci|(E [ Xi µi α])2

4E [ Xi µi α])2

|Ci|(E [ Xi µi α])2

Thus, we have that gα = Oα(1), w.h.p, when N = Θα(k2/w2 min).

Events E1 and E2 occur with high probability: To end this proof, we show that events E1 and E2 occur with high probability as well. When we have a mixture of distributions, we have the number of points sampled from each distribution (|C1|, |C2|, . . . , |Ck|) is distributed as multinomial distribution with parameters (N, {wi}). When we have k balanced distributions, pi = 1/k. Now, we can use the following concentration bound called the Bretagnolle-Huber-Carol inequality (van der Vaart & Wellner, 1996) (Proposition A.6.6), which states that:

i=1 ||Ci| Nwi|) 2

Nδ) 2k exp( 2δ2), δ > 0 (19)

By setting δ = q

2N Nk+1, we have that for N = Ω(k), event E1 holds w.h.p.

Now, coming to E2, let us focus on each Gaussian separately and let us condition on event E1. For Gaussian i, we have that with probability 1 δ, Σ 1/2 i ( ˆµi µi) < ε/k with N = Ω( k3d

ε2 log2(1/δ)) samples. Hence, by choosing a small enough (δ, ε), we can guarantee that with sufficiently many samples N = Ω(poly(k, d, 1 mini [k] λid )) 4, E1, holds with high probability. Now E2, also implies that the reference" centers/clusters that we picked in the beginning are in fact close to the optimal centers with high probability.

Degenerate Covariance: If the covariance matrix is not full rank, we can still define the Gaussian distribution in the appropriate affine subspace Rd , where d d, and rank(Σi) = d . We can repeat the previous analysis, as our analysis is done for each cluster separately and obtain the same conclusion.

When we have N = Ω(poly(k, d, 1 mini [k] λid , 1 wmin )), we get the desired finite sample guarantee.

C.3. Useful inequalities

In our proof, we use several times the following inequalities, which we give here for completeness. Lemma C.1 (Jensen s inequality (Hardy et al., 1952)). For any set of non-negative real numbers {yi}n i=1 it holds that:

i=1 yγ i n1 γ n X

if γ 1. If 0 γ < 1, the reversed inequality holds.

4Here, Ωhides log factors

Seeding for k-means

Lemma C.2 (Chebyshev s sum inequality (Hardy et al., 1952)). For any sequence of real numbers {yi}n i=1 and {wi}n i=1, such that y1 y2 . . . yn and w1 w2 . . . wn, it holds that:

D. Lower bounds

D.1. The dependence on σmax/σmin is tight

We consider an instance with one cluster of n points arranged on a regular simplex of side-length R, while the other k 1 clusters contain n points arranged on a regular simplex of side-length 1. Note that for a cluster made of a regular simplex, we have that gα 2α, hence (gα)2/α = O(1), and the concentration of clusters will not play any role in this construction.

The clusters of radius 1 are pairwise separated by a distance δ, while the cluster of radius R is at infinite distance from all the other clusters. We choose R and δ so that Rα = 10 kδα. Then, at each iteration t 3, the probability of sampling from the cluster of radius R is at least n Rα

nkδα + n Rα 1/2.

Hence, the expected number of missed clusters is at least k/3, which implies that the expected cost of the returned solution is at least (nk/3) δ2 while OPT costs at most n R2 + nk. We select δ so that R2 = k which is satisfied for δ = k1/2 1/α. In this case, the expected competitive ratio is at least Ω(k1 2/α), and one can check that σmax = R = k1/2, and σmin = 1.

D.2. The dependence on gα is tight

Consider the following instance. There are two clusters of n points each. The first clusters C1 consists of n points on a regular simplex. All points in C1 are at distance 1 from each other, and at distance 1 from the centroid of C1. The second cluster has 2 groups of points. There are n 1 points at the origin, and 1 point at coordinates ( , 0, . . . , 0). We place the cluster C1 so that its centroid lies at coordinates ( δ, 0, . . . , 0), with δ = /n1/α.

Second, we select so that σmax/σmin = 1. For this, we would simply need to set so that

( /n)2 (n 1) + 2(1 1/n)2 = n

which gives = Θ( n). The cost of the given clustering is equal to 2n.

However, we can see that gα for the second cluster is not a constant. Indeed we can compute that

gα = (1/n) P

z C2 cost(α)(C1, z)

n1 α/2 (cost(2)(C2, µC2))α/2 (22)

= (1/n) (2(n 1) α)

= Θ nα/2 1 . (24)

To analyze the Dα seeding procedure on this instance, let us denote by B1 the event that the first sampled center belongs to C1, and by B2 the event that the second sample belongs to C2 and lies at coordinate ( , 0, . . . , 0).

Clearly we have that P[B1] = 1/2 ,

and P[B2 | B1] α

n + (n 1) δα + α 1/3 ,

by our choice of δ. This implies that P[B1 B2] 1/6 and the expected cost of the output clustering is at least

1 6 (n 1) δ2 = Θ(n1+1 2/α) = Ω(n1 2/α OPT) = Ω((gα)2/α) OPT .

Seeding for k-means

a = log f(k)

Figure 7. A schematic view of a group of 4 clusters. The densest parts are at the vertices A, B, C, D. All the points in a cluster lie on the segment from a vertex to the center of mass of the square.

D.3. A lower bound for greedy k-means++

In this section, we prove Theorem 2.4. In particular, it implies a super constant lower bound for the greedy variant which uses f(k) samples, as long as f(k) is super constant. We emphasize that here we did not try to optimize the lower bound. The main purpose of this section is to show that the greedy algorithm with a superconstant number of samples cannot guarantee a constant factor in expectation. Consider the following construction against greedy k-means++ with f(k) samples. We will place our points in the euclidean space. For ease of construction, we will take the distributional point of view in this instance, i.e. we assume that each cluster Ci contains infinitely many points placed in Rd according to some distribution fi. Moreover, we will assume that these clusters are balanced (i.e. they have the same weight) which means here that the global distribution of all points in all the clusters has probability density f(x) = 1

i [k] fi(x) for all x R2.

In our instance, we will create k/4 groups of 4 clusters as follows. First, we describe how to construct one group. We create an square ABCD of side length a = log(f(k)) and we denote by M the centroid of this square, and by b = a/

2 the length of half a diagonal in the square. In the cluster C1, the probability density will be a one-dimensional density on the segment [AM] defined as follows

( e ||x A||2

1 e b if x [AM] and 0 otherwise.

We recall here that AM = b so the above function is indeed a probability density. The centroid of the first cluster will be equal to the point µ1 [AM] such that

R b 0 xe xdx

1 e b = 1 b eb 1 .

We repeat the construction similarly for the clusters C2, C3, and C4, replacing A by B, C, and D respectively. Note that we needed only 2 dimensions to construct our first group. We refer the reader to Figure 7 for an illustation of the construction.

Then, we add k/4 1 extra dimensions and we fix the point M to be the origin of Rk/4+1. Finally, we make k/4 1 additional copies T2, T3, . . . , Tk/4 of our first square and we arrange their centroids M1 = (0)k/4+1, M2, . . . , Mk/4 in a regular simplex of side-length = 100 (f(k))3 k. One should think of as a much bigger distance than a (essentially, we have k/4 squares that are infinitely far from each other). Each group of 4 cluster will be denoted Gj for j [k/4].

We prove the following simple statements. Claim D.1. For any cluster Ci, we have that

lim a cost(2)(Ci) = 1.

Seeding for k-means

Proof. We can write

cost(2)(Ci) =

R b 0 (x Aµi)2e xdx

= eb + e b b2 2

(eb 1) (1 e b) ,

which clearly tends to 1 as b goes to infinity.

Claim D.2. For any cluster Ci and any fixed α 2 and b > 10, we have that

R b 0 R b 0 |x y|αfi(x)fi(y)dxdy R b 0 (x Aµi)2fi(x)dx α/2 4 Γ(α + 1) , (25)

where Γ is the gamma function (Artin, 2015).

Proof. From Claim D.1, we already have that the denominator tends to 1 as a goes to infinity. Hence, we only need to upper bound the numerator. We proceed with the simple change of variable u = (x + y) and v = (x y). Then we obtain

0 |x y|αfi(x)fi(y)dxdy = 1 2(1 e b)2 Z b

|v| |v|αe ududv

= 1 (1 e b)2 Z b

v vαe ududv

= 1 (1 e b)2 Z b

0 vα(e v ev 2b)dudv

= 2Γ(α + 1) .

Using the fact that we consider b 10, we obtain the desired upperbound.

In light of Claim D.2, it is is clear that for any fixed α > 2, the Dα seeding algorithm will guarantee a constant factor approximation in expectation.

However, we use the instances we build to prove Theorem 2.4. To this end, we will need to look at the cost of a group of 4 clusters after the greedy k-means++ algorithm selected exactly one center inside a group j. We distinguish two key cases: (1) If the first center x in the cluster Ci Gj such that the distance between x and the centroid µi of the cluster is less than log(b), and (2) if the first center is at distance at most b/100 from the centroid Mj of the corresponding square. Let us denote by K1 and K2 the total cost of the 4 clusters in Gj after case (1) or (2) happened respectively. Claim D.3. For k a big enough constant, we have that

K1 11b2/2 ,

and K2 5b2 .

Proof. Assume w.l.o.g. that the first center belongs to center C1. In the first case then the total cost is lower bounded by the cost of the other three clusters C2, C3, C4. Assume C3 is the center whose centroid µ3 is the furthest away from µ1. Then the cost of this center is lower bounded by

0 (2b log(b) x)2e xdx 4b2 + O(b log b) .

Seeding for k-means

For the other two clusters C2 and C4, we use Pythagore theorem. Since they lie on a perpendicular line to C1 and we are in case (1), the distance of any point in C2 or C4 is at distance at least b log b from the sampled center. Hence we can lower bound the cost of each of the other two clusters with the following quantity

0 (b log(b))2f(x)dx b2 + O(b log b) .

Hence the total cost is at least 6b2 + O(b log b)

which is more than 11b2/2 for b big enough.

In the second case, the cost of each cluster is upperbounded by

0 (1.01b x)2f(x)dx 1.03b2 + O(b) .

Hence the total cost is upperbounded by 4.12b2 + O(b) ,

which is less than 5b2 for b big enough.

Next, we define as L the event that the greedy k-means++ algorithm samples one of its f(k) samples at distance at most b/100 from the centroid Mj of some still undiscovered group Gj. Claim D.4. Assume we are at iteration t < k/4. Then

Proof. At iteration t < k/4, it must be that one group Gj is still undiscovered. Then we have that (by triangle inequality)

cost(2)(Gj) 4( 2a)2 > 2 ,

for k a big enough constant. The total cost of the discovered clusters is at most

k (2a)2 k(2 log f(k))2 cost(2)(Gj)/(10f(k)) .

By union-bound we obtain that with probability at least 9/10, none of the sampled candidate centers belong to an already discovered group. Conditionned on that event, we can simply compute by triangle inequality that the probability that a specific sample is at distance at most b/100 from the centroid is at least (assuming b and f(k) are some big enough constant)

10 (e 99b/100 e b) 9

10e 99b/100 9

Therefore, with probability at most (1 1/f(k))f(k) 1/e, we have that none of the sampled candidate centers are close the centroid Mj of a square. Therefore with probability 1/2 at least one of them is close to a centroid Mj of an undiscovered group.

When the event L happens, it must be (by Claim D.3) that the selected center is at distance at least log b from the center it belongs to. By Claim D.4, this happens in expectation at least k/8 times. This means that the expected cost of the output clustering is at least

(k/8) Z log b

0 (x log b)2e xdx (k/8) ((log b)2 + O(log b)) = Ω((k/8) (log log(f(k)))2 .

Since by Claim D.1 we have that the optimum cost is equivalent to k (for k growing to infinity), this concludes the proof of Theorem 2.4.