# approximate_kmeans_in_sublinear_time__13c54674.pdf

Approximate K-Means++ in Sublinear Time

Olivier Bachem ETH Zurich olivier.bachem@inf.ethz.ch

Mario Lucic ETH Zurich lucic@inf.ethz.ch

S. Hamed Hassani ETH Zurich hamed@inf.ethz.ch

Andreas Krause ETH Zurich krausea@ethz.ch

The quality of K-Means clustering is extremely sensitive to proper initialization. The classic remedy is to apply k-means++ to obtain an initial set of centers that is provably competitive with the optimal solution. Unfortunately, k-means++ requires k full passes over the data which limits its applicability to massive datasets. We address this problem by proposing a simple and efﬁcient seeding algorithm for K-Means clustering. The main idea is to replace the exact D2-sampling step in k-means++ with a substantially faster approximation based on Markov Chain Monte Carlo sampling. We prove that, under natural assumptions on the data, the proposed algorithm retains the full theoretical guarantees of k-means++ while its computational complexity is only sublinear in the number of data points. For such datasets, one can thus obtain a provably good clustering in sublinear time. Extensive experiments conﬁrm that the proposed method is competitive with k-means++ on a variety of real-world, largescale datasets while offering a reduction in runtime of several orders of magnitude.

1 Introduction The goal of K-Means clustering is to ﬁnd a set of k cluster centers for a dataset such that the sum of squared distances of each point to its closest cluster center is minimized. It is one of the classic clustering problems and has been studied for several decades. Yet even today, it remains a relevant problem: Lloyd s algorithm (Lloyd, 1982), a local search algorithm for K-Means, is still one of the ten most popular algorithms for data mining according to Wu et al. (2008) and is implemented as a standard clustering method in most machine learning libraries. In the last few years, K-Means clustering has further been studied in various ﬁelds of machine learning such as representation learning (Coates, Lee, and Ng, 2011; Coates and Ng, 2012) and Bayesian nonparametrics (Kulis and Jordan, 2012). It is well-known that K-Means clustering is highly sensitive to proper initialization. The classical remedy is to use a seeding procedure proposed by Arthur and Vassilvitskii (2007) that together with Lloyd s algorithm is known as

Copyright c 2016, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

k-means++. In the seeding step of k-means++, the cluster centers are sampled iteratively using D2-sampling: First, a cluster center is chosen uniformly at random from the data points. Then, in each of k iterations, a data point is selected as a new cluster center with probability proportional to its distance to the already sampled cluster centers. Even without assumptions on the data, the resulting solution is in expectation O(log k)-competitive with regards to the optimal solution (Arthur and Vassilvitskii, 2007). While k-means++ is easy to implement, it is non-trivial to apply it to large problems. k-means++ has to make a full pass through the data for every cluster center sampled. This leads to a complexity of Θ(nkd) where n is the number of data points, k the number of cluster centers and d the dimensionality of the data. Even if k is moderate, this can be computationally infeasible for massive datasets. This motivates our search for a seeding method with a lower, potentially even sublinear, complexity in the number of data points that retains both the empirical and theoretical beneﬁts of k-means++.

But is it even worth pursuing a fast seeding algorithm? After all, both evaluating the quality of such a seeding and running one iteration of Lloyd s algorithm exhibit the same Θ(nkd) complexity as the seeding step of k-means++. Hence, one might argue that there is no beneﬁt in reducing the complexity of the k-means++ seeding step as it is dominated by these two other operations. There are two shortcomings to this argument: Firstly, k-means++ is an inherently sequential algorithm of k dependent iterations and, as such, difﬁcult to parallelize in a distributed setting. Evaluating the quality of a K-Means solution, however, can be done in parallel using a single Map Reduce step. Similarly, Lloyd s algorithm can also be implemented in Map Reduce (Zhao, Ma, and He, 2009). Secondly, there are many use cases where one requires fast seeding without evaluating the quality of the seeding or running Lloyd s algorithm subsequently. For example, the quality of such a solution can be improved using fast algorithms such as online (Bottou and Bengio, 1994) or mini-batch K-Means (Sculley, 2010) which may be run for less than O(n) iterations in practice. Furthermore, various theoretical results such as coreset constructions (Bachem, Lucic, and Krause, 2015) rely on the theoretical guarantee of k-means++. Hence, a fast seeding algorithm with a strong theoretical guarantee would have an impact on all these applications.

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16)

Our Contributions. In this paper, we propose a simple, but novel algorithm based on Markov Chain Monte Carlo (MCMC) sampling to quickly obtain a seeding for the K-Means clustering problem. The algorithm can be run with varying computational complexity and approximates the seeding step of k-means++ with arbitrary precision as its complexity is increased. Furthermore, we show that for a wide class of non-pathological datasets convergence is fast. Under these mild and natural assumptions, it is sufﬁcient to run our algorithm with complexity sublinear in the number of data points to retain the same O(log k) guarantee as k-means++. This implies that for such datasets, a provably good K-Means clustering can be obtained in sublinear time. We extensively evaluate the proposed algorithm empirically and compare it to k-means++ as well as two other approaches on a variety of datasets.

2 Background & Related Work

K-Means clustering. Let X denote a set of n points in Rd. The K-Means clustering problem is to ﬁnd a set C of k cluster centers in Rd such that the quantization error φC(X) is minimized, where

x X d(x, C)2 =

x X min c C x c 2 2.

In this paper, we implicitly use the Euclidean distance function; however, any distance function d(x, x ) may be used. The optimal quantization error is denoted by φk OP T (X). k-means++ seeding. The seeding step of k-means++ (Arthur and Vassilvitskii 2007) works by sampling an initial cluster center uniformly at random and then adaptively sampling (k 1) additional cluster centers using D2-sampling. More speciﬁcally, in each iteration i = 2, . . . , k, the data point x X is added to the set of already sampled cluster centers Ci 1 with probability

p(x) = d(x, Ci 1)2 x X d(x , Ci 1)2 . (1)

The algorithm s time complexity is Θ(nkd) and the resulting seeding Ck is in expectation O(log k) competitive with respect to the optimal quantization error φk OP T (X) (Arthur and Vassilvitskii, 2007), i.e.

E [φCk(X)] 8(log2 k + 2)φk OP T (X).

Related work. Previously, the same idea as in k-means++ was explored in Ostrovsky et al. (2006) where it was shown that, under some data separability assumptions, the algorithm provides a constant factor approximation. Similar assumptions were analyzed in Balcan, Blum, and Gupta (2009), Braverman et al. (2011), Shindler, Wong, and Meyerson (2011), Jaiswal and Garg (2012) and Agarwal, Jaiswal, and Pal (2013). Without any assumption on the data, it was shown that D2-sampling leads to a constant factor approximation if Ω(k log k) (Ailon, Jaiswal, and Monteleoni, 2009) or Ω(k) (Aggarwal, Deshpande, and Kannan, 2009) centers are sampled. Bad instances for k-means++

were considered in the original paper (Arthur and Vassilvitskii, 2007) as well as in Brunsch and R oglin (2011). A polynomial time approximation scheme for K-Means using D2sampling was proposed in Jaiswal, Kumar, and Sen (2014) and Jaiswal, Kumar, and Yadav (2015). Several ideas extending k-means++ to the streaming setting were explored: A single-pass streaming algorithm based on coresets and k-means++ was proposed in Ackermann et al. (2012). The main drawback of this approach is that the size of the coreset is exponential in the dimensionality of the data. Ailon, Jaiswal, and Monteleoni (2009) suggest a streaming algorithm based on Guha et al. (2003) that provides the same O(log k) guarantee as k-means++ with a complexity of O(ndk log n log k). Bahmani et al. (2012) propose a parallel version of k-means++ called k-means that obtains the same O(log k) guarantee with a complexity of Θ(ndk log n). The main idea is to replace the k sequential sampling rounds of k-means++ by r = Θ(log n) rounds in each of which l = Θ(k) points are sampled in parallel. In a ﬁnal step, the Θ(k log n) sampled points are clustered again using k-means++ to produce a ﬁnal seeding of k points. As a result, the computational complexity of k-means is higher than k-means++ but can be efﬁciently distributed across different machines. In Section 6, we will compare k-means with our proposed method on various datasets.

3 Approximate D2-sampling In each iteration of D2-sampling, the k-means++ algorithm has a computational complexity of Θ(nd) as it needs to calculate the sampling probabilities p(x) in (1) for every data point. We aim to reduce the complexity by approximating the D2-sampling step: we strive for a fast sampling scheme whose implied sampling probabilities p(x) are close to p(x). To formalize this notion of closeness, we use the total variation distance which measures the maximum difference in probabilities that two distributions assign to an event. More formally, let Ω be a ﬁnite sample space on which two probability distributions p and q are deﬁned. The total variation distance between p and q is given by

x Ω |p(x) q(x)|. (2)

In Section 5 we will show that using total variation distance we can bound the solution quality obtained by our algorithm. Informally, if the total variation distance is less than ϵ, we are able to to retain the same theoretical guarantees as k-means++ with probability at least (1 ϵ). MCMC approximation. The Metropolis-Hastings algorithm (Hastings 1970) (with an independent, uniform proposal distribution) applied to a single step of D2-sampling works as follows: We uniformly sample an initial state x0 from the point set X and then iteratively build a Markov chain. In each iteration j, we uniformly sample a candidate point yj and calculate the acceptance probability

π = min 1, p(yj) p(xj 1)

= min 1, d(yj, C)2

d(xj 1, C)2

With probability π we then set the state xj to yj and with probability 1 π to xj 1. For a Markov chain of total length m, we only need to calculate the distance between m data points and the cluster centers since the normalization constants of p(yj) and p(xj 1) in (3) cancel. By design, the stationary distribution of this Markov chain is the target distribution p(x). This implies that the distribution pm(x) of the m-th state xm converges to p(x) as m . Furthermore, the total variation distance decreases at a geometric rate with respect to the chain length m (Cai, 2000) as

pm p TV = O 1 1

where γ = n max x X p(x). (4)

This implies that there is a chain length m = O γ log 1

that achieves a total variation distance of at most ϵ. Intuitively, γ measures the difﬁculty of approximately sampling from p(x) and depends on the current set of centers C and the dataset X. We remark that the total variation distance increases with γ. For now, we assume γ to be given and defer the detailed analysis to Section 5.

4 Approximate K-Means++ using K-MC2

It is straightforward to extend this MCMC-based sampler to approximate the full seeding step of k-means++: We ﬁrst sample an initial cluster center uniformly at random. Then, for each of the remaining k 1 iterations, we build an independent Markov chain of length m and use the last element as the new cluster center. We call this algorithm K-MC2 and provide pseudo-code in Algorithm 1. The complexity of the proposed algorithm is Θ mk2d . In particular, it does not depend on the number of data points n. Theorem 1 guarantees convergence of Algorithm 1 to k-means++ in terms of total variation distance. Since the (k 1) Markov chains are independent, we may use a union bound: If the sampling induced by each chain has a total variation distance of at most ϵ/(k 1), then the total variation distance between the sampling induced by K-MC2 and the sampling induced by k-means++ is at most ϵ (as shown in the proof of Theorem 1).

Algorithm 1 K-MC2

Require: Dataset X, number of centers k, chain length m c1 point uniformly sampled from X C1 {c1} for i = 2, 3, . . . , k do x point uniformly sampled from X dx d(x, Ci 1)2 for j = 2, 3, . . . , m do y point uniformly sampled from X dy d(y, Ci 1)2

dx > Unif(0, 1) then x y, dx dy Ci Ci 1 {x} return Ck

Theorem 1. Let k > 0 and 0 < ϵ < 1. Let p++(C) be the probability of sampling a seeding C using k-means++ and pmcmc(C) the probability using K-MC2 (Algorithm 1). Then, pmcmc p++ TV ϵ for a chain length m = O γ log k

γ = max C X,|C| k max x X n d(x, C)2

x X d(x , C)2 .

The resulting complexity of Algorithm 1 is O γ k2d log k

ϵ . The proof is given in Section B of the Appendix. This result implies that we can use K-MC2 to approximate the seeding step of k-means++ to arbitrary precision. The required chain length m depends linearly on γ which is a uniform upper bound on γ for all possible sets of centers C. In the next section, we provide a detailed analysis of γ and quantify its impact on the quality of seeding produced by K-MC2.

5 Analysis In the previous section, we saw that the rate of convergence of K-MC2 depends linearly on γ . By deﬁnition, γ is trivially bounded by n and it is easy to construct a dataset achieving this bound: Consider the 2-Means clustering problem and let (n 1) points be in an arbitrarily small cluster while a single point lies at some distance away. With probability (1 1

n), a point from the ﬁrst group is sampled as the initial cluster center. In the subsequent D2-sampling step, we are thus required to sample the single point with probability approaching one. For such a pathological dataset, it is impossible to approximate D2-sampling in sublinear time. Our proposed algorithm is consistent with this result as it would require linear complexity with regards to the number of data points for this dataset. Fortunately, such pathological datasets rarely occur in a practical setting. In fact, under very mild and natural assumptions on the dataset, we will show that γ is at most sublinear in the number of data points. To this end, we assume that the dataset X is sampled i.i.d. from a base distribution F and note that γ can be bounded by two terms α and β, i.e.

γ 4 maxx X d(x, μ(X))2

x X d(x , μ(X))2 α

φ1 OP T (X) φk OP T (X) β

where μ(X) denotes the mean of X and φk OP T (X) denotes the quantization error of the optimal solution of k centers (see Section C of the Appendix for a proof). Tail behavior of distribution F. The ﬁrst term α measures the ratio between the maximum and the average of the squared distances between the data points and their empirical mean. In the pathological example introduced above, α would approach (n 1). Yet, under the following assumption, α grows sublinearly in n as formally stated and proven in Section A.1 of the Appendix. (A1) For distributions F with ﬁnite variance and exponential tails1, α is independent of k and d and w.h.p. α = O log2 n .

1 c, t such that P [d(x, μ(F)) > a] ce at where x F.

This assumption is satisﬁed by the univariate and multivariate Gaussian as well as the Exponential and Laplace distributions, but not by heavy tailed distributions such as the Pareto distribution. Furthermore, if α is sublinear in n for all components of a mixture, then α is also sublinear for the mixture itself. For distributions with ﬁnite variance and bounded support, we even show a bound on α that is independent of n. Nondegeneracy of distribution F. The second term β measures the reduction in quantization error if k centers are used instead of just one. Without prior assumptions β can be unbounded: If a dataset consists of at most k distinct points, the denominator of the second term in (5) is zero. Yet, what is the point of clustering such a dataset in practice if the solution is trivial? It is thus natural to assume that F is nondegenerate, i.e., its support is larger than k. Furthermore, we expect β to be independent of n if n is sufﬁciently large: Due to the strong consistency of K-Means the optimal solution on a ﬁnite sample converges to the optimal quantizer of the generating distribution as n (Pollard, 1981) and such an optimal quantizer is by deﬁnition independent of n. At the same time, β should be non-increasing with respect to k as additional available cluster centers can only reduce the optimal quantization error. This allows us to derive a very general result (formally stated and proven in Section A.2 of the Appendix) that for distributions F that are approximately uniform on a hypersphere, β is independent of n. (A2) For distributions F whose minimal and maximal density on a hypersphere with nonzero probability mass is bounded by a constant, β is independent of n and w.h.p.

This property holds for a wide family of continuous probability distribution functions including the univariate and multivariate Gaussian, the Exponential and the Laplace distribution. Again, if β is bounded for all components of a mixture, then β is also bounded for the mixture. Solution quality of K-MC2. These two assumptions do not only allow us to bound γ and thus obtain favourable convergence, but also to analyze the quality of solutions generated by K-MC2. In particular, we show in Section C of the Appendix that the expected quantization error φK-MC2 of Algorithm 1 is bounded by

E [φK-MC2] E [φk-means++] + 2ϵβφk OP T (X).

Hence, by setting the total variation distance ϵ = O(1/β), the second term becomes a constant factor of φk OP T (X). By applying Theorem 1 with m = O(αβ log βk), the solution sampled from K-MC2 is in expectation O(log k)-competitive to the optimal solution and we obtain the following theorem. Theorem 2. Let k > 0 and X be a dataset with α = O log2 n and β = O(k), i.e. assume (A1) and (A2). Let C be the set of centers sampled by K-MC2 (Algorithm 1) with m = O k log2 n log k . Then we have

E [φC(X)] O(log k)φk OP T (X).

The total complexity is O k3d log2 n log k .

Table 1: Datasets with size n, dimensionality d and estimated values for α and β

DATASET N D α β (K=200)

CSN 80000 17 546.27 3.04 KDD 145751 74 1267.65 1.81 USGS 59209 3 2.68 51.67 WEB 45811883 5 2.33 57.09 BIGX 11620300 57 7.82 14.17 SONG 515345 90 525.67 1.23

The proof is provided in Section C of the Appendix. The signiﬁcance of this result is that, under natural assumptions, it is sufﬁcient to run K-MC2 with complexity sublinear in the number of data points to retain the theoretical guarantee of k-means++. Hence, one can obtain a provably good clustering for K-Means in sublinear time for such datasets.

6 Experiments

Datasets. We use six different datasets: USGS (United States Geological Survey, 2010), CSN (Faulkner et al., 2011), KDD (KDD Cup, 2004), BIGX (Ackermann et al., 2012), WEB (Yahoo! Labs, 2008) and SONG (Bertin-Mahieux et al., 2011). Table 1 shows the size and number of dimensions of these datasets as well as estimates of both α and β. We directly calculate α using (5) and approximate β by replacing the optimal solution φk OP T (X) in (5) with the solution obtained using k-means++. Methods. We compare the algorithm K-MC2 to four alternative methods (k-means++, RANDOM, HEURISTIC and k-means ). We run K-MC2 with different chain lengths, i.e. m {1, 2, 5, 10, 20, 50, 100, 150, 200}. As the main baselines, we consider the seeding step of k-means++ as well as RANDOM, a seeding procedure that uniformly samples k data points as cluster centers. We further propose the following HEURISTIC: It works by uniformly sampling s points and then running the seeding step of k-means++ on this subset. Similar to K-MC2, we set s {100, 200, 500, . . . , 10 000, 15 000, 20 000}. Finally, we also compare to k-means . We use r = 5 rounds and a variety of oversampling factors, i.e. l {0.02k, 0.05k, 0.1k, 0.2k, 0.5k, 1k, 2k}. Experimental setup. For the datasets USGS, CSN and KDD, we set k = 200 and train all methods on the full datasets. We measure the number of distance evaluations and the quality of the solution found in terms of quantization error on the full dataset. For the datasets BIGX, WEB and SONG, we set k = 2000 and train on all but 250 000 points which we use as a holdout set for evaluation. We consider both training error and holdout error for the following reason: On one hand, the theoretical guarantees for both KMC2 and k-means++ hold in terms of training error. On the other hand, in practice, one is usually interested in the generalization error. As all the considered methods are randomized procedures, we run them repeatedly with different initial random seeds. We average the obtained quantization errors and use

Table 2: Experimental results.

RELATIVE ERROR VS. K-MEANS++ SPEEDUP VS. K-MEANS++ (DISTANCE EVALUATIONS)

CSN KDD USGS BIGX WEB SONG CSN KDD USGS BIGX WEB SONG

K-MEANS++ 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 1.0 1.0 1.0 1.0 1.0 1.0

RANDOM 394.50% 307.44% 315.50% 11.45% 105.34% 9.74% - - - - - -

K-MC2 (m = 20) 63.58% 32.62% 2.63% 0.05% 0.77% 0.38% 40.0 72.9 29.6 568.5 2278.1 13.3

K-MC2 (m = 100) 14.67% 2.94% -0.33% 0.13% -0.00% -0.02% 8.0 14.6 5.9 113.7 455.6 2.7

K-MC2 (m = 200) 6.53% 1.00% -0.83% -0.03% 0.01% -0.02% 4.0 7.3 3.0 56.9 227.8 1.3

HEURISTIC (s = 2000) 94.72% 73.28% 5.56% 0.38% 2.12% 0.69% 40.0 72.9 29.6 568.5 2278.1 13.3 HEURISTIC (s = 10000) 29.22% 9.55% 0.20% 0.10% 0.15% 0.15% 8.0 14.6 5.9 113.7 455.6 2.7 HEURISTIC (s = 20000) 13.99% 2.22% 0.27% 0.02% 0.07% 0.05% 4.0 7.3 3.0 56.9 227.8 1.3

K-MEANS (r = 5, l = 0.02k) 335.61% 118.03% 2356.06% 223.43% 562.23% 40.54% 9.6 9.0 8.9 10.0 9.5 9.8 K-MEANS (r = 5, l = 0.2k) 2.12% 0.71% 19.13% 1.74% 11.03% -0.34% 1.0 1.0 1.0 1.0 1.0 1.0 K-MEANS (r = 5, l = 2k) -3.75% -6.22% -3.78% -2.43% -2.04% -5.16% 0.1 0.1 0.1 0.1 0.1 0.1

the standard error of the mean to construct 95% conﬁdence intervals. For each method, we further calculate the relative error and the speedup in terms of distance evaluations with respect to our main baseline k-means++.

Discussion. The experimental results are displayed in Figures 1 and 2 and Table 2. As expected, k-means++ produces substantially better solutions than RANDOM (see Figure 1). For m = 1, K-MC2 essentially returns a uniform sample of data points and should thus exhibit the same solution quality as RANDOM. This is conﬁrmed by the results in Figure 1. As the chain length m increases, the performance of K-MC2 improves and converges to that of k-means++. Even for small chain lengths, K-MC2 is already competitive with the full k-means++ algorithm. For example, on BIGX, K-MC2 with a chain length of m = 20 exhibits a relative error of only 0.05% compared to k-means++ (see Table 2). At the same time, K-MC2 is 586.5 faster in terms of distance evaluations.

K-MC2 signiﬁcantly outperforms HEURISTIC on all datasets (see Figure 1). For the same number of distance evaluations K-MC2 achieves a smaller quantization error: In the case of BIGX, HEURISTIC with s = 2000 exhibits a relative error of 0.38% compared to the 0.05% of K-MC2 with a chain length of m = 20. In contrast to HEURISTIC, K-MC2 further offers the theoretical guarantees presented in Theorems 1 and 2.

Figure 2 shows the relationship between the performance of k-means and the number of distance evaluations. Even with ﬁve rounds, k-means is able to match the performance of the inherently sequential k-means++ and even outperforms it if more computational effort is invested. However, as noted in the original paper (Bahmani et al., 2012), k-means performs poorly if it is run with low computational complexity, i.e. if r l < k.

As such, K-MC2 and k-means have different use scenarios: k-means allows one to run the full k-means++ seeding step in a distributed manner on a cluster and potentially obtain even better seedings than k-means++ at the cost computational effort. In contrast, K-MC2 produces approximate but competitive seedings on a single machine at a fraction of the computational cost of both k-means++ and k-means .

7 Conclusion

We propose K-MC2, an algorithm to quickly obtain an initial solution to the K-Means clustering problem. It has several attractive properties: It can be used to approximate the seeding step of k-means++ to arbitrary precision and, under natural assumptions, it even obtains provably good clusterings in sublinear time. This is conﬁrmed by experiments on real-world datasets where the quality of produced clusterings is similar to those of k-means++ but the runtime is drastically reduced. K-MC2 further outperforms a heuristic approach based on subsampling the data and produces fast but competitive seedings with a computational budget unattainable by k-means . We posit that our technique can be extended to improve on other theoretical results for D2sampling as well as to other clustering problems.

Acknowledgments. We would like to thank Sebastian Tschiatschek and the anonymous reviewers for their comments. This research was partially supported by ERC St G 307036 and the Zurich Information Security Center.

Ackermann, M. R.; M artens, M.; Raupach, C.; Swierkot, K.; Lammersen, C.; and Sohler, C. 2012. Stream KM++: A clustering algorithm for data streams. Journal of Experimental Algorithmics 17:2 4.

Agarwal, M.; Jaiswal, R.; and Pal, A. 2013. k-means++ under approximation stability. In Theory and Applications of Models of Computation. Springer. 84 95.

Aggarwal, A.; Deshpande, A.; and Kannan, R. 2009. Adaptive sampling for k-means clustering. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques. Springer. 15 28.

Ailon, N.; Jaiswal, R.; and Monteleoni, C. 2009. Streaming kmeans approximation. In Advances in Neural Information Processing Systems (NIPS), 10 18.

Arthur, D., and Vassilvitskii, S. 2007. k-means++: The advantages of careful seeding. In Symposium on Discrete Algorithms (SODA), 1027 1035. Society for Industrial and Applied Mathematics.

Bachem, O.; Lucic, M.; and Krause, A. 2015. Coresets for nonparametric estimation - the case of DP-means. In International Conference on Machine Learning (ICML).

104 105 106 107

Distance evaluations

Training error

CSN (k=200)

heuristic k-means++ random

104 105 106 107

Distance evaluations

KDD (k=200)

heuristic k-means++ random

104 105 106 107

Distance evaluations

USGS (k=200)

heuristic k-means++ random

106 107 108 109

Distance evaluations

Holdout error

BIGX (k=2000)

heuristic k-means++ random

106 107 108 109

Distance evaluations

WEB (k=2000)

heuristic k-means++ random

106 107 108 109

Distance evaluations

SONG (k=2000)

heuristic k-means++ random

Figure 1: Average quantization error vs. number of distance evaluations for K-MC2 and HEURISTIC as well as the average quantization error (without the number of distance evaluations) for k-means++ and RANDOM. K-MC2 quickly converges to full k-means++ and outperforms HEURISTIC. Shaded areas denote 95% conﬁdence intervals.

104 105 106 107 108 109

Distance evaluations

Training error

105 CSN (k=200)

104 105 106 107 108 109

Distance evaluations

9 1011 KDD (k=200)

104 105 106 107 108 109

Distance evaluations

4.0 103 USGS (k=200)

106 107 108 109 1010 1011 1012

Distance evaluations

Holdout error

1010 BIGX (k=2000)

106 107 108 109 1010 1011 1012

Distance evaluations

7 102 WEB (k=2000)

106 107 108 109 1010

Distance evaluations

9.5 1011 SONG (k=2000)

Figure 2: Average quantization error vs. number of distance evaluations for K-MC2, k-means++ and k-means . K-MC2 obtains competitive solutions signiﬁcantly faster than both k-means++ and k-means .

Bahmani, B.; Moseley, B.; Vattani, A.; Kumar, R.; and Vassilvitskii, S. 2012. Scalable k-means++. Very Large Data Bases (VLDB) 5(7):622 633. Balcan, M.-F.; Blum, A.; and Gupta, A. 2009. Approximate clustering without the approximation. In Symposium on Discrete Algorithms (SODA), 1068 1077. Society for Industrial and Applied Mathematics. Bertin-Mahieux, T.; Ellis, D. P.; Whitman, B.; and Lamere, P. 2011. The Million Song Dataset. In International Conference on Music Information Retrieval. Bottou, L., and Bengio, Y. 1994. Convergence properties of the kmeans algorithms. In Advances in Neural Information Processing Systems (NIPS), 585 592. Braverman, V.; Meyerson, A.; Ostrovsky, R.; Roytman, A.; Shindler, M.; and Tagiku, B. 2011. Streaming k-means on wellclusterable data. In Symposium on Discrete Algorithms (SODA), 26 40. Society for Industrial and Applied Mathematics. Brunsch, T., and R oglin, H. 2011. A bad instance for k-means++. In Theory and Applications of Models of Computation. Springer. 344 352. Cai, H. 2000. Exact bound for the convergence of Metropolis chains. Stochastic Analysis and Applications 18(1):63 71. Coates, A., and Ng, A. Y. 2012. Learning feature representations with k-means. In Neural Networks: Tricks of the Trade. Springer. 561 580. Coates, A.; Lee, H.; and Ng, A. Y. 2011. An analysis of singlelayer networks in unsupervised feature learning. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), volume 1001. Faulkner, M.; Olson, M.; Chandy, R.; Krause, J.; Chandy, K. M.; and Krause, A. 2011. The next big one: Detecting earthquakes and other rare events from community-based sensors. In ACM/IEEE International Conference on Information Processing in Sensor Networks. Guha, S.; Meyerson, A.; Mishra, N.; Motwani, R.; and O Callaghan, L. 2003. Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering 15(3):515 528. Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1):97 109. Jaiswal, R., and Garg, N. 2012. Analysis of k-means++ for separable data. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques. Springer. 591 602. Jaiswal, R.; Kumar, A.; and Sen, S. 2014. A simple D2-sampling based PTAS for k-means and other clustering problems. Algorithmica 70(1):22 46. Jaiswal, R.; Kumar, M.; and Yadav, P. 2015. Improved analysis of D2-sampling based PTAS for k-means and other clustering problems. Information Processing Letters 115(2):100 103. KDD Cup. 2004. Protein Homology Dataset. Retrieved from osmot.cs.cornell.edu/kddcup. Kulis, B., and Jordan, M. I. 2012. Revisiting k-means: New algorithms via Bayesian nonparametrics. In International Conference on Machine Learning (ICML), 513 520. Lloyd, S. 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory 28(2):129 137. Ostrovsky, R.; Rabani, Y.; Schulman, L. J.; and Swamy, C. 2006. The effectiveness of Lloyd-type methods for the k-means problem. In Foundations of Computer Science (FOCS), 165 176. IEEE.

Pollard, D. 1981. Strong consistency of k-means clustering. The Annals of Statistics 9(1):135 140. Sculley, D. 2010. Web-scale k-means clustering. In World Wide Web Conference (WWW), 1177 1178. ACM. Shindler, M.; Wong, A.; and Meyerson, A. W. 2011. Fast and accurate k-means for large datasets. In NIPS, 2375 2383. United States Geological Survey. 2010. Global Earthquakes (1.1.1972-19.3.2010). Retrieved from the mldata.org repository. Wu, X.; Kumar, V.; Ross Quinlan, J.; Ghosh, J.; Yang, Q.; Motoda, H.; Mc Lachlan, G.; Ng, A.; Liu, B.; Yu, P.; Zhou, Z.-H.; Steinbach, M.; Hand, D.; and Steinberg, D. 2008. Top 10 algorithms in data mining. Knowledge and Information Systems 14(1):1 37. Yahoo! Labs. 2008. R6A - Yahoo! Front Page Today Module User Click Log Dataset. Retrieved from research.yahoo.com repository. Zhao, W.; Ma, H.; and He, Q. 2009. Parallel k-means clustering based on Map Reduce. In Cloud Computing. Springer. 674 679.

A Formal Statement of Natural Assumptions We state the theorems related to the assumptions introduced in Section 5 and provide the corresponding proofs.

A.1 Tail behavior of F The following theorem corresponds to Assumption (A1) in Section 5. Theorem 3. Let F be a probability distribution over Rd with ﬁnite variance that has at most exponential tails, i.e. c, t such that

P [d(x, μ) > a] ce at.

Let X be a set of n points independently sampled from F. Then, with high probability, for n sufﬁciently large, α is independent of k as well as d and depends polylogarithmically on n, i.e.

α = maxx X d(x, μ(X))2

x X d(x , μ(X))2 = O log2 n .

Proof. Let μ =

x S xd F(x). Since F has exponential tails, μ is well deﬁned and Ex F [(d(x, μ)] < . As a result, by the strong law of large numbers, we have almost surely that μ(X) μ, or d(μ(X), μ) 0 as n . Furthermore, since F has at most exponential tails P[d(x, μ) > (2 ln n + ln c)/t] n 2. Therefore, using the union bound, with probability at least 1 1/n we have that x X

d(x, μ) (2 ln n + ln c)/t.

Hence, maxx X d(x, μ)2 = O(log2 n). Applying the triangle inequality, we obtain that

max x X d(x, μ(X))2 max x X (d(x, μ) + d( μ, μ(X)))2

2 max x X d(x, μ)2 + 2 d( μ, μ(X))2

w.h.p. = O(log2 n).

If F has ﬁnite variance and bounded support, we can obtain a constant bound for α which is formalized by the following theorem.

Theorem 4. Let F be a probability distribution over Rd with ﬁnite variance whose support is almost-surely bounded by a d-dimensional sphere with radius R. Let X be a set of n points independently sampled from F. Then, with high probability, if n is sufﬁciently large, α is independent of n, k and d.

Proof. The distance between any point x X and the mean μ(X) is clearly bounded by 2R. Hence, we always have maxx X d(x, μ(X))2 4R2. Also, let μ =

x S xd F(x) and σ2 =

x d(x, μ)2F(x). By using the triangle inequality, we get

x X d(x, μ(X))2 1

x X (d(x, μ) + d( μ, μ(X)))2

2 d( μ, μ(X))2 + 2

x X d(x, μ)2.

Then, by the strong law of large numbers (note that F has a bounded variance), as n grows large, we have almost surely that μ(X) μ and 1/n x X d(x, μ)2 σ2 which concludes the proof.

A.2 Nondegeneracy of F The following theorem corresponds to Assumption (A2) in Section 5. Theorem 5. Let F be a probability distribution over Rd with ﬁnite variance. Assume that there exists a d - dimensional sphere B with radius R, s.t. d 2, F(B) > 0, and x, y B : F(x) c F(y) for some c 1 (F is sufﬁciently non-degenerate). Then, w.h.p.

β = φ1 OP T (X) φk OP T (X) c1kmin{1,4/d }, (6)

where c1 is a constant inversely proportional to c F(B)R2.

Proof. Consider picking n i.i.d. points according to distribution F. Among such points, w.h.p m n F(B)/2 points fall into B. Note that these m points are i.i.d. samples from B according to distribution ˆF(x) = F(x)/F(B). Partition these points into m/k subsets of size k = 15k. Each such subset is also an i.i.d. sample from B according to ˆF. Consider one of the partitions X = {x1, , xk } and let Y be a randomly chosen subset of X of size k /5. Let C = {c1, c2, , ck} Rd be an arbitrary set of k centers and assume that for center ci there are ℓpoints yi1, , yiℓ Y which have ci as their nearest neighbor. We can then write using the triangle inequality

j=1 d yij, ci) 2

j=1 d yi2j 1, ci 2 + d yi2j, ci 2

j=1 (d(yi2j 1, ci) + d(yi2j, ci))2

1/2 ℓ/2 min y,y Y :y =y d(y, y )2.

By summing over all the centers, we obtain that

j=1 d(yj, C)2 min y,y Y,y =y d(y, y )2/(3R2).

Recall that we have partitioned the m points into m/k groups of k points. By applying Lemma 1 (see below) and Hoeffding s inequality, with high probability we have that

j=1 d(xj, C)2 c1R2 c2/d k min{1,4/d }/30. (7)

Since F has bounded variance then w.h.p. φ1 OP T (X)/n converges to the variance of F. Hence, by (7), we have w.h.p.

φk OP T (X)/n k min{1,4/d }(c1R2F(B)c2/d )/30.

We conclude that w.h.p. β c2R2F(B)c2/d kmin{1,4/d }.

Lemma 1. Let F be a probability distribution deﬁned on a d > 2-dimensional sphere B with radius R. Assume that for any two points x, y B we have F(x) c F(y) for some constant c. Let X = {x1, , xk} be a sample of k i.i.d. points from F. Then we have

E[ max Y X |Y |=k/5 min x,y Y x =y d(x, y)] c1Rc 1

Proof. Fix a value ϵ > 0 and denote the ball of radius ϵ with a center y by Bϵ(y). Consider the following covering of B using balls of radius ϵ. We center the ﬁrst ball at the center of B. At the i-th iteration, if B \ j<i Bϵ(yj) = , we pick an arbitrary point in the difference and continue the process. Clearly, this process ends in ﬁnite time as B is compact and each pair of the chosen centers have distance at least ϵ. We now prove that any ball Bϵ(y) can have a non-empty intersection with at most 5d other balls. This is because the centers of the intersecting balls should all lie inside the ball B2ϵ(y). Also, any two centers have distance at least ϵ. Therefore, if we draw a ball of radius ϵ/2 around all the centers of the intersecting balls, then these balls are all disjoint from each other and are all inside a bigger ball B5ϵ/2(y). Therefore, by a simple division of the volumes, we see that there can be at most 5d centers whose corresponding ϵ-ball intersects with Bϵ(y). We now bound the probability that two points chosen randomly according F in B have distance less than ϵ. Assume that the ﬁrst chosen point is inside the ball Bϵ(y). In order for the second point to be less than ϵ away from the ﬁrst one, the it should fall inside Bϵ(y) or one of the intersecting balls with Bϵ(y). Since we have at most 5d balls and each have measure (under F) less than c( ϵ

R)d, then the probability that two randomly chosen balls have distance less than ϵ is upper bounded by c( 5ϵ

R )d. By the union bound, the probability that among the k/5 i.i.d. points sampled from F at least two have distance less than ϵ is bounded upper bounded by ck2( 5ϵ

R )d. As a result, denoting the minimum distance among the k/5 i.i.d. points by dmin, we obtain

Pr(dmin > ϵ) 1 ck2 5ϵ

and since dmin 0 we have that

E[dmin] = 2R

0 Pr(dmin > x)dx

R/(5(ck2) 1 d )

= d d + 1 R

which concludes the proof for d 4. As for the cases where d = 2, 3 one can recover a similar result using a ﬁner covering of the sphere.

B Proof of Theorem 1 Theorem 1. Let k > 0 and 0 < ϵ < 1. Let p++(C) be the probability of sampling a seeding C using k-means++ and pmcmc(C) the probability using K-MC2 (Algorithm 1). Then, pmcmc p++ TV ϵ

for a chain length m = O γ log k

γ = max C X,|C| k max x X n d(x, C)2

x X d(x , C)2 .

The resulting complexity of Algorithm 1 is O γ k2d log k

Proof. Let c1, c2, . . . , ck denote the k sampled cluster centers in C and deﬁne for i = 1, 2, . . . , k

Ci = i j=1cj.

Let p++(ci|Ci 1) denote the conditional probability of sampling ci given Ci 1 for k-means++. Similarly, pm(ci|Ci 1) denotes the conditional probability for K-MC2 with chain length m. Note that

i=2 p++(ci|Ci 1)

pmcmc(C) = 1

i=2 pm(ci|Ci 1).

By Corollary 1 of Cai (2000) and the deﬁnition of γ , there exists a chain length m = O γ log k

ϵ such that for all Ci 1 X with |Ci 1| k 1

p++( |Ci 1) pm( |Ci 1) TV ϵ k 1. (8)

Next, we show an auxiliary result: Consider two arbitrary discrete probability distributions

p X,Y (x, y) = p X(x) p Y |X(y|x)

q X,Y (x, y) = q X(x) q Y |X(y|x) with

p X q X TV ϵ1 and p X|Y q X|Y TV ϵ2.

For all x and y, it holds that

|p X,Y q X,Y | p X |p X|Y q X|Y | + q X|Y |p X q X|

and we have by deﬁnition of the total variation distance

p X,Y q X,Y TV p X q X TV + p X|Y q X|Y TV ϵ1 + ϵ2.

An iterative application of this result to (8) yields

pmcmc p++ TV

The same guarantee holds for the probabilities conditioned on the ﬁrst sampled center c1, i.e. pmcmc( |c1) p++( |c1) TV ϵ. (9)

C Proof of Theorem 2 Theorem 2. Let k > 0 and X be a dataset with α = O log2 n and β = O(k), i.e. assume (A1) and (A2). Let C be the set of centers sampled by K-MC2 (Algorithm 1) with m = O k log2 n log k . Then we have

E [φC(X)] O(log k)φk OP T (X).

The total complexity is O k3d log2 n log k .

Proof. We have x X d(x, C)2 φk OP T (X) for all sets of centers C X of cardinality at most k. Furthermore, for all x X d(x, C)2 2 d(x, μ(X))2 + 2 d(μ(X), C)2

4 max x X d(x , μ(P))2.

γ 4 maxx X d(x, μ(X))2

x X d(x , μ(X))2 α

φ1 OP T (X) φk OP T (X) β

Denote by φk-means++ the quantization error for k-means++ and by φmcmc for K-MC2. Let z be the random variable consisting of the sampled cluster centers c2, c3, . . . , ck. Let p++(z|c1) denote the conditional probability of z given the initial cluster center c1 for k-means++. Correspondingly, pm(z|c1) denotes the conditional probability for K-MC2 with chain length m. We note that pm(z|c1) p++(z|c1) + (pm(z|c1) p++(z|c1))+ and E [φc1(X)] 2βφk OP T (X). Using Theorem 1.1 of Arthur and Vassilvitskii (2007) and (9), we then have that

E [φmcmc] =

z X k 1 φc1 z(X)pm(z|c1)

z X k 1 φc1 z(X) p++(z|c1) + [pm(z|c1) p++(z|c1)]+

E [φk-means++] + 1

c1 X φc1(X)

z X k 1 [pm(z|c1) p++(z|c1)]+

[8(log2 k + 2) + 2βϵ ] φk OP T (X).

The result then follows by setting ϵ = O(1/β).