# exact_acceleration_of_kmeans_and_kmeans__75667d91.pdf

Exact Acceleration of K-Means++ and K-Means

Edward Raﬀ Booz Allen Hamilton University of Maryland, Baltimore County raﬀ_edward@bah.com, raﬀ.edward@umbc.edu

K-Means++ and its distributed variant K-Means have become de facto tools for selecting the initial seeds of K-means. While alternatives have been developed, the eﬀectiveness, ease of implementation, and theoretical grounding of the K-means++ and methods have made them diﬃcult to best from a holistic perspective. By considering the limited opportunities within seed selection to perform pruning, we develop specialized triangle inequality pruning strategies and a dynamic priority queue to show the ﬁrst acceleration of K-Means++ and KMeans that is faster in run-time while being algorithmicly equivalent. For both algorithms we are able to reduce distance computations by over 500 . For K-means++ this results in up to a 17 speedup in run-time and a 551 speedup for K-means . We achieve this with simple, but carefully chosen, modiﬁcations to known techniques which makes it easy to integrate our approach into existing implementations of these algorithms.

1 Introduction

Before one can run the 퐾-means algorithm, a prerequisite step is needed to select the initial 퐾-seeds to use as the initial estimate of the means. This seed selection step is critical to obtaining high quality results with the 퐾-means algorithm. Selecting better initial centers 푚1, . . . , 푚퐾can improve the quality of the ﬁnal 퐾-means clustering. A major step in developing better seed selection was the 퐾-means++ algorithm. This was the ﬁrst to show that the seeds it ﬁnds are log-optimal in expectation for solving the 퐾-means problem [Arthur and Vassilvitskii, 2007]. For a dataset with 푛items 퐾-means++ requires 푂(푛퐾) distance computations. If 푃processors are available 퐾-means++ can be done in 푂(푛퐾/푃). However, the amount of communication overhead to do 퐾-means in parallel is signiﬁcant. To remedy this, Bahmani et al. [2012] introduced 퐾-means which retains the 푂(푛퐾/푃) complexity and performs a constant factor more distance computations to signiﬁcantly reduce the communication overhead while still yielding the same log-optimal results [Bachem et al., 2017]. When working in a distributed environment, where commu-

nication must occur over the network, this can lead to large reductions in run-time [Bahmani et al., 2012]. The cost of 퐾-means++ has long been recognized as being an expensive but necessary step for better results, with little progress on improvement. Modern accelerated versions of 퐾means clustering perform as few as 1.2 total iterations of the dataset [Ryšavý and Hamerly, 2016], making 퐾-means++ seed selection take up to 44% of all distance computations. Outside of exact 퐾-means clustering, faster seed selection can help improve stochastic variants of 퐾-means [Bottou and Bengio, 1995; Sculley, 2010] and is useful for applications like corset construction [Bachem et al., 2015], change detection [Raﬀet al., 2020], tensor algorithms [Jegelka et al., 2009], clustering with Bergman divergences [Nock et al., 2008], and Jensen divergences [Nielsen and Nock, 2015]. Applications with large 퐾have in particular been neglected, even though 퐾 20, 000 is useful for scaling kernel methods [Si et al., 2017]. In this work, we seek to accelerate the original 퐾-means++ and 퐾-means algorithms so that we may obtain the same provably good results in less time without compromising on any of the desirable qualities of 퐾-means++ or 퐾-means . We will review work related to our own in 2. Since the bottlenecks and approach to accelerating these two algorithms are diﬀerent we will review their details and our approach to accelerating them sequentially. In respect to 퐾-means++ in 3, we show how simple application of the triangle inequality plus a novel dynamic priority queue allows us to avoid redundant computations and keep the cost of sampling new means low. In 4 we address 퐾-means and develop a new Nearest In Range query that allows us to successfully use a metric index to prune distance computations even though it is restricted to corpora normally too small to be useful with structures like KD-trees. We then perform empirical evaluation of our modiﬁcations in 5 over a larger set of corpora with more diverse properties covering 퐾 [32, 4096]. In doing so, we observe that our accelerated algorithms succeed in requiring either the same or less time across all datasets and all values of 퐾, making it a Pareto improvement. Finally, we will conclude in 6.

2 Related Work Many prior works have looked at using the triangle inequality, 푑(푎, 푏) + 푑(푏, 푐) 푑(푎, 푐), to accelerate the 퐾means algorithm. While the ﬁrst work along this line was done by Phillips [2002], it was ﬁrst successfully popular-

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

ized by Elkan [2003]. Since then, several works have attempted to build faster 퐾-means clustering algorithms with better incorporation or tighter bounds developed through use of the triangle inequality [Hamerly, 2010; Ding et al., 2015; Newling and Fleuret, 2016]. Despite the heavy use of the triangle inequality to accelerate 퐾-means clustering, we are aware of no prior works that apply it to the seed selection step of 퐾-means++ and 퐾-means . We belive this is largely because these methods can not accelerate the ﬁrst iteration of 퐾-means, as they rely on the ﬁrst iteration s result to accelerate subsequent iterations. Since 퐾-means++ is eﬀectively a single iteration of 퐾-means, their approaches can not be directly applied to the seed selection step. In our work to accelerate 퐾-means using metric index structures a similar historical theme emerges. Prior works have looked at using index structures like KD-trees [Pelleg and Moore, 1999] and Cover-trees [Curtin, 2017] to accelerate the 퐾-means clustering algorithm, but did not look at the seed selection step. Similarly we will use a metric indices to accelerate 퐾-means , but we will develop an enhanced nearest neighbor query that considers a maximum range to meaningfully prune even when using small values of 퐾. Most work we are aware of focuses on extending or utilizing the 퐾-means++ algorithm with few signiﬁcant results on improving it. The most signiﬁcant in this regard is the AFKMC[Bachem et al., 2016a] algorithm and its predecessor KMC [Bachem et al., 2016b]. Both can obtain initial seeds with the same quality as 퐾-means++ with less distance computations but scale as 푂(푛/푃+ 푚퐾2), where 푚is a budget factor. This makes them less eﬀective when a large number of CPUs 푃is available or when 퐾is large. Neither work factored in actual run-time. [Newling and Fleuret, 2017] showed that these implementations are actually 3.3 slower when overheads are factored in. We consider run-time in our own work to show that our improvements materialize in practice.

3 Accelerating 퐾-Means++

We start with the 퐾-means++ algorithm where we present detailed pseudo-code in Algorithm 1. We detail the method and how it works when each data point 푥푖has with it an associated weight 푤푖, as this is required later on. The algorithm begins by selecting an initial seed at random, and then assigning a new weight 훽푖to each data point 푥푖, based on the squared distance of 푥푖to the closest existing seed. At each iteration, we select a new seed to the set based on these weights and return once we have 푘total seeds. This requires 푘iterations through the dataset or size 푛resulting in 푂(푛 푘) distance computations. Note that we cache the distance between each point 푥푖 and it s closest mean into the variable 훼푖. We will maintain this notation throughout the paper and use 훼푖as shorthand. The ﬁrst step toward improving the 퐾-means++ algorithm is to ﬁlter out redundant distance computations. To do this, we note that at each iteration we compare the distance of each point 푥푖to the newest mean 푚푘against the previous closest mean 푚푗, where 1 푗< 푘. That is, we need to determine if 푑(푥푖, 푚푘) < 푑(푥푖, 푚푗). To do this, we can use Lemma 1 as introduced and proven by Elkan [2003],

Algorithm 1 K-Means++

Require: Desired number of seeds 퐾, data points 푥1, . . . , 푥푛, data weights 푤1, . . . , 푤푛such that 푤푖 0 1: 훽푖 푤푖/Í푛 푧=1 푤푧, 푖 [1, 푛] 2: 푚1 푥푖, where 푖is selected with probability 훽푖 3: 푘 1, 휶 4: while 푘< 퐾do 5: for 푖 [1, 푛] do 6: 훼푖 min(훼푖, 푑(푚푘, 푥푖)) 7: for 푖 [1, 푛] do 8: 훽푖 푤푖 훼2 푖/ Í푛 푧=1 푤푧 훼2 푧

9: 푘 푘+ 1 10: 푚푘 푥푖, where 푖is selected with probability 훽푖 11: return initial means 푚1, . . . , 푚퐾

Lemma 1. Let 푥be a point and let 푏and 푐be centers. If 푑(푏, 푐) 2푑(푥, 푏) then 푑(푥, 푐) 푑(푥, 푏)

3.1 Applying the Triangle Inequality We can use the distance between 푚푝and 푚푗to determine if computing 푑(푥푖, 푚푘) is a fruitless eﬀort by checking if 푑(푚푗, 푚푘) > 푑(푥푖, 푚푗). This is already available in the form of 훼푖as presented in Algorithm 1. We then only need to compute 푑(푚푗, 푚푘) 푗< 푘, of which there is intrinsically less than 푘unique values at each iteration. Thus, we can compute 훾푗= 푑(푚푗, 푚푘) once at the start of each loop, and we can re-use these 푘values for all 푛 푘distance comparisons. Applying this bound we can avoid many redundant computations. As there are still 퐾total iterations to select 퐾means, each iteration will perform 푘comparisons to previous means and 푛 푘, we get at most 푛distance comparisons per iteration making the worst case still 푂(푛푘) distance computations for the 퐾-means++ algorithm.

3.2 Avoiding Subnomral Slowdowns A non-trivial cost exists in lines 7-10 of Algorithm 1 where we must compute the probability of selecting each point as the next mean and then perform the selection. This requires at least 3 푛ﬂoating point multiplications which can be a bottleneck in low dimensional problems. This can be exasperated because squared distance to the closest center 훼2 푖 naturally becomes very small as 푘increases resulting in subnormalized ﬂoating point values. Subnormals (also called denormal) attempt to extend the precision of IEEE ﬂoats, but can cause 100 slowdowns in computation [Dooley and Kale, 2006]. Depending on hardware, subnormals can also interfere with pipelining behavior and out-of-order execution, making a single subnormal computation highly detrimental to performance [Fog, 2016]. This is particularly problematic because pruning based on the triangle inequality works best on low dimensional problems, and the normalization step prevents us from realizing speedups in terms of total run-time. To circumvent this bottleneck, we develop a simple approach to create a dynamic priority queue that allows us to sample the next mean accurately without having to interact with most of the samples per iteration. We start with the elegant sampling without replacement strategy introduced

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

휆3푤 1 3 훼 2 3 True

휆4푤 1 4 훼 2 4 True

휆2푤 1 2 훼 2 2 False

True 휆1푤 1 1 훼 2 1

Dirty Priority

Re-prioritize

Lowest Possible

휆4푤 1 4 훼 2 4 False

휆3푤 1 3 훼 2 3 False

Dirty Priority

True 휆1푤 1 1 훼 2 1

False 휆2푤 1 2 훼 2 2

Figure 1: Example of priority re-queueing strategy for 푛= 4 items. Initially, it is not clear if items 2, 3, or 4 are the next to sample. All dirty items are removed from the queue until we reach a clean item and then re-inserted after ﬁxing their priorities. We do not need to consider any item after the ﬁrst clean item.

by Efraimidis and Spirakis [2006]. Given 푛items 1, . . . , 푛 with weighted probabilities 푤1, . . . , 푤푛, it works by assigning each item 푖a priority 휆푖푤 1 푖 where 휆푖is sampled from the Exponential distribution with 휆= 1 (i.e., 휆푖 Exponential(1)). To select 퐾items without replacement, one selects the 퐾values with highest priority (smallest 휆푖푤 1 푖 values). This can normally be done with the quick-select algorithm in 푂(푛) time. For 퐾-means++ seeding we want to instead use the priority 휆푖푤 1 푖훼 2 푖 in order to produce random samples. The term 푤 1 푖훼 2 푖 acts as the weight for datum 푖being selected, and it is a combination of the original relative weight of the datum 푤푖 and the squared distance to the nearest seed 훼2 푖. At the start we sample 휆푖 Exponential(1) once. During each round, we update all 훼푖values and leave 휆푖ﬁxed. It is trivial to see that this does not alter the expectation of any point being selected conditioned on the point 푖already being removed. This is because all 휆푖are sampled independently, and so the removal of any 휆푖does not impact the relative weights of any other point. Thus, we can use the weighted sampling without replacement strategy of Efraimidis and Spirakis [2006] to select the seeds. We performed a sanity check by implementing this naive approach and making no other changes. This resulted in the same quality solutions over many trials with the same statistical mean and variance. At ﬁrst glance, this strategy obtains no beneﬁt as the value of 훼푖will change on each iteration. Each value of 훼푖changing means that the relative ordering of all remaining priorities 휆푖푤 1 푖훼 2 푖 will also change. This requires a full quick-select run on each iteration to discover the new maximum priority item. However, we note that 훼푖can only decrease with each iteration, and thus the priority of any given sample either remains constant or decreases. Our ﬁrst contribution is the realization that this property can be exploited to reduce the cost of sampling so that only a subset of priorities need to be considered to sample the next point. We can instead create a priority queue using a standard binary heap to select the next smallest value of 휆푖푤 1 푖훼 2 푖 and maintain a marker if the priority of an item 푖has become dirty. An item is dirty if and only if the item has a higher priority than it actually should. If there is a clean item 푧in the queue, then all items with a lower apparent priority than 푧must have a true priority that is still lower than 푧. Thus, we need only ﬁx the priority of items higher than 푧.

See Figure 1 for an example of this queue for a dataset of 푛= 4 items. Item 2 is clean, and all items with a higher priority (3 and 4) are dirty. That means item 2 has the lowest possible priority that could be the next true sample because it is possible the values of items 3 and 4 will become larger (read, lower priority) once the updated values of 훼3 and 훼4 are computed. Thus, we can remove all items in the queue until we reach item 2 and then re-insert them into the queue with their correct priorities. In this hypothetical example, item 4 still had a lower priority after updating, and so will become the next mean when we them remove it from the queue. Item 1 occurred after item 2 because it had a lower priority. Even though item 1 was dirty, we did not need to consider it because its priority can only decrease once 훼1 is updated. Because Item 2 was clean, its priority will not change, and there is no possibility of item 1 being selected.

3.3 Accelerated 퐾-Means++

Algorithm 2 Our Accelerated K-Means++

Require: Desired number of seeds 퐾, data points 푥1, . . . , 푥푛, data weights 푤1, . . . , 푤푛such that 푤푖 0 1: 휆푖 Exponential(1), 푖 [1, 푛] 2: Priority Queue 푄with each index 푖given priority 휆푖/푤푖 3: dirty푖 False 4: 푚1 푥푄.Pop() 5: 휶 , 푘 1, 휙푖 0 6: for 푘 [1, 퐾) do For each new center 푘 7: for 푗 [1, 푘) do Get distance to previous centers 8: 훾푗 푑(푚푘, 푚푗)

9: for 푖 [1, 푛] do 10: if 1

2훾휙푖 훼푖then 11: continue Pruned by Lemma 1 12: if 푑(푚푘, 푥푖) < 훼푖then 13: 훼푖 푑(푚푘, 푥푖), 휙푖 푘 14: dirty푖 True Priority may now be too high

15: Create new stack 푆 16: while dirty푄.Peek() do All items that could be selected 17: 푖 푄.Pop() 18: 푆.Push(푖) 19: for 푖 푆do Update true priority 20: 푄.Push(푖, 휆푖/ 푤푖 훼2 푖 ) 21: dirty푖 False

22: 푚푘 푥푄.Pop() Select new mean by clean top priority

23: return initial means 푚1, . . . , 푚퐾

The ﬁnal algorithm that performs the accelerated computation is given in Algorithm 2. Lines 7-11 take care to avoid redundant distance computations, and lines 14-21 ensure that the dynamic priority queue allows us to select the next mean without considering all 푛 푘remaining candidates. Combined, we are able to regularly gain reductions both in terms of total time taken as well as the number of distance computations required. Through the use of our dynamic priority queue we ﬁnd that we regularly consider less than 1% of total remaining 푛 푘items. This is important when we work with low-dimension datasets. When the dimension is very

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

small (e.g., 푑= 2 for longitude/latitude data is a common use case), there is little computational cost in the distance computations themselves, and so much of the bottleneck in runtime is contained within the sampling process. Our dynamic queue avoids this bottleneck allowing us to realize the beneﬁts of reduced distance computations.

4 Accelerating 퐾-Means Now we turn our attention to the 퐾-means algorithm detailed in Algorithm 3. While 퐾-means requires more distance computations, it is preferred in distributed environments because it requires less communication which is a signiﬁcant bottleneck for 퐾-means++ [Bahmani et al., 2012]. It works by reducing the 퐾rounds of communication to a ﬁxed number of 푅 퐾rounds, yet still obtains the log-optimal results of 퐾-means++ [Bachem et al., 2017]. In each of the rounds, ℓnew means are sampled based on the weighted unnormalized probability ℓ푤푖훼2 푖. With the standard defaults of 푅= 5 and ℓ= 2퐾, we end up with an expected 푅2퐾> 퐾 total means. These 푅 ℓpotential means are weighted by the number of points that they are closest to and then passed to the 퐾-means++ algorithm to reduce them to a ﬁnal set of 퐾 means, which produces the ﬁnal result. Note this last step requires 푂(퐾2) distance computations when naively using Algorithm 1, making it necessary to accelerate the 퐾-means++ algorithm in order to eﬀectively accelerate 퐾-means for datasets with large 퐾.

Algorithm 3 K-Means

Require: Desired number of seeds 퐾, 푥1, . . . , 푥푛, data weights 푤1, . . . , 푤푛, rounds 푅, oversampling factor ℓ 1: 훽푖 푤푖/Í푛 푗=1 푤푗, 푖 [1, 푛] 2: 푐1 푥푖, where 푖is selected with probability 훽푖 3: 푘 1, 푘prev 0, 휶 4: for 푟 [1, 푅] do 5: for 푖 [1, 푛] do 6: for 푗 (푘prev, 푘] do 7: 훼푖 min 훼푖, 푑 푐푗, 푥푖

8: 푘prev 푘 9: 푍 Í푛 푖=1 푤푖 훼2 푖 10: for 푖 [1, 푛] do 11: if 푝 Ber(min(1, ℓ 푤푖 훼2 푖/푧)) is true then 12: 푘 푘+ 1, 푐푘 푥푖, 훼푖 0 13: Let 푤 푖 Í푛 푗=1 푤푗 1[푑(푐푖, 푥푗) = 훼푗] Weight set to number of points closest to center 푐푖 14: return K-Means++(퐾, 푐1, . . . , 푐푘, 푤 1, . . . , 푤 푘) Run Algorithm 1

Since 퐾< 푅 ℓ 푛, the ﬁnal step of running 퐾-means++ is not overbearing to run on a single compute node, and the sampling procedure is no longer a bottleneck that requires subversion. In a distributed setting, the ℓnew means selected are broadcast out to all worker nodes, which is possible because ℓ 푛, and thus requires limited communication overhead. However, the ability to use the triangle inequality becomes less obvious. Using the same approach as before, similar to Elkan [2003], would require 푂(퐾2) pairwise distances computations

between the new and old means, and more book-keeping overhead that would reduce the eﬀectiveness of avoiding distance computations. Another strategy uses an algorithm like the Cover-Tree that accelerates nearest neighbor searches and supports the removal of data points from the index [Beygelzimer et al., 2006; Izbicki and Shelton, 2015]. Then, we could perform an allpoints nearest neighbor search [Curtin et al., 2013]. However, we are unaware of any approach that has produced a distributed cover-tree algorithm that would not run into the same communication overheads that prevents the standard 퐾means++ from working in this scenario. As such, it does not appear to be a worth while strategy.

4.1 Nearest In Range Queries Another approach would be to ﬁt an index structure C to only the ℓnew points, and for each non-mean 푥푖ﬁnd its nearest potentially new assignment by querying C . Since ℓis 푂(퐾) this is too small a dataset for pruning to be eﬀective with current methods. To remedy this, we note that we have additional information available to perform the search. The value 훼푖which indicates the distance of point 푥푖to its closest current mean. As such, we introduce a Nearest In Range search that returns the nearest neighbor to a query point 푞against an index C if it is within a radius of 푟to the query. Since most points 푥푖will not change ownership in a given iteration, a Nearest In Range search could be able to prune out the entire search tree, and it will increase its eﬀectiveness even if 퐾is small. To do this, we use the Vantage Point tree (VP) algorithm [Yianilos, 1993] because it is fast to construct, has low overhead which makes it competitive with other algorithms such as KD-trees and Cover-trees [Raﬀand Nicholas, 2018], and simple to augment with our new Nearest In Range search. The pseudo-code for the standard VP search is given in Algorithm 4, where Get Child, Search, and Best are auxiliary functions used by the Nearest function to implement a standard nearest neighbor search. The VP has a left and right child, and it uses a value 휏to keep track of the distance to the nearest neighbor found. It also maintains two pairs of bounds, nearlow, nearhigh indicating the shortest and farthest distance to the points in the left child and farlow, farℎ푖푔ℎdo the same for the right child. A standard Nearest Neighbor search calls the Nearest function with 휏= , and the bound is updated as the search progresses when it fails to prune a branch. Our contribution is simple. The Nearest In Range function instead sets 휏= 푟, the minimum viable radius. It is easy to verify that this can only monotonically improve the pruning rate of each search. Since 휏bounds the distance to the nearest neighbor, and we know from the 휶values an upper-bound on the distance to the nearest neighbor, the modiﬁcation remains correct. The rest of the algorithm remains unaltered, and can simply terminate the search faster due to a meaningful initial bound. Thus, to build an accelerated 퐾-means we build an index C on the newly selected means. We do comparisons against that ﬁltered with our Nearest In Range search, as detailed in Algorithm 5. For the ﬁrst iteration, the loop on lines 6-10 will be fast with only 푐1 to determine the initial distribution,

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Algorithm 4 Nearest Neighbor Search in VP Tree

1: function Get Child( 푙표푤) 2: if 푙표푤= 푡푟푢푒then 3: return left child 4: return right child Else, return other 5: function Search(푟, 휏, 푙표푤) 6: if 푙표푤= 푡푟푢푒then 7: 푎 nearlow, 푏 nearhigh 8: else 9: 푎 farlow, 푏 farhigh 10: return 푎 휏< 푟< 푏+ 휏 i.e., is this True or False? 11: function Best(휏, 휏 , 퐼퐷, 퐼퐷 ) 12: if 휏< 휏 then 13: return 휏, 퐼퐷 14: return 휏 , 퐼퐷 Else, return other 15: function Nearest(푞, 휏, 퐼퐷) 16: 푟 푑(푝, 푞) 17: 휏, 퐼퐷 Best(휏, 푟, 퐼퐷, 푝)

18: 푚 nearhigh+farlow

2 19: 푙푓 푟< 푚 True/False, search near/left child ﬁrst? 20: if Search(푟, 휏, 푙푓) then 21: 휏 , 퐼퐷 Get Child( 푙푓).Nearest(푞, 휏, 퐼퐷) 22: 휏, 퐼퐷 Best(휏, 휏 , 퐼퐷, 퐼퐷 ) 23: if Search(푟, 휏, 푙푓) then 24: 휏 , 퐼퐷 Get Child( 푙푓).Nearest(푞, 휏, 퐼퐷) 25: 휏, 퐼퐷 Best(휏, 휏 , 퐼퐷, 퐼퐷 ) 26: return 휏, 퐼퐷 27: function Nearest In Range(푞, 푚푎푥푅푎푛푔푒) 28: return Nearest(푞, 푚푎푥푅푎푛푔푒, 1) This simple function, used in-place of Nearest, is our contribution.

Algorithm 5 Our Accelerated K-Means

Require: Desired number of seeds 퐾, 푥1, . . . , 푥푛, data weights 푤1, . . . , 푤푛, rounds 푅, oversampling factor ℓ 1: 훽푖 푤푖/Í푛 푗=1 푤푗, 푖 [1, 푛] 2: 푐1 푥푖, where 푖is selected with probability 훽푖 3: 휶 , 푘prev 0, 푘 1 4: for 푟 [1, 푅] do 5: C new index built from {푐푘prev, . . . , 푐푘} 6: for 푖 [1, 푛] do 7: 푗 C .Nearest In Range(푥푖, 훼푖) 8: if 푗 0 then 9: 훼푖 푑(푐푗, 푥푖)

10: 푘prev 푘, 푍 Í푛 푖=1 푤푖 훼2 푖 11: for 푖 [1, 푛] do 12: if 푝 Ber(min(1, ℓ 푤푖 훼2 푖/푍)) is true then 13: 푘 푘+ 1, 푐푘 푥푖, 훼푖 0 14: Let 푤 푖 Í푛 푗=1 푤푗 1[푑(푐푖, 푥푗) = 훼푗] Weight set to number of points closest to center 푐푖 15: return K-Means++(퐾, 푐1, . . . , 푐푘, 푤 1, . . . , 푤 푘) Run Algorithm 2

and on every subsequent round we have a meaningful value of 훼푖that can be used to accelerate the search. If none of the

ℓnew candidates 푐푘prev, . . . , 푐푘are within a distance of 훼푖to each point 푥푖, then the Nearest In Range function will return a negative index which can be skipped. In addition, we use our accelerated 퐾-means++ algorithm Algorithm 2 in the ﬁnal step rather than the standard algorithm. This allows us to accelerate all parts of the 퐾-means method while also keeping the simplicity and low communication cost of the original design. The Vantage Point tree is a small index since it is built upon a small dataset of ℓpoints, and the index can be sent to every worker node in a cluster in the exact same manner.

5 Experimental Results

Now that we have detailed the methods by which we accelerate the 퐾-means++ and 퐾-means algorithms, we will evaluate their eﬀectiveness. The two measures we are concerned with are the following: 1) reducing the total number of distance computations and 2) the total run-time spent. Measuring distance computations gives us an upper-bound on potential eﬀectiveness of our algorithm, and allows us to compare approaches in an implementation and hardware independent manner. Measuring the run-time gives us information about the ultimate goal, which is to reduce the time it takes to obtain 퐾seeds. However, it is sensitive to the hardware in use, the language the approach is implemented in, and the relative skills of program authors. For this work we used the JSAT library [Raﬀ, 2017]. The 퐾-means++ algorithm was provided by this framework, and we implemented the 퐾-means and accelerated versions of both algorithms using JSAT. This way all comparisons with respect to run-time and the 퐾-means++ and algorithms presented are directly comparable. Our implementations have been contributed to the JSAT library for public use. Prior works that have investigated alternatives to 퐾means++ have generally explored only a few datasets with 퐷< 100 features and less than 4 values of 퐾, sometimes testing only one value of 퐾per dataset [Bachem et al., 2016b]. For example, while MNIST is regularly tested in seed selection, it is usually projected down to 50 dimensions ﬁrst [Hamerly, 2010] due to being diﬃcult to accelerate. Since our goal is to produce accelerated versions of these algorithms that are uniformly better, we attempt to test over a wide selection of reasonable scenarios. In Table 1 we show the 11 datasets we use, with 퐷 [3, 780], and 푛covering four orders of magnitude. We will test 퐾 [32, 4096] covering each power of two so that we may understand the behavior as 퐾changes and to make sure we produce an improvement even when 퐾is small. To the best of our knowledge, this is a larger number of datasets, range and values of 퐾, and range and values of 퐷to be tested compared to prior work1. Unless stated otherwise, all experiments were done with a single CPU core from an i Mac with a 3.5 GHz Intel i5 CPU with 64 GB of RAM. The phishing dataset is only tested up

1We are aware of no prior work in this space that has considered 퐷> 1024, where pruning methods are unlikely to succeed due to the curse of dimensionality. We consider this reasonable and beyond scope, as such scenarios are usually sparse and best handled by topic models like LDA.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Dataset 푛 퐷

Phishing 11055 68 cod-rna 59535 8 MNIST 60000 780 aloi 108000 128 Range-Queries 200000 8 Skin/No Skin 245057 3 covertype 581012 54 SUSY 5000000 18 Activity Rec. 33741500 5 HIGGS 11000000 28 Web2 45811883 5

Table 1: Datasets used. Left is the dataset, ordered by number of samples (푛). Right most column indicates the number of features 퐷.

to 퐾= 2048, because at 퐾= 4096 we would be selecting over 1/4 of the dataset as means, at which point the purpose of 퐾-means++ style seeding is being defeated by selecting too large a portion of the corpus. All results are averaged over 5 runs, and took four months to complete in our compute environment.

5.1 퐾-Means++ Results We start with the 퐾-means++ results with the reduction in distance computations shown in Figure 2. In the worst case for 퐾= 32 on the MNIST dataset, we still have to do 98% of the distance computations as the standard algorithm, but this improved to only 63% by 퐾= 4096. The best case is observed with the Web dataset, starting out with only 15% of the distance computations at 퐾= 32 and only 0.1% by 퐾= 4096, a 739 improvement. Across all the datasets, we see that the factor reduction in distance computations is a monotonic improvement for 퐾means++. We never see any case where our accelerated approach performs more distance computations than the naive approach. This conﬁrms our decision to do an extra 푘 1 distance computation between the newest mean 푚푘and the previous means 푚1, . . . , 푚푘 1. As we noted in the design of our accelerated variant, we must avoid over-emphasising the performance of just reduced distance computations as the cost of re-normalizing the distribution to sample the next mean is a non-trivial cost. This is especially true when we are able to reduce the distance computations by 16 for several of our datasets. The results showing the run-time reduction are presented in Figure 3. In all cases, our accelerated version of 퐾-means++ is always faster than the standard algorithm. As expected, MNIST has the lowest speedup based on the number of distance computations avoided. At 퐾= 32 we achieved only a 3.4% reduction in time but was 1.5 faster by 퐾= 4096.

Dynamic Priority Impact Since the normalization step is non-trivial, especially when 퐷is small, we see that the actual speedup in run-time is not as strongly correlated with the dimension 퐷. The Covertype dataset (퐷= 54) had the 4th largest reduction in distance computations, but it had the largest reduction in run-time with a 17 improvement at 퐾= 4096. Our ability to still obtain

25 26 27 28 29 210 211 212 20

Distance Reduction

Phishing Cod-RNA MNIST aloi

Range Queries Skin Covtype SUSY

Activity Recognition HIGGS Web

Figure 2: Factor reduction in distance computations for our accelerated 퐾-means++ algorithm compared to original.

25 26 27 28 29 210 211 212 20

Phishing Cod-RNA MNIST aloi

Range Queries Skin Covtype SUSY

Activity Recognition HIGGS Web

Figure 3: Run-time Speedup for our accelerated 퐾-means++ algorithm compared to the standard algorithm.

real speedupson these datasetsis because our dynamic priority queue allows us to consider only a small subset of the dataset to accurately select the next weighted random mean. This can be seen in Figure 4, where a subset of the datasets are shown with the fraction of the corpus examined on the y-axis. As the datasets get larger our dynamic queue generally becomes more eﬀective, thus reducing the number of points that need to be checked to 1%. To conﬁrm that our dynamic priority queue s results are meaningful, we perform an ablation of Algorithm 2 where the dynamic priority queue on lines 15-21 are replaced with the standard sampling code from Algorithm 1. We run both versions and record the speedup when our dynamic queue is used in Table 2 for 퐾= 4096. Here we can see that with the exception of the cod-rna dataset, where there is a < 2% slowdown (on the fastest dataset to run), our approach gives a 5% 231% speedup in all other cases with a median improvement of 20%. We also note that for all 퐾< 4096 we still observe beneﬁts to our queue, but the variance does increase to the degree of speedup. We did not observe any

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

0 1,000 2,000 3,000 4,000 10 8

Fraction of Dataset Examined

Cod-RNA MNIST Range Queries Covtype

Activity Recognition HIGGS Web

Figure 4: The fraction of the remaining candidates that need to be examined (y-axis, log scale) to select the 푘 th mean (x-axis, linear scale) using our dynamic priority queue.

Dataset Speedup

cod-rna 0.983 Phishing 2.313 MNIST 1.059 aloi 1.059 Range-Queries 1.342 Skin/No Skin 1.217 covtype 1.877 SUSY 1.259 Activity Recognition 1.207 HIGGS 1.279 Web 1.725

Table 2: Ablation testing of speedup from using our new dynamic priority queue to perform seed selection at every iteration. Positive values indicate faster results using our dynamic queue, where our pruning from Algorithm 2 was used with/without the dynamic queue.

performance regressions larger than 3% in extended testing.

5.2 퐾-Means Results In Figure 5 we show the factor reduction in distance computations, which mirrors the overal trends of Figure 2. The results have improved by an additional 2 4 with a 579 reduction in distance computations on the Activity Recognition dataset. The MNIST dataset still had the least improvement, but still obtained a more signiﬁcant 88% reduction in distance computations at 퐾= 32. The 4 improvement in distance computations also carries over to the total run-time, as shown in Figure 6. We observe a more consistent behavior because the cost of normalizing and sampling the new means is reduced to only 푅= 5 rounds of sampling. Where our accelerated 퐾-means++ had the relative improvement drop signiﬁcantly for small 퐷< 10 datasets due to this overhead, our accelerated 퐾-means algorithm sees the ordering remain relatively stable. For example, the Activity Recognition dataset enjoys the greatest reduction in distance computations as well as run-time, and the 579

25 26 27 28 29 210 211 212 20

Distance Reduction

Phishing Cod-RNA MNIST aloi

Range Queries Skin Covtype SUSY

Activity Recognition HIGGS Web

Figure 5: Factor Reduction in Distance Computations needed to perform 퐾-means seed selection. Larger is better.

25 26 27 28 29 210 211 212 20

Phishing Cod-RNA MNIST aloi

Range Queries Skin Covtype SUSY

Activity Recognition HIGGS Web

Figure 6: Run-time speedup for our accelerated 퐾-means algorithm compared to the standard algorithm. Larger is better.

reduction in distance computations closely matches the 551 reduction in run-time. The HIGGS dataset has the lowest improvement in run-time with a 1.02 speedup at 퐾= 32 and 2.9 at 퐾= 4096. We also note that the Nearest In Range query provided an additional 1.5 4 speedup in most cases, but was highly dependent on the dataset and value of 퐾.

6 Conclusion

Leveraging simple modiﬁcations and a novel priority queue, we show the ﬁrst method that delivers equal or better run-time in theory (less distance computations) and practice (less runtime). None of our changes impact the function of 퐾-means++ or 퐾-means , allowing us to retain existing properties.

Acknowledgements

I would like to thank Ashley Klein, Drew Farris, and Frank Ferraro for reviewing early drafts and their valuable feedback.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

[Arthur and Vassilvitskii, 2007] David Arthur and Sergei Vassilvitskii. k-means++: The Advantages of Careful Seeding. In ACM-SIAM symposium on Discrete algorithms, volume 8, pages 1027 1035, 2007. [Bachem et al., 2015] Olivier Bachem, Mario Lucic, and Andreas Krause. Coresets for Nonparametric Estimation - the Case of DP-Means. In ICML, volume 37, pages 209 217, 2015. [Bachem et al., 2016a] Olivier Bachem, Mario Lucic, Hamed Hassani, and Andreas Krause. Fast and Provably Good Seedings for k-Means. In Neur IPS, pages 55 63. 2016. [Bachem et al., 2016b] Olivier Bachem, Mario Lucic, S Hamed Hassani, and Andreas Krause. Approximate Kmeans++ in Sublinear Time. In AAAI, pages 1459 1467, 2016. [Bachem et al., 2017] Olivier Bachem, Mario Lucic, and Andreas Krause. Distributed and Provably Good Seedings for k-Means in Constant Rounds. In ICML, volume 70, pages 292 300, 2017. [Bahmani et al., 2012] Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. Scalable K-means++. VLDB Endowment, 5(7):622 633, mar 2012. [Beygelzimer et al., 2006] Alina Beygelzimer, S Kakade, and John Langford. Cover trees for nearest neighbor. In ICML, pages 97 104, 2006. [Bottou and Bengio, 1995] Léon Bottou and Yoshua Bengio. Convergence Properties of the K-Means Algorithms. In Neur IPS, pages 585 592, 1995. [Curtin et al., 2013] Ryan R. Curtin, William B. March, Parikshit Ram, David V. Anderson, Alexander G. Gray, and Charles L. Isbell. Tree-Independent Dual-Tree Algorithms. In ICML, volume 28, pages 1435 1443, 2013. [Curtin, 2017] Ryan R Curtin. A Dual-Tree Algorithm for Fast k -means Clustering With Large k. In SDM, pages 300 308. jun 2017. [Ding et al., 2015] Yufei Ding, Yue Zhao, Xipeng Shen, Madanlal Musuvathi, and Todd Mytkowicz. Yinyang Kmeans: A Drop-in Replacement of the Classic K-means with Consistent Speedup. In ICML, pages 579 587, 2015. [Dooley and Kale, 2006] Isaac Dooley and Laxmikant Kale. Quantifying the Interference Caused by Subnormal Floating-Point Values. In Proceedings of the Workshop on Operating System Interference in High Performance Applications, sep 2006. [Efraimidis and Spirakis, 2006] Pavlos S. Efraimidis and Paul G. Spirakis. Weighted random sampling with a reservoir. Information Processing Letters, 97(5):181 185, mar 2006. [Elkan, 2003] Charles Elkan. Using the Triangle Inequality to Accelerate k-Means. In ICML, pages 147 153, 2003.

[Fog, 2016] Agner Fog. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. 2016. [Hamerly, 2010] Greg Hamerly. Making k-means even faster. In SDM, pages 130 140, 2010. [Izbicki and Shelton, 2015] Mike Izbicki and Christian R Shelton. Faster Cover Trees. In ICML, volume 37, 2015. [Jegelka et al., 2009] Stefanie Jegelka, Suvrit Sra, and Arindam Banerjee. Approximation Algorithms for Tensor Clustering. In ATL, pages 368 383, 2009. [Newling and Fleuret, 2016] James Newling and François Fleuret. Fast k-means with accurate bounds. In ICML, pages 936 944, 2016. [Newling and Fleuret, 2017] James Newling and François Fleuret. K-Medoids For K-Means Seeding. In Neur IPS, pages 5195 5203. 2017. [Nielsen and Nock, 2015] Frank Nielsen and Richard Nock. Total Jensen divergences: Deﬁnition, properties and clustering. In ICASSP, pages 2016 2020, apr 2015. [Nock et al., 2008] Richard Nock, Panu Luosto, and Jyrki Kivinen. Mixed Bregman Clustering with Approximation Guarantees. In ECMLPKDD, pages 154 169, 2008. [Pelleg and Moore, 1999] Dan Pelleg and Andrew Moore. Accelerating Exact K-means Algorithms with Geometric Reasoning. In KDD, pages 277 281, 1999. [Phillips, 2002] Steven J. Phillips. Acceleration of K-Means and Related Clustering Algorithms. In Proc. 4th Int. Workshop on Algorithm Engineering and Experiments (ALENEX 2002), pages 166 177, 2002. [Raﬀand Nicholas, 2018] Edward Raﬀ and Charles Nicholas. Toward Metric Indexes for Incremental Insertion and Querying. ar Xiv, 2018. [Raﬀet al., 2020] Edward Raﬀ, Bobby Filar, and James Holt. Getting Passive Aggressive About False Positives: Patching Deployed Malware Detectors. In 2020 International Conference on Data Mining Workshops (ICDMW), pages 506 515. IEEE, nov 2020. [Raﬀ, 2017] Edward Raﬀ. JSAT: Java Statistical Analysis Tool, a Library for Machine Learning. JMLR, 18(23):1 5, 2017. [Ryšavý and Hamerly, 2016] Petr Ryšavý and Greg Hamerly. Geometric methods to accelerate k-means algorithms. In SDM, pages 324 332, jun 2016. [Sculley, 2010] D. Sculley. Web-scale k-means clustering. In Proceedings of the 19th international conference on World wide web, pages 1177 1178, 2010. [Si et al., 2017] Si Si, Cho-Jui Hsieh, and Inderjit S Dhillon. Memory Eﬃcient Kernel Approximation. JMLR, 18(20):1 32, 2017. [Yianilos, 1993] P.N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In ACM-SIAM Symposium on Discrete algorithms, pages 311 321, 1993.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)