# multiswap_kmeans__14bc6c99.pdf

Multi-Swap k-Means++

Lorenzo Beretta University of Copenhagen lorenzo2beretta@gmail.com

Vincent Cohen-Addad Google Research cohenaddad@google.com

Silvio Lattanzi Google Research silviol@google.com

Nikos Parotsidis Google Research nikosp@google.com

The k-means++ algorithm of Arthur and Vassilvitskii (SODA 2007) is often the practitioners choice algorithm for optimizing the popular k-means clustering objective and is known to give an O(log k)-approximation in expectation. To obtain higher quality solutions, Lattanzi and Sohler (ICML 2019) proposed augmenting k-means++ with O(k log log k) local search steps obtained through the k-means++ sampling distribution to yield a c-approximation to the k-means clustering problem, where c is a large absolute constant. Here we generalize and extend their local search algorithm by considering larger and more sophisticated local search neighborhoods hence allowing to swap multiple centers at the same time. Our algorithm achieves a 9 + ε approximation ratio, which is the best possible for local search. Importantly we show that our approach yields substantial practical improvements, we show significant quality improvements over the approach of Lattanzi and Sohler (ICML 2019) on several datasets.

1 Introduction

Clustering is a central problem in unsupervised learning. In clustering one is interested in grouping together similar object and separate dissimilar one. Thanks to its popularity many notions of clustering have been proposed overtime. In this paper, we focus on metric clustering and on one of the most studied problem in the area: the Euclidean k-means problem.

In the Euclidean k-means problem one is given in input a set of points P in Rd. The goal of the problem is to find a set of k centers so that the sum of the square distances to the centers is minimized. More formally, we are interested in finding a set C of k points in Rd such that P p P minc C ||p c||2, where with ||p c|| we denote the Euclidean distance between p and c.

The k-means problem has a long history, in statistics and operations research. For Euclidean kmeans with running time polynomial in both n, k and d, a 5.912-approximation was recently shown in Cohen-Addad et al. [2022a], improving upon Kanungo et al. [2004], Ahmadian et al. [2019], Grandoni et al. [2022] by leveraging the properties of the Euclidean metric. In terms of lower bounds, the first to show that the high-dimensional k-means problems were APX-hard were Guruswami and Indyk [2003], and later Awasthi et al. [2015] showed that the APX-hardness holds even if the centers can be placed arbitrarily in Rd. The inapproximability bound was later slightly improved by Lee et al. [2017] until the recent best known bounds of Cohen-Addad and Karthik C. S. [2019], Cohen-Addad et al. [2022d, 2021] that showed that it is NP-hard to achieve a better than 1.06-approximation and hard to approximate it better than 1.36 assuming a stronger conjecture. From a more practical point of view, Arthur and Vassilvitskii [2009] showed that the widely-used popular heuristic of Lloyd Lloyd

Authors are ordered in alphabetical order.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

[1957] can lead to solutions with arbitrarily bad approximation guarantees, but can be improved by a simple seeding strategy, called k-means++, so as to guarantee that the output is within an O(log k) factor of the optimum Arthur and Vassilvitskii [2007].

Thanks to its simplicity k-means++ is widely adopted in practice. In an effort to improve its performances Lattanzi and Sohler [2019], Choo et al. [2020] combine k-means++ and local search to efficiently obtain a constant approximation algorithm with good practical performance. These two studies show that one can use the k-means++ distribution in combination with a local search algorithm to get the best of both worlds: a practical algorithm with constant approximation guarantees.

However, the constant obtained in Lattanzi and Sohler [2019], Choo et al. [2020] is very large (several thousands in theory) and the question as whether one could obtain a practical algorithm that would efficiently match the 9 + ε-approximation obtained by the n O(d/ϵ) algorithm of Kanungo et al. [2004] has remained open. Bridging the gap between the theoretical approach of Kanungo et al. [2004] and k-means++ has thus been a long standing goal.

Our Contributions. We make significant progress on the above line of work. We adapt techniques from the analysis of Kanungo et al. [2004] to obtain a tighter analysis of the algorithm in Lattanzi and Sohler [2019]. In particular in Corollary 4, we show that their algorithm achieves an approximation of ratio of 26.64. We extend this approach to multi-swaps, where we allow swapping more than one center at each iteration of local search, improving significantly the approximation to 10.48 in time O(nd poly(k)). Leveraging ideas from Cohen-Addad et al. [2021], we design a better local search swap that improves the approximation further to 9 + ε (see Theorem 12). This new algorithm matches the 9+ε-approximation achieved by the local search algorithm in Kanungo et al. [2004], but it is significantly more efficient. Notice that 9 is the best approximation achievable through local search algorithms, as proved in Kanungo et al. [2004]. We provide experiments where we compare against k-means++ and Lattanzi and Sohler [2019]. We study a variant of our algorithm that performs very competitively with our theoretically sound algorithm. The variant is very efficient and still outperforms previous work in terms of solution quality, even after the standard postprocessing using Lloyd.

Additional Related Work. We start by reviewing the approach of Kanungo et al. [2004] and a possible adaptation to our setting. The bound of 9 + ε on the approximation guarantee shown by Kanungo et al. [2004] is for the following algorithm: Given a set S of k centers, if there is a set S+ of at most 2/ε points in Rd together with a set S of |S+| points in S such that S \ S S+ achieves a better k-means cost than S, then set S := S \ S S+ and repeat until convergence. The main drawback of the algorithm is that it asks whether there exists a set S+ of points in Rd that could be swapped with elements of S to improve the cost. Identifying such a set, even of constant size, is already non-trivial. The best way of doing so is through the following path: First compute a coreset using the state-of-the-art coreset construction of Cohen-Addad et al. [2022b] and apply the dimensionality reduction of Becchetti et al. [2019], Makarychev et al. [2019], hence obtaining a set of O(k/ε4) points in dimension O(log k/ε2). Then, compute grids using the discretization framework of Matousek [2000] to identify a set of ε O(d) k O(ε 2 log(1/ε)) grid points that contains nearly-optimum centers. Now, run the local search algorithm where the sets S+ are chosen from the grid points by brute-force enumeration over all possible subsets of grid points of size at most, say s. The running time of the whole algorithm with swaps of magnitude s, i.e.: |S+| s, hence becomes k O(s ε 2 log(1/ε)) for an approximation of (1 + ε)(9 + 2/s), meaning a dependency in k of k O(ε 3 log(1/ε)) to achieve a 9 + ε-approximation. Our results improves upon this approach in two ways: (1) it improves over the above theoretical bound and (2) does so through an efficient and implementable, i.e.: practical, algorithm.

Recently, Grunau et al. [2023] looked at how much applying a greedy rule on top of the k-means++ heuristic improves its performance. The heuristic is that at each step, the algorithm samples ℓcenters and only keeps the one that gives the best improvement in cost. Interestingly the authors prove that from a theoretical standpoint this heuristic does not improve the quality of the output. Local search algorithms for k-median and k-means have also been studied by Gupta and Tangwongsan [2008] who drastically simplified the analysis of Arya et al. [2004]. Cohen-Addad and Schwiegelshohn [2017]

demonstrated the power of local search for stable instances. Friggstad et al. [2019], Cohen-Addad et al. [2019] showed that local search yields a PTAS for Euclidean inputs of bounded dimension (and doubling metrics) and minor-free metrics. Cohen-Addad [2018] showed how to speed up the local search algorithm using kd-trees (i.e.: for low dimensional inputs).

For fixed k, there are several known approximation schemes, typically using small coresets Becchetti et al. [2019], Feldman and Langberg [2011], Kumar et al. [2010]. The state-of-the-art approaches are due to Bhattacharya et al. [2020], Jaiswal et al. [2014]. The best known coreset construction remains Cohen-Addad et al. [2022c,b].

If the constraint on the number of output centers is relaxed, then we talk about bicriteria approximations and k-means has been largely studied Bandyapadhyay and Varadarajan [2016], Charikar and Guha [2005], Cohen-Addad and Mathieu [2015], Korupolu et al. [2000], Makarychev et al. [2016].

2 Preliminaries

Notation. We denote with P Rd the set of input points and let n = |P|. Given a point set Q P we use µ(Q) to denote the mean of points in Q. Given a point p P and a set of centers A we denote with A[p] the closest center in A to p (ties are broken arbitrarily). We denote with C the set of centers currently found by our algorithm and with O an optimal set of centers. Therefore, given p P, we denote with C[p] and O [p] its closest ALG-center and OPT-center respectively. We denote by cost(Q, A) the cost of points in Q P w.r.t. the centers in A, namely

cost(Q, A) = X

q Q min c A ||q c||2 .

We use ALG and OPT as a shorthand for cost(P, C) and cost(P, O ) respectively. When we sample points proportionally to their current cost (namely, sample q with probability cost(q, C) /cost(P, C)) we call this the D2 distribution. When using Oε( ) and Ωε( ) we mean that ε is considered constant. We use e O(f) to hide polylogarithmic factors in f. The following lemma is folklore.

Lemma 1. Given a point set Q P and a point p P we have

cost(Q, p) = cost(Q, µ(Q)) + |Q| ||p µ(Q)||2 .

Let O i be an optimal cluster, we define the radius of O i as ρi such that ρ2 i |O i | = cost(O i , oi), where oi = µ(O i ). We define the δ-core of the optimal cluster O i as the set of points p O i that lie in a ball of radius (1 + δ)ρi centered in oi. In symbols, core(O i ) = P B(oi, (1 + δ)ρi). Throughout the paper, δ is always a small constant fixed upfront, hence we omit it. Lemma 2. Let O i be an optimal cluster and sample q O i according to the D2-distribution restricted to O i . If cost(O i , C) > (2 + 3δ) cost(O i , oi) then Pr[q core(O i )] = Ωδ(1).

Proof. Define α := cost(O i , C) /cost(O i , oi) > 2 + 3δ. Thanks to Lemma 1, for each c C we have ||c oi||2 (α 1)ρ2 i . Therefore, for each y core(O i ) and every c C we have

cost(y, c) = ||y c||2

α 1 (1 + δ) 2 ρ2 i = Ωδ(αρ2 i ).

Moreover, by a Markov s inequality argument we have |O i \ core(O i )| 1 1+δ |O i | and thus |core(O i )| Ωδ(|O i |). Combining everything we get

cost(core(O i ) , C) |core(O i ) | min c C y core(O i ) cost(y, c) = Ωδ(|O i |) Ωδ(αρ2 i )

and |O i | αρ2 i = cost(O i , C), hence cost(core(O i ) , C) = Ωδ(cost(O i , C)).

3 Multi-Swap k-Means++

The single-swap local search (SSLS) k-means++ algorithm in Lattanzi and Sohler [2019] works as follows. First, k centers are sampled using k-means++ (namely, they are sampled one by one according to the D2 distribution, updated for every new center). Then, O(k log log k) steps of local search follow. In each local search step a point q P is D2-sampled, then let c be the center among

the current centers C such that cost(P, (C \ {c}) {q}) is minimum. If cost(P, (C \ {c}) {q}) < cost(P, C) then we swap c and q, or more formally we set C (C \ {c}) {q}.

We extend the SSLS so that we allow to swap multiple centers simultaneously and call this algorithm multi-swap local search (MSLS) k-means++. Swapping multiple centers at the same time achieves a lower approximation ratio, in exchange for a higher time complexity. In this section, we present and analyse the p-swap local search (LS) algorithm for a generic number of p centers swapped at each step. For any constant δ > 0, we obtain an approximation ratio ALG/OPT = η2 + δ where

η2 (2 + 2/p)η (4 + 2/p) = 0. (1)

The Algorithm. First, we initialize our set of centers using k-means++. Then, we run O(ndkp 1) local search steps, where a local search step works as follows. We D2-sample a set In = {q1 . . . qp} of points from P (without updating costs). Then, we iterate over all possible sets Out = {c1 . . . cp} of p distinct elements in C In and select the set Out such that performing the swap (In, Out) maximally improves the cost2. If this choice of Out improves the cost, then we perform the swap (In, Out), else we do not perform any swap for this step.

Theorem 3. For any δ > 0, the p-swap local search algorithm above runs in e O(ndk2p) time and, with constant probability, finds an (η2 + δ)-approximation of k-means, where η satisfies Equation (1).

Notice that the SSLS algorithm of Lattanzi and Sohler [2019] is exactly the p-swap LS algorithm above for p = 1. Corollary 4. The single-swap local search in Lattanzi and Sohler [2019], Choo et al. [2020] achieves an approximation ratio < 26.64. Corollary 5. For p = O(1) large enough, multi-swap local search achieves an approximation ratio < 10.48 in time O(nd poly(k)).

3.1 Analysis of Multi-Swap k-means++

In this section we prove Theorem 3. Our main stepping stone is the following lemma. Lemma 6. Let ALG denote the cost at some point in the execution of MSLS. As long as ALG/OPT > η2 + δ, a local search step improves the cost by a factor 1 Ω(1/k) with probability Ω(1/kp 1).

Proof of Theorem 3. First, we show that O(kp log log k) local steps suffice to obtain the desired approximation ratio, with constant probability. Notice that a local search step can only improve the cost function, so it is sufficient to show that the approximation ratio is achieved at some point in time. We initialize our centers using k-means++, which gives a O(log k)-approximation in expectation. Thus, using Markov s inequality the approximation guarantee O(log k) holds with arbitrary high constant probability. We say that a local-search step is successful if it improves the cost by a factor of at least 1 Ω(1/k). Thanks to Lemma 6, we know that unless the algorithm has already achieved the desired approximation ratio then a local-search step is successful with probability Ω(1/kp 1). To go from O(log k) to η2 +δ we need O(k log log k) successful local search steps. Standard concentration bounds on the value of a Negative Binomial random variable show that, with high probability, the number of trial to obtain O(k log log k) successful local-search steps is O(kp log log k). Therefore, after O(kp log log k) local-search steps we obtain an approximation ratio of η2 + δ.

To prove the running time bound it is sufficient to show that a local search step can be performed in time e O(ndkp 1). This is possible if we maintain, for each point x P, a dynamic sorted dictionary3 storing the pairs (cost(x, ci) , ci) for each ci C. Then we can combine the exhaustive search over all possible size-p subsets of C In and the computation of the new cost function using time O(ndkp 1 log k). To do so, we iterate over all possible size-(p 1) subsets Z of C In and update all costs as if these centers were removed, then for each point x P we compute how much its cost increases if we remove its closest center cx in (C In) \ Z and charge that amount to cx. In the end, we consider Out = Z {c} where c is the cheapest-to-remove center found in this way.

The rest of this section is devoted to proving Lemma 6. For convenience, we prove that Lemma 6 holds whenever ALG/OPT > η2 + O(δ), which is wlog by rescaling δ. Recall that we now focus on

2If In Out = then we are actually performing the swap (In \ Out, Out \ In) of size < p. 3Also known as dynamic predecessor search data structure.

a given step of the algorithm, and when we say current cost, current centers and current clusters we refer to the state of these objects at the end of the last local-search step before the current one. Let O 1 . . . O k be an optimal clustering of P and let O = {oi = µ(O i ) | for i = 1 . . . k} be the set of optimal centers of these clusters. We denote with C1 . . . Ck the current set of clusters at that stage of the local search and with C = {c1 . . . ck} the set of their respective current centers.

We say that ci captures oj if ci is the closest current center to oj, namely ci = C[oj]. We say that ci is busy if it captures more than p optimal centers, and we say it is lonely if it captures no optimal center. Let e O = {oi | cost(O i , C) > δ ALG/k} and e C = C \ {C[oi] | oi O \ e O}. For ease of notation, we simply assume that e O = {o1 . . . oh} and e C = {c1 . . . ch }. Notice that h > h.

Weighted ideal multi-swaps. Given In P and Out e C of the same size we say that the swap (In, Out) is an ideal swap if In e O. We now build a set of weighted ideal multi-swaps S. First, suppose wlog that {c1 . . . ct} is the set of current centers in e C that are neither lonely nor busy. Let L be the set of lonely centers in e C. For each i = 1 . . . t, we do the following. Let In be the set of optimal centers in e O captured by ci. Choose a set Li of |In| 1 centers from L, set L L \ Li and define Out = Li {ci}. Assign weight 1 to (In, Out) and add it to S. For each busy center ci {ct+1 . . . ch } let A be the set of optimal centers in e O captured by ci, pick a set Li of |A| 1 lonely current centers from L (a counting argument shows that this is always possible). Set L L \ Li. For each oj A and cℓ Li assign weight 1/(|A| 1) to (oj, cℓ) and add it to S. Suppose we are left with ℓcenters o 1 . . . o ℓ e O such that C[o i] e C. Apparently, we have not included any o i in any swap yet. However, since | e C| | e O|, we are left with at least ℓ ℓlonely centers c 1 . . . c ℓ e C. For each i = 1 . . . ℓwe assign weight 1 to (o i, c i) and add it to S.

Observation 7. The process above generates a set of weighted ideal multi-swaps such that: (i) Every swap has size at most p; (ii) The combined weights of swaps involving an optimal center oi e O is 1; (iii) The combined weights of swaps involving a current center ci is at most 1 + 1/p.

Consider an ideal swap (In, Out). Let O In = S

oi In O i and COut = S

cj Out Cj. Define the reassignment cost Reassign(In, Out) as the increase in cost of reassigning points in COut \ O In to centers in C \ Out. Namely,

Reassign(In, Out) = cost(COut \ O In, C \ Out) cost(COut \ O In, C) .

We take the increase in cost of the following reassignment as an upper bound to the reassignment cost. For each p COut \ O In we consider its closest optimal center O [p] and reassign p to the current center that is closest to O [p], namely C[O [p]]. In formulas, we have

Reassign(In, Out) X

p COut\O In cost(p, C[O [p]]) cost(p, C[p])

p COut cost(p, C[O [p]]) cost(p, C[p]) .

Indeed, by the way we defined our ideal swaps we have C[O [p]] Out for each p O In and this reassignment is valid. Notice that the right hand side in the equation above does not depend on In.

p P cost(p, C[O [p]]) 2OPT + ALG + 2

Proof. Deferred to the supplementary material.

Lemma 9. The combined weighted reassignment costs of all ideal multi-swaps in S is at most (2 + 2/p) (OPT +

Proof. Denote by w(In, Out) the weight associated with the swap (In, Out). X

(In,Out) S w(In, Out) Reassign(In, Out)

(In,Out) S w(In, Out) X

p COut cost(p, C[O [p]]) cost(p, C[p])

(1 + 1/p) X

p Cj cost(p, C[O [p]]) cost(p, C[p])

p P cost(p, C[O [p]]) ALG

The second inequality uses (iii) from Observation 7. Applying Lemma 8 completes the proof.

Recall the notions of radius and core of an optimal cluster introduced in Section 2. We say that a swap (In, Out) is strongly improving if cost(P, (C In) \ Out) (1 δ/k) cost(P, C). Let In = {o1 . . . os} e O and Out = {c1 . . . cs} e C we say that an ideal swap (In, Out) is good if for every q1 core(o1) . . . qs core(os) the swap (Q, Out) is strongly improving, where Q = {q1 . . . qs}. We call an ideal swap bad otherwise. We say that an optimal center oi e O is good if that s the case for at least one of the ideal swaps it belongs to, otherwise we say that it is bad. Notice that each optimal center in e O is assigned to a swap in S, so it is either good or bad. Denote with G the union of cores of good optimal centers in e O. Lemma 10. If an ideal swap (In, Out) is bad, then we have

cost(O In, C) (2 + δ)cost(O In, O ) + Reassign(In, Out) + δALG/k. (2)

Proof. Let In = {o1 . . . os}, Q = {q1 . . . qs} such that q1 core(o1) . . . qs core(os). Then, by Lemma 1 cost(O In, Q) (2 + δ)cost(O In, O ). Moreover, Reassign(In, Out) = cost(P \ O In, C \ Out) cost(P \ O In, C) because points in P \ COut are not affected by the swap. Therefore, cost(P, (C Q) \ Out) (2 + δ)cost(O In, O ) + Reassign(In, Out) + cost(P \ O In, C). Suppose by contradiction that Equation (2) does not hold, then

cost(P, C) cost(P, (C Q) \ Out) = cost(P \ O In, C) + cost(O In, C) cost(P, (C Q) \ Out) δALG/k.

Hence, (Q, Out) is strongly improving and this holds for any choice of Q, contradiction.

Lemma 11. If ALG/OPT > η2 + δ then cost(G, C) = Ωδ(cost(P, C)). Thus, if we D2-sample q we have P[q G] = Ωδ(1).

Proof. First, we observe that the combined current cost of all optimal clusters in O \ e O is at most k δALG/k = δALG. Now, we prove that the combined current cost of all O i such that oi is bad is (1 2δ)ALG. Suppose, by contradiction, that it is not the case, then we have:

(1 2δ)ALG < X

Bad oi e O cost(O i , C) X

Bad (In,Out) S w(In, Out) cost(O In, C)

Bad (In,Out) w(In, Out) ((2 + δ)cost(O In, O ) + Reassign(In, Out) + δALG/k)

(2 + δ)OPT + (2 + 2/p)OPT + (2 + 2/p)

OPT + δALG.

The second and last inequalities make use of Observation 7. The third inequality uses Lemma 10.

Setting η2 = ALG/OPT we obtain the inequality η2 (2 + 2/p O(δ))η (4 + 2/p O(δ)) 0. Hence, we obtain a contradiction in the previous argument as long as η2 (2 + 2/p O(δ))η (4 + 2/p O(δ)) > 0. A contradiction there implies that at least an δ-fraction of the current cost is due to points in S

Good oi e O O i . We combine this with Lemma 2 and conclude that the total current cost of G = S

Good oi e O core(O i ) is Ωδ(cost(P, C)).

Finally, we prove Lemma 6. Whenever q1 G we have that q1 core(o1) for some good o1. Then, for some s p we can complete o1 with o2 . . . os such that In = {o1 . . . os} belongs to a good swap. Concretely, there exists Out C such that (In, Out) is a good swap. Since In e O we have cost(O i , C) > δOPT/k for all oi In, which combined with Lemma 2 gives that for i = 2 . . . s P[qi core(oi)] Ωδ(1/k). Hence, we have P[qi core(oi) for i = 1 . . . s] Ωδ,p(1/kp 1). Whenever we sample q1 . . . qs from core(o1) . . . core(os), we have that (Q, Out) is strongly improving. Notice, however, that (Q, Out) is a s-swap and we may have s < p. Nevertheless, whenever we sample q1 . . . qs followed by any sequence qs+1 . . . qp it is enough to choose Out = Out {qs+1 . . . qp} to obtain that ({q1 . . . qp}, Out ) is an improving p-swap.

4 A Faster (9 + ε)-Approximation Local Search Algorithm

The MSLS algorithm from Section 3 achieves an approximation ratio of η2 + ε, where η2 (2 + 2/p)η (4 + 2/p) = 0 and ε > 0 is an arbitrary small constant. For large p we have η 10.48. On the other hand, employing p simultaneous swaps, Kanungo et al. [2004] achieve an approximation factor of ξ2 + ε where ξ2 (2 + 2/p)ξ (3 + 2/p) = 0. If we set p 1/ε this yields a (9 + O(ε))- approximation. In the same paper, they prove that 9-approximation is indeed the best possible for p-swap local search, if p is constant (see Theorem 3.1 in Kanungo et al. [2004]). They showed that 9 is the right locality gap for local search, but they matched it with a very slow algorithm. To achieve a (9 + ε)-approximation, they discretize the space reducing to O(nε d) candidate centers and perform an exhaustive search over all size-(1/ε) subsets of candidates at every step. As we saw in the related work section, it is possible to combine techniques from coreset and dimensionality reduction to reduce the number of points to n = k poly(ε 1) and the number of dimensions to d = log k ε 2. This reduces the complexity of Kanungo et al. [2004] to k O(ε 3 log ε 1).

In this section, we leverage techniques from Cohen-Addad et al. [2021] to achieve a (9 + ε)- approximation faster 4. In particular, we obtain the following. Theorem 12. Given a set of n points in Rd with aspect ratio , there exists an algorithm that computes a 9 + ε-approximation to k-means in time ndk O(ε 2) log O(ε 1)( ) 2 poly(ε 1).

Notice that, besides being asymptotically slower, the pipeline obtained combining known techniques is highly impractical and thus it did not make for an experimental test-bed. Moreover, it is not obvious how to simplify such an ensemble of complex techniques to obtain a practical algorithm.

Limitations of MSLS. The barrier we need to overcome in order to match the bound in Kanungo et al. [2004] is that, while we only consider points in P as candidate centers, the discretization they employ considers also points in Rd \ P. In the analysis of MSLS we show that we sample each point qi from core(O i ) or equivalently that qi B(oi, (1 + ϵ)ρi), where ρi is such that O i would have the same cost w.r.t. oi if all its points were moved on a sphere of radius ρi centered in oi. This allows us to use a Markov s inequality kind of argument and conclude that there must be Ωϵ(|O i |) points in O i B(oi, (1 + ϵ)ρi). However, we have no guarantee that there is any point at all in O i B(oi, (1 ε)ρi). Indeed, all points in O i might lie on B(oi, ρi). The fact that potentially all our candidate centers q are at distance at least ρi from oi yields (by Lemma 1) cost(O i , q) 2cost(O i , oi), which causes the zero-degree term in ξ2 (2+2/p)ξ (3+2/p) = 0 from Kanungo et al. [2004] to become a 4 in our analysis.

Improving MSLS by taking averages. First, we notice that, in order to achieve (9 + ε)- approximation we need to set p = Θ(1/ε). The main hurdle to achieve a (9 + ε)-approximation is that we need to replace the qi in MSLS with a better approximation of oi. We design a subroutine that computes, with constant probability, an ε-approximation ˆoi of oi (namely, cost(O i , ˆoi) (1 + ε)cost(O i , oi)). The key idea is that, if sample uniformly O(1/ε) points from O i and define ˆoi to be the average of our samples then cost(O i , ˆoi) (1 + ε)cost(O i , oi)

Though, we do not know O i , so sampling uniformly from it is non-trivial. To achieve that, for each qi we identify a set N of nice candidate points in P such that a poly(ε)/k fraction of them are from O i . We sample O(1/ε) points uniformly from N and thus with probability (ε/k)O(1/ε) we sample only points from O i . Thus far, we sampled O(1/ε) points uniformly from N O i . What about

4The complexity in Theorem 12 can be improved by applying the same preprocessing techniques using coresets and dimensionality reduction, similar to what can be used to speed up the approach of Kanungo et al. [2004]. Our algorithm hence becomes asymptotically faster.

the points in O i \ N? We can define N so that all points in O i \ N are either very close to some of the (qj)j or they are very far from qi. The points that are very close to points (qj)j are easy to treat. Indeed, we can approximately locate them and we just need to guess their mass, which is matters only when poly(ε)ALG, and so we pay only a log O(1/ε)(1/ε) multiplicative overhead to guess the mass close to qj for j = 1 . . . p = Θ(1/ε). As for a point f that is very far from qi (say, ||f qi|| ρi) we notice that, although f s contribution to cost(O i , oi) may be large, we have cost(f, o) cost(f, oi) for each o B(qi, ρi) B(oi, (2 + ε)ρi) assuming qi core(oi).

5 Experiments

In this section, we show that our new algorithm using multi-swap local search can be employed to design an efficient seeding algorithm for Lloyd s which outperforms both the classical k-means++ seeding and the single-swap local search from Lattanzi and Sohler [2019].

Algorithms. The multi-swap local search algorithm that we analysed above performs very well in terms of solution quality. This empirically verifies the improved approximation factor of our algorithm, compared to the single-swap local search of Lattanzi and Sohler [2019].

Motivated by practical considerations, we heuristically adapt our algorithm to make it very competitive with SSLS in terms of running time and still remain very close, in terms of solution quality, to the theoretically superior algorithm that we analyzed. The adaptation of our algorithm replaces the phase where it selects the p centers to swap-out by performing an exhaustive search over k+p p subsets of centers. Instead, we use an efficient heuristic procedure for selecting the p centers to swap-out, by greedily selecting one by one the centers to swap-out. Specifically, we select the first center to be the cheapest one to remove (namely, the one that increases the cost by the least amount once the points in its cluster are reassigned to the remaining centers). Then, we update all costs and select the next center iteratively. After p repetitions we are done. We perform an experimental evaluation of the greedy variant of our algorithm compared to the theoretically-sound algorithm from Section 3 and show that employing the greedy heuristic does not measurably impact performance.

The four algorithms that we evaluate are the following: 1) KM++: The k-means++ from Arthur and Vassilvitskii [2007], 2) SSLS: The Single-swap local search method from Lattanzi and Sohler [2019], 3) MSLS: The multi-swap local search from Section 3, and 4) MSLS-G: The greedy variant of multi-swap local search as described above.

We use MSLS-G-p = x and MSLS-p = x, to denote MSLS-G and MSLS with p = x, respectively. Notice that MSLS-G-p = 1 is exactly SSLS. Our experimental evaluation explores the effect of p-swap LS, for p > 1, in terms of solution cost and running time.

Datasets. We consider the three datasets used in Lattanzi and Sohler [2019] to evaluate the performance of SSLS: 1) KDD-PHY 100, 000 points with 78 features representing a quantum physic task kdd [2004], 2) RNA - 488, 565 points with 8 features representing RNA input sequence pairs Uzilov et al. [2006], and 3) KDD-BIO 145, 751 points with 74 features measuring the match between a protein and a native sequence kdd [2004]. We discuss the results for two or our datasets, namely KDD-BIO and RNA. We deffer the results on KDD-PHY to the appendix and note that the results are very similar to the results on RNA.

We performed a preprocessing step to clean-up the datasets. We observed that the standard deviation of some features was disproportionately high. This causes all costs being concentrated in few dimensions making the problem, in some sense, lower-dimensional. Thus, we apply min-max scaling to all datasets and observed that this causes all our features standard deviations to be comparable.

Experimental setting. All our code is written in Python. The code will be made available upon publication of this work. We did not make use of parallelization techniques. To run our experiments, we used a personal computer with 8 cores, a 1.8 Ghz processor, and 15.9 Gi B of main memory We run all experiments 5 times and report the mean and standard deviation in our plots. All our plots report the progression of the cost either w.r.t local search steps, or Lloyd s iterations. We run experiments on all our datasets for k = 10, 25, 50. The main body of the paper reports the results for k = 25, while the rest can be found in the appendix. We note that the conclusions of the experiments for k = 10, 50 are similar to those of k = 25.

Removing centers greedily. We first we compare MSLS-G with MSLS. To perform our experiment, we initialize k = 25 centers using k-means++ and then run 50 iterations of local search for both

Figure 1: Comparison between MSLS and MSLS-G, for p = 3, for k = 25, on the datasets KDDBIO and RNA. The y axis shows the solution cost divided by the means solution cost of KM++.

algorithms, for p = 3 swaps. Due to the higher running of the MSLS we perform this experiments on 1% uniform sample of each of our datasets. We find out that the performance of the two algorithms is comparable on all our instances, while they both perform roughly 15%-27% at convergence. Figure 1 shows the aggregate results, over 5 repetitions of our experiment.

It may happen that MSLS, which considers all possible swaps of size p at each LS iteration, performs worse than MSLS-G as a sub-optimal swap at intermediate iterations may still lead to a better local optimum by coincidence. Given that MSLS-G performs very comparably to MSLS, while it is much faster in practice, we use MSLS-G for the rest of our experiments where we compare to baselines. This allows us to consider higher values of p, without compromising much the running time.

Results: Evaluating the quality and performance of the algorithms. In our first experiment we run KM++ followed by 50 iterations of MSLS-G with p = 1, 4, 7, 10 and plot the relative cost w.r.t. KM++ at each iteration, for k = 25. The first row of Figure 2 plots the results. Our experiment shows that, after 50 iterations MSLS-G for p = 4, 7, 10 achieves improvements of roughly 10% compared to MSLS-G-p = 1 and of the order of 20% 30% compared to KM++. We also report the time per iteration that each algorithm takes. For comparison, we report the running time of a single iteration of Lloyd s next to the dataset s name. It is important to notice that, although MSLS-G-p = 1 is faster, running more iterations MSLS-G-p = 1 is not sufficient to compete with MSLS-G when p > 1.

Results: Evaluating the quality after postprocessing using Lloyd. In our second experiment, we use KM++ and MSLS-G as a seeding algorithm for Lloyd s and measure how much of the performance improvement measured in the first experiment is retained after running Lloyd s. First, we initialize our centers using KM++ and the run 15 iterations of MSLS-G for p = 1, 4, 7. We measure the cost achieved by running 10 iterations of Lloyd s starting from the solutions found by MSLS-G as well as KM++. In Figure 2 (second row) we plot the results. Notice that, according to the running times from the first experiment, 15 iterations iterations of MSLS-G take less than 10 iterations of Lloyd s for p = 4, 7 (and also for p = 10, except on RNA). We observe that MSLS-G for p > 1 performs at least as good as SSLS from Lattanzi and Sohler [2019] and in some cases maintains non-trivial improvements.

Results: Evaluating the quality and performance of the algorithms against a fixed deadline. In this experiment we run KM++ followed by MSLS-G with p = 1, 4, 7, 10, for a set of fixed amounts of time. This setting allows the versions of MSLS-G with smaller swap size to perform more iterations compared to the versions of the algorithm with a larger swap size, as smaller swap size leads to lower running time per iteration. Let τ be the average time that Lloyd s algorithm requires to complete a simple iteration on a specific instance. We plot the cost of the solution produced by each algorithm after running λ τ for each λ {1, , 20} in Figure 3. Our experiment shows that MSLS-G for p = 4, 7, 10 achieves improvements of more than 5% compared to MSLS-G-p = 1 even when compared against a fixed running time, and of the order of 20% 30% compared to KM++.

Conclusion and Future Directions

We present a new algorithm for the k-means problem and we show that it outperforms theoretically and experimentally state-of-the-art practical algorithms with provable guarantees in terms of solution

Figure 2: The first row compares the cost of MSLS-G, for p {1, 4, 7, 10}, divided by the mean cost of KM++ at each LS step, for k = 25. The legend reports also the running time of MSLS-G per LS step (in seconds). The second row compares the cost after each of the 10 iterations of Lloyd with seeding from MSLS-G, for p {1, 4, 7, 10} and 15 local search steps and KM++, for k = 25.

Figure 3: Comparison of the cost produced by MSLS-G, for p {1, 4, 7, 10} and k = 25 on the datasets KDD-BIO and KDD-PHU, divided by the mean cost of KM++ after running for fixed amount of time in terms of multiplicative factors to the average time for an iteration of Lloyd s algorithm (i.e., for deadlines that are 1 , . . . , 20 the average time of an iteration of Lloyd).

quality. A very interesting open question is to improve our local search procedure by avoiding the exhaustive search over all possible size-p subsets of centers to swap out, concretely an algorithm with running time O(2poly(1/ε)ndk).

Acknowledgements. This work was partially done when Lorenzo Beretta was a Research Student at Google Research. Moreover, Lorenzo Beretta receives funding from the European Union s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement

No. 801199.

Kdd cup. 2004. URL http://osmot.cs.cornell.edu/kddcup/datasets.html.

Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for kmeans and Euclidean k-median by primal-dual algorithms. SIAM Journal on Computing, 49(4): FOCS17 97 FOCS17 156, 2019.

David Arthur and Sergei Vassilvitskii. K-means++ the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027 1035, 2007.

David Arthur and Sergei Vassilvitskii. Worst-case and smoothed analysis of the ICP algorithm, with an application to the k-means method. SIAM J. Comput., 39(2):766 782, 2009. doi: 10.1137/ 070683921. URL http://dx.doi.org/10.1137/070683921.

Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka Pandit. Local search heuristics for k-median and facility location problems. SIAM J. Comput., 33(3):544 562, 2004. doi: 10.1137/S0097539702416402. URL https://doi.org/10.1137/ S0097539702416402.

Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The hardness of approximation of euclidean k-means. In Lars Arge and János Pach, editors, 31st International Symposium on Computational Geometry, So CG 2015, June 22-25, 2015, Eindhoven, The Netherlands, volume 34 of LIPIcs, pages 754 767. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2015. doi: 10.4230/LIPIcs.SOCG.2015.754. URL https://doi.org/10.4230/LIPIcs.SOCG.2015.754.

Sayan Bandyapadhyay and Kasturi Varadarajan. On variants of k-means clustering. In 32nd International Symposium on Computational Geometry (So CG 2016). Schloss Dagstuhl-Leibniz Zentrum fuer Informatik, 2016.

Luca Becchetti, Marc Bury, Vincent Cohen-Addad, Fabrizio Grandoni, and Chris Schwiegelshohn. Oblivious dimension reduction for k-means: beyond subspaces and the johnson-lindenstrauss lemma. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 1039 1050, 2019. doi: 10.1145/3313276. 3316318. URL https://doi.org/10.1145/3313276.3316318.

Anup Bhattacharya, Dishant Goyal, Ragesh Jaiswal, and Amit Kumar. On sampling based algorithms for k-means. In Nitin Saxena and Sunil Simon, editors, 40th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science, FSTTCS 2020, December 14-18, 2020, BITS Pilani, K K Birla Goa Campus, Goa, India (Virtual Conference), volume 182 of LIPIcs, pages 13:1 13:17. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020. doi: 10. 4230/LIPIcs.FSTTCS.2020.13. URL https://doi.org/10.4230/LIPIcs.FSTTCS.2020.13.

Moses Charikar and Sudipto Guha. Improved combinatorial algorithms for facility location problems. SIAM J. Comput., 34(4):803 824, 2005. doi: 10.1137/S0097539701398594. URL https: //doi.org/10.1137/S0097539701398594.

Davin Choo, Christoph Grunau, Julian Portmann, and Vaclav Rozhon. k-means++: few more steps yield constant approximation. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1909 1917. PMLR, 2020. URL https://proceedings.mlr.press/v119/ choo20a.html.

Vincent Cohen-Addad. A fast approximation scheme for low-dimensional k-means. In Artur Czumaj, editor, Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, pages 430 440. SIAM, 2018. doi: 10.1137/1.9781611975031.29. URL https://doi.org/10.1137/1.9781611975031.29.

Vincent Cohen-Addad and Karthik C. S. Inapproximability of clustering in lp metrics. In David Zuckerman, editor, 60th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2019, Baltimore, Maryland, USA, November 9-12, 2019, pages 519 539. IEEE Computer Society, 2019. doi: 10.1109/FOCS.2019.00040. URL https://doi.org/10.1109/FOCS.2019.00040.

Vincent Cohen-Addad and Claire Mathieu. Effectiveness of local search for geometric optimization. In 31st International Symposium on Computational Geometry, So CG 2015, June 22-25, 2015, Eindhoven, The Netherlands, pages 329 343, 2015. doi: 10.4230/LIPIcs.SOCG.2015.329. URL http://dx.doi.org/10.4230/LIPIcs.SOCG.2015.329.

Vincent Cohen-Addad and Chris Schwiegelshohn. On the local structure of stable clustering instances. In Chris Umans, editor, 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17, 2017, pages 49 60. IEEE Computer Society, 2017. doi: 10.1109/FOCS.2017.14. URL https://doi.org/10.1109/FOCS.2017.14.

Vincent Cohen-Addad, Philip N. Klein, and Claire Mathieu. Local search yields approximation schemes for k-means and k-median in euclidean and minor-free metrics. SIAM J. Comput., 48(2): 644 667, 2019. doi: 10.1137/17M112717X. URL https://doi.org/10.1137/17M112717X.

Vincent Cohen-Addad, Karthik C. S., and Euiwoong Lee. On approximability of clustering problems without candidate centers. In Dániel Marx, editor, Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, SODA 2021, Virtual Conference, January 10 - 13, 2021, pages 2635 2648. SIAM, 2021. doi: 10.1137/1.9781611976465.156. URL https://doi.org/10.1137/1. 9781611976465.156.

Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. Improved coresets and sublinear algorithms for power means in euclidean spaces. Advances in Neural Information Processing Systems, 34:21085 21098, 2021.

Vincent Cohen-Addad, Hossein Esfandiari, Vahab S. Mirrokni, and Shyam Narayanan. Improved approximations for euclidean k-means and k-median, via nested quasi-independent sets. In Stefano Leonardi and Anupam Gupta, editors, STOC 22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, pages 1621 1628. ACM, 2022a. doi: 10.1145/3519935.3520011. URL https://doi.org/10.1145/3519935.3520011.

Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, and Chris Schwiegelshohn. Towards optimal lower bounds for k-median and k-means coresets. In Stefano Leonardi and Anupam Gupta, editors, STOC 22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, pages 1038 1051. ACM, 2022b. doi: 10.1145/3519935.3519946. URL https://doi.org/10.1145/3519935.3519946.

Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn, and Omar Ali Sheikh-Omar. Improved coresets for euclidean k-means. In Neur IPS, 2022c. URL http://papers.nips.cc/paper_files/paper/2022/hash/ 120c9ab5c58ba0fa9dd3a22ace1de245-Abstract-Conference.html.

Vincent Cohen-Addad, Euiwoong Lee, and Karthik C. S. Johnson coverage hypothesis: Inapproximability of k-means and k-median in ℓp metrics. In Proceedings of the 2022 ACM-SIAM Symposium on Discrete Algorithms, SODA 2022. SIAM, 2022d.

D. Feldman and M. Langberg. A unified framework for approximating and clustering data. In STOC, pages 569 578, 2011.

Zachary Friggstad, Mohsen Rezapour, and Mohammad R. Salavatipour. Local search yields a PTAS for k-means in doubling metrics. SIAM J. Comput., 48(2):452 480, 2019. doi: 10.1137/ 17M1127181. URL https://doi.org/10.1137/17M1127181.

Fabrizio Grandoni, Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman, and Rakesh Venkat. A refined approximation for euclidean k-means. Inf. Process. Lett., 176:106251, 2022. doi: 10.1016/j.ipl.2022.106251. URL https://doi.org/10.1016/j.ipl.2022.106251.

Christoph Grunau, Ahmet Alper Özüdogru, Václav Rozhon, and Jakub Tetek. A nearly tight analysis of greedy k-means++. In Nikhil Bansal and Viswanath Nagarajan, editors, Proceedings of the 2023 ACM-SIAM Symposium on Discrete Algorithms, SODA 2023, Florence, Italy, January 2225, 2023, pages 1012 1070. SIAM, 2023. doi: 10.1137/1.9781611977554.ch39. URL https: //doi.org/10.1137/1.9781611977554.ch39.

Anupam Gupta and Kanat Tangwongsan. Simpler analyses of local search algorithms for facility location. Co RR, abs/0809.2554, 2008. URL http://arxiv.org/abs/0809.2554.

Venkatesan Guruswami and Piotr Indyk. Embeddings and non-approximability of geometric problems. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 12-14, 2003, Baltimore, Maryland, USA., pages 537 538, 2003. URL http://dl.acm.org/ citation.cfm?id=644108.644198.

Mary Inaba, Naoki Katoh, and Hiroshi Imai. Applications of weighted voronoi diagrams and randomization to variance-based k-clustering. In Proceedings of the tenth annual symposium on Computational geometry, pages 332 339, 1994.

Ragesh Jaiswal, Amit Kumar, and Sandeep Sen. A simple D 2-sampling based PTAS for k-means and other clustering problems. Algorithmica, 70(1):22 46, 2014. doi: 10.1007/s00453-013-9833-9. URL https://doi.org/10.1007/s00453-013-9833-9.

Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. A local search approximation algorithm for k-means clustering. Computational Geometry, 28(2):89 112, 2004. ISSN 0925-7721. doi: https://doi.org/10.1016/ j.comgeo.2004.03.003. URL https://www.sciencedirect.com/science/article/pii/ S0925772104000215. Special Issue on the 18th Annual Symposium on Computational Geometry - So CG2002.

Madhukar R. Korupolu, C. Greg Plaxton, and Rajmohan Rajaraman. Analysis of a local search heuristic for facility location problems. J. Algorithms, 37(1):146 188, 2000. doi: 10.1006/jagm. 2000.1100. URL http://dx.doi.org/10.1006/jagm.2000.1100.

Amit Kumar, Yogish Sabharwal, and Sandeep Sen. Linear-time approximation schemes for clustering problems in any dimensions. J. ACM, 57(2):5:1 5:32, 2010. doi: 10.1145/1667053.1667054. URL https://doi.org/10.1145/1667053.1667054.

Silvio Lattanzi and Christian Sohler. A better k-means++ algorithm via local search. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3662 3671. PMLR, 2019. URL https://proceedings.mlr.press/v97/lattanzi19a.html.

Euiwoong Lee, Melanie Schmidt, and John Wright. Improved and simplified inapproximability for k-means. Inf. Process. Lett., 120:40 43, 2017. doi: 10.1016/j.ipl.2016.11.009. URL https: //doi.org/10.1016/j.ipl.2016.11.009.

SP Lloyd. Least square quantization in pcm. bell telephone laboratories paper. published in journal much later: Lloyd, sp: Least squares quantization in pcm. IEEE Trans. Inform. Theor.(1957/1982), 18, 1957.

Konstantin Makarychev, Yury Makarychev, Maxim Sviridenko, and Justin Ward. A bi-criteria approximation algorithm for k-means. In Klaus Jansen, Claire Mathieu, José D. P. Rolim, and Chris Umans, editors, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2016, September 7-9, 2016, Paris, France, volume 60 of LIPIcs, pages 14:1 14:20. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2016. doi: 10.4230/LIPIcs. APPROX-RANDOM.2016.14. URL https://doi.org/10.4230/LIPIcs.APPROX-RANDOM. 2016.14.

Konstantin Makarychev, Yury Makarychev, and Ilya P. Razenshteyn. Performance of johnsonlindenstrauss transform for k-means and k-medians clustering. In Moses Charikar and Edith Cohen, editors, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 1027 1038. ACM, 2019. doi: 10.1145/3313276. 3316350. URL https://doi.org/10.1145/3313276.3316350.

Jirí Matousek. On approximate geometric k-clustering. Discrete & Computational Geometry, 24(1):61 84, 2000. doi: 10.1007/s004540010019. URL http://dx.doi.org/10.1007/ s004540010019.

Andrew V Uzilov, Joshua M Keegan, and David H Mathews. Detection of non-coding rnas on the basis of predicted secondary structure formation free energy change. BMC bioinformatics, 7(1): 1 30, 2006.

Supplementary Material

Proofs from Section 3

p P cost(p, C[O [p]]) 2OPT + ALG + 2

p P cost(p, C[O [p]]) =

p O i cost(p, C[oi]) =

oi O |O i | cost(oi, C[oi]) + cost(O i , oi) =

p P cost(O [p], C[O [p]])

p P cost(O [p], C[p])

p P (||O [p] p|| + ||p C[p]||)2 =

2OPT + ALG + 2 X

p P ||O [p], p|| ||p, C[p]|| 2OPT + ALG + 2

The second equality is due to Lemma 1 and the last inequality is due to Cauchy-Schwarz.

Proofs from Section 4

In this section, we prove the following.

Theorem 12. Given a set of n points in Rd with aspect ratio , there exists an algorithm that computes a 9 + ε-approximation to k-means in time ndk O(ε 2) log O(ε 1)( ) 2 poly(ε 1).

We start with a key lemma showing that a sample of size O(1/ε) is enough to approximate 1-mean.

Lemma 13 (Form Inaba et al. [1994]). Given an instance P Rd, sample m = 1/(εδ) points uniformly at random from P and denote the set of samples with S. Then cost(P, µ(S)) (1 + ε)cost(P, µ(P)) with probability at least 1 δ.

Proof. We want to prove that with probability 1 δ we have ||µ(S) µ(P)||2 εcost(P, µ(P)) /|P|. Then, applying Lemma 1 gives the desired result. First, we notice that µ(P) is an unbiased estimator of µ(P), namely E[µ(S)] = µ(P). Then, we have

E ||µ(S) µ(P)||2 = 1

i=1 E ||si µ(P)||2 = cost(P, µ(P))

where si are uniform independent samples from P. Applying Markov s inequality concludes the proof.

The algorithm that verifies Theorem 12 is very similar to the MSLS algorithm from Section 3 and we use the same notation to describe it. The intuition is that in MSLS we sample Q = {q1 . . . qp} hoping that qi core(oi) for each i; here we refine qi to a better approximation ˆoi of oi and swap the points (ˆoi)i rather than (qi)i. Our points ˆoi are generated taking the average of some sampled point, thus we possibly have ˆoi P while, on the other hand, qi P.

A (9+ε)-approximation MSLS algortihm. First, we initialize our set of centers using k-means++. Then, we run ndk O(ε 2) 2poly(ε 1) local search steps, where a local search step works as follows. Set p = Θ(ε 1). We D2-sample a set Q = {q1 . . . qp} of points from P (without updating costs). Then, we iterate over all possible sets Out = {c1 . . . cp} of p distinct elements in C Q. We define the set of temporary centers T = (C Q) \ Out and run a subroutine APX-CENTERS(T ) which returns a list of poly(ε 1) log O(ε 1)( ) size-s sets c In = {ˆo1 . . . ˆos} (where s = |Q \ Out|). We select the set c In in this list such that the swap (c In, Out \ Q) yields the maximum cost reduction. Then we select the set Out that maximizes the cost reduction obtained in this way. If (c In, Out \ Q) actually reduces the cost then we perform that swap.

A subroutine to approximate optimal centers. Here we describe the subroutine APX-CENTERS(T ). Let Q \ Out = {q1 . . . qs}. Recall that s p = O(ε 1). This subroutine outputs a list of 2poly(ε 1) log O(ε 1)( ) size-s sets c In = {ˆo1 . . . ˆos}. Here we describe how to find a list of 2poly(ε 1) log( ) values for ˆo1. The same will apply for ˆo2 . . . ˆos and taking the Cartesian product yields a list of 2poly(ε 1) log O(ε 1)( ) size-s sets. Assume wlog that the pairwise distances between points in P lie in [1, ]. We iterate over all possible values of ρ1 {1, (1 + ε) . . . (1 + ε) log1+ε }. We partition P in three sets: the set of far points F = {x P | cost(x, q1) > ρ2 1/ε3}, the set of close points C = {x P \ F | cost(x, T ) ε3ρ2 1} and the set of nice points N = P \ (C F). Then, we sample uniformly from N a set S of size Θ(ε 1). For each (s + 1)-tuple of coefficients

α0, α1 . . . αs n 1, (1 ε), (1 ε)2, . . . (1 ε) log1 ε(ε7) o {0} we output the candidate solution given by the convex combination

ˆo1 = ˆo1(α0 . . . αs) = α0µ(S) + Ps i=1 αiqi Ps i=0 αi (3)

so, for each value of ρ1, we output 2poly(ε 1) values for ˆo1. Hence, 2poly(ε 1) log( ) values in total.

The key insight in the analysis of the MSLS algorithm form Section 3 was that every qi was a proxy for oi because qi core(oi), and thus qi provided a good center for O i . In the analysis of this improved version of MSLS we replace qi with ˆoi which makes a better center for O i . Formally, fixed Out, we say that a point ˆoi is a perfect approximation of oi when cost(O i , (C {ˆoi}) \ Out) (1 + ε)OPTi + εOPT/k. We define e O and e C as in Section 3, except that we replace δ with ε (which here is not assumed to be a constant). Likewise, we build the set S of ideal multi-swaps as in Section 3. Recall that we say that a multi-swap (In, Out) is strongly improving if cost(P, (C In) \ Out) (1 ε/k) cost(P, C). Let In = {o1 . . . os} e O and Out = {c1 . . . cs} e C, we overload the definition from Section 3 and say that the ideal multi-swap (In, Out) is good if for every c In = {ˆo1 . . . ˆos} such that each ˆoi is a perfect approximation of oi for each i = 1 . . . s the swap (c In, Out) is strongly improving. We call an ideal swap bad otherwise. As in Section 3, we define the core of an optimal center; once again we replace δ with ϵ, which is no longer constant. The two following lemmas are our stepping stones towards Theorem 12.

Lemma 14. If ALG/OPT > 9 + O(ε) then, with probability k O(ε 1) 2 poly(ε 1), there exists Out C Q such that:

(i) If Q \ Out = {q1 . . . qs} then q1 core(o1) . . . qs core(os) for some o1 . . . os O

(ii) If we define In = {o1 . . . os} then (In, Out \ Q) is a good ideal swap.

Lemma 15. If (i) from Lemma 14 holds, then with probability k O(ε 2) 2 poly(ε 1), the list returned by APX-CENTERS contains c In = {ˆo1 . . . ˆos} such that ˆoi is a perfect approximation of oi for each i = 1 . . . s.

Proof of Theorem 12. Here we prove that our improved MSLS algorithm achieves a (9 + O(ε))- approximation, which is equivalent to Theorem 12 up to rescaling ε. Combining Lemma 14 and

Lemma 15 we obtain that, as long as ALG/OPT > 9 + O(ε), with probability at least k O(ε 2) 2 poly(ε 1), the list returned by APX-CENTERS contains c In = {ˆo1 . . . ˆos} such that (c In, Out \ Q) is strongly improving. If this happens, we call such a local step successful. Now the proof goes exactly as the proof of Theorem 3. Indeed, We show that k O(ε 2) 2poly(ε 1) local steps suffice to obtain Ω(k log log k/ε) successful local steps, and thus to obtain the desired approximation ratio, with constant probability.

To prove the running time bound it is sufficient to notice that a local search step can be performed in time nd log O(ε 1)( ) 2poly(ε 1).

In the rest of this section, we prove Lemma 14 and Lemma 15. Observation 16. If we assume δ = ε non-constant in Lemma 2, then performing the computations explicitly we obtain Pr[q core(O i )] poly(ε).

In order to prove Lemma 14, we first prove the two lemmas. Lemma 17 is the analogous of Lemma 10 and Lemma 18 is the analogous of Lemma 11. Overloading once again the definition from Section 3, we define G as the union of cores of good optimal centers in e O, where an optimal center is defined to be good if at least one of the ideal multi-swaps in S it belongs to is good (exactly as in Section 3). Lemma 17. If an ideal swap (In, Out) is bad, then we have

cost(O In, C) (1 + ε)cost(O In, O ) + Reassign(In, Out) + εALG/k. (4)

Proof. Let In = {o1 . . . os}, c In = {ˆo1 . . . ˆos} such that ˆoi is a perfect approximation of oi for each i = 1 . . . s. Recall that O In := Ss i=1 O i , then

cost O In, (C c In) \ Out

i=1 cost(O i , (C {ˆoi}) \ Out) (1 + ε)cost(O In, O ) . (5)

Moreover, Reassign(In, Out) = cost(P \ O In, C \ Out) cost(P \ O In, C) because points

in P \ COut are not affected by the swap. Therefore, cost P, (C c In) \ Out (1 +

ε)cost(O In, O ) + Reassign(In, Out) + cost(P \ O In, C). Suppose by contradiction that Equation (4) does not hold, then

cost(P, C) cost P, (C c In) \ Out =

cost(P \ O In, C) + cost(O In, C) cost P, (C c In) \ Out ϵALG/k.

Hence, (c In, Out) is strongly improving and this holds for any choice of c In, contradiction.

Lemma 18. If ALG/OPT > 9 + O(ε) then cost(G, C) cost(P, C) poly(ε). Thus, if we D2-sample q we have P[q G] poly(ε).

Proof. First, we observe that the combined current cost of all optimal clusters in O \ e O is at most k εALG/k = εALG. Now, we prove that the combined current cost of all O i such that oi is bad is (1 2ε)ALG. Suppose, by contradiction, that it is not the case, then we have:

(1 2ε)ALG < X

Bad oi e O cost(O i , C) X

Bad (In,Out) S w(In, Out) cost(O In, C)

Bad (In,Out) w(In, Out) ((1 + ε)cost(O In, O ) + Reassign(In, Out) + εALG/k)

(1 + ε)OPT + (2 + 2/p)OPT + (2 + 2/p)

OPT + εALG.

The second and last inequalities make use of Observation 7. The third inequality uses Lemma 17.

Setting η2 = ALG/OPT we obtain the inequality η2 (2 + 2/p O(ε))η (3 + 2/p O(ε)) 0. Hence, we obtain a contradiction in the previous argument as long as η2 (2 + 2/p O(ε))η (3 +

2/p O(ε)) > 0, which holds for p = Θ(ε 1) and η2 = 9+O(ε). A contradiction there implies that at least an ε-fraction of the current cost is due to points in S

Good oi e O O i . Thanks to Observation 16, we have Pq cost(q,C)[q core(O i ) | q O i ] poly(ε). Therefore, we can conclude that the current cost of G = S

Good oi e O core(O i ) is at least a poly(ε)-fraction of the total current cost.

Proof of Lemma 14. Thanks to Lemma 18, we have that P[q1 G] poly(ε). Whenever q1 G we have that q1 core(o1) for some good o1. Then, for some s p we can complete o1 with o2 . . . os such that In = {o1 . . . os} belongs to a good swap. Concretely, there exists Out C such that (In, Out) is a good swap. Since In e O we have cost(O i , C) > εOPT/k for all oi In, which combined with Observation 16 gives that, for each i = 2 . . . s, P[qi core(oi)] poly(ε)/k. Hence, we have P[qi core(oi) for i = 1 . . . s] 2 poly(ε 1)k O(ε 1). Notice, however, that (c In, Out) is a s-swap and we may have s < p. Nevertheless, whenever we sample q1 . . . qs followed by any sequence qs+1 . . . qp it is enough to choose Out = Out {qs+1 . . . qp} to obtain that ({q1 . . . qp}, Out ) is an improving p-swap.

In order to prove Lemma 15 we first need a few technical lemmas. Lemma 19 (Lemma 2 from Lattanzi and Sohler [2019]). For each x, y, z Rd and ε > 0, cost(x, y) (1 + ε)cost(x, z) + (1 + 1/ε)cost(z, y).

Lemma 20. Given q Rd and Z Rd such that cost(Z, q) ε2Γ then, for each o Rd

(1 O(ε))cost(Z, o) O(ε)Γ |Z|cost(q, o) (1 + O(ε))cost(Z, o) + O(ε)Γ

Proof. To obtain the first inequality, we apply Lemma 19 to bound cost(z, o) (1+ε)cost(z, o)+ (1 + 1/ε)cost(z, q) for each z Z. To obtain the second inequality, we bound cost(q, o) (1 + ε)cost(z, o) + (1 + 1/ε)cost(z, q) for each z Z.

Lemma 21. Let X = {x1 . . . xℓ} be a weighted set of points in Rd such that xi has weight wi. Let µ be the weighted average of X. Let ˆµ = ˆµ(α1 . . . αℓ) be the weighted average of X where xi has weight αi. If wi αi wi/(1 ε) for each i = 1 . . . ℓ, then if we interpret cost(X, C) as P

xi X wi cost(xi, C) we have cost(X, ˆµ) (1 + O(ε))cost(X, µ).

Proof. We note that µ minimizes the expression cost(X, µ). Moreover, cost(X, z) Pℓ i=1 αi cost(xi, z) cost(X, z) /(1 ε). Since ˆµ minimizes the expression Pℓ i=1 αi cost(xi, z) it must be cost(X, ˆµ) cost(X, µ) /(1 ε).

Adopting the same proof strategy, we obtain the following.

Observation 22. Thanks to Lemma 20, we can assume that the points in Z are concentrated in q for the purpose of computing a (1 + O(ε))-approximation to the 1-means problem on Z, whenever an additive error Γ is tolerable. Indeed, moving all points in Z to q introduces a 1 + O(ε) multiplicative error on cost(Z, ) and a O(ε)Γ additive error.

The next lemma shows that a point z that is far from a center o experiences a small variation of cost(z, o) when the position of o is slightly perturbed.

Lemma 23. Given o, z Rd such that ||o z|| r/ε we have that for every o B(o, r), cost(z, o ) = (1 O(ε))cost(z, o).

Proof. It is enough to prove it for all o that lie on the line L passing through o and z, any other point in o B(o, r) admits a point o B(o, r) L with ||o z|| = ||o z||. It is enough to compute the derivative of cost(z, ) with respect to the direction of L and see that cost(z, )

L |B(o,r) = (1 O(ε))r/ε. Thus, cost(z, o ) = cost(z, o) (1 O(ε))r2/ε = (1 O(ε))cost(z, o).

Proof of Lemma 15. Here we prove that for each o1 . . . os there exist coefficients α(i) 0 . . . α(i) s n 1, (1 ε) . . . (1 ε) log1 ε(ε7) o {0} such that the convex combination ˆoi = ˆoi(α(i) 0 . . . α(i) s )

is a perfect approximation of oi, with probability k O(ε 2) 2 poly(ε 1). Wlog, we show this

for o1 only. Concretely, we want to show that, with probability k O(ε 1) 2 poly(ε 1), there exist coefficients α0 . . . αs such that ˆo1 = ˆo1(α0 . . . αs) satisfies cost(O 1, (C {ˆo1}) \ Out) (1 + O(ε))OPT1 + O(ε)OPT/k. Taking the joint probability of these events for each i = 1 . . . s we obtain the success probability k O(ε 2) 2 poly(ε 1). Note that we are supposed to prove that cost(O 1, (C {ˆo1}) \ Out) (1 + ε)OPT1 + εOPT/k, however we prove a weaker version where ε is replaced by O(ε), which is in fact equivalent up to rescaling ε.

Similarly to C[ ] and O [ ] define T [p] as the closest center to p in T . Denote with C1, F1 and N1 the intersections of O 1 with C, F and N respectively. In what follows we define the values of α0 . . . αs that define ˆo1 = ˆo1(α0 . . . αs) and show an assignment of points in O 1 to centers in (C {ˆo1})\Out with cost (1 + O(ε))OPT1 + O(ε)OPT/k. Recall that we assume that qi core(oi) for each i = 1 . . . s.

In what follows, we assign values to the coefficients (αi)i. It is understood that if the final value we choose for αi is v then we rather set αi to the smallest power of (1 ε) which is larger than v, if v > ε7. Else, set αi to 0. We will see in the end that this restrictions on the values of αi do not impact our approximation.

In what follows, we will assign the points in O 1 to C \ Out, if this can be done inexpensively. If it cannot, then we will assign points to ˆo1. In order to compute a good value for ˆo1 we need an estimate of the average of points assigned to ˆo1. For points in N1, computing this average is doable (leveraging Lemma 13) while for points in O 1 \ N1 we show that either their contribution is negligible or we can collapse them so as to coincide with some qi Q without affecting our approximation. The coefficients (αi)i 1 represent the fraction of points in O i which is collapsed to qi. α0 represents the fraction of points in O i which average we estimate as µ(S). Thus, Equation (3) defines ˆoi as the weighted average of points qi, where the weights are the (approximate) fractions of points collapsed onto qi, together with the the average µ(S) and its associated weight α0.

Points in C1. All points p C1 such that T [p] Q can be assigned to T [p] C \ Out incurring a total cost of at most ε6OPT1, by the definition of C1. Given a point p C1 with T [p] Q we might have T [p] C \ Out and thus we cannot assign p to T [p]. Denote with W the set of points p with T [p] Q. Our goal is now to approximate µ(W). In order to do that, we will move each p W to coincide with qi = T [p]. We can partition W into W1 . . . Ws so that for each z Wi T [z] = qi. If p Zi then we have ||p qi||2 ε3ρ2 1. Hence, thanks to Observation 22, we can consider points in Wi as if they were concentrated in qi while losing at most an additive factor O(ε)OPT1 and a multiplicative factor (1 + ε) on their cost. For i = 1 . . . s, set αi |Wi|/|O 1|. In this way, Ps i=1 αi qi/ Ps i=1 αi is an approximates solution to 1-mean on W up to a multiplicative factor (1 + ε) and an additive factor O(ε)OPT1.

Points in N1. Consider the two cases: (i) cost(N1, T ) > ε2OPT/k; (ii) cost(N1, T ) ε2OPT/k.

Case (i). We show that in this case µ(S) is a (1 + ε)-approximation for 1-mean on N1, with probability k O(ε 1) 2 poly(ε 1). First, notice that if we condition on S N1 then Lemma 13 gives that µ(S) is a (1 + ε)-approximation for 1-mean on N1 with constant probability. Thus, we are left to prove that S N1 with probability k O(ε 1) 2 poly(ε 1). We have that the Pp cost(p,T )[p N1 | p N] ε2/k, however the costs w.r.t. T of points in N varies of at most a factor poly(ε 1), thus Pp Unif[p N1 | p N] poly(ε)/k. The probability of S N1 is thus (poly(ε)/k)|S| = k O(ε 1) 2 poly(ε 1). In this case, we set α0 |N1|/|O 1| because µ(S) approximates the mean of the entire set N1.

Case (ii). Here we give up on estimating the mean of N1 and set α0 0. The point x N1 such that T [x] Q can be assigned to T [x] incurring a combined cost of ε2OPT/k. We partition the remaining points in N1 into Z1 . . . Zs where each point x is placed in Zi if T [x] = qi. Now, we collapse the points in Zi so as to coincide with qi and show that this does not worsen our approximation factor. In terms of coefficients (αi)i, this translates into the updates αi αi + |Zi|/|O i | for each i = 1 . . . s.

Indeed, using Observation 22 we can move all points in Zi to qi incurring an additive combined cost of εOPT/k and a multiplicative cost of 1 + O(ε).

Points in F1. Points in F1 are very far from q1 and thus far from o1, hence even if their contribution to cost(O 1, o1) might be large, we have cost(F1, o1) = (1 O(ε))cost(F1, o ) for all o in a ball of radius ρ1/ε centered in o1, thanks to Lemma 23.

Let H be the set of points that have not been assigned to centers in C\Out. In particular, H = W N1 if points in N1 satisfy case (i) and H = W Z1 . . . Zs if points in N1 satisfy case (ii). We consider two cases.

If ||µ(H) q1|| ρ/ε, then ||µ(H) o1|| ρ(1 + ε + 1/ε) because q1 core(o1). Since for each f F1 we have ||f o1|| ||f q1|| (1 + ε)ρ Ω(ρ/ε3) then cost(f, o ) = (1 O(ε))cost(f, o1) for each o in a ball of radius O(ρ/ε) centered in o1, and so in particular for o = µ(H). Thus in this case we can simply disregard all points in F1 and computing ˆo1 according to the (αi)i defined above yields a perfect approximation of oi.

Else, if ||µ(H) q1|| > ρ/ε, a similar argument applies to show that cost(H, o ) = (1 ε)cost(H, o) for each o in ball of radius O(ρ) centered in o1. Indeed, we can rewrite cost(H, o ) as |H| cost(µ(H), o ) + cost(µ(H), H). If ||µ(H) q1|| < ρ/ε the first term varies of at most a factor (1 + ε) and the second term is constant. Thus in this case ˆo1 = q1 is a perfect approximation of o1 and we simply set α1 = 1 and αj = 0 for j = 1. In other words, here µ(N1 H) is too far from q1 (and thus o1) to significantlyt influence the position of ˆo1 and the same holds for any point in F1. This works, of course, because we assumed q1 core(o1).

Discussing the limitations on the coefficients values. The proof above would work smoothly if we were allowed to set αi to exactly the values discussed above, representing the fractions of points from O i captured by different qis. However, to make the algorithm efficient we limit ourselves to

values in n 1, (1 ε) . . . (1 ε) log1 ε(ε7) o {0}. Lemma 21 shows that as long as the values of

(αi)i estimate the frequencies described above up to a factor 1 O(ε) then the approximation error is within a multiplicative factor 1 O(ε).

We are left to take care of the case in which αi is set to a value < ε7. We set αi when dealing with points in C1 N1 and for each x C1 N1 we have, for each o B(q1, (1 + ε)ρ), cost(x, o ) 2cost(q1, o ) + 2cost(x, q1) = O(ρ1ε 6). Thus, if we simply set αi 0 whenever we have αi < ε7 then the combined cost of points in O 1 with respect to o varies by ε7|O 1| ρ1ε 6 = O(ε)OPT1. Effectively, ignoring these points does not significantly impact the cost. hence solving 1-mean ignoring these points finds a (1 + O(ε))-approximate solution to the original problem.

Additional Experimental Evaluation

In this section we report additional experiments which presentation did not fit in the main body. In particular, we run experiments on the dataset KDD-PHY and for k = 10, 50.

In Figure 4 we compare MSLS-G with MSLS. To perform our experiment, we initialize k = 25 centers using KM++ and then run 50 iterations of local search for both algorithms, for p {2, 3} swaps. We repeat each experiment 5 times. For ease of comparison, we repeat the plot for the KDD-BIO and RNA datasets that we present in the main body of the paper. Due to the higher running of the MSLS we perform this experiments on 1% uniform sample of each of our datasets. We find out that the performance of the two algorithms is comparable on all our instances, while they both perform roughly 15%-27% better than k-means++ at convergence.

In Figure 5 we run KM++ followed by 50 iterations of MSLS-G with p = 1, 4, 7, 10 and k = 10, 25, 50 (expcluding the degenerate case p = k = 10) and plot the relative cost w.r.t. KM++ at each iteration. The results for k = 25 on KDD-BIO and RNA can be found in Figure 2. We repeat each experiment 5 times. Our experiment shows that, after 50 iterations MSLS-G for p = 4, 7, 10 achieves improvements of roughly 5 10% compared to MSLS-G-p = 1 and of the order of 20% 40% compared to KM++. These improvements are more prominent for k = 25, 50. We also report the time per iteration that each algorithm takes. For comparison, we report the running time of a single iteration of Lloyd s next to the dataset s name. Notice that the experiment on RNA for k = 50 is performed on a 10% uniform sample of the original dataset, due to the high running time.

In Figure 6, we use KM++ and MSLS-G as a seeding algorithm for Lloyd s and measure how much of the performance improvement measured is retained after running Lloyd s. First, we initialize

Figure 4: Comparison between MSLS and MSLS-G, for p = 2 (left column) and p = 3 (right column), for k = 25, on the datasets KDD-BIO (first row), KDD-PHY (second row) and RNA (third row). The y axis shows the mean solution cost, over the 5 repetitions of the experiment, divided by the means solution cost of KM++.

Figure 5: We compare the cost of MSLS-G, for p {1, 4, 7, 10}, divided by the mean cost of KM++ at each LS step, for k {10, 25, 50}, excluding the degenerate case p = k = 10. The legend reports also the running time of MSLS-G per LS step (in seconds). The experiments were run on all datasets: KDD-BIO, RNA and KDD-PHY, excluding the case of k = 25 for KDD-BIO and RNA which are reported in the main body of the paper.

our centers using KM++ and the run 15 iterations of MSLS-G for p = 1, 4, 7. We measure the cost achieved by running 10 iterations of Lloyd s starting from the solutions found by MSLS-G as well as KM++. We run experiments for k = 10, 25, 50 and we repeat each experiment 5 times. We observe that for k = 25, 50 MSLS-G for p > 1 performs at least as good as SSLS from Lattanzi and Sohler [2019] and in some cases maintains non-trivial improvements. These improvements are not noticeable for k = 10; however, given how Lloyd s behave for k = 10 we conjecture that k = 10 might be an unnatural number of clusters for our datasets.

Figure 6: We compare the cost after each of the 10 iterations of Lloyd with seeding from MSLS-G, for p {1, 4, 7, 10} and 15 local search steps and KM++, for k {10, 25, 50}. We excluded the degenerate case p = k = 10, and the experiments reported in the main body of the paper.