# adapting_kmeans_algorithms_for_outliers__c2c0c256.pdf Adapting k-means Algorithms for Outliers Christoph Grunau * 1 V aclav Rozhoˇn * 1 This paper shows how to adapt several simple and classical sampling-based algorithms for the k-means problem to the setting with outliers. Recently, Bhaskara et al. (Neur IPS 2019) showed how to adapt the classical k-means++ algorithm to the setting with outliers. However, their algorithm needs to output O(log(k) z) outliers, where z is the number of true outliers, to match the O(log k)-approximation guarantee of k-means++. In this paper, we build on their ideas and show how to adapt several sequential and distributed k-means algorithms to the setting with outliers, but with substantially stronger theoretical guarantees: our algorithms output (1+ε)z outliers while achieving an O(1/ε)-approximation to the objective function. In the sequential world, we achieve this by adapting a recent algorithm of Lattanzi and Sohler (ICML 2019). In the distributed setting, we adapt a simple algorithm of Guha et al. (IEEE Trans. Know. and Data Engineering 2003) and the popular k-means of Bahmani et al. (PVLDB 2012). A theoretical application of our techniques is an algorithm with running time O(nk2/z) 1 that achieves an O(1)-approximation to the objective function while outputting O(z) outliers, assuming k z n. This is complemented with a matching lower bound of Ω(nk2/z) for this problem in the oracle model. *Equal contribution 1ETH Z urich. Correspondence to: Christoph Grunau , V aclav Rozhoˇn . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). 1 O(.) hides polylogarithmic factors in n. We assume that the minimum distance between any two points is 1 and the maximum distance between any two points is bounded by poly(n). Our algorithms work in general metric spaces and we assume a unit cost to evaluate the distance between two points. 1. Introduction Clustering is a fundamental tool in machine learning and data analysis. It aims to partition a given set of objects into clusters in such a way that similar objects end up in the same cluster. The classical way of approaching the clustering problem is via the k-means formulation. In this formulation, one works with a set X consisting of n points in the d-dimensional Euclidean space Rd and the objective is to output a set C consisting of k centers so as to minimize the sum of squared distances of points in X to C, i.e., ϕ(X, C) = X x X min c C x c 2. Assigning each point to its closest cluster center naturally induces a partition of the points where nearby points tend to end up in the same partition. k-means with Outliers One major drawback of kmeans in practice is its sensitivity to outliers (GKL+17). This motivates optimizing a more robust version of the kmeans objective. Probably the simplest such formulation is the k-means with outliers formulation (CKMN01). In this formulation, one additionally receives a number z {0, 1, . . . , n}. The aim is to find a set C consisting of k cluster centers and additionally a set Xout X consisting of z outliers so as to optimize the cost of inliers Xin := X\Xout. That is, we optimize min Xout,C ϕ(X \ Xout, C). Known Results for k-means with Outliers It is known that the problem can be approximated up to a constant factor in polynomial time by rounding a certain linear program (Che08; KLS18). However, the complexity of these known algorithms is a large polynomial. Moreover, as we observe in Section 6, every constant factor approximation algorithm that outputs exactly z outliers needs to perform Ω(z2) queries in the query model. This motivates to weaken our requirements and look for fast algorithms that are allowed to output slightly more than z outliers. There is a line of work that makes progress toward such algorithms (CKMN01; MOP04; GKL+17; GLZ17; BVX19; IQM+20). However, to the best of our knowledge, known algorithms either need to output Az outliers, for some large constant A 1 to obtain a reasonable approximation guarantee, or they suffer from at least a Ω(z2) Adapting k-means Algorithms for Outliers running time and, hence, to be truly applicable even for large z they need to be sped up by coreset constructions (FL11; HJLW18). Our Contribution In this work we aim to further close the gap between theory and practice. We show how to adapt a number of classical sampling-based k-means algorithms to the setting with outliers. The main idea of the adaptations is a reduction to the variant of the problem with penalties described below. This is a known approach (CKMN01), but previous reductions of sampling based algorithms (BVX19) necessarily need to output Az outliers for some large constant A 1, while we obtain algorithms that need to output only (1 + ε)z outliers. More concretely, our contribution is threefold. First, building on ideas of (LS19), we design a simple sampling-based algorithm with running time O(nk/ε). It outputs an O(1/ε)-approximate solution while declaring at most (1 + ε)z points as outliers. This shows that sampling based algorithms that are known to be fast and have good practical performance can also achieve strong theoretical guarantees. Second, we devise distributed algorithms for k-means with outliers where each machine needs to send only O(k/ε) bits. Our construction achieves an O(1)-approximation while outputting (1 + ε)z outliers. Moreover, each machine only needs to perform polynomial-time computation. This improves on (LG18) who achieve an (1 + ε)-approximation while outputting the same number of outliers as our algorithm, but their computation time is exponential. Third, we show that one can achieve an O(1)-approximation guarantee while discarding O(z) outliers in time O(k2 n/z). This is done by speeding-up sampling with the Metropolis Hastings algorithm as done in (BLHK16b; BLHK16a) together with additional ideas. This result is complemented by a matching lower bound of Ω(k2 (n/z)) for k-means/kmedian/k-center algorithms that work for an arbitrary metric space accessed by distance queries. This improves on (MOP04; HJLW18) who give algorithms in this setting with a running time of O k2 (n/z)2 . This is a significant improvement for z n. Roadmap. We overview the related work in Section 2 and our approach in Section 3. Then, we present our contributions in Sections 4 to 6 and our experiments in Section 7. Technical details are mostly deferred to appendices. Notation We define ϕ(x, C) = minc C x c 2 and set ϕ(X, C) = P x X ϕ(x, C). Similarly, we define τΘ(x, C) = min(Θ, ϕ(x, C)) and τΘ(X, C) = P x X τΘ(x, C). We call an algorithm an (α, β)- approximation if it outputs a set of k centers C and a set of βz outliers Xout such that ϕ(X \ Xout, C) αOPT = αϕ(X \X out, C ), where OPT is the cost of a fixed optimal clustering C with a set of outliers X out, |X out| = z. Moreover, we define Xin := X \ Xout and X in := X \ X out. 2. Previous Work k-means Problem It is well-known that finding the optimal solution is NP-hard (ADHP09; MNV09). Currently the best known approximation ratio is roughly 6.36 (ANFSW19) and an approximation ratio of (1 + ε) can be achieved for fixed dimension d (FRS19) or fixed k (KSS04). From the practical perspective, Lloyd s heuristic (Llo82; BLSS16) is the algorithm of choice. As Lloyd s algorithm converges only to a local optimum, it requires a careful seeding to achieve good performance. The most popular seeding choice is the k-means++ seeding (AV07; ORSS13). In k-means++ seeding (Algorithm 1) one chooses the first center uniformly at random from the set of input points X. In each of the following k 1 steps, one samples a point in x X as a new center with probability proportional to its current cost ϕ(x, C). The seeding works well in practice and a theoretical analysis shows that even without running Lloyd s algorithm subsequently it provides an expected O(log k)-approximation to the kmeans objective (AV07). Algorithm 1 k-means++ seeding Input: X, k 1: Uniformly sample c X and set C = {c}. 2: for i 2, 3, . . . , k do 3: Sample c X with probability ϕ(c, C)/ϕ(X, C) and add it to C. 4: end for 5: return C k-means with Outliers There is a growing body of research related to the k-means with outliers problem. On the practical side, Lloyd s algorithm can be adapted to the noisy setting (CG13), but the output quality still remains dependent on the initial seeding. On the theoretical side, constant approximation algorithms based on the method of successive local search (Che08; KLS18) are known to provide a constant approximation guarantee. However, their running time is a large polynomial in n. We are interested in fast algorithms that are allowed to output slightly more than z outliers. Several algorithms have been proposed (CKMN01; MOP04; GKL+17; GLZ17; BVX19; IQM+20), but they either need at least Ω(z2) time or they need to output at least Cz outliers for some C 1. The algorithms can in general be sped up by coreset constructions (FL11; GKL+17; HJLW18; IQM+20). However, there is still a need for fast and simple algorithms with strong guar- Adapting k-means Algorithms for Outliers k-means with (Uniform) Penalties k-means with penalties is a different way of handling outliers introduced in the seminal paper of Charikar et al. (CKMN01). In the version of the problem with uniform penalties, we are given some positive number Θ and the goal is to output a set of centers C = {c1, c2, . . . , ck} so as to minimize the expression τΘ(X, C) = P x X min Θ, minc C x c 2 . That is, the cost of any point is bounded by a threshold Θ. It turns out that it is usually much simpler to work with k-means with penalties than k-means with outliers. This is quite helpful because results for k-means with penalties can be turned into results for k-means with outliers (CKMN01; LG18; BVX19; BR20). We also take this approach in this paper. We formalize the reduction in the next section and also present our improvement of it for sampling based algorithms. 3. Warmup: Reducing k-means with Outliers to k-means with Penalties Our approach is based on reducing the problem of k-means with outliers to the problem of k-means with penalties. In this section, we first review how one can reduce the problem of k-means with outliers to the problem of k-means with penalties. Then we review an instance of this reduction by (BVX19) and provide a refined result which is the starting point of our work. We note that this can be seen as an instantiation of a general principle in optimization where one replaces a constraint with a penalty function known as Lagrangian relaxation (BBV04). 3.1. Review of the Previously Known Reduction We start by giving the following lemma that formalizes how an α-approximate solution to the k-means with penalties objective can be used to obtain an algorithm providing an (O(α), O(α))-approximate solution to the k-means with outliers objective. Lemma 1 ((CKMN01)). Let C be an α-approximate solution to the k-means with penalties objective with penalty OPT/(2z) Θ OPT/z. Let Xout denote the set of points x for which τΘ(x, C) = Θ. Then, it holds that ϕ(X \ Xout, C) = O(αOPT) and |Xout| = O(αz). Proof. Note that the optimal solution for k-means with outliers has a k-means with penalties cost upper bounded by ϕ(X \ X out, C ) + |X out|Θ = OPT + zΘ. As C is an α-approximate solution to the k-means with penalties objective, we have ϕ(X \ Xout, C) τΘ(X, C) α (OPT + zΘ) = O(αOPT). Moreover, paying Θ for every x Xout implies |Xout| τΘ(X,C) Θ α(OPT+zΘ) We remark that the penalty Θ depends on OPT, so the algorithm for k-means with outliers needs to try this reduction for all O(log(n 2)) = O(log n) powers of 2 between 1 and n 2. Here, = n O(1) is the maximum distance between any two points of X. This reduction is very helpful as it allows us to easily adapt sampling-based algorithms like k-means++ to the setting with penalties since their analysis generalises to this setting. For the case of k-means++ , this was shown in (BVX19; Li20) in the following theorem. Algorithm 2 k-means++ (over)seeding with penalties Input: A set of points X, k, ℓ, threshold Θ 1: Uniformly sample c X and set C = {c}. 2: for i 2, 3, . . . , ℓdo 3: Sample c X with probability τΘ(c, C)/τΘ(X, C) and add it to C. 4: end for 5: return C Theorem 1. [(AV07; BVX19; Li20)] Suppose we run Algorithm 2 for ℓ= k steps. Then, the output set C is an O(log k)-approximation to the k-means with penalty Θ objective, in expectation. Proof sketch. The analysis of (AV07) proves this guarantee for Algorithm 1 for any ϕ(a, b) := d2(a, b) such that d is a metric. As the distance function d (a, b) = min(d(a, b), Θ) still defines a metric, one can thus directly use their analysis to prove Theorem 1. More details are given in Appendix A. By Theorem 1, plugging in Algorithm 2 into Lemma 1 gives an (O(log k), O(log k))-approximate algorithm for k-means++ with outliers. Moreover, running Algorithm 2 for O(k) steps results in an O(1)-approximation to the k-means objective in metric spaces (ADK09; BVX19). This gives the following tri-criteria approximation via Lemma 1: we get an (O(1), O(1))-approximation algorithm that needs to use O(k) centers. 3.2. Our Improved Reduction The starting point of our work is the following improvement of the tri-criteria result from the previous subsection that enables us to get a constant factor approximation algorithm that outputs only (1 + ε)z outliers, which is not possible by using Lemma 1 as a black box. The catch is that we need to use O(k/ε) centers. Adapting k-means Algorithms for Outliers Theorem 2. Running Algorithm 2 for ℓ= O(k/ε) iterations and OPT/(2εz) Θ OPT/(εz) results in a set C with τΘ(X in, C) 20OPT, with positive constant probability. Proof sketch (full proof in Appendix B). For an optimal set of centers C = {c 1, . . . , c k}, we define X i X in as the subset of points x X in with c i = arg minc C ϕ(x, c ), where ties are broken arbitrarily. Fix one iteration of Algorithm 2 and let C be its current set of centers. We refer to a cluster X i as unsettled if τΘ(X i , C) 10τΘ(X i , C ). Suppose that τΘ(X in, C) 20OPT, since otherwise we are already done. We sample a new point c from X in with probability τΘ(X in, C) τΘ(X in, C) + τΘ(X out, C) 20OPT 20OPT + τΘ(X out, C) 20OPT 20OPT + OPT/ε ε where we used τΘ(X in, C) 20OPT, τΘ(X out, C) |X out|Θ OPT/ε, and that ε is small enough. Moreover, the cost of all settled clusters is bounded by 10OPT, hence, given that c is sampled from X in, the probability that it is sampled from an unsettled cluster is at least τΘ(X in, C) 10OPT τΘ(X in, C) 20OPT 10OPT where we again used τΘ(X in, C) 20OPT. Given that c is sampled from an unsettled cluster according to the τΘdistribution, Corollary 1 in Appendix A tells us that the cluster becomes settled with probability at least 1 5. Hence, we make a new cluster settled with probability at least (ε/2) 1 2 1 By a standard concentration argument, this implies that after O(k/ε) steps of the algorithm, either all clusters are settled with positive constant probability, and hence we have τΘ(X in, C) 10OPT and we are done, or during the course of the algorithm, the condition τΘ(X in, C) 20OPT stopped being true, in which case we are again done. Note that setting ε = 1 results in the same tri-criteria result we discussed in the previous subsection: This holds as τΘ(X in, C) = O(OPT) implies that τΘ(X, C) τΘ(X in, C) + zΘ = O(OPT/ε). However, we can get more out of Theorem 2: note that as τΘ(X in, C) = O(OPT), for the set Xout defined as those x X with τΘ(x, C) = Θ, we have that |Xout X in| = O(OPT) Θ = O(εz). Hence, setting Xout as our output set of outliers gives on one hand |Xout| |X out| + |Xout X in| = z + O(εz). On the other hand, we can bound ϕ(X \ Xout, C) = τΘ(X \ Xout, C) τΘ(X, C) = O(OPT + zΘ) = O(OPT/ε). Hence, we obtain an O(1/ε)-approximation guarantee while outputting just (1 + ε)z outliers.2 By itself, Theorem 2 is still not satisfactory, as it requires us to oversample the number of centers by a factor of O(1/ε). However, we next show three directions of improvement that lead to more interesting results. 4. Fast Sequential Algorithm In this section, we present a simple sequential samplingbased algorithm for k-means with outliers that achieves an O(1/ε)-approximation and outputs (1 + ε)z outliers: Theorem 3. For every 0 < ε < 1, there exists an (O(1/ε), 1 + ε)-approximation algorithm for k-means with outliers with running time O(nk/ε). The algorithm is based on ideas of a recent paper of Lattanzi and Sohler (LS19) who proposed augmenting k-means++ with O(k) local search steps (KMN+04) as follows: their algorithm first invokes k-means++ to obtain an initial set of k centers. Afterwards, in each local search step, the algorithm samples a (k + 1)-th point from the same distribution as k-means++ . After sampling that point, the algorithm iterates over all k+1 current centers and takes out the one whose deletion raises the cost the least (see Algorithm 3). Running k-means++ followed by O(k) steps of Local-search++ is known to yield an O(1)- approximation for the cost function ϕ = τ (CGPR20), with a positive constant probability. Algorithm 3 One step of Local-search++ Input: X, C, threshold Θ 1: Sample c X with probability τΘ(c, C)/τΘ(X, C) 2: c arg mind C {c} τΘ(X, C \ {d} {c}) 3: return C \ {c } {c} We get Theorem 3 by proving that O(k/ε) iterations of Algorithm 3 with OPT/(2εz) Θ OPT/(εz) result in an (O(1/ε), 1 + ε)-approximation guarantee. The analysis deals with new technical challenges and is deferred to Appendix C. We note that with minor changes to the original analysis of Lattanzi and Sohler, one can show that their result generalizes for arbitrary τΘ, similarly as Theorem 1, and, hence, by Lemma 1 one obtains an (O(1), O(1))- approximation algorithm. Our refined analysis, crucially, 2In this particular case, we can even get an O(1)-approximation guarantee by labelling the furthest (1 + O(ε))z points as outliers, since the set Xout X out has size at most (1 + O(ε))z and ϕ(X \ (Xout X out), C) τΘ(X in, C) = O(OPT). Adapting k-means Algorithms for Outliers does not use Lemma 1 as a black box. Instead, we use the idea of the lemma as a building stone for the rather intricate analysis of the algorithm. The intuition behind our proof of Theorem 3 comes from Theorem 2: the difference between running k-means++ for O(k/ε) steps and the local search algorithm we described above is that the local search algorithm additionally removes one point after each sampling step. One can still hope that the increase in cost due to the removals is dominated by the decrease in cost due to the newly sampled center making an unsettled cluster settled, as we have seen in the analysis of Theorem 2. This is indeed the case. We note that a local search based algorithm was also considered by (GKL+17), though with substantially weaker guarantees than our algorithm. 5. Distributed Algorithms for k-means with Outliers To model the distributed setting, we consider the coordinator model where the input data X = X1 X2 . . . Xm is split across m machines. Each machine can first perform some local computation. Then, each machine sends a message to a global coordinator who computes the final set of cluster centers C. The main complexity measure is the total number of bits each machine needs to send to the coordinator. An informal version of our main distributed result states the following. Theorem 4. There exists an (O(1), 1 + ε)-approximate distributed algorithm in the coordinator model such that each machine sends at most O(k/ε) many bits. Moreover, each machine only needs to perform polynomial-time computations. This improves on a recent result of (LG18). They give an algorithm with a (1+ε, 1+ε)-approximation guarantee, but the required local running time is exponential. Other results similar to ours include (GLZ17; CAZ18). Below, we sketch the high-level idea of the constructions and leave details to Appendix F. Simpler Construction In this paragraph we explain the high-level intuition behind a weaker result that provides an (O(1/ε), 1 + ε)-approximation. This generalizes a classical construction of (GMM+03) that works in the kmeans setting. For simplicity, we assume here that the machines know the value OPT and we define Θ := OPT/(εz). This assumption can be lifted at the cost of a logarithmic overhead in the time - and message complexity. Each machine starts by running Algorithm 2 for O(k/ε) steps with input Xj to obtain a set of centers Yj. The sets of centers satisfy the property P j τΘ(Xj X in, Yj) = O(OPT). This follows from an argument very similar to Theorem 2 and will be important later on. Now, let Xout,j denote all points in Xj that have a squared distance of at least Θ to the closest point in Yj. All points in Xout,j are declared as outliers and the remaining points in Xj \ Xout,j are moved to the closest point in Yj which creates a new, weighted instance X j. Each machine sends its weighted instance together with the number of declared outliers to the coordinator. The coordinator combines the weighted instances and finds an (O(1/ε), 1 + ε)-approximate clustering using the algorithm guaranteed by Theorem 9 on the instance X , but with the number of outliers equal to z = z |Xout| + O(εz), where Xout := S j Xout,j. Let C denote the corresponding set of cluster centers and X out denote the set consisting of the z outliers. In Theorem 15 in Appendix F we prove that (C, Xout X out) is an (O(1/ε), 1 + ε)-approximation. The (simple) analysis follows from two observations. First, the number of inliers incorrectly labelled as outliers in the first step, i.e., | S j(Xout,j X in)|, is bounded by O(εz). This follows from P j τΘ(Xj X in, Yj) = O(OPT) and τΘ(x, Yj) = Θ for every x Xout,j. This ensures that in the end we output only O(εz) more outliers than if the coordinator ran Algorithm 4 on the full dataset with original parameter z. Second, the total movement cost of changing the instance X to instance X is bounded by P j ϕ(Xj \ Xout,j, Yj) P j τΘ(Xj, Yj) zΘ + P j τΘ(Xj X in, Yj) = O(OPT/ε). Hence, the total cost changes additively by O(OPT/ε) compared to the case where the coordinator runs Algorithm 4 on the full dataset. Refined Version To improve the approximation factor from O(1/ε) down to O(1) in Theorem 4, we need to perform two changes. First, the coordinator runs the polynomial-time (O(1), 1)-approximation algorithm of (KLS18) on the weighted instance instead of Algorithm 4. Second, each machine records not only the number of points in Xj \ Xout,j closest to each point y Yj, but it additionally sends for each integer k O(log ), how many of those points have a distance between 2k and 2k+1 to y. The coordinator then constructs an instance based on this refined information. k-means Adapting the popular k-means algorithm to the setting with outliers (BMV+12) can also be accomplished with a construction similar to the one explained above. The only difference is that instead of getting the weighted instance as a union of weighted instances from each machine, it is constructed in O(log n) sampling rounds by using the k-means|| idea: in each round we sample from the same distribution as in Algorithm 2, but we sample O(k/ε) points instead of just one point. Adapting k-means Algorithms for Outliers 6. Tight Bounds for (O(1), O(1))-approximation Algorithms In this section, we discuss tight bounds on the complexity of finding an (O(1), O(1))-approximation for the k-means with outliers objective. In Appendix D we prove the following result. Theorem 5. There is an (O(1), O(1))-approximation algorithm for k-means with outliers that runs in time O(nk min(1, k/z)) and succeeds with positive constant probability. This result is based on Theorem 2, but to speed up the algorithm from O(nk) to O(nk2/z), we use the Metropolis-Hastings algorithm (with uniform proposal distribution) which was used in (BLHK16b; BLHK16a) for k-means. Here, the idea is that in one sampling step of Algorithm 2, instead of computing τΘ(x, C) for all the points, we first subsample O(n/z) points uniformly at random and only from those points we then sample one with probability roughly proportional to their current τΘ cost by defining a certain Markov chain. After this step, we end up with O(k) cluster centers that match the guarantee of Theorem 2, but the running time is improved to O(nk2/z). To reduce the number of cluster centers from O(k) to k in sublinear time, we use additional sampling ideas to obtain a weighted instance that we can efficiently solve with our fast sequential algorithm. Independent of our work, (DKP20) devised an algorithm that obtains a set of O(nk/z) cluster centers that match the guarantee of our O(k) cluster centers. A slight variation of their algorithm can be implemented with running time O(nk/z), faster than our O(nk2/z) runtime. However, it is not clear how to reduce the O(nk/z) cluster centers to just k in sublinear time, as one cannot simply run our fast sequential algorithm on the subsampled set of cluster centers, but one additionally needs to define appropriate weights. Unlike in our case with O(k) cluster centers, it is not clear how one can define appropriate weights for the O(nk/z) cluster centers in time O(nk2/z). Similarly, a result of (HJLW18) gives that running an (O(1), O(1))-approximation algorithm on a uniform sample of O(k(n/z)2) points results in an (O(1), O(1))- approximation without a need to compute clever weights for the subsampled points. However, this leads only to a suboptimal O(k2(n/z)2) running time. Lower Bound Next, we provide a matching lower bound. To that end, we restrict ourselves to the class of algorithms that work in an arbitrary metric space M and access the distances of the space only by asking an oracle that upon getting queried on two points x, y M returns their distance. Virtually all algorithms with guarantees we know of are of this type, possibly up to a constant loss in their approximation guarantee. In Appendix A we verify that this is the case also for algorithms that we consider here. For the classical problems of k-means , k-median, and k-center, there is an Ω(nk) lower bound (i.e., showing that so many queries to the oracle are necessary) for metric space algorithms (MP04; Met02). This essentially matches the complexity of k-means++ (AV07) or the Local-search++ algorithm of Lattanzi and Sohler (LS19). The lower bound also holds in the setting with outliers if the output of the algorithm is an assignment of each of the n points to an optimal cluster, together with the list of the outliers. On the other hand, Theorem 5 gives an (O(1), O(1)) algorithm with complexity O(nk2/z). The catch is that the output of this algorithm is just a set of centers and it does not compute for all the points their respective assignment to a closest center. Also, it does not label all the outliers. For this type of algorithms that only output the set of centers, we prove a matching lower bound of Ω(nk2/z) in the metric space query model. The following theorem is proved in Appendix E. Theorem 6. Any randomized algorithm for the k-means/kmedian/k-center problem with outliers in the setting k C, z Ck log k, and n Cz for an absolute constant C that with probability at least 0.5 gives an (O(1), O(1))- approximation in the general metric space query model, needs Ω(nk2/z) queries. Let us provide a brief intuition of the proof. The construction that yields Theorem 6 is the following: we have k/2 large clusters and k/2 small clusters. All clusters are well separated and small clusters together contain Θ(z) points. As the algorithm can output only Θ(z) outliers, it needs to find Ω(k) small clusters. However, a point chosen from the input set uniformly at random is from a small cluster with probability O(z/n) and we expect to need Ω(k) queries until we find out whether a point is from a small or a big cluster. This leads to the lower bound of Ω(k n z k) = Ω(nk2/z) rounds. Why it makes sense to consider bicriteria approximation Let us also observe a different lower bound against (O(1), 1)-approximation algorithms that motivates why we are concerned with algorithms that can output (1 + ε)z outliers. The following construction is, e.g., in (Ind99). Consider an input metric space with z = n 1 and k = 1, where any two points have a distance of 1, up to a single pair x1, x2 whose distance is 0 (or some small ε). Any approximation algorithm (even a randomized one) that outputs exactly z outliers needs to find the pair x1, x2 and Ω(z2) queries are needed for this. The example shows that there is a fundamental limit to the speed of (O(1), 1)-approximation algorithms: for z linear in n they even need Ω(n2) time. This should be Adapting k-means Algorithms for Outliers contrasted with the (O(1), 1 + ε)-approximation algorithms that only need poly(k/ε) time for z = Θ(n) that follows from (HJLW18). 7. Experiments We tested the following algorithms on the datasets kdd (KDD Cup 1999) subsampled to 10 000 points with 38 dimensions and spam (Spambase) with 4601 points in 58 dimensions (DG17). We set the number of outliers z to be 10 percent of the dataset. Lloyd: Variant of Lloyd s algorithm (Llo82; CG13) that handles outliers with random initialization (10 iterations); k-means++: k-means++ seeding(AV07); k-means++ with penalties: k-means++ with penalties(BVX19; Li20); Metropolized k-means++ with penalties: k-means++ with penalties, sped up by Metropolis-Hastings algorithm with 100 steps (see Appendix D); Distributed k-means++ with penalties: simplified variant of Algorithm 5 input is partitioned in 10 subinputs, k-means++ with penalties is run on each of them with ℓ= 2k, and weighted instances are sent to coordinator who runs k-means++ with penalties again to obtain the final k centers); Sped up local search: Variant of Algorithm 4 k-means++ with penalties followed by k additional local search steps (for the objective with penalties). To guess the value of Θ in all except the first two algorithms, we tried 10 values from 1 to 1010, exponentially separated. The best solution was then picked and we followed by running 10 Lloyd iterations on it with the number of outliers for these iterations set to z (the same for the second k-means++ algorithm). The results for this setup for k {5, 10, . . . , 50} are in Figs. 1 and 2, each value being an average over 10 runs. k-means with penalties outperforms the first two baseline algorithms on average by around 40% in both datasets. Surprisingly, k-means++ seeding leads to consistently worse solutions than random initialization. We believe this indicates that the datasets indeed contain outliers that k-means++ picks preferably due to their large distance from other points. Distributed and metropolized variants are on par with k-means++ with penalties, except for the metropolized variant on the spam dataset. Sped up Figure 1. Experiments on kdd Figure 2. Experiments on spam local search consistently outperforms k-means++ with penalties by around 12% in both datasets. It is also significantly slower, but we implemented only its simple O(nk2) implementation instead of the best possible O(nk). 8. Conclusion We have shown that several simple sampling-based algorithms for k-means can be adapted to handle outliers and retain strong theoretical guarantees, while still being similarly simple to implement. As a theoretical application, we showed that the complexity of getting an (O(1), O(1))- approximation for k-means with outliers is Θ(nk2/z) in the query model. Acknowledgment We thank Davin Choo, Mohsen Ghaffari, Saeed Ilchi, Andreas Krause, and Julian Portmann for engaging discussions. In particular, we thank Mohsen for his numerous helpful remarks and Davin and Saeed for sharing their code with us. Adapting k-means Algorithms for Outliers [ADHP09] Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. Np-hardness of euclidean sum-of-squares clustering. Machine learning, 75(2):245 248, 2009. [ADK09] Ankit Aggarwal, Amit Deshpande, and Ravi Kannan. Adaptive sampling for k-means clustering. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 15 28. Springer, 2009. [ANFSW19] Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. SIAM Journal on Computing, pages FOCS17 97, 2019. [AV07] David Arthur and Sergei Vassilvitskii. kmeans++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027 1035. Society for Industrial and Applied Mathematics, 2007. [BBV04] Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. [BERS19] Anup Bhattacharya, Jan Eube, Heiko R oglin, and Melanie Schmidt. Noisy, greedy and not so greedy k-means++. ar Xiv preprint ar Xiv:1912.00653, 2019. [BLHK16a] Olivier Bachem, Mario Lucic, Hamed Hassani, and Andreas Krause. Fast and provably good seedings for k-means. In Advances in neural information processing systems, pages 55 63, 2016. [BLHK16b] Olivier Bachem, Mario Lucic, S Hamed Hassani, and Andreas Krause. Approximate kmeans++ in sublinear time. In Thirtieth AAAI Conference on Artificial Intelligence, 2016. [BLSS16] Johannes Bl omer, Christiane Lammersen, Melanie Schmidt, and Christian Sohler. Theoretical analysis of the $k$-means algorithm - A survey. Co RR, abs/1602.08254, 2016. [BMV+12] Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. Scalable k-means++. Proceedings of the VLDB Endowment, 5(7):622 633, 2012. [BR20] Aditya Bhaskara and Aravinda Kanchana Rwanpathirana. Robust algorithms for online k-means clustering. In Algorithmic Learning Theory, pages 148 173, 2020. [BVX19] Aditya Bhaskara, Sharvaree Vadgama, and Hong Xu. Greedy sampling for approximate clustering in the presence of outliers. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 11148 11157. Curran Associates, Inc., 2019. [CAZ18] Jiecao Chen, Erfan Sadeqi Azer, and Qin Zhang. A practical algorithm for distributed clustering and outlier detection. In Advances in neural information processing systems, pages 2248 2256, 2018. [CG13] Sanjay Chawla and Aristides Gionis. kmeans : A unified approach to clustering and outlier detection. In Proceedings of the 2013 SIAM International Conference on Data Mining, pages 189 197. SIAM, 2013. [CGPR20] Davin Choo, Christoph Grunau, Julian Portmann, and V aclav Rozhoˇn. k-means++: few more steps yield constant approximation, 2020. [Che08] Ke Chen. A constant factor approximation algorithm for k-median clustering with outliers. In Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms, pages 826 835, 2008. [CKMN01] Moses Charikar, Samir Khuller, David M Mount, and Giri Narasimhan. Algorithms for facility location problems with outliers. In Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, pages 642 651. Society for Industrial and Applied Mathematics, 2001. [DG17] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. [DKP20] Amit Deshpande, Praneeth Kacham, and Rameshwar Pratap. Robust k-means++. In Conference on Uncertainty in Artificial Intelligence, pages 799 808. PMLR, 2020. [FL11] Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 569 578, 2011. Adapting k-means Algorithms for Outliers [FRS19] Zachary Friggstad, Mohsen Rezapour, and Mohammad R Salavatipour. Local search yields a ptas for k-means in doubling metrics. SIAM Journal on Computing, 48(2):452 480, 2019. [GKL+17] Shalmoli Gupta, Ravi Kumar, Kefu Lu, Benjamin Moseley, and Sergei Vassilvitskii. Local search methods for k-means with outliers. Proceedings of the VLDB Endowment, 10(7):757 768, 2017. [GLZ17] Sudipto Guha, Yi Li, and Qin Zhang. Distributed partial clustering. Co RR, abs/1703.01539, 2017. [GMM+03] Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, and Liadan O Callaghan. Clustering data streams: Theory and practice. IEEE transactions on knowledge and data engineering, 15(3):515 528, 2003. [HJLW18] Lingxiao Huang, Shaofeng Jiang, Jian Li, and Xuan Wu. Epsilon-coresets for clustering (with outliers) in doubling metrics. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 814 825. IEEE, 2018. [Ind99] Piotr Indyk. Sublinear time algorithms for metric space problems. In Proceedings of the thirty-first annual ACM symposium on Theory of computing, pages 428 434, 1999. [IQM+20] Sungjin Im, Mahshid Montazer Qaem, Benjamin Moseley, Xiaorui Sun, and Rudy Zhou. Fast noise removal for k-means clustering, 2020. [KLS18] Ravishankar Krishnaswamy, Shi Li, and Sai Sandeep. Constant approximation for kmedian and k-means with outliers via iterative rounding. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, page 646 659, New York, NY, USA, 2018. Association for Computing Machinery. [KMN+04] Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Silverman, and Angela Y Wu. A local search approximation algorithm for k-means clustering. Computational Geometry, 28(2-3):89 112, 2004. [KSS04] Amit Kumar, Yogish Sabharwal, and Sandeep Sen. A simple linear time (1+/spl epsiv/)- approximation algorithm for k-means clustering in any dimensions. In 45th Annual IEEE Symposium on Foundations of Computer Science, pages 454 462. IEEE, 2004. [LG18] Shi Li and Xiangyu Guo. Distributed kclustering for data with heavy noise. Co RR, abs/1810.07852, 2018. [Li20] Min Li. The bi-criteria seeding algorithms for two variants of k-means problem. Journal of Combinatorial Optimization, 2020. [Liu96] Jun S Liu. Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Statistics and computing, 6(2):113 119, 1996. [Llo82] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129 137, 1982. [LS19] Silvio Lattanzi and Christian Sohler. A better k-means++ algorithm via local search. In International Conference on Machine Learning, pages 3662 3671, 2019. [Met02] Ramgopal Reddy Mettu. Approximation algorithms for np-hard clustering problems, 2002. [MNV09] Meena Mahajan, Prajakta Nimbhorkar, and Kasturi Varadarajan. The planar k-means problem is np-hard. In International Workshop on Algorithms and Computation, pages 274 285. Springer, 2009. [MOP04] Adam Meyerson, Liadan O callaghan, and Serge Plotkin. A k-median algorithm with running time independent of data size. Machine Learning, 56(1-3):61 87, 2004. [MP04] Ramgopal R Mettu and C Greg Plaxton. Optimal time bounds for approximate clustering. Machine Learning, 56(1-3):35 60, 2004. [MRR+53] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087 1092, 1953. [ORSS13] Rafail Ostrovsky, Yuval Rabani, Leonard J Schulman, and Chaitanya Swamy. The effectiveness of lloyd-type methods for the kmeans problem. Journal of the ACM (JACM), 59(6):1 22, 2013. Adapting k-means Algorithms for Outliers [Roz20] Vaclav Rozhon. Simple and sharp analysis of k-means||, 2020. [Yao77] Andrew Chi-Chin Yao. Probabilistic computations: Toward a unified measure of complexity. In 18th Annual Symposium on Foundations of Computer Science (sfcs 1977), pages 222 227. IEEE, 1977. Adapting k-means Algorithms for Outliers Appendices A. Preparatory lemmas More notation We start by setting up some more notation. Note that the optimal clustering C splits the dataset X into two parts: the set of inliers X in with ϕ(X in, C ) = OPT and the set of outliers X out with |X out| = z. Moreover, the set X in is naturally split in k sets X 1, X 2, . . . , X k, where X j is the set of points x X in for which ϕ(x, C) = ϕ(x, c j). When using the notation τΘ(X, C), we often drop the subscript Θ when it is clear from context and write just τ(X, C). For a single point c, we drop parentheses and write ϕ(X, c) instead of ϕ(X, {c}) and similarly for τ. We define ϕ z(X, C) = min Xout,|Xout|=z ϕ(X \ Xout, C). We use outΘ(X, C) to denote the subset of points x X such that τΘ(x, C) = Θ and inΘ(X, C) to denote the subset of points x X such that τΘ(x, C) < Θ. We assume that the minimum distance between any two points in X is lower bounded by 1 and their maximum distance is upper bounded by . In this section we collect several useful lemmas that are either well-known or are variations of well-known results. First, we recall the well-known Chernoff bounds and then, for completeness, prove a generalization of sampling lemmas that were used to analyze k-means++ (AV07). These were, in slightly different forms, also proved in (Li20) or (BVX19). Theorem 7 (Chernoff bounds). Suppose X1, . . . , Xn are independent random variables taking values in {0, 1}. Let X denote their sum. Then for any 0 δ 1 we have P(X (1 δ)E[X]) e E[X]δ2/2 and P(X (1 + δ)E[X]) e E[X]δ2/3. Moreover, for any δ 1 we have P(X (1 + δ)E[X]) e E[X]δ/3. Next, we state basic facts about metric spaces and Rd used to prove the sampling lemmas below. Fact 1 (cf. (AV07)). Let A be a subset of Rd. Then inf µ Rd ϕ(A, µ) = ϕ(A, µ(A)) where µ(A) = P x A x/|A| is the mean of A. Fact 2 (cf. (AV07)). Let a, b, c M be three points in an arbitrary metric space. Then ϕ(a, c) 2(ϕ(a, b) + ϕ(b, c)) and, more generally, for C M and ε > 0 we have ϕ(a, C) ϕ(b, C) ε ϕ(b, C) + 1 + 1 where the former inequality is a special case for ε = 1 and C = {c}. Proof. Let d denote the distance function of M. For c C such that d(b, C) = d(b, c) we have ϕ(a, C) = d2(a, C) d2(a, c) (d(a, b) + d(b, c))2 = (d(a, b) + d(b, C))2 = d2(a, b) + d2(b, C) + 2d(a, b)d(b, C) d2(a, b) + d2(b, C) + 1 εd2(a, b) + εd2(b, C) = ϕ(a, b) + ϕ(b, C) + 1 εϕ(a, b) + εϕ(b, C) which is the the needed inequality, up to rearrangement. We used that 2d(a, b)d(b, C) 1 εd2(a, b) + εd2(b, C) Adapting k-means Algorithms for Outliers is equivalent to ( εd(b, C) d(a, b)/ ε)2 0. Fact 3. Let M be an arbitrary metric space with distance function d. Then for any constant T > 0, the function d : d (a, b) = min(d(a, b), T) is also a distance function. Proof. For any a, b, c M, we need to prove d (a, c) d (a, b) + d (b, c). If d(a, b) T or d(b, c) T, respectively, we have d (a, b) = T or d (b, c) = T, respectively, and, hence, d (a, b) + d (b, c) T d (a, c), as needed. Otherwise, we have d (a, c) d(a, c) d(a, b) + d(b, c) = d (a, b) + d (b, c), as needed. Fact 3 for example implies that in Fact 2 we can replace ϕ by τΘ, as τΘ(a, b) = min(d(a, b), Θ)2. It also implies Theorem 1, as the analysis of (AV07) holds for an arbitrary metric space. However, this loses a factor of 2 in the approximation guarantee of Theorem 1 as we explain next. Proofs of Theorem 1 in (Li20; BVX19) instead generalize Lemmas 3.1 and 3.2 in (AV07) for τ costs to get Theorem 1; this can recover the same constants as in (AV07). For completeness, we also reprove these lemmas here as Lemmas 2 and 4 since we refer to them later in the proofs. Also, we prove a version of Lemma 2 for metric spaces as Lemma 3. This is the reason why arguments for arbitrary metric spaces lose additional factor of 2 in approximation guarantees. The relevance of Lemma 3 comes from the fact that it implies our algorithms work in arbitrary metric spaces, which is important for applications in Section 6. Lemma 2 (cf. Lemma 3.1 in (AV07)). For any cluster A Rd we have c A τΘ(A, c) 2 inf µ Rd τΘ(A, µ). Proof. Recall that for the cost function ϕ we have 1 |A| P c A ϕ(A, c) = 2ϕ(A, µ(A)) by Lemma 3.1 in (AV07). Let us fix an arbitrary µ Rd and denote Ain = inΘ(A, µ) and Aout = outΘ(A, µ). Note that τΘ(A, µ(Ain)) τΘ(A, µ) due to Fact 1. We have c A τΘ(A, c) c Ain τΘ(Ain, c) + X c Ain τΘ(Aout, c) + X c Aout τΘ(A, c)) c Ain ϕ(Ain, c) + |Ain||Aout|Θ + |Aout||A|Θ) = 1 |A|(2|Ain|ϕ(Ain, µ(Ain)) + |Ain||Aout|Θ + |Aout||A|Θ) Lemma 3.1 in (AV07) 2ϕ(Ain, µ(Ain)) + 2|Aout|Θ Lemma 3. For any metric space M and subset of its points A we have c A ϕ(A, c) 4 inf µ M ϕ(A, µ). Adapting k-means Algorithms for Outliers Proof. For any µ M, we can write c A ϕ(A, c) = 1 d A ϕ(d, c) d A ϕ(d, µ) + ϕ(µ, c) Fact 2 = 4ϕ(A, µ), The second sampling lemma states that sampling a point from a cluster A proportional to its τ-cost with respect to the current clustering C results in an 8-approximation of the cost of A. Lemma 4 (cf. Lemma 3.2 in (AV07), Lemma 4 in (Li20) and Lemma 6 in (BVX19)). For any cluster A Rd and an arbitrary point set C, we have X τΘ(c, C) τΘ(A, C)τΘ(A, C {c}) 8 inf µ Rd τΘ(A, µ). Proof. Fix an arbitrary µ Rd and c, d A. We have τΘ(c, C) 2(τΘ(c, d) + τΘ(d, C)) by Fact 2. For a given c A we can then average over all d A to get τΘ(c, C) 2 |A| d A (τΘ(c, d) + τΘ(d, C)) = 2 |A|(τΘ(A, c) + τΘ(A, C)). (1) This implies τΘ(c, C) τΘ(A, C)τΘ(A, C {c}) 2 |A|(τΘ(A, c) + τΘ(A, C)) d A min(Θ, ϕ(d, C), ϕ(d, c)) Eq. (1) 2 |A|τΘ(A, c) d A min(Θ, ϕ(d, C)) 2 |A|τΘ(A, C) d A min(Θ, ϕ(d, c)) 2 |A|τΘ(A, c) + 2 c A τΘ(A, c) c A τΘ(A, c) 8τΘ(A, µ). Lemma 2 Both Lemma 2 and Lemma 4 were proven in the same way as corresponding lemmas in (AV07), and Theorem 1 follows from these lemmas in the same way as Lemma 3.3 in (AV07) follows from Lemmas 3.1 and 3.2 in (AV07) (cf. (BVX19; Li20)). Finally, we will use a simple corollary of Lemma 4. Adapting k-means Algorithms for Outliers Corollary 1. For any cluster A Rd and an arbitrary point set C, sampling a point c A proportional to τΘ(c, C) results in τΘ(A, C {c}) 10 inf µ Rd τΘ(A, µ), with probability at least 1 Proof. Follows from Lemma 4 by Markov inequality. B. Basic Oversampling Result In this section we prove a formal and more general version of Theorem 2. The full generality of the theorem is needed at a later point in time. To obtain Theorem 2, plug in some arbitrary Θ [OPT/(2εz), OPT/(εz)] and δ = 0.5 in the following theorem. Theorem 8 (cf. (ADK09)). Let δ, ε (0, 1] be arbitrary and suppose we run Algorithm 2 for ℓ= O(k/ε log 1/δ) steps to get output C. Then, with probability at least 1 δ we have τΘ(X in, C) = O(OPT + εzΘ). Proof. We refer to a cluster X i as unsettled if τ(X i , C) 10τ(X i , µ(X i )). Fix one iteration of Algorithm 2 and let C be its current set of centers. Suppose that τ(X in, C) 20(OPT + εzΘ), since otherwise we are already done. We sample a new point c from X in with probability τ(X in, C) τ(X in, C) + τ(X out, C) 20εzΘ 20εzΘ + zΘ ε/2, where we used τ(X in, C) 20εzΘ, τ(X out, C) |X out|Θ = zΘ, and ε 1. Moreover, given that c is sampled from X in, the probability that it is sampled from an unsettled cluster is at least τ(X in, C) 10OPT τ(X in, C) 20OPT 10OPT where we used τ(X in, C) 20OPT. Given that c is sampled from an unsettled cluster according to the τ-distribution, Corollary 1 tells us that the cluster becomes settled with probability at least 1 5. Hence, we make a new cluster settled with probability at least (ε/2) 1 Finally, consider all O(k/ε log 1/δ) iterations. We call each iteration good, if either τ(X in, C) 20(OPT + εzΘ), or we make a new cluster settled. As each iteration is good with probability at least ε/20, the expected number of good iterations is Ω(k log 1/δ). By Theorem 7, at least k iterations are good with probability at least 1 e Ω(log 1/δ) 1 δ. This implies that in the end either τ(X in, C) 20(OPT + εzΘ) and we are done, or all clusters are settled. But this implies τ(X in, C) 10 P i τ(X i , µ(X i )) 10 P i ϕ(X i , µ(X i )) = 10OPT, so we are again done. C. Proof of Theorem 9 In this section we explain in more detail how to adapt the local-search algorithm of (LS19) in order to obtain an (O(1/ε), 1+ ε)-approximation algorithm for k-means with outliers. As explained in the main part of the paper, the main idea is to run the local-search algorithm with penalties. That is, each point is sampled as a new cluster center proportional to its τΘ-cost, where Θ = Θ( OPT εz ). In total, we run O(k log log k+k log(1/ε)/ε) local-search steps. Hence, the algorithm can be implemented in time O(nk/ε) (cf. (CGPR20), Section 3.3). After the first O(k log log k) search steps, we show that the current cost is with constant probability at most O(OPT/ε). The analysis is quite similar to the analysis of (LS19). Afterwards, within the next O(k log(1/ε)/ε) steps, we show that the number of points with a squared distance of at least 10Θ (The extra factor of 10 is needed in the analysis for technical reasons) drops below (1 + ε)z with constant probability. Hence, we can declare all those points as outliers and the ϕ-cost of each remaining point is then only at most 10 times larger than its τΘ cost. Therefore, ϕ(X, C) can still be bounded by O(OPT/ε), as needed. The formal algorithm description is given below as Algorithm 4. Recall that Θ depends on OPT Adapting k-means Algorithms for Outliers and as OPT is unknown to the algorithm, it needs to be guessed . Also note that due to technicalities in the analysis, the algorithm does not simply output the solution one obtains after running all the local-search steps, but it also considers all intermediate solutions as final solutions. The reason is that, unlike the total cost, the number of points with a squared distance of at least 10Θ is not necessarily decreasing monotonically. Algorithm 4 Local-search++ with outliers Input: X, k, z, ε, Assumptions: k + z < |X|, ε (0, 1] 1: C 2: Xout 3: for i = 0 to log(n 2) do 4: OPTguess 2i, β 300/ε, Θ βOP Tguess z 5: Ci,0 k-means++ (X, k, Θ) (Algorithm 2) 6: for j 1, 2, . . . , O(k log log k + k log(1/ε) ε ) do 7: Ci,j Local-search++ step(X, Ci,j 1, Θ) (Algorithm 3) 8: Xi,j out out10Θ(X, C) 9: if |Xi,j out| (1 + ε)z and ϕ(X \ Xi,j out, Ci,j) < ϕ(X \ Xout, C) then 10: C Ci,j 11: Xout Xi,j out 12: end if 13: end for 14: end for 15: Return (C, Xout) Theorem 9 (Formal version of Theorem 3). Let ε (0, 1] be arbitrary and X Rd be a set of n points. Furthermore, let k N and z N0 be such that k + z < n (otherwise the cost is 0). Then, with positive constant probability, Algorithm 4 returns a set Xout containing at most (1 + ε)z points and a set of k cluster centers C such that ϕ(X \ Xout, C) = O(OPT/ε) for OPT = min C ,X out : |C |=k,|X out|=z ϕ(X \ X out, C ). Before going into the actual proof of Theorem 9, we first need to introduce additional notation. As before, X denotes the set of all input points and X out denotes the set of all input points that are declared as outliers by the optimal solution under consideration. Moreover, let X far X \ X out denote the set consisting of those points in X \ X out that have a squared distance of at least Θ to the closest optimal cluster center. Now, we define X in = X \ (X out X far). In other words, we split X = X in X far X out (in Section 2 we used X = X in X out). For a fixed candidate clustering C and for i [k], Xi X in denotes the set consisting of those points in X in for which the closest center c C is the center ci. Similarly, let X i denote the set consisting of those points in X in that are clustered to the center c i . Moreover, Xfar,i denotes the set of points in X far that are closest to ci among all candidate centers in C. We say that the optimal cluster X i is ignored by C if for every x X i we have τΘ(x, C) = Θ. We define I to be the set of indices corresponding to ignored clusters. Next, we define a function f : [k] \ I 7 [k] with f(i) = arg minj k d(c i , cj) for every i [k] \ I. Intuitively, the function f maps each optimal cluster that is not ignored to the closest candidate center (cf. Fig. 3). Now, for each candidate center, either zero, one or more than one optimal cluster maps to it. If none of the optimal clusters map to it, then we call the candidate center lonely and we denote the indices of all the lonely candidate centers by L. If exactly one optimal cluster maps to a given candidate center, then we call the candidate center matched and H denotes the set of indices of all the matched candidate centers. Without loss of generality, we assume that f(i) = i for every i H. Finally, if more than one optimal cluster maps to a candidate center, then we call it popular. We refer to an optimal cluster X i with cluster index i [k] \ I as a cheap cluster if X i maps to some popular candidate center with corresponding index j and, moreover, X i has the smallest cost with respect to C among all optimal clusters that are mapped to the popular candidate center with index j. We let T denote the set of indices corresponding to cheap clusters. Let b : [k] \ T 7 H L be an arbitrary bijection with b(i) = i for every i H (cf. Fig. 3). That is, each optimal cluster center that is not cheap gets uniquely assigned to a candidate center which is not popular while preserving the matching between matched cluster centers. Adapting k-means Algorithms for Outliers Note that such a bijection exists since the number of cheap optimal clusters is equal to the number of popular candidate centers. H L popular X 1 X 2 X k X 1 X 2 X k matched lonely Figure 3. The figure captures several definitions. First, function f maps each optimal cluster X i to its closest candidate center cj C. Candidate centers with exactly one optimal cluster matched to them are denoted by H and for convenience we assume that f maps X i to ci there. Then there are popular centers with more than one match and for each such center we identify the cheapest matched cluster and put it in the set T. Finally, there are lonely centers with no matches. Ignored optimal clusters are not matched to any center by f. Second, we construct a bijection b that is used later for a double counting argument. In this bijection, we set b(i) = f(i) = i for clusters i matched to centers in H. Then, we omit cheap clusters T and popular candidate centers, and extend b to an arbitrary bijection between the remaining clusters and lonely candidate centers. For i H L, we define outliers(ci) = {x in10Θ(X out, C) : i = arg min j [k] ϕ(x, cj)}. That is, outliers(ci) contains the real outliers that are closest to the candidate center ci and which have a squared distance of at most 10Θ to ci. The factor of 10 comes into play because of the following lemma. Lemma 5. Let C denote the current set of candidate centers and assume there exists some x X i with ϕ(x, C) 10Θ. Then, X i is ignored by C. Proof. Let x X i be arbitrary and let c denote the candidate center in C that is closest to x . We have ||x c || ||x c || ||x x|| ||x c || ||x c i || ||x c i || Θ x, x / X far and therefore ϕ(x , C) > Θ. Hence, X i is ignored. Now we define the notion of reassignment cost, similarly as it was done in (LS19). For each lonely candidate center cℓ, reassign(X, C, cℓ) is an upper bound for the increase in τ cost due to removing cℓfrom the current set of candidate centers C. Similarly, for each matched candidate center ch, reassign(X, C, ch) is an upper bound for the increase in cost due to removing ch from the current set of candidate centers C, ignoring the cost increase for all points in the optimal cluster X h. Adapting k-means Algorithms for Outliers Definition 1. Let h H be arbitrary and let X h be the optimal cluster that is captured by the center ch. The reassignment cost of ch is defined as reassign(X, C, ch) = τ(X in \ X h, C \ {ch}) τ(X in \ X h, C) + Θ|Xfar,h| + (Θ|outliers(ch)| τ(outliers(ch), ch)) . For ℓ L, the reassignment cost of cℓis defined as reassign(X, C, cℓ) = τ(X in, C \ {cℓ}) τ(X in, C) + Θ|Xfar,ℓ| + (Θ|outliers(cℓ)| τ(outliers(cℓ), cℓ)) . Intuitively, the first two terms in the definition correspond to the increase of the cost for all the points in X in \ X h and X in, respectively. The third term corresponds to the increase of the cost for all points in X far. This increase is bounded by Θ, their maximal cost. Finally, the last two terms in the definition below upper bound the increase of the cost of outliers. Here we again crudely upper bound the cost of each outlier after removing the candidate center by Θ. Fact 4. For a matched cluster h we have reassign(X, C, ch) τ(X \ X h, C \ {ch}) τ(X \ X h, C). Similarly, for a lonely cluster ℓwe have reassign(X, C, cℓ) τ(X, C \ {cℓ}) τ(X, C). Proof. See above discussion. Lemma 6 (cf. Lemma 4 in (LS19)). For i H L we have reassign(X, C, ci) 100τ(inΘ(Xi, C), C) + 24ϕ(inΘ(Xi, C), C ) + Θ|Xfar,i| + Θ|outliers(ci)| τ(outliers(ci), ci). Proof. We only present the case i H, as the case i L is similar and even easier. We observe that reassign(X, C, ci) = τ(Xi \ X i , C \ {ci}) τ(Xi \ X i , C) + Θ|Xfar,i| + Θ|outliers(ci)| τ(outliers(ci), ci), since vertices in clusters other than Xi will still be assigned to their current center. Hence, it suffices to show that τ(Xi \ X i , C \ {ci}) τ(Xi \ X i , C) 100τ(inΘ(Xi, C), C) + 24ϕ(inΘ(Xi, C), C ). τ(outΘ(Xi \ X i , C), C \ {ci}) τ(outΘ(Xi \ X i , C), C) = 0, this is equivalent to showing that Adapting k-means Algorithms for Outliers τ(inΘ(Xi \ X i , C), C \ {ci}) τ(inΘ(Xi \ X i , C), C) 100τ(inΘ(Xi, C), C) + 24ϕ(inΘ(Xi, C), C ). The cost contribution of each point in inΘ(Xi \ X i , C) with respect to C is equal to the squared distance to the closest center in C and hence the analysis of Lattanzi and Sohler more or less directly applies. For the sake of completeness, we still present their analysis. To upper bound the cost increase one incurs by removing ci, we assign each point p inΘ(Xi \ X i , C) X j , j = i, to the candidate center that captures the cluster X j . By computing an upper bound for the clustering cost with respect to that assignment, we directly get an upper bound for τ(inΘ(Xi \ X i , C), C \ {ci}). We upper bound the cost increase in two steps. First, we move each point p inΘ(Xi \ X i , C) X j , j = i, to the cluster center c j of X j and we call the resulting multiset Qi. Let qp denote the point in Qi corresponding to p. Among the centers in C, the closest center to qp is now the center that captured the optimal cluster X j . Thus, it is not equal to ci, as ci captures exactly one optimal cluster, namely X i . Thus, we obtain τ(qp, C \ {ci}) τ(p, C) = τ(qp, C) τ(p, C) qp not assigned to ci (2) 10τ(p, C) + 11τ(p, qp) (Facts 2 and 3 with ε = 1/10) 10τ(p, C) + 11τ(p, C ). This implies τ(qp, C \ {ci}) = τ(qp, C) 11 10τ(p, C) + 11τ(p, C ). (3) Next, we upper bound the cost increase by moving each point qp Qi back to its original position p. Formally, we bound τ(p, C \ {ci}) τ(qp, C \ {ci}) (4) 10τ(qp, C \ {ci}) + 11 τ(p, qp) Facts 2 and 3 10τ(p, C) + 11 τ(p, C ) (5) + 11 τ(p, C ) Eq. (3) 100τ(p, C) + 13 τ(p, C ) Combining the two inequalities Eqs. (2) and (4) yields τ(p, C \ {ci}) τ(p, C) (6) = (τ(p, C \ {ci}) τ(qp, C \ {ci})) + (τ(qp, C \ {ci}) τ(p, C)) 100τ(p, C) + 13 τ(p, C ) + 1 10τ(p, C) + 11 τ(p, C ) Eqs. (2) and (4) 100τ(p, C) + 24 τ(p, C ). Adapting k-means Algorithms for Outliers Summing up the inequality Eq. (6) for all points p inΘ(Xi \ X i , C) then yields τ(inΘ(Xi \ X i , C), C \ {ci}) τ(inΘ(Xi \ X i , C), C) 100τ(inΘ(Xi \ X i , C), C) + 24τ(inΘ(Xi \ X i , C), C ) 100τ(inΘ(Xi, C), C) + 24ϕ(inΘ(Xi, C), C ) and therefore reassign(X, C, ci) 100τ(inΘ(Xi, C), C) + 24ϕ(inΘ(Xi, C), C ) + Θ|Xfar,i| + Θ|outliers(ci)| τ(outliers(ci), ci), C.1. Decreasing Cost to O(1/ε) In this subsection, we will prove that while the cost of our solution is Ω(OPT/ε), one local search step decreases the cost by a factor of 1 Θ(1/k) with constant probability (Lemma 8). This means that after O(k) steps, we have τΘ(X, C) = O(OPT/ε) with constant probability (Lemma 9). This, by itself, is the easy part and our proof is essentially the analysis of Lattanzi and Sohler (LS19). The hard part is then the generalization to Lemma 11 in Section C.2, which, roughly speaking, shows that while we have (1 + ε)z points with a big cost, we still do substantial progress in each step. For the next definition, recall that b is the bijection between optimal clusters and candidate cluster centers that we defined earlier. Definition 2. A cluster index i [k] \ T is called awesome, if τ(X i , C) reassign(X, C, cb(i)) 9τ(X i , c i ) > τ(X, C) Intuitively, awesome clusters are such that sampling a point c from them leads to high improvement in cost if the sampled center is sampled close to the cluster mean. In particular, if τ(X i , c) 9τ(X i , c i ), swapping cb(i) with c leads to an improvement of τ(X, C)/(100k) in cost. The next lemma shows that we sample from an awesome cluster with constant probability. Lemma 7. Let C denote the current set of candidate centers. Let ε (0, 1] be arbitrary and β := 300 ε . If τ(X, C) 100βOPT and Θ 2 βOPT i [k] \ T , i is awesome τ(X i , C) 1 Proof. We bound the cost of clusters that are not awesome. Adapting k-means Algorithms for Outliers i [k] \ T , i is not awesome τ(X i , C) i [k]\T reassign(X, C, cb(i)) + 9OPT + τ(X, C) 100 Definition 2 100τ(inΘ(Xb(i), C), C) + 24ϕ(inΘ(Xb(i), C), C ) + Θ|Xfar,b(i)| Lemma 6 + Θ |outliers(cb(i))| τ(outliers(cb(i)), cb(i)) + 9OPT + τ(X, C) 100τ(inΘ(Xi, C), C) + 24ϕ(inΘ(Xi), C), C ) + Θ|Xfar,i| + Θ |outliers(ci)| τ(outliers(ci), ci) + 9OPT + τ(X, C) 100 b([k] \ T) = H L 100τ(X, C) + 24OPT + OPT + zΘ+ 9OPT + τ(X, C) 100 Θ|Xfar,i| OPT, Xi X in 100τ(X, C) + (2β + 34)OPT Θ 2βOPT 100τ(X, C). β 300, τ(X, C) 100βOPT Observe that from the definition of cheap clusters it follows that i T τ(X i , C) 1 2τ(X, C) (7) because every cheap cluster can be matched with a different cluster with at least as big cost that is also matched to the same popular cluster center. Adapting k-means Algorithms for Outliers Finally, we bound i [k] \ T , i is awesome τ(X i , C) = τ(X in, C) X i [k] \ T , i is not awesome i T τ(X i , C) τ(X, C) τ(X out, C) τ(X far, C) 100τ(X, C) 1 2τ(X, C) Eq. (7) 100τ(X, C) zΘ OPT τ(X out, C) zΘ, τ(X far, C) OPT 10τ(X, C). τ(X, C) 100βOPT, Θ 2βOPT Lemma 8. Let C denote the current set of candidate centers. Let ε (0, 1] be arbitrary and β := 300 ε . Suppose that τ(X, C) 100β OPT and Θ 2 βOPT z . Then, with probability at least 1/100, one local search step of Algorithm 4 results in a new set of candidate centers C with τ(X, C ) (1 1/(100k))τ(X, C). Proof. Let c denote the point sampled by Local-search++ with outliers . The probability that c X i for some i with i being an awesome cluster index is at least (1/10)τ(X, C) τ(X, C) = 1 according to Lemma 7. Let i denote an arbitrary awesome cluster index. Lemma 4 implies that E[τ(X i , c )|c X i ] 8 τ(X i , c i ). Hence, by a simple application of Markov s inequality, we obtain P[τ(X i , c ) 9 τ(X i , c i )|c X i ] 1 Putting things together, the probability that the sampled point c is contained in X i for some awesome cluster index i and it additionally holds that τ(X i , c ) 9 τ(X i , c i ) (8) is at least 1 10 1 Consider the case that i H. We can upper bound τ(X, C ) as follows: Adapting k-means Algorithms for Outliers τ(X, (C \ {cb(i)}) {c }) swap cb(i) and c = τ(X, C) + τ(X, (C \ {cb(i)}) {c }) τ(X, C) = τ(X, C) + τ(X \ X i , (C \ {cb(i)}) {c }) τ(X \ X i , C) + τ(X i , (C \ {cb(i)}) {c }) τ(X i , C) τ(X, C) + reassign(X, C, ci) + τ(X i , c ) τ(X i , C) τ(X, C) + reassign(X, C, ci) + 9τ(X i , c i ) τ(X i , C) τ(X, C) τ(X, C) 100k Definition 2 The case i / H is very similar. Thus, we obtain that with probability at least 1/100 we have τ(X, C ) (1 1/(100k)) τ(X, C), as desired. Lemma 9. Let C denote the current set of candidate centers. Let ε (0, 1] be arbitrary and β := 300 ε . Let α > 0 be arbitrary such that τ(X, C) α β OPT. Furthermore, assume that Θ 2 βOPT z . Let C0 := C and for t 0, let Ct+1 denote the set of centers obtained by running Algorithm 4 with input centers Ct. Then, with positive constant probability p > 0, we have τ(X, CT ) 100βOPT with T := 1000 100k log α = O(k log(α)). Proof. For t 0, let Xt denote the indicator variable for the event that τ(X, Ct+1) > max((1 1/(100k))τ(X, Ct), 100βOPT). According to Lemma 8, E[Xt] 99 100. Hence, with X := PT 1 t=0 Xt, we have E[X] 99 100T. Thus, Markov s Inequality implies that there exists a constant p > 0 such that Pr[X 999 1000T] p. In that case, there are at least 100k log α iterations t for which Xt = 0 and therefore max((1 1/(100k))100k log α τ(X, C0), 100βOPT) max(e log ααβOPT, 100βOPT) as desired. C.2. Decreasing Number of Outliers to (1 + ε)z We are now ready to prove Lemma 11. It roughly states that with probability Ω(ε) we sufficiently improve our solution, provided that there are at least (1 + ε)z points with a squared distance of at least 10Θ to the current solution. Lemma 12 then follows as a simple consequence of Lemma 11. Roughly speaking, it states that starting with a set of centers C with a cost of O((1/ε)OPT), within O(k/ε) local search steps, one obtains a set of centers C such that at most (1 + ε)z points have a squared distance of at least 10Θ to C , with positive constant probability. Let g be the number of points in X out that have a squared distance of less than 10Θ to the closest center in C. Note that by Adapting k-means Algorithms for Outliers our definition of outliers(c), we have X i [k] |outliers(ci)| = g z. (9) Definition 3. A cluster index i [k] \ T is called good, if τ(X i , C) reassign(X, C, cb(i)) 9τ(X i , c i ) > τ(X, C) zΘ Lemma 10. Let C denote the current set of candidate centers. Let ε (0, 1] be arbitrary and β := 300 ε . If for some ε ε, there are (1 + ε )z different points with a squared distance of at least 10Θ to the closest candidate center in C and Θ βOPT i [k] \ T , i is good τ(X i , C) τ(X, C) zΘ Proof. Recall our assumption that exactly (1 + ε )z points have a squared distance of at least 10Θ to the set C of candidate centers, i.e., |out10Θ(X, C)| = (1 + ε )z. (10) Moreover, we defined g as the number of points in X out such that they have squared distance of less than 10Θ to C, so |out10Θ(X out, C)| = z g. (11) |out10Θ(X in, C)| (12) = |out10Θ(X, C)| |out10Θ(X out, C)| (13) |out10Θ(X far, C)| (1 + ε )z (z g) |X far| Eqs. (10) and (11) ε z + g OPT Θ . OPT |X far|Θ We will use the following bound τ(X in, C) + τ(X far, C) + τ(in10Θ(X out, C), C) gΘ 100 τ(X in, C) + τ(in10Θ(X out, C), C) gΘ 100 + OPT/100. Now we bound X i [k] \ T , i is not good τ(X i , C) i [k]\T reassign(X, C, cb(i)) + 9OPT + τ(X, C) zΘ Definition 3 100τ(inΘ(Xb(i), C), C) + 24ϕ(inΘ(Xb(i), C), C ) Adapting k-means Algorithms for Outliers + Θ|Xfar,b(i)| + Θ |outliers(cb(i))| τ(outliers(cb(i)), cb(i)) + 9OPT + τ(X, C) zΘ 100 Lemma 6 100τ(inΘ(Xi, C), C) + 24ϕ(inΘ(Xi, C), C ) + Θ|Xfar,i| + Θ |outliers(ci)| τ(outliers(ci), ci) + 9OPT + τ(X, C) zΘ 100 b([k] \ T) = H L 100τ(inΘ(X in, C), C) + 24OPT + OPT i H L (Θ |outliers(ci)| τ(outliers(ci), ci)) + 9OPT + τ(X, C) zΘ 100 Xi X in 100τ(inΘ(X in, C), C) + X i [k] (Θ |outliers(ci)| τ(outliers(ci), ci)) + 34OPT + τ(X, C) zΘ 100τ(inΘ(X in, C), C) + gΘ τ(in10Θ(X out, C), C) + 34OPT + τ(X, C) zΘ 100 Eq. (9) 100τ(inΘ(X in, C), C) + gΘ τ(in10Θ(X out, C), C) + 34OPT+ τ(X in, C) + τ(in10Θ(X out, C), C) gΘ 100 + OPT/100 100τ(inΘ(X in, C), C) + Θg 1 10τ(in10Θ(X out, C), C) + 35OPT + τ(X in, C) gΘ Next, we make use of the inequality i T τ(X i , C) 1 i [k],i is not ignored τ(X i , C). (15) This follows as a cheap cluster is not ignored, and each cheap cluster can be matched with a different cluster that is not ignored, has at least the same cost and is matched to the same popular cluster center. Crucially, we now use Lemma 5 that says that whenever a cluster X i is not ignored, we have X i = in10Θ(X i , C). Putting things together, we get X i [k] \ T , i is good τ(X i , C) Adapting k-means Algorithms for Outliers = τ(X in, C) X i [k] \ T , i is not good τ(X i , C) X i T τ(X i , C) τ(in10Θ(X in, C), C) + τ(out10Θ(X in, C), C) 100τ(inΘ(X in, C), C) + Θg 1 10τ(in10Θ(X out, C), C) + 35OPT + τ(X in, C) gΘ i [k], i is not ignored τ(X i , C) Eq. (15) 100τ(in10Θ(X in, C), C) + ((ε z + g)Θ OPT) 10τ(in10Θ(X out, C), C) + 35OPT + τ(X in, C) gΘ 1 2τ(in10Θ(X in, C), C) Lemma 5 10τ(in10Θ(X in, C), C) + 1 10τ(in10Θ(X out, C), C) + ε zΘ 100OPT τ(out10Θ(X in, C), C) gΘ 10τ(in10Θ(X, C), C) + ε zΘ 102OPT ε zΘ 100 Eq. (12) 10 (τ(in10Θ(X, C), C) + ε zΘ) Θ 300OPT 10 (τ(in10Θ(X, C), C) + τ(out10Θ(X, C), C) zΘ) = τ(X, C) zΘ Lemma 11. Let C denote the current set of candidate centers. Let ε (0, 1] be arbitrary and β := 300 ε . Suppose that for out current candidate centers C we have at least (1 + ε)z points in X that have a squared distance of at least 10Θ to the closest center in C. Furthermore, assume that Θ βOPT z . Then, with probability at least ε/200, one step of Local-search++ with outliers results in a new set of candidate centers C with τ(X, C ) zΘ (1 1/(100k)) (τ(X, C) zΘ). Proof. Let c denote the point sampled by Local-search++ with outliers . The probability that c X i for some i with i being a good cluster index is at least 10 τ(X, C) (1 + ε)zΘ zΘ 10(1 + ε)zΘ τ(X, C) (1 + ε)zΘ Let i denote an arbitrary good cluster index. Lemma 4 implies that E[τ(X i , c )|c X i ] 8 τ(X i , c i ). Adapting k-means Algorithms for Outliers Hence, by a simple application of Markov s inequality, we obtain P[τ(X i , c ) 9 τ(X i , c i )|c X i ] 1 Putting things together, the probability that the sampled point c is contained in X i for some good cluster index i and it additionally holds that τ(X i , c ) 9 τ(X i , c i ) (16) is at least First, consider the case that i H. In that case, we can upper bound τ(X, C ) as follows: τ(X, C ) τ(X, (C \ {cb(i)}) {c }) swap cb(i) and c = τ(X, C) + τ(X, (C \ {cb(i)}) {c }) τ(X, C) = τ(X, C) + τ(X \ X i , (C \ {cb(i)}) {c }) τ(X \ X i , C) + τ(X i , (C \ {cb(i)}) {c }) τ(X i , C) τ(X, C) + reassign(X, C, ci) + τ(X i , c ) τ(X i , C) Fact 4 τ(X, C) + reassign(X, C, ci) + 9τ(X i , c i ) τ(X i , C) Eq. (16) τ(X, C) τ(X, C) zΘ 100k . Definition 3 The case i / H is very similar. Thus, we obtain τ(X, C ) zΘ τ(X, C) τ(X, C) zΘ = (1 1/(100k)) (τ(X, C) zΘ) , as desired. Lemma 12. Let C denote the current set of candidate centers. Let ε (0, 1] be arbitrary and β := 300 ε . Let α > 0 be arbitrary such that τ(X, C) αOPT. Assume that Θ βOPT z . Let C0 := C and for t 0, let Ct+1 denote the set of centers obtained by running Local-search++ with outliers with input centers Ct. Then, with positive constant probability p > 0, there are at most (1 + ε)z points in X that have a squared distance of at least 10Θ to the closest candidate center in Ct for some t T with T = O(log(α)k/ε). Proof. For t 0, we define the potential Φ(t) = τ(X, Ct) zΘ. Now, let Yt denote the indicator variable for the event that there are less than (1 + ε)z points that have a squared distance of at least 10Θ to the closest center in Ct or Φ(t + 1) (1 1/(100k)) Φ(t). Note that the random variables Y0, Y1, . . . , YT 1 are not independent. However, Lemma 11 implies that E[Yi|Y0, Y1, . . . , Yi 1] ε 200 for every i {0, . . . , T 1} and every realization of random variables Y0, Y1, . . . Yi 1. Thus, the random variable Y = PT 1 i=0 Yi stochastically dominates the random variable Y = PT 1 i=0 Y i , where the (Y i ) s are independent Bernoulli Adapting k-means Algorithms for Outliers variables that are equal to 1 with probability ε 200. Now, let T := 100k log(α) + 1 and T := 2T ε/200 . By a standard application of the Chernoff bound in Theorem 7, we get P[Y T ] P[Y (1 1/2)E[Y ]] e Ω(E[Y ]) = e Ω(1). Thus, there exists a positive constant p such that with probability at least p, we have Assume that Y T . First, consider the case that there exists a t {0, . . . , T 1} for which there are less than (1 + ε)z points in X that have a squared distance of at least 10Θ to the closest center in Ct. In that case we are done. Thus, assume that this does not happen. In particular, this implies that Φ(T 1) (1 + ε)zΘ zΘ = εΘz > OPT. However, we also have Φ(T 1) (1 1/(100k))T 1 Φ(0) (1 1/(100k))100k log(α) αOPT a contradiction. This concludes the proof. C.3. Putting things together Lemma 13. Consider some iteration i for which OPT OPTguess 2OPT. Then, with positive constant probability, there exists some iteration j of the inner loop such that |Xi,j out| (1 + ε)z and ϕ(X \ Xi,j out, Ci,j) = O(1/ε)OPT. Proof. According to (BVX19; Li20), running Algorithm 2 results in a set of centers Ci,0 such that E[τ(X, Ci,0)] = O(log k) min C : |C |=k τ(X, C ) O(log k) (OPT + z Θ) O(log k)βOPT. By using Markov s inequality, we therefore have τ(X, Ci,0) = O(log k)βOPT with positive constant probability. Conditioned on that event, we use Lemma 9 to deduce that τ(X, Ci,T ) 100βOPT for some T = O(k log(O(log k))) = O(k log log k) with positive constant probability. Then, we use Lemma 12 to deduce that with positive constant probability, there exists some T O(log(1/ε)k/ε) such that there are at most (1 + ε)z points in X with a squared distance of at least 10Θ to the closest center in Ci,T +T and furthermore τ(X, Ci,T +T ) τ(X, Ci,T ) 100βOPT = O(1/ε)OPT. Hence, |Xi,T +T out | (1 + ε)z and ϕ(X \ Xi,T +T out , Ci,T +T ) 10 τ(X \ Xi,T +T out , Ci,T +T ) O(1/ε)OPT with T + T = O(k log log k + log(1/ε)k/ε), as set in the inner loop of Algorithm 4. Finally, we are ready to finish the analysis of Algorithm 4 by proving Theorem 9. Adapting k-means Algorithms for Outliers Proof of Theorem 9. As k + z < n, we have 1 min C ,X out : |C |=k,|X out|=z ϕ(X \ X out, C) n 2. For i = 0, we have 2i = 1 and for i = log(n 2)) , we have 2i = 2 log(n 2)) n 2. Thus, there exists some i {0, 1, . . . , log(n 2) } such that in the i-th iteration we have OPTguess [OPT, 2OPT] and the statement follows from Lemma 13. D. Using the Metropolis-Hastings algorithm to speed up Algorithm 2 Here we explain how to speed up Algorithm 2 by using the Metropolis-Hastings algorithm. This was first done in (BLHK16b; BLHK16a) for the classical k-means++ algorithm, but there it only leads to a rather weak additive error guarantee, which turns out not to be the case for k-means with outliers. We alter every step of Algorithm 2 as follows. Instead of sampling a new point proportional to its τΘ-weight, we instead sample a new point from the following Markov chain: First, we sample a point x according to a proposal distribution q. Then, in the following T steps, we always sample a new point y from q and set x y with probability min 1, q(x)π(y) q(y)π(x) , where π is the target distribution, so in our case, the τΘ( , C)/τΘ(X, C)-distribution. In other words, we define a Markov chain with transition probability P(x, y) = q(y) min 1, q(x)π(y) In (BLHK16b), a uniform proposal distribution q(x) = 1/n for all x X was chosen. The advantage of the uniform distribution is that one can sample from it easily, but only arguably rather weak guarantees are known for the resulting algorithm in the context of speeding up k-means++ . In (BLHK16a), q(x) = ϕ(x, c)/ϕ(X, c) for a random point c was chosen. This distribution can be precomputed in O(n) time and the guarantee of the resulting algorithm matches the guarantee of the original k-means++ algorithm up to an additional additive error term δϕ(X, µ(X)), if T = O(1/δ) is chosen. Remark 1. One can see (but we will not prove it here) that the analysis of (BLHK16a) directly generalizes to τΘ-costs for each non-negative Θ. Choosing Θ = OPT/z, one obtains δτΘ(X, µ(X)) δ nΘ = δn OPT z . Hence, the choice of δ = O(z/n) yields that the additional incurred error is of the same order as OPT. Hence, generalizing the result of (BLHK16a) and using Lemma 1, we obtain an (O(log k), O(log k))-approximation algorithm with time complexity of O(n + k (k/δ)) = O(n + nk2/z), because we need to precompute the proposal distribution q in O(n) time and sampling one point takes O(1/δ) steps, each implementable in time O(k). Changing proposal distribution to uniform would recover the same guarantees. While in the case of classical k-means, one needs to use the more complicated proposal distribution q(x) = ϕ(x, c)/ϕ(X, c) to get some guarantees, it turns out that for k-means with outliers, uniform proposal suffices. To get Theorem 5, we start by proving that Algorithm 2 with ℓ= O(k) can be sped up with Metropolis-Hastings algorithm with uniform proposal. Theorem 10. A version of Algorithm 2 with ℓ= O(k), Θ = Θ(OPT/z) and where each sampling step is approximated by the Metropolis-Hastings algorithm running for T = O(n/z) steps and a uniform proposal distribution will run in time O( nk2 z ) and produce a solution that with positive constant probability satisfies τΘ(X in, C) = O(OPT). To prove Theorem 10, we first recall a classical result about the Metropolis-Hastings algorithm (MRR+53; Liu96). If the proposal distribution q and the target distribution π satisfy q(x) δπ(x) for all x and some δ > 0, then after T = O(1/δ) steps, the distribution of the Metropolis-Hastings chain dominates π/2. Adapting k-means Algorithms for Outliers Lemma 14 (one Metropolis-Hastings step (Liu96)). Let π denote the target distribution we want to approximate with Metropolis-Hastings and pt = απ + (1 α)p t for some distribution p t. Let q be a proposal distribution with q(x) δπ(x) for all x. Then, after one step of Metropolis-Hastings, we are in a distribution pt+1 which can be written as pt+1 = α π + (1 α )p t+1 for α = α + δ(1 α). Proof. For any fixed vertex x, given that x is the current vertex after t steps, the probability of moving from x to y is P(x, y) = q(y) min 1, q(x)π(y) Now, if q(x)π(y) q(y)π(x), we have P(x, y) = q(y) δπ(y) by the definition of δ. On the other hand, if q(x)π(y) < q(y)π(x), we have P(x, y) = q(x)π(y) π(x) q(x)π(y) q(x)/δ = δπ(y). So, for any x and any y we have P(x, y) δπ(y), and since π is a stationary distribution, for any x we have pt+1(x) απ(x) + (1 α)δπ(x), as needed. We now prove Theorem 10. Proof. Fix one sampling step of Algorithm 2 and suppose that τ(X in, C) 40OPT. For T = O(n/z), we will show that we sample from a distribution p T such that for π(x) = τ(x, C)/τ(X, C) we have p T (x) π(x)/2 for all x. Hence, we may interpret sampling from p T as sampling from π with probability 1/2 and sampling from some other distribution otherwise. Thus, we have essentially reduced the analysis to the one in the proof of Theorem 8, only loosing a 2-factor in the number of required iterations. The running time of the algorithm is O(k n z k), as we need to sample O(k) points in total, O(n/z) iterations of the Metropolis-Hastings algorithm are necessary for each point sample and in each step we need to compute τΘ(x, C), which can be done in time proportional to the number of points already sampled. We now prove that p T (x) π(x)/2 for all x, provided that Θ [OPT/z, 2OPT/z]. First note that π(x) = τ(x, C)/τ(X, C) Θ/τ(X, C), so for δ = z n we have for any x that n 2OPT/z τ(X, C) = q(x) 2OPT τ(X, C) q(x). Hence, we may use Lemma 14 to conclude that after O(1/δ) = O(n/z) steps, p T (x) π(x)/2 for all x, as needed. We are now ready to prove Theorem 5 that we restate here for convenience. Theorem 5. There is an (O(1), O(1))-approximation algorithm for k-means with outliers that runs in time O(nk min(1, k/z)) and succeeds with positive constant probability. Proof. If z = O(k log k) or z = O(log2 n), we can run Algorithm 4 with ε = 1 to obtain an (O(1), O(1))-approximation in time O(nk) (Theorem 9). Otherwise, we run the Metropolis-Hastings based variant of Algorithm 2 from Theorem 10 with ℓ= O(k) that runs in O(nk2/z) time for O(log n) values Θ0 = 20/z, Θ1 = 21/z, . . . , Θlog( 2n) = 2n/z. Hence, we get a sequence of sets of centers Y0, Y1, . . . with |Yi| = O(k). Theorem 10 ensures that there will exist one value Θ with OPT/(2z) Θ OPT/z in the set of values Θ0, . . . , Θlog( 2n) such that the sped up Algorithm 2 returns, with positive constant probability, a set Y with τ Θ(X in, Y ) = O(OPT). In Theorem 11 we give an algorithm with running time O(nk2/z) that turns the pair ( Y , Θ) into a set of cluster centers C, | C| = k, such that ϕ Az(X, C) = O(OPT) for some universal constant A, with positive constant probability. We apply the algorithm from Theorem 11 to all pairs (Yi, Θi) and get a sequence of O(log n) sets C0, C1, . . . with |Ci| = k such that, by Theorem 11, with positive constant probability it contains a set C with ϕ Az(X, C) = O(OPT). Adapting k-means Algorithms for Outliers To conclude, we describe how in time O(nk/z) we can get an estimator ζC of ϕ Az(X, C) for some set of k centers C such that the following two properties are satisfied. ζC 4ϕ Az(X, C) with probability 1 1/poly(n) (17) 4ϕ 10Az(X, C) with probability 1 1/poly(n). (18) Then, we can simply estimate the cost of each set of centers Ci and return the one with the smallest estimated cost. With positive constant probability, there is at least one set of centers Ci with ϕ Az(X, Ci) = O(OPT). Hence, by Eq. (17) there also exists an estimator ζCi, with positive constant probability, such that ζCi = O(OPT). In that case, we choose a set of centers CA with ϕ 10Az(X, CA) 4ζCA = O(OPT) with constant positive probability, by Eq. (18). The estimator ζC for ϕ Az(X, C) is constructed as follows. We sample N = O((n/z) log2 n) points from X uniformly and independently at random with replacement. We denote the resulting multiset as X . Then, we compute the value ϕ 4Az(N/n)(X , C) in time O(|X |k) = O(nk/z) and set ζC = n N ϕ 4Az(N/n)(X , C). We define Xin as the closest Az points to the set C and Xout = X \ Xin. Let us split the points of Xin into log sets X1, X2, . . . such that x Xi if 2i ϕ(x, C) 2i+1. We say a set Xi is big if |Xi| z/ log and small otherwise. The intuition here is that either a set Xi is big enough so that the number of sampled points from Xi is concentrated around its expectation, or the set is small enough with respect to the number of outliers z we consider. More precisely, for every big set Xi we can compute that E[|Xi X |] z n log Ω( n z log2(n)) = Ω(log n), so by Theorem 7, we have P (E[|Xi X |]/2 |Xi X | 2E[|Xi X |]) (19) 1 e Θ(|Xi X |) = 1 1/poly(n). Similarly, we can compute P (E[|Xout X |]/2 |Xout X | 2E[|Xout X |]) (20) 1 e Θ(|Xout X |) = 1 1/poly(n). Furthermore, we have | i is small Xi| log( ) z log( ) = z. As z n N = Ω(log2 n), we can conclude that X contains at most 2 z n N points from i is small Xi with probability 1 1/poly(n). We now condition on the events from Eq. (19) and Eq. (20) and we additionally condition on the event that X contains at most 2 z n N points from i is small Xi. We first prove Eq. (17). We have ζC = n N ϕ 4Az(N/n)(X , C) 4ϕ Az(X, C). By Eq. (20), we have |Xout X | 2Az(N/n), so we can label points in Xout X as outliers. Furthermore, we can additionally label all points in ( i is small Xi) X as outliers, as | ( i is small Xi) X | 2z(N/n). The number of sampled points from each big Xi is at most 2 (N/n) |Xi| by Eq. (19) and the costs of points in Xi differ only by a 2-factor. Next, we prove Eq. (18). Let z i be the number of outliers one needs to choose from Xi X to optimize the cost ϕ 4Az(N/n)(X , C). Let us define a set of outliers XA out such that it contains the set Xout, all points from small sets Xi and 2z i(n/N) arbitrary points from each big set Xi. Then, |XA out| z + z + 2(4Az(N/n))(n/N) 10Az. To bound the cost ϕ(X \ XA out, C), we bound the contribution of each set Xi. We have (|Xi| 2z i(n/N)) 2i+1 N |X Xi| 2z i n N )2i+1 Eq. (19) N 2i(|X Xi| z i). Adapting k-means Algorithms for Outliers Summing up these contributions, we get ϕ 10Az(X, C) 4 n N ϕ 4Az(N/n), as desired. For the purpose of getting a set of centers that provides an (O(1), O(1))-approximation in O(nk2/z) time, we next provide a simple adaptation of the a clustering of a clustering is a clustering result of (GMM+03). Theorem 11. Let X be an input set, OPT the optimal solution to the k-means with z outliers objective on X and Θ such that OPT/(2z) Θ OPT/z. Suppose we have a set Y, |Y | = O(k) such that τΘ(X, Y ) = O(OPT). Then, in time O(nk2/z) we can compute a set CA, |CA| = k, such that τΘ(X, CA) = O(OPT) with positive constant probability. Proof. Let C be the clustering of an optimal solution. Recall that τΘ(X, C ) OPT + zΘ = O(OPT). We subsample N = Θ( nk z log n) points uniformly and independently with replacement from X, thus obtaining a multiset of points X . For each point x X we compute its distance to Y . This is done in time O(nk2/z). For y Y we define B (y) as the set of points x X for which y is the closest center and B(y) as the set of points x X for which y is the closest center. Moreover, we define ˆw(y) = (n/N)|B (y)| and w(y) = |B(y)|. Note that E[ ˆw(y)] = w(y). We denote with yx the point y Y for which x B(y). For each y Y we either have w(y) < z/k, and then we call the respective set B(y) small, or w(y) z/k and the respective set B(y) is called big. For big sets we have E[|B (y)|] (N/n) z/k = Ω(log n). In this case, an application of Chernoff bounds (Theorem 7) gives that w(y)/2 ˆw(y) 2w(y) (21) with probability 1 e Ω(E[|B (y)|]) = 1 1/poly(n). We condition that this event happens for all y Y . We know that P y small w(y) = O(z), so E[P y small ˆw(y)] = O(z) and by Markov s inequality P y small ˆw(y) = O(z) with constant probability. Again, we condition on this event. We now run some O(1)-approximate algorithm for k-means with penalties on the weighted instance (Y, ˆw(y)) that runs in O(n k) time for inputs of size n (e.g., Algorithm 4 can be interpreted as such an algorithm by the easy part of the analysis captured by Lemma 9; the hard part of the analysis Lemma 12 then talks about the guarantees of the algorithm for k-means with outliers). As our instance has n = O(nk/z) points, the time complexity of the algorithm will be O(nk2/z). The algorithm returns a set CA of k centers that we return. We have τΘ((Y, ˆw), C ) (22) y small ˆw(y)Θ + X y big ˆw(y)τΘ(y, C ) O(z) Θ + 2 X y big w(y)τΘ(y, C ) Eq. (21) O(OPT) + 4 X x B(y) τΘ(y, x) + τΘ(x, C ) Fact 2 O(OPT) + O(τΘ(X, Y )) + O(τΘ(X, C )) = O(OPT). x X τΘ(x, CA) x X (τΘ(x, yx) + τΘ(yx, CA)) Fact 2 = 2τΘ(X, Y ) + 2τΘ((Y, w), CA) O(OPT) + 2 X y small w(y)τΘ(y, CA) + 2 X y big w(y)τΘ(y, CA) O(OPT) + O(z) Θ + 4 X y big ˆw(y)τΘ(y, CA) Eq. (21) = O(OPT) + O(τΘ((Y, ˆw), C )) = O(OPT), Eq. (22) Adapting k-means Algorithms for Outliers Remark 2. We believe, but do not prove, that refining the above proofs and using the results from Appendices C and F, we can refine Theorem 5 to achieve an (O(1/poly(ε)), 1 + ε)-approximation algorithm with running time O(nk2/(zpoly(ε))). E. Lower bounds In the following, we consider algorithms for the k-means/k-median/k-center with outliers problem in the general metric space query model. The algorithm is only required to output a set of k centers and does not need to explicitly declare points as outliers. Theorem 12 (Formal version of Theorem 6). Let A denote a (randomized) algorithm for the k-means/k-median/k-center with outliers problem with with a query complexity of o(nk2/z). Let α, β O(1) be arbitrary. Then, for large enough values k, z, and n such that 10000k log k z 1 10000n, there exists a concrete input instance for the k-means/k-median/k-center problem with parameters k, z and n such that A outputs an (α, β)-approximation with probability less than 0.5. If one does not assume that the ratio between the maximum and minimum distance is bounded, one can show that the statement even holds for an arbitrary multiplicative factor α. For simplicity, we assume in the proof that β = 2, but the proof works for an arbitrary constant β. Proof. We roughly follow the proof of (Met02). Applying Yao s principle(Yao77), it suffices to provide a distribution over metric spaces such that any deterministic algorithm needs Ω(nk2/z) queries to succeed with probability 0.5. To that end, let k, z and n be given such that 10000k log k z 1 10000n and we assume that k is sufficiently large. Let C = {c1, . . . , ck} be a set of points such that the distance between any two points in C is 1. Next, we define a probability distribution D over C. Our random instance then consists of a multiset of n points X = {x1, . . . , xn} with each point being sampled independently according to D. More precisely, we call S = {c1, . . . , ck/2} the set of small points and B = {ck/2+1, . . . , ck} the set of big points (for simplicity, we assume that k is even). The probability of sampling each small point is set to 100z n k/2 and the probability of sampling each big point is set to n 100z n k/2 . This defines a valid probablity distribution as n 100z and k n k/2 = 1. Note that it can (and will) happen that we sample the same point twice, but the algorithm does not know it unless it asks the oracle M, which for a query (x, y) returns the distance d(x, y) of the two points in the constructed metric space. One could perturb this instance slightly in order to get an instance with a lower bound on the minimum distance between any two points. We refer to the multiset of points that are a clone of cj as a small cluster if j [k/2] and as a big cluster otherwise. We denote with Ei,j the event that the i-th point is a clone of cj. We consider an arbitrary deterministic algorithm A that asks at most T = nk2 100000z queries to the distance oracle M. We refer to a query as informative if the distance between the two queried points is 0 and uninformative otherwise. We first start by proving a helper lemma. Lemma 15. Assume that among the first t queries of A, at most 0.1k queries involved the i-th point and all of those queries were uninformative. Then, conditioned on the first t queries and the answer to those queries, the probability that the i-th point belongs to some arbitrary fixed small cluster with index jsmall [k/2] is at most 500z kn . Moreover, the statement still holds if we additionally condition on an arbitrary cluster assignment of all points, except the i-th one, that is consistent with the oracle answers to the first t queries. Proof. Let Qt denote the event corresponding to the observed interaction between A and the distance oracle restricted to the first t queries. To prove the lemma, we will not only condition on Qt, but we additionally fix the randomness of all points except the i-th one in an arbitrary way. We denote the corresponding event by R i and we assume that P[Qt R i] > 0. Among the first t queries, the i-th point was queried at most 0.1k times and all queries were uninformative. Hence, conditioned on Qt and R i, there are at least 0.5k 0.1k = 0.4k big clusters that the i-th point could still be contained in. Let jbig denote an arbitrary index of one of those big clusters. Using Bayes rule, we have Adapting k-means Algorithms for Outliers P[Ei,jsmall|Qt R i] P[Ei,jsmall|Qt R i] 0.4k P[Ei,jbig|Qt R i] = P[Qt R i|Ei,jsmall]P[Ei,jsmall] 0.4k P[Qt R i|Ei,jbig]P[Ei,jbig] = P[R i|Ei,jsmall]P[Ei,jsmall] 0.4k P[R i|Ei,jbig]P[Ei,jbig] = P[Ei,jsmall] 0.4k P[Ei,jbig] = 100z 0.4k (n 100z) Let CA denote the set of centers that A outputs. To simplify the notation, we assume that CA is a set of k indices between 1 and n instead of k points. We define CA,small = CA [k/2] as the set of indices corresponding to small clusters. The main ingredient of our lower bound argument is to show that E[|CA,small|] is small, namely at most 0.1k. To that end, we partition CA,small = C1 C2 C3 into three different sets C1 := {i CA,small : the i-th point was involved in less than 0.1k queries,all being uninformative}, C2 := {i CA,small : the i-th point was involved in at least 0.1k queries, the first 0.1k of those being uninformative}, C3 := {i CA,small : Among the first 0.1k queries the i-th point was involved with, at least one was informative}. We prove that the expected size of all three sets is small. We start by bounding E[|C1|]. To that end, let Q denote the event corresponding to some (fixed) complete observed interaction between A and the oracle. As A is deterministic, conditioned on Q, the set CA is completely determined. Consider some arbitrary i CA such that the i-th point was involved in at most 0.1k queries, all being uninformative. Lemma 15 together with a union bound over all small clusters implies that P[the i-th point belongs to a small cluster | Q] (23) Hence, using linearity of expectation, we obtain E[|C1|] k 250z Next, we bound the expected size of C2. Note that there can be at most 2 nk2/(100000z) 0.1k = nk 5000z points that are involved in at least 0.1k queries, the first 0.1k of those being uninformative. Among those points, the expected fraction of points belonging to small clusters is at most 250z n . This is again a consequence of Eq. (23), as at the moment the i-th point was queried exactly 0.1k times, all queries being uninformative, the probability that the point belongs to a small cluster is at most 250z n . Hence, we can conclude that E[|C2|] 250z n nk 5000z = 0.05k. Thus, it remains to bound E[|C3|]. To that end, for t [T], let Xt denote the indicator variable for the following event: The t-th query of A is informative, involves a point from a small cluster that was queried less than 0.1k times before and all those previous queries were uninformative. Again, Lemma 15 implies that E[Xt] 500z k n . As |C3| 2 P t [T ] Xt, linearity of expectation implies that E[|C3|] 2 nk2 100000z 500z Adapting k-means Algorithms for Outliers Putting things together, we obtain that E[|CA,small|] 0.1k. By Markov s inequality, this implies that P(|CA,small| 0.3k) 1/3. As the expected size of each small cluster is 200z k 200 log k, a Chernoff Bound followed by a Union Bound implies that each small cluster contains at least 100z/k points with a probability of at least 0.9. Hence, with a probability of at least 0.9 1/3 > 0.5, each small cluster contains at least 100z/k points and A outputs at most 0.3k centers belonging to small clusters. As (0.5k 0.3k) 100z k > 2z, this implies that the clustering cost with respect to CA is not zero, unlike the clustering cost of an optimal solution. This finishes the proof. Remark 3. Note that the same construction can be used to give an Ω(k2/ε) lower bound for k-means algorithms that output a solution C with ϕ(X, C) = O(ϕ(X, C ) + εϕ(X, µ(X)). This shows that the complexity of the AFK-MC2algorithm (BLHK16a) is essentially the best possible. We omit details. F. Distributed algorithms In this section, we show how to adapt the classical streaming algorithm of (GMM+03) and the popular distributed algorithm k-means|| to the setting with outliers by extending Theorem 8. In Section F.1, we show two ways of, roughly speaking, distributing Theorem 8. Then, in Section F.2, we show how this set can be used to get distributed algorithms that output (1 + ε)z outliers and achieve an O(1/ε) or even O(1)-approximation. The most natural distributed model for the following algorithms is the coordinator model. In this model, the data X = X1 X2 Xm is evenly split across m machines. In a 1-round distributed algorithm such as Algorithm 5, the machines perform some computation on their data, and send the result to the coordinator who computes the final clustering. In a t-round distributed algorithm such as Algorithm 6, there are t rounds of communication between machines and the coordinator. F.1. Distributed algorithms with bicriteria guarantee In this section, we start by presenting Algorithm 5, a simple variant of the algorithm of (GMM+03). In this algorithm, each machine runs Algorithm 2 with oversampling and sends the resulting clustering, together with some additional information, to the coordinator. Analysis of this algorithm follows from our Theorem 8. Algorithm 5 Overseeding from Guha et al. (GMM+03) Require data X, threshold OPT/(2ε) Θ OPT/ε 1: split X arbitrarily into sets X1, X2, . . . , Xm. 2: for j 1, 2, . . . , m in parallel do 3: Yj Algorithm 2 with XA2 = Xj, ℓA2 = O(k/ε), and ΘA2 = Θ. 4: For each y Yj, let w(y) be the number of points x Xj, τΘ(x, Yj) < Θ, such that y = arg miny Yj τΘ(x, y ) with ties handled arbitrarily. Let Xout,j be the set of points x Xj with τ(x, Yj) = Θ. 5: end for 6: Each site j sends (Yj, w) and |Xout,j| to the coordinator, who computes the weighted set (Y, w) = Sm j=1(Yj, w) and the cardinality of Xout = S j Xout,j as |Xout| = P j |Xout,j|. Here, by weighted set (Y, w) we mean a set Y = {y1, y2, . . . , y|Y |} together with a weight function w : {1, 2, . . . , |Y |} N. Set operations can be naturally extended to the weighted setting and in the next subsection we observe that k-means algorithms too. Theorem 13. Let OPT/(2εz) Θ OPT/(εz). Suppose we run Algorithm 5 on the m machines with ℓ= O(k/ε log(m/δ)). Then with probability 1 δ we have j=1 τ(Xj X in, Yj) = O(OPT). Proof. Let zj = |Xj X out| and recall C is the optimal solution for the whole instance X. By Theorem 8 we have with Adapting k-means Algorithms for Outliers probability at least 1 m δ/m = 1 δ that for every j τ(Xj X in, Yj) = O( inf C,|C|=k τ(Xj X in, C) + εzjΘ) = O(τ(Xj X in, C ) + εzjΘ). This implies that we have j=1 τ(Xj X in, Yj) = j=1 O(τ(Xj X in, C ) + εzjΘ) = O(τ(X in, C ) + εzΘ) = O(OPT). We leave the description of the algorithm run by the coordinator to Section F.2 and now we show a different way of getting the weighted set (Y, w) with essentially the same guarantees, by adapting the k-means|| algorithm. The k-means|| algorithm We now show a simple adaptation of the k-means|| algorithm (BMV+12): a popular distributed variant of k-means++ . In k-means|| , the goal is to get O(k) centers that provide a constant approximation guarantee in a small number of distributed steps. This is done by changing Algorithm 2 as follows: in each round we sample k points instead of just one point proportional to the cost, but the algorithm is run only for O(log n) steps. Our adaptation of k-means|| algorithm to outliers is below. Algorithm 6 k-means|| overseeding Require data X,# rounds t, sampling factor ℓ, threshold Θ 1: Uniformly sample x X and set Y = {x}. 2: for i 1, 2, . . . , t do 3: Y 4: for x X do 5: Add x to Y with probability min 1, ℓτΘ(x,Y ) 6: end for 7: Y Y Y 8: end for 9: For each y Y , let w(y) be the number of points x Xj with τΘ(x, Y ) < Θ and such that y = arg miny Y τΘ(x, y ) with ties handled arbitrarily. Let Xout be the set of points x with τ(x, Y ) = Θ. 10: Return (Y, w), |Xout| We now prove an analogue of Theorem 13. Again, it can be seen as an extension of our basic result Theorem 8. Theorem 14. Let Θ [OPT/(2ε), OPT/ε] be arbitrary. Suppose we run Algorithm 6 for t = O(log(n/ε)) rounds with ℓ= O((k/ε) log n) to obtain output Y . Then, with probability at least 1 1/poly(n), we have that τΘ(X in, Y ) = O(OPT). The following proof is a simple adaptation of the analysis of k-means|| by Rozhon (Roz20). Proof. Let Y0 Y1 Yt be the cluster centers that are gradually built by Algorithm 6. We have Y0 = {y} for a point y picked uniformly at random. Whatever point we picked, we have τ(X, Y0) nΘ = poly(n/ε), as we assume = poly(n). We call a cluster X i unsettled in iteration j if τ(X i , Yj) > 10τ(X i , C ). We define τU(X, Yj) as τU(Y, Yj) = X i, X i unsettled in iter. j τ(X i , Yj). Adapting k-means Algorithms for Outliers We will now prove that with probability 1 1/poly(n) we have τU(X, Yj+1) 1 2τU(X, Yj) + OPT. (25) To prove Eq. (25), fix an iteration j. If τU(X, Yj) OPT, the claim certainly holds, so assume the opposite. We split unsettled clusters into two groups: a cluster X i is called heavy if τ(X i , Yj) τU(X, Yj)/(2k) and light otherwise. The probability that we make a heavy cluster X i settled can be bounded as follows. Sampling a random point from X i proportional to τ( , Yj) makes X i settled with probability at least 1/5 by Corollary 1. Hence, there is a subset Zi X i with τ(Zi, Yj) τ(X i , Yj)/5 such that sampling a point from Zi makes X i settled. If Zi contains a point x with τ(x, Yj) τ(X,Yj) ℓ , we sample x with probability 1 and this also makes X i settled. Otherwise, P(X i does not get settled) x Zi (1 ℓ τ(x, Yj)/τ(X, Yj)) exp( O(k log n/ε) X x Zi τ(x, Yj)/τ(X, Yj)) 1 + x ex exp( O(k(log n)/ε) 1 5τ(X i , Yj)/τ(X, Yj)) exp( O((log n)/ε) 1 10τU(X, Yj)/τ(X, Yj)) X i is heavy exp O((log n)/ε) 1 10 τU(X, Yj) τU(X, Yj) + 10OPT + zΘ each point is unsettled, settled, or an outlier exp O((log n)/ε) 1 10 OPT 11OPT + OPT/ε τU(X, Yj) OPT = 1/poly(n). Hence, every heavy cluster X i gets settled with probability 1 1/poly(n). We now condition on that event. Note that P X i light τ(X i , Yj) k τU(X, Yj)/(2k) = τU(X, Yj)/2. Hence, we get τU(X, Yj) τU(X, Yj+1) X X i heavy τ(X i , Yj) τU(X, Yj)/2 as needed to finish the proof of Eq. (25). Now we can apply Eq. (25) t = O(log(n/ε)) times together with τ(X, Y0) poly(n/ε) to conclude that τ(X in, Yt) O(OPT) + (1/2)tpoly(n/ε) + OPT with probability 1 1/poly(n). F.2. Constructing the Final Clustering In this subsection, we show how the output of Algorithms 5 and 6, that is, a weighted set of centers (Y, w), together with the number of found outliers |Xout|, can be used to get (O(1), 1 + ε)-approximation algorithms, via Theorems 13 and 14. First, in Theorem 15, we only show how to get an (O(1/ε), 1 + ε)-approximation guarantee by simply running Algorithm 4 on Adapting k-means Algorithms for Outliers the weighted data with the parameter of number of outliers set to roughly z |Xout|. The advantage of Theorem 15 is its simplicity (we implement a variant of it in Section 7) and speed. Next, in Theorem 16 we refine the reduction to get an (O(1), 1 + ε)-approximation guarantee. This result is interesting from a theoretical perspective, as we explained in Section 5. As a subroutine, the coordinator needs to run some (O(1), 1 + ε)- approximation algorithm, i.e., linear programming based algorithms (Che08; KLS18), so the resulting distributed algorithm seems not practical. Also, we need to refine Algorithms 5 and 6 so that they pass somewhat more information than just (Y, w) and |Xout|. Theorem 15 (Adaptation of (GMM+03)). Let ε [0, 1] be arbitrary and Θ be such that OPT/(2εz) Θ OPT/(εz). Let (Y, w) denote the weighted point set that Algorithm 6 and Algorithm 5 compute, respectively. Furthermore, assume that τΘ(X in, Y ) αOPT in case of Algorithm 6, and P j τΘ(Xj X in, Yj) αOPT, Y = S Yj, in case of Algorithm 5, respectively. Now, suppose we then run a (β, 1 + ε)-approximation algorithm A for the weighted k-means with outliers formulation on the weighted instance (Y, w) with z A := z |Xout| + 2αεz outliers and let C denote the resulting set of k centers. Then, ϕ (1+5αε)z(X, C) = O((αβ + 1/ε)OPT). Here, we define weighted k-means with outliers as the following problem: for the weighted input (Y, w), where each w(y) is an integer, we want to choose a set of k centers C and a weight vector win(y) for any y such that win(y) is a non-negative integer, for wout(y) = w(y) win(y) we have P y Y wout(y) z A and the sum P y Y win(y)ϕ(y, C) is minimized. Note that algorithms for k-means with outliers can be seen as algorithms for weighted k-means with outliers by viewing weighted points (y, w(y)) as w(y) points on the same location (our assumption that pairwise distances of input points are at least one, is in our case only for simplicity of exposition). Moreover, sampling based algorithms such as Algorithm 3 can be implemented in time proportional not to P y w(y), but |Y | = O(k) by additionally weighting the sampling distribution by the weight of input points. Plugging in our Algorithm 3 as A in Theorem 15 hence yields a distributed algorithm with approximation factor O(αβ + 1/ε) = O(1/ε + 1/ε) = O(1/ε) that outputs (1 + O(ε))z outliers. Moreover, the computational complexity of the coordinator is O(k2/ε2) if we use Algorithm 6 and O(mk2/ε2) if we use Algorithm 5. Proof. We start by showing that |X in Xout| 2αεz. That is, the number of points that Algorithm 5 and Algorithm 6, respectively, declare as outliers in the first phase of the algorithm, but that are actually true inliers with respect to a given optimal solution is small. For Algorithm 6, we have |X in Xout| τΘ(X in, Y ) Θ αOPT OPT/(2εz) = 2αεz, and for Algorithm 5, we have |X in Xout| X τΘ(Xj X in, Yj) Θ αOPT OPT/(2εz) = 2αεz, as desired. In particular, |Xout| = |Xout X out| + |Xout X in| z + 2αεz. Hence, z A = z |Xout| + 2αεz 0 and therefore algorithm A is run on a legal instance. Let Xin := X \ Xout. Notice that for Algorithm 6, we have ϕ(X in Xin, Y ) = τΘ(X in Xin, Y ) αOPT, (26) and for Algorithm 5, we have X j ϕ(X in Xin Xj, Yj) (27) j τΘ(X in Xin Xj, Yj) αOPT. (28) Let us call each set of points B(y) X that Algorithm 5 or Algorithm 6 groups around y Y a blob and for a given x Xin, we denote with yx the point with x B(yx). We define (as the respective algorithm also does) w(y) = |B(y)|. Moreover, we define w in(y) = |B(y) X in| and w out(y) = |B(y) X out|. Adapting k-means Algorithms for Outliers Note that splitting the weights w into w in and w out gives us a natural upper bound on the cost of the optimal solution on the instance Y with z A outliers: On one hand, we have y Y w out(y) = |X out Xin| (29) = |X out| |X out Xout| = z |Xout| + |Xout X in| z |Xout| + 2αεz Thus, if we label the weighted set (Y, w out) as outliers, at most z A points are labeled as outliers. On the other hand, we will now bound ϕ((Y, w in), C ), where C is the optimum clustering for the original problem defined on X. For any y Y , x X in B(y), and c x = arg minc C ϕ(x, c ) we have ϕ(y, C ) ϕ(y, c x) 2ϕ(y, x) + 2ϕ(x, c x), (30) where the second inequality is due to Fact 2. Hence, ϕ((Y, w in), C ) (31) y Y w in(y) ϕ(y, C ) (32) x X in B(y) ϕ(y, C ) x X in B(y) 2ϕ(y, x) + 2ϕ(x, c x) Eq. (30) x X in Xin ϕ(yx, x) x X in ϕ(x, c x) 2αOPT + 2OPT Eqs. (26) and (27) Now we consider the output CA and w A out of A, i.e., for each y Y the algorithm A decides which integer weight w A out of y is labeled as outlier. Define w A in(y) = w(y) w A out(y). This solution of instance (Y, w) with P y Y w A out(y) outliers naturally defines a solution of the original instance X as follows. We leave the set of centers CA the same and whenever w A out(y) is nonzero, we label w A out(y) arbitrary points in B(y) as outliers and call this set XA out (note that the algorithm itself does not need to explicitly label the outliers, it suffices to prove that there is a labelling). Then we define points from Xout XA out as outliers. The number of points we label as outliers is bounded by y Y w A out(y) |Xout| + (1 + ε)z A (33) = |Xout| + (1 + ε)(z |Xout| + 2αεz) Adapting k-means Algorithms for Outliers Thus, it remains to bound the cost of the set X \ (Xout XA out) with respect to CA. We have ϕ(X \ (Xout XA out), CA) x B(y)\XA out x B(y) ϕ(x, yx) + 2 X x B(y)\XA out x B(y) X in ϕ(x, yx) + 2 X x B(y) X out ϕ(x, yx) x B(y)\XA out 2αOPT + 2ϕ(Xin X out, Y ) + 2 X y Y w A in(y)ϕ(y, CA) Eq. (26), Eq. (27) 2αOPT + 2zΘ + 2 X y Y w A in(y)ϕ(y, CA) |X out| = z, x Xin ϕ(x, Y ) Θ 2αOPT + 2(1/ε)OPT + 2βϕ((Y, w in), C ) A is β-approximation O((1/ε + αβ)OPT). Better construction Here we sketch a somewhat more elaborate construction that enables us to lose only a constant factor in approximation, instead of O(1/ε). Below, we assume the existence of a (β, 1 + ε)-approximation algorithm that works in what we call an almost-metric space, which is defined as a classical metric space, but without the axiom d(x, y) = 0 x = y. That is, even when an algorithm chooses some point x as a cluster center, the clustering cost of x might still be non-zero. Our algorithms work in this more general setting and we believe that the algorithm of (BERS19) too. In any case, note that the problem of k-means with outliers in an almost metric space can be reduced to the same problem in a metric space by splitting each vertex x into 2k clones x1, . . . , x2k and for i = j defining d (xi, xj) = d(x, x). This increases the number of points by a 2k-factor and we lose only a factor of 2 in our approximation guarantee. As in the previous construction, we moreover assume a weighted version of the problem, where for a given integer weight w(x) one is allowed to output an integer 0 wout(x) w(x) and then only w(x) wout(x) weight is used in the computation of the cost. Theorem 16. Let ε (0, 1] be arbitrary. Suppose that there is a (β, 1 + ε)-approximation algorithm A that solves the weighted variant of k-means with outliers in an almost-metric space in polynomial time. Then, there is a distributed polynomial-time (O(β), 1 + O(ε))-approximation algorithm in the coordinator model with communication O(k/ε) per site that succeeds with positive constant probability. Proof. Assume that Θ satisfies OPT εz . We later discuss how to guess OPT in this distributed setting. The algorithm needed is a variant of Algorithm 5, so we only describe the necessary changes. When the variant of Algorithm 5 aggregates all points to the closest center in Yj, it will not just compute for each y its weight w(y), i.e., for how many points in Xj \ Xout the point y is the closest, but it iterates over all these points x1, x2, . . . , xw(y) and rounds the distances d(xi, y) down to the closest power of two. Then, the additional information sent to the leader about y is not just w(y), but also the list w0(y), w1(y), w2(y), . . . , where wk(y) is the number of points x B(y) with distance d(x, y) rounded down to 2k. For any y Yj and k, we define Bk(y) as the set of points in Xj \ Xout for which y is the closest point in Yj and d(x, y) is rounded down to 2k. As before, for a given x Xin := X \ Xout, yx denotes the point y for which x B(y). Adapting k-means Algorithms for Outliers The instance (Y , w ) the leader is going to construct will not be just (Y, w), but the leader creates a new almost-metric space M that includes the original input metric space M that is only defined for points of Y together with the original metric, but moreover, for each y and 0 k log , it contains a point yk with d M (y, yk) = 2k. Now, imagine a weighted graph with the set of vertices being equal to the points in the almost-metric space and there is an edge between two points if we explicitly stated the distance between the two points and the weight of the edge is equal to that distance. Then, the distance between each pair of vertices is just equal to the length of the shortest path between the two vertices in the weighted graph. In fact, we also define d M (yk, yk) = 2 2k (which is not possible in a metric space) and only then the leader runs the (β, 1 + ε)-approximation algorithm A in M . We note that the defined distances satisfy the triangle inequality. As in Theorem 15, we assume that P j τΘ(Xj X in, Yj) αOPT. We run A on M (we will use ϕ and d when talking about (squared) distances in M ) with the number of outliers being set to z A = z |Xout| + 2αεz and get as output a set of centers CA and a set of outliers (Y, w A out) with P y Y w A out(y) (1 + ε)z A. As in the proof of Theorem 15, we would now like to argue that the optimal solution in the almost-metric space has a cost of at most O(αOPT) by considering the optimal set of centers C for the original problem. However, there is one small technical issue. Namely, the points in C might not be contained in the almost-metric space. To remedy this situation, we do the following: We naturally extend the almost-metric space to also include all the points in C and prove that the defined distances still satisfy the triangle inequality. Next, we show that in this extended metric space, the cost of the optimal solution (with outliers defined appropriately) has a cost of O(αOPT). By using the uniform sampling lemma (Lemma 3), this implies that there also exists a solution in the original almost-metric space (without the points in C ) with a cost of at most 4 O(αOPT) = O(αOPT), as desired. We start by extending the almost-metric space to points in C . For each c C and y Y , we define d M (y, c ) = d(y, c ). That is, the distances between points in Y and C are simply equal to their original distances. The distance between c C and yk is defined as d M (c , yk) = d M (c , y) + d M (y, yk) = d M (c , y) + 2k. This extension still satisfies the triangle inequality, which more or less directly follows from the fact that the original metric satisfies the triangle inequality. Next, we define a solution for the weighted k-means with z A outliers objective in this extended almost metric-space, whose cost gives us an upper bound of O(αOPT) for the optimal solution. We consider the optimal set of centers C together with the set of outliers X out. We set w out(yk) = |Bk(y) X out|, w out(y) = |{y} X out|, w in(yk) = wk(y) w out(yk) and w in(y) = wk(y) w out(y). Splitting the weights w into w in and w out together with the set of centers C gives us a natural upper bound (up to a factor of 4, as mentioned before) on the cost of the optimal solution on the instance (Y , w ) with z A outliers. First, one needs to verify that the number of declared outliers P y Y w out(y) + P k w out(yk) in the solution is at most z A. This follows in the exact same way as in the proof of Theorem 15 (cf. Eq. (29)). Thus, it remains to bound ϕ ((Y , w in), C ). Let x Bk(y) be arbitrary for some k and y Y . Moreover, let c x := arg minc C ϕ(x, c ). We have ϕ (yk, C ) (34) 2ϕ (yk, y) + 2ϕ (y, C ) = 2ϕ (yk, y) + 2ϕ(y, C ) 2ϕ (yk, y) + 4ϕ(y, x) + 4ϕ(x, c x) 6ϕ(y, x) + 4ϕ(x, c x) ϕ (yk, y) ϕ(y, x), Adapting k-means Algorithms for Outliers where we repeatedly used Fact 2. Hence, ϕ ((Y , w in), C ) (35) y Y w in(y) ϕ (y, C ) + X yk Y w in(yk) ϕ (yk, C ) (36) y Y X in ϕ (y, C ) + X x X in Bk(y) ϕ (yk, C ) y Y X in ϕ (y, C ) + X x X in Bk(y) 6ϕ(y, x) + 4ϕ(x, c x) (37) x X in Xin ϕ(yx, x) x X in ϕ(x, c x) 6O(αOPT) + 4OPT Now we consider the output CA and w A out of A, i.e., for each y Y the algorithm A decides which integer weight w A out(y ) of y is labeled as an outlier. Define w A in(y ) = wk(y ) w A out(y ). This solution of instance (Y , w ) in M with P y Y w A out(y ) outliers naturally defines a solution for the original instance X with |Xout| + P y Y w A out(y ) outliers: for each center c CA in M we define f(c) M as follows: if c = yk for some y and k, we let f(c) = y. Otherwise, f(c) = y for some y and we simply set f(c) = c. Now, we choose f(CA) := {f(c): c CA} Y X as our set of centers. Moreover, whenever w A out(yk) is nonzero, we label w A out(yk) arbitrary points in Bk(y) as outliers and similarly, if w A out(y) is nonzero (and therefore 1), we declare y as an outlier. We call the resulting set of outliers XA out. Finally, we define all the points in Xout XA out as outliers. Note that we have P y Y w A out(y ) (1 + ε)z A, so |Xout XA out| |Xout| + (1 + ε)z A = |Xout| + (1 + ε)(z |Xout| + αεz) = (1 + O(αε))z. Thus, we only label (1 + O(αε))z vertices as outliers, as desired. We now bound the cost of the set X \ (Xout XA out) with respect to the set of centers f(CA). To that end, we note that for an arbitrary x Bk(y) and any c M we have ϕ (yk, c) = (2k + d (y, c))2 and therefore ϕ(x, f(c)) (d(x, y) + d(y, f(c)))2 (38) (2k+1 + d(y, f(c)))2 x Bk(y) 4(2k + d(y, f(c)))2 4(2k + d (y, c))2 Definition of f and d = 4ϕ (yk, c). Adapting k-means Algorithms for Outliers Hence, we have ϕ(X \ (Xout XA out), f(CA)) (39) y Y \XA out ϕ(y, f(CA)) + X x Bk(y)\XA out ϕ(x, f(CA)) y Y \XA out ϕ (y, CA) + 4 X x Bk(y)\XA out ϕ (yk, CA) Eq. (38) y Y w A in(y )ϕ (y , CA) 4β 4ϕ ((Y , w in), C ) A is β apx = O(αβOPT). Eq. (35) Note that assuming OPT εz , Theorem 13 implies that P j τΘ(Xj X in, Yj) = O(OPT) with positive constant probability and therefore we can assume that α = O(1). This implies that our final solution has a cost of O(αβOPT) = O(β)OPT and moreover the solution outputs at most (1 + O(αε))z = (1 + O(ε))z outliers, as desired. What remains to be discussed is how to remove the assumption that the algorithm knows OPT. To that end, we again run the algorithm for O(log(n )) guesses of OPT, namely for all values 2e with e [log(n 2)] in parallel as before. Each machine will send all O(log n) respective weighted sets (Y , w) and |Xout,j| to the coordinator that outputs the set of candidate centers that minimizes ϕ ((Y , w A in), CA) among those where Xout satisfies |Xout| z. Eq. (39) certifies that, with positive constant probability, we get an O(β)-approximation of OPT, as needed.