# clustering_stable_instances_of_euclidean_kmeans__a46c43c7.pdf Clustering Stable Instances of Euclidean k-means Abhratanu Dutta Northwestern University adutta@u.northwestern.edu Aravindan Vijayaraghavan Northwestern University aravindv@northwestern.edu Alex Wang Carnegie Mellon University alexwang@u.northwestern.edu The Euclidean k-means problem is arguably the most widely-studied clustering problem in machine learning. While the k-means objective is NP-hard in the worst-case, practitioners have enjoyed remarkable success in applying heuristics like Lloyd s algorithm for this problem. To address this disconnect, we study the following question: what properties of real-world instances will enable us to design efficient algorithms and prove guarantees for finding the optimal clustering? We consider a natural notion called additive perturbation stability that we believe captures many practical instances of Euclidean k-means clustering. Stable instances have unique optimal k-means solutions that does not change even when each point is perturbed a little (in Euclidean distance). This captures the property that kmeans optimal solution should be tolerant to measurement errors and uncertainty in the points. We design efficient algorithms that provably recover the optimal clustering for instances that are additive perturbation stable. When the instance has some additional separation, we can design a simple, efficient algorithm with provable guarantees that is also robust to outliers. We also complement these results by studying the amount of stability in real datasets, and demonstrating that our algorithm performs well on these benchmark datasets. 1 Introduction One of the major challenges in the theory of clustering is to bridge the large disconnect between our theoretical and practical understanding of the complexity of clustering. While theory tells us that most common clustering objectives like k-means or k-median clustering problems are intractable in the worst case, many heuristics like Lloyd s algorithm or k-means++ seem to be effective in practice. In fact, this has led to the CDNM thesis [11, 9]: Clustering is difficult only when it does not matter . We try to address the following natural questions in this paper: Why are real-world instances of clustering easy? Can we identify properties of real-world instances that make them tractable? We focus on the Euclidean k-means clustering problem where we are given n points X = { x1, . . . , xn } Rd, and we need to find k centers µ1, µ2, . . . , µk Rd minimizing the objective P x X mini [k] x µi 2. The k-means clustering problem is the most well-studied objective for clustering points in Euclidean space [18, 3]. The problem is NP-hard in the worst-case [14] even for k = 2, and a constant factor hardness of approximation is known for larger k [5]. Supported by the National Science Foundation (NSF) under Grant No. CCF-1637585. Part of the work was done while the author was at Northwestern University. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. One way to model real-world instances of clustering problems is through instance stability, which is an implicit structural assumption about the instance. Practically interesting instances of k-means clustering problem often have a clear optimal clustering solution (usually the ground-truth clustering) that is stable: i.e., it remains optimal even under small perturbations of the instance. As argued in [7], clustering objectives like k-means are often just a proxy for recovering a ground-truth clustering that is close to the optimal solution. Instances in practice always have measurement errors, and optimizing the k-means objective is meaningful only when the optimal solution is stable to these perturbations. This notion of stability was formalized independently in a pair of influential works [11, 7]. The predominant strand of work on instance stability assumes that the optimal solution is resilient to multiplicative perturbations of the distances [11]. For any γ 1, a metric clustering instance (X, d) on point set X Rd and metric d : X X R+ is said to be γ-factor stable iff the (unique) optimal clustering C1, . . . , Ck of X remains the optimal solution for any instance (X, d ) where any (subset) of the the distances are increased by up to a γ factor i.e., d(x, y) d (x, y) γd(x, y) for any x, y X. In a series of recent works [4, 8] culminating in [2], it was shown that 2-factor perturbation stable (i.e., γ 2) instances of k-means can be solved in polynomial time. Multiplicative perturbation stability represents an elegant, well-motivated formalism that captures robustness to measurement errors for clustering problems in general metric spaces (γ = 1.1 captures relative errors of 10% in the distances). However, multiplicative perturbation stability has the following drawbacks in the case of Euclidean clustering problems: Measurement errors in Euclidean instances are better captured using additive perturbations. Uncertainty of δ in the position of x, y leads to an additive error of δ in x y 2, irrespective of how large or small x y 2 is. The amount of stability γ needed to enable efficient algorithms (i.e., γ 2) often imply strong structural conditions, that are unlikely to be satisfied by many real-world datasets. For instance, γ-factor perturbation stability implies that every point is a multiplicative factor of γ closer to its own center (say µi) than to any other cluster center µj. Algorithms that are known to have provable guarantees under multiplicative perturbation stability are based on single-linkage or MST algorithms that are very non-robust by nature. In the presence of a few outliers or noise, any incorrect decision in the lower layers gets propagated up to the higher levels. In this work, we consider a natural additive notion of stability for Euclidean instances, when the optimal k-means clustering solution does not change even where each point is moved by a small Euclidean distance of up to δ. Moving each point by at most δ corresponds to a small additive perturbation to the pairwise distances between the points3. Unlike multiplicative notions of perturbation stability [11, 4], this notion of additive perturbation is not scale invariant. Hence the normalization or scale of the perturbation is important. Ackerman and Ben-David [1] initiated the study of additive perturbation stability when the distance between any pair of points can be changed by at most δ = ε diam(X) with diam(X) being the diameter of the whole dataset. The algorithms take time n O(k/ε2) = n O(k diam2(X)/δ2) and correspond to polynomial time algorithms when k, 1/ε are constants. However, this dependence of k diam2(X)/δ2 in the exponent is not desirable since the diameter is a very non-robust quantity the presence of one outlier (that is even far away from the decision boundary) can increase the diameter arbitrarily. Hence, these guarantees are useful mainly when the whole instance lies within a small ball and for a small number of clusters [1, 10]. Our notion of additive perturbation stability will use a different scale parameter that is closely related to the distance between the centers, instead of the diameter diam(X). As we will see soon, our results for additive perturbation stability have no explicit dependence on the diameter, and allows instances to have potentially unbounded clusters (as in the case of far-way outliers). Further with some additional assumptions, we also obtain polynomial time algorithmic guarantees for large k. 3Note that not all additive perturbations to the distances can be captured by an appropriate movement of the points in the cluster. Hence the notion we consider in our paper is a weaker assumption on the instance. Figure 1: a)Left: the figure shows an instance with k = 2 satisfying ε-APS with D being separation between the means. The half-angle of the cone is arctan(1/ε) and the distance between µ1 and the apex of the cone ( ) is at most D/2. b) Right: The figure shows a (ρ, , ε)-separated instance, with scale parameter . All the points lie inside the cones of half-angle arctan(1/ε), whose apexes are separated by a margin of at least ρ. 1.1 Additive Perturbation Stability and Our Contributions We consider a notion of additive stability where the points in the instance can be moved by at most δ = εD, where ε (0, 1) is a parameter, and D = maxi =j Dij = maxi =j µi µj 2 is the maximum distance between pairs of means. Suppose X is a k-means clustering instance with optimal clustering C1, C2, . . . , Ck. We say that X is ε-APS (additive perturbation stable) iff every δ = εD-additive perturbation of X has C1, C2, . . . , Ck as an optimal clustering solution. (See Definition 2.3 for a formal definition). Note that there is no restriction on the diameter of the instance, or even the diameters of the individual clusters. Hence, our notion of additive perturbation stability allows the instance to be unbounded. Geometric property of ε-APS instances. Clusters in the optimal solution of an ε-APS instance satisfy a natural geometric condition, that implies an angular separation between every pair of clusters. Proposition 1.1 (Geometric Implication of ε-APS). Consider an ε-APS instance X, and let Ci, Cj be two clusters of the optimal solution. Any point x Ci lies in a cone whose axis is along the direction (µi µj) with half-angle arctan(1/ε). Hence if u is the unit vector along µi µj then x Ci, | x µi+µj 1 + ε2 . (1) For any j [k], all the points in cluster Ci lie inside the cone with its axis along (µi µj) as in Figure 1. The distance between µi and the apex of the cone is = ( 1 2 ε)D. We will call the scale parameter of the clustering. We believe that many clustering instances in practice satisfy ε-APS condition for reasonable constants ε. In fact, our experiments in Section 4 suggest that the above geometric condition is satisfied for reasonable values e.g., ε (0.001, 0.2). While the points can be arbitrarily far away from their own means, the above angular separation (1) is crucial in proving the polynomial time guarantees for our algorithms. For instance, this implies that at least 1/2 of the points in a cluster Ci are within a Euclidean distance of at most O( /ε) from µi. This geometric condition (1) of the dataset enables the design of a tractable algorithm for k = 2 with provable guarantees. This algorithm is based on a modification of the perceptron algorithm in supervised learning, and is inspired by [13]. Informal Theorem 1.2. For any fixed ε > 0, there exists an dnpoly(1/ε) time algorithm that correctly clusters all ε-APS 2-means instances. For k-means clustering, similar techniques can be used to learn the separating halfspace for each pair of clusters. But this incurs an exponential dependence on k2, which renders this approach inefficient for large k.4 We now consider a natural strengthening of this assumption that allows us to get poly(n, d, k) guarantees. Angular Separation with additional Margin Separation. We consider a natural strengthening of additive perturbation stability where there is an additional margin between any pair of clusters. This is reminiscent of margin assumptions in supervised learning of halfspaces and spectral clustering guarantees of Kumar and Kannan [15] (see Section 1.2). Consider a k-means clustering instance X having an optimal solution C1, C2, . . . , Ck. This instance is (ρ, , ε)-separated iff for each i = j [k], the subinstance induced by Ci, Cj has parameter scale , and all points in the clusters Ci, Cj lie inside cones of half-angle arctan(1/ε), which are separated by a margin of at least ρ. This is implied by the stronger condition that the subinstance induced by Ci, Cj is ε-additive perturbation stable with scale parameter even when Ci and Cj are moved towards each other by ρ. Please see Figure 1 for an illustration. We formally define (ρ, , ε)-separated stable instances geometrically in Section 2. Informal Theorem 1.3 (Polytime algorithm for (ρ, , ε)-separated instances). There is an algorithm running in time5 e O(n2kd) that given any instance X that is (ρ, , ε)-separated with ρ Ω( /ε2) recovers the optimal clustering C1, . . . , Ck. A formal statement of the theorem (with unequal sized clusters), and its proof are given in Section 3. We prove these polynomial time guarantees for a new, simple algorithm ( Algorithm 3.1 ). The algorithm constructs a graph with one vertex for each point, and edges between points that within a distance of at most r (for an appropriate threshold r). The algorithm then finds the k-largest connected components. It then uses the k empirical means of these k components to cluster all the points. In addition to having provable guarantees, the algorithm also seems efficient in practice, and performs well on standard clustering datasets. Experiments that we conducted on some standard clustering datasets in UCI suggest that our algorithm manages to almost recover the ground truth, and achieves a k-means objective cost that is very comparable to Lloyd s algorithm and k-means++ (see Section 4). In fact, our algorithm can also be used to initialize the Lloyd s algorithm: our guarantees show that when the instance is (ρ, , ε)-separated, one iteration of Lloyd s algorithm already finds the optimal clustering. Experiments suggest that our algorithm finds initializers of smaller k-means cost compared to the initializers of k-means++ [3] and also recover the ground-truth to good accuracy (see Section 4 and Supplementary material for details). Robustness to Outliers. Perturbation stability requires the optimal solution to remain completely unchanged under any valid perturbation. In practice, the stability of an instance may be dramatically reduced by a few outliers. We show provable guarantees for a slight modification of the algorithm, in the setting when an η-fraction of the points can be arbitrary outliers, and do not lie in the stable regions. More formally, we assume that we are given an instance X Z where there is an (unknown) set of points Z with |Z| η|X| such that X is a (ρ, , ε)-separated-stable instance. Here ηn is assumed to be smaller than size of the smallest cluster by a constant factor. This is similar to robust perturbation resilience considered in [8, 16]. Our experiments in Section 4 indicate that the stability or separation can increase a lot after ignoring a few points close to the margin. In what follows, wmax = max|Ci|/n and wmin = min|Ci|/n are the maximum and minimum weight of clusters, and η < wmin. Theorem 1.4. Given X Z where X satisfies (ρ, , ε)-separated with η < wmin and and η < wmin, there is a polynomial time algorithm running in time e O(n2dk) that returns a clustering consistent with C1, . . . , Ck on X. This robust algorithm is effectively the same as Algorithm 3.1 with one additional step that removes all low-degree vertices in the graph. This step removes bad outliers in Z without removing too many points from X. 4We remark that the results of [1] also incur an exponential dependence on k. 5The e O hides an inverse Ackerman fuction of n. 1.2 Comparisons to Other Related Work Awasthi et al. [4] showed that γ-multiplicative perturbation stable instance also satisfied the notion of γ-center based stability (every point is a γ-factor closer to its center than to any other center) [4]. They showed that an algorithm based on the classic single linkage algorithm works under this weaker notion when γ 3. This was subsequently improved by [8], and the best result along these lines [2] gives a polynomial time algorithm that works for γ 2. A robust version of (γ, η)-perturbation resilience was explored for center-based clustering objectives [8]. As such, the notions of additive perturbation stability, and (ρ, , ε)-separated instances are incomparable to the various notions of multiplicative perturbation stability. Further as argued in [9], we believe that additive perturbation stability is more realistic for Euclidean clustering problems. Ackerman and Ben-David [1] initiated the study of various deterministic assumptions for clustering instances. The measure of stability most related to this work is Center Perturbation (CP) clusterability (an instance is δ-CP-clusterable if perturbing the centers by a distance of δ does not increase the cost much). A subtle difference is their focus on obtaining solutions with small objective cost [1], while our goal is to recover the optimal clustering. However, the main qualitative difference is how the length scale is defined this is crucial for additive perturbations. The run time of the algorithm in [1] is npoly(k,diam(X)/δ) , where the length scale of the perturbations is diam(X), the diameter of the whole instance. Our notion of additive perturbations uses a much smaller length-scale of (essentially the inter-mean distance; see Prop. 1.1 for a geometric intepretation), and Theorem 1.2 gives a run-time guarantee of npoly( /δ) for k = 2 (Theorem 1.2 is stated in terms of ε = δ/ ). By using the largest inter-mean distance instead of the diameter as the length scale, our algorithmic guarantees can also handle unbounded clusters with arbitrarily large diameters and outliers. The exciting results of Kumar and Kannan [15] and Awasthi and Sheffet [6] also gave deterministic margin-separation condition, under which spectral clustering (PCA followed by k-means) finds the optimum clusters under deterministic conditions about the data. Suppose σ = X C 2 op/n is the spectral radius of the dataset, where C is the matrix given by the centers. In the case of equal-sized clusters, the improved results of [6] proves approximate recovery of the optimal clustering if the margin ρ between the clusters along the line joining the centers satisfies ρ = Ω( kσ). Our notion of margin ρ in (ρ, , ε)-separated instances is analogous to the margin separation notion used by the above results on spectral clustering [15, 6]. In particular, we require a margin of ρ = Ω( /ε2) where is our scale parameter, with no extra k factor. However, we emphasize that the two margin conditions are incomparable, since the spectral radius σ is incomparable to the scale parameter . We now illustrate the difference between these deterministic conditions by presenting a couple of examples. Consider an instance with n points drawn from a mixture of k Gaussians in d dimensions with identical diagonal covariance matrices with variance 1 in the first O(1) coordinates and roughly 1 d in the others, and all the means lying in the subspace spanned by these first O(1) co-ordinates. In this setting, the results of [15, 6] require a margin separation of at least k log n between clusters. On the other hand, these instances satisfy our geometric conditions with ε = Ω(1), log n and therefore our algorithm only needs a margin separation of ρ log n ( hence, saving a factor of k). 6 However, if the n points were drawn from a mixture of spherical Gaussians in high dimensions (with d k), then the margin condition required for [15, 6] is weaker. 2 Stability definitions and geometric properties X Rd will denote a k-means clustering instance and C1, . . . , Ck will often refer to its optimal clustering. It is well-known that given a cluster C the value of µ minimizing P x C x µ 2 is given by µ = 1 |C| P x C x, the mean of the points in the set. We give more preliminaries about the k-means problem in the Supplementary Material. 2.1 Balance Parameter We define an instance parameter, β, capturing how balanced a given instance s clusters are. 6Further, while algorithms for learning GMM models may work here, adding some outliers far from the decision boundary will cause many of these algorithms to fail, while our algorithm is robust to such outliers. Figure 2: An example of the family of perturbations considered by Lemma 2.4. Here v is in the upwards direction. If a is to the right of the diagonal solid line, then a will be to the right of the slanted dashed line and will lie on the wrong side of the separating hyperplane. Definition 2.1 (Balance parameter). Given an instance X with optimal clustering C1, . . . , Ck, we say X satisfies balance parameter β 1 if for all i = j, β|Ci| > |Cj|. We will write β in place of β(X) when the instance is clear from context. 2.2 Additive perturbation stability Definition 2.2 (ε-additive perturbation). Let X = { x1, . . . , xn } be a k-means clustering instance with optimal clustering C1, C2, . . . , Ck whose means are given by µ1, µ2, . . . , µk. Let D = maxi,j µi µj . We say that the instance X = { x 1, . . . , x n } is an ε-additive perturbation of X if for all i, x i xi εD. Definition 2.3 (ε-additive perturbation stability). Let X be a k-means clustering instance with optimal clustering C1, C2, . . . , Ck. We say that X is ε-additive perturbation stable (APS) if every ε-additive perturbation of X has unique optimal clustering given by C1, C2, . . . , Ck. Intuitively, the difficulty of the clustering task increases as the stability parameter ε decreases. For example, when ε = 0 the set of ε-APS instances contains any instance with a unique solution. In the following we will only consider ε > 0. 2.3 Geometric implication of ε-APS Let X be an ε-APS k-means clustering instance such that each cluster has at least 4 points. Fix i = j and consider a pair of clusters Ci, Cj with means µi, µj and define the following notations. Let Di,j = µi µj be the distance between µi and µj and let D = maxi ,j µi µj be the maximum distance between any pair of means. Let u = µi µj µi µj be the unit vector in the intermean direction. Let V = u be the space orthogonal to u. For x Rd, let x(u) and x(V ) be the projections x onto u and V . Let p = µi+µj 2 be the midpoint between µi and µj. A simple perturbation that we can use will move all points in Ci and Cj along the direction µi µj by a δ amount, while another perturbation moves these points along µj µi; these will allow us to conclude that margin of size 2δ. To establish Proposition 1.1, we will choose a clever ε-perturbation that allows us to show that clusters must live in cone regions (see figure 1 left). This perturbation chooses two clusters and moves their means in opposite directions orthogonal to u while moving a single point towards the other cluster (see figure 2). The following lemma establishes Proposition 1.1. Lemma 2.4. For any x Ci Cj, (x p)(V ) < 1 ε (x p)(u) εDi,j . Proof. Let v V be a unit vector perpendicular to u. Without loss of generality, let a, b, c, d Ci be distinct. Note that Di,j D and consider the ε-additive perturbation given by X = { a δu, b + δu, c δv, d δv } { x δ 2v | x Ci \ { a, b, c, d } } { x + δ 2v | x Cj } and X \ {Ci Cj}where δ = εDi,j (see figure 2). By assumption, { Ci, Cj } remains the optimal clustering of Ci Cj. We have constructed X such that the new means are at µ i = µi εDi,j 2 v and µ j = µj + εDi,j 2 v, and the midpoint between the means is p = p. The halfspace containing µ i given by the linear separator between µ i and µ j is x p , µ i µ j > 0. Hence, as a is classified correctly by the ε-APS assumption, a p , µ i µ j = a p εDi,ju, Di,ju εDi,jv = Di,j( a p, u ε a p, v εDi,j) > 0 Then noting that u, a p is positive, we have that a p, v < 1 ε (a p)(u) εDi,j . Note that this property follows from perturbations which only affect two clusters at a time. Our results follow from this weaker notion. 2.4 (ρ, , ε)-separation Motivated by Lemma 2.4, we define a geometric condition where the angular separation and margin separation are parametrized separately. This notion of separation is implied by a stronger stability assumption where any pair of clusters is ε-APS with scale parameter even after being moved towards each other a distance of ρ. We say that a pair of clusters is (ρ, , ε)-separated if their points lie in cones with axes along the intermean direction, half-angle arctan(1/ε), and apexes at distance from their means and at least ρ from each other (see figure 1 right). Formally, we require the following. Definition 2.5 (Pairwise (ρ, , ε)-separation). Given a pair of clusters Ci, Cj with means µi, µj, let u = µi µj µi µj be the unit vector in the intermean direction and let p = (µi + µj)/2. We say that Ci and Cj are (ρ, , ε)-separated if Di,j ρ + 2 and for all x Ci Cj, (x p)(V ) 1 ε (x p)(u) (Di,j/2 ) . Definition 2.6 ((ρ, , ε)-separation). We say that an instance X is (ρ, , ε)-separated if every pair of clusters in the optimal clustering is (ρ, , ε)-separated. 3 k-means clustering for general k We assume that our instance has balance parameter β. Our algorithm takes in as input the set of points X and k, and outputs a clustering of all the points. Algorithm 3.1. Input: X = { x1, . . . , xn }, k. 1: for all pairs a, b of distinct points in { xi } do 2: Let r = a b be our guess for ρ 3: procedure INITIALIZE 4: Create graph G on vertex set { x1, . . . , xn } where xi and xj have an edge iff xi xj < r 5: Let a1, . . . , ak Rd where ai is the mean of the ith largest connected component of G 6: procedure ASSIGN 7: Let C1, . . . , Ck be the clusters obtained by assigning each point in X to the closest ai 8: Calculate the k-means objective of C1, . . . , Ck 9: Return clustering with smallest k-means objective found above Theorem 3.2. Algorithm 3.1 recovers C1, . . . , Ck for any (ρ, , ε)-separated instance with ρ = ε and the running time is e O(n2kd). We maintain the connected components and their centers via a union-find data structure and keep it updated as we increase ρ and add edges to the dynamic graph. Since we go over n2 possible choices of ρ and each pass takes O(kd) time, the algorithm runs in e O(n2kd). The rest of the section is devoted to proving Theorem 3.2. Define the following regions of Rd for every pair i, j. Given i, j, let Ci, Cj be the corresponding clusters with means µi, µj. Let u = µi µj µi µj be the unit vector in the inter-mean direction and p = µi+µj 2 be the point between the two means. We first define formally S(cone) i,j which was described in the introduction (the feasible region) and two other regions of the clusters that will be useful in the analysis (see Figure 1b). We observe that Ci belongs to the intersection of all the cones S(cone) i,j over j = i. Definition 3.3. S(cone) i,j = { x Rd | (x (µi u))(V ) 1 ε x (µi u), u }, S(nice) i,j = { x S(cone) i,j | x µi, u 0 }, S(good) i = T j =i S(nice) i,j . The nice area of i with respect to j i.e. S(nice) i,j , is defined as all points in the cap of S(cone) i,j above µi. The good area of a cluster S(good) i is the intersection of its nice areas with respect to all other clusters. It suffices to prove the following two main lemmas. Lemma 3.4 states that the ASSIGN subroutine correctly clusters all points given an initialization satisfying certain properties. Lemma 3.5 states that the initialization returned by the INITIALIZE subroutine satisfies these properties when we guess r = ρ correctly. As ρ is only used as a threshold on edge lengths, testing the distances between all pairs of data points i.e. { a b : a, b X } suffices. Lemma 3.4. For a (ρ, , ε)-separated instance with ρ = Ω( /ε2), the ASSIGN subroutine recovers C1, C2, Ck correctly when initialized with k points { a1, a2, . . . , ak } where ai S(good) i . Lemma 3.5. For an (ρ, , ε)-separated instance with balance parameter β and ρ = Ω(β /ε), the INITIALIZE subroutine outputs one point each from { S(good) i : i [k] } when r = ρ. To prove Lemma 3.5 we define a region of each cluster S(core) i , the core, such that most (at least β/(1 + β) fraction) of the points in Ci will belong to the connected component containing S(core) i . Hence, any large connected component (in particular, the k largest ones) must contain the core of one of the clusters. Meanwhile, the margin ensures points across clusters are not connected. Further, since S(core) i accounts for most points in Ci, the angular separation ensures that the empirical mean of the connected component is in S(good) i . 4 Experimental results We evaluate Algorithm 3.1 on multiple real world datasets and compare its performance to the performance of k-means++, and also check how well these datasets satisfy our geometric conditions. See supplementary results for details about ground truth recovery. Datasets. Experiments were run on unnormalized and normalized versions of four labeled datasets from the UCI Machine Learning Repository: Wine (n = 178, k = 3, d = 13), Iris (n = 150, k = 3, d = 4), Banknote Authentication (n = 1372, k = 2, d = 5), and Letter Recognition (n = 20, 000, k = 26, d = 16). Normalization was used to scale each feature to unit range. Performance We ran Algorithm 3.1 on the datasets. The cost of the returned solution for each of the normalized and unnormalized versions of the datasets is recorded in Table 1 column 2. Our guarantees show that under (ρ, , ε)-separation for appropriate values of ρ (see section 3), the algorithm will find the optimal clustering after a single iteration of Lloyd s algorithm. Even when ρ does not satisfy our requirement, we can use our algorithm as an initialization heuristic for Lloyd s algorithm. We compare our initialization with the k-means++ initialization heuristic (D2 weighting). In Table 1, this is compared to the smallest initialization cost of 1000 trials of k-means++ on each of the datasets, the solution found by Lloyd s algorithm using our initialization and the smallest k-means cost of 100 trials of Lloyd s algorithm using a k-mean++ initialization. Separation in real data sets. As the ground truth clusterings in our datasets are not in general linearly separable, we consider the clusters given by Lloyd s algorithm initialized with the ground Table 1: Comparison of k-means cost for Alg 3.1 and k-means++ Dataset Alg 3.1 k-means++ Alg 3.1 with Lloyd s k-means++ with Lloyd s Wine 2.376e+06 2.426e+06 2.371e+06 2.371e+06 Wine (normalized) 48.99 65.50 48.99 48.95 Iris 81.04 86.45 78.95 78.94 Iris (normalized) 7.035 7.676 6.998 6.998 Banknote Auth. 44808.9 49959.9 44049.4 44049.4 Banknote (norm.) 138.4 155.7 138.1 138.1 Letter Recognition 744707 921643 629407 611268 Letter Rec. (norm.) 3367.8 4092.1 2767.5 2742.3 Table 2: Values of (ρ, ε, ) satisfied by (1 η)-fraction of points Dataset η ε minimum ρ/ average ρ/ maximum ρ/ Wine 0.1 0.1 0.566 1.5 3.05 0.01 0.609 1.53 3.07 Iris 0.1 0.1 0.398 4.35 7.7 0.01 0.496 5.04 9.06 Banknote Auth. 0.1 0.1 0.264 0.264 0.264 0.01 0.398 0.398 0.398 Letter Recognition 0.1 0.1 0.018 2.19 7.11 0.01 0.378 3.07 11.4 truth solutions. Values of ε for Lemma 2.4. We calculate the maximum value of ε such that a given pair of clusters satisfies the geometric condition in Proposition 1.1. The results are collected in the Supplementary material in Table 3. We see that the average value of ε lies approximately in the range (0.01, 0.1). Values of (ρ, , ε)-separation. We attempt to measure the values of ρ, , and ε in the datasets. For η = 0.05, 0.1, ε = 0.1, 0.01, and a pair of clusters Ci, Cj, we calculate ρ as the maximum margin separation a pair of axis-aligned cones with half-angle arctan(1/ε) can have while capturing a (1 η)-fraction of all points. For some datasets and values for η and ε, there may not be any such value of ρ, in this case we leave the row blank. The results for the unnormalized datasets with η = 0.1 are collected in Table 2. (See Supplementary material for the full table). 5 Conclusion and Future Directions We studied a natural notion of additive perturbation stability, that we believe captures many real-world instances of Euclidean k-means clustering. We first gave a polynomial time algorithm when k = 2. For large k, under an additional margin assumption, we gave a fast algorithm of independent interest that provably recovers the optimal clustering under these assumptions (in fact the algorithm is also robust to noise and outliers). An appealing aspect of this algorithm is that it is not tailored towards the model; our experiments indicate that this algorithm works well in practice even when the assumptions do not hold. Our results with the margin assumption hence gives an algorithm that (A) has provable guarantees (under reasonable assumptions) (B) is efficient and practical and (C) is robust to errors. While the margin assumption seems like a realistic assumption qualitatively, we believe that the exact condition we assume is not optimal. An interesting open question is understanding whether such a margin is necessary for designing tractable algorithms for large k. We conjecture that for higher k, clustering remains hard even with ε additive perturbation resilience (without any additional margin assumption). Improving the margin condition or proving lower bounds on the amount of additive stability required are interesting future directions. [1] Margareta Ackerman and Shai Ben-David. Clusterability: A theoretical study. In Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5, pages 1 8. PMLR, 2009. [2] Haris Angelidakis, Konstantin Makarychev, and Yury Makarychev. Algorithms for stable and perturbationresilient problems. In Symposium on Theory of Computing (STOC), 2017. [3] David Arthur and Sergei Vassilvitskii. K-means++: The advantages of careful seeding. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 07, pages 1027 1035, 2007. [4] Pranjal Awasthi, Avrim Blum, and Or Sheffet. Center-based clustering under perturbation stability. Information Processing Letters, 112(1 2):49 54, 2012. [5] Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The hardness of approximation of euclidean k-means. In Symposium on Computational Geometry, pages 754 767, 2015. [6] Pranjal Awasthi and Or Sheffet. Improved spectral-norm bounds for clustering. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 37 49. 2012. [7] Maria-Florina Balcan, Avrim Blum, and Anupam Gupta. Approximate clustering without the approximation. In Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 09, pages 1068 1077, 2009. [8] Maria Florina Balcan and Yingyu Liang. Clustering under perturbation resilience. SIAM Journal on Computing, 45(1):102 155, 2016. [9] Shai Ben-David. Computational feasibility of clustering under clusterability assumptions. Co RR, abs/1501.00437, 2015. [10] Shalev Ben-David and Lev Reyzin. Data stability in clustering: A closer look. Theoretical Computer Science, 558:51 61, 2014. Algorithmic Learning Theory. [11] Yonatan Bilu and Nathan Linial. Are stable instances easy? In Innovations in Computer Science - ICS 2010, Tsinghua University, Beijing, China, January 5-7, 2010. Proceedings, pages 332 341, 2010. [12] Hans-Dieter Block. The perceptron: A model for brain functioning. Reviews of Modern Physics, 34(1):123 135, 1962. [13] Avrim Blum and John Dunagan. Smoothed analysis of the perceptron algorithm for linear programming. In Proceedings of Symposium on Dicrete Algorithms (SODA), 2002. [14] Sanjoy Dasgupta. The hardness of k-means clustering. Department of Computer Science and Engineering, University of California, San Diego, 2008. [15] Amit Kumar and Ravindran Kannan. Clustering with spectral norm and the k-means algorithm. In Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, pages 299 308. IEEE, 2010. [16] Konstantin Makarychev, Yury Makarychev, and Aravindan Vijayaraghavan. Bilu-linial stable instances of max cut. Proc. 22nd Symposium on Discrete Algorithms (SODA), 2014. [17] A.B.J Novikoff. On convergence proofs on perceptrons. Proceedings of the Symposium on the Mathematical Theory of Automata, XII(1):615 622, 1962. [18] David P. Williamson and David B. Shmoys. The Design of Approximation Algorithms. Cambridge University Press, New York, NY, USA, 1st edition, 2011.