# clustering_in_the_presence_of_background_noise__9d55a6b5.pdf Clustering in the Presence of Background Noise Shai Ben-David SHAI@CS.UWATERLOO.CA David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, ON N2L 3G1 CANADA Nika Haghtalab NIKA@CMU.EDU Computer Science Department, Carnegie Mellon University, Pittsburgh, PA 15213 USA We address the problem of noise management in clustering algorithms. Namely, issues that arise when on top of some cluster structure the data also contains an unstructured set of points. We consider how clustering algorithms can be robustified so that they recover the cluster structure in spite of the unstructured part of the input. We introduce some quantitative measures of such robustness that take into account the strength of the embedded cluster structure as well as the mildness of the noise subset. We propose a simple and efficient method to turn any centroid-based clustering algorithm into a noiserobust one, and prove robustness guarantees for our method with respect to these measures. We also prove that more straightforward ways of robustifying clustering algorithms fail to achieve similar guarantees. 1. Introduction Clustering is (usually) aimed to detect groups of similar objects in given datasets. Many common clustering algorithms output a partition of the input set. However, it is often the case that datasets that one wishes to cluster contain, on top of groups of similar objects, a significant subset that is unstructured. Such a subset, that may be referred to as noise, tends to disrupt clustering algorithms and make it difficult to detect the cluster structure of the remaining domain points. This problem can be viewed as a noise robustness issue. Are there useful clustering algorithms that are noise robust (w.r.t addition of unstructured data points)? Short reflection reveals that the noise robustness of an algorithm is closely Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). related to its sensitivity to the input data. As an extreme example, it is easy to achieve noise robustness by ignoring the input data and always returning a fixed output. Ackerman et al. (2013) provides some formal tradeoffs between these two desired properties. Roughly stated, their results (for example, Theorem 4.3) show that no algorithm can be both noise robust and responsive to cluster structure of the data (in the language of (Ackerman et al., 2013) these properties are called robustness to oligarchies and separabilitydetecting). However, those results consider applying an algorithm with a fixed number-of-clusters parameter. This paper addresses the possibility of overcoming those pessimistic results by allowing clustering algorithms to add to the set of clusters they output an extra subset, serving as a garbage collector . An important aspect of clustering, that distinguishes it from other major learning tasks, like classification prediction, is the wide variability of input-output behavior among common clustering algorithms. Different clustering applications employ very different clustering algorithms and there is no single clustering algorithm that is suitable for all. Consequently, solutions to fundamental clustering challenges, like the tradeoff between sensitivity to the input and noise robustness, should be modular in the sense of being applicable to a variety of clustering algorithms. In this paper we propose such a modular solution. In Section 4, we consider a method to transform any centroid-based clustering algorithm to one that outputs a set of clusters augmented by a distinct noise bin. We show that our proposed paradigm can be achieved by employing a simple and efficiently implementable transformation of the input data, after which users can apply any centroid-based clustering algorithm. Yet another contribution of this paper is the introduction of quantitative measures of noise robustness (Section 5). We consider three aspects of noise robustness for centroidbased clustering algorithms; the degree by which noise can affect the location of the centers of the clusters (or the archetypal cluster representatives), the effect of noise on Clustering in the Presence of Background Noise the cost of the clustering solution (or, the value of the clustering objective function) and the similarity between the clustering of the un-noised data, to its clustering in the presence of the noise. In our results, we consider a scenario in which the input dataset X consists of two part: a clusterable subset I, which is also called the un-noised data, and an added noise set X \ I (the identity of which is not known to the clustering algorithm). We consider two clustering algorithms, the original one, A, and its robustified transformation Rp(A) that is obtained by using our paradigm with a tunable parameter p. We examine to what extent well clusterability of I and mildness properties of X \ I (in terms of the size and/or diameter of this set, relative to that of I) affect the similarity between the clusterings A(I) and Rp(A)(X) restricted to I. In Section 6, we prove that our paradigm makes the algorithms noise-robust without sacrificing much of their ability to detect clear cluster structures. The degree of noise-robustness that our paradigm achieves depends on a parameter that can be tuned by the user, depending on the level and structure of the noise expected in the input data. On the other hand, in Section 7, we show that a simple transformation in which Rp(A) is the original algorithm A with an increased number of output clusters (the extra clusters may be used to handle noise) does not enjoy the same robustness guarantees. In Section 8, we further demonstrate the gap between these two paradigms and use experiments to confirm our theoretical results. Note that in the interest of space, the proofs of some theorems and lemmas are moved to the supplementary material. 2. Related Work Previous work on the robustness of clustering methods have mainly focused on two directions. First, developing measures of robustness and examining the performance of traditional clustering algorithms based on these measures. Second, developing clustering algorithms that are robust to noise and outliers. Various measures of robustness have been developed for examining the robustness characteristics of clustering algorithms to noise (Donoho, 1982; Hampel, 1971; Hennig, 2008). These measures have been used to demonstrate the lack of robustness of some traditional algorithms, where the number of clusters is fixed (Ackerman et al., 2013; Hennig, 2008). That is, they consider the scenario in which a clustering algorithm is used with the same number-of-clusters parameter for both the clean input and the input after the addition of noisy points. In this work, we allow some extra clusters to be used (so as to accommodate the added noisy data points) and introduce and analyze noise robustness measures that allow such flexibility. We show that using the added flexibility of our paradigm, we can overcome some of the limitations that the above papers deemed inevitable (without allowing extra noise-accommodating clusters). Several methods have been suggested for clustering a potentially noisy dataset (Cuesta-Albertos et al., 1997; Dave, 1993; Ester et al., 1996). One interesting work is the development of the concept of a noise cluster in a fuzzy setting by Dave (1991; 1993). In this work, we introduce a novel paradigm for robustifying any center-based clustering algorithm. We show that our paradigm generalizes a non-fuzzy variation of the algorithm introduced by Dave (1991). In addition, we prove noise robustness guarantees for our proposed paradigm, guarantees that were not proven in any of the earlier works we are aware of. Some of the earlier work on noise-robustness of clustering algorithms proposes the use of trimmming. Trimming is the natural approach to separating clusterable parts of the data from unstructured ones by fixing some noise-set size (say, as some fraction, α, of the data size) and let the algorithm find the least structured α fraction of an input dataset and discard it before applying a clustering algorithm (Cuesta Albertos et al., 1997; Garc ıa-Escudero et al., 2008; Garc ıa Escudero & Gordaliza, 1999). However, such methods suffer from exponential computational complexity, or have to be compromised for efficient heuristic searches that have no performance guarantees. In this work, we avoid this issue, by developing a paradigm that is of comparable computational complexity to traditional clustering algorithms. Clustering has been the subject of many other theoretical studies. Ackerman & Ben-David (2009), Balcan et al. (2009), and Bilu & Linial (2012) examined the effect of robustness (to perturbations of the input) on computational complexity of clustering. Ben-David et al. (2006) examined situations where the clustering of the input is similar to the clustering of a random subset of it. Ailon et al. (2009) approximated the k-means objective using extra centers. Our work has a different focus. Our goal is to find clustering algorithms that yield a good clustering on the clusterable portion of a noisy input. Therefore, the questions that we address, our setting, and approach are different from the above studies. Discussing the details of more relevant previous work requires the definition of few notations, hence, it is delayed to relevant sections. 3. Preliminaries For a set X and integer k 1, a k-clustering of X is a partition C = {C1, . . . , Ck} of X into k disjoint sets. For a clustering C of X and points x, y X, we say x C y, if x and y are in the same cluster, otherwise x C y. For sets X Clustering in the Presence of Background Noise and I such that I X, and a clustering C = {C1, . . . , Ck} of X, we denote the restriction of C to I by C|I = {C1 I, . . . , Ck I}. For two clusterings C and C of the set X, we define the distance between them as (C, C ), the fraction of pairs of domain points which are in the same cluster under C but in different clusters under C or vice-versa. Equivalently, (C, C ) = 1 i R(C, C ), where i R is the Rand index (Rand, 1971). satisfies the triangle inequality. Let d be a symmetric distance function defined over X with d(x, x) = 0 for all x X, and satisfying the triangle inequality. The diameter of X, indicated by diam(X), is defined as the maximum distance between two elements of X. For a clustering C = {C1, . . . , Ck}, the diameter of C is defined as max Ci C diam(Ci). The radius of X is shown by rad(X) = minc X maxx X d(c, x). Clustering C is σ-separable for σ 1, if minx Cy d(x, y) > σ maxx Cy d(x, y). Clustering C is (ρ1, ρ2)-balanced if for all i k, ρ1|X| |Ci| ρ2|X|. We use ρ-balanced to refer to a clustering that is (0, ρ)-balanced. A clustering algorithm is a function A that takes as input X and d and returns a clustering C of X. An objective function is a function that takes as input a clustering and outputs a non-negative cost associated with it. An objective-based clustering algorithm is an algorithm that produces a clustering that minimizes a specified objective function. Consider an input set X drawn from a given space E with distance function d. Throughout this work, let g : R+ R+ be any continuous, increasing, and unbounded function. The (k, g)-centroid algorithm is an objective-based clustering algorithm with function Λg E,d({C1, . . . , Ck}) = min µ1,...,µk E x Ci g(d(x, µi)) We refer to µi as the center of cluster Ci and we define µ(x) = arg minµi {µ1,...,µk} d(x, µi) to be the center closets to x. With a slight abuse of notation we can also define the (k, g)-centroid algorithm as the algorithm that chooses centers µ1, . . . , µk that minimize Λg X,E,d(µ1, . . . , µk) = X x X g(d(x, µ(x)) We remove X and E from the notation whenever they are clear from the context. Note that for g(x) = x and g(x) = x2, Λg d refers to the k-medians and k-means cost function, respectively. 4. Robustifying Paradigms We define parameterized robustifying paradigms that transform any clustering algorithm to an algorithm that is more robust to noise to the extent determined by a predefined parameter. We define two robustifying paradigms. Moreover, we establish an equivalence between one of the paradigms and a generalization of an existing algorithm. A robustifying parameter, p, denotes the degree to which an algorithm should be robustified to noise; For example, the number of extra clusters that can be used or a notion of distance beyond which a point is considered an outlier. A robustifying paradigm, Rp( ), is a function that takes a clustering algorithm A and returns a robustified clustering algorithm Rp(A) based on the robustifying parameter p. We refer to A as the ground clustering algorithm of Rp(A). Since noise, unstructured data, and outlying structures are heterogeneous with respect to the existing data, outliers and noise groups can be considered as separate clusters. Therefore, some statisticians simply recommend increasing the number of clusters when dealing with noisy data (Garc ıa Escudero & Gordaliza, 1999). The next paradigm captures robustification as used in this practice. Definition 1 (p-Increased Paradigm). The p-Increased Paradigm is a robustifying paradigm, RIp( ), that takes as input a (k, g)-centroid algorithm and returns a (k + p, g)- centroid algorithm. The next paradigm is parameterized by the distance after which a point should be considered an outlier. To define this paradigm, we first introduce a class of algorithms. Given a space E and distance function d, the δ-truncated distance function corresponding to d is the function d such that d (x, y) = min{δ, d(x, y)} for x, y E. The (k, g)-δ-truncated algorithm is an objective based algorithm that, given X E, first optimizes the function Λg X,d (µ 1, . . . , µ k). For j k, let C j = {x X|j = arg mini d(x, µ i) and d(x, µ j) < δ} and C k+1 = {x X| min d(x, µ i) δ}. Then the (k, g)-δ-truncated algorithm returns the (k + 1)-clustering C = {C 1, . . . , C k+1}. We refer to µ i as the center of C i for i k. With a slight abuse of notation, we define µ (x) = arg minµi {µ 1,...,µ k} d(x, µ i) Definition 2 (δ-Truncated Paradigm). The δ-Truncated Paradigm is a robustifying paradigm, RTδ( ), that takes as input a (k, g)-centroid algorithm and returns a (k, g)- δ-truncated algorithm. Dave (1991) developed the notion of the noise prototype, which is a point equidistant from all points in E. He then introduced a clustering algorithm that performs fuzzy (k +1)-means with one center fixed as the noise prototype. In the next definition, we provide a generalization of the non-fuzzy variation of Dave s algorithm for any centroidbased algorithm. Furthermore, we show that the class of algorithms produced in such way is equivalent to the class of algorithms that belong to the δ-Truncated paradigm. Clustering in the Presence of Background Noise Definition 3. Let µ be defined such that for all y E, d(y, µ ) = δ. The generalized (k, g)-δ-centroid algorithm is an objective-based algorithm with objective function Λg X,E {µ },d(µ1, . . . , µk, µ ). We refer to (k, g)-δ-centroid as δ-k-median and δ-k-means when g(x) = x and g(x) = x2, respectively. Theorem 1. Clustering C is an optimal (k, g)-δ-centroid clustering of X if and only if C is an optimal (k, g)-δtruncated clustering of X. Proof. For convenience, let µk+1 = µ and d (x, y) = min{δ, d(x, y)}. We show that the (k, g)-δ-centroid clustering with centers µ1, . . . , µk, µ and the (k, g)-δtruncated clustering with centers µ1, . . . , µk have the same objective value. Λg (E,d )({µ1, . . . , µk}) = X y X g(min i [k]{min{δ, d(y, µi)}}) y X g min i [k]{min{d(y, µk+1), d(y, µi)}} y X g min i [k+1] d(y, µi) = Λg (E {µ },d)({µ1, . . . , µk, µ }) Moreover, µ1, . . . , µk, µ can induce the same (k, g)-δcentroid clustering as the (k, g)-δ-truncated clustering induced by µ1, . . . , µk. Therefore, C is an optimal (k, g)-δcentroid clustering if and only if it is an optimal (k, g)-δtruncated clustering. 5. Measures of Robustness In previous work, robustness to the addition of noise has been measured by comparing the output of the same algorithm on both the un-noised and noisy data. This approach leads to pessimistic results about the possibility of achieving noise robustness. One example of these results is the work of Ackerman et al. (2013), which shows that algorithms that are responsive to the structure of the data are not noise-robust. More precisely, Ackerman et al. (2013) define a k-clustering algorithm A to be σ-separabilitydetecting if for all I, such that there exists a σ-separable k-clustering C of I, A(I) = C. Then, they show that for any σ-separability-detecting algorithm and any ρ, there is a ρ-balanced σ-separable set I that is not robust to a noise set of size as small as k. This approach does not reflect the way clustering is done in practice. For clustering noisy data, one may take special provisions to handle the noise provisions that are not needed when the data is known to be noise free. For example, if it is known that the input data has k clusters, one may allow the algorithm to create more clusters in an attempt to separate the noise. One of the take home messages of this paper is that by revising the previous approach to reflect a more practical setting, we can overcome those pessimistic results. To this end, our measures of robustness compare the output of a robustified algorithm (one that can use extra clusters) on noisy data to the output of its corresponding ground algorithm on the unnoised data. More precisely, let A denote any clustering algorithm (the ground clustering) and Rp( ) denote a robustifying paradigm with parameter p. We use A = Rp(A) to denote the robustified algorithm corresponding to A. Given I X, A(I) denotes the clustering of I using the ground algorithm, and A (X) denotes the clustering of X by the robustified algorithm. We consider I to be robust (to X \ I) with respect to the Rp(A) algorithm if certain properties of A(I) are preserved in A (X). In the following definitions, for any x X, we use µ(x) and µ (x) to denote the centers of A(I) and A (X), respectively, to which x belongs. Cluster centers are commonly used to compress and represent data. The distances between points and their corresponding centers can be viewed as the distortion of such a compression. Therefore, it is essential to have clustering algorithms where this value does not grow significantly in the presence of noise. The next definition measures how much this distortion is affected by the addition of noise to the input data. Definition 4 (α-distance-robust). A subset I X is α-distance-robust with respect to A if for all y I, d(y, µ (y)) d(y, µ(y)) + α. An algorithm is considered robust, if it separates the input using some low-cost clusters that cover the un-noised data. In the next definition, we capture this property by computing minimal cost of a subset of clusters that covers at least |I| points in total. Definition 5 (β-cost-robust). Let Λ be an objective (cost) function. I X is β-cost-robust with respect to A for Λ, if there exists C A (X), such that | S C | |I| and Λ(C ) Λ(A(I)) + β. Note that if A (X) = A(I) {X \ I}, then I is 0-costrobust to A . The next definition measures the degree to which noise affects the structure of a clustering. Definition 6 (γ-robust). I X is γ-robust with respect to algorithm A , if (A (X)|I, A(I)) γ. Lemma 1. Given a (k, g)-centroid algorithm A and parameter δ, let A = RTδ(A). For any I X, such that for all y I, d(y, µ (y)) δ, I is g(δ)|X \ I|-cost-robust to X \ I with respect to RTδ(A) for the Λg d cost function. Proof. Refer to the supplementary material. Clustering in the Presence of Background Noise 6. Robustness of Our Paradigm In this section, we show guaranteed robustness results for the δ-Truncated paradigm, RTδ( ). We prove robustness, distance-robustness, and cost-robustness based on several properties of the underlying structures of I and mildness properties of X \ I: RADIUS OF THE BALL COVERING I The following results guarantee distance-robustness and cost-robustness for the δ-Truncated paradigm. Theorem 2 derives values of δ that render I robust with respect to the δ-Truncated algorithm, based on the radius of I and the signal-to-noise ratio of X. Theorem 2. For any k and g, let A be the (k, g)-centroid algorithm. For all I X, let r = rad(I) and λ = |I| / |X \ I| (signal-to-noise ratio). Then for any δ 4r, g 1(λ(g(2r) g(r))) , I is 4r-distance-robust and g(δ)|X \ I|-cost-robust with respect to RTδ(A) for the Λg d cost function. Proof. Let A = RTδ(A) and let A (X) = C . For all x X, let µ (x) denote the closest center of C to x. Assume on the contrary that there exists y I such that d(y, µ (y)) > 4r, then for any y I, d(y , µ (y )) > 2r. Therefore, Λg d (C ) |I| g(2r). For any clustering C that has a center at the center of the r-ball that covers I, Λg d (C ) |I| g(r) + |X \ I| g(δ). By the choice of δ, Λg d (C ) < Λg d (C ), so C is not optimal. Hence, I is 4r-distance-robust to X \I with respect to RTδ(A). Using Lemma 1, I is also g(δ)|X \ I|-cost-robust with respect to RTδ(A) for cost function Λg d. Note that Theorem 2 implies that if I has radius r and signal-to-noise ratio λ, then for any δ [4r, r 3λ), I is 4r-distance-robust and δ2|X \I|-cost-robust to δ-k-means. UNDERLYING STRUCTURE OF I The following results guarantee robustness, distancerobustness and cost-robustness for the δ-Truncated paradigm. Theorem 3 derives values of δ that render I robust with respect to the δ-Truncated algorithm, based on the underlying structure of I and the signal-to-noise ratio of X. Theorem 3. For any k and g, let A be the (k, g)-centroid algorithm. Assume that I X has the following properties: I can be covered by a (ρ1, ρ2)-balanced set of balls, called B1, . . . , Bk, such that for all i k, rad(Bi) r, and for all i = j, the centers of Bi and Bj are at least ν > 4r + 2g 1( ρ1+ρ2 ρ1 g(r)) apart. Let δ h ν 2, g 1 |I| |X\I|(ρ1g( ν 2 2r) (ρ1 + ρ2)g(r)) , then I is 0-robust min{ ν 2, g 1(g( ν ρ1 g(r)) + 2r}-distancerobust g(δ)|X \ I|-cost-robust with respect to RTδ(A) for the Λg d cost function. Note that Theorem 3 implies that if I X can be covered by two balls of radius r whose centers are 10r apart, and each covers half of I, and if there is 5% noise, then for any δ [5r, 8.5r), I is 0-robust, 4r-distance-robust, and δ|X \I|-cost-robust to RTδ(A) for the 2-medians function. We need the next two lemmas to prove Theorem 3. Lemmas 2 and 3 examine the output of the (k, g)-δ-truncated and (k, g)-centroid algorithms when the un-noised data has a well-clusterable underling pattern. Lemma 2. Let A be the (k, g)-centroid algorithm. Assume that I X, B1, . . . , Bk, and δ are as defined in Theorem 3. Let A = RTδ(A), then A (X)|I = {B1, . . . , Bk, }. Furthermore, for all y I, d(y, µ (y)) min{ν/2, ρ1 g(r) + 2r}. Proof. The conditions of Theorem 3 state that I can be covered by a (ρ1, ρ2)-balanced set of balls, called B = {B1, . . . , Bk}, such that rad(Bi) r for all i k, and for all i = j, the centers of Bi and Bj are at least ν > 4r + 2g 1( ρ1+ρ2 ρ1 g(r)) apart. Moreover, δ is assumed to be in the range h ν 2, g 1 |I| |X\I|(ρ1g( ν 2 2r) (ρ1 + ρ2)g(r)) For i k let bi represent the center of Bi and Di represent a ball of radius ν 2 r centered at bi. Let A (X) = C with centers, µ 1, . . . , µ k that minimize Λg d . Let D1 = {Di|Di does not cover any µ j} and D2 = {Di|Di covers more than one µ j}. Since D1, . . . , Dk are pairwise disjoint, |D1| |D2|. Assume in search of a contradiction that D1 = . For any Di D1, for all y Di, d(y, µ (y)) ν 2 2r. Consider the following set of µ 1, . . . , µ k: If Dj includes exactly one center, µ i, then let µ j = µ i, otherwise µ j = bj. Λg X,d (µ 1, . . . , µ k) Λg X,d (µ 1, . . . , µ k) y Bi [g(d (y, µ (y))) g(d (y, µ (y)))] y Bi [g(d (y, µ (y))) g(d (y, µ (y)))] y X\I g(d (y, µ (y))) Λg X,d (µ 1, . . . , µ k) + |D1||I|ρ1 g(r) g(ν + |D2||I|ρ2g(r) + |X \ I|g(δ) Λg X,d (µ 1, . . . , µ k) + |D1||I| (ρ1 + ρ2)g(r) ρ1g(ν 2 2r) + |X \ I|g(δ) Clustering in the Presence of Background Noise < Λg X,d (µ 1, . . . , µ k) This forms a contradiction, so without loss of generality let every Di cover a center µ i. For i = j and for all y Bi, d(y, µ i) ν 2 < d(y, µ j). Therefore, A (X)|I = {B1, . . . , Bk, }. For every C i C , |Bi| miny Bi g(d(y, µ i)) |Bi|g(r)+ |Ci \ I|g(δ). Therefore, there exists y Bi, such that g(d(y, µ i)) g(r) + |X \ I| 2 2r) (ρ1 + ρ2)g(r) Hence, for all y I, d(y, µ (y)) min{ν/2, 2r + Lemma 3. Let A be the (k, g)-centroid algorithm. For any I, if it can be covered with a (ρ1, ρ2)-balanced set of k balls, called B, where each ball has radius r and the centers of two different balls are at least ν > 4r + 2g 1( ρ1+ρ2 ρ1 g(r)) apart, then A(I) = B. Proof. Refer to the supplementary material. Proof of Theorem 3. Lemma 2 shows that for every y I, d(y, µ (y)) δ. By Lemma 1, I is g(δ)|X \ I|-costrobust with respect to RTδ(A) for cost function Λg d. Let B1, . . . , Bk be the mentioned balls that cover I. Lemma 2 and Lemma 3 show that A (X)|I = {B1, . . . , Bk, } and A(I) = {B1, . . . , Bk}. Therefore, I is 0-robust with respect to RTδ(A). Lemma 2 shows that for any y I, d(y, µ (y)) 2, g 1 g( ν ρ1 g(r) + 2r}. Therefore, I is d(y, µ (y)) min{ν/2, g 1 g( ν 2r}-distance-robust with respect to RTδ(A) UNDERLYING STRUCTURE OF I & CONVEXITY OF g In this section, we restrict our attention to convex cost functions and show that for such functions Theorems 2 and 3 can be strengthened. Note that g is convex in most common clustering, e.g. k-medians and k-means. Throughout this section, we implicitly assume that g is also continuous, increasing, and unbounded. Theorem 4. For any k and convex function g, let A be the (k, g)-centroid algorithm. For all I X, such that I has a σ-separable, ρ-balanced clustering of diameter s, and for any δ > σs/2, and any |I| + 1 g(δ) + 2g(s) g(σs/2) + 2kρ2 I is γ-robust with respect to RTδ(A). We need to develop a few results before proving Theorem 4. The next lemma states an important property of convex functions. Lemma 4. For any x, y X, a metric distance function d, and a convex function g, g(d(x, c)) + g(d(y, c)) Proof. In the following, the first inequality holds by the convexity of g and the second inequality holds by the fact that g is increasing and d satisfies the triangle inequality. g(d(x, c)) + g(d(y, c)) 2g d(x, c) + d(y, c) The next lemma bounds the number of pairs that are clustered differently in two clusterings based on the distance between the partitions. Lemma 5. (Ackerman et al., 2013) Let C1 and C2 be two clusterings of Y, where C1 is ρ-balanced and has k clusters. If (C1, C2) γ, then the number of disjoint pairs {x, y} Y such that x C1 y and x C2 y is at least 1 2(γ kρ2)|Y|. The next two lemmas examine the output of the (k, g)-δTruncated and (k, g)-centroid algorithms when g is convex and I has a desirable internal structure. Lemma 6. For any k and convex function g, let A be the (k, g)-centroid algorithm. For all I X that has a σseparable, ρ-balanced k-clustering of diameter s, namely B, and any δ > σs 2 , if A is RTδ(A), (B, A (X)|I) |I| g(δ) + g(s) g(σs/2) + kρ2 Proof. Let A (X) = C with centers µ 1, . . . , µ k. Let (B, C ) = γ and assume on the contrary that γ > |X\I| |I| g(δ)+g(s) g(σs/2) + kρ2. Using Lemma 4, for any {x, y} I such that x B y but x C y, g(d (x, µ i))+g(d (y, µ i)) min{2g(δ), 2g( σs 2 )} 2g( σs 2 ). Lemma 5 shows that there are at least 1 2(γ kρ2)|I| many such disjoint pairs. Therefore, Λg d (C ) g(σs 2 )(γ kρ2)|I| |I| g(δ) + g(s) g(σs/2) |I| > g(s)|I| + |X \ I|g(δ) > Λg d (B {X \ I}) Clustering in the Presence of Background Noise Contradiction. Therefore, (B, A (X)|I) kρ2 + |X\I| |I| g(δ)+g(s) g(σs/2) . Lemma 7. For any k and convex function g, let A be the (k, g)-centroid algorithm. Let I have a σ-separable, ρ-balanced clustering of diameter s, namely B. Then, (A(I), B) g(s) g(σs/2) + kρ2 Proof. Let A(I) = C with centers µ1, . . . , µk. Let (B, C) = γ and assume, in search of a contradiction, that γ > g(s) g(σs/2) + kρ2. For any {x, y} I such that x B y but x C y, using Lemma 5, g(d(x, µi)) + g(d(y, µi)) 2g( σs 2 ). Lemma 6, shows that there are at least 1 2(γ kρ2)|I| many such disjoint pairs. Therefore, 2(γ kρ2)|I|2g(σs/2) This forms a contradiction. Therefore, (A(I), B) g(s) g(σs/2) + kρ2. Proof of Theorem 4. Let B be the σ-separable, ρbalanced, k-clustering of diameter s that covers I, and let |I| g(δ)+g(s) g(σs/2) +kρ2, and γ = g(δ)+g(s) g(σs/2) +kρ2. Lemmas 6 and 7 respectively show that (B, A (X)|I) γ and (B, A(I)) γ . Therefore, (A(I), A (X)|I) γ + γ γ. 7. Non-robustness of the Simplistic Paradigm A key component of our δ-Truncated paradigm is the use of a garbage-collecting cluster. In this section, we show that using the common cost functions with no such garbage collectors , we can not achieve similar noise robustness performance. More specifically, Theorems 5 and 6 show that for any desired level of robustness and signal-to-noise ratio, there exists I X with the desired signal-to-noise ratio and a well-clusterable underlying pattern that is not robust with respect to the p-Increased paradigm, as long as p < |X \ I|. Theorem 5. Let A be the k-means algorithm. For any α, r, λ > 0 there exists X and I X, such that rad(I) r, I can be covered with k balls of arbitrarily small radii, and X has signal-to-noise ratio |I| |X\I| λ. But, for any p < |X \I|, I is not α-distance-robust or α2(|I| |X \I|)- cost-robust with respect to RIp(A) for the k-means cost function. Proof. Let d1 = (α + 2r)( λ λ+1|X| + 1) and d2 = 2(d1 + 2r) + 1. For i k, let Bi denote a closed ball of ra- Figure 1. Structure of an unrobust dataset w.r.t RIp(A). dius 0, such that |Bi| λ k(λ+1)|X|. Let B1, . . . , Bk be evenly placed on a line of length 2r. For, i |X| λ+1 , let oi be a point on the line that connects B1, . . . , Bk, such that d(o1, B1) = d1 and d(oi, oi+1) = d2 (see Figure 1). Let I = S i [k] Bi and X = I {o1, . . . , o |X|/(λ+1) }. Note that X and I are chosen such that |I| |X\I| λ. Let A = RIp(A), A (X) = {C1, . . . , Ck+p}, µi denote the center of Ci, and µ(x) denote the closest center to x. Assume (in the hope of finding a contradiction) that for all i k+p, Ci I or Ci X \I. Without loss of generality let C1 I, then d(o1, µ1) d1 + 2r. Moreover, for any Ci X \ I, such that |Ci| 2, if oj is the left-most or the right-most point of Ci, d(oj, µi) d2/2 > d(o1, µ1). Without loss of generality, assume o1 C2. There are two cases: 1. C2 = {o1, oj, . . . }: Then the cost of clustering C = {C1 {o1}, C2 \ {o1}, . . . , Cp+k} is lower than the cost of C. 2. C2 = {o1}: Let C3 X \ I be any cluster of size at least 2, and let oi be its left-most point (such a cluster exists since p < |X \ I|). The cost of clustering C = {C1 {o1}, {oi}, C3 \ {oi}, . . . , Cp+k} is lower than the cost of C. Hence, C is not an optimal clustering. So, for any optimal RIp(A) clustering, there exists a cluster Ci such that {oj, y} Ci for some y I. Then, d(y, µi) d1 |I|+1 > α + 2r. Hence, I is not α-distance-robust to X \ I with respect to RIp(A). Because there exists y I with d(y, µ(y)) > α + 2r for all y I, then d(y , µ(y )) > α. So, I is not α2(|I| |X \I|)-cost-robust with respect to RIp(A) for the k-means cost function. Note that Theorem 5 implies that for any (k, g)-centroid algorithm, A, any desired level of robustness, and any signal-to-noise ratio, there exists I with arbitrarily small radius and X I with that signal-to-noise ratio, that is not robust with respect to the RIp(A) function, as long as p |X \ I| k. The next theorem proves similar results as in Theorem 5 for (structural) robustness, and contrasts results shown in Theorems 3 and 4. Clustering in the Presence of Background Noise Theorem 6. Let A be the k-means algorithm. For any r, λ > 0, there exists X and I X, such that rad(I) r, I can be covered with k balls of arbitrarily small radii, and X has signal-to-noise ratio of |I| |X\I| λ. But, for any p < |X \I|, I is not (1 1 k)-robust with respect to RIp(A). Bounded Space Corollaries 1 and 2 demonstrate the limitations of p Increased algorithms even when the diameter of the data is bounded. Corollary 1. Let A be the k-means algorithm. For any λ, k and ν < 1 k, there exists I X, such that X has diameter 1 and signal-to-noise ratio |I| |X\I| λ, and I can be covered by k balls that have arbitrarily small radii and their centers are at least ν apart. But for any p < |X \ I| and any α 1 kν 2|X|(|I|+1), I is not α-distance-robust, (1 1 k)-robust, or (α νk)2(|I| |X \ I|)-cost-robust with respect to RIp(A) for the k-means cost function. Corollary 2. Let A be the k-means algorithm. For any λ > 0, there exists I X, such that X has diameter 1 and signal-to-noise ratio |I| |X\I| λ, and I has an arbitrarily small radius, but for any p < |X \ I| k and any α 1 2(|I|+1)|X|, I is not α-distance-robust or α2(|I| |X \I|)- cost-robust with respect to RIp(A) for the k-means cost function. 8. Comparing the Two Paradigms Here, we use an example to further demonstrate the robustness of the δ-Truncated paradigm compared to the limitations of the p-Increased paradigm. Moreover, we use experiments to back up these our theoretical results. Example 1. Let A denote the k-means algorithm. By Corollary 2, there exists I X such that |X| = n has diameter 1 and 10% noise, and I has radius 0, but for any p < 0.1n k, I is not 5/n2-distance robust to X \ I with respect to RIp(A). Since I can be covered by a ball of radius 1/4n2, by Theorem 2, for δ [ 1 3 4n2 ), I is 1/n2-distance-robust with respect to RTδ(A). 8.1. Experiments Datasets: For any k, we use n = 50000 data points on the unit square. 90% of the points come from k Gaussian distributions with centers selected uniformly at random and standard deviation = 1 n. Additionally, 10% uniform noise is introduced in the data. Clustering Algorithms: We run the Lloyd algorithm for k + p-means, when p = 1, 2. For δ-k-means, we adapt the Lloyd algorithm to calculate the clustering using a δ- truncated distance matrix in every iteration, where δ is set to 10 Charts: The following graph shows the average cost of a clustering per clustered point, in δ-k-means, k + 1-means, and k + 2-means. Let (C, Φ) be the δ-k-means clustering. The δ-k-means average cost is |Cost(C) / | S C|. For the k + p-means, we find the minimal cost of a collection of clusters that cover at least | S C| points, then divide this cost by the number of points that are present in these clusters. The δ-k-means average cost per point is considerably 4 6 8 10 12 14 16 0 Number of Clusters (k) Average Cost of Clustered Points n = 50000, variance = 1/n2, random centers k+2 means k+1 means delta k means smaller than that of k + p-means. Moreover, this cost is stable throughout the experiments and remains close to the average radius of the cluster. On the other hand, the cost of k + p-means fluctuates and is considerably higher than the average radius of the clusters. 9. Concluding Remarks In this paper, we consider the problem of robustness of clustering algorithm to the addition of unstructured data points (that we termed noise ). We propose to augment any given center-based clustering algorithm with an efficiently implementable noise-robustifying mechanism that creates an additional cluster, used as a garbage collecting bucket. We introduce rigorous robustness notions that capture different aspects of robustness that may be desirable for such algorithms. We prove that our algorithmic paradigm indeed guarantees desirable noise robustness, and show that the simple strategy, of just applying the underlying clustering algorithms with extra clusters (to accommodate such noisy data), cannot enjoy similar performance. Acknowledgments This research was partially supported by a Google research award. Clustering in the Presence of Background Noise Ackerman, Margareta and Ben-David, Shai. Clusterability: A theoretical study. In Dyk, David A. Van and Welling, Max (eds.), AISTATS, volume 5 of JMLR Proceedings, pp. 1 8. JMLR.org, 2009. Ackerman, Margareta, Ben-David, Shai, Loker, David, and Sabato, Sivan. Clustering oligarchies. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, 2013. Ailon, Nir, Jaiswal, Ragesh, and Monteleoni, Claire. Streaming k-means approximation. In Advances in Neural Information Processing Systems, pp. 10 18, 2009. Balcan, Maria-Florina, Blum, Avrim, and Gupta, Anupam. Approximate clustering without the approximation. In Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1068 1077. Society for Industrial and Applied Mathematics, 2009. Ben-David, Shai, Von Luxburg, Ulrike, and P al, D avid. A sober look at clustering stability. In Learning theory, pp. 5 19. Springer, 2006. Bilu, Yonatan and Linial, Nathan. Are stable instances easy? Combinatorics, Probability & Computing, 21(5): 643 660, 2012. Cuesta-Albertos, J. A. , Gordaliza, Alfonso, and Matr an, C. Trimmed k-means: An attempt to robustify quantizers. The Annals of Statistics, 25(2):553 576, 1997. Dave, Rajesh N. Characterization and detection of noise in clustering. Pattern Recognition Letters, 12(11):657 664, 1991. Dave, Rajesh N. Robust fuzzy clustering algorithms. In Proceedings of the 2nd IEEE International Conference on Fuzzy Systems, pp. 1281 1286, 1993. Donoho, David L. Breakdown properties of multivariate location estimators. Technical report, Harvard University, 1982. Ester, Martin, Kriegel, Hans-Peter, Sander, J org, and Xu, Xiaowei. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of Knowledge Discovery in Databases, volume 96, pp. 226 231, 1996. Garc ıa-Escudero, Luis Angel and Gordaliza, Alfonso. Robustness properties of k means and trimmed k means. Journal of the American Statistical Association, 94 (447):956 969, 1999. Garc ıa-Escudero, Luis Angel, Gordaliza, Alfonso, Matr an, Carlos, and Mayo-Iscar, Agustin. A general trimming approach to robust cluster analysis. The Annals of Statistics, pp. 1324 1345, 2008. Hampel, Frank R. A general qualitative definition of robustness. The Annals of Mathematical Statistics, 42(6): 1887 1896, 1971. Hennig, Christian. Dissolution point and isolation robustness: robustness criteria for general cluster analysis methods. Journal of Multivariate Analysis, 99(6):1154 1176, 2008. Rand, William M. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846 850, 1971.