# clustering_in_the_presence_of_background_noise__9d55a6b5.pdf

Clustering in the Presence of Background Noise

Shai Ben-David SHAI@CS.UWATERLOO.CA David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, ON N2L 3G1 CANADA

Nika Haghtalab NIKA@CMU.EDU Computer Science Department, Carnegie Mellon University, Pittsburgh, PA 15213 USA

We address the problem of noise management in clustering algorithms. Namely, issues that arise when on top of some cluster structure the data also contains an unstructured set of points. We consider how clustering algorithms can be robustiﬁed so that they recover the cluster structure in spite of the unstructured part of the input. We introduce some quantitative measures of such robustness that take into account the strength of the embedded cluster structure as well as the mildness of the noise subset. We propose a simple and efﬁcient method to turn any centroid-based clustering algorithm into a noiserobust one, and prove robustness guarantees for our method with respect to these measures. We also prove that more straightforward ways of robustifying clustering algorithms fail to achieve similar guarantees.

1. Introduction

Clustering is (usually) aimed to detect groups of similar objects in given datasets. Many common clustering algorithms output a partition of the input set. However, it is often the case that datasets that one wishes to cluster contain, on top of groups of similar objects, a signiﬁcant subset that is unstructured. Such a subset, that may be referred to as noise, tends to disrupt clustering algorithms and make it difﬁcult to detect the cluster structure of the remaining domain points. This problem can be viewed as a noise robustness issue.

Are there useful clustering algorithms that are noise robust (w.r.t addition of unstructured data points)? Short reﬂection reveals that the noise robustness of an algorithm is closely

Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s).

related to its sensitivity to the input data. As an extreme example, it is easy to achieve noise robustness by ignoring the input data and always returning a ﬁxed output. Ackerman et al. (2013) provides some formal tradeoffs between these two desired properties. Roughly stated, their results (for example, Theorem 4.3) show that no algorithm can be both noise robust and responsive to cluster structure of the data (in the language of (Ackerman et al., 2013) these properties are called robustness to oligarchies and separabilitydetecting). However, those results consider applying an algorithm with a ﬁxed number-of-clusters parameter. This paper addresses the possibility of overcoming those pessimistic results by allowing clustering algorithms to add to the set of clusters they output an extra subset, serving as a garbage collector .

An important aspect of clustering, that distinguishes it from other major learning tasks, like classiﬁcation prediction, is the wide variability of input-output behavior among common clustering algorithms. Different clustering applications employ very different clustering algorithms and there is no single clustering algorithm that is suitable for all. Consequently, solutions to fundamental clustering challenges, like the tradeoff between sensitivity to the input and noise robustness, should be modular in the sense of being applicable to a variety of clustering algorithms. In this paper we propose such a modular solution. In Section 4, we consider a method to transform any centroid-based clustering algorithm to one that outputs a set of clusters augmented by a distinct noise bin. We show that our proposed paradigm can be achieved by employing a simple and efﬁciently implementable transformation of the input data, after which users can apply any centroid-based clustering algorithm.

Yet another contribution of this paper is the introduction of quantitative measures of noise robustness (Section 5). We consider three aspects of noise robustness for centroidbased clustering algorithms; the degree by which noise can affect the location of the centers of the clusters (or the archetypal cluster representatives), the effect of noise on

Clustering in the Presence of Background Noise

the cost of the clustering solution (or, the value of the clustering objective function) and the similarity between the clustering of the un-noised data, to its clustering in the presence of the noise.

In our results, we consider a scenario in which the input dataset X consists of two part: a clusterable subset I, which is also called the un-noised data, and an added noise set X \ I (the identity of which is not known to the clustering algorithm). We consider two clustering algorithms, the original one, A, and its robustiﬁed transformation Rp(A) that is obtained by using our paradigm with a tunable parameter p. We examine to what extent well clusterability of I and mildness properties of X \ I (in terms of the size and/or diameter of this set, relative to that of I) affect the similarity between the clusterings A(I) and Rp(A)(X) restricted to I. In Section 6, we prove that our paradigm makes the algorithms noise-robust without sacriﬁcing much of their ability to detect clear cluster structures. The degree of noise-robustness that our paradigm achieves depends on a parameter that can be tuned by the user, depending on the level and structure of the noise expected in the input data. On the other hand, in Section 7, we show that a simple transformation in which Rp(A) is the original algorithm A with an increased number of output clusters (the extra clusters may be used to handle noise) does not enjoy the same robustness guarantees. In Section 8, we further demonstrate the gap between these two paradigms and use experiments to conﬁrm our theoretical results.

Note that in the interest of space, the proofs of some theorems and lemmas are moved to the supplementary material.

2. Related Work

Previous work on the robustness of clustering methods have mainly focused on two directions. First, developing measures of robustness and examining the performance of traditional clustering algorithms based on these measures. Second, developing clustering algorithms that are robust to noise and outliers.

Various measures of robustness have been developed for examining the robustness characteristics of clustering algorithms to noise (Donoho, 1982; Hampel, 1971; Hennig, 2008). These measures have been used to demonstrate the lack of robustness of some traditional algorithms, where the number of clusters is ﬁxed (Ackerman et al., 2013; Hennig, 2008). That is, they consider the scenario in which a clustering algorithm is used with the same number-of-clusters parameter for both the clean input and the input after the addition of noisy points. In this work, we allow some extra clusters to be used (so as to accommodate the added noisy data points) and introduce and analyze noise robustness

measures that allow such ﬂexibility. We show that using the added ﬂexibility of our paradigm, we can overcome some of the limitations that the above papers deemed inevitable (without allowing extra noise-accommodating clusters).

Several methods have been suggested for clustering a potentially noisy dataset (Cuesta-Albertos et al., 1997; Dave, 1993; Ester et al., 1996). One interesting work is the development of the concept of a noise cluster in a fuzzy setting by Dave (1991; 1993). In this work, we introduce a novel paradigm for robustifying any center-based clustering algorithm. We show that our paradigm generalizes a non-fuzzy variation of the algorithm introduced by Dave (1991). In addition, we prove noise robustness guarantees for our proposed paradigm, guarantees that were not proven in any of the earlier works we are aware of.

Some of the earlier work on noise-robustness of clustering algorithms proposes the use of trimmming. Trimming is the natural approach to separating clusterable parts of the data from unstructured ones by ﬁxing some noise-set size (say, as some fraction, α, of the data size) and let the algorithm ﬁnd the least structured α fraction of an input dataset and discard it before applying a clustering algorithm (Cuesta Albertos et al., 1997; Garc ıa-Escudero et al., 2008; Garc ıa Escudero & Gordaliza, 1999). However, such methods suffer from exponential computational complexity, or have to be compromised for efﬁcient heuristic searches that have no performance guarantees. In this work, we avoid this issue, by developing a paradigm that is of comparable computational complexity to traditional clustering algorithms.

Clustering has been the subject of many other theoretical studies. Ackerman & Ben-David (2009), Balcan et al. (2009), and Bilu & Linial (2012) examined the effect of robustness (to perturbations of the input) on computational complexity of clustering. Ben-David et al. (2006) examined situations where the clustering of the input is similar to the clustering of a random subset of it. Ailon et al. (2009) approximated the k-means objective using extra centers. Our work has a different focus. Our goal is to ﬁnd clustering algorithms that yield a good clustering on the clusterable portion of a noisy input. Therefore, the questions that we address, our setting, and approach are different from the above studies.

Discussing the details of more relevant previous work requires the deﬁnition of few notations, hence, it is delayed to relevant sections.

3. Preliminaries

For a set X and integer k 1, a k-clustering of X is a partition C = {C1, . . . , Ck} of X into k disjoint sets. For a clustering C of X and points x, y X, we say x C y, if x and y are in the same cluster, otherwise x C y. For sets X

Clustering in the Presence of Background Noise

and I such that I X, and a clustering C = {C1, . . . , Ck} of X, we denote the restriction of C to I by C|I = {C1 I, . . . , Ck I}.

For two clusterings C and C of the set X, we deﬁne the distance between them as (C, C ), the fraction of pairs of domain points which are in the same cluster under C but in different clusters under C or vice-versa. Equivalently, (C, C ) = 1 i R(C, C ), where i R is the Rand index (Rand, 1971). satisﬁes the triangle inequality.

Let d be a symmetric distance function deﬁned over X with d(x, x) = 0 for all x X, and satisfying the triangle inequality. The diameter of X, indicated by diam(X), is deﬁned as the maximum distance between two elements of X. For a clustering C = {C1, . . . , Ck}, the diameter of C is deﬁned as max Ci C diam(Ci). The radius of X is shown by rad(X) = minc X maxx X d(c, x). Clustering C is σ-separable for σ 1, if minx Cy d(x, y) > σ maxx Cy d(x, y). Clustering C is (ρ1, ρ2)-balanced if for all i k, ρ1|X| |Ci| ρ2|X|. We use ρ-balanced to refer to a clustering that is (0, ρ)-balanced.

A clustering algorithm is a function A that takes as input X and d and returns a clustering C of X. An objective function is a function that takes as input a clustering and outputs a non-negative cost associated with it. An objective-based clustering algorithm is an algorithm that produces a clustering that minimizes a speciﬁed objective function.

Consider an input set X drawn from a given space E with distance function d. Throughout this work, let g : R+ R+ be any continuous, increasing, and unbounded function. The (k, g)-centroid algorithm is an objective-based clustering algorithm with function

Λg E,d({C1, . . . , Ck}) = min µ1,...,µk E

x Ci g(d(x, µi))

We refer to µi as the center of cluster Ci and we deﬁne µ(x) = arg minµi {µ1,...,µk} d(x, µi) to be the center closets to x. With a slight abuse of notation we can also deﬁne the (k, g)-centroid algorithm as the algorithm that chooses centers µ1, . . . , µk that minimize

Λg X,E,d(µ1, . . . , µk) = X

x X g(d(x, µ(x))

We remove X and E from the notation whenever they are clear from the context. Note that for g(x) = x and g(x) = x2, Λg d refers to the k-medians and k-means cost function, respectively.

4. Robustifying Paradigms

We deﬁne parameterized robustifying paradigms that transform any clustering algorithm to an algorithm that is more

robust to noise to the extent determined by a predeﬁned parameter. We deﬁne two robustifying paradigms. Moreover, we establish an equivalence between one of the paradigms and a generalization of an existing algorithm.

A robustifying parameter, p, denotes the degree to which an algorithm should be robustiﬁed to noise; For example, the number of extra clusters that can be used or a notion of distance beyond which a point is considered an outlier. A robustifying paradigm, Rp( ), is a function that takes a clustering algorithm A and returns a robustiﬁed clustering algorithm Rp(A) based on the robustifying parameter p. We refer to A as the ground clustering algorithm of Rp(A).

Since noise, unstructured data, and outlying structures are heterogeneous with respect to the existing data, outliers and noise groups can be considered as separate clusters. Therefore, some statisticians simply recommend increasing the number of clusters when dealing with noisy data (Garc ıa Escudero & Gordaliza, 1999). The next paradigm captures robustiﬁcation as used in this practice.

Deﬁnition 1 (p-Increased Paradigm). The p-Increased Paradigm is a robustifying paradigm, RIp( ), that takes as input a (k, g)-centroid algorithm and returns a (k + p, g)- centroid algorithm.

The next paradigm is parameterized by the distance after which a point should be considered an outlier. To deﬁne this paradigm, we ﬁrst introduce a class of algorithms. Given a space E and distance function d, the δ-truncated distance function corresponding to d is the function d such that d (x, y) = min{δ, d(x, y)} for x, y E. The (k, g)-δ-truncated algorithm is an objective based algorithm that, given X E, ﬁrst optimizes the function Λg X,d (µ 1, . . . , µ k). For j k, let C j = {x X|j = arg mini d(x, µ i) and d(x, µ j) < δ} and C k+1 = {x X| min d(x, µ i) δ}. Then the (k, g)-δ-truncated algorithm returns the (k + 1)-clustering C = {C 1, . . . , C k+1}. We refer to µ i as the center of C i for i k. With a slight abuse of notation, we deﬁne µ (x) = arg minµi {µ 1,...,µ k} d(x, µ i)

Deﬁnition 2 (δ-Truncated Paradigm). The δ-Truncated Paradigm is a robustifying paradigm, RTδ( ), that takes as input a (k, g)-centroid algorithm and returns a (k, g)- δ-truncated algorithm.

Dave (1991) developed the notion of the noise prototype, which is a point equidistant from all points in E. He then introduced a clustering algorithm that performs fuzzy (k +1)-means with one center ﬁxed as the noise prototype. In the next deﬁnition, we provide a generalization of the non-fuzzy variation of Dave s algorithm for any centroidbased algorithm. Furthermore, we show that the class of algorithms produced in such way is equivalent to the class of algorithms that belong to the δ-Truncated paradigm.

Clustering in the Presence of Background Noise

Deﬁnition 3. Let µ be deﬁned such that for all y E, d(y, µ ) = δ. The generalized (k, g)-δ-centroid algorithm is an objective-based algorithm with objective function Λg X,E {µ },d(µ1, . . . , µk, µ ).

We refer to (k, g)-δ-centroid as δ-k-median and δ-k-means when g(x) = x and g(x) = x2, respectively.

Theorem 1. Clustering C is an optimal (k, g)-δ-centroid clustering of X if and only if C is an optimal (k, g)-δtruncated clustering of X.

Proof. For convenience, let µk+1 = µ and d (x, y) = min{δ, d(x, y)}. We show that the (k, g)-δ-centroid clustering with centers µ1, . . . , µk, µ and the (k, g)-δtruncated clustering with centers µ1, . . . , µk have the same objective value.

Λg (E,d )({µ1, . . . , µk}) = X

y X g(min i [k]{min{δ, d(y, µi)}})

y X g min i [k]{min{d(y, µk+1), d(y, µi)}}

y X g min i [k+1] d(y, µi)

= Λg (E {µ },d)({µ1, . . . , µk, µ })

Moreover, µ1, . . . , µk, µ can induce the same (k, g)-δcentroid clustering as the (k, g)-δ-truncated clustering induced by µ1, . . . , µk. Therefore, C is an optimal (k, g)-δcentroid clustering if and only if it is an optimal (k, g)-δtruncated clustering.

5. Measures of Robustness

In previous work, robustness to the addition of noise has been measured by comparing the output of the same algorithm on both the un-noised and noisy data. This approach leads to pessimistic results about the possibility of achieving noise robustness. One example of these results is the work of Ackerman et al. (2013), which shows that algorithms that are responsive to the structure of the data are not noise-robust. More precisely, Ackerman et al. (2013) deﬁne a k-clustering algorithm A to be σ-separabilitydetecting if for all I, such that there exists a σ-separable k-clustering C of I, A(I) = C. Then, they show that for any σ-separability-detecting algorithm and any ρ, there is a ρ-balanced σ-separable set I that is not robust to a noise set of size as small as k.

This approach does not reﬂect the way clustering is done in practice. For clustering noisy data, one may take special provisions to handle the noise provisions that are not needed when the data is known to be noise free. For example, if it is known that the input data has k clusters, one may allow the algorithm to create more clusters in an attempt to separate the noise.

One of the take home messages of this paper is that by revising the previous approach to reﬂect a more practical setting, we can overcome those pessimistic results. To this end, our measures of robustness compare the output of a robustiﬁed algorithm (one that can use extra clusters) on noisy data to the output of its corresponding ground algorithm on the unnoised data. More precisely, let A denote any clustering algorithm (the ground clustering) and Rp( ) denote a robustifying paradigm with parameter p. We use A = Rp(A) to denote the robustiﬁed algorithm corresponding to A. Given I X, A(I) denotes the clustering of I using the ground algorithm, and A (X) denotes the clustering of X by the robustiﬁed algorithm. We consider I to be robust (to X \ I) with respect to the Rp(A) algorithm if certain properties of A(I) are preserved in A (X). In the following deﬁnitions, for any x X, we use µ(x) and µ (x) to denote the centers of A(I) and A (X), respectively, to which x belongs.

Cluster centers are commonly used to compress and represent data. The distances between points and their corresponding centers can be viewed as the distortion of such a compression. Therefore, it is essential to have clustering algorithms where this value does not grow signiﬁcantly in the presence of noise. The next deﬁnition measures how much this distortion is affected by the addition of noise to the input data.

Deﬁnition 4 (α-distance-robust). A subset I X is α-distance-robust with respect to A if for all y I, d(y, µ (y)) d(y, µ(y)) + α.

An algorithm is considered robust, if it separates the input using some low-cost clusters that cover the un-noised data. In the next deﬁnition, we capture this property by computing minimal cost of a subset of clusters that covers at least |I| points in total.

Deﬁnition 5 (β-cost-robust). Let Λ be an objective (cost) function. I X is β-cost-robust with respect to A for Λ, if there exists C A (X), such that | S C | |I| and Λ(C ) Λ(A(I)) + β.

Note that if A (X) = A(I) {X \ I}, then I is 0-costrobust to A .

The next deﬁnition measures the degree to which noise affects the structure of a clustering.

Deﬁnition 6 (γ-robust). I X is γ-robust with respect to algorithm A , if (A (X)|I, A(I)) γ.

Lemma 1. Given a (k, g)-centroid algorithm A and parameter δ, let A = RTδ(A). For any I X, such that for all y I, d(y, µ (y)) δ, I is g(δ)|X \ I|-cost-robust to X \ I with respect to RTδ(A) for the Λg d cost function.

Proof. Refer to the supplementary material.

Clustering in the Presence of Background Noise

6. Robustness of Our Paradigm

In this section, we show guaranteed robustness results for the δ-Truncated paradigm, RTδ( ). We prove robustness, distance-robustness, and cost-robustness based on several properties of the underlying structures of I and mildness properties of X \ I:

RADIUS OF THE BALL COVERING I

The following results guarantee distance-robustness and cost-robustness for the δ-Truncated paradigm. Theorem 2 derives values of δ that render I robust with respect to the δ-Truncated algorithm, based on the radius of I and the signal-to-noise ratio of X.

Theorem 2. For any k and g, let A be the (k, g)-centroid algorithm. For all I X, let r = rad(I) and λ = |I| / |X \ I| (signal-to-noise ratio). Then for any δ 4r, g 1(λ(g(2r) g(r))) , I is 4r-distance-robust and g(δ)|X \ I|-cost-robust with respect to RTδ(A) for the Λg d cost function.

Proof. Let A = RTδ(A) and let A (X) = C . For all x X, let µ (x) denote the closest center of C to x. Assume on the contrary that there exists y I such that d(y, µ (y)) > 4r, then for any y I, d(y , µ (y )) > 2r. Therefore, Λg d (C ) |I| g(2r). For any clustering C

that has a center at the center of the r-ball that covers I, Λg d (C ) |I| g(r) + |X \ I| g(δ). By the choice of δ, Λg d (C ) < Λg d (C ), so C is not optimal. Hence, I is 4r-distance-robust to X \I with respect to RTδ(A). Using Lemma 1, I is also g(δ)|X \ I|-cost-robust with respect to RTδ(A) for cost function Λg d.

Note that Theorem 2 implies that if I has radius r and signal-to-noise ratio λ, then for any δ [4r, r

3λ), I is 4r-distance-robust and δ2|X \I|-cost-robust to δ-k-means.

UNDERLYING STRUCTURE OF I

The following results guarantee robustness, distancerobustness and cost-robustness for the δ-Truncated paradigm. Theorem 3 derives values of δ that render I robust with respect to the δ-Truncated algorithm, based on the underlying structure of I and the signal-to-noise ratio of X.

Theorem 3. For any k and g, let A be the (k, g)-centroid algorithm. Assume that I X has the following properties: I can be covered by a (ρ1, ρ2)-balanced set of balls, called B1, . . . , Bk, such that for all i k, rad(Bi) r, and for all i = j, the centers of Bi and Bj are at least ν > 4r + 2g 1( ρ1+ρ2

ρ1 g(r)) apart. Let δ h ν 2, g 1 |I| |X\I|(ρ1g( ν

2 2r) (ρ1 + ρ2)g(r)) , then I is

0-robust min{ ν

2, g 1(g( ν

ρ1 g(r)) + 2r}-distancerobust g(δ)|X \ I|-cost-robust

with respect to RTδ(A) for the Λg d cost function.

Note that Theorem 3 implies that if I X can be covered by two balls of radius r whose centers are 10r apart, and each covers half of I, and if there is 5% noise, then for any δ [5r, 8.5r), I is 0-robust, 4r-distance-robust, and δ|X \I|-cost-robust to RTδ(A) for the 2-medians function.

We need the next two lemmas to prove Theorem 3. Lemmas 2 and 3 examine the output of the (k, g)-δ-truncated and (k, g)-centroid algorithms when the un-noised data has a well-clusterable underling pattern.

Lemma 2. Let A be the (k, g)-centroid algorithm. Assume that I X, B1, . . . , Bk, and δ are as deﬁned in Theorem 3. Let A = RTδ(A), then A (X)|I = {B1, . . . , Bk, }. Furthermore, for all y I, d(y, µ (y)) min{ν/2,

ρ1 g(r) + 2r}.

Proof. The conditions of Theorem 3 state that I can be covered by a (ρ1, ρ2)-balanced set of balls, called B = {B1, . . . , Bk}, such that rad(Bi) r for all i k, and for all i = j, the centers of Bi and Bj are at least ν > 4r + 2g 1( ρ1+ρ2

ρ1 g(r)) apart. Moreover, δ is assumed to be in the range h ν 2, g 1 |I| |X\I|(ρ1g( ν

2 2r) (ρ1 + ρ2)g(r))

For i k let bi represent the center of Bi and Di represent a ball of radius ν 2 r centered at bi. Let A (X) = C with centers, µ 1, . . . , µ k that minimize Λg d . Let D1 = {Di|Di does not cover any µ j} and D2 = {Di|Di covers more than one µ j}. Since D1, . . . , Dk are pairwise disjoint, |D1| |D2|. Assume in search of a contradiction that D1 = . For any Di D1, for all y Di, d(y, µ (y)) ν 2 2r. Consider the following set of µ 1, . . . , µ k: If Dj includes exactly one center, µ i, then let µ j = µ i, otherwise µ j = bj.

Λg X,d (µ 1, . . . , µ k) Λg X,d (µ 1, . . . , µ k)

y Bi [g(d (y, µ (y))) g(d (y, µ (y)))]

y Bi [g(d (y, µ (y))) g(d (y, µ (y)))]

y X\I g(d (y, µ (y)))

Λg X,d (µ 1, . . . , µ k) + |D1||I|ρ1 g(r) g(ν

+ |D2||I|ρ2g(r) + |X \ I|g(δ)

Λg X,d (µ 1, . . . , µ k)

+ |D1||I| (ρ1 + ρ2)g(r) ρ1g(ν

2 2r) + |X \ I|g(δ)

Clustering in the Presence of Background Noise

< Λg X,d (µ 1, . . . , µ k)

This forms a contradiction, so without loss of generality let every Di cover a center µ i. For i = j and for all y Bi, d(y, µ i) ν

2 < d(y, µ j). Therefore, A (X)|I = {B1, . . . , Bk, }.

For every C i C , |Bi| miny Bi g(d(y, µ i)) |Bi|g(r)+ |Ci \ I|g(δ). Therefore, there exists y Bi, such that

g(d(y, µ i)) g(r) + |X \ I|

2 2r) (ρ1 + ρ2)g(r)

Hence, for all y I, d(y, µ (y)) min{ν/2, 2r +

Lemma 3. Let A be the (k, g)-centroid algorithm. For any I, if it can be covered with a (ρ1, ρ2)-balanced set of k balls, called B, where each ball has radius r and the centers of two different balls are at least ν > 4r + 2g 1( ρ1+ρ2

ρ1 g(r)) apart, then A(I) = B.

Proof. Refer to the supplementary material.

Proof of Theorem 3. Lemma 2 shows that for every y I, d(y, µ (y)) δ. By Lemma 1, I is g(δ)|X \ I|-costrobust with respect to RTδ(A) for cost function Λg d.

Let B1, . . . , Bk be the mentioned balls that cover I. Lemma 2 and Lemma 3 show that A (X)|I = {B1, . . . , Bk, } and A(I) = {B1, . . . , Bk}. Therefore, I is 0-robust with respect to RTδ(A).

Lemma 2 shows that for any y I, d(y, µ (y))

2, g 1 g( ν

ρ1 g(r) + 2r}. Therefore, I

is d(y, µ (y)) min{ν/2, g 1 g( ν

2r}-distance-robust with respect to RTδ(A)

UNDERLYING STRUCTURE OF I & CONVEXITY OF g

In this section, we restrict our attention to convex cost functions and show that for such functions Theorems 2 and 3 can be strengthened. Note that g is convex in most common clustering, e.g. k-medians and k-means. Throughout this section, we implicitly assume that g is also continuous, increasing, and unbounded. Theorem 4. For any k and convex function g, let A be the (k, g)-centroid algorithm. For all I X, such that I has a σ-separable, ρ-balanced clustering of diameter s, and for any δ > σs/2, and any

|I| + 1 g(δ) + 2g(s)

g(σs/2) + 2kρ2

I is γ-robust with respect to RTδ(A).

We need to develop a few results before proving Theorem 4. The next lemma states an important property of convex functions. Lemma 4. For any x, y X, a metric distance function d, and a convex function g, g(d(x, c)) + g(d(y, c))

Proof. In the following, the ﬁrst inequality holds by the convexity of g and the second inequality holds by the fact that g is increasing and d satisﬁes the triangle inequality.

g(d(x, c)) + g(d(y, c)) 2g d(x, c) + d(y, c)

The next lemma bounds the number of pairs that are clustered differently in two clusterings based on the distance between the partitions. Lemma 5. (Ackerman et al., 2013) Let C1 and C2 be two clusterings of Y, where C1 is ρ-balanced and has k clusters. If (C1, C2) γ, then the number of disjoint pairs {x, y} Y such that x C1 y and x C2 y is at least 1 2(γ kρ2)|Y|.

The next two lemmas examine the output of the (k, g)-δTruncated and (k, g)-centroid algorithms when g is convex and I has a desirable internal structure. Lemma 6. For any k and convex function g, let A be the (k, g)-centroid algorithm. For all I X that has a σseparable, ρ-balanced k-clustering of diameter s, namely B, and any δ > σs

2 , if A is RTδ(A),

(B, A (X)|I)

|I| g(δ) + g(s)

g(σs/2) + kρ2

Proof. Let A (X) = C with centers µ 1, . . . , µ k. Let (B, C ) = γ and assume on the contrary that γ > |X\I|

|I| g(δ)+g(s) g(σs/2) + kρ2. Using Lemma 4, for any {x, y} I such that x B y but x C y, g(d (x, µ i))+g(d (y, µ i)) min{2g(δ), 2g( σs

2 )} 2g( σs

2 ). Lemma 5 shows that there are at least 1

2(γ kρ2)|I| many such disjoint pairs. Therefore,

Λg d (C ) g(σs

2 )(γ kρ2)|I|

|I| g(δ) + g(s)

g(σs/2) |I|

> g(s)|I| + |X \ I|g(δ) > Λg d (B {X \ I})

Clustering in the Presence of Background Noise

Contradiction. Therefore, (B, A (X)|I) kρ2 + |X\I|

|I| g(δ)+g(s) g(σs/2) .

Lemma 7. For any k and convex function g, let A be the (k, g)-centroid algorithm. Let I have a σ-separable, ρ-balanced clustering of diameter s, namely B. Then, (A(I), B) g(s) g(σs/2) + kρ2

Proof. Let A(I) = C with centers µ1, . . . , µk. Let (B, C) = γ and assume, in search of a contradiction, that γ > g(s) g(σs/2) + kρ2. For any {x, y} I such that x B y but x C y, using Lemma 5, g(d(x, µi)) + g(d(y, µi)) 2g( σs

2 ). Lemma 6, shows that there are at least 1

2(γ kρ2)|I| many such disjoint pairs. Therefore,

2(γ kρ2)|I|2g(σs/2)

This forms a contradiction. Therefore, (A(I), B) g(s) g(σs/2) + kρ2.

Proof of Theorem 4. Let B be the σ-separable, ρbalanced, k-clustering of diameter s that covers I, and let

|I| g(δ)+g(s) g(σs/2) +kρ2, and γ = g(δ)+g(s)

g(σs/2) +kρ2. Lemmas 6 and 7 respectively show that (B, A (X)|I) γ

and (B, A(I)) γ . Therefore, (A(I), A (X)|I) γ + γ γ.

7. Non-robustness of the Simplistic Paradigm

A key component of our δ-Truncated paradigm is the use of a garbage-collecting cluster. In this section, we show that using the common cost functions with no such garbage collectors , we can not achieve similar noise robustness performance. More speciﬁcally, Theorems 5 and 6 show that for any desired level of robustness and signal-to-noise ratio, there exists I X with the desired signal-to-noise ratio and a well-clusterable underlying pattern that is not robust with respect to the p-Increased paradigm, as long as p < |X \ I|.

Theorem 5. Let A be the k-means algorithm. For any α, r, λ > 0 there exists X and I X, such that rad(I) r, I can be covered with k balls of arbitrarily small radii, and X has signal-to-noise ratio |I| |X\I| λ. But, for any p < |X \I|, I is not α-distance-robust or α2(|I| |X \I|)- cost-robust with respect to RIp(A) for the k-means cost function.

Proof. Let d1 = (α + 2r)( λ λ+1|X| + 1) and d2 = 2(d1 + 2r) + 1. For i k, let Bi denote a closed ball of ra-

Figure 1. Structure of an unrobust dataset w.r.t RIp(A).

dius 0, such that |Bi| λ k(λ+1)|X|. Let B1, . . . , Bk be

evenly placed on a line of length 2r. For, i |X|

λ+1 , let oi be a point on the line that connects B1, . . . , Bk, such that d(o1, B1) = d1 and d(oi, oi+1) = d2 (see Figure 1). Let I = S

i [k] Bi and X = I {o1, . . . , o |X|/(λ+1) }. Note

that X and I are chosen such that |I| |X\I| λ.

Let A = RIp(A), A (X) = {C1, . . . , Ck+p}, µi denote the center of Ci, and µ(x) denote the closest center to x. Assume (in the hope of ﬁnding a contradiction) that for all i k+p, Ci I or Ci X \I. Without loss of generality let C1 I, then d(o1, µ1) d1 + 2r. Moreover, for any Ci X \ I, such that |Ci| 2, if oj is the left-most or the right-most point of Ci, d(oj, µi) d2/2 > d(o1, µ1). Without loss of generality, assume o1 C2. There are two cases:

1. C2 = {o1, oj, . . . }: Then the cost of clustering C = {C1 {o1}, C2 \ {o1}, . . . , Cp+k} is lower than the cost of C. 2. C2 = {o1}: Let C3 X \ I be any cluster of size at least 2, and let oi be its left-most point (such a cluster exists since p < |X \ I|). The cost of clustering C = {C1 {o1}, {oi}, C3 \ {oi}, . . . , Cp+k} is lower than the cost of C.

Hence, C is not an optimal clustering. So, for any optimal RIp(A) clustering, there exists a cluster Ci such that {oj, y} Ci for some y I. Then, d(y, µi) d1 |I|+1 > α + 2r. Hence, I is not α-distance-robust to X \ I with respect to RIp(A).

Because there exists y I with d(y, µ(y)) > α + 2r for all y I, then d(y , µ(y )) > α. So, I is not α2(|I| |X \I|)-cost-robust with respect to RIp(A) for the k-means cost function.

Note that Theorem 5 implies that for any (k, g)-centroid algorithm, A, any desired level of robustness, and any signal-to-noise ratio, there exists I with arbitrarily small radius and X I with that signal-to-noise ratio, that is not robust with respect to the RIp(A) function, as long as p |X \ I| k.

The next theorem proves similar results as in Theorem 5 for (structural) robustness, and contrasts results shown in Theorems 3 and 4.

Clustering in the Presence of Background Noise

Theorem 6. Let A be the k-means algorithm. For any r, λ > 0, there exists X and I X, such that rad(I) r, I can be covered with k balls of arbitrarily small radii, and X has signal-to-noise ratio of |I| |X\I| λ. But, for any p < |X \I|, I is not (1 1

k)-robust with respect to RIp(A).

Bounded Space

Corollaries 1 and 2 demonstrate the limitations of p Increased algorithms even when the diameter of the data is bounded.

Corollary 1. Let A be the k-means algorithm. For any λ, k and ν < 1

k, there exists I X, such that X has diameter 1 and signal-to-noise ratio |I| |X\I| λ, and I can be covered by k balls that have arbitrarily small radii and their centers are at least ν apart. But for any p < |X \ I| and any α 1 kν 2|X|(|I|+1), I is not α-distance-robust, (1 1

k)-robust, or (α νk)2(|I| |X \ I|)-cost-robust with respect to RIp(A) for the k-means cost function.

Corollary 2. Let A be the k-means algorithm. For any λ > 0, there exists I X, such that X has diameter 1 and signal-to-noise ratio |I| |X\I| λ, and I has an arbitrarily small radius, but for any p < |X \ I| k and any α 1 2(|I|+1)|X|, I is not α-distance-robust or α2(|I| |X \I|)- cost-robust with respect to RIp(A) for the k-means cost function.

8. Comparing the Two Paradigms

Here, we use an example to further demonstrate the robustness of the δ-Truncated paradigm compared to the limitations of the p-Increased paradigm. Moreover, we use experiments to back up these our theoretical results.

Example 1. Let A denote the k-means algorithm. By Corollary 2, there exists I X such that |X| = n has diameter 1 and 10% noise, and I has radius 0, but for any p < 0.1n k, I is not 5/n2-distance robust to X \ I with respect to RIp(A). Since I can be covered by a ball of radius 1/4n2, by Theorem 2, for δ [ 1

3 4n2 ), I is 1/n2-distance-robust with respect to RTδ(A).

8.1. Experiments

Datasets: For any k, we use n = 50000 data points on the unit square. 90% of the points come from k Gaussian distributions with centers selected uniformly at random and standard deviation = 1

n. Additionally, 10% uniform noise is introduced in the data.

Clustering Algorithms: We run the Lloyd algorithm for k + p-means, when p = 1, 2. For δ-k-means, we adapt the Lloyd algorithm to calculate the clustering using a δ-

truncated distance matrix in every iteration, where δ is set to 10

Charts: The following graph shows the average cost of a clustering per clustered point, in δ-k-means, k + 1-means, and k + 2-means. Let (C, Φ) be the δ-k-means clustering. The δ-k-means average cost is |Cost(C) / | S C|. For the k + p-means, we ﬁnd the minimal cost of a collection of clusters that cover at least | S C| points, then divide this cost by the number of points that are present in these clusters. The δ-k-means average cost per point is considerably

4 6 8 10 12 14 16 0

Number of Clusters (k)

Average Cost of Clustered Points

n = 50000, variance = 1/n2, random centers

k+2 means k+1 means delta k means

smaller than that of k + p-means. Moreover, this cost is stable throughout the experiments and remains close to the average radius of the cluster. On the other hand, the cost of k + p-means ﬂuctuates and is considerably higher than the average radius of the clusters.

9. Concluding Remarks

In this paper, we consider the problem of robustness of clustering algorithm to the addition of unstructured data points (that we termed noise ). We propose to augment any given center-based clustering algorithm with an efﬁciently implementable noise-robustifying mechanism that creates an additional cluster, used as a garbage collecting bucket. We introduce rigorous robustness notions that capture different aspects of robustness that may be desirable for such algorithms. We prove that our algorithmic paradigm indeed guarantees desirable noise robustness, and show that the simple strategy, of just applying the underlying clustering algorithms with extra clusters (to accommodate such noisy data), cannot enjoy similar performance.

Acknowledgments

This research was partially supported by a Google research award.

Clustering in the Presence of Background Noise

Ackerman, Margareta and Ben-David, Shai. Clusterability: A theoretical study. In Dyk, David A. Van and Welling, Max (eds.), AISTATS, volume 5 of JMLR Proceedings, pp. 1 8. JMLR.org, 2009.

Ackerman, Margareta, Ben-David, Shai, Loker, David, and Sabato, Sivan. Clustering oligarchies. In Proceedings of the 12th International Conference on Artiﬁcial Intelligence and Statistics, 2013.

Ailon, Nir, Jaiswal, Ragesh, and Monteleoni, Claire. Streaming k-means approximation. In Advances in Neural Information Processing Systems, pp. 10 18, 2009.

Balcan, Maria-Florina, Blum, Avrim, and Gupta, Anupam. Approximate clustering without the approximation. In Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1068 1077. Society for Industrial and Applied Mathematics, 2009.

Ben-David, Shai, Von Luxburg, Ulrike, and P al, D avid. A sober look at clustering stability. In Learning theory, pp. 5 19. Springer, 2006.

Bilu, Yonatan and Linial, Nathan. Are stable instances easy? Combinatorics, Probability & Computing, 21(5): 643 660, 2012.

Cuesta-Albertos, J. A. , Gordaliza, Alfonso, and Matr an, C. Trimmed k-means: An attempt to robustify quantizers. The Annals of Statistics, 25(2):553 576, 1997.

Dave, Rajesh N. Characterization and detection of noise in clustering. Pattern Recognition Letters, 12(11):657 664, 1991.

Dave, Rajesh N. Robust fuzzy clustering algorithms. In Proceedings of the 2nd IEEE International Conference on Fuzzy Systems, pp. 1281 1286, 1993.

Donoho, David L. Breakdown properties of multivariate location estimators. Technical report, Harvard University, 1982.

Ester, Martin, Kriegel, Hans-Peter, Sander, J org, and Xu, Xiaowei. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of Knowledge Discovery in Databases, volume 96, pp. 226 231, 1996.

Garc ıa-Escudero, Luis Angel and Gordaliza, Alfonso. Robustness properties of k means and trimmed k means. Journal of the American Statistical Association, 94 (447):956 969, 1999.

Garc ıa-Escudero, Luis Angel, Gordaliza, Alfonso, Matr an, Carlos, and Mayo-Iscar, Agustin. A general trimming approach to robust cluster analysis. The Annals of Statistics, pp. 1324 1345, 2008.

Hampel, Frank R. A general qualitative deﬁnition of robustness. The Annals of Mathematical Statistics, 42(6): 1887 1896, 1971.

Hennig, Christian. Dissolution point and isolation robustness: robustness criteria for general cluster analysis methods. Journal of Multivariate Analysis, 99(6):1154 1176, 2008.

Rand, William M. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846 850, 1971.