# clustering_redemptionbeyond_the_impossibility_of_kleinbergs_axioms__8dc5d355.pdf

Clustering Redemption Beyond the Impossibility of Kleinberg s Axioms

Vincent Cohen-Addad Sorbonne Universités, UPMC Univ Paris 06, CNRS, LIP6 vincent.cohen-addad@lip6.fr

Varun Kanade: University of Oxford varunk@cs.ox.ac.uk

Frederik Mallmann-Trenn; MIT mallmann@mit.edu

Kleinberg [20] stated three axioms that any clustering procedure should satisfy and showed there is no clustering procedure that simultaneously satisﬁes all three. One of these, called the consistency axiom, requires that when the data is modiﬁed in a helpful way, i.e. if points in the same cluster are made more similar and those in different ones made less similar, the algorithm should output the same clustering. To circumvent this impossibility result, research has focused on considering clustering procedures that have a clustering quality measure (or a cost) and showing that a modiﬁcation of Kleinberg s axioms that takes cost into account lead to feasible clustering procedures. In this work, we take a different approach, based on the observation that the consistency axiom fails to be satisﬁed when the correct number of clusters changes. We modify this axiom by making use of cost functions to determine the correct number of clusters, and require that consistency holds only if the number of clusters remains unchanged. We show that single linkage satisﬁes the modiﬁed axioms, and if the input is well-clusterable, some popular procedures such as k-means also satisfy the axioms, taking a step towards explaining the success of these objective functions for guiding the design of algorithms.

1 Introduction

In a highly inﬂuential paper, Kleinberg [20] showed that clustering is impossible in the following sense: there exists no clustering function, i.e. a function that takes a point-set and a pairwise dis-similarity function4 deﬁned on them as input, and outputs a partition of the point-set, that simultaneously fulﬁlls three simple and reasonable axioms scale invariance, richness and consistency. Scale invariance requires that scaling all the dis-similarities by the same positive number should not change the output partition. Richness requires that for any partition of the point-set, there should be a way to deﬁne pairwise dis-similarities such that the clustering function will produce said partition as output. Finally, consistency requires the following: if a clustering function outputs a certain partition of a point-set, given a certain dis-similarity function, then applying this clustering function to a transformed dis-similarity function that makes points within the same part more similar and points in different parts less similar, should yield the same partition.

While seemingly very natural in the context of clustering, the last of these axioms, consistency, is somewhat questionable as has been discussed by researchers over the years (see e.g. [30] and

Ce projet a bénéﬁcié d une aide de l État gérée par l Agence Nationale de la Recherche au titre du Programme FOCAL portant la référence suivante : ANR-18-CE40-0004-01. :This work was supported in part by the Alan Turing Institute through the EPSRC grant EP/N510129/1. ;This work was supported in part by NSF Award Numbers CCF-1461559, CCF-0939370, and CCF-1810758. 4We use dis-similarity rather than distance, as for the most part we don t require the point-set and the associated dis-similarity function to form a metric space.

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

references therein). Consider a dataset with a natural clustering consisting of k parts. Kleinberg s consistency axiom allows a transformation of the dis-similarity function by which one cluster may be subdivided into two subclusters, such that points in the same subcluster are very similar to each other, but sufﬁciently dis-similar to points from the other sub-cluster. The transformed instance may require a different partition: for a good clustering with k parts, it may be more suitable to deﬁne one cluster for each of the two new subclusters and re-arrange the partition of the remaining points. Alternatively, one may ask what is the right number of clusters in the new instance? Since the original instance had k clusters and since one of the clusters got subdivided into two subclusters, it may be more natural to ask for a clustering in k 1 clusters for this new instance. Unfortunately, this is not allowed by the consistency axiom: the clustering should remain the same. This scenario can indeed be formalized as shown in Section 4, even in the case where the original clustering into k parts is very well-clusterable.

We are not the ﬁrst to notice the problem with the consistency axiom as deﬁned by Kleinberg, see e.g. [23, 2, 14]. This impossibility result has been contrasted by a large body of research that argues that relaxing the axioms by restating them with respect to cost functions (clustering quality measures) resolves the inconsistency [14]. For example, in the inﬂuential blog post [30], it is observed that the outcome of such a transformation can change the natural number of clusters.

Perhaps one of the main issue with Kleinberg s axioms is that they fail to explain why some of the classic clustering objectives, such as the k-means objective function (see Deﬁnition 1.1), give rise to very popular algorithms such as k-means++ and Lloyd s algorithm that are very successful in practice5. This suggests that the impossibility result arises from instances that are unrealistic and contrived.

A way to overcome this impossibility result is to look beyond the worst-case scenario. Motivated by the thesis that clustering is difﬁcult only when it does not matter (see e.g. [19, 13, 17]), one can hope that classic objectives such as k-means would satisfy the axioms when the input is well-clusterable. Unfortunately, we show that k-means fails to satisfy Kleinberg s consistency axiom even when we restrict attention to very well-clusterable inputs, in fact even for types of instances for which the k-means++ algorithm has been proven to be efﬁcient [21], and as a result one may expect k-means to be the right objective function to optimize. (See Section 4 for a formal statement of this.)

1.1 Our contributions

Our work aims at bridging the gap between real-world clustering scenarios and an axiomatic approach to understanding the theoretical foundations of clustering. We see the problem of clustering as a two-step procedure:

1. Determine the natural number k of clusters in the dataset; 2. Find out the best clustering with k clusters.

The question of choosing the correct number of clusters is a very relevant one in practice because several of the commonly used clustering algorithms take the number of clusters k as a parameter (cf. [1, 26, 27, 18]) and would yield nonsensical clusters if k was not carefully chosen. Despite this, theoretical work on choosing the number of clusters is quite limited compared to the vast theoretical work analyzing various clustering algorithms. An approach employed quite often is the so-called elbow method, which itself can be deﬁned in different ways. A natural deﬁnition is as follows: Consider an objective function (a measure of quality) for clustering into k parts, and deﬁne OPTk to be the clustering that minimizes this objective function. According to the elbow method, the natural number of clusters is deﬁned as the value k that maximizes the ratio OPTk 1{OPTk for k P t2,...,n 1u, where n is the number of data points. k 1 and k n are explicitly ruled out as they would lead to trivial clusterings.

The intuition behind this approach is that the maximum gain in information is obtained, precisely when ﬁnding k groups instead of k 1. There is diminishing information gain when allowing more clusters beyond k . This approach is widely-used in practice e.g. [27] and has led to interesting theoretical models of real-world inputs. As an example, Ostrovsky et al. [24] deﬁne a real-world input with k clusters as an instance for which the k-means objective satisﬁes OPTk 1{OPTk ą1 ε for a sufﬁciently large ε. In turn such data models have been used in theoretical work to better understand the success of algorithms such as k-means++ [21, 10].

5Note that k-means++ and Lloyd s algorithm aim at minimizing the k-means objective; each step improves the quality of the solution w.r.t. k-means objective.

Taking inspiration from this approach, we return to Kleinberg s axioms and amend the consistency axiom to take into account the potential change in the optimal number of clusters. More precisely, we required that the consistency (i.e. the partition of the input point-set) is preserved in the transformed instance only if the correct number of clusters in the new instance is the same as that in the original instance.

Original instance Perturbed instance In order to meaningfully deﬁne the correct number of clusters, we need to include a cost function, from partitions to the positive reals, as part of the input. We deﬁne the correct number using the elbow method. We show that the new set of axioms is no longer inconsistent and some clustering algorithms, such as singlelinkage, in fact satisfy these axioms. While in the worst-case, clustering algorithms based on the classic center-based clustering objectives such as k-means, k-median and k-center do not satisfy the axioms, we show that for stable clustering instances, these objective function now satisfy the axioms. We show that the notion of stable instance captures some interesting scenarios (see full version). Thus our axioms arguably model the process of clustering relevant in practice inputs, thus taking a step towards explaining the success of some popular objective functions.

Stable clustering instances. We deﬁne well-clusterable or stable clustering instance using the stability notion introduced by Bilu and Linial [16] in the context of center-based clustering. This notion was later considered in the context of clustering in several other works (see e.g. [9, 13, 15, 12]) and various (provable) algorithms have been designed for solving these types of instances. We consider the α-proximity condition introduced by Awasthi et al. [11] which requires that the optimal clustering satisﬁes the following: Given a point in the ith optimal cluster, α times its distance to the center of cluster i remains smaller than its distance to the centers of the other clusters. This notion generalizes the notion of stability as shown by [11].

In the full version we show that this notion arises for large ranges of parameters (for which our proofs hold) in different models such as the stochastic block model and mixture of Gaussians and for which these clustering approaches are used in practice. Our result on stable instances require that the cluster sizes are approximately equal; we observe that when using k-means in the context of e.g. Gaussian mixture models, roughly balanced clusters and a separation of centers ensures that the minimizing the k-means objective is roughly the same as ﬁnding maximum likelihood estimators for the centers.

1.2 Related Work

The authors of the prominent work Ben-David and Ackerman [14] were the ﬁrst to defy Kleinberg s impossibility results. The authors focus on clustering quality measures (CQM), or cost functions, that assign clusterings a value. The authors interpret Kleinberg s axioms in terms of these quality measures and show that the interpretation of the axioms is consistent. This is similar to our approach but their deﬁnition of the consistency axiom differs: their notion of consistency asks for the value of a clustering to not increase after a perturbation of the inputs that bring points in the same cluster closer and pull points across different clusters apart.

As mentioned in the introduction, our consistency axiom requires that the optimal clustering remains the same after the same type of perturbation, if the optimal number of clusters remains the same. We believe that this is a stronger requirement that is of importance: when using a cost function for evaluating a k-clustering, we hope that the minimizer of the cost function (namely the optimal solution) is the underlying natural clustering. Hence, after a perturbation that does not increase distances between points of the same clusters and does not decrease distances between points in different clusters, we hope that the minimizer has remained the same if the natural number of clusters has remained k. The axiom proposed by Ben-David and Ackerman [14] does not enforce an optimal solution to remain optimal under the perturbation.

Ackerman [2] and Ackerman et al. [4] also contribute with a large set of axioms or properties that are suitable for clustering objective functions. In this paper we focus on the three original axioms introduced by Kleinberg. Thus, our approach aims at complementing their study of the axioms by replacing their consistency axiom with a stronger one. Also our approach differs slightly in the

following sense; we aim at deﬁning reasonable axioms that explain why popular objective functions, such as k-means, are good ones (we refer the reader to [5, 6, 3] for further advanteges of k-means and similar methods.

van Laarhoven and Marchiori [28] continue this line of research on quality measures and show that adding reasonable axioms leads to set of axioms which are not fulﬁlled by modularity, a fairly popular CQM. The authors of Puzicha et al. [25] explore properties of clustering objective functions for the setting where the number of clusters, k, is ﬁxed. They propose a few natural axioms of clustering objective functions, and then focus on objective functions that arise by requiring functions to decompose into additive form.

Meil a [23] views clusterings as nodes of a lattice: there is an edge between to clustering C and C1 if C1 can be obtained by splitting a cluster of C into two parts. The authors give axioms for comparing clusterings and show inconsistency of those axioms. Ackerman et al. [5] considers clustering in the weighted setting where every point is assigned a real valued weight. The authors analyze of the inﬂuence of the weighted data on standard clustering algorithms. Ackerman et al. [6] analyze the robustness of clustering algorithms to the addition of points study the robustness of popular clustering methods. See Ackerman [2] for a thorough review on research on clustering properties. There has also been work focused on the single-linkage clustering algorithm and its characterization using a speciﬁc set of axioms including Kleinberg s axioms [31]. This has later been extended to more general families of linkage-based algorithms [3].

Organization of the paper: Section 1.2 introduces basic notions and notations. Section 2 describes and discusses our new axioms. Section 3 shows that single-linkage satisﬁes all of them, even in the worst-case scenario, while k-means and k-median satisfy the axioms when we restrict our attention to well-clusterable instances. Section 4 shows various impossibility results: k-means does not satisfy Kleinberg s axioms even for well-clusterable instances. In the worst-case, none of k-means and k-median satisﬁes all our reﬁned axioms. The proofs can be found in the full version.

Preliminaries Let rns denote the set t1,...,nu. An input to a clustering procedure is prns,dq, where we rns is the point-set and d : rnsˆrns Ñ R gives the pairwise distances between points in rns (we assume d is always symmetric). We do not require prns,dq to be a metric space, though all of our results continue to hold if this requirement is added.We denote by Πrns the set of all possible partitions of the set rns; Π rns denotes the set of non-trivial partitions of rns, i.e. excluding the partitions consisting of exactly one part and the partition consisting of exactly n parts. For a partition P PΠ rns, we will denote by |P| the number of parts. We use OPT (OPTo, respectively) to denote the cost of the optimal solution under the perturbed metric (original metric, respectively).

Deﬁnition 1.1 (k-Means). Let prns,dq be a metric space, and k a non-negative integer. The kmeans problem asks for a subset S of rns, of cardinality at most k, that minimizes costp Sq ř x Prnsminc PS dpx,cq2.

In the k-median problem, the distances are not square while in the k-center problem, the sum is replaced by taking the maximum. In the following, we will sometimes refer to points of rns as clients. The clustering of X induced by S is the partition of A into subsets C t C1,...Cku such that Ci tx Prns | ci argmin c PSdpx,cqu (breaking ties arbitrarily). Similarly, given a partition of X into k parts C t C1,...Cku, we deﬁne the centers induced by C as the set of tcentroidp Ciq|Ci PCu, where we slightly abuse notation by deﬁning the centroid of set of point X Ărns as the point y of X that minimizes ř

x PXdpy,xq2 (a.k.a. the medoid). It is a well-known fact that costp Cq is minimized by the centers induced by C. Hence, we will refer to a solution to the k-means problem by a partition of the points in k parts, or by a set of k centers.

2 An Axiomatic Result

Kleinberg [20] introduced an axiomatic framework for clustering. Following Kleinberg, we deﬁne a clustering procedure to be a function f that takes a pair prns,dq of a point-set and an associated distance function, and outputs a partition P of rns. This deﬁnition is purely combinatorial and in what follows we will modify it slightly to view clustering as an optimization procedure. Kleinberg [20] requires that any clustering procedure satisfy the following three axioms.

Axiom 2.1 (Scale Invariance). For any input prns,dq and any αą0, we have fpprns,dqq fpprns,α dqq, where α d denotes an α-scaling of the distance function d.

Axiom 2.2 (Richness). For any P PΠrns, there exists a d P :rnsˆrnsÑR , such that fpprns,d Pqq P.

The third of Kleinberg s axiom requires the notion of a P-consistent transformation. For a partition P PΠrns, a transformation d1 of d is P-consistent, if d1px,yqďdpx,yq if x,y are in the same part in P and d1px,yqědpx,yq if x,y are in different parts in P. Axiom 2.3 (Consistency). If fpprns,dqq P and if d1 is a P-consistent transformation of d, then fpprns,d1qq P.

It is this last axiom that is an unnecessary restriction on clustering procedures. As discussed in the introduction, this restriction comes from the fact that the axiom enforces the number of clusters to remain the same, even after the perturbation of the input. Indeed, the number of clusters may have changed as a result of the distance transformation. In general choosing the correct number of a clusters is a fairly non-trivial problem. In order to do so, we assume that there is cost function associated with any partition P P Πrns. To avoid trivial cases, we will only allow a clustering algorithm to output a non-trivial partition P PΠ rns. Let Γ:Π rnsÑR be a cost function. For any k Pt2,...,n 1u, deﬁne OPTΓ k : min PPΠrns |P| k Γp Pq.

For example, in the so called k-median clustering objective, k data-points are chosen to be centers and each point is assigned to its closest center (with arbitrary tie-breaking) to arrive at a partition. Then the cost is simply given by adding up the distance of each data-point to its closest center.

We now present our reﬁned consistency axiom. We consider a clustering procedure as a procedure that has as input prns,dq as well as a cost function Γ:Π rnsÑR . The clustering procedure chooses the number of parts k , by picking k that maximizes the ratio OPTΓ k 1{OPTΓ k and then outputs a partition P consisting of k parts that achieves the value OPTΓ k . We refer to such clustering procedures as clustering procedures with cost-function Γ and denote the use k pprns,dq,Γq to denote the optimal value of k and fpprns,dq,Γq to denote the partition output by the clustering procedure f using the cost function Γ. Axiom 2.4 (Reﬁned Consistency). If f is a clustering procedure with cost function Γ and fpprns, dq, Γq P, then if d1 is P-consistent, then either k pprns, dq, Γq k pprns, d1q, Γq or fpprns,dq,Γq fpprns,d1q,Γq.

What the above axiom states is that if a P-consistent transformation does change data in a way that clearly changes the natural cluster structure, then it may output a different partition as the proposed clustering as long as the number of clusters has changed. However, if as per the objective function Γ, the optimal number of clusters has not changed, then the same partition P should be returned after a P-consistent transformation. We refer to a clustering procedure using a cost function Γ that satisﬁes Axioms 2.1, 2.2 and 2.4 as admissible. Section 3 establishes that unlike in Kleinberg s result which asks for clustering procedures satisfying Axioms 2.1, 2.2 and 2.3, we obtain a possibility theorem.

Several cost functions commonly used in practice have the effect of encouraging increasingly ﬁner partitions. As a result, the number of parts, e.g. k in k-means, has to be ﬁxed to avoid achieving a trivial partition where each point is placed in its own clusters. On the other hand, it may be possible to imagine cost functions that encourage fewer clusters, e.g. if there s a cost to open a new center as in facility location problems. Based on these, it is possible to demand a stronger consistency axiom than the one state in Axiom 2.4. If P fpprns,dq,Γq and P1 fpprns,d1q,Γq, one may demand that if k pprns,dq,Γq ă k pprns,d1q,Γq, then P1 is a reﬁnement of P; likewise, if k pprns,dq,Γqąk pprns,d1q,Γq, one may demand that P1 is a coarsening of P. The former should be expected for cost functions encouraging ﬁner partitions and the latter for cost functions encouraging fewer parts. Single linkage does have the property that a P-consistent transformation can never decrease k and the resulting modiﬁed partition P1 is a reﬁnement of P; however, we leave the formal analysis of this claim to the long version of this extended abstract.

3 Admissible Clustering Functions

3.1 Admissibility of Single Linkage

Single linkage is most often deﬁned procedurally, rather than as an optimization problem. It is also commonly used as an algorithm for hierarchical clustering; however, it may equally well be viewed

as a partition-based clustering procedure. Formally, for a given k, the optimization problem that results in the single-linkage algorithm is the following: Find the minimum weight spanning forest with exactly k connected components (trees) .

As in any clustering procedure, the parameter k is input to the algorithm. In order to choose the value of k , we look at the value of k which maximizes the ration OPTk{OPTk 1 for k P t1,...,n 1u. Note that this method of choosing k does not allow k n nor k 1. Proposition 3.1. Single linkage clustering is admissible.

Proving scale invariance and richness is trivial. In order to prove reﬁned consistency, we show that the optimal forest in the modiﬁed metric d1 with k parts cannot use any edges that go between the trees in the forest obtained with d. We refer the reader to the appendix for the full proof. Remark 3.2. Actually, a stronger claim can be made where if k changes, the new partition output by single linkage on prns,d1q will be a reﬁnement of the partition output on prns,dq.

3.2 Admissibility of k-Means

We now turn to a more formal deﬁnition of our well-clusterable instances. Deﬁnition 3.3 (Center proximity [11]). We say that a metric space prns,dq satisﬁes the α-center proximity condition if the centers tc1,...,cku induced by the optimal clustering t C1,...,Cku of prns,dq with respect to the k-means cost satisﬁes that for all i j and p PCi, we have dpp,cjqěα dpp,ciq.

We further say that an instance is δ-balanced if for all i,j, |Ci|ďp1 δq|Cj|. Theorem 3.4. For any α ą 5.3, δ ď 1{2 and for any δ-balanced instance satisfying the α-center proximity, the k-means objective is an admissible cost function. Moreover, there exists a constant c such that for δ ď1{2 for any δ-balanced instance satisfying the α-center proximity with αěc, the k-median objective is an admissible cost function.

Proof. For simplicity, we assume that α 6 and δď1{2, the general case is similar, the higher α the higher δ can be. We will only show the proof for k-means; the proof for k-median is analogous. The proof of all claims can be found in the appendix.

It is easy to see that the k-means objective function satisﬁes Axioms 2.1 and 2.2 (see also [2]). Hence, we only need to show that the k-means objective satisﬁes Axiom 2.3. We will make use of the following lemma mainly due to [12], and [11]. Lemma 3.5 ([12]). For any points p P Ci and q P Cj (j i) in the optimal clustering of an α-center proximity instance, we have dpci,qq ě αpα 1qdpci,pq{pα 1q, and dpp,qq ě pα 1qmaxtdpp,ciq,dpq,cjqu.

We complement this lemma by the following observation: Claim 3.6. Given p,q1 PCi and q PCj, we have that

dpci,q1qď α 1 pα 1q2 dpp,qq (1)

and dpp,q1qď 2α pα 1q2 dpp,qq. (2)

Consider an adversarial perturbation of the instance as prescribed by Axiom 2.3, namely a Cconsistent transformation of d, where C is the optimal k-means clustering of the original instance.

Assume towards contradiction that the optimal k-means clustering for the perturbed instance Γ tΓ1,...,Γku, with centers γ γ 1 ,...,γ k , differs from the optimal k-means solution for the original instance C t C1,...,Cku.

We claim that, assuming αą2 ?

3 it must be that at least one of the clusters of C contains no center of γ 1 ,...,γ k . Indeed, if for each Ci there exists a γ j that is in Ci, then the optimal clustering remains t C1,...,Cku and so Γ C. This follows from Claim 3.6: after the perturbation, each point of Ci remains closer to points of Ci than to any other point. Therefore, if there is a center γ i in each Cj, the optimal partitioning of the points remains Γ.

Thus, we assume that there is at least one cluster of C that has no center of γ 1 ,...,γ k .

In the following we aim at bounding OPTk,OPTk 1,OPTk 1 which are the cost of the optimal solutions using k,k 1,k 1 centers in the perturbed instance.

We now consider the clusters of C that contain no center in the solution induced by Γ. We also consider the centers tγ1,...,γtuĎγ induced by Γ that are located in a cluster Ci that also contains another center of γ .

Given a clustering C1 with centers c1 1,c1 2,...,c1 k, we say a client p is served by c1 i if dpp,c1 iqďdpp,c1 jq for all j ěi and dpp,c1 iqădpp,c1 jq for all j ăi. For each Ci that contains at least two centers of γ , let Ai denote the clients served by all the centers of γ located in Ci. We show:

Claim 3.7. There exists Ci that contains at least two centers of γ such that |Ai|ď pα 1q2

pα 1q2αpα 2q|Ci|.

In the rest, we further analyze the structure of a cluster Ci satisfying Claim 3.7. Let i maxx PAiminγj Pγ dpx,γjq2.

Claim 3.8. We have that OPTk 1 ďOPTk ˆ pα 1q pα 1q2 2 pα 1q2p2α 1q

pα 1q4αpα 2q

Claim 3.9. We have that OPTk 1 ďOPTk p1 δq ˆ 1 1 α 1 2 α 2

Claim 3.10. We have that OPTk{OPTk 1 ąOPTk 1{OPTk.

Claim 3.10 shows that if the perturbation creates a clustering Γ different from C, then the natural value of k has changed (namely OPTk 1{OPTk is not the maximizer over all values of k). Hence, the axiom is satisﬁed.

It is easy to see that for larger δ, a larger value of α allows to derive the proof.

4 Inadmissibility

In this section we prove two theorems showing inadmissibility of ubiquitous clustering functions.

Theorem 4.1. k-means, k-median and k-center are not admissible w.r.t. our axioms.

The following theorem shows that k-means, k-median remain inadmissible w.r.t. to Kleinberg s axiom even if c-cluster proximity is satisﬁed for any constant c. This is in contrast to Theorem 3.4 showing that k-means is admissible w.r.t. to our axioms if 6-cluster proximity is satisﬁed. Given that k-means of great importance in real-world settings, we believe that this is further evidence that our axioms are more suitable.

Theorem 4.2. k-means, k-median are not admissible w.r.t. Kleinberg s axioms even when c-cluster proximity is satisﬁed for any constant c.

4.1 Proof of Theorem 4.1

dp , q u1 u2 u3 u4 u PL u PR u1 0 γ ε 2γ 3ε 2γ 3ε 1 1 3γ 5ε u2 0 γ 2ε γ 2ε 1 γ ε 1 2γ 4ε u3 0 1 2γ 3ε 1 γ ε pγ 2εq u4 0 2γ 3ε 2 γ v PL,v u 2 3γ 5ε 1 v PR,v u γ

Figure 1: Original instance and perturbed instance. Let V be the set of points, with |V | n. Assume n is even, γ 1.5 and ε 1{10. Let L and R be two sets of size pn 4q{2 each. The perturbed instance is obtained by using the red value in brackets. Missing entries are given by symmetry and dpu,vq 0 for v u.

To see that k-center, k-median and k-means are not admissible, we will construct a distance function d having k 2 with a unique optimal clustering C. The instance is given by Figure 1 and we refer to Figure 2 for an illustration. Note that the distance function fulﬁlls the triangle inequality albeit this is not required. The main idea behind the construction is that u2 is, in the original instance, assigned to the cluster center u1. In the perturbed instances, after decreasing the distance between u3 and the nodes of R, we have that u3 becomes the new center. As a consequence the node u2 is now closer to that cluster than to the other cluster. Hence, the clustering changes. It remains to show that the optimal number k of clusters remains 2 in the perturbed instance. Recall that, by deﬁnition, we

2 L R 3γ 5ε + 1

Original instance d

u1 u2 u3 u4

2 L R 3γ 5ε + 1

Perturbed instance d

u1 u2 u3 u4

Figure 2: An illustration of the instance given by Figure 1 that shows that k-center, k-means and k-median are not admissible w.r.t. to our axioms. The distance from u1 to u2 is γ ε. The distance from u1 to all of the nodes of L is 1. The distance of a node L to all other nodes of L is 2 etc. The perturbed instances is obtained by decreasing the distance between u3 and the nodes of R all other distances remain unchanged. After decreasing those distances, the center shifts to u3 causing u2 to switch clusters and hence different clusterings. The red circles denote the optimal clusterings with the centers marked red.

exclude the cases k 1 and k n. We need to check that OPT1{OPT2 ąOPTk{OPTk 1 for all k Pt2,3,...,n 1u. k-center. Note that the optimal solution for k 1 in both the original instance and the perturbed instance is to open a center at u2. We get that OPT1 OPT1 1 2γ 4ε. For the case that k 2 we get for the optimal solution in the original instance (perturbed instance, respectively) consists of opening centers at u1 and u4 (u1 and u3, respectively). The results in a cost of OPT2 γ (γ 2ε, respectively). Furthermore, note that for any k ăn, OPTk,OPT1 k ě1. As a result, we have OPTk{OPTk 1 ďOPT2{OPTk 1 ďγ ď1.5ăOPT1{OPT2 for kě2; the same holds for OPT1. k-median and k-means. Consider k-median. Note that for OPT1 the cost is at least pn 4qp2γ 4εq in the perturbed instance. Furthermore, OPT2 in the perturbed instance is OPT2 ď |L| 1 |R| pγ 2εq Op1q. Hence OPT1{OPT2 2. We can easily verify that OPTk{OPTk 1 ď 1.5 ă OPT1{OPT2 for all kě2. The argument for k-means is along the same lines.

4.2 Proof of Theorem 4.2

For simplicity we consider an instance that satisfy 6-center proximity. The construction of the original and perturbed instance is given by Figure 3. Our construction for k-means and k-median is in fact the same and satisﬁes the triangle inequality before and after perturbation. We show that reducing the intra-cluster distance changes the optimal solution hence violating Kleinberg s axioms.

Original instance d Perturbed instance d

x y x y x y x y

dp , q u1 u2 u3 u PS1 u PS2 u PS3 u1 0 x x y 1 x x y u2 0 x y x 1 x y u3 0 x y x y y v PS1,v u 2 x x y v PS2,v u 2 x y v PS3,v u 2y

Figure 3: The two ﬁgures on the l.h.s. are an example of an instance that satisﬁes the 6-center proximity where k-means and k-median are not admissible w.r.t. to Kleinberg s axioms. The distance from u1 to all nodes in S1 (with |S1| 5) is 1, the distance for any node of S1 to all other nodes of S1 is 2. The distance from any node of S1Ytu1u to any node of S2Ytu2u is x etc. The perturbed instance is generated as follows. First, the intra-distance between all nodes of S1 reduces from 2 to 0. Second, the set S3 Ytu3u is partitioned into equal-sized sets S1 3 and S2 3. The intra-distance between nodes in both set reduces to 0 and the distance between a node of S1 3 and S1 3 reduces to y. All other distances remain unchanged. The red circles denote the optimal clusterings with the centers marked red. The table on the r.h.s. shows the original distance metric d. Missing entries are given by symmetry and dpu,vq 0 for v u.

We assume xą a

5{2 and y 2x. Note that the instance is 0-balanced and satisﬁes x-center proximity and also x1-center proximity for every x1 ďx, by deﬁnition. We require a few deﬁnitions. Let u1 1 be the red node of S1 and let u1 3 be the red node of S1 3. The clustering C1 induced by the centers u1, u2 and u3 and simply assigning all other nodes to the closest node among u1, u2, and u3. Let C1 1 be the

clustering induced by the centers u1, u3, u1 3. Let C2 1 be the clustering induced by the centers u1 1, u2, u3. Similarly, let C2 be the cluster induced by the centers u1 1, u3 and u1 3. Observe that the original metric space satisﬁes the x-center proximity deﬁnition. We will show that the optimal clustering C1 in the original input and optimal clustering C2 in the perturbed input are different. Hence Kleinberg s axioms are not fulﬁlled despite x-center proximity. For a clustering C we use costop Cq and costpp Cq to deﬁne the cost before and after perturbation. k-means. Consider the original instance. We have that OPTo 3 costop C1q |S1| 12 |S2| 12 |S3| y2 10 5y2. Furthermore, costop C2q ě costop C1 1q 5 6x2 4y2. We have that OPTo 3 ă costop C2q. Consider the perturbed instance. We have costpp C2 1q 12 |S2| 12 |S1 3| y2 6 3y2 and costpp C2 1qďcostpp C1q. We have OPT3 costpp C2q 1 p|S2| 1q x2 1 6 x2. Hence, the optimal clusterings in the original instance (C1) and the perturbed instance (C2) differ. An analogous reasoning yields the result for k-median.

[1] Determining the number of clusters in a data set - wikipedia page. URL https://en. wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set.

[2] M. Ackerman. Towards theoretical foundations of clustering. 2012.

[3] M. Ackerman, S. Ben-David, and D. Loker. Characterization of linkage-based clustering. In COLT, pages 270 281. Citeseer, 2010.

[4] M. Ackerman, S. Ben-David, and D. Loker. Towards property-based classiﬁcation of clustering paradigms. In Advances in Neural Information Processing Systems, pages 10 18, 2010.

[5] M. Ackerman, S. Ben-David, S. Brânzei, and D. Loker. Weighted clustering. In AAAI, 2012.

[6] M. Ackerman, S. Ben-David, D. Loker, and S. Sabato. Clustering oligarchies. In Artiﬁcial Intelligence and Statistics, pages 66 74, 2013.

[7] N. Alon and N. Kahale. A spectral technique for coloring random 3-colorable graphs. SIAM Journal on Computing, 26(6):1733 1748, 1997.

[8] N. Alon, M. Krivelevich, and B. Sudakov. Finding a large hidden clique in a random graph. Random Structures and Algorithms, 13(3-4):457 466, 1998.

[9] H. Angelidakis, K. Makarychev, and Y. Makarychev. Algorithms for stable and perturbationresilient problems. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, Montreal, QC, Canada, June 19-23, 2017, pages 438 451, 2017. doi: 10.1145/3055399.3055487. URL http://doi.acm.org/10.1145/3055399.3055487.

[10] P. Awasthi and O. Sheffet. Improved spectral-norm bounds for clustering. In A. Gupta, K. Jansen, J. Rolim, and R. Servedio, editors, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 37 49, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. ISBN 978-3-642-32512-0.

[11] P. Awasthi, A. Blum, and O. Sheffet. Center-based clustering under perturbation stability. Information Processing Letters, 112(1-2):49 54, 2012.

[12] M. Balcan and Y. Liang. Clustering under perturbation resilience. SIAM J. Comput., 45(1): 102 155, 2016. doi: 10.1137/140981575. URL https://doi.org/10.1137/140981575.

[13] S. Ben-David. Computational feasibility of clustering under clusterability assumptions. ar Xiv preprint ar Xiv:1501.00437, 2015.

[14] S. Ben-David and M. Ackerman. Measures of clustering quality: A working set of axioms for clustering. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 121 128. Curran Associates, Inc., 2009.

[15] S. Ben-David and N. Haghtalab. Clustering in the presence of background noise. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML 14, pages II 280 II 288. JMLR.org, 2014. URL http://dl.acm.org/citation. cfm?id=3044805.3044924.

[16] Y. Bilu and N. Linial. Are stable instances easy? Comb. Probab. Comput., 21(5):643 660, Sept. 2012. ISSN 0963-5483. doi: 10.1017/S0963548312000193. URL http://dx.doi.org/10. 1017/S0963548312000193.

[17] A. Daniely, N. Linial, and M. Saks. Clustering is difﬁcult only when it does not matter. ar Xiv preprint ar Xiv:1205.4891, 2012.

[18] S. Dudoit and J. Fridlyand. A prediction-based resampling method for estimating the number of clusters in a dataset. Genome biology, 3(7):research0036 1, 2002.

[19] A. Dutta, A. Vijayaraghavan, and A. Wang. Clustering stable instances of euclidean k-means. ar Xiv preprint ar Xiv:1712.01241, 2017.

[20] J. M. Kleinberg. An impossibility theorem for clustering. In Advances in Neural Information Processing Systems, pages 463 470, 2002.

[21] A. Kumar and R. Kannan. Clustering with spectral norm and the k-means algorithm. In Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, FOCS 10, pages 299 308, Washington, DC, USA, 2010. IEEE Computer Society. ISBN 978-0-7695-4244-7. doi: 10.1109/FOCS.2010.35. URL http://dx.doi.org/10.1109/ FOCS.2010.35.

[22] F. Mc Sherry. Spectral partitioning of random graphs. In Foundations of Computer Science, 2001. Proceedings. 42nd IEEE Symposium on, pages 529 537. IEEE, 2001.

[23] M. Meil a. Comparing clusterings: an axiomatic view. In In ICML 05: Proceedings of the 22nd international conference on Machine learning, pages 577 584. ACM Press, 2005.

[24] R. Ostrovsky, Y. Rabani, L. J. Schulman, and C. Swamy. The effectiveness of lloyd-type methods for the k-means problem. In Foundations of Computer Science, 2006. FOCS 06. 47th Annual IEEE Symposium on, pages 165 176. IEEE, 2006.

[25] J. Puzicha, T. Hofmann, and J. M. Buhmann. A theory of proximity based clustering: structure detection by optimization. Pattern Recognition, 33(4):617 634, 2000. ISSN 0031-3203. doi: https://doi.org/10.1016/S0031-3203(99)00076-X. URL http://www.sciencedirect.com/ science/article/pii/S003132039900076X.

[26] R. L. Thorndike. Who belongs in the family? Psychometrika, 18(4):267 276, 1953.

[27] R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2): 411 423, 2001.

[28] T. van Laarhoven and E. Marchiori. Axioms for graph clustering quality functions. Journal of Machine Learning Research, 15:193 215, 2014. URL http://jmlr.org/papers/v15/ vanlaarhoven14a.html.

[29] E. W. Weisstein. Tree. from mathworld a wolfram web resource. URL http://mathworld. wolfram.com/Tree.html.

[30] A. Williams. Is clustering mathematically impossible? http://alexhwilliams.info/ itsneuronalblog/2015/10/01/clustering2/, 2015.

[31] R. B. Zadeh and S. Ben-David. A uniqueness theorem for clustering. In Proceedings of the twenty-ﬁfth conference on uncertainty in artiﬁcial intelligence, pages 639 646. AUAI Press, 2009.