# individual_preference_stability_for_clustering__8473b975.pdf

Individual Preference Stability for Clustering

Saba Ahmadi 1 Pranjal Awasthi 2 Samir Khuller 3 Matth aus Kleindessner 4 Jamie Morgenstern 5

Pattara Sukprasert 3 Ali Vakilian 1

In this paper, we propose a natural notion of individual preference (IP) stability for clustering, which asks that every data point, on average, is closer to the points in its own cluster than to the points in any other cluster. Our notion can be motivated from several perspectives, including game theory and algorithmic fairness. We study several questions related to our proposed notion. We ﬁrst show that deciding whether a given data set allows for an IP-stable clustering in general is NP-hard. As a result, we explore the design of efﬁcient algorithms for ﬁnding IP-stable clusterings in some restricted metric spaces. We present a polytime algorithm to ﬁnd a clustering satisfying exact IP-stability on the real line, and an efﬁcient algorithm to ﬁnd an IP-stable 2-clustering for a tree metric. We also consider relaxing the stability constraint, i.e., every data point should not be too far from its own cluster compared to any other cluster. For this case, we provide polytime algorithms with different guarantees. We evaluate some of our algorithms and several standard clustering approaches on real data sets.

1. Introduction

Clustering, a foundational topic in unsupervised learning, aims to ﬁnd a partition of a dataset into sets where the intraset similarity is higher than the inter-set similarity. The problem of clustering can be formalized in numerous ways with different ways of measuring similarity within and between

1Toyota Technological Institute at Chicago, USA 2Google, USA 3Northwestern University, USA 4Amazon Web Services, Germany 5University of Washington, USA. Correspondence to: S. Ahmadi <saba@ttic.edu>, P. Awasthi <pranjalawasthi@google.com>, S. Khuller <samir.khuller@northwestern.edu>, M. Kleindessner <matkle@amazon.de>, J. Morgenstern <jamiemmt@cs.washington.edu>, P. Sukprasert <pattara@u.northwestern.edu>, A. Vakilian <vakilian@ttic.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

sets, such as centroid-based formulations like k-means, kmedian and k-center (e.g., Arthur and Vassilvitskii, 2007), hierarchical partitionings (e.g., Dasgupta, 2002), or spectral clustering (e.g., von Luxburg, 2007). All these formulations have also been considered under additional constraints (e.g., Wagstaff et al., 2001); in particular, a recent line of work studies clustering under various fairness constraints (e.g., Chierichetti et al., 2017). Most formulations of clustering are NP-hard, which has led to both the understanding of approximation algorithms and conditions under which (nearly) optimal clusterings can be found in computationally efﬁcient ways. Such conditions that have been studied in recent years are variants of perturbation stability, i.e., small perturbations to the distances do not affect the optimal clustering (Ackerman and Ben-David, 2009; Awasthi et al., 2012; Bilu and Linial, 2012), or are variants of approximation stability, i.e., approximately optimal clusterings according to some objective are all close to each other (Balcan et al., 2013; Ostrovsky et al., 2013). In this work, we introduce a distinct notion of stability in clustering, individual preference stability of a clustering, which measures whether data points can reduce their individual objective by reassigning themselves to another cluster.

The study of individual preference (IP) stability, and clusterings which are IP-stable, has a number of motivations behind it, both in applications and connections to other concepts in computing. For example, clustering can be used to design curricula for a collection of students, with the goal that each student is assigned to a cluster with the most similar learning needs. One might also design personalized marketing campaigns where customers are ﬁrst clustered the campaign s personalization will be more effective if customers have most afﬁnity to the clusters to which they are assigned. More generally, if one ﬁrst partitions a dataset and then designs a collection of interventions or treatments for each subset, IP stability ensures individuals are best represented by the cluster they belong to.

This notion has connections to several other concepts studied in computing, both for clustering and for other algorithmic problems. Suppose each data point belongs to an individual, and that individual chooses which cluster she wants to join to minimize her average distance to points in that cluster. If her features are ﬁxed (and she cannot change

them), an IP-stable clustering will correspond to a Nash equilibrium of this game. In this sense, IP-stability is related to the concept of stability from matching theory (Roth, 1984; Gale and Sotomayor, 1985). IP-stability is also a natural notion of individual fairness for clustering, asking that each individual prefers the cluster they belong to over any other cluster (and directly captures envy-freeness of a clustering such as has been recently studied within the context of classiﬁcation (Balcan et al., 2019)). Finally, the study of IP-stability (whether such a clustering exists, whether an objective-maximizing clustering is also IP-stable) may open up a new set of conditions under which approximately optimal clusterings can be found more efﬁciently than in the worst case.

Our Contributions We propose a natural notion of IPstability clustering and present a comprehensive analysis of its properties. Our notion requires that each data point, on average, is closer to the points in its own cluster than to the points in any other cluster (Section 2). We show that, in general, IP-stable clusterings might not exist, even for Euclidean data sets in R2 (Section 2).

Moreover, we prove that deciding whether a given data set has an IP-stable k-clustering is NP-hard, even for k = 2 and when the distance function is assumed to be a metric (Section 3). Here, and in the following, k is the desired number of clusters. This naturally motivates the study of approximation algorithms for the problem and the study of the problem in special metric spaces.

A clustering is called t-approximately IP-stable if the average distance of each data point to its own cluster is not more than t times its average distance to the points in any other cluster. By exploiting the techniques from metric embedding, in Section 4.1, ﬁrst we provide a polynomial time algorithm that ﬁnds an O(log2 n/ε)-approximation IPstable clustering for (1 ε)-fraction of points in any metric space. Second, by designing a modiﬁed single-linkage approach as a pre-processing step, in Section 4.2 we provide an efﬁcient approximation algorithm when the input has a well-separated IP-stable clustering. More precisely, if the input has an IP-stable clustering with clusters of size at least α n, then our algorithm will ﬁnd an O( 1

α)-approximation IP-stable clustering of the input in polynomial time.

When the data set lies on the real line, surprisingly, we show that an IP-stable k-clustering always exists, and we design an efﬁcient algorithm with running time O(kn) to ﬁnd one, where n is the number of data points (Section 5.1). In addition to ﬁnding a IP-stable clustering, one might want to optimize an additional global objective; we study such a problem in Appendix F, where we minimize the deviation of clusters sizes from desirable target sizes. We propose a DP solution which runs in O(n3k) time for this prob-

Figure 1. Two data sets in R2 with d equaling the Euclidean metric. Left: An IP-stable 3-clustering with clusters C1, C2, C3. Right: Four points for which there is no IP-stable 2-clustering. For example, if the two clusters were C1 = {x1, x4} and C2 = {x2, x3}, then x1 would not be stable because of d(x1, x4) = 0.71 > 0.68 = [d(x1, x2) + d(x1, x3)]/2, and if the two clusters were C1 = {x1} and C2 = {x2, x3, x4}, then x4 would not be stable because d(x1, x4) = 0.71 < 0.715 = [d(x2, x4) + d(x3, x4)]/2.

lem. Moreover, we show how to efﬁciently ﬁnd an IP-stable 2-clustering when the distance function is a tree metric (Section 5.2). We in fact conjecture that the result on the line can be extended to any tree, and while we show a positive result only for k = 2, we believe that a similar result holds for any k.

Finally, we perform extensive experiments on real data sets and compare the performance of several standard clustering algorithms such as k-means++ and k-center w.r.t. IPstability (Section 6). Although in the worst-case the violation of IP-stability by solutions produced by these algorithms can be arbitrarily large (as we show in Section 3 / Appendix E), some of these algorithms perform surprisingly well in practice. We also study simple heuristic modiﬁcations to make the standard algorithms more aligned with the notion of IP-stability, in particular for linkage clustering (Appendix G).

2. Individual Preference Stability

Our notion of individual preference stability applies to a data set D together with a given dissimilarity function d that measures how close two data points are. We use the terms dissimilarity and distance synonymously. We assume d : D D R 0 to be symmetric with d(x, x) = 0, but not necessarily to be a metric (i.e., to additionally satisfy the triangle inequality and d(x, y) = 0 x = y).

Our stability notion deﬁnes what it means that a data point is IP-stable in a clustering of D; namely: a data point is IPstable if the average distance to the points in its own cluster (the point itself excluded) is not greater than the average distance to the points in any other cluster. Then a clustering of D is said to be IP-stable if every data point in D is stable.

For the rest of the paper we assume D to be ﬁnite. Our deﬁnition of IP-stability for clustering can then be formally stated as follows (for l N, we let [l] = {1, . . . , l}):

Figure 2. A data set on the real line with more than one IP-stable clustering. Top: The data set and the distances between the points. Bottom: The same data set with two IP-stable 2-clusterings (one encoded by color: red vs blue / one encoded by frames: solid vs dotted boundary).

Deﬁnition 1 (Individual preference (IP) stability). Let C = (C1, . . . , Ck) be a k-clustering of D, that is D = C1 . . . Ck and Ci = for i [k]. For x D, we write C(x) for the cluster Ci that x belongs to. We say that x D is IP-stable if either C(x) = {x} or

y C(x) d(x, y) 1 |Ci|

y Ci d(x, y) (1)

for all i [k] with Ci = C(x). The clustering C is an IPstable k-clustering if every x D is IP-stable. For brevity, instead of IP-stable we may only say stable.

We discuss some important observations about IP-stability as deﬁned in Deﬁnition 1: if in a clustering all clusters are well-separated and sufﬁciently far apart, then this clustering is IP-stable. An example of such a scenario is provided in the left part of Figure 1. Hence, at least for such simple clustering problems with an obvious solution, IP-stability does not conﬂict with the clustering goal of partitioning the data set such that data points in the same cluster are similar to each other, and data points in different clusters are dissimilar (Celebi and Aydin, 2016, p. 306). However, there are also data sets for which no IP-stable k-clustering exists (for a ﬁxed k and a given distance function d).1 This can even happen for Euclidean data sets and k = 2, as the right part of Figure 1 shows. If a data set allows for an IPstable k-clustering, there might be more than one IP-stable k-clustering. An example of this is shown in Figure 2. This example also illustrates that IP-stability does not necessarily work towards the aforementioned clustering goal. Indeed, in Figure 2 the two clusters of the clustering encoded by the frames, which is stable, are not even contiguous.

These observations raise a number of questions such as: when does an IP-stable k-clustering exist? Can we efﬁciently decide whether an IP-stable k-clustering exists? If an IP-stable k-clustering exists, can we efﬁciently compute it? Can we minimize some (clustering) objective over the

1Of course, the trivial 1-clustering C = (D) or the trivial |D|- clustering that puts every data point in a singleton are stable, and for a trivial distance function d 0, every clustering is stable.

set of all IP-stable clusterings? How do standard clustering algorithms such as Lloyd s algorithm (aka k-means; Lloyd, 1982) or linkage clustering (e.g., Shalev-Shwartz and Ben David, 2014, Section 22) perform in terms of IP-stability? Are there simple modiﬁcations to these algorithms in order to improve their stability? In this paper, we explore some of these questions as outlined in Section 1.

2.1. Related Work and Concepts

Clustering Stability There is a large body of work on the design of efﬁcient clustering algorithms, both in the worst case and under various stability notions (Awasthi and Balcan, 2014). Some works have studied stability notions called average stability that are similar to our notion of IP-stability. The work of Balcan et al. (2008) studies properties of a similarity function that are sufﬁcient in order to approximately recover (in either a list model or a tree model) an unknown ground-truth clustering. One of the weaker properties they consider is the average attraction property, which requires inequality (1) with t = 1 to hold for the ground-truth clustering, but with an additive gap of γ > 0 between the left and the right side of (1). Balcan et al. show that the average attraction property is sufﬁcient to successfully cluster in the list model, but with the length of the list being exponential in 1/γ, and is not sufﬁcient to successfully cluster in the tree model. The works discussed above utilize strong forms of average stability to bypass computational barriers and recover the ground-truth clustering. We, on the other hand, focus on both this problem and the complementary question of when does such stability property even hold, either exactly or approximately. Related notions of clustering stability such as perturbation stability and approximation stability have also been studied. However, the goal in these works is to approximate a clustering objective such as k-means or k-median under stability assumptions on the optimizer of the objective (Ackerman and Ben-David, 2009; Awasthi et al., 2012; Bilu and Linial, 2012; Balcan et al., 2013; Balcan and Liang, 2016; Makarychev and Makarychev, 2016).

Closely related to our work is the paper by Daniely et al. (2012). They show that if there is an unknown ground-truth clustering with cluster size at least αn satisfying a slightly stronger requirement than IP-stability, i.e., each point, on average, is at least a factor 3 closer to the points in its own cluster than to the points in any other cluster, then they can ﬁnd an O(1)-approximately IP-stable clustering. However, their algorithm runs in time nΩ(log 1/α) = Ω(nlog k).

Individual Fairness and Fairness in Clustering In the fairness literature, fairness notions are commonly categorized into notions of individual fairness or group fairness. The latter ensures fairness for groups of individuals (such as men vs women), whereas the former aims to ensure fairness

for single individuals. Our proposed notion of IP-stability can be interpreted as a notion of individual fairness for clustering. Individual fairness was originally proposed by Dwork et al. (2012) for classiﬁcation, requiring that similar data points (as measured by a given task-speciﬁc metric) should receive a similar prediction. It has also been studied in the setting of online learning (Joseph et al., 2016; 2018; Gillen et al., 2018). Recently, two notions of individual fairness for clustering have been proposed. The ﬁrst notion, proposed by Jung et al. (2020) and also studied by Mahabadi and Vakilian (2020), Chakrabarty and Negahbani (2021) and Vakilian and Yalciner (2022), asks that every data point is somewhat close to a center, where somewhat depends on how close the data point is to its n/k nearest neighbors. Jung et al. motivate their notion from the perspective of facility location, but it can also be seen in the context of data points that strive to be well represented by being close to a center similarly to the notion of Chen et al. (2019) discussed above. Note that our proposed notion of IP-stability deﬁnes being well represented in terms of the average distance of a data point to the other points in its cluster, rather than the distance to a center, and hence is not restricted to centroid-based clustering. The second notion, proposed by Anderson et al. (2020), is analogous to the notion of individual fairness by Dwork et al. (2012) for classiﬁcation. It considers probabilistic clustering, where each data point is mapped to a distribution over a set of centers, and requires that similar data points are mapped to similar distributions. Interestingly, Anderson et al. allow the similarity measure used to deﬁne fairness and the similarity measure used to evaluate clustering quality to be the same or different. However, individual fairness as introduced by Dwork et al. (2012) has often been deemed impractical due to requiring access to a task-speciﬁc metric (Ilvento, 2019), and it is unclear where a fairness similarity measure that is different from the clustering similarity measure should come from. Note that both notions are different from our notion: (1) there exists clusterings that are IP-stable, but will achieve α = according α-fairness notion of Jung et al. and (2) the distributional property of Anderson et al. is satisﬁed with uniform distribution whereas our stability clusterings do not always exist.

Hedonic Games A related line of work is the class of hedonic games as a model of coalition formation (Dreze and Greenberg, 1980; Bogomolnaia and Jackson, 2002; Elkind et al., 2016). In a hedonic game the utility of a player only depends on the identity of players belonging to her coalition. Feldman et al. (2015b) study clustering from the perspective of hedonic games. Our work is different from theirs in the sense that in our model the data points are not selﬁsh players.

3. NP-Hardness

We show that in general metrics, the problem of deciding whether an IP-stable k-clustering exists is NP-hard. For such a result it is crucial to specify how an input instance is encoded: we assume that a data set D together with a distance function d is represented by the distance matrix (d(x, y))x,y D. Under this assumption we can prove the following theorem: Theorem 1 (NP-hardness of IP-stable clustering). Deciding whether a data set D together with a distance function d has an IP-stable k-clustering is NP-hard. This holds even if k = 2 and d is a metric distance.

Due to space limitation, the proof of this theorem and all other missing proofs are deferred to the appendix.

The proof is via a reduction from a variant of 3-SAT where the number of clauses is equal to the number of variables and each variable occurs in at most three clauses. Unless P = NP, Theorem 1 implies that for general data sets, even when being guaranteed that an IP-stable k-clustering exists, there cannot be any efﬁcient algorithm for computing such an IP-stable clustering. However, as with all NP-hard problems, there are two possible remedies. The ﬁrst remedy is to look at approximately IP-stable clusterings: Deﬁnition 2 (Approximate IP-stability). Let C = (C1, . . . , Ck) be a k-clustering of D and for x D let C(x) be the cluster Ci that x belongs to. We say that for some t 1, x D is t-approximately IP-stable if either C(x) = {x} or

y C(x) d(x, y) t |Ci|

y Ci d(x, y) (2)

for every Ci = C(x). The clustering C is t-approximately IP-stable if every x D is t-approximately IP-stable.

An alternative way of deﬁning approximate IP-stability, which is explored in Section 4.1, would be to allow a violation of inequality (1) in Deﬁnition 1 for a certain number of points. In our experiments in Section 6 we will actually consider both these notions of approximation.

The second remedy is to restrict our considerations to data sets with some special structure. This is what we do in Section 5 where we consider the 1-dimensional Euclidean metric and tree metrics.

To complement Theorem 1, we show theoretical lower bounds for the standard clustering algorithms k-means++, k-center, and single linkage. Their violation of IP-stability can be arbitrarily large. Theorem 2. For any α > 1, there exist separate examples where the clusterings produced by k-means++, k-center, and single linkage algorithms are t-approximately IP-stable only for t α.

4. Approximation Algorithms for IP-Stability

In this section, we provide two algorithms for ﬁnding approximately IP-stable clustering. Our ﬁrst algorithm ﬁnds a partially IP-stable clustering and the second one ﬁnds an approximately IP-stable clustering if the input point set admits a clustering that satisﬁes a set of requirements that are slightly stronger than the requirements of IP-stability.

4.1. Partially Stable Clustering

Here, we show that if we only want to cluster (1 ε)-fraction of the points, it is possible to ﬁnd a O( log2 n

ε )-approximation for IP-stable k-clustering in any metric space.

We ﬁrst deﬁne hierarchically well-separated trees (HSTs).

Deﬁnition 3 (Bartal (1996)). A t-hierarchically wellseparated tree (t-HST) is deﬁned as a rooted weighted tree with the following properties:

The edge weight from any node to each of its children is the same.

The edge weight along any path from the root to a leaf are decreasing by a factor of at least t.

Fakcharoenphol et al. (2004) shows that it is possible to have a tree embedding where distortion is O(log n) in expectation. Recently, Haeupler et al. (2021) proposes an algorithm that give a worst-case distortion, which they call partial tree embeddings if it is allowed to remove a constant fraction of points from the embedding. While their work concerns hop-constrained network design, we provide a simpliﬁed version of their result which is useful in our context.

Theorem 3 (Haeupler et al. (2021)). Given weighted graph G = (V, E, w), 0 < ε < 1 3 and root r V , there is a polynomial-time algorithm which samples from a distribution over partial tree embeddings whose trees are 2-HST and rooted at r with exclusion probability2 ε and worst-case distance stretch O( log2 n

Claim 4. Without loss of generality, we can assume the following two properties from the construction in Theorem 3: (1) V is the set of leaves of T, and (2) depth(u) = depth(v) for any u, v V .

Our remaining task is to ﬁnd an IP-stable clustering of leaves (according to the tree metric distance). If we can do that, then it follows that we have an O(log2 (n)/ε)- approximately IP-stable k-clustering.

Claim 5. There exists a k-clustering C = (C1, . . . , Ck) of leaves on t-HST for any t 2 such that C is IP-stable

2Let D be distribution of partial tree metrics of G. D has exclusion probability ϵ if for all v V , we have Prd D[v Vd] 1 ϵ

for any leaf node. Moreover, for u, v Ci and w Cj, d T (u, v) d T (u, w).

By combining the tree embedding result and our clustering on HSTs we have the following theorem.

Theorem 6. Let (V, d) be a metric. There is a randomized, polynomial time algorithm that produces a clustering C = (C1, . . . , Ck) for V V , where V is taken from V with exclusion probability ε such that, for any node u Ci, j = i, d(u, Ci) O( log2 n

ε )d(u, Cj) 3.

Proof. We use Theorem 3 to embed (V, d) into a 2-HST (V , T). We then apply Claim 5 to produce (C1, . . . , Ck) that is IP-stable in T. Since T is a partial tree embedding with worst-case distortion O( log2 n

ε ), it follows that

d(u, Ci) d T (u, Ci) d T (u, Cj) O( log2 n

ε )d(u, Cj).

4.2. Instances with Stable Clustering

Next, we show an algorithm that ﬁnds an approximately IP-stable clustering, if there is a well-separated underlying clustering. Let us ﬁrst deﬁne a well-separated clustering. A similar notion is also deﬁned by Daniely et al. (2012).

Deﬁnition 4 ((α, γ)-clustering). Let C = (C1, . . . , Ck) be a k-clustering of D, that is D = C1 . . . Ck and Ci = for i [k]. For x D, we write C(x) for the cluster Ci that x belongs to. We say that C is (α, γ)-clustering if

1. For all Ci, |Ci| α n, where n = |D|, and

2. For all i = j and x Ci, d(x, Cj) γ d(x, Ci).

Lemma 1 (Daniely et al. (2012)). Let C = (C1, . . . , Ck) be an (α, γ)-clustering, and let i = j. Then

1. For every4 x Ci, y Cj, γ 1

γ d(y, Ci) d(x, y)

γ2+1 γ(γ 1)d(y, Ci), and

2. For every x, y Ci, d(x, y) 2 γ 1d(x, Cj).

Here, we show that, for large enough γ, the edges inside a cluster will always be smaller than edges between clusters.

Claim 7. Let C = (C1, . . . , Ck) be an (α, γ)-clustering. Let i = j, and let x, y Ci and z Cj. If γ 2 +

3, then d(x, y) d(y, z).

Theorem 8. Let γ 2 +

3. If there exists an (α, γ)- clustering, then there is an algorithm that ﬁnds such a clustering in time O(n2 log n + n 1

3We let d(u, C) = P

v C d(u,v) |C\{u}| be the average distance

from u to cluster C. If |C| = {u}, then d(u, C) = 0. 4Daniely et al. (2012) use the term almost every to avoid sets of measurement zero.

Proof. Let us consider a modiﬁcation to single-linkage algorithm: We consider edges in non-decreasing order and only merge two clusters if at least one of them has size at most αn. Claim 7 suggests that the edges within an underlying cluster will be considered before edges between two clusters. Since we know that any cluster size is at least αn, when considering an edge, it is safe to use this condition to merge them. If it is not the case, we can ignore the edge.

By this process, we will end up with a clustering where each cluster has size at least αn. Hence, there are at most O(1/α) such clusters. We then can enumerate over all possible clusterings, as there are at most O( 1

αk ) such clusterings, the running time follows.

Note that the algorithm described in the above theorem has a high running time (especially if α is small). Notice that, before the enumeration, we do not make any mistakes when we run the modiﬁed single-linkage, and we get a clustering where each cluster has size at least αn. We further show it is possible to deﬁne a metric over this clustering and apply Theorem 6 to the metric. The clustering we get will be approximately IP-stable.

Conditioning the clusters via single-linkage When we run the single-linkage algorithm, in addition to the size of any pair of clusters, we can also consider the ratio between each pair of points in two different clusters. We want it so that distances for every pairs of points in two clusters are roughly the same. Moreover, the distance from a point x to its own cluster should not be large compared to the distance from x to any other clusters. We formally deﬁne this condition below.

Claim 9. Let C = (C1, . . . , Ck) be an (α, γ)-clustering. Let D Ci and D Cj be two set of points from different clusters. Then for x D and y, y D , d(x,y)

d(x,y ) γ2+1 (γ 1)2 .

Corollary 1. For two subsets D, D from different underlying clusters, let x, x D and y, y D . Then

d(x, y) d(x , y ) γ2 + 1

Claim 10. Let D, D be two clusters we consider in the single-linkage algorithm. Let e = (x, y) be an edge that we merge in an arbitrary step of single-linkage algorithm where x D, y D , then D and D must belong to the same underlying clustering if one of the followings is true.

1. |D| < α n or |D | < α n,

maxx D,y D d(x,y)

minx D,y D d(x,y) > γ2+1 (γ 1)2 2 ,

3. There exists x D such that d(x, x ) > 2γ (γ 1)2 d(x, y).

Theorem 11. Let α > 0, γ 2 +

3. If there exists an (α, γ)-clustering, then there is a randomized algorithm that ﬁnds a O( log2 (1/α)

α )-approximately IP-stable k-clustering in polynomial time and constant success probability.

Proof. In the ﬁrst phase, we run the modiﬁed single-linkage algorithm: As in the standard single-linkage algorithm, we consider edges in a non-decreasing order of length and merge two clusters using the criteria of Claim 10. When this phase ﬁnishes, we get a clustering (D1, . . . , Dℓ) with k ℓ O( 1

α), where (1) |Di| αn, (2) For any x Di, y Dj, d(x, y) approximates d(Di, Dj)5 , and (3) d(x, Di) = O(d(x, Dj)).

This implies that we can pick R = {ri, . . . , rℓ} where each ri Di is the representative of Di. Then, we show an IP-stable clustering of R yields an IP-stable clustering of the initial point set as well.

We apply Theorem 3 to R with exclusion probability ε < α to produce a partial tree embedding T. By the choice of ε, with constant probability, not a single point in R is excluded We then apply Claim 5 to produce a clustering C = (C1, . . . , Ck) of R. Let F = (F1, . . . , Fk) where Fi = S

rj Ci Dj be the ﬁnal clustering. Next, we show that F is an approximately IP-stable clustering.

For every x Fi, let y be the point furthest to x in Fi and z be the point closest to x in Fj. If we show that d(x, y) = O( log2 (1/α)

α )d(x, z), then this proves the claim as d(x, Fi) = O(d(x, y)) and d(x, Fj) = Ω(d(x, z)).

Let Dx, Dy, Dz be the clusters x, y, z belong to after the merging phases terminates, respectively. Also, let rx, ry, rz be the representatives of Dx, Dy, Dz. By Claim 5, since rx, ry belong to the same cluster C, and since rz belongs to another cluster C , d(rx, ry) d T (rx, ry) d T (rx, rz) O( log2 (1/α)

α )d(rx, rz) where the last inequality holds since T is a 2-HST embedding for the representative points with worst-case distortion guarantee of O( log2 (1/α)

α ). Because of Claim 10, d(x, y) = Θ(d(rx, ry)) and d(x, z) = Θ(d(rx, rz)), it follows that d(x, y) = O( log2 (1/α)

α )d(x, z). Hence, the proof is complete.

Remark Our analysis relies on the fact that the partial embedding provides a worst-case guarantee. However, to the best of our knowledge, a probabilistic tree embedding does not seem to work because the distortion guarantee is on expectation. In other words, there could be an edge e such that the distortion is bad, and having only one such edge is enough to break the stability of any k-clustering on the tree embedding.

5d(C, C ) = P

x C,y C d(x,y) |C||C |.

5. Special Metrics

Another approach to circumvent the NP-hardness of IPstable clustering is to restrict our considerations to data sets with some special structure. Here, we consider 1dimensional Euclidean metrics and tree metrics.

5.1. 1-dimensional Euclidean Case

Here we study the special case of D R where |D| = n, and d is the Euclidean metric. We provide an O(kn) algorithm that ﬁnds an IP-stable k-clustering. We show the following Theorem 12 holds.

High-level Idea We start with a speciﬁc clustering where all but one cluster are singletons. By the deﬁnition, only nodes in the non-singleton cluster may want to leave the cluster (i.e., the IP-stability condition may be violated for a subset of points in that cluster). Lemma 5 shows that if a node v of a cluster C desires to leave C, then a boundary node v of C will want to leave C as well. This allows us to focus only on boundary vertices. There can be at most two vertices per cluster, as long as our clustering is contiguous. We further observe that a cluster will become unstable only when an additional vertex joins the cluster. When this happens, since the vertex that just joined, which has become a boundary vertex, does not desire to leave the cluster, it must be the boundary vertex on the other end that wants to leave the cluster. Since we maintain the order and the monotonicity of clusters and boundary vertices that we inspect, we end up with O(kn) time algorithm. We show that the following theorem holds.

Theorem 12. Let D R be a point set of size n. There exists an algorithm that for any 1 k n, gives an IP-stable k-clustering of D, and has a time complexity of O(kn).

It is well-known that for any set of n points in a metric space, there exists an embedding to one-dimensional Euclidean space with distortion O(n). Hence:

Corollary 2. Let D be a point set of size n in an arbitrary metric space. There exists a polynomial time algorithm that returns an O(n)-approximately IP-stable k-clustering of D.

In Appendix F, we consider an extension of the 1dimensional Euclidean case where a target size for each cluster is given, and the goal is to ﬁnd an IP-stable k-clustering that minimizes the total violation from the target cluster sizes. Formally, the goal is to solve

min C=(C1,...,Ck): C is an IP-stable clustering of D with contiguous clusters

(|C1| t1, . . . , |Ck| tk) p, (3)

where t1, . . . , tk [n] with Pk i=1 ti = n are given target cluster sizes, p R 1 { } and p denotes the ℓp-

norm. In Appendix F, we provide an O(n3k) dynamic programming approach for this problem.

5.2. IP-Stable Clusterings on Trees

In this section, we show how to ﬁnd an IP-stable 2-clustering when d is a tree metric. Let T = (V, E) be a given weighted tree rooted at r (r can be arbitrarily chosen). Our construction will ﬁrst pick a boundary edge among edges adjacent to r. We then rotate the boundary edge in a systematic manner until the clustering we have, which are two connected components we get by removing the boundary edge e from T, is IP-stable. While it could be the case that noncontiguous IP-stable clustering exists, we only consider contiguous clustering in our algorithm. Our algorithm implies that such a clustering exists.

For any graph G = (V, E) and v V , let CG(v) be the set of vertices reachable from v in G. For any tree T = (V, E) and an edge e E, let T \e = (V, E \{e}) be the subgraph of T obtained by removing e. Let N(u) denote the set of neighbors of u in T. We say that uf N(u) is the furthest neighbor of u if d(u, CT \(u,uf )(uf)) d(u, CT \(u,w)(w)) for any w N(u). Next, we deﬁne a rotate operation which is crucial for our algorithm and show that the following property holds:

Claim 13. Suppose the current boundary edge is (u, v), we deﬁne the operation rotate(u) to be an operation that moves the boundary edge from (u, v) to (u, uf). After rotate(u) is called, the clustering deﬁned by the boundary edge (u, uf) is stable for u.

Algorithm Now we are ready to describe our algorithm. First, pick an edge (b0 = r, v) arbitrarily among edges adjacent to the root r, and call rotate(b0). After rotation, let e1 = (b0, b1) denote the boundary edge. By Claim 13, the clustering deﬁned by e1 is stable for b0. If it is also stable for b1, then we have a stable clustering. If not, suppose the boundary edge is now ei = (bi 1, bi). By induction, we assume that the clustering is stable for bi 1. If this clustering is not stable, then it is not stable for bi. We call rotate(bi) to move the boundary edge to ei+1 = (bi, bi+1). bi+1 cannot be bi 1, otherwise, the previous clustering would already be stable for bi. We can repeat this process until we ﬁnd a stable clustering. The process will terminate because the boundary edge is moved further away from the root at every step. Eventually, we will reach the point where bf i = bi 1. This might happen at a leaf. Once it happens, then we have a stable clustering for the boundary nodes. In Appendix D.3, we show that when considering a contiguous clustering, if the boundary nodes are stable, then every node is stable. In other words:

Claim 14. Let e = (u, v) be the boundary edge of a 2clustering of T. If u and v are stable, then the clustering is

stable for every node x V .

Finally, since the boundary edge is moving away from a ﬁxed vertex, we call rotate at most T times.

6. Experiments

We ran a number of experiments.6 Our experiments are intended to serve as a proof of concept. They do not focus on the running times of the algorithms or their applicability to large data sets. Hence, we only use rather small data sets of sizes 500 to 1885.

Let us deﬁne some quantities. For any point x, let

Vi(x) = max Ci =C(x)

1 |C(x)| 1 P y C(x) d(x, y)

y Ci d(x, y) ,

where we use the convention that 0

0 = 0. A point x is IP-stable if and only if Vi(x) 1. Let U = {x D : x is not stable}. We measure the extent to which a k-clustering C = (C1, . . . , Ck) of a dataset D is (un-)stable by # Uns = |U| ( number of unstable ), Max Vi = maxx D Vi(x) ( maximum violation ), and Mean Vi ( mean violation ) deﬁned as

x U Vi(x) U =

The clustering C is IP-stable if and only if # Uns = 0 and Max Vi 1. Max Vi is the smallest value of t such that C is a t-approximately IP-stable clustering. We measure the quality of C w.r.t. the goal of putting similar points into the same cluster by Co ( cost ), deﬁned as the average within-cluster distance. Formally,

{x,y} Ci Ci d(x, y). (4)

We performed all experiments in Python. We used the standard clustering algorithms from Scikit-learn7 or Sci Py8 with all parameters set to their default values.

6.1. General Data Sets

We performed the same set of experiments on the ﬁrst 1000 records of the Adult data set, the Drug Consumption data set (1885 records), and the Indian Liver Patient data set (579 records), which are all publicly available in the UCI repository (Dua and Graff, 2019). As distance function d

6Code available on https://github.com/ amazon-research/ip-stability-for-clustering

7https://scikit-learn.org/ 8https://scipy.org/

Figure 3. # Uns (top-left), Co (top-right), Max Vi (bottom-left) and Mean Vi (bottom-right) for the clusterings produced by the various algorithms as a function of k. k-center GF denotes the group-fair k-center algorithm of Kleindessner et al. (2019a).

we used the Euclidean, Manhattan or Chebyshev metric. Here we only present the results for the Adult data set on the Euclidean metric, the other results are provided in Appendix G. Our observations are largely consistent between the different data sets and metrics.

(Un-)Stability of Standard Algorithms Working with the Adult data set, we only used its six numerical features (e.g., age, hours worked per week), normalized to zero mean and unit variance, for representing records. We applied several standard clustering algorithms as well as the group-fair k-center algorithm of Kleindessner et al. (2019a) (referred to as k-center GF) to the data set (k-means++; k-medoids) or its distance matrix (k-center using the greedy strategy of Gonzalez (1985); k-center GF; single / average / complete linkage clustering). In order to study the extent to which these methods produce (un-)stable clusterings, for k = 2, 5, 10, 15, 20, 30, . . . , 100, we computed # Uns, Max Vi and Mean Vi as deﬁned above for the resulting kclusterings. For measuring the quality of the clusterings we computed Co as deﬁned in (4).

Figure 3 shows the results, where we removed the single linkage algorithm since it performs signiﬁcantly worse than the other algorithms. For k-means++, k-medoids, k-center and k-center GF we show average results obtained from running them for 25 times since their outcomes depend on random initializations. We can see that, in particular for large values of k, k-center, k-center GF, and the linkage algorithms can be quite unstable with rather large values of # Uns and Max Vi. In contrast, k-means++ produces quite stable clusterings with # Uns 90 and Max Vi 1.4 even when k is large. For a baseline comparison, for a random clustering in which every data point was assigned to one of k = 100 clusters uniformly at random we observed # Uns = 990, Max Vi = 11.0, and Mean Vi = 2.5 on average. The k-medoids algorithm performs worse than

k-means++, but better than the other algorithms. The clusterings produced by k-center GF, which we ran with the constraint of choosing k/2 female and k/2 male centers, are slightly more stable than the ones produced by k-center. However, note that it really depends on the data set whether group-fairness correlates with IP-stability or not (see Appendix G.1). All methods perform similarly w.r.t. the clustering cost Co.

Heuristics to Improve Standard Algorithms One might wonder whether we can modify the standard clustering algorithms in order to make them more stable. A natural idea to make any clustering more stable is to change it locally and to iteratively pick a data point that is not stable and assign it to the cluster that it is closest to. However, this idea turned out not to work as we observed that usually we can only pick a very small number of data points whose reassignment does not cause other data points that are initially stable to become unstable after the reassignment (see Appendix G.2). Another idea that we explored is speciﬁcally tied to linkage clustering. We provide the details in Appendix G.3.

6.2. 1-dimensional Euclidean Data Sets

In Appendix H we present experiments with our DP approach on real world 1-dimensional Euclidean data sets.

7. Discussion

We proposed a notion of IP-stability that aims at data points being well represented by their clusters. Formally, it requires every point, on average, to be closer to the points in its own cluster than to the points in any other cluster. IP-stability is not restricted to centroid-based clustering and raises numerous questions, some of which we addressed: in a general metric space, we showed it is NP-hard to decide whether an IP-stable k-clustering exists and we provided approximation algorithms for the problem when the input contains a well-separated clustering or when a small fraction of inputs can be excluded in the output. For one-dimensional Euclidean data sets we can compute an IP-stable clustering in polynomial time. For the case of tree metrics, we showed how to ﬁnd an IP-stable 2-clustering in polynomial time. We proved that in the worst-case scenario some standard clustering algorithms including k-means++ provide arbitrarily unstable clusterings. However, our experiments show that k-means++ works quite well in practice.

While our works focus mostly on the upper bound side, understanding the lower bound, e.g., hardness of approximation for IP-stability is one important open question. It is a natural direction to explore whether our ideas for IPstable 2-clustering on trees can be extended to k-clustering for k > 2? Finally, it would be very interesting to provide theoretical evidence supporting our empirical ﬁndings that

show the surprising effectiveness of k-means++ for our proposed notion of IP-stability, in spite of the existence of bad worst-case scenarios.

Acknowledgements

SA is supported by the Simons Collaborative grant on Theory of Algorithmic Fairness, and the National Science Foundation grant CCF-1733556. JM is supported by fundings from the NSF AI Institute for the Foundations of Machine Learning (IFML), an NSF Career award, and the Simons Collaborative grant on Theory of Algorithmic Fairness. AV is supported by NSF award CCF-1934843.

M. Abbasi, A. Bhaskara, and S. Venkatasubramanian. Fair clustering via equitable group representations. In ACM Conference on Fairness, Accountability, and Transparency (ACM FAcc T), 2021.

M. Ackerman and S. Ben-David. Clusterability: A theoretical study. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2009.

S. Ahmadi, S. Galhotra, B. Saha, and R. Schwartz. Fair correlation clustering. ar Xiv:2002.03508 [cs.DS], 2020.

S. Ahmadian, A. Epasto, R. Kumar, and M. Mahdian. Clustering without over-representation. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2019.

S. Ahmadian, A. Epasto, M. Knittel, R. Kumar, M. Mahdian, B. Moseley, P. Pham, S. Vassilvitskii, and Y. Wang. Fair hierarchical clustering. In Neural Information Processing Systems (Neur IPS), 2020a.

S. Ahmadian, A. Epasto, R. Kumar, and M. Mahdian. Fair correlation clustering. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2020b.

A. Anagnostopoulos, L. Becchetti, M. B ohm, A. Fazzone, S. Leonardi, C. Menghini, and C. Schwiegelshohn. Principal fairness: Removing bias via projections. ar Xiv:1905.13651 [cs.DS], 2019.

N. Anderson, S. Bera, S. Das, and Y. Liu. Distributional individual fairness in clustering. ar Xiv:2006.12589 [cs.LG], 2020.

D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Symposium on Discrete Algorithms (SODA), 2007.

P. Awasthi and M.-F. Balcan. Center based clustering: A foundational perspective. In Handbook of Cluster Analysis. CRC Press, 2014.

P. Awasthi, A. Blum, and O. Sheffet. Center-based clustering under perturbation stability. Information Processing Letters, 112(1-2):49 54, 2012.

A. Backurs, P. Indyk, K. Onak, B. Schieber, A. Vakilian, and T. Wagner. Scalable fair clustering. In International Conference on Machine Learning (ICML), 2019.

M.-F. Balcan and Y. Liang. Clustering under perturbation resilience. SIAM Journal on Computing, 45(1):102 155, 2016.

M.-F. Balcan, A. Blum, and S. Vempala. A discriminative framework for clustering via similarity functions. In ACM Symposium on Theory of Computing (STOC), 2008.

M.-F. Balcan, A. Blum, and A. Gupta. Clustering under approximation stability. Journal of the ACM (JACM), 60 (2):1 34, 2013.

M.-F. Balcan, T. Dick, R. Noothigattu, and A. Procaccia. Envy-free classiﬁcation. In Neural Information Processing Systems (Neur IPS), 2019.

Y. Bartal. Probabilistic approximations of metric spaces and its algorithmic applications. In Symposium on Foundations of Computer Science (FOCS), 1996.

S. Bera, D. Chakrabarty, N. Flores, and M. Negahbani. Fair algorithms for clustering. In Neural Information Processing Systems (Neur IPS), 2019.

I. O. Bercea, M. Groß, S. Khuller, A. Kumar, C. R osner, D. R. Schmidt, and M. Schmidt. On the cost of essentially fair clusterings. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM), 2019.

Y. Bilu and N. Linial. Are stable instances easy? Combinatorics, Probability and Computing, 21(5):643 660, 2012.

A. Bogomolnaia and M. O. Jackson. The stability of hedonic coalition structures. Games and Economic Behavior, 38 (2):201 230, 2002.

B. Brubach, D. Chakrabarti, J. Dickerson, S. Khuller, A. Srinivasan, and L. Tsepenekas. A pairwise fair and community-preserving approach to k-center clustering. In International Conference on Machine Learning (ICML), 2020.

B. Brubach, D. Chakrabarti, J. Dickerson, A. Srinivasan, and L. Tsepenekas. Fairness, semi-supervised learning, and more: A general framework for clustering with stochastic pairwise constraints. In AAAI Conference on Artiﬁcial Intelligence, 2021.

M. E. Celebi and K. Aydin. Unsupervised Learning Algorithms. Springer, 2016.

D. Chakrabarty and M. Negahbani. Better algorithms for individually fair k-clustering. ar Xiv preprint ar Xiv:2106.12150, 2021.

X. Chen, B. Fain, L. Lyu, and K. Munagala. Proportionally fair clustering. In International Conference on Machine Learning (ICML), 2019.

F. Chierichetti, R. Kumar, S. Lattanzi, and S. Vassilvitskii. Fair clustering through fairlets. In Neural Information Processing Systems (NIPS), 2017.

E. Chlamt aˇc, Y. Makarychev, and A. Vakilian. Approximating fair clustering with cascaded norm objectives. In ACM-SIAM Symposium on Discrete Algorithms (SODA), 2022.

A. Daniely, N. Linial, and M. Saks. Clustering is difﬁcult only when it does not matter. ar Xiv:1205.4891 [cs.LG]], 2012.

S. Dasgupta. Performance guarantees for hierarchical clustering. In International Conference on Computational Learning Theory (COLT), 2002.

I. Davidson and S. S. Ravi. Making existing clusterings fairer: Algorithms, complexity results and insights. In AAAI Conference on Artiﬁcial Intelligence, 2020.

J. H. Dreze and J. Greenberg. Hedonic coalitions: Optimality and stability. Econometrica: Journal of the Econometric Society, pages 987 1003, 1980.

D. Dua and C. Graff. UCI machine learning repository, 2019. German Credit data set available on https://archive.ics.uci.edu/ml/ datasets/Statlog+(German+Credit+Data). Adult data set available on https://archive. ics.uci.edu/ml/datasets/adult. Drug Consumption data set available on https: //archive.ics.uci.edu/ml/datasets/ Drug+consumption+(quantified). Indian Liver Patient data set available on https: //archive.ics.uci.edu/ml/datasets/ ILPD+(Indian+Liver+Patient+Dataset).

C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In Innovations in Theoretical Computer Science Conference (ITCS), 2012.

E. Elkind, A. Fanelli, and M. Flammini. Price of pareto optimality in hedonic games. In AAAI Conference on Artiﬁcial Intelligence, 2016.

S. Esmaeili, B. Brubach, L. Tsepenekas, and J. Dickerson. Probabilistic fair clustering. In Neural Information Processing Systems (Neur IPS), 2020.

J. Fakcharoenphol, S. Rao, and K. Talwar. A tight bound on approximating arbitrary metrics by tree metrics. Journal of Computer and System Sciences, 69(3):485 497, 2004.

E. Fehrman, A. K. Muhammad, E. M. Mirkes, V. Egan, and A. N. Gorban. The ﬁve factor model of personality and evaluation of drug consumption risk. ar Xiv:1506.06297 [stat.AP], 2015. Data set available on https: //archive.ics.uci.edu/ml/datasets/ Drug+consumption+(quantified).

M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian. Certifying and removing disparate impact. In ACM International Conference on Knowledge Discovery and Data Mining (KDD), 2015a.

M. Feldman, L. Lewin-Eytan, and J. Naor. Hedonic clustering games. ACM Trans. Parallel Comput., 2(1), 2015b.

D. Gale and M. Sotomayor. Some remarks on the stable matching problem. Discrete Applied Mathematics, 11(3): 223 232, 1985.

M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, 1979.

M. Ghadiri, S. Samadi, and S. Vempala. Socially fair kmeans clustering. In ACM Conference on Fairness, Accountability, and Transparency (ACM FAcc T), 2021.

S. Gillen, C. Jung, M. Kearns, and A. Roth. Online learning with an unknown fairness metric. In Neural Information Processing Systems (Neur IPS), 2018.

T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38: 293 306, 1985.

B. Haeupler, E. Hershkowitz, and G. Zuzic. Tree embeddings for hop-constrained network design. In Annual ACM SIGACT Symposium on Theory of Computing (STOC), 2021.

E. Harb and H. Lam. KFC: a scalable approximation algorithm for k-center fair clustering. In Neural Information Processing Systems (Neur IPS), 2020.

L. Huang, S. H.-C. Jiang, and N. K. Vishnoi. Coresets for clustering with fairness constraints. In Neural Information Processing Systems (Neur IPS), 2019.

C. Ilvento. Metric learning for individual fairness. ar Xiv:1906.00250 [cs.LG], 2019.

M. Joseph, M. Kearns, J. Morgenstern, and A. Roth. Fairness in learning: Classic and contextual bandits. In Neural Information Processing Systems (NIPS), 2016.

M. Joseph, M. Kearns, J. Morgenstern, S. Neel, and A. Roth. Meritocratic fairness for inﬁnite and contextual bandits. In AAAI / ACM Conference on Artiﬁcial Intelligence, Ethics, and Society, 2018.

C. Jung, S. Kannan, and N. Lutz. A center in your neighborhood: Fairness in facility location. In Symposium on Foundations of Responsible Computing (FORC), 2020.

M. Kleindessner, P. Awasthi, and J. Morgenstern. Fair kcenter clustering for data summarization. In International Conference on Machine Learning (ICML), 2019a.

M. Kleindessner, S. Samadi, P. Awasthi, and J. Morgenstern. Guarantees for spectral clustering with fairness constraints. In International Conference on Machine Learning (ICML), 2019b.

S. Lloyd. Least squares quantization in pcm. EEE Transactions on Information Theory, 28(2):129 137, 1982.

S. Mahabadi and A. Vakilian. Individual fairness for kclustering. In International Conference on Machine Learning (ICML), 2020.

K. Makarychev and Y. Makarychev. Metric perturbation resilience. ar Xiv preprint ar Xiv:1607.06442, 2016.

Y. Makarychev and A. Vakilian. Approximation algorithms for socially fair clustering. In Conference on Learning Theory (COLT), 2021.

E. Micha and N. Shah. Proportionally fair clustering revisited. In 47th International Colloquium on Automata, Languages, and Programming (ICALP 2020), 2020.

R. Ostrovsky, Y. Rabani, L. Schulman, and C. Swamy. The effectiveness of Lloyd-type methods for the k-means problem. Journal of the ACM (JACM), 59(6):1 22, 2013.

C. R osner and M. Schmidt. Privacy preserving clustering with constraints. In International Colloquium on Automata, Languages, and Programming (ICALP), 2018.

A. Roth. The evolution of the labor market for medical interns and residents: a case study in game theory. Journal of political Economy, 92(6):991 1016, 1984.

M. Schmidt, C. Schwiegelshohn, and C. Sohler. Fair coresets and streaming algorithms for fair k-means clustering. ar Xiv:1812.10854 [cs.DS], 2018.

S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.

A. Vakilian and M. Yalciner. Improved approximation algorithms for individually fair clustering. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2022.

U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395 416, 2007.

K. Wagstaff, C. Cardie, S. Rogers, and S. Schr odl. Constrained k-means clustering with background knowledge. In International Conference on Machine Learning (ICML), 2001.

A. Other Related Work

Other notions of fairness for clustering Fair clustering was ﬁrst studied in the seminal work of Chierichetti et al. (2017). It is based on the fairness notion of disparate impact (Feldman et al., 2015a), which says that the output of a machine learning algorithm should be independent of a sensitive attribute, and asks that each cluster has proportional representation from different demographic groups. Chierichetti et al. provide approximation algorithms that incorporate their notion into k-median and k-center clustering, assuming that there are only two demographic groups. Several follow-up works extend this line of work to other clustering objectives such as k-means or spectral clustering, multiple or non-disjoint groups, some variations of the fairness notion, to address scalability issues or to improve approximation guarantees (R osner and Schmidt, 2018; Schmidt et al., 2018; Ahmadian et al., 2019; Anagnostopoulos et al., 2019; Backurs et al., 2019; Bera et al., 2019; Bercea et al., 2019; Huang et al., 2019; Kleindessner et al., 2019b; Ahmadian et al., 2020a;b; Ahmadi et al., 2020; Esmaeili et al., 2020; Harb and Lam, 2020). The recent work of Davidson and Ravi (2020) shows that for two groups, when given any clustering, one can efﬁciently compute the fair clustering (fair according to the notion of Chierichetti et al.) that is most similar to the given clustering using linear programming. Davidson and Ravi also show that it is NP-hard to decide whether a data set allows for a fair clustering that additionally satisﬁes some given must-link constraints. They mention that such must-link constraints could be used for encoding individual level fairness constraints of the form similar data points must go to the same cluster . However, for such a notion of individual fairness it remains unclear which pairs of data points exactly should be subject to a must-link constraint.

Alternative fairness notions for clustering are tied to centroid-based clustering such as k-means, k-median and k-center, where one chooses k centers and then forms clusters by assigning every data point to its closest center: (i) Motivated by the application of data summarization, Kleindessner et al. (2019a) propose that the various demographic groups should be proportionally represented among the chosen centers. (ii) Chen et al. (2019) propose a notion of proportionality that requires that no sufﬁciently large subset of data points could jointly reduce their distances from their closest centers by choosing a new center. This notion is similar to our notion of IP-stability in that it assumes that an individual data point strives to be well represented (in the notion of Chen et al. by being close to a center) and it also does not rely on demographic group information. However, while our notion aims at ensuring stability for every single data point, the notion of Chen et al. only looks at sufﬁciently large subsets. Micha and Shah (2020) further studied the proportionally fair clustering in Euclidean space and on graph metrics. (iii) Brubach et al. (2020; 2021) introduce the notions of pairwise fairness and community preserving fairness and incorporate them into k-center clustering. (iv) Ghadiri et al. (2021) and Abbasi et al. (2021) study a fair version of k-means clustering where the goal is to minimize the maximum average cost that a demographic group incurs (the maximum is over the demographic groups and the average is over the various data points within a group). Makarychev and Vakilian (2021) and Chlamt aˇc et al. (2022) designed optimal algorithms for minimizing the global clustering cost of a clustering (e.g., k-median and k-means cost) with respect to this notion of fairness and its generalization known as clustering with cascaded norms.

B. Proof of Theorem 1

We show NP-hardness of the IP-stable clustering decision problem (with k = 2 and d required to be a metric) via a reduction from a variant of 3-SAT. It is well known that deciding whether a Boolean formula in conjunctive normal form, where each clause comprises at most three literals, is satisﬁable is NP-hard. NP-hardness also holds for a restricted version of 3-SAT, where each variable occurs in at most three clauses (Garey and Johnson, 1979, page 259). Furthermore, we can require the formula to have the same number of clauses as number of variables as the following transformation shows: let Φ be a formula with m clauses and n variables. If n > m, we introduce l = n m+1

2 new variables x1, . . . , xl and for each of them add three clauses (xi) to Φ (if n m is odd, we add only two clauses (xl)). The resulting formula has the same number of clauses as number of variables and is satisﬁable if and only if Φ is satisﬁable. Similarly, if n < m, we introduce l = 3 m n

2 new variables x1, . . . , xl and add to Φ the clauses (x1 x2 x3), (x4 x5 x6), . . . , (xl 2 xl 1 xl) (if m n is odd, the last clause is (xl 1 xl) instead of (xl 2 xl 1 xl)). As before, the resulting formula has the same number of clauses as number of variables and is satisﬁable if and only if Φ is satisﬁable.

So let Φ = C1 C2 . . . Cn be a formula in conjunctive normal form over variables x1, . . . , xn such that each clause Ci comprises at most three literals xj or xj and each variable occurs in at most three clauses (as either xj or xj). We construct a metric space (D, d) in time polynomial in n such that D has an IP-stable 2-clustering with respect to d if and

only if Φ is satisﬁable (for n sufﬁciently large). We set

D = {True, False, , , C1, . . . , Cn, x1, x1, . . . , xn, xn}

d(x, y) = [d (x, y) + 1{x = y}] + 1{x = y} max x,y D [d (x, y) + 1] , x, y D,

for some symmetric function d : D D R 0 with d (x, x) = 0, x D, that we specify in the next paragraph. It is straightforward to see that d is a metric. Importantly, note that for any x D, inequality (1) holds with respect to d if and only if it holds with respect to d .

We set d (x, y) = 0 for all x, y D except for the following:

d (True, False) = A,

d (True, ) = B,

d ( , False) = C,

d (Ci, False) = D, i = 1, . . . , n,

d (Ci, ) = E, i = 1, . . . , n,

d ( , True) = F,

d ( , False) = G,

d ( , ) = H,

d (Ci, ) = J, i = 1, . . . , n,

d (xi, xi) = S, i = 1, . . . , n,

d (Ci, xj) = U, (i, j) {(i, j) {1, . . . , n}2 : xj appears in Ci},

d (Ci, xj) = U, (i, j) {(i, j) {1, . . . , n}2 : xj appears in Ci},

where we set

A = n, F = n2, B = 2F = 2n2, E = 5

J = E + log n = 5

2n2 + log n, D = J + log2 n = 5

2n2 + log n + log2 n,

U = 3J = 15

2 n2 + 3 log n, H = n D + E = 5

2n2 + n log n + n log2 n,

G = H + 2n2 n 2n log2 n = 5

2n2 n + n log n n log2 n,

S = (3n + 3)U = 45

2 n2 + 9n log n + 9 log n,

C = A + G + n D

4n2 + n log n.

We show that for n 160 there is a satisfying assignment for Φ if and only if there is an IP-stable 2-clustering of D.

Satisfying assignment IP-stable 2-clustering

Let us assume we are given a satisfying assignment of Φ. We may assume that if xi only appears as xi in Φ and not as xi, then xi is true; similarly, if xi only appears as xi, then xi is false. We construct a clustering of D into two clusters V1 and V2 as follows:

V1 = {True, , C1, . . . , Cn} {xi : xi is true in sat. ass.} { xi : xi is true in sat. ass.},

V2 = {False, } {xi : xi is false in satisfying assignment} { xi : xi is false in sat. ass.}.

It is |V1| = 2 + 2n and |V2| = 2 + n. We need show that every data point in D is stable.This is equivalent to verifying that the following inequalities are true:

Points in V1:

True : 1 1 + 2n

v V1 d (True, v) = F 1 + 2n A + B

2 + n = 1 2 + n

v V2 d (True, v) (6)

v V1 d ( , v) = F + n J

1 + 2n G + H

2 + n = 1 2 + n

v V2 d ( , v) (7)

Ci : 1 1 + 2n

v V1 d (Ci, v) J + 2U

1 + 2n U + D + E

2 + n 1 2 + n

v V2 d (Ci, v) (8)

xi : 1 1 + 2n

v V1 d (xi, v) 2U 1 + 2n S 2 + n 1 2 + n

v V2 d (xi, v) (9)

xi : 1 1 + 2n

v V1 d ( xi, v) 2U 1 + 2n S 2 + n 1 2 + n

v V2 d ( xi, v) (10)

Points in V2:

False : 1 1 + n

v V2 d (False, v) = C 1 + n A + G + n D

2 + 2n = 1 2 + 2n

v V1 d (False, v) (11)

v V2 d ( , v) = C 1 + n B + H + n E

2 + 2n = 1 2 + 2n

v V1 d ( , v) (12)

xi : 1 1 + n

v V2 d (xi, v) = 0 S 2 + 2n 1 2 + 2n

v V1 d (xi, v) (13)

xi : 1 1 + n

v V2 d ( xi, v) = 0 S 2 + 2n 1 2 + 2n

v V1 d ( xi, v) (14)

It is straightforward to check that for our choice of A, B, C, D, E, F, G, H, J, S, U as speciﬁed in (5) all inequalities (6) to (14) are true.

IP-stable 2-clustering satisfying assignment

Let us assume that there is an IP-stable clustering of D with two clusters V1 and V2. For any partitioning of {C1, . . . , Cn} into two sets of size l and n l (0 l n) we denote the two sets by Cl and e Cn l.

We ﬁrst show that xi and xi cannot be contained in the same cluster (say in V1). This is because if we assume that xi, xi V1, for our choice of S and U in (5) we have

v V2 d (xi, v) U < S 3n + 2 1 |V1| 1

v V1 d (xi, v)

in contradiction to xi being stable. As a consequence we have n |V1|, |V2| 2n + 4.

Next, we show that due to our choice of A, B, C, D, E, F, G, H, J in (5) none of the following cases can be true:

1. {True, } Cl V1 and e Cn l {False, } V2 for any 0 l < n

In this case, False would not be stable since for all 0 l < n,

v V1 d (False, v) = A + G + l D

l + 2 + n < C + (n l)D

n l + 1 + n = 1 |V2| 1

v V2 d (False, v).

2. {True} Cl V1 and e Cn l {False, , } V2 for any 0 l n

In this case, False would not be stable since for all 0 l n,

v V1 d (False, v) = A + l D

l + 1 + n < C + G + (n l)D

n l + 2 + n = 1 |V2| 1

v V2 d (False, v).

3. {False, } Cl V1 and e Cn l {True, } V2 for any 0 l n

In this case, True would not be stable since for all 0 l n,

v V1 d (True, v) = A + F l + 2 + n < B n l + 1 + n = 1 |V2| 1

v V2 d (True, v).

4. {False} Cl V1 and e Cn l {True, , } V2 for any 0 l n

In this case, True would not be stable since for all 0 l n,

v V1 d (True, v) = A l + 1 + n < B + F n l + 2 + n = 1 |V2| 1

v V2 d (True, v).

5. { , } Cl V1 and e Cn l {False, True} V2 for any 0 l n

In this case, would not be stable since for all 0 l n,

v V2 d ( , v) = B + C + (n l)E

n l + 2 + n < H + l E

l + 1 + n = 1 |V1| 1

v V1 d ( , v).

6. { } Cl V1 and e Cn l {False, True, } V2 for any 0 l n

In this case, would not be stable since for all 0 l n,

v V1 d ( , v) = H + l J

l + 1 + n < F + G + (n l)J

n l + 2 + n = 1 |V2| 1

v V2 d ( , v).

7. Cl V1 and e Cn l {True, False, , } V2 for any 0 l n

In this case, True would not be stable since for all 0 l n,

v V1 d (True, v) = 0 < A + B + F 3 + (n l) + n = 1 |V2| 1

v V2 d (True, v).

8. { } Cl V1 and e Cn l {True, False, } V2 for any 0 l n

In this case, True would not be stable since for all 0 l n,

v V1 d (True, v) = F 1 + l + n < A + B 2 + (n l) + n = 1 |V2| 1

v V2 d (True, v).

Of course, in all these cases we can exchange the role of V1 and V2. Hence, True, , C1, . . . , Cn must be contained in one cluster and , False must be contained in the other cluster. W.l.o.g., let us assume True, , C1, . . . , Cn V1 and , False V2 and hence |V1| = 2n + 2 and |V2| = n + 2.

Finally, we show that for the clause Ci = (lj) or Ci = (lj lj ) or Ci = (lj lj lj ), with the literal lj equaling xj or xj, it cannot be the case that Ci, lj or Ci, lj, lj or Ci, lj, lj , lj are all contained in V1. This is because

otherwise 1 |V2|

v V2 d (Ci, v) = D + E

n + 2 < U + J

2n + 1 1 |V1| 1

v V1 d (Ci, v) (15)

for our choice of D, E, J, U in (5) and Ci would not be stable. Consequently, since xj and xj are not in the same cluster, for each clause Ci at least one of its literals must be in V1. Hence, if we set every literal xi or xi that is contained in V1 to a true logical value and every literal xi or xi that is contained in V2 to a false logical value, we obtain a valid assignment that makes Φ true.

C. Missing Proofs from Section 4

C.1. Proof of Claim 4

Proof. The idea is to extend nodes corresponding to actual points in V so that they are the leaves and that they are at the same depth. The distortion of the modiﬁed tree will be worse by a constant factor, but that is ﬁne for our purpose as the construction has O(polylog(n)) distortion. Let (V , d T ) be a given partial tree metric from the construction in Theorem 6 and let ℓ= depth(d T ) be the depth of the tree. If a node v V is not a leaf of T, we augment T by replacing v with v and connecting v to v with a path v , v1, v2, . . . , vℓ depth T (v) 1, v. Let T be the tree after our modiﬁcation. By this augmentation, v is now a leaf in T and depth T (v) = ℓ 1. Let (V , d T ) be the new tree metric. For any u V , we show that d T (u, v) 3d T (u, v). In words, this is because the weight of the added path between v and v is a telescopic sum, and this sums up to at most the weight of the edge connecting v to its parent in T, i.e. p T (v). That is, d T (u, v) = d T (u, v ) + d T (v , v) = d T (u, v) + d T (v , v) d T (u, v) + d T (p T (v ), v ) = d T (u, v) + d T (p T (v), v) 3d T (u, v). Since we can apply this argument to any node, the claim holds true.

C.2. Proof of Claim 5

Proof. Let T be our HST tree rooted at r. Let S = {v1, v2 . . . , vk} be a set of nodes that we are going to pick. Suppose they are sorted in such a way that for every 1 i < k, depth(vi) depth(vi+1). We let depth(r) = 0, and for every node u, depth(u) = depth(p(u)) + 1 where p(u) is the parent of u in T.

To ﬁnd S, consider the lowest depth, i.e., furthest from the root, ℓsuch that the number of nodes at depth ℓis at most k. Let Vℓbe the set of nodes at depth ℓ. Initially, S = . For each v Vℓ, if |S| + |children(v)| + |Vℓ| 1 < k, then add children(v) to S and remove v from Vℓand continue. Otherwise, if |S| + |children(v)| + |Vℓ| 1 = k, then add children(v) and Vℓ {v} to S. Since |S| = k after this operation, we are done. In the last case where |S| + |children(v)| + |Vℓ| 1 > k, select a subset of children(v) of size k |S| + |Vℓ|. Let this subset be Vv. Let S = S Vℓ Vv. An example is given in Figure 4 for k = 4.

Δ/2 Δ/2 Δ/2 Δ/2 Δ/2

Δ/4 Δ/4 Δ/4 Δ/4 Δ/4

Figure 4. IP-stable 4-clustering on a 2-HST. S is the set of the red nodes.

For i k, let Ti to be the subtree of T \ S j<i Tj rooted at vi. Let Ci denote the ith cluster consisting of the leaves of Ti. We argue that C = (C1, C2, . . . , Ck) is an IP-stable clustering of T. Let u, v be two leaves in T. Let x

be the lowest common ancestor (lca) of u and v. Since T is an HST tree, and since all leaves are at the same depth, d T (u, v) = d T (u, x) + d T (x, v) = 2d(u, x). For any pair of Ci C, u Ci where u is a leaf, as we show above, the distance from u to any other leaf node v Ci is at most 2d T (u, vi). This is because the lca of u and v can either be vi or its descendant. Hence, we can say that 1 |Ci| 1 P

w Ci d T (u, w) 2d T (u, vi). Moreover, for any leaf node v Cj =i, we have that d T (u, v ) 2d T (u, vi). This is because the lca of u and v can either be vi or its ancestor. Hence, 1 |Cj| P

w Cj d T (u, w) 2d T (u, vi). Therefore, 1 |Ci| 1 P

w Ci d(u, w) 1 |Cj| P

w Cj d T (u, w), and C is a stable clustering for T.

We end the proof by noting that for any u, v Ci and w Cj, d T (u, v) 2d T (u, vi) d T (u, vj) d T (u, w).

C.3. Proof of Claim 7

d(x, y) 2 γ 1d(y, Cj) By (2) in Lemma 1

γ d(y, Cj) γ 2 +

d(y, z) By (1) in Lemma 1

C.4. Proof of Theorem 8

Proof. Let us consider a modiﬁcation to single-linkage algorithm: We consider edges in non-decreasing order and only merge two clusters if one (or both) size is consisting of at most αn points. Claim 7 suggests that the edges within an underlying cluster will be considered before edges between two clusters. Since we know that any cluster size is at least αn, when considering an edge, it is safe to use this condition to merge them. If it is not the case, we can ignore the edge.

By this process, we will end up with a clustering where each cluster has size at least αn. There are at most O(1/α) such clusters. We then can enumerate over all possible clusterings, as there are at most O( 1

αk ) such clusterings, the running time follows.

C.5. Proof of Claim 9

Proof. From Lemma 1, γ 1

γ d(x, Cj) d(x, y), d(x, y ) γ2+1 γ(γ 1)d(x, Cj). Hence,

d(x, y) d(x, y ) γ2 + 1 γ(γ 1) γ γ 1 = γ2 + 1

C.6. Proof of Claim 10

Proof. (1) follows from Claim 7 and the fact that we consider edges in non-decreasing order, hence, when we consider e, if D, D belong to different underlying clusters, and if |D| < αn, then an edge e within the cluster CD must already be considered. At that point, D should be merged into another cluster, hence a contradiction.

(2) follows from Corollary 1. If maxx D,y D d(x,y)

minx D,y D d(x,y) > γ2+1 (γ 1)2 2 , then there exists x, x D and y, y D that violate

the inequality in Corollary 1, so D and D must belong to the same underlying cluster.

It remains to show (3). Suppose D belongs to the underlying cluster CD and D belongs to another underlying cluster CD , then d(x, y) γ 1

γ d(x, CD ) ((1) in Lemma 1). However,

d(x, x ) > 2γ (γ 1)2 d(x, y) > 2γ (γ 1)2 γ 1

γ d(x, CD )

= 2 γ 1d(x, CD ).

Since we know that x, x belong to the same underlying cluster, this is a contradiction.

D. Missing Proofs of Section 5

D.1. Proof of Theorem 12

Proof. Initially, start with the leftmost cluster containing n k + 1 points, and all remaining k 1 clusters having one single point. Imagine clusters are separated with some separators. An example is given in Figure 5 for k = 4. We show that by only moving the separators to the left, an IP-stable k-clustering can be found. In this proof, we mainly argue about the

Figure 5. Initialization for k = 4 and n = 8.

stability of boundary nodes of clusters; however, Lemma 5 in Appendix F shows that stability of boundary nodes implies stability of all the nodes. Since all the clusters but the ﬁrst one are singletons, and singleton clusters are IP-stable, the only separator that might want to move is the ﬁrst separator. The ﬁrst separator moves to the left until the ﬁrst cluster becomes IP-stable. In the next step, the only cluster that is not IP-stable is the second cluster since the ﬁrst cluster has just been made IP-stable, and the rest are singletons. The second cluster is unstable due to some nodes on its right hand side that want to join the third cluster by moving the second separator to the left (note that the leftmost node of the second cluster is stable since the ﬁrst separator has just moved). If the second cluster shrinks, some rightmost nodes of the ﬁrst cluster might get unstable, and want to join the second cluster by moving the ﬁrst separator to the left. The same scenario repeats over and over again until the ﬁrst two clusters get stable by moving the ﬁrst two separators only to the left. Using the same argument, at each step i, the only cluster that is not IP-stable is the ith cluster, and the ﬁrst i clusters can be made stable by moving the ﬁrst i separators only to the left.

Consider the last, i.e. kth step. If the kth cluster is a singleton, then it is IP-stable, and our algorithm has converged to an IP-stable k-clustering. Assume the kth cluster is not a singleton. The reason that it is not a singleton is that the points on its left hand side had preferred to move from the (k 1)th cluster to the kth cluster. If the kth cluster is a better cluster for its leftmost point, so is a better cluster for all other points inside it. Therefore, the kth cluster is an IP-stable cluster, and our algorithm converges to an IP-stable k-clustering.

By using the fact that if a separator moves it only moves to the left, and therefore, each separator moves at most n times, in Lemma 2 in Appendix D.2, we show the running time of the algorithm is O(kn).

D.2. Running Time of the Algorithm for 1-dimensional Euclidean Metric

Lemma 2. The running time of the algorithm described for the 1-d Euclidean metric case is O(kn).

Proof. First, we argue that if the following information are maintained for each cluster, the stability of each boundary node can be checked in O(1).

sum of distances from the left-most and right-most nodes to all other nodes in the cluster

the number of nodes in the cluster

Suppose that u is the right-most node of a cluster Ci. In order to check if u is stable, we compare the average distance from u to its own cluster (Ci) with the average distance from u to the adjacent cluster, say (Ci+1). Let v be the node adjacent to u in Ci+1, then d(u, Ci+1) = d(u, v) + d(v, Ci+1). Since d(u, Ci), d(v, Ci+1), |Ci|, |Ci+1| are all maintained, we can check if d(u, Ci) d(u, Ci+1) in O(1). The argument for the case that u is the leftmost node of a cluster Ci holds in a similar way. At the beginning of the algorithm, since our initialization has only one big cluster, coming up with the needed information for two boundary nodes of the big cluster takes O(n). Since the only operation that our algorithm does is to move a separator to the left, we show whenever such an operation happens, all the information stored that need to get

updated, can get updated in O(1). Suppose when a separator moves to the left, the right-most node of Ci, let s say u, moves to Ci+1. Let s, t, v, w be the left-most node of Ci, the node next to u in Ci, the node next to u in Ci+1, and the right-most node of Ci+1 before the move, respectively. Also, let Ci = Ci \ {u} and Ci+1 = Ci+1 {u} denote the updated clusters Ci and Ci+1 after u joins the cluster on its right. As | Ci| = |Ci| 1 and | Ci+1| = |Ci+1| + 1, updating the numbers of nodes in two clusters is trivial. Now we show how to update other information, namely, d(s, Ci), d(t, Ci), d(u, Ci+1), d(w, Ci+1).

d(s, Ci) = d(s, Ci) d(s, u)

d(t, Ci) = d(u, Ci) | Ci|d(t, u)

d(u, Ci+1) = d(u, Ci+1) = d(v, Ci+1) + | Ci+1|d(u, v)

d(w, Ci+1) = d(w, Ci+1) + d(u, w)

As the right-hand sides of the relationships above are maintained each update can be performed in O(1). Since in the algorithm each separator only moves to the left, the total number of times that separators are moved is at most kn. Each time that a separator wants to move, the stability of the node on its right is checked in O(1). Also, if a separator moves, updating the information stored takes O(1). Hence, the total time complexity of the algorithm is O(kn).

D.3. Proof of Claim 14

Proof. Let C1, C2 be the clusters containing u and v, respectively. Wlog, suppose |C1| > 1 and x C1. Our goal is to show that d(x, C1) d(x, C2).

Since the path from x to any y C2 has to go through u, we have that d(x, C2) = d(x, u) + d(u, C2).

As u is stable, d(u, C1) d(u, C2).

If we can show that d(x, C1) d(x, u) + d(u, C1), then we are done. This is true as

d(x, C1) = 1 |C1| 1

y C1,y =x d(x, y)

y C1,y =x d(x, y)

y C1,y =x (d(x, u) + d(u, y))

= d(x, u) + 1 |C1| 1

y C1,y =x d(u, y)

d(x, u) + 1 |C1| 1

y C1 d(u, y)

= d(x, u) + d(u, C1).

E. Hard Instances for k-means++, k-center, and single linkage

In this section, we prove Theorem 2 by showing hard instances for k-means++, k-center, and single linkage clustering algorithms.

E.1. Hard Instances for k-means++

Here, we show that for any given approximation factor α > 1, there exists an instance Iα and a target number of clusters kα such that with constant probability k-means++ outputs a kα-clustering of Iα that violates the requirement of IP-stable clustering by a factor of α.

High-level description of k-means++. k-means++ is an algorithm for choosing the initial seeds for the Lloyd s heuristic that provides a provable approximation guarantee (Arthur and Vassilvitskii, 2007). The high-level intuition is to spread out the initial set of k centers. The ﬁrst center is picked uniformly at random, after which each subsequent center is picked from the remaining set points according to the probability distribution proportional to the squared distance of the points to the so-far-selected set of centers. Once the seeding phase picks k centers, k-means++ performs Lloyd s heuristic.

Before stating the argument formally, we show how to construct the hard instance Iα for any given approximation parameter α > 1 for the IP-stability requirement.

Structure of the hard instance. Given the approximation parameter α, Iα is constructed as follows. The instance consists of nα copies of block I such that each adjacent blocks are at distance Dα from each other; see Figure 6. As we explain in more detail, the solution SOL returned by k-means++ violates the IP-stability by a factor of α if there exists a block Ij Iα such that SOL Ij = {vj, uj}. In the rest of this section, our goal is to show that with constant probability this event happens when kα = 13

I I1 I2 Inα

αr v u v1 u1 v2 u2 vnα unα

Figure 6. The construction of hard instance for a given parameter α. (a) shows the structure of a building block I and (b) shows a hard instance which constitutes of nα blocks I1, . . . , Inα.

Claim 15. Suppose that v and u are picked as the initial set of centers for 2-clustering of I. Then, the solution returned by k-means++ violates the IP-stability requirement by a factor of α.

Proof. It is straightforward to verify that once v and u are picked as the initial set of centers, after a round of Lloyd s heuristic v and u remain the cluster centers and k-means++ stops. This is the case since when v and u are centers initially the clusters are {z, z , v} and {u}. The centroids of these two clusters are respectively v and u. Hence, k-means++ outputs {z, z , v} and {u} as the 2-clustering of I. The maximum violation of the IP-stable requirement for this 2-clustering corresponds to v which is equal to (d(v,z)+d(v,z ))/2

d(v,u) = α.

In this section, the squared distance function is denoted by . Here is the main theorem of this section.

Theorem 16. For any given parameter α, there exists an instance Iα such that with constant probability the solution returned by k-means++ on (Iα, kα = 13

12nα) is violating the requirement of IP-stable clustering by a factor of α the maximum violation over the points in Iα is at least α.

To prove the above theorem, we show that by setting the values of Dα and nα properly, k-means++ with constant probability picks the points corresponding to u, v from at least one of the copies Ij (and picks no other points from Ij).

Lemma 3. Let S1 denote the set of centers picked by k-means++ on (Iα, 13

12nα) after the ﬁrst nα iterations in the seeding

phase where nα 18700. If Dα > αr q

ε where ε < 1 100nα , then with probability at least 0.98,

1. S1 contains exactly one point from each block; I Iα, |S1 I| = 1, and

2. In at least nα

10 blocks, copies of v are picked; |j [nα] | Ij S1 = {vj}| nα

Proof. We start with the ﬁrst property. Consider iteration i nα of k-means++ and let Si 1 denote the set of points that are selected as centers in the ﬁrst i 1 iterations. Let I+ i 1 := {Ij | Ij Si 1 = } and I i 1 := {Ij | Ij Si 1 = }. Since for any pair of points p I, p I where I = I , (p, p ) D2 α, for any point p I i 1, (p, Si 1) D2 α. Moreover,

by the construction of the building block (see Figure 6-(a)), for any point p I+ i 1, (p , Si 1) 4α2r2 4ε 3nα D2 α. Hence, the probability that in iteration i the algorithm picks a point from I i 1 is P

p I i 1 (p, Si 1) P

p I i 1 (p, Si 1) + P

p I+ i 1 (p, Si 1) 4D2 α 4D2α + 3nα (4α2r2) 1 1 + ε 1 ε

Thus, the probability that i-th point is picked from I+ i 1 is at most ε. By union bound over the ﬁrst nα iterations of k-means++, the probability that S1 picks exactly one point from each block in Iα is at least

i=1 Pr[i-th point is picked from I+ i 1] 1 ε nα 0.99 (16)

Next, we show that the second property also holds with high probability. Consider iteration i of the algorithm. By the approximate triangle inequality for function9 and since the distance of any pair of points within any block is 2αr, for a point p Ij I i 1, η2 (p, Si 1) 2(η2 + 4α2r2) where η Dα. In particular, for each block Ij I i 1,

(vj, Si 1) P

p Ij (p, Si 1) η2

7η2 + 24α2r2 1/(8 + ε

Let Xi be a random variable which indicates that the i-th point belongs to {vj|j [nα]}; Xi = 1 if the i-th point belongs to {vj|j [nα]} and is equal to zero otherwise.

Ij I i 1 (vj, Si 1) P

Ij I i 1 P p Ij (p, Si 1) + P

p I+ i 1 (p, Si 1)

D2 α (8 + ε/nα) D2α + P

p I+ i 1 (p, Si 1) by Eq. (17)

D2 α (8 + ε/nα) D2α + 3nα (4α2r2) p I+ i 1, (p, Si 1) 4α2r2

D2 α (8 + ε/nα) D2α + ε D2α > 1/9

Note that we showed that Pr[Xi = 1] 1/9 independent of the algorithm s choices in the ﬁrst i 1 iterations of the algorithm. In particular, random variables X1, . . . , Xnα are stochastically dominated by independent random variable Y1, . . . , Ynα where each Yi = 1 with probability 1/9 and is zero otherwise. Hence, the random variable X := X1+. . .+Xnα is stochasitcally dominated by Y which is B(nα, 1/9), the binomial distribution with nα trials and success probability 1/9. Hence, by an application of Hoeffding s inequality on Y ,

10 ] Pr[Y nα

10 ] exp( 2nα(1

10)2) 0.01 for nα > 18700 (18)

Thus, by Eq. (16) and (18), with probability at least 0.98 both properties hold.

Lemma 4. Let S be the set of centers picked at the end of the seeding phase of k-means++ on (Iα, kα = 13

12nα) where nα > max(9000, 240α2). Then, with probability at least 0.33, there exists a block Ij Iα such that S Ij = {vj, uj}.

Proof. Let T1 denote the event that S1, the set of centers picked in the ﬁrst nα iterations, picks exactly one point from each block and in at least nα/10 blocks the v-type points are picked in S1. Note that by an application of Lemma 3, Pr[T1] 0.98 with probability at least 0.98, T1 holds.

First we show the following. Let mα = nα/20. With constant probability, there exists an iteration i [nα +1, nα +mα] in which k-means++ picks a u-type point form Ij such that Ij Si 1 = {vj}. Let Ei denote the event that in iteration i + nα where i [mα], k-means++ picks the point uj from Ij such that Si 1 Ij = {vj}. Note that for each block in

9 u, v, w P, (v, w) 2( (v, u) + (u, w)).

{Ij | S1 Ij = {vj}}, P

p Ij (p, S1) = (2α2 + 1)r2; otherwise, P

p Ij (p, S1) (5α2 + 1)r2. Hence, for each i [mα], conditioned on T1 and independent of the prior choices of the algorithm in iterations nα +1, . . . , i := nα +i 1,

Pr[Ei|T1] =

{Ij | Ij Si ={vj}} (uj, Si ) P

{Ij | Ij Si ={vj}} P

p Ij (p, Si ) + P

{Ij | Ij Si ={vj}} P

p Ij (p, Si )

10 mα) (2α2 + 1)r2 + (nα ( nα

10 mα)) (5α2 + 1)r2

mα mα(2α2 + 1) + (nα mα)(5α2 + 1)

1 2α2 + 1 + 20(5α2 + 1)

For each i [mα], let random variable Xi := 1Ei|T1. Since X1, . . . , Xmα are stochastically dominated by independent random variables Yi such that Yi = 1 with probability 1/(123α2) and zero otherwise, we can bound the probability that none of E1, . . . , Emα happens as follows.

Pr[ E1 . . . Emα|T1] = Pr[ X

i mα Xi < 1|T1]

i mα Yi < 1]

= (1 1 123α2 )mα exp( mα

123α2 ) (19)

Hence, with probability at least 1 exp( mα 123α2 ), there exists an iteration i [nα + 1, nα + mα] and block I j such that I j Si = {vj, uj}. Next, we show that with constant probability no more points is picked from I j in the remaining iterations of the seeding phase k-means++. Let T2 denote the event that there exists (I j , i ) such that I j Si = {vj, uj}; T2 := E1 . . . Emα. For any i [1, nα + mα i ], let Fi denote the event that in iteration i + i, k-means++ picks another point from I j .

Pr[Fi|T1 T2] < 2α2r2 P

{Ij | Ij Si 1={vj}} P

p Ij (p, Si 1)

10 mα) (2α2 + 1)r2 = 2 3mα (20)

Let tα := nα + mα i and deﬁne T3 to denote the event that none of F1, . . . , Ftα happens. For each i [tα], let random variable Xi := 1Fi|T1 T2. Since X1, . . . , Xmα are stochastically dominated by independent random variables Yi such that Yi = 1 with probability 1/(123α2) and zero otherwise, we can lower bound the probability that event T3 holds as follows.

Pr[T3|T1 T2] = Pr[ F1 . . . Ftα|T1 T2] Pr[ X

Yi < 1|T1 T2]

(1 2 3mα )tα by Eq. (20)

(1 2 3mα ) exp( 2

3) tα mα (21)

Hence, Eq. (19) and Eq. (21) together imply that the probability that there exists no block I j such that |I j S| = {vj, uj} is at most

Pr[ T1] + Pr[ T2|T1] + Pr[ T3|T1 T2] 0.02 + e mα 123α2 + 1 (1 2 3mα )e 2

0.02 + e mα 123α2 + 1

2 + 1 3mα mα 240α2

0.02 + 0.145 + 0.5 + 0.005 < 0.67

Thus, with probability at least 0.33, at the end of the seeding phase, there exists a block I j such that I j S = {vj, uj}.

Now, we are ready to prove Theorem 16.

Proof of Theorem 16. Note that Lemma 3 together with Lemma 4 implies that with constant probability, at the end of the seeding phase, at least one point is picked from each block in Iα and there exist a block Ij such that Ij S = {vj, uj}.

Given the structure of Iα, after each step of the Lloyd s heuristic, no points from two different blocks will be in the same cluster. Furthermore, by Claim 15, the clusters of Ij which are initially {zj, z j, vj}, {uj} remain unchanged in the ﬁnal solution of k-means++. However, such clustering is violating the requirement of IP-stable clustering for vj by a factor of α.

Thus, with constant probability, the solution of k-means++ on (Iα, kα) violates the requirement of IP-stable clustering by a factor of at least α.

E.2. Hard Instances for k-center

1+# 1 # 1/2

% &! &" # '!

Figure 7. A hard instance for k-center.

Figure 7 shows a hard instance for k-center. Each of the balls B1 and B2 have radius ϵ. The center of B1 has a distance of 1 + ϵ from c1, and a distance of 1 ϵ from c2. The center of B2 has a distance of 1 ϵ from c1, and a distance of 1 + ϵ from c2. The nodes of the graph are c1, c2, n points on the surface of B1, and n points on the surface of B2.

Consider k-center using the greedy strategy by (Gonzalez, 1985). Suppose k = 2, and the ﬁrst center picked is c1. The furthest node from c1 is c2, and hence it is the second center picked. Since all the points get assigned to the closest center, all the points on B1 get assigned to c2, except the point p that has the same distance from c1 and c2 and gets assigned to c1. All the points on B2 get assigned to c1. We show such a clustering is unstable for p within a factor of n/8. The average distance of p to the nodes in its own cluster is at least n(1/2 2ϵ)+1

n+1 . The average distance of p to the nodes assigned to c2 is

at most 2ϵ(n 1)+1

n . For ϵ 1/2n, the instability for p is at least within a multiplicative factor of n/8. By setting n > 8α, p is not t-approximately IP-stable for any value t < α.

E.3. Hard Instances for Single Linkage

Figure 8 shows a bad example for single linkage clustering when k = 2. A possible implementation of single linkage ﬁrst merges v2 and v3. Next, it merges v4 and {v2, v3}, and repeatedly at each iteration i adds vi+2 to the set {v2, v3, . . . , vi+1}. When there are only two clusters {v2, v3, . . . , vn} and v1 left, the algorithm terminates. By setting ϵ = 1/2, v2 gets unstable by a factor of (n 1)/4. By setting (n 1)/4 > α, v2 is not t-approximately IP-stable for any value of t < α.

1 1 # 1 # 1 # 1 #

+! +" +# +$ +% +&'! +&

Figure 8. A hard instance for single linkage clustering.

F. Dynamic Programming Approach for Solving (3)

In the following, we propose an efﬁcient DP approach to ﬁnd a solution to (3). Let D = {x1, . . . , xn} with x1 . . . xn. Our approach builds a table T (N { })n n k with

T(i, j, l) = min (C1,...,Cl) Hi,j,l (|C1| t1, . . . , |Cl| tl) p p (22)

for i [n], j [n], l [k], where

Hi,j,l = C = (C1, . . . , Cl) : C is a stable l-clustering of {x1, . . . , xi} with l non-empty

contiguous clusters such that the right-most cluster Cl contains exactly j points

and T(i, j, l) = if Hi,j,l = . Here, we consider the case p = . The modiﬁcations of our approach to the case p = are minimal and are described later.

The optimal value of (3) is given by minj [n] T(n, j, k)1/p. Below, we will describe how to use the table T to compute an IP-stable k-clustering solving (3). First, we explain how to build T. We have, for i, j [n],

T(i, j, 1) =

( |i t1|p, j = i, , j = i , T(i, j, i) =

(Pi s=1 |1 ts|p, j = 1, , j = 1 ,

T(i, j, l) = , j + l 1 > i,

and the recurrence relation, for l > 1 and j + l 1 i,

T(i, j, l) = |j tl|p + min

T(i j, s, l 1) : s [i j (l 2)],

Ps 1 f=1 |xi j xi j f|

Pj f=1 |xi j xi j+f|

Pj f=2 |xi j+1 xi j+f|

Ps 1 f=0 |xi j+1 xi j f|

where we use the convention that 0

0 = 0 for the fractions on the left sides of the inequalities. First, we explain relation (24). Before that we need to show the following lemma holds:

Lemma 5 (Stable boundary points imply stable clustering). Let C = (C1, . . . , Ck) be a k-clustering of D = {x1, . . . , xn}, where x1 x2 . . . xn, with contiguous clusters C1 = {x1, . . . , xi1}, C2 = {xi1+1, . . . , xi2}, . . . , Ck = {xik 1+1, . . . , xn}, for some 1 i1 < . . . < ik 1 < n. Then C is IP-stable if and only if all points xil and xil+1, l [k 1], are stable. Furthermore, xil (xil+1, resp.) is stable if and only if its average distance to the points in Cl \ {xil} (Cl+1 \ {xil+1}, resp.) is not greater than the average distance to the points in Cl+1 (Cl, resp.).

Proof. We assume that D = {x1, . . . , xn} R with x1 x2 . . . xn and write the Euclidean metric d(xi, xj) between two points xi and xj in its usual way |xi xj|.

If C is IP-stable, then all points xil and xil+1, l [k 1], are stable. Conversely, let us assume that xil and xil+1, l [k 1], are stable. We need to show that all points in D are stable. Let x Cl = {xil 1+1, . . . , xil} for some l {2, . . . , k 1} and l {l + 1, . . . , k}. Since xil is stable, we have

y Cl (xil y) = 1 |Cl| 1

y Cl |xil y| 1 |Cl |

y Cl |xil y| = 1 |Cl |

y Cl (y xil)

y Cl | x y| 1 |Cl| 1

y Cl\{ x} (| x xil| + |xil y|)

= (xil x) + 1 |Cl| 1

y Cl\{ x} (xil y)

(xil x) + 1 |Cl |

y Cl (y xil)

y Cl | x y|.

Similarly, we can show for l {1, . . . , l 1} that

y Cl | x y| 1 |Cl |

y Cl | x y|,

and hence x is stable. Similarly, we can show that all points x1, . . . , xi1 1 and xik 1+2, . . . , xn are stable.

For the second claim observe that for 1 s l 1, the average distance of xil to the points in Cs cannot be smaller than the average distance to the points in Cl \ {xil} and for l + 2 s k, the average distance of xil to the points in Cs cannot be smaller than the average distance to the points in Cl+1. A similar argument proves the claim for xil+1.

It follows from Lemma 5 that a clustering (C1, . . . , Cl) of {x1, . . . , xi} with contiguous clusters and Cl = {xi j+1, . . . , xi} is IP-stable if and only if (C1, . . . , Cl 1) is a stable clustering of {x1, . . . , xi j} and the average distance of xi j to the points in Cl 1 \ {xi j} is not greater than the average distance to the points in Cl and the average distance of xi j+1 to the points in Cl \ {xi j+1} is not greater than the average distance to the points in Cl 1. The latter two conditions correspond to the two inequalities in (24) (when |Cl 1| = s, where s is a variable). By explicitly enforcing these two constraints, we can utilize the ﬁrst condition and rather than minimizing over Hi,j,l in (25), we can minimize over both s [i j (l 2)] and Hi j,s,l 1 (corresponding to minimizing over all IP-stable (l 1)-clusterings of {x1, . . . , xi j} with non-empty contiguous clusters). It is

min s [i j (l 2)] (C1,...,Cl 1) Hi j,s,l 1

(|C1| t1, . . . , |Cl 1| tl 1) p p = min s [i j (l 2)] T(i j, s, l 1),

and hence we end up with the recurrence relation (24).

It is not hard to see that using (24), we can build the table T in time O(n3k). Once we have T, we can compute a solution (C 1, . . . , C k) to (3) by specifying |C 1|, . . . , |C k| in time O(nk) as follows: let v = minj [n] T(n, j, k). We set |C k| = j0 for an arbitrary j0 with v = T(n, j0, k). For l = k 1, . . . , 2, we then set |C l | = h0 for an arbitrary h0 with (i) T(n Pk r=l+1 |C r |, h0, l) + Pk r=l+1 ||C r | tr|p = v , (ii) the average distance of xn Pk r=l+1 |C r | to the closest h0 1 many points on its left side is not greater than the average distance to the points in C l+1, and (iii) the average distance of xn Pk r=l+1 |C r |+1 to the other points in C l+1 is not greater than the average distance to the closest h0 many points on its

left side. Finally, it is |C 1| = n Pk r=2 |C r |. It follows from the deﬁnition of the table T in (22) and Lemma 5 that for l = k 1, . . . , 2 we can always ﬁnd some h0 satisfying (i) to (iii) and that our approach yields an IP-stable k-clustering (C 1, . . . , C k) of D.

Hence we have shown the following theorem:

Theorem 17 (Efﬁcient DP approach solves (3)). By means of the dynamic programming approach (22) to (24) we can compute an IP-stable clustering solving (3) in running time O(n3k).

Let us ﬁrst explain the recurrence relation (24): because of (x1, . . . , xl) p p = (x1, . . . , xl 1) p p + |xl|p and for every clustering (C1, . . . , Cl) Hi,j,l it is |Cl| = j, we have

T(i, j, l) = |j tl|p + min (C1,...,Cl) Hi,j,l (|C1| t1, . . . , |Cl 1| tl 1) p p. (25)

Now we describe how to modify the dynamic programming approach of Section 5.1 to the case p = : in this case, we replace the deﬁnition of the table T in (22) by

T(i, j, l) = min (C1,...,Cl) Hi,j,l (|C1| t1, . . . , |Cl| tl) , i [n], j [n], l [k],

and T(i, j, l) = if Hi,j,l = as before. The optimal value of (3) is now given by minj [n] T(n, j, k). Instead of (23), we have, for i, j [n],

T(i, j, 1) =

( |i t1|, j = i, , j = i , T(i, j, i) =

( maxs=1,...,i |1 ts|, j = 1, , j = 1

T(i, j, l) = , j + l 1 > i,

and the recurrence relation (24) now becomes, for l > 1 and j + l 1 i,

T(i, j, l) = max

|j tl|, min T(i j, s, l 1) : s [i j (l 2)],

f=1 |xi j xi j f| 1

f=1 |xi j xi j+f|,

f=2 |xi j+1 xi j+f| 1

f=0 |xi j+1 xi j f| )

Just like before, we can build the table T in time O(n3k). Computing a solution (C 1, . . . , C k) to (3) also works similarly as before. The only thing that we have to change is the condition (i) on h0 (when setting |C l | = h0 for l = k 1, . . . , 2): now h0 must satisfy

r=l+1 |C r |, h0, l

, max r=l+1,...,k ||C r | tr|

or equivalently

r=l+1 |C r |, h0, l

G. Addendum to Section 6.1

In this section, we provide supplements to Section 6.1:

In Section G.1, we present a simple example that shows that it really depends on the data set whether a group-fair clustering is IP-stable or not.

Figure 9. An example illustrating why the local search idea outlined in Section 6.1 does not work. Top left: 12 points in R2. Top right: A k-means clustering of the 12 points (encoded by color) with two points that are not IP-stable (surrounded by a circle). Bottom row: After assigning one of the two points that are not stable in the k-means clustering to its closest cluster, that point is stable. However, now some points that were initially stable are not stable anymore.

In Section G.2, we provide an example illustrating why the local search idea of Section 6.1 does not work.

In Section G.3, we present a heuristic approach to make linkage clustering more aligned with IP-stability.

In Section G.4, we present the experiments of Section 6.1 on the Adult data set (Dua and Graff, 2019): we study the performance of various standard clustering algorithms as a function of the number of clusters k when the underlying metric is either the Euclidean (as in the plot of Figure 3), Manhattan or Chebyshev metric. We also study our heuristic approach of Appendix G.3 for making linkage clustering more stable: just as for the standard algorithms, we show # Uns, Max Vi, Mean Vi and Co as a function of k for ordinary average / complete / single linkage clustering and their modiﬁed versions using our heuristic approach in its both variants (#U denotes the variant based on # Uns and MV the variant based on Max Vi).

In Section G.5, we show the same set of experiments on the Drug Consumption data set (Fehrman et al., 2015). We used all 1885 records in the data set, and we used all 12 features describing a record (e.g., age, gender, or education), but did not use the information about the drug consumption of a record (this information is usually used as label when setting up a classiﬁcation problem on the data set). We normalized the features to zero mean and unit variance. When running the standard clustering algorithms on the data set, we refrained from running spectral clustering since the Scikit-learn implementation occasionally was not able to do the eigenvector computations and aborted with a Lin Alg Error. Other than that, all results are largely consistent with the results for the Adult data set.

In Section G.6, we show the same set of experiments on the Indian Liver Patient data set (Dua and Graff, 2019). Removing four records with missing values, we ended up with 579 records, for which we used all 11 available features (e.g., age, gender, or total proteins). We normalized the features to zero mean and unit variance. Again, all results are largely consistent with the results for the Adult data set.

G.1. Compatibility of Group Fairness and IP-Stability

By means of a simple example we want to illustrate that it really depends on the data set whether group fairness and IP-stability are compatible or at odds with each other. Here we consider the prominent group fairness notion for clustering of Chierichetti et al. (2017), which asks that in each cluster, every demographic group is approximately equally represented. Let us assume that the data set consists of the four 1-dimensional points 0, 1, 7 and 8 and the distance function d is the ordinary Euclidean metric. It is easy to see that the only IP-stable 2-clustering is C = ({0, 1}, {7, 8}). Now if there are two demographic groups G1 and G2 with G1 = {0, 7} and G2 = {1, 8}, the clustering C is perfectly fair according to the notion of Chierichetti et al.. But if G1 = {0, 1} and G2 = {7, 8}, the clustering C is totally unfair according to the their notion.

G.2. Why Local Search Does not Work

Figure 9 presents an example illustrating why the local search idea outlined in Section 6.1 does not work: assigning a data point that is not IP-stable to its closest cluster (so that that data point becomes stable) may cause other data points that are initially stable to not be stable anymore after the reassignment.

G.3. Heuristic Approach to Make Linkage Clustering More Stable

Linkage clustering builds a binary tree that represents a hierarchical clustering with the root of the tree corresponding to the whole data set and every node corresponding to a subset such that a parent is the union of its two children. The leaves of the tree correspond to singletons comprising one data point (e.g., Shalev-Shwartz and Ben-David, 2014, Section 22.1). If one wants to obtain a k-clustering of the data set, the output of a linkage clustering algorithm is a certain pruning of this tree.

With IP-stability being our primary goal, we propose to construct a k-clustering / a pruning of the tree as follows: starting with the two children of the root, we maintain a set of nodes that corresponds to a clustering and proceed in k 2 rounds. In round i, we greedily split one of the i + 1 many nodes that we currently have into its two children such that the resulting (i + 2)-clustering minimizes, over the i + 1 many possible splits, # Uns (the number of non-stable datapoints; cf. the beginning of Section 6). Alternatively, we can split the node that gives rise to a minimum value of Max Vi (cf. the beginning of Section 6). Algorithm 1 provides the pseudocode of our proposed strategy.

Algorithm 1 Algorithm to greedily prune a hierarchical clustering

1: Input: binary tree T representing a hierarchical clustering obtained from running a linkage clustering algorithm; number of clusters k {2, . . . , |D|}; measure meas {# Uns, Max Vi} that one aims to optimize for

2: Output: a k-clustering C

3: # Conventions:

for a node v T, we denote the left child of v by Left(v) and the right child by Right(v)

for a j-clustering C = (C1, C2, . . . , Cj), a cluster Cl and A, B Cl with A B = Cl we write C |Cl, A,B for the (j + 1) clustering that we obtain by replacing the cluster Cl with two clusters A and B in C

4: Let r be the root of T and initialize the clustering C as C = (Left(r), Right(r))

5: for i = 1 to k 2 by 1 do

6: Set v = argmin v:v is a cluster in C with |v|>1 meas(C|v, Left(v),Right(v))

and C = C|v , Left(v ),Right(v )

7: end for 8: return C

G.4. Experiments on the Adult Data Set

Figure 10. Adult data set plots of Figure 3: # Uns (top-left), Co (top-right), Max Vi (bottom-left) and Mean Vi (bottom-right) for the clusterings produced by the various standard algorithms as a function of k.

Figure 11. Adult data set similar plots as in Figure 3, but for the Manhattan metric.

Figure 12. Adult data set similar plots as in Figure 3, but for the Chebyshev metric.

Figure 13. Adult data set with Euclidean metric: # Uns (top-left), Co (top-right), Max Vi (bottom-left), and Mean Vi (bottom-right) for the clusterings produced by average linkage clustering and the two variants of our heuristic of Appendix G.3 to improve it: the ﬁrst (#U in the legend) greedily chooses splits as to minimize # Uns, the second (MV) as to minimize Max Vi.

Figure 14. Adult data set with Euclidean metric: the various measures for the clusterings produced by complete linkage clustering and the two variants of our heuristic approach.

Figure 15. Adult data set with Euclidean metric: the various measures for the clusterings produced by single linkage clustering and the two variants of our heuristic approach.

Figure 16. Adult data set with Manhattan metric: the various measures for the clusterings produced by average linkage clustering and the two variants of our heuristic approach.

Figure 17. Adult data set with Manhattan metric: the various measures for the clusterings produced by complete linkage clustering and the two variants of our heuristic approach.

Figure 18. Adult data set with Manhattan metric: the various measures for the clusterings produced by single linkage clustering and the two variants of our heuristic approach.

Figure 19. Adult data set with Chebyshev metric: the various measures for the clusterings produced by average linkage clustering and the two variants of our heuristic approach.

Figure 20. Adult data set with Chebyshev metric: the various measures for the clusterings produced by complete linkage clustering and the two variants of our heuristic approach.

Figure 21. Adult data set with Chebyshev metric: the various measures for the clusterings produced by single linkage clustering and the two variants of our heuristic approach.

G.5. Experiments on the Drug Consumption Data Set

Figure 22. Drug Consumption data set with Euclidean metric: # Uns (top-left), Co (top-right), Max Vi (bottom-left) and Mean Vi (bottom-right) for the clusterings produced by the various standard algorithms as a function of the number of clusters k. k.

Figure 23. Drug Consumption data set similar plots as in Figure 22, but for the Manhattan metric.

Figure 24. Drug Consumption data set similar plots as in Figure 22, but for the Chebyshev metric.

Figure 25. Drug Consumption data set with Euclidean metric: # Uns (top-left), Co (top-right), Max Vi (bottom-left), and Mean Vi (bottom-right) for the clusterings produced by average linkage clustering and the two variants of our heuristic of Appendix G.3 to improve it: the ﬁrst (#U in the legend) greedily chooses splits as to minimize # Uns, the second (MV) as to minimize Max Vi.

Figure 26. Drug Consumption data set with Euclidean metric: the various measures for the clusterings produced by complete linkage clustering and the two variants of our heuristic approach.

Figure 27. Drug Consumption data set with Euclidean metric: the various measures for the clusterings produced by single linkage clustering and the two variants of our heuristic approach.

Figure 28. Drug Consumption data set with Manhattan metric: the various measures for the clusterings produced by average linkage clustering and the two variants of our heuristic approach.

Figure 29. Drug Consumption data set with Manhattan metric: the various measures for the clusterings produced by complete linkage clustering and the two variants of our heuristic approach.

Figure 30. Drug Consumption data set with Manhattan metric: the various measures for the clusterings produced by single linkage clustering and the two variants of our heuristic approach.

Figure 31. Drug Consumption data set with Chebyshev metric: the various measures for the clusterings produced by average linkage clustering and the two variants of our heuristic approach.

Figure 32. Drug Consumption data set with Chebyshev metric: the various measures for the clusterings produced by complete linkage clustering and the two variants of our heuristic approach.

Figure 33. Drug Consumption data set with Chebyshev metric: the various measures for the clusterings produced by single linkage clustering and the two variants of our heuristic approach.

G.6. Experiments on the Indian Liver Patient Data Set

Figure 34. Indian Liver Patient data set with Euclidean metric: # Uns (top-left), Co (top-right), Max Vi (bottom-left) and Mean Vi (bottom-right) for the clusterings produced by the various standard algorithms as a function of the number of clusters k. k.

Figure 35. Indian Liver Patient data set similar plots as in Figure 34, but for the Manhattan metric.

Figure 36. Indian Liver Patient data set similar plots as in Figure 34, but for the Chebyshev metric.

Figure 37. Indian Liver Patient data set with Euclidean metric: # Uns (top-left), Co (top-right), Max Vi (bottom-left), and Mean Vi (bottom-right) for the clusterings produced by average linkage clustering and the two variants of our heuristic of Appendix G.3 to improve it: the ﬁrst (#U in the legend) greedily chooses splits as to minimize # Uns, the second (MV) as to minimize Max Vi.

Figure 38. Indian Liver Patient data set with Euclidean metric: the various measures for the clusterings produced by complete linkage clustering and the two variants of our heuristic approach.

Figure 39. Indian Liver Patient data set with Euclidean metric: the various measures for the clusterings produced by single linkage clustering and the two variants of our heuristic approach.

Figure 40. Indian Liver Patient data set with Manhattan metric: the various measures for the clusterings produced by average linkage clustering and the two variants of our heuristic approach.

Figure 41. Indian Liver Patient data set with Manhattan metric: the various measures for the clusterings produced by complete linkage clustering and the two variants of our heuristic approach.

Figure 42. Indian Liver Patient data set with Manhattan metric: the various measures for the clusterings produced by single linkage clustering and the two variants of our heuristic approach.

Figure 43. Indian Liver Patient data set with Chebyshev metric: the various measures for the clusterings produced by average linkage clustering and the two variants of our heuristic approach.

Figure 44. Indian Liver Patient data set with Chebyshev metric: the various measures for the clusterings produced by complete linkage clustering and the two variants of our heuristic approach.

Figure 45. Indian Liver Patient data set with Chebyshev metric: the various measures for the clusterings produced by single linkage clustering and the two variants of our heuristic approach.

Figure 46. Histograms of the data sets used in the experiments of Appendix H. Left: The credit amount (one of the 20 features in the German credit data set; normalized to be in [0, 1]) for the 1000 records in the German credit data set. Note that there are only 921 unique values. Right: The estimated probability of having a good credit risk for the second 500 records in the German credit data set. The estimates are obtained from a multi-layer perceptron trained on the ﬁrst 500 records in the German credit data set.

Table 1. Experiment on German credit data set. Clustering 1000 people according to their credit amount. Target cluster sizes ti = 1000

k , i [k]. NAIVE=naive clustering that matches the target cluster sizes, DP=dynamic programming approach of Appendix F, k-MEANS= kmeans initialized with medians of the clusters of the naive clustering, k-ME++= k-means++. Results for k-ME++ averaged over 100 runs. Best values in bold.

# UNS MAXVI MEANVI OBJ CO

NAIVE 105 2.95 1.47 0 0.24 DP 0 1.0 0 172 0.28 k-MEANS 1 1.0 1.0 170 0.28 k-ME++ 1.11 1.01 0.61 275.3 0.29

NAIVE 113 2.16 1.3 0 0.28 DP 0 1.0 0 131 0.3 k-MEANS 4 1.01 1.01 136 0.29 k-ME++ 3.16 1.02 0.99 159.47 0.29

NAIVE 92 3.17 1.3 0 0.3 DP 0 1.0 0 37 0.3 k-MEANS 5 1.01 1.01 37 0.29 k-ME++ 7.44 1.06 1.03 105.06 0.26

NAIVE 101 2.6 1.3 0 0.32 DP 0 1.0 0 8 0.3 k-MEANS 18 1.26 1.06 10 0.3 k-ME++ 12.7 1.19 1.05 50.22 0.26

H. Experiments on 1-dimensional Euclidean data sets

In this section we present experiments with our dynamic programming (DP) approach of Appendix F for 1-dim Euclidean data sets. We used the German Credit data set (Dua and Graff, 2019). It comprises 1000 records (corresponding to human beings) and for each record one binary label (good vs. bad credit risk) and 20 features.

In our ﬁrst experiment, we clustered the 1000 people according to their credit amount, which is one of the 20 features. A histogram of the data can be seen in the left plot of Figure 46. We were aiming for k-clusterings with clusters of equal size (i.e., target cluster sizes ti = 1000

k , i [k]) and compared our DP approach with p = to k-means clustering as well as a naive clustering that simply puts the t1 smallest points in the ﬁrst cluster, the next t2 many points in the second cluster, and so on. We considered two initialization strategies for k-means: we either used the medians of the clusters of the naive clustering for initialization (thus, hopefully, biasing k-means towards the target cluster sizes) or we ran k-means++ (Arthur and Vassilvitskii, 2007). For the latter we report average results obtained from running the experiment for 100 times. In addition to the four quantities # Uns, Max Vi, Mean Vi and Co considered in the experiments of Section 6.1 / Appendix G.4 - G.6 (deﬁned at the beginning of Section 6), we report Obj ( objective ), which is the value of the objective function of (3) for p = . Note that k-means / k-means++ yields contiguous clusters and Obj is meaningful for all four

clustering methods that we consider.

The results are provided in Table 1, where we consider k = 5, k = 10, k = 20 or k = 50. As expected, for the naive clustering we always have Obj = 0, and for our DP approach (DP) we have # Uns = 0, Max Vi 1 and Mean Vi = 0. Most interesting to see is that both versions of k-means yield almost perfectly stable clusterings when k is small and moderately stable clusterings when k = 50 (with k-means++ outperforming k-means).

In our second experiment, we used the ﬁrst 500 records of the data set to train a multi-layer perceptron (MLP) for predicting the label (good vs. bad credit risk). We then applied the MLP to estimate the probabilities of having a good credit risk for the other 500 people. A histogram of the probability estimates is shown in the right plot of Figure 46. We used the same clustering methods as in the ﬁrst experiment to cluster the 500 people according to their probability estimate. We believe that such a clustering problem may arise frequently in practice (e.g., when a bank determines its lending policy) and that IP-stability is highly desirable in this context.

Table 2 and Table 3 provide the results. In Table 2, we consider uniform target cluster sizes ti = 500

k , i [k], while in Table 3 we consider various non-uniform target cluster sizes. The interpretation of the results is similar as for the ﬁrst experiment of Section 6.2. Most notably, k-means can be quite unstable with up to 33 data points being unstable when k is large, whereas k-means++ produces very stable clusterings with not more than three data points being unstable. However, k-means++ performs very poorly in terms of Obj, which can be almost ten times as large as for k-means and our dynamic programming approach (cf. Table 2, k = 50).

The MLP that we used for predicting the label (good vs. bad credit risk) in the second experiment of Section 6.2 has three hidden layers of size 100, 50 and 20, respectively, and a test accuracy of 0.724.

Table 2. Experiment on German credit data set. Clustering the second 500 people according to their estimated probability of having a good credit risk. Target cluster sizes ti = 500

k , i [k]. NAIVE=naive clustering that matches the target cluster sizes, DP=dynamic programming approach of Appendix F, k-MEANS= k-means initialized with medians of the clusters of the naive clustering, k-ME++= kmeans++. Results for k-ME++ averaged over 100 runs. Best values in bold.

TARGET CLUSTER SIZES # UNS MAXVI MEANVI OBJ CO

k = 5 t1 = . . . = t5 = 100

NAIVE 197 58.28 9.52 0 0.34 DP 0 0.99 0 214 0.29 k-MEANS 1 1.02 1.02 212 0.29 k-ME++ 0.67 1.01 0.62 220.72 0.3

k = 10 t1 = . . . = t10 = 50

NAIVE 162 10.27 3.06 0 0.34 DP 0 0.98 0 217 0.29 k-MEANS 8 1.25 1.09 207 0.34 k-ME++ 1.11 1.02 0.87 248.86 0.29

k = 20 t1 = . . . = t20 = 25

NAIVE 116 9.64 2.09 0 0.34 DP 0 1.0 0 155 0.29 k-MEANS 33 2.13 1.38 95 0.3 k-ME++ 2.53 1.07 0.97 239.55 0.28

k = 50 t1 = . . . = t50 = 10

NAIVE 73 3.8 1.78 0 0.34 DP 0 1.0 0 24 0.3 k-MEANS 28 2.29 1.25 13 0.3 k-ME++ 3.56 1.24 1.08 233.09 0.23

Table 3. Experiment on German credit data set. Clustering the second 500 people according to their estimated probability of having a good credit risk. Various non-uniform target cluster sizes. NAIVE=naive clustering that matches the target cluster sizes, DP=dynamic programming approach of Appendix F, k-MEANS= k-means initialized with medians of the clusters of the naive clustering, k-ME++= kmeans++. Results for k-ME++ averaged over 100 runs. Best values in bold.

TARGET CLUSTER SIZES # UNS MAXVI MEANVI OBJ CO

k = 12 ti =

( 50 for 3 i 10 25 else

NAIVE 188 12.85 3.22 0 0.34 DP 0 0.97 0 232 0.29 k-MEANS 3 1.05 1.04 217 0.29 k-ME++ 1.44 1.03 0.77 254.81 0.29

t1 = t12 = 10, t2 = t11 = 15, t3 = t10 = 25, t4 = t9 = 50, t5 = t8 = 50, t6 = t7 = 100

NAIVE 251 65.99 6.56 0 0.33 DP 0 0.97 0 247 0.29 k-MEANS 14 1.52 1.16 238 0.3 k-ME++ 1.43 1.03 0.83 270.09 0.29

k = 20 ti =

( 10 for i = 1, 3, 5, . . . 40 for i = 2, 4, 6, . . .

NAIVE 189 137.31 6.64 0 0.35 DP 0 1.0 0 140 0.29 k-MEANS 30 1.91 1.28 91 0.3 k-ME++ 2.62 1.07 0.98 224.26 0.28

k = 20 ti =

( 115 for i = 10, 11 15 else

NAIVE 224 215.88 13.01 0 0.34 DP 0 1.0 0 165 0.29 k-MEANS 51 3.04 1.67 112 0.3 k-ME++ 2.44 1.07 0.99 249.79 0.28