# fair_polylogapproximate_lowcost_hierarchical_clustering__2d8f87cf.pdf Fair, Polylog-Approximate Low-Cost Hierarchical Marina Knittel Max Springer John Dickerson Mohammad Taghi Hajiaghayi Department of Computer Science University of Maryland, College Park {mknittel,mss423}@umd.edu {john,hajiagha}@cs.umd.edu Research in fair machine learning, and particularly clustering, has been crucial in recent years given the many ethical controversies that modern intelligent systems have posed. Ahmadian et al. [2020] established the study of fairness in hierarchical clustering, a stronger, more structured variant of its well-known flat counterpart, though their proposed algorithm that optimizes for Dasgupta s [2016] famous cost function was highly theoretical. Knittel et al. [2023] then proposed the first practical fair approximation for cost, however they were unable to break the polynomial-approximate barrier they posed as a hurdle of interest. We break this barrier, proposing the first truly polylogarithmic-approximate low-cost fair hierarchical clustering, thus greatly bridging the gap between the best fair and vanilla hierarchical clustering approximations. 1 Introduction Clustering is a pervasive machine learning technique which has filled a vital niche in every day computer systems. Extending upon this, a hierarchical clustering is a recursively defined clustering where each cluster is partitioned into two or more clusters, and so on. This adds structure to flat clustering, giving an algorithm the ability to depict data similarity at different resolutions as well as an ancestral relationship between data points, as in the phylogenetic tree Kraskov et al. [2003]. On top of computational biology, hierarchical clustering has found various uses across computer imaging [Chen et al., 2021b, Selvan et al., 2005], computer security [Chen et al., 2020, 2021a], natural language processing [Ramanath et al., 2013], and much more. Moreover, it can be applied to any flat clustering problem where the number of desired clusters is not given. Specifically, a hierarchical clustering can be viewed as a structure of clusterings at different resolutions that all agree with each other (i.e., two points clustered together in a higher resolution clustering will also be together in a lower resolution clustering). Generally, hierarchical clustering techniques are quite impactful on modern technology, and it is important to guarantee they are both effective and unharmful. Researchers have recognized the harmful biases unchecked machine learning programs pose. A few examples depicting racial discrimination include allocation of health care [Ledford, 2019], presentation of ads suggestive of arrest records [Sweeney, 2013], prediction of hiring success [Bogen and Rieke, 2018], and estimation of recidivism risk [Angwin et al., 2016]. A popular solution that has been extensively studied in the past decade is fair machine learning. Here, fairness concerns the mitigation of bias, particularly against protected classes. Most often, fairness is an additional constraint on the allowed solution space; we optimize for problems in light of this constraint. For instance, the notion of individual fairness introduced by the foundational work of Dwork et al. [2012] deems that an output must guarantee that any two individuals who are similar are classified similarly. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Figure 1: A hierarchical clustering of news articles. Red articles are conservative, blue are liberal. On the left is the optimal unfair hierarchy. We alter the hierarchy slightly on the right to achieve fairness. Now, the user s query for global warming will yield both liberal and conservative articles. In line with previous work in clustering and hierarchical clustering, this paper utilizes the notion of group fairness, which enforces that different protected classes receive a proportionally similar distribution of classifications (in our case, cluster placement). Chierichetti et al. [2017] first introduced this as a constraint for the flat clustering problem, arguing that it mitigates a system s disparate impact, or non-proportional impact on different demographics. This notion of fair clustering has been similarly leveraged and extended by a vast range of works in both flat [Ahmadian et al., 2019, Bera et al., 2019, Bercea et al., 2019] and hierarchical [Ahmadian et al., 2020, Knittel et al., 2023] clustering. To illustrate our fairness concept, consider the following application (Figure 1): a news database is structured as a hierarchical clustering of search terms, where a search term is associated with a cluster of news articles to output to the reader, and more specific search terms access finer-resolution clusters. When a user searches for a term, we simply identify the corresponding cluster and output the contained articles. However, as is, the system does not account for the political skew of articles. In Figure 1, we label conservative-leaning articles in red and liberal-leaning articles in blue. We can see that in this example, when the user searches for global warming articles, they will only see liberal articles. To resolve this, we add a group fairness constraint on our cluster: for example, require at least 1/3 of the articles in each cluster to be of each political skew. This guarantees (as depicted on the right) that the outputted articles will always be at least 1/3 liberal and 1/3 conservative, thus guaranteeing the user is exposed to a range of perspectives along this political axis. This notion of fairness, which we formally define in Definition 3, has been explored in the context of hierarchical clustering in both Ahmadian et al. [2020] and Knittel et al. [2023]. This paper is concerned with approximations for fair low-cost hierarchical clustering. Cost, first defined by Dasgupta [2016] and formally stated in Definition 2, is perhaps the most natural and well-motivated optimization metric for hierarchical clustering evaluation. Effectively, every pair of points contributes an additive cost of similarity clust_size, where the latter is the size of the smallest cluster in the hierarchy containing both points. Unfortunately, it is quite difficult to optimize for (the best being O(plog n)-approximations [Charikar and Chatziafratis, 2017, Dasgupta, 2016]; it hypothesized to not be O(1)-approximable [Charikar and Chatziafratis, 2017]). This appears to be even more difficult in the hierarchical clustering literature. The first work to attempt this problem, Ahmadian et al. [2020], achieved a highly impractical O(n5/6 log3/2 n)-approximation (not too far from the trivial O(n) upper bound), posing fair low-cost hierarchical clustering as an interesting and inherently difficult problem. Knittel et al. [2023] greatly improved this to a near-polylog approximation factor of O(nδpolylog(n)), where δ can be arbitrarily close to 0, and parameterizes a trade-off between approximation factor and degree of fairness. Still, a true polylog approximation was left as an open problem, one which we solve in this paper. 1.1 Our Contributions This work proposes the first polylogarithmic approximation for fair, low-cost hierarchical clustering. We leverage the work of Knittel et al. [2023] as a starting inspiration and create something much simpler, more direct, and better in both fairness and approximation. Like their algorithm, our algorithm starts with a low-cost unfair hierarchical clustering and then alters it with multiple welldefined and limited tree operators. This gives it a degree of explainability, in that the user can understand exactly the steps the algorithm took to achieve its result and why. In addition, our algorithm achieves both relative cluster balance (i.e., clusters who are children of the same cluster have similar size) and fairness, along with a parameterizeable trade-off between the two. On top of the benefits of Knittel et al. [2023] s techniques, we propose a greatly simplified algorithm. They initially proposed an algorithm that required four tree operators, however, we only require two of the four, and we greatly simplify the more complicated operator. This makes the algorithm simpler to understand and more implementable. We show that even with this reduced functionality, we can cleverly achieve both a better approximation and degree of fairness: Theorem 1 (High-Level). Given a γ-approximation to the cost objective for a hierarchical clustering on set of n points, Algorithm 2 yields an O(log2 n) approximation to cost which is relatively fair in time O(n log2 n). Moreover, all clusters are O(1/ log n) balanced, i.e., any cluster s children clusters are within O(1/ log n) size different. To put this in perspective, previously, the best approximation for fair hierarchical clustering previously was O(nδpolylog(n)), whereas the best unfair approximation is O(plog n). Our work greatly reduces this gap by providing a true O(polylog(n)) approximation, moreover with a low degree of 2. Note that this is not the most general and rigorous form of our result for that, we defer to Theorem 1 as stated in the body of our paper. This simpler version is found by setting k = O(1), h = O(log n), and = O(1/ log n) in the more general theorem (note we assume λ = O(1)). 2 Preliminaries 2.1 The Vanilla Problem Fair clustering literature refers to the original problem variant, without fairness, as the vanilla problem. We define the vanilla problem of finding a low-cost hierarchical clustering here using our specific notation. In this problem, we are given a complete graph G = (V, E, w) with a weight function w : E ! R+ is a measure of the similarity between datapoints. Note the data is encoded as a complete tree because we require knowledge of all point-point relationships. We must construct a hierarchical clustering, represented by its dendrogram, T, with root denoted root(T). T is a tree with vertices corresponding to all clusters of the hierarchical clustering. Leaves of T, denoted leaves(root(T)) correspond to the points in the dataset (i.e., singleton clusters). An internal node v corresponds to the cluster containing all leaf-data of the maximal subtree (i.e., contains all its descendants) rooted at v, T[v]. In addition, we let u v denote the lowest common ancestor of u and v in T. In order to define Dasgupta [2016] s cost function, we use the same notational simplifications as Knittel et al. [2023]. For an edge e = (x, y) 2 E, we say n T (e) = |leaves(T[x y])| is the size of the smallest cluster in the hierarchy containing e. Similarly, for a hierarchy node v, n T (vi) = |leaves(T[vi])| is the size of the corresponding cluster. This is sufficient to introduce the notion of cost. Definition 1 (Knittel et al. [2023]). The cost of e 2 E in a graph G = (V, E, w) in a hierarchy T is cost T (e) = w(e) n T (e). Dasgupta s cost function can then be written as a sum over the costs of all edges. Definition 2 (Dasgupta [2016]). The cost of a hierarchy T on graph G = (V, E, w) is: Our algorithm begins by assuming we have some approximate vanilla hierarchy, T. That is, if OPT is the optimal hierarchy tree, then cost(T) cost(OPT) for some approximation factor . According to Dasgupta [2016], we can transform this hierarchy to be binary without increasing the cost. Our paper simply assumes our input is binary. We then produce a modified hierarchy T 0 which similar structure to T that guarantees fairness, i.e., cost(T 0) 0 cost(OPT) for some approximation factor 0 . Note this comparison is being made to the vanilla OPT, as we are unsure, at this time, how to classify the optimal fair hierarchy. Note that the binary assumption may not hold when we consider adding a fairness constraint. 2.2 Fairness and Balance Constraints We consider the fairness constraints based off those introduced by Chierichetti et al. [2017] and extended by Bercea et al. [2019]. On a graph G with colored vertices, let (C) count the number of -colored points in cluster C. Definition 3 (Knittel et al. [2023]). Consider a graph G = (V, E, w) with vertices colored one of λ colors, and two vectors of parameters , β 2 (0, 1)λ with β for all 2 [λ]. A hierarchy T on G is fair if for any non-singleton cluster C in T and for every 2 [λ], |C| (C) β |C|. Additionally, any cluster with a leaf child has only leaf children. Effectively, we are given bounds and β for each color . Every non-singleton cluster must have at least an fraction and at most a β fraction of color . This guarantees proportional representational fairness of each color in each cluster. As an intermediate step in achieving fairness, we will create splits in our hierarchy that achieve relative balance in terms of subcluster size. Thus, the following definition will come in handy. Definition 4. In a hierarchy, a vertex v (corresponding to cluster C) with cv children is -relatively balanced if for every cluster {Ci}i2[cv] that corresponds to a child of v, ( 1 cv )|C| |Ci| ( 1 While this definition is quite similar to that from Knittel et al. [2023], it deviates in two ways: 1) we only define it on a single split as opposed to the entire hierarchy and 2) we allow splits to be non-binary. If we apply it to the entire hierarchy and constrain it to be binary, it is equivalent to the former definition. 2.3 Tree Operators Figure 2: Our operators: subtree deletion and insertion and shallow tree folding. Our work simplifies the work of Knittel et al. [2023]. In doing so, we follow the same framework, using tree operators to make well-defined and limited alterations to a given hierarchical clustering (Figure 2). In addition, our algorithm simplifies operator use in two ways: 1) we only utilize two of their four tree operators, and 2) we greatly simplified their most complicated operator and show that it can still be used to create a fair hierarchy. First off, we utilize the same subtree deletion and insertion operator. The main difference is how we use it, which will be discussed in Section 3. At a high level, this operator removes a subtree from one part of the hierarchy and reinserts it elsewhere, adding and removing parent vertices as necessary. Definition 5 (Knittel et al. [2023]). Consider a binary tree T with internal nodes u, some nonancestor v, u s sibling s, and v s parent g. Subtree deletion at u removes T[u] from T and contracts s into its parent. Subtree insertion of T[u] at v inserts a new parent p between v and g and adds u as a second child of p. The operator del_ins(u, v) deletes u and inserts T[u] at v. The other operator we leverage is their tree folding operator, however we greatly simplify it. In the previous work, tree folding took two or more isomorphic trees and mapped the internal nodes to each other. Instead, we simply take two or more subtrees and merge their roots. The new root then directly splits into all children of the roots of all folded trees. In a way, this is an implementation of their folding operator but only at a single vertex in the tree topology. This is why we call it a shallow tree fold. Definition 6. Consider a set of subtrees T1, . . . , Tk of T such that all root(Ti) have the same parent p in T. A shallow tree folding of trees T1, . . . , Tk (shallow_fold(T1, . . . , Tk)) modifies T such that all T1, . . . , Tk are replaced by a single tree Tf whose root root(Tf) is made a child of p, and whose root-children are [i2[k]children(root(Ti)). In addition, we assume the subtree Tf is then arbitrarily binarized [Dasgupta, 2016] after folding. Since our algorithm works top-bottom, creating balanced vertices as it goes, we don t yet care about the fairness of the descendants of Tf. Moreover, we will then recursively call our algorithm on Tf to do precisely this. 3 Main Algorithm In this section, we present our fair, low-cost, hierarchical clustering algorithm along with its analysis. Ultimately, we achieve the following (for a more intuitive explanation, see Section 1): Theorem 1. When T is a γ-approximate low-cost vanilla hierarchical clustering over (V ) = c n = O(n) vertices of each color 2 [λ], Make Fair (Algorithm 2), for any constants , h, k with h kλ and n h, runs in O(n log n(h + λ log n)) time and yields a hierarchy T 0 satisfying: 1. T 0 is an O γ-approximation for cost. 2. T 0 is fair for any parameters for all i 2 [λ]: i λi , where λi = cin. 3. All internal nodes in T 0 are -relatively balanced. The main idea of our algorithm is to leverage similar tree operators to that of Knittel et al. [2023], but greatly simplify their usage and apply them in a more direct, careful manner. Specifically, the previous work processes the tree four times: once to achieve 1/6-relative balance everywhere, next to achieve -relative balance, next to remove the bottom of the hierarchy, and finally to achieve fairness. The problem is that this causes proportional cost increases to grow in an exponential manner, particularly because the relative balance significantly degrades as you descend the hierarchy. Our solution is to instead do a single top to bottom pass of the tree, rebalancing and folding to achieve fairness as we go. We describe this in detail now. First, we assume our input is some given hierarchical clustering tree. Ideally, this will be a good approximation for the vanilla problem, but our results do work as a black box on top of any hierarchical clustering algorithm. Second, we apply Split Root in order to balance the root (Section 3.1). And finally, we apply shallow tree folding on the children of the root to achieve fairness (Section 3.2). This gives us the first layer of our output, and then we recurse. 3.1 Root Splitting and Balancing Split Root is depicted in Algorithm 1. This fills the role of Knittel et al. [2023] s Refine Rebalance Tree algorithm (and skips their Rebalance Tree algorithm), but it functions differently in that it only rebalances the root and it immediately splits the root into h children, according to our input parameter h. We start Split Root by adding dummy children to v until it has h children (recall we can assume the input is binary). A dummy or null child is just a placeholder for a child to be constructed, or alternatively simply a zero-sized tree (note: this does not add any leaves to the tree). None of these children will be left empty in the end. Next, we define vmax and vmin, the maximal subtrees rooted at children(root(T 0)) which have the most and fewest leaves, respectively. As long as the root is not -relatively balanced (which is equivalent to n T 0(vmax) or n T 0(vmin) deviating from the target n/h by over n , as they are extreme points), we will attempt to rebalance. We define δ1 and δ2 to be the proportional deviation of n T 0(vmin) and n T 0(vmax) from the target size n/h respectively, and δ to be the minimum of the two. In effect, δ measures the maximum number of leaves we can move from the large subtree to the small subtree without causing n T 0(vmax) to dip below n/h or n T 0(vmin) to peak above n/h. This is important to guarantee our runtime: as an accounting scheme, we show that clusters monotonically approach size n/h, and thus we can quantify how fast our algorithm completes. We fully analyze this later, in Lemma 2. Now we must attempt exactly this procedure: move a large subtree from vmax to vmin, though this subtree can have no more than δn leaves. To do this, we simply start at vmax and traverse down its right children (recall below vmax, the tree is still binary). We halt on the first child that is of size δn or smaller. We then remove it and find a place to reinsert it under vmin. The insertion spot is found similarly by descending down vmin s left children until the right child of the current vertex has fewer leaves in its subtree than the tree we are inserting. Thus, we have completed our insertion and deletion operations. We repeat until the tree is relatively balanced as desired. Algorithm 1 Split Root Input: A binary hierarchy tree T of size n 1/2 over a graph G = (V, E, w), with smaller cluster always on the left, and parameters h 2 [n] and 2 (0, min(1/6, 1/h)). Output: A hierarchical clustering T 0 with an -relatively balanced root that has k children. 1: Initialize T 0 = T 2: v = root(T 0) 3: Add null children to v until it has h children 4: Let vmin = argminv02children(v)n T 0(v0) 5: Let vmax = argmaxv02children(v)n T 0(v0) 6: while n T 0(vmax) > n(1/h + ) or n T 0(vmin) < n(1/h ) do 7: δ1 = 1/h n T 0(vmin)/n 8: δ2 = n T 0(vmax)/n 1/h 9: δ = min(δ1, δ2) 10: Let v = vmax 11: 12: while n T 0(v) > δn do 13: v right T 0(v) 14: end while 15: 16: u vmin 17: while n T 0(right T 0(u)) n T 0(v) do 18: u left T 0(u) 19: end while 20: T 0 T 0.del_ins(u, v) 21: Reset vmin and vmax 22: end while We now analyze this part of the algorithm. The full proofs can be found in Appendix B, and we proceed to give the intuition here. To start, consider the tree we are deleting and reinserting, T 0[v]. Ideally, we want this to have many leaves, but no more than δn. We demonstrate that: Lemma 1. For a subtree T 0[v] that is deleted and reinserted in Split Root (Algorithm 1), we must have that n/(2(h 1)) < n T (v) δn. The upper bound simply comes from our stopping condition in the first nested while loop: we ensure n T 0(v) δn before selecting it. The lower bound is slightly more complicated. Effectively, we start by noting that max(δ1, δ2) > , because otherwise the stopping condition for the outer loop would be met. Then, consider the total amount of excess of large clusters , or more precisely, the sum over all deviations from n/h of clusters larger than n/h (note if all clusters were n/h, it would be perfectly balanced). This total excess must be matched in the deficiency of small clusters , which is the sum of deviations of clusters smaller than n/h. Therefore, since there are at least h small or h large clusters, the largest deviation must be at most h times the smallest deviation, according to our accounting scheme. This allows us to bound δ /(h 1). The tree that is inserted and deleted must have at least half this many leaves, since it is the larger child of a node with over δn leaves in its subtree. This gives our lower bound, showing we move at least a significant number of vertices each step. Next, we aim to illustrate the relative balance. In addition to our analysis, we also obtain the runtime, which exhibits near-linearity when the condition h n holds. Lemma 2. Split Root (Algorithm 1) yields a hierarchy whose root is -relatively balanced with h children. In addition, it requires O(nh) time to terminate. The root has h children by definition at the algorithm start and this remains invariant. The runtime comes from our aforementioned accounting scheme: the total excess and deficiency is reduced by the number of leaves in the subtree we move at each step, which we showed in Lemma 1 is n /(2(h 1)) at least. This gives us a convergence time of O(h), and each step can be bounded by O(n) time as we search for our insertion and deletion spots. Finally, the balance comes from the fact that our stopping condition is equivalent to the root being relatively balanced. All that remains is to show the negative impact on the cost of edges that are separated by the algorithm. We bound this via the following lemma. Lemma 3. In Split Root (Algorithm 1), for all e 2 E that are separated: cost T 0(e) n w(e) 2(h 1) cost T (e)/ Lemma 1 gives us that moved subtrees are at least of size n/(2(h 1)), which is a lower bound on the size of the smallest cluster containing any edge separated by the algorithm. This is due to the fact that separated edges must have one endpoint in the deleted subtree and one outside, so their least common ancestor is an ancestor of the subtree. At worst, the final size of the smallest cluster containing such an edge is n, so the proportional increase is 2(h 1)/ at worst. 3.2 Fair Tree Folding Next, we discuss how to achieve fairness by using Make Fair, as seen in Algorithm 2. This is our final recursive algorithm which utilizes Split Root. Assume we are given some hierarchical clustering. We start by running Split Root, to balance the split at the root and give it h children. Next we use a folding process similar to that of Knittel et al. [2023], but we use our shallow tree fold operator. More specifically, we first sort the children of the root by the proportional representation of the first color (say, red). Then, we do a shallow fold across various k-sized sets, defined as follows: according to our ordering over the children, partition the vertices into k contiguous chunks starting from the first vertex. For each i 2 [h/k], we find the i-th vertex in each chunk and fold them together. Notice that this is a k-wise fold since there are k chunks, and we end up with h/k vertices. This is repeated on each color. After this, we simply recurse on the children. If a child is too small to be balanced by Split Root, then we stop and give it a trivial topology (a root with many leaf-children). This completes our algorithm description. We now evaluate its runtime, degree of fairness, and approximation factor. To start, we show the degree of fairness achieved at the top level of the hierarchy. Lemma 4. Make Fair (Algorithm 2) yields a hierarchy such that all depth 1 vertices satisfy fairness under i λi , where λi = cin. This proof is quite in depth, and most details are deferred to Appendix B. At a high level, we are showing that the folding process guarantees a level of fairness. The parts in our partition are ordered by the density of the color (say, red). Since each final vertex is made by folding across one vertex in each part, meaning that the vertices have a relatively wide spread in terms of their density of red points. This means that red vertices are distributed relatively well across our final subtrees. This guarantees a degree of balance. The problem is that the degree of fairness still exhibits a compounding affect as we recurse. More specifically, since the first children are not perfectly balanced, then in the next recursive step, the total data subset we are working on may now deviate from the true color proportions. This deviation is bounded by our result in Lemma 4, but it will increase proportionally at each step. Lemma 5. In Make Fair (Algorithm 2), let {λi}i2[λ] be the proportion of each color and assume kλ h. At any recursive call, the proportion of any color is (where λi = 1/ci for constant ci): O(log(n/h)) O(log(n/h)) Moreover, the recursive depth is bounded above by O(log(n/h)). This result derives directly from Lemma 4. Effectively, we increase the proportion of each color by the same factor each recursive step. All that is left to do is bound the recursive depth. Notice we start with n vertices. After splitting, our subtrees have size at most (1 + )n/h. After one fold, this is increased by a factor of k, and thus kλ after all folds. Interestingly, this doesn t impact the final result significantly; it s fairly similar to turning an n-sized tree into an n/h-sized tree, giving an O(log(n/h)) recursive depth. This will be sufficient to show our fairness. Next, we evaluate the cost incurred at each stage in the hierarchy. Lemma 6. In Make Fair (Algorithm 2), for all e 2 E that is separated before the recursive call: cost T 0(e) O Algorithm 2 Make Fair Input: A hierarchy tree T of size n 1/2 over a graph G = (V, E, w) with vertices given one of λ colors, and parameters h 2 [n], k 2 [h/(λ 1)], and 2 (0, min(1/6, 1/h)). Output: A fair hierarchical clustering T 0. 1: T 0 = Split Root(T, h, ) 2: h0 h 3: for each color 2 [λ] do 4: Order {vi}i2[h0] = children(root(T 0)) decreasing by (leaves(vi)) n T 0(vi) 5: For all i 2 [k], T 0 T 0.shallow_fold({T 0[vi+(j 1)k] : j 2 [h0/k]}) 6: h0 h0/k 7: end for 8: for each child vi of root(T 0) do 9: if n max(1/2 , h) then 10: Replace T 0[vi] Make Fair(T 0[vi], h, k, ) 11: else 12: Replace T 0[vi] with a tree of root vi, leaves leaves(T 0[vi]), and depth 1. 13: end if 14: end for As discussed before, the final cluster size should be (1 + )nkλ/h. Any separated edge must have a starting cluster size of at least (1 )n/h, as this is the size of the smallest cluster involved in tree folding. From this, it is straightforward to compute the proportional cost increase of a single recursive level, as well as the cost increase from the initial splitting in Lemma 3. We additionally demonstrate that, whenever an edge is separated, its endpoints least common ancestor will no longer be involved in any further recursive step. More formally: Lemma 7. In Make Fair (Algorithm 2), any edge e 2 E is separated at only one level of recursion. Putting these two together pretty directly gives us our cost approximation. Lemma 8. In Make Fair (Algorithm 2), cost(T 0) O Finally, Theorem 1 comes directly from Lemmas 6 and 8. 4 Simulations This section validates the theoretical guarantees of Algorithm 2. Specifically, we demonstrate that modifying an unfair hierarchical clustering using the presented procedure yields a fair hierarchy that incurs only a modest increase in cost. Datasets. We use two data sets, Census and Bank, from the UCI data repository Dua and Graff [2017]. Within each, we subsample only the features with numerical values. To compute the cost of a hierarchical clustering we set the similarity to be w(i, j) = 1 1+d(i,j) where d(i, j) is the Euclidean distance between points i and j. We color data based on binary (represented as blue and red) protected features: race for Census and marital status for Bank (both in line with the prior work of Ahmadian et al. [2020]). As a result, Census has a blue to red ratio of 1:7 while Bank has 1:3. We then subsample each color in each data set such that we retain (approximately) the data s original balance. We use samples of size 512 for the balance experiments, and vary the sample sizes when assessing cost. For each experiment we conduct 10 independent replications (with different random seeds for the subsampling), and report the average results. We vary the parameters (c, h, k, ")1 to experimentally assess their theoretical impact on the approximate guarantees of Section 3. Due to space constrains, we here present only the results for the Census dataset and defer the complimentary results on Bank to the appendix. 1Note that the c parameter is used to scale ". Figure 3: Histogram of cluster balances after tree manipulation by Algorithm 2 on a subsample from the Census dataset of size n = 512. The four panels depict: (A) cluster balances after applying the (unfair) average-linkage algorithm, (B) the resultant cluster balances after running Algorithm 2 with parameters (c, h, k, ") = (8, 4, 2, 1/c log2 n), (C) cluster balances after tuning c = 4, (D) cluster balances after further tuning c = 2. The vertical red line on each plot indicates the balance of the dataset itself. Implementation. The Python code for the following experiments are available in the Supplementary Material. We start by running average-linkage, a popular hierarchical clustering algorithm. We then apply Algorithm 2 to modify this structure and induce a fair hierarchical clustering that exhibits a mild increase in the cost objective. Metrics. In our results we track the approximate cost objective increase as follows: Let G be our given graph, T be average-linkage s output, and T 0 be Algorithm 2 s output. We then measure the ratio RATIOcost = cost G(T 0)/cost G(T). We additionally quantify the fairness that results from application of our algorithm by reporting the balances of each cluster in the final hierarchical clustering, where true fairness would match the color proportions of the underlying dataset. Results. We first demonstrate how our algorithm adapts an unfair hierarchy into one that achieves fair representation of the protected attributes as desired in the original problem formulation. In Figure 3, we depict the cluster balances of an unfair hierarchical clustering algorithm, namely average-linkage , and subsequently demonstrate that our algorithm effectively concentrates all clusters around the underlying data balance. In particular, we first apply the algorithm and then show how we the balance is further refined by tuning the parameters. The application of Algorithm 2 dramatically improves the representation of the protected attributes in the final clustering and, as such, firmly resolves the problem of achieving fairness. Figure 4: Relative cost of the fair hierarchical clustering resulting from Algorithm 2 compared to the unfair clustering as a function of the sample size n. While reaching this fair partitioning of the data is the overall goal, we further demonstrate that, in modifying the unfair clustering, we only increase the cost approximation by a modest amount. Figure 4 illustrates the change in relative cost as we increase the sample size n, the primary influence on our theoretical cost guarantees of Section 3. Specifically, we vary n in {128, 256, 512, 1024, 2048} and compute 10 replications (on different random seeds) of the fair hierarchical clustering procedure. Figure 4 depicts the mean relative cost of these replications with standard error bars. Notably, we see that the cost does increase with n as expected, but the increase relative to the unfair cost obtain by average linkage is only by a small multiplicative factor. As demonstrated through this experimentation, the simplistic procedure of Algorithm 2 not only ensures the desired fairness properties absent in conventional (unfair) clustering algorithms but accomplishes this feat with a negligible rise in the overall cost. These results further highlight the immense value of our work. Sara Ahmadian, Alessandro Epasto, Ravi Kumar, and Mohammad Mahdian. Clustering without over-representation. In KDD, pages 267 275, 2019. Sara Ahmadian, Alessandro Epasto, Marina Knittel, Ravi Kumar, Mohammad Mahdian, Benjamin Moseley, Philip Pham, Sergei Vassilvitskii, and Yuyan Wang. Fair hierarchical clustering. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias: There s software used across the country to predict future criminals. and it s biased against blacks. 2016. Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learning. www. fairmlbook.org, 2019. Omer Ben-Porat, Fedor Sandomirskiy, and Moshe Tennenholtz. Protecting the protected group: Circumventing harmful fairness. In Thirty-Fifth AAAI Conference on Artificial Intelligence, pages 5176 5184. AAAI Press, 2021. Suman Kalyan Bera, Deeparnab Chakrabarty, Nicolas Flores, and Maryam Negahbani. Fair algorithms for clustering. In Neur IPS, pages 4955 4966, 2019. Ioana O Bercea, Martin Groß, Samir Khuller, Aounon Kumar, Clemens Rösner, Daniel R Schmidt, and Melanie Schmidt. On the cost of essentially fair clusterings. In APPROX-RANDOM, pages 18:1 18:22, 2019. M. Bogen and A. Rieke. Help wanted: An examination of hiring algorithms, equity, and bias. Technical report, Upturn, 2018. Brian Brubach, Darshan Chakrabarti, John P. Dickerson, Samir Khuller, Aravind Srinivasan, and Leonidas Tsepenekas. A pairwise fair and community-preserving approach to k-center clustering. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1178 1189. PMLR, 2020. Moses Charikar and Vaggos Chatziafratis. Approximate hierarchical clustering via sparsest cut and spreading metrics. In SODA, pages 841 854, 2017. Junjie Chen, Shu Zhang, Xiaoting He, Qingwei Lin, Hongyu Zhang, Dan Hao, Yu Kang, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. How incidental are the incidents? characterizing and prioritizing incidents for large-scale online service systems. Association for Computing Machinery, 2021a. ISBN 9781450367684. Junjie Chen, Shu Zhang, Xiaoting He, Qingwei Lin, Hongyu Zhang, Dan Hao, Yu Kang, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. How incidental are the incidents? characterizing and prioritizing incidents for large-scale online service systems. Association for Computing Machinery, 2021b. ISBN 9781450367684. Yujun Chen, Xian Yang, Hang Dong, Xiaoting He, Hongyu Zhang, Qingwei Lin, Junjie Chen, Pu Zhao, Yu Kang, Feng Gao, Zhangwei Xu, and Dongmei Zhang. Identifying linked incidents in large-scale online service systems. Association for Computing Machinery, 2020. ISBN 9781450370431. Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvitskii. Fair clustering through fairlets. In NIPS, pages 5029 5037, 2017. Vincent Cohen-Addad, Varun Kanade, Frederik Mallmann-Trenn, and Claire Mathieu. Hierarchical clustering: Objective functions and algorithms. In SODA, pages 378 397, 2018. Sanjoy Dasgupta. A cost function for similarity-based hierarchical clustering. In STOC, pages 118 127, 2016. Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics. uci.edu/ml. Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard S. Zemel. Fairness through awareness. In ShafiGoldwasser, editor, Innovations in Theoretical Computer Science 2012, Cambridge, MA, USA, January 8-10, 2012, pages 214 226. ACM, 2012. Marina Knittel, Max Springer, John P. Dickerson, and Mohammad Taghi Hajiaghayi. Generalized reductions: Making any hierarchical clustering fair and balanced with low cost, 2023. Alexander Kraskov, Harald Stögbauer, Ralph G. Andrzejak, and Peter Grassberger. Hierarchical clustering using mutual information. Co RR, q-bio.QM/0311037, 2003. Heidi Ledford. Millions of black people affected by racial bias in healthcare algorithms. Nature, Benjamin Moseley and Joshua Wang. Approximation bounds for hierarchical clustering: Average linkage, bisecting k-means, and local search. In NIPS, pages 3094 3103, 2017. Rohan Ramanath, Monojit Choudhury, and Kalika Bali. Entailment: An effective metric for compar- ing and evaluating hierarchical and non-hierarchical annotation schemes. In Stefanie Dipper, Maria Liakata, and Antonio Pareja-Lora, editors, Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, LAW-ID@ACL 2013, August 8-9, 2013, Sofia, Bulgaria, pages 42 50. The Association for Computer Linguistics, 2013. AN Selvan, LM Cole, L Spackman, S Naylor, and Wright C. Hierarchical cluster analysis to aid diagnostic image data visualization of ms and other medical imaging modalities. In Molecular Biotechnology, 2005. Latanya Sweeney. Discrimination in online ad delivery. ACM Queue, 2013.