# fair_clustering_via_alignment__142dca97.pdf Fair Clustering via Alignment Kunwoong Kim 1 Jihu Lee 1 Sangchul Park 2 Yongdai Kim 1 Algorithmic fairness in clustering aims to balance the proportions of instances assigned to each cluster with respect to a given sensitive attribute. While recently developed fair clustering algorithms optimize clustering objectives under specific fairness constraints, their inherent complexity or approximation often results in suboptimal clustering utility or numerical instability in practice. To resolve these limitations, we propose a new fair clustering algorithm based on a novel decomposition of the fair K-means clustering objective function. The proposed algorithm, called Fair Clustering via Alignment (FCA), operates by alternately (i) finding a joint probability distribution to align the data from different protected groups, and (ii) optimizing cluster centers in the aligned space. A key advantage of FCA is that it theoretically guarantees approximately optimal clustering utility for any given fairness level without complex constraints, thereby enabling highutility fair clustering in practice. Experiments show that FCA outperforms existing methods by (i) attaining a superior trade-off between fairness level and clustering utility, and (ii) achieving nearperfect fairness without numerical instability. 1. Introduction As artificial intelligence (AI) technology has advanced and been successfully applied to diverse domains and tasks, the requirement for AI systems to make fair decisions (i.e., algorithmic fairness) has emerged as an important societal issue. This requirement is particularly necessary when observed data possess historical biases with respect to specific sensitive attributes, leading to unfair outcomes of learned models based on such biased data (Angwin et al., 2016; 1Department of Statistics, Seoul National University, Republic of Korea 2School of Law, Seoul National University, Republic of Korea. Correspondence to: Yongdai Kim . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). Ingold & Soper, 2016; Damodaran et al., 2018; Mehrabi et al., 2019). Moreover, non-discrimination laws are also increasingly emphasizing the importance of fair decision making based on AI systems (Hellman, 2019). Specifically, group fairness is a category within algorithmic fairness that ensures models do not discriminate against certain protected groups, which are defined by specific sensitive attributes (e.g., race). In response, a large amount of research has been conducted to develop algorithms for mitigating such biases in various supervised learning tasks such as classification (Zafar et al., 2017; Donini et al., 2018; Agarwal et al., 2018; Quadrianto et al., 2019; Jiang et al., 2020) and regression (Agarwal et al., 2019; Chzhen et al., 2020). Along with supervised learning, algorithmic fairness for unsupervised learning tasks, such as clustering, has also gathered significant interest. Clustering algorithms have long been employed as fundamental unsupervised learning methods for machine learning, such as recommendation systems (Widiyaningtyas et al., 2021), image processing (Le, 2013; Guo et al., 2020; Mittal et al., 2022), and language modeling (Butnaru & Ionescu, 2017; Zhang et al., 2023). Related works for Fair Clustering (FC) Combining algorithmic fairness and clustering, the notion of Fair Clustering (FC) was initially introduced in Chierichetti et al. (2017). FC operates under the goal that the proportion of each protected group within each cluster should be similar to that in the population. To achieve this goal, various algorithms have been developed to minimize a given clustering objective under pre-specified fairness constraints (Bera et al., 2019; Kleindessner et al., 2019; Backurs et al., 2019; Li et al., 2020; Esmaeili et al., 2021; Ziko et al., 2021; Zeng et al., 2023), to name a few. We can roughly categorize the existing FC algorithms into three: (i) pre-processing, (ii) in-processing, and (iii) postprocessing. Pre-processing methods (Chierichetti et al., 2017; Backurs et al., 2019) involve transforming instances into a fair space based on the concept of fairlets. Fairlets are small subsets that satisfy (perfect) fairness, and thus performing standard clustering on the fairlet space yields a fair clustering. In-processing methods (Kleindessner et al., 2019; Ziko et al., 2021; Li et al., 2020; Zeng et al., 2023) aim to simultaneously find both the cluster centers and assignments of the fair clustering by solving constrained op- Fair Clustering via Alignment timization problems. Post-processing methods (Bera et al., 2019; Harb & Lam, 2020) focus on finding fair assignments given fixed cluster centers. The fixed cluster centers are typically predetermined by a standard clustering algorithm. Our contributions In this paper, we focus on the trade-off between fairness level and clustering utility: our goal is to maximize clustering utility while satisfying a given fairness level. While the trade-off between the fairness and utility is inevitable (Bertsimas et al., 2011; Chhabra et al., 2021), achieving the optimal trade-off with existing FC algorithms remains challenging. For example, preor post-processing algorithms usually result in suboptimal clustering utility due to indirect maximization of clustering utility (e.g., Backurs et al. (2019); Esmaeili et al. (2021)). Even when designed for achieving reasonable trade-off, in-processing algorithms may have trouble due to numerical instability, particularly when a given fairness level is high (e.g., Ziko et al. (2021)). This paper aims to address these challenges by developing a new in-processing algorithm that can practically achieve a superior trade-off between fairness level and clustering utility without numerical instability. The primary idea of our proposed algorithm is to optimally align data from different protected groups by transforming them into a common space (called the aligned space), and then applying a standard clustering algorithm in the aligned space. We prove that the optimal fair clustering, i.e., the clustering with minimal clustering cost under a given fairness constraint, is equivalent to the optimal clustering in the aligned space. Based on the theoretical result, we devise a new FC algorithm, called Fair Clustering via Alignment (FCA)1. FCA alternately finds the aligned space and the (approximately) optimal clustering in the aligned space until convergence. To find the aligned space, we develop a modified version of an algorithm for finding the optimal transport map (Kantorovich, 2006), while a standard clustering algorithm (e.g., the K-means++ algorithm (Arthur & Vassilvitskii, 2007)) is applied to find the (approximately) optimal clustering in the aligned space. It is worth noting that FCA can be particularly compared to the fairlet-based methods (e.g., Backurs et al. (2019)). While existing fairlet-based methods first find fairlets and then perform clustering sequentially, FCA simultaneously builds the aligned space and performs clustering to obtain an approximately optimal fair clustering. A detailed comparison is provided in Remark 3.2. The main contributions of this paper can be summarized as: We provide a novel decomposition of the fair clustering cost into two components: (i) the transport cost with respect to a joint distribution between two pro- 1Implementation code is available at https://github. com/kwkimonline/FCA. tected groups, and (ii) the clustering cost with respect to cluster centers in the aligned space. Building on this decomposition, we develop a novel FC algorithm called FCA (Fair Clustering via Alignment), which is stable and guarantees convergence. Theoretically, we prove that FCA achieves an approximately optimal trade-off between fairness level and clustering utility, for any given fairness level. Experimentally, we show that FCA (i) outperforms existing baseline FC algorithms in terms of the tradeoff and numerical stability, and (ii) effectively controls the trade-off across various fairness levels. 2. Preliminaries Notations Let D = {(xi, si)}n i=1 be a given dataset (i.e., a set of observed instances), where xi Rd and si {0, 1} are d-dimensional data and binary variable for the sensitive attribute, respectively. We denote (X, S) as the random vector whose joint distribution denoted as P is the empirical distribution on D. Let Ps represents the conditional distribution of X given S = s. In this paper, we specifically define these distributions to discuss the (probabilistic) matching between two protected groups of different sizes. We denote E and Es as the expectation operators of P and Ps, respectively. Let X = {xi}n i=1, Xs = {xi X : si = s}, and ns := |Xs| for s {0, 1}. Denote 2 as the L2 norm. We assume that the number of clusters, represented by K N, is given a priori. The K-many cluster centers are denoted as µ := {µ1, . . . , µK} where µk Rd, k [K] = {1, . . . , K}. Let A : X {0, 1} SK be an assignment function that takes as input (x, s) X {0, 1} and returns the assignment probabilities over clusters for the data point x, where SK is the (K 1)-dimensional simplex. We consider this probabilistic assignment function to ensure the existence of a perfectly fair clustering. Clustering objective function We first present the mathematical formulation of the clustering objective function. The objective of the standard (i.e., fair-unaware) K-means clustering is to minimize the clustering cost C(µ, A) := 1 n PK k=1 P (x,s) D A(x, s)k x µk 2, with respect to µ and A. Note that C(µ, A) can be equivalently re-written as E PK k=1 A(X, S)k X µk 2. Furthermore, the optimal assignment function is deterministic, i.e., A(x, s)k = 1(arg mink [K] x µ k 2 = k) for a given (x, s) Xs {0, 1}, where µ 1, . . . , µ K are the centers of the optimal clustering. Thus, C(µ, A) becomes E mink X µk 2, and the optimal clustering is obtained by finding µ minimizing E mink X µk 2. Definition of fair clustering An assignment function A is said to be fair in view of group (or proportional) fairness Fair Clustering via Alignment (Chierichetti et al., 2017), if it satisfies for all k [K], P xi X0 A(xi, 0)k xj X1 A(xj, 1)k This constraint ensures the proportion of data belonging to a cluster be balanced, resulting in fair clustering. That is, we find the cluster center µ and the assignment function A that minimize C(µ, A) among all µ and fair assignment functions A satisfying eq. (1). To assess the fairness level in clustering, Balance measure is widely used (Chierichetti et al., 2017; Bera et al., 2019; Backurs et al., 2019; Esmaeili et al., 2021; Ziko et al., 2021; Zeng et al., 2023), which is defined as min k [K] min P xi X0 A(xi, 0)k P xj X1 A(xj, 1)k , xj X1 A(xj, 1)k P xi X0 A(xi, 0)k Note that the higher balance is, the fairer the clustering is. Furthermore, the balance of any given perfectly fair clustering is min (n0/n1, n1/n0) . The objective of FC is to minimize C(µ, A) with respect to µ RK and A, under the fairness constraint (e.g., balance min (n0/n1, n1/n0)). We abbreviate A( , s) = As( ) and C(µ, A) = C(µ, A0, A1) when their meanings are clear. For the case of perfect fairness, in Section 3, we prove that the FC objective can be decomposed into the sum of (i) the cost of transporting data from different protected groups to a common space (called the aligned space), and (ii) the clustering cost in the aligned space built by the transported data. Building on the decomposition, in Section 4.1, we introduce our proposed algorithm for perfect fairness. Then, in Section 4.2, we extend the algorithm to control the fairness level by relaxing the perfect fairness constraint. 3. Reformulation of fair clustering objective This section presents our main theoretical contribution: we derive a novel decomposition of the perfectly fair (i.e., P xi X0 A0(xi)k/ P xj X1 A1(xj)k = n0/n1 for all k [K]) clustering objective, which motivates our proposed algorithms in Section 4. In Section 3.1, we introduce our idea through discussing the simplest case where the two protected groups are of equal size (n0 = n1). Then, in Section 3.2, we generalize to the unequal case (n0 = n1) by constructing the aligned space defined by a given joint distribution on X0 X1. We prove that there exists a joint distribution such that the objective function of perfectly fair clustering can be decomposed into the sum of the transport cost with respect to the joint distribution and the clustering cost with respect to the cluster centers on the aligned space. Full proofs of all the theoretical findings in this section are given in Appendix B. 3.1. Case of equal sample sizes: n0 = n1 Assume that the sizes of two protected groups are equal (i.e., n0 = n1). We consider deterministic assignment functions, i.e., As(x)k {0, 1}, since the optimal perfectly fair assignment function is deterministic when n0 = n1 (see Remark 3.4 in Section 3.2). The case of probabilistic assignment functions when n0 = n1 is discussed in Section 3.2. The core idea of FCA is to match two instances from different protected groups and assign them to the same cluster. By doing so matching all instances from X0 to X1 in a oneto-one fashion and assigning each pair to the same cluster the resulting clustering becomes perfectly fair. Conversely, suppose we are given a perfectly fair clustering constructed by a deterministic assignment function A. Since n0 = n1 and A is deterministic, there exists a one-to-one matching between X0 and X1 such that two matched instances belong to the same cluster. Thus, we can decompose the clustering cost in terms of the one-to-one matching, as presented in Theorem 3.1. Theorem 3.1. For any given perfectly fair deterministic assignment function A and cluster centers µ, there exists a one-to-one matching map T : Xs Xs such that, for any s {0, 1}, C(µ, A0, A1) = 4 | {z } Transport cost w.r.t. T | {z } Clustering cost w.r.t. µ and T The assignment function A that minimizes eq. (2) given µ and T assigns both x and T(x) to cluster k, where k = arg mink [K] x+T(x) 2 µk 2. Hence, the optimal perfectly fair clustering can be found by minimizing Es X T(X) 2/4 + mink (X + T(X))/2 µk 2 with respect to µ and T, instead of finding the optimal µ and A minimizing C(µ, A0, A1). We update µ for a given T by applying a standard clustering algorithm to { x+T(x) 2 , x Xs}, which is called the aligned space. To update T, any algorithm for finding the optimal matching can be used, where the cost between two instances x0 X0 and x1 X1 is given by x0 x1 2/4+mink (x0+x1)/2 µk 2 (see Section 4.1 for the specific algorithm we use). Note that there are no complex constraints in updating µ and T. Finally, we define A corresponding to T, which assigns both x and T(x) to the same cluster on the aligned space { x+T(x) 2 , x Xs}. See Appendix A.4 for a similar decomposition that extends to other general distance metrics (e.g., the Lp norm for p 1). Remark 3.2 (Comparison to the fairlet-based methods). Although our idea of matching data from different protected groups may seem similar to the fairlet-based methods Fair Clustering via Alignment Figure 1. Comparison between the fairlet-based method and our approach with n0 = n1 = 4 and K = 2. The representative of each fairlet is set as the mean vector of the data within that fairlet, and the standard K-means algorithm is then applied to this set of representatives. The clustering results are visualized using contours. While both the fairlet-based method and ours result in perfectly fair clustering, i.e., balance (Bal) = 1 = min(n0/n1, n1/n0), our approach yields a lower cost (9.82 < 10.22), due to more efficient matchings. (Chierichetti et al., 2017; Backurs et al., 2019), they fundamentally differ: Our method is an in-processing approach that directly minimizes the clustering cost with respect to both the matching map and cluster centers simultaneously. In contrast, the fairlet-based method is a two-step pre-processing approach, which does not directly minimize the clustering cost; instead, it first searches for fairlets and then attempts to find the optimal cluster centers on the set of representatives for each fairlet. See Appendix A.1 for details about fairlets. As a result, in the fairlet-based method, the matchings and cluster centers are not jointly optimized, which may lead to suboptimal clustering utility. Figure 1 illustrates how the fairlet-based method can produce suboptimal clustering, when compared to our approach. It implies that, more efficient matchings exist that yield higher clustering utility than the matchings of fairlets, and our approach is specifically designed to find these efficient matchings. We confirm this claim more comprehensively using real benchmark datasets in Section 5.2. 3.2. Case of unequal sample sizes: n0 = n1 To handle the case of n0 = n1, we follow a similar strategy to that for n0 = n1, but replace the matching map T with the joint distribution Q over X|S = 0 and X|S = 1. For each s {0, 1}, let Xs denote the random variable following the conditional distribution of X|S = s, i.e., Ps. We reformulate the perfectly fair clustering cost in terms of cluster centers µ and joint distribution Q whose marginal distributions are Ps, s {0, 1}. Note that Q serves as a smooth and stochastic version of T. Let Q = {all joint distributions Q on X0 X1 with Qs = Ps, s {0, 1}}, where Qs is the marginal distribution of Q on Xs. Theorem 3.3 below, which is the main theoretical result and motivation of our proposed algorithm, proves that the optimal perfectly fair clustering can be found by optimizing the joint distribution Q and cluster centers µ. Let πs = ns/(ns+ns ) for s = s {0, 1}. We then define TA(x0, x1) := π0x0 + π1x1 as the alignment map. Theorem 3.3. Let µ Rd and Q Q be the cluster centers and joint distribution minimizing 2π0π1 X0 X1 2 + min k TA(X0, X1) µk 2 . Then, (µ , A 0, A 1) is the solution of the perfectly fair K-means clustering, where A 0(x)k := Q arg mink TA(x, X1) µk 2 = k|X0 = x and A 1(x)k is defined similarly. This result implies that, by simultaneously minimizing the transport cost X0 X1 2 and finding the cluster centers in the aligned space {TA(X0, X1) : (X0, X1) Q}, we obtain the optimal perfectly fair clustering. A notable observation is here that there is no explicit constraint in eq. (3). In fact, the constraint for perfect fairness (i.e., P xi X0 A(xi, 0)k/ P xj X1 A(xj, 1)k = n0/n1 for all k [K]) is implicitly satisfied through the use of the alignment map TA together with the assignment functions A 0 and A 1. In conclusion, solving eq. (3) with respect to µ and Q yields the optimal perfectly fair clustering. The algorithm for solving eq. (3) is detailed in Section 4. Remark 3.4 (A becomes deterministic when n0 = n1). Assume that n0 = n1. Then, for a given xi X0, note that γi,j is positive for a unique j [n1], meaning that Q in Theorem 3.3 corresponds to the one-to-one matching map T in Theorem 3.1. In other words, finding Q becomes equivalent to optimizing the optimal permutation between [n0] and [n1]. See Remark 2.4 in Peyr e & Cuturi (2020) for the theoretical evidence. Then, we have A 0(xi)k := Q arg mink TA(xi, X1) µk 2 = k|X0 = xi = 1(arg mink xi+T(xi) 2 µk 2 = k), which is deterministic, provided that π0 = π1 = 1/2. Fair Clustering via Alignment 4. Proposed algorithms 4.1. FCA: Fair Clustering via Alignment Based on Theorem 3.3, we propose an algorithm for finding the (approximately) optimal perfectly fair clustering, called Fair Clustering via Alignment (FCA). FCA consists of two phases. Phase 1 finds the joint distribution Q with the cluster centers µ being fixed. Phase 2 updates the cluster centers µ with the joint distribution Q being fixed. Then, FCA iterates these two phases until cluster centers converge. Phase 1: Finding the joint distribution Phase 1 finds the optimal Q that minimizes the cost in eq. (3) given µ. For this goal, we modify the Kantorovich problem (Kantorovich, 2006; Villani, 2008), which finds the optimal coupling between two measures for a given cost matrix. Appendix A.2 covers details regarding the Kantorovich problem along with the optimal transport problem. First, the transport cost matrix between the two (the first term of eq. (3)) is defined by C := [ci,j] Rn0 n1 + where ci,j = 2π0π1 xi xj 2. The clustering cost matrix between the aligned data and their assigned centers (the second term of eq. (3)) is defined by D := [di,j] Rn0 n1 + where di,j = mink [K] TA(xi, xj) µk 2. Then, we find the optimal coupling Γ = [γi,j] Rn0 n1 + solving min Γ (C + D) Γ 1 = min γi,j (ci,j + di,j)γi,j i=1 γi,j = 1 j=1 γi,j = 1 n0 , γi,j 0. (4) Note that this problem becomes the original Kantorovich problem when D = 0. Hence, this objective can be also efficiently solved using linear programming, similar to the Kantorovich problem (Villani, 2008). Based on the optimal coupling Γ that solves eq. (4), we define the joint distribution as Q({xi, xj}) = γi,j. That is, we have the measures (weights) {γi,j}i [n0],j [n1] for the aligned points in {TA(xi, xj)}i [n0],j [n1]. As a result, we define the corresponding aligned space as the n0 n1 many pairs of the weight and the aligned points : {(γi,j, TA(xi, xj)}i [n0],j [n1]. Phase 2: Optimizing cluster centers Phase 2 optimizes the cluster centers µ on the aligned space {(γi,j, TA(xi, xj)}i [n0],j [n1] obtained from Phase 1, by solving minµ Pn0 i=1 Pn1 j=1 γi,j mink TA(xi, xj) µk 2. Standard clustering algorithms, such as the K-means++ algorithm (Arthur & Vassilvitskii, 2007) or a gradient descentbased algorithm, can be used to update µ. Note that in Section 5.4, we empirically show that FCA is stable regardless of the algorithm used to optimize µ. See Algorithm 1 for the overall algorithm of FCA. Algorithm 1 FCA algorithm input (i) Dataset X0 X1. (ii) The number of clusters K. 1: Initialize cluster centers µ = {µk}K k=1. 2: while µ has not converged do 3: Update Γ = [γi,j] Rn0 n1 + by solving eq. (4) for a fixed µ. // Phase 1: update Γ 4: Update µ by solving minµ Pn0 i=1 Pn1 j=1 γi,j mink TA(xi, xj) µk 2 for a fixed Γ. // Phase 2: update µ 5: end while 6: Build fair assignments: for xi Xs, define As(xi)k := P xj Xs nsγi,j1(arg mink πsxi + πs xj µk 2 = k), k [K]. output (i) Cluster centers µ = {µk}K k=1. (ii) Assignments A0(xi), xi X0 and A1(xj), xj X1. In Appendix A.3, we introduce a practical extension of FCA to scenarios involving multiple protected groups and present experimental results on a real dataset that validate this extension (see Table 4). 4.2. FCA-C: control of fairness level It is also important to find the optimal non-perfectly fair clustering. For this purpose, we introduce a feasible relaxation of FCA called FCA-Control (FCA-C), a variant of FCA specifically designed for controlling the fairness level. Let W X0 X1 be a given subset. The idea of our relaxation is to apply FCA algorithm only to instances in Wc = X0 X1 \ W and to apply the standard K-means algorithm to those in W. Note that W = becomes FCA, while W = X0 X1 results in standard (fair-unaware) clustering. For given µ and W, we define CFCA(µ, W) := 2π0π1 X0 X1 2 +mink TA(X0, X1) µk 2 1 ((X0, X1) Wc) and CK-means(µ, W) := mink π0 X0 µk 2 + mink π1 X1 µk 2 1 ((X0, X1) W) . Denote ε > 0 as a hyper-parameter that controls the fairness level. For a given ε, FCA-C algorithm minimizes EX0,X1 Q CFCA(µ, W) + CK-means(µ, W) (5) with respect to µ, Q and W satisfying Q((X0, X1) W) ε. Figure 2 visualizes an example of W and Q. Then, we construct A as follows. For x0 X0, we define P1(arg min k TA(x0, X1) µk 2 = k, (x0, X1) Wc) + 1(arg min k x0 µk 2 = k) P1((x0, X1) W), and the assignment function A1 can be defined similarly. (6) Fair Clustering via Alignment Figure 2. Example illustration of W and Γ (or equivalently Q) when (n0, n1) = (2, 5) and ε = γ1,4 + γ2,2 + γ2,5. In summary, FCA-C comprises three phases: the first two mirror the Phases 1 and 2 of FCA, and the third updates W. The overall algorithm for FCA-C is in Algorithm 2. Algorithm 2 FCA-C algorithm input (i) Dataset X0 X1. (ii) The number of clusters K. (iii) Fairness level ε [0, 1]. 1: Initialize cluster centers µ = {µk}K k=1 and a subset W X0 X1 such that 1 n0n1 Pn0 i=1 Pn1 j=1 I((xi, xj) W) ε. 2: while µ has not converged do 3: Calculate the costs CK-means for (xi, xj) W and CFCA for (xi, xj) Wc. 4: Update Γ by minimizing eq. (5) for fixed µ and W. // Phase 1: update Γ 5: Update µ by minimizing eq. (5) for fixed Γ and W. // Phase 2: update µ 6: For all (xi, xj) X0 X1, calculate η(xi, xj) := 2π0π1 xi xj 2+mink TA(xi, xj) µk 2. Let ηε be the εth upper quantile. Update W = {(xi, xj) X0 X1 : η(xi, xj) > ηε}. // Phase 3: update W 7: end while 8: Build fair assignment functions A0 and A1 following Equation (6). output (i) Cluster centers µ = {µk}K k=1. (ii) Assignments A0(xi), xi X0 and A1(xj), xj X1. Theoretical validity of FCA-C Notably, FCA-C algorithm achieves a solid theoretical guarantee for the (approximately) optimal trade-off between fairness level and clustering utility, for any given fairness level to be satisfied. First, Theorem 4.1 shows that solving the objective of FCA-C algorithm in eq. (5) is equivalent to solve the clustering cost C(µ, A) under a given fairness constraint, whose proof is given in Appendix B.3. For technical simplicity, assume that the densities of P0 and P1 exist (i.e., we consider the population version). See Remark B.1 in Appendix B when the densities do not exist. Recall that Es(As(X)k) = 1 ns P xi Xs As(xi)k. Let Aε := {(A0, A1) : PK k=1 |E0(A0(X)k) E1(A1(X)k)| ε} represent a set of fair assignment functions. Let C(Q, W, µ) be the objective of FCA-C in eq. (5). Theorem 4.1 (Equivalence between C and constrained C). Minimizing FCA-C objective C(Q, W, µ) with the corresponding assignment function defined in eq. (6), is equivalent to minimizing C(µ, A0, A1) subject to (A0, A1) Aε. Note that Aε is based on the sum of unfairness for k [K], whereas several previous works (e.g., Bera et al. (2019); Backurs et al. (2019)) focus on the proportions for all k [K], which is more directly related to balance. However, Aε is also closely connected to balance: (i) ε = 0 ensures perfect fairness (i.e., balance = min(n0/n1, n1/n0), and (ii) for all k [K], the gap between P xi X0 A0(xi)k/ P xi X1 A1(xi)k and n0/n1 is bounded by ε, as shown in Proposition 4.2. Its proof is provided in Appendix B.4. Proposition 4.2 (Relationship between balance and ε). For any assignment function (A0, A1) Aε, we have P xi X0 A0(xi)k P xj X1 A1(xj)k n0 where c = n0 n1 maxk [K] 1 E1A1(X)k . Section 5.2 empirically validates Theorem 4.1 and Proposition 4.2, by showing that the trade-off between fairness (balance) and utility (cost) is effectively controlled by ε. Lastly, we establish the approximation guarantee of FCA-C in Theorem 4.3, similar to Bera et al. (2019). The approximation error of a given algorithm is defined by an upper bound on the ratio by which the cost of the solution of the algorithm can exceed the optimal cost. Let τ be the approximation error of a standard clustering algorithm used to find initial cluster centers without the fairness constraint. Suppose that supx X x 2 R for some R > 0. Theorem 4.3 (Approximation guarantee of FCA-C). For any given ε, FCA-C algorithm returns an (τ + 2)-approximate solution with a violation 3Rε for the optimal fair clustering, which is the solution of minµ,A0,A1 C(µ, A0, A1) subject to (A0, A1) Aε. A more formal statement is provided in Theorem B.2, with the proof in Appendix B.5. For example, if we use Kmeans++ initialization (Arthur & Vassilvitskii, 2007), then we have τ = O(log K). The violation term 3Rε suggests that FCA-C would be more efficient when ε is small. Remark 4.4 (Comparison of the approximation rate with existing approaches). The approximation rate τ +2 of FCAC is comparable to that of existing approaches. For example, Bera et al. (2019) proposed an algorithm that achieves a (τ + 2)-approximate solution with a violation of 3. FCA-C and the algorithm of Bera et al. (2019) provide the same rate of τ +2, but FCA-C can attain a smaller violation when R = 1, since ε [0, 1]. Furthermore, Schmidt et al. (2019) provided Fair Clustering via Alignment a (5.5τ + 1)-approximation algorithm, which exceeds τ + 2 for τ > 2/9. 4.3. Reducing computational complexity As discussed in Section 4.1, finding the joint distribution Q is technically equivalent to solving the Kantorovich problem, which can be done by linear programming (Villani, 2008). Its computational complexity is approximately O(n3) when n0 = n1 = n (Bonneel et al., 2011), indicating a high computational cost when n is large. To address this issue in practice, we randomly split each group into L partitions of (approximately) same sizes: {X (l) 0 }L l=1 of X0 and {X (l) 1 }L l=1 of X1. We then calculate the optimal coupling Γ(l) between X (l) 0 and X (l) 1 , by solving eq. (4). The (full) optimal coupling between X0 and X1 is estimated by ˆΓ := diag( 1 LΓ(1), . . . , 1 LΓ(L)). Let m = |X (l) 0 | + |X (l) 1 | be the partition size. This technique can theoretically reduce complexity by O(n/m) O(m3) = O(nm2). In our experiments, we apply this technique with m = 1024, which is significantly faster than using full datasets, while yielding similar performance (see Section 5.4 for the results). 5. Experiments This section presents experiments showing that (i) FCA outperforms baseline methods in terms of the trade-off between fairness and utility; (ii) FCA-C effectively controls the fairness level; and (iii) FCA is numerically stable, efficient, and scalable. We further illustrate the applicability of FCA to visual clustering. 5.1. Settings Datasets and performance measures We use three benchmark tabular datasets, ADULT (Becker & Kohavi, 1996), BANK (Moro et al., 2012), and CENSUS (Meek et al.), from the UCI Machine Learning Repository2 (Dua & Graff, 2017). The sensitive attribute is defined by gender, marital status, and gender for ADULT, BANK, and CENSUS, respectively. The number of clusters K is set to 10 for ADULT and BANK, and 20 for CENSUS, following Ziko et al. (2021). The features of data are scaled to have zero mean and unit variance. We then optionally apply L2 normalization, as used in Ziko et al. (2021). Further details are given in Appendix C.1. We consider two performance measures, Cost and Bal. The former assesses clustering utility, while the latter measures fairness level. For a given assignment function A, let As(x)k := 1(arg maxk As(x)k = k) be a deterministic version of A. Note that we use this deterministic A instead of A itself for fair comparison with existing algo- 2https://archive.ics.uci.edu/datasets rithms. For a given x, let k(x, s) be the index k such that As(x)k = 1. Then, for given µ = {µk}K k=1, Cost and Bal on D are defined as: Cost = 1 n P (x,s) D x µk(x,s) 2 and Bal = mink min ( rk, 1/ rk) , where rk = P x X0 A0(x)k P x X1 A1(x)k. Let Bal := min(n0/n1, n1/n0) be the balance of perfectly fair clustering for given D. Baseline algorithms and implementation details For the baseline algorithms compared with FCA, we consider four methods: a pre-processing (fairlet-based) method SFC from Backurs et al. (2019), two post-processing methods FCBC from Esmaeili et al. (2021) and FRAC from Gupta et al. (2023), and an in-processing method VFC from Ziko et al. (2021), which differs from the other baselines as it is specifically designed to control the trade-off between fairness level Bal and clustering utility Cost. For details of these baselines, refer to Appendix C.2. When solving the linear program (i.e., finding the coupling matrix Γ), we use the POT library (Flamary et al., 2021). For finding cluster centers, we adopt the scikit-learn library (Pedregosa et al., 2011) to run the K-means algorithm. We specifically choose the K-means++ initialization (Arthur & Vassilvitskii, 2007) with Lloyd s algorithm (Lloyd, 1982). An ablation study comparing the Lloyd s algorithm and a gradient descent-based algorithm (i.e., Adam (Kingma & Ba, 2014)) for finding cluster centers is provided in Section 5.4. The maximum number of iterations is set to 100, and we select the best iteration when Cost is minimized. 5.2. Performance comparison results Trade-off: fairness level vs. clustering utility First, we compare FC algorithms in terms of their ability to achieve reasonable trade-off between fairness level (Bal) and clustering utility (Cost). Specifically, we compare FCA with three baselines: SFC, FCBC, and FRAC, all of which are designed to achieve perfect fairness. Table 1 presents Bal and Cost of the four methods, where FCA consistently attains the lowest Cost (i.e., the highest utility). Notably, FCA outperforms SFC by achieving higher Bal and lower Cost, highlighting the effectiveness of inprocessing approach for finding better matchings. While FCA requires slightly more iterations until convergence (see Table 21 in Appendix C.3.8), it remains superior even with a smaller number of iterations similar to SFC (see Table 22 in Appendix C.3.8). Moreover, FCA also outperforms SFC in K-median clustering (see Table 23 in Appendix C.3.9), which is the original clustering objective of SFC. Table 8 in Appendix C.3.1 additionally shows that FCA also outperforms the first fairlet-based method introduced by Chierichetti et al. (2017). Furthermore, FCBC and Fair Clustering via Alignment Table 1. Comparison of the trade-off between Bal and Cost on ADULT, BANK, and CENSUS datasets. We underline Bal values for the cases of near-perfect fairness (i.e., Bal Bal ) and use bold face for the lowest Cost value among those cases. Similar results when data are not L2-normalized are presented in Table 6 in Appendix C.3.1. Dataset / Bal ADULT / 0.494 BANK / 0.649 CENSUS / 0.969 With L2 normalization Cost ( ) Bal ( ) Cost ( ) Bal ( ) Cost ( ) Bal ( ) Standard (fair-unaware) 0.295 0.223 0.208 0.325 0.403 0.024 FCBC (Esmaeili et al., 2021) 0.314 0.443 0.685 0.615 1.006 0.926 SFC (Backurs et al., 2019) 0.534 0.489 0.410 0.632 1.015 0.937 FRAC (Gupta et al., 2023) 0.340 0.490 0.307 0.642 0.537 0.954 FCA 0.328 0.493 0.264 0.645 0.477 0.962 FRAC offer lower Bal and higher Cost than FCA in most cases. Specifically for a fixed Cost 0.314, FCA achieves a higher Bal than FCBC (see Table 7 in Appendix C.3.1). These results suggest that FCA is the most effective algorithm for maximizing utility under perfect fairness. Table 2. Comparison of the two in-processing algorithms, VFC and FCA, in terms of numerical stability when achieving the maximum Bal, on ADULT, BANK, and CENSUS datasets with L2 normalization. Bold-faced results are the highest Bal. Similar results without L2 normalization are in Table 9 in Appendix C.3.1. Dataset / Bal VFC (Ziko et al., 2021) FCA ADULT / 0.494 0.437 (0.310) 0.493 (0.328) BANK / 0.649 0.568 (0.221) 0.645 (0.264) CENSUS / 0.969 0.749 (0.432) 0.962 (0.477) Numerical stability Second, we compare the two inprocessing algorithms, FCA and VFC, in terms of their ability to achieve a high (near-perfect) fairness level (i.e., Bal Bal ) without numerical instability. To do so, we obtain the maximum Bal that each algorithm can attain. We also evaluate robustness to data pre-processing, i.e., the impact of L2 normalization to numerical stability. The results presented in Table 2 show that VFC achieves Bal values no higher than 90% of Bal . While FCA is explicitly designed to achieve perfect fairness, VFC is designed to control Bal using a hyper-parameter. However, even with a large hyper-parameter (see Appendix C.2 for the chosen hyper-parameter), VFC fails to achieve perfect fairness (i.e., Bal remains significantly lower than Bal ). Moreover, the performance gap between FCA and VFC becomes greater without L2 normalization than with it, and VFC further fails on CENSUS dataset due to an overflow (see Appendix C.3.1 for details on this issue). Control of fairness level In addition to the previous analysis for the scenario of perfect fairness (where Bal Bal ), we also compare FCA-C and VFC in terms of controlling Bal lower than Bal . To this end, we assess the ability to achieve reasonable Cost while controlling Bal. Figure 3. Bal vs. Cost trade-offs on ADULT dataset. Black square ( ) is from the standard clustering, orange circle ( ) is from VFC, green star ( ) is from FCA-C, orange dashed line (- -) is the maximum of Bal that VFC can achieve, and blue line ( ) is the maximum achievable balance Bal . Figure 3 displays the performance of FCA-C and VFC across various fairness levels, on ADULT dataset. It shows that FCA-C achieves favorable trade-off for many values of Bal (by controlling ε), with Cost similar to that of VFC. Similar results on other datasets or without L2 normalization are given in Figures 4 and 5 in Appendix C.3.1. Furthermore, FCA-C can attain Bal beyond the orange dashed vertical line, which is the maximum achievable Bal by VFC. Refer to Appendix C.2 for details on selecting ε for FCA and the hyper-parameters used in VFC. Furthermore, the computation times of FCA-C (when ε > 0) and FCA (i.e., FCA-C with ε = 0) are compared in Table 20 in Appendix C.3.8, indicating that finding W in FCA-C does not require significant additional time. 5.3. Applicability to visual clustering We further evaluate the applicability of FCA to visual clustering using two image datasets, which are commonly used in recent visual clustering algorithms (Li et al., 2020; Zeng et al., 2023): (i) RMNIST, a mixture of the original MNIST and a color-reversed version, and (ii) OFFICE-31, consisting of images from two domains. We compare FCA with SFC, VFC and two visual FC baselines: DFC (Li et al., 2020) and FCMI (Zeng et al., 2023). Fair Clustering via Alignment Note that DFC and FCMI are end-to-end algorithms that simultaneously perform clustering and learn latent space using autoencoder with fairness constraints. In contrast, FCA, SFC, and VFC are applied to a latent space pre-trained by autoencoder without any fairness constraints. The clustering utility is evaluated using ACC (accuracy based on assignment indices and ground-truth labels) and NMI (normalized mutual information between assigned cluster distribution and ground-truth label distribution), as in Zeng et al. (2023). Table 3. Comparison of clustering utility (ACC) and fairness level (Bal) on two image datasets. Standard (fair-unaware) indicates autoencoder + K-means. The first-place values are bold, and second-place values are underlined. Dataset RMNIST OFFICE-31 Bal 1.000 0.282 A = ACC, B = Bal A ( ) B ( ) A ( ) B ( ) Standard (fair-unaware) 41.0 0.000 63.8 0.192 SFC (Backurs et al., 2019) 51.3 1.000 61.6 0.267 VFC (Ziko et al., 2021) 38.1 0.000 64.8 0.212 DFC (Li et al., 2020) 49.9 0.800 69.0 0.165 FCMI (Zeng et al., 2023) 88.4 0.995 70.0 0.226 FCA 89.0 1.000 67.6 0.270 Table 3 compares FCA with the baseline methods, showing its performance is comparable to the state-of-the-art FCMI and generally better than SFC, VFC, and DFC. While DFC achieves slightly higher ACC on the OFFICE-31 dataset, its Bal is significantly lower (0.165 vs. 0.270 of FCA). Notably, while FCA operates as a two-step approach leveraging the pre-trained latent space, while DFC and FCMI are endto-end methods. Moreover, FCA offers practical benefits: (i) requiring fewer hyper-parameters, reducing burden of hyperparameter tuning in the absence of ground-truth labels, and (ii) adaptability with any pre-trained latent space. Further details are provided in Appendix C.3.2, where Table 12 presents the full results of Table 3, which includes the similar superiority of FCA in terms of NMI values. Moreover, Table 13 highlights FCA s additional benefits over the fairlet-based method (SFC), in terms of matching efficiency. 5.4. Ablation studies (1) Selection of the partition size m We empirically confirm the efficiency of the partitioning technique in Section 4.3, by investigating the convergence of Bal and Cost with respect to the partition size m. In Appendix C.3.3, Figure 6 suggests that m = 1024 yields reasonable results, while Table 14 shows a significant reduction in computation time. (2) Optimization and initialization of cluster centers µ: We also analyze the stability with respect to (i) the choice of optimization algorithm for finding cluster centers, and (ii) the initialization of cluster centers. Appendix C.3.4 shows that FCA is robust to both factors. (3) Consistent outperformance across various K: We confirm that FCA consistently outperforms baseline methods for various values of K. That is, as shown in Figure 8 in Appendix C.3.5, FCA outperforms the baseline methods across all K {5, 10, 20, 40}. (4) Scalability for large-scale data: We further investigate the adaptivity of FCA to large-scale data. To do so, we compare FCA and VFC using a large-scale dataset (n = 106), and observe that FCA still outperforms VFC (see Appendix C.3.6). (5) Linear program vs. Sinkhorn for optimizing Q In our main experiments, we solve the objective in eq. (4) with the linear programming. To explore alternative approaches, we compare (i) FCA with the linear programming and (ii) FCA with the Sinkhorn algorithm, in terms of Cost, Bal, and runtime. Table 19 in Appendix C.3.7 shows that their performances are comparable; however, the Sinkhorn algorithm requires tuning a regularization parameter, suggesting that solving the linear program remains a practical and reliable option. 6. Conclusion and discussion This paper has proposed FCA, an in-processing algorithm for fair clustering, inspired by the theoretical equivalence between optimal perfectly fair clustering and finding cluster centers in the aligned space. FCA algorithm is based on well-known algorithms, the K-means algorithm and linear programming, making it transparent and stable. Furthermore, we have developed FCA-C (a variant of FCA) to control fairness level, and established its approximation guarantee. Experimental results show FCA achieves nearperfect fairness, superior clustering utility, and robustness to data pre-processing, optimization methods, and initialization. A promising direction for future work is to apply FCA to other clustering algorithms such as model-based clustering, e.g., Gaussian mixture, which we will pursue in near future. Fair Clustering via Alignment Acknowledgements This work was partly supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. 2022R1A5A7083908), the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (RS-2025-00556079), Next-generation Intelligence semiconductor R&D Program through the National Research Foundation of Korea(NRF) funded by the Korea government(MSIT)(RS2024-00431718), and by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) [NO.RS-2021II211343, Artificial Intelligence Graduate School Program (Seoul National University)]. Impact Statement This study aims to advance the field of algorithmic fairness in machine learning. In particular, given the growing application of clustering algorithms across diverse domains, we believe the proposed framework offers a promising approach to fair clustering. For example, it can ensure that privileged and unprivileged groups are clustered fairly, leading to unbiased downstream decision-making. Thus, this study is expected to yield a potentially positive impact rather than significant negative societal consequences. Furthermore, incorporating the joint distribution in the proposed algorithm enables the identification of specific data that are matched and assigned together. This feature can enhance interpretability of fair clustering through the lens of implicit matching. While this work focuses on proportional (group) fairness, we acknowledge that such an approach may overlook other fairness notions, such as individual fairness or social fairness (e.g., balancing clustering costs across groups), which are not explicitly addressed within the proportional fairness framework. We therefore encourage readers to interpret our results with this scope in mind: our goal is to improve the trade-off between proportional fairness level and clustering cost, but care must be taken when considering the broader implications for other fairness notions. In conclusion, we anticipate future studies that integrate fair clustering research with these additional notions of fairness. Agarwal, A., Beygelzimer, A., Dudik, M., Langford, J., and Wallach, H. A reductions approach to fair classification. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 60 69. PMLR, 10 15 Jul 2018. URL https://proceedings.mlr.press/v80/ agarwal18a.html. Agarwal, A., Dud ık, M., and Wu, Z. S. Fair regression: Quantitative definitions and reduction-based algorithms. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 915 June 2019, Long Beach, California, USA, 2019. URL http://proceedings.mlr.press/v97/ agarwal19d.html. Angwin, J., Larson, J., Mattu, S., and Kirchner, L. Machine bias. Pro Publica, May, 23:2016, 2016. Arthur, D. and Vassilvitskii, S. k-means++: the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 07, pp. 1027 1035, USA, 2007. Society for Industrial and Applied Mathematics. ISBN 9780898716245. Backurs, A., Indyk, P., Onak, K., Schieber, B., Vakilian, A., and Wagner, T. Scalable fair clustering. In International Conference on Machine Learning, pp. 405 413. PMLR, 2019. Becker, B. and Kohavi, R. Adult. UCI Machine Learning Repository, 1996. DOI: https://doi.org/10.24432/C5XW20. Bera, S., Chakrabarty, D., Flores, N., and Negahbani, M. Fair algorithms for clustering. Advances in Neural Information Processing Systems, 32, 2019. Bertsimas, D., Farias, V. F., and Trichakis, N. The price of fairness. Oper. Res., 59(1):17 31, January 2011. ISSN 0030-364X. doi: 10.1287/opre.1100.0865. URL https: //doi.org/10.1287/opre.1100.0865. Bonneel, N., van de Panne, M., Paris, S., and Heidrich, W. Displacement interpolation using lagrangian mass transport. ACM Trans. Graph., 30(6):1 12, dec 2011. ISSN 0730-0301. doi: 10.1145/2070781. 2024192. URL https://doi.org/10.1145/ 2070781.2024192. Butnaru, A. M. and Ionescu, R. T. From image to text classification: A novel approach based on clustering word embeddings. Procedia Computer Science, 112:1783 1792, 2017. ISSN 1877-0509. doi: https://doi.org/10.1016/j.procs.2017.08.211. URL https://www.sciencedirect.com/ science/article/pii/S1877050917316071. Knowledge-Based and Intelligent Information & Engineering Systems: Proceedings of the 21st International Conference, KES-20176-8 September 2017, Marseille, France. Fair Clustering via Alignment Chen, L., Kyng, R., Liu, Y. P., Peng, R., Gutenberg, M. P., and Sachdeva, S. Maximum flow and minimum-cost flow in almost-linear time. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pp. 612 623, 2022. doi: 10.1109/FOCS54457.2022.00064. Chhabra, A., Masalkovait e, K., and Mohapatra, P. An overview of fairness in clustering. IEEE Access, 9: 130698 130720, 2021. doi: 10.1109/ACCESS.2021. 3114099. Chiappori, P., Mc Cann, R., and Nesheim, L. Hedonic price equilibria, stable matching, and optimal transport: equivalence, topology, and uniqueness. Economic Theory, 42(2):317 354, 2010. URL https://Econ Papers.repec.org/Re PEc: spr:joecth:v:42:y:2010:i:2:p:317-354. Chierichetti, F., Kumar, R., Lattanzi, S., and Vassilvitskii, S. Fair clustering through fairlets. Advances in neural information processing systems, 30, 2017. Chzhen, E., Denis, C., Hebiri, M., Oneto, L., and Pontil, M. Fair regression with wasserstein barycenters. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 7321 7331. Curran Associates, Inc., 2020. URL https://proceedings. neurips.cc/paper/2020/file/ 51cdbd2611e844ece5d80878eb770436-Paper. pdf. Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In Burges, C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings. neurips.cc/paper/2013/file/ af21d0c97db2e27e13572cbf59eb343d-Paper. pdf. Damodaran, B. B., Kellenberger, B., Flamary, R., Tuia, D., and Courty, N. Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation. In Computer Vision ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part IV, pp. 467 483, Berlin, Heidelberg, 2018. Springer-Verlag. ISBN 978-3-030-01224-3. doi: 10.1007/978-3-030-01225-0 28. URL https://doi. org/10.1007/978-3-030-01225-0_28. Deb, N., Ghosal, P., and Sen, B. Rates of estimation of optimal transport maps using plug-in estimators via barycentric projections. In Neur IPS, 2021. Donini, M., Oneto, L., Ben-David, S., Shawe-Taylor, J. S., and Pontil, M. Empirical risk minimization under fairness constraints. In Advances in Neural Information Processing Systems, pp. 2791 2801, 2018. Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml. Esmaeili, S., Brubach, B., Srinivasan, A., and Dickerson, J. Fair clustering under a bounded cost. Advances in Neural Information Processing Systems, 34:14345 14357, 2021. Flamary, R., Courty, N., Gramfort, A., Alaya, M. Z., Boisbunon, A., Chambon, S., Chapel, L., Corenflos, A., Fatras, K., Fournier, N., Gautheron, L., Gayraud, N. T., Janati, H., Rakotomamonjy, A., Redko, I., Rolet, A., Schutz, A., Seguy, V., Sutherland, D. J., Tavenard, R., Tong, A., and Vayer, T. Pot: Python optimal transport. Journal of Machine Learning Research, 22(78):1 8, 2021. URL http: //jmlr.org/papers/v22/20-451.html. Forrow, A., H utter, J.-C., Nitzan, M., Rigollet, P., Schiebinger, G., and Weed, J. Statistical optimal transport via factored couplings. In Chaudhuri, K. and Sugiyama, M. (eds.), Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pp. 2454 2465. PMLR, 16 18 Apr 2019. URL https://proceedings.mlr.press/v89/ forrow19a.html. Galichon, A. Optimal Transport Methods in Economics. Princeton University Press, 2016. URL http://www. jstor.org/stable/j.ctt1q1xs9h. Genevay, A., Cuturi, M., Peyr e, G., and Bach, F. Stochastic optimization for large-scale optimal transport. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 16, pp. 3440 3448, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819. Gordaliza, P., Barrio, E. D., Fabrice, G., and Loubes, J.- M. Obtaining fairness using optimal transport theory. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 2357 2365. PMLR, 09 15 Jun 2019. URL https://proceedings.mlr.press/v97/ gordaliza19a.html. Guo, J., Zhu, X., Zhao, C., Cao, D., Lei, Z., and Li, S. Z. Learning meta face recognition in unseen domains. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6162 6171, Los Alamitos, CA, USA, jun 2020. IEEE Computer Society. doi: 10.1109/CVPR42600.2020.00620. URL Fair Clustering via Alignment https://doi.ieeecomputersociety.org/ 10.1109/CVPR42600.2020.00620. Gupta, S., Ghalme, G., Krishnan, N. C., and Jain, S. Efficient algorithms for fair clustering with a new notion of fairness. Data Min. Knowl. Discov., 37(5): 1959 1997, mar 2023. ISSN 1384-5810. doi: 10.1007/ s10618-023-00928-6. URL https://doi.org/10. 1007/s10618-023-00928-6. Harb, E. and Lam, H. S. Kfc: A scalable approximation algorithm for k-center fair clustering. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 14509 14519. Curran Associates, Inc., 2020. URL https://proceedings.neurips. cc/paper_files/paper/2020/file/ a6d259bfbfa2062843ef543e21d7ec8e-Paper. pdf. Hellman, D. Measuring algorithmic fairness. Criminal Procedure e Journal, 2019. URL https: //api.semanticscholar.org/Corpus ID: 199002104. H utter, J.-C. and Rigollet, P. Minimax estimation of smooth optimal transport maps. The Annals of Statistics, 49(2): 1166 1194, 2021. doi: 10.1214/20-AOS1997. URL https://doi.org/10.1214/20-AOS1997. Ingold, D. and Soper, S. Amazon doesn t consider the race of its customers. should it. Bloomberg, April, 1, 2016. Jiang, R., Pacchiano, A., Stepleton, T., Jiang, H., and Chiappa, S. Wasserstein fair classification. In Adams, R. P. and Gogate, V. (eds.), Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, volume 115 of Proceedings of Machine Learning Research, pp. 862 872. PMLR, 22 25 Jul 2020. URL https://proceedings.mlr.press/ v115/jiang20a.html. Kantorovich, L. On the translocation of masses. Journal of Mathematical Sciences, 133:1381 1382, 2006. Kim, K., Kong, I., Lee, J., Chae, M., Park, S., and Kim, Y. Fairness through matching. Transactions on Machine Learning Research, 2025. ISSN 28358856. URL https://openreview.net/forum? id=d Hljja NHh1. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization, 2014. URL https://arxiv.org/abs/ 1412.6980. Kleindessner, M., Samadi, S., Awasthi, P., and Morgenstern, J. Guarantees for spectral clustering with fairness constraints. In International conference on machine learning, pp. 3458 3467. PMLR, 2019. Le, Q. V. Building high-level features using large scale unsupervised learning. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 8595 8598. IEEE, 2013. Li, P., Zhao, H., and Liu, H. Deep fair clustering for visual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9070 9079, 2020. Lloyd, S. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129 137, 1982. doi: 10.1109/TIT.1982.1056489. Meek, C., Thiesson, B., and Heckerman, D. US Census Data (1990). UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5VP42. Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., and Galstyan, A. A survey on bias and fairness in machine learning. ar Xiv preprint ar Xiv:1908.09635, 2019. Mittal, H., Pandey, A. C., Saraswat, M., Kumar, S., Pal, R., and Modwel, G. A comprehensive survey of image segmentation: clustering methods, performance parameters, and benchmark datasets. Multimedia Tools and Applications, pp. 1 26, 2022. Moro, S., Rita, P., and Cortez, P. Bank Marketing. UCI Machine Learning Repository, 2012. DOI: https://doi.org/10.24432/C5K306. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. Peyr e, G. and Cuturi, M. Computational optimal transport, 2020. URL https://arxiv.org/abs/1803. 00567. Quadrianto, N., Sharmanska, V., and Thomas, O. Discovering fair representations in the data domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8227 8236, 2019. Salimans, T., Zhang, H., Radford, A., and Metaxas, D. Improving gans using optimal transport, 2018. Schmidt, M., Schwiegelshohn, C., and Sohler, C. Fair coresets and streaming algorithms for fair k-means. In WAOA, pp. 232 251, 2019. URL https://doi.org/ 10.1007/978-3-030-39479-0_16. Fair Clustering via Alignment Seguy, V., Damodaran, B. B., Flamary, R., Courty, N., Rolet, A., and Blondel, M. Large scale optimal transport and mapping estimation. In International Conference on Learning Representations, 2018. URL https: //openreview.net/forum?id=B1zlp1b RW. Su, Z., Wang, Y., Shi, R., Zeng, W., Sun, J., Luo, F., and Gu, X. Optimal mass transport for shape matching and comparison. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(11):2246 2259, 2015. doi: 10. 1109/TPAMI.2015.2408346. Villani, C. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 2008. ISBN 9783540710509. URL https://books.google.co.kr/books?id= h V8o5R7_5tk C. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(110):3371 3408, 2010. URL http: //jmlr.org/papers/v11/vincent10a.html. Widiyaningtyas, T., Hidayah, I., and Adji, T. B. Recommendation algorithm using clustering-based upcsim (cbupcsim). Computers, 10(10):123, 2021. Yang, K. D. and Uhler, C. Scalable unbalanced optimal transport using generative adversarial networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=Hyex Ai A5Fm. Yeh, I. and Lien, C. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2, Part 1):2473 2480, 2009. ISSN 0957-4174. doi: https://doi.org/10.1016/j.eswa.2007.12. 020. URL https://www.sciencedirect.com/ science/article/pii/S0957417407006719. Zafar, M. B., Valera, I., Rogriguez, M. G., and Gummadi, K. P. Fairness constraints: Mechanisms for fair classification. In Artificial Intelligence and Statistics, pp. 962 970, 2017. Zeng, P., Li, Y., Hu, P., Peng, D., Lv, J., and Peng, X. Deep fair clustering via maximizing and minimizing mutual information: Theory, algorithm and metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23986 23995, 2023. Zhang, Y., Wang, Z., and Shang, J. Clusterllm: Large language models as a guide for text clustering. ar Xiv preprint ar Xiv:2305.14871, 2023. Ziko, I. M., Yuan, J., Granger, E., and Ayed, I. B. Variational fair clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 11202 11209, 2021. Fair Clustering via Alignment A. Supplementary discussion A.1. Fairlet-based methods Fairlet decomposition was first introduced in Chierichetti et al. (2017), providing a pre-processing method for fair clustering. A fairlet is defined as a subset (as small as possible) of data in which the proportion of the sensitive attribute is balanced. The dataset is then partitioned into disjoint fairlets. It is important to note that the fairlets are not built arbitrarily; instead, the sum of distances among instances within each fairlet is minimized (i.e., each fairlet consists of similar instances). Finding such fairlets can be done by solving the minimum cost flow problem. Notably, when n0 = n1, building fairlets is equivalent to finding the optimal coupling Γ in the Kantorovich problem, which can be understood as a special case of minimum cost problem (Peyr e & Cuturi, 2020; Chen et al., 2022). Hence, we can say that the fairlet-based methods find the coupling only once, then apply the standard clustering algorithm. After building the fairlets, each fairlet is then deterministically assigned to a cluster. Since fairness is implicitly guaranteed within each fairlet, the resulting clustering directly becomes also fair. The representatives of each fairlet can be arbitrarily chosen when n0 = n1, or chosen by the medoids (or possibly the centroids) when n0 = n1, as suggested by Chierichetti et al. (2017). Then, a standard clustering algorithm is applied to the set of these chosen representatives. On the other hand, the computational cost is high, with most of the time spent finding the fairlets due to the quadratic complexity of the minimum cost flow problem. To address this issue, SFC (Backurs et al., 2019) proposed a scalable algorithm for fairlet decomposition by using a reduction approach using metric embedding and trees. See Appendix C.2.3 for details on the SFC algorithm. A.2. Optimal transport problem The notion of Optimal Transport (OT) provides a geometric view of the discrepancy between two probability measures. For two given probability measures P1 and P2, a map T : Supp(P1) Supp(P2) is defined as transport map if T#P1 = P2, where T#P1(A) = P1(T 1(A)), A, is the push-forward measure. The OT map is the minimizer of transport cost among all transport maps. That is, the OT map from P1 to P2 is the solution of min T:T#P1=P2 EX P1 (c (X, T(X))) for a pre-specified cost function c (e.g., L2 distance), which is so-called the Monge problem. Kantorovich relaxed the Monge problem by seeking the optimal coupling (joint distribution) between two distributions. The Kantorovich problem is mathematically formulated as infπ Π(P1,P2) EX,Y π (c(X, Y)) where Π(P1, P2) is the set of all joint measures of P1 and P2. For two empirical measures, this problem can be solved by the use of linear programming as follows. For given two empirical distributions on X0 = {xi}n0 i=1 and X1 = {xj}n1 j=1, the cost matrix between the two is defined by C := [ci,j] Rn0 n1 + where ci,j = xi xj 2. Then, the optimal coupling (joint distribution) is defined by the matrix Γ = [γi,j] Rn0 n1 + that solves the following objective: min Γ C Γ 1 = min γi,j ci,jγi,j s.t. i=1 γi,j = 1 j=1 γi,j = 1 n0 , γi,j 0. This problem can be solved by the use of linear programming. For the case of large n with n0 = n1, various feasible estimators have been developed (Cuturi, 2013; Genevay et al., 2016), e.g., the Sinkhorn algorithm (Cuturi, 2013). Note only practical implementations, but theoretical aspects such as minimax estimation have also discussed deeply (Seguy et al., 2018; Yang & Uhler, 2019; Deb et al., 2021; H utter & Rigollet, 2021). Recently, the OT map is utilized in diverse domains, for example, economics (Galichon, 2016; Chiappori et al., 2010), domain adaptation (Damodaran et al., 2018; Forrow et al., 2019), and computer vision (Su et al., 2015; Salimans et al., 2018). Several studies, including Jiang et al. (2020); Chzhen et al. (2020); Gordaliza et al. (2019); Kim et al. (2025), also employed the OT map in the field of algorithmic fairness for supervised learning. A.3. Extension to multiple protected groups In this paper, we mainly focus on the case of two protected groups, for ease of discussion. However, it is possible to extend FCA to handle multiple (more than two) protected groups. The idea is equivalent to the case of two protected groups: matching multiple individuals from different protected groups. Fair Clustering via Alignment The case of equal sample sizes Let G be the number of protected groups, and denote s {0, . . . , G 1} as a fixed sensitive attribute as an anchor (a reference) attribute. For the case of n0 = . . . = n G 1, we can similarly decompose the objective function in terms of matching map, as follows. First, we decompose the clustering objective similar to the proof of Theorem 3.1: C(µ, A0, . . . , AG 1) = E k=1 AS(X) X µk 2 k=1 As (X)k s =s Ts (X) µk 2 where Ts is the one-to-one matching map from Xs to Xs for s {0, . . . , G 1}. Note that Ts then becomes the identity map from Xs to Xs for s = s . Then for any given x and µk, we can decompose x µk 2 by s =s (x Ts (x)) s =s Ts (x) s =s (x Ts (x)) s =s Ts (x) We also similarly decompose Ts (x) µk 2 by s =s ,s {0,...,G 1}(Ts (x) Ts (x)) s =s Ts (x) s =s ,s {0,...,G 1}(Ts (x) Ts (x)) s =s Ts (x) By summing up the above G many terms, we have the final result that C(µ, A0, . . . , AG 1) k=1 As (X)k s =s(Ts(X) Ts (X)) s =s Ts (X) Note that this result holds for any anchor s {0, . . . , G 1}. The case of unequal sample sizes For the case of unequal sample sizes, we also derive a similar decomposition (i.e., eq. (8)). We first define the alignment map as: TA(x0, . . . , x G 1) := π0x0 + . . . + πG 1x G 1, where πs = ns/n for s {0, . . . , G 1}. Then, we find the joint distribution and cluster centers by minimizing the following objective: E(X0,...,XG 1) Q s =s (Xs Xs ) 2 + min k TA(X0, . . . , XG 1) µk 2 where πG := QG 1 s=0 πs. The proof idea is similar to the case of two protected groups, in Theorem 3.3. We first recall the original K-means objective: C(µ, A0, . . . , AG 1) := E PK k=1 AS(X)k X µk 2 = PG 1 s=0 πs Es PK k=1 As(Xs)k Xs µk 2. Consider a set of joint distributions of X0, . . . , XG 1 given µ as Q := {Q({x0, . . . , x G 1}|µ) : xs Xs, s {0, . . . , G 1}}. Then, we can show that there exists a Q = Q({x0, . . . , x G 1}|µ) Q satisfying C(µ, A0, . . . , AG 1) = k=1 As(Xs)k Xs µk 2 = E(X0,...,XG 1) Q s=0 πs Xs µk 2 ! Fair Clustering via Alignment by using the same logic in the proof of Theorem 3.3. Furthermore, we can reformulate it as E(X0,...,XG 1) Q s =s (Xs Xs ) 2 + min k TA(X0, . . . , XG 1) µk 2 with the assignment functions for fair clustering given as As(xs)k := Q arg min k πsxs + X s =s πs Xs µk 2 = k Xs = xs , s {0, . . . , G 1}. Furthermore, since finding the joint distribution for multiple protected groups is technically similar to the case of two protected groups, the optimization of the objective in eq. (8) can be solved using a linear program. Experiments To empirically confirm the validity of the extension, we use BANK dataset with three protected groups defined by dividing marital status as single/married/divorced (directly following the approach of VFC (Ziko et al., 2021)). We then compare the performance of FCA with VFC, which also can handle multiple protected groups. The result is presented in Table 4 below, showing the superior performance of FCA: achieving a lower cost (0.222 < 0.228) along with a higher balance (0.182 > 0.172). Table 4. Comparison of VFC and FCA in terms of the trade-off between Bal and Cost, on BANK dataset with three protected groups. BANK 3 groups (single/married/divorced) Bal = 0.185 Cost ( ) Bal ( ) VFC 0.228 0.172 FCA 0.222 0.182 A.4. Extension to K-clustering for general distance metrics Let D : Rd Rd R+ be a given distance satisfying (i) D(v, w) D(v, u)+D(u, w), (ii) D(v, w) = D(v+u, w+u), and (iii) D(λv, λw) = |λ| D(v, w), for all v, w, u Rd and λ R. For example, D can be the Lp norm on Rd for any p 1. Then, for the balanced case n0 = n1, we can replicate the argument of Theorem 3.1 as follows: k=1 AS(X)k D(X, µk) Es + D X + T(X) Minimizing this upper bound yields a fair K-clustering procedure for any distance satisfying the above conditions. Note that using the L1 norm for D corresponds to the K-median clustering. See Appendix C.3.9 for experiments based on this approach, showing the outperformance FCA over SFC in view of the K-median clustering. Fair Clustering via Alignment B. Proofs of the theorems B.1. Proof of Theorem 3.1 Theorem 3.1. For any given perfectly fair deterministic assignment function A and cluster centers µ, there exists a one-to-one matching map T : Xs Xs such that, for any s {0, 1}, C(µ, A0, A1) = Es 4 + X + T(X) Proof of Theorem 3.1. Without loss of generality, let s = 0. First, it is clear that we can construct a one-to-one map T that maps each x X k,0 := {x X0 : A0(x)k = 1} to a unique x X k,1 := {x X1 : A1(x )k = 1} for all k [K]. That is, {T(x) : x X k,0} = X k,1, k [K] and {T(x) : x X0} = X1. Then, for the given T, we can rewrite the clustering cost as C(µ, A0, A1) = E k=1 AS(X)k X µk 2 = 1 k=1 A0(X)k X µk 2 + T(X) µk 2 . (10) For any given x and µk, we decompose x µk 2 as: x µk 2 = x x + T(x) 2 + x + T(x) 4 + x + T(x) 2 + 2 x x + T(x) 2 , x + T(x) We similarly decompose T(x) µk 2 as: T(x) µk 2 = T(x) x + T(x) 2 + x + T(x) 4 + x + T(x) 2 + 2 T(x) x + T(x) 2 , x + T(x) Adding the two terms, we have 4 + x + T(x) Finally, we conclude that k=1 A0(X)k X µk 2 + T(X) µk 2 k=1 A0(X)k 2 4 + X + T(X) B.2. Proof of Theorem 3.3 Theorem 3.3. Let µ and Q be the cluster centers and joint distribution minimizing min µ,Q Q E(X0,X1) Q 2π0π1 X0 X1 2 + min k TA(X0, X1) µk 2 . (12) Then, (µ , A 0, A 1) is the solution of the perfectly fair K-means clustering, where A 0(x)k := Q arg mink TA(x, X1) µk 2 = k|X0 = x and A 1(x)k is defined similarly. Fair Clustering via Alignment Proof of Theorem 3.3. Let Xs = X|S = s, s {0, 1}. For given (µ, A0, A1), recall the original K-means objective: C(µ, A0, A1) := E PK k=1 AS(X)k X µk 2 = π0E0 PK k=1 A0(X0)k X0 µk 2 +π1E1 PK k=1 A1(X1)k X1 µk 2. Consider a set of joint distributions of X0, X1 given µ as Q := {Q({x0, x1}|µ) : x0 X0, x1 X1}. We show that there exists a Q = Q({x0, x1}|µ) Q satisfying C(µ, A0, A1) = π0E0 k=1 A0(X0)k X0 µk 2 + π1E1 k=1 A1(X1)k X1 µk 2 = E(X0,X1) Q π0 X0 µk 2 + π1 X1 µk 2 . Q({x0, x1}|µ) = A0(x0)k A1(x1)k Ck P0({x0})P1({x1}), (14) where Ck := EAS(X)k = E0A0(X0)k = E1A1(X1)k. Then, E(X0,X1) Q π0 X0 µk 2 + π1 X1 µk 2 x0 A0(x0)kπ0 x0 µk 2P0({x0}) + X x1 A1(x1)kπ1 x1 µk 2P1({x1}) E0π0A0(X0)k X0 µk 2 + E1π1A1(X1)k X1 µk 2 = C(µ, A0, A1), which concludes eq. (13). Our original aim is to find (µ, A0, A1) minimizing C(µ, A0, A1). Let µ(x) = µk , where k = arg mink x µk 2. Let µ(x0, x1) := µk , where k = arg mink π0x0 + π1x1 µk 2. For given Q defined in eq. (14), similar to Theorem 3.1, we can reformulate E(X0,X1) Q π0 X0 µ(X0) 2 + π1 X1 µ(X1) 2 = E(X0,X1) Q 2π0π1 X0 X1 2 + π0X0 + π1X1 µ(X0, X1) 2 . (16) In turn, the assignment functions are given as A0(x0)k := Q arg min k π0x0 + π1X1 µk 2 = k X0 = x0 A1(x1)k := Q arg min k π0X0 + π1x1 µk 2 = k X1 = x1 Hence, C(µ, A0, A1) = E(X0,X1) Q 2π0π1 X0 X1 2 + mink π0X0 + π1X1 µk 2 . B.3. Proof of Theorem 4.1 Theorem 4.1. Minimizing FCA-C objective C(Q, W, µ) with the corresponding assignment function defined in eq. (6), is equivalent to minimizing C(µ, A0, A1) subject to (A0, A1) Aε. Proof of Theorem 4.1. It suffices to show the followings: A and B. A. For given Q, W, µ, let A0 and A1 be the constructed assignment functions, defined in eq. (6). Then, we have C(Q, W, µ) = C(µ, A0, A1) and (A0, A1) Aε. B. (i) For given µ and (A0, A1) Aε, there exist Q and W s.t. C(Q, W, µ) C(µ, A0, A1). Fair Clustering via Alignment Proof of A. For given Q, W, µ, we construct the assignment functions as in eq. (6). In other words, we define A0(x0)k = P1( arg mink TA(x0, X1) µk 2 = k, (x0, X1) Wc ) +1 arg mink x0 µk 2 = k P1({(x0, X1) W}), and the assignment function A1 is defined similarly. Then, we have C(Q, W, µ) = C(µ, A0, A1) by its definition. Furthermore, k=1 |E0A0(X)k E1A1(X)k| arg min k TA(X0, X1) µk 2 = k, (X0, X1) Wc arg min k TA(X0, X1) µk 2 = k, (X0, X1) Wc + E0Q1 ((X0, X1) W) 1(arg min k X0 µk 2 = k) E1Q0 ((X0, X1) W) 1(arg min k X1 µk 2 = k) E0Q1 ((X0, X1) W) 1(arg min k X0 µk 2 = k) E1Q0 ((X0, X1) W) 1(arg min k X1 µk 2 = k) EX0,X11((X0, X1) W) 1(arg min k X0 µk 2 = k) 1(arg min k X1 µk 2 = k) 1(arg min k X0 µk 2 = k) 1(arg min k X1 µk 2 = k) |(X0, X1) W 1((X0, X1) W) which implies (A0, A1) Aε. Proof of B. For each k [K], let δk = min{E0(A0(X))k, E1(A1(X))k}. We decompose As( )k = As( )k +Ac s( )k, where As( )k = δk As( )k/Es(As(X))k. Then, E0 A0(X)k = E1 A1(X)k for all k [K]. Define As(X)K+1 := PK k=1 Ac s(X)k. Note that Es As(X)K+1 ε. Now, for given µ, we define a probability measure on X0 X1 by Q(dx0, dx1|µ) = A0(x0)k A1(x1)k δk p0(x0)p1(x1) + A0(x0)K+1 A1(x1)K+1 δK+1 p0(x0)p1(x1), where δK+1 = 1 PK k=1 δk and p0, p1 are the densities of P0, P1, respectively. Note that δK+1 = Es As(X)K+1 and thus δK+1 ε. In addition, for given (x0, x1) X0 X1, we define a binary random variable R(x0, x1) such that Pr(R(x0, x1) = 1) = A0(x0)K+1 A1(x1)K+1 δK+1 p0(x0)p1(x1) Q(dx0, dx1|µ) . Fair Clustering via Alignment For given x Xs, let µ(x) = µk , where k = argmink x µk 2. For given (x0, x1) X0 X1, let µ(x0, x1) = µk , where k = arg mink π0x0 + π1x1 µk 2. Then, it holds that C(Q, R, µ) C(µ, A0, A1), where C(Q, R, µ) := EQ,R 2π0π1 X0 X1 2 + π0X0 + π1X1 µ(X0, X1) 2 (1 R(X0, X1)) + EQ,R π0 X0 µ(X0) 2 + π1 X1 µ(X1) 2 R(X0, X1) . The final mission is to find W X0 X1 such that R(x0, x1) = 1((x0, x1) W). For this, define η(x0, x1) = 2π0π1 x0 x1 2 + mink TA(x0, x1) µk 2. Let ηε be the εth upper quantile of η(X0, X1) and let W = {(x0, x1) : η(x0, x1) > ηε}. (20) Then, we can find W W such that C(Q, W, µ) C(Q, R, µ) and Q(W) = ε (see Remark B.1 below), which completes the proof. Remark B.1 (The case when the densities do not exist). When the distribution of η(X0, X1) is strictly increasing, we have Q(W) = ε. If not, we can find W such that W {(x0, x1) : η(x0, x1) ε} with Q(W) = ε when Q has its density. When Q is discrete, the situation is tricky. When n0 = n1, the measure Q minimizing C(Q, W, µ) has masses 1/n0 on n0 many pairs of (x0, x1) among X0 X1. In this case, whenever n0ε is an integer, we can find W such that Q(W) = ε. Otherwise, we could consider a random assignment. Let Fη be the distribution of η(X0, X1). Suppose that Fη has a jump at ηε. In that case, Q(W) < ε. Let (x 0, x 1) be the element such that η(x 0, x 1) = ηε. Then, we can let R(x 0, x 1) = 1 with probability (ε Q(W))/Q({x 0, x 1}), R(x0, x1) = 1 for (x0, x1) W and R(x0, x1) = 0 for (x0, x1) (W {x 0, x 1})c . The current FCA-C algorithm can be modified easily for this random assignment. B.4. Proof of Proposition 4.2 Proposition 4.2. For any assignment function (A0, A1) Aε, we have xi X0 A0(xi)k P xi X1 A1(xi)k n0 where c = n0 n1 maxk [K] 1 E1A1(X)k . Proof of Proposition 4.2. On the other hand, by the definition of Aε, any (A0, A1) Aε satisfies xi X0 A0(xi)k 1 xj X1 A1(xj)k which implies xi X0 A0(xi)k 1 xj X1 A1(xj)k = |E0A0(X)k E1A1(X)k| ε for all k [K]. Then, we have xi X0 A0(xi)k P xi X1 A1(xi)k n0 xi X0 A0(xi)k/n0 P xi X1 A1(xi)k/n1 1 E0A0(X)k E1A1(X)k 1 1 E1A1(X)k |E0A0(X)k E1A1(X)k| n0 1 E1A1(X)k ε, for all k [K]. Letting c := n0 n1 maxk [K] 1 E1A1(X)k concludes the proof. Fair Clustering via Alignment B.5. Proof of Theorem 4.3 We establish the approximation guarantee of our proposed algorithm: The cost of the solution obtained by our FCA-C algorithm in Section 4.2 is at most τ + 2 times of the cost of the global optimal fair clustering solution (where τ is the approximation error of the algorithm used for finding initial cluster centers without the fairness constraint), with an additional violation of O(ε). In other words, for a given fairness level ε, FCA-C has an approximation error of τ + 2 with an additional violation of O(ε). First, for a given fairness level ε, we define several notations to be used as follows. µ , A 0 , A 1 : the solution of FCA-C algorithm for fairness level ε, given initial cluster centers obtained by a τapproximation algorithm for K-means clustering. µ, A0, A1: the optimal fair clustering solution of minµ,A0,A1 C(µ, A0, A1) subject to (A0, A1) Aε, for a given fairness level ε. µ , A 0, A 1: the (approximated) clustering solution, given a τ-approximation algorithm for (standard) K-means clustering. Then, we can prove Theorem B.2 below, which is a formal version of Theorem 4.3, showing the approximation error of FCA-C algorithm. Theorem B.2 (A formal version of Theorem 4.3). Suppose that supx X x 2 R for some R > 0. Given initial cluster centers obtained by a τ-approximation algorithm , FCA-C returns a (τ + 2)-approximate solution with a violation 3Rε for optimal fair clustering, i.e., C(µ , A 0 , A 1 ) (τ + 2)C( µ, A0, A1) + 3Rε. Proof of Theorem B.2. Let Q , W = arg min Q,W C(Q, W, µ ), and A 0, A 1 are the assignment functions corresponding to Q and W . It suffices to show the following two claims: 1. (Claim A): Given initial cluster centers obtained by a τ-approximation algorithm for K-means clustering, we have C(µ , A 0, A 1) (τ + 2)C( µ, A0, A1). 2. (Claim B): There exist Q , W such that C(µ , A 0, A 1) C(µ , A 0, A 1) + 3Rε, where A 0 and A 1 are the fair assignment functions corresponding to Q and W . Then, by combining Claim A and B, we have that C(µ , A 0, A 1) (τ + 2)C( µ, A0, A1) + 3Rε. Note that A 0 and A 1 are the feasible solution of FCA-C, while A 0 and A 1 are the global optimal fair assignment function given µ . Consequently, we can clearly get the same approximation error for (µ , A 0 , A 1 ), by iterating the process (finding new cluster centers minimizing the cost and updating the assignment functions), which completes the proof. See Remark B.3 for details. Proof of Claim A: Let C = minµ,A0,A1 C(µ, A0, A1) represent the global optimal clustering cost without fairness constraint. Then, we have C(µ , A 0, A 1) τC . We show that for given initial cluster centers µ , there exist A+ 0 and A+ 1 that satisfy the following conditions: (i) (A+ 0 , A+ 1 ) Aε, (ii) C(µ , A+ 0 , A+ 1 ) (τ + 2)C( µ, A0, A1), and (iii) C(µ , A 0, A 1) C(µ , A+ 0 , A+ 1 ), where A 0, A 1 are the fair assignment functions corresponding to Q , W = arg min Q,W C(Q, W, µ ). (i) Recall that µ = { µk}K k=1 and µ = {µ k}K k=1. For all k [K], we define the set of nearest neighbors of µ k as N(µ k) := { µk µ : arg minµ k µ µk µ k 2 = µ k}. For x X, we define A+ s (x)k := P k : µk N(µ k) As(x)k . Then, A+ s also becomes an assignment function since PK k=1 A+ s (x) = 1 for all x X. Since ( A0, A1) Aε, we have the fact that PK k=1 1 n0 Pn0 i=1 A0(xi)k 1 n1 Pn1 j=1 A1(xj)k ε. Therefore, we obtain i=1 A+ 0 (xi)k 1 j=1 A+ 1 (xj)k k : µk N(µ k) i=1 A0(xi)k 1 j=1 A1(xj)k Fair Clustering via Alignment k : µk N(µ k) i=1 A0(xi)k 1 j=1 A1(xj)k i=1 A0(xi)k 1 j=1 A1(xj)k where the last equality holds because K k=1 k : µk N(µ k) {k } = {k }K k =1. Thus, the condition (i) is satisfied. (ii) For a given x X, let µ k(x) := arg minµ k µ x µ k 2 be the initial cluster center closest to x. We then have: k=1 A+ s (x)k x µ k 2 = k : µk N(µ k) As(x)k x µ k 2 k : µk N(µ k) As(x)k x µk 2 + µk µ k 2 k : µk N(µ k) As(x)k x µk 2 + µk µ k(x) 2 k : µk N(µ k) As(x)k 2 x µk 2 + x µ k(x) 2 k =1 As(x)k x µk 2 + k : µk N(µ k) As(x)k x µ k(x) 2 k =1 As(x)k x µk 2 + x µ k(x) 2. Summing over all x and dividing by n, we obtain: C(µ , A+ 0 , A+ 1 ) 2C( µ, A0, A1) + 1 x X min k x µ k 2 = 2C( µ, A0, A1) + C(µ , A 0, A 1) 2C( µ, A0, A1) + τC 2C( µ, A0, A1) + τC( µ, A0, A1) = (τ + 2)C( µ, A0, A1), which concludes (ii). (iii) It is clear that C(µ , A 0, A 1) C(µ , A+ 0 , A+ 1 ), since A 0 and A 1 are the minimizers of C(µ, A0, A1) subject to (A0, A1) Aε given µ = µ (by Theorem 4.1). Thus, the condition (iii) is satisfied. Let µ k := Pn i=1 A s(xi)kxi/ Pn i=1 A s(xi)k, which is the minimizer of minµ C(µ, A 0, A 1) given A 0 and A 1. Then, it is clear that C(µ , A 0, A 1) C(µ , A 0, A 1) (τ + 2)C( µ, A0, A1). Proof of Claim B: Note that we can rewrite C(Q, W, µ) = EX0,X1 Q FCA cost(X0, X1, µ) EX0,X1 Q FCA cost(X0, X1, µ) K-means cost(X0, X1, µ) 1 ((X0, X1) W) , (24) where FCA cost(X0, X1, µ) = 2π0π1 X0 X1 2 +mink TA(X0, X1) µk 2 and K-means cost(X0, X1, µ) = mink π0 X0 µk 2 + mink π1 X1 µk 2 , which are defined in eq. (5). (i): For given µ , define Q as the solution of min Q Q EX0,X1 Q(FCA cost(X0, X1, µ )), which can be found by solving the Kantorovich problem. Then, we have EX0,X1 Q FCA cost(X0, X1, µ ) EX0,X1 Q FCA cost(X0, X1, µ ) . (ii): Let η(x0, x1) := FCA cost(x0, x1, µ ) K-means cost(x0, x1, µ ) and ηε be the εth upper quantile. Define Fair Clustering via Alignment W = {(x0, x1) X0 X1 : η(x0, x1) > ηε}. Then, using (i), we have C(Q , W , µ ) C(Q , , µ ) = EX0,X1 Q (FCA-cost(X0, X1, µ )) = EX0,X1 Q (FCA-cost(X0, X1, µ )) sup Q,W:Q(W) ε EX0,X1 Q FCA cost(X0, X1, µ) K-means cost(X0, X1, µ) 1 ((X0, X1) W) + sup Q,W:Q(W) ε EX0,X1 Q FCA cost(X0, X1, µ) K-means cost(X0, X1, µ) 1 ((X0, X1) W) EX0,X1 Q (FCA-cost(X0, X1, µ )) EX0,X1 Q FCA cost(X0, X1, µ) K-means cost(X0, X1, µ) 1 ((X0, X1) W ) + sup Q,W:Q(W) ε EX0,X1 Q FCA cost(X0, X1, µ) K-means cost(X0, X1, µ) 1 ((X0, X1) W) = C(Q , W , µ ) + sup Q,W:Q(W) ε EX0,X1 Q FCA cost(X0, X1, µ) K-means cost(X0, X1, µ) 1 ((X0, X1) W) . (iii) The last term of the right-hand-side can be bounded as follows: First, let µ (x) := arg minµ k µ x µ k 2 for x X and µ (x0, x1) := arg minµ k µ TA(x0, x1) µ k 2 for (x0, x1) X0 X1. Then, (x0, x1) X0 X1, it holds that FCA cost(x0, x1, µ ) K-means cost(x0, x1, µ ) = 2π0π1 x0 x1 2 + TA(x0, x1) µ (x0, x1) 2 π0 x0 µ (x0) 2 π1 x1 µ (x1) 2 2π0π1 x0 x1 2 + TA(x0, x1) µ (x0) 2 π0 x0 µ (x0) 2 π1 x1 µ (x1) 2 2π0π1 x0 x1 2 + π0 x0 µ (x0) 2 + π1 x1 µ (x0) 2 π0 x0 µ (x0) 2 π1 x1 µ (x1) 2 = 2π0π1 x0 x1 2 + π1( x1 µ (x0) 2 x1 µ (x1) 2) 2π0π1 x0 x1 2 + π1 x1 µ (x0) 2 2 x0 x1 2 + x1 µ (x0) 2 1 22R + 2R = 3R. Hence, we conclude that sup Q,W:Q(W) ε EX0,X1 Q FCA cost(X0, X1, µ) K-means cost(X0, X1, µ) 1 ((X0, X1) W) 3R sup Q,W:Q(W)=ε EX0,X1 Q(1 ((X0, X1) W)) = 3RQ(W) = 3Rε. Finally, combining (ii) and (iii), we have C(Q , W , µ ) C(Q , W , µ ) + 3Rε, which implies C(µ , A 0, A 1) C(µ , A 0, A 1) + 3Rε, where A 0 and A 1 are the fair assignment functions corresponding to Q and W . Remark B.3. Finally, FCA-C algorithm iterates the above process. Using FCA-C, we find A 0 and A 1, which minimizes the cost given µ , where µ k := Pn i=1 A s(xi)kxi/ Pn i=1 A s(xi)k, i.e., the minimizer of minµ C(µ, A 0, A 1) given A 0 and A 1. Again, let µ k := Pn i=1 A s(xi)kxi/ Pn i=1 A s(xi)k, which is the minimizer of minµ C(µ, A 0, A 1) given A 0 and A 1. This iteration process results in the following inequality: C(µ , A 0, A 1) C(µ , A 0, A 1) C(µ , A 0, A 1) (τ + 2)C( µ, A0, A1) + 3Rε. Hence, iterating this process until convergence guarantees the approximation error of τ + 2 with additional violation of 3Rε, which concludes that C(µ , A 0 , A 1 ) (τ + 2)C( µ, A0, A1) + 3Rε. Fair Clustering via Alignment C. Experiments C.1. Datasets The three datasets used in our experiments can be found in the UCI Machine Learning Repository3. ADULT dataset (Becker & Kohavi, 1996) can be downloaded from https://archive.is.uci/ml/datasets/ adult. We use 5 continuous variables (age, fnlwgt, education, capital-gain, and hours-per-week). The total sample size is 32,561. The sample size for S = 0 and S = 1 (resp. female and male) are 10,771 and 21,790, respectively. BANK dataset (Moro et al., 2012) can be downloaded from https://archive.ics.uci.edu/ml/datasets/ Bank+Marketing. We use 6 continuous variables (age, duration, euribor3m, nr.employed, cons.price.idx, and campaign). The total sample size is 41,108. The sample size for S = 0 and S = 1 (resp. not married and married) are 16,180 and 24,928, respectively. CENSUS dataset (Meek et al.) can be downloaded from https://archive.ics.uci.edu/ml/datasets/ US+Census+Data+(1990). We use 66 continuous variables (d Ancstry1, d Ancstry2, i Avail, i Citizen, i Class, d Depart, i Disabl1, i Disabl2, i English, i Feb55, i Fertil, d Hispanic, d Hour89, d Hours, i Immigr, d Income1, d Income2, d Income3, d Income4, d Income5, d Income6, d Income7, d Income8, d Industry, i Korean, i Lang1, i Looking, i Marital, i May75880, i Means, i Military, i Mobility, i Mobillim, d Occup, i Othrserv, i Perscare, d POB, d Poverty, d Pwgt1, i Ragechld, d Rearning, i Relat1, i Relat2, i Remplpar, i Riders, i Rlabor, i Rownchld, d Rpincome, i RPOB, i Rrelchld, i Rspouse, i Rvetserv, i School, i Sept80, i Subfam1, i Subfam2, i Tmpabsnt, d Travtime, i Vietnam, d Week89, i Work89, i Worklwk, i WWII, i Yearsch, i Yearwrk, and d Yrsserv). We subsample 20,000 instances, as done in Chierichetti et al. (2017); Bera et al. (2019); Esmaeili et al. (2021). The sample size for S = 0 and S = 1 (not married and married, respectively) are 9,844 and 10,156 respectively. Note that, for each dataset, we use only the continuous (numerical) variables as introduced above, consistent with previous studies, e.g., Backurs et al. (2019); Bera et al. (2019); Esmaeili et al. (2021); Ziko et al. (2021), to name a few. Particularly, the features we consider in this paper are the same as those selected in a baseline method VFC (Ziko et al., 2021). The variables of all datasets are scaled with zero mean and unit variance. C.2. Implementation details C.2.1. COMPUTING RESOURCES The computation is performed on several Intel Xeon Silver CPU cores and an additional RTX 4090 GPU processor. C.2.2. PROPOSED ALGORITHMS FCA We use the POT library (Flamary et al., 2021) to find the optimal joint distribution Q (i.e., Γ). For updating the cluster centers, we adopt the K-means++ algorithm (Arthur & Vassilvitskii, 2007) from the implementation of scikit-learn package Pedregosa et al. (2011). The iterative process of updating the cluster centers and the joint distribution is run for 100 iterations, with the result where Cost is minimized being selected. FCA-C To control Bal, we vary ε, i.e., the size of W. The value of ε is sweeped in increments of 0.05, ranging from 0.1 to 0.9. We also use a similar partitioning technique in Section 4.3 for FCA-C with m = 2048. For stability, at the first iteration step, we find W based on fixed cluster centers µ, initialized using the K-means++ algorithm. C.2.3. BASELINE ALGORITHMS Scalable Fair Clustering (SFC) (Backurs et al., 2019) We directly use the official source code of SFC, available on the authors Git Hub4. SFC provides a fast and scalable algorithm for fairlet decomposition, which builds the fairlets in nearly 3https://archive.ics.uci.edu/datasets 4https://github.com/talwagner/fair_clustering Fair Clustering via Alignment linear time. The given data are first embedded to a tree structure (hierarchically well-separated tree; HST) using probabilistic metric embedding, where we seek for optimal edges (as well as their nodes) to be activated that satisfy Bal Bal . Then, the linked nodes are then aggregated as a fairlet. This process can build fairlets in nearly linear time while minimizing the cost of building them. After building these fairlets using SFC, we apply the standard K-means algorithm on the fairlet space (i.e., the set of representatives of the obtained fairlets). Fair Clustering under a Bounded Cost (FCBC) (Esmaeili et al., 2021) We use the official source code of FCBC, available on the authors Git Hub5. FCBC maximizes Bal under a cost constraint, where the cost constraint is defined by the Price of Fairness (Po F) = cost of fair clustering (the solution) / cost of standard clustering . However, since the authors mentioned the constrained optimization problem is NP-hard, they reduced the problem to a post-processing approach (i.e., fairly assigning data under the cost constraint, with pre-specified centers opened by a standard clustering algorithm). We set the value of Po F to 1.2 to achieve Bal Bal . Fair Round-Robin Algorithm for Clustering (FRAC) (Gupta et al., 2023) We directly follow the official source code of FRAC, available on the authors Git Hub6. FRAC provides an in-loop post-processing approach to fairly assign data from each protected group to given cluster centers found by a standard clustering algorithm. In other words, the fair assignment problem is solved at each iteration of standard clustering algorithm. Variational Fair Clustering (VFC) (Ziko et al., 2021) We use the official source code of VFC, available on the authors Git Hub7. The overall objective of VFC is E mink X µk 2 + λ PK k=1 KL [π0, π1] h π0E0A0(X)k EAS(X)k , π1E1A1(X)k i , where λ is a hyper-parameter to control Bal. A higher λ results in higher Bal, with KL = 0 Bal = 1. We run the code with multiple trials, by varying the values of λ. For Table 2, we select the best λ that achieves the highest Bal and report the corresponding performance for the chosen hyper-parameter, as the authors have done. For Figure 3, we present all the results obtained using the various values of λ. Table 5 below provides the values of VFC s hyper-parameters used in our experiments. Table 5. The hyper-parameters used in VFC for searching a clustering with maximum achievable Bal for each dataset. The bold faces are the recommended ones by the authors. The underlined values are the ones we use for maximum achievable Bal. Dataset L2 normalization Hyper-parameters ADULT O {5000, 7000, 9000, 10000, 11000, 12000, 13000, 13600, 14200} X {5000, 7000, 9000, 10000, 11000, 12000, 13000, 15000, 20000, 22000, 23000} BANK O {5000, 6000, 7000, 9000, 10000, 11000, 12000, 12300} X {10000, 12000, 13000, 15000, 17000, 19000, 25200, 26000} CENSUS O {100, 200, 500, 700, 1000, 1500, 2000} X Failed Note that VFC s superior performance over other two well-known FC algorithms from Bera et al. (2019); Kleindessner et al. (2019) was already shown in Ziko et al. (2021), which is why we omit these two methods as baselines in our experiments. 5https://github.com/Seyed2357/Fair-Clustering-Under-Bounded-Cost 6https://github.com/shivi98g/Fair-k-means-Clustering-via-Algorithmic-Fairness 7https://github.com/imtiazziko/Variational-Fair-Clustering Fair Clustering via Alignment C.3. Omitted experimental results C.3.1. PERFORMANCE COMPARISON RESULTS (SECTION 5.2) Trade-off: fairness level vs. clustering utility Table 6 presents the comparison results for the trade-off between Bal and Cost without L2 normalization, which is similar to Table 1. Table 6. Comparison of the trade-off between Bal and Cost on ADULT, BANK, and CENSUS datasets, when data are not L2-normalized. We underline Bal values for the cases of near-perfect fairness (i.e., Bal Bal ) and use bold face for the lowest Cost value among those cases. Dataset / Bal ADULT / 0.494 BANK / 0.649 CENSUS / 0.969 Without L2 normalization Cost ( ) Bal ( ) Cost ( ) Bal ( ) Cost ( ) Bal ( ) Standard (fair-unaware) 1.620 0.206 1.510 0.391 28.809 0.030 FCBC (Esmaeili et al., 2021) 1.851 0.460 2.013 0.610 59.988 0.925 SFC (Backurs et al., 2019) 3.399 0.471 3.236 0.622 69.437 0.940 FRAC (Gupta et al., 2023) 2.900 0.488 2.716 0.646 38.430 0.962 FCA 1.875 0.492 1.859 0.647 33.472 0.959 On the other hand, as shown in Table 1, FCBC attains a slightly lower Cost (0.314) than FCA, whereas FCA yields a higher fairness level. To fairly compare the two methods at equal Cost, we run FCA-C targeting Cost 0.314, i.e., the Cost of FCBC. Under this constraint, FCA-C achieves a Balance of 0.473, compared to 0.443 of FCBC, demonstrating that FCA offers a superior trade-off between Bal and Cost. Table 7. Comparison of FCBC and FCA-C in terms of the trade-off between Bal and Cost on ADULT dataset, when data are L2normalized. Cost is fixed near 0.314 for a fair comparison. Bal = 0.494 Cost ( ) Bal ( ) FCBC (Esmaeili et al., 2021) 0.314 0.443 FCA-C 0.313 0.473 For our main experiments in Section 5, we mainly compare FCA with the scalable fairlet-based method of Backurs et al. (2019). Here, we additionally compare FCA with the original fairlet-based approach introduced by Chierichetti et al. (2017). Table 8 shows that FCA outperforms the fairlet-based method of Chierichetti et al. (2017) as well as SFC. Table 8. Comparison of Chierichetti et al. (2017), SFC and FCA in terms of the trade-off between Bal and Cost on ADULT, BANK, and CENSUS datasets, when data are L2-normalized. The bold-faced results indicate the bests. Dataset / Bal ADULT / 0.494 BANK / 0.649 CENSUS / 0.969 With L2 normalization Cost ( ) Bal ( ) Cost ( ) Bal ( ) Cost ( ) Bal ( ) Chietichetti et al. (2017) 0.507 0.488 0.378 0.639 1.124 0.941 SFC 0.534 0.489 0.410 0.632 1.015 0.937 FCA 0.328 0.493 0.264 0.645 0.477 0.962 Fair Clustering via Alignment Numerical stability Table 9 presents the comparison results of FCA and VFC in terms of numerical stability without L2 normalization, which is similar to Table 2. In specific, on CENSUS dataset without L2 normalization, VFC fails, i.e., it does not achieve higher Bal than the standard clustering for any λ. We find that the failure of VFC on CENSUS dataset (without L2 normalization) is due to explosion of an exponential term calculated in its algorithm. There exists an exponential term with respect to the clustering cost in the calculation of the optimal assignment vector in the VFC algorithm, and thus when the clustering cost becomes too large, VFC fails due to an overflow. Note that the input dimension of CENSUS dataset is 66, while those of ADULT and BANK are 5 and 6, respectively. While the clustering cost with L2 normalization is bounded regardless of the dimension, the clustering cost without L2 normalization is proportional to the input dimension. This is why VFC fails only for CENSUS dataset without L2 normalization. In contrast, as FCA does not fail at all regardless of the L2 normalization because there is no exponential term in the algorithm, meaning that it is numerically more stable than VFC. Table 9. Comparison of the two in-processing algorithms, VFC (Ziko et al., 2021) and FCA, in terms of numerical stability when achieving the maximum Bal, on ADULT, BANK, and CENSUS datasets without L2 normalization. Bold-faced results are the highest values of Bal. Dataset / Bal VFC (Ziko et al., 2021) FCA ADULT / 0.494 0.310 (1.688) 0.493 (1.875) BANK / 0.649 0.505 (1.549) 0.647 (1.859) CENSUS / 0.969 Failed 0.959 (33.472) Control of fairness level We also provide the full results (i.e., Figures 4 and 5 where the data are L2-normalized and not L2-normalized, respectively) showing the trade-off of VFC and FCA-C for various fairness levels, on ADULT, BANK and CENSUS datasets. FCA and VFC show similar trade-offs, while VFC cannot achieve sufficiently high values of Bal (the orange dashed lines are the limit values of balance that VFC can achieve). Figure 4. Bal vs. Cost trade-offs on (left) ADULT, (center) BANK and (right) CENSUS datasets. Black square ( ) is from the standard clustering, orange circle ( ) is from VFC, green star ( ) is from FCA-C, orange dashed line (- -) is the maximum of Bal that VFC can achieve, and blue line ( ) is the maximum achievable balance Bal . Figure 5. Bal vs. Cost trade-offs on (left) ADULT, (center) BANK and (right) CENSUS datasets. Black square ( ) is from the standard clustering, orange circle ( ) is from VFC, green star ( ) is from FCA-C, orange dashed line (- -) is the maximum of Bal that VFC can achieve, and blue line ( ) is the maximum achievable balance Bal . The data are not L2-normalized. Fair Clustering via Alignment Comparison in terms of the silhouette score As measuring Cost alone may not fully capture the clustering quality, we additionally consider another measure, called the silhouette score. The silhouette score (we abbreviate by Silhouette) is computed as the average of (dnear dintra)/ max(dintra, dnear) over all data points, where dintra denotes the intra-cluster distance and dnear represents the average distance to the nearest neighboring cluster. Notably, the results in Table 10 show that FCA is also superior or competitive to baselines in terms of the silhouette score. Table 10. Comparison of the Silhouette and Bal on ADULT dataset, when the data are L2-normalized. Dataset / Bal ADULT / 0.494 With L2 normalization Silhouette ( ) Bal ( ) Standard (fair-unaware) 0.227 0.223 FCBC 0.173 0.443 SFC 0.071 0.489 FRAC 0.156 0.490 FCA 0.176 0.493 Analysis on an additional dataset We additionally conduct an analysis on CREDITCARD dataset (n = 30000) from Yeh & Lien (2009), which was also used in Bera et al. (2019); Harb & Lam (2020). We directly follow the data preprocessing of Bera et al. (2019). We use gender as the sensitive attribute and set K = 10. The results are provided in Table 11 below, where we can observe that FCA outperforms the baseline methods on CREDITCARD dataset as well. Table 11. Comparison of the Cost and Bal on CREDITCARD dataset, when the data are L2-normalized. Dataset / Bal CREDITCARD / 0.656 With L2 normalization Cost ( ) Bal ( ) Standard (fair-unaware) 0.392 0.506 FCBC 0.492 0.629 SFC 0.682 0.653 FRAC 0.510 0.649 FCA 0.402 0.653 Fair Clustering via Alignment C.3.2. APPLICABILITY TO VISUAL CLUSTERING (SECTION 5.3) Settings: datasets, baselines, and measures RMNIST is a mixture of two image digit datasets: the original MNIST and a color-reversed version (where black and white are swapped). OFFICE-31 is a mixture of two datasets from two domains (amazon and webcam) with 31 classes. Both datasets are used in state-of-the-art visual FC methods (Li et al., 2020; Zeng et al., 2023). For the baseline, we consider a state-of-the-art FC method in vision domain called FCMI from Zeng et al. (2023). FCMI learns a fair autoencoder with two additional loss terms: (i) clustering loss on the latent space and (ii) mutual information between latent vector and color. While FCMI is an end-to-end method that learns the fair latent vector and perform clustering on the fair latent space simultaneously, FCA is applied to a pre-trained latent space obtained by learning an autoencoder with the reconstruction loss only. We also report the performances of DFC (Li et al., 2020) which was the main baseline in the FCMI paper, along with SFC (Backurs et al., 2019) and VFC (Ziko et al., 2021), even though these two methods are not specifically designed for the vision domain. The clustering performance for the two image datasets is evaluated by two classification measures ACC (accuracy calculated based on assigned cluster indices and ground-truth classification labels) and NMI (normalized mutual information between ground-truth label distribution and assigned cluster distribution), which are consistently used in Li et al. (2020); Zeng et al. (2023), as datasets involve ground-truth classification labels (e.g., {0, 1, . . . , 9} for RMNIST and the 31 classes for OFFICE-31). The fairness level is also evaluated by Bal. Results Table 12 shows that FCA performs similar to FCMI, which is the state-of-the-art, while outperforming the other baselines with large margins in terms of Bal. Note that SFC, VFC, and FCA are two-step methods, i.e., they find fair clustering on the pre-trained (fair-unaware) latent space, and FCA is the best among those. Furthermore, on the other hand, DFC and FCMI are end-to-end methods so it is noteworthy that FCA outperforms DFC and performs similarly to FCMI. Table 12. Comparison of clustering utility (ACC and NMI) and fairness level (Bal) on two image datasets. Standard (fair-unaware) indicates autoencoder + K-means (Vincent et al., 2010). First-place values are bold, and second-place values are underlined. The performances of the baselines reflects the better results between our re-implementation and the one reported by Zeng et al. (2023). Dataset / Bal RMNIST / 1.000 OFFICE-31 / 0.282 Performance ACC ( ) NMI ( ) Bal ( ) ACC ( ) NMI ( ) Bal ( ) Standard (fair-unaware) 41.0 52.8 0.000 63.8 66.8 0.192 SFC (Backurs et al., 2019) 51.3 49.1 1.000 61.6 61.2 0.267 VFC (Ziko et al., 2021) 38.1 42.7 0.000 64.8 70.4 0.212 DFC (Li et al., 2020) 49.9 68.9 0.800 69.0 70.9 0.165 FCMI (Zeng et al., 2023) 88.4 86.4 0.995 70.0 71.2 0.226 FCA 89.0 79.0 1.000 67.6 70.5 0.270 Further comparison between the fairlet-based method and FCA in visual clustering We compare the fairlet-based method and FCA using OFFICE-31 dataset, considering (i) not only the overall clustering utility, (ii) but also the similarity of matched features. For a clear comparison, we sample a balanced subset (with respect to the label and sensitive attribute) from the original dataset, which is imbalanced. That is, we ensure that the number of samples with the same label is equal across the two protected groups (i.e., the two domains). As a result, we obtain 795 images from both the amazon and webcam domains. We then find fairlets or apply FCA on this balanced dataset. Note that finding fairlets when n0 = n1 is equivalent to finding the optimal transport map (Villani, 2008; Chierichetti et al., 2017; Peyr e & Cuturi, 2020). Table 13. Comparison of the fairlet-based method and FCA. Matching cost is defined by the average distance between two matched features. Matching = Label is defined by the average ratio of images with the same label being matched. Bold-faced values indicate the best performance. Matching method Matching performance Clustering performance Matching cost Matching = Label ( ) Cost ( ) ACC ( ) NMI ( ) Fairlet-based 0.211 0.595 0.278 65.8 71.0 FCA 0.241 0.631 0.269 69.3 72.2 Fair Clustering via Alignment Table 13 presents the comparison results using various measures, including performance measures with respect to matching (the matching cost and how much images with the same label are matched) and clustering (Cost, ACC, and NMI). While the fairlets tend to match more similar features (i.e., the matching cost is lower) as expected by the definition of fairlets, FCA exhibits a better ability to collect images with same labels into clusters (i.e., lower Cost, higher ACC, and higher NMI). Moreover, in visual clustering, matching similar features in the pre-trained latent space does not always guarantee that images with the same label are matched (FCA achieves a higher proportion of matchings where images share the same label compared to fairlets). These results suggest that while fairlets provide optimal matchings in terms of feature similarity, however, they may be suboptimal in terms of label similarity and overall clustering utility. Fair Clustering via Alignment C.3.3. ABLATION STUDY: (1) SELECTION OF THE PARTITION SIZE m (SECTION 5.4) This section provided empirical evidence for the partitioning technique introduced in Section 4.3. First, Figure 6 indicates that using a partition size of around 1000 yields reasonable results. Specifically, using m values greater than 1000 shows similar performance compared to those obtained with m = 1024. Figure 6. Variations of Cost and Bal with respect to the partition size. (Left, Center, Right) = (ADULT, BANK, CENSUS). (Top, Bottom) = (With L2 normalization, Without L2 normalization). We further provide the elapsed computation time for various partition sizes, up to using the full dataset. Using m = 1024 as the baseline, we calculate the relative computation time (%) for other partition sizes. The comparison, presented in Table 14 below, shows that using m = 1024 leads to a significant reduction in computation time. Table 14. Comparison of computation time with different partition sizes, up to using the full dataset. For each partition size and dataset, we provide the averaged relative elapsed time per iteration, when compared to computation time for m = 1024. (Relative) elapsed time per iteration Partition size m 256 512 1024 2048 4096 Full ADULT (n = 32561, d = 5) 17% 58% 100% 140% 344% 3,184% BANK (n = 41108, d = 6) 14% 43% 100% 161% 288% 3,308% CENSUS (n = 20000, d = 66) 23% 52% 100% 176% 375% 1,064% Additionally, we observe that computation time is linear in m2, as shown in Figure 7 below. This numerical result can support the theoretical computational complexity O(nm2) described in Section 4.3. See Table 18 in Appendix C.3.6 for another result showing that the computation time is also linear in n. Figure 7. Squared partition size m2 vs. Relative computation time. (Left, Center, Right) = (ADULT, BANK, CENSUS). Fair Clustering via Alignment C.3.4. ABLATION STUDY: (2) OPTIMIZATION AND INITIALIZATION OF CLUSTER CENTERS (SECTION 5.4) Optimization algorithm of cluster centers We analyze how the performance of FCA varies depending on the optimization algorithm for finding cluster centers. (i) For the K-means algorithm whose results are reported in the main body, we use the K-means++ initialization (Arthur & Vassilvitskii, 2007) from the implementation of scikit-learn package (Pedregosa et al., 2011). Note that we use the algorithm of Lloyd (1982) for the K-means algorithm. (ii) We additionally consider random initialization of cluster centers with Lloyd s algorithm. (iii) For the gradient-based algorithm, we use Adam optimizer (Kingma & Ba, 2014). We set a learning rate of 0.005 for CENSUS dataset with L2 normalization, and 0.05 for all other cases. To accelerate convergence, 20 gradient steps of updating the centers are performed per iteration. Table 15 presents the results comparing these three approaches, showing that FCA is robust to the choice of the optimization algorithm for finding cluster centers. Note that, while the gradient-based algorithm is also effective and accurate, it requires additional practical considerations such as selections of the learning rate and optimizer. Furthermore, the slight outperformance of K-means++ initialization over random initialization suggests that an efficient initialization algorithm can enhance the final performance of FCA. This is also theoretically examined through the approximation error in Appendix B.5, which depends on the approximation error of the standard algorithm for finding cluster centers without fairness constraints. However, since the margins are small, FCA can be considered empirically robust to the initialization. Table 15. Comparison of performance with respect to optimization algorithms for finding cluster centers with L2 normalization (top) and without L2 normalization (bottom). K-means++ indicates that the initial centers are set according to the K-means++ initialization in the first iteration, then apply the algorithm of Lloyd (1982). K-means random indicates that the initial centers are set randomly at the first iteration, then apply the algorithm from Lloyd (1982). Gradient-based indicates that the initial centers are set randomly, and the centers are subsequently updated using the Adam optimizer. Dataset / Bal ADULT / 0.494 BANK / 0.649 CENSUS / 0.969 With L2 normalization Cost Bal Cost Bal Cost Bal FCA (K-means++) 0.328 0.493 0.264 0.645 0.477 0.962 FCA (K-means random) 0.331 0.490 0.275 0.646 0.477 0.955 FCA (Gradient-based) 0.339 0.492 0.254 0.640 0.478 0.957 Dataset / Bal ADULT / 0.494 BANK / 0.649 CENSUS / 0.969 Without L2 normalization Cost Bal Cost Bal Cost Bal FCA (K-means++) 1.875 0.492 1.859 0.647 33.472 0.959 FCA (K-means random) 1.882 0.489 1.864 0.644 32.913 0.960 FCA (Gradient-based) 1.943 0.490 1.967 0.646 34.121 0.962 Fair Clustering via Alignment Initialization of cluster centers Moreover, we empirically assess the stability of FCA and FCA-C with respect to the different initial cluster centers (while keeping the initialization algorithm fixed to K-means++), and compare them with the standard K-means++ algorithm. For FCA and the standard K-means++ algorithm, we run each algorithm five times with five different random initial centers. For FCA-C, we use five random initial centers for each ε {0.1, 0.15, . . . , 0.85, 0.9} and compute the averages as well as standard deviations. Then, we divide the standard deviation by average 17 times (corresponding to 17 εs), and take average. Table 16 below reports the coefficient of variation (= standard deviation average) of Cost and Bal. The results show that the variations of all the three algorithms are similar, indicating that FCA and FCA-C are as stable as the standard K-means++ with respect to the choice of initial cluster centers. That is, aligning data (i.e., finding the optimal coupling matrix Γ) to build fair clustering does not affect the stability of the overall algorithm. Table 16. Standard deviations divided by averages (i.e., coefficient of variation) with respect to five random different choices of initial centers. FCA ADULT BANK CENSUS Cost Bal Cost Bal Cost Bal with L2 0.012 0.001 0.093 0.006 0.004 0.001 without L2 0.010 0.001 0.081 0.003 0.006 0.003 FCA-C ADULT BANK CENSUS Cost Bal Cost Bal Cost Bal with L2 0.015 0.001 0.083 0.007 0.011 0.004 without L2 0.011 0.001 0.088 0.002 0.010 0.002 K-means++ ADULT BANK CENSUS Cost Bal Cost Bal Cost Bal with L2 0.009 0.001 0.078 0.005 0.008 0.002 without L2 0.011 0.000 0.090 0.004 0.010 0.002 C.3.5. ABLATION STUDY: (3) CONSISTENT OUTPERFORMANCE ACROSS VARIOUS K (SECTION 5.4) We analyze the impact of K to the performance of four FC algorithms (FCBC, SFC, FRAC, and FCA). On ADULT dataset, we evaluate the FC algorithms with K {5, 10, 20, 40}. The results are presented in Figure 8 below, which show that FCA outperforms existing FC algorithms across all values of K. Specifically, FCA achieves lower values of Cost than baselines for most values of K, while maintaining the highest fairness level: Bal Bal . Figure 8. Performance comparison of FC algorithms in terms of Cost and Bal for K {5, 10, 20, 40}. (Left two, Right two) = (With L2 normalization, Without L2 normalization). Fair Clustering via Alignment C.3.6. ABLATION STUDY: (4) SCALABILITY FOR LARGE-SCALE DATA (SECTION 5.4) In this section, we evaluate the scalability of FCA for larger datasets, while the main experiments in Section 5 are conducted on real datasets with sample sizes of around 20, 000 to 40, 000. In specific, we apply FCA on a synthetic dataset with an extremely large sample size (around a million). Large-scale dataset generation We generate a large (n = 106) synthetic dataset in Rd using a J-component Gaussian mixture, as follows: 1. (Mean vectors) We sample J many d-dimensional vectors mj, j {1, . . . , J} from a uniform distribution Unif( 20, 20). To ensure diversity, the distance between any two vectors is constrained to be at least 1. These vectors are used as the mean vectors for the Gaussian components. 2. (Covariance matrices) Each jth Gaussian component is assigned a covariance matrix Σj = σ2 j I, where σj Unif(1, 3). 3. (Weights) Component weights, denoted as ϕj, j {1, . . . , J}, are sampled from a Dirichlet distribution Dirichlet(α1, . . . , αJ) for given parameters α1, . . . , αJ. 4. (Completion) The Gaussian mixture model is completed as PJ j=1 ϕj N(mj, Σj). We set J as an even number, and sample data for S = 0 from J/2 components and for S = 1 from the remaining J/2 components. Using this procedure, we construct a dataset with n = 106, d = 2, J = 20, and αj = 1, j {1, . . . , J}. The resulting generated dataset contains 320,988 samples for S = 0 and 679,012 samples for S = 1. The features are then scaled to have zero mean and unit variance. Figure 9 provides a visualization of this synthetically generated dataset. Figure 9. The large synthetic dataset generated with n = 106, d = 2, J = 20, and αj = 1, j {1, . . . , J}. Fair Clustering via Alignment Results We fix the number of clusters to K = 10. In this analysis, we compare only FCA and VFC, as other baselines incur extremely high computational costs for this dataset. Table 17 presents the results, demonstrating that FCA is easily scaled-up and remains a favorable FC algorithm for this large-scale dataset. That is, FCA successfully achieves near-perfect fairness (i.e., Bal = 0.472 0.473). In contrast, VFC fails to achieve near-perfect fairness, with a limit of Bal 0.114. Meanwhile, FCA-C achieves a lower Cost than VFC (0.107 < 0.111), while attaining the maximum achievable Bal for VFC ( 0.114). Table 17. Comparison of Cost and Bal between FCA (or FCA-C) and VFC on the large synthetic dataset in Figure 9. Bal = 0.473 Cost ( ) Bal ( ) Standard (fair-unaware) 0.058 0.000 VFC (Ziko et al., 2021) (λ = 51000) 0.111 0.114 FCA-C (ε = 0.65) 0.107 0.115 FCA 0.669 0.472 We further analyze the computation times of FCA on four datasets with different ns, including this large-scale dataset. Table 18 shows that the computation time scales linearly with n (i.e., Average time / n is nearly constant). This observation aligns with our discussion on the computational complexity of O(nm2) when applying the partitioning technique with a fixed m = 1024 (see Section 4.3). These results can further highlight the empirical efficiency of the partitioning technique. Table 18. Comparison of total computation times (seconds) of FCA, on four different datasets with different ns. The reported results are averages and standard deviations, based on five runs. ADULT BANK CENSUS Large synthetic (Figure 9) Partition size m 1024 1024 1024 1024 n 32,561 41,108 20,000 1,000,000 Average time (Standard deviation) 56.7 (3.9) 73.1 (7.8) 32.2 (5.4) 1410.1 (115.8) Average time / n 1.7 10 3 1.8 10 3 1.6 10 3 1.4 10 3 C.3.7. ABLATION STUDY: (5) LINEAR PROGRAM VS. SINKHORN FOR OPTIMIZING Q (SECTION 5.4) We evaluate an alternative algorithm for finding the coupling matrix Γ. Specifically, the Sinkhorn algorithm of Cuturi (2013) optimizes C + λ ent(Γ), where C is the cost matrix defined in Phase 1 of Section 4.1, λ is a regularization parameter, and ent(Γ) denotes the entropy of Γ. Note that, as λ increases, Γ approaches a uniform matrix. It is well-known that the Sinkhorn algorithm generally yields a more relaxed solution with reduced runtime compared to solving the Kantorovich problem via linear programming. We compare (i) the original FCA (using linear programming) and (ii) FCA with the Sinkhorn algorithm for λ {0.01, 0.1, 1.0}. Table 19 shows the results, suggesting: (i) A small regularization (λ = 0.01) achieves performance comparable to linear programming with a slight runtime reduction ( 2%). (ii) A large regularization (λ = 1.0) significantly degrades performance while reducing runtime ( 10%). Overall, careful tuning of λ is critical when using the Sinkhorn algorithm, while the runtime reduction may not be significant in practice. Therefore, we recommend solving the linear program for FCA. Table 19. Comparison of using the Sinkhorn algorithm and solving the linear program, in terms of Cost, Bal, and runtime per iteration. Bal = 0.494 Cost ( ) Bal ( ) Runtime / iteration (sec) FCA (Sinkhorn, λ = 1.0) 0.350 0.271 4.98 FCA (Sinkhorn, λ = 0.1) 0.315 0.463 5.12 FCA (Sinkhorn, λ = 0.01) 0.330 0.491 5.55 FCA (Linear program) 0.328 0.493 5.67 Fair Clustering via Alignment C.3.8. COMPARISONS OF COMPUTATION TIME FCA versus FCA-C We compare the computation times of FCA and FCA-C, as FCA-C technically involves an additional step of optimizing W. Table 20 below shows that while FCA-C requires slightly more computation time than FCA, the increase is not substantial (a maximum of 3.7%). Table 20. Comparison of computation times (seconds) of FCA and FCA-C per each iteration. The averages and standard deviations are calculated based on five runs. The data are L2-normalized and the batch size is fixed as 1024. Average (Standard deviation) ADULT BANK CENSUS FCA 5.67 (0.39) 7.31 (0.78) 16.10 (2.70) FCA-C 5.72 (0.47) 7.58 (0.32) 16.46 (1.23) Increase (FCA-C / FCA) 0.5% 3.7% 2.2% FCA versus SFC We here additionally consider two more scenarios to compare FCA and SFC. In this analysis, the data are L2-normalized and the batch size is fixed as 1024 for FCA. Note that FCA consists of an outer iteration (updating the cluster centers and joint distribution alternately), and an inner iteration when applying the K-means algorithm in the aligned space. 1. We compare the number of iterations until convergence. For SFC, we calculate the number of iterations in the K-means algorithm (after finding fairlets). For FCA, we calculate the sum of the number of iterations (inner iteration) in the K-means algorithm for each outer iteration (updating the cluster centers and joint distribution). In this analysis, note that we report the elapsed time for FCA with 10 iterations of the outer iteration, because the performance of FCA with 10 iterations of the outer iteration is not significantly different from FCA with 100 iterations of the outer iteration. As a result, SFC requires smaller number of iterations, compared to FCA (see Table 21), primarily because FCA involves the outer iterations. On the other hand, in FCA, the number of total iterations is almost linear with respect to the number of iterations in each inner loop. Table 21. Comparison of computational costs between FCA and SFC: the total number of iterations of the standard K-means algorithm in SFC and FCA. Total number of iterations ADULT BANK CENSUS SFC (Backurs et al., 2019) 15 10 32 FCA 57 110 93 2. To further analyze whether applying an early-stopping to FCA can maintain reasonable performance, we conduct an additional experiment: we fix the number of K-means iterations to 1 per outer iteration of FCA, then perform a total of 10 outer iterations. With this setup, the total number of iterations for FCA becomes 10, which is comparable to or smaller than that of SFC (15 for Adult, 10 for Bank, and 32 for Census dataset, as shown in Table 21). We observe that the performance of FCA with this early-stopping is slightly worse than the original FCA (where the K-means algorithm runs until convergence), but it still outperforms SFC (see Table 22). However, this early-stopping approach would be not recommended, at least for the datasets used in our experiments. This is because running the K-means algorithm takes less than a second or a few seconds, while the computation of finding the joint distribution dominates the overall runtime. Additionally, the original FCA slightly outperforms the early-stopped version with only 10 iterations. Table 22. Performance comparison of SFC, FCA with a total of 10 iterations, and the original FCA (updating until convergence). The data are not L2-normalized. Dataset / Bal ADULT / 0.494 BANK / 0.649 CENSUS / 0.969 Performance Cost ( ) Bal ( ) Cost ( ) Bal ( ) Cost ( ) Bal ( ) SFC (Backurs et al., 2019) 3.399 0.471 3.236 0.622 69.437 0.940 FCA (total 10 iterations) 1.923 0.489 1.992 0.644 33.967 0.955 FCA (original) 1.875 0.492 1.859 0.647 33.472 0.959 Fair Clustering via Alignment C.3.9. COMPARISON OF FCA AND SFC BASED ON K-MEDIAN CLUSTERING COST As SFC is originally designed for the K-median clustering objective, we compare FCA and SFC based on the K-median clustering cost (i.e., L1 cost) for a more fair comparison. In specific, FCA for K-median clustering cost is modified as follows: (i) The L2 norm in eq. (3) is replaced by the L1 norm. (ii) The cluster centers are found by minimizing the L1 distance, as we discuss in Appendix A.4. The results are presented in Table 23, which shows that FCA still outperforms SFC. It implies that the fairlet-based method still may not always find the most effective matching in view of clustering utility (cost), even when the given clustering objective more suited to fairlet-based approaches (e.g., L1 norm) is considered. Let Cost1 = 1 (x,s) D x µk(x,s) 1 be the K-median clustering cost. Table 23. Comparison of Cost1 and Bal of FCA and SFC. The data are not L2-normalized. Dataset / Bal ADULT / 0.494 BANK / 0.649 CENSUS / 0.969 Performance Cost1 ( ) Bal ( ) Cost1 ( ) Bal ( ) Cost1 ( ) Bal ( ) Standard (fair-unaware) 1.788 0.206 1.989 0.391 21.402 0.030 SFC (Backurs et al., 2019) 2.979 0.471 3.056 0.622 29.597 0.940 FCA 2.032 0.492 2.383 0.647 22.927 0.959