# variational_fair_clustering__e1fd54d7.pdf Variational Fair Clustering Imtiaz Masud Ziko1, Jing Yuan2, Eric Granger1 and Ismail Ben Ayed1 1 ETS Montreal, Canada 2 Xidian University, China imtiaz-masud.ziko.1@etsmtl.ca We propose a general variational framework of fair clustering, which integrates an original Kullback-Leibler (KL) fairness term with a large class of clustering objectives, including prototype or graph based. Fundamentally different from the existing combinatorial and spectral solutions, our variational multiterm approach enables to control the trade-off levels between the fairness and clustering objectives. We derive a general tight upper bound based on a concave-convex decomposition of our fairness term, its Lipschitz-gradient property and the Pinsker s inequality. Our tight upper bound can be jointly optimized with various clustering objectives, while yielding a scalable solution, with convergence guarantee. Interestingly, at each iteration, it performs an independent update for each assignment variable. Therefore, it can be easily distributed for large-scale datasets. This scalability is important as it enables to explore different trade-off levels between the fairness and clustering objectives. Unlike spectral relaxation, our formulation does not require computing its eigenvalue decomposition. We report comprehensive evaluations and comparisons with state-of-the-art methods over various fair clustering benchmarks, which show that our variational formulation can yield highly competitive solutions in terms of fairness and clustering objectives. Introduction Machine learning models are impacting our daily life, for instance, in marketing, finance, education, and even in sentencing recommendations (Kleinberg et al. 2017). However, these models may exhibit biases towards specific demographic groups due to, for instance, the biases that exist within the data. For example, a higher level of face recognition accuracy may be found with white males (Buolamwini and Gebru 2018). These biases have recently triggered substantial interest in designing fair algorithms for the supervised learning setting (Hardt, Price, and Srebro 2016; Zafar et al. 2017; Donini et al. 2018). Also, very recently, the community has started to investigate fairness constraints in unsupervised learning (Chierichetti et al. 2017; Kleindessner et al. 2019; Backurs et al. 2019; Samadi et al. 2018; Celis et al. 2018). Specifically, Chierichetti et al. (Chierichetti et al. 2017) pioneered the concept of fair clustering. The problem consists Copyright c 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. of embedding fairness constraints that encourage clusters to have balanced demographic groups pertaining to some sensitive attributes (e.g., sex, gender, race, etc.), so as to counteract any form of data-inherent bias. Assume that we are given N data points to be assigned to a set of K clusters, and let Sk {0, 1}N denotes a binary indicator vector whose components take value 1 when the point is within cluster k, and 0 otherwise. Also suppose that the data contains J different demographic groups, with Vj {0, 1}N denoting a binary indicator vector of demographic group j. The authors of (Chierichetti et al. 2017; Kleindessner et al. 2019) suggested to evaluate fairness in terms of clusterbalance measures, which take the following form: balance(Sk) = min j =j V t j Sk V t j Sk [0, 1] (1) The higher this measure, the fairer the cluster. The overall clustering balance is defined by the minimum of Eq. (1) over k. This notion of fairness in clusters has recently given rise to a new line of research that was introduced, mostly, for prototype-based clustering, e.g., K-center, Kmedians and Kmeans (Chierichetti et al. 2017; Backurs et al. 2019; Schmidt, Schwiegelshohn, and Sohler 2018; Bera et al. 2019). Also, very recently, fairness has been investigated in the context of spectral graph clustering (Kleindessner et al. 2019). The general problem raises several interesting questions. How to embed fairness in popular clustering objectives? Can we control the trade-off between some acceptable fairness level (or tolerance) and the quality of the clustering objective? What is the cost of fairness with respect to the clustering objective and computational complexity? Chierichetti et al. (Chierichetti et al. 2017) investigated combinatorial approximation algorithms for maximizing the the fairness measures in Eq. (1), for K-center and Kmedians clustering, and for binary demographic groups (J = 2). They compute fairlets, which are groups of points that are fair, and can not be split further into more subsets that are also fair. Then, they consider each fairlet as a data point, and cluster them with approximate K-center or Kmedians algorithms. Unfortunately, as reported in the experiments in (Chierichetti et al. 2017), obtaining fair solutions with these fairlets-based algorithms comes at the price of a substantial increase in the clustering objectives. Also, the cost for finding fairlets with perfect matching is quadratic w.r.t the number of data points, The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) a complexity that increases for more than two demographic groups. Several combinatorial solutions followed-up on the work in (Chierichetti et al. 2017) to reduce this complexity. For instance, Backurs et al. (Backurs et al. 2019) proposed a solution to make the fairlet decomposition in (Chierichetti et al. 2017) scalable for J = 2, by embedding the input points in a tree metric. R osner and Schmidt (R osner and Schmidt 2018) designed a 14-approximate algorithm for fair K-center. (Schmidt, Schwiegelshohn, and Sohler 2018; Huang, Jiang, and Vishnoi 2019) proposed fair K-means/Kmedians based on coreset a reduced proxy set for the full dataset. Bera et al. (Bera et al. 2019) provided a bi-criteria approximation algorithm for fair prototype-based clustering, enabling multiple groups (J > 2). It is worth noting that, for largescale data sets, (Chierichetti et al. 2017; R osner and Schmidt 2018; Bera et al. 2019) sub-sample the inputs to mitigate the quadratic complexity w.r.t N. More importantly, the combinatorial algorithms discussed above are tailored for specific prototype-based objectives. For instance, they are not applicable to the very popular graph-clustering objectives, e.g., Ratio Cut or Normalized Cut (Von Luxburg 2007), which limits applicability in a breadth of graph problems, in which data takes the form of pairwise affinities. Kleindessner et al. (Kleindessner et al. 2019) integrated fairness into graph-clustering objectives. They embedded linear constraints on the assignment matrix in spectral relaxation. Then, they solved a constrained trace optimization via finding the K smallest eigenvalues of some transformed Laplacian matrix. However, it is well-known that spectral relaxation has heavy time and memory loads since it requires storing an N N affinity matrix and computing its eigenvalue decomposition the complexity is cubic w.r.t N for a straightforward implementation, and super-quadratic for fast implementations (Tian et al. 2014). In the general context of spectral relaxation and graph partitioning, issues related to computational scalability for large-scale problems is driving an active line of recent work (Shaham et al. 2018; Ziko, Granger, and Ayed 2018; Vladymyrov and Carreira-Perpi n an 2016). The existing fair clustering algorithms, such as the combinatorial or spectral solutions discussed above, do not have mechanisms that control the trade-off levels between the fairness and clustering objectives. Also, they are tailored either to prototype-based (Backurs et al. 2019; Bera et al. 2019; Chierichetti et al. 2017; Schmidt, Schwiegelshohn, and Sohler 2018) or graph-based objectives (Kleindessner et al. 2019). Finally, for a breadth of problems of wide interest, such as pairwise graph data, the computation and memory loads may become an issue for large-scale data sets. Contributions: We propose a general, variational and bound-optimization framework of fair clustering, which integrates an original Kullback-Leibler (KL) fairness term with a large class of clustering objectives, including both prototypebased (e.g., K-means/Kmedians) and graph-based (e.g., Normalized Cut or Ratio Cut). Fundamentally different from the existing combinatorial and spectral solutions, our variational multi-term approach enables to control the trade-off levels between the fairness and clustering objectives. We derive a general tight upper bound based on a concave-convex decomposition of our fairness term, its Lipschitz-gradient property and the Pinsker s inequality. Our tight upper bound can be jointly optimized with various clustering objectives, while yielding a scalable solution, with convergence guarantee. Interestingly, at each iteration, our general variational fair-clustering algorithm performs an independent update for each assignment variable. Therefore, it can easily be distributed for large-scale datasets. This scalibility is important as it enables to explore different trade-off levels between fairness and the clustering objective. Unlike the constrained spectral relaxation in (Kleindessner et al. 2019), our formulation does not require computing its eigenvalue decomposition. We report comprehensive evaluations and comparisons with state-of-the-art methods over various fair-clustering benchmarks, which show that our variational method can yield highly competitive solutions in terms of fairness and clustering objectives, while being scalable and flexible. Proposed Formulation Let X = {xp RM, p = 1, . . . , N} denote a set of N data points to be assigned to K clusters, and S is a soft cluster-assignment vector: S = [s1, . . . , s N] [0, 1]NK. For each point p, sp = [sp,k] [0, 1]K is the probability simplex vector verifying P k sp,k = 1. Suppose that the data set contains J different demographic groups, with vector Vj = [vj,p] {0, 1}N indicating point assignment to group j: vp,j = 1 if data point p is in group j and 0 otherwise. We propose the following general variational formulation for optimizing any clustering objective F(S) with a fairness penalty, while constraining each sp within the K-dimensional probability simplex K = {y [0, 1]K | 1ty = 1}: min S F(S) + λ X k DKL(U||Pk) s.t. sp K p (2) DKL(U||Pk) denotes the Kullback-Leibler (KL) divergence between the given (required) demographic proportions U = [µj] and the marginal probabilities of the demographics within cluster k: Pk = [P(j|k)]; P(j|k) = V t j Sk 1t Sk j, (3) where Sk = [sp,k] [0, 1]N is the N-dimensional vector 1 containing point assignments to cluster k, and t denotes the transpose operator. Notice that, at the vertices of the simplex (i.e., for hard binary assignments), V t j Sk counts the number of points within the intersection of demographic j and cluster k, whereas 1t Sk is the total number of points within cluster k. Parameter λ controls the trade-off between the clustering objective and fairness penalty. The problem in (2) is challenging due to the ratios of summations in the fairness penalty and the simplex constraints. Expanding KL term DKL(U||Pk) and 1The set of N-dimensional vectors Sk and the set of simplex vectors sp are two equivalent ways for representing assignment variables. However, we use Sk here for a clearer presentation of the problem, whereas, as will be clearer later, simplex vectors sp will be more convenient in the subsequent optimization part. discarding constant µj log µj, our objective in (2) becomes equivalent to minimizing the following functional with respect to the relaxed assignment variables, and subject to the simplex constraints: E(S) = F(S) | {z } clustering j µj log P(j|k) | {z } fairness Observe that, in Eq. (4), the fairness penalty becomes a crossentropy between the given (target) proportion U and the marginal probabilities Pk of the demographics within cluster k. Notice that our fairness penalty decomposes into convex and concave parts: µj log P(j|k) = µj log 1t Sk | {z } concave µj log V t j Sk | {z } convex This enables us to derive the following tight bounds (auxiliary functions) for minimizing our general fair-clustering model in (4) using a quadratic bound and Lipschitz-gradient property of the convex part, along with Pinsker s inequality, and a first-order bound on the concave part. Definition 1 Ai(S) is an auxiliary function of objective E(S) if it is a tight upper bound at current solution Si, i.e., it satisfies the following conditions: E(S) Ai(S), S (6a) E(Si) = Ai(Si) (6b) where i is the iteration index. Bound optimizers, also commonly referred to as Majorize Minimize (MM) algorithms (Zhang, Kwok, and Yeung 2007), update the current solution Si to the next by optimizing the auxiliary function: Si+1 = arg min S Ai(S) (7) When these updates correspond to the global optima of the auxiliary functions, MM procedures enjoy a strong guarantee: The original objective function E(S) does not increase at each iteration: E(Si+1) Ai(Si+1) Ai(Si) = E(Si) (8) This general principle is widely used in machine learning as it transforms a difficult problem into a sequence of easier subproblems (Zhang, Kwok, and Yeung 2007). Examples of wellknown bound optimizers include concave-convex procedures (CCCP) (Yuille and Rangarajan 2001), expectation maximization (EM) algorithms and submodular-supermodular procedures (SSP) (Narasimhan and Bilmes 2005), among others. The main technical difficulty in bound optimization is how to derive an auxiliary function. In the following, we derive auxiliary functions for our general fair-clustering objectives in (4). Proposition 1 (Bound on the fairness penalty) Given current clustering solution Si at iteration i, we have the following auxiliary function on the fairness term in (4), up to additive and multiplicative constants, and for current solutions in which each demographic is represented by at least one point in each cluster (i.e., V t j Si k 1 j, k): Gi(S) PN p=1 st p(bi p + log sp log si p) with bi p = [bi p,1, . . . , bi p,K] µj 1t Si k µjvj,p where L is some positive Lipschitz-gradient constant verifying L N. Proof: We provide a detailed proof in the supplemental material. Here, we give the main technical ingredients for obtaining our bound. We use a quadratic bound and a Lipschitzgradient property for the convex part, and a first-order bound on the concave part. We further bound the quadratic distances between simplex variables with the Pinsker s inequality (Csiszar and K orner 2011). This step avoids completely point-wise Lagrangian-dual projections and inner iterations for handling the simplex constraints, yielding scalable (parallel) updates, with convergence guarantee. Proposition 2 (Bound on the clustering objective) Given current clustering solution Si at iteration i, the auxiliary functions for several popular clustering objectives F(S) take the following general form: Hi(S) = PN p=1 st pai p (10) where point-wise (unary) potentials ai p are given in Table 1. Proofs: We give detailed proofs in the supplemental material. Here, we provide the main technical aspects: For the Ncut objective, the derivation of the auxiliary function is based on the fact that, for positive semi-definite affinity matrix W, the Ncut objective is concave (Tang et al. 2019). Therefore, the first-order approximation at the current solution is an auxiliary function. For the prototype-based objectives, deriving an auxiliary function follows from the observation that the optimal parameters ck, i.e., those that minimize the objective in closed-form, correspond to the sample means/medians within the clusters. These auxiliary functions correspond to the standard K-means and Kmedians procedures, which can be viewed as bound optimizers (Tang et al. 2019). Proposition 3 (Bound on the fair-clustering functional) Given current clustering solution Si, at iteration i, and bringing back the trade-off parameter λ, we have the following auxiliary function for the general fair-clustering objective E(S) in Eq. (4): Ai(S) = PN p=1 st p(ai p + λbi p + log sp log si p) (11) Proof: It is straightforward to check that sum of auxiliary functions, each corresponding to a term in the objective, is also an auxiliary function of the overall objective. Notice that, at each iteration, our auxiliary function in (11) is the sum of independent functions, each corresponding to a single data point p. Therefore, our minimization problem in (4) can be tackled by optimizing each term over sp, subject to Clustering F(S) ai p = [ai p,k], k Where K-means P N P k sp,k(xp ck)2 ai p,k = (xp ci k)2 ci k = Xt Si k 1t Si k Kmedians P N P k sp,kd(xp, ck) ai p,k = d(xp, ci k) ci k = arg min p =q d(xp, xq), d is a distance metric Ncut K P k St k WSk dt Sk ai p,k = dpzi k 2 P q w(xp,xq)si p,k dt Si k zi k = (Si k)t WSi k (dt Si k)2 d = [dp], with dp = P q w(xp, xq); p W = [w(xp, xq)] is an affinity matrix Table 1: Auxiliary functions of several well-known clustering objectives. the simplex constraint, and independently of the other terms, while guaranteeing convergence: min sp K st p(ai p + λbi p + log sp log si p), p (12) Also, notice that, in our derived auxiliary function, we obtained a convex negative entropy barrier function sp log sp, which comes from the convex part in our fairness penalty. This entropy term is very interesting as it avoids completely expensive projection steps and Lagrangian-dual inner iterations for the simplex constraint of each point. It yields closedform updates for the dual variables of constraints 1tsp = 1 and restricts the domain of each sp to non-negative values, avoiding extra dual variables for constraints sp 0. Interestingly, entropy-based barriers are commonly used in Bregman-proximal optimization (Yuan et al. 2017), and have well-known computational benefits when handling difficult simplex constraints (Yuan et al. 2017). However, they are not very common in the general context of clustering. The objective in (12) is the sum of convex functions with affine simplex constraints 1tsp = 1. As strong duality holds for the convex objective and the affine simplex constraints, the solutions of the Karush-Kuhn-Tucker (KKT) conditions minimize globally the auxiliary function. The KKT conditions yield a closed-form solution for both primal variables sp and the dual variables (Lagrange multipliers) corresponding to simplex constraints 1tsp = 1. si+1 p = si p exp( (ai p + λbi p)) 1t[sip exp( (aip + λbip))] p (13) Notice that each closed-form update in (13) is within the simplex. We give the pseudo-code of the proposed fairclustering in Algorithm 1. The algorithm can be used for any specific clustering objective, e.g., K-means or Ncut, among others, by providing the corresponding ai p. The algorithm consists of an inner and an outer loop. The inner iterations updates si+1 p using (13) until Ai(S) does not change, with the clustering term ai p fixed from the outer loop. The outer iteration re-computes ai p from the updated si+1 p . The time complexity of each inner iteration is O(NKJ). Also, the updates are independent for each data p and, thus, can be efficiently computed in parallel. In the outer iteration, the time complexity of updating ai p depends on the chosen clustering objective. For instance, for K-means, it is O(NKM), and, for Ncut, it is O(N 2K) for full affinity matrix W or much lesser for a sparse affinity matrix. Note that ai p can be computed efficiently in parallel for all the clusters. Convergence and monotonicity guarantees: Our variational model belongs to the family of MM procedures, whose theoretical guarantees are well-studied in the literature, e.g., (Vaida 2005). In fact, the MM principle can be viewed as a generalization of well-known expectation-maximization (EM). Therefore, in general, MM algorithms inherit the monotonicity and convergence guarantees of EM algorithms, as detailed in the theoretical discussion in (Vaida 2005). Theorem 3 in (Vaida 2005) states a condition for convergence of the general MM procedure to a local minimum: The auxiliary function has a unique global minimum, which should be obtained at each iteration when solving (7). This condition is important to guarantee, for instance, the monotonicity in (8). Our formulation satisfies this condition. In our case, the auxiliary function in (11) is strictly convex, as it is the sum of linear terms and a strictly convex term (the negative entropy), and is optimized under affine simplex constraints. Therefore, at each iteration, the closed-form solutions we obtained in (13) correspond to the unique global minimum of auxiliary function Ai(S) in (11). Our plots in Fig. 1 confirm the convergence and monotonicity of our general MM procedure for several fair-clustering objectives. Exploring different trade-off levels via multiplier λ: Our variational multi-term formulation enables to explore several levels of trade-off between the clustering and fairness objectives via multiplier parameter λ, unlike the existing fair-clustering methods. In practice, we run in parallel our algorithm for several values of λ and choose the smallest value of λ that satisfies a pre-defined level of fairness error, i.e., DKL(U||Pk) ϵ. This is conceptually similar to standard penalty and augmented-Lagrangian approaches in constrained optimization, where the weights of the penalties2 are gradually increased, until reaching a certain pre-defined precision (or duality gap) for the constraints; see Chapter Chapter 17.1 in (Nocedal and Wright 2006). The difference here is that we run independently for each λ, which can be implemented in parallel. As illustrated by the plots in Fig. 2, 2In standard constrained optimization, penalties typically take a quadratic form, unlike our method, which is based on a KL divergence penalty. Figure 1: The convergence of the proposed bound optimizers for minimizing several fair-clustering objectives in (4): Fair K-means, Fair Ncut and Fair Kmedians. The plots are based on the Synthetic dataset. Figure 2: Clustering/Fairness objectives vs. λ. Algorithm 1 Proposed Fair-clustering Input: X, Initial seeds, λ, U, {Vj}J j=1 Output: Clustering labels {1, .., K}N Initialize labels from initial seeds. Initialize S from labels. Initialize i = 1. repeat Compute ai p from S (see Table 1). Initialize si p = exp( ai p) 1t exp( ai p). repeat Compute si+1 p using (13). si p si+1 p . S = [si p]; p. until Ai(S) in (11) does not change i = i + 1. until E(S) in (4) does not change lp = arg max k sp,k; p. labels = {lp}N p=1. when λ increases, the fairness error decreases and the clustering objective increases, which is intuitive. As discussed in more details below (Tables 2, 3 and 4), our variational formulation can achieve small fairness errors (competitive with the existing state-of-the-art fair-clustering methods), but with much better clustering objectives, consistently across all the data sets. Experiments In this section, we present comprehensive empirical evaluations of the proposed fair-clustering algorithm, along with comparisons with state-of-the-art fair-clustering techniques. We choose three well-known clustering objectives: K-means, Kmedians and Normalized cut (Ncut), and integrate our fairness-penalty bound with the corresponding clustering bounds ap (see Table 1). We refer to our bound-optimization versions as: Fair K-means, Fair Kmedians and Fair Ncut3. Note that our formulation can be used for other clustering objectives (if a bound could be derived for the objective). We investigate the effect of fairness on the original discrete (i.e., w.r.t. binary assignment variables) clustering objectives, and compare with the existing methods. We evaluate the results in terms of the balance of each cluster Sk in (1), and define the overall balance of the clustering as balance = min Sk balance(Sk). We further propose to evaluate the fairness error, which is the KL divergence P k DKL(U||Pk) in (2). This KL measure becomes equal to zero when the proportions of the demographic groups within all the output clusters match the target distribution. For Ncut, we use 20-nearest neighbor affinity matrix, W: w(xp, xq) = 1 if data point xq is within the 20-nearest neighbors of xp, and equal to 0 otherwise. In all the experiments, we fixed L = 2 and found that this does not increase the objective (see the detailed explanation in the supplemental material). We standardize each dataset by making each feature attribute to have zero mean and unit variance. We then performed L2-normalization of the features, and used the standard K-means++ (Arthur and Vassilvitskii 2007) to generate initial partitions for all the models. Synthetic datasets: We created two types of synthetic datasets according to the proportions of the demographics, each having two clusters and a total of 400 data points in 2D features (figures in the supplemental material). The Synthetic 3Code is available at: https://github.com/imtiazziko/Variational Fair-Clustering Datasets Fair Kmedians Objective fairness error / balance Backurs et al. 2019 Ours Backurs et al. 2019 Ours Synthetic (N = 400, J = 2, λ = 10) 299.859 292.4 0.00/1.00 0.00/1.00 Synthetic-unequal (N = 400, J = 2, λ = 10) 185.47 174.81 0.77/0.21 0.00/0.33 Adult (N = 32, 561, J = 2, , λ = 9000) 19330.93 18467.75 0.27/0.31 0.01/0.43 Bank (N = 41, 108, J = 3, λ = 9000) N/A 19527.08 N/A 0.02/0.18 Census II (N = 2, 458, 285, J = 2, λ = 500000) 2385997.92 1754109.46 0.41/0.38 0.02/0.78 Table 2: Comparison of the proposed Fair Kmedians to (Backurs et al. 2019). Fair K-means Objective fairness error / balance Bera et al. 2019 Ours Bera et al. 2019 Ours Synthetic (N = 400, J = 2, λ = 10) 758.43 207.80 0.00 / 1.00 0.00 / 1.00 Synthetic-unequal (N = 400, J = 2, λ = 10) 180.00 159.75 0.00 / 0.33 0.00 / 0.33 Adult (N = 32, 561, J = 2, λ = 9000) 10913.84 9984.01 0.018 / 0.41 0.018 / 0.41 Bank (N = 41, 108, J = 3, λ = 6000) 11331.51 9392.20 0.03 / 0.16 0.05 / 0.17 Census II (N = 2, 458, 285, J = 2, λ = 500000) 1355457.02 1018996.53 0.07/0.77 0.02/0.78 Table 3: Comparison of the proposed Fair K-means to (Bera et al. 2019). dataset contains two perfectly balanced demographic groups, each having an equal number of 200 points. For this data set, we imposed target target proportions U = [0.5, 0.5]. To experiment with our fairness penalty with unequal proportions, we also used Synthetic-unequal dataset with 300 and 100 points within each of the two demographic groups. In this case, we imposed target proportions U = [0.75, 0.25]. Real datasets: We use three datasets from the UCI machine learning repository, one large-scale data set whose demographics are balanced (Census), along with two other data sets with various demographic proportions: Bank 4 dataset contains 41188 number of records of direct marketing campaigns of a Portuguese banking institution corresponding to each client contacted (Moro, Cortez, and Rita 2014). Note that, the previous fair clustering methods (Bera et al. 2019; Backurs et al. 2019) used a much smaller version of Bank dataset with only 4520 number of records with J = 2 and 3 attributes. Instead, we utilize the marital status as the sensitive attribute, which contains three groups (J = 3) single, married and divorced and removed the Unknown marital status. Thus, we have 41, 108 records in total. We chose 6 numeric attributes (age, duration, euribor of 3 month rate, no. of employees, consumer price index and number of contacts performed during the campaign) as features. We set the number of clusters K = 10, and impose the target proportions of three groups U = [0.28, 0.61, 0.11] within each cluster. Adult5 is a US census record data set from 1994. The dataset contains 32, 561 records. We used the gender status as the sensitive attribute, which contains 10771 females and 21790 males. We chose the 5 numeric attributes as features, set the number of clusters to K = 10, and impose proportions U = [0.33, 0.67] within each cluster. Census6 is a large-scale data set corresponding to a US cen- 4https://archive.ics.uci.edu/ml/datasets/Bank+Marketing 5https://archive.is.uci/ml/datasets/adult 6https://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990) sus record data from 1990. The dataset contains 2, 458, 285 records. We used the gender status as the sensitive attribute, which contains 1, 191, 601 females and 1, 266, 684 males. We chose the 25 numeric attributes as features, similarly to (Backurs et al. 2019). We set the number of clusters to K = 20, and imposed proportions U = [0.48, 0.52] within each cluster. In this section, we discuss the results of the different experiments to evaluate the proposed general variational framework for Fair K-means, Fair Kmedians and Fair Ncut. We further report comparisons with (Bera et al. 2019), (Backurs et al. 2019) and (Kleindessner et al. 2019) in terms of discrete fairness measures and clustering objectives. Trade-off between clustering and fairness objectives: We assess the effect of imposing fairness constraints on the original clustering objectives. In each plot in Fig. 2, the blue curve depicts the discrete-valued clustering objective F(S) (K-means or Ncut) obtained at convergence as a function of λ. The fairness error is depicted in red. Observe that, when multiplier λ increases (starting from a certain value), the discrete clustering objective increases while the fairness error decreases, which is intuitive. Also, the fairness error approaches 0 when λ + , and both the clustering and fairness objectives tend to reach a plateau starting from a certain value of λ. The scalability of our model is highly relevant because it enables us to explore several solutions, each corresponding to a different value of multiplier λ, and to choose the smallest λ (i.e., the best clustering objective) that satisfies a pre-defined fairness level P k DKL(U||Pk) ϵ. As detailed below, this flexibility enabled us to obtain better solutions, in terms of fairness and clustering objectives, than several recent fair-clustering methods. Low fairness errors are typically achieved with large values of λ. This is due to the fact that the scale of the fairness penalty could be much smaller than the clustering objectives. Notice that, for relatively small Fair NCUT Objective fairness error / balance Kleindessner et al. 2019 Ours Kleindessner et al. 2019 Ours Synthetic (N = 400, J = 2, λ = 10) 0.0 0.0 0.00/1.00 0.0/1.00 Synthetic-unequal (N = 400, J = 2, λ = 10) 0.03 0.06 0.00/0.33 0.00/0.33 Adult (N = 32, 561, J = 2, λ = 10) 0.47 0.74 0.06/0.32 0.08/0.30 Bank (N = 41, 108, J = 3, λ = 40) N/A 0.58 N/A 0.39/0.14 Census II (N = 2, 458, 285, J = 2, λ = 100) N/A 0.52 N/A 0.41/0.43 Table 4: Comparison of the proposed Fair NCut to (Kleindessner et al. 2019). Figure 3: Effect of K on the clustering objectives for vanilla clustering and our variational fair clustering methods, including the K-means, K-medians and Ncut objectives. The results are shown for the Bank dataset. Note that, for each plot, the value of multiplier λ is fixed. values of λ, the K-means objective (blue curve) for the Adult dataset has an oscillating behaviour. This might be due to the fact that, for small λ, the K-means objective dominates the KL fairness term. However, after a certain value of λ (λ 4000), the curves become smooth, with a predictable behaviour (i.e., the fairness term decreases and the clustering term increases). When the clustering objective dominates, the oscillating behaviour might be due to the local minima of bound optimization for the K-means term. We hypothesize that, with higher values of λ, the KL fairness term convexifies the function, and facilitates optimization. With smaller values of λ, the K-means term dominates, with possibilities of being stuck in local minima (K-means is well-known to be sensitive to the initial conditions). Clustering cost with respect to K: Fig. 3 depicts how the clustering objectives behave w.r.t the number of clusters K, with and without the fairness constraints. We plot the discrete clustering objective vs. K for K-means, K-medians and Ncut, using the Bank dataset, with each plot corresponding to a fixed multiplier λ. In both cases (i.e., with and without the fairness constraints), the obtained clustering objectives decrease with K, with the gap between the clustering objective obtained under fairness constraints and the vanilla clustering increasing with K. Those experimental observations are consistent with the observations in (Bera et al. 2019). Comparisons to state-of-the-art methods: Our algorithm is flexible as it can be used in conjunction with different well-known clustering objectives. This enabled us to compare our Fair Kmedians, Fair K-means and Fair Ncut versions to (Backurs et al. 2019), (Bera et al. 2019) and (Kleindessner et al. 2019), respectively. Tables 2, 3 and 4 report comparisons in terms of the original clustering objectives, achieved minimum balances and fairness errors, for Fair K-medians, Fair K-means and Fair NCut. For our model, we run the algo- rithm for several values of λ in ascending order, and choose the smallest λ that satisfies a pre-defined level of fairness error. This flexibility and scalability enabled us to obtain significantly better clustering objectives and fairness/minimumbalance measures in comparisons to (Backurs et al. 2019); See Table 2. It is worth noting that, for the Bank dataset, we were unable to run (Backurs et al. 2019) as the number of demographic group is 3 (i.e. J > 2). In comparison to (Bera et al. 2019), our variational method achieves significantly better K-means clustering objectives, with approximately the same fairness levels. Note that, we can obtain better fairness with larger λ values. These results highlight the benefits of our proposed variational formulation, which provides control over the trade-off between the fairness level and clustering objective. In the case of fair NCut, (Kleindessner et al. 2019) achieved slightly better Ncut objectives than our model, while achieving similar fairness levels. However, we were unable to run the spectral solution of (Kleindessner et al. 2019) for large-scale Census II data set, and for Bank, due to its computational and memory load (as it requires computing the eigen values of the square affinity matrix). Our algorithm scales up to more than two demographic groups, i.e. when J > 2 (e.g. Bank), unlike many of the existing approaches. Furthermore, for NCut graph clustering, our bound optimizer can deal with large-scale data sets, unlike (Kleindessner et al. 2019), which requires eigen decomposition for large affinity matrices. Finally, the parallel structure of our algorithm within each iteration (i.e., independent updates for each assignment variable) enables to explore different values of λ, thereby choosing the best trade-off between the clustering objective and fairness error. Broader Impact This paper deals with ensuring fairness criteria in clustering decisions, so as to avoid unfair treatment of minority groups pertaining to a sensitive attribute such as race, gender, etc. The paper is an endeavor to present a flexible mechanism, so as to relatively control the required fairness, while ensuring clustering quality at the same time. References Arthur, D.; and Vassilvitskii, S. 2007. k-means++: The advantages of careful seeding. In ACM-SIAM symposium on Discrete algorithms, 1027 1035. Society for Industrial and Applied Mathematics. Backurs, A.; Indyk, P.; Onak, K.; Schieber, B.; Vakilian, A.; and Wagner, T. 2019. Scalable fair clustering. International conference on machine learning (ICML) 405 413. Bera, S.; Chakrabarty, D.; Flores, N.; and Negahbani, M. 2019. Fair algorithms for clustering. In Advances in Neural Information Processing Systems, 4955 4966. Buolamwini, J.; and Gebru, T. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency, 77 91. Celis, L. E.; Keswani, V.; Straszak, D.; Deshpande, A.; Kathuria, T.; and Vishnoi, N. K. 2018. Fair and Diverse DPP-Based Data Summarization. In International Conference on Machine Learning (ICML), 715 724. Chierichetti, F.; Kumar, R.; Lattanzi, S.; and Vassilvitskii, S. 2017. Fair Clustering Through Fairlets. In Neural Information Processing Systems (Neur IPS), 5036 5044. Csiszar, I.; and K orner, J. 2011. Information theory: coding theorems for discrete memoryless systems. Cambridge University Press. Donini, M.; Oneto, L.; Ben-David, S.; Shawe-Taylor, J.; and Pontil, M. 2018. Empirical Risk Minimization Under Fairness Constraints. In Neural Information Processing Systems (Neur IPS), 2796 2806. Hardt, M.; Price, E.; and Srebro, N. 2016. Equality of Opportunity in Supervised Learning. In Neural Information Processing Systems (Neur IPS), 3315 3323. Huang, L.; Jiang, S.; and Vishnoi, N. 2019. Coresets for clustering with fairness constraints. In Advances in Neural Information Processing Systems, 7587 7598. Kleinberg, J.; Lakkaraju, H.; Leskovec, J.; Ludwig, J.; and Mullainathan, S. 2017. Human decisions and machine predictions. The quarterly journal of economics 133(1): 237 293. Kleindessner, M.; Samadi, S.; Awasthi, P.; and Morgenstern, J. 2019. Guarantees for Spectral Clustering with Fairness Constraints. In International Conference of Machine Learning (ICML), 3458 3467. Moro, S.; Cortez, P.; and Rita, P. 2014. A data-driven approach to predict the success of bank telemarketing. Decision Support Systems 62: 22 31. Narasimhan, M.; and Bilmes, J. 2005. A Submodularsupermodular Procedure with Applications to Discriminative Structure Learning. In Conference on Uncertainty in Artificial Intelligence (UAI), 404 412. ISBN 0-9749039-1-4. Nocedal, J.; and Wright, S. 2006. Numerical optimization. Springer. R osner, C.; and Schmidt, M. 2018. Privacy preserving clustering with constraints. In ICALP. Samadi, S.; Tantipongpipat, U. T.; Morgenstern, J. H.; Singh, M.; and Vempala, S. 2018. The Price of Fair PCA: One Extra dimension. In Neural Information Processing Systems (Neur IPS), 10999 11010. Schmidt, M.; Schwiegelshohn, C.; and Sohler, C. 2018. Fair Coresets and Streaming Algorithms for Fair k-Means Clustering. ar Xiv 1304.6478 abs/1812.10854. Shaham, U.; Stanton, K.; Li, H.; Basri, R.; Nadler, B.; and Kluger, Y. 2018. Spectral Net: Spectral Clustering using Deep Neural Networks. In International Conference on Learning Representations (ICLR). Tang, M.; Marin, D.; Ayed, I. B.; and Boykov, Y. 2019. Kernel Cuts: Kernel and Spectral Clustering Meet Regularization. International Journal of Computer Vision 127: 477 511. Tian, F.; Gao, B.; Cui, Q.; Chen, E.; and Liu, T.-Y. 2014. Learning deep representations for graph clustering. In AAAI Conference on Artificial Intelligence, 1293 1299. Vaida, F. 2005. Parameter convergence for EM and MM algorithms. Statistica Sinica 15: 831 840. Vladymyrov, M.; and Carreira-Perpi n an, M. 2016. The Variational Nystrom method for large-scale spectral problems. In International Conference on Machine Learning (ICML), 211 220. Von Luxburg, U. 2007. A tutorial on spectral clustering. Statistics and computing 17(4): 395 416. Yuan, J.; Yin, K.; Bai, Y.; Feng, X.; and Tai, X. 2017. Bregman-Proximal Augmented Lagrangian Approach to Multiphase Image Segmentation. In Scale Space and Variational Methods in Computer Vision (SSVM), 524 534. Yuille, A. L.; and Rangarajan, A. 2001. The Concave-Convex Procedure (CCCP). In Neural Information Processing Systems (NIPS), 1033 1040. Zafar, M. B.; Valera, I.; Gomez-Rodriguez, M.; and Gummadi, K. P. 2017. Fairness Constraints: Mechanisms for Fair Classification. In International Conference on Artificial Intelligence and Statistics (AISTATS), 962 970. Zhang, Z.; Kwok, J. T.; and Yeung, D.-Y. 2007. Surrogate maximization/minimization algorithms and extensions. Machine Learning 69: 1 33. Ziko, I.; Granger, E.; and Ayed, I. B. 2018. Scalable Laplacian K-modes. In Advances in Neural Information Processing Systems, 10041 10051.