# optimal_transport_of_classifiers_to_fairness__9f7b146e.pdf Optimal Transport of Classifiers to Fairness Maarten Buyl Ghent University maarten.buyl@ugent.be Tijl De Bie Ghent University tijl.debie@ugent.be In past work on fairness in machine learning, the focus has been on forcing the prediction of classifiers to have similar statistical properties for people of different demographics. To reduce the violation of these properties, fairness methods usually simply rescale the classifier scores, ignoring similarities and dissimilarities between members of different groups. Yet, we hypothesize that such information is relevant in quantifying the unfairness of a given classifier. To validate this hypothesis, we introduce Optimal Transport to Fairness (OTF), a method that quantifies the violation of fairness constraints as the smallest Optimal Transport cost between a probabilistic classifier and any score function that satisfies these constraints. For a flexible class of linear fairness constraints, we construct a practical way to compute OTF as a differentiable fairness regularizer that can be added to any standard classification setting. Experiments show that OTF can be used to achieve an improved trade-off between predictive power and fairness. 1 Introduction Machine learning methods are increasingly being deployed for automated decision making due to the potential benefits in terms of efficiency and effectiveness. Yet, making automated decisions about people comes with substantial legal and ethical risks, as evidenced by more and more cases of undesirable algorithmic discrimination [27, 13, 10] with respect to the protected traits of individuals. Even decisions that appear neutral because they do not use such sensitive information, may still disproportionately affect certain groups indirectly [38]. The research field of fairness in machine learning [30] therefore studies ways in which a model s discrimination with respect to a person s sensitive features can be reduced or removed. Many notions of fairness have been proposed for binary classification tasks [37]. A popular example is demographic parity, which is expressed as a constraint that enforces equality between the rates at which positive decisions are made for each protected group [6]. Motivation For a given fairness notion, we would like to tune a model that satisfies the constraint that expresses the notion. To this end, it can be practical to quantify the violation of this constraint as a fairness regularization term that can be added to any probabilistic classifier [26]. Such an approach provides an incentive to the model to learn parameters that result in fair probability scores, e.g. by learning to ignore highly biased features. A straightforward fairness regularizer is the norm of the difference between the quantities that should be equal according to the fairness notion [41, 31]. However, such a quantification of unfairness may be limiting, as it only considers the statistical properties of the model s output, while ignoring the input features for which those scores were computed. We argue that, to quantify the unfairness of a model as a fairness regularizer, it can be beneficial to more strongly consider data points with similar features but significant unfair discrepancies in probability scores. To this end, we employ Optimal Transport (OT) [33] as a way to measure these discrepancies while taking feature similarity into account. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Contributions We propose to quantify the unfairness of a probabilistic classifier as the Optimal Transport to Fairness (OTF) cost, which is defined as the smallest OT cost between the score function of the classifier and any fair score function over the same data. We make novel derivations to compute OTF as a differentiable fairness regularizer that can be added to the training loss of any probabilistic classifier and is efficiently computed for popular notions of fairness such as demographic parity and equalised odds. Our approach is capable of handling multiple sensitive variables, that can be categorical or continuous. In experiments, our method shows its benefit of increased flexibility over other OT methods. In some cases, it also achieves a trade-off between predictive power and fairness that is significantly more effective for equalized odds. 2 Background 2.1 Fair Classification Let Z X S {0, 1} denote the sample space, from which we draw samples Z (X, S, Y ) and try to estimate the output label Y {0, 1} from input features X X, without discriminating with respect to sensitive features S S. It is assumed that X Rd X and S Rd S.We sample binary predictions ˆY {0, 1} from a probabilistic classifier h : X [0, 1] that assigns a score h(X) to the belief that a sample with features X belongs to the positive class. Our general goal in fair classification is to minimize a loss function LY (h), in our case the cross-entropy between h(X) and Y , while also scoring well on a fairness measurement of h(X). 2.1.1 Group Fairness The fairness of classifiers is commonly enforced through some notion of independence from sensitive information S S. In particular, group fairness is concerned with constraints over statistical measurements of this independence over all members of a protected group [30]. For now, assume that S consists of a single, categorical feature with d S values. We denote S S as the one-hot encoding1 with values Sk {0, 1}. This encoding is indexed by values from [d S] = {0, ..., d S 1}. Assuming that the predictions ˆY | X are randomly sampled from a probabilistic classifier h(X), then the traditional fairness notion of demographic parity (DP) is equivalent to enforcing zero covariance between h(X) and each group in S, which we refer to as Probabilistic Demographic Parity (PDP): (PDP) k [d S] : EZ [h(X)Sk] EZ[h(X)] EZ[Sk] = 0. (1) Remark 1. If ˆY is not sampled from h(X) but instead decided by a threshold (e.g. ˆY = 1 if h(X) > 0.5), then Eq. (1) is a relaxation of the actual DP notion and may therefore admit unfair models [29]. However, such a threshold introduces a discontinuity with respect to the input, making it difficult to efficiently find fair classifiers in practice [36, 1]. Moreover, by considering the probabilities directly, we account for the uncertainty expressed by the probabilistic classifier. The notation in Eq. (1) is easily extended to fairness notions that condition on other variables. For example, the equalized odds (EO) [21] fairness notion conditions DP on the actual label Y . We apply EO to probabilistic classifiers by conditioning the covariance constraint of PDP on the one-hot encoding of Y and retrieve the following expression for Probabilistic Equalized Odds (PEO): (PEO) l {0, 1}, k [d S] : EZ [h(X)Sk Yl] EZ[h(X)Yl] EZ[Sk Yl] = 0. (2) 2.1.2 Linear Fairness The PDP and PEO covariance constraints in Eq. (1) and Eq. (2) are practical, because they can be written as linear sums over the probabilities given by the classifier h. As was done in past work [1], we are generally interested in all fairness notions that can be enforced through such linear constraints. 1For example, if S only contains two categories, then we denote S = {[1, 0], [0, 1]} with |S| = d S = 2. Table 1: Mapping of binary fairness notions onto the g functions of their probabilistic version to be written as linear fairness notions. Predictions Constraint Linear Fairness Notion (P)DP P( ˆY = 1 | S) = P( ˆY = 1) gk = Sk EZ[Sk] 1 (P)EO P( ˆY = 1 | S, Y ) = P( ˆY = 1 | Y ) gk+ld S = Yl Sk EZ[Sk Yl] 1 Definition 1 (Linear fairness notion). A notion of fairness is a linear fairness notion when the set F of all score functions f : X [0, 1] that satisfy it is given by F {f : X [0, 1] : EZ [g(Z)f(X)] = 0d F } (3) with 0d F a vector of d F zeros and the vector-valued function g(Z) : Z Rd F constant w.r.t. f(X). PDP and PEO are clearly linear fairness notions, with their g function listed in Table 1. Linear constraints are practical because the set of fair score functions F is convex. Moreover, multiple linear fairness notions can be combined into a new linear fairness notion by concatenating their constraint vectors. Sensitive information that is inherently continuous, like age, can be considered as a single dimension in S, thereby allowing for a mix of categorical and continuous sensitive attributes. However, under linear fairness, a non-linear dependence between a score function f(X) and continuous sensitive attributes S can remain undetected. Also, some well-known fairness notions are not linear. For example, notions related to sufficiency involve conditioning on f(X), meaning that f(X) also shows up in the g function. To this end, approaches have been proposed [9] that can approximate classifiers with non-linear fairness through a reduction to problems with linear constraints. 2.2 Optimal Transport of Score Functions Optimal Transport (OT) [32] theory considers the problem of moving a mass from one measure to another at the smallest possible total cost. Here, we let every score function f correspond with the measure θf defined over the input space X endowed with the Borel σ-algebra: θf P x DX f(x)δx, with δx the Dirac measure and DX the input features of samples in a collection D. Note that the input space measure θf is not normalized, though this is not necessary to apply OT theory. In what follows, we implicitly consider the score functions h and f as their corresponding input space measures θh and θf when used in the OT problem. See the Appendix C.1 for further clarification. For a collection D of n samples (e.g. a full dataset or only a batch from it), let h and f denote the n-dimensional vectors of score function values for all data points, i.e. hi = h(xi) and fi = f(xi). Furthermore, for a non-negative transport cost function c defined over X X, let C Rn n + represent the matrix of cost terms, i.e. Cij = c(xi, xj). Similarly, with π(xi, xj) the coupling that reflects how much score mass was transported from xi to xj, define the matrix P Rn n + with Pij = π(xi, xj). The OT cost is then simplified to OT(h, f) = min P Π(h,f) C, P (4) with Π(h, f) = P Rn n + : P1n = h, PT 1n = f . where 1n is the n-dimensional vector of ones and C, P = P ij Cij Pij. 3 Optimal Transport to Fairness Assume that a set F is available that denotes all score functions which satisfy the required notion of fairness. We would like to quantify the unfairness of a classifier s score function h as the amount of work minimally required to make it a member of this set by measuring how far h is from its fair projection onto F, i.e. the f F that is closest to h. We measure this closeness as the OT cost between h and f, as we then assign a higher unfairness cost to h if scores need to be transported between highly dissimilar individuals in order to reach a fair function f F. Thus, we propose to quantify unfairness as the Optimal Transport to Fairness (OTF) cost, i.e. the cost of the OT-based projection of h onto F: OTF(h) = min f F OT(h, f). (5) In what follows, we expand on the OTF method by constructing it in three steps. 1. In Sec. 3.1, we directly express the OTF(h) objective as a linear programming problem by making the assumption that F is defined by a linear fairness notion as in Def. 1. 2. In Sec. 3.2, we add entropic smoothing to OTF(h), thereby making the resulting OTFϵ(h) cost a differentiable fairness regularizer that can be computed efficiently. 3. In Sec. 3.3, we address the fact that due to this smoothing, OTFϵ(h) does not necessarily equal zero for a fair h. To this end, we subtract the adjustment term OTFRϵ(h), resulting in the adjusted OTF0 ϵ(h) cost. Finally in Sec. 3.4, we show how OTF0 ϵ(h) can be minimized w.r.t. (the parameters of) h. For all derivations and proofs, we refer to Appendix A. 3.1 Optimal Transport to Linear Fairness To compute the minimization in Eq. (5), first note that in the OT(h, f) cost in Eq. (4), f only shows up in the constraint PT 1n = f on the column marginals of coupling matrix P. It therefore suffices to weaken this constraint to PT 1n F. Next, recall that a linear fairness notion is enforced through a vector of constraints on the expectation of g(Z)f(X). For a collection D of n samples, let Gcj = gc(zj), i.e. G Rd F n is the constraints matrix with a row for every constraint and a column for every data point. It defines the linear fairness notion for D. Definition 2 (OTF). For a linear fairness notion expressed through constraints matrix G for n samples and with non-negative cost matrix C, the Optimal Transport to Fairness cost for score function h : X [0, 1] is computed as OTF(h) = min P ΠF(h) C, P (6) with ΠF(h) = P Rn n + : P1n = h, GPT 1n = 0d F . The optimal coupling computed for OTF(h) implicitly transports the scores of h to the fair vector f given the coupling s column marginals (i.e. f = PT 1n). Note that for any coupling matrix that satisfies the first constraint (P1n = h), we have that h T 1n = 1T n PT 1n = f T 1n, meaning that h and this implicitly found vector f sum to the same value. Scores of h are therefore only transported to a fair score function and no extra scores are created or destroyed. For Pij 0, the optimization problem in Eq. (6) can be solved through linear programming. However, its solution is not differentiable with respect to the score function h [12, 17]. It is thus impractical as a fairness regularization term. 3.2 Entropic Smoothing A well-known trick to make OT costs differentiable is to apply entropic regularization or smoothing [12, 33], which is done by maximizing the entropy of the coupling while optimizing the OT cost. We apply the same principle to make OTF differentiable. Definition 3 (OTFϵ). For a linear fairness notion expressed through constraints matrix G for n samples and with non-negative cost matrix C, the smooth OTF cost with smoothing strength ϵ > 0 for score function h : X (0, 1] is computed as OTFϵ(h) = min P ΠF(h) C, P ϵH(P) (7) with H(P) = P ij Pij (log (Pij) 1) the entropy of coupling P and ϵ > 0 a hyperparameter that regulates the strength of entropic smoothing. The use of an entropy term in Eq. (7) requires some justification, since we do not assume that h represents a normalized probability distribution (i.e. sums to one). Therefore, P also does not represent a normalized joint distribution. However, the H(P) term still provides the practical advantage that it only admits couplings with Pij > 0, which is why we require that hi > 0. Furthermore, the addition of H(P) makes the objective cost strongly convex. 3.2.1 Duality The OTF problem for a linear fairness notion has n + d F constraints, which is usually far fewer in number than the n2 variables that make up the couplings in the primal problem. Because OTFϵ is also strongly convex, we will derive and then maximize the dual function instead. Proposition 1. If G has a non-empty null-space (i.e. f Rn : f = 0n Gf = 0d F ), then the OTFϵ(h) problem has a unique solution for any h : X (0, 1]. Moreover, the OTFϵ problem is strongly convex and then enjoys strong duality. Its dual function is L(λ, µ) = X Cij + λi + X with λ Rn and µ Rd F the dual variable vectors for the marginalization and fairness constraints. 3.2.2 Optimization Though Eq. (8) could be maximized directly, we follow standard OT approaches [33] and perform our optimization with exact coordinate ascent. This strategy is particularly useful here, because the λi variable that maximizes L(λ, µ) while µ is fixed, is found independently of other variables in λ. All variables in λ can therefore be updated at the same time: λi ϵ log hi ϵ log X where we can use the stabilized log-sum-exp operation. Unfortunately, there is no closed form expression for the µc that maximizes L(λ, µ). Instead, we preprocess L(λ, µ) and numerically solve for each µc: µc argmin µc j ηj(λ) exp k =c µk Gkj with ηj(λ) = P ϵ [ Cij + λi] . 3.2.3 Complexity The dual problem involves two variable vectors: λ with length n the number of samples and µ with length d F the number of linear fairness constraints. It is commonly the case that d F n, because the number of distinguished protected groups is limited. This is in contrast with the traditional OTϵ problem, where the dual problem involves two dual variable vectors of length n. In terms of computational complexity, the update of the full λ has complexity O(n(n + d F)). However, we stress that each λi can be updated in parallel with complexity O(n + d F). When keeping the λ vector fixed, we can perform a O(n2) operation to precompute the ηj(λ) values for the µ updates. Though these updates no closed form, the inner loop of the update for µc only has complexity complexity O(n + d F) for each inner loop, and should only be performed for d F variables. By permitting a high tolerance on the convergence of λ, the necessary number of updates is also limited. In our experiments, we found that a single update is often enough. We can achieve a memory complexity of O(n + d F). However, we can perform efficient matrix computations by storing the cost matrix C and the fairness constraints matrix G. The memory required for practical use is therefore O(n(n + d F)). 3.3 Adjusted OTFϵ A cause for concern with the smooth OTFϵ cost is that, as opposed to the standard OTF, it does not necessarily equal zero if h is already a fair score function. Indeed, while it may be feasible to achieve C, P = 0, it is generally the case that the entropy H(P) > 0. We would thus like to find an interesting lower bound on OTFϵ that is tight when h F. Definition 4 (OTFRϵ). The OTFϵ cost in Def. 3 is relaxed to the OTFRϵ cost as OTFRϵ(h) = min P ΠF(h) C, P ϵH(P) (11) with ΠF(h) = P Rn n + : P1n = h, |GPT 1n| |Gh| . Definition 5 (OTF0 ϵ). For a score function h : X (0, 1], the adjusted OTFϵ cost is computed as OTF0 ϵ(h) = OTFϵ(h) OTFRϵ(h). (12) Whereas the OTFϵ(h) problem involved transport to fair score vectors (GPT 1n = 0d F ), the OTFRϵ(h) problem relaxes this constraint by allowing unfairness up to the unfairness already present in h. It is thus easily seen that OTFϵ(h) OTFRϵ(h). Interestingly, because this relaxation is bounded by the unfairness of h, we have that OTFRϵ(h) is a tight lower bound if h is already fair. Proposition 2. h F = OTF0 ϵ(h) = 0. In Appendix B.1, we provide empirical intuition for this theoretical result by using OTF0 ϵ(h) as a postprocessing approach and illustrate that OTFϵ(h) gradually converges to OTFRϵ(h) as the fairness of h improves. The adjustment term OTFRϵ can be computed by using strong duality in a similar way as OTFϵ(h), though now with two dual variables ϕ Rd F and ψ Rd F that enforce the fairness inequality, in addition to the dual variable vector, here denoted by κ Rn, that enforces the row constraint on the coupling. The dual function of OTFRϵ is L(κ, ϕ, ψ) = X c γc(ϕc + ψc) ϵ X Cij + κi + X c (ϕc ψc)Gjc with ϕc 0, ψc 0 and γc = |P j Gcjhj|. For a full discussion, we refer to Appendix A.4. 3.4 Minimising Unfairness Given an efficient way to compute the parameters of the adjusted, smooth OTF0 ϵ(h) cost, we can now jointly minimize the weighted sum of the classifier s loss and unfairness cost with respect to h. Let LY (h) denote the cross-entropy loss of h(X) for the output labels Y . To add our OTF0 ϵ(h) cost to this objective, we can pose the following optimization problem: min h (1 α)LY (h) + αOTF0 ϵ(h) (13) with 0 α 1 a hyperparameter that regulates the importance of the unfairness cost term OTF0 ϵ(h). So far, we considered the scores vector of h as constant with respect to the OTF0 ϵ(h) problem. However, the solutions for OTFϵ(h) and OTFRϵ(h) clearly depend on h. Let λ , µ , κ , ϕ and ψ denote the fully converged dual variables. Then h OTF0 ϵ(h) = h (L(λ , µ ) L(κ , ϕ , ψ )) (14) Even if we did not update the dual variables until convergence, we can still use intermediate values to approximate the true gradient [18, 16]. We further approximate the true OTF0 ϵ(h) cost by only computing it for a subset of the dataset, e.g. when batching is already used to optimize LY (h). The OTF0 ϵ(h) can be optimized on a batch by only using the values of C and the columns of G that relate to these data points. For large datasets (including those we use in experiments), it is then unnecessary to precompute the C matrix between all pairs of data points, since most pairs will never actually co-occur in randomly sampled batches. 4 Related Work Fairness in machine learning is commonly understood in two (extreme) ways [30]. First, the notion of individual fairness states that similar individuals should be treated similarly [15]. Second, group fairness requires that as a group, people are treated equally. The primary goal of our work is to achieve group fairness by measuring the distance of a score function to a set of group-fair functions. Yet, to an extent we also approach individual fairness because our unfairness quantification takes the features of each individual into account in its cost function c. By appropriately choosing c, it may be possible to further improve upon individual fairness in future work. Methods that aim to improve fairness can be categorized into three types: those that perform preprocessing where the data is made more fair [25, 8], as opposed to those that do postprocessing where the model s predictions are modified to fit a notion of fairness [21]. As a fairness regularization term, our work clearly fits in the class of inprocessing methods, i.e where the machine learning model or its training is directly altered to reduce unfairness [7, 41, 40, 1, 9]. However, we highlight that the OTF cost could also be used as a postprocessing approach by setting α = 1. Our work draws inspiration from two kinds of inprocessing methods. First, with those methods where a (still unfair) model is explicitly projected unto the set of fair models. A popular definition of distance for this projection is the Kullback-Leibler divergence [2, 23, 5]. Second, our work uses Optimal Transport (OT) [33] theory. Several works have applied OT to fairness, though they almost all do so by finding barycenters [35]: measures that are equally far from the measures belonging to each demographic. This type of method has been applied to preprocessing [19], classification [23, 42] and regression [11, 20], with the overarching goal being to move each group s measure closer to the barycenter. Such methods require fairness notions to be encoded in the format of explicit equality, between a limited amount of groups. They do not allow for more complex constraints such as those involving continuous or multiple sensitive attributes, and make strong assumption that the cost function is a distance metric. Our proposed method instead combines the idea of projecting the model to a fairness-constrained set and the use of OT to measure the cost of this projection. In this manner, our work is most similar to [34], which computes the OT projection onto such a fair set. However, they only use the method as a group fairness test. Our method has the advantage that it computes a differentiable cost that can be added to the loss of any probabilistic classifier. 5 Experiments Our experiments were performed on four datasets from fair binary classification literature [14, 28] with different kinds of sensitive information. First, the UCI Adult Census Income dataset with two binary sensitive attributes SEX and RACE. Second, the UCI Bank Marketing dataset, where we measure fairness with respect to the original continuous AGE values and the binary, quantized version AGE_BINNED. Third, the Dutch Census dataset with sensitive attribute SEX. Fourth, the Diabetes dataset with sensitive attribute GENDER. Additional information on datasets and hyperparameters is given in Appendix C.2 and C.3 respectively. 5.1 Methods All experiments were performed using a logistic regression classifier. To achieve fairness, we jointly optimize fairness regularizers with the cross-entropy loss as in Eq. (13), and compute the gradient of their sum with respect to the parameters of the classifier. The OTF cost is the adjusted version from Def. 5, with ϵ = 10 3 (different choices for ϵ are illustrated in Appendix B.2). For the cost function c, we use the Euclidean distance between non-protected features. The fairness constraints for both sensitive features in the Adult dataset were combined by concatenating their G matrices. As a first baseline, we use the L1 norm of the PDP and PEO fairness constraints and will refer to this method as the Norm cost. This simple type of fairness regularizer is frequently proposed in literature [26, 41, 39, 31]. In our experiments, it thus represents the class of methods that can accommodate a flexible range of fairness constraints while only considering the output of a classifier and not the input features. Our second baseline is an OT Barycenter2 approach [24]. This method directly minimizes the OT cost between the score distributions of two sensitive groups. However, the implementation can only minimize PDP, and does not admit a composition of linear fairness notions or continuous attributes. We therefore only minimize unfairness with respect to SEX and AGE_BINNED. To show the drawbacks of this limitation, we also report PDP unfairness with respect to RACE and the continuous AGE. The third baseline is Unfair, which was a classifier trained without a fairness cost term. 5.2 Evaluation To measure the predictive power of the methods, the ROC AUC score was measured. Violation of PDP is quantified as the absolute value of the maximal Pearson correlation between the scores h(X) and each dimension of the sensitive attributes Sk. For PEO, the violation is computed as the maximal PDP for the data points of each one-hot encoded output label Yl. These measures mirror the requirements in Eq. (1) and Eq. (2) that the covariances ought to equal zero, though we choose the Pearson correlation instead such that the violation is normalized between 0 and 1. Thus we avoid that a method may improve on linear fairness notions by squeezing the variance of its scores. All methods were tested for various strengths of α (see Eq. (13)). Each configuration of each method (i.e. each α value and fairness notion) was tested for 10 train-test splits with proportions 80/20. We report mean test set scores in Fig. 1 and show the confidence ellipse3 of their estimator for the first standard deviation. The train set results are reported in Appendix B.3 and show the same trends. Note that for the OTF and Norm regularizers, we omit fairness scores for notions that were not optimized in that configuration. While SEX and RACE are minimized at the same time for the Adult dataset, the AGE and AGE_BINNED attributes were minimized in separate training runs. All experiments were conducted using half the hyperthreads on an internal machine equipped with a 12 Core Intel(R) Xeon(R) Gold processor and 256 GB of RAM. In total, all experiments, including preliminary and failed experiments, used this machine for approximately 200 hours. 5.3 Results On all datasets, our OTF method makes a similar trade-off between AUC and PDP fairness as the baselines for various α values. However, OTF clearly outperforms Norm for both sensitive attributes on the Adult dataset in Fig. 1a when they are trained to minimize PEO, especially in the lower violation range. Similarly, OTF achieves a better trade-off between AUC and PEO violation on the Diabetes dataset in Fig. 1d, especially on the train set (see Appendix B.3). This advantage in minimizing PEO may be due to the fact that OTF has the incentive to achieve fairness by exchanging classifier scores between similar individuals. Conditioning on the output label rewards such exchanges, as individuals with similar labels can be expected to have similar features. Furthermore, we remark that our OTF method is as flexible as the Norm baseline in minimizing either PDP or PEO with respect to multiple sensitive attributes on the Adult dataset and the continuous AGE attribute on the Bank dataset in Fig 1b. Implementations of differentiable regularizers that can accommodate such attributes are scarcely available. For example, the Barycenter method was not trained to improve fairness with respect to RACE, as its implementation can not handle multiple sensitive attributes. Since it can only minimize the OT cost between a limited number of score distributions, it is naturally ill-equipped for continuous attributes. In separate runs on the Bank dataset, we therefore also trained OTF and Norm to minimize unfairness with respect to AGE_BINNED. Though the results are subject to noise, we observe similar trends as for the other non-continuous attributes: our method is clearly on par with the baselines. On the train set results (see Appendix B.3) the OTF method significantly outperforms the Norm baseline for PEO. On the Adult, Bank and Diabetes datasets, the proposed OTF method thus displays its advantage in both an effective AUC-fairness trade-off, and a flexible capacity for fairness. However, we note that on the Dutch dataset in Fig. 1c, no method seems to display a significantly better trade-off. We hypothesize that no such advantage is seen here because PEO, the fairness notion where our method usually achieves superior results, may be relatively easy to satisfy on this dataset. 2Retrieved from https://github.com/deepmind/wasserstein_fairness (Apache License 2.0) 3We use the variance of the estimator of the mean, which is smaller than that of individual scores 0.00 0.10 0.20 0.30 Violation of PDP - sex 0.05 0.10 Violation of PDP - race 0.10 0.20 Violation of PEO - sex 0.03 0.05 0.08 0.10 Violation of PEO - race OTF Norm Barycenter Unfair (a) Adult dataset 0.03 0.05 0.08 Violation of PDP - age 0.01 0.02 0.03 Violation of PDP - age_binned 0.02 0.04 0.06 Violation of PEO - age 0.03 0.04 0.05 0.06 Violation of PEO - age_binned (b) Bank dataset 0.00 0.05 0.10 0.15 Violation of PDP - sex 0.02 0.04 0.06 0.08 Violation of PEO - sex (c) Dutch dataset 0.01 0.02 0.03 0.04 Violation of PDP - gender 0.02 0.04 0.06 Violation of PEO - gender (d) Diabetes dataset Figure 1: Test set results for the methods that were trained to reduce the evaluated fairness measure (PDP or PEO). Violation of PDP (and PEO) is computed as the maximal absolute Pearson correlation between the probability scores (conditioned on the output labels) and each sensitive attribute. 5.4 Limitations and Impact We nuance our results by pointing out that the OTF method has a higher computation complexity (O(n(n + d F))) than Norm, which has complexity O(nd F) with n the batch size and d F the number of fairness constraints. Furthermore, as discussed in Sec. 2.1, we only evaluate linear fairness notions and leave an extension to non-linear notions of fairness [9] for future work. The fairness of a system should always be judged based on a holistic consideration of the context of the system [3] and relevant ethical principles [4]. Without it, any method to improve fairness may reinforce the underlying injustice that led to the risks of discrimination that we aim to solve [22]. 6 Conclusion In this paper, we constructed the Optimal Transport to Fairness (OTF) method as a differentiable fairness regularization term, which combines the advantages of flexible linear fairness constraints with Optimal Transport theory, thereby taking non-protected similarities between individuals into account. In experiments, we show that our final method achieves a similar AUC-fairness trade-off as other cost terms for notions of fairness inspired by demographic parity, and significantly better for notions inspired by equalized odds. At the same time, OTF empirically displays why its increased flexibility, i.e. its ability to improve all linear fairness notions with respect to multiple sensitive attributes that may be categorical or continuous, provides an advantage over previous fairness applications of OT. In the future, we hope to further investigate the properties of OT-based fairness, inspired by its clear advantage in achieving equalized odds. Though our method was only applied to linear fairness notions, OT may provide an opportunity to better achieve non-linear notions of fairness, e.g. through a creative choice of the cost function that directs the transport of classifier scores. Acknowledgments and Disclosure of Funding The research leading to these results has received funding from the European Research Council under the European Union s Seventh Framework Programme (FP7/2007-2013) (ERC Grant Agreement no. 615517), and under the European Union s Horizon 2020 research and innovation programme (ERC Grant Agreement no. 963924), from the Flemish Government under the Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen programme, and from the FWO (project no. G0F9816N, 3G042220). MB is supported by a doctoral scholarship from the Special Research Fund (BOF) of Ghent University (reference number: BOF20/DOC/144). [1] Alekh Agarwal, Alina Beygelzimer, Miroslav Dudik, John Langford, and Hanna Wallach. A Reductions Approach to Fair Classification. In Proceedings of the 35th International Conference on Machine Learning, pages 60 69. PMLR, July 2018. [2] Wael Alghamdi, Shahab Asoodeh, Hao Wang, Flavio P. Calmon, Dennis Wei, and Karthikeyan Natesan Ramamurthy. Model Projection: Theory and Applications to Fair Machine Learning. In 2020 IEEE International Symposium on Information Theory (ISIT), pages 2711 2716. IEEE, June 2020. [3] Cynthia L. Bennett and Os Keyes. What is the point of fairness? disability, AI and the complexity of justice. ACM SIGACCESS Accessibility and Computing 125, page 5:1, March 2020. [4] Reuben Binns. On the apparent conflict between individual and group fairness. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 514 524, Barcelona Spain, January 2020. ACM. [5] Maarten Buyl and Tijl De Bie. The KL-Divergence Between a Graph Model and its Fair I-Projection as a Fairness Regularizer. In Machine Learning and Knowledge Discovery in Databases, pages 351 366. Springer International Publishing, 2021. [6] Toon Calders, Faisal Kamiran, and Mykola Pechenizkiy. Building Classifiers with Independency Constraints. In 2009 IEEE International Conference on Data Mining Workshops, pages 13 18, December 2009. [7] Toon Calders and Sicco Verwer. Three naive Bayes approaches for discrimination-free classification. Data Mining and Knowledge Discovery, 21(2):277 292, September 2010. [8] Flavio Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and Kush R Varshney. Optimized Pre-Processing for Discrimination Prevention. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. [9] L. Elisa Celis, Lingxiao Huang, Vijay Keswani, and Nisheeth K. Vishnoi. Classification with Fairness Constraints: A Meta-Algorithm with Provable Guarantees. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 319 328, Atlanta GA USA, January 2019. ACM. [10] Rumman Chowdhury. Sharing learnings about our image cropping algorithm. https://blog.twitter.com/engineering/en_us/topics/insights/2021/sharing-learnings-aboutour-image-cropping-algorithm, May 2021. [11] Evgenii Chzhen, Christophe Denis, Mohamed Hebiri, Luca Oneto, and Massimiliano Pontil. Fair Regression with Wasserstein Barycenters. ar Xiv:2006.07286 [cs, math, stat], June 2020. [12] Marco Cuturi. Sinkhorn Distances: Lightspeed Computation of Optimal Transport. In Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. [13] Jeffrey Dastin. Amazon scraps secret AI recruiting tool that showed bias against women | Reuters. https://www.reuters.com/article/us-amazon-com-jobs-automation-insightid USKCN1MK08G, October 2018. [14] Dheeru Dua and Graff Casey. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml, 2017. [15] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, ITCS 12, pages 214 226, New York, NY, USA, January 2012. Association for Computing Machinery. [16] Jean Feydy, Thibault Séjourné, François-Xavier Vialard, Shun-ichi Amari, Alain Trouve, and Gabriel Peyré. Interpolating between Optimal Transport and MMD using Sinkhorn Divergences. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2681 2690. PMLR, April 2019. [17] Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya-Polo, and Tomaso Poggio. Learning with a Wasserstein loss. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS 15, pages 2053 2061, Cambridge, MA, USA, December 2015. MIT Press. [18] Aude Genevay, Gabriel Peyre, and Marco Cuturi. Learning Generative Models with Sinkhorn Divergences. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, pages 1608 1617. PMLR, March 2018. [19] Paula Gordaliza, Eustasio Del Barrio, Gamboa Fabrice, and Jean-Michel Loubes. Obtaining Fairness using Optimal Transport Theory. In Proceedings of the 36th International Conference on Machine Learning, pages 2357 2365. PMLR, May 2019. [20] Thibaut Le Gouic, Jean-Michel Loubes, and Philippe Rigollet. Projection to Fairness in Statistical Learning. ar Xiv:2005.11720 [cs, math, stat], June 2020. [21] Moritz Hardt, Eric Price, Eric Price, and Nati Srebro. Equality of Opportunity in Supervised Learning. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. [22] Anna Lauren Hoffmann. Where fairness fails: Data, algorithms, and the limits of antidiscrimination discourse. Information, Communication & Society, 22(7):900 915, June 2019. [23] Heinrich Jiang and Ofir Nachum. Identifying and Correcting Label Bias in Machine Learning. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, pages 702 712. PMLR, June 2020. [24] Ray Jiang, Aldo Pacchiano, Tom Stepleton, Heinrich Jiang, and Silvia Chiappa. Wasserstein Fair Classification. In Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, pages 862 872. PMLR, August 2020. [25] Faisal Kamiran and Toon Calders. Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems, 33(1):1 33, October 2012. [26] Toshihiro Kamishima, Shotaro Akaho, and Jun Sakuma. Fairness-aware Learning through Regularization Approach. In 2011 IEEE 11th International Conference on Data Mining Workshops, pages 643 650, Vancouver, BC, Canada, December 2011. IEEE. [27] Jeff Larson, Surya Mattu, Lauren Kirchner, and Julia Angwin. How We Analyzed the COMPAS Recidivism Algorithm. https://www.propublica.org/article/how-we-analyzed-the-compasrecidivism-algorithm, May 2016. [28] Tai Le Quy, Arjun Roy, Vasileios Iosifidis, Wenbin Zhang, and Eirini Ntoutsi. A survey on datasets for fairness-aware machine learning. WIREs Data Mining and Knowledge Discovery, 12(3):e1452, 2022. [29] Michael Lohaus, Michael Perrot, and Ulrike Von Luxburg. Too Relaxed to Be Fair. In Proceedings of the 37th International Conference on Machine Learning, pages 6360 6369. PMLR, November 2020. [30] Mehrabi Ninareh, Morstatter Fred, Saxena Nripsuta, Lerman Kristina, and Galstyan Aram. A Survey on Bias and Fairness in Machine Learning. ACM Computing Surveys (CSUR), July 2021. [31] Manisha Padala and Sujit Gujar. FNNC: Achieving fairness through neural networks. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 20, pages 2277 2283, Yokohama, Yokohama, Japan, January 2021. [32] Gabriel Peyré and Marco Cuturi. Computational Optimal Transport: With Applications to Data Science. Foundations and Trends in Machine Learning, 11(5-6):355 607, February 2019. [33] Gabriel Peyré and Marco Cuturi. Computational Optimal Transport. ar Xiv:1803.00567 [stat], March 2020. [34] Nian Si, Karthyek Murthy, Jose Blanchet, and Viet Anh Nguyen. Testing Group Fairness via Optimal Transport Projections. ar Xiv:2106.01070 [cs, math, stat], June 2021. [35] Chiappa Silvia, Jiang Ray, Stepleton Tom, Pacchiano Aldo, Jiang Heinrich, and Aslanides John. A General Approach to Fairness with Optimal Transport. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04):3633 3640, April 2020. [36] Bahar Taskesen, Jose Blanchet, Daniel Kuhn, and Viet Anh Nguyen. A Statistical Test for Probabilistic Fairness. ar Xiv:2012.04800 [cs, stat], December 2020. [37] Sahil Verma and Julia Rubin. Fairness definitions explained. In Proceedings of the International Workshop on Software Fairness, pages 1 7, Gothenburg Sweden, May 2018. ACM. [38] Sandra Wachter, Brent Mittelstadt, and Chris Russell. Bias Preservation in Machine Learning: The Legality of Fairness Metrics under EU Non-Discrimination Law. West Virginia Law Review, 123(3):735 790, 2020. [39] Michael Wick, swetasudha panda, and Jean-Baptiste Tristan. Unlocking Fairness: A Tradeoff Revisited. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. [40] Blake Woodworth, Suriya Gunasekar, Mesrob I. Ohannessian, and Nathan Srebro. Learning Non-Discriminatory Predictors. In Proceedings of the 2017 Conference on Learning Theory, pages 1920 1953. PMLR, June 2017. [41] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez-Rodriguez, and Krishna P. Gummadi. Fairness Constraints: A Flexible Approach for Fair Classification. Journal of Machine Learning Research, 20(75):1 42, 2019. [42] Meike Zehlike, Philipp Hacker, and Emil Wiedemann. Matching code and law: Achieving algorithmic fairness with optimal transport. Data Mining and Knowledge Discovery, 34(1):163 200, January 2020. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] In Sec. 3, we proposed the OTF cost term and deployed tricks to minimize it effectively. In Sec. 5, we show that the results of using our proposed method are close to the baselines or surpassing it. (b) Did you describe the limitations of your work? [Yes] We describe limitations of our considered fairness notions in Sec. 2.1 and the complexity of our proposed OTF cost term in Sec. 3.2.3. We briefly discuss overall limitations of fairness methods in Sec. 5.4. (c) Did you discuss any potential negative societal impacts of your work? [Yes] We briefly discuss the danger that inconsiderately applied fairness methods can pose in Sec. 5.4 (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] Our method is constructed in several steps. When required, we discuss and add additional assumptions. (b) Did you include complete proofs of all theoretical results? [Yes] Proofs of all propositions and derivations of other results are included in the Appendix. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] There are references to the datasets we used in the Appendix. Code and a pipeline to run it are included in the supplemental material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] Initial training details are given in Sec.5.2. More details of the hyperparameters are included in the Appendix. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] Since we show the trade-off between two goals (accuracy and fairness) we show the confidence ellipses. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] This is mentioned at the end of Sec. 5.2. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] For the Barycenter code, a reference to the Github implementation is included. For datasets, we refer to the UCI repository. (b) Did you mention the license of the assets? [Yes] For the Barycenter code, we include the license. We refer to the data according to the requested citation policy. (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [No] The datasets we use are popular, public datasets in machine learning fairness literature. Obtaining consent is infeasible. However, the data is highly anonymized and its popularity promotes reproducibility in the community. (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [Yes] As briefly mentioned in the Appendix, the data is anonymous and free from any personal identifiers. 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]