# multioutput_distributional_fairness_via_postprocessing__d2ecb8e0.pdf

Published in Transactions on Machine Learning Research (03/2025)

Multi-Output Distributional Fairness via Post-Processing

Gang Li gang-li@tamu.edu Texas A&M University

Qihang Lin qihang-lin@uiowa.edu The University of Iowa

Ayush Ghosh ayushghosh70@gmail.com The University of Iowa

Tianbao Yang tianbao-yang@tamu.edu Texas A&M University

Reviewed on Open Review: https: // openreview. net/ forum? id= MJOKr Hqi V1

The post-processing approaches are becoming prominent techniques to enhance machine learning models fairness because of their intuitiveness, low computational cost, and excellent scalability. However, most existing post-processing methods are designed for task-specific fairness measures and are limited to single-output models. In this paper, we introduce a post-processing method for multi-output models, such as the ones used for multi-task/multiclass classification and representation learning, to enhance a model s distributional parity, a task-agnostic fairness measure. Existing methods for achieving distributional parity rely on the (inverse) cumulative density function of a model s output, restricting their applicability to single-output models. Extending previous works, we propose to employ optimal transport mappings to move a model s outputs across different groups towards their empirical Wasserstein barycenter. An approximation technique is applied to reduce the complexity of computing the exact barycenter and a kernel regression method is proposed to extend this process to out-of-sample data. Our empirical studies evaluate the proposed approach against various baselines on multi-task/multi-class classification and representation learning tasks, demonstrating the effectiveness of the proposed approach.1

1 Introduction

In machine learning, multi-output learning is a broadly defined domain (Xu et al., 2019; Liu et al., 2018), where the goal is to simultaneously predict multiple outputs given an input, such as multi-label classification, multi-class classification, multi-target regression, etc. In contrast to conventional single-output learning like binary classification, multi-output learning is characterized by its multi-variate nature, whose outputs exhibit rich information for further handling. Multi-output learning is important for real-world decision-making where final decisions are made by considering and weighting multiple factors and criteria. For example, when applied to college admission, the predicted multi-outputs can represent a prospective student s likelihoods of accepting the offer, needing financial aid, completing the degree, finding a job at graduation, etc. Those outputs are weighted to guide admission decisions though the weights may vary with colleges, majors and years (Jaschik, 2023).

However, multi-output learning for decision-making faces the challenge of bias and fairness. There is plenty of evidence indicating the discriminatory impact of ML-based decision-making on individuals and groups (O neil, 2017; Datta et al., 2014; Bolukbasi et al., 2016; Barocas & Selbst, 2016; Raji & Buolamwini,

1Code is available at: https://github.com/Gang Lii/TAB

Published in Transactions on Machine Learning Research (03/2025)

0.0 0.2 0.4 0.6 0.8 1.0 Output 1

Group 1 Group 2

0.2 0.4 0.6 0.8 1.0 Output 1

0.0 0.2 0.4 0.6 0.8 1.0 Output 1

Figure 1: Suppose the original model s outputs (left) minimize prediction error and have a good performance, one can balance performance and fairness by applying our method with setting α = 0.5 (middle) or achieve exact fairness by setting α = 0 (right).

2019), such as racial bias in assessing the risk of recidivism (Flores et al., 2016) and gender bias in job advertising (Simonite, 2015). To mitigate the bias in machine learning, numerous fairness criteria and algorithms have been proposed (Corbett-Davies et al., 2017; Barocas et al., 2023). These methods introduce statistical constraints during training or post-process the predictions to ensure fair treatment in accordance with corresponding fairness notions such as Demographic Parity (Calders et al., 2009; Chuang & Mroueh, 2021), Equality of Odds or Equal Opportunity (Hardt et al., 2016; Awasthi et al., 2020), Strong Demographic Parity (Agarwal et al., 2019; Jiang et al., 2020) and AUC fairness (Vogel et al., 2021; Yang et al., 2023; Yao et al., 2023). Nevertheless, almost all existing methods focus on ensuring fairness for binary classification or regression within the context of single-output models. Extending fairness to multi-output settings, such as multi-label/multi-class classification and representation learning, remains underexplored and non-trivial.

A naive approach for multi-output fairness is to apply existing fairness-enhancing algorithms for single-output models to each output individually. However, removing unfairness in each output may not help reduce the unfairness in the joint distribution of the outputs. Consider a model with two outputs, with the distributions for two groups shown in Figure 1 (left). The outputs have the same marginal distribution in both groups, so each dimension of the outputs is considered to be fair. However, the outputs have very different joint distributions for different groups, leading to potential unfairness. For example, suppose Output1 and Output2 denote the likelihood of accepting the offer and completing the degree in a college admission procedure, then when a college requires outputs are higher than 0.45 and 0.6 respectively (shown in dot lines in Figure 1), it leads to only students in group 1 being admitted. To address the challenge, in this paper, we propose a post-processing method to enhance fairness of multi-output models on multi-group data by producing a similar distribution of multi-dimensional outputs on each group with minimal manipulation(i.e., minimal mean squared error compared with original outputs). Before discussing details, we present a 2D example in Figure 1.

Measuring fairness based on the distribution of the multi-dimenstional outputs on each group leads to a strong notion called Statistical Parity or Strong Demographic Parity, which has been studied recently for single-output problems in literature (Agarwal et al., 2019; Jiang et al., 2020; Chzhen et al., 2020; Hu et al., 2023). We apply the same notion to multi-output problems and call it Distributional Parity (Definition 1). However, extending this kind of fairness from single-output problems to multi-output problems is non-trivial due to some technical and practical issues: (1) what is the optimal fair predictor for multi-output problems, which achieves Distributional Parity while preserving accuracy best; (2) empirically, how to compute the theoretical optimal fair predictor with only a set of observed data; (3) practically, how to generalize the post-processing method from training data to out-of-sample test data. Our approach addresses the mentioned challenges and our main contributions are summarized below:

Following Gouic et al. (2020); Chzhen et al. (2020); Chzhen & Schreuder (2022), we derive a post-processing method by projecting a multi-output model to a constraint set defined by a fairness inequality that promotes distributional parity among multiple groups up to a user-specified tolerance of unfairness. We generalize the closed-form solution of the projection by Chzhen et al. (2020); Chzhen & Schreuder (2022) from single-output case to multi-output case using the optimal transport mappings to the Wasserstein barycenter of the model s outputs on each group.

Published in Transactions on Machine Learning Research (03/2025)

The aforementioned closed form involves the density function of the model s outputs, which is unknown in practice. Although it can be approximated using training samples, the computation cost of the Wasserstein barycenter is still prohibitively high. To address this issue, we propose to replace the barycenter in the closed form by a low-cost approximate barycenter from Lindheim (2023) using training data. Additionally, we propose a kernel regression approach to extrapolate the closed-form solution, so the post-processing method can be applied to any out-of-sample data.

The proposed method is task-agnostic and model-agnostic. Numerical experiments are conducted to compare our method with various baselines on multi-label/multi-class classification and representation learning. The results demonstrate the effectiveness of our method.

2 Related work

The existing methods to achieve algorithmic fairness include three primary categories: 1) pre-processing methods that exclude sensitive features from data prior to training machine learning models (Dwork et al., 2012; du Pin Calmon et al., 2018; Kamiran & Calders, 2012; d Alessandro et al., 2017); 2) in-processing methods, which attain fairness during the training phase of the learning model (Agarwal et al., 2018; Goh et al., 2016; Zhang et al., 2018; Kim et al., 2019; Hong & Yang, 2021; Du et al., 2021; Mo et al., 2021; Yao et al., 2023); and 3) post-processing methods designed to alleviate unfairness in the model inferences after the learning process (Hardt et al., 2016; Lipton et al., 2018; Cui et al., 2023; Petersen et al., 2021; Lohia et al., 2019; Xian et al., 2023). The pre-processing methods are inadequate if the sensitive information can be inferred from the remaining variables. The in-processing methods have high computational cost since they typically require re-training a model when the tolerance of unfairness changes. Our method belongs to the last category, which can be directly applicable to any pre-trained model and thus avoids the high computational cost from re-training a model.

This work is motivated by Gouic et al. (2020); Chzhen et al. (2020); Chzhen & Schreuder (2022) where task-agnostic post-processing methods are developed on the empirical cumulative density function and the quantile function of the output, which is limited to single-output models. We extend their approaches for multi-output models using the optimal transport mappings to the barycenter of the model s outputs on different groups. Different from their methods, we approximate the barycenter to reduce the computational cost and apply kernel regression to extend our post-processing method to any new data.

There also exist post-processing methods for specific machine learning tasks. For example, assuming a Bayes optimal score function is available, the group thresholding methods by Gaucher et al. (2023); Chzhen et al. (2019); Schreuder & Chzhen (2021); Zeng et al. (2022) are developed for single-task binary classification. A similar thresholding method has been developed by Denis et al. (2021) for multi-class classification where the thresholds are computed using the Lagrangian multipliers of the demographic parity constraints. The method by Xian et al. (2023) is also for multi-class classification where the output of the score function is mapped a class label by the optimal transport mapping to a Wasserstein barycenter with restricted supports. Compared to these works, the method in this paper is task-agnostic and can be applied to multi-task learning and (self-supervised) representation learning. Hu et al. (2023) developed an approach to enforce strong demographic parity in multi-task learning including both regression and binary classification tasks. Their method essentially applies the single-output method by Gouic et al. (2020); Chzhen et al. (2020) to each task separately, which may not guarantee the distributional parity in this paper. See the example in Figure 1. Though there is a large number of literature working on learning representations independent of sensitive attributes, they achieve this by adversarial training (Zhang et al., 2018; Kim et al., 2019; Xie et al., 2017; Madras et al., 2018), variational auto-encoders (Louizos et al., 2015; Sarhan et al., 2020; Amini et al., 2019; Gong et al., 2024), or contrastive learning(Han et al., 2023; Qi et al., 2024; Zhang et al., 2022; Park et al., 2022), which are mostly in-processing methods and thus have higher computational cost than our method in general.

The multi-output fairness problem in this paper is closely related to compositional fairness introduced in Dwork & Ilvento (2018), which highlights that individually fair classifiers may not compose into fair systems. While Dwork & Ilvento (2018) focuses on categorical outputs, we examine continuous output distributions.

Published in Transactions on Machine Learning Research (03/2025)

Since categorical outputs often result from thresholding or softmax on continuous outputs, compositional fairness is a special case of multi-output fairness, and our method can directly address it.

3 Preliminaries

We consider a general multi-output machine learning problem, e.g., multi-output classification, multi-output regression and representation learning. Let (X, S) be a random vector, where X Rd is the feature vector and S S is a sensitive attribute. It is assumed that S = [m] := {1, ..., m}. A sample from the distribution of (X, S) is denoted by (x, s). We denote by f : Rd S Rk a machine learning model learned to generate a k-dimensional output. For example, f can be a multi-task regression model to predict a vector of continuous values, a multi-class/multi-task classification model that returns the probability of each class/task, or a feature extractor that outputs a high-dimensional vector of an input data such as an image.

The focus of this work is not to train the model f . Assuming f is already obtained, this paper studies how to measure and improve its fairness using a post-processing method to modify the output f (X, S). More specifically, given f , we consider a numerical approach for building a mapping f : Rd S Rk that approximates f but produces an output f(X, S) more fair than f (X, S). Here, the approximation error is measured by R(f) = E f (X, S) f(X, S) 2 2. (1)

If f performs well for the task it was built for and the performance metric is continuous, a small R(f) also ensures f has a reasonably good performance, too.

3.1 Distributional Parity

Many existing definitions and measures of fairness in literature are task-specific including, for example, demographic parity, equalized odds and equal opportunity (Dwork et al., 2012; Hardt et al., 2016), which are specific to classification problems. In order to be task-agnostic and model-agnostic, we are interested in the fairness measure that is applied to the distribution of f(X, S) and f (X, S). A measure of this kind is the strong demographic parity introduced by Agarwal et al. (2019); Jiang et al. (2020); Chzhen et al. (2020), which essentially indicates that a model f(x, s) is fair only when f(X, S) has the same distribution conditioning on S = s for any s S. Although they only consider a single-output model, their fairness measure also applies to the multi-output case. Next we present their fairness measure formally with the name multi-output distributional parity. This new name helps differentiate it from the traditional demographic parity for binary classification (Calders et al., 2009).

Definition 1 (Multi-output Distributional Parity) A measurable mapping f : Rd S Rk satisfies distributional parity if, for every s, s S, f(X, S) conditioning on S = s and f(X, S) conditioning on S = s have the same distribution.

Remark: Although we focus on Definition 1, an extension of demographic parity, it is natural to extend other notions of fairness, such as equal odds and equal opportunity, in a similar fashion. In particular, suppose f(X, S) is used to predict a target variable Y which takes values from a finite set Y (e.g., multi-class/mult-label classification). We say f satisfies distributionally equal opportunity with respect to a class y Y if, for every s, s S, f(X, S) conditioning on (S, Y ) = (s, y) and f(X, S) conditioning on (S, Y ) = (s , y) have the same distribution. Similarly, we say f satisfies distributionally equal odds if, for every s, s S and every y Y, f(X, S) conditioning on (S, Y ) = (s, y) and f(X, S) conditioning on (S, Y ) = (s , y) have the same distribution.

Consider a joint distribution of (X, S, Y ) where Y Rp is a target vector. In the setting of a single-task binary classification, we have k = 1, p = 1 and Y {1, 1} is the true class label. Then the model f(X, S) R typically represents a score/probability of positivity, which is used to predict Y by comparing f(X, S) with a threshold θ, namely, the predicted label b Y generated as b Y = 1 if f(X, S) θ and b Y = 1 if f(X, S) < θ. The traditional demographic parity (Calders et al., 2009; Dwork et al., 2012) holds if P(b Y = 1|S = s) = P(b Y = 1|S = s ) s, s S. Obviously, when f satisfies distributional parity, it satisfies demographic parity for any threshold θ. In fact, this is still true for multi-task binary classification problems, where p = k > 1 and Y {1, 1}k.

Published in Transactions on Machine Learning Research (03/2025)

The notion of demographic parity has been extended for the multi-class classification problem, where k > 1, p = 1 and Y [k] (Xian et al., 2023). In this case, each coordinate of f(X, S) represents the score of one class in [k] and the predicted label is typically generated as b Y = arg maxl [k] fl(X, S). According to Xian et al. (2023), a model satisfies demographic parity when

P(b Y = y|S = s) = P(b Y = y|S = s ) s, s S, y [k], (2)

which is clearly implied by distributional parity.

3.2 Wasserstein Distance and Wasserstein Barycenters of Discrete Distributions

It is challenging and, sometimes, unnecessary to obtain a model f that exactly satisfies the distributional parity in Definition 1. In practice, a model slightly violating Definition 1 can be acceptable for some applications. To quantify the extent to which a model f satisfies the distributional parity, a statistical distance is often introduced to measure the difference between the distributions of f(X, s) and f(X, s ) for any s and s in S. Following the literature (Chzhen et al., 2020; Gouic et al., 2020), we utilize the Wasserstein distance that conveys the distinction between probability measures by quantifying the expense associated with transforming one probability measure into another. Definition 2 (Wasserstein-2 distance). Let µ and ν be two probability measures on Rk with finite second moments. The squared Wasserstein-2 distance between µ and ν is defined as

W2 2(µ, ν) = inf γ Γµ,ν

Rk Rk x y 2 2dγ(x, y) (3)

where Γµ,ν is the set of probability measures on Rk Rk such that its marginal distributions are equal to µ and ν, i.e., for all γ Γµ,ν and all measurable sets A, B Rk it holds that γ(A Rk) = µ(A) and γ(Rk B) = ν(B).

Suppose probability measures µ and ν both have density functions and the infimum in (3) is obtained at γ . There exists a mapping Tµ,ν : Rk Rk such that γ = (Id, Tµ,ν)#µ, where # denotes the pushforward operator on measures and Id denotes the identity mapping (Santambrogio, 2015). Here, Tµ,ν is known as the optimal transport mapping from µ to ν. With the Wasserstein distance, we can characterize the geometric average of finitely many probability distributions by the Wasserstein-2 barycenter, which will be later used in the measure of unfairness considered in this paper. Definition 3 For probability measures ν1, . . . , νm and p1, . . . , pm such that pi > 0, Pm i=1 pi = 1, the weighted Wasserstein-2 barycenter is given by

ν = arg min ν

i=1 pi W2 2(νi, ν)

3.3 Post-processing with Distributional Parity Constraint

Let ps = P(S = s) and νf|s be the probability distribution of f(X, S) conditioning on S = s for s S. Following Chzhen & Schreuder (2022), we then measure the unfairness of model f by the sum of the weighted distances between νf|s to their weighted Wasserstein-2 barycenters, namely,

U(f) = min ν

s S ps W2 2(νf|s, ν). (5)

As Wasserstein-2 distance serves as a distance metric on the space of probability distributions, W2 2(νf|s, νf|s ) = 0 if and only if νf|s = νf|s (Kwegyir-Aggrey et al., 2023; Peyré et al., 2017). Hence, U(f) = 0 if and only if for any s, s S, νf|s = νf|s , satisfying the distributional parity. However, when some level of unfairness is allowed, we only need to ensure U(f) αU(f ), where α [0, 1] represents the tolerance of unfairness in terms of the fraction of the unfairness of f . Like Chzhen & Schreuder (2022); Kwegyir-Aggrey et al. (2023), we consider the post-processing problem with distributional parity constraint as follows: min f R(f) s.t. U(f) αU(f ), (6)

where R(f) is defined in (1).

Published in Transactions on Machine Learning Research (03/2025)

4 Structure of Optimal Solution

Problem (6) has been thoroughly studied by Chzhen et al. (2020); Chzhen & Schreuder (2022) under the setting that f is a single-output model. They provide an intuitive closed form of the optimal solution of (6) for any α [0, 1] using the cumulative density function (CDF) of νf |s for each s. As they only focus on a single-output model, we expand their results to the multi-output case with some modifications in their proofs. We present the extension of their results in this section and highlight the main difference.

When α = 0 in (6), the optimal solution of (6) can be characterized by the following theorem.

Theorem 1 Suppose νf |s has density and finite second moments for each s S. Then

min U(f)=0 R(f) = min ν

s S ps W2 2(νf |s, ν) = U(f ). (7)

Moreover, if f0 and ν0 solve the first and second minimization in (7), respectively, then ν0 is the distribution of f0 and

f0(x, s) = Tf |s,ν0(f (x, s)) (8)

where Tf |s,ν0 : Rk Rk is the optimal transport mapping from νf |s to ν0.

By Definition 3, ν0 solving the second minimization in (7) is the barycenter of {νf |s}s S. This suggests a post-processing method by transporting the output f (X, S) when S = s to that barycenter.

When k = 1, Theorem 1 is reduced to the structural results obtained by Chzhen et al. (2020); Gouic et al. (2020) (see, e.g., Theorem 2.3 in Chzhen et al. (2020)). Indeed, when k = 1,

Tf |s,ν0(f (x, s)) = X

s S ps Qf |s Ff |s(f (x, s)), (9)

where Ff |s is the CDF of νf |s and Qf |s is the quantile function of νf |s, i.e., Qf |s(t) = inf{y R : Ff |s(y) t}.

In practice, when some level of unfairness is acceptable, one can select α > 0 in (6). When α = 1, f = f is the optimal solution. When α (0, 1), a closed form of the optimal solution for (6) is derived by Chzhen & Schreuder (2022); Kwegyir-Aggrey et al. (2023). Although they are originally studied only under the single-output case, the same result holds for a multi-output model by almost the same proof. This closed form is presented in the following proposition, which motivates the way of trading off the error and fairness in our post-processing method in the next section. Proposition 1 (Proposition 4.1 in Chzhen & Schreuder (2022)) Suppose νf |s has density and finite second moments for each s S. For any α [0, 1], the optimal solution to (6), denoted by fα, satisfies (up to a zero-measure set) fα(x, s) = αf (x, s) + (1 α)f0(x, s), (10) where f0 is defined in Eq. (8).

Remark: It is not always possible to minimize R and U simultaneously by the same f. Therefore, a model f is valuable as long as it achieves a value of R (or U) that cannot be further reduced without increasing U (or R). In fact, a function f(x, s) measurable in x is called Pareto efficient if there is no function f (x, s) measurable in x such that one of the following two cases happens: (1) R(f) R(f ) and U(f) < U(f ) or (2) R(f) < R(f ) and U(f) U(f ). The Pareto frontier is 2-dimensional curve consisting of (U(f), R(f)) for all Pareto efficient f s. Proposition 4.6 in Chzhen & Schreuder (2022) shows, the Pareto frontier for U and R is actually the curve {(R(fα), U(fα))} with α varying from 0 to 1, where fα is from (10). Although d = 1 in Chzhen & Schreuder (2022), their Proposition 4.6 holds for our setting also for any dimension d 1 by the same proof.

5 Fair Post-Processing with Finite Samples

Theorem 1 and Proposition 1 only characterize the optimal solution to (6) when νf |s has density. However, in practice, we only have access to a finite set of outputs of f , denoted by {f (xi, si)}n i=1, where {(xi, si)}n i=1

Published in Transactions on Machine Learning Research (03/2025)

is a dataset sampled from the distribution of (X, S). 2 As a result, we are not able to compute ν0 and Tf |s,ν0 exactly and apply (10). To address this issue when k = 1, a plug-in principle is applied by replacing ps, Ff |s(t), and Qf |s(t) in (9) using their empirical approximation based on finite data sample (Chzhen et al., 2020; Gouic et al., 2020). Because f is a multi-output mapping in our case, (9) is not applicable. In the following, we present how to employ the plug-in principle by approximating ν0 and Tf |s,ν0 in (8) by finite data sample. It consists of three main steps.

5.1 Optimal Transportation with Finite Samples

Consider approximating ν0 and Tf |s,ν0 by a collection of datasets Ds = {(xs i, s)}ns i=1 for s S. We denote the empirical distribution of f on Ds by νf (Ds), i.e., νf (Ds) = 1 ns Pns i=1 δf (xs i ,s) for s S, where δ is the Dirac measure. Consider two discrete distributions in Rk: µ = Pnµ i=1 pµ i δξµ i and ν = Pnν i=1 pν i δξν i , where ξµ i Rk, ξν i Rk, pµ i > 0, pν i > 0, Pnµ i=1 pµ i = 1 and Pnν i=1 pν i = 1. As a special case of (3), the squared Wasserstein-2 distance between µ and ν is

W2 2(µ, ν) = min

γ R nµ nν +

j=1 cijγij s.t.

i=1 γij = pµ i , j,

j=1 γij = pν i , i. (11)

where cij = ξµ i ξν j 2 2 and γij represents the mass located ξµ i transported to ξν j in order to move distribution µ to ν. Also, γ can be viewed as a discrete distribution in Rk Rk supported on (ξµ i , ξν j ) for j = 1, ..., nµ and i = 1, ..., nν. Suppose γ Rnµ nν + is the optimal solution of (11). The optimal transportation from µ to ν and from ν to µ, denoted by Tµ,ν and Tν,µ respectively, are random mappings such that

P(Tµ,ν(ξµ i ) = ξν j ) = γ ij pµ i , P(Tν,µ(ξν j ) = ξµ i ) = γ ij pν j , (12)

for j = 1, ..., nµ and i = 1, ..., nν.

With W2 2(µ, ν) given in (11), the barycenter of discrete distributions νi for i = 1, . . . , m is also defined as the solution of (4). Since νf (Ds) is the discrete approximation of νf |s, we propose to approximate ν0 in Theorem 1 by the barycenter of {νf |s}s S with the weights ps = ns P

s S ns . Unfortunately, Wasserstein

barycenters are NP-hard to compute (Altschuler & Boix-Adsera, 2022). Although (4) can be formulated as a multi-marginal optimal transport problem (Agueh & Carlier, 2011) and solved as a linear program (Anderes et al., 2016), its computational complexity scales exponentially in terms of m. To reduce the exponential computing complexity, instead of computing the barycenter exactly, we adopt the approach by Lindheim (2023) to construct its approximation. This approach achieves a good balance between runtime and approximation error in practice. We present this approach in the next section.

5.2 Approximate Barycenter with Finite Samples

The approach by Lindheim (2023) first computes the optimal transport mapping between νf (Ds) to νf (D s) for each pair of s and s in S, i.e., Tνf (Ds),νf (Ds ) satisfying (12). For simplicity of notation, we denote Tνf (Ds),νf (Ds ) by Ts,s . Then, they define a mapping M(f (xs i, s)) = P

s S ps ETs,s (f (xs i, s)), where ps = ns/(P

s S ns ) and the expectation is taken over the random output of Ts,s following distribution in (12). Here, M(f (xs i, s)) is the weight average of the expected outcomes after transporting f (xs i, s) to each of the |S| distributions. Finally, the approximate barycenter is constructed as a discrete distribution with P

s S ns supports defined as follows ν0 = P

s S Pns i=1 ps ns δM(f (xs i ,s)). Compared to the exact barycenter, this approximation requires solving |S|(|S| 1)/2 optimal transport mapping between two discrete distributions and thus has a polynomial time complexity. This procedure is formally stated in Algorithm 1.

Let m = |S| and νs = νf (Ds) in (4). Let ˆν0 be optimal solution of (4), i.e, the barycenter of {νf (Ds)}s S. It is shown by Lindheim (2023) that Ψ( ν0)/Ψ(ˆν0) 2. This bound is significant because there is no

2The technique we propose is model-agnostic and can be applied directly to the outputs {f (xi, si)}n i=1, so it does not require knowing {(xi, si)}n i=1.

Published in Transactions on Machine Learning Research (03/2025)

Algorithm 1 Approximate Barycenter

1: Input: A mapping f : Rd S Rk and a dataset D = {(xi, si)}n i=1 sampled from the distribution of (X, S).

2: Partition D into subsets based on s and obtain Ds = {(xs i, s)}ns i=1 for s S. 3: Let ps = ns P

s S ns for s S.

4: for 1 s < s |S| do

5: Solve (11) with µ = νf (Ds) and ν = νf (Ds ) to obtain Ts,s = Tνf (Ds),νf (Ds ).

6: end for 7: Construct mapping M(f (xs i, s)) = P

s S ps ETs,s (f (xs i, s))

8: Output: Discrete distribution ν0 = P s S Pns i=1 ps ns δM(f (xs i ,s)).

polynomial-time algorithm can achieve a ratio arbitrarily close to one with high probability (Altschuler & Boix-Adsera, 2022).

Although Proposition 1 is derived for continuous distribution, it motivates a heuristic post-processing method to update the output f (xs i, s) to

fα(xs i, s) := αf (xs i, s) + (1 α)Tνf (Ds), ν0(f (xs i, s)) (13)

for i = 1, . . . , ns and s S, where α [0, 1]. Note that Tνf (Ds), ν0 is not obtained during Algorithm 1 and needs to be solved separately. When α = 0 and (xs i, s) is uniformly randomly sampled from Ds, fα(xs i, s) has the same distribution for any s, indicating the post-processed outputs satisfy distributional parity. However, (13) is only defined for existing data in Ds for s S. In the next section, we propose a method to extend this processing scheme to new samples from (X, S).

5.3 Post-Process Out-of-Sample Data

Note that fα(xs i, s) is defined for (xs i, s) Ds only because the optimal transport mapping Tνf (Ds), ν0 in (13) is only defined on f (xs i, s) with (xs i, s) Ds. To extend the definition of fα(xs i, s) from Ds to any (x, s) Rd S, we extrapolate Tνf (Ds), ν0 over Rk using the Nadaraya-Watson kernel regression method (Nadaraya, 1964).

Let K : Rk R+ be a kernel function satisfying K(z) = K( z)and R

Rk K(z)dz = 1. Let h > 0 be a bandwidth. For any (x, s) Rd S, the kernel regression estimator of Tνf (Ds), ν0(f (x, s)) is

f(x, s) := Tνf (Ds), ν0(f (x, s)) :=

K ((f (x, s) f (xs i, s))/h) Tνf (Ds), ν0(f (xs i, s)) Pns j=1 K (f (x, s) f (xs j, s))/h . (14)

Here, f(x, s) is the post-processed prediction for data (x, s). Recall that Tνf (Ds), ν0 is a randomly mapping (see (12)), so is Tνf (Ds), ν0. To analyze the fairness of f(x, s) on the out-of-sample data distribution, we present the following proposition to characterize how much f(X, S) violates the distributional parity after post-processing by bounding U( f(X, S)) and the Wasserstein distance between the distributions of f(X, S) on different groups. The proof is given in Section A.2.

Proposition 2 Suppose K(z) = 1

2π exp( z2). Conditioning on data D = {(xi, si)}n i=1, it holds that

U( f(X, S)) X

s S O psn2 s exp 1

h4 W2 2(νf |s, νf (Ds)) . (15)

Moreover, let ν f|s be the distribution of f(X, S) conditioning on S = s for s S. It holds that

W2 2(ν f|s, ν f|s ) O (n2 s + n2 s ) exp 1

h4 W2 2(νf |s, νf (Ds)) . (16)

In both upper bounds above, the first term is the error due to extrapolation which will converge to zero as the bandwidth h approaches zero. The second term is the error due to the sampling error between training data D and the ground-truth data distribution. The second error is represented by the Wasserstein distribution

Published in Transactions on Machine Learning Research (03/2025)

Algorithm 2 Post-Processing Method by Transporting to Approximate Barycenter (TAB)

1: Input: A data point (x, s) Rd S, a mapping f : Rd S Rk and a dataset D = {(xi, si)}n i=1 sampled from the distribution of (X, S).

2: Partition D into subsets based on s and obtain Ds = {(xs i, s)}ns i=1 for s S. 3: Compute ν0 by Algorithm 1. 4: for s = 1, . . . , |S| do 5: Solve (11) with µ = νf (Ds) and ν = ν0 to obtain Tνf (Ds), ν0. 6: end for

7: if (x, s) Ds then compute fα(x, s) as in (13) else compute fα(x, s) as in (18). 8: Output: fα(x, s)

W 2 2 (νf |s, νf (Ds)). As the data sizes ns for s S increase, the rate at which W 2 2 (νf |s, νf (Ds)) converges to zero has been well studied in literature. For example, according to case (3) of Theorem 2 in Fournier & Guillin (2015), if there exists q > 4 such that E(X,S)|S=s f (X, S) q 2 < + , where E(X,S)|S=s is the expectation over (X, S) conditioning on S = s, then the Wasserstein distance in the second term in (15) and (16) satisfies the following concentration inequality P(W 2 2 (νf |s, νf (Ds)) η) a(ns, η) + b(ns, η) for any η (0, 1) (17) where

a(ns, η) = C

exp( cnsη2), if k < 4, exp cns η/ log(2 + 1/η) 2 , if k = 4, exp( cnsηk/2), if k > 4.

and b(ns, η) = n q 4

After extrapolating Tνf (Ds), ν0 to Tνf (Ds), ν0 defined above, we can extend (13) for any out-of-sample data by updating f (x, s) to fα(x, s) := αf (x, s) + (1 α) f(x, s). (18)

We then formally present this post-processing method in Algorithm 2. Note that when (x, s) Ds, we still apply the original mapping in (13) instead of its extrapolation (18).

Our post-processing method can also be extended to improve the fairness of a model under the notion of distributionally equal opportunity or distributionally equal odds introduced in the remark after Definition 1. In particular, to improve the fairness in terms of distributionally equal odds, we can partition the training set D into |Y| subsets {Dy}y Y depending on the class label Y . Then we apply Algorithm 2 to each subset Dy to construct a separate post-processing mapping fy(x, s) like (14) for each combination of group s and class y. To apply it on an out-of-sample data point, we first generate a predicted label by using f (x, s) and then convert f (x, s) to fby(x, s) using the mapping constructed on Dy. The same procedure works for distributionally equal opportunity with respect to one class y Y, except that we only need to construct the mapping on Dy and post-process data for one class y.

6 Experiments

In this section, we apply the proposed post-processing method to machine learning models with multiple outputs to evaluate its effectiveness, including multi-label/multi-class classification and representation learning.

Datasets. In our experiments, we include four datasets from various domains, including marketing domain(Customer dataset 3), medical diagnosis( Chexpert Dataset (Irvin et al., 2019)), face recognition (Celeb A dataset (Liu et al., 2015) and UTKFace dataset (Zhang & Qi, 2017)). The details of these datasets are provided in Appendix B.

Baselines and Settings. To verify the effectiveness, we compare our method against six baseline approaches, including four post-processing methods: (1)Hu et al. (2023) (Hu et al., 2023), which introduces a post-

3https://www.kaggle.com/datasets/kaushiksuresh147/customer-segmentation

Published in Transactions on Machine Learning Research (03/2025)

processing approach for incorporating fairness into multi-task learning using multi-marginal Wasserstein barycenters; (2)Xian et al (2023) (Xian et al., 2023), which characterizes the inherent tradeoff of demographic parity in multi-class classification and proposes an effective post-processing method; (3)FRAPPE (Tifrea et al., 2024), which proposes a generic framework that converts any regularized in-processing method into a post-processing approach; (4)Adv Debiasing (Zhang et al., 2018), which presents a framework for mitigating demographic group biases by introducing a group-related variable and jointly learning a predictor and adversary, and two in-processing methods: (5)Sim Fair (Liu et al., 2023), which focuses on training models for fairness-aware multi-label classification; (6)f-FERM (Baharlouei et al., 2024), which presents a unified stochastic optimization framework for fair empirical risk minimization regularized by f-divergence measures.

Specifically in our experiments, for baseline Adv Debiasing, both Predictor Block and Adversary Block are implemented by a two-layer neural network with 128 hidden units and tanh activating function, respectively. For baselines FRAPPE, Sim Fair, and Adv Debiasing, we train the models for 60 epochs with Adam optimizer, batch size as 64, and tune the learning rate in {1e-3, 1e-4}. We vary their weight parameter λ(or α) in { 0.1, 1, 2, 4, 8, 10} to show the trade-off between classification performance and fairness. For f-FERM, we follow their paper to vary the weight parameter λ in {0.1, 1, 10, 50, 100, 150} and tune the learning rate in {0.1, 0.01, 0.001} with their proposed optimization algorithm. For Hu et al. (2023) (Hu et al., 2023), we vary parameter α for their method in {0, 0.2, 0.4, 0.6, 0.8, 1.0}. Following their paper, the parameter α for Xian et al. (2023) are in {1, 0.16, 0.14, 0.12, 0.1, 0.08, 0.06, 0.04, 0.02, 0.01, 0.008, 0.006, 0.004, 0.002, 0.001, 0.0.}. For our method, we experiment with a Gaussian kernel and h is chosen from {0.02, 0.04, 0.5, 1} based on input dimension, as smaller h theoretically and empirically leads to better performance but too small h may lead to numerical issues. We vary α for our method in {0, 0.2, 0.4, 0.6, 0.8, 1.0}. All the experiments are run five times with different seeds.

6.1 Multi-label classification

For multi-label classification tasks, we experiment on Celeb A dataset and Chexpert dataset. We compare our method with Sim Fair (Liu et al., 2023), FRAPPE (Tifrea et al., 2024), Adv Debiasing and the method Hu et al. (2023) (Hu et al., 2023), which essentially independently applies the post-processing method for a single-output model in Chzhen et al. (2020); Chzhen & Schreuder (2022) to each coordinate of the output of a multi-task classification model. Firstly, one Res Net50 (He et al., 2016) and one Dense Net121 (Huang et al., 2017) and are trained on the Celeb A and Chexpert data respectively, then post-processing methods Hu et al. (2023) (Hu et al., 2023), Adv Debiasing and our proposed method are applied to the predicted probabilities of each task and methods Sim Fair (Liu et al., 2023) and FRAPPE (Tifrea et al., 2024) are applied to the extracted image features to train a new linear classification head. For both FRAPPE and Sim Fair, the fairness regularizer is defined as described in Equation 1 of the (Liu et al., 2023) paper, as it has been demonstrated to be effective for promoting demographic parity. The results are summarized in Fig. 2(a) and 2(b). We can observe that, at the same level of accuracy, our method achieves much lower unfairness than baselines, which can be seen clearly when drawing horizontal lines across the figures. Particularly, on the Chexpert dataset, our method delivers fairer predictions with only a negligible decrease in predictive performance. From Fig. 2(a) and 2(b), we also note that learning-based baselines such as FRAPPE, Sim Fair, and Adv Debiasing tend to achieve lower unfairness but only at the cost of a substantial drop in accuracy, whereas processing-based methods like Hu et al. (2023) and our method maintain a better balance. Furthermore, we observe that with the same fairness regularizer, the post-processing method FRAPPE achieves comparable fairness-error trade-offs to the in-processing technique Sim Fair, aligning with the findings reported in Tifrea et al. (2024).

6.2 Multi-class classification

To verify the effectiveness of our proposed method on multi-class classification tasks, we apply our method to the customer dataset. A classic multi-class logistic regression model is built first on the training data and then post-processing methods are applied to the predicted probabilities of each class. For in-processing methods, the multi-class logistic regression models are trained with fairness regularizer from scratch. We compare our method with FRAPPE (Tifrea et al., 2024), f-FERM (Baharlouei et al., 2024), Xian et al (2023) (Xian et al., 2023) and Adv Debiasing. Note that method Xian et al (2023) directly produces the class labels instead of score function, so we can only perform the comparison on the classification accuracy and the unfairness based

Published in Transactions on Machine Learning Research (03/2025)

0.0 0.1 0.2 0.3 0.4 Unfairness on Wasserstein Distance to Barycenter

TAB Hu et al. (2023) Adv Debiasing FRAPPE Sim Fair

0.002 0.004 0.006 0.008 0.010 0.012 0.014 Unfairness on Wasserstein Distance to Barycenter

TAB Hu et al. (2023) Adv Debaising FRAPPE Sim Fair

0.0 0.2 0.4 0.6 0.8 Unfairness on Predicted Classes

TAB Xian et al (2023) Adv Debiasing FRAPPE f-FERM

Figure 2: Multi-label classification on Celeb A dataset (a) and Chexpert dataset (b); (c) Multi-class classification on Customer dataset.

10 5 0 5 10 10

Female Old Female Young Male Old Male Young

First t-SNE

Second t-SNE

10 5 0 5 10

Female Old Female Young Male Old Male Young

First t-SNE

Second t-SNE

Female Old Female Young Male Old Male Young

First t-SNE

Second t-SNE

0 2000 4000 6000 8000 10000 Unfairness on Wasserstein Distance to Barycenter

TAB Hu et al. (2023) Adv Debaising

0 10 20 0.7

Figure 3: t-SNE visualization of representations on Celeb A dataset. (a) Raw representations from an SSL model; (b) Representations after post-processing(α = 0) by Hu et al. (2023); (c) Representations after post-processing(α = 0) with TAB(Ours); (d) Performance on downstream tasks.

10 5 0 5 10

White Black Asian Indian Others

First t-SNE

Second t-SNE

White Black Asian Indian Others

First t-SNE

Second t-SNE

10 5 0 5 10

White Black Asian Indian Others

First t-SNE

Second t-SNE

5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 Unfairness on Wasserstein Distance to Barycenter

TAB Hu et al. (2023) Adv Debaising

Figure 4: t-SNE visualization of representations on UTKface Dataset. (a) Raw representations from a CLIP model; (b) Representations after post-processing(α = 0) with Hu et al. (2023); (c) Representations after post-processing(α = 0) with TAB(Ours); (d) Performance on downstream tasks.

on Equation 2, which is in favor of their method. For FRAPPE, the fairness regularizer is defined as the Equation 2 proposed by Xian et al. (2023). We present the results in Fig. 2(c). The figure demonstrates that our method effectively reduces unfairness measured by the metric proposed by Xian et al. (2023) and achieves a superior balance between accuracy and unfairness compared to the baselines, particularly when a higher level of fairness is required. Besides, we notice that although it is implied that methods f-FERM (Baharlouei et al., 2024) and FRAPPE (Tifrea et al., 2024) are applicable for multi-class classification, their respective papers do not include experiments for this scenario, and their methods, in our experiments, fail to mitigate unfairness defined by Equation 2 even when it brings a significant undesirable performance decrease.

Published in Transactions on Machine Learning Research (03/2025)

6.3 Representation learning

In this part, we explore the fair representation learning for self-supervised learning(SSL) models and large pre-trained foundation models. We experiment on Celeb A dataset and UTKFace dataset. Due to the scarcity of the postprocessing methods for representation learning in existing literature, we still compare our method with the post-processing method Hu et al. (2023) (Hu et al., 2023) and Adv Debiasing. For Celeb A dataset, we first train an SSL model to learn representations with a dimension of 128, by employing the algorithm proposed in Yuan et al. (2022) on the whole training dataset. Then we apply post-processing methods to eliminate the sensitive information in the raw representations. Specifically, for Adv Debiasing, we utilize the raw representations as the labels for the Predictor as there is no task-specific labels in self-supervised learning. In other words, we aim for the model to remove sensitive information while altering the raw representation as minimally as possible. With processed representations, we also build a multi-label logistic regression model to evaluate the performance on downstream tasks. For UTKFace dataset, raw representations are generated by a released CLIP(Vi T-B/16) model (Radford et al., 2021) which is pre-trained on 400M text-image pairs, with a dimension of 512.

We evaluate the unfairness of the representations as well as the accuracy performance of the downstream tasks. The results are summarized in Fig. 3 and 4. By examining the representations of Fig. 3(a) and 4(a), it s evident that both the SSL model and CLIP model disclose sensitive attributes prominently. Even after applying post-processing techniques proposed in Hu et al. (2023) (Hu et al., 2023), this exposure persists, as depicted in Fig. 3(b) and Fig. 4(b). However, through our proposed methods illustrated in Fig. 3(c) and Fig. 4(c), we are able to achieve fairer representations with respect to sensitive groups. Notably, from Fig. 3(d) and 4(d), we can observe that our method achieves a better tradeoff between downstream performance and distributional parity, whereas Hu et al. (2023) (Hu et al., 2023) fails to improve the distributional parity because, as discussed in Fig. 1, the elimination of unfairness in individual outputs may not necessarily mitigate unfairness in the joint distribution of outputs.

7 Conclusion and Discussion

In this paper, we have proposed a post-processing method to enhance fairness for multi-output machine learning models, which is underexplored in the literature. Our approach employs optimal transport to move a model s outputs across different groups towards their empirical Wasserstein barycenter to achieve the model s distributional parity. We have developed an approximation technique to reduce the complexity of computing the exact barycenter and a kernel regression method for extending this process to out-of-sample data. Extensive experimental results on multi-label/multi-class classification and representation learning demonstrate the effectiveness of our method.

One limitation of this work is that the notion of fairness in Definition 1 we pursue is strong while a weak notion such as (2) might be sufficient for some specific applications. To achieve a stronger sense of fairness may lead to more decreases in the predictive performance than targeting a weak sense. Therefore, applicability of the proposed method may vary depending on the specific use case. Additionally, the lack of theoretical convergence analysis of the proposed method as the training sample size increases is another limitation, which is an important future work.

Broader Impact Statement

This paper is to attain distributional fairness in machine learning-driven decision-making, particularly concerning various demographic groups such as gender, age, and race. When enforcing group fairness, it may potentially disadvantage certain individuals. Other than this, none we feel must be specifically highlighted here.

Acknowledgments

We thank anonymous reviewers for constructive comments. G. Li, Q. Lin, and T. Yang were partially supported by NSF Award 2147253.

Published in Transactions on Machine Learning Research (03/2025)

Alekh Agarwal, Alina Beygelzimer, Miroslav Dudík, John Langford, and Hanna Wallach. A reductions approach to fair classification. In International conference on machine learning, pp. 60 69. PMLR, 2018.

Alekh Agarwal, Miroslav Dudík, and Zhiwei Steven Wu. Fair regression: Quantitative definitions and reduction-based algorithms. In International Conference on Machine Learning, pp. 120 129. PMLR, 2019.

Martial Agueh and Guillaume Carlier. Barycenters in the wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904 924, 2011.

Jason M Altschuler and Enric Boix-Adsera. Wasserstein barycenters are np-hard to compute. SIAM Journal on Mathematics of Data Science, 4(1):179 203, 2022.

Alexander Amini, Ava P Soleimany, Wilko Schwarting, Sangeeta N Bhatia, and Daniela Rus. Uncovering and mitigating algorithmic bias through learned latent structure. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 289 295, 2019.

Ethan Anderes, Steffen Borgwardt, and Jacob Miller. Discrete wasserstein barycenters: Optimal transport for discrete data. Mathematical Methods of Operations Research, 84:389 409, 2016.

Pranjal Awasthi, Matthäus Kleindessner, and Jamie Morgenstern. Equalized odds postprocessing under imperfect group information. In International conference on artificial intelligence and statistics, pp. 1770 1780. PMLR, 2020.

Sina Baharlouei, Shivam Patel, and Meisam Razaviyayn. f-ferm: A scalable framework for robust fair empirical risk minimization. In The Twelfth International Conference on Learning Representations, 2024.

Solon Barocas and Andrew D Selbst. Big data s disparate impact. California law review, pp. 671 732, 2016.

Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and machine learning: Limitations and opportunities. MIT Press, 2023.

Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29, 2016.

Toon Calders, Faisal Kamiran, and Mykola Pechenizkiy. Building classifiers with independency constraints. In 2009 IEEE international conference on data mining workshops, pp. 13 18. IEEE, 2009.

Ching-Yao Chuang and Youssef Mroueh. Fair mixup: Fairness via interpolation. ar Xiv preprint ar Xiv:2103.06503, 2021.

Evgenii Chzhen and Nicolas Schreuder. A minimax framework for quantifying risk-fairness trade-off in regression. The Annals of Statistics, 50(4):2416 2442, 2022.

Evgenii Chzhen, Christophe Denis, Mohamed Hebiri, Luca Oneto, and Massimiliano Pontil. Leveraging labeled and unlabeled data for consistent fair binary classification. Advances in Neural Information Processing Systems, 32, 2019.

Evgenii Chzhen, Christophe Denis, Mohamed Hebiri, Luca Oneto, and Massimiliano Pontil. Fair regression with wasserstein barycenters. Advances in Neural Information Processing Systems, 33:7321 7331, 2020.

Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining, pp. 797 806, 2017.

Sen Cui, Weishen Pan, Changshui Zhang, and Fei Wang. Bipartite ranking fairness through a model agnostic ordering adjustment. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.

Published in Transactions on Machine Learning Research (03/2025)

Brian d Alessandro, Cathy O Neil, and Tom La Gatta. Conscientious classification: A data scientist s guide to discrimination-aware classification. Big data, 5(2):120 134, 2017.

Amit Datta, Michael Carl Tschantz, and Anupam Datta. Automated experiments on ad privacy settings: A tale of opacity, choice, and discrimination. ar Xiv preprint ar Xiv:1408.6491, 2014.

Christophe Denis, Romuald Elie, Mohamed Hebiri, and François Hu. Fairness guarantee in multi-class classification. ar Xiv preprint ar Xiv:2109.13642, 2021.

Mengnan Du, Subhabrata Mukherjee, Guanchu Wang, Ruixiang Tang, Ahmed Awadallah, and Xia Hu. Fairness via representation neutralization. Advances in Neural Information Processing Systems, 34:12091 12103, 2021.

Flavio du Pin Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and Kush R Varshney. Data pre-processing for discrimination prevention: Information-theoretic optimization and analysis. IEEE Journal of Selected Topics in Signal Processing, 12(5):1106 1119, 2018.

Cynthia Dwork and Christina Ilvento. Fairness under composition. ar Xiv preprint ar Xiv:1806.06122, 2018.

Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pp. 214 226, 2012.

David A Edwards. On the kantorovich rubinstein theorem. Expositiones Mathematicae, 29(4):387 398, 2011.

Anthony W Flores, Kristin Bechtel, and Christopher T Lowenkamp. False positives, false negatives, and false analyses: A rejoinder to machine bias: There s software used across the country to predict future criminals. and it s biased against blacks. Fed. Probation, 80:38, 2016.

Nicolas Fournier and Arnaud Guillin. On the rate of convergence in wasserstein distance of the empirical measure. Probability theory and related fields, 162(3):707 738, 2015.

Solenne Gaucher, Nicolas Schreuder, and Evgenii Chzhen. Fair learning with wasserstein barycenters for nondecomposable performance measures. In International Conference on Artificial Intelligence and Statistics, pp. 2436 2459. PMLR, 2023.

Gabriel Goh, Andrew Cotter, Maya Gupta, and Michael P Friedlander. Satisfying real-world goals with dataset constraints. Advances in neural information processing systems, 29, 2016.

Ziyu Gong, Ben Usman, Han Zhao, and David I Inouye. Towards practical non-adversarial distribution matching. In International Conference on Artificial Intelligence and Statistics, pp. 4276 4284. PMLR, 2024.

Thibaut Le Gouic, Jean-Michel Loubes, and Philippe Rigollet. Projection to fairness in statistical learning. ar Xiv preprint ar Xiv:2005.11720, 2020.

Sungwon Han, Seungeon Lee, Fangzhao Wu, Sundong Kim, Chuhan Wu, Xiting Wang, Xing Xie, and Meeyoung Cha. Dualfair: fair representation learning at both group and individual levels via contrastive self-supervision. In Proceedings of the ACM web conference 2023, pp. 3766 3774, 2023.

Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. Advances in neural information processing systems, 29, 2016.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Youngkyu Hong and Eunho Yang. Unbiased classification through bias-contrastive and bias-balanced learning. Advances in Neural Information Processing Systems, 34:26449 26461, 2021.

François Hu, Philipp Ratz, and Arthur Charpentier. Fairness in multi-task learning via wasserstein barycenters. ar Xiv preprint ar Xiv:2306.10155, 2023.

Published in Transactions on Machine Learning Research (03/2025)

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700 4708, 2017.

Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pp. 590 597, 2019.

Scott Jaschik. Admissions Offices, Cautiously, Start Using AI. https://www.insidehighered.com/news/ admissions/2023/05/15/admissions-offices-cautiously-start-using-ai, 2023. Accessed: 2023-0515.

Ray Jiang, Aldo Pacchiano, Tom Stepleton, Heinrich Jiang, and Silvia Chiappa. Wasserstein fair classification. In Uncertainty in artificial intelligence, pp. 862 872. PMLR, 2020.

Faisal Kamiran and Toon Calders. Data preprocessing techniques for classification without discrimination. Knowledge and information systems, 33(1):1 33, 2012.

Byungju Kim, Hyunwoo Kim, Kyungsu Kim, Sungjin Kim, and Junmo Kim. Learning not to learn: Training deep neural networks with biased data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9012 9020, 2019.

Kweku Kwegyir-Aggrey, Jessica Dai, A Feder Cooper, John Dickerson, Keegan Hines, and Suresh Venkatasubramanian. Repairing regressors for fair binary classification at any decision threshold. In Neur IPS 2023 Workshop Optimal Transport and Machine Learning, 2023.

Johannes von Lindheim. Simple approximative algorithms for free-support wasserstein barycenters. Computational Optimization and Applications, 85(1):213 246, 2023.

Zachary Lipton, Julian Mc Auley, and Alexandra Chouldechova. Does mitigating ml s impact disparity require treatment disparity? Advances in neural information processing systems, 31, 2018.

Tianci Liu, Haoyu Wang, Yaqing Wang, Xiaoqian Wang, Lu Su, and Jing Gao. Simfair: a unified framework for fairness-aware multi-label classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 14338 14346, 2023.

Weiwei Liu, Donna Xu, Ivor W Tsang, and Wenjie Zhang. Metric learning for multi-output tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2):408 422, 2018.

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.

Pranay K Lohia, Karthikeyan Natesan Ramamurthy, Manish Bhide, Diptikalyan Saha, Kush R Varshney, and Ruchir Puri. Bias mitigation post-processing for individual and group fairness. In Icassp 2019-2019 ieee international conference on acoustics, speech and signal processing (icassp), pp. 2847 2851. IEEE, 2019.

Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zemel. The variational fair autoencoder. ar Xiv preprint ar Xiv:1511.00830, 2015.

David Madras, Elliot Creager, Toniann Pitassi, and Richard Zemel. Learning adversarially fair and transferable representations. In International Conference on Machine Learning, pp. 3384 3393. PMLR, 2018.

Sangwoo Mo, Hyunwoo Kang, Kihyuk Sohn, Chun-Liang Li, and Jinwoo Shin. Object-aware contrastive learning for debiased scene representation. Advances in Neural Information Processing Systems, 34: 12251 12264, 2021.

Elizbar A Nadaraya. On estimating regression. Theory of Probability & Its Applications, 9(1):141 142, 1964.

Published in Transactions on Machine Learning Research (03/2025)

Cathy O neil. Weapons of math destruction: How big data increases inequality and threatens democracy. Crown, 2017.

Sungho Park, Jewook Lee, Pilhyeon Lee, Sunhee Hwang, Dohyung Kim, and Hyeran Byun. Fair contrastive learning for facial attribute classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10389 10398, 2022.

Felix Petersen, Debarghya Mukherjee, Yuekai Sun, and Mikhail Yurochkin. Post-processing for individual fairness. Advances in Neural Information Processing Systems, 34:25944 25955, 2021.

Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport. Center for Research in Economics and Statistics Working Papers, (2017-86), 2017.

Qi Qi, Quanqi Hu, Qihang Lin, and Tianbao Yang. Provable optimization for adversarial fair self-supervised contrastive learning. ar Xiv preprint ar Xiv:2406.05686, 2024.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748 8763. PMLR, 2021.

Inioluwa Deborah Raji and Joy Buolamwini. Actionable auditing: Investigating the impact of publicly naming biased performance results of commercial ai products. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 429 435, 2019.

Vikram V Ramaswamy, Sunnie SY Kim, and Olga Russakovsky. Fair attribute classification through latent space de-biasing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9301 9310, 2021.

Filippo Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-63):94, 2015.

Mhd Hasan Sarhan, Nassir Navab, Abouzar Eslami, and Shadi Albarqouni. Fairness by learning orthogonal disentangled representations. In Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23 28, 2020, Proceedings, Part XXIX 16, pp. 746 761. Springer, 2020.

Nicolas Schreuder and Evgenii Chzhen. Classification with abstention but without disparities. In Uncertainty in Artificial Intelligence, pp. 1227 1236. PMLR, 2021.

Tom Simonite. Probing the dark side of google s ad-targeting system. MIT Technology Review, 2015.

Alexandru Tifrea, Preethi Lahoti, Ben Packer, Yoni Halpern, Ahmad Beirami, and Flavien Prost. Frappé: A group fairness framework for post-processing everything. In Forty-first International Conference on Machine Learning, 2024.

Robin Vogel, Aurélien Bellet, and Stephan Clémençon. Learning fair scoring functions: Bipartite ranking under roc-based fairness constraints. In International conference on artificial intelligence and statistics, pp. 784 792. PMLR, 2021.

Ruicheng Xian, Lang Yin, and Han Zhao. Fair and optimal classification via post-processing. In International Conference on Machine Learning, pp. 37977 38012. PMLR, 2023.

Qizhe Xie, Zihang Dai, Yulun Du, Eduard Hovy, and Graham Neubig. Controllable invariance through adversarial feature learning. Advances in neural information processing systems, 30, 2017.

Donna Xu, Yaxin Shi, Ivor W Tsang, Yew-Soon Ong, Chen Gong, and Xiaobo Shen. Survey on multi-output learning. IEEE transactions on neural networks and learning systems, 31(7):2409 2429, 2019.

Zhenhuan Yang, Yan Lok Ko, Kush R Varshney, and Yiming Ying. Minimax auc fairness: Efficient algorithm with provable convergence. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 11909 11917, 2023.

Published in Transactions on Machine Learning Research (03/2025)

Yao Yao, Qihang Lin, and Tianbao Yang. Stochastic methods for auc optimization subject to auc-based fairness constraints. In International Conference on Artificial Intelligence and Statistics, pp. 10324 10342. PMLR, 2023.

Zhuoning Yuan, Yuexin Wu, Zi-Hao Qiu, Xianzhi Du, Lijun Zhang, Denny Zhou, and Tianbao Yang. Provable stochastic optimization for global contrastive learning: Small batch does not harm performance. In International Conference on Machine Learning, pp. 25760 25782. PMLR, 2022.

Xianli Zeng, Edgar Dobriban, and Guang Cheng. Bayes-optimal classifiers under group fairness. ar Xiv preprint ar Xiv:2202.09724, 2022.

Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 335 340, 2018.

Fengda Zhang, Kun Kuang, Long Chen, Yuxuan Liu, Chao Wu, and Jun Xiao. Fairness-aware contrastive learning with partially annotated sensitive attributes. In The Eleventh International Conference on Learning Representations, 2022.

Song Yang Zhang, Zhifei and Hairong Qi. Age progression/regression by conditional adversarial autoencoder. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.

Published in Transactions on Machine Learning Research (03/2025)

Before providing the proof of Theorem 1, let s review some additional results about Optimal Transport theory.

The next result shows that as long as two measures admit a density with finite second moments, there exists a unique deterministic optimal transport map between them.

Lemma 1 (Theorem 1.22 in Santambrogio (2015)) Let µ, ν be two measures on Rk with finite second moments such that µ has a density and let X µ. Then there exists a unique deterministic mapping T : Rk Rk

such that W2 2(µ, ν) = E X T(X) 2 2,

that is (X, T(X)) γ Γµ,ν where γ is an optimal coupling.

A.1 Proof of Theorem 1

This proof is originally from the proof of Theorem 2.3 in Chzhen et al. (2020). We extend their results from single output case to the multi-output case with some modifications in their proofs.

Theorem 1 Suppose νf |s has density and finite second moments for each s S. Then

min U(f)=0 R(f) = U(f ) = min ν

s S ps W2 2(νf |s, ν).

Moreover, if f0 and ν0 solve the first and second minimization in (7), respectively, then ν0 is the density of f0 and

f0(x, s) = Tf |s,ν0(f (x, s))

where Tf |s,ν0 : Rk Rk is the optimal transport mapping from νf |s to ν0.

Proof First, we d like to show that

min U(f)=0 E f (X, S) f(X, S) 2 min ν

s S ps W2 2(νf |s, ν).

Let g : Rd S Rk be the minimizer of the l.h.s of the above equation and denoted by ν g the distribution of g. Since νf |s admits density, with Lemma 1, for each s S there exists Tf |s, g such that

s S ps W2 2(νf |s, ν g) = X

Z z Tf |s, g(z) 2dνf |s(z)

Rd f (x, s) Tf |s, g(f (x, s)) 2d PX|S=s(x)

s S ps E f (X, s) Tf |s, g(f (X, s)) 2|S = s

= E f (X, S) g(X, S) 2

where we define g(x, s) = Tf |s, g(f (x, s)) for all (x, s) Rd S. With optimal transportation, g(X, s)|S = s follow the distribution ν g for any s S. Then we have

U( g) = min ν

s S ps W2 2(ν g|s, ν) = 0

which indicates g is fair. By optimality of g we have

E f (X, S) g(X, S) 2 E f (X, S) g(X, S) 2

Due to definition of W2 2, for each s S we have

W2 2(νf |s, ν g) E[ f (X, S) g(X, S) 2|S = s]

Published in Transactions on Machine Learning Research (03/2025)

Then we can conclude X

s S ps W2 2(νf |s, ν g) = min U(f)=0 E f (X, S) f(X, S) 2

This implies that

min U(f)=0 E f (X, S) f(X, S) 2 min ν

s S ps W2 2(νf |s, ν). (19)

Second, we are going to show that the opposite inequality also holds. To this end, we define ν0 as

ν0 arg min ν

s S ps W2 2(νf |s, ν)

Since we assume νf |s admits density, with Lemma 1, there exists Tνf |s,ν0 as a optimal transport map from νf |s to ν0. And we define f0 for all (x, s) Rd S as

f0(x, s) = Tνf |s,ν0 f0(x, s)

By the definition of f0 and the Lemma 1, we have

s S ps W2 2(νf |s, ν) = E f (X, S) f0(X, S) 2 (20)

Moreover since ν0 is independent from S, using similar argument as above we can show that f0 is fair, and it yields

s S ps W2 2(νf |s, ν) min U(f)=0 E f (X, S) f(X, S) 2. (21)

Therefore, combining Eq. 19 and Eq. 21, we showed that

s S ps W2 2(νf |s, ν) = min U(f)=0 E f (X, S) f(X, S) 2.

Thanks to Eq. 20, we can also have

E f (X, S) f0(X, S) 2 = E f (X, S) g(X, S) 2

and since f0 is fair we can put g = f0. This proof is concluded.

A.2 Proof of Proposition 2

Proposition 2 Suppose K(z) = 1

2π exp( z2). Conditioning on data D = {(xi, si)}n i=1, it holds that

U( f(X, S)) X

s S O psn2 s exp 1

h4 W2 2(νf |s, νf (Ds)) .

Proof Recall that (X, S) is the ground-truth distribution. Let

f(x, s) := Tνf (Ds), ν0(f (x, s)).

Let ν f be the distribution of f(X, S), and ν f|s be the distribution of f(X, S) conditioning on S = s for s S. Suppose ν0 has n0 supports, denoted by {ξi}n0 i=1 Rk. In the entire proof, we assume Ds is already sampled and thus deterministic for s S.

Published in Transactions on Machine Learning Research (03/2025)

By (5), we have

U( f) = min ν

s S ps W2 2(ν f|s, ν) X

s S ps W2 2(ν f|s, ν0)

s S 2ps W2 2(ν f|s, ν f(Ds)) + X

s S 2ps W2 2(ν f(Ds), ν0), (22)

where the second inequality is the triangle inequality. Next, we bound W2 2(ν f|s, ν f(Ds)) and W2 2(ν f(Ds), ν0) from above separately.

Consider any s S. We denote the discrete distribution of f(X, S) when (X, S) is uniformly sampled from Ds as ν f(Ds). Let ϕ( ) : Rk R be any 1-Lipschitz continuous function on Rk, meaning that |ϕ(ξ) ϕ(ξ )| ξ ξ 2 for any ξ and ξ in Rk. We have

Rk ϕ(ξ)dν f(Ds) = 1

i=1 Eϕ( Tνf (Ds), ν0(f (xs i, s)) and Z

Rk ϕ(ξ)d ν0 = 1

i=1 Eϕ(Tνf (Ds), ν0(f (xs i, s))).

Consider any i {1, . . . , ns}. Let Ei be the conditional expectation conditioning on the outcome of random mapping Tνf (Ds), ν0(f (xs i, s). By (14) and the 1-Lipschitz continuity of ϕ( ), we have Eϕ( Tνf (Ds), ν0(f (xs i, s)) Eϕ(Tνf (Ds), ν0(f (xs i, s)))

E Eiϕ( Tνf (Ds), ν0(f (xs i, s)) Eiϕ(Tνf (Ds), ν0(f (xs i, s)))

Pns j=1 K (f (xs i, s) f (xs j, s))/h EEi Tνf (Ds), ν0(f (xs i, s)) Tνf (Ds), ν0(f (xs j, s)) Pns j=1 K (f (xs i, s) f (xs j, s))/h

Pns j=1,j =i K (f (xs i, s) f (xs j, s))/h EEi Tνf (Ds), ν0(f (xs i, s)) Tνf (Ds), ν0(f (xs j, s)) Pns j=1 K (f (xs i, s) f (xs j, s))/h

2 f (Ds),min

f (Ds),min := min i,j=1,...,ns,i =j f (xs i, s) f (xs j, s) and ν0,max := max j,j =1,...,n0 ξj ξj .

As a result, we have

Rk ϕ(ξ)dν f(Ds) Z

Rk ϕ(ξ)d ν0 ns exp

2 f (Ds),min

for any ϕ that is 1-Lipschitz continuous function on Rk. By Kantorovich-Rubinstein duality, we have

W2 2(ν f(Ds), ν0) n2 s exp

2 f (Ds),min

2 ν0,max = O n2 s exp 1

Again, let ϕ( ) : Rk R be any 1-Lipschitz continuous function on Rk. Let ET be the expectation over the random mappings Tνf (Ds), ν0(f (xs i, s)) for i = 1, . . . , ns. We have Z

Rk ϕ(ξ)dν f(Ds) = Z

Rk ET Φ(ξ)dνf (Ds) and Z

Rk ϕ(ξ)dν f|s = Z

Rk ET Φ(ξ)dνf |s,

Pns i=1 K ξ f (xs i ,s) h Tνf (Ds), ν0(f (xs i, s)) Pns i=1 K ξ f (xs i ,s) h

Published in Transactions on Machine Learning Research (03/2025)

is a random value function whose randomness is from Tνf (Ds), ν0(f (xs i, s)) for i = 1, . . . , ns. Since K is O( 1

h2 )-Lipschitz continuous and ϕ is 1-Lipschitz continuous, Φ(ξ) is O( 1

h2 )-Lipschitz continuous in ξ given any outcomes of random mappings Tνf (Ds), ν0(f (xs i, s)) for i = 1, . . . , ns. Therefore, ET Φ(ξ) is also O( 1

h)-Lipschitz continuous.

By Kantorovich-Rubinstein inequality (Edwards, 2011), there exists a constant C such that Z

Rk ϕ(ξ)dν f(Ds) Z

Rk ϕ(ξ)dν f|s Z

Rk ET Φ(ξ)dνf (Ds) Z

Rk ET Φ(ξ)dνf |s C

h2 W2(νf |s, νf (Ds)).

which implies

W2 2(ν f(Ds), ν0) C2

h4 W2 2(νf |s, νf (Ds)). (24)

Applying (23) and (24) to (22) gives the first conclusion.

Consider any two groups s and s in S. Applying (23) and (24) and the triangle inequality again, we have

W2 2(ν f|s, ν f|s ) W2 2(ν f|s, ν0) + W2 2(ν f|s , ν0)

W2 2(ν f|s, ν f(Ds)) + W2 2(ν f(Ds), ν0) + W2 2(ν f|s , ν0) + W2 2(ν f(Ds ), ν0)

O (n2 s + n2 s ) exp 1

h4 W2 2(νf |s, νf (Ds)) ,

which gives the second conclusion.

B Details of Datasets

In this section, we provide more details on the datasets we used in the numerical experiments. We include four datasets from various domains, including marketing domain(Customer dataset 4), medical diagnosis( Chexpert Dataset (Irvin et al., 2019)), face recognition (Celeb A dataset (Liu et al., 2015) and UTKFace dataset (Zhang & Qi, 2017)).

The Customer dataset has 8068 training samples and 2627 testing samples and the task is to classify customers into anonymous customer categories for target marketing. We partition the data into four sensitive groups based on the gender and the marital status of customers: married female, unmarried female, married male, and unmarried male.

The Chexpert dataset contains 224,316 training instances and the task is to detect five chest and lung diseases based on X-ray images. Due to the high computational complexity of solving optimal transportation between large datasets, we sample 5% instances from the original training data as the training set and sample another 5% as the testing set. The Celeb A dataset contains 162,770 training instances and 39,829 testing instances and the task is to detect ten attributes (chosen based on Ramaswamy et al. (2021)) of the person in an image, which are being attractive, having bags under eye, having black hair, having bangs, wearing glasses, having high cheek bones, being smiling, wearing hat, having a slightly open mouth, and have a pointy nose. For the same computational reason, we sample 5% instances from the original training data as the training set and sample 20% from the original testing data as the testing set. For both Chexpert and Celeb A datasets, we partition the data into four sensitive groups based on gender and age: young female, old female, young male, and old male. UTKFace dataset consists of 23705 face images with five groups in terms of race(i.e., White, Black, Asian, Indian, and Others) and we randomly split it into training and testing (8:2) sets. And the task is to classify gender and age (25 age 60, customized by us) based on face images. All the data in UTKFace dataset are utilized.

4https://www.kaggle.com/datasets/kaushiksuresh147/customer-segmentation