# fair_regression_with_wasserstein_barycenters__f6dd09a0.pdf Fair Regression with Wasserstein Barycenters Evgenii Chzhen LMO, Université Paris-Saclay CNRS, Inria Christophe Denis LAMA, Université Gustave Eiffel MIA-Paris, Agro Paris Tech INRAE, Université Paris-Saclay Mohamed Hebiri LAMA, Université Gustave Eiffel CREST, ENSAE, IP Paris Luca Oneto DIBRIS, University of Genoa Massimiliano Pontil Istituto Italiano di Tecnologia University College London We study the problem of learning a real-valued function that satisfies the Demographic Parity constraint. It demands the distribution of the predicted output to be independent of the sensitive attribute. We consider the case that the sensitive attribute is available for prediction. We establish a connection between fair regression and optimal transport theory, based on which we derive a close form expression for the optimal fair predictor. Specifically, we show that the distribution of this optimum is the Wasserstein barycenter of the distributions induced by the standard regression function on the sensitive groups. This result offers an intuitive interpretation of the optimal fair prediction and suggests a simple post-processing algorithm to achieve fairness. We establish risk and distribution-free fairness guarantees for this procedure. Numerical experiments indicate that our method is very effective in learning fair models, with a relative increase in error rate that is inferior to the relative gain in fairness. 1 Introduction A central goal of algorithmic fairness is to ensure that sensitive information does not unfairly influence the outcomes of learning algorithms. For example, if we wish to predict the salary of an applicant or the grade of a university student, we would like the algorithm to not unfairly use additional sensitive information such as gender or race. Since today s real-life datasets often contain discriminatory bias, standard machine learning methods behave unfairly. Therefore, a substantial effort is being devoted in the field to designing methods that satisfy fairness requirements, while still optimizing prediction performance, see for example [5, 10, 13, 16, 18, 21, 23, 25, 26, 28, 32, 45 47, 49] and references therein. In this paper we study the problem of learning a real-valued regression function which among those complying with the Demographic Parity fairness constraint, minimizes the mean squared error. Demographic Parity requires the probability distribution of the predicted output to be independent of the sensitive attribute and has been used extensively in the literature, both in the context of classification and regression [1, 12, 20, 24, 34]. In this paper we consider the case that the sensitive evgenii.chzhen@universite-paris-saclay.fr 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. attribute is available for prediction. Our principal result is to show that the distribution of the optimal fair predictor is the solution of a Wasserstein barycenter problem between the distributions induced by the unfair regression function on the sensitive groups. This result builds a bridge between fair regression and optimal transport, [see e.g., 38, 41]. We illustrate our result with an example. Assume that X represents a candidate s skills, S is a binary attribute representing two groups of the population (e.g., majority or minority), and Y is the current market salary. Let f (x, s) = E[Y |X=x, S=s] be the regression function, that is, the optimal prediction of the salary currently in the market for candidate (x, s). Due to bias present in the underlying data distribution, the induced distribution of market salary predicted by f varies across the two groups. We show that the optimal fair prediction g transforms the regression function f as g (x, s) = psf (x, s) + (1 ps)t (x, s) , where ps is the frequency of group s and the correction t (x, s) is determined so that the ranking of f (x, s) relative to the distribution of X|S = s for group s (e.g., minority) is the same as the ranking of t (x, s) relative to the distribution of the group s = s (e.g., majority). We elaborate on this example after Theorem 2.3 and in Figure 1. The above expression of the optimal fair predictor naturally suggests a simple post-processing estimation procedure, where we first estimate f and then transform it to get an estimator of g . Importantly, the transformation step involves only unlabeled data since it requires estimation of cumulative distribution functions. Contributions and organization. In summary we make the following contributions. First, in Section 2 we derive the expression for the optimal function which minimizes the squared risk under Demographic Parity constraints (Theorem 2.3). This result establishes a connection between fair regression and the problem of Wasserstein barycenters, which allows to develop an intuitive interpretation of the optimal fair predictor. Second, based on the above result, in Section 3 we propose a post-processing procedure that can be applied on top of any off-the-shelf estimator for the regression function, in order to transform it into a fair one. Third, in Section 4 we show that this post-processing procedure yields a fair prediction independently from the base estimator and the underlying distribution (Proposition 4.1). Moreover, finite sample risk guarantees are derived under additional assumptions on the data distribution provided that the base estimator is accurate (Theorem 4.4). Finally, Section 5 presents a numerical comparison of the proposed method w.r.t. the state-of-the-art. Related work. Unlike the case of fair classification, fair regression has received limited attention to date; we are only aware of few works on this topic that are supported by learning bounds or consistency results for the proposed estimator [1, 34]. Connections between algorithmic fairness and Optimal Transport, and in particular the problem of Wasserstein barycenters, has been studied in [12, 20, 24, 43] but mainly in the context of classification. These works are distinct from ours, in that they do not show the link between the optimal fair regression function and Wasserstein barycenters. Moreover, learning bounds are not addressed therein. Our distribution-free fairness guarantees share similarities with contributions on prediction sets [30, 31] and conformal prediction literature [42, 48] as they also rely on results on rank statistics. Meanwhile, the risk guarantee that we derive, combines deviation results on Wasserstein distances in one dimension [7] with peeling ideas developed in [3], and classical theory of rank statistics [40]. Notation. For any positive integer N N we denote by [N] the set {1, . . . , N}. For a, b R we denote by a b (resp. a b) the minimum (resp. the maximum) between a and b. For two positive real sequences an, bn we write an bn to indicate that there exists a constant c such that an cbn for all n. For a finite set S we denote by |S| its cardinality. The symbols E and P stand for generic expectation and probability. For any univariate probability measure µ, we denote by Fµ its Cumulative Distribution Function (CDF) and by Qµ : [0, 1] R its quantile function (a.k.a. generalized inverse of Fµ) defined for all t (0, 1] as Qµ(t) = inf {y R : Fµ(y) t} with Qµ(0) = Qµ(0+). For a measurable set A R we denote by U(A) the uniform distribution on A. 2 The problem In this section we introduce the fair regression problem and present our derivation for the optimal fair regression function alongside its connection to Wasserstein barycenter problem. We consider the general regression model Y = f (X, S) + ξ , (1) f (x, 1)=t ( x, 2) g (x, 1)=g ( x, 2) t (x, 1)=f ( x, 2) Fair optimal prediction g with p1 = 2/5 and p2 = 3/5 density of f |S=1 density of f |S=2 density of g Figure 1: For a new point (x, 1), the value t (x, 1) is chosen such that the shaded Green Area (//) = P(f (X, S) t (x, 1)|S = 2) equals to the shaded Blue Area (\\) = P(f (X, S) f (x, 1)|S = 1). The final prediction g (x, 1) is a convex combination of f (x, 1) and t (x, 1). The same is done for ( x, 2). where ξ R is a centered random variable, (X, S) PX,S on Rd S, with |S| < , and f : Rd S R is the regression function minimizing the squared risk. Let P be the joint distribution of (X, S, Y ). For any prediction rule f : Rd S R, we denote by νf|s the distribution of f(X, S)|S = s, that is, the Cumulative Distribution Function (CDF) of νf|s is given by Fνf|s(t) = P(f(X, S) t|S = s) , (2) to shorten the notation we will write Ff|s and Qf|s instead of Fνf|s and Qνf|s respectively. Definition 2.1 (Wasserstein-2 distance). Let µ and ν be two univariate probability measures. The squared Wasserstein-2 distance between µ and ν is defined as W2 2(µ, ν) = inf γ Γµ,ν Z |x y|2 dγ(x, y) , where Γµ,ν is the set of distributions (couplings) on R R such that for all γ Γµ,ν and all measurable sets A, B R it holds that γ(A R) = µ(A) and γ(R B) = ν(B). In this work we use the following definition of (strong) Demographic Parity, which was previously used in the context of regression by [1, 12, 24]. Definition 2.2 (Demographic Parity). A prediction (possibly randomized) g : Rd S R is fair if, for every s, s S P(g(X, S) t|S = s) P(g(X, S) t|S = s ) = 0 . Demographic Parity requires the Kolmogorov-Smirnov distance between νg|s and νg|s to vanish for all s, s . Thus, if g is fair, νg|s does not depend on s and to simplify the notation we will write νg. Recall the model in Eq. (1). Since the noise has zero mean, the minimization of E(Y g(X, S))2 over g is equivalent to the minimization of E(f (X, S) g(X, S))2 over g. The next theorem shows that the optimal fair predictor for an input (x, s) is obtained by a nonlinear transformation of the vector (f (x, s))|S| s=1 that is linked to a Wasserstein barycenter problem [2]. Theorem 2.3 (Characterization of fair optimal prediction). Assume, for each s S, that the univariate measure νf |s has a density and let ps = P(S = s). Then, min g is fair E(f (X, S) g(X, S))2 = min ν s S ps W2 2(νf |s, ν) . Moreover, if g and ν solve the l.h.s. and the r.h.s. problems respectively, then ν = νg and s S ps Qf |s Ff |s (f (x, s)) . (3) The proof of Theorem 2.3 relies on the classical characterization of optimal coupling in one dimension (stated in Theorem A.1 in the appendix) of the Wasserstein-2 distance. We show that a minimizer g of the L2-risk can be used to construct ν and vice-versa, given ν , we leverage a well-known expression for one dimensional Wasserstein barycenter (see e.g., [2, Section 6.1] and Lemma A.2 in the appendix) and construct g . The case of binary protected attribute. Let us unpack Eq. (3) in the case that S = {1, 2}, assuming w.l.o.g. that p2 p1. Theorem 2.3 states that the fair optimal prediction g is defined for all individuals x Rd in the first group as g (x, 1) = p1f (x, 1) + p2t (x, 1), with t (x, 1) = inf t R : Ff |2(t) Ff |1(f (x, 1)) , and likewise for the second group. This form of the optimal fair predictor, and more generally Eq. (3), allows us to understand the decision made by g at individual level. If we interpret (x, s) as the candidate s CV and candidate s group respectively, and f (x, s) as the current market salary (which might be discriminatory), then the fair optimal salary g (x, s) is a convex combination of the market salary f (x, s) and the adjusted salary t (x, s), which is computed as follows. ebgin If say s=1, we first compute the fraction of individuals from the first group whose market salary is at most f (x, 1), that is, we compute P(f (X, S) f (x, 1)|S=1). Then, we find a candidate x in group 2, such that the fraction of individuals from the second group whose market salary is at most f ( x, 2) is the same, that is, x is chosen to satisfy P(f (X, S) f ( x, 2)|S=2) = P(f (X, S) f (x, 1)|S=1). Finally, the market salary of x is exactly the adjustment for x, that is, t (x, 1) = f ( x, 2). This idea is illustrated in Figure 1 and leads to the following philosophy: if candidates (x, 1) and ( x, 2) share the same group-wise market salary ranking, then they should receive the same salary determined by the fair prediction g (x, 1) = g ( x, 2) = p1f (x, 1) + p2f ( x, 2). At last, note that Eq. (3) allows to understand the (potential) amount of extra money that we need to pay in order to satisfy fairness. While the unfair decision made with f costs f (x, 1)+f ( x, 2) for the salary of (x, 1) and ( x, 2), the fair decision g costs 2(p1f (x, 1)+p2f ( x, 2)). Thus, the extra (signed) salary that we pay is = (p2 p1)(f ( x, 2) f ( x, 1)). Since, p2 p1, will be positive whenever the candidate x from the majority group gets higher salary according to f , and negative otherwise. We believe that the expression Eq. (3) could be the starting point for further more applied work on algorithmic fairness. 3 General form of the estimator In this section we propose an estimator of the optimal fair predictor g that relies on the plug-in principle. The expression (3) of g suggests that we only need estimators for the regression function f , the proportions ps, as well as the CDF Ff |s and the quantile function Qf |s, for all s S. While the estimation of f needs labeled data, all the other quantities rely only on PS, PX|S and f , therefore unlabeled data with an estimator of f suffices. Thus, given a base estimator of f , our post-processing algorithm will require only unlabeled data. For each s S let Us={Xs i }Ns i=1 i.i.d. PX|S=s be a group-wise unlabeled sample. In the following for simplicity we assume that Ns are even for all s S2. Let Is 0, Is 1 [Ns] be any fixed partition of [Ns] such that |Is 0|=|Is 1|=Ns/2 and Is 0 Is 1=[Ns]. For each j {0, 1} we let Us j = Xs i Us : i Is j be the restriction of Us to Is j . We use Us 0 to estimate Qf|s and Us 1 to estimate Ff|s. For each f : Rd S R and each s S, we estimate νf|s by ˆν0 f|s = 1 |Is 0| i Is 0 δ f(Xs i , s) + εis and ˆν1 f|s = 1 |Is 1| i Is 1 δ f(Xs i , s) + εis , (4) where δ is the Dirac measure and all εis i.i.d. U([ σ, σ]), for some positive σ set by the user. Using the estimators in Eq. (4), we define for all f : Rd S R estimators of Qf|s and of Ff|s as ˆQf|s Qˆν0 f|s and ˆFf|s Fˆν1 f|s . (5) That is, ˆFf|s and ˆQf|s are the empirical CDF and empirical quantiles of (f(X, S)+ε)|S=s based on {f(Xs i , s)+εis}i Is 1 and {f(Xs i , s)+εis}i Is 0 respectively. The noise εis serves as a smoothing random variable, since for all s S and i [Ns] the random variables f(Xs i , s)+εis are 2Since we are ready to sacrifice a factor 2 in our bounds, this assumption is without loss of generality. Algorithm 1: Procedure to evaluate estimator in Eq. (6) Input: new point: (x, s); base estimator ˆf; unlabeled data U1, . . . , U|S|; jitter parameter σ; empirical frequencies ˆp1, . . . , ˆp|S| Output :fair prediction ˆg(x, s) for the point (x, s) for s S do // data structure for Eq.(4) Us 0 , Us 1 split_in_two(Us ) // split unlabeled data into two equal parts ars 0 ˆf(X, s )+U([ σ, σ]) X Us 0 , ars 1 ˆf(X, s )+U([ σ, σ]) X Us 1 ars 0 sort ars 0 , ars 1 sort ars 1 // for fast evaluation of Eq.(5) end ks position ˆf(x, s)+U([ σ, σ]), ars 1 // evaluate ˆF ˆ f|s ˆf(x, s) + ε in Eq.(6) s S ˆps ars 0 Ns ks/Ns // evaluation of Eq.(6) i.i.d. continuous for any P and f. In contrast, f(Xs i , s) might have atoms resulting in a non-zero probability to observe ties in {f(Xs i , s)}i Is j . This step is also known as jittering, often used for data visualization [11] for tie-breaking. It plays a crucial role in the distribution-free fairness guarantees that we derive in Proposition 4.1; see the discussion thereafter. Finally, let A={Si}N i=1 i.i.d. PS and for each s S let ˆps be the empirical frequency of S=s evaluated on A. Given a base estimator ˆf of f constructed from n labeled samples L={(Xi, Si, Yi)}n i=1 i.i.d. P, we define the final estimator ˆg of g for all (x, s) Rd S mimicking Eq. (3) as s S ˆps ˆQ ˆ f|s ˆF ˆ f|s ˆf(x, s) + ε , (6) where ε U([ σ, σ]) is assumed to be independent from every other random variables. Remark 3.1. In practice one should use a very small value for σ (e.g., σ=10 5), which does not alter the statistical quality of the base estimator ˆf as indicated in Theorem 4.4. A pseudo-code implementation of ˆg in Eq. (6) is reported in Algorithm 1. It requires two primitives: sort(ar) sorts the array ar in an increasing order; position(a, ar) which outputs the index k such that the insertion of a into k th position in ar preserves ordering (i.e., ar[k 1] a < ar[k]). Algorithm 1 consists of two for parts: in the for-loop we perform a preprocessing which takes P s S O(Ns log Ns) time3 since it involves sorting; then, the evaluation of ˆg on a new point (x, s) is performed in (maxs S log Ns) time since it involves an element search in a sorted array. Note that the for-loop of Algorithm 1 needs to be performed only once as this step is shared for any new (x, s). 4 Statistical analysis In this section we provide a statistical analysis of the proposed algorithm. We first present in Proposition 4.1 distribution-free finite sample fairness guarantees for post-processing of any base learner with unlabeled data and then we show in Theorem 4.4 that if the base estimator ˆf is a good proxy for f , then under mild assumptions on the distribution P, the processed estimator ˆg in Eq. (6) is a good estimator of g in Eq. (3). Distribution free post-processing fairness guarantees. We derive two distribution-free results in Proposition 4.1, the first in Eq. (7) shows that the fairness definition is satisfied as long as we take the expectation over the data inside the supremum in Definition 2.2, while the second one in Eq. (8) bounds the expected violation of Definition 2.2. 3It is assumed in this discussion that the time complexity to evaluate ˆf is O(1). Proposition 4.1 (Fairness guarantees). For any joint distribution P of (X, S, Y ), any base estimator ˆf constructed on labeled data, and for all s, s S, the estimator ˆg defined in Eq. (6) satisfies sup t R |P(ˆg(X, S) t|S=s) P(ˆg(X, S) t|S=s )| 2 (Ns Ns + 2) 1 1{Ns =Ns } (7) E sup t R |P(ˆg(X, S) t|S=s, D) P(ˆg(X, S) t|S=s , D)| 6 (Ns Ns + 1) 1/2 . (8) where D = L A s S Us is the union of all available datasets. Let us point out that this result does not require any assumption on the distribution P as well as on the base estimator ˆf. This is achieved thanks to the jittering step in the definition of ˆg in Eq. (6), which artificially introduces continuity. Continuity allows us to use results from the theory of rank statistics of exchangeable random variables to derive Eq. (7) as well as the classical inverse transform (see e.g., [40, Sections 13 and 21]) combined with the Dvoretzky-Kiefer-Wolfowitz inequality [33] to derive Eq. (8). Since basic results on rank statistics and inverse transform are distribution-free as long as the underlying random variable is continuous, the guarantees in Eqs. (7) (8) are also distribution-free and can be applied on top of any base estimator ˆf. The bound in Eq. (7) might be surprising to the reader. Yet, let us emphasize that this bound holds because the expectation w.r.t. the data distribution is taken inside the supremum (since P stands for the joint distribution of all random variables involved in ˆg(X, S)). Similar proof techniques are also used in randomization inference via permutations [19, 22], conformal prediction [30, 42], knockoff estimation [4] to name a few. However, unlike the aforementioned contributions, the problem of fairness requires a non-trivial adaptation of these techniques. In contrast, Eq. (8) might be more appealing to the machine learning community as it controls the expected (over data) violation of the fairness constraint with standard parametric rate. Estimation guarantee with accurate base estimator. In order to prove non-asymptotic risk bounds we require the following assumption on the distribution P of (X, S, Y ) Rd S R. Assumption 4.2. For each s S the univariate measure νf |s admits a density qs, which is lower bounded by λs > 0 and upper-bounded by λs λs. Although the lower bound on the density assumption is rather strong and might potentially be violated in practice, it is still reasonable in certain situations. We believe that it can be replaced by the assumption that f (X, S) conditionally on S=s for all s S admits 2+ϵ moments. We do not explore this relaxation in our work as it significantly complicates the proof of Theorem 4.4. At the same time, our empirical study suggests that the lower bound on the density is not intrinsic to the problem, since the estimator exhibits a good performance across various scenarios. In contrast, the milder assumption that the density is upper bounded is crucial for our proof and seems to be necessary. Apart from the assumption on the density of νf |s, the actual rate of estimation depends on the quality of the base estimator ˆf. Assumption 4.3. There exists a positive constant c independent from n, N, N1, . . . , N|S|, and a positive sequence bn : N R+ such that E|f (X, S) ˆf(X, S)| cb 1/2 n . The bound in Assumption 4.3 is stated in a very generic form, it only requires that the base estimator ˆf estimates f in L1-norm at the rate b 1/2 n . We refer to [3, 15, 29, 30, 39] for various examples of estimators, it includes local polynomial estimators, k-nearest neighbours, and linear regression, to name just a few. Under these assumptions we can prove the following finite-sample estimation bound. Theorem 4.4 (Estimation guarantee). Let Assumptions 4.2 and 4.3 be satisfied, and set σ mins S N 1/2 s b 1/2 n , then the estimator ˆg defined in Eq. (6) satisfies E |g (X, S) ˆg(X, S)| b 1/2 n _ X s S ps N 1/2 s where the leading constant depends only on λs, λs, c from Assumptions 4.2 and 4.3. The proof of this result combines expected deviation of empirical measure from the real measure in terms of Wasserstein distance on real line [7] with the already mentioned rank statistics and classical peeling argument of [3]. The first term of the derived bound corresponds to the estimation error of f by ˆf, the second term is the price to pay for not knowing conditional distributions X|S = s while the last term correspond to the price of unknown marginal probabilities of each protected attribute. Notice that if Ns = ps N, which corresponds to the standard i.i.d. sampling from PX,S of unlabeled data, the second and the third term are of the same order. Moreover, if N is sufficiently large, which in most scenarios4 is w.l.o.g., then the rate is dominated by b 1/2 n . Notice that one can find a collection of joint distributions P, such that f satisfies demographic parity. Hence, if b 1/2 n is the minimax optimal estimation rate of f , then it is also optimal for g f . 5 Empirical study In this section, we present numerical experiments5 with the proposed fair regression estimator defined in Section 3. In all experiments, we collect statistics on the test set T = {(Xi, Si, Yi)}ntest i=1. The empirical mean squared error (MSE) is defined as MSE (g) = 1 ntest (X,S,Y ) T (Y g(X, S))2 . We also measure the violation of fairness constraint imposed by Definition 2.2 via the empirical Kolmogorov-Smirnov (KS) distance, KS (g) = max s,s S sup t R (X,S,Y ) T s 1{g(X,S) t} 1 |T s | (X,S,Y ) T s 1{g(X,S) t} where for all s S we define the set T s= {(X, S, Y ) T : S=s}. For all datasets we split the data in two parts (70% train and 30% test), this procedure is repeated 30 times, and we report the average performance on the test set alongside its standard deviation. We employ the 2-steps 10-fold CV procedure considered by [17] to select the best hyperparameters with the training set. In the first step, we shortlist all the hyperparameters with MSE close to the best one (in our case, the hyperparameters which lead to 10% larger MSE w.r.t. the best MSE). Then, from this list, we select the hyperparameters with the lowest KS. Methods. We compare our method (see Section 3) to different fair regression approaches for both linear and non-linear regression. In the case of linear models we consider the following methods: Linear RLS plus [6] (RLS+Berk), Linear RLS plus [34] (RLS+Oneto), and Linear RLS plus Our Method (RLS+Ours), where RLS is the abbreviation of Regularized Least Squares. In the case of non-linear models we compare to the following methods. i) For Kernel RLS (KRLS): KRLS plus [34] (KRLS+Oneto), KRLS plus [35] (KRLS+Perez), KRLS plus Our Method (KRLS+Ours); ii) For Random Forests (RF): RF plus [36] (RF+Raff), RF plus [1]6 (RF+Agar), and RF plus Our Method (RF+Ours). The hyperparameters of the methods are set as follows. For RLS we set the regularization hyperparameters λ 10{ 4.5, 3.5, ,3} and for KRLS we set λ 10{ 4.5, 3.5, ,3} and γ 10{ 4.5, 3.5, ,3}. Finally, for RF we set to 1000 the number of trees and for the number of features to select during the tree creation we search in {d 1/4, d 1/2, d 3/4}. Datasets. In order to analyze the performance of our methods and test it against the state-of-the-art alternatives, we consider five benchmark datasets, CRIME, LAW, NLSY, STUD, and UNIV, which are briefly described below: Communities&Crime (CRIME) contains socio-economic, law enforcement, and crime data about communities in the US [37] with 1994 examples. The task is to predict the number of violent crimes per 105 population (normalized to [0, 1]) with race as the protected attribute. Following [9], we made 4One can achieve it by splitting the labeled dataset L artificially augmenting the unlabeled one, which ensures that N > n. In this case if b 1/2 n = O(n 1/2), then the first term is always dominant in the derived bound. 5The source of our method can be found at https://github.com/lucaoneto/NIPS2020_Fairness. 6We thank the authors for sharing a prototype of their code. CRIME LAW NLSY STUD UNIV Method MSE KS MSE KS MSE KS MSE KS MSE KS RLS .033 .003 .55 .06 .107 .010 .15 .02 .153 .016 .73 .07 4.77 .49 .50 .05 2.24 .22 .14 .01 RLS+Berk .037 .004 .16 .02 .121 .013 .10 .01 .189 .019 .49 .05 5.28 .57 .32 .03 2.43 .23 .05 .01 RLS+Oneto .037 .004 .14 .01 .112 .012 .07 .01 .156 .016 .50 .05 5.02 .54 .23 .02 2.44 .26 .05 .01 RLS+Ours .041 .004 .12 .01 .141 .014 .02 .01 .203 .019 .09 .01 5.62 .52 .04 .01 2.98 .32 .02 .01 KRLS .024 .003 .52 .05 .040 .004 .09 .01 .061 .006 .58 .06 3.85 .36 .47 .05 1.43 .15 .10 .01 KRLS+Oneto .028 .003 .19 .02 .046 .004 .05 .01 .066 .007 .06 .01 4.07 .39 .18 .02 1.46 .13 .04 .01 KRLS+Perez .033 .003 .25 .02 .048 .005 .04 .01 .065 .007 .08 .01 3.97 .38 .14 .02 1.50 .15 .06 .01 KRLS+Ours .034 .004 .09 .01 .056 .005 .01 .01 .081 .008 .03 .01 4.46 .43 .03 .01 1.71 .16 .02 .01 RF .020 .002 .45 .04 .046 .005 .11 .01 .055 .006 .55 .06 3.59 .39 .45 .05 1.31 .13 .10 .01 RF+Raff .030 .003 .21 .02 .058 .006 .06 .01 .066 .006 .08 .01 4.28 .40 .09 .01 1.38 .12 .02 .01 RF+Agar .029 .003 .13 .01 .050 .005 .04 .01 .065 .006 .07 .01 3.87 .41 .07 .01 1.40 .13 .02 .01 RF+Ours .033 .003 .08 .01 .064 .006 .02 .01 .070 .007 .03 .01 4.18 .38 .02 .01 1.49 .14 .01 .01 Table 1: Results for all the datasets and all the methods concerning MSE and KS. a binary sensitive attribute s as to the percentage of black population, which yielded 970 instances of s=1 with a mean crime rate 0.35 and 1024 instances of s= 1 with a mean crime rate 0.13. Law School (LAW) refers to the Law School Admissions Councils National Longitudinal Bar Passage Study [44] and has 20649 examples. The task is to predict a students GPA (normalized to [0, 1]) with race as the protected attribute (white versus non-white). National Longitudinal Survey of Youth (NLSY) involves survey results by the U.S. Bureau of Labor Statistics that is intended to gather information on the labor market activities and other life events of several groups [8]. Analogously to [27] we model a virtual company s hiring decision assuming that the company does not have access to the applicants academic scores. We set as target the person s GPA (normalized to [0, 1]), with race as sensitive attribute Student Performance (STUD), approaches 649 students achievement (final grade) in secondary education of two Portuguese schools using 33 attributes [14], with gender as the protected attribute. University (UNIV)7 is a proprietary and highly sensitive dataset containing all the data about the past and present students enrolled at the University of Genoa. In this study we take into consideration students who enrolled, in the academic year 2017-2018. The dataset contains 5000 instances, each one described by 35 attributes (both numeric and categorical) about ethnicity, gender, financial status, and previous school experience. The scope is to predict the average grades at the end of the first semester, with gender as the protected attribute. Comparison w.r.t. state-of-the-art. In Table 1, we present the performance of different methods on various datasets described above. One can notice that LAW and UNIV datasets have a least amount of disciminatory bias (quantified by KS), since the fairness unaware methods perform reasonably well in terms of KS. Furthermore, on these two datasets, the difference in performance between all fairness aware methods is less noticeable. In contrast, on CRIME, NLSY, and STUD, fairness unaware methods perform poorly in terms of KS. More importantly, our findings indicate that the proposed method is competitive with state-of-the-art methods and is the most effective in imposing the fairness constraint. In particular, in all except two considered scenarios (CRIME+RLS, CRIME+RF) our method improves fairness by 50% (and up to 80% in some cases) over the closest fairness aware method. In contrast, the accuracy of our method decreases by 1% up to 30% when compared to the most accurate fairness aware method. However, let us emphasize that the relative decrease in accuracy is much smaller than the relative improvement in fairness across the considered scenarios. For example, on NLSY+RLS the most accurate fairness aware method is RLS+Oneto with mean MSE=.156 and mean KS=.50, while RLS+Ours yields mean MSE=.203 and mean KS=.09. That is, compared to RLS+Oneto our method drops about 30% in accuracy, while gains about 82% in fairness. With RF, which is a more powerful estimator, the average drop in accuracy across all datasets compared to RF+Agar is about 12% while the average improvement in fairness is about 53%. 7The data and the research are related to the project DROP@UNIGE of the University of Genoa. 6 Conclusion and perspectives In this work we investigated the problem of fair regression with Demographic Parity constraint assuming that the sensitive attribute is available for prediction. We derived a closed form solution for the optimal fair predictor which offers a simple and intuitive interpretation. Relying on this expression, we devised a post-processing procedure, which transforms any base estimator of the regression function into a nearly fair one, independently of the underlying distribution. Moreover, if the base estimator is accurate, our post-processing method yields an accurate estimator of the optimal fair predictor as well. Finally, we conducted an empirical study indicating the effectiveness of our method in imposing fairness in practice. In the future it would be valuable to extend our methodology to the case when we are not allowed to use the sensitive feature as well as to other notions of fairness. Broader impact This work investigates the problem of fair regression with multiple sensitive groups using tools from statistical learning theory and optimal transport theory. Our results lead to an efficient learning algorithm that we show empirically and theoretically to be very effective to impose fairness according to the notion of Demographic Parity. Our approach is directly designed to mitigate potential bias present in the data. Hence, even though the work is primarily theoretical, we anticipate that our results could be used in the future by practitioners in order to specialize our methodology to real-life scenarios involving individuals, and to potentially help making decision which help people with disadvantages or minority groups. We believe that the most important positive impact of our work is the intuitive interpretation of the optimal fair prediction, which should help to reason as to why a given prediction was made for a given individual. At the same time, this interpretation allows to understand the weaknesses of the notion of Demographic Parity: if f does not adequately reflect the group-wise ordering of individuals, the optimal fair prediction g might not lead to a fair prediction from individuals perspective. In other words, returning to the salary example considered above, the notion of Demographic Parity reflects the principle: more qualified individuals get higher salary within their respective groups. Acknowledgments and Disclosure of Funding This work was partially supported Amazon Web Services and SAP SE. [1] A. Agarwal, M. Dudik, and Z. S. Wu. Fair regression: Quantitative definitions and reductionbased algorithms. In International Conference on Machine Learning, 2019. [2] M. Agueh and G. Carlier. Barycenters in the wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904 924, 2011. [3] J. Y. Audibert and A. Tsybakov. Fast learning rates for plug-in classifiers. The Annals of Statistics, 35(2):608 633, 2007. [4] Rina Foygel Barber, Emmanuel J Candès, et al. A knockoff filter for high-dimensional selective inference. The Annals of Statistics, 47(5):2504 2537, 2019. [5] S. Barocas, M. Hardt, and A. Narayanan. Fairness and Machine Learning. fairmlbook.org, 2018. [6] R. Berk, H. Heidari, S. Jabbari, M. Joseph, M. Kearns, J. Morgenstern, S. Neel, and A. Roth. A convex framework for fair regression. In Fairness, Accountability, and Transparency in Machine Learning, 2017. [7] S. Bobkov and M. Ledoux. One-dimensional empirical measures, order statistics and kantorovich transport distances. Memoirs of the American Mathematical Society, 2016. [8] Bureau of Labor Statistics. National longitudinal surveys of youth data set. www.bls.gov/nls/, 2019. [9] T. Calders, A. Karim, F. Kamiran, W. Ali, and X. Zhang. Controlling attribute effect in linear regression. In IEEE International Conference on Data Mining, 2013. [10] F. Calmon, D. Wei, B. Vinzamuri, K. N. Ramamurthy, and K. R. Varshney. Optimized preprocessing for discrimination prevention. In Neural Information Processing Systems, 2017. [11] J. M. Chambers. Graphical methods for data analysis. CRC Press, 2018. [12] S. Chiappa, R. Jiang, T. Stepleton, A. Pacchiano, H. Jiang, and J. Aslanides. A general approach to fairness with optimal transport. In AAAI, 2020. [13] F. Chierichetti, R. Kumar, S. Lattanzi, and S. Vassilvitskii. Fair clustering through fairlets. In Neural Information Processing Systems, 2017. [14] P. Cortez and A. Silva. Using data mining to predict secondary school student performance. In FUture BUsiness TEChnology Conference, 2008. [15] L. Devroye. The uniform convergence of nearest neighbor regression function estimators and their application in optimization. IEEE Transactions on Information Theory, 24(2):142 151, 1978. [16] M. Donini, L. Oneto, S. Ben-David, J. S. Shawe-Taylor, and M. Pontil. Empirical risk minimization under fairness constraints. In Neural Information Processing Systems, 2018. [17] M. Donini, L. Oneto, S. Ben-David, J. S. Shawe-Taylor, and M. Pontil. Empirical risk minimization under fairness constraints. In Neural Information Processing Systems, 2018. [18] C. Dwork, N. Immorlica, A. T. Kalai, and M. D. M. Leiserson. Decoupled classifiers for group-fair and efficient machine learning. In Conference on Fairness, Accountability and Transparency, 2018. [19] Ronald Aylmer Fisher. Design of experiments. Br Med J, 1(3923):554 554, 1936. [20] P. Gordaliza, E. Del Barrio, G. Fabrice, and J. M. Loubes. Obtaining fairness using optimal transport theory. In International Conference on Machine Learning, 2019. [21] M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning. In Neural Information Processing Systems, 2016. [22] Wassily Hoeffding. The large-sample power of tests based on permutations of observations. The Annals of Mathematical Statistics, pages 169 192, 1952. [23] S. Jabbari, M. Joseph, M. Kearns, J. Morgenstern, and A. Roth. Fair learning in markovian environments. In Conference on Fairness, Accountability, and Transparency in Machine Learning, 2016. [24] R. Jiang, A. Pacchiano, T. Stepleton, H. Jiang, and S. Chiappa. Wasserstein fair classification. ar Xiv preprint ar Xiv:1907.12059, 2019. [25] M. Joseph, M. Kearns, J. H. Morgenstern, and A. Roth. Fairness in learning: Classic and contextual bandits. In Neural Information Processing Systems, 2016. [26] N. Kilbertus, M. Rojas-Carulla, G. Parascandolo, M. Hardt, D. Janzing, and B. Schölkopf. Avoiding discrimination through causal reasoning. In Neural Information Processing Systems, 2017. [27] J. Komiyama and H. Shimao. Comparing fairness criteria based on social outcome. ar Xiv preprint ar Xiv:1806.05112, 2018. [28] M. J. Kusner, J. Loftus, C. Russell, and R. Silva. Counterfactual fairness. In Neural Information Processing Systems, 2017. [29] J. Lei. Classification with confidence. Biometrika, 101(4):755 769, 2014. [30] J. Lei and L. Wasserman. Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):71 96, 2014. [31] Jing Lei, James Robins, and Larry Wasserman. Distribution-free prediction sets. Journal of the American Statistical Association, 108(501):278 287, 2013. [32] K. Lum and J. Johndrow. A statistical framework for fair predictive algorithms. ar Xiv preprint ar Xiv:1610.08077, 2016. [33] P. Massart. The tight constant in the dvoretzky-kiefer-wolfowitz inequality. The Annals of Probability, 18(3):1269 1283, 1990. [34] L. Oneto, M. Donini, and M. Pontil. General fair empirical risk minimization. ar Xiv preprint ar Xiv:1901.10080, 2019. [35] A. Pérez-Suay, V. Laparra, G. Mateo-García, J. Muñoz-Marí, L. Gómez-Chova, and G. Camps Valls. Fair kernel learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2017. [36] E. Raff, J. Sylvester, and S. Mills. Fair forests: Regularized tree induction to minimize model bias. In AAAI/ACM Conference on AI, Ethics, and Society, 2018. [37] M. Redmond and A. Baveja. A data-driven software tool for enabling cooperative information sharing among police departments. European Journal of Operational Research, 141(3):660 678, 2002. [38] F. Santambrogio. Optimal transport for applied mathematicians. Springer, 2015. [39] S. Van de Geer. High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36(2):614 645, 2008. [40] A. W. Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000. [41] C. Villani. Topics in Optimal Transportation. American Mathematical Society, 2003. [42] Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic learning in a random world. Springer Science & Business Media, 2005. [43] H. Wang, B. Ustun, and F. Calmon. Repairing without retraining: Avoiding disparate impact with counterfactual distributions. In International Conference on Machine Learning, 2019. [44] L. F. Wightman and H. Ramsey. LSAC national longitudinal bar passage study. Law School Admission Council, 1998. [45] S. Yao and B. Huang. Beyond parity: Fairness objectives for collaborative filtering. In Neural Information Processing Systems, 2017. [46] M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In International Conference on World Wide Web, 2017. [47] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork. Learning fair representations. In International Conference on Machine Learning, 2013. [48] Gianluca Zeni, Matteo Fontana, and Simone Vantini. Conformal prediction: a unified review of theory and new challenges. ar Xiv preprint ar Xiv:2005.07972, 2020. [49] I. Zliobaite. On the relation between accuracy and fairness in binary classification. ar Xiv preprint ar Xiv:1505.05723, 2015.