# inherent_tradeoffs_in_learning_fair_representations__d7c3af6a.pdf Inherent Tradeoffs in Learning Fair Representations Han Zhao Machine Learning Department School of Computer Science Carnegie Mellon University han.zhao@cs.cmu.edu Geoffrey J. Gordon Microsoft Research, Montreal Machine Learning Department Carnegie Mellon University geoff.gordon@microsoft.com With the prevalence of machine learning in high-stakes applications, especially the ones regulated by anti-discrimination laws or societal norms, it is crucial to ensure that the predictive models do not propagate any existing bias or discrimination. Due to the ability of deep neural nets to learn rich representations, recent advances in algorithmic fairness have focused on learning fair representations with adversarial techniques to reduce bias in data while preserving utility simultaneously. In this paper, through the lens of information theory, we provide the first result that quantitatively characterizes the tradeoff between demographic parity and the joint utility across different population groups. Specifically, when the base rates differ between groups, we show that any method aiming to learn fair representations admits an information-theoretic lower bound on the joint error across these groups. To complement our negative results, we also prove that if the optimal decision functions across different groups are close, then learning fair representations leads to an alternative notion of fairness, known as the accuracy parity, which states that the error rates are close between groups. Finally, our theoretical findings are also confirmed empirically on real-world datasets. 1 Introduction With the prevalence of machine learning applications in high-stakes domains, e.g., criminal judgement, medical testing, online advertising, etc., it is crucial to ensure that the automated decision making systems do not propagate existing bias or discrimination that might exist in historical data [3, 4, 28]. Among many recent proposals for achieving different notions of algorithmic fairness [10, 14, 31 33], learning fair representations has received increasing attention due to recent advances in learning rich representations with deep neural networks [5, 11, 24, 26, 30, 34]. In fact, a line of work has proposed to learn group-invariant representations with adversarial learning techniques in order to achieve statistical parity, also known as the demographic parity in the literature. This line of work dates at least back to Zemel et al. [33] where the authors proposed to learn predictive models that are independent of the group membership attribute. At a high level, the underlying idea is that if representations of instances from different groups are similar to each other, then any predictive model on top of them will certainly make decisions independent of group membership. On the other hand, it has long been observed that there is an underlying tradeoff between utility and demographic parity: All methods have in common that to some extent accuracy must be traded-off for lowering the dependency. [6] In particular, it is easy to see that in an extreme case where the group membership coincides with the target task, a call for exact demographic parity will inevitably remove the perfect predictor [14]. Part of this work was done when Han Zhao was visiting Microsoft Research, Montreal. 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. Empirically, it has also been observed that a tradeoff exists between accuracy and fairness in binary classification [38]. Clearly, methods based on learning fair representations are also bound by such inherent tradeoff between utility and fairness. But how does the fairness constraint trade for utility? Will learning fair representations help to achieve other notions of fairness besides the demographic parity? If yes, what is the fundamental limit of utility that we can hope to achieve under such constraint? To answer the above questions, through the lens of information theory, in this paper we provide the first result that quantitatively characterizes the tradeoff between demographic parity and the joint utility across different population groups. Specifically, when the base rates differ between groups, we provide a tight information-theoretic lower bound on the joint error across these groups. Our lower bound is algorithm-independent so it holds for all methods aiming to learn fair representations. When only approximate demographic parity is achieved, we also present a family of lower bounds to quantify the tradeoff of utility introduced by such approximate constraint. As a side contribution, our proof technique is simple but general, and we expect it to have broader applications in other learning problems using adversarial techniques, e.g., unsupervised domain adaptation [12, 36], privacypreservation under attribute inference attacks [13, 35] and multilingual machine translation [16]. To complement our negative results, we show that if the optimal decision functions across different groups are close, then learning fair representations helps to achieve an alternative notion of fairness, i.e., the accuracy parity, which states that the error rates are close between groups. Empirically, we conduct experiments on a real-world dataset that corroborate both our positive and negative results. We believe our theoretical insights contribute to better understanding of the tradeoff between utility and different notions of fairness, and they are also helpful in guiding the future design of representation learning algorithms to achieve algorithmic fairness. 2 Preliminary We first introduce the notation used throughout the paper and formally describe the problem setup. We then briefly discuss some information-theoretic concepts that will be used in our analysis. Notation We use X Rd and Y = {0, 1} to denote the input and output space. Accordingly, we use X and Y to denote the random variables which take values in X and Y, respectively. Lower case letters x and y are used to denote the instantiation of X and Y . To simplify the presentation, we use A {0, 1} as the sensitive attribute, e.g., race, gender, etc. 2 Let H be the hypothesis class of classifiers. In other words, for h H, h : X Y is the predictor that outputs a prediction. Note that even if the predictor does not explicitly take the sensitive attribute A as input, this fairness through blindness mechanism can still be biased due to the potential correlations between X and A. In this work we study the stochastic setting where there is a joint distribution D over X, Y and A from which the data are sampled. To keep the notation consistent, for a {0, 1}, we use Da to mean the conditional distribution of D given A = a. For an event E, D(E) denotes the probability of E under D. In particular, in the literature of fair machine learning, we call D(Y = 1) the base rate of distribution D and we use BR(D, D ) := |D(Y = 1) D (Y = 1)| to denote the difference of the base rates between two distributions D and D over the same sample space. Given a feature transformation function g : X Z that maps instances from the input space X to feature space Z, we define g D := D g 1 to be the induced (pushforward) distribution of D under g, i.e., for any event E Z, g D(E ) := D(g 1(E )) = D({x X | g(x) E }). Problem Setup Given a joint distribution D, the error of a predictor h under D is defined as Err D(h) := ED[|Y h(X)|]. Note that for binary classification problems, when h(X) {0, 1}, Err D(h) reduces to the true error rate of binary classification. To make the notation more compact, we may drop the subscript D when it is clear from the context. In this work we focus on group fairness where the group membership is given by the sensitive attribute A. Even in this context there are many possible definitions of fairness [27], and in what follows we provide a brief review of the ones that are mostly relevant to this work. Definition 2.1 (Demographic Parity). Given a joint distribution D, a classifier b Y satisfies demographic parity if b Y is independent of A. 2Our main results could also be straightforwardly extended to the setting where A is a categorical variable. Demographic parity reduces to the requirement that D0(b Y = 1) = D1(b Y = 1), i.e., positive outcome is given to the two groups at the same rate. When exact equality does not hold, we use the absolute difference between them as an approximate measure: Definition 2.2 (DP Gap). Given a joint distribution D, the demographic parity gap of a classifier b Y is DP(b Y ) := |D0(b Y = 1) D1(b Y = 1)|. Demographic parity is also known as statistical parity, and it has been adopted as definition of fairness in a series of work [6, 11, 15, 17, 18, 24, 26, 33]. However, as we shall quantify precisely in Section 3, demographic parity may cripple the utility that we hope to achieve, especially in the common scenario where the base rates differ between two groups, e.g., D0(Y = 1) = D1(Y = 1) [14]. In light of this, an alternative definition is accuracy parity: Definition 2.3 (Accuracy Parity). Given a joint distribution D, a classifier h satisfies accuracy parity if Err D0(h) = Err D1(h). In the literature, a break of accuracy parity is also known as disparate mistreatment [32]. Again, when h is a deterministic binary classifier, accuracy parity reduces to D0(h(X) = Y ) = D1(h(X) = Y ). Different from demographic parity, the definition of accuracy parity does not eliminate the perfect predictor when Y = A when the base rates differ between two groups. When costs of different error types matter, more refined definitions exist: Definition 2.4 (Positive Rate Parity). Given a joint distribution D, a deterministic classifier h satisfies positive rate parity if D0(h(X) = 1 | Y = y) = D1(h(X) = 1 | Y = y), y {0, 1}. Positive rate parity is also known as equalized odds [14], which essentially requires equal true positive and false positive rates between different groups. Furthermore, Hardt et al. [14] also defined true positive parity, or equal opportunity, to be D0(h(X) = 1 | Y = 1) = D1(h(X) = 1 | Y = 1) when positive outcome is desirable. Last but not least, predictive rate parity, also known as test fairness [7], asks for equal chance of positive outcomes across groups given predictions: Definition 2.5 (Predictive Rate Parity). Given a joint distribution D, a probabilistic classifier h satisfies predictive rate parity if D0(Y = 1 | h(X) = c) = D1(Y = 1 | h(X) = c), c [0, 1]. When h is a deterministic binary classifier that only takes value in {0, 1}, Chouldechova [7] showed an intrinsic tradeoff between predictive rate parity and positive rate parity: Theorem 2.1 (Chouldechova [7]). Assume D0(Y = 1) = D1(Y = 1), then for any deterministic classifier h : X {0, 1} that is not perfect, i.e., h(X) = Y , positive rate parity and predictive rate parity cannot hold simultaneously. Similar tradeoff result for probabilistic classifier has also been observed by Kleinberg et al. [21], where the authors showed that for any non-perfect predictors, calibration and positive rate parity cannot be achieved simultaneously if the base rates are different across groups. Here a classifier h is said to be calibrated if D(Y = 1 | h(X) = c) = c, c [0, 1], i.e., if we look at the set of data that receive a predicted probability of c by h, we would like c-fraction of them to be positive instances according to Y [29]. f-divergence Introduced by Ali and Silvey [2] and Csiszár [8, 9], f-divergence, also known as the Ali-Silvey distance, is a general class of statistical divergences to measure the difference between two probability distributions P and Q over the same measurable space. Definition 2.6 (f-divergence). Let P and Q be two probability distributions over the same space and assume P is absolutely continuous w.r.t. Q (P Q). Then for any convex function f : (0, ) R that is strictly convex at 1 and f(1) = 0, the f-divergence of Q from P is defined as Df(P || Q) := EQ The function f is called the generator function of Df( || ). Different choices of the generator function f recover popular statistical divergence as special cases, e.g., the KL-divergence. From Jensen s inequality it is easy to verify that Df(P || Q) 0 and Df(P || Q) = 0 iff P = Q almost surely. Note that f-divergence does not necessarily leads to a distance metric, and it is not symmetric in general, i.e., Df(P || Q) = Df(Q || P) provided that P Q and Q P. We list some common choices of the generator function f and their corresponding properties in Table 1. Notably, Khosravifard et al. [20] proved that total variation is the only f-divergence that serves as a metric, i.e., satisfying the triangle inequality. Table 1: List of different f-divergences and their corresponding properties. DKL(P || Q) denotes the KL-divergence of Q from P and M := (P + Q)/2 is the average distribution of P and Q. Symm. stands for Symmetric and Tri. stands for Triangle Inequality. Name Df(P || Q) Generator f(t) Symm. Tri. Kullback-Leibler DKL(P || Q) t log t Reverse-KL DKL(Q || P) log t Jensen-Shannon DJS(P, Q) := 1 2(DKL(P||M) + DKL(Q||M)) t log t (t + 1) log( t+1 2 ) Squared Hellinger H2(P, Q) := 1 t)2/2 Total Variation d TV(P, Q) := sup E |P(E) Q(E)| |t 1|/2 3 Main Results As we briefly mentioned in Section 2, it is impossible to have imperfect predictor that is both calibrated and preserves positive rate parity when the base rates differ between two groups. Similar impossibility result also holds between positive rate parity and predictive rate parity. On the other hand, while it has long been observed that demographic parity may eliminate perfect predictor [14], and previous work has empirically verified that tradeoff exists between accuracy and demographic parity [6, 17, 38] on various datasets, so far a quantitative characterization on the exact tradeoff between accuracy and various notions of parity is still missing. In what follows we shall prove a family of information theoretic lower bounds on the accuracy that hold for all algorithms. 3.1 Tradeoff between Fairness and Utility Essentially, every prediction function induces a Markov chain: X g Z h b Y , where g is the feature transformation, h is the classifier on feature space, Z is the feature and b Y is the predicted target variable by h g. Note that simple models, e.g., linear classifiers, are also included by specifying g to be the identity map. With this notation, we first state the following theorem that quantifies an inherent tradeoff between fairness and utility. Theorem 3.1. Let b Y = h(g(X)) be the predictor. If b Y satisfies demographic parity, then Err D0(h g) + Err D1(h g) BR(D0, D1). Remark First of all, BR(D0, D1) is the difference of base rates across groups, and it achieves its maximum value of 1 iff Y = A almost surely, i.e., Y indicates group membership. On the other hand, if Y is independent of A, then BR(D0, D1) = 0 so the lower bound does not make any constraint on the joint error. Second, Theorem 3.1 applies to all possible feature transformation g and predictor h. In particular, if we choose g to be the identity map, then Theorem 3.1 says that when the base rates differ, no algorithm can achieve a small joint error on both groups, and it also recovers the previous observation that demographic parity can eliminate the perfect predictor [14]. Third, the lower bound in Theorem 3.1 is insensitive to the marginal distribution of A, i.e., it treats the errors from both groups equally. As a comparison, let α := D(A = 1), then Err D(h g) = (1 α)Err D0(h g) + αErr D1(h g). In this case Err D(h g) could still be small even if the minority group suffers a large error. Furthermore, by the pigeonhole principle, the following corollary holds: Corollary 3.1. If the predictor b Y = h(g(X)) satisfies demographic parity, then max{Err D0(h g), Err D1(h g)} BR(D0, D1)/2. In words, this means that for fair predictors in the demographic parity sense, at least one of the subgroups has to incur an error of at least BR(D0, D1)/2 which could be large in settings like criminal justice where BR(D0, D1) is large. Before we give the proof, we first present a useful lemma that lower bounds the prediction error by the total variation distance. Lemma 3.1. Let b Y = h(g(X)) be the predictor, then for a {0, 1}, d TV(Da(Y ), Da(b Y )) Err Da(h g). Proof. For a {0, 1}, we have: d TV(Da(Y ), Da(b Y )) = |Da(Y = 1) Da(h(g(X)) = 1)| = |EDa[Y ] EDa[h(g(X))]| EDa[|Y h(g(X))|] = Err Da(h g). Now we are ready to prove Theorem 3.1: Proof of Theorem 3.1. First of all, we show that if b Y = h(g(X)) satisfies demographic parity, then: d TV(D0(b Y ), D1(b Y )) = max |D0(b Y = 0) D1(b Y = 0)|, |D0(b Y = 1) D1(b Y = 1)| = |D0(b Y = 1) D1(b Y = 1)| = |D(b Y = 1 | A = 0) D(b Y = 1 | A = 1)| = 0, where the last equality follows from the definition of demographic parity. Now from Table 1, d TV( , ) is symmetric and satisfies the triangle inequality, we have: d TV(D0(Y ), D1(Y )) d TV(D0(Y ), D0(b Y )) + d TV(D0(b Y ), D1(b Y )) + d TV(D1(b Y ), D1(Y )) = d TV(D0(Y ), D0(b Y )) + d TV(D1(b Y ), D1(Y )). (2) The last step is to bound d TV(Da(Y ), Da(b Y )) in terms of Err Da(h g) for a {0, 1} using Lemma 3.1: d TV(D0(Y ), D0(b Y )) Err D0(h g), d TV(D1(Y ), D1(b Y )) Err D1(h g). Combining the above two inequalities and (2) completes the proof. It is not hard to show that our lower bound in Theorem 3.1 is tight. To see this, consider the case A = Y , where the lower bound achieves its maximum value of 1. Now consider a constant predictor b Y 1 or b Y 0, which clearly satisfies demographic parity by definition. But in this case either Err D0(h g) = 1, Err D1(h g) = 0 or Err D0(h g) = 0, Err D1(h g) = 1, hence Err D0(h g) + Err D1(h g) 1, achieving the lower bound. To conclude this section, we point out that the choice of total variation in the lower bound is not unique. As we will see shortly in Section 3.2, similar lower bounds could be attained using specific choices of the general f-divergence with some desired properties. 3.2 Tradeoff in Fair Representation Learning In the last section we show that there is an inherent tradeoff between fairness and utility when a predictor exactly satisfies demographic parity. In practice we may not be able to achieve demographic parity exactly. Instead, most algorithms [1, 5, 11, 24] build an adversarial discriminator that takes as input the feature vector Z = g(X), and the goal is to learn fair representations such that it is hard for the adversarial discriminator to infer the group membership from Z. In this sense due to the limit on the capacity of the adversarial discriminator, only approximate demographic parity can be achieved in the equilibrium. Hence it is natural to ask what is the tradeoff between fair representations and accuracy in this scenario? In this section we shall answer this question by generalizing our previous analysis with f-divergence to prove a family of lower bounds on the joint target prediction error. Our results also show how approximate DP helps to reconcile but not remove the tradeoff between fairness and utility. Before we state and prove the main results in this section, we first introduce the following lemma by Liese and Vajda [22] as a generalization of the data processing inequality for f-divergence: Lemma 3.2 (Liese and Vajda [22]). Let µ(Z) be the space of all probability distributions over Z. Then for any f-divergence Df( || ), any stochastic kernel κ : X µ(Z), and any distributions P and Q over X, Df(κP || κQ) Df(P || Q). Roughly speaking, Lemma 3.2 says that data processing cannot increase discriminating information. Define d JS(P, Q) := p DJS(P, Q) and H(P, Q) := p H2(P, Q). It is well-known in information theory that both d JS( , ) and H( , ) form a bounded distance metric over the space of probability distributions. Realize that d TV( , ), H2( , ) and DJS( , ) are all f-divergence. The following corollary holds: Corollary 3.2. Let h : Z Y be any hypothesis, and g Da be the pushforward distribution of Da by g, a {0, 1}. Let b Y = h(g(X)) be the predictor, then all the following inequalities hold: 1. d TV(D0(b Y ), D1(b Y )) d TV(g D0, g D1) 2. H(D0(b Y ), D1(b Y )) H(g D0, g D1) 3. d JS(D0(b Y ), D1(b Y )) d JS(g D0, g D1) Now we are ready to present the following main theorem of this section: Theorem 3.2. Let b Y = h(g(X)) be the predictor. Assume d JS(g D0, g D1) d JS(D0(Y ), D1(Y )) and H(g D0, g D1) H(D0(Y ), D1(Y )), then the following three inequalities hold: 1. Total variation lower bound: Err D0(h g) + Err D1(h g) d TV(D0(Y ), D1(Y )) d TV(g D0, g D1). 2. Jensen-Shannon lower bound: Err D0(h g) + Err D1(h g) d JS(D0(Y ), D1(Y )) d JS(g D0, g D1) 2/2. 3. Hellinger lower bound: Err D0(h g) + Err D1(h g) H(D0(Y ), D1(Y )) H(g D0, g D1) 2/2. Remark All the three lower bounds in Theorem 3.2 imply a tradeoff between the joint error across demographic subgroups and learning group-invariant feature representations. In a nutshell: For fair representations, it is not possible to construct a predictor that simultaneously minimizes the errors on both demographic subgroups. When g D0 = g D1, which also implies D0(b Y ) = D1(b Y ), all three lower bounds get larger, in this case we have max d TV(D0(Y ), D1(Y )), 1 2d2 JS(D0(Y ), D1(Y )), 1 2H2(D0(Y ), D1(Y )) = d TV(D0(Y ), D1(Y )) = BR(D0, D1), and this reduces to Theorem 3.1. Now we give a sketch of the proof for Theorem 3.2: Proof Sketch of Theorem 3.2. We prove the three inequalities respectively. The total variation lower bound follows the same idea as the proof of Theorem 3.1 and the inequality d TV(D0(b Y ), D1(b Y )) d TV(g D0, g D1) from Corollary 3.2. To prove the Jensen-Shannon lower bound, realize that d JS( , ) is a distance metric over probability distributions. Combining with the inequality d JS(D0(b Y ), D1(b Y )) d JS(g D0, g D1) from Corollary 3.2, we have: d JS(D0(Y ), D1(Y )) d JS(D0(Y ), D0(b Y )) + d JS(g D0, g D1) + d JS(D1(b Y ), D1(Y )). Now by Lin s lemma [23, Theorem 3], for any two distributions P and Q, we have d2 JS(P, Q) d TV(P, Q). Combine Lin s lemma with Lemma 3.1, we get the following lower bound: p Err D0(h g) + p Err D1(h g) d JS(D0(Y ), D1(Y )) d JS(g D0, g D1). Apply the AM-GM inequality, we can further bound the L.H.S. by q 2 Err D0(h g) + Err D1(h g) p Err D0(h g) + p Err D1(h g). Under the assumption that d JS(g D0, g D1) d JS(D0(Y ), D1(Y )), taking a square at both sides then completes the proof for the second inequality. The proof for Hellinger s lower bound follows exactly as the one for Jensen-Shannon s lower bound, except that instead of Lin s lemma, we need to use the fact that H2(P, Q) d TV(P, Q) 2H(P, Q), P, Q. As a simple corollary of Theorem 3.2, the following result shows how approximate DP (in terms of the DP gap) helps to reconcile the tradeoff between fairness and utility: Corollary 3.3. Let b Y = h(g(X)) be the predictor, then Err D0(h g) + Err D1(h g) BR(D0, D1) DP(b Y ). In a sense Corollary 3.3 means that in order to lower the joint error, the DP gap of the predictor cannot be too small. Of course, since the above inequality is a lower bound, it only serves as a necessary condition for small joint error. Hence an interesting question would be to ask whether it is possible to have a sufficient condition that guarantees a small joint error such that the DP gap of the predictor is no larger than that of the perfect predictor, i.e., BR(D0, D1). We leave this as a future work. 3.3 Fair Representations Lead to Accuracy Parity In the previous sections we prove a family of information-theoretic lower bounds that demonstrate an inherent tradeoff between fair representations and joint error across groups. A natural question to ask then, is, what kind of parity can fair representations bring us? To complement our negative results, in this section we show that learning group-invariant representations help to reduce discrepancy of errors (utilities) across groups. First of all, since we work under the stochastic setting where Da is a joint distribution over X and Y conditioned on A = a, then any function mapping h : X Y will inevitably incur an error due to the noise existed in the distribution Da. Formally, for a {0, 1}, define the optimal function h a : X Y under the absolute error to be h a(X) := m Da(Y | X), where m Da(Y | X) denotes the median of Y given X under distribution Da. Now define the noise of distribution Da to be n Da := EDa[|Y h a(X)|]. With these notations, we are now ready to present the following theorem: Theorem 3.3. For any hypothesis H h : X Y, the following inequality holds: |Err D0(h) Err D1(h)| n D0 + n D1 + d TV(D0(X), D1(X)) + min {ED0[|h 0 h 1|], ED1[|h 0 h 1|]} . Remark Theorem 3.3 upper bounds the discrepancy of accuracy across groups by three terms: the noise, the distance of representations across groups and the discrepancy of optimal decision functions. In an ideal setting where both distributions are noiseless, i.e., same people in the same group are always treated equally, the upper bound simplifies to the latter two terms: |Err D0(h) Err D1(h)| d TV(D0(X), D1(X)) + min {ED0[|h 0 h 1|], ED1[|h 0 h 1|]} . If we further require that the optimal decision functions h 0 and h 1 are close to each other, i.e., optimal decisions are insensitive to the group membership, then Theorem 3.3 implies that a sufficient condition to guarantee accuracy parity is to find group-invariant representation that minimizes d TV(D0(X), D1(X)). We now present the proof for Theorem 3.3: Proof of Theorem 3.3. First, we show that for a {0, 1}, Err Da(h) cannot be too large if h is close to h a: |Err Da(h) n Da| = |Err Da(h) Err Da(h a)| = EDa[|Y h(X)|] EDa[|Y h a(X)|] EDa[|h(X) h a(X)|], where the inequality is due to triangular inequality. Next, we bound |Err D0(h) Err D1(h)| by: |Err D0(h) Err D1(h)| n D0 + n D1 + ED0[|h(X) h 0(X)|] ED1[|h(X) h 1(X)|] . In order to show this, define εa(h, h ) := EDa[|h(X) h (X)|] so that ED0[|h(X) h 0(X)|] ED1[|h(X) h 1(X)|] = ε0(h, h 0) ε1(h, h 1) . To bound ε0(h, h 0) ε1(h, h 1) , realize that |h(X) h a(X)| {0, 1}. On one hand, we have: ε0(h, h 0) ε1(h, h 1) = ε0(h, h 0) ε0(h, h 1) + ε0(h, h 1) ε1(h, h 1) ε0(h, h 0) ε0(h, h 1) + ε0(h, h 1) ε1(h, h 1) ε0(h 0, h 1) + d TV(D0(X), D1(X)), where the last inequality is due to ε0(h, h 1) ε1(h, h 1) = D0(|h h 1| = 1) D1(|h h 1| = 1) sup E |D0(E) D1(E)| = d TV(D0, D1). Similarly, by subtracting and adding back ε1(h, h 0) instead, we can also show that ε0(h, h 0) ε1(h, h 1) ε1(h 0, h 1) + d TV(D0(X), D1(X)). Combine the above two inequalities yielding: ε0(h, h 0) ε1(h, h 1) min{ε0(h 0, h 1), ε1(h 0, h 1)} + d TV(D0(X), D1(X)). Incorporating the two noise terms back to the above inequality then completes the proof. 4 Empirical Validation Our theoretical results on the lower bound imply that over-training the feature transformation function to achieve group-invariant representations will inevitably lead to large joint errors. On the other hand, our upper bound also implies that group-invariant representations help to achieve accuracy parity. To verify these theoretical implications, in this section we conduct experiments on a real-world benchmark dataset, the UCI Adult dataset, to present empirical results with various metrics. Dataset The Adult dataset contains 30,162/15,060 training/test instances for income prediction. Each instance in the dataset describes an adult from the 1994 US Census. Attributes include gender, education level, age, etc. In this experiment we use gender (binary) as the sensitive attribute, and we preprocess the dataset to convert categorical variables into one-hot representations. The processed data contains 114 attributes. The target variable (income) is also binary: 1 if 50K/year otherwise 0. For the sensitive attribute A, A = 0 means Male otherwise Female. In this dataset, the base rates across groups are different: Pr(Y = 1 | A = 0) = 0.310 while Pr(Y = 1 | A = 1) = 0.113. Also, the group ratios are different: Pr(A = 0) = 0.673. Experimental Protocol To validate the effect of learning group-invariant representations with adversarial debiasing techniques [5, 26, 34], we perform a controlled experiment by fixing the baseline network architecture to be a three hidden-layer feed-forward network with Re LU activations. The number of units in each hidden layer are 500, 200, and 100, respectively. The output layer corresponds to a logistic regression model. This baseline without debiasing is denoted as No Debias. For debiasing with adversarial learning techniques, the adversarial discriminator network takes the feature from the last hidden layer as input, and connects it to a hidden-layer with 50 units, followed by a binary classifier whose goal is to predict the sensitive attribute A. This model is denoted as Adv Debias. Compared with No Debias, the only difference of Adv Debias in terms of objective function is that besides the cross-entropy loss for target prediction, the Adv Debias also contains a classification loss from the adversarial discriminator to predict the sensitive attribute A. In the experiment, all the other factors are fixed to be the same between these two methods, including learning rate, optimization algorithm, training epoch, and also batch size. To see how the adversarial loss affects the joint error, the demographic parity as well as the accuracy parity, we vary the coefficient ρ for the adversarial loss between 0.1, 1.0, 5.0 and 50.0. Results and Analysis The experimental results are listed in Table 2. Note that in the table |Err D0 Err D1| could be understood as measuring an approximate version of accuracy parity, and similarly DP(b Y ) measures the closeness of the classifier to satisfy demographic parity. From the table, it is then clear that with increasing ρ, both the overall error Err D (sensitive to the marginal distribution of A) and the joint error Err D0 + Err D1 (insensitive to the imbalance of A) are increasing. As expected, DP(b Y ) is drastically decreasing with the increasing of ρ. Furthermore, |Err D0 Err D1| is also gradually decreasing, but much slowly than DP(b Y ). This is due to the existing noise in the data as well as the shift between the optimal decision functions across groups, as indicated by our upper bound. To conclude, all the empirical results are consistent with our theoretical findings. 5 Related Work Fairness Frameworks Two central notions of fairness have been extensively studied, i.e., group fairness and individual fairness. In a seminal work, Dwork et al. [10] define individual fairness as a measure of smoothness of the classification function. Under the assumption that number of Table 2: Adversarial debiasing on demographic parity, joint error across groups, and accuracy parity. Err D Err D0 + Err D1 |Err D0 Err D1| DP(b Y ) No Debias 0.157 0.275 0.115 0.189 Adv Debias, ρ = 0.1 0.159 0.278 0.116 0.190 Adv Debias, ρ = 1.0 0.162 0.286 0.106 0.113 Adv Debias, ρ = 5.0 0.166 0.295 0.106 0.032 Adv Debias, ρ = 50.0 0.201 0.360 0.112 0.028 individuals is finite, the authors proposed a linear programming framework to maximize the utility under their fairness constraint. However, their framework requires apriori a distance function that computes the similarity between individuals, and their optimization formulation does not produce an inductive rule to generalize to unseen data. Based on the definition of positive rate parity, Hardt et al. [14] proposed a post-processing method to achieve fairness by taking as input the prediction and the sensitive attribute. In a concurrent work, Kleinberg et al. [21] offer a calibration technique to achieve the corresponding fairness criterion as well. However, both of the aforementioned two approaches require sensitive attribute during the inference phase, which is not available in many real-world scenarios. Regularization Techniques The line of work on fairness-aware learning through regularization dates at least back to Kamishima et al. [19], where the authors argue that simple deletion of sensitive features in data is insufficient for eliminating biases in automated decision making, due to the possible correlations among attributes and sensitive information [25]. In light of this, the authors proposed a prejudice remover regularizer that essentially penalizes the mutual information between the predicted goal and the sensitive information. In a more recent approach, Zafar et al. [31] leveraged a measure of decision boundary fairness and incorporated it via constraints into the objective function of logistic regression as well as support vector machines. As discussed in Section 2, both approaches essentially reduce to achieving demographic parity through regularization. Representation Learning In a pioneer work, Zemel et al. [33] proposed to preserve both group and individual fairness through the lens of representation learning, where the main idea is to find a good representation of the data with two competing goals: to encode the data for utility maximization while at the same time to obfuscate any information about membership in the protected group. Due to the power of learning rich representations offered by deep neural nets, recent advances in building fair automated decision making systems focus on using adversarial techniques to learn fair representation that also preserves enough information for the prediction vendor to achieve his utility [1, 5, 11, 24, 30, 34, 37]. Madras et al. [26] further extended this approach by incorporating reconstruction loss given by an autoencoder into the objective function to preserve demographic parity, equalized odds, and equal opportunity. 6 Conclusion In this paper we theoretically and empirically study the important problem of quantifying the tradeoff between utility and fairness in learning group-invariant representations. Specifically, we prove a novel lower bound to characterize the tradeoff between demographic parity and the joint utility across different population groups when the base rates differ between groups. In particular, our results imply that any method aiming to learn fair representations admits an information-theoretic lower bound on the joint error, and the better the representation, the larger the joint error. Complementary to our negative results, we also show that learning fair representations leads to accuracy parity if the optimal decision functions across different groups are close. These theoretical findings are also confirmed empirically on real-world datasets. We believe our results take an important step towards better understanding the tradeoff between utility and different notions of fairness. Inspired by our lower bound, one interesting direction for future work is to design instance-weighting algorithm to balance the base rates during representation learning. Acknowledgments HZ and GG would like to acknowledge support from the DARPA XAI project, contract #FA87501720152 and an Nvidia GPU grant. HZ would also like to thank Jianfeng Chi for helpful discussions on the relationship between algorithmic fairness and privacy-preservation learning. [1] Tameem Adel, Isabel Valera, Zoubin Ghahramani, and Adrian Weller. One-network adversarial fairness. In 33rd AAAI Conference on Artificial Intelligence, 2019. [2] Syed Mumtaz Ali and Samuel D Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society: Series B (Methodological), 28(1):131 142, 1966. [3] Solon Barocas and Andrew D Selbst. Big data s disparate impact. Calif. L. Rev., 104:671, 2016. [4] Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. Fairness in criminal justice risk assessments: The state of the art. Sociological Methods & Research, page 0049124118782533, 2018. [5] Alex Beutel, Jilin Chen, Zhe Zhao, and Ed H Chi. Data decisions and theoretical implications when adversarially learning fair representations. ar Xiv preprint ar Xiv:1707.00075, 2017. [6] Toon Calders, Faisal Kamiran, and Mykola Pechenizkiy. Building classifiers with independency constraints. In 2009 IEEE International Conference on Data Mining Workshops, pages 13 18. IEEE, 2009. [7] Alexandra Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153 163, 2017. [8] Imre Csiszár. Eine informationstheoretische ungleichung und ihre anwendung auf beweis der ergodizitaet von markoffschen ketten. Magyer Tud. Akad. Mat. Kutato Int. Koezl., 8:85 108, 1964. [9] Imre Csiszár. Information-type measures of difference of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica, 2:229 318, 1967. [10] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214 226. ACM, 2012. [11] Harrison Edwards and Amos Storkey. Censoring representations with an adversary. ar Xiv preprint ar Xiv:1511.05897, 2015. [12] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096 2030, 2016. [13] Jihun Hamm. Minimax filter: Learning to preserve privacy from inference attacks. The Journal of Machine Learning Research, 18(1):4704 4734, 2017. [14] Moritz Hardt, Eric Price, Nati Srebro, et al. Equality of opportunity in supervised learning. In Advances in neural information processing systems, pages 3315 3323, 2016. [15] James E Johndrow, Kristian Lum, et al. An algorithm for removing sensitive information: application to race-independent recidivism prediction. The Annals of Applied Statistics, 13(1): 189 220, 2019. [16] Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et al. Google s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339 351, 2017. [17] Faisal Kamiran and Toon Calders. Classifying without discriminating. In 2009 2nd International Conference on Computer, Control and Communication, pages 1 6. IEEE, 2009. [18] Toshihiro Kamishima, Shotaro Akaho, and Jun Sakuma. Fairness-aware learning through regularization approach. In 2011 IEEE 11th International Conference on Data Mining Workshops, pages 643 650. IEEE, 2011. [19] Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma. Fairness-aware classifier with prejudice remover regularizer. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 35 50. Springer, 2012. [20] Mohammadali Khosravifard, Dariush Fooladivanda, and T Aaron Gulliver. Confliction of the convexity and metric properties in f-divergences. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 90(9):1848 1853, 2007. [21] Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair determination of risk scores. ar Xiv preprint ar Xiv:1609.05807, 2016. [22] Friedrich Liese and Igor Vajda. On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52(10):4394 4412, 2006. [23] Jianhua Lin. Divergence measures based on the shannon entropy. IEEE Transactions on Information theory, 37(1):145 151, 1991. [24] Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zemel. The variational fair autoencoder. ar Xiv preprint ar Xiv:1511.00830, 2015. [25] Kristian Lum and James Johndrow. A statistical framework for fair predictive algorithms. ar Xiv preprint ar Xiv:1610.08077, 2016. [26] David Madras, Elliot Creager, Toniann Pitassi, and Richard Zemel. Learning adversarially fair and transferable representations. In International Conference on Machine Learning, pages 3381 3390, 2018. [27] Arvind Narayanan. Translation tutorial: 21 fairness definitions and their politics. In Proc. Conf. Fairness Accountability Transp., New York, USA, 2018. [28] Executive Office of the President. Big data: A report on algorithmic systems, opportunity, and civil rights. Executive Office of the President, 2016. [29] Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q Weinberger. On fairness and calibration. In Advances in Neural Information Processing Systems, pages 5680 5689, 2017. [30] Jiaming Song, Pratyusha Kalluri, Aditya Grover, Shengjia Zhao, and Stefano Ermon. Learning controllable fair representations. In Artificial Intelligence and Statistics, pages 2164 2173, 2019. [31] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi. Fairness constraints: Mechanisms for fair classification. ar Xiv preprint ar Xiv:1507.05259, 2015. [32] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th International Conference on World Wide Web, pages 1171 1180. International World Wide Web Conferences Steering Committee, 2017. [33] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In International Conference on Machine Learning, pages 325 333, 2013. [34] Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 335 340. ACM, 2018. [35] Han Zhao, Jianfeng Chi, Yuan Tian, and Geoffrey J. Gordon. Adversarial privacy preservation under attribute inference attack. ar Xiv preprint ar Xiv:1906.07902, 2019. [36] Han Zhao, Remi Tachet des Combes, Kun Zhang, and Geoffrey J Gordon. On learning invariant representation for domain adaptation. In International Conference on Machine Learning, 2019. [37] Han Zhao, Amanda Coston, Tameem Adel, and Geoffrey J. Gordon. Conditional learning of fair representations. ar Xiv preprint ar Xiv:1910.07162, 2019. [38] Indre Zliobaite. On the relation between accuracy and fairness in binary classification. ar Xiv preprint ar Xiv:1505.05723, 2015.