# a_reductions_approach_to_fair_classification__7be7a16d.pdf A Reductions Approach to Fair Classification Alekh Agarwal 1 Alina Beygelzimer 2 Miroslav Dud ık 1 John Langford 1 Hanna Wallach 1 We present a systematic approach for achieving fairness in a binary classification setting. While we focus on two well-known quantitative definitions of fairness, our approach encompasses many other previously studied definitions as special cases. The key idea is to reduce fair classification to a sequence of cost-sensitive classification problems, whose solutions yield a randomized classifier with the lowest (empirical) error subject to the desired constraints. We introduce two reductions that work for any representation of the cost-sensitive classifier and compare favorably to prior baselines on a variety of data sets, while overcoming several of their disadvantages. 1. Introduction Over the past few years, the media have paid considerable attention to machine learning systems and their ability to inadvertently discriminate against minorities, historically disadvantaged populations, and other protected groups when allocating resources (e.g., loans) or opportunities (e.g., jobs). In response to this scrutiny and driven by ongoing debates and collaborations with lawyers, policy-makers, social scientists, and others (e.g., Barocas & Selbst, 2016) machine learning researchers have begun to turn their attention to the topic of fairness in machine learning, and, in particular, to the design of fair classification and regression algorithms. In this paper we study the task of binary classification subject to fairness constraints with respect to a pre-defined protected attribute, such as race or sex. Previous work in this area can be divided into two broad groups of approaches. The first group of approaches incorporate specific quantitative definitions of fairness into existing machine learning 1Microsoft Research, New York 2Yahoo! Research, New York. Correspondence to: A. Agarwal , A. Beygelzimer , M. Dud ık , J. Langford , H. Wallach . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). methods, often by relaxing the desired definitions of fairness, and only enforcing weaker constraints, such as lack of correlation (e.g., Woodworth et al., 2017; Zafar et al., 2017; Johnson et al., 2016; Kamishima et al., 2011; Donini et al., 2018). The resulting fairness guarantees typically only hold under strong distributional assumptions, and the approaches are tied to specific families of classifiers, such as SVMs. The second group of approaches eliminate the restriction to specific classifier families and treat the underlying classification method as a black box, while implementing a wrapper that either works by pre-processing the data or post-processing the classifier s predictions (e.g., Kamiran & Calders, 2012; Feldman et al., 2015; Hardt et al., 2016; Calmon et al., 2017). Existing pre-processing approaches are specific to particular definitions of fairness and typically seek to come up with a single transformed data set that will work across all learning algorithms, which, in practice, leads to classifiers that still exhibit substantial unfairness (see our evaluation in Section 4). In contrast, post-processing allows a wider range of fairness definitions and results in provable fairness guarantees. However, it is not guaranteed to find the most accurate fair classifier, and requires test-time access to the protected attribute, which might not be available. We present a general-purpose approach that has the key advantage of this second group of approaches i.e., the underlying classification method is treated as a black box but without the noted disadvantages. Our approach encompasses a wide range of fairness definitions, is guaranteed to yield the most accurate fair classifier, and does not require test-time access to the protected attribute. Specifically, our approach allows any definition of fairness that can be formalized via linear inequalities on conditional moments, such as demographic parity or equalized odds (see Section 2.1). We show how binary classification subject to these constraints can be reduced to a sequence of cost-sensitive classification problems. We require only black-box access to a cost-sensitive classification algorithm, which does not need to have any knowledge of the desired definition of fairness or protected attribute. We show that the solutions to our sequence of cost-sensitive classification problems yield a randomized classifier with the lowest (empirical) error subject to the desired fairness constraints. Corbett-Davies et al. (2017) and Menon & Williamson A Reductions Approach to Fair Classification (2018) begin with a similar goal to ours, but they analyze the Bayes optimal classifier under fairness constraints in the limit of infinite data. In contrast, our focus is algorithmic, our approach applies to any classifier family, and we obtain finite-sample guarantees. Dwork et al. (2018) also begin with a similar goal to ours. Their approach partitions the training examples into subsets according to protected attribute values and then leverages transfer learning to jointly learn from these separate data sets. Our approach avoids partitioning the data and assumes access only to a classification algorithm rather than a transfer learning algorithm. A preliminary version of this paper appeared at the FAT/ML workshop (Agarwal et al., 2017), and led to extensions with more general optimization objectives (Alabi et al., 2018) and combinatorial protected attributes (Kearns et al., 2018). In the next section, we formalize our problem. While we focus on two well-known quantitative definitions of fairness, our approach also encompasses many other previously studied definitions of fairness as special cases. In Section 3, we describe our reductions approach to fair classification and its guarantees in detail. The experimental study in Section 4 shows that our reductions compare favorably to three baselines, while overcoming some of their disadvantages and also offering the flexibility of picking a suitable accuracy fairness tradeoff. Our results demonstrate the utility of having a general-purpose approach for combining machine learning methods and quantitative fairness definitions. 2. Problem Formulation We consider a binary classification setting where the training examples consist of triples (X, A, Y ), where X X is a feature vector, A A is a protected attribute, and Y {0, 1} is a label. The feature vector X can either contain the protected attribute A as one of the features or contain other features that are arbitrarily indicative of A. For example, if the classification task is to predict whether or not someone will default on a loan, each training example might correspond to a person, where X represents their demographics, income level, past payment history, and loan amount; A represents their race; and Y represents whether or not they defaulted on that loan. Note that X might contain their race as one of the features or, for example, contain their zipcode a feature that is often correlated with race. Our goal is to learn an accurate classifier h : X {0, 1} from some set (i.e., family) of classifiers H, such as linear threshold rules, decision trees, or neural nets, while satisfying some definition of fairness. Note that the classifiers in H do not explicitly depend on A. 2.1. Fairness Definitions We focus on two well-known quantitative definitions of fairness that have been considered in previous work on fair classification; however, our approach also encompasses many other previously studied definitions of fairness as special cases, as we explain at the end of this section. The first definition demographic (or statistical) parity can be thought of as a stronger version of the US Equal Employment Opportunity Commission s four-fifths rule, which requires that the selection rate for any race, sex, or ethnic group [must be at least] four-fifths (4/5) (or eighty percent) of the rate for the group with the highest rate. 1 Definition 1 (Demographic parity DP). A classifier h satisfies demographic parity under a distribution over (X, A, Y ) if its prediction h(X) is statistically independent of the protected attribute A that is, if P[h(X) = ˆy | A = a] = P[h(X) = ˆy] for all a, ˆy. Because ˆy {0, 1}, this is equivalent to E[h(X) | A = a] = E[h(X)] for all a. The second definition equalized odds was recently proposed by Hardt et al. (2016) to remedy two previously noted flaws with demographic parity (Dwork et al., 2012). First, demographic parity permits a classifier which accurately classifies data points with one value A = a, such as the value a with the most data, but makes random predictions for data points with A = a as long as the probabilities of h(X) = 1 match. Second, demographic parity rules out perfect classifiers whenever Y is correlated with A. In contrast, equalized odds suffers from neither of these flaws. Definition 2 (Equalized odds EO). A classifier h satisfies equalized odds under a distribution over (X, A, Y ) if its prediction h(X) is conditionally independent of the protected attribute A given the label Y that is, if P[h(X) = ˆy | A = a, Y = y] = P[h(X) = ˆy | Y = y] for all a, y, and ˆy. Because ˆy {0, 1}, this is equivalent to E[h(X) | A = a, Y = y] = E[h(X) | Y = y] for all a, y. We now show how each definition can be viewed as a special case of a general set of linear constraints of the form Mµ(h) c, (1) where matrix M R|K| |J| and vector c R|K| describe the linear constraints, each indexed by k K, and µ(h) R|J| is a vector of conditional moments of the form µj(h) = E gj(X, A, Y, h(X)) Ej for j J, where gj : X A {0, 1} {0, 1} [0, 1] and Ej is an event defined with respect to (X, A, Y ). Crucially, gj depends on h, while Ej cannot depend on h in any way. Example 1 (DP). In a binary classification setting, demographic parity can be expressed as a set of |A| equality constraints, each of the form E[h(X) | A = a] = E[h(X)]. Letting J = A { }, gj(X, A, Y, h(X)) = h(X) for all j, 1See the Uniform Guidelines on Employment Selection Procedures, 29 C.F.R. 1607.4(D) (2015). A Reductions Approach to Fair Classification Ea = {A = a}, and E = {True}, where {True} refers to the event encompassing all points in the sample space, each equality constraint can be expressed as µa(h) = µ (h).2 Finally, because each such constraint can be equivalently expressed as a pair of inequality constraints of the form µa(h) µ (h) 0 µa(h) + µ (h) 0, demographic parity can be expressed as equation (1), where K = A {+, }, M(a,+),a = 1{a = a}, M(a,+), = 1, M(a, ),a = 1{a = a}, M(a, ), = 1, and c = 0. Expressing each equality constraint as a pair of inequality constraints allows us to control the extent to which each constraint is enforced by positing ck > 0 for some (or all) k. Example 2 (EO). In a binary classification setting, equalized odds can be expressed as a set of 2 |A| equality constraints, each of the form E[h(X) | A = a, Y = y] = E[h(X) | Y = y]. Letting J = (A { }) {0, 1}, gj(X, A, Y, h(X)) = h(X) for all j, E(a,y) = {A = a, Y = y}, and E( ,y) = {Y = y}, each equality constraint can be equivalently expressed as µ(a,y)(h) µ( ,y)(h) 0 µ(a,y)(h) + µ( ,y)(h) 0. As a result, equalized odds can be expressed as equation (1), where K = A Y {+, }, M(a,y,+),(a ,y ) = 1{a = a, y = y}, M(a,y,+),( ,y ) = 1, M(a,y, ),(a ,y ) = 1{a = a, y = y}, M(a,y, ),( ,y ) = 1, and c = 0. Again, we can posit ck > 0 for some (or all) k to allow small violations of some (or all) of the constraints. Although we omit the details, we note that many other previously studied definitions of fairness can also be expressed as equation (1). For example, equality of opportunity (Hardt et al., 2016) (also known as balance for the positive class; Kleinberg et al., 2017), balance for the negative class (Kleinberg et al., 2017), error-rate balance (Chouldechova, 2017), overall accuracy equality (Berk et al., 2017), and treatment equality (Berk et al., 2017) can all be expressed as equation (1); in contrast, calibration (Kleinberg et al., 2017) and predictive parity (Chouldechova, 2017) cannot because to do so would require the event Ej to depend on h. We note that our approach can also be used to satisfy multiple definitions of fairness, though if these definitions are mutually contradictory, e.g., as described by Kleinberg et al. (2017), then our guarantees become vacuous. 2.2. Fair Classification In a standard (binary) classification setting, the goal is to learn the classifier h H with the minimum classification 2Note that µ (h) = E[h(X) | True] = E[h(X)]. error: err(h) := P[h(X) = Y ]. However, because our goal is to learn the most accurate classifier while satisfying fairness constraints, as formalized above, we instead seek to find the solution to the constrained optimization problem3 min h H err(h) subject to Mµ(h) c. (2) Furthermore, rather than just considering classifiers in the set H, we can enlarge the space of possible classifiers by considering randomized classifiers that can be obtained via a distribution over H. By considering randomized classifiers, we can achieve better accuracy fairness tradeoffs than would otherwise be possible. A randomized classifier Q makes a prediction by first sampling a classifier h H from Q and then using h to make the prediction. The resulting classification error is err(Q) = P h H Q(h) err(h) and the conditional moments are µ(Q) = P h H Q(h)µ(h) (see Appendix A for the derivation). Thus we seek to solve min Q err(Q) subject to Mµ(Q) c, (3) where is the set of all distributions over H. In practice, we do not know the true distribution over (X, A, Y ) and only have access to a data set of training examples {(Xi, Ai, Yi)}n i=1. We therefore replace err(Q) and µ(Q) in equation (3) with their empirical versions c err(Q) and bµ(Q). Because of the sampling error in bµ(Q), we also allow errors in satisfying the constraints by setting bck = ck + εk for all k, where εk 0. After these modifications, we need to solve the empirical version of equation (3): min Q c err(Q) subject to Mbµ(Q) bc. (4) 3. Reductions Approach We now show how the problem (4) can be reduced to a sequence of cost-sensitive classification problems. We further show that the solutions to our sequence of cost-sensitive classification problems yield a randomized classifier with the lowest (empirical) error subject to the desired constraints. 3.1. Cost-sensitive Classification We assume access to a cost-sensitive classification algorithm for the set H. The input to such an algorithm is a data set of training examples {(Xi, C0 i , C1 i )}n i=1, where C0 i and C1 i denote the losses costs in this setting for predicting the labels 0 or 1, respectively, for Xi. The algorithm outputs arg min h H i=1 h(Xi) C1 i + (1 h(Xi)) C0 i . (5) 3We consider misclassification error for concreteness, but all the results in this paper apply to any error of the form err(h) = E[gerr(X, A, Y, h(X))], where gerr( , , , ) [0, 1]. A Reductions Approach to Fair Classification This abstraction allows us to specify different costs for different training examples, which is essential for incorporating fairness constraints. Moreover, efficient cost-sensitive classification algorithms are readily available for several common classifier representations (e.g., Beygelzimer et al., 2005; Langford & Beygelzimer, 2005; Fan et al., 1999). In particular, equation (5) is equivalent to a weighted classification problem, where the input consists of labeled examples {(Xi, Yi, Wi)}n i=1 with Yi {0, 1} and Wi 0, and the goal is to minimize the weighted classification error Pn i=1 Wi 1{h(Xi) = Yi}. This is equivalent to equation (5) if we set Wi = |C0 i C1 i | and Yi = 1{C0 i C1 i }. 3.2. Reduction To derive our fair classification algorithm, we rewrite equation (4) as a saddle point problem. We begin by introducing a Lagrange multiplier λk 0 for each of the |K| constraints, summarized as λ R|K| + , and form the Lagrangian L(Q, λ) = c err(Q) + λ Mbµ(Q) bc . Thus, equation (4) is equivalent to max λ R|K| + L(Q, λ). (6) For computational and statistical reasons, we impose an additional constraint on the ℓ1 norm of λ and seek to simultaneously find the solution to the constrained version of (6) as well as its dual, obtained by switching min and max: max λ R|K| + , λ 1 B L(Q, λ), (P) max λ R|K| + , λ 1 B min Q L(Q, λ). (D) Because L is linear in Q and λ and the domains of Q and λ are convex and compact, both problems have solutions (which we denote by Q and λ ) and the minimum value of (P) and the maximum value of (D) are equal and coincide with L(Q , λ ). Thus, (Q , λ ) is the saddle point of L (Corollary 37.6.2 and Lemma 36.2 of Rockafellar, 1970). We find the saddle point by using the standard scheme of Freund & Schapire (1996), developed for the equivalent problem of solving for an equilibrium in a zero-sum game. From game-theoretic perspective, the saddle point can be viewed as an equilibrium of a game between two players: the Q-player choosing Q and the λ-player choosing λ. The Lagrangian L(Q, λ) specifies how much the Q-player has to pay to the λ-player after they make their choices. At the saddle point, neither player wants to deviate from their choice. Our algorithm finds an approximate equilibrium in which neither player can gain more than ν by changing their choice Algorithm 1 Exp. gradient reduction for fair classification Input: training examples {(Xi, Yi, Ai)}n i=1 fairness constraints specified by gj, Ej, M, bc bound B, accuracy ν, learning rate η Set θ1 = 0 R|K| for t = 1, 2, . . . do Set λt,k = B exp{θk} 1+P k K exp{θk } for all k K ht BESTh(λt) t Pt t =1 ht , L L b Qt, BESTλ( b Qt) t Pt t =1 λt , L L BESTh(bλt), bλt νt max n L( b Qt, bλt) L, L L( b Qt, bλt) o if νt ν then Return ( b Qt, bλt) end if Set θt+1 = θt + η (Mbµ(ht) bc) end for (where ν > 0 is an input to the algorithm). Such an approximate equilibrium corresponds to a ν-approximate saddle point of the Lagrangian, which is a pair ( b Q, bλ), where L( b Q, bλ) L(Q, bλ) + ν for all Q , L( b Q, bλ) L( b Q, λ) ν for all λ R|K| + , λ 1 B. We proceed iteratively by running a no-regret algorithm for the λ-player, while executing the best response of the Qplayer. Following Freund & Schapire (1996), the average play of both players converges to the saddle point. We run the exponentiated gradient algorithm (Kivinen & Warmuth, 1997) for the λ-player and terminate as soon as the suboptimality of the average play falls below the pre-specified accuracy ν. The best response of the Q-player can always be chosen to put all of the mass on one of the candidate classifiers h H, and can be implemented by a single call to a cost-sensitive classification algorithm for the set H. Algorithm 1 fully implements this scheme, except for the functions BESTλ and BESTh, which correspond to the bestresponse algorithms of the two players. (We need the best response of the λ-player to evaluate whether the suboptimality of the current average play has fallen below ν.) The two best response functions can be calculated as follows. BESTλ(Q): the best response of the λ-player. The best response of the λ-player for a given Q is any maximizer of L(Q, λ) over all valid λs. In our setting, it can always be chosen to be either 0 or put all of the mass on the most violated constraint. Letting bγ(Q) := Mbµ(Q) and letting ek denote the kth vector of the standard basis, BESTλ(Q) returns ( 0 if bγ(Q) bc, Bek otherwise, where k = arg maxk[bγk(Q) bck]. A Reductions Approach to Fair Classification BESTh(λ): the best response of the Q-player. Here, the best response minimizes L(Q, λ) over all Qs in the simplex. Because L is linear in Q, the minimizer can always be chosen to put all of the mass on a single classifier h. We show how to obtain the classifier constituting the best response via a reduction to cost-sensitive classification. Letting pj := b P[Ej] be the empirical event probabilities, the Lagrangian for Q which puts all of the mass on a single h is then L(h, λ) = c err(h) + λ Mbµ(h) bc = b E 1{h(X) = Y } λ bc + X k,j Mk,jλkbµj(h) = λ bc + b E 1{h(X) = Y } pj b E h gj X,A,Y,h(X) 1{(X,A,Y ) Ej} i . Assuming a data set of training examples {(Xi, Ai, Yi)}n i=1, the minimization of L(h, λ) over h then corresponds to costsensitive classification on {(Xi, C0 i , C1 i )}n i=1 with costs4 C0 i = 1{Yi = 0} pj gj(Xi,Ai,Yi, 0) 1{(Xi,Ai,Yi) Ej} C1 i = 1{Yi = 1} pj gj(Xi,Ai,Yi, 1) 1{(Xi,Ai,Yi) Ej}. Theorem 1. Letting ρ := maxh Mbµ(h) bc , Algorithm 1 satisfies the inequality νt B log(|K| + 1) Thus, for η = ν 2ρ2B , Algorithm 1 will return a ν-approximate saddle point of L in at most 4ρ2B2 log(|K|+1) ν2 iterations. This theorem, proved in Appendix B, bounds the suboptimality νt of the average play ( b Qt, bλt), which is equal to its suboptimality as a saddle point. The right-hand side of the bound is optimized by η = p log(|K| + 1) / (ρ t), leading to the bound νt 2ρB p log(|K| + 1) / t. This bound decreases with the number of iterations t and grows very slowly with the number of constraints |K|. The quantity ρ is a problem-specific constant that bounds how much any single classifier h H can violate the desired set of fairness constraints. Finally, B is the bound on the ℓ1-norm of λ, which we introduced to enable this specific algorithmic scheme. In general, larger values of B will bring the problem (P) closer to (6), and thus also to (4), but at the cost of 4For general error, err(h) = E[gerr(X, A, Y, h(X))], the costs C0 i and C1 i contain, respectively, the terms gerr(Xi, Ai, Yi, 0) and gerr(Xi, Ai, Yi, 1) instead of 1{Yi = 0} and 1{Yi = 1}. needing more iterations to reach any given suboptimality. In particular, as we derive in the theorem, achieving suboptimality ν may need up to 4ρ2B2 log(|K| + 1) / ν2 iterations. Example 3 (DP). Using the matrix M for demographic parity as described in Section 2, the cost-sensitive reduction for a vector of Lagrange multipliers λ uses costs C0 i = 1{Yi = 0}, C1 i = 1{Yi = 1} + λAi where pa := b P[A = a] and λa := λ(a,+) λ(a, ), effectively replacing two non-negative Lagrange multipliers by a single multiplier, which can be either positive or negative. Because ck = 0 for all k, bck = εk. Furthermore, because all empirical moments are bounded in [0, 1], we can assume εk 1, which yields the bound ρ 2. Thus, Algorithm 1 terminates in at most 16B2 log(2 |A| + 1) / ν2 iterations. Example 4 (EO). For equalized odds, the cost-sensitive reduction for a vector of Lagrange multipliers λ uses costs C0 i = 1{Yi = 0}, C1 i = 1{Yi = 1} + λ(Ai,Yi) where p(a,y) := b P[A = a, Y = y], p( ,y) := b P[Y = y], and λ(a,y) := λ(a,y,+) λ(a,y, ). If we again assume εk 1, then we obtain the bound ρ 2. Thus, Algorithm 1 terminates in at most 16B2 log(4 |A| + 1) / ν2 iterations. 3.3. Error Analysis Our ultimate goal, as formalized in equation (3), is to minimize the classification error while satisfying fairness constraints under a true but unknown distribution over (X, A, Y ). In the process of deriving Algorithm 1, we introduced three different sources of error. First, we replaced the true classification error and true moments with their empirical versions. Second, we introduced a bound B on the magnitude of λ. Finally, we only run the optimization algorithm for a fixed number of iterations, until it reaches suboptimality level ν. The first source of error, due to the use of empirical rather than true quantities, is unavoidable and constitutes the underlying statistical error. The other two sources of error, the bound B and the suboptimality level ν, stem from the optimization algorithm and can be driven arbitrarily small at the cost of additional iterations. In this section, we show how the statistical error and the optimization error affect the true accuracy and the fairness of the randomized classifier returned by Algorithm 1 in other words, how well Algorithm 1 solves our original problem (3). To bound the statistical error, we use the Rademacher complexity of the classifier family H, which we denote by Rn(H), where n is the number of training examples. We assume that Rn(H) Cn α for some C 0 and A Reductions Approach to Fair Classification α 1/2. We note that α = 1/2 in the vast majority of classifier families, including norm-bounded linear functions (see Theorem 1 of Kakade et al., 2009), neural networks (see Theorem 18 of Bartlett & Mendelson, 2002), and classifier families with bounded VC dimension (see Lemma 4 and Theorem 6 of Bartlett & Mendelson, 2002). Recall that in our empirical optimization problem we assume that bck = ck + εk, where εk 0 are error bounds that account for the discrepancy between µ(Q) and bµ(Q). In our analysis, we assume that these error bounds have been set in accordance with the Rademacher complexity of H. Assumption 1. There exists C, C 0 and α 1/2 such that Rn(H) Cn α and εk = C P j J|Mk,j|n α j , where nj is the number of data points that fall in Ej, nj := i : (Xi, Ai, Yi) Ej . The optimization error can be bounded via a careful analysis of the Lagrangian and the optimality conditions of (P) and (D). Combining the three different sources of error yields the following bound, which we prove in Appendix C. Theorem 2. Let Assumption 1 hold for C 2C + 2 + p ln(4/δ) / 2, where δ > 0. Let ( b Q, bλ) be any νapproximate saddle point of L, let Q minimize err(Q) subject to Mµ(Q) c, and let p j = P[Ej]. Then, with probability at least 1 (|J| + 1)δ, the distribution b Q satisfies err( b Q) err(Q ) + 2ν + e O(n α), γk( b Q) ck + 1+2ν j J |Mk,j| e O(n α j ) for all k, where e O( ) suppresses polynomial dependence on ln(1/δ). If np j 8 log(2/δ) for all j, then, for all k, γk( b Q) ck + 1+2ν j J |Mk,j| e O (np j) α . In other words, the solution returned by Algorithm 1 achieves the lowest feasible classification error on the true distribution up to the optimization error, which grows linearly with ν, and the statistical error, which grows as n α. Therefore, if we want to guarantee that the optimization error does not dominate the statistical error, we should set ν n α. The fairness constraints on the true distribution are satisfied up to the optimization error (1 + 2ν) /B and up to the statistical error. Because the statistical error depends on the moments, and the error in estimating the moments grows as n α j n α, we can set B nα to guarantee that the optimization error does not dominate the statistical error. Combining this reasoning with the learning rate setting of Theorem 1 yields the following theorem (proved in Appendix C). Theorem 3. Let ρ := maxh Mbµ(h) bc . Let Assumption 1 hold for C 2C + 2 + p ln(4/δ) / 2, where δ > 0. Let Q minimize err(Q) subject to Mµ(Q) c. Then Algorithm 1 with ν n α, B nα and η ρ 2n 2α terminates in O(ρ2n4α ln |K|) iterations and returns b Q, which with probability at least 1 (|J| + 1)δ satisfies err( b Q) err(Q ) + e O(n α), γk( b Q) ck + X j J |Mk,j| e O(n α j ) for all k. Example 5 (DP). If na denotes the number of training examples with Ai = a, then Assumption 1 states that we should set ε(a,+) = ε(a, ) = C (n α a + n α) and Theorem 3 then shows that for a suitable setting of C , ν, B, and η, Algorithm 1 will return a randomized classifier b Q with the lowest feasible classification error up to e O(n α) while also approximately satisfying the fairness constraints E[h(X) | A = a] E[h(X)] e O(n α a ) for all a, where E is with respect to (X, A, Y ) as well as h b Q. Example 6 (EO). Similarly, if n(a,y) denotes the number of examples with Ai = a and Yi = y and n( ,y) denotes the number of examples with Yi = y, then Assumption 1 states that we should set ε(a,y,+) = ε(a,y, ) = C (n α (a,y) +n α ( ,y)) and Theorem 3 then shows that for a suitable setting of C , ν, B, and η, Algorithm 1 will return a randomized classifier b Q with the lowest feasible classification error up to e O(n α) while also approximately satisfying the fairness constraints E[h(X) | A = a, Y = y] E[h(X) | Y = y] e O(n α (a,y)) for all a, y. Again, E includes randomness under the true distribution over (X, A, Y ) as well as h b Q. 3.4. Grid Search In some situations, it is preferable to select a deterministic classifier, even if that means a lower accuracy or a modest violation of the fairness constraints. A set of candidate classifiers can be obtained from the saddle point (Q , λ ). Specifically, because Q is a minimizer of L(Q, λ ) and L is linear in Q, the distribution Q puts non-zero mass only on classifiers that are the Q-player s best responses to λ . If we knew λ , we could retrieve one such best response via the reduction to cost-sensitive learning introduced in Section 3.2. We can compute λ using Algorithm 1, but when the number of constraints is very small, as is the case for demographic parity or equalized odds with a binary protected attribute, it is also reasonable to consider a grid of values λ, calculate the best response for each value, and then select the value with the desired tradeoff between accuracy and fairness. Example 7 (DP). When the protected attribute is binary, e.g., A {a, a }, then the grid search can in fact be conducted in a single dimension. The reduction formally takes A Reductions Approach to Fair Classification two real-valued arguments λa and λa , and then adjusts the costs for predicting h(Xi) = 1 by the amounts pa λa λa and δa = λa respectively, on the training examples with Ai = a and Ai = a . These adjustments satisfy paδa + pa δa = 0, so instead of searching over λa and λa , we can carry out the grid search over δa alone and apply the adjustment δa = paδa/pa to the protected attribute value a . With three attribute values, e.g., A {a, a , a }, we similarly have paδa + pa δa + pa δa = 0, so it suffices to conduct grid search in two dimensions rather than three. Example 8 (EO). If A {a, a }, we obtain the adjustment δ(a,y) = λ(a,y) p(a,y) λ(a,y) + λ(a ,y) for an example with protected attribute value a and label y, and similarly for protected attribute value a . In this case, separately for each y, the adjustments satisfy p(a,y)δ(a,y) + p(a ,y)δ(a ,y) = 0, so it suffices to do the grid search over δ(a,0) and δ(a,1) and set the parameters for a to δ(a ,y) = p(a,y)δ(a,y)/p(a ,y). 4. Experimental Results We now examine how our exponentiated-gradient reduction5 performs at the task of binary classification subject to either demographic parity or equalized odds. We provide an evaluation of our grid-search reduction in Appendix D. We compared our reduction with the score-based postprocessing algorithm of Hardt et al. (2016), which takes as its input any classifier, (i.e., a standard classifier without any fairness constraints) and derives a monotone transformation of the classifier s output to remove any disparity with respect to the training examples. This post-processing algorithm works with both demographic parity and equalized odds, as well as with binary and non-binary protected attributes. For demographic parity, we also compared our reduction with the reweighting and relabeling approaches of Kamiran & Calders (2012). Reweighting can be applied to both binary and non-binary protected attributes and operates by changing importance weights on each example with the goal of removing any statistical dependence between the protected attribute and label.6 Relabeling was developed for 5https://github.com/Microsoft/fairlearn 6Although reweighting was developed for demographic parity, the weights that it induces are achievable by our grid search, albeit the grid search for equalized odds rather than demographic parity. binary protected attributes. First, a classifier is trained on the original data (without considering fairness). The training examples close to the decision boundary are then relabeled to remove all disparity while minimally affecting accuracy. The final classifier is then trained on the relabeled data. As the base classifiers for our reductions, we used the weighted classification implementations of logistic regression and gradient-boosted decision trees in scikit-learn (Pedregosa et al., 2011). In addition to the three baselines described above, we also compared our reductions to the unconstrained classifiers trained to optimize accuracy only. We used four data sets, randomly splitting each one into training examples (75%) and test examples (25%): The adult income data set (Lichman, 2013) (48,842 examples). Here the task is to predict whether someone makes more than $50k per year, with gender as the protected attribute. To examine the performance for non-binary protected attributes, we also conducted another experiment with the same data, using both gender and race (binarized into white and non-white) as the protected attribute. Relabeling, which requires binary protected attributes, was therefore not applicable here. Pro Publica s COMPAS recidivism data (7,918 examples). The task is to predict recidivism from someone s criminal history, jail and prison time, demographics, and COMPAS risk scores, with race as the protected attribute (restricted to white and black defendants). Law School Admissions Council s National Longitudinal Bar Passage Study (Wightman, 1998) (20,649 examples). Here the task is to predict someone s eventual passage of the bar exam, with race (restricted to white and black only) as the protected attribute. The Dutch census data set (Dutch Central Bureau for Statistics, 2001) (60,420 examples). Here the task is to predict whether or not someone has a prestigious occupation, with gender as the protected attribute. While all the evaluated algorithms require access to the protected attribute A at training time, only the post-processing algorithm requires access to A at test time. For a fair comparison, we included A in the feature vector X, so all algorithms had access to it at both the training time and test time. We used the test examples to measure the classification error for each approach, as well as the violation of the desired fairness constraints, i.e., maxa E[h(X) | A = a] E[h(X)] and maxa,y E[h(X) | A = a, Y = y] E[h(X) | Y = y] for demographic parity and equalized odds, respectively. We ran our reduction across a wide range of tradeoffs between the classification error and fairness constraints. We considered ε {0.001, . . . , 0.1} and for each value ran Algorithm 1 with bck = ε across all k. As expected, the returned randomized classifiers tracked the training Pareto A Reductions Approach to Fair Classification 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.05 0.10 adult / DP / log. reg. unconstrained postprocessing reweight relabel reduction (exp. grad.) 0.00 0.05 0.10 0.15 adult4 / DP / log. reg. 0.00 0.05 0.10 COMPAS / DP / log. reg. 0.00 0.05 0.10 0.15 Dutch census / DP / log. reg. 0.00 0.05 0.10 Law Schools / DP / log. reg. 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.05 0.10 adult / DP / boosting 0.00 0.05 0.10 0.15 adult4 / DP / boosting 0.00 0.05 0.10 COMPAS / DP / boosting 0.00 0.05 0.10 0.15 Dutch census / DP / boosting 0.00 0.05 0.10 Law Schools / DP / boosting 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.05 0.10 adult / EO / log. reg. 0.00 0.05 0.10 0.15 adult4 / EO / log. reg. 0.00 0.05 0.10 0.15 COMPAS / EO / log. reg. 0.00 0.05 0.10 Dutch census / EO / log. reg. 0.00 0.05 0.10 0.15 Law Schools / EO / log. reg. 0.0 0.2 0.4 0.6 0.8 1.0 violation of the fairness constraint 0.00 0.02 0.04 0.06 adult / EO / boosting 0.00 0.05 0.10 0.15 adult4 / EO / boosting 0.00 0.05 0.10 0.15 COMPAS / EO / boosting 0.00 0.05 0.10 0.15 Dutch census / EO / boosting 0.00 0.05 0.10 0.15 0.20 Law Schools / EO / boosting Figure 1. Test classification error versus constraint violation with respect to DP (top two rows) and EO (bottom two rows). All data sets have binary protected attributes except for adult4, which has four protected attribute values, so relabeling is not applicable there. For our reduction approach we plot the convex envelope of the classifiers obtained on training data at various accuracy fairness tradeoffs. We show 95% confidence bands for the classification error of our reduction approach and 95% confidence intervals for the constraint violation of post-processing. Our reduction approach dominates or matches the performance of the other approaches up to statistical uncertainty. frontier (see Figure 2 in Appendix D). In Figure 1, we evaluate these classifiers alongside the baselines on the test data. For all the data sets, the range of classification errors is much smaller than the range of constraint violations. Almost all the approaches were able to substantially reduce or remove disparity without much impact on classifier accuracy. One exception was the Dutch census data set, where the classification error increased the most in relative terms. Our reduction generally dominated or matched the baselines. The relabeling approach frequently yielded solutions that were not Pareto optimal. Reweighting yielded solutions on the Pareto frontier, but often with substantial disparity. As expected, post-processing yielded disparities that were statistically indistinguishable from zero, but the resulting classification error was sometimes higher than achieved by our reduction under a statistically indistinguishable disparity. In addition, and unlike the post-processing algorithm, our reduction can achieve any desired accuracy fairness tradeoff, allows a wider range of fairness definitions, and does not require access to the protected attribute at test time. Our grid-search reduction, evaluated in Appendix D, sometimes failed to achieve the lowest disparities on the training data, but its performance on the test data very closely matched that of our exponentiated-gradient reduction. However, if the protected attribute is non-binary, then grid search is not feasible. For instance, for the version of the adult income data set where the protected attribute takes on four values, the grid search would need to span three dimensions for demographic parity and six dimensions for equalized odds, both of which are prohibitively costly. 5. Conclusion We presented two reductions for achieving fairness in a binary classification setting. Our reductions work for any classifier representation, encompass many definitions of fairness, satisfy provable guarantees, and work well in practice. Our reductions optimize the tradeoff between accuracy and any (single) definition of fairness given training-time access to protected attributes. Achieving fairness when trainingtime access to protected attributes is unavailable remains an open problem for future research, as does the navigation of tradeoffs between accuracy and multiple fairness definitions. A Reductions Approach to Fair Classification Acknowledgements We would like to thank Aaron Roth, Sam Corbett-Davies, and Emma Pierson for helpful discussions. Agarwal, A., Beygelzimer, A., Dud ık, M., and Langford, J. A reductions approach to fair classication. In Fairness, Accountability, and Transparency in Machine Learning (FATML), 2017. Alabi, D., Immorlica, N., and Kalai, A. T. Unleashing linear optimizers for group-fair learning and optimization. In Proceedings of the 31st Annual Conference on Learning Theory (COLT), 2018. Barocas, S. and Selbst, A. D. Big data s disparate impact. California Law Review, 104:671 732, 2016. Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002. Berk, R., Heidari, H., Jabbari, S., Kearns, M., and Roth, A. Fairness in criminal justice risk assessments: The state of the art. ar Xiv:1703.09207, 2017. Beygelzimer, A., Dani, V., Hayes, T. P., Langford, J., and Zadrozny, B. Error limiting reductions between classification tasks. In Proceedings of the Twenty-Second International Conference on Machine Learning (ICML), pp. 49 56, 2005. Boucheron, S., Bousquet, O., and Lugosi, G. Theory of classification: a survey of some recent advances. ESAIM: Probability and Statistics, 9:323 375, 2005. Calmon, F., Wei, D., Vinzamuri, B., Ramamurthy, K. N., and Varshney, K. R. Optimized pre-processing for discrimination prevention. In Advances in Neural Information Processing Systems 30, 2017. Chouldechova, A. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, Special Issue on Social and Technical Trade-Offs, 2017. Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., and Huq, A. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 797 806, 2017. Donini, M., Oneto, L., Ben-David, S., Shawe-Taylor, J., and Pontil, M. Empirical risk minimization under fairness constraints. 2018. ar Xiv:1802.08626. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pp. 214 226, 2012. Dwork, C., Immorlica, N., Kalai, A. T., and Leiserson, M. Decoupled classifiers for group-fair and efficient machine learning. In Conference on Fairness, Accountability and Transparency (FAT ), pp. 119 133, 2018. Fan, W., Stolfo, S. J., Zhang, J., and Chan, P. K. Adacost: Misclassification cost-sensitive boosting. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML), pp. 97 105, 1999. Feldman, M., Friedler, S. A., Moeller, J., Scheidegger, C., and Venkatasubramanian, S. Certifying and removing disparate impact. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015. Freund, Y. and Schapire, R. E. Game theory, on-line prediction and boosting. In Proceedings of the Ninth Annual Conference on Computational Learning Theory (COLT), pp. 325 332, 1996. Freund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 139, 1997. Hardt, M., Price, E., and Srebro, N. Equality of opportunity in supervised learning. In Neural Information Processing Systems (NIPS), 2016. Johnson, K. D., Foster, D. P., and Stine, R. A. Impartial predictive modeling: Ensuring fairness in arbitrary models. ar Xiv:1608.00528, 2016. Kakade, S. M., Sridharan, K., and Tewari, A. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in neural information processing systems, pp. 793 800, 2009. Kamiran, F. and Calders, T. Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems, 33(1):1 33, 2012. Kamishima, T., Akaho, S., and Sakuma, J. Fairness-aware learning through regularization approach. In 2011 IEEE 11th International Conference on Data Mining Workshops, pp. 643 650, 2011. Kearns, M., Neel, S., Roth, A., and Wu, Z. S. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In Proceedings of the 35th International Conference on Machine Learning (ICML), 2018. A Reductions Approach to Fair Classification Kivinen, J. and Warmuth, M. K. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1 63, 1997. Kleinberg, J., Mullainathan, S., and Raghavan, M. Inherent trade-offs in the fair determination of risk scores. In Proceedings of the 8th Innovations in Theoretical Computer Science Conference, 2017. Langford, J. and Beygelzimer, A. Sensitive error correcting output codes. In Proceedings of the 18th Annual Conference on Learning Theory (COLT), pp. 158 172, 2005. Ledoux, M. and Talagrand, M. Probability in Banach Spaces: Isoperimetry and Processes. Springer, 1991. Lichman, M. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml. Menon, A. K. and Williamson, R. C. The cost of fairness in binary classification. In Proceedings of the Conference on Fiarness, Accountability, and Transparency, 2018. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. Rockafellar, R. T. Convex analysis. Princeton University Press, 1970. Shalev-Shwartz, S. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2012. Wightman, L. LSAC National Longitudinal Bar Passage Study, 1998. Woodworth, B. E., Gunasekar, S., Ohannessian, M. I., and Srebro, N. Learning non-discriminatory predictors. In Proceedings of the 30th Conference on Learning Theory (COLT), pp. 1920 1953, 2017. Zafar, M. B., Valera, I., Rodriguez, M. G., and Gummadi, K. P. Fairness constraints: Mechanisms for fair classification. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 962 970, 2017.