# a_reductions_approach_to_fair_classification__7be7a16d.pdf

A Reductions Approach to Fair Classiﬁcation

Alekh Agarwal 1 Alina Beygelzimer 2 Miroslav Dud ık 1 John Langford 1 Hanna Wallach 1

We present a systematic approach for achieving fairness in a binary classiﬁcation setting. While we focus on two well-known quantitative deﬁnitions of fairness, our approach encompasses many other previously studied deﬁnitions as special cases. The key idea is to reduce fair classiﬁcation to a sequence of cost-sensitive classiﬁcation problems, whose solutions yield a randomized classiﬁer with the lowest (empirical) error subject to the desired constraints. We introduce two reductions that work for any representation of the cost-sensitive classiﬁer and compare favorably to prior baselines on a variety of data sets, while overcoming several of their disadvantages.

1. Introduction

Over the past few years, the media have paid considerable attention to machine learning systems and their ability to inadvertently discriminate against minorities, historically disadvantaged populations, and other protected groups when allocating resources (e.g., loans) or opportunities (e.g., jobs). In response to this scrutiny and driven by ongoing debates and collaborations with lawyers, policy-makers, social scientists, and others (e.g., Barocas & Selbst, 2016) machine learning researchers have begun to turn their attention to the topic of fairness in machine learning, and, in particular, to the design of fair classiﬁcation and regression algorithms.

In this paper we study the task of binary classiﬁcation subject to fairness constraints with respect to a pre-deﬁned protected attribute, such as race or sex. Previous work in this area can be divided into two broad groups of approaches.

The ﬁrst group of approaches incorporate speciﬁc quantitative deﬁnitions of fairness into existing machine learning

1Microsoft Research, New York 2Yahoo! Research, New York. Correspondence to: A. Agarwal <alekha@microsoft.com>, A. Beygelzimer <beygel@gmail.com>, M. Dud ık <mdudik@microsoft.com>, J. Langford <jcl@microsoft.com>, H. Wallach <wallach@microsoft.com>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

methods, often by relaxing the desired deﬁnitions of fairness, and only enforcing weaker constraints, such as lack of correlation (e.g., Woodworth et al., 2017; Zafar et al., 2017; Johnson et al., 2016; Kamishima et al., 2011; Donini et al., 2018). The resulting fairness guarantees typically only hold under strong distributional assumptions, and the approaches are tied to speciﬁc families of classiﬁers, such as SVMs.

The second group of approaches eliminate the restriction to speciﬁc classiﬁer families and treat the underlying classiﬁcation method as a black box, while implementing a wrapper that either works by pre-processing the data or post-processing the classiﬁer s predictions (e.g., Kamiran & Calders, 2012; Feldman et al., 2015; Hardt et al., 2016; Calmon et al., 2017). Existing pre-processing approaches are speciﬁc to particular deﬁnitions of fairness and typically seek to come up with a single transformed data set that will work across all learning algorithms, which, in practice, leads to classiﬁers that still exhibit substantial unfairness (see our evaluation in Section 4). In contrast, post-processing allows a wider range of fairness deﬁnitions and results in provable fairness guarantees. However, it is not guaranteed to ﬁnd the most accurate fair classiﬁer, and requires test-time access to the protected attribute, which might not be available.

We present a general-purpose approach that has the key advantage of this second group of approaches i.e., the underlying classiﬁcation method is treated as a black box but without the noted disadvantages. Our approach encompasses a wide range of fairness deﬁnitions, is guaranteed to yield the most accurate fair classiﬁer, and does not require test-time access to the protected attribute. Speciﬁcally, our approach allows any deﬁnition of fairness that can be formalized via linear inequalities on conditional moments, such as demographic parity or equalized odds (see Section 2.1). We show how binary classiﬁcation subject to these constraints can be reduced to a sequence of cost-sensitive classiﬁcation problems. We require only black-box access to a cost-sensitive classiﬁcation algorithm, which does not need to have any knowledge of the desired deﬁnition of fairness or protected attribute. We show that the solutions to our sequence of cost-sensitive classiﬁcation problems yield a randomized classiﬁer with the lowest (empirical) error subject to the desired fairness constraints.

Corbett-Davies et al. (2017) and Menon & Williamson

A Reductions Approach to Fair Classiﬁcation

(2018) begin with a similar goal to ours, but they analyze the Bayes optimal classiﬁer under fairness constraints in the limit of inﬁnite data. In contrast, our focus is algorithmic, our approach applies to any classiﬁer family, and we obtain ﬁnite-sample guarantees. Dwork et al. (2018) also begin with a similar goal to ours. Their approach partitions the training examples into subsets according to protected attribute values and then leverages transfer learning to jointly learn from these separate data sets. Our approach avoids partitioning the data and assumes access only to a classiﬁcation algorithm rather than a transfer learning algorithm.

A preliminary version of this paper appeared at the FAT/ML workshop (Agarwal et al., 2017), and led to extensions with more general optimization objectives (Alabi et al., 2018) and combinatorial protected attributes (Kearns et al., 2018).

In the next section, we formalize our problem. While we focus on two well-known quantitative deﬁnitions of fairness, our approach also encompasses many other previously studied deﬁnitions of fairness as special cases. In Section 3, we describe our reductions approach to fair classiﬁcation and its guarantees in detail. The experimental study in Section 4 shows that our reductions compare favorably to three baselines, while overcoming some of their disadvantages and also offering the ﬂexibility of picking a suitable accuracy fairness tradeoff. Our results demonstrate the utility of having a general-purpose approach for combining machine learning methods and quantitative fairness deﬁnitions.

2. Problem Formulation

We consider a binary classiﬁcation setting where the training examples consist of triples (X, A, Y ), where X X is a feature vector, A A is a protected attribute, and Y {0, 1} is a label. The feature vector X can either contain the protected attribute A as one of the features or contain other features that are arbitrarily indicative of A. For example, if the classiﬁcation task is to predict whether or not someone will default on a loan, each training example might correspond to a person, where X represents their demographics, income level, past payment history, and loan amount; A represents their race; and Y represents whether or not they defaulted on that loan. Note that X might contain their race as one of the features or, for example, contain their zipcode a feature that is often correlated with race. Our goal is to learn an accurate classiﬁer h : X {0, 1} from some set (i.e., family) of classiﬁers H, such as linear threshold rules, decision trees, or neural nets, while satisfying some deﬁnition of fairness. Note that the classiﬁers in H do not explicitly depend on A.

2.1. Fairness Deﬁnitions

We focus on two well-known quantitative deﬁnitions of fairness that have been considered in previous work on

fair classiﬁcation; however, our approach also encompasses many other previously studied deﬁnitions of fairness as special cases, as we explain at the end of this section.

The ﬁrst deﬁnition demographic (or statistical) parity can be thought of as a stronger version of the US Equal Employment Opportunity Commission s four-ﬁfths rule, which requires that the selection rate for any race, sex, or ethnic group [must be at least] four-ﬁfths (4/5) (or eighty percent) of the rate for the group with the highest rate. 1

Deﬁnition 1 (Demographic parity DP). A classiﬁer h satisﬁes demographic parity under a distribution over (X, A, Y ) if its prediction h(X) is statistically independent of the protected attribute A that is, if P[h(X) = ˆy | A = a] = P[h(X) = ˆy] for all a, ˆy. Because ˆy {0, 1}, this is equivalent to E[h(X) | A = a] = E[h(X)] for all a.

The second deﬁnition equalized odds was recently proposed by Hardt et al. (2016) to remedy two previously noted ﬂaws with demographic parity (Dwork et al., 2012). First, demographic parity permits a classiﬁer which accurately classiﬁes data points with one value A = a, such as the value a with the most data, but makes random predictions for data points with A = a as long as the probabilities of h(X) = 1 match. Second, demographic parity rules out perfect classiﬁers whenever Y is correlated with A. In contrast, equalized odds suffers from neither of these ﬂaws.

Deﬁnition 2 (Equalized odds EO). A classiﬁer h satisﬁes equalized odds under a distribution over (X, A, Y ) if its prediction h(X) is conditionally independent of the protected attribute A given the label Y that is, if P[h(X) = ˆy | A = a, Y = y] = P[h(X) = ˆy | Y = y] for all a, y, and ˆy. Because ˆy {0, 1}, this is equivalent to E[h(X) | A = a, Y = y] = E[h(X) | Y = y] for all a, y.

We now show how each deﬁnition can be viewed as a special case of a general set of linear constraints of the form

Mµ(h) c, (1)

where matrix M R|K| |J| and vector c R|K| describe the linear constraints, each indexed by k K, and µ(h) R|J| is a vector of conditional moments of the form

µj(h) = E gj(X, A, Y, h(X)) Ej for j J,

where gj : X A {0, 1} {0, 1} [0, 1] and Ej is an event deﬁned with respect to (X, A, Y ). Crucially, gj depends on h, while Ej cannot depend on h in any way.

Example 1 (DP). In a binary classiﬁcation setting, demographic parity can be expressed as a set of |A| equality constraints, each of the form E[h(X) | A = a] = E[h(X)]. Letting J = A { }, gj(X, A, Y, h(X)) = h(X) for all j,

1See the Uniform Guidelines on Employment Selection Procedures, 29 C.F.R. 1607.4(D) (2015).

A Reductions Approach to Fair Classiﬁcation

Ea = {A = a}, and E = {True}, where {True} refers to the event encompassing all points in the sample space, each equality constraint can be expressed as µa(h) = µ (h).2

Finally, because each such constraint can be equivalently expressed as a pair of inequality constraints of the form

µa(h) µ (h) 0

µa(h) + µ (h) 0,

demographic parity can be expressed as equation (1), where K = A {+, }, M(a,+),a = 1{a = a}, M(a,+), = 1, M(a, ),a = 1{a = a}, M(a, ), = 1, and c = 0. Expressing each equality constraint as a pair of inequality constraints allows us to control the extent to which each constraint is enforced by positing ck > 0 for some (or all) k.

Example 2 (EO). In a binary classiﬁcation setting, equalized odds can be expressed as a set of 2 |A| equality constraints, each of the form E[h(X) | A = a, Y = y] = E[h(X) | Y = y]. Letting J = (A { }) {0, 1}, gj(X, A, Y, h(X)) = h(X) for all j, E(a,y) = {A = a, Y = y}, and E( ,y) = {Y = y}, each equality constraint can be equivalently expressed as

µ(a,y)(h) µ( ,y)(h) 0

µ(a,y)(h) + µ( ,y)(h) 0.

As a result, equalized odds can be expressed as equation (1), where K = A Y {+, }, M(a,y,+),(a ,y ) = 1{a = a, y = y}, M(a,y,+),( ,y ) = 1, M(a,y, ),(a ,y ) = 1{a = a, y = y}, M(a,y, ),( ,y ) = 1, and c = 0. Again, we can posit ck > 0 for some (or all) k to allow small violations of some (or all) of the constraints.

Although we omit the details, we note that many other previously studied deﬁnitions of fairness can also be expressed as equation (1). For example, equality of opportunity (Hardt et al., 2016) (also known as balance for the positive class; Kleinberg et al., 2017), balance for the negative class (Kleinberg et al., 2017), error-rate balance (Chouldechova, 2017), overall accuracy equality (Berk et al., 2017), and treatment equality (Berk et al., 2017) can all be expressed as equation (1); in contrast, calibration (Kleinberg et al., 2017) and predictive parity (Chouldechova, 2017) cannot because to do so would require the event Ej to depend on h. We note that our approach can also be used to satisfy multiple deﬁnitions of fairness, though if these deﬁnitions are mutually contradictory, e.g., as described by Kleinberg et al. (2017), then our guarantees become vacuous.

2.2. Fair Classiﬁcation

In a standard (binary) classiﬁcation setting, the goal is to learn the classiﬁer h H with the minimum classiﬁcation

2Note that µ (h) = E[h(X) | True] = E[h(X)].

error: err(h) := P[h(X) = Y ]. However, because our goal is to learn the most accurate classiﬁer while satisfying fairness constraints, as formalized above, we instead seek to ﬁnd the solution to the constrained optimization problem3

min h H err(h) subject to Mµ(h) c. (2)

Furthermore, rather than just considering classiﬁers in the set H, we can enlarge the space of possible classiﬁers by considering randomized classiﬁers that can be obtained via a distribution over H. By considering randomized classiﬁers, we can achieve better accuracy fairness tradeoffs than would otherwise be possible. A randomized classiﬁer Q makes a prediction by ﬁrst sampling a classiﬁer h H from Q and then using h to make the prediction. The resulting classiﬁcation error is err(Q) = P

h H Q(h) err(h) and the conditional moments are µ(Q) = P

h H Q(h)µ(h) (see Appendix A for the derivation). Thus we seek to solve

min Q err(Q) subject to Mµ(Q) c, (3)

where is the set of all distributions over H.

In practice, we do not know the true distribution over (X, A, Y ) and only have access to a data set of training examples {(Xi, Ai, Yi)}n i=1. We therefore replace err(Q) and µ(Q) in equation (3) with their empirical versions c err(Q) and bµ(Q). Because of the sampling error in bµ(Q), we also allow errors in satisfying the constraints by setting bck = ck + εk for all k, where εk 0. After these modiﬁcations, we need to solve the empirical version of equation (3):

min Q c err(Q) subject to Mbµ(Q) bc. (4)

3. Reductions Approach

We now show how the problem (4) can be reduced to a sequence of cost-sensitive classiﬁcation problems. We further show that the solutions to our sequence of cost-sensitive classiﬁcation problems yield a randomized classiﬁer with the lowest (empirical) error subject to the desired constraints.

3.1. Cost-sensitive Classiﬁcation

We assume access to a cost-sensitive classiﬁcation algorithm for the set H. The input to such an algorithm is a data set of training examples {(Xi, C0 i , C1 i )}n i=1, where C0 i and C1 i denote the losses costs in this setting for predicting the labels 0 or 1, respectively, for Xi. The algorithm outputs

arg min h H

i=1 h(Xi) C1 i + (1 h(Xi)) C0 i . (5)

3We consider misclassiﬁcation error for concreteness, but all the results in this paper apply to any error of the form err(h) = E[gerr(X, A, Y, h(X))], where gerr( , , , ) [0, 1].

A Reductions Approach to Fair Classiﬁcation

This abstraction allows us to specify different costs for different training examples, which is essential for incorporating fairness constraints. Moreover, efﬁcient cost-sensitive classiﬁcation algorithms are readily available for several common classiﬁer representations (e.g., Beygelzimer et al., 2005; Langford & Beygelzimer, 2005; Fan et al., 1999). In particular, equation (5) is equivalent to a weighted classiﬁcation problem, where the input consists of labeled examples {(Xi, Yi, Wi)}n i=1 with Yi {0, 1} and Wi 0, and the goal is to minimize the weighted classiﬁcation error Pn i=1 Wi 1{h(Xi) = Yi}. This is equivalent to equation (5) if we set Wi = |C0 i C1 i | and Yi = 1{C0 i C1 i }.

3.2. Reduction

To derive our fair classiﬁcation algorithm, we rewrite equation (4) as a saddle point problem. We begin by introducing a Lagrange multiplier λk 0 for each of the |K| constraints, summarized as λ R|K| + , and form the Lagrangian

L(Q, λ) = c err(Q) + λ Mbµ(Q) bc .

Thus, equation (4) is equivalent to

max λ R|K| + L(Q, λ). (6)

For computational and statistical reasons, we impose an additional constraint on the ℓ1 norm of λ and seek to simultaneously ﬁnd the solution to the constrained version of (6) as well as its dual, obtained by switching min and max:

max λ R|K| + , λ 1 B L(Q, λ), (P)

max λ R|K| + , λ 1 B

min Q L(Q, λ). (D)

Because L is linear in Q and λ and the domains of Q and λ are convex and compact, both problems have solutions (which we denote by Q and λ ) and the minimum value of (P) and the maximum value of (D) are equal and coincide with L(Q , λ ). Thus, (Q , λ ) is the saddle point of L (Corollary 37.6.2 and Lemma 36.2 of Rockafellar, 1970).

We ﬁnd the saddle point by using the standard scheme of Freund & Schapire (1996), developed for the equivalent problem of solving for an equilibrium in a zero-sum game. From game-theoretic perspective, the saddle point can be viewed as an equilibrium of a game between two players: the Q-player choosing Q and the λ-player choosing λ. The Lagrangian L(Q, λ) speciﬁes how much the Q-player has to pay to the λ-player after they make their choices. At the saddle point, neither player wants to deviate from their choice.

Our algorithm ﬁnds an approximate equilibrium in which neither player can gain more than ν by changing their choice

Algorithm 1 Exp. gradient reduction for fair classiﬁcation

Input: training examples {(Xi, Yi, Ai)}n i=1 fairness constraints speciﬁed by gj, Ej, M, bc bound B, accuracy ν, learning rate η Set θ1 = 0 R|K|

for t = 1, 2, . . . do

Set λt,k = B exp{θk} 1+P

k K exp{θk } for all k K ht BESTh(λt)

t Pt t =1 ht , L L b Qt, BESTλ( b Qt)

t Pt t =1 λt , L L BESTh(bλt), bλt

νt max n L( b Qt, bλt) L, L L( b Qt, bλt) o

if νt ν then

Return ( b Qt, bλt) end if Set θt+1 = θt + η (Mbµ(ht) bc) end for

(where ν > 0 is an input to the algorithm). Such an approximate equilibrium corresponds to a ν-approximate saddle point of the Lagrangian, which is a pair ( b Q, bλ), where

L( b Q, bλ) L(Q, bλ) + ν for all Q ,

L( b Q, bλ) L( b Q, λ) ν for all λ R|K| + , λ 1 B.

We proceed iteratively by running a no-regret algorithm for the λ-player, while executing the best response of the Qplayer. Following Freund & Schapire (1996), the average play of both players converges to the saddle point. We run the exponentiated gradient algorithm (Kivinen & Warmuth, 1997) for the λ-player and terminate as soon as the suboptimality of the average play falls below the pre-speciﬁed accuracy ν. The best response of the Q-player can always be chosen to put all of the mass on one of the candidate classiﬁers h H, and can be implemented by a single call to a cost-sensitive classiﬁcation algorithm for the set H.

Algorithm 1 fully implements this scheme, except for the functions BESTλ and BESTh, which correspond to the bestresponse algorithms of the two players. (We need the best response of the λ-player to evaluate whether the suboptimality of the current average play has fallen below ν.) The two best response functions can be calculated as follows.

BESTλ(Q): the best response of the λ-player. The best response of the λ-player for a given Q is any maximizer of L(Q, λ) over all valid λs. In our setting, it can always be chosen to be either 0 or put all of the mass on the most violated constraint. Letting bγ(Q) := Mbµ(Q) and letting ek denote the kth vector of the standard basis, BESTλ(Q) returns ( 0 if bγ(Q) bc, Bek otherwise, where k = arg maxk[bγk(Q) bck].

A Reductions Approach to Fair Classiﬁcation

BESTh(λ): the best response of the Q-player. Here, the best response minimizes L(Q, λ) over all Qs in the simplex. Because L is linear in Q, the minimizer can always be chosen to put all of the mass on a single classiﬁer h. We show how to obtain the classiﬁer constituting the best response via a reduction to cost-sensitive classiﬁcation. Letting pj := b P[Ej] be the empirical event probabilities, the Lagrangian for Q which puts all of the mass on a single h is then

L(h, λ) = c err(h) + λ Mbµ(h) bc

= b E 1{h(X) = Y } λ bc + X

k,j Mk,jλkbµj(h)

= λ bc + b E 1{h(X) = Y }

pj b E h gj X,A,Y,h(X) 1{(X,A,Y ) Ej} i .

Assuming a data set of training examples {(Xi, Ai, Yi)}n i=1, the minimization of L(h, λ) over h then corresponds to costsensitive classiﬁcation on {(Xi, C0 i , C1 i )}n i=1 with costs4

C0 i = 1{Yi = 0}

pj gj(Xi,Ai,Yi, 0) 1{(Xi,Ai,Yi) Ej}

C1 i = 1{Yi = 1}

pj gj(Xi,Ai,Yi, 1) 1{(Xi,Ai,Yi) Ej}.

Theorem 1. Letting ρ := maxh Mbµ(h) bc , Algorithm 1 satisﬁes the inequality

νt B log(|K| + 1)

Thus, for η = ν 2ρ2B , Algorithm 1 will return a ν-approximate

saddle point of L in at most 4ρ2B2 log(|K|+1)

ν2 iterations.

This theorem, proved in Appendix B, bounds the suboptimality νt of the average play ( b Qt, bλt), which is equal to its suboptimality as a saddle point. The right-hand side of the bound is optimized by η = p

log(|K| + 1) / (ρ

t), leading to the bound νt 2ρB p

log(|K| + 1) / t. This bound decreases with the number of iterations t and grows very slowly with the number of constraints |K|. The quantity ρ is a problem-speciﬁc constant that bounds how much any single classiﬁer h H can violate the desired set of fairness constraints. Finally, B is the bound on the ℓ1-norm of λ, which we introduced to enable this speciﬁc algorithmic scheme. In general, larger values of B will bring the problem (P) closer to (6), and thus also to (4), but at the cost of

4For general error, err(h) = E[gerr(X, A, Y, h(X))], the costs C0 i and C1 i contain, respectively, the terms gerr(Xi, Ai, Yi, 0) and gerr(Xi, Ai, Yi, 1) instead of 1{Yi = 0} and 1{Yi = 1}.

needing more iterations to reach any given suboptimality. In particular, as we derive in the theorem, achieving suboptimality ν may need up to 4ρ2B2 log(|K| + 1) / ν2 iterations.

Example 3 (DP). Using the matrix M for demographic parity as described in Section 2, the cost-sensitive reduction for a vector of Lagrange multipliers λ uses costs

C0 i = 1{Yi = 0}, C1 i = 1{Yi = 1} + λAi

where pa := b P[A = a] and λa := λ(a,+) λ(a, ), effectively replacing two non-negative Lagrange multipliers by a single multiplier, which can be either positive or negative. Because ck = 0 for all k, bck = εk. Furthermore, because all empirical moments are bounded in [0, 1], we can assume εk 1, which yields the bound ρ 2. Thus, Algorithm 1 terminates in at most 16B2 log(2 |A| + 1) / ν2 iterations.

Example 4 (EO). For equalized odds, the cost-sensitive reduction for a vector of Lagrange multipliers λ uses costs

C0 i = 1{Yi = 0},

C1 i = 1{Yi = 1} + λ(Ai,Yi)

where p(a,y) := b P[A = a, Y = y], p( ,y) := b P[Y = y], and λ(a,y) := λ(a,y,+) λ(a,y, ). If we again assume εk 1, then we obtain the bound ρ 2. Thus, Algorithm 1 terminates in at most 16B2 log(4 |A| + 1) / ν2 iterations.

3.3. Error Analysis

Our ultimate goal, as formalized in equation (3), is to minimize the classiﬁcation error while satisfying fairness constraints under a true but unknown distribution over (X, A, Y ). In the process of deriving Algorithm 1, we introduced three different sources of error. First, we replaced the true classiﬁcation error and true moments with their empirical versions. Second, we introduced a bound B on the magnitude of λ. Finally, we only run the optimization algorithm for a ﬁxed number of iterations, until it reaches suboptimality level ν. The ﬁrst source of error, due to the use of empirical rather than true quantities, is unavoidable and constitutes the underlying statistical error. The other two sources of error, the bound B and the suboptimality level ν, stem from the optimization algorithm and can be driven arbitrarily small at the cost of additional iterations. In this section, we show how the statistical error and the optimization error affect the true accuracy and the fairness of the randomized classiﬁer returned by Algorithm 1 in other words, how well Algorithm 1 solves our original problem (3).

To bound the statistical error, we use the Rademacher complexity of the classiﬁer family H, which we denote by Rn(H), where n is the number of training examples. We assume that Rn(H) Cn α for some C 0 and

A Reductions Approach to Fair Classiﬁcation

α 1/2. We note that α = 1/2 in the vast majority of classiﬁer families, including norm-bounded linear functions (see Theorem 1 of Kakade et al., 2009), neural networks (see Theorem 18 of Bartlett & Mendelson, 2002), and classiﬁer families with bounded VC dimension (see Lemma 4 and Theorem 6 of Bartlett & Mendelson, 2002).

Recall that in our empirical optimization problem we assume that bck = ck + εk, where εk 0 are error bounds that account for the discrepancy between µ(Q) and bµ(Q). In our analysis, we assume that these error bounds have been set in accordance with the Rademacher complexity of H.

Assumption 1. There exists C, C 0 and α 1/2 such that Rn(H) Cn α and εk = C P

j J|Mk,j|n α j , where nj is the number of data points that fall in Ej,

nj := i : (Xi, Ai, Yi) Ej .

The optimization error can be bounded via a careful analysis of the Lagrangian and the optimality conditions of (P) and (D). Combining the three different sources of error yields the following bound, which we prove in Appendix C.

Theorem 2. Let Assumption 1 hold for C 2C + 2 + p

ln(4/δ) / 2, where δ > 0. Let ( b Q, bλ) be any νapproximate saddle point of L, let Q minimize err(Q) subject to Mµ(Q) c, and let p j = P[Ej]. Then, with probability at least 1 (|J| + 1)δ, the distribution b Q satisﬁes

err( b Q) err(Q ) + 2ν + e O(n α),

γk( b Q) ck + 1+2ν

j J |Mk,j| e O(n α j ) for all k,

where e O( ) suppresses polynomial dependence on ln(1/δ). If np j 8 log(2/δ) for all j, then, for all k,

γk( b Q) ck + 1+2ν

j J |Mk,j| e O (np j) α .

In other words, the solution returned by Algorithm 1 achieves the lowest feasible classiﬁcation error on the true distribution up to the optimization error, which grows linearly with ν, and the statistical error, which grows as n α. Therefore, if we want to guarantee that the optimization error does not dominate the statistical error, we should set ν n α. The fairness constraints on the true distribution are satisﬁed up to the optimization error (1 + 2ν) /B and up to the statistical error. Because the statistical error depends on the moments, and the error in estimating the moments grows as n α j n α, we can set B nα to guarantee that the optimization error does not dominate the statistical error. Combining this reasoning with the learning rate setting of Theorem 1 yields the following theorem (proved in Appendix C).

Theorem 3. Let ρ := maxh Mbµ(h) bc . Let Assumption 1 hold for C 2C + 2 + p

ln(4/δ) / 2, where δ > 0.

Let Q minimize err(Q) subject to Mµ(Q) c. Then Algorithm 1 with ν n α, B nα and η ρ 2n 2α terminates in O(ρ2n4α ln |K|) iterations and returns b Q, which with probability at least 1 (|J| + 1)δ satisﬁes

err( b Q) err(Q ) + e O(n α),

γk( b Q) ck + X

j J |Mk,j| e O(n α j ) for all k.

Example 5 (DP). If na denotes the number of training examples with Ai = a, then Assumption 1 states that we should set ε(a,+) = ε(a, ) = C (n α a + n α) and Theorem 3 then shows that for a suitable setting of C , ν, B, and η, Algorithm 1 will return a randomized classiﬁer b Q with the lowest feasible classiﬁcation error up to e O(n α) while also approximately satisfying the fairness constraints E[h(X) | A = a] E[h(X)] e O(n α a ) for all a,

where E is with respect to (X, A, Y ) as well as h b Q.

Example 6 (EO). Similarly, if n(a,y) denotes the number of examples with Ai = a and Yi = y and n( ,y) denotes the number of examples with Yi = y, then Assumption 1 states that we should set ε(a,y,+) = ε(a,y, ) = C (n α (a,y) +n α ( ,y)) and Theorem 3 then shows that for a suitable setting of C , ν, B, and η, Algorithm 1 will return a randomized classiﬁer b Q with the lowest feasible classiﬁcation error up to e O(n α) while also approximately satisfying the fairness constraints E[h(X) | A = a, Y = y] E[h(X) | Y = y] e O(n α (a,y))

for all a, y. Again, E includes randomness under the true distribution over (X, A, Y ) as well as h b Q.

3.4. Grid Search

In some situations, it is preferable to select a deterministic classiﬁer, even if that means a lower accuracy or a modest violation of the fairness constraints. A set of candidate classiﬁers can be obtained from the saddle point (Q , λ ). Specifically, because Q is a minimizer of L(Q, λ ) and L is linear in Q, the distribution Q puts non-zero mass only on classiﬁers that are the Q-player s best responses to λ . If we knew λ , we could retrieve one such best response via the reduction to cost-sensitive learning introduced in Section 3.2.

We can compute λ using Algorithm 1, but when the number of constraints is very small, as is the case for demographic parity or equalized odds with a binary protected attribute, it is also reasonable to consider a grid of values λ, calculate the best response for each value, and then select the value with the desired tradeoff between accuracy and fairness.

Example 7 (DP). When the protected attribute is binary, e.g., A {a, a }, then the grid search can in fact be conducted in a single dimension. The reduction formally takes

A Reductions Approach to Fair Classiﬁcation

two real-valued arguments λa and λa , and then adjusts the costs for predicting h(Xi) = 1 by the amounts

pa λa λa and δa = λa

respectively, on the training examples with Ai = a and Ai = a . These adjustments satisfy paδa + pa δa = 0, so instead of searching over λa and λa , we can carry out the grid search over δa alone and apply the adjustment δa = paδa/pa to the protected attribute value a .

With three attribute values, e.g., A {a, a , a }, we similarly have paδa + pa δa + pa δa = 0, so it sufﬁces to conduct grid search in two dimensions rather than three.

Example 8 (EO). If A {a, a }, we obtain the adjustment

δ(a,y) = λ(a,y)

p(a,y) λ(a,y) + λ(a ,y)

for an example with protected attribute value a and label y, and similarly for protected attribute value a . In this case, separately for each y, the adjustments satisfy

p(a,y)δ(a,y) + p(a ,y)δ(a ,y) = 0,

so it sufﬁces to do the grid search over δ(a,0) and δ(a,1) and set the parameters for a to δ(a ,y) = p(a,y)δ(a,y)/p(a ,y).

4. Experimental Results

We now examine how our exponentiated-gradient reduction5 performs at the task of binary classiﬁcation subject to either demographic parity or equalized odds. We provide an evaluation of our grid-search reduction in Appendix D.

We compared our reduction with the score-based postprocessing algorithm of Hardt et al. (2016), which takes as its input any classiﬁer, (i.e., a standard classiﬁer without any fairness constraints) and derives a monotone transformation of the classiﬁer s output to remove any disparity with respect to the training examples. This post-processing algorithm works with both demographic parity and equalized odds, as well as with binary and non-binary protected attributes.

For demographic parity, we also compared our reduction with the reweighting and relabeling approaches of Kamiran & Calders (2012). Reweighting can be applied to both binary and non-binary protected attributes and operates by changing importance weights on each example with the goal of removing any statistical dependence between the protected attribute and label.6 Relabeling was developed for

5https://github.com/Microsoft/fairlearn 6Although reweighting was developed for demographic parity, the weights that it induces are achievable by our grid search, albeit the grid search for equalized odds rather than demographic parity.

binary protected attributes. First, a classiﬁer is trained on the original data (without considering fairness). The training examples close to the decision boundary are then relabeled to remove all disparity while minimally affecting accuracy. The ﬁnal classiﬁer is then trained on the relabeled data.

As the base classiﬁers for our reductions, we used the weighted classiﬁcation implementations of logistic regression and gradient-boosted decision trees in scikit-learn (Pedregosa et al., 2011). In addition to the three baselines described above, we also compared our reductions to the unconstrained classiﬁers trained to optimize accuracy only.

We used four data sets, randomly splitting each one into training examples (75%) and test examples (25%):

The adult income data set (Lichman, 2013) (48,842 examples). Here the task is to predict whether someone makes more than $50k per year, with gender as the protected attribute. To examine the performance for non-binary protected attributes, we also conducted another experiment with the same data, using both gender and race (binarized into white and non-white) as the protected attribute. Relabeling, which requires binary protected attributes, was therefore not applicable here. Pro Publica s COMPAS recidivism data (7,918 examples). The task is to predict recidivism from someone s criminal history, jail and prison time, demographics, and COMPAS risk scores, with race as the protected attribute (restricted to white and black defendants). Law School Admissions Council s National Longitudinal Bar Passage Study (Wightman, 1998) (20,649 examples). Here the task is to predict someone s eventual passage of the bar exam, with race (restricted to white and black only) as the protected attribute. The Dutch census data set (Dutch Central Bureau for Statistics, 2001) (60,420 examples). Here the task is to predict whether or not someone has a prestigious occupation, with gender as the protected attribute.

While all the evaluated algorithms require access to the protected attribute A at training time, only the post-processing algorithm requires access to A at test time. For a fair comparison, we included A in the feature vector X, so all algorithms had access to it at both the training time and test time.

We used the test examples to measure the classiﬁcation error for each approach, as well as the violation of the desired fairness constraints, i.e., maxa E[h(X) | A = a] E[h(X)]

and maxa,y E[h(X) | A = a, Y = y] E[h(X) | Y = y] for demographic parity and equalized odds, respectively.

We ran our reduction across a wide range of tradeoffs between the classiﬁcation error and fairness constraints. We considered ε {0.001, . . . , 0.1} and for each value ran Algorithm 1 with bck = ε across all k. As expected, the returned randomized classiﬁers tracked the training Pareto

A Reductions Approach to Fair Classiﬁcation

0.0 0.2 0.4 0.6 0.8 1.0

0.00 0.05 0.10

adult / DP / log. reg.

unconstrained postprocessing reweight relabel reduction (exp. grad.)

0.00 0.05 0.10 0.15

adult4 / DP / log. reg.

0.00 0.05 0.10

COMPAS / DP / log. reg.

0.00 0.05 0.10 0.15

Dutch census / DP / log. reg.

0.00 0.05 0.10

Law Schools / DP / log. reg.

0.0 0.2 0.4 0.6 0.8 1.0

0.00 0.05 0.10

adult / DP / boosting

0.00 0.05 0.10 0.15

adult4 / DP / boosting

0.00 0.05 0.10

COMPAS / DP / boosting

0.00 0.05 0.10 0.15

Dutch census / DP / boosting

0.00 0.05 0.10

Law Schools / DP / boosting

0.0 0.2 0.4 0.6 0.8 1.0

0.00 0.05 0.10

adult / EO / log. reg.

0.00 0.05 0.10 0.15

adult4 / EO / log. reg.

0.00 0.05 0.10 0.15

COMPAS / EO / log. reg.

0.00 0.05 0.10

Dutch census / EO / log. reg.

0.00 0.05 0.10 0.15

Law Schools / EO / log. reg.

0.0 0.2 0.4 0.6 0.8 1.0

violation of the fairness constraint

0.00 0.02 0.04 0.06

adult / EO / boosting

0.00 0.05 0.10 0.15

adult4 / EO / boosting

0.00 0.05 0.10 0.15

COMPAS / EO / boosting

0.00 0.05 0.10 0.15

Dutch census / EO / boosting

0.00 0.05 0.10 0.15 0.20

Law Schools / EO / boosting

Figure 1. Test classiﬁcation error versus constraint violation with respect to DP (top two rows) and EO (bottom two rows). All data sets have binary protected attributes except for adult4, which has four protected attribute values, so relabeling is not applicable there. For our reduction approach we plot the convex envelope of the classiﬁers obtained on training data at various accuracy fairness tradeoffs. We show 95% conﬁdence bands for the classiﬁcation error of our reduction approach and 95% conﬁdence intervals for the constraint violation of post-processing. Our reduction approach dominates or matches the performance of the other approaches up to statistical uncertainty.

frontier (see Figure 2 in Appendix D). In Figure 1, we evaluate these classiﬁers alongside the baselines on the test data.

For all the data sets, the range of classiﬁcation errors is much smaller than the range of constraint violations. Almost all the approaches were able to substantially reduce or remove disparity without much impact on classiﬁer accuracy. One exception was the Dutch census data set, where the classiﬁcation error increased the most in relative terms.

Our reduction generally dominated or matched the baselines. The relabeling approach frequently yielded solutions that were not Pareto optimal. Reweighting yielded solutions on the Pareto frontier, but often with substantial disparity. As expected, post-processing yielded disparities that were statistically indistinguishable from zero, but the resulting classiﬁcation error was sometimes higher than achieved by our reduction under a statistically indistinguishable disparity. In addition, and unlike the post-processing algorithm, our reduction can achieve any desired accuracy fairness tradeoff, allows a wider range of fairness deﬁnitions, and does not require access to the protected attribute at test time.

Our grid-search reduction, evaluated in Appendix D, sometimes failed to achieve the lowest disparities on

the training data, but its performance on the test data very closely matched that of our exponentiated-gradient reduction. However, if the protected attribute is non-binary, then grid search is not feasible. For instance, for the version of the adult income data set where the protected attribute takes on four values, the grid search would need to span three dimensions for demographic parity and six dimensions for equalized odds, both of which are prohibitively costly.

5. Conclusion

We presented two reductions for achieving fairness in a binary classiﬁcation setting. Our reductions work for any classiﬁer representation, encompass many deﬁnitions of fairness, satisfy provable guarantees, and work well in practice.

Our reductions optimize the tradeoff between accuracy and any (single) deﬁnition of fairness given training-time access to protected attributes. Achieving fairness when trainingtime access to protected attributes is unavailable remains an open problem for future research, as does the navigation of tradeoffs between accuracy and multiple fairness deﬁnitions.

A Reductions Approach to Fair Classiﬁcation

Acknowledgements

We would like to thank Aaron Roth, Sam Corbett-Davies, and Emma Pierson for helpful discussions.

Agarwal, A., Beygelzimer, A., Dud ık, M., and Langford, J. A reductions approach to fair classication. In Fairness, Accountability, and Transparency in Machine Learning (FATML), 2017.

Alabi, D., Immorlica, N., and Kalai, A. T. Unleashing linear optimizers for group-fair learning and optimization. In Proceedings of the 31st Annual Conference on Learning Theory (COLT), 2018.

Barocas, S. and Selbst, A. D. Big data s disparate impact. California Law Review, 104:671 732, 2016.

Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002.

Berk, R., Heidari, H., Jabbari, S., Kearns, M., and Roth, A. Fairness in criminal justice risk assessments: The state of the art. ar Xiv:1703.09207, 2017.

Beygelzimer, A., Dani, V., Hayes, T. P., Langford, J., and Zadrozny, B. Error limiting reductions between classiﬁcation tasks. In Proceedings of the Twenty-Second International Conference on Machine Learning (ICML), pp. 49 56, 2005.

Boucheron, S., Bousquet, O., and Lugosi, G. Theory of classiﬁcation: a survey of some recent advances. ESAIM: Probability and Statistics, 9:323 375, 2005.

Calmon, F., Wei, D., Vinzamuri, B., Ramamurthy, K. N., and Varshney, K. R. Optimized pre-processing for discrimination prevention. In Advances in Neural Information Processing Systems 30, 2017.

Chouldechova, A. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, Special Issue on Social and Technical Trade-Offs, 2017.

Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., and Huq, A. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 797 806, 2017.

Donini, M., Oneto, L., Ben-David, S., Shawe-Taylor, J., and Pontil, M. Empirical risk minimization under fairness constraints. 2018. ar Xiv:1802.08626.

Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pp. 214 226, 2012.

Dwork, C., Immorlica, N., Kalai, A. T., and Leiserson, M. Decoupled classiﬁers for group-fair and efﬁcient machine learning. In Conference on Fairness, Accountability and Transparency (FAT ), pp. 119 133, 2018.

Fan, W., Stolfo, S. J., Zhang, J., and Chan, P. K. Adacost: Misclassiﬁcation cost-sensitive boosting. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML), pp. 97 105, 1999.

Feldman, M., Friedler, S. A., Moeller, J., Scheidegger, C., and Venkatasubramanian, S. Certifying and removing disparate impact. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015.

Freund, Y. and Schapire, R. E. Game theory, on-line prediction and boosting. In Proceedings of the Ninth Annual Conference on Computational Learning Theory (COLT), pp. 325 332, 1996.

Freund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 139, 1997.

Hardt, M., Price, E., and Srebro, N. Equality of opportunity in supervised learning. In Neural Information Processing Systems (NIPS), 2016.

Johnson, K. D., Foster, D. P., and Stine, R. A. Impartial predictive modeling: Ensuring fairness in arbitrary models. ar Xiv:1608.00528, 2016.

Kakade, S. M., Sridharan, K., and Tewari, A. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in neural information processing systems, pp. 793 800, 2009.

Kamiran, F. and Calders, T. Data preprocessing techniques for classiﬁcation without discrimination. Knowledge and Information Systems, 33(1):1 33, 2012.

Kamishima, T., Akaho, S., and Sakuma, J. Fairness-aware learning through regularization approach. In 2011 IEEE 11th International Conference on Data Mining Workshops, pp. 643 650, 2011.

Kearns, M., Neel, S., Roth, A., and Wu, Z. S. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In Proceedings of the 35th International Conference on Machine Learning (ICML), 2018.

A Reductions Approach to Fair Classiﬁcation

Kivinen, J. and Warmuth, M. K. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1 63, 1997.

Kleinberg, J., Mullainathan, S., and Raghavan, M. Inherent trade-offs in the fair determination of risk scores. In Proceedings of the 8th Innovations in Theoretical Computer Science Conference, 2017.

Langford, J. and Beygelzimer, A. Sensitive error correcting output codes. In Proceedings of the 18th Annual Conference on Learning Theory (COLT), pp. 158 172, 2005.

Ledoux, M. and Talagrand, M. Probability in Banach Spaces: Isoperimetry and Processes. Springer, 1991.

Lichman, M. UCI machine learning repository, 2013. URL

http://archive.ics.uci.edu/ml.

Menon, A. K. and Williamson, R. C. The cost of fairness in binary classiﬁcation. In Proceedings of the Conference on Fiarness, Accountability, and Transparency, 2018.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

Rockafellar, R. T. Convex analysis. Princeton University Press, 1970.

Shalev-Shwartz, S. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2012.

Wightman, L. LSAC National Longitudinal Bar Passage Study, 1998.

Woodworth, B. E., Gunasekar, S., Ohannessian, M. I., and Srebro, N. Learning non-discriminatory predictors. In Proceedings of the 30th Conference on Learning Theory (COLT), pp. 1920 1953, 2017.

Zafar, M. B., Valera, I., Rodriguez, M. G., and Gummadi, K. P. Fairness constraints: Mechanisms for fair classiﬁcation. In Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pp. 962 970, 2017.