# onenetwork_adversarial_fairness__e6708d68.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

One-Network Adversarial Fairness

Tameem Adel University of Cambridge, UK tah47@cam.ac.uk

Isabel Valera MPI-IS, Germany isabel.valera@tuebingen.mpg.de

Zoubin Ghahramani University of Cambridge, UK Uber AI Labs, USA zoubin@eng.cam.ac.uk

Adrian Weller University of Cambridge, UK The Alan Turing Institute, UK aw665@cam.ac.uk

There is currently a great expansion of the impact of machine learning algorithms on our lives, prompting the need for objectives other than pure performance, including fairness. Fairness here means that the outcome of an automated decisionmaking system should not discriminate between subgroups characterized by sensitive attributes such as gender or race. Given any existing differentiable classiﬁer, we make only slight adjustments to the architecture including adding a new hidden layer, in order to enable the concurrent adversarial optimization for fairness and accuracy. Our framework provides one way to quantify the tradeoff between fairness and accuracy, while also leading to strong empirical performance.

1 Introduction

Automated decision support has become widespread, raising concerns about potential unfairness. Here, following earlier work, unfairness means discriminating against a particular group of people due to sensitive group characteristics such as gender or race (Grgic-Hlaca et al. 2018b; Hardt, Price, and Srebro 2016; Kusner et al. 2017; Louizos et al. 2016; Zafar et al. 2017c; 2017a; Zemel et al. 2013). Fairness concerns have been considered in applications including predictive policing (Brennan, Dieterich, and Ehret 2009), recidivism prediction (Chouldechova 2017) and credit scoring (Khandani, Kim, and Lo 2010). The current trend in the literature on fairness is to begin with a notion (deﬁnition) of fairness, and then to construct a model which automates the detection and/or eradication of unfairness accordingly. Common deﬁnitions of fairness consider whether or not a decision is related to sensitive attributes, such as gender. Most state-of-the-art machine learning algorithms for fairness build from scratch a model s architecture tailored for a speciﬁc fairness notion. In contrast, we propose a method that slightly modiﬁes the architecture of the model which was to be optimized solely for accuracy. We learn a fair representation together with a new performance function acting on it, with the goal of concurrently optimizing for both fairness and performance (accuracy). Our method is based on an adversarial framework, which allows explicitly measuring the tradeoff between fairness and accuracy. Our approach

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

is general, in that it may be applied to any differentiable discriminative model. We establish a fairness paradigm where the architecture of a deep discriminative model, optimized for accuracy, is modiﬁed such that fairness is imposed (the same paradigm could be applied to a deep generative model in future work). Beginning with an ordinary neural network optimized for prediction accuracy of the class labels in a classiﬁcation task, we propose an adversarial fairness framework performing a change to the network architecture, leading to a neural network that is maximally uninformative about the sensitive attributes of the data as well as predictive of the class labels. In this adversarial learning framework, there is no need for a separate network architecture representing the adversary, thus we avoid the well-known difﬁculties which may arise from double-network adversarial learning (Goodfellow 2016; Goodfellow et al. 2014). We make the following contributions: 1) We propose a fairness algorithm by modifying the architecture of a potentially unfair model to simultaneously optimize for both accuracy and fairness with respect to sensitive attributes (Section 2); 2) We quantify the tradeoff between the accuracy and fairness objectives (Section 2); 3) We propose a variation of the adversarial learning procedure to increase diversity among elements of each minibatch of the gradient descent training, in order to achieve a representation with higher ﬁdelity. This variation may be of independent interest since it is applicable generally to adversarial frameworks (Section 2.2); 4) We develop a novel generalization bound for our framework illustrating the theoretical grounding behind the relationship between the label classiﬁer and the adversary s ability to predict the sensitive attribute value (Section 3); and 5) We experiment on two fairness datasets comparing against many earlier approaches to demonstrate stateof-the-art effectiveness of our methods (Section 4).

2 Fair Learning by Modifying an Unfair Model

We aim to adapt a classiﬁer that was to learn solely to predict the data labels into a fair classiﬁer without reconstructing its architecture from scratch. Our strategy is to add a term to the optimization objective to simultaneously maximize both fairness and accuracy. This term imposes fairness by establishing a fair representation of the input data which

is invariant to changes in the sensitive attribute values. To optimize this invariance, an adversary is included at the top of the model in the form of a classiﬁer which tries to predict the sensitive attribute value. For a deep model, this can be implemented via: (i) adding another classiﬁer at the top of the network such that we have two classiﬁers: the original label predictor and the new predictor of sensitive attribute value(s); and (ii) adding a network layer just under the classiﬁers (now the top hidden layer) that aims at maximizing the performance of the label predictor, while minimizing the performance of the sensitive attribute predictor by reversing the gradients of the latter in the backward phase of the gradient descent training. See Figure 1 for an illustration.

2.1 Fair adversarial discriminative (FAD) model

We consider data D = {(xi, yi, si)}n i=1, where xi, yi and si refer to the input features, the ground truth label and the sensitive attribute(s), respectively. Denote by ˆyi the corresponding label estimate. Typically speaking, ˆyi is predicted by a potentially unfair discriminative classiﬁer possibly represented by a neural network whose input is features x and sensitive attributes s. The goal of fairness is to ensure that the prediction of y is not dependent on these attributes s (Zafar et al. 2017c). A naive approach is to simply discard s and let classiﬁer f have input x - this has some proponents in terms of process (Grgic-Hlaca et al. 2018a; 2018b) but suffers since x might include signiﬁcant information about s. For example, an address attribute in x might give a strong indication about race s. We achieve fairness through the adversarial construction illustrated in Figures 1 and 2, as follows. Deﬁne the empirical classiﬁcation loss of the label classiﬁer f (x ) as L(ˆy, y) = L(f (x ), y) = 1 |D| P|D| i=1 Err(f (x i), yi), and similarly the classiﬁcation loss of the classiﬁer φ , predicting the value of the sensitive attribute s, as L(φ (x ), s) = 1 |D| P|D| i=1 Err(φ (x i), si). A simpliﬁed view of the architecture of our fairness adversarial discriminative (FAD) framework is displayed in Figure 1. We extend the original architecture of the potentially unfair predictor, by adding a new layer g that produces a fair data representation x , and adding a new predictor φ aiming to predict the sensitive attribute s. Denoting by w the weight vector of a network layer, the backpropagation forward pass proceeds normally in FAD. During the backward pass (dotted lines), however, the gradient from the loss of the s classiﬁer L(φ (x ),s)

wφ is multiplied with a negative sign so that g adversarially aims at increasing the loss of φ , resulting in a representation x maximally invariant to the change in values of s. Since the rest of the FAD training proceeds normally, the classiﬁer f should also make x maximally informative about the original prediction task, depicted by y. The idea of reversing a layer s gradient with respect to the layer below in the backward pass was used in domain adaptation to develop invariant representations to changes between the source and target domains (Ajakan et al. 2014; Ganin and Lempitsky 2015; Ganin et al. 2016). In our fairness paradigm, we aim at the following: i)

achieving fairness; ii) quantitatively evaluating how much fairness we achieve, and; iii) quantifying the impact of such fairness on accuracy, i.e. computing the difference in accuracy between the proposed fair model and a corresponding (potentially unfair) model. In Figure 2, another schematic diagram of the proposed modiﬁcations is displayed. The predictor φ (x ) depicts a classiﬁer predicting the s value from x . The ability to accurately predict s given x signiﬁes a high risk of unfairness since this means that the sensitive attributes may be inﬂuential in any decision making process based on x . Adversarially through g, the non-sensitive features x are transformed into a representation x which attempts to break any dependence on the sensitive attributes s. As such, the optimization objective can have many trivial solutions, e.g. g that maps every x into 0 which provides no information to predict s. In order to prevent that, and to ensure that fairness is rather aligned with the accuracy objective, a classiﬁer f predicts the labels y from x . Here we consider classiﬁcation accuracy to be the metric of the initial model. However, our adversarial paradigm can also be adopted in models with other metrics to achieve fairness.

Figure 1: Architecture of the proposed fair adversarial discriminative model. The parts added, due to FAD, to a potentially unfair deep architecture with input x are (shown in red): i) the layer g where x is learned and; 2) the sensitive attribute s predictor φ at the top of the network. Denote by w the weight vector of the respective layer. The forward pass of backpropagation proceeds normally in FAD. During the backward pass (dotted lines), the gradient from the loss of the s classiﬁer L((φ (x )),s)

wφ is multiplied with a negative sign so that g adversarially aims at increasing the loss of φ , resulting in a representation x maximally invariant to the change in values of s. The rest of the FAD training proceeds normally, i.e. gradient from the labeling classiﬁer f , L((f (x )),y)

wf , is normally (with a positive sign) imported to g, ultimately making x maximally informative about y.

To summarize: g, φ and f are all involved in the concurrent optimization for fairness and accuracy. The parameters of g and φ adversarially optimize for fairness while f guarantees that accuracy is not fully sacriﬁced for the sake of fairness. The labeling accuracy difference between the proposed fair model and a corresponding potentially unfair model may be quantiﬁed via the discrepancy between

Figure 2: The proposed FAD paradigm. We establish an adversarial objective where g minimizes the ability of the predictor φ to correctly predict values of the sensitive attributes s, whereas φ on the other hand maximizes its own accuracy in predicting s. Another predictor f predicts the labels y from x .

f (x ) obtained after learning g, and f(x), respectively.

Fairness notions We focus on two common notions (definitions) of fairness, disparate impact (Barocas and Selbst 2016; Feldman et al. 2015; Primus 2010) and disparate mistreatment (Zafar et al. 2017b). We ﬁrst address the former. Disparate impact refers to a decision making process that, in aggregate, leads to different outcomes for subpopulations with different sensitive attribute values. According to this notion, and assuming WLOG a binary label and sensitive attribute, fairness is achieved (in other words, there is no disparate impact) when: p(ˆy = 1|s = 0) = p(ˆy = 1|s = 1). (1) For our proposed model, deﬁne the part of the data D where s = 0 as Ds=0, then p(ˆy = 1|s = 0) = 1 |Ds=0| P|Ds=0| i=1 p(f (x ) = 1). Accordingly, disparate impact is avoided when:

1 |Ds=0| P|Ds=0| i=1 p(f (x ) = 1) = 1 |Ds=1| P|Ds=1| i=1 p(f (x ) = 1) (2)

The disparity (Disp) between both sides of (2) quantiﬁes unfairness in terms of disparate impact. Larger disparity values signify more unfairness:

Disp DI = (3) 1 |Ds=0| P|Ds=0| i=1 p(f (x ) = 1) 1 |Ds=1| P|Ds=1| i=1 p(f (x ) = 1)

The other notion of fairness we inspect is referred to as disparate mistreatment (Zafar et al. 2017b). One major difference between the two notions is that disparate mistreatment depends on the ground truth labels y, and therefore it can be considered only when such information is available. Assuming a binary label y with ground truth values 1 and 1, disparate mistreatment arises when the rate of erroneous decisions (false positives, false negatives or both) is different for subpopulations with different values of sensitive attributes. Unfairness in terms of disparate mistreatment is thus avoided when the false positive rate (FPR) and false negative rate (FNR) are as in (4) and (5), respectively.

p(ˆy = y|y = 1, s = 0) = p(ˆy = y|y = 1, s = 1) (4) p(ˆy = y|y = 1, s = 0) = p(ˆy = y|y = 1, s = 1) (5) A larger disparity between both sides in either of these two equations signiﬁes more disparate mistreatment. For our model, disparate mistreatment, as a whole, is deﬁned in (6).

1 |Ds=0| P|Ds=0| i=1 Err(f (x i), yi) = 1 |Ds=1| P|Ds=1| i=1 Err(f (x i), yi)

Decomposing (6) into FPR and FNR is straightforward. The higher the following disparity values the more unfair the model is, in terms of disparate mistreatment:

Disp FPR = (7) 1 |Ds=0| P|Ds=0| i=1 Err(f (x i), yi|yi= 1) 1 |Ds=1| P|Ds=1| i=1 Err(f (x i), yi|yi= 1)

Disp FNR = (8) 1 |Ds=0| P|Ds=0| i=1 Err(f (x i), yi|yi=1) 1 |Ds=1| P|Ds=1| i=1 Err(f (x i), yi|yi= 1)

Modeling objective The overall training objective of the proposed model, for a dataset of size n, is stated as follows:

ming,f [L(f (g(x)), y) β maxφ (L(φ (g(x)), s))] (9)

n Pn i=1 Err(f (g(xi)), yi) β maxφ 1

n Pn i=1 Err(φ (g(xi)), si) ,

where the fairness hyperparameter β > 0 controls the degree of fairness induced in the model. The model, as such, is based on optimizing for disparate impact. Recall that the whole proposed model is ultimately implemented as one neural network in which the adversarial layer g is injected as a hidden layer right beneath the top layer consisting of the classiﬁers φ and f . In order to optimize for disparate mistreatment, a modiﬁcation to (9) is needed. The function φ (x ) = φ (g(x)) should be multiplied by a term reﬂecting what is needed to be equal for subpopulations with different sensitive attribute values such that disparate mistreatment is achieved. Multiplying φ (g(x)) with the ratio of Err(f (x i), yi) over n, which signiﬁes the misclassiﬁcation rate, leads to the objective becoming an interpretation of disparate mistreatment. The overall training objective when optimizing for disparate mistreatment therefore becomes:

n Pn i=1 Err(f (g(xi)),yi) β maxφ 1

n Pn i=1 Err(φ (g(xi)),si)I(ˆyi =yi) , (10)

where I(z) is the indicator function: I(z) = 1, when z is true and 0 otherwise. However, since ˆy is the classiﬁcation output, which depends on both g and f , the indicator function might be problematic to optimize by gradient descent. Hence, we can relax this optimization problem and rewrite it as follows to express the labeling misclassiﬁcation probability:

n Pn i=1 Err(f (g(xi)),yi) β maxφ 1

n Pn i=1 Err(φ (g(xi)),si)p(ˆyi =yi)

ming,f h 1 n Pn i=1 Err(f (g(xi)),yi) β maxφ Pn i=1 Err(f (x i),yi) n Err(φ (g(xi)),si) i

For the decomposition of disparate mistreatment into FPR and FNR, the term Err(f (x i), yi) = Err(f (g(xi)), yi) in (11) shall be divided into Err(f (x i), yi)yi=0 and Err(f (x i), yi)yi=1, respectively.

2.2 Fair adversarial discriminative model with increased minibatch diversity (FAD-MD) Although our approach avoids two dueling networks, still it is prone to some of the common problems of adversarial learning, including mode collapse (Goodfellow 2016;

Goodfellow et al. 2014). Imagine if the input to the discriminator is MNIST images, and, instead of covering the whole space of the ten digits, the comparisons are restricted to two or three digits that are ultimately being represented quite similarly by data of both classes introduced as input to the discriminator. The latter therefore becomes maximally confused and a convergence of the whole adversarial learning framework is reached although in reality only a small subset of the data is well represented by the inferred representation. Since we want to avoid imposing too many variations on the original settings of the system, we need to develop a computationally efﬁcient solution to mode collapse. Most of the previously proposed solutions are either tailored to the generative versions of adversarial learning (Arjovsky, Chintala, and Bottou 2017; Kim et al. 2017; Metz et al. 2017; Rosca et al. 2017) or computationally inefﬁcient for our purpose (Srivastava et al. 2017). We experimented with the heuristics proposed in (Salimans et al. 2016) but they did not improve results in our one-network adversarial framework. Therefore we propose a method based on making elements of a minibatch as diverse as possible from one another. Each minibatch is formed as follows: i) Beginning with very few data points randomly chosen to belong to the minibatch, there is a pool of points from which some are to be selected for the addition to the minibatch based on the criterion of ending up with minibatch elements as diverse as possible without losing too much in terms of computational runtime; ii) A point is selected from the pool via the score resulting from a one-class support vector machine (SVM) classiﬁer where the class consists of the current elements of the minibatch. The next added data point from the pool to the minibatch is the point with the lowest score, i.e. the point believed to be the least likely to belong to the class formed by the current minibatch elements, or the most dissimilar point to the minibatch elements; iii) This process continues until greedily reaching the prespeciﬁed size of the minibatch. The size of each pool of points to begin with is a hyperparameter to be speciﬁed by e.g. cross-validation. We claim that using the efﬁcient one-class SVM to form minibatches is a step in the direction of establishing adversarial learners with a better coverage of all the regions of the data space.

3 Generalization Bound

We illustrate the theoretical foundation of the relationship between the label classiﬁer and the adversary s ability to predict the sensitive attribute value. We begin by deriving a generalization upper bound for the framework (Theorem 1). We shed light on the interpretation of the sensitive attribute classiﬁer as an adversary in the concurrent adversarial optimization for fairness and accuracy, in terms of the distance between data distributions given different sensitive attribute values. Assume the label y (as well as its estimate ˆy) and the sensitive attribute s values are binary. Recall that a data sample D = {(xi, yi, si)}n i=1 is given. Also, recall that, for the ith instance, the classiﬁcation loss between ˆy and y (equivalently deﬁned between any two labelings) is denoted by Err(ˆyi, yi). Denote the expected loss between

two labelings ˆy and y with respect to a distribution P by LP(ˆy, y) = EP[Err(ˆy, y)]. When abbreviated as Err(ˆy), this means that the other side is the ground truth label y. Next we describe the Rademacher complexity, which measures the richness of a class of real-valued functions with respect to a probability distribution (Shalev-Shwartz and Ben-David 2014). Rademacher complexity facilitates the derivation of general learning guarantees for problems with inﬁnite hypothesis sets, like ours. Thanks to the introduction of a Rademacher random variable σ, Rademacher complexity directly maps the measurement of accuracy (or inversely error rate) of the hypothesis into the richness or expressiveness of the hypothesis w.r.t. the probability distribution resulting from the introduction of σ. More formally, let H be a hypothesis set, where h H is an arbitrary hypothesis of the set. Deﬁne the expected loss of a hypothesis class as Err(H)1. When each h is a real-valued function, the empirical Rademacher complexity (Mansour, Mohri, and Rostamizadeh 2009; Mohri and Afshin 2008a; Shalev-Shwartz and Ben-David 2014) of H given D can be deﬁned as:

Rad D(H) = 1

n Eσ[sup h H

i=1 σih(xi)], (12)

where σ1, σ2, , σn are independent random variables drawn from the Rademacher distribution, which notes that Pr(σ = 1) = Pr(σ = 1) = 0.5. The Rademacher complexity of a hypothesis set H is thence denoted by the expectation of Rad D(H) over all samples of size n:

Radn(H) = ED[Rad D(H)] (13)

For deriving our bound, we use a notion of distance between distributions, proposed by Mansour, Mohri, and Rostamizadeh (2009), and referred to as the discrepancy distance. It is symmetric and it satisﬁes the triangle inequality. Denote by H a hypothesis set, where each h H is a classiﬁer X Y. The discrepancy distance between two arbitrary distributions P and Q is deﬁned as:

disc(P, Q) = max h,h H |LP(h, h ) LQ(h, h )| (14)

We also state the following bound (Bartlett and Mendelson 2002; Koltchinskii and Panchenko 2000), whose proof is in (Bartlett and Mendelson 2002). For a probability distribution P (estimated by ˆP) deﬁned on X { 1}, let H be a hypothesis set where each h H is a { 1}-valued function mapping X to a binary Y. Then for a data sample D with size n, With probability at least 1 δ, and for a 0-1 classiﬁcation loss, every hypothesis h H satisﬁes:

EP(Err(Y, h(X))) E ˆP(Err(Y, h(X))) + Rad D(H)

1Note that this is the expectation of a class of functions with respect to a speciﬁc data sample D. This differs from LP(h(x), y) since the latter is the expectation of a speciﬁc function h with respect to a probability distribution P from which D is drawn.

Theorem 1. Denote by P0 and P1 the distributions P(x|s = 0) and P(x|s = 1), respectively. Assuming training data D with size n, which, without loss of generality (WLOG), is equally divided into data points with s = 0, D0, and those with s = 1, D1. For a class H of binary classiﬁers, and for δ > 0 with probability greater than or equal to 1 δ, the following holds:

disc(P0, P1) disc(ˆP0, ˆP1) + 2

2(Rad D0(Err(H))+Rad D1(Err(H)))

Proof. Using the triangle inequality, the L.H.S. of (16) can turn into:

disc(P0, P1) disc(P0, ˆP0) + disc(P1, ˆP1) + disc(ˆP0, ˆP1)

For a 0-1 classiﬁcation error, we can use (15) with the class of real-valued functions being a class where each element is the loss of h, Err(h), on a speciﬁc training sample D, instead of h itself. We can as well replace y in (15) with an arbitrary h, and since we are upper bounding we can assume we replace it with the worst case h, i.e. h resulting in the highest upper bound. This leads to:

EP(Err(h(x), h (x))) E ˆP(Err(h(x), h (x)))

+ Rad D(Err(H))

Since EP(Err(h(x), h (x))) = LP(h(x), h (x)), and from (14), then (18) turns into:

disc(P, ˆP) Rad D(Err(H))

Using (19) to describe the terms disc(P0, ˆP0) and disc(P1, ˆP1) in (17), we get:

disc(P0, P1) disc(ˆP0, ˆP1)

+ Rad D0(Err(H))

+ Rad D1(Err(H))

disc(P0, P1) disc(ˆP0, ˆP1) + 2

2(Rad D0(Err(H)) + Rad D1(Err(H)))

which concludes the proof.

This provides an interpretation of our modeling objective in (9) in the main document, since: The sensitive attribute classiﬁer φ (g(x)) aims at minimizing the ﬁrst term in the

bound on the right of (21), disc(ˆP0, ˆP1). The label classiﬁer f (g(x)) aims at minimizing the second term on the right of (21). The third term on the right of (21) tends to zero as the sample size n goes to inﬁnity. To further clarify that: As noted in (Mansour, Mohri, and Rostamizadeh 2009), for a 0-1 classiﬁcation error and for our hypothesis class consisting of the loss on each hypothesis, Err(h), the discrepancy distance disc(ˆP0, ˆP1) is equivalent to the following notion of distance, referred to as H-divergence (Ben-David et al. 2007; 2010; Devroye, Gyorﬁ, and Lugosi 1996; Kifer, Ben-David, and Gehrke 2004):

d H(ˆP0, ˆP1) = sup a |h h | |ˆP0(a) ˆP1(a)| (22)

As proved in (Ben-David et al. 2007; 2010) (Lemma 2 in (Ben-David et al. 2010)), the H-divergence in such case can be approximated by performing a classiﬁcation task of the data D into points belonging to D0 or to D1:

d H(ˆP0, ˆP1) = 2(1 min h

n I(x D0) + 2

n I(x D1) i )

where the distance (H-divergence, which is equivalent in this case to the discrepancy distance) between ˆP0 and ˆP1 is inversely proportional to the performance of the classiﬁer. From (21) and (23), for an arbitrary h:

disc(P0, P1) 2 h 4

n I(x D0) + 4

n I(x D1) i

2(Rad D0(Err(H)) + Rad D1(Err(H)))

The term h 4 n I(x D0)+ 4

n I(x D1) i is what is estimated in our proposed formulation in (9) in the main document by β maxφ (L(φ (g(xi)), si)). Due to the negative sign preceding the latter term in (9) and the equivalent former term in (24), it is inversely proportional to the maximization of our objective, and to the overall distance between estimated distributions of data with different s values, respectively. Also, the larger the value of the fairness hyperparameter β, the higher the impact of this term on the optimization of our objective. The label classiﬁer f (g(xi)) aims at minimizing the error of the classiﬁer, i.e. minimizing the error in predicting the label ˆy of the data points D (both D0 and D1). It is therefore straightforward to see that f (g(xi)) naturally aims at minimizing the second term in the bound in (21), (Rad D0(Err(H)) + Rad D1(Err(H))). From the latter note and (24), the two classiﬁers of the adversarial formulation proposed in (9) in the main document can be interpreted w.r.t. the ﬁrst two terms of the upper bound on the right of (21); minimizing the classiﬁer s losses can be interpreted as minimizing the generalization upper bound.

4 Experiments On two datasets, we perform experiments to evaluate the following: (i) the difference in classiﬁcation accuracy due to optimizing for fairness Table 1 and Figure 3. Comparisons among the unfair and fair versions of the proposed frameworks, FAD and FAD-MD, as well as previous stateof-the-art algorithms, show that the quest for fairness with FAD and FAD-MD leads to minimal loss in accuracy; (ii) (un)fairness, in terms of both disparate impact and disparate mistreatment Table 2. Results demonstrate state-of-the-art effectiveness of FAD and FAD-MD; and (iii) an MMD 2sample test to assess the ﬁdelity of the learned fair representation x Figure 4. This also shows the effectiveness of increasing the minibatch diversity in FAD-MD. Moreover, the increase in training time due to optimizing for fairness is not considerable for DAF compared to its unfair version. We test our framework on two popular real-world datasets in the fairness literature, the Propublica COMPAS dataset (Larson et al. 2016) and the Adult dataset (Dheeru and Taniskidou 2017). The task in the COMPAS dataset is a binary classiﬁcation task with two classes depicting whether or not a criminal defendant is to recidivate within two years. We use the black vs. white values of the race feature as our sensitive attribute. The total number of the COMPAS data instances we work on is 5,278 instances and 12 features. On the other hand, the Adult dataset consists of 32,561 complete instances denoting adults from the US Census in 1994. The task is to predict, from 14 features, whether an adult s income is higher or lower than 50K USD. Gender is the sensitive attribute in the Adult data (male or female). Each experiment is repeated ten times where, in each run, data is randomly split into three partitions, training, validation (to identify the value of the fairness hyperparameter β) and test. A portion of 60% of the data is reserved for training, 20% for validation and 20% for testing. Statistics reported are the averages of the ten repetitions. For FAD-MD, we use the one-class SVM introduced in (Scholkopf et al. 2000) with ν = 0.5 (fractions of support vectors and outliers). On both datasets, and in addition to the unfair versions of the proposed algorithms, of (Zafar et al. 2017b) and of (Zafar et al. 2017a), we compare (where applicable) FAD and FAD-MD to the following state-of-the-art fairness algorithms: Zafar et al. (2017b), Zafar et al. (2017a), Zafar et al. (2017c), Hardt, Price, and Srebro (2016), Feldman et al. (2015), Kamishima et al. (2012), Fish, Kun, and Lelkes (2016), Bechavod and Ligett (2017), Komiyama et al. (2018), Agarwal et al. (2018), Narasimhan (2018). Classiﬁcation results are displayed in Table 1. A 2-layer neural network is utilized to obtain the results of the unfair classiﬁcation, whereas the adversarial layer is added under the top layer to accomplish fairness via FAD and its variation FAD-MD. Values of the fairness hyperparameter β selected by cross-validation are 0.3 and 0.8 for the COMPAS and Adult datasets, respectively. Classiﬁcation results on both datasets demonstrate that, among fairness algorithms, FAD and FAD-MD achieve state-of-the-art classiﬁcation accuracy. In addition, the impact on classiﬁcation accuracy due to the optimization for fairness by the proposed algorithms, FAD and FAD-MD, is rather minimal. This is quantiﬁed via the

Table 1: Results of the classiﬁcation accuracy on the COMPAS and Adult datasets. In addition to its impact on fairness, increasing the minibatch diversity with FAD-MD improves the classiﬁcation accuracy. Classiﬁcation accuracy values achieved by FAD and FAD-MD on both datasets are higher than previous state-of-the-art results. Loss in accuracy due to fairness (difference in classiﬁcation accuracy between the unfair and fair versions) is not big; with FAD-MD, it is as minimal as state-of-the-art by Zafar et al. (2017b) for the COMPAS data, and uniquely minimal for the Adult data. Bold refers to an accuracy value that is signiﬁcantly better than the other fair (all apart from the ﬁrst three entries) competitors. To test signiﬁcance, we perform a paired t-test with signiﬁcance level at 5%.

Unfair (β = 0) Unfair (Zafar et al. 2017b) Unfair (Zafar et al. 2017a) FAD 89.3% 66.8% 69.0% 88.4% FAD-MD Zafar et al. (2017b) Zafar et al. (2017a) Hardt et al. (2016) 88.7% 66.2% 67.5% 64.4% Feldman et al. (2015) Kamishima et al. (2012) Fish et al. (2016) Bechavod (2017) 86.8% 72.4% 81.2% 66.4% Komiyama et al. (2018) Agarwal et al. (2018) Narasimhan (2018) 86.6% 71.2% 77.7%

Unfair (β = 0) Unfair (Zafar et al. 2017b) Unfair (Zafar et al. 2017a) FAD 90.1% 85.8% 87.0% 88.6% FAD-MD Zafar et al. (2017b) Zafar et al. (2017a) Hardt et al. (2016) 89% 83.1% 84.0% 84.6% Feldman et al. (2015) Kamishima et al. (2012) Fish et al. (2016) Bechavod (2017) 82.1% 84.3% 84.0% 78.3% Komiyama et al. (2018) Agarwal et al. (2018) Narasimhan (2018) 85.7% 86.2% 81.5%

difference in accuracy between the unfair and fair versions of the proposed framework compared to such difference in the cases of (Zafar et al. 2017b) and (Zafar et al. 2017a). In Figure 3, we vary the classiﬁcation accuracy as a function of the fairness hyperparameter β. FAD-MD leads to a slightly higher classiﬁcation accuracy than FAD. Classiﬁcation accuracy initially decreases when optimizing for fairness until it rather saturates with larger values of β. Fairness results, in the form of empirical values of the disparity notions described in (3), (7) and (8), are displayed in Table 2. Such values are shown for FAD, FAD-MD and the same competitors as in Table 1, where applicable. The proposed algorithms, FAD and FAD-MD, minimize the disparity values (unfairness) when optimizing for fairness, and achieve the (at times joint) best results, in terms of the fairness metrics, in ﬁve out of six cases (three disparity values -Disp DI, Disp FPR and Disp FNRfor each dataset). We move on now to a more rigorous evaluation of the minibatch diversity augmentation by FAD-MD. One of the problems of most current frameworks of adversarial learning is the fact that the adversary bases its optimization on comparing data points instead of distributions or, at least, of sets of points. We aim at evaluating this here by performing a nonparametric two-sample test between representations of data points belonging to different values of the sensitive attribute s. The null hypothesis denotes that the distributions of the learned representation x given different s values are equal, H0 : p(x |s = 0) = p(x |s = 1), and its alternative is H1 : p(x |s = 0) = p(x |s = 1). Failing to reject (i.e. accepting) the null hypothesis H0 is the favorable outcome since it means that the model has learned a representation x through which different values of s are indistinguishable. We perform a two-sample maxi-

Table 2: Unfairness of different algorithms measured by disparate impact -Disp DI in Eq. (3)- and disparate mistreatment - Disp FPR, Disp FNR in Eqs. (7,8)- on the COMPAS and the Adult datasets. Smaller values are more favorable since they denote less unfairness. For each competitor, we report their best value achieved throughout their different settings. In ﬁve out of six cases (three disparity values for each dataset), the proposed algorithms, FAD and FAD-MD, (at times jointly) achieve the best results in terms of the fairness metrics. Bold refers to a value that is signiﬁcantly less (better) than its non-bold competitors. To test signiﬁcance, we perform a paired t-test with signiﬁcance level at 5%. Empty cells indicate non-applicable experiments.

Unfair (β = 0) Unfair (Zafar et al. 2017b) Unfair (Zafar et al. 2017a) FAD FAD-MD Zafar et al. (2017b) Zafar et al. (2017a) Hardt et al. (2016) Disp DI: 0.6 0.62 0.08 0.11 0.38 Disp FPR: 0.21 0.18 0.01 0.01 0.03 0.01 Disp FNR: 0.29 0.3 0.01 0.02 0.1 0.01 Feldman et al. (2015) Kamishima et al. (2012) Fish et al. (2016) Bechavod (2017) Komiyama et al. (2018) Agarwal et al. (2018) Narasimhan (2018) 0.95 0.9 0.15 0.2 0.09 0.1 0.4 0.2 0.03 0.01 0.05 0.09 0.45 0.15 0.03 0.03 0.05 0.11

Unfair (β = 0) Unfair (Zafar et al. 2017b) Unfair (Zafar et al. 2017a) FAD FAD-MD Zafar et al. (2017b) Zafar et al. (2017a) Hardt et al. (2016) Disp DI: 0.71 0.68 0.14 0.13 0.29 Disp FPR: 0.36 0.35 0.02 0.01 0.12 0.04 Disp FNR: 0.32 0.4 0.01 0.02 0.09 0.03 Feldman et al. (2015) Kamishima et al. (2012) Fish et al. (2016) Bechavod (2017) Komiyama et al. (2018) Agarwal et al. (2018) Narasimhan (2018) 0.25 0.3 0.16 0.28 0.13 0.19 0.3 0.07 0.02 0.0 0.04 0.14 0.4 0.08 0.03 0.04 0.05 0.08

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Value of the parameter

Classification accuracy (%)

COMPAS dataset

(a) COMPAS dataset

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Value of the parameter

Classification accuracy (%)

Adult dataset

(b) Adult dataset

Figure 3: Classiﬁcation accuracy as a function of the fairness hyperparameter β, where β = 0 indicates no adjustment for fairness. The COMPAS data is balanced; 47% of the instances have recidivated and 53% have not. Hence, a random guessing classiﬁer s accuracy would be as low as around 53%. For the Adult data, a random guessing classiﬁer would result in 75.9% accuracy, since there are many more records for people whose income is less than or equal to 50K. Classiﬁcation accuracy initially decreases with increasing β until it saturates. With both datasets, classiﬁcation accuracy has not considerably changed for β > 1. Accuracy achieved by FAD-MD is slightly higher.

mum mean discrepancy (MMD) test. Let x 0 and x 1 refer to data points sampled from x given s = 0 and s = 1, respectively. We compute the unbiased estimate MMD(x 0, x 1) as a two-sample test between x 0 and x 1 (Gretton et al. 2006; Lloyd and Ghahramani 2015) by the following expression:

1 n02 Pn0 i=1 Pn0 j=1 k(x 0(i), x 0(j)) + 1 n12 Pn1 i=1 Pn1 j=1 k(x 1(i), x 1(j))

2 n0n1 Pn0 i=1 Pn1 j=1 k(x 0(i), x 1(j)) (25)

We use a Gaussian kernel k(x 0, x 1) = e γ x 0 x 1 2. Cross-validation has been used to indicate γ. The threshold used (the given allowable probability of false rejection (Gretton et al. 2012)) is 0.05. Results of running an MMD two-sample test for 100 times are displayed in Figure 4. Smaller values are better since they signify a more similar representation for instances with different sensitive attribute,

s, values. Hence, FAD-MD achieves better results with lower rejection rates than FAD for both datasets.

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Value of the parameter

Rejection rate (%)

COMPAS dataset

(a) COMPAS dataset

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Value of the parameter

Rejection rate (%)

Adult dataset

(b) Adult dataset

Figure 4: MMD 2-sample test results as a function of the fairness hyperparameter β. The lower the value the better. The threshold used, i.e. given allowable probability of false rejection, is 0.05. FAD-MD leads to less distinguishable (less unfair) representations of data with different s, than FAD.

Details of the model architectures are listed in Table 3. Adam (Kingma and Ba 2015) is the optimizer used to compute the gradients.

Table 3: Architecture of the neural network used in the introduced models. There are two layers, in addition to the adversarial layer g. FC stands for fully connected.

Dataset Architecture COMPAS FC 16 Re LU, FC 32 Re LU, g : FC 16 Re LU, FC output. Adult FC 32 Re LU, FC 32 Re LU, g : FC 16 Re LU, FC output.

5 Related Work

We ﬁrst focus on fairness frameworks that are based on adversarial learning. We begin with a comparison with the framework of (Madras et al. 2018). Differences between the latter and our proposed framework include:

Our proposed platform can be applied to an existing neural network architecture with rather slender modiﬁcations. We do not have to reconstruct everything from scratch to obtain a potentially fair model.

The whole optimization in our framework is implemented within one neural network, compared to up to four neural networks in (Madras et al. 2018). This leads to avoiding some of the well documented problems of adversarial learning that appear when two or more neural networks are involved in the optimization (Goodfellow 2016; Goodfellow et al. 2014).

Performing the training with diverse minibatches, i.e. minibatches whose elements are maximally diverse from one another, leads to improved results.

Our framework is capable of quantitatively evaluating the achieved degree of fairness as well as the difference (potential loss) in accuracy due to imposing fairness.

The work in (Edwards and Storkey 2016) learns a representation that is concurrently fair and discriminative w.r.t. the prediction task. Similar to (Madras et al. 2018), it is based on more than one neural network since each adversary consists of a separate network, leading to difﬁculties in reaching stability among adversaries. The work in (Edwards and Storkey 2016) also sheds light on a mapping between fairness, distances between distributions and the adversary s ability to predict the sensitive attribute value. In addition to having a different approach, we extend beyond that by linking all these notions to the original labeling classiﬁcation accuracy as well. Zhang, Lemoine, and Mitchell (2018) utilize an adversarial learning framework to achieve fairness via directly comparing the labeling classiﬁcation outcomes, i.e. without learning an intermediate fair representation. The work in (Beutel et al. 2017) analyzes the impact of the data distribution on the fairness notion adopted by the adversary. Another adversarial framework is the one introduced by Louppe, Kagan, and Cranmer (2017) which permits two-network based adversarial frameworks to act on a continuous sensitive attribute. Although they note that the algorithm is applicable for fairness, the experiments performed are on one real-world dataset that is not fairnessrelated. As a result, implementing the same idea in (Louppe, Kagan, and Cranmer 2017) using continuous sensitive attributes on fairness datasets in a monolithic network within our framework is an interesting direction for future work. Other fairness-aware adversarial works include (Wadsworth, Vera, and Piech 2018; Xu et al. 2018). Looking more broadly (beyond adversarial frameworks), and in addition to those mentioned elsewhere in the paper, other fairness algorithms include Goel, Rao, and Shroff (2015) that deﬁnes disparities as a function of false positive rates for people from different races, i.e. similar to disparate mistreatment. In (Celis et al. 2018), more than one notion of fairness can be jointly enforced on a metaalgorithm via fairness constraints. A comparative study of some fairness algorithms has been provided in (Friedler et al. 2018). An interpolation between statistical notions of fairness, with the aim of obtaining the good properties of each deﬁnition, has been presented in (Kearns et al. 2018). The

work in (Hajian et al. 2015) imposes fairness via a postprocessing approach of the frequent patterns in the data.

6 Conclusion

We introduced a fair adversarial framework applicable to differentiable discriminative models. Instead of having to establish the architecture from scratch, we make slight adjustments to an existing differentiable classiﬁer by adding a new hidden layer and a new classiﬁer above it, to concurrently optimize for fairness and accuracy. We analyzed and evaluated the resulting tradeoff between fairness and accuracy. We proposed a minibatch diversity variation of the learning procedure which may be of independent interest for other adversarial frameworks. We provided a theoretical interpretation of the two classiﬁers (adversaries) constituting the model. We demonstrated strong empirical performance of our methods compared to previous leading approaches. Our approach applies to existing architectures; hence, it will be interesting to study how a pre-trained network adapts to the new dual objective.

Acknowledgements

TA, ZG and AW acknowledge support from the Leverhulme Trust via the CFI. AW acknowledges support from the David Mac Kay Newton research fellowship at Darwin College and The Alan Turing Institute under EPSRC grant EP/N510129/1 & TU/B/000074. IV acknowledges support from the MPG Minerva Fast Track program.

Agarwal, A.; Beygelzimer, A.; Dudik, M.; Langford, J.; and Wallach, H. 2018. A reductions approach to fair classiﬁcation. ICML. Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; and Marchand, M. 2014. Domain adversarial neural networks. ar Xiv preprint ar Xiv:1412.4446. Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein generative adversarial networks. ICML. Barocas, S., and Selbst, A. 2016. Big Data s Disparate Impact. California Law Review. Bartlett, P., and Mendelson, S. 2002. Rademacher and Gaussian complexities: Risk bounds and structural results. JMLR. Bechavod, Y., and Ligett, K. 2017. Penalizing unfairness in binary classiﬁcation. ar Xiv preprint ar Xiv:1707.00044. Ben-David, S.; Blitzer, J.; Crammer, K.; and Pereira, F. 2007. Analysis of representations for domain adaptation. NIPS 21:137 144. Ben-David, S.; Blitzer, S.; Crammer, K.; Kulesza, A.; Pereira, F.; and Vaughan, J. 2010. A theory of learning from different domains. Machine learning 79(2):151 175. Beutel, A.; Chen, J.; Zhao, Z.; and Chi, E. 2017. Data decisions and theoretical implications when adversarially learning fair representations. ar Xiv preprint ar Xiv:1707.00075. Brennan, T.; Dieterich, W.; and Ehret, B. 2009. Evaluating the predictive validity of the COMPAS risk and needs assessment system. Criminal Justice and Behavior 36:21 40. Celis, E.; Huang, L.; Keswani, V.; and Vishnoi, N. 2018. Classiﬁcation with fairness constraints: A meta-algorithm with provable guarantees. ar Xiv preprint ar Xiv:1806.06055.

Chouldechova, A. 2017. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data 2. Devroye, L.; Gyorﬁ, L.; and Lugosi, G. 1996. A probabilistic theory of pattern recognition. Springer. Dheeru, D., and Taniskidou, E. K. 2017. UCI ML Repository. Edwards, H., and Storkey, A. 2016. Censoring representations with an adversary. ICLR. Feldman, M.; Friedler, S.; Moeller, J.; Scheidegger, C.; and Venkatasubramanian, S. 2015. Certifying and removing disparate impact. KDD. Fish, B.; Kun, J.; and Lelkes, A. 2016. A conﬁdence-based approach for balancing fairness and accuracy. SDM. Friedler, S.; Scheidegger, C.; Venkatasubramanian, S.; Choudhary, S.; Hamilton, E.; and Roth, D. 2018. A comparative study of fairness-enhancing interventions in machine learning. ar Xiv preprint ar Xiv:1802.04422. Ganin, Y., and Lempitsky, V. 2015. Unsupervised domain adaptation by backpropagation. ICML 32. Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. 2016. Domainadversarial training of neural networks. JMLR 17(59):1 35. Goel, S.; Rao, J.; and Shroff, R. 2015. Precinct or prejudice? Understanding racial disparities in New York City s stop-and-frisk policy. Annals of Applied Statistics. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. NIPS 2672 2680. Goodfellow, I. 2016. NIPS 2016 tutorial: Generative adversarial networks. ar Xiv preprint ar Xiv:1701.00160. Gretton, A.; Borgwardt, K.; Rasch, M.; Scholkopf, B.; and Smola, A. 2006. A kernel method for the two-sample-problem. NIPS. Gretton, A.; Borgwardt, K.; Rasch, M.; Scholkopf, B.; and Smola, A. 2012. A kernel two-sample test. JMLR 13. Grgic-Hlaca, N.; Redmiles, E.; Gummadi, K.; and Weller, A. 2018a. Human perceptions of fairness in algorithmic decision making: A case study of criminal risk prediction. WWW. Grgic-Hlaca, N.; Zafar, M.; Gummadi, K.; and Weller, A. 2018b. Beyond distributive fairness in algorithmic decision making: Feature selection for procedurally fair learning. AAAI. Hajian, S.; Domingo-Ferrer, J.; Monreale, A.; Pedreschi, D.; and Giannotti, F. 2015. Discrimination and privacy-aware patterns. Data Mining and Knowledge Discovery 1733 1782. Hardt, M.; Price, E.; and Srebro, N. 2016. Equality of opportunity in supervised learning. NIPS. Kamishima, T.; Akaho, S.; Asoh, H.; and Sakuma, J. 2012. Fairness-aware classiﬁer with prejudice remover regularizer. Kearns, M.; Neel, S.; Roth, A.; and Wu, Z. 2018. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. ICML. Khandani, A.; Kim, A.; and Lo, A. 2010. Consumer credit-risk models via machine-learning algorithms. JBF 34:2767 2787. Kifer, D.; Ben-David, S.; and Gehrke, J. 2004. Detecting change in data streams. VLDB 180 191. Kim, T.; Cha, M.; Kim, H.; Lee, J.; and Kim, J. 2017. Learning to discover cross-domain relations with generative adversarial networks. ICML. Kingma, D., and Ba, J. 2015. Adam: A Method for Stochastic Optimization. ICLR.

Koltchinskii, V., and Panchenko, D. 2000. Rademacher processes and bounding the risk of function learning. HDP. Komiyama, J.; Takeda, A.; Honda, J.; and Shimao, H. 2018. Nonconvex optimization for regression with fairness constraints. ICML. Kusner, M.; Loftus, J.; Russell, C.; and Silva, R. 2017. Counterfactual fairness. NIPS. Larson, J.; Mattu, S.; Kirchner, L.; and Angwin, J. 2016. https://github.com/propublica/compas-analysis. Lloyd, J., and Ghahramani, Z. 2015. Statistical model criticism using kernel two sample tests. NIPS. Louizos, C.; Swerky, K.; Li, Y.; Welling, M.; and Zemel, R. 2016. The variational fair autoencoder. ICLR. Louppe, G.; Kagan, M.; and Cranmer, K. 2017. Learning to pivot with adversarial networks. NIPS. Madras, D.; Creager, E.; Pitassi, T.; and Zemel, R. 2018. Learning adversarially fair and transferable representations. ICML. Mansour, Y.; Mohri, M.; and Rostamizadeh, A. 2009. Domain adaptation: Learning bounds and algorithms. COLT. Metz, L.; Poole, B.; Pfau, D.; and Sohl-Dickstein, J. 2017. Unrolled generative adversarial networks. ICLR. Mohri, M., and Afshin, R. 2008a. Rademacher complexity bounds for non-IID processes. NIPS. Narasimhan, H. 2018. Learning with complex loss functions and constraints. AISTATS 1646 1654. Primus, R. 2010. The future of disparate impact. Mich. Law Rev. Rosca, M.; Lakshminarayanan, B.; Warde-Farley, D.; and Mohamed, S. 2017. Variational approaches for auto-encoding generative adversarial networks. ar Xiv preprint ar Xiv:1706.04987. Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; and Radford, A. 2016. Improved techniques for training GANs. NIPS. Scholkopf, B.; Williamson, R.; Smola, A.; Shawe-Taylor, J.; and Platt, J. 2000. Support vector method for novelty detection. NIPS. Shalev-Shwartz, S., and Ben-David, S. 2014. Understanding machine learning: From theory to algorithms. Cambridge Univ. Press. Srivastava, A.; Valkov, L.; Russell, C.; Gutmann, M.; and Sutton, C. 2017. VEEGAN: Reducing mode collapse in GANs using implicit variational learning. NIPS. Wadsworth, C.; Vera, F.; and Piech, C. 2018. Achieving fairness through adversarial learning: an application to recidivism prediction. FAT/ML Workshop. Xu, D.; Yuan, S.; Zhang, L.; and Wu, X. 2018. Fair GAN: Fairness-aware generative adversarial networks. ar Xiv preprint ar Xiv:1805.11202. Zafar, M.; Valera, I.; Rodriguez, M.; Gummadi, K.; and Weller, A. 2017a. From parity to preference-based notions of fairness in classiﬁcation. NIPS. Zafar, M.; Valera, I.; Rodriguez, M.; and Gummadi, K. 2017b. Fairness beyond disparate treatment and disparate impact: Learning classiﬁcation without disparate mistreatment. WWW. Zafar, M.; Valera, I.; Rodriguez, M.; and Gummadi, K. 2017c. Fairness constraints: Mechanisms for fair classiﬁcation. AISTATS. Zemel, R.; Wu, Y.; Swersky, K.; Pitassi, T.; and Dwork, C. 2013. Learning fair representations. ICML 325 333. Zhang, B.; Lemoine, B.; and Mitchell, M. 2018. Mitigating unwanted biases with adversarial learning. ar Xiv:1801.07593.