# learning_adversarially_fair_and_transferable_representations__13ce6bf4.pdf

Learning Adversarially Fair and Transferable Representations

David Madras 1 2 Elliot Creager 1 2 Toniann Pitassi 1 2 Richard Zemel 1 2

In this paper, we advocate for representation learning as the key to mitigating unfair prediction outcomes downstream. Motivated by a scenario where learned representations are used by third parties with unknown objectives, we propose and explore adversarial representation learning as a natural method of ensuring those parties act fairly. We connect group fairness (demographic parity, equalized odds, and equal opportunity) to different adversarial objectives. Through worst-case theoretical guarantees and experimental validation, we show that the choice of this objective is crucial to fair prediction. Furthermore, we present the ﬁrst in-depth experimental demonstration of fair transfer learning and demonstrate empirically that our learned representations admit fair predictions on new tasks while maintaining utility, an essential goal of fair representation learning.

1. Introduction

There are two implicit steps involved in every prediction task: acquiring data in a suitable form, and specifying an algorithm that learns to predict well given the data. In practice these two responsibilities are often assumed by distinct parties. For example, in online advertising the socalled prediction vendor proﬁts by selling its predictions (e.g., person X is likely to be interested in product Y ) to an advertiser, while the data owner proﬁts by selling a predictively useful dataset to the prediction vendor (Dwork et al., 2012).

Because the prediction vendor seeks to maximize predictive accuracy, it may (intentionally or otherwise) bias the predictions to unfairly favor certain groups or individuals. The use of machine learning in this context is especially concerning because of its reliance on historical datasets

1Department of Computer Science, University of Toronto, Toronto, Canada 2Vector Institute, Toronto, Canada. Correspondence to: David Madras <madras@cs.toronto.edu>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

that include patterns of previous discrimination and societal bias. Thus there has been a ﬂurry of recent work from the machine learning community focused on deﬁning and quantifying these biases and proposing new prediction systems that mitigate their impact.

Meanwhile, the data owner also faces a decision that critically affects the predictions: what is the correct representation of the data? Often, this choice of representation is made at the level of data collection: feature selection and measurement. If we want to maximize the prediction vendor s utility, then the right choice is to simply collect and provide the prediction vendor with as much data as possible. However, assuring that prediction vendors learn only fair predictors complicates the data owner s choice of representation, which must yield predictors that are never unfair but nevertheless have relatively high utility.

In this paper, we frame the data owner s choice as a representation learning problem with an adversary criticizing potentially unfair solutions. Our contributions are as follows: We connect common group fairness metrics (demographic parity, equalize odds, and equal opportunity) to adversarial learning by providing appropriate adversarial objective functions for each metric that upper bounds the unfairness of arbitrary downstream classiﬁers in the limit of adversarial training; we distinguish our algorithm from previous approaches to adversarial fairness and discuss its suitability to fair classiﬁcation due to the novel choice of adversarial objective and emphasis on representation as the focus of adversarial criticism; we validate experimentally that classiﬁers trained naively (without fairness constraints) from representations learned by our algorithm achieve their respective fairness desiderata; furthermore, we show empirically that these representations achieve fair transfer they admit fair predictors on unseen tasks, even when those predictors are not explicitly speciﬁed to be fair.

In Sections 2 and 3 we discuss relevant background materials and related work. In Section 4 we describe our model and motivate our learning algorithm. In Section 5 we discuss our novel adversarial objective functions, connecting them to common group fairness metrics and providing theoretical guarantees. In Section 6 we discuss experiments demonstrating our method s success in fair classiﬁcation and fair transfer learning.

Learning Adversarially Fair and Transferable Representations

2. Background

2.1. Fairness

In fair classiﬁcation we have some data X Rn, labels Y {0, 1}, and sensitive attributes A {0, 1}. The predictor outputs a prediction ˆY {0, 1}. We seek to learn to predict outcomes that are accurate with respect to Y but fair with respect to A; that is, the predictions are accurate but not biased in favor of one group or the other.

There are many possible criteria for group fairness in this context. One is demographic parity, which ensures that the positive outcome is given to the two groups at the same rate, i.e. P( ˆY = 1|A = 0) = P( ˆY = 1|A = 1). However, the usefulness of demographic parity can be limited if the base rates of the two groups differ, i.e. if P(Y = 1|A = 0) = P(Y = 1|A = 1). In this case, we can pose an alternate criterion by conditioning the metric on the ground truth Y , yielding equalized odds and equal opportunity (Hardt et al., 2016); the former requires equal false positive and false negative rates between the groups while the latter requires only one of these equalities. Equal opportunity is intended to match errors in the advantaged outcome across groups; whereas Hardt et al. (2016) chose Y = 1 as the advantaged outcome, the choice is domain speciﬁc and we here use Y = 0 instead without loss of generality. Formally, this is P( ˆY = Y |A = 0, Y = y) = P( ˆY = Y |A = 1, Y = y) y {0, 1} (or just y = 0 for equal opportunity).

Satisfying these constraints is known to conﬂict with learning well-calibrated classiﬁers (Chouldechova, 2017; Kleinberg et al., 2016; Pleiss et al., 2017). It is common to instead optimize a relaxed objective (Kamishima et al., 2012), whose hyperparameter values negotiate a tradeoff between maximizing utility (usually classiﬁcation accuracy) and fairness.

2.2. Adversarial Learning

Adversarial learning is a popular method of training neural network-based models. Goodfellow et al. (2014) framed learning a deep generative model as a two-player game between a generator G and a discriminator D. Given a dataset X, the generator aims to fool the discriminator by generating convincing synthetic data, i.e., starting from random noise z p(z), G(z) resembles X. Meanwhile, the discriminator aims to distinguish between real and synthetic data by assigning D(G(z)) = 0 and D(X) = 1. Learning proceeds by the max-min optimization of the joint objective

V (D, G) Ep(X)[log(D(X))]+Ep(z)[log(1 D(G(z)))],

where D and G seek to maximize and minimize this quantity, respectively.

3. Related Work

Interest in fair machine learning is burgeoning as researchers seek to deﬁne and mitigate unintended harm in automated decision making systems. Deﬁnitional works have been broadly concerned with group fairness or individual fairness. Dwork et al. (2012) discussed individual fairness within the owner-vendor framework we utilize. Zemel et al. (2013) encouraged elements of both group and individual fairness via a regularized objective. An intriguing body of recent work uniﬁes the individual-group dichotomy by exploring fairness at the intersection of multiple group identities, and among small subgroups of individuals (Kearns et al., 2018; H ebert-Johnson et al., 2018).

Calmon et al. (2017) and Hajian et al. (2015) explored fair machine learning by preand post-processing training datasets. Mc Namara et al. (2017) provides a framework where the data producer, user, and regulator have separate concerns, and discuss fairness properties of representations. Louizos et al. (2016) give a method for learning fair representations with deep generative models by using maximum mean discrepancy (Gretton et al., 2007) to eliminate disparities between the two sensitive groups.

Adversarial training for deep generative modeling was popularized by Goodfellow et al. (2014) and applied to deep semi-supervised learning (Salimans et al., 2016; Odena, 2016) and segmentation (Luc et al., 2016), although similar concepts had previously been proposed for unsupervised and supervised learning (Schmidhuber, 1992; Gutmann & Hyv arinen, 2010). Ganin et al. (2016) proposed adversarial representation learning for domain adaptation, which resembles fair representation learning in the sense that multiple distinct data distributions (e.g., demographic groups) must be expressively modeled by a single representation.

Edwards & Storkey (2016) made this connection explicit by proposing adversarially learning a classiﬁer that achieves demographic parity. This work is the most closely related to ours, and we discuss some key differences in sections 5.4. Recent work has explored the use of adversarial training to other notions of group fairness. Beutel et al. (2017) explored the particular fairness levels achieved by the algorithm from Edwards & Storkey (2016), and demonstrated that they can vary as a function of the demographic unbalance of the training data. In work concurrent to ours, Zhang et al. (2018) use an adversary which attempts to predict the sensitive variable solely based on the classiﬁer output, to learn an equal opportunity fair classiﬁer. Whereas they focus on fairness in classiﬁcation outcomes, in our work we allow the adversary to work directly with the learned representation, which we show yields fair and transferable representations that in turn admit fair classiﬁcation outcomes.

Learning Adversarially Fair and Transferable Representations

4. Adversarially Fair Representations

Encoder f(X)

Decoder k(Z, A)

Classiﬁer g(Z)

Adversary h(Z)

Figure 1. Model for learning adversarially fair representations. The variables are data X, latent representations Z, sensitive attributes A, and labels Y . The encoder f maps X (and possibly A - not shown) to Z, the decoder k reconstructs X from (Z, A), the classiﬁer g predicts Y from Z, and the adversary h predicts A from Z (and possibly Y - not shown).

4.1. A Generalized Model

We assume a generalized model (Figure 1), which seeks to learn a data representation Z capable of reconstructing the inputs X, classifying the target labels Y , and protecting the sensitive attribute A from an adversary. Either of the ﬁrst two requirements can be omitted by setting hyperparameters to zero, so the model easily ports to strictly supervised or unsupervised settings as needed. This general formulation was originally proposed by Edwards & Storkey (2016); below we address our speciﬁc choices of adversarial objectives and explore their fairness implications, which distinguish our work as more closely aligned to the goals of fair representation learning.

The dataset consists of tuples (X, A, Y ) in Rn, {0, 1} and {0, 1}, respectively. The encoder f : Rn Rm yields the representations Z. The encoder can also optionally receive A as input. The classiﬁer and adversary1 g, h : Rm {0, 1} each act on Z and attempt to predict Y and A, respectively. Optionally, a decoder k : Rm {0, 1} Rn

attempts to reconstruct the original data from the representation and the sensitive variable.

The adversary h seeks to maximize its objective LAdv(h(f(X, A)), A). We discuss a novel and theoretically motivated adversarial objective in Sections 4.2 and 5, whose exact terms are modiﬁed according to the fairness desideratum.

Meanwhile, the encoder, decoder, and classiﬁer jointly seek to minimize classiﬁcation loss and reconstruction error, and

1 In learning equalized odds or equal opportunity representations, the adversary h : Rm {0, 1} {0, 1} also takes the label Y as input.

also minimize the adversary s objective. Let LC denote a suitable classiﬁcation loss (e.g., cross entropy, ℓ1), and LDec denote a suitable reconstruction loss (e.g., ℓ2). Then we train the generalized model according to the following min-max procedure:

minimize f,g,k maximize h EX,Y,A [L(f, g, h, k)] , (1)

with the combined objective expressed as

L(f, g, h, k) = αLC(g(f(X, A)), Y )

+ βLDec(k(f(X, A), A), X)

+ γLAdv(h(f(X, A)), A) (2)

The hyperparameters α, β, γ respectively specify a desired balance between utility, reconstruction of the inputs, and fairness.

Due to the novel focus on fair transfer learning, we call our model Learned Adversarially Fair and Transferable Representations (LAFTR).

4.2. Learning

We realize f, g, h, and k as neural networks and alternate gradient decent and ascent steps to optimize their parameters according to (2). First the encoder-classiﬁer-decoder group (f, g, k) takes a gradient step to minimize L while the adversary h is ﬁxed, then h takes a step to maximize L with ﬁxed (f, g, k). Computing gradients necessitates relaxing the binary functions g and h, the details of which are discussed in Section 5.3.

One of our key contributions is a suitable adversarial objective, which we express here and discuss further in Section 5. For shorthand we denote the adversarial objective LAdv(h(f(X, A)), A) whose functional form depends on the desired fairness criteria as LAdv(h). For demographic parity, we take the average absolute difference on each sensitive group D0, D1:

LDP Adv(h) = 1 X

(x,a) Di |h(f(x, a)) a| (3)

For equalized odds, we take the average absolute difference on each sensitive group-label combination D0 0, D0 1, D1 0, D1 1, where Dj i = {(x, y, a) D|a = i, y = j}:

LEO Adv(h) = 2 X

(i,j) {0,1}2

|h(f(x, a)) a|

(4) To achieve equal opportunity, we need only sum terms corresponding to Y = 0.

Learning Adversarially Fair and Transferable Representations

4.3. Motivation

For intuition on this approach and the upcoming theoretical section, we return to the framework from Section 1, with a data owner who sells representations to a (prediction) vendor. Suppose the data owner is concerned about the unfairness in the predictions made by vendors who use their data. Given that vendors are strategic actors with goals, the owner may wish to guard against two types of vendors:

The indifferent vendor: this vendor is concerned with utility maximization, and doesn t care about the fairness or unfairness of their predictions.

The adversarial vendor: this vendor will attempt to actively discriminate by the sensitive attribute.

In the adversarial model deﬁned in Section 4.1, the encoder is what the data owner really wants; this yields the representations which will be sold to vendors. When the encoder is learned, the other two parts of the model ensure that the representations respond appropriately to each type of vendor: the classiﬁer ensures utility by simulating an indifferent vendor with a prediction task, and the adversary ensures fairness by simulating an adversarial vendor with discriminatory goals. It is important to the data owner that the model s adversary be as strong as possible if it is too weak, the owner will underestimate the unfairness enacted by the adversarial vendor.

However, there is another important reason why the model should have a strong adversary, which is key to our theoretical results. Intuitively, the degree of unfairness achieved by the adversarial vendor (who is optimizing for unfairness) will not be exceeded by the indifferent vendor. Beating a strong adversary h during training implies that downstream classiﬁers naively trained on the learned representation Z must also act fairly. Crucially, this fairness bound depends on the discriminative power of h; this motivates our use of the representation Z as a direct input to h, because it yields a strictly more powerful h and thus tighter unfairness bound than adversarially training on the predictions and labels alone as in Zhang et al. (2018).

5. Theoretical Properties

We now draw a connection between our choice of adversarial objective and several common metrics from the fair classiﬁcation literature. We derive adversarial upper bounds on unfairness that can be used in adversarial training to achieve either demographic parity, equalized odds, or equal opportunity.

We are interested in quantitatively comparing two distributions corresponding to the learned group representations, so consider two distributions D0 and D1 over the same sample

space ΩD, as well as a binary test function µ : ΩD {0, 1}. µ is called a test since it can distinguish between samples from the two distributions according to the absolute difference in its expected value. We call this quantity the test discrepancy and express it as

dµ(D0, D1) | E x D0 [µ(x)] E x D1 [µ(x)] |. (5)

The statistical distance (a.k.a. total variation distance) between distributions is deﬁned as the maximum attainable test discrepancy (Cover & Thomas, 2012):

(D0, D1) sup µ dµ(D0, D1). (6)

When learning fair representations we are interested in the distribution of Z conditioned on a speciﬁc value of group membership A {0, 1}. As a shorthand we denote the distributions p(Z|A = 0) and p(Z|A = 1) as Z0 and Z1, respectively.

5.1. Bounding Demographic Parity

In supervised learning we seek a g that accurately predicts some label Y ; in fair supervised learning we also want to quantify g according to the fairness metrics discussed in Section 2. For example, the demographic parity distance is expressed as the absolute expected difference in classiﬁer outcomes between the two groups:

DP (g) dg(Z0, Z1) = |EZ0[g] EZ1[g]|. (7)

Note that DP (g) (Z0, Z1), and also that DP (g) = 0 if and only if g(Z) A, i.e., demographic parity has been achieved.

Now consider an adversary h : ΩZ {0, 1} whose objective (negative loss) function2 is expressed as

LDP Adv(h) EZ0[1 h] + EZ1[h] 1. (8)

Given samples from ΩZ the adversary seeks to correctly predict the value of A, and learns by maximizing LDP Adv(h). Given an optimal adversary trained to maximize (8), the adversary s loss will bound DP (g) from above for any function g learnable from Z. Thus a sufﬁciently powerful adversary h will expose through the value of its objective the demographic disparity of any classiﬁer g. We will later use this to motivate learning fair representations via an encoder f : X Z that simultaneously minimizes the task loss and minimizes the adversary objective.

Theorem. Consider a classiﬁer g : ΩZ ΩY and adversary h : ΩZ ΩA as binary functions, i.e,. ΩY = ΩA =

2 This is equivalent to the objective expressed by Equation 3 when the expectations are evaluated with ﬁnite samples.

Learning Adversarially Fair and Transferable Representations

{0, 1}. Then LDP Adv(h ) DP (g): the demographic parity distance of g is bounded above by the optimal objective value of h.

Proof. By deﬁnition DP (g) 0. Suppose without loss of generality (WLOG) that EZ0[g] EZ1[g], i.e., the classiﬁer predicts the positive outcome more often for group A0 in expectation. Then, an immediate corollary is EZ1[1 g] EZ0[1 g], and we can drop the absolute value in our expression of the disparate impact distance:

DP (g) = EZ0[g] EZ1[g] = EZ0[g] + EZ1[1 g] 1 (9) where the second equality is due to EZ1[g] = 1 EZ1[1 g]. Now consider an adversary that guesses the opposite of g , i.e., h = 1 g. Then3, we have

LDP Adv(h) = LDP Adv(1 g) = EZ0[g] + EZ1[1 g] 1

= DP (g) (10) The optimal adversary h does at least as well as any arbitrary choice of h, therefore LDP Adv(h ) LDP Adv(h) = DP .

5.2. Bounding Equalized Odds

We now turn our attention to equalized odds. First we extend our shorthand to denote p(Z|A = a, Y = y) as Zy a, the representation of group a conditioned on a speciﬁc label y. The equalized odds distance of classiﬁer g : ΩZ {0, 1} is EO(g) |EZ0 0 [g] EZ0 1 [g]|

+ |EZ1 0 [1 g] EZ1 1 [1 g]|, (11)

which comprises the absolute difference in false positive rates plus the absolute difference in false negative rates. EO(g) = 0 means g satisﬁes equalized odds. Note that EO(g) (Z0 0, Z0 1) + (Z1 0, Z1 1).

We can make a similar claim as above for equalized odds: given an optimal adversary trained on Z with the appropriate objective, if the adversary also receives the label Y , the adversary s loss will upper bound EO(g) for any function g learnable from Z.

Theorem. Let the classiﬁer g : ΩZ ΩY and the adversary h : ΩZ ΩY ΩZ, as binary functions, i.e., ΩY = ΩA = {0, 1}. Then LEO Adv(h ) EO(g): the equalized odds distance of g is bounded above by the optimal objective value of h.

Proof. Let the adversary h s objective be

LEO Adv(h) = EZ0 0 [1 h] + EZ0 1 [h]

+ EZ1 0 [1 h] + EZ1 1 [h] 2 (12)

3 Before we assumed WLOG EZ0[g] EZ1[g]. If instead EZ0[g] < EZ1[g] then we simply choose h = g instead to achieve the same result.

By deﬁnition EO(g) 0. Let |EZ0 0 [g] EZ0 1 [g]| = α [0, EO(g)] and |EZ1 0 (1 g) EZ1 1 [1 g]| = EO(g) α. WLOG, suppose EZ0 0 [g] EZ0 1 [g] and EZ1 0 [1 g] EZ1 1 [1 g]. Thus we can partition (11) as two expressions, which we write as

EZ0 0 [g] + EZ0 1 [1 g] = 1 + α,

EZ1 0 [1 g] + EZ1 1 [g] = 1 + ( EO(g) α), (13)

which can be derived using the familiar identity Ep[η] = 1 Ep[1 η] for binary functions.

Now, let us consider the following adversary h

h(z) = g(z), if y = 1 1 g(z), if y = 0

Then the previous statements become

EZ0 0 [1 h] + EZ0 1 [h] = 1 + α

EZ1 0 [1 h] + EZ1 1 [h] = 1 + ( EO(g) α) (15)

Recalling our deﬁnition of LEO Adv(h), this means that

LEO Adv(h) = EZ0 0 [1 h] + EZ0 1 [h] + EZ1 0 [h] + EZ1 1 [h] 2

= 1 + α + 1 + ( EO(g) α) 2 = EO(g) (16) That means that for the optimal adversary h = suph LEO Adv(h), we have LEO Adv(h ) LEO Adv(h) = EO.

An adversarial bound for equal opportunity distance, deﬁned as EOpp(g) |EZ0 0 [g] EZ0 1 [g]|, can be derived similarly.

5.3. Additional points

One interesting note is that in each proof, we provided an example of an adversary which was calculated only from the joint distribution of Y, A, and ˆY = g(Z) we did not require direct access to Z and this adversary achieved a loss exactly equal to the quantity in question ( DP or EO). Therefore, if we only allow our adversary access to those outputs, our adversarial objective (assuming an optimal adversary), is equivalent to simply adding either DP or EO to our classiﬁcation objective, similar to common regularization approaches (Kamishima et al., 2012; Bechavod & Ligett, 2017; Madras et al., 2017; Zafar et al., 2017). Below we consider a stronger adversary, with direct access to the key intermediate learned representation Z. This allows for a potentially greater upper bound for the degree of unfairness, which in turn forces any classiﬁer trained on Z to act fairly.

In our proofs we have considered the classiﬁer g and adversary h as binary functions. In practice we want to learn these functions by gradient-based optimization, so we instead substitute their continuous relaxations g, h : ΩZ [0, 1]. By viewing the continuous output as parameterizing a Bernoulli

Learning Adversarially Fair and Transferable Representations

distribution over outcomes we can follow the same steps in our earlier proofs to show that in both cases (demographic parity and equalized odds) E[L( h )] E[ ( g)], where h

and g are randomized binary classiﬁers parameterized by the outputs of h and g.

5.4. Comparison to Edwards & Storkey (2016)

An alternative to optimizing the expectation of the randomized classiﬁer h is to minimize its negative log likelihood (NLL - also known as cross entropy loss), given by

L( h) = EZ,A h A log h(Z) + (1 A) log(1 h(Z)) i . (17) This is the formulation adopted by Ganin et al. (2016) and Edwards & Storkey (2016), which propose maximizing (17) as a proxy for computing the statistical distance4

(Z0, Z1) during adversarial training.

The adversarial loss we adopt here instead of cross-entropy is group-normalized ℓ1, deﬁned in Equations 3 and 4. The main problems with cross entropy loss in this setting arise from the fact that the adversarial objective should be calculating the test discrepancy. However, the cross entropy objective sometimes fails to do so, for example when the dataset is imbalanced. In Appendix A, we discuss a synthetic example where a cross-entropy loss will incorrectly guide an adversary on an unbalanced dataset, but a groupnormalized ℓ1 adversary will work correctly.

Furthermore, group normalized ℓ1 corresponds to a more natural relaxation of the fairness metrics in question. It is important that the adversarial objective incentivizes the test discrepancy, as group-normalized ℓ1 does; this encourages the adversary to get an objective value as close to as possible, which is key for fairness (see Section 4.3). In practice, optimizing ℓ1 loss with gradients can be difﬁcult, so while we suggest it as a suitable theoretically-motivated continuous relaxation for our model (and present experimental results), there may be other suitable options beyond those considered in this work.

6. Experiments

6.1. Fair classiﬁcation

LAFTR seeks to learn an encoder yielding fair representations, i.e., the encoder s outputs can be used by third parties with the assurance that their naively trained classiﬁers will be reasonably fair and accurate. Thus we evaluate the quality of the encoder according to the following training procedure, also described in pseudo-code by Algorithm 1. Using labels

4 These papers discuss the bound on DP (g) in terms of the H-divergence (Blitzer et al., 2006), which is simply the statistical distance up to a multiplicative constant.

ALGORITHM 1 Evaluation scheme for fair classiﬁcation (Y = Y ) & transfer learning (Y = Y ).

Input: data X ΩX, sensitive attribute A ΩA, labels Y, Y ΩY , representation space ΩZ Step 1: Learn an encoder f : ΩX ΩZ using data X, task label Y , and sensitive attribute A. Step 2: Freeze f. Step 3: Learn a classiﬁer (without fairness constraints) g : ΩZ ΩY on top of f, using data f(X ), task label Y , and sensitive attribute A . Step 3: Evaluate the fairness and accuracy of the composed classiﬁer g f : ΩX ΩY on held out test data, for task Y .

Y , sensitive attribute A, and data X, we train an encoder using the adversarial method outlined in Section 4, receiving both X and A as inputs. We then freeze the learned encoder; from now on we use it only to output representations Z. Then, using unseen data, we train a classiﬁer on top of the frozen encoder. The classiﬁer learns to predict Y from Z note, this classiﬁer is not trained to be fair. We can then evaluate the accuracy and fairness of this classiﬁer on a test set to assess the quality of the learned representations.

During Step 1 of Algorithm 1, the learning algorithm is speciﬁed either as a baseline (e.g., unfair MLP) or as LAFTR, i.e., stochastic gradient-based optimization of (Equation 1) with one of the three adversarial objectives described in Section 4.2. When LAFTR is used in Step 1, all but the encoder f are discarded in Step 2. For all experiments we use L1 loss for the classiﬁer (we observed training unstability with other classiﬁer losses). The classiﬁer g in Step 3 is a feed-forward MLP trained with SGD. See Appendix B for details.

We evaluate the performance of our model on fair classiﬁcation on the UCI Adult dataset5, We aimed to predict each person s income category (either greater or less than 50K/year). We took the sensitive attribute to be gender, which was listed as Male or Female.

Figure 2 shows classiﬁcation results on the Adult dataset. Each sub-ﬁgure shows the accuracy-fairness trade-off (for varying values of γ; we set α = 1, β = 0 for all classiﬁcation experiments) evaluated according to one of the group fairness metrics: DP , EO, and EOpp. For each fairness metric, we show the trade-off curves for LAFTR trained under three adversarial objectives: LDP Adv, LEO Adv, and LEOpp Adv . We observe, especially in the most important regiment for fairness (small ), that the adversarial objective we propose for a particular fairness metric tends to achieve the best trade-off. Furthermore, in Figure 2a, we compare our

5https://archive.ics.uci.edu/ml/datasets/adult

Learning Adversarially Fair and Transferable Representations

0.025 0.050 0.075 0.100 0.125 0.150 0.175

LAFTR-DP LAFTR-EO LAFTR-EOpp DP-CE MLP-Unfair

(a) Tradeoff between accuracy and DP

0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 EO

LAFTR-DP LAFTR-EO LAFTR-EOpp MLP-Unfair

(b) Tradeoff between accuracy and EO

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 EOpp

LAFTR-DP LAFTR-EO LAFTR-EOpp MLP-Unfair

(c) Tradeoff between accuracy and EOpp

Figure 2. Accuracy-fairness tradeoffs for various fairness metrics ( DP , EO, EOpp), and LAFTR adversarial objectives (LDP Adv, LEO Adv, LEOpp Adv ) on fair classiﬁcation of the Adult dataset. Upper-left corner (high accuracy, low ) is preferable. Figure 2a also compares to a cross-entropy adversarial objective (Edwards & Storkey, 2016), denoted DP-CE. Curves are generated by sweeping a range of fairness coefﬁcients γ, taking the median across 7 runs per γ, and computing the Pareto front. In each plot, the bolded line is the one we expect to perform the best. Magenta square is a baseline MLP with no fairness constraints. see Algorithm 1 and Appendix B.

proposed adversarial objective for demographic parity with the one proposed in (Edwards & Storkey, 2016), ﬁnding a similar result.

For low values of un-fairness, i.e., minimal violations of the respective fairness criteria, the LAFTR model trained to optimize the target criteria obtains the highest test accuracy. While the improvements are somewhat uneven for other regions of fairness-accuracy space (which we attribute to instability of adversarial training), this demonstrates the potential of our proposed objectives. However, the fairness of our model s learned representations are not limited to the task it is trained on. We now turn to experiments which demonstrate the utility of our model in learning fair representations for a variety of tasks.

6.2. Transfer Learning

In this section, we show the promise of our model for fair transfer learning. As far as we know, beyond a brief introduction in Zemel et al. (2013), we provide the ﬁrst in-depth experimental results on this task, which is pertinent to the common situation where the data owner and vendor are separate entities.

We examine the Heritage Health dataset6, which comprises insurance claims and physician records relating to the health and hospitalization observations of over 60,000 patients. We chose the goal of predicting the Charlson Index, a comorbidity indicator that estimates the risk of patient death in the next several years. We binarize the (nonnegative) Charlson Index as zero or nonzero. We took the sensitive variable to be age and binarize by thresholding at 70 years old. This dataset contains information on sex, age, lab test, prescription, and claim details.

6https://www.kaggle.com/c/hhp

The task is as follows: using data X, sensitive attribute A, and labels Y , learn an encoding function f such that given unseen X , the representations produced by f(X , A) can be used to learn a fair predictor for new task labels Y , even if the new predictor is being learned by a vendor who is indifferent or adversarial to fairness. This is an intuitively desirable condition: if the data owner can guarantee that predictors learned from their representations will be fair, then there is no need to impose fairness restrictions on vendors, or to rely on their goodwill.

The original task is to predict Charlson index Y fairly with respect to age A. The transfer tasks relate to the various primary condition group (PCG) labels, each of which indicates a patient s insurance claim corresponding to a speciﬁc medical condition. PCG labels {Y } were held out during LAFTR training but presumably correlate to varying degrees with the original label Y . The prediction task was binary: did a patient ﬁle an insurance claim for a given PCG label in this year? For various patients, this was true for zero, one, or many PCG labels. There were 46 different PCG labels in the dataset; we only used the 10 most common whose positive base rates ranged from 9-60% as transfer tasks.

Our experimental procedure was as follows. To learn representations that transfer fairly, we used the same model as described above, but set our reconstruction coefﬁcient β = 1. Without this, the adversary will stamp out any information not relevant to the label from the representation, which will hurt transferability. We can optionally set our classiﬁcation coefﬁcient α to 0, which worked better in practice. Note that although the classiﬁer g is no longer involved when α = 0, the target task labels are still relevant for either equalized odds or equal opportunity transfer fairness.

We split our test set ( 20, 000 examples) into transfer-train, -validation, and -test sets. We trained LAFTR (α = 0, β = 1,

Learning Adversarially Fair and Transferable Representations

ℓ2 loss for the decoder) on the full training set, and then only kept the encoder. In the results reported here, we trained using the equalized odds adversarial objective described in Section 4.2; similar results were obtained with the other adversarial objectives. Then, we created a feed-forward model which consisted of our frozen, adversarially-learned encoder followed by an MLP with one hidden layer, with a loss function of cross entropy with no fairness modiﬁcations. Then, i 1 . . . 10, we trained this feed-forward model on PCG label i (using the transfer-train and -validation) sets, and tested it on the transfer-test set. This procedure is described in Algorithm 1, with Y taking 10 values in turn, and Y remaining constant (Y = Y ).

We trained four models to test our method against. The ﬁrst was an MLP predicting the PCG label directly from the data (Target-Unfair), with no separate representation learning involved and no fairness criteria in the objective this provides an effective upper bound for classiﬁcation accuracy. The others all involved learning separate representations on the original task, and freezing the encoder as previously described; the internal representations of MLPs have been shown to contain useful information (Hinton & Salakhutdinov, 2006). These (and LAFTR) can be seen as the values of REPRLEARN in Alg. 1. In two models, we learned the original Y using an MLP (one regularized for fairness (Bechavod & Ligett, 2017), one not; Transfer Fair and -Unfair, respectively) and trained for the transfer task on its internal representations. As a third baseline, we trained an adversarial model similar to the one proposed in (Zhang et al., 2018), where the adversary has access only to the classiﬁer output ˆY = g(Z) and the ground truth label (Transfer-Y-Adv), to investigate the utility of our adversary having access to the underlying representation, rather than just the joint classiﬁcation statistics (Y, A, ˆY ).

We report our results in Figure 3 and Appendix C. In Figure 3, we show the relative change from the high-accuracy baseline learned directly from the data for both classiﬁcation error and EO. LAFTR shows a clear fairness improvement, decreasing EO on average from the non-transfer baseline, with mean relative difference 20%, which is much larger than other baselines. We also see that LAFTR s loss in accuracy is only marginally worse than other models.

A fairly-regularized MLP ( Transfer-Fair ) does not actually produce fair representations during transfer; on average it yields similar fairness results to transferring representations learned without fairness constraints. Another observation is that the output-only adversarial model ( Transfer Y-Adv ) produces similar transfer results to the regularized MLP. This shows the practical gain of using an adversary that can observe the representations.

In Appendix D we provide further analysis. We provide coarser metrics of fairness for our representations as further

Transfer-Unf Transfer-Fair Transfer-Y-Adv LAFTR

Relative Diﬀerence to baseline (Target-Unfair)

Figure 3. Fair transfer learning on Health dataset. Displaying average across 10 transfer tasks of relative difference in error and EO unfairness (the lower the better for both metrics), as compared to a baseline unfair model learned directly from the data. -0.10 means a 10% decrease. Transfer-Unf and -Fair are MLP s with and without fairness restrictions respectively, Transfer-Y-Adv is an adversarial model with access to the classiﬁer output rather than the underlying representations, and LAFTR is our model trained with the adversarial equalized odds objective.

evidence indicating that LAFTR is doing something more sophisticated than moment-matching.

7. Conclusion

In this paper, we proposed and explore methods of learning adversarially fair representations. We provided theoretical grounding for the concept, and proposed novel adversarial objectives that guarantee performance on commonly used metrics of group fairness. Experimentally, we demonstrated that these methods can learn fair and useful predictors through using an adversary on the intermediate representation. We also demonstrate success on fair transfer learning, by showing that our methods can produce representations which transfer utility to new tasks as well as yielding fairness improvements.

Several open problems remain around the question of learning representations fairly. Various approaches have been proposed, both adversarial and non-adversarial a careful in-depth comparison of these approaches would help elucidate their pros and cons. Furthermore, questions remain about the optimal form of the adversarial loss function, both in theory and practice. Answering these questions could help stabilize adversarial training of fair representations. As for transfer fairness, it would be useful to understand between what tasks and in what situations transfer fairness is possible, and on what tasks it is most likely to succeed.

Learning Adversarially Fair and Transferable Representations

Acknowledgements

We gratefully acknowledge Cynthia Dwork and Kevin Swersky for their helpful comments. This work was supported by the Canadian Institute for Advanced Research (CIFAR) and the Natural Sciences and Engineering Research Council of Canada (NSERC).

Bechavod, Y. and Ligett, K. Learning fair classiﬁers: A regularization-inspired approach. ar Xiv preprint ar Xiv:1707.00044, 2017.

Beutel, A., Chen, J., Zhao, Z., and Chi, E. H. Data decisions and theoretical implications when adversarially learning fair representations. ar Xiv preprint ar Xiv:1707.00075, 2017.

Blitzer, J., Mc Donald, R., and Pereira, F. Domain adaptation with structural correspondence learning. In Proceedings of the 2006 conference on empirical methods in natural language processing, pp. 120 128. Association for Computational Linguistics, 2006.

Calmon, F., Wei, D., Vinzamuri, B., Ramamurthy, K. N., and Varshney, K. R. Optimized pre-processing for discrimination prevention. In Advances in Neural Information Processing Systems, pp. 3995 4004, 2017.

Chouldechova, A. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153 163, 2017.

Cover, T. M. and Thomas, J. A. Elements of information theory. John Wiley & Sons, 2012.

Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pp. 214 226. ACM, 2012.

Edwards, H. and Storkey, A. Censoring representations with an adversary. In International Conference on Learning Representations, 2016.

Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096 2030, 2016.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014.

Gretton, A., Borgwardt, K. M., Rasch, M., Sch olkopf, B., and Smola, A. J. A kernel method for the two-sampleproblem. In Advances in neural information processing systems, pp. 513 520, 2007.

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767 5777, 2017.

Gutmann, M. and Hyv arinen, A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, pp. 297 304, 2010.

Hajian, S., Domingo-Ferrer, J., Monreale, A., Pedreschi, D., and Giannotti, F. Discrimination-and privacy-aware patterns. Data Mining and Knowledge Discovery, 29(6): 1733 1782, 2015.

Hardt, M., Price, E., Srebro, N., et al. Equality of opportunity in supervised learning. In Advances in neural information processing systems, pp. 3315 3323, 2016.

H ebert-Johnson, U., Kim, M. P., Reingold, O., and Rothblum, G. N. Calibration for the (computationallyidentiﬁable) masses. In International Conference on Machine Learning, 2018.

Hinton, G. E. and Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. science, 313 (5786):504 507, 2006.

Kamishima, T., Akaho, S., Asoh, H., and Sakuma, J. Fairness-aware classiﬁer with prejudice remover regularizer. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 35 50. Springer, 2012.

Kearns, M., Neel, S., Roth, A., and Wu, Z. S. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In International Conference on Machine Learning, 2018.

Kingma, D. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

Kleinberg, J., Mullainathan, S., and Raghavan, M. Inherent trade-offs in the fair determination of risk scores. ar Xiv preprint ar Xiv:1609.05807, 2016.

Louizos, C., Swersky, K., Li, Y., Welling, M., and Zemel, R. The variational fair autoencoder. 2016.

Luc, P., Couprie, C., Chintala, S., and Verbeek, J. Semantic segmentation using adversarial networks. In NIPS Workshop on Adversarial Training, 2016.

Learning Adversarially Fair and Transferable Representations

Madras, D., Pitassi, T., and Zemel, R. Predict responsibly: Increasing fairness by learning to defer. ar Xiv preprint ar Xiv:1711.06664, 2017.

Mc Namara, D., Ong, C. S., and Williamson, R. C. Provably fair representations. ar Xiv preprint ar Xiv:1710.04394, 2017.

Odena, A. Semi-supervised learning with generative adversarial networks. ar Xiv preprint ar Xiv:1606.01583, 2016.

Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J., and Weinberger, K. Q. On fairness and calibration. In Advances in Neural Information Processing Systems, pp. 5684 5693, 2017.

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pp. 2234 2242, 2016.

Schmidhuber, J. Learning factorial codes by predictability minimization. Neural Computation, 4(6):863 879, 1992.

Zafar, M. B., Valera, I., Gomez Rodriguez, M., and Gummadi, K. P. Fairness beyond disparate treatment & disparate impact: Learning classiﬁcation without disparate mistreatment. In Proceedings of the 26th International Conference on World Wide Web, pp. 1171 1180. International World Wide Web Conferences Steering Committee, 2017.

Zemel, R., Wu, Y., Swersky, K., Pitassi, T., and Dwork, C. Learning fair representations. In International Conference on Machine Learning, pp. 325 333, 2013.

Zhang, B. H., Lemoine, B., and Mitchell, M. Mitigating unwanted biases with adversarial learning. ar Xiv preprint ar Xiv:1801.07593, 2018.