# fair_representation_learning_through_implicit_path_alignment__351300e3.pdf

Fair Representation Learning through Implicit Path Alignment

Changjian Shui 1 Qi Chen 1 Jiaqi Li 2 Boyu Wang 2 Christian Gagn e 1 3

We consider a fair representation learning perspective, where optimal predictors, on top of the data representation, are ensured to be invariant with respect to different sub-groups. Specifically, we formulate this intuition as a bi-level optimization, where the representation is learned in the outer-loop, and invariant optimal group predictors are updated in the inner-loop. Moreover, the proposed bi-level objective is demonstrated to fulfill the sufficiency rule, which is desirable in various practical scenarios but was not commonly studied in the fair learning. Besides, to avoid the high computational and memory cost of differentiating in the inner-loop of bi-level objective, we propose an implicit path alignment algorithm, which only relies on the solution of inner optimization and the implicit differentiation rather than the exact optimization path. We further analyze the error gap of the implicit approach and empirically validate the proposed method in both classification and regression settings. Experimental results show the consistently better trade-off in prediction performance and fairness measurement.

1. Introduction

Machine learning has been widely used in the real world decision-making practice such as job candidate screening (Raghavan et al., 2020). However, it has been observed that learning algorithms treated some groups of population unfavorably, for example, predicting the likelihood of crime on the grounds of ethnicity, gender or age (Hardt et al., 2016). To that end, algorithmic fairness, which aims to mitigate the prediction bias for the protected feature such as gender, has recently received tremendous attentions.

1Universit e Laval, Qu ebec, Canada 2University of Western Ontario, Ontario, Canada 3Canada CIFAR AI Chair, Mila. Correspondence to: Boyu Wang <bwang@csd.uwo.ca>, Christian Gagn e <christian.gagne@gel.ulaval.ca>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

With the advancements of deep learning (Le Cun et al., 2015), fair representation learning (Zemel et al., 2013) has been recently highlighted. Specifically, the learned fair representation can easily transfer the unbiased prior knowledge to various downstream learning-tasks. For example, in language understanding, the fair embedding provides both useful and unbiased representation for different goals such as translation or recommendation (Chang et al., 2019; Ethayarajh, 2020). Besides, it has been investigated in other scenarios such as computer vision (Kehrenberg et al., 2020) and intelligent health (Fletcher et al., 2021).

Typically, fair representation learning is realized by introducing fair constraints during the training. Consequently, a number of fair notions for various goals have been proposed. Specifically, most existing approaches in classification or regression use independence or separation rule (see Sec.2 and references therein) (Madras et al., 2018; Song et al., 2019; Chzhen et al., 2020). However, in a variety of applications, independence or separation are not always appropriate, and other fair notions such as sufficiency rule (Chouldechova, 2017) are preferred. Intuitively, given the output of the algorithm ˆY , the sufficiency rule ensures the conditional expectation of label E[Y | ˆY ] is invariant across the different sub-groups (see Sec.2 for the formal definition).

In practice, the negligence of sufficiency rule can lead to the significant bias in intelligent health. For example, health systems rely on commercial algorithms to identify and assist patients with complex health needs. Such algorithms output a score of healthcare needs, where a higher score indicates that the patient is sicker and requires additional care. Notably, Obermeyer et al. (2019) reveals a industrywide used algorithm that affects millions of patients, exhibits significant racial bias. Under the same predicted score ˆY = t, Black patients are considerably sicker than White patients (Eblack[Y | ˆY = t] > Ewhite[Y | ˆY = t]). Obermeyer et al. (2019) further points out that eliminating this disparity would increase the percentage of Black patients receiving additional healthcare from 17.7% to 46.5%.

From the algorithmic perspective, sufficiency rule is generally non-compatible to independence or separation, as demonstrated in Sec.2, indicating that existing fair algorithms for independence or separation do not improve or even worsen the sufficiency rule. Therefore fair represen-

Fair Representation Learning through Implicit Path Alignment

(a) Unfair Representation

t h L0(h(t), λ)

t h L1(h(t), λ)

(b) Explicit Path

(c) Implicit Path

Figure 1. Illustration of explicit and implicit path. (a) Unfair representation leads to different optimization paths and non-invariant optimal predictors on the latent space Z. (b) The fair representation learning ensures the invariant optimal predictor w.r.t. different sub-groups on Z (encouraging h 0 = h 1). Since the gradient based approach is adopted to optimize h, the explicit path alignment aims to learn a representation λ(x) to enforce the identical optimization path (i.e, identical blue and red curve) w.r.t. h. (c) The proposed implicit path alignment only requires the last iteration point and approximate the gradient w.r.t. λ from the last update of h (orange arrow).

tation learning w.r.t. the sufficiency rule is important and promising in both practice and algorithmic development.

In this paper, we propose a framework to address the sufficiency rule via the following principle: given a fixed representation function, if the optimal predictor that learned on the embedding space are invariant to different sub-groups, then the corresponding representation function is fair. The principle is further illustrated in Fig. 1(a): when the representation function λ : X Z is unfair and we adopt gradient descent to learn the predictor h : Z R. The optimal predictors of different sub-groups (blue, red) are not invariant, yielding biased predictions. Intuitively, the optimal predictor for each subgroup approximates the conditional expectation, which encourages the sufficiency. We will justify such an principle ensures that learned representation could satisfy the sufficiency rule under proper assumptions, showing in Proposition 3.1.

The aforementioned principle can be naturally formulated as a bi-level optimization problem, where we aim to adjust the representation λ (in the outer-loop) to satisfy the invariant optimal predictor h (in the inner-loop). Based on this, when we adopt the gradient-based approach in solving the bi-level objective, a straightforward solution is to learn the representation λ to fulfill the identical explicit gradient-descent directions in learning optimal predictor h of different groups, shown in Fig. 1(b). Clearly, if the inner gradient descent step of each sub-group is identical, their final predictors (as the approximation of h ) will be surely invariant. However, the corresponding algorithmic realization is challenging in deep learning: 1) It requires storing the whole gradient steps, which induces a high memory burden. 2) the embedding function λ is optimized via backpropagation from the whole gradient optimization path, which induces a high computational complexity.

To address this, we propose an implicit path alignment, shown in Fig. 1(c). Namely, we only consider the final (t-th) update of the predictor h(t), then we update representation function λ by approximating its gradient at point h(t)

through the implicit function (Bengio, 2000). By using the

gradient approximation, we do not need to store the whole gradient steps and conduct the backpropagation through the entire optimization path. Overall, contributions in this paper are as follows:

Fair representation learning for the sufficiency rule The proposed fair-representation approach is proved to satisfy the sufficiency rule in both classification and regression. We also find such a criteria is intrinsically consistent with the recent proposed Invariant Risk Minimization (Arjovsky et al., 2019; B uhlmann, 2020), which aims to preserve the invariant correlations between the embedding (or representation) and true label. Intuitively, if such correlations are robust and not influenced by the specific sub-group, the learned representation is somehow fair.

Efficient algorithm We propose an implicit path alignment algorithm to learn the fair representation, which address the prohibitive memory and computational cost in the original bi-level objective. We analyze the approximation error gap of the proposed implicit algorithm, which induces a tradeoff between the correct gradient estimation and fairness.

Improved fairness in classification and regression We evaluate the implicit algorithm in classification and regression with tabular, computer vision and NLP datasets, where the implicit algorithm effectively improves the fairness.

2. Sufficiency rule

We denote X X as the input, Y Y as the ground truth label, and ˆY Y as algorithm s output. Following the previous work in fair representation learning (Madras et al., 2018), we consider binary protected feature or two sub-groups with corresponding distributions D0 and D1. Then according to (Liu et al., 2019), the sufficiency rule is defined as:

ED0[Y | ˆY = t] = ED1[Y | ˆY = t], t Y (1)

Eq.(1) shows that the conditional expectation of ground truth label Y are identical for D0, D1, given the same prediction output t. Based on Eq.(1), we propose the sufficiency gap

Fair Representation Learning through Implicit Path Alignment

as the metric to measure the fairness. Since we aim to evaluate this in both binary classification (Y { 1, 1}) and regression (Y R), the sufficiency gaps are separately defined.

Sufficiency gap in binary classification Based on the sufficiency rule, the sufficiency gap in binary classification is naturally defined as:

y { 1,1} |D0(Y = y| ˆY = y) D1(Y = y| ˆY = y)|

Suf C [0, 1] encourages two sub-groups with identical Positive predicted value (PPV) and Negative predicted value (NPV). To better understand this metric, consider the example of healthcare system, which outputs only binary score: High Risk or Low Risk. Obermeyer et al. (2019) essentially revealed Dblack(Y = High Risk| ˆY = Low Risk) > Dwhite(Y = High Risk| ˆY = Low Risk): the severity of illness in Black patients is actually underestimated. Thus if Suf C is small, the racial discrimination will be remedied.

Sufficiency gap in regression Based on (Kuleshov et al., 2018), the sufficiency gap in regression is defined as:

t Y |D0(Y t| ˆY t) D1(Y t| ˆY t)|dt

An illustrative example depicts in Fig. 2. Specifically, Suf R [0, 1] is an approximation of |D0(Y = y| ˆY = y) D1(Y = y| ˆY = y)|, y R, since the latter is difficult to estimate since Y is continuous. We also adopt the healthcare example to understand this metric: assuming the health system outputs a real-value healthcare score ˆY = t (higher indicates sicker), Obermeyer et al. (2019); Sjoding et al. (2020) observed Dblack(Y > t| ˆY t) > Dwhite(Y > t| ˆY t). Namely, for all the patients with the predicted healthcare score lower than t, the actual sicker proportion (Y > t) in Black patients is significantly higher than White patients. Therefore a small Suf R suggests an improved disparity w.r.t. the sufficient rule.

Relation to other fair rules We briefly compare the Sufficiency rule with widely adopted Independence and Separation rule in binary classification. The detailed justifications and comparisons are shown in Appendix.

Independence rule is defined as:

ED0[ ˆY ] = ED1[ ˆY ],

In binary classification, the Independence rule is also referred as demographic parity (DP) (Zemel et al., 2013). We can further justify that if D0(Y = y) = D1(Y = y) (i.e, different label distribution in the sub-groups), the Sufficiency and Independence rule cannot both hold.

0.0 0.2 0.4 0.6 0.8 1.0 Predicted Cumulative Distribution

Ground-truth Cumulative Distribution

D0(Y t|Y t)

D1(Y t|Y t)

Group 0 Group 1

Figure 2. Illustrative example of Sufficiency gap ( Suf R) in regression

Separation Rule is defined as:

ED0[ ˆY |Y = t] = ED1[ ˆY |Y = t], t Y

In binary classification, the Separation rule is also denoted as Equalized Odds (EO) (Hardt et al., 2016). Barocas et al. (2019) further justified that if D0(Y = y) = D1(Y = y) and the joint distribution of (Y, ˆY ) has positive probability in D0, D1, the Sufficiency and Separation rule cannot both hold.

3. Fair representation learning as a bi-level optimization

We denote the representation function λ that maps the input X into the latent variable Z Z, the prediction function h such that h : Z R for regression and h : Z { 1, 1} for binary classification. We denote the prediction loss as ℓ, the prediction loss on sub-group D0, D1 is expressed as:

L0(h, λ) = E(x,y) D0ℓ(h λ(x), y)

L1(h, λ) = E(x,y) D1ℓ(h λ(x), y)

According to the intuition, we aim to solve the following bi-level objective:

min λ L0(h 0, λ) + L1(h 1, λ) (Outer-Loop)

s.t. h 0 = h 1, (Inner-Loop)

h 0 argmin h L0(h, λ), h 1 argmin h L1(h, λ).

In the outer-loop, we aim to find a representation function λ for minimizing the prediction error, given the optimal predictor (h 0, h 1) on the embedding space Z. As for the inner-loop, given a fixed representation λ, h 0, h 1 are the optimal predictor for each sub-group. The constraints h 0 = h 1 additionally encourage the invariant optimal predictors from D0, D1.

Fair Representation Learning through Implicit Path Alignment

Relation to explicit path alignment In deep learning, we adopt gradient-based approaches to minimize the loss, therefore h in the inner-loop is approximated as h(t+1), the t-th update in the gradient descent: h 0 h(0) P

t h L0(h(t), λ), h 1 h(0) P

t h L1(h(t), λ), where h(0) is the common initialization. Thus the invariant optimal predictor is equivalent to: X

t h L0(h(t), λ) = X

t h L1(h(t), λ).

The aforementioned equation suggests learning a representation λ that ensures the identical optimization path w.r.t. h for each sub-group, which recovers the explicit path alignment.

Relation to Sufficiency rule We further demonstrate the relation between the bi-level objective and Sufficiency rule.

Proposition 3.1. If we specify the prediction loss ℓ as logistic regression loss in the classification log(1 + exp( yh(z))) with Y = { 1, 1} and the square loss in the regression (h(z) y)2 with Y R. Then minimizing the inner-loop loss is equivalent to:

ED0[Y |Z = z] = ED1[Y |Z = z],

ED0[Y | ˆY = h (z)] = ED1[Y | ˆY = h (z)]

where h = h 0 = h 1 and z = λ(x).

Proposition 3.1 demonstrates that the objective of inner-loop loss fulfills the sufficiency rule in both binary classification and regression.

4. Proposed Algorithms

We propose the implicit alignment in deep learning, where λ and h are implemented by the neural network. We also reformulate as the original objective through Lagrangian relaxation:

min λ L0(h 0, λ) + L1(h 1, λ) + κ

2 h 0 h 1 2 2

(Outer-Loop)

s.t. h 0 argmin h L0(h, λ), h 1 argmin h L1(h, λ),

(Inner-Loop)

where the introduced κ > 0 is the coefficient to control the fairness, with a sufficient large κ ensuring h 0 h 1. Then we drive the approximated gradient w.r.t. λ, which contains the following key elements.

Solving the inner optimization Given a fixed representation λ, we find hϵ 0, hϵ 1 such that:

h 0 hϵ 0 ϵ, h 1 hϵ 1 ϵ,

where ϵ is the optimization tolerance. Besides, h 1 and hϵ 1 are essentially the function of λ, i.e., hϵ 1 depends on the

predefined representation function λ. It is worth mentioning that the optimization tolerance ϵ is realistic. E.g, consider a fixed representation and one-layer predictor h0, h1, the optimization will be convex.

Computing the gradient of λ Given the approximate solution hϵ 0, hϵ 1, we can compute the gradient w.r.t. λ (referred as grad(λ)) 1 in the outer-loop:

grad(λ) = λL0(hϵ 0, λ) + λL1(hϵ 1, λ)

( λhϵ 0)T ( h0L0(hϵ 0, λ) + κ(hϵ 0 hϵ 1))

+ ( λhϵ 1)T ( h1L1(hϵ 1, λ) κ(hϵ 0 hϵ 1)) .

Where h0L0(hϵ 0, λ) is the partial derivative in the loss w.r.t. the first term (about h0), evaluated at hϵ 0. Also λL0(hϵ 0, λ) is the partial derivative w.r.t. the second term (about λ).

Implicit function for approximating the gradient In order to compute grad(λ) in autograd, we need to estimate λhϵ 0 and λhϵ 1. We herein adopt the implicit function (Bengio, 2000) to approximate λhϵ 0, which has been adopted in the hyperparameter optimization (Pedregosa, 2016) and meta-learning (Rajeswaran et al., 2019).

Concretely, if the prediction loss is smooth and there exist stationary points to achieve optimal, we have: h0L0(h 0(λ), λ) = 0, h1L0(h 1(λ), λ) = 0. Then differentiating w.r.t. λ will induce: d ( h0L0(h 0(λ), λ)) /dλ = 2 h0L0(h 0, λ) λh 0 + λ h0L0(h 0, λ) = 0.2 Thus we

have λh 0 = 2 h0L0(h 0, λ) 1 ( λ h0L0(h 0, λ)), where the Hessian matrix 2 h0L0(h 0, λ) is assumed to be invertible.

Through the implicit function, we can approximate λhϵ 0 as:

λhϵ 0 2 h0L0(hϵ 0, λ) 1 ( λ h0L0(hϵ 0, λ))

As for λhϵ 1, we have the similar result: λhϵ 1 2 h1L1(hϵ 1, λ) 1 ( λ h1L1(hϵ 1, λ)).

Efficient and numerical stable gradient estimation Plugging in the approximations, the gradient w.r.t λ is approximated as:

grad(λ) λL0(hϵ 0, λ) ( λ h0L0(hϵ 0, λ))T p0 + λL1(hϵ 1, λ) ( λ h1L1(hϵ 1, λ))T p1

Where p0, p1 are denoted as the inverse-Hessian vector product with:

p0 = 2 h0L0(hϵ 0, λ) 1 ( h0L0(hϵ 0, λ) + κ(hϵ 0 hϵ 1))

p1 = 2 h1L1(hϵ 1, λ) 1 ( h1L1(hϵ 1, λ) κ(hϵ 0 hϵ 1)) .

1We denote the ground truth gradient as grad(λ) if we adopt optimal predictor h 0, h 1 in the computation. 2d( )/dλ denotes the total derivative.

Fair Representation Learning through Implicit Path Alignment

Algorithm 1 Implicit Path Alignment Algorithm

1: Input: Representation function λ, predictor h0, h1, datasets from two sub-groups D0. D1. 2: for mini-batch of samples from (D0, D1) do 3: Solving the inner-loop optimization with tolerance ϵ. Obtaining hϵ 0, hϵ 1. 4: Solving Eq. (2), (3) with tolerance δ. Obtaining pδ 0 and pδ 1.

5: Computing grad δ(λ) (gradient of representation λ)

6: Updating λ through autograd: λ λ grad δ(λ) 7: end for 8: Return: λ, hϵ 0, hϵ 1

However, the current form is still computationally expensive due to the computation of inverse Hessian matrix. Then computing p0 and p1 is equivalent to solve the following quadratic programming (QP):

argminˆp0 1 2ˆp T 0 2 h0L0(hϵ 0, λ) ˆp0

ˆp T 0 ( h0L0(hϵ 0, λ) + κ(hϵ 0 hϵ 1)) (2)

argminˆp1 1 2ˆp T 1 2 h1L1(hϵ 1, λ) ˆp1

ˆp T 1 ( h1L1(hϵ 1, λ) κ(hϵ 0 hϵ 1)) (3)

Since it is a typical QP problem and we adopt conjugate gradient method (Concus et al., 1985; Rajeswaran et al., 2019), which can be updated efficiently through autograd via computing the Hessian-vector product. We additionally suppose the optimization error in the QP as δ, i.e.: p0 pδ 0 δ, p1 pδ 1 δ, then the gradient w.r.t representation λ can be finally expressed as:

grad δ(λ) = λL0(hϵ 0, λ) ( λ h0L0(hϵ 0, λ))T pδ 0 + λL1(hϵ 1, λ) ( λ h1L1(hϵ 1, λ))T pδ 1

The grad δ(λ) can be also efficiently estimated through Hessian vector product via autograd without explicitly computing the Hessian matrix.

Proposed algorithm Based on the key elements, the proposed algorithm is shown in Algo. 1.

4.1. The cost of Implicit algorithm: Approximation-Fair Trade-off

In the proposed objective bi-level loss, a sufficient large κ encourages the invariant optimal predictor, yielding the fair results. However, the implicit approach will lead to a biased estimation of the ground truth gradient. We analyze the error gap of the approximation in Theorem 4.1.

Theorem 4.1 (Approximation Error Gap). Suppose that (1) Smooth Predictive Loss. The first-order derivatives and second-order derivatives of L are Lipschitz continuous; (2) Non-singular Hessian matrix. We assume h0,h0L0(h0, λ), h1,h1L1(h1, λ), the Hessian matrix of the inner optimization problem, are invertible. (3) Bounded representation and predictor function. We assume the λ and h are bounded, i.e., λ , h are upper bounded by the predefined positive constants. Then the approximation error between the ground truth and algorithmic estimated gradient w.r.t. the representation is be upper bounded by:

grad(λ) grad δ(λ) = O(κϵ + ϵ + δ).

The proof is delegated in Appendix C. We also discuss the assumptions to guarantee the convergence of Algorithm 1, shown in Appendix D.

Theorem 4.1 reveals that the gradient approximation error depends on the two-level optimization tolerance ϵ, δ and the coefficient of fair constraints κ. Specifically, the error gap reveals the inherent trade-off in accurate gradient estimation and fair-representation learning. If we fix the optimization tolerance ϵ and δ, a smaller κ indicates a better approximation of the gradient, which yields weak fair constraints. Thus the implicit alignment introduces a trade-off in the prediction performance (i.e., correct approximation of the gradient) and fairness measurement.

5. Related Work

Fair Machine Learning Below we only list the most related work and refer to the survey paper (Mehrabi et al., 2021) for details in the algorithmic fairness. In the classification, various methods in learning fair representations have been proposed. Specifically, a common strategy is to introduce the statistical constraints as the regularization during the training, e.g., demographic parity (DP) (Zhang et al., 2018; Madras et al., 2018; Song et al., 2019; Jiang et al., 2020; Kehrenberg et al., 2020) that encourages the identical output of the representation or equalized odds (EO) (Song et al., 2019; Gupta et al., 2021) that ensures the identical conditional output of the representation, given the ground truth label Y . Another direction is to disentangle the data for factorizing meaningful representations such as (Locatello et al., 2019; Kim et al., 2019). Intuitively, the disentangled embedding is independent of the protected feature, thus reflecting a fair representation w.r.t. the independence rule, which can be potentially problematic when the label distributions of sub-groups vary dramatically (Zhao et al., 2019).

The concept of fairness has also been extended to the fields beyond classification. For instance, in the regression problem (Komiyama et al., 2018; Agarwal et al., 2019), the bounded group loss has been proposed as the fair measure:

Fair Representation Learning through Implicit Path Alignment

if the prediction loss in each sub-group is smaller than ϵ, the regression is ϵ-level fair. In fact, the fair criteria in our paper is not equivalent to ϵ-fair. Considering a fixed representation function λ, the ϵ-level fair does not guarantee the optimal and invariant predictor for each sub-group and vice versa.

The sufficiency rule has also been discussed in the previous work. Notably, Chouldechova (2017); Liu et al. (2019) proposed the sufficiency gap in classification for measuring fairness w.r.t. the sufficiency rule. (Liu et al., 2019) also discussed the relations between the sufficiency gap and probabilistic calibration (Guo et al., 2017) (referred as calibration gap). According to Pleiss et al. (2017), the calibration rule is a stronger condition than sufficiency rule while it can simultaneously hurt the prediction performance. Throughout this paper, we only consider the sufficiency rule. The triple trade-off between the probabilistic calibration, sufficiency rule and accuracy will be left as future work.

Learning Invariance The analyzed fair-representation criteria shares a quite similar spirit to the IRM or Invariant Risk Minimization (Arjovsky et al., 2019; B uhlmann, 2020; Creager et al., 2021), where an algorithm IRM v1 is proposed to enable the out-of-distribution (OOD) generalization. The key difference between our work and (Arjovsky et al., 2019) lies in the algorithmic aspect: it has been theoretically justified that the originally proposed IRM v1 does not necessarily capture the invariance across the environments (Rosenfeld et al., 2020; Shui et al., 2022). By contrast, we aim to solve the bi-level objective in the context of deep-learning and propose an efficient and principled practical algorithm with better empirical performance than IRM v1. Besides, based on results of (Chen et al., 2021), the proposed algorithm does not provably guarantee the OOD generalization property due to the limited sub-groups (N = 2) considered within the paper.

6. Experiments

6.1. Experimental setup

In the paper, we adopt the aforementioned sufficiency gap as fair metrics, where ˆY is denoted as:

( hϵ 0 λ(X), X D0 hϵ 1 λ(X), X D1

Then in the binary classification, we can estimate Suf C = 1 2 P

y { 1,+1} |D0(Y = y| ˆY = y) D1(Y = y| ˆY = y)| from the data.

As for regression, the original form Suf R = R

t |D0(Y t| ˆY t) D1(Y t| ˆY t)| (as shown in Fig. 2, Appendix) is difficult to estimate due to the integration term. To address this, we sample multiple values {t1, . . . , tm}

and compute its average differences as the approximation of the integration. Namely, Suf R 1 m Pm i=1 |D0(Y ti| ˆY ti) D1(Y ti| ˆY ti)|. Concretely, for a given ti in each group, we compute the percentile ( ˆY0) at point t: D0( ˆY0 ti), then we compute the corresponding ground truth cumulative distribution (Y ) at the same point ti: D(Y ti| ˆY ti). Through the aforementioned approximation, we can estimate |D0(Y ti| ˆY ti) D1(Y ti| ˆY ti)|.

Baselines We consider the baselines that add fairness constraints during the training process. Specifically, we compare our method with (I) Empirical Risk Minimization (ERM) that trains the model without considering fairness; (II) Adversarial Debiasing (referred as adv debias) (Zhang et al., 2018); (III) Fair Mix-up (Chuang & Mroueh, 2021), a recent data-augmentation and effective approach in the fair representation learning. In fact, the baselines (II) and (III) are based on Independence rule or Demographic-Parity (DP), which is designed to demonstrate the general noncompatibility in addressing the sufficiency rule.

Besides, we include two additional baselines that have the similar objective but different algorithmic realizations. (IV) the original IRM regularization (referred as IRM v1) (Arjovsky et al., 2019), which adds a gradient penalty to encourage the invariance among the different groups. (V) One-step explicit alignment. In the inner-loop optimization, we suppose to conduct a simple one-step gradient descent (T = 1) for each sub-group, i.e, h 0 hinit h0L0(h0, λ), h 1 hinit h1L1(h1, λ). Thus in the outer-loop optimization, we add a gradient-incoherence constraint to encourage the identical (one-step) optimization path: minλ h0L0(h0, λ) h1L1(h1, λ) 2 2.

All the results are reported by averaging five repetitions and additional experimental details are delegated in the Appendix.

6.2. Toxic Comments

The toxic comments dataset (Jigsaw, 2018) is a binary classification task in NLP to predict whether comment is toxic or not. The original label is actually not binary since the comments is decided by multiple annotators, where the labelling discrepancy generally occurs. To this end, we conduct a simple strategy to decide comment is toxic if at least one annotator marks it. In this dataset, a portion of comments have been labeled with identity attributes, including gender and race. It has also been revealed that the race identity (e.g., black) is correlated with the toxicity label, which can lead to the predictive discrimination. Thus we adopted the race as the protected feature by selecting two sub-groups of Black and Asian. For the sake of computational simplicity, we first applied the pretrained BERT (Devlin et al., 2018) to extract

Fair Representation Learning through Implicit Path Alignment

Table 1. Fair Classification. Accuracy and Suf C in Toxic comments (left) and Celeb A datasets (right)

Toxic comments Accuracy ( ) Suf C ( )

ERM (I) 0.768 0.004 0.173 0.008

Adv debias (II) 0.760 0.008 0.291 0.006 Mixup (III) 0.758 0.003 0.343 0.022

IRM v1 (IV) 0.753 0.004 0.057 0.015 One step (V) 0.755 0.007 0.048 0.008 Implicit 0.760 0.007 0.051 0.012

Celeb A Accuracy ( ) Suf C ( )

ERM (I) 0.780 0.015 0.210 0.022

Adv debias (II) 0.785 0.022 0.165 0.028 Mixup (III) 0.792 0.011 0.160 0.010

IRM v1 (IV) 0.795 0.012 0.086 0.015 One step (V) 0.797 0.006 0.086 0.012 Implicit 0.794 0.027 0.074 0.020

0.05 0.08 0.11 0.14 0.17 Suf_C (Sufficiency Gap)

Implicit One_step IRM_v1 ERM

0.05 0.10 0.15 0.20 Suf_C (Sufficiency Gap)

Implicit One_step IRM_v1 ERM

(b) Celeb A

10 20 30 40 50 Inner Optimization Step

Relative Time Complexity

Explicit Implicit (solver=2) Implicit (solver=10) Implicit (solver=20)

(c) Time in Celeb A

Figure 3. Fair Classification. (a,b) The Accuracy-Fair trade-off curve in Toxic (a) and Celeb A (b) dataset. The implicit approach demonstrated a consistently better trade-off. (c) Running time comparison of Explicit and Implicit alignment in Celeb A dataset. Specifically, solver = 2 indicates that the conjugate-gradient algorithm is executed 2 iterations. The results show that implicit approach avoids the long back-propagation of the entire inner-optimization path. The time complexity of the explicit approach, on the other hand, increases linearly with the inner-optimization step.

the word embedding with 748 dimensional vector. Then we adopt representation function λ as two fully-connected layers with hidden dimension 200 with Relu activation and classifier h as a linear predictor. We report the test-set subgroup average accuracy and sufficiency gap ( Suf C) in Tab. 1 and Fig. 3(a).

From the results, the Demographic Parity (DP) based fair constraints are non-compatible with the sufficiency rule. Specifically, baseline (II,III) even increase Suf C with higher value than ERM. For the baselines that track the sufficiency rule (IV,V), the sufficiency gap Suf C is improved with a similar accuracy, shown in Tab.1. We also change the regularization coefficient in (IV,V) and κ in the implicit approach. We observe that the implicit approach demonstrates a consistent better Accuracy-Fair trade-off, shown in Fig. 3(a).

6.3. Celeb A Dataset

The Celeb A dataset (Liu et al., 2015) contains around 200K images of celebrity faces, where each image is associated with 40 human-annotated binary attributes including gender, hair color, young, etc. In this paper, we designate gender as

the protected feature, and attractive as the binary classification task. We randomly select around 82K and 18K images as the training and validation set. Then we adopt representation function λ as pre-trained Res Net-18 (He et al., 2016) and classifier h as two-fully connected layers. We report the test-set sub-group average accuracy and sufficiency gap ( Suf C) in Tab. 1 and Fig. 3(b).

The results in the Celeb A show similar behaviors with the Toxic comments. Specifically, the DP based fair approaches (II, III) did not effectively improve Suf C, shown in Tab. 1. In contrast, the sufficiency can be significantly improved in baselines (IV, V) and implicit approach without largely losing the accuracy. Specifically, Fig. 3(b) visualizes the accuracy-fair trade-off curve, where the later three approaches show quite similar behaviors.

Ablation: Computational benefits of Implicit Alignment To show the efficiency of implicit approach in deep neural network, we empirically evaluated the computing time of T-inner step explicit alignment and implicit approach. The experimental results (shown in Fig.3(c)) verified the computational efficiency of Implicit alignment. Notably, a large

Fair Representation Learning through Implicit Path Alignment

Table 2. Fair Regression. MSE and Suf R in Law dataset (left) and NLSY dataset (right) Law MSE ( ) Suf R ( )

ERM (I) 0.190 0.005 0.160 0.007

Adv debias (II) 0.223 0.008 0.188 0.012 Mixup (III) 0.216 0.012 0.172 0.007

IRM v1 (IV) 0.208 0.006 0.096 0.006 One step (V) 0.204 0.007 0.125 0.010 Implicit 0.198 0.005 0.091 0.011

NLSY MSE ( ) Suf R ( )

ERM (I) 1.939 0.021 0.246 0.019

Adv debias (II) 1.982 0.016 0.252 0.020 Mixup (III) 1.979 0.025 0.246 0.023

IRM v1 (IV) 1.927 0.031 0.077 0.009 One step (V) 1.904 0.027 0.090 0.019 Implicit 1.906 0.019 0.051 0.005

0.0 0.2 0.4 0.6 0.8 1.0 Predicted Cumulative Distribution

Ground-truth Cumulative Distribution

Group 0 Group 1

0.0 0.2 0.4 0.6 0.8 1.0 Predicted Cumulative Distribution

Ground-truth Cumulative Distribution

Group 0 Group 1

(b) Fair Mix-up

0.0 0.2 0.4 0.6 0.8 1.0 Predicted Cumulative Distribution

Ground-truth Cumulative Distribution

Group 0 Group 1

(c) Implicit

Figure 4. Illustration of the sufficiency gap ( Suf R) in Law dataset (regression). The ERM and Fair mix-up suffer a high Suf R, while the proposed implicit alignment can significantly mitigate the sufficiency gap.

inner-optimization step does not considerably increase the whole computational time of implicit approach with different iterations of conjugated gradient solver. In contrast, the corresponding computational time complexity in explicit alignment linearly scales with the inner-optimization steps, which is consistent with our analysis.

6.4. Law Dataset

The Law Dataset is a regression task to predict a students GPA (real value, ranging from [0, 4]), where the data is utilized from the School Admissions Councils National Longitudinal Bar Passage Study (Wightman, 1998) with 20K examples. In the regression task, we adopt the square loss and race as the protected feature (white versus non-white). We adopt λ as the one fully connected layer with hidden dimension 100 and Relu activation and predictor h as a linear predictor. We report the test-set sub-group average MSE (Mean Square Error) and sufficiency gap ( Suf R) in Tab. 2 and Fig. 5.

Compared to the classification task, the results show similar behaviors in the regression. Specifically, the DP based fair approaches (II, III) still increase Suf R in the regression. In contrast, the gap is significantly improved in our proposed approach and baseline (IV,V). Specifically, Fig. 4 visualizes

0.04 0.06 0.08 0.10 0.12 0.14 0.16 Suf_R (Sufficiency Gap)

Implicit One_step IRM_v1 ERM

Figure 5. Law Dataset (regression). MSE-Fair Trade-off curve

the sufficiency-gap of different approaches, where the implicit approach significantly mitigates the sufficiency gap. In addition, Fig. 5 describes the MSE-sufficiency gap curve, which further justifies the benefits of implicit approach with a better trade-off between the prediction performance and fairness.

Fair Representation Learning through Implicit Path Alignment

6.5. NLSY Dataset

The National Longitudinal Survey of Youth (NLSY, 2021) dataset is a regression task with around 7K dataset, which involves the survey results of the U.S. Bureau of Labor Statistics. It is intended to gather information on the labor market activities and other life events of several groups for predicting the income y of each person. We treat the gender as the protected feature. We also normalize the output y by diving the 10, 000, then the final output y ranges around [0, 8]. The prediction loss is also the square loss. We adopt representation λ as the two fully connected layers with hidden dimension 200 and Relu activation and predictor h as a linear predictor. We report the testset sub-group average MSE (Mean Square Error) and Sufficiency Gap ( Suf R) in Tab. 2 and Fig. 6.

Tab. 2 provides similar trends with other datasets. Baselines (IV,V) and implicit approach effective control the sufficiency gap, while the DP based approach generally fails to improve the gap. Fig. 6 reveals a slightly better approximation-fair trade off for the implicit approach. Finally, Fig. 8 (in Appendix) visualizes the sufficiency gap of different algorithms. The gap is actually significantly improved while the calibration gap still exists, which is consistent with (Liu et al., 2019). Therefore it can be quite interesting and promising to analyze the triple trade-off between the sufficiency gap, probabilistic calibration and prediction performance in the regression in the future.

0.05 0.10 0.15 0.20 0.25 Suf_R (Sufficiency Gap)

Implicit One_step IRM_v1 ERM

Figure 6. NLSY (regression). MSE-Fair Trade-off curve

7. Conclusion

We considered the fair representation learning from a novel perspective through encouraging the invariant optimal predictors on the top of data representation. We formulated this problem as a bi-level optimization and proposed an implicit alignment algorithm. We further demonstrated the bi-level objective is to fulfil the Sufficiency rule. Then we analyzed

the error gap of the implicit algorithm, which reveal the trade-off of biased gradient approximation and fairness constraints. The empirical results in both classification and regression settings suggest the consistently improved fairness measurement. Finally, we think the future work can include developing computationally efficient explicit algorithms for avoiding the biased gradient computation.

Limitations

We considered a novel fair representation learning perspective to encourage the sufficiency rule. Simultaneously this work remains several limitations.

In the proposed algorithm, we need a two-step optimization with tolerance ϵ and δ. As for controlling ϵ (the tolerance w.r.t. the predictor h), since h is a shallow network with one or two layers, then optimizing over h will be relatively easy. As for δ, since the representation λ could be highly non-convex and high-dimensional, controlling δ would be quite difficult in theory. In practice, we generally control the steps in the conjugate gradient while it is unclear the convergence behavior in the highly non-convex settings.

The current paper mainly focus the binary sensitive attribute with two subgroups. Although it is feasible to extend the multi-attribute settings by consider the pair-wise path alignment, but it would be promising to consider an efficient algorithm for the multi-attribute.

The performance and fair trade-off is induced by the efficient gradient estimation in the bi-level objective. Thus it would be promising to develop an efficient explicit path approach for avoiding such a trade-off.

Acknowledgments

The authors appreciate the constructive feedback and suggestions from anonymous Reviewers and Meta-Reviewers. The authors also would like to thank Gezheng Xu and Jun Xiao for the discussion and proofreading the manuscript.

C. Shui and C. Gagn e acknowledge funding from NSERCCanada and CIFAR while B. Wang and J. Li are supported by NSERC-Canada.

Agarwal, A., Dud ık, M., and Wu, Z. S. Fair regression: Quantitative definitions and reduction-based algorithms. In International Conference on Machine Learning, pp. 120 129. PMLR, 2019.

Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez Paz, D. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019.

Fair Representation Learning through Implicit Path Alignment

Barocas, S., Hardt, M., and Narayanan, A. Fairness and Machine Learning. fairmlbook.org, 2019. http:// www.fairmlbook.org.

Bengio, Y. Gradient-based optimization of hyperparameters. Neural computation, 12(8):1889 1900, 2000.

Boyd, S., Boyd, S. P., and Vandenberghe, L. Convex optimization. Cambridge university press, 2004.

B uhlmann, P. Invariance, causality and robustness. Statistical Science, 35(3):404 426, 2020.

Chang, K.-W., Prabhakaran, V., and Ordonez, V. Bias and fairness in natural language processing. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP): Tutorial Abstracts, Hong Kong, China, November 2019. Association for Computational Linguistics. URL https://aclanthology.org/D19-2004.

Chen, Y., Rosenfeld, E., Sellke, M., Ma, T., and Risteski, A. Iterative feature matching: Toward provable domain generalization with logarithmic environments. ar Xiv preprint ar Xiv:2106.09913, 2021.

Chouldechova, A. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153 163, 2017.

Chuang, C.-Y. and Mroueh, Y. Fair mixup: Fairness via interpolation. ar Xiv preprint ar Xiv:2103.06503, 2021.

Chzhen, E., Denis, C., Hebiri, M., Oneto, L., and Pontil, M. Fair regression with wasserstein barycenters. ar Xiv preprint ar Xiv:2006.07286, 2020.

Concus, P., Golub, G., and Meurant, G. Block preconditioning for the conjugate gradient method. Siam Journal on Scientific and Statistical Computing, 6, 01 1985. doi: 10.1137/0906018.

Creager, E., Jacobsen, J.-H., and Zemel, R. Environment inference for invariant learning. In International Conference on Machine Learning, pp. 2189 2200. PMLR, 2021.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

Ethayarajh, K. Is your classifier actually biased? measuring fairness under uncertainty with bernstein bounds. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2914 2919, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.262. URL https: //aclanthology.org/2020.acl-main.262.

Fletcher, R. R., Nakeshimana, A., and Olubeko, O. Addressing fairness, bias, and appropriate use of artificial intelligence and machine learning in global health. Frontiers in Artificial Intelligence, 3:116, 2021. ISSN 2624-8212. doi: 10.3389/frai.2020. 561802. URL https://www.frontiersin.org/ article/10.3389/frai.2020.561802.

Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. In International Conference on Machine Learning, pp. 1321 1330. PMLR, 2017.

Gupta, U., Ferber, A., Dilkina, B., and Ver Steeg, G. Controllable guarantees for fair outcomes via contrastive information estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 7610 7619, 2021.

Hardt, M., Price, E., and Srebro, N. Equality of opportunity in supervised learning. Advances in neural information processing systems, 29:3315 3323, 2016.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Jiang, R., Pacchiano, A., Stepleton, T., Jiang, H., and Chiappa, S. Wasserstein fair classification. In Uncertainty in Artificial Intelligence, pp. 862 872. PMLR, 2020.

Jigsaw. Toxic comment classification challenge, 2018. URL https://www.kaggle.com/c/ jigsaw-toxic-comment-classificationchallenge/overview/description.

Kehrenberg, T., Bartlett, M., Thomas, O., and Quadrianto, N. Null-sampling for interpretable and fair representations. In European Conference on Computer Vision, pp. 565 580. Springer, 2020.

Kim, B., Kim, H., Kim, K., Kim, S., and Kim, J. Learning not to learn: Training deep neural networks with biased data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9012 9020, 2019.

Komiyama, J., Takeda, A., Honda, J., and Shimao, H. Nonconvex optimization for regression with fairness constraints. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 2737 2746. PMLR, 10 15 Jul 2018. URL https://proceedings.mlr.press/v80/ komiyama18a.html.

Fair Representation Learning through Implicit Path Alignment

Kuleshov, V., Fenner, N., and Ermon, S. Accurate uncertainties for deep learning using calibrated regression. In International Conference on Machine Learning, pp. 2796 2804. PMLR, 2018.

Le Cun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436 444, 2015.

Liu, L. T., Simchowitz, M., and Hardt, M. The implicit fairness criterion of unconstrained learning. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 4051 4060. PMLR, 09 15 Jun 2019. URL https://proceedings.mlr.press/v97/ liu19f.html.

Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.

Locatello, F., Abbati, G., Rainforth, T., Bauer, S., Sch olkopf, B., and Bachem, O. On the fairness of disentangled representations. ar Xiv preprint ar Xiv:1905.13662, 2019.

Madras, D., Creager, E., Pitassi, T., and Zemel, R. Learning adversarially fair and transferable representations. In International Conference on Machine Learning, pp. 3384 3393. PMLR, 2018.

Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., and Galstyan, A. A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR), 54(6):1 35, 2021.

NLSY. National longitudinal survey of youth, 2021. URL https://www.bls.gov/nls/.

Obermeyer, Z., Powers, B., Vogeli, C., and Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464):447 453, 2019. doi: 10.1126/science. aax2342. URL https://www.science.org/ doi/abs/10.1126/science.aax2342.

Online. Problem with proof of Conditional expectation as best predictor. https://stats.stackexchange. com/questions/71863/problem-withproof-of-conditional-expectation-asbest-predictor, 2013. [Online; accessed May2022].

Pedregosa, F. Hyperparameter optimization with approximate gradient. In International conference on machine learning, pp. 737 746. PMLR, 2016.

Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J., and Weinberger, K. Q. On fairness and calibration. ar Xiv preprint ar Xiv:1709.02012, 2017.

Raghavan, M., Barocas, S., Kleinberg, J., and Levy, K. Mitigating bias in algorithmic hiring: Evaluating claims and practices. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pp. 469 481, 2020.

Rajeswaran, A., Finn, C., Kakade, S., and Levine, S. Metalearning with implicit gradients. In Advances in neural information processing systems, 2019.

Rosenfeld, E., Ravikumar, P., and Risteski, A. The risks of invariant risk minimization. ar Xiv preprint ar Xiv:2010.05761, 2020.

Shui, C., Wang, B., and Gagn e, C. On the benefits of representation regularization in invariance based domain generalization. Machine Learning, pp. 1 21, 2022.

Sjoding, M. W., Dickson, R. P., Iwashyna, T. J., Gay, S. E., and Valley, T. S. Racial bias in pulse oximetry measurement. New England Journal of Medicine, 383(25): 2477 2478, 2020.

Song, J., Kalluri, P., Grover, A., Zhao, S., and Ermon, S. Learning controllable fair representations. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2164 2173. PMLR, 2019.

Wightman, L. F. Lsac national longitudinal bar passage study, 1998.

Zemel, R., Wu, Y., Swersky, K., Pitassi, T., and Dwork, C. Learning fair representations. In International conference on machine learning, pp. 325 333. PMLR, 2013.

Zhang, B. H., Lemoine, B., and Mitchell, M. Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 335 340, 2018.

Zhang, W. and Ntoutsi, E. Faht: an adaptive fairness-aware decision tree classifier. ar Xiv preprint ar Xiv:1907.07237, 2019.

Zhao, H., Coston, A., Adel, T., and Gordon, G. J. Conditional learning of fair representations. ar Xiv preprint ar Xiv:1910.07162, 2019.

Fair Representation Learning through Implicit Path Alignment

A. Sufficiency rule: comparison with other fair criteria

Sufficiency v.s. Independence We will demonstrate:

If D0(Y = y) = D1(Y = y) (i.e, different label distribution in the sub-groups), the Sufficiency and Independence rule cannot both hold.

Proof. Since we consider the binary-classification with Y = { 1, 1}, the expectation and conditional can be expressed as the probability of predicted output.

D0( ˆY = 1) = 1

2(1 + ED0[ ˆY ]), D0(Y = 1| ˆY = t) = 1

2(1 + ED0[Y | ˆY = t])

Then sufficiency and independence are equivalent to: D0( ˆY = t) = D1( ˆY = t) and D0(Y = y| ˆY = t) = D1(Y = y| ˆY = t) both hold for any t, y. Then the joint distribution of ˆY , Y should be identical:

D0( ˆY = t, Y = y) = D1( ˆY = t, Y = y), t, y

Then the marginal distribution D0(Y = y) = D1(Y = y) must holds.

If D0(Y = y) = D1(Y = y), the joint distribution is not equal:

D0( ˆY = t, Y = y) = D1( ˆY = t, Y = y), t, y

Since D0( ˆY = t, Y = y) = D0(Y = y| ˆY = t)D0( ˆY = t), thus either D0( ˆY = t) = D1( ˆY = t) or D0(Y = y| ˆY = t) = D1(Y = y| ˆY = t) must hold for at least one term. I.e, the sufficiency and Independence could not both hold.

Sufficiency v.s. Separation We will demonstrate:

If D0(Y = y) = D1(Y = y) and joint distribution of (Y, ˆY ) has positive probability in D0, D1, the Sufficiency and Separation rule cannot both hold.

Proof. Based on the previous results, if D0(Y = y) = D1(Y = y), then the joint distribution of (Y, ˆY ) are not identical:

D0( ˆY = t, Y = y) = D1( ˆY = t, Y = y), t, y

Then either D0( ˆY = t) = D1( ˆY = t) or D0(Y = y| ˆY = t) = D1(Y = y| ˆY = t) must hold for at least one term (conclusion of previous result).

If the sufficiency and separation both hold when D0(Y = y) = D1(Y = y), it must be the following case:

D0( ˆY = t) = D1( ˆY = t), D0(Y = y| ˆY = t) = D1(Y = y| ˆY = t)

D0(Y = y) = D1(Y = y), D0( ˆY = t|Y = y) = D1( ˆY = t|Y = y)

However, we will prove it is impossible in the classification. Based on Bayes rule:

D0(Y = y| ˆY = t) = D0( ˆY = t|Y = y)D0(Y = y)

D0( ˆY = t) = D1( ˆY = t|Y = y)D1(Y = y)

D1( ˆY = t) = D1(Y = y| ˆY = t)

Thus we should have D0(Y = y)

D0( ˆY = t) = D1(Y = y)

D1( ˆY = t) , t, y

We consider the binary classification by denoting D0(Y = 1) = p, D1(y = 1) = q, D0( ˆY = 1) = ˆp, D1( ˆY = 1) = ˆq, then we have: p

ˆq , p 1 ˆp = q 1 ˆq , 1 p

There exists the unique non-zero solution of p = q = ˆp = ˆq = 0.5, which clearly contradicts our assumptions.

Fair Representation Learning through Implicit Path Alignment

Table 3. Common Fair criteria in classification General Definition Binary classification Definition Relation

Independence D0( ˆY ) = D1( ˆY ) Demographic parity D0( ˆY = 1) = D1( ˆY = 1) Equivalent

Separation D0( ˆY |Y = y) = D1( ˆY |Y = y), y Equalized odds D0( ˆY = 1|Y = y) = D1( ˆY = 1|Y = y), y Equivalent

Separation Equal opportunity D0( ˆY = 1|Y = 1) = D1( ˆY = 1|Y = 1) Relaxation

Sufficiency D0(Y | ˆY = y) = D1(Y | ˆY = y), y Conditional use accuracy equality D0(Y = y| ˆY = y) = D1(Y = y| ˆY = y), y Equivalent

Sufficiency Predictive parity D0(Y = 1| ˆY = 1) = D1(Y = 1| ˆY = 1) Relaxation

A.1. Comparison Tables

For the sake of completeness, we list common fair criteria for the comparison.

B. Proof Proposition 3.1

We consider the regression and classification separately.

Regression According to the definition, given a fixed and deterministic representation λ, we have

L0(h, λ) = ED0(h(z) y)2

Since it is a functional optimization w.r.t. the function h, through using the calculus of variations (Online, 2013),

δh(z) = 2 Z [h(z) y]D0(z, y)dy = 0

Solving for h(z), and using the sum and product rules of probability, we obtain

R y D0(z, y)dy

D0(z) = Z y D0(y|Z = z)dy = ED0[Y |Z = z]

Then we have h 0(z) = ED0[Y |Z = z]. As for D1, we apply the same strategy with h 1(z) = ED1[Y |Z = z]. Based on the invariant optimal predictor, we have ED0[Y |Z = z] = ED1[Y |Z = z] with z = λ(x).

Classification According to the definition, we have:

L0(h, λ) = ED0 log(1 + exp( yh(z)))

Since the optimal predictor on the logistic loss is the log-conditional density ratio: h 0(z) = log D0(Y =1|Z=z) D0(Y = 1|Z=z) .

Observe that in the binary classification with Y = { 1, 1}, we have D0(Y = 1|Z = z) = 1

2(1 + ED0[Y |Z = z]) and D0(Y = 1|Z = z) = 1

2(1 ED0[Y |Z = z]), then we have:

h 0(z) = log 1 + ED0[Y |Z = z]

1 ED0[Y |Z = z]

As for D1, we adopt the same strategy and we have log 1+ED0[Y |Z=z]

1 ED0[Y |Z=z] = log 1+ED1[Y |Z=z]

1 ED1[Y |Z=z] , then we have ED0[Y |Z =

z] = ED1[Y |Z = z].

As for the predictive parity, since we have ED0[Y |Z = z] = ED1[Y |Z = z] and h = h 1 = h 2, then we have ED0[Y |h (z)] = ED1[Y |h (z)].

C. Approximation Error

Theorem C.1 (Approximation Error Gap). Suppose that (1) Smooth Predictive Loss. The first-order derivatives and second-order derivatives of L are Lipschitz continuous; (2) Non-singular Hessian matrix. We assume

Fair Representation Learning through Implicit Path Alignment

h0,h0L0(h0, λ), h1,h1L1(h1, λ), the Hessian matrix of the inner optimization problem, are invertible. (3) Bounded representation and predictor function. We assume the λ and h are bounded, i.e., λ , h are upper bounded by the predefined positive constants. Then the approximation error between the ground truth and algorithmic estimated gradient w.r.t. the representation is be upper bounded by:

grad(λ) grad δ(λ) = O(κϵ + ϵ + δ).

Proof. We denote grad(λ) as the ground truth gradient w.r.t. λ in outer-loop loss (given the optimal predictor h 0, h 1). Then we aim to bound grad(λ) grad δ(λ)

We first introduce the following terms for facilitating the proof:

Aϵ 0 = h0 λL0(hϵ 0, λ), Aϵ 1 = λ h1L1(hϵ 1, λ), A 0 = λ h0L0(h 0, λ), A 1 = λ h1L1(h 1, λ),

Bϵ 0 = λL0(hϵ 0, λ), Bϵ 1 = λL1(hϵ 1, λ), B 0 = λL0(h 0, λ), B 1 = λL1(h 1, λ),

p 0 = 2 h0L0(h 0, λ) 1 ( h0L0(h 0, λ) + κ(h 0 h 1)) ,

p 1 = 2 h1L1(h 1, λ) 1 ( h1L1(h 1, λ) κ(h 0 h 1)) .

Then the approximation error gap can be expressed as:

grad(λ) grad δ(λ) = (B 0 A 0p 0 + B 1 A 1p 1) Bϵ 0 Aϵ 0pδ 0 + Bϵ 1 Aϵ 1pδ 1

i=0 B i Bϵ i +

i=0 A i p i Aδ i pδ i

Due to the symmetric of D0 and D1, we only focus on the term on i = 0, the the upper bound in i = 1 can be derived analogously.

As for bounding B 0 Bϵ 0 , since we assume first order derivative of the loss is Lipschitz functions (with constant L1), then we have : B 0 Bϵ 0 L1 h 0 hϵ 0 ϵL1

Then the second term can be upper bounded by three terms:

A 0p 0 Aδ 0pδ 0 A 0p 0 A 0p0 | {z } (1)

+ A 0p0 Aϵ 0p0 | {z } (2)

+ Aϵ 0p0 Aϵ 0pδ 0 | {z } (3)

Before estimating the upper bound, we first demonstrate Aϵ 0 and A 0 are also bounded.

Since we assume λ and h are bounded (assuming the bounded constant as η and ϕ), the second order derivative are Lipschitz (with constant L2). Then we consider another fixed point (λ , h 0(λ )) with bounded second order derivative: A0 = 2 h0,λL0(h 0(λ ), λ ) and A0 A. We have:

A 0 A0 2 L2 [h 0(λ), λ] [h 0(λ ), λ ] 2 L2 p

Thus we have A 0 A + L2 p

η2 + ϕ2 = A sup. As for the second derivative at point hϵ 0, it can be upper bounded analogously with a similar constant Aϵ sup.

The upper bound of term (1) We have:

A 0p 0 A 0p0 A 0 p 0 p0

We have proved A 0 is upper bounded by A sup. We additionally introduce the following auxiliary terms:

P 0 = 2 h0L0(h 0, λ) 1 , P ϵ 0 = 2 h1L1(h 1, λ) 1 .

b 0 = h0L0(h 0, λ) + κ(h 0 h 1), bϵ 0 = h0L0(hϵ 0, λ) + κ(hϵ 0 hϵ 1)

Fair Representation Learning through Implicit Path Alignment

Then we have:

p 0 p0 = P 0 b 0 P ϵ 0bϵ 0

P 0 b 0 P 0 bϵ 0 + P 0 bϵ 0 P ϵ 0bϵ 0

P 0 b 0 bϵ 0 + bϵ 0 P 0 P ϵ 0

As for the P 0 , since we assume the Hessian matrix is invertible thus its norm is upper bounded by some constant (denoted as A 1). As for b 0 bϵ 0 , we have:

b 0 bϵ 0 h0L0(h 0, λ) h0L0(hϵ 0, λ) + 2κϵ

Thus we have P 0 b 0 bϵ 0 A 1(ϵL1 + 2κϵ).

As for bϵ 0 , we can easily verify that it is indeed bounded by some constant b. For the first term, we can adopt the same strategy in proving bounded A 0 . As for the second term in bϵ 0, it is upper bounded by 2κϕ, due to the bounded predictor.

We now demonstrate P 0 P ϵ 0 . Denoting = (P 0 ) 1 (P ϵ 0) 1, then according to the second order Lipschitz assumption, we have: ϵL2. Plugging in the result, we have:

P 0 P ϵ 0 = (P 0 ) (P ϵ 0) P 0 P ϵ 0 (A 1)2L2ϵ

We still adopt the assumption that the bounded Hessian-inverse matrix by A 1.

Plugging in all the results, we have:

(1) A1(ϵL1 + 2κϵ) + b(A1)2L2ϵ := O(κϵ + ϵ)

The upper bound of term (2) We have:

A 0p0 Aϵ 0p0 p0 2 A 0 Aϵ 0

Since we assume the loss is second-order Lipschitz, thus we have

A 0 Aϵ 0 = λ h0L0(h 0, λ) λ h0L0(hϵ 0, λ) L2 h 0 hϵ 0 ϵL2

We can also demonstrate p0 is bounded. According to the definition we have:

p0 2 h0L0(hϵ 0, λ) 1 ( h0L0(hϵ 0, λ) + κ(hϵ 0 hϵ 1))

(i) A 1(L1 h 0 hϵ 0 2 + 2κϕ)

(ii) A 1(ϵL1 + 2κϕ)

For (i), we assume: 1) the Hessian matrix is invertible thus its norm is surely upper bounded by some constant (denoted as A 1), 2) the first-order derivative is Lipschitz (bounded by L1), 3) the predictor h is bounded. For (ii), we adopt the definition of hϵ 0.

Therefore, the upper bound for Term (2) is formulated as:

(2) ϵL2A 1(ϵL1 + 2κϕ) := O(κϵ)

The upper bound of term (3) We have:

Aϵ 0p0 Aϵ 0pδ 0 Aϵ 0 p0 pδ 0 δAϵ sup = O(δ)

Through the upper bound in (1)-(3), we finally have the error between the estimated and ground-truth gradient:

grad(λ) grad δ(λ) = O(κϵ + ϵ + δ)

Fair Representation Learning through Implicit Path Alignment

D. The Convergence Behavior

For the sake of completeness, we provide the convergence analysis of the proposed algorithm.

Proposition D.1. We execute the implicit alignment algorithm (Algo. 1), obtaining a sequence of λ1, . . . , λk, . . . . Supposing the fair constraint κ is fixed. The optimization tolerances are summable: P

k ϵ2 k + and P

k δ2 k + , then λk is proved to be converged with lim k λk = λ .

If the stationary point λ is also within the bounded norm, then we have:

grad(λ ) = 0.

Proof. We denote the entire outer-loop loss w.r.t. λ as L(λ), by the assumption the β-smooth loss L. Then at iteration k + 1 and k, we have:

L(λk+1) L(λk) grad(λk)T (λk λk+1) + β

2 λk+1 λk 2

= L(λk) grad(λk) grad δ(λk) + grad δ(λk) T (λk λk+1) + β

2 λk+1 λk 2

= L(λk) grad(λk) grad δ(λk) T (λk λk+1) grad δ(λk)(λk λk+1) + β

2 λk+1 λk 2

Since we assume the representation is within the bounded norm, the projection onto the convex set are non-expansive operators (Boyd et al., 2004). Then for any point p, q, we have proj(p) proj(q) 2 (p q)T (proj(p) proj(q)). Then

we set λk and λk+1 = λk 1

β grad δ(λk), we have:

λk λk+1 2 1

β ( grad δ(λk))T (λk λk+1)

Plugging into the results, we have:

L(λk+1) L(λk) grad(λk) grad δ(λk) T (λk λk+1) β

2 λk+1 λk 2

L(λk) + grad(λk) grad δ(λk) λk λk+1 β

2 λk+1 λk 2

Rearranging the inequality, we have:

2 λk+1 λk 2 grad(λk) grad δ(λk) λk λk+1 + (L(λk+1) L(λk)) 0

Then we have:

grad(λk) grad δ(λk) + q

grad(λk) grad δ(λk) 2 2β (L(λk+1) L(λk))

By denoting Bk = grad(λk) grad δ(λk) and Ck = L(λk+1) L(λk). Then we have:

λk+1 λk 2 1

B2 k + B2 k 2βCk + 2Bk q

β2 B2 k + B2 k 2βCk + B2 k + B2 k 2βCk

β2 [ grad(λk) grad δ(λk) 2 2 2β (L(λk+1) L(λk))]

Fair Representation Learning through Implicit Path Alignment

Taking sum over k, we have:

k=1 λk+1 λk 2 4

k=1 grad(λk) grad δ(λk) 2 2 8

β ( lim k L(λk+1) L(λ1))

k [(C + κ)2ϵ2 k + δ2 k] 8

lim k L(λk+1) L(λ1) < +

Since 1) the first term on the right side is finite, because the optimization tolerance is summable; 2) the second term is also finite, because the loss is assumed to be bounded. Then the upper bound is finite. In order to satisfy this condition, on the left side we should have: lim k λk+1 λk = 0

By adopting the definition λk+1 = Proj(λk grad δ(λk)) and limk grad δ(λk) = grad(λk) (Based on theorem 1, the limit of the optimization tolerance is zero), then we have:

λ = proj(λ grad(λ ))

Where λ = limk + λk+1 = limk + λk. Since the projection is on the bounded norm Lnorm and λ is within the bounded norm space, thus if λ grad(λ ) is within the bounded norm space, we have:

grad(λ ) = 0

Else if λ grad(λ ) is outside the bounded norm space, then according to the definition, the projection of λ grad(λ ) is surely on the boundary of the Lnorm space, with proj(λ grad(λ )) = Lnorm. However, we have assumed the λ is within the bounded norm space with λ < Lnorm, which leads to the contradiction. Based on these discussions, we finally have: grad(λ ) = 0

E. Possible extensions to non-binary protected features

For the completeness, it is also possible to extend to binary protected features with distribution D1, . . . , DN. For example, the bi-level objective can be naturally formulated as

n=1 Ln(h n, λ) +

m=1,n=1,n<m

2 h n h m 2 2

s.t. h n argminh Ln(h, λ)

Compared with the binary group, we introduce the pair-wise regularization ( h n h m ) term to ensure the invariance between each pair of the sensitive attributes (n, m). However, determining the coefficient κn,m will become practically challenging, since the hyper-parameter space is much larger O(N 2) than the binary case.

If the sensitive attribute is indeed a real value, a simple practical approach is to cluster the continuous attribute into several discrete groups, then conducting the pair-wise bi-level optimization. At the same time, there may be difficulty in measuring fairness. E.g, if the sensitive attribute is the ratio of people of certain demographic backgrounds, the corresponding sufficiency gap will be hard to estimate, since current metrics are defined on the discrete sensitive attribute.

F. Additional Details and Results

F.1. Illustrative example of sufficiency gap

F.2. Correlation Analysis on the benchmark

For the justification propose, we compute the Pearson correlation coefficient (ranging from [ 1, 1]) between the binary group index (or protected feature) A and label Y . Intuitively, if DA=a(Y = y) = DA=a (Y = y), a, a , y, the protected feature (or group index) A is independent of label Y .

Fair Representation Learning through Implicit Path Alignment

Table 4. Pearson correlation coefficient between the group index and label

Toxic-Comment Celeb A Law NLSY

0.30 -0.35 0.18 -0.29

The Pearson correlation coefficient clearly demonstrates the non-independence between the group index and label. i.e, the label distributions among the sub-groups are different. The experimental results also validated this fact: the demographic parity based approach could not improve the sufficiency gap due to the different label distributions.

F.3. Additional Details

Toxic Comments We split the training, validation and testing set as 70%, 10% and 20%. We adopt Adam optimizer with learning rate 10 3 and eps 10 3. The batch-size is set as 500 for each sub-group and we use sampling with replacement to run the explicit algorithm with maximum epoch 100. The fair coefficient is generally set as κ = 0.1 0.001. As for the inner-optimization step, the iteration number is 20 and the iteration in running conjugate gradient approach is 10.

Celeb A The training/validation/test set are around 82K, 18K and 18K. We also adopt the Adam optimizer with learning rate on λ : 10 5 10 4 and h : 10 3. The batch-size is set as 64 for each sub-group and we iterate the whole dataset as one epoch. The maximum running epoch is set as 20 and the iteration in running conjugate gradient approach is 10.

Law We split the training, validation and testing set as 70%, 10% and 20%. Then we adopt Adam optimizer with learning rate 10 3 and eps 10 3. The batch-size is set as 500 for each sub-group and we use sampling with replacement to run the implicit algorithm, with the maximum epoch 100. We adopt the MSE loss in the regression. The fair coefficient is generally set as κ = 0.1 10 4. As for the inner-optimization, the iteration number is 20 and the iteration in running conjugate gradient is 10. In computing the sufficiency gap in the regression, we sample 33 points to compute the gap.

NLSY We split the training, validation and testing set as 70%, 10% and 20%. Then we adopt Adam optimizer with learning rate 10 3 and eps 10 3. The batch-size is set as 500 for each sub-group and we use sampling with replacement to run the implicit algorithm, with maximum epoch 100. We adopt the MSE loss in the regression. The fair coefficient is generally set as κ = 0.1 10 4. As for the inner-optimization, the iteration number is 20 and the iteration in running conjugate gradient is 10. In computing the sufficiency gap, we sample 33 points to compute the sufficiency gap.

F.4. Additional Empirical results

0 20 40 60 80 100 Training Epoch

Evolution of gradient norm

Figure 7. Gradient Norm evolution w.r.t. representation λ in Toxic comments dataset. We visualize the norm of grad δ(λ) at each training epoch, which suggests a convergence behavior and the gradient finally tends to zero.

Gradient evolution We also visualize the gradient norm of the representation λ in the Toxic dataset, shown in Fig. 7. The results verify the convergence behavior and the gradient norm finally tends to zero.

Fair Representation Learning through Implicit Path Alignment

0.0 0.2 0.4 0.6 0.8 1.0 Predicted Cumulative Distribution

Ground-truth Cumulative Distribution

Group 0 Group 1

0.0 0.2 0.4 0.6 0.8 1.0 Predicted Cumulative Distribution

Ground-truth Cumulative Distribution

Group 0 Group 1

(b) Fair Mix-up

0.0 0.2 0.4 0.6 0.8 1.0 Predicted Cumulative Distribution

Ground-truth Cumulative Distribution

Group 0 Group 1

(c) Implicit

Figure 8. Illustration of the sufficiency gap in NLSY dataset. The ERM and mix-up suffer the high predictive sufficiency-gap, while the proposed implicit alignment can significantly mitigate the sufficiency gap. In contrast, the probability calibration is not improved. This results also verifies the inequivalence between the sufficiency gap and calibration gap (Liu et al., 2019).

F.5. Discussion with non-deep learning baselines

In order to show the effectiveness of the proposed approach, we additionally compare the FAHT (Zhang & Ntoutsi, 2019), a decision tree based fair classification approach. We evaluated the empirical performance on Toxic comments dataset.

Table 5. Comparison with Fairness Aware Decision Tree

Method Accuracy ( ) Suf C ( )

FAHT 0.596 0.397 Implicit 0.760 0.051

The implicit approach demonstrates the considerable better results, which may come from two aspects: (1) the Toxic task is a high-dimensional classification problem (x R748), where the deep learning based approach is more effective in handling the high-dim dataset. (2) The FAHT aims to realize the statistical parity (the independence rule), which is not compatible with the sufficiency. According to the analysis of (Barocas et al., 2019), when the protected feature (A) and label (Y) are not independent (This has been justified by computing their Pearson Correlation coefficient), the sufficiency and independence cannot both hold.

F.6. sufficiency Gap in regression

We visualize the sufficiency gap of NLSY dataset.

G. Complementary technical details

We present complementary details that are related to the paper.

G.1. Conjugate Gradient Method

We present the Conjugate Gradient (CG) algorithm in Algo. 2 through autograd. In the conventional CG algorithm with objective 1

2x T AX b X, we need to estimate AX and compute its residual and update X. Since in our problem setting, the A = 2 h0L0(hϵ 0, λ), then computing AX can be realized through Hessian-vector product through autograd, denoted as function F in the paper. i.e., 2 h0L0(hϵ 0, λ)X = F(x).

Below we provided a simple Py Torch code for realizing the Hessian Vector product.

1 import torch

2 def hessain_vector_product(loss,model,vector):

3 # loss: the defined loss

4 # model: the model in computing the Hessian

5 # vector: the required vector in computing Hessian-vector product

6 partial_grad = torch.autograd.grad(loss, model_parameters(), create_graph=True)

Fair Representation Learning through Implicit Path Alignment

Algorithm 2 Conjugate Gradient Method Ensure: Function F that computes Hessian-vector product through autograd, initial value X0, bias vector B.

1: Computing Residual: r0 = B F(X0) 2: Set p0 = r0 3: for inner iterations k do 4: Computing αk r T k rk p T k F (pk) 5: Xk+1 Xk + αkpk 6: rk+1 rk αk F(pk) 7: If rk+1 is sufficiency small, then stop.

8: βk r T k+1rk+1

r T k rk 9: pk+1 rk+1 + βkpk 10: end for 11: Return: Xk+1

7 flat_grad = torch.cat([g.contiguous().view(-1) for g in partial_grad])

8 h = torch.sum(flat_grad * vector_to_optimize)

9 hvp = torch.autograd.grad(h, model.parameters())

10 return hvp

Listing 1. Simple demo in computing Hessian vector product

G.2. Calibration Gap in the regression

Based on Kuleshov et al. (2018), we first compute the predicted cumulative distribution ( ˆY0) of at point t: D0( ˆY0 t) = α, then we compute the corresponding ground truth cumulative distribution (Y0) at point t. By changing t, we obtain several points on function D0(Y t| ˆY0 t) = β. Then the regression is probabilistic calibrated when α β. From this perspective, the zero calibration gap can guarantee a zero sufficiency gap. But the inverse is not necessarily true, as our experimental results suggest, a small sufficiency gap can lead to either small or large calibration gap. Thus it can be quite promising to explore their inherent relations and trade-off in the fair regression.