# fair_text_classification_via_transferable_representations__911f5ab8.pdf

Journal of Machine Learning Research 26 (2025) 1-47 Submitted 3/25; Revised 10/25; Published 10/25

Fair Text Classiﬁcation via Transferable Representations

Thibaud Leteno thibaud.leteno@univ-st-etienne.fr Universit e Jean Monnet Saint-Etienne, CNRS, Institut d Optique Graduate School, Laboratoire Hubert Curien UMR 5516, F-42023, Saint-Etienne, France

Michael Perrot michael.perrot@inria.fr Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 - CRISt AL, F-59000, Lille, France

Charlotte Laclau charlotte.laclau@telecom-paris.fr LTCI, T el ecom Paris Institut Polytechnique de Paris, France

Antoine Gourru antoine.gourru@univ-st-etienne.fr Universit e Jean Monnet Saint-Etienne, CNRS, Institut d Optique Graduate School, Laboratoire Hubert Curien UMR 5516, F-42023, Saint-Etienne, France

Christophe Gravier christophe.gravier@univ-st-etienne.fr Universit e Jean Monnet Saint-Etienne, CNRS, Institut d Optique Graduate School, Laboratoire Hubert Curien UMR 5516, F-42023, Saint-Etienne, France

Editor: Manuel Gomez-Rodriguez

Group fairness is a central research topic in text classiﬁcation, where reaching fair treatment between sensitive groups (e.g., women and men) remains an open challenge. We propose an approach that extends the use of the Wasserstein Dependency Measure for learning unbiased neural text classiﬁers. Given the challenge of distinguishing fair from unfair information in a text encoder, we draw inspiration from adversarial training by inducing independence between representations learned for the target label and those for a sensitive attribute. We further show that domain adaptation can be eﬃciently leveraged to remove the need for access to the sensitive attributes in the data set we cure. We provide both theoretical and empirical evidence that our approach is well-founded. Keywords: natural language processing, fairness, text classiﬁcation, domain adaptation, transfer

1. Introduction

Machine learning algorithms have become increasingly inﬂuential in decision-making processes that signiﬁcantly impact our daily lives. One of the major challenges that has emerged in research, both academic and industrial, concerns the fairness of these models, that is,

2025 Thibaud Leteno, Michael Perrot, Charlotte Laclau, Antoine Gourru and Christophe Gravier.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/25-0485.html.

Leteno, Perrot, Laclau, Gourru and Gravier

their ability to treat individuals and groups equitably without causing prejudice or discrimination. As more researchers work to overcome these shortcomings, the ﬁrst problem is to deﬁne what fairness is. This deﬁnition may hardly be consensual (Han et al., 2023) or is at least diﬃcult to establish, as it depends on situational and cultural contexts (Fiske, 2017). In this work, we focus on group fairness (that we will refer to as fairness for simplicity), which prevents predictions related to individuals from being based on sensitive attributes such as gender or ethnicity. We then adopt common metrics for assessing group fairness in practice, which are based on the notion of disparate impact referenced in legal frameworks across several countries.1 This type of metric considers a predictive model fair if its outcomes remain consistent across groups of individuals deﬁned by sensitive attributes.

In this article, we focus on the problem of fairness in the domain of Natural Language Processing (NLP) (Li et al., 2023; Chu et al., 2024) and more speciﬁcally for text classiﬁcation as it is one of the most ubiquitous tasks in our society, with prominent examples in medical and legal domains (Demner-Fushman et al., 2009) or human resources (Jatob a et al., 2019), to name a few. For more general overviews of fairness in machine learning systems, we refer the interested readers to Caton and Haas (2024); Barocas et al. (2023). Initially, works in text classiﬁcation rely on text encoders, which are parameterized and learned functions that map tokens (arbitrary text chunks) into a latent space of controllable dimension, usually followed by a classiﬁcation layer. Built upon the Transformers architecture (Vaswani et al., 2017), popular Pre-trained Language Models (PLMs) such as BERT (Devlin et al., 2019) leverage self-supervised learning to train the text encoder parameters. These PLMs are further ﬁne-tuned for the supervised task at hand. More recently, with the advent of powerful decoder-based models, practitioners started to prompt those models for classiﬁcation tasks (Dubey, 2024; Ruan et al., 2024). While many studies already report biases in NLP systems (Sun et al., 2019; Hutchinson et al., 2020; Tan and Celis, 2019; Liang et al., 2021; Bender et al., 2021), these issues become even more signiﬁcant with the advent of public-ready AI-powered NLP systems. As mentioned above, recent developments in NLP, such as prompting-based models, raise questions about ensuring fairness in text classiﬁcation. Atwood et al. (2024) highlight the limitations of prompting for fairness control, whereas regularization-based methods achieve better fairness-performance trade-oﬀs. Meanwhile, Roccabruna et al. (2024) evaluate multiple large decoder-based models alongside Ro BERTa (Liu et al., 2019) on temporal relation classiﬁcation, ﬁnding that Ro BERTa outperforms all the decoder-based models for this task. However, other approaches leverage powerful decoder models to generate embeddings for various tasks, including text classiﬁcation, as seen with SFR-Embedding-2 R (Meng et al., 2024) or NV-Embed-v2 (Lee et al., 2024) both built on Mistral-7B (Jiang et al., 2023). While some recent works adopt this embedding-based strategy (Yang, 2024), others continue to rely on encoder-only architectures (Sturman et al., 2024). For fairness control in text classiﬁcation, this leaves two main approaches: incorporating fairness constraints into prompts or debiasing the model during ﬁne-tuning. Our work is part of this latter setting.

1. for example, GPDR, Article 22 (European Parliament and Council of the European Union, 2016) and AI Act (European Parliament and Council of the European Union, 2024), Recital 27 in the European Union, Title VII of the 1964 Civil Rights Act (Act, 1964) in the United States of America.

Fair Classifier via Transferable Representations

Contributions This paper extends our work on Wasserstein independence for text classiﬁcation (Leteno et al., 2023) to mitigate bias in text classiﬁers. We introduce an extensive theoretical analysis and present additional experimental results. Our approach addresses bias directly in the latent space, making it applicable to any text encoder or decoder (e.g., BERT or Mistral). To proceed, we disentangle the neural signals encoding bias from those used for predictions. Disentanglement-based methods have primarily focused on images or tabular data (Jang and Wang, 2024; Locatello et al., 2019). In this paper, we introduce an approach tailored to NLP and capable of handling less-explored scenarios, including continuous sensitive attributes and regression tasks. Our method overcomes a major shortcoming of prior studies that rely on access to the sensitive attributes during training - regulations, such as GDPR (European Parliament and Council of the European Union, 2016), impose more stringent requirements for the collection and use of protected attributes, which can, in certain cases, pose constraints on some methodologies. In the following, we demonstrate that our approach tackles this issue by learning from simple data sets, such as toy data sets, to transfer knowledge and enable fair classiﬁcation even when sensitive attributes are not available in the deployment data. In a nutshell, our goal is to reduce the dependency between predictions and sensitive attributes to improve fairness. To achieve this, we minimized the Wasserstein Dependency Measure (Ozair et al., 2019) between the hidden representations of two neural networks: one for the end-task classiﬁcation and one for predicting the sensitive attributes. This requires approximating several measures relative to the initial objective of independence between the classiﬁer and the sensitive attribute. In this paper, we establish the theoretical validity of these approximations. First, we examine the relation between the chosen dependency measure and various fairness metrics. Second, we derive an upper bound on the transfer of sensitive attributes, supporting the use of predicted sensitive attributes when the real ones are unavailable. Finally, we justify the use of latent representations and provide guarantees on this approximation. We further validate our approach empirically by comparing it to state-of-the-art methods and evaluating diﬀerent variations of our architecture.

Organization of the paper The rest of this paper is organized as follows. Section 2 presents recent advances related to our proposition. Section 3 discusses our motivation, provides the background knowledge to understand our contributions, and presents our ﬁrst results that establish the relation between fairness and the Wasserstein Dependency Measure. Section 4 proceeds with the theoretical framework of the proposed approach and its analysis. Section 5 provides the description of the proposed approach and the algorithmic details of the implementation. Section 6 introduces the setting of our experiments, and Section 7 presents the experiments and their interpretations. We present our conclusions and research perspectives in Section 8 and end the paper with a section dedicated to the limitations of our contributions.

2. Related Works

Recent work on fairness in NLP has focused on fair text classiﬁcation with adversarial methods (Beutel et al., 2017; Zhang et al., 2018; Elazar and Goldberg, 2018; Madras et al., 2018; Torres, 2024) being widely investigated. Han et al. (2021b,a) suggest using multiple

Leteno, Perrot, Laclau, Gourru and Gravier

discriminators, each learning distinct hidden representations or applying adversarial training across domains. Other contributions enforce fairness through balanced training (Han et al., 2021c), batch selection (Roh et al., 2021), or by integrating fairness metrics, such as Equality of Opportunity, directly into the objective function (Shen et al., 2022a,b). However, these methods rely on access to sensitive attribute annotations during training, limiting their practical applicability. In this work, we overcome this constraint while providing strong theoretical guarantees. Next, we focus on related work that considers settings where sensitive attributes are unavailable, followed by fairness approaches based on dependency measures and theoretical guarantees.

Sensitive attribute access for fairness mitigation To address their absence, proxy models have been proposed to enhance fairness. Other approaches circumvent the use of sensitive attributes during training or inference by leveraging related features (Zhao et al., 2022), knowledge distillation (Chai et al., 2022), adversarial reweighted learning (Lahoti et al., 2020), proxy features (Gupta et al., 2018), or perturbations (Awasthi et al., 2020). However, Kenfack et al. (2023) recently highlighted the risks associated with proxy-sensitive attributes, which may exacerbate the fairness-accuracy trade-oﬀ. Domain adaptation has also been explored as a means to address fairness in data sets lacking demographic information. Schumann et al. (2019) employ adversarial learning to enforce fairness in the source domain while predicting domain membership, while Coston et al. (2019) propose loss reweighting to mitigate the absence of sensitive attributes in either domain. Our approach follows this line of research, speciﬁcally addressing the lack of sensitive attributes in the target domain. By working in the representation space to minimize divergence between domains, we aim to ensure that the classiﬁer trained on the source domain treats both domains equivalently.

Fair classiﬁcation with dependency measures The Wasserstein distance has been increasingly used to enforce fairness constraints in machine learning. For instance, Risser et al. (2022) and Jiang et al. (2020) apply it to measure the discrepancy between the distributions of predictions conditionally on groups deﬁned by the sensitive attribute. Although eﬀective, these approaches are limited to categorical sensitive attributes and mainly favor conditional independence. In contrast, we propose to exploit the Wasserstein dependency measure, which captures the dependence between the joint distribution of the hidden output representations and the sensitive attribute, and the product of their marginals. This distinction allows us to assess and mitigate bias at a more fundamental level, ensuring that the learned representations themselves do not encode sensitive information. Our approach is inspired by Ozair et al. (2019), which uses Wasserstein s dependency measure to improve representation learning for images. However, while their work focuses on improving feature representations for downstream tasks, we incorporate sensitive attributes into the estimation process to promote fairness. Another related approach in NLP is proposed by Cheng et al. (2021), which maximizes the mutual information between sentence representations and their augmented counterparts to remove sensitive information from inputs. However, as noted by Shen et al. (2022b) and Cabello et al. (2023), this does not guarantee the independence between predictions and

Fair Classifier via Transferable Representations

sensitive attributes. Our method diﬀers by explicitly minimizing the dependency between representations of the same sentence processed by two diﬀerent encoders, ensuring that predictions remain unaﬀected by sensitive attributes. Additionally, our work shares conceptual similarities with Nam et al. (2020), which addresses bias in image data. However, instead of focusing on reweighting samples to counteract biases in a secondary model, we employ the Wasserstein distance to quantify and minimize the dependency between the representations learned by two models. More recently, Iskander et al. (2024) also seeks to mitigate disparities but relies on task-speciﬁc representations and KL divergence to enforce distributional uniformity across groups. Theoretical guarantees in fairness Most fairness mitigation techniques are evaluated on test sets that may not fully represent real-world deployment scenarios (Dunkelau, 2020; Hort et al., 2024). This highlights the need for theoretical guarantees to ensure the reliability of mitigation approaches with respect to fairness metrics. Several works provide such guarantees, often focusing on post-training corrections. For instance, Woodworth et al. (2017) propose a post-hoc correction method with guarantees on classiﬁer performance and prediction disparities across sensitive attributes. Denis et al. (2024) derive distribution-free fairness guarantees, while Chzhen et al. (2020) establish fairness bounds dependent only on the dimensionality of the unlabeled data set. On the other hand, Celis et al. (2019) develop a meta-learning framework to obtain an optimally fair classiﬁer with respect to algorithmic complexity, and Mc Namara et al. (2017) show that learned representations can satisfy both group and individual fairness criteria. Finally, a closely related work is Gupta et al. (2021), who consider Mutual Information to measure the dependency between representations, providing fairness guarantees based on this latter. They derive an upper bound on the Demographic Parity measure via the Mutual Information between latent representations and the sensitive attributes, as well as bounds on the Mutual Information between classiﬁcation labels and conditional latent representations. However, unlike our approach, they do not provide guarantees on the dependency between the classiﬁcation labels and sensitive attributes.

3. Wasserstein Dependency Measure and Group Fairness

This section introduces the notations used throughout the paper, along with the deﬁnitions of key fairness metrics and the Wasserstein Dependency Measure (IW ). We then present our ﬁrst result, establishing a link between two popular group fairness metrics and IW .

3.1 Notations

We consider a corpus of n triplets {(xi, yi, ai)}n i=1, where xi X is a short document or a sentence, yi Y is a label and ai A is either a sensitive attribute, such as gender, ethnicity or age, or represents intersectional groups of several sensitive attributes. In this paper, we assume that Y and A are discrete spaces, and we will often abuse notations such that y Y and a A represent either a target label or a vector representation obtained through one-hot encoding. The embeddings (or representations) are obtained thanks to an encoding function, Enc, that maps words into numeric values. The objective is to predict outcomes y for a given input x by estimating the conditional distribution p(Y |X = x). To this end, we learn a scoring function πy : X P(Y) where P(Y) is the set of probability

Leteno, Perrot, Laclau, Gourru and Gravier

distributions over Y. Given πy(x), the actual prediction is denoted by ˆy and corresponds to the label predicted as most likely. For instance, in a social network context, one can learn a classiﬁer to predict whether a message is toxic. This prediction could inform decisions such as banning the message or its author from the platform. In modern NLP applications, deep classiﬁcation often follows a two-step approach: the scoring function π is expressed as πy = hy Enc, where Enc(x) Rd maps a text x into a low-dimensional embedding space, and hy, typically a simple neural network layer with a softmax activation serves as the classiﬁcation layer.

3.2 Group Fairness

Our goal is to learn fair models and we focus on two main deﬁnitions of fairness. On the one hand, we consider demographic parity (Hardt et al., 2016) which is deﬁned, for a desirable outcome y and a sensitive attribute a, as

DPa,y = P ˆY = y A = a P ˆY = y . (1)

On the other hand, we consider equality of opportunity (Hardt et al., 2016) which is deﬁned, for an outcome y and a sensitive attribute a, as

EOa,y = P( ˆY = Y |Y = y, A = a) P( ˆY = Y |Y = y). (2)

3.3 Wasserstein Dependency Measure

Mutual Information (MI) is an information-theoretic metric that measures the statistical dependence or the amount of information shared between two variables. For two random variables U p(U) and V p(V ) that takes values in U and V, respectively, the MI is deﬁned as the KL-divergence between the joint distribution p(U, V ) and the product of the marginal distributions p(U)p(V )

MI(U, V ) = KL(p(U, V ) p(U)p(V )).

Early works in fair classiﬁcation introduced the idea that fairness can be improved by reducing the Mutual Information (MI) between the classiﬁer s output, ˆY , and the sensitive attribute, A (Kamishima et al., 2012; Zemel et al., 2013). Speciﬁcally, enforcing Demographic Parity (DP) corresponds to minimizing the MI between these two random variables, ensuring that ˆY is independent of A. Similarly, Equalized Odds (EO) can be formulated as minimizing the MI between A and ˆY conditionally on the true label Y , ensuring that predictions remain independent of the sensitive attribute within each outcome class. However, MI is known to be intractable for most real-life scenarios and has strong theoretical limitations as outlined by Mc Allester and Stratos (2020). Notably, it requires an exponential number of samples in the value of the MI to build a high conﬁdence lower bound, and it is sensitive to small perturbations in the data sample. To overcome this issue, Ozair et al. (2019) propose a theoretically sound dependency measure, the Wasserstein Dependency Measure (IW ), based on the Wasserstein 1-distance

IW (U, V ) = W1(p(U, V ), p(U)p(V )).

Fair Classifier via Transferable Representations

Using the Kantorovich-Rubinstein duality, it can also be expressed as

IW (U, V ) = sup ||f||L 1 EU,V p(U,V )[f(U, V )] EU p(U),V p(V )[f(U, V )], (3)

where ||f||L 1 is the set of all 1-Lipschitz functions. The Wasserstein distance has been eﬃciently used in many machine learning applications (Frogner et al., 2015; Courty et al., 2014; Torres et al., 2021) and a particularly interesting one is that of fair machine learning (Jiang et al., 2020; Silvia et al., 2020; Gordaliza et al., 2019; Laclau et al., 2021). We present the Wasserstein distance in Appendix A along with technical lemmas relevant for our proposition.

3.4 Connection with Group Fairness

In this section, we show a connection between the Wasserstein Dependency Measure and the two group fairness measures we consider. Hence, in the next lemma, we show that a linear combination of Demographic Parity or Equality of Opportunity for all possible values of a and y are equivalent to the Wasserstein Dependency Measure between well-chosen random variables. This result is reminiscent of the result of Gupta et al. (2021), who showed a connection between group fairness and mutual information.

Lemma 1 (Group fairness and Wasserstein Dependency Measure.) Let IW be the Wasserstein dependency measure, and A, Y , ˆY be random variables corresponding to the sensitive attribute, the true label, and the predicted label, respectively. Let p be the ground metric for the Wasserstein 1-distance. We have that

IW ( ˆY , A) =

a A P(A = a) X

y Y |DPa,y| ,

IW (( ˆY = Y )|Y = y, A|Y = y) = p

a A P(A = a|Y = y) |EOa,y| ,

with |.| denoting the absolute value.

Proof The proof is provided in Appendix B.

This lemma shows that minimizing the Wasserstein Dependency Measure between wellchosen random variables is a sound way to minimize Demographic Parity or Equality of Opportunity. This motivates the regularization of a learning algorithm by IW ( ˆY , A) to improve the fairness of text classiﬁers.

4. Predictive and Sensitive Information Approximations

To improve classiﬁer fairness, we aim to minimize the Wasserstein Dependency Measure (IW ) between the sensitive attribute A and the label predictions ˆY . However, this optimization presents several challenges, notably having access to the sensitive attributes and requiring to diﬀerentiate a signal that went through an argmax function to obtain the label predictions.

Leteno, Perrot, Laclau, Gourru and Gravier

To address these, we ﬁrst approximate the sensitive attribute labels using their predicted values, ˆA, obtained from a neural network. Then, instead of working directly with ˆY and ˆA, we use their hidden representations, denoted as Zy and Za, from the corresponding neural networks to overcome the non-diﬀerentiability of the argmax function. We also provide guarantees on these approximations. This leads to the following optimization objective for learning a fair text classiﬁer:

arg min L(Y, hy(Enc(Xy))) + β IW (Zy, Za), (4)

where IW (Zy, Za) = W1(p(Zy, Za), p(Zy)p(Za)). Here, Zy and Za represent the hidden representations from two Multi-Layer Perceptrons (MLPs): one for classiﬁcation and one for the proxy model introduced in Section 4.1. The function L ensures the classiﬁer achieves high accuracy on Y (e.g., we consider the cross-entropy for binary classiﬁcation), while the second term encourages fairness by constraining the learned representations. The hyperparameter β R+ controls the balance between accuracy and fairness, as the two objectives may converge at diﬀerent speeds.

We refer to this approach as Wasserstein Fair Classiﬁcation (WFC). Details on its implementation are provided in Section 5.

4.1 Deﬁnition of the Demonic Model

In the following, we use a surrogate model, referred to as the demonic model, for predicting the sensitive attribute A without requiring explicit observation of attributes at training time. To proceed, we assume a similar architecture as for predicting the labels: we learn a scoring function πa = ha Enc which, given an example x, outputs a probability distribution over A, with ha a ﬁxed classiﬁcation function predicting A. The predicted sensitive attribute is then ˆa and corresponds to the most likely sensitive attribute according to πa. Consequently, we propose to consider IW ( ˆY , ˆA) instead of IW ( ˆY , A) to approximate the dependency between the predictions and the sensitive attributes. In the next theorem, we study this approximation and show that it is close to the original measure while being dependent on the demonic model performance.

Lemma 2 Let ˆY , ˆA, A be random variables that correspond to the predicted label, predicted sensitive attribute, and true sensitive attribute, respectively. Let p be the ground metric for the Wasserstein 1-distance. Then, we have that

IW ( ˆY , A) IW ( ˆY , ˆA) + 2 p

2P(A = ˆA).

Proof The proof is provided in Appendix C.

This lemma shows that replacing A by ˆA is sound when the latter is an accurate estimate of the former, that is, when P(A = ˆA) is small. In the next theorem, we combine this result with a standard generalization result to show that this remains valid in the ﬁnite sample regime. The proof is provided in Appendix C.1.

Theorem 3 Let ˆA, A {0, 1}, and H be a hypothesis space of V C-dimension d. Let p be the ground metric for the Wasserstein 1-distance. Assume that we have access to a training

Fair Classifier via Transferable Representations

set of m i.i.d. examples. Then, with probability at least 1 δ, we have h H

IW ( ˆY , A) IW ( ˆY , ˆA) + 2 p

with e, the base of the natural logarithm and ˆε the empirical risk of the demonic model.

Remark This bound indicates that minimizing IW ( ˆY , ˆA) allows to minimize IW ( ˆY , A). However, it is tight when the demonic model is accurately predicting the sensitive attributes. In other words, with an accurate demonic model, the bound on the error rate is low and the bound tends to the estimate IW ( ˆY , ˆA). In the perfect case, where the demonic model achieves perfect predictions, the bound is simply IW ( ˆY , ˆA). Moreover, with input data of suﬃcient size, the bound on the error rate ε gets lower. We will consider the case where the demonic model is trained on data out of the domain (transfer learning scenario) later in Section 4.2. Note that we can easily generalize to multi-label sensitive attributes by considering the Natarajan dimension (Natarajan, 1989) instead of the VC-dimension. Moreover, multiple sensitive attributes can be considered by looking at the groups intersections to cast the problem as a multi-label one. Finally, continuous sensitive attributes can be handled by binning. However, ﬁnding the relevant threshold for binning is a non-trivial problem. We present the diﬀerent scenarios and solutions in Appendix E.

4.2 Demonic Model in Cross-Domain Settings

Recall that ˆA and the latent representations Za are obtained through a proxy neural network trained to predict the sensitive attribute to tackle the lack of sensitive attribute annotation. As it, one can train ha on a diﬀerent data set from the end-task one. Let us consider two data sets, the end-task data set (or target) DT and the side data set (or source) DS. DT = {x T ,i, y T ,i}n T i is composed of a set of features and labels, while DS = {x S,i, a S,i}n S i is composed of a set of features and sensitive attributes. We assume that we are in the context of covariate shift: the feature distributions are diﬀerent but the sensitive attribute distributions are similar (AT AS). Then, we want to learn a mapping φ : XS XT and train the demonic model classiﬁcation layer ha on the mapped XS:

min ha,φ L(ha(Enc(XS)), AS) + Λ(φ(Enc(XS)), Enc(XT )),

with Λ(φ(Enc(XS)), Enc(XT )) the measure of divergence between the embeddings of XT and XS. Note that the encoder Enc has to be the same for the source and target domains. We provide experimental details in Section 5.2. Moreover, Theorem 3 can be adapted to this setting, only the approximation of the error rate of the demonic model changes.

Theorem 4 Assuming that ˆA, A {0, 1}. Assume that DS and DT are a source and a target distribution such that PDS(X = x) = PDT (X = x) and PDS(A = a|X = x) = PDT (A = a|X = x), that is assume a covariate-shift. Let p be the ground metric for the Wasserstein 1-distance. Assume that IW ( ˆY , A) and IW ( ˆY , ˆA) are computed on the target

Leteno, Perrot, Laclau, Gourru and Gravier

distribution and let εS = PDS( ˆA = A), εT = PDT ( ˆA = A), then we have that

IW ( ˆY , A) IW ( ˆY , ˆA) + 2 p

2d H H(DS, DT ) + λ ,

where d H H( DS, DT ) is the H H-divergence between the marginal feature distributions DS and DT and λ = λS +λT with λS and λT the errors of h = argminh H(εT (h), εS(h)) with respect to DS and DT respectively.

Proof This is a direct application of Ben-David et al. (2010, Theorem 2).

Remark We can draw similar conclusions as for Theorem 3. However, in this case, one must also consider the divergence between the domains, determinant to the success of the approximation. The closer the two domains are, the tighter the bound is. Therefore, if the demonic model decreases in accuracy due to the divergence between the source and target domains, the bound gets looser. Note that the H H-divergence between the source and target domains is restricted to the binary setting. To generalize to categorical sensitive attributes (and by extension to multiple or continuous sensitive attributes as in Theorem 3), one could consider a generalization of the H H-divergence as in (Sicilia et al., 2022). Recall that we detail the diﬀerent scenarios and solutions in Appendix E.

4.3 Using Latent Representations

In the previous section, we explained why using the Wasserstein Dependency Measure between the predicted labels and sensitive attributes, IW ( ˆY , ˆA), instead of between the predicted labels and the true sensitive attributes, IW ( ˆY , A). Nevertheless, as such, we cannot consider this measure to regularize any training algorithms since the argmax operation producing the hard predictions (ˆY ) following the classiﬁcation layer is not diﬀerentiable. Thus, instead of considering the network s ﬁnal output, one can overcome this limitation by minimizing the IW between the latent representations of the networks hy and ha, respectively referred to as Zy and Za. In Theorem 5, we show that the IW between the neural networks representations is an upper bound of the IW between the predictions.

Theorem 5 Let ˆY , ˆA be random variables that correspond to the predicted label and predicted sensitive attribute, respectively. Assume that hy = σλ(f(Zy)) and ha = σλ(g(Za)) where σλ is the softmax function with temperature λ, f and g are both L-lipschitz with respect to the p-norm, and Zy and Za are latent representations of the examples. Let p be the ground metric for the Wasserstein 1-distance. For a given example x with predicted label ˆy and predicted sensitive attribute ˆa, let ξy(x) = f(Zy)ˆy maxy =ˆy f(Zy)y and ξa(x) = g(Za)ˆa maxa =ˆa g(Za)a be positive margins. Let δ = 1 P(ξy(X) ξ, ξa(X) ξ)

with ξ > 0. Let α = p

2 |Y| |A| 1 p (1 δ) and ι = L(|Y| + |A|)

. Then, setting

ξ log 2ξα ιIW (Zy,Za) , we have that

IW ( ˆY , ˆA) 2IW (Zy, Za) ι

1 + log max 4, 2ξα ιIW (Zy, Za)

Fair Classifier via Transferable Representations

Proof The proof of a slightly sharper result, in particular when IW (Zy, Za) is large, is provided in Appendix D. We present this simpler version here for better readability.

Remark This result suggests that minimizing IW (Zy, Za) is a sound way to minimize IW ( ˆY , ˆA). The tightness of the bound depends mainly on the error introduced by the softmax and, more speciﬁcally, on two terms: ξ and δ. The margin ξy(x) (resp. ξa(x)) measures how dominant the predicted class is relatively to the others, i.e., it is large when ˆY in one-hot encoded form and σλ(f(Zy)) are close. In other words, ξy(x) (resp. ξa(x)) represents the conﬁdence level of the classiﬁcation model and ξ represents the minimum expected conﬁdence. The term δ is the proportion of examples for which this minimum conﬁdence is not obtained by the model. We note that there is a trade-oﬀbetween the ﬁrst and the second term in the bound, depending on the value of ξ, as a high value of ξ is likely to imply a large δ and vice versa. This result also indicates that for a given model, there is an optimal softmax temperature for inference, but, since the models are assumed given, it does not help us ﬁnding the optimal softmax temperature at training time. Furthermore, since the softmax is followed by an argmax function, the optimal temperature at inference has a limited impact. Finally, the way it is used here, temperature scaling is mainly seen as a technical tool and it might be possible to derive a similar result without it, thus it might not be necessary. For all these reasons, we do not further investigate this term experimentally.

5. Implementation of Wasserstein Fair Classiﬁcation

In this section, we present both the overall architecture of WFC and the implemented training strategy.

5.1 Architecture of WFC

The overall architecture of WFC is composed of three components: two classiﬁers and a critic (see Figure 1). We recall that the architecture aims to minimize the loss function described in Equation 4.

Learning Zy and Za Given a batch of documents along with their sensitive attribute, we start by generating a representation of each document using a pre-trained language model (PLM). These representations serve as input to two MLPs, which are trained to predict A and Y , respectively. The ﬁrst model, referred to as the demonic model, is pre-trained. The prediction ˆY outputted by the second MLP (in green in Figure 1) is directly used to compute the ﬁrst term of our objective function (see Equation 4). Additionally, from a given hidden layer in each of the MLPs, we extract the hidden representation vectors, Zy and Za, which capture intermediate features relevant to their respective tasks.

Computing IW (Zy, Za) The second term of the loss is the IW between Zy and Za. To compute this latter, we use the following approximation (Arjovsky et al., 2017)

max ω,||Cw||L 1 EZy,Za p(Zy,Za)[Cω(Zy, Za)] EZy p(Zy),Za p(ZA)[Cω(Zy, Za)]. (5)

Leteno, Perrot, Laclau, Gourru and Gravier

Figure 1: Architecture of our method. The top part illustrates the pre-training of the demonic model (red) with domain adaptation. The model is trained to predict the sensitive attribute on the source domain (AS) while minimizing the divergence between the hidden representations from the source and target domains (ZS and ZT ). The bottom part describes the WFC pipeline for a batch of size 4, the demonic model is then frozen. The data representation on the right shows how we enforce dependency or independence between Zy and Za. During inference, only the trained classiﬁer (green) is retained to predict Y .

where Cω is called the critic. Initially proposed for Wasserstein GAN (Generative Adversarial Network) (Arjovsky et al., 2017), the critic is a neural network used as an alternative to the GAN s discriminator to overcome this latter unstable training (Arjovsky and Bottou, 2017). It estimates the Wasserstein distance between real and fake distributions by outputting scores (the Wasserstein distance) instead of classifying samples as probabilities. The underlying idea is to use a neural network with a Lipschitz constraint on the weights to approximate the Kantorovich-Rubinstein formulation of the Wasserstein distance (Equation 3). The Lipschitz constraint is induced by weights clipping or gradient penalty (Gulrajani et al., 2017). We follow Ozair et al. (2019) who use it to compute the IW with this differentiable estimate of the Wasserstein distance. To enforce the Lipschitz constraint, we

Fair Classifier via Transferable Representations

clamp the weights to given values ([ 0.01, 0.01]) at each optimization step.2 For a batch of documents, the critic takes as input the concatenation of Zy and Za, and the concatenation of Zy and Za randomly drawn from the data set (equivalent to Zy p(Zy), Za p(Za)). We then follow the training procedure introduced by Arjovsky et al. (2017), which alternates maximizing Equation 5 in the critic parameters for nc iterations and minimizing Equation 4 for nd iterations in the hy classiﬁer parameters. We add a comparison to WFCeo, where we compute and minimize the IW between instances that were well classiﬁed during the training. This allows us to compare optimizing directly DP vs. EO. Overall The overview of the training process is detailed in Appendix F.1. The details of the MLPs used to parameterize each component, including their architecture, are given in Appendix F. We evaluate and optimize the hyperparameters for our models on a validation set, focusing on the MLP and Critic learning rates, the value of nd (number of batches used to train the main MLP), the layers producing Za and Zy, the value of β and the value used to clamp the weights to enforce the Lipschitz constraint. The values allowing us to obtain the optimal trade-oﬀbetween accuracy and fairness (DTO, cf. Section 6.1) during this process are presented in Appendix F.2. In all our experiments, and if not mentioned otherwise, the value of β is set to 1. Our implementation is available on Github: https://github.com/Leteno Thibaud/wasserstein_fair_classification.

5.2 Pre-training the Demonic Model

Overview We pre-train the demonic model, an MLP with a similar architecture as the previous classiﬁer, to predict the sensitive attributes. Note that we do not update the demonic weights during the training phase of the main model. The beneﬁts are twofold. First, unlike previous works (Caton and Haas, 2020), we require only limited access to sensitive attribute labels during training, and we do not need access to the sensitive attributes at inference. This makes WFC highly compatible with recent regulations (e.g., US Consumer Financial Protection Bureau). Second, the demonic model can be trained in a few-shot fashion if some examples of the training set are annotated with sensitive attributes.

Learning with a related data set However, when no sensitive attributes are available in the training set, we replace the training data of the demonic model with data from another domain (e.g., another data set) containing sensitive information for the same attribute. For example, for gender, we can leverage generated data sets, like the EEC data set (Kiritchenko and Mohammad, 2018). This enables knowledge transfer between data sets, promoting fairness autonomy regardless of whether sensitive attributes are present in the data, as long as another data set with similar sensitive attributes exists. Finally, in most cases, sensitive attribute knowledge transfers easily between data sets without additional adjustments. However, when data set divergence is signiﬁcant, domain adaptation techniques can be applied to ensure transfer quality.

Learning with domain adaptation If the training data set diﬀers signiﬁcantly from the endtask data set, we add a regularization term to the loss of the demonic model to train it with

2. We also tested some more recent improvements of Lipschitz constraint enforcement (Gulrajani et al., 2017; Wei et al., 2018). Interestingly, all lead to poorer performance.

Leteno, Perrot, Laclau, Gourru and Gravier

a double objective: 1) predicting the sensitive attribute and 2) generating representations from the source and target domains that are both close and informative for classiﬁcation. In practice, under the covariate shift assumption, we use the Wasserstein distance between the representations of the source and target data sets as a measure of divergence. For domain adaptation, as for WFC, a critic model estimates the Wasserstein distance between the source and target representations. We use this measure for domain adaptation as done in Shen et al. (2018). Note that while in WFC the Wasserstein distance is computed between the joint and the product of the marginal distributions of the representations to compute a measure of independence, here we compute it between the representations themselves. Speciﬁcally, we compute the Wasserstein distance between the last hidden states of the model for both sets of representations (source and target). Therefore, if we consider the source and target domains, respectively DS = {x S,i, a S,i}n S i and DT = {x T ,i, y T ,i}n T i , with XS, XT the sets of input texts, AS the sensitive attributes. The objective of the demonic model, ha, can be written as follows

arg min L(AS, ha(Enc(XS)) + η W1(ZS, ZT ), (6)

where L is the loss function aiming at maximizing the accuracy of ha on predicting A, and ZS, ZT are the hidden representations of the model respectively for XS and XT .

6. Experimental Framework

In this section, we present the setting of our experiments, that is, the data sets and metrics we consider.

6.1 Evaluation Metrics

In this section, we introduce the metrics used to evaluate the performance of the models. For utility, we consider the balanced accuracy (Bal. Acc.) to handle class imbalance in the data. For fairness, we recall in Section 3.1 the Equality of Opportunity (cf. Equation 2). In our experiments, we consider binary sensitive attributes (A = {0, 1}). For multi-class objectives (e.g. Y = {1, , C}), one can aggregate EO scores over classes (formulated as the diﬀerence of true positive rates across sensitive groups). This measure is the TPR-parity (or TPR-GAP) score (De-Arteaga et al., 2019; Ravfogel et al., 2020) deﬁned as follows

TPR-parity =

c C (TPR1,c TPR0,c)2,

with TPR0,c and TPR1,c the true positive rates for class c and respectively sensitive group 0 and 1. For clarity in the results comparison with the accuracy score, we consider the following Fairness = (1 TPR-parity) 100.

The Fairness score indicates a perfectly fair model when equal to 100, and unfair when equal to 0. Additionally, as fairness often requires determining a trade-oﬀsuch that reaching equity does not degrade the general classiﬁcation performance, Han et al. (2021c) proposed the Distance To Optimum (DTO) score. It measures the accuracy-fairness trade-oﬀby

Fair Classifier via Transferable Representations

computing the Euclidean distance from a model to an Utopia point (point corresponding to the best accuracy and best fairness values across all the baselines). The goal is to minimize the DTO. Let consider the Utopia point with coordinates {accuracyu, fairnessu} and the performance of a model at a given epoch {accuracym, fairnessm}, then we have

(fairnessu fairnessm)2 + (accuracyu accuracym)2.

Finally, we consider the Leakage metric, which measures the accuracy of a classiﬁcation model trained to predict the sensitive attribute A from the latent representations (Z) of another model. Let us consider two models, a classiﬁcation model h that we want to evaluate and another model hleakage trained to retrieve the sensitive information A from the latent representations of h, Zh. We consider a test set of size n such that

i=0 1L(Zhi)

100 with 1L(Zh) =

( 1 if hleakage(Zh) = A, 0 if hleakage(Zh) = A.

It measures the fairness of the latent representations themselves and demonstrates representation unfairness when close to 100. We use the architecture presented in Shen et al. (2022b), that is an MLP with one hidden layer of size 100, a Re LU activation function, and a constant learning rate of 0.001. The optimizer used is Adam.

6.2 Data Sets

We employ two widely-used data sets to evaluate fairness in the context of text classiﬁcation, building upon prior research (Ravfogel et al., 2020; Han et al., 2021b; Shen et al., 2022b). Both data sets are readily available in the Fair Lib library (Han et al., 2022).

Bias in Bios (De-Arteaga et al., 2019). This data set, referred to as Bios data set in the rest of the paper, consists of brief biographies from the common crawl associated with occupations (a total of 28) and genders (male or female). As per the partitioning prepared by Ravfogel et al. (2020), the training, validation, and test sets comprise 257, 000, 40, 000, and 99, 000 samples, respectively.

Moji (Blodgett et al., 2016). This data set contains tweets written in either Standard American English (SAE) or African American English (AAE), annotated with positive or negative polarity. We use the data set prepared by Ravfogel et al. (2020), which includes 100, 000 training examples, 8, 000 validation examples, and 8, 000 test examples. The target variable Y represents the polarity, while the protected attribute corresponds to the ethnicity, indicated by the AAE/SAE attribute.

7. Results and Discussion

In this section, we consider three experimental axes to illustrate our method: 1) in-domain experiments compared to state-of-the-art methods, 2) cross-domain experiments, 3) analysis of the WFC method.

Leteno, Perrot, Laclau, Gourru and Gravier

Model Bal. Acc. Fairness DTO Leakage *CE 72.3 0.5 61.2 1.4 31.0 87.9 3.3 INLP + BERTft 73.3 0.0 85.6 0.0 8.49 86.7 0.6 Adv + BERTft 75.6 0.4 90.4 1.1 4.03 78.8 6.0

Gate + BTEO + BERTft 76.2 0.3 90.1 1.30 3.55 100.0 0.0

Fair Batch + BERTft 75.1 0.6 90.6 0.5 4.47 88.4 0.4

EOGLB + BERTft 75.2 0.2 90.1 0.4 4.49 85.7 1.2 DAFair + BERTft 79.5 0.2 73.1 1.1 18.3 - Adv 74.5 0.3 81.5 2.0 11.1 - Gate + BTEO 74.9 0.2 86.2 0.3 6.94 - Condp 75.8 0.3 88.1 0.6 4.96 54.2 0.9

Coneo 74.1 0.7 84.1 3.0 9.08 80.1 4.2

WFC 75.2 0.1 91.4 0.3 4.29 86.9 0.2 WFCeo 75.1 0.1 91.0 0.8 4.39 85.9 0.2 WFC + BTEO 75.3 0.1 91.1 0.3 4.21 87.2 0.5

Table 1: Results on Moji. For baselines, results are drawn from Shen et al. (2022b). We report the mean standard deviation over 5 runs. * indicates the model without fairness consideration, and - indicates that we cannot access the result. The best results are in bold, results in blue indicate the best results without ﬁne-tuning BERT. indicates statistical signiﬁcant diﬀerence with WFC based on the Student s t-test.

7.1 Comparison with State-of-the-Art Methods

Firstly, we compare our approach with state-of-the-art methods and diﬀerent text encoders.

Baselines First, we consider *CE, that is, the architecture without any form of regularization. Then, the baselines are INLP (Ravfogel et al., 2020), the ADV method (Han et al., 2021b), Fair Batch (Roh et al., 2021), GATE (Han et al., 2021c), EOGLB (Shen et al., 2022a) and Con, displaying the dp and eo versions (Shen et al., 2022b). Shen et al. (2022b) extend some of the methods by rebalancing classes during training (+ BTEO) or ﬁne-tuning a BERT model in addition to the trainable MLP (+ BERTft). We also consider DAFair (Iskander et al., 2024) in our baselines due to the proximity with our work as indicated in Section 2, and rerun their experiments with similar settings as the authors on respective data sets to ensure comparable results with regards to the splits and seeds. If not mentioned otherwise, the results of the other baselines are drawn from Han et al. (2022) and Shen et al. (2022b). We did not rerun it since we built our code on the Fairlib library made available by the authors. As such, the base architectures, evaluation protocols, seeds, and data splits are similar.

Setting To compare our method against state-of-the-art approaches, we ﬁrst use the representation generated by a base BERT model as an input to the MLPs. For Bios, the

Fair Classifier via Transferable Representations

Model Bal. Acc. Fairness DTO Leakage *CE 82.3 0.2 85.1 0.8 5.87 98.0 0.0

INLP + BERTft 82.3 0.0 88.6 0.0 2.61 97.6 0.1

Adv + BERTft 81.9 0.2 90.6 0.5 1.81 88.6 4.6

Gate + BTEO + BERTft 83.7 0.2 90.4 0.9 0.40 100.0 0.0

Fair Batch + BERTft 82.2 0.1 89.5 1.3 1.98 98.0 0.3

EOGLB + BERTft 81.7 0.4 88.4 1.0 3.12 97.2 0.5 DAFair + BERTft 83.7 0.1 86.4 0.3 4.40 - Adv 81.1 0.1 87.3 0.9 4.36 - Gate + BTEO 79.4 0.1 90.8 0.2 4.30 - Condp 82.1 0.2 84.3 0.8 6.69 76.3 1.5

Coneo 81.8 0.3 85.2 0.4 5.91 84.9 3.4

WFC 82.4 0.1 89.0 0.3 2.22 96.5 0.5 WFCeo 82.1 0.2 89.0 0.2 2.42 97.4 0.3

WFC + BTEO 82.3 0.2 89.1 0.3 2.20 96.7 0.5

Table 2: Results on Bios. For baselines, results are drawn from Shen et al. (2022b). We report the mean standard deviation over 5 runs. * indicates the model without fairness consideration, - indicates that we do not have access to these results. The best results are in bold, results in blue indicate the best results without ﬁne-tuning BERT. indicates statistical signiﬁcant diﬀerence with WFC based on the Student s t-test.

demonic MLP is trained on 1% of the training set and obtains 99% accuracy for predicting the sensitive attributes on the test set. Similarly, the demonic MLP obtains 88.5% accuracy on Moji. Except for the standard cross-entropy loss without a fairness constraint (CE) and the DAFair baseline, which we run ourselves, we report results from Shen et al. (2022b); Han et al. (2022) as mentioned in Baselines. In our approach, embedding representations are derived from a ﬁxed BERT model, with only the MLP weights being adjusted. We also evaluate the quality of our method under balanced training as in Shen et al. (2022b).

Discussion We compare WFC with text classiﬁcation baselines. For Moji (Table 1), the accuracy of WFC is higher than the accuracy of CE, and it is equivalent to competitors. Considering the fairness metrics, we outperform all baselines. Note that DAFair, related to our work with the KL-divergence as dependency measure, outperforms all baselines in terms of accuracy with a limited gain of Fairness. For Bios (Table 2), our method is competitive with the other baselines and ranks 4 out of 12 with BTEO and 5 without it in terms of accuracy-fairness trade-oﬀ(DTO). Especially, WFC has the second-best accuracy compared to baselines. Moreover, on both data sets, we obtain similar Leakage as the comparable baselines, we further discuss this score in Section 7.3.1. Note that BERT is not ﬁne-tuned during our training pipeline. This decision is based on several factors: ﬁrst, ﬁne-tuning BERT increases training complexity and may hinder convergence. Additionally, it makes our method ﬂexible to any encoder or decoder architec-

Leteno, Perrot, Laclau, Gourru and Gravier

Model Bal. Acc. Fairness DTO Leakage *CE 85.5 0.09 86.1 0.36 6.63 97.9 0.41 GATE 85.3 0.23 83.5 0.60 9.22 100.0 0.01

GATE + BTEO 84.4 0.14 92.7 0.67 1.10 99.9 0.13

ADV 84.8 0.72 90.3 0.40 2.49 89.1 7.96

ADV + BTEO 84.3 0.07 91.4 0.41 1.74 86.2 6.05

WFC 85.2 0.02 90.0 0.21 2.74 97.8 0.41 WFC + BTEO 85.1 0.06 90.0 0.25 2.75 97.8 0.34

Table 3: SFR-Embeddings-2 R Results Bios. We report the mean standard deviation over 5 runs. * indicates the model without fairness consideration. The best results are in bold. indicates statistical signiﬁcant diﬀerence with WFC based on the Student s t-test.

ture, regardless of size. However, among the baselines without BERT ﬁne-tuning, we reach the lowest DTO, comparable to those obtained with methods that ﬁne-tune BERT. When comparing the versions of WFC optimizing EO or DP and rebalancing classes, we report close results on the three approaches. Noting a slightly better DPO on the version optimizing DP (WFC), we consider this version in the other experiments. Despite the better DTO of WFC + BTEO, we do not choose it for the experiments in Sections 7.2 and 7.3 to evaluate the method without external inﬂuence. Ultimately, compared to the baselines, our method demonstrates notable advantages, particularly its ability to achieve competitive performance without access to sensitive attributes in the training set. We assess this capability in the section 7.2. In the next subsection, we explore an alternative model for generating the representations used by the classiﬁer.

7.1.1 Using recent decoder-based model

Setting State-of-the-art baselines use BERT representations. However, recent PLMs have surpassed BERT s performance. Additionally, many modern embedding models are based on a decoder architecture. Therefore, we assess the robustness of our method using representations from SFR-Embedding-2 R model3 (Meng et al., 2024) built on the Mistral model (Jiang et al., 2023). This model is ranked ﬁrst on the MTEB benchmark4 (Muennighoﬀ et al., 2022) on July 8th, 2024, notably for the classiﬁcation task. We realize this set of experiments on the Bios data set and exclude the Moji data set since we do not have access to the raw text, and that the embeddings depend on the Deep Moji model (Felbo et al., 2017). The demonic MLP is also trained on SFR-Embedding-2 R s representations. We compare our approach to the cross-entropy without regularization (CE), as well as the best baselines on BERT concerning fairness and accuracy (respectively, GATE and ADV). The approaches are evaluated with and without balanced training (BTEO). We realize hyperparameter tuning for all methods as described in Appendix F.3.

3. https://huggingface.co/Salesforce/SFR-Embedding-2_R 4. https://huggingface.co/spaces/mteb/leaderboard

Fair Classifier via Transferable Representations

Data Bal. Acc. Fairness DTO Leakage Demonic Bal. Acc.

Bios 1% 82.4 0.1 89.0 0.3 2.22 96.5 0.5 99.0 EEC 82.2 0.4 88.9 0.4 2.42 97.5 0.3 98.1 MP 82.4 0.3 88.9 0.4 2.30 96.4 0.5 98.4

Table 4: Comparison between several scenarios for training the demonic model for prediction on Bios. We report the mean standard deviation over 5 runs.

Discussion We evaluate the eﬃciency of our architecture on recent decoder-based models to generate the embedding representations and compare them with the best baselines on the BERT-encoding results. We perform this evaluation on the Bios data set as explained above and present results in Table 3. We observe an improvement of both accuracy and fairness for all methods compared to the results with a BERT encoder. However, in this experiment, improving fairness comes at the cost of performance compared to the model without regularization (*CE). Among all baselines, ours enhances fairness while minimizing performance the less. In contrast, other baselines that improve fairness (GATE + BTEO, ADV, and ADV + BTEO) lead to a performance drop of up to one point.

7.2 Cross-domain WFC

We consider two experiments to assess the transfer of sensitive attributes: with and without the domain adaptation procedure. We conduct these experiments on Bios, as other data sets with gender annotations are already available, unlike AAE/SAE data sets for Moji. The main objective of this section is to evaluate the performance of WFC when the demonic is trained on other sources than the task data set.

7.2.1 Zero-shot cross-domain demonic training

Setting We consider two source data sets to train the demonic MLP without domain adaptation. The EEC data set (Kiritchenko and Mohammad, 2018) consists of 8,640 synthetic sentences in English for Sentiment Analysis. The Marked Personas (MP) data set (Cheng et al., 2023) is composed of 2,700 descriptions of individuals obtained using a generative procedure: we consider the dv2 version. We then evaluate the WFC pipeline with those demonic MLP. When training on the EEC data set, we obtain, on average over 5 runs, 98.1% of accuracy, and 98.4% on the MP data set.

Discussion Table 4 shows that when the source and target data sets are similar, we achieve results comparable to those obtained when pre-training is performed using the same data set. The average loss in accuracy and fairness is minimal, with the standard deviation causing the measurements to overlap. These results are promising for improving fairness, especially in situations where collecting sensitive data is not feasible or when only partial information is available. In the next subsection, we investigate when the divergence between the source and target is higher and consider domain adaptation to train the demonic model.

Leteno, Perrot, Laclau, Gourru and Gravier

Method Accuracy on S Accuracy on T

baseline 65.3 3.23 75.3 13.9 η = 0.5 75.0 0.00 96.5 0.94 η = 1 81.3 0.67 98.0 0.24 η = 2 75.0 0.00 95.9 0.27

Table 5: Performance of the demonic model trained with domain adaptation. Performance on the source is given for the best corresponding performance on the target set.

7.2.2 Demonic training with domain adaptation

Setting and protocol We begin by considering a variant of the MP data set for this experiment. A set of gendered words (listed in Appendix F.3.4) is removed from the texts to increase the divergence with the Bios data set. Next, we train a demonic model on this data set with the values of regularization η {0.5, 1, 2} on the domain adaptation term in Equation 6 recalled below

arg min L(AS, ha(Enc(XS)) + η W1(ZS, ZT ).

We run the pipeline for 15000 epochs; at each epoch, the critic is trained on 20 batches and the model on 5 batches. We assess diﬀerent values for the learning rate on a validation set and obtain the following optimal learning rate: 1e 5. We also compare to the baseline, which consists of training the demonic for 20 epochs on the source data set only, without any adaptation. For the baseline, the demonic model is optimized with the following objective

arg min L(a, ha(zsource))

Finally, we run the WFC pipeline with the demonic obtained as in the previous set of experiments.

Cross-domain demonic performance As shown in Table 5, the domain adaptation procedure signiﬁcantly improves the performance of the demonic model on the sensitive attributes predictions when the domains diverge. Note that the value of η matters; with a lower η, the adaptation may be too weak to align the domains, whereas with a higher η, the regularization term may overly inﬂuence the classiﬁcation term in the loss. Interestingly, for the case of gender, when the most common expressions of gender are removed from the source but remain in the target domain, the procedure also helps to improve the performance of the demonic model on the source domain. Furthermore, we note better scores on the target domain than on the source. We hypothesize that the presence of the gender markers in the target domain makes the task easier, inducing better results. Finally, it is interesting to note the variance on the baseline demonic: while in some cases domain adaptation will not be necessary, the procedure ensures an eﬃcient demonic model without regard to the initial conditions of the optimization.

Fair Classifier via Transferable Representations

Model Bal. Acc. Fairness DTO Leakage Demonic accuracy *CE 82.3 0.20 85.1 0.80 5.87 98.0 0.00 - Baseline 82.5 0.05 86.8 0.50 4.19 97.1 0.44 75.3 13.9 η = 0.5 82.5 0.02 87.4 0.21 3.57 96.6 0.36 96.5 0.94 η = 1.0 82.4 0.09 88.7 0.47 2.50 96.7 0.31 98.0 0.24 η = 2.0 82.5 0.06 87.2 0.15 3.79 96.7 0.10 95.9 0.27

Table 6: Results on Bios with a demonic trained with domain adaptation. We report the mean standard deviation over 5 runs. * indicates the model without fairness consideration.

WFC results with cross-domain demonic Table 6 reports the results of the WFC pipeline on the Bios data set when using domain adaptation during the demonic training. We note that thanks to the improvement of the accuracy of the demonic model, the fairness on the end-task is improved compared to both the pipeline without fairness consideration and the pipeline where the demonic model is trained without adaptation. With domain adaptation, the improved performance of the demonic model is reﬂected in the enhanced fairness. This experiment highlights the importance of an accurate demonic model and the advantages of considering domain adaptation when training it on data sets diverging from the end-task data set.

7.3 WFC Architecture Components Investigation

In this section, we ﬁrst investigate the hyperparameter β, its impact on the representation fairness (Leakage), and on fairness metrics related to the bounds from Lemma 1. Next, we study the use of the representations from diﬀerent layers in the two MLPs (classiﬁer and demonic models). Finally, we explore the use of the predicted sensitive attributes instead of the hidden representations of the demonic model as previously done.

7.3.1 Impact of the hyperparameter β

Setting In this experiment, we investigate the impact of the hyperparameter β associated with the regularization term. Recall that our objective is the following

arg min L(Y, hy(Enc(Xy))) + β IW (Zy, Za),

where β controls the impact of the Wasserstein Dependency Measure on the loss. We train the model over 5 seeds for diﬀerent values of β. Speciﬁcally, β {0.1, 1, 2, 5, 10, 100}.

Discussion First, we note in Table 7 that with a higher β, the Leakage decreases, meaning the sensitive attribute is harder to retrieve from the latent representations. Although we initially aim to improve the Fairness while maintaining the Accuracy of the model, our method can be used to improve the Leakage by increasing the value of β in Equation 4. In other words, we give more importance to the Wasserstein regularization in the loss; as observed in Figure 2 where increasing the importance of the regularization term allows

Leteno, Perrot, Laclau, Gourru and Gravier

β Acc. Fair. DTO Leak. Acc. Fair. DTO Leak. Bios Moji 0.1 82.8 0.1 87.2 0.4 3.75 98.1 0.2 50.4 0.7 99.5 1.1 27.08 85.7 0.1 1.0 82.4 0.1 89.0 0.3 2.22 96.7 0.5 75.2 0.1 91.4 0.3 1.00 86.9 0.2 5.0 81.8 0.2 88.9 0.2 2.69 91.8 1.4 71.4 0.5 93.7 0.4 5.38 81.1 0.5 10.0 81.6 0.2 88.6 0.2 3.06 86.1 0.8 70.1 0.6 92.7 0.4 6.21 82.5 0.8 100.0 81.2 0.4 87.9 0.4 3.84 77.7 1.7 67.9 1.4 94.7 1.1 8.9 83.0 0.5

Table 7: Study of the impact of β. We report the mean standard deviation over 5 runs.

having a lower IW (Zy, Za).5 However, on both data sets, the Accuracy that we want to preserve decreases and the trade-oﬀworsens as the Leakage gets better. In other words, reducing the Leakage makes it more challenging to retrieve sensitive attributes, but could result in unintended information loss needed for the classiﬁcation task, aﬀecting the performance. Ultimately, we want to enhance fairness while keeping a good performance, and this objective may not necessarily match with a strong Leakage improvement (Shen et al., 2022b). Indeed, Lohaus et al. (2022) show that training neural networks to satisfy fairness constraints (e.g., demographic parity) is often done by making the model more aware of the sensitive attributes, and that stronger fairness constraints make these attributes more recoverable from their internal states. As such, combining privacy and fairness remains an open challenge. Finally, note that on the Moji data set, the performance for β = 0.1 is surprisingly low, this is due to the selection criterion used: the DTO. Indeed, when looking at the best results for this setting, we have an accuracy of 73 0.0 for a fairness of 68.5 0.0. This can be explained by the fact that the fairness regularization term is too low to improve fairness on this data set, then the results for the best accuracy are close to the CE-baseline results (cf. Table 1). However, at initialization with an inaccurate classiﬁer, the fairness is very high, thus the optimal trade-oﬀis obtained with these values. In the next subsection, we investigate the relation of the Wasserstein Dependency Measure between the latent representations with the fairness metrics for diﬀerent values of β.

7.3.2 Relation between fairness metrics and the regularization term

In this section, we empirically show the validity of the bounds from Lemma 1 on two data sets, Bios and Moji.

Setting We train the WFC pipeline with diﬀerent β values as in Section 7.3.1, and report in Figure 2, the IW (Zy, Za) on the training data, and the EO (2a,b) and DP (2c,d) on the test set for every training epoch.

Discussion We note that the more the loss is constrained by the regularization term, the lower IW (Zy, Za) is, as well as the fairness metrics. However, after a certain threshold value for β (5.0 in the experiments), the fairness metrics converge. Finally, while in most cases IW (Zy, Za) is greater than the considered metrics, as expected from Lemma 1 and Theorems 3 and 5, the contrary happens on a few epochs. This discrepancy with the results

5. Note that the values are computed exactly using the POT library (Flamary et al., 2021).

Fair Classifier via Transferable Representations

0 2000 4000 6000 8000 10000 Epochs

EO, IW(Zy, Za)

(a) Bios - Equality of Opportunity

0 2000 4000 6000 8000 10000 Epochs

EO, IW(Zy, Za)

= 0.1 = 1.0 = 5.0

= 10.0 = 100.0

(b) Moji - Equality of Opportunity

0 2000 4000 6000 8000 10000 Epochs

DP, IW(Zy, Za)

= 0.1 = 1.0

= 10.0 = 100.0

(c) Bios - Demographic Parity

0 2000 4000 6000 8000 10000 Epochs

DP, IW(Zy, Za)

(d) Moji - Demographic Parity

Figure 2: IW (Zy, Za) and averaged fairness metrics over classes across training epochs. The values are averaged over 5 runs.

expected from the theoretical relation arises because we only plot IW (Zy, Za) rather than the full right-hand term.

7.3.3 Use of representations from different layers

Setting In the previous experiments, following approaches presented in Han et al. (2022), the Wasserstein distance is approximated using the last hidden representations of the 3layer MLP. In this section, we explore the use of representations from diﬀerent layers of the MLPs. We compare this approach, on both data sets, with the use of the ﬁrst hidden representations of the MLPs and with the output logits (before argmax), shown in Figure 3. For the latter, the Wasserstein distance is estimated between distributions of diﬀerent dimensions. For example, for Bios, the demonic MLP predicts 2 labels while the classiﬁcation MLP predicts 28 labels.

Leteno, Perrot, Laclau, Gourru and Gravier

Layer Bal. Acc. Fairness DTO Leakage Bios Last hidden 82.4 0.1 89.0 0.3 2.06 96.5 0.5 First hidden 81.9 0.2 86.7 0.4 4.29 96.5 0.6 Output layer 82.1 0.6 87.5 0.3 3.49 87.0 1.1

Moji Last hidden 75.2 0.1 91.4 0.3 1.17 86.9 0.2 First hidden 74.3 0.1 80.8 1.0 11.4 85.6 0.6 Output layer 73.5 0.0 70.2 0.2 21.9 64.5 0.1

Table 8: Comparison between the use of representations of diﬀerent MLP layers to compute the Wasserstein.

Input (|Emb|)

First Hidden (300)

Last Hidden (300)

Output (|C|)

... ... ...

Figure 3: Representation of the MLP s layer.

Labels Bal. Acc. Fairness DTO Leakage Bios Representations 82.4 0.1 89.0 0.3 2.06 96.5 0.5 Hard labels 82.6 0.2 87.5 0.2 3.28 92.0 0.2

Moji Representations 75.2 0.1 91.4 0.3 1.17 86.9 0.2 Hard labels 72.2 0.1 65.0 0.0 27.3 81.0 0.8

Table 9: Comparison between the use of representations Za and hard sensitive attributes to compute the Wasserstein distance.

Discussion On both data sets (Table 8), accuracy is rather stable regardless of the layers used to compute the Wasserstein distance. Still, the best results are obtained using the last hidden representations. However, while we note a slight decrease in fairness on Bios when using representations from other layers, the decrease becomes much more signiﬁcant on Moji. Thus, using the last hidden layer is the best strategy.

7.3.4 Independence with predicted hard sensitive attributes

Setting To assess the impact of using the representation Za, we replace Za with the sensitive attributes predicted by the demonic MLP, ˆA. We consider the setting using the embeddings from BERT, with the Bios and Moji data sets. Then, we obtain the following regularization term: IW (Zy, ˆA) = W1(p(Zy, ˆA), p(Zy)p( ˆA)). Note that we do not encounter a problem with the non-diﬀerentiability for ˆA (with the argmax operation as for ˆY as mentioned in Section 4.3) since the demonic model is pre-trained.

Discussion We report the results of this experiment in Table 9. When we replace Za by the predicted ˆA to compute the Wasserstein distance, we observe, on average, a slight improvement of the accuracy on Bios, and a slight decrease of the accuracy on Moji. However, while the decrease in Fairness is not signiﬁcant for Bios, we observe a substantial drop for

Fair Classifier via Transferable Representations

Moji. As a result, using ˆA instead of Za seems to have a neutral impact at best; this may also result, in some cases, in a reduction of both accuracy and fairness.

8. Conclusion

We extend WFC, a method enforcing fairness constraints using a pre-trained neural network on the sensitive attributes and Wasserstein regularization. We show that minimizing the Wasserstein Dependency Measure (IW ) improves fairness by reducing the statistical dependence between predictions and sensitive attributes, linking it to key metrics such as Demographic Parity and Equality of Opportunity. Instead of directly optimizing IW between predictions and sensitive attributes, we apply it to the latent representations of two models: one predicting classiﬁcation labels and the other sensitive attributes. We prove that this formulation provides an upper bound on the dependency measure between predictions and true sensitive attributes while ensuring computational feasibility. Speciﬁcally, the IW between latent representations upper-bounds the IW between predicted labels and sensitive attributes, which in turn upper-bounds the IW between predictions and true sensitive attributes. Our method does not require sensitive attribute annotations at both training and inference time. We obtain competitive results on the Bios data set and outperform baselines on fairness metrics while maintaining comparable accuracy on the Moji data set. The approach is also compatible with both encoder-based and decoder-based architectures. We also extend our method to settings where sensitive attributes are unavailable, leveraging a domain adaptation approach to enable training under this constraint. We provide theoretical guarantees, inspired by domain adaptation results, to assess its generalization to other data sets.

Perspectives Overall, this work could be extended in numerous ways. The bound presented in Theorem 4 relies on a covariate shift assumption. However, in the literature on causal vs. anticausal learning (Sch olkopf et al., 2012), many NLP tasks can be viewed as anticausal, where the target label (e.g., an occupation) generates the observed text (e.g., the description of the occupation) (Jin et al., 2021). In such cases, a prior probability shift assumption, where the label distribution changes across domains but the feature distribution conditioned on the label remains invariant, would arguably be a natural setting to consider. When the sensitive attribute itself is considered as the label (e.g., gender), as is the case for our demonic, one could also argue for an anticausal relation, but in a weaker sense: the sensitive attribute inﬂuences some parts of the text (e.g., pronouns, names) without fully determining its content. Studying fairness-aware transfer under such partial forms of anticausality would be an interesting direction for future work. In Theorem 5, we discussed the use of an optimal softmax temperature to obtain a sharper bound. It would be interesting to investigate whether having such a given temperature is an artifact of our proof technique or whether it is a requirement. Furthermore, if it turns out to be necessary, a possible avenue of research would be to draw connections with other research domains that rely on temperature scaling. Calibration is a prime example of such a ﬁeld where it is a popular baseline that has been shown to perform well in diﬀerent settings (Park et al., 2020; Wang et al., 2020; Chen and Su, 2023; Guo et al., 2017). Hence,

Leteno, Perrot, Laclau, Gourru and Gravier

it would be interesting to study in which cases the temperature derived in Theorem 5 would positively or negatively impact the level of calibration of the model. Finally, although we did not explore this direction, the approach could be extended beyond text classiﬁcation to tasks such as regression or unsupervised learning, or to other types of data such as images.

9. Limitations

The proposed approach is ﬂexible and can handle various types of sensitive attributes. However, due to the lack of available data sets, we were unable to evaluate its performance on continuous sensitive attributes, such as age. Additionally, while gender can be represented as an n-ary variable, our experiments were limited to a binary classiﬁcation (men vs. women) due to data availability. Finally, our experiments demonstrate the eﬀectiveness of our approach in transferring sensitive attributes to improve fairness. However, our theoretical results indicate that the success of this transfer depends on its quality; a poor transfer could, in theory, lead to a decrease in fairness.

Acknowledgments

This work is funded by the French National Research Agency (ANR) in the context of the grant ANR-21-CE23-0026 (Project DIK E). Michael Perrot is supported by the ANR through the grant ANR-23-CE23-0011-01 (Project Fa CTor). Charlotte Laclau is supported by the ANR through the grant ANR-23-CE23-0026 (Project Re FAIR) and through the PEPR IA FOUNDRY (ANR-23-PEIA-0003). Our experiments use the previously mentioned Fairlib framework. We would like to express our gratitude to Xudong Han for his availability and assistance in using it.

Appendix A. Wasserstein Distance

Finding correspondences between two sets of points is a longstanding issue in machine learning. The optimal transport (OT) (Monge, 1781) problem oﬀers an eﬃcient solution to this issue by calculating an optimal one-to-one transport map between the two sets, taking into account the geometrical proximity of the points. Let ˆµ0 and ˆµ1 be measures supported on the point sets X0 = {x(i) 0 Rd}N0 i=1 and X1 = {x(j) 1 Rd}N1 i=1, respectively. We consider the Monge-Kantorovich formulation of the original OT problem (Kantorovich, 1942), where the goal is to ﬁnd a coupling γ deﬁned as a joint probability distribution over X0 X1 with marginals ˆµ0 and ˆµ1. This amounts to minimizing the cost of transport w.r.t. some metric lp = p : X0 X1 R+, the lp-norm. This problem admits a unique solution γ and deﬁnes a metric on the space of probability measures called the Wasserstein distance (also known as the Earth-Mover Distance) as follows

W1(ˆµ0, ˆµ1) = min γ Π( ˆ µ0;ˆµ1) M, γ F ,

Fair Classifier via Transferable Representations

where , F is the Frobenius dot product, M is a dissimilarity matrix, i.e., Mij = l(x(i) 0 , x(j) 1 ), deﬁning the cost of associating x(i) 0 with x(j) 1 and Π(ˆµ0, ˆµ1) = {γ RN0 N1 + |γ1 = ˆµ0, γT 1 = ˆµ1} is a set of doubly stochastic matrices. In the following, we will rely on the following technical lemma on the Wasserstein distance between discrete distributions.

Lemma 6 Let U p(U) and V p(V ) be two discrete random variables respectively taking

values in u1, . . . , uk and v1, . . . , vk. Assume that ui vj p = 0 if i = j p

2 otherwise , then, we

W1(p(U), p(V )) =

i=1 |P(ui) P(vi)|

Proof From Gibbs and Su (2002, Theorem 4), we have that

min u =v u v p TV (p(U), p(V )) W1(p(U), p(V )) max u,v u v p TV (p(U), p(V )),

where TV (p(U), p(V )) = 1

2 Pk i=1 |P(ui) P(vi)| is the total variation. Noticing that, in our case, minu =v u v p = maxu,v u v p = p

2 concludes thee proof.

Lemma 7 Let U p(U), V p(V ), and W p(W) be discrete random variables taking values in U, V, and W respectively and such that u u p = v v p = w w p = 0 if u = u , v = v or w = w

2 otherwise , then, we have that

W1(p(U, W), p(U)p(W)) = X

w W1(p(U|W = w), p(U))P(W = w)

W1(p(U, W), p(V, W)) = X

w W1(p(U|W = w), p(V |W = w))P(W = w)

W1(p(U)p(W), p(V )p(W)) = X

w W1(p(U), p(V ))P(W = w)

Proof The cost matrix associated with W1(p(U, W), p(U)p(W)) is of size |U||W| |U||W|. Assuming that we order the pairs (u, w) by varying the values of u ﬁrst, that is (u1, w1), (u2, w1), . . ., the cost matrix contains blocks of size |U| |U|. The diagonal blocks have value p

2(1|U| |U| I|U| |U|) where I|U| |U| is the identity matrix of size |U| |U| and 1|U| |U| is a matrix of ones. The oﬀdiagonal blocks have value p

2I|U| |U| + p

4(1|U| |U| I|U| |U|). We have that w W, P u P(U = u, W = w) = P(W = w) = P u P(U = u)P(W = w) which means that we can consider each diagonal block independently when computing W1(p(U, W), p(U)p(W)), that is compute w, W1(p(U|W = w), p(U)) and then normalize the transport cost by P(W = w). This will be the optimal cost since the mass that is not transported with a cost of 0 will be transported with a cost of p

2, which is the smallest

Leteno, Perrot, Laclau, Gourru and Gravier

possible cost diﬀerent from 0. This concludes the proof of the ﬁrst equality. The proofs of the second and third equality follow using the same arguments.

Appendix B. Connection with Group Fairness

The following lemma shows that minimizing the Wasserstein dependency measure is a sound way to improve either demographic parity or equality of opportunity.

Lemma 1 (Group fairness and Wasserstein dependency measure.) Let IW be the Wasserstein dependency measure, and A, Y , ˆY be random variables corresponding to the sensitive attribute, the true label, and the predicted label, respectively. We have that

IW ( ˆY , A) =

a A P(A = a) X

y Y DPa,y ,

IW (( ˆY = Y )|Y = y, A|Y = y) = p

a A P(A = a|Y = y)EOa,y .

Proof Let ˆY and A be the two random variables corresponding to the predicted label and sensitive attribute, respectively. Recall that these random variables are encoded using a one-

hot vector, that is yi yj p = 0 if i = j p

2 otherwise and ai aj p = 0 if i = j p

2 otherwise .

Then, by successively applying Lemma 7 and Lemma 6, we have that

IW ( ˆY , A) := W1(p( ˆY , A), p( ˆY )p(A))

a A W1(p( ˆY |A = a), p( ˆY ))P(A = a)

a A P(A = a)

P( ˆY = y|A = a) P( ˆY = y)

a A P(A = a) X

P( ˆY = y|A = a) P( ˆY = y)

Noticing that P( ˆY = y|A = a) P( ˆY = y) is the demographic parity for group a and label y concludes the proof of the ﬁrst statement. Similarly, notice that given a label y Y

IW (( ˆY = Y )|Y = y, A|Y = y) := W1(p(( ˆY = Y ), A|Y = y), p(( ˆY = Y )|Y = y)p(A|Y = y))

a A W1(p(( ˆY = Y )|A = a, Y = y), p(( ˆY = Y )|Y = y))P(A = a|Y = y)

a A P(A = a|Y = y)

P( ˆY = Y |A = a, Y = y) P( ˆY = Y |Y = y)

+ P( ˆY = Y |A = a, Y = y) P( ˆY = Y |Y = y)

a A P(A = a|Y = y) P( ˆY = Y |A = a, Y = y) P( ˆY = Y |Y = y)

Fair Classifier via Transferable Representations

Noticing that P( ˆY = Y |A = a, Y = y) P( ˆY = Y |Y = y) is the Equality of opportunity for group a and label y concludes the proof of the second statement.

Appendix C. Bounding the IW(ˆY , A) by the error rate

In this section, we provide the details of the proof of Lemma 2 leading to Theorems 3 and 4.

Lemma 2 Let ˆY , ˆA, A be random variables that correspond to the predicted label, predicted sensitive attribute, and true sensitive attribute, respectively. Then, we have that

IW ( ˆY , A) IW ( ˆY , ˆA) + 2 p

Proof Let ˆY , ˆA and A be the random variables corresponding to the predicted label, predicted sensitive attribute, and true sensitive attribute, respectively. The Wasserstein Dependency Measure (Ozair et al., 2019) is

IW ( ˆY , A) = W1(p( ˆY , A), p( ˆY )p(A)) .

The W1-metric can be shown to be a proper metric when the compared distributions have the same overall mass (Rubner et al., 2000). Therefore, it satisﬁes the triangle inequality

IW ( ˆY , A) := W1(p( ˆY , A), p( ˆY )p(A))

W1(p( ˆY , A), p( ˆY , ˆA))

+ W1(p( ˆY , ˆA), p( ˆY )p( ˆA))

+ W1(p( ˆY )p( ˆA), p( ˆY )p(A)),

with W1(p( ˆY , ˆA), p( ˆY )p( ˆA)) = IW ( ˆY , ˆA).

Recall that ˆY , ˆA and A are encoded using a one hot vector, that is yi yj p = 0 if i = j p

2 otherwise and ai aj p = 0 if i = j p

2 otherwise . Then, by successively applying

Lemma 7 and Lemma 6, we have that

W1(p( ˆY , A), p( ˆY , ˆA)) = X

y Y W1(p(A| ˆY = y), p( ˆA| ˆY = y))P( ˆY = y)

y Y P( ˆY = y)

P(A = a| ˆY = y) P( ˆA = a| ˆY = y) (7)

Leteno, Perrot, Laclau, Gourru and Gravier

By the law of total probability and the union bound for disjoint events, we have that X

P(A = a| ˆY = y) P( ˆA = a| ˆY = y)

P(A = a, ˆA = a| ˆY = y) + P(A = a, ˆA = a| ˆY = y)

P( ˆA = a, A = a| ˆY = y) P( ˆA = a, A = a| ˆY = y)

P(A = a, ˆA = a| ˆY = y) P( ˆA = a, A = a| ˆY = y)

a A P(A = a, ˆA = a| ˆY = y) + P( ˆA = a, A = a| ˆY = y)

a A P(A = a, ˆA = a| ˆY = y) + P( ˆA = a, A = a| ˆY = y)

a A A = a, ˆA = a| ˆY = y) + P( [

a A ˆA = a, A = a| ˆY = y)

= 2P(A = ˆA| ˆY = y)

Plugging this result in Equation 7, we obtain

W1(p( ˆY , A), p( ˆY , ˆA)) = X

y Y P( ˆY = y) p

2P(A = ˆA| ˆY = y)

Using similar arguments, we obtain that

W1(p( ˆY )p(A), p( ˆY )p( ˆA)) = X

y Y P( ˆY = y)

P(A = a) P( ˆA = a)

2P(A = ˆA).

This concludes the proof of this lemma.

Built on this ﬁrst result, we consider two scenarios to bound the error rate: either we pre-trained the demonic model on the data of the classiﬁcation task (in-domain) or as a DA problem, on diﬀerent data with shared sensitive attributes (e.g., gender, ethnicity, etc.) (cross-domain).

C.1 In-domain bound of the error rate for binary sensitive attributes

Theorem 3 Let ˆA, A {0, 1}, and H be a hypothesis space of V C-dimension d. Let p be the ground metric for the Wasserstein 1-distance. Assume that we have access to a training set of m i.i.d. examples. Then, with probability at least 1 δ, we have h H

IW ( ˆY , A) IW ( ˆY , ˆA) + 2 p

Fair Classifier via Transferable Representations

with e, the base of the natural logarithm and ˆε the empirical risk of the demonic model.

Proof From Lemma 2, we derive the following

IW ( ˆY , A) IW ( ˆY , ˆA) + 2 p

= IW ( ˆY , ˆA) + 2 p

We apply the Vapnik-Chervonenkis theory (Vapnik, 1998) to bound the true error ε of the demonic model by its empirical risk ˆε. Let ha be a ﬁxed classiﬁcation function from Za to A and H be a hypothesis space of V C-dimension d. Therefore, if the training set is of size m .i.i.d. samples, with probability at least 1 δ, we have for every h H

IW ( ˆY , A) IW ( ˆY , ˆA) + 2 p

and e is the base of the natural logarithm.

Appendix D. Bounding IW(ˆY , ˆA) by IW(Zy, Za)

In this section, we present the proof for Theorem 5 recalled below.

Theorem 5 Let ˆY , ˆA be random variables that correspond to the predicted label and predicted sensitive attribute, respectively. Assume that hy = σλ(f(Zy)) and ha = σλ(g(Za)) where σλ is the softmax function with temperature λ, f and g are both L-lipschitz with respect to the p-norm, and Zy and Za are latent representations of the examples. Let p be the ground metric for the Wasserstein 1-distance. For a given example x with predicted label ˆy and predicted sensitive attribute ˆa, let ξy(x) = f(Zy)ˆy maxy =ˆy f(Zy)y and ξa(x) = g(Za)ˆa maxa =ˆa g(Za)a be positive margins. Let δ = 1 P(ξy(X) ξ, ξa(X) ξ) with ξ > 0. Let

2 |Y| |A| 1 p (1 δ) and ι = L(|Y|+|A|)

. Then, setting λ = 1

ξ log 2ξα ιIW (Zy,Za) ,

we have that

IW ( ˆY , ˆA) min α, 2IW (Zy, Za) ι

1 + log max 4, 2ξα ιIW (Zy, Za)

Proof Since, in our case, the Wasserstein distance is a proper metric, we have that

IW ( ˆY , ˆA) = W1(p( ˆY , ˆA), p( ˆY )p( ˆA)),

W1(p( ˆY , ˆA), p(σλ(f(Zy)), σλ(g(Za))))

+ W1(p(σλ(f(Zy)), σλ(g(Za))), p(σλ(f(Zy)))p(σλ(g(Za))))

+ W1(p(σλ(f(Zy)))p(σλ(g(Za))), p( ˆY )p( ˆA)). (8)

Leteno, Perrot, Laclau, Gourru and Gravier

We will ﬁrst bound each term independently and then show that we can choose the softmax temperature, λ, in order to minimize the right-hand side of the bound.

Bounding W1(p(σλ(f(Zy)), σλ(g(Za))), p(σλ(f(Zy)))p(σλ(g(Za)))). Given γ Γ a coupling between the two distributions, the second term can be bounded as

W1(p(σλ(f(Zy)),σλ(g(Za))), p(σλ(f(Zy)))p(σλ(g(Za)))),

= inf γ E(zy,za,z y,z a) γ (σλ(f(zy)), σλ(g(za))) (σλ(f(z y)), σλ(g(z a))) p ,

inf γ E(zy,za,z y,z a) γLλ(|Y| + |A|)

(zy, za) (z y, z a) p ,

Lλ(|Y| + |A|)

W1(p(Zy, Za), p(Zy)p(Za)),

= Lλ(|Y| + |A|)

IW (Zy, Za).

where the ﬁrst inequality comes from the λ-lipschitzness of the softmax function ℓ2-norm (Gao and Pavel, 2017) and equivalence of norms properties.

Bounding W1(p( ˆY , ˆA), p(σλ(f(Zy)), σλ(g(Za)))) and W1(p(σλ(f(Zy)))p(σλ(g(Za))), p( ˆY )p( ˆA)). Let the softmax function σλ(f(z)) = eλf(z) eλf(z) 1 for z a vector representation of an example

x and λ 0 the temperature. Then, we have that

W1(p( ˆY , ˆA), p(σλ(f(Zy)), σλ(g(Za))) = W1(p( ˆY )p( ˆA), p(σλ(f(Zy)))p(σλ(f(Za)))

= E c(X, X).

Indeed, for an example x represented as zy, za and with predictions ˆy and ˆa and an example x represented as z y, z a and with predictions ˆy and ˆa the cost matrix c is such that:

(ˆy, ˆa) eλf(z y)

eλf(z y) 1 , eλf(z a)

p . Thus, the minimal cost is achieved when

each example is mapped onto itself since the predictions are obtained by taking the labels and sensitive attributes predicted as being most likely. We then have that

1 eλf(zy)ˆy

y =ˆy eλf(zy)y

1 eλg(za)ˆa

a =ˆa eλg(za)a

y =ˆy eλf(zy)y

a =ˆa eλg(za)a

For a given example x with predicted label ˆy and predicted sensitive attribute ˆa, let ξy(x) = f(zy)ˆy maxy =ˆy f(zy)y and ξa(x) = g(za)ˆa maxa =ˆa g(za)a be positive margins.

Fair Classifier via Transferable Representations

We note my = f(zy)ˆy and ma = f(za)ˆa. Then, we have that

(|Y| 1)eλ(my ξy)

eλmy+eλ(my ξy)

(|A| 1)eλ(ma ξa)

eλma+eλ(ma ξa)

2 (|Y| 1)eλmy

eλξyeλmy + eλmy

p + 2 (|A| 1)eλma

eλξaeλma + eλma

p + 2 |A| 1

Let δ = 1 P(ξy(X) ξ, ξa(X) ξ) with ξ > 0, then we have that

E c(X, X) = E [c(X, X)|ξy(x) ξ, ξa(x) ξ] (1 δ) + E h c(X, X)|ξy(x) ξ, ξa(x) ξ i δ,

2(|Y| 1)p + (|A| 1)p

p (1 δ) + 2(|Y| 1)p + (|A| 1)p

|Y| |A| 1 p eλξ + 1 (1 δ) +

Optimizing the softmax temperature. Our goal is to minimize the right hand side of Equation (8), we need to solve

2 |Y| |A| 1 p eλξ + 1 (1 δ) + λL(|Y| + |A|)

IW (Zy, Za).

2 |Y| |A| 1 p (1 δ) and β = L(|Y|+|A|)

IW (Zy, Za)which are both positive

values, then we consider

arg inf λ 2α eλξ + 1 + λβ.

Let γ = λξ, since ξ > 0 and α > 0 then,

arg inf λ 2α eγ + 1 + λβ = 1

ξ arg inf γ 1 eγ + 1 + γ β

Let c = β 2ξα 0 by deﬁnition, then we solve

arg inf γ 1 eγ + 1 + cγ.

Leteno, Perrot, Laclau, Gourru and Gravier

We can study this function by looking at the sign of its derivative. Considering the derivative equal to 0, we have

(eγ + 1)2 = 0

c(eγ + 1)2 eγ = 0

ce2γ + 2ceγ + c eγ = 0

ce2γ + (2c 1)eγ + c = 0.

With a change of variables x = eγ, we solve

cx2 + (2c 1)x + c = 0,

and obtain the following root = (2c 1)2 4c2 = 1 4c. In the following, we consider two cases:

4, then 0 and there no or a single root. Since c 0, the gradient is always positive which implies that the minimum is reached at γ = 0 which is λ = 0. Therefore, in this case, the bound is equal to α = p

2 |Y| |A| 1 p.

4, then > 0 and we have x = 1 2c 1 4c

2c . Since, x = eγ and γ 0, then x 1.

If x = 1 2c 1 4c

2c and x 1, then 1 4c 1 4c which is impossible since c < 1

It implies that x = 1 2c+ 1 4c

2c which is λ = 1

ξ log(1 2c+ 1 4c

2c ). Then, we have

ξ log 1 2c + 1 4c

Since we have an increasing function for λ λ and 1 4c 1, we can consider

In this case, the bound becomes

e 1 ξ log( 1

c 1 β = 2α 1

1 + log 2ξα

Thus, we obtain the following bound

1 + log max 4, 2ξα

where the left term of the minimization corresponds to the bound when 2ξα

β 4, otherwise the bound is equal to the right term.

Fair Classifier via Transferable Representations

Appendix E. Details of the bounds handling under diﬀerent sensitive attribute scenarios.

We summarize the diﬀerent scenarios of sensitive attributes and how adaptable our theoretical framework is in Table 10.

Type of SA Lemma 1 & 2 Th. 3 Th. 4 Empirically A {0, 1}

A {0, . . . , K} (with Natarajan dimension)

(with generalization of the H H-divergence (Sicila et al., 2022))

Multiple SA (considering the intersection of Y and A so that A {0, . . . , K})

A R (with binning) (with binning) (with binning) (with regression)

Table 10: Summary of the diﬀerent possible scenarios for the sensitive attributes (SA)

Appendix F. Experimental details

F.1 WFC algorithm

In this section, we describe the full algorithm of WFC. Algorithm 1 provides the detailed algorithm for WFC used in our experiments.

F.2 Details when using BERT-encoder

In this section, we provide additional experimental details, notably, we detail the architectures of the MLPs and give the optimal hyperparameters when BERT model is used to obtain the initial representations.

F.2.1 MLP architecture

In Table 11a, we present the architectural details of the classiﬁer MLP. We grid searched over the learning rate (lr {1e 5, 1e 4, 1e 3, 5e 5, 5e 4, 5e 3}, the number of training batches for classiﬁcation per epoch nd {5, 10, 20}, the value used to clip the weights to enforce the Lipschitz constraint clipping value {0.001, 0.01, 0.1}, the parameter β {0.1, 0.5, 1, 2, 5, 10, 100}, the layer used between the ﬁrst hidden, last hidden, or last layer.

F.2.2 Critic architecture

In Table 11b, we present the architectural details of the Critic, which is a simple multi-layer perceptron. We grid searched over the learning rate lr {5e 5, 5e 4, 5e 3}.

Leteno, Perrot, Laclau, Gourru and Gravier

Data: D = {(xi, yi, ai)}n i=1 the training set, ne the number of epochs, nc and nd the number of training iterations per epoch for the critic and the classiﬁer respectively, a batch size nb, two neural networks ha(Enc(x)) and hy(Enc(x); θ), a Critic Cω, a weight on the regularization β for e = 1, ..., ne do

for t = 1, ..., nc do

Sample {xi, yi, ai}nb i=1 Encode : za {ha(Enc(xi))}nb i=1, zy {hy(Enc(xi))}nb i=1 Concatenate vectors to get Zdep [za,i, zy,i]nb i=1 Shuﬄe the za,i vectors. Concatenate vectors to get Zind [zs,i, zy,i]nb i=1 grad(w) ω 1

nb (Pnb i=1 Cω(Zdep,i) Pnb i=1 Cω(Zind,i)) ω Adam(ω; grad(w)) end for t = 1, ..., nd do

Sample {xi, yi, ai}nb i=1 Encode : zs {ha(xi)}nb i=1, zy {hy(xi)}nb i=1 Concatenate vectors to get Zdep = [za,i, zy,i]nb i=1 Shuﬄe the za,i vectors. Concatenate vectors to get Zind = [za,i, zy,i]nb i=1 L Pnb i=1 L(yi, hy(Enc(xy,i))) L L + β(Pnb i=1 Cω(Zdep,i) Pnb i=1 Cω(Zind,i)) θ Adam(θ; θ 1

nb L) end end

Algorithm 1: WFC Algorithm

F.3 Details when using SFR-Embeddings-2 R

F.3.1 MLP architecture

In Table 12a, we present the architectural details of the classiﬁer MLP when the embeddings are produced by the SFR-Embeddings-2 R. We grid searched over the learning rate (lr {3e 7, 3e 6, 3e 5, 3e 3, 3e 1}, the number of training batches for classiﬁcation per epoch and for the Critic training nd, nc {5, 10, 20}, and the hidden layer dimension (100, 300, 900).

F.3.2 Critic architecture

In Table 12b, we present the architectural details of the Critic for the task using SFREmbeddings-2 R. We grid searched over the learning rate lr {3e 7, 3e 6, 3e 5, 3e 3, 3e 1}.

F.3.3 Baselines hyperparameters

We select the best hyperparameters for the baselines for the classiﬁcation of the representations generated by the SFR-Embedding-2 R model. Following Shen et al. (2022b), we

Fair Classifier via Transferable Representations

Data set Bios Moji input dimension 768 2304 hidden layers 2 2 hidden dimension 300 300 learning rate 1 4 1 5

batch size 128 128 epochs max 10000 10000 activation Tan H Tan H β 1 1 nc 20 5 nd 5 5 clipping value 0.01 0.01 layer used last last

(a) Details on hyperparameters used for the classifying MLP.

Hyperparameter Value number hidden layer 1 hidden dimension 512 activation Re LU

optimizer Root Mean Square Propagation learning rate 5e 5

(b) Details on hyperparameters used for the Critic MLP.

Table 11: Hyperparameter details when using BERT-encoder.

ﬁrst determine the optimal hyperparameters of the classiﬁcation models and keep those hyperparameters ﬁxed when searching for the method-speciﬁc best hyperparameters. We tune the learning rate (lr {3e 1, 3e 2, 3e 3, 3e 4, 3e 5} and the hidden dimension ( {100, 300, 900}). For the ADV baseline, we take 3 adversaries and consider several values for the following hyperparameters adv diverse lambda {1e 1, 1e 2, 1e 3, 1e 4} and adv lambda {0.3, 0.5, 1, 2}. Values in bold are the selected ones. When BTEO is used the hyperparameters are set to EO for BTObj, Resampling for BT as in (Shen et al., 2022b). Finally, the embedding size is 4096, the batch size is 1024, and we set a patience of 10 for the early stopping.

F.3.4 Details for Cross-domain WFC

In this section, we explain how we build the data set used for the cross-domain experiment to increase the divergence with the Bios data set. To do so, we remove a set of words from the MP data set with regard to the sensitive attribute: gender. The words included in the set are the following: he , him , his , himself , Mr. , Sir , Lord , King , Prince , man , boy , gentleman , father , son , husband , brother , uncle , nephew , king , prince , she , her , hers , herself , Mrs. , Ms. , Miss , Lady , Dame , Queen , Princess , woman , girl , lady , mother , daughter , wife , sister , aunt , niece , queen , princess .

Leteno, Perrot, Laclau, Gourru and Gravier

Hyperparameter Value input dimension 4096 hidden layers 1 hidden dimension 300 learning rate 3e 5

batch size 128 epochs max 10000 activation Tan H β 1 nc 20 nd 10 clipping value 0.01 layer used last

(a) Details on hyperparameters used for the classifying MLP.

Hyperparameter Value number hidden layer 1 hidden dimension 512 activation Re LU

optimizer Root Mean Square Propagation learning rate 3e 6

(b) Details on hyperparameters used for the Critic MLP.

Table 12: Hyperparameter details for SFR-Embeddings-2 R.

Civil Rights Act. Civil rights act of 1964. Title VII, Equal Employment Opportunities, 1964.

Martin Arjovsky and Leon Bottou. Towards principled methods for training generative adversarial networks. In International Conference on Learning Representations, 2017.

Martin Arjovsky, Soumith Chintala, and L eon Bottou. Wasserstein generative adversarial networks. In ICML, pages 214 223. PMLR, 2017.

James Atwood, Nino Scherrer, Preethi Lahoti, Ananth Balashankar, Flavien Prost, and Ahmad Beirami. Inducing group fairness in prompt-based language model decisions. 2024.

Pranjal Awasthi, Matth aus Kleindessner, and Jamie Morgenstern. Equalized odds postprocessing under imperfect group information. In International Conference on Artiﬁcial Intelligence and Statistics, pages 1770 1780. PMLR, 2020.

Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learning: Limitations and Opportunities. MIT Press, 2023.

Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from diﬀerent domains. Machine learning, 79:151 175, 2010.

Emily M Bender, Timnit Gebru, Angelina Mc Millan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In FAcc T, pages 610 623, 2021.

Fair Classifier via Transferable Representations

Alex Beutel, Jilin Chen, Zhe Zhao, and Ed H Chi. Data decisions and theoretical implications when adversarially learning fair representations. ar Xiv preprint ar Xiv:1707.00075, 2017.

Su Lin Blodgett, Lisa Green, and Brendan O Connor. Demographic dialectal variation in social media: A case study of african-american english. ar Xiv preprint ar Xiv:1608.08868, 2016.

Laura Cabello, Anna Katrine Jørgensen, and Anders Søgaard. On the independence of association bias and empirical fairness in language models. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, FAcc T 23, page 370 378, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701924. doi: 10.1145/3593013.3594004. URL https://doi.org/10.1145/3593013.3594004.

Simon Caton and Christian Haas. Fairness in machine learning: A survey. ar Xiv preprint ar Xiv:2010.04053, 2020.

Simon Caton and Christian Haas. Fairness in machine learning: A survey. ACM Comput. Surv., 56(7), April 2024. ISSN 0360-0300. doi: 10.1145/3616865. URL https://doi. org/10.1145/3616865.

L. Elisa Celis, Lingxiao Huang, Vijay Keswani, and Nisheeth K. Vishnoi. Classiﬁcation with fairness constraints: A meta-algorithm with provable guarantees. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* 19, page 319 328, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450361255. doi: 10.1145/3287560.3287586. URL https://doi.org/10.1145/3287560.3287586.

Junyi Chai, Taeuk Jang, and Xiaoqian Wang. Fairness without demographics through knowledge distillation. Advances in Neural Information Processing Systems, 35:19152 19164, 2022.

Jiahao Chen and Bing Su. Transfer knowledge from head to tail: Uncertainty calibration under long-tailed distribution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19978 19987, 2023.

Myra Cheng, Esin Durmus, and Dan Jurafsky. Marked personas: Using natural language prompts to measure stereotypes in language models. ar Xiv preprint ar Xiv:2305.18189, 2023.

Pengyu Cheng, Weituo Hao, Siyang Yuan, Shijing Si, and Lawrence Carin. Fairﬁl: Contrastive neural debiasing method for pretrained text encoders, 2021.

Zhibo Chu, Zichong Wang, and Wenbin Zhang. Fairness in large language models: A taxonomic survey. ar Xiv preprint ar Xiv:2404.01349, 2024.

Evgenii Chzhen, Christophe Denis, Mohamed Hebiri, Luca Oneto, and Massimiliano Pontil. Fair regression via plug-in estimator and recalibration with statistical guarantees. Advances in Neural Information Processing Systems, 33:19137 19148, 2020.

Leteno, Perrot, Laclau, Gourru and Gravier

Amanda Coston, Karthikeyan Natesan Ramamurthy, Dennis Wei, Kush R Varshney, Skyler Speakman, Zairah Mustahsan, and Supriyo Chakraborty. Fair transfer learning with missing protected attributes. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 91 98, 2019.

Nicolas Courty, R emi Flamary, and Devis Tuia. Domain adaptation with regularized optimal transport. In ECML PKDD, pages 274 289. Springer, 2014.

Maria De-Arteaga, Alexey Romanov, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, and Adam Tauman Kalai. Bias in bios: A case study of semantic representation bias in a high-stakes setting. In Facc T, pages 120 128, 2019.

Dina Demner-Fushman, Wendy W Chapman, and Clement J Mc Donald. What can natural language processing do for clinical decision support? Journal of biomedical informatics, 42(5):760 772, 2009.

Christophe Denis, Romuald Elie, Mohamed Hebiri, and Fran cois Hu. Fairness guarantees in multi-class classiﬁcation with demographic parity. Journal of Machine Learning Research, 25(130):1 46, 2024. URL http://jmlr.org/papers/v25/23-0322.html.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACLHLT, pages 4171 4186, 2019.

Kush Dubey. Evaluating the fairness of task-adaptive pretraining on unlabeled test data before few-shot text classiﬁcation. In Dieuwke Hupkes, Verna Dankers, Khuyagbaatar Batsuren, Amirhossein Kazemnejad, Christos Christodoulopoulos, Mario Giulianelli, and Ryan Cotterell, editors, Proceedings of the 2nd Gen Bench Workshop on Generalisation (Benchmarking) in NLP, pages 1 26, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.genbench-1.1. URL https:// aclanthology.org/2024.genbench-1.1.

Jannik Dunkelau. Fairness-aware machine learning an extensive overview. 2020. URL https://api.semanticscholar.org/Corpus ID:237483522.

Yanai Elazar and Yoav Goldberg. Adversarial removal of demographic attributes from text data. In Proceedings of EMNLP. Association for Computational Linguistics, 2018.

European Parliament and Council of the European Union. Regulation (EU) 2016/679 of the European Parliament and of the Council, 2016. URL https://data.europa.eu/ eli/reg/2016/679/oj.

European Parliament and Council of the European Union. Regulation (eu) 2024/1689 of the european parliament and of the council. 2024.

Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan, and Sune Lehmann. Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1615 1625, 2017.

Fair Classifier via Transferable Representations

Susan T Fiske. Prejudices in cultural contexts: Shared stereotypes (gender, age) versus variable stereotypes (race, ethnicity, religion). Perspectives on psychological science, 12 (5):791 799, 2017.

R emi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aur elie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenﬂos, Kilian Fatras, Nemo Fournier, L eo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong, and Titouan Vayer. Pot: Python optimal transport. Journal of Machine Learning Research, 22(78):1 8, 2021. URL http: //jmlr.org/papers/v22/20-451.html.

Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. Learning with a wasserstein loss. Neur IPS, 28, 2015.

Bolin Gao and Lacra Pavel. On the properties of the softmax function with application in game theory and reinforcement learning. ar Xiv preprint ar Xiv:1704.00805, 2017.

Alison L. Gibbs and Francis Edward Su. On choosing and bounding probability metrics. International Statistical Review / Revue Internationale de Statistique, 70(3):419 435, 2002. ISSN 03067734, 17515823. URL http://www.jstor.org/stable/1403865.

Paula Gordaliza, Eustasio Del Barrio, Gamboa Fabrice, and Jean-Michel Loubes. Obtaining fairness using optimal transport theory. In ICML, pages 2357 2365. PMLR, 2019.

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of wasserstein gans. In Proceedings of the International Conference on Neural Information Processing Systems, NIPS 17, page 5769 5779. Curran Associates Inc., 2017. ISBN 9781510860964.

Chuan Guo, GeoﬀPleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International conference on machine learning, pages 1321 1330. PMLR, 2017.

Maya Gupta, Andrew Cotter, Mahdi Milani Fard, and Serena Wang. Proxy fairness. ar Xiv preprint ar Xiv:1806.11212, 2018.

Umang Gupta, Aaron M Ferber, Bistra Dilkina, and Greg Ver Steeg. Controllable guarantees for fair outcomes via contrastive information estimation. Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 35(9):7610 7619, May 2021. doi: 10.1609/aaai. v35i9.16931. URL https://ojs.aaai.org/index.php/AAAI/article/view/16931.

Xudong Han, Timothy Baldwin, and Trevor Cohn. Decoupling adversarial training for fair NLP. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 471 477, Online, August 2021a. Association for Computational Linguistics. doi: 10. 18653/v1/2021.ﬁndings-acl.41. URL https://aclanthology.org/2021.findings-acl. 41.

Leteno, Perrot, Laclau, Gourru and Gravier

Xudong Han, Timothy Baldwin, and Trevor Cohn. Diverse adversaries for mitigating bias in training. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2760 2765, Online, April 2021b. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-main.239. URL https://aclanthology.org/2021.eacl-main.239.

Xudong Han, Timothy Baldwin, and Trevor Cohn. Balancing out bias: Achieving fairness through balanced training. ar Xiv preprint ar Xiv:2109.08253, 2021c.

Xudong Han, Aili Shen, Yitong Li, Lea Frermann, Timothy Baldwin, and Trevor Cohn. Fairlib: A uniﬁed framework for assessing and improving fairness. In Proceedings of the The 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 60 71, 2022.

Xudong Han, Timothy Baldwin, and Trevor Cohn. Fair enough: Standardizing evaluation and model selection for fairness research in NLP. In Andreas Vlachos and Isabelle Augenstein, editors, Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 297 312, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.eacl-main.23. URL https://aclanthology.org/2023.eacl-main.23.

Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. Neur IPS, 29, 2016.

Max Hort, Zhenpeng Chen, Jie M. Zhang, Mark Harman, and Federica Sarro. Bias mitigation for machine learning classiﬁers: A comprehensive survey. ACM J. Responsib. Comput., 1(2), June 2024. doi: 10.1145/3631326. URL https://doi.org/10.1145/3631326.

Ben Hutchinson, Vinodkumar Prabhakaran, Emily Denton, Kellie Webster, Yu Zhong, and Stephen Denuyl. Social biases in NLP models as barriers for persons with disabilities. In ACL, pages 5491 5501, 2020.

Shadi Iskander, Kira Radinsky, and Yonatan Belinkov. Leveraging prototypical representations for mitigating social bias without demographic information. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 379 390, Mexico City, Mexico, June 2024. Association for Computational Linguistics. URL https://aclanthology.org/ 2024.naacl-short.33.

Taeuk Jang and Xiaoqian Wang. Fades: Fair disentanglement with sensitive relevance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12067 12076, June 2024.

Mariana Jatob a, Juliana Santos, Ives Gutierriz, Daniela Moscon, Paula Odete Fernandes, and Jo ao Paulo Teixeira. Evolution of artiﬁcial intelligence research in human resources. Procedia Computer Science, 164:137 142, 2019.

Fair Classifier via Transferable Representations

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. ar Xiv preprint ar Xiv:2310.06825, 2023.

Ray Jiang, Aldo Pacchiano, Tom Stepleton, Heinrich Jiang, and Silvia Chiappa. Wasserstein fair classiﬁcation. In Uncertainty in artiﬁcial intelligence, pages 862 872. PMLR, 2020.

Zhijing Jin, Julius von K ugelgen, Jingwei Ni, Tejas Vaidhya, Ayush Kaushal, Mrinmaya Sachan, and Bernhard Sch olkopf. Causal direction of data collection matters: Implications of causal and anticausal learning for NLP. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP, pages 9499 9513. Association for Computational Linguistics, 2021.

Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma. Fairness-aware classiﬁer with prejudice remover regularizer. In Machine Learning and Knowledge Discovery in Databases, pages 35 50, 2012. ISBN 978-3-642-33486-3.

Leonid Kantorovich. On the translocation of masses. In C.R. (Doklady) Acad. Sci. URSS(N.S.), volume 37(10), pages 199 201, 1942.

Patrik Joslin Kenfack, Samira Ebrahimi Kahou, and Ulrich A ıvodji. Fairness under demographic scarce regime. ar Xiv preprint ar Xiv:2307.13081, 2023.

Svetlana Kiritchenko and Saif M. Mohammad. Examining gender and race bias in two hundred sentiment analysis systems, 2018.

Charlotte Laclau, Ievgen Redko, Manvi Choudhary, and Christine Largeron. All of the fairness for edge prediction with optimal transport. In AISTATS, pages 1774 1782. PMLR, 2021.

Preethi Lahoti, Alex Beutel, Jilin Chen, Kang Lee, Flavien Prost, Nithum Thain, Xuezhi Wang, and Ed Chi. Fairness without demographics through adversarially reweighted learning. Advances in neural information processing systems, 33:728 740, 2020.

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models. ar Xiv preprint ar Xiv:2405.17428, 2024.

Thibaud Leteno, Antoine Gourru, Charlotte Laclau, R emi Emonet, and Christophe Gravier. Fair text classiﬁcation with Wasserstein independence. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15790 15803, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.978. URL https: //aclanthology.org/2023.emnlp-main.978/.

Yingji Li, Mengnan Du, Rui Song, Xin Wang, and Ying Wang. A survey on fairness in large language models. ar Xiv preprint ar Xiv:2308.10149, 2023.

Leteno, Perrot, Laclau, Gourru and Gravier

Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency, and Ruslan Salakhutdinov. Towards understanding and mitigating social biases in language models. In ICML, pages 6565 6576. PMLR, 2021.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ar Xiv preprint ar Xiv:1907.11692, 2019.

Francesco Locatello, Gabriele Abbati, Thomas Rainforth, Stefan Bauer, Bernhard Sch olkopf, and Olivier Bachem. On the fairness of disentangled representations. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/ 1b486d7a5189ebe8d8c46afc64b0d1b4-Paper.pdf.

Michael Lohaus, Matth aus Kleindessner, Krishnaram Kenthapadi, Francesco Locatello, and Chris Russell. Are two heads the same as one? identifying disparate treatment in fair neural networks. Advances in Neural Information Processing Systems, 35:16548 16562, 2022.

David Madras, Elliot Creager, Toniann Pitassi, and Richard Zemel. Learning adversarially fair and transferable representations. In International Conference on Machine Learning, pages 3384 3393. PMLR, 2018.

David Mc Allester and Karl Stratos. Formal limitations on the measurement of mutual information. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artiﬁcial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 875 884. PMLR, 26 28 Aug 2020. URL https://proceedings.mlr.press/v108/mcallester20a.html.

Daniel Mc Namara, Cheng Soon Ong, and Robert C Williamson. Provably fair representations. ar Xiv preprint ar Xiv:1710.04394, 2017.

Rui Meng, Ye Liu, Shaﬁq Rayhan Joty, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. Sfr-embedding-2: Advanced text embedding with multi-stage training, 2024. URL https: //huggingface.co/Salesforce/SFR-Embedding-2_R.

Gaspard Monge. M emoire sur la th eorie des d eblais et des remblais. Histoire de l Acad emie Royale des Sciences, pages 666 704, 1781.

Niklas Muennighoﬀ, Nouamane Tazi, Lo ıc Magne, and Nils Reimers. Mteb: Massive text embedding benchmark. ar Xiv preprint ar Xiv:2210.07316, 2022. doi: 10.48550/ARXIV. 2210.07316. URL https://arxiv.org/abs/2210.07316.

Junhyun Nam, Hyuntak Cha, Sungsoo Ahn, Jaeho Lee, and Jinwoo Shin. Learning from failure: De-biasing classiﬁer from biased classiﬁer. Advances in Neural Information Processing Systems, 33:20673 20684, 2020.

Balas K Natarajan. On learning sets and functions. Machine Learning, 4:67 97, 1989.

Fair Classifier via Transferable Representations

Sherjil Ozair, Corey Lynch, Yoshua Bengio, Aaron Van den Oord, Sergey Levine, and Pierre Sermanet. Wasserstein dependency measure for representation learning. Advances in Neural Information Processing Systems, 32, 2019.

Sangdon Park, Osbert Bastani, James Weimer, and Insup Lee. Calibrated prediction with covariate shift via unsupervised domain adaptation. In International Conference on Artiﬁcial Intelligence and Statistics, pages 3219 3229. PMLR, 2020.

Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, and Yoav Goldberg. Null it out: Guarding protected attributes by iterative nullspace projection. In Proceedings of the Annual Meeting of ACL, pages 7237 7256. ACL, 2020.

Laurent Risser, Alberto Gonz alez Sanz, Quentin Vincenot, and Jean-Michel Loubes. Tackling algorithmic bias in neural-network classiﬁers using wasserstein-2 regularization. Journal of Mathematical Imaging and Vision, 64(6):672 689, 2022.

Gabriel Roccabruna, Massimo Rizzoli, and Giuseppe Riccardi. Will LLMs replace the encoder-only models in temporal relation classiﬁcation? In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 20402 20415, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/ 2024.emnlp-main.1136. URL https://aclanthology.org/2024.emnlp-main.1136/.

Yuji Roh, Kangwook Lee, Steven Euijong Whang, and Changho Suh. Fairbatch: Batch selection for model fairness. In International Conference on Learning Representations, 2021.

Qian Ruan, Ilia Kuznetsov, and Iryna Gurevych. Are large language models good classiﬁers? a study on edit intent classiﬁcation in scientiﬁc document revisions. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15049 15067, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/ v1/2024.emnlp-main.839. URL https://aclanthology.org/2024.emnlp-main.839.

Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. The earth mover s distance as a metric for image retrieval. Int. J. Comput. Vision, 40(2):99 121, 2000. ISSN 0920-5691. doi: 10.1023/A:1026543900054.

Bernhard Sch olkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris M. Mooij. On causal and anticausal learning. In Proceedings of the International Conference on Machine Learning, ICML, 2012.

Candice Schumann, Xuezhi Wang, Alex Beutel, Jilin Chen, Hai Qian, and Ed H Chi. Transfer of machine learning fairness across domains. ar Xiv preprint ar Xiv:1906.09688, 2019.

Aili Shen, Xudong Han, Trevor Cohn, Timothy Baldwin, and Lea Frermann. Optimising equal opportunity fairness in model training. In Proceedings of the Conference of the North

Leteno, Perrot, Laclau, Gourru and Gravier

American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4073 4084. Association for Computational Linguistics, 2022a.

Aili Shen, Xudong Han, Trevor Cohn, Timothy Baldwin, and Lea Frermann. Does representational fairness imply empirical fairness? In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 81 95, 2022b.

Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu. Wasserstein distance guided representation learning for domain adaptation. In Proceedings of the AAAI conference on artiﬁcial intelligence, volume 32, 2018.

Anthony Sicilia, Katherine Atwell, Malihe Alikhani, and Seong Jae Hwang. Pac-bayesian domain adaptation bounds for multiclass learners. In Uncertainty in artiﬁcial intelligence, pages 1824 1834. PMLR, 2022.

Chiappa Silvia, Jiang Ray, Stepleton Tom, Pacchiano Aldo, Jiang Heinrich, and Aslanides John. A general approach to fairness with optimal transport. In Proceedings of AAAI, volume 34, pages 3633 3640, 2020.

Olivia Sturman, Aparna R Joshi, Bhaktipriya Radharapu, Piyush Kumar, and Renee Shelby. Debiasing text safety classiﬁers through a fairness-aware ensemble. In Franck Dernoncourt, Daniel Preot iuc-Pietro, and Anastasia Shimorina, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 199 214, Miami, Florida, US, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-industry.16. URL https://aclanthology.org/2024.emnlp-industry.16.

Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai El Sherief, Jieyu Zhao, Diba Mirza, Elizabeth Belding, Kai-Wei Chang, and William Yang Wang. Mitigating gender bias in natural language processing: Literature review. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 1630 1640, 2019.

Yi Chern Tan and L Elisa Celis. Assessing social and intersectional biases in contextualized word representations. Neur IPS, 32, 2019.

Luis Caicedo Torres, Luiz Manella Pereira, and M Hadi Amini. A survey on optimal transport for machine learning: Theory and applications. ar Xiv preprint ar Xiv:2106.01963, 2021.

Nicol as Torres. Contrastive adversarial gender debiasing. Natural Language Processing Journal, 8:100092, 2024. ISSN 2949-7191. doi: https://doi.org/10.1016/j.nlp.2024.100092. URL https://www.sciencedirect.com/science/article/pii/S2949719124000402.

VN Vapnik. Statistical learning theory. Adaptive and Learning Systems for Signal Processing Communications and control, 1998.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Neur IPS, 30, 2017.

Fair Classifier via Transferable Representations

Ximei Wang, Mingsheng Long, Jianmin Wang, and Michael Jordan. Transferable calibration with lower bias and variance in domain adaptation. Advances in Neural Information Processing Systems, 33:19212 19223, 2020.

Xiang Wei, Boqing Gong, Zixia Liu, Wei Lu, and Liqiang Wang. Improving the improved training of wasserstein gans: A consistency term and its dual eﬀect. In International Conference on Learning Representation (ICLR), 2018.

Blake Woodworth, Suriya Gunasekar, Mesrob I Ohannessian, and Nathan Srebro. Learning non-discriminatory predictors. In Conference on Learning Theory, pages 1920 1953. PMLR, 2017.

Xilin Yang. Diagnosing hate speech classiﬁcation: Where do humans and machines disagree, and why? ar Xiv e-prints, pages ar Xiv 2410, 2024.

Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In Sanjoy Dasgupta and David Mc Allester, editors, Proceedings of the International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 325 333. PMLR, 2013.

Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 335 340, 2018.

Tianxiang Zhao, Enyan Dai, Kai Shu, and Suhang Wang. Towards fair classiﬁers without sensitive attributes: Exploring biases in related features. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, WSDM 22, page 1433 1442, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450391320. doi: 10.1145/3488560.3498493. URL https://doi.org/10.1145/ 3488560.3498493.