# transportbased_counterfactual_models__d675abbf.pdf

Journal of Machine Learning Research 25 (2024) 1-59 Submitted 12/21; Revised 2/24; Published 5/24

Transport-based Counterfactual Models

Lucas De Lara lucas.de lara@math.univ-toulouse.fr Institut de Math ematiques de Toulouse Universit e Paul Sabatier Toulouse, France

Alberto Gonz alez-Sanz ag4855@columbia.edu Department of Statistics Columbia University New York, United States

Nicholas Asher nicholas.asher@irit.fr Institut de Recherche en Informatique de Toulouse CNRS Toulouse, France

Laurent Risser laurent.risser@math.univ-toulouse.fr Institut de Math ematiques de Toulouse CNRS Toulouse, France

Jean-Michel Loubes loubes@math.univ-toulouse.fr Institut de Math ematiques de Toulouse Universit e Paul Sabatier Toulouse, France

Editor: Silvia Chiappa

Counterfactual frameworks have grown popular in machine learning for both explaining algorithmic decisions but also deﬁning individual notions of fairness, more intuitive than typical group fairness conditions. However, state-of-the-art models to compute counterfactuals are either unrealistic or unfeasible. In particular, while Pearl s causal inference provides appealing rules to calculate counterfactuals, it relies on a model that is unknown and hard to discover in practice. We address the problem of designing realistic and feasible counterfactuals in the absence of a causal model. We deﬁne transport-based counterfactual models as collections of joint probability distributions between observable distributions, and show their connection to causal counterfactuals. More speciﬁcally, we argue that optimal-transport theory deﬁnes relevant transport-based counterfactual models, as they are numerically feasible, statistically-faithful, and can coincide under some assumptions with causal counterfactual models. Finally, these models make counterfactual approaches to fairness feasible, and we illustrate their practicality and eﬃciency on fair learning. With this paper, we aim at laying out the theoretical foundations for a new, implementable approach to counterfactual thinking.

Keywords: counterfactuals, optimal transport, causality, fairness, supervised learning

c 2024 Lucas De Lara, Alberto Gonz alez-Sanz, Nicholas Asher, Laurent Risser, Jean-Michel Loubes.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/21-1440.html.

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

1. Introduction

A counterfactual states how the world should be modiﬁed so that a given outcome occurs. For instance, the statement had you been a woman, you would have gotten half your salary is a counterfactual relating the antecedent had you been a woman to the consequent you would have gotten half your salary . Counterfactuals have been used to deﬁne causation (Lewis, 1973) and hence have attracted attention in the ﬁelds of explainability and robustness in machine learning, as such statements are tailored to explain black-box decision rules. Applications abound, including algorithmic recourse (Joshi et al., 2019; Poyiadzi et al., 2020; Karimi et al., 2021; Slack et al., 2021; Bajaj et al., 2021; Rasouli and Chieh Yu, 2024), defense against adversarial attacks (Ribeiro et al., 2016; Moosavi-Dezfooli et al., 2016) and fairness (Kusner et al., 2017; Chiappa, 2019; Black et al., 2020; Plecko and Meinshausen, 2020; Asher et al., 2021). State-of-the-art models for computing meaningful counterfactuals have mostly focused on the nearest counterfactual explanation principle (Wachter et al., 2017), according to which one ﬁnds minimal translations, minimal changes in the features of an instance that lead to a desired outcome. However, as noted by Black et al. (2020) and Poyiadzi et al. (2020), this simple distance approach generally fails to describe realistic alternative worlds, as it implicitly assumes the features to be independent. Changing just the gender of a person in such a translation might convert from a typical male into an untypical female, rendering out-of-distribution counterfactuals like the following: if I were a woman I would be 190cm tall and weigh 85 kg. According to intuition, such counterfactuals are false and rightly so because they are not representative of the underlying statistical distributions. As a practical consequence, such counterfactuals typically hide biases in machine learning decision rules (Lipton et al., 2018; Besse et al., 2021). The link between counterfactual modality and causality motivated the use of Pearl s causal modeling (Pearl, 2009) to address the aforementioned shortcoming (Kusner et al., 2017; Joshi et al., 2019; Mahajan et al., 2020; Karimi et al., 2021). Pearl s do-calculus, by enforcing a change in a set of variables while keeping the rest of the causal mechanism untouched, provides a rigorous basis for generating intuitively true counterfactuals. The cost of this approach is fully specifying the causal model, namely specifying not only the Bayesian network (or graph) capturing the causal links between variables, but also the structural equations relating them, and the law of the latent, exogenous variables. The reliance on such a strong prior makes the causal approach appealing in theory, but inadequate for deployment on practical cases. To sum-up, research has mostly focused on two divergent frameworks to compute counterfactuals: one that proposes an easy-to-implement model that leads, however, to intuitively untrue counterfactuals; another rigorously takes into account the dependencies between variables to produce realistic counterfactuals, but at the cost of feasibility. Our contribution addresses a third way. Extending the work of Black et al. (2020), who ﬁrst suggested substituting causality-based counterfactual reasoning with optimal transport, we deﬁne transport-based counterfactual models. Such models, by characterizing a counterfactual operation as a coupling, a mass transportation plan between two observable distributions, ensures that the generated counterfactuals are in-distribution, hence realistic. In addition, they remedy to the impracticability issues of causal modeling as they can

Transport-based Counterfactual Models

be computed through any mass transportation techniques, for instance optimal transport. The major beneﬁt of this approach is that it renders doable many critical applications of counterfactual frameworks, for example in algorithmic fairness.

1.1 Outline of Contributions

We make both theoretical and practical contributions in the ﬁelds of counterfactual reasoning and fair machine learning. We propose a mass-transportation framework for counterfactual reasoning and point out its similarities to the causal approach. Additionally, we show that counterfactual methods for fairness become feasible in this framework by introducing and implementing transport-based counterfactual fairness criteria. More precisely, our contributions can be outlined as follows.

1. In Section 2, we recall the necessary background on Pearl s causal modeling, while we introduce in Section 3 the basics of mass transportation and optimal transport theory. Both sections serve as the theoretical and notational toolbox that will be used throughout; they are meant to keep the paper self-contained and can be skipped by readers familiar with these subjects.

2. In Section 4, we ﬁrstly recall how to compute counterfactual quantities using causal modeling. Then, we introduce a general causality-free framework for the computation of counterfactuals through mass-transportation techniques, encompassing the approach of Black et al. (2020). Essentially, we also propose a uniﬁed mass-transportation viewpoint of counterfactuals, be them causal-based or transport-based, through the deﬁnition counterfactual models, collections of couplings characterizing all possible counterfactual statements for a given feature to alter (for example the gender). We provide concrete examples of models, and discuss the limitations of the diﬀerent approaches.

3. In Section 5, we leverage the uniﬁed formalism proposed in the previous section to demonstrate connections between causality and optimal transport. More precisely, after studying the implications of two general causal assumptions onto the induced counterfactual models, we demonstrate that optimal transport maps for the quadratic cost generate the same counterfactual instances as some speciﬁc causal models, including the common linear additive models. We argue that this makes optimal-transportbased counterfactual models relevant surrogates in the absence of a known causal model.

4. In Sections 6, 7 and 8, we illustrate the practicality of our approach for fairness in machine learning. We apply the mass-transportation viewpoint of structural counterfactuals by recasting the counterfactual fairness criterion (Kusner et al., 2017) into a transport-like one. Then, we propose new causality-free criteria by substituting the causal model by transport-based models in the original criterion. Finally, we address the training of counterfactually fair classiﬁers and predictors, providing statistical guarantees and numerical experiments over various data sets.

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

To sum-up: Sections 2 and 3 provide the prerequisites for the paper; Sections 4 and 5 introduce the concept of counterfactual models and the corresponding theory; Sections 6 to 8 address fairness applications of these models.

1.2 Related Work

This work follows the paper of Black et al. (2020), which focuses on building sound counterfactual quantities through optimal transport, deviating from both causal-based techniques and the nearest-counterfactual-instance principle. Our contributions in Sections 4 and 5 can be seen as the theoretical foundations of their approach, by shedding light on the link between measure-preserving counterfactuals and structural counterfactuals. Also, we note that the way we introduce the causal account for counterfactual reasoning in Section 4 concurs with (Plecko and Meinshausen, 2020) and (Bongers et al., 2021). More precisely, we underline that the objects of interest are the joint probability distributions, or couplings, generated by manipulations of the causal model. Additionally, we propose in Section 6 a direct extension of the counterfactual fairness frameworks introduced in (Kusner et al., 2017) and (Russell et al., 2017) to transport-based counterfactual models, leading to a new method for supervised fair learning. This relates our work to the rich literature on fair learning through optimal transport (Gordaliza et al., 2019; Chiappa et al., 2020; Thibaut Le Gouic et al., 2020; Chzhen et al., 2020; Risser et al., 2022). Finally, we note that the main result of Section 5, stating that optimal transport maps recover causal eﬀects under speciﬁc assumptions, shares similarities with the main theorem of (Torous et al., 2024). In contrast to our work, their assumptions are motivated by the study of classical treatment and control experiments over time via diﬀerence-in-diﬀerences methods (Imbens and Rubin, 2015).

2. Causal Modeling

Pearl s causal modeling addresses the fundamental problem of analyzing causal relations between variables, beyond mere correlations (Pearl, 2009). It can be regarded as a mathematical formalism meant to describe associations that standard probability calculus cannot represent (Pearl, 2010). This section recalls the basic theory on this modeling, borrowing the rigorous mathematical framework recently proposed by Bongers et al. (2021). It is meant to keep the paper self-contained and can be skipped by a reader familiar with causality. Let us ﬁx some notations before proceeding. Throughout the paper, we consider a probability space (Ω, A, P). We denote respectively by L(W) and E[W] the law and expectation under P of any random variable W on Ωtaking values in a measurable space (Rp, B) where p 1 and B is the Borel σ-algebra. Additionally, for any tuple w := (wi)i I indexed by a ﬁnite index set I and any subset I I we write w I := (wi)i I. Similarly, we deﬁne WI := Q i I Wi for any collection of spaces (Wi)i I.

2.1 Deﬁnition

Causal reasoning rests on the knowledge of a structural causal model (SCM), which represents the causal relationships between the studied variables.

Transport-based Counterfactual Models

Deﬁnition 1 Let I and J be two disjoint ﬁnite index sets, and write V := Q i I Vi R|I|, U := Q i J Ui R|J | for two measurable product spaces. A structural causal model M is a couple U, G where:

1. U : Ω U is a vector of mutually independent random variables, sometimes called the random noise;

2. G = {Gi}i I is a collection of measurable R-valued functions, where for every i I there exist two subsets of indices Endo(i) I and Exo(i) J , respectively called the endogenous and exogenous parents of i, such that Gi : VEndo(i) UExo(i) Vi.

A random vector V : Ω V is a solution of M if for every i I

Vi P a.s. = Gi(VEndo(i), UExo(i)). (1)

The collection of equations deﬁned by (1) and characterized by G and U are called the structural equations. By characterizing G as a measurable vector function G : V U V, we compactly write that V is a solution of M if V P a.s. = G(V, U).

A structural causal model can be seen as a generative model. The variables in U are said to be exogenous as they are imposed a priori by the model. In contrast, the variables in a solution V are said to be endogenous as they are outputs of the model determined through the structural equations. In practice, the endogenous variables represent observed data, while the exogenous ones model latent background phenomena. Note that similarly to Bongers et al. (2021), we assume the (Uj)j J to be mutually independent but we do not suppose causal suﬃciency (that is no unmeasured common causes): formally we do not require Exo(i) Exo(i ) = for every i = i . The structural equations specify the causal dependencies between all these variables and are frequently illustrated by the directed graph deﬁned as follows: the set of nodes is I J , and a directed edge points from node k to node l if and only if l I and k Endo(l) Exo(l) (we say that k is a parent of l). For clarity, we will often substitute the indexes i I or j J for the variables Vi or Uj, in particular when drawing such a graph (see Figure 1). Also, similarly to Bongers et al. (2021), we will use in practice non-disjoint subsets I and J of duplicated natural integers for the sake of clarity. The example below illustrates the above notations and deﬁnitions.

Example 1 Consider a simple SCM M := U, G where U := (U1, U2, U3) is an arbitrary random vector, and such that G is deﬁned by

G1(u1) := u1, G2(v1, u2) := v1 + u2, G3(v1, v2, u3) := v1 + v2 + u3.

Figure 1a represents the corresponding graph. By deﬁnition, ﬁnding a solution V := (V1, V2, V3) to M amounts to solving,

V1 P a.s. = U1, V2 P a.s. = V1 + U2, V3 P a.s. = V1 + V2 + U3.

Then, we readily obtain an almost-surely unique solution given by

V1 P a.s. = U1, V2 P a.s. = U1 + U2, V3 P a.s. = 2U1 + U2 + U3.

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

(a) Without intervention

(b) After do(V2 = 0)

Figure 1: Examples of causal graphs.

According to (Bongers et al., 2021, Theorem 3.3), a model M := U, G admits a solution if and only if it is solvable, that is there exists a measurable function Γ : U V, called a solution map, such that V P a.s. = Γ(U) = V P a.s. = G(V, U). Solvability signiﬁes that a solution V can be expressed solely in terms of U, as in the above example. Note that SCMs are not always solvable (Bongers et al., 2021, Example 2.4). For the sake of convenience, we make in the rest of the paper the common assumption that the considered models are acyclic, meaning that their graphs do not contain any cycles:

(A) The structural causal model M induces a directed acyclic graph (DAG).

Acyclicity entails unique solvability of the SCM (Bongers et al., 2021, Proposition 3.4), in the sense that Equation (1) admits a unique solution up to P-negligible sets (as in Example 1). In particular, the generated distribution on the endogenous variables is unique. We will abusively refer to a solution as the solution of the SCM. Essentially, causal structures capture the assumption that features are not independently manipulable. As we detail next, they enable to understand the downstream eﬀect of ﬁxing some variables to certain values onto non-intervened variables.

2.2 Do-intervention

The so-called do-calculus embodies mathematically the fundamental distinction between causation and correlation. While standard probability theory can only account for correlations through conditioning, do-calculus allows for intervening on variables through the do-operator. Concretely, a do-intervention is an operation mapping any model M to an alternative one by modifying the generative process.

Deﬁnition 2 Let M = U, G be an SCM, I I a subset of endogenous variables, and v I VI a value. The action do(I, v I) deﬁnes the modiﬁed model Mdo(I,v I) = U, G where G is given by

( vi if i I, Gi if i I \ I.

Transport-based Counterfactual Models

The model surgery described in Deﬁnition 2 consists in enforcing a state of things by substituting a set of endogenous variables by ﬁxed values while keeping all the rest of the causal mechanism equal. By deﬁnition, do-interventions respect the exogeneity of the random seed since U remains unchanged. This transcribes the causal principle that acting upon endogenous phenomenons does not aﬀect exogenous ones. Provided it is solvable, the modiﬁed model Mdo(I,v I) generates a new distribution of endogenous variables, describing an alternative world where every Vi for i I is set to value vi. Note that do-interventions preserve acyclicity. Therefore, if an SCM M satisﬁes (A), then Mdo(I,v I) also satisﬁes (A). Going further, if V is the solution of an acyclic M, we can non-ambiguously deﬁne (up to P-negligible sets) its intervened counterpart Vdo(I,v I) solution to Mdo(I,v I). All in all, (A) enables to work in a convenient setting where the output of a causal model as well as the ones of its intervened counterparts are always well-deﬁned. This implication enables to clarify the notations: in the sequel we write do(VI = v I) for the operation do(I, v I), and use the subscript VI = v I to indicate results of this operation. Crucially, intervening does not amount to conditioning in general, that is L(V | VI = v I) = L(VVI=v I). This means that causal outcomes may not be observable and hence require a known causal model to be inferred, as exempliﬁed below.

Example 2 Let M := U, G be the SCM from Example 1 and consider the do-intervention do(V2 = 0). This deﬁnes the intervened model MV2=0 := U, G where

G1(u1) := u1, G2(v1, u2) := 0, G3(v1, v2, u3) := v1 + v2 + u3.

Figure 1b represents the graph after surgery. The modiﬁed structural equations on a solution V := ( V1, V2, V3) are

V1 P a.s. = U1, V2 P a.s. = 0, V3 P a.s. = V1 + V2 + U3.

Then, we readily obtain that the almost-surely unique solution is given by

V1 P a.s. = U1, V2 P a.s. = 0, V3 P a.s. = U1 + U3.

Since U1, U2, U3 are mutually independent we have L(V1 | V2 = 0) = L(U1 | U1 + U2 = 0) = L( U2) while L( V1) = L(U1). Therefore, L(V1 | V2 = 0) = L( V1) in general.

In Section 4, we will explain how the do-operator enables counterfactual inference from a causal model. We now turn to the second mathematical theory of interest for our work: mass transportation.

3. Mass Transportation

We ﬁrstly introduce the necessary background on mass (or measure) transportation. Then, we detail the speciﬁc case of optimal transport.

3.1 Deﬁnition

In probability theory, the problem of mass transportation amounts to constructing a joint distribution namely a coupling, between two marginal probability measures. Suppose that

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

each marginal distribution is a sand pile in the ambient space. A coupling is a mass transportation plan transforming one pile into the other, by specifying how to move each elementary sand mass from the ﬁrst distribution so as to recover the second distribution. Alternatively, we can see a coupling as a random matching which pairs start points to end points between the respective supports with a certain weight. Formally, let P, Q be both Borel probability measures on Rd, whose respective supports are denoted by supp(P) and supp(Q). We recall that the support is the set of points x Rd such that every open neighbourhood of x has a positive probability. A coupling between P and Q is a probability measure π on Rd Rd admitting P as ﬁrst marginal and Q as second marginal, precisely π(A Rd) = P(A) and π(Rd B) = Q(B) for all measurable sets A, B Rd. Throughout the paper, we denote by Π(P, Q) P(Rd Rd) the set of joint distributions whose marginals coincide with P and Q respectively. A coupling π Π(P, Q) is said to be deterministic if each instance from the ﬁrst marginal is paired with probability one to an instance of the second marginal. Such a coupling can be characterized by a measurable map T : Rd Rd that pushes forward P to Q, that is Q(B) := P(T 1(B)) for any measurable set B Rd. This property, denoted by T P = Q, means that if the law of a random variable Z is P, then the law of T(Z) is Q. To make the relation with random couplings, we also introduce the action of couples of functions on probability measures. For any pairs of functions T1, T2 : Rd Rd, we deﬁne (T1 T2) : Rd Rd Rd, x 7 (T1(x), T2(x)). As such, (T1 T2) P denotes the law of T1(Z), T2(Z) where L(Z) = P. This coupling admits T1 P and T2 P as ﬁrst and second marginal respectively. Thus, the deterministic coupling π between P and Q characterized by a push-forward operator T satisfying T P = Q can be written as π = (I T) P where I is the identity function on Rd. This coupling matches a given instance x supp(P) to T(x) supp(Q) with probability 1.

3.2 Optimal Transport

We recall here some basic knowledge on optimal transport theory, which is the mass transportation approach we focus on in this work, and refer to (Villani, 2003, 2008) for further details. Optimal transport restricts the set of feasible couplings between two marginals by isolating ones that are optimal in some sense.

3.2.1 Arbitrary Cost

The Monge formulation of the optimal transport problem with general cost c : Rd Rd R is the optimization problem

min T: T P=Q

Rd c(x, T(x))d P(x). (2)

We refer to solutions to (2) as optimal transport maps between P and Q with respect to c; they minimize the eﬀort, quantiﬁed by c, of moving every elementary mass from P to Q. One mathematical complication is that the push-forward constraint renders the problem unfeasible in many general settings, in particular when P and Q are not absolutely continuous with respect to the Lebesgue measure or have unbalanced numbers of atoms.

Transport-based Counterfactual Models

This issue motivates the following Kantorovich relaxation of the optimal transport problem with cost c,

min π Π(P,Q)

Rd Rd c(x, x )dπ(x, x ). (3)

Solutions to (3) are optimal transport plans (possibly non deterministic) between P and Q with respect to c. In contrasts to optimal transport maps, they exist under very mild assumptions, like the non negativeness of the cost. Notice that, since a push-forward operator deﬁnes a coupling, the set of feasible solutions to (2) is included in the set of feasible solutions to (3).

3.2.2 Quadratic Cost

Optimal transport enjoys a well-established theory, in particular when the ground cost is the squared Euclidean distance c(x, x ) := x x 2 on Rd Rd. Suppose that P is absolutely continuous with respect to the Lebesgue measure in Rd, and that both P and Q have ﬁnite second order moments. Theorem 2.12 in Villani (2003), originally proved by Cuesta and Matr an (1989) and then Brenier (1991), states that there exists a unique solution to Kantorovich s formulation of optimal transport (3), whose form is (I T) P where T : Rd Rd solves the corresponding squared Monge problem,

min T: T P=Q

Rd x T(x) 2d P(x). (4)

Although it may not be unique, this optimal transport map T is uniquely determined P-almost everywhere, and we will abusively refer to it as the solution to Problem (4). Crucially, this map coincides P-almost everywhere with the gradient of a convex function. Moreover, according to Mc Cann (1995), under the sole assumption that P is absolutely continuous with respect to the Lebesgue measure, there exists only one (up to P-negligible sets) gradient of a convex function φ satisfying the push-forward condition φ P = Q. We combine Brenier s and Mc Cann s theorems into the following lemma, which simpliﬁes the search for the solutions to (4).

Lemma 3 Assume that P is absolutely continuous with respect to the Lebesgue measure, and that both P and Q have ﬁnite second order moments. Then, a measurable map T is a solution to (4) if and only if it satisﬁes the two following conditions:

(i) T P = Q,

(ii) there exists a convex function φ : Rd R such that T = φ P-almost everywhere.

This result will play a key role in Section 5.2 to prove a link between optimal transport and causality.

3.2.3 Implementation

In practice, we do not know the measures P and Q but have access to empirical observations. This naturally raises the questions of building relevant data-driven approximations,

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

or estimators, of the optimal transport plans, and of what should be required to ensure statistical guarantees. In this section, we brieﬂy present the computational aspects of optimal transport, and refer to (Peyr e and Cuturi, 2019) for a complete overview. Concretely, consider two samples of i.i.d. observations {x1, . . . , xn} and {x 1, . . . , x m} drawn from respectively P and Q. These samples deﬁne the empirical measures P n = n 1 Pn i=1 δxi and Qm = m 1 Pm i=1 δx i, where δx denotes the Dirac measure at point x. Then, the standard way to estimate an optimal transport plan between the marginals P and Q is to solve the Kantorovich formulation (3) between their empirical counterparts P n

and Qm. By characterizing a discrete coupling by a matrix, we write this problem as,

min π Σ(n,m)

j=1 c(xi, x j)π(i, j), (5)

where Σ(n, m) := {π Rn m + | Pm j=1 π(i, j) = n 1 and Pn i=1 π(i, j) = m 1}. Note that empirical transport plans are statistically consistent. This means that if the Kantorovich problem (3) admits a unique solution π, then a sequence {πn,m}n,m N of solutions to Problem (3) converges weakly to π as n and m increase to inﬁnity (Villani, 2008, Theorem 5.19). This property is crucial to ensure statistical guarantees in optimal-transport frameworks. We emphasize that even if a solution to Problem (5) is necessarily non-deterministic as soon as n = m, the corresponding solution to Problem (3) can be deterministic. The main challenge when working with empirical optimal-transport solutions is that they are expensive in both computational complexity and memory: solving (5) typically requires O((n + m)nm log(n + m)) computer operations, and the solution is stored as an n m matrix, which can limit the application on large data sets. Our implementation (see the experiments in Section 8) exploits the sparsity of the transport matrix to avoid overloading the memory and to speed-up the evaluation of optimal-transport-based metrics. One could also consider entropic regularization schemes to accelerate the computation of a solution to reach O(nm) operations (Cuturi, 2013). However, the obtained approximation of the transport matrix is typically non sparse, hence contains many non-zero coeﬃcients, which precludes memory-eﬃcient implementations. This is why we address only standard optimal transport in our numerical experiments.

4. Counterfactual Models

We now have all the tools to focus on the main subject of this paper: counterfactual reasoning. As mentioned in the introduction, both causality and transport techniques have been used for this purpose. However, a yet non-appreciated aspect is that these frameworks can be written in a common formalism; this is what this section addresses. More precisely, we propose the deﬁnition of counterfactual models, mathematical objects encoding the probabilities of all counterfactual statements with respect to modiﬁcations of one variable, and detail how to construct them with respectively causal models and mass-transportation methods.

4.1 Problem Setup

Set d 1, and deﬁne the random vector V := (X, S) Rd+1, where the variables X : Ω X Rd represent some observed features, while the variable S : Ω S R

Transport-based Counterfactual Models

can be subjected to interventions. For simplicity, we assume that S is ﬁnite such that for every s S, P(S = s) > 0. We consider the problem of computing the potential outcomes of X when changing S. Typically, S represents the sensitive, protected attribute in fairness settings, or the treatment status in the potential-outcome framework (Rubin, 1974). Suppose for instance that the event {X = x, S = s} is observed, and set s = s. We aim at answering the counterfactual question: had S been equal to s instead of s, what would have been the value of X? Critically, because of correlations or structural relations between the variables, computing the alternative state does not amount to change the value of S while keeping the features X equal.

4.2 Structural Proﬁle Counterfactuals

Answering the counterfactual question from Section 4.1 with Pearl s framework requires to assume causal dependencies between X and S. Formally, suppose that V = (X, S) Rd+1

is the unique solution to an SCM M = U, G satisfying the acyclicity assumption (A). We recall that each endogenous variable Vk is then deﬁned (up to sets of probability zero) by the structural equation Vk P a.s. = Gk VEndo(k), UExo(k) ,

where Gk is a real-valued measurable function, U is a vector of exogenous variables, while VEndo(k) and UExo(k) denote respectively the endogenous and exogenous parents of Vk. In the following, we denote by UX and US the exogenous parents of respectively X and S. We write XS=s the post-intervention counterpart of X through the do-intervention do(S = s), that is after replacing the structural equation on S by S = s while keeping the rest of the causal mechanism equal. Several types of counterfactual quantities can be derived from SCMs, corresponding to answers to distinct counterfactual questions. For the sake of clarity, we ﬁrstly explain in details the speciﬁc SCM-based counterfactuals that are relevant to answer the question from Section 4.1. Then, we highlight the mass-transportation interpretation of such counterfactuals.

4.2.1 Definitions of SCM-based Counterfactuals

In the SCM M, a unit is characterized by a realization u U of the exogenous random vector U. Such a value completely determines a realization v = (x, s) X S of the endogenous random vector V through the structural equations, formally (x, s) := Γ(u) using the solution-map notation. In this article, we follow Wu et al. (2019) and call a vector (x, s) X S a proﬁle. Importantly, a unit uniquely determines a proﬁle in M but a proﬁle does not uniquely determine a unit, due to Γ not being necessarily invertible. (Pearl et al., 2016, Section 4.2.1) deﬁnes the basic counterfactuals at the unit level. This construction rests on a fundamental property of the alternative SCMs {MS=s }s S obtained by do-interventions: they share the same exogenous vector U as M. Therefore, ﬁxing a same input unit u U in the models M and MS=s for a chosen s S produces two output proﬁles: (x, s) from M and (x , s ) from MS=s . They respectively describe the factual outcome and the counterfactual outcome had S been equal to s for the speciﬁc unit u. Unit counterfactuals enable one to answer the question What would have been the value of X had S been equal to s given that {U = u}? . While this question is interesting

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

in theory, it is often irrelevant in practice since what one generally observes are endogenous variables (namely proﬁles) instead of hidden exogenous variables (namely units). Therefore, people doing counterfactual reasoning focus on surrogate questions of the form: What would have been the value of X had S been equal to s given that {X = x, S = s}? , which is precisely the pivotal question that we aim at answering in this paper. It diﬀers fundamentally from the former query, notably because the proﬁle-level context {X = x, S = s} does not always uniquely determines a unit-level context {U = u}. Uncertainty as to the unit behind the proﬁle (x, s) is quantiﬁed by the probability distribution L(U | X = x, S = s) which is not necessarily degenerate. Therefore, one cannot always certainly ﬁnd the true unit responsible for the observation, and thereby cannot always certainly compute the unit-level counterfactual outcome correctly answering our question. Instead, one can compute a distribution over unit-level counterfactual outcomes consistent with the context {X = x, S = s}, formally given by L(XS=s | X = x, S = s), which speciﬁes a set of possible answers with probability weights. This means that answers to the counterfactual question from Section 4.1 are fundamentally probabilistic from an SCM perspective. Computing the probabilities of such answers relies on the notorious three-step procedure (Pearl, 2009, Theorem 7.1.7), which requires knowing the fully-speciﬁed SCM. This algorithm takes for input a context, framed as an event on endogenous variables, and delivers a probability distribution of counterfactual outcomes for a chosen do-intervention. More speciﬁcally, for an input context {X = x, S = s} and an intervention do(S = s ), it carries out the following three steps:

1. (Abduction) compute the posterior distribution of units consistent with the factual observation, namely L(U | X = x, S = s);

2. (Action) perform do(S = s ) to obtain the modiﬁed causal mechanism GS=s of MS=s ;

3. (Prediction) pass the distribution L(U | X = x, S = s) through the modiﬁed system of equations GS=s to get the distribution of counterfactual outcomes L(XS=s | X = x, S = s).

This procedure deﬁnes the SCM-based counterfactuals that one needs to address the problem from Section 4.1. We call them structural proﬁle counterfactuals to emphasize two characteristics: (1) they are computed via an SCM, (2) they are deﬁned for a speciﬁc proﬁle (instead of a speciﬁc unit). The tag structural notably serves to distinguish such counterfactuals from the (later introduced) transport-based counterfactuals at the heart of this work.

Deﬁnition 4 Let M satisfy (A). For an observed evidence {X = x, S = s} and an intervention do(S = s ), the structural proﬁle counterfactuals are characterized by the probability distribution µ s |s ( |x) deﬁned as

µ s |s ( |x) := L(XS=s | X = x, S = s).

We make three remarks. First, this deﬁnition can be seen as a speciﬁc case of (Barocas et al., 2023, Chapter 5, Deﬁnition 2) for an input context framed as an event of the form {V = v} where v V (rather than an event on a possibly reduced number of endogenous

Transport-based Counterfactual Models

variables). It also corresponds to the law of the counterfactual random variable deﬁned in (Karimi et al., 2020, Section 2.2). Second, as explained, the proﬁle-level counterfactual counterparts of a single proﬁle are not necessarily deterministic (that is uniquely determined), which diﬀers from unit-level counterfactuals. Mathematically, this comes from the fact that L(XS=s | X = x, S = s) is not necessarily degenerate, whereas L(XS=s | U = u) is always a point-mass distribution. Obviously, in case a proﬁle (x, s) X S uniquely determines a unit u U, then the two conditional distributions for these speciﬁc values coincide, meaning that the structural counterfactual counterpart of (x, s) is unique. Section 5.1.1 details under which condition proﬁle-level counterfactuals are deterministic, and therefore correspond to unit-level counterfactuals. We also emphasize that, while the rest of the paper almost exclusively focuses on proﬁle counterfactuals rather than unit counterfactuals, the distinction will play a critical role in Section 6 to understand the deﬁnition of counterfactual fairness. Third, note that µ s |s ( |x) measures counterfactual outcomes of the feature vector X alone rather than the whole proﬁle vector (X, S). This is an arbitrary choice that serves to highlight that after do(S = s ), only the resulting values of XS=s need to be speciﬁed since SS=s = s . On this basis, we introduce the following notations to formalize the contrast between interventional, counterfactual and factual outcomes. For s, s S we deﬁne three probability distributions. Firstly, µs := L(X | S = s) is the distribution of the factual s-instances. This observable measure describes the possible values of X such that S = s, and we write Xs for its support. Secondly, we denote by µS=s := L(XS=s) the distribution of the interventional s-instances. It describes the alternative values of X in a world where S is forced to take the value s. On the contrary to the factual distribution, the interventional distribution is in general not observational, in the sense that we cannot draw empirical observations from it. Finally, we deﬁne by µ s |s := L(XS=s | S = s) the distribution of the counterfactual s -instances given s. It describes what would have been the factual instances of µs had S been equal to s instead of s. According to the consistency rule (Pearl et al., 2016), which states that X P a.s. = P s S 1{S=s }XS=s , the factual and counterfactual distributions coincide when s = s , that is µs = µ s|s . However, when s = s , the counterfactual distribution µ s |s is generally not observable.

4.2.2 Mass-transportation Viewpoint

In this paper, we focus on a mass-transportation viewpoint of the counterfactuals from Deﬁnition 4. It rests on the fact that, for a given change of S, there exists a correspondence between factual proﬁles and structural counterfactual proﬁles. Such a correspondence is not necessarily one-to-one, but consists of a collection of weighted correspondences described by the distribution of counterfactuals. This random mapping between factual and counterfactual counterparts as induced by the SCM is formalized by the following deﬁnition.

Deﬁnition 5 Let M satisfy (A). For every s, s S, the structural counterfactual coupling between µs and µ s |s is given by

π s |s := L (X, XS=s ) | S = s .

We call the collection of couplings Π := {π s |s }s,s S the structural counterfactual model on X with respect to S.

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

In this formalism, the quantity dπ s |s (x, x ) is the elementary probability of the counterfactual statement had S been equal to s instead of s then X would have been equal to x

instead of x. As such, a counterfactual model characterizes the distribution of all the crossworld statements on X with respect to changes of S. Note that each realization of π s |s , that is each pair of factual instance and counterfactual counterpart, is generated by a same realization of L(U | S = s). We point out that Deﬁnitions 4 and 5 characterize the exact same counterfactual statements, the formal link being dπ s |s (x, x ) = µ s |s (x |x)dµs(x). In particular, there is an equivalence between µ s |s ( |x) narrowing down to a single value for every x Xs and π s |s being a deterministic coupling. Assumptions rendering single-valued counterfactuals will be studied in Section 5.1.1. We also note that this joint-probability-distribution perspective of Pearl s counterfactuals concurs with the one from (Bongers et al., 2021, Section 2.5).

4.3 Transport-based Proﬁle Counterfactuals

The main issue of structural proﬁle counterfactuals, which will be widely discussed in Section 4.5, comes from the causal model being unknown in practice. Thus, the necessity to make counterfactual frameworks feasible naturally raises the question of ﬁnding good surrogates to causal proﬁle counterfactuals. We have seen that the problem of assessing counterfactual statements about X with respect to interventions on S using causal models could be reduced to knowing a collection of random mappings from factual distributions {µs}s S towards counterfactual distributions {µ s |s }s,s S. This perspective suggests that mass-transportation techniques can be natural substitutes for structural counterfactual reasoning, as they remedy to the aforementioned issues.

4.3.1 Definition

In (Black et al., 2020), the authors mimicked the structural account of counterfactuals by computing alternative instances using a deterministic optimal transport map. Extending their idea, we propose a more general framework where the counterfactual operation switching S from s to s can be seen as a mass transportation plan, not necessarily optimaltransport based and not necessarily deterministic, between two distributions.1 In the following, t : Rd Rd Rd Rd, (x, x ) 7 (x , x) denotes the permutation function.

Deﬁnition 6 A transport-based counterfactual model is a collection of couplings Π := π s |s

s,s S satisfying for every s, s S,

(i) π s |s Π(µs, µs );

(ii) π s|s = (I I) µs;

(iii) π s|s = t π s |s .

An element of Π is called a counterfactual coupling. We say that Π is a random counterfactual model if at least one coupling for s = s is not deterministic. Otherwise, we say that Π is a deterministic counterfactual model. In the deterministic case, Π can be characterized

1. In (Asher et al., 2022, Section 7.2), we present this view of counterfactuals from a logic perspective.

Transport-based Counterfactual Models

almost everywhere by a collection T := T s |s

s,s S of measurable mappings from X to X satisfying for every s, s S,

(i) T s |s µs = µs ;

(ii) T s|s = I;

(iii) T s |s is invertible µs-almost everywhere such that T s|s = T 1 s |s .

An element of T is called a counterfactual operator.

Similarly to structural counterfactual models, these models assign a probability to all the cross-world statements on X with respect to interventions on S. By convention, we use the superscript to denote structural counterfactual models, and no superscript for transportbased counterfactual models. The marginal constraint (i) in Deﬁnition 6 translates the intuition that a realistic counterfactual operation on S should morph the non-intervened variables X so that their values ﬁt the targeted distribution. In this sense, transport-based models preserve the principle that features are not independently manipulable, but without using causal relations. The symmetry constraints (ii) and (iii) cover the reciprocity intuition we have on counterfactuals counterparts. Remark that in the case of discrete measures, the operation t in condition (iii) simply amounts to transposing the associated coupling matrices. Lastly, note that this deﬁnition replaces the unobservable, SCM-dependent distributions {µ s |s }s,s S of structural counterfactual models by the observational {µs }s S for feasibility reasons. In Section 5.1.2, we will see that this approximation makes sense in typical fairness settings where µ s |s = µs for every s, s S. The adjective deterministic refers to the fact that the model assigns to each factual proﬁle a unique counterfactual counterpart. Formally, the counterfactual counterpart of some observation x Xs after changing S from s to s is given by x = T s |s (x) Xs . In contrast, a random model allows possibly several counterparts with probability weights. The ﬁrst interest of considering random couplings is purely conceptual; rendering non necessarily unique the counterfactual counterparts of a single instance has philosophical implications (Asher et al., 2022, Section 6.3). Besides, it is consistent with the causal approach which also authorizes non-deterministic proﬁle counterfactuals. The second and most critical beneﬁt is mathematical and practical. While there always exist random couplings between two distributions, deterministic push-forward mappings (even causally-induced ones) may not exist when the marginals do not have densities, making deterministic counterfactual models ill-deﬁned in many settings with non-continuous variables. This makes the extension to random couplings mathematically necessary to properly deﬁne transport-based counterfactual models in any conﬁguration, in particular concrete machine-learning tasks, which typically involve both continuous and discrete covariates. Notably, we rely on random couplings in the numerical experiments from Section 8, illustrating Section 7. We also point out that Deﬁnition 6 refers to the true conditional distributions {µs}s S, not their empirical counterparts. This means that, in practice, after choosing the counterfactual model in theory (for instance as the solutions to an optimal transport problem) one must estimate this model from ﬁnite samples. Sections 4.4 and 8 respectively illustrate the

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

estimation of deterministic and random transport-based counterfactual models. For statistical coherence, we leverage a (deterministic) map estimator only if the coupling speciﬁed by the chosen model is itself deterministic.

4.3.2 Choosing a Model

One challenge for the transport-based approach is to choose the model appropriately in order to deﬁne a relevant notion of counterpart. There possibly exists an inﬁnite number of admissible counterfactual models in the sense of Deﬁnition 6, many of them being inappropriate. As an illustration, consider the family of trivial couplings, namely {µs µs }s,s S where denotes the factorization of measures. Though it is a well-deﬁned transport-based counterfactual model, it is not intuitively justiﬁable as it completely decorrelates factual and counterfactual proﬁles. To sum-up, a transport-based counterfactual model must be both intuitively justiﬁable and computationally feasible.

We argue that optimal-transport solutions are tailored couplings with respect to both criteria. Regarding feasibility, optimal transport has become the most popular tool in statistics-related ﬁelds to deﬁne couplings between distributions when no canonical choice is available, as in generative modeling (Balaji et al., 2020), domain adaptation (Courty et al., 2014, 2017; Redko et al., 2019; Rakotoarison et al., 2022), and transfer learning (Gayraud et al., 2017; Peterson et al., 2021) thanks to signiﬁcant advances in computational schemes. Regarding justiﬁability, as argued by Black et al. (2020), generating a counterfactual operation by solving the optimal-transport Problem (2) leads to intuitively relevant counterfactuals for two reasons. Firstly, because counterfactuals are obtained by minimizing a metric between paired instances with diﬀerent S-values, they transcribe the founding principle of Lewis (1973) to compute counterfactual counterparts: ﬁnding the most similar alternative worlds verifying the desired change. Secondly, the mass-preservation constraint ensures that the generated counterfactuals are realistic, conforming to the statistical features of a target distribution. More speciﬁcally, deterministic optimal transport for the quadratic cost (see Section 3.2.2) has remarkable properties that are relevant in the context of constructing counterfactuals. According to Lemma 3, solutions to Problem (4) are gradients of convex functions, which extends the notion of non-decreasing function to several dimensions. In particular, the optimal transport map in dimension one is the quantile-preservation map between univariate distributions. This behaviour has notably inspired constructions of multivariate notions of quantile based on optimal transport (Chernozhukov et al., 2017; Hallin et al., 2021; Ghosal and Sen, 2022). It makes sense in counterfactual reasoning where, without further information on the data-generation process, preserving the quantile from on marginal to the other is an intuitive deﬁnition of the counterfactual counterpart. Charpentier et al. (2023) notably explore this idea. For the sake of illustration, Section 4.4 below provides several examples of optimal transport applied to counterfactual reasoning.

In Section 5.2 we will further justify the pertinence of optimal-transport-based counterfactual models by showing that they coincide with structural counterfactual models under some assumptions. However, the scope of Deﬁnition 6 goes beyond solutions to standard optimal-transport problems, allowing other transport methods and as such more possible counterfactual models. The purpose of this generalization is partly theoretic: in the future, one could propose an original matching technique and justify its relevance compared to op-

Transport-based Counterfactual Models

timal transport. In particular, the couplings mentioned in (Villani, 2008, Chapter 1) as well as diﬀeomorphic registration mappings (Joshi and Miller, 2000; Beg et al., 2005) are possible candidates we do not investigate in this paper. Additionally, this generalization permits the use of regularized optimal transport (Cuturi, 2013), which deviates from the original formulation of Problem (3), to accelerate computations. Note in passing that solutions to regularized optimal transport, which are non deterministic, deﬁne adequate transport-based counterfactual models thanks to Deﬁnition 6 taking into account random couplings. Lastly, we will see in Section 5.1.2 that structural counterfactual models are transport-based counterfactual models but not necessarily optimal-transport-based under some assumptions.

4.4 Examples

Now that we gave deﬁnitions and insights on counterfactual models, let us study two concrete examples on real data.

4.4.1 Law Data Set

We start by focusing on the Law School Admission Council data set which gathers statistics from 163 US law schools and more than 20,000 students, including four variables: the race S, the entrance-exam score X1, the grade-point average before law school X2, and the ﬁrstyear average grade Y . The end goal is to predict the ﬁrst-year grade Y from the proﬁle (X, S). Similarly to Russell et al. (2017), we consider a fairness setting where the race plays the role of a protected, sensitive attribute which should not be discriminated against, and we restrict to only black (S = 0) and white (S = 1) students. Counterfactual reasoning has become popular in such algorithmic fairness tasks to either ensure or test that, for example, had a black student been white, the output would have been the same. This requires a model to compute the counterfactual counterparts of any students after changing their skin colors. First, we consider a structural counterfactual model. This requires a causal model: Russell et al. (2017) proposed the following SCM for the data set,

X1 = b1 + w1S + U1, X2 = b2 + w2S + U2, S = US, US, U1, U2 independent,

where b := (b1, b2) and w := (w1, w2) are deterministic R2 parameters obtained by adjusting linear-regression models component-wise. Let us now calculate the induced structural counterfactual model by applying Deﬁnition 5. The coupling from S = 0 to S = 1 is given by

π 1|0 := L (X, XS=1) | S = 0 = L (b + UX, b + w + UX) = L (X, X + w) | S = 0 .

Conversely, the structural counterfactual coupling from S = 1 to S = 0 is

π 0|1 := L (X, XS=0) | S = 1 = L (b + w + UX, b + UX) = L (X, X w) | S = 1 .

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

Figures 2a and 2b illustrate the computation of the corresponding counterfactual counterparts on samples. We make two important remarks.

Firstly, generating counterfactual quantities in this case amounts to translating instances of µ0 by the constant w or conversely translating instances of µ1 by the constant w. Notably, the two couplings are deterministic: π 1|0 and π 0|1 are respectively characterized by the mappings T 1|0 (x) := x + w and T 0|1 (x) := x w. Note that there is consequently no need to specify the law of the exogenous variables to compute counterfactual quantities. Section 5.1.1 provides a general analysis of such deterministic settings.

Secondly, the causal model implies that S UX. This critically entails that the counterfactual distributions are observable, since µ 1|0 = L(XS=1 | S = 0) = L(b + w + UX | S = 0) = L(b+w +UX | S = 1) = µ1 and µ 0|1 = µ0 analogously. Therefore, the structural counterfactual couplings π 1|0 and π 0|1 belong respectively to Π(µ0, µ1) and Π(µ1, µ0). Additionally, they are transposed from one another, that is t π 1|0 = π 0|1 . This means that the structural counterfactual model Π := {π 1|0 , π 0|1 } is a transport-based counterfactual model. Mathematical justiﬁcations of these properties will be studied in Section 5.1.2.

In a second time, we turn to an optimal-transport-based counterfactual model. More precisely, we learn the optimal transport map for the quadratic cost, denoted by T 1|0 , from the black distribution µ0 towards the white distribution µ1. This map (and therefore the associated deterministic counterfactual model) is well-deﬁned because the population marginals µ0 and µ1 admit densities with respect to the Lebesgue measure. In practice, we rely on the Python Optimal Transport (POT) library to compute an approximation of the mapping from data (Flamary et al., 2021). Note that solving the empirical optimaltransport problem (5) between samples provides a matching that cannot generalize to new, out-of-sample observations. This is why we employ POT s in-built non-regularized barycentric extension of the empirical solution to obtain a mapping deﬁned everywhere. We use 800 points from each distribution to compute the estimator of T 1|0 illustrated in Figure 2a. The converse counterfactual operation T 0|1 represented in Figure 2d is produced by inversion.

We emphasize that all the couplings in Figure 2, be they causal-based or optimaltransport-based, are imperfect approximations, but for diﬀerent reasons. More precisely, we assumed that a linear causal model generated the data in order to compute the structural counterfactual couplings. However, this model-class assumption is not a perfect ﬁt: in particular, some of the produced counterfactual instances are not realistic, yielding GPA scores exceeding the upper limit of 4.0 points; more generally, while both couplings should have µ0 and µ1 for marginals, several counterfactual counterparts do not conform to these distributions. Besides, the translation vector w used in practice is an estimation from data, thereby an approximation of the best linear model ﬁtting the data. The implemented optimaltransport mappings are also ﬁnite-sample estimators of the true mappings between the continuous distributions. Figure 2c notably shows poor counterfactual associations for outliers of the red sample, likely due to weak estimation in low-density domains. Nevertheless, the marginal constraint of optimal transport ensures that the generated counterfactuals faithfully ﬁt the data and are therefore plausible. Finally, these approximation artifacts of diﬀerent nature induce distinct behaviours of the causal-based and transport-based couplings at the border of the data set. However, we remark that the couplings are similar in high-density areas. More precisely, the grey lines near the center of the data set correspond

Transport-based Counterfactual Models

Figure 2: Counterfactual models for the Law data set. The red sample represents 200 factual black students while the blue sample represents 200 factual white students. The green sample depicts counterfactual instances: the ﬁrst column (Figures 2a and 2c) has white counterfactual students; the second column (Figures 2b and 2d) has black counterfactual students. The lightgray lines describe the coupling between factual and counterfactual instances.

to a translation by a same constant for both approaches. This proximity, which sides with the observations of Black et al. (2020), will be theoretically grounded in Section 5.2.

4.4.2 Body-measurement Data Set

We now further illustrate the properties of optimal-transport counterfactuals on a data set of body measurements from n0 = 260 women and n1 = 247 men. The features of interest are the weight X1 and the height X2, while S encodes the gender. Suppose now that Bob is a 80kg and 190cm man. What would have been Bob s height and weight had he been a woman? Since we do not know the structural relationships between X, S and possibly hidden sources of randomness U, we follow Black et al. (2020) and rely on masstransportation techniques to answer this counterfactual question. We proceed as before to estimate the optimal transport map from the male distribution µ1 towards the female

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

(a) 2D optimal-transport matching.

(b) 2D component-wise quantile-preservation.

Figure 3: Body data set. The red dots represent a data sample of men, while the blue dots represent a data sample of women. The green dots are the estimated counterfactual counterparts of the male sample.

distribution µ0. Applying this operator to Bob, we obtain that, had he been a women, she would have been 59kg and 177cm. Though it does not have a canonical deﬁnition when d = 2, optimal transport seems visually to preserve the position of the paired points from one marginal to another. This is due to the optimal map being the unique gradient of a convex function between distributions as previously explained. We underscore in Figure 3 that optimal transport does not amount to feature-wise quantile-preservation, making it a relevant extension of the notion of order to higher dimension. Notably, preserving the quantile along each coordinate does not satisfy the marginal constraint, yielding counterfactual women not representative of their gender s distribution.

4.5 Discussion

Counterfactuals have valuable applications in fairness and explainability. One could for example try to learn predictor h designed to make h(x, 0) as close as possible to h(x , 1) for every counterfactual pair (x, x ). This is what Russell et al. (2017) proposed using causal models, and what we implement in Section 7 using transport-based models. Or, one could test whether a trained predictor h is unfair by checking if h(x, 0) = h(x , 1) for every counterfactual pair (x, x ), which is essentially the procedure of Black et al. (2020) leveraging optimal transport maps. However, the application of counterfactual models raises several issues. We conclude Section 4 by discussing important drawbacks of the causal account to counterfactual reasoning as well as the limitations of the transport approach.

4.5.1 Shortcomings of the Causal Approach

The main limitation of computing counterfactuals like Deﬁnition 4 and Deﬁnition 5, as for any causal-based framework, is its feasibility due to the reliance on an SCM. Assuming a known causal model, in particular a fully-speciﬁed causal model, is a too strong assumption

Transport-based Counterfactual Models

in practice. It requires experts to reach a consensus on the causal graph, the structural equations, the distribution of the input exogenous variables, and to test the validity of their model on available data. This is not a realistic scenario, especially when dealing with a high-number of features and possibly complex structural relations. Besides, this is not practical since a causal model must be designed and tested for each possible data set. A more straightforward approach is to directly infer the causal model from observational data. There exist for instance sound techniques to learn the causal graph, but they suﬀer from being NP-hard, with an exponential worst-case complexity with respect to the number of nodes (Cooper, 1990; Chickering et al., 2004; Scutari et al., 2019). In addition, this is not enough to compute counterfactual quantities, as the structural equations would still be lacking. To obtain these equations, researchers often predeﬁne the functional form of the relations between the variables on the basis of a known graph (be it assumed or inferred) and learn them through regression models (Kusner et al., 2017; Russell et al., 2017), or infer simultaneously the graph and the structural equations. However, this also becomes computationally challenging as the number of features increases. Notably, the literature mostly addresses simple linear models (Shimizu et al., 2006) or very few variables (Hoyer et al., 2008). Finally, the approximation error implied by the choice of the functional class can lead to unrealistic, out-of-distribution counterfactuals, as exempliﬁed in Figure 2 above. To our knowledge, the literature on causal counterfactuals has not pointed out this ﬂaw to date.

A related issue is causal uncertainty. There exist several causal models corresponding to a same data distribution, leading to possibly diﬀerent counterfactual models (see Bongers et al., 2021, Example 4.2). It cannot be tested whether the adjusted model is the true one, making the modeling inherently uncertain. Moreover, for non-deterministic structural counterfactual models, the computation of counterfactual quantities requires to know the law of the exogenous variables, which is not observable. While it is common to assume a prior distribution on U, this also adds uncertainty in the causal modeling, hence on the induced counterfactuals.

Perhaps more surprisingly, counterfactual quantities are sometimes nonexistent in Pearl s causal framework. The causal modeling we introduced is very general: we do not assume the model to be causally suﬃcient, and only suppose that the equations are acyclic. Assumption (A) is very common for both practical reasons and reasons of interpretability. In general, however, observational data can be generated through an acyclical mechanism. Critically, (solvable) acyclic models do not always admit solutions under do-interventions, implying that XS=s may not be deﬁned. We refer to (Bongers et al., 2021, Example 2.17) for an illustration. As a consequence, counterfactual quantities are ill-deﬁned in such settings.

4.5.2 Applicability of the Transport Approach

Regarding transport-based counterfactual reasoning, the main practical limitation is also computational. The domain S of the intervened variable S must be ﬁnite for the counterfactual model to be tractable. Moreover, generating the model needs |S| (|S| 1) /2 computations of transportation plans, which can become too expensive when |S| is large. We recall that solving Problem 5 demands O((n + m)nm log(n + m)) operations. Therefore, this approach is tailored to settings with small |S|, typically fairness problems where

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

S represents gender or race. This is why we speciﬁcally focus on fairness applications in Sections 6 to 8. Another inconvenience comes from the fact that one must specify a family of couplings to implement a transport-based counterfactual model. There is no quantitative rule for this choice; it is guided by intuition and feasibility reasons, and we explained above why optimal transport was a relevant option. Note that the causal approach has a similar ﬂaw: as previously explained, structural counterfactual models are subjected to misspeciﬁcation since the underlying causal model itself is uncertain. The advantage of transport methods compared to causal modeling is that they circumvent possibly wrong assumptions on the data-generative process. In particular, transport plans consistently adjust to the data (thanks to the marginal constraint) regardless of the chosen family of couplings, whereas misspeciﬁcation of the SCM may lead to out-of-distribution structural counterfactuals as aforementioned. In the following, we derive theoretical properties of the counterfactual models introduced in this section, grounding the similarity between optimal transport and Pearl s computation of counterfactuals we evidenced in Figure 2. Interestingly, this echoes the work of Black et al. (2020), who also empirically observed that optimal transport maps generated nearly identical counterfactuals to the ones based on causal models.

5. Theoretical Results

Until now, we have recalled the basics of causality and transport in Sections 2 and 3, and introduced counterfactual models, either causal-based or transport-based, in Section 4. In what follows, we demonstrate connections between both approaches. Concretely, we ﬁrstly explore in Section 5.1 the relationship between an SCM and the counterfactual model it induces, providing justiﬁcations to what we observed in Section 4.4.1. More precisely, we study the implications of typical causal assumptions onto the generated counterfactuals. Then, on the basis of these assumptions and the mass-transportation formalism proposed in Section 4, we demonstrate in Section 5.2 that optimal transport recovers structural counterfactuals in speciﬁc cases.

5.1 Causal Assumptions and Their Consequences

We analyze in detail two standard scenarios of the causal counterfactual framework: ﬁrst, when the counterfactuals are deterministic then the computation can be written as an explicit push-forward operation; second, when S can be considered exogenous then the counterfactual distribution is observable. Note that none of Section 5.1 involves any speciﬁc knowledge on optimal transport theory, only on causal modeling and (general) mass transportation.

5.1.1 The Deterministic Case

We show that when the SCM deterministically implies the counterfactual values of X, then the counterfactual coupling is deterministic. Additionally, we provide the expression of the corresponding push-forward operator. To reformulate structural counterfactuals in

Transport-based Counterfactual Models

deterministic transport terms, we ﬁrst highlight the functional relation between an instance and its intervened counterparts.

Lemma 7 If M satisﬁes (A), then there exists a measurable function F such that X P a.s. =

F(S, UX) and XS=s P a.s. = F(s, UX) for every s S.

The proof leverages the acyclicity of the structural equations, which implies that the system of structural equations deﬁning X and S is triangular, enabling to express X solely in terms of UX and S. Now, let us set for every s S the function fs : u 7 F(s, u) deﬁned L(UX)-almost everywhere. Using this notation, we can give a simple expression of the possible counterfactual counterparts of any factual instance. In what follows, B denotes the closure of any B Rd.

Proposition 8 Let M satisfy (A). For any s, s S and µs-almost every x Xs,

supp µ s |s ( |x) fs f 1 s ({x}).

As a direct consequence of this proposition, all counterfactual quantities on X with respect to S are uniquely determined when the right term of the inclusion becomes a singleton, therefore when the following assumption holds.

Assumption (I) The functions in {fs}s S are injective.

While the unique solvability of acyclic models ensures that (X, S) is deterministically determined by U, (I) states that, conversely, UX is deterministically determined by {X = x, S = s}. This assumption holds in particular for additive-noise models: classical models where the exogenous variables are independent additive terms of the structural equations, such as in Example 1 and Section 4.4.1.

Example 3 An acyclical SCM M = U, G is an additive-noise model if for any i I, Exo(i) is a singleton and

Gi(v Endo(i), u Exo(i)) := φi(v Endo(i)) + u Exo(i),

where φi : VEndo(i) Vi. In this case, the random seed U is fully determined by the value of V , meaning that for any v V the posterior distribution L(U | V = v) narrows down to a single value. As such, whatever the do-intervention on V , the three-step procedure can only generate a deterministic counterfactual quantity. In our setting, which addresses interventions on a single endogenous variable S, satisfying (I) does not require a fully invertible model between V = (X, S) and U but simply between X and UX for every ﬁxed value of S. As illustration, consider an additive-noise model over X only, meaning that the equation for S does not matter. Formally, the structural equations on the variables in X can summarized through

X P a.s. = ϕ (S, X) + UX,

where ϕ : S X X is a deterministic measurable function. Assumption (A) en-

tails through unique solvability that X P a.s. = (I ϕ(S, )) 1(UX). Thereby fs(u) := (I ϕ(s, )) 1(u), and Assumption (I) readily holds such that f 1 s (x) = x ϕ(s, x).

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

Remark that Assumption (I) imposes constraints on the variables to enable a one-to-one correspondence between the realizations of L(X) and L(UX). In particular, these two probability laws must be respectively supported by spaces with the same cardinal, preventing for instance a continuous UX with a discrete X. Note also that even though it is restrictive, the mainstream literature on causality frequently assumes full invertibility. In particular, most of the causal-discovery frameworks which aim at inferring the structural causal models from observational data require invertible models (Zhang and Chan, 2006; Hoyer et al., 2008) or even linear-additive ones (Shimizu et al., 2006). Analogously, the recent research on causal algorithmic recourse generally addresses invertible models in both theory and practice (Karimi et al., 2021; Dominguez-Olmedo et al., 2022; von K ugelgen et al., 2022). In Section 5.2, we will use the invertibility assumption as an ideal setting to derive theoretical guarantees. Let us ﬁnally turn to the structural counterfactual models. Assumption (I) implies that all the couplings between the factual and counterfactual distributions are deterministic, as written in the next proposition.

Proposition 9 Let M satisfy (A), suppose that (I) hold, and for any s, s S set the mapping T s |s := fs f 1 s |Xs deﬁned µs-almost everywhere, where f 1 s |Xs denotes the restriction of f 1 s to Xs. The following properties hold:

(i) µ s |s ( |x) = δT s |s (x) for µs-almost every x Xs;

(ii) µ s |s = T s |s µs;

(iii) π s |s = (I T s |s ) µs.

We say that T s |s is a structural counterfactual operator, and associate T := {T s |s }s,s S to the deterministic structural counterfactual model Π = {(I T s |s ) µs}s,s S.

Similarly to the structural counterfactual couplings, the operators in T describe the eﬀect of causal interventions on factual distributions. We highlight that they are welldeﬁned without any knowledge on L(U), meaning that the exogenous variables are not necessary to compute counterfactual quantities under (I). Lastly, remark that we framed (I) so that it implies that all the counterfactuals instances for any changes on S are deterministic, leading to a fully-deterministic counterfactual model.2 However, according to Proposition 8, it suﬃces that one fs be injective for some s S to render all the counterfactual couplings {π s |s }s S deterministic. Therefore, when (I) does not hold, the structural counterfactual model possibly contains both random and deterministic couplings.

5.1.2 The Exogenous Case

We now discuss the counterfactual implications of the position of S in the causal graph. More speciﬁcally, we focus on the case where S can be considered as a root node. We will see that this entails that the structural counterfactual model is a transport-based counterfactual model.

2. In logic terms, this means that the model veriﬁes the conditional excluded middle (Stalnaker, 1980).

Transport-based Counterfactual Models

Figure 4: DAG of a structural causal model satisfying (RE). The nodes UX and X possibly represent several variables.

Let denote the independence between random variables. The variable S is said to be exogenous relative to X (Galles and Pearl, 1998) if the following holds:

Assumption (RE) US UX and XEndo(S) = .

The ﬁrst item, US UX, ensures that there is no hidden confounder between X and S. The literature on causal modeling generally supposes causal suﬃciency (Shimizu et al., 2006; Karimi et al., 2021; Dominguez-Olmedo et al., 2022), which is a stronger condition. By demanding Exo(i) Exo(i ) = for every i = i , causal suﬃciency implies that the (UExo(i))i I are mutually independent (as a consequence of the (Uj)j J being mutually independent). The second item, XEndo(S) = , means that S is ancestrally closed: no variable in X is a direct cause (or parent) of S (see Figure 4). This assumption intuitively holds in fairness problems, such as in Section 4.4.1, where the variable S to alter generally encodes someone s gender, race or age, which presumably do not have any observable causes. As pointed out by Fawkes et al. (2021), ancestral closure is a common hypothesis in causalfairness research, and even a requirement for many frameworks (Kusner et al., 2017; Russell et al., 2017; Nabi and Shpitser, 2018; Chiappa, 2019; Kilbertus et al., 2020; Plecko and Meinshausen, 2020). Note that (RE) cannot be tested, and thereby rests on plausibility. Interestingly, relative exogeneity has critical implications on the generated counterfactuals. Assumption (RE) readily entails that S UX. Then, it is easy to see that at the distributional level, intervening on S amounts to conditioning X by a value of S.

Proposition 10 Let M satisfy (A). If (RE) holds, then for every s, s S we have µS=s = µs = µ s |s .

Recall that the structural counterfactual coupling π s |s represents an intervention transforming an observable distribution µs into an a priori non-observable counterfactual distribution µ s |s . According to Proposition 10, (RE) renders the causal model otiose for the purpose of generating the counterfactual distribution, as the latter coincides with the observable factual distribution µs . This is notably what occurred in the example from Section 4.4.1. However, we underline that the coupling is still required to determine how each speciﬁc proﬁle is matched. As such, the causal model still carries major information on the induced counterfactual quantities. Besides, as remarked by Plecko and Meinshausen (2020) and Fawkes et al. (2021), a practical consequence of (RE) is that it enables to link observational and causal notions

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

of fairness. In Section 6, we will prove a similar result through the prism of counterfactual models. The demonstration relies on the proposition below, which ensures that structural counterfactual models are transport-based counterfactual models when S is relatively exogenous to X.

Proposition 11 Let M satisfy (A). If (RE) holds, then for any s, s S,

(i) π s |s Π(µs, µs );

(ii) t π s |s = π s|s .

Suppose additionally that (I) holds. Then, for any s, s , s S,

(iii) T s |s µs = µs ;

(iv) The operator T s |s is invertible µs-almost everywhere, such that µs -almost everywhere

T 1 s |s = T s|s .

Notably, this means that in classical fairness settings transport-based models can be seen as approximations, relaxations of structural models. Another meaningful consequence of Proposition 11 is that items (ii) and (iv) may be false when (RE) does not hold. Said diﬀerently, in general contexts, there is no reciprocity between a factual instance and its structural counterfactual counterparts. We emphasize that even under the assumptions of Proposition 11, a transport-based counterfactual model does not necessarily coincide with the structural counterfactual model: Theorem 12 will provide suﬃcient conditions under which such an equivalence occurs.

5.1.3 The Example of Linear Additive SCMs

We illustrate how our notation and assumptions apply to the case of linear additive structural models, which account for many state-of-the-art models (Bentler and Weeks, 1980; Shimizu et al., 2006; Hyttinen et al., 2012; Rothenh ausler et al., 2021).

Example 4 Under (RE) and (A), a linear additive SCM is characterized by the structural equations

X = MX + w S + b + UX,

where w, b Rd and M Rd d are deterministic parameters such that I M is invertible, and US UX. Solving the equations we get X = (I M) 1(w S + b + UX) =: F(S, UX). Besides, note that (I) holds such that for any s S, f 1 s (x) = (I M)x ws b. Then, for any s, s S, T s |s (x) = x + (I M) 1w(s s). This general expression is consistent with the example from Section 4.4.1.

Remarkably, in the speciﬁc case of linear additive SCMs ﬁtting (RE), computing counterfactual quantities amounts to applying translations between factual distributions. Therefore, should an oracle reveal that the SCM belongs to this class without providing the structural equations, it would suﬃce to compute the mean translation between sampled points

Transport-based Counterfactual Models

from µs and µs to obtain an estimator of the counterfactual operator T s |s . For more complex SCMs satisfying (RE), it is presumably diﬃcult to infer the counterfactual model from data. We address this issue the next section. Speciﬁcally, we show that optimal transport for the quadratic cost generates the same counterfactuals as a class of causal models including linear additive models.

5.2 When Optimal Transport Meets Causality

We focus on the deterministic transport-based counterfactual model T = {T s|s }s,s S deﬁned by the solutions of Problem (4) between all pairs of factual distributions. That is, for every s, s S,

T s |s := arg min T: T µs=µs

Xs x T(x) 2dµs(x). (6)

As explained before in Section 4, this model provides an elegant interpretation to the obtained counterfactual statements, as they are deﬁned by minimizing the squared Euclidean distance between paired instances, and preserve the quantile between marginals when d = 1. Moreover, as stated in the following theorem, this transport-based counterfactual model recovers structural counterfactuals in speciﬁc cases.

Theorem 12 Let M satisfy (A), (RE) and (I). Suppose that the factual distributions {µs}s S are absolutely continuous with respect to the Lebesgue measure and have ﬁnite second order moments. For any s, s S, if the structural counterfactual operator T s |s is the gradient of some convex function then it is the solution to Problem (6).

The mass-transportation formalism of Pearl s counterfactual reasoning introduced in Section 4.2 and developed in Section 5.1 renders the proof of this theorem straightforward. The non triviality comes precisely from the reformulation of deterministic structural counterfactuals through push-forward operators. We underline that the demonstration does not require any prior knowledge on optimal transport theory except what we summarized in Lemma 3. Thus, for the sake of illustration and clarity, we reproduce it directly below. Proof According to (I) and Proposition 9, the SCM deﬁnes a structural counterfactual operator T s |s between µs and µ s |s . Additionally, (RE) implies through Proposition 10 that µ s |s = µs . Therefore, T s |s µs = µs . Assume now that µs is absolutely contin-

uous with respect to the Lebesgue measure, and that both µs and µs have ﬁnite second order moments. If T s |s is the gradient of some convex function, then according to Lemma 3 it is the solution to Problem (4) between µs and µs , that is the solution to Problem (6).

Understanding the strengths and limitations of Theorem 12 requires understanding how rich is the class of SCMs ﬁtting its assumptions. The larger the class, the more likely optimal transport maps for the squared Euclidean cost will provide (nearly) identical counterfactuals to causality. Finding explicit conditions on fs and fs so that fs f 1 s is the gradient of a convex potential requires tedious computations as soon as d > 1, which renders the speciﬁcation of the relevant SCMs diﬃcult. Nevertheless, we can ﬁnd speciﬁc sub-classes of causal models ﬁtting Theorem 12. For instance, as the structural counterfactual operator from Example 4 is the gradient of a convex function, we obtain the following corollary.

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

Corollary 13 Let M be a linear additive SCM satisfying (RE) (see Example 4). Suppose that the factual distributions {µs}s S are absolutely continuous with respect to the Lebesgue measure and have ﬁnite second order moments. Then for any s, s S, the structural counterfactual operator T s |s is the solution to (4) between µs and µs .

Therefore, up to a linear approximation of the data-generation process, employing optimal transport maps for counterfactual reasoning in fairness contexts recovers causal changes, as in the example from Section 4.4.1. Besides, the scope of Theorem 12 goes beyond linear additive SCMs, as shown in the following non-linear non-additive example.

Example 5 Consider the following SCM,

X1 = α(S)U1 + β1(S),

X2 = α(S) ln2 X1 β1(S)

α(S) U2 + β2(S),

where α, β1, β2 are R-valued functions such that α > 0, U1 > 0, and US (U1, U2). It satisﬁes (A), (I) and (RE), such that for any s, s S, the associated structural counterfactual operator is given by,

T s |s (x) = α(s )

α(s) x + β(s ) β(s) ,

where β = (β1, β2) is R2-valued. This is the gradient of the convex function x 7 α(s )

[β(s ) β(s)]T x. Then, if the factual distributions are absolutely continuous with respect to the Lebesgue measure and have ﬁnite second-order moments, T s |s is the solution to (4) between µs and µs .

Note that the converse of the implication in Theorem 12 does not hold. This comes from the fact that many functions (even continuous ones) cannot be written as gradients when d > 1, as illustrated in the following example.

Example 6 Consider the following SCM,

X1 = U1, X2 = SX2 1 + U2, S = US,

where US (U1, U2). It satisﬁes (A), (I) and (RE), such that for any s, s S, the associated structural counterfactual operator is given by,

T s |s (x1, x2) = x1, x2 + (s s)x2 1 .

It cannot be written as the gradient of a function. Consequently, it is not a solution to (4) according to Lemma 3.

Transport-based Counterfactual Models

Through Section 5, we aimed notably at justifying the pertinence of optimal transport in counterfactual frameworks on top of the insights and illustrations given in Section 4. To sum-up, the main requisite for transport-based methods, typically optimal transport, to be used as substitutes for causal counterfactual reasoning is Assumption (RE), ensuring that structural counterfactual models are transport-based counterfactual models. As previously explained, this condition is almost systematically veriﬁed in fairness problems, making the proposed surrogate approach relevant in various essential tasks. The more speciﬁc assumptions from Theorem 12, which include (I), describe an ideal setting meant to derive theoretical guarantees; optimal transport remains an arguably relevant alternative even outside this context. Altogether, Theorem 12 and Corollary 13 support the intuition that computing a Π from optimal transport provides a suitable approximation of the unknown structural Π . In the sequel, we apply this approach by extending causal counterfactual frameworks for fairness to transport-based models.

6. Transport-based Counterfactual Fairness

The strength of the uniﬁed mass-transportation viewpoint of counterfactual reasoning we proposed in Section 4 and further studied in Section 5 lies in the fact that all deﬁnitions and frameworks implicitly based on a structural counterfactual model have a transport-based analogue, and can therefore be made feasible. In this section, we apply this process to fairness in machine learning. Suppose that the random variable S encodes a so-called sensitive or protected attribute (for example race or gender) which divides the population into diﬀerent classes in a machinelearning prediction task. We denote by h : X S R an arbitrary predictor deﬁning the random variable of predicted output ˆY := h(X, S). In this work, we focus on deterministic predictors (be they classiﬁers or regression functions), in the sense that h is a measurable function on X S only, meaning that ˆY does not have other sources of randomness than X and S. Fairness addresses the question of the dependence of ˆY on the protected attribute S. The most classical fairness criterion is the so-called demographic or statistical parity, which is achieved when ˆY S. However, this criterion is notoriously limited, as it only gives a notion of group fairness, and does not control discrimination at a subgroup or an individual level: a conﬂict illustrated by Dwork et al. (2012). The counterfactual framework, by capturing the structural or statistical links between the features and the protected attribute, allows for sharper notions of fairness. After clarifying the role of the accepted counterfactual fairness condition (Kusner et al., 2017), we reformulate it using the mass-transportation formalism introduced in Section 4. On the basis of this reformulation, we then propose new fairness criteria derived from transport-based counterfactual models.

6.1 Causal Counterfactual Fairness From a Classical Viewpoint

Counterfactually fair predictors satisfy the deﬁnition below, proposed by Kusner et al. (2017).

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

Deﬁnition 14 Let M satisfy (A). A predictor ˆY = h(X, S) is counterfactually fair if for every s, s S and µs-almost every x in Xs,

L ˆYS=s | X = x, S = s = L ˆYS=s | X = x, S = s ,

where ˆYS=s := h(XS=s, s).

The interpretation of counterfactual fairness can be confusing, as explained in (Pleˇcko and Bareinboim, 2024, Section 4.4.1) and (Silva, 2024). Before addressing the masstransportation viewpoint, let us thoroughly analyze the meaning of counterfactual fairness by clarifying the level of discrimination it controls and the type of counterfactuals it involves. We will try to be particularly careful with the vocabulary. On the one hand, counterfactual fairness as expressed by Deﬁnition 14 involves a proﬁlelevel context. A direct rewriting of the equality between laws in the deﬁnition leads to: δh(x,s) = h( , s ) µ s |s ( |x). This shows that counterfactual fairness requires equal outputs between every factual proﬁle and their structural counterfactual counterparts (Proposition 16 will make this even clearer). On the other hand, many people refer to counterfactual fairness as a within-unit notion of fairness, suggesting that it ensures that every single unit is treated equally in all alternative worlds in which only the protected status has changed. This could puzzle some readers, as the direct formalization of such a condition would rather be L ˆY | U = u = L ˆYS=s | U = u for every unit u U and altered protected attribute

s S, to explicitly include a unit-level context. All in all, it seems that two diﬀerent interpretations coexist: an across-proﬁle condition with input {X = x, S = s}; a within-unit condition with input {U = u}. Crucially, as explained in (Kusner et al., 2017, Appendix S1) and further justiﬁed by Silva (2024), these two interpretations formally coincide; using a proﬁle-level context in the original deﬁnition was notably meant to remind of the three-step procedure. For the sake of completeness, we provide our own version of this equivalence in the next proposition.

Proposition 15 Let M satisfy (A). A predictor ˆY = h(X, S) is counterfactually fair if and only if for every s S and L(U)-almost every u U,

L ˆY | U = u = L ˆYS=s | U = u ,

where ˆYS=s := h(XS=s , s ). All the probability distributions written above are degenerate.

Remark that this result holds even when {U = u} is not uniquely determined by {X = x, S = s}. It is due to the fact that ˆY only depends on U through (X, S).3

Moreover, counterfactual fairness is frequently referred as an individual fairness criterion, which can also be ambiguous. In the causality literature, people interchangeably use the terms individual and unit to call an element u U (Saha and Garain, 2022; Pleˇcko and Bareinboim, 2024). According to (Kusner et al., 2017, Appendix S1), counterfactual fairness

3. In practice, when the counterfactual counterparts are not deterministic, the probabilities in Deﬁnition 14 are approximated via a Monte-Carlo method after learning L(U | X = x, S = s). See for instance (Russell et al., 2017) or (Chiappa and Isaac, 2019).

Transport-based Counterfactual Models

is an individual notion of fairness in the sense that it controls the outputs of a same unit across diﬀerent realities. This signiﬁes that counterfactual fairness enjoys two individual properties: (1) it involves units, (2) it constrains predictions within a single unit rather than across distinct units. However, there exist fairness deﬁnitions tagged individual without involving units or causal models. Notably, individual fairness as popularized by Dwork et al. (2012) simply demands similar outputs to similar inputs (with respect to some similarity metrics that are not necessarily causal-based). This does not qualify as an individual fairness notion as counterfactual fairness does; this criterion is individual in the sense that it simply prevents disparate treatment against speciﬁc data points (or proﬁles) as opposed to group (or distributional) criteria. The most notorious group-fair condition is statistical parity, that is h(X, S) S, which can be written as h( , s) µs = h( , s ) µs for every s, s S, highlighting that it prevents discrimination across groups of proﬁles but not across speciﬁc proﬁles.4 Interestingly, counterfactual fairness is an individual fairness criterion in both senses: it constrains predictions within a unit and across input proﬁles. In what follows, we avoid the loaded word individual and prefer using the dissimilar terms unit and proﬁle . We also specify whether we study an eﬀect within a single instance or across diﬀerent instances.

6.2 Causal Counterfactual Fairness From a Mass-transportation Viewpoint

We now move on to giving a mass-transportation interpretation of Deﬁnition 14.

Proposition 16 Let M satisfy (A). The following properties hold.

(i) A predictor h(X, S) is counterfactually fair if and only if for every s, s S and π s |s -almost every (x, x ),

h(x, s) = h(x , s ).

(ii) If (RE) holds, then a predictor h(X, S) is counterfactually fair if and only if for every s, s S such that s < s and π s |s -almost every (x, x ),

h(x, s) = h(x , s ).

(iii) If (I) holds, then a predictor h(X, S) is counterfactually fair if and only if for every s, s S and µs-almost every x,

h(x, s) = h(T s |s (x), s ).

(iv) If (I) and (RE) hold, then a predictor h(X, S) is counterfactually fair if and only if for every s, s S such that s < s and µs-almost every x,

h(x, s) = h(T s |s (x), s ).

4. The survey paper of Castelnovo et al. (2022) clariﬁes the two main distinctions among fairness criteria: (1) observational vs. causal; (2) group vs. individual.

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

Let us detail this result. Items (ii) to (iv) are variations of the ﬁrst item under the implications of (RE) and (I) through respectively Propositions 11 and 9. Note that they have practical interests. Assumption (I) highlights the deterministic relationship between factual and counterfactual quantities and makes unnecessary the knowledge of L(U) to test counterfactual fairness. Assumption (RE) entails by symmetry that only half of the couplings are necessary to check the condition. Proposition 16 clearly shows that counterfactual fairness requires equal outputs across every counterfactually-paired proﬁles. We point out that because counterfactual fairness compares inputs across protected groups rather than within a same neighborhood of proﬁles, it is not incompatible with group fairness in contrast to the individual-fairness criterion studied in (Dwork et al., 2012; Xu and Strohmer, 2024). It is in fact the opposite: if (RE) holds, then counterfactual fairness is a stronger criterion than the statistical parity across groups, as explained in the following proposition.

Proposition 17 Let M satisfy (A) and suppose that (RE) holds. If the predictor h(X, S) satisﬁes counterfactual fairness, then it satisﬁes statistical parity. The converse does not hold in general.

In the original paper on counterfactual fairness, the authors had already demonstrated the connection between their deﬁnition and statistical parity. Note that the assumptions in Proposition 17 correspond to a speciﬁc case of those in (Kusner et al., 2017, Lemma 2), showing that the two results are consistent. Interestingly, our proof of Proposition 17 relies on the mass-transportation viewpoint, which highlights the following interpretation: counterfactual fairness is stronger than statistical parity because it constrains the predictor along some couplings while statistical parity constrains the predictor only along the couplings marginals. Remark that comparing statistical parity formulated as h( , s) µs = h( , s ) µs for every s, s S to counterfactual fairness as expressed in Proposition 16 emphasizes the marginal vs. coupling (which translates into group vs. proﬁle) relation between the two fairness deﬁnitions.

6.3 Extending Counterfactual Fairness

One can think of being counterfactually fair as being invariant to counterfactual operations with respect to the protected attribute. In order to deﬁne SCM-free criteria, we generalize this idea to the models introduced in Section 4.

Deﬁnition 18 1. Let Π = {π s |s }s,s S be a (random) transport-based counterfactual model. A predictor h(X, S) is Π-counterfactually fair if for every s, s S and π s |s - almost every (x, x ), h(x, s) = h(x , s ).

2. Let T = {T s |s }s,s S be a deterministic transport-based counterfactual model. A predictor h(X, S) is T -counterfactually fair if for every s, s S and µs-almost every x, h(x, s) = h(T s |s (x), s ).

Transport-based Counterfactual Models

Note that it follows from the symmetry of the transport-based counterfactual models (see items (ii) and (iii) in Deﬁnition 6) that only half of the couplings are truly required in the above conditions. Besides, because the proof of Proposition 17 only relies on the assumption that the couplings have factual distributions for marginals, the following proposition automatically holds.

Proposition 19 Let Π be a transport-based counterfactual model (deterministic or not). If a predictor h(X, S) satisﬁes Π-counterfactual fairness, then it satisﬁes statistical parity. The converse does not hold in general.

This result has interesting consequences. Consider that, for the purpose of computing counterfactual quantities, some practitioners designed a candidate SCM ﬁtting the data and satisfying (RE). Even if the SCM is misspeciﬁed, it would still characterize a transportbased counterfactual model controlling statistical parity. The fair data-processing transformation proposed by Plecko and Meinshausen (2020) is an illustrative example. More generally, the conceptual interest of transport-based fairness criteria is similar to the original counterfactual fairness criterion: they oﬀer across-proﬁle notions of fairness while still controlling for discrimination against protected groups. The added value is their feasibility. In contrast to Deﬁnition 14 and Proposition 16, Deﬁnition 18 relies on computationally feasible counterfactual models that obviate any assumptions about the data-generation process. In addition, as Deﬁnition 14 amounts to Π -counterfactual fairness (when (RE) holds), one can as well think of Deﬁnition 18 as an approximation of counterfactual fairness. Crucially, these new criteria can naturally be applied in classical explainability and fairness machine learning frameworks based on counterfactual reasoning. While Black et al. (2020) focused on explaining discriminatory biases in binary decision rules, we address the training of a Π-counterfactually fair predictor in Section 7.

6.4 Ethical Risk

We conclude this section by discussing a potentially negative impact of our work. As aforementioned, the transport-based approach allows for many counterfactual models, but they do not all deﬁne legitimate notions of counterparts. Consequently, transport-based notions of counterfactual fairness could be used for unethical fair-washing. The next proposition formalizes this risk.

Proposition 20 If h(X, S) is a classiﬁer satisfying statistical parity, then there exists a transport-based counterfactual model Π with a closed-form such that h(X, S) satisﬁes Πcounterfactual fairness.

Practitioners could take advantage of the weak notion of statistical parity to construct counterfactual models such that their trained classiﬁers are counterfactually fair, while still discriminating at the subgroup or individual level. This is why we argue that practitioners must always be able to justify the counterfactual models when not imposed by legal experts of the prediction task.

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

7. Application to Counterfactually Fair Learning

We now address an application of transport-based counterfactual models to fairness. More precisely, we introduce a supervised learning procedure trading-oﬀbetween Π-counterfactual fairness and accuracy, and provide statistical guarantees.

7.1 Learning Problem

In (Russell et al., 2017), the authors considered a learning problem involving a penalization controlling the pair-wise diﬀerence in decision between the training inputs and their structural counterfactual counterparts. While they gave empirical evidence of the eﬃciency of their training method, they had to assume a known causal model and did not provide consistency guarantees on the estimated predictor. In this sub-section, we illustrate that this counterfactual approach can naturally be made both feasible and statistically consistent by replacing the structural counterparts by transport-based counterparts. Note that in contrast to Russell et al. (2017), we do not optimize over several counterfactual models. Let Y : Ω Y R denote the so-called ground-truth variable to predict, and denote by D the law of the data (X, S, Y ). We consider a parametric class of predictors {hθ}θ Θ from X S to Y, indexed by Θ Rp where p > 1. For a given counterfactual model Π := π s |s

s,s S and a given weight λ > 0, we deﬁne the following expected risk on the predictors,

RD,Π,λ(θ) := E[ℓ(hθ(X, S), Y )] + λ X

s S P(S = s) X

s =s E[rθ(Xs, s, Xs , s )], (7)

where L((Xs, Xs )) = π s |s for every s, s S. The application ℓdenotes a data-loss function, continuous with respect to each of its input variables, while rθ(x, s, x , s ) is a penalty promoting counterfactual fairness by enforcing the diﬀerence between the outputs of the algorithm for an input and its counterfactual, namely |hθ(x, s) hθ(x , s )|, to be small. For instance, in (Russell et al., 2017), the authors considered the tightest convex relaxation of ϵ-approximate counterfactual fairness, that is rθ(x, s, x , s ) := max 0, |hθ(x, s) hθ(x , s )| ϵ for some ϵ > 0. In this paper, we rather work with the penalty rθ(x, s, x , s ) := |hθ(x, s) hθ(x , s )|2 which is smoother. Through λ, the risk RD,Π,λ quantiﬁes a trade-oﬀbetween accuracy and counterfactual fairness. Importantly, when Π = Π , it corresponds precisely to the expected risk of the learning problem proposed by Russell et al. (2017) reframed using the mass-transportation viewpoint. In what follows, we will simply write RD,Π,λ as R. In practice, we learn a predictor by minimizing an empirical version of R. To this end, we need an empirical counterfactual model. Concretely, consider a training set {(xi, si, yi)}n i=1 composed of n i.i.d. observations drawn from D. We divide this collection into S protected categories by deﬁning for any s S the index In s := {1 i n | si = s} of length ns := |In s |. Then, the empirical versions of the factual distributions are for every s S, µn s := n 1 s P i In s δxi. In our case, the counterfactual pairs between µn s and µn s are estimated within the training data set through an empirical transport plan {πn s |s (i, j)}i Is,j Is , typically by solving Problem (5) as explained in Section 3.2.3. Then, we deﬁne the following

Transport-based Counterfactual Models

empirical risk,

i=1 ℓ(hθ(xi, si), yi) + λnsi X

j In s πn s |si (i, j)rθ(xi, si, xj, s ). (8)

The learning procedure amounts to carrying out a gradient-descent-based routine to minimize Rn. We underline that this procedure, as the original one from (Russell et al., 2017), is tailored to both regression and multi-class classiﬁcation. It also works for more than two protected groups, but requires the domain of the sensitive variable to be ﬁnite.

7.2 Consistency

In this part, we focus on the counterfactual model constructed with quadratic optimal transport, and prove the statistical consistency of the learning procedure. Set a sequence {θn}n N deﬁned by θn arg minθ Θ Rn(θ). The next theorem ensures the convergence to zero of the excess risk R(θn) minθ Θ R(θ) for ball-constrained linear predictions.

Theorem 21 Suppose that for every pair of factual distributions, the Kantorovich problem (3) with cost c(x, x ) := x x 2 admits a unique solution. Thus, we can deﬁne the counterfactual model Π = {π s |s }s,s S and its empirical counterpart Πn = {πn s |s }s,s S as, for every s, s S,

π s |s := arg min π Π(µs,µs )

x x 2dπ(x, x ), (9)

πn s |s arg min π Σ(ns,ns )

j Is xi xj 2π(i, j). (10)

Now, let Φ : X S Rp be a feature map such that for every s S and x1, x2 X, Φ(x1, s) Φ(x2, s) Ls x1 x2 where Ls > 0. Consider for Θ Rp the class of linear predictors {hθ}θ Θ deﬁned as hθ(x, s) := θT Φ(x, s). If the following assumptions hold,

(i) there exists D > 0 such that Θ = {θ Rp | θ D},

(ii) there exists R > 0 such that X x Rd | x R ,

(iii) there exists b > 0 such that Y {y R | |y| b},

(iv) there exists L > 0 such that for any (x, s, y) X S Y, the function θ Θ 7 ℓ(θT Φ(x, s), y) is L-Lipschitz,

R(θn) min θ Θ R(θ) a.s. n + 0.

The proof analyzes separately the accuracy term from the regularization term. The demonstration for the former follows classical results from the statistical-learning literature; the demonstration for the latter is original: we ﬁrstly show that each penalty contribution can be bounded by a distance between the empirical and the true coupling, and then invoke the convergence in law. We gather some additional remarks below.

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

Remark 22 1. Uniqueness of the solution (9) notably holds when the factual distributions are Lebesgue absolutely-continuous.

2. Typically, Φ is deﬁned as (x, s) 7 (x, s, 1) in order to add an intercept, or corresponds to the feature map of a kernel when aiming for non-linear decision boundaries.

3. Assumptions (i) to (iv) are common for supervised learning problems. The sets X and Y are usually bounded spaces, as well as Θ the set of parameters deﬁning the algorithm. The Lipschitz conditions for the loss functions and the feature map can be directly assumed or are direct consequences of smoothness properties and compactness assumptions of the spaces on which they are deﬁned.

4. The assumption on the second-order moments of the factual distributions is automatically satisﬁed under (ii).

5. If the risks Rn and R are strictly convex, then θn = arg minθ Θ Rn(θ) and θ = arg minθ Θ R(θ) are well-deﬁned, and it follows that θn a.s. n + θ (this additional

step is detailed in the proof of Theorem 21).

8. Numerical Experiments

In this section, we present the implementation of our counterfactually fair learning procedure on real data, and show that it has the expected behaviour. The code is available at https: //github.com/lucasdelara/PI-Fair.

8.1 Procedure

Whatever the data set, the general procedure is the following: after dividing the studied data set into a training set and a testing set, we learn one empirical counterfactual model for each set. The ﬁrst one implements the penalty of the training loss function; the second enables to evaluate the counterfactual fairness of the trained predictors. We compute the corresponding optimal transport plans using the default (non-regularized) POT solver. Then, we train several predictors for various values of the weight λ to study the model s ability to trade-oﬀbetween accuracy and fairness. Finally, we assess the performances of the learnt algorithms according to three criteria: accuracy, group fairness and counterfactual fairness, and we benchmark them against baselines.

8.1.1 Evaluation Metrics

In what follows, h : X S Y denotes either a binary classiﬁer (Y {0, 1}) or a regression function (Y R), and D denotes a data set (X, S, Y ). Let us properly deﬁne the diﬀerent metrics we employ.

To evaluate the data ﬁdelity of a classiﬁer, we compute the accuracy (Acc), deﬁned as Acc(h, D) := P(h(X, S) = Y ).

Transport-based Counterfactual Models

For a regression function, we compute the mean square error (MSE), deﬁned as

MSE(h, D) := E h |h(X, S) Y |2i .

To assess the statistical parity of a binary classiﬁer when the protected attribute is binary, we compute the the parity gap (PG), deﬁned as

PG(h, D) := |P(h(X, S) = 1 | S = 0) P(h(X, S) = 1 | S = 1)|.

It quantiﬁes the violation to group fairness, and equals zero when statistical parity is achieved. For a regression function, we use the Kolmogorov-Smirnov distance (KS) between L( ˆY | S = 0) and L( ˆY | S = 1), deﬁned as

KS(h, D) := sup y R

E[1{h(X,S)>y} | S = 0] E[1{h(X,S)>y} | S = 1] .

Note that this extends the parity gap to the continuous case. The purpose of these two group-fair indicators is to illustrate Proposition 19, stating that counterfactual fairness implies statistical parity.

Finally, we need a metric to evaluate counterfactual fairness. We extend the notion of (ϵ, δ)-approximate counterfactual fairness introduced by Russell et al. (2017) to transport-based counterfactual models. For a counterfactual model Π and a tolerance ϵ > 0, we deﬁne the probability for the disparate treatment by h between (x, s) and its s -counterfactual counterpart to be lower than ϵ as

CFTϵ(h, x, s, s , Π) := Z

x Xs 1{|h(x,s) h(x ,s )| ϵ} dπ s |s

dµs (x |x).

Then, for a probability threshold 0 δ 1, we say that a predictor h is (ϵ, δ)- approximately counterfactually fair if for every s S, for µs-almost every x Xs, and for every s = s, CFTϵ(h, x, s, s , Π) 1 δ. (11)

We make two remarks: ﬁrstly, if h is a classiﬁer, then the only relevant value for ϵ is 0; secondly, if the counterfactual model is deterministic, then the only relevant value for δ is 0. As the empirical counterfactual models we use are non-deterministic although their continuous counterparts may be deterministic we set δ = 0.1 whatever the prediction task. In practice, we quantify counterfactual fairness through the (ϵ, δ)- counterfactual fairness rate (CFR),

CFRϵ,δ(h, D, Π) := X

s S P(S = s) Z

s =s 1{CFTϵ(h,x,s,s ,Π) 1 δ}

This corresponds to the proportion of points satisfying (11). In the classiﬁcation setting we set ϵ = 0 while in the regression setting we work with ϵ = 1

2E [|Y Y |] where Y is an independent copy of Y .

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

Data set Adult COMPAS Law Crimes Task Classiﬁcation Classiﬁcation Regression Regression S : 0/1 Woman/Man Black/White Black/White Black/Non-black d 35 6 2 97 ntrain 32,724 4,120 13,109 1,335 ntest 16,118 2,030 6,458 659

Table 1: Data sets

8.1.2 Baselines

We aim at applying our regularized approach for several values of the weight λ to study the model s ability to trade-oﬀbetween accuracy and fairness. For classiﬁcation tasks, we consider logistic models; for regression tasks, we consider linear regression models. Theses choices will be useful in particular to benchmark our method against the one of Zafar et al. (2017), tailored to such models. For a given λ, we write Π-Fair(λ) for the corresponding regularized predictor. We compare the obtained results to three baseline algorithms: the best constant predictor Const, which achieves perfect fairness; the group-fair predictor Z developed by Zafar et al. (2017), which is meant to maximize accuracy under an exactfairness constraint; the unaltered (λ = 0) predictor U, which is presumably the most accurate but also the most unfair predictor.

8.2 Data Sets

We carry out the experiments on four data sets: the ﬁrst two for classiﬁcation and the last two for regression. Note that in all the considered settings, the sensitive variable S is binary and relatively exogenous to X. Table 1 summarizes information about each data set after preprocessing.

8.2.1 Adult

The Adult Data Set from the UCI Machine Learning Repository (Dua and Graﬀ, 2019) has become a gold reference data set to evaluate and benchmark fairness frameworks. The classiﬁcation task is to predict whether the income of an individual exceeds 50K USD per year based on census data. Concretely, the data set contains n = 48, 842 instances with 14 attributes (numerical and categorical). The ground-truth variable Y equals 1 whenever the incomes exceeds 50K, and 0 otherwise. In this work, we set the sensitive variable S to be the gender: S = 0 stands for female, while S = 1 stands for male. The potential sources of algorithmic bias against women have been widely studied by Besse et al. (2021). They mainly amount to an under representation of women in the data set, as well as a high correlation between being a woman and having a lower income. Any standard algorithms, optimizing only for accuracy, are bound to be unfair towards women. Before training any models, we process the data using a one-hot-encoding of the categorical attributes. The processing is the exact same as in (Besse et al., 2021). This leads to a data set of dimension d + 1 = 36 (without the outcome). We divide it into a training set of size ntrain = 32, 724 and a testing set of size ntest = 16, 118.

Transport-based Counterfactual Models

8.2.2 COMPAS

The Correctional Oﬀender Management Proﬁling for Alternative Sanctions (COMPAS) is an infamous score used by US court oﬃcers to assess the risk of criminal recidivism. Pro Publica analyzed more than 10,000 of cases from Florida, and concluded that black defendants tended to be predicted riskier than they actually were whereas white defendants were often predicted at lower risk than they were.5 In this part, we follow Kusner et al. (2017) and try to predict the risk of recidivism while avoiding discrimination against the race, using the same data. Keeping only black and white defendants, we get n = 6, 150 instances with d + 1 = 7 attributes such as the number of prior oﬀenses and the type of crime they committed. The ground-truth variable Y equals 1 if the individual recidivated and 0 otherwise. We set the sensitive variable S to be the race: S = 0 stands for black, while S = 1 stands for white. Finally, we divide the data into a training set of size ntrain = 4, 120 and a testing set of size ntest = 2, 030.

8.2.3 Law School

This is the data set used in Section 4.4.1, gathering statistics from 163 US law schools and more than 20,000 students. Here again we follow Kusner et al. (2017), and try to predict the ﬁrst-year average grade of individuals Y on the basis of the race (black or white) S, the entrance-exam score X1, and the grade-point average before law school X2. All in all, we have d = 2 features excluding the outcome and the protected attributes, and work with ntrain = 13, 109 training entries and ntest = 6, 458 testing entries.

8.2.4 Communities and Crimes

The Communities and Crimes data set can also be found in the UCI Machine Learning Repository (Dua and Graﬀ, 2019). It contains socio-economics, law enforcement and crime data from communities across the United States. Similarly to Chzhen et al. (2020), we consider the problem of predicting the rate of violent crime per 105 of population Y with S = 0 indicating that at least 50% of the population is black and S = 1 otherwise. After processing the 128 numerical and categorical attributes composing the data set, we obtain d + 1 = 98 features over ntrain = 1, 335 training instances and ntesting = 659 testing instances.

8.3 Results

The regularization weight λ takes successively all the values in a grid 10 4, 10 3.5, . . . , 101 . We repeat the training and evaluation processes of our models together with the baselines across 10 repeats for every data sets. As all learning techniques are deterministic, the randomness of the experiments comes uniquely from the division of the data set into a training and testing sets. The results are reported in the ﬁgures below.

5. https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

8.3.1 Trade-off Between Accuracy and Fairness

Figures 5 to 8 show the evolution with respect to λ of the accuracy, the counterfactual fairness rate, and the statistical-parity metric. The solid line represents the mean value of the evaluation metric, while the vertical length of the shaded area corresponds to the standard deviation. We observe that our learning algorithm is able to reliably trade-oﬀaccuracy for counterfactual fairness as λ increases, conﬁrming the relevancy of the approach. Additionally, the evaluation metrics remain stable across the diﬀerent repeats. As anticipated from Proposition 19, the regularization also tends to improve group fairness. Overall, the group-fair learning technique of Zafar et al. (2017) sacriﬁces less accuracy than our method to reach the same level of statistical parity, but our method performs better at encouraging counterfactual fairness. We conclude that the prevailing technique depends on the speciﬁc type of fairness one wants to achieve. Note that the group-fair predictor Z on the Law data set (Figure 7) behaves similarly to the perfectly counterfactually-fair predictor. This is likely due to the use of simple linear models on such a low-dimensional data set (d + 1 = 3) limiting the space of feasible algorithms. We leave the in-depth analysis of this phenomenon for further research.

8.3.2 Recovering Causal Effects

To conclude these numerical experiments, let us verify that our optimal-transport counterfactual loss enforces causal counterfactual fairness in the adequate setting. We address the Law data set for which a plausible causal model is known (see Section 4.4.1) and satisﬁes the assumptions of Corollary 13. Figure 9b displays the evolution of the two counterfactual losses, one based on the structural counterfactual model and the other on the optimaltransport counterfactual model, for predictors trained according to the optimal-transport counterfactual model. Figure 9a serves as a sanity check: it plots the normalized-variance

E h ( ˆY E[ ˆY ]) 2i

E[(Y E[Y ])2] to control how close a predictor is to being constant.

As anticipated by theory, the training process does promote causal counterfactual fairness: the two curves in Figure 9b are almost identical. Crucially, this is not a consequence of the predictors merely becoming constant, since the sequence of predictions in Figure 9a have variations that remain signiﬁcantly higher than the best constant predictor.

8.4 Discussion

To sum-up, our learning procedure enables to increase counterfactual fairness while limiting the loss in accuracy, and is both theoretically sound and computationally eﬃcient. This simple approach expands the fair learning arsenal to stronger fairness criteria than group fairness conditions, and so without requiring any additional knowledge on the datageneration process. Regarding limitations, we note that the current procedure is not tailored to mini-batch learning. Using mini-batches would require to compute a new empirical counterfactual model for each one, which increases the computational complexity, especially since the batch-size should be chosen large enough for the empirical transport plans to make sense.

Transport-based Counterfactual Models

(d) Classiﬁers

Figure 5: Evaluation metrics on the Adult data set.

(d) Classiﬁers

Figure 6: Evaluation metrics on the COMPAS data set.

(d) Predictors

Figure 7: Evaluation metrics on the Law data set.

(d) Predictors

Figure 8: Evaluation metrics on the Crimes data set.

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

(a) Standard deviation of ˆY

(b) Counterfactual fairness losses

Figure 9: Promotion of causal counterfactual fairness on the Law data set.

This opens new lines of inquiry for leveraging recent advances on computational optimal transport in order to improve counterfactual learning problems. In particular, we could take advantage of entropic regularization schemes to speed-up the computation of optimal transport plans (Cuturi, 2013; Peyr e and Cuturi, 2019). This would make the produced counterfactual model blurry, but still close to the desired solution, allowing a trade-oﬀ between precision of the counterfactuals and numerical eﬃciency. Additionally, we could use the growing literature on plug-in estimations of optimal transport maps (Beirlant et al., 2020; Hallin et al., 2021; Manole et al., 2021; Pooladian and Niles-Weed, 2021) to construct empirical counterfactual models not as a matrices, but as a mappings able to generalize to out-of-sample observations, reusable on new data sets and batches. We leave these directions for future work.

9. Conclusion

In this paper, we focused on the challenge of designing sound and feasible counterfactuals. Our work showed that the causal account for counterfactual modeling can be written in a mass-transportation formalism, where implying either deterministic or random counterfactuals has a direct formulation in terms of the deterministic or random nature of couplings between factual and counterfactual instances. This novel perspective enabled us to generalize sharp but unfeasible causal criteria of fairness by actionable transport-based ones. We illustrated that the use optimal transport was a competitive approach to implement these criteria, as it can recover causal changes and can be computed eﬃciently. In particular, we proposed an new easy-to-implement method to train accurate classiﬁers with a counterfactual fairness regularization. We provided statistical guarantees, and showed empirically the relevancy of our method. In doing this article, we hope to shed a new light on counterfactual reasoning, and to open lines for strengthening the explainability and fair-learning arsenal in artiﬁcial intelligence.

Acknowledgments

This work was funded by the AI interdisciplinary institute ANITI, grant agreement ANR19-PI3A-0004 under the French investing for the future PIA3 program. The authors thank

Transport-based Counterfactual Models

the Action Editor and the anonymous reviewers for their careful readings and helpful suggestions.

Appendix A. Proofs of Section 5

Proof of Lemma 7 As a direct consequence of Assumption (A), there exists a topological ordering on the nodes of the graph induced by M with respect to the parent-child relation. Therefore, starting with the components Xk for which Endo(k) = or Endo(k) = {S}, we can recursively replace the terms XEndo(k) in the formulas Gk(XEndo(k), SEndo(k), UExo(k)) by expressions depending only on UX and S. This yields a measurable function F such that X P a.s. = F(S, UX). The same computation but changing S to s for some s S leads to

XS=s P a.s. = F(s, UX).

Proof of Proposition 8 Recall that X P a.s. = F(S, UX). This implies that, P-almost

surely, (X = x, S = s) = UX f 1 s ({x}) . Besides, XS=s P a.s. = fs (UX) according to Lemma 7. Then, let B X be an arbitrary measurable subset and compute:

P (XS=s B | X = x, S = s) = P (fs (UX) B | X = x, S = s)

= P fs (UX) B, UX f 1 s ({x}) | X = x, S = s

= P fs (UX) B, fs (UX) fs f 1 s ({x}) | X = x, S = s

= P XS=s B fs f 1 s ({x}) | X = x, S = s .

Therefore, L(XS=s | X = x, S = s) does not put mass outside fs f 1 s ({x}). The deﬁnition of the support the set of points x Rd such that every open neighborhood of x has a positive probability thus implies that supp(µ s |s ( |x)) fs f 1 s ({x}).

Proof of Proposition 9 Set s, s S and x Xs. Note that, according to (I), UX P a.s. = f 1 S (X). Let us address each item of the proposition separately. Item (i). Proposition 8 states that supp(µ s |s ( |x)) fs f 1 s ({x}) for µs-almost every x Xs. This means according to (I) that supp(µ s |s ( |x)) {fs f 1 s (x)}. Since the support of a probability distribution cannot be empty, we have equality. This proves the ﬁrst item. Item (ii). By deﬁnition of the counterfactual distribution, we ﬁnd that

µ s |s = L(XS=s | S = s)

= L (fs (UX) | S = s)

= L fs f 1 S (X) | S = s

= L fs f 1 s (X) | S = s

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

This proves the second item. Item (iii). Similarly, by deﬁnition of the structural counterfactual coupling we obtain

π s |s = L (X, XS=s ) | S = s

= L ((X, fs (UX)) | S = s)

= L (X, fs f 1 s (X)) | S = s

= L (Xs, fs f 1 s (Xs)) ,

where L(Xs) = µs. This completes the proof.

Proof of Proposition 10 Set s S and recall that X P a.s. = F(S, UX) while XS=s P a.s. = F(s, UX). Thanks to Assumption (RE), we have that S UX. Therefore,

L(X | S = s) = L (F(S, UX) | S = s) ,

= L (F(s, UX) | S = s) ,

= L (F(s, UX)) ,

This means that µs = µS=s. Similarly, for s, s S the counterfactual distribution becomes

L(XS=s | S = s) = L F(s , UX) | S = s ,

= L F(s , UX) ,

= L F(s , UX) | S = s ,

= L F(S, UX) | S = s ,

= L(X | S = s ).

This means that µ s |s = µs , which completes the proof.

Proof of Proposition 11 We address each item separately. Item (i). It is a direct consequence of π s |s Π(µs, µ s |s ) by deﬁnition and µ s |s = µs from Proposition 10. Item (ii). Recall that (RE) implies that S UX. Then, by deﬁnition we have

π s|s = L((X, XS=s) | S = s )

= L((fs (UX), fs(UX)) | S = s )

= L((fs (UX), fs(UX)) | S = s)

= L((XS=s , X) | S = s)

= t L((X, XS=s ) | S = s)

= t π s |s .

Transport-based Counterfactual Models

Item (iii). It is a direct consequence of T s |s µs = µ s |s from Proposition 9 and

µ s |s = µs from Proposition 10.

Item (iv). We know according to Lemma 7 that XS=s P a.s. = fs(UX) and XS=s P a.s. = fs (UX). Furthermore, it follows from (RE) and Proposition 10 that µs = L(XS=s) and µs = L(XS=s ). Wrapping this up, there exists a measurable set Ω0 Ωwith P(Ω0) = 1 such that for every ω Ω0,

XS=s(ω) = fs(UX(ω)) Xs,

XS=s (ω) = fs (UX(ω)) Xs .

In the rest of the proof we implicitly work on an ω Ω0. Assumption (I) ensures that UX = f 1 s (XS=s) so that XS=s = fs f 1 s (XS=s). Noting that XS=s Xs, we obtain XS=s = fs f 1 s |Xs (XS=s) = T s |s (XS=s). Following the same computation after

switching s and s , we additionally get that XS=s = fs f 1 s |Xs (XS=s ) = T s|s (XS=s ).

Therefore, T s |s is invertible on XS=s (Ω0) such that T 1 s |s = T s|s on XS=s(Ω0). Since µs (XS=s(Ω0)) = P(Ω0) = 1 and µs (XS=s (Ω0)) = P(Ω0) = 1, this means that T s |s is invertible µs-almost everywhere such that T 1 s |s = T s|s µs -almost everywhere. This completes the proof.

Proof of Corollary 13 We address the structural equations

X = MX + w S + b + UX,

where w, b Rd and M Rd d are deterministic parameters. We showed that for any s, s S, T s |s (x) = x + (I M) 1w(s s).

Notice that T s |s is the gradient of x 7 1

2 x 2+ (I M) 1w(s s) T x, which is a convex function. As (RE) holds and µs together with µs are Lebesgue-absolutely continuous with ﬁnite second order moments, it follows from Theorem 12 that T s |s is the solution to (4) between µs and µs .

Appendix B. Proofs of Section 6

We prove Proposition 16 before Proposition 15. Proof of Proposition 16 We address each item separately. Item (i). We claim that counterfactual fairness is equivalent to

(Goal) For every s, s S, there exists a measurable set C = C(s, s ) X X satisfying π s |s (C) = 1 such that for every (x, x ) C

h(x, s) = h(x , s ).

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

Note that a direct reformulation of the original counterfactual fairness condition is

(CF) For every s, s S, there exists a measurable set A = A(s) satisfying µs(A) = 1, such that for every x A and every measurable set M R

P ˆYS=s M | X = x, S = s = P ˆYS=s M | X = x, S = s . (12)

We aim at showing that (CF) is equivalent to (Goal). To do so, we ﬁrst prove that one can rewrite (CF) into the following intermediary formulation

(IF) For every s, s S, there exists a measurable set A = A(s) satisfying µs(A) = 1, such that for every x A and every measurable M R there exists a measurable set B = B(s, s , x, M) satisfying µ s |s (B|x) = 1 and such that for every x B,

1{h(x,s) M} = 1{h(x ,s ) M}.

Let us show that (CF) (IF). Set s, s S, x A and M R. According to the

consistency rule, that is X P a.s. = P s S XS=s , we can rewrite the left term of (12) as

P ˆYS=s M | X = x, S = s = P (h(XS=s, s) M | X = x, S = s)

= P(h(X, s), s) M | X = x, S = s)

= P(h(x, s) M)

= 1{h(x,s) M}.

Then, using Deﬁnition 4, we reframe the right term of (12) as

P ˆYS=s M | X = x, S = s = P(h(XS=s , s ) M | X = x, S = s)

= Z 1{h(x ,s ) M}dµ s |s (x |x).

Remark now that because the indicator functions take either the value 0 or 1, the condition

1{h(x,s) M} = Z 1{h(x ,s ) M}dµ s |s (x |x)

is equivalent to 1{h(x,s) M} = 1{h(x ,s ) M} for µ s |s ( |x)-almost every x . This means that there exists a measurable set B = B(s, s , x, M) such that µ s |s (B|x) = 1 and for every x B, 1{h(x,s) M} = 1{h(x ,s ) M}.

This proves that (CF) is equivalent to (IF). We now prove that (IF) = (Goal). As (IF) is true for any arbitrary measurable set M R, we can apply this result with M = {h(x, s)} to obtain a measurable set B = B(s, s , x) such that µ s |s (B|x) = 1 and for every x B, h(x , s ) = h(x, s). To sumup, for every s, s S, there exists a measurable set A = A(s) satisfying µs(A) = 1 such that

Transport-based Counterfactual Models

for every x A, there exists a measurable set B = B(s, s , x) satisfying µ s |s (B|x) = 1, such that for every x B, h(x , s ) = h(x, s). Now, we must show that the latter equality holds for π s |s -almost every (x, x ). To this end, set C = C(s, s ) = {(x, x ) X X|x A(s), x B(s, s , x)}. Remark that by deﬁnition of A and B, for every (x, x ) C, h(x, s) = h(x , s ). To conclude, let us prove that π s |s (C) = 1.

π s |s (C) = Z

A P (XS=s B | X = x, S = s) dµs(x)

A µ s |s (B|x)dµs(x)

This proves that (IF) implies (Goal). We ﬁnally prove that (Goal) = (IF). Using (Goal), consider a measurable set C = C(s, s ) satisfying π s |s (C) = 1 and such that for every (x, x ) C, h(x, s) = h(x , s ). Then, deﬁne for any x X, the measurable set B(s, s , x) = {x X | (x, x ) C}. We use disintegrated formula of π s |s to write

1 = Z µ s |s (B|x)dµs(x).

Since 0 µ s |s (B|x) 1, this implies that for µs-almost every x, µ s |s (B|x) = 1. Said diﬀerently, there exists a measurable set A = A(s) satisfying µs(A) = 1 such that for every x A, the measurable set B(s, s , x) satisﬁes µ s |s (B|x) = 1. By construction of B and by deﬁnition of C, for every x A and every x B, h(x, s) = h(x , s ). To obtain (IF), it suﬃces to take any measurable M R and to note that the latter equality implies that 1{h(x,s) M} = 1{h(x ,s ) M}. Item (ii). Recall that π s|s = (I I) µs. Therefore, it follows from the previous item that counterfactual fairness can be written as: for every s, s S such that s < s, and π s |s -almost every (x, x ) h(x, s) = h x , s ,

and for π s|s -almost every (x, x )

h(x, s ) = h x , s .

Moreover, (RE) implies through Proposition 11 that π s|s = t π s |s . Therefore, the second condition above can be written as: for π s |s -almost every (x, x )

h(x , s ) = h (x, s) ,

which is exactly the ﬁrst condition. This means that only the ﬁrst condition is necessary, proving this item.

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

Item (iii). Consider (CF), and recall that for every s, s S, µs-almost every x and every measurable M R the left term of (12) is 1{h(x,s) M}. Let us now reframe the

right-term of (12). If (I) holds, using that UX P a.s. = f 1 S (X) we obtain

P ˆYS=s M | X = x, S = s = P h XS=s , s M | X = x, S = s

= P h F(s , UX), s , s ) M | X = x, S = s

= P h fs (f 1 S (X)), s M | X = x, S = s

= P h fs f 1 s (x), s M

= P h(T s |s (x), s ) M

= 1n h(T s |s (x),s ) M o.

Consequently, (CF) holds if and only if, for every measurable M R

1{h(x,s) M} = 1n h(T s |s (x),s ) M o.

Using the same reasoning as before, we take M = {h(x, s)} to prove that this condition is equivalent to h(x, s) = h(T s |s (x), s ). This concludes the third part of the proof. Item (iv). From the previous item and Proposition 10, it follows that counterfactual fairness can be written as: for every s, s S such that s < s, for µs-almost every x

h(x, s) = h T s |s (x), s ,

and for µs -almost every x h(x, s ) = h T s|s (x), s .

Set s, s S such that s < s. To prove the fourth item, we show as for item (ii) that the two above conditions are equivalent. Set A a measurable subset of Xs such that µs(A) = 1, and h(x, s) = h(T s |s (x), s ) for any x A. Then, make the change of

variable x = T s |s (x) so that h(T 1 s |s (x ), s ) = h(x , s ) for every x T s |s (A). By Propositions 9 and 10, T s |s µs = µs , which implies that µs (T s |s (A)) = 1. Therefore,

the equality h(T 1 s |s (x ), s) = h(x , s ) holds for µs -almost every x . Finally, recall that

according to Proposition 11, T 1 s |s = T s|s µs -almost everywhere. As the intersection of two sets of probability one is a set of probability one, h(T s|s (x ), s) = h(x , s ) holds for µs -almost every x . To prove the converse, we proceed similarly by switching s to s .

Proof of Proposition 15 Before all, we introduce some notations that will greatly simplify the proof. Recall that by unique solvability of the SCM, there exists a solution map Γ such that (X, S) P a.s. = Γ(U). Using the same argument in any intervened model MS=s

for s S yields a solution map Γs such that (XS=s , s ) P a.s. = Γs (U).

Transport-based Counterfactual Models

We prove that counterfactual fairness as expressed in Deﬁnition 14 is equivalent to the condition in Proposition 15. According to Proposition 16, counterfactual fairness is equivalent to: for every s, s S,

P(h(X, s) = h(XS=s , s ) | S = s) = 1,

by deﬁnition of π s |s . Using the solution maps, this can be written as: for every s, s S,

P (h(Γ(U)) = h(Γs (U)) | S = s) = 1.

Because P(S = s) > 0 for every s S, this is equivalent to: for every s S,

P (h(Γ(U)) = h(Γs (U))) = 1.

The above equality means that for L(U)-almost-every u U,

h(Γ(u)) = h(Γs (u)),

which can be written as δh(Γ(u)) = δh(Γs (u)).

Noticing that L( ˆY | U = u) = δh(Γ(u)) and L( ˆYS=s | U = u) = δh(Γs (u)) concludes the proof.

Proof of Proposition 17 According to Proposition 16, h is counterfactually fair if and only if for any s, s S and for π s |s -almost every (x, x ), h(x, s) = h(x , s ) or equivalently 1{h(x,s) M} = 1{h(x ,s ) M} for every measurable M R. Set s, s S. Recall that from (RE), π s |s admits µs for ﬁrst marginal and µs for second marginal. Let us integrate this equality with respect to π s |s to obtain, for every measurable M R Z 1{h(x,s) M}dµs(x) = Z 1{h(x ,s ) M}dµs (x).

This can be written as,

P(h(X, s) M | S = s) = P(h(X, s ) M | S = s ),

which means that L(h(X, S) | S = s) = L(h(X, S) | S = s ).

As this holds for any s, s S, we have that h(X, S) S. One can easily convince herself that the converse is not true. As a counterexample, consider the following causal model,

X = S UX + (1 S) ( UX),

where US follows a Bernoulli distribution, UX follows a centered Gaussian distribution with unit-variance and US UX. By do-interventions, XS=1 = UX and XS=0 = UX. Observe that (RE) is satisﬁed so that

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

L(XS=0) = L(X | S = 0),

L(XS=1) = L(X | S = 1).

Moreover, since L(UX) = L( UX) we have L(X | S = 0) = L(X | S = 1). In particular, whatever the chosen predictor, statistical parity will hold since the factual distributions are the same. By deﬁnition of the structural counterfactual operator, we have T 1|0 (x) = x. Now, set the unaware predictor (namely which does not take the protected attribute as an input), h(X, S) := sign(X). Clearly,

h(T 1|0 (x)) = h(x) = h(x),

except on {0} which is a null set for L(X | S = 0). Therefore h(X, S) satisﬁes statistical parity but not counterfactual fairness.

Proof of Proposition 20 Suppose that the classiﬁer h(X, S) takes values in the ﬁnite set Y R, and deﬁne for any s S and y Y the sets H(s, y) := {x Rd | h(x, s) = y}. Remark that for every s S, {H(s, y)}y Y forms a partition of Rd. Statistical parity can be written as, for any s S and any y Y,

µs (H(s, y)) = py,

where {py}y Y is a probability on Y that does not depend on s. Now, set s, s S. We aim at constructing a coupling π s |s between µs and µs such that,

π s |s n (x, x ) Rd Rd | h(x, s) = h(x , s ) o = 1.

We deﬁne our candidate π s |s as,

dπ s |s (x, x ) := X

1{x H(s,y)}1{x H(s ,y)}

py dµs(x)dµs (x ).

First, let s show that it admits respectively µs and µs as ﬁrst and second marginals. Let A Rd be a measurable set,

π s |s A Rd = X

1{x H(s,y)}1{x H(s ,y)}

py dµs(x)dµs (x )

x A 1{x H(s,y)}dµs(x)

y Y µs (A H(s, y))

Transport-based Counterfactual Models

One can follow the same computation for the second marginal. To conclude, compute

π s |s {(x, x ) Rd Rd | h(x, s) = h(x , s )}

y Y H(s, y) H(s , y)

y Y π s |s H(s, y) H(s , y)

Z 1{x H(s,y)}dµs(x) Z 1{x H(s ,y)}dµs (x)

Appendix C. Proofs of Section 7

Proof of Theorem 21 The outline of the proof is typical for such supervised learning problems, though some parts require basic knowledge on optimal transport. It mainly amounts to show the uniform convergence of {Rn}n N to R, to then use the following classical deviation inequality,

R(θn) min θ Θ R(θ) 2 sup θ Θ |Rn(θ) R(θ)|. (13)

For any measure P and any measurable function g, we will use the notation P(g) := R gd P throughout the proof. Step 1. Uniform convergence of the risk. By the triangle inequality,

sup θ Θ |Rn(θ) R(θ)| sup θ Θ

i=1 ℓ(hθ(xi, si), yi) E [ℓ(hθ(X, S), Y )]

s =s sup θ Θ

n πn s |s P(S = s)π s |s rθ( , s, , s ) .

The ﬁrst term corresponds to the standard uniform risk deviation of supervised learning problems for Lipschitz losses and linear predictions. Under (i) to (iv), for 0 < δ < 1 it follows from (Shalev-Shwartz and Ben-David, 2014, Theorem 26.5) that with probability greater than 1 δ,

i=1 ℓ(hθ(xi, si), yi) E [ℓ(hθ(X, S), Y )]

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

where ℓ0 = sup|y| b |ℓ(0, y)|. Then, by taking δn := 1 n2 , we apply Borel-Cantelli lemma so that for every ω Ω, there exists a threshold N(ω) such that for any n N(ω),

i=1 ℓ(hθ(xi, si), yi) E [ℓ(hθ(X, S), Y )]

The upper bound tends to zero as n tends to inﬁnity, and consequently

i=1 ℓ(hθ(xi, si), yi) E [ℓ(hθ(X, S), Y )]

a.s. n + 0.

The critical part is dealing with the counterfactual penalization. Let s, s S such that s = s. In the following of this step, we aim at showing that,

n πn s |s P(S = s)π s |s rθ( , s, , s ) a.s. n + 0.

To do so, we use the triangle inequality again, leading to,

n πn s |s P(S = s)π s |s ) rθ( , s, , s )

n P(S = s) sup θ Θ

Z rθ(x, s, x , s )dπ s |s (x, x ) (14)

+ P(S = s) sup θ Θ

Z rθ(x, s, x , s ) dπn s |s (x, x ) dπ s |s (x, x ) . (15)

The terms (14) tends to zero almost surely as n increases to inﬁnity. We now turn to the convergence of the term (15). Firstly, let us show that the functions {rθ( , s, , s )}θ Θ are uniformly Lipschitz on X X. For any (x1, x 1), (x2, x 2) X X, we have, rθ(x1, s, x 1, s ) rθ(x2, s, x 2, s ) θT Φ(x1, s) Φ(x 1, s ) Φ(x2, s) + Φ(x 2, s ) 2

θT (Φ(x1, s) Φ(x2, s)) 2

+ θT Φ(x 1, s ) Φ(x 2, s ) 2,

θ 2 Φ(x1, s) Φ(x2, s) 2

+ θ 2 Φ(x 1, s ) Φ(x 2, s ) 2,

D2 n L2 s x1 x2 2 + L2 s x 1 x 2 2o ,

D2 max s S L2 s (x1, x 1) (x2, x 2) 2,

2D2 max s S L2 s R (x1, x 1) (x2, x 2) .

Let us set Λ := 2

2D2(maxs S Ls)2R, so that the functions {rθ( , s, , s )}θ Θ are ΛLipschitz. Secondly, we know from (Villani, 2008, Theorem 5.19) that πn s |s converges almostsurely weakly to π s |s as ns, ns + . Moreover, for any s S, we have ns

Transport-based Counterfactual Models

P(S = s) > 0, hence ns n + + . As a consequence, πn s |s converges almost-surely

weakly to π s |s as n + . Additionally, since Xs Xs X X, it follows from (ii) that π s |s is compactly supported. According to Remark 7.13 in (Villani, 2003), this implies that W1(πn s |s , π s |s ) a.s. n + 0, where W1 denotes the Wasserstein-1 distance. Using the

dual formulation of the Wasserstein distance, this convergence can be written as,

W1(πn s |s , π s |s ) = sup r Lip1(X X,R)

Z r dπn s |s dπ s |s a.s. n + 0.

Noting that for any θ Θ, rθ( , s, , s )/Λ is 1-Lipschitz, we have

1 Λ sup θ Θ

Z rθ(x, s, x , s ) dπn s |s (x, x ) dπ s |s (x, x ) W1(πn s |s , π s |s ) a.s. n + 0.

This entails that the term (15) converges almost surely to 0, and completes the proof. Step 2. Consistency of the mimimum. For this additional step, we assume that Rn and R have unique minimizers, and we denote θ := arg minθ Θ R(θ). Note that the sequence {θn}n N is bounded by D, and as such we can extract a sub-sequence {θσ(n)}n N converging to some θσ Θ. Let us prove that θσ = θ regardless of the choice of the subsequence {σ(n)}n N. According to the deviation inequality (13) and by continuity of R, we have at the limit,

R(θσ) R(θ ).

This means that θσ is a minimizer of R. Therefore, by uniqueness, θσ = θ . This completes the proof, as this implies that, θn a.s. n + θ .

Nicholas Asher, Soumya Paul, and Chris Russell. Fair and adequate explanations. In International Cross-Domain Conference for Machine Learning and Knowledge Extraction, pages 79 97. Springer, 2021.

Nicholas Asher, Lucas De Lara, Soumya Paul, and Chris Russell. Counterfactual models for fair and adequate explanations. Machine Learning and Knowledge Extraction, 4(2): 316 349, 2022.

Mohit Bajaj, Lingyang Chu, Zi Yu Xue, Jian Pei, Lanjun Wang, Peter Cho-Ho Lam, and Yong Zhang. Robust counterfactual explanations on graph neural networks. In Advances in Neural Information Processing Systems, volume 34, 2021.

Yogesh Balaji, Rama Chellappa, and Soheil Feizi. Robust optimal transport with applications in generative modeling and domain adaptation. In Advances in Neural Information Processing Systems, volume 33, pages 12934 12944, 2020.

Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learning: Limitations and Opportunities. MIT Press, 2023.

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

M Faisal Beg, Michael I Miller, Alain Trouv e, and Laurent Younes. Computing large deformation metric mappings via geodesic ﬂows of diﬀeomorphisms. International Journal of Computer Vision, 61(2):139 157, 2005.

J. Beirlant, S. Buitendag, E. del Barrio, M. Hallin, and F. Kamper. Center-outward quantiles and the measurement of multivariate risk. Insurance: Mathematics and Economics, 95:79 100, 2020.

Peter M Bentler and David G Weeks. Linear structural equations with latent variables. Psychometrika, 45(3):289 308, 1980.

Philippe Besse, Eustasio del Barrio, Paula Gordaliza, Jean-Michel Loubes, and Laurent Risser. A survey of bias in machine learning through the prism of statistical parity. The American Statistician, pages 1 11, 2021.

Emily Black, Samuel Yeom, and Matt Fredrikson. Fliptest: Fairness testing via optimal transport. In Conference on Fairness, Accountability, and Transparency, page 111 121, New York, USA, 2020. Association for Computing Machinery.

Stephan Bongers, Patrick Forr e, Jonas Peters, and Joris M Mooij. Foundations of structural causal models with cycles and latent variables. The Annals of Statistics, 49(5):2885 2915, 2021.

Yann Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on Pure and Applied Mathematics, 44(4):375 417, 1991.

Alessandro Castelnovo, Riccardo Crupi, Greta Greco, Daniele Regoli, Ilaria Giuseppina Penco, and Andrea Claudio Cosentini. A clariﬁcation of the nuances in the fairness metrics landscape. Scientiﬁc Reports, 12(1):4209, 2022.

Arthur Charpentier, Emmanuel Flachaire, and Ewen Gallic. Optimal transport for counterfactual estimation: A method for causal inference. In Optimal Transport Statistics for Economics and Related Topics, pages 45 89. Springer, 2023.

Victor Chernozhukov, Alfred Galichon, Marc Hallin, and Marc Henry. Monge Kantorovich depth, quantiles, ranks and signs. The Annals of Statistics, 45(1):223 256, 2017.

Silvia Chiappa. Path-speciﬁc counterfactual fairness. In AAAI Conference on Artiﬁcial Intelligence, volume 33, pages 7801 7808, 2019.

Silvia Chiappa and William S Isaac. A causal Bayesian networks viewpoint on fairness. Privacy and Identity Management. Fairness, Accountability, and Transparency in the Age of Big Data: 13th IFIP WG 9.2, 9.6/11.7, 11.6/SIG 9.2. 2 International Summer School, Vienna, Austria, August 20-24, 2018, Revised Selected Papers 13, pages 3 20, 2019.

Silvia Chiappa, Ray Jiang, Tom Stepleton, Aldo Pacchiano, Heinrich Jiang, and John Aslanides. A general approach to fairness with optimal transport. In AAAI Conference on Artiﬁcial Intelligence, pages 3633 3640, 2020.

Transport-based Counterfactual Models

David Maxwell Chickering, David Heckerman, and Christopher Meek. Large-sample learning of bayesian networks is NP-hard. Journal of Machine Learning Research, 5(Oct): 1287 1330, 2004.

Evgenii Chzhen, Christophe Denis, Mohamed Hebiri, Luca Oneto, and Massimiliano Pontil. Fair regression with wasserstein barycenters. In Advances in Neural Information Processing Systems, volume 33, pages 7321 7331. Curran Associates, Inc., 2020.

Gregory F. Cooper. The computational complexity of probabilistic inference using Bayesian belief networks (research note). Artiﬁcial Intelligence, 42(2 3):393 405, March 1990.

Nicolas Courty, R emi Flamary, and Devis Tuia. Domain adaptation with regularized optimal transport. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 274 289. Springer, 2014.

Nicolas Courty, R emi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. In Advances in Neural Information Processing Systems, volume 30, 2017.

Juan Antonio Cuesta and Carlos Matr an. Notes on the Wasserstein metric in Hilbert spaces. The Annals of Probability, pages 1264 1276, 1989.

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, volume 26, pages 2292 2300, 2013.

Ricardo Dominguez-Olmedo, Amir H Karimi, and Bernhard Sch olkopf. On the adversarial robustness of causal algorithmic recourse. In International Conference on Machine Learning, pages 5324 5342. PMLR, 2022.

Dheeru Dua and Casey Graﬀ. UCI machine learning repository, 2019. URL http:// archive.ics.uci.edu/ml.

Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Innovations in Theoretical Computer Science Conference, pages 214 226, 2012.

Jake Fawkes, Robin Evans, and Dino Sejdinovic. Selection, ignorability and challenges with causal fairness. In Conference on Causal Learning and Reasoning, 2021.

R emi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aur elie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenﬂos, Kilian Fatras, Nemo Fournier, L eo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong, and Titouan Vayer. POT: Python optimal transport. Journal of Machine Learning Research, 22(78):1 8, 2021.

David Galles and Judea Pearl. An axiomatic characterization of causal counterfactuals. Foundations of Science, 3(1):151 182, 1998.

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

Nathalie TH Gayraud, Alain Rakotomamonjy, and Maureen Clerc. Optimal transport applied to transfer learning for p300 detection. In BCI 2017-7th Graz Brain-Computer Interface Conference, page 6, 2017.

Promit Ghosal and Bodhisattva Sen. Multivariate ranks and quantiles using optimal transport: Consistency, rates and nonparametric testing. The Annals of Statistics, 50(2): 1012 1037, 2022.

Paula Gordaliza, Eustasio Del Barrio, Fabrice Gamboa, and Jean-Michel Loubes. Obtaining fairness using optimal transport theory. In International Conference on Machine Learning, pages 2357 2365. PMLR, 2019.

Marc Hallin, Eustasio del Barrio, Juan Cuesta-Albertos, and Carlos Matr an. Distribution and quantile functions, ranks and signs in dimension d: A measure transportation approach. The Annals of Statistics, 49(2):1139 1165, 2021.

Patrik Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Sch olkopf. Nonlinear causal discovery with additive noise models. In Advances in Neural Information Processing Systems, volume 21, 2008.

Antti Hyttinen, Frederick Eberhardt, and Patrik O Hoyer. Learning linear cyclic causal models with latent variables. The Journal of Machine Learning Research, 13(1):3387 3439, 2012.

Guido W Imbens and Donald B Rubin. Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press, 2015.

Sarang C Joshi and Michael I Miller. Landmark matching via large deformation diﬀeomorphisms. IEEE Transactions on Image Processing, 9(8):1357 1370, 2000.

Shalmali Joshi, Oluwasanmi Koyejo, Warut Vijitbenjaronk, Been Kim, and Joydeep Ghosh. Towards realistic individual recourse and actionable explanations in black-box decision making systems. ar Xiv preprint ar Xiv:1907.09615, 2019.

Amir-Hossein Karimi, Julius von K ugelgen, Bernhard Sch olkopf, and Isabel Valera. Towards causal algorithmic recourse. In International Workshop on Extending Explainable AI Beyond Deep Models and Classiﬁers, pages 139 166. Springer, 2020.

Amir-Hossein Karimi, Bernhard Sch olkopf, and Isabel Valera. Algorithmic recourse: from counterfactual explanations to interventions. In ACM Conference on Fairness, Accountability, and Transparency, pages 353 362, 2021.

Niki Kilbertus, Philip J Ball, Matt J Kusner, Adrian Weller, and Ricardo Silva. The sensitivity of counterfactual fairness to unmeasured confounding. In Conference on Uncertainty in Artiﬁcial Intelligence, pages 616 626. PMLR, 2020.

Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness. In Advances in Neural Information Processing Systems, volume 30, pages 4066 4076. Curran Associates, Inc., 2017.

Transport-based Counterfactual Models

David Lewis. Causation. Journal of Philosophy, 70(17):556 567, 1973.

Zachary Lipton, Julian Mc Auley, and Alexandra Chouldechova. Does mitigating ML s impact disparity require treatment disparity? In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

Divyat Mahajan, Chenhao Tan, and Amit Sharma. Preserving causal constraints in counterfactual explanations for machine learning classiﬁers. ar Xiv preprint ar Xiv:1912.03277, 2020.

Tudor Manole, Sivaraman Balakrishnan, Jonathan Niles-Weed, and Larry Wasserman. Plugin estimation of smooth optimal transport maps. ar Xiv preprint ar Xiv:2107.12364, 2021.

Robert J. Mc Cann. Existence and uniqueness of monotone measure-preserving maps. Duke Math. J., 80(2):309 323, 11 1995. doi: 10.1215/S0012-7094-95-08013-2.

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2574 2582, 2016.

Razieh Nabi and Ilya Shpitser. Fair inference on outcomes. In AAAI Conference on Artiﬁcial Intelligence, volume 32, 2018.

Judea Pearl. Causality. Cambridge university press, 2009.

Judea Pearl. The mathematics of causal relations. Causality and Psychopathology: Finding the Determinants of Disorders and their Cures (P. Shrout, K. Keyes and K. Ornstein, eds.). Oxford University Press, Corvallis, OR, pages 47 65, 2010.

Judea Pearl, Madelyn Glymour, and Nicholas P Jewell. Causal Inference in Statistics: A Primer. John Wiley & Sons, 2016.

Victoria Peterson, Nicol as Nieto, Dominik Wyser, Olivier Lambercy, Roger Gassert, Diego H Milone, and Rub en D Spies. Transfer learning based on optimal transport for motor imagery brain-computer interfaces. IEEE Transactions on Biomedical Engineering, 69 (2):807 817, 2021.

Gabriel Peyr e and Marco Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends R in Machine Learning, 11(5-6):355 607, 2019.

Drago Pleˇcko and Elias Bareinboim. Causal fairness analysis: A causal toolkit for fair machine learning. Foundations and Trends R in Machine Learning, 17(3):304 589, 2024.

Drago Plecko and Nicolai Meinshausen. Fair data adaptation with quantile preservation. Journal of Machine Learning Research, 21:1 44, 2020.

Aram-Alexandre Pooladian and Jonathan Niles-Weed. Entropic estimation of optimal transport maps. ar Xiv preprint ar Xiv:2109.12004, 2021.

De Lara, Gonz alez-Sanz, Asher, Risser, Loubes

Rafael Poyiadzi, Kacper Sokol, Raul Santos-Rodriguez, Tijl De Bie, and Peter Flach. Face: feasible and actionable counterfactual explanations. In AAAI/ACM Conference on AI, Ethics, and Society, pages 344 350, 2020.

Herilalaina Rakotoarison, Louisot Milijaona, Andry Rasoanaivo, Mich ele Sebag, and Marc Schoenauer. Learning meta-features for automl. In International Conference on Learning Representations, 2022.

Peyman Rasouli and Ingrid Chieh Yu. CARE: Coherent actionable recourse based on sound counterfactual explanations. International Journal of Data Science and Analytics, 17(1): 13 38, 2024.

Ievgen Redko, Nicolas Courty, R emi Flamary, and Devis Tuia. Optimal transport for multisource domain adaptation under target shift. In International Conference on Artiﬁcial Intelligence and Statistics, pages 849 858. PMLR, 2019.

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. why should i trust you? : Explaining the predictions of any classiﬁer. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, page 1135 1144, New York, NY, USA, 2016. Association for Computing Machinery.

Laurent Risser, Alberto Gonzalez Sanz, Quentin Vincenot, and Jean-Michel Loubes. Tackling algorithmic bias in neural-network classiﬁers using Wasserstein-2 regularization. Journal of Mathematical Imaging and Vision, pages 1 18, 2022.

Dominik Rothenh ausler, Nicolai Meinshausen, Peter B uhlmann, Jonas Peters, et al. Anchor regression: Heterogeneous data meet causality. Journal of the Royal Statistical Society Series B, 83(2):215 246, 2021.

Donald B Rubin. Estimating causal eﬀects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688, 1974.

Chris Russell, Matt J Kusner, Joshua Loftus, and Ricardo Silva. When worlds collide: Integrating diﬀerent counterfactual assumptions in fairness. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

Saptarshi Saha and Utpal Garain. On noise abduction for answering counterfactual queries: A practical outlook. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=4FU8Jz1Oyj.

Marco Scutari, Claudia Vitolo, and Allan Tucker. Learning bayesian networks from big data with greedy search: computational complexity and eﬃcient implementation. Statistics and Computing, 29(5):1095 1108, 2019.

Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge university press, 2014.

Shohei Shimizu, Patrik O Hoyer, Aapo Hyv arinen, Antti Kerminen, and Michael Jordan. A linear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(10), 2006.

Transport-based Counterfactual Models

Ricardo Silva. Counterfactual fairness is not demographic parity, and other observations. ar Xiv preprint ar Xiv:2402.02663, 2024.

Dylan Slack, Sophie Hilgard, Himabindu Lakkaraju, and Sameer Singh. Counterfactual explanations can be manipulated. In Advances in Neural Information Processing Systems, volume 34, 2021.

Robert C Stalnaker. A defense of conditional excluded middle. In Ifs, pages 87 104. Springer, 1980.

Thibaut Le Gouic, Jean-Michel Loubes, and Philippe Rigollet. Projection to fairness in statistical learning. ar Xiv preprint ar Xiv:2005.11720, 2020.

William Torous, Florian Gunsilius, and Philippe Rigollet. An optimal transport approach to estimating causal eﬀects via nonlinear diﬀerence-in-diﬀerences. ar Xiv preprint ar Xiv:2108.05858, 2024.

C edric Villani. Topics in Optimal Transportation. Number 58 in Graduate Studies in Mathematics. American Mathematical Soc., 2003.

C edric Villani. Optimal Transport: Old and New. Number 338 in Grundlehren der mathematischen Wissenschaften. Springer, Berlin, 2008.

Julius von K ugelgen, Amir-Hossein Karimi, Umang Bhatt, Isabel Valera, Adrian Weller, and Bernhard Sch olkopf. On the fairness of causal algorithmic recourse. In AAAI Conference on Artiﬁcial Intelligence, volume 36, pages 9584 9594, 2022.

Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harv. JL & Tech., 31:841, 2017.

Yongkai Wu, Lu Zhang, and Xintao Wu. Counterfactual fairness: Unidentiﬁcation, bound and algorithm. In International Joint Conference on Artiﬁcial Intelligence, 2019.

Shizhou Xu and Thomas Strohmer. On the (in)compatibility between group fairness and individual fairness. ar Xiv preprint ar Xiv:2401.07174, 2024.

Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rogriguez, and Krishna P Gummadi. Fairness constraints: Mechanisms for fair classiﬁcation. In International Conference on Artiﬁcial Intelligence and Statistics, pages 962 970. PMLR, 2017.

Kun Zhang and Lai-Wan Chan. Extensions of ICA for causality discovery in the hong kong stock market. In International Conference on Neural Information Processing, pages 400 409. Springer, 2006.