# transportbased_counterfactual_models__d675abbf.pdf Journal of Machine Learning Research 25 (2024) 1-59 Submitted 12/21; Revised 2/24; Published 5/24 Transport-based Counterfactual Models Lucas De Lara lucas.de lara@math.univ-toulouse.fr Institut de Math ematiques de Toulouse Universit e Paul Sabatier Toulouse, France Alberto Gonz alez-Sanz ag4855@columbia.edu Department of Statistics Columbia University New York, United States Nicholas Asher nicholas.asher@irit.fr Institut de Recherche en Informatique de Toulouse CNRS Toulouse, France Laurent Risser laurent.risser@math.univ-toulouse.fr Institut de Math ematiques de Toulouse CNRS Toulouse, France Jean-Michel Loubes loubes@math.univ-toulouse.fr Institut de Math ematiques de Toulouse Universit e Paul Sabatier Toulouse, France Editor: Silvia Chiappa Counterfactual frameworks have grown popular in machine learning for both explaining algorithmic decisions but also defining individual notions of fairness, more intuitive than typical group fairness conditions. However, state-of-the-art models to compute counterfactuals are either unrealistic or unfeasible. In particular, while Pearl s causal inference provides appealing rules to calculate counterfactuals, it relies on a model that is unknown and hard to discover in practice. We address the problem of designing realistic and feasible counterfactuals in the absence of a causal model. We define transport-based counterfactual models as collections of joint probability distributions between observable distributions, and show their connection to causal counterfactuals. More specifically, we argue that optimal-transport theory defines relevant transport-based counterfactual models, as they are numerically feasible, statistically-faithful, and can coincide under some assumptions with causal counterfactual models. Finally, these models make counterfactual approaches to fairness feasible, and we illustrate their practicality and efficiency on fair learning. With this paper, we aim at laying out the theoretical foundations for a new, implementable approach to counterfactual thinking. Keywords: counterfactuals, optimal transport, causality, fairness, supervised learning c 2024 Lucas De Lara, Alberto Gonz alez-Sanz, Nicholas Asher, Laurent Risser, Jean-Michel Loubes. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/21-1440.html. De Lara, Gonz alez-Sanz, Asher, Risser, Loubes 1. Introduction A counterfactual states how the world should be modified so that a given outcome occurs. For instance, the statement had you been a woman, you would have gotten half your salary is a counterfactual relating the antecedent had you been a woman to the consequent you would have gotten half your salary . Counterfactuals have been used to define causation (Lewis, 1973) and hence have attracted attention in the fields of explainability and robustness in machine learning, as such statements are tailored to explain black-box decision rules. Applications abound, including algorithmic recourse (Joshi et al., 2019; Poyiadzi et al., 2020; Karimi et al., 2021; Slack et al., 2021; Bajaj et al., 2021; Rasouli and Chieh Yu, 2024), defense against adversarial attacks (Ribeiro et al., 2016; Moosavi-Dezfooli et al., 2016) and fairness (Kusner et al., 2017; Chiappa, 2019; Black et al., 2020; Plecko and Meinshausen, 2020; Asher et al., 2021). State-of-the-art models for computing meaningful counterfactuals have mostly focused on the nearest counterfactual explanation principle (Wachter et al., 2017), according to which one finds minimal translations, minimal changes in the features of an instance that lead to a desired outcome. However, as noted by Black et al. (2020) and Poyiadzi et al. (2020), this simple distance approach generally fails to describe realistic alternative worlds, as it implicitly assumes the features to be independent. Changing just the gender of a person in such a translation might convert from a typical male into an untypical female, rendering out-of-distribution counterfactuals like the following: if I were a woman I would be 190cm tall and weigh 85 kg. According to intuition, such counterfactuals are false and rightly so because they are not representative of the underlying statistical distributions. As a practical consequence, such counterfactuals typically hide biases in machine learning decision rules (Lipton et al., 2018; Besse et al., 2021). The link between counterfactual modality and causality motivated the use of Pearl s causal modeling (Pearl, 2009) to address the aforementioned shortcoming (Kusner et al., 2017; Joshi et al., 2019; Mahajan et al., 2020; Karimi et al., 2021). Pearl s do-calculus, by enforcing a change in a set of variables while keeping the rest of the causal mechanism untouched, provides a rigorous basis for generating intuitively true counterfactuals. The cost of this approach is fully specifying the causal model, namely specifying not only the Bayesian network (or graph) capturing the causal links between variables, but also the structural equations relating them, and the law of the latent, exogenous variables. The reliance on such a strong prior makes the causal approach appealing in theory, but inadequate for deployment on practical cases. To sum-up, research has mostly focused on two divergent frameworks to compute counterfactuals: one that proposes an easy-to-implement model that leads, however, to intuitively untrue counterfactuals; another rigorously takes into account the dependencies between variables to produce realistic counterfactuals, but at the cost of feasibility. Our contribution addresses a third way. Extending the work of Black et al. (2020), who first suggested substituting causality-based counterfactual reasoning with optimal transport, we define transport-based counterfactual models. Such models, by characterizing a counterfactual operation as a coupling, a mass transportation plan between two observable distributions, ensures that the generated counterfactuals are in-distribution, hence realistic. In addition, they remedy to the impracticability issues of causal modeling as they can Transport-based Counterfactual Models be computed through any mass transportation techniques, for instance optimal transport. The major benefit of this approach is that it renders doable many critical applications of counterfactual frameworks, for example in algorithmic fairness. 1.1 Outline of Contributions We make both theoretical and practical contributions in the fields of counterfactual reasoning and fair machine learning. We propose a mass-transportation framework for counterfactual reasoning and point out its similarities to the causal approach. Additionally, we show that counterfactual methods for fairness become feasible in this framework by introducing and implementing transport-based counterfactual fairness criteria. More precisely, our contributions can be outlined as follows. 1. In Section 2, we recall the necessary background on Pearl s causal modeling, while we introduce in Section 3 the basics of mass transportation and optimal transport theory. Both sections serve as the theoretical and notational toolbox that will be used throughout; they are meant to keep the paper self-contained and can be skipped by readers familiar with these subjects. 2. In Section 4, we firstly recall how to compute counterfactual quantities using causal modeling. Then, we introduce a general causality-free framework for the computation of counterfactuals through mass-transportation techniques, encompassing the approach of Black et al. (2020). Essentially, we also propose a unified mass-transportation viewpoint of counterfactuals, be them causal-based or transport-based, through the definition counterfactual models, collections of couplings characterizing all possible counterfactual statements for a given feature to alter (for example the gender). We provide concrete examples of models, and discuss the limitations of the different approaches. 3. In Section 5, we leverage the unified formalism proposed in the previous section to demonstrate connections between causality and optimal transport. More precisely, after studying the implications of two general causal assumptions onto the induced counterfactual models, we demonstrate that optimal transport maps for the quadratic cost generate the same counterfactual instances as some specific causal models, including the common linear additive models. We argue that this makes optimal-transportbased counterfactual models relevant surrogates in the absence of a known causal model. 4. In Sections 6, 7 and 8, we illustrate the practicality of our approach for fairness in machine learning. We apply the mass-transportation viewpoint of structural counterfactuals by recasting the counterfactual fairness criterion (Kusner et al., 2017) into a transport-like one. Then, we propose new causality-free criteria by substituting the causal model by transport-based models in the original criterion. Finally, we address the training of counterfactually fair classifiers and predictors, providing statistical guarantees and numerical experiments over various data sets. De Lara, Gonz alez-Sanz, Asher, Risser, Loubes To sum-up: Sections 2 and 3 provide the prerequisites for the paper; Sections 4 and 5 introduce the concept of counterfactual models and the corresponding theory; Sections 6 to 8 address fairness applications of these models. 1.2 Related Work This work follows the paper of Black et al. (2020), which focuses on building sound counterfactual quantities through optimal transport, deviating from both causal-based techniques and the nearest-counterfactual-instance principle. Our contributions in Sections 4 and 5 can be seen as the theoretical foundations of their approach, by shedding light on the link between measure-preserving counterfactuals and structural counterfactuals. Also, we note that the way we introduce the causal account for counterfactual reasoning in Section 4 concurs with (Plecko and Meinshausen, 2020) and (Bongers et al., 2021). More precisely, we underline that the objects of interest are the joint probability distributions, or couplings, generated by manipulations of the causal model. Additionally, we propose in Section 6 a direct extension of the counterfactual fairness frameworks introduced in (Kusner et al., 2017) and (Russell et al., 2017) to transport-based counterfactual models, leading to a new method for supervised fair learning. This relates our work to the rich literature on fair learning through optimal transport (Gordaliza et al., 2019; Chiappa et al., 2020; Thibaut Le Gouic et al., 2020; Chzhen et al., 2020; Risser et al., 2022). Finally, we note that the main result of Section 5, stating that optimal transport maps recover causal effects under specific assumptions, shares similarities with the main theorem of (Torous et al., 2024). In contrast to our work, their assumptions are motivated by the study of classical treatment and control experiments over time via difference-in-differences methods (Imbens and Rubin, 2015). 2. Causal Modeling Pearl s causal modeling addresses the fundamental problem of analyzing causal relations between variables, beyond mere correlations (Pearl, 2009). It can be regarded as a mathematical formalism meant to describe associations that standard probability calculus cannot represent (Pearl, 2010). This section recalls the basic theory on this modeling, borrowing the rigorous mathematical framework recently proposed by Bongers et al. (2021). It is meant to keep the paper self-contained and can be skipped by a reader familiar with causality. Let us fix some notations before proceeding. Throughout the paper, we consider a probability space (Ω, A, P). We denote respectively by L(W) and E[W] the law and expectation under P of any random variable W on Ωtaking values in a measurable space (Rp, B) where p 1 and B is the Borel σ-algebra. Additionally, for any tuple w := (wi)i I indexed by a finite index set I and any subset I I we write w I := (wi)i I. Similarly, we define WI := Q i I Wi for any collection of spaces (Wi)i I. 2.1 Definition Causal reasoning rests on the knowledge of a structural causal model (SCM), which represents the causal relationships between the studied variables. Transport-based Counterfactual Models Definition 1 Let I and J be two disjoint finite index sets, and write V := Q i I Vi R|I|, U := Q i J Ui R|J | for two measurable product spaces. A structural causal model M is a couple U, G where: 1. U : Ω U is a vector of mutually independent random variables, sometimes called the random noise; 2. G = {Gi}i I is a collection of measurable R-valued functions, where for every i I there exist two subsets of indices Endo(i) I and Exo(i) J , respectively called the endogenous and exogenous parents of i, such that Gi : VEndo(i) UExo(i) Vi. A random vector V : Ω V is a solution of M if for every i I Vi P a.s. = Gi(VEndo(i), UExo(i)). (1) The collection of equations defined by (1) and characterized by G and U are called the structural equations. By characterizing G as a measurable vector function G : V U V, we compactly write that V is a solution of M if V P a.s. = G(V, U). A structural causal model can be seen as a generative model. The variables in U are said to be exogenous as they are imposed a priori by the model. In contrast, the variables in a solution V are said to be endogenous as they are outputs of the model determined through the structural equations. In practice, the endogenous variables represent observed data, while the exogenous ones model latent background phenomena. Note that similarly to Bongers et al. (2021), we assume the (Uj)j J to be mutually independent but we do not suppose causal sufficiency (that is no unmeasured common causes): formally we do not require Exo(i) Exo(i ) = for every i = i . The structural equations specify the causal dependencies between all these variables and are frequently illustrated by the directed graph defined as follows: the set of nodes is I J , and a directed edge points from node k to node l if and only if l I and k Endo(l) Exo(l) (we say that k is a parent of l). For clarity, we will often substitute the indexes i I or j J for the variables Vi or Uj, in particular when drawing such a graph (see Figure 1). Also, similarly to Bongers et al. (2021), we will use in practice non-disjoint subsets I and J of duplicated natural integers for the sake of clarity. The example below illustrates the above notations and definitions. Example 1 Consider a simple SCM M := U, G where U := (U1, U2, U3) is an arbitrary random vector, and such that G is defined by G1(u1) := u1, G2(v1, u2) := v1 + u2, G3(v1, v2, u3) := v1 + v2 + u3. Figure 1a represents the corresponding graph. By definition, finding a solution V := (V1, V2, V3) to M amounts to solving, V1 P a.s. = U1, V2 P a.s. = V1 + U2, V3 P a.s. = V1 + V2 + U3. Then, we readily obtain an almost-surely unique solution given by V1 P a.s. = U1, V2 P a.s. = U1 + U2, V3 P a.s. = 2U1 + U2 + U3. De Lara, Gonz alez-Sanz, Asher, Risser, Loubes (a) Without intervention (b) After do(V2 = 0) Figure 1: Examples of causal graphs. According to (Bongers et al., 2021, Theorem 3.3), a model M := U, G admits a solution if and only if it is solvable, that is there exists a measurable function Γ : U V, called a solution map, such that V P a.s. = Γ(U) = V P a.s. = G(V, U). Solvability signifies that a solution V can be expressed solely in terms of U, as in the above example. Note that SCMs are not always solvable (Bongers et al., 2021, Example 2.4). For the sake of convenience, we make in the rest of the paper the common assumption that the considered models are acyclic, meaning that their graphs do not contain any cycles: (A) The structural causal model M induces a directed acyclic graph (DAG). Acyclicity entails unique solvability of the SCM (Bongers et al., 2021, Proposition 3.4), in the sense that Equation (1) admits a unique solution up to P-negligible sets (as in Example 1). In particular, the generated distribution on the endogenous variables is unique. We will abusively refer to a solution as the solution of the SCM. Essentially, causal structures capture the assumption that features are not independently manipulable. As we detail next, they enable to understand the downstream effect of fixing some variables to certain values onto non-intervened variables. 2.2 Do-intervention The so-called do-calculus embodies mathematically the fundamental distinction between causation and correlation. While standard probability theory can only account for correlations through conditioning, do-calculus allows for intervening on variables through the do-operator. Concretely, a do-intervention is an operation mapping any model M to an alternative one by modifying the generative process. Definition 2 Let M = U, G be an SCM, I I a subset of endogenous variables, and v I VI a value. The action do(I, v I) defines the modified model Mdo(I,v I) = U, G where G is given by ( vi if i I, Gi if i I \ I. Transport-based Counterfactual Models The model surgery described in Definition 2 consists in enforcing a state of things by substituting a set of endogenous variables by fixed values while keeping all the rest of the causal mechanism equal. By definition, do-interventions respect the exogeneity of the random seed since U remains unchanged. This transcribes the causal principle that acting upon endogenous phenomenons does not affect exogenous ones. Provided it is solvable, the modified model Mdo(I,v I) generates a new distribution of endogenous variables, describing an alternative world where every Vi for i I is set to value vi. Note that do-interventions preserve acyclicity. Therefore, if an SCM M satisfies (A), then Mdo(I,v I) also satisfies (A). Going further, if V is the solution of an acyclic M, we can non-ambiguously define (up to P-negligible sets) its intervened counterpart Vdo(I,v I) solution to Mdo(I,v I). All in all, (A) enables to work in a convenient setting where the output of a causal model as well as the ones of its intervened counterparts are always well-defined. This implication enables to clarify the notations: in the sequel we write do(VI = v I) for the operation do(I, v I), and use the subscript VI = v I to indicate results of this operation. Crucially, intervening does not amount to conditioning in general, that is L(V | VI = v I) = L(VVI=v I). This means that causal outcomes may not be observable and hence require a known causal model to be inferred, as exemplified below. Example 2 Let M := U, G be the SCM from Example 1 and consider the do-intervention do(V2 = 0). This defines the intervened model MV2=0 := U, G where G1(u1) := u1, G2(v1, u2) := 0, G3(v1, v2, u3) := v1 + v2 + u3. Figure 1b represents the graph after surgery. The modified structural equations on a solution V := ( V1, V2, V3) are V1 P a.s. = U1, V2 P a.s. = 0, V3 P a.s. = V1 + V2 + U3. Then, we readily obtain that the almost-surely unique solution is given by V1 P a.s. = U1, V2 P a.s. = 0, V3 P a.s. = U1 + U3. Since U1, U2, U3 are mutually independent we have L(V1 | V2 = 0) = L(U1 | U1 + U2 = 0) = L( U2) while L( V1) = L(U1). Therefore, L(V1 | V2 = 0) = L( V1) in general. In Section 4, we will explain how the do-operator enables counterfactual inference from a causal model. We now turn to the second mathematical theory of interest for our work: mass transportation. 3. Mass Transportation We firstly introduce the necessary background on mass (or measure) transportation. Then, we detail the specific case of optimal transport. 3.1 Definition In probability theory, the problem of mass transportation amounts to constructing a joint distribution namely a coupling, between two marginal probability measures. Suppose that De Lara, Gonz alez-Sanz, Asher, Risser, Loubes each marginal distribution is a sand pile in the ambient space. A coupling is a mass transportation plan transforming one pile into the other, by specifying how to move each elementary sand mass from the first distribution so as to recover the second distribution. Alternatively, we can see a coupling as a random matching which pairs start points to end points between the respective supports with a certain weight. Formally, let P, Q be both Borel probability measures on Rd, whose respective supports are denoted by supp(P) and supp(Q). We recall that the support is the set of points x Rd such that every open neighbourhood of x has a positive probability. A coupling between P and Q is a probability measure π on Rd Rd admitting P as first marginal and Q as second marginal, precisely π(A Rd) = P(A) and π(Rd B) = Q(B) for all measurable sets A, B Rd. Throughout the paper, we denote by Π(P, Q) P(Rd Rd) the set of joint distributions whose marginals coincide with P and Q respectively. A coupling π Π(P, Q) is said to be deterministic if each instance from the first marginal is paired with probability one to an instance of the second marginal. Such a coupling can be characterized by a measurable map T : Rd Rd that pushes forward P to Q, that is Q(B) := P(T 1(B)) for any measurable set B Rd. This property, denoted by T P = Q, means that if the law of a random variable Z is P, then the law of T(Z) is Q. To make the relation with random couplings, we also introduce the action of couples of functions on probability measures. For any pairs of functions T1, T2 : Rd Rd, we define (T1 T2) : Rd Rd Rd, x 7 (T1(x), T2(x)). As such, (T1 T2) P denotes the law of T1(Z), T2(Z) where L(Z) = P. This coupling admits T1 P and T2 P as first and second marginal respectively. Thus, the deterministic coupling π between P and Q characterized by a push-forward operator T satisfying T P = Q can be written as π = (I T) P where I is the identity function on Rd. This coupling matches a given instance x supp(P) to T(x) supp(Q) with probability 1. 3.2 Optimal Transport We recall here some basic knowledge on optimal transport theory, which is the mass transportation approach we focus on in this work, and refer to (Villani, 2003, 2008) for further details. Optimal transport restricts the set of feasible couplings between two marginals by isolating ones that are optimal in some sense. 3.2.1 Arbitrary Cost The Monge formulation of the optimal transport problem with general cost c : Rd Rd R is the optimization problem min T: T P=Q Rd c(x, T(x))d P(x). (2) We refer to solutions to (2) as optimal transport maps between P and Q with respect to c; they minimize the effort, quantified by c, of moving every elementary mass from P to Q. One mathematical complication is that the push-forward constraint renders the problem unfeasible in many general settings, in particular when P and Q are not absolutely continuous with respect to the Lebesgue measure or have unbalanced numbers of atoms. Transport-based Counterfactual Models This issue motivates the following Kantorovich relaxation of the optimal transport problem with cost c, min π Π(P,Q) Rd Rd c(x, x )dπ(x, x ). (3) Solutions to (3) are optimal transport plans (possibly non deterministic) between P and Q with respect to c. In contrasts to optimal transport maps, they exist under very mild assumptions, like the non negativeness of the cost. Notice that, since a push-forward operator defines a coupling, the set of feasible solutions to (2) is included in the set of feasible solutions to (3). 3.2.2 Quadratic Cost Optimal transport enjoys a well-established theory, in particular when the ground cost is the squared Euclidean distance c(x, x ) := x x 2 on Rd Rd. Suppose that P is absolutely continuous with respect to the Lebesgue measure in Rd, and that both P and Q have finite second order moments. Theorem 2.12 in Villani (2003), originally proved by Cuesta and Matr an (1989) and then Brenier (1991), states that there exists a unique solution to Kantorovich s formulation of optimal transport (3), whose form is (I T) P where T : Rd Rd solves the corresponding squared Monge problem, min T: T P=Q Rd x T(x) 2d P(x). (4) Although it may not be unique, this optimal transport map T is uniquely determined P-almost everywhere, and we will abusively refer to it as the solution to Problem (4). Crucially, this map coincides P-almost everywhere with the gradient of a convex function. Moreover, according to Mc Cann (1995), under the sole assumption that P is absolutely continuous with respect to the Lebesgue measure, there exists only one (up to P-negligible sets) gradient of a convex function φ satisfying the push-forward condition φ P = Q. We combine Brenier s and Mc Cann s theorems into the following lemma, which simplifies the search for the solutions to (4). Lemma 3 Assume that P is absolutely continuous with respect to the Lebesgue measure, and that both P and Q have finite second order moments. Then, a measurable map T is a solution to (4) if and only if it satisfies the two following conditions: (i) T P = Q, (ii) there exists a convex function φ : Rd R such that T = φ P-almost everywhere. This result will play a key role in Section 5.2 to prove a link between optimal transport and causality. 3.2.3 Implementation In practice, we do not know the measures P and Q but have access to empirical observations. This naturally raises the questions of building relevant data-driven approximations, De Lara, Gonz alez-Sanz, Asher, Risser, Loubes or estimators, of the optimal transport plans, and of what should be required to ensure statistical guarantees. In this section, we briefly present the computational aspects of optimal transport, and refer to (Peyr e and Cuturi, 2019) for a complete overview. Concretely, consider two samples of i.i.d. observations {x1, . . . , xn} and {x 1, . . . , x m} drawn from respectively P and Q. These samples define the empirical measures P n = n 1 Pn i=1 δxi and Qm = m 1 Pm i=1 δx i, where δx denotes the Dirac measure at point x. Then, the standard way to estimate an optimal transport plan between the marginals P and Q is to solve the Kantorovich formulation (3) between their empirical counterparts P n and Qm. By characterizing a discrete coupling by a matrix, we write this problem as, min π Σ(n,m) j=1 c(xi, x j)π(i, j), (5) where Σ(n, m) := {π Rn m + | Pm j=1 π(i, j) = n 1 and Pn i=1 π(i, j) = m 1}. Note that empirical transport plans are statistically consistent. This means that if the Kantorovich problem (3) admits a unique solution π, then a sequence {πn,m}n,m N of solutions to Problem (3) converges weakly to π as n and m increase to infinity (Villani, 2008, Theorem 5.19). This property is crucial to ensure statistical guarantees in optimal-transport frameworks. We emphasize that even if a solution to Problem (5) is necessarily non-deterministic as soon as n = m, the corresponding solution to Problem (3) can be deterministic. The main challenge when working with empirical optimal-transport solutions is that they are expensive in both computational complexity and memory: solving (5) typically requires O((n + m)nm log(n + m)) computer operations, and the solution is stored as an n m matrix, which can limit the application on large data sets. Our implementation (see the experiments in Section 8) exploits the sparsity of the transport matrix to avoid overloading the memory and to speed-up the evaluation of optimal-transport-based metrics. One could also consider entropic regularization schemes to accelerate the computation of a solution to reach O(nm) operations (Cuturi, 2013). However, the obtained approximation of the transport matrix is typically non sparse, hence contains many non-zero coefficients, which precludes memory-efficient implementations. This is why we address only standard optimal transport in our numerical experiments. 4. Counterfactual Models We now have all the tools to focus on the main subject of this paper: counterfactual reasoning. As mentioned in the introduction, both causality and transport techniques have been used for this purpose. However, a yet non-appreciated aspect is that these frameworks can be written in a common formalism; this is what this section addresses. More precisely, we propose the definition of counterfactual models, mathematical objects encoding the probabilities of all counterfactual statements with respect to modifications of one variable, and detail how to construct them with respectively causal models and mass-transportation methods. 4.1 Problem Setup Set d 1, and define the random vector V := (X, S) Rd+1, where the variables X : Ω X Rd represent some observed features, while the variable S : Ω S R Transport-based Counterfactual Models can be subjected to interventions. For simplicity, we assume that S is finite such that for every s S, P(S = s) > 0. We consider the problem of computing the potential outcomes of X when changing S. Typically, S represents the sensitive, protected attribute in fairness settings, or the treatment status in the potential-outcome framework (Rubin, 1974). Suppose for instance that the event {X = x, S = s} is observed, and set s = s. We aim at answering the counterfactual question: had S been equal to s instead of s, what would have been the value of X? Critically, because of correlations or structural relations between the variables, computing the alternative state does not amount to change the value of S while keeping the features X equal. 4.2 Structural Profile Counterfactuals Answering the counterfactual question from Section 4.1 with Pearl s framework requires to assume causal dependencies between X and S. Formally, suppose that V = (X, S) Rd+1 is the unique solution to an SCM M = U, G satisfying the acyclicity assumption (A). We recall that each endogenous variable Vk is then defined (up to sets of probability zero) by the structural equation Vk P a.s. = Gk VEndo(k), UExo(k) , where Gk is a real-valued measurable function, U is a vector of exogenous variables, while VEndo(k) and UExo(k) denote respectively the endogenous and exogenous parents of Vk. In the following, we denote by UX and US the exogenous parents of respectively X and S. We write XS=s the post-intervention counterpart of X through the do-intervention do(S = s), that is after replacing the structural equation on S by S = s while keeping the rest of the causal mechanism equal. Several types of counterfactual quantities can be derived from SCMs, corresponding to answers to distinct counterfactual questions. For the sake of clarity, we firstly explain in details the specific SCM-based counterfactuals that are relevant to answer the question from Section 4.1. Then, we highlight the mass-transportation interpretation of such counterfactuals. 4.2.1 Definitions of SCM-based Counterfactuals In the SCM M, a unit is characterized by a realization u U of the exogenous random vector U. Such a value completely determines a realization v = (x, s) X S of the endogenous random vector V through the structural equations, formally (x, s) := Γ(u) using the solution-map notation. In this article, we follow Wu et al. (2019) and call a vector (x, s) X S a profile. Importantly, a unit uniquely determines a profile in M but a profile does not uniquely determine a unit, due to Γ not being necessarily invertible. (Pearl et al., 2016, Section 4.2.1) defines the basic counterfactuals at the unit level. This construction rests on a fundamental property of the alternative SCMs {MS=s }s S obtained by do-interventions: they share the same exogenous vector U as M. Therefore, fixing a same input unit u U in the models M and MS=s for a chosen s S produces two output profiles: (x, s) from M and (x , s ) from MS=s . They respectively describe the factual outcome and the counterfactual outcome had S been equal to s for the specific unit u. Unit counterfactuals enable one to answer the question What would have been the value of X had S been equal to s given that {U = u}? . While this question is interesting De Lara, Gonz alez-Sanz, Asher, Risser, Loubes in theory, it is often irrelevant in practice since what one generally observes are endogenous variables (namely profiles) instead of hidden exogenous variables (namely units). Therefore, people doing counterfactual reasoning focus on surrogate questions of the form: What would have been the value of X had S been equal to s given that {X = x, S = s}? , which is precisely the pivotal question that we aim at answering in this paper. It differs fundamentally from the former query, notably because the profile-level context {X = x, S = s} does not always uniquely determines a unit-level context {U = u}. Uncertainty as to the unit behind the profile (x, s) is quantified by the probability distribution L(U | X = x, S = s) which is not necessarily degenerate. Therefore, one cannot always certainly find the true unit responsible for the observation, and thereby cannot always certainly compute the unit-level counterfactual outcome correctly answering our question. Instead, one can compute a distribution over unit-level counterfactual outcomes consistent with the context {X = x, S = s}, formally given by L(XS=s | X = x, S = s), which specifies a set of possible answers with probability weights. This means that answers to the counterfactual question from Section 4.1 are fundamentally probabilistic from an SCM perspective. Computing the probabilities of such answers relies on the notorious three-step procedure (Pearl, 2009, Theorem 7.1.7), which requires knowing the fully-specified SCM. This algorithm takes for input a context, framed as an event on endogenous variables, and delivers a probability distribution of counterfactual outcomes for a chosen do-intervention. More specifically, for an input context {X = x, S = s} and an intervention do(S = s ), it carries out the following three steps: 1. (Abduction) compute the posterior distribution of units consistent with the factual observation, namely L(U | X = x, S = s); 2. (Action) perform do(S = s ) to obtain the modified causal mechanism GS=s of MS=s ; 3. (Prediction) pass the distribution L(U | X = x, S = s) through the modified system of equations GS=s to get the distribution of counterfactual outcomes L(XS=s | X = x, S = s). This procedure defines the SCM-based counterfactuals that one needs to address the problem from Section 4.1. We call them structural profile counterfactuals to emphasize two characteristics: (1) they are computed via an SCM, (2) they are defined for a specific profile (instead of a specific unit). The tag structural notably serves to distinguish such counterfactuals from the (later introduced) transport-based counterfactuals at the heart of this work. Definition 4 Let M satisfy (A). For an observed evidence {X = x, S = s} and an intervention do(S = s ), the structural profile counterfactuals are characterized by the probability distribution µ s |s ( |x) defined as µ s |s ( |x) := L(XS=s | X = x, S = s). We make three remarks. First, this definition can be seen as a specific case of (Barocas et al., 2023, Chapter 5, Definition 2) for an input context framed as an event of the form {V = v} where v V (rather than an event on a possibly reduced number of endogenous Transport-based Counterfactual Models variables). It also corresponds to the law of the counterfactual random variable defined in (Karimi et al., 2020, Section 2.2). Second, as explained, the profile-level counterfactual counterparts of a single profile are not necessarily deterministic (that is uniquely determined), which differs from unit-level counterfactuals. Mathematically, this comes from the fact that L(XS=s | X = x, S = s) is not necessarily degenerate, whereas L(XS=s | U = u) is always a point-mass distribution. Obviously, in case a profile (x, s) X S uniquely determines a unit u U, then the two conditional distributions for these specific values coincide, meaning that the structural counterfactual counterpart of (x, s) is unique. Section 5.1.1 details under which condition profile-level counterfactuals are deterministic, and therefore correspond to unit-level counterfactuals. We also emphasize that, while the rest of the paper almost exclusively focuses on profile counterfactuals rather than unit counterfactuals, the distinction will play a critical role in Section 6 to understand the definition of counterfactual fairness. Third, note that µ s |s ( |x) measures counterfactual outcomes of the feature vector X alone rather than the whole profile vector (X, S). This is an arbitrary choice that serves to highlight that after do(S = s ), only the resulting values of XS=s need to be specified since SS=s = s . On this basis, we introduce the following notations to formalize the contrast between interventional, counterfactual and factual outcomes. For s, s S we define three probability distributions. Firstly, µs := L(X | S = s) is the distribution of the factual s-instances. This observable measure describes the possible values of X such that S = s, and we write Xs for its support. Secondly, we denote by µS=s := L(XS=s) the distribution of the interventional s-instances. It describes the alternative values of X in a world where S is forced to take the value s. On the contrary to the factual distribution, the interventional distribution is in general not observational, in the sense that we cannot draw empirical observations from it. Finally, we define by µ s |s := L(XS=s | S = s) the distribution of the counterfactual s -instances given s. It describes what would have been the factual instances of µs had S been equal to s instead of s. According to the consistency rule (Pearl et al., 2016), which states that X P a.s. = P s S 1{S=s }XS=s , the factual and counterfactual distributions coincide when s = s , that is µs = µ s|s . However, when s = s , the counterfactual distribution µ s |s is generally not observable. 4.2.2 Mass-transportation Viewpoint In this paper, we focus on a mass-transportation viewpoint of the counterfactuals from Definition 4. It rests on the fact that, for a given change of S, there exists a correspondence between factual profiles and structural counterfactual profiles. Such a correspondence is not necessarily one-to-one, but consists of a collection of weighted correspondences described by the distribution of counterfactuals. This random mapping between factual and counterfactual counterparts as induced by the SCM is formalized by the following definition. Definition 5 Let M satisfy (A). For every s, s S, the structural counterfactual coupling between µs and µ s |s is given by π s |s := L (X, XS=s ) | S = s . We call the collection of couplings Π := {π s |s }s,s S the structural counterfactual model on X with respect to S. De Lara, Gonz alez-Sanz, Asher, Risser, Loubes In this formalism, the quantity dπ s |s (x, x ) is the elementary probability of the counterfactual statement had S been equal to s instead of s then X would have been equal to x instead of x. As such, a counterfactual model characterizes the distribution of all the crossworld statements on X with respect to changes of S. Note that each realization of π s |s , that is each pair of factual instance and counterfactual counterpart, is generated by a same realization of L(U | S = s). We point out that Definitions 4 and 5 characterize the exact same counterfactual statements, the formal link being dπ s |s (x, x ) = µ s |s (x |x)dµs(x). In particular, there is an equivalence between µ s |s ( |x) narrowing down to a single value for every x Xs and π s |s being a deterministic coupling. Assumptions rendering single-valued counterfactuals will be studied in Section 5.1.1. We also note that this joint-probability-distribution perspective of Pearl s counterfactuals concurs with the one from (Bongers et al., 2021, Section 2.5). 4.3 Transport-based Profile Counterfactuals The main issue of structural profile counterfactuals, which will be widely discussed in Section 4.5, comes from the causal model being unknown in practice. Thus, the necessity to make counterfactual frameworks feasible naturally raises the question of finding good surrogates to causal profile counterfactuals. We have seen that the problem of assessing counterfactual statements about X with respect to interventions on S using causal models could be reduced to knowing a collection of random mappings from factual distributions {µs}s S towards counterfactual distributions {µ s |s }s,s S. This perspective suggests that mass-transportation techniques can be natural substitutes for structural counterfactual reasoning, as they remedy to the aforementioned issues. 4.3.1 Definition In (Black et al., 2020), the authors mimicked the structural account of counterfactuals by computing alternative instances using a deterministic optimal transport map. Extending their idea, we propose a more general framework where the counterfactual operation switching S from s to s can be seen as a mass transportation plan, not necessarily optimaltransport based and not necessarily deterministic, between two distributions.1 In the following, t : Rd Rd Rd Rd, (x, x ) 7 (x , x) denotes the permutation function. Definition 6 A transport-based counterfactual model is a collection of couplings Π := π s |s s,s S satisfying for every s, s S, (i) π s |s Π(µs, µs ); (ii) π s|s = (I I) µs; (iii) π s|s = t π s |s . An element of Π is called a counterfactual coupling. We say that Π is a random counterfactual model if at least one coupling for s = s is not deterministic. Otherwise, we say that Π is a deterministic counterfactual model. In the deterministic case, Π can be characterized 1. In (Asher et al., 2022, Section 7.2), we present this view of counterfactuals from a logic perspective. Transport-based Counterfactual Models almost everywhere by a collection T := T s |s s,s S of measurable mappings from X to X satisfying for every s, s S, (i) T s |s µs = µs ; (ii) T s|s = I; (iii) T s |s is invertible µs-almost everywhere such that T s|s = T 1 s |s . An element of T is called a counterfactual operator. Similarly to structural counterfactual models, these models assign a probability to all the cross-world statements on X with respect to interventions on S. By convention, we use the superscript to denote structural counterfactual models, and no superscript for transportbased counterfactual models. The marginal constraint (i) in Definition 6 translates the intuition that a realistic counterfactual operation on S should morph the non-intervened variables X so that their values fit the targeted distribution. In this sense, transport-based models preserve the principle that features are not independently manipulable, but without using causal relations. The symmetry constraints (ii) and (iii) cover the reciprocity intuition we have on counterfactuals counterparts. Remark that in the case of discrete measures, the operation t in condition (iii) simply amounts to transposing the associated coupling matrices. Lastly, note that this definition replaces the unobservable, SCM-dependent distributions {µ s |s }s,s S of structural counterfactual models by the observational {µs }s S for feasibility reasons. In Section 5.1.2, we will see that this approximation makes sense in typical fairness settings where µ s |s = µs for every s, s S. The adjective deterministic refers to the fact that the model assigns to each factual profile a unique counterfactual counterpart. Formally, the counterfactual counterpart of some observation x Xs after changing S from s to s is given by x = T s |s (x) Xs . In contrast, a random model allows possibly several counterparts with probability weights. The first interest of considering random couplings is purely conceptual; rendering non necessarily unique the counterfactual counterparts of a single instance has philosophical implications (Asher et al., 2022, Section 6.3). Besides, it is consistent with the causal approach which also authorizes non-deterministic profile counterfactuals. The second and most critical benefit is mathematical and practical. While there always exist random couplings between two distributions, deterministic push-forward mappings (even causally-induced ones) may not exist when the marginals do not have densities, making deterministic counterfactual models ill-defined in many settings with non-continuous variables. This makes the extension to random couplings mathematically necessary to properly define transport-based counterfactual models in any configuration, in particular concrete machine-learning tasks, which typically involve both continuous and discrete covariates. Notably, we rely on random couplings in the numerical experiments from Section 8, illustrating Section 7. We also point out that Definition 6 refers to the true conditional distributions {µs}s S, not their empirical counterparts. This means that, in practice, after choosing the counterfactual model in theory (for instance as the solutions to an optimal transport problem) one must estimate this model from finite samples. Sections 4.4 and 8 respectively illustrate the De Lara, Gonz alez-Sanz, Asher, Risser, Loubes estimation of deterministic and random transport-based counterfactual models. For statistical coherence, we leverage a (deterministic) map estimator only if the coupling specified by the chosen model is itself deterministic. 4.3.2 Choosing a Model One challenge for the transport-based approach is to choose the model appropriately in order to define a relevant notion of counterpart. There possibly exists an infinite number of admissible counterfactual models in the sense of Definition 6, many of them being inappropriate. As an illustration, consider the family of trivial couplings, namely {µs µs }s,s S where denotes the factorization of measures. Though it is a well-defined transport-based counterfactual model, it is not intuitively justifiable as it completely decorrelates factual and counterfactual profiles. To sum-up, a transport-based counterfactual model must be both intuitively justifiable and computationally feasible. We argue that optimal-transport solutions are tailored couplings with respect to both criteria. Regarding feasibility, optimal transport has become the most popular tool in statistics-related fields to define couplings between distributions when no canonical choice is available, as in generative modeling (Balaji et al., 2020), domain adaptation (Courty et al., 2014, 2017; Redko et al., 2019; Rakotoarison et al., 2022), and transfer learning (Gayraud et al., 2017; Peterson et al., 2021) thanks to significant advances in computational schemes. Regarding justifiability, as argued by Black et al. (2020), generating a counterfactual operation by solving the optimal-transport Problem (2) leads to intuitively relevant counterfactuals for two reasons. Firstly, because counterfactuals are obtained by minimizing a metric between paired instances with different S-values, they transcribe the founding principle of Lewis (1973) to compute counterfactual counterparts: finding the most similar alternative worlds verifying the desired change. Secondly, the mass-preservation constraint ensures that the generated counterfactuals are realistic, conforming to the statistical features of a target distribution. More specifically, deterministic optimal transport for the quadratic cost (see Section 3.2.2) has remarkable properties that are relevant in the context of constructing counterfactuals. According to Lemma 3, solutions to Problem (4) are gradients of convex functions, which extends the notion of non-decreasing function to several dimensions. In particular, the optimal transport map in dimension one is the quantile-preservation map between univariate distributions. This behaviour has notably inspired constructions of multivariate notions of quantile based on optimal transport (Chernozhukov et al., 2017; Hallin et al., 2021; Ghosal and Sen, 2022). It makes sense in counterfactual reasoning where, without further information on the data-generation process, preserving the quantile from on marginal to the other is an intuitive definition of the counterfactual counterpart. Charpentier et al. (2023) notably explore this idea. For the sake of illustration, Section 4.4 below provides several examples of optimal transport applied to counterfactual reasoning. In Section 5.2 we will further justify the pertinence of optimal-transport-based counterfactual models by showing that they coincide with structural counterfactual models under some assumptions. However, the scope of Definition 6 goes beyond solutions to standard optimal-transport problems, allowing other transport methods and as such more possible counterfactual models. The purpose of this generalization is partly theoretic: in the future, one could propose an original matching technique and justify its relevance compared to op- Transport-based Counterfactual Models timal transport. In particular, the couplings mentioned in (Villani, 2008, Chapter 1) as well as diffeomorphic registration mappings (Joshi and Miller, 2000; Beg et al., 2005) are possible candidates we do not investigate in this paper. Additionally, this generalization permits the use of regularized optimal transport (Cuturi, 2013), which deviates from the original formulation of Problem (3), to accelerate computations. Note in passing that solutions to regularized optimal transport, which are non deterministic, define adequate transport-based counterfactual models thanks to Definition 6 taking into account random couplings. Lastly, we will see in Section 5.1.2 that structural counterfactual models are transport-based counterfactual models but not necessarily optimal-transport-based under some assumptions. 4.4 Examples Now that we gave definitions and insights on counterfactual models, let us study two concrete examples on real data. 4.4.1 Law Data Set We start by focusing on the Law School Admission Council data set which gathers statistics from 163 US law schools and more than 20,000 students, including four variables: the race S, the entrance-exam score X1, the grade-point average before law school X2, and the firstyear average grade Y . The end goal is to predict the first-year grade Y from the profile (X, S). Similarly to Russell et al. (2017), we consider a fairness setting where the race plays the role of a protected, sensitive attribute which should not be discriminated against, and we restrict to only black (S = 0) and white (S = 1) students. Counterfactual reasoning has become popular in such algorithmic fairness tasks to either ensure or test that, for example, had a black student been white, the output would have been the same. This requires a model to compute the counterfactual counterparts of any students after changing their skin colors. First, we consider a structural counterfactual model. This requires a causal model: Russell et al. (2017) proposed the following SCM for the data set, X1 = b1 + w1S + U1, X2 = b2 + w2S + U2, S = US, US, U1, U2 independent, where b := (b1, b2) and w := (w1, w2) are deterministic R2 parameters obtained by adjusting linear-regression models component-wise. Let us now calculate the induced structural counterfactual model by applying Definition 5. The coupling from S = 0 to S = 1 is given by π 1|0 := L (X, XS=1) | S = 0 = L (b + UX, b + w + UX) = L (X, X + w) | S = 0 . Conversely, the structural counterfactual coupling from S = 1 to S = 0 is π 0|1 := L (X, XS=0) | S = 1 = L (b + w + UX, b + UX) = L (X, X w) | S = 1 . De Lara, Gonz alez-Sanz, Asher, Risser, Loubes Figures 2a and 2b illustrate the computation of the corresponding counterfactual counterparts on samples. We make two important remarks. Firstly, generating counterfactual quantities in this case amounts to translating instances of µ0 by the constant w or conversely translating instances of µ1 by the constant w. Notably, the two couplings are deterministic: π 1|0 and π 0|1 are respectively characterized by the mappings T 1|0 (x) := x + w and T 0|1 (x) := x w. Note that there is consequently no need to specify the law of the exogenous variables to compute counterfactual quantities. Section 5.1.1 provides a general analysis of such deterministic settings. Secondly, the causal model implies that S UX. This critically entails that the counterfactual distributions are observable, since µ 1|0 = L(XS=1 | S = 0) = L(b + w + UX | S = 0) = L(b+w +UX | S = 1) = µ1 and µ 0|1 = µ0 analogously. Therefore, the structural counterfactual couplings π 1|0 and π 0|1 belong respectively to Π(µ0, µ1) and Π(µ1, µ0). Additionally, they are transposed from one another, that is t π 1|0 = π 0|1 . This means that the structural counterfactual model Π := {π 1|0 , π 0|1 } is a transport-based counterfactual model. Mathematical justifications of these properties will be studied in Section 5.1.2. In a second time, we turn to an optimal-transport-based counterfactual model. More precisely, we learn the optimal transport map for the quadratic cost, denoted by T 1|0 , from the black distribution µ0 towards the white distribution µ1. This map (and therefore the associated deterministic counterfactual model) is well-defined because the population marginals µ0 and µ1 admit densities with respect to the Lebesgue measure. In practice, we rely on the Python Optimal Transport (POT) library to compute an approximation of the mapping from data (Flamary et al., 2021). Note that solving the empirical optimaltransport problem (5) between samples provides a matching that cannot generalize to new, out-of-sample observations. This is why we employ POT s in-built non-regularized barycentric extension of the empirical solution to obtain a mapping defined everywhere. We use 800 points from each distribution to compute the estimator of T 1|0 illustrated in Figure 2a. The converse counterfactual operation T 0|1 represented in Figure 2d is produced by inversion. We emphasize that all the couplings in Figure 2, be they causal-based or optimaltransport-based, are imperfect approximations, but for different reasons. More precisely, we assumed that a linear causal model generated the data in order to compute the structural counterfactual couplings. However, this model-class assumption is not a perfect fit: in particular, some of the produced counterfactual instances are not realistic, yielding GPA scores exceeding the upper limit of 4.0 points; more generally, while both couplings should have µ0 and µ1 for marginals, several counterfactual counterparts do not conform to these distributions. Besides, the translation vector w used in practice is an estimation from data, thereby an approximation of the best linear model fitting the data. The implemented optimaltransport mappings are also finite-sample estimators of the true mappings between the continuous distributions. Figure 2c notably shows poor counterfactual associations for outliers of the red sample, likely due to weak estimation in low-density domains. Nevertheless, the marginal constraint of optimal transport ensures that the generated counterfactuals faithfully fit the data and are therefore plausible. Finally, these approximation artifacts of different nature induce distinct behaviours of the causal-based and transport-based couplings at the border of the data set. However, we remark that the couplings are similar in high-density areas. More precisely, the grey lines near the center of the data set correspond Transport-based Counterfactual Models Figure 2: Counterfactual models for the Law data set. The red sample represents 200 factual black students while the blue sample represents 200 factual white students. The green sample depicts counterfactual instances: the first column (Figures 2a and 2c) has white counterfactual students; the second column (Figures 2b and 2d) has black counterfactual students. The lightgray lines describe the coupling between factual and counterfactual instances. to a translation by a same constant for both approaches. This proximity, which sides with the observations of Black et al. (2020), will be theoretically grounded in Section 5.2. 4.4.2 Body-measurement Data Set We now further illustrate the properties of optimal-transport counterfactuals on a data set of body measurements from n0 = 260 women and n1 = 247 men. The features of interest are the weight X1 and the height X2, while S encodes the gender. Suppose now that Bob is a 80kg and 190cm man. What would have been Bob s height and weight had he been a woman? Since we do not know the structural relationships between X, S and possibly hidden sources of randomness U, we follow Black et al. (2020) and rely on masstransportation techniques to answer this counterfactual question. We proceed as before to estimate the optimal transport map from the male distribution µ1 towards the female De Lara, Gonz alez-Sanz, Asher, Risser, Loubes (a) 2D optimal-transport matching. (b) 2D component-wise quantile-preservation. Figure 3: Body data set. The red dots represent a data sample of men, while the blue dots represent a data sample of women. The green dots are the estimated counterfactual counterparts of the male sample. distribution µ0. Applying this operator to Bob, we obtain that, had he been a women, she would have been 59kg and 177cm. Though it does not have a canonical definition when d = 2, optimal transport seems visually to preserve the position of the paired points from one marginal to another. This is due to the optimal map being the unique gradient of a convex function between distributions as previously explained. We underscore in Figure 3 that optimal transport does not amount to feature-wise quantile-preservation, making it a relevant extension of the notion of order to higher dimension. Notably, preserving the quantile along each coordinate does not satisfy the marginal constraint, yielding counterfactual women not representative of their gender s distribution. 4.5 Discussion Counterfactuals have valuable applications in fairness and explainability. One could for example try to learn predictor h designed to make h(x, 0) as close as possible to h(x , 1) for every counterfactual pair (x, x ). This is what Russell et al. (2017) proposed using causal models, and what we implement in Section 7 using transport-based models. Or, one could test whether a trained predictor h is unfair by checking if h(x, 0) = h(x , 1) for every counterfactual pair (x, x ), which is essentially the procedure of Black et al. (2020) leveraging optimal transport maps. However, the application of counterfactual models raises several issues. We conclude Section 4 by discussing important drawbacks of the causal account to counterfactual reasoning as well as the limitations of the transport approach. 4.5.1 Shortcomings of the Causal Approach The main limitation of computing counterfactuals like Definition 4 and Definition 5, as for any causal-based framework, is its feasibility due to the reliance on an SCM. Assuming a known causal model, in particular a fully-specified causal model, is a too strong assumption Transport-based Counterfactual Models in practice. It requires experts to reach a consensus on the causal graph, the structural equations, the distribution of the input exogenous variables, and to test the validity of their model on available data. This is not a realistic scenario, especially when dealing with a high-number of features and possibly complex structural relations. Besides, this is not practical since a causal model must be designed and tested for each possible data set. A more straightforward approach is to directly infer the causal model from observational data. There exist for instance sound techniques to learn the causal graph, but they suffer from being NP-hard, with an exponential worst-case complexity with respect to the number of nodes (Cooper, 1990; Chickering et al., 2004; Scutari et al., 2019). In addition, this is not enough to compute counterfactual quantities, as the structural equations would still be lacking. To obtain these equations, researchers often predefine the functional form of the relations between the variables on the basis of a known graph (be it assumed or inferred) and learn them through regression models (Kusner et al., 2017; Russell et al., 2017), or infer simultaneously the graph and the structural equations. However, this also becomes computationally challenging as the number of features increases. Notably, the literature mostly addresses simple linear models (Shimizu et al., 2006) or very few variables (Hoyer et al., 2008). Finally, the approximation error implied by the choice of the functional class can lead to unrealistic, out-of-distribution counterfactuals, as exemplified in Figure 2 above. To our knowledge, the literature on causal counterfactuals has not pointed out this flaw to date. A related issue is causal uncertainty. There exist several causal models corresponding to a same data distribution, leading to possibly different counterfactual models (see Bongers et al., 2021, Example 4.2). It cannot be tested whether the adjusted model is the true one, making the modeling inherently uncertain. Moreover, for non-deterministic structural counterfactual models, the computation of counterfactual quantities requires to know the law of the exogenous variables, which is not observable. While it is common to assume a prior distribution on U, this also adds uncertainty in the causal modeling, hence on the induced counterfactuals. Perhaps more surprisingly, counterfactual quantities are sometimes nonexistent in Pearl s causal framework. The causal modeling we introduced is very general: we do not assume the model to be causally sufficient, and only suppose that the equations are acyclic. Assumption (A) is very common for both practical reasons and reasons of interpretability. In general, however, observational data can be generated through an acyclical mechanism. Critically, (solvable) acyclic models do not always admit solutions under do-interventions, implying that XS=s may not be defined. We refer to (Bongers et al., 2021, Example 2.17) for an illustration. As a consequence, counterfactual quantities are ill-defined in such settings. 4.5.2 Applicability of the Transport Approach Regarding transport-based counterfactual reasoning, the main practical limitation is also computational. The domain S of the intervened variable S must be finite for the counterfactual model to be tractable. Moreover, generating the model needs |S| (|S| 1) /2 computations of transportation plans, which can become too expensive when |S| is large. We recall that solving Problem 5 demands O((n + m)nm log(n + m)) operations. Therefore, this approach is tailored to settings with small |S|, typically fairness problems where De Lara, Gonz alez-Sanz, Asher, Risser, Loubes S represents gender or race. This is why we specifically focus on fairness applications in Sections 6 to 8. Another inconvenience comes from the fact that one must specify a family of couplings to implement a transport-based counterfactual model. There is no quantitative rule for this choice; it is guided by intuition and feasibility reasons, and we explained above why optimal transport was a relevant option. Note that the causal approach has a similar flaw: as previously explained, structural counterfactual models are subjected to misspecification since the underlying causal model itself is uncertain. The advantage of transport methods compared to causal modeling is that they circumvent possibly wrong assumptions on the data-generative process. In particular, transport plans consistently adjust to the data (thanks to the marginal constraint) regardless of the chosen family of couplings, whereas misspecification of the SCM may lead to out-of-distribution structural counterfactuals as aforementioned. In the following, we derive theoretical properties of the counterfactual models introduced in this section, grounding the similarity between optimal transport and Pearl s computation of counterfactuals we evidenced in Figure 2. Interestingly, this echoes the work of Black et al. (2020), who also empirically observed that optimal transport maps generated nearly identical counterfactuals to the ones based on causal models. 5. Theoretical Results Until now, we have recalled the basics of causality and transport in Sections 2 and 3, and introduced counterfactual models, either causal-based or transport-based, in Section 4. In what follows, we demonstrate connections between both approaches. Concretely, we firstly explore in Section 5.1 the relationship between an SCM and the counterfactual model it induces, providing justifications to what we observed in Section 4.4.1. More precisely, we study the implications of typical causal assumptions onto the generated counterfactuals. Then, on the basis of these assumptions and the mass-transportation formalism proposed in Section 4, we demonstrate in Section 5.2 that optimal transport recovers structural counterfactuals in specific cases. 5.1 Causal Assumptions and Their Consequences We analyze in detail two standard scenarios of the causal counterfactual framework: first, when the counterfactuals are deterministic then the computation can be written as an explicit push-forward operation; second, when S can be considered exogenous then the counterfactual distribution is observable. Note that none of Section 5.1 involves any specific knowledge on optimal transport theory, only on causal modeling and (general) mass transportation. 5.1.1 The Deterministic Case We show that when the SCM deterministically implies the counterfactual values of X, then the counterfactual coupling is deterministic. Additionally, we provide the expression of the corresponding push-forward operator. To reformulate structural counterfactuals in Transport-based Counterfactual Models deterministic transport terms, we first highlight the functional relation between an instance and its intervened counterparts. Lemma 7 If M satisfies (A), then there exists a measurable function F such that X P a.s. = F(S, UX) and XS=s P a.s. = F(s, UX) for every s S. The proof leverages the acyclicity of the structural equations, which implies that the system of structural equations defining X and S is triangular, enabling to express X solely in terms of UX and S. Now, let us set for every s S the function fs : u 7 F(s, u) defined L(UX)-almost everywhere. Using this notation, we can give a simple expression of the possible counterfactual counterparts of any factual instance. In what follows, B denotes the closure of any B Rd. Proposition 8 Let M satisfy (A). For any s, s S and µs-almost every x Xs, supp µ s |s ( |x) fs f 1 s ({x}). As a direct consequence of this proposition, all counterfactual quantities on X with respect to S are uniquely determined when the right term of the inclusion becomes a singleton, therefore when the following assumption holds. Assumption (I) The functions in {fs}s S are injective. While the unique solvability of acyclic models ensures that (X, S) is deterministically determined by U, (I) states that, conversely, UX is deterministically determined by {X = x, S = s}. This assumption holds in particular for additive-noise models: classical models where the exogenous variables are independent additive terms of the structural equations, such as in Example 1 and Section 4.4.1. Example 3 An acyclical SCM M = U, G is an additive-noise model if for any i I, Exo(i) is a singleton and Gi(v Endo(i), u Exo(i)) := φi(v Endo(i)) + u Exo(i), where φi : VEndo(i) Vi. In this case, the random seed U is fully determined by the value of V , meaning that for any v V the posterior distribution L(U | V = v) narrows down to a single value. As such, whatever the do-intervention on V , the three-step procedure can only generate a deterministic counterfactual quantity. In our setting, which addresses interventions on a single endogenous variable S, satisfying (I) does not require a fully invertible model between V = (X, S) and U but simply between X and UX for every fixed value of S. As illustration, consider an additive-noise model over X only, meaning that the equation for S does not matter. Formally, the structural equations on the variables in X can summarized through X P a.s. = ϕ (S, X) + UX, where ϕ : S X X is a deterministic measurable function. Assumption (A) en- tails through unique solvability that X P a.s. = (I ϕ(S, )) 1(UX). Thereby fs(u) := (I ϕ(s, )) 1(u), and Assumption (I) readily holds such that f 1 s (x) = x ϕ(s, x). De Lara, Gonz alez-Sanz, Asher, Risser, Loubes Remark that Assumption (I) imposes constraints on the variables to enable a one-to-one correspondence between the realizations of L(X) and L(UX). In particular, these two probability laws must be respectively supported by spaces with the same cardinal, preventing for instance a continuous UX with a discrete X. Note also that even though it is restrictive, the mainstream literature on causality frequently assumes full invertibility. In particular, most of the causal-discovery frameworks which aim at inferring the structural causal models from observational data require invertible models (Zhang and Chan, 2006; Hoyer et al., 2008) or even linear-additive ones (Shimizu et al., 2006). Analogously, the recent research on causal algorithmic recourse generally addresses invertible models in both theory and practice (Karimi et al., 2021; Dominguez-Olmedo et al., 2022; von K ugelgen et al., 2022). In Section 5.2, we will use the invertibility assumption as an ideal setting to derive theoretical guarantees. Let us finally turn to the structural counterfactual models. Assumption (I) implies that all the couplings between the factual and counterfactual distributions are deterministic, as written in the next proposition. Proposition 9 Let M satisfy (A), suppose that (I) hold, and for any s, s S set the mapping T s |s := fs f 1 s |Xs defined µs-almost everywhere, where f 1 s |Xs denotes the restriction of f 1 s to Xs. The following properties hold: (i) µ s |s ( |x) = δT s |s (x) for µs-almost every x Xs; (ii) µ s |s = T s |s µs; (iii) π s |s = (I T s |s ) µs. We say that T s |s is a structural counterfactual operator, and associate T := {T s |s }s,s S to the deterministic structural counterfactual model Π = {(I T s |s ) µs}s,s S. Similarly to the structural counterfactual couplings, the operators in T describe the effect of causal interventions on factual distributions. We highlight that they are welldefined without any knowledge on L(U), meaning that the exogenous variables are not necessary to compute counterfactual quantities under (I). Lastly, remark that we framed (I) so that it implies that all the counterfactuals instances for any changes on S are deterministic, leading to a fully-deterministic counterfactual model.2 However, according to Proposition 8, it suffices that one fs be injective for some s S to render all the counterfactual couplings {π s |s }s S deterministic. Therefore, when (I) does not hold, the structural counterfactual model possibly contains both random and deterministic couplings. 5.1.2 The Exogenous Case We now discuss the counterfactual implications of the position of S in the causal graph. More specifically, we focus on the case where S can be considered as a root node. We will see that this entails that the structural counterfactual model is a transport-based counterfactual model. 2. In logic terms, this means that the model verifies the conditional excluded middle (Stalnaker, 1980). Transport-based Counterfactual Models Figure 4: DAG of a structural causal model satisfying (RE). The nodes UX and X possibly represent several variables. Let denote the independence between random variables. The variable S is said to be exogenous relative to X (Galles and Pearl, 1998) if the following holds: Assumption (RE) US UX and XEndo(S) = . The first item, US UX, ensures that there is no hidden confounder between X and S. The literature on causal modeling generally supposes causal sufficiency (Shimizu et al., 2006; Karimi et al., 2021; Dominguez-Olmedo et al., 2022), which is a stronger condition. By demanding Exo(i) Exo(i ) = for every i = i , causal sufficiency implies that the (UExo(i))i I are mutually independent (as a consequence of the (Uj)j J being mutually independent). The second item, XEndo(S) = , means that S is ancestrally closed: no variable in X is a direct cause (or parent) of S (see Figure 4). This assumption intuitively holds in fairness problems, such as in Section 4.4.1, where the variable S to alter generally encodes someone s gender, race or age, which presumably do not have any observable causes. As pointed out by Fawkes et al. (2021), ancestral closure is a common hypothesis in causalfairness research, and even a requirement for many frameworks (Kusner et al., 2017; Russell et al., 2017; Nabi and Shpitser, 2018; Chiappa, 2019; Kilbertus et al., 2020; Plecko and Meinshausen, 2020). Note that (RE) cannot be tested, and thereby rests on plausibility. Interestingly, relative exogeneity has critical implications on the generated counterfactuals. Assumption (RE) readily entails that S UX. Then, it is easy to see that at the distributional level, intervening on S amounts to conditioning X by a value of S. Proposition 10 Let M satisfy (A). If (RE) holds, then for every s, s S we have µS=s = µs = µ s |s . Recall that the structural counterfactual coupling π s |s represents an intervention transforming an observable distribution µs into an a priori non-observable counterfactual distribution µ s |s . According to Proposition 10, (RE) renders the causal model otiose for the purpose of generating the counterfactual distribution, as the latter coincides with the observable factual distribution µs . This is notably what occurred in the example from Section 4.4.1. However, we underline that the coupling is still required to determine how each specific profile is matched. As such, the causal model still carries major information on the induced counterfactual quantities. Besides, as remarked by Plecko and Meinshausen (2020) and Fawkes et al. (2021), a practical consequence of (RE) is that it enables to link observational and causal notions De Lara, Gonz alez-Sanz, Asher, Risser, Loubes of fairness. In Section 6, we will prove a similar result through the prism of counterfactual models. The demonstration relies on the proposition below, which ensures that structural counterfactual models are transport-based counterfactual models when S is relatively exogenous to X. Proposition 11 Let M satisfy (A). If (RE) holds, then for any s, s S, (i) π s |s Π(µs, µs ); (ii) t π s |s = π s|s . Suppose additionally that (I) holds. Then, for any s, s , s S, (iii) T s |s µs = µs ; (iv) The operator T s |s is invertible µs-almost everywhere, such that µs -almost everywhere T 1 s |s = T s|s . Notably, this means that in classical fairness settings transport-based models can be seen as approximations, relaxations of structural models. Another meaningful consequence of Proposition 11 is that items (ii) and (iv) may be false when (RE) does not hold. Said differently, in general contexts, there is no reciprocity between a factual instance and its structural counterfactual counterparts. We emphasize that even under the assumptions of Proposition 11, a transport-based counterfactual model does not necessarily coincide with the structural counterfactual model: Theorem 12 will provide sufficient conditions under which such an equivalence occurs. 5.1.3 The Example of Linear Additive SCMs We illustrate how our notation and assumptions apply to the case of linear additive structural models, which account for many state-of-the-art models (Bentler and Weeks, 1980; Shimizu et al., 2006; Hyttinen et al., 2012; Rothenh ausler et al., 2021). Example 4 Under (RE) and (A), a linear additive SCM is characterized by the structural equations X = MX + w S + b + UX, where w, b Rd and M Rd d are deterministic parameters such that I M is invertible, and US UX. Solving the equations we get X = (I M) 1(w S + b + UX) =: F(S, UX). Besides, note that (I) holds such that for any s S, f 1 s (x) = (I M)x ws b. Then, for any s, s S, T s |s (x) = x + (I M) 1w(s s). This general expression is consistent with the example from Section 4.4.1. Remarkably, in the specific case of linear additive SCMs fitting (RE), computing counterfactual quantities amounts to applying translations between factual distributions. Therefore, should an oracle reveal that the SCM belongs to this class without providing the structural equations, it would suffice to compute the mean translation between sampled points Transport-based Counterfactual Models from µs and µs to obtain an estimator of the counterfactual operator T s |s . For more complex SCMs satisfying (RE), it is presumably difficult to infer the counterfactual model from data. We address this issue the next section. Specifically, we show that optimal transport for the quadratic cost generates the same counterfactuals as a class of causal models including linear additive models. 5.2 When Optimal Transport Meets Causality We focus on the deterministic transport-based counterfactual model T = {T s|s }s,s S defined by the solutions of Problem (4) between all pairs of factual distributions. That is, for every s, s S, T s |s := arg min T: T µs=µs Xs x T(x) 2dµs(x). (6) As explained before in Section 4, this model provides an elegant interpretation to the obtained counterfactual statements, as they are defined by minimizing the squared Euclidean distance between paired instances, and preserve the quantile between marginals when d = 1. Moreover, as stated in the following theorem, this transport-based counterfactual model recovers structural counterfactuals in specific cases. Theorem 12 Let M satisfy (A), (RE) and (I). Suppose that the factual distributions {µs}s S are absolutely continuous with respect to the Lebesgue measure and have finite second order moments. For any s, s S, if the structural counterfactual operator T s |s is the gradient of some convex function then it is the solution to Problem (6). The mass-transportation formalism of Pearl s counterfactual reasoning introduced in Section 4.2 and developed in Section 5.1 renders the proof of this theorem straightforward. The non triviality comes precisely from the reformulation of deterministic structural counterfactuals through push-forward operators. We underline that the demonstration does not require any prior knowledge on optimal transport theory except what we summarized in Lemma 3. Thus, for the sake of illustration and clarity, we reproduce it directly below. Proof According to (I) and Proposition 9, the SCM defines a structural counterfactual operator T s |s between µs and µ s |s . Additionally, (RE) implies through Proposition 10 that µ s |s = µs . Therefore, T s |s µs = µs . Assume now that µs is absolutely contin- uous with respect to the Lebesgue measure, and that both µs and µs have finite second order moments. If T s |s is the gradient of some convex function, then according to Lemma 3 it is the solution to Problem (4) between µs and µs , that is the solution to Problem (6). Understanding the strengths and limitations of Theorem 12 requires understanding how rich is the class of SCMs fitting its assumptions. The larger the class, the more likely optimal transport maps for the squared Euclidean cost will provide (nearly) identical counterfactuals to causality. Finding explicit conditions on fs and fs so that fs f 1 s is the gradient of a convex potential requires tedious computations as soon as d > 1, which renders the specification of the relevant SCMs difficult. Nevertheless, we can find specific sub-classes of causal models fitting Theorem 12. For instance, as the structural counterfactual operator from Example 4 is the gradient of a convex function, we obtain the following corollary. De Lara, Gonz alez-Sanz, Asher, Risser, Loubes Corollary 13 Let M be a linear additive SCM satisfying (RE) (see Example 4). Suppose that the factual distributions {µs}s S are absolutely continuous with respect to the Lebesgue measure and have finite second order moments. Then for any s, s S, the structural counterfactual operator T s |s is the solution to (4) between µs and µs . Therefore, up to a linear approximation of the data-generation process, employing optimal transport maps for counterfactual reasoning in fairness contexts recovers causal changes, as in the example from Section 4.4.1. Besides, the scope of Theorem 12 goes beyond linear additive SCMs, as shown in the following non-linear non-additive example. Example 5 Consider the following SCM, X1 = α(S)U1 + β1(S), X2 = α(S) ln2 X1 β1(S) α(S) U2 + β2(S), where α, β1, β2 are R-valued functions such that α > 0, U1 > 0, and US (U1, U2). It satisfies (A), (I) and (RE), such that for any s, s S, the associated structural counterfactual operator is given by, T s |s (x) = α(s ) α(s) x + β(s ) β(s) , where β = (β1, β2) is R2-valued. This is the gradient of the convex function x 7 α(s ) [β(s ) β(s)]T x. Then, if the factual distributions are absolutely continuous with respect to the Lebesgue measure and have finite second-order moments, T s |s is the solution to (4) between µs and µs . Note that the converse of the implication in Theorem 12 does not hold. This comes from the fact that many functions (even continuous ones) cannot be written as gradients when d > 1, as illustrated in the following example. Example 6 Consider the following SCM, X1 = U1, X2 = SX2 1 + U2, S = US, where US (U1, U2). It satisfies (A), (I) and (RE), such that for any s, s S, the associated structural counterfactual operator is given by, T s |s (x1, x2) = x1, x2 + (s s)x2 1 . It cannot be written as the gradient of a function. Consequently, it is not a solution to (4) according to Lemma 3. Transport-based Counterfactual Models Through Section 5, we aimed notably at justifying the pertinence of optimal transport in counterfactual frameworks on top of the insights and illustrations given in Section 4. To sum-up, the main requisite for transport-based methods, typically optimal transport, to be used as substitutes for causal counterfactual reasoning is Assumption (RE), ensuring that structural counterfactual models are transport-based counterfactual models. As previously explained, this condition is almost systematically verified in fairness problems, making the proposed surrogate approach relevant in various essential tasks. The more specific assumptions from Theorem 12, which include (I), describe an ideal setting meant to derive theoretical guarantees; optimal transport remains an arguably relevant alternative even outside this context. Altogether, Theorem 12 and Corollary 13 support the intuition that computing a Π from optimal transport provides a suitable approximation of the unknown structural Π . In the sequel, we apply this approach by extending causal counterfactual frameworks for fairness to transport-based models. 6. Transport-based Counterfactual Fairness The strength of the unified mass-transportation viewpoint of counterfactual reasoning we proposed in Section 4 and further studied in Section 5 lies in the fact that all definitions and frameworks implicitly based on a structural counterfactual model have a transport-based analogue, and can therefore be made feasible. In this section, we apply this process to fairness in machine learning. Suppose that the random variable S encodes a so-called sensitive or protected attribute (for example race or gender) which divides the population into different classes in a machinelearning prediction task. We denote by h : X S R an arbitrary predictor defining the random variable of predicted output ˆY := h(X, S). In this work, we focus on deterministic predictors (be they classifiers or regression functions), in the sense that h is a measurable function on X S only, meaning that ˆY does not have other sources of randomness than X and S. Fairness addresses the question of the dependence of ˆY on the protected attribute S. The most classical fairness criterion is the so-called demographic or statistical parity, which is achieved when ˆY S. However, this criterion is notoriously limited, as it only gives a notion of group fairness, and does not control discrimination at a subgroup or an individual level: a conflict illustrated by Dwork et al. (2012). The counterfactual framework, by capturing the structural or statistical links between the features and the protected attribute, allows for sharper notions of fairness. After clarifying the role of the accepted counterfactual fairness condition (Kusner et al., 2017), we reformulate it using the mass-transportation formalism introduced in Section 4. On the basis of this reformulation, we then propose new fairness criteria derived from transport-based counterfactual models. 6.1 Causal Counterfactual Fairness From a Classical Viewpoint Counterfactually fair predictors satisfy the definition below, proposed by Kusner et al. (2017). De Lara, Gonz alez-Sanz, Asher, Risser, Loubes Definition 14 Let M satisfy (A). A predictor ˆY = h(X, S) is counterfactually fair if for every s, s S and µs-almost every x in Xs, L ˆYS=s | X = x, S = s = L ˆYS=s | X = x, S = s , where ˆYS=s := h(XS=s, s). The interpretation of counterfactual fairness can be confusing, as explained in (Pleˇcko and Bareinboim, 2024, Section 4.4.1) and (Silva, 2024). Before addressing the masstransportation viewpoint, let us thoroughly analyze the meaning of counterfactual fairness by clarifying the level of discrimination it controls and the type of counterfactuals it involves. We will try to be particularly careful with the vocabulary. On the one hand, counterfactual fairness as expressed by Definition 14 involves a profilelevel context. A direct rewriting of the equality between laws in the definition leads to: δh(x,s) = h( , s ) µ s |s ( |x). This shows that counterfactual fairness requires equal outputs between every factual profile and their structural counterfactual counterparts (Proposition 16 will make this even clearer). On the other hand, many people refer to counterfactual fairness as a within-unit notion of fairness, suggesting that it ensures that every single unit is treated equally in all alternative worlds in which only the protected status has changed. This could puzzle some readers, as the direct formalization of such a condition would rather be L ˆY | U = u = L ˆYS=s | U = u for every unit u U and altered protected attribute s S, to explicitly include a unit-level context. All in all, it seems that two different interpretations coexist: an across-profile condition with input {X = x, S = s}; a within-unit condition with input {U = u}. Crucially, as explained in (Kusner et al., 2017, Appendix S1) and further justified by Silva (2024), these two interpretations formally coincide; using a profile-level context in the original definition was notably meant to remind of the three-step procedure. For the sake of completeness, we provide our own version of this equivalence in the next proposition. Proposition 15 Let M satisfy (A). A predictor ˆY = h(X, S) is counterfactually fair if and only if for every s S and L(U)-almost every u U, L ˆY | U = u = L ˆYS=s | U = u , where ˆYS=s := h(XS=s , s ). All the probability distributions written above are degenerate. Remark that this result holds even when {U = u} is not uniquely determined by {X = x, S = s}. It is due to the fact that ˆY only depends on U through (X, S).3 Moreover, counterfactual fairness is frequently referred as an individual fairness criterion, which can also be ambiguous. In the causality literature, people interchangeably use the terms individual and unit to call an element u U (Saha and Garain, 2022; Pleˇcko and Bareinboim, 2024). According to (Kusner et al., 2017, Appendix S1), counterfactual fairness 3. In practice, when the counterfactual counterparts are not deterministic, the probabilities in Definition 14 are approximated via a Monte-Carlo method after learning L(U | X = x, S = s). See for instance (Russell et al., 2017) or (Chiappa and Isaac, 2019). Transport-based Counterfactual Models is an individual notion of fairness in the sense that it controls the outputs of a same unit across different realities. This signifies that counterfactual fairness enjoys two individual properties: (1) it involves units, (2) it constrains predictions within a single unit rather than across distinct units. However, there exist fairness definitions tagged individual without involving units or causal models. Notably, individual fairness as popularized by Dwork et al. (2012) simply demands similar outputs to similar inputs (with respect to some similarity metrics that are not necessarily causal-based). This does not qualify as an individual fairness notion as counterfactual fairness does; this criterion is individual in the sense that it simply prevents disparate treatment against specific data points (or profiles) as opposed to group (or distributional) criteria. The most notorious group-fair condition is statistical parity, that is h(X, S) S, which can be written as h( , s) µs = h( , s ) µs for every s, s S, highlighting that it prevents discrimination across groups of profiles but not across specific profiles.4 Interestingly, counterfactual fairness is an individual fairness criterion in both senses: it constrains predictions within a unit and across input profiles. In what follows, we avoid the loaded word individual and prefer using the dissimilar terms unit and profile . We also specify whether we study an effect within a single instance or across different instances. 6.2 Causal Counterfactual Fairness From a Mass-transportation Viewpoint We now move on to giving a mass-transportation interpretation of Definition 14. Proposition 16 Let M satisfy (A). The following properties hold. (i) A predictor h(X, S) is counterfactually fair if and only if for every s, s S and π s |s -almost every (x, x ), h(x, s) = h(x , s ). (ii) If (RE) holds, then a predictor h(X, S) is counterfactually fair if and only if for every s, s S such that s < s and π s |s -almost every (x, x ), h(x, s) = h(x , s ). (iii) If (I) holds, then a predictor h(X, S) is counterfactually fair if and only if for every s, s S and µs-almost every x, h(x, s) = h(T s |s (x), s ). (iv) If (I) and (RE) hold, then a predictor h(X, S) is counterfactually fair if and only if for every s, s S such that s < s and µs-almost every x, h(x, s) = h(T s |s (x), s ). 4. The survey paper of Castelnovo et al. (2022) clarifies the two main distinctions among fairness criteria: (1) observational vs. causal; (2) group vs. individual. De Lara, Gonz alez-Sanz, Asher, Risser, Loubes Let us detail this result. Items (ii) to (iv) are variations of the first item under the implications of (RE) and (I) through respectively Propositions 11 and 9. Note that they have practical interests. Assumption (I) highlights the deterministic relationship between factual and counterfactual quantities and makes unnecessary the knowledge of L(U) to test counterfactual fairness. Assumption (RE) entails by symmetry that only half of the couplings are necessary to check the condition. Proposition 16 clearly shows that counterfactual fairness requires equal outputs across every counterfactually-paired profiles. We point out that because counterfactual fairness compares inputs across protected groups rather than within a same neighborhood of profiles, it is not incompatible with group fairness in contrast to the individual-fairness criterion studied in (Dwork et al., 2012; Xu and Strohmer, 2024). It is in fact the opposite: if (RE) holds, then counterfactual fairness is a stronger criterion than the statistical parity across groups, as explained in the following proposition. Proposition 17 Let M satisfy (A) and suppose that (RE) holds. If the predictor h(X, S) satisfies counterfactual fairness, then it satisfies statistical parity. The converse does not hold in general. In the original paper on counterfactual fairness, the authors had already demonstrated the connection between their definition and statistical parity. Note that the assumptions in Proposition 17 correspond to a specific case of those in (Kusner et al., 2017, Lemma 2), showing that the two results are consistent. Interestingly, our proof of Proposition 17 relies on the mass-transportation viewpoint, which highlights the following interpretation: counterfactual fairness is stronger than statistical parity because it constrains the predictor along some couplings while statistical parity constrains the predictor only along the couplings marginals. Remark that comparing statistical parity formulated as h( , s) µs = h( , s ) µs for every s, s S to counterfactual fairness as expressed in Proposition 16 emphasizes the marginal vs. coupling (which translates into group vs. profile) relation between the two fairness definitions. 6.3 Extending Counterfactual Fairness One can think of being counterfactually fair as being invariant to counterfactual operations with respect to the protected attribute. In order to define SCM-free criteria, we generalize this idea to the models introduced in Section 4. Definition 18 1. Let Π = {π s |s }s,s S be a (random) transport-based counterfactual model. A predictor h(X, S) is Π-counterfactually fair if for every s, s S and π s |s - almost every (x, x ), h(x, s) = h(x , s ). 2. Let T = {T s |s }s,s S be a deterministic transport-based counterfactual model. A predictor h(X, S) is T -counterfactually fair if for every s, s S and µs-almost every x, h(x, s) = h(T s |s (x), s ). Transport-based Counterfactual Models Note that it follows from the symmetry of the transport-based counterfactual models (see items (ii) and (iii) in Definition 6) that only half of the couplings are truly required in the above conditions. Besides, because the proof of Proposition 17 only relies on the assumption that the couplings have factual distributions for marginals, the following proposition automatically holds. Proposition 19 Let Π be a transport-based counterfactual model (deterministic or not). If a predictor h(X, S) satisfies Π-counterfactual fairness, then it satisfies statistical parity. The converse does not hold in general. This result has interesting consequences. Consider that, for the purpose of computing counterfactual quantities, some practitioners designed a candidate SCM fitting the data and satisfying (RE). Even if the SCM is misspecified, it would still characterize a transportbased counterfactual model controlling statistical parity. The fair data-processing transformation proposed by Plecko and Meinshausen (2020) is an illustrative example. More generally, the conceptual interest of transport-based fairness criteria is similar to the original counterfactual fairness criterion: they offer across-profile notions of fairness while still controlling for discrimination against protected groups. The added value is their feasibility. In contrast to Definition 14 and Proposition 16, Definition 18 relies on computationally feasible counterfactual models that obviate any assumptions about the data-generation process. In addition, as Definition 14 amounts to Π -counterfactual fairness (when (RE) holds), one can as well think of Definition 18 as an approximation of counterfactual fairness. Crucially, these new criteria can naturally be applied in classical explainability and fairness machine learning frameworks based on counterfactual reasoning. While Black et al. (2020) focused on explaining discriminatory biases in binary decision rules, we address the training of a Π-counterfactually fair predictor in Section 7. 6.4 Ethical Risk We conclude this section by discussing a potentially negative impact of our work. As aforementioned, the transport-based approach allows for many counterfactual models, but they do not all define legitimate notions of counterparts. Consequently, transport-based notions of counterfactual fairness could be used for unethical fair-washing. The next proposition formalizes this risk. Proposition 20 If h(X, S) is a classifier satisfying statistical parity, then there exists a transport-based counterfactual model Π with a closed-form such that h(X, S) satisfies Πcounterfactual fairness. Practitioners could take advantage of the weak notion of statistical parity to construct counterfactual models such that their trained classifiers are counterfactually fair, while still discriminating at the subgroup or individual level. This is why we argue that practitioners must always be able to justify the counterfactual models when not imposed by legal experts of the prediction task. De Lara, Gonz alez-Sanz, Asher, Risser, Loubes 7. Application to Counterfactually Fair Learning We now address an application of transport-based counterfactual models to fairness. More precisely, we introduce a supervised learning procedure trading-offbetween Π-counterfactual fairness and accuracy, and provide statistical guarantees. 7.1 Learning Problem In (Russell et al., 2017), the authors considered a learning problem involving a penalization controlling the pair-wise difference in decision between the training inputs and their structural counterfactual counterparts. While they gave empirical evidence of the efficiency of their training method, they had to assume a known causal model and did not provide consistency guarantees on the estimated predictor. In this sub-section, we illustrate that this counterfactual approach can naturally be made both feasible and statistically consistent by replacing the structural counterparts by transport-based counterparts. Note that in contrast to Russell et al. (2017), we do not optimize over several counterfactual models. Let Y : Ω Y R denote the so-called ground-truth variable to predict, and denote by D the law of the data (X, S, Y ). We consider a parametric class of predictors {hθ}θ Θ from X S to Y, indexed by Θ Rp where p > 1. For a given counterfactual model Π := π s |s s,s S and a given weight λ > 0, we define the following expected risk on the predictors, RD,Π,λ(θ) := E[ℓ(hθ(X, S), Y )] + λ X s S P(S = s) X s =s E[rθ(Xs, s, Xs , s )], (7) where L((Xs, Xs )) = π s |s for every s, s S. The application ℓdenotes a data-loss function, continuous with respect to each of its input variables, while rθ(x, s, x , s ) is a penalty promoting counterfactual fairness by enforcing the difference between the outputs of the algorithm for an input and its counterfactual, namely |hθ(x, s) hθ(x , s )|, to be small. For instance, in (Russell et al., 2017), the authors considered the tightest convex relaxation of ϵ-approximate counterfactual fairness, that is rθ(x, s, x , s ) := max 0, |hθ(x, s) hθ(x , s )| ϵ for some ϵ > 0. In this paper, we rather work with the penalty rθ(x, s, x , s ) := |hθ(x, s) hθ(x , s )|2 which is smoother. Through λ, the risk RD,Π,λ quantifies a trade-offbetween accuracy and counterfactual fairness. Importantly, when Π = Π , it corresponds precisely to the expected risk of the learning problem proposed by Russell et al. (2017) reframed using the mass-transportation viewpoint. In what follows, we will simply write RD,Π,λ as R. In practice, we learn a predictor by minimizing an empirical version of R. To this end, we need an empirical counterfactual model. Concretely, consider a training set {(xi, si, yi)}n i=1 composed of n i.i.d. observations drawn from D. We divide this collection into S protected categories by defining for any s S the index In s := {1 i n | si = s} of length ns := |In s |. Then, the empirical versions of the factual distributions are for every s S, µn s := n 1 s P i In s δxi. In our case, the counterfactual pairs between µn s and µn s are estimated within the training data set through an empirical transport plan {πn s |s (i, j)}i Is,j Is , typically by solving Problem (5) as explained in Section 3.2.3. Then, we define the following Transport-based Counterfactual Models empirical risk, i=1 ℓ(hθ(xi, si), yi) + λnsi X j In s πn s |si (i, j)rθ(xi, si, xj, s ). (8) The learning procedure amounts to carrying out a gradient-descent-based routine to minimize Rn. We underline that this procedure, as the original one from (Russell et al., 2017), is tailored to both regression and multi-class classification. It also works for more than two protected groups, but requires the domain of the sensitive variable to be finite. 7.2 Consistency In this part, we focus on the counterfactual model constructed with quadratic optimal transport, and prove the statistical consistency of the learning procedure. Set a sequence {θn}n N defined by θn arg minθ Θ Rn(θ). The next theorem ensures the convergence to zero of the excess risk R(θn) minθ Θ R(θ) for ball-constrained linear predictions. Theorem 21 Suppose that for every pair of factual distributions, the Kantorovich problem (3) with cost c(x, x ) := x x 2 admits a unique solution. Thus, we can define the counterfactual model Π = {π s |s }s,s S and its empirical counterpart Πn = {πn s |s }s,s S as, for every s, s S, π s |s := arg min π Π(µs,µs ) x x 2dπ(x, x ), (9) πn s |s arg min π Σ(ns,ns ) j Is xi xj 2π(i, j). (10) Now, let Φ : X S Rp be a feature map such that for every s S and x1, x2 X, Φ(x1, s) Φ(x2, s) Ls x1 x2 where Ls > 0. Consider for Θ Rp the class of linear predictors {hθ}θ Θ defined as hθ(x, s) := θT Φ(x, s). If the following assumptions hold, (i) there exists D > 0 such that Θ = {θ Rp | θ D}, (ii) there exists R > 0 such that X x Rd | x R , (iii) there exists b > 0 such that Y {y R | |y| b}, (iv) there exists L > 0 such that for any (x, s, y) X S Y, the function θ Θ 7 ℓ(θT Φ(x, s), y) is L-Lipschitz, R(θn) min θ Θ R(θ) a.s. n + 0. The proof analyzes separately the accuracy term from the regularization term. The demonstration for the former follows classical results from the statistical-learning literature; the demonstration for the latter is original: we firstly show that each penalty contribution can be bounded by a distance between the empirical and the true coupling, and then invoke the convergence in law. We gather some additional remarks below. De Lara, Gonz alez-Sanz, Asher, Risser, Loubes Remark 22 1. Uniqueness of the solution (9) notably holds when the factual distributions are Lebesgue absolutely-continuous. 2. Typically, Φ is defined as (x, s) 7 (x, s, 1) in order to add an intercept, or corresponds to the feature map of a kernel when aiming for non-linear decision boundaries. 3. Assumptions (i) to (iv) are common for supervised learning problems. The sets X and Y are usually bounded spaces, as well as Θ the set of parameters defining the algorithm. The Lipschitz conditions for the loss functions and the feature map can be directly assumed or are direct consequences of smoothness properties and compactness assumptions of the spaces on which they are defined. 4. The assumption on the second-order moments of the factual distributions is automatically satisfied under (ii). 5. If the risks Rn and R are strictly convex, then θn = arg minθ Θ Rn(θ) and θ = arg minθ Θ R(θ) are well-defined, and it follows that θn a.s. n + θ (this additional step is detailed in the proof of Theorem 21). 8. Numerical Experiments In this section, we present the implementation of our counterfactually fair learning procedure on real data, and show that it has the expected behaviour. The code is available at https: //github.com/lucasdelara/PI-Fair. 8.1 Procedure Whatever the data set, the general procedure is the following: after dividing the studied data set into a training set and a testing set, we learn one empirical counterfactual model for each set. The first one implements the penalty of the training loss function; the second enables to evaluate the counterfactual fairness of the trained predictors. We compute the corresponding optimal transport plans using the default (non-regularized) POT solver. Then, we train several predictors for various values of the weight λ to study the model s ability to trade-offbetween accuracy and fairness. Finally, we assess the performances of the learnt algorithms according to three criteria: accuracy, group fairness and counterfactual fairness, and we benchmark them against baselines. 8.1.1 Evaluation Metrics In what follows, h : X S Y denotes either a binary classifier (Y {0, 1}) or a regression function (Y R), and D denotes a data set (X, S, Y ). Let us properly define the different metrics we employ. To evaluate the data fidelity of a classifier, we compute the accuracy (Acc), defined as Acc(h, D) := P(h(X, S) = Y ). Transport-based Counterfactual Models For a regression function, we compute the mean square error (MSE), defined as MSE(h, D) := E h |h(X, S) Y |2i . To assess the statistical parity of a binary classifier when the protected attribute is binary, we compute the the parity gap (PG), defined as PG(h, D) := |P(h(X, S) = 1 | S = 0) P(h(X, S) = 1 | S = 1)|. It quantifies the violation to group fairness, and equals zero when statistical parity is achieved. For a regression function, we use the Kolmogorov-Smirnov distance (KS) between L( ˆY | S = 0) and L( ˆY | S = 1), defined as KS(h, D) := sup y R E[1{h(X,S)>y} | S = 0] E[1{h(X,S)>y} | S = 1] . Note that this extends the parity gap to the continuous case. The purpose of these two group-fair indicators is to illustrate Proposition 19, stating that counterfactual fairness implies statistical parity. Finally, we need a metric to evaluate counterfactual fairness. We extend the notion of (ϵ, δ)-approximate counterfactual fairness introduced by Russell et al. (2017) to transport-based counterfactual models. For a counterfactual model Π and a tolerance ϵ > 0, we define the probability for the disparate treatment by h between (x, s) and its s -counterfactual counterpart to be lower than ϵ as CFTϵ(h, x, s, s , Π) := Z x Xs 1{|h(x,s) h(x ,s )| ϵ} dπ s |s dµs (x |x). Then, for a probability threshold 0 δ 1, we say that a predictor h is (ϵ, δ)- approximately counterfactually fair if for every s S, for µs-almost every x Xs, and for every s = s, CFTϵ(h, x, s, s , Π) 1 δ. (11) We make two remarks: firstly, if h is a classifier, then the only relevant value for ϵ is 0; secondly, if the counterfactual model is deterministic, then the only relevant value for δ is 0. As the empirical counterfactual models we use are non-deterministic although their continuous counterparts may be deterministic we set δ = 0.1 whatever the prediction task. In practice, we quantify counterfactual fairness through the (ϵ, δ)- counterfactual fairness rate (CFR), CFRϵ,δ(h, D, Π) := X s S P(S = s) Z s =s 1{CFTϵ(h,x,s,s ,Π) 1 δ} This corresponds to the proportion of points satisfying (11). In the classification setting we set ϵ = 0 while in the regression setting we work with ϵ = 1 2E [|Y Y |] where Y is an independent copy of Y . De Lara, Gonz alez-Sanz, Asher, Risser, Loubes Data set Adult COMPAS Law Crimes Task Classification Classification Regression Regression S : 0/1 Woman/Man Black/White Black/White Black/Non-black d 35 6 2 97 ntrain 32,724 4,120 13,109 1,335 ntest 16,118 2,030 6,458 659 Table 1: Data sets 8.1.2 Baselines We aim at applying our regularized approach for several values of the weight λ to study the model s ability to trade-offbetween accuracy and fairness. For classification tasks, we consider logistic models; for regression tasks, we consider linear regression models. Theses choices will be useful in particular to benchmark our method against the one of Zafar et al. (2017), tailored to such models. For a given λ, we write Π-Fair(λ) for the corresponding regularized predictor. We compare the obtained results to three baseline algorithms: the best constant predictor Const, which achieves perfect fairness; the group-fair predictor Z developed by Zafar et al. (2017), which is meant to maximize accuracy under an exactfairness constraint; the unaltered (λ = 0) predictor U, which is presumably the most accurate but also the most unfair predictor. 8.2 Data Sets We carry out the experiments on four data sets: the first two for classification and the last two for regression. Note that in all the considered settings, the sensitive variable S is binary and relatively exogenous to X. Table 1 summarizes information about each data set after preprocessing. 8.2.1 Adult The Adult Data Set from the UCI Machine Learning Repository (Dua and Graff, 2019) has become a gold reference data set to evaluate and benchmark fairness frameworks. The classification task is to predict whether the income of an individual exceeds 50K USD per year based on census data. Concretely, the data set contains n = 48, 842 instances with 14 attributes (numerical and categorical). The ground-truth variable Y equals 1 whenever the incomes exceeds 50K, and 0 otherwise. In this work, we set the sensitive variable S to be the gender: S = 0 stands for female, while S = 1 stands for male. The potential sources of algorithmic bias against women have been widely studied by Besse et al. (2021). They mainly amount to an under representation of women in the data set, as well as a high correlation between being a woman and having a lower income. Any standard algorithms, optimizing only for accuracy, are bound to be unfair towards women. Before training any models, we process the data using a one-hot-encoding of the categorical attributes. The processing is the exact same as in (Besse et al., 2021). This leads to a data set of dimension d + 1 = 36 (without the outcome). We divide it into a training set of size ntrain = 32, 724 and a testing set of size ntest = 16, 118. Transport-based Counterfactual Models 8.2.2 COMPAS The Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) is an infamous score used by US court officers to assess the risk of criminal recidivism. Pro Publica analyzed more than 10,000 of cases from Florida, and concluded that black defendants tended to be predicted riskier than they actually were whereas white defendants were often predicted at lower risk than they were.5 In this part, we follow Kusner et al. (2017) and try to predict the risk of recidivism while avoiding discrimination against the race, using the same data. Keeping only black and white defendants, we get n = 6, 150 instances with d + 1 = 7 attributes such as the number of prior offenses and the type of crime they committed. The ground-truth variable Y equals 1 if the individual recidivated and 0 otherwise. We set the sensitive variable S to be the race: S = 0 stands for black, while S = 1 stands for white. Finally, we divide the data into a training set of size ntrain = 4, 120 and a testing set of size ntest = 2, 030. 8.2.3 Law School This is the data set used in Section 4.4.1, gathering statistics from 163 US law schools and more than 20,000 students. Here again we follow Kusner et al. (2017), and try to predict the first-year average grade of individuals Y on the basis of the race (black or white) S, the entrance-exam score X1, and the grade-point average before law school X2. All in all, we have d = 2 features excluding the outcome and the protected attributes, and work with ntrain = 13, 109 training entries and ntest = 6, 458 testing entries. 8.2.4 Communities and Crimes The Communities and Crimes data set can also be found in the UCI Machine Learning Repository (Dua and Graff, 2019). It contains socio-economics, law enforcement and crime data from communities across the United States. Similarly to Chzhen et al. (2020), we consider the problem of predicting the rate of violent crime per 105 of population Y with S = 0 indicating that at least 50% of the population is black and S = 1 otherwise. After processing the 128 numerical and categorical attributes composing the data set, we obtain d + 1 = 98 features over ntrain = 1, 335 training instances and ntesting = 659 testing instances. 8.3 Results The regularization weight λ takes successively all the values in a grid 10 4, 10 3.5, . . . , 101 . We repeat the training and evaluation processes of our models together with the baselines across 10 repeats for every data sets. As all learning techniques are deterministic, the randomness of the experiments comes uniquely from the division of the data set into a training and testing sets. The results are reported in the figures below. 5. https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm De Lara, Gonz alez-Sanz, Asher, Risser, Loubes 8.3.1 Trade-off Between Accuracy and Fairness Figures 5 to 8 show the evolution with respect to λ of the accuracy, the counterfactual fairness rate, and the statistical-parity metric. The solid line represents the mean value of the evaluation metric, while the vertical length of the shaded area corresponds to the standard deviation. We observe that our learning algorithm is able to reliably trade-offaccuracy for counterfactual fairness as λ increases, confirming the relevancy of the approach. Additionally, the evaluation metrics remain stable across the different repeats. As anticipated from Proposition 19, the regularization also tends to improve group fairness. Overall, the group-fair learning technique of Zafar et al. (2017) sacrifices less accuracy than our method to reach the same level of statistical parity, but our method performs better at encouraging counterfactual fairness. We conclude that the prevailing technique depends on the specific type of fairness one wants to achieve. Note that the group-fair predictor Z on the Law data set (Figure 7) behaves similarly to the perfectly counterfactually-fair predictor. This is likely due to the use of simple linear models on such a low-dimensional data set (d + 1 = 3) limiting the space of feasible algorithms. We leave the in-depth analysis of this phenomenon for further research. 8.3.2 Recovering Causal Effects To conclude these numerical experiments, let us verify that our optimal-transport counterfactual loss enforces causal counterfactual fairness in the adequate setting. We address the Law data set for which a plausible causal model is known (see Section 4.4.1) and satisfies the assumptions of Corollary 13. Figure 9b displays the evolution of the two counterfactual losses, one based on the structural counterfactual model and the other on the optimaltransport counterfactual model, for predictors trained according to the optimal-transport counterfactual model. Figure 9a serves as a sanity check: it plots the normalized-variance E h ( ˆY E[ ˆY ]) 2i E[(Y E[Y ])2] to control how close a predictor is to being constant. As anticipated by theory, the training process does promote causal counterfactual fairness: the two curves in Figure 9b are almost identical. Crucially, this is not a consequence of the predictors merely becoming constant, since the sequence of predictions in Figure 9a have variations that remain significantly higher than the best constant predictor. 8.4 Discussion To sum-up, our learning procedure enables to increase counterfactual fairness while limiting the loss in accuracy, and is both theoretically sound and computationally efficient. This simple approach expands the fair learning arsenal to stronger fairness criteria than group fairness conditions, and so without requiring any additional knowledge on the datageneration process. Regarding limitations, we note that the current procedure is not tailored to mini-batch learning. Using mini-batches would require to compute a new empirical counterfactual model for each one, which increases the computational complexity, especially since the batch-size should be chosen large enough for the empirical transport plans to make sense. Transport-based Counterfactual Models (d) Classifiers Figure 5: Evaluation metrics on the Adult data set. (d) Classifiers Figure 6: Evaluation metrics on the COMPAS data set. (d) Predictors Figure 7: Evaluation metrics on the Law data set. (d) Predictors Figure 8: Evaluation metrics on the Crimes data set. De Lara, Gonz alez-Sanz, Asher, Risser, Loubes (a) Standard deviation of ˆY (b) Counterfactual fairness losses Figure 9: Promotion of causal counterfactual fairness on the Law data set. This opens new lines of inquiry for leveraging recent advances on computational optimal transport in order to improve counterfactual learning problems. In particular, we could take advantage of entropic regularization schemes to speed-up the computation of optimal transport plans (Cuturi, 2013; Peyr e and Cuturi, 2019). This would make the produced counterfactual model blurry, but still close to the desired solution, allowing a trade-off between precision of the counterfactuals and numerical efficiency. Additionally, we could use the growing literature on plug-in estimations of optimal transport maps (Beirlant et al., 2020; Hallin et al., 2021; Manole et al., 2021; Pooladian and Niles-Weed, 2021) to construct empirical counterfactual models not as a matrices, but as a mappings able to generalize to out-of-sample observations, reusable on new data sets and batches. We leave these directions for future work. 9. Conclusion In this paper, we focused on the challenge of designing sound and feasible counterfactuals. Our work showed that the causal account for counterfactual modeling can be written in a mass-transportation formalism, where implying either deterministic or random counterfactuals has a direct formulation in terms of the deterministic or random nature of couplings between factual and counterfactual instances. This novel perspective enabled us to generalize sharp but unfeasible causal criteria of fairness by actionable transport-based ones. We illustrated that the use optimal transport was a competitive approach to implement these criteria, as it can recover causal changes and can be computed efficiently. In particular, we proposed an new easy-to-implement method to train accurate classifiers with a counterfactual fairness regularization. We provided statistical guarantees, and showed empirically the relevancy of our method. In doing this article, we hope to shed a new light on counterfactual reasoning, and to open lines for strengthening the explainability and fair-learning arsenal in artificial intelligence. Acknowledgments This work was funded by the AI interdisciplinary institute ANITI, grant agreement ANR19-PI3A-0004 under the French investing for the future PIA3 program. The authors thank Transport-based Counterfactual Models the Action Editor and the anonymous reviewers for their careful readings and helpful suggestions. Appendix A. Proofs of Section 5 Proof of Lemma 7 As a direct consequence of Assumption (A), there exists a topological ordering on the nodes of the graph induced by M with respect to the parent-child relation. Therefore, starting with the components Xk for which Endo(k) = or Endo(k) = {S}, we can recursively replace the terms XEndo(k) in the formulas Gk(XEndo(k), SEndo(k), UExo(k)) by expressions depending only on UX and S. This yields a measurable function F such that X P a.s. = F(S, UX). The same computation but changing S to s for some s S leads to XS=s P a.s. = F(s, UX). Proof of Proposition 8 Recall that X P a.s. = F(S, UX). This implies that, P-almost surely, (X = x, S = s) = UX f 1 s ({x}) . Besides, XS=s P a.s. = fs (UX) according to Lemma 7. Then, let B X be an arbitrary measurable subset and compute: P (XS=s B | X = x, S = s) = P (fs (UX) B | X = x, S = s) = P fs (UX) B, UX f 1 s ({x}) | X = x, S = s = P fs (UX) B, fs (UX) fs f 1 s ({x}) | X = x, S = s = P XS=s B fs f 1 s ({x}) | X = x, S = s . Therefore, L(XS=s | X = x, S = s) does not put mass outside fs f 1 s ({x}). The definition of the support the set of points x Rd such that every open neighborhood of x has a positive probability thus implies that supp(µ s |s ( |x)) fs f 1 s ({x}). Proof of Proposition 9 Set s, s S and x Xs. Note that, according to (I), UX P a.s. = f 1 S (X). Let us address each item of the proposition separately. Item (i). Proposition 8 states that supp(µ s |s ( |x)) fs f 1 s ({x}) for µs-almost every x Xs. This means according to (I) that supp(µ s |s ( |x)) {fs f 1 s (x)}. Since the support of a probability distribution cannot be empty, we have equality. This proves the first item. Item (ii). By definition of the counterfactual distribution, we find that µ s |s = L(XS=s | S = s) = L (fs (UX) | S = s) = L fs f 1 S (X) | S = s = L fs f 1 s (X) | S = s De Lara, Gonz alez-Sanz, Asher, Risser, Loubes This proves the second item. Item (iii). Similarly, by definition of the structural counterfactual coupling we obtain π s |s = L (X, XS=s ) | S = s = L ((X, fs (UX)) | S = s) = L (X, fs f 1 s (X)) | S = s = L (Xs, fs f 1 s (Xs)) , where L(Xs) = µs. This completes the proof. Proof of Proposition 10 Set s S and recall that X P a.s. = F(S, UX) while XS=s P a.s. = F(s, UX). Thanks to Assumption (RE), we have that S UX. Therefore, L(X | S = s) = L (F(S, UX) | S = s) , = L (F(s, UX) | S = s) , = L (F(s, UX)) , This means that µs = µS=s. Similarly, for s, s S the counterfactual distribution becomes L(XS=s | S = s) = L F(s , UX) | S = s , = L F(s , UX) , = L F(s , UX) | S = s , = L F(S, UX) | S = s , = L(X | S = s ). This means that µ s |s = µs , which completes the proof. Proof of Proposition 11 We address each item separately. Item (i). It is a direct consequence of π s |s Π(µs, µ s |s ) by definition and µ s |s = µs from Proposition 10. Item (ii). Recall that (RE) implies that S UX. Then, by definition we have π s|s = L((X, XS=s) | S = s ) = L((fs (UX), fs(UX)) | S = s ) = L((fs (UX), fs(UX)) | S = s) = L((XS=s , X) | S = s) = t L((X, XS=s ) | S = s) = t π s |s . Transport-based Counterfactual Models Item (iii). It is a direct consequence of T s |s µs = µ s |s from Proposition 9 and µ s |s = µs from Proposition 10. Item (iv). We know according to Lemma 7 that XS=s P a.s. = fs(UX) and XS=s P a.s. = fs (UX). Furthermore, it follows from (RE) and Proposition 10 that µs = L(XS=s) and µs = L(XS=s ). Wrapping this up, there exists a measurable set Ω0 Ωwith P(Ω0) = 1 such that for every ω Ω0, XS=s(ω) = fs(UX(ω)) Xs, XS=s (ω) = fs (UX(ω)) Xs . In the rest of the proof we implicitly work on an ω Ω0. Assumption (I) ensures that UX = f 1 s (XS=s) so that XS=s = fs f 1 s (XS=s). Noting that XS=s Xs, we obtain XS=s = fs f 1 s |Xs (XS=s) = T s |s (XS=s). Following the same computation after switching s and s , we additionally get that XS=s = fs f 1 s |Xs (XS=s ) = T s|s (XS=s ). Therefore, T s |s is invertible on XS=s (Ω0) such that T 1 s |s = T s|s on XS=s(Ω0). Since µs (XS=s(Ω0)) = P(Ω0) = 1 and µs (XS=s (Ω0)) = P(Ω0) = 1, this means that T s |s is invertible µs-almost everywhere such that T 1 s |s = T s|s µs -almost everywhere. This completes the proof. Proof of Corollary 13 We address the structural equations X = MX + w S + b + UX, where w, b Rd and M Rd d are deterministic parameters. We showed that for any s, s S, T s |s (x) = x + (I M) 1w(s s). Notice that T s |s is the gradient of x 7 1 2 x 2+ (I M) 1w(s s) T x, which is a convex function. As (RE) holds and µs together with µs are Lebesgue-absolutely continuous with finite second order moments, it follows from Theorem 12 that T s |s is the solution to (4) between µs and µs . Appendix B. Proofs of Section 6 We prove Proposition 16 before Proposition 15. Proof of Proposition 16 We address each item separately. Item (i). We claim that counterfactual fairness is equivalent to (Goal) For every s, s S, there exists a measurable set C = C(s, s ) X X satisfying π s |s (C) = 1 such that for every (x, x ) C h(x, s) = h(x , s ). De Lara, Gonz alez-Sanz, Asher, Risser, Loubes Note that a direct reformulation of the original counterfactual fairness condition is (CF) For every s, s S, there exists a measurable set A = A(s) satisfying µs(A) = 1, such that for every x A and every measurable set M R P ˆYS=s M | X = x, S = s = P ˆYS=s M | X = x, S = s . (12) We aim at showing that (CF) is equivalent to (Goal). To do so, we first prove that one can rewrite (CF) into the following intermediary formulation (IF) For every s, s S, there exists a measurable set A = A(s) satisfying µs(A) = 1, such that for every x A and every measurable M R there exists a measurable set B = B(s, s , x, M) satisfying µ s |s (B|x) = 1 and such that for every x B, 1{h(x,s) M} = 1{h(x ,s ) M}. Let us show that (CF) (IF). Set s, s S, x A and M R. According to the consistency rule, that is X P a.s. = P s S XS=s , we can rewrite the left term of (12) as P ˆYS=s M | X = x, S = s = P (h(XS=s, s) M | X = x, S = s) = P(h(X, s), s) M | X = x, S = s) = P(h(x, s) M) = 1{h(x,s) M}. Then, using Definition 4, we reframe the right term of (12) as P ˆYS=s M | X = x, S = s = P(h(XS=s , s ) M | X = x, S = s) = Z 1{h(x ,s ) M}dµ s |s (x |x). Remark now that because the indicator functions take either the value 0 or 1, the condition 1{h(x,s) M} = Z 1{h(x ,s ) M}dµ s |s (x |x) is equivalent to 1{h(x,s) M} = 1{h(x ,s ) M} for µ s |s ( |x)-almost every x . This means that there exists a measurable set B = B(s, s , x, M) such that µ s |s (B|x) = 1 and for every x B, 1{h(x,s) M} = 1{h(x ,s ) M}. This proves that (CF) is equivalent to (IF). We now prove that (IF) = (Goal). As (IF) is true for any arbitrary measurable set M R, we can apply this result with M = {h(x, s)} to obtain a measurable set B = B(s, s , x) such that µ s |s (B|x) = 1 and for every x B, h(x , s ) = h(x, s). To sumup, for every s, s S, there exists a measurable set A = A(s) satisfying µs(A) = 1 such that Transport-based Counterfactual Models for every x A, there exists a measurable set B = B(s, s , x) satisfying µ s |s (B|x) = 1, such that for every x B, h(x , s ) = h(x, s). Now, we must show that the latter equality holds for π s |s -almost every (x, x ). To this end, set C = C(s, s ) = {(x, x ) X X|x A(s), x B(s, s , x)}. Remark that by definition of A and B, for every (x, x ) C, h(x, s) = h(x , s ). To conclude, let us prove that π s |s (C) = 1. π s |s (C) = Z A P (XS=s B | X = x, S = s) dµs(x) A µ s |s (B|x)dµs(x) This proves that (IF) implies (Goal). We finally prove that (Goal) = (IF). Using (Goal), consider a measurable set C = C(s, s ) satisfying π s |s (C) = 1 and such that for every (x, x ) C, h(x, s) = h(x , s ). Then, define for any x X, the measurable set B(s, s , x) = {x X | (x, x ) C}. We use disintegrated formula of π s |s to write 1 = Z µ s |s (B|x)dµs(x). Since 0 µ s |s (B|x) 1, this implies that for µs-almost every x, µ s |s (B|x) = 1. Said differently, there exists a measurable set A = A(s) satisfying µs(A) = 1 such that for every x A, the measurable set B(s, s , x) satisfies µ s |s (B|x) = 1. By construction of B and by definition of C, for every x A and every x B, h(x, s) = h(x , s ). To obtain (IF), it suffices to take any measurable M R and to note that the latter equality implies that 1{h(x,s) M} = 1{h(x ,s ) M}. Item (ii). Recall that π s|s = (I I) µs. Therefore, it follows from the previous item that counterfactual fairness can be written as: for every s, s S such that s < s, and π s |s -almost every (x, x ) h(x, s) = h x , s , and for π s|s -almost every (x, x ) h(x, s ) = h x , s . Moreover, (RE) implies through Proposition 11 that π s|s = t π s |s . Therefore, the second condition above can be written as: for π s |s -almost every (x, x ) h(x , s ) = h (x, s) , which is exactly the first condition. This means that only the first condition is necessary, proving this item. De Lara, Gonz alez-Sanz, Asher, Risser, Loubes Item (iii). Consider (CF), and recall that for every s, s S, µs-almost every x and every measurable M R the left term of (12) is 1{h(x,s) M}. Let us now reframe the right-term of (12). If (I) holds, using that UX P a.s. = f 1 S (X) we obtain P ˆYS=s M | X = x, S = s = P h XS=s , s M | X = x, S = s = P h F(s , UX), s , s ) M | X = x, S = s = P h fs (f 1 S (X)), s M | X = x, S = s = P h fs f 1 s (x), s M = P h(T s |s (x), s ) M = 1n h(T s |s (x),s ) M o. Consequently, (CF) holds if and only if, for every measurable M R 1{h(x,s) M} = 1n h(T s |s (x),s ) M o. Using the same reasoning as before, we take M = {h(x, s)} to prove that this condition is equivalent to h(x, s) = h(T s |s (x), s ). This concludes the third part of the proof. Item (iv). From the previous item and Proposition 10, it follows that counterfactual fairness can be written as: for every s, s S such that s < s, for µs-almost every x h(x, s) = h T s |s (x), s , and for µs -almost every x h(x, s ) = h T s|s (x), s . Set s, s S such that s < s. To prove the fourth item, we show as for item (ii) that the two above conditions are equivalent. Set A a measurable subset of Xs such that µs(A) = 1, and h(x, s) = h(T s |s (x), s ) for any x A. Then, make the change of variable x = T s |s (x) so that h(T 1 s |s (x ), s ) = h(x , s ) for every x T s |s (A). By Propositions 9 and 10, T s |s µs = µs , which implies that µs (T s |s (A)) = 1. Therefore, the equality h(T 1 s |s (x ), s) = h(x , s ) holds for µs -almost every x . Finally, recall that according to Proposition 11, T 1 s |s = T s|s µs -almost everywhere. As the intersection of two sets of probability one is a set of probability one, h(T s|s (x ), s) = h(x , s ) holds for µs -almost every x . To prove the converse, we proceed similarly by switching s to s . Proof of Proposition 15 Before all, we introduce some notations that will greatly simplify the proof. Recall that by unique solvability of the SCM, there exists a solution map Γ such that (X, S) P a.s. = Γ(U). Using the same argument in any intervened model MS=s for s S yields a solution map Γs such that (XS=s , s ) P a.s. = Γs (U). Transport-based Counterfactual Models We prove that counterfactual fairness as expressed in Definition 14 is equivalent to the condition in Proposition 15. According to Proposition 16, counterfactual fairness is equivalent to: for every s, s S, P(h(X, s) = h(XS=s , s ) | S = s) = 1, by definition of π s |s . Using the solution maps, this can be written as: for every s, s S, P (h(Γ(U)) = h(Γs (U)) | S = s) = 1. Because P(S = s) > 0 for every s S, this is equivalent to: for every s S, P (h(Γ(U)) = h(Γs (U))) = 1. The above equality means that for L(U)-almost-every u U, h(Γ(u)) = h(Γs (u)), which can be written as δh(Γ(u)) = δh(Γs (u)). Noticing that L( ˆY | U = u) = δh(Γ(u)) and L( ˆYS=s | U = u) = δh(Γs (u)) concludes the proof. Proof of Proposition 17 According to Proposition 16, h is counterfactually fair if and only if for any s, s S and for π s |s -almost every (x, x ), h(x, s) = h(x , s ) or equivalently 1{h(x,s) M} = 1{h(x ,s ) M} for every measurable M R. Set s, s S. Recall that from (RE), π s |s admits µs for first marginal and µs for second marginal. Let us integrate this equality with respect to π s |s to obtain, for every measurable M R Z 1{h(x,s) M}dµs(x) = Z 1{h(x ,s ) M}dµs (x). This can be written as, P(h(X, s) M | S = s) = P(h(X, s ) M | S = s ), which means that L(h(X, S) | S = s) = L(h(X, S) | S = s ). As this holds for any s, s S, we have that h(X, S) S. One can easily convince herself that the converse is not true. As a counterexample, consider the following causal model, X = S UX + (1 S) ( UX), where US follows a Bernoulli distribution, UX follows a centered Gaussian distribution with unit-variance and US UX. By do-interventions, XS=1 = UX and XS=0 = UX. Observe that (RE) is satisfied so that De Lara, Gonz alez-Sanz, Asher, Risser, Loubes L(XS=0) = L(X | S = 0), L(XS=1) = L(X | S = 1). Moreover, since L(UX) = L( UX) we have L(X | S = 0) = L(X | S = 1). In particular, whatever the chosen predictor, statistical parity will hold since the factual distributions are the same. By definition of the structural counterfactual operator, we have T 1|0 (x) = x. Now, set the unaware predictor (namely which does not take the protected attribute as an input), h(X, S) := sign(X). Clearly, h(T 1|0 (x)) = h(x) = h(x), except on {0} which is a null set for L(X | S = 0). Therefore h(X, S) satisfies statistical parity but not counterfactual fairness. Proof of Proposition 20 Suppose that the classifier h(X, S) takes values in the finite set Y R, and define for any s S and y Y the sets H(s, y) := {x Rd | h(x, s) = y}. Remark that for every s S, {H(s, y)}y Y forms a partition of Rd. Statistical parity can be written as, for any s S and any y Y, µs (H(s, y)) = py, where {py}y Y is a probability on Y that does not depend on s. Now, set s, s S. We aim at constructing a coupling π s |s between µs and µs such that, π s |s n (x, x ) Rd Rd | h(x, s) = h(x , s ) o = 1. We define our candidate π s |s as, dπ s |s (x, x ) := X 1{x H(s,y)}1{x H(s ,y)} py dµs(x)dµs (x ). First, let s show that it admits respectively µs and µs as first and second marginals. Let A Rd be a measurable set, π s |s A Rd = X 1{x H(s,y)}1{x H(s ,y)} py dµs(x)dµs (x ) x A 1{x H(s,y)}dµs(x) y Y µs (A H(s, y)) Transport-based Counterfactual Models One can follow the same computation for the second marginal. To conclude, compute π s |s {(x, x ) Rd Rd | h(x, s) = h(x , s )} y Y H(s, y) H(s , y) y Y π s |s H(s, y) H(s , y) Z 1{x H(s,y)}dµs(x) Z 1{x H(s ,y)}dµs (x) Appendix C. Proofs of Section 7 Proof of Theorem 21 The outline of the proof is typical for such supervised learning problems, though some parts require basic knowledge on optimal transport. It mainly amounts to show the uniform convergence of {Rn}n N to R, to then use the following classical deviation inequality, R(θn) min θ Θ R(θ) 2 sup θ Θ |Rn(θ) R(θ)|. (13) For any measure P and any measurable function g, we will use the notation P(g) := R gd P throughout the proof. Step 1. Uniform convergence of the risk. By the triangle inequality, sup θ Θ |Rn(θ) R(θ)| sup θ Θ i=1 ℓ(hθ(xi, si), yi) E [ℓ(hθ(X, S), Y )] s =s sup θ Θ n πn s |s P(S = s)π s |s rθ( , s, , s ) . The first term corresponds to the standard uniform risk deviation of supervised learning problems for Lipschitz losses and linear predictions. Under (i) to (iv), for 0 < δ < 1 it follows from (Shalev-Shwartz and Ben-David, 2014, Theorem 26.5) that with probability greater than 1 δ, i=1 ℓ(hθ(xi, si), yi) E [ℓ(hθ(X, S), Y )] De Lara, Gonz alez-Sanz, Asher, Risser, Loubes where ℓ0 = sup|y| b |ℓ(0, y)|. Then, by taking δn := 1 n2 , we apply Borel-Cantelli lemma so that for every ω Ω, there exists a threshold N(ω) such that for any n N(ω), i=1 ℓ(hθ(xi, si), yi) E [ℓ(hθ(X, S), Y )] The upper bound tends to zero as n tends to infinity, and consequently i=1 ℓ(hθ(xi, si), yi) E [ℓ(hθ(X, S), Y )] a.s. n + 0. The critical part is dealing with the counterfactual penalization. Let s, s S such that s = s. In the following of this step, we aim at showing that, n πn s |s P(S = s)π s |s rθ( , s, , s ) a.s. n + 0. To do so, we use the triangle inequality again, leading to, n πn s |s P(S = s)π s |s ) rθ( , s, , s ) n P(S = s) sup θ Θ Z rθ(x, s, x , s )dπ s |s (x, x ) (14) + P(S = s) sup θ Θ Z rθ(x, s, x , s ) dπn s |s (x, x ) dπ s |s (x, x ) . (15) The terms (14) tends to zero almost surely as n increases to infinity. We now turn to the convergence of the term (15). Firstly, let us show that the functions {rθ( , s, , s )}θ Θ are uniformly Lipschitz on X X. For any (x1, x 1), (x2, x 2) X X, we have, rθ(x1, s, x 1, s ) rθ(x2, s, x 2, s ) θT Φ(x1, s) Φ(x 1, s ) Φ(x2, s) + Φ(x 2, s ) 2 θT (Φ(x1, s) Φ(x2, s)) 2 + θT Φ(x 1, s ) Φ(x 2, s ) 2, θ 2 Φ(x1, s) Φ(x2, s) 2 + θ 2 Φ(x 1, s ) Φ(x 2, s ) 2, D2 n L2 s x1 x2 2 + L2 s x 1 x 2 2o , D2 max s S L2 s (x1, x 1) (x2, x 2) 2, 2D2 max s S L2 s R (x1, x 1) (x2, x 2) . Let us set Λ := 2 2D2(maxs S Ls)2R, so that the functions {rθ( , s, , s )}θ Θ are ΛLipschitz. Secondly, we know from (Villani, 2008, Theorem 5.19) that πn s |s converges almostsurely weakly to π s |s as ns, ns + . Moreover, for any s S, we have ns Transport-based Counterfactual Models P(S = s) > 0, hence ns n + + . As a consequence, πn s |s converges almost-surely weakly to π s |s as n + . Additionally, since Xs Xs X X, it follows from (ii) that π s |s is compactly supported. According to Remark 7.13 in (Villani, 2003), this implies that W1(πn s |s , π s |s ) a.s. n + 0, where W1 denotes the Wasserstein-1 distance. Using the dual formulation of the Wasserstein distance, this convergence can be written as, W1(πn s |s , π s |s ) = sup r Lip1(X X,R) Z r dπn s |s dπ s |s a.s. n + 0. Noting that for any θ Θ, rθ( , s, , s )/Λ is 1-Lipschitz, we have 1 Λ sup θ Θ Z rθ(x, s, x , s ) dπn s |s (x, x ) dπ s |s (x, x ) W1(πn s |s , π s |s ) a.s. n + 0. This entails that the term (15) converges almost surely to 0, and completes the proof. Step 2. Consistency of the mimimum. For this additional step, we assume that Rn and R have unique minimizers, and we denote θ := arg minθ Θ R(θ). Note that the sequence {θn}n N is bounded by D, and as such we can extract a sub-sequence {θσ(n)}n N converging to some θσ Θ. Let us prove that θσ = θ regardless of the choice of the subsequence {σ(n)}n N. According to the deviation inequality (13) and by continuity of R, we have at the limit, R(θσ) R(θ ). This means that θσ is a minimizer of R. Therefore, by uniqueness, θσ = θ . This completes the proof, as this implies that, θn a.s. n + θ . Nicholas Asher, Soumya Paul, and Chris Russell. Fair and adequate explanations. In International Cross-Domain Conference for Machine Learning and Knowledge Extraction, pages 79 97. Springer, 2021. Nicholas Asher, Lucas De Lara, Soumya Paul, and Chris Russell. Counterfactual models for fair and adequate explanations. Machine Learning and Knowledge Extraction, 4(2): 316 349, 2022. Mohit Bajaj, Lingyang Chu, Zi Yu Xue, Jian Pei, Lanjun Wang, Peter Cho-Ho Lam, and Yong Zhang. Robust counterfactual explanations on graph neural networks. In Advances in Neural Information Processing Systems, volume 34, 2021. Yogesh Balaji, Rama Chellappa, and Soheil Feizi. Robust optimal transport with applications in generative modeling and domain adaptation. In Advances in Neural Information Processing Systems, volume 33, pages 12934 12944, 2020. Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learning: Limitations and Opportunities. MIT Press, 2023. De Lara, Gonz alez-Sanz, Asher, Risser, Loubes M Faisal Beg, Michael I Miller, Alain Trouv e, and Laurent Younes. Computing large deformation metric mappings via geodesic flows of diffeomorphisms. International Journal of Computer Vision, 61(2):139 157, 2005. J. Beirlant, S. Buitendag, E. del Barrio, M. Hallin, and F. Kamper. Center-outward quantiles and the measurement of multivariate risk. Insurance: Mathematics and Economics, 95:79 100, 2020. Peter M Bentler and David G Weeks. Linear structural equations with latent variables. Psychometrika, 45(3):289 308, 1980. Philippe Besse, Eustasio del Barrio, Paula Gordaliza, Jean-Michel Loubes, and Laurent Risser. A survey of bias in machine learning through the prism of statistical parity. The American Statistician, pages 1 11, 2021. Emily Black, Samuel Yeom, and Matt Fredrikson. Fliptest: Fairness testing via optimal transport. In Conference on Fairness, Accountability, and Transparency, page 111 121, New York, USA, 2020. Association for Computing Machinery. Stephan Bongers, Patrick Forr e, Jonas Peters, and Joris M Mooij. Foundations of structural causal models with cycles and latent variables. The Annals of Statistics, 49(5):2885 2915, 2021. Yann Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on Pure and Applied Mathematics, 44(4):375 417, 1991. Alessandro Castelnovo, Riccardo Crupi, Greta Greco, Daniele Regoli, Ilaria Giuseppina Penco, and Andrea Claudio Cosentini. A clarification of the nuances in the fairness metrics landscape. Scientific Reports, 12(1):4209, 2022. Arthur Charpentier, Emmanuel Flachaire, and Ewen Gallic. Optimal transport for counterfactual estimation: A method for causal inference. In Optimal Transport Statistics for Economics and Related Topics, pages 45 89. Springer, 2023. Victor Chernozhukov, Alfred Galichon, Marc Hallin, and Marc Henry. Monge Kantorovich depth, quantiles, ranks and signs. The Annals of Statistics, 45(1):223 256, 2017. Silvia Chiappa. Path-specific counterfactual fairness. In AAAI Conference on Artificial Intelligence, volume 33, pages 7801 7808, 2019. Silvia Chiappa and William S Isaac. A causal Bayesian networks viewpoint on fairness. Privacy and Identity Management. Fairness, Accountability, and Transparency in the Age of Big Data: 13th IFIP WG 9.2, 9.6/11.7, 11.6/SIG 9.2. 2 International Summer School, Vienna, Austria, August 20-24, 2018, Revised Selected Papers 13, pages 3 20, 2019. Silvia Chiappa, Ray Jiang, Tom Stepleton, Aldo Pacchiano, Heinrich Jiang, and John Aslanides. A general approach to fairness with optimal transport. In AAAI Conference on Artificial Intelligence, pages 3633 3640, 2020. Transport-based Counterfactual Models David Maxwell Chickering, David Heckerman, and Christopher Meek. Large-sample learning of bayesian networks is NP-hard. Journal of Machine Learning Research, 5(Oct): 1287 1330, 2004. Evgenii Chzhen, Christophe Denis, Mohamed Hebiri, Luca Oneto, and Massimiliano Pontil. Fair regression with wasserstein barycenters. In Advances in Neural Information Processing Systems, volume 33, pages 7321 7331. Curran Associates, Inc., 2020. Gregory F. Cooper. The computational complexity of probabilistic inference using Bayesian belief networks (research note). Artificial Intelligence, 42(2 3):393 405, March 1990. Nicolas Courty, R emi Flamary, and Devis Tuia. Domain adaptation with regularized optimal transport. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 274 289. Springer, 2014. Nicolas Courty, R emi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. In Advances in Neural Information Processing Systems, volume 30, 2017. Juan Antonio Cuesta and Carlos Matr an. Notes on the Wasserstein metric in Hilbert spaces. The Annals of Probability, pages 1264 1276, 1989. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, volume 26, pages 2292 2300, 2013. Ricardo Dominguez-Olmedo, Amir H Karimi, and Bernhard Sch olkopf. On the adversarial robustness of causal algorithmic recourse. In International Conference on Machine Learning, pages 5324 5342. PMLR, 2022. Dheeru Dua and Casey Graff. UCI machine learning repository, 2019. URL http:// archive.ics.uci.edu/ml. Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Innovations in Theoretical Computer Science Conference, pages 214 226, 2012. Jake Fawkes, Robin Evans, and Dino Sejdinovic. Selection, ignorability and challenges with causal fairness. In Conference on Causal Learning and Reasoning, 2021. R emi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aur elie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, L eo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong, and Titouan Vayer. POT: Python optimal transport. Journal of Machine Learning Research, 22(78):1 8, 2021. David Galles and Judea Pearl. An axiomatic characterization of causal counterfactuals. Foundations of Science, 3(1):151 182, 1998. De Lara, Gonz alez-Sanz, Asher, Risser, Loubes Nathalie TH Gayraud, Alain Rakotomamonjy, and Maureen Clerc. Optimal transport applied to transfer learning for p300 detection. In BCI 2017-7th Graz Brain-Computer Interface Conference, page 6, 2017. Promit Ghosal and Bodhisattva Sen. Multivariate ranks and quantiles using optimal transport: Consistency, rates and nonparametric testing. The Annals of Statistics, 50(2): 1012 1037, 2022. Paula Gordaliza, Eustasio Del Barrio, Fabrice Gamboa, and Jean-Michel Loubes. Obtaining fairness using optimal transport theory. In International Conference on Machine Learning, pages 2357 2365. PMLR, 2019. Marc Hallin, Eustasio del Barrio, Juan Cuesta-Albertos, and Carlos Matr an. Distribution and quantile functions, ranks and signs in dimension d: A measure transportation approach. The Annals of Statistics, 49(2):1139 1165, 2021. Patrik Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Sch olkopf. Nonlinear causal discovery with additive noise models. In Advances in Neural Information Processing Systems, volume 21, 2008. Antti Hyttinen, Frederick Eberhardt, and Patrik O Hoyer. Learning linear cyclic causal models with latent variables. The Journal of Machine Learning Research, 13(1):3387 3439, 2012. Guido W Imbens and Donald B Rubin. Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press, 2015. Sarang C Joshi and Michael I Miller. Landmark matching via large deformation diffeomorphisms. IEEE Transactions on Image Processing, 9(8):1357 1370, 2000. Shalmali Joshi, Oluwasanmi Koyejo, Warut Vijitbenjaronk, Been Kim, and Joydeep Ghosh. Towards realistic individual recourse and actionable explanations in black-box decision making systems. ar Xiv preprint ar Xiv:1907.09615, 2019. Amir-Hossein Karimi, Julius von K ugelgen, Bernhard Sch olkopf, and Isabel Valera. Towards causal algorithmic recourse. In International Workshop on Extending Explainable AI Beyond Deep Models and Classifiers, pages 139 166. Springer, 2020. Amir-Hossein Karimi, Bernhard Sch olkopf, and Isabel Valera. Algorithmic recourse: from counterfactual explanations to interventions. In ACM Conference on Fairness, Accountability, and Transparency, pages 353 362, 2021. Niki Kilbertus, Philip J Ball, Matt J Kusner, Adrian Weller, and Ricardo Silva. The sensitivity of counterfactual fairness to unmeasured confounding. In Conference on Uncertainty in Artificial Intelligence, pages 616 626. PMLR, 2020. Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness. In Advances in Neural Information Processing Systems, volume 30, pages 4066 4076. Curran Associates, Inc., 2017. Transport-based Counterfactual Models David Lewis. Causation. Journal of Philosophy, 70(17):556 567, 1973. Zachary Lipton, Julian Mc Auley, and Alexandra Chouldechova. Does mitigating ML s impact disparity require treatment disparity? In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. Divyat Mahajan, Chenhao Tan, and Amit Sharma. Preserving causal constraints in counterfactual explanations for machine learning classifiers. ar Xiv preprint ar Xiv:1912.03277, 2020. Tudor Manole, Sivaraman Balakrishnan, Jonathan Niles-Weed, and Larry Wasserman. Plugin estimation of smooth optimal transport maps. ar Xiv preprint ar Xiv:2107.12364, 2021. Robert J. Mc Cann. Existence and uniqueness of monotone measure-preserving maps. Duke Math. J., 80(2):309 323, 11 1995. doi: 10.1215/S0012-7094-95-08013-2. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2574 2582, 2016. Razieh Nabi and Ilya Shpitser. Fair inference on outcomes. In AAAI Conference on Artificial Intelligence, volume 32, 2018. Judea Pearl. Causality. Cambridge university press, 2009. Judea Pearl. The mathematics of causal relations. Causality and Psychopathology: Finding the Determinants of Disorders and their Cures (P. Shrout, K. Keyes and K. Ornstein, eds.). Oxford University Press, Corvallis, OR, pages 47 65, 2010. Judea Pearl, Madelyn Glymour, and Nicholas P Jewell. Causal Inference in Statistics: A Primer. John Wiley & Sons, 2016. Victoria Peterson, Nicol as Nieto, Dominik Wyser, Olivier Lambercy, Roger Gassert, Diego H Milone, and Rub en D Spies. Transfer learning based on optimal transport for motor imagery brain-computer interfaces. IEEE Transactions on Biomedical Engineering, 69 (2):807 817, 2021. Gabriel Peyr e and Marco Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends R in Machine Learning, 11(5-6):355 607, 2019. Drago Pleˇcko and Elias Bareinboim. Causal fairness analysis: A causal toolkit for fair machine learning. Foundations and Trends R in Machine Learning, 17(3):304 589, 2024. Drago Plecko and Nicolai Meinshausen. Fair data adaptation with quantile preservation. Journal of Machine Learning Research, 21:1 44, 2020. Aram-Alexandre Pooladian and Jonathan Niles-Weed. Entropic estimation of optimal transport maps. ar Xiv preprint ar Xiv:2109.12004, 2021. De Lara, Gonz alez-Sanz, Asher, Risser, Loubes Rafael Poyiadzi, Kacper Sokol, Raul Santos-Rodriguez, Tijl De Bie, and Peter Flach. Face: feasible and actionable counterfactual explanations. In AAAI/ACM Conference on AI, Ethics, and Society, pages 344 350, 2020. Herilalaina Rakotoarison, Louisot Milijaona, Andry Rasoanaivo, Mich ele Sebag, and Marc Schoenauer. Learning meta-features for automl. In International Conference on Learning Representations, 2022. Peyman Rasouli and Ingrid Chieh Yu. CARE: Coherent actionable recourse based on sound counterfactual explanations. International Journal of Data Science and Analytics, 17(1): 13 38, 2024. Ievgen Redko, Nicolas Courty, R emi Flamary, and Devis Tuia. Optimal transport for multisource domain adaptation under target shift. In International Conference on Artificial Intelligence and Statistics, pages 849 858. PMLR, 2019. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. why should i trust you? : Explaining the predictions of any classifier. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, page 1135 1144, New York, NY, USA, 2016. Association for Computing Machinery. Laurent Risser, Alberto Gonzalez Sanz, Quentin Vincenot, and Jean-Michel Loubes. Tackling algorithmic bias in neural-network classifiers using Wasserstein-2 regularization. Journal of Mathematical Imaging and Vision, pages 1 18, 2022. Dominik Rothenh ausler, Nicolai Meinshausen, Peter B uhlmann, Jonas Peters, et al. Anchor regression: Heterogeneous data meet causality. Journal of the Royal Statistical Society Series B, 83(2):215 246, 2021. Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688, 1974. Chris Russell, Matt J Kusner, Joshua Loftus, and Ricardo Silva. When worlds collide: Integrating different counterfactual assumptions in fairness. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. Saptarshi Saha and Utpal Garain. On noise abduction for answering counterfactual queries: A practical outlook. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=4FU8Jz1Oyj. Marco Scutari, Claudia Vitolo, and Allan Tucker. Learning bayesian networks from big data with greedy search: computational complexity and efficient implementation. Statistics and Computing, 29(5):1095 1108, 2019. Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge university press, 2014. Shohei Shimizu, Patrik O Hoyer, Aapo Hyv arinen, Antti Kerminen, and Michael Jordan. A linear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(10), 2006. Transport-based Counterfactual Models Ricardo Silva. Counterfactual fairness is not demographic parity, and other observations. ar Xiv preprint ar Xiv:2402.02663, 2024. Dylan Slack, Sophie Hilgard, Himabindu Lakkaraju, and Sameer Singh. Counterfactual explanations can be manipulated. In Advances in Neural Information Processing Systems, volume 34, 2021. Robert C Stalnaker. A defense of conditional excluded middle. In Ifs, pages 87 104. Springer, 1980. Thibaut Le Gouic, Jean-Michel Loubes, and Philippe Rigollet. Projection to fairness in statistical learning. ar Xiv preprint ar Xiv:2005.11720, 2020. William Torous, Florian Gunsilius, and Philippe Rigollet. An optimal transport approach to estimating causal effects via nonlinear difference-in-differences. ar Xiv preprint ar Xiv:2108.05858, 2024. C edric Villani. Topics in Optimal Transportation. Number 58 in Graduate Studies in Mathematics. American Mathematical Soc., 2003. C edric Villani. Optimal Transport: Old and New. Number 338 in Grundlehren der mathematischen Wissenschaften. Springer, Berlin, 2008. Julius von K ugelgen, Amir-Hossein Karimi, Umang Bhatt, Isabel Valera, Adrian Weller, and Bernhard Sch olkopf. On the fairness of causal algorithmic recourse. In AAAI Conference on Artificial Intelligence, volume 36, pages 9584 9594, 2022. Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harv. JL & Tech., 31:841, 2017. Yongkai Wu, Lu Zhang, and Xintao Wu. Counterfactual fairness: Unidentification, bound and algorithm. In International Joint Conference on Artificial Intelligence, 2019. Shizhou Xu and Thomas Strohmer. On the (in)compatibility between group fairness and individual fairness. ar Xiv preprint ar Xiv:2401.07174, 2024. Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rogriguez, and Krishna P Gummadi. Fairness constraints: Mechanisms for fair classification. In International Conference on Artificial Intelligence and Statistics, pages 962 970. PMLR, 2017. Kun Zhang and Lai-Wan Chan. Extensions of ICA for causality discovery in the hong kong stock market. In International Conference on Neural Information Processing, pages 400 409. Springer, 2006.