# counterfactual_transportability_a_formal_approach__c83d4a85.pdf

Counterfactual Transportability: A Formal Approach

Juan D. Correa 1 Sanghack Lee 2 Elias Bareinboim 3

Generalizing causal knowledge across environments is a common challenge shared across many of the data-driven disciplines, including AI and ML. Experiments are usually performed in one environment (e.g., in a lab, on Earth, in a training ground), almost invariably, with the intent of being used elsewhere (e.g., outside the lab, on Mars, in the real world), in an environment that is related but somewhat different than the original one, where certain conditions and mechanisms are likely to change. This generalization task has been studied in the causal inference literature under the rubric of transportability (Pearl and Bareinboim, 2011). While most transportability works focused on generalizing associational and interventional distributions, the generalization of counterfactual distributions has not been formally studied. In this paper, we investigate the transportability of counterfactuals from an arbitrary combination of observational and experimental distributions coming from disparate domains. Speciﬁcally, we introduce a sufﬁcient and necessary graphical condition and develop an efﬁcient, sound, and complete algorithm for transporting counterfactual quantities across domains in nonparametric settings. Failure of the algorithm implies the impossibility of generalizing the target counterfactual from the available data without further assumptions.

1. Introduction

Counterfactuals form the basis for different notions across human cognition and decision-making, including credit assignment, regret, responsibility and blame. Counterfactual relations require retrospective thinking, where one must be

1Department of Computer Science, Universidad Aut onoma de Manizales, Manizales, Colombia 2Graduate School of Data Science, Seoul National University, Seoul, South Korea 3Department of Computer Science, Columbia University, New York, USA. Correspondence to: Juan D. Correa <jcorrea@autonoma.edu.co>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

able to compare what did happen with what would have happened under some alternative hypothesis (Pearl, 2000). Given the impossibility of observing an alternative outcome once an action is taken, counterfactuals evoke what if? questions which answer can only be approached by imagining hypothetical conditions usually contrary to the factual evidence. For instance, questions such as what would be the death rates had the vaccination started two weeks earlier? or given that I arrived late, would I have been on time had I taken the subway instead of the taxi? require us to carry out a mental experiment where we recover some state of affairs, perform a change in the sequence of events, and let a hypothetical situation to play out. More generally, counterfactuals are an important component in the construction of explanations regarding why events occurred the way they did. For instance, the previous questions could be related to why did the death rate achieve the number it did? or was it the way of transportation that caused my late arrival? (Pearl & Mackenzie, 2018)[Ch. 8].

Formally, a structural account of causation provides suitable semantics for representing counterfactual statements (Pearl, 2000). Each structural causal model (SCM) M models a generative process and induces a collection of distributions related to the activities of seeing (observational), acting (interventional), and imagining (counterfactual), which together form what is known as the ladder of causation (Pearl & Mackenzie, 2018; Bareinboim et al., 2022). In practice, the SCM M is usually not fully observable, which leads to the inferential challenge of using data from one part of the ladder to make inferences about another. For instance, there exist a plethora of methods allowing for inferences from observational to experimental (i.e., layers 1 to 2 in the ladder) (Pearl, 1995; Tian & Pearl, 2003; Shpitser & Pearl, 2006; Huang & Valtorta, 2006; Bareinboim & Pearl, 2012b; Lee et al., 2019), and from observational and experimental to counterfactual distributions (Pearl, 2001; Avin et al., 2005; Shpitser & Pearl, 2007; Correa et al., 2021).

In practice, obtaining different experimental distributions for the same population is often highly nontrivial. One of the key aspects of human cognition is the ability to generalize concepts from one domain to another. The task of leveraging causal invariances to extrapolate and fuse experimental knowledge across settings has been formally studied in the causal inference literature under the rubric

Counterfactual Transportability: A Formal Approach

of transportability (Pearl & Bareinboim, 2011). By and large, there are several graphical conditions and algorithms for the transportability of causal effects from a combination of observational and experimental data in various settings (Bareinboim & Pearl, 2012a; Lee & Honavar, 2013a;b; Bareinboim & Pearl, 2013; 2014; 2016; Lee et al., 2020; Correa & Bareinboim, 2019; 2020).

Despite the powerful identiﬁability and transportability results found in this literature, it is still largely unknown how to transport counterfactual distributions across different environments and changing conditions. In particular, the literature on transportability has been focused on the extrapolation of observational and experimental distributions (layers 1 and 2 of the ladder) but has not addressed how to operate within counterfactual ones (layer 3). For concreteness, consider an example motivated by (Powdthavee et al., 2013).

Example 1.1 (Compulsory education and well-being). Consider an economic study to understand the effects of compulsory schooling (X) on people s subjective well-being (Y ). A researcher group in Australia performed a controlled longitudinal experiment to assess the effect of X on income (Z), written as P(Z | do(X)). A selection diagram (to be deﬁned in Section 3) that describes this scenario is shown with the graph in Figure 1, where each variable corresponds to a vertex and the edges describe how variables causally inﬂuence one another. The bidirected arrow between X and Z indicates the existence of unmeasured confounders that affect both X and Z (e.g., social status, race, neighborhood).

Another group of researchers in the United States aims to determine how strong is the inﬂuence of X on Y by means other than Z. This inﬂuence can be captured through a quantity known as the natural direct effect (NDE), which is written in counterfactual language as E[Yx ,Zx Yx,Zx] (Pearl, 2001)[Def. 5]. The ﬁrst expression is the value Y attains if X is held constant at x while Z still follows X = x. Noteworthy, this is a typical counterfactual quantity since Z and Y consider X as taking different values, while in the real world the variable X can take only one value at a time. The second quantity represents the value of Y when X = x and Z vary accordingly. As the researchers believe that people in the US perceive well-being based on income, which is different than in Australia, they are surveying several people in order to obtain the observational distribution P (X, Z, Y ). This difference between the populations is represented with a node pointing to Y ( Y ), as shown in Figure 1. The distributions with superscript indicate the target population, in this case, the US, and those without superscript represent the source population, Australia.

In this setting, the NDE cannot be determined from data from the US alone nor from Australia alone. Still, it can be

Figure 1: Selection diagram describing the causal structure of a model for studying the effect of compulsory education (X) on perceived well-being (Y ) (see Example 1.1).

determined through the following expression: X

P (y | x , z) P (y | x, z) | {z } from the US

P(z | do(x)) | {z } from Australia

In other words, the ﬁrst factor in Equation (1) (in parenthesis) is a difference computed from the observational distribution in America, while the second factor (the do distribution) is from the interventional distribution in Australia.

In this paper, our goal is to explicate why this extrapolation (and formula) holds in this particular example, and more broadly, understand under what conditions this type of counterfactual inference across domains is allowed. We will investigate the nonparametric transportability of arbitrary counterfactual quantities when the input consists of any combination of observational and interventional distributions, gathered across different heterogeneous domains. More speciﬁcally, our contributions are as follows:

1. Graphical characterization. We introduce a graphical condition for determining whether a counterfactual quantity is transportable from a collection of datasets. We then prove that this condition is both necessary and sufﬁcient. 2. Algorithmic solution. We develop an efﬁcient algorithm to determine the existence of an estimand for a target counterfactual distribution, as a function of available observational and experimental distributions from the different domains. We further show that this algorithm is not only sound but also complete. In other words, the failure of the algorithm implies that the quantity is not transportable without further assumptions.

Our results generalize the identiﬁcation results in (Correa et al., 2021) to a transportability setting, where available data comes from different domains and different experimental conditions, also generalizing results in (Correa & Bareinboim, 2020) to counterfactual queries. The remaining of the paper is organized as follows. Section 2 introduces structural causal models and required notions about counterfactuals. Section 3 reviews transportability and the graphical models used for it. It also deﬁnes the counterfactual transportability task and a building block to solve it. Section 4 generalizes previous work to the context of the transportability task, including conditional and nested counterfactuals. Section 5 presents a sound and complete algorithm for the proposed task. Finally, Section 6 ends with the conclusions.

Counterfactual Transportability: A Formal Approach

1.1. Preliminaries

We denote variables by capital letters, X, and values by small letters, x. Bold letters, X represent sets of variables and x sets of values. The domain of a variable X is denoted by XX. Two values x and z are said to be consistent if they share the common values for X Z. We also denote by x \ Z the value of X \ Z consistent with x and by x Z the subset of x corresponding to variables in Z. We assume the domain of every variable is ﬁnite.

We represent qualitative assumptions in the form of causal graphs that are named with a calligraphic letter, e.g., G, H, etc. We denote by V(G) the set of vertices (i.e., variables) in a graph H. Given a graph G, GWX is the result of removing edges coming into variables in W and going out from variables in X. G[W] denotes a vertex-induced subgraph, which includes W and the edges among its elements.We use kinship notation for graphical relationships such as parents, children, descendants, and ancestors of a set of variables. For example, the set of parents of X in G is denoted by Pa(X)G := S

X X Pa(X)G. Similarly, we deﬁne the set of ancestors (An( )) and descendants (De( )), which include the argument itself.

To articulate and formalize the generalization of counterfactuals, we require a framework that allows us to reason about multiple domains and alternative worlds simultaneously. For this purpose, we use the Structural Causal Model (SCM) paradigm (Pearl, 2000). An SCM M is a 4-tuple U, V, F, P(u) , where U is a set of exogenous (latent) variables; V is a set of endogenous (observable) variables; F is a collection of functions such that each variable Vi V is determined by a function fi F. Each fi is a mapping from a set of exogenous variables Ui U and a set of endogenous variables Pai V \ {Vi} to the domain of Vi. The uncertainty is encoded through a probability distribution over the exogenous variables, P(U). An SCM M induces a causal diagram G where V is the set of vertices, there is a directed edge (Vj Vi) for every Vi V and Vj Pai, and a bidirected edge (Vi L9999K Vj) for every pair Vi, Vj V such that Ui Uj = (Vi and Vj have a common exogenous parent).

We assume that the underlying model is recursive. That is, there are no cyclic dependencies among the variables. Equivalently, the corresponding causal diagram is acyclic.

The set V decomposes into subsets called c-components (Tian & Pearl, 2002b) according to a diagram G such that two variables belong to the same c-component if they are connected in G by a path made entirely of bidirected edges.

2. Structural Causal Models and Counterfactuals

Intervening on a system represented by an SCM M results in a new model differing from M only in the mechanisms associated with the intervened variables (Pearl, 1994; Dawid, 2002; 2015). Let b X be a collection of functions { b X : X b UX XX}X X for some X V and some set

of unobservable variables b UX disjoint from the original U. Then, an intervention can be described by some b X that induces a derived model M b X where each f X has been replaced by b X. Then, Y b X(u, bu X) is called the potential response of Y to X = b X, and is deﬁned as the solution of Y, for a particular (u, bu X), in the derived model M b X. The most simple and common type of intervention considered consists on one that ﬁxes f X to a constant value x XX, denoted do(X = x). Other interventions may replace f X with some b X which is a function of only observable variables (conditional intervention) or even a function with both observable and unobservable (from b UX) arguments (stochastic intervention). In any case, we assume that b X is not a function of any of the original U.

We use W to denote sets of arbitrary counterfactual variables1. Let W = {W1[b T1], W2[b T2], . . .} represent a set of counterfactual variables such that Wi V and Ti V for i = 1, . . . , l. Deﬁne V(W ) = {W V | Wb T W }, that is, the set of observables that appear in W . Let w represent a vector of values, one for each variable in W .

If all variables in the expression have the same subscript, we could write P(W1[x], W2[x], . . .) as Px(W1, W2, . . .). Also, we will write P(Y = y ) simply as P(y ) when there is no room for confusion.

As the variables in the causal diagram have ancestors that causally affect them, counterfactual variables also have causally relevant ancestors. This generalization of the notion of ancestrality and causal relevance was formalized by (Correa et al., 2021) with the following deﬁnition.

Deﬁnition 2.1 (Ancestors of a counterfactual). Let Yx be such that Y V, X V. Then, the set of (counterfactual) ancestors of Yx, denoted An(Yx), consist of each Wz, such that W An(Y )GX (which includes Y itself), and z = x An(W)GX.

For a set of variables W , An(W ) is deﬁned as the union of the ancestors of each variable in the set. That is, An(W ) = S

Wt W An(Wt).

Example 2.1 (College Degree and Earnings Counterfactual ancestors). For instance, consider the causal di-

1When subscripts are used to enumerate variables such as W1, W2, . . ., square brackets are added around the part of the subscript denoting interventions, e.g., W1[x], W2[x], . . ..

Counterfactual Transportability: A Formal Approach

(a) A causal diagram G

(b) A selection diagram G

Figure 2: A causal diagram and a selection diagram over four variables. The selection diagram summarizes the differences between two source domains and a target domain.

agram in Figure 2(a) and suppose X represents college degree, W occupation, Z socio-economic factors, and Y earnings. Let x0 = computer science , then the counterfactual variable Yx0 represents earnings had college degree been ﬁxed to computer science (X = x0). The set of ancestors, An(Yx0) = {Yx0, Wx0, Z}, represents the set of random variables that causally affect Yx0. Speciﬁcally, Yx0 is not affected by X or W, only by Z and Wx0, which is not necessarily equal to W. Similarly, we can compute the ancestors of other counterfactual variables such as An(Wyz) = {Wz, Xz} and An(Yw) = {Yw, X, Z}.

There are counterfactual distributions with a special form that exploits the local structure described by the causal diagram called counterfactual factors, deﬁned as follows:

Deﬁnition 2.2 (Counterfactual Factor (ctf-factor) (Correa et al., 2021)). A ctf-factor is a distribution of the form

P(w1[pa1], w2[pa2], . . . , wl[pal]), (2)

where each Wi V and there could be Wi = Wj for some i, j {1, . . . , l}.

Example 2.2 (ctf-factor). Consider the causal diagram in Figure 2(a), then all of the following are ctf-factors:

P(yzxw, wx), P(wx, z), and P(yz0x0w0, yz1x1w1). (3)

In contrast, the following are not ctf-factors:

P(yzw, wx), P(wz, z), and P(yz0x0w0, yz1x1), (4)

as they do not have the required form.

3. Transporting Counterfactual Relationships Across Domains

In this section, we review transportability and selection diagrams, including some examples. Then, we deﬁne the counterfactual transportability task and provide a lemma that serves as a basic block to solve it.

Our goal is to assess a counterfactual quantity using assumptions encoded in the form of a graph and observa-

tional and/or experimental distributions arising from different domains. Let Π={π , π1, π2, . . .} be the set of domains/populations involved in the analysis, and π is the target domain where the query is to be inferred.

We assume that each domain has an underlying SCM that produces the samples observed. A distribution generated by a domain πi is denoted with a superscript as P i. For instance, the observational distribution in the domain π

is denoted as P (V). Moreover, each domain is associated with a causal diagram Gi describing the qualitative assumptions made for the SCM in that domain.

The ability to generalize pieces of data from one domain to another depends on the commonalities and differences between domains and the quantity of interest. The differences between domains are called domain discrepancies (Lee et al., 2020), formally deﬁned next.

Deﬁnition 3.1 (Domain Discrepancy). Let πa and πb be domains associated, respectively, with SCMs Ma and Mb

conforming to a causal diagrams Ga and Gb. We denote by a,b V a set of variables such that, for every Vi a,b, there might exist a discrepancy if f a i = f b i or P a(Ui) = P b(Ui).

We will write ,i simply as i to represent the differences between the target and each source domain, with = . Domain discrepancies can be represented graphically by augmenting the causal diagram for a domain πi with extra nodes that represent changes in a mechanism. This type of diagram, called selection diagram, was ﬁrst proposed in (Pearl & Bareinboim, 2011).

While previous approaches assume that all domains have the same causal structure (i.e., the arguments of the corresponding functions in every SCM match), we allow for a function to have different arguments in different domains as long as no cyclic dependencies are present. Therefore, we consider one selection diagram per domain, as deﬁned next.

Deﬁnition 3.2 (Selection Diagram). Given a causal diagram Gi = V, E and domain discrepancies i, let S = {Sv | n i=1V i} be called selection variables. Then, a selection diagram G i is deﬁned as a graph V S, E {Sv V }Sv S .

Let G = {G i | π Π} denote the collection of selection diagrams, including one per domain. Here, G is always the same G .

Example 3.1 (Selection diagram). The selection diagram in Figure 2(b) augments the causal diagram in Figure 2(a) with two selection nodes pointing, respectively, to Z and W. These nodes are labeled with the name of the domain (or domains) for which they advertise differences with respect to the target domain π . This single selection diagram is a summary of the two selection diagrams shown in Figure 3(a)

Counterfactual Transportability: A Formal Approach

Figure 3: Two selection diagrams showing the possible differences between domains π1, π2, and π .

and Figure 3(b). Formally, this diagram encodes 1 = {Z} and 2 = {W}. In the context of Example 2.1, 1 advertises possible differences in the socio-economic conditions (Z) between the domains π1 and π . Meanwhile, 2 indicates differences regarding occupation (W) due, perhaps, to π2 being a region where the occupational proﬁle of several college program alumni differs with respect to π .

In this paper, our goal is to assess a counterfactual quantity in a target domain π using observational and experimental data from one or more source domains. We represent experimental distributions using regime indicators (Dawid, 2002; Tian, 2008; Correa & Bareinboim, 2019) to indicate an atomic, conditional, or stochastic policy. An atomic policy σX = do(x) ﬁxes the value of a variable X to x. A conditional intervention σX = g(W) sets the value of X according to a deterministic function g : XW XX for some set W V \ De(X). A stochastic intervention σX = b P(X | W) sets the value of X according to a given probability distribution conditional on a similar set W. A distribution affecting a set of variables X is analogously represented as σX. Depending on the intervention, the causal diagram Gi σX corresponds to the SCM Mi after being intervened with σX.

Example 3.2 (Experimental input distributions). Recall the story in Example 2.1 and suppose we are interested in assessing the impact of studying computer science (x0) on the earnings of people in π . We are, however, not interested in the effect that this has on the average person in π but on those who choose to pursue this degree on their own. To do this, we could consider the following quantity:

E[Yx0 | x0] X

x αx E[Yx | x0]. (5)

The ﬁrst expectation in Equation (5) refers to the expected value of earnings, had a person studied computer science given that this was the person s choice. Naturally, by the axioms of counterfactuals (and common sense), this is the same as E[Y | x0]; the interesting aspect is the contrast produced by considering the expectations in the sum that comes after. The quantity E[Yx | x0] evokes a counterfactual that considers the expected earnings had a person who chose computer science studied some other degree x. Moreover, αx is some weight assigned to a particular college

degree x for the sake of comparison. While one could use uniform weights, a sensible choice could be the distribution of second choices made by people who in the end studied computer science.

With the target quantity (Equation (5)) in mind, suppose the available data consists of an observational study carried out in π2 (P 2(Z, X, W, Y )) and a study in π1 reporting the results of a large-scale scholarship program where students were given the change to pursue any degree they wanted regardless of socio-economical factors, which we represent with distribution P 1(Z, X, W, Y ; σX = b P(X)). No data from π is observed.

The expectations in Equation (5) are associated, respectively to the distributions P (Yx0 | x0) and P (Yx | x0) for x XX2. For the reasons mentioned before, the ﬁrst quantity is simply P (Y | x0), yet we have not observed data directly from π , hence we need to be clever about how to use the available data. As the conditions in Section 4 and the algorithm introduced in Section 5 (and some simpliﬁcation) will allow us to derive:

P (y | x0) = X

z P 1(y | x0, z; σX)P 2(x0, z). (6)

The second probability can be similarly obtained from available data as

P (yx | x0) = X

z P 1(y | x, z; σX)P 2(x0, z), (7)

which together with Equation (6) allow us to estimate Equation (5) using the available distributions.

In general, it is useful to represent graphically the effect of an intervention. For instance, the causal diagram that represents the intervention σX is shown in Section 3. Compared to the original causal diagram (Figure 2(a)), the edges Z X and Z L99 99K X have disappeared as they are removed by σX.

Data available in each domain is speciﬁed by Z = {Zi | πi Π}, where each Zi = {σZ1, σZ2, . . .}, Zj V, corresponds to domain πi. This means that distributions {P i(V; σZj) > 0 | Zj Zi}Zi Z are assumed to be available. Notice that P i(V; σ ) = P i(V) describes the observational (non-interventional) distribution in domain πi. In this example, Z = {Z = , Z1 = {σZ}, Z2 = {σW }}.

We are now ready to formally deﬁne the task to be solved in this paper.

Deﬁnition 3.3 (Counterfactual Transportability). A query P (y | x ) is said to be transportable from G , Z , if P (y | x ) is uniquely computable from the set of distributions Z for every assignment (y, x) and every set of models {Mi}πi Π inducing G and Z.

2Since we are trying to say something about π , we use the superscript to indicate the domain of the distributions of interest.

Counterfactual Transportability: A Formal Approach

Figure 4: Causal diagram corresponding to the regime σX, denoted as GσX.

To understand the conditions for counterfactual transportability to be feasible, as well as devising a systematic approach to solve it, we establish transportability at the level of a ctf-factor, that serves as our unit of analysis.

Lemma 3.1 (Counterfactual Factor Transportability). Let G i be the selection diagram based on Gi and i, and let P (w ) be a ctf-factor, then P (w ) = P i(w ) if G i

does not contain selection nodes Svi pointing to any variable in Vi W, that is, Vi / i.

Lemma 3.1 allow us to determine whether the ctf-factor needed to estimate the query in the target domain can be obtained from other domains, based on the assumptions encoded in the selection diagram.

Example 3.3 (Ctf-factor transportability). Consider again the ctf-factors in Equation (3). We have,

P (yzxw, wx) = P 1(yzxw, wx), and (8)

P (yz0x0w0, yz1x1w1) = P 1(yz0x0w0, yz1x1w1)

= P 2(yz0x0w0, yz1x1w1). (9)

However, we cannot guarantee P (wx, z) to be equal to its counterpart P 1(wx, z) or P 2(wx, z), as Z 1 and W 2.

4. Counterfactual Transportability

To solve counterfactual transportability systematically, we will generalize results for the identiﬁability of counterfactuals in a single domain setting based on ctf-factors. The goal is to decompose any given counterfactual query in terms of ctf-factors which transportability from available data can be judged independently.

For this purpose, we follow the strategy in (Correa et al., 2021) and use the notion of ancestrality introduced in Definition 2.1, to write the query in terms of all the variables relevant for the analysis. Given an arbitrary query P (y ), let D = An(Y ) then

d \y P (d ). (10)

Next, we characterize the relationship between the transportability of P (d ) and P (y ).

Lemma 4.1 (Transportability of the sum over an ancestral set). Let P (d ) be a ctf-factor and let Y D be such that D = An(Y ). Then, P

d \y P (d ) is transportable from Z iff P (d ) is transportable from Z.

Then, P (d ) can be written in ctf-factor-form as

Dt D Dpad = d , (11)

where each d = d {Dt} and pad is determined for each Dt D as the union of t (Pad T) and d (Pad \T).

Example 4.1 (Ancestral set and ctf-factor). Consider again the selection diagram in Figure 2(b) and the target quantity P(yx | x0) = P(yx, x0)/P(x0) in Example 3.2. The corresponding ancestral set is D = An(Yx, Wx, X, Z) and the query can be written as

P (yx, x0) = X

z,w P (yx, wx, x0, z) (12)

z,w P (yxwz, wx, x0[z], z), (13)

where Equation (13) is in ctf-factor form.

The next step in the solution strategy is to factorize the righthand-side of Equation (11) according to the c-component structure of the graph G [V(D )]. Let C1, . . . , Ck be the c-components of this graph and deﬁne Cj = {Dpad D | D Cj} and cj as the values in d corresponding to Cj . Then P (d ) decomposes as

j P (cj ). (14)

Once the query of interest is in ctf-factor-form, the transportability question reduces to determining the transportability of smaller ctf-factors.

Example 4.2 (Ctf-factor factorization). Following from Example 4.1 and Equation (13), the query factorizes as

P (yx, x0) = X

z,w P (yxwz, wx)P (x0[z], z). (15)

The question becomes whether ctf-factors corresponding to individual c-components can be transported from the available input. The following deﬁnition and theorem characterize the factors that can be transported from Z and G.

Deﬁnition 4.1 (Inconsistent ctf-factor). P(w ) is an inconsistent ctf-factor if it is a ctf-factor, G[V(W )] has a single c-component, and one of the following situations hold:

(i) there exist Wt W , Z T V(W ) such that z t, z w and z = z , or (ii) there exist Wi[ti], Wj[tj] W and, T Ti Tj such that t ti, t tj and t = t .

Counterfactual Transportability: A Formal Approach

(a) Diagram where the ctf-factor P(yx, x ) is inconsistent.

(b) Diagram where the ctf-factor P(wx, w x ) is inconsistent.

Figure 5: Examples of causal diagrams and inconsistent ctf-factors derived from them.

Example 4.3 (Some inconsistent ctf-factors). First, consider the causal diagram in Figure 5(a), where the ctffactor P(yx, x ) is inconsistent due to condition (i) in Definition 4.1. For another example, consider the ctf-factor P(wx, w x ) is inconsistent due to condition (ii).

Once the query has been expressed in terms of ctf-factors and their consistency, the following theorem characterizes their transportability with an experimental (or observational) distribution.

Theorem 4.1 (Transportability from Z). A ctf-factor P (w ) is transportable from Z only if it is consistent. If consistent, let W = V(W ) and W = Pa(W)\W; then P (w ) is equal to P w (w) where w and w are consistent with w S

{Wpaw W } paw, and P (w ) is transportable from Z iff P w (w) is transportable from Z.

Example 4.4 (Ctf-factor transportability). Consider each of the two factors in Example 4.2. First, P (yxwz, wx) can be transported from P 1(Z, X, W, Y ; σX = b P(X)) using Lemma 3.1 since Y, W / 1:

P (yxwz, wx) = P 1(yxwz, wx) = P 1 xz(y, w) (16)

= P 1(y, w | x, z; σX). (17)

Moreover, P (x0[z], z) can be transported from P 2(Z, X, W, Y ), because X, Z / 2, as

P (x0[z], z) = P 2(x0[z], z) = P 2 wy(x0, z) (18)

= P 2(x0, z). (19)

Both Equation (17) and Equation (19) follow from using σ-TR (Algorithm 4 in Appendix B) (Correa & Bareinboim, 2020) to transport P 1 xz(y, w) and P 2 wy(x0, z) from Z.

Given a counterfactual variable Yx some values in x may be causally irrelevant to Y once the rest of x is ﬁxed. In general, a counterfactual Yx can be minimized with a process that we denote with the operator as Yx = Yt, where T = X An(Y )GX and t = x T. For a set of counterfactual variables Y , minimization is done as Y = { Yx | Yx Y }. Moreover, such minimization could reveal repeated portions of a counterfactual event or inconsistencies that make the probability of the event to be zero. We capture these ideas in Algorithm 1.

Algorithm 1 SIMPLIFY(Y , y ) Input: Y a set of counterfactual variables in V and y a set of values for Y . Output: An interventionally minimal event Y = y without redundant subscripts or 0 if the counterfactual event is guaranteed to have probability 0.

1: let Y Y . 2: if there exists Yx Y with two or more different values in y Yx or Yy Y with y Yy = y then return 0. 3: if there exists Yx Y with two consistent values in y Yx or Yy Y with y Yy = y then remove repeated variables from Y and values y . 4: return Y = y .

4.1. Conditional Queries

The counterfactual query of interest could be a conditional one of the form P (y | x ). For this case, there exists a tight reduction from such conditional counterfactual to an unconditional one, described in (Correa et al., 2021).

To perform this reduction, one needs to look at the selection diagram paying special attention to variables after the conditioning bar that are also ancestors of those before. Let X (Wt) = V( X An(Wt)), that is, the primitive variables in X that are ancestors of Wt.

Deﬁnition 4.2 (Ancestral components). Let W be a set of counterfactual variables, X W , and G be a causal diagram. Then the ancestral components induced by W , given X , are sets A1 , A2 , . . . that form a partition over An(W ), made of unions of the ancestral sets An(Wt)GX (Wt), Wt W . Sets An W1[t1] GX (W1[t1]) and An W2[t2] GX (W2[t2]) are put together if they are not

disjoint or there exists a bidirected arrow in G connecting variables in those sets.

Next, we extend the result in (Correa et al., 2021) to the transportability task:

Lemma 4.2 (Conditional-marginal transportability reduction). Let Y , X be two sets of counterfactual variables and let D be the set of variables in the same ancestral component, given X , as any variable in Y , then

d \(y x ) P (V

Dt D Dpad=d) P

Dt D Dpad=d) , (20)

where pad is consistent with t, d and x (Dt), for each Dt D . Moreover, P (y | x ) is transportable from Z iff P (V

Dt D Dpad = d) is transportable from Z.

Example 4.5 (Conditional query simpliﬁcation). Recall the selection diagram in Figure 1 and consider the counterfactual P (yx | zx, x ). While the unconditional probability P (yx, zx, x ) is not transportable due to the inconsistent ctf-factor P (zx, x ), the probability P (yx | zx, x ) is still

Counterfactual Transportability: A Formal Approach

Figure 6: Causal diagram used to compute the ancestral sets of Y = {Yx} given X = {Zx, X}, denoted GX (Yx).

transportable as we can see by using Lemma 4.2. Here Y = {Yx}, X = {Zx, X}, then X (Yx) = {Zx} and D can be computed as the ancestors of Yx in the causal diagram in Figure 6, that is, {Yx}. This gives

P (yx | zx, x ) = P (yxz), (21)

which is equal to P x,z(y) = P (y | x, z) and could be obtained from a suitable Z.

4.2. Nested Queries

The query of interest could involve counterfactuals with interventions that involve other counterfactuals, also called nested counterfactuals . One example of this is the natural direct effect that we described in Example 1.1. Theorem 4 in (Correa et al., 2021) reduces the identiﬁability of a nested counterfactual to that of a non-nested one. We extend this reduction to the transportability case in the following.

Lemma 4.3 (Counterfactual Unnesting). Let b X, b Z be any counterfactual variables (nested or non-nested) sets X, Z V. Then, for Y V disjoint from X and Z such that X An(Y)GZ, P (Yb Z, b X = y) is transportable from

Z, G iff P (Yb Z,x = y, b X = x) is transportable from Z, G for every x, and given by

P (Yb Z, b X=y) = X

x XX P (Yb Z,x=y, b X=x). (22)

Example 4.6 (Conditional and nested). In Example 1.1, one of the target quantities is P (yx ,Zx), which is a nested counterfactual. Using Lemma 4.3 we get

P (yx ,Zx) = X

z P (yx z, zx) (23)

and the problem is reduced to transporting P (yx z, zx) from Z.

5. A Sound and Complete Algorithm for Counterfactual Transportability

Using the factorization of the query described in the previous section and Lemma 3.1, we propose CTFTR (Algorithm 3), an algorithm that determines the transportability of a probability of the form P (y | x ) corresponding to a target domain π from a collection of observational and

Algorithm 2 CTFTRU(Y , y , Z, G )

Input: G = {G i}π Π selection diagrams over V; Y a set of counterfactual variables in V; y a set of values for Y ; and available distribution speciﬁcation Z. Output: P (Y = y ) in terms of available distributions or FAIL if not transportable from G , Z .

1: (Y , y ) SIMPLIFY(Y , y ). 2: let W An(Y ), and let C1 , . . . , Ck be corresponding ctf-factors in G [V(W )]. 3: if inconsistent Ci then return FAIL. 4: for each Ci do 5: Q σ-TR(Ci, Z, G ). 6: if Q is not FAIL then 7: let P Pa(Ci)\Ci(Ci) Q. 8: let c (ci S

Ct Ci pac). 9: let P (Ci = ci ) Q|c. 10: move to the next Ci. 11: end if 12: end for 13: if any P (Ci = ci ) was not transported from Z then return FAIL. 14: return P (Y = y ) P

i P (Ci = ci ).

experimental distributions Z, and a selection diagram G . When the query is transportable, the algorithm outputs an expression for P (y | x ) in terms of the speciﬁed distributions and FAIL if the query is not transportable from such input in G .

In CTFTR, line 1 computes the ancestral components associated with the query and line 2 determines the set D . In line 3 invokes the subroutine CTFTRU to transport the numerator of Equation (20) that is later returned in line 14.

The subroutine CTFTRU can be used to transport nonconditional counterfactuals. Line 1 invokes SIMPLIFY (Algorithm 1) which makes the query interventionally minimal, removes redundant variables, or determines if the event has probability zero. Line 2 computes the ancestral set and ctffactors corresponding to the query and line 3 checks whether any of them is inconsistent, in which case the algorithm fails. The loop in line 4 tries to transport every ctf-factor from the available input. Line 5 calls σ-TR (Algorithm 4 in Appendix B) to transport Q = P Pa(Ci)\Ci(Ci) from Z and G . If successful, line 8 creates a set with the values that are used to evaluate Q (in line 9) so that it is equal to the ctf-factor P (Ci = ci ). Line 10 moves on to the next factor when the current one has been transported. Finally, line 13 fails if any of the ctf-factors is not transportable from the input or line 14 the corresponding expression.

Theorem 5.1 (CTFTR completeness). A counterfactual probability P (y | x ) is transportable from Z and G if and only if CTFTR returns an expression for it. Moreover,

Counterfactual Transportability: A Formal Approach

Algorithm 3 CTFTR(Y , y , X , x , Z, G )

Input: G causal diagram over variables V; Y , X a set of counterfactual variables in V; y , x a set of values for Y and X ; and available distribution speciﬁcation Z. Output: P (Y =y | X =x ) in terms of available distributions or FAIL if not transportable from G , Z .

1: Let A1 , A2 , . . . be the ancestral components of Y X given X . 2: Let D be the union of the ancestral components containing a variable in Y and d the corresponding set of values. 3: let Q CTFTRU(S

Dt D Dpad, d , Z, G ). 4: return P

d \(y x ) Q/ P

CTFTR decides this task in time O(n4z) where n = |V| and z = P Zi Z |Zi|.

6. Conclusions

In this paper, we studied the problem of transporting counterfactual quantities from a combination of observational and experimental distributions obtained from one or more heterogeneous domains. Using a decomposition based on ctf-factors, we characterized the transportability of such factors between domains (Lemma 3.1) and used it to establish a sufﬁcient and necessary graphical condition for the transportability of a given counterfactual query (Lemma 4.1, Theorem 4.1). We considered conditional and nested counterfactuals, and then provided tight reductions for those types of queries (Lemmas 4.2 and 4.3). In Section 5, we developed a sound and complete algorithm (Algorithm 3) for the counterfactual transportability task (Theorem 5.1). In other words, this means that the target counterfactual quantity is not transportable whenever the algorithm returns failure, unless further parametric assumptions are made about the underlying generating model. The problem of generalizing and fusing information across settings is pervasive in the sciences. Many questions in the empirical sciences can be formulated as counterfactual queries, with data coming from observations and experiments in heterogeneous domains, which constitute counterfactual transportability tasks. We hope the language, conditions, and algorithms developed in this paper serve as stepping stones in the modeling and the solution of these problems.

ACKNOWLEDGEMENTS

Elias Bareinboim and Juan D. Correa were supported in part by the NSF, ONR, Do E, Amazon, JP Morgan, and The Alfred P. Sloan Foundation. Sanghack Lee was supported by the New Faculty Startup Fund from Seoul National University.

Avin, C., Shpitser, I., and Pearl, J. Identiﬁability of Path Speciﬁc Effects. In Proceedings of the Nineteenth International Joint Conference on Artiﬁcial Intelligence {IJCAI-05}, pp. 357 363, Edinburgh, UK, 2005. Morgan Kaufmann Publishers.

Bareinboim, E. and Pearl, J. Transportability of causal effects: Completeness Results. In Proceedings of the 26th AAAI Conference on Artiﬁcial Intelligence, CA, 2012a. Department of Computer Science, University of California, Los Angeles.

Bareinboim, E. and Pearl, J. Causal Inference by Surrogate Experiments: z-Identiﬁability. In Murphy, N. d. F. and Kevin (eds.), Proceedings of the Twenty-Eighth Conference on Uncertainty in Artiﬁcial Intelligence, pp. 113 120. AUAI Press, 2012b.

Bareinboim, E. and Pearl, J. A general algorithm for deciding transportability of experimental results. Journal of Causal Inference, 1(1):107 134, 2013.

Bareinboim, E. and Pearl, J. Transportability from Multiple Environments with Limited Experiments: Completeness Results. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27, pp. 280 288. Curran Associates, Inc., 2014.

Bareinboim, E. and Pearl, J. Causal inference and the datafusion problem. Proceedings of the National Academy of Sciences, 113(27):7345 7352, 2016.

Bareinboim, E., Correa, J. D., Ibeling, D., and Icard, T. On pearl s hierarchy and the foundations of causal inference. In Probabilistic and Causal Inference: The Works of Judea Pearl, pp. 507 556. Association for Computing Machinery, New York, NY, USA, 1st edition, 2022.

Correa, J. D. and Bareinboim, E. From Statistical Transportability to Estimating the Effects of Stochastic Interventions. In Proceedings of the 27th International Joint Conference on Artiﬁcial Intelligence, 2019.

Correa, J. D. and Bareinboim, E. General Transportability of Soft Interventions: Completeness Results. In Advances in Neural Information Processing Systems, volume 33, 2020.

Correa, J. D., Lee, S., and Bareinboim, E. Nested Counterfactual Identiﬁcation from Arbitrary Surrogate Experiments. In Advances in Neural Information Processing Systems, volume 34, 2021.

Dawid, A. P. Inﬂuence diagrams for causal modelling and inference. International Statistical Review, 70:161 189, 2002.

Counterfactual Transportability: A Formal Approach

Dawid, A. P. Statistical Causality from a Decision-Theoretic Perspective. Annual Review of Statistics and Its Application, 2(1):273 303, 2015.

Huang, Y. and Valtorta, M. Pearl s Calculus of Intervention Is Complete. In T.S. Richardson, R. D. a. (ed.), Proceedings of the Twenty-Second Conference on Uncertainty in Artiﬁcial Intelligence, pp. 217 224, Corvallis, OR, 2006. AUAI Press.

Huang, Y. and Valtorta, M. On the completeness of an identiﬁability algorithm for semi-Markovian models. Annals of Mathematics and Artiﬁcial Intelligence, 54(4):363 408, 2008.

Lee, S. and Honavar, V. Causal Transportability of Experiments on Controllable Subsets of Variables: z Transportability. In Nicholson, A. and Smyth, P. (eds.), Proceedings of the Twenty-Ninth Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pp. 361 370. AUAI Press, 2013a.

Lee, S. and Honavar, V. m-Transportability: Transportability of a Causal Effect from Multiple Environments. In des Jardins, M. and Littman, M. (eds.), Proceedings of the Twenty-Seventh National Conference on Artiﬁcial Intelligence, pp. 583 590, Menlo Park, CA, 2013b. AAAI Press.

Lee, S., Correa, J. D., and Bareinboim, E. General Identiﬁability with Arbitrary Surrogate Experiments. In Proceedings of the Thirty-Fifth Conference Annual Conference on Uncertainty in Artiﬁcial Intelligence, Corvallis, OR, 2019. AUAI Press.

Lee, S., Correa, J. D., and Bareinboim, E. Generalized Transportability: Synthesis of Experiments from Heterogeneous Domains. In Proceedings of the 34th AAAI Conference on Artiﬁcial Intelligence, Menlo Park, CA, 2020. AAAI Press.

Pearl, J. A probabilistic calculus of actions. In de Mantaras, R. L. and D. Poole (eds.), Uncertainty in Artiﬁcial Intelligence 10, pp. 454 462. Morgan Kaufmann, San Mateo, CA, 1994.

Pearl, J. Causal diagrams for empirical research. Biometrika, 82(4):669 688, 1995.

Pearl, J. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, NY, USA, 2nd edition, 2000.

Pearl, J. Direct and indirect effects. In Proceedings of the Seventeenth Conference on Uncertainty in Artiﬁcial Intelligence, pp. 411 420. Morgan Kaufmann, San Francisco, CA, 2001.

Pearl, J. and Bareinboim, E. Transportability of Causal and Statistical Relations: A Formal Approach. In Proceedings of the Twenty-Fifth Conference on Artiﬁcial Intelligence (AAAI-11), pp. 247 254, Menlo Park, CA, 8 2011.

Pearl, J. and Mackenzie, D. The Book of Why. Basic Books, New York, 2018.

Powdthavee, N., Lekfuangfu, W. N., and Wooden, M. The Marginal Income Effect of Education on Happiness: Estimating the Direct and Indirect Effects of Compulsory Schooling on Well-Being in Australia. Technical Report 7365, Institute for the Study of Labor (IZA), Bonn, 2013.

Shpitser, I. and Pearl, J. Identiﬁcation of Joint Interventional Distributions in Recursive semi-Markovian Causal Models. In Proceedings of the Twenty-First AAAI Conference on Artiﬁcial Intelligence, volume 2, pp. 1219 1226, 2006.

Shpitser, I. and Pearl, J. What Counterfactuals Can Be Tested. In Proceedings of the Twenty-Third Conference on Uncertainty in Artiﬁcial Intelligence, pp. 352 359. AUAI Press, Vancouver, BC, Canada, 2007.

Tian, J. Identifying Dynamic Sequential Plans. In In Proceedings of the Twenty-Fourth Conference Annual Conference on Uncertainty in Artiﬁcial Intelligence (UAI-08), pp. 554 561, Corvallis, Oregon, 2008. AUAI Press.

Tian, J. and Pearl, J. A General Identiﬁcation Condition for Causal Effects. In Proceedings of the Eighteenth National Conference on Artiﬁcial Intelligence (AAAI 2002), pp. 567 573, Menlo Park, CA, 2002a. AAAI Press/The MIT Press.

Tian, J. and Pearl, J. On the Testable Implications of Causal Models with Hidden Variables. Proceedings of the Eighteenth Conference on Uncertainty in Artiﬁcial Intelligence (UAI-02), pp. 519 527, 2002b.

Tian, J. and Pearl, J. A General identiﬁcation condition for causal effects. Technical Report R-290-A, Department of Computer Science, University of California, Los Angeles, CA, 2003.

Counterfactual Transportability: A Formal Approach

A. Graphical Condition for Transportability

Lemma 3.1 (Counterfactual Factor Transportability). Let G i be the selection diagram based on Gi and i, and let P (w ) be a ctf-factor, then P (w ) = P i(w ) if G i does not contain selection nodes Svi pointing to any variable in Vi W, that is, Vi / i.

Proof. A ctf-factor can be expressed, as3

{u|W (u)=w } P (u). (24)

The predicate W (u) = w depends on the functions fw, W W and W (u) only depends on P (U(W)), where U(W) = S

W W Uw. Since no Vi W i, we have f w = f i w and P (U(W)) = P i(U(W)). Then, it follows that Equation (24) is also equal to X

{u|W (u)=w } P i(u) = P i(w ). (25)

Lemma A.1. Suppose P (W = w ) is not transportable from a set of available distributions in a causal diagram G. Let A1[t1], A2[t2] W such that A2[t2] is a function of A1[t1]. Then P

a1 P (W = w ) is not identiﬁable from the same input either.

Proof. Let M(1) and M(2) be the two sets of models witnessing the non-transportability of P (W = w ), they agree on available distributions, but for w we have P (1)(W = w ) = α, P (2)(W = w ) = β with α = β. Assume, without loss of generality that α > β. We will extend the strategy used by (Huang & Valtorta, 2008) to construct two sets of models M(1) and M(2) where the domain of A2 is XA2 {0, 1}, where XA2 is the domain of A2 in M1, M2. Let F(A1) be a probability function from XA1 to {0, 1}, such that P(F(a1) = k) > 0, k = 0, 1 and P(F(a1) = 0) = 1 P(F(a1) = 1). In M(i) , i = 1, 2 we deﬁne the new function for A2 such that:

P (i) ((a2, k) | paa2, ua2) = P (i)(a2 | paa2, ua2)P(F(a1)=k). (26)

For any other Vj V \ {A2} we keep the same function such that P (i) (vj|paj, uj) = P (i)(vj|paj, uj). In particular, the functions of variables that are children of A2 simply ignore the new bit in A2 s domain. We can verify that for any counterfactual Z on which M(1) and M(2) agree, we have

P (i) (Z = z ) = X

d \z P (i) (D = d ), (27)

where D = An(Z ) and d is the union of z and the indexing values of the sum. This can be written in the form of a ctf-factor as

P (i) (Z = z ) = X

Dt D Dpad = d

and then sum over U

P (i) (Z = z ) = X

u,d \z P (i)

Dt D Dpad = d | u

P (i) (u). (29)

Note that P (i) (u) = P (i)(u). Also, once U has been ﬁxed, any Dpad becomes deterministic and independent of any other variable, as it only depends on Pad and Ud. Then we can write

Dt D Dpad = d | u

Dt D P (i) (Dpad = d | u) = Y

Dt D P (i) (d | pad, u), (30)

3see (Correa et al., 2021) for details on computing counterfactual probabilities from the SCM

Counterfactual Transportability: A Formal Approach

For any D = A2 we have P (i) (d | pad, u) = P (i)(d | pad, u). Then P (i) (Z = z ) = P Mi(Z = z ), if D does not contain any A2.

By construction, the factor for A2 is given by Equation (26). Hence if D does contain some instance of A2 and d has some (a2, k),

P (i) (Z = z ) = X

Dt D Dpad = d

P(F(a1) = k). (31)

If (a2, k) / z, we can sum out k to obtain P (i) (Z = z ) = P (i)(Z = z ). If a1 / (d \ z ) then the expression becomes

P (i) (Z = z ) = P(F(a1) = k) X

Dt D Dpad = d

= P(F(a1) = k)P (i)(Z = z ). (32)

For any input distribution P k(V; σT), we have

P k(V; σT) = P k t (V \ T) Y

T T P(T | Pat; σT) = P k(

Vi V\T Vi[t]) Y

T T P(T | Pat; σT), (33)

either A1[t] appears in the expression and does not have an index in d \ z , or A1 T so that no instance of A1 is an ancestor of any other variable in the expression. Hence any experimental distribution of the form P(V; σT), given as input (T Z), will match in M(1) and M(2) as it matches in M(1) and M(2).

Consider the assignment w \ {a2}, (a 2, 0), by construction we have

P (i) (W = (w \ {a2}) (a2, 0)) = P (i)(W = w )P(F(a1) = 0) (34)

For a1 w , a 1 = a1, let P(F(a1) = 0) = 1/2 and P(F(a 1) = 0) = (α β)/4. Then X

a 1 P (i)(W = (w \ {a1, a2}) {a 1, (a2, 0)})

a 1 P (i)(W = (w \ {a1, a2}) {a 1, a2})P(F(a 1) = 0) (35)

For M(1) this means X

a 1 P (1) (W = (w \ {a1, a2}) {a 1, (a2, 0)})

a1 =a 1 P (1)(W = (w \ {a1, a2}) {a 1, a2}) (36)

As for M 2: X

a 1 P (2) (W = (w \ {a1, a2}) {a 1, (a2, 0)})

a1 =a 1 P (2)(W = (w \ {a1, a2}) {a 1, a2}) (38)

Then, M(1) and M(2) are compatible with G, match in the available distributions and yield different P

a1 P (W = w ).

Counterfactual Transportability: A Formal Approach

Lemma 4.1 (Transportability of the sum over an ancestral set). Let P (d ) be a ctf-factor and let Y D be such that D = An(Y ). Then, P d \y P (d ) is transportable from Z iff P (d ) is transportable from Z.

Proof. If P (W = w ) is transportable, then P

w \y P (W = w ) is clearly transportable too. Since W is ancestral with respect of Y , every variable in W \ Y must have a children in W , that is, another variable that is a function of it. Then, we can use Lemma A.1 in a topological order for the variables in W \ Y to claim that if P (W = w ) is not identiﬁable then P

w \y P (W = w ) is also not identiﬁable.

Lemma A.2 (Inconsistent factor non-identiﬁability (Correa et al., 2021)). Let P(W = w ) be an inconsistent ctf-factor. Then, P(W = w ) is not identiﬁable from any set of interventional distributions.

Theorem 4.1 (Transportability from Z). A ctf-factor P (w ) is transportable from Z only if it is consistent. If consistent, let W = V(W ) and W = Pa(W) \ W; then P (w ) is equal to P w (w) where w and w are consistent with w S

{Wpaw W } paw, and P (w ) is transportable from Z iff P w (w) is transportable from Z.

Proof. Let z = w S {Wpaw W } paw. Since the factor is consistent, z contains at most one value per observable. We can write P (W = w) = P (V

{Wpaw W } Wpaw = w) where every paw is consistent with z.

Let W W be such that Paw W = , then Paw W and Wpaw = Ww where w has any valid value for W \ Z. For a W with parents in W, suppose every T Paw \ W already appears as Tw . Then, we can use composition to make W[pa W w ] = Ww . Applying this reasoning from parent to children in W we get

{Wpaw W } Ww = w = P w (w). (41)

The only if portion of the statement follows from Lemma A.2 because even if P (w ) cannot be identiﬁed from any subset of interventional distributions in the target domain π , it will certainly not be transportable from interventional distributions in other domains, even if they can be transported to π .

A.1. Conditional Counterfactual Transportability

Before the proof of Lemma 4.2, we prove some auxiliary lemmata.

Lemma A.3. Let A, B and C be binary random variables causally related as given by the chain A B C. Suppose P(B = 1 | A = 1) = α and P(B = 1 | A = 0) = 1 α, for some α [0, 1]. Then, for any β such that |1/2 β| |1/2 α| there is always a function f C such that P(C = 1 | A = 1) = β, P(C = 1 | A = 0) = 1 β and P(A, B, C) is a positive distribution.

Proof. Let P(C = 1 | B = 1) = x and P(C = 1 | B = 0) = 1 x, then

β = P(C = 1 | A = 1) = X

b P(C = 1 | b)P(b | A = 1) = 1 α + x(2α 1) (42)

x = α + β 1

2α 1 . (43)

Since x must belong to the interval (0, 1), we can bound β (1 α, α) if α 1/2 and β (α, 1 α) if α > 1/2. Both conditions are satisﬁed when |1/2 β| |1/2 α| as assumed, so any solution x is a valid probability.

Then, we can deﬁne the function for C as

f C = B Uc, (44)

where Uc is a binary unobservable, P(Uc = 0) = x and is the logical xor operator.

Lemma A.4. Let A and B be two variables in a causal graph where A An An 1 A1 C B1 Bm 1 Bm B. The variables A1, . . . , An, B1, . . . , Bm are observable, C could be observable or unobservable

Counterfactual Transportability: A Formal Approach

and m, n are nonnegative integers. Then we can deﬁne functions for all variables involved such that they are binary and

( 1 2γ if a = b 1 2(1 γ) otherwise , (45)

for any γ (0, 1).

Proof. First, if C is unobservable set P(C = 1) = 1/2, else deﬁne an unobservable Uc with P(Uc = 1) = 1/2 and let f C = Uc. Let α, β (0, 1) be parameters to decide later.

If n = 0 deﬁne f A such that P(A = 1 | C = 1) = α, P(A = 1 | C = 0) = 1 α. Similarly, if m = 0 deﬁne f B such that P(B = 1 | C = 1) = α, P(B = 1 | C = 0) = 1 α.

Suppose n > 0, then we will deﬁne the functions for A1, A2, . . . , An, A such that P(Ai = 1 | C = 1) gets closer to α as i increases. If α < 1/2, set f A1 such that P(A1 = 1 | C = 1) = α/(n + 1), P(A1 = 1 | C = 0) = 1 α/(n + 1). Then use lemma A.3 to deﬁne f Ai, i = 1, . . . , n and f A such that P(Ai = 1 | C = 1) = iα/(n + 1), P(Ai = 1 | C = 0) = 1 iα/(n + 1) and ﬁnally P(A = 1 | C = 1) = (n + 1)α/(n + 1) = α, P(A = 1 | C = 1) = 1 α.

If α > 1/2 use the same strategy but starting from P(A1 = 1 | C = 1) = 1 (1 α)/(n + 1) and decreasing as P(Ai = 1 | C = 1) = 1 i(1 α)/(n + 1), to obtain P(A = 1 | C = 1) = 1 (n + 1)(1 α)/(n + 1) = α.

The same procedure is applied for B1, . . . , Bm, B to obtain P(B = 1 | C = 1) = β, P(B = 1 | C = 0) = 1 β.

P(A = 1, B = 1) = X

c P(A = 1 | c)P(B = 1 | c)P(c) (46)

2[αβ + (1 α)(1 β)] (47)

P(A = 0, B = 0) = 1

2[(1 α)(1 β) + αβ] (48)

P(A = 0, B = 1) = 1

2[(1 α)β + α(1 β)] (49)

P(A = 1, B = 0) = 1

2[α(1 β) + (1 α)β]. (50)

If γ < 1/2 make β = 1 α and α = 1 1 2γ

2 . If γ 1/2, let β = α and α = 1 2γ 1

2 . It is just a matter of algebra to verify that P(A, B) results in the intended distribution.

Lemma A.5. Let W be a set of counterfactual variables, X , and let A1 , A2 , . . . be the ancestral components of W given X . Then

P(W = w ) = Y

At Aj Apaa = a , (51)

where paa is consistent with t and aj , for each At Aj .

Proof. For the sake of simplicity let X = X . First, note that by deﬁnition A1 , A2 , . . . must be mutually disjoint and their union is equal to A = An(W ). Then, we can write

P(W = w ) = X

j Aj = aj . (52)

Furthermore, no variable in two distinct ancestral components could be in the same c-component in G[V(A )]. Hence, we have

P(W = w ) = X

At Aj Apaa = a . (53)

Counterfactual Transportability: A Formal Approach

To prove the result, it sufﬁces to show that this expression can be factorized such that the sum over each aj \ w only affects factor j. For this, we argue that for At Aj no value of paa or a appears in ak \ w for any k = j. To see why this is the case, assume for the sake of contradiction that (i) there is some value in paa that appears in such a ak \ w or (ii) a appears in the same set. For case (i), this implies that there is an ancestor of At (not in X ) which is also an ancestor of some variable in component Ak . For case (ii), this means that At / X is an ancestor of some variable in component Ak . In both cases would imply that Aj and Ak are not disjoint, a contradiction.

Lemma 4.2 (Conditional-marginal transportability reduction). Let Y , X be two sets of counterfactual variables and let D be the set of variables in the same ancestral component, given X , as any variable in Y , then

d \(y x ) P (V

Dt D Dpad=d) P

Dt D Dpad=d) , (20)

where pad is consistent with t, d and x (Dt), for each Dt D . Moreover, P (y | x ) is transportable from Z iff P (V

Dt D Dpad = d) is transportable from Z.

P (Y = y | X = x ) = P (Y = y , X = x ) P

y P (Y = y , X = x ) (54)

Let A1 , A2 , . . . be the ancestral components of Y X , given X . Then, by Lemma A.5

P (Y = y , X = x ) = Y

aj \(y x ) P

At Aj Apaa = a . (55)

Let Yr Y and y its corresponding value in y . If there exists any variable At A such that y t, Yr and At must be in the same Aj , as the former is then an ancestor of the latter. Let Ay be the union of ancestral components that contain ancestors of Y , and let Ay be the rest. Then, if we sum over y in the right-hand-side of Equation (55), the factors with variables in Ay can taken out of the sum.

P (Y = y | X = x )

P ay \x P V At Ay Apaa = a P ay \(y x ) P V At Ay Apaa = a

At Ay Apaa = a P

ay \(y x ) P V

At Ay Apaa = a (56)

After cancelling out the ﬁrst factor, simplifying the expression, and deﬁning D = Ay , we obtain Equation (20).

For simplicity, let G be the same as G after removing all edges out of V(X D ) and any edge out of V(Y An(X )). G and G have the same c-component structure and the same ancestral relationships; therefore and by the same reasoning, the query is not transportable from G , Z either. For the rest of the proof, let G denote G instead. Clearly, if P (V

Dt D Dpad = d) is transportable, so it is P (Y = y | X = x ). For the second part of the statement, suppose this query is not transportable and let M(1), M(2) be two sets of models witnessing this fact.

For simplicity, let

ρ(y, x) = X

d \(y x ) P

Dt D Dpad = d (57)

P (Y = y | X = x ) = ρ(y , x ) P

y ρ(y , x ). (58)

If ρ(x) = P

y ρ(y, x) is transportable then P (Y = y | X = x ) must be non-transportable, else

ρ(y, x) = P (Y = y | X = x )ρ(x), (59)

Counterfactual Transportability: A Formal Approach

contradicting the assumption that the ρ(y, x) is not identiﬁable. Therefore, we can further assume ρ(x) is not identiﬁable, for the rest of the argument.

Let (y , x ) be an assignment such that ρ(y , x ) differs between M(1) and M(2), then let

X Y ρ(1)(X, Y) ρ(2)(X, Y)

x y a b x = y c d = x y e f = x = y g h

Due to the non-transportability of ρ(y, x) we have a = b and without loss of generality we can assume a > b. Similarly, due to the non-transportability of ρ(x) we have a + c = b + d. For M(1) and M(2):

ρ(1)(y | x ) = a a + c (60)

ρ(2)(y | x ) = b b + d. (61)

These probabilities are equal if and only if ad = bc. Hence, if they are not equal, we are done as ρ(1)(y | x ) = ρ(2)(y | x ). If they are equal, let Xt X D be such that Xt and some Yr Y have a common ancestor or they have ancestors sharing a bidirected edge. Assume Xt is such that no other element in X is closer to Yr in the corresponding path in G. Let p be the path in G that goes from X to Y , passing to the common ancestor or bidirected alluded before. Add a bit to every variable in p and denote them with subscript p. Deﬁne independent functions for the bits which we will parametrize later.

We deﬁne two new sets of models, M(1) and M(2) , based on M(1) and M(2). For every variable in p except for X and Y , append the corresponding extra bit, deﬁned in p, to the original variables in M(1) and M(2). Rename X and Y as e X, e Y and make them unobservable, then deﬁne X in the new models with functions:

( x if Xp = 1 e X otherwise, (62)

where Xp is unobservable too and x = x (X), for which the query disagrees in M(1) and M(2).

Analogously, deﬁne

( y if Yp = 1 e Y otherwise. (63)

According to our choice of X, there exists Z p An(X) An(Y ) (possibly Y itself) or there exists Z1, Z2 p with Z1 An(X), Z2 An(Y ) and there is an edge Z1 L9999K Z2 in p. For the parametrization of the extra bits in p deﬁne a new unobservable U and let Zp = U if the common ancestor is observable or let U be the unobservable parent of Z1p and Z2p in the second. Notice that Z1 and Z2 may be equal to X and Y themselves.

Using lemma A.4 we will parametrize p such that

X0 Y0 P(X0, Y0)

1 0 1 2(1 γ)

0 1 1 2(1 γ)

for some γ (0, 1) that we will pick later.

Counterfactual Transportability: A Formal Approach

Claim A.1 (Disagreement on the query). M(1) and M(2) disagree on the query for any γ such that c d = [(a + c + 1)h (b + d + 1)g](1 γ).

Proof. For M(1) and M(2) we have

ρ (x , y ) = X

xp,yp,ex,ey ρ (x , y , ex, ey | xp, yp)P(xp, yp). (64)

Going over each possible combination of Xp and Yp, we get

ρ (x , y ) = ρ(X = x , Y = y )P(Xp = 0, Yp = 0) (65)

+ ρ(Y = y )P(Xp = 1, Yp = 0) (66)

+ ρ(X = x )P(Xp = 0, Yp = 1) (67)

+ P(Xp = 1, Yp = 1). (68)

ρ (x ) = ρ(X = x )P(Xp = 0) + P(Xp = 1). (69)

ρ(1) (y | x ) =

2(a + e)(1 γ) + 1

2(a + c)(1 γ) + 1

1 2(a + c) + 1

= aγ + (2a + c + e)(1 γ) + γ

a + c + 1 (71)

= a (a + c + e)(γ 1) + γ

a + c + 1 (72)

= a + (a + c + e)(1 γ) (1 γ) + 1

a + c + 1 (73)

= a (1 (a + c + e))(1 γ) + 1

a + c + 1 (74)

= a g(1 γ) + 1

a + c + 1 (75)

Analogously for M(2) :

ρ(2) (y | x ) = b h(1 γ) + 1

b + d + 1 (76)

Those two are equal if and only if

ab bg(1 γ) + b + ad dg(1 γ) + d + a g(1 γ) + 1

= ab ah(1 γ) + a + bc ch(1 γ) + c + b h(1 γ) + 1 (77)

bg(1 γ) + ad dg(1 γ) + d + g(1 γ)

= ah(1 γ) + bc ch(1 γ) + c + h(1 γ) (78)

Recall that we have ad = bc, which also implies that c = d, else a = b which is a contradiction. Then, the condition for equality can be further simpliﬁed as

c d = [(a + c + 1)h (b + d + 1)g](1 γ). (79)

The left hand side is nonzero, all a, b, c, d, g and h are ﬁxed, and γ is a free parameter. Therefore, as long as we pick a γ such that the equality doesn t hold, we get that ρ (y | x ) = P (Y = y | X = x ) does not match in M(1) and M(2) .

Counterfactual Transportability: A Formal Approach

Claim A.2 (Agreement on given distributions). Let Z V be any subset of observable variables and let σZ Zk Z. If M(1) and M(2) agree on P k(V; σZ), then M(1) and M(2) also agree on P k(V; σZ).

Proof. For simplicity we omit the superscript k for the domain, which is ﬁxed with σZ. The superscript (1) and (2) indicate to which of the sets of models under consideration the expression refers to.

As the input distributions are in layer 2, let us use c-factor notation, where each c-factor corresponds to a causal effect. For any C V, the quantity Q[C](v) is the c-factor of C and denotes the following function

Q[C](v) = X

{i|Vi C} P k(vi | pai, ui)P k(u(C)), (80)

where U(C)= S

Let C1, C2, . . . be the C-components of GσZ. By assumption we have Q(1)[V; σZ] = Q(2)[V; σZ], and since any Q[Cj; σZ] is identiﬁable from Q[V; σZ] (Tian & Pearl, 2002b), it follows Q(1)[Cj; σZ] = Q(2)[Cj; σZ] for any Cj not intersecting Z.

M(k) is identical to M(k), k = 1, 2, except for the functions of the observables in the path p. For any variable T not in p, but with a parent on it, the function f T remains the same and it simply ignores the extra bit that its parent has in M(k) .

Let Cj be a c-component containing some set of variables R in p different to X and Y (the endpoints of p). First, by deﬁnition

Q(k) [Cj; σZ](v) = X

Vi Cj P (k) (vi | pai, ui; σX)P (k) (u(Cj)). (81)

For any S / p, R R, X and Y that could be in Cj in M(k) , their corresponding factors in the previous expression can be re-written in terms of probabilities of M(k), as follows:

P (k) (s | pas, us; σX) = P (k)(s | pas, us; σX), (82)

P (k) (r | par, ur; σX) = P (k)(r | par, ur; σX)P(rp | (par)p), (83)

P (k) (y | pay, uy; σX) = P (k)(y | pay, uy; σX)P(Yp = 0 | (pay)p)

+ 1[y = y ]P(Yp = 1 | (pay)p)), (84)

P (k) (w | paw, uw; σX) = P (k)(w | paw, uw; σX)P(Wp = 0 | (paw)p)

+ 1[w = w ]P(Wp = 1 | (paw)p)). (85)

It follows that

Q(k) [Cj; σZ](v) =

R R P(rp | (par)p)

Q(k)[Cj; σZ](v)P(Yp = 0 | (pay)p)P(Wp = 0 | (paw)p)

+ Q(k)[Cj \ {Y }; σZ](v)P(Yp = 1 | (pay)p)P(Wp = 0 | (paw)p)1[y = y ] (86)

+ Q(k)[Cj \ {W}; σZ](v)P(Yp = 0 | (pay)p)P(Wp = 1 | (paw)p)1[w = w ]

+ Q(k)[Cj \ {W, Y }; σZ](v)P(Yp = 1 | (pay)p)P(Wp = 1 | (paw)p)1[w = w , y = y ] .

Since X and Y have no descendants in G,

Q(k)[Cj \ {W, Y }; σZ](v) = X

w,y Q(k)[Cj; σZ](v) (87)

Q(k)[Cj \ {W}; σZ](v) = X

w Q(k)[Cj; σZ](v) (88)

Q(k)[Cj \ {Y }; σZ](v) = X

y Q(k)[Cj; σZ](v), (89)

Counterfactual Transportability: A Formal Approach

all match between M(1) and M(2). Consequently, every c-factor in the right-hand side of (86) is the same in those models, and since every other term is also the same in both M(1) and M(2) , we conclude that Q(1) [Cj; σZ](v) = Q(2) [Cj; σZ](v), which in turn implies our claim since

P (k) (v; σZ) = Y

j Q(k) [Cj; σZ](v). (90)

In summary, M(1) and M(2) induce G and matching Z, yet they differ on the value for P (Y = y | X = x ), proving the non-identiﬁability of the query.

A.2. Nested Counterfactual Transportability

Lemma 4.3 (Counterfactual Unnesting). Let b X, b Z be any counterfactual variables (nested or non-nested) sets X, Z V. Then, for Y V disjoint from X and Z such that X An(Y)GZ, P (Yb Z, b X = y) is transportable from Z, G iff

P (Yb Z,x = y, b X = x) is transportable from Z, G for every x, and given by

P (Yb Z, b X=y) = X

x XX P (Yb Z,x=y, b X=x). (22)

Proof. (soundness) From the model we have: P (Yb Z, b X = y) = P

{u|Yb Z, b X(u)=y} P (u). In Mb Z b X, for any given u, the

variables in X take a particular value-vector x = b X(u). Then, we can partition {u | Yb Z, b X(u) = y} as S x XX{u |

Yb Z,X=x(u) = y, b X(u) = x}. Hence

P (Yb Z, b X = y) = X

{u|Yb Z,x(u)=y, b X(u)=x} P (u) = X

x XX P (Yb Z,x = y, b X = x). (91)

(necessity) Suppose P (Yb Z,x = y, b X = x) is not transportable from Z. Since X An(Y)GZ there exist Xr b X and a set of variables Db Z such that a path X D1 D2 Dk Y exists in G for some Yb Z Y with k 0. This implies that D1[b Z] is a function of Xr, Di+1[b Z] is a function of Di[b Z] and Yb Z is a function of Dk[b Z]. Then, we can use Lemma A.1 from X to Y following this path to prove the result.

B. Algorithm

B.1. Soundness and Completeness

Lemma B.1 (CTFTRU completeness for non-conditional, non-nested queries). A counterfactual probability P (y ) is transportable from Z and G if and only if CTFTRU returns an expression for it.

Proof. (soundness) The algorithm starts by simplifying the target event by removing irrelevant subscripts and looking for inconsistent or redundant events (line 1). Line 2 ﬁnds the relevant ctf-factors consisting of a single c-component, licensed by the properties of ancestral sets and ctf-factors. As long as the factors are consistent, and allowed by Theorem 4.1, lines 7-10 try to transport of the causal effect P Pa(Ci)\Ci(Ci) from the available distributions employing the algorithm σ-TR (Algorithm 4) as a subroutine. Then, the factor P (Ci = ci ) is transported by evaluating P Pa(Ci)\Ci(Ci) with the appropriate values according to Theorem 4.1. If all the ctf-factors are consistent and transportable, the algorithm returns the query as in line 14. This expression is equal to the original query.

(necessity) The procedure fails if any of the factors P (Ci = ci ) is inconsistent or not transportable from Z. If inconsistent, Theorem 4.1 ascertains the non-transportability of the factor from Z. On the other hand, σ-TR (in line 5) failing to obtain P Pa(Ci)\Ci(Ci) from Z implies the non-transportability of P Pa(Ci)\Ci(Ci) from Z (Correa & Bareinboim, 2020). That is, there exist two sets of models that agree on all distributions given by Z but differ on P Pa(Ci)\Ci(Ci) for a particular set

Counterfactual Transportability: A Formal Approach

Algorithm 4 σ-TR(Ci, Z, G )

Input: A set of variables Ci that form a single c-component in G, a set of distributions Z, and a selection diagram G . Output: An expression for the c-factor P Pa(Ci)\Ci(Ci) in terms of distributions available in Z or FAIL if this distribution is not transportable.

1: for each Z Zk Z s.t. Ci Z= and Ci k= do 2: let Bi be the c-component of Gk σZ such that Ci Bi. 3: compute PV\Bi(Bi) from PZ(V). 4: let Q IDENTIFY(Ci, Bi, PV\Bi(Bi), Gk σZ). 5: if Q is not FAIL then 6: return Q. 7: end if 8: end for 9: return FAIL

of values v. With a proper encoding of the values, consistent with (ci S Ct Ci pac), this is also a counterexample for P (Ci = ci ).

If any P (Ci = ci ) is not transportable, then P (W = w ) is also not transportable. This is because if the latter were transportable, the former can be obtained from it; a contradiction. Moreover, Lemma 4.1 then implies that P (Y = y ) is not identiﬁable.

Theorem 5.1 (CTFTR completeness). A counterfactual probability P (y | x ) is transportable from Z and G if and only if CTFTR returns an expression for it. Moreover, CTFTR decides this task in time O(n4z) where n = |V| and z = P

Proof. (Soundness and completeness) Lines 1 and 2 determine the set of variables and values D and d as speciﬁed by Lemma 4.2 to reduce the conditional query to a marginal one. Then, line 3 invokes CTFTRU to identify this quantity. If CTFTRU succeeds, the returned expression is equal to the conditional query. If it fails, Lemma 4.2 implies that the conditional query is also not transportable.

(Efﬁciency) Let n = |V| and z = P

πi |Zi|. Operations such as computing the set of ancestors, ancestral components or c-components in a graph can be done in O(n2) time. The number of sets of ancestors and c-components is O(n), hence the ﬁrst two operations in CTFTR take at most O(n3) time. CTRTRU is called once and the ﬁrst step, that uses SIMPLIFY, takes at most O(n3) time because the simpliﬁcation is done by computing sets of ancestors and then performing checks that take O(n2) time. There are O(n) ctf-factors to process, so the total number of times the for-loop in CTFTRU could execute, calling σ-TR is O(n). Besides some O(n2) operations, σ-TR may call IDENTIFY up to z times. IDENTIFY recursively reduces the input c-factor at least by one variables each time, and the operations used can be performed in O(n2) so it takes O(n3) time per call to return an expression or FAIL. Consequently, CTRTRU takes O(n4z) time, which is also the overall time for CTFTR.

B.2. Transportability algorithm for interventional distributions

Our algorithm CTFTRU relies on a subroutine to trasnport a given interventional distribution from Z. For this purpose we employ σ-TR (Correa & Bareinboim, 2020) which is shown in Algorithm 4. In turn, σ-TR uses the IDENTIFY algorithm (Tian & Pearl, 2002a), shown in Algorithm 5.

Counterfactual Transportability: A Formal Approach

Algorithm 5 IDENTIFY(C, T, Q, G)

Input: C T V, Q = Q[T] and graph G. Assuming G[C] and G[T] are composed of a single c-component. Output: Expression for Q[C] in terms of Q or Fail.

1: Let A An(C)G[T]. 2: if A = C then return Q[C] = P

t\c Q. 3: if A = T then return FAIL. 4: if A = C then 5: Let T be the C-component containing C in G[A]. 6: Compute Q[T ] from Q[A] = P

t\a Q. 7: return Identify(C, T , Q[T ], G). 8: end if