# nested_counterfactual_identification_from_arbitrary_surrogate_experiments__af8e20cb.pdf

Nested Counterfactual Identiﬁcation from Arbitrary Surrogate Experiments

Juan D. Correa Columbia University jdcorrea@cs.columbia.edu

Sanghack Lee Seoul National University sanghack@snu.ac.kr

Elias Bareinboim Columbia University eb@cs.columbia.edu

The Ladder of Causation describes three qualitatively different types of activities an agent may be interested in engaging in, namely, seeing (observational), doing (interventional), and imagining (counterfactual) (Pearl and Mackenzie, 2018). The inferential challenge imposed by the causal hierarchy is that data is collected by an agent observing or intervening in a system (layers 1 and 2), while its goal may be to understand what would have happened had it taken a different course of action, contrary to what factually ended up happening (layer 3). While there exists a solid understanding of the conditions under which cross-layer inferences are allowed from observations to interventions, the results are somewhat scarcer when targeting counterfactual quantities. In this paper, we study the identiﬁcation of nested counterfactuals from an arbitrary combination of observations and experiments. Speciﬁcally, building on a more explicit deﬁnition of nested counterfactuals, we prove the counterfactual unnesting theorem (CUT), which allows one to map arbitrary nested counterfactuals to unnested ones. For instance, applications in mediation and fairness analysis usually evoke notions of direct, indirect, and spurious effects, which naturally require nesting. Second, we introduce a sufﬁcient and necessary graphical condition for counterfactual identiﬁcation from an arbitrary combination of observational and experimental distributions. Lastly, we develop an efﬁcient and complete algorithm for identifying nested counterfactuals; failure of the algorithm returning an expression for a query implies it is not identiﬁable.

1 Introduction

Counterfactuals provide the basis for notions pervasive throughout human affairs, such as credit assignment, blame and responsibility, and regret. One of the most powerful constructs in human reasoning what if? questions evokes hypothetical conditions usually contradicting the factual evidence. Judgment and understanding of critical situations found from medicine to psychology to business involve counterfactual reasoning, e.g.: Joe received the treatment and died, would he be alive had he not received it?, Had the candidate been male instead of female, would the decision from the admissions committee be more favorable?, or Would the proﬁt this quarter remain within 5% of its value had we increased the price by 2%? . By and large, counterfactuals are key ingredients that go in the construction of explanations about why things happened as they did [17, 19].

The structural interpretation of causality provides proper semantics for representing counterfactuals [17, Ch. 7]. Speciﬁcally, each structural causal model (SCM) M induces a collection of distributions related to the activities of seeing (called observational), doing (interventional), and imagining (counterfactual). The collection of these distributions is known as the Ladder of causation [19], and has also been called the Pearl s Causal Hierarchy (PCH, for short) [2]. The PCH is a containment hierarchy; each type of distribution can be put in increasingly reﬁned layers: observational content goes in layer 1; experimental in layer 2; counterfactual in layer 3; see Fig. 1.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Structural Causal Model (Unobserved Nature)

P(X, Y ) Layer 1

P(Y |do(X)) Layer 2

P(Yx|x , y ) Layer 3

Figure 1: Every SCM induces different quantities in each layer of the PCH.

It is understood that if we have all the information in the world about layer 1, there are still questions about layers 2 and 3 that are unanswerable, or technically undetermined; further, if we have data from layers 1 and 2, there are still questions in the world about layer 3 that are underdetermined [17, 2].

The inferential challenge in these settings arises because the generating model M is not fully observed, nor data from all of the layers are necessarily available, perhaps due to the cost or the infeasibility of performing certain interventions. One common task found in the literature is to determine the effect of an intervention of a variable X on an outcome Y , say P(Y |do(X)) (layer 2), using data from observations P(V) (layer 1), where V is the set of observed variables, and possibly other interventions, e.g., P(V|do(Z)). Also, qualitative assumptions about the system are usually articulated in the form of a causal diagram G. This setting has been studied in the literature under the rubric of non-parametric identiﬁcation from a combination of observations and experiments. Multiple solutions exist, including Pearl s celebrated do-calculus [16], and other increasingly reﬁned methods that are computationally efﬁcient, sufﬁcient, and necessary [25, 8, 20, 26, 21, 10, 3, 13, 12].1

There is a growing literature about cross-layer inferences from data in layers 1 and 2 to quantities in layer 3. For example, a data scientist may be interested in evaluating the effect of an intervention on the group of subjects who receive the treatment instead of those randomly assigned to it. This measure is known as the effect of treatment on the treated [9, 17], and there exists a graphical condition for mapping it to a (layer 2) causal effect [23]. Further, there are also results on the identiﬁcation of pathspeciﬁc effects, which are counterfactuals that isolate speciﬁc paths in the graph [18, 1]. In particular, [24] provides a sufﬁcient and necessary algorithm for identiﬁcation of these effects from observational data, and [28] provides identiﬁcation conditions from observational and experimental data in general canonical models. Further, [22] studied counterfactual identiﬁcation under the assumption that all experimental distributions (i.e., over every subset of the observed variables) are available.2

In this paper, our goal is to identify the probability distribution of (possibly nested) counterfactual events from an arbitrary combination of user-speciﬁed observational and experimental distributions. To the best of our knowledge, this provides the ﬁrst general treatment of nested counterfactual identiﬁcation from arbitrary data collections. Moreover, it also provides the ﬁrst, graphical and algorithmic, sufﬁcient and necessary conditions for the identiﬁcations of counterfactuals from observational data alone (when no experimental data is available) and arbitrary causal diagrams. Moving up the PCH, our results allow for arbitrary quantities as inferential targets and for the addition of arbitrary experimental distributions to the input, increasing the ﬂexibility of the solution.

Figure 2: Causal diagram with treatment X, outcome Y , and mediator M.

For concreteness, consider the causal diagram shown in Fig. 2 and a counterfactual query called direct effect. This quantity represents the sensitivity of a variable Y to changes in another variable X while all other factors in the analysis remain ﬁxed. Suppose X is level of exercise, M cholesterol levels, and Y cardiovascular disease. Exercising can improve cholesterol levels, which in turn affect the chances of developing cardiovascular disease. An interesting question is how much exercise prevents the disease by means other than regulating cholesterol. In counterfactual notation, this is to compare Yx,Mx and Yx ,Mx, where x and x are different values. The ﬁrst quantity represents the value of Y when X=x and M varies accordingly. The second expression is the value Y attains if X is held constant at x while M still follows X=x. The difference E[Yx ,Mx Yx,Mx] known as the natural direct effect (NDE) is non-zero if there is some direct effect of X on Y . In this instance, this nested counterfactual is identiﬁable only if observational data and experiments on X are available.

1In fact, this is a classic task in a larger family of problems known as data fusion, which include other challenges such as selection bias, transportability, to cite a few. For more details, see [4]. 2For the sake of context, the work proposed here can be seen as a generalization of two tasks, counterfactual identiﬁcation under the assumptions discussed earlier [22] and interventional identiﬁcation from arbitrary experiments [13]. As discussed later on, we will be able to show, based on the machinery developed here, that the individual methods for those tasks can be combined and also be shown complete.

After all, there is no general identiﬁcation method for this particular counterfactual family (which also includes indirect and spurious effects) and, more broadly, other arbitrary nested counterfactuals that are well-deﬁned in layer 3. Our goal is to understand the non-parametric identiﬁcation of arbitrary nested and conditional counterfactuals when the input consists of any combination of observational and interventional distributions, whatever is available for the data scientist. More speciﬁcally, our contributions are as follows.

1. We look at nested counterfactuals from an SCM perspective and introduce machinery that supports counterfactual reasoning. In particular, we prove the counterfactual unnesting theorem (CUT), which allows one to map any nested counterfactual to an unnested one (Section 2). 2. Building on this new machinery, we derive sufﬁcient and necessary graphical conditions and an algorithm to determine the identiﬁability of marginal nested counterfactuals from an arbitrary combination of observational and experimental distributions (Section 3). 3. We prove a reduction from conditional counterfactuals to marginal ones, and use it to derive a complete algorithm for their identiﬁcation (Section 4).

Due to space constraints, all the proofs in this paper can be found in the full technical report [5].

1.1 Preliminaries

We denote variables by capital letters, X, and values by small letters, x. Bold letters, X represent sets of variables and x sets of values. The domain of a variable X is denoted by XX. Two values x and z are said to be consistent if they share the common values for X Z. We also denote by x \ Z the value of X \ Z consistent with x and by x Z the subset of x corresponding to variables in Z. We assume the domain of every variable is ﬁnite.

We relay on causal graphs and denote them with a calligraphic letter, e.g., G. We denote the set of vertices (i.e., variables) in G as V(G). Given a graph G, GWX is the result of removing edges coming into variables in W and going out from variables in X. G[W] denotes a vertex-induced subgraph including W and the edges among its elements. We use kinship notation for graphical relationships such as parents, children, descendants, and ancestors of a set of variables. For example, the set of parents of X in G is Pa(X)G := X S

X X Pa(X)G. Similarly, we deﬁne Ch(), De(), and An().

To articulate and formalize counterfactual questions, we require a framework that allows us to reason simultaneously about events from alternative worlds. Accordingly, we employ the Structural Causal Model (SCM) paradigm [17, Ch. 7]. An SCM M is a 4-tuple U, V, F, P(u) , where U is a set of exogenous (latent) variables; V is a set of endogenous (observable) variables; F is a collection of functions such that each variable Vi V is determined by a function fi F. Each fi is a mapping from a set of exogenous variables Ui U and a set of endogenous variables Pai V \ {Vi} to the domain of Vi. Uncertainty is encoded through a probability distribution over the exogenous variables, P(U). An SCM M induces a causal diagram G where V is the set of vertices, there is a directed edge (Vj Vi) for every Vi V and Vj Pai, and a bidirected edge (Vi L99 99K Vj) for every pair Vi, Vj V such that Ui Uj = (Vi and Vj have a common exogenous parent).

We assume that the underlying model is recursive. That is, there are no cyclic dependencies among the variables. Equivalently, that is to say, that the corresponding causal diagram is acyclic. The set V(G) can be partitioned into subsets called c-components [27] such that two variables belong to the same c-component if they are connected in G by a path made entirely of bidirected edges.

2 SCMs and Nested Counterfactuals

Intervening on a system represented by an SCM M results in a new model differing only on the mechanisms associated with the intervened variables [15, 6, 7]. If the intervention consists on ﬁxing the value of a variable X to a constant x XX, it induces a submodel, denoted Mx [17, Def. 7.1.2]. To formally study nested counterfactuals, we extend this notion to models derived from interventions that replace functions from the original SCM with other, not necessarily constant, functions.

Deﬁnition 1 (Derived Model). Let M be an SCM, b U U, X V, and b X : b U XX a function. Then, M b X, called the derived model of M according to b X, is identical to M, except that the function f X is replaced with a function bf X identical to b X.

This deﬁnition is easily extendable to models derived from an intervention on a set X instead of a singleton. When b X is a collection of functions { b X : b UX XX}X X, the derived model M b X is obtained by replacing each f X with b X for X X. Next, we discuss the concept of potential response [17, Def. 7.4.1] with respect to the derived models. Deﬁnition 2 (Potential Response). Let X, Y V be subsets of observable variables, let u be a unit, and let b X(u) be a set of functions from b UX XX, for X X where b UX U. Then, YX= b X(u) (or Y b X(u), for short) is called the potential response of Y to X = b X, and is deﬁned as the solution of Y, for a particular u, in the derived model M b X.

A potential response Y b X(u) describes the value that variable Y would attain for a unit (or individual) u if the intervention b X is performed. This concept is tightly related to that of potential outcome, but the former explicitly allows for interventions that do not necessarily ﬁx the variables in X to a constant value. Averaging over the space of U, a potential response Y b X(u) induces a random variable that we will denote simply as Y b X. If the intervention replaces a function f X with a potential response of X in M, we say the intervention is natural.

When variables are enumerated as W1, W2, . . ., we may add square brackets around the part of the subscript denoting interventions. We use W to denote sets of arbitrary counterfactual variables. Let W = {W1[b T1], W2[b T2], . . .} represent a set of counterfactual variables such that Wi V and Ti V for i = 1, . . . , l. Deﬁne V(W ) = {W V | Wb T W }, that is, the set of observables that appear in W . Let w represent a vector of values, one for each variable in W and deﬁne w (X ) as the subset of w corresponding to X for any X W .

The probability of any counterfactual event is given by

P(Y = y ) = X

{u|Y (u)=y } P(u), (1)

where the predicate Y (u) = y means V {Y b X Y } Y b X(u) = y.

When all variables in the expression have the same subscript, that is, they belong to the same submodel; we will often denote it as Px(W1, W2, . . .).

For most real-world scenarios, having access to a fully speciﬁed SCM of the underlying system is unfeasible. Nevertheless, our analysis does not rely on such privileged access but the aspects of the model captured by the causal graph and data samples generated by the unobserved model.

2.1 Nested Counterfactuals

Potential responses can be compounded based on natural interventions. For instance, the counterfactual YZx(u) (YZ=Zx(u)) can be seen as the potential response of Y to an intervention that makes b Z equal to Zx. Notice that Zx(u) is in itself a potential response, but from a different (nested) model. Hence we call YZx a nested counterfactual.

Recall the causal diagram in Fig. 2 and consider once again the NDE as

NDEx x ,Z(Y ) = E[Yx Zx] E[Yx]. (2)

The second term is also equal to Yx Zx as Zx is consistent with X = x, so it is the value Y listens to in Mx. Meanwhile, the ﬁrst one is related to P(Yx Zx), the probability of a nested counterfactual.

2.2 Tools for Counterfactual Reasoning

Before characterizing the identiﬁcation of counterfactuals from observational and experimental data, we develop from ﬁrst principles a canonical representation of any such query. First, we extend the notion of ancestors for counterfactual variables, which subsumes the usual one described before. Deﬁnition 3 (Ancestors, of a counterfactual). Let Yx be such that Y V, X V. Then, the set of (counterfactual) ancestors of Yx, denoted An(Yx), consist of each Wz, such that W An(Y )GX (which includes Y itself), and z = x An(W)GX.

For a set of variables W , we deﬁne An(W ) as the union of the ancestors of each variable in the set. That is, An(W ) = S

Wt W An(Wt). For instance, in Fig. 3(a), An(Yx) = {Yx, Z}, An(Xyz) =

(a) Backdoor graph.

(b) Graphical representation of the ancestors of Yz.

(c) Napkin graph.

(d) Graphical representation of the ancestors of Yx.

Figure 3: Two causal diagrams and the subgraphs considered when ﬁnding sets of ancestors for a counterfactual variable.

{Xz} and An(Yz) = {Yz, Xz} (depicted in Fig. 3(b)). In Fig. 3(c) An(Z, Yz) = {Yz, Xz, Z, W} and An(Yx) = {Yx} (represented in Fig. 3(d)).

Probabilistic and causal inference with graphical models exploits the local structure among variables, speciﬁcally parent-child relationships, to infer and even estimate probabilities. In particular, Tian [27] introduced c-factors which have proven instrumental in solving many problems in causal inference. We naturally generalize this notion to the counterfactual setting with the following deﬁnition. Deﬁnition 4 (Counterfactual Factor (ctf-factor)). A counterfactual factor is a distribution of the form

P(W1[pa1] = w1, W2[pa2] = w2, . . . , Wl[pal] = wl), (3)

where each Wi V and there could be Wi = Wj for some i, j {1, . . . , l}.

For example, for Fig. 3(c) P(Yx = y, Yx = y ), P(Yx = y, Xz = x) are ctf-factors but P(Yz = y, Zw = z) is not. Using the notion of ancestrality introduced in Deﬁnition 3, we can factorize counterfactual probabilities as ctf-factors. Theorem 1 (Ancestral set factorization). Let W be an ancestral set, that is, An(W ) = W , and let w be a vector with a value for each variable in W . Then,

P(W = w ) = P

Wt W Wpaw = w , (4)

where each w is wt and paw is determined for each Wt W as follows:

(i) the values for variables in Paw T are the same as in t, and (ii) the values for variables in Paw \ T are taken from w corresponding to the parents of W.

Proof outline. Following a reverse topological order in G, look at each Witi W . Since any parent of Wi not in Ti must appear in W , the composition axiom [17, 7.3.1] licenses adding them to the subscript. Then, by exclusion restrictions [16], any intervention not involving Pa(Wi) can be removed to obtain the form in Eq. (4).

For example, consider the diagram in Fig. 3(c) and the counterfactual P(Yx = y | X = x ) known as the effect of the treatment on the treated (ETT) [9, 17]. First note that P(Yx = y | X = x ) = P(Yx = y, X = x )/P(X = x ) and that An(Yx, X) = {Yx, X, Z, W}, then

P(Yx = y, X = x ) = X

z,w P(Yx = y, X = x , Z = z, W = w). (5)

Then, by Theorem 1 we can write

P(Yx = y, X = x ) = X

z,w P(Yx = y, Xz = x , Zw = z, W = w). (6)

Moreover, the following result describes a factorization of ctf-factors based on the c-component structure of the graph, which will prove instrumental in the next section. Theorem 2 (Counterfactual factorization). Let P(W = w ) be a ctf-factor, let W1 < W2 < be a topological order over the variables in G[V(W )], and let C1, . . . , Ck be the c-components of the same graph. Deﬁne Cj = {Wpaw W | W Cj} and cj as the values in w corresponding to Cj , then P(W = w ) decomposes as

P(W = w ) = Y

j P(Cj = cj ). (7)

Figure 4: Three causal diagrams representing plausible structures in mediation analysis.

Furthermore, each factor can be computed from P(W = w) as

P(Cj = cj ) = Y

{w|Wpaw W ,Wi<W } P(W = w ) P

{w|Wpaw W ,Wi 1<W } P(W = w ). (8)

Armed with these results, we consider the identiﬁcation problem in the next section.

3 Counterfactual Identiﬁcation from Observations and Experiments

In this section, we consider the identiﬁcation of a counterfactual probability from a collection of observational and experimental distributions. This task can be seen as a generalization of that in [13] where the available data is the same, but the query is a causal effect Px(Y). Let Z = {Z1, Z2, . . .}, Zj V, and assume that all of {Pzj(V)}zj XZj ,Zj Z are available. Notice that Zj = is a valid choice corresponding to P(V) the observational (non-interventional) distribution. Deﬁnition 5 (Counterfactual Identiﬁcation). A query P(Y = y ) is said to be identiﬁable from Z in G, if P(Y = y ) is uniquely computable from the distributions {Pzj(V)}zj XZj ,Zj Z in any causal model which induces G.

Given an arbitrary query P(Y = y ), we could express it in terms of ctf-factors by writing P(Y = y ) = P

d \y P(D = d ) where D = An(Y ) and then using Theorem 1 to write P(D = d ) as a ctf-factor. For instance, the ancestral set W = {Yx, X, Z, W} with w = {y, x , z, w} in Eq. (6) can be written in terms of ctf-factors as

P(Yx = y, Xz = x , Zw = z, W = w) = P(Yx = y, Xz = x , W = w)P(Zw = z). (9)

The following lemma characterizes the relationship between the identiﬁability of P(Y = y ) and P(D = d ). Lemma 1. Let P(W = w ) be a ctf-factor and let Y W be such that W = An(Y ). Then, P

w \y P(W = w ) is identiﬁable from Z if and only if P(W = w ) is identiﬁable from Z.

Once the query of interest is in ctf-factor-form, the identiﬁcation task reduces to identifying smaller ctf-factors according to the c-components of G. In this respect, Theorem 2 implies the following Corollary 1. Let P(W = w ) be a ctf-factor and Cj be a c-component of G[V(W )]. Then, if P(Cj = cj ) is not identiﬁable, P(W = w ) is also not identiﬁable.

Proof. Assume for the sake of contradiction that P(Cj = cj ) is not identiﬁable but P(W = w ) is. Then, by Theorem 2, the former is identiﬁable from the latter, a contradiction.

Let us consider the causal diagrams in Fig. 4 and the counterfactual Yx1,Wx0 = y, X = x, with x0, x1, x XX, used to deﬁne quantities for fairness analysis in [28] (e.g., Yx1,Wx0 = y|X = x):

P(Yx1,Wx0 = y, X = x)

w P(Yx1,w = y, Wx0 = w, X = x) Unnesting (10)

w,z P(Yx1,w = y, Wx0 = w, X = x, Z = z) Complete ancestral set (11)

w,z P(Yx1,w,z = y, Wx0 = w, Xz = x, Z = z) Write in ctf-factor-form (12)

(a) The factor P(Zx = z) is identiﬁable only if {X} Z.

(b) The factor P(Yx = y, X = x ) is inconsistent.

(c) The factor P(Wx = w, Wx = w ) is inconsistent.

Figure 5: Examples of causal diagrams and inconsistent ctf-factors derived from them.

Due to the particular c-component structure of each model, we can factorize P(Yx1,w,z = y, Wx0 = w, Xz = x, Z = z) according to each model as:

P(Yx1,w,z = y)P(Wx0 = w)P(Xz = x)P(Z = z), (13) P(Yx1,w,z = y, Z = z)P(Wx0 = w)P(Xz = x), and (14) P(Yx1,w,z = y)P(Wx0 = w, Xz = x)P(Z = z). (15)

The question then becomes, whether ctf-factors corresponding to individual c-components can be identiﬁed from the available input. In this example, all factors in Eq. (13) and Eq. (14) are identiﬁable from P(V). For Eq. (14) in particular, they are given by

P(Y = y, Z = z | W = w, X = x1)P(W = w | X = x0)P(X = x | Z = z). (16)

In contrast, the factor P(Wx0=w, Xz=x) in Eq. (15) (model Fig. 4(c)) is only identiﬁable if x=x0. The following deﬁnition and theorem characterize the factors that can be identiﬁed from Z and G. Deﬁnition 6 (Inconsistent ctf-factor). P(W = w ) is an inconsistent ctf-factor if it is a ctf-factor, G[V(W )] has a single c-component, and one of the following situations hold:

(i) there exist Wt W , Z T V(W ) such that z t, z w and z = z , or (ii) there exists Wi[ti], Wj[tj] W and T Ti Tj such that t t1, t t2 and t = t .

Theorem 3 (Ctf-factor identiﬁability). A ctf-factor P(W = w) is identiﬁable from Z if and only if it is consistent. If consistent, let W = V(W ) and W = V \ W; then P(W = w ) is equal to Pw (w) where w and w are consistent with w S

{Wpaw W } paw.

Consider the NDEx x ,Z(Y ) in Fig. 5(a), we can write

P(Yx Zx = y) = X

z P(Yx z = y, Zx = z) = X

z P(Yx z = y)P(Zx = z). (17)

While the factor P(Yx z = y) is identiﬁable from P(V) as P(Y = y | X = x , Z = z), the second factor is identiﬁable only if experimental data on X is available, as Px(z).

We can also verify that the factor P(Yx = y, X = x ) in Fig. 5(b) is inconsistent. For another example consider the ETT-like expression P(Yx,z = y, X = x , Z = z ) in Fig. 5(c), we have

P(Yxz = y, X = x , Z = z )

w,w P(Yxz = y, X = x , Z = z , Wx = w, W = w ) (18)

w,w P(Yxz = y, X = x , Zw = z , Wx = w, Wx = w ) (19)

w,w P(Yxz = y)P(X = x )P(Zw = z )P(Wx = w, Wx = w ), (20)

where the factor P(Wx = w, Wx = w ) is inconsistent.

Given a counterfactual variable Yx, it could be the case that some values in x become causally irrelevant to Y after the rest of x has been ﬁxed. Formally, Lemma 2 (Interventional Minimization). Let Y := S Y b X Y Y b X such that Y b X := Yb Z where

Z = X An(Y )GX and b Z is consistent with b X , x = x and = . Then, Y = Y .

Moreover, such simpliﬁcation may reveal counterfactual expressions with equivalent or contradicting events. In Fig. 3(c), (Yxz = y, Yxz = y ) = (Yx = y, Yx = y ) which has probability 0 if y = y , or (Yxz = y, Yxz = y) that is simply (Yx = y). Similarly, the probabilities of counterfactual events of the form P(Xx = x ), x = x , and P(Xx = x) are trivially 0 and 1 respectively.

The following result shows how nested counterfactuals can be written in terms of non-nested ones.

Algorithm 1 CTFIDU(Y , y , Z, G) Input: G causal diagram over variables V; Y a set of counterfactual variables in V; y a set of values for Y ; and available distribution speciﬁcation Z. Output: P(Y = y ) in terms of available distributions or FAIL if not identiﬁable from G, Z .

1: let Y Y . 2: if there exists Yx Y with two or more different values in y (Yx) or Yy Y with y (Yy) = y then return 0. 3: if there exists Yx Y with two consistent values in y (Yx) or Yy Y with y (Yy) = y then remove repeated variables from Y and values y . 4: let W An(Y ), and let C1 , . . . , Ck be corresponding ctf-factors in G[V(W )]. 5: for each Ci s.t. (Ci = ci ) is not inconsistent, Z Z s.t. Ci Z = do 6: let Bi be the c-component of GZ such that Ci Bi, compute PV\Bi(Bi) from PZ(V). 7: if IDENTIFY(Ci, Bi, PV\Bi(Bi), G) does not FAIL then 8: let PV\Ci(Ci) IDENTIFY(Ci, Bi, PV\Bi(Bi), G). 9: let P(Ci = ci ) PV\Ci(Ci) evaluated with values (ci S

Ct Ci pac). 10: move to the next Ci. 11: end if 12: end for 13: if any P(Ci = ci ) is inconsistent or was not identiﬁed from Z then return FAIL. 14: return P(Y = y ) P

i P(Ci = ci ).

Theorem 4 (Counterfactual Unnesting Theorem (CUT)). Let b X, b Z be any natural interventions on disjoint sets X, Z V. Then, for Y V disjoint from X and Z such that X An(Y)GZ, P(Yb Z, b X = y) is identiﬁable iff P(Yb Z,x = y, b X = x) is identiﬁable for every x, and given by

P(Yb Z, b X = y) = X

x XX P(Yb Z,x = y, b X = x). (21)

For instance, for the model in Fig. 2 we can write

P(Yx Zx = y) = X

z P(Yx z = y, Zx = z). (22)

As Theorem 4 allows us to rewrite any nested counterfactual in terms of a non-nested one, we focus on the latter and assume that any given counterfactual is already unnested.

Using the results in this section, we propose the algorithm CTFIDU (Algorithm 1) which given a set of counterfactual variables Y , values y , a collection of observational and experimental distributions Z, and a causal diagram G; outputs an expression for P(Y = y ) in terms of the speciﬁed distributions or FAIL if the query is not identiﬁable from such input in G. Line 1 removes irrelevant subscripts from the query by virtue of Lemma 2. Then, lines 2 and 3 look for inconsistent events and redundant events, respectively. Line 4 ﬁnds the relevant ctf-factors consisting of a single c-component, as licensed by Theorem 1 and Theorem 2. As long as the factors are consistent, and allowed by Theorem 3, lines 6-11 carry out identiﬁcation of the causal effect PV\Ci(Ci) from the available distributions employing the algorithm IDENTIFY [26] as a subroutine (see an example in the next section). The procedure fails if any of the factors P(Ci = ci ) is inconsistent or not identiﬁable from Z. Otherwise, it returns the corresponding expression. Theorem 5 (CTFIDU completeness). A counterfactual probability P(Y = y ) is identiﬁable from Z and G if and only if CTFIDU returns an expression for it.

This result ascertains that Algorithm 1 solves the identiﬁcation task for counterfactuals in the form P(Y = y ).3 Still, there exist other quantities that take into account evidence from events observed in the system, and cannot be written in this particular form. Since these queries represent important aspects of the underlying system, we will discuss them in the next section.

3One corollary of this result is that two identiﬁcation algorithms ([22, 13]) can be combined and be shown complete; for details, see Appendix F. Even though our results were developed from ﬁrst principles using an entirely new approach and machinery (which was the key to derive Algorithm 1 and prove its completeness), practitioners familiar with these algorithms may ﬁnd the completeness of this combined approach attractive.

(a) While P(Yx=y | Zx=z, X=x ) is identiﬁable from Z, P(Yx=y, Zx=z, X=x ) is not.

(b) P(Yx=y | Zx=z, X=x ) is not identiﬁable from Z because of the factor P(Yxz=y, X=x ).

Figure 6: Examples of conditional queries.

Algorithm 2 CTFID(Y , y , X , x , Z, G) Input: G causal diagram over variables V; Y , X a set of counterfactual variables in V; y , x a set of values for Y and X ; and available distribution speciﬁcation Z. Output: P(Y =y | X =x ) in terms of available distributions or FAIL if non-ID from G, Z .

1: Let A1 , A2 , . . . be the ancestral components of Y X given X . 2: Let D be the union of the ancestral components containing a variable in Y and d the corresponding set of values. 3: let Q CTFIDU(S

Dt D Dpad, d , Z, G). 4: return P

d \(y x ) Q/ P

4 Identiﬁcation of Conditional Counterfactuals

In this section, we consider counterfactual quantities of the form P(Y = y | X = x ). It is immediate to write such a query as P(W = w )/ P

y P(W = w ) with W = Y X , and try to identify it using CTFIDU. Nevertheless, depending on the graphical structure, the original query may be identiﬁable even if the latter is not. To witness, consider the causal diagram in Fig. 6(a) and the counterfactual P(Yx = y | Zx = z, X = x ), which can be written as P(Yx = y, Zx = z, X = x )/ P

y P(Yx = y, Zx = z, X = x ). Following the strategy explained so far, the numerator is equal to P(Yz = y)P(Zx = z, X = x ), where the second ctf-factor is inconsistent, and therefore not identiﬁable from Z. Nevertheless, the conditional query is identiﬁable as

P(Yxz = y)P(Zx = z, X = x ) P(Zx = z, X = x ) P

y P(Yxz = y) = P(Yxz = y) = P(Y = y | Z = z, X = x). (23)

To characterize such simpliﬁcations of the query, we look at the causal diagram paying special attention to variables after the conditioning bar that are also ancestors of those before. Let X (Wt) = V( X An(Wt)), that is, the primitive variables in X that are also ancestors of Wt.

Deﬁnition 7 (Ancestral components). Let W be a set of counterfactual variables, X W , and G be a causal diagram. Then the ancestral components induced by W , given X , are sets A1 , A2 , . . . that form a partition over An(W ), made of unions of ancestral the sets An(Wt)GX (Wt), Wt W . Sets An W1[t1] GX (W1[t1]) and An W2[t2] GX (W2[t2]) are put together if they are not disjoint or

there exists a bidirected arrow in G connecting variables in those sets.

Lemma 3 (Conditional Query Reduction). Let Y , X be two sets of counterfactual variables and let D be the set of variables in the same ancestral component, given X , as any variable in Y , then

P(Y = y | X = x ) =

d \(y x ) P(V

Dt D Dpad = d) P

Dt D Dpad = d) , (24)

where pad is consistent with t and d , for each Dt D . Moreover, P(Y = y | X = x ) is identiﬁable from Z if and only if P(V Dt D Dpad = d) is identiﬁable from Z.

Using the notion of ancestral components and Lemma 3, we propose a conditional version of CTFIDU (Algorithm 2). Due to Lemma 3, it is easy to see that CTFID is complete. In case that W = Y X contains nested counterfactuals, Theorem 4 can be used ﬁrst and the new variables in the expression are added to W , then the new sum indexes added to d in Eq. (24).

Theorem 6 (CTFID completeness). A counterfactual probability P(Y = y | X = x ) is identiﬁable from Z and G if and only if CTFID returns an expression for it.

To illustrate the mechanics of our algorithm, consider the causal diagram G in Fig. 7(a) and the identiﬁcation of the query P(Yx1Zx0 = y1 | Xw0 = x1, Tr1 = t1, R = r0) from Z = {{}, {W, T}}. First, we can use Theorem 4 to write the query as: X

z P(Yx1z=y1, Zx0=z | Xw0=x1, Tr1=t1, R=r0). (25)

Figure 7: Graphs used for the running example of the algorithms.

The query within the sum can be processed using CTFID(Y = {Yx1z, Zx0}, y = {y1, z}, X = {Xw0, Tr1, R}, x = {x1, t1, r0}, Z, G).

At line 1, the algorithm looks for the ancestral components of Y X given X . This results in three ancestral components, which can be labeled as A1 = {Yz, Xw0, T, R}, A2 = {Zx0} and A3 = {Tr1}. Then, we let D be the union of the ancestral sets that contain variables in Y , that is D = {Yz, Xw0, T, R, Zx0}. We also gather the values d = {y1, x1, t, r0, z}, where t is a new value introduced for T that will appear as the index in a sum on line 3.

At this point, CTFIDU is invoked with arguments (Y = {Yz, Xw0t, Tr0, R, Zx0r0}, y = d , Z, G). In this case lines 1-3 will not make any change. Note that An Dpad = {Dpad}, hence their union is also an ancestral set, and W will be equal to the Y given to CTFIDU. The c-components in the graph G[V(W )] are {X, R, Y }, {T} and {Z}, then we need to consider the ctf-factors P(Yz = y1, Xw0t = x1, R = r0), P(Tr0 = t) and P(Zx0r0 = z). In this case all of the ctf-factors are consistent, so we can try to identify each of them from Z. Speciﬁcally, we will obtain4

P(Yz = y1, Xw0t = x1, R = r0) = Pw0,t(r0)Pw0,t(x1 | r0)Pw0,t(y1 | r0, x1, z), (26) P(Tr0 = t) = P(t | r0), and (27) P(Zx0r0 = z) = Pw,t (z | r0, x0); (28)

where w and t could be any values in XW and XT , respectively. We could just take w0 and t for simplicity. Then, considering the sum over z introduced for unnesting, the ﬁnal result is:

P(Yx1Zx0 = y1 | Xw0 = x1, Tr1 = t1) = (29) X

z,t Pw0,t(r0)Pw0,t(x1 | r0)Pw0,t(y1 | r0, x1, z)P(t | r0)Pw0,t(z | r0, x0). (30)

5 Conclusions

We investigated in this paper the problem of nested and non-nested counterfactual identiﬁcation from an arbitrary combination of observational and experimental distributions. First, we introduced fundamental building blocks for counterfactual reasoning, which allowed us to prove several properties of nested counterfactual distributions, including the counterfactual unnesting theorem (Theorem 4), and the ancestral and counterfactual factorization theorems (Theorem 1, Theorem 2). Moreover, we introduced a graphical condition (Deﬁnition 6, Theorem 3) and developed an efﬁcient algorithm (Algorithm 1) for identifying marginal counterfactuals. We then proved their sufﬁciency and necessity for the task of nested counterfactual identiﬁcation (Theorem 5). Lastly, we reduced the identiﬁcation of conditional counterfactuals to that of marginal ones (Lemma 3) and provided a corresponding complete algorithm (Algorithm 2) for this task (Theorem 6). These results advance the state of the art regarding the distributions the inference engine expects as input, and the query it can generate as the output. In terms of the input, it accepts any combination of observational and experimental distributions, and in terms of the output, it considers an arbitrary nested counterfactual distribution. This work closes the long-standing inferential gap within the layers of the Pearl s Causal Hierarchy, and now one can move within all layers of the hierarchy in a very ﬂexible and general way.

4For details on this example see Appendix E.

Acknowledgments and Disclosure of Funding

Elias Bareinboim and Juan D. Correa were supported in part by funding from the NSF (IIS-1704352, IIS-1750807), Amazon, JP Morgan, and The Alfred P. Sloan Foundation. Sanghack Lee was supported by the New Faculty Startup Fund from Seoul National University.

[1] Avin, C., Shpitser, I., and Pearl, J. (2005). Identiﬁability of Path-Speciﬁc Effects. In Proceedings of the Nineteenth International Joint Conference on Artiﬁcial Intelligence {IJCAI-05}, pages 357 363, Edinburgh, UK. Morgan-Kaufmann Publishers. [2] Bareinboim, E., Correa, J. D., Ibeling, D., and Icard, T. (2020). On Pearl s Hierarchy and the Foundations of Causal Inference. Technical Report R-60, Causal Artiﬁcial Intelligence Lab, Columbia University. In: Probabilistic and Causal Inference: The Works of Judea Pearl , ACM Books, in press. [3] Bareinboim, E. and Pearl, J. (2012). Causal Inference by Surrogate Experiments: z-Identiﬁability. In Murphy, N. d. F. and Kevin, editors, Proceedings of the Twenty-Eighth Conference on Uncertainty in Artiﬁcial Intelligence, pages 113 120. AUAI Press. [4] Bareinboim, E. and Pearl, J. (2016). Causal inference and the data-fusion problem. In Shiffrin, R. M., editor, Proceedings of the National Academy of Sciences, volume 113, pages 7345 7352. National Academy of Sciences. [5] Correa, J. D., Lee, S., and Bareinboim, E. (2021). Nested Counterfactual Identiﬁcation from Arbitrary Surrogate Experiments. Technical Report R-79, Causal AI Lab, Columbia University. [6] Dawid, A. P. (2002). Inﬂuence diagrams for causal modelling and inference. International Statistical Review, 70:161 189. [7] Dawid, A. P. (2015). Statistical Causality from a Decision-Theoretic Perspective. Annual Review of Statistics and Its Application, 2(1):273 303. [8] Galles, D. and Pearl, J. (1995). Testing identiﬁability of causal effects. In Besnard, P. and Hanks, S., editors, Uncertainty in Artiﬁcial Intelligence 11, pages 185 195. Morgan Kaufmann, San Francisco. [9] Heckman, J. J. (1992). Randomization and Social Policy Evaluation. In Manski, C. and Garﬁnkle, I., editors, Evaluations: Welfare and Training Programs, pages 201 230. Harvard University Press, Cambridge, MA. [10] Huang, Y. and Valtorta, M. (2006). Identiﬁability in Causal Bayesian Networks: A Sound and Complete Algorithm. In Proceedings of the Twenty-First National Conference on Artiﬁcial Intelligence (AAAI 2006), pages 1149 1156. AAAI Press, Menlo Park, CA. [11] Huang, Y. and Valtorta, M. (2008). On the completeness of an identiﬁability algorithm for semi-Markovian models. Annals of Mathematics and Artiﬁcial Intelligence, 54(4):363 408. [12] Lee, S. and Bareinboim, E. (2020). Causal effect identiﬁability under partial-observability. In Proceedings of the 37th International Conference on Machine Learning, volume 119. PMLR. [13] Lee, S., Correa, J. D., and Bareinboim, E. (2019). General Identiﬁability with Arbitrary Surrogate Experiments. In Proceedings of the Thirty-Fifth Conference Annual Conference on Uncertainty in Artiﬁcial Intelligence, Corvallis, OR. AUAI Press. [14] Lee, S., Correa, J. D., and Bareinboim, E. (2020). Generalized Transportability: Synthesis of Experiments from Heterogeneous Domains. In Proceedings of the 34th AAAI Conference on Artiﬁcial Intelligence, Menlo Park, CA. AAAI Press. [15] Pearl, J. (1994). A probabilistic calculus of actions. In de Mantaras, R. L. and D. Poole, editors, Uncertainty in Artiﬁcial Intelligence 10, pages 454 462. Morgan Kaufmann, San Mateo, CA. [16] Pearl, J. (1995). Causal diagrams for empirical research. Biometrika, 82(4):669 688. [17] Pearl, J. (2000). Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, NY, USA, 2nd edition. [18] Pearl, J. (2001). Direct and indirect effects. In Proceedings of the Seventeenth Conference on Uncertainty in Artiﬁcial Intelligence, pages 411 420. Morgan Kaufmann, San Francisco, CA.

[19] Pearl, J. and Mackenzie, D. (2018). The Book of Why. Basic Books, New York. [20] Pearl, J. and Robins, J. M. (1995). Probabilistic evaluation of sequential plans from causal models with hidden variables. In Besnard, P. and Hanks, S., editors, Uncertainty in Artiﬁcial Intelligence 11, pages 444 453. Morgan Kaufmann, San Francisco. [21] Shpitser, I. and Pearl, J. (2006). Identiﬁcation of Joint Interventional Distributions in Recursive semi-Markovian Causal Models. In Proceedings of the Twenty-First AAAI Conference on Artiﬁcial Intelligence, volume 2, pages 1219 1226. [22] Shpitser, I. and Pearl, J. (2007). What Counterfactuals Can Be Tested. In Proceedings of the Twenty-Third Conference on Uncertainty in Artiﬁcial Intelligence, pages 352 359. AUAI Press, Vancouver, BC, Canada. [23] Shpitser, I. and Pearl, J. (2009). Effects of Treatment on the Treated: Identiﬁcation and Generalization. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence, Montreal, Quebec. AUAI Press. [24] Shpitser, I. and Sherman, E. (2018). Identiﬁcation of Personalized Effects Associated With Causal Pathways. In Proceedings of the 34th Conference on Uncertainty in Artiﬁcial Intelligence, pages 530 539. [25] Spirtes, P., Glymour, C. N., and Scheines, R. (2001). Causation, Prediction, and Search. MIT Press, 2nd edition. [26] Tian, J. and Pearl, J. (2002a). A General Identiﬁcation Condition for Causal Effects. In Proceedings of the Eighteenth National Conference on Artiﬁcial Intelligence (AAAI 2002), pages 567 573, Menlo Park, CA. AAAI Press/The MIT Press. [27] Tian, J. and Pearl, J. (2002b). On the Testable Implications of Causal Models with Hidden Variables. Proceedings of the Eighteenth Conference on Uncertainty in Artiﬁcial Intelligence (UAI-02), pages 519 527. [28] Zhang, J. and Bareinboim, E. (2018). Fairness in Decision-Making-The Causal Explanation Formula. In AAAI Conference on Artiﬁcial Intelligence.