# minimum_intervention_cover_of_a_causal_graph__ac3504dc.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Minimum Intervention Cover of a Causal Graph

Saravanan Kandasamy Tata Institute of Fundamental Research Mumbai, India

Arnab Bhattacharyya Dept. of Computer Science National Univ. of Singapore Singapore & Dept. of Computer Science & Automation Indian Institute of Science Bangalore, India

Vasant G Honavar Artiﬁcial Intelligence Research Laboratory College of Information Sciences & Technology, Pennsylvania State Univ. University Park, PA, USA

Eliciting causal effects from interventions and observations is one of the central concerns of science, and increasingly, artiﬁcial intelligence. We provide an algorithm that, given a causal graph G, determines MIC(G), a minimum intervention cover of G, i.e., a minimum set of interventions that sufﬁces for identifying every causal effect that is identiﬁable in a causal model characterized by G. We establish the completeness of do-calculus for computing MIC(G). MIC(G) effectively offers an efﬁcient compilation of all of the information obtainable from all possible interventions in a causal model characterized by G. Minimum intervention cover ﬁnds applications in a variety of contexts including counterfactual inference, and generalizing causal effects across experimental settings. We analyze the computational complexity of minimum intervention cover and identify some special cases of practical interest in which MIC(G) can be computed in time that is polynomial in the size of G.

Introduction Determining causal effects from interventions and observations is one of the central concerns of science, and increasingly, of artiﬁcial intelligence (Pearl 2009; Spirtes, Glymour, and Scheines 2000; Imbens and Rubin 2015; Morgan and Winship 2014; Hernan and Robins 2010; Berzuini, Dawid, and Bernardinell 2012; Peters, Janzing, and Sch olkopf 2017). In the framework pioneered by Pearl (Pearl 2009), the structure of a causal model is encoded using a causal graph G, deﬁned over a set of observable variables (vertices) V. In the resulting causal model, for any two variables X and Y , a directed edge X Y denotes that X is a direct cause of Y ; and a bi-directed edge X Y indicates that X and Y are confounded by an unobservable variable (which is a common parent). The parameters of the causal model correspond to the probability distributions of the variables conditioned on their parents in G. In a causal model encoded using a graph G, for a given subset of observable variables X, and an assignment x of X, the intervention do(x) refers to the action of ﬁxing X to x, irrespective of the values of the parents of X. For any Y V, the causal effect of do(x) on Y will be the interventional distribution obtained on Y by the intervention do(x) (Pearl

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

2009). Given a causal model, Do-calculus (Pearl 1995; 2009) offers a general machinery that can be used to identify causal effects from observations and interventions, answer counterfactual queries, etc., given a causal graph.

Related Work

The problem of identifying causal effects from data has been extensively studied in the literature. In the absence of unobservable variables, all causal effects are identiﬁable from the observational distribution (Robins 1986; Spirtes, Glymour, and Scheines 2000; Pearl 1995). When some of the variables are unobservable, it is not always possible to identify causal effects from the observational distribution alone. A series of papers established sufﬁcient graphical conditions for solving this problem (Spirtes, Glymour, and Scheines 2000; Pearl 1995; Galles and Pearl 1995; Pearl and Robins 1995; Halpern 1998; Kuroki and Miyakawa 1999; Tian and Pearl 2002a), eventually leading to a sound and complete algorithm (Shpitser and Pearl 2006; Huang and Valtorta 2006). The resulting methods have been generalized to work in settings where the underlying causal graph is unknown (Hyttinen, Eberhardt, and J arvisalo 2015). Recent work has considered the problem of generalizing a causal effect from one or more source domains (where observational and all interventional distributions are available) to a target domain (where only the observational distribution is available), provided some invariances in causal mechanisms hold across the source and target domains (Pearl and Bareinboim 2011; Lee and Honavar 2013b; Bareinboim and Pearl 2013b). Extensions of this problem consider the identiﬁcation of causal effects in the target domain, but from the observational and interventional distributions on subsets of observable variables (that are amenable to intervention) of the source domains (Bareinboim and Pearl 2013a; Lee and Honavar 2013a; Bareinboim et al. 2013; Bareinboim and Pearl 2014). These results provide a sound theoretical foundation for integrative analyses of observational and experimental data (Tsamardinos, Triantaﬁllou, and Lagani 2012; Bareinboim and Pearl 2016). A related line of work (Shpitser and Pearl 2008) provides a sound and complete graphical characterization for the problem of answering counterfactual queries in the setting where the observational distribution and all the interventional distributions are available.

Motivation The focus of the entire body of existing work (summarized above) on generalizing interventional data across multiple domains and on identifying counterfactual queries from interventional data is on whether the required quantity of interest can be determined when all of the interventional distributions on the observable variables (or a subset of observables that are amenable to experimental manipulation) are obtainable. However, obtaining an interventional distribution requires performing the corresponding intervention. Because interventions can incur signiﬁcant cost and effort, it is important to minimize the number of interventions that need to be performed. Minimizing the number of interventions is especially useful in cases where a signiﬁcant amount of interventional distributions are actually required by the existing algorithms for the identiﬁcation of the quantities of interest.

Contributions We address the following question: Given a causal graph G, ﬁnd a smallest set of interventions that sufﬁces to determine all interventional distributions. We call such a minimum set of interventions of G a minimum intervention cover of G (MIC(G)). We treat the case where any observable variable may be manipulated by interventions. A similar analysis when the set of manipulable variables is restricted remains a challenging open problem. The main contributions of this paper include: 1. A necessary and sufﬁcient condition for a set of interventions (or, equivalently, interventional distributions) that form MIC(G), a minimum intervention cover of a causal graph G. 2. A sound and complete algorithm for ﬁnding the minimum intervention cover of a causal graph G. 3. Proof of completeness of do-calculus for the minimum intervention cover problem. 4. An analysis of the computational complexity of MIC(G), including a characterization of special cases of practical interest in which MIC(G) can be determined in time that is polynomial in the size of G. In particular, we provide an efﬁcient algorithm to determine MIC(G) when G has bounded in-degree, out-degree, and C-components of bounded size1.

Preliminaries Probabilistic Causal Models and Causal Graphs We follow the notational conventions of probabilistic causal models (PCM) (Pearl 2009), also known as structural causal models or data-generating models. As is customary, without loss of generality (see (Verma and Pearl 1990; Tian and Pearl 2002b)), we limit our attention to causal graphs with the unobservable variables as root nodes, each with exactly two observable children. The resulting causal graph is directed acyclic with respect to the observable variables, but contains bi-directed edges between observable variables to represent the common unobservable parent between those observable

1See Deﬁnition 6.

variables. A probabilistic causal model consists of a causal graph G, and a four-tuple V, U, F, Pr(U) , where V is the set of all observable variables, U is the set of all unobservable variables (connected by the bi-directed edges of G) distributed according to Pr[U], and F = {f1, . . . , f|V|} is a set of functions. The value of each variable V V is determined by the function f V based on the values of the parents (both observable and unobservable) of V .

Notation We use uppercase letters to denote variables; lowercase to denote value assignments of variables; bold uppercase and bold lowercase letters to denote sets of variables and their value assignments respectively2; GX and GX to denote the graph obtained from G by removing the incoming edges (including bi-directed edges) to X and outgoing edges from X respectively; V = {V1, V2, . . . , Vn} to denote the observable vertices of graph G, with the vertex indices being topologically ordered; Pa(X) and An(X) to denote the observable parents and observable ancestors of X (excluding X) in G respectively; pa(X) to denote an assignment to Pa(X); G[D] to denote the induced subgraph of G on D: whose vertex set is D and whose edge set contains all the edges (including bi-directed edges) of G that have both endpoints in D, for any D V; Pr M[ ] to denote Pr[ ] in model M. Two assignments x, y of X, Y are said to be consistent if they agree on all of the vertices in X Y. We use set operations to denote values of a set of variables. For example, a \ pa(A) is used to represent the values of A \ Pa(A). When the graph being referenced is not clear from context, we use Pa(X)G to denote the observable parents of X (excluding X) in graph G. For any non-negative integer k, we use [k] to denote the set {1, 2, . . . , k}; For a given a collection of binary values s, we use bp(s) to denote the bit parity of s and bp(s) to denote the complement of the bit parity of s; We also use 0 and 1 to represent a set of 0 and 1 values respectively.

Interventions and Identiﬁability We review some essential deﬁnitions:

Intervention. Given a causal graph G, a set of observable variables X V, and an assignment x of X, an intervention do(x) is the process of ﬁxing X to x irrespective of the values of parents of X (Pearl 2009), which induces a new graph GX obtained from G by removing all incoming edges of X.

Interventional distribution. Given two disjoint subsets X, Y of V, and an intervention do(x), the interventional distribution denoted by Pr[Y | do(x)], is the causal effect of the intervention do(x) on Y, that is the distribution over Y obtained over the intervention do(x).

Information set. An information set IS(G) denotes a set of interventional distributions over the causal graph G.

2Without loss of generality, the variables are assumed to take values from the domain Σ. The results presented in the paper generalize in a straightforward way to the setting where each variable X takes values from the corresponding domain ΣX

Joint interventional distribution For a given intervention do(x), the distribution observed over the rest of the variables Pr[V \ X | do(x)] is a joint interventional distribution. Joint information set. A joint information set is an information set that contains only joint interventional distributions. Note that interventions with different assignments represent distinct elements in the information set. For example, do(Smoking = 0) and do(Smoking = 1) are different. Deﬁnition 1 (Identiﬁability). For a given causal graph G, let X, Y be disjoint subsets of the observable vertices V and let IS(G) be a given information set for G. The causal effect of an action do(x) on Y, denoted by Pr[Y | do(x)], is identiﬁable from IS(G) in G if Pr[Y | do(x)] is uniquely determined by IS(G) in any causal model that is deﬁned on G. Similarly, an information set IS (G) is identiﬁable from IS(G) in G, if for each intervention I IS (G), I is identiﬁable from IS(G) in G. The following lemma directly follows from the deﬁnition of identiﬁability. Lemma 1. Given a causal graph G and an information set IS(G), the interventional distribution Pr[Y | do(x)] is not identiﬁable from IS(G), if there exist two models M1, M2 that is deﬁned on the same causal graph G such that (i) Pr M1[Y | do(x)] = Pr M2[Y | do(x)]; and (ii) Pr M1[T | do(s)] = Pr M2[T | do(s)] for each intervention Pr[T | do(s)] IS(G). Deﬁnition 2 (All possible interventional distributions (IS*(G))). For a given causal graph G, let

IS*(G) := [

s {Pr[V \ S | do(s)]}

represent the set of all possible interventional distributions. Deﬁnition 3 (Size of Information Set). For a given information set IS(G), the size of IS(G) is the cardinality of IS(G), i.e., the number of interventions in the set IS(G). Deﬁnition 4 (Intervention Cover). For a given causal graph G, an information set IS(G) is an intervention cover of G if IS*(G) is identiﬁable from IS(G) in G. Deﬁnition 5 (Minimum Intervention Cover (MIC(G))). For a given causal graph G, a minimum intervention cover of G (MIC(G)) is an information set ISmin(G) such that (i) ISmin(G) is an intervention cover of G; (ii) there exists no intervention cover of G of size smaller than ISmin(G). Remark 1. Note that because we are minimizing the number of interventions in the information set, not the number of variables involved in each intervention, we may take MIC(G) to be a joint information set without loss of generality. Remark 2. Determining causal graphs from a small number of interventions has been studied in the literature under the faithfulness assumption (using Conditional Independence (CI) tests) (Shanmugam et al. 2015; Kocaoglu, Shanmugam, and Bareinboim 2017). MIC(G) can be applied to the causal graphs constructed using the above algorithms.

We assume that the distributions are positive as in (Pearl 2009). We adopt the deﬁnitions and rules of do-calculus (Pearl 2009) (See Supplementary material). In what follows, we will use the graph H (shown in Figure 1a) to construct examples that illustrate the key deﬁnitions, arguments and results.

(a) Causal graph H

(b) Causal Graph H

(c) Causal Graph H

Figure 1: X1 Smoking, X2 Alcohol Consumption, Y1 Depression, Y2 Sleep Disorder

C-Component Factorization Deﬁnition 6 ((Tian and Pearl 2002a) C-component). Given a causal graph G, and a set of observable vertices S V, S is a C-component of G if in the induced subgraph G[S] there is a path between any two vertices of S that consists of only bi-directed edges.

For a given causal graph G, we use C(V) := {S1, S2, . . . , Sk 1, Sk} to represent the partition of V into the maximal C-components of G, i.e., each Si is a maximal Ccomponent of G. Similarly, for any subset W V, we use C(W) to denote the set of all maximal C-components of the subgraph G[W] induced on W.

Example 1. The C-components of the graph H are: {X1}, {X2}, {Y1}, {Y2}, {X1, Y1, Y2}, {X1, Y1}, {Y1, Y2} Hence, the maximal C-components of H are: C({X1, X2, Y1, Y2}) = {{X1, Y1, Y2}, {X2}}.

As we will see below, the properties of C-components play a crucial role in the identiﬁcation of causal effects. Let us ﬁrst recall a fact that directly follows from the deﬁnition of probabilistic causal models:

Lemma 2. Given a subset of observable variables S V, an assignment s of S, and an assignment pa(S) of the observable parents of S, the interventional probability Pr[s | do(pa(S))] can be expressed as

Pr[s | do(pa(S))] = Pr[s | do(pa(S), o)]

for any assignment o of any O V \ (Pa(S) S).

Proof. By the deﬁnition of probabilistic causal models, when all the observable parents of S are targeted by an intervention, the distribution on S remains unchanged regardless of whether the other vertices (i.e., O) are also targeted by the intervention.

Example 2. Consider the graph H. Let S = {Y1} and O = {X1, Y2}. Hence Pa(S) = {X2}. By the deﬁnition of probabilistic causal model, when X2 is set by an intervention (say to the value x2), the probability distribution of Y1 is unaffected by whether or not X1 or Y2 are are also set by intervention. Hence Pr[s | do(pa(S))] = Pr[s | do(pa(S), o)].

The following C-component factorization Lemma (Tian and Pearl 2002b) highlights the role played by Ccomponents in the identiﬁcation of causal effects:

Lemma 3. (Tian and Pearl 2002b) For a given subset X V, let C(V \ X) = {B1, . . . , Bk}. Then, for any assignment v of the observable vertices V, the interventional probability Pr[v \ x | do(x)] can be expressed as Pr[v \ x | do(x)] = Q

i Pr[bi | do(v \ bi)]. Hence:

Pr[v \ x | do(x)] = Y

i Pr[bi | do(v \ bi)]

i Pr[bi | do(pa(Bi))] (by Lemma 2)

where the assignment pa(Bi) is consistent with v \ bi.

Example 3. Let x1, x2, y1, y2 be an arbitrary assignment of the variables X1, X2, Y1, Y2 respectively , and do(x1) an intervention in a causal model characterized by the graph H. It is easy to see that {X2} and {Y1, Y2} are the the maximal C-components of the causal graph HX1, resulting from the intervention, do(x1). Lemma 3 says that the interventional probability Pr[x2, y1, y2 | do(x1)] can be expressed as the product of Pr[x2 | do(x1)] and Pr[y1, y2 | do(x1, x2)].

Minimum Intervention Cover

Overview: We begin by deﬁning an information set LD(G), a set of local distributions, such that IS*(G) (the set of all possible interventional distributions of a causal model characterized by a causal graph G) is identiﬁable from LD(G). We then deﬁne ILD(G), an informative subset of LD(G), and show that LD(G), and hence IS*(G), is identiﬁable from ILD(G). We proceed to introduce a sound and complete graphical criteria for identifying ILD(G) from a joint information set ISinp(G). We show that ISinp(G) is a minimum intervention cover of a causal model characterized by G if and only if ISinp(G) identiﬁes ILD(G) and no other information set of a smaller size does so. Deﬁnition 7. For a given causal graph G, we deﬁne

Bi:Bi is a C-component of G

pa(Bi) Pr[Bi|do(pa(Bi))].

Example 4. Consider the graph H shown in Figure 1a. For every C-component Bi of H (See Example 1), and for every assignment of values to their parents pa(Bi), it is easy to see that the corresponding interventional distribution Pr[Bj | do(pa(Bj))] is in LD(H).

The next claim, which directly follows from the Ccomponent factorization of Lemma 3, shows that every intervention of G is identiﬁable from LD(G).

Claim 1. IS*(G) is identiﬁable from LD(G).

Proof. Recall that each interventional distribution of IS*(G) can be factorized in terms of its corresponding C-factors Pr[bi | do(pa(Bi))] (Lemma 3). All such C-factors are available in LD(G).

Next we show that an informative subset of LD(G), which we call ILD(G), sufﬁces to identify each distribution in LD(G). ILD(G) is the set of all Pr[Bj | do(pa(Bj))] LD(G) such that Bj is a maximal C-component of the induced subgraph G[V \ Pa(Bj)].

Deﬁnition 8. For a given causal graph G, we deﬁne

ILD(G) := [

Bj:Bj C(V\Pa(Bj))

pa(Bj) Pr[Bj | do(pa(Bj))] .

Example 5. Consider the graph H shown in shown in Figure 1a. It is easy to verify that ILD(H) contains only the interventional distributions Pr[Bj | do(pa(Bj))] such that Bj {{X2}, {Y1, Y2}, {X1, Y1, Y2}}.

Claim 2. LD(G) is identiﬁable from ILD(G).

Proof. Let LD(G) Pr[Bi | do(pa(Bi))] / ILD(G). We prove the claim by demonstrating the existence of an informative local distribution that identiﬁes Pr[Bi | do(pa(Bi))]. Deﬁne Bj := Bi D

where D V \ (Bi Pa(Bi)) such that Bi D C(V \ Pa(Bi)) (See Figure 2). Let pa(Bj) be an assignment consistent with pa(Bi). By deﬁnition, Pr[Bj | do(pa(Bj))] ILD(G), because Bj C(V \ Pa(Bj)). Hence:

Pr[Bi|do(pa(Bi))] = Pr[Bi|do(pa(Bj))]

bj\bi Pr[Bi, (bj \ bi)|do(pa(Bj))].

The ﬁrst equality follows from Lemma 2, since (i) Pa(Bi) Pa(Bj); (ii) Pa(Bj) and Bi are disjoint; (iii) the assignments pa(Bi) and pa(Bj) are consistent. The second equality is obtained by marginalization. Hence the claim.

𝐁𝑖 Note: 𝐁𝑖is a C-component Note: 𝐁𝑖is non-empty

𝐕 (𝐁𝑖 𝐏𝐚(𝐁𝑖))

All other vertices

Here 𝐁𝑖 𝐃is the maximal C-component of G[𝐕 𝐏𝐚𝐁𝑖]

Note: 𝐁𝑖 𝐁𝑗 C( 𝐕 𝐏𝐚(𝐁𝑗)) Note: 𝐁𝑗is non-empty

All other vertices

𝐕 (𝐁𝑗 𝐏𝐚(𝐁𝑗))

Figure 2: Illustration of Claim 2

Claims 1 and 2 imply that ILD(G) sufﬁces to identify all interventional distributions IS*(G). We now we proceed to specify a graphical criterion for the identiﬁability of ILD(G)

Note: 𝐁is non-empty

𝐕 (𝐁 𝐏𝐚(𝐁))

All other vertices

Note: 𝐁is a maximal C-component of G[𝐕 𝐏𝐚(𝐁)]

Note: Each vertex 𝐴𝑖 𝐀has: at least one outgoing edge to 𝐁 at least one bi-directed edge with 𝐁

Note: Each vertex 𝑃𝑖 𝐏𝐚𝐁\𝑨has: at least one outgoing edge to 𝐁 no bi-directed edge with B

Figure 3: Bush A, B

from a given joint information set ISinp(G). Our analysis relies on the existence of a special structure, which we call a bush.

Deﬁnition 9 (Bush). For a given causal graph G, let A and B be disjoint subsets of the observable variables V. Then A, B form a bush in G, if (i) |B| = 0 (ii) B C(V \ Pa(B)) (iii) For each Ai A, (iii.a) Ai Pa(B) (and) (iii.b) C({Ai} B) = {{Ai} B} (iv) For each Pi Pa(B) \ A, C({Pi} B) = {{Pi}, B}.

In other words, if A, B form a bush, then (i) B is nonempty; (ii) B is a maximal C-component of G[V \ Pa(B)]; (iii) Each Ai A a) has at least one child in B (and) b) share a bi-directed edge with at least one vertex in B; (iv) no vertex in (Pa(B) \ A) share a bi-directed edge to a vertex in B (See Figure 3).

Example 6. There are three bushes in the graph H shown in Figure 1a: (i) A1 = {}, B1 = {X1, Y1, Y2}; (ii) A2 = {}, B2 = {X2}; and (iii) A3 = {X1}, B3 = {Y1, Y2}.

Claim 3. For a given causal graph G, there is a one-to-one correspondence between ILD(G), and the set of all bush and assignment pairs of G

pa(B) {((A, B), pa(B))}.

Proof. Every B V that respects B C(B \ Pa(B)) maps to a unique bush A, B, where A = {Ai Pa(B) : C({Ai} B) = {{Ai} B}}. Hence, every informative local distribution Pr[B | do(pa(B))] ILD(G) maps to a unique bush A, B and assignment pa(B) pair. Furthermore, every bush A, B and assignment pa(B) pair maps to a unique informative local distribution Pr[B | do(pa(B))] ILD(G).

For a given bush and assignment pair for a causal graph G, we deﬁne what we call an informative intervention set, which plays an important role in identifying the corresponding informative local distribution from a given joint information set.

Deﬁnition 10 (Informative Intervention Set). Given a causal graph G, and a bush and assignment pair ((A, B), pa(B)) of G, the informative intervention set IIS(a ; pa(B) \ a ; B) is a joint information set3 that contains the set of all interventions I such that 1. I intervenes on all the vertices of A with assignment a. 2. I does not intervene on any vertex in B. 3. I and pa(B) are consistent on Pa(B). Example 7. Consider the bush A3 = {X1}, B3 = {Y1, Y2} of H. For a given assignment x1, x2, IIS(x1, x2, {Y1, Y2}) is a joint information set that contains at least one of the following interventional distributions: a) Pr[Y1, Y2, X2 | do(x1)] ; b) Pr[Y1, Y2 | do(x1, x2)]. Theorem 4 shows that bushes and IISs can be used to characterize the identiﬁability of informative local distributions ILD(G) from ISinp(G), a given joint information set of a causal model characterized by G. Theorem 4. Given a causal graph G, a bush and assignment pair ((A, B), pa(B)) of G, and a joint information set ISinp(G), the distribution Pr[B | do(pa(B))] is identiﬁable from ISinp(G), if and only if, ISinp(G) IIS(a ; pa(B) \ a ; B) is non-empty. The proof of Theorem 4 is given in the Appendix. Theorem 4 asserts that an informative local distribution Pr[B | do(pa(B))] ILD(G) is uniquely determinable from ISinp(G), if and only if, ISinp(G) contains an intervention from the corresponding informative intervention set. We will now illustrate how the concept of bushes and Theorem 4 can be used to ﬁnd minimum intervention covers. Consider the causal graphs shown in Figure 1. For simplicity, we assume that all the variables are boolean. Example 8 (MIC(H)). Consider the graph H (shown in Figure 1a), where the only possible bushes are: (i) B1 : A1 = {}, B1 = {X1, Y1, Y2}; (ii) B2 : A2 = {}, B2 = {X2}; and (iii) B3 : A3 = {X1}, B3 = {Y1, Y2}. With respect to bush B1, for each assignment pa(B1) of Pa(B1), i.e., for each x2 {0, 1}, identifying the informative local distribution Pr[B1 | pa(B1)], i.e., Pr[X1, Y1, Y2 | do(x2)], requires that any minimum intervention cover include an intervention from IIS( ; x2 ; {X1, Y1, Y2}). Similarly, with respect to bush B2, for each assignment x1 {0, 1}, identifying Pr[X2 | do(x1)] requires that any minimum intervention cover include an intervention from IIS( ; x1 ; {X2}); and for bush B3, identifying Pr[Y1, Y2 | do(x1, x2)] for each assignment of (x1, x2) {0, 1}2 requires that MIC(H) include an intervention from IIS(x1 ; x2 ; {Y1, Y2}). Note that the observable distribution intersects the informative intervention sets corresponding to bushes B1 and B2. Similarly, for each x1 {0, 1}, do(x1) intersects the informative intervention set IIS(x1 ; x2 ; {Y1, Y2}) that correspond to bush B3 for both values of x2 {0, 1}. By claims 1 and 2, we know that the informative local distributions ILD(H) are sufﬁcient to identify the set of all interventional distributions IS*(H). Hence, it is easy to see

3Note that a is the assignment of A consistent with pa(B).

that the observable distribution together with do(X1 = 0) and do(X1 = 1) form a minimum intervention cover of H, because any other information set with fewer than 3 interventions cannot intersect all the required IISs (a fact that can be veriﬁed by brute-force enumeration).

Example 9 (MIC(H )). Now consider the graph H which contains an additional bi-directed edge X2 Y1. H contains four different bushes: (i) B1 : A1 = {}, B1 = {X1, X2, Y1, Y2}; (ii) B2 : A2 = {X1}, B2 = {X2, Y1, Y2}; (iii) B3 : A3 = {X2}, B3 = {X1, Y1, Y2}; and (iv) B4 : A4 = {X1, X2}, B4 = {Y1, Y2}. By Theorem 4, for bushes B1, B2, B3, B4, identifying the respective local distributions, i.e., Pr[Bi | do(pa(Bi))]s, requires (1) the observable distribution, (2) do(x1) for each x1 {0, 1}, (3) do(x2) for each x2 {0, 1}, and (4) do(x1, x2) for each (x1, x2) {0, 1}2 respectively. Also note that no two IISs have any intervention in common. Hence, the size of MIC(H ) is 9.

Based on the previous examples, one might be tempted to conclude that when the graph is a C-component, minimum intervention cover must include all the possible interventions (except the trivial interventions that target a leaf node). However, this is not true because structure of the C-component plays a crucial role in determining the minimum intervention cover.

Example 10 (MIC(H )). Consider the graph H shown in Figure 1c which consists of the bushes: (i) B1 : A1 = {}, B1 = {X1, X2, Y1, Y2}; (ii) B2 : A2 = {X1}, B2 = {X2}; (iii) B3 : A3 = {X1}, B3 = {Y1, Y2}; and (iv) B4 : A4 = {X2}, B4 = {X1, Y1, Y2}. By Theorem 4, we know that B1 requires the observable distribution, and B4 requires the two interventions do(X2 = 0) and do(X2 = 1) to be included in MIC(H ). Also, the two interventions do(X1 = 0) and do(X1 = 1) intersect the informative intervention sets that arise from bushes B2 and B3. It is easy to verify by brute force that no information set of size less than 5 can intersect all the required IISs. Hence the size of MIC(H ) is 5.

An Efﬁcient Algorithm for MIC(G) Based on the results presented in the previous section, it is possible to ﬁnd MIC(G) by exhaustively enumerating the subsets of IS*(G) in increasing order of size, and checking whether the subset being considered is an intervention cover of G. First, such an approach is impractical for causal graphs with more than a few variables. Second, it is unclear whether such a brute-force approach is optimal. Hence, we proceed to study the computational complexity of MIC(G). Speciﬁcally, we exploit the structural properties of bushes to reduce the problem of ﬁnding MIC(G) to that of ﬁnding a minimum vertex coloring of an undirected graph b G obtained from G. Minimum vertex coloring of a graph with vertices W and edges E is the well-known problem of minimizing the number of colors required to assign colors c Wi to vertices Wi W such that (Wi, Wj) E = c Wi = c Wj. If we denote the subset of vertices that are assigned the color ci by Wci, then a vertex coloring of a graph corresponds to

Edges to all other vertices

Bush 1: 𝐀1= , 𝐁1 = 𝑋1, 𝑋2, 𝑋3, 𝑋4

𝑊1 = 𝐀1 , 𝐁1 ,

Bush 2: 𝐀2 = 𝑋1 , 𝐁2 = 𝑋2

𝑊2 = ((𝐀2 , 𝐁2) , 𝑋1 = 0 )

𝑊3 = 𝐀3 , 𝐁3 , 𝑋1 = 1

Bush 3: 𝐀3= 𝑋1 , 𝐁3 = 𝑌1, 𝑌2

𝑊4 = ((𝐀3 , 𝐁3) , (𝑋1, 𝑋2) = (0,0))

𝑊5 = ((𝐀3 , 𝐁3) , (𝑋1, 𝑋2) = (0,1))

𝑊6 = ((𝐀3 , 𝐁3) , (𝑋1, 𝑋2) = (1,0))

𝑊7 = ((𝐀3 , 𝐁3) , (𝑋1, 𝑋2) = (1,1))

Bush 4: 𝐀4 = 𝑋2 , 𝐁2 = 𝑋1, 𝑌1, 𝑌2

𝑊8 = ((𝐀4 , 𝐁4) , 𝑋2 = 0 )

𝑊9 = ((𝐀4 , 𝐁4) , 𝑋2 = 1 )

All bush, assignment pairs of 𝑯

Figure 4: c H obtained by the reduction from H

a partition of the vertices into (disjoint) subsets where each subset is assigned a distinct color and no subset contains an edge. Minimum vertex coloring is NP-complete (Garey and Johnson 1990).

Reduction of MIC(G) to Minimum Vertex Coloring

We proceed to describe how to construct the graph b G from G and how to use the resulting graph b G to compute MIC(G). We deﬁne the vertex set of b G as follows: With each bush and assignment pair ((A, B), pa(B)) for a causal model characterized by G, we associate a vertex in b G. We deﬁne the edge set of b G as follows: Two vertices Wi = ((Ai, Bi), pa(Bi)) and Wj = ((Aj, Bj), pa(Bj)) of b G share an edge if and only if there exists no intervention that identiﬁes both Pr[Bi | do(pa(bi))] and Pr[Bj | do(pa(bj))]. By Theorem 4, this translates to the following deﬁnition.

Deﬁnition 11 (Edge set of b G). There exists an edge (conﬂict edge) between two distinct vertices Wi = ((Ai, Bi), pa(Bi)) and Wj = ((Aj, Bj), pa(Bj)) of b G, if one of the following conditions hold:

1. Ai Bj is non-empty; 2. Aj Bi is non-empty; 3. ai and pa(bj) are inconsistent; 4. aj and pa(bi) are inconsistent.

Given an undirected graph b G with the above speciﬁcations of vertices and edges, any minimum vertex coloring of b G corresponds to a minimum intervention cover of G.

Example 11 (MIC(H )). Consider the graph H (Figure 1c). For each bush and assignment pair of H , there exists a vertex in b H (Figure 4). The edges are constructed according to Deﬁnition 11. Note that any vertex coloring would contain three distinct colors for W1, W8 and W9. It is easy to see that no other vertex can use these colors. Hence, let Wc1 := {W1}, Wc2 := {W8} and Wc3 := {W8}.

Also, it is easy to see that coloring the remaining vertices requires at least two colors. Suppose Wc4 := {W2, W4, W5} and Wc5 := {W3, W6, W7}. From Theorem 4, it follows that the interventions do( ), do(X2 = 0), do(X2 = 1), do(X1 = 0) and do(X1 = 1) will identify all the informative local distributions (bush assignment pairs) in Wc1, Wc2, Wc3, Wc4, Wc5 respectively.

The following lemma is useful in designing an efﬁcient algorithm to construct the vertex set of b G.

Lemma 5. Let A , B form a bush for G. Then for all A A , there is a bush A, B for G such that B B (A \A).

Proof. Let A A . Since A , B form a bush for G,

C(B (A \A)) = {B (A \A)} is singleton, because every vertex of A has bi-directed edge to a vertex in B . A Pa(B (A \ A)), because A Pa(B ) by bush deﬁnition.

Therefore, there exists a B B (A \ A) such that A, B form a bush for G (where B will be the maximal Ccomponent of G[V \ A] that contains B ).

Algorithm 1 MIC(G)

1: IB FINDALLBUSHES(G) 2: W VERTEXCONSTRUCTION(IB, G) 3: Let W be the vertex set of Gco. 4: Construct the edges of Gco using Deﬁnition 11. 5: Find a minimum vertex coloring of Gco. Let Wc be the vertices colored using color c. 6: For each color c, let Ic= Pr[V \ Ac | do(ac)]

where, Ac = S

((A,B),pa(b)) Wc A

((A,B),pa(b)) Wc a 4.

7: return S

Algorithm 2 FINDALLBUSHES(G)

1: Initialize IB0 = S

Bj C(V){({}, Bj)} 2: for i 1 to n do 3: IBi = {} 4: if IBi 1 is empty then break 5: for all (A, B) IBi 1 do 6: for all W B do 7: B=FINDPAIRS(A {W}, B\{W}, G) 8: IBi B IBi return n S

4Here a is an assignment of A consistent with pa(B). Since no two vertices in Wc are connected by an edge, the union operation results an unambiguous assignment ac, due to conditions 3 and 4 of Deﬁnition 11.

Algorithm 3 FINDPAIRS(A , B \ {W}, G)

1: Initialize B = {} 2: for all B C(B \ {W}) do 3: if A , B is a bush then 4: B B (A , B ) return B

Algorithm 4 VERTEXCONSTRUCTION(IB,G)

1: Initialize pairs P = {} 2: for all (A, B) IB do 3: for all (pa(B) Σ|Pa(B)|)5 do 4: Add a new vertex ((A, B), pa(B)) to P return P

Our procedure for ﬁnding MIC(G) is shown in Algorithm 1. In order to ﬁnd the vertex set of b G from the given graph G, ﬁrst the algorithm stores the set of all bushes in the set IB. This step is efﬁciently executed by making use of Lemma 5. For a non-negative integer i, let IBi represent the set of all bushes A, B such that |A| = i. Given IBi, the algorithm determines IBi+1 as follows: For every bush (A, B) IBi and W B, bushes of the form (A = A {W}, B ) are added to IBi+1. By Lemma 5, we know that every bush A , B with |A | = i+1 will get added to IBi+1 from some bush A, B IBi (where A = A \{W} for some W B) in this process. Hence, at the end of this process, IB will contain the set of all bushes of G. Next, for every bush A, B IB and assignment pa(B) pair, the algorithm creates a vertex ((A, B), pa(B)) and adds it to W. Thus W forms the vertex set of b G. The edges are added according to Deﬁnition 11. The algorithm then proceeds to ﬁnd a minimum vertex coloring of b G. For each color c, the algorithm deﬁnes an intervention Ic such that Ic identiﬁes Pr[B | do(pa(b))] for every ((A, B), pa(b)) Wc, which follows from Theorem 4, Deﬁnition 11, the deﬁnition of Ic as in line 6 of Algorithm 1, and the fact that Wc forms an independent set (i.e., no two vertices in Wc are connected by an edge). Hence, S

c Ic identiﬁes ILD. From Claims 1 and 2, we know that ILD(G) identiﬁes LD(G), and LD(G) identiﬁes IS*(G). Hence, S

c Ic returned by the algorithm is an intervention cover of G. Next we prove (by contradiction) that S

c Ic is indeed a minimum intervention cover of G. Suppose there exists a joint information set IS of size smaller than S

c Ic that identiﬁes ILD(G). For each I IS , suppose, using Theorem 4, we identify vertices ((A, B), pa(B)) in b G such that Pr[B | do(pa(b))] is identiﬁable from I, i.e., I IIS(a ; pa(B) \ a ; B). By Deﬁnition 11, all such vertices form an independent set. Also, for each vertex ((A, B), pa(B)) of b G there must exist an intervention I IS such that I IIS(a ; pa(B) \ a ; B), by Theo-

5When the variables take values from different alphabets, pa(B) will loop over all possible assignments of Pa(B).

rem 4. It therefore follows that S

c Ic cannot be a minimum vertex coloring of b G. The proof of completeness of do-calculus follows from the proof of Theorem 4.

Theorem 6. Do-calculus is complete for MIC(G).

Proof. Follows from the fact that all of the results needed to prove Theorem 4 have been obtained using only the 3 rules of do-calculus along with basic manipulations.

Complexity of Minimum Intervention Cover

The complexity of Algorithm 1 is a function of the complexity of the minimum vertex coloring for the class of graphs Γ = { b G|G is a causal graph}. While the minimum vertex coloring for general graphs is known to be NP-complete, it is unclear whether this hardness result necessarily holds for the class of graphs Γ. In what follows, we show that (i) the size of MIC(G) can be large when the size of the largest Ccomponent is large; and (ii) when the degree of the causal graph G and the C-component sizes are bounded, the size of the minimum intervention is small, and the runtime of Algorithm 1 is polynomial in the size of G.

Lemma 7. For a given causal graph G, suppose there exists a bush A , B such that |A | = k. Then the size of MIC(G) is at least (|Σ| + 1)k.

Proof. The proof directly follows from Lemma 5. We know that for any A A , there exists a bush A, B. Let ((A, B), κ) and ((A, B), τ) be vertices of b G such that κ, and τ, are inconsistent on A. Then there is an edge between the above vertices in b G, because the assignments over A are inconsistent. Hence, for each such bush A, B, over all the possible assignments of A, there are |Σ||A| vertices in b G that form a clique. Let Ai and Aj be distinct subsets of A . From Lemma 5 we know the existence of bushes Ai, Bi and Aj, Bj. Note that there is an edge in b G between ((Ai, Bi), κ) and ((Aj, Bj), τ) for any assignments κ and τ, because6 Ai Bj = or Aj Bi = . The preceding claims together imply that the vertices described above must form a clique of size (|Σ|+1)k in b G because Pk i=0|Σ|i k i = (|Σ| + 1)k .

Hence the coloring number of b G is at least (|Σ| + 1)k.

The preceding lemma provides a characterization of the class of graphs for which the size of MIC(G) is superpolynomial in the size of G.

Corollary 7.1. For a given causal graph G, the size of MIC(G) is super-polynomial in n if there exists a bush A, B such that |A| is ω(log n).

Next we show that when the C-component size and the sum of the in-degree (excluding the unobservable parents) and out-degree of each vertex in the causal graph G are bounded, then MIC is small.

6Recall that Bi (B (A \Ai)) and Bj (B (A \Aj)).

Lemma 8. For a given causal graph G, suppose the sum of the in-degree (excluding the unobserved parents) and outdegree of each vertex of G is bounded by d, and the size of each C-component is bounded by p. Then the degree of the undirected graph, b G, constructed by Algorithm 1, is at most 22(p+pd2)|Σ|pd3.

Proof. Fix a vertex W = ((A, B), pa(B)) W of the undirected graph b G obtained using the reduction of Algorithm 1. Since the C-component size is at most p, the size of A B is at most p. For any vertex W = ((A , B ), pa(B )) that share an edge with W, we know that there does not exist an intervention I that satisﬁes I IIS(a ; pa(B)\a ; B) and I IIS(a ; pa(B ) \ a ; B ). That is, W satisﬁes one of the following properties:

1. A B is non-empty 2. B A is non-empty 3. a and pa(B ) are inconsistent 4. a and pa(B) are inconsistent.

Note that the size of A, A , B and B is at most p. Hence for a ﬁxed W, the total number of W s satisfying the ﬁrst condition above is at most 2p, and similarly, the number of W satisfying the second condition is at most 2p. Since the in-degree is bounded by d, |Pa(B)| is at most pd. Also, the out-degree of the graph is bounded by d, implying that there can be at most pd2 children of Pa(B). Hence for a ﬁxed W, the total number of W s satisfying the third condition is at most 2pd2|Σ|pd3, and the fourth condition is at most 2pd2|Σ|pd3. Thus, the degree of W is bounded by at most 22p(1+d2)|Σ|2pd3.

As the chromatic number of any graph of degree r is at most r + 1, we obtain a characterization of a class of causal graphs where the size of MIC is constant. Corollary 8.1. When the size of the C-component and the degree of the causal graph G are bounded by constants, the size of MIC(G) is O(1).

Note that the number of vertices of ˆG is O(n) when the size of the C-component and the sum of the in-degree and out-degree of the causal graph G are bounded by constants, and hence MIC(G) can be computed in time that is polynomial in the number of variables. Arguably, causal models that are readily communicable to humans need to be simple , and hence likely to have small in-degree and out-degree and C-components of bounded size.

Summary and Discussion We have provided an algorithm that, given a causal graph G, computes MIC(G), a minimum intervention cover of G, i.e., a minimum set of interventions that sufﬁce for identifying every causal effect of a causal model that is characterized by G. We have established the completeness of do-calculus for the minimum intervention cover problem. MIC(G) effectively offers an efﬁcient compilation of all of the information that is obtainable from observations and interventions relative to a causal graph G in anticipation of all possible

causal queries that are answerable by any causal model with structure speciﬁed by G. These results ﬁnd applications in a variety of contexts, including in particular, counterfactual inference, and generalizing causal effects across experimental settings. Work in progress is aimed at generalizing the deﬁnition of MIC(G) relative to an arbitrary subset of feasible interventions, as opposed to all possible interventions on V. This paper focused on minimizing the number of interventions, and not the number of variables targeted by each intervention. It would be interesting to consider variants of MIC(G) that minimize the sum of the numbers of variables targeted by all interventions, or minimize over both the number of interventions as well as the the number of variables targeted by the interventions.

Appendix: Proof of Theorem 4

Before proving Theorem 4, ﬁrst we recall the hedge structure of (Shpitser and Pearl 2006) which determines the identiﬁability of causal effects from the observable distribution of a causal graph G.

Deﬁnition 12 (Hedge). For a given causal graph G, two disjoint variables X, Y V, and assignments x, y, there exists a hedge for Pr[y | do(x)] in G, if there exists two subgraphs7 F, F of G such that

F and F are C-components. F is a subset8 of F . The leaf vertices9 of F and F are common. F includes at least one vertex from X. F does not include any vertex from X. Each observable variable has at most one outgoing edge in F and F . The set of leaf nodes of F is a subset of An(Y)GX Y.

Theorem 9 (Identifying causal effects from observable distribution (Shpitser and Pearl 2006)). For a given causal graph G, Pr[y | do(x)] is identiﬁable from the observable distribution Pr[V], if and only if, there does not exist a hedge for Pr[y | do(x)] in G.

Now we are ready to prove Theorem 4.

Soundness proof, Theorem 4. The if part of Theorem 4 easily follows from the known identiﬁcation algorithm of (Shpitser and Pearl 2006). Let M be the probabilistic causal model MG deﬁned over the causal graph G. Suppose there exists an intervention Pr M[V\S | do(s)] ISinp such that Pr M[V \ S | do(s)] IIS(a ; pa(B)G \ a ; B). Our goal is to show that Pr M[B | do(pa(B)G)] is identiﬁable from Pr M[V \ S | do(s)]. Note that the intervention do(s) induces a new model N on the graph G = G[V\S], which essentially simulates the

7A subgraph of G is a graph obtained from G by excluding some vertices and edges. 8F is a subset of F if all the vertices and edges of F are also present in F . 9A vertex with no child is called a leaf vertex.

interventional distribution Pr M[V \S | do(s)] as its observable distribution Pr N[V \ S] by excluding the variables S in the causal graph G . Formally, N is a probabilistic causal model, with observable variables V \ S, on the causal graph G = G[V \ S] such that:

The unobservable variables U, and the underlying distribution Pr[U] in N remains the same as M. For each vertex V V\S, the function f V in the induced model N behaves the same as the corresponding function in M, when restricted to the assignment s.

From the above deﬁnition,

Pr N[V \ S] = Pr M[V \ S | do(s)]. (1)

By the deﬁnition of IIS, we know that s and pa(B)G are consistent, and that B and S are disjoint. Also, Pa(B)G = Pa(B)G \ S. Hence, it follows that:

Pr N[B | do(pa(B)G )] = Pr M[B | do(pa(B)G)]. (2)

where pa(B)G is the assignment consistent with pa(B)G. Also, since A S, and B does not have a bi-directed edge to any vertex of V \ (B A) (refer Figure 3), we know that B is a maximal C-component in the graph G . Note that there can not exist a subgraph of G , F , that (i) is a C-component; and (ii) contains a vertex from Pa(B)G . Hence there can not exist a hedge (Shpitser and Pearl 2006) for Pr N[B | do(pa(B)G )] in the graph G , which implies that Pr N[B | do(pa(B)G )] is identiﬁable from the observable distribution Pr N[V \ S] of N. This, combined with Equations (1) and (2), imply that the required informative local distribution Pr M[B | do(pa(B)G)] is identiﬁable form Pr M[V \ S | do(s)].

Completeness proof, Theorem 4. Omitted 10.

Acknowledgements Saravanan Kandasamy was supported in part by the DRDO Frontiers Project DRDO0687, and by Ramanujan grant SB/S2/RJN-020/2017 of DST India at the Tata Institute of Fundamental Research. Arnab Bhattacharyya was supported at IISc by Ramanujan grant DSTO1358 and the DRDO Frontiers Project DRDO0687 and at NUS by an Ac RF Tier 1 grant on Inference and Testing of Sparse Models in High Dimensions . Vasant G Honavar was supported in part by the National Center for Advancing Translational Sciences, NIH through the grant UL1 TR000127 and TR002014, by the NSF, USA, through the grants 1518732, 1640834, and 1636795, the Pennsylvania State University s Institute for Cyberscience and the Center for Big Data Analytics and Discovery Informatics, the Edward Frymoyer Endowed Professorship in Information Sciences and Technology at Pennsylvania State University and the Sudha Murty Distinguished Visiting Chair in Neurocomputing and Data Science funded by the Pratiksha Trust at the Indian Institute of Science.The content is solely the responsibility of the authors and does not necessarily represent the ofﬁcial views of the sponsors.

10See supplementary material available at https://www.comp. nus.edu.sg/ arnab/aaai19-sup.pdf

References Bareinboim, E., and Pearl, J. 2013a. Causal transportability with limited experiments. In Proceedings of the 27th AAAI Conference on Artiﬁcial Intelligence, 95 101. Bareinboim, E., and Pearl, J. 2013b. Meta-transportability of causal effects: A formal approach. In Proceedings of the 16th International Conference on Artiﬁcial Intelligence and Statistics, 135 143. Bareinboim, E., and Pearl, J. 2014. Transportability from multiple environments with limited experiments: Completeness results. In Advances in Neural Information Processing Systems 27. 280 288. Bareinboim, E., and Pearl, J. 2016. Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences 113(27):7345 7352. Bareinboim, E.; Lee, S.; Honavar, V.; and Pearl, J. 2013. Transportability from multiple environments with limited experiments. In Advances in Neural Information Processing Systems 26. 136 144. Berzuini, C.; Dawid, P.; and Bernardinell, L. 2012. Causality: Statistical perspectives and applications. John Wiley & Sons. Galles, D., and Pearl, J. 1995. Testing identiﬁability of causal effects. In Proceedings of the 11th Conference on Uncertainty in Artiﬁcial Intelligence, 185 195. Garey, M. R., and Johnson, D. S. 1990. Computers and Intractability; A Guide to the Theory of NP-Completeness. New York, NY, USA: W. H. Freeman & Co. Halpern, J. Y. 1998. Axiomatizing causal reasoning. In Proceedings of the 14th Conference on Uncertainty in Artiﬁcial Intelligence, 202 210. Hernan, M. A., and Robins, J. M. 2010. Causal inference. CRC Boca Raton, FL. Huang, Y., and Valtorta, M. 2006. Identiﬁability in causal bayesian networks: a sound and complete algorithm. In Proceedings of the 21st National Conference on Artiﬁcial intelligence, 1149 1154. AAAI Press. Hyttinen, A.; Eberhardt, F.; and J arvisalo, M. 2015. Docalculus when the true graph is unknown. In Proceedings of the 31st Conference on Uncertainty in Artiﬁcial Intelligence, 395 404. Imbens, G. W., and Rubin, D. B. 2015. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press. Kocaoglu, M.; Shanmugam, K.; and Bareinboim, E. 2017. Experimental design for learning causal graphs with latent variables. In Advances in Neural Information Processing Systems 30, 7018 7028. Kuroki, M., and Miyakawa, M. 1999. Identiﬁability criteria for causal effects of joint interventions. Journal of the Japan Statistical Society 29(2):105 117. Lee, S., and Honavar, V. 2013a. Causal transportability of experiments on controllable subsets of variables: ztransportability. In Proceedings of the 29th Conference on Uncertainty in Artiﬁcial Intelligence, 361 370.

Lee, S., and Honavar, V. 2013b. m-transportability: Transportability of a causal effect from multiple environments. In Proceedings of the 27th AAAI Conference on Artiﬁcial Intelligence, 583 590. Morgan, S. L., and Winship, C. 2014. Counterfactuals and causal inference. Cambridge University Press. Pearl, J., and Bareinboim, E. 2011. Transportability of causal and statistical relations: A formal approach. In Proceedings of the 25th AAAI Conference on Artiﬁcial Intelligence, 247 254. Pearl, J., and Robins, J. 1995. Probabilistic evaluation of sequential plans from causal models with hidden variables. In Proceedings of the 11th Conference on Uncertainty in Artiﬁcial Intelligence, 444 453. Pearl, J. 1995. Causal diagrams for empirical research. Biometrika 82(4):669 688. Pearl, J. 2009. Causality: Models, Reasoning and Inference. New York, USA: Cambridge University Press, 2nd edition. Peters, J.; Janzing, D.; and Sch olkopf, B. 2017. Elements of causal inference: foundations and learning algorithms. MIT press. Robins, J. 1986. A new approach to causal inference in mortality studies with a sustained exposure period application to control of the healthy worker survivor effect. Mathematical modelling 7:1393 1512. Shanmugam, K.; Kocaoglu, M.; Dimakis, A. G.; and Vishwanath, S. 2015. Learning causal graphs with small interventions. In Advances in Neural Information Processing Systems 28, 3195 3203. Shpitser, I., and Pearl, J. 2006. Identiﬁcation of joint interventional distributions in recursive semi-markovian causal models. In Proceedings of the 21st National Conference on Artiﬁcial Intelligence - Volume 2, 1219 1226. Shpitser, I., and Pearl, J. 2008. Complete identiﬁcation methods for the causal hierarchy. Journal of Machine Learning Research 9:1941 1979. Spirtes, P.; Glymour, C.; and Scheines, R. 2000. Causation, Prediction, and Search. MIT press, 2nd edition. Tian, J., and Pearl, J. 2002a. A general identiﬁcation condition for causal effects. In Proceedings of the 18th National Conference on Artiﬁcial Intelligence, 567 573. Tian, J., and Pearl, J. 2002b. On the testable implications of causal models with hidden variables. In Proceedings of the 18th Conference on Uncertainty in Artiﬁcial Intelligence, 519 527. Tsamardinos, I.; Triantaﬁllou, S.; and Lagani, V. 2012. Towards integrative causal analysis of heterogeneous data sets and studies. Journal of Machine Learning Research 13:1097 1157. Verma, T., and Pearl, J. 1990. Causal networks: Semantics and expressiveness. In Proceedings of the 4th Conference on Uncertainty in Artiﬁcial Intelligence, 69 78.