# causal_effect_identifiability_under_partialobservability__917cdcb4.pdf

Causal Effect Identiﬁability under Partial-Observability

Sanghack Lee 1 Elias Bareinboim 1

Abstract Causal effect identiﬁability is concerned with establishing the effect of intervening on a set of variables on another set of variables from observational or interventional distributions under causal assumptions that are usually encoded in the form of a causal graph. Most of the results of this literature implicitly assume that every variable modeled in the graph is measured in the available distributions. In practice, however, the data collections of the different studies considered do not measure the same variables, consistently. In this paper, we study the causal effect identiﬁability problem when the available distributions encompass different sets of variables, which we refer to as identiﬁcation under partial-observability. We study a number of properties of the factors that comprise a causal effect under various levels of abstraction, and then characterize the relationship between them with respect to their status relative to the identiﬁcation of a targeted intervention. We establish a sufﬁcient graphical criterion for determining whether the effects are identiﬁable from partially-observed distributions. Finally, building on these graphical properties, we develop an algorithm that returns a formula for a causal effect in terms of the available distributions.

1. Introduction

One of the central goals in data sciences (the health and the social sciences), artiﬁcial intelligence, and machine learning is to discover cause and effect relationships. If the scientiﬁc study is performed appropriately, and causal relations are eventually discovered, the corresponding effects are more likely to hold under a broader set of conditions. Causal relations are usually more stable and generalizable across disparate conditions (Pearl, 2000; Pearl & Mackenzie, 2018).

1Department of Computer Science, Columbia University, New York, NY 10027, USA. Correspondence to: Sanghack Lee <sanghacklee@cs.columbia.edu>.

Proceedings of the 37th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

Causal inference provides a collection of principles and tools to help understand the conditions under which these extrapolations can take place (Pearl, 2000; Spirtes et al., 2001; Bareinboim & Pearl, 2016). For instance, a scientist may be able to use an observational study to infer the effect of a new intervention by leveraging knowledge encoded in its causal model (Pearl, 1995; Tian & Pearl, 2002; Tian, 2002; Shpitser & Pearl, 2006; Huang & Valtorta, 2006b). Causal effects can also be inferred across a broad range of conditions, including across disparate populations (Bareinboim & Pearl, 2014; Lee et al., 2020), under selection bias (Bareinboim & Pearl, 2012b; Correa & Bareinboim, 2017), missing data (Mohan & Pearl, 2014), and in the absence of the causal graph (Jaber et al., 2019), to cite a few. Further, recent advances in causal inference lead to algorithmic solutions to combine data collected under multiple, disparate regimes (observational and interventional) to identify a causal effect (Lee et al., 2019). Despite all the generality and power provided by these results, by and large, they implicitly assume that every modeled variable is consistently available across the different data collections. Since each study is usually designed to fulﬁll its own objectives, datasets across studies tend to measure different sets of variables (i.e., the datasets have different columns). As a consequence, relevant data available to answer a question about a speciﬁc effect may be not usable in another, possibly very related study.

For concreteness, consider the setting where a researcher aims to understand the effect of physical exercise (X) on stroke (Y ), written as Px(y), and is then analyzing two related datasets. The ﬁrst is based on an experimental study estimating the effect of physical exercise itself (X) on blood pressure (C), collected from different age groups (A); commonly written as PX(A, C). The second dataset is based on an observational study about the association among body mass index (BMI, or B), blood pressure (C), and stroke (Y ), i.e., P(B, C, Y ). The causal graphs representing these two studies are shown in Fig. 1a and 1b, respectively. After conducting standard causal analysis, the researcher realizes that Px(y), the effect of interest, cannot be inferred from the two datasets, separately. She then creates a common representation of their union, which is summarized in the causal graph shown in Fig. 1c. Unfortunately, the identiﬁcation algorithms available today assume full-observability, which means that graphs and datasets

Causal Effect Identiﬁability under Partial-Observability

Figure 1. (a, b) Causal graphs representing experimental (PX(A, C)) and observational studies (P(B, C, Y )), respectively. (c) The causal graph representing the union of both studies.

deﬁned over different sets of variables cannot be taken as input. On the other hand, the effect Px(Y ) is inferable by the careful combination of these studies through the expression P

a,c Px(a, c) P

b P(Y |b, c)P(b). The ﬁrst factor can be computed from the experimental study, while the other two can be obtained from the observational study.

Our goal in this paper is to understand under what conditions inferences such as this one are allowed from ﬁrst principles. More broadly, and motivated by the lack of a systematic treatment to combining partially-observed datasets, we formally introduce and study the problem of causal effect identiﬁability under partial-observability. More speciﬁcally, the main contributions of this work are as follows: (i) We develop novel machinery to account for partial-observability constraints, including constructs for co-identiﬁcation, embedding of critical identiﬁcation factors, and formal understanding of minimum viable embeddings. Putting these results together, we derive a novel graphical condition for identiﬁcation under partial-observability; (ii) We then develop the ﬁrst general algorithm that avoids redundant computations and runs in polynomial time for the known subclasses of identiﬁability problems under full-observability. Furthermore, we provide a detailed discussion on the necessity of our algorithm and the NP-completeness status of this particular identiﬁability problem. Finally, we discuss the extension of this work to the transportability setting (Bareinboim & Pearl, 2014) in which the corresponding datasets may come from multiple, heterogeneous domains.

One metaphor that will be informative and facilitate the understanding of this paper is to compare the task of identiﬁcation under partial-observability to a jigsaw puzzle (Fig. 2). The targeted causal query and underlying causal graph deﬁne the puzzle and its layout (Sec. 4.1), while each of the available observational and experimental distributions provides pieces and chunks (pieces tied together) for the puzzle (Sec. 4.1.1 and 4.1.2). Then, the given pieces and chunks are combined, without overlapping each other, to complete the puzzle, eventually forming the targeted effect (Sec. 4.2). For simplicity, we start in Sec. 3 the discussion with a relatively simple class of puzzles, and then build over it, reﬁning its understanding (just after the preliminaries).

layout/pieces chunks solution

Figure 2. A high-level abstraction of our problem as a jigsaw puzzle. A causal query Px(y) is a puzzle with how its pieces (factors) should be laid out. Each of the available distributions (represented as different colors) provides chunks of information, where a chunk is either a piece or the combination of multiple pieces. The solution for the puzzle is then putting the subset of chunks together.

2. Preliminaries

Following conventions in the ﬁeld, a variable is denoted by an uppercase letter, e.g., Z, and its value is denoted by the corresponding lowercase letter, z XZ, where XZ is the state space of Z. Bold letters are for a set of variables or values, e.g., z XZ = Z ZXZ. For simplicity, we may omit curly braces, e.g., f({x}) versus f(x), for a singleton set when it is used as an argument.

This paper builds on the language of Structural Causal Models (SCM) (Pearl, 2000). Each SCM M is a quadruple, U, V, F, and P(U). The set of unobserved variables U follows the joint probability distribution P(U). The set of observed variables V are speciﬁed through the set of structural functions F = {f V }V V, where each function is of the form f V (pa V , u V ) such that pa V and u V are the values for PAV V\{V } and UV U, respectively. The SCM M induces a causal graph G over V where there are directed edges W V if W PAV and bidirected ones W V if there exists an unobserved confounder (UC, for short) U UW UV . Further, M induces a set of observational and interventional distributions. One can intervene on X, setting them to x, which yields a submodel Mx, where the function for X X in M is replaced by constants x x. The distribution generated by Mx is denoted by Px(V) (or P(V | do(x))). For simplicity, PZ(W) denotes a collection of probabilities {Pz(w)}z XZ,w XW, which we may call PZ(W) just a distribution. For a more detailed discussion on SCMs, please refer to (Pearl, 2000, Ch. 7).

We denote by VH the vertices of a graph H. We often use V as the vertices of G where no ambiguity arises. We denote by pa(W)G the union of parents of W W in G. Similarly, ch, an, and de are children, ancestors, and descendants. Additionally, Ch, An, De include their arguments as well, e.g., Ch(W)G = ch(W)G W. A subgraph of G over V V is denoted by G[V ]. Further, GX and GZ denote G with edges incoming to X and going out from Z removed, respectively.

The latent projection (or projection, for short) of a causal

Causal Effect Identiﬁability under Partial-Observability

graph is deﬁned in a way to retain the causal relationships among a subset of variables. This concept is crucial in understanding partial-observability. We deﬁne projection as below, adopted from (Tian & Pearl, 2003).

Deﬁnition 1 (Projection). The projection of a causal graph G over V on V V, denoted by G V , is a causal graph over V such that for every pair of vertices X and Y :

1. There exists a directed edge X Y in G V if there exists a directed path from X to Y in G such that every vertex other than X and Y on the path is not in V .

2. There exists a bidirected edge X Y to G V if there exists a divergent path1 between X and Y in G such that every vertex other than X and Y on the path is not in V .

We denote by G -W a causal graph with W projected out, i.e., G -W = G V \ W . Conditional independence statements and the rules of do-calculus (Pearl, 1995) on a projection are valid in G, vice versa. We often simplify notation involving projections by letting G = G V . Similarly, G and Gj is deﬁned for V and Vj, respectively. An illustration for how projection induces new edges and some remarks are provided in App. A in (Lee & Bareinboim, 2020).

Graphical Constructs for Identiﬁability Throughout the paper, we ﬁx the use of a few symbols. A query Px(y) is deﬁned on a causal graph G of an unknown SCM M where its vertices (or variables) are V where X and Y are possibly empty disjoint subsets of V. Further, we deﬁne a few new symbols that would greatly help understanding the anatomy of a given graph with respect to the identiﬁcation of a query, namely:

X = X An(Y)GX X+ = An(X )G \ Y+

Y = Y Y+ = An(Y)GX

V = X Y V+ = X+ Y+.

The left column presents essential parts of the query and the right column describes relevant variables in identifying the query; Fig. 3 provides an illustration of these notions. Their meanings and relationships will become more clear when they appear in lemmas and theorems, especially in Sec. 4. We will use the same color scheme to visualize graphs.

Concepts for Non-identiﬁability One of crucial building blocks for characterizing our problem of interest is a graphical structure articulating the non-identiﬁability of a causal effect. A graph H is said to be a c-component if a subset

1A divergent path between X and Y exists if two directed paths (X, . . . , W1) and (W2, . . . , Y ) in G towards X and Y , respectively, such that W1 = W2 or W1 W2.

(a) a causal graph (part)

(b) abstract illustration

Figure 3. An illustrative example for X+ (light red), Y+ (light blue), X (dark red), and Y (dark blue). Bidirected edges are not relevant in deﬁning those symbols and omitted on purpose.

Figure 4. (a, b) Hedges for Px(y) where, for both cases, {Y } is the sole root set and the only element of F with F = G. (c) is not a hedge for Px(y) but {Y1, W}-rooted c-forests F = G[{Y1, W, X}] and F = G[{Y1, W}] form a hedge for Px(y).

of its bidirected arcs forms a spanning tree over all vertices in H (Tian & Pearl, 2002; Tian, 2002). The deﬁnition can be understood as a set of vertices that are connected via bidirected edges. We utilize the notion of the decomposition of a set of vertices in a graph with respect to its maximal c-components, which we call c-component decomposition. In a special type of c-component, a graph H with root-set (sink nodes) R is said to be an R-rooted c-forest if H is a c-component with a minimal number of edges.

Deﬁnition 2 (Hedge). A hedge is a pair of R-rooted cforests F, F such that F F.

A hedge is said to be formed for Px(y) if R An(Y)GX, F X = , and F X = , which implies the nonidentiﬁability of Px(y) from P(V) in G (Shpitser & Pearl, 2006). We prefer to separate the graphical deﬁnition of hedge from the speciﬁc syntactic goal/task, following discussion in (Lee et al., 2019). Such hedge satisﬁes that VF\F V+ intersects with X . Further, VF Y+

since X+ being the part of F implies that X X is in F , which violates the deﬁnition. Examples are illustrated in Fig. 4 where the ﬁrst two causal graphs are hedges for Px(y) but the last one isn t. Omitted proofs and other supporting materials are provided in (Lee & Bareinboim, 2020).

3. Identiﬁability with a Single Partially-Observed Distribution

We begin with a simpler, yet crucial identiﬁability problem that concerns with identifying a causal effect given a single

Causal Effect Identiﬁability under Partial-Observability

partially-observed distribution. Using the jigsaw metaphor discussed earlier, this task can be seen as a puzzle where all the available chunks are of the same color, as deﬁned next.

Deﬁnition 3 (Causal Effect Identiﬁability under a Partially-Observed Distribution). Given a causal graph G, let Z , V , X, Y V where Z V = and X Y = . The causal effect Px(y) is said to be identiﬁable from PZ (V ) in G if Px(y) is (uniquely) computable from PZ (V ) in any causal model which induces G and PZ (V \ Z ) > 0.

The deﬁnition is similar to a number of identiﬁability problems with full-observability except for the given data being partially-observed. The positivity assumption is imposed on PZ (V \ Z ) (i.e., the full joint before the projection) instead of the given distribution PZ (V ). This is to ensure that the partial-observability is correctly responsible for the non-identiﬁability of a causal effect especially when the effect is identiﬁable with PZ (V \ Z ).

We show multiple necessary criteria as follows.

Proposition 1. A causal effect Px(y) is identiﬁable from PZ (V ) in G only if

Y V (inclusion of outcomes),

X V Z (inclusion of minimal treatments), and

Z Y+ = (undisturbed mechanisms).

Under full-observability, the ﬁrst condition is implied by the third condition, and the second condition holds trivially since Z V = V and X, Y V. The third condition highlights that an intervention prohibits the understanding of the underlying natural mechanisms relevant to the distribution over Y (Lee et al., 2019). When these criteria hold, the problem is reducible to the classic identiﬁability.

Lemma 1. Given Px(y) and PZ (V ) in G satisfying the three criteria in Prop. 1, Px(y) is identiﬁable from PZ (V ) in G if and only if Qx \Z (y) is identiﬁable in (G \ Z ) V V+ where Q = Pz with z consistent with x+ Z .

Roughly speaking, Lemma 1 implies that the effect is identiﬁable when there is no structure that embeds a hedge under the projection onto the partially-observed variables. A simple example is provided (Fig. 5) showing the identiﬁability of Px(y) given PZ(X, Y ). The effect is not identiﬁable with P(V) (the existence of a hedge with F = G, F = G[{W, Y }]), PW (X, Y , Z) (a disturbed mechanism), nor PY (W, X, Z) (exclusion of outcomes).

Corollary 1 (Soundness and Completeness). Px(y) is identiﬁable from PZ (V ) in G if and only if Y V , X V Z , Z Y+ = , and Qx \Z (y) is identiﬁable in (G \ Z ) V V+ where Q = Pz with z consistent with x+ Z .

(c) (G \ {Z}) -W

Figure 5. Causal diagrams representing the identiﬁability of Px(y) given PZ(X, Y ) as Px(y) = Pz(y|x) for any z.

Figure 6. A causal graph G (a) which embeds hedges under different projections onto (b) {A, X, Y } and (c) {B, X, Y }.

4. Identiﬁability with Multiple Partially Observed Distributions

We now investigate the main task of this paper, i.e., how to systematically use multiple distributions with different levels of observability, as deﬁned next.

Deﬁnition 4 (Causal Effect Identiﬁability under Partially-Observed Distributions (GID-PO)). Let G be a causal graph, and X, Y V be sets corresponding to the treatment and outcomes variables, respectively. Further, let P = {PZi(Vi)}m i=1 be a collection of partially observable distributions such that Zi and Vi are disjoint subsets of V for 1 i m. The causal effect Px(y) is identiﬁable from the graph G and P if it is uniquely computable from P in any model that induces G where {PZi(V \ Zi)}m i=1 are positive distributions.

This deﬁnition generalizes g-identiﬁability (GID, Lee et al., 2019) with partial-observability. An example of the problem is shown in Fig. 7a, where P(B, C, Y ) and PX(A, C) are given to identify Px(y) (the same as Fig. 1c). Its formula can be derived as (see App. D (Lee & Bareinboim, 2020) for the detailed derivation):

a,c Px (a)Px(c|a) P

b P(y|b, c)P(b), (1)

where x can be any value in XX. Naively incorporating Cor. 1 into GID, which looks for colored pieces but not chunks, fails to identify the query since one of the pieces is not identiﬁable with any of the given datasets (Fig. 7c).

Lemma 2. Let G be a causal graph and X, Y V. A query Px(y) is not identiﬁable if there exist two causal models M1 and M2 compatible with G such that P 1 Zi(Vi) = P 2 Zi(Vi), for every Pi P, but P 1 x(y) = P 2 x(y) and P 1 Zi(V \ Zi) = P 2 Zi(V \ Zi) > 0.

Causal Effect Identiﬁability under Partial-Observability

Consider a causal graph (Fig. 6a) where two observational distributions P(X, Y , A) and P(B, X, Y ) are available. A causal query Px(y) is identiﬁable given P(X, Y , A, B) but not from each of P(X, Y , A) and P(B, X, Y ) due to the existence of a hedge (Cor. 1). Further, one can show that the query is not identiﬁable taking both into account: Let f X = U1, f A = U1 U2, f B = U2 be the common functions between M1 and M2, and f 1 Y = X A B UY and f 2 Y = UY . Further, let U1 and U2 be two fair coins, and P(UY = 1) = 0.1 for both models. Then, M1 and M2 will agree on both P(A, X, Y ) and P(B, X, Y ), while P 1 X=0(Y = 0) = 0.5 and P 2 X=0(Y = 0) = 0.9.

4.1. Characterization of Factors under Projection

To develop an algorithm capable of combining the different parts of the available distributions, we brieﬂy review approaches to the related problem of decomposing distributions currently known in the literature. A joint probability distribution P(V) in G can be seen as the product of causal effects in G. Tian & Pearl (2002) introduced the Q-decomposition that expresses P(v) using c-factors as follows: P(v) = Q

i Pv\si(si), where Si is a c-component of V.2 We will characterize here such factorization under projections. As a ﬁrst step, we deﬁne the key notion of generalized c-factors:

Deﬁnition 5 (gc-factors). Let V be a subset of variables such that V V V+ and let G = G V . Let S = {Si}k i=1 be the c-components of G [Y+ ] where Y+ = An(Y)G X . Then, the gc-factors of Px(y) in G with respect to V is deﬁned as

FX,Y G,V = { pa(Si)G \Si, Si }k i=1.

For example, consider a causal graph G in Fig. 6a and a query Px(y), where V = {X, Y } and V+ = V. With V = V, Y+ = {A, B, Y } (i.e., variables in blue), and S = {{A, B}, {Y }} (i.e., the two ccomponents in the blue area). Hence, the gc-factors FX,Y G,V = { , {A, B} , {A, B, X}, {Y } }. With V = {A, X, Y }, G is shown in Fig. 6b. In this case, Y+ = {A, Y }, S = {{A, Y }}, and FX,Y G,{A,X,Y } = { {X}, {A, Y } }.

A gc-factor (or, simply, a factor), which is represented as a pair of sets of variables, can be used to present a set of distributions, e.g., Xi, Yi FX,Y G,V and PXi(Yi). We may call the ﬁrst and second elements as Xand Y-side of the gc-factor, respectively. Graphically, this decomposes Y +

2Such a factorization leads to a natural divide-and-conquer approach to the problem of causal identiﬁcation, which underpins, sometimes more or less explicitly, most of the results in this literature (e.g., (Huang & Valtorta, 2006a; Shpitser & Pearl, 2006; Bareinboim & Pearl, 2012a; 2014; Lee et al., 2019), to cite a few). Depending on the identiﬁcation task, one can prove that the Q-decomposition leads to a complete characterization.

Pc(y) Pa,x(b)

Figure 7. Causal graphs (a) G and (b) G -B . Corresponding gcfactors (c, d) represented as puzzle where identiﬁed factors are colored with green, PX(A, C), and purple, P(B, C, Y ).4

in G with respect to c-components so that the Y-side of every gc-factor is maximally confounded with respect to the underlying projection. The X-side resides in Y + X .

For brevity, we simplify the notation by F = FX,Y G,V and a superscript is delegated to V so that F = FX,Y G,V . Also we interchangeably use a factor Z, W with PZ(W) or with Pz(w) for an arbitrary assignment. Probabilities associated with the gc-factors of Px(y) in G with respect to V can form an expression for Px(y). Proposition 2 (gc-decomposition). For V V V+,

{i| Xi,Yi F } Pxi(yi) (2)

The algorithm for general-identiﬁability (Lee et al., 2019) makes the use of this decomposition based on F (i.e., restricted to V = V+) and identiﬁes each factor using one of the available distributions. However, such strategy relying on the decomposition based on F is insufﬁcient for handling partially-observed distributions. Recall Fig. 7a where Px(y) can be factorized as

a,b,c P(a)Pa,x(b)Pb(c)Pc(y), (3)

following Prop. 2. Unfortunately, Pa,x(b) = P(b|a, x) is not identiﬁable from each of available distributions since no distribution includes all A, B, X (Cor. 1). On the other hand, a gc-decomposition based on a projection onto V = {A, C, X, Y } (Fig. 7b) yields

a,c P(a)Pa,x(c)Pc(y), (4)

which allows each factor to be identiﬁed by at least one of the available distributions (i.e., P(a) = Px (a), for any x XX, Pa,x(c) = Px(c|a), and Pc(y) = P

b P(y|b, c)P(b)). This provides a basis to the following sufﬁcient condition. Lemma 3 (Soundness). Let G, Px(y), P be the causal graph, the query, and the distributions forming a GID-PO instance. If there exists a subset of variables V V V+

such that every term Pxi(yi) in each gc-factor Xi, Yi in F is identiﬁable from P P, then Px(y) is identiﬁable from P in G.

Causal Effect Identiﬁability under Partial-Observability

Given Lemma 3, we are interested in ﬁnding V , a subset of V, which would yield a gc-decomposition where each gc-factor is identiﬁable.5 We note that a natural solution emerges since one could search over an exponential number of subsets of V+. While this would certainly lead to a sound procedure, it is clearly the case that this offers little to no insight into the problem. This motivates us to study the relationships among gc-factors at different levels of projections so as to develop an efﬁcient solution while avoiding this naive, and clearly intractable solution.

4.1.1. EMBEDDING FACTOR AND CO-IDENTIFICATION

In this section, we formally relate gc-factors before and after marginalizations through two new concepts called embedding factor and co-identiﬁcation, which associate factors and their identiﬁcation under various levels of projections so as for us to perceive the task of identiﬁcation from a more comprehensive view.

Comparing the decompositions in Eqs. (3) and (4) based on G and G -B , respectively, we observe that two factors P(a) and Pc(y), which does not have B in it, are shared while other two factors Pa,x(b) and Pb(c) in Eq. (3) are replaced to Pa,x(c) in Eq. (4) where we can elicit P

b Pa,x(b)Pb(c) = Pa,x(c). We characterize such changes in gc-factors before and after a projection (equivalently, a marginalization).

Proposition 3 (Factors under Marginalization). Let W V and V = V \ W where W V = . Let H be an undirected graph where vertices are X j, Y j F and an edge exists if two factors satisfy their Y-sides intersecting with Ch(W)G for some W W. Then, vertices (i.e., gcfactors) in each connected component of H are merged to form X k, Y k F such that

Y k = j j Y j \ W, X k = ( j j X j) \ Y k \ W,

where j is the set of indices of gc-factors in F forming the connected component. Other gc-factors without W are remained intact, shared by both F and F .

For instance, marginalizing B out in Eq. (3) will consolidate factors whose Y-sides intersecting with Ch(B)G={B, C}, that is, Pa,x(b) and Pb(c). Further, P b Pa,x(b)Pb(c) will result in a new gc-factor that has its Y-side, ({B} {C})\{B}={C} and X-side, ({A, X} {B})\{C}\{B}={A, X}. We introduce an embedding relationship to describe the connection between the factors merged through a projection and a resulting factor.

Deﬁnition 6 (Embedding Factor). A gc-factor X , Y

5After submitting this work (in early February, 2020), Lee & Shpitser (2020) independently introduced the problem of m ID , where m stands for marginal distributions, which corresponds to the notion of partial-observability here (Def. 4). Even though there are subtle differences in terminology and notation, their main result in this context (Lemma 3) can be seen as our Lemma 3.

P(a) Pax(b) Pb(c) Pc(y)

Px(b) Pax(c) Pb(y)

Px(c) Pax(y)

Figure 8. (a) embedding relationships among gc-factors where a directed edge i j indicates that j embeds i (represented with a transitive reduction). (b) the same embedding relationships with a sum-product notation.

F is said to be embedded in a gc-factor X , Y F

for V V V if Y contains any variable in the Y parts of gc-factors in the connected component containing X , Y in H, which is constructed as follows. H is an undirected graph of gc-factors in F where there exists an edge between two factors, say X i, Y i , X j, Y j , if both share V \ V , that is,

(X i Y i) (X j Y j) (V \ V ) = .

Note that a gc-factor is also a (non-proper) embedding factor of itself, and Px(y), the query itself as X, Y is an embedding factor of every gc-factor. We illustrate embedding relationships among every gc-factor of Px(y) in G (Fig. 7a) in Fig. 8a. The four factors at the bottom correspond to F and the corresponding embedding relationships are represented as directed edges. For instance, {A, X}, {C} (i.e., Pa,x(c)) is an embedding factor of {B}, {C} and {A, X}, {B} (Fig. 7a). Quantitatively, embedding relationships can be understood as sum-product relationships (Fig. 7b), where projections are equivalent to marginalizations.

Although a gc-factor in F (a ﬁnest-grained factor) cannot be identiﬁed, it can be identiﬁed as a group under the summation, i.e., its embedding factor (a coarser-grained factor) is identiﬁed.

Deﬁnition 7 (Co-Identiﬁcation). If a factor Q is identiﬁable with a distribution, the factors in F embedded in Q are said to be co-identiﬁed with the distribution with respect to Q.

With the puzzle metaphor, co-identiﬁcation of a factor refers to whether there will be a colored chunk which covers the piece. Following the strategy described in Lemma 3, Px(y) will only be identiﬁed when every gc-factor in F is coidentiﬁed by one of the available distributions. However, there can be an exponential number of embedding factors for a gc-factor in F. Hence, we investigate the relationship between (non-)identiﬁability of embedding factors.

Causal Effect Identiﬁability under Partial-Observability

(c) G -{E, D}

P(ab) Pax(d)

Pbe(y2) Pd(ey1)

P(ab) Pax(d) Pbe(y2) Pd(ey1)

Px(bd) Pe(ay2) Pax(ey1) Pbd(y)

Pex(dy2) Px(bey1) Pd(ay) Pabx(y)

(e) Hasse diagram

Figure 9. (a, b, c) Illustrations of causal diagrams with c-components in Y + highlighted which correspond to Y-side of gc-factors, (d) F as a puzzle and how pieces are merged by marginalization, (e) a Hasse diagram for the embedding relationships among factors (not exhaustive) where four vertices at the bottom are F. The highlighted areas in (a, b, c) correspond to the right three vertices in (e).

4.1.2. MINIMUM VIABLE EMBEDDING FACTOR

We introduce the crucial concept of minimum viable embedding factor (MVEF) that will help with the characterization of the co-identiﬁcation of gc-factors in F with respect to a single available distribution. Its purpose is to ﬁnd the ﬁnestgrained factors that a (partially-observed) distribution might be able to identify. In other words, as the default decomposition offers the factors (F) of the right granularity for fully-observed distributions, we attempt to ﬁnd factors of appropriate granularity with respect to a single partiallyobserved distribution. Deﬁnition 8 (Minimum Viable Embedding Factor (MVEF)). An embedding factor X , Y F of a gc-factor Xi, Yi F is said to be a minimum viable embedding factor of Xi, Yi with respect to PZ (V ) if the three criteria holds true: Y V ; X V Z ; and Z An(Y )GX = , and V \ V is minimal. Further, V \ V is said to be a MVEF-admissible set.

Given one of available distributions P, we want to check whether it co-identiﬁes a gc-factor in F. Among an exponential number of its embedding factors, we can choose an embedding factor, which satisﬁes the necessary conditions as speciﬁed in Prop. 1. A polynomial time algorithm for ﬁnding out an MVEF is depicted in App. D (Lee & Bareinboim, 2020). The algorithm iteratively seeks variables to be projected out in order to satisfy the necessary conditions.

For example, take a look at a graph Fig. 9a where there are four factors in F (Fig. 9d). Consider co-identifying a factor Pd(e, y1) (with its Y side highlighted in Fig. 9a) with PB(A, X, Y) P. Since D and E do not appear in Z = {B} and Z V = {B, A, X, Y}, respectively (Prop. 1), we may project out both D and E in G at once, and examine the resulting embedding factor of Pd(e, y1) in G -{E, D} (Fig. 9c), that is, Pa,b,x(y). The puzzle diagram illustrates how projecting out D and E, or equivalently, P

d,e yields a chunk with three pieces except P(a, b). Further, a Hasse diagram (Fig. 9e) represents the transitive reduction of embedding relationships where one can examine that Pa,b,x(y) embeds the three factors in F excluding

P(a, b) and other two intermediate factors. Another example is given in Appendix showing that obtaining an MVEF may take multiple steps.

An MVEF, if exists, is uniquely determined. Further, MVEFs obtained given a distribution for a subset of gcfactors are disjoint with respect to embedded gc-factors.

Proposition 4 (Uniqueness). If an MVEF exists for a gcfactor with respect to a distribution, then it is unique.

Proposition 5 (Disjointness). Given a distribution, let Q and R be MVEFs of two gc-factors in F, respectively. Then either Q = R or there is no common gc-factor embedded in Q and R.

MVEFs are only a subset of all factors at different levels. If they are identiﬁable with a distribution, then they correspond to same-colored chunks. Should we further examine the possibility of other chunks of the same color? If an MVEF is found and not identiﬁed, is it still possible for factors further embedding the MVEF to be identiﬁable? We show the non-identiﬁability of factors embedding a non-identiﬁed MVEF.

Lemma 4 (Hedge over Embedding Factors). Consider a gc-factor Xi, Yi F and a distribution PZ (V ). The gc-factor is not co-identiﬁable with PZ (V ) with respect to any of its embedding factors if its MVEF does not exist or there exists a hedge for Px \Z (y ) in (G \ Z ) V given the MVEF X , Y F .

Hence, the failure to identify an MVEF informs us that none of its embedding factors needs to be examined for (non- )identiﬁability. Below is a complementing lemma that, with puzzle terms, any chunks of the same color is composed of MVEFs (i.e., the smallest chunks of the color).

Lemma 5 (Compositionality). Given a distribution PZ (V ), any identiﬁable embedding factor of a gc-factor in F can be represented as the summation over the product of a subset of identiﬁed MVEFs.

For instance, given that P(a) and Pa,x(c) are identiﬁed with PX(A, C) (Fig. 7d), it is clear to see the identiﬁability

Causal Effect Identiﬁability under Partial-Observability

of Px(c) as P

a P(a)Pa,x(c) and checking it is redundant. Thus, ﬁnding MVEFs and checking their identiﬁability are sufﬁcient to comprehend how one distribution contributes to co-identiﬁcation of a portion of factors.

4.2. Algorithm for GID-PO

Previous sections discussed what factors of different levels of projections will be available from each distribution. Hence, identiﬁed MVEFs from available distributions become chunks of different colors, which needs to be put together to complete the causal jigsaw puzzle Px(y).

We devise a two-phase algorithm called GID-PO (Alg. 1), which ﬁrst identiﬁes MVEFs co-identifying gc-factors in F (colorful chunks), and then combine them to produce a formula for Px(y). The implementation of the ﬁrst phase is straightforward given the characterizations of MVEFs in the previous section. For every gc-factor in F and a distribution in P (Line 3), the existence of MVEF is ﬁrst checked (Line 5). If it exists and identiﬁable, then record information including the MVEF and the gc-factors embedded in the MVEF (i.e., co-identiﬁed), the used distribution, and what variables are marginalized out (Lines 7 9).

With the colorful chunks of identiﬁed MVEFs, the second phase of the algorithm tries to complete the big picture. Based on Prop. 2, it attempts to select MVEFs whose coidentiﬁed gc-factors (represented as their indices in F) do not overlap. This reduces to solving an exact cover (Line 13), an NP-complete problem (Karp, 1972): given a family I of subsets of a set [n] = {1, . . . , n}, whether there exists a subfamily I I such that sets in I are disjoint and I = I = [n]. If a solution exists, then an expression for Px(y) is written as the summation of product of sub-formulas corresponding to selected MVEFs where the variables to be marginalized follows Eq. (2) except that the part of variables are moved inside the sub-formulas (Line 16). Finally, we show next that this approach is indeed correct.

Theorem 1 (Soundness). GID-PO is sound.

Remarkably, GID-PO is more efﬁcient than a naive implementation of Lemma 3, which would require to run a traditional identiﬁability algorithm (O(|V|4)), for exponentially many gc-factors and for each dataset. On the other hand, GID-PO requires for the identiﬁability algorithm to run |F| times for each distribution to collect the identiﬁed MVEFs. Then, an exact cover runs in time exponential in the number of uniquely identiﬁed MVEFs, which is upper bounded by |F|. More speciﬁcally, GID-PO will run in O(ℓmn4 + 2ℓ) while a naive implementation for Lemma 3 runs in O(mn42n), where n=|V|, m=|P|, and ℓ=|F|. Further, if the problem instance is compatible with g-identiﬁability, where no partial-observability is involved, then the algorithm runs in a polynomial time in |V| since

Algorithm 1 GID-PO

1: Input: G a causal graph, x and y value assignments for a query, P a collection of available distributions 2: Prepare J an empty collection. 3: for PZ (V ), Xi, Yi P F do 4: continue if an MVEF that embeds Xi, Yi is already found by the same data PZ (V ). 5: X , Y , W MVEF( Xi, Yi , PZ (V )). 6: continue if the MVEF is co-identiﬁed by some data. 7: if X , Y exists and identiﬁable with PZ (V ) then 8: i the indices of factors in F embedded in the MVEF. 9: Add X , Y , W , i, Z , V to J. 10: end if 11: end for 12: Let I be the collection of indices in J. 13: if I exact cover with {1, . . . , |F|} and I then 14: Let J J be the subcollection matching the solution I . 15: Let W be the union of all MVEF-admissible sets in J . 16: return P

y+ \(Y W ) Q

J GID(Px (y ), G V , {Z }). 17: end if 18: return NULL.

every identiﬁed MVEF will be a single-piece chunk and an exact cover algorithm will spot uncovered elements ﬁrst.

5. Discussions: Completeness, Complexity, Conditional Effects, and Transportability

We conjecture that Lemma 3 is also a necessary condition and, hence, our algorithm is sound and complete. We can show, under the completeness conjecture, that the decision version of our problem is NP-complete. Further, we discuss two possible extensions to this work.

On Completeness Our intuition for its completeness lies in the fact that each gc-factor in F provides nondecomposable information about the model. Formally speaking, a probability Pw(z) F is not identiﬁable from the combination of smaller pieces, i.e., P = {PW (Z ) | Z Z, W W} \ {PW(Z)}. Following Lemma 2, we can construct two models M1 and M2 where W and UCs are independent fair coins and each Z Z takes parents (including UCs) with exclusive-or for both models except the fact that one Z Z ﬂips its value for M2. Further, estimates for proper embedding factors and other gc-factors will not pinpoint a gc-factor of interest. For example, does knowing a chunk {f, g} and piece {g} allow us to infer what {f} is? Consider an equation h = R

x f(x) g(x)dx. Generally, knowing h and g does not allow us to infer what f is. Having a number of such equations may help characterizing f but it won t identify f unless the domain of X is small and ﬁnite. For such reasons, we conjecture that F provides a fundamental, necessary, and sufﬁcient building blocks for constructing Px(y) given marginal interventional distributions. A rigorous proof for the completeness is still under investigation.

Causal Effect Identiﬁability under Partial-Observability

On NP-Completeness Assume that we have proved the completeness of the algorithm, which reﬂects Prop. 2. Then, we can show that the decision problem of the puzzle is indeed NP-complete. We present a polynomial reduction of an exact cover problem to a GID-PO problem instance.

Consider an arbitrary exact cover problem with a universe [n] and a collection of its subsets.6 We construct a causal graph G with V = {Vi,j}1 i<j n and its direct edges are Vi,j Vj,i for i < j. We add bidirected edges to connect Vi,: = {Vi,j}j =i side by side. Examples of such causal graph with n = 3 and n = 4 are shown in Figs. 10a and 10b. We then ﬁx a query as P(V ) where V = {Vj,i}j>i are the sink nodes in G. Each gc-factor pa(Vi,:), Vi,: F represents element i in the universe [n], e.g., P(V1,2, V1,3) for 1 and P(V3,1, V3,2|do(V1,3, V2,3)) for 3. Given a subset of integers w [n], there exists a corresponding distribution PX (Y ) in P such that X , Y F is an MVEF of any of gc-factor corresponding to w w where V = V \ {Vi,j}{i =j|i,j w}. For instance, a subset of integers for an exact cover problem in Fig. 10a will be matched to the following distributions:

{1} = P(V1,2, V1,3)

{2} = P(V2,1, V2,3 | do(V1,2))

{3} = P(V3,1, V3,2 | do(V1,3, V2,3))

{1, 2} = P(V1,3, V2,1, V2,3)

{1, 3} = P(V1,2, V3,1, V3,2 | do(V2,3))

{2, 3} = P(V2,1, V3,1, V3,2 | do(V1,2, V1,3))

{1, 2, 3} = P(V2,1, V3,1, V3,2)

Hence, the construction can be done in a polynomial time of n|I|, the size of the problem. Further, Alg. 1 yields an identiﬁcation formula if and only if the reduced problem has an exact cover solution since each set as an available distribution can only identify itself but not others: Therefore, under the completeness of Lemma 3, we can show that the decision problem of general identiﬁability under partialobservability is NP-complete.

Generalization to Transportability and Conditional Distributions One of the natural extensions of this work is in the context of transportability (Pearl & Bareinboim, 2011; Bareinboim & Pearl, 2014; Lee et al., 2020). Transportability is concerned with the fusion of datasets collected from heterogeneous domains, where the mechanisms for some of the variables differ from a domain in which a causal effect is sought. Unless the differences between a source and a target (pieces shapes are incompatible) are directly on the Y-side of a factor, a similar procedure can be applied.

Another generalization is identifying a conditional interven-

6An empty set can be simply ignored because it does not contribute to answering the exact cover problem.

Figure 10. Causal diagrams for exact cover problems with universe (a) {1, 2, 3} and (b) {1, 2, 3, 4}.

tional distribution, e.g., Px(y|w) is delegated to identifying Px ,w (y, w ) (Tian, 2004; Lee et al., 2020). This change only requires an additional preand post-process of the result from GID-PO or its extension to transportability. We provide the generalization of our problem taking account both directions in App. F (Lee & Bareinboim, 2020).

6. Conclusions

We introduced the general identiﬁability problem when the available distributions are only partially observable, which is named GID-PO. We investigated how a causal query can be factorized under different levels of projections, and then introduced new constructs called embedding factors and co-identiﬁcation. These constructs make explicit the connection of the factors required to identify the targeted query and the available observed distributions, which allows a systematic view of the problem of identiﬁability under different granularities. We introduced a new graphical structure called minimum viable embedding factor (MVEF) and studied its properties, including its uniqueness, disjointedness, and compositionality. Putting these results together, we developed a new algorithm (GID-PO) that efﬁciently and systematically examines the identiﬁability of embedding factors and combines the identiﬁed MVEFs to compose the expression for a given query. Since each of the factors cannot be identiﬁed from smaller marginal interventional distributions, we conjecture that the procedure is also necessary. Assuming its completeness, we showed that the decision version of this new identiﬁability problem is NP-complete; yet, it does run in a polynomial time in the number of observed variables for the problems under full-observability (Lee et al., 2019). Noting that, in practice, available datasets are measured inconsistently with respect to the variables they cover they usually have different columns we hope the results in this paper can help data scientists tackle more challenging identiﬁcation instances and determine causal effects in more intricate and realistic scenarios.

Acknowledgements

This research is supported in parts by grants from NSF (IIS-1704352 and IIS-1750807 (CAREER)).

Causal Effect Identiﬁability under Partial-Observability

Bareinboim, E. and Pearl, J. Causal inference by surrogate experiments: z-identiﬁability. In de Freitas, N. and Murphy, K. (eds.), Proceedings of the Twenty-Eighth Conference on Uncertainty in Artiﬁcial Intelligence, pp. 113 120, Corvallis, OR, 2012a. AUAI Press.

Bareinboim, E. and Pearl, J. Controlling selection bias in causal inference. In Girolami, M. and Lawrence, N. (eds.), Proceedings of The Fifteenth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2012), pp. 100 108. JMLR (22), 2012b.

Bareinboim, E. and Pearl, J. Transportability from multiple environments with limited experiments: Completeness results. In Advances in Neural Information Processing Systems 27, pp. 280 288. Curran Associates, Inc., 2014.

Bareinboim, E. and Pearl, J. Causal inference and the datafusion problem. Proceedings of the National Academy of Sciences, 113:7345 7352, 2016.

Correa, J. and Bareinboim, E. Causal effect identiﬁcation by adjustment under confounding and selection biases. In Proceedings of the 31st AAAI Conference on Artiﬁcial Intelligence, pp. 3740 3746. AAAI Press, 2017.

Huang, Y. and Valtorta, M. Identiﬁability in causal bayesian networks: A sound and complete algorithm. In Proceedings of the Twenty-First National Conference on Artiﬁcial Intelligence (AAAI 2006), pp. 1149 1156. AAAI Press, Menlo Park, CA, 2006a.

Huang, Y. and Valtorta, M. Pearl s calculus of intervention is complete. In Dechter, R. and Richardson, T. (eds.), Proceedings of the Twenty-Second Conference on Uncertainty in Artiﬁcial Intelligence (UAI 2006), pp. 217 224. AUAI Press, Corvallis, OR, 2006b.

Jaber, A., Zhang, J., and Bareinboim, E. Causal identiﬁcation under Markov equivalence: Completeness results. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pp. 2981 2989. PMLR, 2019.

Karp, R. M. Reducibility among combinatorial problems. In Miller R.E., Thatcher J.W., B. J. (ed.), Complexity of Computer Computations, The IBM Research Symposia Series, pp. 85 103. Plenum Press, New York, 1972.

Lee, J. J. R. and Shpitser, I. Identiﬁcation methods with arbitrary interventional distributions as inputs, 2020. ar Xiv:2004.01157.

Lee, S. and Bareinboim, E. Causal effect identiﬁability under partial-observability. Technical Report R-58, Columbia Causal Artiﬁcial Intelligence Lab, Department of Computer Science, Columbia University, 2020.

Lee, S., Correa, J. D., and Bareinboim, E. General identiﬁability with arbitrary surrogate experiments. In Proceedings of the Thirty-Fifth Conference Annual Conference on Uncertainty in Artiﬁcial Intelligence, Corvallis, OR, 2019. AUAI Press.

Lee, S., Correa, J. D., and Bareinboim, E. Generalized transportability: Synthesis of experiments from heterogeneous domains. In Proceedings of the 34th AAAI Conference on Artiﬁcial Intelligence. AAAI Press, 2020.

Mohan, K. and Pearl, J. On the testability of models with missing data. In Proceedings of The Seventeenth International Conference on Artiﬁcial Intelligence and Statistics. JMLR, 2014.

Pearl, J. Causal diagrams for empirical research. Biometrika, 82(4):669 688, 1995.

Pearl, J. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, 2000.

Pearl, J. and Bareinboim, E. Transportability of causal and statistical relations: A formal approach. Technical Report Technical Report r-372, Cognitive Systems Laboratory, Department of Computer Science, UCLA, 2011.

Pearl, J. and Mackenzie, D. The Book of Why: The New Science of Cause and Effect. Basic Books, 2018.

Shpitser, I. and Pearl, J. Identiﬁcation of joint interventional distributions in recursive semi-Markovian causal models. In Proceedings of The Twenty-First National Conference on Artiﬁcial Intelligence, pp. 1219 1226. AAAI Press, 2006.

Spirtes, P., Glymour, C., and Scheines, R. Causation, Prediction, and Search. MIT Press, Cambridge, MA, 2nd edition, 2001.

Tian, J. Studies in Causal Reasoning and Learning. Ph D thesis, Computer Science Department, University of California, Los Angeles, CA, November 2002.

Tian, J. Identifying conditional causal effects. In Proceedings of the 20th Conference on Uncertainty in Artiﬁcial Intelligence, pp. 561 568. AUAI Press, 2004.

Tian, J. and Pearl, J. A general identiﬁcation condition for causal effects. In Proceedings of the Eighteenth National Conference on Artiﬁcial Intelligence (AAAI 2002), pp. 567 573, Menlo Park, CA, 2002. AAAI Press/The MIT Press.

Tian, J. and Pearl, J. On the identiﬁcation of causal effects. Technical Report R-290-L, Department of Computer Science, University of California, Los Angeles, CA, 2003.