# causal_effect_identifiability_under_partialobservability__917cdcb4.pdf Causal Effect Identifiability under Partial-Observability Sanghack Lee 1 Elias Bareinboim 1 Abstract Causal effect identifiability is concerned with establishing the effect of intervening on a set of variables on another set of variables from observational or interventional distributions under causal assumptions that are usually encoded in the form of a causal graph. Most of the results of this literature implicitly assume that every variable modeled in the graph is measured in the available distributions. In practice, however, the data collections of the different studies considered do not measure the same variables, consistently. In this paper, we study the causal effect identifiability problem when the available distributions encompass different sets of variables, which we refer to as identification under partial-observability. We study a number of properties of the factors that comprise a causal effect under various levels of abstraction, and then characterize the relationship between them with respect to their status relative to the identification of a targeted intervention. We establish a sufficient graphical criterion for determining whether the effects are identifiable from partially-observed distributions. Finally, building on these graphical properties, we develop an algorithm that returns a formula for a causal effect in terms of the available distributions. 1. Introduction One of the central goals in data sciences (the health and the social sciences), artificial intelligence, and machine learning is to discover cause and effect relationships. If the scientific study is performed appropriately, and causal relations are eventually discovered, the corresponding effects are more likely to hold under a broader set of conditions. Causal relations are usually more stable and generalizable across disparate conditions (Pearl, 2000; Pearl & Mackenzie, 2018). 1Department of Computer Science, Columbia University, New York, NY 10027, USA. Correspondence to: Sanghack Lee . Proceedings of the 37th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). Causal inference provides a collection of principles and tools to help understand the conditions under which these extrapolations can take place (Pearl, 2000; Spirtes et al., 2001; Bareinboim & Pearl, 2016). For instance, a scientist may be able to use an observational study to infer the effect of a new intervention by leveraging knowledge encoded in its causal model (Pearl, 1995; Tian & Pearl, 2002; Tian, 2002; Shpitser & Pearl, 2006; Huang & Valtorta, 2006b). Causal effects can also be inferred across a broad range of conditions, including across disparate populations (Bareinboim & Pearl, 2014; Lee et al., 2020), under selection bias (Bareinboim & Pearl, 2012b; Correa & Bareinboim, 2017), missing data (Mohan & Pearl, 2014), and in the absence of the causal graph (Jaber et al., 2019), to cite a few. Further, recent advances in causal inference lead to algorithmic solutions to combine data collected under multiple, disparate regimes (observational and interventional) to identify a causal effect (Lee et al., 2019). Despite all the generality and power provided by these results, by and large, they implicitly assume that every modeled variable is consistently available across the different data collections. Since each study is usually designed to fulfill its own objectives, datasets across studies tend to measure different sets of variables (i.e., the datasets have different columns). As a consequence, relevant data available to answer a question about a specific effect may be not usable in another, possibly very related study. For concreteness, consider the setting where a researcher aims to understand the effect of physical exercise (X) on stroke (Y ), written as Px(y), and is then analyzing two related datasets. The first is based on an experimental study estimating the effect of physical exercise itself (X) on blood pressure (C), collected from different age groups (A); commonly written as PX(A, C). The second dataset is based on an observational study about the association among body mass index (BMI, or B), blood pressure (C), and stroke (Y ), i.e., P(B, C, Y ). The causal graphs representing these two studies are shown in Fig. 1a and 1b, respectively. After conducting standard causal analysis, the researcher realizes that Px(y), the effect of interest, cannot be inferred from the two datasets, separately. She then creates a common representation of their union, which is summarized in the causal graph shown in Fig. 1c. Unfortunately, the identification algorithms available today assume full-observability, which means that graphs and datasets Causal Effect Identifiability under Partial-Observability Figure 1. (a, b) Causal graphs representing experimental (PX(A, C)) and observational studies (P(B, C, Y )), respectively. (c) The causal graph representing the union of both studies. defined over different sets of variables cannot be taken as input. On the other hand, the effect Px(Y ) is inferable by the careful combination of these studies through the expression P a,c Px(a, c) P b P(Y |b, c)P(b). The first factor can be computed from the experimental study, while the other two can be obtained from the observational study. Our goal in this paper is to understand under what conditions inferences such as this one are allowed from first principles. More broadly, and motivated by the lack of a systematic treatment to combining partially-observed datasets, we formally introduce and study the problem of causal effect identifiability under partial-observability. More specifically, the main contributions of this work are as follows: (i) We develop novel machinery to account for partial-observability constraints, including constructs for co-identification, embedding of critical identification factors, and formal understanding of minimum viable embeddings. Putting these results together, we derive a novel graphical condition for identification under partial-observability; (ii) We then develop the first general algorithm that avoids redundant computations and runs in polynomial time for the known subclasses of identifiability problems under full-observability. Furthermore, we provide a detailed discussion on the necessity of our algorithm and the NP-completeness status of this particular identifiability problem. Finally, we discuss the extension of this work to the transportability setting (Bareinboim & Pearl, 2014) in which the corresponding datasets may come from multiple, heterogeneous domains. One metaphor that will be informative and facilitate the understanding of this paper is to compare the task of identification under partial-observability to a jigsaw puzzle (Fig. 2). The targeted causal query and underlying causal graph define the puzzle and its layout (Sec. 4.1), while each of the available observational and experimental distributions provides pieces and chunks (pieces tied together) for the puzzle (Sec. 4.1.1 and 4.1.2). Then, the given pieces and chunks are combined, without overlapping each other, to complete the puzzle, eventually forming the targeted effect (Sec. 4.2). For simplicity, we start in Sec. 3 the discussion with a relatively simple class of puzzles, and then build over it, refining its understanding (just after the preliminaries). layout/pieces chunks solution Figure 2. A high-level abstraction of our problem as a jigsaw puzzle. A causal query Px(y) is a puzzle with how its pieces (factors) should be laid out. Each of the available distributions (represented as different colors) provides chunks of information, where a chunk is either a piece or the combination of multiple pieces. The solution for the puzzle is then putting the subset of chunks together. 2. Preliminaries Following conventions in the field, a variable is denoted by an uppercase letter, e.g., Z, and its value is denoted by the corresponding lowercase letter, z XZ, where XZ is the state space of Z. Bold letters are for a set of variables or values, e.g., z XZ = Z ZXZ. For simplicity, we may omit curly braces, e.g., f({x}) versus f(x), for a singleton set when it is used as an argument. This paper builds on the language of Structural Causal Models (SCM) (Pearl, 2000). Each SCM M is a quadruple, U, V, F, and P(U). The set of unobserved variables U follows the joint probability distribution P(U). The set of observed variables V are specified through the set of structural functions F = {f V }V V, where each function is of the form f V (pa V , u V ) such that pa V and u V are the values for PAV V\{V } and UV U, respectively. The SCM M induces a causal graph G over V where there are directed edges W V if W PAV and bidirected ones W V if there exists an unobserved confounder (UC, for short) U UW UV . Further, M induces a set of observational and interventional distributions. One can intervene on X, setting them to x, which yields a submodel Mx, where the function for X X in M is replaced by constants x x. The distribution generated by Mx is denoted by Px(V) (or P(V | do(x))). For simplicity, PZ(W) denotes a collection of probabilities {Pz(w)}z XZ,w XW, which we may call PZ(W) just a distribution. For a more detailed discussion on SCMs, please refer to (Pearl, 2000, Ch. 7). We denote by VH the vertices of a graph H. We often use V as the vertices of G where no ambiguity arises. We denote by pa(W)G the union of parents of W W in G. Similarly, ch, an, and de are children, ancestors, and descendants. Additionally, Ch, An, De include their arguments as well, e.g., Ch(W)G = ch(W)G W. A subgraph of G over V V is denoted by G[V ]. Further, GX and GZ denote G with edges incoming to X and going out from Z removed, respectively. The latent projection (or projection, for short) of a causal Causal Effect Identifiability under Partial-Observability graph is defined in a way to retain the causal relationships among a subset of variables. This concept is crucial in understanding partial-observability. We define projection as below, adopted from (Tian & Pearl, 2003). Definition 1 (Projection). The projection of a causal graph G over V on V V, denoted by G V , is a causal graph over V such that for every pair of vertices X and Y : 1. There exists a directed edge X Y in G V if there exists a directed path from X to Y in G such that every vertex other than X and Y on the path is not in V . 2. There exists a bidirected edge X Y to G V if there exists a divergent path1 between X and Y in G such that every vertex other than X and Y on the path is not in V . We denote by G -W a causal graph with W projected out, i.e., G -W = G V \ W . Conditional independence statements and the rules of do-calculus (Pearl, 1995) on a projection are valid in G, vice versa. We often simplify notation involving projections by letting G = G V . Similarly, G and Gj is defined for V and Vj, respectively. An illustration for how projection induces new edges and some remarks are provided in App. A in (Lee & Bareinboim, 2020). Graphical Constructs for Identifiability Throughout the paper, we fix the use of a few symbols. A query Px(y) is defined on a causal graph G of an unknown SCM M where its vertices (or variables) are V where X and Y are possibly empty disjoint subsets of V. Further, we define a few new symbols that would greatly help understanding the anatomy of a given graph with respect to the identification of a query, namely: X = X An(Y)GX X+ = An(X )G \ Y+ Y = Y Y+ = An(Y)GX V = X Y V+ = X+ Y+. The left column presents essential parts of the query and the right column describes relevant variables in identifying the query; Fig. 3 provides an illustration of these notions. Their meanings and relationships will become more clear when they appear in lemmas and theorems, especially in Sec. 4. We will use the same color scheme to visualize graphs. Concepts for Non-identifiability One of crucial building blocks for characterizing our problem of interest is a graphical structure articulating the non-identifiability of a causal effect. A graph H is said to be a c-component if a subset 1A divergent path between X and Y exists if two directed paths (X, . . . , W1) and (W2, . . . , Y ) in G towards X and Y , respectively, such that W1 = W2 or W1 W2. (a) a causal graph (part) (b) abstract illustration Figure 3. An illustrative example for X+ (light red), Y+ (light blue), X (dark red), and Y (dark blue). Bidirected edges are not relevant in defining those symbols and omitted on purpose. Figure 4. (a, b) Hedges for Px(y) where, for both cases, {Y } is the sole root set and the only element of F with F = G. (c) is not a hedge for Px(y) but {Y1, W}-rooted c-forests F = G[{Y1, W, X}] and F = G[{Y1, W}] form a hedge for Px(y). of its bidirected arcs forms a spanning tree over all vertices in H (Tian & Pearl, 2002; Tian, 2002). The definition can be understood as a set of vertices that are connected via bidirected edges. We utilize the notion of the decomposition of a set of vertices in a graph with respect to its maximal c-components, which we call c-component decomposition. In a special type of c-component, a graph H with root-set (sink nodes) R is said to be an R-rooted c-forest if H is a c-component with a minimal number of edges. Definition 2 (Hedge). A hedge is a pair of R-rooted cforests F, F such that F F. A hedge is said to be formed for Px(y) if R An(Y)GX, F X = , and F X = , which implies the nonidentifiability of Px(y) from P(V) in G (Shpitser & Pearl, 2006). We prefer to separate the graphical definition of hedge from the specific syntactic goal/task, following discussion in (Lee et al., 2019). Such hedge satisfies that VF\F V+ intersects with X . Further, VF Y+ since X+ being the part of F implies that X X is in F , which violates the definition. Examples are illustrated in Fig. 4 where the first two causal graphs are hedges for Px(y) but the last one isn t. Omitted proofs and other supporting materials are provided in (Lee & Bareinboim, 2020). 3. Identifiability with a Single Partially-Observed Distribution We begin with a simpler, yet crucial identifiability problem that concerns with identifying a causal effect given a single Causal Effect Identifiability under Partial-Observability partially-observed distribution. Using the jigsaw metaphor discussed earlier, this task can be seen as a puzzle where all the available chunks are of the same color, as defined next. Definition 3 (Causal Effect Identifiability under a Partially-Observed Distribution). Given a causal graph G, let Z , V , X, Y V where Z V = and X Y = . The causal effect Px(y) is said to be identifiable from PZ (V ) in G if Px(y) is (uniquely) computable from PZ (V ) in any causal model which induces G and PZ (V \ Z ) > 0. The definition is similar to a number of identifiability problems with full-observability except for the given data being partially-observed. The positivity assumption is imposed on PZ (V \ Z ) (i.e., the full joint before the projection) instead of the given distribution PZ (V ). This is to ensure that the partial-observability is correctly responsible for the non-identifiability of a causal effect especially when the effect is identifiable with PZ (V \ Z ). We show multiple necessary criteria as follows. Proposition 1. A causal effect Px(y) is identifiable from PZ (V ) in G only if Y V (inclusion of outcomes), X V Z (inclusion of minimal treatments), and Z Y+ = (undisturbed mechanisms). Under full-observability, the first condition is implied by the third condition, and the second condition holds trivially since Z V = V and X, Y V. The third condition highlights that an intervention prohibits the understanding of the underlying natural mechanisms relevant to the distribution over Y (Lee et al., 2019). When these criteria hold, the problem is reducible to the classic identifiability. Lemma 1. Given Px(y) and PZ (V ) in G satisfying the three criteria in Prop. 1, Px(y) is identifiable from PZ (V ) in G if and only if Qx \Z (y) is identifiable in (G \ Z ) V V+ where Q = Pz with z consistent with x+ Z . Roughly speaking, Lemma 1 implies that the effect is identifiable when there is no structure that embeds a hedge under the projection onto the partially-observed variables. A simple example is provided (Fig. 5) showing the identifiability of Px(y) given PZ(X, Y ). The effect is not identifiable with P(V) (the existence of a hedge with F = G, F = G[{W, Y }]), PW (X, Y , Z) (a disturbed mechanism), nor PY (W, X, Z) (exclusion of outcomes). Corollary 1 (Soundness and Completeness). Px(y) is identifiable from PZ (V ) in G if and only if Y V , X V Z , Z Y+ = , and Qx \Z (y) is identifiable in (G \ Z ) V V+ where Q = Pz with z consistent with x+ Z . (c) (G \ {Z}) -W Figure 5. Causal diagrams representing the identifiability of Px(y) given PZ(X, Y ) as Px(y) = Pz(y|x) for any z. Figure 6. A causal graph G (a) which embeds hedges under different projections onto (b) {A, X, Y } and (c) {B, X, Y }. 4. Identifiability with Multiple Partially Observed Distributions We now investigate the main task of this paper, i.e., how to systematically use multiple distributions with different levels of observability, as defined next. Definition 4 (Causal Effect Identifiability under Partially-Observed Distributions (GID-PO)). Let G be a causal graph, and X, Y V be sets corresponding to the treatment and outcomes variables, respectively. Further, let P = {PZi(Vi)}m i=1 be a collection of partially observable distributions such that Zi and Vi are disjoint subsets of V for 1 i m. The causal effect Px(y) is identifiable from the graph G and P if it is uniquely computable from P in any model that induces G where {PZi(V \ Zi)}m i=1 are positive distributions. This definition generalizes g-identifiability (GID, Lee et al., 2019) with partial-observability. An example of the problem is shown in Fig. 7a, where P(B, C, Y ) and PX(A, C) are given to identify Px(y) (the same as Fig. 1c). Its formula can be derived as (see App. D (Lee & Bareinboim, 2020) for the detailed derivation): a,c Px (a)Px(c|a) P b P(y|b, c)P(b), (1) where x can be any value in XX. Naively incorporating Cor. 1 into GID, which looks for colored pieces but not chunks, fails to identify the query since one of the pieces is not identifiable with any of the given datasets (Fig. 7c). Lemma 2. Let G be a causal graph and X, Y V. A query Px(y) is not identifiable if there exist two causal models M1 and M2 compatible with G such that P 1 Zi(Vi) = P 2 Zi(Vi), for every Pi P, but P 1 x(y) = P 2 x(y) and P 1 Zi(V \ Zi) = P 2 Zi(V \ Zi) > 0. Causal Effect Identifiability under Partial-Observability Consider a causal graph (Fig. 6a) where two observational distributions P(X, Y , A) and P(B, X, Y ) are available. A causal query Px(y) is identifiable given P(X, Y , A, B) but not from each of P(X, Y , A) and P(B, X, Y ) due to the existence of a hedge (Cor. 1). Further, one can show that the query is not identifiable taking both into account: Let f X = U1, f A = U1 U2, f B = U2 be the common functions between M1 and M2, and f 1 Y = X A B UY and f 2 Y = UY . Further, let U1 and U2 be two fair coins, and P(UY = 1) = 0.1 for both models. Then, M1 and M2 will agree on both P(A, X, Y ) and P(B, X, Y ), while P 1 X=0(Y = 0) = 0.5 and P 2 X=0(Y = 0) = 0.9. 4.1. Characterization of Factors under Projection To develop an algorithm capable of combining the different parts of the available distributions, we briefly review approaches to the related problem of decomposing distributions currently known in the literature. A joint probability distribution P(V) in G can be seen as the product of causal effects in G. Tian & Pearl (2002) introduced the Q-decomposition that expresses P(v) using c-factors as follows: P(v) = Q i Pv\si(si), where Si is a c-component of V.2 We will characterize here such factorization under projections. As a first step, we define the key notion of generalized c-factors: Definition 5 (gc-factors). Let V be a subset of variables such that V V V+ and let G = G V . Let S = {Si}k i=1 be the c-components of G [Y+ ] where Y+ = An(Y)G X . Then, the gc-factors of Px(y) in G with respect to V is defined as FX,Y G,V = { pa(Si)G \Si, Si }k i=1. For example, consider a causal graph G in Fig. 6a and a query Px(y), where V = {X, Y } and V+ = V. With V = V, Y+ = {A, B, Y } (i.e., variables in blue), and S = {{A, B}, {Y }} (i.e., the two ccomponents in the blue area). Hence, the gc-factors FX,Y G,V = { , {A, B} , {A, B, X}, {Y } }. With V = {A, X, Y }, G is shown in Fig. 6b. In this case, Y+ = {A, Y }, S = {{A, Y }}, and FX,Y G,{A,X,Y } = { {X}, {A, Y } }. A gc-factor (or, simply, a factor), which is represented as a pair of sets of variables, can be used to present a set of distributions, e.g., Xi, Yi FX,Y G,V and PXi(Yi). We may call the first and second elements as Xand Y-side of the gc-factor, respectively. Graphically, this decomposes Y + 2Such a factorization leads to a natural divide-and-conquer approach to the problem of causal identification, which underpins, sometimes more or less explicitly, most of the results in this literature (e.g., (Huang & Valtorta, 2006a; Shpitser & Pearl, 2006; Bareinboim & Pearl, 2012a; 2014; Lee et al., 2019), to cite a few). Depending on the identification task, one can prove that the Q-decomposition leads to a complete characterization. Pc(y) Pa,x(b) Figure 7. Causal graphs (a) G and (b) G -B . Corresponding gcfactors (c, d) represented as puzzle where identified factors are colored with green, PX(A, C), and purple, P(B, C, Y ).4 in G with respect to c-components so that the Y-side of every gc-factor is maximally confounded with respect to the underlying projection. The X-side resides in Y + X . For brevity, we simplify the notation by F = FX,Y G,V and a superscript is delegated to V so that F = FX,Y G,V . Also we interchangeably use a factor Z, W with PZ(W) or with Pz(w) for an arbitrary assignment. Probabilities associated with the gc-factors of Px(y) in G with respect to V can form an expression for Px(y). Proposition 2 (gc-decomposition). For V V V+, {i| Xi,Yi F } Pxi(yi) (2) The algorithm for general-identifiability (Lee et al., 2019) makes the use of this decomposition based on F (i.e., restricted to V = V+) and identifies each factor using one of the available distributions. However, such strategy relying on the decomposition based on F is insufficient for handling partially-observed distributions. Recall Fig. 7a where Px(y) can be factorized as a,b,c P(a)Pa,x(b)Pb(c)Pc(y), (3) following Prop. 2. Unfortunately, Pa,x(b) = P(b|a, x) is not identifiable from each of available distributions since no distribution includes all A, B, X (Cor. 1). On the other hand, a gc-decomposition based on a projection onto V = {A, C, X, Y } (Fig. 7b) yields a,c P(a)Pa,x(c)Pc(y), (4) which allows each factor to be identified by at least one of the available distributions (i.e., P(a) = Px (a), for any x XX, Pa,x(c) = Px(c|a), and Pc(y) = P b P(y|b, c)P(b)). This provides a basis to the following sufficient condition. Lemma 3 (Soundness). Let G, Px(y), P be the causal graph, the query, and the distributions forming a GID-PO instance. If there exists a subset of variables V V V+ such that every term Pxi(yi) in each gc-factor Xi, Yi in F is identifiable from P P, then Px(y) is identifiable from P in G. Causal Effect Identifiability under Partial-Observability Given Lemma 3, we are interested in finding V , a subset of V, which would yield a gc-decomposition where each gc-factor is identifiable.5 We note that a natural solution emerges since one could search over an exponential number of subsets of V+. While this would certainly lead to a sound procedure, it is clearly the case that this offers little to no insight into the problem. This motivates us to study the relationships among gc-factors at different levels of projections so as to develop an efficient solution while avoiding this naive, and clearly intractable solution. 4.1.1. EMBEDDING FACTOR AND CO-IDENTIFICATION In this section, we formally relate gc-factors before and after marginalizations through two new concepts called embedding factor and co-identification, which associate factors and their identification under various levels of projections so as for us to perceive the task of identification from a more comprehensive view. Comparing the decompositions in Eqs. (3) and (4) based on G and G -B , respectively, we observe that two factors P(a) and Pc(y), which does not have B in it, are shared while other two factors Pa,x(b) and Pb(c) in Eq. (3) are replaced to Pa,x(c) in Eq. (4) where we can elicit P b Pa,x(b)Pb(c) = Pa,x(c). We characterize such changes in gc-factors before and after a projection (equivalently, a marginalization). Proposition 3 (Factors under Marginalization). Let W V and V = V \ W where W V = . Let H be an undirected graph where vertices are X j, Y j F and an edge exists if two factors satisfy their Y-sides intersecting with Ch(W)G for some W W. Then, vertices (i.e., gcfactors) in each connected component of H are merged to form X k, Y k F such that Y k = j j Y j \ W, X k = ( j j X j) \ Y k \ W, where j is the set of indices of gc-factors in F forming the connected component. Other gc-factors without W are remained intact, shared by both F and F . For instance, marginalizing B out in Eq. (3) will consolidate factors whose Y-sides intersecting with Ch(B)G={B, C}, that is, Pa,x(b) and Pb(c). Further, P b Pa,x(b)Pb(c) will result in a new gc-factor that has its Y-side, ({B} {C})\{B}={C} and X-side, ({A, X} {B})\{C}\{B}={A, X}. We introduce an embedding relationship to describe the connection between the factors merged through a projection and a resulting factor. Definition 6 (Embedding Factor). A gc-factor X , Y 5After submitting this work (in early February, 2020), Lee & Shpitser (2020) independently introduced the problem of m ID , where m stands for marginal distributions, which corresponds to the notion of partial-observability here (Def. 4). Even though there are subtle differences in terminology and notation, their main result in this context (Lemma 3) can be seen as our Lemma 3. P(a) Pax(b) Pb(c) Pc(y) Px(b) Pax(c) Pb(y) Px(c) Pax(y) Figure 8. (a) embedding relationships among gc-factors where a directed edge i j indicates that j embeds i (represented with a transitive reduction). (b) the same embedding relationships with a sum-product notation. F is said to be embedded in a gc-factor X , Y F for V V V if Y contains any variable in the Y parts of gc-factors in the connected component containing X , Y in H, which is constructed as follows. H is an undirected graph of gc-factors in F where there exists an edge between two factors, say X i, Y i , X j, Y j , if both share V \ V , that is, (X i Y i) (X j Y j) (V \ V ) = . Note that a gc-factor is also a (non-proper) embedding factor of itself, and Px(y), the query itself as X, Y is an embedding factor of every gc-factor. We illustrate embedding relationships among every gc-factor of Px(y) in G (Fig. 7a) in Fig. 8a. The four factors at the bottom correspond to F and the corresponding embedding relationships are represented as directed edges. For instance, {A, X}, {C} (i.e., Pa,x(c)) is an embedding factor of {B}, {C} and {A, X}, {B} (Fig. 7a). Quantitatively, embedding relationships can be understood as sum-product relationships (Fig. 7b), where projections are equivalent to marginalizations. Although a gc-factor in F (a finest-grained factor) cannot be identified, it can be identified as a group under the summation, i.e., its embedding factor (a coarser-grained factor) is identified. Definition 7 (Co-Identification). If a factor Q is identifiable with a distribution, the factors in F embedded in Q are said to be co-identified with the distribution with respect to Q. With the puzzle metaphor, co-identification of a factor refers to whether there will be a colored chunk which covers the piece. Following the strategy described in Lemma 3, Px(y) will only be identified when every gc-factor in F is coidentified by one of the available distributions. However, there can be an exponential number of embedding factors for a gc-factor in F. Hence, we investigate the relationship between (non-)identifiability of embedding factors. Causal Effect Identifiability under Partial-Observability (c) G -{E, D} P(ab) Pax(d) Pbe(y2) Pd(ey1) P(ab) Pax(d) Pbe(y2) Pd(ey1) Px(bd) Pe(ay2) Pax(ey1) Pbd(y) Pex(dy2) Px(bey1) Pd(ay) Pabx(y) (e) Hasse diagram Figure 9. (a, b, c) Illustrations of causal diagrams with c-components in Y + highlighted which correspond to Y-side of gc-factors, (d) F as a puzzle and how pieces are merged by marginalization, (e) a Hasse diagram for the embedding relationships among factors (not exhaustive) where four vertices at the bottom are F. The highlighted areas in (a, b, c) correspond to the right three vertices in (e). 4.1.2. MINIMUM VIABLE EMBEDDING FACTOR We introduce the crucial concept of minimum viable embedding factor (MVEF) that will help with the characterization of the co-identification of gc-factors in F with respect to a single available distribution. Its purpose is to find the finestgrained factors that a (partially-observed) distribution might be able to identify. In other words, as the default decomposition offers the factors (F) of the right granularity for fully-observed distributions, we attempt to find factors of appropriate granularity with respect to a single partiallyobserved distribution. Definition 8 (Minimum Viable Embedding Factor (MVEF)). An embedding factor X , Y F of a gc-factor Xi, Yi F is said to be a minimum viable embedding factor of Xi, Yi with respect to PZ (V ) if the three criteria holds true: Y V ; X V Z ; and Z An(Y )GX = , and V \ V is minimal. Further, V \ V is said to be a MVEF-admissible set. Given one of available distributions P, we want to check whether it co-identifies a gc-factor in F. Among an exponential number of its embedding factors, we can choose an embedding factor, which satisfies the necessary conditions as specified in Prop. 1. A polynomial time algorithm for finding out an MVEF is depicted in App. D (Lee & Bareinboim, 2020). The algorithm iteratively seeks variables to be projected out in order to satisfy the necessary conditions. For example, take a look at a graph Fig. 9a where there are four factors in F (Fig. 9d). Consider co-identifying a factor Pd(e, y1) (with its Y side highlighted in Fig. 9a) with PB(A, X, Y) P. Since D and E do not appear in Z = {B} and Z V = {B, A, X, Y}, respectively (Prop. 1), we may project out both D and E in G at once, and examine the resulting embedding factor of Pd(e, y1) in G -{E, D} (Fig. 9c), that is, Pa,b,x(y). The puzzle diagram illustrates how projecting out D and E, or equivalently, P d,e yields a chunk with three pieces except P(a, b). Further, a Hasse diagram (Fig. 9e) represents the transitive reduction of embedding relationships where one can examine that Pa,b,x(y) embeds the three factors in F excluding P(a, b) and other two intermediate factors. Another example is given in Appendix showing that obtaining an MVEF may take multiple steps. An MVEF, if exists, is uniquely determined. Further, MVEFs obtained given a distribution for a subset of gcfactors are disjoint with respect to embedded gc-factors. Proposition 4 (Uniqueness). If an MVEF exists for a gcfactor with respect to a distribution, then it is unique. Proposition 5 (Disjointness). Given a distribution, let Q and R be MVEFs of two gc-factors in F, respectively. Then either Q = R or there is no common gc-factor embedded in Q and R. MVEFs are only a subset of all factors at different levels. If they are identifiable with a distribution, then they correspond to same-colored chunks. Should we further examine the possibility of other chunks of the same color? If an MVEF is found and not identified, is it still possible for factors further embedding the MVEF to be identifiable? We show the non-identifiability of factors embedding a non-identified MVEF. Lemma 4 (Hedge over Embedding Factors). Consider a gc-factor Xi, Yi F and a distribution PZ (V ). The gc-factor is not co-identifiable with PZ (V ) with respect to any of its embedding factors if its MVEF does not exist or there exists a hedge for Px \Z (y ) in (G \ Z ) V given the MVEF X , Y F . Hence, the failure to identify an MVEF informs us that none of its embedding factors needs to be examined for (non- )identifiability. Below is a complementing lemma that, with puzzle terms, any chunks of the same color is composed of MVEFs (i.e., the smallest chunks of the color). Lemma 5 (Compositionality). Given a distribution PZ (V ), any identifiable embedding factor of a gc-factor in F can be represented as the summation over the product of a subset of identified MVEFs. For instance, given that P(a) and Pa,x(c) are identified with PX(A, C) (Fig. 7d), it is clear to see the identifiability Causal Effect Identifiability under Partial-Observability of Px(c) as P a P(a)Pa,x(c) and checking it is redundant. Thus, finding MVEFs and checking their identifiability are sufficient to comprehend how one distribution contributes to co-identification of a portion of factors. 4.2. Algorithm for GID-PO Previous sections discussed what factors of different levels of projections will be available from each distribution. Hence, identified MVEFs from available distributions become chunks of different colors, which needs to be put together to complete the causal jigsaw puzzle Px(y). We devise a two-phase algorithm called GID-PO (Alg. 1), which first identifies MVEFs co-identifying gc-factors in F (colorful chunks), and then combine them to produce a formula for Px(y). The implementation of the first phase is straightforward given the characterizations of MVEFs in the previous section. For every gc-factor in F and a distribution in P (Line 3), the existence of MVEF is first checked (Line 5). If it exists and identifiable, then record information including the MVEF and the gc-factors embedded in the MVEF (i.e., co-identified), the used distribution, and what variables are marginalized out (Lines 7 9). With the colorful chunks of identified MVEFs, the second phase of the algorithm tries to complete the big picture. Based on Prop. 2, it attempts to select MVEFs whose coidentified gc-factors (represented as their indices in F) do not overlap. This reduces to solving an exact cover (Line 13), an NP-complete problem (Karp, 1972): given a family I of subsets of a set [n] = {1, . . . , n}, whether there exists a subfamily I I such that sets in I are disjoint and I = I = [n]. If a solution exists, then an expression for Px(y) is written as the summation of product of sub-formulas corresponding to selected MVEFs where the variables to be marginalized follows Eq. (2) except that the part of variables are moved inside the sub-formulas (Line 16). Finally, we show next that this approach is indeed correct. Theorem 1 (Soundness). GID-PO is sound. Remarkably, GID-PO is more efficient than a naive implementation of Lemma 3, which would require to run a traditional identifiability algorithm (O(|V|4)), for exponentially many gc-factors and for each dataset. On the other hand, GID-PO requires for the identifiability algorithm to run |F| times for each distribution to collect the identified MVEFs. Then, an exact cover runs in time exponential in the number of uniquely identified MVEFs, which is upper bounded by |F|. More specifically, GID-PO will run in O(ℓmn4 + 2ℓ) while a naive implementation for Lemma 3 runs in O(mn42n), where n=|V|, m=|P|, and ℓ=|F|. Further, if the problem instance is compatible with g-identifiability, where no partial-observability is involved, then the algorithm runs in a polynomial time in |V| since Algorithm 1 GID-PO 1: Input: G a causal graph, x and y value assignments for a query, P a collection of available distributions 2: Prepare J an empty collection. 3: for PZ (V ), Xi, Yi P F do 4: continue if an MVEF that embeds Xi, Yi is already found by the same data PZ (V ). 5: X , Y , W MVEF( Xi, Yi , PZ (V )). 6: continue if the MVEF is co-identified by some data. 7: if X , Y exists and identifiable with PZ (V ) then 8: i the indices of factors in F embedded in the MVEF. 9: Add X , Y , W , i, Z , V to J. 10: end if 11: end for 12: Let I be the collection of indices in J. 13: if I exact cover with {1, . . . , |F|} and I then 14: Let J J be the subcollection matching the solution I . 15: Let W be the union of all MVEF-admissible sets in J . 16: return P y+ \(Y W ) Q J GID(Px (y ), G V , {Z }). 17: end if 18: return NULL. every identified MVEF will be a single-piece chunk and an exact cover algorithm will spot uncovered elements first. 5. Discussions: Completeness, Complexity, Conditional Effects, and Transportability We conjecture that Lemma 3 is also a necessary condition and, hence, our algorithm is sound and complete. We can show, under the completeness conjecture, that the decision version of our problem is NP-complete. Further, we discuss two possible extensions to this work. On Completeness Our intuition for its completeness lies in the fact that each gc-factor in F provides nondecomposable information about the model. Formally speaking, a probability Pw(z) F is not identifiable from the combination of smaller pieces, i.e., P = {PW (Z ) | Z Z, W W} \ {PW(Z)}. Following Lemma 2, we can construct two models M1 and M2 where W and UCs are independent fair coins and each Z Z takes parents (including UCs) with exclusive-or for both models except the fact that one Z Z flips its value for M2. Further, estimates for proper embedding factors and other gc-factors will not pinpoint a gc-factor of interest. For example, does knowing a chunk {f, g} and piece {g} allow us to infer what {f} is? Consider an equation h = R x f(x) g(x)dx. Generally, knowing h and g does not allow us to infer what f is. Having a number of such equations may help characterizing f but it won t identify f unless the domain of X is small and finite. For such reasons, we conjecture that F provides a fundamental, necessary, and sufficient building blocks for constructing Px(y) given marginal interventional distributions. A rigorous proof for the completeness is still under investigation. Causal Effect Identifiability under Partial-Observability On NP-Completeness Assume that we have proved the completeness of the algorithm, which reflects Prop. 2. Then, we can show that the decision problem of the puzzle is indeed NP-complete. We present a polynomial reduction of an exact cover problem to a GID-PO problem instance. Consider an arbitrary exact cover problem with a universe [n] and a collection of its subsets.6 We construct a causal graph G with V = {Vi,j}1 ii are the sink nodes in G. Each gc-factor pa(Vi,:), Vi,: F represents element i in the universe [n], e.g., P(V1,2, V1,3) for 1 and P(V3,1, V3,2|do(V1,3, V2,3)) for 3. Given a subset of integers w [n], there exists a corresponding distribution PX (Y ) in P such that X , Y F is an MVEF of any of gc-factor corresponding to w w where V = V \ {Vi,j}{i =j|i,j w}. For instance, a subset of integers for an exact cover problem in Fig. 10a will be matched to the following distributions: {1} = P(V1,2, V1,3) {2} = P(V2,1, V2,3 | do(V1,2)) {3} = P(V3,1, V3,2 | do(V1,3, V2,3)) {1, 2} = P(V1,3, V2,1, V2,3) {1, 3} = P(V1,2, V3,1, V3,2 | do(V2,3)) {2, 3} = P(V2,1, V3,1, V3,2 | do(V1,2, V1,3)) {1, 2, 3} = P(V2,1, V3,1, V3,2) Hence, the construction can be done in a polynomial time of n|I|, the size of the problem. Further, Alg. 1 yields an identification formula if and only if the reduced problem has an exact cover solution since each set as an available distribution can only identify itself but not others: Therefore, under the completeness of Lemma 3, we can show that the decision problem of general identifiability under partialobservability is NP-complete. Generalization to Transportability and Conditional Distributions One of the natural extensions of this work is in the context of transportability (Pearl & Bareinboim, 2011; Bareinboim & Pearl, 2014; Lee et al., 2020). Transportability is concerned with the fusion of datasets collected from heterogeneous domains, where the mechanisms for some of the variables differ from a domain in which a causal effect is sought. Unless the differences between a source and a target (pieces shapes are incompatible) are directly on the Y-side of a factor, a similar procedure can be applied. Another generalization is identifying a conditional interven- 6An empty set can be simply ignored because it does not contribute to answering the exact cover problem. Figure 10. Causal diagrams for exact cover problems with universe (a) {1, 2, 3} and (b) {1, 2, 3, 4}. tional distribution, e.g., Px(y|w) is delegated to identifying Px ,w (y, w ) (Tian, 2004; Lee et al., 2020). This change only requires an additional preand post-process of the result from GID-PO or its extension to transportability. We provide the generalization of our problem taking account both directions in App. F (Lee & Bareinboim, 2020). 6. Conclusions We introduced the general identifiability problem when the available distributions are only partially observable, which is named GID-PO. We investigated how a causal query can be factorized under different levels of projections, and then introduced new constructs called embedding factors and co-identification. These constructs make explicit the connection of the factors required to identify the targeted query and the available observed distributions, which allows a systematic view of the problem of identifiability under different granularities. We introduced a new graphical structure called minimum viable embedding factor (MVEF) and studied its properties, including its uniqueness, disjointedness, and compositionality. Putting these results together, we developed a new algorithm (GID-PO) that efficiently and systematically examines the identifiability of embedding factors and combines the identified MVEFs to compose the expression for a given query. Since each of the factors cannot be identified from smaller marginal interventional distributions, we conjecture that the procedure is also necessary. Assuming its completeness, we showed that the decision version of this new identifiability problem is NP-complete; yet, it does run in a polynomial time in the number of observed variables for the problems under full-observability (Lee et al., 2019). Noting that, in practice, available datasets are measured inconsistently with respect to the variables they cover they usually have different columns we hope the results in this paper can help data scientists tackle more challenging identification instances and determine causal effects in more intricate and realistic scenarios. Acknowledgements This research is supported in parts by grants from NSF (IIS-1704352 and IIS-1750807 (CAREER)). Causal Effect Identifiability under Partial-Observability Bareinboim, E. and Pearl, J. Causal inference by surrogate experiments: z-identifiability. In de Freitas, N. and Murphy, K. (eds.), Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, pp. 113 120, Corvallis, OR, 2012a. AUAI Press. Bareinboim, E. and Pearl, J. Controlling selection bias in causal inference. In Girolami, M. and Lawrence, N. (eds.), Proceedings of The Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2012), pp. 100 108. JMLR (22), 2012b. Bareinboim, E. and Pearl, J. Transportability from multiple environments with limited experiments: Completeness results. In Advances in Neural Information Processing Systems 27, pp. 280 288. Curran Associates, Inc., 2014. Bareinboim, E. and Pearl, J. Causal inference and the datafusion problem. Proceedings of the National Academy of Sciences, 113:7345 7352, 2016. Correa, J. and Bareinboim, E. Causal effect identification by adjustment under confounding and selection biases. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, pp. 3740 3746. AAAI Press, 2017. Huang, Y. and Valtorta, M. Identifiability in causal bayesian networks: A sound and complete algorithm. In Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI 2006), pp. 1149 1156. AAAI Press, Menlo Park, CA, 2006a. Huang, Y. and Valtorta, M. Pearl s calculus of intervention is complete. In Dechter, R. and Richardson, T. (eds.), Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence (UAI 2006), pp. 217 224. AUAI Press, Corvallis, OR, 2006b. Jaber, A., Zhang, J., and Bareinboim, E. Causal identification under Markov equivalence: Completeness results. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pp. 2981 2989. PMLR, 2019. Karp, R. M. Reducibility among combinatorial problems. In Miller R.E., Thatcher J.W., B. J. (ed.), Complexity of Computer Computations, The IBM Research Symposia Series, pp. 85 103. Plenum Press, New York, 1972. Lee, J. J. R. and Shpitser, I. Identification methods with arbitrary interventional distributions as inputs, 2020. ar Xiv:2004.01157. Lee, S. and Bareinboim, E. Causal effect identifiability under partial-observability. Technical Report R-58, Columbia Causal Artificial Intelligence Lab, Department of Computer Science, Columbia University, 2020. Lee, S., Correa, J. D., and Bareinboim, E. General identifiability with arbitrary surrogate experiments. In Proceedings of the Thirty-Fifth Conference Annual Conference on Uncertainty in Artificial Intelligence, Corvallis, OR, 2019. AUAI Press. Lee, S., Correa, J. D., and Bareinboim, E. Generalized transportability: Synthesis of experiments from heterogeneous domains. In Proceedings of the 34th AAAI Conference on Artificial Intelligence. AAAI Press, 2020. Mohan, K. and Pearl, J. On the testability of models with missing data. In Proceedings of The Seventeenth International Conference on Artificial Intelligence and Statistics. JMLR, 2014. Pearl, J. Causal diagrams for empirical research. Biometrika, 82(4):669 688, 1995. Pearl, J. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, 2000. Pearl, J. and Bareinboim, E. Transportability of causal and statistical relations: A formal approach. Technical Report Technical Report r-372, Cognitive Systems Laboratory, Department of Computer Science, UCLA, 2011. Pearl, J. and Mackenzie, D. The Book of Why: The New Science of Cause and Effect. Basic Books, 2018. Shpitser, I. and Pearl, J. Identification of joint interventional distributions in recursive semi-Markovian causal models. In Proceedings of The Twenty-First National Conference on Artificial Intelligence, pp. 1219 1226. AAAI Press, 2006. Spirtes, P., Glymour, C., and Scheines, R. Causation, Prediction, and Search. MIT Press, Cambridge, MA, 2nd edition, 2001. Tian, J. Studies in Causal Reasoning and Learning. Ph D thesis, Computer Science Department, University of California, Los Angeles, CA, November 2002. Tian, J. Identifying conditional causal effects. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pp. 561 568. AUAI Press, 2004. Tian, J. and Pearl, J. A general identification condition for causal effects. In Proceedings of the Eighteenth National Conference on Artificial Intelligence (AAAI 2002), pp. 567 573, Menlo Park, CA, 2002. AAAI Press/The MIT Press. Tian, J. and Pearl, J. On the identification of causal effects. Technical Report R-290-L, Department of Computer Science, University of California, Los Angeles, CA, 2003.