# causal_identification_with_matrix_equations__b886913a.pdf Causal Identification with Matrix Equations Sanghack Lee Graduate School of Data Science Seoul National University Seoul, South Korea sanghack@snu.ac.kr Elias Bareinboim Department of Computer Science Columbia University New York, USA eb@cs.columbia.edu Causal effect identification is concerned with determining whether a causal effect is computable from a combination of qualitative assumptions about the underlying system (e.g., a causal graph) and distributions collected from this system. Many identification algorithms exclusively rely on graphical criteria made of a non-trivial combination of probability axioms, do-calculus, and refined c-factorization (e.g., Lee & Bareinboim, 2020). In a sequence of increasingly sophisticated results, it has been shown how proxy variables can be used to identify certain effects that would not be otherwise recoverable in challenging scenarios through solving matrix equations (e.g., Kuroki & Pearl, 2014; Miao et al., 2018). In this paper, we develop a new causal identification algorithm which utilizes both graphical criteria and matrix equations. Specifically, we first characterize the relationships between certain graphically-driven formulae and matrix multiplications. With such characterizations, we broaden the spectrum of proxy variable based identification conditions and further propose novel intermediary criteria based on the pseudoinverse of a matrix. Finally, we devise a causal effect identification algorithm, which accepts as input a collection of marginal, conditional, and interventional distributions, integrating enriched matrix-based criteria into a graphical identification approach. 1 Introduction Cause and effect relations are one of the most common types of knowledge sought after throughout the empirical sciences. These relations are one of the main ingredients in the construction of stable explanations, and usually underpin robust and generalizable decision-making strategies [19, 26]. There is a growing literature that aims to systematically find causal relations by fusing observations, experiments, and substantive knowledge about the phenomenon under investigation [19, 3, 20]. Formally, the inferential target usually appears as the effect of a set of variables do(X = x) on another set of variables Y, which is written as P(y|do(x)) or Px(y). Assumptions about the underlying data-generating processes are commonly expressed as a causal graph G over endogenous variables V. There are different lines of investigation that aims to establish the quantity Px(y). First, one line of investigation attempts to exploit the non-parametric constraints encoded in G to determine whether the quantity Px(y) (i.e., query) is uniquely computable from the different available distributions. For instance, Pearl s celebrated method known as do-calculus provides a symbolic way of determining whether a query can be identified from G and observational data P(V) [18]. A number of necessary and sufficient conditions were developed for systematically determining the identifiability status of the query from observational data [18, 29, 23, 9]. For example, the causal effect Px(y) in Fig. 1a is identified by a back-door criterion as Px(y) = P u P(y|x, u)P(u). Further methods were developed to identify the query from a combination of observational and experimental distributions [2, 14, 12, 11]. The key ideas behind these methods are (i) to decompose the given causal query into standard factors (following Tian s c-factorization, [29]), leveraging the graphical 35th Conference on Neural Information Processing Systems (Neur IPS 2021). Figure 1: Causal graphs: (a) with a back-door condition; (b) with W as a proxy for U; (c) with proxies W and Z for U. (d) A causal graph G where a causal effect Px(y) can be computed by estimating each of Px2(t) (light blue, left) and Pt,x1(y) (light red, right) using MGT criterion. constraints, and (ii) to identify these factors individually, matching one of the available distributions. We call this the factorization-based approach. Another line of investigation attempts to exploit assumptions about the relationship between unobserved confounders and the observable variables through the idea of proxy variables [5, 8, 10, 15, 32]. These methods rely on the cardinality and complexity of these relationships, which will possibly lead to the invertibility of certain matrices. The focus is usually on local conditions involving the treatment X, the outcome Y , and the proxies W, Z for the set of unobserved confounders U; for example, see Fig. 1b. Note that since U is unobserved, the effect of X on Y , i.e., Px(y), is provably not identifiable by standard graphical methods discussed earlier. Still, if the distribution of the proxy given the unobserved confounder is available P(W|U), in addition to P(X, Y , W), the quantity Px(y) can be computed under the invertibility of P(W|U), a matrix representation of P(W|U) where P(W|U)i,j = P(wi|uj) (see Eq. (3) for detail). In case of Fig. 1c, the causal effect Px(y) is identified from P(X, Y , W, Z) through the following,1 Px(y) = P(y|Z, x)P(W|Z, x) 1P(W), (1) where P(W|Z, x) has rank matching the size of domain of the unmeasured confounder U. Although typically not framed in terms of identification with multiple datasets, this setting corresponds to identifying Px(y) with, e.g., marginal distributions {P(X, Y , Z), P(W, X, Z)} or conditional distributions {P(Y |Z, x), P(W|Z, x), P(W)}. We call this the proxy approach.2 Despite all the power and successes achieved by the factorization and proxy approaches, there exist still interesting cases not covered by any of them individually. To witness, consider the causal graph in Fig. 1d and first notice that the effect is not identified from each approach. On the other hand, if we combine both approaches, we are able to obtain the causal effect is identifiable and given by, t Px(t, y) = P t Pt,x1(y)Px2(t) P(y|Z1, x1, t)P(W1|Z1, x1, t) 1P(W1) P(t|Z2, x2)P(W2|Z2, x2) 1P(W2) where Pt,x1(y) and Px2(t), the two factors of the query, are individually identified with the help of the proxy approach, i.e., Eq. (1). Our goal in this paper is to explicate how this can be accomplished from first principles. More broadly, we will develop a flexible and general graphical identification approach that combines both the factorization and matrix equations (the underpinning idea of proxy-based methods). More specifically, our contributions are as follows: 1. We connect the graphical and matrical approaches by characterizing matrix equations of probability distributions driven by graphical constraints in a causal graph. This leads to a better understanding of the identification through solving systems of equations. 2. Building on this new characterization, we generalize proxy-based criteria and devise novel intermediary pseudoinverse criteria so as to identify a causal effect by utilizing the general inverse of a matrix and diverse collection of distributions. 1We focus on discrete variables. Detailed discussion on a continuous case can be found in [15]. We refer this identification condition MGT criterion where the acronym MGT comes from the surnames of the authors in [15]. 2Recent work in this line of research use the term proximal causal inference (PCI) [25, 27, 33, 6]. [27] generalizes the results in [15] to allow observed confounders as well. [25] considered combining graphical and proximal approaches, similar to the motivational example. 3. We develop a general identification algorithm that amalgamates graphical and matrical approaches, returning an identification formula for a causal query given a causal diagram and a set of marginal, experimental, and conditional distributions. We show that this method subsumes current state of the art in the literature. 2 Preliminaries We follow notational conventions from literature on causal inference. We denote a variable by an upper case letter Y , and its value is denoted by its corresponding lower case letter y in the domain XY . A set of variables will be denoted by a bold capital letter Y with its value y. We may use [, instead of [, to emphasize the union of two disjoint sets. Given Z W, w\Z denotes the value of W\Z consistent with w. Let a/B = (A \ B, a\B) which retains B as a set of variables and values of a excluding B. Without loss of generality, we refer PZ(V0|W) a distribution. We may employ conditional to emphasize W ;, experimental or interventional Z ; compared to observational Z = ;, and marginal if Z [ W [ V0 ( V. We employ structural causal models (SCMs) [19, Ch. 7] as the semantical framework to represent a domain of interest. An SCM M is a quadruple h U, V, P(U), Fi. A set of exogenous variables U, which follows P(U), is determined by factors outside the model. V is a set of endogenous variables whose values are determined by functions F = {fi}Vi2V such that Vi fi(pai, ui) where PAi V\{Vi} and Ui U. Further, do(x) represents the operation of holding a set X to a constant x regardless of their original mechanisms. Such intervention induces a submodel Mx, which is M with f X replaced to x for X 2 X. The distribution over V induced by the submodel is denoted by Px(V). We may employ letter Q to denote an interventional distribution, e.g., Q = Pr. Each SCM (model, for short) induces a causal diagram (or causal graph) G = h V, Ei, where each type of edge represents a different causal relationship among the variables: (i) X!Y if X is used as an argument of f Y ; and (ii) X$Y if UX and UY are correlated. Given a causal diagram G, familial relationships among its vertices are denoted by pa and an for parents and ancestors, respectively. Further, An is a set of ancestors including its argument as well. We denote by GXZ an edge subgraph of G which removes edges incoming to X and outgoing from Z. Causal relationships among other variables are captured in G\X, which is the subgraph of G over V\X. A vertex induced subgraph is denoted by G[V0] where V0 V. Causal effect identification relies heavily on standard graphical constraints imposed by a causal diagram such as d-separation (reading off conditional independence from the graph, [30, 7]) and do-calculus (equivalence among interventional probabilities) [18], which we present in the Appendix for completeness. The latent projection (or projection, for short) of a causal diagram is a causal diagram retaining the causal relationships among a subset of variables. We denote by Gh V0i the latent projection of G onto V0 V, the causal graph over V0 [31]. Conditional independence (CI) statements and do-calculus [18] on a projection are valid in G, vice versa. We formally define a latent projection in the supplementary material. The omitted proofs and derivations are also provided in [13]. 3 Characterization of Matrix Equations of Graphical Constraints In this section, we present characterizations of graphical constraints in a given causal diagram G, leading to equations expressed as the multiplication of matrices. The characterizations will further advance our understanding on the constraints imposed over the distributions generated by the underlying system compared to simple equivalence relationships such as conditional independence and do-calculus. To begin with, we denote by P the matrix representation of a distribution P where free outcome variables are rows and free condition or intervention variables correspond to columns. For instance, Pa,B(C, d) is a |XC| |XB| matrix and Pr(A, b|C, d) is a |XA| |XC| matrix. Further, we may use 0 and 00 to represent two disjoint subsets such that B = B0 [ B00. Chain Rule with Conditional Independence The definitions of conditional and marginal distributions naturally lead to a sum of a product of probabilities. Let Q = Pr be an arbitrary interventional distribution. Let A, B, C, and R be disjoint. A marginal probability over chain-rule-induced multi- plication is expressed as Q(a, b0|c) = Q(a|b, c)Q(b|c) = Q(a|b0, B00, c)Q(B00, b0|c). Considering conditional independence, we can further enrich such a characterization. Lemma 1. Given a causal diagram G, let Q = Pr for some r 2 XR where R ( V. Let A, B, C, D, E be disjoint subsets of V\R. If (D ?? A | B, C, E) and (E ?? B | C, D) in G\R, then, Q(A, b0|c, D, e) = Q(A|b0, B00, c, e)Q(B00, b0|c, D). The lemma emphasizes the condition under which the result of multiplication is a matrix not just a row or column. A special case of the lemma is appeared in MGT criterion where the inverse of a matrix multiplication P(W|Z, x) = P(W|U, x)P(U|Z, x) is utilized so as to cancel out an unknown distribution P(W|U) = P(W|U, x). Adjustment Criterion Given a graph G and a causal effect of interest Px(y), the adjustment criterion [24] seeks a set of covariates Z V\X\Y, called an adjustment set for a causal effect Px(y), which grants the following expression, Px(y) = P z P(y|x, z)P(z). Adjustment criterion generalizes back-door criterion [19]. Its matricized expression with employing Q = Pr is Q(y|x, z)Q(z) = Q(y|x, Z)Q(Z). This simple expression plays a central role in the identification with proxy variables. In many settings, the left hand side (LHS) is the query of interest and two terms in the RHS are usually available or to be inferred using other available quantities. However, substituting value y with Y, we can further yield (under invertibility assumption) Q(Z) = Q(Y|x, Z) 1Qx(Y), which restores the covariate distribution of interest given a causal effect and conditional distribution. C-Factorization C-factorization [28] decomposes a causal effect Px(y) into the sum-product of c-factors (simply, factors) with respect to the given causal diagram G. Without loss of generality, let X be minimal such that no X0 ( X satisfies Px0(y) 6= Px(y) (i.e., overriding X by an GX(Y) \ X). For any projection H of G that preserves X [ Y, the following holds: Yi2C(H[Y+]) Ppa H(yi)\yi(yi), (2) where Y+ = An HX(Y) and C( ) is the c-component decomposition (partitioning the variables in the graph based on their connectivity through bidirected edges). Given a query Px(y), we let c-factors FH = {h Xi, Yii}i where {Yi}i form the c-component decomposition of H[Y+] and Xi = pa H(Yi)\Yi. The general identification method under partial-observability [12] can be summarily described as finding H such that each factor is identified by one of the available distributions. Z Yi Xj Xi Yj (a) Euler diagram Figure 2: (a) set relationships, (b) a causal graph where the numbers correspond to the regions of (a) from the left to right. Let us denote Xij = Xi [ Xj. We focus here on the sum-product of a pair of factors at some latent projection H, which satisfies Pxi(yi)Pxj(yj) = Pxij\yij(yij) where the values are consistent [12]. Now consider the summation over Z Xi \Yj. Then, Pxi\z,z(yi)Pxj(yj\z, z) = Pxij\yij(yij\Z). To properly represent this as a matrix multiplication, we should decide which variables will become rows and columns. Two matrices to be multiplied should share only Z, and other common variables, appearing in the two terms, or non-matching variables, appearing as intervention in left term and outcome in right term, need to be set to constants. To help understand, we illustrate in Fig. 2a the set relationships among Xi, Yi, Xj, and Yj in the case of Z = Xi \ Yj. Therefore, Yi\Xj (the shared variables other than Z) and (Xi[Yj)\Z (the non-matching variables) are fixed. These fixed variables are not colored in Fig. 2a. As a result, we can obtain a submatrix of PXij\Yij(Yij\Z) as the multiplication of the submatrices of PXi(Yi) and PXj(Yj) obtained via fixing variables. (b) Px,y(W) (e) Px,z(W) Figure 3: Causal diagrams admitting singleand double-proxy settings with surrogate experiments. Lemma 2 (Matrix Equation of C-Factorization with Two Factors). Given a causal diagram G and an experimental distribution Q = Pr where a causal effect is c-factorized as Qx(y) = P z Qxi(yi)Qxj(yj) in a projection H of G\R, the effect can be represented as a matrix multiplication, if Z Xi \ Yj. Further, the corresponding matrix equation is Q(xij\yij)/(Xj\Xi\Yi)((yij\Z)/(Yi\Xj)) = Qxi/Z(yi/(Yi\Xj))Qxj/(Xj\Xi\Yi)(yj/Z). We illustrate a causal graph in Fig. 2b where each variable matches to each region in the Euler diagram (Fig. 2a). With Z = {6} and for an arbitrary instantiation of variables V2, V4, V5, and V7, PV3,v4,v5(V1, v2, v7) = Pv4,v5,V6(V1, v2)Pv2,V3,v4(V6, v7). In this section, we connected different graphical constraints from chain-rule with conditional independence to c-factorization induced identification formulae to matrix equations. Results presented in this section are by no means complete yet cover, to the best of our knowledge, every sum-product expression appeared in both graphical and matrical identification approaches. Nevertheless, this suite of characterizations will provide a fundamental understanding of the mathematical structures involved in the identification methods with matrical expressions. 4 Generalized Proxy-based Criteria P(W|x) P(y,W|x) P(W) P(U) P(U|x) Figure 4: Schematic for a single proxy setting. Gray and red lines are for elementwise matrix multiplication and adjustment criterion, respectively. Equipped with the characterization from the previous section, we revisit singleand double-proxy settings [5, 8, 10, 15] more formally, which identify a causal effect through the combination of chain rule, adjustment criterion (c-factorization), and inverses of matrices given an observational distribution and additional external study involving an unobserved confounder. We investigate its extension capable of utilizing other types of available distributions. Results presented in this section is crucial in adopting matrical approaches into a factorization-based identification algorithm. Single-Proxy Setting We illustrate in Fig. 4 the available distributions and unknown distributions, considered in Fig. 1b, that can lead to Px(y). In the figure, a matrix multiplication A = B C is represented as B R C with positions annotated. First, note that (X, Y ?? W | U) in a causal graph G is central, which grants P(W|U, x, y) = P(W|U, x) = P(W|U). If P(W|U) is invertible,3 the three distributions P(y, U|x), P(U|x), and P(U) are obtained where P(y|U, x) is computed by chain rule (Lemma 1). Then, Px(y) is, with denoting elementwise division, P(y|U,x) z }| { * (P(W|U) 1P(y, W|x)) | {z } P(y,U|x) (P(W|U) 1P(W|x)) | {z } P(U|x) +> P(W|U) 1P(W) | {z } P(U) However, this scheme would not work in more restrictive conditions such as Fig. 3a or 3b. This challenging scenario can be handled if a different external study is available altogether with a surrogate 3In case of W has more categorical values than U, one can coarsen W to W 0 ensuring that P(W 0|U) is full rank. In other words, P(W|U) has full column rank. experiment. For example, when only Y is independent to W given U and X, the availability of an external study P(W|U, x), instead of P(W|U), and a surrogate experiment Px(W), in addition to the observational distribution P(Y , W, X), suffices to identify the causal effect (see [13]for the derivations for Fig. 3a, 3b). These examples suggest that we can generalize the single proxy-based criterion to take advantage of a diverse collection of external studies and surrogate experiments. Theorem 1. Given G, let X, Y, W, U, and R be disjoint subsets of V, Q = Pr for some r 2 XR, and H = G\R. A causal effect Pr,x(y) = Qx(y) is identifiable if (1) U is an adjustment admissible set for Qx(y) in H; (2a) (Y ?? W | U, X)H and Q(W|U, x), Q(y, W|x) and Q(W|x) are available where Q(W|U, x) has full column rank; or (2b) (Y 6?? W | U, X)H and Q(W|U, x, Y) and Q(Y, W|x) are available where every Q(W|U, x, y0) has full column rank for y0 2 XY; (3) Qz0(W) is available for Z X [ Y and Z0 = (X [ Y)\Z with z0 consistent with x [ y such that (Z ?? W | U, Z0) in H; and U is an adjustment admissible set for Qz0(W) in H. The theorem presents a sound condition to elicit the causal effect through combining various distributions especially when the given situation is more restrictive. P(W|U) P(U|Z, x) Px(y) L R R L L R Q(W|U, s0) Q(U|z/S, x) Q(W|z/S, x) Q(y|z/S, x) Qx(y) L R R L L R Figure 5: Schematics of (a) MGT criterion and (b) its generalization (simplified) Double-Proxy Setting We now generalize MGT criterion to utilize distributions other than the originally considered observational study. MGT criterion for a double-proxy setting relies on the following conditions to identify Px(y) with P(X, Y , Z, W): (C1) U is an adjustment set for Px(y) in G; (C2) Y ?? Z | U, X in G; (C3) Z, X ?? W | U in G; and (C4) P(W|Z, x) has rank |XU|. Under these conditions, algebraic relationships between a causal effect Px(y) and other distributions can be illustrated as in Fig. 5a where the distributions form a closed loop alternating between (i) distributions with an unmeasured confounder and (ii) given distributions and a query. Among the four multiplications, Px(y) corresponds to an adjustment criterion (C1) so as to Px(y) = P(y|U, x)P(U). Others are due to the chain-rule combined with CI, e.g., (C2) P(y|U, x) = P(y|U, x, z) and (C3) P(W|U) = P(W|U, x, z). With (C4), which implies that both P(W|U) and P(U|Z, x) are invertible under coarsening of W and Z if necessary [15, 1], the causal effect Px(y) can be expressed as Eq. (1) by subsequently rewriting P(U|Z, x) and P(y|U, x), and contracting P(W) = P(W|U)P(U) (see [13] Appendix C for an illustration). To examine the possible extension of the setting, we relax (C3), where the CI grants the use of matrix multiplication leading to a chain-rule P(W) = P(W|U)P(U) and P(W|Z, x) = P(W|U)P(U|Z, x). One can consider three relaxed versions of (C3), namely, (Z ?? W | U, X), (X ?? W | U, Z), and, ultimately, dropping (C3) illustrated respectively in Fig. 3c, 3d and 3e. For concreteness, consider the first relaxation (Z ?? W | U, X). The assumption grants P(W|U, Z, x)=P(W|U, x). Given that a surrogate experiment Px(W) is accessible, and it can be decomposed as Px(W) = P(W|U, x)P(U) where U is also an adjustment admissible set for Px(w), then, Px(y) is identified. Other two relaxations turned out to be much more challenging, yet the causal effect can be identified by exploiting different surrogate experiments or additional external study (see [13] Appendix C for detail). Motivated by these examples, we present a theorem that extends MGT criterion with varying degrees of the assumption and data collection. Theorem 2 (Generalized MGT Criterion). Given a causal graph G, let X, Y, Z, W, U, R V be disjoint sets of variables where R can be empty. Let Q = Pr for some r 2 XR and H = G\R. Let S X[Z. A causal effect Qx(y) = Px,r(y) is identifiable in G if, for some z, (1) U is an adjustment set for Qx(y) in H; (2) (Y ?? Z | U, X)H; (3) (W ?? S0 | U, S)H where S0 = (X [ Z)\S; (4) U is an adjustment set for Qs0(W) in H; (5) Q(W|U, s0) is invertible; (6) Q(U|z/(S [ Z0), x) is invertible for some Z0 Z\S and Q(y|z/(S [ Z0), x), Q(W|z/(S [ Z0), x), and Qs0(W) are available. Additionally, Q(W|U, s0/Z0) is available if Z0 6= ;. The theorem broadens the applicability of the MGT criterion as an identification template to a range of collection of distributions. Although the conditions involved in generalizing MGT criterion non-trivial, we can similarly draw a (simplified) big picture of the generalized criterion as in Fig. 5b. The key idea is relaxing the C3 condition for a set of variables while considering sets of variables (explicitly) and interventional distributions. We have established generalized identification criteria exploiting proxy variables.4 The purpose of both criteria aligns well with the philosophy of general identification [14, 12] so that they can be smoothly integrated into graphical approaches. 5 Pseudoinverse and Intermediary Criteria Now, we present a novel identification condition for a challenging setting that neither previous matrical nor graphical approaches could handle. In the setting, the probability of interest is expressed as the multiplication of three matrices and is identifiable through employing the pseudoinverse of a matrix, without an invertibility assumption. Let P( ) denote the pseudoinverse of P( ).5 Lemma 3 (Base Intermediary Criterion). Let {P1, P2, P3, P4} be distributions. Let {Pi}4 i=1 be their matrix representations. If submatrices {P0 i=1 of {Pi}4 i=1 satisfy P0 3 are given, then, P0 Proof. By the given condition, the associativity of matrix multiplications, and the property of pseudoinverse PP P = P, P0 Although the lemma itself is rather general, we concretely characterize distributions satisfying Lemma 3 with respect to chain-rule (Sec. 5.1) and c-factorization (Sec. 5.2). 5.1 Chain-Rule-based Intermediary Criterion Figure 6: Causal diagrams where chainrule intermediary criterion is applicable to identify P(a|d) given P(a|C, d), P(B|C, d), and P(B|d). We start by characterizing an intermediary criterion with a chain-rule using a simple illustrative example. Let Q be an arbitrary interventional distribution. One way to decompose Q(a, b, c|d) into three probabilities is Q(a|b, c, d)Q(b|c, d)Q(c|d). Let a probability of interest be its marginal Q(a|d) = P b,c Q(a, b, c|d) where the following distributions are available: Q(B|C, d), Q(A|C, d), and Q(B|d). If the first term Q(a|b, c, d) is equal to Q(a|b, d), the term can be multiplied by Q(b|d) = P c Q(b|c, d)Q(c|d). Hence, the matricized expression becomes, Q(A|C,d) z }| { Q(A|B, d) Q(B|C, d)Q(C|d) | {z } Q(B|d) = Q(A|C, d)Q(B|C, d) Q(B|d). for any d 2 XD. Two illustrative examples where this expression is applicable (i.e., (C ?? A | B, D) in G) are shown in Fig. 6. Now, we propose a chain-rule intermediary criterion. Lemma 4 (Chain-Rule Intermediary Criterion). Given a causal diagram G, let A, B, C, D, and R be disjoint subsets of V with D and R can be empty. Let B = B0 [ B00 and C = C0 [ C00 where B0 and C0 are not empty. Given an interventional distribution Q = Pr, if (C0 ?? A | B, C00D) in G\R and Q(a, b00, c00|d) = P b0,c0 Q(a|b, c, d)Q(b|c, d)Q(c|d), then, Q(A, b00, c00|d) = Q(A, b00|C0, c00, d) Q(B0, b00|C0, c00, d) Q(B0, b00, c00|d). 4As mentioned earlier, [27] extends the MGT criterion by allowing observed confounders C. Roughly, this can be understood as replacing U in Fig. 5a with U and C where C is not necessarily marginalized out so that the resulting distribution may contain C. 5In case of continuous variables, P(B|A) can be understood as a linear operator on Hilbert spaces. See [16] for more detailed discussion on the existence and uniqueness of pseudoinverse for a bounded linear operator. Additional CI allows the criterion to return a matrix, which can then be nested into other equation. Corollary 1. Given Lemma 4, if (D ?? A, B00 | C) and (D ?? B | C) in G\R, then Q(A, b00, c00|D) = Q(A, b00|C0, c00) Q(B0, b00|C0, c00) Q(B0, b00, c00|D). These results advance current knowledge of the causal or non-causal identification with fragments of information presented as conditional distributions, which is often under-studied. 5.2 C-Factorization-based Intermediary Criterion We proceed to characterize an intermediary criterion where a matrix equation corresponds to a c-factorization resulting in three factors, extending the case of a pair of factors (Lemma 2). Concretely speaking, we are interested in the matrix form of Pxi(yi)Pxj(yj)Pxk(yk) = Pxj(yj)Pxk(yk), (4) where Z Xi \ Yj and W Xj \ Yk are disjoint sets of variables to be marginalized (the order among the three factors is irrelevant since they are invariant up to renaming.) In addition, available distributions are of the form PXj(Yj), PXij\Yij(Yij\Z), and PXjk\Yjk(Yjk\W). Theorem 3 (C-Factorization Intermediary Criterion). Let G be a causal diagram and Q = Pr. Let Qx (y ) be c-factorized as Eq. (4). Let X+ k be a subset of Xk excluding the rest five sets, {Yi, Yj, Yk, Xi, Xj}. Y+ i is similarly defined. If Z (Xi \ Yj)\Xk and W Xj \ Yk, then Qx /X+ i ), a submatrix of QX (Y ), becomes i ) = Q(xij\yij)/W((yij\Z)/Y+ i ) Qxj/W(yj/Z) Q(xjk\yjk)/X+ k ((yjk\W)/Z). The theorem imposes an additional constraint that Z should be disjoint to Xk compared to naively interpreting the three-matrix multiplication as two individual matrix multiplications as seen in Lemma 2 (we depict the sophisticated set relationships among W, Xs, Ys, and Z in Appendix D [13].) Briefly speaking, given that the summation over W is nested (Eq. (4)), (Xj [ Yj)\W is fixed along with Yj \ Xk. Thus, Xi \ Yj \ Xk can t be part of Z. In other words, the constraints imposed in the original expression asymmetrically affect what Z can be but not what W can be. The implication of this result is immediate. It was previously unknown whether the identification procedures by Lee and Bareinboim [12] (GID-PO) and Lee and Shpitser [11] (Lemma 3 and 6) taking marginal and interventional are complete. Now, we show a negative result. Proposition 1. GID-PO [12] and Lemma 3 and 6 [11] are not complete. Proof. Consider a causal graph X ! A ! B ! Y and distributions P(X, B), P(A, B), and P(A, Y ). Px(y) is identified as P(y|A)P(B|A) P(B|x), which is not solvable by [12, 11]. In this section, we developed novel intermediary criteria by characterizing both chain-rule and cfactorization with respect to matrix multiplications of three matrices exploiting the pseudoinverse, which has never been employed in the context of causal identification to the best of our knowledge. 6 A Unifying Causal Identification Algorithm We present a causal identification algorithm ID-ME (Alg. 1) taking a collection of marginal, conditional, and interventional distributions D and causal graph G.6 The algorithm integrates different approaches such as generalized proxy-variable based criteria (Sec. 4 and [8, 10, 15]), intermediary criteria (Sec. 5), and factorization approaches [23, 14, 12, 11]. Taking a causal query Px(y), causal graph G, and distributions D, it refines the given query by removing redundant interventions based on Rule 3 of do-calculus, i.e., Px(y) = Px0(y) for a 6We omit some details on whether some variables are fixed (e.g., Pa(B) versus PA(B)). Without loss of generality, every distribution contains no redundant conditions and interventions, which can be obtained through repeatedly applying rules of do-calculus. Algorithm 1 ID-ME 1: function ID-ME(x, y, G, D) Input: x, y value assignments for a query Px(y); G a causal diagram; D a collection of distributions Output: a formula for Px(y) made with D or FAIL. 2: X, G0 an GX(Y) \ X, G[An G(Y)] 3: D expand(G, D) unless D is unconditional 4: F+ FG0 if D is unconditional and nonmarginal else S{FG0h V0i | X [ Y V0 V} 5: F0, F00 empty dictionary, F+ 6: for all PXi(Yi) 2 F00 and {PZ(V0|T) 2 D | Yi V0, Xi Z [ V0 [ T} do 7: update F0, F00 with ID-RC(PXi(Yi), G0, PZ(V0|T)) 8: for all PXi(Yi) 2 F00 do update F0, F00 with PROXY(PXi(Yi), G, D [ F0) 9: repeat update F0, F00 with CF-INT(G0, F0, F00) and CF-INV(G0, F0, F00) until F0 not changed 10: return exact-cover(G0, Px(y), F0) minimal subset X0 X. Further, G0 a copy of causal graph is obtained through pruning G only to the ancestors of Y. However, note that the original G is kept intact to be used in the proxy criteria later where proxies can be non-ancestors of Y, e.g., Fig. 1b. Then, the algorithm proceeds the following three parts: 1) expanding the given distributions D; 2) c-factorizing the query Px(y) and identifying each factor; and 3) combining identified c-factors F0 to elicit the causal query. First, the algorithm expands the given distributions (expand in Line 3). A given conditional distribution PZ(V0|T) can be understood as a distribution PZ(V0) under a selection bias, which might hinder identification of a c-factor. By multiplying another distribution equal to, e.g., PZ(T00|T0), we can obtain a distribution with more outcome and less bias, PZ(V0, T00|T0). Chain rule closure [11] describes a state of a set of distributions where no new distribution with a smaller condition can be added to the collection. Function expand contains a procedure for chain rule closure along with chain-rule matrix inversion (Lemma 1) and chain-rule intermediary criterion (Lemma 4), enjoying the matrical approach especially relevant to conditional distributions. The next part enumerates c-factors (Line 4) and attempts to identify every c-factor based on available distributions (Lines 5 8). An algorithm for identifying a c-factor given a single unconditional and non-marginal distribution has been thoroughly studied (e.g., Identify [29] and ID [23]), and is a building block for general identification [2, 14]. Dealing with a marginal or conditional distribution, one should consider (i) whether the condition in the conditional distribution in D can be negligible with respect to identifying the c-factor or (ii) whether matrix equations can be utilized (e.g., simple inverse or generalized proxy criteria). ID-RC (Line 7) is an identification module modified to handle a conditional distribution [4, 11], and PROXY (Line 8) refers to generalized proxy criteria. The last part (Lines 9 10) attempts to map the identified c-factors to the causal effect Px(y). Once a subset of c-factors are identified from the previous step, the algorithm proceeds to examine whether some of the unidentified factors in F00 can be further inferred from the individually identified factors F0, i.e., whether a simple inverse (CF-INV implementing Lemma 2) or the intermediary criterion over c-factors (CF-INT implementing Thm. 3) can be invoked (Line 9). Then, we finally elicit a causal effect if a valid combination of identified c-factors exists (exact-cover in Line 10 checking Eq. (2)). The algorithm coherently integrates existing methods and the newly developed machinery in the previous sections. Whenever ID-ME returns a formula, evaluation of the formula leads to the quantity Px(y). Further, ID-ME strictly subsumes the union of aforementioned methods in a way that not only returns it a formula whenever some of the methods returns one given the same problem instance but also it can output novel formulae for the problems that cannot be answered by any of them as demonstrated through examples and by Prop. 1. Theorem 4. ID-ME is sound. Theorem 5. ID-ME strictly subsumes proxy criteria [8, 15], GID(-PO) [14, 12], m ID, or e ID [11]. Finally, we discuss the time complexity of ID-ME through examining the complexity of algorithms concerning a subset of problem instances ID-ME can handle. First, the state of the art identification conditions with distributions marginal, experimental, but unconditional [12, 11] involve finding a latent projection where the corresponding set of c-factors are all identified. It is conjectured that it requires time exponential in |V| [12]. Next, dealing with conditional distributions through expanding them takes time exponential in |V| (e.g., chain rule closure [11]). As a result the algorithm generally runs in time exponential in |V|, which we speculate that this complexity reflects a fundamental trade off between expressive power and tractability within causal inference, as already observed in [12, 11]. 7 Conclusion In this paper, we studied the use of matrix equations in causal identification given general distributions. In particular, we characterized matrix equations made of distributions driven by graphical constraints, deepening our understanding on algebraic constraints imposed by the graph (Lemmas 1 and 2 in Sec. 3). We generalized proxy-based identification methods to cover a broad spectrum of problem instances beyond the identification problems with an observation study and optional external study (Thms 1 and 2 in Sec. 4). We developed novel intermediary criteria that identify a query by utilizing the pseudoinverse of the center matrix within the multiplication of three matrices (Lemma 4 and Thm. 3 in Sec. 5). Finally, we provide a causal identification algorithm (Alg. 1 in Sec. 6) integrating several existing identification results to produce an identification formula for a causal query given a causal graph and distributions that can be marginal, experimental, and conditional harnessing the power of matrix equations and (pseudo)inverses. As a future research direction, it will be crucial to devise a statistically efficient estimator for identification formulae involving matrix inversion. Acknowledgments and Disclosure of Funding Sanghack Lee was supported by the New Faculty Startup Fund from Seoul National University. Elias Bareinboim was supported in part by funding from the NSF, Amazon, JP Morgan, and The Alfred P. Sloan Foundation. [1] S. Banerjee and A. Roy. Linear Algebra and Matrix Analysis for Statistics. Boca Raton: Taylor & Francis, 2014. [2] E. Bareinboim and J. Pearl. Causal inference by surrogate experiments: z-identifiability. In N. de Freitas and K. Murphy, editors, Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, pages 113 120, Corvallis, OR, 2012. AUAI Press. [3] E. Bareinboim and J. Pearl. Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences, 113:7345 7352, 2016. [4] E. Bareinboim and J. Tian. Recovering causal effects from selection bias. In S. Koenig and B. Bonet, editors, Proceedings of the 29th AAAI Conference on Artificial Intelligence, pages 3475 3481, Palo Alto, CA, 2015. AAAI Press. [5] R. Carroll, D. Ruppert, L. Stefanski, and C. Crainiceanu. Measurement Error in Nonlinear Models: A Modern Perspective. Boca Raton: Chapman & Hall/CRC, 2nd edition, 2006. [6] O. Dukes, I. Shpitser, and E. J. T. Tchetgen. Proximal mediation analysis, 2021. ar Xiv:2109.11904. [7] D. Geiger, T. Verma, and J. Pearl. Identifying independence in Bayesian networks. In Networks, volume 20, pages 507 534. John Wiley, Sussex, England, 1990. [8] S. Greenland and T. Lash. Bias analysis. In K. Rothman, S. Greenland, and T. Lash, editors, Modern Epidemiology, pages 345 380. Lippincott Williams & Wilkins, Philadelphia, PA, 3rd edition, 2008. [9] Y. Huang and M. Valtorta. Pearl s calculus of intervention is complete. In R. Dechter and T. Richardson, editors, Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence (UAI 2006), pages 217 224. AUAI Press, Corvallis, OR, 2006. [10] M. Kuroki and J. Pearl. Measurement bias and effect restoration in causal inference. Biometrika, 101:423 437, 2014. [11] J. J. R. Lee and I. Shpitser. Identification methods with arbitrary interventional distributions as inputs, 2020. ar Xiv:2004.01157. [12] S. Lee and E. Bareinboim. Causal effect identifiability under partial-observability. In Pro- ceedings of the 37th International Conference on Machine Learning, volume 119. PMLR, 2020. [13] S. Lee and E. Bareinboim. Causal identification with matrix equations. Technical Report R-70, Columbia Causal Artificial Intelligence Lab, Department of Computer Science, Columbia University, 2021. [14] S. Lee, J. D. Correa, and E. Bareinboim. General identifiability with arbitrary surrogate experiments. In Proceedings of the Thirty-Fifth Conference Annual Conference on Uncertainty in Artificial Intelligence, Corvallis, OR, 2019. AUAI Press. [15] W. Miao, Z. Geng, and E. J. Tchetgen Tchetgen. Identifying causal effects with proxy variables of an unmeasured confounder. Biometrika, 105(4):987 993, 08 2018. ISSN 0006-3444. [16] M. Z. Nashed. Generalized inverses, normal solvability, and iteration for singular operator equations. In L. B. Rall, editor, Nonlinear Functional Analysis and Applications, pages 311 359. Academic Press, 1971. [17] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Mateo, CA, [18] J. Pearl. Causal diagrams for empirical research. Biometrika, 82(4):669 688, 1995. [19] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, [20] J. Pearl and D. Mackenzie. The Book of Why: The New Science of Cause and Effect. Basic Books, 2018. [21] I. Shpitser. Appendum to on the validity of covariate adjustment for estimating causal effects . [22] I. Shpitser and J. Pearl. Identification of joint interventional distributions in recursive semi- Markovian causal models. In Proceedings of The Twenty-First National Conference on Artificial Intelligence, pages 1219 1226. AAAI Press, 2006. [23] I. Shpitser and J. Pearl. Identification of joint interventional distributions in recursive semi- Markovian causal models. In Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI 2006), pages 1219 1226. AAAI Press, Menlo Park, CA, 2006. [24] I. Shpitser, T. Vander Weele, and J. Robins. On the validity of covariate adjustment for estimating causal effects. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, pages 527 536. AUAI, Corvallis, OR, 2010. [25] I. Shpitser, Z. Wood-Doughty, and E. J. T. Tchetgen. The proximal id algorithm, 2021. ar Xiv:2108.06818. [26] P. Spirtes, C. N. Glymour, and R. Scheines. Causation, prediction, and search. MIT press, [27] E. J. T. Tchetgen, A. Ying, Y. Cui, X. Shi, and W. Miao. An introduction to proximal causal learning, 2020. ar Xiv:2009.10982. [28] J. Tian. Studies in Causal Reasoning and Learning. Ph D thesis, Computer Science Department, University of California, Los Angeles, CA, November 2002. [29] J. Tian and J. Pearl. A general identification condition for causal effects. In Proceedings of the Eighteenth National Conference on Artificial Intelligence (AAAI 2002), pages 567 573, Menlo Park, CA, 2002. AAAI Press/The MIT Press. [30] T. Verma and J. Pearl. Causal networks: Semantics and expressiveness. In Proceedings of the Fourth Workshop on Uncertainty in Artificial Intelligence, pages 352 359, Mountain View, CA, 1988. [31] T. Verma and J. Pearl. Equivalence and synthesis of causal models. In Proceedings of the Sixth Conference on Uncertainty in Artificial Intelligence (UAI 1990), pages 220 227, Cambridge, MA, July 1990. [32] Y. Wang and D. M. Blei. The blessings of multiple causes. Journal of the American Statistical Association, 114(528):1574 1596, 2019. [33] A. Ying, W. Miao, X. Shi, and E. J. T. Tchetgen. Proximal causal inference for complex longitudinal studies, 2021.