# linear_causal_disentanglement_via_interventions__d65638fe.pdf Linear Causal Disentanglement via Interventions Chandler Squires * 1 2 Anna Seigal * 1 3 Salil Bhate 1 Caroline Uhler 1 2 Causal disentanglement seeks a representation of data involving latent variables that are related via a causal model. A representation is identifiable if both the latent model and the transformation from latent to observed variables are unique. In this paper, we study observed variables that are a linear transformation of a linear latent causal model. Data from interventions are necessary for identifiability: if one latent variable is missing an intervention, we show that there exist distinct models that cannot be distinguished. Conversely, we show that a single intervention on each latent variable is sufficient for identifiability. Our proof uses a generalization of the RQ decomposition of a matrix that replaces the usual orthogonal and upper triangular conditions with analogues depending on a partial order on the rows of the matrix, with partial order determined by a latent causal model. We corroborate our theoretical results with a method for causal disentanglement. We show that the method accurately recovers a latent causal model on synthetic and semi-synthetic data and we illustrate a use case on a dataset of single-cell RNA sequencing measurements. 1. Introduction The goal of representation learning is to find a description of data that is interpretable, useful for reasoning, and generalizable. Such a representation disentangles the data into conceptually distinct variables. Traditionally, conceptual distinctness of variables has meant statistical independence. This is the setting of independent component analysis (Comon, 1994). However, human reasoning often involves variables that are not statistically independent. For exam- *Equal contribution 1Broad Institute of MIT and Harvard 2Laboratory for Information and Decision Systems, MIT 3School of Engineering and Applied Sciences, Harvard University. Correspondence to: Chandler Squires . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). ple, the presence of a sink and the presence of a mirror in an image. It is therefore natural to generalize conceptual distinctness to variables that are causally autonomous; i.e., interventions can be performed on each variable separately. This motivates causal disentanglement (Yang et al., 2021), the recovery of causally autonomous variables from data. In this paper, we study the identifiability of causal disentanglement; i.e., its uniqueness. We adopt a generative perspective, as in (Bengio et al., 2013; Moran et al., 2021). We assume that the observed variables are generated in two steps. First, latent variables Z are sampled from a distribution P(Z). Then, the observed variables X are the image of the latent variables under a deterministic mixing function. We assume that the latent variables are generated according to a linear structural equation model and that the mixing function is an injective linear map. Recent work has studied identifiability of various settings in representation learning (Khemakhem et al., 2020; Ahuja et al., 2021). A common assumption for identifiability is that variables are observed across multiple contexts, each affecting the latent distribution P(Z) but not the mixing function. In our setup, each context is either an intervention on a latent variable, or is observational, i.e., has no interventions. We use the same terminology for interventions as Squires & Uhler (2022). From most to least general, a soft intervention on Zi changes the dependency of Zi on its direct causes, a perfect intervention removes this dependency but allows for stochasticity of Zi, and a do-intervention sets Zi to a deterministic value. Our main result is that our linear causal disentanglement setup is identifiable if, in addition to an observational context, for each latent variable Zi, there is a context where Zi is the intervened variable under a perfect intervention; see Section 3.2. This is a sufficient condition for identifiability. Furthermore, we show that the condition of at least one intervention per latent node is necessary in the worst case: if some latent node is not intervened in any context, then there exist latent causal representations that are not identifiable; see Section 3.3. Our focus in this paper is on identifiability guarantees. Nonetheless, we convert our proofs into a method for causal disentanglement in the finite-sample setting. In Section 4, we apply the method to synthetic and semi-synthetic data and show that it recovers the generative model, and we compute a linear causal disentanglement on a single-cell RNA sequencing dataset. Linear Causal Disentanglement via Interventions 1.1. Motivating Example Consider two latent variables Z = (Z1, Z2). Assume that X = (X1, X2) is observed in two contexts k {0, 1}, that X = GZ in both contexts, and that in context k, Z = Ak Z + Ω1/2 k ε for ε N(0, I). A0 = A1 = 0 1 0 0 , Ω0 = 1 0 0 1 Ω1 = 1 0 0 1/4 , G = 2 2 2 1 Context k = 1 is an intervention on Z2 that changes its variance. The covariance of X in contexts k = 0 and k = 1 are, respectively, Σ0 = 20 16 16 13 and Σ1 = 8 7 7 25/4 since Σk = G(I Ak) 1Ωk(I Ak) G . However, the following parameters give the same covariance matrices: b A0 = b A1 = 0 0 0 0 , bΩ0 = 1 0 0 1 bΩ1 = 1 0 0 1/4 , b G = 2 4 2 3 The second set of parameters imply independence of Z1 and Z2, since ( b A0)1,2 = 0, whereas the original parameters imply non-independence since (A0)1,2 = 0. This non-identifiability holds for generic A0, A1, Ω0, Ω1, and G, where A1 comes from an intervention on Z2, see Section 3.3. This non-identifiability extends to any number of latent variables d: we show that, in the worst case, non-identifiability holds when fewer than d + 1 contexts are observed. 1.2. Related Work The growing field of causal representation learning blends techniques from multiple lines of work. Chief among these are identifiable representation learning, causal structure learning, and latent DAG (directed acyclic graph) learning. Identifiable representation learning. The identifiability of the linear independent component analysis (ICA) model was given in Comon (1994). Identifiability of a nonlinear ICA model is studied in (Hyvarinen et al., 2019; Khemakhem et al., 2020), in the presence of auxiliary variables. The ICA model imposes the stringent condition that the latent variables are independent (in the linear setting) or conditionally independent given the auxiliary variables (in the nonlinear setting). Recent works on identifiable representation learning (Ahuja et al., 2021; Zimmermann et al., 2021) introduce structure on the data generating process to circumvent the independence condition. However, they do not consider latent variables that are causally related. Causal Structure Learning. Causal structure is identifiable up to an equivalence class that depends on the available interventional data (Verma & Pearl, 1990; Hauser & B uhlmann, 2012; Yang et al., 2018; Squires et al., 2020; Jaber et al., 2020). See Squires & Uhler (2022) for a recent review. A key line of work (Eberhardt et al., 2005; Hyttinen et al., 2013) characterizes the interventions necessary and sufficient to ensure that the causal structure is fully identifiable; i.e., that the equivalence class is of size one. In particular, Eberhardt et al. (2005) showed that d 1 interventions are in the worst case necessary to fully identify a causal DAG model on d nodes. The current paper extends this line of work to DAG models over latent variables. Learning latent DAG models. The task of learning a DAG over latent variables dates back at least to Silva et al. (2006). They introduced the notion of a pure child: an observed variable Xi with only one latent parent, such Xi is also called an anchor (Halpern et al., 2015; Saeed et al., 2020). The method of Silva et al. (2006) requires that all observed variables are pure children. Recent works relax this assumption by studying the linear non-Gaussian setting, where all latent and observed variables are linear functions of their parents plus independent non-Gaussian noise. For example, Cai et al. (2019) propose a method which learns a latent DAG under the assumption that each latent variable has at least two pure children. The pure child assumption can be extended to allow subsets of latent variables with the same observed children, as in Xie et al. (2020), which introduces the Generalized Independent Noise condition. This condition was used by Xie et al. (2022) to permit latent variables with no observed children; i.e., a hierarchical latent model. Other works consider the discrete setting, requiring that the latent and observed variables are discrete (Halpern et al., 2015) or that the latent variables are discrete (Kivva et al., 2021). The paper Kivva et al. (2021) relaxes the pure child assumption, as follows. The children of node Zi are the variables with a directed edge from Zi. The no twins assumption says that the observed children of any two latent nodes are distinct sets. A similar assumption called strong non-redundancy appears in Adams et al. (2021), which considers models whose latent variables can be downstream of observed variables. See Appendix A. These works require sparsity in the map between latent and observed variables: they do not allow all observed variables to be children of all latent variables, which is the setting of the present paper. A number of recent works discard the sparsity requirement. Ahuja et al. (2022a) and Brehmer et al. (2022) learn a latent DAG from paired counterfactual data. In contrast, we study unpaired data, which is more realistic in applications such as biology (Stark et al., 2020). To the best of our knowledge, only two works consider unpaired data without sparsity assumptions. Liu et al. (2022) study a linear Gaussian model Linear Causal Disentanglement via Interventions Setting Graphical Conditions Identification Result Silva et al. (2006) LNG All children pure Identified up to Markov equivalence. Halpern et al. (2015) Dis 1 pure child per latent Identified up to Markov equivalence. Cai et al. (2019) LNG 2 pure children per latent Fully identified. Xie et al. (2020) LNG 2 pure children per latent Fully identified. Xie et al. (2022) LNG 2 pure children per latent* Fully identified. Kivva et al. (2021) Dis No twins Identified up to Markov equivalence. Liu et al. (2022) LG None Fully identified from 2|V | perfect interventions if |E| |V |. Ahuja et al. (2022b) Poly None Fully identified from |V | do-interventions. This paper LNG or LG None Fully identified from |V | perfect interventions. Table 1: Settings from prior works on learning latent DAG models. LNG is short for linear non-Gaussian, LG for linear Gaussian, Dis for discrete, and Poly for polynomial mixing. In *, pure children are allowed to be latent. The number of nodes and the number of edges in the latent graph are denoted |V | and |E|, respectively. over d latent variables, a nonlinear mixing function, and vector-valued contexts. Their identifiability result only applies to our setting if the latent graph has at most as many edges as nodes, see Appendix K. In that case, their result implies that 2d interventions suffice for identifiability. We strengthen this result, showing that (i) d interventions are sufficient and (ii) no restrictions on the latent graph are required. Moreover, we show that d interventions are, in the worst case, necessary. Such necessary conditions do not appear in prior work on identifying latent DAGs. Finally, contemporaneous work (Ahuja et al., 2022b) shows that a latent DAG is identifiable from the more restricted class of do-interventions, but allow non-linear relationships. See Table 1 for a summary of prior work. We consider d latent variables Z = (Z1, . . . , Zd), generated according to a linear structural equation model. We index contexts by k {0} [K], where [K] := {1, . . . , K}. The linear structural equation models in each context are related: context k = 0 is observational data, while contexts k [K] are interventional data. We now state the assumptions for our model; see also Fig. 1. Assumption 1. (a) Linear latent model: Let G be a DAG with nodes ordered so that an edge j i implies j > i. The latent variables Z follow a linear structural equation model: in context k, the latent variables Z satisfy Z = Ak Z + Ω1/2 k ε, Cov(ε) = Id, where Id Rd d is the identity matrix, Ωk Rd d is diagonal with positive entries, and Ak Rd d has (Ak)ij = 0 if and only if there is an edge j i in G. That is, in context k, Z = B 1 k ε, where Bk = Ω 1/2 k (Id Ak). (1) (b) Generic single-node interventions: For each k [K], there exists ik {1, . . . , d} such that Bk = B0 + eikc k , further, (Bk) eik is not a multiple of (B0) eik unless ik has no parents in G. (c) Linear observations: Fix p d. There is a full rank matrix G Rp d such that X = GZ in every context k. Let H := G denote its Moore-Penrose pseudoinverse. We set the entry of largest absolute value in each row of H to 1. If multiple entries in a row have same absolute value we set the leftmost entry to be positive. Our strongest results hold under one additional assumption. Assumption 2. Perfect interventions: For each k [K], there exists ik {1, . . . , d} such that Bk = B0 + eikc k , where ck = λkeik B 0 eik for some λk > 0. Remark 1 (The parts of Assumption 1 that hold without loss of generality). Taking Var(εi) = 1 for all i holds without loss of generality, since scaling can be absorbed into the matrix Ωk. A linear structural equation model is causally sufficient if εi εj for all i = j. Thus, for a causally sufficient linear structural equation model, we have Cov(ε) = Id in Assumption 1(a) without loss of generality. The ordering of nodes in Assumption 1(a) is also without loss of generality: a permutation of the latent nodes can be absorbed into G. Our ordering makes the matrices Ak upper triangular. The scaling of H in Assumption 1(c) is without loss of generality, as follows. If {Bk}K k=0 and H satisfy Assumption 1 then X = (Bk H) ε. Consider the matrices {BkΛ}K k=0 and Λ 1H, for Λ diagonal with positive entries. Observe that X = (BkΛΛ 1H) ε, has the same distribution as X in context k. The alternative matrices satisfy Assumption 1, except for the scaling condition on H. Assumption 1(c) therefore fixes the scaling indeterminacy of each node. Linear Causal Disentanglement via Interventions X1 X2 X3 Xp k = 2 k = 1 k = 3 i1 = 2 i2 = 1 i3 = d Figure 1: The proposed setup. The latent variables Z = (Z1, . . . , Zd) follows a linear DAG model, with contexts k = 1, . . . , K being single node interventions on targets i1, . . . , i K. The observed variables X = (X1, . . . , Xp) are an injective linear function of the latent variables X = GZ, where G Rp d does not vary across contexts. The genericity condition in Assumption 1(b) automatically holds for perfect interventions. It fails to hold only for soft interventions that changes the variance but not the edge weights1. We show the importance of the genericity assumption for the identifiability of causal disentanglement in Appendix B. We give an example of a setting in which Assumption 1 might apply. Suppose Z is the internal state of a cell (e.g., the concentrations of proteins, the locations of organelles, etc.) and that each context is an exposure to a different small molecule. Assumption 1(b) posits that each small molecule has a highly selective effect, modifying only one cellular mechanism. Assumption 2 posits that each small molecule completely disrupts the modified mechanism. While one does not expect all small molecules to be highly selective, one could filter based on selectivity. In Appendix G, we describe a hypothesis test to test implications of Assumption 1(b), and show empirically that this test effectively determines model membership. The covariance of X in context k is rank deficient when d < p, since X = GZ. We therefore define the precision matrix of X in context k to be the pseudoinverse of the covariance matrix, Θk := E[XX ] . Then Θk = H B k Bk H, (2) by Prop. 6 in Appendix C, since Covk(Z) = (B k Bk) 1. We consider an unknown latent DAG. Each candidate DAG has unknown weights on its edges, unknown variances on its nodes, unknown new weights under each intervention, and an unknown mixing map to the observed variables. That is, our goal is to decompose the precision matrices {Θk}K k=0 in Equation (2) to recover H and {Bk}K k=0. We recall some graph theoretic notions. The parents of node i are pa G(i) = {j | j i in G}, and we define pa G(i) := 1i.e., (Ωk)ik,ik = (Ω0)ik,ik and (Ak) eik = (A0) eik pa G(i) {i}. Similarly, an G(i) denotes the ancestors of i in G, the vertices j with a directed path from j to i. We define an G(i) := an G(i) {i} and an G(I) := i I an G(i). The source nodes of G are the nodes i with pa G(i) = . We drop the subscript G when the graph is clear from context. The transitive closure of G, denoted G, is the DAG with pa G(i) = an G(i). Given a DAG G, define the partial order G to be i G j if and only if j an G(i). Thus, the transitive closure is the graph with j i whenever i G j. To decompose the precision matrices in Equation (2), we introduce a matrix decomposition defined on a partial order. Recall that the RQ decomposition writes H Rd p as H = RQ for an upper triangular R Rd d and orthogonal Q Rd p. We generalize the RQ decomposition2. Definition 1 (The partial order RQ decomposition). Given a partial order , the partial order RQ decomposition writes H Rd p as H = RQ, where R Rd d satisfies Rii 0 and Rij = 0 unless i j, and where qi, the i-th row of Q Rd p, is norm one and orthogonal to qj : i j . We recover the usual reduced RQ decomposition (Trefethen & Bau III, 1997) when is the total order 1 < < d. We construct the partial order RQ decomposition in Appendix D. Finally, given a positive definite matrix M Rd d, the Cholesky factor U Rd d, denoted CHOLESKY(M), is the unique upper triangular matrix with positive diagonal such that M = U U. 3. Identifiability of Causal Disentanglement We establish the sufficiency and worst case necessity of one intervention per latent node for identifiability of our causal disentanglement problem. The following result describes recovery of G. Later, we discuss identifiability of the parameters in our setup. Since the labeling of latent nodes is unimportant, G is recovered if it is found up to relabeling. Theorem 1. Assume the setup in Assumption 1 with d latent variables. Then d interventions are sufficient and, in the worst case, necessary to recover G from {Θk}k K. If Assumption 2 also holds, then d interventions are sufficient and, in the worst case, necessary to recover G from {Θk}k K. 3.1. Preliminaries We note the following basic fact, where v 2 := vv : Fact 1. Let B Rd d. Then B B = Pd i=1(B ei) 2. We give a proof in Appendix F. This fact gives the key identity that drives our identifiability results. 2The RQ decomposition is used here (rather than the more familiar QR decomposition) because it gives an expression for the rows of H, which each correspond to one latent variable. Linear Causal Disentanglement via Interventions Proposition 1. Consider the setup in Assumption 1. Then, for any k [K], Θk Θ0 = (H B k eik) 2 (H B 0 eik) 2. In particular the rank of the difference Θk Θ0 is 1 if ik is a source node in G, and 2 otherwise. Proof. By Assumption 1, B k ei = B 0 ei for all i = ik. Using Fact 1, we have B k Bk B 0 B0 = (B k eik) 2 (B 0 eik) 2. Recall from Equation (2) that Θk = H B k Bk H. The result follows from left-multiplying both sides of the displayed equation by H and right-multiplying by H. This shows that rank(Θk Θ0) 2. For a source node, both vectors B k eik and B 0 eik have just one entry non-zero and rank(Θk Θ0) = 1. Otherwise, the vectors have more than one entry non-zero and, by the genericity condition in Assumption 1(b), the difference Θk Θ0 has rank two. We can reduce a more general causal disentanglement problem to our setting, as we explain in Appendix F. First, we can count the latent dimension, since rank(Θk) = d for any k. Second, we can identify which environments correspond to interventions on the same intervention target, see Prop. 9. Finally, we can identify which environment is observational using rank constraints, see Prop. 10. Thus, we assume without loss of generality that d is known, that the observational environment is known, and that each node is only intervened on in one context. 3.2. Sufficiency We define S(G) to be the set of permutations on d letters such that σ(j) > σ(i) for all edges j i. For example, if G is a complete graph then S(G) contains only the identity. If G has no edges then S(G) is the group of permutations on d letters. The permutation matrix corresponding to permutation σ is Pσ Rd d with (Pσ)ij = 1{i=σ(j)}. Our main sufficiency result is the following. Theorem 2. Assume the set-up in Assumptions 1 and 2 with one intervention on each latent node. Then the graph G, the intervention targets ik, and the parameters are identifiable up to S(G): given a solution (B0, . . . , BK, H), the set of solutions is {(PσB0P σ , . . . , PσBKP σ , PσH) : σ S(G)}. Theorem 2 says that solutions to the causal disentanglement problem are unique up to permutations of the latent nodes that preserve the property that j i implies j > i. First, we verify that each permutation in S(G) gives a solution. Proposition 2. Assume the set-up in Assumption 1. Given a solution (B0, . . . , BK, H) to Equation (2) for k {0} [K], the matrices (PσB0P σ , . . . , PσBKP σ , PσH) are a valid solution whenever σ S(G). Algorithm 1 ID-ANCESTORS 1: Input: Θk, Θ0, {bqi}i I 2: Output: Vector bqk, ancestor set A 3: Let A = I 4: for i I do 5: Let W i = bqi : j I \ {i} 6: Let V i = proj W i rowspan(Θk Θ0) 7: If dim(V i) = 1, let A = A \ {i} 8: end for 9: Let W = bqa : a A 10: Let V = proj W rowspan(Θk Θ0) 11: Take bqk with first nonzero entry positive and bqk 2 = 1, such that bqk = V 12: return bqk, A Algorithm 2 ID-PARTIALORDER 1: Input: Precision matrices (Θ0, Θ1, . . . , ΘK), rank d 2: Output: Factor b Q, partial order 3: Let I0 = {}, b Q = 0d d 4: for t = 1, . . . K do 5: Let Wt = bqi : i It 1 6: Let Vk = proj W t rowspan(Θk Θ0) for k It 1 7: Pick k such that dim(Vk) = 1 8: Let bqk, A = ID-ANCESTORS(Θk, Θ0, {bqi}i It 1) 9: Add a k for any a a, a A 10: Let It = It 1 {k}, b Qt = [bqk; b Qt 1] 11: end for 12: return b Q, Proof. Let {Bk}K k=0 and H satisfy Assumption 1 and Equation (2). Define B(σ) k = PσBk P σ and H(σ) = PσH for σ S(G). Then Θk = H(σ) B(σ) k B(σ) k H(σ). The matrices B(σ) k are upper triangular, as follows. For all i, j [p], we have (B(σ) k )σ(i),σ(j) = (Bk)ij. Hence B(σ) k is upper triangular when (Bk)ij = 0 for all i, j [p] with σ(i) > σ(j). This holds since σ S(G). Moreover, these matrices also satisfy Assumption 1(b) with the intervention target σ(ik) in context k. Finally, H(σ) satisfies Assumption 1(c), since we just permute the rows of H. We give a constructive proof of Theorem 2 via an algorithm to recover H and {Bk}K k=0 from {Θk}K k=0 3. The computational complexity of the algorithm is given in Appendix H. The bulk of the algorithm is devoted to recovering H. First, we recover the partial order G (i.e., the DAG G), together with the matrix Q from a partial order RQ decomposition 3We only use the second moment of X. We do not use the first moment since we assume E[ε] = 0, and we do not use higher moments since in the worst case (Gaussian noise), they contain no additional information. Linear Causal Disentanglement via Interventions of H, up to signs and permutations of rows, in Algorithm 2. The subroutine Algorithm 1 recovers the ancestors of a node and its corresponding row of Q. Then we recover R in Algorithm 3. Having recovered H, the matrices {Bk}K k=0 are found via the Cholesky decomposition. We show that Algorithm 2 returns Q from the partial order RQ decomposition of H, up to a permutation σ S(G) and a matrix in Sigd, the d d diagonal matrices with diagonal entries 1. Proposition 3. Assume the setup in Assumption 1, and that every latent node is intervened on; i.e, {ik}K k=1 = [d]. Let ( b Q, ) be the output of Algorithm 2. Then is the partial order G. Moreover, let H = RQ be the partial order RQ decomposition of H for the partial order G. Then b Q = SPσQ for some σ S(G) and S Sigd. The following lemma relates the partial order of G to the linear spaces rowspan(Θk Θ0). Lemma 1. Assume the setup in Assumption 1. Let H = RQ be a partial order RQ decomposition of H. Let ik be the intervention target of context k and let I [d]. Then (a) rowspan(Θk Θ0) hi : i I if and only if (b) rowspan(Θk Θ0) qi : i an(ik) , and (c) rowspan(Θk Θ0) qi : i I if pa(ik) I. Proof. Let hi denote the ith row of H, and let αj,ik := P i pa(ik)(Bj)ik,ihi. By Prop. 1, we have rowspan(Θk Θ0) = αk,ik, α0,ik . (3) Equation (3) shows that rowspan(Θk Θ0) hi : i pa(ik) . This linear space is contained in hi : i I whenever pa(ik) I. Conversely, assume there exists j pa(ik) with j / I. Then, containment of rowspan(Θk Θ0) in hi : i I cannot hold: containment implies (B0)ik,jhj hi : i [p]\{j} , a contradiction since H is full row rank and (B0)ik,j = 0. Hence (a) holds. By definition of the partial order RQ decomposition, we have hi qj | j an(i) . Thus, by transitivity of the ancestorship relation, (b) holds. Conversely, assume there exists j pa(ik) with j / I. Then rowspan(Θk Θ0) qi : i I implies that hj qi : i I + hi : i pa(ik) \ {j} , since (B0)ik,j = 0. We partition I into the descendants and non-descendants of j: let Id := {i I : j an(i)} and let Ind := {i I : j / an(i)}. By definition of the partial order RQ decomposition, we have qi hj whenever j an(i). Thus hj qi : i Ind + hi : i Algorithm 3 ITERATIVEDIFFERENCEPROJECTION 1: Input: Precision matrices (Θ0, Θ1, . . . , ΘK) 2: Output: b H, ( b B0, b B1, . . . , b BK) 3: Let d = rank(Θ0) 4: Let b Q, = ID-PARTIALORDER((Θ0, Θ1, . . . , ΘK), d) 5: Let b Ck = CHOLESKY(( b Q ) Θk b Q ) for k = 0, . . . , K 6: Let b R = Id 7: for k = 1, . . . , K do 8: Let b Dk = b Ck b C0 9: Letbik be index of the only nonzero row of b Dk 10: Let b Rbik = ( b Dk)bik + ( b C0)bik 11: end for 12: Let b H = b R b Q 13: Let b H = bΛ b H , for bΛ diagonal such that b H satisfies the conditions on H in Assumption 1(c) 14: Let b B0 = CHOLESKY(( b H ) Θ0 b H ) 15: Let b Bk = b B0 + ebik |bΛbik,bik|ebik b B 0 ebik for k = 1, . . . , K 16: return b H, ( b B0, b B1, . . . , b BK) pa(ik) \ {j} . Inverting the partial order RQ decomposition gives qi hi : i an(i) . Hence hj hi : i an(Ind) (pa(ik) \ {j}) , a contradiction, since j / an(Ind) and H is full rank. Proof of Prop. 3. Assume that b Qt 1 is the last t 1 rows of SPσQ, for some σ S(G) and S Sigd. Then Wt 1 = bqi : i It 1 . At step t, we pick k such that Vk = proj W t rowspan(Θk Θ0) has dimension one. Lemma 1 implies that such k are those with pa(ik) It 1. Algorithm 1 returns a set A with pa(ik) A an G(ik). Thus Line 2 adds the relation k a if and only if a an G(ik). Hence is the partial order G. Line 2 picks bqk orthogonal to {qi : i G ik}, such that rowspan( b Qt) contains rowspan(Θk Θ0). By Lemma 1 and Definition 1, this is qik. Thus, b Qt is equal to the last t rows of S Pσ Q, for some σ S(G) and S Sigd. Repeating for t = 1, . . . , K gives the result. We prove Theorem 2 by proving the following result, which is its constructive analogue. Theorem 3. Assume the setup in Assumptions 1 and 2, and that every latent node is intervened; i.e., {ik}K k=1 = [d]. Let b H and { b Bk}K k=0 be the output of Algorithm 3. Then b H = PσH and b Bk = PσBk P σ for all k, for some σ S(G). Proof. Let H = RQ. Then Θk = Q R B k Bk RQ, by Equation (2), and b Q = SPσQ for some σ S(G) and Linear Causal Disentanglement via Interventions S Sigd, by Prop. 3. Hence ( b Q ) Θk b Q equals S PσR P σ PσB k P σ PσBk P σ PσRP σ S. Let C(σ) k = S(PσBk P σ )(PσRP σ )S. The matrix C(σ) k is upper triangular, since it is a product of four upper triangular matrices, by the definition of S(G) and of the partial order RQ decomposition, where we use that i G j implies i < j. Moreover, C(σ) k has positive diagonal, since the matrices R and Bk have positive diagonal, by Definition 1 and Assumption 1(a) respectively. Hence b Ck := C(σ) k is the Cholesky factor of ( b Q ) Θk b Q . The differences b Dk := b Ck b C0 equal SPσ(Bk B0)RP σ S = SPσeikc k RP σ S, by Assumption 1(b). The intervention target σ(ik) is the only nonzero row of b Dk, i.e., bik = σ(ik). Observe that ( b Dk)bik + ( b C0)bik = Sσ(ik),σ(ik)λk(RP σ )σ(ik). Thus, we have recovered b R = bΛPσRP σ for bΛ diagonal such that bΛbik,bik = λk. This gives b H = bΛPσRP σ PσQ = bΛPσH. The scaling in Line 3 recovers b H = PσH and bΛ. We have ( b H ) Θ0 b H = PσB 0 P σ PσB0P σ , where PσB0P σ is upper triangular, and thus we recover b B0 = PσB0P σ from the Cholesky decomposition. Finally, since |bΛbik,bik| = λk, Line 3 gives us b Bk = PσBk P T σ . Theorem 3 requires Assumption 2, see Appendix J. In Appendix K, we compare our identifiability condition to that of Liu et al. (2022). We show that Liu et al. (2022) requires that the latent graph has fewer than d edges. In contrast, our condition imposes no constraints on the latent graph. 3.3. Worst-case Necessity We show that one intervention per latent node is necessary for identifiability of our setup, in the worst case. It is clear that observational data does not suffice for identifiability: from just observational data, we always have a solution with independent latent variables, as follows. Let b H = ΛB0H and b B0 = Λ 1, for Λ a diagonal matrix with positive entries such that Assumption 1(c) holds. Then b B0 b H = B0H, i.e. b B0, b H solve the causal disentanglement problem. The new solution has independent latent nodes, since b B0 is diagonal. The next result, which follows from prior work in causal structure learning, says that d 1 interventions are required in the worst case, for a fully observed model. This is the special case of our setup where H is a permutation matrix. Proposition 4. Assume the setup in Assumptions 1 and 2, with H a permutation matrix. Let K < d 1 with all intervention targets distinct. Then, in the worst case over intervention targets {ik}K k=1, B0 and H are not identifiable. Proof. In the linear Gaussian setting with unknown-target interventions, the structure of a DAG is only identifiable up to its interventional Markov equivalence class (MEC), see e.g. Proposition 3.3(ii) of Castelletti & Peluso (2022). For single-node interventions, d 1 interventions are in the worst case necessary to ensure that the interventional MEC has size one, by Theorem 3.7 of Eberhardt et al. (2005). We show that d interventions are necessary, in the worst case, when H is a general matrix. The proof is in Appendix I. Proposition 5. Assume the setup in Assumptions 1 and 2, with K < d. Then there exist H and {Bk}K k=0 and a set of K distinct intervention targets such that (i) H and Bk are not identifiable up to S(G) and (ii) G is not identifiable. Example 1. Proposition 5 generalizes our motivating example from Section 1. Fix H R2 2 with entries Hij, and fix upper triangular B0, B1 R2 2 with entries (B0)ij and (B1)ij, respectively. Assume i1 = 2; i.e., (B0)11 = (B1)11 and (B0)12 = (B1)12. Let b B0 = 1 0 0 (B0)22 , b B1 = 1 0 0 (B1)22 b H = (B0)11H11 + (B0)12H21 (B0)11H12 + (B0)12H22 H21 H22 Then for k {0, 1}, we have b Bk b H = Bk H, both equal to (B0)11H11 + (B0)12H21 (B0)11H12 + (B0)12H22 (Bk)22H21 (Bk)22H22 Remark 2. Prop. 5 pertains to the worst case over intervention targets. It is natural to ask if there exists a better choice of K intervention targets, for K < d, such that H and {Bk}K k=0 are identifiable. For example, when d = 2, consider G = {2 1}, with an intervention on Z1; i.e., i1 = 1. Then rowspan(Θ1 Θ0) hi : i I if and only if I = {1, 2}, by Lemma 1(a). Thus, Θ1 Θ0 is rank 2, and we can detect that i1 = 1 and G = {2 1}; otherwise, we would have rank(Θ1 Θ0) = 1. While i1 and G are identifiable, preliminary computational evidence suggests that the entries of B0, B1, and H are not identifiable. Proof of Theorem 1. The necessity of d interventions is Prop. 5. Under Assumption 1 and 2, the sufficiency of d interventions follows from Theorem 2. Under Assumption 1, the sufficiency is the recovery of G in Prop. 3. 4. Experimental Results We adapt our proof of Theorem 3 into a method for causal disentanglement in the finite-sample setting. We modify our methods to (1) use matrices instead of vector spaces, (2) use scores based on singular values to test rank, and (3) choose a nonzero row based on norms. The adapted algorithms are in Appendix L. In this section, we investigate the performance of the method in a simulation study. There is a single hyperparameter γ [0, 1], the percentage of spectral energy associated to the largest singular value, above which we consider a matrix to have rank one. We use γ = 0.99. Linear Causal Disentanglement via Interventions (a) Error in estimating H (b) Error in estimating B0 (c) Intervention targets Figure 2: The adapted version of Algorithm 3 is consistent for recovering H, B0, and {ik}K k=1 from noisy data. At each sample size, we generate 500 random models. Note the logarithmic scale on the x-axis. In (a), we plot the median of b H H 2, the error in Frobenius norm. In (b), we plot the median of b B0 b B 2. In (c), we plot the fraction of models where all intervention targets were correctly estimated. 4.1. Synthetic Data Generation We generate 500 random models following Assumption 1 for d = 5 latent and p = 10 observed variables, as follows. We sample the graph G from an Erd os-R enyi random graph model with density 0.75. We sample the nonzero entries of A0 independently from Unif( [0.25, 1]), and the nonzero entries of Ω0 independently from Unif([2, 4]). We sample uniformly among permutations to generate the intervention targets ik. In context k, we have Ak = A0 eik A 0 eik; i.e., all entries in row ik are 0. We change (Ω0)ik,ik into a new value (Ωk)ik,ik, sampled from Unif([6, 8]) to ensure a non-negligible change. Finally, the entries of H are sampled independently from Unif([ 2, 2]). We consider sample sizes n from 2500 to 250000 and use as input the sample precision matrices. All code for data generation and for our adapted versions of Algorithms 1, 2, and 3 (that is, Algorithms 6, 5 and 7) can be found at the link in Appendix M. 4.2. Synthetic Data Results We show the results of applying our method in the finitesample setting, presented in Fig. 2. We measure the error in estimating the parameters H and B0 and the intervention targets {ik}K k=1. Since the problem is only identifiable up to the partial order G, we align our estimated b H, b B0, and {bik}K k=1 with H, B0, and {ik}K k=1 by picking σ S(G) to maximize PK k=1 1{ik = σ(bik)}. For small d, this optimization can be solved by enumerating over S(G). For large d, we use the integer linear program in Appendix M.1, implemented in gurobipy. As expected, the estimation errors for H and B0 decrease as the sample size increases, while the fraction of models with all intervention targets correctly estimated increases. 4.3. Biological Data Results We evaluate our method on a dataset from Ursu et al. (2022). This single-cell RNA sequencing (sc RNA-seq) dataset consists of 90,000 cells from a lung cancer cell line, with 83 different nonsynonymous mutations of the KRAS oncogene overexpressed. Semi-synthetic analysis. The authors divide the mutations into five categories based on the genes that they affect, and compute a score for the impact of each mutation. Taking the two highest impact mutations from each category gives K = 10 contexts. The wild type KRAS gene is taken as the observational context. We select p = 100 observational features to be the most variable genes from the proliferation regulation category in the Gene Ontology (Ashburner et al., 2000), which are significant modulators of cancer activity such as oncogenes and tumor suppressor genes. We compute the sample precision matrices bΘ0, bΘ1, . . . , bΘ10 and use them as input to Algorithm 7 with γ = 0.99. Given estimates { b Bk}K k=1, b H from our algorithm, we let Bk = Mk b Bk, where (Mk)ij = 1{|( b Bk)ij|>0.04}; i.e., we truncate the entries away from zero. We treat the resulting parameters as a new generative model. This tests our method in a more realistic setting, with parameters based on real data, while retaining the ability to measure performance. Since the entries of the matrices Bk are smaller than for our synthetic data, we consider larger sample sizes n {1e6, 5e6, 1e7, 5e7, 1e8}. As seen in Fig. 6, Appendix M.2, we successfully recover the generative model. Hypothetical Workflow. We illustrate a hypothetical workflow of our method on biological data. If we run our algorithm for all mutations (K = 83) on the p = 83 most variable genes, we obtain the graph in Fig. 3. We see that G12R, G12V, G13R, G13V, and G12I all perturb highlyconnected latent nodes with several descendants. The G12 and G13 positions in the KRAS protein are key functional Linear Causal Disentanglement via Interventions Z27 Z28 Z29 Z30 Z31 Z40 Z41 Z42 Z50 Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59 Z60 Z61 Z62 Z63 Z64 Z65 Z66 Z67 Z68 Z69 Z70 Z81 Z82 Z83 Z27 Z28 Z29 Z30 Z31 Z40 Z41 Z42 Z50 Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59 Z60 Z61 Z62 Z63 Z64 Z65 Z66 Z67 Z68 Z69 Z70 Z81 Z82 Z83 A146V A155G A59E A59G A59T AG11TD AG59GV C118S C185Y G12I G12R G12S G12V K117N K117R L159S L19F L52F M170L N26Y P110S P34L P34R Q22H Q22K Q61A Q61H Q61K R149K R164Q R41K R68S Figure 3: The latent graph and intervention targets learned from sc RNA-seq data. Edges with weight of magnitude above 0.2 are shown. Boxes represent context indicators, corresponding to KRAS mutations, with edges to their respective targets. Red nodes are significantly associated with survival outcomes in the TCGA dataset. residues whose mutations are known to be causal drivers of cancer (Huang et al., 2021). This indicates that the learned graph can highlight influential biological pathways which may be useful for prioritizing therapeutic development. The matrix b H from our algorithm gives a mapping from genes to latent variables that can be transferred across datasets and related to other observations. For example, we compute estimates of the latent variables for the 589 lung cancer patients in the Cancer Genome Atlas (Liu et al., 2018) and relate these variables to the patients survival outcomes. We find that 5 latent variables, those targeted by the M170L, I163S, A55G, G12R, and Q25H mutations, are significantly associated with survival outcomes (at significance 0.1, after multiple testing correction). See Appendix M.2 for details. Note that the output of our method should be treated as exploratory; further theoretical and methodological development is required to better match complex real-world data. 5. Discussion In this paper, we showed that a latent causal model is identifiable when data is available from an intervention on each latent variable. Conversely, we showed that, in the worst case, such data is necessary for identifiability of the latent representation. Our proof is constructive, consisting of an algorithm for recovering the latent representation, which can be adapted to the noisy setting. Our algorithm was developed for maximal clarity of our identifiability result, and leaves open several directions for future work. Theory of latent interventional Markov equivalence. We established sufficient and (worst-case) necessary conditions for the complete identifiability of the parameters H and {Bk}K k=0. However, in many settings, it is expected that these parameters (and corresponding combinatorial structures) are only partially identifiable. Indeed, Proposition 5 suggests that the problem parameters may be partially recoverable. In future work, it would be interesting to develop a theory of identifiability for K < p interventions. Non-linear setting. Our results require that both the latent linear structural equation model and the mixing function are linear. We expect that the insights developed here may apply when one or both of these elements are non-linear. Notably, the contemporaneous work of Ahuja et al. (2022b) shows that, under certain conditions, the case of polynomial mixing can be reduced to the case of linear, which can be immediately applied to extend our result. Statistical analysis of causal disentanglement. A next step beyond identifiability is to investigate the statistical properties of the setup. This includes lower bounds on the accuracy of recovering the parameters H and {Bk}K k=0, along with corresponding combinatorial structures such as G and the matching between k and ik, and computationally efficient algorithms that match these lower bounds. Our identifiability result suggests that the differences of precision matrices may play a role. These differences appear in Varici et al. (2021), which uses techniques for directly estimating differences between precision matrices. Moreover, it would be interesting to develop a score-based approach, e.g., (penalized) maximum likelihood estimation. Our problem formulation suggests a natural parameterization for such an approach, which could be addressed using combinatorial optimization or gradient-based search. Acknowledgements We thank Nils Sturma, Mathias Drton, Jiaqi Zhang, and Alvaro Ribot for helpful discussions. We thank our anonymous reviewers for helpful suggestions that improved the paper. Chandler Squires was partially supported by an NSF Graduate Research Fellowship. Anna Seigal was supported by the Society of Fellows at Harvard University. Salil Bhate was supported by the Eric and Wendy Schmidt Center at the Broad Institute. In addition, this work was supported by NCCIH/NIH (1DP2AT012345), ONR (N00014-22-1-2116), NSF (DMS-1651995), the MIT-IBM Watson AI Lab, and a Simons Investigator Award to Caroline Uhler. Linear Causal Disentanglement via Interventions Adams, J., Hansen, N., and Zhang, K. Identification of partially observed linear causal models: Graphical conditions for the non-gaussian and heterogeneous cases. Advances in Neural Information Processing Systems, 34: 22822 22833, 2021. Ahuja, K., Hartford, J., and Bengio, Y. Properties from mechanisms: an equivariance perspective on identifiable representation learning. In International Conference on Learning Representations, 2021. Ahuja, K., Hartford, J., and Bengio, Y. Weakly supervised representation learning with sparse perturbations. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022a. URL https://openreview.net/forum? id=6ZI4i F_T7t. Ahuja, K., Wang, Y., Mahajan, D., and Bengio, Y. Interventional causal representation learning. In Neur IPS 2022 Workshop on Neuro Causal and Symbolic AI (n CSI), 2022b. Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., et al. Gene ontology: tool for the unification of biology. Nature genetics, 25(1):25 29, 2000. Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8): 1798 1828, 2013. Brehmer, J., Haan, P. D., Lippe, P., and Cohen, T. Weakly supervised causal representation learning. In ICLR2022 Workshop on the Elements of Reasoning: Objects, Structure and Causality, 2022. URL https: //openreview.net/forum?id=r UXOUBu Ucg5. Cai, R., Xie, F., Glymour, C., Hao, Z., and Zhang, K. Triad constraints for learning causal structure of latent variables. Advances in neural information processing systems, 32, 2019. Castelletti, F. and Peluso, S. Network structure learning under uncertain interventions. Journal of the American Statistical Association, pp. 1 12, 2022. Comon, P. Independent component analysis, a new concept? Signal processing, 36(3):287 314, 1994. Davidson-Pilon, C. lifelines: survival analysis in python. Journal of Open Source Software, 4(40):1317, 2019. Drton, M., Sturmfels, B., and Sullivant, S. Algebraic factor analysis: tetrads, pentads and beyond. Probability Theory and Related Fields, 138:463 493, 2007. Drton, M., Massam, H., and Olkin, I. Moments of minors of wishart matrices. 2008. Eberhardt, F., Glymour, C., and Scheines, R. On the number of experiments sufficient and in the worst case necessary to identify all causal relations among n variables. In Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, pp. 178 184, 2005. Greville, T. N. E. Note on the generalized inverse of a matrix product. Siam Review, 8(4):518 521, 1966. Halpern, Y., Horng, S., and Sontag, D. Anchored discrete factor analysis. ar Xiv preprint ar Xiv:1511.03299, 2015. Hauser, A. and B uhlmann, P. Characterization and greedy learning of interventional markov equivalence classes of directed acyclic graphs. The Journal of Machine Learning Research, 13(1):2409 2464, 2012. Huang, L., Guo, Z., Wang, F., and Fu, L. Kras mutation: from undruggable to druggable in cancer. Signal transduction and targeted Therapy, 6(1):386, 2021. Hyttinen, A., Eberhardt, F., and Hoyer, P. O. Experiment selection for causal discovery. Journal of Machine Learning Research, 14:3041 3071, 2013. Hyvarinen, A., Sasaki, H., and Turner, R. Nonlinear ICA using auxiliary variables and generalized contrastive learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 859 868. PMLR, 2019. Jaber, A., Kocaoglu, M., Shanmugam, K., and Bareinboim, E. Causal discovery from soft interventions with unknown targets: Characterization and learning. Advances in neural information processing systems, 33:9551 9561, 2020. Khemakhem, I., Kingma, D., Monti, R., and Hyvarinen, A. Variational autoencoders and nonlinear ICA: A unifying framework. In International Conference on Artificial Intelligence and Statistics, pp. 2207 2217. PMLR, 2020. Kivva, B., Rajendran, G., Ravikumar, P., and Aragam, B. Learning latent causal graphs via mixture oracles. Advances in Neural Information Processing Systems, 34: 18087 18101, 2021. Kuleshov, M. V., Jones, M. R., Rouillard, A. D., Fernandez, N. F., Duan, Q., Wang, Z., Koplev, S., Jenkins, S. L., Jagodnik, K. M., Lachmann, A., et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic acids research, 44(W1):W90 W97, 2016. Liu, J., Lichtenberg, T., Hoadley, K. A., Poisson, L. M., Lazar, A. J., Cherniack, A. D., Kovatich, A. J., Benz, C. C., Levine, D. A., Lee, A. V., et al. An integrated tcga Linear Causal Disentanglement via Interventions pan-cancer clinical data resource to drive high-quality survival outcome analytics. Cell, 173(2):400 416, 2018. Liu, Y., Zhang, Z., Gong, D., Gong, M., Huang, B., Hengel, A. v. d., Zhang, K., and Shi, J. Q. Weight-variant latent causal models. ar Xiv preprint ar Xiv:2208.14153, 2022. Moran, G. E., Sridhar, D., Wang, Y., and Blei, D. M. Identifiable deep generative models via sparse decoding. ar Xiv preprint ar Xiv:2110.10804, 2021. Saeed, B., Belyaeva, A., Wang, Y., and Uhler, C. Anchored causal inference in the presence of measurement error. In Conference on uncertainty in artificial intelligence, pp. 619 628. PMLR, 2020. Seabold, S. and Perktold, J. statsmodels: Econometric and statistical modeling with python. In 9th Python in Science Conference, 2010. Silva, R., Scheines, R., Glymour, C., Spirtes, P., and Chickering, D. M. Learning the structure of linear latent variable models. Journal of Machine Learning Research, 7 (2), 2006. Squires, C. and Uhler, C. Causal structure learning: A combinatorial perspective. Foundations of Computational Mathematics, 2022. Squires, C., Wang, Y., and Uhler, C. Permutation-based causal structure learning with unknown intervention targets. In Conference on Uncertainty in Artificial Intelligence, pp. 1039 1048. PMLR, 2020. Squires, C., Yun, A., Nichani, E., Agrawal, R., and Uhler, C. Causal structure discovery between clusters of nodes induced by latent factors. In Conference on Causal Learning and Reasoning, pp. 669 687. PMLR, 2022. Stark, S. G., Ficek, J., Locatello, F., Bonilla, X., Chevrier, S., Singer, F., R atsch, G., and Lehmann, K.-V. Scim: universal single-cell matching with unpaired feature sets. Bioinformatics, 36(Supplement 2):i919 i927, 2020. Trefethen, L. N. and Bau III, D. Numerical linear algebra, volume 50. Siam, 1997. Ursu, O., Neal, J. T., Shea, E., Thakore, P. I., Jerby Arnon, L., Nguyen, L., Dionne, D., Diaz, C., Bauman, J., Mosaad, M. M., et al. Massively parallel phenotyping of coding variants in cancer with perturb-seq. Nature Biotechnology, pp. 1 10, 2022. Varici, B., Shanmugam, K., Sattigeri, P., and Tajer, A. Scalable intervention target estimation in linear models. Advances in Neural Information Processing Systems, 34: 1494 1505, 2021. Verma, T. and Pearl, J. Equivalence and synthesis of causal models. In Proceedings of the Sixth Annual Conference on Uncertainty in Artificial Intelligence, pp. 255 270, 1990. Xie, F., Cai, R., Huang, B., Glymour, C., Hao, Z., and Zhang, K. Generalized independent noise condition for estimating latent variable causal graphs. Advances in Neural Information Processing Systems, 33:14891 14902, 2020. Xie, F., Huang, B., Chen, Z., He, Y., Geng, Z., and Zhang, K. Identification of linear non-Gaussian latent hierarchical structure. In International Conference on Machine Learning, pp. 24370 24387. PMLR, 2022. Yang, K., Katcoff, A., and Uhler, C. Characterizing and learning equivalence classes of causal DAGs under interventions. In International Conference on Machine Learning, pp. 5541 5550. PMLR, 2018. Yang, M., Liu, F., Chen, Z., Shen, X., Hao, J., and Wang, J. Causal VAE: Disentangled representation learning via neural structural causal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9593 9602, 2021. Zimmermann, R. S., Sharma, Y., Schneider, S., Bethge, M., and Brendel, W. Contrastive learning inverts the data generating process. In International Conference on Machine Learning, pp. 12979 12990. PMLR, 2021. Linear Causal Disentanglement via Interventions Contents of Appendix A Additional related work 13 B Non-generic soft interventions 13 C Pseudoinverse of a covariance matrix 13 D The partial order RQ decomposition 13 E Further preliminaries for identifiability and reduction 14 F Reduction 14 F.1 Reducing to one intervention per node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 F.2 Reducing to a known observational context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 G Hypothesis testing a necessary condition for model membership 15 H Computational complexity 16 I Proofs 17 I.1 Proof of non-identifiability for one missing intervention . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 J Non-identifiability for imperfect interventions 17 J.1 Parameter non-identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 J.2 Graph non-identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 K Comparison to Liu et al. (2022) 18 L Finite-sample algorithms 18 M Code and Data 19 M.1 Optimizing over permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 M.2 Additional information on real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Linear Causal Disentanglement via Interventions A. Additional related work Fig. 4 shows the two graphical conditions assumed in some prior works. (a) Pure children (b) No twins Figure 4: Graphical conditions assumed in prior works. In (a), the orange edges link pure children (X1 and X3) to their parents (Z1 and Z3, respectively). In (b), the no twins assumption is satisfied since the observed children of Z1, Z2, and Z3 are, respectively, {X1, X2}, {X1, X2, X3}, and {X3}, and these three sets are distinct. B. Non-generic soft interventions We discuss the genericity condition in Assumption 1(b). We show that for soft interventions in which this genericity condition fails to hold, identifiability of the causal disentanglement problem as in Theorem 1 may fail. The following matrices satisfy all of Assumption 1, except for the genericity condition in Assumption 1(b), since B 1 e1 = 2B 0 e1: B0 = 1 1 0 1 , B1 = 2 2 0 1 , B2 = 1 1 0 2 , G = 1 0 0 1 Consider the alternative matrices b B0 = 1 0 0 1 , b B1 = 2 0 0 1 , b B2 = 1 0 0 2 , b G = 1 1 0 1 These do not differ from the original tuple of matrices via a permutation. However, one can check that they are a valid solution, since Θ0 = bΘ0 = 1 1 1 2 , Θ1 = bΘ1 = 4 4 4 5 , Θ1 = bΘ1 = 1 1 1 5 C. Pseudoinverse of a covariance matrix Proposition 6. Let X = GZ for G Rp d with full column rank. Assume Cov(Z) is invertible and let K := Cov(Z) 1. Then Cov(X) = H KH, where H := G . Proof. The covariance matrices Cov(X) and Cov(Z) are related via Cov(X) = G Cov(Z) G . The property (UV ) = V U holds whenever U has full column rank and V has full row rank (Greville, 1966). The matrix G has full column rank, Cov(Z) has full rank, and G has full row rank. Hence Cov(X) = (G ) Cov(Z) G = H KH. D. The partial order RQ decomposition Recall the partial order RQ decomposition from Definition 1. We present Algorithm 4 to find the partial order RQ decomposition of a matrix. In Line 4, the normalize operator is normalize(v) := v v 2 . We let 0d p denote the d p matrix of zeros and Qj i denote the submatrix of Q on rows j with j i. We say that a partial order is consistent with the total order 1, 2, . . . , d if i j implies i < j. Any partial order can be put in this form by relabelling. Proposition 7. Let H Rd p be full rank and fix a partial order over [d]. Then there exists a unique partial order RQ decomposition of H. If is consistent with the total order 1, 2, . . . , d, the decomposition is returned by Algorithm 4. Linear Causal Disentanglement via Interventions Proof. The matrix Qj i has fewer rows than columns and hi qj : j i , by its construction in Algorithm 4. Hence the vector r of non-zero entries in the i-th row of R is the unique solution to Q j ir = hi. By construction, we see that H = RQ and, moreover, Rij = 0 if j i, and qi is orthogonal to qj for i j. Furthermore, the entry Rii is positive, since qi is the (normalized) projection of hi onto Wi. Algorithm 4 Partial Order RQ Decomposition 1: Input: Matrix H Rd p, partial order over [d] consistent with the total order 1, 2, . . . , d 2: Output: A partial order RQ decomposition R, Q 3: Let R = 0d d, Q = 0d p 4: for i = 1, . . . , d do 5: Let hi be the i-th row of H 6: Let Wi = qj : j i 7: Let qi = normalize(proj W i hi) be the i-th row of Q 8: Let r = (Q j i) hi 9: For i j, let Rij = rj 10: end for 11: return R, Q E. Further preliminaries for identifiability and reduction We prove Fact 1 and discuss the reduction described in Section 3.1. Proof of Fact 1. We have B B = B Pd i=1 eie i B = Pd i=1 B eie i B = Pd i=1(B ei) 2. Proposition 8. Consider the setup in Assumption 1. For k, ℓ [K], we have B k Bk B ℓBℓ= (B k eik) 2 (B 0 eik) 2 (B ℓeiℓ) 2 + (B 0 eiℓ) 2, (4) and thus Θk Θℓ= (HB k eik) 2 (HB 0 eik) 2 (HB ℓeiℓ) 2 + (HB 0 eiℓ) 2. Proof. The proof follows the same steps as Prop. 1, together with the fact that B k Bk B ℓBℓ= (B k Bk B 0 B0) (B ℓBℓ B 0 B0). F. Reduction In this section, we show that a more general causal disentanglement problem can be simplified to one that satisfies our assumptions. We consider an unknown observational context, and multiple contexts with the same target. We focus here on the case of perfect interventions. F.1. Reducing to one intervention per node We identify which contexts correspond to interventions on the same node. Thus, we can reduce to the case of one intervention per node by removing any redundant contexts. We do not use knowledge of which context is the observational context here. Proposition 9. Consider the setup in Assumptions 1 and 2. Assume generic parameters for B0, λk, and λℓ. For k, ℓ [K], we have rk,ℓ= 1 if and only if ik = iℓ, where rk,ℓ:= rank(Θk Θℓ). Proof. Since H is full rank, we have rk,ℓ= rank(B k Bk B ℓBℓ). Thus, we consider rank(B k Bk B ℓBℓ). Suppose ik = iℓ= i. We have B k Bk B ℓBℓ= B k ei 2 B ℓei 2 , Linear Causal Disentanglement via Interventions by Equation (4) in Prop. 8. Both B k ei and B ℓei have a single nonzero entry at the i-th coordinate, by Assumption 2. Thus rk,ℓ= 1. Suppose ik = iℓand assume without loss of generality that ik < iℓ. Given a matrix M, let MU denote the submatrix of M with rows and columns indexed by the elements of the set U. We have B k Bk B ℓBℓ {ik,iℓ} = λ2 k (B0)2 ik,ik (B0)ik,ik(B0)ik,iℓ (B0)ik,ik(B0)ik,iℓ λ2 ℓ+ (B0)2 iℓ,iℓ (B0)2 ik,iℓ by Equation (4) in Prop. 8 and Assumption 2. For generic parameters, this submatrix has rank two, so the full matrix has rank at least two; i.e., rk,ℓ 2. F.2. Reducing to a known observational context The previous section explains how to reduce to the case with one intervention per latent node. We may also reduce to the case with only one observational context: if more than one context is the observational context, they will all have the same inverse covariance matrix, so we may select only one of these contexts to serve as the observational context k = 0. Next we show that, with one intervention per node, and one observational context, we can identify the observational context. We show that the observational context has the sparsest changes from the other contexts. We formalize this intuition with the following definition. Definition 2. The deviation score of context k is ℓ [K]\{k} rk,ℓ, where rk,ℓ:= rank(Θk Θℓ)forall k, ℓ [K]. Proposition 10. Consider the setup in Assumption 1. Then k {0} [K] is an observational context if and only if k = arg mink {0} [K] rk. Proof. Let source(G) denote the set of source nodes in G. By Prop. 1, r0,ℓ= 1 + 1pa(iℓ) = for all ℓ [K]. Thus, r0 = 2K |source(G)|. For k = 0, we have rk,ℓ 2 + 1pa(iℓ) = for all ℓ [K] \ {k}. Thus, 2K. Since G must have at least one source node, we see that rk > r0 for all k = 0. G. Hypothesis testing a necessary condition for model membership We define the null hypothesis H0 : rank(Θk Θ0) 2 k [K] Assumption 1(b) implies that H0 holds, by Prop. 1. The null hypothesis H0 is a necessary condition for membership of (Θ0, Θ1, . . . , ΘK) in the model defined by Assumption 1. However, H0 is not a sufficient condition for model membership: we may have rank(Θk Θ0) 2 for all k [K], despite some interventions not targeting single nodes. For example, if G is the empty graph, and all interventions have two targets, then H0 holds. These cases may be ruled out with other conditions implied by model membership. We leave a membership test for our model to future work. Here, we focus on developing a test for H0. Prior work on testing latent variables models (Drton et al., 2007; Squires et al., 2022) use such rank constraints. To test whether a matrix M Rp p is rank k, one can test that all minors of size k + 1 vanish; i.e., the collection of hypotheses HA,B : t A,B = 0 A, B [p], |A| = |B| = k + 1 where t A,B := det(MA,B). For example, Squires et al. (2022) use this to test whether certain submatrices of a covariance matrix are rank one, as follows. Let c M be the sample covariance matrix computed from n samples. If the underlying distribution is multivariate Gaussian, it is well-known that c M follows a Wishart distribution. Now, for each pair of subsets A, B, compute the empirical minor Linear Causal Disentanglement via Interventions Figure 5: Performance of our singular-value-based hypothesis test for H0, the hypothesis that all precision matrix differences are of rank at most two. The gray line indicates the performance of randomly guessing. Our test performs thresholding on the value ρ2(bΘk bΘ0). Here, we vary the threshold from 0.97 to 0.999. bt A,B := det(c MA,B). Then, compute an estimate c Var(t A,B). Such an estimate can be obtained by evaluating the expression for Var(t A,B) in Drton et al. (2008), which characterizes the moments of minors for Wishart matrices. Given this estimate, compute the z-score z A,B = bt A,B/ q c Var(t A,B). By typical asymptotic theory, z A,B N(0, 1) as n , so we can use the z-score to compute an asymptotically correct p-value. Finally, the p-values for all pairs of subsets A, B can be aggregated into a single p-value using a multiple hypothesis testing procedure such as Bonferroni correction or Sidak adjustment. In principle, a similar procedure can be performed to test our null hypothesis H0. However, under a Gaussianity assumption, Θk and Θ0 follow inverse Wishart distributions, rather than Wishart distribution. This would require expressions for the moments of minors for inverse Wishart matrices. We leave such a hypothesis test, which could give guarantees on false discovery rate control, to future work. Instead, we demonstrate the performance of a simple hypothesis test based on the singular values of the matrix bΘk bΘ0. Let ρ2(M) := σ2 1(M) + σ2 2(M) / Xp i=1 σ2 i (M) If rank(M) 2, then ρ2(M) = 1, otherwise, ρ2(M) < 1. Thus, we may test H0 by checking where ρ2(bΘk bΘ) > τ for some threshold τ near 1. We demonstrate the performance of this procedure for testing model membership. We generate 500 random models following Assumption 1, using the same hyperparameters as in Section 4. These models satisfy H0. We also generate 500 random models where the interventions target two nodes instead of one. For each k, we pick intervention targets Ik [d] with |Ik| = 2, uniformly at random among all subsets of size two. We hold all other hyperparameters of the simulation fixed. We consider only n = 2500 samples, the smallest sample size used in Section 4, and vary the threshold τ from 0.97 to 0.999, linearly spaced over 20 values. The singular value based test is able to determine model membership at a rate well above random guessing, see Fig. 5. H. Computational complexity Algorithm 3 takes as input K precision matrices in Rp p. If these are each computed from at most n samples, then the total cost is O(Kp3 + Knp2). Algorithm 2 runs for K rounds and computes at most 2K projections per round. Each projection costs O(p3), so the cost of this step is O(K2p3). In the remainder of Algorithm 3, we perform O(K) matrix multiplications and Cholesky decompositions. Each matrix multiplication costs O(p2) and each Cholesky decomposition costs O(p3). The other operations (e.g. selecting rows) in Algorithm 3 are negligible. Therefore, the overall runtime of Algorithm 3 is dominated by Algorithm 2, with a total cost of O(K2p3). Linear Causal Disentanglement via Interventions I.1. Proof of Prop. 5 Proof. For (i), it suffices to find b H and { b Bk}K k=1 such that b Bk b H = Bk H for all k [K], and such that there is no σ S(G) satisfying b Bk = PσBk P σ , H = PσH, by Equation (2). Suppose ik = 1 for any k. Let b Bk = e1 (Bk)2:d,1:d , b H = (B0) 1 H H2:d,1:p Then, for all k, we have b Bk b H = Bk H. Suppose that Z1 has at least one parent, i.e., (B0)1j = 0 for at least one j > 1. Then the first row of b H is a nonzero combination of at least two rows of H. Hence it is not equal to a single row of H, since H is full rank. Thus, b H is not equal to H up to any permutation of rows. For (ii), observe that for the stated example, the partial order given by b B0 differs in general from the partial order G, since Z1 has no predecessors in . J. Non-identifiability for imperfect interventions J.1. Parameter non-identifiability In this section, we show that Assumption 2 is necessary to identify H. Let B0 = (B0)11 (B0)12 0 (B0)22 , B1 = (B1)11 (B1)12 0 (B0)22 B2 = (B0)11 (B0)12 0 (B2)22 , H = 1 H12 0 1 Then, for any value b H12 R, we have Bk H = b Bk b H for all k, where b B0 = (B0)11 (B0)12 + (B0)11H12 (B0)11 b H12 0 (B0)22 , b B1 = (B1)11 (B1)12 + (B1)11H12 (B1)11 b H12 0 (B0)22 b B2 = (B0)11 (B0)12 + (B0)11H12 (B0)11 b H12 0 (B2)22 , b H = 1 b H12 0 1 J.2. Graph non-identifiability Suppose that G has edges 2 1, 3 2, and 3 1. Then the weight matrices have the form (B0)11 (B0)12 (B0)13 0 (B0)22 (B0)23 0 0 (B0)33 (B0)11 (B0)12 (B0)13 0 (B1)22 (B1)23 0 0 (B0)33 (B0)11 (B0)12 (B0)13 0 (B2)22 (B2)23 0 0 (B0)33 (B0)11 (B0)12 (B0)13 0 (B0)22 (B0)23 0 0 (B3)33 1 H12 H13 0 1 H23 0 0 1 Let b G be the DAG with edges 2 1 and 3 2. We show that there exist b H and matrices b Bk, following the support of b G, such that Bk H = b Bk b H for all k. Note that G and b G have the same transitive closure. Let b H12 R and let b H13, b H23 be a solution to the system of equations: " (B0)11 (B0)12 + (B0)11(H12 b H12) (B1)11 (B1)12 + (B1)11(H12 b H12) # " b H13 b H23 = (B0)11H13 + (B0)12H23 + (B0)13 (B1)11H13 + (B1)12H23 + (B1)13 Linear Causal Disentanglement via Interventions This system generically has a solution, since for generic parameters the matrix on the left hand side is rank two. Then a solution, with all matrices b Bk have vanishing entry (1, 3), is as follows. (B0)11 (B0)12 + (B0)11(H12 b H12) 0 0 (B0)22 (B0)23 + (B0)22(H23 b H23) 0 0 (B0)33 (B1)11 (B1)12 + (B1)11(H12 b H12) 0 0 (B0)22 (B0)23 + (B0)22(H23 b H23) 0 0 (B0)33 (B0)11 (B0)12 + (B0)11(H12 b H12) 0 0 (B2)22 (B2)23 + (B2)22(H23 b H23) 0 0 (B0)33 (B0)11 (B0)12 + (B0)11(H12 b H12) 0 0 (B0)22 (B0)23 + (B0)22(H23 b H23) 0 0 (B3)33 K. Comparison to Liu et al. (2022) We compare Theorem 2 to Liu et al. (2022). We restate their main result for convenience and notation. Theorem (Theorem 4.1 of Liu et al. (2022)). Let Z = Ak Z + εk and X = g(Z) + εx. Let ηk be the sufficient statistic for the distribution of Z in environment k. That is, ηk = vec( Θk), where Θk denotes the inverse covariance matrix of Z in the kth setting and vec denotes the vectorization of a matrix. We assume that vec ignores zeros and repetitions. Assume that (i) {x X | φεx(x) = 0} has measure zero, where φεx is the characteristic function for εx, (ii) g is bijective, and (iii) There exists K + 1 environments such that the following matrix is invertible: | | | η1 η0 η2 η0 . . . ηK η0 | | | Then we can recover g up to permutation and scaling. First, we show that (i) and (ii) hold in our setting. Our assumption that X = GZ for G invertible guarantees (ii). Our assumption that X is a deterministic function of Z corresponds to taking εx δ0, i.e., εx = 0 with probability one. The characteristic function is φε(t) = 1, thus satisfying (i). We now show that (iii) is only satisfied in our setting when the number of edges in the latent graph is at most d. The vector vec( Θk Θ0) is of length d + |E|, where |E| is the number of edges in the graphical model defined by Θ0. To be invertible, L must be a square matrix, and hence we require K d + |E|. If |E| > d, then K > 2d, and we must have an intervention target i such that i = ik for at least three values of k. We have Θk = B k Bk, and thus Θk Θ1 = (λkeik) 2 (B 0 eik) 2 (λℓei1) 2 + (B 0 ei1) 2 Given k1, k2, and k3 such that ik1 = ik2 = ik3 = i, we see that ηk1, ηk2 and ηk3 differ only at position (i, i). The space of vectors that differ in at most one entry is at most two-dimensional. Thus ηk1, ηk2, and ηk3 are not linearly independent, and L is not invertible. L. Finite-sample algorithms Matrix rank scoring. In Line 1 of Algorithm 1 and Line 2 of Algorithm 2, we check whether a subspace is rank one. In the finite-sample setting, we represent these subspaces by matrices and measure how close the matrices are to rank one. Linear Causal Disentanglement via Interventions Algorithm 5 IDENTIFYPARTIALORDERFINITESAMPLE 1: Hyperparameters: γ 2: Input: Precision matrices (Θ0, Θ1, . . . , ΘK) 3: Output: Factor b Q, partial order 4: Let I0 = {}, b Q = 0d d 5: for t = 1, . . . K do 6: Let Mk = proj b Q t 1(Θk Θ0) for each k It 1 7: Let ρk = σ2 1(Mk)/(Pp i=1 σ2 i (Mk)) for each k It 1 8: Pick k arg maxk It 1 ρk 9: Let bqk, A = IDENTIFYANCESTORSFINITESAMPLE(Θk, Θ0, {bqi}i It 1; γ) 10: Add a k for any a a, a A 11: Let It = It 1 {k}, b Qt = [bqk; b Qt 1] 12: end for 13: return b Q, Algorithm 6 IDENTIFYANCESTORSFINITESAMPLE 1: Hyperparameters: γ 2: Input: Θk, Θ0, {bqi}i I 3: Output: Vector bqk, ancestor set A 4: Let A = I 5: for i I do 6: Let W i = [bqi : j I \ {i}] 7: Let M i = proj W i (Θk Θ0) 8: Let ρk = σ2 1(Mk)/(Pp i=1 σ2 i (Mk)) for each k It 1 9: If ρk γ, let A = A \ {i} 10: end for 11: Let W = [bqa : a A] 12: Let M = proj W (Θk Θ0) 13: Let bqk be the (normalized) leading left singular vector of M 14: return bqk, A We use the score ρ(M) := σ2 1(M)/ Pp i=1 σ2 i (M) , where σi(M) is the ith largest singular value of M; ρ(M) can be interpreted as the percentage of spectral energy associated to the largest singular value of M. Using this score to choose the next element of the partial order does not require hyperparameters, see Line 5 of Algorithm 5. In contrast, using this score to prune the set of ancestors requires a hyperparameter to determine whether a matrix is close enough to rank one, see γ in Line 6 of Algorithm 6. Larger values of γ (e.g., 0.999) result in a more conservative algorithm and will output a denser latent graph, while smaller values of γ (e.g., 0.8) result in more aggressive pruning of the latent graph. Picking qk. The matrix M in Line 6 of Algorithm 6 is not guaranteed to be rank one in the finite-sample case. We instead select the leading left singular vector of M. Picking nonzero rows. In the finite-sample case, the matrix b Dk will not usually have only one nonzero row, see Line 3 of Algorithm 7. We estimate the intervention target ik by picking the row of largest norm. Since we assume that ik is distinct for distinct k, we maintain a set N of candidate intervention targets and do not allow replicates. M. Code and Data Our code can be found at https://github.com/csquires/linear-causal-disentanglement-via-interventions. Linear Causal Disentanglement via Interventions Algorithm 7 ITERATIVEDIFFERENCEPROJECTIONFINITESAMPLE 1: Hyperparameters: γ 2: Input: Precision matrices (Θ0, Θ1, . . . , ΘK) 3: Output: b H, ( b B0, b B1, . . . , b BK) 4: Let d = K 5: Let b Q, = IDENTIFYPARTIALORDERFINITESAMPLE((Θ0, Θ1, . . . , ΘK); γ) 6: Let b Ck = CHOLESKY(( b Q ) Θk b Q ) for k = 0, . . . , K 7: Let N = [p] 8: Let b R = Id 9: for k = 1, . . . , K do 10: Let b Dk = b Ck b C0 11: Pickbik arg maxi N ( b Dk)i 2 12: Let N = N \ {bik} 13: Let b Rbik = ( b Dk)bik + ( b C0)bik 14: end for 15: Let b H = b R b Q 16: Let b H = Λ b H , for Λ diagonal such that b H satisfies the conditions on H in Assumption 1(c) 17: Let b B0 = CHOLESKY(( b H ) Θ0 b H ) 18: Let b Bk = b B0 + ebik |bΛbik,bik|ebik b B 0 ebik for k = 1, . . . , K 19: return b H, ( b B0, b B1, . . . , b BK) M.1. Optimizing over S(G) Consider a partial order G, a set of true intervention targets i1, . . . , i K, and a set of estimated intervention targets bi1, . . . ,bi K. The integer linear program (5) computes the topological order π consistent with G that maximizes the number of agreements between ik andbik. The topological order π can be recovered by letting π (i) = j for the unique j such that Aij = 1. The first two lines of constraints ensure this uniqueness, and that π (i) = π (i ) for i = i . The final line of constraints ensures that π is consistent with G. If π is not consistent with G, then there exists i, i , j such that i G i , Ai j = 1, and Aij = 0 for all j j, which violates the constraint P j j (Aij Ai j ) 0. max Aij {0,1}d d i=1 Aij = 1 j [d] j=1 Aij = 1 i [d] j j (Aij Ai j ) 0 i G i , j [d] M.2. Additional information on real data The sc RNA-seq dataset of Ursu et al. (2022) is available at https://www.ncbi.nlm.nih.gov/geo/query/ acc.cgi?acc=GSE161824. The TCGA dataset of Liu et al. (2018) is available at https://gdc-hub. s3.us-east-1.amazonaws.com/download/TCGA-LUAD.survival.tsv and https://gdc-hub.s3. us-east-1.amazonaws.com/download/TCGA-LUAD.htseq_fpkm.tsv.gz. Processing. We use Enrich R (Kuleshov et al., 2016) to pick the p = 100 and p = 83 most variable genes in the proliferation regulation gene set from the Gene Ontology. We use the values from the processed dataset from Ursu et al. (2022); the only additional processing removes cells which were assigned to synonymous mutations (i.e., those that do not Linear Causal Disentanglement via Interventions Number of samples Mean Frobenius error in H (a) Error in estimating H Number of samples Mean Frobenius error in B0 (b) Error in estimating B0 Number of samples Fraction with all intervention targets correct (c) Intervention targets Figure 6: (Semi-synthetic) The adapted version of Algorithm 3 is consistent for recovering H, B0, and {ik}K k=1 from semi-synthetic data. At each sample size, we generate 50 datasets. Note the logarithmic scale on the x-axis. In (a), we plot the mean of b H H 2, the error in Frobenius norm. In (b), we plot the mean of b B0 B0 2. In (c), we plot the fraction of models where all intervention targets were correctly estimated. change any amino acids and hence do not have structural effects). Semi-synthetic analysis. Our algorithm recovers the problem parameters for the semi-synthetic data, see Fig. 6. Comparison to TCGA dataset. Our survival analysis is performed using the Cox proportional hazards model from the lifelines package (Davidson-Pilon, 2019). To correct for multiple hypothesis testing, we use the Benjamini-Hochberg procedure from the statsmodels package (Seabold & Perktold, 2010).