# identification_of_nonlinear_latent_hierarchical_models__7e9eaacd.pdf Identification of Nonlinear Latent Hierarchical Models Lingjing Kong1, Biwei Huang2, Feng Xie3, Eric Xing1,4, Yuejie Chi1, and Kun Zhang1,4 1 Carnegie Mellon University 2University of California San Diego 3Beijing Technology and Business University 4Mohamed bin Zayed University of Artificial Intelligence Identifying latent variables and causal structures from observational data is essential to many real-world applications involving biological data, medical data, and unstructured data such as images and languages. However, this task can be highly challenging, especially when observed variables are generated by causally related latent variables and the relationships are nonlinear. In this work, we investigate the identification problem for nonlinear latent hierarchical causal models in which observed variables are generated by a set of causally related latent variables, and some latent variables may not have observed children. We show that the identifiability of causal structures and latent variables (up to invertible transformations) can be achieved under mild assumptions: on causal structures, we allow for multiple paths between any pair of variables in the graph, which relaxes latent tree assumptions in prior work; on structural functions, we permit general nonlinearity and multi-dimensional continuous variables, alleviating existing work s parametric assumptions. Specifically, we first develop an identification criterion in the form of novel identifiability guarantees for an elementary latent variable model. Leveraging this criterion, we show that both causal structures and latent variables of the hierarchical model can be identified asymptotically by explicitly constructing an estimation procedure. To the best of our knowledge, our work is the first to establish identifiability guarantees for both causal structures and latent variables in nonlinear latent hierarchical models. 1 Introduction Classical causal structure learning algorithms often assume no latent confounders. However, it is usually impossible to enumerate and measure all task-related variables in real-world scenarios. Neglecting latent confounders may lead to spurious correlations among observed variables. Hence, much effort has been devoted to handling the confounding problem. For instance, Fast Casual Inference and its variants [Spirtes et al., 2000, Pearl, 2000, Colombo et al., 2012, Akbari et al., 2021] leverage conditional independence constraints to locate possible latent confounders and estimate causal structures among observed variables, assuming no causal relationships among latent variables. This line of approaches ends up with an equivalence class, which usually consists of many directed acyclic graphs (DAGs). Later, several approaches have been proposed to tackle direct causal relations among latent variables with observational data. These approaches are built upon principles such as rank constraints [Silva et al., 2006a, Kummerfeld and Ramsey, 2016, Huang et al., 2022], matrix decomposition [Chandrasekaran et al., 2011, 2012, Anandkumar et al., 2013], high-order moments [Shimizu et al., 2009, Cai et al., 2019, Xie et al., 2020, Adams et al., 2021, Chen et al., 2022], copula models [Cui et al., 2018], and mixture oracles [Kivva et al., 2021]. Pearl [1988], Zhang [2004], Choi et al. [2011], Drton et al. [2017], Zhou et al. [2020], Huang et al. [2020] extend such approaches to handle tree structures 37th Conference on Neural Information Processing Systems (Neur IPS 2023). z4 x4 z5 z6 x12 z7 x1 x2 x3 x5 x6 x7 x9 x10 x11 x13 x14 x15 x1 x2 x3 z4 x7 x8 x9 (b) Figure 1: Examples of latent hierarchical graphs satisfying Conditions 2.4. xi denotes observed variables and zj denotes latent variables. The node size represents the dimensionality of each variable. We note that our graph condition permits multiple paths between two latent variables (e.g., z1 and z4 in Figure 1b), thus more general than tree structures. where only one path is permitted between any pair of variables. Recently, Huang et al. [2022] propose to use rank-deficiency constraints to identify more general latent hierarchical structures. A common assumption behind these approaches is that either the causal relationships should be linear or the variables should be discrete. However, the linearity and discrete-variable assumptions are rather restrictive in real-world scenarios. Moreover, these approaches only focus on structure learning without identifying the latent variables and corresponding causal models. We defer a detailed literature review to Appendix A. In this work, we identify the hierarchical graph structure and latent variables simultaneously for general nonlinear latent hierarchical causal models. Specifically, we first develop novel identifiability guarantees on a specific latent variable model which we refer to as the basis model (Figure 2). We then draw connections between the basis-model identifiability and the identification of nonlinear latent hierarchical models. Leveraging this connection, we develop an algorithm to localize and estimate latent variables and simultaneously learn causal structures underlying the entire system. We show the correctness of the proposed algorithm and thus obtain identifiability of both latent hierarchical causal structures and latent variables for nonlinear latent hierarchical models. In sum, our main contributions are as follows. Analogous to independence tests in PC algorithm [Spirtes et al., 2000] and rank-deficiency tests in Huang et al. [2022], we develop a novel identifiability theory (Theorem 3.2) as a fundamental criterion for locating and identifying latent variables in general nonlinear causal models. We show structure identification guarantees for latent hierarchical models admitting continuous multi-dimensional (c.f., one-dimensional and often discrete [Pearl, 1988, Choi et al., 2011, Huang et al., 2022, Xie et al., 2022]) variables, general nonlinear (c.f., linear [Pearl, 1988, Huang et al., 2022, Xie et al., 2022]) structural functions, and general graph structures (c.f., generalized trees [Pearl, 1988, Choi et al., 2011, Huang et al., 2022]). Under the same conditions, we provide identification guarantees for latent variables up to invertible transformations, which, to the best of our knowledge, is the first attempt in cases of nonlinear hierarchical models. We accompany our theory with an estimation method that can asymptotically identify the causal structure and latent variables for nonlinear latent hierarchical models and validate it on multiple synthetic and real-world datasets. 2 Nonlinear Latent Hierarchical Causal Models In this section, we introduce notations and formally define the latent hierarchical causal model. We focus on causal models with directed acyclic graph (DAG) structures and denote the graph with G, which consists of the latent variable set Z := {z1, . . . , zm}, the observed variable set 1 X := {x1, . . . , xn}, and the edge set E. Each variable is a random vector comprising potentially multiple dimensions, i.e., xi Rdxi and zj Rdzj , where dxi and dzj stand for the dimensionalities of xi X and zj Z, respectively. Both latent variables and observed variables are generated recursively by their latent parents: xi = gxi(Pa(xi), εxi) zj = gzj(Pa(zj), εzj), (1) 1We refer to leaf variables in the graph as observed variables for ease of exposition. The theory in this work is also applicable when some non-leaf variables happen to be observed, which we discuss in Append C.2 where Pa(zj) represents all the parent variables of zj and εzj represents the exogenous variable generating zj. Identical definitions apply to those notations involving xi. All exogenous variables εxi, εzj are independent of each other and could also comprise multiple dimensions. We now define a general latent hierarchical causal model with a causal graph G := (Y, E) in Definition 2.1. Definition 2.1 (Latent Hierarchical Causal Models). The variable set Y consists of observed variable set X and latent variable set Z, and each variable is generated by Equation 1. In light of the given definitions, we formally state the objectives of this work: 1. Structure identification: given observed variables X, we would like to recover the edge set E. 2. Latent-variable identification: given observed variables X, we would like to obtain a set of variables ˆZ := {ˆz1, . . . , ˆzm} where for i [m], zi and ˆzi are identical up to an invertible mapping, i.e., ˆzi = hi(zi), where hi is an invertible function. 2 Definition 2.1 gives a general class of latent hierarchical causal models. On the functional constraint, the general nonlinear function form (Equation 1) renders it highly challenging to identify either the graph structure or the latent variables. Prior work Pearl [1988], Choi et al. [2011], Xie et al. [2022], Huang et al. [2022] relies on either linear model conditions or discrete variable assumptions. In this work, we remove these constraints to address the general nonlinear case with continuous variables. On the structure constraint, identifying arbitrary causal structures is generally impossible with only observational data. For instance, tree-like structures are assumed in prior work [Pearl, 1988, Choi et al., 2011, Huang et al., 2022] for structural identifiability. In the following, we present relaxed structural conditions under which we show structural identifiability. Definition 2.2 (Pure Children). vi is a pure child of another variable vj, if vj is the only parent of vi in the graph, i.e., Pa(vi) = {vj}. Definition 2.3 (Siblings). Variables vi and vj are siblings if Pa(vi) Pa(vj) = . Condition 2.4 (Structural Conditions). i Each latent variable has at least 2 pure children. ii There are no directed paths between any two siblings. Intuitively, Condition 2.4-i allows each latent variable to leave a footprint sufficient for identification. This excludes some fundamentally unidentifiable structures. For instance, if latent variables z1 and z2 share the same children X, i.e., z1 X and z2 X, the two latent variables cannot be identified without further assumptions, while pure children would help in this case. The pure child assumption naturally holds in many applications with many directly measured variables, including psychometrics, image analysis, and natural languages. We require fewer pure children than prior work [Silva et al., 2006a, Kummerfeld and Ramsey, 2016] and place no constraints on the number of neighboring variables as in prior work [Huang et al., 2022, Xie et al., 2022]. Condition 2.4-ii avoids potential triangle structures in the latent hierarchical model. Triangles present significant obstacles for identification in a triangle structure formed by z1 z2 z3 and z1 z3, it is difficult to discern how z1 influences z3 without functional constraints, as there are two directed paths between them. We defer the discussion of how to use our techniques to handle more general graphs that include triangles to Appendix C.1. Condition 2.4-ii is widely adopted in prior work, especially those on tree structures [Pearl, 1988, Choi et al., 2011, Drton et al., 2017] (which cannot handle multiple undirected paths between variables as we do) and more recent work [Huang et al., 2022]. To the best of our knowledge, only Xie et al. [2022] manage to handle triangles in the latent hierarchical structure, which, however, heavily relies on strong assumptions on both the function class (linear functions), distribution (non-Gaussian), and structures (the existence of neighboring variables). 3 Identifiability of Nonlinear Latent Hierarchical Causal Models This section presents the identifiability and identification procedure of causal structures and latent variables in nonlinear latent hierarchical causal models, from only observed variables. First, in Section 3.1, we introduce a latent variable model (i.e., the basis model in Figure 2), whose identifiability 2We are concerned about representation vectors corresponding to each individual variable. Thus, identifiability up to invertible transforms suffices. Generally, stronger identifiability (e.g., component-wise) necessitates additional assumptions, e.g., multiple distributions [Khemakhem et al., 2020], sparsity assumptions [Zheng et al., 2022]. s1 s2 z Figure 2: The basis model. v1 and v2 are generated from nonlinear functions g1 and g2 which can be distinct and non-invertible individually. We also admit dependence between z and s2, as indicated by the dashed edge. underpins the identifiability of the hierarchical model. In Section 3.2, we establish the connection between the basis model and the hierarchical model. Leveraging this, in Section 3.3, we show by construction that both causal structures and latent variables are identifiable in the hierarchical model. 3.1 Basis Model Identifiability We present the data-generating process of the basis model in Equation 2 and illustrate it in Figure 2: v1 = g1(z, s1) v2 = g2(z, s2), (2) where v1 V1 Rdv1, v2 V2 Rdv2, z Z Rdz, s1 S1 Rds1, and s2 S2 Rds2 are variables consisting of potentially multiple components. The data-generating process admits potential dependence between z and s2 and dependence across dimensions in each variable. We denote the entire latent variable c := [z, s1, s2] C and the mapping from c to (v1, v2) as g : C V1 V2. Below, we introduce notations and basic definitions that we use throughout this work. Indexing: For a matrix M, we denote its i-th row as Mi,:, its j-th column as M:,j, and its (i, j) entry as Mi,j. Similarly, for an index set I {1, . . . , m} {1, . . . , n}, we denote Ii,: := {j : (i, j) I} and I:,j := {i : (i, j) I}. Subspaces: We denote a subspace of Rn defined by an index set S as Rn S, where Rn S := {z Rn : i / S, zi = 0}. Support of matrix-valued functions: We define the support of a matrix-valued function M(x) : X Rm n as Supp(M) := {(i, j) : x X, s.t., M(x)i,j = 0}, i.e., the set of indices whose corresponding entries are nonzero for some x X. In Theorem 3.2 below, we show that the latent variable z is identifiable up to an invertible transformation by estimating a generative model (ˆpz,s2, ˆps1, ˆg) according to Equation 2. We denote Jacobian matrices of g and ˆg as Jg and Jˆg, and their supports as G and ˆG, respectively. Further, we denote as T a set of matrices with the same support as that of the matrix-valued function J 1 g (c)Jˆg(ˆc). Assumption 3.1 (Basis model conditions). i [Differentiability & Invertibility]: The mapping g(c) = (v1, v2) is a differentiable invertible function with a differentiable inverse. ii [Subspace span]: For all i {1, . . . , dv1 + dv2}, there exists {c(ℓ)}|Gi,:| ℓ=1 and T T , such that span({Jg(c(ℓ))i,:}|Gi,:| ℓ=1 ) = Rdc Gi,: and [Jg(c(ℓ))T]i,: R dc ˆGi,:. iii [Edge Connectivity]: For all jz {1, . . . , dz}, there exist iv1 {1, . . . , dv1} and iv2 {dv1, . . . , dv1 + dv2}, such that (iv1, jz) G and (iv2, jz) G. Theorem 3.2. Under Assumption 3.1, if a generative model (ˆpz,s2, ˆps1, ˆg) follows the data-generating process in Equation 2 and matches the true joint distribution: pv1,v2(v1, v2) = ˆpv1,v2(v1, v2), (v1, v2) V V, (3) then the estimated variable ˆz and the true variable z are equivalent up to an invertible transformation. The proof can be found in Appendix B.1. Interpretation. Intuitively, we can infer the latent variable z shared by the two observed variables v1 and v2), in a sense that the estimated variable ˆz contains all the information of z without mixing any information from s1 and s2. The identifiability up to an invertible mapping is extensively employed in identifiability theory [Casey and Westner, 2000, Hyvärinen and Hoyer, 2000, Le et al., 2011, Theis, 2006, von Kügelgen et al., 2021, Kong et al., 2022]. Discussion on assumptions. Assumption 3.1-i assumes the mapping g preserves latent variables information, guaranteeing the possibility of identification. Such assumptions are universally adopted in the existing literature on nonlinear causal model identification [Hyvarinen et al., 2019, Khemakhem et al., 2020, Kong et al., 2022, Lyu et al., 2022, von Kügelgen et al., 2021]. Assumption 3.1-ii guarantees that the influence of z changes adequately across its domain, as discussed in prior work [Lachapelle et al., 2022, Zheng et al., 2022]. This eliminates situations where the Jacobian matrix is partially constant, causing it to insufficiently capture the connection between the observed and latent variables. This condition is specific for nonlinear functions, and a counterpart can be derived for linear cases [Zheng et al., 2022]. Assumption 3.1-iii is a formal characterization of the common cause variable z in the basis model. This assumption excludes the scenario where some dimensions of z only influence one of v1 and v2, in which case these dimensions of z are not truly the common cause of v1 and v2 and should be modeled as a separate variable rather than part of z. This is equivalent to causal minimality, which is necessary for casual structure identification for general function classes with observational data [Peters et al., 2017]. Comparison with prior work. In contrast to similar data-generating processes in prior work [Lyu et al., 2022, von Kügelgen et al., 2021], Theorem 3.2 allows the most general conditions. First, Theorem 3.2 does not necessitate the invertibility for g1( ) and g2( ) individually as in [Lyu et al., 2022, von Kügelgen et al., 2021]. The invertibility of g1 and g2 amounts to asking that z s information is duplicate over v1 and v2. In contrast, Assumption 3.1-i only requires that z s information is stored in (v1, v2) jointly. Moreover, generating functions g1( ) and g2( ) are assumed identical in [von Kügelgen et al., 2021], as opposed to being potentially distinct in Theorem 3.2. Additionally, Theorem 3.2 allows for dependence between z and s2 which is absent in [Lyu et al., 2022]. These relaxations expand the applicability of the basis model. For example, the distinction between g1 and g2 is indispensable in our application to the hierarchical model, as we cannot assume each latent variable generates its many children and descendants through the same function. Our proof technique is distinct from prior related work, which can be of independent interest to the community. 3.2 Local Identifiability of Latent Hierarchical Models In this section, we build the connection between the basis and hierarchical models. Specifically, we show in Theorem 3.4 that a careful grouping of variables in the hierarchical model enables us to apply Theorem 3.2 to identify latent variables and their causal relations in a local scope. To this end, we modify Assumption 3.1 in basis models and obtain Assumption 3.3 for hierarchical models. Assumption 3.3 (Hierarchical model conditions). i [Differentiability]: Structure equations in Equation 1 are differentiable. ii [Information-conservation]: Any z Z and exogenous variable ε can be expressed as differentiable functions of all observed variables, i.e., z = fz(X) and ε = fε(X). iii [Subspace span]: For each set A that d-separates all its ancestors Anc(A) and observed variables X, i.e., X Anc(A)|A, for any z A Pa(A), there exists an invertible mapping from a set of ancestors of A containing z A to the separation set A, such that this mapping satisfies the subspace span condition (i.e., Assumption 3.1-ii). iv [Edge connectivity]: The function between each latent variable z and each of its children z has a Jacobian J, such that for all j {1, . . . , dz}, there exists i {1, . . . , dz }, such that (i, j) is in the support of J. Theorem 3.4. In a latent hierarchical causal model that satisfies Condition 2.4 and Assumption 3.3, we consider xi X as v1 and X\v1 as v2 in the basis model (Figure 2). 3 With an estimation model (ˆpz,s2, ˆps1, ˆg) that follows the data-generating process in Equation 2, the estimated ˆz is a one-to-one mapping of the parent(s) of v1, i.e., ˆz = h(Pa(v1)) where h( ) is an invertible function. The proof can be found in Appendix B.2. Interpretation. Theorem 3.4 shows that invoking Theorem 3.2 with the assignment v1 = x and v2 = X\{x} can identify the parents of x in the hierarchical model. Intuitively, we can identify latent variables one level above the current level X. Moreover, as the estimated variables are equivalent to true variables up to one-to-one mappings, we take a step further in Section 3.3 to show that this procedure can be applied to the newly estimated variables iteratively to traverse the hierarchical model level by level . In the degenerate case where v1 does not have any parents, i.e., the root of the hierarchical model, the identified variable z in the basis model corresponds to the v1 itself we can regard s1 as a constant and g1( ) as an identity function in Theorem 3.2. 3To avoid cluttering, we slightly abuse the bold lowercase font to represent either an individual vector or a vector set (which can be viewed as a concatenation). x1 x2 x3 z4 x7 x8 x9 x1 x2 x3 z4 x7 x8 x9 x1 x2 x3 z4 x7 x8 x9 (c) Figure 3: Evolution of active set A in Algorithm 1. We mark the active set A with shaded gray circles before each iterations of Algorithm 1, with Figure 3a, Figure 3b, and Figure 3c corresponding to iteration 1, 2, and 3. Before iteration 1, A is set to X by default. At iteration 1, z2, z3, and z4 can be estimated by the basis model; however, only z4 can be updated into A. Otherwise, directed paths would be introduced by z2 and z3. Discussion on assumptions. Assumption 3.3-ii corresponds to Assumption 3.1-i for the basis model. Intuitively, it states that the information in the system is preserved without unnecessary duplication. As far as we know, the existing literature on causal variable identification for nonlinear models [Hyvarinen et al., 2019, Khemakhem et al., 2020, Kong et al., 2022, Lyu et al., 2022, Yao et al., 2022] universally assumes the invertibility of such a mapping. Without this assumption, we cannot guarantee the identification of latent variables from observed variables in the nonlinear case. Assumption 3.3-iii is an extension of Assumption 3.1-ii to the hierarchical case, with A and z A playing the role of (v1, v2) and z respectively. Assumption 3.3-iv excludes degenerate cases where an edge connects components that do not influence each other. This is equivalent to the causal minimality and generally considered necessary for causal identification with observational data [Peters et al., 2017]. For instance, we cannot distinguish Y := f(X) + NY and Y := c + NY if f is allowed to differ from constant c only outside X s support. Further, when v1 = x is a pure child of a latent variable z, Condition 2.4-i ensures that v2 := Az \ {v1} contains at least one pure child or descendant of z to fulfill Assumption 3.1-iii. Implications on measurement causal models. Despite being an intermediate step towards the global structure identification, Theorem 3.4 can be applied to a class of nonlinear measurement models 4 with arbitrary latent structures and identify all the latent variables. Then, the latent structures in the measurement model can be identified by performing existing causal discovery algorithms, such as the PC algorithm [Spirtes et al., 2001], on the identified latent variables. We leave the detailed presentation of this application as future work. 3.3 Global Identifiability of Latent Hierarchical Causal Models Equipped with the local identifiability (i.e., Theorem 3.4), the following theorem shows that all causal structures and latent variables are identifiable in the hierarchical model. Theorem 3.5. Under assumptions in Theorem 3.4, all latent variables Z in the hierarchical model can be identified up to one-to-one mappings, and the causal structure G can also be identified. Comparison to prior work. Theorem 3.5 handles general nonlinear cases, whereas prior work [Pearl, 1988, Choi et al., 2011, Huang et al., 2022, Xie et al., 2022] is limited to linear function or discrete latent variable conditions. Structure-wise, our identifiability results apply to latent structures with V-structures and certain triangle structures (see discussion in Appendix C.1) beyond generalized trees as studied in prior work [Choi et al., 2011, Pearl, 1988, Huang et al., 2022], require fewer pure children and no neighboring variables in comparison with [Xie et al., 2022, Huang et al., 2022], and can determine directions for each edge (c.f., Markov-equivalent classes in Huang et al. [2022]). Proof sketch with search procedures. The proof can be found in Appendix B.3. In particular, we develop a concrete algorithm (i.e., Algorithm 1) and show that it can successfully identify the causal structure and latent variables asymptotically. We give a proof sketch of Theorem 3.4 below by navigating through Algorithm 1 and illustrate it with an example in Figure 3. Stage 1: identifying parents with the basis model (line 4-line 5). As shown in Theorem 3.4, the basis model can identify the latent parents of leaf-level variables in the hierarchy. In Figure 3a, we can identify z2 when applying the basis model with the assignment v1 := {x1} and v2 := X \ v1. Naturally, the basic idea of Algorithm 1 is to iteratively apply the basis model to the most recently 4We refer to Silva et al. [2006b] for a general measurement model definition. Here, we consider a popular type of measurement models that has been widely used in the literature (see Xie et al. [2020]) in which observed variables do not cause other variables. identified latent variables to further identify their parents, which is the purpose of Stage 1 (line 4line 5). In Algorithm 1, we define as active set A the set of variables to which we apply the basis model. For example, A equals to X in the first round (Figure 3a) and becomes {x1, x2, x3, z4, x7, x8, x9} in the second round (Figure 3b). Stage 2: merging duplicate variables (line 6). As multiple variables in A can share a parent, dictionary P( ) may contain multiple variables that are one-to-one mappings to each other, which would obscure the true causal structure and increase the algorithm s complexity. Stage 2 (Line 6) detects such duplicates and represents them with one variable. In Figure 3a, setting v1 to any of x1, x2, and x3 would yield an equivalent variable of z2. We merge the three equivalents by randomly selecting one. Stage 3: detecting and merging super-variables (line 7 - line 9). Due to the potential existence of V-structures, variables in A may have multiple parents and produce super-variables in P. For instance, at the second iteration of Algorithm 1 (i.e., Figure 3b), the basis model will be run on v1 = z4 and v2 = {x1, x2, x3, x7, x8, x9} and output a variable equivalent to the concatenation (z2, z3). Leaving this super-variable untouched will be problematic, as we would generate a false causal structure z z4 where z is the estimated super-variable (z2, z3), rather than recognizing z4 is the child of two already identified variables z2 and z3. Line 7 to line 9 detect such super-variables by testing whether each variable ˆz in P is equivalent to a union of other variables in P. If such a union Z := {ˆz1, . . . , ˆzm} exists, we will replace ˆz with Z in all its occurrences. In Figure 3b, we would split the variable equivalent to [z2, z3] into variables of z2 and z3 individually. If ˆz is tested to be a super-variable, i.e., it can perfectly predict another variable ˆz in P and ˆz cannot predict ˆz perfectly, and the equivalent union cannot be found, we will track ˆz in line 9 to prevent it from being updated into A at line 16. Stage 4: detecting and avoiding directed paths in A (line 12-line 14). Ideally, we would like to repeat line 4 to line 9 until reaching the root of the hierarchical model. Unfortunately, such an approach can be problematic, as this would cause variables in active set A to have directed edges among them, whereas Theorem 3.4 applies to leaf variable set X which contains no directed edges. In Figure 3a, P would contain {z2, z3, z4}, as each of these latent variables has pure observed children in X. However, due to direct paths within P, i.e., z2 z4 and z3 z4, we cannot directly substitute X with {z2, z3, z4} in A. To resolve this dilemma, we proactively detect directed paths emerging with newly estimated variables and defer the local update of such estimated variables to eliminate direct paths. For directed path detection, we introduce a corollary of Theorem 3.4 as Corollary 3.6. Corollary 3.6. Under assumptions in Theorem 3.4, for any z Pa(A) where A is the active set in Algorithm 1, we consider v1 := z and v2 := A\( Ch(z) A) where Ch(z) is a strict subset of z s children (i.e., when the active set A contains at least one child of z s), estimating the basis model yields ˆz equivalent to z up to a one-to-one mapping. Corollary 3.6 can be obtained by setting v1 as an ancestor of v2 and s1 as a degenerate variable (i.e., a deterministic quantity) in Theorem 3.4. Leveraging Corollary 3.6, we can detect whether each newly identified latent variable in P would introduce directed paths into A if they were substituted in. If directed paths exist, the variable ˆz would contain the same information as v1 := z, which prediction tests can evaluate. In this event, we will suppress the update of the z at this iteration. That is, we still keep its children in A. This directed-path detection is conducted in lines 1214, after properly grouping variables that share children in lines 1011. As shown in Figure 3b, z2 and z3 are not placed in A, even if they are found in the first iteration. This update only happens when z4 has been placed in A at the second iteration. Generally, a latent variable enters A only if all its children reside in A. We can show that A contains all the information of unidentified latent variables A d-separates the latent variables that have not been placed in A and those were in A once. Equipped with such a procedure, we can identify the hierarchical model iteratively until completion. We discuss Algorithm 1 s complexity in Appendix D. 4 Experimental Results In this section, we present experiments to corroborate our theoretical results in Section 3. We start with the problem of recovering the basis model in Section 4.2, which is the foundation of the overall identifiability. In Section 4.3, we present experiments for hierarchical models on a synthetic dataset and two real-world datasets. Algorithm 1 Identification of Latent Hierarchical Models. A: the active set, X: the observed variable set, P/Joint P: dictionaries that store the relations between variables in A and estimated variables. 1: Initialize the active set: A X; 2: while A = do 3: initialize an empty set R and empty dictionaries P, Joint P; 4: for each active variable a A do 5: estimate the basis model with v1 =a and v2 =A\{a} to obtain ˆz, and P P {(a, ˆz)}; 6: merge equivalent variables in P; 7: for each variable ˆz in P do 8: if ˆz P that ˆz can perfectly predict but ˆz cannot predict ˆz then 9: if Z P \ {ˆz} that perfectly predicts ˆz, then replace ˆz with Z; else: R R {ˆz}; 10: cluster ˆz variables in P into {Zi}m i=1 such that Zi is either a singleton or contains spouse variables; 11: store each cluster Zi and its children set Hi as a pair (Zi, Hi) into Joint P; 12: for each pair (Zi, Hi) Joint P do 13: estimate the basis model with v1 =Zi and v2 =A\Hi to obtain a variable ˆztest; 14: if z Zi such that ˆztest can perfectly predict z , then R R Zi 15: for each pair (Zi, Hi) Joint P do 16: if no variables in Zi has been tracked by R, i.e., Zi R = , then substitute Hi with Zi in A. 17: remove variable a from A if a A \ {a}. 18: Return: all the past active sets A and parent sets P. z1 R2 = 0.78 0.03 z2 R2 = 0.88 0.09 z3 R2 = 0.90 0.06 x1 x2 x3 x4 (a) A balanced tree. z1 R2 = 0.76 0.02 z2 R2 = 0.87 0.11 z3 R2 = 0.91 0.00 x1 x2 x3 x4 x5 (b) With V-structure. z1 R2 = 0.76 0.09 z2 R2 = 0.86 0.09 z3 R2 = 0.88 0.07 z4 R2 = 0.90 0.02 x3 x4 x5 x1 x2 (c) An unbalanced tree. Figure 4: Evaluated hierarchical models. We denote the estimation R2 scores around corresponding latent variables. We can observe that all latent variables can be identified decently, justifying our theoretic results. 4.1 Experimental Setup Synthetic data generation. For the basis model (Figure 2), we sample z N(0, I), s1 N(0, I), and s2 N(Az + b, I), where the dependence between z and s2 is implemented by randomly constructed matrix A and bias b. We choose the true mixing function g as a multilayer perceptron (MLP) with Leaky-Re LU activations and well-conditioned weights to facilitate invertibility. We apply element-wise max{z, 0} to z before inputting [z, s1] to g1 and element-wise min{z, 0} to z before inputting [z, s2] to g2, so that v1 is generated by the positive elements of z and v2 is generated by the negative elements of z. This way, g1 and g2 are jointly invertible but not individually invertible. For latent hierarchical models (Figure 4), we sample each exogenous variable ε as ε N(0, I) and each endogenous variable z is endowed with a distinct generating function gz parameterized by an MLP, i.e., z = gz(Pa(z), εz). Real-world datasets. We adopt two real-world datasets with hierarchical generating processes, namely a personality dataset and a digit dataset. The personality dataset was curated through an interactive online personality test [Project, 2019]. Participants were requested to provide a rating for each question on a five-point scale. Each question was designed to be associated with one of the five personality attributes, i.e., agreeableness, openness, conscientiousness, extraversion, and neuroticism. The corresponding answer scores are denoted as ai, oj, etc, as indicated in Figure 5. We use responses (around 19,500 for each question) to six questions for each of the five personality attributes. For the digit dataset, we construct a multi-view digit dataset from MNIST [Deng, 2012]. We first randomly crop each image to obtain two intermediate views and then randomly rotate each of the intermediate views independently to obtain four views. This procedure gives rise to a latent structure similar to that in Figure 4a. We feed images to a pretrained Res Net-44 [He et al., 2016] for dimensionality reduction (28 28 64) and run our algorithm on the produced features. Estimation models. We implement the estimation model (ˆg1, ˆg2, ˆf) following Equation 2, where ˆf can be seen as an encoder that transforms (v1, v2) to the latent space and (ˆg1, ˆg2) act as the decoders dz = ds1 = ds2 = 2 dz = ds1 = 2, ds2 = 3 dz = ds1 = ds2 = 4 dz = ds1 = 4, ds2 = 6 Joint invertibility (Ours) 0.93 0.09 0.90 0.10 0.89 0.02 0.83 0.12 Individual invertibility 0.67 0.06 NA 0.67 0.09 NA Table 1: The basis model identifiability. We show the identifiability for z under various data dimensionalities dz, ds1, and ds2 for z, s1, s2. We compare our results with prior work that assumes both g1 and g2 are invertible individually. NA indicates that the model is not applicable when the dimensionalities of s1 and s2 differ. The results are averaged over 30 random seeds. z4 z5 z6 z7 z8 z9 c1 c2 c3 c4 c5 a1 a2 a3 n1 n2 e1 e2 e3 e4 e5 o1 o2 o3 o4 o5 Figure 5: The causal structure of the personality dataset learned by our method. The Letter in each variable name indicates the personality attribute. Subscripts correspond to question indices. Some observed variables are not shown in the figure, as they are not clustered with other variables, suggesting their distant relation to the system. generating v1 and v2 respectively. We parameterize each module with an MLP with Leaky-Re LU activation. Training configurations can be found in Appendix E. Evaluation metrics. Due to the block-wise nature of our identifiability results, we adopt the coefficient of determination R2 between the estimated variables ˆz and the true variables z, where R2 = 1 suggests that the estimated variable ˆz can perfectly capture the variation of the true variable z. We apply kernel regression with Gaussian kernel to estimate the nonlinear mapping. Such a protocol is employed in related work von Kügelgen et al. [2021], Kong et al. [2022]. We repeat each experiment over at least 3 random seeds. 4.2 Basis Model Identification The results for the basis model are presented in Table 1 and Figure 9. We vary the number of components of each latent partition (dz, ds1, ds2). We can observe that the model with individual invertibility (as assumed in prior work [von Kügelgen et al., 2021, Lyu et al., 2022]) can only capture around half of the information in z, due to their assumption that both g1 and g2 are invertible, which is violated in this setup. In contrast, our estimation approach can largely recover the information of z across a range of latent component dimensions, verifying our Theorem 3.2. Moreover, prior work [von Kügelgen et al., 2021] assumes g1 = g2, and therefore cannot be applicable when the dimensionalities of s1 and s2 differ (e.g., ds1 = 2, ds2 = 3), hence the NA in the table. Figure 9 in Appendix E.2 shows scatter-plots of the true and the estimated components with dz = ds1 = ds2 = 2. We can observe that components of z and those of ˆz are highly correlated, suggesting that the information of z is indeed restored. In contrast, ˆz contains very little information of s1, consistent with our theory that a desirable disentanglement is attainable. Additional experiments on varying sample sizes can be found in Appendix E.2. 4.3 Hierarchical Model Identification Synthetic data. We present the evaluation of Algorithm 1 on latent hierarchical models in Figure 4, Table 2, and Table 45 in Appendix E.3. In Figure 4, we can observe that all variables can be estimated decently despite a slight recovery loss from a lower to a higher level. Table 2 presents the pairprediction scores within pairs of estimations while learning the V-structure model in Figure 4b. We can observe that scores within the sibling pairs (x1, x2) and (x4, x5) are much higher than non-sibling pairs. Notably, the estimate from v1 = x3 can perform accurate prediction over other estimates, whereas the other estimates fail to capture it faithfully. This is consistent with Theorem 3.4: the basis model with v1 = x3 will output a variable equivalent to the concatenation of z2 and z3, explaining the asymmetric prediction performances. These results empirically corroborate Theorem 3.5. Personality dataset. From Figure 5, we can observe that our nonlinear model can correctly cluster the variables associated with the same attribute together in the first level, which is consistent with the intentions of these questions. It suggests that conscientiousness, agreeableness, and neuroticism are closely related at the intermediate level, whereas extraversion and neuroticism are remotely related. Some observed variables are not shown in the figure, as they are not clustered with other variables, indicating that they are not closely related to the system. This may provide insights into questionnaire design. (a) The learned causal structure. v1 v2 v3 v4 v1 0.69 0.001 0.51 0.01 0.51 0.005 v2 0.65 0.02 0.48 0.01 0.47 0.03 v3 0.48 0.02 0.42 0.03 0.78 0.05 v4 0.53 0.08 0.48 0.02 0.75 0.05 (b) Pairwise predictions. Figure 6: Multi-view digit dataset results. Figure 6a: (v1, v2) and (v3, v4) form two clusters, as they share the aspect-ratio factor within the cluster while distinct in the rotation-angle factor. Table 6b: each box (a, b) shows the R2 score obtained by applying the estimated variable produced by treating one specific view a as v1 to predict the estimated variable produced by treating view b as v1. x1 x2 x3 x4 x5 x1 0.85 0.000 0.53 0.01 0.57 0.002 0.55 0.003 x2 0.83 0.006 0.52 0.002 0.54 0.001 0.53 0.000 x3 0.90 0.002 0.88 0.002 0.86 0.001 0.90 0.006 x4 0.57 0.001 0.54 0.002 0.55 0.003 0.86 0.003 x5 0.55 0.006 0.56 0.001 0.55 0.002 0.83 0.003 Table 2: Pairwise predictions among estimated variables in Figure 4b. Each box (a, b) shows the R2 score obtained applying the estimated variable produced by treating a as v1 to predict that produced by treating b as v1. We observe that the prediction scores within sibling pairs are noticeably higher than other pairs, suggesting a decent structure estimation. In particular, the estimate from v1 = x3 can predict other estimates accurately, whereas not the other way round, confirming our theory that v1 = x3 will recover the information of both z2 and z3. The results are averaged over 30 random seeds. Digit dataset. Figure 6a and Table 6b present the causal structure learned from the multi-view digit dataset. We can observe that we can automatically cluster the two views sharing more latent factors. This showcases that our theory and approach can handle high-dimensional variables, whereas prior causal structure learning work mostly assumes that all variables are one-dimensional. 5 Conclusion In this work, we investigate the identifiability of causal structures and latent variables in nonlinear latent hierarchical models. We provide identifiability theory for both the causal structures and latent variables without assuming linearity/discreteness as in prior work [Pearl, 1988, Choi et al., 2011, Huang et al., 2022, Xie et al., 2022] while admitting structures more general than (generalized) latent trees [Pearl, 1988, Choi et al., 2011, Huang et al., 2022]. Together with the theory, we devise an identification algorithm and validate it across multiple synthetic and real-world datasets. We have shown that our algorithm yields informative results across various datasets. However, it is essential to acknowledge that its primary role is as a theoretical device for our identification proof. It may not be well-suited to large-scale datasets, e.g., Image Net, due to its nature as a discrete search algorithm. In future research, we aim to integrate our theoretical insights into scalable continuousoptimization-based algorithms [Zheng et al., 2018] and deep learning. We believe that our work facilitates the understanding of the underlying structure of highly complex and high-dimensional datasets, which is the foundation for creating more interpretable, safer, and principled machine learning systems. Acknowledgment. We thank anonymous reviewers for their constructive feedback. The work of LK and YC is supported in part by NSF under the grants CCF-1901199 and DMS-2134080. This project is also partially supported by NSF Grant 2229881, the National Institutes of Health (NIH) under Contract R01HL159805, a grant from Apple Inc., a grant from KDDI Research Inc., and generous gifts from Salesforce Inc., Microsoft Research, and Amazon Research. J. Adams, N. Hansen, and K. Zhang. Identification of partially observed linear causal models: Graphical conditions for the non-gaussian and heterogeneous cases. Advances in Neural Information Processing Systems, 34, 2021. K. Ahuja, D. Mahajan, Y. Wang, and Y. Bengio. Interventional causal representation learning, 2023. S. Akbari, E. Mokhtarian, A. Ghassami, and N. Kiyavash. Recursive causal structure learning in the presence of latent variables and selection bias. In Advances in Neural Information Processing Systems, volume 34, pages 10119 10130, 2021. A. Anandkumar, D. Hsu, A. Javanmard, and S. Kakade. Learning linear bayesian networks with latent variables. In International Conference on Machine Learning, pages 249 257, 2013. S. Buchholz, G. Rajendran, E. Rosenfeld, B. Aragam, B. Schölkopf, and P. Ravikumar. Learning linear causal representations from interventions under general nonlinear mixing, 2023. R. Cai, F. Xie, C. Glymour, Z. Hao, and K. Zhang. Triad constraints for learning causal structure of latent variables. In Advances in Neural Information Processing Systems, pages 12863 12872, 2019. M. A. Casey and A. Westner. Separation of mixed audio sources by independent subspace analysis. In ICMC, pages 154 161, 2000. V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky. Rank-sparsity incoherence for matrix decomposition. SIAM Journal on Optimization, 21(2):572 596, 2011. V. Chandrasekaran, P. A. Parrilo, and A. S. Willsky. Latent variable graphical model selection via convex optimization. Annals of Statistics, 40(4):1935 1967, 2012. Z. Chen, F. Xie, J. Qiao, Z. Hao, K. Zhang, and R. Cai. Identification of linear latent variable model with arbitrary distribution. In Proceedings 36th AAAI Conference on Artificial Intelligence (AAAI), 2022. M. J. Choi, V. Y. Tan, A. Anandkumar, and A. S. Willsky. Learning latent tree graphical models. Journal of Machine Learning Research, 12:1771 1812, 2011. D. Colombo, M. H. Maathuis, M. Kalisch, and T. S. Richardson. Learning high-dimensional directed acyclic graphs with latent and selection variables. The Annals of Statistics, pages 294 321, 2012. R. Cui, P. Groot, M. Schauer, and T. Heskes. Learning the causal structure of copula models with latent variables. In UAI, pages 188 197. AUAI Press, 2018. L. Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141 142, 2012. M. Drton, S. Lin, L. Weihs, and P. Zwiernik. Marginal likelihood and model selection for gaussian latent tree and forest models. Bernoulli, 23(2):1202 1232, 2017. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. B. Huang, C. J. H. Low, F. Xie, C. Glymour, and K. Zhang. Latent hierarchical causal structure discovery with rank constraints. ar Xiv preprint ar Xiv:2210.01798, 2022. F. Huang, N. U. Naresh, I. Perros, R. Chen, J. Sun, and A. Anandkumar. Guaranteed scalable learning of latent tree models. In Uncertainty in Artificial Intelligence, pages 883 893. PMLR, 2020. A. Hyvärinen and P. Hoyer. Emergence of phase-and shift-invariant features by decomposition of natural images into independent feature subspaces. Neural computation, 12(7):1705 1720, 2000. A. Hyvarinen, H. Sasaki, and R. Turner. Nonlinear ica using auxiliary variables and generalized contrastive learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 859 868. PMLR, 2019. Y. Jiang and B. Aragam. Learning nonparametric latent causal graphs with unknown interventions, 2023. I. Khemakhem, D. Kingma, R. Monti, and A. Hyvarinen. Variational autoencoders and nonlinear ica: A unifying framework. In International Conference on Artificial Intelligence and Statistics, pages 2207 2217. PMLR, 2020. B. Kivva, G. Rajendran, P. Ravikumar, and B. Aragam. Learning latent causal graphs via mixture oracles. Advances in Neural Information Processing Systems, 34, 2021. L. Kong, S. Xie, W. Yao, Y. Zheng, G. Chen, P. Stojanov, V. Akinwande, and K. Zhang. Partial disentanglement for domain adaptation. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 11455 11472. PMLR, 17 23 Jul 2022. URL https://proceedings.mlr.press/v162/kong22a.html. E. Kummerfeld and J. Ramsey. Causal clustering for 1-factor measurement models. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1655 1664. ACM, 2016. S. Lachapelle, P. R. López, Y. Sharma, K. Everett, R. L. Priol, A. Lacoste, and S. Lacoste-Julien. Disentanglement via mechanism sparsity regularization: A new principle for nonlinear ica, 2022. Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR 2011, pages 3361 3368. IEEE, 2011. W. Liang, A. Keki c, J. von Kügelgen, S. Buchholz, M. Besserve, L. Gresele, and B. Schölkopf. Causal component analysis, 2023. Q. Lyu, X. Fu, W. Wang, and S. Lu. Understanding latent correlation-based multiview learning and self-supervision: An identifiability perspective. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=5FUq05QRc5b. J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, NY, USA, 2000. ISBN 0-521-77362-8. J. Peters, D. Janzing, and B. Schölkopf. Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press, Cambridge, MA, USA, 2017. O.-S. P. Project. Big five personality test, 2019. URL https://openpsychometrics.org/tests/ IPIP-BFFM/. Accessed: 2023-5-16. S. Shimizu, P. O. Hoyer, and A. Hyvärinen. Estimation of linear non-gaussian acyclic models for latent factors. Neurocomputing, 72(7-9):2024 2027, 2009. R. Silva, R. Scheine, C. Glymour, and P. Spirtes. Learning the structure of linear latent variable models. Journal of Machine Learning Research, 7(Feb):191 246, 2006a. R. Silva, R. Scheines, C. Glymour, and P. Spirtes. Learning the structure of linear latent variable models. Journal of Machine Learning Research, 7:191 246, 2006b. P. Spirtes, C. N. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT press, 2000. P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT Press, Cambridge, MA, 2nd edition, 2001. C. Squires, A. Seigal, S. Bhate, and C. Uhler. Linear causal disentanglement via interventions, 2023. F. Theis. Towards a general independent subspace analysis. Advances in Neural Information Processing Systems, 19:1361 1368, 2006. B. Varici, E. Acarturk, K. Shanmugam, A. Kumar, and A. Tajer. Score-based causal representation learning with interventions, 2023. J. von Kügelgen, Y. Sharma, L. Gresele, W. Brendel, B. Schölkopf, M. Besserve, and F. Locatello. Self-supervised learning with data augmentations provably isolates content from style. ar Xiv preprint ar Xiv:2106.04619, 2021. F. Xie, R. Cai, B. Huang, C. Glymour, Z. Hao, and K. Zhang. Generalized independent noise condition for estimating latent variable causal graphs. In Advances in Neural Information Processing Systems, pages 14891 14902, 2020. F. Xie, B. Huang, Z. Chen, Y. He, Z. Geng, and K. Zhang. Identification of linear non-gaussian latent hierarchical structure. In International Conference on Machine Learning, pages 24370 24387. PMLR, 2022. W. Yao, Y. Sun, A. Ho, C. Sun, and K. Zhang. Learning temporally causal latent processes from general temporal data. ar Xiv preprint ar Xiv:2110.05428, 2021. W. Yao, G. Chen, and K. Zhang. Learning latent causal dynamics. ar Xiv preprint ar Xiv:2202.04828, 2022. J. Zhang, C. Squires, K. Greenewald, A. Srivastava, K. Shanmugam, and C. Uhler. Identifiability guarantees for causal disentanglement from soft interventions, 2023. N. L. Zhang. Hierarchical latent class models for cluster analysis. The Journal of Machine Learning Research, 5:697 723, 2004. X. Zheng, B. Aragam, P. K. Ravikumar, and E. P. Xing. Dags with no tears: Continuous optimization for structure learning. Advances in neural information processing systems, 31, 2018. Y. Zheng, I. Ng, and K. Zhang. On the identifiability of nonlinear ica: Sparsity and beyond. ar Xiv preprint ar Xiv:2206.07751, 2022. C. Zhou, X. Wang, and J. Guo. Learning mixed latent tree models. Journal of Machine Learning Research, 21(249):1 35, 2020. Appendix for Identification of Nonlinear Latent Hierarchical Models" A Detailed Literature Review Previous causal discovery approaches, which allow latent confounders and causal relationships among those latent variables, either assume linear causal relationships or assume discrete data. Representative approaches along this line include rank deficiency-based methods, matrix decomposition-based methods, generalized independent noise condition-based methods, and mixture oracles-based methods. (1) Rank deficiency-based. By testing the rank deficiency over cross-covariance matrices over observed variables, one is able to locate latent variables and identify the causal relationships among them in linear-Gaussian models. Silva et al. [2006a], Kummerfeld and Ramsey [2016] make use of the Tetrad condition, a special case of the rank-deficiency constraints, to handle the case where each observed variable is influenced by only one latent parent, and each latent variable has at least three pure measured children. The Tetrad condition has also been used to identify a latent tree structure [Pearl, 1988]. Recently, the rank-deficiency constraints have been extended to identify more general hierarchical structures [Huang et al., 2022]. (2) Matrix decomposition-based. It has been shown that, under certain conditions, the precision matrix can be decomposed into a low-rank matrix and a sparse matrix, where the low-rank matrix represents the causal structure from latent variables to observed variables and the sparse matrix gives the structural relationships over observed variables. To achieve such decomposition, certain assumptions are imposed on the structure [Chandrasekaran et al., 2011, 2012, Anandkumar et al., 2013], e.g., there should be three times more measured variables than latent variables. (3) Generalized independent noise (GIN) condition-based. The GIN condition is an extension of the independent noise condition in the existence of latent confounders, relying on higher-order statistics to identify latent structures. In particular, Xie et al. [2020] proposes a GIN-based approach that allows multiple latent parents behind every pair of observed variables and can identify causal directions among latent variables. Moreover, Adams et al. [2021] gives necessary and sufficient structural constraints in the linear, non-Gaussian or heterogeneous case, to identify the latent structures. (4) Mixture oracles-based method-based. Recently, Kivva et al. [2021] proposed a mixture oracles-based method to identify the latent variable graph that allows nonlinear causal relationships. However, it requires that the latent variables are discrete and each latent variable has measured variables as children. Thanks to the discreteness assumption, it can handle general DAGs over latent variables. On the other hand, regarding the scenario of latent hierarchical structures, most previous work along this line assumes a tree structure and requires that each variable has only one dimension and that the data is either linear-Gaussian or discrete [Pearl, 1988, Zhang, 2004, Choi et al., 2011, Drton et al., 2017, Huang et al., 2020]. In contrast, we address the general nonlinear case with continuous variables. Moreover, our conditions allow for multiple undirected paths between two variables and thus are more general than tree-based assumptions in prior work. Another related research line is latent-variable identifiability literature. Hyvarinen et al. [2019], Khemakhem et al. [2020], Kong et al. [2022] have shown that with an additional observed variable to modulate latent independent variables, the latent independent variables are identifiable. Recently, Yao et al. [2021, 2022] allow time-delayed causal relationships among latent variables. However, for time-delayed causal relations, the causal direction is fixed and predefined, and moreover, they assume that all latent variables have measured variables as children, avoiding the hierarchical cases. Prior work[von Kügelgen et al., 2021, Lyu et al., 2022] studies latent-variable models related to our proposed basis model (which serves as a tool and is defined in Section 2) in this work, but with more restrictive functional and statistical assumptions. To the best of our knowledge, no prior work has managed to identify latent variables or causal structures in nonlinear latent hierarchical models. We note that our work focuses on identifying latent causal models from observational data and structure conditions. Another important line of work [Ahuja et al., 2023, Squires et al., 2023, Varici et al., 2023, Jiang and Aragam, 2023, Liang et al., 2023, Buchholz et al., 2023, Zhang et al., 2023] utilizes interventional data for this purpose. Specifically, these works leverage multiple data distributions generated by one causal model under distinct interventions, which are accessible in applications like biological experiments. The accessibility of interventional data can allow for relaxed structure conditions. Hence, one should consider this tradeoff when faced with causal identification problems. B.1 Proof for Theorem 3.2 The original assumptions and theorem are replicated here for reference. In Theorem 3.2, we show that the latent variable z shared by v1 and v2 is identifiable up to a one-to-one mapping by estimating a generative model (ˆpz,s2, ˆps1, ˆg) according to Equation 2. We denote the support of Jacobian matrix Jg and Jˆg as G and ˆG respectively and denote by T a matrix with the same support as T(c) in Jˆg(ˆc) = Jg(c)T(c). Assumption 3.1 (Basis model conditions). i [Differentiability & Invertibility]: The mapping g(c) = (v1, v2) is a differentiable invertible function with a differentiable inverse. ii [Subspace span]: For all i {1, . . . , dv1 + dv2}, there exists {c(ℓ)}|Gi,:| ℓ=1 and T T , such that span({Jg(c(ℓ))i,:}|Gi,:| ℓ=1 ) = Rdc Gi,: and [Jg(c(ℓ))T]i,: R dc ˆGi,:. iii [Edge Connectivity]: For all jz {1, . . . , dz}, there exist iv1 {1, . . . , dv1} and iv2 {dv1, . . . , dv1 + dv2}, such that (iv1, jz) G and (iv2, jz) G. Theorem 3.2. Under Assumption 3.1, if a generative model (ˆpz,s2, ˆps1, ˆg) follows the data-generating process in Equation 2 and matches the true joint distribution: pv1,v2(v1, v2) = ˆpv1,v2(v1, v2), (v1, v2) V V, (3) then the estimated variable ˆz and the true variable z are equivalent up to an invertible transformation. Proof. We first define the indeterminacy function: h := ˆg 1 g, which is a smooth and invertible function h : C ˆC thanks to Assumption 3.1-i. According to the chain rule and inverse function theorem, we have the following relation among the Jacobian matrices: Jˆg(ˆc) = Jg(c) J 1 h (c). (4) For ease of exposition, we denote M(c) := J 1 h (c) in the following. We define the support notations as follows: G := supp(Jg), ˆG := supp(Jˆg), T := supp(M). Because of Assumption 3.1-ii, for any i {1, . . . , dv1 + dv2}, there exists {c(ℓ)}|Gi,:| ℓ=1 , such that span({Jg(c(ℓ))i,:}|Gi,:| ℓ=1 ) = Rdz Gi,:. Since {Jg(c(ℓ))i,:}|Gi,:| ℓ=1 forms a basis of Rdc Gi,:, for any j0 Gi,:, we can express canonical basis vector ej0 Rdc Gi,: as: ℓ Gi,: αℓ Jg(c(ℓ))i,:, (5) where αℓ R is a coefficient. Then, following Assumption 3.1-ii, there exists a deterministic matrix T such that Tj0,: = e j0T = X ℓ Gi,: αℓ Jg(c(ℓ))i,:T Rdc ˆGi,:, (6) where is due to the fact that each element in the summation belongs to Rdc ˆGi,:. j Gi,:, Tj,: Rdc ˆGi,:. Equivalently, we have: (i, j) G, {i} Tj,: ˆG. (7) We would like to show that ˆz is not influenced by s1 and s2, which is equivalent to Tjz,jˆs = 0 for jz {1, . . . , dz} and jˆs {dz + 1, . . . , dc}. We first will prove this for jˆs {dz +1, . . . , dz +ds1} by contradiction. Suppose that (jz, jˆs1) T with jz {1, . . . , dz} and jˆs1 {dz + 1, . . . , dz + ds1}. Thanks to Assumption 3.1-iii, there must exist iv2 {dv1 +1, . . . , dv1 +dv2}, such that, (iv2, jz) G. It follows from Equation 7 that: {iv2} Tjz,: ˆG = (iv2, jˆs1) ˆG. (8) However, due to the structure of ˆg2, [Jˆg2]iv2,jˆs1 = 0, which results in a contradiction. Therefore, such (iv2, jˆs1) does not exist. The same reasoning gives rise to that (jz, jˆs2) / T with jz {1, . . . , dz} and jˆs2 {dz + ds1 + 1, . . . , dc}. Therefore, we have shown that Tjz,jˆs = 0 for jz {1, . . . , dz} and jˆs {dz + 1, . . . , dc}. As T is invertible, it follows from the block-matrix inverse formulae that Tjˆz,js = 0 for jˆz {1, . . . , dz} and js {dz + 1, . . . , dc} In conclusion, ˆz is not influenced by (s1, s2). Thus, we have shown that there is a one-to-one mapping between z and ˆz. B.2 Proof for Theorem 3.4 The original theorem is copied below. Assumption 3.3 (Hierarchical model conditions). i [Differentiability]: Structure equations in Equation 1 are differentiable. ii [Information-conservation]: Any z Z and exogenous variable ε can be expressed as differentiable functions of all observed variables, i.e., z = fz(X) and ε = fε(X). iii [Subspace span]: For each set A that d-separates all its ancestors Anc(A) and observed variables X, i.e., X Anc(A)|A, for any z A Pa(A), there exists an invertible mapping from a set of ancestors of A containing z A to the separation set A, such that this mapping satisfies the subspace span condition (i.e., Assumption 3.1-ii). iv [Edge connectivity]: The function between each latent variable z and each of its children z has a Jacobian J, such that for all j {1, . . . , dz}, there exists i {1, . . . , dz }, such that (i, j) is in the support of J. Theorem 3.4. In a latent hierarchical causal model that satisfies Condition 2.4 and Assumption 3.3, we consider xi X as v1 and X\v1 as v2 in the basis model (Figure 2). 5 With an estimation model (ˆpz,s2, ˆps1, ˆg) that follows the data-generating process in Equation 2, the estimated ˆz is a one-to-one mapping of the parent(s) of v1, i.e., ˆz = h(Pa(v1)) where h( ) is an invertible function. Proof. We show that performing estimation following the generating process Equation 2 to any v1 = xi X and v2 = X \ v1 is equivalent to the estimation of the model in Figure 2 with z = Pa(v1). Therefore, the identifiability result ensues, thanks to Theorem 3.2. First, we show that for each selection of v1, we can locate (z, s1, s2) such that the conditions for the basis model in Theorem 3.2 are satisfied. We choose (z, s1, s2) as follows. We choose the z to be all parents of v1: z := Pa(v1) We choose s1 to be exogenous variables that cause v1 together with z: s1 := εv1. 5To avoid cluttering, we slightly abuse the bold lowercase font to represent either an individual vector or a vector set (which can be viewed as a concatenation). To obtain s2, we trace back (traversing backward every edge) from the v2 recursively and execute the following steps. At each step of backtracking, we include any cause (exogenous variables and endogenous variables) that is not situated on any directed path from z to v2. The backtracking for each path halts when the most recent step recovers an endogenous variable out of the directed paths, or it reaches z. We note that such a choice for s2 is not unique the above procedure is only one instance. Invertibility. From (z, s1, s2) to (v1, v2), we can observe that (z, s1, s2) by construction include all the information to generate (v1, v2), i.e., X, as (z, s1, s2) d-separate their parents/ancestors from (v1, v2) and contain all necessary exogenous variables. From (v1, v2) to (z, s1, s2), we can observe that as the tuple (v1, v2) comprises the entire observable set X, which contains the information of all latent exogenous and endogenous variables according to the general invertibility (Assumption 3.3-ii) of the hierarchical model. Therefore, the mapping is invertible. Conditional independence: v1 v2|z. As z is chosen to be the parents of v1 and there is no edge among the observables X (i.e., leaf variables), the local Markov property implies conditional independence. Subspace span. As (v1, v2) := X satisfies the tree conditions in Assumption 3.3-iii w.r.t., z := Pa(z), the subspace span condition follows from Assumption 3.3-iii. Edge connectivity. We assume that in the hierarchical model, the function between each latent variable z and each of its pure child z has a non-degenerate Jacobian matrix in a sense that for all j {1, . . . , dz}, there exists i {1, . . . , dz }, such that (i, j) is included in the Jacobian matrix s support (i.e., Assumption 3.3-iv). Therefore, since each latent variable has at least 2 pure children (i.e., Condition 2.4-i) and v1 contains 1 variable, at least 1 pure child of z or a descendant of a pure child of z will appear in v2. Thus, for each j {1, . . . , dc}, there exist iv1 and iv2 such that both (iv1, j) and (iv2, j) are contained in the support of the Jacobian matrix, which fulfills the edge connectivity condition (i.e., Assumption 3.1-iii). Thus, it follows from Theorem 3.2 that the estimated variable ˆz is a one-to-one mapping to the true variable z. B.3 Proof for Theorem 3.5 Theorem 3.5. Under assumptions in Theorem 3.4, all latent variables Z in the hierarchical model can be identified up to one-to-one mappings, and the causal structure G can also be identified. Proof. We will show by induction that each iteration of Algorithm 1 fulfills the following conditions: Condition B.1 (Active-set conditions). i Each element in the active set A is a one-to-one mapping of a distinct variable in Y. ii There are no directed paths among variables in A. iii The graph is d-separated by A into latent variables ZA that have not been included in A and those Z A that were in A (but not now): ZA Z A|A. We will first verify the base case where active set A is assigned observable set X. Condition B.1-i is automatically satisfied due to the initial assignment. As X are all leaf variables, there are no directed paths within X, and therefore Condition B.1-ii is satisfied. Condition B.1-iii is also met trivially, as Z A = at the initial step. So far, we have verified the base case. Now, we make the inductive hypothesis that all conditions hold before an iteration and show that these conditions are maintained after the iteration. We first note that Condition B.1-i, Condition B.1-ii, and Condition B.1-iii in the inductive hypothesis enable us to apply Theorem 3.4 to the modified graph consisting of variables in A as the bottom layer and all latent variables that have not been placed in A. This modified graph satisfies the structural properties required by Theorem 3.4: All paths end at active (observed) variables. There are no directed paths among the active variables. We now analyze the updates to A made at Step 16 in Algorithm 1 after each iteration. For a specific variable z P(A), there are the following cases. Case B.2 (One-iteration cases). i z has only pure children and all of them are in A. ii z has only pure children and not all of them are in A. iii z has coparents and all its children and its co-parents children are in A. iv z has coparents and not all its children and its co-parents children are in A. For Case B.2-i, when each pure child a of z is treated as v1, the basis model will yield z. So, there are certainly estimates of z in P(a). As z possesses at least 2 pure children, all of which belong to A, there will be duplicates, i.e., multiple equivalent one-to-one mappings of z, which we merge into one at Step 6. Therefore, for each pure child a of z, the only element in P(a) is a identical one-to-one mapping of z. It follows that Joint P contains a pair (Z, H) where Z consists of z alone and H contains all its pure children. This fact, together with Condition B.1-iii, implies that ˆztest would be the parent of z, which cannot perfectly predict z at Step 14. Therefore, Algorithm 1 will substitute all the pure children of z with the estimate of z in A. We now switch to Case B.2-ii. Analogous to the process in Case B.2-i, after Step 6 we will have P(a) = {z} for each active pure child. However, as not all pure children of z are active, Joint P({z}) does not include all of the pure children of z and A \ H certainly contains descendants of z. It follows from Corollary 3.6 that ˆztest will be a one-to-one mapping of z and can predict it perfectly at Step 14. Therefore, z will not be updated to A at this iteration. For Case B.2-iii, we will show that the update takes place for z and its coparents. Without loss of generality, we assume that z only has one coparent z . As both z has multiple pure children (i.e., Condition 2.4-i) and all of them are in A, the dictionary P will contain z and z in their keys and P(z) and P(z ) contain their pure children respectively. Suppose a is a shared child of z and z , then P(a) would be a singleton set containing a one-to-one mapping of z := [z, z ], after Step 6. At Step 9, z will be recognized as a combination of z and z and replaced with them. It follows that Joint P will contain a pair (Z, H) such that Z = (z, z ) and H is composed of all the children of the two coparents. At Step 16, both z and z will be updated into A. Cases with more than one coparents can be dealt with in the same manner. For Case B.2-iv, we show that the update will not be carried out for z and its coparents. Following the notation in Case B.2-iii, we assume that z has a coparent z , and these two share at least one active child. If neither z nor z has any pure active children, they will be regarded as a supernode z = [z, z ] and the corresponding H contains only the shared active children. Then, A \ H will certainly contain descendants of both z and z . Consequently, ˆztest will be equivalent to the supernode z and predicts it perfectly at Step 14. In this case, neither z and z will be updated into A. If only one of z and z has no pure children in A, then neither will be updated into A due to Step 9. Without loss of generality, we assume that z does not have any active pure children, whereas z does. Then, there will be no estimated z in the values of P, but there will be estimated z and (z, z ). As a results, Step 9 will prevent z and z from updated to A. If both z and z have active pure children, then there will be a pair (Z, H) in Joint P such that Z = {z, z } and H contains all their active children. As H does not contain all the children of z and z , the A \ H will contain descendants of at least one of z and z . Then ˆztest will contain all information of at least one of z and z and predicts it perfectly at Step 16, which aborts the update of z and z . Therefore, no updates will happen in this case. As each of the newly estimated variables in the values of Joint P is a one-to-one mapping to a distinct variable in Z, the iteration preserves Condition B.1-i. Figure 7: A specific case to demonstrate how we can avoid assuming specific functional classes. For Condition B.1-ii, we can observe that if z is updated into A at this iteration, all its children would be in A at the beginning of this iteration and get removed at Step 16. Therefore, the update would not introduce directed paths to A. Condition B.1-iii holds for the same reason: the newly introduced variables in the active set A at the end of the iteration separate their children in the active set A at the beginning of the iteration and all undiscovered latent variables. So far, by induction, we have shown that Condition B.1-iii, Condition B.1-ii, and Condition B.1-iii hold at the end of each iteration. Given the analysis above, at each iteration, we update variables whenever all their children and their coparents children are active, i.e., Case B.2-i and Case B.2-iii, and the causal relations between the parents and the children are encoded in Joint P. We further argue that at every iteration, there exists at least one undiscovered latent variable that has all its children in A. We focus on the trimmed graph of ZA and A, where ZA refers to all undiscovered variables. As the graph is of finite size and all variables in ZA have directed paths ending at active variables in A, the active variable a farthest from the root does not have siblings in ZA; otherwise, there would be a (i.e., the child of a s sibling in ZA) in A that was further from the root than a. In sum, we have shown that there will exist at least one update at each iteration, and each update will lead to the discovery of new latent variables and correct causal structures. Therefore, Algorithm 1 can successfully identify the underlying hierarchical causal graph and each latent variable up to one-to-one mappings. B.4 Illustration of a Specific Case Figure 7 shows a specific case to demonstrate how we can avoid assuming specific functional classes. For active set A = {z2, z4, z5}, if we treat z2 as v1, z1 will become z in the basis and [z4, z5] will become v2. This set of (z, v1, v2) satisfies Assumption 2-i, as all information about z1 is contained by z2, z4, and z5. Assumption 2-iii and Assumption 2-ii are direct consequences of Assumption 3.3-iv and Assumption 3.3-iii. Also, we can always find an independent variable s1 := εz2 in the basis model. Therefore, Theorem 3.4 guarantees the estimated ˆz to be a one-to-one mapping of z1. C Relaxation of Structural Conditions C.1 Relaxation of Condition 2.4-ii We remark that Condition 2.4-ii is introduced to simplify the presentation of our main results as the first step of nonlinear-latent-hierarchical causal discovery, we wish to present the basic techniques more clearly handling triangles involves additional components of the algorithm and could potentially compromise the readability. We give an example in Figure 8 with a triangle structure to illustrate how our theory and algorithm can handle triangles. We refer to variables in a triangle under the following convention: 1) root: the variable which is the cause of the other two variables (e.g., z1 in Figure 8), 2) sink: the variable which is the effect of the other two variables (e.g., z3 in Figure 8), and 3) bridge: the variable which is the effect of the root variable and the cause of the sink variables (e.g., z2 in Figure 8). x1 x2 x3 z3 x7 x8 x9 Figure 8: An example of triangle structures, i.e., z1, z2, and z3. In the following, we illustrate the procedures to recognize and handle triangle structure by navigating through Figure 8. First, the directed path detection procedure portrayed in Section 3.3 can pinpoint the directed path between the bridge variable and the sink variable (e.g., z2 and z3). We need to determine whether the two variables indeed belong to a triangle. To this end, we can lump the bridge variable and the sink variable together as v1. We perform the basis model estimation over v1 and the active set A excluding v1 s children. In Figure 8, we perform the basis model over v1 = [z2, z3] and A = {z2, z3, x7, x8, x9}, which returns an estimate of z1. Further, we can determine that the bridge variable z2 and the estimated variable from the lumped v1 constitute the parents of the sink variable z3 by mutual prediction, which ascertains the two variables are indeed the bridge variable and the sink variable of a triangle, respectively. In this example, we find that the estimated variables of z2 and z1 together contain identical information to the estimated variable of z3, which reveals that z2 and z3 are the bridge variable and the sink variable of a triangle. This concludes the triangle detection procedure. With this knowledge, we can handle the triangle structure by lumping the bridge variable z3 and the sink variable z3 into a super-variable and proceed with this modification. C.2 Admitting Non-leaf variables into X Identical to the directed path detection detailed in Section 3.3, Corollary 3.6 can detect non-leaf observed variables and exclude them from the initial active set X in Algorithm 1. Thus, the problem can be reduced back to the case without observed non-leaf variables, which can be tackled by Algorithm 1. D Algorithm 1 Complexity The dominant complexity term in our proposed algorithm is deep learning training, which is of a complexity O(R n) where R is the number of levels in the hierarchical model, and n is the number of observed variables. In contrast, recent works on linear hierarchical models provide algorithms with complexity O(R (1 + n)p+1) [Huang et al., 2022] and O(R n!) [Xie et al., 2022] where p is the largest number of variables that share children. Our algorithm achieves the lowest complexity in terms of the algorithmic complexity. As we allow for multi-dimensional variables, the dimensionality of each variable also contributes to the overall wall-clock time through the deep learning model dimension, which is a hyperparameter for the experimenter. E Experiments E.1 Additional Experimental Details We provide additional details of experiments in Section 4. For the basis model evaluation in Section 4.2, then encoder ˆf and decoders (ˆg1, ˆg2) are 4-layer MLP s with a hidden dimension 30 times as large as the input data and Leaky-Re LU (α = 0.2) activation functions. For the synthetic hierarchy experiments in Section 4.3, the model layers are 8 and 50 times as large as the input data. For the personality dataset in Section 4.3, the model layers are 4 and 8 times as large as the input data. For the multi-view digit dataset in Section 4.3, the model layers are 4, and 4 times (encoder) and 2 times (decoder) as large as the input data The model configurations are summarized in Table 3. # encoder layers # decoder layers encoder widths decoder width Basis models (Section 4.2) 4 4 30 30 Synthetic hierarchy (Section 4.3) 8 8 50 50 Personality dataset (Section 4.3) 4 4 8 8 Digit dataset (Section 4.3) 4 4 4 2 Table 3: Estimation model configuration for each experiment. We apply Adam to train each model for 20, 000 steps with a learning rate of 1e 3. For hierarchical models, we consistently use 2-layer MLPs for generating functions and set the exogenous variable dimensionality as 2. We evaluate the R2 score with kernel regression over 8192 samples for each estimated variable pair. Figure 6.b, The matched pairs give significantly higher scores than the unmatched ones (e.g., Table 2 and Figure 6b). For instance, in the first row of Table 2, the estimated latent variable with This gap allows us to set a threshold for each experiment to distinguish the matched estimated variables (e.g., a threshold of 0.6 suffices for the experiment in Table 2). E.2 Additional Results on the Basis Model Figure 9 visualizes the dependence between the estimated and true components. We can observe that the z is highly correlated with ˆz whereas s1 has little correlation with ˆz, which verifies Theorem 3.2 that we can successfully identify z. Estimated [Z]1 Estimated [Z]1 Estimated [Z]1 Estimated [Z]1 Estimated [Z]2 Estimated [Z]2 Estimated [Z]2 Estimated [Z]2 Figure 9: The scatter plot between the estimated ˆz and the true (z, s1). The scatter points are obtained in an estimation run with dz = ds1 = ds2 = 2. We can observe that the estimated components of ˆz are highly correlated with those of z and have little correlation with those s1, empirically verifying Theorem 3.2. Although our theory only concerns asymptotic properties, our approach can perform stably at relatively low-sample regimes. As shown in Figure 10, our algorithm can achieve decent identification scores when the sample number exceeds 5000. The performance peaks with little variance when the sample size is over 10000. E.3 Additional Results on Synthetic Hierarchical Models. Table 4 and Table 5 contain the mutual prediction scores for the hierarchical structures in Figure 4a and Figure 4c. We can observe that the estimated variables that yield high mutual prediction scores are indeed under the same causal cluster, which gives signals for structure learning. 1000 2000 5000 10000 100000 Number of training samples Figure 10: R2 scores for basis model identification. The true basis model (Figure 2 in the manuscript) has latent dimensionality dz = ds1 = ds2 = 2. The basis model training and data generation follow the training configuration in the main text. We vary the number of training samples and measure the R2 score to assess the identification result. Each result is averaged over 3 random seeds. x1 x2 x3 x4 x1 0.83 0.03 0.57 0.04 0.50 0.08 x2 0.86 0.01 0.58 0.03 0.52 0.04 x3 0.6 0.02 0.59 0.05 0.77 0.00 x4 0.59 0.02 0.59 0.05 0.81 0.04 Table 4: Pairwise predictions among estimates in Figure 4a. Each box (a, b) shows the R2 score obtained by applying the estimate produced by treating a as v1 to predict the estimate produced by treating b as v1. We can observe that the prediction scores within sibling pairs (x1, x2) and (x3, x4) are noticeably higher than other pairs, showing a decent structure estimation. E.4 Additional Results on the Personality Dataset We randomly and evenly partition each set of six questions into two variables. Following this procedure, we have observed ten variables c1, c2, a1, a2, n1, n2, e1, e2, o1, and o2, each of 3dimension. The Letter in each variable name indicates the attribute to which the variable belongs, to facilitate readability. Figure 11 contains the results. Analogous to Figure 5, the learned structure faithfully reflects the questionnaire structure questions designed for the same personality trait are clustered together. z4 x3 x4 x5 z4 0.81 0.011 0.48 0.006 0.52 0.000 x3 0.87 0.002 0.47 0.003 0.54 0.002 x4 0.57 0.000 0.52 0.006 0.77 0.005 x5 0.58 0.001 0.57 0.015 0.81 0.000 Table 5: Pairwise predictions among estimated variables in Figure 4c. The pattern is consistent with Table 4 and Table 2. z2 z3 z4 z5 z6 c1 c2 a1 a2 n1 n2 e1 e2 o1 o2 Figure 11: The casual graph learned by our method, where each observed variable (c, a, n, e, o) is of three dimensions.