# exogenous_isomorphism_for_counterfactual_identifiability__4107d00e.pdf Exogenous Isomorphism for Counterfactual Identifiability Yikang Chen 1 Dehui Du 1 This paper investigates L3-identifiability, a form of complete counterfactual identifiability within the Pearl Causal Hierarchy (PCH) framework, ensuring that all Structural Causal Models (SCMs) satisfying the given assumptions provide consistent answers to all causal questions. To simplify this problem, we introduce exogenous isomorphism and propose EI-identifiability, reflecting the strength of model identifiability required for L3-identifiability. We explore sufficient assumptions for achieving EI-identifiability in two special classes of SCMs: Bijective SCMs (BSCMs), based on counterfactual transport, and Triangular Monotonic SCMs (TM-SCMs), which extend L2-identifiability. Our results unify and generalize existing theories, providing theoretical guarantees for practical applications. Finally, we leverage neural TM-SCMs to address the consistency problem in counterfactual reasoning, with experiments validating both the effectiveness of our method and the correctness of the theory. 1. Introduction The purpose of counterfactual reasoning is to answer questions about hypothetical, unobserved worlds. It has been applied in tasks such as fairness evaluation (Kusner et al., 2017), explanation generation (Karimi et al., 2020), harm quantification (Richens et al., 2022), and policy optimization (Tsirtsis & Rodriguez, 2023). Counterfactual identification is a subtask of counterfactual reasoning, aiming to determine whether all models satisfying certain reasonable assumptions yield the same answer to a specific counterfactual question. It is critical to ensure the dependability of counterfactual reasoning results, as the reasoning outcomes can only be consistent with the assumptions and 1Shanghai Key Laboratory of Trustworthy Computing, East China Normal University. Correspondence to: Dehui Du . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). thus confirm the reliability of the constructed model if counterfactual identifiability is guaranteed. (Pearl & Mackenzie, 2018) categorized causal questions into three levels of human cognition, termed the Pearl Causal Hierarchy (PCH) (Bareinboim et al., 2022). Counterfactual reasoning corresponds to human imagination, situated at the highest level L3 of the PCH, encoding the most intricate and nuanced information. Structural Causal Models (SCMs) provide specific semantics for addressing counterfactual questions. Counterfactual identifiability on SCMs guarantees that all SCMs adhering to the assumptions yield consistent results for L3 quantities. Prior research has constrained causal structures to identify specific counterfactual effects (Shpitser & Pearl, 2008; Correa et al., 2021; Xia et al., 2023), or limited causal mechanisms to determine counterfactual outcomes (Lu et al., 2020; Nasr-Esfahany et al., 2023; Scetbon et al., 2024). Further studies investigate model identifiability for counterfactuals, offering empirical evidence that counterfactual outcomes are identifiable (Khemakhem et al., 2021; Javaloy et al., 2023). This work introduces a novel identification target: identification across the entire counterfactual layer of the PCH. This requires that all SCMs satisfying the assumptions yield consistent results for any counterfactual statement. Within the PCH, since the counterfactual layer L3 encodes all causal information, the identifiability of the SCMs in L3 implies that these SCMs provide consistent answers to all causal questions, rendering them indistinguishable for any causal statements. Thus, identifiability over the counterfactual layer represents the most stringent and comprehensive goal for causal quantity identifiability within the PCH. To achieve this goal, we first examine the identifiability problem in causal inference from the perspective of model identifiability in Section 2. In this context, identifiability over the counterfactual layer is denoted as L3-identifiability. To simplify the L3-identifiability problem, we establish an alternative form of identifiability in Section 3, denoted as EIidentifiability, which is induced by an equivalence relation termed exogenous isomorphism. This form of identifiability demonstrates that fully recovering the exogenous variables of SCMs is unnecessary to ensure consistent results for any counterfactual quantities, clarifying the strength of assumptions required for counterfactual layer identification. Exogenous Isomorphism for Counterfactual Identifiability We then explore sets of assumptions that induce EIidentifiability, focusing on two special classes of SCMs. The first class, termed Bijective SCMs (BSCMs), is studied in Section 4, where we induce exogenous isomorphism between BSCMs from the perspective of counterfactual transport (Theorem 4.6), offering a novel interpretation for counterfactual identifiability. The second class, termed Triangular Monotonic SCMs (TM-SCMs), is examined in Section 5. We identify a simple method to induce exogenous isomorphism between TM-SCMs (Corollary 5.4), which can be viewed as a strengthened version of L2-identifiability for Markovian Causal Bayesian Networks. This corollary unifies and generalizes prior theories for proving counterfactual outcome identifiability, indirectly demonstrating that several models used in previous works for counterfactual outcome identifiability are theoretically L3-identifiable. This provides theoretical guarantees for their safe application in practice. Finally, leveraging this theory, we utilize TM-SCMs parameterized by neural networks in Section 6 to address counterfactual identification and estimation problems. Experimental results in Section 8 on these models empirically support the validity of our theory. 2. Preliminaries In this section, we provide the background for understanding the problem and describe the problem setting. We begin by reviewing the relevant concepts of SCMs. In Section 2.1, we present the definition of SCM and introduce key notions such as do-intervention and causal order. Subsequently, in Section 2.2, we leverage these concepts to logically characterize L1, L2, L3 in PCH and define L3-consistency. Next, in Section 2.3, we introduce counterfactual identification problem. Finally, in Section 2.4, we elaborate on model identifiability, establish the connection between causal identifiability and model identifiability, and motivate the research objective of this work: L3-identifiability. Notation We study the problem from a more general measure-theoretic perspective. Let the probability space be (Ω, F, P), where the set Ωis the sample space, the σalgebra F on Ωis the event space, and P is the probability measure. We assume all measurable spaces are standard Borel. We denote by uppercase X a random variable taking values from the measurable space (ΩX, FX), which is a measurable function from Ωto ΩX, where ΩX is also called the domain of X. For a given finite index set I, we use uppercase bold X to denote a collection of random variables (Xi)i I with the associated space (ΩX, FX), where ΩX = Q i I ΩXi and FX = N i I FXi are the product space and the product σ-algebra, respectively. For a subset of indices I I, (Xi)i I is abbreviated as XI or renamed as Y with Y X, where the corresponding index set is denoted by IY. The distribution of X is represented by PX, defined as the pushforward of P under X, such that PX = X P. A set X FX represents a random event, with the associated probability term P(X X) = PX(X). 2.1. Structural Causal Model Definition 2.1 (Structural Causal Model (SCM)). An SCM is a tuple M = I, ΩV, ΩU, f, PU , where I is a finite index set. The product space ΩV = Q i I ΩVi and ΩU = Q i I ΩUi denote the domain of endogenous and exogenous variables, respectively. The measurable function f : ΩV ΩU ΩV = (fi)i I specifies the causal mechanisms, where each fi : ΩVpa(i) ΩUi ΩVi is associated with a set of indices pa(i) I. The probability measure PU is called the exogenous distribution. If PU = Q i I PUi is a product measure, the SCM is said to be Markovian. In this work, we primarily focus on recursive SCMs, which are defined such that a partial order exists on the index set I, and for any pair of indices i, j I, if j i, then i / pa(j). If the partial order admits a linear extension , we call it the causal order of the SCM, which can be constructed using a topological sorting algorithm and is not necessarily unique. Recursiveness ensure that the solution of the SCM exists and is unique. According to (Bongers et al., 2016), if an SCM has a unique solution, then there exists a function Γ such that for almost all u ΩU and v ΩV, v = Γ(u) if and only if v = f(v, u). We refer to Γ as the solution mapping of the recursive SCM. SCMs also allow for operations called do-interventions. For a subset of indices IX I, performing an intervention to set X = VIX to a specific value x corresponds to deriving a new SCM M[x] = I, ΩV, ΩU, f[x], PU , called a submodel. In M[x], f[x] = (fi[x])i I is defined such that ( xi i IX, fi i I \ IX. Endogenous variables in M[x] will be denoted as V[x]. Another property of recursive SCMs is that any submodel is also a recursive SCM. This implies that, in the submodel, given a value u, the endogenous variables V[x] are determined as Γ[x](u). Under this premise, the value of the endogenous variables Y[x] under a given u is represented as YM[x](u) = (Γ[x](u))IY, called the potential response. 2.2. Pearl Causal Hierarchy (Pearl & Mackenzie, 2018) categorized causal questions into three levels of human cognition: seeing, doing, and imagining. To formalize these questions and delineate the logical differences between the three levels, symbolic languages L1, L2, L3 are introduced. Each language consists of Listatements φ, where every φ Li is a Boolean combination Exogenous Isomorphism for Counterfactual Identifiability of inequalities between polynomials over probability terms P(αi). For L1, P(α1) takes the form P(Y Y), representing the probability of Y occurring, making L1 a standard probabilistic logic. L2 introduces interventions, with P(α2) in the form P(Y[x] Y), denoting the probability of Y occurring when X is intervened to x. L3 further encodes conjunctions under different interventions, making P(α3) take the form P(Y Y ), where Y = (Yj[xj])j 1:n represents the collection of endogenous variables Yj[xj] from n submodels M[x1], M[x2], . . . , M[xn]. Thus, Y Y is logically equivalent to the conjunction V j 1:n Yj[xj] Yj. An SCM M provides semantics for the symbolic languages L1, L2, L3. For L3, any term of the form P(Y Y ) can be evaluated by M as P M Y (Y ), where P M Y is called the counterfactual distribution. Let YM = (YMj[xj ])j 1:n. Since each Yj[xj] = YMj[xj ] U almost surely, the P M Y is the pushforward measure (YM ) PU, yielding: P M Y (Y ) = PU (YM ) 1 [Y ] . (1) Equation 1 is called L3-valuation. After the L3-valuation, a statement φ L3 evaluated by M, denoted as φ(M), can be checked for validity. If φ(M) holds, we write M |= φ. The Pearl Causal Hierarchy (PCH) is defined as the combination of the above syntax and semantics. Specifically, when an SCM M is given, the collection of observational, interventional, and counterfactual distributions defined syntactically by L1, L2, L3 and semantically by L1, L2, L3valuations constitutes the PCH. In this sense, the PCH encapsulates all causal information entailed by an SCM. Furthermore, the PCH exhibits a strict hierarchy. In terms of syntax, the representations of the terms P(αi) imply that L1 L2 L3. In terms of logical expressiveness, similar theorems have been established, known as the Causal Hierarchy Theorem (CHT) (Bareinboim et al., 2022). To illustrate the expressiveness of Li, consistency is introduced. Let Li(M) = {φ(M) | φ Li} denote the Li-theory of M, then Li-consistency is defined as follows: Definition 2.2 (Li-Consistency). SCMs M(1) and M(2) are Li-consistent, denoted M(1) Li M(2), if their Litheories are identical, i.e., Li(M(1)) = Li(M(2)). The CHT demonstrates that in the PCH, Li generally does not imply Lj for i < j, formally indicating that higher levels possess greater expressiveness than lower levels. 2.3. Counterfactual Identification In practice, the underlying true SCM M is almost never fully specified, making direct reasoning on M impractical. Consequently, researchers rely on partial knowledge about M to address causal questions, which necessitates constructing a proxy M that satisfies this partial knowledge and performing equivalent reasoning on M . Causal identification is the task of proving or reasoning whether any proxy constructed from partial knowledge provides consistent answers to specific causal questions. For example, when the true SCM M is assumed to be Markovian and both observational distribution and the causal graph are available, constructing a Markovian Causal Bayesian Network (CBN) suffices to answer any causal question in L2 via truncated factorization (Pearl, 2009). Thus, Markovian CBNs can be regarded as identifiable at the intervention level. When the Markovian assumption fails, semi-Markovian CBNs can still answer a subset of causal questions in L2 using do-calculus. Regarding counterfactuals, as they represent the highest level in the PCH, achieving counterfactual identifiability requires stricter conditions and more complex methodologies. For instance, even Markovian CBNs cannot ensure consistency when answering certain L3 questions (Nasr Esfahany & Kiciman, 2023). Prior research on counterfactual identification often focuses on identifying a subset of counterfactual statements φ L3. These efforts can be categorized into constraining causal structures to identify counterfactual effects, such as (Shpitser & Pearl, 2008), or constraining causal mechanisms to identify counterfactual outcomes, such as (Nasr-Esfahany et al., 2023). 2.4. Model Identifiability SCMs are a special class of generative models, where the exogenous distribution and causal mechanisms can be interpreted as the latent distribution and generative function of the model, respectively. Identifiability has been a continuously discussed topic in the literature, appearing in representation learning (Bengio et al., 2013), causal discovery (Glymour et al., 2019), causal representation learning (Sch olkopf et al., 2021), and causal quantity inference (Pearl, 2009). Providing a broad definition of identifiability may lead to ambiguity. Borrowing from the definition of identifiability in (Khemakhem et al., 2020), these notions of identifiability can be unified through equivalence classes: Definition 2.3 ( -Identifiability). Let A denote a set of assumptions, [A] = {M | M |= A} includes all models satisfying the assumptions, and denote an equivalence relation between models. Then a model is said to be - identifiable from A, or identifiable up to -equivalence class, if for any M(1), M(2) [A], M(1) M(2). We now interpret the problem of causal identifiability from the perspective of -identifiability. Suppose the causal question involves a set of Li-statements φ Li, and two SCMs M(1) and M(2) are said to be φ-consistent if φ(M(1)) = φ(M(2)). Since φ-consistency is an equiva- Exogenous Isomorphism for Counterfactual Identifiability lence relation between SCMs, denoted as M(1) φ M(2), the problem of causal identification with respect to φ can be reframed as a problem of φ-identifiability. The CHT states that when A contains only low-level knowledge, φ-identifiability is often unattainable. Achieving this requires assuming higher-level information. For instance, under the assumptions of known L1 observational distribution, causal graph, and Markovianity, the causal graph encodes structural constraints on L2, enabling the constructed CBN to answer all questions in L2. Since the union of all statements φ L2 equals L2, this property is also referred to as L2-identifiability. This paper focuses on L3-identifiability, which is an enhanced version of L2-identifiability, requiring φidentifiability for any φ L3. If A satisfies L3identifiability, then by the definition of L3-consistency, any M(1), M(2) [A] satisfy M(1) L3 M(2). Since L3 represents the highest level in the PCH and encodes all causal information of the SCM, if M(1) L3 M(2), the two models are indistinguishable under any causal statement. Therefore, L3-identifiability is the ultimate goal for causal identifiability within the PCH. 3. Exogenous Isomorphism L3-identifiability, by definition, is not straightforward to handle, as it is indirectly defined through the PCH rather than directly based on the model itself. Therefore, we aim to find a simpler model-based identifiability notion that implies L3-identifiability, thereby simplifying the problem. One possible choice is =-identifiability, which requires uniquely identifying the generative model. Some prior works have achieved =-identifiability (Xi & Bloem-Reddy, 2023), which indirectly induces L3-identifiability. This serves as one approach to addressing L3-identifiability, but the required assumptions are often too strong. An alternative property is defined via counterfactual equivalence (Peters et al., 2017, Proposition 6.49), which while not insisting on uniqueness of the causal mechanisms does demand full recovery of the exogenous distribution. Between =-identifiability, counterfactual equivalence and L3-identifiability, there should exist other forms of - identifiability, as counterfactual reasoning does not require a completely fixed latent representation. This is akin to humans being able to answer what-if questions without fully understanding the underlying factors of the physical world. Identifying such a form of -identifiability is a goal worth exploring. Motivated by this, we have identified an equivalence relation between SCMs, denoted as EI and referred to as exogenous isomorphism, for which EI-identifiability strictly implies L3-identifiability, yet it is weaker than the previous two. This characterization more precisely reflects the strength of model identifiability required to achieve complete counterfactual identifiability. Let the recursive SCMs M(k) = I, ΩV, Ω(k) U , f (k), P (k) U share the index set I and the domain of endogenous variables ΩV. For a given endogenous value v ΩV, we abbreviate f (k) i (vpa(k)(i), ) as f (k) i (v, ). Exogenous isomorphism is then defined as: Definition 3.1 (Exogenous Isomorphism). Recursive SCMs M(1) and M(2) are said to be exogenously isomorphic, denoted M(1) EI M(2), if there exists a shared causal ordering and function h : Ω(1) U Ω(2) U satisfying: Component-wise Bijection: For each i I, h = (hi)i I, where hi : Ω(1) Ui Ω(2) Ui is a bijection; Exogenous Distribution Isomorphism: P (2) U = h P (1) U ; Causal Mechanism Isomorphism: For each i I, for almost every u(1) i Ω(1) Ui and all v ΩV, f (2) i (v, hi(u(1) i )) = f (1) i (v, u(1) i ). The implication of EI-identifiability for L3-identifiability is formally stated in the following theorem: Theorem 3.2 ( EI Implies L3). For recursive SCMs M(1) and M(2), if M(1) EI M(2), then M(1) L3 M(2). Sketch of proof The mechanism isomorphism of each fi can be progressively deduced into the potential responses, ensuring V(1) M = V(2) M h. Given P (2) U = h P (1) U , we have (V(2) M h) P (1) U = (V(2) M ) (h P (1) U ), i.e., (V(1) M ) P (1) U = (V(2) M ) P (2) U . Thus, all L3-evaluations for M(1) and M(2) are identical, and any statement φ L3 yields the same result. Therefore, M(1) L3 M(2). For a detailed proof, see Appendix A.2. 4. Bijective SCM Once identifiability is defined, the next step is to explore the assumption sets that induce identifiability. In this work, for simplicity, we target some special settings of recursive SCMs. Throughout, the set of positive integers {i | l i r, i N+} is abbreviated as l :r for l, r N+. Moreover, we assume that for each i I, the domain of exogenous variables ΩUi and the domain of endogenous variables ΩVi are both indexed Rdi with 1:di as the index set. That is, both exogenous and endogenous variables are random vectors, and the nested index set I = {(i, j) | i I, j 1 : di} indexes the individual dimensions of ΩU and ΩV. Since the cardinalities of the exogenous and endogenous domains are equal, a bijection can be established between these domains. A bijection implies no information loss. If the causal mechanism is also bijective, then for any observation in the endogenous domain, a distinguishable latent Exogenous Isomorphism for Counterfactual Identifiability encoding can be identified in the exogenous domain. Below, we formally define this specific setting of SCMs: Definition 4.1 (Bijective SCM (BSCM)). A recursive SCM M is called a bijective SCM (BSCM) if its solution mapping Γ is a bijection. Proposition 4.2. A recursive SCM M is a BSCM if and only if fi(v, ) is a bijection for every i I and all v ΩV. For BSCMs M(1) and M(2), since their solution mappings Γ(k) : Ω(k) U ΩV are bijections, it is evident that the composition (Γ(2)) 1 Γ(1) : Ω(1) U Ω(2) U is also a bijection. Given that exogenous isomorphism requires a bijection between Ω(1) U and Ω(2) U , a natural question arises: can (Γ(2)) 1 Γ(1) directly serve as an exogenous isomorphism? The following theorem provides a precise answer. Theorem 4.3 (BSCM-EI). If two BSCMs M(1) and M(2) share a common causal order and the same observational distribution PV, then M(1) EI M(2) if and only if for every i I, there exists a bijection hi : Ω(1) Ui Ω(2) Ui such that for all v ΩV, (f (2) i (v, )) 1 (f (1) i (v, )) = hi almost surely.1 4.1. EI-Identification for BSCM To derive assumption sets that imply EI-identifiability, we need to restrict our perspective to a single SCM M. Consider the conditions in Theorem 4.3, where there exists a function hi : Ω(1) Ui Ω(2) Ui such that for all v ΩV, (f (2) i (v, )) 1 (f (1) i (v, )) = hi holds almost surely. Now, consider different v, v ΩV. Since both f (1) i (v, ) and f (2) i (v, ) are bijections, by the associativity of composition, (f (1) i (v , )) (f (1) i (v, )) 1 = (f (2) i (v , )) (f (2) i (v, )) 1 almost surely, where both sides come from the SCM M, and we explicitly name this concept as follows: Definition 4.4 (Counterfactual Transport). For a BSCM M, the function KM : ΩV ΩV ΩV ΩV is called the counterfactual transport if KM = (KM,i)i I and for every i I and all v, v ΩV, the component KM,i( , v, v ) = (fi(v , )) (fi(v, )) 1. Under Markovianity, the practical meaning of counterfactual transport is the transport between conditional distributions: 1The BGM proposed in (Nasr-Esfahany et al., 2023) corresponds to the fi of a BSCM. Furthermore, component-wise bijection and causal mechanism isomorphism in exogenous isomorphism are described as BGM equivalence. Therefore, Theorem 3.2 extends (Nasr-Esfahany et al., 2023, Proposition 6.2). Proposition 4.5. If the BSCM M is Markovian, then for almost all v, v ΩV, the conditional distributions satisfy PVi|Vpa(i)( , v ) = (KM,i( , v, v )) PVi|Vpa(i)( , v).2 Combining this with Theorem 3.2, we find that when all counterfactual transports are fixed, EI-identifiability can be achieved. Let the following assumptions be defined: (i) ABSCM: M is a BSCM; (ii) A : the total order is a causal order for M; (iii) APV: the observational distribution of M is PV with strictly positive density; (iv) AK: there exists a function K : ΩV ΩV ΩV ΩV = (Ki)i I such that counterfactual transport KM = K almost surely. Let A{BSCM, ,PV,K} = {ABSCM, A , APV, AK}. Then: Theorem 4.6 (EI-ID from Counterfactual Transport). An SCM is EI-identifiable from A{BSCM, ,PV,K}. Verifying the assumption set AK in practice is challenging, making Theorem 4.6 difficult to apply. By combining Proposition 4.5 and considering Markovian BSCMs, a natural direction is to identify specific types of transport that are always well-defined between conditional distributions. One such transport, the KR transport, is constructed using one-dimensional conditional cumulative density functions in 1 : d order, ensuring that the mass at each point of the distribution follows a lexicographical order, and: Lemma 4.7 (Santambrogio 2015, Proposition 2.18). Given any two distributions P and P on Rd with strictly positive densities, the KR transport T : Rd Rd such that P = T P always exists and is almost surely unique. The KR transport induces a special case of A{BSCM, ,PV,K} in Theorem 4.6. Let the following assumptions be defined: (i) AM: M is Markovian; (ii) AKR: for each i I, the counterfactual transport component KM,i is almost surely equivalent to the KR transport. Let A{BSCM, ,M,PV,KR} = {ABSCM, A , AM, APV, AKR}. Then: Theorem 4.8 (EI-ID from KR Transport). An SCM is EIidentifiable from A{BSCM, ,M,PV,KR}. Nevertheless, in most practical scenarios, conditional cumulative density functions are unknown, and one often only has access to samples from the distributions. Therefore, KR transport cannot be constructed as defined, motivating the need for a tractable approximation of KR transport. This leads to a more practical subclass of BSCMs. 2A more general notion of counterfactual transport was defined in (Lara et al., 2024) from the perspective of coupling between conditional distributions. In this paper, the counterfactual transport in Markovian BSCMs (Definition 4.4 and Proposition 4.5) is a special instance of their more general framework. Exogenous Isomorphism for Counterfactual Identifiability 5. Triangular Monotonic SCM We call a function T(x) = (Tj(x1:j 1, xj))j 1:d, where T : Rd Rd, a triangular mapping if the j-th component Tj depends only on x1:j. The name originates from the fact that the Jacobian matrix of such a mapping is triangular. For a triangular mapping T, we define its monotonicity signature ξ(T) = (ξ(Tj))j 1:d as: 1 x1:j 1 Rj 1, Tj(x1:j 1, ) is s.m.i. 1 x1:j 1 Rj 1, Tj(x1:j 1, ) is s.m.d. 0 otherwise , where s.m.i. and s.m.d. abbreviate strictly monotonically increasing and decreasing, respectively. If d = Pd j=1 ξ(Tj), i.e., every component is consistently s.m.i., we call T a triangular monotonic increasing (TMI) mapping. TMI mappings are closely related to KR transport: Lemma 5.1 (Jaini et al. 2019, Theorem 1). Given any two distributions P and P on Rd with strictly positive densities, if a TMI mapping T : Rd Rd satisfies P = T P , then T is almost surely equivalent to the KR transport. TMI mappings can be further generalized. For a triangular mapping T, if d = Pd j=1 |ξ(Tj)|, i.e., every component is either consistently s.m.i. or s.m.d., we call T a triangular monotonic (TM) mapping. TM mappings, in addition to being bijective due to monotonicity, exhibit special properties: they remain TM mappings under inversion, composition, and selection of contiguous components. Moreover, their monotonicity signatures adhere to specific rules, which are elaborated in Appendix A.4. Crucially, for two TM mappings T(1) and T(2) with identical monotonicity signatures, T(2) (T(1)) 1 is always a TMI mapping. Next, we aim to relate the solution mapping Γ of an SCM to TM mappings. However, TM mappings are defined over the index set 1 : d, whereas SCMs are defined over the nested index set I = {(i, j) | i I, j 1 : di}. To bridge this gap, we introduce the concept of flattening, a generalization of permutation. For a finite index set I, we call a bijection ι : I 1 : |I| a flattening mapping of I, and flattening refers to re-indexing according to ι. Define the re-indexing Pι : Q i I ΩXi Q|I| i=1 ΩXi such that for every i I, Pι((xi)i I) = (xι 1(j))j ι[I], where ι[I] is the image set. Since ι is bijective, Pι is also bijective, and for any J 1:|I|, we have (P 1 ι )((xj)j J) = (xι(i))i ι 1[J], where ι 1[I] is the pre-image set. For the nested index set I and a total order on I, if a flattening mapping ι satisfies ι(i, j) ι(i, j ) = j j for any (i, j), (i, j ) I and ι(i, j) < ι(i , j ) for any (i, j), (i , j ) I with i < i , we call it a vectorization under . In other words, vectorization merges all random vectors into a unified order. Based on whether the solution mapping Γ is a TM map under vectorization, we define a more specialized class of SCMs: Definition 5.2 (Triangular Monotonic SCM (TM-SCM)). A recursive SCM M is called a triangular monotonic SCM (TM-SCM) if there exists a vectorization ι under a causal order such that the re-indexed Pι Γ is a TM mapping. Proposition 5.3. A recursive SCM M is a TM-SCM if and only if fi(v, ) is a TM mapping and there exists ξi such that ξ(fi(v, )) = ξi for every i I and all v ΩV. 5.1. EI-Identification for TM-SCM As a subclass of BSCMs, TM-SCMs allow the assumption set in Theorem 4.8 to be further simplified. The new identifiability strategy is specifically tied to TM-SCMs. Let A{TM-SCM, ,M,PV} = {ATM-SCM, A , AM, APV}, where ATM-SCM states that M is a TM-SCM. Then: Corollary 5.4 (EI-ID from Triangular Monotonicity). An SCM is EI-identifiable from A{TM-SCM, ,M,PV}. Sketch of Proof By combining the definition of TMSCMs with the properties of TM mappings, any counterfactual transport in a TM-SCM is a TMI mapping. Furthermore, under Markovianity, given the causal order and observational distribution PV, the KR transport is almost uniquely determined by Lemmas 4.7 and 5.1, which then implies Theorem 4.8. A detailed proof is provided in Appendix A.4. Corollary 5.4 (together with Theorem 3.2) can be viewed as a strengthened version for Markovian CBNs, which exhibits L2-identifiability, i.e., SCMs are L2-identifiable from A{ ,M,PV}. Corollary 5.4 augments A{ ,M,PV} with the additional assumption ATM-SCM, strengthening L2identifiability to L3-identifiability. Moreover, this extends and unifies the theoretical results in (Lu et al., 2020, Theorem 1), (Nasr-Esfahany et al., 2023, Theorem 5.1), and (Scetbon et al., 2024, Theorem 2.14). Specifically, these theorems focus on counterfactual identifiability from the perspectives of bijectivity, monotonicity, and Markovianity, demonstrating identifiability based on abduction-action-prediction. Compared to these results, our theoretical contributions offer the following advancements: (i) Generalizing the measurable space of each endogenous variable from a scalar R to a vector Rdi, enabling support for a broader class of SCMs; (ii) Demonstrating EI-identifiability based on exogenous isomorphism theory (i.e. Theorem 3.2), which implies L3-identifiability rather than merely the identifiability of counterfactual outcomes. Exogenous Isomorphism for Counterfactual Identifiability 6. Neural TM-SCM Problem formulation Consider the true underlying SCM M = I, ΩV, Ω U, f , P U , where Ω U, f , and P U are unknown. Assuming M is a Markovian TM-SCM with causal order , how can one construct a parameterized proxy SCM Mθ = I, ΩV, ΩUθ, fθ, PUθ based on the observational dataset DV = {vi | v(i) PV}N i=1 such that Mθ is as L3-consistent with M as possible? According to Corollary 5.4 and Theorem 3.2, this problem is equivalent to constructing Mθ such that it satisfies A{TM-SCM, ,M,PV} as closely as possible. This entails four construction objectives: (i) G1: Mθ |= ATM-SCM, meaning that Mθ belongs to the TM-SCM class; (ii) G2: Mθ |= A , meaning that is one of the causal orders of Mθ. (iii) G3: Mθ |= AM, meaning that Mθ satisfies Markovianity. (iv) G4: Mθ |= APV, meaning that Mθ induces the observational distribution PV. To satisfy G1, Mθ must be a TM-SCM or one of its special subclasses. Furthermore, in this paper, we primarily focus on neural networks that provide the parameters θ for Mθ, thereby referring to it as a neural TM-SCM. Previous works have constructed neural SCMs that satisfy this requirement in the scalar case; we categorize these works into four categories and propose corresponding prototype models as extensions to the vector case: DNME: Constraint the causal mechanism fi,θ to be a TM mapping, such that the causal mechanism fi,θ(vpa(i), ui) = bi,θ(vpa(i)) + ai,θ(vpa(i)) ui, (2) where ai,θ(vpa(i)), bi,θ(vpa(i)) Rdi with vector ai,θ always strictly positive, and denotes component-wise multiplication. Since the Jacobian matrix of fi,θ w.r.t. ui is diagonal, it is named Diagonal Noise MEchanism, such as LSNM (Immer et al., 2023). TNME: Constraint the causal mechanism fi,θ to be a TM mapping, such that the causal mechanism fi,θ(vpa(i), ui) = bi,θ(vpa(i)) + Ai,θ(vpa(i)) u i , (3) where Ai,θ(vpa(i)) Rdi di, bi,θ(vpa(i)) Rdi with matrix Ai,θ always a strictly positive lower triangular matrix. Since the Jacobian matrix of fi,θ w.r.t. ui is lower triangular, it is named Triangular Noise MEchanism, such as Fi P (Scetbon et al., 2024). CMSM: Constraint the re-indexed solution mapping Pι Γθ to be a TM mapping by composing multiple TM mappings Pι Γθ = T1,θ Tn,θ, (4) where each Ti,θ can be an autoregressive affine transformation in normalizing flows (Dinh et al., 2017). Since the solution map Γθ is constructed by composing multiple mappings, it is named Composed Mapped Solution Mapping, such as Causal NF (Javaloy et al., 2023). TVSM: Constraints the re-indexed solution mapping Pι Γθ to be a TM mapping by defining it as the flow constructed from the solution to the ODE ( dx(t) = vθ(x(t), t) dt, x(0) = v0, (5) where the vector field vθ is a Lipschitz continuous triangular mapping. According to the Picard Lindel of theorem, such flows are TMI, see Lemma A.21 in Appendix A.4. Since the solution map Γθ is implied by a triangular velocity field, it is named Triangular Velocity Solution Mapping, such as CFM (Khoa Le et al., 2025). The relationship between these related works and TM-SCM is detailed in Appendix B.4, and the implementation details of the four prototype models can be found in Appendix C. To satisfy G2, the causal mechanisms fθ in the constructed SCM need to maintain a specific causal structure to derive the causal order . When parameterizing the causal mechanisms fi,θ as in (Xia et al., 2023), we address the issue of inappropriate additional independence assumptions arising from only knowing the causal order by setting pa(i) = {j | j i, j I}. Similarly, when indirectly modeling the solution map Γθ using an autoregressive generative model as in (Javaloy et al., 2023), we align the autoregressive order with the causal order. To satisfy G3, the modeled exogenous distribution is required to satisfy PUθ = Q i I PUi,θ. To enhance the expressiveness of the exogenous distribution and to indirectly indicate that identifiability is independent of the specific implementation of the exogenous distribution, we employ unconstrained normalizing flows. Specifically, the log-likelihood of each PUi,θ is given by the change of variables formula: log p Ui,θ(ui) = log p Zi,θ(T 1 i,θ (ui)) + log det JT 1 i,θ (ui) , (6) where Ti,θ : Zi Ui is MAF (Papamakarios et al., 2017), and det JT 1 i,θ (ui) is the Jacobian determinant at ui. To satisfy G4, in generative model learning, the observational dataset DV is commonly used to optimize a learning objective, ensuring that the proxy SCM Mθ induces an observational distribution PV,θ that closely approximates the true observational distribution PV. To achieve this, we adopt the traditional maximum likelihood (MLE) approach, corresponding to the negative log-likelihood (NLL) loss: i=1 log p Vθ(v(i)), (7) Exogenous Isomorphism for Counterfactual Identifiability where log p Vθ is the log-likelihood of the modeled observational distribution. Since TM-SCM ensures a bijective Γθ, the log-likelihood log p Vθ(v(i)) at v(i) can be computed using the change of variables formula. 7. Related Works Isomorphism and counterfactual identifiability Several prior works have introduced notions similar to exogenous isomorphism and shown that SCMs satisfying these equivalence relations enjoy counterfactual consistency, mirroring the result of Theorem 3.2. This body of work includes counterfactual equivalence as defined in (Peters et al., 2017), which requires identical exogenous distributions; BGM equivalence, introduced for BGM within BSCM in (Nasr Esfahany et al., 2023); LCM isomorphism, proposed by (Brehmer et al., 2022) for causal representation learning and proven, under additional conditions, to coincide with LCM counterfactual consistency; and domain counterfactual equivalence between ILDs, together with necessary and sufficient conditions, as defined in (Zhou et al., 2024). The latter three lines of work additionally assume that the solution mapping Γ is bijective, whereas Theorem 3.2 applies to arbitrary recursive SCMs. TM-SCMs and their identifiability Previously, we classified four construction prototypes of TM-SCMs from a construction perspective; a complementary perspective considers the tasks these models address and the corresponding identifiability guarantees. For cause-effect identification, special TM-SCMs including Li NGAM (Shimizu et al., 2006), ANM (Hoyer et al., 2008), PNL (Zhang & Hyv arinen, 2009), LSNM (Immer et al., 2023), and CAREFL (Khemakhem et al., 2021) have been proven identifiable with respect to causal direction. For causal effect estimation, Causal NF (Javaloy et al., 2023) establishes representation identifiability of TMI, and similar TM-SCM constructions are adopted by Str AF (Chen et al., 2023), CCNF (Zhou et al., 2025), and CFM (Khoa Le et al., 2025), which therefore inherit the same theoretical guarantees. In counterfactual identification, (Lu et al., 2020, Theorem 1), (Nasr-Esfahany et al., 2023, Theorem 5.1), and (Scetbon et al., 2024, Theorem 2.14) each demonstrate that TM-SCM entails counterfactual identifiability under different settings, and these theorems are all special cases of Corollary 5.4. Counterfactual inference with neural SCMs Proxy SCMs built with neural network components are widely employed for counterfactual reasoning by directly learning from observational or interventional data. Several methods focus on the tractability of inference rather than identifiability by adopting different neural modules, including DSCM (Pawlowski et al., 2020), Diff-SCM (Sanchez & Tsaftaris, 2022), and VACA (S anchez-Martin et al., 2022). Other approaches obtain identifiability for a subset of counterfactual queries; for example, CVAE-SCM (Karimi et al., 2020) is identifiable for specific counterfactual queries from causal sufficiency and observational distribution, and NCM (Xia et al., 2023) shows a duality between the identifiability of structure-constrained proxy SCMs and non-parametric identification results. Once parametric assumptions are introduced, complete counterfactual identifiability becomes attainable, exemplified by any subclass of neural TM-SCMs. Additional examples are provided in Appendix B.4. 8. Experiments To demonstrate that neural TM-SCM can effectively address the counterfactual consistency problem in practice, we conducted experiments on synthetic adatasets. These experiments were designed to showcase the model s ability to generate counterfactual results that are consistent with the test set, using only the endogenous samples drawn from the observational distribution as the training set. 3 Datasets The experiments involve the following synthetic datasets, with details described in Appendix D.1. TM-SCM-SYM: A collection of four small datasets (BARBELL, STAIR, FORK, BACKDOOR) with up to 4 causal variables, using exogenous distributions that are standard or Markovian multivariate normals and manually defined TM causal mechanisms. ER-DIAG-50 and ER-TRIL-50: Each contains 50 datasets with 3 8 causal variables, Markovian multivariate normal exogenous distributions, and Erd os-R enyi causal graphs (edge probability 0.5). ER-DIAG-50 ensures diagonal Jacobians, while ER-TRIL-50 ensures lower triangular Jacobians for TM mappings. Metrics To evaluate trained Mθ, we compute OBSWD (Wasserstein distance) for the fit to the observational distribution and CTFRMSE (root mean square error) for L3consistency with ground truth counterfactual outcomes. Ablation on TM-SCM-SYM To provide empirical evidence for Corollary 5.4 and Theorem 3.2, and show that neural TM-SCM addresses counterfactual consistency, we performed a small-scale ablation study on TM-SCM-SYM, including: (i) w/o O: Reverses causal order , ablating A ; (ii) w/o M: Uses non-Markovian exogenous distributions, ablating AM; (iii) w/o T: Constructs non-triangular SCM, ablating ATM-SCM (supported only by CMSM and TVSM). Figure 1 shows scatter plots of OBSWD and CTFRMSE on the validation set. Configurations w/o O and w/o T fail to converge, highlighting the necessity of A and ATM-SCM. 3Code is available at: https://github.com/cyisk/tmscm Exogenous Isomorphism for Counterfactual Identifiability TM-SCM w/o O w/o M w/o T 0.5 1.0 1.5 2.0 2.5 0.2 0.3 0.4 0.5 0.6 0.30 0.35 0.40 0.45 0.50 2.25 2.30 2.35 2.40 2.45 2.50 2.55 2.60 0.6 (a) (b) (c) (d) OBSWD Figure 1. Ablation results of neural TM-SCMs on TM-SCM-SYM. Colored curves depict sliding-window regressions, with shaded areas showing 95% CI. (a) DNME for BARBELL; (b) TNME for STAIR; (c) CMSM for FORK; (d) TVSM for BACKDOOR. Most w/o M settings result in higher CTFRMSE, emphasizing the role of AM. CMSM shows slightly lower stability than other methods. Additional results are in Appendix D.4. Ablation on ER-DIAG-50 and ER-TRIL-50 We further conducted a more comprehensive evaluation on ER-DIAG50 and ER-TRIL-50, reinforcing the generality of the theory through experiments on a wide variety of synthetic SCMs. The ablations in these experiments also targeted A , AM, and ATM-SCM. Table 1 presents the final CTFRMSE on the ER-DIAG-50 and ER-TRIL-50 test sets for each method and its ablations. The results demonstrate that violating A , AM, or ATM-SCM significantly degrades L3-consistency, emphasizing the importance of these assumptions and the validity of the theory to ensure consistent counterfactual inference. Additional results on these datasets can be found in Appendix D.4. 9. Conclusion In this work, we explored L3-identifiability, the strongest form of causal model indistinguishability within the PCH. By simplifying the problem through the introduction of exogenous isomorphism and EI-identifiability, we developed concrete methods to achieve identifiability in BSCMs and TM-SCMs. These findings unify and extend existing Table 1. Ablation results of neural TM-SCM on ER-DIAG-50 and ER-TRIL-50. The values shown are the means of 50 experiments, with the subscript representing the 95% CI. The best-performing results are highlighted in bold. METHOD ER-DIAG-50 ER-TRIL-50 DNME - 0.53 0.05 0.51 0.12 W/O O 0.78 0.05 0.89 0.10 W/O M 0.62 0.04 0.58 0.10 TNME - 0.47 0.05 0.55 0.12 W/O O 11.24 20.98 6.41 9.84 W/O M 0.62 0.04 0.73 0.21 - 0.37 0.05 0.42 0.12 W/O O 2.64 3.72 2.12 2.49 W/O M 1.69 2.60 0.75 0.49 W/O T 0.64 0.05 1.25 1.29 - 0.46 0.05 0.50 0.12 W/O O 0.79 0.04 0.88 0.10 W/O M 0.53 0.05 0.53 0.11 W/O T 0.67 0.05 0.78 0.12 theories, providing reliability guarantees for counterfactual reasoning. Our empirical evaluations further validate the practicality of the proposed approach, paving the way for reliable applications in counterfactual modeling. Acknowledgement (Camera ready) Acknowledgements This work was supported in part by the National Key R&D Program of China (No. 2022ZD0120302). Impact Statement This study aims to advance the field of causal inference by providing both theoretical and practical support for the consistency and trustworthiness of the results of counterfactual reasoning. In future applications, the proposed methods can enhance the reliability of reasoning in relevant practical scenarios. However, we emphasize the necessity of maintaining close collaboration with domain experts and stakeholders when applying these models in practice. It is particularly important to assess whether the theoretical assumptions proposed in this paper are satisfied to mitigate the risks of potential negative consequences. Avin, C., Shpitser, I., and Pearl, J. Identifiability of pathspecific effects. In Proceedings of the 19th International Joint Conference on Artificial Intelligence, IJCAI 05, pp. 357 363, San Francisco, CA, USA, 2005. Morgan Kaufmann Publishers Inc. Exogenous Isomorphism for Counterfactual Identifiability Bareinboim, E., Correa, J. D., Ibeling, D., and Icard, T. On Pearl s Hierarchy and the Foundations of Causal Inference, pp. 507 556. Association for Computing Machinery, New York, NY, USA, 1 edition, 2022. ISBN 9781450395861. Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798 1828, August 2013. ISSN 0162-8828. doi: 10.1109/TPAMI.2013.50. Bongers, S., Forr e, P., Peters, J., and Mooij, J. M. Foundations of structural causal models with cycles and latent variables. The Annals of Statistics, 2016. Brehmer, J., de Haan, P., Lippe, P., and Cohen, T. S. Weakly supervised causal representation learning. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 38319 38331. Curran Associates, Inc., 2022. Chao, P., Bl obaum, P., Patel, S. K., and Kasiviswanathan, S. Modeling causal mechanisms with diffusion models for interventional and counterfactual queries. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. Chen, A., Shi, R. I., Gao, X., Baptista, R., and Krishnan, R. G. Structured neural networks for density estimation and causal inference. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 66438 66450. Curran Associates, Inc., 2023. Chen, R. T. Q. torchdiffeq, 2018. Chen, R. T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. Correa, J., Lee, S., and Bareinboim, E. Nested counterfactual identification from arbitrary surrogate experiments. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 6856 6867. Curran Associates, Inc., 2021. Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real NVP. In International Conference on Learning Representations, 2017. Durkan, C., Bekasov, A., Murray, I., and Papamakarios, G. Neural spline flows. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. Feydy, J., S ejourn e, T., Vialard, F.-X., Amari, S.-i., Trouve, A., and Peyr e, G. Interpolating between optimal transport and mmd using sinkhorn divergences. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2681 2690, 2019. Glymour, C., Zhang, K., and Spirtes, P. Review of causal discovery methods based on graphical models. Frontiers in Genetics, 10, 2019. ISSN 1664-8021. doi: 10.3389/ fgene.2019.00524. Grathwohl, W., Chen, R. T. Q., Bettencourt, J., and Duvenaud, D. Scalable reversible generative models with free-form continuous dynamics. In International Conference on Learning Representations, 2019. Gresele, L., von K ugelgen, J., Stimper, V., Sch olkopf, B., and Besserve, M. Independent mechanism analysis, a new concept? In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 28233 28248. Curran Associates, Inc., 2021. Hoyer, P., Janzing, D., Mooij, J. M., Peters, J., and Sch olkopf, B. Nonlinear causal discovery with additive noise models. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L. (eds.), Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2008. Hyv arinen, A., Khemakhem, I., and Morioka, H. Nonlinear independent component analysis for principled disentanglement in unsupervised deep learning. Patterns, 4(10):100844, 2023. ISSN 2666-3899. doi: https://doi.org/10.1016/j.patter.2023.100844. Immer, A., Schultheiss, C., Vogt, J. E., Sch olkopf, B., B uhlmann, P., and Marx, A. On the identifiability and estimation of causal location-scale noise models. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 14316 14332. PMLR, 23 29 Jul 2023. Jaini, P., Selby, K. A., and Yu, Y. Sum-of-squares polynomial flow. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 3009 3018. PMLR, 09 15 Jun 2019. Javaloy, A., Sanchez-Martin, P., and Valera, I. Causal normalizing flows: from theory to practice. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 58833 58864. Curran Associates, Inc., 2023. Exogenous Isomorphism for Counterfactual Identifiability Karimi, A.-H., von K ugelgen, J., Sch olkopf, B., and Valera, I. Algorithmic recourse under imperfect causal knowledge: a probabilistic approach. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 265 277. Curran Associates, Inc., 2020. Khemakhem, I., Kingma, D., Monti, R., and Hyvarinen, A. Variational autoencoders and nonlinear ica: A unifying framework. In Chiappa, S. and Calandra, R. (eds.), Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 2207 2217. PMLR, 26 28 Aug 2020. Khemakhem, I., Monti, R., Leech, R., and Hyvarinen, A. Causal autoregressive flows. In Banerjee, A. and Fukumizu, K. (eds.), Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pp. 3520 3528. PMLR, 13 15 Apr 2021. URL https://proceedings.mlr.press/ v130/khemakhem21a.html. Khoa Le, M., Do, K., and Tran, T. Learning structural causal models from ordering: Identifiable flow models. Proceedings of the AAAI Conference on Artificial Intelligence, 39(17):17831 17839, Apr. 2025. doi: 10.1609/aaai.v39i17.33961. Kivva, B., Rajendran, G., Ravikumar, P., and Aragam, B. Identifiability of deep generative models without auxiliary information. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 15687 15701. Curran Associates, Inc., 2022. Kusner, M. J., Loftus, J., Russell, C., and Silva, R. Counterfactual fairness. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. Lara, L. D., Gonz alez-Sanz, A., Asher, N., Risser, L., and Loubes, J.-M. Transport-based counterfactual models. Journal of Machine Learning Research, 25(136):1 59, 2024. Lu, C., Huang, B., Wang, K., Hern andez-Lobato, J. M., Zhang, K., and Sch olkopf, B. Sample-efficient reinforcement learning via counterfactual-based data augmentation. Co RR, abs/2012.09092, 2020. Nasr-Esfahany, A. and Kiciman, E. Counterfactual (non- )identifiability of learned structural causal models, 2023. Nasr-Esfahany, A., Alizadeh, M., and Shah, D. Counterfactual identifiability of bijective causal models. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 25733 25754. PMLR, 23 29 Jul 2023. Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive flow for density estimation. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. Pawlowski, N., Coelho de Castro, D., and Glocker, B. Deep structural causal models for tractable counterfactual inference. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 857 869. Curran Associates, Inc., 2020. Pearl, J. Causality: Models, Reasoning and Inference. Cambridge University Press, USA, 2nd edition, 2009. ISBN 052189560X. Pearl, J. and Mackenzie, D. The Book of Why: The New Science of Cause and Effect. Basic Books, Inc., USA, 1st edition, 2018. ISBN 046509760X. Peters, J., Janzing, D., and Schlkopf, B. Elements of Causal Inference: Foundations and Learning Algorithms. The MIT Press, 2017. ISBN 0262037319. Peters, J., Bauer, S., and Pfister, N. Causal Models for Dynamical Systems, pp. 671 690. Association for Computing Machinery, New York, NY, USA, 1 edition, 2022. ISBN 9781450395861. Richens, J., Beard, R., and Thompson, D. H. Counterfactual harm. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 36350 36365. Curran Associates, Inc., 2022. Rozet, F., Divo, F., and Schnake, S. Zuko: Normalizing Flows in Py Torch, 2024. Sanchez, P. and Tsaftaris, S. A. Diffusion causal models for counterfactual estimation. In Sch olkopf, B., Uhler, C., and Zhang, K. (eds.), Proceedings of the First Conference on Causal Learning and Reasoning, volume 177 of Proceedings of Machine Learning Research, pp. 647 668. PMLR, 11 13 Apr 2022. S anchez-Martin, P., Rateike, M., and Valera, I. Vaca: Designing variational graph autoencoders for causal queries. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 8159 8168, 2022. Exogenous Isomorphism for Counterfactual Identifiability Santambrogio, F. One-dimensional issues, pp. 59 85. Springer International Publishing, Cham, 2015. ISBN 978-3-319-20828-2. doi: 10.1007/978-3-319-20828-2 2. Scetbon, M., Jennings, J., Hilmkil, A., Zhang, C., and Ma, C. A fixed-point approach for causal generative modeling. In Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F. (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp. 43504 43541. PMLR, 21 27 Jul 2024. Sch olkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., and Bengio, Y. Toward causal representation learning. Proceedings of the IEEE, 109(5): 612 634, 2021. Shimizu, S., Hoyer, P. O., Hyv arinen, A., and Kerminen, A. A linear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(72):2003 2030, 2006. Shpitser, I. and Pearl, J. Complete identification methods for the causal hierarchy. Journal of Machine Learning Research, 9(64):1941 1979, 2008. Spirtes, P. and Zhang, K. Causal discovery and inference: concepts and recent methodological advances. Applied Informatics, 3(1):3, Feb 2016. ISSN 2196-0089. doi: 10.1186/s40535-016-0018-x. Tian, J. and Pearl, J. Probabilities of causation: Bounds and identification. Annals of Mathematics and Artificial Intelligence, 28(1):287 313, 2000. Tsirtsis, S. and Rodriguez, M. Finding counterfactually optimal action sequences in continuous state spaces. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 3220 3247. Curran Associates, Inc., 2023. von K ugelgen, J., Besserve, M., Wendong, L., Gresele, L., Keki c, A., Bareinboim, E., Blei, D., and Sch olkopf, B. Nonparametric identifiability of causal representations from unknown interventions. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 48603 48638. Curran Associates, Inc., 2023. Vowels, M. J., Camgoz, N. C., and Bowden, R. D ya like dags? a survey on structure learning and causal discovery. ACM Comput. Surv., 55(4), November 2022. ISSN 03600300. doi: 10.1145/3527154. Wehenkel, A. and Louppe, G. Unconstrained monotonic neural networks. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. Wu, P., Li, H., Zheng, C., Zeng, Y., Chen, J., Liu, Y., Guo, R., and Zhang, K. Learning counterfactual outcomes under rank preservation, 2025. Xi, Q. and Bloem-Reddy, B. Indeterminacy in generative models: Characterization and strong identifiability. In Ruiz, F., Dy, J., and van de Meent, J.-W. (eds.), Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pp. 6912 6939. PMLR, 25 27 Apr 2023. Xia, K., Lee, K.-Z., Bengio, Y., and Bareinboim, E. The causal-neural connection: Expressiveness, learnability, and inference. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 10823 10836. Curran Associates, Inc., 2021. Xia, K. M., Pan, Y., and Bareinboim, E. Neural causal models for counterfactual identification and estimation. In The Eleventh International Conference on Learning Representations, 2023. Zeˇcevi c, M., Dhami, D. S., Veliˇckovi c, P., and Kersting, K. Relating graph neural networks to structural causal models, 2021. Zhang, J. and Bareinboim, E. Fairness in decision-making the causal explanation formula. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), Apr. 2018. doi: 10.1609/aaai.v32i1.11564. Zhang, K. and Hyv arinen, A. On the identifiability of the post-nonlinear causal model. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 09, pp. 647 655, Arlington, Virginia, USA, 2009. AUAI Press. ISBN 9780974903958. Zhou, Q., Lu, K., and Xu, M. Causally consistent normalizing flow. Proceedings of the AAAI Conference on Artificial Intelligence, 39(21):22974 22981, Apr. 2025. doi: 10.1609/aaai.v39i21.34460. Zhou, Z., Bai, R., Kulinski, S., Kocaoglu, M., and Inouye, D. I. Towards characterizing domain counterfactuals for invertible latent causal models. In The Twelfth International Conference on Learning Representations, 2024. Exogenous Isomorphism for Counterfactual Identifiability Definition A.4 Definition A.3 Definition A.1 Definition 2.1 Definition A.6 Lemma A.5 (UCCM CCM) Definition A.7 Lemma A.8 (Potential Response PCCM) Definition 2.2 (Li-Consistency) Definition 3.1 (Exogenous Isomorphism) Lemma A.9 (Isomorphism for CCM) Lemma A.10 (Isomorphism for AUCCM) Lemma A.11 (Isomorphism for PCCM) Theorem 3.2 (EI-ID L3-ID) Definition 2.3 Definition 4.1 Theorem A.12 (Decomposition Theorem) Proposition 4.2 (BSCM Decomposition) Lemma A.13 (Pushforward as Conditional) Theorem 4.3 (BSCM EI) Definition 4.4 (Counterfactual Transport (CT)) Lemma A.14 (Partial Pushforward as Conditional) Theorem 4.6 (EI-ID from CT) Proposition 4.5 (CT by Markovianity) Lemma 4.7 (KR-Transport Uniqueness) Theorem 4.8 (EI-ID from KR-Transport) Definition 5.2 Lemma A.15 (TM Inverse) Lemma A.16 (TM Compose) Lemma A.18 (TM Component) Lemma A.20 (Vectorize an) Lemma A.19 (Vectorize li) Lemma 5.1 (TMI KR Transport) Remark A.17 Remark A.17 (TMI-SCM Decomposition) Corollary 5.4 (EI-ID from Triangular Monotonicity) Lemma A.21 (ODE TMI) Figure 2. Overview of all theorems discussed in the main text and appendix, along with their dependency graph. Nodes represent theorems, and edges indicate dependencie, directed from top to bottom. Different colors denote different topics: theorems related to recursive SCMs are marked in blue; theorems related to exogenous isomorphism are marked in green; theorems related to BSCMs are marked in yellow; and theorems related to TM-SCMs are marked in red. A.1. Background and Preliminaries Definition 2.1 draws characteristics from both (Bongers et al., 2016) and (Bareinboim et al., 2022), but differs in the following aspects: (i) we use a shared index set I for exogenous and endogenous variables, rather than using an additional index set J for endogenous variables, because this paper does not focus on the relationships between exogenous variables; (ii) to partially compensate for the limitation introduced in (i), we allow PU = Q i I PUi to not necessarily be a product measure. This relaxation permits endogenous variables to be dependent unless the SCM is Markovian. Following Definition 2.1, an SCM describes the functional relationships between endogenous and exogenous variables through a set of deterministic equations of the form vi = fi(vpa(i), ui), known as structural equations. (Bongers et al., 2016) discuss the solvability of this system of equations. Specifically, if there exists a pair (V, U) such that V = f(V, U) holds almost surely and the distribution of U is exactly PU, then the SCM is said to have a solution. If there exists a function Exogenous Isomorphism for Counterfactual Identifiability Γ such that for almost all u ΩU and all v ΩV, v = Γ(u) implies v = f(v, u), then the SCM is said to be solvable. According to (Bongers et al., 2016, Theorem 3.2), the SCM has a solution if and only if it is solvable. A recursive SCM is a special type of SCM with certain desirable properties. In this subsection, we will prove a fundamental lemma on recursive SCMs to facilitate subsequent proofs and introduce some notations. For an index i I in a recursive SCM, the lower set an(i) = {j | j i} of the partial order is called the causal ancestors of i, while the lower set pr(i) = {j | j < i} of its linear extension is called the causal prefix of i. Additionally, an (i) = an(i) {i} and pr (i) = pr(i) {i}. Consider a recursive SCM M = I, ΩV, ΩU, f, PU , we can recursively expand the endogenous part of fi to obtain a new function: Definition A.1 (Unrolled Causal Mechanism (UCM)). For each causal mechanism fi : ΩVpa(i) ΩUi ΩVi in M, let fi : ΩUan (i) ΩVi be the result of recursively unrolling the endogenous part ΩVpa(i) of the causal mechanism fi, such that for any u ΩU, fi(uan (i)) = fi fk(uan (k)) k pa(i) , ui This is referred to as the Unrolled Causal Mechanism (UCM). The following lemma demonstrates that the tuple of UCMs constitutes the solution mapping Γ of a recursive SCM, thereby indirectly proving the existence and uniqueness of solutions for recursive SCMs. Lemma A.2. For all u ΩU, v = f(v, u) if and only if v = Γ(u), where Γ(u) = ( fi(uan (i)))i I. Proof. Let u ΩU be arbitrary. = : Suppose the structural equations v = f(v, u), meaning each component satisfies vi = fi(vpa (i), ui). We proceed by induction based on the partial order . When pa(i) = , we have vi = fi(ui), and by the definition of UCM, fi(ui) = fi(ui) = vi. When pa(i) = , assume that for each k pa(i), vk = fk(uan (k)) holds. Then, fi(uan (i)) = fi(( fk(uan (k)))k pa(i), ui) Definition A.1 = fi(vpa(i), ui). Induction Thus, for each i I, we have vi = fi(uan (i)), implying that (vi)i I = ( fi(uan (i)))i I, i.e., v = Γ(u). = : Suppose v = Γ(u), meaning each component satisfies vi = fi(uan (i)). Then, for each i I, vi = fi(uan (i)) = fi(( fk(uan (k)))k pa(i), ui) Definition A.1 = fi(vpa(k), ui). vk = fk(uan (k)) where the last equality follows from vk = fk(uan (k)) for each k pa(i). Therefore, the structural equation vi = fi(vpa(i), ui) holds for every i I, and thus the system of structural equations v = f(v, u) holds. We decompose the problem and, based on the causal mechanisms, define the following compound structure as an extension of the causal mechanisms at the counterfactual level, establishing a connection with the potential responses in the PCH. Suppose there are n submodels of recursive SCMs, M[x1], M[x2], . . . , M[xn], and let the UCM corresponding to fi[xj] in each submodel M[xj] be denoted by fi[xj]. Definition A.3 (Compound Causal Mechanism (CCM)). Let fi[x ] : Qn j=1 ΩVpa(i) ΩUi Qn j=1 ΩVi be the tuple of causal mechanisms indexed by i I across the submodels, denoted as (fi[xj])j 1:n, where x = (xj)j 1:n. This satisfies, for each j 1 : n, all vj ΩV, and all ui ΩUi, fi[x ]((vpa(i),j)j 1:n, ui) = (fi[xj](vpa(i),j, ui))j 1:n. This is referred to as the Compound Causal Mechanism (CCM). Exogenous Isomorphism for Counterfactual Identifiability Definition A.4 (Unrolled Compound Causal Mechanism (UCCM)). Let fi[x ] : ΩUan (i) Qn j=1 ΩVi be the result of recursively unrolling the Qn j=1 ΩVpa(i) part of the CCM fi[x ], such that for all u ΩU, fi[x ](uan (i)) = fi[x ] fk[x ](uan (k)) k pa(i) , ui This is referred to as the Unrolled Compound Causal Mechanism (UCCM). Lemma A.5. For each i I and all u ΩU, the UCCM fi[x ](uan (i)) is equal to the tuple of UCMs ( fi[xj](uan (i)))j 1:n. Proof. Given any u ΩU, we proceed by induction based on the partial order . When pa(i) = , according to the definition of UCCM, we have fi[x ](uan (i)) = fi[x ](ui) = fi[x ](ui). Additionally, according to the definition of UCM, we also have fi[xj](uan (i)) = fi[xj](ui). Furthermore, based on the definition of CCM, fi[x ](ui) = (fi[xj](ui))j 1:n. Therefore, fi[x ](uan (i)) = ( fi[xj](uan (i)))j 1:n. When pa(i) = , assume that for each k pa(i), the following holds: fk[x ](uan (k)) = ( fk[xj](uan (k)))j 1:n. Then, fi[x ](uan (i)) = fi[x ](( fk[x ](uan (k)))k pa(i), ui) Definition A.4 = fi[x ]((( fk[xj](uan (k)))j 1:n)k pa(i), ui) Induction = fi[xj](( fk[xj](uan (k)))k pa(i), ui) j 1:n Definition A.3 = ( fi[xj](uan (k)))j 1:n. Definition A.1 Therefore, for each i I and all u ΩU, we have fk[x ](uan (k)) = ( fk[xj](uan (k)))j 1:n. Definition A.6 (Augmented Unrolled Compound Causal Mechanism (AUCCM)). Augmenting the domain of fi[x ] from ΩUan (i) to ΩUpr (i), resulting in gi[x ] : ΩUpr (i) Qn j=1 ΩVi. This ensures that fi[x ](uan (i)) = gi[x ](upr (i)) for all u ΩU, which is referred to as the Augmented Unrolled Compound Causal Mechanism (AUCCM). Definition A.7 (Prefix Compound Causal Mechanism (PCCM)). Define gpr (i)[x ] : ΩUpr (i) Qn j=1 ΩVpr (i) as a tuple of multiple AUCCMs, ( gk[x ])k pr (i). This is referred to as the Prefix Compound Causal Mechanism (PCCM). Lemma A.8. For all u ΩU, the potential response VM (u) equals the PCCM g I[x ](u). Proof. Given any u ΩU, consider the component indexed by i in the j-th submodel, namely (VM (u))i,j and ( g I[x ](u))i,j. According to the definition of potential response, (VM (u))i,j = (VM[xj ](u))i = (Γ[xj](u))i, where the i-th component of the solution mapping Γ[xj](u) is fi[xj](uan (i)) according to Lemma A.2. On the other hand, by the definition of PCCM, ( g I[x ](u))i,j = ( gi[x ](u))j, and by the definition of AUCCM, ( gi[x ](u))j = ( fi[x ](uan (i)))j, where, according to Lemma A.5, ( fi[x ](uan (i)))j is the UCM fi[xj](uan (i)) in the submodel M[xj]. Therefore, (VM (u))i,j = (Γ[xj](u))i Lemma A.2 ======== fi[xj](uan (i)) Lemma A.5 ======== ( fi[x ](uan (i)))j = ( g I[x ](u))i,j. Since this equality holds for every component i, j, the potential response VM (u) equals the PCCM g I[x ](u). A.2. Exogenous Isomorphism The Definitions A.3, A.4, A.6 and A.7 enable us to progressively derive the isomorphism of causal mechanisms in exogenous isomorphism to the potential responses. Specifically, consider two recursive SCMs M(1) and M(2) such that M(1) EI M(2). The following lemmas then hold in succession. To prevent ambiguity during the proof process, we use the full notation f (k) i (vpa(k)(i), ) instead of the abbreviated f (k) i (v, ). Lemma A.9. For each i I, for almost all u(1) Ω(1) U and for all v ΩV, the CCM satisfies f (2) i[x ](vpa(2)(i), hi(u(1) i )) = f (1) i[x ](vpa(1)(i), u(1) i ). Exogenous Isomorphism for Counterfactual Identifiability Proof. For each i I and any v ΩV, consider u(1) i Ω(1) Ui such that f (2) i (vpa(2)(i), hi(u(1) i )) = f (1) i (vpa(1)(i), u(1) i ). Then, f (2) i[x ](vpa(2)(i), hi(u(1) i )) = (fi[xj](vpa(2)(i),j, hi(u(1) i )))j 1:n Definition A.3 ( (xj)i i IXj f (2) i (vpa(2)(i),j, hi(u(1) i )) i I \ IXj j 1:n Definition of submodel ( (xj)i i IXj f (1) i (vpa(1)(i),j, u(1) i ) i I \ IXj j 1:n M(1) EI M(2) = (f (1) i[xj](vpa(1)(i),j, u(1) i ))j 1:n Definition of submodel = f (1) i[x ](vpa(1)(i), u(1) i ). Definition A.3 Since f (2) i (vpa(2)(i), hi(u(1) i )) = f (1) i (vpa(1)(i), u(1) i ) holds for almost all u(1) i Ω(1) Ui , the proposition is thus established. Lemma A.10. For each i I and for almost all u(1) Ω(1) U , the AUCCM satisfies ( g(2) i[x ] hpr (i))(u(1) pr (i)) = g(1) i[x ](u(1) pr (i)). Proof. Consider u(1) Ω(1) U for which Lemma A.9 holds. We proceed by induction based on the common causal order (hence the necessity of the causal order). When | pr(i)| = 0, we have pr (i) = {i}. By Lemma A.9: ( g(2) i[x ] hpr (i))(u(1) pr (i)) = g(2) i[x ](hi(u(1) i )) pr (i) = {i} = f (2) i[x ](hi(u(1) i )) Definition A.6 = f (2) i[x ](hi(u(1) i )) Definition A.4 = f (1) i[x ](u(1) i ) Lemma A.9 = f (1) i[x ](u(1) i ) Definition A.4 = g(1) i[x ](u(1) i ) Definition A.6 = g(1) i[x ](u(1) pr (i)). pr (i) = {i} When | pr(i)| > 0, assume that for each k pr(i), ( g(2) k[x ] hpr (k))(u(1) pr (k)) = g(1) k[x ](u(1) pr (k)) holds. Then, g(2) i[x ](hpr (i)(u(1) pr (i))) = f (2) i[x ](han (2)(i)(u(1) an (2)(i))) Definition A.6 = f (2) i[x ](( f (2) k[x ](han (2)(k)(u(1) an (2)(k))))k pa(2)(i), hi(u(1) i )) Definition A.4 = f (2) i[x ](( g(2) k[x ](hpr (k)(u(1) pr (k))))k pa(2)(i), hi(u(1) i )) Definition A.6 = f (2) i[x ](( g(1) k[x ](u(1) pr (k)))k pa(2)(i), hi(u(1) i )) Induction = f (1) i[x ](( g(1) k[x ](u(1) pr (k)))k pa(1)(i), u(1) i ) Lemma A.9 = f (1) i[x ](( f (1) k[x ](u(1) an (1)(k)))k pa(1)(i), u(1) i ) Definition A.6 = f (1) i[x ](u(1) an (1)(i)) Definition A.4 = g(1) i[x ](u(1) pr (i)). Definition A.6 Since Lemma A.9 holds for almost all u(1) Ω(1) U , the proposition is thus established. Exogenous Isomorphism for Counterfactual Identifiability Lemma A.11. For each i I and for almost all u(1) Ω(1) U , the PCCM satisfies ( g(2) pr (i)[x ] hpr (i))(u(1) pr (i)) = g(1) pr (i)[x ](u(1) pr (i)). Proof. Consider u(1) Ω(1) U for which Lemma A.10 holds. Then, g(2) pr (i)[x ](hpr (i)(u(1) pr (i))) = ( g(2) k[x ](hpr (i)(u(1) pr (i))))k pr (i) Definition A.7 = ( g(1) k[x ](u(1) pr (i)))k pr (i) Lemma A.10 = g(1) pr (i)[x ](u(1) pr (i)). Definition A.7 Since Lemma A.10 holds for almost all u(1) Ω(1) U , the proposition is thus established. Theorem 3.2 ( EI Implies L3). For recursive SCMs M(1) and M(2), if M(1) EI M(2), then M(1) L3 M(2). Proof. Consider u(1) Ω(1) U for which Lemma A.8 and Lemma A.11 hold simultaneously. Then, when pr (i) = I, we have V(1) M (u(1)) Lemma A.8 ======== g(1) I[x ](u(1)) Lemma A.11 ======== ( g(2) I[x ] h)(u(1)) Lemma A.8 ======== (V(2) M h)(u(1)). Since Lemmas Lemma A.8 and Lemma A.11 hold for almost all u(1) Ω(1) U , it follows that V(1) M almost surely equals V(2) M h. Moreover, since P (2) U = h P (1) U , it follows that P M(1) V = (V(1) M ) P (1) U = (V(2) M h) P (1) U = (V(2) M ) (h P (1) U ) = (V(2) M ) P (2) U = P M(2) V . Therefore, for any counterfactual random variables Y V and any event Y , we have P M(1) Y (Y ) = P M(2) Y (Y ), as they are obtained by marginalizing P M(1) V = P M(2) V . According to the L3-valuation, any term of the form P(Y Y ) has equal assignments in M(1) and M(2). Since for any φ L3, φ consists of terms of the form P(Y Y ), it follows that φ(M(1)) = φ(M(2)). Therefore, according to the definition of L3-theory, we have L3(M(1)) = {φ(M(1)) | φ L3} = {φ(M(2)) | φ L3} = L3(M(2)), i.e., M(1) L3 M(2), according to Definition 2.2. A.3. Bijective SCM From the perspective of measure theory, a (regular) conditional distribution is defined as a probability kernel PY|X : FY ΩX [0, 1], where it is required that the index sets satisfy IX IY = . If for any X FX and Y FY there exists a joint distribution PX,Y(X Y) = R X PY|X(Y, x)PX(dx), where PX is the marginal distribution, then such a probability kernel PY|X is referred to as a regular conditional distribution given X. The Decomposition Theorem establishes its existence and uniqueness: Theorem A.12 (Decomposition Theorem). If the space ΩX is a Polish space, then the regular conditional distribution PY|X : FY ΩX [0, 1] exists and is almost surely unique. Lemma A.13. If Y = f(X), then for all Y FY and for almost every x ΩX, the conditional distribution PY|X(Y, x) = δf(x)(Y) = 1Y(f(x)), where δf(x)(Y) is the Dirac measure and 1Y(f(x)) is the indicator function. Proof. For any X FX and Y FY, according to the definition of the joint distribution, PX,Y(X Y) = ((X, Y) P)(X Y) = P({ω | ω Y 1[Y] X 1[X]}) = P({ω | ω (f(X)) 1[Y] X 1[X]}) = P({ω | ω X 1[f 1[Y]] X 1[X]}) = P({ω | ω X 1[f 1[Y] X]}) f 1[S] f 1[T] = f 1[S T] = PX(f 1[Y] X). Exogenous Isomorphism for Counterfactual Identifiability Then, for the Lebesgue integral of the indicator function (or Dirac measure), Z X 1Y(f(x))PX(dx) = Z X 1f 1[Y](x)PX(dx) f 1[Y] X PX(dx) = PX(f 1[Y] X) = PY,X(Y X). According to Theorem A.12, since PY|X that satisfies PY,X(Y X) = R X PY|X(Y, x)PX(dx) is almost surely unique, it follows that for all Y FY and for almost every x ΩX, PY|X = δf(x)(Y) = 1Y(f(x)). Lemma A.14. If Y = f(X, Z), then for all Y FY and for almost every z ΩZ, the conditional distribution PY|Z(Y, z) = (f( , z)) PX|Z( , z) (Y). Proof. For any Y ΣY, consider the condition that Lemma A.13 holds for z ΩZ. Then, PY|Z(Y, z) = PY,X|Z(Y ΩX, z) ΩX PY|X,Z(Y, x, z)PX|Z(dx, z) Factorization ΩX 1Y(f(x, z))PX|Z(dx, z) Lemma A.13 ΩX 1(f( ,z)) 1[Y](x)PX|Z(dx, z) (f( ,z)) 1[Y] PX|Z(dx, z) = PX|Z (f( , z)) 1 [Y], z = (f( , z)) PX|Z( , z) (Y). Since Lemma A.13 holds for almost every z ΩZ, the original proposition follows. Proposition 4.2. A recursive SCM M is a BSCM if and only if fi(v, ) is a bijection for every i I and all v ΩV. Proof. = : Suppose that the SCM is a BSCM. Consider each i I. According to Definition 4.1, this is equivalent to proving that fi(vpa(i), ) is both injective and surjective for all v ΩV, under the assumption that Γ is a bijection. First, we show that fi(vpa(i), ) is injective for all v ΩV. Given any v ΩV and any ui, u i ΩU such that fi(vpa(i), ui) = fi(vpa(i), u i), let v i = fi(vpa(i), ui) (note that v i may not equal vi). Construct v = (vpa(i), v i , v I\{i}\pa(i)), where v I\{i}\pa(i) is arbitrary. By the definition of a BSCM and since Γ is a bijection, there exists a unique u such that v = Γ(u ). Replace the i-th component of u with ui and u i to obtain u and u , respectively. Since fi(vpa(i), ui) = fi(vpa(i), u i) = v i , it follows that v = Γ(u) = Γ(u ). Given the uniqueness of u such that v = Γ(u ), we have u = u = u , implying that ui = u i. Therefore, for any ui, u i ΩU, if fi(vpa(i), ui) = fi(vpa(i), u i), then ui = u i, proving that fi(vpa(i), ) is injective. Next, we show that fi(vpa(i), ) is surjective for all v ΩV. Assume, for contradiction, that there exists v ΩV such that fi(vpa(i), ) is not surjective. Then, there exists v i such that for all ui ΩUi, fi(vpa(i), ui) = v i. Construct v = (vpa(i), v i, v I\{i}\pa(i)), where v I\{i}\pa(i) is arbitrary. By the definition of a BSCM and since Γ is a bijection, there exists u such that v = Γ(u ), where the i-th component satisfies fi(vpa(i), u i ) = v i. This contradicts the assumption that for all ui ΩUi, fi(vpa(i), ui) = v i. Therefore, our assumption is false, and fi(vpa(i), ) must be surjective for all v ΩV. In summary, for all v ΩV, fi(vpa(i), ) is a bijection. = : Suppose that fi(vpa(i), ) is a bijection for each i I and for all v ΩV. This is equivalent to proving that Γ is both injective and surjective. Exogenous Isomorphism for Counterfactual Identifiability First, we show that Γ is injective. Let u, u ΩU such that v = Γ(u) = Γ(u ). According to Lemma Lemma A.2, for each i I, we have vi = fi(vpa(i), ui) = fi(vpa(i), u i). Since fi(vpa(i), ) is a bijection, it follows that ui = u i. Therefore, u = u , proving that Γ is injective. Next, we show that Γ is surjective. For each v ΩV and each i I, consider the i-th component vi. Since fi(vpa(i), ) is a bijection for any v ΩV, there exists ui = (fi(vpa(i), )) 1(vi) such that vi = fi(vpa(i), ui). Let u = ((fi(vpa(i), )) 1(vi))i I. According to the definition of the solution function Γ, substituting u into Γ yields Γ(u) = v. Thus, for any v ΩV, there exists u such that Γ(u) = v, proving that Γ is surjective. Therefore, Γ is a bijection. By Definition 4.1, M is a BSCM. Theorem 4.3 (BSCM-EI). If two BSCMs M(1) and M(2) share a common causal order and the same observational distribution PV, then M(1) EI M(2) if and only if for every i I, there exists a bijection hi : Ω(1) Ui Ω(2) Ui such that for all v ΩV, (f (2) i (v, )) 1 (f (1) i (v, )) = hi almost surely. Proof. = : Suppose that M(1) EI M(2). According to the definition of exogenous isomorphism, there exists a bijection h = (hi)i I such that each hi is a bijection and the causal mechanisms are preserved. For each i I and given any v ΩV, consider the causal mechanism f (2) i (v, hi(u(1) i )) = f (1) i (v, u(1) i ) for some u(1) i Ω(1) Ui . Since f (2) i (v, ) is a bijection, composing both sides with (f (2) i (v, )) 1 yields hi(u(1) i ) = (f (2) i (v, u(1) i )) 1 (f (1) i (v, )). Since f (2) i (v, hi(u(1) i )) = f (1) i (v, u(1) i ) holds for almost every u(1) i Ω(1) Ui , it follows that (f (2) i (v, )) 1 f (1) i (v, ) = hi almost surely. = : For each i I, suppose there exists a bijection hi : Ω(1) Ui Ω(2) Ui such that (f (2) i (v, )) 1 f (1) i (v, ) = hi almost surely. By the associativity of function composition, for any v ΩV, we have f (2) i (v, ) hi = f (2) i (v, ) (f (2) i (v, )) 1 f (1) i (v, ) = f (1) i (v, ) almost surely. This equality holds for each i I and all v ΩV, thereby preserving the causal mechanisms. Moreover, since the observational distribution satisfies PV = (Γ(1)) P (1) U = (Γ(2)) P (2) U , and both Γ(1) and Γ(2) are bijections, it follows by pullback that ((Γ(2)) 1 Γ(1)) P (1) U = P (2) U . By Lemma A.8 and Lemma A.11, we have Γ(1) = Γ(2) h, that is, (Γ(2)) 1 Γ(1) = h. Therefore, h P (1) U = P (2) U , which preserves the exogenous distributions. Lastly, the component-wise bijections are satisfied by the assumption of hi : Ω(1) Ui Ω(2) Ui . Hence, h = (hi)i I constitutes an exogenous isomorphism. Given the existence of such an exogenous isomorphism and the common causal order, it follows that M(1) EI M(2). Before proceeding with the proof below, we visually depict the objects of study in Theorem 4.3 and Definition 4.4 in Figure 3, allowing readers to intuitively grasp the strong connection between these two concepts. Intuitively, these objects are related through a flipping operation; formally, this relationship is characterized algebraically by the computation of associativity and inverses. Subsequently, we focus on Definition 4.4, which enables the problem to be confined within the same BSCM. Exogenous Isomorphism for Counterfactual Identifiability Ω(1) Ui Ω(2) Ui u(1) i u(2) i f(1) i (v, ) f(1) i (v , ) f(2) i (v , ) f(2) i (v, ) (f(2) i (v, )) 1 (f(1) i (v, )) (f(2) i (v , )) 1 (f(1) i (v , )) (a) Ω(1) Ui Ω(2) Ui u(1) i u(2) i f(1) i (v, ) f(1) i (v , ) f(2) i (v , ) f(2) i (v, ) (f(1) i (v , )) (f(1) i (v, )) 1 (f(2) i (v , )) (f(2) i (v, )) 1 Figure 3. (a) The objects of study in Theorem 4.3, (f (2) i (v, )) 1 (f (1) i (v, )) (green) and (f (2) i (v , )) 1 (f (1) i (v , )) (red), constructed across different BSCMs; (b) The objects of study in Definition 4.4, (f (1) i (v , )) (f (1) i (v, )) 1 (blue) and (f (2) i (v , )) (f (2) i (v, )) 1 (yellow), constructed within the same BSCM. Proposition 4.5. If the BSCM M is Markovian, then for almost all v, v ΩV, the conditional distributions satisfy PVi|Vpa(i)( , v ) = (KM,i( , v, v )) PVi|Vpa(i)( , v). Proof. According to the recursive SCM, there exists a unique solution such that Vi = fi(Vpa(i), Ui). Based on Proposition 4.2, the causal mechanism fi(vpa(i), ) in the BSCM is a bijection for each i I and for all v ΩV. Therefore, we have Ui = (fi(Vpa(i), )) 1(Vi). According to Lemma A.14, for any v ΩV, it holds that PVi|Vpa(i)( , vpa(i)) = (fi(vpa(i), )) PUi|Vpa(i)( , vpa(i)), PUi|Vpa(i)( , vpa(i)) = (fi(vpa(i), )) 1 PVi|Vpa(i)( , vpa(i)). By the Markovianity assumption, PUi|Vpa(i) = PUi. Therefore, for any v, v ΩV, the counterfactual transport satisfies (KM,i( , v, v )) PVi|Vpa(i)( , vpa(i)) = (fi(v pa(i), )) (fi(vpa(i), )) 1 PVi|Vpa(i)( , vpa(i)) = fi(v pa(i), ) (fi(vpa(i), )) 1 PVi|Vpa(i)( , vpa(i)) = fi(v pa(i), ) PUi|Vpa(i)( , vpa(i)) Lemma A.14 = fi(v pa(i), ) PUi Markovianity = fi(v pa(i), ) PUi|Vpa(i)( , v pa(i)) Markovianity = PVi|Vpa(i)( , v pa(i)). Lemma A.14 This shows that the conditional distribution satisfies PVi|Vpa(i)( , v ) = (KM,i( , v, v )) PVi|Vpa(i)( , v). Theorem 4.6 (EI-ID from Counterfactual Transport). An SCM is EI-identifiable from A{BSCM, ,PV,K}. Proof. Consider any pair of models M(1), M(2) |= A{BSCM, ,PV,K} and each i I. For any pair v, v ΩV, let the components of the counterfactual transport be KM(1),i( , v, v ) and KM(2),i( , v, v ), respectively. According to the validity of AK(M), it holds that KM(1),i( , v, v ) = KM(2),i( , v, v ) = Ki( , v, v ) almost surely. By the definition of counterfactual transport, this implies (f (1) i (v pa(i), )) (f (1) i (vpa(i), )) 1 = (f (2) i (v pa(i), )) (f (2) i (vpa(i), )) 1 Exogenous Isomorphism for Counterfactual Identifiability almost surely. Since the causal mechanisms f (1) i (vpa(i), ) and f (2) i (v pa(i), ) are bijections, we can compose them on the right with f (1) i (vpa(i), ) and on the left with (f (2) i (v pa(i), )) 1, respectively. By the associativity of function composition, we obtain (f (2) i (v pa(i), )) 1 f (1) i (v pa(i), ) = (f (2) i (vpa(i), )) 1 f (1) i (vpa(i), ) almost surely, and both sides of the equation remain bijections. Since this equality holds for any pair v, v ΩV, they are all equivalent to some bijection hi : Ω(1) Ui Ω(2) Ui . That is, for each i I, there exists a bijection hi : Ω(1) Ui Ω(2) Ui such that for all v ΩV, (f (2) i (v, )) 1 f (1) i (v, ) = hi almost surely. Furthermore, a common causal order exists by assumption. Therefore, by Theorem 4.3, it follows that M(1) IE M(2). Theorem 4.8 (EI-ID from KR Transport). An SCM is EI-identifiable from A{BSCM, ,M,PV,KR}. Proof. We need to prove that if M |= A{BSCM, ,M,PV,KR}, then M |= A{BSCM, ,PV,K}, and subsequently apply Theorem 4.6. Specifically, we only need to demonstrate that if M |= A{BSCM, ,M,PV,KR}, then AK(M) holds. Assume that M |= A{BSCM, ,M,PV,KR}. Then, M is a Markov BSCM with causal order and observational distribution PV. For each i I, consider any pair v, v ΩV. Given the observational distribution PV, the conditional distribution PVi|Vpa(i)( , vpa(i)) is almost surely uniquely determined by Theorem A.12. Then, according to Lemma 4.7, the KR transport between PVi|Vpa(i)( , vpa(i)) and PVi|Vpa(i)( , v pa(i)) is almost surely unique. Let the partial function Ki( , v, v ) be any version of these KR transports. Therefore, Ki( , v, v ) is almost uniquely determined, and hence K = (Ki)i I is also almost uniquely determined. Since AKR(M) holds, the components of the counterfactual transport KM,i are almost surely equivalent to the KR transports. Moreover, because KM,i is a transport between conditional distributions by Proposition 4.5, and by Lemma 4.7, the KR transport between two distributions is almost surely unique given the distributions. Therefore, the component KM,i( , v, v ) of the counterfactual transport is almost surely equivalent to Ki( , v, v ). Since the counterfactual transport KM = (Ki)i I, it follows that the counterfactual transport KM = K holds almost surely. Thus, AK(M) is satisfied. Consequently, we have proven that if M |= A{BSCM, ,M,PV,KR}, then M |= A{BSCM, ,PV,K}. By Theorem 4.6, for any pair M(1), M(2) |= A{BSCM, ,PV,K}, it holds that M(1) EI M(2). Therefore, for any pair M(1), M(2) |= A{BSCM, ,M,PV,KR}, we also have M(1) EI M(2). This implies that the SCM is EI-identifiable from A{BSCM, ,M,PV,KR}. A.4. Triangular Monotonic SCM There are some properties of TM mappings: Lemma A.15. For a TM mapping T : Rd Rd, its inverse T 1 is also a TM mapping, and the monotonicity signature satisfies ξ(T) = ξ(T 1). Proof. To clarify the proof, we denote the domain or codomain by the symbols ΩX and ΩZ, both of which actually refer to Rd. For a TM mapping T : ΩX ΩZ, assume it is of the form T(x) = (Tj(x1:j 1, xj))j 1:d, where Tj : ΩX1:j 1 ΩXj ΩZj. According to the definition of a TM mapping, for any x1:j 1 ΩX1:j 1, the function Tj(x1:j 1, ) : ΩXj ΩZj is consistently strictly monotonic. That is, for each j 1:d, there exists ξj { 1, 1} such that ξ(Tj) = ξj. We construct e T1:j : ΩZ1:j ΩX1:j such that it is of the form e T1:j = ( e Tk)k 1:j and e Tj(z1:j, zj) = Tj(e T1:j(z1:j), ) 1 (zj). for any j 1:d and any z1:j ΩZ1:j. Since e Tj depends only on z1:j, e T is a triangular mapping. Let e T = e T1:j. Next, we prove that e T = T 1, meaning that e T is the explicit form of the inverse function of T. According to the definition of an invertible function, we need to show that z = T(e T(z)) for any z ΩZ. Consider each component, Exogenous Isomorphism for Counterfactual Identifiability i.e., prove that zj = (T(e T(z)))j for any j 1:d. First, by the form of T, we can expand its j-th component as (T(e T(z)))j = Tj((e T(z))1:j 1, (e T(z))j). According to the previous construction, (e T(z))1:j 1 = e T1:j 1(z1:j 1) and (e T(z))j = e Tj(z1:j, zj). Substituting these into the above equation, we obtain (T(e T(z)))j = Tj e T1:j 1(z1:j 1), Tj(e T1:j(z1:j), ) 1 (zj) . Now, we need to show that the above expression equals zj for each j 1:d. Let T1:j 1(z1:j 1) = x 1:j 1. Since T is a TM mapping, the function Tj(x 1:j 1, ) : ΩXj ΩZj is always strictly monotonic and hence invertible. Therefore, the expression simplifies to Tj(x 1:j 1, ) Tj(x 1:j 1, ) 1(zj) = zj. Thus, (T(e T(z)))j = zj holds for each j 1:d, implying that e T = T 1. Since e T is a triangular mapping, T 1 is also a triangular mapping. Next, we consider the monotonicity signature. For each j 1:d, recalling that e Tj(z1:j, zj) = Tj(e T1:j(z1:j), ) 1 (zj). If for any x1:j 1 ΩX1:j 1, the function Tj(x1:j 1, ) : ΩXj ΩZj is s.m.i., then its inverse T 1 j (x1:j 1, ) : ΩZj ΩXj is also s.m.i. For any z1:j 1 ΩZ1:j 1, let T1:j 1(z1:j 1) = x 1:j 1. Since x 1:j 1 ΩX1:j 1, the function (Tj(x 1:j 1, )) 1 is always s.m.i. Therefore, e Tj is s.m.i. with respect to zj given any z1:j 1 ΩZ1:j 1. Similarly, when Tj(x1:j 1, ) is s.m.d., a similar conclusion holds. Therefore, according to the definition of the monotonicity signature, for each j 1:d, there exists ξj { 1, 1} such that ξ(Tj) = ξ( e Tj) = ξj. Consequently, ξ(T 1) = ξ(e T) = (ξ( e Tj))j 1:d = (ξj)j 1:d = (ξ(Tj))j 1:d = ξ(T), i.e., the monotonicity signature satisfies ξ(T 1) = ξ(T). Moreover, since Pd j=1 |ξj| = d and T 1 is a triangular mapping, according to the definition of a TM mapping, T 1 is indeed a TM mapping. Lemma A.16. For two TM mappings T(1) : Rd Rd and T(2) : Rd Rd, their composition T(1) T(2) is also a TM mapping, and the monotonicity signature satisfies ξ(T(1) T(2)) = ξ(T(1)) ξ(T(2)), where denotes the component-wise multiplication. Proof. To clarify the proof, we denote the domains or codomains by the symbols ΩX, ΩY, and ΩZ, all of which actually refer to Rd. For the TM mapping T(1) : ΩY ΩZ, assume it is of the form T(1)(y) = (T (1) j (y1:j 1, yj))j 1:d, where T (1) j : ΩY1:j 1 ΩYj ΩZj. According to the definition of a TM mapping, for any y1:j 1 ΩY1:j 1, the function T (1) j (y1:j 1, ) : ΩYj ΩZj is consistently strictly monotonic. Therefore, for each j 1:d, there exists ξ(1) j { 1, 1} such that ξ(T (1) j ) = ξ(1) j . The same holds for the TM mapping T(2). We construct e T1:j : ΩX1:j ΩZ1:j such that it is of the form e T1:j = ( e Tk)k 1:j and e Tj(x1:j, xj) = T (1) j T(2) 1:j(x1:j), T (2) j (x1:j, xj) , where T(2) 1:j = (T (2) k )k 1:j. Since e Tj depends only on x1:j, e T is a triangular mapping. According to the form of T(1), for any j 1:d and any x ΩX, let y = T(2)(x). Then, ((T(1) T(2))(x))j = (T(1)(y))j = T (1) j (y1:j, yj). Exogenous Isomorphism for Counterfactual Identifiability According to the form of T(2), we have y1:j = (T(2)(x))1:j = (T (2) k (x1:k 1, xk))k 1:j = T(2) 1:j(x1:j), and yj = T (2)(x1:j, xj). Therefore, ((T(1) T(2))(x))j = T (1) j (y1:j, yj) = T (1) j T(2) 1:j(x1:j), T (2)(x1:j, xj) = e Tj(x1:j, xj) for any j 1:d. Let e T = e T1:j. That is, e T = T(1) T(2). Since e T is a triangular mapping, T(1) T(2) is also a triangular mapping. Next, we consider the monotonicity signature. For each j 1:d, recalling that e Tj(x1:j, xj) = T (1) j T(2) 1:j(x1:j), T (2)(x1:j, xj) . If for any x1:j 1 ΩX1:j 1, the function T (2) j (x1:j 1, ) : ΩXj ΩYj is s.m.i., and for any y1:j 1 ΩY1:j 1, the function T (1) j (y1:j 1, ) : ΩYj ΩZj is also s.m.i., then the composition T (1) j (y1:j 1, ) T (2) j (x1:j 1, ) is s.m.i. According to the definition of monotonic functions, if both functions have the same monotonicity, the composition is s.m.i.; if they have opposite monotonicities, the composition is s.m.d. Therefore, according to the definition of the monotonicity signature, for each j 1:d, we have ξ( e Tj) = ξ(1) j ξ(2) j . Consequently, ξ(T(1) T(2)) = ξ(e T) = (ξ( e Tj))j 1:d = (ξ(1) j ξ(2) j )j 1:d = (ξ(T (1) j ) ξ(T (2) j ))j 1:d = ξ(T(1)) ξ(T(2)), i.e., the monotonicity signature satisfies ξ(T(1) T(2)) = ξ(T(1)) ξ(T(2)). Moreover, since ξ(1) j , ξ(2) j { 1, 1}, it follows that |ξ(1) j ξ(2) j | = 1 for any j J. Therefore, Pd j=1 |ξ(1) j ξ(2) j | = d. Additionally, since T(1) T(2) is a triangular mapping, according to the definition of a TM mapping, T(1) T(2) is indeed a TM mapping. Remark A.17. For two TM mappings T(1) : Rd Rd and T(2) : Rd Rd, if ξ(T(1)) = ξ(T(2)), then ξ(T(1) T(2)) = (1)1:d, meaning that T(1) T(2) is a TMI mapping. Proof. This result trivially follows from the component-wise multiplication and Lemma A.16. Lemma A.18. For a TM mapping T : Rd Rd, its subcomponent Ti:j restricted to any x1:i 1 Ri 1, i.e., Ti:j(x1:i 1, ), is also a TM mapping for any i, j 1:d with i < j, and its monotonicity signature satisfies ξ(Ti:j(x1:i 1, ))i:j = (ξ(T))i:j. Proof. To clarify the proof, we denote the domain or codomain by the symbols ΩX and ΩZ, both of which actually refer to Rd. For a TM mapping T : ΩX ΩZ, assume it is of the form T(x) = (Tj(x1:j 1, xj))j J, where Tj : ΩX1:j 1 ΩXj ΩZj. According to the definition of a TM mapping, for any x1:j 1 ΩX1:j 1, the function Tj(x1:j 1, ) : ΩXj ΩZj is consistently strictly monotonic. Therefore, for each j 1:d, there exists ξj { 1, 1} such that ξ(Tj) = ξj. The subcomponent Ti:j is of the form (Tk(x1:i 1, xi:j 1, xj))k i:j. For any x1:i 1 Ri 1, the restriction Ti:j(x1:i 1, ) is of the form (Tk(x1:i 1, xi:j 1, xj))k i:j. Since Tk(x1:i 1, ) depends only on xi:j, Ti:j(x1:i 1, ) is a triangular mapping. Now consider the monotonicity signature. Assume that for each k i : j, according to the definition of ξ(Tk), Tk is s.m.i. with respect to xk given any x1:k 1 ΩX1:k 1. Since (x 1:i 1, xi:k 1) ΩX1:k 1, Tk(x 1:i 1, ) is also s.m.i. with respect to xk given any xi:k 1 ΩXi:k 1. The same conclusion holds when Tk is s.m.d. Thus, according to the definition of the monotonicity signature, for any x1:i 1 Ri 1 and each k i : j, we have ξ(Tk(x1:i 1, )) = ξk. Consequently, ξ(Ti:j(x1:i 1, )) = (ξ(Tk(x1:i 1, )))k i:j = (ξk)k i:j = (ξ(T))i:j. Moreover, since Pj k=i |ξk| = j i + 1, and because Ti:j(x1:i 1, ) is a triangular mapping, it follows from the definition of TM mappings that Ti:j(x1:i 1, ) is a TM mapping. Exogenous Isomorphism for Counterfactual Identifiability Lemma A.19. For the vectorization ι under , for each i I, there exists li = 1 + P i pr(i) di, such that the image set ι[{(i, j)}j 1:di] = li : li + di 1, where pr(i) denotes the prefix of index i under . Proof. Recall that the vectorization ι is a flattening mapping such that for any (i, j), (i, j ) I, ι(i, j) ι(i, j ) = j j , and for any (i, j), (i , j ) I, i < i implies ι(i, j) < ι(i , j ). We prove the result using induction over the partial order . When | pr(i)| = 0, we have li = 1. Since there does not exist any i I such that i < i, there are no pairs (i , j ) such that (i , j ) < (i, 1). Thus, by the definition of vectorization, ι(i, 1) = 1. For any {(i, j)}j 1:di, we have ι(i, j) ι(i, 1) = j 1, so ι(i, j) = j 1 + 1 = j. Therefore, the image set is ι[{(i, j)}j 1:di] = 1 : di = li : li + di 1. Suppose for any k I with | pr(k)| < | pr(i)|, there exists lk = 1 + P i pr(k) di such that the image set ι[{(k, j)}j 1:dk] = lk : lk + dk 1. Let li = 1 + P i pr(i) di. Since ι is a bijection, the preimage ι 1[1 : li 1] = ι 1[S pr(k) li holds if and only if j > i (by the properties of vectorization). However, since j an(i), we know j < i (by the definition of causal order). This contradicts the assumption that j > i. Therefore, the assumption lj + dj 1 li does not hold. Hence, for any i I, we have ι[an(i)] 1:li 1. Proposition 5.3. A recursive SCM M is a TM-SCM if and only if fi(v, ) is a TM mapping and there exists ξi such that ξ(fi(v, )) = ξi for every i I and all v ΩV. Proof. Consider a causal order with its vectorization ι. Define the reindexing mapping Pι,i = (Pι,i,j)j di. By Lemma A.19, the vectorization satisfies Pι,i((xi,j)j 1:di) = (xj)j li:li+di 1 and its inverse P 1 ι,i ((xj)j li:li+di 1) = (xi,j)j 1:di. For simplicity, when we refer to vi or ui indexed by I, we actually mean the entire vector (xi,j)j 1:di. = : Assume that the SCM M is a TM-SCM. By Definition 5.2, there exists a causal order with its vectorization ι such that the reindexed solution mapping Pι Γ is a TM mapping. Let Pι Γ = T = (Tj)j 1:D, where each Tj is of the form Tj(u1:j 1, uj), and D = P i I di. Then, for each i I, we have (Γ)i = P 1 ι,i ((Tj)j li:li+di 1) = P 1 ι,i Tli:li+di 1. By Lemma A.2, Γ(u) = ( fi(uan (i)))i I, so fi = P 1 ι,i Tli:li+di 1. The function Tli:li+di 1(u1:li 1, ) is of the form Tli:li+di 1(u1:li 1, uli:li+di 1). Thus, (P 1 ι,i Tli:li+di 1)(uι 1[1:li 1], ) is of the form (P 1 ι,i Tli:li+di 1)(uι 1[1:li 1], (ui,j)j 1:di). By Lemma A.20 and the fact that uan(i) = (ui,j)an(i), fi is independent of u1:li 1\ι[an(i)]. Hence, fi(uan(i), ui) = (P 1 ι,i Tli:li+di 1)(uan(i), (ui,j)j 1:di), implying that fi(uan(i), ) is a triangular mapping. By Lemma A.18, the monotonicity signature satisfies ξ(Tli:li+di 1(x1:li 1, ))li:li+di 1 = (ξ(T))li:li+di 1. Thus, ξ( fi(uan(i), )) = (ξ(T))li:li+di 1. Since P j li:li+di 1 |(ξ(T))j| = di, it follows that fi(uan(i), ) is a TM mapping. Finally, since TM mappings are bijections, for any given v ΩV, there exists u ΩU such that u = Γ 1(v). By Definition A.1, fi(( fk(u an (k)))k pa(i), ) = fi(u an(i), ). Therefore, for all v ΩV, fi(v, ) is a TM mapping, and there exists ξi = (ξ(T))li:li+di 1 such that ξ(fi(v, )) = ξi. Exogenous Isomorphism for Counterfactual Identifiability = : Let be any causal order and ι its vectorization. Assume fi(v, ) is a TM mapping and there exists ξi such that ξ(fi(v, )) = ξi for each i I and all v ΩV. For any given v ΩV, there exists u ΩU such that u = Γ 1(v). By Definition A.1, fi(( fk(u an (k)))k pa(i), ) = fi(u an(i), ). By assumption, fi(uan(i), ) is a TM mapping with ξ( fi(v, )) = ξi. Using Lemma A.20, we construct Tli:li+di 1 = Pι,i fi, where ξ(Tj) = (ξi)j li+1 for each j li : li + di 1. By Lemma A.2, Γ = ( fi)i I. Thus, Pι Γ = (Pι,i fi)i I = (Tli:li+di 1)i I = (Tj)1:D. Each Tj depends only on u1:li 1, and since P j 1:D |ξ(Tj)| = P i I di = D, Pι Γ is a TM mapping. By Definition 5.2, M is a TM-SCM. Corollary 5.4 (EI-ID from Triangular Monotonicity). An SCM is EI-identifiable from A{TM-SCM, ,M,PV}. Proof. We aim to prove that if M |= A{TM-SCM, ,M,PV}, then M |= A{BSCM, ,M,PV,KR}, allowing us to utilize Theorem 4.8. Specifically, it suffices to show that if M |= A{TM-SCM, ,M,PV}, then AKR(M) holds. Assume M |= A{TM-SCM, ,M,PV}. Then M is a Markovian BSCM, inducing the causal order , with the observed distribution PV. Since M is a TM-SCM, by Proposition 5.3, for each i I and all v ΩV, the causal mechanism fi(v, ) is a TM mapping, and there exists ξi such that ξ(fi(v, )) = ξi. Furthermore, by the definition of counterfactual transport and Remark A.17, KM,i = (fi(v , )) (fi(v, )) 1 is a TMI mapping for any pair v, v ΩV. Now, for each i I, consider an arbitrary pair v, v ΩV. Given the observed distribution PV, the conditional distribution PVi|Vpa(i)( , vpa(i)) is almost surely uniquely determined by Theorem A.12. Consequently, the conditional distributions PVi|Vpr(i)( , vpr(i)) and PVi|Vpr(i)( , v pr(i)) are also almost surely determined. Given these distributions, by Lemma 5.1 and Proposition 4.5, the TMI mapping KM,i = (fi(v , )) (fi(v, )) 1 is almost surely equivalent to KR transport. Thus, AKR(M) holds. Hence, we have proven that if M |= A{TM-SCM, ,M,PV}, then M |= A{BSCM, ,M,PV,KR}. By Theorem 4.8, for any pair M(1), M(2) |= A{BSCM, ,M,PV,KR}, we have M(1) EI M(2). Therefore, for any pair M(1), M(2) |= A{TM-SCM, ,M,PV}, we also have M(1) EI M(2). Thus, SCMs from A{TM-SCM, ,M,PV} are EIidentifiable. Lemma A.21 (restated from Khoa Le et al. 2025, Theorem 1). If the velocity field v : Rd [0, 1] Rd is a Lipschitz continuous triangular mapping for any t [0, 1], then for the ODE ( dx(t) = v(x(t), t) dt, x(0) = x0, the flow T ( , t) is a TMI mapping for any t [0, 1]. Proof. We first show that T ( , t) is a triangular mapping. Writing T ( , t) as (Tj( , t))j=1:d and x(t) as (xj(t))j=1:d, consider dimension j = 1. Since v1( , t) is a triangular mapping depending only on x1(t), the corresponding ODE can be written as ( dx1(t) = v1(x1(t), t) dt, x1(0) = x1,0. This is a one-dimensional ODE. By Lipschitz continuity and the Picard Lindel of theorem, the solution exists and is unique, implying that x1(t) depends only on x1,0. Therefore, by the definition of the flow, T1(x0, t) = x1(t) depends only on x1,0. Now, assume by induction that Tj 1(x0, t) = xj 1(t) depends only on x1:j 1,0. For dimension j, since vj( , t) is a triangular mapping depending only on x1:j(t), the corresponding ODE can be written as ( dxj(t) = vj(x1:j(t), t) dt, x1:j(0) = x1:j,0, Exogenous Isomorphism for Counterfactual Identifiability or equivalently, dxj(t) = vj(x1:j 1(t), xj(t), t) dt, xj(0) = xj,0, x1:j 1(0) = x1:j 1,0. Here, x1:j 1(t) depends only on x1:j 1,0 and can be treated as a constant. Thus, the ODE for xj(t) becomes a onedimensional ODE. By Lipschitz continuity and the Picard Lindel of theorem, the solution exists and is unique, implying that xj(t) depends only on xj,0. Consequently, xj(t) depends only on x1:j,0 when x1:j 1,0 is unspecified. By the definition of the flow, Tj(x0, t) = xj(t) depends only on x1:j,0. By induction, T ( , t) is a triangular mapping. Next, we show that T ( , t) is a TMI mapping. For each dimension j, consider two different initial values xj,0, x j,0 R such that xj,0 < x j,0. From the previous proof, given x1:j 1,0, the ODE is one-dimensional. Using a proof by contradiction, suppose that at time t = t , we have xj(t) x j(t), where xj(t) and x j(t) are the solutions starting at xj,0 and x j,0, respectively. By continuity, there exists t t such that xj(t ) = x j(t ). However, by Lipschitz continuity and the Picard Lindel of theorem, the solution to the ODE is unique for any t. Thus, if xj(t ) = x j(t ) at some t, then xj(t) = x j(t) for all t, including t = 0. This contradicts the assumption xj,0 < x j,0, so the hypothesis is false. Therefore, given x1:j 1,0, if xj,0 < x j,0, then xj(t) < x j(t). In other words, ξ(Tj( , t)) = ξ(xj) = 1. In summary, T ( , t) is a triangular mapping and Pd j=1 ξ(Tj( , t)) = d. By the definition of a TMI mapping, T ( , t) is a TMI mapping. B. Related Works B.1. Identifiability Counterfactual Identification Previous work on counterfactual identification has typically focused on identifying specific classes of counterfactual statements φ L3, which can be categorized as follows: (i) Constraining causal structures to identify specific counterfactual effects. This involves representing counterfactual statements using symbols, graphs, and available observational and interventional distributions. Examples include the identification of counterfactual probabilities (Shpitser & Pearl, 2008), nested counterfactual probabilities (Correa et al., 2021), sufficient or necessary probabilities (Tian & Pearl, 2000), path-specific effects (Avin et al., 2005), discrimination effects (Zhang & Bareinboim, 2018), and identification via neural networks (Xia et al., 2023). (ii) Constraining causal mechanisms to identify counterfactual outcomes. This focuses on deterministic counterfactuals inferred through abduction-action-prediction (Pearl, 2009), a special class of counterfactual statements of the form y[x] |x , y . Identifiability is achieved primarily by imposing parametric constraints such as monotonicity, as in (Lu et al., 2020), (Nasr-Esfahany et al., 2023), and (Scetbon et al., 2024). Identifiability The identifiability exhibited by generative models has been a continuously discussed topic in the literature, spanning tasks such as representation learning (Bengio et al., 2013), causal discovery (Glymour et al., 2019), causal representation learning (Sch olkopf et al., 2021), and causal quantity reasoning (Pearl, 2009). Identifiability in representation learning refers to uniquely determining latent factors under certain ambiguities, such as permutation and scaling (Hyv arinen et al., 2023), affine transformations (Kivva et al., 2022), or component-wise invertible transformations (Gresele et al., 2021). In causal discovery, identifiability pertains to structural identifiability, which requires determining equivalence classes of causal graphs (Spirtes & Zhang, 2016; Vowels et al., 2022) or identifying the direction of cause and effect (Shimizu et al., 2006; Hoyer et al., 2008). In causal representation learning, identifiability involves determining latent causal variables and causal structures, where causal variables must satisfy component-wise diffeomorphic mappings, and causal structures require graph isomorphism (Brehmer et al., 2022; von K ugelgen et al., 2023). Finally, in causal quantity reasoning, identifiability refers to the consistency of causal statements (Pearl, 2009), which was previously termed causal identification. B.2. Triangular Monotonic SCMs To demonstrate the broad applicability of Corollary 5.4, we summarize several representative works from the past and establish their connection to our theory by proving that they are special cases of the TM-SCM. This highlights the significance of our theoretical contributions. Overall, these works can be categorized based on the criteria outlined in Table 2, which further inspires us to derive four prototypical models of neural TM-SCMs. Exogenous Isomorphism for Counterfactual Identifiability Table 2. Representative works that are special cases of the TM-SCM. Model Research Domain TM Object TM Type TM Principle LINGAM Cause-effect Identification fi ANM b i vpa(i) + ui ANM Cause-effect Identification fi ANM gi(vpa(i)) + ui LSNM Cause-effect Identification fi ANM li(vpa(i)) + si(vpa(i)) ui BGM Counterfactual Identification fi UMM Prior PNL Cause-effect Identification fi UMM h(gi(vpa(i)) + ui) FIP Counterfactual Identification fi UMM C1 smoothness CAREFL Cause-effect Identification Pι Γ DCF Affine Transform STRAF Causal Quantity Estimation Pι Γ DCF UMNN Transform CAUSALNF Causal Quantity Estimation Pι Γ DCF Monotonic RQS Transform CCNF Causal Quantity Estimation Pι Γ DCF Partial Causal Transform CKM Generalization Pι Γ CCF Picard Lindel of CFM Causal Quantity Estimation Pι Γ CCF Picard Lindel of Original Research Domain The related works summarized here originate from various research domains, each with distinct objectives. For instance, many studies on specialized SCMs aim to address the task of cause-effect identification in causal discovery. Others, as previously mentioned, focus on identifying counterfactual outcomes by imposing constraints on causal mechanisms. Some works explore how to generalize SCMs to other scenarios. Another line of research, more aligned with deep generative models, seeks to address efficient causal effect estimation. Interestingly, despite their differing objectives, the SCM formulations of these works uniformly fall under the TM-SCM framework. Triangular Monotonicity Object The TM-SCM framework admits two equivalent definitions. One, as described in Definition 5.2, requires the solution mapping Pι Γ, obtained by reindexing under the causal ordering vectorization ι, to be a TM mapping. The other, given in Proposition 5.3, ensures that the causal mechanism fi(v, ) for each i I and any v ΩV is a TM mapping with consistent monotonicity signatures. Therefore, these works can be categorized based on whether they focus on constraining the global solution mapping Pι Γ or individual causal mechanisms fi. Triangular Monotonicity Type Existing works can be further categorized based on the types of triangular monotonicity they achieve: Additive Noise Mechanism (ANM): Additive noise mechanisms restrict each causal mechanism fi in the SCM to the form α(vpa(i)) + β(vpa(i)) ui, where β is required to be a strictly positive function. Consequently, even for vector-valued cases, the causal mechanism fi(v, ) for each i I and any v ΩV is always a TMI mapping. This follows from the Jacobian matrix being diagonal, with each diagonal element (β(vpa(i)))j > 0 due to the constraints on β. Relevant works include Li NGAM (Shimizu et al., 2006), ANM (Hoyer et al., 2008), and LSNM (Immer et al., 2023). These studies have inspired the construction of DNME-type neural TM-SCMs. Univariate Monotonic Mechanism (UMM): In the univariate case, functions are necessarily strictly Monotonicity if they are invertible. Thus, it suffices to further constrain the monotonicity signature of each causal mechanism fi(v, ) to be consistent for every i I and any v ΩV. This consistency can be ensured by various approaches: (1) enforcing the function to be s.m.i or s.m.d through prior construction, (2) indirectly composing an additive noise mechanism with an invertible function, or (3) deriving monotonicity consistency from C1 smoothness. These approaches correspond to BGM (Nasr-Esfahany et al., 2023), PNL (Zhang & Hyv arinen, 2009), and Fi P (Scetbon et al., 2024), respectively. These works have inspired the construction of TNME-type neural TM-SCMs, extending from the univariate to the multivariate case. Discrete Causal Flow (DCF): Certain discrete-time transformations in normalizing flows, such as affine transformations (Dinh et al., 2017), unconstrained monotonic neural networks (Wehenkel & Louppe, 2019), and monotonic rational quadratic splines (Durkan et al., 2019), can be shown to be TMI mappings. When these transformation blocks are composed, the entire normalizing flow also remains a TMI mapping. Several works establish a connection between Γ and normalizing flows, such as CAREFL (Khemakhem et al., 2021), Str AF (Chen et al., 2023), Causal NF (Javaloy et al., 2023), and CCNF (Zhou et al., 2025), each employing different transformation blocks. These works can be classified as the CMSM-type neural TM-SCMs. Continuous Causal Flow (CCF): In continuous-time settings where ODEs are used to construct SCMs, structural constraints are introduced into the velocity field, providing a novel pathway to induce TM mappings. Specifically, when the velocity field is triangular, the solution to the ODE is always a TMI mapping, as guaranteed by the Picard Lindel of Exogenous Isomorphism for Counterfactual Identifiability condition (see Lemma A.21). Some works establish equivalence between the ODE solution and the solution mapping Γ of an SCM, such as CKM (Peters et al., 2022) and CFM (Khoa Le et al., 2025). These studies lay the foundation for constructing TVSM-type neural TM-SCMs. Corollary 5.4 establishes that, under the additional assumptions of Markov property and identical causal ordering, and provided they induce the same observational distribution, the constructed models are identifiable up to the EI-equivalence class. Furthermore, by Theorem 3.2, these models are mutually L3-consistent, implying that they are indistinguishable at the counterfactual level. Therefore, when these models are claimed to be counterfactually consistent or counterfactually identifiable, Corollary 5.4 and Theorem 3.2 theoretically justify such assertions. B.3. Theories We now elaborate on how other related theorems mentioned in the main text can be interpreted as special cases of Theorem 3.2 and Corollary 5.4, starting with a restatement of these theorems using the notation of this paper. Special cases of Theorem 3.2 Several prior works have introduced notions similar to exogenous isomorphism and shown that SCMs satisfying these equivalence relations enjoy counterfactual consistency, mirroring the result of Theorem 3.2. For instance, (Peters et al., 2017) formalizes the concept of counterfactual equivalence and proves that, when two SCMs share the same exogenous distribution, counterfactual equivalence is guaranteed by the following theorem: Theorem B.1 (restated from Peters et al. 2017, Proposition 6.49). For two recursive SCMs M(1) and M(2), if their exogenous distributions satisfy P (1) U = P (2) U = PU, and for each i I, for almost every ui ΩUi and all v ΩV, f (2) i (vpa(2)(i), ui) = f (1) i (vpa(1)(i), ui), then M(1) and M(2) are counterfactually equivalent. The original purpose of this theorem was to establish the minimality property of SCMs: namely, if a mechanism in an SCM does not depend on one of its parent endogenous variables, one can always select an SCM with a sparser structure. However, as elaborated in the main text, the counterfactual equivalence defined in this way mandates a fixed latent representation. (Nasr-Esfahany et al., 2023) relaxes counterfactual equivalence to non-fixed latent encodings: rather than requiring the exogenous distributions to coincide exactly, it only demands a component-wise bijection. Theorem B.2 (restated from Nasr-Esfahany et al. 2023, Proposition 6.2). Suppose f(Vpa(i), Ui) is a BGM, meaning that for each realization vpa(i) of Vpa(i), f(vpa(i), ) is invertible. For two BGMs f1 and f2, if for almost every u(1) i Ω(1) Ui and all v ΩV, f (2)(vpa(2)(i), g(u(1) i )) = f (1)(vpa(1)(i), u(1) i ), where g : Ω(1) Ui Ω(2) Ui is a bijection, then we say that f1 and f2 are equivalent. Two BGMs are equivalent if and only if they produce the same counterfactual outcomes. Although this result is weaker than that of (Peters et al., 2017), it is confined to BSCMs. Moreover, the original paper considers only a single causal mechanism within a BSCM, referred to as a BGM. In addition, the theorem addresses counterfactual outcomes rather than counterfactual distributions, a gap filled later by Theorem B.6. In contrast, although conceptually similar, Theorem 3.2 is not limited to BSCMs; it applies to any recursive SCM. Recent advances in causal representation learning yield analogous results. Specifically, (Brehmer et al., 2022) investigates a class of models termed Latent Causal Models (LCMs). Because the causal mechanisms in an LCM are assumed to be point-wise diffeomorphic, an LCM can be viewed as a BSCM augmented with latent endogenous variables. Using an isomorphism-based criterion to characterize equivalence between LCMs, the authors show that a component-wise diffeomorphic latent representation aligns the counterfactual distributions of two LCMs (counterfactual distributions are referred to in their paper as weakly supervised distributions). Theorem B.3 (restated from Brehmer et al. 2022, Theorem 1). For LCMs M and M . If they can be reparameterized through the graph isomorphism and elementwise diffeomorphisms, we say there is an LCM isomorphism between them, or LCM equivalence, and we write M LCM M . Assume that (i) the two models share the same observation space; (ii) the domains of every endogenous variable and its corresponding exogenous noise are R; (iii) the intervention set contains all Exogenous Isomorphism for Counterfactual Identifiability atomic, perfect interventions; and (iv) the intervention distributions have full support. Under these conditions, M LCM M if and only if M and M entail equal weakly supervised distributions. (Zhou et al., 2024) also investigates domain counterfactual equivalence between Invertible Latent Domain Causal Models (ILDs). An ILD incorporates latent endogenous variables together with a mixing function g. Under a fully connected DAG with a known causal ordering, the causal mechanism fd in domain d is autoregressive. Because ILD assumes invertibility, the latent portion of its SCM can be regarded as a BSCM, and the domain counterfactual can be interpreted as a special instance of the generalized counterfactual in which the domain itself serves as the intervention. Accordingly, a domain-specific mechanism fd corresponds to the mechanism f[d] in the submodel M[d]. Theorem B.4 (restated from Zhou et al. 2024, Theorem 1). For two ILDs M and M whose mixing functions are g and g and whose causal mechanisms are f and f , if for any domains d, d it holds that g fd f 1 d g 1 = g f d (f d) 1 (g ) 1, then M and M are said to be domain counterfactual equivalent, denoted M DC M . The relation M DC M holds if and only if there exist invertible maps h1, h2 such that g = g h 1 1 is invertible and f d = h1 fd h2 is autoregressive. In summary, Theorem B.3 and Theorem B.4 achieve results comparable to Theorem B.2. Their main contribution is to extend counterfactual equivalence to the setting of invertible latent causal models, where, in addition to the standard counterfactual reasoning, an unknown mixing function maps latent causal variables to the observable space, thereby advancing the development of causal representation learning. Spectial cases of Corollary 5.4 Several existing results on counterfactual identifiability for SCMs under triangular monotonicity can be regarded as special cases of Corollary 5.4. (Lu et al., 2020) applied counterfactual reasoning in reinforcement learning, where Theorem 1 states that counterfactual outcomes are identifiable when noise is independent of states and actions, and the state transition function is strictly monotonically increasing: Theorem B.5 (restated from Lu et al. 2020, Theorem 1). Assume St+1 = f(St, At, Ut+1), where Ut+1 (St, At), and St, At, Ut denote the state, action, and noise at time step t in reinforcement learning, respectively. Suppose f is smooth and strictly monotonically increasing for fixed St and At. Given the observed values St = st, At = a, and St+1 = st+1, the counterfactual outcome St+1,At=a |St = st, At = a, St+1 = st+1 is identifiable for a counterfactual action At = a . Here, the independence of noise from states and actions corresponds to the Markovianity assumption. The function f, being smooth and strictly monotonically increasing, represents a univariate TM mapping. The causal ordering assumption is implicitly given by time step t, and the observational distribution assumption is implicitly encoded in the observed values St = st, At = a, and St+1 = st+1. Thus, all four assumptions in Corollary 5.4 are satisfied, making the SCM described in Theorem B.5 EI-identifiable. By Theorem 3.2, this SCM is also L3-identifiable, thereby ensuring the identifiability of counterfactual outcomes. (Nasr-Esfahany et al., 2023) further distilled the properties described in Theorem B.5 to uncover results more aligned with general SCM formalism. Specifically, they examined a special causal mechanism known as BGM: Theorem B.6 (restated from Nasr-Esfahany et al. 2023, Theorem 5.1). Suppose f(Vpa(i), Ui) is a BGM, meaning that for each realization vpa(i) of Vpa(i), f(vpa(i), ) is invertible. Then f is counterfactually identifiable given PVpa(i),Vi if: (i) the Markovianity assumption holds, i.e., Ui Vpa(i); and (ii) for all vpa(i), f(vpa(i), ) is either strictly monotonically increasing or strictly monotonically decreasing. Compared to Theorem B.5, Theorem B.6 extends the analysis to strictly monotonically decreasing functions, which still conform to the definition of TM mappings. The assumption of causal ordering is implicitly provided by the known pa(i). Lastly, while the original theorem is restricted to the univariate case, Corollary 5.4 does not impose such a restriction. Thus, Corollary 5.4 accompanied by Theorem 3.2 fully subsumes Theorem B.6. (Scetbon et al., 2024) adopted an alternative fixed-point-based representation of SCMs, referring to the identifiability of counterfactual outcomes as weakly partial recovery of fixed-point SCMs, and proposed the following theorem: Theorem B.7 (restated from Scetbon et al. 2024, Theorem 2.14). Assume that ΩVi and ΩUi are subsets of R for each i I, PU is absolutely continuous with a continuous density, and each Ui is mutually independent. Let [M]INV denote the set of Exogenous Isomorphism for Counterfactual Identifiability SCMs satisfying the observational distribution PV, causal ordering , and causal mechanisms f that are C1 and such that f(vpa(i), ) is bijective for all i I. Then [M]INV = , and all models in this set are consistent on (f[x] f 1) PV for any intervention x ΩX, where X V. Here, (f[x] f 1) PV represents the pushforward form of counterfactual outcomes defined via the abduction-actionprediction framework, indicating that the theorem essentially discusses counterfactual identifiability. The mutual independence of Ui corresponds to the Markovianity assumption. The causal ordering is represented by a permutation matrix P in the original paper, and the observational distribution is described as a pushforward of the exogenous distribution. Compared to Theorem B.6, the key difference lies in the lack of an explicit assumption that f(vpa(i), ) is strictly monotonic. Instead, this property is implicitly ensured through C1 smoothness. Specifically, leveraging C1 smoothness and (Scetbon et al., 2024, Lemma G.8), they indirectly proved that the partial derivative of f(vpa(i), ) with respect to ui is always strictly positive or strictly negative, which implies consistency of the strict monotonicity of f(vpa(i), ). There are other generalizations of the monotonicity assumption for counterfactual identifiability. (Wu et al., 2025) employs the rank preservation assumption to identify the counterfactual outcome, which relaxes the strict monotonicity assumption. As discussed in the main text, strict monotonicity essentially corresponds to a lexicographical order; rank preservation can therefore be viewed as a redefinition of this lexicographical order, thereby generalizing the monotonicity assumption. B.4. Neural SCMs As for the related work on neural TM-SCM, we focus here on methods designed for counterfactual inference tasks that leverage proxy models constructed by mimicking the structure of SCMs and parameterizing them with neural networks, which we refer to as neural SCMs. Neural SCMs without proven identifiability A series of works connect SCMs with various deep generative models, utilizing the inverse processes of these models to address the tractability issues in counterfactual inference. DSCM (Pawlowski et al., 2020) employs normalizing flows and variational inference to propose a general standardized framework for constructing SCMs using neural networks, enabling tractable inference of exogenous variables for counterfactual reasoning. Diff-SCM (Sanchez & Tsaftaris, 2022) introduces diffusion models tailored for counterfactual inference on high-dimensional data, where latent variables are inferred through the diffusion process, and gradients are intervened upon during the reverse diffusion process, ultimately applied to counterfactual image generation. VACA (S anchez-Martin et al., 2022) parameterizes SCMs using a variational graph autoencoder (VGAE), distinguishing itself by directly modeling the global causal mechanism rather than individual conditional mechanisms and encoding variable dependencies within the SCM using graph neural networks. While these models introduce the concept of mimicking SCMs to construct proxy models and enable counterfactual reasoning, they do not explicitly demonstrate the reliability of their inference results, i.e., whether the counterfactual outcomes are identifiable. Neural SCMs proven to be identifiable for counterfactual effects A series of studies address the identifiability of counterfactual effects in proxy SCMs by leveraging deep generative models. CVAE-SCM (Karimi et al., 2020) employs conditional variational autoencoders (CVAE) to model conditional mechanisms and proves that under the Markov assumption, certain types of counterfactual queries are identifiable solely from the observational distribution. Consequently, the constructed model ensures consistency for these counterfactual queries. MLE-NCM (Xia et al., 2021) systematically explores the identifiability of connecting neural networks to SCMs and demonstrates that, in discrete cases and with sufficiently expressive neural networks, proxy models constructed from causal graphs and observational distributions exhibit duality in L2 causal queries. That is, a causal query is identifiable on the proxy model if and only if it is identifiable on the causal graph. i VGAE (Zeˇcevi c et al., 2021) extends MLE-NCM by showing that such duality also holds when using VGAE. GAN-NCM (Xia et al., 2023) further generalizes MLE-NCM to L3 causal queries, proving that proxy models constructed from causal graphs and interventional distributions exhibit duality in the identifiability of counterfactual queries. The model is trained using generative adversarial networks (GANs). Neural SCMs with empirically identifiable counterfactual outcomes A series of studies primarily focus on proving model identifiability and empirically demonstrate the identifiability of counterfactual outcomes through experiments. CAREFL (Khemakhem et al., 2021) connects causal inference with autoregressive normalizing flows (ANF), utilizing affine transformations primarily for cause-effect identification tasks in causal discovery and proposing methods for counterfactual reasoning. Causal NF (Javaloy et al., 2023) extends this approach to other types of ANF transformations and proves model Exogenous Isomorphism for Counterfactual Identifiability identifiability in latent encodings. CCNF (Zhou et al., 2025) further introduces partial causal transformations to enhance inference performance. CFM (Khoa Le et al., 2025) generalizes this line of work from discrete normalizing flows to continuous normalizing flows, employing flow matching for modeling. While these studies demonstrate the identifiability of counterfactual outcomes on synthetic datasets, the connection between latent encoding identifiability (a problem in representation learning) and counterfactual identifiability (a problem in causal quantities) remains unclear. Proving under which conditions the former implies the latter is one of the key contributions of this work. Neural SCMs proven to be identifiable for counterfactual outcomes A series of studies have established the identifiability of counterfactual outcomes under specified assumptions. BGM (Nasr-Esfahany et al., 2023) investigates the identifiability of causal mechanisms for counterfactual outcomes under the bijection assumption and proposes three identification methods, ultimately using conditional normalizing flows to construct a proxy SCM to empirically support their findings. Fi P (Scetbon et al., 2024) formalizes SCMs and counterfactual distributions using fixed points, framing the identifiability problem for counterfactual outcomes as the weak partial recovery problem. They identify a condition for recovering counterfactual distributions, though their experiments return to additive noise models based on attention mechanisms while employing optimal transport to model noise distributions. DCM (Chao et al., 2024) assumes that the exogenous distribution of the true latent SCM is uniform, focusing on encoder-decoder models, and proposes two identification methods as extensions of (Nasr-Esfahany et al., 2023, Proposition 6.2). These methods rely on diffusion models to construct proxy SCMs. As shown in Appendix B.3, the first two studies are special cases of TM-SCM, while the last study leverages (Nasr-Esfahany et al., 2023, Proposition 6.2), which corresponds to a special instance of Theorem 3.2 in this work that considers only one single causal mechanism. Although these studies demonstrate the identifiability of counterfactual outcomes, these results represent only a specific subset of counterfactual statements. In contrast, a key contribution of this work is proving that the theoretical foundations extend beyond the identifiability of counterfactual outcomes to encompass the more general and stronger L3-identifiability. C. Neural TM-SCM In this section, we present the implementation details of various neural TM-SCMs used in the experiments. C.1. Vectorization In this subsection, we will specifically address the implementation issues related to vectorization and de-vectorization, particularly how to switch between the symbolic form of endogenous or exogenous values (primarily used for reasoning on M) and the vectorized form (primarily used for reasoning on the solution mapping Γ) when a TM-SCM is given. Vectorization Suppose we aim to vectorize an endogenous or exogenous value x, resulting in its vectorized form Pι(x). Since these values originate from TM-SCM, their indices are fully aligned, and thus we do not distinguish between them here. Vectorization requires a predefined re-indexing map ι. According to the definition provided in the main text, this only additionally requires a causal order in the TM-SCM. The causal order can be assumed to be directly provided, as in the neural TM-SCM problem setting described in the main text. For the ground truth TM-SCM, any causal order can be obtained through topological sorting. Subsequently, we can construct such a map ι based on Lemma A.19, defined as ι(i, j) = j + X k pr(i) dk. From an implementation perspective, the collection of values (xi)i I, representing the symbolic form of x, can be expressed as a dictionary where keys are I and values are vectors xi. Vectorization in this context involves concatenating the vectors xi corresponding to different causal variables in the causal order , resulting in |pr(i)|=k xi, where pr(i) denotes the causal prefix of i I under , and represents vector concatenation. De-vectorization Suppose we aim to de-vectorize an endogenous or exogenous value Pι(x) to obtain the symbolic form of x. This operation requires knowledge of the vectorization mapping ι, which is provided during the vectorization process. Exogenous Isomorphism for Counterfactual Identifiability According to Lemma A.19 and (Pι) 1, each causal variable is given by xi = (Pι(x))j li:li+di, where li = P k pr(i) dk. From an implementation perspective, this is equivalent to selecting the slice corresponding to li :li + di from the vector Pι(x). C.2. Exogenous Distribution In the main text, we design the use of normalizing flows to model the exogenous distribution. However, in practical experiments, to demonstrate the insensitivity of our theory to the type of exogenous distribution, we additionally consider other types of exogenous distributions. Recall that, according to the Markovianity assumption, it is required that PUθ = Q i I PUi,θ, so the following discussion focuses on each PUi,θ. Standard normal distribution A simple modeling approach is to use a standard normal distribution, such that log p Ui,θ(ui) = log N(ui |0, I), where N denotes the density of a Gaussian distribution. However, this modeling approach may not provide sufficient expressive power. For example, in the causal mechanisms of DNME, the diagonal structure results in non-interacting dimensions, and using a standard normal distribution in such cases would enforce independence between these dimensions. This limitation is reflected in the experiments presented in the Appendix D. Normalizing flow As introduced in the main text, the log-likelihood for each PUi,θ is given by log p Ui,θ(ui) = log p Zi,θ(T 1 i,θ (ui)) + log det JT 1 i,θ (ui) , where Ti,θ : Zi Ui is a masked autoregressive flow (MAF), and det JT 1 i,θ (ui) denotes the Jacobian determinant at u. Gaussian mixture model The log-likelihood for each PUi,θ is given by log p Ui,θ(ui) = log k=1 wk,i,θ N(ui |µk,i,θ, Σk,i,θ) where N the density of a Gaussian distribution with mean µk,i,θ and covariance matrix Σk,i,θ, both parameterized by neural networks. Additionally, PK k=1 wk,i,θ = 1 for any i I. In the implementation, these parameterized distribution models are provided by the normalizing flow library Zuko (Rozet et al., 2024). We begin with the simplest DNME prototype to demonstrate how to construct the model while enabling the derivation of inverses and log absolute Jacobian determinants. Specifically, we leverage the normalizing flow library Zuko, which provides a unified framework and standard for building normalizing flow models. Thus, the constructed neural TM-SCM is effectively transformed into a normalizing flow. Recall that the symbolic form of DNME is given by fi,θ(vpa(i), ui) = bi,θ(vpa(i)) + ai,θ(vpa(i)) ui. To transform a DNME-type neural TM-SCM into a normalizing flow, we interpret the causal mechanism fi,θ as a discrete transformation within a normalizing flow. This can be represented using a coupling transformation c combined with an affine transformation a: fi,θ(v I\{i}, ui) = c(v I\{i}, ui; a( ; mθ(vpa(i)))), c(x, y; f) = (x, f(y)), a(x; a, b) = b + exp(a) x, Exogenous Isomorphism for Counterfactual Identifiability where the affine transformation satisfies the strict positivity requirement of ai,θ(vpa(i)), and the coupling transformation ensures that fi,θ depends only on vpa(i) and ui, thereby constraining the causal structure. The function mθ is a multi-layer perceptron (MLP) that provides the parameters required for the affine transformation. Correspondingly, the inverse of the causal mechanism, (fi,θ(v I\{i})) 1(vi), can be expressed as: fi,θ(v I\{i}, vi) = c 1(v I\{i}, vi; a( ; mθ(vpa(i)))), c 1(x, y; f) = (x, f 1(y)), a 1(x; a, b) = (x b) exp(a), where denotes component-wise division. The log absolute Jacobian determinants for the coupling and affine transformations are, respectively: log |det Jc(x, y; f)| = log |det Jf(y)| , log |det Ja(x; a, b)| = X where aj is the j-th component of a. The solution mapping Γθ is then composed according to the causal order as fs,θ fr,θ, where pr(s) = 0 and pr (r) = |I|. This forms a discrete normalizing flow, and since each transformation supports forward, inverse derivations, and log absolute Jacobian determinant calculations, Γθ inherits these properties. Consequently, the log-likelihood of endogenous value v(i) is computed as: p Vθ(v(i)) = log p Uθ(Γ 1 θ (v(i))) + log det JΓ 1 θ (v(i)) , supporting a maximum likelihood optimization process. Here, log p Uθ is the log-likelihood function of the exogenous distribution, discussed in Appendix C.2. The TNME improves upon the DNME, with its formal representation defined as fi,θ(vpa(i), ui) = bi,θ(vpa(i)) + Ai,θ(vpa(i)) u i . Specifically, the component-wise affine transformation a is replaced by a more sophisticated lower triangular affine transformation A, such that fi,θ(v I\{i}, ui) = c(v I\{i}, ui; A( ; mθ(vpa(i)))), A(x; A, b) = b + tril(exp(A + ϵ)) x , where the lower triangular affine transformation takes a matrix A and a vector b as parameters. The exp function constrains the matrix to be positive, ensuring monotonicity, while the tril function and a small positive scalar ϵ enforce the matrix to be a full-rank lower triangular matrix, guaranteeing invertibility. The MLP mθ provides the parameters required for the lower triangular affine transformation. Similar to DNME, the inverse of the lower triangular affine transformation is given by A 1(x; a, b) = (tril(exp(A + ϵ))) 1(x b) , where (tril(exp(A + ϵ))) 1 is the inverse of tril(exp(A + ϵ)). During implementation, instead of explicitly computing the matrix inverse, the inverse operation is treated as solving the linear system tril(exp(A + ϵ)) A 1(x; a, b) = (x b) , where A 1(x; a, b) is treated as the unknown. Exogenous Isomorphism for Counterfactual Identifiability Due to the properties of lower triangular matrices, the log-determinant of the Jacobian of the lower triangular affine transformation is straightforward to compute: log |det Ja(x; a, b)| = X j 1:di (aj,j + ϵ), where aj,j is the (j, j)-th element of the matrix A. Subsequently, the construction of the mapping Γθ follows the same principles as in DNME. The CMSM models the re-indexed solution mapping Pι Γθ as a sequence of discrete transformations, Pι Γθ = T1,θ Tn,θ, where n is required to be no less than the diameter of the causal graph, as stated in (Javaloy et al., 2023). Each Tj,θ is modeled as an autoregressive transformation t, with an affine transformation a as the univariate transformation, such that xj = t(xj 1; a), where x0 = Pι(u) represents the vectorized exogenous values, and xn = Pι(v) represents the vectorized endogenous values. The autoregressive transformation t ensures that each component xj,i of xj is expressed as xj,i = a(xj 1,i; mj,θ(xj 1,1:i 1)), where the MLP mj,θ provides the parameters required for the affine transformation. The autoregressive transformation t is a triangular transformation, and its inverse is computed in the same manner as triangular transformations, as detailed in Lemma A.15. Furthermore, its log absolute Jacobian determinant is given by log |det Jt(x; f)| = X j I di log |det Jf(xi)| , as the Jacobian matrix of a triangular transformation is lower triangular. Consequently, the construction of the re-indexed solution mapping Pι Γθ follows the same principles as DNME. For the ODE ( dx(t) = v(x(t), t) dt, x(0) = x0, we denote the flow as T (x0, t) = x(t), where v : Rd [0, 1] Rd is a time-dependent velocity field, and x : [0, 1] Rd is the solution to the ODE. A flow T ( , t) pushes forward the distribution at t = 0 to that at time t, such that Pt,x = (T ( , t)) P0,x. According to Lemma A.21, if the velocity field v is a Lipschitz continuous triangular mapping for any t, then the flow T ( , t) is always a TMI mapping. In implementation, we adopt a method similar to (Chen et al., 2018) for modeling velocity fields, introducing a mask to ensure that v is a triangular mapping: v(x, t) = tanh((W M)x + b)(U sigmoid(G) M) , where (W, b, U, G) = mθ(t), W, U, G are matrices, and b is a vector, all parameterized by an MLP with t as input. M is a lower triangular mask. Unlike previous approaches that model the re-indexed solution mapping using composite discrete transforms, we model Pι Γθ = odeint(v, 0, 1), and its inverse Γ 1 θ (Pι) 1 = odeint(v, 1, 0), where odeint represents an IVP solver from the Exogenous Isomorphism for Counterfactual Identifiability Algorithm 1 Pseudo Potential Response for TM-SCM Input: a TM-SCM M, exogenous value u, intervened value x, vectorization ι under a causal order of M. Output: potential response VM[x](u) u u, v Γ(u), D P i I di {Initialize potential response} for k = 1 to D do (i, j) ι 1(k), v v if i Ix then (v )i,j (x)i,j {Do-intervention} end if (u )ι 1[1:k] (Γ 1(v ))ι 1[1:k] {Find exogenous value for fully explaining the prefix part} (v )ι 1[k:D] (Γ(u ))ι 1[k:D] {Assume the suffix part is not intervened, and update the potential response} end for return v torchdiffeq library (Chen, 2018). For log-likelihood computation, we employ the unbiased log-likelihood estimation method in (Grathwohl et al., 2019): log p Vθ(v(i)) = log p Uθ((Γ 1 θ (Pι) 1)(v(i))) Eϵ Pϵ 0 ϵ vs,θ/ (Γ 1 s,θ (Pι) 1) ϵ ds , where Γ 1 s,θ (Pι) 1 corresponds to odeint(v, 1, s), and Pϵ is the standard normal distribution. C.7. Inference In the experiments, the trained neural TM-SCM infers counterfactual outcomes through a process referred to as pseudo potential response (Algorithm 1). Direct intervention on causal mechanisms is challenging for TM-SCMs constructed from solution mappings (e.g., CMSM and TVSM), as they lack explicit causal mechanisms and thus cannot directly fix the values of causal parent variables. The term pseudo potential response arises from the characteristic of Algorithm 1 performing indirect interventions. TMSCM, as a BSCM, possesses a fully supported exogenous variable space. Combined with TM-SCM s well-defined iteration order, the algorithm aims to inversely identify an exogenous value that perfectly explains the intervention and the ancestors already determined, rather than directly intervening on endogenous variables. This extends the intervention algorithm for single intervened variables in (Javaloy et al., 2023) to support arbitrary intervened variables and random vectors. To ensure correctness, we prove that the output of the algorithm is consistent with the potential response VM[x](u): Theorem C.1 (Correctness of Pseudo Potential Response). The result returned by the pseudo potential response algorithm, v , equals VM[x](u). Proof. For any t 1:D, where D = P i I di, suppose t li :li + di. By the definition of the potential response, we have (VM[x](u))ι 1(t) = (Γ[x](u))ι 1(t) Lemma A.2 ======== ( fi[x](uan (i)))ι 1(t) Definition A.1 ========= ( (fi(( fk[x](uan (k)))k pa(i), ui))t li i / Ix, (x)ι 1(t) i Ix. Suppose (v )ι 1[1:t 1] = ( fi[x](uan (i)))ι 1[1:t 1]. Consider the t-th iteration. Let v = v initially. If i Ix, then (v )ι 1(t) = xι 1(t). By the property of Γ as a TM mapping, the component at index d remains unchanged: (v )ι 1(t) = (v )ι 1(t) = xι 1(t) = ( fi[x](uan (i)))ι 1(t). If i / Ix, then by the (t 1)-th iteration: (v )ι 1(t) = (Γ(u ))ι 1(t), where for each k 1:t 1: ( fj(u an (j)))ι 1(k) = ( fj[x](uan (j)))ι 1(k), Exogenous Isomorphism for Counterfactual Identifiability assuming k lj : lj + dj. Thus, (v )ι 1[t] = (fi(( fk(u an (k)))k pa(i), ui))t li = (fi(( fk[x](uan (k)))k pa(i), ui))t li = ( fi[x](uan (i)))ι 1(t). Hence, after the t-th iteration, (v )ι 1[t] = ( fi[x](uan (i)))ι 1[t] holds universally. Since v ι[1:t 1] remains unchanged, (v )ι 1[1:t] = ( fi[x](uan (i)))ι 1[1:t]. This holds for any t 1:D. Thus, at t = D, v = fi[x](uan (i)) where an (i) = I, implying v = VM[x](u). D. Experiments D.1. Datasets TM-SCM-SYM The dataset collection includes four small synthetic datasets: BARBELL, STAIR, FORK, and BACKDOOR. Each causal variable is represented as a random vector with dimensionality ranging from 1 to 8, with up to four causal variables in total. The exogenous distributions are either mutually independent standard normal distributions or multivariate normal distributions with a Markov structure. The causal mechanisms between variables are defined as manually designed symbolic TM mappings. The specific generation mechanisms for these four synthetic datasets are detailed as follows. BARBELL: Includes 2 causal variables, x and y, each with 8 dimensions. Their causal relationship follows the simplest binary cause-effect structure, x y, with the causal mechanisms represented as: xi = (fx(ux))i = s(1.8 ux,i) 1 i = 1, l Pi 1 j=1 xj, ux,i 1 < i 5, 0.3 ux,i + s Pi 1 j=1 xj + 1 1 5 < i 7, CDF 1 s 1.3 Pi 2 j=1 xj + Pi 1 j=1 xj / 3 + 1 + 2, 0.6, ux,i i = 8. yi = (fy(x, uy))i = l (s(1.8 uy,i) 1, xi) i = 1, l l Pi 1 j=1 yj, ux,i , xi 1 < i 5, CDF 1 s 1.3 Pi 1 j=1 yj + l Pi 1 j=1 yj, xi / 3 + 1 + 2, 0.6, uy,i 5 < i 7, 0.3 uy,i 0.5 Pi 1 j=1 yj + s Pi 2 j=1 yj + 1 1 i = 8. The function s represents the softplus function defined as s(x) = log(1 + exp(x)). The binary function l is defined as l(x, y) = s(x+1)+s(0.5+y) 3. The function CDF 1(µ, b, x) denotes the quantile function of the Laplace distribution at x, with location µ and scale b. The exogenous distribution follows a standard normal distribution. STAIR Consider 3 causal variables x, y, and z with dimensions 1, 2, and 3, respectively. Their causal structure follows a chain: x y z. All causal mechanisms are linear and are represented as: xi = (fx(ux))i = 0.75 ux,i. yi = (fy(x, uy))i = ( 10 xi + uy,i i = 1, 0.25 xi 1 0.5 uy,i + 2 i = 2. zi = (fz(x, uz))i = 0.5 yi + uz,i 4 i = 1, 5 yi 1.5 uz,i i = 2, yi 1 + 2 uz,i 0.5 i = 3. and the exogenous distribution follows a standard normal distribution. FORK Consider 4 causal variables x, y, z, and w, each of dimension 2. The causal structure among them includes a collider structure x z y as well as a direct causal relationship z w. All causal mechanisms are additive and can Exogenous Isomorphism for Counterfactual Identifiability be expressed as: xi = (fx(ux))i = ( ux,i i = 1, ux,i 1 + 0.25 ux,i i = 2. yi = (fy(x, uy))i = ( uy,i i = 1, 0.5 uy,i 1 + uy,i i = 2. zi = (fz(x, y, uz))i = 1 1 + exp( x1 y2) y2 1 + 0.5 uz,i i = 1, 1 1 + exp( y1 x2) x2 1 + 0.5(uz,i 1 + uz,i) i = 2. wi = (fw(z, uw))i = 20 1 + exp(0.5 z2 1 z2) + uw,i i = 1, 20 1 + exp(0.5 z2 2 z1) + 0.25 uw,i 1 uw,i i = 2. and the exogenous distribution also follows a standard normal distribution. BACKDOOR The causal graph involves 4 causal variables, x, y, z, and w, each of dimension 4. The causal structure includes a direct causal link y w, an indirect causal path y z w, and a backdoor path y x w. The causal mechanisms are represented as: xi = (fx(ux))i = tanh(ux,i) i = 1, sigmoid( ux,i) i = 2, xi 1 tanh(ux,i) i = 3, 1 + sigmoid(ux,i) i = 4. yi = (fy(x, uy))i = tanh(x1 + uy,i) i = 1, sigmoid(x2 uy,i) i = 2, yi 1 tanh(uy,i) i = 3, x1:i 1 (softmax(y1:i 1)) + sigmoid(uy,i) i = 4. zi = (fz(y, uz))i = elu(y1 uz,i) i = 1, leaky relu(y2 + uz,i) i = 2, zi 1 + elu(uz,i) i = 3, y1:i 1 (softmin(z1:i 1)) + leaky relu(uz,i) i = 4. wi = (fw(x, y, z, uw))i = (xi yi zi) (softmax(xi yi zi)) + elu(zi uw,i) i = 1, (wi 1 uw,i 1) (softmin(wi 1 uw,i 1)) + leaky relu(yi + uw,i) i = 2, (x1:i y1:i z1:i) (softmin(x1:i y1:i z1:i)) + xi 1 + elu(uw,i) i = 3, (w1:i 1 uw,1:i 1) (softmax(w1:i 1 uw,1:i 1)) + leaky relu(uw,i) i = 4. where tanh, sigmoid, elu, and leaky relu are widely used activation functions, ensuring monotonicity and continuity. The exogenous distribution follows a multivariate normal distribution, with location 0 and covariance matrix: 1 0.8 0.3 0.2 0.8 1 0.4 0.1 0.3 0.4 1 0.6 0.2 0.1 0.6 1 ER-DIAG-50 and ER-TRIL-50 Each dataset collection consists of 50 randomly synthesized datasets, where each dataset contains 3 8 causal variables, and the dimensionality of each causal variable is randomly chosen between 1 and 8. The causal structure among the variables is modeled as an Erd os-R enyi graph, the exogenous distribution is randomly generated as a multivariate normal distribution, and the causal mechanisms are constructed using randomly generated TM mappings. The details for generating these components are described as follows. Exogenous Isomorphism for Counterfactual Identifiability Variables The number of causal variables, n, is sampled from a discrete uniform distribution Unif{3, 8}, and these variables are denoted as x1, . . . , xn. Dimensions The dimension di of a causal variable xi is sampled from a discrete uniform distribution Unif{1, 8}, with the j-th component denoted as xi,j. Graphs An Erd os-R enyi graph is generated over the n causal variables with a probability of 0.5 for each edge, ensuring the graph is directed and acyclic. Specifically, we construct an n n adjacency matrix A, where each element ai,j is sampled from a continuous uniform distribution Unif(0, 1). If ai,j > 0.5 and i < j, the edge xi xj is included. Exogenous distributions For each causal variable, a multivariate normal distribution of dimension m is randomly generated to represent its corresponding exogenous distribution. By Markovianity, exogenous samples are formed by independently sampling from these distributions. Specifically, the parameters of the multivariate normal distribution are generated by sampling a mean vector µ and a matrix C from the standard normal distribution. The covariance matrix is computed as Σ = C C + I, ensuring positive definiteness, where I is the identity matrix. Causal mechanisms The randomly generated causal mechanisms are ensured to be TM mappings. For ER-DIAG-50, the causal mechanism is represented as (fi(xpa(i), ui))j = hi,j(gi(xpa(i)), ui,j), for j 1:di. For ER-TRIL-50, it is expressed as (fi(xpa(i), ui))j = hi,j((fi(xpa(i), ui))j 1, ui,j), where for j = 1, (fi(xpa(i), ui))j 1 = gi(xpa(i)). The function gi is a randomly generated continuous function R P k pa(i) dk R. To enhance its complexity, it is expressed as a combination of sinusoidal and segmented terms: i=1 ci sin( wi, x + ϕi) | {z } sinusoidal term + sj (x) x µj (x) | {z } segmented term where j (x) = arg min1 j n x µj , and ci, wi, ϕi, sj, µj are random parameters. The number of sinusoidal and segmented terms is n = m = 3. The function hi,j is a randomly generated monotonic function R2 R, ensuring monotonicity with respect to the second dimension. It is represented in a linear form: hi,j(x1, x2) = s1 x1 + s2 x2 + b, where the bias term b and scaling coefficients s1, s2 are random parameters. The signs of s1, s2 are not constrained to be positive, indicating that the resulting mapping is not necessarily a TMI mapping but rather a more general TM mapping. D.2. Metrics We employed three metrics to evaluate the performance of the trained models. OBSWD This metric measures the Wasserstein distance between the observational distribution derived from the model and the ground truth, evaluating how well the trained model fits the observational distribution. The Wasserstein distance is computed using the geomloss library (Feydy et al., 2019), which calculates the un-biased Sinkhorn divergence with a blur parameter of 0.05 and a cost function of C(x, y) = 1 CTFRMSE This metric measures the error between the counterfactual results inferred by the model and the ground truth, assessing the accuracy of the trained model in counterfactual reasoning and reflecting the model s consistency in answering counterfactual queries. We use the root mean square error (RMSE), which retains the same unit as the original values, making it more interpretable for evaluating the accuracy of counterfactual reasoning. CTFWD This is a metric unreported in the main text but included as a supplementary evaluation in the appendix. It measures the Wasserstein distance between the interventional distribution and the ground truth in an average sense. Since Exogenous Isomorphism for Counterfactual Identifiability the observed values in the ground truth of counterfactual datasets are sampled from the observational distribution, these samples, when considering only the counterfactual outcomes, correspond to the distribution expressed as ΩV PV[x]|V( , v)PV(dv) = Ex PX PV[x] , where PX denotes the prior distribution of intervention values provided by the counterfactual dataset, and PV[x] represents the interventional distribution under intervention x. D.3. Execution This subsection details the preprocessing, training, and testing procedures of the experiments to ensure reproducibility. Preprocessing During the initial execution, following the settings described in Appendix D.1, each synthetic dataset is divided into three splits: training, validation, and test datasets. The training dataset comprises observational data with a sample size of 20,000, directly sampled from the exogenous distribution, with exogenous samples propagated through the synthesized TM-SCM to derive endogenous observational values. The validation and test datasets, each containing 2,000 samples, consist of counterfactual data, providing observations, interventions, and counterfactual outcomes. These counterfactual datasets are synthesized in three steps: (i) Similar to the observational dataset, exogenous values and their corresponding endogenous observations are sampled. (ii) Intervened values are sampled anew, where the intervened variables X satisfy certain conditions termed interventionally meaningful, and the intervened values adhere to the observational distribution. (iii) The counterfactual outcomes are derived using the exogenous values from (i) and the intervened values from (ii) to compute the potential responses. The notion of interventionally meaningful refers to the following three conditions of an intervention set X: (i) For any X X, X must have at least one child. (ii) If X, Y X and X is a parent of Y , then X and Y must have at least one common child. (iii) X is non-empty. In implementation, indices IX are sampled to satisfy these conditions, and a mask is used to indicate whether a variable has been intervened upon. Finally, all dimensions in the datasets are standardized. To ensure consistency, the mean and variance from the observational dataset are used for standardization, including for the intervened values and counterfactual outcomes in the counterfactual datasets. Since standardization merely involves component-wise scaling and shifting, the triangular monotonicity property of TM-SCM is preserved. Training All neural network components in the models, as reported in Appendix C, are configured as MLPs with 2 hidden layers and a width of 128. The outputs of these MLPs include an additional dimension representing the encoding length, and the parameters subsequently derived are averaged over this encoding. This design enhances the expressiveness of the MLPs while maintaining fairness in model parameters. Specifically, before each run, we perform binary search to find an appropriate encoding length such that the total number of model parameters does not exceed 1M. All models are constructed according to this standard, ensuring that the parameter counts are comparable across models. For experiments on TM-SCM-SYM, we train for 100 epochs with a batch size of 64. Validation is performed every k training steps, where k grows exponentially such that the interval after the t-th validation is k = γt . We set γ = 1.25. The validation results are used to plot the scatter relationship between OBSWD and CTFRMSE. All experiments under this configuration are conducted with 10 different random seeds. For experiments on ER-DIAG-50 and ER-TRIL-50, we train for 50 epochs with a batch size of 64, performing validation after every epoch. All experiments are run once on their respective 50 pre-generated random datasets. For all the experiments mentioned above, the optimizer used is Adam with a learning rate of 0.001 and weight decay of 0.01 as a regularization term. The model weights corresponding to the epoch with the lowest CTFRMSE on the validation set are saved for testing. Testing The test set is used to evaluate the model performance reported in the tables. Specifically, for OBSWD, we utilize the part of the test set labeled as observed values. According to the description in the preprocessing section, these samples satisfy the ground truth observational distribution. For CTFRMSE and CTFWD, we use the parts of the test set labeled as observed values, intervened values, and counterfactual outcomes. The observed and intervened values are input into Exogenous Isomorphism for Counterfactual Identifiability Algorithm 1 to estimate the counterfactual outcomes, and the ground truth counterfactual outcomes are used to compute these counterfactual metrics. During validation and testing, in cases where certain model configurations exhibit instability at specific positions, leading to numerical overflow, the corresponding test cases are masked and ignored. D.4. Full Results Ablation on TM-SCM-SYM We report below the complete ablation study results on the TM-SCM-SYM dataset as discussed in the main text. First, we present the full version of the scatter plots shown in the main text, as illustrated in Figure 4, which are obtained from the validation results of experiments conducted under 10 different random seeds. TM-SCM w/o O w/o M w/o T 0.5 1.0 1.5 2.0 2.5 0.15 0.20 0.25 0.30 0.35 0.40 0.3 0.4 0.5 0.6 0.7 0.8 2.4 2.6 2.8 3.0 3.2 0 2 4 6 8 10 12 14 16 0.2 0.3 0.4 0.5 0.6 0.3 0.4 0.5 0.6 0.7 0.8 2.4 2.6 2.8 3.0 3.2 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.16 0.18 0.20 0.22 0.24 0.30 0.35 0.40 0.45 0.50 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 0.16 0.18 0.20 0.22 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 2.25 2.30 2.35 2.40 2.45 2.50 2.55 2.60 BARBELL STAIR FORK BACKDOOR Figure 4. Ablation results of neural TM-SCMs on TM-SCM-SYM. Colored curves depict sliding-window regressions, with shaded areas showing 95% CI. To improve the readability of the plot scales, outliers below the 0.01 quantile and above the 0.99 quantile were removed. Rows represent models (DNME, TNME, CMSM, TVSM), and columns represent datasets (BARBELL, STAIR, FORK, BACKDOOR). Across all combinations of models and datasets, the non-ablated neural TM-SCM (blue) consistently achieves lower CTFRMSE, demonstrating superior performance in counterfactual consistency. Specifically, as training progresses, the horizontal axis OBSWD gradually decreases, indicating an improved fit of the model to the observed distribution. It can be observed that, except for certain settings where the model lacks sufficient expressiveness, OBSWD is reduced to a similar Exogenous Isomorphism for Counterfactual Identifiability Table 3. Ablation results of neural TM-SCM on BARBELL and STAIR. The values shown are the means of 10 experiments with different random seeds, with the subscript representing the 95% CI. The best-performing results are highlighted in bold. BARBELL STAIR METHOD OBSWD CTFRMSE CTFWD OBSWD CTFRMSE CTFWD DNME - 0.42 0.03 0.18 0.00 0.20 0.01 0.18 0.03 0.10 0.02 0.03 0.01 W/O O 4.44 2.74 0.94 0.00 3.89 0.00 31.54 52.74 0.80 0.00 0.61 0.00 W/O M 5.38 0.12 0.50 0.02 1.53 0.10 0.17 0.01 0.47 0.03 0.48 0.04 TNME - 0.39 0.03 0.07 0.00 0.04 0.00 0.16 0.01 0.16 0.01 0.07 0.01 W/O O 18.26 1.02 0.94 0.00 3.89 0.00 32.65 52.99 0.80 0.00 0.61 0.00 W/O M 7.50 6.76 0.52 0.05 1.55 0.23 0.18 0.01 0.39 0.02 0.35 0.04 - 0.41 0.02 0.05 0.01 0.02 0.01 0.17 0.01 0.24 0.23 0.30 0.53 W/O O 0.39 0.04 0.94 0.00 3.89 0.00 15.05 38.19 0.83 0.02 0.60 0.03 W/O M 1.01 0.46 0.07 0.01 0.04 0.01 0.17 0.01 0.13 0.04 0.05 0.03 W/O T 1.82 2.12 0.98 1.22 25.38 55.39 0.19 0.03 14.67 32.28 0.51 0.70 - 0.37 0.02 0.10 0.00 0.03 0.01 0.17 0.02 0.16 0.01 0.08 0.01 W/O O 0.48 0.15 0.94 0.00 3.89 0.00 32.15 53.73 0.80 0.00 0.61 0.00 W/O M 1.15 0.21 0.29 0.01 0.48 0.03 0.17 0.01 0.28 0.01 0.21 0.02 W/O T 1.33 0.21 0.51 0.01 1.59 0.07 20.56 46.17 0.43 0.02 0.44 0.05 Table 4. Ablation results of neural TM-SCM on FORK and BACKDOOR. The values shown are the means of 10 experiments with different random seeds, with the subscript representing the 95% CI. The best-performing results are highlighted in bold. FORK BACKDOOR METHOD OBSWD CTFRMSE CTFWD OBSWD CTFRMSE CTFWD DNME - 0.28 0.01 0.14 0.00 0.08 0.00 2.83 0.03 0.27 0.00 0.59 0.02 W/O O 2.15 0.10 0.74 0.00 0.57 0.00 4.79 0.26 0.46 0.00 1.45 0.00 W/O M 1.61 2.86 0.62 0.01 0.47 0.01 4.36 0.12 0.34 0.01 0.88 0.05 TNME - 0.27 0.01 0.14 0.00 0.07 0.00 2.47 0.04 0.16 0.00 0.20 0.01 W/O O 3.88 0.24 0.74 0.00 0.57 0.00 9.20 0.24 0.46 0.00 1.45 0.00 W/O M 15.31 32.54 0.58 0.02 0.48 0.01 3.20 0.04 0.28 0.01 0.63 0.03 - 0.30 0.02 0.08 0.01 0.02 0.00 2.53 0.09 0.15 0.00 0.19 0.01 W/O O 1.56 2.43 0.80 0.05 0.71 0.26 6.35 3.18 0.47 0.00 1.50 0.01 W/O M 0.30 0.02 0.08 0.00 0.02 0.00 2.68 0.09 0.17 0.02 0.23 0.05 W/O T 0.37 0.06 0.63 0.24 1.02 0.56 7.91 11.03 0.37 0.02 1.04 0.11 - 0.29 0.01 0.16 0.00 0.08 0.00 2.29 0.02 0.12 0.01 0.11 0.01 W/O O 2.10 0.08 0.74 0.00 0.57 0.00 5.79 0.05 0.46 0.00 1.45 0.00 W/O M 0.32 0.03 0.48 0.01 0.31 0.01 2.53 0.07 0.21 0.01 0.36 0.03 W/O T 0.46 0.03 0.61 0.02 0.43 0.02 2.64 0.44 0.34 0.01 0.87 0.05 level across most configurations. However, the final performance on the vertical axis CTFRMSE varies across different ablation modes, highlighting the negative impact of the ablated components on counterfactual consistency. In these ablation modes, w/o O (yellow) represents the case where the causal order is reversed. Although OBSWD decreases in almost all settings, CTFRMSE remains nearly unchanged. This highlights the significant role of the causal order assumption A in ensuring counterfactual consistency. w/o M (green) represents the case where the exogenous distribution is non Markovian. In most settings, as OBSWD decreases, CTFRMSE instead increases. This indicates that counterfactual consistency cannot converge to a lower level as expected under non-Markovian conditions, underscoring the necessity of the Markovian assumption AM for counterfactual consistency. w/o T (red) represents the case where the solution mapping of the constructed SCM is not triangular (allowed only for CMSM and TVSM). In these settings, as OBSWD decreases, CTFRMSE diverges. This demonstrates the critical impact of the assumption that SCMs are TM-SCMs, ATM-SCM, on counterfactual consistency. Only in the no ablation mode (blue), where all three assumptions in Corollary 5.4 ATM-SCM, AM, and A are satisfied, does counterfactual consistency improve (as evidenced by decreasing CTFRMSE) as APV is gradually satisfied (indicated by Exogenous Isomorphism for Counterfactual Identifiability Table 5. Ablation results of neural TM-SCM on TM-SCM-SYM for exogenous distributions. The values shown are the means of 10 experiments with different random seeds, with the subscript representing the 95% CI. BARBELL STAIR FORK BACKDOOR METHOD DIST OBSWD CTFRMSE OBSWD CTFRMSE OBSWD CTFRMSE OBSWD CTFRMSE DNME N 6.20 0.06 0.19 0.00 0.17 0.00 0.11 0.00 1.14 0.03 0.29 0.00 4.57 0.02 0.20 0.00 GMM 1.94 0.43 0.19 0.01 0.16 0.01 0.12 0.01 0.37 0.04 0.17 0.01 3.41 0.23 0.22 0.01 NF 0.42 0.03 0.18 0.00 0.18 0.03 0.10 0.02 0.28 0.01 0.14 0.00 2.83 0.03 0.27 0.00 TNME N 5.88 0.07 0.10 0.00 0.17 0.00 0.10 0.00 1.12 0.04 0.26 0.00 3.12 0.02 0.15 0.00 GMM 3.65 3.46 0.09 0.01 0.17 0.02 0.14 0.01 0.34 0.03 0.16 0.01 2.82 0.22 0.15 0.00 NF 0.39 0.03 0.07 0.00 0.16 0.01 0.16 0.01 0.27 0.01 0.14 0.00 2.47 0.04 0.16 0.00 CMSM N 2.96 2.68 0.06 0.01 0.17 0.02 0.12 0.02 0.32 0.02 0.07 0.01 2.86 0.41 0.13 0.00 GMM 1.99 2.13 0.06 0.00 0.19 0.02 0.11 0.02 0.30 0.02 0.08 0.00 15.93 27.12 0.13 0.00 NF 0.41 0.02 0.05 0.01 0.17 0.01 0.24 0.23 0.30 0.02 0.08 0.01 2.53 0.09 0.15 0.00 TVSM N 0.34 0.01 0.09 0.00 0.16 0.00 0.14 0.00 0.31 0.01 0.17 0.00 2.29 0.01 0.10 0.00 GMM 0.53 0.10 0.09 0.00 0.16 0.01 0.15 0.01 0.29 0.01 0.15 0.00 2.27 0.01 0.10 0.01 NF 0.37 0.02 0.10 0.00 0.17 0.02 0.16 0.01 0.29 0.01 0.16 0.00 2.29 0.02 0.12 0.01 decreasing OBSWD). Furthermore, we can analyze the neural TM-SCM methods under different constructions (without ablations) row by row. It can be observed that DNME and TNME exhibit similar performance but perform slightly worse on the BACKDOOR dataset. This may be attributed to their limited expressive capacity for the observed distribution, as reflected by the OBSWD, which did not decrease as expected to the same level as other models. CMSM appears to be less stable during training, as evidenced by the widely dispersed scatter points across all settings. This could imply that the parameters of CMSM undergo significant changes during training, resulting in unstable performance. In contrast, TVSM demonstrates more concentrated scatter points, indicating higher stability. Moreover, it achieves a lower OBSWD on the BACKDOOR dataset compared to DNME and TNME, suggesting stronger expressive capacity. However, due to its reliance on IVP solvers, its inference speed is relatively slow. We report the final performance of the above experiments on the test set in Table 3 and Table 4. It can be observed that, apart from certain anomalies exhibited by CMSM under the non-Markovian ablation on the STAIR dataset (as also reflected in Figure 4), the non-ablation settings consistently achieve lower CTFRMSE. This further highlights the critical role of the three assumptions in Corollary 5.4, namely ATM-SCM, AM, and A , in ensuring counterfactual consistency. Finally, we conducted an ablation study on exogenous distributions by constructing new models using different exogenous distributions as described in Appendix C.2. This analysis demonstrates that Corollary 5.4 and the proposed neural TM-SCM are insensitive to the type of exogenous distribution. The results, shown in Table 5, indicate that while different models may exhibit some variation in the fit of the observational distribution (i.e., OBSWD), their counterfactual consistency (i.e., CTFRMSE) remains largely similar. Furthermore, as asserted in Appendix C.2, using a standard normal distribution as the exogenous distribution may not provide sufficient expressive power, as evidenced by the higher OBSWD values in DNME and TNME models in Table 5. Ablation on ER-DIAG-50 and ER-TRIL-50 We report below the complete results of the ablation study on the ER-DIAG-50 and ER-TRIL-50 datasets as presented in the main text. The ER-DIAG-50 and ER-TRIL-50 datasets encompass diverse TM-SCM configurations, including various scales, structures, and parameter settings. Compared to the four symbolically constructed datasets in TM-SCM-SYM, they provide more comprehensive validation and testing for our framework. The complete version of the ablation results table from the main text, as shown in Table 6, presents the final results on the test set. The table additionally includes the metrics OBSWD and CTFWD, which measure the average fit to the observational distribution and the prediction for the interventional distribution, respectively. The results demonstrate that models without ablations achieve the best performance on the final counterfactual consistency Exogenous Isomorphism for Counterfactual Identifiability Table 6. Ablation results of neural TM-SCM on ER-DIAG-50 and ER-TRIL-50. The values shown are the means of 50 experiments, with the subscript representing the 95% CI. The best-performing results are highlighted in bold. ER-DIAG-50 ER-TRIL-50 METHOD OBSWD CTFRMSE CTFWD OBSWD CTFRMSE CTFWD DNME - 3.47 0.53 0.53 0.05 1.50 0.28 21.28 22.09 0.51 0.12 3.37 3.05 W/O O 5.06 1.75 0.78 0.05 2.88 0.43 25.42 21.27 0.89 0.10 4.79 3.03 W/O M 4.35 0.77 0.62 0.04 1.98 0.30 18.53 24.04 0.58 0.10 3.60 3.04 TNME - 4.65 2.93 0.47 0.05 1.32 0.29 3.34 3.17 0.55 0.12 3.49 3.05 W/O O 5.40 2.55 11.24 20.98 3.47 1.08 23.75 29.53 6.41 9.84 37.11 45.91 W/O M 3.88 0.65 0.62 0.04 2.08 0.33 5.57 3.27 0.73 0.21 13.60 19.84 - 15.70 23.20 0.37 0.05 0.96 0.23 9.78 9.87 0.42 0.12 3.17 3.07 W/O O 41.74 45.13 2.64 3.72 2.91 0.44 36.65 30.28 2.12 2.49 4.69 3.08 W/O M 8.00 9.88 1.69 2.60 1.20 0.35 10.39 10.56 0.75 0.49 7.79 8.34 W/O T 3.74 1.09 0.64 0.05 2.47 0.51 23.36 36.32 1.25 1.29 2.89 1.33 - 3.66 1.05 0.46 0.05 1.29 0.28 3.40 2.01 0.50 0.12 3.35 2.99 W/O O 5.25 2.31 0.79 0.04 2.89 0.43 11.79 10.36 0.88 0.10 4.68 3.03 W/O M 3.54 0.53 0.53 0.05 1.63 0.29 4.25 4.82 0.53 0.11 3.33 3.00 W/O T 22.22 36.74 0.67 0.05 2.43 0.49 14.30 15.34 0.78 0.12 6.63 3.87 metric, CTFRMSE. This highlights the universality of the three assumptions in Corollary 5.4, namely ATM-SCM, AM, and A , in ensuring counterfactual consistency. Furthermore, these findings empirically reinforce the validity of Corollary 5.4. Beyond providing empirical support for counterfactual consistency, these results also reveal several characteristics of the four different models. For instance, DNME performs significantly better on ER-DIAG-50 compared to ER-TRIL-50, as DNME is specifically designed based on causal mechanisms where exogenous variables interact in a diagonal manner. The performance drop on ER-TRIL-50 indicates that DNME lacks sufficient expressive power in more general settings. TNME addresses this limitation of DNME through its design. CMSM may exhibit instability due to its potentially poor fit to the observed distribution. Furthermore, as only numerically stable test results are recorded in the experiments, many omitted cases in CMSM s results may contribute to its lower CTFWD values. In contrast, TVSM demonstrates higher accuracy in counterfactual inference compared to DNME and TNME, while also achieving a better fit to the observational distribution than CMSM.