# disentangled_representation_learning_in_nonmarkovian_causal_systems__fab1ea27.pdf

Disentangled Representation Learning in Non-Markovian Causal Systems

Adam Li and Yushu Pan and Elias Bareinboim Causal Artificial Intelligence Lab Columbia University {adam.li, yushupan, eb}@cs.columbia.edu

Considering various data modalities, such as images, videos, and text, humans perform causal reasoning using high-level causal variables, as opposed to operating at the low, pixel level from which the data comes. In practice, most causal reasoning methods assume that the data is described as granular as the underlying causal generative factors, which is often violated in various AI tasks. This mismatch translates into a lack of guarantees in various tasks such as generative modeling, decision-making, fairness, and generalizability, to cite a few. In this paper, we acknowledge this issue and study the problem of causal disentangled representation learning from a combination of data gathered from various heterogeneous domains and assumptions in the form of a latent causal graph. To the best of our knowledge, the proposed work is the first to consider i) non-Markovian causal settings, where there may be unobserved confounding, ii) arbitrary distributions that arise from multiple domains, and iii) a relaxed version of disentanglement. Specifically, we introduce graphical criteria that allow for disentanglement under various conditions. Building on these results, we develop an algorithm that returns a causal disentanglement map, highlighting which latent variables can be disentangled given the combination of data and assumptions. The theory is corroborated by experiments.

1 Introduction

Causality is fundamental throughout various aspects of human cognition, including understanding, planning, decision-making. The ability to perform causal reasoning is considered one of the hallmarks of human intelligence [1 3]. In the context of AI, the capability of reasoning with cause-and-effect relationships plays a critical role in challenges of explainability, fairness, decision-making, robustness, and generalizability. One key assumption of most methods currently available in the causal literature is that the set of (endogenous) variables is at the right level of granularity. However, this is not the case in many AI applications, where various modalities, such as images, and text, come into play [4]. For example, images of a park scene capture objects as causal variables, not the pixels themselves. AI must disentangle these latent causal variables to represent the true relationships in the image. Faithfully representing this latent structure impacts downstream AI tasks like image generation and few-shot learning.

In machine learning, the representation learning literature is concerned with finding useful representations from data [5]. One important line of work traces back to linear ICA (independent component analysis) [6], where one attempts to disentangle latent variables assuming a linear mixing function. The literature has also considered settings where the mixing function is nonlinear [7, 8]. It has been understood that nonlinear-ICA is, in general, not identifiable (ID) given only observational data [9].

*These authors contributed equally to this work, and the author names are listed in alphabetical order.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

Input Output Assumptions Data Identifiability Goal Non-Markovian Non-parametric Interventions Multiple Domains Distr. Reqs. [6, 11 14] 1 per node Scaling, Mixture or Affine Transformation [7, 8, 10, 15, 16] / / 2|V| + 1 Scaling [17, 18] 1 per node Scaling [19, 20] / / 1 per node Scaling [21] 1 per node Scaling or Ancestral Mixture [22] 2|V| + |MG| + 1 Scaling or Mixture [23] / / TBD Scaling & Affine Transformation [24] / / TBD Functional Dependency Map1 This work General Causal Disentanglement Map

Table 1: A non-exhaustive list of identifiability results given knowledge of the latent graph.

Different routes have been taken to circumvent such impossibilities. For instance, one might assume parametric families (e.g., exponential), and auxiliary variables as input, which can be thought of as non-stationary times-series that may lead to certain invariances that can be exploited [7, 8, 10].

Singledomain

Independent

Non-Markovian

Non Parametric

Linear, Polynomial,

Multidomain

Pure observation

One intervention

Arbitrary interventions

Full disentanglement

Causal Disentanglement

Assumptions

Ancestral disentanglement

surrounding node

disentanglement

Figure 1: Dimensions of identifiability in causal disentanglement representation learning.

Interestingly, the machinery developed in this context can be applied to causal settings with multimodal data, where there is a mismatch between the causal variables and the granularity at which they are represented in the data. The key observation that links these two worlds is that an underlying causal system generates the data at such granularity (images, texts). Acknowledging this connection leads to various possibilities regarding learning, or disentangling the causal variables from data, similar to the initial ICA-like literature. First, the assumption that the features underlying a signal are independent needs to be relaxed since it is arguably too stringent, a priori ruling out almost any interesting causal system. So, we should consider different assumptions regarding the structure of the underlying generative model. One initial relaxation is that this model is Markovian, where the features need not be independent, and causal relationships are allowed across features. In the context of computer vision, for example, one might assume a specific structure on the latent variables where the style and content of the images are separated and augmented data is leveraged to disentangle these two components [18]. Generalizing this idea to more relaxed causal settings, one can show ID up to certain indeterminancies given observational across multiple domains, or interventional data [21, 22]. Another approach allows for certain parametric mixing functions, which could lead to new ID results [11, 14]. These results have been applied and advanced across various downstream tasks [25 31].

Considering this background, we study three axes within the different types of input and expected outputs of the causal disentanglement representation learning task, which is summarized in Table 1 and Fig. 1. The input can be partitioned into qualitative and quantitative components. Qualitatively, we consider different assumptions about the underlying generative processes, including non-parametric models, as well as linear or Gaussian ones. We also account for systems with richer causal topologies than ICA (independent features) and generalize the Markovian setting. Notably, we do not rule out a priori the existence of unobserved confounding among features, a pervasive challenge in causal inference in empirical sciences. Quantitatively, we consider data gathered from arbitrary combinations of interventions and domains. Recent literature on this distinction acknowledges key differences [32 37], whereas prior literature often assumes data comes from different interventions in the same domain or from various (observational) distributions from different domains. In fact, it is feasible that data spawns various interventions and domains in a less well-structured manner (App. A.3). In terms of the expected output, similar to [24], we will consider both full disentanglements as well as a more relaxed type of disentanglement, known as the causal disentanglement map.

For concreteness, consider a hypothetical latent graph depicted in Fig. 2 in the context of epilepsy research [38 46]. In terms of assumptions, hospitals in different countries Πi and Πj will differ in the amount of sleep (V1) patients get (represented by the S-node Si,j V1). Now suppose sleep (V1) affects the efficacy of the drug treatment (V2), and the drug helps epilepsy patients control their seizures (V3). The quality of sleep and the type of drug treatment are confounded by socioeconomic factors (V1 L9999K V2). Clinicians are then given electroencephalogram (EEG) data from each

1We recently were made aware of the work in [24], where their definition of a functional dependency map is what we define as a causal disentanglement map. However, the disentanglement map that they can achieve is different from ours as discussed in Section F.

Latent Selection

Diagram, GS

Mixing Function (Unobserved)

Multi-domain High-dimensional

Distributions X ℝd Latents Inverse Mixing

Causal Disentanglement Map

EMRs, Imaging,

Input (Data and Assumptions) Output Representation

Figure 2: Data generating model and the goal of learning disentangled causal latent representations.

hospital where they know different drug treatments were administered. The EEG X is a nonlinear (nonparametric) transformation of latent V = {V1, V2, V3} via f X. Their goal is to generate realistic EEG data to understand how different drugs affect EEG patterns. This requires a general output representation that disentangles sleep from drug as it is understood that sleep affects EEG [47]. One could leverage state-of-the-art generative modeling techniques and train a self-supervised learning model to learn a representation of the EEG that they then perturb to generate new instances of EEG [48 50]. However, there are no guarantees that the representation, or interventions in the latent space will generate realistic EEG. In this case, drug and sleep might remain entangled in the learned representation, which is potentially harmful, since it may lead to unrealistic EEG data that contains visual differences due to sleep rather than drug. More formally, given an input set of distributions and knowledge of the latent variable causal structure, the goal is to learn the inverse of the mixing function bf 1 X and a representation b V = {c V1, c V2, c V3}, where V2 is disentangled from V1 2.

In this paper, we develop graphical and algorithmic machinery to determine whether (and how) causal representations can be disentangled from heterogeneous data and assumptions about the underlying causal system, which might help improve various downstream tasks. Our contributions are: 3:

1. Graphical criteria for determining the disentangleability of causal factors. We formalize a general version of the causal representation learning problem and develop methods to determine if a pair of (user-chosen) variables are disentangled in a non-Markovian setting with arbitrary distributions from multiple heterogeneous domains (Props. 3,4, and 5)4. 2. An algorithm to learn the causal disentanglement map. Leveraging these new conditions, we develop an algorithmic called CRID, which systematically determines whether two sets of latent variables are disentangleable given their selection diagram and a collection of intervention targets (Thm. 1). The theoretical findings are corroborated with simulations.

All supplementary material (including proofs) is provided in the full technical report. Preliminaries. We introduce basic definitions used throughout the paper. Uppercase letters (X) represent random variables, lowercase letters (x) signify assignments, and bold letters (X) indicate sets. For a set X, |X| denotes its dimension. Denote P(X) as a probability distribution over X and p(x) as its density function. The basic semantic framework of our analysis rests on structural causal models (SCMs) [1, Ch. 7]. An SCM is a 4-tuple U, V, F, P(U) , where (1) U is a set of background variables, also called exogenous variables, that are determined by factors outside the model; (2) V = {V1, V2, . . . , Vd} is the set of endogenous variables that are determined by other variables in the model; (3) F is the set of functions {f V1, f V2 . . . , f Vd} mapping UVj Pa Vj to Vj, where UVj U and Pa Vj V\Vj; (4) P(U) is a probability function over the domain of U.

Each SCM induces a causal diagram G, which is a directed acyclic graph where every Vj is a vertex. There is a directed arrow from Vj to Vk if Vj Pa Vk. There is a bidirected arrow between Vj and Vk if UVj and UVk are not independent [3]. Variables V can be partitioned into subsets called c-components [57]. The c-component of X, denoted as C(X), is a set of variables connected to X by bidirected paths. The c-component of a set X, denoted as C(X), is defined as the union of

2We separate the tasks of disentanglement and structural learning, and consider the latent causal graph as input of our task. Still, there are works in the literature that consider both tasks simultaneously [13, 22, 51 56]. 3We refer the readers to our full technical report for a more detailed treatment of the problem setting. 4All proofs are provided in Appendix C.

the c-component of every X X. We will use Pa(X) or Pa X to denote parents of X in G. Let Pa(X) = Pa(X) X, which includes X itself. A subgraph over X V in G is denoted as G(X) and GX denotes the subgraph by removing arrows coming into nodes in X.

A soft intervention on a variable X, denoted σX, replaces f X with a new function f X of Pa V and variables U X [58, 59]. For interventions on a set of variables X V, let σX = {σX}X X, that is, the result of applying one intervention after the other. Given an SCM M, let MσX be a submodel of M induced by intervention MσX. A special class of soft interventions, resulting in observational distributions, called idle intervention, leaves the function as it is, which means σX = {}. Another special class of stochastic soft interventions, called perfect interventions [21, 51] and denoted as perf(X), such that Pa(X) = and U X U = . This implies that the modified diagram induced by MσX is GX. We assume soft interventions that are not hard do not change the structure of the graph 5. Namely, the diagram induced by MσX is the same with G.

2 Modeling Disentangled Representation Learning (General Case)

In this section, we formalize the disentangled representation learning task in causal language. We leverage an Augmented SCMs, to model the generative process over latent causal variables V. Definition 2.1 (Augmented Structure Causal Model). An Augmented Structure Causal Model (for short, ASCM) over a generative level SCM M0 = {U0, V0, F0, P 0(U0)} is a tuple M = U, {V, X}, F, P(U) such that (1) exogenous variables U = U0; (2) V = V0 = {V1, . . . , Vd} are d latent endogenous variables; X is an m dimensional mixture variable; (3) F = {F0, f X}, where f X : Rd Rm is a diffeomorphic 6 function that maps from (the respective domains of) V to X. h = f 1 X such that V = h(X); and (4) P(U0) = P 0(U0).

In words, an ASCM M describes a two-stage generative process involving latent generative factors V and high-dimensional mixture X (e.g., images, or text). First, latent generative factors V Rd are generated by an underlying SCM. The causal diagram induced by M0 over V is called a latent causal graph (LCG); denoted as G here. Next, a nonparametric diffeomorphism f X mixes V to get the high-dimensional mixture X Rm. An important aspect of f X is that it is invertible regarding V which implies that the generative factors V are recognized in a given X = x 7.

The initial disentangled representation learning setting can be traced back at least to linear/nonlinear ICA [7 9], where G is assumed to have no edges (V are independent of each other) and Markovian (no bidirected edges in the LCG). More recently, allowing latent variables to have edges in the LCG was studied, albeit still under the Markovian assumption [11, 13, 14, 21, 22, 51, 55]. We relax this assumption and allow confounding to exist between V, which we call non-Markovianity8.

Domains. We address the general setting of distributions that arise from multiple domains. Following [32 35, 62, 63], we define the so-called latent selection diagram that represents a collection of ASCMs to model the multi-domain setting. Selection diagrams enable us to compactly represent causal structure and cross-domain invariances 9. Definition 2.2 (Latent Selection Diagrams). Let M = M1, M2, ..., MN be a collection of ASCMs relative to N domains Π = Π1, Π2, ..., ΠN , sharing mixing function f X and LCG, G. M defines a latent selection diagram (LSD. for short) GS, constructed as follows: (1) every edge in G is also an edge in GS; (2) GS contains an extra node Si,j and corresponding edge Si,j Vk whenever there exists a discrepancy f i Vk = f j Vk, or P i(Uk) = P j(Uk) between Mi and Mj.

S-nodes indicate possible differences over V due to changes in the underlying mechanism or exogenous distributions across domains. For example, consider the LSD in Fig. 2. The S-node Si,j implies

5General soft interventions can arbitrarily change the graph by adding or removing edges. We do not consider this setting, and refer the readers to [1, 58, 59] for a general discussion on soft interventions. 6A diffeomorphism is a bijective function f X such that both f X and f 1 X are continuously differentiable [27] 7Further discussion on the invertibility and non-parametric assumption is provided in Appendix A.2. 8To our knowledge, this is the first work in disentangled casual representation learning to relax Markovianity, which we believe is important since a significant challenge in causal inference stems from the existence of confounding bias traced back to Rubin [60], Pearl [1, 61], and more recently data fusion [36]. 9See [32, 33] and Appendix Sec. A.3 for a more detailed discussion on the fundamental differences between interventions and domains, and why modeling their distinction is important in general.

that V1 possibly changes from domain Πi to Πj, while the other variable s mechanisms are assumed to be invariant. Note no S-node points to X since f X is shared across M.

Interventions. A set of interventions Σ = {σ(k)}K k=1 are applied across domains Π, where k is an index from 1 to K. The corresponding domains that Σ are intervened in is denoted as ΠΣ = {Π(k)}K k=1 (the domains associated with each σ(k) Σ). We study a general setting where each intervention can be applied to any subset of nodes and in any domain, which can be seen as a generalization of the more restricted settings in prior work (see Appendix F).

The intervention targets collection of these K interventions {σ(k)}K k=1 is denoted as Ψ = {I(k)}K k=1.

Each intervention target I(k) is given in the form of {V Π(k),{b},t i , V Π(k),{b },t

j , . . . }, which indicates the intervention σ(k) changes the mechanism of {Vi, Vj, . . . } in domain Π(k). {b} indicates the mechanism of the intervention on the same node. The mechanisms of V 1 i and V 2 i are different while the mechanism on different nodes (V 1 i and V 1 j ) is default different. t = perf indicates the intervention

is perfect. When I(k) = {V Π(k),t i }, where {b} is omitted, then the intervention is assumed to have a

different mechanism. When I(k) = {V Π(k),{b} i }, where t is omitted, then the intervention is assumed to be a general soft intervention. When I(k) is an idle intervention in Π(n), it is denoted as {}Π(n). The set Perf[I(k)] is a set of variables with perfect interventions in σ(k). Thus when Perf[I(k)] = {}, it implies there are no variables with perfect interventions in σ(k). Ψperf T is a subset of Ψ such that T Perf[I(j)] for every I(j) Ψperf T , which implies I(j) contains perfect interventions on T; see Fig. S1 and Ex. 7 illustrating the notation.

Given Distributions. The interventions Σ = {σ(k)}K k=1, induce distributions P = {P (k)}K k=1 in multi-domains, where P (k) = P Π(k)(X; σ(k)).

V tar = τ(V\Ven)

V tar = τ (V\Ven)

Data distribution Relationship between and V V

Structural assumptions Space of a collection of ASCMs

Figure 3: General ID/disentangleability.

Problem Statement Suppose the underlying true model M induces the LSD GS and a collection of distributions P over X is given according to a corresponding collection of interventions Σ. The goal of this paper is to learn a disentangled representation b V of the latent generative factors V in M. In the literature, it is common to require every variable Vi V to be disentangled from all other variables [7, 21] or some special subset (e.g. non-ancestors of Vi) [21, 22]. However, as illustrated in Fig. 2, sometimes only the target variables (Vtar V) is needed to be disentangled from some user-chosen entangled variables (Ven). Recent work has also considered a similar goal of generalized disentanglement [24]. Our work still differs from theirs in the following ways: i) [assumptions] we model a completely nonparametric non-Markovian ASCM, whereas [24] assumes sparsity and a Markovian ASCM, and ii) [input] we consider arbitrary combinations of distributions from multiple domains, whereas [24] considers only interventions within a single domain (see Appendix F for a detailed comparison). We formally define this type of general indeterminacy next as well as the formal version of our ID task. Definition 2.3 (General Identifiability/Disentangleability (ID)). Let M be the underlying true ASCMs inducing LSD GS, and P = {P (k)}K k=1 a set of distributions resulting from K intervention sets Σ. Consider target variables Vtar V, and Ven V\Vtar. The set Vtar is identifiable (disentangled) with respect to (from) Ven if there exists a function τ such that b Vtar = τ(V\Ven) for any c M that is compatible with GS and P c M = P. For short, Vtar is said to be ID w.r.t. Ven.

To illustrate, consider a target variable Vtar such that one wants its representation to be disentangled from another subset variables Ven. The above definition states that Vtar is disentangled from Ven

(or is ID w.r.t. Ven) if the learned representations b Vtar in c M is only a function of V\Ven for any c M that matches with the LSD GS and distribution P10. Def 2.3 is illustrated in Fig. 3. Following the example illustrated in Fig. 2, suppose the user wants V3 to be disentangled from V1 while considering the entanglement between V2 and V3 acceptable. If b V 3 = τ(V 2, V 3) for any ASCM c M matches the

10In general, this definition is defined after a permutation of variables. We illustrate more in Sec. A.4.

distributions and LSD, V3 is ID w.r.t. V1. Def. 2.3 is more relaxed since one is free to choose any target Vtar and Ven. It can be reduced to existing identifiability definitions (Appendix F.2). Example 1 (Example of an ID task). Suppose the pair of underlying ASCMs M1, M2 induces the LSG GS in Fig. 2 and distributions P = {P (1), P (2), P (3), P (4)} = {P Π1(X), P Π2(X), P Π2(X; σV3), P Π1(X; σV4)} from interventions Σ = {σ(1), σ(2), σ(3), σ(4)} = {{}, {}, σV3, perf(V2)}. Given intervention targets Ψ = {I(1), I(2), I(3), I(4)} = {{}Π1, {}Π2, V Π2 3 , V Π1,perf 2 } and GS, the task is to determine whether (and how) {V2, V3} is ID w.r.t. V1, and V1 is ID w.r.t {V2, V3}. The answer to this is provided in Ex. 6.

Assumptions (Informal) and Modeling Concepts Before discussing the main theoretical contributions, we restate important assumptions and remarks (discussed in this section) here to ground the ASCM model 11. Assumption 1 (Soft interventions without altering the causal structure). Interventions do not change the causal diagram. Hard interventions cut all incoming parent edges, and soft interventions preserve them [59]. However, more general interventions may arbitrarily change the parent set for any given node [59]. We do not consider such interventions and leave this general case for future work.

Assumption 2 (Known-target interventions). All interventions occur with known targets, reducing permutation indeterminacy for intervened variables. Assumption 3 (Sufficiently different distributions). Each pair of distributions P (j) and P (k) P are sufficiently different, unless stated otherwise. This is naturally satisfied if ASCMs and interventions are randomly chosen [51]. Similar assumptions include the "genericity" [51], "interventional discrepancy" [21], and "sufficient changes" assumptions [10, 22].

Remark 1 (Mixing is invertible). As a consequence of Def. 2.1, the mixing function f X is invertible, ensuring that latent variables are uniquely learnable [9, 10, 17, 64].

Remark 2 (Confounders are not part of the mixing function). According to Def. 2.1, latent exogenous variables U influence the high-dimensional mixture X only through latent causal variables V, so unobserved confounding U does not directly affect the mixing function.

Remark 3 (Shared causal structure). As a consequence of Def. 2.2, each environment s ASCM shares the same latent causal graph, with no structural changes among latent variables 12.

3 Graphical Criterion for Causal Disentanglement

In this section, we study a general form of identifiability given general assumptions and input distributions. More specifically, we build the connection from V and representation b V through comparing distributions and then introduce three graphical criteria (Prop. 3, 4 and 5) to check ID.

First, we introduce a factorization of distributions induced by non-Markovian models [3, Def. 15]. Specifically, consider PT(V) induced by an ASCM M after a perfect intervention on T. Then, given a topological order < of G, PT(V) can be factorized as follows:

Vi V PT(Vi|Pa T+ i ) (1)

where Pa T+ i = Pa({V C(Vi) : V Vi}) \ {Vi} is the extended parents set of Vi in GT. The factorization form for PT(V) will be different according to the choice given order. Example 2. Consider a collection of M inducing the LSD shown in Fig. 4(c). Given order A: V1 < V2 < V3 < V4, P(V) can be factorized as: P(V1)P(V2 | V1)P(V3 | V2, V1)P(V4 | V3) Notice that the conditioning part of V3 includes {V2, V1}, which are not parents of V3. Choosing order B: V1 < V3 < V2 < V4, P(V) can be factorized as P(V1)P(V3)P(V2 | V1, V3)P(V4 | V3). The conditioning part of V2 and V3 are different given different orders.

11For a formal discussion on the assumptions, we refer the readers to Appendix A.2) 12The assumption that there are no structural changes between domains can be relaxed and is considered in the context of inference when causal variables are fully observed, as discussed in [33]. This is an interesting topic for future explorations, and we do not consider this avenue here.

{Si,j} i,j [5]

V1 V3 V2 V4

Figure 4: LSDs in Ex. and Exps. (a) chain, (b) collider and (c) non-markovian graphs.

Armed with this factorization, the representation b V in c M and the true underlying variables V in M can be related by comparing distributions as follows.

Proposition 1 (Distribution Comparison). Consider a pair of collections ASCMs M and c M that matches with the distribution P resulting from interventions Σ and LSD GS. Consider two distributions P Π(j)(X; σ(j)) and P Π(k)(X; σ(k)). Suppose perf(T) is in both intervention sets, then,

i log p(j) T (vi | pa T+ i ) p(k) T (vi | pa T+ i ) =

i log p(j) T (bvi | c pa T+ i ) log p(k) T (bvi | c pa T+ i ), (2)

where p(j) T ( ) and p(k) T ( ) are density functions.

To illustrate, Prop.1 shows when the intervention and domain changes from σ(k) to σ(j) and Π(k) to Π(j), the change comes from factors p T(vi | pa T+ i ) both in M and c M.

However, not all factors necessarily contribute to Eq (2). For example, in the Markovian setting, only one factor p T(vi | pai) possibly changes when comparing the observational to a singleton interventional distribution in the same domain. Other invariant factors will be canceled out in Eq. (2). The following result generalizes finding invariant factors when comparing distributions from different domains and interventions in non-Markovian settings.

Proposition 2 (Invariant Factors). Consider two distributions P (j), P (k) P with intervention targets σ(j) and σ(k) containing perf(T). Construct the changed variable set V[I(j), I(k), GS] (for short V) with target sets I(j), I(k) as follows: (1) Vl V if V πl,{bl},tl l I(j) but V π l,{bl},t l l I(k), or vice versa; (2) Vl V if i) SΠ(j),Π(k) point to Vl and ii) V πl,{bl},tl l I(j) I(k). If Vi V\C( V), then p(j) T (vi | pa T+ i ) = p(k) T (vi | pa T+ i ) (denoted invariant factors).

Prop. 2 states that factors p T(vi | pa T+ i ) are guaranteed to be invariant if Vi is not in the C-component of the changed variable set V. V[I(j), I(k), GS] contains variables that are intervened differently in I(j), I(k) and the variables pointed by S-node, Sj,k 13.

Example 3. Consider the diagram in Fig. 4(c) and two distributions P (1), P (2) P with intervention targets I(1) = {}Π1 and I(2) = {V Π1 2 }. The changed variable set V(2),(1) = {V2, V3} since V2 I(2), V2 I(1), and C(V2) = {V2, V3}. Thus, comparing P (2) with P (1) (order A in Ex. 2), factors p(v1), p(v4 | v2, v1) are invariant, whereas p(v2 | v1), p(v3 | v2, v1) can change.

With Prop. 2, Eq. (2) naturally keeps factors only in the C-component of V, i.e., X

Vi V log p(j) T (vi | pa T+ i ) p(k) T (vi | pa T+ i ) = X

Vi V log p(j) T (bvi | c pa T+ i ) log p(k) T (bvi | c pa T+ i ) (3)

where V = C( V[I(j), I(k), GS]). This factorization hints that b V (RHS of Eq. (3)) is only related to variables that appear on the LHS.

Definition 3.1 ( Q Set). Given two distributions P (j), P (k) with interventions targets σ(j) and σ(k)

containing perfect interventions on T, the Q[I(j), I(k), T, GS] set (for short: Q(j),(k), or Q if index not needed) of the target sets I(j), I(k) is the remaining variables after comparison (i.e. Eq. 3), Q[I(j), I(k), T, GS] = V Pa T+( V), where V = C( V[I(j), I(k), GS]).

13Notice that the same intervention mechanism will dominate the domain changes, which means when the intervened mechanism of Vl is the same between I(j) and I(k), the discrepancy of Vl due to the change of domain between Π(j) and Π(k) will be canceled. See Appendix A.3 for an example.

To illustrate, the Q set involves all variables in LHS of Eq. (3), including V and its extended parents. These variables come from factors that possibly change and are kept in Eq. (3). We call V\ Q canceled variables since invariant factors are canceled out from the comparison. Continuing Ex. 3, Q = {V1, V2, V3} given either topological order.

Leveraging the comparisons among distributions in P (Eq. 3), we next develop three criterions for disentanglement. First, we can disentangle canceled variables from Q set since the difference of density over representations b V in the Q set (RHS of Eq. (3)) is irrelevant to canceled variables (LHS of Eq. (3)). Proposition 3 (ID the Q set w.r.t Canceled Variables). Consider variables Vtar V. Let PT = {P (a0), P (a1), . . . , P (a L)} P be a collection of distributions such that (1) l [L], T = Perf[I(a0)] Perf[I(al)] 14; (2) S

l [L] Q[I(al), I(a0), T, GS] = Vtar; (3) there exists

{a 1, . . . , a d } {a1, . . . , a L} such that for all V tar i Vtar, V tar i Q[I(a i), I(a0), T, GS], where d = |Vtar|. Then, Vtar is ID w.r.t V\Vtar.

Prop. 3 disentangles target variables Vtar (as a union of Q sets) from canceled variables according to Eq. (3). To illustrate, it considers to find a collection of L distribution {P (a1), . . . , P (a L)} to compare with the baseline P (a0) such that (1) the perfect intervention variables sets of {I(a1), . . . , I(a L)} contain the perfect intervention set of the baseline I(a0), (2) the union of Q is equivalent to Vtar, and (3) each V tar i changes at least once. Then, Vtar can be ID wrt V\Vtar.

Example 4. (Ex. 3 continued.) Suppose P = {P (1), P (2), P (3), P (4)} with intervention targets I(1) = {}Π1, I(2) = {V2}Π1, I(3) = {V3}Π1, I(4) = {V1}Π1. Consider Vtar = {V1, V2, V3} and Ven = V\{V1, V2, V3} = {V4}. Comparing {I(2), I(3), I(4)} with the baseline I(1), the perfect intervention variables are T = Perf[I(1)] = {}. Then we have Q sets: {V1, V2, V3}, {V1, V2, V3} and V1. Thus, these three comparisons satisfy the three conditions in Prop. 3. Then Vtar is ID w.r.t Ven by Prop. 3. See Appendix Ex. 17 for a derivation.

According to Prop. 3, a disentanglement corollary leveraging the comparison of interventions and observational distributions can be derived. Corollary 1 (ID intervened variables). Given an observational distribution and L distributions resulting from interventions on the same target W but with different mechanisms (in the same domain), if L |Pa+ W W|, Pa+ W W is ID w.r.t V\{Pa+ W W}, where Pa+ W = Wi WPa+ Wi.

The second result disentangles variables within Q sets. Proposition 4 (ID of variables within Q sets). Consider the variables Vtar V, PT that satisfies conditions (1) in Prop. 3 and Q(al),(a0) = Vtar, for l [L]. For any pair of Vi, Vj Vtar such that Vi Vj|Vtar\{Vi, Vj} in GT(Vtar), Vi is ID w.r.t. Vj if L 2|Vtar| + δ , where δ is the number of pair Vk, Vr Vtar such that Vk and Vr are connected given Vtar\{Vk, Vr} in GT(Vtar).

Prop. 4 disentangles target variables Vi and Vj both in Q sets. To illustrate, consider a set of distributions that satisfies conditions (1) as Prop. 3. This proposition suggests that if (1) Vi, Vj Vtar are conditionally independent given all other variables in Vtar, (2) L is not smaller than 2|Vtar|+δ , then Vi can be disentangled from Vj. Example 5. Suppose LSD GS is a collider graph shown in Fig. 4(b). Suppose the intervention targets are Ψ = {{}Π1, {}Π2, {}Π3, {}Π4, {}Π5}, which means that observational distributions are available in each domain. Consider T = {}. Let Vtar = {V1, V3}. We have V1 V3 in G(V1, V3). Based on Def. 3.1, Q[I(j), I(1), T, GS] = {V1, V3} for j = 2, 3, 4, 5. Then the number of distributions used for comparing (i.e., four) is not smaller than the required (2 2 + 0), which means V1 is ID w.r.t. V3 and V3 is ID w.r.t. V1 by Prop. 4. See App. Ex. 18 for a derivation.

With these existing disentanglements from Props. 3 and 4, the following Proposition considers an inverse direction, which identifies canceled variables w.r.t. Q sets 15. Proposition 5 (ID of canceled variables w.r.t. Q sets). Suppose Ψ contains perf(T). Given V\V tar is ID w.r.t. a single variable V tar, V tar is ID w.r.t. V\V tar if V tar V\V tar in GT.

14Recall we use the notation Perf[I] to denote that all variables that have perfect interventions on I. 15Recall that ID is one-way. ID of Vi wrt Vj does not imply Vj is ID wrt Vi.

Algorithm 1 CRID: Algorithm for determining causal representation identifiability - GS is the LSD; Ψ is the intervention target sets; GV, b V is the output bipartite graph (i.e. CDM).

Input: GV, and intervention target sets Ψ. Output: CDM GV, b V 1: GV, b V Fully Connected Bipartite Graph(V, b V) Initialize GV, b V with Alg. F.2 2: while GV, b V is updated in the last epoch do 3: Perf = {T1, T2, . . . , Ts} Ψ Get perfect intervention variables sets. 4: for all T Perf do 5: Ψperf T Ψ Collect intervention targets that contain hard intervention variables T

6: for all I Ψperf T such that Perf[I] = T do Iterate intervention targets as the baseline 7: Q = {Q1, . . . , Q|ΨT\I|}, where Qk Q[J(k), I, T, GS] Construct Q sets. 8: for all Q such that Q = S

Ql Q Ql V do Iterate the union of Q factors

9: GV, b V Dis QFrom Cancel(Q, GV, b V, GT, Ψperf T , I, Q) Alg. F.3, Prop. 3

10: GV, b V Dis Within Q(Q, GV, b V, GT, Ψperf T , I, Q) Alg. F.6, Prop. 4

11: for all T Perf do 12: GV, b V Dis Cancel From Q(GV, b V, GT) Alg. F.8, Prop. 5

13: return GV, b V

(a) Initialization.

(b) 1st epoch, Step 11.

(c) 1st epoch, Step 14.

(d) 2nd epoch, Step 11.

Figure 5: Process of removing edges from CDM GV, b V using Alg. 1 in Ex. 6. (d) is the final output.

To illustrate, Prop. 5 states: if V\{V tar} is already disentangled from V tar, then V tar is ID wrt V\{V tar} if a perfect intervention on T exists to separate V tar and V\{V tar} in GT. Prop. 5 does not compare distributions but relies on existing disentanglements. See Ex. 19 for details.

4 Algorithmic Disentanglement of Causal Representations

In this section, we develop an algorithmic procedure for determining whether Vtar and Ven are disentangleable given the LSD GS and interventions sets Ψ. The whole algorithm Causal Representation ID (CRID, for short) is described in Alg. 1. We start by introducing a bipartite graph GV, b V, called Causal Disentanglement Map (CDM) (which was informally shown in Fig 2 (right)). In words, the absence of the edge Vi b Vj implies Vj is ID w.r.t Vi. If each b Vi is only pointed by Vi, then we have full disentanglement of V. If V Vi points to b Vi , then we have partial disentanglement of Vi.

CRID proceeds by first constructing the fully connected CDM in Step 1. In each iteration, the hard intervention set T and the baseline intervention target set I (Steps 4 and 6) are enumerated. For each T and baseline, all Q sets are constructed based on Def. 3.1 and put into a collection Q (Steps 7). After the union of Q sets (denoted as Q) is chosen (Step 8) iteratively, Props. 3 and 4 are leveraged in two procedures (Step 9 and 10) to check the identification of Q w.r.t. V\Q and the identification within Q. The disentanglements in CDM at the current stage are leveraged to reduce the required number of distributions (see details in Alg. F.3 and F.6). At the end of the iteration, Prop. 5 is used for identifying V\Q from Q leveraging current disentanglement in CDM (Step 11-12).

Example 6. (Ex. 1 continued.) Consider the selection diagram (Fig. 2) and the set up in Ex. 1 The perfect intervention variable sets are the empty set {} and {V2}. First, T is chosen as {} and then Ψperf T = Ψ. Choosing the baseline I = I(1), the Q collection: Q = {Q1, Q2, Q3} = {{V2, V3}, {V2, V3}, {V1, V2}}. We consider the Q as {V2, V3} and {V1, V2}. For Q = {V2, V3},

(a) (b) (c) (d)

CRID Output

Disentangled

Figure 6: Correlation of learned latent representations with true latent variables from Fig. 4.

leveraging Step 9 (Prop. 3), the edges from V1 to {b V2, b V3} are removed (See Ex. 15 for details) and Step 10 (Prop. 4) does not remove further edges. However, for Q = {V1, V2}, no edge can be removed, since it at least needs two comparisons for claiming disentanglement.

Choosing {}Π2 or V Π2 3 or V Π1,do 2 as the baseline, no new Q can be constructed, so no further edges are removed. When T is chosen as {V2}, the comparison does not work since no other distribution is available. At the end of this iteration, with the fact that {V2, V3} is ID wrt V1 and V1 {V2, V3} in GV2, Step 12 (Prop. 5) removes edges from V2 to b V1 and V3 to b V1. In the second iteration, the algorithm repeats the choice of T and the baseline. At this iteration, for Q = {V1, V2}, the edge from V3 to b V2 is removed since V3 to b V1 has already been removed in CDM and only 1 comparison is needed now. At the end of this epoch no further can be removed by Alg. F.8. In the third epoch, GV, b V is not updated and the process of CDM returned is shown in Fig. 5.

The following theorem indicates the soundness of CRID. Theorem 1 (Soundness of CRID). Consider a LSD GS and intervention targets Ψ. Consider the target variables Vtar and Ven V\Vtar. If no edges from Vtar points to b Ven in the output causal disentanglement map (CDM) from CRID, GV,b V , then Vtar is ID w.r.t Ven.

5 Experiments

We corroborate the theoretical findings through simulations and MNIST dataset. For full details, see Appendix Section G. In simulations, we consider LSDs shown in Fig. 4 with different collection of distributions P = {P (k)(X; σ(k))}K k=1 and the results are presented in Fig. 6. For the evaluation, we follow a standard evaluation protocol in prior work [18], where we take the latent representations b V and compute their mean correlation coefficient (MCC) wrt the latent V. We compare MCC with what is expected from CRID. Fig. S9 shows the full MCC comparisons of V and b V.

Chain Graph Fig. 4(a). Fig. 6(a) shows ID of V3 wrt {V1} using input distributions P with interventions Σ = {σ{}, σ{1} 3 , σ{2} 3 } because MCC(b V3, V1) is relatively low compared to MCC(b V3, V3), which is consistent with CRID. The ID results of [21] states V3 would still be entangled with V1 because V1 Anc(V3). Fig. 6(b) shows ID of V1 wrt {V2, V3}. Interestingly, we do not even have to intervene on V1 to obtain full disentanglement.

Collider Graph Fig. 4(b). Fig. 6(c) shows V1 and V3 are ID wrt V2 and each other because MCC(b V3, V3) > MCC(b V3, Vi) and MCC(b V2, V2) > MCC(b V2, Vi), which is consistent with CRID. There are distributions from four domains that have a change-in-mechanism on {V1, V3} (represented by the S-node). According to [22], since V1 and V3 are adjacent in the Markov Network, V1 and V3 are not disentangleable.

Non-Markovian Graph Fig. 4(c). Fig. 6(d) shows V3 is ID wrt {V1, V2, V4} with interventions Σ = {do{1}(V3), do{2}(V3)}, which is consistent with CRID. No prior results achieve disentanglement with confounding among V.

6 Conclusions

This work introduces theory and a practical ID algorithm for determining which latent variables are disentangleable from a given set of assumptions in the form of a LSD, and input distributions from heterogenous domains. This brings us one step closer to building robust AI that can reason causally over high-level concepts when only given low-level data.

Acknowledgements

AL was supported by the NSF Computing Innovation Fellowship (#2127309). YP and EB were supported in part by the NSF, ONR, AFOSR, Do E, Amazon, JP Morgan, and The Alfred P. Sloan Foundation.

[1] Judea Pearl. Causality: Models, reasoning, and inference. 2nd. Cambridge University Press, 2009. [2] J. Pearl and D. Mackenzie. The book of why : the new science of cause and effect. Pages: 418. 2019. [3] E. Bareinboim, J. D. Correa, D. Ibeling, and T. Icard. On Pearl s Hierarchy and the Foundations of Causal Inference. In: Probabilistic and Causal Inference: The Works of Judea Pearl. 1st ed. Vol. 36. New York, NY, USA: Association for Computing Machinery, 2022, pp. 507 556. [4] B. Schölkopf, F. Locatello, S. Bauer, N. R. Ke, N. Kalchbrenner, A. Goyal, and Y. Bengio. Towards Causal Representation Learning. ar Xiv:2102.11107 [cs]. 2021. [5] Y. Bengio, A. Courville, and P. Vincent. Representation Learning: A Review and New Perspectives. In: Arxiv (2012). [6] A. Hyvärinen and E. Oja. Independent component analysis: algorithms and applications. In: Neural networks 13.4-5 (2000), pp. 411 430. [7] A. Hyvarinen, H. Sasaki, and R. E. Turner. Nonlinear ICA Using Auxiliary Variables and Generalized Contrastive Learning. ar Xiv:1805.08651 [cs, stat]. 2019. [8] A. Hyvärinen, I. Khemakhem, and H. Morioka. Nonlinear independent component analysis for principled disentanglement in unsupervised deep learning. In: Patterns 4.10 (2023), p. 100844. [9] A. Hyvärinen and P. Pajunen. Nonlinear independent component analysis: Existence and uniqueness results. In: Neural networks 12.3 (1999), pp. 429 439. [10] I. Khemakhem, D. Kingma, R. Monti, and A. Hyvarinen. Variational autoencoders and nonlinear ica: A unifying framework. In: International Conference on Artificial Intelligence and Statistics. PMLR. 2020, pp. 2207 2217. [11] C. Squires, A. Seigal, S. S. Bhate, and C. Uhler. Linear Causal Disentanglement via Interventions. en. In: Proceedings of the 40th International Conference on Machine Learning. ISSN: 2640-3498. PMLR, 2023, pp. 32540 32560. [12] K. Ahuja, J. S. Hartford, and Y. Bengio. Weakly supervised representation learning with sparse perturbations. In: Advances in Neural Information Processing Systems 35 (2022), pp. 15516 15528. [13] B. Varici, E. Acarturk, K. Shanmugam, A. Kumar, and A. Tajer. Score-based causal representation learning with interventions. In: ar Xiv preprint ar Xiv:2301.08230 (2023). [14] K. Ahuja, D. Mahajan, Y. Wang, and Y. Bengio. Interventional Causal Representation Learning. 2024. [15] L. Gresele, P. K. Rubenstein, A. Mehrjou, F. Locatello, and B. Schölkopf. The incomplete rosetta stone problem: Identifiability results for multi-view nonlinear ica. In: Uncertainty in Artificial Intelligence. PMLR. 2020, pp. 217 227. [16] L. Gresele, J. Von Kügelgen, V. Stimper, B. Schölkopf, and M. Besserve. Independent mechanism analysis, a new concept? In: Advances in neural information processing systems 34 (2021), pp. 28233 28248. [17] S. Lachapelle, P. Rodriguez, Y. Sharma, K. E. Everett, R. Le Priol, A. Lacoste, and S. Lacoste Julien. Disentanglement via Mechanism Sparsity Regularization: A New Principle for Nonlinear ICA. In: First Conference on Causal Learning and Reasoning. 2021. [18] J. von Kügelgen, Y. Sharma, L. Gresele, W. Brendel, B. Schölkopf, M. Besserve, and F. Locatello. Self-Supervised Learning with Data Augmentations Provably Isolates Content from Style. 2022.

[19] F. Locatello, B. Poole, G. Rätsch, B. Schölkopf, O. Bachem, and M. Tschannen. Weaklysupervised disentanglement without compromises. In: International conference on machine learning. PMLR. 2020, pp. 6348 6359. [20] J. Brehmer, P. De Haan, P. Lippe, and T. S. Cohen. Weakly supervised causal representation learning. In: Advances in Neural Information Processing Systems 35 (2022), pp. 38319 38331. [21] L. Wendong, A. Keki c, J. von Kügelgen, S. Buchholz, M. Besserve, L. Gresele, and B. Schölkopf. Causal Component Analysis. ar Xiv:2305.17225 [cs, stat]. 2023. [22] K. Zhang, S. Xie, I. Ng, and Y. Zheng. Causal Representation Learning from Multiple Distributions: A General Setting. ar Xiv:2402.05052 [cs, stat]. 2024. [23] B. Varici, E. Acartürk, K. Shanmugam, and A. Tajer. General identifiability and achievability for causal representation learning. In: International Conference on Artificial Intelligence and Statistics. PMLR. 2024, pp. 2314 2322. [24] S. Lachapelle, P. R. López, Y. Sharma, K. Everett, R. L. Priol, A. Lacoste, and S. Lacoste-Julien. Nonparametric partial disentanglement via mechanism sparsity: Sparse actions, interventions and sparse temporal dependencies. In: ar Xiv preprint ar Xiv:2401.04890 (2024). [25] Z. Jin, J. Liu, Z. Lyu, S. Poff, M. Sachan, R. Mihalcea, M. Diab, and B. Schölkopf. Can Large Language Models Infer Causation from Correlation? ar Xiv:2306.05836 [cs]. 2023. [26] M. Zeˇcevi c, M. Willig, D. S. Dhami, and K. Kersting. Causal Parrots: Large Language Models May Talk Causality But Are Not Causal. en. In: Transactions on Machine Learning Research (2023). [27] Y. Pan and E. Bareinboim. Counterfactual Image Editing. In: ar Xiv preprint ar Xiv:2403.09683 (2024). [28] P. C. Austin. An introduction to propensity score methods for reducing the effects of confounding in observational studies. In: Multivariate Behavioral Research 46.3 (2011), pp. 399 424. [29] M. Brookhart, T. Stürmer, R. Glynn, J. Rassen, and S. Schneeweiss. Confounding control in healthcare database research: challenges and potential approaches. In: Medical care 48.6 0 (2010), S114 S120. [30] C. Wachinger, B. G. Becker, A. Rieckmann, and S. Pölsterl. Quantifying Confounding Bias in Neuroimaging Datasets with Causal Inference. en. In: Medical Image Computing and Computer Assisted Intervention MICCAI 2019. Ed. by D. Shen, T. Liu, T. M. Peters, L. H. Staib, C. Essert, S. Zhou, P.-T. Yap, and A. Khan. Cham: Springer International Publishing, 2019, pp. 484 492. [31] F. Mahmood, R. Chen, and N. J. Durr. Unsupervised Reverse Domain Adaptation for Synthetic Medical Images via Adversarial Training. In: IEEE Transactions on Medical Imaging 37.12 (2018), pp. 2572 2581. [32] A. Li, A. Jaber, and E. Bareinboim. Causal discovery from observational and interventional data across multiple environments. In: Thirty-seventh Conference on Neural Information Processing Systems. 2023. [33] E. Bareinboim and J. Pearl. Transportability of Causal Effects: Completeness Results. In: Proceedings of the AAAI Conference on Artificial Intelligence 26.1 (2012), pp. 698 704. [34] J. Pearl and E. Bareinboim. Transportability across studies: A formal approach. In: (2018). [35] E. Bareinboim and J. Pearl. Meta-Transportability of Causal Effects: A Formal Approach. en. In: Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics. ISSN: 1938-7228. PMLR, 2013, pp. 135 143. [36] E. Bareinboim and J. Pearl. Causal inference and the data-fusion problem. In: Proceedings of the National Academy of Sciences 113.27 (2016). Publisher: National Academy of Sciences, pp. 7345 7352. [37] P. Hünermund and E. Bareinboim. Causal Inference and Data Fusion in Econometrics. ar Xiv:1912.09104 [econ]. 2023. [38] A. Li, C. Huynh, Z. Fitzgerald, I. Cajigas, D. Brusko, J. Jagid, A. O. Claudio, A. M. Kanner, J. Hopp, S. Chen, J. Haagensen, E. Johnson, W. Anderson, N. Crone, S. Inati, K. A. Zaghloul, J. Bulacio, J. Gonzalez-Martinez, and S. V. Sarma. Neural fragility as an EEG marker of the seizure onset zone. en. In: Nature Neuroscience 24.10 (2021). Number: 10 Publisher: Nature Publishing Group, pp. 1465 1474.

[39] A. Li, P. Myers, N. Warsi, K. M. Gunnarsdottir, S. Kim, V. Jirsa, A. Ochi, H. Otusbo, G. M. Ibrahim, and S. V. Sarma. Neural Fragility of the Intracranial EEG Network Decreases after Surgical Resection of the Epileptogenic Zone. en. Pages: 2021.07.07.21259385. 2022. [40] A. Li, B. Chennuri, S. Subramanian, R. Yaffe, S. Gliske, W. Stacey, R. Norton, A. Jordan, K. Zaghloul, S. Inati, S. Agrawal, J. Haagensen, J. Hopp, C. Atallah, E. Johnson, N. Crone, W. Anderson, Z. Fitzgerald, J. Bulacio, J. Gale, S. Sarma, and J. Gonzalez-Martinez. Using network analysis to localize the epileptogenic zone from invasive EEG recordings in intractable focal epilepsy. In: Network Neuroscience 2.2 (2017). [41] J. M. Bernabei, A. Li, A. Y. Revell, R. J. Smith, K. M. Gunnarsdottir, I. Z. Ong, K. A. Davis, N. Sinha, S. Sarma, and B. Litt. Quantitative approaches to guide epilepsy surgery from intracranial EEG. In: Brain (2023), awad007. [42] K. M. Gunnarsdottir, A. Li, R. J. Smith, J.-Y. Kang, A. Korzeniewska, N. E. Crone, A. G. Rouse, J. J. Cheng, M. J. Kinsman, P. Landazuri, U. Uysal, C. M. Ulloa, N. Cameron, I. Cajigas, J. Jagid, A. Kanner, T. Elarjani, M. M. Bicchi, S. Inati, K. A. Zaghloul, V. L. Boerwinkle, S. Wyckoff, N. Barot, J. Gonzalez-Martinez, and S. V. Sarma. Source-sink connectivity: a novel interictal EEG marker for seizure localization. In: Brain 145.11 (2022), pp. 3901 3915. [43] L. Nobili, B. Frauscher, S. Eriksson, S. Gibbs, H. Peter, I. Lambert, R. Manni, L. Peter-Derex, P. Proserpio, F. Provini, A. Weerd, and L. Parrino. Sleep and epilepsy: A snapshot of knowledge and future research lines. In: Journal of Sleep Research 31 (2022). [44] A. Bagshaw, J. Jacobs, P. Le Van, F. Dubeau, and J. Gotman. Effect of sleep stage on interictal high-frequency oscillations recorded from depth macroelectrodes in patients with focal epilepsy. In: Epilepsia 50 (2008), pp. 617 28. [45] S. Gibbs, P. Proserpio, M. Terzaghi, A. Pigorini, S. Sarasso, G. Russo, L. Tassi, and L. Nobili. Sleep-related epileptic behaviors and non-REM-related parasomnias: Insights from stereo EEG. In: Sleep Medicine Reviews 63 (2015). [46] P. Greene, A. Li, J. Gonzalez-Martinez, and S. Sarma. Classification of Stereo-EEG Contacts in White Matter vs. Gray Matter Using Recorded Activity. In: Frontiers in Neurology 11 (2021). [47] A. A. Borbély, F. Baumann, D. Brandeis, I. Strauch, and D. Lehmann. Sleep deprivation: Effect on sleep stages and EEG power density in man. In: Electroencephalography and Clinical Neurophysiology 51.5 (1981), pp. 483 493. [48] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is All you Need. In: Advances in Neural Information Processing Systems. Ed. by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Vol. 30. Curran Associates, Inc., 2017. [49] R. Bommasani et al. On the Opportunities and Risks of Foundation Models. 2022. [50] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. Mc Candlish, A. Radford, I. Sutskever, and D. Amodei. Language Models are Few-Shot Learners. 2020. [51] J. von Kügelgen, M. Besserve, L. Wendong, L. Gresele, A. Keki c, E. Bareinboim, D. M. Blei, and B. Schölkopf. Nonparametric Identifiability of Causal Representations from Unknown Interventions. ar Xiv:2306.00542 [cs, stat]. 2023. [52] K. Ahuja, D. Mahajan, Y. Wang, and Y. Bengio. Interventional causal representation learning. In: International conference on machine learning. PMLR. 2023, pp. 372 407. [53] D. Yao, D. Xu, S. Lachapelle, S. Magliacane, P. Taslakian, G. Martius, J. von Kügelgen, and F. Locatello. Multi-View Causal Representation Learning with Partial Observability. 2024. [54] M. Yang, F. Liu, Z. Chen, X. Shen, J. Hao, and J. Wang. Causal VAE: Disentangled Representation Learning via Neural Structural Causal Models. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). ISSN: 2575-7075. 2021, pp. 9588 9597. [55] J. Zhang, K. Greenewald, C. Squires, A. Srivastava, K. Shanmugam, and C. Uhler. Identifiability guarantees for causal disentanglement from soft interventions. In: Advances in Neural Information Processing Systems 36 (2024). [56] A. Li, J. Feitelberg, A. P. Saini, R. Höchenberger, and M. Scheltienne. MNE-ICALabel: Automatically annotating ICA components with ICLabel in Python. In: Journal of Open Source Software 7.76 (2022), p. 4484.

[57] J. Tian and J. Pearl. On the testable implications of causal models with hidden variables. In: ar Xiv preprint ar Xiv:1301.0608 (2012). [58] J. Correa and E. Bareinboim. A Calculus for Stochastic Interventions:Causal Effect Identification and Surrogate Experiments. en. In: Proceedings of the AAAI Conference on Artificial Intelligence 34.06 (2020). Number: 06, pp. 10093 10100. [59] J. Correa and E. Bareinboim. General Transportability of Soft Interventions: Completeness Results. In: Advances in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc., 2020, pp. 10902 10912. [60] P. R. Rosenbaum and D. B. Rubin. The Central Role of the Propensity Score in Observational Studies for Causal Effects. In: Biometrika 70.1 (1983), pp. 41 55. [61] J. Pearl. Causal Diagrams for Empirical Research. In: Biometrika 82.4 (1995). Publisher: [Oxford University Press, Biometrika Trust], pp. 669 688. [62] J. Pearl and E. Bareinboim. Transportability of Causal and Statistical Relations: A Formal Approach. en. In: Proceedings of the AAAI Conference on Artificial Intelligence 25.1 (2011). Number: 1, pp. 247 254. [63] S. Lee, J. D. Correa, and E. Bareinboim. General identifiability with arbitrary surrogate experiments. In: 2019. [64] F. Locatello, S. Bauer, M. Lucic, G. Raetsch, S. Gelly, B. Schölkopf, and O. Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In: international conference on machine learning. PMLR. 2019, pp. 4114 4124. [65] R. Perry, J. von Kügelgen, and B. Schölkopf. Causal Discovery in Heterogeneous Environments Under the Sparse Mechanism Shift Hypothesis. ar Xiv:2206.02013 [cs, stat]. 2022. [66] B. Huang, K. Zhang, M. Gong, and C. Glymour. Causal Discovery and Forecasting in Nonstationary Environments with State-Space Models. en. In: Proceedings of the 36th International Conference on Machine Learning. ISSN: 2640-3498. PMLR, 2019, pp. 2901 2910. [67] B. Huang, C. J. H. Low, F. Xie, C. Glymour, and K. Zhang. Latent hierarchical causal structure discovery with rank constraints. In: ar Xiv preprint ar Xiv:2210.01798 (2022). [68] J. Peters, P. Bühlmann, and N. Meinshausen. Causal inference using invariant prediction: identification and confidence intervals. ar Xiv:1501.01332 [stat]. 2015. [69] J. M. Mooij, S. Magliacane, and T. Claassen. Joint causal inference from multiple contexts. In: The Journal of Machine Learning Research 21.1 (2020), 99:3919 99:4026. [70] M. Okamoto. Distinctness of the eigenvalues of a quadratic form in a multivariate sample. In: The Annals of Statistics (1973), pp. 763 765. [71] D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009. [72] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. en. Google-Books-ID: Av NID7Ly Mus C. Morgan Kaufmann, 1988. [73] S. L. Lauritzen, A. P. Dawid, B. N. Larsen, and H.-G. Leimer. Independence properties of directed markov fields. en. In: Networks 20.5 (1990). _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/net.3230200503, pp. 491 505. [74] K. Rajamanickam. A Mini Review on Different Methods of Functional-MRI Data Analysis Citation: Karunanithi Rajamanickam. A Mini Review on Different Methods of Functional-MRI Data Analysis. In: Archives of Internal Medicine Research 03 (2020), pp. 44 060. [75] Nuzillard, D. and Bijaoui, A. Blind source separation and analysis of multispectral astronomical images. In: Astron. Astrophys. Suppl. Ser. 147.1 (2000), pp. 129 138. [76] E. Bingham and A. Hyvarinen. ICA of complex valued signals: a fast and robust deflationary algorithm. In: Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium. Vol. 3. 2000, 357 362 vol.3. [77] A. D. Back and A. S. Weigend. A First Application of Independent Component Analysis to Extracting Structure from Stock Returns. In: Econometrics: Applied Econometrics & Modeling e Journal (1997). [78] E. Bingham, J. Kuusisto, and K. Lagus. ICA and SOM in text document analysis. In: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. 2002, pp. 361 362.

[79] B. Varıcı, E. Acartürk, K. Shanmugam, and A. Tajer. Linear Causal Representation Learning from Unknown Multi-node Interventions. In: ar Xiv preprint ar Xiv:2406.05937 (2024). [80] S. Bing, U. Ninad, J. Wahl, and J. Runge. Identifying linearly-mixed causal representations from multi-node interventions. In: ar Xiv preprint ar Xiv:2311.02695 (2023). [81] L. Gresele, G. Fissore, A. Javaloy, B. Schölkopf, and A. Hyvärinen. Relative gradient optimization of the Jacobian term in unsupervised deep learning. 2020. [82] G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan. Normalizing Flows for Probabilistic Modeling and Inference. 2021. [83] C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios. Neural Spline Flows. 2019. [84] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. 2017. [85] K. Sachs, O. Perez, D. Pe er, D. A. Lauffenburger, and G. P. Nolan. Causal protein-signaling networks derived from multiparameter single-cell data. eng. In: Science (New York, N.Y.) 308.5721 (2005), pp. 523 529. [86] J. M. Robins, M. A. Hernán, and B. Brumback. Marginal structural models and causal inference in epidemiology. eng. In: Epidemiology (Cambridge, Mass.) 11.5 (2000), pp. 550 560. [87] J. Tian and J. Pearl. A General Identification Condition for Causal Effects. In: AAAI (2002). [88] I. Shpitser and J. Pearl. Identification of Joint Interventional Distributions in Recursive Semi-Markovian Causal Models. In: AAAI-Proceedings (2006), pp. 1219 1226. [89] M. Kocaoglu, K. Shanmugam, and E. Bareinboim. Experimental Design for Learning Causal Graphs with Latent Variables. In: Advances in Neural Information Processing Systems 30 (2017). [90] M. Kocaoglu, A. Jaber, K. Shanmugam, and E. Bareinboim. Characterization and Learning of Causal Graphs with Latent Variables from Soft Interventions. In: Advances in Neural Information Processing Systems 32 (2019). [91] A. Jaber, M. Kocaoglu, K. Shanmugam, and E. Bareinboim. Causal Discovery from Soft Interventions with Unknown Targets: Characterization and Learning. In: Advances in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc., 2020, pp. 9551 9561. [92] X. Shen, F. Liu, H. Dong, Q. Lian, Z. Chen, and T. Zhang. Weakly Supervised Disentangled Generative Causal Representation Learning. In: Journal of Machine Learning Research 23 (2022), pp. 1 55. [93] K. Xia, K.-Z. L. Lee Bloomberg, Y. Bengio, and E. Bareinboim. The Causal-Neural Connection: Expressiveness, Learnability, and Inference. In: (2021). [94] K. Xia, Y. Pan, and E. Bareinboim. Neural Causal Models for Counterfactual Identification and Estimation. In: International Conference on Learning Representations. 2022. [95] A. Jaber, A. H. Ribeiro, J. Zhang, and E. Bareinboim. Causal Identification under Markov equivalence: Calculus, Algorithm, and Completeness. In: Advances in Neural Information Processing Systems. 2022. [96] A. Jaber, J. Zhang, and E. Bareinboim. Causal Identification under Markov Equivalence: Completeness Results. In: (2019). Publisher: PMLR, pp. 2981 2989. [97] T. V. Anand, A. H. Ribeiro, J. Tian, and E. Bareinboim. Causal effect identification in cluster dags. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 37. 10. 2023, pp. 12172 12179. [98] I. Shpitser and J. Pearl. Complete Identification Methods for the Causal Hierarchy. In: Journal of Machine Learning Research 9 (2008), pp. 1941 1979. [99] J. Zhang, J. Tian, and E. Bareinboim. Partial Counterfactual Identification from Observational and Experimental Data. In: (2021).

A Background and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

A.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

A.2 Assumptions and Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 18

A.3 Domains vs Interventions . . . . . . . . . . . . . . . . . . . . . . . . . . 21

A.4 Permutation Indeterminancy . . . . . . . . . . . . . . . . . . . . . . . . . 22

B CRID Algorithm Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

C Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

C.1 "Distribution Change Sufficiently" - Proof of Lemma 1 . . . . . . . . . . . 27

C.2 Distribution comparison - Proof of Proposition 1 . . . . . . . . . . . . . . 28

C.3 Invariant factors - Proof of Proposition 2 . . . . . . . . . . . . . . . . . . 29

C.4 ID Q w.r.t Canceled Factors - Proof of Proposition 3 and Lemma 2 . . . 29

C.5 ID within Q set - Proof of Proposition 4 and Lemma 3 . . . . . . . . . 30

C.6 ID-reverse of existing disentangled variables - Proof of Proposition 5 . . . 33

C.7 Soundness of Latent ID Algorithm - Proof of Thm. 1 . . . . . . . . . . . . 34

D Examples and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

D.1 Additional Example Illustrating Motivation of Causal Disentangled Learning 35

D.2 Examples for non-Markovian Factorization . . . . . . . . . . . . . . . . . 35

E Examples for Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

E.1 The detailed examples of Proposition 3 and 4 . . . . . . . . . . . . . . . . 37

F Related Work Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

F.1 Causal representation learning with unknown latent causal structure . . . . 39

F.2 Comparisons with other identifiability criterion . . . . . . . . . . . . . . . 39

F.3 Case study on challenges when disentangling variables in a non-Markovian setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

F.4 ID within c-components . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

F.5 Case study on disentangling variables in a Markovian setting . . . . . . . . 41

F.6 Comparing different identifiability results . . . . . . . . . . . . . . . . . . 42

G Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

G.1 Synthetic data-generating process . . . . . . . . . . . . . . . . . . . . . . 46

G.2 Image Editing Using Disentangled Representations . . . . . . . . . . . . . 47

G.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

G.4 Training details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

G.5 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

G.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

G.7 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

H Broader Impact and Forward-Looking Statements . . . . . . . . . . . . . . . . . . 52

I Frequently Asked Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

A Background and Assumptions

A.1 Notations

Symbol Description

[d] {1, 2, . . . , d}

G Latent Causal Graph (LCG) over V induced by an M

M An ASCM (Def. 2.1) describes the data generation process of d latent variables V Rd and an observed high-dimensional mixture X Rm.

G Latent Causal Diagram (LCG) over V induced by an M

Pa(V), Pa V The union of parents of V and V itself

C(V) C-Component of V (Def.6.2).

M A set of N ASCMs M1, . . . , MN (shared mixing function f X) relative to domains Π = Π1, . . . , ΠN

GS Latent Selection Diagram (LSG, Def 2.2) induced by M

Σ = {σ(k)}K k=1 A set of K >= N interventions applied to M. Each intervention σ(k) can be idle, perfect, or other soft interventions that do not alter the structure of G

ΠΣ = {Π(k)}K k=1 The corresponding domains of interventions Σ. σ(k) is applied in Π(k)

Ψ = {I(k)}K k=1 The collection of intervened target sets of the intervention collection Σ.

I(k) = {V Π(k),{b},t i , . . . } The intervention target set of σ(k)

{b} The mechanism of intervention. Default as interventions have different mechanisms if b is ignored. Also, the mechanism od different variables are different. The mechanism of V {1} 1 is not equal to V {1} 2 .

t Whether an intervention is perfect or not. t = do means it is perfect. Default as not perfect if t is ignored.

Perf[I(k)] Variables that are perfectly intervened on in σ(k).

Ψperf T The collection of intervention target sets that contain a perfect intervention on T.

P = {P (k)}K k=1 Set of distributions induced by M resulting from collection of interventions Σ. P (k) = P Π(k)(X; σ(k))

Pa T+(V), Pa T+ V Extended parents from factorization Eq. (1).

V[I(j), I(k), GS] Changed variable sets constructed in Proposition 2. For short, V or V(j),(k) when index is needed.

V The C-Component of V[I(j), I(k), GS]. The factor P(vi | pa T+) for Vi V\ V remains invariant in Eq.( 2).

Q[I(j), I(k), T, GS] Q set defined in Def. 3.1. Variables in Q set remains from Eq.( 2) to Eq.( 3). For short, Q or Q(j),(k) when index is needed.

Canceled variables The complement of Q, which is V\ Q.

Figure S1: Table of Notations

We gave an example to illustrate the notation of a collection of intervention target sets Ψ and each intervention target set I(k). Example 7. Let an intervention target collection be

Ψ = {I(1) = {{}Π1}, I(2) = {V Π1,{1} 1 }, I(3) = {V Π2,{2} 1 , V Π2,{1},perf 2 }, I(4) = {V Π2,{1} 1 , V Π2,perf 2 }} (4) In words, Ψ indicates 4 different interventions Σ = {σ(k)}4 k=1: σ(1); an idle intervention is applied resulting in an observational distribution in the domain Π1. σ(2); a soft intervention with mechanism {1} is applied to V1 in domain Π1. σ(3); an intervention is applied to V1 and V2 in domain Π2 where the mechanism of V1 is different from σ(2) and the intervention on V2 is perfect. σ(4); an intervention is applied to V1 and V2 in domain Π2 where the mechanism of V1 is the same with σ(2) and the mechanism of V2 is different from σ(3). Perf[I(3)] = {V2} means that σ(3) perfectly intervenes on {V2}. Ψperf V2 = {I(3), I(4)} means that the interventions targets that contain perfect interventions on V2.

Ψ{} = Ψ. Also, the mechanism of V Π1,{1} 1 is different from the mechanism of V Π2,{1} 2 since variables are different.

A.2 Assumptions and Remarks

In this paper, we make a few key assumptions about interventions and the differences in domains. We leverage many similar assumptions to the setting proposed in the literature related to causal representation learning, and handling of multiple domains and interventions [21, 22, 32, 51]. We discuss those assumptions and their implications here. Remark 1 (Mixing is invertible). As a consequence of Def. 2.1, the mixing function f X is invertible, ensuring that latent variables are uniquely learnable [9, 10, 17, 64].

The mapping from generative factors V to high dimensional mixture X is a one-to-one mapping. Consider images. In one direction, V constructs the image through a mixing tool f X (such as a camera lens). In the reverse direction, these generative factors V can be uniquely labeled through f 1 X . We take images example in Sec. D.1 as an example. The generative factors Gender, Age and Haircolor are directly expressed through pixels in images. Given an image, the values of these generative factors are uniquely determined. This assumption is commonly used in non-linear ICA and representation learning literature [9, 10, 17, 64]. Remark 2 (Confounders are not part of the mixing function). According to Def. 2.1, latent exogenous variables U influence the high-dimensional mixture X only through latent causal variables V, so unobserved confounding U does not directly affect the mixing function.

An example of when this can occur in the real world is when modeling high-dimensional T1 MRI scans. Let the LCG comprise of Drug Treatment Outcome, but they are confounded by socioeconomic status (Drug Treatment Outcome). The drug treatment and outcome are assumed to be visually discernable on the MRI. However, socioeconomic status does not directly impact how the MRI appears, except through how it impacts the drug treatment efficacy or outcome. In addition, in EEG data, sleep quality and drug treatment may influence EEG appearance, while socioeconomic status may confound sleep and drug treatment but not directly affect EEG. This idea is also present in prior work, such as nonlinear ICA, where independent exogenous variables Ui each point to a single Vi. [7]. Remark 3 (Shared causal structure). As a consequence of Def. 2.2, each environment s ASCM shares the same latent causal graph, with no structural changes among latent variables 16.

This means that the S-nodes will not represent structural changes such as when Vi has a different parent set across domains 17.

16The assumption that there are no structural changes between domains can be relaxed and is considered in the context of inference when causal variables are fully observed, as discussed in [33]. This is an interesting topic for future explorations, and we do not consider this avenue here. 17The assumption that there are no structural changes between domains can be relaxed and is considered in the context of inference, as discussed in [33]. This is an interesting topic for future explorations, and we do not consider this avenue here.

Remark 4 (Mixing function is shared across all domains). By Def. 2.1, the mixing function f X is the same for all ASCMs Mi M, enabling cross-domain analysis. If the mixing function varied across distributions, the latent representations would not be identifiable from iid data alone [9, 51].

Sharing of the mixing function is needed for the multi-domain setting because if everything may change across environments, the domains can only be analysed in isolation, and thus unable to leverage the changes (and similarities) across domains.

Assumptions for Interventions We discuss assumptions related to interventions here. Assumption 1 (Soft interventions without altering the causal structure). Interventions do not change the causal diagram. Hard interventions cut all incoming parent edges, and soft interventions preserve them [59]. However, more general interventions may arbitrarily change the parent set for any given node [59]. We do not consider such interventions and leave this general case for future work.

This assumption precludes any soft interventions that modify the graphical structure of the causal diagram. This work does allow both perfect interventions that cut all incoming parent edges, and soft interventions that preserve all parent edges. However, more general interventions may arbitrarily change the parent set for any given node [59]. We do not consider such interventions and leave this general case for future work. Note Assumption 1 does not mean that interventions cannot occur with the same mechanism across domains. For example, consider two hospitals Π1 and Π2. Treating epilepsy in each of these hospitals can have outcomes that differ vastly due to the differences in domains [38, 39, 41]. This is represented graphically in GS with S1,2 outcome. However, if a neurologist who controls every aspect of his treatment procedure treats patients in both hospitals herself for the purposes of an experiment, then the outcomes will not differ in distribution. This is represented graphically as S1,2 outcome with the S-node being removed from the "outcome" variable. Thus if a pair of interventions occurring in different domains are deemed to have the same mechanism, then the S-node (if one is pointing to the intervened variable) is removed when comparing these two distributions.

Another assumption we make is that all interventions have known targets. Assumption 2 (Known-target interventions). All interventions occur with known targets, reducing permutation indeterminacy for intervened variables.

That is, for each interventional distribution we have, we know the interventions that occurred and at which node(s) they occurred. This assumption allows us to reduce the permutation indeterminacy that would arise if we did not know the intervention targets. In this work, we also are not concerned with permutation indeterminacy for variables we do not necessarily intervene on because we will mostly be concerned with disentanglement wrt the intervened variables (see Appendix Section A.4). It would be interesting for future work to consider unknown intervention targets.

Assumptions for Distributions In Sec. 2, we discuss that each distribution resulting from an intervention is sufficiently distinct from another distribution Assumption 4. Here we formally define and illustrate what is "change sufficiently". Assumption 4 (Changing Sufficiently). Consider a collection of ASCMs M and a set of distribution P induced by M from a collection of interventions Σ. Let the LSD induced by M be GS. Let PT = {P (a0), P (a1), . . . , P (a L)} P be any collection of distributions such that T = do[I(a0)] do[I(al)] for l [L], meaning for the baseline distribution all perfect interventions must be exactly on T, and all other distributions must at least contain T in their perfect interventions. Let Q = S

l [L] Q[I(al), I(0), T, GS] (Def. 3.1). It is assumed:

1. The probability density function of V is smooth and positive, i.e. p(al) T (v) is smooth and p(al) T (v) > 0 almost everywhere.

2. First-order discrepancy. If there exists {a 1, . . . , a |Q|} {a1, . . . , a L} such that Vq

Q, Vq Q[I(a q), I(a0), T, GS], then {ω1(v, a1), ω1(v, a2), . . . , ω1(v, a L)} are linearly independent, where

ω1(v, al) = ( log p(al) T (v) log p(a0) T (v) vq )Vq Q

3. Second-order discrepancy. Let a set E consist of pairs of (Vp, Vq) such that (Vp, Vq) appears at least in one Q and Vp is connected with Vq conditioning on V\{Vp, Vq} in GT (Q). Namely,

E ={ϵj = {Vk, Vr} |

(i) al, {Vp, Vq} Q(al),(a0);

(ii) Vp is d-connected to Vq conditioned on Vtar\{Vp, Vq} in GT(Vtar)},

If there exists {a 1, . . . , a 2|Q|+|E|} {a1, . . . , a L} such that Vq Q, Vq

Q(a i),(a0)], Vq Q(a |Q|+i),(a0) and for all ϵj E, ϵj Q(a 2|Q|+j),(a0), then {ω2(v, a1), ω2(v, a2), . . . , ω2(v, a L)} are linearly independent, where

ω2(v, al) = ( log p(al) T (v) log p(a0) T (v) vq )Vq Q,

( 2 log p(al) T (v) log p(a0) T (v) v2q )Vq Q,

( 2 log p(al) T (v) log p(a0) T (v) vpvq )(Vp,Vq) E(GT (Q))

At a high level, this assumption will be naturally satisfied if the ASCMs and interventions are randomly chosen and only will be violated if the probability density of P (j) and P (k) are fine-tuned to each other [51]. This kind of assumption is generally included in the causal representation learning literature, such as the "genericity" assumption [51], the "interventional discrepancy" assumption [21], and the "sufficient changes" assumption [10, 22].

To illustrate, the assumptions contain two linear independence constraints. Specifically, the first-order and second-order partial derivatives of the log discrepancy from P (al) to P (a0) should be independent of each other. Specifically, The two conditions are made because of necessity, since the linear independence constraints can hold only if these conditions hold. The following example illustrates the necessity of the first-order condition:

Example 8 (Distributions do not change sufficiently). Consider Q obtained after comparisons as

Q(1),(0) = {V1}, Q(2),(0) = {V1}, Q(1),(0) = {V1, V2, V3}, (8)

Let Q = {V1, V2, V3}. We have

log p(1) T (v) log p(0) T (v) v2 = 0 (9)

Since V2 Q(1),(0). Similarly, we know

ω1(v1, v2, v3, 1) = ( log p(1) T (v) log p(0) T (v) v1 , 0, 0)

ω1(v1, v2, v3, 2) = ( log p(2) T (v) log p(0) T (v) v1 , 0, 0)

ω1(v1, v2, v3, 3) = ( log p(3) T (v) log p(0) T (v) v1 , log p(3) T (v) log p(0) T (v) v2 , log p(3) T (v) log p(0) T (v) v3 )

And this implies ω1(v1, v2, v3, 1), ω1(v1, v2, v3, 2), ω1(v1, v2, v3, 3) are for sure not linearly independent.

On the other perspective, violating these assumptions is like stating the probability densities are fine-tuned to each other [51]. Here we give an example of how this assumption can be violated.

Domain Observational Interventional Π1 P 1 {}(X) P 1 vi(X) P 1 vj(X) P 1 vi,vj(X) . . . Π2 P 2 {}(X) P 2 vi(X) P 2 vk(X) P 2 vi,vk(X) . . . ... ... ... ... ... ... ΠN P N {}(X) P N vl (X) P N vm(X) P N vl,vj(X) . . .

Table S1: Possible distributions observed for any given causal representation learning task - Each domain Π = {Π1, Π2, ..., ΠN} may contain observational and interventional distributions over latent variables V, which are mixed via f X to generate X Rm. The first row and column are studied in the existing literature under the lens of the multi-domain intervention exchangeability assumption [32]. Prior work also requires distributions across the entire column (i.e. many domains must be observed), or entire row (i.e. an intervention per latent variable). This paper discusses a more general disentangled representation learning setting when an arbitrary combination of distributions from interventions and domains can be input (i.e. any combination of cells in yellow, and green).

Example 9 (Distributions do not change sufficiently). Consider intervention targets

Ψ = {I(1) = {{}Π1}, I(2) = {V Π1,{1} 1 }, I(3) = {V Π1,{2} 2 }, I(4) = {V Π1,{1} 1 , V Π1,{2} 2 }} (11)

Choosing I(1) as the baseline, T = {}. The corresponding Q sets are {{V1}, {V2}, {V1, V2}}. Let Q be the union of Q sets, which is {V1, V2}. One can verify

ω1(v, 2) + ω1(v, 3) = ω1(v, 4) (12)

since I(4) is designed as a combination of I(2) and I(3).

We provide the following Lemma to justify Assumption 4 formally. Lemma 1. Assumption 4 almost surely holds.

A.3 Domains vs Interventions

In previous studies, there has been a tendency to conflate the notions of interventions and domain shifts [65 69]. However, it is essential to recognize their distinctiveness, particularly when considering various real-world examples spanning different scientific domains that utilize observational and interventional data. The differentiation between interventions and domains is not only conceptually significant but also holds implications for causal inference and the characterization of corresponding causal structures as noted by [32]. Moreover, it is crucial to avoid conflating these qualitatively distinct concepts of interventions and domains, as highlighted in transportability analysis [62]. Pearl and Bareinboim have introduced clear semantics for (S) nodes (environments), presenting a unified representation in the form of selection diagrams [33, 35, 36].

By recognizing these differences, this work leverages any combination of observational and/or interventional data arising from multiple domains to present a general approach to disentanglement learning compared to prior work (see Table S1). Prior work generally considered either interventions in a single domain (top row in Π1), where there must be an intervention per latent variable [14, 21], or observational distributions from many domains Π1, Π2, ..., ΠN (first column under "Observational"). However, this paper considers a general setting where we may have an arbitrary collection of interventions, or observations from any combination of domains (green section).

Here, we illustrate some examples of the CRID algorithm using distributions from multiple domains. Example 10 (Example illustrating CRID with domains). Consider the LSD shown in Fig. 4(a). We have the following distributions P = {P (1), P (2)} = {P Π1(X), P Π2(X) from interventions Σ = {σ(1), σ(2)} = {{}, {}}. Applying CRID algorithm, we can determine that V1 is ID wrt V2 and V3.

This example illustrates that observational data in two domains can help disentangle a root variable (V1) from all its descendants.

Example 11 (Example illustrating CRID with interventions across domains with different mechanisms). Consider the LSD shown in Fig. 4(a). We have the following distributions P = {P (1), P (2)} = {P Π1(X), P Π2(X) from interventions Σ = {σ(1), σ(2)} with targets Ψ = {{V2}Π1, {}Π2. Applying CRID algorithm, we can determine that V2 and V1 is ID wrt V3.

This example demonstrates that when comparing observational data from domain Π1 with interventional data from a different domain Π2, the only invariant factor is P(V3|V2), with V [{{V2}Π1, {}Π2, GS] = {V1, V2}. The canceled variable is V3, and thus we achieve our identifiability result. Example 12 (Example illustrating CRID with interventions across domains with the same mechanisms). Consider the LSD shown in Fig. 4(a). We have the following distributions P = {P (1), P (2), P (3)} from interventions Σ = {σ(1), σ(2), σ(3)} with targets Ψ = {{V [i] 1 , V2}Π1, {}Π2, {V [i] 1 }Π2. Applying CRID algorithm, we can determine that V1 is ID wrt {V2, V3}, and V2 is ID wrt {V3}.

Even with an intervention that changes both V1, V2. When comparing the distributions P (1) and P (3), the P(V1) term becomes an invariant factor because the intervention has the same mechanism. This removes the possible difference encoded by the S-node on V1 between domains Π1, Π2.

These examples further demonstrates the importance of distinguishing domains and interventions because a difference in mechanism is present when comparing all distributions between a pair of domains, Πi = Πj. This in principle, results in additional variables in the Q set. However, interventions may allow us to remove variables from this set by increasing the number of invariant factors.

A.4 Permutation Indeterminancy

In the context of causal representation learning, permutation indeterminacy is a significant challenge that arises when attempting to identify latent variables from observed data. This phenomenon occurs when the ordering of latent variables is not uniquely determined, leading to multiple equivalent representations (i.e. permutations of the latent variables) that can explain the observed data equally well.

In the earliest results of disentangled representation learning, linear ICA was known to be identifiable only up to permutation and scaling indeterminacies [6]. Permutation indeterminacy is still present in nonlinear ICA [7], since the independent components may be permuted arbitrarily.

Interestingly, when generalizing the problem to the Markovian setting where latent variables have causal structure (i.e. edges in a causal graph), permutation indeterminacy can be reduced to a graph isomorphism in certain cases. That is, latent variables are exchangeable with other latent variables that preserve the topological ordering of the latent causal graph (rather than permuted with any arbitrary latent variable) [13, 22, 51]. When the interventions occur with known targets on the latent space, and intervention occurs uniquely on every latent variable, then there is no permutation indeterminacy [21].

In this work, we assume intervention targets are known, but do not necessarily occur on all latent variables, and they may occur on multiple variables at once. For variables that are intervened on uniquely (i.e. one intervention applied on only that variable), there is no permutation ambiguity. For variables that are intervened on in groups, or not intervened on at all, there still exists permutation ambiguity:

1. (Grouped variables) These variables are all intervened on in the same group. In the context of our paper, these variables are consistently in the same Q set. For example, consider the following LCG V1 V2 V3. If we have distributions arising only from interventions on {V1, V3} and the observational distribution, and assume the learned representation is fully disentangled, then the learned representation still has a permutation indeterminacy wrt {V1, V3}. That is, ˆV1 could be the representation for V1, or V3 and similarly for ˆV3 (See why permutation can hold for details in Example 18). 2. (Non-intervened variables) These variables do not contain any interventions. Then there is still permutation ambiguity among these variables. However, instead of a graph isomorphism ambiguity, these variables form a subgraph isomorphism problem because there may be other

variables that change across distributions (i.e. via interventions, or changes in domains), which are not permutable with respect to these invariant variables.

Specifically, the identifiability we talk about (Def. 2.3) is considered after a subgraph isomorphism permutation. For example, in the collider example setting where permutation can happen between V1 and V3. The "V1 is ID w.r.t {V2, V3}" should implies there exists a function τ such that π(V)[V1] = τ(π(V)[V1]), where π(V)[Vi] means variable Vi after the permutation on V and π denotes a permutation only in this text. In our paper, we are primarily concerned with disentanglement and determining if the learned representation is disentangled in some general sense, and the permutation part is out of our scope.

B CRID Algorithm Details

Here, we provide additional pseudocode for the CRID Alg. 1.

First, the following algorithm illustrates how to initialize a fully connected bipartite graph GV, b V. In

the initial GV, b V, the true underlying factors V points to representations each b Vi b V, which means each variable Vi V is entangled with all other variables.

Algorithm F.2 Fully Connected Bipartite Graph: Initialization step - Initialize a fully connected bipartite graph.

Input: V, b V Output: GV, b V 1: Initialize an empty graph GV, b V 2: for Vi in V do 3: for Vj in ˆV do 4: Add edge (Vi, Vj) to GV, b V

Then, after constructing Q from comparisons of distributions, the Alg. F.3 illustrates the details to check whether V\Q can be disentangled from Q according to Proposition 3. To illustrate, each variable Z V\Q is checked one by one. The variables that have already been disentangled from Z are collected in the list Mem through procedure Check Memoize. Next, check if there is a sub-collection of Q that satisfy the [1-3] conditions in Proposition 3. The checking procedure is shown in Alg.F.5. If conditions are satisfied the edges from Z to b Q are removed to demonstrate disentanglement. Based on the Lemma 2, the condition [3] in Prop. 3 can be reduced to a weaker condition [4] leveraging existing disentanglements in CDM.

Lemma 2. Consider variables Vtar V and Z V\Vtar. Suppose Mem = {Vj Vtar | Vj is ID w.r.t. Z}. Consider, PT and its corresponding intervention targets that hold conditions [1-2] in Prop. 3. If the new version of the condition [3] is also satisfied:

[4] there exists {a 1, . . . , a |Vtar|} {a1, . . . , a L} such that for all V tar i

Vtar\Mem, V tar i Q[I(a i), I(a0), T, GS].

then Vtar is ID w.r.t Z.

To illustrate, the above lemma indicates not all variables in Vtar needed to be covered uniquely. Variables that have been already disentangled (in Mem) do not need to be considered.

Example 13. Consider the LSD GS and intervention targets I(1) = {} and I(4) = {V Π1,do 2 }. Comparing I(4) and I(1) taking T = {}, Q = {V1, V2}. Based on Prop. 3, we cannot get V2 is ID w.r.t V3 since to cover V1 and V2 separately, at least two Q sets are needed.

Now assume it is known that V1 is ID w.r.t. V3, namely Mem = {V1}. Q sets only need to cover V2 and does not need to cover V1 from condition [4] in Lemma 2. Then V2 is ID w.r.t. V3.

Algorithm F.3 Dis Qfrom Cancel - Check whether canceled variables V\Q can be disentangled from the LQ factors Q. GV, b V is the current bipartite graph; GT is the LCG after the perfect

intervention on T; Ψperf X is the intervened sets that contains perfect interventions on X; I Ψperf T is the chosen baseline distribution; Q is the collection of Q sets after comparing intervention targets J Ψperf X \I with the baseline.

Input: Q, GV, b V, GX, Ψperf X , I, Q Output: GV, b V 1: for all Z V\Q do 2: Mem Check Memoize(GV, b V, Z, Q) Variables in Q has been already ID w.r.t. Z. 3: if Check Consition3(Q, Q, Mem) then Check conditions in Prop. 3 and Lem. 3 4: remove edge Z b Q in GV, b V 5: return GV, b V

Algorithm F.4 Check Memoize: Memoization step - The variables in Q is ID w.r.t Z already.

Input: GV, ˆV , Z, Q Output: Mem

1: Mem {} 2: for all b V Q do 3: if Z b V GV, b V then 4: Mem.append(V ) 5: return Mem

Algorithm F.5 Check Condition3: Check conditions in Proposition 3 and Lemma 2. Q is the collection of Q sets; Q are target variables;Mem are variables in Q have already been disentangled.

Input: Q, Q, Mem Output: True or False

1: L {} 2: for Qk Q do 3: if Qk Q then 4: L.append(Qk) 5: Qre = {Q1, . . . , Qd } Q\Mem, d |Qre| 6: if Q1 L1, Q2 L2, . . . , Qd Ld after a permutation of L then 7: return True 8: return False

Algorithm F.6 Dis Within Q - Check the disentanglement of variables within Q. GV, b V is the

current bipartite graph; GT is the LCG after the perfect intervention on T; Ψperf T is the intervened sets that contains perfect interventions on X; I Ψperf T is the chosen baseline distribution; Q is the collection of Q sets after comparing intervention targets J Ψperf X \I with the baseline.

Input: Q, GV, b V, GT, Ψperf T , I, Q Output: GV, b V 1: for for all pair Vi, Vj Q do 2: if Vi Vj | Q\{Vi, Vj} then 3: Memi Check Memoize(GV, b V, Vi, Q) Variables in Q is ID w.r.t Vi already. 4: Memj Check Memoize(GV, b V, Vj, Q) Variables in Q is ID w.r.t Vj already. 5: if Check Consition4(Q, Q, Memi, Memj, GT) then Check conditions in Prop. 4 and Lem. 3 6: remove edge Z b Q in GV, b V 7: return GV, b V

Next, the Alg. F.6 illustrates the details to check whether Vi, Vj Q such that Vi and Vj are independent of each other conditioning on other variables in Q can be disentangled according to Proposition 4. To illustrate, two lists of variables that have already been disentangled from Vi and Vj are constructed as Memi and Memj respectively through Check Memoize. Next, check if there is a sub-collection of Q that satisfy the [1-3] conditions in Proposition 3. The checking procedure is shown in Alg.F.7. If conditions are satisfied the edges from Z to b Q are removed to demonstrate disentanglement. Based on the Lem. 3, the condition [3 ] in Prop. 4 can be reduced to a weaker condition [4 ] leveraging existing disentanglements in CDM.

Lemma 3 (ID of variables within Q sets). Consider variables Vtar V. For any pair of Vi, Vj Vtar such that Vi Vj|Vtar\{Vi, Vj} in GT(Vtar), let Memi be a list of variables in Q that have been ID w.r.t. Vi and let Memj be a list of qvariables in Q that have been ID w.r.t. Vj. If there exists PT that satisfies conditions [1-2] in Prop. 3 and the following condition [4 ].

[4 ] (Enough changes occur across distributions) Let Qre = Vtar\(Memi S Memj) and d = |Qre|. And

Eij ={ϵj = {Vk, Vr} | i) al, {Vk, Vr} Q(al),(a0);

ii) Vk is connected to Vr conditioning Vtar\{Vk, Vr} in GT(Vtar) iii) Vk, Vr / Memi Memj}

there exists {a 1, . . . , a 2d +|E|} {a1, . . . , a L} such that for all Qi Qre, Qi

Q(a i),(a0)], Qi Q(a d +i),(a0) and for all ϵl Eij, ϵl Q(a 2d +l),(a0).

, then Vi is ID w.r.t Vj.

Algorithm F.7 Check Condition4: Check conditions in Proposition 4 and 3. Q is the collection of Q sets; Q are target variables;Memi are variables in Q have already been disentangled with Vi;Memj are variables in Q have already been disentangled with Vj; GT is the diagram after removing incoming edge to T.

Input: Q, Q, Memi, Memj, GT Output: True or False

1: L {} 2: for Qk Q do 3: if Qk Q then 4: L.append(Qk) 5: E {} 6: for {Vk, Vr} Q do 7: if (i) L L such that {Vk, Vr} L (ii) Vk is conditionally connected to Vl (iii) {Vk, Vr} Memi Memj then 8: E.append((Vk, Vr)) Construct E according to Lem. 3 9: Qre+ = {Q1, . . . , Qd } (Q\(Memi Memj)) E, d+ |Qre| 10: if Q1 L1, Q2 L2, . . . , Qd Ld after a permutation of L then 11: return True 12: return False

Lastly, we leverage the independence and current disentangled results stored in GV, b V. Canceled variables with V\Q can be disentangled with each other according to Proposition 5. The following algorithm illustrates this step.

Algorithm F.8 Dis QFrom Cancel - Disentangle canceled variables from Q. GV, b V is the current bipartite graph; GT is the LCG after the perfect intervention on T.

Input: Q, GV, b V, GX Output: GV, b V 1: for for all Z such that Z V\Z in GT do 2: if there are no edges from V\Z to Z then 3: remove edges from Z to V\Z 4: return GV, b V

Here, we provide detailed proofs of theoretical results in the main paper.

C.1 "Distribution Change Sufficiently" - Proof of Lemma 1

We assume "distributions changes sufficiently" in Sec. 2. This assumption is formally defined in Assumption 4 and will be used as a technique assumption in the proof of propositions in this work. Lemma 1 provides the justification of this assumption. It suggests Assumption 4 almost surely holds. We first provide proof here. Assumption 4 (Changing Sufficiently). Consider a collection of ASCMs M and a set of distribution P induced by M from a collection of interventions Σ. Let the LSD induced by M be GS. Let PT = {P (a0), P (a1), . . . , P (a L)} P be any collection of distributions such that T = do[I(a0)] do[I(al)] for l [L], meaning for the baseline distribution all perfect interventions must be exactly on T, and all other distributions must at least contain T in their perfect interventions. Let Q = S l [L] Q[I(al), I(0), T, GS] (Def. 3.1). It is assumed:

1. The probability density function of V is smooth and positive, i.e. p(al) T (v) is smooth and p(al) T (v) > 0 almost everywhere.

2. First-order discrepancy. If there exists {a 1, . . . , a |Q|} {a1, . . . , a L} such that Vq

Q, Vq Q[I(a q), I(a0), T, GS], then {ω1(v, a1), ω1(v, a2), . . . , ω1(v, a L)} are linearly independent, where

ω1(v, al) = ( log p(al) T (v) log p(a0) T (v) vq )Vq Q

3. Second-order discrepancy. Let a set E consist of pairs of (Vp, Vq) such that (Vp, Vq) appears at least in one Q and Vp is connected with Vq conditioning on V\{Vp, Vq} in GT (Q). Namely,

E ={ϵj = {Vk, Vr} |

(i) al, {Vp, Vq} Q(al),(a0);

(ii) Vp is d-connected to Vq conditioned on Vtar\{Vp, Vq} in GT(Vtar)},

If there exists {a 1, . . . , a 2|Q|+|E|} {a1, . . . , a L} such that Vq Q, Vq

Q(a i),(a0)], Vq Q(a |Q|+i),(a0) and for all ϵj E, ϵj Q(a 2|Q|+j),(a0), then {ω2(v, a1), ω2(v, a2), . . . , ω2(v, a L)} are linearly independent, where

ω2(v, al) = ( log p(al) T (v) log p(a0) T (v) vq )Vq Q,

( 2 log p(al) T (v) log p(a0) T (v) v2q )Vq Q,

( 2 log p(al) T (v) log p(a0) T (v) vpvq )(Vp,Vq) E(GT (Q))

Lemma 1. Assumption 4 almost surely holds.

Proof. We will prove the first-order discrepancy and second-order discrepancy almost surely hold, which means the situations where first-order discrepancy and second-order discrepancy do not hold have Lebesgue measure 0.

We first consider the first-order discrepancy. Denote {ω1(v, a1), ω1(v, a2), . . . , ω1(v, a L)} as A. And every entry in A is

alq = log p(al) T (v) log p(a0) T (v) vq (14)

According to Eq. (3), we know log p(al) T (v) log p(a0) T (v) is a function of only variables in Q. Thus, if Vq Q[I(a l), I(a0), T, GS], alq = 0; if Vq Q[I(a l), I(a0), T, GS], we assume alq follows a standard normal distribution, which means the non-zero entries in matrix A are randomly sampled and are not fine-tuned. Thus, to prove this lemma, it is equivalent to prove if there exists {a 1, . . . , a |Q|} {a1, . . . , a L} such that Vq Q, Vq Q[I(a q), I(a0), T, GS], the row of A are almost surely linear independent. W.O.L.G, we let {a 1 = a1, . . . , a Q = a L. Then, it is equivalent to prove that A is a full rank matrix.

In order to prove that A is a full-rank matrix, we prove that the determinant of A is almost surely non-zero. Since Vq Q, Vq Q[I(aq), I(a0), T, GS], there exists A such that det(A) is non-zero, and then det(A) is non-trivial. Based on a simple algebraic lemma in [70], the subset of {A | det(A) = 0} of the real space has Lebesgue measure 0. Then det(A) = 0 almost surely holds.

The second-order discrepancy proof is similar. Denote {ω2(v, a1), ω2(v, a2), . . . , ω2(v, a L)} as B. And every entry of B is

alq = log p(al) T (v) log p(a0) T (v) vq , q |Q|

alq = 2 log p(al) T (v) log p(a0) T (v) v2q , |Q| + 1 q 2|Q|

alϵ = 2 log p(al) T (v) log p(a0) T (v) vp vq , 2|Q| + 1 ϵ 2|Q| + |E|, {Vp, Vq} E

According to Eq. (3), we know log p(al) T (v) log p(a0) T (v) is a function of only variables in Q. Thus, if Vq Q[I(a l), I(a0), T, GS], alq = 0; if Vq Q[I(a l), I(a0), T, GS], we assume alq follows a standard normal distribution, which means the non-zero entries in matrix B are randomly sampled and are not fine-tuned. If {Vp, Vq} Q[I(a l), I(a0), T, GS], alϵ = 0; if {Vp, Vq} Q[I(a l), I(a0), T, GS], we assume alϵ follows a standard normal distribution, which means the non-zero entries in matrix B are randomly sampled and are not fine-tuned. Following the same discussion above, the subset of {B | det(B) = 0} of the real space has Lebesgue measure 0. Then det(B) = 0 almost surely holds.

C.2 Distribution comparison - Proof of Proposition 1

Proposition 1 (Distribution Comparison). Consider a pair of collections ASCMs M and c M that matches with the distribution P resulting from interventions Σ and LSD GS. Consider two distributions P Π(j)(X; σ(j)) and P Π(k)(X; σ(k)). Suppose perf(T) is in both intervention sets, then,

i log p(j) T (vi | pa T+ i ) p(k) T (vi | pa T+ i ) =

i log p(j) T (bvi | c pa T+ i ) log p(k) T (bvi | c pa T+ i ), (2)

where p(j) T ( ) and p(k) T ( ) are density functions.

Proof. According to the ASCM definition Def .2.1, the mapping from V to X, and the mapping X to b V can be expressed as: b V = bf 1 X (X) = bf 1 X (f X(V)) (16) Then based on the change variable formula, we have

p(v) = p(bv)|Jϕ| (17)

where ϕ = bf 1 X f X and Jϕ is the Jacobian matrix of ϕ. Leveraging the factorization in Eq. 1 and taking log of the above equation,

i=1 log p T(vi | pa T+) =

i=1 log p T(bvi | c pa T+) + log |Jϕ| (18)

Subtract the above factorization of density function induced by I(j) and I(k), and we have Eq.( 2).

Eq 2 naturally gives a connection from V to b V. Comparing two factorization for Fig. 4(c), the connection connections are made from P(v1), p(v2 | v1), p(v3 | v2, v1), P(v4 | v3) or P(v1), p(v3), p(v2 | v1, v3), P(v4 | v3).

C.3 Invariant factors - Proof of Proposition 2

Proposition 2 (Invariant Factors). Consider two distributions P (j), P (k) P with intervention targets σ(j) and σ(k) containing perf(T). Construct the changed variable set V[I(j), I(k), GS] (for short V) with target sets I(j), I(k) as follows: (1) Vl V if V πl,{bl},tl l I(j) but V π l,{bl},t l l I(k), or vice versa; (2) Vl V if i) SΠ(j),Π(k) point to Vl and ii) V πl,{bl},tl l I(j) I(k). If Vi V\C( V), then p(j) T (vi | pa T+ i ) = p(k) T (vi | pa T+ i ) (denoted invariant factors).

Proof. Consider an arbitrary order. Based on the proposition, V[I(j), I(k), GS] includes all variables that the mechanism f V or exogenous U possibly change when the intervention changes from I(k) to I(j). In other words, for any Vl V\ V[I(j), I(k), GS], f Vl and exogenous Ul are invariant.

Let Vi V\C( V). Z = C(Vi) Pa T+. We have Z V\C( V) according to the definition of C-component.

According to the definition of Pa T+ i , we know Pa T+ i \Z = Pa({Vi} Z). Now reconsider the distribution P Π(j)(Vi | Pa T+ i ; σ(j)) and P Π(k)(Vi | Pa T+ i ; σ(k)),

P Π(j) T (Vi | Pa T+ i ; σ(j)) = P Πj T (Vi, Z | Pai({Vi} Z); σ(j))/P Π(j) T (Z | Pai({Vi} Z); σ(j))

P Π(k) T (Vi | Pa T+ i ; σ(k)) = P Π(k) T (Vi, Z | Pai({Vi} Z); σ(k))/P Π(k) T (Z | Pai({Vi} Z); σ(k)) (20)

Since the mechanism and exogenous variables of Vi and Z are invariant, both the nominators and denominators are the same. Namely,

P Πj T (Vi, Z | Pai({Vi} Z); σ(j)) = P Πk T (Vi, Z | Pai({Vi} Z); σ(k)) (21)

P Π(j) T (Z | Pai({Vi} Z); σ(j)) = P Π(k) T (Z | Pai({Vi} Z); σ(k)) (22)

which implies the density functions are invariant,

p(j) T (vi | pa T+ i ) = p(k) T (vi | pa T+ i ) (23)

C.4 ID Q w.r.t Canceled Factors - Proof of Proposition 3 and Lemma 2

Proposition 3 (ID the Q set w.r.t Canceled Variables). Consider variables Vtar V. Let PT = {P (a0), P (a1), . . . , P (a L)} P be a collection of distributions such that (1) l [L], T = Perf[I(a0)] Perf[I(al)] 18; (2) S

l [L] Q[I(al), I(a0), T, GS] = Vtar; (3) there exists

{a 1, . . . , a d } {a1, . . . , a L} such that for all V tar i Vtar, V tar i Q[I(a i), I(a0), T, GS], where d = |Vtar|. Then, Vtar is ID w.r.t V\Vtar.

Proof. We denote Vtar as Q for convenience. Notice that the Assumption 4 will be used in the proof.

18Recall we use the notation Perf[I] to denote that all variables that have perfect interventions on I.

Comparing P Π(al)(V; σ(al)) with P Π(a0)(V; σ(a0)), we have X

Vi V log p(al) T (vi | pa T+ i ) pa0) T (vi | pa T+ i ) = X

Vi V log p(al) T (bvi | c pa T+ i ) log p(a0) T (bvi | c pa T+ i )

= log p(al) T (bv) log p(a0) T (bv) (24)

from Eq. (3).

Notice that the left side only involves variables in Q = S

l [L] Q[I(al), I(a0), T, GS] based on the Def. 3.1. Thus, for any Z V\Q,

l [L], log p(al) T (bv) log p(a0) T (bv) bz = 0 (25)

Take partial of the above equation w.r.t. Z, we have:

l [L], 0 = X

log p(al) T (bv) log p(a0) T (bv) bvi

z (Chain Rule) (26)

log p(al) T (bv) log p(a0) T (bv) bvq

z (Eq.( 25)) (27)

Eq. (27) is a linear system for unknowns { b Vq/ Z}Vq Q. When distribution changes sufficiently, namely under Assumption 4, the row factor of the coefficient matrix of the linear system is linearly independent. When L |Q| (implied by condition [3]), the matrix is full rank, thus,

Recall that Vq = ϕVq(V). For any Z V\Q, Eq.( 28) holds. Thus, Q is enough to be the input of ϕVq, which means there exists Vq = ϕVq(Q).

Lemma 2. Consider variables Vtar V and Z V\Vtar. Suppose Mem = {Vj Vtar | Vj is ID w.r.t. Z}. Consider, PT and its corresponding intervention targets that hold conditions [1-2] in Prop. 3. If the new version of the condition [3] is also satisfied:

[4] there exists {a 1, . . . , a |Vtar|} {a1, . . . , a L} such that for all V tar i

Vtar\Mem, V tar i Q[I(a i), I(a0), T, GS].

then Vtar is ID w.r.t Z.

Proof. For all Vm Mem, Vm/ Z = 0. Thus, the unknown in Eq.( 27) exclude vm

z . Then, when [3 ] holds, the system will have zero solutions and Eq.( 28) will hold.

C.5 ID within Q set - Proof of Proposition 4 and Lemma 3

The next result provides us with an additional way of disentangling latent variables within the same Q-factor leveraging second-order conditions and conditional independence. We will first prove the following stronger result.

Definition 6.1 (Markov Network). Let MV be the Markov network over variables V with vertices {Vi}n i=1 and E(MV ) denote the set of edges. An edge (Vi, Vj) is added to E(MV ) if Vi Vj|V\{Vi, Vj}.

Proposition 6 (ID of variables within Q sets). Consider the variables Vtar V. Define E as the set of edges within the Markov Network of GT(Vtar) that are contained within a Q set.

E ={ϵj = {Vk, Vr}

(i) al, {Vk, Vr} Q(al),(a0);

(ii) Vk is d-connected to Vr conditioned on Vtar\{Vk, Vr} in GT(Vtar)},

For any pair of Vi, Vj Vtar such that Vi Vj | Vtar\{Vi, Vj} in GT(Vtar), if there exists PT = {P (a0), P (a1), . . . , P (a L)} P that satisfies conditions (1-2) in Prop. 3 and the following condition (3 ).

(3 ) Enough changes occur across distributions, i.e., Formally, there exists {a 1, . . . , a 2d +|E|}

{a1, . . . , a L} such that for all V tar i Vtar, i) V tar i Q(a i),(a0), ii) V tar i Q(a d +i),(a0), and iii) for all ϵj E, ϵj Q(a 2d +j),(a0), where d = |Vtar|

then, Vi is ID w.r.t. Vj.

Proof. We denote Vtar as Q for convenience. Notice that Assumption 4 will be used in the proof. From Eq. 3, we have X

Vi V log p(al) T (vi | pa T+ i ) pa0) T (vi | pa T+ i ) = X

Vi V log p(al) T (bvi | c pa T+ i ) log p(a0) T (bvi | c pa T+ i )

= log p(al) T (bv) log p(a0) T (bv) (30)

Notice that the left side only involves variables in Q = S

l [L] Q[I(al), I(a0), T, GS] based on the Def. 3.1.

We first argue that if Vi Vj|Q\{Vi, Vj} in GT then Vi Pa T+ j , Vj Pa T+ i and Vi, Vj Pa T+ m for any Vm Q.

First, since Vi Vj|Q\{Vi, Vj}, Vi and Vj cannot be directly connected by edges in GT, which implies Vi C(Vj) and Vi Pa T+(Vj). Also, the outgoing edge from Vi and Vj cannot point to the same C-component. Otherwise, the path is active from Vi and Vj is active when conditioning on other variables (collider structure). Thus, Vi Pa T+ j , Vj Pa T+ i and Vi, Vj Pa T+ k where Vk Q.

This implies Vi and Vj will not appear to the same factor p(al) T (vm | pa T+ m ) for any Vm V. Thus,

2 log p(al) T (vm | pa T+ m ) vivj = 0 (31)

Thus, for any pair of Vk, Vr such that Vk Vr|Q\{Vk, Vr},

2 log p(al) T (bvm | pa T+ m ) bvkbvr

= 2 log p(al) T (bv) log p(a0) T (bv) bvkbvr = 0

On the other hand, when either Vk or Vr is in Q\ Q(al),(a0) for l [L],

l [L], = 2 log p(al) T (bv) log p(a0) T (bv) bvkbvr = 0 (33)

since p(al) T (bv) log p(a0) T (bv) bvk = 0 or p(al) T (bv) log p(a0) T (bv) bvr = 0 (34)

Upon Eq. (31), taking the second partial derivative on both sides of Eq. (30), the left side will be 0, and then l [L], we have

log p(al) T (bv) log p(a0) T (bv) bvkbvr

bvr vj Chain Rule

= 2 log p(al) T (bv) log p(a0) T (bv) bv2 i

bvi vj + 2 log p(al) T (bv) log p(a0) T (bv) bv2 j

2 log p(al) T (bv) log p(a0) T (bv) bv2q

log p(al) T (bv) log p(a0) T (bv) bvq

log p(al) T (bv) log p(a0) T (bv) bvkbvr

bvr vj Eq. (32) and (32)

Eq.( 36) is also a linear system. When distribution changes sufficiently, namely under Assumption 4, the row factor of the coefficient matrix of the linear system is linearly independent. When L 2|Q| + δ (implied by condition 4), the matrix is full rank, thus, bvi vi

bvi vj = 0, bvj

bvj vj = 0 (37)

Then we have bvi vj = 0, bvi

vj = 0 (38)

up to a permutation of Vi and Vj. This implies that Vi is ID w.r.t Vj and Vj is ID w.r.t Vi.

When Q(a1),(a0) = = Q(a L),(a0)Vtar

Then we can get Prop. 4 directly from the above result. Proposition 4 (ID of variables within Q sets). Consider the variables Vtar V, PT that satisfies conditions (1) in Prop. 3 and Q(al),(a0) = Vtar, for l [L]. For any pair of Vi, Vj Vtar such that Vi Vj|Vtar\{Vi, Vj} in GT(Vtar), Vi is ID w.r.t. Vj if L 2|Vtar| + δ , where δ is the number of pair Vk, Vr Vtar such that Vk and Vr are connected given Vtar\{Vk, Vr} in GT(Vtar).

Proof. Taking Q(a1),(a0) = = Q(a L),(a0)Vtar, the condition (2) is satisfied. When L 2|Vtar| + δ , condition (3) is satisfied. Thus, Prop. 4 holds.

Lemma 3 (ID of variables within Q sets). Consider variables Vtar V. For any pair of Vi, Vj Vtar such that Vi Vj|Vtar\{Vi, Vj} in GT(Vtar), let Memi be a list of variables in Q that have been ID w.r.t. Vi and let Memj be a list of qvariables in Q that have been ID w.r.t. Vj. If there exists PT that satisfies conditions [1-2] in Prop. 3 and the following condition [4 ].

[4 ] (Enough changes occur across distributions) Let Qre = Vtar\(Memi S Memj) and d = |Qre|. And

Eij ={ϵj = {Vk, Vr} | i) al, {Vk, Vr} Q(al),(a0);

ii) Vk is connected to Vr conditioning Vtar\{Vk, Vr} in GT(Vtar) iii) Vk, Vr / Memi Memj}

there exists {a 1, . . . , a 2d +|E|} {a1, . . . , a L} such that for all Qi Qre, Qi

Q(a i),(a0)], Qi Q(a d +i),(a0) and for all ϵl Eij, ϵl Q(a 2d +l),(a0).

, then Vi is ID w.r.t Vj.

Proof. The unknown in the linear system

bvq vj = 0, (39)

if Vp is ID w.r.t Vi or Vq is ID w.r.t Vj. 2bvq vivj = 0 (40)

If Vq is ID w.r.t Vi or Vj. Even these terms are excluded in [4 ], the system still has the zero solutions.

C.6 ID-reverse of existing disentangled variables - Proof of Proposition 5

The next Proposition provides an additional tool to achieve identifiability and leverages the fact that other variables may have previously been disentangled and independence relationships in the factorization. Proposition 5 (ID of canceled variables w.r.t. Q sets). Suppose Ψ contains perf(T). Given V\V tar is ID w.r.t. a single variable V tar, V tar is ID w.r.t. V\V tar if V tar V\V tar in GT.

Proof. We first introduce a lemma for distribution preserving from [20].

Lemma 4 (Lemma 2 of [20]). Let A = C = R and B = Rn. Let f : A B C be differentiable. Define differentiable measures PA on A and PC on C. Let b B, f( , b) : A C be measurepreserving. Then f is constant in b.

Denote V\Vtar as Z. V tar Z in GT implies that

PT(V) = PT(V tar)PT(Z) (41)

With the change of variable formulation and taking log:

log p T(vtar) + log p T(z) = log p T(bvtar) + log p T(bz) + log |Jϕ| (42)

Since Z is ID w.r.t V tar, b Z/ Vtar = 0. In other words, the elements ϕZ/ Vtar = 0 for every Z Z in Jacobian matrix are 0, where ϕZ is a function mapping from V to b Z. Then

log |Jϕ| = log |JZ| + log |JVtar| (43)

where |JZ| =

ϕZ1/ z1 ϕZ1/ z2 . . . ϕZ1/ zd 1 ϕZ2/ z1 ϕZ2/ z2 . . . ϕZ2/ zd 1 ... ... ... ... ϕZd 1/ z1 ϕZd 1/ z2 . . . ϕZd 1/ zd 1

and log |JVtar| =

| ϕV tar/ vtar|.

Again, since Z is ID w.r.t Vtar, b Z = ϕZ(Z). Thus,

log p T(z) = log p T(bz) + log |JZ| (44)

Subtracting this to Eq. (42)

log p T(vtar) = log p T(bvtar) + log |JVtar| (45)

Denote ϕVtar(z, ) as ϕz Vtar( ), which is the function ϕVtar fixing value Z = z mapping from Vtar

to b V. This suggests for every z,

PT(b V tar) = PT(ϕz V tar(V tar)) (46)

Apply Lemma 2 of [20], ϕV tar should be a constant regarding Z. Thus,

C.7 Soundness of Latent ID Algorithm - Proof of Thm. 1

The following provides the proof of the soundness of our proposed graphical algorithm for determining whether or not two variables are disentangleable given a collection of distributions from multiple domains and interventions. Theorem 1 (Soundness of CRID). Consider a LSD GS and intervention targets Ψ. Consider the target variables Vtar and Ven V\Vtar. If no edges from Vtar points to b Ven in the output causal disentanglement map (CDM) from CRID, GV,b V , then Vtar is ID w.r.t Ven.

Proof. In Latent ID, for each epoch, we iterate to choose T and the baseline distribution to execute procedure Alg. F.3 and Alg. F.6. Any time an edge is removed, Proposition 3 and/or 4 are applied. At the end of epoch, Alg. F.8 is executed and edges will be removed only if Proposition 5 is applied. Thus the edge removals are all sound. The algorithm will stop when no edge will be removed, and terminate giving the causal disentanglement map GV, ˆV , which is a valid summary of what is disentangleable.

ˆ Haircolor

Figure S3: Latent causal graph and the desired causal disentanglement map.

D Examples and Discussion

D.1 Additional Example Illustrating Motivation of Causal Disentangled Learning

In the introduction, we illustrated a medical example for why it is important to learn disentangled representations.

Require to be disentangled from Age Gender

Do not require being disentangled from Age Haircolor

Change Gender

Figure S2: The disentanglement requirements in face examples

An additional motivating example can be seen through the lens of generating realistic face images [27]. Consider an image dataset of human faces. Based on our understanding of anatomy and facial expressions, we know that both Gender and Age are not causally related, while age does directly affect Hair Color. There is a strong spurious correlation between age and gender, where there are many old males and young females in the dataset. In addition, let there be face images from both a senior and teen center building. The change in domain (i.e. population center) impacts the age distribution, as senior center faces are older than teen center faces. Given these images and knowledge of the latent causal graph, one would ultimately like to generate realistic face images given perturbations of Age. If the variable representations are entangled, then it is possible for changes in age to also spuriously change gender. This is undesirable, and thus our goal is to achieve disentanglement of age and gender. Note that we do not require Age to be disentangled from Hair Color necessarily since changing Age and also simultaneously changing Haircolor would be a realistic image generation. Here, we would seek a causal disentanglement map shown in Fig. S3.

If we could get the causal disentanglement map, then we know that when the representations are fully learned, we can intervene on Age, an without changing the Gender of the face. This motivates the need for a general approach to identifiability, compared to the scaling indeterminacy in Def. 6.5, which requires all variables to be disentangled from each other.

As another motivating example, consider a marketing company creating faces for a female product. The relevant latent factors are Gender Age Hair Color (see Appendix D.1 for details). If Gender and Age are entangled, changing Age might also alter Gender, which is undesirable. The company needs a model where Age is disentangled from Gender, while correlation with Hair Color is allowed. Our paper addresses the problem of determining whether a given set of input data and assumptions in the form of a LSD is sufficient to learn such a disentangled representation.

D.2 Examples for non-Markovian Factorization

In this section, we centralize theoretical results in relation to the theory presented in this paper.

Unless specified, we denote the natural log as log.

We first provide more discussion about non-Markovian factorization Eq. (1). First, the concept C-component is formally defined as follows:

Figure S4: Causal graph with four C-components.

Definition 6.2 (Confounded Component). Let {C1, C2, . . . , Ck} be a partition over the set of variables V, where Ci is said to be a confounded component (for short, C-component) of the selection diagram GV if for every Vi, Vj Ci there exists a path made entirely of bidirected edges between Vi and Vj in GV , and Ci is maximal.

This construct represents clusters of variables that share the same exogenous variations regardless of their directed connections. The selection diagram in Figure 2 has a bidirected edge indicating the presence of unobserved confounders affecting the pairs (V1, V2) and contains two C-components, namely, C1 = {V1, V2}, C2 = {V3}.

Akin to parents within a Markovian SCM, the c-components play a fundamental role in factorizing the joint distribution of the observed variables V.

Let < be a topological order V1, . . . , Vn of the variables V in GS. Then define the Pa T+ i = Pa({V C(Vi) : V Vi})\{Vi}. The Pa+(Vi) set consists of the nodes in the same c-component that are " " in topological order as Vi, their corresponding parents, minus the node Vi itself. For instance, in Fig. S4, Pa+(E) = {D, C, A} and Pa+(D) = {B, C, A}.

The general factorization formula Eq. (1) factorizes not only the joint observational distribution related to a causal graph, but also interventional distributions. With a perfect intervention on T, the factorization follows the corresponding graph is GT, where the incoming arrows towards T are cut. This factorization encompasses both Markovian and non-Markovian SCM models. When there are no bidirected edges in the diagram, Pa T+ i reduce to Pa in FT.

Next, we introduce the Markov blanket, a fundamental idea in characterizing certain conditional independences in a causal graph [71, 72].

Definition 6.3 (Markov Blanket). Let G be a causal graph over variables V. A Markov blanket of a random variable Y V is any subset V1 V such that conditioned on V1, Y is independent of all other variables.

The Markov blanket is an important object that captures conditional independences between variables when conditioned on all other variables in the graph.

Definition 6.4 ("Global" Markov property of DAGs [73]). Consider a joint probability distribution, P over a set of variables V satisfies the Markov property with respect to a graph G = (V L, E) if the following holds for, (X, Y, Z) disjoint subsets of V:

P(y|x, z) = P(y|z) if Y X|Z in G (that is Y is d-separated from X given Z)

The global Markov property maps graphical structure in causal directed acyclic graphs (DAGs) to conditional independence (CI) statements in the relevant probability distributions from data. The distributions we consider P are considered Markov wrt the graph, thus mapping d-separations in the graph to conditional independences in the distributions. This allows us to leverage factorizations, such as the one presented in Section 2.

E Examples for Proposition 2

The following example illustrates more about the invariant factors.

Example 14. (Example 1 continued.) Choose P (1) as the baseline and T = {}. The factorization of P(V) is P(V1)P(V2 | V1)P(V3 | V2). The changed variable set V[I(2), I(1), GS] = {V3} since the S-node points to V3 in GS and V[I(3), I(1), GS] = {V3} since V3 I(3) while V3 I(1). Thus, comparing P (2) and P (3) with the baseline P (1), p(v2 | v1) and p(v1) are invariant factors while p(v3 | v2) possibly changes.

E.1 The detailed examples of Proposition 3 and 4

We show another example of using Proposition 3 to solve an ID task in Example 1.

Example 15. (Example 14 continued.) Consider Vtar = {V2, V3}, Ven = V\{V2, V3} = {V1}. When comparing {P (2), P (3)} with the baseline P (1), T = σM[I(1)] = {}, and then

Q[I(2), I(1), T, GS] = Q[I(3), I(1), T, GS] = Vtar (48)

Thus, these two comparisons satisfy the three conditions in Prop.3. Because the number of compared distribution {P (2), P (3)} is 2, which is equal to |Vtar|, then we know Vtar is ID w.r.t Ven by Prop. 3. This demonstrates that a variable V2 can be disentangled from another variable that is in the C-component (V1). See Appendix Ex. 16 for a detailed derivation.

Proposition 3 and 4 disentangle variables through comparing distributions. With enough distributions, one can build a linear system (illustrated in Appendix C.4 and C.5).

Example 16. (details for Example 15). By comparing distribution resulting from σ(2) and σ(3) with the baseline σ(1),

log p(2)(v3 | v2) log p(1)(v3 | v2) = log p(2)(bv3 | bv2) log p(1)(bv3 | bv2)

log p(3)(v3 | v2) log p(1)(v3 | v2) = log p(3)(bv3 | bv2) log p(1)(bv3 | bv2) (49)

Taking the first order partial derivative w.r.t. V1:

0 = log p(2)(bv3 | bv2) log p(1)(bv3 | bv2)

bv2 v1 + log p(2)(bv3 | bv2) log p(1)(bv3 | bv2)

0 = log p(3)(bv3 | bv2) log p(1)(bv3 | bv2)

bv2 v1 + log p(3)(bv3 | bv2) log p(1)(bv3 | bv2)

In this system, notice that

log p(2)(bv3 | bv2) log p(1)(bv3 | bv2) = log p(2)(bv1, bv2, bv3) log p(1)(bv1, bv2, bv3) (51)

Then since the coefficient is linear independent assumed in Assumption 4, we have

bv2 v1 = 0, bv3

v1 = 0 (52)

Then V2 = τ2(V2, V3), and V3 = τ3(V2, V3).

First, this example shows we can disentangle two variables in the same C-component (V!, V2). Second, Compared with the baseline, one can disentangle variable V with its descendants when soft interventions are given per node, and V is considered to be still entangled with its ancestral (see Sec. F.6). The above result shows that it is possible to disentangle variables from their ancestors using only soft interventions. More interestingly, no intervention is performed on V2 while we disentangle V2 from V1. Compared with [22], one can disentangle V1 and V3 using 10 distributions and we demonstrate 3 distributions are enough.

Example 17. (details for Example 4). Choosing order V1 < V3 < V2 < V4.

P(V) = P(V1)P(V3)P(V2 | V1, V3)P(V4 | V3) (53)

as the factorization. By comparing distribution resulting from σ(2) and σ(3) with the baseline σ(1),

log p(2)(v2 | v1, v3) log p(1)(v2 | v1, v3) = log p(2)(bv2 | bv1, bv3) log p(1)(bv2 | bv1, bv3)

log p(3)(v3) log p(1)(bv3) + log p(3)(v2 | v1, v3) log p(1)(v2 | v1, v3) =

log p(3)(bv1)(bv3) log p(3)(bv3) + log p(2)(bv2 | bv1, bv3) log p(1)(bv2 | bv1, bv3)

log p(4)(v1) log p(1)(v1) = log p(4)(bv1) log p(1)(bv1)

Taking the first order partial derivative w.r.t. V4:

0 = h2,1 bv1 v4 + h2,2 bv2 v4 + h2,3 bv3 v4

0 = h3,1 bv1 v4 + h3,2 bv2 v4 + h3,3 bv3 v4

0 = h4,1 bv1 v4

h2,i = log p(2)(bv2 | bv1, bv3) log p(1)(bv2 | bv1, bv3)

bv4 for i = 1, 2, 3

h3,i = log p(3)(bv3) log p(1)(bv3) + log p(3)(bv2 | bv1, bv3) log p(1)(bv2 | bv1, bv3)

bv4 for i = 1, 2, 3

h4,1 = p(4)(bv1) log p(1)(bv1)

bv4 (56) Then since the coefficient is linear independent assumed in Assumption 4, we have bv1 v4 = 0, bv2

v4 = 0, bv3

v4 = 0 (57)

Then V1 = τ1(V1, V2, V3), and V2 = τ2(V1, V2, V3) and V3 = τ3(V1, V2, V3).

Example 18. The factorization based on GS choosing T = {} is P(V) = P(V1)P(V3)P(V3 | V1, V2) (58)

By comparing distribution resulting from σ(2) and σ(3) with the baseline σ(1), for j = 2, 3, 4, 5

log p(j)(v1) + log p(j)(v3) log p(1)(v1) log p(1)(v3)

= log p(j)(bv ) + log p(j)(bv3) log p(1)(bv1) log p(1)(bv3) (59)

Taking the second order partial derivative w.r.t. V1, V3:

0 = 2 log p(j)(bv1)p(j) log p(1)(bv1)

bv1 v3 + 2 log p(j)(bv3)p(j) log p(1)(bv3)

+ log p(j)(bv1)p(j) log p(1)(bv1)

2bv1 v1 V3 + log p(j)(bv3)p(j) log p(1)(bv3)

2bv3 v1 v3 (60) Then since the coefficient is linear independent assumed in Assumption 4, we have bv1 v1

bv1 v3 = 0, bv3

bv3 v3 = 0 (61)

Then after permutation, bv1 v3 = 0, bv3

v1 = 0 (62)

which implies that V3 is ID w.r.t V1 and V1 is ID w.r.t V3.

The following example shows how Proposition 5 achieves disentanglement for the ID task in Example 1.

Example 19. (Example 15 (continued).) Let P (4) with intervention target I(4) = {V 1,{1},do 2 } be another distribution added to the original setting. Consider Vtar = {V1}. From Ex. 14, {V2, V3} is ID w.r.t. V1. Consider T = {V2} (from I(4)). Since V1 {V2, V3}, then V1 is ID w.r.t {V2, V3}.

F Related Work Discussion

Disentangled representation learning aims to obtain approximations b V = {b V1, . . . , b Vd} that separate the distinct, informative generative factors of variations [5] from the observations of X and inductive bias of M. In other words, the learning goal is an unmixing function bf 1 X that maps from X to b V (namely b V = bf 1 X (X)), where b Vi is some transformation of W V. The goal of disentangled representation learning is to have b Vi be a function only of Vi, i.e. W = {Vi}. This is not always possible, and different assumptions, data and relaxed versions of disentanglement may be studied to theoretically ground representation learning. The disentangled representation learning tasks are studied with various assumptions and input. In the following, we discuss related tasks and identifiability results in context of this paper. We also present a few case studies on the nuances between Markovian and non-Markovian ASCM setting.

First, we review the main goal of identifiability in all prior works. It is what is known as scaling identifiability. A special case of our ID definition in Def. 2.3.

Definition 6.5 (Scaling indeterminancy). Consider a collection of ASCM M that induces an LSD GS and a collection of distribution P. We say V is identifiable up to scaling indeterminacy if for every c M matches with the GS and P, there exists functions {h1, . . . , hd} such that b Vi = hi(Vi), i [d], where hi is a diffeomorphism in R.

F.1 Causal representation learning with unknown latent causal structure

In many prior works, the goal has been not only identifiability of the underlying latent variables, but also the discovery of the causal relationships among the latent variables [4, 22, 51]. That is, the latent causal graph is unknown. The work proposed in this paper is a foundation for the first step of causal representation learning, i.e. identifying the distributions of the latent causal variables. It would be interesting future work to explore how the results proposed in this paper extend to the case when the latent causal graph is unknown.

F.2 Comparisons with other identifiability criterion

We also consolidate other definitions of identifiability from the literature using the notion of an ASCM. We have already defined identifiability up to scaling ambiguity in Def. 6.5.

Corollary 2 (Scaling ID is a case in general ID). Let M be a collection of ASCM with GS the LSD over the latent causal variables V. If V V is identifiable up to scaling indeterminacy, then it is identifiable wrt V\ V .

Proof. The proof follows from the application of Def. 2.3 and Def. 6.5.

Definition 6.6 (Identifiability up to ancestral mixtures [21]). Let M be a collection of ASCM with GS the LSD over the latent causal variables V. We say a variable V V is identifiable up to ancestral mixtures if for every c M matches with the GS and P, there exists functions {h1, . . . , hd} such that b Vi = hi(Anc(Vi)), i [d].

Corollary 3 (Ancestral ID is a case in general ID). Let M be a collection of ASCM with G the LSD over the latent causal variables V. If V V is identifiable up to ancestral mixtures, then it is identifiable wrt V\Anc( V ).

Proof. The proof follows from the application of Def. 2.3 and Def. 6.6.

The following definitions are inspired by the identifiability results from [22].

Definition 6.7 (Intimate Neighbor Set). We say εMG,Vi := {Vj | j = i, but Vj is adjacent to Vi and all other neighbors of Vi in MG.

The intimate neighbor set for a variable dictates a set of neighbors that are adjacent to all of that variable s neighbors. It is used in the following definition from [22].

Definition 6.8 (Identfiability up to intimate neighbor set of Markov Network [22]). Let M be a collection of ASCM with GS the LSD over the latent causal variables V. We say a variable V V is identifiable up to intimate neighbors in the Markov Network if for every c M matches with the GS

and P, there exists functions {h1, . . . , hd} such that b Vi = hi(ε(MG, Vi)), i [d], and MG is the Markov network of G and ε(MG, Vi) is the intimate neighbor set of Vi in MG. Corollary 4 (Intimate Neighbor Markov Network ID is a case in general ID). Let M be a collection of ASCM with G the LSD over the latent causal variables V. If V V is identifiable up to intimate neighbor set of the Markov Network, then it is identifiable wrt V\ϕ(MN(G); V ).

Proof. The result follows from the application of Def. 2.3 and Def. 6.8.

Thus, we showed that each of these identifiability definitions imply a general ID for a non-trivial subset of latent variables V V with respect to Ven V.

F.3 Case study on challenges when disentangling variables in a non-Markovian setting

Prior results suggest that in a Markovian setting, given a perfect intervention on every node, the latent variables V are ID up to scaling indeterminancies according to Def. 6.5 [14, 21].

One would suspect that ID may still hold in non-Markovian ASCMs, but the following result states that even with one perfect intervention per node, it is not possible to disentangle latent variables within the same c-component. Lemma 5 (Challenges of identifability in non-Markovian causal models). Consider the ASCM M that induces the diagram V1 V2. Suppose the intervention set includes an observational distribution, and perfect interventions on both V1 and V2: Ψ = σ{}, σM({V1}), σM({V2}) . Then V1 is not ID w.r.t V2 and vice versa.

Proof. We prove this by construction of a counter-example.

Consider an ASCM M that is constructed as follows:

V1 U1,2 V2 U1,2 + UV2 X1 V1, X2 V2 U1,2 N(0, 1), UY N(0, 3)

σV1 = P( UV1), UV1 N(0, 2)

σV2 = P( UV2), UV1 N(0, 7)

Consider a separate ASCM M (1) that is constructed as follows:

V (1) 1 U (1) 1,2

V (1) 2 0.5U (1) 1,2 + 1.5UY

X1 1/3V (1) 1 + 2/3V (1) 2 ,

X2 2/3V (1) 1 2/3V (1) 2 U (1) 1,2 N(0, 3), U (1) V2 N(0, 1)

σV1 = P( UV1 (1)), U (1) V1 N(0, 6)

σV2 = P( UV2 (1)), U (1) V2 N(0, 7)

M and M (1) induce the same observational distribution P(X) N(0, 1 1 1 4

), and interventional

distributions P(X; σV1) N(0, 2 0 0 4

), P(X; σV2) N(0, 1 0 0 7

However, V (1) 1 = V1 V2, which implies V (1) 1 is not ancestral mixture or rescaling of the original V1. Therefore, V1 is not identifiable up to ancestral mixtures, or rescaling.

F.4 ID within c-components

Lemma 5 shows that even with one perfect intervention on each node, it is not possible to disentangle variables within the same c-component. The next lemma provides a means of doing so using two perfect interventions on the same node. This provides some intuition for the usefulness of perfect interventions in the CRID setting. Lemma 6 (Two perfect interventions can disentangle within a c-component). Let GS be the LSD induced from a collection of ASCM M. Suppose Vi, Vj V are in the same c-component, and there are L + 1 perfect interventions distributions PVi = {P (a0), P (a1), . . . , P (a L)} such that Vi do[I(al)] and Q[I(al), I(a0), Vi, GS] are equivalent (denoted as Q) for l [L]. When Vj / Q and if L |Q|, Vi is identifiable wrt Vj. When Vj Q and if L 2|Q| + δ , Vi is identifiable wrt Vj.

Proof. The result follows from the application of Proposition 3 and Proposition 4.

Example 20. In most simple case. Let s have do[I(j)] = do[I(k)] = Vi and Q = Vi. Let Vi, Vj Ck be two arbitrary latent variables in the same c-component. By comparing distributions, we have p(2) Vi (vi) p(1) Vi (vi) = p(2) Vi (bvi) p(1) Vi (bvi) (63)

Taking partial w.r.t. Vj, we have

0 = p(2) Vi (bvi) p(1) Vi (bvi) bvi

bvi vj (64)

which implies bvi

Notice that this is not the only way to disentangle to variables in the C-Component. In Example 6, V1 and V2 are disentangled from each other without leveraging two perfect interventions.

F.5 Case study on disentangling variables in a Markovian setting

This next example works out the algebraic derivations for analyzing Fig. 4(a). This derivation is provided to provide additional intuition on the theory presented in Section 3, and how these concepts apply in a simple 3-dimensional latent causal graph. Example 21 (Algebraic derivation of disentanglement in a simple 3-node chain graph). Given the graph shown in Figure 4(a), we can factorize the joint observational distribution of the latent variables

P(V) = P(V3|V2)P(V1|V2)P(V2) (65)

By the probability transformation formula, we can similarly write the distribution in terms of its estimated sources via function ϕ = ˆf 1 X f X for its distribution Q.

P(V) = P(ϕV3(V)|ϕV2(V))P(ϕV1(V)|ϕV2(V)P(ϕV2(V))|det Jϕ| (66)

Now, consider the interventional distributions: P(V; σV (1) 3 ) and P(V; σV (2) 3 ). Here, we will use shorthand ϕi to indicate ϕVi(V). Similarly, we can factorize the distribution P(V; σV (1) 3 ):

P(V; σV (1) 3 )

= P(V3|V2; σV (1) 3 )P(V1|V2; σV (1) 3 )P(V2; σV (1) 3 ) (Conditional independence)

= P(ϕ3|ϕ2; σ3(1))P(ϕ1|ϕ2; σV (1) 3 )P(ϕ2; σV (1) 3 )|det Jϕ| (Probability transformation formula)

Similarly, we can decompose the interventional distribution P(V; σV (2) 3 ). Now, comparing the log observational distribution with the log intervention σV (i) 3 , we get:

log p(V; σV (i) 3 ) log p(V)

= log p(V3|V2; σV (i) 3 ) + log p(V1|V2; σV (i) 3 ) + log p(V2; σV (i) 3 )

log p(V3|V2) log p(V1|V2) log p(V2) = log p(V3|V2; σV (i) 3 ) log p(V3|V2)

Where the last line applies the invariance of P(Vi|Vj; σVk) = P(Vi|Vj) if (Vi Vk|Vj)GVV3 . In the space mapped by ϕ, we similarly get:

log p(ϕ; σV (i) 3 ) log p(ϕ)

= log p(ϕ3|ϕ2; σ3(i)) + log p(ϕ1|ϕ2; σV (i) 3 ) + log p(ϕ2; σV (i) 3 )

log p(ϕ3|ϕ2) log p(ϕ1|ϕ2) log p(ϕ2) = log p(ϕ3|ϕ2; σ3(i)) log p(ϕ3|ϕ2)

When comparing the distributions of b V, interestingly the log of the determinant of the Jacobian cancels out. Combining the two, we get:

log p(V3|V2; σV (u) 3 ) log p(V3|V2) = log p(ϕ3|ϕ2; σ3(u)) log p(ϕ3|ϕ2) (67)

Taking the partial derivative now with respect to V1, we get that the LHS equals 0 and the RHS becomes:

0 = V1 log p(ϕ3|ϕ2; σ3(i)) log p(ϕ3|ϕ2)

= log p(ϕ3|ϕ2; σ3(i))

ϕ3 V1 + log p(ϕ3|ϕ2; σ3(i))

log p(ϕ3|ϕ2)

ϕ3 V1 log p(ϕ3|ϕ2)

ϕ2 V1 (Chain rule)

log p(ϕ3|ϕ2; σ3(i))

ϕ3 log p(ϕ3|ϕ2)

log p(ϕ3|ϕ2; σ3(i))

ϕ2 log p(ϕ3|ϕ2)

(Collect terms)

Thus, we have two unknowns ϕ2

V1 . Given the two interventions with different mechanisms on V3 compared to the observational distribution, we have two equations that result in a 2-dimensional linear system. We are able to determine that ϕ2

V1 = 0 thus demonstrating that our approach disentangles ˆV3 = ϕ3(V) and ˆV2 = ϕ2(V) from V1.

F.6 Comparing different identifiability results

In this section, we explicitly compare and discuss our work compared to a non-exhaustive list of related disentangled learning in the setting of causally related latent components. Different from previous literature, we do not make common assumptions such as (1) each intervention is applied to a single node [55]; (2) idle interventions (observational distribution) are present within each domain [21, 22]; (3) exactly one intervention is applied per node [13]; (4) at least one intervention is applied per node [21, 55].

{Si,j} i,j [5]

V1 V3 V2 V4

Figure S5: Reproduced Fig. 4 for convenience.

Causal component analysis [21] The closest work to ours is [21], which also presupposes knowledge of the latent causal graph and focuses solely on learning the unmixing function and the distributions of the causal variables. In [21], the results emphasized the need for interventions that occur only on a single node in the latent causal graph. However, Lemma 5 demonstrates challenges that are not addressed in the prior work. In addition, in our work, we propose a more general concept of identifiability in Def. 2.3. As a result, Thm. 1 makes significantly weaker assumptions to still achieve identifiability. Exs.2-6 illustrate also the nuances addressed by our work, but not in [21].

Another interesting concept introduced by [21] is the "fat-hand" interventions, which intervene on groups of variables within different groups, and the concept of "block-identifiability".

Here, we illustrate some examples and discussion on how our work compares with that of [21] that also provides sufficient conditions for identifiability given a causal graph over the latent variables. One key difference between our work is that we do not assume Markovianity in the underlying SCM, whereas they do.

Example (Ex. 6 cont.). This example continues off of Ex. 6. Consider the motivating example in healthcare depicted in Fig. 2. In hospitals from different countries Πi and Πj, drug treatment (V1) affect length of ICU stay (V2), and ultimately whether or not the patient lives or dies (V3). Our task is to learn representations of the high-level latent variables (V1, V2, V3) that are not collected given a collection of low-level input such as EMRs, imaging and bloodwork data (high-dimensional data X). In existing work [21], there are no guarantees that variables {V2, V3} are disentangled from their ancestor V1 from soft interventions per nodes. However, Proposition 3 demonstrates two comparisons are enough to disentangle both V2 and V3 from their ancestor V1.

Even in the Markovian setting, where the LSG does not contain bidirected edges, our results can also guarantee identifiability in this setting.

Example 22 ([21] approach). Given the graph shown in Figure 4(a), [21] requires an observational, and tuple of intervention sets Ψ = {}, {V1}, {V2}, {V3} . Provided these four distributions, there is still no disentanglement of ˆV3 with respect to any variables, Vi V.

Causal Representation Learning from Multiple Distributions: A General Setting [22] Another approach to achieving disentanglement among the latent variables is similar to nonlinear-ICA, but leverages the conditional independence properties within a Markov Network of the causal graph. Then the proof strategy of [22] considers the second order derivative, which leverages the conditional independence constraints.

However, this results in a required 2d + |E(MG)| + 1 number of distributions that satisfy Assump. 4. In addition, this strategy states that in a collider graph V1 V2 V3, that V1 is not ID wrt V2, and V3 is not ID wrt V2.

Another example, continues off of Ex. 6.

Example (Ex. 6 cont.). This example continues off of Ex. 6. Consider the motivating example in healthcare depicted in Fig. 2. In hospitals from different countries Πi and Πj, drug treatment (V1) affect length of ICU stay (V2), and ultimately whether or not the patient lives or dies (V3). Our task is to learn representations of the high-level latent variables (V1, V2, V3) that are not collected given a collection of low-level input such as EMRs, imaging and bloodwork data (high-dimensional data X). According to [22], 10 distributions can disentangle V3 from V1 when V3 V1 | V2. However, Proposition 3 demonstrates two comparisons are enough to disentangle both V2 and V3 from their ancestor V1.

Linear ICA Linear ICA has been extensively studied over decades, and is applied in magnetic resonance imaging (MRI) [74], astronomy [75], image processing [76], finance [77] and document analysis [78]. In linear ICA settings, the generative factors are assumed to be independent of each other and the mixture function f X) is considered to be an invertible matrix A Rd d. Formally, the mechanism F and the distribution P(U) of the true ASCM M are written as:

Vj fj(Uj), j [d] X AV Ui Uj, i, j [d] (68)

Notice that X is d dimensional variable here and Xi Pd j=1 aij Vj = ai V, i [d]. Given the

observational distribution P(X), the goal of linear tasks is to learn b A such that b Vj is a scaling of a true underlying generative factors Vi, where b V = b A 1X. The scaling and permutation identifiability is defined as follows to denote the achievability of linear ICA tasks.

Definition 6.9 (Scaling and Permutation Identifiability). The representation b V is said to be identifiable up to scaling and permutation V(2) = CPV(1) if for every pair of ASCM M(1) and M(2) such that (1) P M(1)(X) = P M(2)(X), P M(1)(X; σvk) = P M(2)(X; σvk); (2) M(1) and M(2) are constrained by the modeling process in Eq. 68, where C = diag(c1, . . . , cd) is a scaling diagonal matrix and P is a permutation matrix.

Def. 6.9 says that if every pair model M(1) and M(2) in linear ICA settings match the observational distributions, the generative variables can be transformed by permutation and scaling. This implies once one finds a proxy ASCM M that matches P(X), b V is guaranteed to be a scale and permutation representation of the true generative variable if the identifiability is achieved. The next example illustrates ASCMs in linear ICA settings and Def. 6.9. Example 23 (ICA Identifiability Is Not Achieved). We consider the three augmented generative processes M , M(1) and M(2) with linear ICA constraints.

V1 U1, V2 U2 X1 V1, X2 V2 U1, U2 N(0, [1, 0; 0, 1])

( V (1) 1 U1, V (1) 2 U2 X1 2V (1) 1 , X2 0.5V (1) 2 U1, U2 N(0, [1/4, 0; 0, 4])

V (2) 1 U1, V (2) 2 U2 X1

2 2 V (2) 1 +

2 2 V (2) 2 X2

2 2 V (2) 1

2 2 V (2) 2 U1, U2 N(0, [1, 0; 0, 1])

It is verifiable that X1, X2 N(1, 0; 0, 1) induced by all three models. The latent generative variables in M(1) are scaled and permuted representations of the true factors M , namely V (1) 1 = 2V (2) 2 and V (1) 2 = 0.5V (1) 2 . In other words, V (1) and V (2) distinctly represents V2 and V1 respectively. However, the representations V (2) 1 and V (2) 2 in M(2) are mixture of true generative factors V1 and V2, i.e.,

V (2) 1 = 2

V (2) 2 = 2

which implies this is not a scaling and permutation transformation. Thus, M(2) demonstrates that the scaling and permutation identifiability is not achieved in this setting.

The above example shows a famous result of linear ICA: the representations are not identifiable if generative factors follow a multi-gaussian distribution. This result comes from the symmetricity of gaussian distributions: any white gaussian variables are still white gaussian after an orthogonal transformation. However, orthogonal transformations are not guaranteed to be a scaling or permutation thus a proxy model may have generative factors that are mixtures of the true V (V(2) in Example 23). Further, the identifiability result can be concluded as follows with the non-Gaussian assumption.

Nonlinear ICA [7, 10] Compared to linear ICA, nonlinear ICA assumes the mixing function is a nonlinear bijective function (i.e. invertible and differentiable).

In linear ICA settings, the generative factors are assumed to be independent of each other and the mixture function f X) is considered to be an invertible matrix A Rd d. Formally, the mechanism F and the distribution P(U) of the true ASCM M are written as:

Vj fj(Uj), j [d] X f X(V) Ui Uj, i, j [d] (70)

The traditional approaches for proving identifiability from [7, 8, 10] has the following settings:

(Assumptions) A parametric exponential family is assumed in [7]. In addition, the causal assumptions of the latent variables is fully disconnected graph, where all variables are mutually independent. Our work assumes a nonparametric mixing model, and only requires the mixing function to be a bijection. In addition, we allow a non-Markovian causal model among the latent variables, which is the first to our knowledge to analyze identifiability in this general setting.

(Data) Nonlinear ICA assumes that 2d + 1 number of distributions with mechanism changes of the latent variables such that a version of the Assump. 4 holds. One instantiation of this in real-world data is time-series with non-stationary changes. Our work leverages arbitrary combinations of interventional data arising from multiple domains, and also does not necessarily require observational data.

(Output) The focus of nonlinear ICA was typically on achieving disentanglement of latent variables up to scaling indeterminancy (Def. 6.5). Our work approaches the goal of identifiability from a more general setting according to Def. 2.3.

Interventional causal representation learning [52] Another potentially promising approach to improving identifiability results lies in assuming a parametric form to the mixing function. [52] considers the setting of having a mixing function that is a composition of polynomial functions (i.e. a polynomial decoder).

Thus, [52] is able to achieve identifiability of latent variables up to an affine transformation:

ˆV = AV + c

where A Rd d and c Rd make up an invertible affine transformation of the true latent variables V. In our work, we consider a nonparametric form of the mixing function. However, future work could consider relaxing this assumption in the direction of a parametric mixing function with polynomial functions.

Learning Causal Representations from General Environments: Identifiability and Intrinsic Ambiguity [25] In this paper, identifiability results for a linear ASCM with a linear mixing function is provided, with access to multi-distributional data arising from different environments.

In addition, they prove identifiability up to "surrounded-nodes" in [25, Thm. 3]. Specifically, any linear proxy model that is compatible with the observed distributions P, the causal graph, and satisfies a few technical assumptions will achieve identifiability for each variable with respect to variables not in the surrounded-set. Similar to ancestral identifiability (Def. 6.6), surrounded-set disentanglement is a special case of our proposed identifiability definition (Def. 2.3). Our work proposes a graphical criterion and an algorithm for determining a causal disentanglement map, which may contain different disentanglements compared to a surrounded-set. Besides our notion of identifiability (goal), our paper also allows arbitrary distributions from multiple domains (input), and non-parametric non-Markovian ASCMs (assumptions).

Definition 6.10 (Surrounded set from [79]). For two nodes, Vi, Vj V in graph G, we say that Vi sur(Vj) if Vi Paj and Chj Chi.

Identifying Linearly-Mixed Causal Representations from Multi-Node Interventions [80] In this paper, the authors explore identifiability results in an ASCM with a linear mixing function, where interventions occur on multiple latent variables at the same time (i.e. multi-node interventions). Further, they assume that interventions are perfect interventions and sufficiently diverse, and have a sparse effect on the set of latent variables. Finally, their goal of identifiability is a full disentanglement, which is a special case of the general disentanglement we provide in Def. 2.3, where any variable may be ID w.r.t. a subset of latent variables.

Our paper also allows multi-node interventions within our graphical criterion (Props. 3, 4, and 5). In terms of the identifiability goal (output), ASCM model (assumptions), and distributions (input), our paper is more general notion of identifiability in the form of a causal disentanglement map (goal); our paper also allows arbitrary distributions (soft, and/or perfect interventions with same/different mechanisms) from multiple domains compared to only perfect interventions in a single domain that meet a sparsity constraint (input), and non-parametric non-Markovian ASCMs vs linear Markovian ASCMs (assumptions).

Linear Causal Representation Learning from Unknown Multi-node Interventions [79] In this paper, identifiability results are provided in a linear ASCM with a linear mixing function, where soft, or perfect interventions occur on multiple nodes. The authors establish full disentanglement results, or disentanglement up to ancestors, which is similar to the results demonstrated in "Causal Component Analysis" [21].

Our paper also allows multi-node interventions within our graphical criterion (Props. 3, 4, and 5). In terms of the identifiability goal (output), ASCM model (assumptions), and distributions (input), our paper is more general notion of identifiability in the form of a causal disentanglement map (goal); our paper also allows arbitrary distributions (soft, and/or perfect interventions with same/different mechanisms) from multiple domains (input), and non-parametric non-Markovian ASCMs vs linear Markovian ASCMs (assumptions).

Learning Causal Representations from General Environments: Identifiability and Intrinsic Ambiguity [24] In this paper, the authors propose a partial disentanglement goal in a linear, or non-parametric Markovian ASCM with a sparse causal structure.

Their notion of the disentanglement goal introduces so-called entanglement graphs (output), which is interestingly exactly what we call the causal disentanglement map. Though the proposed output is the same, the identifiability results are not the same even in the Markovian case.

In addition, in terms of the distributions leveraged (input), our work differs in considering arbitrary combinations of distributions (soft, or perfect, or observational) from heterogenous domains. In terms of modeling the ASCM (assumptions), our work considers completely non-parametric non-Markovian ASCMs instead of non-parametric Markovian ASCMs with sparse connectivity. In future work, we believe it will be interesting to explore the assumption of sparsity in the context of our work.

G Experimental Results

G.1 Synthetic data-generating process

We generate data according to latent causal diagrams shown in Fig. 4. Specifically, we analyze the chain graph V1 V2 V3, and collider graph V1 V2 V3 with different input distributions.

Each graph is constructed according to an ASCM, where the latent variables are related linearly:

j P ai αi,j Vj + ϵi

where linear parameters are drawn from a uniform distribution αi,j U( a, a), and the noise is distributed according to the standard normal distribution ϵi N(0, 1).

Generating Multiple Domains To generate a new domain, where Si,j Vi indicates a change in mechanism for Vi due to the change in ASCMs between M i and M j, we start from the first ASCM generated, and then we modify the distribution of the noise variable with a mean-shift.

Generating Interventions Within Each Domain To generate interventional datasets within each domain Πi Π, we modify the Mi M by additionally modifying the SCM, and shifting its mean for a variable. Therefore for distribution k in Πi, with perfect intervention I, we will have:

Vk := ϵ k, with ϵ k N(µk, σk), Vk I

such that µk is not within +/ 1 of any other distribution for variable Vk V. This ensures the Assumption of Generalized Distribution Change (Assump. 4). With a soft intervention J that is not perfect:

j P ak αi,j Vk + ϵ k, with ϵ k N(µk, σk), Vk J

For each distribution over V Rd, we generate 200,000 data points resulting in d 200, 000 data points in total for N total distributions.

We modify the mean and the variance to ensure that the Assumption of distribution change is met (Assump. 4).

Mixing function In order to generate the low-level data X, we will apply a mixing function f X to the generated latent variables V. Following [21, 51], to generate an invertible mixing function, we will use a multilayer perceptron f X = σ AM ... σA1, where AM Rd d for m [1, M] denotes invertible linear matrices and σ is an element-wise invertible nonlinear function. In our case, we will use the tanh functio as done in [81]:

σ(x) = tanh(x) + 0.1x

In addition, each sampled matrix Ai is re-drawn if | det Ai| < 0.1. This ensures that the linear maps are not ill-conditioned and close to being singular. Once the mixing function is drawn for a given simulation, it is fixed across all domains and interventions according to Assump. 4, and then P is drawn according to all ASCMs instantiated.

G.2 Image Editing Using Disentangled Representations

We demonstrate qualitatively that the generalized disentanglement proposed in this work is valuable for downstream tasks, such as counterfactual image editing [27]. Consider the graph shown in Fig. S8. Specifically, we use our learned proxy model to generate initial images and perform interventions on learned representations b V to edit images. We generate initial image samples from observational distribution of Π1, and then perturb the relevant representations with random Gaussian noise to edit the image. This is done for the color of the bar, color of the digit, and the digit representations. If the learned b V satisfy the CRID output disentanglement,

1. editing the color of the digit (σb V2) should keep the original digit and writing style but may change the color of the bar since V2 has a causal effect on V3. 2. editing the color of the bar (σb V3) should keep the original digit and writing style but may change the color of the digit since V3 is not disentangled with V2. 3. editing digit (σb V1) may change all variables since no disentanglement of V1 is claimed by CRID.

The editing results are shown in Fig. S7. All editing results are aligned with the CDM output as expected, which are illustrated above. Specifically, Fig. S7(a) shows the learned VAE-NF model can change the color of the bar without arbitrarily changing the digit, or writing style. Fig. S7(b) shows the learned VAE-NF model can change the color of the digit without arbitrarily changing the digit, or writing style. Finally, Fig. S7(c) shows the learned VAE-NF model did not learn a disentangled representation for "digit". When perturbing the representation for digit, sometimes the digit does not change, while the color of the bar, color of the digit, or the writing style changes. This experiment also demonstrates one usage of CRID. Before training a model that is potentially computationally and time-intensive, one can leverage CRID to determine if their input data and input assumptions are sufficient for learning a relevant disentangled representation for their downstream task.

Change Digit

Digit and Writing

Style should not change Bar Color may change

Digit and Writing Style

should not change Digit Color is possible

Change Digit

Digit color, or color bar

changes sometimes even if digit does not change Writing style changes

Digit Color

Disentangled Potentially Entangled

Digit Digit Color

Digit Digit Color

Correct Editing?

Figure S7: Editing the image using the learned representations - The representation of the color of the digit (a), color of the bar (b), and the digit (c) is perturbed. Only (a) and (b) show robust editing due to the learned representation being relatively disentangled as predicted by CRID.

Digit Digit Color

V1 V2 V3 V4

Bar s Color

P(1) = PΠ1(X)

P(2) = PΠ2(X)

P(3) = PΠ2(X; σV3)

P(4) = PΠ1(X; do(V2))

(d) (e) (a)

Image Samples

Latent Generative Factor Distributions

Figure S8: Color MNIST with bar data generation. The ground-truth LSD is shown in (a). Four distributions are generated with observations in domain Π1 (b), observations in domain Π2 (c), soft intervention on the color-bar (V3) in Π1 (d), and a perfect intervention on the color-digit (V2) in Π1.

We train invertible MLPs with normalizing flows. The parameters of the causal mechanisms are learned while the causal graph is assumed to be known. We leverage the implementation in [21], and extend it for our experiments.

The encoder is trained with the following objective that estimates the inverse function f 1, and the latent densities P(V) reproducing the ground-truth up to certain mixture ambiguities (c.f. Lemmas 3, 6). The encoder parameters is estimated by maximizing the likelihood..

Normalizing flows We use a normalizing flows architecture [82] to learn an encoder gθ : Rd Rd. Therefore, the observations X will be the result of an invertible and differentiable transformation:

Specifically, gθ will comprise of Neural Spline Flows [83] with a 3-layer feedforward neural network with hidden dimension 128 and a permutation in each flow layer.

Base distributions Normalizing flows require a base distribution. We leverage one baseline distribution per sampled dataset, (ˆpk θ)k [d] over the base noise variables V. The conditional density of any variable is given by:

ˆpk θ(vi|Pai) = N X

j P ai ˆαi,jvj, ˆσi

where the parameters are replaced by their corresponding counterparts if there is a change-in-domain, or an intervention applied. When a perfect intervention is applied, we have that:

ˆpk θ(vi) = N(ˆµi, ˆσi

G.4 Training details

We use the ADAM optimizer [84].We start with a learning rate of 1e-4. We train the model for 200 epochs with a batch size of 4096.

The learning objective is expressed as:

θ = arg max θ

n=1 log pk θ(X(k))

where nk represents the size of the dataset P k, which is 200,000 in our simulations. We perform 10 training runs over different seeds for each experiment, and show the distributions of the meancorrelation coefficient (MCC). Using the output of Alg. 1, we compare variables that are expected to be entangled and disentangled. We use NVIDIA H100 GPUs to train the neural network models.

G.5 Evaluation metrics

The output of our trained model is ˆV = gθ(X), which is a d-dimensional representation. We will compare this representation with our ground-truth latent variable distributions V by computing the mean correlation coefficients (MCC) between the learned and ground-truth latents. We expect there to be an overall lower MCC for variables that are predicted to be disentangleable by Alg. 1 relative to variables that are not deemed disentangleable.

Note that our algorithm is not shown to be complete, so there may be variables that are disentangled at the end of our training process that are not captured by the output of Alg. 1. Characterizing when this occurs and coming up with a complete theoretical characterization of disentanglement is a line for future work.

For the evaluation, we follow a standard evaluation protocol taken in prior work [18]. We expect low MCC values when predicting variables that are disentangled, and higher MCC values when predicting variables that are still entangled.

G.6 Limitations

A major limitation of normalizing flows is that the input and output dimensions of the encoder must be the same. This is due to the fact that we wish to constrain the layers to be invertible transformations. It is easy to define invertible transformations for the same input/output dimensions, but it is non-trivial to do so when input/output dimensions vary widely.

Besides the technical limitations of the implementation, it is important to note that our theoretical results are asymptotic results. The theory claims we can achieve ID when the neural network is trained to zero error. However, in practice, this is not always simple to do and may require hyperparameter tuning and a very large sample size.

For example, when we consider Fig. 6, we observe that the disentanglement of (b,c) is significantly better than (a,d). In the experiment involving the collider graph from Fig. 4(b), we sample four distributions each with 200,000 samples, and thus we have almost 2x the data points compared to the settings in Fig. 6(a,c). We illustrate this point to emphasize that there is no correct way to set the sample sizes, hyperparameters, or model architecture as each simulation will be different. We chose a sample size, model architecture, and default hyperparameters based on prior literature [21] instead of biasing our experimental results by tuning significantly for each simulation.

G.7 Discussion of Results

In Fig. S9, we show the MCC values for each learned latent representation ˆV and the corresponding ground-truth latents V for the three different LSDs shown in Fig. 4. Based on the causal disentanglement map (CDM) output from the CRID algorithm, the disentangled variables are shown in red, while the entangled variables are shown in gray.

In Fig. S9(a), the MCC( ˆV3, V1) is low relative to the MCC( ˆV3, V3), which is predicted by the CRID algorithm s CDM output (right plot). This suggests that V1 is disentangled from V3. In addition, we observe that all MCC values wrt ˆV1 are relatively similar, which makes sense as we do not obtain any disentanglement wrt V1 (left plot). CRID also predicts that V2 is ID wrt V1 (middle plot). However, we observe quite a large range of MCC values, possibly due to variance, default hyperparameter settings, or insufficient sample size. Importantly, this experiment verifies that two soft interventions on V3 in the chain graph of Fig. 4(a) can ID V3 wrt V1, whereas previous literature suggested that V3 is not ID wrt V1 because V1 Anc(V3) [21].

In Fig. S9(b), we now have an observational, two soft interventions on V3, and a perfect intervention on V2. In addition to ID V2 wrt V3 (middle plot), we are also able to obtain full disentanglement of V1 from {V2, V3} (left plot). Interestingly, we are able to fully disentangle the representation for V1 without intervening on it. This is the first theoretical (and empirical) result to our knowledge that shows this in a causal representation learning setting.

In Fig. S9(c), we have an observational and four interventional distributions applied on {V1, V3} all with different mechanisms. We observe that V1 and V3 are fully disentangled. MCC( ˆV3, V3) > MCC( ˆV3, {V1, V2}), and MCC( ˆV1, V1) > MCC( ˆV1, {V2, V3}). CRID does not predict disentanglement for the V2 representation (middle plot), yet interestingly we still see some disentanglement. [21] analyzes a similar setup using "fat-hand interventions", and the corresponding theory does predict V1 and V3 is ID wrt V2. However, we also disentangle V1 and V3 from each other using many interventions. [22] presents a similar approach by leveraging 2d + |E(MG)| + 1 distributions that "sufficiently change" (i.e. Assumption 4) to disentangle variables. However, the corresponding theory suggests that V1 and V3 are still entangled because they are adjacent in the Markov Network of G (MG). These results demonstrate theoretically (and empirically) that V1 and V3 are in fact disentangled from each other in a fundamentally important causal graph (i.e. the collider).

In Fig. S9(d), we consider disentanglement in a non-Markovian LSD. We leverage two perfect interventions on V3 (c.f. Lemma 6), and verify that even without observational distributions and the challenging setting of confounding among the latent variables, we can achieve disentanglement of V3 wrt all other variables. MCC( ˆV3, V3) > MCC( ˆV3, {V1, V2, V4}), which is predicted by the CRID algorithm s CDM output (3rd plot from left). As expected, V1 and V2 are still fully entangled with all other variables (1st and 2nd plot from left).

V1 V2 V3 True Latent Variable

Correlation With V1

V1 V2 V3 True Latent Variable

Correlation With V2

V1 V2 V3 True Latent Variable

Correlation With V3 Causal Disentanglement Map

(CRID) Disentangled Entangled

V1 V2 V3 True Latent Variable

Correlation With V1

V1 V2 V3 True Latent Variable

Correlation With V2

V1 V2 V3 True Latent Variable

Correlation With V3

V1 V2 V3 True Latent Variable

Correlation With V1

V1 V2 V3 True Latent Variable

Correlation With V2

V1 V2 V3 True Latent Variable

Correlation With V3

V1 V2 V3 V4 True Latent Variable

Correlation With V1

V1 V2 V3 V4 True Latent Variable

Correlation With V2

V1 V2 V3 V4 True Latent Variable

Correlation With V3

V1 V2 V3 V4 True Latent Variable

Correlation With V4

Figure S9: Mean correlation coefficient (MCC) of latent ground truth variables with the learned representation ˆV, and expected disentanglement (red) according to the CRID algorithm. Each plot corresponds to an experimental setting using the graphs shown in Fig. 4: chain graph with two interventions on V3 (a). chain graph with two interventions on V3 and a perfect intervention on V2 (b), collider graph with four interventions on {V1, V3} (c) and the non-markovian graph with two perfect interventions on V3 (d).

H Broader Impact and Forward-Looking Statements

The development of learning disentangled causal representations has the potential to improve our understanding of complex systems, and to help identify the generative factors for many important problems. By improving our ability to leverage observational and interventional data across multiple domains, this work could ultimately lead to more realistic generative AI. Beyond the machine learning and causal inference community, we expect that our results will enable fundamental contributions in various fields, including biology [85], epidemiology [86], economics [37] and neuroscience [38].

I Frequently Asked Questions

Q1. What s the learning goal of the paper? This work claims to be causal representation learning, but why do we not learn the structure over the latent variables while assuming it as given? Answer. Causal representation learning may comprise of two parts: i) learning the distributions of the latent variables and ii) learning the causal structure among these latent variables. Learning the distribution over latent variables is a non-trivial problem, especially in the context of non-Markovian ASCMs and the general multi-domain context. For example, consider nonlinear ICA, where the structure of the latent variables is the fully disconnected graph. It was shown to be non-ID with only iid data [9]. Although ID results eventually came about for nonlinear ICA, it was nontrivial to derive. In the same spirit, we seek to analyze the most general setting possible when assuming knowledge of the causal structure. This is analogous to the causal inference task of identification [87, 88], where the goal is to determine if a causal effect over observed variables is estimable given infinite data from some given distributions on the observed variables. Put similarly, our work s goal is to determine if a latent variable Vi V is disentangleable given infinite data from some given distributions over the observed variables X. In traditional causal inference, when the causal graph is unknown, then one is typically interested in causal discovery, or structure learning of the graph over the observed variables given distributions over the observed variables. Future work may assume that even the latent causal structure is unknown, and pursue the structure learning of the LCG given distributions over the observed variables.

Q2. Is it reasonable to expect that the causal diagram is available? How do you get the graph? Answer. The assumption of the causal diagram is made out of necessity. Even existing methods is able to learn the casual diagram at the same time, however, the setting is more restricted. For example, the SCM should be Markovian and the intervention data per node should be given. In our setting, the underlying SCM can be non-Markovian and the given data can be any observational and interventional data from an arbitrary domain. In the general setting, even when the generative factors are all observed, learning the causal diagram task (structural learning task) is still difficult. Interestingly, recovering the full true diagram is even impossible, and existing works aim to recover an equivalence class of diagrams [32, 89 91]. Thus, in this general setting for causal representation learning, we first provide identification results given a causal diagram and leave structure learning for future work. We follow closely to the disentangled representation learning works that assume the causal diagram is given. ICA/Nonlinear ICA assumes the diagram G is given and restricts the setting where no edges are in G. Later, [18] assumes focus on disentangling the content variable from the style variable and assumes the knowledge of the diagram is given (Content is the ancestral of Style). Recently, [21] focuses on the setting that the given diagram is Markovian. We extend the setting to non-Markovain settings. Notice that our generalization is not only related to diagram assumption but involves more general assumption, data, and output (please see Sec. 1, Tab. 1 and Tab. S1 for details.) In practice, knowledge of the latent causal graph is typically provided by domain experts, or a modeling assumption. As an example of a realistic setting where the latent causal graph can be assumed, consider generating realistic face images [27]. Here, the latent causal structure comprises of Gender, Age, and Hair Color. Knowledge of the graph is provided due to our understanding of what comprises realistic changes in a face. For a detailed discussion on this, see Appendix Section D.1.

Q3. Why CRID (Alg. 1) only takes intervention targets Ψ and LSG GS as input? Do you need distributions P? If not, how do you learn representations? Answer. CRID leverages the intervention targets Ψ and the LSG GS to determine the invariant and changing factors when considering the generalized factorization of probability distributions Markov relative to the provided graph. These invariant and changing factors are what give rise to the theory we develop in Section 3. The CRID algorithm leverages this theory to provide an identifiability algorithm, which answers the question: If we fully learn a representation ˆV (given the diagram and the distributions), which variables are expected to be disentangled with which variables? This is an asymptotic question and assumes the representation is fully learned. To fully learn the representations, one can search a proxy model that matches P and GS and the b V. Then the proxy model is the learned representation. We do this in the Experiments Section, but note we do not claim that this method of learning the representations is superior to any prior work. Specifically, we implement an approach to train a neural model that is compatible with the diagram to match the given distribution based on normalizing flows. Recently, many graphical constraints proxy neural models have been proposed, and they are trained to fit the given distribution for causal representation learning and downstream tasks [20, 27, 55, 92 94]. Without our work, one can still try to use these models to learn representations. However, there is no guarantee about how these learned representations is entangled with each other. Our work is the first one to provide general answers for this identification problem. This process can be compared with the identification and estimation problem in classic causal inference. The identification of a specific query given a causal diagram can be answered in symbolic ways [87, 95 99], and then if the query is identifiable, one can take the distribution (or data) as input and use estimation methods to obtain the estimated query. Without the identifiability result, there are no guarantees for the estimation.

Q4. Why not just use observational distributions in each domain as the baseline in the CRID algorithm described in Section 4? Answer. One may surmise that this is not efficient and propose to choose the observational distribution in each domain alternatively. However, we argue that this enumeration is needed from two perspectives. First, the observational distributions, namely the idle interventions, are not always given. Second, comparing with observational distributions is not guaranteed to offer diverse Q sets. For example, consider intervention targets I(1) = {}Π1, I(2) = {V Π1,[1] 1 , V Π1,[1] 2 }, I(3) = {V Π1,[1] 1 , V Π1,[2] 2 } all applied to the same domain Π1. Choosing T = {} and comparing I(2) and I(3) with the idle intervention I(1),

Q[I(2), I(1), T] = Q[I(3), I(1), T] = {V1, V2}. (71)

Comparing I(1) and I(3) with the idle intervention I(2),

Q[I(1), I(2), T] = Q[I(3), I(2), T] = {V2}. (72)

Then using Proposition 3, it is possible to disentangle V2 from V1 with the latter choice. This demonstrates that the observational distribution is not always necessarily the best baseline. Furthermore, consider the challenge of disentangling V1 from V2 in the LCG V1 L9999K V2. As Lemma 6 demonstrates, one can compare two perfect intervention distributions on V1 to achieve ID of V1 wrt V2. In this case, one would not even need the observational distribution.

Q5. Why distinguish domains and interventions? Are they not the same thing? Answer. The literature has typically conflated domains and interventions in the context of causal inference. Many examples across scientific disciplines demonstrate that the notions of domain/environment and interventions are distinct. For example, when making inferences about humans based on data from bonobos, this distinction becomes clear. The difference between the two species is depicted as the environment/domain in this context. A scientist might perform an intervention on a bonobo s kidney (specifically, what we re representing as Z), and try to determine the effect of medication (X) on fluid equilibrium in the body (Y ). Although we could intervene on Z in bonobos and observe its effect on X and Y ,

our ultimate goal might be to understand the effect of X on Y in humans. It s generally invalid to conflate these two qualitatively different indices, a point first noted by [62] in the context of transportability analysis. The distinct environments exist regardless of any intervention, such as medication. Also, an intervention on kidney function is different across the two species. [62] formalized this setting, introducing clear semantics for the S-nodes (environments) that essentially offer a combined representation for both environments. With this foundation, we can now address the more general problem of analyzing data generated from interventions across multiple domains in the latent space. We point the reader to Appendix Section A.3 for a discussion and some examples of how CRID leverages this distinction. More recently, In addition, we provide the following example that we hope further motivates the necessity of distinguishing interventions and domains.

Example 24 (Disentangled representation with interventions in different domains). Consider a ASCM, M over domains bonobos (Π1) and humans (Π2) that induces the causal chain V1 V2 V3 S1,2. The latent variables are sun exposure (V1), Age (V2), Hair Color (V3). Sun exposure causes aging over time, and aging causes changes in hair color. Hair color looks different across species, which is represented by the S-node. We collect images of their faces, X = f X(V1, V2, V3). Assume we are able to collect images in two interventional settings {V1}Π1 and {V1}Π2, where we modify the level of sun exposure each participant is exposed to. Now, assume we ignore the domain index, and simply treat these two distributions as interventional, since we are intervening on V1. Then prior results would state that a soft (or perfect) intervention on V1 allows it to be disentangled from V3. However, this is incorrect.

Q6. Is the relaxation of Markovianity important? Since all V are already latent, can one regard the confounding U as V to transfer the model in the non-Markovianity setting to a Markovanity model? answer Yes the distinction between Markovianity and non-Markovianity is important both qualitatively and quantitatively. Qualitatively, consider the following example in healthcare, where one has access to highdimensional T1 MRI scans. Let the LCG comprise of Drug Treatment Outcome, but they are confounded by socioeconomic status (Drug Treatment L9999K Outcome). The drug treatment and outcome are visually discernable on the MRI. However, socioeconomic status does not directly impact how the MRI appears, except through how it impacts the drug treatment efficacy or outcome. The socioeconomic status is therefore an unaccounted confounder in the LCG, and it is important to model this spurious association. If unaccounted for, one may assume that it is possible to disentangle Drug Treatment and Outcome leveraging existing ID results in the literature [11, 13, 14, 21, 22] even if the results do not apply in this setting. Regarding modeling, an ASCM with confounding cannot be reduced to a Markovian ASCM. Although U and V are both latent, every U is not the direct parents of X, which means U cannot be uniquely determined by value of X. Take the example where V1 L9999K V2 is the LCG G. Since U12 does not point to X, we cannot let U12 be another latent generative factor V. Regarding results, we point the reader to Lemma 5, where it is shown that even with one perfect interventions per node, it is not possible to disentangle variables within the same c-component. This in contrast with results in the Markovian setting, where it is shown in [21] that one perfect intervention per latent variable allows us to achieve full identifiability of every latent variable up to scaling indeterminancies. More broadly, it is noteworthy that transitioning causal reasoning from Markovian to non Markovian settings was not trivial. For example, it is known that interventional distributions, such as P(y | do(x)), are always identifiable from the causal graph and observational distribution in Markovian settings in all models. Moving to non-Markovian settings, the celebrated do-calculus is developed primarily to address the decision problem of whether an interventional distribution can be uniquely computed from a combination of causal assumptions (in the form of a causal diagram) and the observational distribution [61].

Naturally, the issue of non-identifiability is much more acute in this setting, due to the existence of unobserved confounding.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: We claim that we introduce graphical criteria and an algorithm for determining whether latent variables are ID or not from our general definition of ID. We then provide our results in Section 3 and 4.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: We discuss the assumptions made in this paper in Appendix Section A.2, and how future work can potentially improve upon our limitations.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: For each proposition and Theorem stated in the main text: Propositions 1, 2, 3, 4, 5 and Theorem 1 are stated in the main text, and proved in Appendix Section C.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We provide details of our experiments in Appendix Section G. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The code we used to run experiments is here: https://github.com/tree1111/CDRL. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We provide details of our experiments in Appendix Section G. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We show distribution plots, and do not compute any pvalues. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We provide details of our experiments in Appendix Section G. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We have reviewed the code of ethics. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: We discuss broader impacts in Appendix Section H. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: Our paper poses no such risks.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification: We cited [21], which we leveraged their code to produce experimental results shown in the paper. We do not repackage any datasets, or code.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: Our paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: Our paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: Our paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.