# causal_component_analysis__e8bc261b.pdf

Causal Component Analysis

Liang Wendong 1,2 Armin Keki c 1 Julius von Kügelgen 1,3 Simon Buchholz 1

Michel Besserve 1 Luigi Gresele 1 Bernhard Schölkopf 1

1 Max Planck Institute for Intelligent Systems, Tübingen, Germany 2 ENS Paris-Saclay, Gif-sur-Yvette, France 3 University of Cambridge, United Kingdom {wendong.liang,armin.kekic,jvk,simon.buchholz}@tue.mpg.de {besserve,luigi.gresele,bs}@tue.mpg.de

Independent Component Analysis (ICA) aims to recover independent latent variables from observed mixtures thereof. Causal Representation Learning (CRL) aims instead to infer causally related (thus often statistically dependent) latent variables, together with the unknown graph encoding their causal relationships. We introduce an intermediate problem termed Causal Component Analysis (Cau CA). Cau CA can be viewed as a generalization of ICA, modelling the causal dependence among the latent components, and as a special case of CRL. In contrast to CRL, it presupposes knowledge of the causal graph, focusing solely on learning the unmixing function and the causal mechanisms. Any impossibility results regarding the recovery of the ground truth in Cau CA also apply for CRL, while possibility results may serve as a stepping stone for extensions to CRL. We characterize Cau CA identifiability from multiple datasets generated through different types of interventions on the latent causal variables. As a corollary, this interventional perspective also leads to new identifiability results for nonlinear ICA a special case of Cau CA with an empty graph requiring strictly fewer datasets than previous results. We introduce a likelihood-based approach using normalizing flows to estimate both the unmixing function and the causal mechanisms, and demonstrate its effectiveness through extensive synthetic experiments in the Cau CA and ICA setting.

1 Introduction

Independent Component Analysis (ICA) [7] is a principled approach to representation learning, which aims to recover independent latent variables, or sources, from observed mixtures thereof. Whether this is possible depends on the identifiability of the model [32, 57]: this characterizes assumptions under which a learned representation provably recovers (or disentangles) the latent variables, up to some well-specified ambiguities [22, 58]. A key result shows that, when nonlinear mixtures of the latent components are observed, the model is non-identifiable based on independent and identically distributed (i.i.d.) samples from the generative process [8, 19]. Consequently, a learned model may explain the data equally well as the ground truth, even if the corresponding representation is strongly entangled, rendering the recovery of the original latent variables fundamentally impossible.

Identifiability can be recovered under deviations from the i.i.d. assumption, e.g., in the form of temporal autocorrelation [14, 18] or spatial dependence [15] among the latent components; auxiliary variables which render the sources conditionally independent [17, 21, 25, 26]; or additional, noisy views [11]. An alternative path is to restrict the class of mixing functions [4, 13, 63].

Shared last author. Code available at https://github.com/akekic/causal-component-analysis.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

f P0(Z) = P0(X)

f P1(Z) = P1(X)

f P2(Z) = P2(X)

f P3(Z) = P3(X)

Figure 1: Causal Component Analysis (Cau CA). We posit that observed variables X are generated through a nonlinear mapping f, applied to unobserved latent variables Z which are causally related. The causal structure G of the latent variables is assumed to be known, while the causal mechanisms Pi(Zi | Zpa(i)) and the nonlinear mixing function are unknown and to be estimated. (Known or observed quantities are highlighted in red.) Cau CA assumes access to multiple datasets Dk that result from stochastic interventions on the latent variables.

Despite appealing identifiability guarantees for ICA, the independence assumption can be limiting, since interesting factors of variation in real-world data are often statistically, or causally, dependent [51]. This motivates Causal Representation Learning (CRL) [45], which aims instead to infer causally related latent variables, together with a causal graph encoding their causal relationships. This is challenging if both the graph and the unmixing are unknown. Identifiability results in CRL therefore require strong assumptions such as counterfactual data [1, 3, 9, 37, 54], temporal structure [30, 33], a parametric family of latent distributions [5, 6, 30, 47, 59, 60], graph sparsity [28 30], pairs of interventions and genericity [55] or restrictions on the mixing function class [2, 49, 52]. It has been argued that knowing either the graph or the unmixing might help better recover the other, giving rise to a chicken-and-egg problem in CRL [3].

We introduce an intermediate problem between ICA and CRL which we call Causal Component Analysis (Cau CA), see Fig. 1 for an overview. Cau CA can be viewed as a generalization of ICA that models causal connections (and thus statistical dependence) among the latent components through a causal Bayesian network [41]. It can also be viewed as a special case of CRL that presupposes knowledge of the causal graph, and focuses on learning the unmixing function and causal mechanisms.

Since Cau CA is solving the CRL problem with partial ground truth information, it is strictly easier than CRL. This implies that impossibility results for Cau CA also apply for CRL. Possibility results for Cau CA, on the other hand, while not automatically generalizing to CRL, can nevertheless serve as stepping stones, highlighting potential avenues for achieving corresponding results in CRL. Note also that there are only finitely many possible directed acyclic graphs for a fixed number of nodes, but the space of spurious solutions in representation learning (e.g., in nonlinear ICA) is typically infinite. By solving Cau CA problems, we can therefore gain insights into the minimal assumptions required for addressing CRL problems. Cau CA may be applicable to scenarios in which domain knowledge can be used to specify a causal graph for the latent components. For instance, in computer vision applications, the image generation process can often be modelled based on a fixed graph [44, 50].

Structure and Contributions. We start by recapitulating preliminaries on causal Bayesian networks and interventions in 2. Next, we introduce Causal Component Analysis (Cau CA) in 3. Our primary focus lies in characterizing the identifiability of Cau CA from multiple datasets generated through various types of interventions on the latent causal variables ( 4). Importantly, all our results are applicable to the nonlinear and nonparametric case. The interventional perspective we take exploits the modularity of the causal relationships (i.e., the possibility to change one of them without affecting the others) a concept that was not previously leveraged in works on nonlinear ICA. This leads to extensions of existing results that require strictly fewer datasets to achieve the same level of identifiability. We introduce and investigate an estimation procedure for Cau CA in 5, and conclude with a summary of related work ( 6) and a discussion ( 7). We highlight the following main contributions:

We derive sufficient and necessary conditions for identifiability of Cau CA from different types of interventions (Thm. 4.2, Prop. 4.3, Thm. 4.5). We prove additional results for the special case with an empty graph, which corresponds to a novel ICA model with interventions on the latent variables (Prop. 4.6, Prop. 4.7, Corollary 4.8, Prop. 4.9). We show in synthetic experiments in both the Cau CA and ICA settings that our normalizing flow-based estimation procedure effectively recovers the latent causal components ( 5).

2 Preliminaries

Notation. We use P to denote a probability distribution, with density function p. Uppercase letters X, Y, Z denote unidimensional and bold uppercase X, Y, Z denote multidimensional random variables. Lowercase letters x, y, z denote scalars in R and x, y, z denote vectors in Rd. We use Ji, j K to denote the integers from i to j, and [d] denotes the natural numbers from 1 to d. We use common graphical notation, see App. A for details. The ancestors of i in a graph are the nodes j in G such that there is a directed path from j to i, and they are denoted by anc(i). The closure of the parents (resp. ancestors) of i is defined as pa(i) := pa(i) {i} (resp. anc(i) := anc(i) {i}).

A key definition connecting directed acyclic graphs (DAGs) and probabilistic models is the following.

Definition 2.1 (Distribution Markov relative to a DAG [41]). A joint probability distribution P is Markov relative to a DAG G if it admits the factorization P(Z1, . . . , Zd) = Qd i=1 Pi(Zi|Zpa(i)).

Defn. 2.1 is a key assumption in directed graphical models, where a distribution being Markovian relative to a graph implies that the graph encodes specific independences within the distribution, which can be exploited for efficient computation or data storage [43, 6.5].

Causal Bayesian networks and interventions. Causal systems induce multiple distributions corresponding to different interventions. Causal Bayesian networks [CBNs; 41] can be used to represent how these interventional distributions are related. In a CBN with associated graph G, arrows signify causal links among variables, and the conditional probabilities Pi Zi | Zpa(i) in the corresponding Markov factorization are called causal mechanisms.2

Interventions are modelled in CBNs by replacing a subset τk V (G) of the causal mechanisms by new, intervened mechanisms {e Pj Zj | Zpak(j) }j τk, while all other causal mechanisms are left unchanged. Here, pak(j) denotes the parents of Zj in the post-intervention graph in the interventional regime k and τk the intervention targets. We will omit the superscript k when the parent set is unchanged and assume that interventions do not add new parents, pak(j) pa(j). Unless e Pj is a point mass, we call the intervention stochastic or soft. Further, we say that e Pj is a perfect intervention if the dependence of the j-th variable from its parents is removed (pak(j) = ), corresponding to deleting all arrows pointing to i, sometimes also referred to as graph surgery [48].3 An imperfect intervention is one for which pak(j) = . We summarise this in the following definition.

Definition 2.2 (CBN ). A causal Bayesian network (CBN) consists of a graph G, a collection of causal mechanisms {Pi(Zi | Zpa(i))}i [d], and a collection of interventions {{e Pk j Zj | Zpak(j) }j τk}k [K] across K interventional regimes. The joint probability for interventional regime k is given by:

(Qd i=1 Pi(Zi | Zpa(i)) k = 0 Q j τk e Pk j Zj | Zpak(j) Q i/ τk Pi Zi | Zpa(i) k [K] (1)

where P0 is the unintervened, or observational, distribution, and Pk are interventional distributions.

Remark 2.3. The joint probabilities Pk in (1) are uniquely factorized into causal mechanisms according to G. We therefore use the equivalent notation (G, (Pk, τk)k J0,KK), where Pk is defined as in (1).

3 Problem Setting

The main object of our study is a latent variable model termed latent causal Bayesian network (CBN).

Definition 3.1 (Latent CBN). A latent CBN is a tuple (G, f, (Pk, τk)k J0,KK), where f : Rd Rd is a diffeomorphism (i.e. invertible with both f and f 1 differentiable).

2The term can also be used in structural causal models to denote deterministic functions of endogenous and exogenous variables in assignments, see [43, Def. 3.1]. A central idea in causality [41, 43] is that causal mechanisms are modular or independent, i.e., it is possible to modify some without affecting the others: after an intervention, typically only a subset of the causal mechanisms change. 3A special case of perfect interventions are hard interventions, where e Pj corresponds to a Dirac distribution: P(Z | do(Zj = zj)) = δZj=zj Q

i =j Pi Zi | Zpa(i) .

Data-generating process for Causal Component Analysis (Cau CA). In Cau CA, we assume that we are given multiple datasets {Dk}k J0,KK generated by a latent CBN (G, f, (Pk, τk)k J0,KK):

Dk := τk, n x(n,k)o Nk

, with x(n,k) = f z(n,k) and z(n,k) i.i.d. Pk, (2)

where Nk denotes the sample size for interventional regime k, see Fig. 1 for an illustration. The graph G is assumed to be known. Further, we assume that the intervention targets τk are observed, see 7 for further discussion. Both the mixing function f and the latent distributions Pk in (2) are unknown.

The problem we aim to address is the following: given only the graph G and the datasets Dk in (2), can we learn to invert the mixing function f and thus recover the latent variables z? Whether this is possible, and up to what ambiguities, depends on the identifiability of Cau CA. Definition 3.2 (Identifiability of Cau CA). A model class for Cau CA is a tuple (G, F, PG), where F is a class of functions and PG is a class of joint distributions Markov relative to G. A latent CBN (G, f, (Pk, τk)k J0,KK) is said to be in (G, F, PG) if f F and Pk PG for all k J0, KK. We say (G, F, PG) has known intervention targets if all its elements share the same G and (τk)k J0,KK.

We say that Cau CA in (G, F, PG) is identifiable up to S (a set of functions called indeterminacy set ) if for any two latent CBNs (G, f, (Pk, τk)k J0,KK) and (G , f , (Qk, τk)k J0,KK), the equality of pushforward f Pk = f Qk k J0, KK implies that h S s.t. h = f 1 f on the support of P.

We justify the definition of known intervention targets and generalize them to a more flexible scenario in App. E. Defn. 3.2 is inspired by the identifiability definition of ICA in [4, Def. 1]. Intuitively, it states that, if two models in (G, F, PG) give rise to the same distribution, then they are equal up to ambiguities specified by S. Consequently, when attempting to invert f based on the data in (2), the inversion can only be achieved up to those ambiguities.

In the following, we choose F to be the class of all C1-diffeomorphisms Rd Rd, denoted C1(Rd), and suppose the distributions in PG are absolutely continuous with full support in Rd, with the density pk differentiable.

A first question is what ambiguities are unavoidable by construction in Cau CA, similar to scaling and permutation in ICA [22, 3.1]. The following Lemma characterizes this. Lemma 3.3. For any (G, f, (Pk, τk)k J0,KK) in (G, F, PG), and for any h Sscaling with

Sscaling : = h : Rd Rd | h(z) = (h1(z1), . . . , hd(zd)), hi is a diffeomorphism in R , (3)

there exists a (G, f h, (Qk, τk)k J0,KK) in (G, F, PG) s.t. f Pk = (f h) Qk for all k J0, KK.

Lemma 3.3 states that, as in nonlinear ICA, the ambiguity up to element-wise nonlinear scaling is also unresolvable in Cau CA. However, unlike in nonlinear ICA, there is no permutation ambiguity in Cau CA: this is a consequence of the assumption of known intervention targets. The next question is under which conditions we can achieve identifiability up to (3), and when the ambiguity set is larger.

In this section, we investigate the identifiability of Cau CA. We first study the general case ( 4.1), and then consider the special case of ICA in which the graph is empty ( 4.2).

4.1 Identifiability of Cau CA

Single-node interventions. We start by characterizing the identifiability of Cau CA based on single-node interventions. For datasets Dk defined as in (2), every k > 0 corresponds to interventions on a single variable: i.e., k > 0, |τk| = 1. This is the setting depicted in Fig. 1, where each interventional dataset is generated by intervening on a single latent variable. The following assumption will play a key role in our proofs. Assumption 4.1 (Interventional discrepancy). Given k [K], let pτk denote the causal mechanism of zτk. We say that a stochastic intervention pτk satisfies interventional discrepancy if

zτk | zpa(τk) = (ln epτk)

zτk | zpak(τk) almost everywhere (a.e.). (4)

Figure 2: Violation of the Interventional Discrepancy Assumption. The shown distributions constitute a counterexample to identifiability that violates Asm. 4.1 and thus allows for spurious solutions, see App. C for technical details. (Left) Visualisation of the joint distributions of two independent latent components z1 and z2 after no intervention (red), and interventions on z1 (green) and z2 (blue). As can be seen, each distribution reaches the same plateau on some rectangular interval of the domain, coinciding within the red square. (Center/Right) Within the red square where all distributions agree, it is possible to apply a measure preserving automorphism which leaves all distributions unchanged, but non-trivially mixes the latents. The right plot shows a distance-dependent rotation around the centre of the black circle, whereas the middle plot show a reference identity transformation.

Asm. 4.1 can be applied to imperfect and perfect interventions alike (in the latter case the conditioning on the RHS disappears). Intuitively, Asm. 4.1 requires that the stochastic intervention is sufficiently different from the causal mechanism, formally expressed as the requirement that the partial derivative over zi of the ratio between pi and pi is nonzero a.e. One case in which Asm. 4.1 is violated is when pi/ zi and pi/ zi are both zero on the same open subset of their support. In Fig. 2 (Left), we provide an example of such a violation (see App. C for its construction), and apply a measure-preserving automorphism within the area where the two distributions agree (see Fig. 2 (Right)).

We can now state our main result for Cau CA with single-node interventions. Theorem 4.2. For Cau CA in (G, F, PG),

(i) Suppose for each node in [d], there is one (perfect or imperfect) stochastic intervention that satisfies Asm. 4.1. Then Cau CA in (G, F, PG) is identifiable up to

SG = n h : Rd Rd|h(z) = hi(zanc(i))

i [d] , h is C1-diffeomorphism o . (5)

(ii) Suppose for each node i in [d], there is one perfect stochastic intervention that satisfies Asm. 4.1. Then Cau CA in (G, F, PG) is identifiable up to Sscaling.

Thm. 4.2 (i) states that for single-node stochastic interventions, perfect or imperfect, we can achieve identifiability up to an indeterminacy set where each reconstructed variable can at most be a mixture of ground truth variables corresponding to nodes in the closure of the ancestors of i. While this ambiguity set is larger than the one in eq. (3), it is still a non-trivial reduction in ambiguity with respect to the spurious solutions which could be generated without Asm. 4.1. A related result in [49, Thm. 1] shows that for linear mixing, linear latent SCM and unknown graph, d interventions are sufficient and necessary for recovering G (the transitive closure of the ground truth graph G) and the latent variables up to elementwise reparametrizations. Thm. 4.2 (i) instead proves that d interventions are sufficient for identifiability up to mixing of variables corresponding to the coordinates in G for arbitrary C1-diffeomorphisms F, non-parametric PG and known graph.

Thm. 4.2 (ii) shows that if we further constrain the set of interventions to perfect single-node, stochastic interventions only, then we can achieve a much stronger identifiability i.e., identifiability up to scaling, which as discussed in 3 is the best one we can hope to achieve in our problem setting without further assumptions. In short, the unintervened distribution together with one single-node, stochastic perfect intervention per node is sufficient to give us the strongest achievable identifiability in our considered setting. In App. D, we also discuss identifiability when only imperfect stochastic interventions are available. In short, with a higher number of imperfect interventions, the ambiguity in Thm. 4.2 (i) can be further constrained to the closure of parents, instead of the closure of ancestors.

Thm. 4.2 (ii) shows that d datasets generated by single-node interventions on the latent causal variables are sufficient for identifiability up to Sscaling. We additionally prove below that d interventional datasets are necessary i.e., for Cau CA, and for any nonlinear causal representation learning problem, d 1 single-node interventions are not sufficient for identifiability up to Sscaling. Proposition 4.3. Given a DAG G, with d 1 perfect stochastic single node interventions on distinct targets, if the remaining unintervened node has any parent in G, (G, F, PG) is not identifiable up to

Sreparam := g : Rd Rd | g = P h, P is a permutation matrix, h Sscaling

A similar result in [49] shows that one intervention per node is necessary when the underlying graph is unknown. Prop. 4.3 shows that this is still the case, even when the graph is known.

Fat-hand interventions. A generalization of single-node interventions are fat-hand interventions i.e., interventions where |τk| > 1. In this section, we study this more general setting and focus on a weaker form of identification than for single-node intervention. Assumption 4.4 (Block-interventional discrepancy). We denote Q0 τk as the causal mechanism of Zτk in the unintervened regime. For each k [K], we denote Qs τk as the intervention mechanism in the s-th interventional regime that has τk [d] as targets of intervention, i.e., Qs τk is a (conditional) joint distribution over Zτk. Then the Block-interventional discrepancy for τk is defined as follows:

if there is no arrow into τk (i.e., τk has no parents in [d]\τk), suppose that there are nk interventions with target τk such that the following nk nk matrix

z1 (ln q1 τk ln q0 τk)(zτk) . . . znk (ln q1 τk ln q0 τk)(zτk) ... ... z1 (ln qnk τk ln q0 τk)(zτk) . . . znk (ln qnk τk ln q0 τk)(zτk)

is invertible for zτk Rnk almost everywhere, where qs τk,j denotes the j-th marginal of qs τk, zτk,j denotes the j-th dimension of zτk, and s = 0 denotes the unintervened (observational) regime;

otherwise, suppose that there are nk + 1 interventions with target τk such that the matrix (7) is invertible for zτk Rnk almost everywhere, where qs τk,j and zτk,j are defined as before, but s = 0, . . . , nk now indexes the nk + 1 interventions i.e., without an unintervened regime.

Asm. 4.4 is tightly connected to Asm. 4.1: if G has no arrow, if k : nk = 1, then Asm. 4.4 is reduced to Asm. 4.1. However, for any G that has arrows, the number of interventional regimes required by Asm. 4.4 is strictly larger than Asm. 4.1. Theorem 4.5. Given any DAG G. Suppose that our datasets encompass interventions over all variables in the latent graph, i.e., S

k [K] τk = [d]. For all k [K], suppose the targets of interventions are strict subsets of all variables, i.e., |τk| = nk, nk [d 1]. Suppose the interventions over τk are perfect, i.e. the intervention mechanisms Qs τk are joint distributions over Zτk without conditioning on other variables. Suppose Asm. 4.4 is satisfied for τk.

Then Cau CA in (G, F, PG) is block-identifiable (following [54]): namely, if for all k [K], f Pk = f Qk, then for φ := f 1 f, for all k [K],

[φ(z)]τk = φτk (zτk) . (8)

We illustrate the above identifiability results through an example in Tab. 1.

4.2 Special Case: ICA with stochastic interventions on the latent components

An important special case of Cau CA occurs when the graph G is empty, corresponding to independent latent components. This defines a nonlinear ICA generative model where, in addition to the mixtures, we observe a variable τk which indicates which latent distributions change in the interventional regime k, while every other distribution is unchanged.4 This nonlinear ICA generative model is closely related to similar models with observed auxiliary variables [21, 25]: it is natural to interpret

4The relationship between Cau CA and nonlinear ICA is discussed in more detail in App. F.

Table 1: Overview of identifiability results. For the DAG Z1 Z2 Z3 from Fig. 1, we summarise the guarantees provided by our theoretical analysis in 4.1 for representations learnt by maximizing the likelihoods Pk θ(X) for different sets of interventional regimes.

Requirement of interventions Learned representation ˆz = ˆf 1(x) Reference

1 intervention per node [h1(z1), h2(z1, z2), h3(z1, z2, z3)] Thm. 4.2 (i)

1 perfect intervention per node [h1(z1), h2(z2), h3(z3)] Thm. 4.2 (ii)

1 intervention per node for z1 and z2, plus |pa(3)|(|pa(3)|+1) = 2 3 imperfect interventions on z3 with variability assumption [h1(z1), h2(z2), h3(z2, z3)] Prop. D.1

1 perfect intervention on z1 and 2+1=3 perfect fat-hand interventions on (z2, z3) [h1(z1), h2(z2, z3), h3(z2, z3)] Thm. 4.5

τk itself as an auxiliary variable. As we will see, our interventional interpretation allows us to derive novel results and re-interpret existing ones. Below, we characterize identifiability for this setting.

Single-node interventions. We first focus on single-node stochastic interventions, where the following result proves that we can achieve the same level of identifiability as in Thm. 4.2 (ii), with one less intervention than in the case where the graph is non-trivial.

Proposition 4.6. Suppose that G is the empty graph, and that there are d 1 variables intervened on, with one single target per dataset, such that Asm. 4.1 is satisfied. Then Cau CA (in this case, ICA) in (G, F, PG) is identifiable up to Sscaling defined as in eq. (3).

The result above shows that identifiability can be achieved through single-node interventions on the latent variables using strictly fewer datasets (i.e., auxiliary variables) than previous results in the auxiliary variables setting (d in our case, 2d + 1 in [21, Thm. 1]). One potentially confusing aspect of Prop. 4.6 is that the ambiguity set does not contain permutations which is usually an unavoidable ambiguity in ICA. This is due to our considered setting with known targets, where a total ordering of the variables is assumed to be known. The result above can also be extended to the case of unknown intervention targets, where we only know that, in each dataset, a distinct variable is intervened on, but we do not know which one (see App. E). For that case, we prove (Prop. E.6) that ICA in (G, F, PG) is in fact identifiable up to scaling and permutation. Note that Prop. 4.6 is not a special case of Thm. 4.5 in which nk = 1 k, since it only requires d 1 interventions instead of d.

We can additionally show that for nonlinear ICA, d 1 interventions are necessary for identifiability.

Proposition 4.7. Given an empty graph G, with d 2 single-node interventions on distinct targets, with one single target per dataset, such that Asm. 4.1 is satisfied. Then Cau CA (in this case, ICA) in (G, F, PG) is not identifiable up to Sreparam.

Fat-hand interventions. For the special case with independent components, the following corollary characterises identifiability under fat-hand interventions.

Corollary 4.8. [Corollary of Thm. 4.5] Suppose G is the empty graph. Suppose that our datasets encompass interventions over all variables in the latent graph, i.e., S k [K] τk = [d]. Suppose for every k, the targets of interventions are a strict subset of all variables, i.e., |τk| = nk, nk [d 1].

Figure 3: We use the symbol together with a times symbol to represent how many interventions are required by the two assumptions. (Left) (Thm. 4.5) For Asm. 4.4, we need nk interventions to get block-identification of zτk. (Right) (Prop. 4.9) For the block-variability assumption, we need 2nk to get to elementwise identification up to scaling and permutation.

Suppose Asm. 4.4 is verified, which has a simpler form in this case: there are nk interventions with target τk such that vk(zτk, 1) vk(zτk, 0), , vk(zτk, nk) vk(zτk, 0) are linearly independent, where vk(zτk, s) := (ln qs τk,1) (zτk,1) , , (ln qs τk,nk) (zτk,nk) , (9) where qs τk is the intervention of the s-th interventional regime that has the target τk, and qs τk,j is the j-th marginal of it. zτk,j is the j-th dimension of zτk. s = 0 denotes the unintervened regime.

Then Cau CA in (G, F, PG) is block-identifiable, in the same sense as Thm. 4.5.

Our interventional perspective also allows us to re-interpret and extend a key result in the theory of nonlinear ICA with auxiliary variables, [21, Thm.1]. In particular, the following Proposition holds. Proposition 4.9. Under the assumptions of Thm. 4.5, suppose furthermore that all density functions in PG and all mixing functions in F are C2, and suppose there exist k [K] and there are 2nk interventions with targets τk such that for any zτk Rnk, wk (zτk, 1) wk (zτk, 0), . . . , wk (zτk, 2nk) wk (zτk, 0) are linearly independent, where

wk (zτk, s) :=

qs τk,1 qs τk,1

(zτk,1) , . . . , qs τk,nk qsτk,nk

(zτk,nk) , qs τk,1 qs τk,1 (zτk,1) , . . . , qs τk,nk qsτk,nk (zτk,nk)

where qs τk is the intervention of the s-th interventional regime that has the target τk, and qs τk,j is the j-th marginal of it. zτk,j is the j-th dimension of zτk. s = 0 denotes the unintervened regime. Then

φτk Sreparam := g : Rnk Rnk|g = P h where P is a permutation matrix and h Sscaling

Remark 4.10. The assumption of linear independence wk (zτk, s) , s [2nk] precisely corresponds to the assumption of variability in [21, Thm. 1]; however, we only assume it within a nk-dimensional block (not over all d variables). We refer to it as block-variability.

Note that the block-variability assumption implies block-interventional discrepancy (Asm. 4.4): i.e., it is a strictly stronger assumption, which, correspondingly, leads to a stronger identification. In fact, block-interventional discrepancy only allows block-wise identifiability within the nk-dimensional intervened blocks based on nk interventions. In contrast, the variability assumption can be interpreted as a sufficient assumption to achieve identification up to permutation and scaling within a nk-dimensional block, based on 2nk fat-hand interventions (in both cases one unintervened dataset is required), see Fig. 3 for a visualization. We summarise our results for nonlinear ICA in Tab. 2, App. F.

In [21], the variability assumption is assumed to hold over all variables, which in our setting can be interpreted as a requirement over 2d fat-hand interventions over all latent variables simultaneously (plus one unintervened distribution). In this sense, Prop. 4.9 and block-variability extend the result of [21, Thm. 1], which only considers the case where all variables are intervened, by exploiting variability to achieve a strong identification only within a subset of the variables.

5 Experiments

Our experiments aim to estimate a Cau CA model based on a known graph and a collection of interventional datasets with known targets. We focus on the scenarios with single-node, perfect interventions described in 4. For additional technical details, see App. H.

Synthetic data-generating process. We first sample DAGs G with an edge density of 0.5. To model the causal dependence among the latent variables, we use the family of CBNs induced by linear Gaussian structural causal model (SCM) consistent with G.5 For the ground-truth mixing function, we use M-layer multilayer perceptrons f = σ AM . . . σ A1, where Am Rd d for m J1, MK denote invertible linear maps (sampled from a multivariate uniform distribution), and σ is an element-wise invertible nonlinear function. We then sample observed mixtures from these latent CBNs, as described by eq. (2).

Likelihood-based estimation procedure. Our objective is to learn an encoder gθ : Rd Rd that approximates the inverse function f 1 up to tolerable ambiguities, together with latent densities (pk)k J0,d K reproducing the ground truth up to corresponding ambiguities (cf. Lemma 3.3). We

5Additional experiments with nonlinear, non-Gaussian CBNs can be found in App. I.

ICA linear i.i.d 0.2

2 3 5 No. nonlinearities

2 3 4 5 Latent dimension

Cau CA linear E(G) =

ICA linear i.i.d

1 5 10 Signal-to-noise ratio SCM

Cau CA E(G) =

Figure 4: Experimental results. Figures (a) and (e) present the mean correlation coefficients (MCC) between true and learned latents and log-probability differences between the model and ground truth ( log prob.) for Cau CA experiments. Misspecified models assuming a trivial graph (E(G)= ) and a linear encoder function class are compared. All violin plots show the distribution of outcomes for 10 pairs of CBNs and mixing functions. Figures (c) and (d) display Cau CA results with varying numbers of nonlinearities in the mixing function and latent dimension. For the ICA setting, MCC values and log probability differences are illustrated in (b) and (f). Baselines include a misspecified model (linear mixing) and a naive (single-environment) unidentifiable normalizing flow with an independent Gaussian base distribution (labelled i.i.d.). The naive baseline is trained on pooled data without using information about interventions and their targets. Figure (g) shows the median MCC for Cau CA and the misspecified baseline (E(G)= ) as the strength of the linear parameters relative to the exogenous noise in the structural causal model generating the CBN increases. The shaded areas show the range between minimum and maximum values.

estimate the encoder parameters by maximizing the likelihood, which can be derived through a change of variables from eq. (1): for an observation in dataset k > 0 taking a value x, it is given by

log pk θ(x) = log | det Jgθ(x)| + log epτk((gθ)τk(x)) + X

i =τk log pi (gθ)i(x) | (gθ)pa(i)(x) , (10)

where Jgθ(x) denotes the Jacobian of gθ evaluated at x. The learning objective can be expressed as

θ = arg maxθ PK k=0 1 Nk PNk n=1 log pk θ(x(n,k)) , with Nk representing the size of dataset k.

Model architecture. We employ normalizing flows [40] to parameterize the encoder. Instead of the typically deployed base distribution with independent components, we use the collection of densities (one per interventional regime) induced by the CBN over the latents. Following the Cau CA setting, the parameters of the causal mechanisms are learned while the causal graph is assumed known. For details on the model and training parameters, see App. H.

Settings. We investigate two learning problems: (i) Cau CA, corresponding to 4.1, and (ii) ICA, where the sampled graphs in the true latent CBN contain no arrows, as discussed in 4.2.

Results. (i) For a data-generating process with non-empty graph, experimental outcomes are depicted in Fig. 4 (a, e). We compare a well-specified Cau CA model (blue) to misspecified baselines, including a model with correctly specified latent model but employing a linear encoder (red), and a model with a nonlinear encoder but assuming a causal graph with no arrows (orange). See caption of Fig. 4 for details on the metrics. The results demonstrate that the Cau CA model accurately identifies the latent variables, benefiting from both the nonlinear encoder and the explicit modelling of causal dependence. We additionally test the effect of increasing a parameter influencing the magnitude of the sampled linear parameters in the SCM (we refer to this as signal-to-noise ratio, see App. H.1 for details) which increases the statistical dependence among the true latent components. The gap between the Cau CA model and the baseline assuming a trivial graph widens (Fig. 4 (g)), indicating that correctly

modelling the causal relationships becomes increasingly important the more (statistically) dependent the true latent variables are. Finally, we verify that the model performs well for different number of layers M in the ground-truth nonlinear mixing (c) (performance degrades slightly for higher M), and across various latent dimensionalities for the latent variable (d).

(ii) For data-generating processes where the graph contains no arrows (ICA), results are presented in Fig. 4 (b, f). Well-specified, nonlinear models (blue) are compared to misspecified linear baselines (red) and a naive normalizing flow baseline trained on pooled data (purple). The findings confirm that interventional information provides useful learning signal even in the context of nonlinear ICA.

6 Related Work

Causal representation learning. In the present work, we focus on identifiability of latent CBNs with a known graph, based on interventional data, and investigate the nonlinear and nonparametric case. In CRL (unknown graph), many studies focus on identifiability of latent SCMs instead, which requires strong assumptions such as weak supervision (i.e., counterfactual data) [1, 3, 36, 54]. Alternatively, the setting where temporal information is available, i.e., dynamic Bayesian networks, has been studied extensively [29, 30, 33, 34, 62]. In a non-temporal setting, other works assume interventional data and linear mixing functions [49, 52]; or that the latent distributions are linear Gaussian [35]. Ahuja et al. [2] identify latent representations by deterministic hard interventions, together with parametric assumptions on the mixing, and an independent support assumption [56]. Concurrent work studies the cases with non-parametric mixing and linear Gaussian latent causal mode [5] or non-parametric latent causal model under faithfulness, genericity and Asm. 4.1 [55].

Prior knowledge on the latent SCM. Other prior works also leverage prior knowledge on the causal structure for representation learning. Yang et al. [61] introduce the Causal VAE model, which aims to disentangle the endogenous and exogenous variables of an SCM, and prove identifiability up to affine transformations based on known intervention targets. Shen et al. [46] also consider the setting in which the graph is (partially) known, but their approach requires additional supervision in the form of annotations of the ground truth latent. Leeb et al. [31] embed an SCM into the latent space of an autoencoder, provided with a topological ordering allowing it to learn latent DAGs.

Statistically dependent components. Models with causal dependences among the latent variables are a special case of models where the latent variables are statistically dependent [22]. Various extensions of the ICA setting allow for dependent variables: e.g., independent subspace analysis [16]; topographic ICA [20] (see also [24]); independently modulated component analysis [26]. Morioka and Hyvärinen [39] introduce a multi-modal model where within-modality dependence is described by a Bayesian network, with joint independence across the modalities, and a mixing function for sameindex variables across these networks. Unlike our work, it encodes no explicit notions of interventions.

7 Discussion

Limitations. (i) Known intervention targets: We proved that with fully unknown targets, there are fundamental and strong limits to identifiability (see Corollary E.8). We also studied some relaxations of this assumption (App. E), and generalized our results to known targets up to graph automorphisms and matched intervention targets (see Prop. E.6 and Prop. E.6). Other relaxations are left for future work; e.g., the case with a non-trivial graph and matched intervention targets is studied in [55], under faithfulness and genericity assumptions. (ii) Estimation: More scalable estimation procedures than our likelihood-based approach ( 5) may be developed, e.g., based on variational inference.

Cau CA as a causal generalization of ICA. As pointed out in 4.2, the special case of Cau CA with a trivial graph corresponds to a novel ICA model. Beyond the fact that Cau CA allows statistical dependence described by general DAGs among the components, we argue that it can be viewed as a causal generalization of ICA. Firstly, we exploit the assumption of localized and sparse changes in the latent mechanisms [42, 45], in contrast to previous ICA works which exploit non-stationarity at the level of the entire joint distribution of the latent components [17, 21, 38], leading to strong identifiability results (e.g., in Thm. 4.2 (ii)). Secondly, we exploit the modularity of causal mechanisms: i.e., it is possible to intervene on some of the mechanisms while leaving the others invariant [41, 43]. To the best of our knowledge, our work is the first ICA extension where latent dependence can actually be interpreted in a causal sense.

Acknowledgements

The authors thank Vincent Stimper, Weiyang Liu, Siyuan Guo, Junhyung Park, Jinglin Wang, Corentin Correia, Cian Eastwood, Adrián Javaloy and the anonymous reviewers for helpful comments and discussions.

Funding Transparency Statement

This work was supported by the Tübingen AI Center. L.G. was supported by the Video Predict project, FKZ: 01IS21088.

[1] K. Ahuja, J. S. Hartford, and Y. Bengio. Weakly supervised representation learning with sparse perturbations. In Advances in Neural Information Processing Systems, volume 35, pages 15516 15528, 2022. [Cited on pages 2 and 10.]

[2] K. Ahuja, D. Mahajan, Y. Wang, and Y. Bengio. Interventional causal representation learning. In International Conference on Machine Learning, pages 372 407. PMLR, 2023. [Cited on pages 2 and 10.]

[3] J. Brehmer, P. De Haan, P. Lippe, and T. S. Cohen. Weakly supervised causal representation learning. Advances in Neural Information Processing Systems, 35:38319 38331, 2022. [Cited on pages 2, 10, and 16.]

[4] S. Buchholz, M. Besserve, and B. Schölkopf. Function classes for identifiable nonlinear independent component analysis. Advances in Neural Information Processing Systems, 35: 16946 16961, 2022. [Cited on pages 1 and 4.]

[5] S. Buchholz, G. Rajendran, E. Rosenfeld, B. Aragam, B. Schölkopf, and P. Ravikumar. Learning linear causal representations from interventions under general nonlinear mixing. In Advances in Neural Information Processing Systems, 2023. [Cited on pages 2 and 10.]

[6] R. Cai, F. Xie, C. Glymour, Z. Hao, and K. Zhang. Triad constraints for learning causal structure of latent variables. In Advances in Neural Information Processing Systems, volume 32, pages 12883 12892, 2019. [Cited on page 2.]

[7] P. Comon. Independent component analysis, a new concept? Signal processing, 36(3):287 314, 1994. [Cited on page 1.]

[8] G. Darmois. Analyse des liaisons de probabilité. In Proc. Int. Stat. Conferences 1947, page 231, 1951. [Cited on page 1.]

[9] I. Daunhawer, A. Bizeul, E. Palumbo, A. Marx, and J. E. Vogt. Identifiability results for multimodal contrastive learning. In The Eleventh International Conference on Learning Representations, 2022. [Cited on page 2.]

[10] C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios. Neural spline flows. Advances in Neural Information Processing Systems, 32, 2019. [Cited on page 38.]

[11] L. Gresele, P. K. Rubenstein, A. Mehrjou, F. Locatello, and B. Schölkopf. The Incomplete Rosetta Stone problem: Identifiability results for multi-view nonlinear ICA. In Uncertainty in Artificial Intelligence, pages 217 227. PMLR, 2019. [Cited on page 1.]

[12] L. Gresele, G. Fissore, A. Javaloy, B. Schölkopf, and A. Hyvärinen. Relative gradient optimization of the Jacobian term in unsupervised deep learning. Advances in neural information processing systems, 33:16567 16578, 2020. [Cited on page 38.]

[13] L. Gresele, J. von Kügelgen, V. Stimper, B. Schölkopf, and M. Besserve. Independent mechanism analysis, a new concept? In Advances in neural information processing systems, volume 34, pages 28233 28248, 2021. [Cited on page 1.]

[14] H. Hälvä and A. Hyvärinen. Hidden markov nonlinear ICA: Unsupervised learning from nonstationary time series. In Conference on Uncertainty in Artificial Intelligence, pages 939 948. PMLR, 2020. [Cited on page 1.]

[15] H. Hälvä, S. Le Corff, L. Lehéricy, J. So, Y. Zhu, E. Gassiat, and A. Hyvärinen. Disentangling identifiable features from noisy data with structured nonlinear ICA. Advances in Neural Information Processing Systems, 34:1624 1633, 2021. [Cited on page 1.]

[16] A. Hyvärinen and P. Hoyer. Emergence of phase-and shift-invariant features by decomposition of natural images into independent feature subspaces. Neural computation, 12(7):1705 1720, 2000. [Cited on page 10.]

[17] A. Hyvärinen and H. Morioka. Unsupervised feature extraction by time-contrastive learning and nonlinear ICA. Advances in neural information processing systems, 29, 2016. [Cited on pages 1 and 10.]

[18] A. Hyvärinen and H. Morioka. Nonlinear ICA of temporally dependent stationary sources. In Artificial Intelligence and Statistics, pages 460 469. PMLR, 2017. [Cited on page 1.]

[19] A. Hyvärinen and P. Pajunen. Nonlinear independent component analysis: Existence and uniqueness results. Neural networks, 12(3):429 439, 1999. [Cited on pages 1, 24, and 27.]

[20] A. Hyvärinen, P. O. Hoyer, and M. Inki. Topographic independent component analysis. Neural computation, 13(7):1527 1558, 2001. [Cited on page 10.]

[21] A. Hyvärinen, H. Sasaki, and R. Turner. Nonlinear ICA using auxiliary variables and generalized contrastive learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 859 868. PMLR, 2019. [Cited on pages 1, 6, 7, 8, 10, 25, 26, 34, and 35.]

[22] A. Hyvärinen, I. Khemakhem, and R. Monti. Identifiability of latent-variable and structuralequation models: from linear to nonlinear. ar Xiv preprint ar Xiv:2302.02672, 2023. [Cited on pages 1, 4, and 10.]

[23] A. Javaloy, P. Sánchez-Martín, and I. Valera. Causal normalizing flows: from theory to practice. In Advances in Neural Information Processing Systems, 2023. [Cited on page 37.]

[24] T. A. Keller and M. Welling. Topographic VAEs learn equivariant capsules. Advances in Neural Information Processing Systems, 34:28585 28597, 2021. [Cited on page 10.]

[25] I. Khemakhem, D. Kingma, R. Monti, and A. Hyvärinen. Variational autoencoders and nonlinear ICA: A unifying framework. In International Conference on Artificial Intelligence and Statistics, pages 2207 2217. PMLR, 2020. [Cited on pages 1 and 6.]

[26] I. Khemakhem, R. Monti, D. Kingma, and A. Hyvärinen. Ice-Bee M: Identifiable conditional energy-based deep models based on nonlinear ICA. Advances in Neural Information Processing Systems, 33:12768 12778, 2020. [Cited on pages 1 and 10.]

[27] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Y. Bengio and Y. Le Cun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. [Cited on page 39.]

[28] B. Kivva, G. Rajendran, P. Ravikumar, and B. Aragam. Identifiability of deep generative models without auxiliary information. In Advances in Neural Information Processing Systems, volume 35, pages 15687 15701, 2022. [Cited on page 2.]

[29] S. Lachapelle and S. Lacoste-Julien. Partial disentanglement via mechanism sparsity. In UAI 2022 Workshop on Causal Representation Learning, 2022. [Cited on page 10.]

[30] S. Lachapelle, P. Rodriguez, Y. Sharma, K. E. Everett, R. Le Priol, A. Lacoste, and S. Lacoste Julien. Disentanglement via mechanism sparsity regularization: A new principle for nonlinear ICA. In Conference on Causal Learning and Reasoning, pages 428 484. PMLR, 2022. [Cited on pages 2 and 10.]

[31] F. Leeb, G. Lanzillotta, Y. Annadani, M. Besserve, S. Bauer, and B. Schölkopf. Structure by architecture: Structured representations without regularization. Proceedings of the Eleventh International Conference on Learning Representations (ICLR), 2023. ar Xiv preprint 2006.07796. [Cited on page 10.]

[32] E. L. Lehmann and G. Casella. Theory of point estimation. Springer Science & Business Media, 2006. [Cited on page 1.]

[33] P. Lippe, S. Magliacane, S. Löwe, Y. M. Asano, T. Cohen, and S. Gavves. Citris: Causal identifiability from temporal intervened sequences. In International Conference on Machine Learning, pages 13557 13603. PMLR, 2022. [Cited on pages 2 and 10.]

[34] P. Lippe, S. Magliacane, S. Löwe, Y. M. Asano, T. Cohen, and E. Gavves. Causal representation learning for instantaneous and temporal effects in interactive systems. In The Eleventh International Conference on Learning Representations, 2023. [Cited on page 10.]

[35] Y. Liu, Z. Zhang, D. Gong, M. Gong, B. Huang, A. van den Hengel, K. Zhang, and J. Qinfeng Shi. Weight-variant latent causal models. ar Xiv e-prints, pages ar Xiv 2208, 2022. [Cited on page 10.]

[36] F. Locatello, B. Poole, G. Rätsch, B. Schölkopf, O. Bachem, and M. Tschannen. Weaklysupervised disentanglement without compromises. In International Conference on Machine Learning, pages 6348 6359. PMLR, 2020. [Cited on page 10.]

[37] Q. Lyu, X. Fu, W. Wang, and S. Lu. Understanding latent correlation-based multiview learning and self-supervision: An identifiability perspective. In International Conference on Learning Representations, 2022. [Cited on page 2.]

[38] R. P. Monti, K. Zhang, and A. Hyvärinen. Causal discovery with general non-linear relationships using non-linear ICA. In Uncertainty in artificial intelligence, pages 186 195. PMLR, 2020. [Cited on page 10.]

[39] H. Morioka and A. Hyvärinen. Connectivity-contrastive learning: Combining causal discovery and representation learning for multimodal data. In International Conference on Artificial Intelligence and Statistics, pages 3399 3426. PMLR, 2023. [Cited on page 10.]

[40] G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. The Journal of Machine Learning Research, 22(1):2617 2680, 2021. [Cited on pages 9, 37, and 38.]

[41] J. Pearl. Causality. Cambridge university press, 2009. [Cited on pages 2, 3, and 10.]

[42] R. Perry, J. von Kügelgen, and B. Schölkopf. Causal discovery in heterogeneous environments under the sparse mechanism shift hypothesis. In Advances in Neural Information Processing Systems, 2022. [Cited on page 10.]

[43] J. Peters, D. Janzing, and B. Schölkopf. Elements of Causal Inference: Foundations and Learning Algorithms. The MIT Press, 2017. [Cited on pages 3 and 10.]

[44] A. Sauer and A. Geiger. Counterfactual generative networks. In International Conference on Learning Representations (ICLR), 2021. [Cited on page 2.]

[45] B. Schölkopf, F. Locatello, S. Bauer, N. R. Ke, N. Kalchbrenner, A. Goyal, and Y. Bengio. Toward Causal Representation Learning. Proceedings of the IEEE, 109(5):612 634, 2021. [Cited on pages 2 and 10.]

[46] X. Shen, F. Liu, H. Dong, Q. Lian, Z. Chen, and T. Zhang. Weakly supervised disentangled generative causal representation learning. Journal of Machine Learning Research, 23:1 55, 2022. [Cited on page 10.]

[47] R. Silva, R. Scheines, C. Glymour, P. Spirtes, and D. M. Chickering. Learning the structure of linear latent variable models. Journal of Machine Learning Research, 7(2), 2006. [Cited on page 2.]

[48] P. Spirtes, C. N. Glymour, and R. Scheines. Causation, prediction, and search. MIT press, 2000. [Cited on page 3.]

[49] C. Squires, A. Seigal, S. S. Bhate, and C. Uhler. Linear causal disentanglement via interventions. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 32540 32560. PMLR, 2023. [Cited on pages 2, 5, 6, and 10.]

[50] M. Tangemann, S. Schneider, J. von Kügelgen, F. Locatello, P. Gehler, T. Brox, M. Kümmerer, M. Bethge, and B. Schölkopf. Unsupervised object learning via common fate. In 2nd Conference on Causal Learning and Reasoning (CLea R). 2021. ar Xiv:2110.06562. [Cited on page 2.]

[51] F. Träuble, E. Creager, N. Kilbertus, F. Locatello, A. Dittadi, A. Goyal, B. Schölkopf, and S. Bauer. On disentangled representations learned from correlated data. In International Conference on Machine Learning, pages 10401 10412. PMLR, 2021. [Cited on page 2.]

[52] B. Varici, E. Acarturk, K. Shanmugam, A. Kumar, and A. Tajer. Score-based causal representation learning with interventions. ar Xiv preprint ar Xiv:2301.08230, 2023. [Cited on pages 2 and 10.]

[53] B. Varıcı, E. Acartürk, K. Shanmugam, and A. Tajer. General identifiability and achievability for causal representation learning. ar Xiv preprint ar Xiv:2310.15450, 2023. [Cited on page 33.]

[54] J. von Kügelgen, Y. Sharma, L. Gresele, W. Brendel, B. Schölkopf, M. Besserve, and F. Locatello. Self-supervised learning with data augmentations provably isolates content from style. Advances in neural information processing systems, 34:16451 16467, 2021. [Cited on pages 2, 6, 10, and 22.]

[55] J. von Kügelgen, M. Besserve, L. Wendong, L. Gresele, A. Keki c, E. Bareinboim, D. M. Blei, and B. Schölkopf. Nonparametric identifiability of causal representations from unknown interventions. In Advances in Neural Information Processing Systems, 2023. [Cited on pages 2, 10, and 33.]

[56] Y. Wang and M. I. Jordan. Desiderata for representation learning: A causal perspective. ar Xiv preprint ar Xiv:2109.03795, 2021. [Cited on page 10.]

[57] L. Wasserman. All of statistics: a concise course in statistical inference, volume 26. Springer, 2004. [Cited on page 1.]

[58] Q. Xi and B. Bloem-Reddy. Indeterminacy in generative models: Characterization and strong identifiability. In International Conference on Artificial Intelligence and Statistics, pages 6912 6939. PMLR, 2023. [Cited on page 1.]

[59] F. Xie, R. Cai, B. Huang, C. Glymour, Z. Hao, and K. Zhang. Generalized independent noise condition for estimating latent variable causal graphs. In Advances in Neural Information Processing Systems, volume 33, pages 14891 14902, 2020. [Cited on page 2.]

[60] F. Xie, B. Huang, Z. Chen, Y. He, Z. Geng, and K. Zhang. Identification of linear non-gaussian latent hierarchical structure. In International Conference on Machine Learning, pages 24370 24387. PMLR, 2022. [Cited on page 2.]

[61] M. Yang, F. Liu, Z. Chen, X. Shen, J. Hao, and J. Wang. Causal VAE: Disentangled representation learning via neural structural causal models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9593 9602, 2021. [Cited on page 10.]

[62] W. Yao, Y. Sun, A. Ho, C. Sun, and K. Zhang. Learning temporally causal latent processes from general temporal data. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. Open Review.net, 2022. [Cited on page 10.]

[63] K. Zhang and A. Hyvärinen. On the identifiability of the post-nonlinear causal model. In 25th Conference on Uncertainty in Artificial Intelligence (UAI 2009), pages 647 655. AUAI Press, 2009. [Cited on page 1.]

App. A recapitulates the notation used in this paper. App. B contains the proofs of all theoretical statements presented in the paper. App. C contains a nontrivial counterexample of Thm. 4.2 (ii) when Asm. 4.1 is violated. App. D contains an additional result of identifiability based on Thm. 4.2 (i), when more imperfect stochastic interventions are available. App. E contains a general discussion on Cau CA with unknown intervention targets, as well as a generalization of some of the identifiability results. App. F contains a technical details on the relationship between Cau CA and nonlinear ICA. App. G contains some theoretical results which were useful in the design of the experiments. App. H contains the details of the experiments.

A Notations

Symbol Description G A directed acyclic graph with nodes V (G) = [d] and arrows E(G) (i, j) An ordered tuple representing an arrow in E(G), with i, j V (G) Ji, j K The integers i, . . . , j [d] The natural numbers 1, . . . , d pa(i) Parents of i, defined as {j V (G) | (j, i) E(G)} pak(j) Parents of j in the post-intervention graph in the intervention regime k pa(i) Closure of the parents of i, defined as pa(i) {i} anc(i) Ancestors of i, nodes j in G such that there is a directed path from j to i anc(i) Closure of the ancestors of i, defined as anc(i) {i} G Transitive closure of G defined by pa G(i) := anc G(i) X, Y, Z Unidimensional random variables X, Y, Z Multidimensional random variables x, y, z Scalars in R x, y, z Vectors in Rd z[i] The (1, . . . , i) dimensions of z φi The function that outputs the i-th dimension of the mapping φ φ[i] The mapping that outputs the (1, . . . , i) dimensions of the mapping φ τk Intervention targets in interventional regime k P, Q Probability distributions p, q Density functions of P, Q Pi Zi | Zpa(i) Causal mechanism of variable Zi e Pk i Zi | Zpak(i) Intervened mechanism of variable Zi in interventional regime k Pk (Z) k = 0: interventional distribution in interventional regime k (Defn. 2.2) k = 0: unintervened distribution PG Class of latent joint probabilities that are Markov relative to G In this paper, it is assumed to have differentiable density and to be absolutely continuous with full support in Rd F Function class of mixing function/decoders In this paper, it is assumed to be all C1-diffeomorphisms S Indeterminacy set, defined in Defn. 3.2 f Mixing function or decoder, a diffeomorphism Rd Rd f P The pushforward measure of P by f G A directed acyclic graph without indices G |= G G is an indexed graph of G, i.e. V (G) = [d] and there exists an isomorphism G G Aut G Group of automorphisms of graph G Sd Group of permutations of d elements

B.1 Lemmata

Lemma B.1 (Lemma 2 of [3]). Let A = C = R and B = Rd. Let f : A B C be differentiable. Define differentiable measures PA on A and PC on C. Let b B, f( , b) : A C be measurepreserving, i.e. PC = f( , b) PA. Then f is constant in b over B.

Lemma B.2. For any distributions P, Q of full support on R, with c.d.f F, G, there are only two diffeomorphisms T : R R such that T P = Q: they are G 1 F and G 1 F, where G(x) := 1 G(x).

Proof. T is a diffeomorphism, then T (x) = 0 x R. Then the sign of T (x) is either positive or negative everywhere. T is either strictly increasing or strictly decreasing on R. Since P and Q are full support in R, F and G are strictly increasing in R.

If T is increasing: T P = Q implies that G(x) = Q(X x) = P(T(X) x). Since T is strictly increasing, G(x) = P(T(X) x) = P(X T 1(x)) = F T 1(x). Thus x = G 1 F T 1(x). T = G 1 F.

If T is decreasing: G(x) = Q(X x) = P(T(X) x) = P(X T 1(x)) = 1 F(T 1(x)). Thus T = G 1 F.

Lemma B.3. Suppose P, Q Markov relative to G, absolutely continuous with full support in Rd. Fix any functions φ1, . . . , φd diffeomorphisms strictly monotonic in R. Let φ := (φi)i [d]. The following statements are equivalent:

(1) Q = φ P

(2) i [d], zi R, zpa(i) R|pa(i)|, pi zi | zpa(i) = qi φi (zi) | φpa(i) zpa(i) |φ i (zi)|, or equivalently, Qi( |φpa(i) zpa(i) ) = (φi) Pi | zpa(i) .

Proof. Qi( |φpa(i) zpa(i) ) denotes the conditional probability of Zi: Qi(Zi|φpa(i) zpa(i) ).

(2) (1): Multiply the equations in (2) for n indices,

i=1 pi zi | zpa(i) =

i=1 qi φi (zi) | φpa(i) zpa(i) |φ i (zi)| .

Since Qd i=1 |φ i (zi)| = | det Dφ(z)|, we obtain the equation in (1).

(1) (2): without loss of generality, choose a total order on V (G) that preserves the partial order of G: i > j if i pa(j). Since φ is a diffeomorphism, by the change of variables formula,

p(z) = q(φ(z))| det Dφ(z)|. (11)

Since P, Q are Markov relative to G, write p, q as the factorization according to G:

i=1 pi zi | zpa(i) =

i=1 qi φi (zi) | φpa(i) zpa(i) |φ i (zi)| . (12)

We will show by induction on the reverse order of [d] that for all i [d],

pi zi | zpa(i) = qi φi (zi) | φpa(i) zpa(i) |φ i (zi)| .

In eq. (12), marginalize over zd,

i=1 pi zi | zpa(i) = Z

i=1 qi φi (zi) | φpa(i) zpa(i) |φ i (zi)| dzd. (13)

Fix z[d 1], change of variable u = φd (zd), du = φ d (zd),

i=1 pi zi | zpa(i) =

i=1 qi φi (zi) | φpa(i) zpa(i) |φ i (zi)| . (15)

Cancel the two sides of equation (12) by (15),

pi zd | zpa(d) = qi φd (zd) | φpa(d) zpa(d) |φ d (zd)| .

Suppose the property is true for i + 1, , d. Then for i, i is a leaf node in the first i nodes. We use the same proof as before, marginalize over zi (same as (13) with d replaced by i) on the joint distribution of i first variables (same as (12) with d replaced by i), which is then divided by the obtained i 1 marginal equation (same as (15) with d 1 replaced by i 1).

Lemma 3.3. For any (G, f, (Pk, τk)k J0,KK) in (G, F, PG), and for any h Sscaling with

Sscaling : = h : Rd Rd | h(z) = (h1(z1), . . . , hd(zd)), hi is a diffeomorphism in R , (3)

there exists a (G, f h, (Qk, τk)k J0,KK) in (G, F, PG) s.t. f Pk = (f h) Qk for all k J0, KK.

Proof. For any (G, f, (Pk, τk)k J0,KK) in (G, F, PG), and for any h Sscaling, define g := h 1, then g Sscaling. Define Q0 := g P0. By Lemma B.3, for all i [d], Qi |hpa(i)(zpa(i)) = (gi) Pi | zpa(i) .

For k [K], define e Qk j |gpak(j)(zpak(j)) := (gj) e Pk j | zpak(j) . (16)

Define Qk := Q j τk e Qk j Q j / τk Qj, by Lemma B.3 and (16), Qk = g Pk k [K]. By definition of Q0, Qk = g Pk k J0, KK. i.e., f Pk = (f h) Qk.

B.2 Proof of Thm. 4.2

Theorem 4.2. For Cau CA in (G, F, PG),

(i) Suppose for each node in [d], there is one (perfect or imperfect) stochastic intervention that satisfies Asm. 4.1. Then Cau CA in (G, F, PG) is identifiable up to

SG = n h : Rd Rd|h(z) = hi(zanc(i))

i [d] , h is C1-diffeomorphism o . (5)

(ii) Suppose for each node i in [d], there is one perfect stochastic intervention that satisfies Asm. 4.1. Then Cau CA in (G, F, PG) is identifiable up to Sscaling.

Proof. Proof of (i): Consider two latent CBNs achieving the same likelihood across all interventional regimes: G, f, Pi, τi

and G, f , Qi, τi

. Since the intervention targets

(τi)i J0,d K are the same on both latent CBN, by rearranging the indices of G and correspondingly the indices in Pi, Qi and (τi)i J0,d K, we can suppose without loss of generality that the index of G preserves the partial order induced by E(G): i < j if (i, j) E(G). Since (τi)i [d] covers all d nodes in G, by rearranging (τi)i [d] we can suppose without loss of generality that τi = i i [d].

In the i-th interventional regime,

Pi(Z) = e Pi Zi | Zpai(i) Y

j [d]\i P Zj | Zpa(j) ,

Qi(Z) = e Qi Zi | Zpai(i) Y

j [d]\i Qj Zj | Zpa(j) ,

where pa(j) = pai(j) j = i, since intervening on i does not change the arrows towards j.

Define φ := f 1 f. Denote its i-th dimension output function as φi : Rd R. We will prove by induction that i [d], j / anc(i), z Rd, φi

zj (z) = 0.

For any i J0, d K, f Pi = f Qi. Since φ is a diffeomorphism, by the change of variable formula,

pi(z) = qi(φ(z))|det Dφ(z)|. (17)

For i = 0, factorize pi and qi according to G, then take the logarithm on both sides:

j=1 ln pj zj | zpa(j) =

j=1 ln qj φj(z) | φpa(j)(z) + ln | det Dφ(z)|. (18)

For i = 1, e Q1 has no conditionals, thus qi is factorized as

q1(z) = eq1 (z1)

j=2 qj zj | zpa(j) .

So the equation (17) for i = 1 after taking logarithm is

ln ep1 (z1) +

j=2 ln pj zj | zpa(j) = ln eq1 (φ1(z)) +

j=2 qj φj(z) | φpa(j)(z)

+ ln |det Dφ(z)| .

Subtract (19) by (18),

ln p1 (z1) ln p1 (z1) = ln q1 (φ1(z)) ln q1 (φ1(z)) . (20)

For any i = 1, take the i-th partial derivative of both sides:

0 = q 1 (φ1(z)) q1 (φ1(z)) q 1 (φ1(z)) q1 (φ1(z))

By Asm. 4.1, the term in the parenthesis is non-zero a.e. in Rd. Thus φ1

zi (z) = 0 a.e. in Rd. Since φ = f 1 f where f, f are C1-diffeomorphisms, so is φ. φ1

zi is continuous and thus equals zero everywhere.

Now suppose k [i 1], j / anc(k), z Rd, φk

zj (z) = 0. Then for interventional regime i,

ln pi zi | zpai(i) + X

j =i ln pj zj | zpa(j) = ln qi φi(z) | φpai(i)(z)

j =i qj φj(z) | φpa(j)(z) + ln | det Dφ(z)|,

Subtracted by (18),

ln pi zi | zpai(i) ln pi zi | zpa(i) = ln qi φi(z) | φpai(i)(z) ln qi φi(z) | φpa(i)(z) . (21)

For any j / anc(i), j / pa(i) pai(i) by assumption. Take partial derivative over zj :

k pai(i) qi xk φi(z) | φpai(i)(z) φk

qi φi(z) | φpai(i)(z)

k pa(i) qi xk φi(z) | φpa(i)(z) φk

qi φi(z) | φpa(i)(z) ,

where xk denotes the k-th dimension of the domain of qi and qi.

For all k pa(i), since j / anc(i), j is not in anc(k) either. By the assumption of induction, φk

zj (z) = 0. Delete the partial derivatives that are zero:

xi φi(z) | φpai(i)(z)

qi φi(z) | φpai(i)(z)

qi xi φi(z) | φpa(i)(z)

qi φi(z) | φpa(i)(z)

By Asm. 4.1, the term in parenthesis is nonzero a.e., thus φi

zj (z) = 0 a.e. Since φi

zj is continuous, it equals zero everywhere.

The induction is finished when i = d. We have proven that i [d], j / anc(i), z Rd, φi zj (z) = 0. Namely, φi only depends on zanc(i).

Proof of (ii):

By the result proved in (i), φ := f 1 f SG. Thus Dφ(z) is lower triangular for all z Rd. Thus

| det Dφ(z)| = Qd i=1 φi

zi z[i] , and for all i, φi only depends on z1, , zi. We will prove that φi only depends on zi, i.e., it is constant on other variables.

To prove the conclusion in this item, we need the following lemma:

Lemma B.4. Given any φ : Rn Rn diffeomorphism such that for all i [n 1], φi

zn is zero everywhere, and given any two distributions P, Q that are absolutely continuous and have full support in Rn such that φ P = Q, then the distributions of the first n 1 coordinates are preserved, i.e.,

(φ[n 1]) P[n 1](z[n 1]) = Q[n 1](z[n 1]).

Proof. Fix z[n 1] Rn 1. For all i [n 1], φi

zn is zero everywhere, and φ is a diffeomorphism, so φn

zn is nonzero everywhere, otherwise there will exist z such that φn

zn (z) is singular. Therefore φn

zn (z[n 1], ) is continuous and nonzero, and thus φn(z[n 1], ) is a diffeomorphism R R. So

we can apply the change of variable u = φn z[n 1], zn , du = φn

zn z[n 1], zn dzn:

R q φ[n 1] z[n 1] , φn (zn) φn

z[n 1], zn dzn = Z

R q φ[n 1] z[n 1] , u du

= q[n 1] φ[n 1] z[n 1] . (22)

In the equation p(z) = q(φ(z))|detφ(z)|, marginalize over zn: Z

R p z[n 1], zn dzn = Z

R q φ[n 1] z[n 1] , φn (zn) n Y

z[i] dzn. (23)

Using (22), we obtain

p[n 1] z[n 1] = q[n 1] φ[n 1] z[n 1] n 1 Y

= q[n 1] φ[n 1] z[n 1] | det Dφ[n 1] z[n 1] |,

which is the density equation for push-forward measures that we want to prove.

Back to the proof of (ii). For all i [d], (φ1, . . . , φi 1) is a diffeomorphism because det Dφ[i 1](z[i 1]) = Qi 1 j=1 φj

zj (z) = 0.

We will prove by induction on the reverse order of [d] that the i-th row off-diagonal entries of Dφ(z) are zero for all z Rd.

In the interventional regime d, by the assumption on the indices of V (G) = [d] in the proof (i), the node d is not a parent of any node in [d 1]. Thus the perfect stochastic intervention on zd leads to the density pd and qd factorized as follows:

p[d 1] z[d 1] pd (zd) = q[d 1] φ[d 1] z[d 1] qd (φd(z)) det Dφ[d 1] z[d 1] φd

Since Dφ is lower triangular everywhere, cancel the terms of coordinate [d 1] on both sides by Lemma B.4,

pd (zd) = qd φd z[d 1], zd φd

z[d 1], zd , (24)

which is equivalent to

z[d 1] Rd 1, e Qd = φd z[d 1],

By Lemma B.1, φd is constant in the first d 1 variables.

Suppose the off-diagonal entries are zero for the i, i + 1, . . . , d rows of Dφ(z).

By the assumption on the indices of V (G) in the proof (i), the node i is not a parent of any node in [i 1]. Thus the perfect stochastic intervention on zi leads to the density pi and qi factorized as follows:

Rd i p (z) dzi+1 dzd = Z

Rd i q(φ(z))

z[j] dzi+1 . . . dzd

Rd i q φ[i] z[i] , φi+1 (zi+1) , , φd (zd) d Y

zk (zk) dzi+1 dzd.

By a change of variables

ui+1 = φi+1 (zi+1) ... ud = φd (zd) , we get

p[i](z[i]) =

Rd i q φ[i] z[i] , ui+1, , ud dui+1 dud

= q[i] φ[i] z[i] det Dφ[i] z[i]

= q[i 1] φ[i 1] z[i 1] qi (φi(z)) det Dφ[i 1] z[i 1] φi

By Lemma B.4, p[i 1] z[i 1] = q[i 1] φ[i 1] z[i 1] | det Dφ[i 1] z[i 1] |. By Lemma B.1, φ[i] is constant in the first i 1 variables.

In addition, Dφ(z) is lower triangular for all z, so we have proven that φ Sscaling.

B.3 Proof of Prop. 4.3

Proposition 4.3. Given a DAG G, with d 1 perfect stochastic single node interventions on distinct targets, if the remaining unintervened node has any parent in G, (G, F, PG) is not identifiable up to

Sreparam := g : Rd Rd | g = P h, P is a permutation matrix, h Sscaling

Proof. Without loss of generality by rearranging Pi, τi

i J0,d 1K, suppose that the unintervened

variable is the node d. Fix any G, f, Pi, τi

, s.t. d 1 is a parent of d, and s.t. the causal

mechanism of Zd only has one conditional variable Zd 1, and Zd N (Zd 1, 1), namely,

pd (zd | zd 1) = 1

We now construct G, f , Qi, τi

such that f Pi = f Qi and f 1 f / Sreparam.

Set Qi Zi | Zpa(i) : (d) = Pi Zi | Zpa(i) , f Qi (Zi) : (d) = e Pi Zi | Zpa(i) i [d 1].

Set Qd (Zd | Zd 1) : (d) = N ( Zd 1, 1), φ(z) := (z1, , zd 1, zd 2zd 1), thus | Det Dφ(z)| = 1 z Rd.

qd (φd(z) | φd 1(z)) = 1

(φd(z) φd 1(z))2

(zd 2zd 1 + zd 1)2

= pd (zd | zd 1) .

From the above equation, we infer that for the unintervened regime,

j=1 pj zj | zpa(j) =

j=1 qj φj(z) | φpa(j)(z)

| det Dφ(z)|,

and for the d 1 interventional regimes,

i [d 1], pi (zi) Y

j =i pj zj | zpa(j) = qi (φi(z))

j =i qj φj(z) | φpa(j)(z)

| det Dφ(z)|,

i.e., φ Pi = Qi i J0, d 1K.

f Pi = f φ 1

However, f φ 1 1 f = φ / Sreparam.

B.4 Proof of Thm. 4.5

Theorem 4.5. Given any DAG G. Suppose that our datasets encompass interventions over all variables in the latent graph, i.e., S

k [K] τk = [d]. For all k [K], suppose the targets of interventions are strict subsets of all variables, i.e., |τk| = nk, nk [d 1]. Suppose the interventions over τk are perfect, i.e. the intervention mechanisms Qs τk are joint distributions over Zτk without conditioning on other variables. Suppose Asm. 4.4 is satisfied for τk.

Then Cau CA in (G, F, PG) is block-identifiable (following [54]): namely, if for all k [K], f Pk = f Qk, then for φ := f 1 f, for all k [K],

[φ(z)]τk = φτk (zτk) . (8)

Proof. Fix k [K]. Then the equation of the k-th interventional regime is pk(z) = qk(φ(z))|detφ(z)|.

If τk has no parents from [d] \ τk, then in the unintervened regime, p0 k and q0 k have no conditional. Since interventions over τk are perfect, ps k and qs k have no conditional for all s J1, nk K.

If τk has parents from [d] \ τk, since in this case there are nk + 1 perfect interventions, enumerated by s J0, nk K, ps k and qs k have no conditional for all s J0, nk K.

In both cases, ps k and qs k have no conditional for all s J0, nk K.

We write the equality of pushforward densities just as Prop. 4.6, and subtract the k-th interventional regime by the unintervened regime:

ln ps(zτk) ln p0(zτk) = ln qs(φτk(z)) ln q0(φτk(z)).

For all i [d]\τk (nonempty by assumption), take the partial derivative of zi :

xj (ln q1 τk)(φτk(zτk)) xj (ln q0 τk)(φτk(zτk)) φτk,j

By assumption, for all s [nk], there is one interventional regime in which the above equation holds. Those nk equations form a linear system 0 = Mτk(zτk) φτk

zi (zτk), where Mτk(zτk) is defined in the statement of theorem. Since Mτk(z) is invertible a. e. by assumption, the vector φτk

zi (z) = 0 z Rd a. e., which is furthermore strictly everywhere since φ := f 1 f is C1. Such a result is valid for all i [d]\τk. Since S

k I τk = [d], all the non-diagonal entries of Dφ(z) that are not in the blocks of τk τk are 0. We conclude that φτi only depends on zτi, i.e., [φ(z)]τk = φτk (zτk).

For all k [K], [d] \ τk = . Suppose there exists z Rd such that det(Dφτk(z)) = 0. Since [φ(z)]τi = φτi (zτi) i [n], the vector φτk

zi (z) = 0 for all i / τk. Thus the rows τk of Dφ(z) are linearly dependent, which implies det (Dφ(z)) = 0, which contradicts with φ invertible. Thus φτk is a diffeomorphism.

B.5 Proof of Prop. 4.6

Proposition 4.6. Suppose that G is the empty graph, and that there are d 1 variables intervened on, with one single target per dataset, such that Asm. 4.1 is satisfied. Then Cau CA (in this case, ICA) in (G, F, PG) is identifiable up to Sscaling defined as in eq. (3).

Proof. Without loss of generality by rearranging Pi, τi

i J0,d 1K, suppose that the unintervened variable is the node d. We apply the induction in the proof of Thm. 4.2 (i). Since there are d 1 interventions, the induction stops at d 1, and we can infer that for φ := f 1 f, for all i [d 1], φi zj (z) = 0 a.e. j = i.

Similar to Thm. 4.2 (i), since φi

zj is continuous, it equals zero everywhere. Thus Dφ(z) is lower triangular for all z Rd.

By Lemma B.4, φ[d 1]

P[d 1] Z[d 1] = Q[d 1] Z[d 1] . Namely,

j=1 pj (zj) =

j=1 qj (φj(z)) det Dφ[d 1] z[d 1] . (27)

Since for all i [d 1], j = i, φi

zj (z) = 0, Dφ(z) is lower triangular for all z Rd. Thus

|det Dφ(z)| = Qd j=1 | jφj(z)|. Moreover, in the unintervened dataset,

j=1 pj (zj) =

j=1 qj (φj(z)) | jφj(z)| . (28)

Divide (28) by (27),

pd (zd) = qd (φd(z)) φd

z[d 1], zd ,

which is equivalent to z[d 1] Rd 1, f Qd = φd z[d 1],

By Lemma B.1, φd is constant in the first d 1 variables. We have proven that φ Sscaling.

B.6 Proof of Prop. 4.7

Proposition 4.7. Given an empty graph G, with d 2 single-node interventions on distinct targets, with one single target per dataset, such that Asm. 4.1 is satisfied. Then Cau CA (in this case, ICA) in (G, F, PG) is not identifiable up to Sreparam.

Proof. Without loss of generality by rearranging Pi, τi

i J0,d 2K, suppose that the two unintervened

variables are the nodes d 1, d. Fix any f and Pi, e Pi

i J0,d 2K such that for all i [d 2], Pi, e Pi

have any distribution that is absolutely continuous and full support in R with a differentiable density, and such that Asm. 4.1 is satisfied. We will prove that whether we suppose independent Gaussian distributions are in the class of latent distributions or not, Cau CA in (G, F, PG) is not identifiable up to Sreparam.

Case 1: Independent Gaussian distributions are in PG.

By the famous result in linear ICA, if Pd 1, Pd form an isotropic Gaussian vector N(0, Σ), i.e., Σ is diagonal with the same variances on each dimension, then any rotation form a spurious solution. Namely, let φ(z) = (z1, . . . , zd 2, zd 1 cos(θ) zd sin(θ), zd cos(θ) + zd 1 sin(θ)), θ = kπ, then

i J0, d 2K φ Pi = Pi

f Pi = f φ 1

However, f φ 1 1 f = φ / Sreparam.

Case 2: Independent Gaussian distributions are not in PG.

Suppose that for all i {d 1, d}, Pi has the same density pa:

exp( az2) z < 0 1 0 z 1 p π

a exp a z 1 p π

a 2 z > 1 p π

a < 1. One can verify that pa is a smooth p.d.f.

We construct a measure-preserving automorphism inspired by [19].

Z ||ZJd 1,d K|| R

Z[d 2] cos(α(||ZJd 1,d K C|| R))Zd 1 sin(α(||ZJd 1,d K C|| R))Zd cos(α(||ZJd 1,d K C|| R))Zd + sin(α(||ZJd 1,d K C|| R))Zd 1

||ZJd 1,d K|| < R

where α = 0, C = 1

Now let us prove that φ preserves PJd 1,d K. By shifting the center of pa to the origin, we only need to prove that the shifted φJd 1,d K preserves the uniform distribution over [ R, R]2. One can verify that φJd 1,d K is a diffeomorphism over the 2-dimensional open disk D2(0, R) \ {0} D2(0, R)\{0} and | det(DφJd 1,d K(z))| = 1. Thus pa(z) = pa(φJd 1,d K(z))| det(DφJd 1,d K(z))| z D2(0, R)\{0}. Since φ = Id outside of the disk, this change of variables formula holds almost everywhere in R2, thus PJd 1,d K = (φJd 1,d K) PJd 1,d K, namely, φJd 1,d K preserves PJd 1,d K on R2.

Moreover, since φ[d 2] is identity, e Pi = (φi) e Pi and Pi = (φi) Pi for all i [d 2]. Thus

i J0, d 2K φ Pi = Pi,

f Pi = f φ 1

However, f φ 1 1 f = φ / Sreparam.

B.7 Further constraint on the indeterminacy set

Corollary B.5. Based on the assumption of (2) of Thm. 4.2, if in every dataset we are given the set of possible intervention mechanisms: M = (Mi)i [d], then Cau CA in (G, F, PG) is identifiable up to SM := {φ : Rn Rn|φ Sscaling, i [d], φi SMi} where SMi := { F 1 M i FMi|Mi, M i Mi}

In particular, if Mi is singleton for all i, then Cau CA in (G, F, PG) is identifiable up to Sreflexion := {h : Rd Rd| i [d], hi = Id or Id}.

Proof. Based on the conclusion of (2) of Thm. 4.2, for any i [d], choose any Mi, M i in Mi,

(φi) Mi = M i.

By Lemma B.2, the only possible φi are F 1 M i FMi and F 1 M i FMi. Thus φi SMi. In particular

if Mi is a singleton {Mi}, then F 1 Mi FMi =Id, and F 1 Mi FMi = Id.

B.8 Proof of Corollary 4.8

Corollary 4.8. [Corollary of Thm. 4.5] Suppose G is the empty graph. Suppose that our datasets encompass interventions over all variables in the latent graph, i.e., S k [K] τk = [d]. Suppose for every k, the targets of interventions are a strict subset of all variables, i.e., |τk| = nk, nk [d 1].

Suppose Asm. 4.4 is verified, which has a simpler form in this case: there are nk interventions with target τk such that vk(zτk, 1) vk(zτk, 0), , vk(zτk, nk) vk(zτk, 0) are linearly independent, where vk(zτk, s) := (ln qs τk,1) (zτk,1) , , (ln qs τk,nk) (zτk,nk) , (9)

where qs τk is the intervention of the s-th interventional regime that has the target τk, and qs τk,j is the j-th marginal of it. zτk,j is the j-th dimension of zτk. s = 0 denotes the unintervened regime.

Then Cau CA in (G, F, PG) is block-identifiable, in the same sense as Thm. 4.5.

Proof. This corollary is a special case of Thm. 4.5 when G has no arrow. Since G has no arrow, all the blocks of interventions are in the first case of block-interventional discrepancy in Thm. 4.5: for all k [K], |τk| = nk. To prove all the blocks satisfy the block-interventional discrepancy, it suffices to prove that the linearly independent vectors in the statement form the matrix Mτk. To see this, it suffices to notice that for all s J0, nk K, ln qs τk(zτk) = P

j τk ln qs τk,j(zτk,j), and therefore zj (ln qs τk,j)(zτk) = (ln qs τk,j) (zτk,j).

B.9 Proof of Prop. 4.9

Proposition 4.9. Under the assumptions of Thm. 4.5, suppose furthermore that all density functions in PG and all mixing functions in F are C2, and suppose there exist k [K] and there are 2nk interventions with targets τk such that for any zτk Rnk, wk (zτk, 1) wk (zτk, 0), . . . , wk (zτk, 2nk) wk (zτk, 0) are linearly independent, where

wk (zτk, s) :=

qs τk,1 qs τk,1

(zτk,1) , . . . , qs τk,nk qsτk,nk

(zτk,nk) , qs τk,1 qs τk,1 (zτk,1) , . . . , qs τk,nk qsτk,nk (zτk,nk)

where qs τk is the intervention of the s-th interventional regime that has the target τk, and qs τk,j is the j-th marginal of it. zτk,j is the j-th dimension of zτk. s = 0 denotes the unintervened regime. Then

φτk Sreparam := g : Rnk Rnk|g = P h where P is a permutation matrix and h Sscaling

The proof is based on Theorem 1 of [21].

Proof. Since the intervention targets are known, without loss of generality, suppose the interventions are on the first nk variables. By the result of Thm. 4.5 we have ps k (zτk) =

qs k (φτk (zτk)) |det Dφτk (zτk)| where ps k denotes the joint distribution in the s-th interventional regime such that the intervention target is τk. Factorize ps k and qs k in the change of variables formula, and take the logarithm:

l=1 ln ps k,l (zl) =

l=1 ln qs k,l (φl (zτk)) + ln |det Dφτk (zτk)| , (30)

where ps l is the l-th marginal of the intervention of the s-th interventional regime that has the target τk.

For the unintervened regime, denote the density of the l-th marginal of P as pl. By the result of Thm. 4.5 we have nk X

l=1 ln p0 l (zl) =

l=1 ln q0 l (φl (zτk)) + ln |det Dφτk (zτk)| . (31)

Subtract the equation (30) by (31):

ln ps k,l (zl) ln p0 l (zl) =

ln qs k,l (φl (zτk)) ln q0 l (φl (zτk)) . (32)

For any j τk = [nk], take the partial derivative over zj

ps k,j (zj) ps k,j (zj) p0 j (zj) p0 j (zj) =

" qs k,l (φl (zτk)) qs k,l (φl (zτk)) q0 k,l (φl (zτk)) q0 k,l (φl (zτk))

# φl zj (zτk) . (33)

For any 1 k < j, take the partial derivative over zk,

" qs k,l qs k,l

q0 k,l q0 k,l

" qs k,l (φl (zτk)) qs k,l (φl (zτk)) q0 k,l (φl (zτk)) q0 k,l (φl (zτk))

# 2φl (zτk)

zk zj . (34)

For 1 k < j nk there are nk(nk 1)

2 equations.

Define al (zτk) = φl zk (zτk) φl

1 k j nk , bl (zτk) = 2φl(zτk)

Then the nk(nk 1)

2 equations can be written as a linear system

l=1 al (zτk)

" qs k,l qs k,l

q0 k,l q0 k,l

" qs k,l (φl (zτk)) qs k,l (φl (zτk)) q0 k,l (φl (zτk)) q0 k,l (φl (zτk))

Define Mk (zτk) = (a1 (zτk) , , ank (zτk) , b1 (zτk) , , bnk (zτk)).

Collect the equations for s = 1, , 2nk, 0 = Mk (zτk) (wk (φτk (zτk) , 1) wk (φτk (zτk) , 0) , . . . , wk (φτk (zτk) , 2nk) wk (φτk (zτk) , 0)) .

By assumption, the matrix containing w is invertible. Thus Mk (zτk) = 0, which implies aℓ(zτk) are zero for all zτk. By the same reasoning as [21], each row in Dφτk (zτk) has only one non-zero term, and this does not change for different z, since otherwise by continuity there exists z such that Dφτk (zτk) is singular, contradiction with the invertibility of φτk (Thm. 4.5). Thus i τk, φi is a function of one coordinate of zτk. Since φτk is invertible, det Dφτk (zτk) = 0, so σ permutation of τk s.t. i τk, φσ(i)

zi (zτk) = 0. Thus φτk Sreparam.

C A Counterexample for Thm. 4.2 (ii) when Asm. 4.1 is violated

One trivial case of violation of Asm. 4.1 is when Pi and e Pi are the same. In that case, the interventional regime i is useless, namely, it does not constrain at all the indeterminacy set. Our counterexample is in a non-trivial case of violation, where the intervened mechanisms are not deterministically related, and do not share symmetries with the causal mechanisms.

We construct a counterexample for Thm. 4.2 (ii) when Asm. 4.1 is violated. The visualization of it is in Fig. 2. The counterexample is similar to the non-Gaussian case in the proof of Prop. 4.7.

Suppose that d = 2, E(G) = and that for all i {1, 2}, Pi has the same density pa,b:

exp( az2) z < 0 1 0 z 1 1

b ) exp( b(z (1 1

b )))2) z > 1 1

b < 1. One can verify that pa, pb are smooth p.d.f.

Suppose that for all i {1, 2}, the intervened mechanism e Pi has the same density pc,d, defined in the same way as (35), such that p π

d < 1, and c, d / {a, b}.

Set λ := min(x,y) {(a,b),(c,d)} 1 1

π y , then over (0, λ)2 all the densities are con-

stant, violating Asm. 4.1.

We construct a measure-preserving automorphism inspired by [19].

z ||z|| R cos(α(||z c|| R))z1 sin(α(||z c|| R))z2 cos(α(||z c|| R))z2 + sin(α(||z c|| R))z1

where c = ( λ

2 ) denotes the center of rotation, R (0, λ

2 ] denotes the radius of the disk, and α = 0.

Now let us prove that φ preserves Pi for all i J0, 2K. By shifting pa,b by c, we only need to prove that the shifted φ preserves the uniform distribution over [ R, R]2. One can verify that φ is a diffeomorphism over the 2-dimensional open disk D2(0, R) \ {0} D2(0, R) \ {0} and | det(φ(z))| = 1. Thus pa(z) = pa(φ(z))| det(φ(z))| z D2(0, R) \ {0}. Since φ = Id outside of the disk, this change of variables formula holds almost everywhere in R2, thus Pi = φ Pi, e Pi = φ e Pi, which implies that φ Pi = Pi i J0, 2K.

Thus for all f F, f Pi = f φ 1

Pi. However, f φ 1 1 f = φ, which is not in Sreparam or Sscaling. Remark C.1. The above example can be easily generalized to any pa,b such that the constants on the plateau are different between Pi and e Pi and the domains of the plateau intersect on a nonzero measure set. For d > 2, the above example can be generalized by constructing the same Pi and e Pi for i = 1, 2, and for i > 2 we fix any Pi and e Pi verifying Asm. 4.1. Then one spurious solution φ is as follows: let φJ1,2K be the same measure-preserving automorphism as in the previous counterexample, and φj = Id for j > 2. Remark C.2. In Cau CA, we suppose the distributions are Markov to a given graph G, but not necessarily faithful to G. This implies that independent components are in PG no matter which graph G is supposed given. Therefore, as long as Asm. 4.1 is not assumed, this counterexample applies to any Cau CA model (G, F, P) with nonlinear F and nonparametric P.

D Identifiability by structure-preserving stochastic interventions

In this section, we extend the result of Thm. 4.2 (i) to the case when we have access to a higher number of imperfect interventions. Here we focus on one special case of imperfect interventions, structure-preserving interventions, i.e., the interventions that do not change the parent set.

Proposition D.1. For Cau CA in (G, F, PG) assume the assumptions in Thm. 4.2 (i) hold. Fix any i [d] such that pa(i) = anc(i), pa(i) = , and define ni := |pa(i)|. If there are ni(ni + 1) structure-preserving interventions on node i such that the variability assumption V i holds, then Cau CA in (G, F, PG) is identifiable up to

SGi = n h C1(Rd) : h(z) = hj(zanc(j))

j [d] | hj(zanc(j)) = hj(zpa(i)) j pa(i) o .

Namely, for all φ SGi, for all the nodes j pa(i), the reconstructed Zj can at most be a mixture of variables corresponding to the nodes in the closure of parents of i, instead of the closure of ancestors of j.

The variability assumption V i means

A1 i (z) B1 i (z) ... ... Ani(ni+1) i (z) Bni(ni+1) i (z)

Rni(ni+1) ni(ni+1)

is invertible, where the symbols are defined as follows:

(gs1,t i hs1 i ) xr1 φi(z) | φpa(i)(z)

(g sk,t i h sk i ) xrm φi(z) | φpa(i)(z)

g sni ,t i h sni i

φi(z) | φpa(i)(z)

gs1,t i hs1 i φi(z) | φpa(i)(z)

... g sni,t i h sni i φi(z) | φpa(i)(z)

where rk, sk are the k-th variable in pa(i), and

gk,t i zi | zpa(i) :=

eqt i zk zi | zpa(i)

eqt i zi | zpa(i) , hk i zi | zpa(i) :=

qi zk zi | zpa(i)

qi zi | zpa(i)

where eqt i denotes the intervened mechanism in t-th interventional regime that has the interventional target i, qi denotes the causal mechanism on Zi.

Proof. Based on the assumption of Thm. 4.2(i), reuse the proof of Thm. 4.2(i) from the equation (21):

ln ept i zi | zpai(i) ln pi zi | zpa(i) = ln eqt i φi(z) | φpai(i)(z) ln qi φi(z) | φpa(i)(z)

where ept i, eqt i denote the intervened mechanism in t-th interventional regime that has the interventional target i. Notice that by the assumption of structure-preserving interventions, pai(i) = pa(i).

Thm. 4.2(i) has already concluded that jφi is constant 0 for all j / anc(i). Now we are interested in jφi j anc(i). Take the partial derivative over zj with j anc(i) \ pa(i) (non-empty by assumption):

P k pa(i) eqt i xk φi(z) | φpa(i)(z) φk

eqt i φi(z) | φpa(i)(z)

k pa(i) qi xk φi(z) | φpa(i)(z) φk

qi φi(z) | φpa(i)(z) (36)

Recall that we define

gk,t i zi | zpa(i) :=

eqt i zk zi | zpa(i)

eqt i zi | zpa(i) , hk i zi | zpa(i) :=

qi zk zi | zpa(i)

qi zi | zpa(i)

So (36) is rewritten as

h gk,t i φi(z) | φpa(i)(z) hk i φi(z) | φpa(i)(z) i jφk(z)

Choose any l pa(i) (non-empty by assumption). Take the partial derivative of zl on two sides:

m pa(i) m gk,t i hk i φi(z) | φpa(i)(z) jφk(z) lφm(z)

+ gk,t i hk i φi(z) | φpa(i)(z) l jφk(z)

which can be rewritten as 0 = At i(z)ai j,l(z) + Bt i(z)bi j,l(z) (37)

where ai j,l(z) =

jφp1(z) lφp1(z) ... jφpk(z) lφpm(z) ... jφpni(z) lφpni(z)

Rn2 i , bi j,l(z) =

l jφp1(z) ... l jφpni(z)

Collect ni (ni + 1) equations, every one corresponding to one interventional regime, in the form of (37) for all t [ni (ni + 1)]:

0 = Mi(z) ai j,l(z) bi j,l(z)

A1 i (z) B1 i (z) ... Ani(ni+1) i (z) Bni(ni+1) i (z)

Rni(ni+1) ni(ni+1) (39)

By assumption of variability V i, Mi(z) is invertible for all z Rd. Thus (38) has a unique solution, which is ai j,l = 0, bi j,l = 0.

ai j,l(z) = 0 implies k, m pa(i), jφk(z) lφm(z) = 0.

Since l pa(i), lφl(z) = 0 is in ai j,l(z), so k pa(i), jφk(z) lφl(z) = 0, which implies jφk(z) = 0. We have proven that for all j anc(i) \ pa(i), for all k pa(i), for all z Rd, jφk(z) = 0. Namely, φk only depends on pa(i). Combining with the result in Thm. 4.2(i), we obtain the conclusion.

E Known vs. unknown intervention targets

In the main paper, for simplicity, we only provided the minimal set of notation required for describing the problem of Cau CA with known intervention targets. However, we believe that a more general version of Cau CA should also be considered for cases where the targets are unknown. In fact, in the following, we will distinguish many problem settings, ranging from totally known targets to totally unknown targets. Each setting may be more or less suited to model a collection of datasets, depending on the amount and kind of prior knowledge available.

In the following, we provide a general framework in which Cau CA with unknown intervention targets can be rigorously formulated. Note that G denotes a DAG such that V (G) = [d], indexed by natural numbers, which correspond to the indices of targets τi of interventions. G denotes instead a DAG

equipped only with a partial order induced by the arrows, namely, V (G) is a set not necessarily indexed by natural numbers. For example, consider V (G) = { cloudy , sprinkle , raining , wet grass }. In this case, the probability distribution P that is Markov relative to this graph is not defined because of the lack of indices: P1 might denote the marginal of cloudy , sprinkle , raining or wet grass . However, P is Markov relative to an indexed DAG of G, denoted G |= G, which denotes that there exists a bijection σ s.t. (u, v) E(G) iff (σ(u), σ(v)) E(G).

Definition E.1. Given two DAGs G |= G and G |= G, an isomorphism from G to G is a bijection σ of V (G) = [d] such that (i, j) E(G) if and only if (σ(i), σ(j)) E(G ). An automorphism of G is an isomorphism G G.

Definition E.2 (Identifiabilities of Causal component analysis, general setting). Given G a partially ordered DAG, F a class of diffeomorphisms, PG a set of distributions such that for every P PG there exists G |= G such that P PG, we define (G, F, PG) as a class of latent CBN (G, f, (Pi, τi)i J0,KK) such that G |= G, f F and P PG.

(i) We define Cau CA with known intervention targets in (G, F, PG) as a class of latent CBN models such that all latent CBN models have the same G and (τi)i J0,KK.

(ii) We define Cau CA with known intervention targets up to graph automorphisms in (G, F, PG) as a class of latent CBN models such that for any two latent CBN models (G, f, (Pi, τi)i J0,KK) and (G, f , (Pi, τ i)i J0,KK) in the class, there exists σ an automorphism of G such that τ i = σ(τi) for all i [K].

(iii) We define Cau CA with matched intervention targets in (G, F, PG) as a class of latent CBN models such that for any two latent CBN models (G, f, (Pi, τi)i J0,KK) and (G , f , (Pi, τ i)i J0,KK) in the class with G, G |= G, there exists a permutation π Sd : V (G) V (G ) such that τ i = π(τi) for all i [K].

(iv) We define Cau CA with unknown intervention targets in (G, F, PG) as a class of latent CBN models that contains all (G, f, (Pi, τi)i J0,KK) such that G |= G, f F and P PG.

We say that the Cau CA in (G, F, PG)6 is identifiable up to S if for any (G, f, (Pi, τi)i J0,KK) and (G , f , (Qi, τ i)i J0,KK) in (G, F, PG), the relation f Pi = f Qi i J0, KK implies that there is h S such that h = f 1 f on the support of P.

Sprinkler Rain

Dataset τ τ (i) τ (ii) τ (iii) τ (iv) D1 C C C S S D2 C C C S R D3 S S R W W D4 R R S C W D5 W W W R W

Figure 5: Representative cases for the Cau CA settings described in Defn. E.2 (i)-(iv). Each row corresponds to a dataset where one perfect intervention is performed on one of the targets: Cloud (C), Sprinkler (S), Rain (R), and Wet grass (W). Each column corresponds to an admissible choice of intervention targets within each of the settings: the discrepancies between the intervention targets τ (i)-τ (iv) and the ground truth targets τ are meant to illustrate different degrees of ignorance about τ across the settings in Defn. E.2 (i)-(iv). In the known intervention targets setting, which we focused on in the main paper, τ (i) is the only possible choice: i.e., the targets must be perfectly aligned with the ground truth τ. For the settings with known intervention targets up to graph automorphisms and with matched intervention targets, τ (ii) and τ (iii) represent admissible choices: identifiability results can be proved for both settings. We also show that in the setting of totally unknown targets, where τ (iv) is an admissible choice, Cau CA is not identifiable (see Remark E.9).

Remark E.3. In this general framework, the dataset Di := {x(j)}Ni j=1 , Ti denotes the nodes in

V (G) ( cloudy , sprinkler etc) instead of nodes τi [d] in V (G) (2).

6Or (G, F, PG) for Cau CA with known intervention targets.

If all d variables are included in the known targets, then there exists a unique bijection σ : V (G) V (G). This implies that for any latent CBN (G, f, (Pi, τi)i J0,KK) in (G, F, PG), τi = σ 1(Ti) i.e. the targets τi are uniquely defined by Ti in each interventional regime. In this case, without loss of generality, we can suppose that i V (G), if i pa(j), then i < j. This can be achieved by rearranging the nodes in the graph and, correspondingly, the coordinates of (Pk, τk) k J0, KK. When the intervention targets are totally unknown, (G, F, PG) only gives us the information of an unordered graph G, and in every interventional regime the candidate latent CBN models that achieve the same likelihood might intervene on totally different variables. In this case, we cannot rearrange the nodes without loss of generality.

Our main paper has shown identifiability results about Defn. E.2 (i). In the following, we generalize Thm. 4.2 to Cau CA with known intervention targets up to graph automorphisms. We also generalize Prop. 4.6 to matched intervention targets in ICA setting. We also prove in Corollary E.8 that Cau CA with unknown intervention targets is not identifiable. An open question is whether Cau CA with a nontrivial graph is identifiable with matched intervention targets. Assumption E.4 (Interventional discrepancy, general version). Given k [K], we say that a stochastic intervention pτk satisfies general interventional discrepancy if for all i [d],

zi | zpa(i) = (ln epτk)

zτk | zpak(τk) almost everywhere (a.e.). (40)

Theorem E.5. For Cau CA with known intervention targets up to graph automorphisms in (G, F, PG),

(i) Suppose for each node in [d], there is one (perfect or structure-preserving) stochastic intervention such that Asm. E.4 is satisfied. Then Cau CA in (G, F, PG) is identifiable up to

SAut G = Pσ h : Rd Rd|σ Aut G, Pσ permutation matrix of σ,

h(z) = hi(zanc(i))

i [d] , h is C1-diffeomorphism o .

(ii) Suppose for each node i in [d], there is one perfect stochastic intervention such that Asm. E.4

is satisfied, then Cau CA in (G, F, PG) is identifiable up to

SG-scaling : = Pσ h|σ Aut G, Pσ permutation matrix of σ, h : Rd Rd,

h(z) = (h1(z1), . . . , hd(zd)) for some hi C1(R, R) with |h i(z)| > 0 z Rd .

Proof. Proof of (i): The proof is based on the proof of Thm. 4.2(i). Consider two latent models achieving the same likelihood across all interventional regimes: G, f, Pi, τi)i J0,d K and G, f , Qi, τ i)i J0,d K . By the definition of latent CBN models with known targets up to graph automorphisms, there exists σ automorphism of G s.t. the targets (τ i)i [d] = (σ(1), , σ(d)). Since (τi)i [d] covers all d nodes in G, by rearranging (τi)i [d] we can suppose without loss of generality that τi = i i [d]. Namely, in the i-th interventional regime,

Pi = e Pi Zj | Zpai(j) Y

j =i P(Zj|Zpa(j)), Qi = e Qσ(i) Zσ(i) | Zpaσ(i)(σ(i)) Y

j =i Q(Zσ(j)|Zpa(σ(j))).

Define φ := f 1 f. Denote its i-th dimension output function as φi : Rd R. We will prove by induction that i [d], j / anc(i), z Rd, φσ(i)

zj (z) = 0.

For any i J0, d K, f Pi = f Qi. Since φ is a diffeomorphism, by the change of variables formula,

pi(z) = qi(φ(z))|detφ(z)|. (41)

For i = 0, factorize pi and qi according to G, then take the logarithm on both sides:

j=1 ln pj zj | zpa(j) =

j=1 ln qj φj(z) | φ(pa(j))(z) + ln | det Dφ(z)|. (42)

For i = 1, e Q1 has no conditionals, and so does Zσ(1). Thus qi is factorized as

q1(z) = eqσ(1) zσ(1) Y

j =σ(1) qj zj | zpa(j) .

So the equation (41) for i = 1 after taking logarithm is

ln ep1 (z1) + X

j =1 ln pj zj | zpa(j) = ln eqσ(1) φσ(1)(z) | φpaσ(1)(σ(1))(z)

j =σ(1) qj φj(z) | φpa(j)(z) + ln |det Dφ(z)| . (43)

Subtract (43) by (42),

ln p1 (z1) ln p1 (z1) = ln qσ(1) φσ(1)(z) ln qσ(1) φσ(1)(z) . (44)

For any i = 1, take the i-th partial derivative of both sides:

" eq σ(1) φσ(1)(z)

eqσ(1) φσ(1)(z) q σ(1) φσ(1)(z)

qσ(1) φσ(1)(z)

By Asm. E.4, the term in the parenthesis is non-zero a.e. in Rd. Thus φσ(1)

zi (z) = 0 a.e. in Rd. Since

φ = f 1 f where f, f are C1-diffeomorphisms, so is φ. φσ(1)

zi is continuous and thus equals zero everywhere.

Now suppose k [i 1], j / anc(k), φσ(k)

zj (z) = 0, z Rd. Then for interventional regime i,

ln pi zi | zpai(i) + X

j =i ln pj zj | zpaj(j) = ln qσ(i) φσ(i)(z) | φpaσ(i)(σ(i))(z)

j =i qσ(j) φσ(j)(z) | φpa(σ(i))(z) + ln | det Dφ(z)|,

subtracted by (42),

ln pi zi | zpai(i) ln pi zi | zpa(i) = ln qσ(i) φσ(i)(z) | φpaσ(i)(σ(i))(z)

ln qσ(i) φσ(i)(z) | φpa(σ(i))(z) .

For any j / anc(i), j / pa(i) pai(i) by assumption. Take partial derivative over zj:

k paσ(i)(σ(i)) eqσ(i)

xk φσ(i)(z) | φpaσ(i)(σ(i))(z) φk

eqσ(i) φσ(i)(z) | φpaσ(i)(σ(i))(z)

k pa(σ(i)) qσ(i)

xk φσ(i)(z) | φpa(σ(i))(z) φk

qσ(i) φσ(i)(z) | φpa(σ(i))(z) .

Now prove that Gσ(i) = σ(Gi):

Gσ(i) is obtained by either a perfect stochastic or a structure-preserving stochastic intervention. For a structure-preserving stochastic intervention, G = Gσ(i) = σ(Gi). For a perfect stochastic intervention, since τ i = σ(τi), in Gσ(i) only the arrows towards σ(τi) are deleted, which correspond to deleting arrows towards τi in Gi.

Thus Gσ(i) = σ(Gi). Thus paσ(i)(σ(i)) = paσ(i)(σ(i)) = σ(pai(i)), the last equality by the definition of automorphism σ, σ(pa(i)) = pa(σ(i)).

The assumption of induction says k [i 1] j / anc(k) φσ(k)

zj (z) = 0 z Rd. For

all k pa(i) pai(i), since j / anc(i), j / anc(k) as well. By induction, φσ(k)

zj (z) = 0. So the second sum of the right-hand side of (45) can be canceled except for k = σ(i). Also, pa(σ(i)) = σ(pa(i)), paσ(i)(σ(i)) = σ(pai(i)). By the same assumption in the current induction, for all l paσ(i)(σ(i)) = σ(pai(i)), φl

zj (z) = 0. So the first sum of the right-hand side of (45) can be canceled except for k = σ(i).

The equation rewrites after deleting the partial derivatives that are zero:

eqσ(i) φσ(i) φσ(i)(z) | φpaσ(i)(σ(i))(z)

eqσ(i) φσ(i)(z) | φpaσ(i)(σ(i))(z)

qσ(i) φσ(i) φσ(i)(z) | φpa(σ(i))(z)

qσ(i) φσ(i)(z) | φpa(σ(i))(z)

By Asm. E.4, the term in parenthesis is nonzero a.e., thus φσ(i)

zj (z) = 0 a.e. Since φσ(i)

zj is continuous, it equals zero everywhere.

The induction is finished when i = d. We have proven that i [d], j / anc(i), z Rd, φσ(i)

zj (z) = 0, i.e. P 1 σ Dφ(z)

ij = (Dφ(z))σ(i)j = φσ(i)

Thus P 1 σ φ S G. φ SAut G.

Proof of (ii):

By the result of (i), ψ := f 1 f SAut G, thus there exists a permutation matrix Pσ s.t. P 1 σ ψ S G Denote φ = P 1 σ ψ. Apply Thm. 4.2(ii), we can prove that φ Sscaling. Thus ψ = Pσφ SG-scaling.

Proposition E.6. Suppose that G is the empty graph and that there are d 1 variables intervened on, with one single target per dataset, satisfying Asm. E.4. Then ICA with matched intervention targets in (G, F, PG) with single-node interventions (Defn. E.2) is identifiable up to

Sreparam := g : Rd Rd|g = P h where P is a permutation matrix and h Sscaling

Proof. For an empty graph G, Aut G = Sd. Thus ICA with matched intervention targets is the same as known intervention targets up to graph automorphisms. Consider two latent models achieving the same likelihood across all interventional regimes: G, f, Pi, τi)i J0,d K and G, f , Qi, τ i)i J0,d K . By the definition of latent CBN models with known targets up to graph automorphisms, there exists σ automorphism of G s.t. the targets (τ i)i [d] = (σ(τ1), , σ(τd)).

Apply Prop. 4.6 to G, f, Pi, τi

and G, f Pσ, Qσ 1(i), τi

, then φ :=

(f Pσ) 1 f Sscaling. Then for G, f, Pi, τi

and G, f , Qi, τ i

f 1 f = Pσ φ Sreparam.

Remark E.7. The setting of Cau CA with matched intervention targets is similar to one scenario studied in causal representation learning. See, e.g., von Kügelgen et al. [55, Thm. 3.4]: the constraint on intervention targets expressed in their Asm. (A2 ) can be rephrased within our framework as the requirement that for any two latent CBN models, there exists a permutation π : V (G) V (G ) such that the intervention targets τ i,1 = τ i,2 = π(τi,1) = π(τi,2), where τi,1 denotes the unknown target of the (i, 1)-indexed interventional regime. Subsequent work by Varıcı et al. [53] also studied the setting with coupled environments.7

7Varıcı et al. [53] additionally introduced a setting with uncoupled environments [53, Def. 2], relaxing the requirement of matched interventions. Note that the setting with uncoupled environments assumes strictly more information than Defn. E.2 (iv), since it requires that for each latent CBN model (G, f, (Pi, τi)i J0,2d K), for each node j V (G), there exist k, l J1, 2d K such that τk = τl = {j}.

Corollary E.8. [Corollary of Prop. 4.7] Given a DAG G, Cau CA with unknown intervention targets is not identifiable in (G, F, PG) up to Sreparam.

Proof. Fix any G |= G and f. Without loss of generality, suppose K > d 2. By Prop. 4.7, there exist G, f , Qi, τi

such that for all i J0, d 2K, f Pi = f Qi and φ := f 1 f / Sreparam.

Notice that for all i J0, d 2K, Pi and Qi are independent components and therefore Markov relative to G. Since the targets are unknown, we can construct a spurious solution as follows: for all i Jd 1, KK, set τi = {1}, choose Pi to be any distribution that is composed of independent components, and set Qi := φ Pi. Then for all i J0, KK, f Pi = f Qi and φ := f 1 f / Sreparam.

Remark E.9. Although the proof above is almost trivial, it clarifies that a minimal requirement for identifiability up to permutation and scaling, both in ICA and in CRL, is that each latent variable must be intervened on at least once.

Relaxing the assumption of a known causal graph. Since we only require the distributions Pk in Defn. 3.2 to be Markov to G (resp., Qk to G ), our theory suggests that it should be possible to identify latent variables up to the ambiguities characterised in our results even when the learned model assumes a denser DAG than the true one: in particular, a fully connected graph would always be a valid choice, thus partially relaxing the assumption of a known graph. A thorough characterisation of the performance of Cau CA in this misspecified setting is left for future work.

F Relationship between Cau CA and nonlinear ICA

ICA is a special case of Cau CA. One way to show that ICA is a special case of Cau CA is the following. In Cau CA, we suppose the distributions are Markov to a given graph G, but not necessarily faithful to G. This implies that any Cau CA model (G, F, PG) may allow independent components: this holds even for models with a non-empty graph (in this case, the distribution would be unfaithful to G). Since the distributions with independent components are a small subset within the set of distributions which are Markov to a non-trivial graph G, it follows that Cau CA is a strictly more general, and harder, problem than ICA, and no trivial reduction of Cau CA to ICA is possible.

Comparison to the results of Hyvärinen et al. [21]. For the special case of Cau CA where there are no edges in the causal graph (i.e., ICA), Hyvärinen et al. [21, Thm.1] proved an identifiability result based on (in our terminology) 2d interventional datasets with unknown intervention targets, plus one unintervened dataset. While our work takes inspiration from [21], our theoretical analysis presents several differences.

Our identifiability results can be compared with [21, Thm.1] in two cases:

(i) Trivial graph:

Previous work: A core step in the proof of Thm. 1 of [21] is eq. (20,21), which is essentially the same as (33), (34) in our work. We will refer to the proof of our Prop. 4.9 in the following (since the proof of this proposition specifically is similar to the one of Thm.1 of [21]). The proof proceeds then as follows: we take twice the partial derivatives of (32) i.e., we calculate the Hessian matrix of the two sides. We then obtain a system of 2d equations, and identifiability corresponds to the uniqueness of the solution of the system 0 = Ax , where A is in R2d 2d. So A needs to be invertible i.e., the variability assumption in [21, Thm.1]. Our work: Our Prop. 4.6 and Corollary 4.8 show that, if the intervention targets are known, we can provide an identifiability proof which only requires the Jacobian of the above equation to be invertible (instead of Hessian, as in the previous results). The functions in F can thus be C1, instead of C2, and the linear system will be of d equations instead of 2d. In short, exploiting knowledge on the intervention targets allows us to provide a different proof, even for the special case of ICA.

In Tab. 2, we summarize our theoretical results for nonlinear ICA and compare them to [21, Thm.1].

(ii) Nontrivial graph:

As we mentioned in the previous paragraph, Cau CA is a strictly more general problem than ICA. Our proof of Thm. 4.2 is therefore even farther from all ICA results. A peculiarity of ICA is that, after

Table 2: Overview of ICA identifiability results. For the trivial DAG over Z1, Z2, Z3 (i.e., no edges), we summarise our identifiability results ( 4.2) for representations learnt by maximizing the likelihoods Pk θ(X) based on multiple interventional regimes. π[ ] denotes a permutation.

Requirement of interventions Learned representation ˆz = ˆf 1(x) Reference

1 intervention on any two nodes respectively with Asm. 4.1 [h1(z1), h2(z2), h3(z3)] Prop. 4.6

1 intervention on z1 and 2 fat-hand interventions on (z2, z3) with Asm. 4.4 [h1(z1), h2(z2, z3), h3(z2, z3)] Corollary 4.8

1 intervention on z1 and 4 fat-hand interventions on (z2, z3) with variability assumption h h1(z1), π[h2(z2), h3(z3)] i Prop. 4.9

1 intervention per node on any two nodes respectively with unknown order ( matched intervention target , see Defn. E.2) with Asm. E.4

π [h1(z1), h2(z2), h3(z3)] Prop. E.6

6 fat-hand interventions on (z1, z2, z3) with variability assumption π [h1(z1), h2(z2), h3(z3)] [21][Thm. 1]

taking the Hessian, the left-hand side of (34) vanishes: this is exploited in the proof of [21, Thm.1]. All the proof techniques of nonlinear ICA papers we are aware of rely on this vanishing left-hand side over all coordinates of z. Unfortunately, with a non-trivial graph, it is impossible to get the same after taking the partial derivatives of every coordinate of z: in fact, for each i [d], the left-hand side of the i-th equation depends on the ancestors of zi. Our proofs for the case with a nontrivial graph (Thm. 4.2 and Thm. 4.5) therefore follow different strategies, based on first derivatives alone.

G Additional theoretical results used in the design of experiments

G.1 Multi-objective and pooled objective

Our identifiability theory implies that the ground-truth latent CBN could be learned by maximizing likelihood across all interventional regimes, i.e.,

k=0 arg max θ 1 Nk

n=1 log pk θ(x(n,k)),

where log pk θ(x(n,k)) is defined in (10). This is a multi-objective optimization problem.

In general, multi-objective optimization is hard; however, we show that, because of our assumptions, we can equivalently optimize a pooled objective. In fact, suppose

fk(θ) = log pk θ(x), f(θ) =

Define Θk := arg maxθ fk(θ). By our problem setting we know Td k=0 Θk = .

On the one hand, for all ˆθ Td k=0 Θi, ˆθ arg maxθ f(θ). On the other hand, we will prove by contradiction that ˆθ arg maxθ f(θ), ˆθ Td k=0 Θk. Suppose there exists i J0, d K such that ˆθ / Θi. Then for all θ Td k=0 Θk, fi(ˆθ) < fi(θ ). Thus f(ˆθ) = Pd k=0 fk(ˆθ) < Pd k=0 fk(θ ) = f(θ ). This yields a contradiction with ˆθ arg maxθ f(θ).

We thus conclude that Td k=0 arg maxθ fk(θ) = arg maxθ f(θ).

G.2 Expressivity in multi-intervention learning

To learn the latent CBN, we need to learn both the latent distributions (both the unintervened causal mechanisms and the intervened ones) and mixing functions. For the latent distributions, a natural

question is whether any of them could be fixed without loss of generality. This is for example possible in the context of nonlinear ICA, where due to the indeterminacy up to element-wise nonlinear scaling of the latent variables, we can arbitrarily fix their univariate distributions w.l.o.g. The following proposition elucidates this matter for Cau CA.

Proposition G.1. For a ground truth latent CBN, G, f, Pk, τk

where Pk, τk

k J1,d K are obtained by perfect stochastic intervention on every variable respectively, fix at most one element for each k in {Qk|pa(k) = } {e Qk|k [d]}, there exists G, f , Qk, τ k

s.t. f Pk =

f Qk, f 1 f Sscaling.

In practice, this means that in order to learn f and (Qk)k [d] with d perfect stochastic interventions, even if we fix all the intervention mechanisms (e Qk)k [d], we can still learn a true latent model up to scaling functions. Equivalently, we could also fix instead {Qk|pa(k) = } {e Qk|pa(k) = }.

Proof. For any k [d], Pk = e Pk Q

j =k Pj. Without loss of generality by exchanging Qk with e Qk, suppose we are given f Qk, by Lemma B.2, there are two possible diffeomorphisms for Tk st. (Tk) f Pk = f Qk. Choose one of them arbitrarily. Set T := (Tk)k [d].

Define Q0 := T P0. By Lemma B.3, i [d], zi R, zpa(i) R#pa(i),

pi zi | zpa(i) = qi Ti (zi) | Tpa(i) zpa(i) |T i (zi)| .

Multiply the causal mechanisms of all i [d] \ {k} and the intervened mechanism of k, we get

i =k pi zi | zpa(i) = eqk (Tk (zk)) |T k(zk)| Y

i =k qi Ti (zi) | Tpa(i) zpa(i) |T i (zi)| ,

which is equivalent to epk(z) = eqk(T (z)) |det T(z)|. This is the change of variable formula of the diffeomorphism T in Rd, and implies Qk = T Pk.

Define f := T f 1, then f 1 f = T Sscaling.

Remark If G is non trivial, given G, f, P0 , in general there does not exist Q0 PG, f F s.t. f P0 = f Q0, f 1 f Sscaling.

To see why, consider the graph z1 z2. Fix Q1, there are only two possible diffeomorphisms for T1 s.t. Q1 = (T1) P1 by Lemma B.2.

If Q2 (Z2 | T1(z1)) is fixed, to find T2 s.t.

(T2) P2 (Z2 | z1) = Q2 (Z2 | T1(z1)) z1 R,

by Lemma B.2 , the only possible T2 are G ( | T1 (z1)) 1 F ( | z1) and G ( | T1 (z1)) 1 F ( | z1). In general, these two functions depend on z1. For example if P1 (Z1) P2 (Z2), Q2 (Z2 | z1) N (z1, 1) then T2 depend on z1. Namely, There is no T Sscaling s.t. Q0 = T P0.

G.3 Normalizing flows for nonparametric CBN

In App. I, we learn a nonlinear mixing function and nonparametric causal mechanisms (pϕ i (zi | zpa(i)))i [d]. We model pϕ by a normalizing flow hϕ : Rd Rd from the space of Z to the space of U, while fixing Pu. The goal is to learn hϕ 1, the reduced form of an SCM which pushes forward the exogenous variables U to every causal mechanism (pϕ i (zi | zpa(i)))i [d]. (hϕ is denoted as h CBN ϕ in App. I). In the following, we prove that such hϕ exists in a special case.

Proposition G.2. For any CBN {Pϕ i (Zi|Zpa(i))}i [d] Markov and faithful to a fully connected DAG G, with Pϕ absolutely continuous and smooth in Rd, and for any independent component joint distribution Pu absolutely continuous and smooth in Rd, there exists a normalizing flow hϕ : Rd Rd such that

there exist a permutation matrix P and an autoregressive normalizing flow h such that hϕ = P h (autoregressive in the sense that D h(z) is lower triangular for all z Rd);

hϕ Pϕ = Pu;

for all i [d], pϕ i (zi|zpa(i)) = hϕ i zi (z) pui hϕ i (z) .

Proof. Without loss of generality by applying a permutation matrix P, we can suppose that the index of G preserves the partial order induced by E(G): i < j if (i, j) E(G). In the following, we can thus replace h by hϕ. Since Pϕ is faithful to a fully connected graph, Zpa(i) = Z<i.

By [40] [Sec 2.2], there exists a Darmois transformation F : Rd (0, 1)d s.t. F Pϕ = Unif(0, 1)d, and a Darmois transformation G : Rd (0, 1)d s.t. G Pu = Unif(0, 1)d. Define hϕ := G 1 F, then hϕ Pϕ = Pu.

Notice that the Darmois transformations F and G are autoregressive normalizing flows, i.e. DF(z) and DG(z) are lower triangular for all z Rd. Since the autoregressive normalizing flows are closed under composition and inverse, D(hϕ)(z) is also lower triangular for all z Rd. By induction using Lemma B.4, it is direct to prove that for all i [d], hϕ [i] is a diffeomorphism from the first i variables of Z to the first i variables of U.

By the definition of hϕ, Pu = hϕ Pϕ. By the change of variable formula, pϕ(z) = det Dhϕ(z) pu hϕ(z) .

We now prove by induction that for all i [d], pϕ i (zi|zpa(i)) = hϕ i zi (z) pui hϕ i (z) . For

i = 1, by the assumption of indices in G, z1 has no parent. Since hϕ 1 is a diffeomorphism,

pϕ 1 (z1) = hϕ 1 z1 (z1) pu1 hϕ 1(z1) .

Suppose the property is true for i:

pϕ i (zi|zpa(i)) =

hϕ i zi (z)

pui hϕ i (z) (46)

Then for i + 1, since (hϕ)[i+1] is a diffeomorphism with lower triangular Jacobian, we have

j=1 pϕ j (zj|zpa(j)) = det Dhϕ [i+1](z) pu[i+1]

hϕ [i+1](z) =

hϕ j zj (z)

puj hϕ j (z) (47)

where the last equality is because Dhϕ is lower triangular and Pu is independent components. By dividing eq. (47) by eq. (46), we obtain the result.

In the method above, we assume that the CBN is Markov and faithful to a fully connected graph. Devising a more general method for sparse graphs is left for future work.

We note that the concurrent work by [23] may also be used to learn nonparametric latent CBNs. One main difference is that in our normalizing flow, every coordinate of hϕ 1 models separately a causal mechanism, which is not the case in [23]. In our objective function (10), we need to model and learn every causal mechanism individually.

H Details Experiments

H.1 Synthetic Data Generation

Directed Acyclic Graph (DAG). In order to generate data, we begin by sampling a random DAG G QG, where QG is a distribution over DAGs. The edge density of the DAG is set to 0.5 in topological order, meaning that the edges in the DAG are constrained to follow the variable index order: an edge Zi Zj can only exist if j > i. To construct the DAG, we individually sample each

potential edge Zi Zj with j > i with a probability of 0.5, while all other edges are assigned a probability of 0. For experiments conducted in the Cau CA setting with non-trivial graphs (i.e., not empty), we reject and redraw any sampled DAGs that do not contain any edges.

Causal Bayesian network (CBN). To sample data from a CBN, we start by drawing the parameters of a linear Gaussian Structural Causal Model (SCM) with additive noise:

j pa(i) αi,j Zj + εi, (48)

where the linear parameters are drawn from a uniform distribution αi,j Uniform( a, a) and the noise variable is Gaussian εi N(0, 1). The signal-to-noise ratio for the SCM, denoted as a/std(εi) = a, describes the strength of the dependence of the causal variables relative to the exogenous noise. For most experiments, the signal-to-noise ratio is set to 1. In Fig. 4 (g), we explore values ranging from 1 to 10. To specify the latent CBNs, we can define the conditional distributions entailed by the SCMs defined as in eq. (48), see also App. H.2 and eq. (52).

Generating interventional datasets. For a given CBN, we generate d + 1 related datasets: one unintervened (observational) dataset and d interventional datasets, where the CBN was modified by a perfect stochastic intervention on one variable (i.e., one dataset for each variable in the CBN). W.l.o.g. we assume that the kth variable was intervened on in dataset k; k = 0 denotes the observational dataset. The intervention is applied in the SCM by removing the influence of the parent variables and changing the exogenous noise by shifting its mean up or down. Hence, for dataset k we have

Zk := εk, with εk N(µ, 1), (49)

where the mean of the noise µ is uniformly sampled from { 2} and fixed within each dataset. Each dataset comprises a total of 200, 000 data points, resulting in (d + 1) 200, 000 data points in total for each CBN.

Mixing function. The mixing function takes the form of a multilayer perceptron f = σ AM . . . σ A1, where Am Rd d for m J1, MK denote invertible linear maps, and σ is an elementwise invertible nonlinear function. The elements of the linear maps are sampled independently (Am)i,j Uniform(0, 1) for i, j J1, d K. A sampled matrix Am is rejected and re-drawn if | det Am| < 0.1 to rule out linear maps that are (close to) singular. The invertible element-wise nonlinearity is a leaky-tanh activation function:

σ(x) = tanh(x) + 0.1x, (50)

as used in [12].

H.2 Model architecture

Normalizing flows. We use normalizing flows [40] to learn an encoder gθ : Rd Rd. Normalizing flows model observations x as the result of an invertible, differentiable transformation gθ on some base variables z, x = gθ(z). (51)

We apply a series of L = 12 such transformations gl θl : Rd Rd, which we refer to as flow layers, such that the resulting transformation is given by gθ = g L θL . . . g1 θ1. We use Neural Spline Flows [10] for the invertible transformation, with a 3-layer feedforward neural network with hidden dimension 128 and a permutation in each flow layer.

Base distribution. We extend the typically used simple base distributions to encode information about the CBN. We have one distribution per dataset (ˆpk θk CBN)k J0,d K over the learned base noise variables z. The conditional density of latent variable i in dataset k is given by

ˆpk θk CBN(zi | zpa(i)) = N

j pa(i) ˆαi,jzj, ˆσi

when i = k, i.e., when variable i is not intervened on. For i = k, we have

ˆpk θk CBN(zi) = N(ˆµk i , ˆσk i ). (53)

In summary, the base distribution parameters are the parameters of the linear relationships between parents and children in the CBN (ˆαi,j)i,j J1,d K, the standard deviations for each variable in the observational setting (ˆσi)i J1,d K, and mean and standard deviation for the intervened variable in each dataset (ˆµk i , ˆσk i )i,k J1,d K.

Linear baseline model. For the linear baseline models shown in Fig. 4 (a, b, e, f) we replace the nonlinear transformations (gl θl)l J1,LK by a single linear transformation. The base distribution stays the same.

Graph-misspecified model. In order to test the impact of providing knowledge about the causal structure, we compare the Cau CA model to one that assumes the latents are independent. This is achieved by setting ˆαi,j = 0 i, j J1, d K in the base distribution (52).

H.3 Training and model selection

Training parameters. We use the ADAM optimizer [27] with cosine annealing learning rate scheduling, starting with a learning rate of 5 10 3 and ending with 1 10 7. We train the model for 50 200 epochs with a batch size of 4096. The number of epochs was tuned manually for each type of experiment to ensure reliable convergence of the validation log probability.

Pooled objective. The learning objective described in 5 is using the pooled rather than the multi-objective formulation. In App. G, we prove that for our problem the two are equivalent.

Fixing CBN parameters. As explained in Prop. G.1, we can fix some of the CBN parameters w.l.o.g. In our case, we fix the noise parameters for intervened mechanisms of non-root variables and observational mechanisms of root variables.

Model selection. For each drawn latent CBN, we train three models with different initializations and select the model with the highest validation log probability at the end of training.

Compute. Each training run takes 2 8 hours on NVIDIA RTX-6000 gpus. For the experiments shown in the main paper, we performed 450 training runs (30 per violin plot / point in Fig. 4 (g)) which sums up to around 2250 compute hours (assuming an average run time of 5 hours).

I Nonparametric Experiments

The main experiments in 5 were made under the restriction that the ground-truth CBN and the learned latent CBN were assumed to be linear Gaussian. In the experiments shown here, we relax those restrictions. In the following, we describe the differences with respect to the main experiments. All other settings and parameters are the same as described in 5 and App. H.

I.1 Synthetic Data Generation

Causal Bayesian network (CBN). To sample data from a CBN, we draw the parameters of a nonlinear non-additive-noise Structural Causal Model (SCM): Zi := hloc(Zpa(i)) + hscale(Zpa(i)) εi, (54) where the location and scale functions hloc, hscale : R#pa(i) R are parameterized by random 3-layer neural networks (similar to the random mixing function) and the noise variable is Gaussian εi N(0, 1).

I.2 Model architecture

Base distribution. We extend the typically used simple base distributions to encode information about the CBN. We train a second normalizing flow to learn a mapping h CBN ϕ : Rd Rd from the latent causal variables z to exogenous noise variables u, which we explain in App. G.3. We encode knowledge about the causal graph in the base distribution by passing the latent variables in causal order to the invertible transformation h CBN, whose Jacobian is lower triangular. This ensures the correct causal relationships among the elements of z. During training, we fix the distribution of u and the intervened upon latent variables, i.e. zk for dataset k, to be standard normal. We use a similar architecture based on Neural Spline Flows, with one difference: we omit the permutation layer, which would violate the topological order of the variables.

Cau CA (loc. scale)

Cau CA (lin. SCM)

linear i.i.d.

Figure 6: Experimental results with nonparametric model. The figure shows the mean correlation coefficients (MCC) between true and learned latents for Causal Component Analysis (Cau CA) experiments. The first two violin plots show the fully nonparametric model when the ground truth latent CBN is generated by a locationscale model and for the case with a linear SCM generating the ground truth CBN. Misspecified models assuming a linear encoder function class and a naive (single-environment) unidentifiable normalizing flow with an independent Gaussian base distribution (labelled i.i.d.) are compared. The misspecified models are trained on a location-scale CBN. All violin plots show the distribution of outcomes for 10 pairs of CBNs and mixing functions.

I.3 Results

In line with the experiments presented in 5, we compare the nonparametric model (blue) to misspecified baselines: one with a correctly specified nonparametric latent model but employing a linear encoder (red), and a model with a nonlinear encoder but assuming a causal graph with no arrows (orange). The results shown in Fig. 6 show that the nonparametric model accurately identifies the latent variables, benefiting from both the nonlinear encoder and the explicit modelling of causal dependence. When we compare the nonparametric model trained on the location-scale data generating process (left blue violin) to one trained on the linear Gaussian process (right blue violin), we observe that in the location-scale case it seems to be more challenging for the model to uncover the true latent factors.