# interventional_causal_representation_learning__1d1476d4.pdf

Interventional Causal Representation Learning

Kartik Ahuja 1 Divyat Mahajan 2 Yixin Wang 3 Yoshua Bengio 2 4

Causal representation learning seeks to extract high-level latent factors from low-level sensory data. Most existing methods rely on observational data and structural assumptions (e.g., conditional independence) to identify the latent factors. However, interventional data is prevalent across applications. Can interventional data facilitate causal representation learning? We explore this question in this paper. The key observation is that interventional data often carries geometric signatures of the latent factors support (i.e. what values each latent can possibly take). For example, when the latent factors are causally connected, interventions can break the dependency between the intervened latents support and their ancestors . Leveraging this fact, we prove that the latent causal factors can be identified up to permutation and scaling given data from perfect do interventions. Moreover, we can achieve block affine identification, namely the estimated latent factors are only entangled with a few other latents if we have access to data from imperfect interventions. These results highlight the unique power of interventional data in causal representation learning; they can enable provable identification of latent factors without any assumptions about their distributions or dependency structure.

1. Introduction

Modern deep learning models like GPT-3 (Brown et al., 2020) and CLIP (Radford et al., 2021) are remarkable representation learners (Bengio et al., 2013). Despite the successes, these models continue to be far from the human ability to adapt to new situations (distribution shifts) or carry out new tasks (Geirhos et al., 2020; Bommasani et al., 2021;

1FAIR (Meta AI) 2Mila-Quebec AI Institute, Universit e de Montr eal 3University of Michigan 4CIFAR Senior Fellow and CIFAR AI Chair. Correspondence to: Kartik Ahuja <kartikahuja@meta.com>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

Observational data

Z1 Z2 Z1 Z2 Z1 Z2

Do interventional data Imperfect intervention that induces independent support

Imperfect intervention without independent support

(a) (b) (c) (d)

Figure 1. Figure 1a) Observational data: the support of child (Z2) conditional on parent (Z1) varies with the value of parent. Figure 1b), 1c): the support of child conditional on parent under do intervention, perfect intervention and many imperfect interventions is independent of the parent. Figure 1d): intervention on child reduces the impact of the parent on it which causes the support of the child conditional on parent to take a larger set of values.

Yamada et al., 2022). Humans encapsulate their causal knowledge of the world in a highly reusable and recomposable way (Goyal & Bengio, 2020), enabling them to adapt to new tasks in an ever-distribution-shifting world. How can we empower modern deep learning models with this type of causal understanding? This question is central to the emerging field of causal representation learning (Sch olkopf et al., 2021).

A core task in causal representation learning is provable representation identification, i.e., developing representation learning algorithms that can provably identify natural latent factors (e.g., location, shape and color of different objects in a scene). While provable representation identification is known to be impossible for arbitrary data-generating process (DGP) (Hyv arinen & Pajunen, 1999; Locatello et al., 2019), real data often exhibits additional structures. For example, Hyvarinen et al. (2019); Khemakhem et al. (2022) consider the conditional independence between the latents given auxiliary information; Lachapelle et al. (2022) leverage the sparsity of the causal connections among the latents; Locatello et al. (2020); Klindt et al. (2020); Ahuja et al. (2022a) rely on the sparse variation in the latents over time.

Most existing works rely on observational data and make assumptions on the dependency structure of the latents to achieve provable representation identification. However, in many applications, such as robotics and genomics, there

Interventional Causal Representation Learning

is a wealth of interventional data available. For example, interventional data can be obtained from experiments such as genetic perturbations (Dixit et al., 2016) and electrical stimulations (Nejatbakhsh et al., 2021). Can interventional data help identify latent factors in causal representation learning? How can it help? We explore these questions in this work. The key observation is that interventional data often carries geometric signatures of the latent factors support (i.e., what values each latent can possibly take). Fig. 1 illustrates these geometric signatures: perfect interventions and many imperfect interventions can make the intervened latents support independent of their ancestors support. As we will show, these geometric signatures go a long way in facilitating provable representation identification in the absence of strong distributional assumptions.

Contributions. This work establishes representation identification guarantees without strong distributional assumptions on the latents in the following settings.

do interventions. We first investigate scenarios where the true latent factors are mapped to high-dimensional observations through a finite-degree multivariate polynomial. When some latent dimension undergoes a hard do intervention (Pearl, 2009), we are able to identify it up to shift and scaling. Even when the mapping is not a polynomial, approximate identification of the intervened latent is still achievable provided we have data from multiple do interventional distributions on the same latent dimension.

Perfect & imperfect interventions. We achieve block affine identification under imperfect interventions (Peters et al., 2017) provided the support of the intervened latent is rendered independent of its ancestors under the intervention as shown in Figure 1c. This result covers all perfect interventions as a special case.

Observational data and independent support. The independence-of-support condition above can further facilitate representation identification with observational data. We show that, if the support of the latents are already independent in observational data, then these latents can be identified up to permutation, shift, and scaling, without the need of any interventional data. This result extends the classical identifiability results from linear independent component analysis (ICA) (Comon, 1994) to allow for dependent latent variables. They also provide theoretical justifications for recent proposals of performing unsupervised disentanglement through the independent support condition (Wang & Jordan, 2021; Roth et al., 2022).

We summarize our results in Table 1. Finally, we empirically demonstrate the practical utility of our theory. From data

generation mechanisms ranging from polynomials to image generation from rendering engine (Shinners et al., 2011), we show that interventional data helps identification.

Also, the code repository can be accessed at: github.com/facebookresearch/Causal Rep ID.

2. Related Work

Existing provable representation identification approaches often utilize structure in time-series data, as seen in initial works by Hyvarinen & Morioka (2016) and Hyvarinen & Morioka (2017). More recent studies have expanded on this approach, such as H alv a & Hyvarinen (2020); Yao et al. (2021; 2022a;b); Lippe et al. (2022b;a); Lachapelle et al. (2022). Other forms of weak supervision, such as data augmentations, can also be used in representation identification, as seen in works by Zimmermann et al. (2021); Von K ugelgen et al. (2021); Brehmer et al. (2022); Locatello et al. (2020); Ahuja et al. (2022a) that assume access to contrastive pairs of observations (x, x). A third approach, used in (Khemakhem et al., 2022; 2020), involves using high-dimensional observations (e.g., an image) and auxiliary information (e.g., label) to identify representations.

To understand the factual and counterfactual knowledge used by different works in representation identification, we can classify them according to Pearl s ladder of causation (Bareinboim et al., 2022). In particular, our work operates with interventional data (level-two knowledge), while other studies leverage either observational data (levelone knowledge) or counterfactual data (level-three knowledge). Works such as Khemakhem et al. (2022; 2020); Ahuja et al. (2022b); Hyvarinen & Morioka (2016; 2017); Ahuja et al. (2021) use observational data and either make assumptions on the structure of the underlying causal graph of latents or rely on auxiliary information. In contrast, works like Brehmer et al. (2022) use counterfactual knowledge to achieve identification for general DAG structures; Lippe et al. (2022b;a); Ahuja et al. (2022a); Lachapelle et al. (2022) use preand post-intervention observations to achieve provable representation identification. These latter studies use instance-level temporal interventions that carry much more information than interventional distribution alone. To summarize, these works require more information than is available with level two data in Pearlian ladder of causation.

Finally, a concurrent work from Seigal et al. (2022) also studies identification of causal representations using interventional distributions. The authors focus on linear mixing of the latents and consider perfect interventions. In contrast, our results consider nonlinear mixing function and imperfect interventions.

Interventional Causal Representation Learning

Table 1. Summary of results. Existing works such as i VAE (Khemakhem et al., 2022) use observational data and make assumptions on the graphical model of the latents to achieve identification. In contrast, we use interventional data and make no assumptions on the graph.

Input data Assm. on Z Assm. on g Identification

Obs Zr Zs|U, U aux info. Diffeomorphic Perm & scale (Khemakhem, 2020) Obs Non-empty interior Injective poly Affine (Theorem 4.4) Obs Non-empty interior Injective poly Affine (Theorem A.8) Obs Independent support Injective poly Perm, shift, & scale (Theorem 6.3) Obs + do intervn Non-empty interior Injective poly Perm, shift, & scale (Theorem 5.3) Obs + do intervn Non-empty interior Diffeomorphic Perm & comp-wise (Theorem A.12) Obs + Perfect intervn Non-empty interior Injective poly Block affine (Theorem 5.8) Obs + Imperfect intervn Partially indep. support Injective poly Block affine (Theorem 5.8) Counterfactual Bijection w.r.t. noise Diffeomorphic Perm & comp-wise (Brehmer, 2022)

3. Setup: Causal Representation Learning

Causal representation learning aims to identify latent variables from high-dimensional observations. Begin with a data-generating process where some high-dimensional observations x Rn are generated from some latent variables z Rd. We consider the task of identifying latent z assuming access to both observational and interventional datasets: the observational data is drawn from

z PZ; x g(z), (1)

where the latent z is sampled from the distribution PZ and x is the observed data point rendered from the underlying latent z via an injective decoder g : Rd Rn. The interventional data is drawn from a similar distribution except the latent z is drawn from P(i) Z , namely the distribution of z under intervention on zi:

z P(i) Z ; x g(z). (2)

We denote Z and Z(i) as the support of PZ and P(i) Z respectively (support is the set where the probability density is more than zero). The support of x is thus X = g(Z) in observational data and X (i) = g Z(i) in interventional data. The goal of causal representation learning is provable representation identification, i.e. to learn an encoder function, which takes in the observation x as input and provably output its underlying true latent z. In practice, such an encoder is often learned via solving a reconstruction identity,

h f(x) = x x X X (i), (3)

where f : Rn Rd and h : Rd Rn are a pair of encoder and decoder, which need to jointly satisfy Eq. 3. The pair (f, h) together is referred to as the autoencoder. Given the learned encoder f, the resulting representation is ˆz f(x), which holds the encoder s estimate of the latents.

The reconstruction identity Eq. 3 is highly underspecified and cannot in general identify the latents. There exist many pairs of (f, h) that jointly solve Eq. 3 but do not provide

representations ˆz f(x) that coincide with the true latents z. For instance, applying an invertible map b to any solution (f, h) will result in another valid solution b f, h b 1. In practical applications, however, the exact identification of the latents is not necessary. For example, we may not be concerned with the recovering the latent dimensions in the order they appear in z. Thus, in this work, we examine conditions of under which the true latents can be identified up to certain transformations, such as affine transformations and coordinate permutations.

4. Stepping Stone: Affine Representation Identification with Polynomial Decoders

We first establish an affine identification result, which serves as a stepping stone towards stronger identification guarantees in the next section. We begin with a few assumptions.

Assumption 4.1. The interior of the support of z, Z Z(i), is a non-empty subset of Rd.1

Assumption 4.2. The decoder g is a polynomial of finite degree p whose corresponding coefficient matrix G has full column rank. Specifically, the decoder g is determined by the coefficient matrix G as follows,

g(z) = G[1, z, z z, , z z | {z } p times ] z Rd,

where represents the Kronecker product with all distinct entries; for example, if z = [z1, z2], then z z = [z2 1, z1z2, z2 2].

The assumption that the matrix G Rn q has a full column rank of q guarantees that the decoder g is injective; see Lemma A.1 in Appendix A.1 for a proof. This injectivity condition on g is common in identifiable representation learning. Without injectivity, the problem of identification becomes ill-defined; multiple different latent z s can give

1We work with (Rd, 2) as the metric space. A point is in the interior of a set if there exists an ϵ ball for some ϵ > 0 containing that point in the set. The set of all such points defines the interior.

Interventional Causal Representation Learning

rise to the same observation x. We note that the full-columnrank condition for G in Assumption 4.2 imposes an implicit constraint on the dimensionality n of the data; it requires that the dimensionality n is greater than the number of terms in the polynomial of degree p, namely n Pp r=0 r+d 1 d 1 . In the Appendix (Theorem A.5), we show that if our data is generated from sparse polynomials, i.e., G is a sparse matrix, then n is allowed to be much smaller.

Under Assumptions 4.1 and 4.2, we perform causal representation learning with two constraints: polynomial decoder and non-collapsing encoder.

Constraint 4.3. The learned decoder h is a polynomial of degree p and it is determined by its corresponding coefficient matrix H as follows,

h(z) = H[1, z, z z, , z z | {z } p times ] z Rd,

where represents the Kronecker product with all distinct entries. The interior of the image of the encoder f(X X (i)) is a non-empty subset of Rd.

We now show that solving the reconstruction identity with these constraints can provably identify the true latent z up to affine transformations.

Theorem 4.4. Suppose the observational data and interventional data are generated from Eq. 1 and Eq. 2 respectively under Assumptions 4.1 and 4.2. The autoencoder that solves the reconstruction identity in Eq. 3 under Constraint 4.3 achieves affine identification, i.e., z Z Z(i), ˆz = Az+c, where ˆz is the encoder f s output, z is the true latent, A Rd d is invertible and c Rd.

Theorem 4.4 drastically reduces the ambiguities in identifying latent z from arbitrary invertible transformations to only invertible affine transformations. Moreover, Theorem 4.4 does not require any structural assumptions about the dependency between the latents. It only requires (i) a geometric assumption that the interior of the support is non-empty and (ii) the map g is a finite-degree polynomial.

The proof of Theorem 4.4 is in Appendix A.1. The idea is to write the representation ˆz = f(x) as ˆz = f g(z) = a(z) with a f g, leveraging the relationship x = g(z) in Eq. 1. We then show the a function must be an affine map. To give further intuition, we consider a toy example with one-dimensional latent z, three-dimensional observation x, and the true decoder g and the learned decoder h each being a degree-two polynomial. We first solve the reconstruction identity on all x, which gives h(ˆz) = g(z), and equivalently H[1, ˆz, ˆz2] = G[1, z, z2] . This equality implies that both ˆz and ˆz2 must be at most degree-two polynomials of z. As a consequence, ˆz must be a degree-one polynomial of z, which we next prove by contradiction. If ˆz is a degree-two polynomial of z, then ˆz2 is degree four; it contradicts the

fact that ˆz2 is at most degree two in z. Therefore, ˆz must be a degree-one polynomial in z, i.e. a linear function of z.

Beyond polynomial map g. Theorem A.8 in the Appendix extends Theorem 4.4 to a class of maps g( ) that are ϵ-approximable by a polynomial.

5. Provable Representation Identification with Interventional Data

In the previous section, we derived affine identification guarantees. Next, we strengthen these guarantees by leveraging geometric signals specific to many interventions.

5.1. Representation identification with do interventions

We begin with a motivating example on images, where we are given data with do interventions on the latents. Consider the two balls shown in Fig. 2a. Ball 1 s coordinates are (z1 1, z1 2) and Ball 2 s coordinates are (z2 1, z2 2). We write the latent z = [(z1 1, z1 2), (z2 1, z2 2)], this latent is rendered in the form of the image x shown in the Fig. 2a. The latent z in the observational data follows the directed acyclic graph (DAG) in Fig. 2b, where Ball 1 s coordinate cause the Ball 2 coordinates. The latent z under a do intervention on z2 2, then the second coordinate of Ball 2, follows the DAG in Fig. 2c. Our goal is to learn an encoder using the images x in observational and interventional data, which outputs the coordinates of the balls up to permutation and scaling.

Suppose z is generated from a structural causal model with an underlying DAG (Pearl, 2009). Formally, a do intervention on one latent dimension fixes it to some constant value. The distribution of the children of the intervened component is affected by the intervention, while the distribution of remaining latents remains unaltered. Based on this property of do intervention, we characterize the distribution P(i) Z in Eq. 2 as

zi = z ; z i P(i) Z i, (4)

where zi takes a fixed value z . The remaining variables in z, z i, are sampled from P(i) Z i.

The distribution P(i) Z i in Eq. 4 encompasses many settings in practice, including (i) the do interventions on causal DAGs (Pearl, 2009), i.e., P(i) Z i = PZ i|do(zi=z ), ii) the do interventions on cyclic graphical models (Mooij & Heskes, 2013), and (iii) sampling z i from its conditional in the observational data P(i) Z i = PZ i|zi=z (e.g., subsampling images in observational data with a fixed background color).

Given interventional data from do interventions, we perform causal representation learning by leveraging the geometric signature of the do intervention in search of the autoencoder.

Interventional Causal Representation Learning

Observational Data Interventional Data

(b) (c) (a) Figure 2. Illustrating do interventions in image-based data in (a). The DAG of dependencies under the observational distribution (b) and a perfect intervention on z2 2 in (c).

In particular, we enforce the following constraint while solving the reconstruction identity in Eq. 3.

Constraint 5.1. The encoder s kth component fk(x) denoted as ˆzk is required to take some fixed value z for all x X (i). Formally stated fk(x) = z , x X (i).

In Constraint 5.1, we do not need to know which component is intervened and the value it takes, i.e., k = i and z = z . We next show how this constraint helps identify the intervened latent zi under an additional assumption on the support of the unintervened latents stated below.

Assumption 5.2. The interior of support of distribution of unintervened latents P(i) Z i is a non-empty subset of Rd 1.

Theorem 5.3. Suppose the observational data and interventional data are generated from Eq. 1 and Eq. 2 respectively under Assumptions 4.1 and 4.2, where P(i) Z follows Eq. 4. The autoencoder that solves Eq. 3 under Constraint 4.3, Constraint 5.1 identifies the intervened latent zi up to shift and scaling, i.e., ˆzk = ezi + b, where e R, b R.

Theorem 5.3 immediately extends to settings when multiple interventional distributions are available, with each corresponding to a hard do intervention on a distinct latent variable. Under the same assumptions of Theorem 5.3, each of the intervened latents can be identified up to permutation, shift, and scaling. Notably, Theorem 5.3 does not rely on any distributional assumptions (e.g., parametric assumptions) on z; nor does it rely on the nature of the graphical model for z (e.g., cyclic, acyclic). Theorem 5.3 makes these key geometric assumptions: (i) support of z in observational data, (ii) support of unintervened latents z i has a non-empty interior.

Theorem 5.3 combines the affine identification guarantee we derived in Theorem 4.4 with the geometric signature of do interventions. For example, in Fig. 1b, the support of the true latents is axis-aligned (parallel to x-axis). In this case, the interventional constraint also forces the support of ˆz to be axis-aligned (parallel to x-axis or y-axis). The proof of Theorem 5.3 is in Appendix A.2. We provide some intuition here. First, given Assumptions 4.1 and 4.2

and Constraint 4.3, Theorem 4.4 already guarantees affine identification. It implies ˆzk = a iz i + ezi + b, where z i includes all entries of z other than zi, and a i is a vector of the corresponding coefficients. As a result, a iz i must also take a fixed value for all values of z i in the support of P(i) Z i, since both ˆzk and zi are set to a fixed value. We argue a i = 0 by contradiction. If a i = 0, then any changes to z i in the direction of a i will also reflect as a change in ˆzk; it contradicts the fact that ˆzk takes a fixed value. Therefore, a i = 0 and zi is identified up to shift and scaling.

Beyond polynomial map g. In Theorem 5.3, we assume that the map g is a polynomial. In the Appendix (Theorem A.12) we show that, even when g is not a polynomial but a general diffeomorphism, the intervened latent can be approximately identified up to an invertible transform provided sufficiently many do interventional distributions per latent are available. That said, one interventional distribution per latent no longer suffices, unlike the polynomial g case. Our experiments on images in 8 further support this argument. We state Theorem A.12 informally below.

Theorem. (Informal) Suppose the observational data is generated from Eq. 1 and suppose we gather multiple interventional datasets for latent zi, where in each interventional dataset, zi is set to a distinct fixed value under do intervention following Eq. 4. If the number of do interventional datasets is sufficiently large and the support of the latents satisfy certain regularity conditions (detailed in Theorem A.12), then the autoencoder that solves Eq. 3 under multiple constraints of the form Constraint 5.1 identifies zi up to an invertible transform approximately.

5.2. General perfect and imperfect interventions

In the discussion so far, we focused on do interventions. In this section, our goal is to build identification guarantees under imperfect interventions. In the example that follows, we motivate the class of imperfect interventions we consider.

Motivating example of perfect & imperfect interventions on images. First, we revisit perfect interventions in causal DAGs (Peters et al., 2017). Under a perfect intervention, the intervened latent is disconnected from its parents and do interventions are a special case of perfect interventions. Consider the two balls shown in Fig. 2a. Suppose Ball 1 has a strong influence on Ball 2 in the observational DAG shown in Fig. 2b. As a result, the position of Ball 1 determines the region where Ball 2 can be located inside the box in Fig. 2a. Now imagine if a perfect intervention is carried out as shown in Fig. 2c. Under this intervention the second coordinate of Ball 2 is not restricted by Ball 1 and it takes all possible values in the box. Do we need perfect interventions to ensure that Ball 2 can be located anywhere in the box? Even an imperfect intervention that reduces the strength of

Interventional Causal Representation Learning

influence of Ball 1 on Ball 2 can suffice to ensure that Ball 2 takes all possible locations in the box. In this section, we consider such imperfect interventions that guarantee that the range of values the intervened latent takes does not depend on its non-descendants. We formalize this below.

Definition 5.4. (Wang & Jordan, 2021) Consider a random variable V = [V1, V2] sampled from PV . V1, V2 are said to have independent support if V = V1 V2 where V is the support of PV , Vj are the supports of marginal distribution of Vj for j {1, 2} and is the Cartesian product.

Observe that two random variables can be dependent but have independent support. Suppose z is generated from a structural causal model with an underlying DAG and zi undergoes an imperfect intervention. We consider imperfect interventions such that each pair (zi, zj) satisfies support independence (Definition 5.4), where zj is a non-descendant of zi in the underlying DAG. Below we characterize imperfect interventions that satisfy support independence.

Characterizing imperfect interventions that lead to support independence. Suppose zi w(Pa(zi), u), where Pa(zi) is the value of the set of parents of zi, u U is a noise variable that is independent of the ancestors of zi, and w is the map that generates zi. We carry out an imperfect intervention on zi and change the map w to v. If the range of values assumed by v for any two values assumed by the parents are equal, then the support of zi is independent of all its non-descendants. Formally stated the condition is v(Pa(zi), U) = v(Pa (zi), U), where Pa(zi) and Pa (zi) are any two sets of values assumed by the parents.

We are now ready to describe the geometric properties we require of the interventional distribution P(i) Z in Eq. 2. We introduce some notation before that. Let [d] := {1, , d}. For each j [d], we define the supremum and infimum of each component zj in the interventional distribution. Define αj sup (αj inf) to be the supremum (infimum) of the set Z(i) j .

Assumption 5.5. Consider z sampled from the interventional distribution P(i) Z in Eq. 2. S [d] such that the support of zi is independent of zj for all j S. For all j S Z(i) i,j = Z(i) i Z(i) j (5)

For all j [d], < αj inf αj sup < ζ > 0 such that

(αj sup ζ, αj sup) (αj inf, αj inf + ζ) Z(i) j , j [d].

The distribution P(i) Z above is quite general in several ways as it encompasses i) all perfect interventions since they render the intervened latent independent of its non-descendants and ii) imperfect interventions that lead to independent support as characterized above. The latter part of the above assumption is a regularity condition on the geometry of the

support. It ensures the support of z has a ζ-thick boundary for a ζ > 0.

We now describe a constraint on the encoder that leverages the geometric signature of imperfect interventions in Assumption 5.5. Recall ˆzk = fk(x). Let ˆZ = f(X) and ˆZ(i) = f(X (i)) represent the support of encoder f s output on observational data and interventional data respectively. ˆZ(i) k,m represents the joint support of (ˆzk, ˆzm) and ˆZ(i) k is the support of ˆzk in interventional data. Similarly, we define ˆZk,m and ˆZk for observational data.

Constraint 5.6. Given a set S . For each m S , (ˆzk, ˆzm) satisfies support independence on interventional data, i.e.,

ˆZ(i) k,m = ˆZ(i) k ˆZ(i) m , m S .

In the above Constraint 5.6, the index k and set S are not necessarily the same as i and S from Assumption 5.5. In the theorem that follows, we require |S | |S| to guarantee that a solution to Constraint 5.6 exists. In the Appendix A.3, we explain that this requirement can be easily relaxed. Note that Constraint 5.6 bears similarity to Constraint 5.1 from the case of do interventions. Both constraints ensure that the support of the kth component is independent of all other components. In the theorem that follows, we show that the above Constraint 5.6 helps achieve block affine identification, which we formally define below.

Definition 5.7. If ˆz = ΛΠz + c for all z Z Z(i), where Π is a permutation matrix, Λ is an invertible matrix such that there is a submatrix of Λ which is zero, then ˆz is said to block-affine identify z.

Theorem 5.8. Suppose the observational data and interventional data are generated from Eq. 1 and Eq. 2 respectively under Assumptions 4.1, 4.2, 5.5. The autoencoder that solves Eq. 3 under Constraint 4.3, 5.6 (with |S | |S|) achieves block affine identification. More specifically, z Z Z(i)

ˆzk = a k z + ck, ˆzm = a mz + cm, m S ,

where ak contains at most d |S | non-zero elements and each component of am is zero whenever the corresponding component of ak is non-zero for all m S .

Firstly, from Theorem 4.4, ˆz = Az + c. From the above theorem, it follows that ˆzk linearly depends on at most d |S | latents and not all the latents. Each ˆzm with m S

does not depend on any of the latents that ˆzk depends on. As a result, |S | + 1 rows of A (from Theorem 4.4) are sparse. Observe that if |S | = |S| = d 1, then as a result of the above theorem, ˆzk identifies some zj up to scale and shift. Further, remaining components ˆz k linearly depend on z j and do not depend on zj. The proof of Theorem 5.8 is in Appendix A.3.

Interventional Causal Representation Learning

6. Extensions to Identification with Observational Data & Independent Support

In the previous section, we showed that interventions induce geometric structure (independence of supports) in the support of the latents that helps achieve strong identification guarantees. In this section, we consider a special case where such geometric structure is already present in the support of the latents in the observational data. Since we only work with observational data in this section, we set the interventional supports Z(i) = X (i) = , where is the empty set. For each j [d], define βj sup to be the supremum of the support of zj, i.e., Zj. Similarly, for each j [d], define βj inf to be the infimum of the set Zj. Assumption 6.1. The support of PZ in Eq. 1 satisfies pairwise support independence between all the pairs of latents. Formally stated,

Zr,s = Zr Zs, r = s, r, s [d] (6)

For all r [d], < βr inf βr sup < . ζ > 0 such that (βr sup ζ, βr sup) (βr inf, βr inf + ζ) Zr for all r [d].

Following previous sections, we state a constraint, where the learner leverages the geometric structure in the support in Assumption 6.1 to search for the autoencoder. Constraint 6.2. Each pair (ˆzk, ˆzm), where k, m [d] and k = m satisfies support independence on observational data, i.e., ˆZk,m = ˆZk ˆZm, where ˆZk,m is the joint support of (ˆzk, ˆzm) and ˆZk is support of ˆzk.

Theorem 6.3. Suppose the observational data is generated from Eq. 1 under Assumption 4.1, 4.2, and 6.1, The autoencoder that the solves Eq. 3 under Constraint 6.2 achieves permutation, shift and scaling identification. Specifically, z Z, ˆz = ΛΠz + c, where ˆz is the output of the encoder f and z is the true latent and Π is a permutation matrix and Λ is an invertible diagonal matrix.

The proof of Theorem 6.3 is in Appendix A.4. Theorem 6.3 says that the independence between the latents support is sufficient to achieve identification up to permutation, shift, and scaling in observational data. Theorem 6.3 has important implications for the seminal works on linear ICA (Comon, 1994), considering the simple case of a linear g. Comon (1994) shows that, if the latent variables are independent and non-Gaussian, then the latent variables can be identified up to permutation and scaling. However, Theorem 6.3 states that, even if the latent variables are dependent, the latent variables can be identified up to permutation, shift and scaling, as long as they are bounded (hence non-Gaussian) and satisfy pairwise support independence.

Finally, Theorem 6.3 provides a first general theoretical justification for recent proposals of unsupervised disentanglement via the independent support condition (Wang & Jordan, 2021; Roth et al., 2022).

7. Learning Representations from Geometric Signatures: Practical Considerations

In this section, we describe practical algorithms to solve the constrained representation learning problems in 5 and 6.

To perform constrained representation learning with dointervention data, we proceed in two steps. In the first step, we carry out minimization of the reconstruction objective f , h = arg minf,h E h f(X) X 2 , where h is the decoder, f is the encoder and expectation is taken over observational data and interventional data. In the experiments, we restrict h to be a polynomial and show that affine identification is achieved by the learned f as proved in Theorem 4.4.

In the second step, we learn a linear map to transform the learned representations and enforce Constraint 5.1. For each interventional distribution, P(i) X , we learn a different linear map γi that projects the representation such that it takes an arbitrary fixed value z i on the support of P(i) X . We write this objective as

i EX P(i) X

γ i f (X) z i 2 . (7)

Construct a matrix Γ with different γ i as the rows. The final output representation is Γf (X). In the experiments, we show that this representation achieves permutation, shift and scaling identification as predicted by Theorem 5.3. A few remarks in order. i) z i is arbitrary and learner does not know the true do intervention value, ii) for ease of exposition, Eq. 7 assumes the knowledge of index of intervened and can be easily relaxed by multiplying Γ with a permutation matrix.

We next describe an algorithm that learns representations to enforce independence of support (leveraged in Theorem 5.8 and 6.3). To measure the (non)-independence of the latents support, we follow Wang & Jordan (2021); Roth et al. (2022) and measure the distance between the sets in terms of Hausdorff distance: the Hausdorff distance HD between the sets

S1, S2 is HD(S1, S2) = supz S2

infz S1( z z ) ,

where S1 S2.

To further enforce the independent support constraint, we again follow a two-step algorithm. The first step remains the same, i.e., we minimize the reconstruction objective. In the second step, we transform the learned representations (f (x)) with an invertible map Γ Rd d. The joint support obtained post transformation is a function of the parameters Γ and is denoted as ˆZ(Γ). Following the notation introduced earlier, the joint support along dimensions k, m is ˆZk,m(Γ) and the marginal support along k is ˆZk(Γ). We translate the problem in Constraint 6.2 as follows. We find a Γ to

Interventional Causal Representation Learning

k =m HD ˆZk,m(Γ), ˆZk(Γ) ˆZm(Γ) . (8)

Constraint 5.6 can be similarly translated.

8. Empirical Findings

In this section, we analyze how the practical implementation of the theory holds up in different settings ranging from data generated from polynomial decoders to images generated from Py Game rendering engine (Shinners et al., 2011). The code to reproduce the experiments can be found at https://github.com/ facebookresearch/Causal Rep ID.

Data generation process. Polynomial decoder data: The latents for the observational data are sampled from PZ. PZ can be i) independent uniform, ii) an SCM with sparse connectivity (SCM-S), iii) an SCM with dense connectivity (SCM-D) (Brouillard et al., 2020). The latent variables are then mapped to x using a multivariate polynomial. We use a n = 200 dimensional x. We use two possible dimensions for the latents (d) six and ten. We use polynomials of degree (p) two and three. Each element in G to generate x is sampled from a standard normal distribution.

Image data: For image-based experiments, we used the Py Game (Shinners, 2011) rendering engine. We generate 64 64 3 pixel images of the form in Fig. 2 and consider a setting with two balls. We consider three distributions for latents: i) independent uniform, ii) a linear SCM with DAG in Fig. 2, iii) a non-linear SCM with DAG in Fig. 2, where the coordinates of Ball 1 are at the top layer in the DAG and coordinates of Ball 2 are at the bottom layer in the DAG.

For both settings above, we carry out do interventions on each latent dimension to generate interventional data.

Model parameters and evaluation metrics. We follow the two step training procedures described in 7. For imagebased experiments we use a Res Net-18 as the encoder (He et al., 2016) and for all other experiments, we use an MLP with three hidden layers and two hundred units per layer. We learn a polynomial decoder h as the theory prescribes to use a polynomial decoder (Constraint 4.3) when g is a polynomial. In App. B.3, we also present results when we use an MLP decoder. To check for affine identification (from Theorem 4.4), we measure the R2 score for linear regression between the output representation and the true representation. If the score is high, then it guarantees affine identification. To verify permutation, shift and scaling identification (from Theorem 6.3), we check the mean correlation coefficient (MCC (Khemakhem et al., 2022)). For

further details on data generation, models, hyperparamters, and supplementary experiments refer to the App. B.

PZ d p R2 MCC (IOS)

Uniform 6 2 1.00 0.00 99.3 0.07 Uniform 6 3 1.00 0.00 99.4 0.06 Uniform 10 2 1.00 0.00 90.7 2.92 Uniform 10 3 0.99 0.00 94.6 1.50

SCM-S 6 2 0.96 0.02 72.6 1.48 SCM-S 6 3 0.87 0.07 70.6 1.54 SCM-S 10 2 0.99 0.00 65.9 1.32 SCM-S 10 3 0.90 0.05 58.8 1.27

SCM-D 6 2 0.97 0.01 61.6 4.36 SCM-D 6 3 0.81 0.11 65.2 2.70 SCM-D 10 2 0.83 0.10 69.6 3.09 SCM-D 10 3 0.72 0.15 60.1 1.16

Table 2. Observational data with polynomial decoder g: Mean S.E. (5 random seeds). R2 and MCC(IOS) (for uniform) have high values as predicted in Theorem 4.4 and Theorem 6.3 respectively.

PZ d p MCC MCC (IL)

Uniform 6 2 69.1 1.11 100.0 0.00 Uniform 6 3 73.4 0.49 100.0 0.00 Uniform 10 2 59.9 2.03 100.0 0.00 Uniform 10 3 65.9 0.80 99.9 0.03

SCM-S 6 2 68.4 0.90 99.5 0.38 SCM-S 6 3 74.1 2.32 99.3 0.34 SCM-S 10 2 68.0 2.36 99.9 0.03 SCM-S 10 3 66.8 1.10 98.8 0.13

SCM-D 6 2 71.8 3.77 99.6 0.12 SCM-D 6 3 79.5 3.45 98.2 1.07 SCM-D 10 2 70.8 1.89 95.3 2.24 SCM-D 10 3 70.1 2.80 97.2 0.88

Table 3. Interventional data with polynomial decoder g: Mean S.E. (5 random seeds). MCC(IL) is high as shown in Theorem 5.3.

Results for polynomial decoder. Observational data: We consider the setting when the true decoder g is a polynomial and the learned decoder h is also a polynomial. In Table 2, we report the R2 between the representation learned after the first step, where we only minimize reconstruction loss. R2 values are high as predicted in Theorem 4.4. In the second step, we learn a map Γ and enforce independence of support constraint by minimizing Hausdorff distance from Eq. 8. Among the distributions PZ only the uniform distribution satisfies support independence from Assumption 6.1 and following Theorem 6.3, we expect MCC to be high in this case only. In Table 2, we report the MCC obtained by enforcing independence of support in MCC (IOS). In the App. B.3, we also carry out experiments on correlated uniform distributions and observe high MCC (IOS).

Interventional data: We now consider the case when we also have access to do intervention data in addition to observa-

Interventional Causal Representation Learning

#interv dist. Uniform SCM linear SCM non-linear

1 34.2 0.24 12.8 0.28 19.7 0.31 3 73.9 0.38 73.2 0.33 59.7 0.28 5 73.6 0.21 83.4 0.21 62.8 0.2 7 72.5 0.34 84.2 0.25 69.3 0.34 9 73.1 0.47 86.2 0.17 71.4 0.26

Table 4. Interventional data in image-based experiments: Mean S.E (5 random seeds). MCCs increase with the number of do interventional distributions per latent dimension (Theorem A.12).

tional data. We consider the setting with one do intervention per latent dimension. We follow the two step procedure described in 7. In Table 3, we first show the MCC values of the representation obtained after the first step in the MCC column. In the second step, we learn Γ by minimizing the interventional loss (IL) in Eq. 7. We report the MCC of the representation obtained in the MCC (IL) column in Table 3; the values are close to one as predicted by Theorem 5.3.

Results for image dataset. We follow the two step procedure described in 7 except now in the second step, we learn a non-linear map (using an MLP) to minimize the interventional loss (IL) in Eq. 7. In Table 4, we show the MCC values achieved by the learned representation as we vary the number of do interventional distributions per latent dimension. As shown in Theorem A.12, more interventional distributions per latent dimension improve the MCC.

9. Conclusions

In this work, we lay down the theoretical foundations for learning causal representations in the presence of interventional data. We show that geometric signatures such as support independence that are induced under many interventions are useful for provable representation identification. Looking forward, we believe that exploring representation learning with real interventional data (Lopez et al., 2022; Liu et al., 2023) is a fruitful avenue for future work.

Acknowledgments

Yixin Wang acknowledges grant support from the National Science Foundation and the Office of Naval Research. Yoshua Bengio acknowledges the support from CIFAR and IBM.

Interventional Causal Representation Learning

Ahuja, K., Hartford, J., and Bengio, Y. Properties from mechanisms: an equivariance perspective on identifiable representation learning. ar Xiv preprint ar Xiv:2110.15796, 2021.

Ahuja, K., Hartford, J., and Bengio, Y. Weakly supervised representation learning with sparse perturbations. ar Xiv preprint ar Xiv:2206.01101, 2022a.

Ahuja, K., Mahajan, D., Syrgkanis, V., and Mitliagkas, I. Towards efficient representation identification in supervised learning. ar Xiv preprint ar Xiv:2204.04606, 2022b.

Ash, R. B., Robert, B., Doleans-Dade, C. A., and Catherine, A. Probability and measure theory. Academic press, 2000.

Bareinboim, E., Correa, J. D., Ibeling, D., and Icard, T. On pearl s hierarchy and the foundations of causal inference. In Probabilistic and Causal Inference: The Works of Judea Pearl, pp. 507 556. 2022.

Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8): 1798 1828, 2013.

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., et al. On the opportunities and risks of foundation models. ar Xiv preprint ar Xiv:2108.07258, 2021.

Brehmer, J., De Haan, P., Lippe, P., and Cohen, T. Weakly supervised causal representation learning. ar Xiv preprint ar Xiv:2203.16437, 2022.

Brouillard, P., Lachapelle, S., Lacoste, A., Lacoste-Julien, S., and Drouin, A. Differentiable causal discovery from interventional data. Advances in Neural Information Processing Systems, 33:21865 21877, 2020.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877 1901, 2020.

Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., and Lerchner, A. Understanding disentangling in beta-vae. ar Xiv preprint ar Xiv:1804.03599, 2018.

Comon, P. Independent component analysis, a new concept? Signal processing, 36(3):287 314, 1994.

Dixit, A., Parnas, O., Li, B., Chen, J., Fulco, C. P., Jerby Arnon, L., Marjanovic, N. D., Dionne, D., Burks, T., Raychowdhury, R., et al. Perturb-seq: dissecting molecular circuits with scalable single-cell rna profiling of pooled genetic screens. cell, 167(7):1853 1866, 2016.

Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., and Wichmann, F. A. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665 673, 2020.

Goyal, A. and Bengio, Y. Inductive biases for deep learning of higher-level cognition. ar Xiv preprint ar Xiv:2011.15091, 2020.

H alv a, H. and Hyvarinen, A. Hidden markov nonlinear ica: Unsupervised learning from nonstationary time series. In Conference on Uncertainty in Artificial Intelligence, pp. 939 948. PMLR, 2020.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Hyvarinen, A. and Morioka, H. Unsupervised feature extraction by time-contrastive learning and nonlinear ICA. Advances in neural information processing systems, 29, 2016.

Hyvarinen, A. and Morioka, H. Nonlinear ICA of temporally dependent stationary sources. In Artificial Intelligence and Statistics, pp. 460 469. PMLR, 2017.

Hyv arinen, A. and Pajunen, P. Nonlinear independent component analysis: Existence and uniqueness results. Neural networks, 12(3):429 439, 1999.

Hyvarinen, A., Sasaki, H., and Turner, R. Nonlinear ica using auxiliary variables and generalized contrastive learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 859 868. PMLR, 2019.

Khemakhem, I., Monti, R., Kingma, D., and Hyvarinen, A. Ice-beem: Identifiable conditional energy-based deep models based on nonlinear ICA. Advances in Neural Information Processing Systems, 33:12768 12778, 2020.

Khemakhem, I., Kingma, D., Monti, R., and Hyvarinen, A. Variational autoencoders and nonlinear ICA: A unifying framework. In International Conference on Artificial Intelligence and Statistics, pp. 2207 2217. PMLR, 2022.

Klindt, D., Schott, L., Sharma, Y., Ustyuzhaninov, I., Brendel, W., Bethge, M., and Paiton, D. Towards nonlinear disentanglement in natural data with temporal sparse coding. ar Xiv preprint ar Xiv:2007.10930, 2020.

Interventional Causal Representation Learning

Lachapelle, S., Rodriguez, P., Sharma, Y., Everett, K. E., Le Priol, R., Lacoste, A., and Lacoste-Julien, S. Disentanglement via mechanism sparsity regularization: A new principle for nonlinear ICA. In Conference on Causal Learning and Reasoning, pp. 428 484. PMLR, 2022.

Lippe, P., Magliacane, S., L owe, S., Asano, Y. M., Cohen, T., and Gavves, E. icitris: Causal representation learning for instantaneous temporal effects. ar Xiv preprint ar Xiv:2206.06169, 2022a.

Lippe, P., Magliacane, S., L owe, S., Asano, Y. M., Cohen, T., and Gavves, S. Citris: Causal identifiability from temporal intervened sequences. In International Conference on Machine Learning, pp. 13557 13603. PMLR, 2022b.

Liu, Y., Alahi, A., Russell, C., Horn, M., Zietlow, D., Sch olkopf, B., and Locatello, F. Causal triplet: An open challenge for intervention-centric causal representation learning. ar Xiv preprint ar Xiv:2301.05169, 2023.

Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Sch olkopf, B., and Bachem, O. Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pp. 4114 4124. PMLR, 2019.

Locatello, F., Poole, B., R atsch, G., Sch olkopf, B., Bachem, O., and Tschannen, M. Weakly-supervised disentanglement without compromises. In International Conference on Machine Learning, pp. 6348 6359. PMLR, 2020.

Lopez, R., Tagasovska, N., Ra, S., Cho, K., Pritchard, J. K., and Regev, A. Learning causal representations of single cells via sparse mechanism shift modeling. ar Xiv preprint ar Xiv:2211.03553, 2022.

Mityagin, B. The zero set of a real analytic function. ar Xiv preprint ar Xiv:1512.07276, 2015.

Mooij, J. and Heskes, T. Cyclic causal discovery from continuous equilibrium data. ar Xiv preprint ar Xiv:1309.6849, 2013.

Nejatbakhsh, A., Fumarola, F., Esteki, S., Toyoizumi, T., Kiani, R., and Mazzucato, L. Predicting perturbation effects from resting activity using functional causal flow. bio Rxiv, pp. 2020 11, 2021.

Pearl, J. Causal inference in statistics: An overview. Statistics surveys, 3:96 146, 2009.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

Peters, J., Janzing, D., and Sch olkopf, B. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748 8763. PMLR, 2021.

Roth, K., Ibrahim, M., Akata, Z., Vincent, P., and Bouchacourt, D. Disentanglement of correlated factors via hausdorff factorized support. ar Xiv preprint ar Xiv:2210.07347, 2022.

Sch olkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., and Bengio, Y. Towards causal representation learning 2021. ar Xiv preprint ar Xiv:2102.11107, 2021.

Seigal, A., Squires, C., and Uhler, C. Linear causal disentanglement via interventions. ar Xiv preprint ar Xiv:2211.16467, 2022.

Shinners, P. Pygame. http://pygame.org/, 2011.

Shinners, P. et al. Pygame. Dostupn e z: http://pygame. org/[Online (2011), 2011.

Von K ugelgen, J., Sharma, Y., Gresele, L., Brendel, W., Sch olkopf, B., Besserve, M., and Locatello, F. Selfsupervised learning with data augmentations provably isolates content from style. Advances in neural information processing systems, 34:16451 16467, 2021.

Wang, Y. and Jordan, M. I. Desiderata for representation learning: A causal perspective. ar Xiv preprint ar Xiv:2109.03795, 2021.

Yamada, Y., Tang, T., and Ilker, Y. When are lemons purple? the concept association bias of clip. ar Xiv preprint ar Xiv:2212.12043, 2022.

Yao, W., Sun, Y., Ho, A., Sun, C., and Zhang, K. Learning temporally causal latent processes from general temporal data. ar Xiv preprint ar Xiv:2110.05428, 2021.

Yao, W., Chen, G., and Zhang, K. Learning latent causal dynamics. ar Xiv preprint ar Xiv:2202.04828, 2022a.

Yao, W., Sun, Y., Ho, A., Sun, C., and Zhang, K. Learning temporally causal latent processes from general temporal data. In International Conference on Learning Representations, 2022b. URL https://openreview.net/ forum?id=RDl LMj LJXdq.

Zimmermann, R. S., Sharma, Y., Schneider, S., Bethge, M., and Brendel, W. Contrastive learning inverts the data generating process. In International Conference on Machine Learning, pp. 12979 12990. PMLR, 2021.

Interventional Causal Representation Learning

Interventional Causal Representation Learning

We organize the Appendix as follows.

In App. A, we present the proofs for the theorems that were presented in the main body of the paper.

In App. A.1, we derive the affine identification guarantees and its approximations in various settings. (Theorem 4.4) In App. A.2, we derive the do intervention based identification guarantees and its extensions. (Theorem 5.3) In App. A.3, we present representation identification guarantees for imperfect interventions. (Theorem 5.8) In App. A.4, we present representation identification guarantees for observational data with independent support. (Theorem 6.3)

In App. B, we present supplementary materials for the experiments.

In App. B.1, we present the pseudocode for the method used to learn the representations. In App. B.2, we present the details of the setup used in the experiments with the polynomial decoder g. In App. B.3, we present supplementary results for the setting with polynomial decoder g. In App. B.4, we present the details of the setup used in the experiments with image data. In App. B.5, we present supplementary results for the setting with image data.

Interventional Causal Representation Learning

A. Proofs and Technical Details

In this section, we provide the proofs for the theorems. We restate the theorems for convenience.

Preliminaries and notation. We state the formal definition of support of a random variable. In most of the work, we operate on the following measure space (Rd, B, λ), B is the Borel sigma field over Rd and λ is the Lebesgue measure over completion of Borel sets on Rd (Ash et al., 2000). For a random variable X, the support X = {x Rd, d PX(x) > 0}, where d PX(x) is the Radon-Nikodym derivative of P w.r.t Lebesgue measure over completion of Borel sets on Rd. For random variable Z, Z is the support of Z in the observational data. The support of the component Zj of Z is Zj. For random variable Z, Z(i) is the support of Z when Zi is intervened. The support of the component Zj of Z in intervened data is Z(i) j .

A.1. Affine Identification

Lemma A.1. If the matrix G that defines the polynomial g is full rank and p > 0, then g is injective.

Proof. Suppose this is not the case and g(z1) = g(z2) for some z1 = z2. Thus

1 z1 z1 z1 ... z1 z1 | {z } p times

1 z2 z2 z2 ... z2 z2 | {z } p times

0 (z1 z2) z1 z1 z2 z2 ... z1 z1 | {z } p times z2 z2 | {z } p times

Since z1 = z2 we find a non-zero vector in the null space of G which contradicts the fact that G has full column rank. Therefore, it cannot be the case that g(z1) = g(z2) for some z1 = z2. Thus g has to be injective.

Lemma A.2. If v1 is a polynomial of degree k1 and v2 is a polynomial of degree k2, then v1v2 is a polynomial of degree k1 + k2.

Proof. We separate vi(z) into two parts the terms with degree ki (ui(z)) and the terms with degree less than ki (wi(z)) for i {1, 2}. We obtain the following expression.

v1(z)v2(z) = (u1(z) + w1(z))(u2(z) + w2(z)) = u1(z)u2(z) + u1(z)w2(z) + u2(z)w1(z) + w1(z)w2(z) (10)

The maximum degree achieved by u1(z)u2(z) is k1+k2. For the other terms, the maximum is bounded above by k1+k2 1. To prove the result, we need to show that u1(z)u2(z) has a degree k1 + k2.

We first start with a simple case. Suppose u1(z) and u2(z) do not share any component of z that they both depend on. In such a case, if we take the leading degree term in u1 and u2 respectively and multiply them then we obtain distinct terms of degree k1 + k2.

Suppose u1 and u2 both depend on z1. We write u1(z) as

i dji=k1 θj

i=1 zdji i = X

i dji=k1 θjcj(z)

Interventional Causal Representation Learning

where cj(z) = Q

i zdji i is a degree k1 polynomial. Note that for each j, cj is a different polynomial, i.e. for j = q, cj = cq. We write u2(z) as

i dji=k2 βj

i=1 zdji i = X

i dji=k2 βjcj(z)

We collect all the terms in u1 that have the highest degree associated with z1 such that the coefficient θj is non-zero. We denote the highest degree as r and write these terms as

i=2 zdqi i = X

q θqzr 1ωq(z)

where ωq(z) = Qd i=2 zdqi i , q = l = ωq = ωl, and r 1

From u2(z), collect the terms with the highest degree for z1 such that the coefficient βj is non-zero to obtain. We denote the highest degree as s and write these terms as

i=2 zdti i = X

t βtzs 1ηt(z)

where ηt(z) = Qd i=2 zdti i , t = l = ηt = ηl, and s 1.

As a result, u1(z)u2(z) will contain the term

q θqωq(z) X

zr+s 1 δ1(z)δ2(z)

where δ1(z) = P

q θqωq(z) and δ2(z) = P

t βtηt(z). We will use principle of induction on the degree of polynomial to prove the claim.

We first establish the base case for k1 = 1 and k2 = 1. Consider two polynomials ρ 1 z and ρ 2 z. We multiply the two to obtain P

i,j ρ1iρ2jzizj. Consider two cases. In case 1, the two polynomials have at least one non-zero coefficient for the same component zi. In that case, we obtain the only non-zero term with ρ2 1iz2 i , which establishes the base case. In the second case, the two polynomials have no shared non-zero coefficients. In such a case, each term with a non-zero coefficient is of the form ρ1iρ2jzizj. This establishes the base case. The other cases with k1 = 0 and k2 = 1 or k2 = 0 and k1 = 1 or both k1 = 0, k2 = 0 are trivially true. Thus we have established the base case for all polynomials (with arbitrary dimension for z) of degree less than k1 = 1 and k2 = 1.

We can now assume that the claim is true for all polynomials v1 with degree less than k1 1 and all polynomials v2 with degree less than k2 1. As a result, the degree of δ1(z)δ2(z) is k1 + k2 r s.

We can write δ1δ2 in terms of the terms with degree equal to k1 + k2 r s (δ (z)) and terms that have a degree less than k1 + k2 r s (δ (z)). As a result, we can simplify zr+s 1 δ1(z)δ2(z) to obtain

zr+s 1 (δ (z) + δ (z)) (11)

The degree of zr+s 1 δ (z) is at most k1 +k2 1. The degree of zr+s 1 (δ (z)) has to be k1 +k2 since δ (z) does not depend on z1, δ (z) is of degree k1 + k2 r s. Note that this is the only term in the entire polynomial u1(z)u2(z) that is associated with the highest degree for z1 (zr+s 1 ) since other terms (cj, c j) have a smaller degree associated with z1 thus the coefficient of this term cannot be cancelled to zero. Therefore, the degree of the polynomial u1u2 and hence the degree of v1v2 is k1 + k2.

Interventional Causal Representation Learning

Recall ˆz f(x), a f g. Since f(x) = f g(z) = a(z) = ˆz = a(z), where a : Z Z(i) ˆZ ˆZ(i), and ˆZ = f(X) and ˆZ(i) = f(X (i)). We now show that a is bijective.

Lemma A.3. Suppose the observational data and interventional data are generated from Eq. 1 and Eq. 2 respectively. The mapping a that relates the output of the encoder f written as ˆz, which solves the reconstruction identity Eq. 3, is related to the true latent z is bijective, where ˆz = a(z).

Proof. Observe that a is surjective by construction. We now need to prove that a is injective. Suppose a is not injective. Therefore, there exists z1 Z and z2 Z, where z1 = z2 and ˆz1 = a(z1) = ˆz2 = a(z2). Note that a(z1) = f(x1), where x1 = g(z1) and a(z2) = f(x2), where x2 = g(z2). This implies that f(x1) = f(x2). We know that the decoder encoder pair satisfy reconstruction, which means h f(x1) = x1 and h f(x2) = x2. Since f(x1) = f(x2), we obtain that x1 = x2, which implies that z1 = z2 since g is injective. This contradicts the fact that z1 = z2. Therefore, ˆz = a(z) is bijective.

Theorem 4.4. Suppose the observational data and interventional data are generated from Eq. 1 and Eq. 2 respectively under Assumptions 4.1 and 4.2. The autoencoder that solves the reconstruction identity in Eq. 3 under Constraint 4.3 achieves affine identification, i.e., z Z Z(i), ˆz = Az + c, where ˆz is the encoder f s output, z is the true latent, A Rd d is invertible and c Rd.

Proof. We start by restating the reconstruction identity. For all x X X (i)

h(ˆz) = g(z)

1 ˆz ˆz ˆz ... ˆz ˆz | {z } p times

1 z z z ... z z | {z } p times

Following the assumptions, h is restricted to be polynomial but f bears no restriction. If H = G and f = g 1, we get the ideal solution ˆz = z, thus a solution to the above identity exists.

Since G has full column rank, we can select q rows of G such that G Rq q and rank( G) = q. Denote the corresponding matrix H that select the same rows as H. We restate the identity in Eq. 12 in terms of H and G as follows. For all z Z Z(i)

Interventional Causal Representation Learning

1 ˆz ˆz ˆz ... ˆz ˆz | {z } p times

1 z z z ... z z | {z } p times

1 ˆz ˆz ˆz ... ˆz ˆz | {z } p times

1 z z z ... z z | {z } p times

1 ˆz ˆz ˆz ... ˆz ˆz | {z } p times

z = A1ˆz + A2 ˆz ˆz + Ap ˆz ˆz | {z } p times +c,

where A is a submatrix of G 1 H that describes the relationship between z and polynomial of ˆz, { Ai}p i=1 correspond to blocks of rows of A. Suppose at least one of A2, , Ap is non-zero. Among the matrices A2, , Ap which are non-zero, pick the matrix Ak with largest index k. Suppose row i of Ak has some non-zero element. Now consider the element in the row in the RHS of equation 13 corresponding to zp i . Observe that zp i is a polynomial of ˆz of degree kp, where k 2 (follows from Lemma A.2). In the LHS, we have a polynomial of degree at most p. In the LHS, we have a polynomial of degree at most p. The equality between LHS and RHS is true for all ˆz f(X X (i)). The difference of LHS and RHS is an analytic function. From Constraint 4.3 f(X X (i)) has a measure greater than zero. Therefore, we leverage Mityagin (2015) to conclude that the LHS is equal to RHS on entire Rd. If two polynomials are equal everywhere, then their respective coefficients have to be the same. Based on supposition, RHS has non zero coefficient for terms with degree kp while LHS has zero coefficient for terms higher than degree p. This leads to a contradiction. As a result, none of A2, , Ap can be non-zero. Thus z = A1ˆz + c. Next, we show that A1 is invertible, which immediately follows from Lemma A.3.

A.1.1. EXTENSIONS TO SPARSE POLYNOMIAL g( )

Suppose g( ) is a degree p polynomial. Let us define the basis that generates g as

1 z z z ... z z | {z } p times

Note that the number of terms in u(z) grows as q = Pp r=0 r+d 1 d 1 . In the previous proof, we worked with

Interventional Causal Representation Learning

1 z z z ... z z | {z } p times

where G Rn q was full rank. As a result, n has to be greater than q and also grow at least as Pp r=0 r+d 1 d 1 . In real data, we can imagine that the g( ) has a high degree. However, g can exhibit some structure, for instance sparsity. We now show that our entire analysis continues to work even for sparse polynomials thus significantly reducing the requirment on n to grow as the number of non-zero basis terms in the sparse polynomial. We write the basis for the sparse polynomial of degree p as u (z). u (z) consists of a subset of terms in u(z). We write the sparse polynomial g( ) as

g(z) = Gu (z)

We formally state the assumption on the decoder in this case as follows.

Assumption A.4. The decoder g is a polynomial of degree p whose corresponding coefficient matrix G (a.k.a. the weight matrix) has full column rank. Specifically, the decoder g is determined by the coefficient matrix G as follows,

g(z) = Gu (z) (14)

where u (z) consists of a subset of terms in u(z). u (z) consists of the degree one term, i.e., z and at least one term of the form zo i , where o p+1

Theorem A.5. Suppose the observational data and interventional data are generated from Eq. 1 and Eq. 2 respectively under Assumptions 4.1, A.4. The autoencoder that solves reconstruction identity in Eq. 3 under Constraint 4.3 achieves affine identification, i.e., z Z Z(i), ˆz = Az + c, where ˆz is the output of the encoder f, z is the true latent, A is an invertible d d matrix and c Rd.

Proof. We start by restating the reconstruction identity. For all x X X (i)

h(ˆz) = g(z)

1 ˆz ˆz ˆz ... ˆz ˆz | {z } p times

Following the assumptions, h is restricted to be polynomial but f bears no restriction. If H is equal to the matrix G for columns i where ui = u j for some j and zero in other columns and f = g 1, we get the ideal solution ˆz = z, thus a solution to the above identity exists. Since G has full column rank, we can select q rows of G such that G Rq q and rank( G) = q. Denote the corresponding matrix H that select the same rows as H. We restate the identity in Eq. 15 in terms of H and G as follows. For all z Z Z(i)

Interventional Causal Representation Learning

1 ˆz ˆz ˆz ... ˆz ˆz | {z } p times

1 ˆz ˆz ˆz ... ˆz ˆz | {z } p times

1 ˆz ˆz ˆz ... ˆz ˆz | {z } p times

z = A1ˆz + A2 ˆz ˆz + Ap ˆz ˆz | {z } p times +c

In the simplification above, we rely on the fact that u (z) consists of the first degree term. Suppose at least one of A2, , Ap is non-zero. Among the matrices A2, , Ap which are non-zero, pick the matrix Ak with largest index k. Suppose row i of Ak has some non-zero element. Now consider the element in the row in the RHS of equation 16 corresponding to zo i . Observe that zo i is a polynomial of ˆz of degree ko, where k 2. In the LHS, we have a polynomial of degree at most p. The equality between LHS and RHS is true for all ˆz f(X X (i)). The difference of LHS and RHS is an analytic function. From Constraint 4.3 f(X X (i)) has a measure greater than zero. Therefore, we leverage Mityagin (2015) to conclude that the LHS is equal to RHS on entire Rd. If two polynomials are equal everywhere, then their respective coefficients have to be the same. Based on supposition, RHS has non zero coefficient for terms with degree p + 1 while LHS has zero coefficient for terms higher than degree p. This leads to a contradiction. As a result, none of A2, , Ap can be non-zero. Thus z = A1ˆz + c. Next, we need to show that A1 is invertible, which follows from Lemma A.3.

A.1.2. EXTENSIONS TO POLYNOMIAL g( ) WITH UNKNOWN DEGREE

The learner starts with solving the reconstruction identity by setting the degree of h( ) to be s; here we assume H has full rank (this implicitly requires that n is greater than the number of terms in the polynomial of degree s).

1 ˆz ˆz ˆz ... ˆz ˆz | {z } s times

1 z z z ... z z | {z } p times

We can restrict H to rows such that it is a square invertible matrix H. Denote the corresponding restriction of G as G. The equality is stated as follows.

Interventional Causal Representation Learning

1 ˆz ˆz ˆz ... ˆz ˆz | {z } s times

1 z z z ... z z | {z } p times

If s > p, then ˆz ˆz | {z } s times is a polynomial of degree at least p + 1. Since the RHS contains a polynomial of degree at most

p the two sides cannot be equal over a set of values of z with positive Lebesgue measure in Rd. Thus the reconstruction identity will only be satisfied when s = p. Thus we can start with the upper bound and reduce the degree of the polynomial on LHS till the identity is satisfied.

A.1.3. EXTENSIONS FROM POLYNOMIALS TO ϵ-APPROXIMATE POLYNOMIALS

We now discuss how to extend Theorem 4.4 to settings beyond polynomial g. Suppose g is a function that can be ϵapproximated by a polynomial of degree p on entire Z Z(i). In this section, we assume that we continue to use polynomial decoders h of degree p (with full rank matrix H) for reconstruction. We state this as follows.

Constraint A.6. The learned decoder h is a polynomial of degree p and its corresponding coefficient matrix h is determined by H as follows. For all z Rd

h(z) = H[1, z, z z, , z z | {z } p times ] (19)

where represents the Kronecker product with all distinct entries. H has a full column rank.

Since we use h as a polynomial, then satisfying the exact reconstruction is not possible. Instead, we enforce approximate reconstruction as follows. For all x X X (i), we want

h f(x) x ϵ, (20)

where ϵ is the tolerance on reconstruction error. Recall ˆz = f(x). We further simplify it as ˆz = f g(z) = a(z). We also assume that a can be η-approximated on entire Z Z(i) with a polynomial of sufficiently high degree say q. We write this as follows. For all z Z Z(i),

z z z ... z z | {z } q times

ˆz Θ1z Θ2 z z Θp z z | {z } q times

We want to show that the norm of Θk for all k 2 is sufficiently small. We state some assumptions needed in theorem below.

Assumption A.7. Encoder f does not take values near zero, i.e., fi(x) γη for all x X X (i) and for all i {1, , d}, where γ > 2. The absolute value of each element in H 1 G is bounded by a fixed constant. Consider the absolute value of the singular values of H; we assume that the smallest absolute value is strictly positive and bounded below by ζ.

Theorem A.8. Suppose the true decoder g can be approximated by a polynomial of degree p on entire Z Z(i) with approximation error ϵ

2. Suppose a = f g can be approximated by polynomials on entire Z Z(i) with η error. If [ zmax, zmax]d Z Z(i), where zmax is sufficiently large, and Assumption 4.1, Assumption A.7 hold, then the polynomial

Interventional Causal Representation Learning

approximation of a (recall ˆz = a(z)) corresponding to solutions of approximate reconstruction identity in Eq. 20 under Constraint A.6 is approximately linear, i.e., the norms of the weights on higher order terms are sufficiently small. Specifically, the absolute value of the weight associated with term of degree k decays as 1 zk 1 max .

Proof. We start by restating the approximate reconstruction identity. We use the fact that g can be approximated with a polynomial of say degree p to simplify the identity below. For all x X X (i)

ˆz ˆz ˆz ... ˆz ˆz | {z } p times

z z z ... z z | {z } p times

z z z ... z z | {z } p times

To obtain the second step from the first, add and subtract G[z, z z, , ˆz ˆz | {z } p times ] and use reverse triangle inequality.

Since H is full rank, we select rows of H such that H is square and invertible. The corresponding selection for G is denoted as G. We write the identity in terms of these matrices as follows.

ˆz ˆz ˆz ... ˆz ˆz | {z } p times

z z z ... z z | {z } p times

ˆz ˆz ˆz ... ˆz ˆz | {z } p times

z z z ... z z | {z } p times

3ϵ 2|σmin( H)|

where |σmin( H)| is the singular value with smallest absolute value corresponding to the matrix H. In the simplification above, we use the assumption that g is ϵ

2-approximated by a polynomial with matrix G and we also use the fact that |σmin( H)| is positive. Now we write that the polynomial that approximates ˆzi = ai(z) as follows.

|ˆzi θ 1 z θ 2 z z θ q z z | {z } q times | η (24)

ˆzi θ 1 z + θ 2 z z + θ q z z | {z } q times η

ˆzi θ 1 z + θ 2 z z + θ q z z | {z } q times +η (25)

From Assumption A.7 we know that ˆzi γη, where γ > 2. It follows from the above equation that

θ 1 z + θ 2 z z + θ q z z | {z } q times +η γη

= θ 1 z + θ 2 z z + + θ q z z | {z } q times (γ 1)η 0

= 1 γ 1 η θ 1 z + θ 2 z z + + θ q z z | {z } q times

Interventional Causal Representation Learning

For ˆzi γη, we track how ˆzp i grows below.

ˆzi θ 1 z + θ 2 z z + θ q z z | {z } q times η (γ 2)η 0

ˆzp i (θ 1 z + θ 2 z z + θ q z z | {z } q times η)p

ˆzp i (θ 1 z + θ 2 z z + θ q z z | {z } q times )p(1 1 γ 1)p

In the last step of the above simplification, we use the condition in Eq. 26. We consider z = [zmax, , zmax]. Consider the terms θijzk max inside the polynomial in the RHS above. We assume all components of θ are positive. Suppose θij 1 zk κ 1 max ,

where κ (0, 1), then the RHS in Eq. 27 grows at least z(1+κ)p max γ 2

γ 1 p. From Eq. 23, ˆzp i is very close to degree p polynomial in z. Under the assumption that the terms in H 1 G are bounded by a constant, the polynomial of degree p grows at at most zp max. The difference in growth rates the Eq. 23 is an increasing function of zmax for ranges where zmax is sufficiently large. Therefore, the reconstruction identity in Eq. 23 cannot be satisfied for points in a sufficiently small neighborhood of z = [zmax, , zmax]. Therefore, θij < 1 zk κ 1 max . We can consider other vertices of the hypercube Z and

conclude that |θij| < 1 zk κ 1 max .

A.2. Representation identification under do interventions

Theorem 5.3. Suppose the observational data and interventional data are generated from Eq. 1 and Eq. 2 respectively under Assumptions 4.1 and 4.2, where P(i) Z follows Eq. 4. The autoencoder that solves Eq. 3 under Constraint 4.3, Constraint 5.1 identifies the intervened latent zi up to shift and scaling, i.e., ˆzk = ezi + b, where e R, b R.

Proof. First note that Assumptions 4.1-4.2 hold. Since we solve Eq. 3 under Constraint 4.3, we can continue to use the result from Theorem 4.4. From Theorem 4.4, it follows that the estimated latents ˆz are an affine function of the true z. ˆzk = a z + b, z Z Z(i), where a Rd, b R.

We consider a z Z(i) such that z i is in the interior of the support of P(i) Z i. We write z Z(i) as [z , z i]. We can write ˆzk = aiz + a iz i + b, where a i is the vector of the values of coefficients in a other than the coefficient of ith dimension, ai is ith component of a, z i is the vector of values in z other than zi. From the constraint in Constraint 5.1 it follows that for all z Z(i), ˆzk = z . We use these expressions to carry out the following simplification.

a iz i = z aiz b (28)

Consider another data point z Z(i) from the same interventional distribution such that z i = z i + θej is in the interior of the support of P(i) Z i, where ej is vector with one in jth coordinate and zero everywhere else. From Assumption 5.2,

we know that there exists a small enough θ such that z i is in the interior. Since the point is from the same interventional distribution z i = z . For z i we have

a iz i = z aiz b (29)

We take a difference of the two equations equation 28 and equation 29 to get

a i(z i z i) = θa iej = 0. (30)

From the above, we get that the jth component of a i is zero. We can repeat the above argument for all j and get that a i = 0. Therefore, ˆzk = aizi + b for all possible values of zi in Z Z(i).

Interventional Causal Representation Learning

A.2.1. EXTENSION OF do INTERVENTIONS BEYOND POLYNOMIALS

In the main body of the paper, we studied the setting where g is a polynomial. We relax the constraint on g. We consider settings with multiple do interventional distribution on a target latent.

We write the DGP for intervention j {1, , t} on latent i as

z i P(i,j) Z i (31)

Let T = {z ,1, , z ,t} be the set of do intervention target values. We extend the constrained representation learning setting from the main body, where the learner leverages the geometric signature of a single do intervention per latent dimension to multiple do interventional distributions per latent dimension.

h f(x) = x, x X X (i,j)

fk(x) = z ,j, x X (i,j), j {1, , t} (32)

Recall that the ˆz = f(x) = f g(z) = a(z). Consider the kth component ˆzk = ak(z). Suppose ak(z) is invertible and only depends on zi, we can write it as ak(zi). If ˆzk only depends on zi, i.e., ˆzk = ak(zi) and ak is invertible, then the zi is identified up to an invertible transform. Another way to state the above property is z iak(z) = 0 for all z i. In what follows, we show that it is possible to approximately achieve identification up to an invertible transform. We show that if the number of interventions t is sufficiently large, then z iak(z) ϵ for all z Z.

Assumption A.9. The interior of the support of z in the observational data, i.e., Z, is non-empty. The interior of the support of z i in the interventional data, i.e., Z(i,j) i , is equal to the support in observational data, i.e., Z i, for all j {1, , t}. Each intervention z ,j is sampled from a distribution Q. The support of Q is equal to the support of zi in the observational data, i.e., Zi. The density of Q is greater than ϱ (ϱ > 0) on the entire support.

The above assumption states the restrictions on the support of the latents underlying the observational data and the latents underlying the interventional data.

Assumption A.10. 2a(z)

zi zj is bounded by L < for all z Z and for all i, j {1, , d}.

Lemma A.11. If the number of interventions t log( δϵ 2(βisup+βi inf))/ log(1 ϱ ϵ

2), then maxzi Zi minz ,j T zi z ,j ϵ with probability 1 δ.

Proof. Consider the interval [ βi inf, βi sup], where βi inf and βi sup are the infimum and supremum of Zi. Consider an ϵ

2 covering of [ βi inf, βi sup]. This covering consists of 2(βi sup+βi inf) ϵ equally spaced points at a separation of ϵ/2. Consider a point zi, its nearest neighbor in the cover is denoted as z l, and the nearest neighbor of zi in the set of interventions T is z ,j. The nearest neighbor of z l in the set of interventions is z ,r. Since zi z ,j zi z ,q for all q {1, , t} we can write

zi z ,j zi z ,r zi z l + z l z ,r ϵ

2 + z l z ,r (33)

Observe that if z l z ,r is less than ϵ

2 for all z l in the cover, then for all zi in Zi, zi z ,j is less than ϵ. We now show that z l z ,r is sufficiently small provided t is sufficiently large. Observe that

P( z l z ,r > ϵ

We would like that (1 ϱ ϵ

2)t δ, which implies t log(δ)/ log(1 ϱ ϵ

2). Therefore, if t log(δ)/ log(1 ϱ ϵ

2), then P( z l z ,r ϵ 2) with a probability at least 1 δ. If we set δ = δϵ 2(βisup+βi inf), then we obtain that for all j,

P( z l z ,r ϵ

2) with probability at least 1 δ. The final expression for t log( δϵ 2(βisup+βi inf))/ log(1 ϱ ϵ

Interventional Causal Representation Learning

Theorem A.12. Suppose the observational data and interventional data are generated from Eq. 1 and Eq. 31 respectively. If the number of interventions t is sufficiently large, i.e., t log( δϵ 2L(βisup+βi inf))/ log(1 ϱ ϵ

2L), Assumption A.9 and Assumption A.10 are satified, then the solution to Eq. 32 identifies the intervened latent zi approximately up to an invertible tranform, i.e., z iak(z) ϵ for all z Z.

Proof. Recall ˆz = f(x) = f g(z) = a(z), where a : j Z(i,j) Z j ˆZ(i,j) ˆZ. Consistent with the notation used earlier in the proof of Theorem 4.4, ˆZ(i,j) = f(X (i,j)). In Lemma A.3, we had shown that a is bijective, we can use the same recipe here and show that a is bijective.

Owing to the constraint in Eq. 32, we claim that z iak(z) = 0 for all z i in the interior of Z i with zi = z ,j. Consider a ball around z i that is entirely contained in Z i, denote it as Bz. From Eq. 32, it follows that fk(x) takes the same value on this neighborhood. As a result, ak(z) is equal to a constant on the ball Bz. Therefore, it follows that z iak(z) = 0 on the ball Bz. We can extend this argument to all the points in the interior of the support of z i. As a result, z iak(z) = 0 on the interior of the support of z i. Further, z iak(z) = 0 for all z = [z ,j, z i] in j Z(i,j). Define ℵ(z) = z iak(z). Consider the jth component of ℵ(z) denoted as ℵj(z). Consider a point z Z and find its nearest neighbor in j Z(i,j) and denote it as z . Following the assumptions, z i = z i. We expand ℵj(z) around z as follows

ℵj(z) = ℵj(z ) + zℵj(z ) (z z )

ℵj(z) = ℵj(z ) zi (zi z i)

In the above, we use the fact that ℵj(z ) = 0.

|ℵj(z)| = ℵj(z ) zi (zi z i) ℵj(z ) zi

To see the last inequality in the above, use Lemma A.11 with ϵ as ϵ/L and Assumption A.10.

In the discussion above, we showed that multiple do interventional distribution on target latent dimension help achieve approximate identification of a latent up to an invertible transform. The above argument extends to all latents provided we have data with multiple do interventional distributions per latent. We end this section by giving some intuition as to why multiple interventions are necessary in the absence of much structure on g.

Necessitating multiple interventions We consider the case with one do intervention. Consider the set of values achieved under intervention, where z i is from the interior of Z(i) i. We call this set Z(i) Suppose a is a bijection of the following form.

( I, if z is in Z(i)

a otherwise (34)

where I is identity function and a is an arbitrary bijection with bounded second order derivative (satisfying Assumption A.10). Define f = a g 1 and h = g a 1. Observe that these f and h satisfy both the constraints in the representation learning problem in Constraint 5.1. In the absence of any further assumptions on g or structure of support of Z, each intervention enforces local constraints on a.

A.3. Representation identification under general perfect and imperfect interventions

Before proving Theorem 5.8, we prove a simpler version of the theorem, which we leverage to prove Theorem 5.8. We start with the case when the set S has one element say S = {j}.

Assumption A.13. Consider the Z that follow the interventional distribution P(i) Z . The joint support of zi, zj satisfies factorization of support, i.e., Z(i) i,j = Z(i) i Z(i) j (35)

For all j {1, , d}, < αj inf αj sup < . There exists a ζ > 0 such that the all the points in (αj sup ζ, αj sup)

(αj inf, αj inf + ζ) are in Z(i) j , j {1, , d}

Interventional Causal Representation Learning

The above assumption only requires support independence for two random variables Zi and Zj.

We now describe a constraint, where the learner enforces support independence between ˆzi and ˆzj.

Constraint A.14. The pair (ˆzi, ˆzj) satisfies support independence on interventional data, i.e.,

ˆZ(i) i,j = ˆZ(i) i ˆZ(i) j

In the above Constraint A.14, we use same indices i and j as in Assumption A.13 for convenience, the arguments extend to the case where we use a different pair.

Theorem A.15. Suppose the observational data and interventional data are generated from Eq. 1 and Eq. 2 respectively under Assumptions 4.1, 4.2, A.13. The autoencoder that solves Eq. 3 under Constraint 4.3, A.14 achieves block affine identification, i.e., z Z, ˆz = Az + c, where ˆz is the output of the encoder f and z is the true latent and A is an invertible d d matrix and c Rd. Further, the matrix A has a special structure, i.e., the row ai and aj do not have a non-zero entry in the same column. Also, each row ai and aj has at least one non-zero entry.

Proof. Let us first verify that there exists a solution to Eq. 3 under Constraint 4.3, A.14. If ˆZ = Z and h = g, then that suffices to guarantee that a solution exists.

First note that since Assumptions 4.1, 4.2 holds and we are solving Eq. 3 under Constraint 4.3, we can continue to use the result from Theorem 4.4. From Theorem 4.4, z Z Z(i), ˆz = Az + c, where ˆz is the output of the encoder f and z is the true latent and A is an invertible d d matrix and c Rd.

From Assumption A.13 we know each component k {1, , d} of z, zk is bounded above and below. Suppose the minimum and maximum value achieved by zk Z(i) k is αk inf and the maximum value achieved by zk Z(i) k is αk sup .

Define a new latent

z k = 2 zk αk sup+αk inf 2 αksup αk inf

, k {1, , d}

Notice post this linear operation, the new latent takes a maximum value of 1 and a minimum value of 1.

We start with ˆz = Az + c, where z is element-wise transformation of z that brings its maximum and minimum value of each component to 1 and 1. Following the above transformation, we define the left most interval for z i as [ 1, 1 + ηi] and the rightmost interval is [1 ζi, 1], where ηi > 0 and ζi > 0. Such an interval exists owing to the Assumption A.13.

Few remarks are in order. i) Here we define intervals to be closed from both ends. Our arguments also extend to the case if these intervals are open from both ends or one end, ii) We assume all the values in the interval [ 1, 1 + ηi] are in the support. The argument presented below extends to the case when all the values in [ 1, 1 + ηi] are assumed by z i except for a set of measure zero, iii) The assumption A.13 can be relaxed by replacing supremum and infimum with essential supremum and infimum.

For a sufficiently small κ, we claim that the marginal distribution of ˆzi and ˆzj contain the sets defined below. Formally stated ( ai 1 + ci, ai 1 + ci + κ) ( ai 1 + ci κ, ai 1 + ci) ˆZ(i) i (36)

( aj 1 + cj, aj 1 + cj + κ) ( aj 1 + cj κ, aj 1 + cj) ˆZ(i) j (37)

where ai and aj are ith and jth row in matrix A. We justify the above claim next. Suppose all elements of ai are positive. We set κ sufficiently small such that κ |aik|d ηk for all k {1, , d}. Since κ is sufficiently small, [ 1, 1 + κ |aik|d] in

the support z k, this holds for all k {1, , d}. As a result, ( ai 1 +ci, ai 1 +ci +κ) is in the support of ˆzk. We can repeat the same argument when the signs of ai are not all positive by adjusting the signs of the elements z . This establishes ( ai 1 + ci, ai 1 + ci + κ) ˆZ(i) i . Similarly, we can also establish that ( ai 1 + ci κ, ai 1 + ci) ˆZ(i) i .

Suppose the two rows ai and aj share at least q 1 non-zero entries. Without loss of generality assume that ai1 is non-zero and aj1 is non-zero. Pick an 0 < ϵ < κ

Interventional Causal Representation Learning

Suppose ai1 and aj1 are both positive. In this case, if ˆzi < ai 1 + ci + ϵ, then

z 1 < 1 + 2ϵ

To see why is the case, substitute z 1 = 1 + 2ϵ |ai1| and observe that ˆzi > ai 1 + ci + ϵ.

Suppose ai1 and aj1 are both positive. In this case, if ˆzj > aj 1 + cj ϵ, then

z 1 > 1 2ϵ |aj1|

For sufficiently small ϵ (ϵ < 1 1/|ai1|+|aj1|) both z 1 < 1 + 2ϵ

ai1 and z 1 > 1 2ϵ

aj1 cannot be true simultaneously.

Therefore, ˆzi < ai 1 +ci +ϵ and ˆzj > aj 1 +cj ϵ cannot be true simultaneously. Individually, ˆzi < ai 1 +ci +ϵ occurs with a probability greater than zero; see Eq. 36. Similarly, ˆzj > aj 1 + cj ϵ occurs with a probability greater than zero; see Eq. 37. This contradicts the support independence constraint. For completeness, we present the argument for other possible signs of a.

Suppose ai1 is positive and aj1 is negative. In this case, if ˆzi < ai 1 + ci + ϵ, then

z 1 < 1 + 2ϵ

Suppose ai1 is positive and aj1 is negative. In this case, if ˆzj < aj 1 + cj + ϵ, then

z 1 > 1 2ϵ |aj1|

Rest of the above case is same as the previous case. We can apply the same argument to any shared non-zero component. Note that a row ai cannot have all zeros or all non-zeros (then aj has all zeros). If that is the case, then matrix A is not invertible. This completes the proof.

We now use the result from Theorem A.15 to prove the Theorem 5.8. Theorem 5.8. Suppose the observational data and interventional data are generated from Eq. 1 and Eq. 2 respectively under Assumptions 4.1, 4.2, 5.5. The autoencoder that solves Eq. 3 under Constraint 4.3, 5.6 (with |S | |S|) achieves block affine identification. More specifically, z Z Z(i)

ˆzk = a k z + ck, ˆzm = a mz + cm, m S ,

where ak contains at most d |S | non-zero elements and each component of am is zero whenever the corresponding component of ak is non-zero for all m S .

Proof. Let us first verify that there exists a solution to Eq. 3 under Constraint 4.3, 5.6 (with |S | |S|).

We write ˆZ = ΠZ, where Π is a permutation matrix such that ˆZk = Zi. For each m S there exists a unique j S such that ˆZm = Zj. Suppose h = g Π 1. Observe that this construction satisfies the constraints in Constraint 5.6.

To show the above claim, we leverage Theorem A.15. We apply Theorem A.15 to all the pairs in {(k, m), m S }, we obtain the following. We write ˆzk = a k z + ck. Without loss of generality, assume ak is non-zero in first s elements. Now consider any ˆzm = a mz + cm, where m S . From Theorem A.15 it follows that am[1 : s] = 0. This holds true for all m S . Suppose s d |S | + 1. In this case, the first s columns cannot be full rank. Consider the submatrix formed by the first s columns. In this submatrix |S | rows are zero. The maximum rank of this matrix is d |S |. If s d |S | + 1, then this submatrix would not have a full column rank, which contradicts the fact that A is invertible. Therefore, 1 s d |S |.

We can relax the assumption that |S | |S| in the above theorem. We follow an iterative procedure. We start by solving Constraint 5.6 with |S | = d 1. If a solution exists, then we stop. If a solution does not exist, then we reduce the size of |S | by one and repeat the procedure till we find a solution. As we reach |S | = |S| a solution has to exist.

Interventional Causal Representation Learning

A.4. Representation identification with observational data under independent support

Theorem 6.3. Suppose the observational data is generated from Eq. 1 under Assumption 4.1, 4.2, and 6.1, The autoencoder that the solves Eq. 3 under Constraint 6.2 achieves permutation, shift and scaling identification. Specifically, z Z, ˆz = ΛΠz + c, where ˆz is the output of the encoder f and z is the true latent and Π is a permutation matrix and Λ is an invertible diagonal matrix.

Proof. We will leverage Theorem A.15 to show this claim. Consider ˆzi = a T i z + ci. We know that the ai has at least one non-zero element. Suppose it has at least q 2 non-zero elements. Without loss of generality assume that these correspond to the first q components. We apply Theorem A.15 to each pair ˆzi, ˆzj for all j = i. Note here i is kept fixed and then Theorem A.15 is applied to every possible pair. From the theorem we get that aj[1 : q] is zero for all j = i. If q 2, then the span of first q columns will be one dimensional and as a result A cannot be invertible. Therefore, only one element of row i is non-zero. We apply the above argument to all i {1, , d}. We write a function π : {1, , d} {1, , d}, where π(i) is the index of the element that is non-zero in row i, i.e., ˆzi = aiπ(i)zπ(i) + ci. Note that π is injective, if two indices map to the same element, then that creates shared non-zero coefficients, which violates Theorem A.15. This completes the proof.

Interventional Causal Representation Learning

B. Supplementary Materials for Empirical Findings

B.1. Method details

Algorithm 1 Summarizing our two step approach for both the independence of support (IOS) and interventional data case.

1: {Step 1: Training autoencoder (f, h)} 2: Sample data: X X X I where X I = d i=1X (i)

3: Minimize reconstruction loss: f , h = arg minf,h E h f(X) X 2

4: 5: {Step 2: Learning transformation Γ with Independence of Support (IOS) objective} 6: Sample data: ˆZ f (X) where f is the encoder learnt in Step 1 7: Minimize reconstruction + Hausdorff loss: minΓ E Γ Γ( ˆZ) ˆZ 2 + λ P k =m HD ˆZk,m(Γ), ˆZk(Γ) ˆZm(Γ)

8: Return transformed latents: Γ( ˆZ) 9: 10: {Step 2: Learning transformation Γ = [γi]i=1:d using do-interventions} 11: for i in {1, , d} do 12: Sample data: ˆZ f (X (i)) where f is the encoder learnt in Step 1s 13: Fix intervention targets at random ˆY (i) Uniform(0, 1)

14: Minimize MSE loss: minγi E ˆ Z γi( ˆZ) ˆY (i) 2

15: end for 16: Return transformed latents: Γ( ˆZ)

We provide details about our training procedure in Algorithm 1. For learning with the independence of support (IOS) objective in Step 2, we need to ensure that the map Γ is invertible, hence we minimize a combination of reconstruction loss with Hausdorff distance, i.e.,

min Γ E Γ Γ( ˆZ) ˆZ 2 + λ X

k =m HD ˆZk,m(Γ), ˆZk(Γ) ˆZm(Γ) (38)

where ˆZ denotes the output from the encoder learnt in Step 1, i.e., ˆZ = f (X).

If we have data with multiple interventional distributions per latent dimension, then we sample a new target for each interventional distribution. In our polynomial decoder experiments, we use a linear γi. In our image based experiments, in Step 2, we use a non-linear map γi.

B.2. Experiment setup details: Polynomial decoder (g)

Basic setup. We sample data following the DGP described in Assumption 4.2 with the following details:

Latent dimension: d {6, 10}

Degree of decoder polynomial (g): p {2, 3}

Data dimension: n = 200

Decoder polynomial coefficient matrix G: sample each element of the matrix iid from a standard normal distribution.

Latent distributions. Recall zi is the ith component of the latent vector z Rd. The various latent distributions (PZ) we use in our experiments are as follows:

Uniform: Each latent component zi is sampled from Uniform(-5, 5). All the latents (zi) are independent and identically distributed.

Interventional Causal Representation Learning

Uniform-Correlated: Consider a pair of latent variables zi, zi+1 and sample two confounder variables c1, c2 s.t. c1 Bernoulli(p = 0.5), and c2 Bernoulli(p = 0.9). Now we sample zi, zi+1 using c1, c2 as follows:

( Uniform(0.0, 0.5) if c1 = 1 Uniform( 0.5, 0.0) if c1 = 0 ,

( Uniform(0.0, 0.3) if c1 c2 = 1 Uniform( 0.3, 0.0) if c1 c2 = 0 ,

where is the xor operation. Hence, c1 acts as a confounder as it is involved in the generation process for both zi, zi+1, which leads to correlation between them. Due to the xor operation, the two random variables satisfy independence of support condition. Finally, we follow this generation process to generate the latent vector z by iterating over different pairs (i { 1, , d} with step size 2 ).

Gaussian-Mixture: Each zi is sampled from a Gaussian mixture model with two components and equal probability of sampling from the components, as described below:

( N(0, 1) with prob. 0.5 N(1, 2) with prob. 0.5

All latents in this case are independent and identically distributed like the Uniform case; though we have mixture distribution instead of single mode distribution.

SCM-S: The latent variable z is sampled as a DAG with d nodes using the Erd os R enyi scheme with linear causal mechanism and Gaussian noise (Brouillard et al., 2020) 2 and set the expected density (expected number of edges per node) to be 0.5.

SCM-D: The latent variable z is sampled as a DAG with d nodes using the Erd os R enyi scheme with linear causal mechanism and Gaussian noise (Brouillard et al., 2020) and set the expected density (expected number of edges per node) to be 1.0.

Case Train Validation Test

Observational (D) 10000 2500 20000 Interventional (D(I)) 10000 2500 20000

Table 5. Statistics for the synthetic poly-DGP experiments

Further details on dataset and evaluation. For experiments in Table 2, we only use observational data (D); while for experiments in Table 3, we use both observational and interventional data (D D(i)), with details regarding the train/val/test split described in Table 5.

We carry out do interventions on each latent with D(i) corresponding to data from interventions on zi. The union of data from interventions across all latent dimensions is denoted as D(I) = i=1:d D(i). The index of the variable to be intervened is sampled from Uniform({1, . . . , d}). The selected latent variable to be intervened is set to value 2.0.

Further, note that for learning the linear transformation (γi) in Step 2 (Eq. 7), we only use the corresponding interventional data (D(i)) from do-intervention on the latent variable i. Also, all the metrics (R2, MCC (IOS), MCC, MCC (IL)) are computed only on the test split of observational data (D) (no interventional data used).

2https://github.com/slachapelle/dcdi

Interventional Causal Representation Learning

Model architecture. We use the following architecture for the encoder f across all the experiments with polynomial decoder g (Table 2, Table 3) to minimize the reconstruction loss;

Linear Layer (n, h); Leaky Re LU(0.5),

Linear Layer (h, h); Leaky Re LU(0.5),

Linear Layer (h, d),

where n is the input data dimension and h is hidden units and h = 200 in all the experiments. For the architecture for the decoder (h) in Table 2, Table 3, we use the polynomial decoder (h(z) = H[1, z, z z, , z z | {z } p times ] ); where p is set to

be same as that of the degree of true decoder polynomial (g(z)) and the coefficient matrix H is modeled using a single fully connected layer.

For the independence of support (IOS) experiments in Table 2, we model both Γ, Γ using a single fully connected layer.

For the interventional data results (Table 3), we learn the mappings γi from the corresponding interventional data (P(i) X ) using the default linear regression class from scikit-learn (Pedregosa et al., 2011) with the intercept term turned off.

Finally, for the results with NN Decoder h (Table 8, Table 9), we use the following architecture for the decoder with number of hidden nodes h = 200.

Linear layer (d, h); Leaky Re LU(0.5)

Linear layer (h, h); Leaky Re LU(0.5)

Linear layer (h, n)

Hyperparameters. We use the Adam optimizer with hyperparameters defined below. We also use early stopping strategy, where we halt the training process if the validation loss does not improve over 10 epochs consecutively.

Batch size: 16

Weight decay: 5 10 4

Total epochs: 200

Learning rate: optimal value chosen from grid: {10 3, 5 10 4, 10 4}

For experiments with independence of support (IOS) objective in Step 2 (Table 2), we train with λ = 10 as the relative weight of Hausdorff distance in the reconstruction loss (Equation 38).

B.3. Additional results: Polynomial decoder (g)

Table 6 presents additional details about Table 2 in main paper. We present additional metrics like mean squared loss for autoencoder reconstruction task (Recon-MSE) and MCC computed using representations from Step 1. Note that training with independence of support objective in Step 2 leads to better MCC scores than using the representations from Step 1 on distributions that satisfy independence of support. Also, the Uniform Correlated (Uniform-C) latent case can be interpreted as another sparse SCM with confounders between latent variables. For this case, the latent variables are not independent but their support is still independent, therefore we see improvement in MCC with IOS training in Step 2. Similarly, Table 7 presents the extended results for the interventional case using polynomial decoder (Table 3 in main paper); with additional metrics like mean squared loss for autoencoder reconstruction task (Recon-MSE) and R2 to test for affine identification using representations from Step 1. We notice the same pattern for all latent distributions, that training on interventional data on Step 2 improves the MCC metric.

Further, we also experiment with using a neural network based decoder to have a more standard autoencoder architecture where we do not assume access to specific polynomial structure or the degree of the polynomial. Table 8 presents the results

Interventional Causal Representation Learning

with NN decoder for the observational case, where we see a similar trend to that of polynomial decoder case (Table 6) that the MCC increase with IOS training in Step 2 for Uniform and Uniform-C latent distributions. Similarly, Table 9 presents the results with NN decoder for the interventional case, where the trend is similar to that of polynomial decoder case (Table 7); though the MCC (IL) for the SCM sparse and SCM dense case are lower compared to that with polynomial decoder case.

PZ d p Recon-MSE R2 MCC MCC (IOS)

Uniform 6 2 1.59 0.40 1.00 0.00 66.91 2.45 99.31 0.07 Uniform 6 3 1.81 0.40 1.00 0.00 75.14 3.93 99.39 0.06 Uniform 10 2 2.04 0.76 1.00 0.00 58.49 2.26 90.73 2.92 Uniform 10 3 8.59 2.15 0.99 0.00 56.77 0.60 94.62 1.50

Uniform-C 6 2 0.36 0.07 1.00 0.00 71.19 2.29 96.81 0.11 Uniform-C 6 3 1.72 0.67 1.00 0.00 70.53 1.1 96.29 0.05 Uniform-C 10 2 0.86 0.27 1.00 0.00 64.58 1.81 85.31 2.35 Uniform-C 10 3 2.42 0.47 1.00 0.00 62.69 0.92 87.20 1.77

Gaussian-Mixture 6 2 0.86 0.27 1.0 0.0 70.53 1.25 67.43 2.01 Gaussian-Mixture 6 3 0.86 0.32 0.99 0.0 66.19 1.38 67.94 1.42 Gaussian-Mixture 10 2 1.38 0.51 1.0 0.0 59.5 2.22 58.3 0.67 Gaussian-Mixture 10 3 4.12 1.70 0.99 0.0 57.15 0.43 59.08 1.11

SCM-S 6 2 1.52 0.70 0.96 0.02 71.77 1.43 72.61 1.48 SCM-S 6 3 2.25 0.51 0.87 0.07 73.14 3.44 70.56 1.54 SCM-S 10 2 4.23 1.13 0.99 0.0 64.35 2.0 65.86 1.32 SCM-S 10 3 2.83 0.85 0.90 0.05 61.95 0.98 58.77 1.27

SCM-D 6 2 1.34 0.26 0.97 0.01 75.25 2.85 61.61 4.36 SCM-D 6 3 1.20 0.55 0.81 0.11 82.9 3.11 65.19 2.70 SCM-D 10 2 2.89 0.79 0.83 0.10 67.49 2.32 69.64 3.09 SCM-D 10 3 1.55 0.39 0.72 0.15 66.4 1.86 60.1 1.16

Table 6. Observational data with Polynomial Decoder: Mean S.E. (5 random seeds). R2 and MCC (IOS) achieve high values (for Uniform & Uniform-C) as predicted Theorem 4.4 and Theorem 6.3 respectively.

PZ d p Recon-MSE R2 MCC MCC (IL)

Uniform 6 2 0.29 0.08 1.0 0.0 69.11 1.11 100.0 0.0 Uniform 6 3 0.97 0.36 1.0 0.0 73.42 0.49 100.0 0.0 Uniform 10 2 2.29 0.85 1.0 0.0 59.96 2.03 100.0 0.0 Uniform 10 3 2.74 0.36 1.0 0.0 65.94 0.80 99.85 0.03

Uniform-C 6 2 0.29 0.11 1.0 0.0 71.2 2.46 100.0 0.0 Uniform-C 6 3 1.50 0.62 1.0 0.0 70.21 1.90 99.97 0.01 Uniform-C 10 2 0.79 0.24 1.0 0.0 61.02 1.03 100.0 0.0 Uniform-C 10 3 1.72 0.45 1.0 0.0 61.16 1.59 99.91 0.01

Gaussian-Mixture 6 2 0.75 0.27 1.0 0.0 67.72 2.20 99.99 0.01 Gaussian-Mixture 6 3 0.57 0.20 0.99 0.0 70.21 2.74 99.39 0.05 Gaussian-Mixture 10 2 0.61 0.16 1.0 0.0 60.77 1.60 99.98 0.01 Gaussian-Mixture 10 3 2.29 0.72 0.99 0.0 57.81 1.16 99.46 0.05

SCM-S 6 2 0.21 0.04 0.99 0.0 68.41 0.90 99.53 0.38 SCM-S 6 3 0.93 0.18 0.99 0.0 74.12 2.32 99.25 0.34 SCM-S 10 2 0.63 0.17 1.0 0.0 68.01 2.36 99.92 0.03 SCM-S 10 3 1.29 0.31 0.97 0.01 66.81 1.10 98.8 0.13

SCM-D 6 2 0.81 0.05 0.99 0.01 71.8 3.77 99.64 0.12 SCM-D 6 3 0.75 0.26 0.98 0.01 79.48 3.45 98.22 1.07 SCM-D 10 2 0.76 0.15 0.98 0.01 70.78 1.89 95.3 2.24 SCM-D 10 3 0.96 0.22 0.97 0.0 70.08 2.80 97.24 0.88

Table 7. Interventional data with Polynomial Decoder: Mean S.E. (5 random seeds). MCC(IL) is high as predicted by Theorem 5.3.

Interventional Causal Representation Learning

PZ d p Recon-MSE R2 MCC MCC (IOS)

Uniform 6 2 1.22 0.19 0.98 0.0 73.75 2.85 99.05 0.02 Uniform 6 3 2.79 0.20 0.92 0.0 63.29 1.06 95.74 0.12 Uniform 10 2 3.66 0.39 0.99 0.0 61.71 1.16 94.25 2.13 Uniform 10 3 33.16 3.34 0.94 0.0 59.27 1.06 91.24 4.99

Uniform-C 6 2 0.65 0.10 0.96 0.02 68.46 1.94 94.95 1.83 Uniform-C 6 3 1.39 0.30 0.91 0.0 68.09 1.56 89.14 2.38 Uniform-C 10 2 1.78 0.09 0.99 0.0 62.63 2.05 88.88 3.28 Uniform-C 10 3 12.0 1.59 0.91 0.01 59.91 1.75 81.76 3.67

Gaussian-Mixture 6 2 0.49 0.12 0.95 0.0 72.59 2.03 65.33 1.11 Gaussian-Mixture 6 3 0.79 0.16 0.84 0.01 66.25 2.86 63.43 1.27 Gaussian-Mixture 10 2 1.38 0.18 0.95 0.0 57.12 1.52 54.76 1.26 Gaussian-Mixture 10 3 7.22 1.23 0.83 0.01 55.41 1.40 52.87 0.86

SCM-S 6 2 2.24 1.11 0.59 0.18 69.77 3.87 66.04 1.34 SCM-S 6 3 2.45 0.18 0.74 0.05 73.72 1.63 67.66 2.18 SCM-S 10 2 6.41 1.71 0.78 0.08 65.99 1.14 63.52 1.11 SCM-S 10 3 4.32 1.37 0.11 0.43 66.96 2.60 62.11 1.36

SCM-D 6 2 2.7 0.39 0.63 0.22 75.19 2.62 61.89 4.0 SCM-D 6 3 1.89 0.73 0.47 0.25 77.83 3.49 65.85 1.58 SCM-D 10 2 4.46 0.76 0.46 0.11 69.81 1.43 65.35 2.72 SCM-D 10 3 3.53 0.69 0.10 0.29 65.89 2.56 61.92 1.95

Table 8. Observational data with Neural Network Decoder: Mean S.E. (5 random seeds). R2 achieves high values in many cases but MCC (IOS) achieve high values (for Uniform & Uniform-C).

PZ d p Recon-MSE R2 MCC MCC (IL)

Uniform 6 2 0.35 0.08 0.98 0.0 68.39 1.21 99.09 0.02 Uniform 6 3 2.02 0.28 0.91 0.0 63.2 1.33 91.67 2.50 Uniform 10 2 3.89 0.50 0.99 0.0 60.54 1.81 99.59 0.04 Uniform 10 3 29.21 2.33 0.95 0.0 61.0 1.48 93.73 0.45

Uniform-C 6 2 0.42 0.15 0.94 0.02 65.91 0.53 96.43 1.47 Uniform-C 6 3 1.05 0.19 0.91 0.0 67.92 3.48 94.8 0.28 Uniform-C 10 2 1.32 0.09 0.99 0.0 60.02 1.83 99.42 0.01 Uniform-C 10 3 10.46 1.27 0.92 0.0 61.68 1.20 93.83 0.78

Gaussian-Mixture 6 2 0.45 0.13 0.94 0.0 70.64 3.83 96.87 0.14 Gaussian-Mixture 6 3 0.62 0.12 0.83 0.01 64.43 2.36 84.53 2.60 Gaussian-Mixture 10 2 0.87 0.15 0.94 0.0 57.35 1.62 97.06 0.16 Gaussian-Mixture 10 3 5.98 0.93 0.83 0.0 57.89 2.06 80.14 1.77

SCM-S 6 2 0.27 0.07 0.94 0.02 74.68 2.28 93.07 2.16 SCM-S 6 3 0.9 0.18 0.89 0.02 71.56 3.18 88.66 2.71 SCM-S 10 2 0.93 0.23 0.98 0.0 66.08 1.04 94.14 0.39 SCM-S 10 3 1.99 0.36 0.88 0.01 63.35 1.44 76.62 6.15

SCM-D 6 2 0.69 0.07 0.95 0.02 76.99 2.53 91.63 1.90 SCM-D 6 3 0.87 0.25 0.88 0.01 75.72 1.69 88.19 3.63 SCM-D 10 2 1.05 0.29 0.95 0.01 68.71 2.16 90.14 4.35 SCM-D 10 3 1.68 0.34 0.86 0.01 68.52 2.11 81.82 3.0

Table 9. Interventional data with Neural Network Decoder: Mean S.E. (5 random seeds). MCC(IL) is high.

Interventional Causal Representation Learning

B.4. Experiment setup details: Synthetic image experiments

The latent variable comprises of two balls and their (x, y) coordinates; hence we have d = 4 dimensional latent variable. We use Py Game (Shinners, 2011) rendering engine final images of dimension 64 64 3.

Latent Distributions. We denote the (x, y) coordinates of the Ball 1 as (x1, y1), and for Ball 2 as (x2, y2). We have the following three cases for the latent distributions in case of synthetic image experiments:

Uniform: Each coordinate of Ball 1 (x1, y1) and Ball 2 (x2, y2) are sampled from Uniform(0.1, 0.9).

SCM (linear): The coordinates of Ball 1 (x1, y1) are sampled from Uniform(0.1, 0.9), which are used to sample the coordinates of Ball 2 as follows:

( Uniform(0.1, 0.5) if x1 + y1 1.0 Uniform(0.5, 0.9) if x1 + y1 < 1.0

( Uniform(0.5, 0.9) if x1 + y1 1.0 Uniform(0.1, 0.5) if x1 + y1 < 1.0

SCM (non-linear): The coordinates of Ball 1 (x1, y1) are sampled from Uniform(0.1, 0.9), which are used to sample the coordinates of Ball 2 as follows:

( Uniform(0.1, 0.5) if 1.25 (x2 1 + y2 1) 1.0 Uniform(0.5, 0.9) if 1.25 (x2 1 + y2 1) < 1.0

( Uniform(0.5, 0.9) if 1.25 (x2 1 + y2 1) 1.0 Uniform(0.1, 0.5) if 1.25 (x2 1 + y2 1) < 1.0

Case Train Validation Test

Observational (D) 20000 5000 20000 Interventional (D(I)) 20000 5000 20000

Table 10. Statistics for the synthetic image experiments

Further details on dataset and evaluation. For experiments in Table 4, the details regarding the train/val/test split are described in Table 10.

Note that the interventional data (D(I)) is composed of do interventions on each latent variable (D(I) = i=1:d Di), where latent variable to be intervened is sampled from Uniform({1, , d}). Hence, each latent variable has equal probability to be intervened.

While performing do-interventions on any latent variable (D(i)), we control for the total number of distinct values the latent takes under the intervention (#interv, each distinct value correpsonds to sampling data from one interventional distribution). When #interv = 1, then we set the latent variable i to value 0.5. For the case when #interv > 1, we sample the values corresponding to different do-interventions on latent variable i as total of #interv equally distant points from S = [0.25, 0.75]. Eg, when #interv = 3, then the possible values after do-intervention on latent variable i are {0.25, 0.50, 0.75}. Note that we uniformly at random sample the value of intervention from the set of intervention values.

Note that we only use the observational data (D) for training the autoencoder in Step 1. while the non-linear transformations γi in Step 2 (Eq. 7) are learnt using the corresponding interventional data (D(i)). Further, the metrics (MCC, MCC (IL)) are computed only on the test split of observational data (D) (no interventional data used).

Interventional Causal Representation Learning

Model architecture. We use the following architecture for encoder f across all experiments (Table 4) in Step 1 of minimizing the reconstruction loss.

Res Net-18 Architecture (No Pre Training): Image (64 64 3) Penultimate Layer Output (512 dimensional)

Linear Layer (512, 128); Batch Norm(); Leaky Re LU()

Linear Layer (128, 25); Batch Norm()

We use the following architecture for decoder h across all experiments (Table 4) in Step 1 of minimizing the reconstruction loss. Our architecture for decoder is inspired from the implementation in widely used works (Locatello et al., 2019).

Linear Layer (25, 128); Leaky Re LU()

Linear Layer (128, 1024); Leaky Re LU()

De Convolution Layer (cin: 64, cout: 64, kernel: 4; stride: 2; padding: 1); Leaky Re LU()

De Convolution Layer (cin: 64, cout: 32, kernel: 4; stride: 2; padding: 1); Leaky Re LU()

De Convolution Layer (cin: 32, cout: 32, kernel: 4; stride: 2; padding: 1); Leaky Re LU()

De Convolution Layer (cin: 32, cout: 3, kernel: 4; stride: 2; padding: 1); Leaky Re LU()

Note: Here the latent dimension of the encoder (25) is not equal to the true latent dimension (d = 4) as that would lead issues with training the autoencoder itself. Also, this choice is more suited towards practical scenarios where we do not know the dimension of latent beforehand.

For learning the mappings γi from the corresponding interventional data (P(i) X ), we use the default MLP Regressor class from scikit-learn (Pedregosa et al., 2011) with 1000 max iterations for convergence.

Hyperparameters. We use Adam optimizer with hyperparameters defined below. We also use early stopping strategy, where we halt the training process if the validation loss does not improve over 100 epochs consecutively.

Batch size: 64

Weight decay: 5 10 4

Total epochs: 1000

Learning rate: 5 10 4

B.5. Additional Results: Synthetic Image Experiments

Table 11 presents more details about Table 4 in the main paper, with additional metrics like mean squared loss for autoencoder reconstruction task (Recon-MSE) and and R2 to test for affine identification using representations from Step 1. Note that Recon-RMSE and R2 are computed using the autoencoder trained from Step 1, hence the results are not affected by training on varying #interv per latent in Step 2. We get high R2 values across different latent distributions indicating the higher dimensional latents ( ˆd = 25) learned by the encoder are related to the small dimensional true latents (d = 4) by a linear function.

We also report a batch of reconstructed images from the trained autoencoder for the different latent distributions; Uniform (Figure 3), SCM Linear (Figure 4), and SCM Non-Linear (Figure 5). In all the cases the position and color of both the balls is accurately reconstructed.

Interventional Causal Representation Learning

PZ #interv Recon-RMSE R2 MCC (IL)

Uniform 1 0.04 0.0 0.51 0.0 34.18 0.24 Uniform 3 0.04 0.0 0.51 0.0 73.94 0.38 Uniform 5 0.04 0.0 0.51 0.0 73.62 0.21 Uniform 7 0.04 0.0 0.51 0.0 72.54 0.34 Uniform 9 0.04 0.0 0.51 0.0 73.14 0.47

SCM (linear) 1 0.03 0.0 0.8 0.0 12.81 0.28 SCM (linear) 3 0.03 0.0 0.8 0.0 73.21 0.33 SCM (linear) 5 0.03 0.0 0.8 0.0 83.38 0.21 SCM (linear) 7 0.03 0.0 0.8 0.0 84.22 0.25 SCM (linear) 9 0.03 0.0 0.8 0.0 86.16 0.17

SCM (non-linear) 1 0.04 0.0 0.69 0.0 19.70 0.31 SCM (non-linear) 3 0.04 0.0 0.69 0.0 59.68 0.28 SCM (non-linear) 5 0.04 0.0 0.69 0.0 62.79 0.20 SCM (non-linear) 7 0.04 0.0 0.69 0.0 69.31 0.34 SCM (non-linear) 9 0.04 0.0 0.69 0.0 71.37 0.26

Table 11. Interventional data in image-based experiments: Mean S.E. (5 random seeds). MCCs increase with the number of interventions per latent dimension as predicted by Theorem A.12.

Figure 3. Reconstructed images (top row) for the corresponding real images (bottom row) for the uniform latent case.

Figure 4. Reconstructed images (top row) for the corresponding real images (bottom row) for the SCM (linear) latent case.

Figure 5. Reconstructed images (top row) for the corresponding real images (bottom row) for the SCM (non-linear) latent case.

Interventional Causal Representation Learning

B.6. Experiments with independence penalty from β-VAE

In this section, we provide some additional comparisons with models trained with independence prior on the latents used in β-VAEs (Burgess et al., 2018). We take a standard autoencoder that uses a reconstruction penalty and add to it the β-VAE penalty. We carry out the comparisons for both polynomial data generation experiments and also for the image-based experiments. For the polynomial data generation experiments, we use the same MLP based encoder-decoder architecture that we used earlier for Table 8 and Table 9. In Table 12 and Table 13, we show the results for autoencoder trained with β-VAE penalty for the same setting as was used in Table 8 and Table 9 respectively. For the image-based experiments, we use the same Res Net-based encoder-decoder architecture that we used earlier for Table 4. In Table 14, we show results for the image-based experiments using the same setting as Table 4 focusing on the case with nine interventions.

PZ d p MCC (β = 0.1) MCC (β = 1.0) MCC (β = 10.0) MCC (IOS)

Uniform 6 2 67.35 2.7 68.73 2.88 72.38 3.4 99.05 0.02 Uniform 6 3 70.98 2.57 69.43 1.82 71.46 3.13 95.74 0.12 Uniform 10 2 58.94 2.04 57.8 1.35 60.14 1.33 94.25 2.13 Uniform 10 3 59.29 2.45 60.94 2.17 59.22 1.24 91.24 4.99

SCM-S 6 2 65.37 2.28 61.98 4.52 65.63 4.15 66.04 1.34 SCM-S 6 3 64.53 2.38 65.23 0.99 68.61 2.74 67.66 2.18 SCM-S 10 2 62.54 1.33 62.23 1.65 63.43 0.87 63.52 1.11 SCM-S 10 3 62.44 0.56 58.04 1.64 59.5 0.89 62.11 1.36

SCM-D 6 2 60.86 2.63 59.36 2.71 62.26 1.66 61.89 4.0 SCM-D 6 3 66.32 1.36 65.49 1.97 66.43 1.4 65.85 1.58 SCM-D 10 2 61.13 1.42 62.23 1.32 61.38 2.25 65.35 2.72 SCM-D 10 3 60.39 1.83 58.64 2.01 58.43 1.08 61.92 1.95

Table 12. Observational data with Neural Network Decoder: Mean S.E. (5 random seeds).

PZ d p MCC (β = 0.1) MCC (β = 1.0) MCC (β = 10.0) MCC (IL)

Uniform 6 2 69.79 1.83 68.62 2.99 69.59 1.76 99.09 0.02 Uniform 6 3 68.44 1.88 71.86 0.42 68.76 2.13 91.67 2.50 Uniform 10 2 60.46 1.46 58.93 0.91 58.95 0.89 99.59 0.04 Uniform 10 3 59.85 1.46 62.92 2.98 61.34 1.66 93.73 0.45

SCM-S 6 2 71.18 2.11 71.01 1.32 67.6 2.07 93.07 2.16 SCM-S 6 3 72.67 1.29 70.9 3.63 75.6 3.04 88.66 2.71 SCM-S 10 2 65.04 1.46 64.47 1.49 65.84 2.1 94.14 0.39 SCM-S 10 3 63.2 1.83 62.31 1.1 62.16 1.68 76.62 6.15

SCM-D 6 2 72.6 2.01 76.58 2.95 71.92 2.85 91.63 1.90 SCM-D 6 3 67.79 0.97 72.93 1.81 72.98 1.27 88.19 3.63 SCM-D 10 2 69.78 3.85 69.19 2.78 66.53 1.28 90.14 4.35 SCM-D 10 3 64.9 1.68 66.86 2.61 64.25 1.61 81.82 3.0

Table 13. Interventional data with Neural Network Decoder: Mean S.E. (5 random seeds).

Interventional Causal Representation Learning

PZ MCC (β = 0.1) MCC (β = 1.0) MCC (β = 10.0) MCC (IL)

Uniform 42.6 4.23 36.5 2.45 38.3 2.22 73.1 0.47

SCM (linear) 60.8 2.52 59.5 2.47 61.6 1.06 86.2 0.17

SCM (non-linear) 62.5 1.88 60.7 2.41 59.7 1.29 71.4 0.26

Table 14. Interventional data in image-based experiments: Mean S.E. (5 random seeds).