# identifying_representations_for_intervention_extrapolation__4655b59d.pdf Published as a conference paper at ICLR 2024 IDENTIFYING REPRESENTATIONS FOR INTERVENTION EXTRAPOLATION Sorawit Saengkyongam1 Elan Rosenfeld2 Pradeep Ravikumar2 Niklas Pfister3 Jonas Peters1 1ETH Zürich 2Carnegie Mellon University 3University of Copenhagen The premise of identifiable and causal representation learning is to improve the current representation learning paradigm in terms of generalizability or robustness. Despite recent progress in questions of identifiability, more theoretical results demonstrating concrete advantages of these methods for downstream tasks are needed. In this paper, we consider the task of intervention extrapolation: predicting how interventions affect an outcome, even when those interventions are not observed at training time, and show that identifiable representations can provide an effective solution to this task even if the interventions affect the outcome nonlinearly. Our setup includes an outcome variable Y , observed features X, which are generated as a nonlinear transformation of latent features Z, and exogenous action variables A, which influence Z. The objective of intervention extrapolation is then to predict how interventions on A that lie outside the training support of A affect Y . Here, extrapolation becomes possible if the effect of A on Z is linear and the residual when regressing Z on A has full support. As Z is latent, we combine the task of intervention extrapolation with identifiable representation learning, which we call Rep4Ex: we aim to map the observed features X into a subspace that allows for nonlinear extrapolation in A. We show that the hidden representation is identifiable up to an affine transformation in Z-space, which, we prove, is sufficient for intervention extrapolation. The identifiability is characterized by a novel constraint describing the linearity assumption of A on Z. Based on this insight, we propose a flexible method that enforces the linear invariance constraint and can be combined with any type of autoencoder. We validate our theoretical findings through a series of synthetic experiments and show that our approach can indeed succeed in predicting the effects of unseen interventions. 1 INTRODUCTION Representation learning (see, e.g., Bengio et al., 2013, for an overview) underpins the success of modern machine learning methods as evident, for example, in their application to natural language processing and computer vision. Despite the tremendous success of such machine learning methods, it is still an open question when and to which extent they generalize to unseen data distributions. It is further unclear, which precise role representation learning can play in tackling this task. To us, the main motivation for identifiable and causal representation learning (e.g., Schölkopf et al., 2021) is to overcome this shortcoming. The core component of this approach involves learning a representation of the data that reflects some causal aspects of the underlying model. Identifying this from the observational distribution is referred to as the identifiability problem. Without any assumptions on the data generating process, learning identifiable representations is not possible (Hyvärinen & Pajunen, 1999). To show identifiability, previous works have explored various assumptions, including the use of auxiliary information (Hyvarinen et al., 2019; Khemakhem et al., 2020), sparsity (Moran et al., 2021; Lachapelle et al., 2022), interventional data (Brehmer et al., 2022; Seigal et al., 2022; Ahuja et al., 2022a; 2023; Buchholz et al., 2023) and structural assumptions (Hälvä et al., 2021; Kivva et al., 2022). However, this body of work has focused solely on the problem of identifiability. Despite its potential, however, convincing theoretical results illustrating the benefits of such identification in solving tangible downstream tasks are arguably scarce; a few recent works have Published as a conference paper at ICLR 2024 Predicting the e!ect of unseen interventions E[Y|do(A=a*)] Learning an identi!able representation (a) Graphical model of the problem setup (b) Example illustrating intervention extrapolation; during training, A, X, and Y are observed Figure 1: In this paper, we consider the goal of intervention extrapolation, see (b). We are given training data (yellow) that cover only a limited range of possible values of A. During test time (grey), we would like to predict E[Y | do(A = a )] for previously unseen values of a . The function a 7 E[Y | do(A = a )] (red) can be nonlinear in a . We argue in Section 2 how this can be achieved using control functions if the data follow a structure like in (a) and Z is observed. We show in Section 3 that, under suitable assumptions, the problem is still solvable if we first have to reconstruct the hidden representation Z (up to a transformation) from X. The representation is used to predict E[Y | do(A = a )], so we learn a representation for intervention extrapolation (Rep4Ex). provided theoretical evidence for the advantages of identifiable representations in tasks such as estimating treatment effects (Wu & Fukumizu, 2022), improving generalization in multi-task learning (Lachapelle et al., 2023a) and generating novel object compositions (Lachapelle et al., 2023b). In this work, we consider the task of intervention extrapolation, that is, predicting how interventions that were not present in the training data will affect an outcome. We study a setup with an outcome Y ; observed features X which are generated via nonlinear transformation of latent predictors Z; and exogenous action variables A which influence Z. We assume the underlying data generating process depicted in Figure 1a. The dimension of X can be larger than the dimension of Z and we allow for potentially unobserved confounders between Y and Z (as depicted by the two-headed dotted arrow between Z and Y ). Adapting notation from the independent component analysis (ICA) literature (Hyvärinen & Oja, 2000), we refer to g0 as a mixing (and g 1 0 as an unmixing) function. In this setup, the task of intervention extrapolation is to predict the effect of a previously unseen intervention on the action variables A (with respect to the outcome Y ). Using do-notation (Pearl, 2009), we thus aim to estimate E[Y | do(A = a )], where a lies outside the training support of A. Due to this extrapolation, E[Y | do(A = a )], which may be nonlinear in a , cannot be consistently estimated by only considering the conditional expectation of Y given A (even though A is exogenous and E[Y | do(A = a)] = E[Y |A = a] for all a in the support of A), see Figure 1b. We formally prove this in Proposition 1. In this paper, the central assumption that permits learning identifiable representation and subsequently solving the downstream task is that the effect of A on Z is linear, that is, E[Z | A] = M0A for an unknown matrix M0. The approach we propose in this paper, Rep4Ex-CF, successfully extrapolates the effects outside the training support by performing two steps (see Figure 1a): In the first stage, we use (A, X) to learn an encoder ϕ : X Z that identifies, from the observed distribution of (A, X), the unmixing function g 1 0 up to an affine transformation and thereby obtains a feature representation ϕ(X). To do that, we propose to make use of a novel constraint based on the assumption of the linear effect of A on Z, which, as we are going to see, enables identification. Since this constraint has a simple analytical form, it can be added as a regularization term to an auto-encoder loss. In the second stage, we use (A, ϕ(X), Y ) to estimate the interventional expression effect E[Y | do(A = a )]. The model in the second stage is adapted from the method of control functions in the econometrics literature (Telser, 1964; Heckman, 1977; Newey et al., 1999), where one views A as instrumental variables. Figure 1b shows results of our proposed method (Rep4Ex-CF) on a simulated data set, together with the outputs of and a standard neural-network-based regression (MLP). Published as a conference paper at ICLR 2024 We believe that our framework provides a complementary perspective on causal representation learning. Similar to most works in that area, we also view Z as the variables that we ultimately aim to control. However, in our view, direct (or hard) interventions on Z are inherently ill-defined due to its latent nature. We, therefore, consider the action variables A as a means to modify the latent variables Z. As an example, in the context of reinforcement learning, one may view X as an observable state, Z as a latent state, A as an action, and Y as a reward. Our aim is then to identify the actions that guide us toward the desired latent state which subsequently leads to the optimal expected reward. The ability to extrapolate to unseen values of A comes (partially) from the linearity of A on Z; such extrapolation therefore becomes possible if we recover the true latent variables Z up to an affine transformation. The problem of learning identifiable representations can then be understood as the process of mapping the observed features X to a subspace that permits extrapolation in A. We refer to this task of learning a representation for intervention extrapolation as Rep4Ex. 1.1 RELATION TO EXISTING WORK Some of the recent work on representation learning for latent causal discovery also relies on (unobserved) interventions to show identifiability, sometimes with auxiliary information. These works often assume that the interventions occur on one or a fixed group of nodes in the latent DAG (Ahuja et al., 2022a; Buchholz et al., 2023; Zhang et al., 2023) or that they are exactly paired (Brehmer et al., 2022; von Kügelgen et al., 2023). Other common conditions include parametric assumptions on the mixing function (Rosenfeld et al., 2021; Seigal et al., 2022; Ahuja et al., 2023; Varici et al., 2023) or precise structural conditions on the generative model (Cai et al., 2019; Kivva et al., 2021; Xie et al., 2022; Jiang & Aragam, 2023; Kong et al., 2023). Unlike these works, we study interventions on exogenous (or "anchor") variables, akin to simultaneous soft interventions on the latents. Identifiability is also studied in nonlinear ICA (e.g., Hyvarinen & Morioka, 2016; Hyvarinen et al., 2019; Khemakhem et al., 2020; Schell & Oberhauser, 2023), we discuss the relation in Appendix A. The task of predicting the effects of new interventions has been explored in several prior works. Nandy et al. (2017); Saengkyongam & Silva (2020); Zhang et al. (2023) consider learning the effects of new joint interventions based on observational distribution and single interventions. Bravo Hermsdorff et al. (2023) combine data from various regimes to predict intervention effects in previously unobserved regimes. Closely related to our work, Gultchin et al. (2021) focus on predicting causal responses for new interventions in the presence of high-dimensional mediators X. Unlike our work, they assume that the latent features are known and do not allow for unobserved confounders. Our work is related to research that utilizes exogenous variables for causal effect estimation and distribution generalization. Instrumental variable (IV) approaches (Wright, 1928; Angrist et al., 1996) exploit the existence of the exogenous variables to estimate causal effects in the presence of unobserved confounders. Our work draws inspiration from the control function approach in the IV literature (Telser, 1964; Heckman, 1977; Newey et al., 1999). Several works (e.g., Rojas-Carulla et al., 2018; Arjovsky et al., 2019; Rothenhäusler et al., 2021; Christiansen et al., 2021; Rosenfeld et al., 2022; Saengkyongam et al., 2022) have used exogenous variables to increase robustness and perform distribution generalization. While the use of exogenous variables enters similarly in our approach, these existing works focus on a different task and do not allow for nonlinear extrapolation. 2 INTERVENTION EXTRAPOLATION WITH OBSERVED Z To provide better intuition and insight into our approach, we start by considering a setup in which Z is observed, which is equivalent to assuming that we are given the true underlying representation. We now focus on the intervention extrapolation part, see Figure 1a (red box) with Z observed. Consider an outcome variable Y Y R, predictors Z Z Rd, and exogenous action variables A A Rk. We assume the following structural causal model (Pearl, 2009) S : A := ϵA Z := M0A + V Y := ℓ(Z) + U, (1) where ϵA, V, U are noise variables and we assume that ϵA (V, U), E[U] = 0, and M0 has full row rank. Here, V and U may be dependent. Notation. For a structural causal model (SCM) S, we denote by PS the observational distribution entailed by S and the corresponding expectation by ES. When there is no ambiguity, we may omit Published as a conference paper at ICLR 2024 the superscript S. Further, we employ the do-notation to denote the distribution and the expectation under an intervention. In particular, we write PS;do(A=a) and ES[ | do(A = a)] to denote the distribution and the expectation under an intervention setting A := a, respectively, and Pdo(A=a) and E[ | do(A = a)] if there is no ambiguity. Lastly, for any random vector B, we denote by supp S(B) the support1 of B in the observational distribution PS. Again, when the SCM is clear from the context, we may omit S and write supp(B) as the support in the observational distribution. Our goal is to compute the effect of an unseen intervention on the action variables A (with respect to the outcome Y ), that is, E[Y | do(A = a )], where a / supp(A). A naive approach to tackle this problem is to estimate the conditional expectation E[Y |A = a] by regressing Y on A using a sample from the observational distribution of (Y, A). Despite A being exogenous, from (1) we only have that E[Y | do(A = a)] = E[Y |A = a] for all a supp(A). As a lies outside the support of A, we face the non-trivial challenge of extrapolation. The proposition below shows that in our model class E[Y | do(A = a )] is indeed not identifiable from the conditional expectation E[Y |A] alone. Consequently, E[Y | do(A = a )] cannot be consistently estimated by simply regressing Y on A. (The result is independent of the fact whether Z is observed or not and applies to the setting of unobserved Z in the same way, see Section 3. Furthermore, the result still holds even when V and U are independent.) All proofs can be found in Appendix D. Proposition 1 (Regressing Y on A does not suffice). There exist SCMs S1 and S2 of the form (1) (with the same set A) that satisfy all of the following conditions: (i) supp S1(V ) = supp S2(V ) = R; (ii) supp S1(A) = supp S2(A) = A ; (iii) a supp S1(A) : ES1[Y |A = a] = ES2[Y |A = a]; (iv) B A with positive Lebesgue measure s.t. a B : ES1[Y | do(A = a)] = ES2[Y | do(A = a)] (the latter implies B supp S1(A) = ). Proposition 1 affirms that relying solely on the knowledge of the conditional expectation E[Y |A] is not sufficient to identify the effect of an intervention outside the support of A. It is, however, possible to incorporate additional information beyond the conditional expectation to help us identify E[Y | do(A = a )]. In particular, inspired by the method of control functions in econometrics, we propose to identify E[Y | do(A = a )] from the observational distribution of (A, X, Z) based on the following identities, E[Y | do(A = a )] = E[ℓ(Z)| do(A = a )] + E[U| do(A = a )] = E[ℓ(M0a + V )| do(A = a )] + E[U| do(A = a )] = E[ℓ(M0a + V )], (2) where the last equality follows from E[U] = 0 and the fact that, for all a A, PU,V = Pdo(A=a ) U,V . Now, since A V , we have E[Z | A] = M0A and M0 can be identified by regressing Z on A. V is then identified with V = Z M0A. V is called a control variable and, as argued by Newey et al. (1999), for example, it can be used to identify ℓ: defining λ : v 7 E[U|V = v], we have for all z, v supp(Z, V ) E[Y |Z = z, V = v] = E[ℓ(Z) + U|Z = z, V = v] = ℓ(z) + E[U|Z = z, V = v] = ℓ(z) + E[U|V = v] = ℓ(z) + λ(v), (3) where in the second last equality, we have used that U Z | V (see Lemma 8 in Appendix C). In general, (3) does not suffice to identify ℓ(e.g., V and Z are not necessarily independent of each other). Only under additional assumptions, such as parametric assumptions on the function classes, ℓand λ are identifiable up to additive constants2. In our work, we utilize an assumption by Newey et al. (1999) that puts restrictions on the joint support of A and V and identifies ℓon the set M0 supp(A) + supp(V ). Since M0 and V are identifiable, too, this then allows us to compute, by (2), E[Y | do(A = a )] for all a s.t. M0a + supp(V ) M0 supp(A) + supp(V ); thus, supp(V ) = Rd is a sufficient condition to identify E[Y | do(A = a )] for all a A. This support assumption, together with the additivity of V in (1), is key to ensure that the nonlinear function ℓ can be inferred on all of Rd, allowing for nonlinear extrapolation. Similar ideas have been used for extrapolation in a different setting and under different assumptions by Shen & Meinshausen (2023). In some applications, we may want to compute the effect of an intervention on A conditioned on Z, that is, E[Y |Z = z, do(A = a )]. We show in Appendix C.1 that this expression is identifiable, too. 1The support of a random vector B Ω Rq (for some q N+) is defined as the set of all b Ωfor which every open neighborhood of b (in Ω) has positive probability. 2The constant can be identified by using the assumption E[U] = 0. Published as a conference paper at ICLR 2024 3 INTERVENTION EXTRAPOLATION VIA IDENTIFIABLE REPRESENTATIONS Section 2 illustrates the problem of intervention extrapolation in the setting where the latent predictors Z are fully observed. We now consider the setup where we do not directly observe Z but instead we observe X which are generated by applying a nonlinear mixing function to Z. Formally, consider an outcome variable Y Y R, observable features X X Rm, latent predictors Z Z = Rd, and action variables A A Rk. We model the underlying data generating process by the following SCM. Setting 1 (Rep4Ex). We assume the SCM S : A := ϵA Z := M0A + V X := g0(Z) Y := ℓ(Z) + U, (4) where ϵA, V, U are noise variables and we assume that the covariance matrix of ϵA is full-rank, ϵA (V, U), E[U] = 0, supp(V ) = Rd, and M0 has full row rank (thus k d). Further, g0 and ℓare measurable functions and g0 is injective. In this work, we only consider interventions on A. For example, we do not require that the SCM models interventions on Z correctly. Possible relaxations of the linearity assumption between A and Z and the absence of noise in X are discussed in Remark 6 in Appendix B. Our goal is to compute E[Y | do(A = a )] for some a / supp(A). As in the case of observed Z, the naive method of regressing Y on A using a non-parametric regression fails to handle the extrapolation of a (see Proposition 1). We, however, can incorporate additional information beyond the conditional expectation to identify E[Y | do(A = a )] through the method of control functions. From (2), we have for all a A that E[Y | do(A = a )] = E[ℓ(M0a + V )]. (5) Unlike the case where we observe Z, the task of identifying the unknown components on the righthand side of (5) becomes more intricate. In what follows, we show that if we can learn an encoder ϕ : X Z that identifies g 1 0 up to an affine transformation (see Definition 2 below), we can construct a procedure that identifies the right-hand side of (5) and can thus be used to predict the effect of unseen interventions on A. Definition 2 (Affine identifiability). Assume Setting 1. An encoder ϕ : X Z is said to identify g 1 0 up to an affine transformation (aff-identify for short) if there exists an invertible matrix Hϕ Rd d and a vector cϕ Rd such that z Z : (ϕ g0)(z) = Hϕz + cϕ. (6) We denote by κϕ : z 7 Hϕz + cϕ the corresponding affine map. Under Setting 1, we show an equivalent formulation of affine identifiability in Proposition 7 stressing that Z can be reconstructed from ϕ(X). Next, let ϕ : X Z be an encoder that aff-identifies g 1 0 and κϕ : z 7 Hϕz + cϕ be the corresponding affine map. From (5), we have for all a A that E[Y | do(A = a )] = E[ℓ(M0a + V )] = E[(ℓ κ 1 ϕ )(κϕ(M0a + V ))] = E[(ℓ κ 1 ϕ )(HϕM0a + cϕ + Hϕ E[V ] + Hϕ(V E[V ]))] = E[(ℓ κ 1 ϕ )(Mϕa + qϕ + Vϕ)], (7) where we define Mϕ := HϕM0, qϕ := cϕ + Hϕ E[V ], and Vϕ := Hϕ(V E[V ]). (8) We now outline how to identify the right-hand side of (7) by using the encoder ϕ and formalize the result in Theorem 3. Identifying Mϕ, qϕ and Vϕ Using that ϕ aff-identifies g 1 0 , we have (almost surely) that ϕ(X) = (ϕ g0)(Z) = HϕZ + cϕ = HϕM0A + HϕV + cϕ = MϕA + qϕ + Vϕ. (9) Now, since Vϕ A (following from V A), we can identify the pair (Mϕ, qϕ) by regressing ϕ(X) on A. The control variable Vϕ can therefore be obtained as Vϕ = ϕ(X) (MϕA + qϕ). Published as a conference paper at ICLR 2024 Identifying ℓ κ 1 ϕ Defining λϕ : v 7 E[U|Vϕ = v], we have, for all ω, v supp((ϕ(X), Vϕ)), E[Y |ϕ(X) = ω, Vϕ = v] ( ) = E[Y |κϕ(Z) = ω, Vϕ = v] = E[Y |Z = κ 1 ϕ (ω), Vϕ = v] = E[ℓ(Z) + U|Z = κ 1 ϕ (ω), Vϕ = v] = (ℓ κ 1 ϕ )(ω) + E[U|Z = κ 1 ϕ (ω), Vϕ = v] ( ) = (ℓ κ 1 ϕ )(ω) + E[U|Vϕ = v] = (ℓ κ 1 ϕ )(ω) + λϕ(v), (10) where the equality ( ) holds since ϕ aff-identifies g 1 0 and ( ) holds by Lemma 9, see Appendix C. Similarly to the case in Section 2, the functions ℓ κ 1 ϕ and λϕ are identifiable (up to additive constants) under some regularity conditions on the joint support of A and Vϕ (Newey et al., 1999). We make this precise in the following theorem, which summarizes the deliberations from this section. Theorem 3. Assume Setting 1 and let ϕ : X Z be an encoder that aff-identifies g 1 0 . Further, define the optimal linear function from A to ϕ(X) as3 (Wϕ, αϕ) := argmin W Rd k,α Rd E[ ϕ(X) (WA + α) 2] (11) and the control variable Vϕ := ϕ(X) (WϕA + αϕ). Lastly, let ν : Z Y and ψ : V Y be additive regression functions such that ω, v supp((ϕ(X), Vϕ)) : E[Y |ϕ(X) = ω, Vϕ = v] = ν(ω) + ψ(v). (12) If ℓ, λϕ are differentiable and the interior of supp(A) is convex, then the following statements hold (i) a A : E[Y | do(A = a )] = E[ν(Wϕa + αϕ + Vϕ)] (E[ν(ϕ(X))] E[Y ]) (13) (ii) x Im(g0), a A : E[Y |X = x, do(A = a )] = ν(ϕ(x))+ψ(ϕ(x) (Wϕa +αϕ)). (14) 4 IDENTIFICATION OF THE UNMIXING FUNCTION g 1 0 Theorem 3 illustrates that intervention extrapolation can be achieved if one can identify the unmixing function g 1 0 up to an affine transformation. In this section, we focus on the representation part (see Figure 1a, blue box) and prove that such an identification is possible. The identification relies on two key assumptions outlined in Setting 1: (i) the exogeneity of A and (ii) the linearity of the effect of A on Z. These two assumptions give rise to a conditional moment restriction on the residuals obtained from the linear regression of g 1 0 (X) on A. Recall that for all encoders ϕ : X Z we defined (Wϕ, αϕ) := argmin W Rd k,α Rd E[ ϕ(X) (WA + α) 2]. Under Setting 1, we have a supp(A) : E[g 1 0 (X) (Wg 1 0 A + αg 1 0 ) | A = a] = 0. (15) The conditional moment restriction (15) motivates us to introduce the notion of linear invariance of an encoder ϕ (with respect to A). Definition 4 (Linear invariance). Assume Setting 1. An encoder ϕ : X Z is said to be linearly invariant (with respect to A) if the following holds a supp(A) : E[ϕ(X) (WϕA + αϕ) | A = a] = 0. (16) To establish identifiability, we consider an encoder ϕ : X Z satisfying the following constraints. (i) ϕ is linearly invariant and (ii) ϕ|Im(g0) is bijective, (17) where ϕ|Im(g0) denotes the restriction of ϕ to the image of the mixing function g0. The second constraint (invertibility) rules out trivial solutions of the first constraint (linear invariance). For instance, a constant encoder ϕ : x 7 c (for some c Rd) satisfies the linear invariance constraint but it clearly does not aff-identify g 1 0 . Theorem 5 shows that, under the assumptions listed below, the constraints (17) are necessary and sufficient conditions for an encoder ϕ to aff-identify g 1 0 . 3Here, Wϕ, αϕ and Vϕ are equal to Mϕ, qϕ and Vϕ as shown in the proof. We introduce the new notation (e.g., (11) instead of (8)) to emphasize that the expressions are functions of the observational distribution. Published as a conference paper at ICLR 2024 Assumption 1 (Regularity conditions on g0). Assume Setting 1. The mixing function g0 is differentiable and Lipschitz continuous. Assumption 2 (Regularity conditions on V ). Assume Setting 1. First, the characteristic function of the noise variable V has no zeros. Second, the distribution PV admits a density f V w.r.t. Lebesgue measure such that f V is analytic on Rd. Assumption 3 (Regularity condition on A). Assume Setting 1. The support of A, supp(A), contains a non-empty open subset of Rk. In addition to the injectivity assumed in Setting 1, Assumption 1 imposes further regularity conditions on the mixing function g0. As for Assumption 2, the first condition is satisfied, for example, when the distribution of V is infinitely divisible. The second condition requires that the density function of V can be locally expressed as a convergent power series. Examples of such functions are the exponential functions, trigonometric functions, and any linear combinations, compositions, and products of those. Hence, Gaussians and mixture of Gaussians are examples of distributions that satisfy Assumption 2. Lastly, Assumption 3 imposes a condition on the support of M0A, that is, the support of M0A has non-zero Lebesgue measure. These assumptions are closely related to the assumptions for bounded completeness in instrumental variable problems (D Haultfoeuille, 2011). Theorem 5. Assume Setting 1 and Assumptions 1, 2, and 3. Let Φ be a class of functions from X to Z that are differentiable and Lipschitz continuous. It holds for all ϕ Φ that ϕ satisfies (17) ϕ aff-identifies g 1 0 . (18) 5 A METHOD FOR TACKLING REP4EX 5.1 FIRST-STAGE: AUTO-ENCODER WITH MMR REGULARIZATION This section illustrates how to turn the identifiability result outlined in Section 4 into a practical method that implements the linear invariance and invertibility constraints in (17). The method is based on an auto-encoder (Kramer, 1991; Goodfellow et al., 2016) with a regularization term that enforces the linear invariance constraint (16). In particular, we adopt the the framework of maximum moment restrictions (MMRs) introduced in Muandet et al. (2020) as a representation of the constraint (16). MMRs can be seen as the reproducing kernel Hilbert space (RKHS) representations of conditional moment restrictions. Formally, let H be the RKHS of vector-valued functions (Alvarez et al., 2012) from A to Z with a reproducing kernel k and define ψ := ψPX,A : (x, a, ϕ) 7 ϕ(x) (Wϕa+αϕ) (recall that Wϕ and αϕ depend on the observational distribution PX,A). We can turn the conditional moment restriction in (16) into the MMR as follows. Define the function Q(ϕ) := sup h H, h 1 (E[ψ(X, A, ϕ) h(A)])2. (19) If the reproducing kernel k is integrally strictly positive definite (see Muandet et al. (2020, Definition 2.1)), then Q(ϕ) = 0 if and only if the conditional moment restriction in (16) is satisfied. One of the main advantages of using the MMR representation is that it can be written as a closedform expression. We have by Muandet et al. (2020, Theorem 3.3) that Q(ϕ) = E[ψ(X, A, ϕ) k(A, A )ψ(X , A , ϕ)], (20) where (X , A ) is an independent copy of (X, A). We now introduce our auto-encoder objective function4 with the MMR regularization. Let ϕ : X Z be an encoder and η : Z 7 X be a decoder. Our (population) loss function is defined as L(ϕ, η) := E[ X η(ϕ(X)) 2] + λQ(ϕ), (21) where λ is a regularization parameter. In practice, we parameterize ϕ and η by neural networks, use a plug-in estimator5 for (21) to obtain an empirical loss function, and minimize that loss with a 4We consider a basic auto-encoder, but one can add MMR regularization to other variants too, e.g., (Kingma & Welling, 2014), adversarial-based (Makhzani et al., 2015), or diffusion-based (Preechakul et al., 2022). 5More precisely, we replace the expectations in (21) and (11) by empirical means (the latter expression enters through ψ and Q(ϕ)). Published as a conference paper at ICLR 2024 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Intervention Strength ( ) The dimension of Z = 2 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Intervention Strength ( ) The dimension of Z = 4 Method AE-MMR AE-MMR-Oracle AE-Vanilla VAE Figure 2: R-squared values for different methods as the intervention strength (α) increases. Each point represents an average over 20 repetitions, and the error bar indicates its 95% confidence interval. AE-MMR yields an R-squared close to 1 as α increases, indicating its ability to aff-identify g 1 0 , while the two baseline methods yield significantly lower R-squared values. standard (stochastic) gradient descent optimizer. Here, the role of the reconstruction loss part in (21) is to enforce the bijectivity constraint of ϕ|Im(g0) in (17). The regularization parameter λ controls the trade-off between minimizing the mean squared error (MSE) and satisfying the MMR. We discuss procedures to choose λ in Appendix E.2. 5.2 SECOND-STAGE: CONTROL FUNCTION APPROACH Given a learned encoder ϕ, we can now implement the control function approach for estimating E[Y | do(A = a )], as per Theorem 3. We call the procedure Rep4Ex-CF. Algorithm 1 in Appendix E outlines the details. In summary, we first perform the linear regression of ϕ(X) on A to obtain ( ˆWϕ, ˆαϕ), allowing us to compute the control variables ˆV = ϕ(X) ( ˆWϕA ˆαϕ). Subsequently, we employ an additive regression model on (ϕ(X), ˆV ) to predict Y and obtain the additive regression functions ˆν and ˆψ. Finally, using the function ˆν, we compute an empirical average of the expectation on the right-hand side of (13). 6 EXPERIMENTS We now conduct simulation experiments to empirically validate our theoretical findings. First, we apply the MMR based auto-encoder introduced in Section 5.1 and show in Section 6.1 that it can successfully recover the unmixing function g 1 0 up to an affine transformation. Second, in Section 6.2, we apply the full Rep4Ex-CF procedure (see Section 5.2) to demonstrate that one can indeed predict previously unseen interventions as suggested by Theorem 3. The code for all experiments is included in the supplementary material. 6.1 IDENTIFYING THE UNMIXING FUNCTION g 1 0 This section validates the result of affine identifiability , see Theorem 5. We consider the SCMs S(α) : A := ϵA Z := αM0A + V X := g0(Z), (22) where the complete specification of this SCM is given in Appendix G.1. The parameter α controls the strength of the effect of A on Z. We set the dimension of X to 10 and consider two choices d {2, 4} for the dimension of Z. Additionally, we set the dimension of A to the dimension of Z. We sample 1 000 observations from the SCM (22) and learn an encoder ϕ using the regularized auto-encoder (AE-MMR) as outlined in Section 5.1. As our baselines, we include a vanilla autoencoder (AE-Vanilla) and a variational auto-encoder (VAE) for comparison. We also consider an oracle model (AE-MMR-Oracle) where we train the encoder and decoder using the true latent predictors Z and then use these trained models to initialize the regularized auto-encoder. We refer to Appendix G.2 for the details on the network and parameter choices. Lastly, we consider identifiability of g 1 0 only up to an affine transformation, see Definition 2. To measure the quality of Published as a conference paper at ICLR 2024 Figure 3: Different estimations of the target of inference E[Y | do(A := )] as the training support γ increases. The error bars represent the 95% confidence intervals over 10 repetitions. The training points displayed are subsampled for the purpose of visualization. Rep4Ex-CF demonstrates the ability to extrapolate beyond the training support, achieving nearly perfect extrapolation when γ = 1.2. In contrast, the baseline MLP shows clear limitations in its ability to extrapolate. an estimate ϕ, we therefore linearly regress the true Z on the representation ϕ(X) and report the R-squared for each candidate method. This metric is justified by Proposition 7 in Appendix C.2. Figure 2 illustrates the results with varying intervention strength (α). As α increases, our method, AE-MMR, achieves higher R-squared values that appear to approach 1. This indicates that AE-MMR can indeed recover the unmixing function g 1 0 up to an affine transformation. In contrast, the two baseline methods, AE-Vanilla and VAE, achieve significantly lower R-squared values, indicating non-identifiablity without enforcing the linear invariance constraint, see also the scatter plots in Figures 5 (AE-MMR) and 6 (AE-Vanilla) in Appendix H. 6.2 PREDICTING PREVIOUSLY UNSEEN INTERVENTIONS In this section, we focus on the task of predicting previously unseen interventions as detailed in Section 3. We use the following SCM as data generating process. S(γ) : A := ϵγ A Z := M0A + V X := g0(Z) Y := ℓ(Z) + U, (23) where ϵγ A Unif([ γ, γ]k). Hence, the parameter γ determines the support of A in the observational distribution. The complete specification of this SCM is provided in Appendix G.1. Our approach, denoted by Rep4Ex-CF, follows the procedure outlined in Algorithm 1. In the first stage, we employ AE-MMR as the regularized auto-encoder. In the second stage, we use a neural network that enforces additivity in the output layer for the additive regression model. For comparison, we include a neural-network-based regression model (MLP) of Y on A as a baseline. We also include an oracle method, Rep4Ex-CF-Oracle, where we use the true latent Z instead of learning a representation in the first stage. In all experiments, we use a sample size of 10 000. Figure 3 presents the results obtained with three γ values (0.2, 0.7, 1.2), one-dimensional A and twodimensional X. As anticipated, the neural-network-based regression model (MLP) fails to extrapolate beyond the training support. Conversely, our approach, Rep4Ex-CF, demonstrates successful extrapolation, with increased performance for higher γ. Furthermore, we conduct experiments with multi-dimensional A and present the results in Appendix H.1. Solving the optimization problem becomes more difficult but the outcomes echo the results observed with one-dimensional A. 7 DISCUSSION Our work highlights concrete benefits of identifiable representation learning. We introduce Rep4Ex, the task of learning a representation that enables nonlinear intervention extrapolation and propose corresponding theory and methodology. We regard this work only as a first step toward solving this task. Developing alternative methods and relaxing some of the assumptions (e.g., allowing for noise in the mixing function g0 and more flexible dependencies between A and Z) may yield more powerful methods for achieving Rep4Ex. Published as a conference paper at ICLR 2024 ACKNOWLEDGMENTS We thank Nicola Gnecco and Felix Schur for helpful discussions. NP was supported by a research grant (0069071) from Novo Nordisk Fonden. ER and PR acknowledge the support of DARPA via FA8750-23-2-1015, ONR via N00014-23-1-2368, and NSF via IIS-1909816, IIS-1955532. Kartik Ahuja, Jason S Hartford, and Yoshua Bengio. Weakly supervised representation learning with sparse perturbations. Advances in Neural Information Processing Systems, 35:15516 15528, 2022a. Kartik Ahuja, Divyat Mahajan, Vasilis Syrgkanis, and Ioannis Mitliagkas. Towards efficient representation identification in supervised learning. In Conference on Causal Learning and Reasoning, pp. 19 43. PMLR, 2022b. Kartik Ahuja, Divyat Mahajan, Yixin Wang, and Yoshua Bengio. Interventional causal representation learning. In International Conference on Machine Learning, pp. 372 407. PMLR, 2023. Mauricio A Alvarez, Lorenzo Rosasco, Neil D Lawrence, et al. Kernels for vector-valued functions: A review. Foundations and Trends in Machine Learning, 4(3):195 266, 2012. Joshua D Angrist, Guido W Imbens, and Donald B Rubin. Identification of causal effects using instrumental variables. Journal of the American statistical Association, 91(434):444 455, 1996. Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. Ar Xiv e-prints (1907.02893), 2019. Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798 1828, 2013. Gecia Bravo-Hermsdorff, David S Watson, Jialin Yu, Jakob Zeitler, and Ricardo Silva. Intervention generalization: A view from factor graph models. ar Xiv preprint ar Xiv:2306.04027, 2023. Johann Brehmer, Pim De Haan, Phillip Lippe, and Taco S Cohen. Weakly supervised causal representation learning. Advances in Neural Information Processing Systems, 35:38319 38331, 2022. Simon Buchholz, Goutham Rajendran, Elan Rosenfeld, Bryon Aragam, Bernhard Schölkopf, and Pradeep Ravikumar. Learning linear causal representations from interventions under general nonlinear mixing. ar Xiv preprint ar Xiv:2306.02235, 2023. Theo Bühler and Dietmar A Salamon. Functional analysis, volume 191. American Mathematical Society, 2018. Ruichu Cai, Feng Xie, Clark Glymour, Zhifeng Hao, and Kun Zhang. Triad constraints for learning causal structure of latent variables. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. Rune Christiansen, Niklas Pfister, Martin Emil Jakobsen, Nicola Gnecco, and Jonas Peters. A causal framework for distribution generalization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6614 6630, 2021. Panayiota Constantinou and A Philip Dawid. Extended conditional independence and applications in causal inference. The Annals of Statistics, pp. 2618 2653, 2017. Xavier D Haultfoeuille. On the completeness condition in nonparametric instrumental problems. Econometric Theory, 27(3):460 471, 2011. Kenji Fukumizu, Arthur Gretton, Gert Lanckriet, Bernhard Schölkopf, and Bharath K Sriperumbudur. Kernel choice and classifiability for rkhs embeddings of probability distributions. In Advances in Neural Information Processing Systems 22 (Neur IPS). Curran Associates, Inc., 2009. Published as a conference paper at ICLR 2024 Nicola Gnecco, Jonas Peters, Sebastian Engelke, and Niklas Pfister. Boosted control functions. ar Xiv preprint ar Xiv:2310.05805, 2023. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016. Limor Gultchin, David Watson, Matt Kusner, and Ricardo Silva. Operationalizing complex causes: A pragmatic view of mediation. In International Conference on Machine Learning, pp. 3875 3885. PMLR, 2021. Hermanni Hälvä, Sylvain Le Corff, Luc Lehéricy, Jonathan So, Yongjie Zhu, Elisabeth Gassiat, and Aapo Hyvarinen. Disentangling identifiable features from noisy data with structured nonlinear ica. Advances in Neural Information Processing Systems, 34:1624 1633, 2021. James J Heckman. Dummy endogenous variables in a simultaneous equation system. Technical report, National Bureau of Economic Research, 1977. Aapo Hyvarinen and Hiroshi Morioka. Unsupervised feature extraction by time-contrastive learning and nonlinear ica. Advances in neural information processing systems, 29, 2016. Aapo Hyvärinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural networks, 13(4-5):411 430, 2000. Aapo Hyvärinen and Petteri Pajunen. Nonlinear independent component analysis: existence and uniqueness results. Neural Networks, 12(3):429 439, 1999. Aapo Hyvarinen, Hiroaki Sasaki, and Richard Turner. Nonlinear ica using auxiliary variables and generalized contrastive learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 859 868. PMLR, 2019. Martin Emil Jakobsen and Jonas Peters. Distributional robustness of K-class estimators and the PULSE. The Econometrics Journal, 25(2):404 432, 2022. Yibo Jiang and Bryon Aragam. Learning nonparametric latent causal graphs with unknown interventions. ar Xiv preprint ar Xiv:2306.02899, 2023. Ilyes Khemakhem, Diederik Kingma, Ricardo Monti, and Aapo Hyvarinen. Variational autoencoders and nonlinear ica: A unifying framework. In International Conference on Artificial Intelligence and Statistics, pp. 2207 2217. PMLR, 2020. Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014. Bohdan Kivva, Goutham Rajendran, Pradeep Ravikumar, and Bryon Aragam. Learning latent causal graphs via mixture oracles. Advances in Neural Information Processing Systems, 34:18087 18101, 2021. Bohdan Kivva, Goutham Rajendran, Pradeep Ravikumar, and Bryon Aragam. Identifiability of deep generative models without auxiliary information. Advances in Neural Information Processing Systems, 35:15687 15701, 2022. Lingjing Kong, Biwei Huang, Feng Xie, Eric Xing, Yuejie Chi, and Kun Zhang. Identification of nonlinear latent hierarchical models. ar Xiv preprint ar Xiv:2306.07916, 2023. Mark A Kramer. Nonlinear principal component analysis using autoassociative neural networks. AICh E journal, 37(2):233 243, 1991. Sébastien Lachapelle, Pau Rodriguez, Yash Sharma, Katie E Everett, Rémi Le Priol, Alexandre Lacoste, and Simon Lacoste-Julien. Disentanglement via mechanism sparsity regularization: A new principle for nonlinear ica. In Conference on Causal Learning and Reasoning, pp. 428 484. PMLR, 2022. Sébastien Lachapelle, Tristan Deleu, Divyat Mahajan, Ioannis Mitliagkas, Yoshua Bengio, Simon Lacoste-Julien, and Quentin Bertrand. Synergies between disentanglement and sparsity: Generalization and identifiability in multi-task learning. In International Conference on Machine Learning, 2023a. Published as a conference paper at ICLR 2024 Sébastien Lachapelle, Divyat Mahajan, Ioannis Mitliagkas, and Simon Lacoste-Julien. Additive decoders for latent variables identification and cartesian-product extrapolation. ar Xiv preprint ar Xiv:2307.02598, 2023b. Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. ar Xiv preprint ar Xiv:1511.05644, 2015. Gemma E Moran, Dhanya Sridhar, Yixin Wang, and David M Blei. Identifiable deep generative models via sparse decoding. ar Xiv preprint ar Xiv:2110.10804, 2021. Krikamol Muandet, Wittawat Jitkrittum, and Jonas Kübler. Kernel conditional moment test via maximum moment restriction. In Conference on Uncertainty in Artificial Intelligence, pp. 41 50. PMLR, 2020. Preetam Nandy, Marloes H. Maathuis, and Thomas S. Richardson. Estimating the effect of joint interventions from observational data in sparse high-dimensional settings. The Annals of Statistics, 45(2):647 674, 2017. Whitney K Newey, James L Powell, and Francis Vella. Nonparametric estimation of triangular simultaneous equations models. Econometrica, 67(3):565 603, 1999. Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, USA, 2nd edition, 2009. Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwajanakorn. Diffusion autoencoders: Toward a meaningful and decodable representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10619 10629, 2022. Thomas Richardson. Markov properties for acyclic directed mixed graphs. Scandinavian Journal of Statistics, 30(1):145 157, 2003. Geoffrey Roeder, Luke Metz, and Durk Kingma. On linear identifiability of learned representations. In International Conference on Machine Learning, pp. 9030 9039. PMLR, 2021. Mateo Rojas-Carulla, Bernhard Schölkopf, Richard Turner, and Jonas Peters. Invariant models for causal transfer learning. The Journal of Machine Learning Research, 19(1):1309 1342, 2018. Elan Rosenfeld, Pradeep Ravikumar, and Andrej Risteski. The risks of invariant risk minimization. In International Conference on Learning Representations, volume 9, 2021. Elan Rosenfeld, Pradeep Ravikumar, and Andrej Risteski. Domain-adjusted regression or: Erm may already learn features sufficient for out-of-distribution generalization. ar Xiv preprint ar Xiv:2202.06856, 2022. Dominik Rothenhäusler, Nicolai Meinshausen, Peter Bühlmann, and Jonas Peters. Anchor regression: Heterogeneous data meet causality. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 83(2):215 246, 2021. Walter Rudin. Real and complex analysis, 3rd Edition. Mc Graw-Hill, 1987. Sorawit Saengkyongam and Ricardo Silva. Learning joint nonlinear effects from single-variable interventions in the presence of hidden confounders. In Conference on Uncertainty in Artificial Intelligence, pp. 300 309. PMLR, 2020. Sorawit Saengkyongam, Leonard Henckel, Niklas Pfister, and Jonas Peters. Exploiting independent instruments: Identification and distribution generalization. In International Conference on Machine Learning, pp. 18935 18958. PMLR, 2022. Alexander Schell and Harald Oberhauser. Nonlinear independent component analysis for discretetime and continuous-time signals. The Annals of Statistics, 51(2):487 518, 2023. Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning. Proceedings of the IEEE, 109(5):612 634, 2021. Published as a conference paper at ICLR 2024 Anna Seigal, Chandler Squires, and Caroline Uhler. Linear causal disentanglement via interventions. ar Xiv preprint ar Xiv:2211.16467, 2022. Xinwei Shen and Nicolai Meinshausen. Engression: Extrapolation for nonlinear regression? ar Xiv preprint ar Xiv:2307.00835, 2023. Lester G Telser. Iterative estimation of a set of linear regression equations. Journal of the American Statistical Association, 59(307):845 862, 1964. Burak Varici, Emre Acarturk, Karthikeyan Shanmugam, Abhishek Kumar, and Ali Tajer. Scorebased causal representation learning with interventions. ar Xiv preprint ar Xiv:2301.08230, 2023. Julius von Kügelgen, Michel Besserve, Wendong Liang, Luigi Gresele, Armin Keki c, Elias Bareinboim, David M Blei, and Bernhard Schölkopf. Nonparametric identifiability of causal representations from unknown interventions. ar Xiv preprint ar Xiv:2306.00542, 2023. Norbert Wiener. Tauberian theorems. Annals of Mathematics, pp. 1 100, 1932. Philip G Wright. The Tariff on Animal and Vegetable Oils. Investigations in International Commercial Policies. Macmillan, New York, NY, 1928. Pengzhou Abel Wu and Kenji Fukumizu. $\beta$-intact-vae: Identifying and estimating causal effects under limited overlap. In International Conference on Learning Representations, 2022. Feng Xie, Biwei Huang, Zhengming Chen, Yangbo He, Zhi Geng, and Kun Zhang. Identification of linear non-Gaussian latent hierarchical structure. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 24370 24387. PMLR, 2022. Jiaqi Zhang, Chandler Squires, Kristjan Greenewald, Akash Srivastava, Karthikeyan Shanmugam, and Caroline Uhler. Identifiability guarantees for causal disentanglement from soft interventions. ar Xiv preprint ar Xiv:2307.06250, 2023. Published as a conference paper at ICLR 2024 A RELATED WORK: NONLINEAR ICA Identifiable representation learning has been studied within the framework of nonlinear ICA (e.g., Hyvarinen & Morioka, 2016; Hyvarinen et al., 2019; Khemakhem et al., 2020; Schell & Oberhauser, 2023). Khemakhem et al. (2020) provide a unifying framework that leverages the independence structure of latent variables Z conditioned on auxiliary variables. Although our actions A could be considered auxiliary variables, the identifiability results and assumptions in Khemakhem et al. (2020) do not fit our setup and task. Concretely, a key assumption in their framework is that the components of Z are independent when conditioned on A. In contrast, our approach permits dependence among the components of Z even when conditioned on A as the components of V in our setting can have arbitrary dependencies. More importantly, Khemakhem et al. (2020) provide identifiability up to point-wise nonlinearities which is not sufficient for intervention extrapolation. The main focus of our work is to provide an identification that facilitates a solution to the task of intervention extrapolation. Some other studies in nonlinear ICA have shown identifiability beyond point-wise nonlinearities (e.g., Roeder et al., 2021; Ahuja et al., 2022b). However, the models considered in these studies are not compatible with our data generation process either. B REMARK ON THE KEY ASSUMPTIONS IN SETTING 1 Remark 6. (i) The assumption of linearity from Z on A can be relaxed: if there is a known nonlinear function h such that Z := M0h( A) + V , we can define A := h( A) and obtain an instance of Setting 1. Similarly, if there is an injective h such that Z := h(M0A + V ) and X := g0( Z), we can define Z := M0A + V and X := (g0 h)(Z). (ii) The assumptions of full support of V and full rank of M0 can be relaxed by considering Z Rd to be a linear subspace, with supp(V ) and M0A both being equal to Z. (iii) Our experimental results in Appendix H.3 suggest that it may be possible to relax the assumption of the absence of noise in X. C FURTHER THEORETICAL RESULTS C.1 IDENTIFYING E[Y |Z = z, do(A = a )] WITH OBSERVED Z For all z supp(Z) and a A, we have E[Y |Z = z, do(A = a )] = ℓ(z) + E[U|Z = z, do(A = a )] = ℓ(z) + E[U|M0a + V = z, do(A = a )] = ℓ(z) + E[U|V = z M0a , do(A = a )] = ℓ(z) + E[U|V = z M0a ] since PU,V = Pdo(A=a ) U,V , = ℓ(z) + λ(z M0a ), where, ℓand λ are identifiable by (3) if the interior of supp(A) is convex (we still assume that supp(V ) = Rd), and the functions ℓand λ are differentiable (Newey et al., 1999). C.2 EQUIVALENT FORMULATION OF AFFINE IDENTIFIABILITY Under Setting 1, we show an equivalent formulation of affine identifiability in Proposition 7 stressing that Z can be reconstructed from ϕ(X). In our empirical evaluation (see Section 6), we adopt this formulation to define a metric for measuring how well an encoder ϕ aff-identifies g 1 0 . Proposition 7 (Equivalent definition of affine identifiability). Assume Setting 1. An encoder ϕ : X Z aff-identifies g 1 0 if and only if there exists a matrix Jϕ Rd d and a vector dϕ Rd s.t. z Z : z = Jϕϕ(x) + dϕ, where x := g0(z). (24) Published as a conference paper at ICLR 2024 C.3 SOME LEMMATA Lemma 8. Assume the underlying SCM (1). We have that U Z | V under PS. Proof. In the SCM (1) it holds that A (U, V ) which by the weak union property of conditional independence (e.g., Constantinou & Dawid, 2017, Theorem 2.4) implies that A U | V . This in turn implies (A, V ) U | V (e.g., Constantinou & Dawid, 2017, Example 2.1). Now, by Proposition 2.3 (ii) in Constantinou & Dawid (2017) this is equivalent to the condition that for all measurable and bounded functions g : A R : R it almost surely holds that E[g(A, V ) | U, V ] = E[g(A, V ) | V ]. (25) Hence, for all f : Z R measurable and bounded it almost surely holds that E[f(Z) | U, V ] = E[f(M0A + V ) | U, V ] = E[f(M0A + V ) | V ] by (25) with g : (a, v) 7 f(M0a + v) = E[f(Z) | V ]. (26) Again by Proposition 2.3 (ii) in Constantinou & Dawid (2017), this is equivalent to U Z | V as desired. As an alternative to our proof, one can also argue using SCMs and Markov properties in ADMGs (Richardson, 2003). Lemma 9. Assume Setting 1. We have that U Z | Vϕ. Proof. Since the function v 7 Hϕ(v E[V ]) is bijective, the proof follows from the same arguments as given in the proof of Lemma 8. Lemma 10. Assume Setting 1. Let ϕ : X Z be an encoder. We have that ϕ g0 is bijective = ϕ|Im(g0) is bijective . (27) Proof. Let ϕ be an encoder such that ϕ g0 is bijective. We first show that ϕ|Im(g0) is injective by contradiction. Assume that ϕ|Im(g0) is not injective. Then, there exist x1, x2 Im(g0) such that ϕ(x1) = ϕ(x2) and x1 = x2. Now consider z1, z2 Z with x1 = g0(z1) and x2 = g0(z2); clearly, z1 = z2. Using that ϕ g0 is injective, we have (ϕ g0)(z1) = ϕ(x1) = ϕ(x2) = (ϕ g0)(z2) which leads to the contradiction. We can thus conclude that ϕ|Im(g0) is injective. Next, we show that ϕ|Im(g0) is surjective. Let z1, z2 Z. Since ϕ g0 is surjective, there exist z1, z2 Z such that z1 = (ϕ g0)( z1) and z2 = (ϕ g0)( z2). Let x1 := g0( z1) Im(g0) and x2 := g0( z2) Im(g0). We then have that z1 = ϕ(x1) and z2 = ϕ(x2) which shows that ϕ|Im(g0) is surjective and concludes the proof. D.1 PROOF OF PROPOSITION 1 Proof. We consider k = d = 1, that is, A R, Z R, Y R. We define the function p1 V : R R for all v R by 1 12 if v ( 4, 2) 1 4 exp( (v 2)) if v (2, ) 1 4 exp(v + 4) if v ( , 4) and the function p2 V : R R for all v R by 1 12 if v ( 2, 1) 1 24 if v ( 5, 2) 5 16 exp( (v 1)) if v (1, ) 5 16 exp(v + 5) if v ( , 5). Published as a conference paper at ICLR 2024 These two functions are valid densities as we have for all v R that p1 V (v) > 0, v R : p2 V (v) > 0, and R p1 V (v) dv = 1, R p2 V (v) dv = 1. Furthermore, these two densities p1 V (v) and p2 V (v) satisfy the following conditions, (1) for all a (0, 1), it holds that a 1 p1 V (v) dv = 1 a 2 p2 V (v) dv, (28) (2) for all a ( 3, 2) the following holds a 1 p1 V (v) dv = Z a+1 1 4 exp(v + 4) dv 2 exp((a 1) + 4) a 2 p2 V (v) dv. (29) Next, let S1 be the following SCM A := ϵA Z := A + V Y := 1(|Z| 1) + U, (30) where ϵA Uniform(0,1), V P1 V , U P1 U independent such that ϵA (V, U), and E[U] = 0. Further, we assume that V admits a density p1 V as defined above. Next, we define the second SCM S2 as follows A := ϵA Z := A + V Y := 1(|Z + 1| 1) + U, (31) where ϵA Uniform(0,1), V P2 V , U P2 U independent such that ϵA (V, U), E[U] = 0 and V has the density given by p2 V . By construction we have that supp S1(V ) = supp S2(V ) = R and supp S1(A) = supp S2(A). Now, we show that the two SCMs S1 and S2 satisfy the third statement of Proposition 1. Define c1 = 0 and c2 = 1. For i {1, 2}, we have for all a R that ESi[Y | do(A = a)] = ESi[1(|Z + ci| 1) | do(A = a)] + ESi[U| do(A = a)] = ESi[1(|V a + ci| 1) | do(A = a)] + ESi[U| do(A = a)] ( ) = ESi[1(|V a + ci| 1) | do(A = a)] ( ) = ESi[1(|V a + ci| 1)] 1(|v a + ci| 1)pi V (v) dv, (32) where ( ) holds because a A : PU = Pdo(A=a) U and ESi[U] = 0 and ( ) holds because a A : PV = Pdo(A=a) V . Since A is exogenous, we have for all i {1, 2} and a supp S1(A) = (0, 1) Published as a conference paper at ICLR 2024 that ESi[Y | do(A = a)] = ESi[Y | A = a]. From (32), we therefore have for all a (0, 1) ES1[Y | A = a] = Z 1(|v a| 1)p1 V (v) dv a 1 p1 V (v) dv a 2 p2 V (v) dv by (28) 1(|v a + 1| 1)p2 V (v) dv = ES2[Y | A = a]. We have shown that the two SCMs S1 and S2 satisfy the first statement of Proposition 1. Lastly, we show below that they also satisfy the fourth statement of Proposition 1. Define B := ( 3, 2) R which has positive measure. From (32), we then have for all a ( 3, 2) ES1[Y | do(A = a)] = Z 1(|v a| 1)p1 V (v) dv a 1 p1 V (v) dv a 2 p2 V (v) dv by (29) 1(|v a + 1| 1)p2 V (v) dv = ES2[Y | do(A = a)], which shows that S1 and S2 satisfy the forth condition of Proposition 1 and concludes the proof. D.2 PROOF OF PROPOSITION 7 Proof. We begin by showing the only if direction. Let ϕ : X Z be an encoder that aff-identifies g 1 0 . Then, by definition, there exists an invertible matrix Hϕ Rd d and a vector cϕ Rd such that z Z : (ϕ g0)(z) = Hϕz + cϕ. (33) We then have that z Z : z = H 1 ϕ ϕ(x) H 1 ϕ cϕ, where x := g0(z), (34) which shows the required statement. Next, we show the if direction. Let ϕ : X Z be an encoder for which there exists a matrix Jϕ Rd d and a vector dϕ Rd such that z Z : z = Jϕϕ(x) + dϕ, where x := g0(z). (35) Since Z = Rd, this implies that Jϕ is surjective and thus has full rank. We therefore have that z Z : (ϕ g0)(z) = J 1 ϕ z J 1 ϕ dϕ, (36) which shows the required statement and concludes the proof. D.3 PROOF OF THEOREM 3 Proof. Let κϕ = z 7 Hϕz + cϕ be the corresponding affine map of ϕ. From (7), we have for all a A, that E[Y | do(A = a )] = E[(ℓ κ 1 ϕ )(Mϕa + qϕ + Vϕ)], (37) Published as a conference paper at ICLR 2024 where Mϕ = HϕM0, qϕ = cϕ + Hϕ E[V ], and Vϕ = Hϕ(V E[V ]) as defined in (8). To prove the first statement, we thus aim to show that, for all a A, E[ν(Wϕa + αϕ + Vϕ)] (E[ν(ϕ(X))] E[Y ]) = E[(ℓ κ 1 ϕ )(Mϕa + qϕ + Vϕ)]. (38) To begin with, we show that Wϕ = Mϕ and αϕ = qϕ. We have for all α Rd, W Rd d E[ ϕ(X) (WA + α) 2] = E[ MϕA + qϕ + Vϕ α WA 2] from (9) = E[ (Mϕ W)A + (qϕ α) + Vϕ 2] = E[ (Mϕ W)A + (qϕ α) 2] + 2 E[((Mϕ W)A + (qϕ α)) Vϕ] + E[ Vϕ 2] = E[ (Mϕ W)A + (qϕ α) 2] + E[ Vϕ 2]. since A Vϕ and E[Vϕ] = 0 Since the covariance matrix of A has full rank, we therefore have that (αϕ, Wϕ) = argmin α Rd,W Rd k E[ ϕ(X) α WA 2] = (qϕ, Mϕ), (39) and that Vϕ = ϕ(X) (MϕA + qϕ) = Vϕ, where the last equality holds by (9). Next, we show that ν (ℓ κ 1 ϕ ). Since ℓis differentiable, the function ℓ κ 1 ϕ is also differentiable. We have supp(A, Vϕ) = supp(A, V ) = supp(A) Rd. Thus, the interior of supp(A, Vϕ) is convex (as the interior of supp(A) is convex) and its boundary has measure zero. Also, the matrix M0 has full row rank. Moreover, using aff-identifiability and (4) we can write ϕ(X) = MϕA + qϕ + Vϕ Y = ℓ κ 1 ϕ (ϕ(X)) + U, where A (Vϕ, U). This is a simultaneous equation model (over the observed variables ϕ(X), A, and Y ) for which the structural function is ℓ κ 1 ϕ and the control function is λϕ. We can therefore apply Theorem 2.3 in Newey et al. (1999) (see Gnecco et al. (2023, Proposition 3) for a complete proof, including usage of convexity, which we believe is missing in the argument of Newey et al. (1999)) to conclude that ℓ κ 1 ϕ and λϕ are identifiable from (10) up to a constant. That is, ν (ℓ κ 1 ϕ ) + δ and ψ λϕ δ (40) for some constant δ R. Combining with the fact that Wϕ = Mϕ and αϕ = qϕ, we then have, for all a A, E[ν(Wϕa + αϕ + Vϕ)] = E[(ℓ κ 1 ϕ )(Mϕa + qϕ + Vϕ)] + δ. (41) Now, we use the assumption that E[U] = 0 to deal with the constant term δ. E[Y ] = E[ℓ(g 1 0 (X))] since E[U] = 0 (42) = E[((ℓ κ 1 ϕ ) (κϕ g 1 0 ))(X)] (43) = E[(ℓ κ 1 ϕ )(ϕ(X))] since ϕ aff-identifies g 1 0 . (44) Thus, we have E[ν(ϕ(X))] E[Y ] = E[(ℓ κ 1 ϕ )(ϕ(X)) + δ] E[Y ] by (40) = E[(ℓ κ 1 ϕ )(ϕ(X)) + δ] E[(ℓ κ 1 ϕ )(ϕ(X))] by (44) Combining (45) and (41), we have for all a A that E[ν(Wϕa + αϕ + Vϕ)] (E[ν(ϕ(X))] E[Y ]) = E[(ℓ κ 1 ϕ )(Mϕa + qϕ + Vϕ)], which yields (38) and concludes the proof of the first statement. Published as a conference paper at ICLR 2024 Next, we prove the second statement. We have for all x Im(g0) and a A, that E[Y |X = x, do(A = a )] = E[ℓ(Z) | X = x, do(A = a )] + E[U|X = x, do(A = a )] = (ℓ g 1 0 )(x) + E[U|X = x, do(A = a )] = (ℓ g 1 0 )(x) + E[U|g0(Z) = x, do(A = a )] = (ℓ g 1 0 )(x) + E[U|g0(M0a + V ) = x, do(A = a )] = (ℓ g 1 0 )(x) + E[U|V = g 1 0 (x) M0a , do(A = a )] ( ) = (ℓ g 1 0 )(x) + E[U|V = g 1 0 (x) M0a ] = ((ℓ κ 1 ϕ ) (κϕ g 1 0 ))(x) + E[U|V = g 1 0 (x) M0a ] ( ) = (ℓ κ 1 ϕ )(ϕ(x)) + E[U|V = g 1 0 (x) M0a ] (46) where the equality ( ) hold because a A : PU,V = Pdo(A=a ) U,V and ( ) follows from the fact that ϕ aff-identifies g 1 0 . Next, define h := v 7 Hϕ(v E[V ]). We have for all x Im(g0) and a A that h(g 1 0 (x) M0a ) = Hϕ(g 1 0 (x) M0a E[V ]) = Hϕg 1 0 (x) HϕM0a Hϕ E[V ] = Hϕg 1 0 (x) + cϕ (Mϕa + qϕ) = (ϕ g0 g 1 0 (x)) (Mϕa + qϕ) = ϕ(x) (Mϕa + qϕ) = ϕ(x) (Wϕa + αϕ). from (39) (47) Since the function h is bijective, combining (47) and (46) yields E[Y |X = x, do(A = a )] = (ℓ κ 1 ϕ )(ϕ(x)) + E[U|h(V ) = h(g 1 0 (x) M0a )] = (ℓ κ 1 ϕ )(ϕ(x)) + E[U|Vϕ = ϕ(x) (Wϕa + αϕ)] = (ℓ κ 1 ϕ )(ϕ(x)) + λϕ(ϕ(x) (Wϕa + αϕ)). Lastly, as argued in the first part of the proof, it holds from Theorem 2.3 in Newey et al. (1999) that ν (ℓ κ 1 ϕ ) + δ and ψ λϕ δ, for some constant δ R. We thus have that x Im(g0), a A : E[Y |X = x, do(A = a )] = ν(ϕ(x)) + ψ(ϕ(x) (Wϕa + αϕ)), which concludes the proof of the second statement. D.4 PROOF OF THEOREM 5 Proof. We begin the proof by showing the forward direction (ϕ satisfies (17) = ϕ satisfies (6)). Let ϕ Φ be an encoder that satisfies (17). We then have for all a supp(A) Wϕa + αϕ = E[ϕ(X) | A = a] = E[(ϕ g0)(M0A + V ) | A = a] = E[(ϕ g0)(M0a + V )] since A V . Define h := ϕ g0. Taking derivative with respect to a on both sides yields Wϕ = E[h(M0a + V )] Next, we interchange the expectation and derivative using the assumptions that ϕ and g0 have bounded derivative and the dominated convergence theorem. We have for all a supp(A) Wϕ = E[ h(M0a + V ) a by the chain rule u=M0a+V M0 . (48) Published as a conference paper at ICLR 2024 Defining h : z 7 h(u) u=z and g : z 7 h (z)M0 Wϕ, we have for all a supp(A) 0 = E[h (M0a + V )M0 Wϕ] = E[g(M0a + V )] = Z g(M0a + v)f V (v)d v. Define t := M0a Rd and τ := t + v, we then have for all t supp(M0A) that 0 = Z g(τ)f V (τ t)d (τ t) = Z g(τ)f V (τ t)d τ = Z g(τ)f V (t τ)d τ. (49) Recall that g is a function from Rd to Rd k. Now, for an arbitrary (i, j) Rd Rk define the function gij( ) : Rd R := g( )ij. We then have for each element (i, j) and all t supp(M0A) that 0 = Z gij(τ)f V (t τ)d τ. (50) Next, let us define cij : t Rd 7 R gij(τ)f V (t τ)d τ R. We now show that cij 0 where we adapt the proof of D Haultfoeuille (2011, Proposition 2.3). By Assumption 2, f V is analytic on Rd, we thus have for all τ Rd that the function t 7 gij(τ)f V (t τ) is analytic on Rd. Moreover, since gij is bounded the function t 7 gij(τ)f V (t τ) is bounded, too. Thus, by (Rudin, 1987, page 229), the function cij is then also analytic on Rd. Using that M0 is surjective, we have by the open mapping theorem (see e.g., Bühler & Salamon (2018), page 54) that M0 is an open map. Now, since supp(A) contains a non-empty open subset of Rk and M0 is an open map, we thus have from (50) that cij(t) = 0 on a non-empty open subset of Rd. Then, by the identity theorem, the function cij is identically zero, that is, cij 0. (51) Next, we show that gij 0. Let L1 denote the space of equivalence classes of integrable functions from Rd to R. For all t Rd, let us define ft( ) := f V (t ) and Q := {ft | t Rd}. By Assumption 2, the characteristic function of V does not vanish. This implies that the characteristic function of V does not vanish either (since the characteristic function of V is the complex conjugate of the characteristic function of V ). We therefore have that the Fourier transform of f V has no real zeros. Then, we apply Wiener s Tauberian theorem (Wiener, 1932) and have that Q is dense in L1. Using that Q is dense in L1, combining with (51) and the continuity of the linear form ϕ L1 7 R gij(τ) ϕ(τ)d τ (continuity follows from boundedness of gij and Cauchy-Schwarz), it holds that ϕ L1 : Z gij(τ) ϕ(τ)d τ = 0. (52) From (52), we can then conclude that gij( ) 0. (53) Next, from (53) and the definition of g, we thus have for all a supp(A) and v Rd h (M0a + v)M0 = Wϕ. (54) As M0 has full row rank, it thus holds that h (M0a + v) = WϕM 0. (55) We therefore have that the function h = ϕ g0 is an affine transformation. Furthermore, using that g0 is injective and ϕ|Im(g0) is bijective, the composition h = ϕ g0 is also injective. Therefore, there exists an invertible matrix H Rd d and a vector c Rd such that z Rd : ϕ g0(z) = Hz + c, (56) Published as a conference paper at ICLR 2024 which concludes the proof of the forward direction. Next, we show the backward direction of the statement (ϕ satisfies (6) = ϕ satisfies (17)). Let ϕ Φ satisfy (17). Then, there exists an invertible matrix H Rd d and a vector c Rd such that z Rd : (ϕ g0)(z) = Hz + c. We first show the second condition of (17). By the invertibility of H, the composition ϕ g0 is bijective. By Lemma 10, we thus have that ϕ|Im(g0) is bijective. Next, we show the first condition of (17). Let µV := E[V ]. We have for all α Rd, W Rd d E[ ϕ(X) α WA 2] = E[ (ϕ g0)(Z) α WA 2] = E[ HZ + c α WA 2] = E[ H(M0A + V ) + c α WA 2] = E[ (HM0 W)A + (c α) + HV 2] = E[ (HM0 W)A + (c + HµV α) + H(V µV ) 2] = E[ (HM0 W)A + (c + HµV α) 2] + 2 E[((HM0 W)A + (c + HµV α)) H(V µV )] + E[ H(V µV ) 2] = E[ (HM0 W)A + (c + HµV α) 2] + E[ H(V µV ) 2]. since A V Since the covariance matrix of A is full rank, we therefore have that (αϕ, Wϕ) def = argmin α Rd,W Rd k E[ ϕ(X) α WA 2] = (c + HµV , HM0). (57) Then, we have for all a supp(A) that E[ϕ(X) αϕ WϕA | A = a] ( ) = E[(ϕ g0)(Z) (c + HµV ) HM0A | A = a] = E[HZ + c (c + HµV ) HM0A | A = a] = E[H(M0A + V ) + c (c + HµV ) HM0A | A = a] = E[HV HµV | A = a] ( ) = HµV HµV = 0, where the equality ( ) follows from (57) and ( ) holds by A V . This concludes the proof. E DETAILS ON THE ALGORITHM E.1 THE ALGORITHM FOR REP4EX We here present a pseudo algorithm for Rep4Ex-CF, see Algorithm 1. E.2 HEURISTIC FOR CHOOSING REGULARIZATION PARAMETER λ To select the regularization parameter λ in the regularized auto-encoder objective function (21), we employ the following heuristic. Let Λ = {λ1, . . . , λm} be our candidate regularization parameters, ordered such that λ1 > λ2 > > λm. For each λi, we estimate the minimizer of (21) and calculate the reconstruction loss. Additionally, we compute the reconstruction loss when setting λ = 0 as the baseline loss. We denote the resulting reconstruction losses for different λi as Rλi (and R0 for the baseline loss). Algorithm 2 illustrates how λ is chosen. In our experiments, we set a cutoff parameter at 0.2 and for each setting execute the heuristic algorithm only during the first repetition run to save computation time. Figure 4 demonstrates the effectiveness of our heuristic. Here, our algorithm would suggest choosing λ = 102, which also corresponds to the highest R-squared value. Another approach to choose λ is to apply the conditional moment test in Muandet et al. (2020) to test whether the linear invariance constraint (16) is satisfied. Specifically, in a similar vein to Jakobsen Published as a conference paper at ICLR 2024 Algorithm 1: An algorithm for Rep4Ex Input: observations (xi, ai, yi)n i=1, target interventions (a j)m j=1, auto-encoder AE, additive regression AR // Train the auto-encoder ϕ = AE((xi, ai)n i=1) ; // Regress ϕ(X) on A ( ˆWϕ, ˆαϕ) = argmin W,α Pn i=1 ϕ(xi) (Wai + α) 2 ; // Obtain the control variables for i = 1 to n do vi = ϕ(xi) (Wai + α) ; end // Train additive regression ˆν, ˆψ = AR(yi ν(ϕ(xi)) + ψ(vi), i = 1 . . . n) ; // Estimate E[Y | do(A = a )] for j = 1 to m do ˆyj = Pn i=1 ˆν( ˆWϕa j + ˆαϕ + vi) Pn i=1 ˆν(ϕ(xi)) yi end Output: (ˆyj)m j=1 Algorithm 2: Choosing λ parameter Input: cut off parameter α λ λm ; for i = 1 to m 1 do R0 1 ; if δi < α then λ λi ; break return λ & Peters (2022); Saengkyongam et al. (2022), we may select the smallest possible value of λ for which the conditional moment test is not rejected. F POSSIBLE WAYS OF CHECKING APPLICABILITY OF THE PROPOSED METHOD Due to the nature of extrapolation problems, it is not feasible to definitively verify the method s underlying assumptions from the training data. However, we may still be able to check and potentially falsify the applicability of our approach in practice. To this end, we propose comparing its performance under two different cross-validation schemes: (i) Standard cross-validation, where the data is randomly divided into training and test sets. (ii) Extrapolation-aware cross-validation, in which the data is split such that the support of A in the test set does not overlap with that in the training set. By comparing our method s performance across these two schemes, we can assess the applicability of our overall method. A significant performance gap may suggest that some key assumptions are not valid and one could consider adapting the setting, e.g., by transforming A (see Remark 6). A further option of checking for potential model violations is to test for linear invariance of the fitted encoder, using for example the conditional moment test by Muandet et al. (2020). If the null hypothesis of linear invariance is rejected, this indicates that either the optimization was unsuccessful or the model is incorrectly specified. Published as a conference paper at ICLR 2024 10 1 100 101 102 103 104 105 Regularization parameter ( ) MSE AE-MMR AE-Vanilla G DETAILS ON THE EXPERIMENTS G.1 DATA GENERATING PROCESSES (DGPS) IN SECTION 6 DGP for Section 6.1 We consider the following underlying SCM S(α) : A := ϵA Z := αM0A + V X := g0(Z) (58) where ϵA Unif( 1, 1) and V N(0, Σ) are independent noise variables. Here, we consider a four-layer neural networks with Leaky Re LU activation functions as the mixing function g0. The parameters of the neural networks and the parameters of the SCM (22) including Σ and M0 are randomly chosen, see below for more details. The parameter α controls the strength of the effect of A on Z. In this experiment, we set the dimension of X to 10 and consider two choices d {2, 4} for the dimension of Z. Additionally, we set the dimension of A to the dimension of Z. DGP for Section 6.2 We consider the following underlying SCM S(γ) : A := ϵγ A Z := M0A + V X := g0(Z) Y := ℓ(Z) + U. (59) where ϵγ A Unif([ γ, γ]k) and V N(0, ΣV ) are independent noise variables. U is then generated as U := h(V ) + ϵU, where ϵU N(0, 1). The parameter γ determines the support of A in the observational distribution. Similar to Section 6.1, we consider a four-layer neural networks with Leaky Re LU activation functions as the mixing function g0 and the parameters of g0, ΣV , and M0 are randomly chosen as detailed below. Details on other parameters In all experiments, we employ a neural network with the following details as the mixing function g0: Activation functions: Leaky Re LU Architecture: three hidden layers with the hidden size of 16 Initialization: weights are independently drawn from Unif( 1, 1). As for the matrix M0, each element is indepedently drawn from Unif( 2, 2). The covariance ΣV is generated by ΣV := AA + diag(V ), where A and V are indepedently drawn from Unif([0, 1]d). In Section 6.2, in the case of one-dimensional A, we specify the functions h and ℓas h : v 7 1 5v3, ℓ: z 7 2z + 10 sin(z). In the case of multi-dimensional A (see Appendix H.1), we employ the following neural network for both h and ℓ: Activation functions: Tanh Published as a conference paper at ICLR 2024 Architecture: one hidden layer with the hidden size of 64 Initialization: weights are independently drawn from Unif( 1, 1). Lastly, in all experiments, we use the Gaussian kernel for the MMR term in the objective function (21). The bandwidth of the Gaussian kernel is chosen by the median heuristic (e.g., Fukumizu et al., 2009). G.2 AUTO-ENCODER DETAILS We employ the following hyperparameters for all autoencoders in our experiments. The same architecture is utilized for both the encoder and decoder: Activation functions: Leaky Re LU Architecture: three hidden layers with the hidden size of 32 Learning rate: 0.005 Batch size: 256 Optimizer: Adam optimizer with β1 = 0.9, β2 = 0.999 Number of epochs: 1000. For the variational auto-encoder, we employ a standard Gaussian prior with the same network architecture and hyperparameters as defined above. H FURTHER DETAILS ON EXPERIMENTAL RESULTS Figures 5 and 6 show reconstruction performance of the hidden variables for the experiment described in Section 6.1. Estimated Z1 0.4 0.3 0.2 0.1 0.0 Estimated Z2 0.2 0.1 0.0 0.1 0.2 0.3 Estimated Z1 0.4 0.3 0.2 0.1 0.0 Estimated Z2 0.2 0.1 0.0 0.1 0.2 0.3 Method = AE-MMR H.1 SECTION 6.2 CONTINUED: MULTI-DIMENSIONAL A Here, we consider multi-dimensional variables Z and A. Similar to Section 6.1, we set the dimension of X to 10, vary the dimension d of Z, and keep the dimension of A equal to that of Z. We specify the functions h and ℓusing two-layer neural networks with the hyperbolic tangent activation functions. For the training distribution, we generate A from a uniform distribution over [ 1, 1]d. To assess extrapolation performance, we generate 100 test points of A from a uniform distribution over [ 3, 1]d and calculate the mean squared error with respect to the true conditional mean. In addition to the baseline MLP, we also include an oracle method, denoted as Rep4Ex-CF-Oracle, where we directly use the true latent predictors Z instead of learning a representation in the first Published as a conference paper at ICLR 2024 Estimated Z1 20 15 10 5 0 5 10 Estimated Z2 5 0 5 10 15 Estimated Z1 20 15 10 5 0 5 10 Estimated Z2 5 0 5 10 15 Method = AE-Vanilla Mean squared error The dimension of A = 2 The dimension of A = 4 The dimension of A = 10 MLP Rep4Ex-CF Rep4Ex-CF-Oracle Figure 7: MSEs of different methods for three distinct dimensionalities of A. The box plots illustrate the distribution of MSEs based on 10 repetitions. Rep4Ex-CF yields substantially lower MSEs in comparison to the baseline MLP. Furthermore, the MSEs achieved by Rep4Ex-CF are comparable to those of Rep4Ex-CF-Oracle, underscoring the effectiveness of the representation learning stage. stage. The outcomes for d {2, 4, 10} are depicted in Figure 7. Across all settings, our proposed method, Rep4Ex-CF, consistently achieves markedly lower mean squared error compared to the baseline MLP. Furthermore, the performance of Rep4Ex-CF is on par with that of the oracle method Rep4Ex-CF-Oracle, indicating that the learned representations are close to the true latent predictors (up to an affine transformation). H.2 SECTION 6.2 CONTINUED: IMPACT OF UNOBSERVED CONFOUNDERS Our approach allows for unobserved confounders between Z and Y . This section explores the impact of such confounders on extrapolation performance empirically. We consider the SCM as in (59) from Section 6.2, where we set γ = 1.2 and generate the noise variables U and V from a joint Gaussian distribution with the covariance matrix ΣU,V = 1 ρ ρ 1 . Here, the parameter ρ controls the dependency between U and V , representing the strength of unobserved confounders. Figure 8 presents the results for four different confounding levels ρ = (0, 0.1, 0.5, 0.9). Our method, Rep4Ex-CF, demonstrates robust extrapolation capabilities across all confounding levels. H.3 SECTION 6.2 CONTINUED: ROBUSTNESS AGAINST VIOLATING THE MODEL ASSUMPTION OF NOISELESS X In Setting 1, we assume that the observed features X are deterministically generated from Z via the mixing function g0. However, this assumption may not hold in practice. In this section, we investigate the robustness of our method against the violation of this assumption. We conduct an experiment with the setting similar to that with one-dimensional A in Section 6.2 but here we introduce independent additive random standard Gaussian noise in X, i.e., X := g0(Z) + ϵX, where ϵX N(0, σ2Im). The parameter σ controls the noise level. Figure 9 illustrates the results for Published as a conference paper at ICLR 2024 Figure 8: Different estimations of the target of inference E[Y | do(A := )] as the strength of unobserved confounders (ρ) increases. Notably, the extrapolation performance of Rep4Ex-CF remains consistent across all confounding levels. Figure 9: Different estimations of the target of inference E[Y | do(A := )] in the presence of noise in X. Our method, Rep4Ex-CF, demonstrates the ability to extrapolate beyond the training support when the noise is not too large, suggesting the potential to relax the assumption of the absence of noise in X. different noise levels σ = (1, 2, 4). The results indicate that our method maintains successful extrapolation capabilities under moderate noise conditions. Therefore, we believe it may be possible to relax the assumption of the absence of noise in X.