# exploiting_independent_instruments_identification_and_distribution_generalization__0a48fd03.pdf Exploiting Independent Instruments: Identification and Distribution Generalization Sorawit Saengkyongam 1 Leonard Henckel 1 Niklas Pfister 1 Jonas Peters 1 Instrumental variable models allow us to identify a causal function between covariates X and a response Y , even in the presence of unobserved confounding. Most of the existing estimators assume that the error term in the response Y and the hidden confounders are uncorrelated with the instruments Z. This is often motivated by a graphical separation, an argument that also justifies independence. Positing an independence restriction, however, leads to strictly stronger identifiability results. We connect to the existing literature in econometrics and provide a practical method called HSIC-X for exploiting independence that can be combined with any gradient-based learning procedure. We see that even in identifiable settings, taking into account higher moments may yield better finite sample results. Furthermore, we exploit the independence for distribution generalization. We prove that the proposed estimator is invariant to distributional shifts on the instruments and worst-case optimal whenever these shifts are sufficiently strong. These results hold even in the under-identified case where the instruments are not sufficiently rich to identify the causal function. 1. Introduction When estimating the causal function between a vector of covariates X and a response Y in the presence of unobserved confounding, standard regression procedures such as ordinary least squares (OLS) are even asymptotically biased. Instrumental variable approaches (Wright, 1928; Imbens & Angrist, 1994; Newey, 2013) exploit the existence of exogenous heterogeneity in the form of an instrumental 1Department of Mathematical Sciences, University of Copenhagen, Denmark. Correspondence to: Sorawit Saengkyongam . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). variable (IV) Z and estimate, under suitable conditions, the causal function consistently. Importantly, the errors in Y and the hidden confounders U should be uncorrelated with the instruments Z. Usually, this has to be argued for with background knowledge. Often this is done by assuming that the data generating process follows a structural causal model (SCM) (Pearl, 2009; Bongers et al., 2021) (so that the distribution is Markov with respect to the induced graph), and that Y and U are d-separated from Z in the graph obtained by removing the edge from X to Y (this is the case for the DAG at the beginning of Section 2, for example). In particular, this requires an argument that Z is not causing Y directly but only via X. But this argument does not only imply that the errors in Y and U are uncorrelated but also that they are independent. This independence comes with several benefits. For example, even in settings where the causal function can be identified by classical approaches based on uncorrelatedness, it has been observed that the independence can be exploited to construct estimators that achieve the semiparametric efficiency bound, at least when the error distribution comes from a known, parametric family (Hansen et al., 2010). Furthermore, the independence restriction is stronger than uncorrelatedness and therefore yields stronger identifiability results, which has been reported in the field of econometrics (e.g., Imbens & Newey, 2009; Chesher, 2003). For example, even binary instruments may be able to identify nonlinear effects (Dunker et al., 2014; Torgovitsky, 2015; Loh, 2019). In this work, we investigate the independence restriction in more detail: we add to existing identifiability results, provide methods for exploiting the restriction for finite data and analyse implications for distribution generalisation. More precisely, in Section 2 we discuss the identifiability conditions in a general and simple form, list some of the existing identifiability results that they imply, add novel results to this list, and extend the framework to conditional IV. We also provide a practical method that exploits the above independence for estimating causal effects from data. It relies on the Hilbert-Schmidt independence criterion (HSIC) (Gretton et al., 2008) which has become a widely used tool Exploiting Independent Instruments for testing independence in a joint distribution. Equipped with a characteristic kernel (such as the Gaussian kernel), HSIC is positive and equals zero if and only if the considered distribution factorizes. We propose an easy-to-use method, called HSIC-X ( X for exogenous ), that minimizes HSIC directly. The underlying problem is non-convex but can be tackled using widely used methods from stochastic optimization. While theoretical guarantees are hard to obtain, it has recently been shown empirically that reliable optimization of HSIC seems possible, at least when considering the different problem of minimizing independence of residuals and predictors in a regression problem (Greenfeld & Shalit, 2020; Mooij et al., 2009). Furthermore, the optimization can be initialized at informative starting points, such as the two-stage least squares (2SLS) solution, or even the OLS solution. HSIC-X can be combined with any machine learning method for nonparametric regression that is optimized by (stochastic) gradient descent. The details of our method are described in Section 3. Furthermore, the independence restriction can be exploited for distribution generalization. In Section 4, we construct an estimator HSIC-X-pen ( pen for penalization ) that is worst-case optimal in nonlinear settings under distributional shifts corresponding to interventions on Z. This is particularly interesting in underidentified settings, where the causal function cannot be identified from the data. Our work thereby adds to an increasing literature connecting distributional robustness and causal inference (e.g., Schölkopf et al., 2012; Rojas-Carulla et al., 2018; Magliacane et al., 2018; Rothenhäusler et al., 2021; Arjovsky et al., 2019; Christiansen et al., 2021; Pfister et al., 2021; Yuan et al., 2021; Krueger et al., 2021; Creager et al., 2021). As for HSICX, the estimator HSIC-X-pen is modular in that it can be used with any machine learning method for nonparametric regression. Experiments on both simulated and real world data in Section 5 confirm that HSIC-X can exploit the improved identifiability guarantees and can be more efficient in finite samples if wrong solutions yield both second and higher order dependencies between the residuals and the instruments. The code for all the experiments is available at https://github.com/sorawitj/HSIC-X. All proofs are provided in Appendix A. 1.1. Further Related Work Independence between Z and Y f(X) can also be characterized differently; e.g., it is equivalent to E[η(Z)ψ(Y f(X))] = E[η(Z)] E[ψ(Y f(X))] for all η and ψ in a sufficiently rich class (such as bounded continuous functions). This suggests estimating the causal function by a generalized methods of moments (GMM) ap- proach. Indeed, Poirier (2017) focuses on the derivation of optimal weights for such estimating equations and the asymptotic behaviour of the estimator. Dunker et al. (2014); Dunker (2021) phrase the problem as an inverse problem and derive convergence rates for the iteratively regularized Gauss-Newton method. The above works focus on theoretical advancements (e.g., the experiments are restricted to univariate settings). Considering the independence of residuals has also been suggested, by Peters et al. (2016, Section 5) but no practical method was provided. Powerful regression techniques and machine learning methods to measure dependence in IV settings exist (Hartford et al., 2017; Singh et al., 2019; Bennett et al., 2019; Muandet et al., 2020) but to the best of our knowledge, none of these methods exploit the independence restriction. In summary, exploiting independence does not seem to have played a major role in practice. Arguably, one of the reasons is that independence restrictions are difficult to work with in theory and practice. E.g., choosing the class of functions ψ is non-trivial. 2. Identifiability from Higher Order Moments Unless stated otherwise, we consider the following SCM Z := ϵZ U := ϵU X := g(Z, U, ϵX) Y := f(X) + h(U, ϵY ) Y X where (ϵZ, ϵU, ϵX, ϵY ) Q are jointly independent noise variables, Z Rr are instruments, U Rq are unobserved variables, X Rd are predictors and Y R is a response. For simplicity, we additionally assume that E[ϵZ] = E[X] = E[h(U, ϵY )] = 0. We call a collection M = (f, g, h, Q) an IV model, with f F {f | f : Rd R} and F a pre-specified function class. The collection of all IV models of this form is denoted by M. Any IV model M M induces a distribution PM over the observed variables (X, Y, Z). We denote the data generating IV model by M 0 = (f 0, g0, h0, Q0) (parts of the data are unobserved). We refer to the function f 0 as the causal function. We can use it to compute the causal effect of any treatment contrast, such as the average treatment effect f(1) f(0) in case of a binary X. We assume that the causal function f( ) only depends on X, i.e., is homogeneous, and thus we do not need to distinguish between local and global treatment effects (e.g Imbens & Angrist, 1994). Unless stated otherwise, we also assume that f 0 can be parameterized such that f 0( ) = φ( ) θ0 for some θ0 Θ Rp, with a (known) function φ : Rd Rp and, for simplicity, Exploiting Independent Instruments we assume that θ = θ0 implies φ( ) θ = φ( ) θ0. In practice the true basis φ does not need to be known and can be approximated by a sufficiently flexible function approximator (e.g., neural networks), see Section 3. Let us assume that we observe n i.i.d. observations from (X, Y, Z) PM 0. In this paper, we consider two problems: estimating the causal function f 0 and predicting the response Y under distributional shifts on the variables Z (see Section 4). In order to solve either of these tasks, we make use of (parts of) the causal function f 0. We therefore require that it is uniquely determined by the observed distribution PM 0 (we relax this condition in Section 4). This is often termed identifiability of the causal function. In a classical IV approach, identification of f 0 (or, equivalently, θ0) is based on the moment restriction E[η(Z)(Y φ(X) θ)] = 0, (1) where η : Rr Rk is a known function. Under the IV model M 0 this condition is satisfied for θ0. A sufficient and necessary condition for identifiability based on (1) is given by the (moment) identifiability condition E[η(Z)φ(X) ]τ = 0 = τ = 0. (2) This condition is sometimes called the rank condition because it is equivalent to the matrix E[η(Z)φ(X) ] having rank p, which, in particular, implies that k p (e.g., Wooldridge, 2010). The moment restriction (1) aims to detect mean shifts of the residuals when varying the value of Z. For example, (1) can only identify θ0 if we do not have for all i {1, . . . , p} E[φi(X)|Z] = 0 almost surely (otherwise the function x 7 f 0(x) + φi(x) solves (1), too). But even if there are no mean shifts and (1) is not powerful enough to identify f 0, we may still be able to identify f 0 from a stronger condition. To this end, we consider the independence restriction Z Y φ(X) θ. (3) In this paper, we consider identifying f 0 based on (3), rather than (1). Under the IV model M 0 described above, this condition is satisfied for the true parameter θ0. A necessary and sufficient condition for identifiability using (3) is Z h(U, ϵY ) + φ(X) τ = τ = 0. (4) This, in particular, implies that for all i {1, . . . , p} we have Z h(U, ϵY )+φi(X) (since, otherwise τ = ei would violate the above condition). If condition (4) is violated, condition (2) is violated, too, but the implication does not hold in the other direction. As a result, the independence restriction yields strictly stronger identifiability results (see also Table 1). Proposition 2.1 (Identifiability based on independence). Consider an IV model M 0 and assume f 0( ) = φ( ) θ0 for some θ0 Rp. Then, the following statements hold. (i) If θ0 is identifiable from the moment restriction (1) it is also identifiable from the independence restriction (3). (ii) There exist IV models such that θ0 is identifiable from the independence restriction (3) but not from (1). (iii) In particular, there are examples of the following type that satisfy the conditions of (ii). (a) Nonadditive Z : Z := ϵZ U := ϵU X := ZU + ϵX Y := X + U + ϵY , (5) with F = {f | f(x) = θx}, where (ϵZ, ϵU, ϵX, ϵY ) are jointly independent standard Gaussian variables. (b) Binary Z : Z {0, 1}, X R, p > 2 (i.e., F contains nonlinear functions). (c) Independent Z : Z X. Statements (i) and (ii) are known (e.g., Imbens & Newey, 2009; Chesher, 2003) but for completeness we nevertheless include their proofs in Appendix A. Intuitively, the causal function from the SCM (5) in (a) is not identifiable from the moment restriction because the instrument (or any transformation η(Z)) does not correlate with the mean of X, i.e., E[η(Z)X] = E[η(Z) E[X|Z]] = 0. Therefore, for any γ R, the shifted causal function f(x) := f 0(x) + γx, for x R, also satisfies the moment restriction. In the proof of Proposition 2.1, we argue that identifiability using (2) is impossible for an example of type (b). It may come as a surprise that it is indeed possible to identify nonlinear functions even if the instrument Z has a discrete support with small cardinality. This observation has been reported, e.g., by Dunker et al. (2014); Torgovitsky (2015); Loh (2019). Even in cases where the causal function is identifiable from both (4) and (2), it may be beneficial to consider the independence restriction. This is, for example, the case when the effect of the instrument can be seen in both the conditional mean of φ(X), given Z, and in higher moments. In our simulation experiments in Section 5, we see that taking into account dependencies in higher moments may yield better statistical performance. Condition (4) depends on the unknown h(U, ϵY ); at first glance, one might believe that the hidden term can be dropped from condition (4) but this is not the case. Proposition 2.2. Consider the IV model M 0 and assume that φ( ) = (φ1( ), . . . , φp( )) is a collection of basis functions such that f 0( ) = φ( ) θ0 for some θ0 Rp. Then, in general, condition (4) does not imply Z φ(X) τ = τ = 0. (6) Furthermore, (6) does not imply (4), either. Exploiting Independent Instruments Table 1. Overview of some of the identifiability results, described in Section 2. | supp(Z)| Z acts on Z acts on identif. with (1) identif. with (3) Comments mean of X higher orders of X possible possible yes yes/no gain efficiency with (3), cf Sec. 5.1 no yes cf Prop 2.1-iii-a identif. m par. with (1), gain efficiency m < yes yes/no & identif. with (3), cf Prop 2.1-iii-b, Sec 5.1 m < no yes cf example in proof of Prop 2.2 Identifiability condition (2) depends only on the joint distribution of (Z, X) (which can be estimated from data), but we do not consider this an advantage: in practice, it is usually desirable to consider empirical relaxations of the identifiability conditions and output the set of θ s that (approximately) satisfy the empirical version of (1) or (3), respectively. For example, one can invert statistical tests to construct confidence sets, see Section 3.4. 2.1. Conditional Instrumental Variables In some applications, such as the one we consider in Section 5.3, restrictions (1) and (3) may be violated for the causal parameter, e.g., because there are confounding variables W between the instruments Z and the response Y . Under certain assumptions, however, the framework of conditional IV (CIV) (Frölich, 2007; Newey, 2013) still allows us to identify the causal function f 0 from the observed distribution. Both the identifiability point of view and the methodology we develop in Section 3 can be extended to CIV. More details are provided in Appendix B. Consider the following SCM. W := m(ϵW , V, U) Z := q(W, V, ϵZ) X := g(Z, W, U, ϵX) V := ϵV , U := ϵU Y := f(X) + h(W, U, ϵY ) where (ϵW , ϵZ, ϵV , ϵU, ϵX, ϵY ) are jointly independent and W Rt are observed covariates. We assume that there is no edge from V to W or no edge from U to W, i.e., either Z and W or W and Y are not confounded. Due to the unobserved confounding, (3) may not hold but as we assume that there is either no confounding between Z and W or between W and Y , we can instead use the conditional independence restriction Z Y φ(X) θ | W (7) to identify f 0. A corresponding sufficient and necessary identifiability condition for f 0 is Z h(W, U, ϵY ) + φ(X) τ | W = τ = 0. (8) We can estimate f 0 using a loss that is minimized if restriction (7) is satisfied, e.g., using a conditional independence measure (e.g. Fukumizu et al., 2008; Zhang et al., 2011; Berrett & Samworth, 2019; Shah & Peters, 2020). In Appendix B we discuss cases, in which we can avoid using a conditional independence restriction. 3. Independence-based IV with HSIC Starting from the independence restriction (3), our goal is to find a function ˆf such that the residuals R ˆ f := Y ˆf(X) are independent of the instruments Z. In the identifiable case, that is, if condition (4) is satisfied, only the causal function f 0 achieves independence. Thus, given an i.i.d. sample (xi, yi, zi)n i=1 of the variables (X, Y, Z), our method aims to find a function ˆf that minimizes the dependency between the residuals (r ˆ f i )n i=1, with r ˆ f i := yi ˆf(xi), and the instruments (zi)n i=1. In this work, we measure dependency using HSIC (Gretton et al., 2008). When using characteristic kernels (Fukumizu et al., 2008), this measure equals zero if and only if the considered joint distribution factorizes. HSIC has been used for optimization problems before (e.g., Greenfeld & Shalit, 2020; Mooij et al., 2009) and satisfies the conditions used to prove consistency (see Section 3.3) but other choices of independence measures are possible, too. Specifically, we consider the following learning problem: ˆf := arg min f F \ HSIC((rf i , zi)n i=1; k Rf , k Z), (9) where \ HSIC((rf i , zi)n i=1; k Rf , k Z) := tr(KHLH) is a consistent estimator of HSIC((Rf, Z); k Rf , k Z) (Gretton et al., 2008); here, Kij = k Rf (rf i , rf j ) and Lij = k Z(zi, zj) are the kernel matrices for the residuals Rf and the instruments Z with positive definite kernels k Rf and k Z, respectively, and Hij := δij 1 n is the centering matrix. We call the estimator in (9) HSIC-X. The independence restriction (3) allows us to learn the causal function f 0 up to a bias term (for any α R, HSIC((Y f 0(X), Z); k Rf0, k Z) and HSIC((Y α Exploiting Independent Instruments f 0(X), Z); k Rf0, k Z) are identical). Nonetheless, we can correct for the bias by using the zero mean assumption of the noise ϵY . The final estimate is then obtained as f( ) := ˆf( ) 1 n Pn i=1(yi ˆf(xi)). 3.1. Regularizing towards Predictive Functions In many practical applications the identifiability condition (4) (or (2) for classical IV) is satisfied but many parameters approximately solve the empirical version of (3). This is the case, for example, if the influence of the instruments is weak, which usually yields subpar finite sample properties. Furthermore, classical estimators like 2SLS are known to only have moments up to the degree of over-identification (e.g., Mariano, 2001). To stabilize the estimation it has been proposed to regularize towards a predictive function, such as the OLS in linear settings, see, e.g., the K-class estimators (Theil, 1958; Jakobsen & Peters, 2022), which contain the OLS, 2SLS, the FULLER (Fuller, 1977) and LIML estimators (Anderson & Rubin, 1949) as special cases. We propose an analogous regularization for our estimator and call this variant HSIC-X-pen. More specifically, for a convex loss function ℓ: R R we modify the optimization problem (9) as follows, ˆf λ = arg min f F \ HSIC((rf i , zi)n i=1; k Rf , k Z) (10) + λ Pn i=1 ℓ(yi f(xi)), where λ [0, ) is a tuning parameter. Unlike in the linear settings described above (e.g., Fuller, 1977), deriving a data-driven choice of the tuning parameter λ is non-trivial. We propose to select λ following a procedure analogue to the one described in Jakobsen & Peters (2022): we select the largest possible value of λ for which an HSIC-based independence test (e.g., Gretton et al., 2008; Pfister et al., 2017) between R ˆ f λ and Z is not rejected. As discussed in Section 4, HSIC-X-pen can be understood in relation to distribution generalization, too. There, one starts from the objective of optimizing the predictive loss and adds the HSIC term as a penalty that regularizes the predictor to guard against distributional shifts. 3.2. Algorithm and Implementation Details We now specify the details of HSIC-X. To solve (9), we fix any parametric function class F := {fθ( ) | θ Θ Rp} (e.g., a linear combination of some basis functions or a neural network) and optimize the parameters θ by a gradientbased optimization method. We choose the Gaussian kernel k R (e.g., Schölkopf & Smola, 2002) for the residuals, and the discrete or Gaussian kernel k Z for the instruments depending on whether Z is discrete or continuous, respectively. The bandwidth parameter σ of the Gaussian kernel is chosen by the median heuristic (e.g., Sriperumbudur et al., 2009) and is recomputed during the optimization process (as the residuals change at each iteration). Since the optimization problem (9) is generally non-convex, the resulting parameter estimates may not be the global optimal solution. We alleviate this problem by introducing the following restarting heuristic. Let ˆθ Θ be a solution to the optimization problem. We conduct an independence test between the resulting residuals (rˆθ i := yi fˆθ(xi))n i=1 and the instruments (zi)n i=1 using HSIC with Gamma approximation (Gretton et al., 2008). We accept the parameters ˆθ if the test is not rejected, otherwise we randomly re-initialize the parameters and restart the optimization. In the spirit of (10), we initialize the parameters in the first trial at the OLS solution. Algorithm 1 in Appendix C illustrates the whole optimization procedure with the standard gradient descent update. The gradient step can be replaced with other gradient-based optimization algorithms such as Adam (Kingma & Ba, 2015) or Adagrad (Duchi et al., 2011); in all of our experiments, we used Adam. The Algorithm for HSIC-X-pen is also provided in Appendix C. 3.3. Consistency We now prove consistency of the proposed approach in that the minimizer of (9) converges (in probability) against the causal function, as sample size increases.1 Theorem 3.1. Consider the IV model M 0, assume that f 0( ) = φ( ) θ0 for some bounded function φ( ) and some θ0 Θ, with Θ Rp being compact, and assume that the identifiability condition (4) holds. Consider the function class F := {f( ) = φ( ) θ | θ Θ} and fixed bounded, continuously differentiable, characteristic kernels k Z and k R with bounded derivatives. Then, the estimator ˆf defined in (9) is consistent, i.e., ˆf f 0 PM0 0. The same statement holds if in (9) we replace HSIC and \ HSIC by any independence measure H(PM 0, θ) and its estimate ˆHn(Dn, θ) (based on data Dn), such that (i) H(PM 0, θ) = 0 if and only if (Y φ(X) θ) Z, (ii) for all θ Θ, we have ˆHn(Dn, θ) H(PM 0, θ) in probability and, (iii), both H(PM 0, θ) and ˆHn(Dn, θ) for all n are Lipschitz continuous in θ with the same constant L. 3.4. Confidence Regions Suppose Tn : Θ Rr n {0, 1} tests, for a given parameter θ Θ, for independence between the residuals Rθ := Y fθ(X) and the instruments Z, based on the n i.i.d. observations (Xi, Yi, Zi)n i=1. Denote by Θ0 := 1There is a slight mismatch between Theorem 3.1 and the described algorithm: in practice, the kernel bandwidth is not fixed but is chosen according to the median heuristic. This difference could be accounted for by sample splitting. Exploiting Independent Instruments {θ | Rθ Z} the set of θ s satisfying the null hypothesis of independence. We say Tn has pointwise asymptotic level α if supθ Θ0 limn P(Tn(θ, (Zi)n i=1) = 1) α (here, Tn = 1 corresponds to rejecting independence). The test has uniform asymptotic level if lim and sup in the definition can be exchanged and the statement still holds; it has finite sample level if it holds for all n when removing lim . If Tn has finite sample level, then, by construction, ˆCn := {θ Θ | Tn(θ, (Zi)n i=1) = 0} is a (1 α)-confidence region for θ0: P(θ0 Cn) = P(Tn(θ0, (Zi)n i=1) = 0) 1 α. Similarly for pointwise (uniform) asymptotic level. Because the independence restriction is stronger than the moment restriction (see Proposition 2.1), we can use any existing test for (1), such as the Anderson-Rubin test (Anderson & Rubin, 1949), to construct such confidence regions, too. Alternatively, we can use HSIC to construct an independence test, which we then invert. For a fixed kernel, tests can be constructed either by permutation-based procedures or by approximating the distributon of \ HSIC under the null hypothesis, e.g., by a gamma distribution (Gretton et al., 2008). For many tests, ˆ Cn has to be approximated. 4. Distribution Generalization Here, we follow a line of work, which connects distribution generalization with causality (see Section 1). In this framework, distributional shifts are modeled as interventions. We propose using HSIC-X-pen, introduced in Section 3.1, for this task and prove that it is worst-case optimal under interventions on the exogenous variables, even if the model is nonlinear and the causal function is not identifiable. To formalize the result, we again assume that the data is generated by the IV model M 0. Furthermore, in this section, we consider the function classes F := L2(Rd, PX M 0) consisting of all functions f : Rd R that satisfy EM 0[f (X)2] < and Finv := {f F | Z Y f (X) under PM 0} . We model a distributional shift as an intervention on the exogenous Z, denoted by i; it consists of replacing the distribution of Z with a new distribution. The intervened model is denoted by M 0(i) and is again an IV model. In this setting, we now prove that for any predictor f F satisfying the independence restriction Z Y f(X), that is, f Finv, the expected loss is invariant to interventions on Z in the following sense. Theorem 4.1 (Invariance with respect to interventions on Z). Let ℓ: R R be a convex loss function and I be a set of interventions on Z satisfying for all i I that PM 0(i) is dominated by PM 0. Then, for all f Finv it holds that EM 0 ℓ(Y f(X)) = sup i I EM 0(i) ℓ(Y f(X)) . We consider this setting relevant as identifiablity (that is, |Finv| = 1) is not achievable in most modern machine learning applications. The assumption that the intervened distributions are dominated by the observed distribution ensures that none of the interventions on Z extend the support of Z. Generalizing to distributions that extend the support is only possible under additional extrapolation assumptions that require the functions g0 and f 0 in the IV model M 0 to be partially identifiable (Christiansen et al., 2021). In this sense it is not possible to strengthen our result without making additional assumptions on the identifiablity of the IV model. Theorem 4.1 motivates estimating a predictor based on arg min f Finv EM 0 ℓ(Y f(X)) , (11) where ℓ: R R is a convex loss function. Our proposed HSIC-X-pen estimator from (10) for λ approaching 0 provides a flexible way of estimating the minimizer (11). By Theorem 4.1, it is guaranteed to control the test error under any distributional shift generated by an intervention on Z which does not extend the support. Moreover, we now prove that for a sufficiently rich class of distribution shifts, a predictor solving (11) on the training data is worst-case optimal. To formalize our result, define the intervened subset S {1, . . . , d} consisting of all Xj that are descendants of Z in the causal graph. When |S| < d, this may contain cases in which identifiability according to (3) is impossible.2 Theorem 4.2 (Generalization to interventions on Z). Let ℓ: R R be a convex loss function and I be a set of interventions on Z satisfying for all i I that PM 0(i) is dominated by PM 0, which itself is absolutely continuous with respect to a product measure. If there exists i I such that XS U | XSc under PM 0(i ) and supp(PX M 0(i )) = supp(PX M 0), then inf f Finv EM 0 ℓ(Y f(X)) = inf f F sup i I EM 0(i) ℓ(Y f(X)) . The intervention i can be called partially confoundingremoving (see also Christiansen et al., 2021) in the sense that in the generated distribution PM 0(i ) the variables XS affected by the exogenous Z are, conditioned on XSc, no longer confounded with Y via U (if Z acts additively, this implies that, in general, U does not act on both Y and XS). As shown in the proof, this intervention results in the worstcase loss. An example of this type of intervention is given in 2For example, if Y = f 0(X1, X2) + U + ϵY , X1 = U and X2 = Z + U, then the causal function is not identifiable from (3). Exploiting Independent Instruments Section 5.2. If ℓ( ) = ( )2 is the squared loss, the minimizer is attained at the conditional mean of Y given X under PM 0(i ), that is, f 0(x) + EM 0(i ) h(U, ϵY ) | X = x (see proof of Theorem 4.2). 5. Experiments 5.1. Simulation: Instrumental Variable Estimation We first evaluate the empirical performance of HSIC-X and HSIC-X-pen for estimating causal functions. To this end, we use the following IV models in our experiments: M(α, PϵZ, f 0) : Z := ϵZ U := ϵU X := ZϵX + αZ + U Y := f 0(X) 4U + ϵY , (12) where ϵU, ϵX, ϵY i.i.d. N(0, 1), ϵZ PϵZ and ϵZ ϵU, ϵX, ϵY . We consider different experiment settings by varying three parameters α, PϵZ and f 0. The parameter α adjusts how the instruments Z influence the predictor X: the instruments Z only change the variance of X when α = 0, while both the mean and the variance of X are changed by Z when α > 0. For PϵZ, we consider binary and Gaussian random variables. Lastly, for f 0, we consider a linear function f 0 lin(X) := 2X and a nonlinear function f 0 nonlin(X) := 1.5X 0.2X2 + P10 j=1 wje (X cj)2, where cj represents a partition of [-7, 7] in 10 equally spaced in- tervals and w1, . . . , w10 i.i.d. N(0, 2). We generate 1000 observations from the IV model (12) and evaluate the performance of our methods under different settings. In all the experiments, we use Adam as the optimizer with the learning rate set to 0.01 and a batch-size of 256. Known basis functions. We first assume that the underlying function class containing the causal function is known; that is, we consider F := {f( ) = φ( ) θ | θ Rp} with p = 1 and φ(x) = x in the linear case, and p = 12 and φ(x) = [x, x2, e (x c1)2, . . . , e (x c10)2] in the nonlinear case (c1, . . . , c10 are the same as described above). We compare the performance of HSIC-X with the following baseline methods: 2SLS: this solves the moment restriction (1); we use the feature map φ defined above as the function η in the moment restriction. OLS: least square regression of Y on X (using the basis functions φ or neural networks). Oracle: least square regression of Y on X using non-confounded data (the confounder U is removed from the assignment of X). HSIC-Oracle: same as HSIC-X(-pen) but initialized at the true causal parameters; this is to investigate how much our method suffers from the non-convexity of the objective function. The last two methods serve as oracle benchmarks. The performance of each method is measured by the integrated mean squared error (MSE) E[( ˆf(X) f 0(X))2] 0.0 0.1 0.2 0.3 0.4 f0 = Linear, Z = Gaussian 0.0 0.1 0.2 0.3 0.4 f0 = Linear, Z = Binary 0.0 0.25 0.5 0.75 1.0 f0 = Non-linear, Z = Gaussian 0.0 0.25 0.5 0.75 1.0 f0 = Non-linear, Z = Binary 2SLS OLS Oracle HSIC-X HSIC-Oracle Figure 1. MSEs of different estimators when the correct basis functions are used. Each point represents an average over 10 simulations and the error bar indicates its 95% confident interval. 2SLS is inconsistent and underperforms OLS when α = 0, and for all α in the bottom right setting, while HSIC-X yields a substantial improvement over OLS in such settings. Under weak instruments (small α), we observe an efficiency gain of HSIC-X over 2SLS. between the estimate ˆf and the causal function f 0, approximated using a test sample of size 10000. In all experiments, we report an average of the MSE values over 10 simulations. Figure 1 reports the MSE as we increase the parameter α. Our method shows a significant improvement over 2SLS in almost all settings. The improvement is especially prominent in the nonlinear-binary setting (see bottom right Figure 1), where the moment identifiability condition (2) is not satisfied (see Proposition 2.1). We still see an improvement gain even in the identifable cases (when α > 0) which suggests a finite sample efficiency gain from using the independence restriction. Lastly, the performance of HSIC-X is on par with that of HSIC-Oracle, indicating that the optimization objective is reasonably well-behaved despite its non-convexity (only in the nonlinear/binary case, there is a slight deviation). We illustrate the estimated functions against the true causal function in Appendix D.2. Approximate functions. We now consider a more flexible function class to approximate the causal function f 0 nonlin by using a neural network (NN). For a fixed width and depth, f 0 nonlin may not lie in the function class represented by the NN. Nonetheless, we expect our method to produce a reasonable estimate of the causal function. We generate 1000 observations from the IV model (12) with f 0 nonlin. In addition to the OLS and Oracle baselines, we compare our method to Deep GMM (Bennett et al., 2019) and Deep IV (Hartford et al., 2017) using their publicly available implementations. To investigate the effect of the MSE regularization (see Section 3.1), we add a variant of our method (HSIC-X-pen) where the MSE regularization is Exploiting Independent Instruments 0.0 0.25 0.5 0.75 1.0 0.0 0.25 0.5 0.75 1.0 1 Z = Gaussian Deep GMM Deep IV OLS Oracle HSIC-X HSIC-X-pen Figure 2. MSEs of different estimators when the causal function is approximated by a neural network. HSIC-X shows a substantial gain over the baselines in all settings. HSIC-X-pen does not show a major improvement over HSIC-X. employed. A neural network with one hidden layer of size 64 is used in our methods and all the baselines. Figure 2 outlines the simulation results. In short, both HSICX and HSIC-X-pen outperform Deep GMM, Deep IV and OLS baselines and approach the Oracle s performance as the instrument strength (α) increases. We speculate that the inferior performance of Deep GMM and Deep IV may be explained by their sole reliance on the moment restriction (1) as opposed to the full independence restriction (3). Lastly, HSIC-X-pen does not show a major improvement over HSIC-X. Multi-dimensional setting. We also investigate the effect of X s and Z s dimensionality on our estimators. The exact experiment setup and detailed results are provided in Appendix D.1. The results suggest that for linear models with higher order effects of Z, HSIC-X outperforms 2SLS, in particular when the dimensions of Z is strictly smaller than that of X. 5.2. Simulation: Distribution Generalization To empirically verify the theoretical generalization guarantees from Section 4, we consider for i {0.5, 1, . . . , 3.5, 3.99}, the collection of IV models Z := ϵi, U1 := ϵU1, U2 := ϵU2, X2 := U2 + ϵX2 X1 := U11(Z 3.5) + 0.1Z + 2ZϵX1, Y := f 0(X1, X2) + U1 + U2, where ϵU1, ϵU2, ϵX1ϵX2 are i.i.d. N(0, 1), ϵi = W11(K = 0) + W21(K = 1) with K Ber(i/4), W1 Unif(0, i) and W2 Unif(i, 4) and ϵi (ϵU1, ϵU2, ϵX1ϵX2). The parameter i determines how much the distribution of Z with support (0, 4) is skewed towards the right. In particular, for i 4 the intervention removes the confounding effect of U1, see Theorem 4.2. We use M(0.5) to generate 3000 observations from the training distribution and use it to fit HSIC-X-pen, OLS and Anchor regression (AR) (Rothenhäusler et al., 2021), which comes with generalization guarantees, too, but does not cover the higher-moments 0.5 1.5 2.5 3.5 Intervention strength (i) f0 = Linear 0.5 1.5 2.5 3.5 Intervention strength (i) f0 = Non-linear AR OLS Causal HSIC-X-pen Function class Known Basis NN Figure 3. Predictive performance of different predictors on shifted test distributions as the intervention strength increases. Causal is the most stable predictor but conservative, while OLS is markedly affected by the interventions. Due to the higher-moments influence from Z, HSIC-X outperforms AR in most cases and establishes the best trade-off between invariance and predictiveness. influence of Z as in the SCM above. As in Section 5.1, we consider both linear and nonlinear functions for f 0, and employ the correct basis (Known Basis) and neural network (NN) function classes in our method and the baselines, see Appendix D.3 for details. We evaluate the fitted estimators on shifted test distributions and add a causal baseline (Causal) in which the true causal function is used as a predictor. This serves as an oracle baseline as, in this setting, the causal function is not identifiable. The results are shown in Figure 3. Our method (HSIC-Xpen) and Causal are significantly more robust to the interventions compared to OLS and AR, with the causal oracle being the most invariant predictor as described in Theorem 4.1. However, the causal oracle is conservative (it ignores the information from the hidden confounder U2 that is unaffected by the interventions and is helpful in predicting Y ) and yields subpar performance when the interventions i are small. Our method utilizes the invariant information from U2 and yields the best trade-off among all candidates. 5.3. Real-world Application We apply HSIC-X to estimate the causal effect of education on earnings using 3010 observations from the 1979 National Longitudinal Survey of Young Men (Card, 1995). The response variable Y is the logarithm of wage, X is years of education, and the (discrete) instrument Z is geographic proximity to colleges (whether an individual grew up near a four-year college). Card (1995) also considered several conditioning variables W including years of experience, race and geographic information and studied a linear causal effect of education on earnings, so we use F = {f | f(x) = θx} and apply the conditional IV estimator described in Appendix B. In addition, we obtain a confidence region as described in Section 3.4. We then compare our results with those by 2SLS, which was used in the original study of Card (1995), and confidence intervals constructed by the Anderson-Rubin test (Anderson & Rubin, 1949). Exploiting Independent Instruments Table 2. Estimation results of the effect of education on earnings (Card, 1995), including 95% confidence sets. The independence restriction puts stronger constraints on the parameters and yields a smaller confidence set; e.g., it does not include the OLS solution, which is not rejected by an Anderson-Rubin test. METHOD POINT ESTIMATE LOWER UPPER OLS 0.072 0.065 0.079 2SLS 0.142 0.050 0.273 HSIC-X 0.160 0.097 0.208 Table 2 reports the point estimates and the confidence intervals at the 95% confidence level. The point estimates from 2SLS and HSIC-X differ from the estimate from OLS suggesting a potential sizable unobserved confounding effect. Nonetheless, the difference between the OLS and 2SLS estimates is not statistically significant which was also observed in the original study (see Card (1995) and Card (2001)). We speculate that the linear effect of college proximity Z on education X may not be strong enough which leads to the imprecision of the 2SLS. On the other hand, the confidence region around HSIC-X does not contain the OLS. This suggests that taking into account full independence gains finite sample efficiency in this study. 6. Conclusion and Future Work Exploiting independence between exogenous variables and the residual terms of a response variable can be beneficial for identifiability of a causal function, the empirical performance of corresponding estimates, and for constructing estimators that perform well even under interventions on such exogenous variables. We have proposed two estimators, HSIC-X and HSIC-X-pen that can be equipped with any machine learning regression method based on gradient descent. Empirical results on simulated and real data indicate that one may indeed benefit from considering independence restrictions in practice. We believe it could be fruitful to construct fast approximate methods for inverting an independence test and analyze distribution generalization when the support of Z is extended. Acknowledgments SS, LH, JP were supported by a research grant (18968) from VILLUM FONDEN. NP was supported by a research grant (0069071) from Novo Nordisk Fonden. We thank Nicola Gnecco and Nikolaj Thams for helpful discussions. Anderson, T. W. and Rubin, H. Estimation of the parameters of a single equation in a complete system of stochastic equations. Annals of Mathematical Statistics, 20(1):46 63, 1949. Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. Invariant risk minimization. Ar Xiv e-prints (1907.02893), 2019. Bennett, A., Kallus, N., and Schnabel, T. Deep generalized method of moments for instrumental variable analysis. In Advances in Neural Information Processing Systems 32 (Neur IPS). Curran Associates, Inc., 2019. Berrett, T. B. and Samworth, R. J. Nonparametric independence testing via mutual information. Biometrika, 106 (3):547 566, 2019. Bongers, S., Forre, P., Peters, J., and Mooij, J. M. Foundations of structural causal models with cycles and latent variables. Annals of Statistics, 49(5):2885 2915, 2021. Card, D. Using geographic variation in college proximity to estimate the return to schooling. Aspects of Labour Market Behaviour: Essays in Honour of John Vanderkamp, pp. 201 222, 1995. Card, D. Estimating the return to schooling: Progress on some persistent econometric problems. Econometrica, 69(5):1127 1160, 2001. Chesher, A. Identification in nonseparable models. Econometrica, 71(5):1405 1441, 2003. Christiansen, R., Pfister, N., Jakobsen, M. E., Gnecco, N., and Peters, J. A causal framework for distribution generalization. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1 1, 2021. Creager, E., Jacobsen, J.-H., and Zemel, R. Environment inference for invariant learning. In Proceedings of the 38th International Conference on Machine Learning (ICML), pp. 2189 2200. PMLR, 2021. Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Rearning Research, 12(7), 2011. Dunker, F. Adaptive estimation for some nonparametric instrumental variable models with full independence. Electronic Journal of Statistics, 15:6151 6190, 2021. Dunker, F., Florens, J.-P., Hohage, T., Johannes, J., and Mammen, E. Iterative estimation of solutions to noisy nonlinear operator equations in nonparametric instrumental regression. Journal of Econometrics, 178:444 455, 2014. Frölich, M. Nonparametric IV estimation of local average treatment effects with covariates. Journal of Econometrics, 139(1):35 75, 2007. Exploiting Independent Instruments Fukumizu, K., Gretton, A., Sun, X., and Schölkopf, B. Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems 20 (Neur IPS). Curran Associates, Inc., 2008. Fuller, W. A. Some properties of a modification of the limited information estimator. Econometrica, 45:939 53, 1977. Greenfeld, D. and Shalit, U. Robust learning with the Hilbert-schmidt independence criterion. In Proceedings of the 37th International Conference on Machine Learning (ICML), pp. 3759 3768. PMLR, 2020. Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Schölkopf, B., and Smola, A. A kernel statistical test of independence. In Advances in Neural Information Processing Systems 20 (Neur IPS). Curran Associates, Inc., 2008. Hansen, C., Mc Donald, J. B., and Newey, W. K. Instrumental variables estimation with flexible distributions. Journal of Business & Economic Statistics, 28(1):13 25, 2010. Hartford, J., Lewis, G., Leyton-Brown, K., and Taddy, M. Deep IV: A flexible approach for counterfactual prediction. In Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 1414 1423. PMLR, 2017. Imbens, G. W. and Angrist, J. D. Identification and estimation of local average treatment effects. Econometrica, 62 (2):467 75, 1994. Imbens, G. W. and Newey, W. K. Identification and estimation of triangular simultaneous equations models without additivity. Econometrica, 77(5):1481 1512, 2009. Jakobsen, M. and Peters, J. Distributional robustness of K-class estimators and the PULSE. The Econometrics Journal, 25(2):404 432, 2022. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR); ar Xiv:1412.6980, 2015. Krueger, D., Caballero, E., Jacobsen, J.-H., Zhang, A., Binas, J., Zhang, D., Priol, R., and Courville, A. Out-ofdistribution generalization via risk extrapolation (REx). In Proceedings of the 38th International Conference on Machine Learning (ICML), pp. 5815 5826. PMLR, 2021. Loh, I. Nonparametric identification and estimation with independent, discrete instruments. Ar Xiv e-prints (1906.05231), 2019. Magliacane, S., van Ommen, T., Claassen, T., Bongers, S., Versteeg, P., and Mooij, J. M. Domain adaptation by using causal inference to predict invariant conditional distributions. In Advances in Neural Information Processing Systems 31 (Neur IPS), pp. 10846 10856. Curran Associates, Inc., 2018. Mariano, R. S. Simultaneous equation model estimators: Statistical properties and practical implications. In Baltagi, B. H. (ed.), A companion to theoretical econometrics, pp. 122 43. Blackwell, Malden, MA, 2001. Mooij, J. M., Janzing, D., Peters, J., and Schölkopf, B. Regression by dependence minimization and its application to causal inference. In Proceedings of the 26th International Conference on Machine Learning (ICML), pp. 745 752. ACM Press, 2009. Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J., and Schölkopf, B. Distinguishing cause from effect using observational data: methods and benchmarks. Journal of Machine Learning Research, 17(32):1 102, 2016. Muandet, K., Mehrjou, A., Lee, S. K., and Raj, A. Dual instrumental variable regression. In Advances in Neural Information Processing Systems 33 (Neur IPS). Curran Associates, Inc., 2020. Newey, W. K. Uniform convergence in probability and stochastic equicontinuity. Econometrica: Journal of the Econometric Society, pp. 1161 1167, 1991. Newey, W. K. Nonparametric instrumental variables estimation. American Economic Review, 103(3):550 556, 2013. Pearl, J. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, USA, 2nd edition, 2009. Peters, J., Bühlmann, P., and Meinshausen, N. Causal inference using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B (with discussion), 78(5):947 1012, 2016. Pfister, N., Bühlmann, P., Schölkopf, B., and Peters, J. Kernel-based tests for joint independence. Journal of the Royal Statistical Society: Series B, 80:5 31, 2017. Pfister, N., William, E. G., Peters, J., Aebersold, R., and Bühlmann, P. Stabilizing variable selection and regression. Annals of Applied Statistics, 15(3):1220 1246, 2021. Poirier, A. Efficient estimation of models with independence restrictions. Journal of Econometrics, 196(1):1 22, 2017. Exploiting Independent Instruments Rojas-Carulla, M., Schölkopf, B., Turner, R., and Peters, J. Causal transfer in machine learning. Journal of Machine Learning Research, 19(36):1 34, 2018. Rothenhäusler, D., Bühlmann, P., Meinshausen, N., and Peters, J. Anchor regression: heterogeneous data meets causality. Journal of Royal Statistical Society, Series B, 83(2):215 246, 2021. Schölkopf, B. and Smola, A. Learning with Kernels. MIT Press, Massachusetts, 2002. Schölkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., and Mooij, J. M. On causal and anticausal learning. In Proceedings of the 29th International Conference on Machine Learning (ICML). Omnipress, 2012. Shah, R. and Peters, J. The hardness of conditional independence testing and the generalised covariance measure. Annals of Statistics, 48(3):1514 1538, 2020. Singh, R., Sahani, M., and Gretton, A. Kernel instrumental variable regression. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32 (Neur IPS). Curran Associates, Inc., 2019. Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Lanckriet, G., and Schölkopf, B. Kernel choice and classifiability for rkhs embeddings of probability distributions. In Advances in Neural Information Processing Systems 22 (Neur IPS). Curran Associates, Inc., 2009. Theil, H. Economic forecasts and policy. North-Holland, Amsterdam, NL, 1958. Torgovitsky, A. Identification of nonseparable models using instruments with small support. Econometrica, 83(3): 1185 1197, 2015. Wooldridge, J. M. Econometric analysis of cross section and panel data. MIT press, Cambridge, MA, 2010. Wright, P. G. The Tariff on Animal and Vegetable Oils. Investigations in International Commercial Policies. Macmillan, New York, NY, 1928. Yuan, J., Ma, X., Kuang, K., Xiong, R., Gong, M., and Lin, L. Learning domain-invariant relationship with instrumental variable for domain generalization. Ar Xiv e-prints (2110.01438), 2021. Zhang, K., Peters, J., Janzing, D., and Schölkopf, B. Kernelbased conditional independence test and application in causal discovery. In Proceedings of the 27th Annual Conference on Uncertainty in Artificial Intelligence (UAI), pp. 804 813. AUAI Press, 2011. Exploiting Independent Instruments A.1. Proof of Proposition 2.1 Proof. (i) We first show that if there exists k and η, such that condition (2) holds, then condition (4) holds. We do so by contraposition, so assume that there exists a τ = 0 violating condition (4). The following then hold for all k N and all η : Rr Rk. First, Cov[η(Z), h(U, ϵY ) + φ(X) τ] = 0. Second, as Z h(U, ϵY ) it follows that Cov[η(Z), h(U, ϵY )] = 0. Third, by combining the previous two results it follows that Cov[η(Z), φ(X) τ] = 0, which is a violation of condition (2). (ii) This statement is proven by (iii) (a), for example. (iii) We now consider the two types of examples (a), (b) and (c). (a) In the example SCM, we have E[X|Z] = Z E[U] + E[ϵX] = 0 and therefore for all k N and functions η : Rr Rk, E[η(Z)X] = 0. As a result, the identifiability condition (2) does not hold here. Specifically, any function of the form θX, θ R satisfies the moment restriction, so we cannot identify the causal function f 0(x) = x. Consider now the independence based identifiability condition (4): Z U + ϵY + τX = τ = 0. Plugging in the structural equation for X this is equivalent to Kτ := U + τZU + τϵX + ϵY Z for all τ = 0. We now show that this is true by showing that for all τ = 0, E[Z2K2 τ ] = E[Z2]E[K2 τ ]. First, E[Z2K2 τ ] = E[Z2(U + τZU + τϵX + ϵY )2] = E[Z2U 2] + τ 2E[Z4U 2] + τ 2E[Z2ϵ2 X] + E[Z2ϵ2 Y ] = 2 + 4τ 2. E[Z2]E[K2 τ ] = E[Z2]E[(U + τZU + τϵX + ϵY )2] = E[Z2](E[U 2] + τ 2E[Z2U 2] + τ 2E[ϵ2 X] + E[ϵ2 Y ]) = 2 + 2τ 2. Therefore, condition (4) is fulfilled, meaning we can identify the causal function f 0(x) = x. (b) We first give an argument why identifiability using condition (2) is impossible. For a discrete instrument Z, the number of almost surely linearly independent functions3 of Z is bounded by the cardinality of the support of Z; therefore, if the number of basis functions is larger than the support of Z, there is no k N and η : Rr Rk such that condition (2) is satisfied. We now construct an explicit example as follows: First, we show for a class of SCMs that under certain assumptions condition (2) does not hold while (4) does. Second, we give an explicit SCM with a binomially distributed instrument that fulfills these conditions. Fix k, p N such that p > k and consider the SCM Z := ϵZ U := ϵU X := Z + U + ϵX i=1 αi Xi + U + ϵY , where ϵZ, ϵU, ϵX and ϵY are jointly independent, real-valued errors ϵZ Unif({0, . . . , k 1}) and αi R \ {0} for all i {1, . . . , p}. Assume that the random variables {ϵi XU j|1 i + j p} are almost surely linearly independent 3Here, we say that a collection of functions l1, . . . , lk is almost surely linearly independent if for all vectors α = 0, P Pk i=1 αili(Z) = 0 = 0. Exploiting Independent Instruments (see above). We are interested in estimating the causal function f 0(x) = Pp i=1 αixi with the basis functions φ(x) = (x1, . . . , xp). Consider first the moment identifiability condition (2). Because Z is discrete, the set of functions {1{Z=i}, i {0, . . . , k 1}} form a linear basis for all functions from supp(Z) to R. As a result, for all functions η = (η1, . . . , ηq) : R Rq, with q > k, condition (2) cannot hold. This implies that while condition (2) may hold if p k, with for example η(Z) = (1{Z=0}, . . . , 1{Z=k 1}), it cannot hold if p > k, irrespective of the choice of η. Consider now the independence identifiability condition (4): i=1 τiαi Xi + h(U, ϵY ) Z = τ1 = = τp = 0. Because the support of Z is finite, this is equivalent to the statement that the distributions of the random variables i=1 τiαi(j + U + ϵX)i + U + ϵY , are the same for all j {0, . . . , k 1}. Consider the cases j = 0 and j = 1. For Pp i=1 τiαi(U + ϵX)i + U + ϵY and Pp i=1 τiαi(1 + U + ϵX)i + U + ϵY to have the same distribution, the coefficients in front of the ϵi X (when collecting all terms) for all i {1, . . . , p} must be equal in both random variables. Consider first the ϵp 1 X term. The corresponding two coefficients are τp 1αp 1 and τp 1αp 1 + pτpαp. These two coefficients are only equal if τp = 0. By iterating this argument we obtain that τ2 = = τp = 0. This implies that τ1α1(U + ϵX) + U + ϵY and τ1α1(1 + U + ϵX) + U + ϵY have the same distribution. Thus, τ1 = 0 and therefore condition (4) holds. We now give an explicit example, where k < p and where the random variables {ϵi XU j|1 i + j p} are almost surely linearly independent. Let ϵU, ϵZ, ϵX and ϵY standard normal, Z Bernoulli(0.5) and p = 3. Here, k = 2 < p and we will now show that the set of random variables {ϵi XU j|1 i + j 3} is almost sure linearly independent. Let τij, 1 i + j 3 be coefficients such that Kτ := P 1 i+j 3 τijϵi XU j = 0 almost surely. In particular, this implies that E[Kτϵi XU j] = 0 for all i, j 0. Using E[ϵi XU jϵk XU l] = 0 if either i + k or j + l are odd (which follows as ϵX and U are mean zero Gaussian and independent), we obtain the following nine equations. E[KτU] = τ01 + 3τ03 + τ21 = 0 E[KτU 2] = 3τ02 + τ20 = 0 E[KτU 3] = 3τ01 + 15τ03 + 3τ21 = 0 E[KτϵX] = τ10 + τ12 + 3τ30 = 0 E[KτϵXU] = τ11 = 0 E[KτϵXU 2] = τ10 + 3τ12 + 3τ30 = 0 E[Kτϵ2 X] = τ02 + 3τ20 = 0 E[Kτϵ2 XU] = τ01 + 3τ03 + 3τ21 = 0 E[Kτϵ3 X] = 3τ10 + 3τ12 + 15τ30 = 0. We can write this as a linear system Aτ = 0 with corresponding 9 9 matrix A. As A is invertible, τ = 0 follows. (c) Consider the SCM Z := ϵZ U := ϵU X := 2ZU U + ϵX Y := X + U + ϵY . where ϵZ, ϵU, ϵX and ϵY are jointly independent, ϵZ Bernoulli(0.5) and the remaining errors are standard normal. Consider the basis function φ(x) = x. Here, Kτ := h(U, ϵY ) + τX = ( (1 + τ)U + τϵX + ϵY if Z = 1 (1 τ)U + τϵX + ϵY if Z = 0 Exploiting Independent Instruments and therefore E[K2 τ |Z = 1] = 2τ 2 + 2τ + 2 and E[K2 τ |Z = 0] = 2τ 2 τ + 2. It follows that condition (4) holds. On the other hand, τX|(Z = 1) = τ(U + ϵX) and τX|(Z = 0) = τ( U + ϵX) and second, the distribution of U is symmetric around 0. Therefore, Z τX for all τ R. We can therefore identify the causal function with the independence restriction, even though Z is independent of X. A.2. Proof of Proposition 2.2 Proof. This is a proof by example. We have already constructed an SCM, such that condition (4) holds but condition (6) does not in the proof of statement (iii) (c) of Proposition 2.1. We now construct a SCM, such that condition (6) holds but condition (4) does not. Consider the SCM Z := ϵZ U := ϵU X := 2ZU + ϵX Y := X U + ϵY . where ϵZ, ϵU, ϵX and ϵY are jointly independent, ϵZ Bernoulli(0.5) and the remaining errors are standard normal. Consider the basis functions φ(x) = (x). Here, h(U, ϵY ) + X = ( U + ϵX + ϵY if Z = 1 U + ϵX + ϵY if Z = 0 and therefore condition (4) does not hold for τ = 1. On the other hand, ( τ(2U + ϵX) if Z = 1 τϵX if Z = 0 and therefore E[(τX)2|Z = 1] = 5τ 2 while E[(τX)2|Z = 0] = τ 2. It follows that condition (6) holds. A.3. Proof of Theorem 3.1 Proof. Let H(PM 0, θ) := HSIC((Y φ(X) θ, Z); k R, k Z) with non-negative bounded and Lipschitz continuous kernels k R and k Z with bounded and continuous derivatives.4 Consider an i.i.d. data set Dn = (Xi, Yi, Zi)n i=1, such that that each triple (Xi, Yi, Zi) follows the distribution of M 0. Let ˆHn(Dn, θ) := \ HSIC((Rθ i , Zi)n i=1), with Rθ i = Yi φ(Xi) θ. Then the following three results hold. First, as the kernels k R and k L are by assumption non-negative and bounded it follows by Corollary 15 in Mooij et al. (2016) that for all θ and all ϵ > 0, limn PM 0(| ˆHn(Dn, θ) H(θ)| > ϵ) = 0. Second, it follows by Lemma 16 in Mooij et al. (2016) that for all θ1, θ2 Θ and all n 2 ˆHn(Dn, θ1) ˆHn(Dn, θ2) = \ HSIC((Rθ1 i , Zi)n i=1) \ HSIC((Rθ2 i , Zi)n i=1) φ(X) (θ1 θ2) φ(X) (θ1 θ2) i=1 φ(Xi) 2 !0.5 32λCM (θ1 θ2) , 4From now on, we do not explicitly write down the kernels and write H(θ) rather than H(PM0, θ) to ease notation Exploiting Independent Instruments where Rθ = (Yi φ(Xi) θ)n i=1, φ(X) = (φ(Xi))n i=1, λ is the Lipschitz constant for k R, C is an upper bound for k Z and M is an upper bound for φ( ) 2 which exists by the assumption that φ( ) is bounded. Third, for example by Pfister et al. (2017) Proposition 2.5, the population HSIC can be written as H(θ) = HSIC(Y φ(X) θ, Z) = E[k R(Rθ1 1 , Rθ2 2 )k Z(Z1, Z2)] + E[k R(Rθ1 1 , Rθ2 2 )]E[k Z(Z1, Z2)] 2E[k R(Rθ1 1 , Rθ2 2 )k Z(Z1, Z3)], with Rθ1 1 and Rθ2 2 as well as Z1, Z2 and Z3 being i.i.d. copies of Rθ1 and Z, respectively. The function Rθ : θ 7 Y φ(X) θ is (surely) continuously differentiable, as is k R by assumption. Further, k R and k Z have bounded derivative. We can therefore conclude by dominated convergence and the chain rule that H(θ) is continuously differentiable. Therefore, H(θ) is Lipschitz on Θ, since Θ is compact. Jointly, these three results (which correspond to (i), (ii), and (iii) from the statement of the theorem) allow us to apply Corollary 2.2 from Newey (1991) and conclude that for all ε > 0, lim n PM 0(max θ Θ | ˆHn(Dn, θ) H(θ)| > ε) = 0. (13) Here and in the remainder of the proof we use that θ 7 ˆHn(Dn, θ) and θ 7 H(θ) are (surely) continuous functions that map to R. Restricted to any compact set, they therefore (surely) attain a maximum and a minimum. Consider now any sequence of estimators ˆθ0 n arg minθ Θ ˆHn(Dn, θ). By the assumption that condition (4) holds, θ0 is the unique minimizer of the continuous function H( ) on the compact set Θ. Therefore, for all ε > 0, there exists a ζ(ε) such that for all θ / Bε(θ0) = {θ : ||θ θ0|| < ε} it holds that H(θ) H(θ0) > ζ(ε). Note also that Θ \ Bε is compact. Fix ε > 0 and δ > 0, then by (13) there exists an n N, such that for all n n PM 0(maxθ Θ| ˆHn(Dn, θ) H(θ)| > 1 Then, for all n n it holds that PM 0(||ˆθ0 n θ0|| > ε) PM 0(minθ Θ\Bε(θ0)( ˆHn(Dn, θ) ˆHn(Dn, θ0)) 0) PM 0(maxθ Θ\Bε(θ0)| ˆHn(Dn, θ) H(θ)| + | ˆHn(Dn, θ0) H(θ0)| > ζ(ε)) PM 0({maxθ Θ\Bε(θ0)| ˆH(Dn, θ) H(θ)| > 1 2ζ(ε)}{| ˆHn(Dn, θ0) H(θ)| > 1 Here, we used the following four arguments: Firstly, the event {||ˆθ0 n θ0|| > ε} can only occur if there exists a θ Θ \ Bε(θ0) such that the event { ˆHn(Dn, θ) ˆHn(Dn, θ0) 0} occurs. Secondly, ζ(ε) is defined such that for all θ Θ \ Bε(θ0) it holds that H(θ) H(θ0) > ζ(ε). Therefore, if there exist a θ such that the event { ˆHn(Dn, θ) ˆHn(Dn, θ0) 0} occurs, then there must exist a θ such that the event {| ˆHn(Dn, θ) H(θ)| + | ˆHn(Dn, θ0) H(θ0)| > ζ(ε)} occurs. Thirdly, for {maxθ Θ\Bε(θ0)| ˆHn(Dn, θ) H(θ)| + | ˆHn(Dn, θ0) H(θ0)| > ζ(ε)} to occur, either {maxθ Θ\Bε(θ0)| ˆHn(Dn, θ) H(θ)| > 1 2ζ(ε)} or {| ˆHn(Dn, θ0) H(θ0)| > 1 2ζ(ε)} must occur. Fourthly, we use the union bound together with (14) for the last inequality. We can therefore conclude that limn PM 0(||ˆθ0 n θ0|| > ε) = 0. Finally, consider our estimator ˆf( ) = φ( ) ˆθ0 n for the causal function f 0. By the assumption that φ is bounded, we get that || ˆf f 0|| = sup x Rd |φ(x) (ˆθ0 n θ0)| sup x Rd ||φ(x)|| ||ˆθ0 n θ0|| M||ˆθ0 n θ0||. Exploiting Independent Instruments Hence we conclude that || ˆf f 0|| PM0 0. Our second claim follows by the fact that under the conditions laid out, we can directly apply Corollary 2.2 from Newey (1991) to obtain uniform convergence in probability and then argue as in the case for \ HSIC. A.4. Proof of Theorem 4.1 Proof. Let f Finv and i I. Let G be the directed graph induced by the SCM M 0. Then, since Z is by construction a source node in G and i an intervention on Z satisfying that PM 0(i) is dominated by PM 0, it holds that EM 0(i) ℓ(Y f (X)) = EM 0(i) h EM 0(i) ℓ(Y f (X)) | Z i = EM 0(i) h EM 0 ℓ(Y f (X)) | Z i . (15) Now, since f Finv it holds that Z Y f (X) under PM 0, which implies that EM 0(i) h EM 0 ℓ(Y f (X)) | Z i = EM 0(i) h EM 0 ℓ(Y f (X)) i = EM 0 ℓ(Y f (X)) i . (16) Since i I was arbitrary, combining (15) and (16) directly implies EM 0 ℓ(Y f(X)) = sup i I EM 0(i) ℓ(Y f(X)) , which completes the proof of Theorem 4.1. A.5. Proof of Theorem 4.2 Proof. We consider two optimization problems: (A) Minimize EM 0 ℓ(Y f (X)) over all f Finv and (B) minimize supi I EM 0(i) ℓ(Y f (X)) over all f F. To prove the result, we fix an arbitrary ε > 0 and find a function φ Finv that satisfies EM 0 ℓ(Y φ(X)) inf f Finv EM 0 ℓ(Y f (X)) + ϵ (17) and sup i I EM 0(i) ℓ(Y φ(X)) inf f F sup i I EM 0(i) ℓ(Y f (X)) + ϵ. (18) Since by (15) and (16) in the proof of Theorem 4.1 it holds that EM 0 ℓ(Y φ(X)) = supi I EM 0(i) ℓ(Y φ(X)) , the result we wish to prove follows immediately. The proof now proceeds in two steps: (1) We construct the function φ using the intervention i from the statement. (2) We use that φ Finv to conclude the proof. Step (1): Let i I be the intervention from the statement of the theorem. For all f F, it holds that sup i I EM 0(i) ℓ(Y f (X)) EM 0(i ) ℓ(Y f (X)) . As f F was arbitrary, we can take the infimum on both sides and get inf f F sup i I EM 0(i) ℓ(Y f (X)) inf f F EM 0(i ) ℓ(Y f (X)) . (19) By the definition of the infimum there exists5 ψ F satisfying EM 0(i ) ℓ(Y ψ(X)) inf f F sup i I EM 0(i) ℓ(Y f (X)) + ε/2. (20) 5In the case of squared loss ℓ( ) = ( )2, the minimum is attained at the conditional mean EM0(i )[Y | X] = f 0(X) + EM0(i )[h(U, ϵY ) | X]. Exploiting Independent Instruments Moreover, setting ψ := ψ f 0 and expanding Y from the structural assignment we get that EM 0(i ) ℓ(Y ψ(X)) = EM 0(i ) ℓ(f 0(X) + h(U, ϵY ) ψ(X)) = EM 0(i ) ℓ(h(U, ϵY ) ψ(X)) . Next, recall that XS U | XSc. Since ϵY (X, U), it also holds (using the properties of conditional independence) that XS (U, ϵY ) | XSc under PM 0(i ). Let µ denote the common dominating product measure and let p denote the density corresponding to PM 0(i ) with respect to µ. Then, by conditional independence it holds that p (x S, x Sc, u, e) = p (x S | x Sc)p (u, e | x Sc)p (x Sc). Expressing the expectation as an integral and applying Jensen s inequality once (ℓis convex) we get that EM 0(i ) ℓ(h(U, ϵY ) ψ(X)) = Z ℓ(h(u, e) ψ(x S, x Sc))p (x S | x Sc)p (u, e | x Sc)p (x Sc)dµ(u, e, x S, x Sc) Z ℓ Z h(u, e) ψ(x S, x Sc)p (x S | x Sc)dµ(x S) p (u, e | x Sc)p (x Sc)dµ(u, e, x Sc) = Z ℓ h(u, e) Z ψ(x S, x Sc)p (x S | x Sc)dµ(x S) p (u, e | x Sc)p (x Sc)dµ(u, e, x Sc). Let φ F be a function that satisfies for all x Sc supp(PXSc M 0(i )) that φ(x Sc) = Z ψ(x S, x Sc)p (x S | x Sc)dµ(x S). Such a φ F exists because F consists of all square-integrable functions. We then get that EM 0(i ) ℓ(Y ψ(X)) Z ℓ h(u, e) φ(x Sc) p (u, e | x Sc)p (x Sc)dµ(u, e, x Sc) = EM 0(i ) ℓ(h(U, ϵY ) φ(XSc)) = EM 0(i ) ℓ(Y (f 0(X) + φ(XSc))) . Combining this with (20) and defining φ F for all x Rd by φ(x) := f 0(x) + φ(x Sc) leads to EM 0(i ) ℓ(Y φ(X)) inf f F sup i I EM 0(i) ℓ(Y f (X)) + ε/2. (21) Moreover, by construction it holds PM 0(i )-almost surely that Y φ(X) = h(U, ϵY ) φ(XSc). (22) Since PM 0(i ) is dominated by PM 0 and supp(PX M 0(i )) = supp(PX M 0), (22) also holds PM 0-almost surely. Finally, using that XSc are not descendants of Z and Z is a source node, it holds that Z (XSc, U, ϵY ) under PM 0, which implies Z Y φ(X) under PM 0. Hence φ Finv. Step (2): Since φ Finv we can use (15) and (16) from the proof of Theorem 4.1 together with (21) to get that sup i I EM 0(i) ℓ(Y φ(X)) = EM 0(i ) ℓ(Y φ(X)) inf f F sup i I EM 0(i) ℓ(Y f (X)) + ε/2. This proves (18). Next, by the definition of the infimum there exists f Finv satisfying EM 0 ℓ(Y f (X)) inf f Finv EM 0 ℓ(Y f (X)) + ϵ/2. (23) Exploiting Independent Instruments Now, assume that EM 0 ℓ(Y f (X)) EM 0 ℓ(Y φ(X)) . If this was not the case then (17) would be true and the proof would be complete. Again, applying Theorem 4.1 and defining cmin := inff F supi I EM 0(i) ℓ(Y f (X)) this implies cmin sup i I EM 0(i) ℓ(Y f (X)) sup i I EM 0(i) ℓ(Y φ(X)) cmin + ϵ/2. This implies that sup i I EM 0(i) ℓ(Y φ(X)) sup i I EM 0(i) ℓ(Y f (X)) + ϵ/2 = EM 0 ℓ(Y f (X)) + ϵ/2. Combined with (23) this proves (17), which concludes the proof of Theorem 4.2. B. Additional details - Conditional IV We consider the following SCM.6 W := m(ϵW , V, U) Z := q(W, V, ϵZ) X := g(Z, W, U, ϵX) V := ϵV U := ϵU Y := f(X) + h(W, U, ϵY ), where (ϵW , ϵZ, ϵV , ϵU, ϵX, ϵY ) Q are jointly independent noise variables, Z Rr are instruments, V Rs1 and U Rs2 are unobserved variables, W Rt are observed covariates, X Rd are predictors and Y R is a response. For simplicity, we additionally assume that E[W] = E[Z] = E[X] = E[h(W, U, ϵY )] = 0. We consider models of the form M = (m, q, g, f, h) and assume that either there is no edge from V to W, i.e., W = m(ϵW , U) or no edge between W and U, i.e., W = m(ϵW , V ). As before, M 0 = (m0, q0, g0, f 0, h0) denotes the data generating model and f 0( ) = φ( ) θ0 for some θ0 Θ Rp, with a (known) function φ : Rd Rp. Due to the unobserved confounding, it may be the case that Y f 0(X) and Z are dependent and, as a result, restriction (3) may not hold for the true model M 0. But as we assume that there is either no confounding between Z and W or between W and Y , we can instead use the conditional independence restriction Z Y φ(X) θ | W (24) to identify f 0. A corresponding sufficient and necessary identifiability condition for f 0 is Z h(W, U, ϵY ) + φ(X) τ | W = τ = 0. Under this condition we could proceed as for the unconditional independence restriction and estimate f 0 using a loss that is minimized if restriction (7) is satisfied, e.g., using a conditional independence measure. We can avoid using a conditional independence restriction, in the following two special cases. First, consider the case in which W V and, in addition, assume q(W, ϵz) = q1(W) + ϵZ. Then we can use the following independence restriction Z q1(W) Y φ(X) θ. (25) Under our assumptions, q0 1(W) = E[Z|W] and therefore q0 1 is identifiable. The corresponding identifiability condition becomes Z q1(W) h(W, U, ϵY ) + φ(X) τ = τ = 0. (26) 6The results can be extended to other SCMs, too; we only show one for simplicity. E.g., we could also allow for the edge between Z and W to point towards W, that is, for Z to be a cause of W. Exploiting Independent Instruments Second, consider the case in which we instead assume that W U and that h(W, U, ϵY ) = h1(W) + h2(U, ϵY ). Then we can use the independence restriction: Y φ(X) θ φ(W) γ (Z, W), (27) where we assume that7 h0 1( ) = φ( )γ0 for γ0 Rt. The identifiability condition becomes (Z, W) h2(U, ϵY ) + φ(X) τ1 + φ(W) τ2 = τ1 = 0. (28) This identifiability condition translates to a corresponding moment restriction and as in the case of unconditional IV, we can again show that the independence version leads to strictly stronger identifiability. Proposition B.1 (CIV - Identifiability of the independence restriction is strictly stronger). Consider a conditional IV model M 0 and assume that φ( ) = (φ1( ), . . . , φp( )) is a collection of basis functions such that f 0( ) = φ( ) θ0 for some θ0 Rp and h0 1( ) = φ( ) γ0 for some γ0 Rt. Then, if there exists k N and η : R(r+t) Rk such that θ0 is identifiable from the moment restriction E[η(Z, W)(Y φ(X) θ φ(W) γ)] = 0, then θ0 is also identifiable from the conditional independence restriction (27). Furthermore, there exist a conditional IV model such that θ0 is identifiable from the independence restriction (27) but not from the corresponding moment restriction. An example for that is the model V := ϵV , U := ϵU, W := V + ϵW , Z := W + V + ϵZ, X := ZU + ϵX Y := X + W + U + ϵY , where ϵZ, ϵV , ϵU, ϵW , ϵX and ϵY are jointly independent standard Gaussian variables and we consider the base functions φ(X) = (X) and φ(W) = (W). Proof. For the first claim, consider a θ0 that is not identifiable from restriction (27). Then there exists a τ1 = 0 for which condition (28) is violated. By the independence statement for τ1, it holds for all k N, functions η : R(r+t) Rk and some τ2 R that Cov[η(Z, W), h(U, ϵY ) + φ(X) τ1 + φ(W) τ2] = 0. But as (Z, W) h(U, ϵY ) it also holds that Cov[η(Z, W), h(U, ϵY )] = 0. By the additivity of the covariance it follows that Cov[η(Z, W), φ(X) τ1 + φ(W) τ2] = 0, which is the identifiability condition corresponding to the moment restriction. For the second claim, note that in our example SCM, E[X|(Z, W)] = 0. It follows that for all k N and functions η : R(r+t) Rk, Cov[η(Z, W), X] = 0. We can therefore not uniquely identify the causal function f 0(x) = x with the moment restriction. We now consider condition (28). Consider Kτ := U + ϵY + τ1(ZU + ϵX) + τ2W. It holds that E[Z2K2 τ ] = E[Z2(U + ϵY + τ1(ZU + ϵX) + τ2W)2] = E[Z2U 2] + E[Z2ϵ2 Y ] + τ 2 1 E[Z4U 2] + τ 2 1 E[Z2ϵ2 X] + τ 2 2 E[Z2W 2] = 12 + 114τ 2 1 + 30τ 2 2 , E[Z2]E[K2 τ ] = E[Z2]E[(U + ϵY + τ1(ZU + ϵX) + τ2W)2] = E[Z2](E[U 2] + E[ϵ2 Y ] + τ 2 1 E[Z2U 2] + τ 2 1 E[ϵ2 X] + τ 2 2 E[W 2]) = 6(2 + 7τ 2 1 + 2τ 2 2 ) = 12 + 42τ 2 1 + 12τ 2 2 . These two terms are only equal if 72τ 2 1 + 18τ 2 2 = 0, which requires that τ1 = 0. Therefore, condition (28) holds, meaning we can identify the causal function f 0(x) = x. 7For simplicity, we use the same basis φ. Exploiting Independent Instruments We now propose two procedures; one for each of the identifiabilty conditions we discussed. We conjecture that combining both leads to a doubly robust estimator. For restriction (25), we propose the following estimation procedure. Given an i.i.d. sample (xi, yi, zi, wi)n i=1 of the variables (X, Y, Z, W), we can first use any flexible non-parametric estimator to estimate q0 1. Based on our estimate ˆq1 we then learn ˆf as arg min f F \ HSIC((rf i , zˆq i )n i=1; k R, k Z ˆq1), as our estimate for f 0, where rf i = yi ˆf(xi) and z ˆq1 i = zi ˆq1(wi) and F is a function class. Here, \ HSIC is the same empirical HSIC estimator as in Section 3. For restriction (27), we propose the following estimation procedure. Given an i.i.d. sample (xi, yi, zi, wi)n i=1 of the variables (X, Y, Z, W), we estimate f 0 as the first entry of arg min f F, k K \ HSIC((rf,k i , (zi, wi))n i=1; k RY , k Z,W ), where rf,k i = y1 f(xi) k(wi) and, F and K are function classes. Here, \ HSIC is again the same empirical HSIC estimator as in Section 3. Exploiting Independent Instruments C. Algorithms: HSIC-X and HSIC-X-pen We provide details for HSIC-X in Algorithm 1 and details for HSIC-X-pen in Algorithm 2. We propose to choose the tuning parameter λ for HSIC-X-pen as the largest possible value for which an HSIC-based independence test between the estimated residuals and the instruments is not rejected (see Section 3.1). Algorithm 1 HSIC-X Input: observations (xi, yi, zi)n i=1, kernels k R and k Z, batch size m, learning rate γ, number of gradient steps per cycle k, significance level α and maximum restarting times t Initialize a restart counter ℓ 0 Compute the bandwidth σZ of k Z with median heuristic repeat if ℓ= 0 then Initialize parameters θ at the OLS solution else Randomly initialize parameters θ end repeat Compute the residuals rθ i := yi fθ(xi) Compute the bandwidth σRθ of k Rθ with median heuristic repeat k times Sample a mini-batch (xj, yj, zj)m j=1 Compute the residuals rθ j := yj fθ(xj) Compute the HSIC loss L(θ) := tr(KHLH) Update parameters θ θ γ L(θ) end until convergence Compute the residuals rθ i := yi fθ(xi) Compute the p-value p of the independence test between (rθ i )n i=1 and (zi)n i=1 Update the counter ℓ ℓ+ 1 until p α or ℓ t Output: Final estimate fθ( ) := fθ( ) 1 n Pn i=1 yi fθ(xi) Exploiting Independent Instruments Algorithm 2 HSIC-X-pen Input: observations (xi, yi, zi)n i=1, kernels k R and k Z, batch size m, learning rate γ, number of gradient steps per cycle k, significance level α and maximum restarting times t, penalization parameter λ, prediction loss ℓ Initialize a restart counter l 0 Compute kernel bandwidth σZ of k Z with median heuristic repeat if l = 0 then Initialize parameters θ at the OLS solution else Randomly initialize parameters θ end repeat Compute the residuals rθ i := yi fθ(xi) Compute kernel bandwidth σR of k R with median heuristic repeat k times Sample a mini-batch (xj, yj, zj)m j=1 Compute the residuals rθ j := yj fθ(xj) Compute loss L(θ) := λ P j ℓ(rθ j) + (1 λ) tr(KHLH) Update parameter θ θ γ L(θ) end until convergence Compute the residuals rθ i := yi fθ(xi) Compute the p-value p of the independence test between (rθ i )n i=1 and (zi)n i=1 Update the counter l l + 1 until p α or l t Output: Final estimate fθ( ) := fθ( ) 1 n Pn i=1 yi fθ(xi) D. Additional Experiment Details and Results D.1. Multi-dimensional Setting We consider the following IV models in our experiments: Z := ϵZ U := ϵU X := JZ2ϵX + BZ + U Y := β X 2U + ϵY , where ϵU, ϵY i.i.d. N(0, 1), ϵZ N(0, Id Z), ϵX N(0, Id X) are independent noise variables, d X and d Z represent the dimensions of X and Z, J and B are d X d Z matrices controlling the influence of the instruments Z and β is the causal parameters. In the experiment, J is an d X d Z all-ones matrix, Bi,j Uniform( 4, 4) and β N(0, Id X). Figure 4 reports the MSE of each method as dimension (d Z) of the instruments varies for some fixed predictor s dimensionality (d X). When d Z < d X, the identifiability result suggests that 2SLS cannot consistently estimate the causal function (which is reflected by the high values of the MSE); while HSIC-X can still gain some improvement over the OLS solution. When d Z d X, the performance of our method is on par with that of 2SLS. D.2. Known Basis Functions Figure 5 shows some of the functions estimated by HSIC-X, OLS, 2SLS, along with the underlying causal function under various settings. In short, when the instrument has no effect on the mean of X (α = 0), 2SLS fails to produce sensible estimates because of the non-identifiability under the moment restriction. On the other hand, the proposed method (HSIC-X) can still identify the causal function thanks to the independence restriction and yields reasonable estimates in all of the settings. Exploiting Independent Instruments 1 2 3 4 5 d Z 1 2 3 4 5 d Z 102 d X = 5 2SLS OLS Oracle HSIC-X Figure 4. MSEs of different estimators under varying size d Z of the instruments when (a) the predictors dimension d X is 3, and (b) the predictors dimension is 5. 2SLS is inconsistent and underperforms OLS when d Z < d X, while HSIC-X shows a substantial improvement over OLS in such settings. D.3. Distribution Generalization In Section 5.2, we consider both linear and nonlinear underlying causal functions. The linear function is defined as f 0 lin(X1, X2) := X1 + X2, and the nonlinear function is defined as f 0 nonlin(X) := P10 j=1 wje 1 3 X cj 2 2, where cj is drawn from a uniform distribution over a two-dimensional grid [ 5, 5] [ 5, 5] and w1, . . . , w10 i.i.d. N(0, 4). We employ the correct basis and neural network function classes in our method and the baselines. For the correct basis, we consider F := {f( ) = φ( ) θ | θ Rp} with p = 3 and φ(x) = [1, x1, x2] in the linear case, and p = 11 and φ(x) = [1, e 1 3 x c1 2 2, . . . , e 1 3 x c10 2 2] in the nonlinear case. For the neural network function class, we use a neural network with one hidden layer of size 64. We optimize our model using Adam optimizer with the learning rate of 0.005 and batch-size of 256, and use the R package Anchor Regression (https://github.com/simzim96/Anchor Regression) for the Anchor Regression baseline. Lastly, the tuning parameter γ of Anchor Regression is set to 100, and the tuning parameter λ for HSIC-X-pen is chosen as the largest possible value for which an HSIC-based independence test between the estimated residuals and the instruments is not rejected (see Section 3.1). Exploiting Independent Instruments 8 6 4 2 0 2 4 6 x_vis Model: linear, Instrument: Binary, Alpha: 0.0 y_mse y_hsic y_2sls f_x (a) f 0: Linear, PϵZ: Binary, α: 0 8 6 4 2 0 2 4 6 x_vis Model: linear, Instrument: Binary, Alpha: 0.4 y_mse y_hsic y_2sls f_x (b) f 0: Linear, PϵZ: Binary, α: 0.4 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 x_vis Model: linear, Instrument: Gaussian, Alpha: 0.0 y_mse y_hsic y_2sls f_x (c) f 0: Linear, PϵZ: Gaussian, α: 0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 x_vis Model: linear, Instrument: Gaussian, Alpha: 0.4 y_mse y_hsic y_2sls f_x (d) f 0: Linear, PϵZ: Gaussian, α: 0.4 6 4 2 0 2 4 6 x_vis Model: radial, Instrument: Binary, Alpha: 0.0 y_mse y_hsic y_2sls f_x (e) f 0: Nonlinear, PϵZ: Binary, α: 0 6 4 2 0 2 4 6 x_vis Model: radial, Instrument: Binary, Alpha: 1.0 y_mse y_hsic y_2sls f_x (f) f 0: Nonlinear, PϵZ: Binary, α: 1.0 8 6 4 2 0 2 4 6 8 x_vis Model: radial, Instrument: Gaussian, Alpha: 0.0 y_mse y_hsic y_2sls f_x (g) f 0: Nonlinear, PϵZ: Gaussian, α: 0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 x_vis Model: radial, Instrument: Gaussian, Alpha: 1.0 y_mse y_hsic y_2sls f_x (h) f 0: Nonlinear, PϵZ: Gaussian, α: 1.0 Figure 5. Estimated causal functions with different estimators under varying α values from the experiment in Section 5.1 (the known basis functions setting). The legend reads as follows: "y_mse" represents OLS, "y_hsic" represents HSIC-X, "y_2sls" represents 2SLS and "f_x" represents the underlying causal function. Each line represents an estimated function from one simulation run. The grey dots are the observations drawn from the corresponding IV model.