# exploiting_independent_instruments_identification_and_distribution_generalization__0a48fd03.pdf

Exploiting Independent Instruments: Identiﬁcation and Distribution Generalization

Sorawit Saengkyongam 1 Leonard Henckel 1 Niklas Pﬁster 1 Jonas Peters 1

Instrumental variable models allow us to identify a causal function between covariates X and a response Y , even in the presence of unobserved confounding. Most of the existing estimators assume that the error term in the response Y and the hidden confounders are uncorrelated with the instruments Z. This is often motivated by a graphical separation, an argument that also justiﬁes independence. Positing an independence restriction, however, leads to strictly stronger identiﬁability results. We connect to the existing literature in econometrics and provide a practical method called HSIC-X for exploiting independence that can be combined with any gradient-based learning procedure. We see that even in identiﬁable settings, taking into account higher moments may yield better ﬁnite sample results. Furthermore, we exploit the independence for distribution generalization. We prove that the proposed estimator is invariant to distributional shifts on the instruments and worst-case optimal whenever these shifts are sufﬁciently strong. These results hold even in the under-identiﬁed case where the instruments are not sufﬁciently rich to identify the causal function.

1. Introduction

When estimating the causal function between a vector of covariates X and a response Y in the presence of unobserved confounding, standard regression procedures such as ordinary least squares (OLS) are even asymptotically biased. Instrumental variable approaches (Wright, 1928; Imbens & Angrist, 1994; Newey, 2013) exploit the existence of exogenous heterogeneity in the form of an instrumental

1Department of Mathematical Sciences, University of Copenhagen, Denmark. Correspondence to: Sorawit Saengkyongam <ss@math.ku.dk>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

variable (IV) Z and estimate, under suitable conditions, the causal function consistently. Importantly, the errors in Y and the hidden confounders U should be uncorrelated with the instruments Z. Usually, this has to be argued for with background knowledge. Often this is done by assuming that the data generating process follows a structural causal model (SCM) (Pearl, 2009; Bongers et al., 2021) (so that the distribution is Markov with respect to the induced graph), and that Y and U are d-separated from Z in the graph obtained by removing the edge from X to Y (this is the case for the DAG at the beginning of Section 2, for example). In particular, this requires an argument that Z is not causing Y directly but only via X. But this argument does not only imply that the errors in Y and U are uncorrelated but also that they are independent. This independence comes with several beneﬁts.

For example, even in settings where the causal function can be identiﬁed by classical approaches based on uncorrelatedness, it has been observed that the independence can be exploited to construct estimators that achieve the semiparametric efﬁciency bound, at least when the error distribution comes from a known, parametric family (Hansen et al., 2010). Furthermore, the independence restriction is stronger than uncorrelatedness and therefore yields stronger identiﬁability results, which has been reported in the ﬁeld of econometrics (e.g., Imbens & Newey, 2009; Chesher, 2003). For example, even binary instruments may be able to identify nonlinear effects (Dunker et al., 2014; Torgovitsky, 2015; Loh, 2019).

In this work, we investigate the independence restriction in more detail: we add to existing identiﬁability results, provide methods for exploiting the restriction for ﬁnite data and analyse implications for distribution generalisation.

More precisely, in Section 2 we discuss the identiﬁability conditions in a general and simple form, list some of the existing identiﬁability results that they imply, add novel results to this list, and extend the framework to conditional IV.

We also provide a practical method that exploits the above independence for estimating causal effects from data. It relies on the Hilbert-Schmidt independence criterion (HSIC) (Gretton et al., 2008) which has become a widely used tool

Exploiting Independent Instruments

for testing independence in a joint distribution. Equipped with a characteristic kernel (such as the Gaussian kernel), HSIC is positive and equals zero if and only if the considered distribution factorizes. We propose an easy-to-use method, called HSIC-X ( X for exogenous ), that minimizes HSIC directly. The underlying problem is non-convex but can be tackled using widely used methods from stochastic optimization. While theoretical guarantees are hard to obtain, it has recently been shown empirically that reliable optimization of HSIC seems possible, at least when considering the different problem of minimizing independence of residuals and predictors in a regression problem (Greenfeld & Shalit, 2020; Mooij et al., 2009). Furthermore, the optimization can be initialized at informative starting points, such as the two-stage least squares (2SLS) solution, or even the OLS solution. HSIC-X can be combined with any machine learning method for nonparametric regression that is optimized by (stochastic) gradient descent. The details of our method are described in Section 3.

Furthermore, the independence restriction can be exploited for distribution generalization. In Section 4, we construct an estimator HSIC-X-pen ( pen for penalization ) that is worst-case optimal in nonlinear settings under distributional shifts corresponding to interventions on Z. This is particularly interesting in underidentiﬁed settings, where the causal function cannot be identiﬁed from the data. Our work thereby adds to an increasing literature connecting distributional robustness and causal inference (e.g., Schölkopf et al., 2012; Rojas-Carulla et al., 2018; Magliacane et al., 2018; Rothenhäusler et al., 2021; Arjovsky et al., 2019; Christiansen et al., 2021; Pﬁster et al., 2021; Yuan et al., 2021; Krueger et al., 2021; Creager et al., 2021). As for HSICX, the estimator HSIC-X-pen is modular in that it can be used with any machine learning method for nonparametric regression.

Experiments on both simulated and real world data in Section 5 conﬁrm that HSIC-X can exploit the improved identiﬁability guarantees and can be more efﬁcient in ﬁnite samples if wrong solutions yield both second and higher order dependencies between the residuals and the instruments. The code for all the experiments is available at https://github.com/sorawitj/HSIC-X. All proofs are provided in Appendix A.

1.1. Further Related Work

Independence between Z and Y f(X) can also be characterized differently; e.g., it is equivalent to

E[η(Z)ψ(Y f(X))] = E[η(Z)] E[ψ(Y f(X))]

for all η and ψ in a sufﬁciently rich class (such as bounded continuous functions). This suggests estimating the causal function by a generalized methods of moments (GMM) ap-

proach. Indeed, Poirier (2017) focuses on the derivation of optimal weights for such estimating equations and the asymptotic behaviour of the estimator. Dunker et al. (2014); Dunker (2021) phrase the problem as an inverse problem and derive convergence rates for the iteratively regularized Gauss-Newton method. The above works focus on theoretical advancements (e.g., the experiments are restricted to univariate settings). Considering the independence of residuals has also been suggested, by Peters et al. (2016, Section 5) but no practical method was provided. Powerful regression techniques and machine learning methods to measure dependence in IV settings exist (Hartford et al., 2017; Singh et al., 2019; Bennett et al., 2019; Muandet et al., 2020) but to the best of our knowledge, none of these methods exploit the independence restriction.

In summary, exploiting independence does not seem to have played a major role in practice. Arguably, one of the reasons is that independence restrictions are difﬁcult to work with in theory and practice. E.g., choosing the class of functions ψ is non-trivial.

2. Identiﬁability from Higher Order Moments

Unless stated otherwise, we consider the following SCM

Z := ϵZ U := ϵU X := g(Z, U, ϵX)

Y := f(X) + h(U, ϵY ) Y X

where (ϵZ, ϵU, ϵX, ϵY ) Q are jointly independent noise variables, Z Rr are instruments, U Rq

are unobserved variables, X Rd are predictors and Y R is a response. For simplicity, we additionally assume that E[ϵZ] = E[X] = E[h(U, ϵY )] = 0. We call a collection M = (f, g, h, Q) an IV model, with f F {f | f : Rd R} and F a pre-speciﬁed function class. The collection of all IV models of this form is denoted by M. Any IV model M M induces a distribution PM over the observed variables (X, Y, Z). We denote the data generating IV model by M 0 = (f 0, g0, h0, Q0) (parts of the data are unobserved). We refer to the function f 0 as the causal function. We can use it to compute the causal effect of any treatment contrast, such as the average treatment effect f(1) f(0) in case of a binary X. We assume that the causal function f( ) only depends on X, i.e., is homogeneous, and thus we do not need to distinguish between local and global treatment effects (e.g Imbens & Angrist, 1994). Unless stated otherwise, we also assume that f 0 can be parameterized such that f 0( ) = φ( ) θ0 for some θ0 Θ Rp, with a (known) function φ : Rd Rp and, for simplicity,

Exploiting Independent Instruments

we assume that θ = θ0 implies φ( ) θ = φ( ) θ0. In practice the true basis φ does not need to be known and can be approximated by a sufﬁciently ﬂexible function approximator (e.g., neural networks), see Section 3.

Let us assume that we observe n i.i.d. observations from (X, Y, Z) PM 0. In this paper, we consider two problems: estimating the causal function f 0 and predicting the response Y under distributional shifts on the variables Z (see Section 4). In order to solve either of these tasks, we make use of (parts of) the causal function f 0. We therefore require that it is uniquely determined by the observed distribution PM 0 (we relax this condition in Section 4). This is often termed identiﬁability of the causal function.

In a classical IV approach, identiﬁcation of f 0 (or, equivalently, θ0) is based on the moment restriction

E[η(Z)(Y φ(X) θ)] = 0, (1)

where η : Rr Rk is a known function. Under the IV model M 0 this condition is satisﬁed for θ0. A sufﬁcient and necessary condition for identiﬁability based on (1) is given by the (moment) identiﬁability condition

E[η(Z)φ(X) ]τ = 0 = τ = 0. (2)

This condition is sometimes called the rank condition because it is equivalent to the matrix E[η(Z)φ(X) ] having rank p, which, in particular, implies that k p (e.g., Wooldridge, 2010).

The moment restriction (1) aims to detect mean shifts of the residuals when varying the value of Z. For example, (1) can only identify θ0 if we do not have for all i {1, . . . , p} E[φi(X)|Z] = 0 almost surely (otherwise the function x 7 f 0(x) + φi(x) solves (1), too). But even if there are no mean shifts and (1) is not powerful enough to identify f 0, we may still be able to identify f 0 from a stronger condition. To this end, we consider the independence restriction

Z Y φ(X) θ. (3)

In this paper, we consider identifying f 0 based on (3), rather than (1). Under the IV model M 0 described above, this condition is satisﬁed for the true parameter θ0. A necessary and sufﬁcient condition for identiﬁability using (3) is

Z h(U, ϵY ) + φ(X) τ = τ = 0. (4)

This, in particular, implies that for all i {1, . . . , p} we have Z h(U, ϵY )+φi(X) (since, otherwise τ = ei would violate the above condition).

If condition (4) is violated, condition (2) is violated, too, but the implication does not hold in the other direction. As a result, the independence restriction yields strictly stronger identiﬁability results (see also Table 1).

Proposition 2.1 (Identiﬁability based on independence). Consider an IV model M 0 and assume f 0( ) = φ( ) θ0

for some θ0 Rp. Then, the following statements hold.

(i) If θ0 is identiﬁable from the moment restriction (1) it is also identiﬁable from the independence restriction (3). (ii) There exist IV models such that θ0 is identiﬁable from the independence restriction (3) but not from (1). (iii) In particular, there are examples of the following type that satisfy the conditions of (ii). (a) Nonadditive Z :

Z := ϵZ U := ϵU X := ZU + ϵX Y := X + U + ϵY , (5)

with F = {f | f(x) = θx}, where (ϵZ, ϵU, ϵX, ϵY ) are jointly independent standard Gaussian variables. (b) Binary Z : Z {0, 1}, X R, p > 2 (i.e., F contains nonlinear functions). (c) Independent Z : Z X.

Statements (i) and (ii) are known (e.g., Imbens & Newey, 2009; Chesher, 2003) but for completeness we nevertheless include their proofs in Appendix A. Intuitively, the causal function from the SCM (5) in (a) is not identiﬁable from the moment restriction because the instrument (or any transformation η(Z)) does not correlate with the mean of X, i.e., E[η(Z)X] = E[η(Z) E[X|Z]] = 0. Therefore, for any γ R, the shifted causal function f(x) := f 0(x) + γx, for x R, also satisﬁes the moment restriction. In the proof of Proposition 2.1, we argue that identiﬁability using (2) is impossible for an example of type (b). It may come as a surprise that it is indeed possible to identify nonlinear functions even if the instrument Z has a discrete support with small cardinality. This observation has been reported, e.g., by Dunker et al. (2014); Torgovitsky (2015); Loh (2019).

Even in cases where the causal function is identiﬁable from both (4) and (2), it may be beneﬁcial to consider the independence restriction. This is, for example, the case when the effect of the instrument can be seen in both the conditional mean of φ(X), given Z, and in higher moments. In our simulation experiments in Section 5, we see that taking into account dependencies in higher moments may yield better statistical performance.

Condition (4) depends on the unknown h(U, ϵY ); at ﬁrst glance, one might believe that the hidden term can be dropped from condition (4) but this is not the case.

Proposition 2.2. Consider the IV model M 0 and assume that φ( ) = (φ1( ), . . . , φp( )) is a collection of basis functions such that f 0( ) = φ( ) θ0 for some θ0 Rp. Then, in general, condition (4) does not imply

Z φ(X) τ = τ = 0. (6)

Furthermore, (6) does not imply (4), either.

Exploiting Independent Instruments

Table 1. Overview of some of the identiﬁability results, described in Section 2.

| supp(Z)| Z acts on Z acts on identif. with (1) identif. with (3) Comments mean of X higher orders of X possible possible

yes yes/no gain efﬁciency with (3), cf Sec. 5.1 no yes cf Prop 2.1-iii-a identif. m par. with (1), gain efﬁciency m < yes yes/no & identif. with (3), cf Prop 2.1-iii-b, Sec 5.1 m < no yes cf example in proof of Prop 2.2

Identiﬁability condition (2) depends only on the joint distribution of (Z, X) (which can be estimated from data), but we do not consider this an advantage: in practice, it is usually desirable to consider empirical relaxations of the identiﬁability conditions and output the set of θ s that (approximately) satisfy the empirical version of (1) or (3), respectively. For example, one can invert statistical tests to construct conﬁdence sets, see Section 3.4.

2.1. Conditional Instrumental Variables

In some applications, such as the one we consider in Section 5.3, restrictions (1) and (3) may be violated for the causal parameter, e.g., because there are confounding variables W between the instruments Z and the response Y . Under certain assumptions, however, the framework of conditional IV (CIV) (Frölich, 2007; Newey, 2013) still allows us to identify the causal function f 0 from the observed distribution. Both the identiﬁability point of view and the methodology we develop in Section 3 can be extended to CIV. More details are provided in Appendix B. Consider the following SCM.

W := m(ϵW , V, U)

Z := q(W, V, ϵZ)

X := g(Z, W, U, ϵX)

V := ϵV , U := ϵU Y := f(X) + h(W, U, ϵY )

where (ϵW , ϵZ, ϵV , ϵU, ϵX, ϵY ) are jointly independent and W Rt are observed covariates. We assume that there is no edge from V to W or no edge from U to W, i.e., either Z and W or W and Y are not confounded.

Due to the unobserved confounding, (3) may not hold but as we assume that there is either no confounding between Z and W or between W and Y , we can instead use the conditional independence restriction

Z Y φ(X) θ | W (7)

to identify f 0. A corresponding sufﬁcient and necessary

identiﬁability condition for f 0 is

Z h(W, U, ϵY ) + φ(X) τ | W = τ = 0. (8)

We can estimate f 0 using a loss that is minimized if restriction (7) is satisﬁed, e.g., using a conditional independence measure (e.g. Fukumizu et al., 2008; Zhang et al., 2011; Berrett & Samworth, 2019; Shah & Peters, 2020). In Appendix B we discuss cases, in which we can avoid using a conditional independence restriction.

3. Independence-based IV with HSIC

Starting from the independence restriction (3), our goal is to ﬁnd a function ˆf such that the residuals R ˆ f := Y ˆf(X) are independent of the instruments Z. In the identiﬁable case, that is, if condition (4) is satisﬁed, only the causal function f 0 achieves independence. Thus, given an i.i.d. sample (xi, yi, zi)n i=1 of the variables (X, Y, Z), our method aims to ﬁnd a function ˆf that minimizes the dependency between the residuals (r ˆ f i )n i=1, with r ˆ f i := yi ˆf(xi), and the instruments (zi)n i=1. In this work, we measure dependency using HSIC (Gretton et al., 2008). When using characteristic kernels (Fukumizu et al., 2008), this measure equals zero if and only if the considered joint distribution factorizes. HSIC has been used for optimization problems before (e.g., Greenfeld & Shalit, 2020; Mooij et al., 2009) and satisﬁes the conditions used to prove consistency (see Section 3.3) but other choices of independence measures are possible, too. Speciﬁcally, we consider the following learning problem:

ˆf := arg min f F \ HSIC((rf i , zi)n i=1; k Rf , k Z), (9)

where \ HSIC((rf i , zi)n i=1; k Rf , k Z) := tr(KHLH) is a consistent estimator of HSIC((Rf, Z); k Rf , k Z) (Gretton et al., 2008); here, Kij = k Rf (rf i , rf j ) and Lij = k Z(zi, zj) are the kernel matrices for the residuals Rf and the instruments Z with positive deﬁnite kernels k Rf and k Z, respectively, and Hij := δij 1

n is the centering matrix. We call the estimator in (9) HSIC-X.

The independence restriction (3) allows us to learn the causal function f 0 up to a bias term (for any α R, HSIC((Y f 0(X), Z); k Rf0, k Z) and HSIC((Y α

Exploiting Independent Instruments

f 0(X), Z); k Rf0, k Z) are identical). Nonetheless, we can correct for the bias by using the zero mean assumption of the noise ϵY . The ﬁnal estimate is then obtained as f( ) := ˆf( ) 1

n Pn i=1(yi ˆf(xi)).

3.1. Regularizing towards Predictive Functions

In many practical applications the identiﬁability condition (4) (or (2) for classical IV) is satisﬁed but many parameters approximately solve the empirical version of (3). This is the case, for example, if the inﬂuence of the instruments is weak, which usually yields subpar ﬁnite sample properties. Furthermore, classical estimators like 2SLS are known to only have moments up to the degree of over-identiﬁcation (e.g., Mariano, 2001). To stabilize the estimation it has been proposed to regularize towards a predictive function, such as the OLS in linear settings, see, e.g., the K-class estimators (Theil, 1958; Jakobsen & Peters, 2022), which contain the OLS, 2SLS, the FULLER (Fuller, 1977) and LIML estimators (Anderson & Rubin, 1949) as special cases.

We propose an analogous regularization for our estimator and call this variant HSIC-X-pen. More speciﬁcally, for a convex loss function ℓ: R R we modify the optimization problem (9) as follows,

ˆf λ = arg min f F \ HSIC((rf i , zi)n i=1; k Rf , k Z) (10)

+ λ Pn i=1 ℓ(yi f(xi)),

where λ [0, ) is a tuning parameter. Unlike in the linear settings described above (e.g., Fuller, 1977), deriving a data-driven choice of the tuning parameter λ is non-trivial. We propose to select λ following a procedure analogue to the one described in Jakobsen & Peters (2022): we select the largest possible value of λ for which an HSIC-based independence test (e.g., Gretton et al., 2008; Pﬁster et al., 2017) between R ˆ f λ and Z is not rejected.

As discussed in Section 4, HSIC-X-pen can be understood in relation to distribution generalization, too. There, one starts from the objective of optimizing the predictive loss and adds the HSIC term as a penalty that regularizes the predictor to guard against distributional shifts.

3.2. Algorithm and Implementation Details

We now specify the details of HSIC-X. To solve (9), we ﬁx any parametric function class F := {fθ( ) | θ Θ Rp} (e.g., a linear combination of some basis functions or a neural network) and optimize the parameters θ by a gradientbased optimization method. We choose the Gaussian kernel k R (e.g., Schölkopf & Smola, 2002) for the residuals, and the discrete or Gaussian kernel k Z for the instruments depending on whether Z is discrete or continuous, respectively. The bandwidth parameter σ of the Gaussian kernel is chosen

by the median heuristic (e.g., Sriperumbudur et al., 2009) and is recomputed during the optimization process (as the residuals change at each iteration).

Since the optimization problem (9) is generally non-convex, the resulting parameter estimates may not be the global optimal solution. We alleviate this problem by introducing the following restarting heuristic. Let ˆθ Θ be a solution to the optimization problem. We conduct an independence test between the resulting residuals (rˆθ i := yi fˆθ(xi))n i=1 and the instruments (zi)n i=1 using HSIC with Gamma approximation (Gretton et al., 2008). We accept the parameters ˆθ if the test is not rejected, otherwise we randomly re-initialize the parameters and restart the optimization. In the spirit of (10), we initialize the parameters in the ﬁrst trial at the OLS solution. Algorithm 1 in Appendix C illustrates the whole optimization procedure with the standard gradient descent update. The gradient step can be replaced with other gradient-based optimization algorithms such as Adam (Kingma & Ba, 2015) or Adagrad (Duchi et al., 2011); in all of our experiments, we used Adam. The Algorithm for HSIC-X-pen is also provided in Appendix C.

3.3. Consistency

We now prove consistency of the proposed approach in that the minimizer of (9) converges (in probability) against the causal function, as sample size increases.1

Theorem 3.1. Consider the IV model M 0, assume that f 0( ) = φ( ) θ0 for some bounded function φ( ) and some θ0 Θ, with Θ Rp being compact, and assume that the identiﬁability condition (4) holds. Consider the function class F := {f( ) = φ( ) θ | θ Θ} and ﬁxed bounded, continuously differentiable, characteristic kernels k Z and k R with bounded derivatives. Then, the estimator ˆf deﬁned

in (9) is consistent, i.e., ˆf f 0 PM0 0.

The same statement holds if in (9) we replace HSIC and \ HSIC by any independence measure H(PM 0, θ) and its estimate ˆHn(Dn, θ) (based on data Dn), such that (i) H(PM 0, θ) = 0 if and only if (Y φ(X) θ) Z, (ii) for all θ Θ, we have ˆHn(Dn, θ) H(PM 0, θ) in probability and, (iii), both H(PM 0, θ) and ˆHn(Dn, θ) for all n are Lipschitz continuous in θ with the same constant L.

3.4. Conﬁdence Regions

Suppose Tn : Θ Rr n {0, 1} tests, for a given parameter θ Θ, for independence between the residuals Rθ := Y fθ(X) and the instruments Z, based on the n i.i.d. observations (Xi, Yi, Zi)n i=1. Denote by Θ0 :=

1There is a slight mismatch between Theorem 3.1 and the described algorithm: in practice, the kernel bandwidth is not ﬁxed but is chosen according to the median heuristic. This difference could be accounted for by sample splitting.

Exploiting Independent Instruments

{θ | Rθ Z} the set of θ s satisfying the null hypothesis of independence. We say Tn has pointwise asymptotic level α if supθ Θ0 limn P(Tn(θ, (Zi)n i=1) = 1) α (here, Tn = 1 corresponds to rejecting independence). The test has uniform asymptotic level if lim and sup in the deﬁnition can be exchanged and the statement still holds; it has ﬁnite sample level if it holds for all n when removing lim . If Tn has ﬁnite sample level, then, by construction,

ˆCn := {θ Θ | Tn(θ, (Zi)n i=1) = 0}

is a (1 α)-conﬁdence region for θ0: P(θ0 Cn) = P(Tn(θ0, (Zi)n i=1) = 0) 1 α. Similarly for pointwise (uniform) asymptotic level.

Because the independence restriction is stronger than the moment restriction (see Proposition 2.1), we can use any existing test for (1), such as the Anderson-Rubin test (Anderson & Rubin, 1949), to construct such conﬁdence regions, too. Alternatively, we can use HSIC to construct an independence test, which we then invert. For a ﬁxed kernel, tests can be constructed either by permutation-based procedures or by approximating the distributon of \ HSIC under the null hypothesis, e.g., by a gamma distribution (Gretton et al., 2008). For many tests, ˆ Cn has to be approximated.

4. Distribution Generalization

Here, we follow a line of work, which connects distribution generalization with causality (see Section 1). In this framework, distributional shifts are modeled as interventions. We propose using HSIC-X-pen, introduced in Section 3.1, for this task and prove that it is worst-case optimal under interventions on the exogenous variables, even if the model is nonlinear and the causal function is not identiﬁable.

To formalize the result, we again assume that the data is generated by the IV model M 0. Furthermore, in this section, we consider the function classes F := L2(Rd, PX M 0) consisting of all functions f : Rd R that satisfy EM 0[f (X)2] < and

Finv := {f F | Z Y f (X) under PM 0} .

We model a distributional shift as an intervention on the exogenous Z, denoted by i; it consists of replacing the distribution of Z with a new distribution. The intervened model is denoted by M 0(i) and is again an IV model. In this setting, we now prove that for any predictor f F satisfying the independence restriction Z Y f(X), that is, f Finv, the expected loss is invariant to interventions on Z in the following sense.

Theorem 4.1 (Invariance with respect to interventions on Z). Let ℓ: R R be a convex loss function and I be a set of interventions on Z satisfying for all i I that PM 0(i) is

dominated by PM 0. Then, for all f Finv it holds that

EM 0 ℓ(Y f(X)) = sup i I EM 0(i) ℓ(Y f(X)) .

We consider this setting relevant as identiﬁablity (that is, |Finv| = 1) is not achievable in most modern machine learning applications. The assumption that the intervened distributions are dominated by the observed distribution ensures that none of the interventions on Z extend the support of Z. Generalizing to distributions that extend the support is only possible under additional extrapolation assumptions that require the functions g0 and f 0 in the IV model M 0

to be partially identiﬁable (Christiansen et al., 2021). In this sense it is not possible to strengthen our result without making additional assumptions on the identiﬁablity of the IV model.

Theorem 4.1 motivates estimating a predictor based on

arg min f Finv EM 0 ℓ(Y f(X)) , (11)

where ℓ: R R is a convex loss function. Our proposed HSIC-X-pen estimator from (10) for λ approaching 0 provides a ﬂexible way of estimating the minimizer (11). By Theorem 4.1, it is guaranteed to control the test error under any distributional shift generated by an intervention on Z which does not extend the support. Moreover, we now prove that for a sufﬁciently rich class of distribution shifts, a predictor solving (11) on the training data is worst-case optimal. To formalize our result, deﬁne the intervened subset S {1, . . . , d} consisting of all Xj that are descendants of Z in the causal graph. When |S| < d, this may contain cases in which identiﬁability according to (3) is impossible.2

Theorem 4.2 (Generalization to interventions on Z). Let ℓ: R R be a convex loss function and I be a set of interventions on Z satisfying for all i I that PM 0(i) is dominated by PM 0, which itself is absolutely continuous with respect to a product measure. If there exists i I such that XS U | XSc under PM 0(i ) and supp(PX M 0(i )) = supp(PX M 0), then

inf f Finv EM 0 ℓ(Y f(X)) = inf f F sup i I EM 0(i) ℓ(Y f(X)) .

The intervention i can be called partially confoundingremoving (see also Christiansen et al., 2021) in the sense that in the generated distribution PM 0(i ) the variables XS

affected by the exogenous Z are, conditioned on XSc, no longer confounded with Y via U (if Z acts additively, this implies that, in general, U does not act on both Y and XS). As shown in the proof, this intervention results in the worstcase loss. An example of this type of intervention is given in

2For example, if Y = f 0(X1, X2) + U + ϵY , X1 = U and X2 = Z + U, then the causal function is not identiﬁable from (3).

Exploiting Independent Instruments

Section 5.2. If ℓ( ) = ( )2 is the squared loss, the minimizer is attained at the conditional mean of Y given X under PM 0(i ), that is, f 0(x) + EM 0(i ) h(U, ϵY ) | X = x (see proof of Theorem 4.2).

5. Experiments

5.1. Simulation: Instrumental Variable Estimation

We ﬁrst evaluate the empirical performance of HSIC-X and HSIC-X-pen for estimating causal functions. To this end, we use the following IV models in our experiments:

M(α, PϵZ, f 0) :

Z := ϵZ U := ϵU X := ZϵX + αZ + U Y := f 0(X) 4U + ϵY , (12)

where ϵU, ϵX, ϵY i.i.d. N(0, 1), ϵZ PϵZ and ϵZ ϵU, ϵX, ϵY . We consider different experiment settings by varying three parameters α, PϵZ and f 0. The parameter α adjusts how the instruments Z inﬂuence the predictor X: the instruments Z only change the variance of X when α = 0, while both the mean and the variance of X are changed by Z when α > 0. For PϵZ, we consider binary and Gaussian random variables. Lastly, for f 0, we consider a linear function f 0 lin(X) := 2X and a nonlinear function f 0 nonlin(X) := 1.5X 0.2X2 + P10 j=1 wje (X cj)2, where cj represents a partition of [-7, 7] in 10 equally spaced in-

tervals and w1, . . . , w10 i.i.d. N(0, 2). We generate 1000 observations from the IV model (12) and evaluate the performance of our methods under different settings. In all the experiments, we use Adam as the optimizer with the learning rate set to 0.01 and a batch-size of 256.

Known basis functions. We ﬁrst assume that the underlying function class containing the causal function is known; that is, we consider F := {f( ) = φ( ) θ | θ Rp} with p = 1 and φ(x) = x in the linear case, and p = 12 and φ(x) = [x, x2, e (x c1)2, . . . , e (x c10)2] in the nonlinear case (c1, . . . , c10 are the same as described above). We compare the performance of HSIC-X with the following baseline methods: 2SLS: this solves the moment restriction (1); we use the feature map φ deﬁned above as the function η in the moment restriction. OLS: least square regression of Y on X (using the basis functions φ or neural networks). Oracle: least square regression of Y on X using non-confounded data (the confounder U is removed from the assignment of X). HSIC-Oracle: same as HSIC-X(-pen) but initialized at the true causal parameters; this is to investigate how much our method suffers from the non-convexity of the objective function. The last two methods serve as oracle benchmarks.

The performance of each method is measured by the integrated mean squared error (MSE) E[( ˆf(X) f 0(X))2]

0.0 0.1 0.2 0.3 0.4

f0 = Linear, Z = Gaussian

0.0 0.1 0.2 0.3 0.4

f0 = Linear, Z = Binary

0.0 0.25 0.5 0.75 1.0

f0 = Non-linear, Z = Gaussian

0.0 0.25 0.5 0.75 1.0

f0 = Non-linear, Z = Binary

2SLS OLS Oracle HSIC-X HSIC-Oracle

Figure 1. MSEs of different estimators when the correct basis functions are used. Each point represents an average over 10 simulations and the error bar indicates its 95% conﬁdent interval. 2SLS is inconsistent and underperforms OLS when α = 0, and for all α in the bottom right setting, while HSIC-X yields a substantial improvement over OLS in such settings. Under weak instruments (small α), we observe an efﬁciency gain of HSIC-X over 2SLS.

between the estimate ˆf and the causal function f 0, approximated using a test sample of size 10000. In all experiments, we report an average of the MSE values over 10 simulations.

Figure 1 reports the MSE as we increase the parameter α. Our method shows a signiﬁcant improvement over 2SLS in almost all settings. The improvement is especially prominent in the nonlinear-binary setting (see bottom right Figure 1), where the moment identiﬁability condition (2) is not satisﬁed (see Proposition 2.1). We still see an improvement gain even in the identifable cases (when α > 0) which suggests a ﬁnite sample efﬁciency gain from using the independence restriction. Lastly, the performance of HSIC-X is on par with that of HSIC-Oracle, indicating that the optimization objective is reasonably well-behaved despite its non-convexity (only in the nonlinear/binary case, there is a slight deviation). We illustrate the estimated functions against the true causal function in Appendix D.2.

Approximate functions. We now consider a more ﬂexible function class to approximate the causal function f 0 nonlin by using a neural network (NN). For a ﬁxed width and depth, f 0 nonlin may not lie in the function class represented by the NN. Nonetheless, we expect our method to produce a reasonable estimate of the causal function.

We generate 1000 observations from the IV model (12) with f 0 nonlin. In addition to the OLS and Oracle baselines, we compare our method to Deep GMM (Bennett et al., 2019) and Deep IV (Hartford et al., 2017) using their publicly available implementations. To investigate the effect of the MSE regularization (see Section 3.1), we add a variant of our method (HSIC-X-pen) where the MSE regularization is

Exploiting Independent Instruments

0.0 0.25 0.5 0.75 1.0

0.0 0.25 0.5 0.75 1.0

1 Z = Gaussian

Deep GMM Deep IV OLS Oracle HSIC-X HSIC-X-pen

Figure 2. MSEs of different estimators when the causal function is approximated by a neural network. HSIC-X shows a substantial gain over the baselines in all settings. HSIC-X-pen does not show a major improvement over HSIC-X.

employed. A neural network with one hidden layer of size 64 is used in our methods and all the baselines.

Figure 2 outlines the simulation results. In short, both HSICX and HSIC-X-pen outperform Deep GMM, Deep IV and OLS baselines and approach the Oracle s performance as the instrument strength (α) increases. We speculate that the inferior performance of Deep GMM and Deep IV may be explained by their sole reliance on the moment restriction (1) as opposed to the full independence restriction (3). Lastly, HSIC-X-pen does not show a major improvement over HSIC-X.

Multi-dimensional setting. We also investigate the effect of X s and Z s dimensionality on our estimators. The exact experiment setup and detailed results are provided in Appendix D.1. The results suggest that for linear models with higher order effects of Z, HSIC-X outperforms 2SLS, in particular when the dimensions of Z is strictly smaller than that of X.

5.2. Simulation: Distribution Generalization

To empirically verify the theoretical generalization guarantees from Section 4, we consider for i {0.5, 1, . . . , 3.5, 3.99}, the collection of IV models

Z := ϵi, U1 := ϵU1, U2 := ϵU2, X2 := U2 + ϵX2 X1 := U11(Z 3.5) + 0.1Z + 2ZϵX1, Y := f 0(X1, X2) + U1 + U2,

where ϵU1, ϵU2, ϵX1ϵX2 are i.i.d. N(0, 1), ϵi = W11(K = 0) + W21(K = 1) with K Ber(i/4), W1 Unif(0, i) and W2 Unif(i, 4) and ϵi (ϵU1, ϵU2, ϵX1ϵX2). The parameter i determines how much the distribution of Z with support (0, 4) is skewed towards the right. In particular, for i 4 the intervention removes the confounding effect of U1, see Theorem 4.2. We use M(0.5) to generate 3000 observations from the training distribution and use it to ﬁt HSIC-X-pen, OLS and Anchor regression (AR) (Rothenhäusler et al., 2021), which comes with generalization guarantees, too, but does not cover the higher-moments

0.5 1.5 2.5 3.5 Intervention strength (i)

f0 = Linear

0.5 1.5 2.5 3.5 Intervention strength (i)

f0 = Non-linear

AR OLS Causal HSIC-X-pen Function class

Known Basis NN

Figure 3. Predictive performance of different predictors on shifted test distributions as the intervention strength increases. Causal is the most stable predictor but conservative, while OLS is markedly affected by the interventions. Due to the higher-moments inﬂuence from Z, HSIC-X outperforms AR in most cases and establishes the best trade-off between invariance and predictiveness.

inﬂuence of Z as in the SCM above. As in Section 5.1, we consider both linear and nonlinear functions for f 0, and employ the correct basis (Known Basis) and neural network (NN) function classes in our method and the baselines, see Appendix D.3 for details. We evaluate the ﬁtted estimators on shifted test distributions and add a causal baseline (Causal) in which the true causal function is used as a predictor. This serves as an oracle baseline as, in this setting, the causal function is not identiﬁable.

The results are shown in Figure 3. Our method (HSIC-Xpen) and Causal are signiﬁcantly more robust to the interventions compared to OLS and AR, with the causal oracle being the most invariant predictor as described in Theorem 4.1. However, the causal oracle is conservative (it ignores the information from the hidden confounder U2 that is unaffected by the interventions and is helpful in predicting Y ) and yields subpar performance when the interventions i are small. Our method utilizes the invariant information from U2 and yields the best trade-off among all candidates.

5.3. Real-world Application

We apply HSIC-X to estimate the causal effect of education on earnings using 3010 observations from the 1979 National Longitudinal Survey of Young Men (Card, 1995). The response variable Y is the logarithm of wage, X is years of education, and the (discrete) instrument Z is geographic proximity to colleges (whether an individual grew up near a four-year college). Card (1995) also considered several conditioning variables W including years of experience, race and geographic information and studied a linear causal effect of education on earnings, so we use F = {f | f(x) = θx} and apply the conditional IV estimator described in Appendix B. In addition, we obtain a conﬁdence region as described in Section 3.4. We then compare our results with those by 2SLS, which was used in the original study of Card (1995), and conﬁdence intervals constructed by the Anderson-Rubin test (Anderson & Rubin, 1949).

Exploiting Independent Instruments

Table 2. Estimation results of the effect of education on earnings (Card, 1995), including 95% conﬁdence sets. The independence restriction puts stronger constraints on the parameters and yields a smaller conﬁdence set; e.g., it does not include the OLS solution, which is not rejected by an Anderson-Rubin test.

METHOD POINT ESTIMATE LOWER UPPER

OLS 0.072 0.065 0.079 2SLS 0.142 0.050 0.273 HSIC-X 0.160 0.097 0.208

Table 2 reports the point estimates and the conﬁdence intervals at the 95% conﬁdence level. The point estimates from 2SLS and HSIC-X differ from the estimate from OLS suggesting a potential sizable unobserved confounding effect. Nonetheless, the difference between the OLS and 2SLS estimates is not statistically signiﬁcant which was also observed in the original study (see Card (1995) and Card (2001)). We speculate that the linear effect of college proximity Z on education X may not be strong enough which leads to the imprecision of the 2SLS. On the other hand, the conﬁdence region around HSIC-X does not contain the OLS. This suggests that taking into account full independence gains ﬁnite sample efﬁciency in this study.

6. Conclusion and Future Work

Exploiting independence between exogenous variables and the residual terms of a response variable can be beneﬁcial for identiﬁability of a causal function, the empirical performance of corresponding estimates, and for constructing estimators that perform well even under interventions on such exogenous variables. We have proposed two estimators, HSIC-X and HSIC-X-pen that can be equipped with any machine learning regression method based on gradient descent. Empirical results on simulated and real data indicate that one may indeed beneﬁt from considering independence restrictions in practice.

We believe it could be fruitful to construct fast approximate methods for inverting an independence test and analyze distribution generalization when the support of Z is extended.

Acknowledgments

SS, LH, JP were supported by a research grant (18968) from VILLUM FONDEN. NP was supported by a research grant (0069071) from Novo Nordisk Fonden. We thank Nicola Gnecco and Nikolaj Thams for helpful discussions.

Anderson, T. W. and Rubin, H. Estimation of the parameters of a single equation in a complete system of stochastic

equations. Annals of Mathematical Statistics, 20(1):46 63, 1949.

Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. Invariant risk minimization. Ar Xiv e-prints (1907.02893), 2019.

Bennett, A., Kallus, N., and Schnabel, T. Deep generalized method of moments for instrumental variable analysis. In Advances in Neural Information Processing Systems 32 (Neur IPS). Curran Associates, Inc., 2019.

Berrett, T. B. and Samworth, R. J. Nonparametric independence testing via mutual information. Biometrika, 106 (3):547 566, 2019.

Bongers, S., Forre, P., Peters, J., and Mooij, J. M. Foundations of structural causal models with cycles and latent variables. Annals of Statistics, 49(5):2885 2915, 2021.

Card, D. Using geographic variation in college proximity to estimate the return to schooling. Aspects of Labour Market Behaviour: Essays in Honour of John Vanderkamp, pp. 201 222, 1995.

Card, D. Estimating the return to schooling: Progress on some persistent econometric problems. Econometrica, 69(5):1127 1160, 2001.

Chesher, A. Identiﬁcation in nonseparable models. Econometrica, 71(5):1405 1441, 2003.

Christiansen, R., Pﬁster, N., Jakobsen, M. E., Gnecco, N., and Peters, J. A causal framework for distribution generalization. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1 1, 2021.

Creager, E., Jacobsen, J.-H., and Zemel, R. Environment inference for invariant learning. In Proceedings of the 38th International Conference on Machine Learning (ICML), pp. 2189 2200. PMLR, 2021.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Rearning Research, 12(7), 2011.

Dunker, F. Adaptive estimation for some nonparametric instrumental variable models with full independence. Electronic Journal of Statistics, 15:6151 6190, 2021.

Dunker, F., Florens, J.-P., Hohage, T., Johannes, J., and Mammen, E. Iterative estimation of solutions to noisy nonlinear operator equations in nonparametric instrumental regression. Journal of Econometrics, 178:444 455, 2014.

Frölich, M. Nonparametric IV estimation of local average treatment effects with covariates. Journal of Econometrics, 139(1):35 75, 2007.

Exploiting Independent Instruments

Fukumizu, K., Gretton, A., Sun, X., and Schölkopf, B. Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems 20 (Neur IPS). Curran Associates, Inc., 2008.

Fuller, W. A. Some properties of a modiﬁcation of the limited information estimator. Econometrica, 45:939 53, 1977.

Greenfeld, D. and Shalit, U. Robust learning with the Hilbert-schmidt independence criterion. In Proceedings of the 37th International Conference on Machine Learning (ICML), pp. 3759 3768. PMLR, 2020.

Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Schölkopf, B., and Smola, A. A kernel statistical test of independence. In Advances in Neural Information Processing Systems 20 (Neur IPS). Curran Associates, Inc., 2008.

Hansen, C., Mc Donald, J. B., and Newey, W. K. Instrumental variables estimation with ﬂexible distributions. Journal of Business & Economic Statistics, 28(1):13 25, 2010.

Hartford, J., Lewis, G., Leyton-Brown, K., and Taddy, M. Deep IV: A ﬂexible approach for counterfactual prediction. In Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 1414 1423. PMLR, 2017.

Imbens, G. W. and Angrist, J. D. Identiﬁcation and estimation of local average treatment effects. Econometrica, 62 (2):467 75, 1994.

Imbens, G. W. and Newey, W. K. Identiﬁcation and estimation of triangular simultaneous equations models without additivity. Econometrica, 77(5):1481 1512, 2009.

Jakobsen, M. and Peters, J. Distributional robustness of K-class estimators and the PULSE. The Econometrics Journal, 25(2):404 432, 2022.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR); ar Xiv:1412.6980, 2015.

Krueger, D., Caballero, E., Jacobsen, J.-H., Zhang, A., Binas, J., Zhang, D., Priol, R., and Courville, A. Out-ofdistribution generalization via risk extrapolation (REx). In Proceedings of the 38th International Conference on Machine Learning (ICML), pp. 5815 5826. PMLR, 2021.

Loh, I. Nonparametric identiﬁcation and estimation with independent, discrete instruments. Ar Xiv e-prints (1906.05231), 2019.

Magliacane, S., van Ommen, T., Claassen, T., Bongers, S., Versteeg, P., and Mooij, J. M. Domain adaptation by using causal inference to predict invariant conditional distributions. In Advances in Neural Information Processing Systems 31 (Neur IPS), pp. 10846 10856. Curran Associates, Inc., 2018.

Mariano, R. S. Simultaneous equation model estimators: Statistical properties and practical implications. In Baltagi, B. H. (ed.), A companion to theoretical econometrics, pp. 122 43. Blackwell, Malden, MA, 2001.

Mooij, J. M., Janzing, D., Peters, J., and Schölkopf, B. Regression by dependence minimization and its application to causal inference. In Proceedings of the 26th International Conference on Machine Learning (ICML), pp. 745 752. ACM Press, 2009.

Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J., and Schölkopf, B. Distinguishing cause from effect using observational data: methods and benchmarks. Journal of Machine Learning Research, 17(32):1 102, 2016.

Muandet, K., Mehrjou, A., Lee, S. K., and Raj, A. Dual instrumental variable regression. In Advances in Neural Information Processing Systems 33 (Neur IPS). Curran Associates, Inc., 2020.

Newey, W. K. Uniform convergence in probability and stochastic equicontinuity. Econometrica: Journal of the Econometric Society, pp. 1161 1167, 1991.

Newey, W. K. Nonparametric instrumental variables estimation. American Economic Review, 103(3):550 556, 2013.

Pearl, J. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, USA, 2nd edition, 2009.

Peters, J., Bühlmann, P., and Meinshausen, N. Causal inference using invariant prediction: identiﬁcation and conﬁdence intervals. Journal of the Royal Statistical Society: Series B (with discussion), 78(5):947 1012, 2016.

Pﬁster, N., Bühlmann, P., Schölkopf, B., and Peters, J. Kernel-based tests for joint independence. Journal of the Royal Statistical Society: Series B, 80:5 31, 2017.

Pﬁster, N., William, E. G., Peters, J., Aebersold, R., and Bühlmann, P. Stabilizing variable selection and regression. Annals of Applied Statistics, 15(3):1220 1246, 2021.

Poirier, A. Efﬁcient estimation of models with independence restrictions. Journal of Econometrics, 196(1):1 22, 2017.

Exploiting Independent Instruments

Rojas-Carulla, M., Schölkopf, B., Turner, R., and Peters, J. Causal transfer in machine learning. Journal of Machine Learning Research, 19(36):1 34, 2018.

Rothenhäusler, D., Bühlmann, P., Meinshausen, N., and Peters, J. Anchor regression: heterogeneous data meets causality. Journal of Royal Statistical Society, Series B, 83(2):215 246, 2021.

Schölkopf, B. and Smola, A. Learning with Kernels. MIT Press, Massachusetts, 2002.

Schölkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., and Mooij, J. M. On causal and anticausal learning. In Proceedings of the 29th International Conference on Machine Learning (ICML). Omnipress, 2012.

Shah, R. and Peters, J. The hardness of conditional independence testing and the generalised covariance measure. Annals of Statistics, 48(3):1514 1538, 2020.

Singh, R., Sahani, M., and Gretton, A. Kernel instrumental variable regression. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32 (Neur IPS). Curran Associates, Inc., 2019.

Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Lanckriet, G., and Schölkopf, B. Kernel choice and classiﬁability for rkhs embeddings of probability distributions. In Advances in Neural Information Processing Systems 22 (Neur IPS). Curran Associates, Inc., 2009.

Theil, H. Economic forecasts and policy. North-Holland, Amsterdam, NL, 1958.

Torgovitsky, A. Identiﬁcation of nonseparable models using instruments with small support. Econometrica, 83(3): 1185 1197, 2015.

Wooldridge, J. M. Econometric analysis of cross section and panel data. MIT press, Cambridge, MA, 2010.

Wright, P. G. The Tariff on Animal and Vegetable Oils. Investigations in International Commercial Policies. Macmillan, New York, NY, 1928.

Yuan, J., Ma, X., Kuang, K., Xiong, R., Gong, M., and Lin, L. Learning domain-invariant relationship with instrumental variable for domain generalization. Ar Xiv e-prints (2110.01438), 2021.

Zhang, K., Peters, J., Janzing, D., and Schölkopf, B. Kernelbased conditional independence test and application in causal discovery. In Proceedings of the 27th Annual Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pp. 804 813. AUAI Press, 2011.

Exploiting Independent Instruments

A.1. Proof of Proposition 2.1

Proof. (i) We ﬁrst show that if there exists k and η, such that condition (2) holds, then condition (4) holds. We do so by contraposition, so assume that there exists a τ = 0 violating condition (4). The following then hold for all k N and all η : Rr Rk. First, Cov[η(Z), h(U, ϵY ) + φ(X) τ] = 0. Second, as Z h(U, ϵY ) it follows that Cov[η(Z), h(U, ϵY )] = 0. Third, by combining the previous two results it follows that Cov[η(Z), φ(X) τ] = 0, which is a violation of condition (2).

(ii) This statement is proven by (iii) (a), for example.

(iii) We now consider the two types of examples (a), (b) and (c).

(a) In the example SCM, we have E[X|Z] = Z E[U] + E[ϵX] = 0 and therefore for all k N and functions η : Rr Rk, E[η(Z)X] = 0. As a result, the identiﬁability condition (2) does not hold here. Speciﬁcally, any function of the form θX, θ R satisﬁes the moment restriction, so we cannot identify the causal function f 0(x) = x.

Consider now the independence based identiﬁability condition (4):

Z U + ϵY + τX = τ = 0.

Plugging in the structural equation for X this is equivalent to Kτ := U + τZU + τϵX + ϵY Z for all τ = 0. We now show that this is true by showing that for all τ = 0, E[Z2K2 τ ] = E[Z2]E[K2 τ ]. First,

E[Z2K2 τ ] = E[Z2(U + τZU + τϵX + ϵY )2]

= E[Z2U 2] + τ 2E[Z4U 2] + τ 2E[Z2ϵ2 X] + E[Z2ϵ2 Y ]

= 2 + 4τ 2.

E[Z2]E[K2 τ ] = E[Z2]E[(U + τZU + τϵX + ϵY )2]

= E[Z2](E[U 2] + τ 2E[Z2U 2] + τ 2E[ϵ2 X] + E[ϵ2 Y ])

= 2 + 2τ 2.

Therefore, condition (4) is fulﬁlled, meaning we can identify the causal function f 0(x) = x.

(b) We ﬁrst give an argument why identiﬁability using condition (2) is impossible. For a discrete instrument Z, the number of almost surely linearly independent functions3 of Z is bounded by the cardinality of the support of Z; therefore, if the number of basis functions is larger than the support of Z, there is no k N and η : Rr Rk such that condition (2) is satisﬁed.

We now construct an explicit example as follows: First, we show for a class of SCMs that under certain assumptions condition (2) does not hold while (4) does. Second, we give an explicit SCM with a binomially distributed instrument that fulﬁlls these conditions. Fix k, p N such that p > k and consider the SCM

Z := ϵZ U := ϵU X := Z + U + ϵX

i=1 αi Xi + U + ϵY ,

where ϵZ, ϵU, ϵX and ϵY are jointly independent, real-valued errors ϵZ Unif({0, . . . , k 1}) and αi R \ {0} for all i {1, . . . , p}. Assume that the random variables {ϵi XU j|1 i + j p} are almost surely linearly independent

3Here, we say that a collection of functions l1, . . . , lk is almost surely linearly independent if for all vectors α = 0, P Pk i=1 αili(Z) = 0 = 0.

Exploiting Independent Instruments

(see above). We are interested in estimating the causal function f 0(x) = Pp i=1 αixi with the basis functions φ(x) = (x1, . . . , xp).

Consider ﬁrst the moment identiﬁability condition (2). Because Z is discrete, the set of functions {1{Z=i}, i {0, . . . , k 1}} form a linear basis for all functions from supp(Z) to R. As a result, for all functions η = (η1, . . . , ηq) : R Rq, with q > k, condition (2) cannot hold. This implies that while condition (2) may hold if p k, with for example η(Z) = (1{Z=0}, . . . , 1{Z=k 1}), it cannot hold if p > k, irrespective of the choice of η.

Consider now the independence identiﬁability condition (4):

i=1 τiαi Xi + h(U, ϵY ) Z = τ1 = = τp = 0.

Because the support of Z is ﬁnite, this is equivalent to the statement that the distributions of the random variables

i=1 τiαi(j + U + ϵX)i + U + ϵY ,

are the same for all j {0, . . . , k 1}. Consider the cases j = 0 and j = 1. For Pp i=1 τiαi(U + ϵX)i + U + ϵY and Pp i=1 τiαi(1 + U + ϵX)i + U + ϵY to have the same distribution, the coefﬁcients in front of the ϵi X (when collecting all terms) for all i {1, . . . , p} must be equal in both random variables. Consider ﬁrst the ϵp 1 X term. The corresponding two coefﬁcients are τp 1αp 1 and τp 1αp 1 + pτpαp. These two coefﬁcients are only equal if τp = 0. By iterating this argument we obtain that τ2 = = τp = 0. This implies that τ1α1(U + ϵX) + U + ϵY and τ1α1(1 + U + ϵX) + U + ϵY have the same distribution. Thus, τ1 = 0 and therefore condition (4) holds.

We now give an explicit example, where k < p and where the random variables {ϵi XU j|1 i + j p} are almost surely linearly independent. Let ϵU, ϵZ, ϵX and ϵY standard normal, Z Bernoulli(0.5) and p = 3. Here, k = 2 < p and we will now show that the set of random variables {ϵi XU j|1 i + j 3} is almost sure linearly independent. Let τij, 1 i + j 3 be coefﬁcients such that Kτ := P

1 i+j 3 τijϵi XU j = 0 almost surely. In particular, this implies that E[Kτϵi XU j] = 0 for all i, j 0. Using E[ϵi XU jϵk XU l] = 0 if either i + k or j + l are odd (which follows as ϵX and U are mean zero Gaussian and independent), we obtain the following nine equations.

E[KτU] = τ01 + 3τ03 + τ21 = 0

E[KτU 2] = 3τ02 + τ20 = 0

E[KτU 3] = 3τ01 + 15τ03 + 3τ21 = 0

E[KτϵX] = τ10 + τ12 + 3τ30 = 0

E[KτϵXU] = τ11 = 0

E[KτϵXU 2] = τ10 + 3τ12 + 3τ30 = 0

E[Kτϵ2 X] = τ02 + 3τ20 = 0

E[Kτϵ2 XU] = τ01 + 3τ03 + 3τ21 = 0

E[Kτϵ3 X] = 3τ10 + 3τ12 + 15τ30 = 0.

We can write this as a linear system Aτ = 0 with corresponding 9 9 matrix A. As A is invertible, τ = 0 follows.

(c) Consider the SCM

Z := ϵZ U := ϵU X := 2ZU U + ϵX Y := X + U + ϵY .

where ϵZ, ϵU, ϵX and ϵY are jointly independent, ϵZ Bernoulli(0.5) and the remaining errors are standard normal. Consider the basis function φ(x) = x. Here,

Kτ := h(U, ϵY ) + τX =

( (1 + τ)U + τϵX + ϵY if Z = 1 (1 τ)U + τϵX + ϵY if Z = 0

Exploiting Independent Instruments

and therefore E[K2 τ |Z = 1] = 2τ 2 + 2τ + 2 and E[K2 τ |Z = 0] = 2τ 2 τ + 2. It follows that condition (4) holds. On the other hand, τX|(Z = 1) = τ(U + ϵX) and τX|(Z = 0) = τ( U + ϵX) and second, the distribution of U is symmetric around 0. Therefore, Z τX for all τ R. We can therefore identify the causal function with the independence restriction, even though Z is independent of X.

A.2. Proof of Proposition 2.2

Proof. This is a proof by example. We have already constructed an SCM, such that condition (4) holds but condition (6) does not in the proof of statement (iii) (c) of Proposition 2.1.

We now construct a SCM, such that condition (6) holds but condition (4) does not. Consider the SCM

Z := ϵZ U := ϵU X := 2ZU + ϵX Y := X U + ϵY .

where ϵZ, ϵU, ϵX and ϵY are jointly independent, ϵZ Bernoulli(0.5) and the remaining errors are standard normal. Consider the basis functions φ(x) = (x). Here,

h(U, ϵY ) + X =

( U + ϵX + ϵY if Z = 1 U + ϵX + ϵY if Z = 0

and therefore condition (4) does not hold for τ = 1. On the other hand,

( τ(2U + ϵX) if Z = 1 τϵX if Z = 0

and therefore E[(τX)2|Z = 1] = 5τ 2 while E[(τX)2|Z = 0] = τ 2. It follows that condition (6) holds.

A.3. Proof of Theorem 3.1

Proof. Let H(PM 0, θ) := HSIC((Y φ(X) θ, Z); k R, k Z) with non-negative bounded and Lipschitz continuous kernels k R and k Z with bounded and continuous derivatives.4 Consider an i.i.d. data set Dn = (Xi, Yi, Zi)n i=1, such that that each triple (Xi, Yi, Zi) follows the distribution of M 0. Let ˆHn(Dn, θ) := \ HSIC((Rθ i , Zi)n i=1), with Rθ i = Yi φ(Xi) θ.

Then the following three results hold. First, as the kernels k R and k L are by assumption non-negative and bounded it follows by Corollary 15 in Mooij et al. (2016) that for all θ and all ϵ > 0, limn PM 0(| ˆHn(Dn, θ) H(θ)| > ϵ) = 0.

Second, it follows by Lemma 16 in Mooij et al. (2016) that for all θ1, θ2 Θ and all n 2

ˆHn(Dn, θ1) ˆHn(Dn, θ2) = \ HSIC((Rθ1 i , Zi)n i=1) \ HSIC((Rθ2 i , Zi)n i=1)

φ(X) (θ1 θ2)

φ(X) (θ1 θ2)

i=1 φ(Xi) 2 !0.5

32λCM (θ1 θ2) ,

4From now on, we do not explicitly write down the kernels and write H(θ) rather than H(PM0, θ) to ease notation

Exploiting Independent Instruments

where Rθ = (Yi φ(Xi) θ)n i=1, φ(X) = (φ(Xi))n i=1, λ is the Lipschitz constant for k R, C is an upper bound for k Z and M is an upper bound for φ( ) 2 which exists by the assumption that φ( ) is bounded.

Third, for example by Pﬁster et al. (2017) Proposition 2.5, the population HSIC can be written as

H(θ) = HSIC(Y φ(X) θ, Z)

= E[k R(Rθ1 1 , Rθ2 2 )k Z(Z1, Z2)] + E[k R(Rθ1 1 , Rθ2 2 )]E[k Z(Z1, Z2)] 2E[k R(Rθ1 1 , Rθ2 2 )k Z(Z1, Z3)],

with Rθ1 1 and Rθ2 2 as well as Z1, Z2 and Z3 being i.i.d. copies of Rθ1 and Z, respectively. The function Rθ : θ 7 Y φ(X) θ is (surely) continuously differentiable, as is k R by assumption. Further, k R and k Z have bounded derivative. We can therefore conclude by dominated convergence and the chain rule that H(θ) is continuously differentiable. Therefore, H(θ) is Lipschitz on Θ, since Θ is compact.

Jointly, these three results (which correspond to (i), (ii), and (iii) from the statement of the theorem) allow us to apply Corollary 2.2 from Newey (1991) and conclude that for all ε > 0,

lim n PM 0(max θ Θ | ˆHn(Dn, θ) H(θ)| > ε) = 0. (13)

Here and in the remainder of the proof we use that θ 7 ˆHn(Dn, θ) and θ 7 H(θ) are (surely) continuous functions that map to R. Restricted to any compact set, they therefore (surely) attain a maximum and a minimum.

Consider now any sequence of estimators ˆθ0 n arg minθ Θ ˆHn(Dn, θ). By the assumption that condition (4) holds, θ0 is the unique minimizer of the continuous function H( ) on the compact set Θ. Therefore, for all ε > 0, there exists a ζ(ε) such that for all θ / Bε(θ0) = {θ : ||θ θ0|| < ε} it holds that H(θ) H(θ0) > ζ(ε). Note also that Θ \ Bε is compact. Fix ε > 0 and δ > 0, then by (13) there exists an n N, such that for all n n

PM 0(maxθ Θ| ˆHn(Dn, θ) H(θ)| > 1

Then, for all n n it holds that

PM 0(||ˆθ0 n θ0|| > ε) PM 0(minθ Θ\Bε(θ0)( ˆHn(Dn, θ) ˆHn(Dn, θ0)) 0)

PM 0(maxθ Θ\Bε(θ0)| ˆHn(Dn, θ) H(θ)| + | ˆHn(Dn, θ0) H(θ0)| > ζ(ε))

PM 0({maxθ Θ\Bε(θ0)| ˆH(Dn, θ) H(θ)| > 1

2ζ(ε)}{| ˆHn(Dn, θ0) H(θ)| > 1

Here, we used the following four arguments:

Firstly, the event {||ˆθ0 n θ0|| > ε} can only occur if there exists a θ Θ \ Bε(θ0) such that the event { ˆHn(Dn, θ) ˆHn(Dn, θ0) 0} occurs.

Secondly, ζ(ε) is deﬁned such that for all θ Θ \ Bε(θ0) it holds that H(θ) H(θ0) > ζ(ε). Therefore, if there exist a θ such that the event { ˆHn(Dn, θ) ˆHn(Dn, θ0) 0} occurs, then there must exist a θ such that the event {| ˆHn(Dn, θ) H(θ)| + | ˆHn(Dn, θ0) H(θ0)| > ζ(ε)} occurs.

Thirdly, for {maxθ Θ\Bε(θ0)| ˆHn(Dn, θ) H(θ)| + | ˆHn(Dn, θ0) H(θ0)| > ζ(ε)} to occur, either {maxθ Θ\Bε(θ0)| ˆHn(Dn, θ) H(θ)| > 1

2ζ(ε)} or {| ˆHn(Dn, θ0) H(θ0)| > 1

2ζ(ε)} must occur.

Fourthly, we use the union bound together with (14) for the last inequality.

We can therefore conclude that limn PM 0(||ˆθ0 n θ0|| > ε) = 0.

Finally, consider our estimator ˆf( ) = φ( ) ˆθ0 n for the causal function f 0. By the assumption that φ is bounded, we get that

|| ˆf f 0|| = sup x Rd |φ(x) (ˆθ0 n θ0)| sup x Rd ||φ(x)|| ||ˆθ0 n θ0|| M||ˆθ0 n θ0||.

Exploiting Independent Instruments

Hence we conclude that || ˆf f 0|| PM0 0.

Our second claim follows by the fact that under the conditions laid out, we can directly apply Corollary 2.2 from Newey (1991) to obtain uniform convergence in probability and then argue as in the case for \ HSIC.

A.4. Proof of Theorem 4.1

Proof. Let f Finv and i I. Let G be the directed graph induced by the SCM M 0. Then, since Z is by construction a source node in G and i an intervention on Z satisfying that PM 0(i) is dominated by PM 0, it holds that

EM 0(i) ℓ(Y f (X)) = EM 0(i) h EM 0(i) ℓ(Y f (X)) | Z i

= EM 0(i) h EM 0 ℓ(Y f (X)) | Z i . (15)

Now, since f Finv it holds that Z Y f (X) under PM 0, which implies that

EM 0(i) h EM 0 ℓ(Y f (X)) | Z i = EM 0(i) h EM 0 ℓ(Y f (X)) i = EM 0 ℓ(Y f (X)) i . (16)

Since i I was arbitrary, combining (15) and (16) directly implies

EM 0 ℓ(Y f(X)) = sup i I EM 0(i) ℓ(Y f(X)) ,

which completes the proof of Theorem 4.1.

A.5. Proof of Theorem 4.2

Proof. We consider two optimization problems: (A) Minimize EM 0 ℓ(Y f (X)) over all f Finv and (B) minimize supi I EM 0(i) ℓ(Y f (X)) over all f F. To prove the result, we ﬁx an arbitrary ε > 0 and ﬁnd a function φ Finv that satisﬁes EM 0 ℓ(Y φ(X)) inf f Finv EM 0 ℓ(Y f (X)) + ϵ (17)

and sup i I EM 0(i) ℓ(Y φ(X)) inf f F sup i I EM 0(i) ℓ(Y f (X)) + ϵ. (18)

Since by (15) and (16) in the proof of Theorem 4.1 it holds that EM 0 ℓ(Y φ(X)) = supi I EM 0(i) ℓ(Y φ(X)) , the result we wish to prove follows immediately. The proof now proceeds in two steps:

(1) We construct the function φ using the intervention i from the statement.

(2) We use that φ Finv to conclude the proof.

Step (1): Let i I be the intervention from the statement of the theorem. For all f F, it holds that

sup i I EM 0(i) ℓ(Y f (X)) EM 0(i ) ℓ(Y f (X)) .

As f F was arbitrary, we can take the inﬁmum on both sides and get

inf f F sup i I EM 0(i) ℓ(Y f (X)) inf f F EM 0(i ) ℓ(Y f (X)) . (19)

By the deﬁnition of the inﬁmum there exists5 ψ F satisfying

EM 0(i ) ℓ(Y ψ(X)) inf f F sup i I EM 0(i) ℓ(Y f (X)) + ε/2. (20)

5In the case of squared loss ℓ( ) = ( )2, the minimum is attained at the conditional mean EM0(i )[Y | X] = f 0(X) + EM0(i )[h(U, ϵY ) | X].

Exploiting Independent Instruments

Moreover, setting ψ := ψ f 0 and expanding Y from the structural assignment we get that

EM 0(i ) ℓ(Y ψ(X)) = EM 0(i ) ℓ(f 0(X) + h(U, ϵY ) ψ(X)) = EM 0(i ) ℓ(h(U, ϵY ) ψ(X)) .

Next, recall that XS U | XSc. Since ϵY (X, U), it also holds (using the properties of conditional independence) that XS (U, ϵY ) | XSc under PM 0(i ). Let µ denote the common dominating product measure and let p denote the density corresponding to PM 0(i ) with respect to µ. Then, by conditional independence it holds that

p (x S, x Sc, u, e) = p (x S | x Sc)p (u, e | x Sc)p (x Sc).

Expressing the expectation as an integral and applying Jensen s inequality once (ℓis convex) we get that

EM 0(i ) ℓ(h(U, ϵY ) ψ(X))

= Z ℓ(h(u, e) ψ(x S, x Sc))p (x S | x Sc)p (u, e | x Sc)p (x Sc)dµ(u, e, x S, x Sc)

Z ℓ Z h(u, e) ψ(x S, x Sc)p (x S | x Sc)dµ(x S) p (u, e | x Sc)p (x Sc)dµ(u, e, x Sc)

= Z ℓ h(u, e) Z

ψ(x S, x Sc)p (x S | x Sc)dµ(x S) p (u, e | x Sc)p (x Sc)dµ(u, e, x Sc).

Let φ F be a function that satisﬁes for all x Sc supp(PXSc

M 0(i )) that

φ(x Sc) = Z

ψ(x S, x Sc)p (x S | x Sc)dµ(x S).

Such a φ F exists because F consists of all square-integrable functions. We then get that

EM 0(i ) ℓ(Y ψ(X)) Z ℓ h(u, e) φ(x Sc) p (u, e | x Sc)p (x Sc)dµ(u, e, x Sc)

= EM 0(i ) ℓ(h(U, ϵY ) φ(XSc))

= EM 0(i ) ℓ(Y (f 0(X) + φ(XSc))) .

Combining this with (20) and deﬁning φ F for all x Rd by φ(x) := f 0(x) + φ(x Sc) leads to

EM 0(i ) ℓ(Y φ(X)) inf f F sup i I EM 0(i) ℓ(Y f (X)) + ε/2. (21)

Moreover, by construction it holds PM 0(i )-almost surely that

Y φ(X) = h(U, ϵY ) φ(XSc). (22)

Since PM 0(i ) is dominated by PM 0 and supp(PX M 0(i )) = supp(PX M 0), (22) also holds PM 0-almost surely. Finally, using that XSc are not descendants of Z and Z is a source node, it holds that Z (XSc, U, ϵY ) under PM 0, which implies

Z Y φ(X) under PM 0.

Hence φ Finv.

Step (2): Since φ Finv we can use (15) and (16) from the proof of Theorem 4.1 together with (21) to get that

sup i I EM 0(i) ℓ(Y φ(X)) = EM 0(i ) ℓ(Y φ(X)) inf f F sup i I EM 0(i) ℓ(Y f (X)) + ε/2.

This proves (18). Next, by the deﬁnition of the inﬁmum there exists f Finv satisfying

EM 0 ℓ(Y f (X)) inf f Finv EM 0 ℓ(Y f (X)) + ϵ/2. (23)

Exploiting Independent Instruments

Now, assume that EM 0 ℓ(Y f (X)) EM 0 ℓ(Y φ(X)) .

If this was not the case then (17) would be true and the proof would be complete. Again, applying Theorem 4.1 and deﬁning cmin := inff F supi I EM 0(i) ℓ(Y f (X)) this implies

cmin sup i I EM 0(i) ℓ(Y f (X)) sup i I EM 0(i) ℓ(Y φ(X)) cmin + ϵ/2.

This implies that

sup i I EM 0(i) ℓ(Y φ(X)) sup i I EM 0(i) ℓ(Y f (X)) + ϵ/2 = EM 0 ℓ(Y f (X)) + ϵ/2.

Combined with (23) this proves (17), which concludes the proof of Theorem 4.2.

B. Additional details - Conditional IV

We consider the following SCM.6

W := m(ϵW , V, U)

Z := q(W, V, ϵZ)

X := g(Z, W, U, ϵX)

V := ϵV U := ϵU Y := f(X) + h(W, U, ϵY ),

where (ϵW , ϵZ, ϵV , ϵU, ϵX, ϵY ) Q are jointly independent noise variables, Z Rr are instruments, V Rs1 and U Rs2 are unobserved variables, W Rt are observed covariates, X Rd are predictors and Y R is a response. For simplicity, we additionally assume that E[W] = E[Z] = E[X] = E[h(W, U, ϵY )] = 0. We consider models of the form M = (m, q, g, f, h) and assume that either there is no edge from V to W, i.e., W = m(ϵW , U) or no edge between W and U, i.e., W = m(ϵW , V ). As before, M 0 = (m0, q0, g0, f 0, h0) denotes the data generating model and f 0( ) = φ( ) θ0 for some θ0 Θ Rp, with a (known) function φ : Rd Rp.

Due to the unobserved confounding, it may be the case that Y f 0(X) and Z are dependent and, as a result, restriction (3) may not hold for the true model M 0. But as we assume that there is either no confounding between Z and W or between W and Y , we can instead use the conditional independence restriction

Z Y φ(X) θ | W (24)

to identify f 0. A corresponding sufﬁcient and necessary identiﬁability condition for f 0 is

Z h(W, U, ϵY ) + φ(X) τ | W = τ = 0.

Under this condition we could proceed as for the unconditional independence restriction and estimate f 0 using a loss that is minimized if restriction (7) is satisﬁed, e.g., using a conditional independence measure. We can avoid using a conditional independence restriction, in the following two special cases.

First, consider the case in which W V and, in addition, assume q(W, ϵz) = q1(W) + ϵZ. Then we can use the following independence restriction Z q1(W) Y φ(X) θ. (25)

Under our assumptions, q0 1(W) = E[Z|W] and therefore q0 1 is identiﬁable. The corresponding identiﬁability condition becomes Z q1(W) h(W, U, ϵY ) + φ(X) τ = τ = 0. (26)

6The results can be extended to other SCMs, too; we only show one for simplicity. E.g., we could also allow for the edge between Z and W to point towards W, that is, for Z to be a cause of W.

Exploiting Independent Instruments

Second, consider the case in which we instead assume that W U and that h(W, U, ϵY ) = h1(W) + h2(U, ϵY ). Then we can use the independence restriction:

Y φ(X) θ φ(W) γ (Z, W), (27)

where we assume that7 h0 1( ) = φ( )γ0 for γ0 Rt. The identiﬁability condition becomes

(Z, W) h2(U, ϵY ) + φ(X) τ1 + φ(W) τ2 = τ1 = 0. (28)

This identiﬁability condition translates to a corresponding moment restriction and as in the case of unconditional IV, we can again show that the independence version leads to strictly stronger identiﬁability. Proposition B.1 (CIV - Identiﬁability of the independence restriction is strictly stronger). Consider a conditional IV model M 0 and assume that φ( ) = (φ1( ), . . . , φp( )) is a collection of basis functions such that f 0( ) = φ( ) θ0 for some θ0 Rp and h0 1( ) = φ( ) γ0 for some γ0 Rt. Then, if there exists k N and η : R(r+t) Rk such that θ0 is identiﬁable from the moment restriction

E[η(Z, W)(Y φ(X) θ φ(W) γ)] = 0,

then θ0 is also identiﬁable from the conditional independence restriction (27). Furthermore, there exist a conditional IV model such that θ0 is identiﬁable from the independence restriction (27) but not from the corresponding moment restriction. An example for that is the model

V := ϵV , U := ϵU, W := V + ϵW , Z := W + V + ϵZ, X := ZU + ϵX Y := X + W + U + ϵY ,

where ϵZ, ϵV , ϵU, ϵW , ϵX and ϵY are jointly independent standard Gaussian variables and we consider the base functions φ(X) = (X) and φ(W) = (W).

Proof. For the ﬁrst claim, consider a θ0 that is not identiﬁable from restriction (27). Then there exists a τ1 = 0 for which condition (28) is violated. By the independence statement for τ1, it holds for all k N, functions η : R(r+t) Rk and some τ2 R that Cov[η(Z, W), h(U, ϵY ) + φ(X) τ1 + φ(W) τ2] = 0.

But as (Z, W) h(U, ϵY ) it also holds that Cov[η(Z, W), h(U, ϵY )] = 0. By the additivity of the covariance it follows that Cov[η(Z, W), φ(X) τ1 + φ(W) τ2] = 0,

which is the identiﬁability condition corresponding to the moment restriction.

For the second claim, note that in our example SCM, E[X|(Z, W)] = 0. It follows that for all k N and functions η : R(r+t) Rk, Cov[η(Z, W), X] = 0. We can therefore not uniquely identify the causal function f 0(x) = x with the moment restriction.

We now consider condition (28). Consider Kτ := U + ϵY + τ1(ZU + ϵX) + τ2W. It holds that

E[Z2K2 τ ] = E[Z2(U + ϵY + τ1(ZU + ϵX) + τ2W)2]

= E[Z2U 2] + E[Z2ϵ2 Y ] + τ 2 1 E[Z4U 2] + τ 2 1 E[Z2ϵ2 X] + τ 2 2 E[Z2W 2]

= 12 + 114τ 2 1 + 30τ 2 2 ,

E[Z2]E[K2 τ ] = E[Z2]E[(U + ϵY + τ1(ZU + ϵX) + τ2W)2]

= E[Z2](E[U 2] + E[ϵ2 Y ] + τ 2 1 E[Z2U 2] + τ 2 1 E[ϵ2 X] + τ 2 2 E[W 2])

= 6(2 + 7τ 2 1 + 2τ 2 2 )

= 12 + 42τ 2 1 + 12τ 2 2 .

These two terms are only equal if 72τ 2 1 + 18τ 2 2 = 0, which requires that τ1 = 0. Therefore, condition (28) holds, meaning we can identify the causal function f 0(x) = x.

7For simplicity, we use the same basis φ.

Exploiting Independent Instruments

We now propose two procedures; one for each of the identiﬁabilty conditions we discussed. We conjecture that combining both leads to a doubly robust estimator.

For restriction (25), we propose the following estimation procedure. Given an i.i.d. sample (xi, yi, zi, wi)n i=1 of the variables (X, Y, Z, W), we can ﬁrst use any ﬂexible non-parametric estimator to estimate q0 1. Based on our estimate ˆq1 we then learn ˆf as

arg min f F \ HSIC((rf i , zˆq i )n i=1; k R, k Z ˆq1),

as our estimate for f 0, where rf i = yi ˆf(xi) and z ˆq1 i = zi ˆq1(wi) and F is a function class. Here, \ HSIC is the same empirical HSIC estimator as in Section 3.

For restriction (27), we propose the following estimation procedure. Given an i.i.d. sample (xi, yi, zi, wi)n i=1 of the variables (X, Y, Z, W), we estimate f 0 as the ﬁrst entry of

arg min f F, k K \ HSIC((rf,k i , (zi, wi))n i=1; k RY , k Z,W ),

where rf,k i = y1 f(xi) k(wi) and, F and K are function classes. Here, \ HSIC is again the same empirical HSIC estimator as in Section 3.

Exploiting Independent Instruments

C. Algorithms: HSIC-X and HSIC-X-pen

We provide details for HSIC-X in Algorithm 1 and details for HSIC-X-pen in Algorithm 2. We propose to choose the tuning parameter λ for HSIC-X-pen as the largest possible value for which an HSIC-based independence test between the estimated residuals and the instruments is not rejected (see Section 3.1).

Algorithm 1 HSIC-X Input: observations (xi, yi, zi)n i=1, kernels k R and k Z, batch size m, learning rate γ, number of gradient steps per cycle k, signiﬁcance level α and maximum restarting times t Initialize a restart counter ℓ 0 Compute the bandwidth σZ of k Z with median heuristic repeat

if ℓ= 0 then

Initialize parameters θ at the OLS solution else

Randomly initialize parameters θ end repeat

Compute the residuals rθ i := yi fθ(xi) Compute the bandwidth σRθ of k Rθ with median heuristic repeat k times

Sample a mini-batch (xj, yj, zj)m j=1 Compute the residuals rθ j := yj fθ(xj) Compute the HSIC loss L(θ) := tr(KHLH) Update parameters θ θ γ L(θ) end until convergence Compute the residuals rθ i := yi fθ(xi) Compute the p-value p of the independence test between (rθ i )n i=1 and (zi)n i=1 Update the counter ℓ ℓ+ 1 until p α or ℓ t Output: Final estimate fθ( ) := fθ( ) 1

n Pn i=1 yi fθ(xi)

Exploiting Independent Instruments

Algorithm 2 HSIC-X-pen Input: observations (xi, yi, zi)n i=1, kernels k R and k Z, batch size m, learning rate γ, number of gradient steps per cycle k, signiﬁcance level α and maximum restarting times t, penalization parameter λ, prediction loss ℓ Initialize a restart counter l 0 Compute kernel bandwidth σZ of k Z with median heuristic repeat

if l = 0 then

Initialize parameters θ at the OLS solution else

Randomly initialize parameters θ end repeat

Compute the residuals rθ i := yi fθ(xi) Compute kernel bandwidth σR of k R with median heuristic repeat k times

Sample a mini-batch (xj, yj, zj)m j=1 Compute the residuals rθ j := yj fθ(xj) Compute loss L(θ) := λ P

j ℓ(rθ j) + (1 λ) tr(KHLH) Update parameter θ θ γ L(θ) end until convergence Compute the residuals rθ i := yi fθ(xi) Compute the p-value p of the independence test between (rθ i )n i=1 and (zi)n i=1 Update the counter l l + 1 until p α or l t Output: Final estimate fθ( ) := fθ( ) 1

n Pn i=1 yi fθ(xi)

D. Additional Experiment Details and Results

D.1. Multi-dimensional Setting

We consider the following IV models in our experiments:

Z := ϵZ U := ϵU X := JZ2ϵX + BZ + U Y := β X 2U + ϵY ,

where ϵU, ϵY i.i.d. N(0, 1), ϵZ N(0, Id Z), ϵX N(0, Id X) are independent noise variables, d X and d Z represent the dimensions of X and Z, J and B are d X d Z matrices controlling the inﬂuence of the instruments Z and β is the causal parameters. In the experiment, J is an d X d Z all-ones matrix, Bi,j Uniform( 4, 4) and β N(0, Id X).

Figure 4 reports the MSE of each method as dimension (d Z) of the instruments varies for some ﬁxed predictor s dimensionality (d X). When d Z < d X, the identiﬁability result suggests that 2SLS cannot consistently estimate the causal function (which is reﬂected by the high values of the MSE); while HSIC-X can still gain some improvement over the OLS solution. When d Z d X, the performance of our method is on par with that of 2SLS.

D.2. Known Basis Functions

Figure 5 shows some of the functions estimated by HSIC-X, OLS, 2SLS, along with the underlying causal function under various settings. In short, when the instrument has no effect on the mean of X (α = 0), 2SLS fails to produce sensible estimates because of the non-identiﬁability under the moment restriction. On the other hand, the proposed method (HSIC-X) can still identify the causal function thanks to the independence restriction and yields reasonable estimates in all of the settings.

Exploiting Independent Instruments

1 2 3 4 5 d Z

1 2 3 4 5 d Z

102 d X = 5

2SLS OLS Oracle HSIC-X

Figure 4. MSEs of different estimators under varying size d Z of the instruments when (a) the predictors dimension d X is 3, and (b) the predictors dimension is 5. 2SLS is inconsistent and underperforms OLS when d Z < d X, while HSIC-X shows a substantial improvement over OLS in such settings.

D.3. Distribution Generalization

In Section 5.2, we consider both linear and nonlinear underlying causal functions. The linear function is deﬁned as f 0 lin(X1, X2) := X1 + X2, and the nonlinear function is deﬁned as f 0 nonlin(X) := P10 j=1 wje 1

3 X cj 2 2, where cj is drawn

from a uniform distribution over a two-dimensional grid [ 5, 5] [ 5, 5] and w1, . . . , w10 i.i.d. N(0, 4). We employ the correct basis and neural network function classes in our method and the baselines. For the correct basis, we consider F := {f( ) = φ( ) θ | θ Rp} with p = 3 and φ(x) = [1, x1, x2] in the linear case, and p = 11 and φ(x) = [1, e 1

3 x c1 2 2, . . . , e 1

3 x c10 2 2] in the nonlinear case. For the neural network function class, we use a neural network with one hidden layer of size 64. We optimize our model using Adam optimizer with the learning rate of 0.005 and batch-size of 256, and use the R package Anchor Regression (https://github.com/simzim96/Anchor Regression) for the Anchor Regression baseline. Lastly, the tuning parameter γ of Anchor Regression is set to 100, and the tuning parameter λ for HSIC-X-pen is chosen as the largest possible value for which an HSIC-based independence test between the estimated residuals and the instruments is not rejected (see Section 3.1).

Exploiting Independent Instruments

8 6 4 2 0 2 4 6 x_vis

Model: linear, Instrument: Binary, Alpha: 0.0

y_mse y_hsic y_2sls f_x

(a) f 0: Linear, PϵZ: Binary, α: 0

8 6 4 2 0 2 4 6 x_vis

Model: linear, Instrument: Binary, Alpha: 0.4

y_mse y_hsic y_2sls f_x

(b) f 0: Linear, PϵZ: Binary, α: 0.4

7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 x_vis

Model: linear, Instrument: Gaussian, Alpha: 0.0

y_mse y_hsic y_2sls f_x

(c) f 0: Linear, PϵZ: Gaussian, α: 0

7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 x_vis

Model: linear, Instrument: Gaussian, Alpha: 0.4

y_mse y_hsic y_2sls f_x

(d) f 0: Linear, PϵZ: Gaussian, α: 0.4

6 4 2 0 2 4 6 x_vis

Model: radial, Instrument: Binary, Alpha: 0.0

y_mse y_hsic y_2sls f_x

(e) f 0: Nonlinear, PϵZ: Binary, α: 0

6 4 2 0 2 4 6 x_vis

Model: radial, Instrument: Binary, Alpha: 1.0

y_mse y_hsic y_2sls f_x

(f) f 0: Nonlinear, PϵZ: Binary, α: 1.0

8 6 4 2 0 2 4 6 8 x_vis

Model: radial, Instrument: Gaussian, Alpha: 0.0

y_mse y_hsic y_2sls f_x

(g) f 0: Nonlinear, PϵZ: Gaussian, α: 0

7.5 5.0 2.5 0.0 2.5 5.0 7.5 x_vis

Model: radial, Instrument: Gaussian, Alpha: 1.0

y_mse y_hsic y_2sls f_x

(h) f 0: Nonlinear, PϵZ: Gaussian, α: 1.0

Figure 5. Estimated causal functions with different estimators under varying α values from the experiment in Section 5.1 (the known basis functions setting). The legend reads as follows: "y_mse" represents OLS, "y_hsic" represents HSIC-X, "y_2sls" represents 2SLS and "f_x" represents the underlying causal function. Each line represents an estimated function from one simulation run. The grey dots are the observations drawn from the corresponding IV model.