# dual_instrumental_variable_regression__577e71f7.pdf Dual Instrumental Variable Regression Krikamol Muandet Max Planck Institute for Intelligent Systems T ubingen, Germany krikamol@tuebingen.mpg.de Arash Mehrjou Max Planck Institute for Intelligent Systems ETH Z urich, Z urich, Switzerland arash.mehrjou@inf.ethz.ch Si Kai Lee Booth School of Business University of Chicago, USA sikai.lee@chicagobooth.edu Anant Raj Max Planck Institute for Intelligent Systems T ubingen, Germany anant.raj@tuebingen.mpg.de We present a novel algorithm for non-linear instrumental variable (IV) regression, Dual IV, which simplifies traditional two-stage methods via a dual formulation. Inspired by problems in stochastic programming, we show that two-stage procedures for non-linear IV regression can be reformulated as a convex-concave saddle-point problem. Our formulation enables us to circumvent the first-stage regression which is a potential bottleneck in real-world applications. We develop a simple kernel-based algorithm with an analytic solution based on this formulation. Empirical results show that we are competitive to existing, more complicated algorithms for non-linear instrumental variable regression. 1 Introduction Inferring causal relationships under the influence of unobserved confounders remains one of the most challenging problems in economics, health care, and social sciences [1, 2]. A typical example in economics is the study of returns from schooling [3], which attempts to measure the causal effect of education on labor market earnings. For each individual, the treatment variable X represents the level of education and the outcome Y represents how much they earn. However, one s level of education and income is likely confounded by the socioeconomic status or other unobserved confounding factors H [1, Ch. 4]. Since randomized control trials are often infeasible in most economic studies, economists have turned to instrumental variables (IVs) or instruments derived from naturally occurring random experiments to overcome unobserved confounding. Informally, instrumental variables Z are defined as variables that are associated with the treatment X, affect the outcome Y only through X and do not share common causes with Y . For instance, the season-of-birth was used as an instrument in [4] to estimate the impact of compulsory schooling on earnings. Because of the compulsory school attendance laws, an individual s season-of-birth, which is likely to be random, affects how long they actually remain in school, but not their earnings. Figure 1 illustrates this example. Finding valid instruments for specific problems is an essential task in econometrics [1] and epidemiology [5]. Although IV analysis is widely used, the statistical tools employed for estimating causal effect are fairly rudimentary. Most applications of instrumental variables utilise a two-stage procedure [1, 6 8]. For instance, the two-stage least squares (2SLS) relies on the assumption that the relationship between X and Y is linear [9]. It first estimates the conditional mean E[X|Z = z] via linear regression and then regresses Y on the estimate of E[X|Z = z] to obtain an estimate of the causal effect. Since the first-stage estimate is by construction independent from confounders, the resultant 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. education season of birth income socioeconomic status Figure 1: A data generating process (DGP) with a hidden confounder H and an instrument Z. A variation in X comes from both H and Z. Intuitively speaking, the external source of variation from Z can help improve an estimation by removing the effect of H on X. causal estimate is therefore free from hidden confounding. In the non-linear setting, however, a poorly-fitted first-stage regression may result in inaccurate second-stage estimates [1, Ch. 4.6]. In this paper, we propose a novel procedure, Dual IV, to directly estimate the structural causal function. Unlike previous works which extend 2SLS by employing non-linear models in place of their linear counterparts [7, 8], we solve the dual problem which can be expressed as a convex-concave saddle-point problem. Based on this framework, we develop a consistent reproducing kernel Hilbert spaces-based (RKHS) algorithm. Our formulation was inspired by the mathematical resemblance of non-linear IV to two-stage problems in stochastic programming [10 12]. The rest of the paper is organized in the following manner. Section 2 introduces the IV regression problem, reviews related work and identifies current limitations. We present our formulation in Section 3, followed by the kernelized estimation method in Section 4. Then, Section 5 reports empirical comparisons between Dual IV and existing algorithms. Finally, we discuss the limitations of our procedure and suggest future directions in Section 6. All proofs can be found in Appendix E. 2 Instrumental variable regression Let X, Y , and Z be treatment, outcome, and instrumental variable(s) taking values in X, Y, and Z, respectively. In this work, we assume that Y 2 R, and X and Z are Polish spaces. We also assume that Y is bounded, i.e., |Y | < M < 1 almost surely. Moreover, we denote unobserved confounder(s) by H. The underlying data generating process (DGP) is described by the causal graph in Figure 1 equipped with the following structural causal model (SCM): Y = f(X) + ", E["] = 0, (1) where f is an unknown, potentially non-linear continuous function and " denotes the additive noise which depends on the hidden confounder(s) H. If E[" | X] = 0, we can estimate f consistently from observational data via the standard least-square regression. This allows us to identify E[Y | do(X = x)] where do(X = x) represents an intervention on X where its value is set to x [13]. For non-expert readers, we elaborate that do(X = x) here denotes a mathematical operator which simulates physical interventions by setting the value of X to x, while keeping the rest of the model unchanged [14, Sec. 3.2.1]. That is, the conditional expectation E[Y | do(X = x)] is computed with respect to the interventional distribution P(Y | do(X = x)). We can estimate P(Y | do(X = x)) if it is possible to directly manipulate X and then observe the resulting outcome Y . In Figure 1, for instance, one may assign different levels of education to people and then observe their subsequent levels of income in the labor market. Unfortunately, such experiment is not always possible and we only have access to an observational distribution P(Y | X = x), which can be different from P(Y | do(X = x)). The discrepancy between interventional and observational distributions may result from the unobserved socioeconomic status, as illustrated in Figure 1. When hidden confounders exist between X and Y , the error term " in (1) is generally correlated with X. Hence, E["|X] 6= 0 and it follows from (1) that E[Y | X = x] = f(x) + E[" | X = x], (2) which implies that E[Y | do(X = x)] 6= E[Y | X = x]. Thus, standard least-square regression no longer provides a valid estimate of f for making a prediction about the outcome of an intervention on X [7, 8, 15]. To handle hidden confounders, we assume access to an instrumental variable(s) Z which satisfies the following assumptions: (i) Relevance: Z has a causal influence on X. (ii) Exclusion restriction: Z affects Y only through X, i.e., Y ?? Z|X, ". (iii) Unconfounded instrument(s): Z is independent of the error, i.e., " ?? Z. The properties of Z imply that E[" | Z] = 0. Taking the expectation of (1) w.r.t. Y conditioned on Z yields the following integral equation f(x) d P(x|Z), (3) which is a Fredholm integral equation of the first kind. Recent works in nonparametric IV regression have adopted this perspective [7, 8, 15, 16] despite the fact that solving (3) directly is an ill-posed problem as it involves inverting linear compact operators [15, 17, 18]. To illustrate the role of an instrument, we consider two special cases. When X is perfectly correlated with Z, the treatment is uncorrelated with the hidden confounder. In other words, we recover the strong ignorability assumption [19, 20] required for causal inference. When Z is independent of X, the instrument is useless as it has no predictive power over treatment so the structural function f is unidentifiable from the data. Therefore, the most interesting cases lie between these two extremes, especially when X and Z are weakly correlated, see, e.g., [21, 22][1, pp. 205 216]. 2.1 Previous work Early applications of instrumental variables often assume linear relationships between Z and X as well as X and Y [1, 23]. When there is a single endogeneous variable and instrument, the structural parameter can be estimated consistently by the instrumental variable (IV) estimator [23]. Interestingly, we can obtain this estimate using a two-stage procedure: regress X on Z using ordinary least square (OLS) to calculate the predicted value of X and used that as an explanatory variable in the structural equation to estimate the structural parameter using OLS. When there are multiple instruments, the two-stage least squares (2SLS) estimator is obtained by using all the instruments simultaneously in the first-stage regression. Wooldridge [24, Theorem 5.3] asserts that the 2SLS estimator is the most efficient IV estimator; see, e.g., [1, 24] for a detailed exposition. Recently, several extensions of 2SLS have been proposed to overcome the linearity constraint. The first line of work replaces linear regression by a linear projection onto a set of known basis functions [15, 16, 25, 26]. Chen and Christensen [27] provides a uniform convergence rate of this approach. However, there exists no principled way of choosing the appropriate set of basis functions. The second line of work replaces the first-stage regression by a conditional density estimate of P(X|Z) [28, 29]. Despite being more flexible, such approaches are known to suffer from the curse of dimensionality [30, Ch. 1]. Other extensions of 2SLS are Deep IV [8] and Kernel IV [7] algorithms. In [8], (3) is solved by first estimating P(X|Z) with a mixture of deep generative models on which f is learned using another deep neural network. Instead of neural networks, Singh et al. [7] proposes to model the first-stage regression using the conditional mean embedding of P(X|Z) [31 33] which is then used in the second-stage kernel ridge regression. The curse of two-stage methods. Two-stage procedures have two fundamental issues. First, such procedures violate Vapnik s principle [34]: [...] when solving a problem of interest, do not solve a more general problem as an intermediate step [...] . Specifically, estimating the conditional density [8] or the conditional mean embedding [7] via regression in the first stage can be harder than estimating the parameter of interest in the second stage. The first stage is even referred as the forbidden regression in econometrics [1, Ch. 4.6]. On top of that, we usually only observe a single sample from each P(X|Z = z), which further increases the difficulty of the task. Second, although two-stage procedures are asymptotically consistent, the first-stage estimate creates a finite-sample bias in the second-stage estimate [1, Sec. 4.6.4]. This bias can be alleviated through sample splitting [35] which is also used in [7, 8]. Thus, two-stage procedures are less sample efficient and could yield biased estimates when run on the smaller datasets common in economics and social sciences. The generalized method of moments (GMM) framework provides another set of popular approaches for estimating f [36, 37]. Unlike two-stage procedures, GMM-based algorithms find a function f that satisfies the orthogonality condition E["|Z] = 0 directly. Specifically, if g1, g2, . . . , gm are arbitrary real-valued functions, the orthogonality condition implies that E[(Y f(X))gj(Z)] = 0 for j = 1, . . . , m. The GMM estimate of f can then be obtained by minimizing the quadratic form j=1 (f, gj)2 where (f, g) := E[(Y f(X))g(Z)]. This estimator can be interpreted as a generalization of the 2SLS estimator in the linear setting [6]. Recently, extensions of GMM-based methods where both f and g are parameterized by deep neural networks have successfully been used to solve non-linear IV regression [38, 39]. In contrast, Muandet et al. [40] considers the set of RKHS functions which allow for an analytic formulation of the orthogonality condition. In this section, we reformulate the integral equation (3) as an empirical risk minimization problem and present Dual IV algorithm. 3.1 Empirical risk minimization Let : R R ! R+ be a proper, convex, and lower semi-continuous loss function for any value in its first argument.1 Let F be an arbitrary class of continuous functions which we assume contains f that fulfills the integral equation (3). Then, we can formulate (3) as min f2F R(f) := EYZ (Y, EX|Z[f(X)]) where R(f) denotes the expected risk of f. To understand how (3) and (4) are related, let us consider the squared loss (y, y0) = (y y0)2 and define h(z) := EX|z[f(X)]. Then, the solution to (4) is the minimum mean square error (MMSE) estimator h (z) := E[Y |z], which is exactly the LHS of (3). If there exists no f 2 F for which h (z) = E[Y |z], we use h (z) as the best MMSE approximation. The key challenge here is that if f is noncontinuous in h(z), it is not assured to be consistently estimated even if h(z) is estimated correctly [15]. We defer further discussion to Section 3.4. In addition, it remains cumbersome to solve (4) directly because of the inner expectation. To circumvent this, we look to similar two-stage problems in stochastic programming [10, 11]. For example, in [11], the problem of learning from conditional distributions was formulated in a similar fashion to (4). Moreover, [12] proposes the deconditional mean embedding (DME) which solves the integral equation (3) by performing a closed-form inversion of the conditional mean embedding of P(X|Z) (see [33, 42] for a review). By contrast, we solve the initial problem in (3) by resorting to the dual formulation of (4). 3.2 Dual formulation To derive the dual form of (4), we employ two existing results, interchangeability and Fenchel duality, which we review; see, e.g., [11, Lemma 1], [43, Ch. 14], and [10, Ch. 7] for more details. Theorem 1 (Interchangeability). Let ! be a random variable on and, for any ! 2 , the function f( , !) : R ! ( 1, 1) is proper and upper semi-continuous concave function. Then, u2R f(u, !) = max u( )2U( ) E![f(u(!), !)], (5) where U( ) := {u( ) : ! R} is the entire space of functions defined on the support . Definition 2 (Fenchel duality). Let : R R ! R+ be a proper, convex, and lower semi-continuous loss function for any value in its first argument and ? y := ?(y, ) a convex conjugate of y := (y, ) which is also proper, convex, and lower semi-continuous w.r.t. the second argument. Then, y(v) = maxu{uv ? y(u)}. The maximum is achieved at v 2 @ ?(u), or equivalently u 2 @ (v). Applying the interchangeability and Fenchel duality to (4) yields the expected loss R(f) = EYZ[max u2R {EX|Z[f(X)]u ? u2U EYZ[EX|Z[f(X)]u(Y, Z) ? Y (u(Y, Z))] u2U EXYZ[f(X)u(Y, Z)] EYZ[ ? Y (u(Y, Z))] where U is the space of continuous functions over Y Z. Hence, (4) can be reformulated as min f2F max u2U EXYZ [f(X)u(Y, Z)] EYZ [ ? Y (u(Y, Z))] . (6) 1The function f is proper if dom f 6= ; and f(x) > 1, 8x 2 X. It is lower (upper) semi-continuous at x0 2 X if for " > 0 there exists a neighborhood N(x0) of x0 such that " < (>)f(x) f(x0) for all x 2 N(x0) [41]. Following [11], we will refer to u 2 U as the dual function. Note that this function depends on only the outcome Y and the instrument Z, but not the treatment X. The advantages of our formulation (6) over (3) and (4) are twofold. First, there is no need to estimate EX|Z[f(X)] or P(X|Z) explicitly. Second, the target function f appears linearly in (6) which makes it convex in f. Since ? y is also convex, (6) is concave in the dual function u. Hence, (6) is essentially a convex-concave saddle-point problem for which efficient solvers exist [11]. For the squared loss (y, y0) = (y y0)2, we have ? y(w) = wy + 1 2w2 (see Appendix A for the derivation) and the saddle-point problem (6) reduces to min f2F max u2U (f, u) := EXYZ [(f(X) Y )u(Y, Z)] 1 To solve (7), one can adopt an SGD-based algorithm developed by Dai et al. [11]. Alternatively, we propose in Section 4 a simple algorithm that can solve (7) in closed form. 3.3 Interpreting the dual function The dual function u(y, z) plays an important role in our framework. To understand its role, we consider the minimization and maximization problems in (7) separately. For any f 2 F, the maximization problem is maxu2U EXYZ[(f(X) Y )u(Y, Z)] 1 2ku(Y, Z)k2 L2(PYZ ) where the first term can be viewed, loosely speaking, as a loss function and the second as a regularizer. Intuitively, we are seeking u 2 U that is least orthogonal to the residual. Given u , the outer minimization problem minf2F EXYZ [(f(X) Y )u (Y, Z)] finds the function f that yields the most orthogonal residual to u . Our procedure clearly differs from previous two-stage methods as the minimization and maximization stages are interdependent. From the causal inference perspective, the residual contains the variation that cannot be explained by the current estimate of f due to hidden confounding. We select u that maximally reweights the residuals according to how inconsistent they are w.r.t. the unconfounded joint distribution of Y and Z. Given u, we then select f that minimizes the inconsistencies between the residuals and u. Hence, at the equilibrium, we are left with residuals uncorrelated with Y and Z which can be attributed to noise due to unobserved confounding. Lastly, we draw a connection between (7) and GMM. Let g1, g2, . . . , gm be real-valued functions on Y Z and (f, g) := E[(Y f(X))g(Y, Z)]. When U = span{g1, . . . , gm}, it is not difficult to show that maxu2U (f, u) = 1 2 > 1 where := EYZ[g g] with g := (g1(Y, Z), . . . , gm(Y, Z))>; see Appendix B. That is, minimizing the above over f yields a formulation that strongly resembles the GMM objective, with the dual function u(Y, Z) playing a role similar to that of an instrument. However, we must clarify that u cannot act as an instrument since it depends on Y and thereby violates the exclusion restriction assumption. This ambiguity has been resolved in Liao et al. [44, Appendix F] by resorting to an alternative formulation similar to (4) and (7). Furthermore, we also note that AGMM [38] and Deep GMM [39] rely on minimax optimization, similar to (7), but were formulated based on the GMM framework. 3.4 Theoretical analysis This section provides the conditions for which the true structural function f can be identified by the optimum of the saddle-point problem (7). We lay out the assumptions needed for the optimal dual function u to be unique and continuous, show that the saddle-point formulation (7) is equivalent to the problem (4) under the squared loss and prove that the solution of (7) given u is indeed f . Assumption 1. (i) P(X|Z) is continuous in Z for any values of X. (ii) The function class F is correctly specified, i.e., f 2 F. Following [11], we define the optimal dual function for any pair (y, z) 2 Y Z as u (y, z) 2 arg maxu2R{EX|z[f(X) y]u (1/2)u2}. Since this is an unconstrained quadratic program, u (y, z) takes the form EX|z[f(X)] y. Given Assumption 1 and the loss function is convex and continuously differentiable, it follows from [11, Proposition 1] that u is unique and continuous. Next, we shows that if (f , u ) is the saddle-point of (7), f minimizes the original objective (4). The result follows from plugging u = EX|z[f(X)] y into the dual loss (f, u) in (7); see Appendix E.1 for the detailed proof. -6 -4 -2 0 2 Data True OLS Estimated -6 -4 -2 -4 Figure 2: The dual function u w.r.t. the current estimate f in the linear setting (8). For each y and z, u can directly measures the discrepancy between y and EX|z[f(X)]. Proposition 3. Let (y, y0) = 1 2(y y0)2. Then, for any fixed f, we have R(f) = maxu (f, u). By Proposition 3 and the convexity of the loss (y, y0), we obtain the following result. Theorem 4. Let (y, y0) = 1 2(y y0)2 and assume that Assumption 1 holds. Then, (f , u ) is the saddle-point of a minimax problem minf2F maxu2U (f, u). By virtue of Theorem 4, we can identify the true function f under relatively weak assumptions. In contrast, previous work usually require stronger assumptions such as the completeness condition [7, 15] which specifies that the first-stage conditional expectation EX|z[f(X)] is injective, or h(z) = EX|z[f(X)] is a smooth function of z [7, 26, 27]. Since we do not perform first-stage regression, we only require P(X|Z) is continuous in Z for any value of X. The assumption that (4) is correctly specified, i.e., f 2 F, is standard in the literature [7, 15, 16]. As we can see, the optimal dual function u (y, z) = EX|z[f(X)] y acts as a residual function measuring the discrepancy between y and EX|z[f(X)] [11]. Remarkably, this makes it possible to approximate R(f) in (4) without computing the expectation EX|z[f(X)] explicitly. We will later exploit this property in selecting hyperparameters. Moreover, since EX|z[f(X)] allows X and Z to have a non-linear relationship, u can be non-linear even when the true structural function f is linear. This flexibility enables u to accommodate a larger class of functions that maps Z to X. Figure 2 illustrates this given the following generative process: Y = Xβ + e + , X = (1 )Z1 + e + (8) where e N(0, 2), Z1 N(0, 2), N(0, 0.1), and N(0, 0.1). The parameter controls the strength of the instrument w.r.t. hidden confounder e. Here, we set n = 300, β = 0.7, ˆβ = 0.4, and = 0.2 where ˆβ is an OLS estimate of β. Under this model, we have u (y, z) ˆβ(1 )z y. 4 Kernelized Dual IV To demonstrate the effectiveness of our framework, we develop a simple kernel-based algorithm using the new formulation (7). To simplify notation, we denote by W := (Y, Z) a random variable taking value in W := Y Z. We pick F and U to be reproducing kernel Hilbert spaces (RKHSs) associated with positive definite kernels k : X X ! R and l : W W ! R, respectively. Let φ : x 7! k(x, ) and ' : w 7! l(w, ) be the canonical feature maps [45]. We assume both F and U are universal and hence are dense in the space of bounded continuous functions [46, Ch. 4]. Then, for any f 2 F and u 2 U, we can rewrite the objective in (7) as (f, u) = EXW [f(X)u(W)] EYZ[Y u(Y, Z)] 1 2EW [u(W)2] = h CWX f b, ui U 1 2hu, CW ui U, (9) where b := EYZ[Y '(Y, Z)] 2 U, CW := EW ['(W) '(W)] 2 U U is a covariance operator, and CWX := EWX ['(W) φ(X)] 2 U F is a cross-covariance operator [47, 48] (see Appendix Algorithm 1 Kernelized Dual IV Input: Data (xi, yi, zi)n i=1, kernel functions k, l, and a parameter grid Γ. 1: Compute kernel matrices Kij = k(xi, xj) and Lij = l((yi, zi), (yj, zj)). 2: (λ1, λ2) Select Params(K, L, Γ). 3: M K(L + nλ1I) 1L. 4: β (MK + nλ2K) 1My. Output: f(x) = Pn i=1 βik(xi, x). C). Since (9) is quadratic in u, we have CW u = CWX f b. Substituting u back into (9) yields f = arg min CWX f b, C 1 W (CWX f b) U = (CXW C 1 W CWX ) 1CXW C 1 We can view (10) as a generalized least squares solution in RKHS. Since C 1 W and (CXW C 1 W CWX ) 1 may not exist in general, we replace them with regularized versions (CW + λ1I) 1 and (CXW C 1 W CWX + λ2I) 1 where I is the identity operator and λ1, λ2 > 0 are regularization parameters. Given an i.i.d. sample (xi, yi, zi)n i=1 from P(X, Y, Z), we define Φ := [φ(x1), . . . , φ(xn)], := ['(y1, z1), . . . , '(yn, zn)], and y := [y1, . . . , yn]>. Then, we can estimate b, CW , and CXW with their empirical counterparts ˆb := n 1 Pn i=1 yi'(yi, zi) = n 1 y, b CXW := n 1 Pn i=1 φ(xi) '(yi, zi) = n 1Φ > and b CW = n 1 Pn i=1 '(yi, zi) '(yi, zi) = n 1 >. We denote the empirical version of (9) by b (f, u) and the estimate of f by ˆf. Next, we show that the representer theorem [49] for b (f, u) holds for both f and u. Lemma 5. For any f 2 F and u 2 U, there exist fβ = Pn i=1 βik(xi, ) and u = Pn i=1 il(wi, ) for some , β 2 Rn such that b (f, u) = b (fβ, u ). By virtue of Lemma 5, the solution to (10) can be expressed as f(x) = Pn i=1 βik(xi, x) where the coefficients β are given by the following proposition. Proposition 6. Given an i.i.d. sample (xi, yi, zi)n i=1 from P(X, Y, Z), let K := Φ>Φ and L := > be the Gram matrices such that Kij = k(xi, xj) and Lij = l(wi, wj) where wi := (yi, zi). Then, ˆf = Φβ where β = (MK + nλ2K) 1My and M := K(L + nλ1I) 1L. Compared to previous work which involved conditional density estimation [8, 28, 29] and vectorvalued regression [7] as first-stage regression, estimating the dual function u, a real-valued function, is arguably easier. This is especially so when X and Z are high-dimensional. Hyperparameter selection. Our estimator depends on two hyper-parameters, λ1 and λ2. Given a dataset (xi, yi, zi)2n i=1 of size 2n, we provide a simple heuristic to determine the values of (λ1, λ2). Ideally, if we know the optimal dual function u , we can interpret u (y, z)2 as a loss function of f at (y, z), as discussed in Section 3.4. To this end, we first estimate ˆf via Proposition 6 and ˆuλ := ( b CW +λI) 1( b CWX ˆf ˆb) on the first half of the data (xi, yi, zi)n i=1. Next, the out-of-sample loss of ˆf is evaluated on the second half (xi, yi, zi)2n b R( ˆf) = 1 (EX|zi[ ˆf(X)] yi)2 1 ˆuλ(yi, zi)2. (11) Note that ˆuλ = (L + nλI) 1(Kβ y) = where := (L + nλI) 1(Kβ y) and K, L 2 Rn n are kernel matrices evaluated on (xi, yi, zi)n i=1. Hence, b R( ˆf) >e L1/n where e Lij = l((yi, zi), (yj, zj)) for i = 1, . . . , n and j = n + 1, . . . , 2n and 1 is the all-ones column vector. In practice, we fix λ in ˆuλ to a small constant to stabilize the loss (11) and only optimize (λ1, λ2) that appear in β. Note that this procedure differs from the two-stage causal validation procedures used in [7, 8]. Alternatively, one may choose the hyperparameters by cross-validation with respect to (11). Algorithm 1 outlines the kernelized Dual IV method whose consistency is studied in Appendix D. We note that above algorithm involves matrix inversions (O(n3)) which become the primary computational bottlenecks when scaling to large datasets. To improve the scalability of our algorithm, Table 1: Comparisons of IV regression methods in small (top) and medium (bottom) sample size regimes. We report the log10 mean squared error (MSE) and its standard deviations over 20 trials. n = 50 Log10 Mean Squared Error (MSE) = 0.1 = 0.25 = 0.5 = 0.75 = 0.9 2SLS 5.814 1.214 6.013 0.827 5.895 0.718 5.625 1.182 5.308 1.031 Deep IV 5.127 0.043 5.131 0.031 5.133 0.072 5.130 0.124 5.127 0.061 Kernel IV 4.481 0.134 4.460 0.095 4.438 0.132 4.433 0.100 4.462 0.114 Deep GMM 3.848 1.096 2.899 1.638 3.952 0.900 4.148 0.556 3.738 0.587 Dual IV 4.257 0.108 4.210 0.126 4.285 0.170 4.286 0.126 4.232 0.152 2SLS 8.236 0.117 7.242 1.232 8.290 1.132 8.371 0.865 8.544 1.109 Deep IV 4.613 0.052 4.618 0.048 4.614 0.068 4.701 0.040 4.731 0.032 Kernel IV 4.189 0.046 4.209 0.040 4.199 0.043 4.195 0.045 4.194 0.055 Deep GMM 4.090 0.691 3.953 1.076 4.392 0.561 4.272 0.595 4.415 0.522 Dual IV 4.143 0.117 4.221 0.185 4.104 0.102 4.142 0.105 4.127 0.106 we can leverage the rich literature on large-scale kernel machines such as random Fourier features and Nystr om method; see, e.g., Yang et al. [50] and references therein. Alternatively, we can employ stochastic gradient descent-based (SGD) algorithms similar to those proposed in Dai et al. [11, Algorithm 1] to solve the dual formulation (7) directly. This would also allow us to employ flexible models such as neural networks to parameterize the function classes F and U. Recently, Liao et al. [44] has taken an important step in this direction and provided convergence analysis for neural networks under a similar formulation. 5 Experiments In this section, we compare kernelized Dual IV2 with: (i) vanilla two-stage least squares (2SLS) [23], (ii) Deep IV [8], (iii) Kernel IV [7] and (iv) Deep GMM [39]. To provide a fair comparison, we adhered to the provided hyperparameter settings. Given the low-dimensional nature of our experiments, we used Deep GMM s settings for low-dimensional scenarios in [39, Appendix B.2.1]. We ran 20 simulations of each algorithm for sample sizes of 50 and 1000 and calculated the log10 mean squared error and its standard deviations w.r.t. the true function f for 2800 out-of-sample test points. Demand design. We consider the same simulation as in [7, 8]: Y = f(X) + " where Y is outcome, X = (P, T, S) are inputs, and Z = (C, T, S) are instruments. Specifically, Y is sales, P is price, which is endogeneous, C is a supply cost shifter (instrument), and (T, S) are time of year and customer sentiment acting as exogeneous variables. The aim is to estimate the demand function f(p, t, s) = 100+(10+p)s (t) 2p where (t) = 2[(t 5)4/600+exp( 4(t 5)2)+t/10 2]. Training data is sampled according to S Unif{1, . . . , 7}, T Unif[0, 10], (C, V ) N(0, I2), " N( V, 1 2), and P = 25 + (C + 3) (T) + V . The parameter 2 {0.1, 0.25, 0.5, 0.75, 0.9} controls the extent to which price P is confounded with outcome Y by supply-side market forces. In our notation, X = (P, T, S), Z = (C, T, S) and W = (Y, Z) = (Y, C, T, S). For Dual IV, we used the Gaussian RBF kernel for both k and l. In the experiments, the kernels on X and Z are product kernels, i.e., k(xi, xj) = kp(pi, pj)kt(ti, tj)ks(si, sj) and k(zi, zj) = kc(ci, cj)kt(ti, tj)ks(si, sj), and l([yi, zi], [yj, zj]) = exp([yi yj, zi zj]>V 1 yz [yi yj, zi zj]) where Vyz is a symmetric bandwidth matrix. The values of all bandwidth parameters are determined via the median heuristic. We choose (λ1, λ2) from {10 10, 10 9, . . . , 10 1} via cross-validation. Once (λ1, λ2) is chosen, we refit ˆf on the entire training set. Table 1 reports the results of different methods evaluated on the test data. First, we observe that 2SLS achieves the largest MSE in both regimes as expected because the linearity assumption is violated here. Second, in the small sample size regime, Deep IV achieves relatively larger MSE than the other non-linear methods. Kernel IV, Deep GMM, and Dual IV, on the other hand, have comparable performance, with Deep GMM having the lowest MSE. However, we note that the results attained 2Our implementation is available at https://github.com/krikamol/Dual IV-Neur IPS2020. by Deep GMM were unstable out of the box and we had to reduce the variance of the initialization of the neural networks to 0.1 to obtain some degree of stability which is reflected in the standard deviations. We can fully attribute this variability to initialization as Deep GMM s default batch size of 1024 is larger than that of both training datasets so there is no sampling variability in the optimization process. This suggests that Deep GMM, like Deep IV, is relatively brittle compared to kernel-based methods in the small sample size regime. Furthermore, Deep GMM comes with an extensive hyperparameter selection process, which highlights its need for fine-tuning. Last but not least, Dual IV is competitive to Kernel IV across with slightly smaller MSE, which lends weight to our hypothesis that estimating the real-valued dual function is easier than vector-valued regression. In the medium sample size regime, we observe that performance of Deep IV is in the same ballpark as the rest of the non-linear IV regression methods and the variance of Deep GMM is reduced, albeit still highest among the non-linear methods. The results of Dual IV, Kernel IV and Deep GMM are almost indistinguishable with Dual IV having an edge as increases. This could mean accounting for both Y and Z is perhaps slightly more effective than Z alone in the presence of greater confounding. 6 Conclusion This paper proposes a general framework for non-linear IV regression called Dual IV. Unlike previous work, Dual IV does not require the first-stage regression which is the critical bottleneck of modern two-stage procedures. By exploiting tools in stochastic programming, we were able to reformulate the two-stage problem as the convex-concave saddle-point problem which is relatively simpler to solve. Instead of first-stage regression, Dual IV requires the dual function u(y, z) to be estimated, which is arguably easier than first-stage regression, especially when the instruments and treatments are high-dimensional. We demonstrate the validity of our framework with a kernel-based algorithm. Results show the competitiveness of our algorithm with respect to existing ones. Finally, potential directions for future work include (i) a minimax convergence analysis which could provide additional insight into the benefits of our framework, (ii) more flexible and scalable models such as deep neural networks as dual functions with stochastic gradient descent (SGD) [11], and (iii) applications to other two-stage problems in causal inference such as double ML [51]. Broader impact This work provides a new framework for non-linear instrumental variable regression which allows one to perform causal analysis under the presence of unobserved confounders. This could have a profound impact in other fields such as economics, social science, and epidemiology, among others. Understanding the role of instruments in the context of learning theory may also pave the way towards creating more robust and trustworthy machine learning algorithms that are capable of surviving in the world full of hidden biases. Acknowledgments and Disclosure of Funding We are indebted to Rahul Singh and Arthur Gretton for their help with the Kernel IV code which was used in our experiments. We thank Victor Chernozhukov, Elias Bareinboim, Sorawit Saengkyongam, Uri Shalit, Konrad Kording, Rahul Singh, Arthur Gretton, and You-Lin Chen for fruitful discussions as well as anonymous reviewers for the helpful feedback on our initial submission. This work is funded by the federal and state governments of Germany through the Max Planck Society (MPG). [1] Joshua D. Angrist and J orn-Steffen Pischke. Mostly Harmless Econometrics: An Empiricist s Companion. Princeton University Press, 2008. [2] Guido W. Imbens and Donald B. Rubin. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, New York, NY, USA, 2015. [3] David Card. The causal effect of education on earnings. In Handbook of Labor Economics, volume 3 of Handbook of Labor Economics, chapter 30, pages 1801 1863. Elsevier, 1999. [4] Joshua D. Angrist and Alan B. Keueger. Does Compulsory School Attendance Affect School- ing and Earnings? The Quarterly Journal of Economics, 106(4):979 1014, 1991. [5] Stephen Burgess, Dylan S Small, and Simon G Thompson. A review of instrumental vari- able estimators for Mendelian randomization. Statistical Methods in Medical Research, 26(5): 2333 2355, 2017. [6] Halbert White. Instrumental variables regression with independent observations. Economet- rica, 50(2):483 499, 1982. [7] Rahul Singh, Maneesh Sahani, and Arthur Gretton. Kernel instrumental variable regression. In Advances in Neural Information Processing Systems 32, pages 4593 4605. 2019. [8] Jason Hartford, Greg Lewis, Kevin Leyton-Brown, and Matt Taddy. Deep IV: A flexible ap- proach for counterfactual prediction. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 1414 1423. PMLR, 2017. [9] Jonas Peters, Dominik Janzing, and Bernhard Sch olkopf. Elements of causal inference: foun- dations and learning algorithms. 2017. [10] Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczynski. Lectures on Stochastic Programming: Modeling and Theory, Second Edition. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2014. [11] Bo Dai, Niao He, Yunpeng Pan, Byron Boots, and Le Song. Learning from Conditional Dis- tributions via Dual Embeddings. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54, pages 1458 1467. PMLR, 2017. [12] Kelvin Hsu and Fabio Ramos. Bayesian deconditional kernel mean embeddings. In Proceed- ings of the 36th International Conference on Machine Learning, volume 97, pages 2830 2838. PMLR, 2019. [13] Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000. [14] Judea Pearl. Causal inference in statistics: An overview. Statistics Surveys, 3:96 146, 2009. [15] Whitney K. Newey and James L. Powell. Instrumental variable estimation of nonparametric models. Econometrica, 71(5):1565 1578, 2003. [16] Joel L Horowitz. Applied nonparametric instrumental variables estimation. Econometrica, 79 (2):347 394, 2011. [17] Reiner Kress. Linear Integral Equations, volume 3. Springer, 1989. [18] M. Nashed and G. Wahba. Generalized inverses in reproducing kernel spaces: An approach to regularization of linear operator equations. SIAM Journal on Mathematical Analysis, 5(6): 974 987, 1974. [19] D.B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5):688 701, 1974. [20] Donald Rubin. Causal inference using potential outcomes. Journal of the American Statistical Association, 100(469):322 331, 2005. [21] John Bound, David A. Jaeger, and Regina M. Baker. Problems with instrumental variables estimation when the correlation between the instruments and the endogeneous explanatory variable is weak. Journal of the American Statistical Association, 90(430):443 450, 1995. [22] Douglas Staiger and James H. Stock. Instrumental variables regression with weak instruments. Econometrica, 65(3):557 586, 1997. [23] Joshua D. Angrist, Guido W. Imbens, and Donald B. Rubin. Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434):444 455, 1996. [24] Jeffrey M. Wooldridge. Econometric Analysis of Cross Section and Panel Data, volume 1 of MIT Press Books. The MIT Press, March 2001. [25] Richard Blundell, Xiaohong Chen, and Dennis Kristensen. Semi-nonparametric IV estimation of shape-invariant Engel curves. Econometrica, 75(6):1613 1669, 2007. [26] Xiaohong Chen and Demian Pouzo. Estimation of nonparametric conditional moment models with possibly nonsmooth generalized residuals. Econometrica, 80(1):277 321, 2012. [27] Xiaohong Chen and Timothy M. Christensen. Optimal sup-norm rates and uniform inference on nonlinear functionals of nonparametric IV regression. Quantitative Economics, 9(1):39 84, 2018. [28] Peter Hall and Joel L. Horowitz. Nonparametric methods for inference in the presence of instrumental variables. The Annals of Statistics, 33(6):2904 2929, 12 2005. [29] S. Darolles, Y. Fan, J. P. Florens, and E. Renault. Nonparametric instrumental regression. Econometrica, 79(5):1541 1565, 2011. [30] Alexandre Tsybakov. Introduction to Nonparametric Estimation. Springer Publishing Com- pany, Incorporated, 1st edition, 2008. [31] Le Song, Jonathan Huang, Alex Smola, and Kenji Fukumizu. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th International Conference on Machine Learning (ICML), June 2009. [32] Le Song, Kenji Fukumizu, and Arthur Gretton. Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine, 30(4):98 111, 2013. [33] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Sch olkopf. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10(1-2):1 141, 2017. [34] Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998. [35] Joshua D. Angrist and Alan B. Krueger. Split sample instrumental variables. Technical Report 699, Princeton University, Department of Economics, Industrial Relations Section., October 1993. [36] Lars Peter Hansen. Large sample properties of generalized method of moments estimators. Econometrica, 50(4):1029 1054, 1982. [37] A.R. Hall. Generalized Method of Moments. Advanced texts in econometrics. Oxford Univer- sity Press, 2005. [38] Greg Lewis and Vasilis Syrgkanis. Adversarial generalized method of moments. 03 2018. [39] Andrew Bennett, Nathan Kallus, and Tobias Schnabel. Deep generalized method of moments for instrumental variable analysis. In Advances in Neural Information Processing Systems 32, pages 3564 3574. 2019. [40] Krikamol Muandet, Wittawat Jitkrittum, and Jonas K ubler. Kernel conditional moment test via maximum moment restriction. In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence, volume 124, pages 41 50. PMLR, 2020. [41] R. Tyrrell Rockafellar. Convex analysis. Princeton Mathematical Series. Princeton University Press, 1970. [42] Le Song, Kenji Fukumizu, and Arthur Gretton. Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine, 30:98 111, 07 2013. [43] R. Tyrrell Rockafellar and Roger J.-B. Wets. Variational Analysis. Springer Verlag, Heidelberg, Berlin, New York, 1998. [44] Luofeng Liao, You-Lin Chen, Zhuoran Yang, Bo Dai, Zhaoran Wang, and Mladen Kolar. Prov- ably efficient neural estimation of structural equation model: An adversarial approach. In Advances in Neural Information Processing Systems 33. 2020. [45] Bernhard Sch olkopf and Alexander Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2002. [46] I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008. [47] Charles R. Baker. Joint measures and cross-covariance operators. Transactions of the American Mathematical Society, 186:pp. 273 289, 1973. [48] Kenji Fukumizu, Francis Bach, and Michael Jordan. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5: 73 99, 2004. [49] B. Sch olkopf, R. Herbrich, and AJ. Smola. A generalized representer theorem. In Lecture Notes in Computer Science, Vol. 2111, number 2111 in LNCS, pages 416 426, 2001. [50] Tianbao Yang, Yu-feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. Nystr om method vs random Fourier features: A theoretical and empirical comparison. In Advances in Neural Information Processing Systems 25, pages 476 484. 2012. [51] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1 C68, 2018. [52] K. Fukumizu, L. Song, and A. Gretton. Kernel Bayes rule: Bayesian inference with positive definite kernels. Journal of Machine Learning Research, 14:3753 3783, 2013.