# geometryaware_instrumental_variable_regression__9b3b0de0.pdf Geometry-Aware Instrumental Variable Regression Heiner Kremer 1 Bernhard Sch olkopf 1 Instrumental variable (IV) regression can be approached through its formulation in terms of conditional moment restrictions (CMR). Building on variants of the generalized method of moments, most CMR estimators are implicitly based on approximating the population data distribution via reweightings of the empirical sample. While for large sample sizes, in the independent identically distributed (IID) setting, reweightings can provide sufficient flexibility, they might fail to capture the relevant information in presence of corrupted data or data prone to adversarial attacks. To address these shortcomings, we propose the Sinkhorn Method of Moments, an optimal transport-based IV estimator that takes into account the geometry of the data manifold through data-derivative information. We provide a simple plug-and-play implementation of our method that performs on par with related estimators in standard settings but improves robustness against data corruption and adversarial attacks. 1. Introduction Instrumental variable regression is one of the most widespread approaches for learning in presence of confounding (Angrist & Pischke, 2008). It is applicable in situation where one is interested in inferring the outcome Y of some treatment T, where both, treatment and outcome, are affected by a so-called unobserved confounder U. To eliminate the confounding bias, one can take into account an instrumental variable Z, which i) affects the treatment T, ii) affects the outcome Y only through its effect on T, and, iii) is independent of the confounder U. While traditionally the problem has been addressed through the 2-stage least squares approach (Angrist & Pischke, 2008), in recent years the formulation in terms of conditional moment re- 1Max Planck Institute for Intelligent Systems. Correspondence to: Heiner Kremer . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). Figure 1. Paradigms to approximate P0 from data (red dots) in the GEL framework. φ-divergence-based estimators (left) approximate P0 by reweighting (weight ˆ= size) the sample (e.g., (Ai & Chen, 2003; Bennett & Kallus, 2023). MMD-based estimators (middle) allow to sample additional data points (blue dots) (Kremer et al., 2023). In contrast, optimal transport-based estimators (right) allow to move around the data points (present work). strictions (CMR) has gained popularity for its potential to benefit from advances in machine learning models (Bennett et al., 2019; Dikkala et al., 2020; Muandet et al., 2020; Kremer et al., 2022; 2023; Bennett & Kallus, 2023; Zhang et al., 2023). The CMR formulation of IV regression is based on restricting the expectation of the prediction residual Y f(T) conditioned on the instruments Z, where f denotes a causal relation from T to Y that one wants to infer. In general, this leads to a zero-sum game in which one minimizes an objective with respect to the model parameters and maximizes it with respect to an adversary function that detects the moment violations (Bennett et al., 2019; Dikkala et al., 2020). One of the most general frameworks for learning with moment restrictions is the family of generalized empirical likelihood (GEL) estimators (Owen, 1988; Qin & Lawless, 1994; Kitamura & Stutzer, 1997; Imbens et al., 1998; Owen, 2001), which includes the prominent generalized method of moments (Hansen, 1982; Hansen et al., 1996; Hall, 2004). The idea behind empirical likelihood is to learn a model via maximum likelihood estimation without specifying a parametric form of the data distribution (Owen, 2001). In practice, this is realized by learning a non-parametric approximation of the population data distribution P0 along with the model f by means of minimizing a φ-divergence under the moment restrictions. However, by relying on φ-divergences one effectively restricts the estimator of the population distribution to reweightings of the sample. The reweighting assumption has recently been lifted by Kremer et al. (2023) by introducing an estimator based on maximum mean discrepancy (Gretton et al., 2012). Geometry-Aware Instrumental Variable Regression Their estimator allows for more fine-grained approximations of P0 by sampling additional data points from a generative model. While reweightings of the present data or sampling of additional points might be suitable to find sufficiently close approximations of the population distribution in some cases, in presence of highly complex data manifolds, e.g., image spaces, they might become ineffective as they are blind towards the geometry of the data space. This is particularly relevant in the presence of poisoned (Chen et al., 2017) or adversarial (Goodfellow et al., 2014) data points, i.e., data that has been corrupted with small perturbations which lead to vastly inaccurate predictions. The key to robustness against such perturbations is to look at how the learning signal changes around the empirical data points, i.e., to take into account the geometry of the signal with respect to the data manifold. We implement the idea of a geometry-aware learning with conditional moment restrictions by proposing an empirical likelihood-type estimator based on a regularized optimal transport distance, which we call the Sinkhorn Method of Moments (SMM). Figure 1 schematically compares our method to previous approaches to empirical likelihood estimation. Our Contributions We propose the Sinkhorn Method of Moments (SMM), the first geometry-aware approach to IV regression resulting from an empirical likelihood-type estimator based on the Sinkhorn distance. We derive the dual form of our estimator and a leading order expansion that lets us compute our estimator with stochastic gradient methods. We show that under standard assumptions, our method is consistent for models identified via conditional moment restrictions. We derive a kernel-based implementation of our method that can be interpreted as a geometry-aware variant of a 2-stage generalized method of moments estimator for conditional moment restrictions. Our experiments demonstrate that SMM is competitive with state-of-the-art IV estimators in standard settings and can provide an improvement in presence of corrupted data and adversarial examples. The remainder of the paper is structured as follows. Section 2 introduces empirical likelihood estimation for conditional moment restrictions, followed by the derivation of our method and its theoretical properties in Section 3. Empirical results are provided in Section 4 and related work is discussed in Section 5. 2. Empirical Likelihood Estimation for CMR In the following let T, Y and Z denote random variables taking values in T Rdt, Y Rdy and Z Rdz respectively. We denote by EP [ ] the expectation operator with respect to a distribution P and drop the subscript whenever we refer to the population distribution P0. Conditional moment restrictions identify a function of interest f0 F by restricting the conditional expectation of a so-called moment function ψ : T Y F Rm, E[ψ(T, Y ; f0)|Z] = 0 PZ a.s. (1) The most prominent example of this problem is instrumental variable (IV) regression, where the moment function is given by the prediction residual ψ(t, y; f) = y f(t) and the conditioning variable Z denotes the instrument. IV regression is one of the major practical approaches to deal with endogenous variables (Pearl, 2000) and has been largely adopted by the causal machine learning community (Hartford et al., 2017; Singh et al., 2019; Xu et al., 2021; Saengkyongam et al., 2022; Zhang et al., 2023). Learning with conditional moment restrictions is challenging mostly due to two factors. The first one is that equation (1) contains a conditional expectation over the treatments T and outcomes Y , while one generally has access to a sample from the joint distribution over (T, Y, Z) P0. For a sufficiently complex data generating process the accurate estimation of a conditional distribution from the corresponding joint distribution can require large amounts of data (Hall et al., 1999). This can be avoided by rewriting the CMR (1) in terms of an equivalent variational formulation (Bierens, 1982) E[ψ(T, Y ; f0)T h(Z)] = 0 h H, (2) where H is a sufficiently rich function space, e.g., the space of square-integrable functions (Bierens, 1982) or the reproducing kernel Hilbert space of a certain kind of kernel (Kremer et al., 2022). While (2) avoids the conditional expectation operator, it involves an infinite-dimensional over-determined system of equations. The second difficulty is the fact that the moment restrictions identify the function of interest f0 via the population distribution P0 of the data, about which one usually only has partial information in terms of a sample D = {(ti, yi, zi)}n i=1 with empirical distribution ˆPn := 1 n Pn i=1 δ(ti,yi,zi), where δ(ti,yi,zi) denotes a point mass centered at (ti, yi, zi). While the true function f0 is identified by the population moment restrictions (2), it might not satisfy the empirical counterpart of (2) and thus one might not retrieve f0 by enforcing it. Empirical likelihood estimation (Owen, 1988; 1990; Qin & Lawless, 1994) has been proposed as a flexible tool to solve over-determined moment restriction problems with access to only a finite sample. The idea is based on approximating Geometry-Aware Instrumental Variable Regression the population distribution by seeking a distribution with minimal distance to the empirical one for which the moment restrictions can be fulfilled. We visualize this approach in Figure 2. The standard generalized empirical likelihood estimator (Qin & Lawless, 1994) with the extension to conditional moment restrictions of Kremer et al. (2022) takes the form f FGEL = arg minf F R(f) with R(f) = min P Pn Dφ(P|| ˆPn) s.t. EP [ψ(T, Y ; f)T h(Z)] = 0 h H, where the distance Dφ(P||Q) = R φ d P d Q d Q denotes the φ-divergence between distributions P and Q and the set Pn := {P ˆPn : EP [1] = 1} contains all distributions which are absolutely continuous with respect to the empirical one, i.e., reweightings of the data points. 3. Sinkhorn Method of Moments The goal of this work is to extend the idea of empirical likelihood estimation to optimal transport distances. Before deriving the method, we provide a brief introduction to optimal transport. Consider the random variable ξ := (T, Y, Z) taking values in Ξ := T Y Z Rdξ, with dξ = dt+dy+dz, and let P(Ξ) denote the space of probability distributions over Ξ. Optimal Transport Optimal transport provides an intuitive way of comparing two distributions by means of measuring the minimum effort of transforming one to another by moving probability mass at a certain cost. Let P P(Ξ) and Q P(Ξ) denote two probability distributions over Ξ with densities or probability mass functions (pmf) p and q respectively. Let Π(P, Q) P(Ξ Ξ) denote the space of joint probability distributions over the product space Ξ Ξ with marginals P and Q. Define the projection operators P1 and P2 with P1(x, y) = x and P2(x, y) = y and their pushforward operation Pi such that for any element of Π(P, Q), with density (or pmf) π we have P1 π = PR π(ξ, ξ )dξ = p(ξ) and P2 π = PR π(ξ, ξ )dξ = q(ξ ). Then, for a cost function c : Ξ Ξ R we can define the Wasserstein distance between P and Q in the Kantorovich formulation as Wc(P, Q) := minπ Π(P,Q) R c(ξ, ξ )dπ(ξ, ξ ). Computation of the Wasserstein distance requires the solution of an infinite-dimensional linear program. In order to enhance its computational efficiency, Cuturi (2013) proposed to regularize the distance by penalizing the relative entropy, i.e., the Kullback-Leibler divergence, between the coupling distribution π and a reference measure µ ν P(Ξ Ξ), W ϵ c (P, Q) = min π Π(P,Q) Z c(ξ, ξ )dπ(ξ, ξ ) + ϵH(π|µ ν), Figure 2. Sinkhorn profile. For every f F, the Sinkhorn profile R(f), (3), is the minimal distance between the empirical distribution ˆPn and the set of distributions satisfying the CMR (1). where the relative entropy is defined as H(π|µ ν) = Z Ξ Ξ log dπ(ξ, ξ ) dµ(ξ)dν(ξ ) The resulting distance can be efficiently computed with the matrix scaling algorithm of Sinkhorn & Knopp (1967), from where it derives its name, Sinkhorn distance. We refer to Peyr e et al. (2019) for a comprehensive introduction to computational optimal transport for machine learning. In order to define an estimator for the conditional moment restriction problem (1), first, we resort to the functional formulation of Kremer et al. (2022). Let H denote a sufficiently rich space of functions such that equivalence between (1) and (2) holds. Then we define the moment functional Ψ : T Y Z F Rm via its action on h H as Ψ(t, y, z; f)(h) = ψ(t, y; f)T h(z). This lets us express the CMR (1) in its equivalent functional form, E[Ψ(T, Y, Z; f)] H = 0, where H denotes the norm in the dual space H of H. With this at hand, we can define the primal problem of the Sinkhorn Method of Moments estimator for conditional moment restrictions as the minimizer of the Sinkhorn profile Rϵ defined as Rϵ(f) := min P P W ϵ c (P, ˆPn) (3) s.t. EP [Ψ(T, Y, Z; f)] H = 0. Using Lagrangian duality we can go over to the dual formulation of (3) as formalized by the following theorem whose proof is inspired by the mathematically closely related Sinkhorn Distributionally Robust Optimization (DRO) method of Wang et al. (2023). Theorem 3.1 (Duality). Consider the Sinkhorn profile (3) with reference measure µ ν P(Ξ Ξ). Then (3) has the strongly dual form Rϵ(f) = suph H D(f, h), where D(f, h) := Eξ ν h ϵ log Eξ µ h e Ψ(ξ;f)(h) c(ξ,ξ )/ϵii . Geometry-Aware Instrumental Variable Regression In contrast to its original purpose, in our application, the goal of the entropic regularization penalty is not to make computation of the distance more efficient but rather to arrive at a relaxed dual problem (4). The dual Sinkhorn profile (4) contains expectation operators with respect to the reference distributions µ and ν combined in a non-linear way. This casts optimization of the objective difficult as stochastic gradient estimates will be biased. One way to proceed is to resort to de-biasing techniques as discussed by Wang et al. (2023) for their related DRO objective. However, on top of the problem of gradient estimation, computation of (4) requires sampling from two reference distributions µ and ν such that accurate gradient estimation becomes costly. To avoid these issues, we propose an alternative solution for a special choice of reference measures and cost function. Cuturi (2013) chooses the reference measure as the product of the marginals of the coupling distribution π. For W ϵ c (P, Q) this corresponds to the choice µ ν = P Q. The choice of µ and ν can be interpreted as a prior for distributions P and Q respectively. Motivated by this, we choose ν = ˆPn and in order not to restrict the form of P we use an uninformative prior and choose µ as the Lebesgue measure. The second modeling choice is the transport cost function c. Here, we use a weighted Euclidean norm, c(ξ, ξ ) := 1 2(ξ ξ )T Γ(ξ ξ ) (5) w {t,y,z} γw w w 2 2, where the factors γw > 0 determine the transport cost in the spaces T , Y and Z and we defined the block diagonal matrix Γ := diag({γt Idt, γy Idy, γz Idz}) Rdξ dξ, with Idi denoting the identity matrix in Rdi. With these choices, the objective (4) becomes D(f, h) = Eξ ˆ Pn h ϵ log Eξ N(ξ ,ϵΓ 1) h e Ψ(ξ;f)(h)ii , where N(ξ , ϵΓ 1) denotes a multivariate Gaussian centered at ξ = (t , y , z ) with diagonal covariance ϵΓ 1. Thus, for each value of ξ we need to carry out an expectation over the moment violation exp( Ψ(ξ; f)(h)) with respect to a narrow Gaussian distribution centered at ξ . Now, as ϵ is a small regularization parameter, the integrand will only provide relevant contributions in a neighborhood of ξ and thus, for a sufficiently smooth moment function ψ and instrument function h, we can employ a Taylor expansion and carry out the Gaussian expectation over ξ in closed form. In the following, we define the weighted Laplacian ξ = ξ Γ 1 ξ = P w {t,y,z} 1 γw w and the weighted l2-norm Γ as v 2 Γ = v T Γ 1v for v Rdξ. Theorem 3.2. Let the moment functional Ψ( ; f) : Ξ H be continuously differentiable everywhere for any f F. Consider the SMM estimator with transport cost function (5) and reference measure ˆPn L, where L denotes the Lebesgue measure over Ξ. Then, for ϵ/γi, i [t, y, z], sufficiently small, up to constants and rescalings the objective of the dual Sinkhorn profile (4) takes the form D(f, h) =Eξ ˆ Pn 2 ξ Ψ(ξ; f)(h) i (7) 2Eξ ˆ Pn ξΨ(ξ; f)(h) 2 Γ + O(ϵ3/2). Motivated by the classical 2-stage generalized method of moments (GMM) estimator (Hansen, 1982) we define the Sinkhorn Method of Moments by substituting the instrument function in the second term in (7) by a first stage estimate f. We will show below that this does not harm the consistency and convergence properties of our method. Additionally, we add regularization on the instrument function λ 2 h 2 H to ensure that the optimization over h is well behaved on finite samples. Definition 3.3 (SMM). Let f F denote a first-stage estimate of f0 F, then we define the Sinkhorn Method of Moments (SMM) estimator as the solution of the saddlepoint problem f SMM = arg min f F max h H M(f, h) ϵR( f, h) (8) M(f, h) = E ˆ Pn 2 ξ Ψ(ξ; f)(h) i R( f, h) = 1 h ξΨ(ξ; f)(h) 2 Γ i + λ where as before Ψ(ξ; f)(h) = ψ(t, y; f)T h(z). By using the 2-stage GMM-style estimator we shift most of the computational complexity into the optimization of the instrument function h H. The optimization over the possibly high-dimensional model remains simple and even is a convex program, whenever f has a convexity preserving parameterization, e.g., for linear models. In practice, if (8) is optimized with stochastic gradient methods, one can dynamically update the first stage estimate f using the result from the previous iteration. In the context of CMR estimation this GMM-inspired two stage procedure is a popular approach to stabilize the training (Lewis & Syrgkanis, 2018; Bennett et al., 2019; Bennett & Kallus, 2023). Note that without the 2-stage adaptation we would obtain an estimator similar in spirit to the continuous updating GMM estimator of Hansen et al. (1996) or the FGEL estimator of Kremer et al. (2022), which can be harder to train in practice (Hall, 2004). Geometry-Aware Instrumental Variable Regression The objective (8) involves a gradient and a Laplacian with respect to the data, which allows the method to take into account the geometry of the moment violation with respect to the data manifold. As we maximize the objective over h H, we promote instrument functions which correspond to local minima of the moment violation ψ(t, y; f)T h(z) with respect to the data. Generally for CMR estimators the instrument function is responsible for translating the data into a learning signal for the model f. Choosing h in a local minimum w.r.t. the data means that we attribute less importance to data points that lead to large increases in the moment violation when perturbed slightly. This makes the model less vulnerable to poisoned data and adversarial attacks. SMM s property to take into account how the learning signal changes in proximity of the data is unique compared to related estimators which are blind towards the geometry of the data manifold as they are based on reweighting the existing data (Lewis & Syrgkanis, 2018; Bennett et al., 2019; Dikkala et al., 2020; Kremer et al., 2022; Bennett & Kallus, 2023) or sampling additional data points (Kremer et al., 2023) respectively. 3.1. Consistency The following assumptions allow us to guarantee consistency and derive a convergence rate of our 2-stage estimator (8) in the parametric, uniquely identified setting. Suppose there exists a unique parameter θ0 Θ Rp for which E[ψ(T, Y ; θ0)|Z] = 0 PZ a.s.. In the following, let x X Rdx denote the concatenation of (t, y) T Y and let i [m] be a shorthand for i {1, . . . , m}. Further, we define the Jacobian of a vector-valued function ψ : X Θ Rm as Jxψ(x; θ) Rm dx. Assumption 1 (Identifiability). θ0 Θ is the unique solution to E[ψ(X; θ)|Z] = 0 PZ a.s.; Θ is compact; ψ(X; θ) is continuous in θ everywhere w.p.1. This is a standard assumption in IV regression that provides identifiability of the true parameter θ0. Assumption 2 (Data regularity). The space Ξ = T Y Z Rdξ is compact. Assumption 3 (Smoothness w.r.t. data). The moment function ψ( ; θ) : T Y Rm is C -smooth in the data for every θ Θ. Further the sets of functions {ψ( ; θ)l : θ Θ} and {(Jxψ( ; θ))lr : θ Θ}, are P0-Donsker for every l [m] and r [dx]. Assumption 2 and 3 ensure that the moment function and its derivatives are well-behaved with respect to the data. While the compactness of the data space might be violated in practice, usually one can construct a sufficiently large compact set that contains the data with high probability. Assumption 4. The matrix V (Z; θ) Rm m defined as V (Z; θ) = E[Jxψ(X; θ)Γ 1Jxψ(X; θ)T |Z] (9) is non-singular for θ {θ0, θ} w.p.1, where θ is an initial parameter estimate defined in Assumption 6. This corresponds to the common assumption of a nonsingular covariance matrix required by related estimators (Newey & Smith, 2004; Kremer et al., 2022; Bennett & Kallus, 2023), but, here, imposed on the covariance of the data-Jacobian. Assumption 5 (Instrument function). H = Lm l=1 Hl is a a sufficiently rich space of vector-valued functions such that equivalence between (1) and (2) holds. Further for l [m], h Hl is C -smooth and the unit ball Hl,1 := {h Hl : h Hl 1} as well as {Jzh : h Hl,1} are P0-Donsker. This is fulfilled, for example, by choosing each Hl as the RKHS of a universal, integrally stricly positive definite kernel, e.g., the Gaussian kernel, which we will formalize later. For neural network instrument function classes, equivalence between the variational and conditional formulations can be shown on basis of universal approximation theorems (Yarotsky, 2017; 2018). In this case Hl,1, C -smoothness can be realized by using smooth activation functions. Assumption 6 (Regularization). There is a firststage parameter estimate θn p θ for which E ψ(X; θn) ψ(X; θ) = Op(n ζ) and E Jxψ(X; θn) Jxψ(X; θ) = Op(n ζ) with 0 < ζ 1/2. Choose λn = Op(n ρ) with 0 < ρ < ζ. For linear IV regression this implies θn θ = Op(n ζ), which means θn has to be a n ζ-consistent estimator for θ, which can be any parameter for which (9) is non-singular, e.g., the true parameter θ0. Assumption 7 (Smoothness w.r.t. θ). θ0 int(Θ); ψ(x; θ) is continuously differentiable in a neighborhood Θ of θ0; and E[supθ Θ Jθψ(X; θ) 2|Z] < w.p.1; rank (E[Jθψ(X; θ0)|Z]) = p, w.p.1. This allows us to translate the convergence rate of the moment functional into a rate for the parameter estimate. With these assumptions, consistency of SMM follows: Theorem 3.4 (Consistency). Let Assumptions 1-6 be satisfied. For any 0 < ϵ1 < ϵ2, choose ϵ Uniform([ϵ1, ϵ2]). Then the SMM estimator ˆθ converges to the true parameter θ0 in probability ˆθ p θ0. If additionally Assumption 7 is satisfied, then ˆθ θ0 = Op(n 1/2). The consistency result is independent of the choice of instrument function space H as long as it fulfills Assumption 5. Next, we discuss two different implementations of H based on kernel methods and neural networks. Geometry-Aware Instrumental Variable Regression Algorithm 1 n-stage Kernel-SMM Input: Initial function f, hyperparameters ϵ, λ, γx for i = 1, . . . , n do Compute Q( f) while not converged do f Gradient Descent(f, f RQ( f)(f)) end while f f end for Output: Function estimate f 3.2. Kernel-SMM Choosing H as the RKHS of a suitable kernel, we can guarantee equivalence between the conditional and variational moment restrictions formulations (1) and (2). On top of that, for RKHS instrument functions we can employ a representer theorem and carry out the optimization over the instrument function h H in closed form. The resulting estimator can be obtained as the solution of a simple minimization problem bearing close resemblance to the optimally weighted 2-stage GMM estimator but taking into account the geometry of the moment violation with respect to the data. Before deriving the result we provide the necessary background on reproducing kernel Hilbert spaces (RKHS). Reproducing Kernel Hilbert Space An RKHS H is a Hilbert space of functions h : Z R in which point evaluation is a bounded functional. With every RKHS one can associate a positive semi-definite kernel k( , ) : Z Z R with the reproducing property, i.e., for any h H we have h(z) = h, k(z, ) H. A kernel is called universal if its RKHS is dense in the set of all continuous real-valued functions (Micchelli et al., 2006). Further, a kernel is called integrally strictly positive definite (ISPD) if for any h H with 0 < h 2 H < , we have R Z h(z)k(z, z )h(z )dzdz > 0. Refer to, e.g., Sch olkopf & Smola (2002) and Berlinet & Thomas-Agnan (2011) for comprehensive introductions. The following proposition specifies the properties of an RKHS for which Assumption 5 is satisfied. Proposition 3.5. Let Z Rdz be compact. Then, the instrument function space H = Lm l=1 Hl, where each Hl corresponds to the RKHS of universal, integrally strictly positive definite kernel kl, l [m] fulfills Assumption 5. Now, for a representer theorem to hold, in the following, we place infinite cost γz = on the transport of z Z, i.e., we fix the instruments at their empirical locations. As long as γt, γy < this still allows for varying the functional relation between Z and T as well as T and Y in the training data. In the following, define the blockdiagonal matrix Γx := diag({γt Idt, γy Idy}) Rdx and the weighted Laplace operator x = x Γ 1 x x . Theorem 3.6 (Kernel-SMM). Let H = Lm l=1 Hl be the direct sum of m reproducing kernel Hilbert spaces with kernels kl : Z Z R. Let f F denote a first stage estimate of f0 and let γz = . Define ψ (f) Rnm, L Rnm nm and Q(f) Rnm nm with entries ψ (f)i l = I + ϵ 2 x ψl(xi; f) L(i l),(j r) =δlrkl(zi, zj) Q(f)(i l),(j r) = 1 n kl(zi, zk) xsψl(xk; f) ss xsψr(xk; f)kr(zk, zj) o . Then the Sinkhorn profile is given by RQ( f)(f) = 1 2n2 ψ (f)T L Q( f) + λ ϵ L 1 Lψ (f). Compared to the general saddle point formulation (8) the kernelized version (10) has the significant advantage that it only involves a minimization over the model parameters and thus avoids the difficulties of mini-max optimization (Daskalakis et al., 2017). Algorithm 1 details the implementation of the multi-stage Kernel-SMM approach. In order to minimize the number of hyperparameters, we implement the gradient descent step with the limited memory BFGS method (Liu & Nocedal, 1989). We empirically observed that the n-step estimator effectively converges with the second iteration. 3.3. Neural-SMM A particularly interesting alternative choice of instrument function space are neural network classes, as they can represent highly flexible functions while allowing for optimization via mini-batch stochastic gradient methods. As demonstrated by related works (Lewis & Syrgkanis, 2018; Bennett et al., 2019; Kremer et al., 2022), such neural network-based approaches can lead to powerful and scalable estimators that may outperform the corresponding kernel method on large samples. On the downside, they tend to be difficult to train due to the instability and hyperparameter sensitivity of minimax optimization. This is particularly problematic for IV regression, as in contrast to standard supervised learning, it is non-trivial to define suitable validation metrics to set these hyperparameters. As a result, compared to (10), those estimators require more attention and careful evaluation which makes them less suitable as plug-and-play IV estimators for practitioners. As the primary focus of this work is to introduce a new geometry-aware learning paradigm for IV regression independent of the instrument function class, we consider the simpler kernel version in the following and defer results for the Neural-SMM estimator to Appendix B. Geometry-Aware Instrumental Variable Regression 4. Experimental Results We benchmark the kernel version of our method against a selection of plug-and-play IV estimators including maximum moment restrictions (MMR) (Zhang et al., 2023), sieve minimum distance (SMD) (Ai & Chen, 2003) as well as the kernel variational method of moments (VMM) (Bennett & Kallus, 2023). Results for the neural network version and related estimators can be found in Appendix B. For all kernel methods we choose a radial basis function kernel k(z, z ) = exp( η z z 2 2), where we set η according to the median heuristic (Garreau et al., 2017). The remaining hyperparameters of all methods are set by using the MMR objective on a validation data set (see Appendix A). In all experiments we consider perturbations in the treatment variable t and fix the other variables at their empirical values by setting γy, γz = for SMM. Implementations of our estimators are available at https: //github.com/Heiner Kremer/sinkhorn-iv/. IV Regression with Corrupted Data We consider the Simple IV experiment of Bennett & Kallus (2023) with the following data generating process, Z = sin(πZ0/10) (11) T = 0.75Z0 + 3.5H + 0.14η 0.6 Y = f(T; θ0) 10U + 0.1η2 where η1, η2, U N(0, I) and Z0 Uniform([ 5, 5]). The model is given by f(t; θ) = θ1t2 + θ2t + θ3 with θ0 = [3.0, 0.5, 0.5]. This is a typical IV problem, where the unobserved confounder U induces a non-causal dependence between T and Y . To investigate the robustness against corrupted data, we sample training sets of 1000 points and exchange a proportion of the covariates T by random values generated according to Uniform([tmin, tmax]). Figure 3 shows the mean-squared error of the models trained with different methods over the proportion of random covariates in the training data. We observe that for no data corruption, all estimators perform similarly, with SMM providing a small advantage. With increasing proportion of corrupted data, SMM scales favorably compared to the baselines. We provide more details and a hyperparameter sensitivity analysis in Appendix A. Adversarially Robust IV Regression We test the adversarial robustness of different IV estimators in the following setting. Define C = 0.2I R5 1, as well as B R5 1, with fixed entries sampled from Uniform([0.1, 0.3]). Consider the non-linear data generating process, Z Uniform([ 3, 3]) T = BZ + CU + η1 Y = f0(T) + U + η2 0.00 0.05 0.10 0.15 0.20 % random covariates MSE( ˆf, f0) MMR SMD VMM SMM Figure 3. Robustness against corrupted data. We generate 1000 data points from the process (11) and substitute in a proportion of the data the treatment variable T for a random value sampled uniformly over the domain. Lines and error bars correspond to the mean and standard error computed over 20 training datasets. with U N(0, 1), η1, η2 N(0, 0.1) and f0(t) = 1.5 cos(At) + 0.1At, where A R1 5 with fixed entries sampled from Uniform([ 1.5, 1.5]). We approximate f0 with a feed-forward neural network with [20, 20, 3] hidden units and leaky Re LU activation functions. We train the network using different plug-and-play IV estimators and evaluate the adversarial robustness by running FGSM attacks (Goodfellow et al., 2014) in directions t with strength ϵ [0, 1.0]. Figure 4 shows that all IV estimators yield comparable mean-squared errors for ϵ = 0, clearly improving over the non-causal least squares (LSQ) solution (table). Moreover, for increasing attack strengths ϵ, we see that SMM demonstrates stronger adversarial robustness than the SMD and VMM estimators. Interestingly, here, the MMR estimator which performed worse in the first experiments exhibits the least sensitivity towards adversarial perturbations. This might be understood by the fact that the MMR estimator corresponds to the limit case of SMM and VMM for λ . Generally, strong regularization promotes flat functions which are less sensitive to the inputs, which could explain MMR s superior robustness here. In Appendix B we provide results on a common modern IV benchmark that provides further evidence that SMM performs on par with state-of-the art estimators in standard IV settings. In this context, we also provide results for a Neural-SMM estimator, which proves to be competitive with state-of-the art deep learning approaches (Bennett et al., 2019; Kremer et al., 2022). 5. Related Work Instrumental variable regression has traditionally been addressed via the 2-stage least squares (2SLS) method, which limits both regression stages to linear models (Angrist & Pis- Geometry-Aware Instrumental Variable Regression 0.00 0.25 0.50 0.75 1.00 perturbation ϵ ˆf(t + ϵ t) f0(t) 2 2 MMR SMD VMM SMM LSQ MMR SMD VMM SMM MSE (ϵ = 0) 0.45 0.014 0.018 0.012 0.012 Figure 4. Adversarial robustness of IV estimators. We use a training set of size n = 1000 and evaluate the learned models over FGSM attacks with increasing strength ϵ. Lines and error bars show the mean and standard error over 20 random training datasets. The table contains the MSE in the perturbation-free case. chke, 2008). Extensions to non-linear models have been provided by multiple works (Amemiya, 1974), recently based on density estimators (Hartford et al., 2017; Singh et al., 2019) and deep features (Xu et al., 2021). As an alternative to 2SLS, estimators based on the conditional moment restriction formulation have been used based on either basis function expansions of L2 (Carrasco & Florens, 2000; Ai & Chen, 2003; Carrasco et al., 2007; Otsu, 2011) or machine learning models (Bennett et al., 2019; Dikkala et al., 2020; Muandet et al., 2020; Kremer et al., 2022; 2023; Bennett & Kallus, 2023). Related to our Kernel-SMM estimator, multiple works have used RKHS functions as instrument models (Carrasco & Florens, 2000; Singh et al., 2019; Bennett & Kallus, 2023; Zhang et al., 2023), leading to similar formulations as our (10). However, in contrast to ours, none of them take into account the geometry of the moment violation with respect to the data. Optimization over measure spaces by means of minimizing some notion of distributional distance between the optimization variable and an empirical distribution has recently attracted significant attention in the context of distributionally robust optimization (Duchi & Namkoong, 2017; Sinha et al., 2018; Mohajerin Esfahani & Kuhn, 2018; Lam, 2019; Duchi & Namkoong, 2020; Duchi et al., 2021). On a higher level, one can distinguish between three types of approaches based on the respective distance notion (cf. Figure 1): φdivergences restrict the optimization variable to a finite dimensional vector of weights attributed to the data points and thus find optimal reweightings of the sample. Methods based on maximum-mean discrepancy (Gretton et al., 2007) and the Fisher-Rao metric (Bauer et al., 2016), allow for creation and annihilation of probability mass (Zhu et al., 2021; Kremer et al., 2023; Yan et al., 2023). Finally, methods based on optimal transport distances effectively allow to move around the data points in the data space (Mohajerin Esfahani & Kuhn, 2018; Sinha et al., 2018). While CMR estimation has been based on the previous two paradigms, to the best of our knowledge, our Sinkhorn Method of Moments is the first estimator based on the latter category. In a different context, empirical likelihood has previously been combined with Wasserstein distances to calibrate the radius of ambiguity sets in distributionally robust optimization (DRO) (Blanchet et al., 2019). However, their method does not extend to CMR estimation and neither does it make use of a regularized duality structure. From a mathematical perspective the derivation of our first duality result (Theorem 3.1) closely resembles the derivation of the dual Sinkhorn DRO estimator of Wang et al. (2023), which, nevertheless, addresses an entirely different problem. In addition, Wang et al. (2023) relies on de-biasing techniques to optimize their objective, whereas we provided a form that can be directly optimized via stochastic gradient methods. 6. Conclusion Instrumental variable regression is an important concept in the field of causal inference, which motivates the development of estimators adapted to the intricacies of real-world datasets. Notwithstanding recent mini-max estimators based on neural network instrument function classes showing convincing performance on benchmarks (Bennett et al., 2019; Dikkala et al., 2020; Kremer et al., 2022; 2023), there remains a need for simple plug-and-play estimators that can be trained by practitioners without deep technical knowledge and with a manageable set of hyperparameters. We have extended the repertoire of such estimators by a method whose learning signal arises from an optimal transport geometry in the data space. We showed that our estimator exhibits favorable properties in presence of corrupted data or adversarial examples while maintaining performance competitive with state-of-the art approaches on standard benchmarks. The simplicity of our plug-and-play estimator partially results from its kernel-based implementation which limits the scalability to large sample sizes. To address this, we provide a neural network-based implementation in the appendix, whose detailed analysis is left for future work. Acknowledgements We thank Yassine Nemmour and Frederike L ubeck for helpful initial discussions on the project as well as Frederik Tr auble for insisting on using certain matrix identities. Geometry-Aware Instrumental Variable Regression Impact Statement This paper presents work whose goal is to conceptually advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Causal analysis of systems has the potential to advance our understanding of a system s response under interventions, which may lead to more rational decision making. Ai, C. and Chen, X. Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica, 71(6):1795 1843, 2003. Amemiya, T. The nonlinear two-stage least-squares estimator. Journal of Econometrics, 2(2):105 110, 1974. Angrist, J. D. and Pischke, J.-S. Mostly harmless econometrics. Princeton university press, 2008. Bauer, M., Bruveris, M., and Michor, P. W. Uniqueness of the fisher rao metric on the space of smooth densities. Bulletin of the London Mathematical Society, 48(3):499 506, 2016. Bennett, A. and Kallus, N. The variational method of moments. Journal of the Royal Statistical Society Series B: Statistical Methodology, 85(3):810 841, 2023. Bennett, A., Kallus, N., and Schnabel, T. Deep generalized method of moments for instrumental variable analysis. Advances in neural information processing systems, 32, 2019. Berlinet, A. and Thomas-Agnan, C. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011. Bierens, H. J. Consistent model specification tests. Journal of Econometrics, 20(1):105 134, 1982. Blanchet, J., Kang, Y., and Murthy, K. Robust wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3):830 857, 2019. Carrasco, M. and Florens, J.-P. Generalization of gmm to a continuum of moment conditions. Econometric Theory, 16(6):797 834, 2000. ISSN 02664666, 14694360. Carrasco, M., Chernov, M., Florens, J.-P., and Ghysels, E. Efficient estimation of general dynamic models with a continuum of moment conditions. Journal of econometrics, 140(2):529 573, 2007. Chen, X., Liu, C., Li, B., Lu, K., and Song, D. Targeted backdoor attacks on deep learning systems using data poisoning. ar Xiv preprint ar Xiv:1712.05526, 2017. Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013. Daskalakis, C., Ilyas, A., Syrgkanis, V., and Zeng, H. Training gans with optimism. ar Xiv preprint ar Xiv:1711.00141, 2017. Dikkala, N., Lewis, G., Mackey, L., and Syrgkanis, V. Minimax estimation of conditional moment models. In Advances in Neural Information Processing Systems, volume 33, pp. 12248 12262. Curran Associates, Inc., 2020. Duchi, J. and Namkoong, H. Variance-based regularization with convex objectives. Advances in neural information processing systems, 30, 2017. Duchi, J. and Namkoong, H. Learning models with uniform performance via distributionally robust optimization, 2020. Duchi, J. C., Glynn, P. W., and Namkoong, H. Statistics of robust optimization: A generalized empirical likelihood approach. Mathematics of Operations Research, 46(3): 946 969, 2021. Garreau, D., Jitkrittum, W., and Kanagawa, M. Large sample analysis of the median heuristic. ar Xiv preprint ar Xiv:1707.07269, 2017. Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014. Gretton, A., Fukumizu, K., Teo, C., Song, L., Sch olkopf, B., and Smola, A. A kernel statistical test of independence. Advances in neural information processing systems, 20, 2007. Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch olkopf, B., and Smola, A. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012. Hall, A. Generalized method of moments. Wiley Online Library, 2004. Hall, P., Wolff, R. C., and Yao, Q. Methods for estimating a conditional distribution function. Journal of the American Statistical association, 94(445):154 163, 1999. Hansen, L. P. Large sample properties of generalized method of moments estimators. Econometrica, 50(4): 1029 1054, 1982. ISSN 00129682, 14680262. Hansen, L. P., Heaton, J., and Yaron, A. Finite-sample properties of some alternative gmm estimators. Journal of Business & Economic Statistics, 14(3):262 280, 1996. ISSN 07350015. Geometry-Aware Instrumental Variable Regression Hartford, J., Lewis, G., Leyton-Brown, K., and Taddy, M. Deep iv: A flexible approach for counterfactual prediction. In International Conference on Machine Learning, pp. 1414 1423. PMLR, 2017. Imbens, G. W., Spady, R. H., and Johnson, P. Information theoretic approaches to inference in moment condition models. Econometrica, 66(2):333 357, 1998. ISSN 00129682, 14680262. Kitamura, Y. and Stutzer, M. An information-theoretic alternative to generalized method of moments estimation. Econometrica, 65(4):861 874, 1997. ISSN 00129682, 14680262. Kosorok, M. R. Introduction to empirical processes and semiparametric inference, volume 61. Springer, 2008. Kremer, H., Zhu, J.-J., Muandet, K., and Sch olkopf, B. Functional generalized empirical likelihood estimation for conditional moment restrictions. In International Conference on Machine Learning, pp. 11665 11682. PMLR, 2022. Kremer, H., Nemmour, Y., Sch olkopf, B., and Zhu, J.-J. Estimation beyond data reweighting: Kernel method of moments. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 17745 17783. PMLR, 23 29 Jul 2023. Lam, H. Recovering best statistical guarantees via the empirical divergence-based distributionally robust optimization. Operations Research, 67(4):1090 1105, 2019. Lewis, G. and Syrgkanis, V. Adversarial generalized method of moments, 2018. Liu, D. C. and Nocedal, J. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503 528, 1989. Micchelli, C., Xu, Y., and Zhang, H. Universal kernels. Mathematics, 7, 12 2006. Mohajerin Esfahani, P. and Kuhn, D. Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171(1-2):115 166, 2018. Muandet, K., Mehrjou, A., Lee, S. K., and Raj, A. Dual instrumental variable regression, 2020. Newey, W. K. and Smith, R. J. Higher order properties of gmm and generalized empirical likelihood estimators. Econometrica, 72(1):219 255, 2004. ISSN 00129682, 14680262. Otsu, T. Empirical likelihood estimation of conditional moment restriction models with unknown functions. Econometric Theory, 27(1):8 46, 2011. Owen, A. Empirical likelihood ratio confidence regions. The Annals of Statistics, 18(1):90 120, 1990. ISSN 00905364. Owen, A. B. Empirical likelihood ratio confidence intervals for a single functional. Biometrika, 75(2):237 249, 1988. ISSN 00063444. Owen, A. B. Empirical likelihood. Chapman and Hall/CRC, 2001. Pearl, J. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, NY, USA, 2000. Peyr e, G., Cuturi, M., et al. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019. Qin, J. and Lawless, J. Empirical likelihood and general estimating equations. The Annals of Statistics, 22(1): 300 325, 1994. ISSN 00905364. Saengkyongam, S., Henckel, L., Pfister, N., and Peters, J. Exploiting independent instruments: Identification and distribution generalization. In International Conference on Machine Learning, pp. 18935 18958. PMLR, 2022. Sch olkopf, B. and Smola, A. J. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002. Sch olkopf, B., Herbrich, R., and Smola, A. J. A generalized representer theorem. In Computational Learning Theory, pp. 416 426, 2001. Singh, R., Sahani, M., and Gretton, A. Kernel instrumental variable regression. Advances in Neural Information Processing Systems, 32, 2019. Sinha, A., Namkoong, H., and Duchi, J. Certifiable distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018. Sinkhorn, R. and Knopp, P. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2):343 348, 1967. Wang, J., Gao, R., and Xie, Y. Sinkhorn distributionally robust optimization, 2023. Xu, L., Chen, Y., Srinivasan, S., de Freitas, N., Doucet, A., and Gretton, A. Learning deep features in instrumental variable regression. In International Conference on Learning Representations, 2021. Geometry-Aware Instrumental Variable Regression Yan, Y., Wang, K., and Rigollet, P. Learning gaussian mixtures using the wasserstein-fisher-rao gradient flow. ar Xiv preprint ar Xiv:2301.01766, 2023. Yarotsky, D. Error bounds for approximations with deep relu networks. Neural Networks, 94:103 114, 2017. Yarotsky, D. Optimal approximation of continuous functions by very deep relu networks. In Conference on learning theory, pp. 639 649. PMLR, 2018. Zhang, R., Imaizumi, M., Sch olkopf, B., and Muandet, K. Instrumental variable regression via kernel maximum moment loss. Journal of Causal Inference, 11(1):20220073, 2023. Zhu, J.-J., Jitkrittum, W., Diehl, M., and Sch olkopf, B. Kernel distributionally robust optimization: Generalized duality theorem and stochastic approximation. In International Conference on Artificial Intelligence and Statistics, pp. 280 288. PMLR, 2021. Geometry-Aware Instrumental Variable Regression A. Experimental Details Hyperparameters For SMM we choose the hyperparameters from the grid defined by ϵ [10 6, 10 4, 10 2] and λ/ϵ [10 6, 10 4, 10 2, 1.0]. Note that as ϵ and γt only appear as ϵ/γt, we absorb the factor γt into ϵ and consider γt = 1 everywhere. For VMM we choose the hyperparameters from λ [10 6, 10 4, 10 2, 1.0] as done by the authors of the method (Bennett & Kallus, 2023). We pick the best hyperparameter configuration by evaluating the MMR objective (Zhang et al., 2023) on a validation data set of the same size as the training set. We visualize the dependency on the hyperparameters for the first experiment without random covariates in Figure 5. We observe that the method is rather insensitive to the choice of ϵ but admits a stronger dependence on the choice of the regularization parameter λ. 1e-6 1e-4 1e-2 1.0 λ/ϵ 1e-6 1e-4 1e-2 ϵ 1.34 1.93 3.29 4.39 1.55 3.66 3.03 4.40 1.16 2.25 3.11 4.40 Figure 5. Kernel-SMM dependency on hyperparameters. We evaluate the SMM estimator on the first experiment without random covariates for different hyperparameter configurations. Values correspond to the mean of the prediction error E[ f(T; ˆθ) f(T; θ0) 2 2] averaged over models trained on 20 random training sets. B. Additional Results Network IV Here, we consider a common modern benchmark for IV regression in the standard setting without any data corruptions. Consider the following data generating process introduced by Bennett et al. (2019) and subsequently used by many other works (Zhang et al., 2023; Kremer et al., 2022; 2023), y = f0(t) + e + δ, t = z + e + γ, z Uniform([ 3, 3]), e N(0, 1), γ, δ N(0, 0.1), where the function f0 is chosen from the set of simple functions sin: f0(t) = sin(t), abs: f0(t) = |t|, linear: f0(t) = t, step: f0(t) = I{t 0}. We learn a neural network fθ with two layers of [20, 3] hidden units and leaky Re LU activation functions to approximate the function f0 by imposing the conditional moment restriction E[Y fθ(T)|Z] = 0 PZ a.s.. Table 1 contains the results of different plug-and-play IV estimators trained on a dataset of 1000 points and averaged over 20 random training datasets. We observe that SMD, VMM and SMM perform roughly on par whereas MMR only improves in one of the settings over the non-causal least squares solution (LSQ) which ignores the instruments entirely. Neural Estimators We explore an alternative SMM implementation where we represent the instrument function h H as a neural network parameterized by ω Ω. With this choice, the estimator (8) takes the form f = arg min f F max ω ΩE ˆ Pn 2 ξ ψ( ; f)T hω( ) (ξ) ϵ 2 ξ ψ( ; f)T hω( ) (ξ) 2 Γ λ 2 hω(Z) 2 2 The Neural-SMM estimator can be trained in the same fashion as the Deep GMM (Bennett et al., 2019) or Functional GEL (Kremer et al., 2022) estimators by alternating mini-batch stochastic gradient descent steps in the the model parameters and the adversary parameters ω. Geometry-Aware Instrumental Variable Regression Table 1. Network IV experiment. Results represent the mean and standard error of the prediction error E[ f(T; ˆθ) f(T; θ0) 2 2] resulting from 20 random training datasets. LSQ MMR SMD VMM SMM sin 0.36 0.03 0.40 0.02 0.12 0.01 0.17 0.02 0.15 0.01 abs 1.94 1.48 0.61 0.28 0.20 0.08 0.09 0.04 0.12 0.04 step 0.35 0.04 > 100 0.04 0.01 0.05 0.01 0.04 0.00 linear 0.36 0.05 0.36 0.09 0.07 0.04 0.03 0.01 0.07 0.03 Table 2. Neural CMR estimators. Results represent the mean and standard error of the prediction error E[ f(T; ˆθ) f(T; θ0) 2 2] resulting from 20 random runs of the Network IV experiment. Deep GMM Neural FGEL Neural SMM sin 0.08 0.01 0.10 0.01 0.07 0.01 abs 0.04 0.01 0.04 0.01 0.04 0.01 step 0.07 0.01 0.08 0.01 0.07 0.01 linear 0.05 0.01 0.06 0.01 0.05 0.01 We benchmark the Neural-SMM estimator against Deep GMM (Bennett et al., 2019) and Functional GEL (Kremer et al., 2022) which achieved state-of-the-art results on several benchmarks including the Network IV experiment. For all methods we use the same instrument network architecture consisting of a feed-forward neural network with [50, 20] hidden units and leaky Re LU activation functions. We optimize the objective by alternating steps with an optimistic Adam (Daskalakis et al., 2017) optimizer with parameters β = (0.5, 0.9). We tuned the learning rates, for the model and adversary by evaluating the Deep GMM estimator for different values and fix them both to 5e 4 for all methods. In the same way we fix the batch size to 200 and the number of epochs to 3000. For the Functional GEL estimator we use the Kullback-Leibler divergence version. For all methods we choose the regularization parameter λ from [10 6, 10 4, 10 2, 1.0] and for Neural-SMM we additionally choose ϵ from [10 6, 10 4, 10 2, 1.0] by using the MMR objective on a validation set of the same size as the training set. We observe in Table 2 that Neural-SMM performs on par with these SOTA estimators on all variants of the Network IV experiment, suggesting that the geometry-awareness and additional robustness of our estimator does not come at the price of reduced performance in standard settings. It does, however, come at the price of increased computation due to the presence of the gradient and Laplace operators with respect to the data in the objective. Figure 6 visualizes the dependence of Neural-SMM on its hyperparameters. We observe that for this experiment SMM requires either one or both parameters to be chosen large for optimal performance but the performance remains stable across a range of parameters. 1e-6 1e-4 1e-2 1.0 λ/ϵ 1e-6 1e-4 1e-2 1.0 ϵ 0.29 0.28 0.06 0.05 0.18 0.16 0.05 0.04 0.07 0.08 0.05 0.05 0.04 0.04 0.04 0.04 Figure 6. Neural-SMM dependency on hyperparameters. We evaluate the Neural-SMM estimator for different hyperparameter configurations exemplarily for the abs function in the network IV experiment. Values correspond to the mean of the prediction error E[ f(T; ˆθ) f(T; θ0) 2 2] averaged over models trained on 20 random training sets. Geometry-Aware Instrumental Variable Regression C.1. Duality Results Proof of Theorem 3.1 Proof. Introducing the Lagrange parameter ρ R, the Lagrangian of (3) reads L(P, ρ, f) = min π Π(P, ˆ Pn) E(ξ,ξ ) π c(ξ, ξ ) + ϵ log dπ(ξ, ξ ) dµ(ξ)dν(ξ ) + ρ sup h H, h H=1 EP [Ψ(ξ; f)(h)]. (14) As eventually the Lagrangian will be maximized with respect to ρ, we can merge it with the optimization over the unit ball in H to obtain a Lagrangian with an unrestricted parameter h H, L(P, h, f) = min π Π(P, ˆ Pn) E(ξ,ξ ) π c(ξ, ξ ) + ϵ log dπ(ξ, ξ ) dµ(ξ)dν(ξ ) + EP [Ψ(ξ; f)(h)]. (15) Note that the Wasserstein distance is mass preserving, i.e., we do not need to explicitly impose the constraint EP [1] = 1 as this is implied directly by normalization of the empirical distribution, i.e., let p and ˆp denote the density and probability mass functions of P and ˆPn respectively, then EP [1] = R Ξ p(ξ)dξ = R Ξ Pn i=1 π(ξ, ξ i)dξ = Pn i=1 ˆp(ξ i) = Pn i=1 1 n = 1. To derive the dual problem we need to minimize the Lagrangian over the primal variable P. By definition of the coupling distribution π we have p = P1 π and thus we can collapse the minimizations over the π and P into a single minimization over π Π( ˆPn) := {P(Ξ Ξ) : P2 π = ˆPn}, D(h, f) = min π Π( ˆ Pn) E(ξ,ξ ) π c(ξ, ξ ) + ϵ log dπ(ξ, ξ ) dµ(ξ)dν(ξ ) + EP1 π[Ψ(ξ; f)(h)]. (16) Now to extract the relevant degree of freedom we can write all expectation operators as combinations of the empirical expectation and conditional expectation over π(ξ, ξ ) given its second argument ξ Ξ. To see this, note that by the product rule we have π(ξ, ξ ) =: π(ξ|ξ )ˆp(ξ ) and by the law of iterated expectation we have for any function g : Ξ Ξ R, Eπ[g(ξ, ξ )] = Eξ ˆ Pn[Eξ π|ξ [g(ξ, ξ )|ξ ]], where we defined π|ξ as the conditional distribution of ξ given ξ , with density π(ξ|ξ ). Similarly we have for any function g : Ξ R, EP1 π[g(ξ)] = Z Ξ g(ξ)(P1 π)(ξ)dξ = Z i=1 π(ξ, ξ i)dξ (17) i=1 π(ξ|ξ i)ˆp(ξ i)dξ = Z i=1 π(ξ|ξ i)dξ (18) = Eξ ˆ Pn[Eξ π|ξ [g(ξ)|ξ ]]. (19) Therefore the optimization over π Π( ˆPn) is equivalent to a sequence of optimization problems over π|ξ P(Ξ), one for each value of ξ Ξ. With this we can express the dual problem (16) as D(h, f) = Eξ ˆ Pn min π|ξ P(Ξ) Eξ π|ξ c(ξ, ξ ) + ϵ log d(π|ξ )(ξ) + Ψ(ξ; f)(h) Now for each ξ Ξ consider the inner optimization problem G(ξ ; h, f) := min π|ξ P(Ξ) Eξ π|ξ c(ξ, ξ ) + ϵ log d(π|ξ )(ξ) + Ψ(ξ; f)(h) . (21) Define the density of π|ξ P(Ξ) with respect to the reference measure µ P(Ξ) as r(ξ) = d(π|ξ )(ξ) dµ(ξ) , then we can rewrite the optimization problem as an optimization over r R := {r : Ξ R+ : Eµ[r(ξ)] = 1}, G(ξ ; h, f) = min r R Eξ µ [r(ξ)c(ξ, ξ ) + ϵr(ξ) log (r(ξ)) + r(ξ)Ψ(ξ; f)(h)] . (22) Geometry-Aware Instrumental Variable Regression Now introducing Lagrange parameter η R and using Lagrangian duality we get G(ξ ; h, f) = sup η R min r:Ξ R+ Eξ µ[r(ξ)c(ξ, ξ ) + ϵr(ξ) log (r(ξ)) + r(ξ)Ψ(ξ; f)(h) + η(1 r(ξ))] (23) = sup η R η ϵEξ µ sup t 0 tη c(ξ, ξ ) Ψ(ξ; f)(h) ϵ t log t (24) = sup η R η ϵEξ µ exp η c(ξ, ξ ) Ψ(ξ; f)(h) where we used that the Fenchel conjugate of the Kullback Leibler divergence t log t is supt p, t t log t = ep 1. We can eliminate the dual normalization variable η R from the problem by solving the corresponding first order optimality condition 0 = 1 eη/ϵ 1EX µ exp Ψ(ξ; f)(h) c(ξ, ξ ) which yields η = ϵ ϵ log EX µ exp Ψ(ξ; f)(h) c(ξ, ξ ) Inserting back into (25), we obtain for each ξ Ξ G(ξ ; h, f) = ϵ log Eξ µ exp Ψ(ξ; f)(h) c(ξ, ξ ) and the result follows by inserting into (20) and redefining h/ϵ h. Proof of Theorem 3.2 Proof. Using the assumptions on the reference measure and cost function, we can write the objective in the form (6), where the inner expectation is given as Eξ N(ξ ,ϵΓ 1) h e Ψ(ξ;f)(h)i = Z Ξ e Ψ(ξ;f)e 1 2ϵ ξ ξ 2 Γ 1dξ. (29) As for small ϵ the integrand only provides a finite contribution in a neighborhood of ξ , we can use that Ψ is continuously differentiable everywhere and employ a Taylor expansion, Ψ(ξ; f)(h) =Ψ(ξ ; f)(h) + (ξ ξ )T ξΨ(ξ ; f)(h) (30) 2(ξ ξ )T 2 ξΨ(ξ ; f)(h)(ξ ξ ) + O( ξ ξ 3). (31) Note that due to the Gaussian measure under the integral we have ξ ξ = O(ϵ1/2). Now defining δ := ξ ξ Ξ as well as the gradient G(ξ ) := ξΨ(ξ ; f)(h) and Hessian H(ξ ) := 2 ξΨ(ξ ; f)(h) of the evaluated moment functional we can insert back and get Eξ N(ξ ,γ) h e Ψ(ξ;f)(h)i = e Ψ(ξ ;f)(h) Z 2ϵ 2ϵδT G(ξ ) + ϵδT H(ξ )δ + δT Γδ dδ + O(ϵ3/2). (32) Define the regularized Hessian Ωϵ := Ωϵ(ξ ) := Γ + ϵH(ξ ), which is invertible w.p.1, as for sufficiently small ϵ/γ we have λmin(Γ) = minw {t,y,z} γw > ϵλmin(H(ξ )) w.p.1 and thus Ωϵ is strictly positive definite w.p.1. Then we can employ a change of variables by defining ω := Ω1/2 ϵ δ and obtain Eξ N(ξ ,γ) h e Ψ(ξ;f)(h)i (33) =e Ψ(ξ ;f)(h) Z 1 det Ω1/2 ϵ exp 1 ωT ω + 2ϵωT Ω 1/2 ϵ G(ξ ) dω + O(ϵ3/2). (34) Geometry-Aware Instrumental Variable Regression Now, completing the square we obtain Eξ N(ξ ,γ) h e Ψ(ξ;f)(h)i (35) =e Ψ(ξ ;f)(h)e ϵ 2 G(ξ )T Ω 1 ϵ G(ξ ) Z 1 det Ω1/2 ϵ exp 1 ω + ϵΩ 1/2 ϵ G(ξ )2 dω + O(ϵ3/2) (36) dξ/2 det Ω1/2 ϵ 1 e Ψ(ξ ;f)(h)e ϵ 2 G(ξ )T Ω 1 ϵ G(ξ ) + O(ϵ3/2). (37) Finally inserting back into (6) we get D(f, h) =Eξ ˆ Pn h ϵ log Eξ N(ξ ,γ) h eΨ(ξ;f)(h)ii (38) ϵΨ(ξ ; f)(h) ϵ2 2 G(ξ )T Ω 1 ϵ G(ξ ) + ϵ 2 log |det Ωϵ| ϵdξ ϵ + O(ϵ5/2). (39) Dividing by ϵ and neglecting constant terms we get D(f, h) = Eξ ˆ Pn Ψ(ξ ; f)(h) ϵ 2G(ξ )T Ω 1 ϵ G(ξ ) + 1 2 log |det Ωϵ| + O(ϵ3/2). (40) Now, for small ϵ we can Taylor expand Ω 1 ϵ as Ω 1 ϵ = (Γ + ϵH(ξ )) 1 (41) = Γ 1 I + ϵΓ 1H 1 (42) = Γ 1 I ϵΓ 1H + O(ϵ2) (43) = Γ 1 + O(ϵ). (44) Similarly we have log |det Ωϵ| = log |det (Γ + ϵH)| (45) = log |det Γ| + log det I + ϵΓ 1H (46) x {t,y,z} dx log γx + log det I + ϵΓ 1H (47) = C + Tr log I + ϵΓ 1H (48) = C + Tr ϵΓ 1H + O(ϵ2) (49) 1 γx xΨ(ξ ; f)(h) + O(ϵ2). (50) So we finally obtain D(f, h) =E ˆ Pn Ψ(ξ; f)(h) ϵ xΨ(ξ; f)(h) 2 2 xΨ(ξ; f)(h) + O(ϵ3/2). (51) C.2. Proof of Theorem 3.4 (Consistency) The objective of the SMM estimator (8) can be written as b D(h, θ) = I + ϵ 2 ξ E ˆ Pn[Ψ(ξ; f)(h)] ϵ 2 h, bΩλn( θn)h H, (52) where we defined the linear operator bΩλn( θn) : H H as bΩλn( θn) = E ˆ Pn h ξΨ(ξ; θn) T Γ 1 ξΨ(ξ; θn) i + λn I I. Our proof of Theorem 3.4 uses properties of the spectrum of bΩλn( θn) which we will derive in the following. Geometry-Aware Instrumental Variable Regression C.2.1. PREVIOUS RESULTS Lemma C.1 (Corollary 9.31, Kosorok (2008)). Let F and G be Donsker classes of functions. Then F + G is Donsker. Further if additionally F and G are uniformly bounded, then F G is Donsker. Lemma C.2 (Lemma 18, Bennett & Kallus (2023)). Suppose that G is a class of functions of the form g : Ξ R, and that G is P-Donsker in the sense of Kosorok (2008). Then we have sup g G E ˆ Pn[g(ξ)] E[g(ξ)] = Op(n 1/2). (53) Lemma C.3 (Lemma E.4, Kremer et al. (2023)). Let Assumptions 1-7 be satisfied. Then the matrix Σ(θ0) = E[ θΨ(ξ; θ0)], E[ θT Ψ(ξ; θ0)] H (54) is strictly positive definite and non-singular with smallest eigenvalue bounded away from zero. C.2.2. SPECTRUM OF bΩ Lemma C.4. Let Assumptions 2 and 3 be satisfied. Then we have sup θ Θ,x T Y ψ(x; θ) Cψ < (55) sup θ Θ,x T Y Jx(ψ)(x; θ) Lψ < (56) sup θ Θ,x T Y xψ(x; θ) Dψ < (57) sup θ Θ,z Z h(z) Ch < (58) sup θ Θ,z Z Jzh(z) Lh < (59) sup θ Θ,z Z zh(z) Dh < , (60) which directly implies ξ op < on H . Proof. The proof follows directly from the fact that a continuous function on a compact domain is bounded and both ψ( ; θ) and h are C -smooth by Assumptions 3 and 5. Lemma C.5. Let V (Z; θ) = E[Jx(ψ)(X; θ)Γ 1Jx(ψ)(X; θ)T |Z] be non-singular with probability 1. Then the linear operator Ω(θ) : H H defined as Ω(θ) = E h ( ξΨ(ξ; θ))T Γ 1 ξΨ(ξ; θ) i (61) is non-singular. Proof. We derive the result by showing that the smallest eigenvalue of Ω(θ) is positive. Consider any h H with h L2(H,P0) > 0 then we have h, Ω(θ)h H =E[h(Z)T Jx(ψ)(X; θ)Γ 1Jx(ψ)(X; θ)h(Z)] (62) =E[h(Z)T E[Jx(ψ)(X; θ)Γ 1Jx(ψ)(X; θ)|Z]h(Z)] (63) =E[h(Z)T V0(Z; θ)h(Z)] (64) =CE[ h(Z) 2 2] (65) =C h 2 L2(H,P0) > 0 (66) where we used that by assumption V (Z; θ) is non-singular and thus its smallest eigenvalue C bounded away from zero w.p.1. Geometry-Aware Instrumental Variable Regression Lemma C.6 (Spectrum of bΩ). Let the assumptions of Theorem 3.4 be satisfied. Then for θ Θ with θn θ, the empirical gradient covariance operator bΩλn( θn) = E ˆ Pn h ξΨ(ξ; θn) T Γ 1 ξΨ(ξ; θn) i + λn I I (67) is a positive definite operator with smallest eigenvalue λmin(bΩ) bounded away from zero and largest eigenvalue λmax(bΩ) < C < bounded from above w.p.a.1. Proof. Let in the following bΩ(θ) = bΩλn=0(θ). With Assumption 4 it follows from Lemma C.5 that the operator Ω( θ) := E h ξΨ(ξ; θ) T Γ 1 ξΨ(ξ; θ) i is non-singular and thus its smallest eigenvalue bounded away from zero. In the following we show that bΩ( θn) p Ω( θ), where the convergence rate in operator norm is Op(n ζ). Therefore, by adding the identity operator with regularization parameter λn that goes to zero slower than Op(n ζ) we ensure that bΩλn( θn) remains positive definite w.p.a.1. The derivation of this result follows the proof of Lemma 20 of Bennett & Kallus (2023). By the triangle inequality we have bΩ( θn) Ω( θ) op bΩ( θn) Ω( θn) + Ω( θn) Ω( θ) . (68) The first term we can estimate using standard results from empirical process theory. Define h 2 H = 1 m Pm i=1 hi 2 Hi as well as Jψ(X; θ) = Jxψ(X; θ) and Jh(Z) = Jzh(Z). Let H1 = {h H : h H 1} denote the unit ball in H, then bΩ( θn) Ω( θn) = sup h,h H1 h , bΩ( θn) Ω( θn)h H (69) = sup h,h H1 E ˆ Pn h(Z)T Jψ(X; θn)Γ 1 x Jψ(X; θn)T h (Z) (70) E h(Z)T Jψ(X; θn)Γ 1 x Jψ(X; θn)T h (Z) (71) γz E ˆ Pn ψ(X; θn)T Jh(Z)Jh (Z)T ψ(X; θn) (72) γz E ψ(X; θn)T Jh(Z)Jh (Z)T ψ(X; θn) ) n E ˆ Pn[g(ξ)] E[g(ξ)] o + 1 γz sup s S2 n E ˆ Pn[s(ξ)] E[s(ξ)] o (74) where for i [dξ] we define Gi = {gi : gi(ξ) = j=1 hj(z) (Jψ(x; θ))ji Γ 1/2 ii , h Hi,1, θ Θ} (75) G2 = {g : g(ξ) = X i [dx] gi(ξ)g i(ξ), gi, g i Gi} (76) Si = {si : si(ξ) = j=1 ψj(x; θ) (Jh(z))ji , h Hi,1, θ Θ} (77) S2 = {si : si(ξ) = X i [dz] si(ξ)s i(ξ), si, s i Si} (78) Now for the first term, we have that each hj Hi,1 is P0-Donsker by Assumption 5 and uniformly bounded by Lemma C.4. Similarly each entry of the Jacobian Jψ( ; θ) is P0-Donsker by Assumption 3 and uniformly bounded by Lemma C.4. With that we can employ Lemma C.1 to conclude that Gi is P0-Donsker and thus using Lemma C.1 again it follows that G2 is P0-Donsker. Therefore we can use Lemma C.2 to obtain supg G2 n E ˆ Pn[g(ξ)] E[g(ξ)] o = Op(n 1/2). For the second term in (74) we have that each ψj( ; θ) is P0-Donsker by Assumption 3 and uniformly bounded by Lemma C.4. Similarly each entry of the Jacobian Jzh is P0-Donsker by Assumption 5 and uniformly bounded by Lemma C.4. With that, Geometry-Aware Instrumental Variable Regression again, we can employ Lemma C.1 to conclude that Si is P0-Donsker and thus using Lemma C.1 again it follows that S2 is P0-Donsker. Therefore we can use Lemma C.2 to obtain 1 γz sups S2 n E ˆ Pn[s(ξ)] E[s(ξ)] o = Op(n 1/2). Putting these results together we finally obtain bΩ( θn) Ω( θn) Op(n 1/2). For the second term in (68) we have Ω( θn) Ω( θ) = sup h,h H1 h , Ω( θn) Ω( θ)h H Cx + 1 Cx = sup h,h H1 E h h (Z)T Jψ(X; θn)Γ 1 x Jψ(X; θn)T (80) Jψ(X; θ)Γ 1 x Jψ(X; θ)T h(Z) i (81) = sup h,h H1 E h h (Z)T Jψ(X; θn)Γ 1 x Jψ(X; θn) Jψ(X; θ)Γ 1 x Jψ(X; θ) h(Z) i (82) = sup h,h H1 E h h (Z)T Jψ(X; θn)Γ 1 x Jψ(X; θn) Jψ(X; θ) T h(Z) (83) + h (Z)T Jψ(X; θ)Γ 1 x Jψ(X; θn) Jψ(X; θ) T h(Z) i (84) 2 min({γt, γy})m2C2 h LψE Jψ(X; θn) Jψ(X; θ) (85) =Op(n ζ) (86) where we used that by Lemma C.4, supθ Θ,x T Y Jψ(x; θ) Lψ and suph H1,z Z |h(z)| Ch as well as by Assumption 6 E Jψ(X; θn) Jψ(X; θ) = Op(n ζ). Now similarly for the second term in (79) we have Cz = sup h,h H1 E h Tr Jh (Z)T ψ(X; θn)ψ(X; θn)T Jh(Z) (87) Tr Jh (Z)T ψ(X; θ)ψ(X; θ)T Jh(Z) i (88) L2 h E[1T ψ(X; θn)ψ(X; θn)T ψ(X; θ)ψ(X; θ)T 1] (89) =L2 h E h 1T ψ(X; θn) ψ(X; θn) ψ(X; θ) T 1 i (90) + L2 h E h 1T ψ(X; θ) ψ(X; θn) ψ(X; θ) T 1 i (91) 2m2L2 h CψE ψ(X; θn) ψ(X; θ) (92) =Op(n ζ), (93) where again we used Lemma C.4 and Assumption 6. Combining both results we obtain Ω( θn) Ω( θ) = Cx + 1 γz Cz Op(n ζ). Finally as 0 < ζ 1/2 it follows that bΩ( θn) Ω( θ) bΩ( θn) Ω( θn) + Ω( θn) Ω( θ) (94) Op(n 1/2) + Op(n ζ) = Op(n ζ). (95) In conclusion we have shown that bΩ( θn) converges to the non-singular operator Ω( θ) at rate Op(n ζ) and by Assumption 6 we have λn = Op(n ρ) with 0 < ρ < ζ, therefore the operator bΩλn( θn) = bΩ( θn) + λn I is non-singular with smallest eigenvalue bounded away from zero w.p.a.1. Geometry-Aware Instrumental Variable Regression It remains to be shown that the largest eigenvalue of bΩ( θn) is bounded. This is a direct consequence of Lemma C.4. Consider any h H with h H > 0 and h, bΩ( θn)h = E ˆ Pn[h(Z)T Jψ(X; θn)Jψ(X; θn)T h(Z)] (96) E[Jψ(X; θn) 2 ]E[ h(Z) 2 ] (97) L2 ψC2 h < . (98) C.2.3. PROOF OF THEOREM 3.4 Lemma C.7. Let the sets of functions {ψ( ; θ)l : θ Θ, l [m]} and H1 be P0-Donsker. Then we have for any θ Θ E ˆ Pn[Ψ(ξ; θ)] E[Ψ(ξ; θ)] H = Op(n 1/2). (99) E ˆ Pn[Ψ(ξ; θ)] E[Ψ(ξ; θ)] H = sup h H1 E ˆ Pn[ψ(X; θ)T h(Z)] E[ψ(X; θ)T h(Z)] (100) = sup g G E ˆ Pn[g(ξ)] E[g(ξ)] (101) G = {g : g(ξ) = i=1 ψi(x; θ)hi(z), hi Hi,1, θ Θ}. (102) Now as each hi and ψi( ; θ) are P0-Donsker by Assumption 5 and 3 respectively and uniformly bounded by Lemma C.4, we can employ Lemma C.1 to conclude that G is P0-Donsker. From this, the result follows by application of Lemma C.2. Lemma C.8 (Convergence of b D). Let the assumptions of Theorem 3.4 be satisfied. Additionally let θ Θ be a consistent estimator for θ0, i.e., θ p θ0 with E ˆ Pn[Ψ(ξ; θ)] H = Op n 1/2 . Then for h = arg maxh H D( θ, h) we have h H = Op n 1/2 and b D( θ, h) Op n 1 . Proof. Let Ψ := 1 n Pn i=1 Ψ(ξi, θ). Then we have 0 = b D( θ, 0) (103) arg max h H D( θ, h) (104) 2 ξ Ψ( h) ϵ 2 h, bΩλn( θn) h H (105) 2 ξ op Ψ H h H ϵ 2λmin bΩλn( θn) h 2 2 ξ Ψ H h H ϵ 2λmin bΩλn( θn) h 2 Using that ξ < by Lemma C.4 and moreover λmin bΩλn( θn) > 0 by Lemma C.6, we get h H C Ψ H and thus h H = Op n 1/2 . Now inserting back into b D we get b D( θ, h) Op n 1 . Lemma C.9 (Convergence of ˆΨ H ). Let the assumptions of Theorem 3.4 be satisfied. Let ˆθ = arg minθ Θ suph H b D(θ, h) denote the SMM estimator for θ0. Then E ˆ Pn[Ψ(ξ; ˆθ)] H = Op(n 1/2). Geometry-Aware Instrumental Variable Regression Proof. Let ˆΨ = 1 n Pn i=1 Ψ(ξ, ˆθ). Let ϕ(ˆΨ) H denote the Riesz representer of ˆΨ H . Consider any σn 0 and define h ˆΨ = σnϕ(ˆΨ). Using that the eigenvalues of the Laplacian ξ are bounded by Lemma C.4 and the largest eigenvalue of bΩ( θn) is bounded by a constant C by Lemma C.6, we have b D(ˆθ, h ˆΨ) = I + ϵ 2 ξ ˆΨ(h ˆΨ) ϵ 2 h ˆΨ, bΩ( θn)h ˆΨ H (108) 2λmin( ξ) ˆΨ(h ˆΨ) ϵ 2C h ˆΨ 2 H (109) C σn ˆΨ 2 H Cϵ 2 σ2 n ˆΨ 2 H , (110) where by assumption on ϵ we have C = 1 + ϵ 2λmin( ξ) = 0 w.p.1. Now, as ˆθ is the minimizer of the Sinkhorn profile R(θ) = maxh H b D(θ, h) we have C σn ˆΨ 2 H Cϵ 2 σ2 n ˆΨ 2 H b D(ˆθ, h ˆΨ) b D(ˆθ, ˆh) max h H b D(θ0, h) O(n 1), (111) where in the last step we used that E ˆ Pn[Ψ(ξ; θ0)] H = Op(n 1/2) by Lemma C.7 and thus the assumptions of Lemma C.8 are fulfilled and we get maxh H b D(θ0, h) O(n 1). Thus we have σn(C Cϵ 2 σn) ˆΨ 2 H = Op(n 1) and as (C Cϵ 2 σn) is bounded away from zero for all n large enough, we have σn ˆΨ 2 H Op(n 1). As this holds for any σn p 0 we finally have ˆΨ H = Op(n 1/2). Proof of Theorem 3.4 Using the result of Lemma C.9 for the convergence rate of the empirical moment functional, the proof of the consistency of our SMM estimator is identical to the ones provided by Kremer et al. (2022) and Kremer et al. (2023) for their estimtators. We provide it here for completeness. Proof. From Lemma C.7 it follows that E ˆ Pn[Ψ(ξ; θ)] E[Ψ(ξ; θ)] H = Op(n 1/2) for any θ Θ. By Lemma C.9 we have E ˆ Pn[Ψ(ξ; ˆθ)] H = Op(n 1/2) and thus using the triangle inequality we get E[Ψ(ξ; ˆθ)] H = E[Ψ(ξ; ˆθ)] E ˆ Pn[Ψ(ξ; ˆθ)] + E ˆ Pn[Ψ(ξ; ˆθ)] H E[Ψ(ξ; ˆθ)] E ˆ Pn[Ψ(ξ; ˆθ)] H + E ˆ Pn[Ψ(ξ; ˆθ)] H = Op(n 1/2) p 0. As by Assumption 1, θ0 is the unique parameter for which E[ψ(T, Y ; θ)|Z] = 0 Pz a.s. and by Assumption 5 this is fulfilled if and only if E[ψ(ξ; θ)] H = 0, it follows that ˆθ p θ0. Under the additional Assumption 7 we can use this result to translate the convergence rate of the moment functional to a convergence rate of the estimator ˆθ. As Ψ(ξ; θ) is continuously differentiable in its second argument which follows immediately from Assumption 7 and the definition of Ψ, we can use the mean value theorem to expand Ψ(ξ, ˆθ) about θ0, i.e., there exists θ conv({θ0, ˆθ}) such that Ψ(ξ; ˆθ) = Ψ(ξ; θ0) + (ˆθ θ0)T θΨ(ξ; θ). (112) Using this we have E[Ψ(ξ; ˆθ)] 2 H = E[Ψ(ξ; θ0)] | {z } =0 +(ˆθ θ0)T E[ θΨ(ξ; θ)] 2 H (113) = D (ˆθ θ0)T E[ θΨ(ξ; θ)], (ˆθ θ0)T E[ θΨ(ξ; θ)] E = (ˆθ θ0)T E[ θΨ(ξ; θ)], E[ θT Ψ(ξ; θ)] H | {z } =:Σ( θ) (ˆθ θ0) (115) λmin Σ( θ) ˆθ θ0 2 2. (116) Geometry-Aware Instrumental Variable Regression Now as ˆθ p θ0 and θ conv({θ0, ˆθ}) we have θ p θ0 and thus Σ( θ) p Σ(θ0) =: Σ0 by the continuous mapping theorem. By the non-negativity of the norm Σ0 is positive-semi definite and non-singular by Lemma C.3, thus the smallest eigenvalue of Σ( θ), λmin(Σ( θ)), is positive and bounded away from zero w.p.a.1. Finally as E[Ψ(X, Z; ˆθ)] = Op(n 1/2) taking the square-root on both sides we have ˆθ θ0 = Op(n 1/2). Proof of Proposition 3.5 Proof. For a universal ISPD kernel, equivalence of the conditional and the variational moment restrictions (1) and (2) follows by Theorem 3.9 of Kremer et al. (2022). The Donsker property of the unit ball in an RKHS of a smooth universal kernel with compact domain follows from Lemma 17 of Bennett & Kallus (2023). Finally, the Donsker property of the Jacobian Jzh of h follows by the same argument as Lemma 17 of Bennett et al. (2019) using C smoothness of h and boundedness of Jzh. Proof of Theorem 3.6 Proof. Under the assumptions the Sinkhorn profile is given as Rλ(f) = sup h H h h(Z)T I + ϵ 2 x ψ(X; f) i (117) h h(Z)T Jψ(X; f)Γ 1 x Jψ(X; f)T h(Z) i λ which as the unconstrained maximization of a concave objective is a convex optimization problem. Moreover, the conditions of the classical representer theorem (Sch olkopf et al., 2001) are fulfilled and thus the maximizer of (118) is given as hl = Pn i=1 αl ikl(zi, ) with αl Rn. Inserting this into (118) and defining the kernel Gram matrices Kl Rn n with entries (Kl)ij = kl(zi, zj) we obtain Rλ(f) = sup α Rnm 1 n l=1 αl i(Kl)ij I + ϵ 2 x ψl(xj; f) λ l=1 (αl)T Kl(αl) (119) l,r=1 αl i(Kl)ij xψl(xj; f)T Γ 1 x xψr(xj; f)(Kr)jkαk r (120) = sup α Rnm 1 nαT Lψ 1 2αT ϵQ( f) + λL α (121) where we defined ψ (f) Rnm, L Rnm nm and Q(f) Rnm nm with entries ψ (f)i l = I + ϵ 2 x ψl(xi; f) (122) L(i l),(j r) = δlrkl(zi, zj) (123) Q(f)(i l),(j r) = 1 s=1 kl(zi, zk) xsψl(xk; f)(Γ 1 x )ss xsψr(xk; f)kr(zk, zj). (124) The first order optimality conditions for α read n Lψ (f) ϵQ( f) + λL α, (125) which immediately gives α = ϵQ( f) + λL 1 1 n Lψ (f). (126) Geometry-Aware Instrumental Variable Regression Inserting back into Rλ(f) and multiplying by ϵ > 0 we obtain Rλ(f) = 1 2n2 ψ (f)T L Q( f) + λ ϵ L 1 Lψ (f). (127)