# counterfactual_density_estimation_using_kernel_stein_discrepancies__933f2565.pdf Published as a conference paper at ICLR 2024 COUNTERFACTUAL DENSITY ESTIMATION USING KERNEL STEIN DISCREPANCIES Diego Martinez-Taboada Department of Statistics & Data Science Carnegie Mellon University Pittsburgh, PA 15213, USA diegomar@andrew.cmu.edu Edward H. Kennedy Department of Statistics & Data Science Carnegie Mellon University Pittsburgh, PA 15213, USA edward@stat.cmu.edu Causal effects are usually studied in terms of the means of counterfactual distributions, which may be insufficient in many scenarios. Given a class of densities known up to normalizing constants, we propose to model counterfactual distributions by minimizing kernel Stein discrepancies in a doubly robust manner. This enables the estimation of counterfactuals over large classes of distributions while exploiting the desired double robustness. We present a theoretical analysis of the proposed estimator, providing sufficient conditions for consistency and asymptotic normality, as well as an examination of its empirical performance. 1 INTRODUCTION Causal targets examine the outcomes that might have occurred if a specific treatment had been administered to a group of individuals. Generally, only the expected value of these outcomes is analyzed, as seen in the widely used average treatment effect. Nonetheless, focusing solely on means proves insufficient in many scenarios. Important attributes of the counterfactuals, such as their variance or skewness, are often disregarded. Modelling the entire distribution gives a complete picture of the counterfactual mechanisms, which opens the door to a richer analysis. For example, a multimodal structure in the counterfactual may indicate the presence of distinct subgroups with varying responses to the treatment (Kennedy et al., 2021). This profound understanding of the counterfactual may be ultimately exploited in the design of new treatments. In order to model a distribution, one may consider estimating either its cumulative distribution function (CDF) or its probability density function (PDF). While the statistical analysis of the former is generally more straightforward, the latter tends to be more attractive for practitioners given its appealing interpretability. Although density estimation may be conducted non-parametrically, modelling probability density functions based on a parametric class of distributions has garnered significant attention. These models allow easy integration of prior data knowledge and offer interpretable parameters describing distribution characteristics. However, a number of interesting parametric density functions are only known up to their normalizing constant. Energy-based models (Le Cun et al., 2006) establish probability density functions qθ(y) exp( Eθ(y)), where Eθ is a parametrized potential fulfilling R exp( Eθ(y))dy < . Sometimes, the energy is expressed as a combination of hidden and visible variables, namely product of experts or restricted Boltzmann machines (Ackley et al., 1985; Hinton, 2002; 2010). In contrast, energy-based models may link inputs directly to outputs (Mnih & Hinton, 2005; Hinton et al., 2006). Gibbs distributions and Markov random fields (Geman & Geman, 1984; Clifford, 1990), as well as exponential random graph models (Robins et al., 2007; Lusher et al., 2013) are also examples of this form of parameterization. We highlight the generality of energy-based models, which allow for modelling distributions with outstanding flexibility. Generally, the primary difficulty in manipulating energy-based models stems from the need to precisely estimate the normalizing constant. Nonetheless, this challenge may be circumvented by using the so-called kernel Stein discrepancies, which only require the computation of the score function x log qθ(x). Given a class of distributions Q = {qθ}θ Θ and samples from a base distribution Q, one may model the latter by the distribution qθn that minimizes its kernel Stein discrepancy with Published as a conference paper at ICLR 2024 respect to the empirical distribution Qn. However, these minimum kernel Stein discrepancy estimators have not been explored in the challenging counterfactual setting, where outcomes are not always observed, and the treatment assignment procedure ought to be taken into account. In this work, we propose to model counterfactuals by minimizing kernel Stein discrepancies in a doubly robust manner. While the presented estimator retains the desired properties of double robustness, it enables flexible modelling of the counterfactual via density functions with normalizing constants that need not be specified. Our contributions are two-fold. First, we present a novel estimator for modelling counterfactual distributions given a parametric class of distributions, along with its theoretical analysis. We provide sufficient conditions for both consistency and asymptotic normality. Second, we illustrate the empirical performance of the estimator in a variety of scenarios. 2 RELATED WORK Counterfactual distribution estimation has been mainly addressed based on CDF approximation (Abadie, 2002; Chernozhukov & Hansen, 2005; Chernozhukov et al., 2013; D ıaz, 2017), where the predominant approach reduces to counterfactual mean estimation. In contrast, counterfactual PDF estimation generally relies on kernel smoothing (Robins & Rotnitzky, 2001; Kim et al., 2018) or projecting the empirical distribution onto a finite-dimensional model using f-divergences or Lp norms (Westling & Carone, 2020; Kennedy et al., 2021; Melnychuk et al., 2023). We highlight that none of these alternatives can in principle handle families of densities with unknown normalizing constants. For instance, using the general f-divergences or Lp norms in the projection stage of the estimation (Kennedy et al., 2021) requires access to the exact evaluation of the densities, which includes the normalizing constants. This is precisely what motivated the authors of this contribution to explore kernel Stein discrepancies in the counterfactual setting. Kernel Stein discrepancies (KSD), which build on the general Stein s method (Stein, 1972; Gorham & Mackey, 2015), were first introduced for conducting goodness-of-fit tests and sample quality analysis (Liu et al., 2016; Chwialkowski et al., 2016; Gorham & Mackey, 2017); they may be understood as a kernelized version of score-matching divergence (Hyv arinen & Dayan, 2005). Minimum kernel Stein discrepancy (MKSD) estimators were subsequently proposed (Barp et al., 2019; Matsubara et al., 2022), which project the empirical distribution onto a finite-dimensional model using the KSD. We highlight that MKSD estimators had not been proposed in the counterfactual settings prior to this work. Lam & Zhang (2023) suggested a doubly-robust procedure to estimate expectations via Monte Carlo simulation using KSD, but their motivation and analysis significantly diverge from our own contributions: while MKSD estimators minimise the KSD over a class of distributions, (quasi) Monte Carlo methods exploit KSD by transporting the sampling distribution (Oates et al., 2017; Fisher et al., 2021; Korba et al., 2021). We refer the reader to Anastasiou et al. (2023) for a review on Stein s methods, and to Oates et al. (2022) for an overview of MKSD estimators. Kernel methods have been gaining interest in causal inference for assessing whole counterfactual distributions. In order to test for (conditional) distributional treatment effects, Muandet et al. (2021) and Park et al. (2021) made use of kernel mean embeddings via inverse propensity weighting and plug-in estimators, respectively. This line of work was later generalized to doubly robust estimators by Fawkes et al. (2022) and Martinez-Taboada et al. (2023). Beyond distributional representation, kernel regressors have have found extensive use in counterfactual tasks (Singh et al., 2019; 2020; 2021; Zhu et al., 2022). 3 BACKGROUND Let (X, A, Y ) P, where X X, Y Y Rd, and A {0, 1} represent the covariates, outcome, and binary treatment respectively. We frame the problem in terms of the potential outcome framework (Rubin, 2005; Imbens & Rubin, 2015), assuming A1) Y = AY 1 + (1 A)Y 0 (where Y 1 Q1 and Y 0 Q0 are the potential outcomes or counterfactuals), A2) Y 0, Y 1 A | X, A3) ϵ < π(X) := PA|X(A = 1|X) < 1 ϵ almost surely for some ϵ > 0. Conditions A1-A3 are ubiquitous in causal inference, but other identifiability assumptions are also possible. Condition A1 holds true when potential outcomes are exclusively determined by an indi- Published as a conference paper at ICLR 2024 vidual s own treatment (i.e., no interference), and Condition A2 applies when there are no unmeasured confounders. Condition A3 means treatment is not allocated deterministically. Under these three assumptions, it is known that the distribution of either counterfactual may be expressed in terms of observational data. The ultimate goal of this contribution is to model either distribution Y a using a parametric class of distributions Q = {qθ}θ Θ which only need to be specified up to normalizing constants. Without loss of generality, we restrict our analysis to the potential outcome of the treatment Y 1 Q1. For conducting such a task, we draw upon minimum kernel Stein discrepancy estimators, which build on the concepts of reproducing kernel Hilbert space and Stein s method. Reproducing kernel Hilbert spaces (RKHS): Consider a non-empty set Y and a Hilbert space H of functions f : Y R equipped with the inner product , H. The Hilbert space H is called an RKHS if there exists a function k : Y Y R (referred to as reproducing kernel) satisfying (i) k( , y) H for all y Y, (ii) f, k( , y) H = f(y) for all y Y and f H. We denote by Hd the product RKHS containing elements h := (h1, . . . , hd) with hi H and h, h = Pd i=1 hi, hi H. Kernel Stein discrepancies (KSD): Assume Qθ has density qθ on Y Rd, and let H be an RKHS with reproducing kernel k such that k( , y) exists for all y Y. Let sθ(y) = y log qθ(y) and define ξθ( , y) := [sθ(y)k( , y) + k( , y)] Hd, so that hθ(y, y) = ξθ( , y), ξθ( , y) Hd = sθ(y), sθ( y) Rdk(y, y) + sθ( y), yk(y, y) Rd+ + sθ(y), yk(y, y) Rd + yk( , y), yk( , y) Hd (1) is a reproducing kernel. The KSD is defined as KSD(Qθ Q) := (EY, Y Q[hθ(Y, Y )])1/2. Under certain regularity conditions, KSD(Qθ Q) = 0 if (and only if) Qθ = Q (Chwialkowski et al., 2016); other properties such as weak convergence dominance may also be established (Gorham & Mackey, 2017). Further, if EQ p hθ(Y, Y ) < , then KSD(Qθ Q) = EY Q [ξθ( , Y )] Hd. In non-causal settings, given Y1, . . . , Yn Q and the closed form evaluation of hθ presented in equation 1, the V-statistic i=1 ξθ( , Yi) 2 Hd = 1 j=1 hθ(Yi, Yj). (2) may be considered for estimating KSD2(Qθ Q). Similarly, removing the diagonal elements of this V-statistic gives way to an unbiased U-statistic, which counts with similar properties. Minimum kernel Stein discrepancy (MKSD) estimators: Given Q = {qθ}θ Θ known up to normalizing constants, the scores sθ can be computed. MKSD estimators model Q by Qθn, where θn arg min θ Θ gn(θ; Y1, . . . , Yn), and gn is the V-statistic defined in equation 2 or its unbiased version. The MKSD estimator θn is consistent and asymptotically normal under regularity conditions (Oates et al., 2022). 4 MAIN RESULTS We consider the problem of modeling the counterfactual distribution Y 1 Q1 given a parametric class of distributions Q = {qθ}θ Θ with potentially unknown normalizing constants. We work under the potential outcomes framework, assuming that we have access to observations (Xi, Ai, Yi)n i=1 X {0, 1} Y sampled as (X, A, Y ) P, such that Conditions A1-A3 hold. Throughout, we denote Z (X, A, Y ) and Z X {0, 1} Y for ease of presentation. Like the previously introduced MKSD estimators, our approach includes modeling Q1 by choosing a θ such that θn arg min θ Θ gn(θ; Z1, . . . , Zn), (3) Published as a conference paper at ICLR 2024 where gn is a proxy for KSD2(Qθ Q1). In contrast to such MKSD estimators, defining gn as the Vstatistic introduced in equation 2 would lead to inconsistent estimations, given that the distribution of Y may very likely differ from that of Y 1 (this is, indeed, the very essence of counterfactual settings). In order to define an appropriate gn, we first note that, by the law of iterated expectation, Ψ := EY 1 Q1 ξθ( , Y 1) = EZ P [ϕ(Z)] , ϕ(z) = a π(x) {ξθ( , y) βθ(x)} + βθ(x), (4) π(x) = E [A|X = x] , βθ(x) = E [ξθ( , Y )|A = 1, X = x] . An analogous result holds for Y 0 or any other discrete treatment level. Note that the Y 0 does not affect neither π nor βθ. In fact, the problem could have been presented as a missing outcome data problem, where the data from Y 0 is missing. The authors have posed the problem in counterfactual outcome terms simply for motivational purposes. The embedding ϕ induces a reproducing kernel h θ(z, z) = ϕ(z), ϕ( z) Hd, which will be used in subsequent theoretical analyses. We highlight that ϕ Ψ is the efficient influence function for Ψ. This implies that the resulting estimators have desired statistical properties such as double robustness, as shown later. That is, ˆΨ = Pn ˆϕθ will be consistent if either ˆπ or ˆβθ is consistent, and its convergence rate corresponds to the product of the learning rates of ˆπ and ˆβθ. Nonetheless, it may not be feasible to estimate ˆβθ individually for each θ of interest. In turn, let us assume for now that we have access to estimators ˆπ and ˆβ, where the latter approximates β(x) = E [k( , Y )|A = 1, X = x], and is of the form i=1 ˆwi(x)k( , Yi). (5) We highlight that many prominent algorithms for conducting Hd-valued regression, such as conditional mean embeddings (Song et al., 2009; Gr unew alder et al., 2012) and distributional random forests ( Cevid et al., 2022; N af et al., 2023), are of the form exhibited in equation 5. In order to avoid refitting the estimator ˆβθ for each value of θ, we propose to define ˆβθ from ˆβ as follows: ˆβθ(x) = [ˆβθ(ˆβ)](x) := i=1 ˆwi(x)ξθ( , Yi). (6) Intuitively, we could expect that if estimator ˆβ is consistent, then ˆβθ is also consistent if the mapping k( , y) [sθ(y)k( , y) + k( , y)] is regular enough. As a result, we propose the statistic gn(θ; Y1, . . . , Yn) = i=1 ˆϕθ(Zi), 1 j=1 ˆϕθ(Zj) where ˆβθ is constructed from ˆβ as established in equation 6. Although this statistic resembles the one presented in Martinez-Taboada et al. (2023) and Fawkes et al. (2022), several important differences arise between the contributions. First and foremost, they studied a testing problem, so the causal target was different. Second, their work did not extend to embeddings that depend on a parameter θ. Lastly, they focused on the distribution of their statistic under the null in order to calibrate a two-sample test; in contrast, our approach minimizes a related statistic for directly modelling the counterfactual. The main theoretical contribution of Martinez-Taboada et al. (2023) is the extension of cross U-statistics to the causal setting; in stark contrast, our contribution deals with the theoretical properties of the minimizer of a debiased V-statistic. Both the motivation and theoretical challenges in our study diverge from those in the earlier works; we view our research as complementary and orthogonal to these prior contributions. While we have assumed so far that estimators ˆπ and ˆβθ are given, defining an optimal strategy for training such estimators is key when seeking to maximize the performance. A simple approach, such as using half of the data to estimate π and βθ, and the other half on the empirical averages of the Published as a conference paper at ICLR 2024 Algorithm 1 DR-MKSD 1: input Data D = (Xi, Ai, Yi)n i=1 and Q = {qθ}θ Θ such that that sθ is known for all θ in Θ. 2: output A distribution qθn that approximates the potential outcome Y 1. 3: Choose kernel k and estimators ˆβ, ˆπ. 4: Split data in two sets D1 = (Xi, Ai, Yi)n//2 i=1 and D2 = (Xi, Ai, Yi)n i=n//2+1. 5: Train ˆπ(1), ˆβ(1) on D2 and ˆπ(2), ˆβ(2) on D1. Define ˆβ(r) θ from ˆβ(r) as in equation 6. 6: Define ˆϕθ as ˆϕ(r) θ on D1 r for r {1, 2}, and gn as in equation 7. 7: Take θn arg minθ Θ gn(θ; Y1, . . . , Yn). 8: return qθn. statistic gn, would lead to an increase in the variance of the latter. So, we use cross-fitting (Robins et al., 2008; Zheng & van der Laan, 2010; Chernozhukov et al., 2018). This is, split the data in half, use the two folds to train different estimators separately, and then evaluate each estimator on the fold that was not used to train it. Based on these theoretical considerations, we propose a novel estimator called DR-MKSD (Doubly Robust Minimum Kernel Stein Discrepancy) outlined in Algorithm 1. Two key observations regarding Algorithm 1 are noteworthy. Although we have presented gn as an abstract inner product in Hd, equation 7 has a closed-form expression as long as hθ can be evaluated. Further details can be found in Appendix A. Additionally, the estimator θn is defined through a minimization problem, the complexity of which depends on Eθ. In general, gn need not be convex with respect to θ, and thus, estimating θn may involve typical non-convex optimization challenges. However, this is inherent to our approach, aiming for flexible modeling of the counterfactual distribution. The potential Eθ itself could be a neural network, making non-convexity an unavoidable aspect of the problem. We note that evaluating gn has a time complexity of O(n2). Nonetheless, in stark contrast to the methods presented in Kim et al. (2018); Kennedy et al. (2021), the proposed DR-MKSD procedure only requires the nuisance estimators to be fitted once. This enables DR-MKSD to leverage computationally expensive estimators, such as deep neural networks, providing a significant advantage with respect to these previous contributions. While it may difficult to find a global minimizer of gn, we turn to study the properties of the estimator as if that optimization task posed no problem, with the scope that this analysis sheds light on the expected behaviour of the procedure if the minimization problem yields a good enough estimate of θ. We thus provide sufficient conditions for consistency and inference properties of the optimal θn. We defer the proofs to Appendix C, as well as an exhaustive description of the notation used. Theorem 1 (Consistency) Assume that Θ Rp open, convex, and bounded. Further, let Conditions A1-A3 hold, as well as A4) ZZ sup θ Θ h θ(z, z)d P(z)d P( z) < , A5) Z sup θ Θ h θ(z, z)d P(z) < , A6) ZZ sup θ Θ θh θ(z, z) Rpd P(z)d P( z) < , A7) Z sup θ Θ θh θ(z, z) Rpd P(z) < . If (i) P(ˆπ(r) [ϵ, 1 ϵ]) = 1, (ii) supθ Θ ˆϕ(r) θ ϕθ = o P ( n) and (iii) ˆπ(r) π supθ Θ ˆβ(r) θ βθ = o P (1) for r {1, 2}, then KSD(Qθn Q1) p min θ Θ KSD(Qθ Q1). Conditions A4-A7 build on the supremum of an abstract kernel h θ, but we note that the properties stem from those of k and sθ. If sθ(y), θsθ(y), k( , y), and yk( , y) are bounded, then so are h θ and θh θ, and Conditions A4-A7 are fulfilled. In particular, if sθ and k are smooth and Y is compact, such assumptions are attained. We highlight the weakness of Condition (i), which only requires that our estimates are bounded away from 0 and 1, and Condition (ii), which does not even require the estimates of ϕθ to be consistent, and is usually implied by Condition (iii). Condition (iii) implicitly conveys the so-called double robustness. That is, as long the estimator of π is consistent or the estimators of βθ are uniformly consistent, then so will be the procedure. We Published as a conference paper at ICLR 2024 highlight that we are not estimating βθ independently for each θ, but we are rather constructing all of them based on an estimate of β. This opens the door to having immediate uniform consistency as long as the estimator of β is consistent (depending on the underlying structure between β and βθ) in a similar spirit to the work on uniform consistency rates presented in Hardle et al. (1988). The consistency properties of a doubly robust estimator are highly desirable. However, a most important characteristic is the convergence rate, which we show next corresponds to the product of the convergence rates of the two estimators upon which it is constructed. Theorem 2 (Asymptotic normality) Let Θ Rp be open, convex, and bounded, and assume θn p θ , where KSD(Qθ Q1) = min KSD(Qθ Q1). Suppose Conditions A1-A5 hold, as well as A8) the maps {θ 7 θh θ(z, z) : z, z Z} are differentiable on Θ, A9) the maps {θ 7 2 θh θ(z, z) : z, z Z} are uniformly continuous at θ , A10) RR supθ Θ θh θ(z, z) Rpd P(z)d P( z) < , A11) RR θh θ(z, z) 2 Rpd P(z)d P( z) |θ=θ < , A12) R supθ Θ 2 θh θ(z, z) Rp pd P(z) < , A13) RR supθ Θ 2 θh θ(z, z) Rp pd P(z)d P( z) < , A14) Γ = RR 2 θh θ(z, z)d P(z)d P( z) |θ=θ 0. If (i ) P(ˆπ(r) [ϵ, 1 ϵ]) = 1, (ii ) supθ Θ n ˆϕ(r) θ ϕθ + θ ˆϕ(r) θ θϕθ o = o P (1), and (iii ) ˆπ(r) π sup θ Θ n ˆβ(r) θ βθ + θ ˆβ(r) θ θβθ o = o P ( 1 n) for r {1, 2}, then n(θn θ ) d N(0, 4Γ 1ΣΓ 1), where Σ := Cov Z P [ R θh θ(Z, z)d P( z)] |θ=θ . As proposed in Oates et al. (2022), a sandwich estimator 4Γ 1 n ΣnΓ 1 n for the asymptotic variance 4Γ 1ΣΓ 1 can be established (Freedman, 2006), where Σn and Γn are obtained by substituting θ by θn and replacing the expectations by empirical averages. If the estimator 4Γ 1 n ΣnΓ 1 n is consistent at any rate, then Σ 1/2 n Γn(θ θn) d N(0, 1) in view of Slutsky s theorem. Conditions A8-A13 can be satisfied under diverse regularity assumptions, analogously to Conditions A4-A7. Further, we note that assumptions akin to Condition A14 are commonly encountered in the context of asymptotic normality. While Condition (i ) is carried over the consistency analysis, Condition (ii ) now requires the estimators of ϕθ and θϕθ to be uniformly consistent. However, we highlight the weakness of this assumption. For instance, it is attained as long as ˆπ is consistent and ˆβθ is uniformly bounded. Lastly, we again put the spotlight on Condition (iii ): if the product of the rates is o P (n 1/2), asymptotic normality of the estimate θn can be established. The double robustness of gn has two profound implications in the DR-MKSD procedure. First, gn(θ) converges to KSD2(Qθ Q1) faster than if solely relying on nuisance estimators ˆπ or ˆβθ, which translates to a better estimate θn. Further, it opens the door to conducting inference if the product of the rates is o P (n 1/2), which may be achieved by a rich family of estimators. We underscore that θn p θ need not be a consequence of Theorem 1. Theorem 1 establishes consistency on KSD(Qθn Q1), not θn itself. Consistent estimators for KSD(Qθn Q1) but not for θn may arise, for example, if the minimizer is not unique. In such a case, one cannot theoretically derive the consistency of the estimate (at least, while remaining agnostic about the optimization solver). For instance, if there are two minimizers and the optimization solver alternates between the two of them, KSD(Qθn Q1) is consistent but θn does not even converge. Nonetheless, we emphasize that the theorems provide sufficient, but not necessary, conditions for consistency and asymptotic normality. Furthermore, a potential lack of consistency does not necessarily translate to a poor performance of the proposed estimator. Two distributions characterized by very different parameters θ may be close in the space of distributions defined by the KSD metric (analogously to two neural networks with very different weights having similar empirical performance). Estimating the distribution characterized via the parameter, not the parameter itself, is the ultimate goal of this work. Hence, the proposed estimator can yield good models for the counterfactual distribution even with inconsistent θn. Published as a conference paper at ICLR 2024 Lastly, we highlight that the DR-MKSD procedure can be easily redefined by estimating βθg for each θg belonging to a finite grid {θg}g G. The uniform consistency assumptions would consequently translate on usual consistency for each θg of the set {θg}g G, and the minimization problem would reduce to take the arg min of a finite set. The statistical and computational trade-off is clear: estimating βθg for each θg may lead to stronger theoretical guarantees and improved performance, but the computational cost of estimating βθg increases by a factor of |G|. 5 EXPERIMENTS We provide a number of experiments with (semi)synthetic data. Given that this is the first estimator to handle unnormalized densities in the counterfactual setting, we found no natural benchmark to compare the empirical performance of the proposed estimator with. Hence, we focus on illustrating the theoretical properties and exploring the empirical performance of the estimator in a variety of settings. Throughout, we take k(x, y) = (c2 + l 2 x y 2 Rd)β to be the inverse multi-quadratic (IMQ) kernel, based on the discussion in Gorham & Mackey (2017), with β = 0.5, l = 0.1 and c = 1. We estimate the minimizer of gn(θ) by gradient descent. We defer an exhaustive description of all simulations and further experiments to Appendix B. Consistency of the DR-MKSD estimator: To elucidate the double robustness of DR-MKSD, we define its inverse probability weighting and plug-in versions, given by ˆϕIPW,θ(z) = a ˆπ(x)ξθ( , y) and ˆϕPI,θ(z) = ˆβθ(x) respectively. We draw (Xi, Ai, Yi)n i=1 such that Y 1 N(0, 1). Further, we let X N(Y 1, 1) and we sample A using a logistic model that depends on X. Note that this sampling procedure is consistent with Conditions A1-A3. We consider the set of normal distributions with unit variance Q = {N(θ, 1)}θ R. Figure 1 exhibits the mean squared error of the procedures across 100 bootstrap samples, different sample sizes n and various choices of estimators. The procedure shows consistency as long as either the IPW or PI versions are consistent, even if the other is not. We defer to Appendix B an analysis of the confidence intervals for θn given by Theorem 2, where we illustrate that they have the desired empirical coverage. Asymptotic normality of the DR-MKSD estimator: Following the examples in Liu et al. (2019) and Matsubara et al. (2022), we consider the intractable model with potential function Eθ(y) = η(θ), J(y) R8, where θ R2, y R5, η(θ) := ( 0.5, 0.6, 0.2, 0, 0, 0, θ)T , and J(y) = (P5 i=1 y2 i , y1y2, P5 i=3 y1yi, tanh(y))T . The normalizing constant is not tractable except for θ = 0, where we recover the density of a Gaussian distribution N0. We draw (Xi, Ai, Yi)500 i=1 so that Y 1 N0 and X N(Y 1, I5). Lastly, A follows a logistic model that depends on X X. Figure 2 displays the empirical estimates of θ = (θ1, θ2), which shows approximately normal. Counterfactual Restricted Boltzmann Machines (RBM): We let (Xi, Ai, Yi)500 i=1 such that Y 1 is drawn from a two-dimensional RBM with one hidden variable h such that qθ(y1, h) exp(h + θ, y1 R2 2 y1 2 2) . Additionally, X N(Y 1, 0.25I2) and A follows a logistic model that depends Figure 1: Mean squared error (MSE) of θn approximated with 100 bootstrap samples. π and β are estimated via (A) boosting and conditional mean embeddings (CME), (B) logistic regression and 1-NN, (C) logistic regression and CME. The doubly robust approach proves consistent in all cases. Published as a conference paper at ICLR 2024 Figure 2: Illustration of 100 simulations of the DR-MKSD estimator θn = (θn,1, θn,2), constructed on estimates of π and β fitted as random forests and CME for n = 500, with (A) scatter plot of empirical distribution for θn, (B) histogram of the standardized values of θn,1, (C) Normal QQ plot of the standardized values of θn,1. We highlight the two-dimensional Gaussian behaviour of θn. on X. Although estimating the exact θ is hopeless (h is unobservable), it may still be possible to uncover the underlying structure governing the behavior of Y 1, which depends on the ratio between θ1 and θ2. Figure 3 shows that minimizing gn does indeed recover the direction of θ. Experimental data: Suppose one has access to a noisy version of Y 1, denoted as X, and the process of denoising this data to recover Y 1 is costly. Due to budget constraints, one can only afford to denoise approximately half the data. The objective is to estimate the distribution of Y 1 Rd, which can be described by a model qθ(Y 1) exp( T(Y 1), θ ). Here, T(Y 1) Rp represents a lower-dimensional representation of Y 1. To achieve this, one randomly selects observations for denoising with a probability of 1/2 and employs DR-MKSD to model Y 1. For illustration, we consider ten distinct counterfactuals, denoted as Y 1,i R20 for i {1, . . . , 10}. These counterfactuals are defined as Y 1,i := f(W i), where each W i represents an observation from the MNIST dataset corresponding to digit i, and f is a pretrained neural network. Additionally, we utilize another pretrained neural network T : R20 R10. In Figure 4, we display random MNIST dataset observations in the top row, alongside those with the lowest unnormalized density of f(W i) in the bottom row, estimated by DR-MKSD independently for each digit. Notably, the latter images appear more distorted, which aligns with the expectation that DR-MKSD accurately models Y 1. Figure 3: Values of the statistic gn(θ) for the RBM with (A) θ1 = θ2 = 0, (B) θ1 = θ2 = 1, (C) θ1 = θ2 = 1. Minimizing gn leads to recover the direction of θ. Published as a conference paper at ICLR 2024 Figure 4: Observations of the MNIST dataset chosen randomly (top row), and with the lowest estimated density of f(W i) (bottom row). We highlight the potential outlier detection in digits 0, 2, 4 and 6. 6 DISCUSSION We have presented an estimator for modeling counterfactual distributions given a flexible set of distributions, which only need to be known up to normalizing constants. The procedure builds on minimizing the kernel Stein discrepancy between such a set and the counterfactual, while simultaneously accounting for sampling bias in a doubly robust manner. We have provided sufficient conditions for the consistency and asymptotic normality of the estimator, and we have illustrated its performance in various scenarios, showing the empirical validity of the procedure. There are several avenues for future research. Employing energy-based models for estimating the counterfactual enables the generation of synthetic observations using a rich representation of the potential outcome. Exploring the empirical performance of sampling methods that do not require the normalizing constants, such as Hamiltonian Monte Carlo or the Metropolis-Hasting algorithm, holds particular promise in domains where collecting additional real-world data is challenging. Furthermore, extensions could involve incorporating instrumental variables and conditional effects. Of particular interest would be the expansion of our framework to accommodate time-varying treatments. Lastly, we highlight that, while we have framed the problem within a causal inference context, analogous scenarios arise in off-policy evaluation (Dud ık et al., 2011; Thomas & Brunskill, 2016). Extending our contributions to that domain could yield intriguing insights. REPRODUCIBILITY STATEMENT Reproducible code for all experiments is provided in the supplementary materials. ACKNOWLEDGMENTS DMT gratefully acknowledges that the project that gave rise to these results received the support of a fellowship from la Caixa Foundation (ID 100010434). The fellowship code is LCF/BQ/EU22/11930075. EK was supported by NSF CAREER Award 2047444. Alberto Abadie. Bootstrap tests for distributional treatment effects in instrumental variable models. Journal of the American statistical Association, 97(457):284 292, 2002. David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147 169, 1985. Andreas Anastasiou, Alessandro Barp, Franc ois-Xavier Briol, Bruno Ebner, Robert E Gaunt, Fatemeh Ghaderinezhad, Jackson Gorham, Arthur Gretton, Christophe Ley, Qiang Liu, et al. Stein s method meets computational statistics: A review of some recent developments. Statistical Science, 38(1):120 139, 2023. Alessandro Barp, Francois-Xavier Briol, Andrew Duncan, Mark Girolami, and Lester Mackey. Minimum stein discrepancy estimators. Advances in Neural Information Processing Systems, 32, 2019. Published as a conference paper at ICLR 2024 Domagoj Cevid, Loris Michel, Jeffrey N af, Peter B uhlmann, and Nicolai Meinshausen. Distributional random forests: Heterogeneity adjustment and multivariate distributional regression. The Journal of Machine Learning Research, 23(1):14987 15065, 2022. Victor Chernozhukov and Christian Hansen. An iv model of quantile treatment effects. Econometrica, 73(1):245 261, 2005. Victor Chernozhukov, Iv an Fern andez-Val, and Blaise Melly. Inference on counterfactual distributions. Econometrica, 81(6):2205 2268, 2013. Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1 C68, 2018. Kacper Chwialkowski, Heiko Strathmann, and Arthur Gretton. A kernel test of goodness of fit. In International Conference on Machine Learning, pp. 2606 2615. PMLR, 2016. Peter Clifford. Markov random fields in statistics. Disorder in physical systems: A volume in honour of John M. Hammersley, pp. 19 32, 1990. Iv an D ıaz. Efficient estimation of quantiles in missing data models. Journal of Statistical Planning and Inference, 190:39 51, 2017. Miroslav Dud ık, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning. Omnipress, 2011. Jake Fawkes, Robert Hu, Robin J Evans, and Dino Sejdinovic. Doubly robust kernel statistics for testing distributional treatment effects even under one sided overlap. ar Xiv preprint ar Xiv:2212.04922, 2022. Matthew Fisher, Tui Nolan, Matthew Graham, Dennis Prangle, and Chris Oates. Measure transport with kernel stein discrepancy. In International Conference on Artificial Intelligence and Statistics, pp. 1054 1062. PMLR, 2021. David A Freedman. On the so-called huber sandwich estimator and robust standard errors . The American Statistician, 60(4):299 302, 2006. Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6 (6):721 741, 1984. doi: 10.1109/TPAMI.1984.4767596. Jackson Gorham and Lester Mackey. Measuring sample quality with stein s method. Advances in neural information processing systems, 28, 2015. Jackson Gorham and Lester Mackey. Measuring sample quality with kernels. In International Conference on Machine Learning, pp. 1292 1301. PMLR, 2017. Steffen Gr unew alder, Guy Lever, Luca Baldassarre, Sam Patterson, Arthur Gretton, and Massimilano Pontil. Conditional mean embeddings as regressors. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pp. 1803 1810, 2012. W Hardle, Paul Janssen, and Robert Serfling. Strong uniform consistency rates for estimators of conditional functionals. The Annals of Statistics, pp. 1428 1449, 1988. Geoffrey Hinton. A practical guide to training restricted boltzmann machines. Momentum, 9(1): 926, 2010. Geoffrey Hinton, Simon Osindero, Max Welling, and Yee-Whye Teh. Unsupervised discovery of nonlinear structure using contrastive backpropagation. Cognitive science, 30(4):725 731, 2006. Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771 1800, 2002. Published as a conference paper at ICLR 2024 Aapo Hyv arinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005. Guido W Imbens and Donald B Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015. Edward H Kennedy, Sivaraman Balakrishnan, and Larry Wasserman. Semiparametric counterfactual density estimation. ar Xiv preprint ar Xiv:2102.12034, 2021. Kwangho Kim, Jisu Kim, and Edward H Kennedy. Causal effects based on distributional distances. ar Xiv preprint ar Xiv:1806.02935, 2018. Anna Korba, Pierre-Cyril Aubin-Frankowski, Szymon Majewski, and Pierre Ablin. Kernel stein discrepancy descent. In International Conference on Machine Learning, pp. 5719 5730. PMLR, 2021. Henry Lam and Haofeng Zhang. Doubly robust stein-kernelized monte carlo estimator: Simultaneous bias-variance reduction and supercanonical convergence. Journal of Machine Learning Research, 24(85):1 58, 2023. Yann Le Cun, Sumit Chopra, Raia Hadsell, Marc Aurelio Ranzato, and Fu-Jie Huang. Predicting structured data, chapter a tutorial on energy-based learning, 2006. Qiang Liu, Jason Lee, and Michael Jordan. A kernelized stein discrepancy for goodness-of-fit tests. In International Conference on Machine Learning, pp. 276 284. PMLR, 2016. Song Liu, Takafumi Kanamori, Wittawat Jitkrittum, and Yu Chen. Fisher efficient inference of intractable models. Advances in Neural Information Processing Systems, 32, 2019. Dean Lusher, Johan Koskinen, and Garry Robins. Exponential random graph models for social networks: Theory, methods, and applications. Cambridge University Press, 2013. Diego Martinez-Taboada, Aaditya Ramdas, and Edward H Kennedy. An efficient doubly-robust test for the kernel treatment effect. ar Xiv preprint ar Xiv:2304.13237, 2023. Takuo Matsubara, Jeremias Knoblauch, Franc ois-Xavier Briol, and Chris J Oates. Robust generalised bayesian inference for intractable likelihoods. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(3):997 1022, 2022. Valentyn Melnychuk, Dennis Frauen, and Stefan Feuerriegel. Normalizing flows for interventional density estimation. In International Conference on Machine Learning, pp. 24361 24397. PMLR, 2023. Andriy Mnih and Geoffrey Hinton. Learning nonlinear constraints with contrastive backpropagation. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., volume 2, pp. 1302 1307. IEEE, 2005. Krikamol Muandet, Motonobu Kanagawa, Sorawit Saengkyongam, and Sanparith Marukatat. Counterfactual mean embeddings. The Journal of Machine Learning Research, 22(1):7322 7392, 2021. Jeffrey N af, Corinne Emmenegger, Peter B uhlmann, and Nicolai Meinshausen. Confidence and uncertainty assessment for distributional random forests. ar Xiv preprint ar Xiv:2302.05761, 2023. Chris Oates et al. Minimum kernel discrepancy estimators. ar Xiv preprint ar Xiv:2210.16357, 2022. Chris J Oates, Mark Girolami, and Nicolas Chopin. Control functionals for monte carlo integration. Journal of the Royal Statistical Society Series B: Statistical Methodology, 79(3):695 718, 2017. Junhyung Park, Uri Shalit, Bernhard Sch olkopf, and Krikamol Muandet. Conditional distributional treatment effect with kernel conditional mean embeddings and U-statistic regression. In International Conference on Machine Learning, pp. 8401 8412. PMLR, 2021. Garry Robins, Pip Pattison, Yuval Kalish, and Dean Lusher. An introduction to exponential random graph (p*) models for social networks. Social networks, 29(2):173 191, 2007. Published as a conference paper at ICLR 2024 James Robins and Andrea Rotnitzky. Inference for semiparametric models: Some questions and an answer - comments. Statistica Sinica, 11:920 936, 10 2001. James Robins, Lingling Li, Eric Tchetgen, and Aad van der Vaart. Higher order influence functions and minimax estimation of nonlinear functionals. In Institute of Mathematical Statistics Collections, pp. 335 421. Institute of Mathematical Statistics, 2008. Donald B Rubin. Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100(469):322 331, 2005. Rahul Singh, Maneesh Sahani, and Arthur Gretton. Kernel instrumental variable regression. Advances in Neural Information Processing Systems, 32, 2019. Rahul Singh, Liyuan Xu, and Arthur Gretton. Kernel methods for causal functions: Dose, heterogeneous, and incremental response curves. ar Xiv preprint ar Xiv:2010.04855, 2020. Rahul Singh, Liyuan Xu, and Arthur Gretton. Kernel methods for multistage causal inference: Mediation analysis and dynamic treatment effects. ar Xiv preprint ar Xiv:2111.03950, 2021. Le Song, Jonathan Huang, Alex Smola, and Kenji Fukumizu. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 961 968, 2009. Charles Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory, volume 6, pp. 583 603. University of California Press, 1972. Philip Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139 2148. PMLR, 2016. Ted Westling and Marco Carone. A unified study of nonparametric inference for monotone functions. Annals of statistics, 48(2):1001, 2020. Wenjing Zheng and Mark J van der Laan. Asymptotic theory for cross-validated targeted maximum likelihood estimation. U.C. Berkeley Division of Biostatistics Working Paper Series, 2010. Yuchen Zhu, Limor Gultchin, Arthur Gretton, Matt J Kusner, and Ricardo Silva. Causal inference with treatment measurement error: a nonparametric instrumental variable approach. In Uncertainty in Artificial Intelligence, pp. 2414 2424. PMLR, 2022. Published as a conference paper at ICLR 2024 A CLOSED FORM OF THE STATISTIC Under estimators of the form ˆβθ(x) = Pn i=1 ˆwi(x)ξθ( , Yi), we derive that both inner products D ˆβθ(x), ˆβθ( x) E i=1 ˆwi(x)ξθ( , Yi), j=1 ˆwj( x)ξθ( , Yj) j=1 ˆwi(x) ˆwj( x) ξθ( , Yi), ξθ( , Yj) Hd j=1 ˆwi(x) ˆwj( x)hθ(Yi, Yj), D ˆβθ(x), ξθ( , Y ) E i=1 ˆwi(x)ξθ( , Yi), ξθ( , Y ) i=1 ˆwi(x) ξθ( , Yi), ξθ( , Y ) Hd i=1 ˆwi(x)hθ(Yi, Y ) count with closed forms as long as hθ can be computed. Consequently gn(θ; Y1, . . . , Yn) = i=1 ˆϕθ(Zi), 1 j=1 ˆϕθ(Zj) n ξθ( , Yi) ˆβθ(Xi) o + ˆβθ(Xi), n ξθ( , Yj) ˆβθ(Xj) o + ˆβθ(Xj) Ai ˆπ(Xi)ξθ( , Yi) + 1 Ai ˆπ(Xi) Aj ˆπ(Xj)ξθ( , Yj) + 1 Aj ˆπ(Xj) Ai Aj ˆπ(Xi)ˆπ(Xj) ξθ( , Yi), ξθ( , Yj) 1 Aj ˆπ(Xj) ξθ( , Yi), ˆβθ(Xj) + 1 Ai ˆπ(Xi) 1 Aj ˆπ(Xj) ˆβθ(Xi), ˆβθ(Xj) Ai Aj ˆπ(Xi)ˆπ(Xj)hθ(Yi, Yj) + 1 Aj ˆπ(Xj) ˆwi (Xj)hθ(Yi , Yi) + i,j,i ,j =1 1 Ai ˆπ(Xi) 1 Aj ˆπ(Xj) ˆwi (Xi) ˆwj (Xj)hθ(Yi , Yj ) is closed form as long as hθ, ˆπ and w can be evaluated. Published as a conference paper at ICLR 2024 B EXPERIMENTS B.1 CONSISTENCY OF THE DR-MKSD ESTIMATOR We consider the set of normal distributions with unit variance Q = {N(θ, 1)}θ R. We draw (Xi, Ai, Yi)n i=1 for n {200, 250, . . . , 800} such that Y 1 N(0, 1). Further, X N(Y 1, 1) and A follows a Bernoulli distribution with log odds X. For obtaining Figure 1, we estimate π with Default Logistic Regression from the scikit-learn package with C = 1e5 and max iter = 1000, Default Ada Boost Classifier from the scikit-learn package, Conditional Mean Embeddings (CME) with the radial basis function kernel and regularization parameter 1e 3, Vector-valued One Nearest Neighbor (1-NN). Parameter θ was estimated by gradient descent for a number of 1000 steps. A total number of 100 experiments with different random seeds were run in order to yield Figure 1. We further analyze the empirical coverage of the estimated 0.95-confidence intervals θn n, where n = 1.96 q 4Γ 1 n ΣnΓ 1 n /n and Σn and Γn are defined as described in Section 4. Figure 5 exhibits the confidence intervals for n = 200 and n = 300. We highlight that the respective empirical coverage for those sample sizes are 0.94 and 0.96 respectively, hence presenting a desired empirical coverage. Figure 5: Estimated 0.95-confidence intervals θn n, where n = 1.96 q 4Γ 1 n ΣnΓ 1 n /n for (A) n = 200, and (B) n = 300. The empirical coverages of the confidence intervals are (A) 0.94, and (B) 0.96. Published as a conference paper at ICLR 2024 B.2 ASYMPTOTIC NORMALITY OF THE DR-MKSD ESTIMATOR We consider the intractable model qθ(y) exp( Eθ(y)) with potential function Eθ(y) = η(θ), J(y) R8, where η(θ) := ( 0.5, 0.6, 0.2, 0, 0, 0, θ)T , θ R2, i=1 y2 i , y1y2, i=3 y1yi, tanh(y))T , y R5. The normalizing constant is not tractable except for θ = 0, where we recover the density of a Gaussian distribution N(0, Σ), with 1 0.6 0.2 0.2 0.2 0.6 1 0 0 0 0.2 0 1 0 0 0.2 0 0 1 0 0.2 0 0 0 1 We draw (Xi, Ai, Yi)500 i=1 so that Y 1 N(0, Σ) and X N(Y 1, I5). Lastly, A follows a Bernoulli distribution with log odds P5 i=1(X2 i 1). For obtaining Figure 2, we estimate π with the default Random Forest Classifier from the scikit-learn package, and β with Conditional Mean Embeddings (CME) with the radial basis function kernel and regularization parameter 1e 3. Parameter θ was estimated by gradient descent for a number of 1000 steps. A total number of 100 experiments with different random seeds were run. B.3 COUNTERFACTUAL RESTRICTED BOLTZMANN MACHINES (RBM) We let (Xi, Ai, Yi)500 i=1 such that Y 1 is drawn from a two-dimensional RBM with one hidden variable h such that qθ(y1, h) exp(h + θ, y1 R2 2 y1 2 2), for θ = (0, 0), θ = (1, 1), and θ = (1, 1). In order to draw from such RBMs, we make use of Gibbs sampling and burn the first 1000 samples. Additionally, X N(Y 1, 0.25I2) and A follows a Bernoulli distribution with log odds 1 5 P5 i=1(Xi 0.5). We estimate π with the default Logistic Regression from the scikit-learn package with C = 1e5 and max iter = 1000, and β with Conditional Mean Embeddings (CME) with the radial basis function kernel and regularization parameter 1e 3. Figure 3 exhibits the values of gn over a grid {(θ1, θ2) : θ1, θ2 { 5, 4.9, 4.8 . . . , 5}}. Figure 6 exhibits the values of gn for further pairs (θ1, θ2), illustrating the behaviour of gn with different magnitudes of θ1 and θ2. Figure 6: Values of the statistic gn(θ) for the RBM with (A) θ1 = 0.1, θ2 = 5, (B) θ1 = 1, θ2 = 0.1, (C) θ1 = 1, θ2 = 0.1. Published as a conference paper at ICLR 2024 B.4 EXPERIMENTAL DATA We start by training a dense neural network with layers of size [784, 100, 20, 10] on the MNIST train dataset. For this, we minimize the log entropy loss on 80% of such train data and we store the parameters that minimize the log entropy loss for the remaining validation data (remaining 20% of the MNIST train dataset). We then evaluate this trained neural network on the MNIST test dataset. For each digit, we take the 20-dimensional layer to be Y 1. The covariates X are defined as Y 1 + G, where G N(0, I20). Treatment A follows a Bernoulli distribution with probability 0.5. We consider the intractable model qθ(Y 1) exp( T(Y 1), θ ), where the lower dimensional representation T(Y 1) is taken as the evaluation of the 10-dimensional layer of the neural network. Note that T is therefore differentiable, so we can apply the DR-MKSD estimator. Out of the whole MNIST test dataset, we take the first 500 observations to train the DR-MKSD estimator (i.e., find θn) with (Xi, Ai, Yi)500 i=1. We define π = 0.5 and estimate β with Conditional Mean Embeddings (CME) with the radial basis function kernel and regularization parameter 1e 3. In order to obtain Figure 4, we take the minimizers of the estimated unnormalized density of Y 1 (the evaluation of the 20-dimensional layer) for the remaining observations of the MNIST test dataset. The top row images are taken randomly for the same subset of the MNIST test dataset. Published as a conference paper at ICLR 2024 We now present the proofs of the theorems stated in the main body of the paper. For this, we start by introducing the notation that will be used throughout. The proofs of Theorem 1 and Theorem 2 subsequently follow. C.1 NOTATION Given a Hilbert space K and a K-valued function ϕ(z) K, we denote its norm in the Hilbert space by ϕ(z) K. Furthermore, bϕ 2 = R bϕ(z) 2 K d P(z) denotes the squared L2(Q) norm of the Kvalued function norm. We highlight that the expectation is only taken with respect to the randomness of Z, while ˆϕ is considered to be fixed. Note that in the case K = R, 2 denotes the usual L2(Q) norm of a real-valued function. We let Pn ˆϕ = Pn{ˆϕ(Z)} = 1 n P i ϕ(Zi) denote the sample average and P(ˆϕ) = P{ˆϕ(Z)} = R ˆϕ(z) d P(z) the expected value of ˆϕ (Bochner integral) with respect to Z, treating ˆϕ as fixed. If a function k takes two arguments (this is the case when k is a kernel), then we denote P 2 nk = 1 n2 P i,j k(Yi, Yj). Further, Tr denotes the trace of a matrix. Lastly, we make use of standard big-oh and little-oh notation, where Xn = OP (rn) implies that the ratio Xn/rn is bounded in probability, and Xn = o P (rn) indicates that Xn/rn converges in probability to 0. Throughout, we make use of calculus with this stochastic order notation, such as o P (1) + OP (1) = OP (1) and o P (1)OP (1) = o P (1). C.2 PROOF OF THEOREM 1 We prove the theorem in three steps: 1. First, we show that supθ Θ Pn ˆϕθ Pnϕθ Hd p 0. 2. We then prove that supθ Θ |gn(θ) KSD(Qθ Q1)| p 0. 3. We finally arrive to KSD(Qθn Q1) p minθ Θ KSD(Qθ Q1). We now need to demonstrate the validity of each of the steps. Details of step 1. For simplicity, let us assume that n is even and hence n//2 = n/2. We note that Pn ˆϕθ Pnϕθ = 1 n ˆϕθ(Zi) ϕθ(Zi) o n ˆϕθ(Zi) ϕθ(Zi) o + n ˆϕθ(Zi) ϕθ(Zi) o n ˆϕθ(Zi) ϕθ(Zi) o n ˆϕθ(Zi) ϕθ(Zi) o n ˆϕ(1) θ (Zi) ϕθ(Zi) o n ˆϕ(2) θ (Zi) ϕθ(Zi) o | {z } (II) Let us first work with term (I), and note that it may be rewritten as Pn/2 ˆϕ(1) θ Pn/2ϕθ. Additionally, Pn/2 ˆϕ(1) θ Pn/2ϕθ = T1(θ) + T2(θ), where T1(θ) = (Pn/2 Q)(ˆϕ(1) θ ϕθ) is the empirical process term and T2(θ) = Q(ˆϕ(1) θ ϕθ) is the bias term. Given that ˆϕ(1) θ was trained independently Published as a conference paper at ICLR 2024 from {Zn/2, . . . , Zn}, following Martinez-Taboada et al. (2023, Lemma C.7) and the arguments in Martinez-Taboada et al. (2023, Theorem C.9), we deduce that T1(θ) Hd = OP ˆϕ(1) θ ϕθ p , T2(θ) Hd 1 ϵ π ˆπ(1) βθ ˆβ(1) θ . Given Conditions (i)-(ii), we deduce that sup θ Θ T1(θ) Hd + T2(θ) Hd sup θ Θ T1(θ) Hd + sup θ Θ T2(θ) Hd supθ Θ ˆϕ(1) θ ϕθ p ϵ π ˆπ(1) βθ ˆβ(1) θ = o P (1) + o P (1) = o P (1). Hence, we infer that supθ Θ Pn/2 ˆϕ(1) θ Pn/2ϕθ Hd = o P (1). Analogously, we deduce that supθ Θ (II) Hd = o P (1). We thus conclude sup θ Θ Pn ˆϕθ Pnϕθ Hd = sup θ Θ 1 2(I) Hd + sup θ Θ 1 = o P (1) + o P (1) = o P (1). Details of step 2. Denoting χn(θ) = Pn ˆϕθ Pnϕθ, we have that gn(θ; Y1, . . . , Yn) = i=1 ˆϕθ(Zi), 1 j=1 ˆϕθ(Zj) = D Pn ˆϕθ, Pn ˆϕθ E Hd = Pnϕθ + χn(θ), Pnϕθ + χn(θ) Hd = Pnϕθ, Pnϕθ Hd + 2 Pnϕθ, χn(θ) Hd + χn(θ), χn(θ) Hd . We now work with these three terms separately. First, note that assumption A5 implies that, for all θ in Θ, Z h θ(z, z)d P(z) < Z sup θ Θ h θ(z, z)d P(z) < , Hence, by Jensen s inequality, Z q h θ(z, z)d P(z) s Z h θ(z, z)d P(z) < θ Θ. Given that we have also assumed Conditions A6-A7, we can apply Oates et al. (2022, Lemma 11) to establish that sup θ Θ | Pnϕθ, Pnϕθ Hd KSD(Qθn Q1)| = o P (1). Second, we have that Pnϕθ 2 Hd = P 2 n{h θ} P 2 n{supθ Θ h θ}, hence Pnϕθ 2 Hd P 2 n{supθ Θ h θ}. Based on the law of large numbers for V-statistics and Conditions A4-A5, we deduce that P 2 n{supθ Θ h θ} p RR supθ Θ h θ(z, z)d P(z)d P( z) < , so sup θ Θ Pnϕθ Hd = OP (1). (8) Published as a conference paper at ICLR 2024 Further, we remind that we have proven supθ Θ χn(θ) Hd p 0 on Step 1. Consequently, sup θ Θ | Pnϕθ, χn(θ) Hd | (i) sup θ Θ ( Pnϕθ Hd χn(θ) Hd) sup θ Θ Pnϕθ Hd sup θ Θ χn(θ) Hd = OP (1)o P (1) = o P (1), where (i) is obtained by Cauchy-Schwartz inequality. Lastly, we highlight that sup θ Θ χn(θ) Hd = o P (1) = sup θ Θ χn(θ) 2 Hd = o P (1). It suffices to note that supθ Θ |gn(θ) KSD(Qθn Q1)| may be rewritten as sup θ Θ | Pnϕθ, Pnϕθ Hd + 2 Pnϕθ, χn(θ) Hd + χn(θ), χn(θ) Hd KSD(Qθ Q1)|, which is upper bounded by sup θ Θ | Pnϕθ, Pnϕθ Hd KSD(Qθ Q1)| + 2 sup θ Θ | Pnϕθ, χn(θ) Hd | + sup θ Θ | χn(θ), χn(θ) Hd , which is o P (1), concluding sup θ Θ |gn(θ) KSD(Qθ Q1)| = o P (1). Details of step 3. In order to conclude, we extend the argument presented in Oates et al. (2022, Lemma 7) to convergence in probability. Take θ arg min KSD(Qθ Q1). Given that supθ Θ |gn(θ) KSD(Qθ Q1)| = o P (1), for any ϵ > 0 and δ > 0, there exists n N such that |gn(θ) KSD(Qθ Q1)| < ϵ 2 with probability at least 1 δ for all n n . Consequently, KSD(Qθ Q1) KSD(Qθn Q1) + ϵ 2 gn(θn) gn(θ ) + ϵ 2 KSD(Q θ Q1) + ϵ for n n with probability 1 δ. This is, |gn(θn) KSD(Q θ Q1)| < ϵ for n n with probability 1 δ, concluding that gn(θn) p KSD(Qθ Q1). C.3 PROOF OF THEOREM 2 We prove the theorem in three steps: 1. First, we prove supθ Θ Pn ˆϕθ Pnϕθ Hd = o P 1 n , and θ n Pn ˆϕθ Pnϕθ o Hd = o P 1 n . 2. Second, we show that θ {Pnϕθn} , Pnϕθn Hd = n/2, with n Rp = o P 1 n . 3. We then conclude that n(θn θ ) d N(0, 4Γ 1ΣΓ 1). We now need to demonstrate the validity of each of the steps. Details of step 1. With an analogous argument used in Proof C.2 (step 1), we yield Pn ˆϕθ Pnϕθ = 1 n ˆϕ(1) θ (Zi) ϕθ(Zi) o n ˆϕ(2) θ (Zi) ϕθ(Zi) o | {z } (II) Published as a conference paper at ICLR 2024 sup θ Θ (I) Hd = OP supθ Θ ˆϕ(1) θ ϕθ p ϵ π ˆπ(1) βθ ˆβ(1) θ , sup θ Θ (II) Hd = OP supθ Θ ˆϕ(2) θ ϕθ p ϵ π ˆπ(2) βθ ˆβ(2) θ . Assumptions (ii ) and (iii ) imply sup θ Θ (I) Hd = o P , sup θ Θ (II) Hd = o P sup θ Θ Pn ˆϕθ Pnϕθ Hd = o P Similarly, based on Assumptions (ii ) and (iii ), we obtain supθ Θ θ n Pn ˆϕθ Pnϕθ o (Hd)p = Details of step 2. Given that θn is the minimizer of a differentiable function gn, we have that g (θn) = 0. Denoting χn(θ) = Pn ˆϕθ Pnϕθ, we have that θ Pnϕθn + χn(θn), Pnϕθn + χn(θn) Hd θ {Pnϕθn + χn(θn)} , Pnϕθn + χn(θn) + Pnϕθn + χn(θn), θ {Pnϕθn + χn(θn)} θ {Pnϕθn} , Pnϕθn θ {χn(θn)} , χn(θn) θ {Pnϕθn} , χn(θn) Let us upper bound the last three terms of the latter addition. First, we note that θ {χn(θn)} , χn(θn) θ {χn(θ)} , χn(θ) (i) sup θ Θ θ {χn(θ)} (Hd)p χn(θ) Hd θ {χn(θ)} (Hd)p sup θ Θ { χn(θ) Hd} Published as a conference paper at ICLR 2024 where (i) is obtained by Cauchy-Schwarz inequality. Second, (i) sup θ Θ θ {χn(θ)} (Hd)p sup θ Θ { Pnϕθ Hd} sup θ Θ θ {χn(θ)} (Hd)p (ii) = OP (1)o P where (i) is obtained based on Cauchy-Schwartz inequality, and (ii) is derived as in Equation equation 8, given Conditions A4-A5. Third, we have that θ {Pnϕθn} 2 (Hd)p sup θ Θ θ {Pnϕθ} 2 (Hd)p = sup θ Θ Pn = sup θ Θ P 2 n sup θ Θ Tr 2 Note that Tr 2 θ2 h θ is dominated by the L1 norm (sum of the absolute values of the entries) of the θ2 h θ. Further, the L1 norm is equivalent to the L2 norm Rp p (all norms are equivalent in finite dimensional Banach spaces). Hence, based on the law of large numbers for V-statistics and Conditions A12-A13, we deduce that θ2 h θ Rp p p ZZ sup θ Θ 2 θ2 h θ(z, z) Rp pd P(z)d P( z) < . Consequently, θ {Pnϕθn} 2 (Hd)p = OP (1), Published as a conference paper at ICLR 2024 θ {Pnϕθn} , χn(θn) θ {Pnϕθ} , χn(θ) (i) sup θ Θ θ {Pnϕθ} (Hd)p χn(θ) Hd θ {Pnϕθ} (Hd)p sup θ Θ { χn(θ) Hd} = OP (1)o P where (i) is obtained based on Cauchy-Schwartz inequality. Combining Equation equation 9 with the upper bounds from Equations equation 10, equation 11, and equation 12, we derive that θ {Pnϕθn} , Pnϕθn with n Rp = o P 1 n + o P 1 n + o P 1 n = o P 1 n . Details of step 3. Further, denoting g n(θ) := Pnϕθ, Pnϕθ Hd R, we arrive to θ {Pnϕθn} , Pnϕθn θ Pnϕθn, Pnϕθn Hd = Based on the mean value theorem for convex open sets, there exists θn = tnθn + (1 tn)θ with tn [0, 1] for which θg n(θ ) + 2 θ2 g n( θn) (θn θ ) = g n(θn) = n. If 2 θ2 g n( θn) is non-singular, then rewriting this expression we obtain n(θn θ ) = 2 θ2 g n( θn) 1 n n 2 θ2 g n( θn) 1 n Under slightly weaker assumptions than stated in Theorem 2, Oates et al. (2022, Theorem 12) showed the non-singularity of 2 θ2 g n( θn) by convergence to the matrix Γ, as well as θ2 g n( θn) 1 n θg n(θ ) d N(0, 4Γ 1ΣΓ 1). Given that n Rp = o P 1 n and 2 θ2 g n( θn) p Γ non-singular, we deduce that θ2 g n( θn) 1 n n Rp = o P (1), hence concluding n(θn θ ) = N(0, 4Γ 1ΣΓ 1).