# neural_estimation_of_statistical_divergences__a86590da.pdf Journal of Machine Learning Research 23 (2022) 1-75 Submitted 10/21; Revised 2/22; Published 3/22 Neural Estimation of Statistical Divergences Sreejith Sreekumar sreejithsreekumar@cornell.edu Electrical and Computer Engineering Department Cornell University Ithaca, NY 14850, USA Ziv Goldfeld goldfeld@cornell.edu Electrical and Computer Engineering Department Cornell University Ithaca, NY 14850, USA Editor: Jean-Philippe Vert Statistical divergences (SDs), which quantify the dissimilarity between probability distributions, are a basic constituent of statistical inference and machine learning. A modern method for estimating those divergences relies on parametrizing an empirical variational form by a neural network (NN) and optimizing over parameter space. Such neural estimators are abundantly used in practice, but corresponding performance guarantees are partial and call for further exploration. We establish non-asymptotic absolute error bounds for a neural estimator realized by a shallow NN, focusing on four popular f-divergences Kullback-Leibler, chi-squared, squared Hellinger, and total variation. Our analysis relies on non-asymptotic function approximation theorems and tools from empirical process theory to bound the two sources of error involved: function approximation and empirical estimation. The bounds characterize the effective error in terms of NN size and the number of samples, and reveal scaling rates that ensure consistency. For compactly supported distributions, we further show that neural estimators of the first three divergences above with appropriate NN growth-rate are minimax rate-optimal, achieving the parametric convergence rate. Keywords: Approximation theory, minimax estimation, empirical process theory, fdivergence, neural estimation, neural network, statistical divergence, variational form. 1. Introduction Statistical divergences (SDs) measure the discrepancy between probability distributions. A variety of inference tasks, from generative modeling (Kingma and Welling, 2014; Nowozin et al., 2016; Arjovsky et al., 2017; Tolstikhin et al., 2018; Goldfeld et al., 2020a; Nietert et al., 2021) to homogeneity/goodness-of-fit/independence testing (Kac et al., 1955; Zhang et al., 2018b; Hallin et al., 2021) can be posed as measuring or optimizing a SD between the data distribution and the model. Popular SDs include f-divergences (Ali and Silvey, 1966; Csisz ar, 1967), integral probability metrics (IPMs) (Zolotarev, 1983; M uller, 1997), and Wasserstein distances (Villani, 2008; Santambrogio, 2015). A common formulation that c 2022 Sreejith Sreekumar and Ziv Goldfeld. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-1212.html. Sreekumar and Goldfeld captures many of these is1 Dh,F(µ, ν) = sup f F Eµ[f] Eν[h f], (1.1) where F is a function class of discriminators and h is sometimes called a measurement function (cf., e.g., Arora et al., 2017). This variational form is at the core of various learning algorithms implemented based on SDs (Nowozin et al., 2016; Arjovsky et al., 2017), and has been recently leveraged for estimating SDs from samples a technique termed neural estimation. While neural estimators (NEs) are popular in practice due to their computational scalability, a theoretic account of corresponding performance guarantees is missing. To address the deficit, this work provides a through study of consistency and non-asymptotic absolute error bounds for NEs realized by shallow neural networks (NNs). 1.1 Neural Estimation of Statistical Divergences Typical applications to machine learning, e.g., generative adversarial networks (GANs) (Goodfellow et al., 2014; Arjovsky et al., 2017) or anomaly detection (P oczos et al., 2011; Zenati et al., 2018; Schlegl et al., 2019), favor estimators whose computation scales well with number of samples and is compatible with backpropagation and minibatch-based optimization. Neural estimation is a modern technique that adheres to these requirements (Arora et al., 2017; Zhang et al., 2018a; Belghazi et al., 2018; Mroueh et al., 2021). Neural estimators (NEs) parameterize the discriminator class F in (1.1) by a NN, approximate expectations by sample means, and then optimize the resulting empirical objective over parameter space. Denoting the samples from µ and ν by Xn := (X1, . . . , Xn) and Y n := (Y1, . . . , Yn), respectively, the said NE is ˆDh,G(Xn, Y n) := sup g G h g(Xi) h g(Yi) i , (1.2) where G is the class of functions realized by a NN. The performance of a NE is dictated by the quality of the NN approximation to the original function class F from (1.1), and the sample size needed to accurately estimate the parametrized form Dh,G(µ, ν). The former is measured by the approximation error, Dh,F(µ, ν) Dh,G(µ, ν) , whereas the latter by the empirical estimation error, ˆDh,G(Xn, Y n) Dh,G(µ, ν) . While approximation needs G to be rich and expressive, efficient estimation relies on controlling its complexity. Past works on NEs provide only a partial account of estimation performance. Belghazi et al. (2018) proved consistency of mutual information neural estimation, which boils down to estimating KL divergence, but do not quantify approximation errors. Non-asymptotic sample complexity bounds for the parameterized form, i.e., when F in (1.1) is the NN class G to begin with, were derived in (Arora et al., 2017; Zhang et al., 2018a). These objects are known as NN distances and, by definition, overlook the approximation error. Also related is (Nguyen et al., 2010), where KL divergence estimation rates are provided under the assumption that the approximating class is large enough to contain an optimizer of (1.1). This assumption is often violated in 1. Specifically, (1.1) accounts for f-divergences, IPMs and the 1-Wasserstein distance. Neural Estimation of Statistical Divergences practice, e.g., when using a NN class as done herein, or a reproducing kernel Hilbert space, as considered in (Nguyen et al., 2010). Quantification of the approximation error, alongside the empirical estimation error, is pivotal for a complete account of neural estimation performance. This work thus studies non-asymptotic effective (approximation plus empirical estimation) error bounds for NEs realized by a k-neuron shallow NN and n samples from each distribution. Results are specialized to four popular f-divergences: Kullback-Leibler (KL), chi-squared (χ2), squared Hellinger (H2) distance, and total variation (TV) distance. 1.2 Contributions This work extends our earlier conference paper (Sreekumar et al., 2021), where the first non-asymptotic effective error bounds for NEs of f-divergences was derived. Consistency results for appropriate scaling rates of the NN and the sample sizes were also provided. However, the analysis therein resulted in sub-optimal error rates, only considered compactly supported distributions, and was not applicable for TV distance estimation. These aspects are key for a complete account of the neural estimation performance, and serve to motivate the present work, which closes all the aforementioned gaps. We first consider compactly supported distributions and show that the effective error of a NE based on k neurons and n samples for the KL divergence, χ2 divergence, or the H2 distance scales as O k 1/2 + n 1/2 . (1.3) Our bound is sharp in the sense that by scaling k proportional to n, NEs achieve minimax optimality, converging at the parametric n 1/2 rate. The results assume a spectral norm bound on the optimal potential (i.e., maximizer of (1.1)) of the SD, which, in particular, is satisfied when the distributions have sufficiently smooth densities . Notably, this condition suffices to avoid the so-called curse of dimensionality (Co D) and attain parametric rates that do not degrade exponentially with dimension.2 The derivation of (1.3) relies on two key technical results that separately account for the approximation and estimation errors. The first is a sup-norm O(k 1/2) universal approximation bound for shallow NNs (Klusowski and Barron, 2018), and the second is a O(n 1/2) bound on the empirical estimation error of the parametrized form. Derivation of the latter result leverages tools from empirical process theory and bounds the entropy integral (Van Der Vaart and Wellner, 1996) associated with the NN class. To that end, we bound the covering number of the NN class by noting that it can be represented as (a subset of) the symmetric convex hull (Van Der Vaart and Wellner, 1996)) of a composition of a monotone function with a VC subgraph class. Equipped with these results, we treat neural estimation of the KL and χ2 divergences, and the H2 and TV distances. We establish consistency and obtain (1.3) as a finite-sample absolute-error bound by combining the approximation and empirical estimation bounds and identifying the appropriate scaling of the NN width k and parameter norms with the sample size n for each f-divergence. Our analysis results in the parametric absolute-error convergence rate for the NEs of KL divergence, χ2 divergence, and H2 distance. We also show 2. A similar behavior was observed in (Kandasamy et al., 2015) for classic f-divergence estimators between densities with high (H older) smoothness. Sreekumar and Goldfeld an Ω(n 1/2) lower bound on the minimax absolute error risk for KL divergence estimation problem by reducing it to differential entropy estimation and using the lower bound from Goldfeld et al. (2020b) for the latter problem. This establishes minimax optimality of KL divergence NE, with similar claims holding for NEs of χ2 divergence and H2 distance. Our method also accounts for the mutual information neural estimator (MINE) (Belghazi et al., 2018), and provides the first non-asymptotic effective error bound and minimax optimality claim for it. Different from these, the TV distance NE requires a modified approach because the spectral norm of the optimal potential is infinite. To circumvent the issue, we apply Gaussian smoothing to this potential, and control the approximation error as the smoothing parameters shrinks with the NN size k. This results in an approximation-estimation error bound that depends on dimension, i.e., the Co D applies in this case. We then extend our results to distributions with unbounded support. To that end, we exploit the fact that our approximation error bound depends on the support of the target function only via its spectral norm. Thus, bounds on the effective error in the unbounded case are obtained by quantifying the spectral norm of the optimal potential inside a ball and growing its radius appropriately with k. The resulting bound depends on the scaling of the radius and the tail decay of the underlying distributions (as quantified by the Orlicz norm of the densities). The results are specialized to the aforementioned divergences, focusing on Gaussian and sub-Gaussian distributions. We note that our analysis applies to distributions whose densities need not be bounded away from zero (see also, Berrett et al., 2019; Berrett and Samworth, 2019) an assumption that is often imposed for f-divergence estimation. 1.3 Related Work Many non-parametric estimators of SDs are available in the literature (Wang et al., 2005; Perez-Cruz, 2008; Sriperumbudur et al., 2012; Krishnamurthy et al., 2014; Singh and P oczos, 2014a,b; Kandasamy et al., 2015; Singh and P oczos, 2016; Noshad et al., 2017; Moon et al., 2018; Wisler et al., 2018; Berrett et al., 2019; Berrett and Samworth, 2019; Liang, 2019; Han et al., 2020). These estimators typically rely on classic methods (or their variants) such as plug-in, kernel density estimation (KDE) or k-nearest neighbors (k NN) techniques, and are known to achieve optimal estimation error rates for specific SDs, subject to smoothness and/or regularity conditions on the densities (see Remark 21). However, kernel-based methods usually require at least one of the following to achieve optimal rates: (i) boundary bias correction mechanism such as the usage of a mirror image kernel which assumes knowledge of the boundaries of the support of the distributions (cf., e.g., Singh and P oczos, 2014a,b); (ii) assumptions ensuring smooth behaviour of densities at the boundaries (cf., e.g., Moon et al., 2018; Han et al., 2020). As will be evident later, smoothness assumptions imposed by the aforementioned spectral norm condition are sufficient for (1.3) to hold for the NE. Thus, knowledge of the support of distributions or boundary bias correction is not needed. Focussing on NEs, the tradeoffs between approximation and estimation errors was previously studied for non-parametric regression using NNs (cf., e.g., Barron, 1994; Bach, 2017; Suzuki, 2019). The goal there is to fit the best NN proxy to an (unknown) target function based on data generated from it by minimizing a prescribed loss function. Assuming that the target function satisfies certain smoothness or spectral norm constraints, the approximation-estimation tradeoffin such problems has been analyzed for different loss Neural Estimation of Statistical Divergences functions. In particular, Barron (1994) derived upper bounds on the minimax mean squared error rate for shallow NN models under a spectral norm condition on the Fourier transform of the target function. Density estimation under general loss functions was considered in (Yang and Barron, 1999), where minimax rate bounds in terms of covering/packing entropy were established. In (Suzuki, 2019), the minimax rate for non-parametric regression using deep NNs (DNNs) when the target function is Besov was determined. More recently, (Uppal et al., 2019) established the minimax rate for density estimation under a so-called Besov IPM loss. 1.4 Organization The paper is organized as follows. Section 2 provides background and preliminary definitions. Technical results characterizing the approximation error and empirical estimation error are stated in Section 3. In Section 4, we apply these results to obtain upper bounds on the neural estimation error of the aforementioned f-divergences. Corresponding error bounds for distributions with unbounded support are the topic of Section 5. Section 6 provides concluding remarks and discusses future research directions. Proofs are deferred to appendices. 2. Background and Definitions 2.1 Notation Let denote the Euclidean norm on Rd and x y designate the inner product. The ℓm ball of radius r 0 in Rd centered at 0 is Bm d (r); in particular, the Euclidean ball is designated as Bd(r). We use R := R { , } for the extended reals. For 1 r < , the Lr space over X Rd with respect to (w.r.t.) the measure µ is denoted by Lr(X, µ), with r,µ representing the norm. When µ is the Lebesgue measure λ, we use the shorthand Lr(X) with norm r,X , or even Lr and r when X is clear from the context. For r = , we use ,µ and ,X for the essential supremum norm and the standard sup-norm, respectively. Slightly abusing notation, for X Rd, we set X := supx X x . The probability space on which all random variables are defined is denoted by (Ω, A, P) (assumed to be sufficiently rich), with E designating the corresponding expectation. The class of Borel probability measures on X Rd is denoted by P(X). To stress that the expectation of f is taken w.r.t. µ P(X), we write Eµ[f] := R fdµ. For µ, ν P(X) with µ ν, i.e., µ is absolutely continuous w.r.t. ν, we use dµ dν for the Radon-Nikodym derivative of µ w.r.t. ν. For n N, µ n denotes the n-fold product measure of µ. We assume that all functions are Borel measurable. For a multi-index α = (α1, , αd) Zd 0, the partial derivative operator of order α 1 := Pd j=1 αj is designated by Dα := α1 α1x1 αd αdxd . For an open set U Rd and an integer m 0, the class of functions such that all partial derivatives of order m exist and are continuous on U are denoted by Cm(U). In particular, C(U) := C0(U) and C (U) denotes the class of continuous functions and infinitely differentiable functions. For b 0 and an integer m 0, Cm b (U) := f Cm(U) : maxα: α 1 m Dαf ,U b denotes the subclass of Cm(U) with partial derivatives of order up to m uniformly bounded by b. The restriction of f : Rd R to a subset X Rd is denoted by f|X . The Fourier transform of f L1(X) is denoted by F[f]. For a function Sreekumar and Goldfeld class F and a function g, g F := {g f : f F} and |g F| := {|g f| : f F}, where denotes function composition (domains assumed to be compatible for composition). We denote universal constants by c (or c1, c2, etc.) while constants that depend on a parameter x are denoted by cx. The values of c and cx may change between different instances even within the same line of an equation. We use the shorthand a x b for a cxb for some cx > 0 (a b means a cb for a universal constant c > 0); similarly, a x b stands for a = cxb. We also employ standard asymptotic notations such as O, Ω, O, etc., where the tilde designates hidden logarithmic factors. For a, b R, a b := max{a, b} and a b := min{a, b}. We proceed with preliminary definitions and technical background. 2.2 Statistical Divergences Let X Rd. A common variational formulation of a SD between µ, ν P(X) is Dh,F(µ, ν) = sup f F Eµ[f] Eν[h f], (2.1) where h : R R, and F is a class of measurable functions f : Rd R for which the last expectation is finite. This formulation captures f-divergences, IPMs (for h(x) = x), as well as the 1-Wasserstein distance (which is an IPM w.r.t. the 1-Lipschitz function class). We next specialize the above variational form to the f-divergences for which we derive neural estimation error bounds. KL divergence: The KL divergence between µ, ν P(X) is DKL (µ ν) := ( Eµ h log dµ dν i , µ ν, , otherwise. A variational form for DKL (µ ν) is obtained via Legendre-Fenchel duality, yielding: DKL (µ ν) = sup f:X R Eµ[f] Eν ef 1 , (2.2) where the supremum is over all measurable functions such that the last expectation in (2.2) is finite. This fits the framework of (2.1) with h(x) = h KL(x) := ex 1. When µ ν, the supremum in (2.2) is achieved by f KL := log dµ dν . χ2 divergence: The χ2 divergence between µ, ν P(X) is χ2 (µ ν) := Eν h dµ dν 1 2 i , µ ν, , otherwise. It admits the dual form: χ2 (µ ν) = sup f:X R Eµ[f] Eν f + f2/4 , (2.3) where the supremum is over all measurable functions such that the last expectation in (2.3) is finite. This dual form corresponds to (2.1) with h(x) = hχ2(x) := x + x2/4 and the supremum is achieved by fχ2 := 2 dµ dν 1 , whenever µ ν. Neural Estimation of Statistical Divergences Squared Hellinger distance: Let η P(X) be a probability measure that dominates both µ, ν P(X), i.e., µ, ν η (e.g., η = (µ + ν)/2), and denote the corresponding densities by p = dµ dη and q = dν dη. The squared Hellinger distance between µ, ν is3 H2(µ, ν) := Eη h p q 2i . (2.4) When µ ν, the above expression can be written as H2(µ, ν) = Eν with the corresponding dual form H2(µ, ν) = sup f:X R, f(x)<1, x X where the supremum is over all functions such that the expectations are finite. This form corresponds to (2.1) with h(x) = h H2(x) := x/(1 x), and the supremum in (2.5) is achieved by f H2 := 1 dµ dν 1/2 . Also note that H2 defines a metric on P(X) and that 0 H2(µ, ν) 2, for any µ, ν P(X). Total variation distance: The TV distance between µ, ν P(X) is δTV(µ, ν) := sup C 2 |µ(C) ν(C)| , (2.6) where the supremum is over all Borel subsets of X. The corresponding variational form is δTV(µ, ν) = sup f:X R, f 1 Eµ[f] Eν[f], (2.7) which pertains to (2.1) with h(x) = h TV(x) := x. Denoting the densities of µ and ν w.r.t. a common dominating measure η P(X) by p and q, respectively, the supremum in (2.7) is achieved by f TV := 1C 1X\C , where C := x X : p(x) q(x) . (2.8) Furthermore, δTV is a metric on P(X) with 0 δTV(µ, ν) 2. 2.3 Stochastic Processes Our analysis of the estimation error needs the following definitions. Definition 1 (Sub-Gaussian process) A real-valued stochastic process (Xθ)θ Θ on a metric space (Θ, d) is sub-Gaussian if it is centered, i.e., E[Xθ] = 0 for all θ Θ, and E et(Xθ X θ) e 1 2 t2d(θ, θ)2, θ, θ Θ, t 0. 3. The standard definition of the squared Hellinger distance has an extra factor of 0.5. We use the current definition as it simplifies the statements of some results and proofs, while clearly having no effect on the qualitative conclusions. The same applies for the TV distance given in (2.6). Sreekumar and Goldfeld Definition 2 (Separable process) A stochastic process (Xθ)θ Θ on a metric space (Θ, d) is said to be separable if there exists a null set N and a countable subset Θ0 Θ, such that for every ω / N and θ Θ, there is a sequence (θm)m N in Θ0 with d(θm, θ) 0 and Xθm(ω) Xθ(ω). Definition 3 (Covering and packing numbers) Let (Θ, d) be a metric space. (i) A set Θ Θ is an ϵ-covering of (Θ, d) if for every θ Θ, there exists θ Θ such that d(θ, θ) ϵ; the ϵ-covering number is N(ϵ, Θ, d) := inf {|Θ | : Θ is an ϵ-covering of Θ}. (ii) A set Θ Θ is an ϵ-packing of (Θ, d) if d(θ, θ) > ϵ for every θ, θ Θ such that θ = θ; the ϵ-packing number is T(ϵ, Θ, d) := sup{|Θ | : Θ is an ϵ-packing of Θ}. 2.4 Function Classes Our approximation result requires the target function on X to have an extension to Rd, whose spectral norm (as introduced in (Barron, 1993) and (Klusowski and Barron, 2018)) is finite. The class of functions with such bounded spectral norm is defined next. Definition 4 (Approximation class) Let m N. Consider a function f : Rd R that has a Fourier representation f(x) = R 0 eiω x F(dω), where i = 1 is the imaginary unit and F(dω) is a complex Borel measure over Rd with magnitude |F| (dω) that satisfies Rd ω m 1 |F| (dω) < . (2.9) For c 0, m = 1, 2, and X Rd, define Bc,m,X Rd := n f : Rd R : X Sm(f) |f(0)| f(0) 1 1{m=2} c o , and for f : X R, set c (f, m, X) := inf n c : f Bc,m,X Rd , f = f|X o . We refer to Bc,1,X Rd , Bc,2,X Rd , c B(f, X) := c (f, 1, X) and c KB(f, X) := c (f, 2, X) as the Barron class, Klusowski-Barron class, Barron coefficient, and Klusowski-Barron coefficient, respectively. For TV distance neural estimation, analysis of the NN approximation error for step functions is required. Such functions naturally belong to the Lipschitz function class defined below. Definition 5 (Lipschitz class) For r (0, ], m N, and f Lr Rd , the mth modulus of smoothness of f is ξm,r(f, t) := sup u Rd, u t m u f r,Rd , (2.10) where m u f(x) = Pm j=0( 1)m jf(x + ju). For X Rd and 0 < s 1, the Lipschitz class with smoothness parameter s is Lips,r,b(X) := f Lr Rd : f Lip(s,r) b, supp (f) = X , where f Lip(s,r) := f r + supt>0 t sξ1,r(f, t) is the Lipschitz seminorm. Neural Estimation of Statistical Divergences Note that norm in (2.10) is taken over Rd (despite the assumption that f nullifies outside of X). For d = 1, the class of functions of bounded variation over X R is contained in b RLip1,1,b(X). The Vapnik-Chervonenkis (VC) type class of functions will play a prominent role in our empirical estimation error analysis. Definition 6 (VC-type class) Let F be a class of Borel measurable functions with domain X and a finite measurable envelope F, i.e., supf F |f(x)| F(x) < , x X. Then, F is a VC-type class with envelope F if there exists finite constants lvc(F) = lvc(F, F) and uvc(F) = uvc(F, F) such that sup γ P(X) N ϵ F 2,γ , F, 2,γ lvc(F)ϵ 1 uvc(F), 0 < ϵ 1. (2.11) Finally, we introduce the function class of shallow NNs. Definition 7 (NN class) Let φ : R R be a (non-linear) measurable activation function. The class of shallow NNs (i.e., with a single hidden layer) with k neurons and bounds on its parameters specified by a = (a1, a2, a3, a4) R4 0 is Gk(a, φ) := g : Rd R : g(x) = i=1 βiφ (wi x + bi) + w0 x + b0, max 1 i k wi 1 |bi| a1, max 1 i k |βi| a2, |b0| a3, w0 1 a4 Let φS(z) = (1 + e z) 1 and φR(z) = z 0 denote the logistic sigmoid 4 and the rectified linear unit (Re LU) activation functions, respectively. Further, for a 0, define the shorthands GS k(a) := Gk k1/2 log k, 2k 1a, a, 0, φS , GR k (a) := Gk 1, 2k 1a, a, a, φR , and G k(φ) := Gk a , φ with a = (1, 1, 1, 0). Throughout, we will assume φ {φS, φR}. 2.5 Minimax Estimation Risk To investigate the decision-theoretic fundamental limit of estimating a SD Dh,F as defined in (2.1), we now define the minimax risk. Let P2 X P(X) P(X) be a class of pairs of distributions between which Dh,F is finite and fix (µ, ν) P2 X . Let Xn := (X1, . . . , Xn) and Y n := (Y1, . . . , Yn) be n independently and identically distributed (i.i.d.) samples from µ and ν, respectively.5 An estimator of Dh,F based on these samples is denoted by ˆDh,F(Xn, Y n). The minimax absolute-error risk is R h,F(n, P2 X ) := inf ˆDh,F sup (µ,ν) P2 X E h Dh,F(µ, ν) ˆDh,F(Xn, Y n) i . (2.12) 4. The results that follow with φS as activation straightforwardly applies to any continuous monotone bounded activation, e.g., any sigmoidal activation with φ(z) 1 as z and φ(z) 0 as z . 5. For simplicity, we restrict attention to the case where an equal number of samples is available from both µ and ν, but our analysis readily extends to the mismatched scenario with the corresponding bounds obtained by replacing n 1/2 by (m 1 + n 1)1/2, where m denotes the number of samples from µ (say). Sreekumar and Goldfeld We explore the performance of the NE ˆDh,Gk(ak,φ)(Xn, Y n) := sup g Gk(ak,φ) h g(Xi) h g(Yi) i , (2.13) under the above framework. By appropriately scaling the NN size k (and parameter norm) with the sample size n, we show that NEs of KL and χ2 divergences as well as H2 distance converges at the parametric n 1 2 rate uniformly over certain classes of distribution pairs satisfying regularity conditions. We further show (see, e.g., Corollary 18) that the minimax risk is at least Ω(n 1/2) over this class, thus establishing the minimax optimality of NEs. 3. Preliminary Technical Results We next present two technical results that account for the NN approximation error and the empirical estimation error of the parametrized SD. These results are later leveraged to derive effective error bounds for neural estimation of KL and χ2 divergences, squared Hellinger distance and TV distance. 3.1 Sup-norm Function Approximation We start with a bound on the approximation error of a target function f with a compact domain X for which c (f, m, X) < , m = 1, 2. A reminiscent result for the case m = 1 was given in (Barron, 1992), albeit without explicitly quantifying the dependence on dimension or addressing how the NN parameters scale with k. The bounds for m = 2 are taken from (Klusowski and Barron, 2018). Theorem 8 (Approximation error bound) Let X be compact. Given f : X R with c KB(f, X) a, there exists g GR k (a) such that f g ad 1 2 k 1 Similarly, given f : X R such that c B(f, X) a, there exists g GS k(a) satisfying (3.1). The above theorem states that a k-neuron shallow NN can approximate a function f on X within an O(k 1/2) gap in the sup-norm, provided f is the restriction of some f from the Barron class or Klusowski-Barron class. The bound in (3.1) follows from (Klusowski and Barron, 2018, Theorem 2), up to rescaling the domain therein. The proof of the second claim pertaining to approximation by NN class GS k(a) is provided in Appendix A.1.1, and is based on ideas from (Barron, 1992, 1993; Yukich et al., 1995). The error bounds stated in Theorem 8 are representative of the approximation capabilities of shallow NNs with Re LU (unbounded) and sigmoid (bounded) activations, respectively. Note that GR k (a) has bounded parameters independent of k, albeit with an extra affine term (see Definition 7) compared to functions in GS k(a). On the other hand, achieving O(k 1/2) approximation error using the latter class requires the bounds on the hidden layer weights and biases to scale as k1/2 log k. Remark 9 (Related approximation results) Several related approximation bounds to Theorem 8 are available in the literature, which can also be leveraged to analyze the approximation error of NEs. In particular, Yukich et al. (1995, Theorem 2.2) provides sup-norm Neural Estimation of Statistical Divergences error bounds for approximating a target function and its derivatives by a sigmoidal NN with unbounded input weights and biases. A further improvement over (Barron, 1992, Theorem 2) by a k 1/2d factor is reported in (Makovoz, 1998) for NNs with step activation functions, under a different regularity condition on the Fourier transform of target function. A sup-norm approximation result for squared Re LU activation is given in (Klusowski and Barron, 2018, Theorem 3) for functions f with bounded S3(f) (see (2.9)). Also related are NN approximation bounds derived in (Domingo-Enrich and Mroueh, 2021) for a function with bounded R, U-norm, where the latter is based on R-norm introduced in (Ongie et al., 2020). The next proposition shows that a sufficiently smooth function over a compact domain can be approximated to within O(k 1/2) error by a shallow NN. Proposition 10 (Approximation of smooth functions) Let X Rd be compact and f : X R. Suppose that there exists an open set U X, b 0, and f Cs KB b (U), s KB := d/2 + 3, such that f = f|X . Then, there exists g GR k cb,d, X , where cb,d, X is given in (A.15), such that f g cb,d, X d1/2k 1/2. The same holds with s KB and GR k replaced with s B := d/2 + 2 and GS k, respectively. The proof of Proposition 10 (see Appendix A.1.2) shows that any sufficiently smooth function on X can be extended to a function in the Barron or the Klusowksi-Barron class with domain Rd. This is done by nullifying the partial derivatives of order s KB (or s B) outside X and multiplying by a smooth bump function that equals 1 on X and smoothly decays outside. Note that for an integer s 0 and a real number s s, Cs b(U) contains the H older class with smoothness s and radius b. 3.2 Estimation of Parameterized Divergences For µ, ν P(X), consider the SD Dh,F(µ, ν) defined in (2.1). Let Xn and Y n be n i.i.d. samples from µ and ν, respectively. Consider a NE for Dh,F(µ, ν) realized by a shallow NN, i.e., ˆDh,Gk(ak,φ)(Xn, Y n) (see (2.13)). Our next result provides a tail inequality for the error in estimating the parametrized divergence Dh,G k(φ)(µ, ν) by ˆDh,G k(φ)(Xn, Y n), which will be used to prove consistency of the NE. To state it, given a class of functions F with domain X, define C(F, X) := infx X,f F f(x) and C(F, X) := supx X,f F f(x). Theorem 11 (Empirical estimation error tail bound) Let µ, ν P(X) and consider the NN class G k(φ) given in Definition 7. Assume X and φ are such that C |G k(φ)|, X < , h is differentiable in C G k(φ), X , C G k(φ), X with derivative h , Dh,G k(φ)(µ, ν) < , and C h G k(φ) , X < . (3.2) Then there exists a constant c > 0 such that for any δ 0, we have sup µ,ν P(X): Dh,G k(φ)(µ,ν)< P ˆDh,G k(φ)(Xn, Y n) Dh,G k(φ)(µ, ν) δ+Ek,h,φ,X n 1 2 ce nδ2 Vk,h,φ,X , (3.3) with upper bounds for Vk,h,φ,X and Ek,h,φ,X available in (A.20) and (A.21), respectively. Sreekumar and Goldfeld The proof of Theorem 11 (see Appendix A.1.3) relies on upper bounding the estimation error by a separable sub-Gaussian process and invoking the chaining tail inequality (see Theorem 56 in Appendix A.1.3). The next theorem provides an upper bound on the expected empirical estimation error. It will be used to obtain effective error bounds for the NE in the forthcoming sections. Theorem 12 (Empirical estimation error bound) Let a > 0 and Gk GR k (a), GS k(a) . Suppose h is differentiable in C Gk, X , C Gk, X with derivative h , C (|h Gk| , X) C(|h Gk|, X) C |Gk|, X a,h, X 1 for all k N. Then, for all k, n N, sup µ,ν P(X) E h ˆDh,Gk(Xn, Y n) Dh,Gk(µ, ν) i h,a, X d 3 2 n 1 Theorem 12 follows from a more general result that we establish in Appendix A.1.4 (namely, Theorem 58), where Gk as above is replaced by an arbitrary VC-type class satisfying certain technical conditions. The proof of the latter relies on standard maximal inequalities from empirical process theory. To prove (3.4), we also require a bound on the entropy integral of the NN class. This is obtained by noting that Gk is a subset of the symmetric convex hull of the composition of a monotone function with a VC subgraph class, and upper bounding the covering numbers of such convex hulls. Remark 13 (NN distances) The SD Dh,Gk(a,φ)(µ, ν) is the so-called NN distance, studied in (Arora et al., 2017; Zhang et al., 2018a) in the context of GANs. Theorem 11 and 12 can thus be understood, respectively, as a tail bound and as an error bound for NN distance estimation from data, and implies that the estimation error rate is parametric in n. For Gk GR k (a), GS k(a) , we have C (|Gk|) 3a( X + 1), C (|h Gk|) sup{|h(z)| , z [ 3a( X + 1), 3a( X + 1)]} < and Dh,Gk < for all k and h {h KL, hχ2}. Similarly, C (|h Gk|) is finite and bounded by a quantity independent of k for these h (see (B.3) and (B.5)). Hence, h KL and hχ2 satisfies the assumptions in Theorem 12, and consequently, (3.4) applies for KL and χ2 divergences. These bounds also hold for H2 and TV distances for appropriate NN classes (see Theorems 27 and 32 below). In the next section, we use the above results to analyze the effective error for neural estimation of SDs. 4. Neural Estimation of f-Divergences We now turn to analyze neural estimation performance of several important f-divergences, encompassing KL, χ2, H2, and TV. Throughout this section, we assume for simplicity that X = [0, 1]d, but the results and proof techniques readily extend to arbitrary compact domains. Further, we present results for Re LU NNs, although all statements also hold for sigmoid nets with a slightly modified spectral norm condition defining the class of distributions. We comment about this once in Remark 15 below, but omit further mention to avoid repetition. 4.1 KL Divergence Let ˆDGk(ak,φ)(Xn, Y n) := ˆDh KL,Gk(ak,φ)(Xn, Y n) be a NE of DKL (µ ν), where ak R4 0 for all k N. To state performance guarantees for this NE, some definitions are needed. Let Neural Estimation of Statistical Divergences P2 KL(X) be the set of all pairs (µ, ν) P(X) P(X) such that µ ν and DKL (µ ν) < , and for any M 0 define P2 KL(M, X) := (µ, ν) P2 KL(X) : c KB(f KL, X) DKL (µ ν) M . (4.1) For appropriately chosen M, b 0, P2 KL(M, X) contains (µ, ν) P2 KL(X) for which DKL (µ ν) M and f KL = log dµ dν Cs KB b (U) for some U X. To see this, note that a smoothness order of s KB for f KL ensures that c KB(f KL, X) cb,d, X (see Proposition 10). Hence, for any (µ, ν) P2 KL(X) and M cb,d, X DKL (µ ν), (µ, ν) P2 KL(M, X). In particular, P2 KL(M, X), for sufficiently large M, contains Gaussian densities, truncated and normalized to be supported on X. Since the class P2 KL(M, X) becomes larger as M increases, it is to be expected that a larger NN class would be required for accurate neural estimation of KL divergence between distributions in this class. This means that the range of the NN parameters has to be selected depending on M. However, often it is hard to ascertain such an M for the distributions of interest. To account for this, we do not assume that M is known in advance. Instead, we take a NN class GR k (mk) for some non-decreasing positive sequence (mk)k N with mk , for obtaining neural estimation error bounds. The following theorem establishes the consistency of KL divergence NE and uniformly bounds the effective error in terms of the NN and sample sizes. Theorem 14 (KL divergence neural estimation) The following hold: (i) Let (µ, ν) P2 KL(X) be such that f KL C (X). Then, for any 0 < ρ < 1, (kn)n N with kn , kn 1 4(1 ρ) log n and Gn = G kn(φ), ˆDGn(Xn, Y n) n DKL (µ ν) , P a.s. (4.2) (ii) For any M 0, mk = log log k 1, Gk = GR k (mk), sup (µ,ν) P2 KL(M,X) E h ˆDGk(Xn, Y n) DKL (µ ν) i M d 1 2 k 1 2 + d 3 2 (log k)7n 1 The proof of Theorem 14 is presented in Appendix A.2.1. The consistency result in Part (i) relies on G k(φ) being a universal approximator for the class of continuous functions on compact sets as k and Theorem 11. For Part (ii), we derive (4.3) by utilizing Theorems 8 and 12 to bound the sum of the approximation and estimation errors. From Theorem 8, the former is O(k 1/2) if c KB (f KL, X) M and k is such that M log log k 1. On the other hand, for k violating this condition, the effective error is bounded by DKL (µ ν) M. The growing NN parameters contribute an extra polylog(k) factor to the empirical estimation error bound. Remark 15 (Effective error bound for sigmoid NN class) It can be seen from the proof of (4.3) that the same bound applies to sigmoid NN class GS k(mk) when P2 KL(M, X) is replaced by P2 KL,B(M, X) := (µ, ν) P2 KL(X) : c B(f KL, X) DKL (µ ν) M . Similar remarks apply for all the effective error bounds henceforth, which we omit to avoid repetition. Sreekumar and Goldfeld Remark 16 (Effective error bound based on M) If M in the definition of the class P2 KL(M, X) is known when picking the NN parameters (i.e., they can depend on M), then with mk = M and Gk = GR k (M), we have (see (A.44) and the last statement in the proof of Theorem 14 in Appendix A.2.1) sup (µ,ν) P2 KL(M,X) E h ˆDGk(Xn, Y n) DKL (µ ν) i M d 1 2 k 1 2 + d 3 2 n 1 which removes the polylog factor in the empirical estimation bound (2nd term in (4.3)). Remark 17 (L2 neural estimation of a function) In (Barron, 1994), a reminiscent approximation-estimation error analysis for learning a NN approximation of a bounded range function is presented. This differs from our setup since SDs are given as a supremum over a function class, as opposed to a single function. As such, our results require stronger sup-norm approximation results, as opposed to the L2 bound used in (Barron, 1994). The error bounds in (4.4) and (4.3) imply that the KL divergence NE achieves the parametric and near parametric error rates, respectively. Corollary 18 (Minimax optimality) The KL divergence NE ˆDGn(Xn, Y n) is minimax rate-optimal over P2 KL(M, X) and P2 KL,B(M, X) with Gn = GR n(M) and Gn = GS n(M), respectively, achieving the O n 1/2 minimax risk. If M is unknown, then this NE with M replaced by mn = log log n 1, is near minimax optimal achieving O n 1/2 minimax risk. The corollary is proven in Appendix A.2.2, where the upper bound follows directly from Theorem 16, Remark 15 and (4.4) by setting k = n. For the lower bound, we present a reduction of the KL divergence estimation problem to differential entropy estimation, and invoke the Ω(n 1/2) lower bound from Goldfeld et al. (2020b) for the latter problem. Theorem 14 and Corollary 18 impose conditions on f KL to bound the effective neural estimation error (namely, assuming that c KB(f KL, X) M, for some M). A primitive sufficient condition in terms of the densities p and q of µ and ν, respectively, w.r.t. an arbitrary common dominating measure η is given next. Proposition 19 (Sufficient condition for Theorem 14) For b 0 and s KB = d/2 + 3, consider the class e P2 KL(b, X) of pairs of distributions given by e P2 KL(b, X) := (µ, ν) P2 KL(X) : p, q Cs KB b (U) for some open set U X s.t. log p = p|X , log q = q|X Then, (4.3) and (4.4) hold with M = 2 cb,d, X 2b, where cb,d, X is given in (A.15),6 and e P2 KL(b, X) in place of P2 KL(M, X). Remark 20 (Feasible distributions) e P2 KL( , X) contains distributions (µ, ν) P2 KL(X) whose densities (p, q) are bounded (from above and below) on X with a smooth extension on an open set covering X. In particular, this includes uniform distributions, truncated Gaussians, truncated Cauchy distributions, etc. 6. Although X is taken to be [0, 1]d, we will retain the dependence of X in the error bounds, which will be used later for extending the results to the unbounded support case. Neural Estimation of Statistical Divergences Remark 21 (Relation to other works) Corollary 18 and Proposition 19 together imply that the KL divergence NE achieves the parametric minimax error rate over the class of densities with at least s KB = d/2 +3 derivatives (or s B = d/2 +2 derivatives in the case of sigmoid activation). To compare with existing results, it is known that variants of classic kernel-based estimators (Kandasamy et al., 2015; Singh and P oczos, 2014a; Moon et al., 2018; Berrett et al., 2019) achieve the optimal minimax risk of O(n 1/2) when the densities are H older smooth with at least d/2 or d derivatives. We also note that the parametric rate achieved by NE is an improvement over the n 1/4 rate shown in Sreekumar et al. (2021). Furthermore, we observe that (4.3), (4.4) and minimax rate optimality holds for the class of distributions obtained by replacing c KB(f KL, X) M in (4.1) with X f KL R,U |f KL(0)| f KL(0) 1 M, where R, U-norm is defined in (Domingo-Enrich and Mroueh, 2021, Equation 6). This follows by using (Domingo-Enrich and Mroueh, 2021, Theorem 2) in place of Theorem 8 to analyze the approximation error. Similar conclusions hold for NEs of other SDs considered below. 4.1.1 Neural Estimation via Donsker-Varadhan Formula Another well known variational representation for KL divergence is the Donsker-Varadhan (DV) formula: DKL (µ ν) = sup f F Eµ[f] log Eν ef , where the supremum is over all measurable f such that the last expectation is finite. Parametrizing F by a NN and replacing expectation with sample means leads to the DV-NE for KL, given by ˇDDV,G(Xn, Y n) := sup g G i=1 g(Xi) log 1 i=1 eg(Yi). In (Belghazi et al., 2018), the authors studied the special case of DV-NE pertaining to estimation of mutual information, termed MINE. They established consistency along with sample complexity bounds (without accounting for the approximation error). In Appendix C, we show that consistency of the DV-NE holds under similar conditions as in Theorem 14 (see (C.1)). We also prove that the effective error bound given in (4.3) applies to DV-NE, albeit with different constants (see (C.2)). In particular, the latter establishes the minimax optimality of DV-NE with the scaling k = n. Instantiating these results for µ = PAB and ν = PA PB (i.e., a joint probability law versus the product of its marginals), translates these performance guarantees to MINE, now accounting for finite-size NNs, the associated approximation error, and minimax convergence rates. 4.2 χ2 Divergence Let ˆχ2 Gk(ak,φ)(Xn, Y n) := ˆDhχ2,Gk(ak,φ)(Xn, Y n) denote the NE of χ2 (µ ν). Set P2 χ2(X) as the collection of all (µ, ν) P(X) P(X) such that µ ν and χ2 (µ ν) < , and let P2 χ2(M, X) := n (µ, ν) P2 χ2(X) : c KB fχ2, X χ2 (µ ν) M o . The next theorem establishes consistency of the NE and bounds its effective absolute-error. Sreekumar and Goldfeld Theorem 22 (χ2 divergence neural estimation) The following hold: (i) Let (µ, ν) P2 χ2(X) be such that fχ2 C (X). Then, for any 0 < ρ < 1, (kn)n N with kn , kn = O n(1 ρ)/5 and Gn = G kn(φ), we have ˆχ2 Gn(Xn, Y n) n χ2 (µ ν) , P a.s. (4.5) (ii) For any M 0, mk = log k, and Gk = GR k (mk), we have sup (µ,ν) P2 χ2(M,X) E ˆχ2 Gk(Xn, Y n) χ2 (µ ν) M dk 1 2 + d 3 2 (log k)2n 1 The proof strategy for Theorem 22 is similar to that of Theorem 14, with appropriate adaptations to account for the difference between fχ2 and f KL (see Appendix A.2.4). Comparing (4.5)-(4.6) to (4.2)-(4.3), we see that consistency for χ2 divergence estimation holds under milder conditions and that the effective error bound is better in terms of dependence on k than for KL divergence. Remark 23 (Effective error based on M) If the NN parameters can depend on M, then setting mk = M in (4.6) yields sup (µ,ν) P2 χ2(M,X) E ˆχ2 Gk(Xn, Y n) χ2 (µ ν) M dk 1 2 + d 3 2 n 1 Choosing k = n in (4.7) (resp. (4.6)), we have that the χ2 NE achieves the parametric (resp. near parametric) error rate over the class P2 χ2(M, X). The proof is similar to that of Corollary 18, and is omitted for brevity. Let P2 χ2,B(M, X) := (µ, ν) P2 χ2(X) : c B fχ2, X χ2 (µ ν) M . Corollary 24 (Minimax optimality) The χ2 NE ˆχ2 Gn(Xn, Y n) is minimax rate-optimal over P2 χ2(M, X) and P2 χ2,B(M, X) with Gn = GR n(M) and Gn = GS n(M), respectively, achiev- ing the O(n 1/2) risk. This NE achieves the near parametric O n 1/2 minimax risk when M is replaced by log n, which is applicable to the scenario of unknown M. Given next is the counterpart of Proposition 19 for χ2 divergence (proven in Appendix A.2.5), which provides primitive conditions in terms of densities under which the effective error bounds in Theorem 22 and Corollary 24 hold. Proposition 25 (Sufficient condition for Theorem 22) For b 0 and s KB = d/2 + 3, let e P2 χ2(b, X) := (µ, ν) P2 χ2(X) : p, q Cs KB b (U) for some open set U X s.t. p = p|X , q 1 = q|X Then, (4.6) and (4.7) hold with M = (κdd3/2 X 1) 2 + 2s KB+1 c2 b,d, X (b2 + 1), where κd and cb,d, X are given in (A.3) and (A.15), respectively, and e P2 χ2(b, X) in place of P2 χ2(M, X), Neural Estimation of Statistical Divergences Remark 26 (Feasible distributions) The class e P2 χ2( , X) contains (µ, ν) P2 χ2(X), whose densities p, q, are bounded (upper bounded for p and bounded away from zero for q) on X with an extension that is sufficiently smooth on an open set covering X. This includes the distributions mentioned in Remark 20. 4.3 Squared Hellinger Distance Let ˆH2 Gk,t(ak,φ)(Xn, Y n) := ˆDh H2, Gk,t(ak,φ)(Xn, Y n), where for t > 0, Gk,t(a, φ) is the NN class Gk,t (a, φ) := n g : Rd R : g(x) = (1 t) g(x), g Gk(a, φ) o . Set P2 H2(X) as the collection of all (µ, ν) P(X) P(X) such that µ ν, and P2 H2(M, X) := (µ, ν) PH2(X) : c KB(f H2, X) dµ dν Also, let G k,t(φ) := Gk,t 1, 1, 1, 0, φ , and for a 0, define GR k,t(a) := Gk,t 1, 2k 1a, a, a, φR . (4.9) The next theorem establishes consistency of the NE and bounds its effective absolute-error. Theorem 27 (Squared Hellinger distance neural estimation) The following hold: (i) If (µ, ν) P2 H2(X) is such that f H2 C (X) and there exists M > 0 such that dµ/dν ,η M, then, for any 0 < ρ < 1, (kn)n N with kn , kn = O n(1 ρ)/3 and Gn = G kn,M 1/2(φ), we have ˆH2 Gn(Xn, Y n) n H2(µ, ν), P a.s. (4.10) (ii) For any M 0, mk = log k, tk = (log k) 1, and Gk = GR k,tk(mk), we have sup (µ,ν) P2 H2(M,X) E h ˆH2 Gk(Xn, Y n) H2(µ, ν) i M d 1 2 k 1 2 log k + d 3 2 (log k)3n 1 The proof of Theorem 27 is presented in Appendix A.2.6. To prove consistency and establish effective error bounds for H2 NE, we use a truncated NN class Gk,t (a, φ) that saturates the NN output to 1 t for some t > 0. This is done since h H2(x) has a singularity at x = 1 and the NN outputs must be truncated below 1 so as to satisfy (3.2) for bounding the empirical estimation error. To get the effective error bounds under this constraint, we take t = tk for some non-increasing positive sequence tk 0. The bound in (4.11) uses tk = (log k) 1. Remark 28 (Effective error based on M) If M is known when selecting the NN parameters, then for Gk = GR k,M 1/2(M), we have (see (A.58)) sup (µ,ν) P2 H2(M,X) E h ˆH2 Gk(Xn, Y n) H2(µ, ν) i M d 1 2 k 1 2 + d 3 2 n 1 Sreekumar and Goldfeld Addressing minimax optimality, we set k = n in the above equation (resp.(4.11)) to attain the parametric (resp. near parametric) rate for the H2 NE over the class P2 H2(M, X). Let GS k,t(a) := Gk,t k1/2 log k, 2k 1a, a, 0, φS , and P2 H2,B(M, X) denote the class of distribution pairs with c KB(f H2, X) in (4.8) replaced with c B(f H2, X). Corollary 29 (Minimax optimality) The H2 NE ˆH2 Gn(Xn, Y n) is minimax rate-optimal over the class P2 H2(M, X) and P2 H2,B(M, X) with Gn = GR n,M 1/2(M) and Gn = GS n,M 1/2(M), respectively, achieving O n 1/2 minimax risk. Further, relevant to the case when M is unknown, the same NE achieves O n 1/2 risk, when M is replaced with log n and M 1/2 with (log n) 1. Below, we provide a sufficient condition in terms of densities under which the effective error bounds in Theorem 27 as well as Corollary 29 applies, similar in spirit to Proposition 19 (see Appendix A.2.7 for the proof). Proposition 30 (Sufficient condition for Theorem 27) For b 0 and s KB = d/2 + 3, consider the class e P2 H2(b, X) of pairs of distributions given by e P2 H2(b, X) := (µ, ν) P2 H2(X) : p, q Cs KB b (U) for some open set U X 2 = p|X , q 1 2 = q|X , and p q 1 ,η b Then, (4.11) and Remark 28 hold with M = κdd 3 2 X 1 1 + 2s KB c2 b,d, X b2, where cb,d, X and κd are given in (A.15) and (A.3), respectively, and e P2 H2(b, X) in place of P2 H2(M, X). Remark 31 (Feasible distributions) e P2 H2( , X) includes (µ, ν) P2 H2(X), whose densities p, q, are bounded (from above and away from zero) on X with an extension that is sufficiently smooth on an open set covering X. This contains the distributions mentioned in Remark 20. 4.4 Total Variation Distance Consider the NN class obtained by truncating the functions in Gk(a, φ) to [ 1, 1], i.e., Gk(a, φ) := g : g(x) = 1{| g(x)| 1} g(x) + 1{ g(x)>1} 1{ g(x)< 1} for some g Gk(a, φ) . (4.12) Also, let ˆδ Gk(a,φ)(Xn, Y n) := ˆDh TV, Gk(a,φ)(Xn, Y n), and set GR k (a) := Gk 1, 2k 1a, a, a, φR , and G k(φ) := Gk 1, 1, 1, 0, φ . Denote the densities of µ and ν w.r.t. λ by p and q, respectively, and for M 0 define P2 TV(M, X) := n (µ, ν) P(X) P(X) : µ, ν λ, p q ,X M o . (4.13) The following theorem bounds the effective error for TV distance neural estimation. Theorem 32 (TV distance neural estimation) The following hold: Neural Estimation of Statistical Divergences (i) For any µ, ν P(X), 0 < ρ < 1, (kn)n N with kn , kn = O n(1 ρ)/2 and Gn = G kn(φ), we have ˆδGn(Xn, Y n) n δTV(µ, ν) , P a.s. (4.14) (ii) For any 0 < s 1, M 0, ck,d,s,M, X = Od,M, X k(d+2)/2(s+d+2) as defined in (A.75) and Gk = GR k ck,d,s,M, X , we have sup (µ,ν) P2 TV(M,X): f TV Lips,1,M(X) E h ˆδGk(Xn, Y n) δTV(µ, ν) i d,M,s k s 2(s+d+2) + k d+2 2(s+d+2) n 1 The proof of Theorem 32 is provided in Appendix A.2.8. A key technical challenge arises from the fact that f TV = 1C 1X\C (see (2.8)) contains step discontinuities in its domain, and hence, it does not belong to the Klusowski-Barron class. Consequently, Theorem 8 is not directly applicable for bounding the approximation error as was done for the SDs considered until now. To overcome this issue, we apply a Gaussian smoothing kernel to f TV so that the smoothed version belongs to the Klusowski-Barron class. The width of the kernel is then adjusted as a function of k such that L1 norm of the difference between f TV and its smoothed version decreases as k increases. The need for the smoothing operation results in a slower approximation and empirical estimation error rate that depends on d. Remark 33 (Curse of dimensionality) Setting k = n in (4.15), we achieve the effective error rate O n s/2(s+d+2) . Note that this rate suffers from Co D, different from NEs of other SDs considered above where the parametric rate is achieved. In practice, the condition f TV Lips,1,M(X) required for (4.15) may be hard to verify. A simple sufficient condition in terms of the densities of µ and ν is given below. To state it, we need the following definition. Definition 34 (Critical zero) Given f : X R, a point x0 X is called a critical zero of f if f(x0) = 0 and every neighbourhood Ux0 of x0 contains an x Ux0 X such that f(x) = 0. In particular, if f(x0) = 0 and f is differentiable at x0 with derivative f (x0) > 0, then x0 is a critical zero. Let Z(f) denote the set of critical zeros of f. Based on the above, for N N and b 0, define Tb,N(X) := f : X R : x x b, x, x Z(f), |Z(f)| N , (4.16) as the class of functions on X with at most N critical zeros at pairwise (Euclidean) distance of at least b from each other. We are now ready to state the sufficient condition for TV distance estimation; see Appendix A.2.9 for proof. Proposition 35 (Sufficient condition for Theorem 32) For N N and b 0, consider the class e P2 TV(b, N, X) := (µ, ν) P2 TV(b, X) : f Tb,N(X) s.t. p q = f . Then, for any 0 < s 1, (4.15) holds with M = λ(X) + 2b sλ(X) 2Nπd/2bd sΓ(d/2 + 1) 1 and supremum over (µ, ν) e P2 TV(b, N, X) in place of that over (µ, ν) P2 TV(M, X), where λ(X) is the Lebesgue measure of X and Γ is the gamma function. Sreekumar and Goldfeld Remark 36 (Feasible distributions) The set e P2 TV( , , X) includes generalized Gaussian distributions, Gaussian mixtures, exponential families, Cauchy distributions, etc., truncated and normalized to be supported on X. It also includes distributions whose densities are analytic functions, e.g., non-negative polynomials on X. These inclusions are easy to verify since p q has finitely many separated critical zeros for such distributions (cf., e.g., (Smale, 1986; Kalantari, 2004) for the case of analytic functions). 5. Neural Estimation for Distributions with Unbounded Support Thus far, we considered compactly supported µ and ν. In this section, we consider neural estimation of KL, χ2, H2 and TV with µ, ν P Rd . Throughout, unless stated otherwise, we will assume that µ, ν λ with p, q denoting the respective Lebesgue densities. For each SD, we first prove consistency of the NE under certain regularity conditions on the densities. Then, we present effective error bounds under an Orlicz norm constraint on the densities, which are subsequently specialized to multivariate Gaussian distributions. We next introduce the required definitions below. Definition 37 (Orlicz space) An increasing convex function ψ : [0, ) [0, ) with ψ(0) = 0 and limx ψ(x) = is called an Orlicz function. For a given ψ and M 0, the bounded Orlicz space 7 is Lψ(M) = n f : Rd R : f ψ M o , where f ψ := inf c > 0 : R Rd ψ ( x /c) f(x)dx 1 . Examples of Orlicz functions include ˆψr(z) = zr and ψr(z) = ezr 1, z R, for r 1; in particular, ψr with r = 2 correspond to the sub-Gaussian class defined next. Definition 38 (Sub-Gaussian distribution) A distribution µ P Rd is σ2-sub-Gaussian for σ > 0 if X µ satisfies E h eu (X E[X])i e σ2 u 2 For M 0, let SG(M) be the set of all σ2-sub-Gaussian distributions with σ2 E[X] M. With some abuse of notation, we henceforth use boldface letters to denote infinite sequences, e.g., v = (vk)k N; this will simplify some of the subsequent notation. In particular, we use r = (rk)k N for an increasing positive divergent sequence (i.e., rk ) with rk 1, and m = (mk)k N for a non-decreasing positive sequence with mk 1. Let ˆGk(a, φ, r) := g1Bd(r) : g Gk(a, φ) , ˆGR k (a, r) := g1Bd(r) : g GR k (a) , and ˆG k(φ, r) := g1Bd(r) : g G k(φ) denote the NN classes Gk(a, φ), GR k (a), and G k(φ), respectively, after nullifying the functions outside of Bd(r). 7. It is possible to generalize the results in this section to µ, ν γ, where γ is an arbitrary positive σ-finite Borel measure. Accordingly, the Orlicz norm in Definition 37 is replaced with f ψ,γ := inf c [0, ] : R Rd ψ x /c f(x)dγ(x) 1 . We adopt the current definition for simplicity. Neural Estimation of Statistical Divergences 5.1 KL Divergence For M 0, ℓ N, r and m as above, consider the following class of distributions: P2 KL,ψ (M, ℓ, r, m) := (µ, ν) P2 KL Rd : µ, ν λ, p, q Lψ(M), f KL ℓ,µ M, c KB f KL|Bd(rk), Bd(rk) mk, k N In words, the class above contains pairs of distributions whose (i) densities have a ψ-Orlicz norm bounded by M, (ii) f KL has Lℓ(µ) norm at most M, and (iii) the restriction of f KL to Bd(rk) has a Klusowski-Barron coefficient that is at most mk. The following is the counterpart of Theorem 14 for distributions supported on Rd; the proof is provided in Appendix A.3.1. Theorem 39 (KL divergence neural estimation) For any 0 < ρ < 1, the following hold: (i) Let (µ, ν) P2 KL(Rd) be such that f KL C Rd and f KL 1,µ < . Then, for kn, rn, n satisfying kn , rn , k3/2 n rnekn(rn+1) = O n(1 ρ)/2 and Gn = ˆG kn(φ, rn), ˆDGn(Xn, Y n) n DKL (µ ν) , P a.s. (ii) Let ℓ> 1, M 0, ℓ = ℓ/(ℓ 1), and m be such that 1 mk k(1 ρ)/2. Then, for Gk = ˆGR k (mk, rk), we have sup (µ,ν) P2 KL,ψ(M,ℓ,r,m) E h ˆDGk(Xn, Y n) DKL (µ ν) i d,M,ρ,ψ,ℓmkk 1 2 + mkrke3mk(rk+1) n 1 2 + ψ rk M 1 1 The proof of the consistency claim in Part (i) follows similar to (4.2) by using the universal approximation property of ˆG kn(φ, rn) on Euclidean balls, controlling the residual approximation error via integrability assumption on f KL, and using Theorem 11 to bound the empirical estimation error. The proof of (5.1) is based on the following observations. First, we note that if c KB f KL|Bd(rk), Bd(rk) can be bounded for every k, then Theorem 8 implies that the NN class ˆGR k (mk, rk) with mk, rk at an appropriate rate can approximate f KL to within an error of d1/2mkk 1/2 inside the Euclidean ball Bd(rk). An upper bound on c KB f KL|Bd(rk), Bd(rk) is guaranteed, for instance, by Proposition 10 when f KL is sufficiently smooth on Bd(rk). Moreover, since every Borel probability measure on Rd is tight, µ Bc d(rk) ν Bc d(rk) 0 for every rk . The proof then follows by an analysis of the approximation error outside Bd(rk) under the Orlicz norm constraint on the densities of µ and ν, along with an account of the empirical estimation error. The Orlicz norm constraint controls the rate of tail decay of the densities. Remark 40 (Feasible distributions) Based on Proposition 10, (5.1) holds for distributions (µ, ν) P2 KL Rd , µ, ν λ, such that their densities are sufficiently smooth and bounded (from above and away from zero) on Euclidean balls Bd(r) for any r > 0, and f KL ℓ,µ is finite for some ℓ> 1. This includes multivariate Gaussians, Gaussian mixtures, Cauchy distributions, etc., to name a few. Sreekumar and Goldfeld As an instance of an explicit effective error bound, we now specialize Theorem 39 to the important case of Gaussian distributions. Define the class ( N(mp, Σp), N(mq, Σq) : mp , mq M Σp op, Σ 1 p op, Σq op, Σ 1 q op < M of pairs of non-singular multivariate Gaussian distributions with appropriate bounded operator norm (denoted by op). The following corollary quantifies the effective error for pairs of Gaussian distributions. However, as the proof (see Appendix A.3.2) requires a tedious evaluation of a bound on the Klusowski-Barron coefficient, we restrict attention to isotropic Gaussians, i.e., whose covariance matrix is Σ = σ2Id, for some σ > 0. The (sub)class of isotropic Gaussian measures is denoted by P2 N(M). Nevertheless, we stress that the argument can be generalized to account for the entire P2 N(M) class above. Corollary 41 (Gaussian effective error) For any 1 < M < , there exists cd,M > 0 such that for mk d,M (log k)0.5(d+3), rk := 1 M+ rk, rk d,M log k and Gk = ˆGR k (mk, rk), sup (µ,ν) P2 N(M) E h ˆDGk(Xn, Y n) DKL (µ ν) i d,M (log k) d+4 2 + kcd,M(log k) d+2 Remark 42 (Gaussian error rate) Optimizing over k in the above equation yields an effective error rate of n (log n)cd,M log n for some cd,M > 1. Despite the dependence of this rate on d, in Appendix D.1 we show that for certain classes of sub-Gaussian distributions, a NE effective error rate of n 1/3 can be achieved independent of dimension. This is to stress that the NE can produce dimension-free convergence rates even when supports are unbounded. 5.2 χ2 Divergence We next consider χ2 divergence. Consider the following class of distributions: P2 χ2,ψ (M, ℓ, r, m) := (µ, ν) P2 χ2 Rd : µ, ν λ, p, q Lψ(M), fχ2 ℓ,µ M, c KB fχ2|Bd(rk), Bd(rk) mk, k N The following theorem states consistency of the χ2 NE and bounds the effective error. Theorem 43 (χ2 neural estimation) The following hold: (i) Let (µ, ν) P2 χ2 Rd satisfy fχ2 C Rd and fχ2 1,µ hχ2 fχ2 1,ν < . Then, for kn , rn , n satisfying k5/2 n r2 n = O n(1 ρ)/2 for some 0 < ρ < 1 and Gn = ˆG kn(φ, rn), we have ˆχ2 Gn(Xn, Y n) n χ2 (µ ν) , P a.s. (ii) For any M 0, ℓ> 1, ℓ = ℓ/(ℓ 1) and Gk = ˆGR k (mk, rk), we have sup (µ,ν) P2 χ2,ψ(M,ℓ,r,m) E ˆχ2 Gk(Xn, Y n) χ2 (µ ν) M,ψ,ℓm2 kdk 1 2 + d 3 2 m2 kr2 kn 1 + ψ rk M 1 1 Neural Estimation of Statistical Divergences The proof of Theorem 43 is similar to that of Theorem 39 and is given in Appendix A.3.3. Remark 44 (Feasible distributions) Theorem 43 (ii) holds for any distributions (µ, ν) P2 χ2 Rd , µ, ν λ, such that their densities are sufficiently smooth and bounded (from above for p and away from zero for q) on Euclidean balls, and fχ2 ℓ,µ is finite for some ℓ> 1. This encompasses the distributions mentioned in Remark 40 for certain parameter ranges. The corollary below (see Appendix A.3.4 for proof) provides effective error bounds for the following class of Gaussian distributions: P2 χ2,N(M) := ( N(mp, σ2 p Id), N(mq, σ2 q Id) : 1/M < σ2 p < 2σ2 q < M, 2σ2 q σ2 p > 1/M, mp mq M where the constraint σ2 p < 2σ2 q is required for χ2 (µ ν) to be finite. Corollary 45 (Gaussian effective error) For 1 < M < , we have with mk d,M k2M5/(4M5+1)(log k)0.5(s KB+d+1), rk = 1 M + rk, rk M log k and Gk = ˆGR k (mk, rk) that sup (µ,ν) P2 χ2,N(M) E ˆχ2 Gk(Xn, Y n) χ2 (µ ν) d,M (log k)2(s KB+d+1) k 1 2+8M5 + k Remark 46 (Gaussian error rate) The optimum in the right hand side (RHS) of the equation above over (k, n) is attained at k = n(1+4M5)/(1+8M5), and results in an effective error rate of n 1/(2+16M5) (log n)2(s KB+d+1). Note that this rate degrades with increasing d or M. Nevertheless, in Proposition 70 in Appendix D.2, we show that a dimension-free improvement of n 1/2 (up to logarithmic factors) can be achieved for a certain class of sub-Gaussian distributions with unbounded support. 5.3 Squared Hellinger Distance Next, we consider the squared Hellinger distance. For M, r, m as above, let P2 H2,ψ(M, r, m):= (µ, ν) P2 H2 Rd : µ, ν λ, p, q Lψ(M), c KB f H2|Bd(rk), Bd(rk) dµ dν ,Bd(rk) mk, k N Also, consider the following NN class obtained from GR k,t( ) (see (4.9)) by nullifying the functions outside of Bd(r): ˇG R k,t(a, r) := n g1Bd(r) : g GR k,t(a) o . (5.2) The next theorem provides conditions under which consistency holds for H2 neural estimation and bounds the effective error; see Appendix A.3.5 for the proof. Theorem 47 (Squared Hellinger distance neural estimation) Let m satisfy mk = o(k1/4). The following hold: Sreekumar and Goldfeld (i) For (µ, ν) P2 H2,ψ (M, r, m), and k, r, m, n such that kn , rkn , mkn , k1/2 n m2 knrkn = O n(1 ρ)/2 for some 0 < ρ < 1 and Gn = ˇG R kn,m 1/2 kn (mkn, rkn), we have ˆH2 Gn(Xn, Y n) n H2(µ, ν), P a.s. (ii) For any M 0 and Gk = ˇG R k,m 1/2 k (mk, rk), we obtain sup (µ,ν) P2 H2,ψ(M,r,m) E h ˆH2 Gk(Xn, Y n) H2(µ, ν) i M,ψ m2 kd 1 2 k 1 2 + d 3 2 m2 krkn 1 2 + ψ rk M 1 1 The proof of Theorem 47 follows along similar lines to Theorem 39. Notice that the NN class ˇG R k,t is used to overcome the issue of singularity of f H2 as in Theorem 27. Remark 48 (Feasible distributions) Theorem 47 applies for any distributions (µ, ν) P2 H2 Rd , µ, ν λ, such that their densities p, q are sufficiently smooth and bounded (from above and below) on Euclidean balls. To list a few, this includes multivariate Gaussians, mixture Gaussians, Cauchy distributions, etc. The next corollary provides effective error bounds for the class of isotropic Gaussian distributions with bounded parameters, P2 N(M), considered in Section 5.1; see Appendix A.3.6 for the proof. Corollary 49 (Gaussian effective error) For 1 < M < , rk = 1 M + (M + 8M2) 1/2(log k)1/2, mk d,M k2M/(1+8M)(log k)0.5(s KB+d+1), and Gk = ˇG R k,m 1/2 k (mk, rk), sup (µ,ν) P2 N(M) E h ˆH2 Gk(Xn, Y n) H2(µ, ν) i d,M (log k)s KB+d+2k 1 2+16M 1 + k 1 2 n 1 Remark 50 (Gaussian error rate) Setting k = n in the equation above yields an effective error rate of n 1/(2+16M)(log n)s KB+d+2. While this rate deteriorates with M and d, in Proposition 72 in Appendix D.3, we show that a rate of n 1/2 (up to logarithmic factors) is possible independent of dimension for a certain class of sub-Gaussian distributions with unbounded support. 5.4 TV Distance Finally, we consider neural estimation of TV distance for distributions with unbounded support. For M 0, s 0, b 0, N N, sequences r and m as above, let P2 TV,ψ(M, s, r, m) := (µ, ν) P2 TV(M, Rd) : µ, ν λ, p, q Lψ(M), f TV1Bd(rk) Lips,1,mk(Bd(rk)) ˆP2 TV(b, M, N) := n (µ, ν) P2 TV(M, Rd) : µ, ν SG(M), f Tb,N Rd s.t. p q = f o , Neural Estimation of Statistical Divergences where P2 TV(M, Rd) and Tb,N Rd are defined in (4.13) and (4.16), respectively. Also, define the NN classes G R k (a, r) := g1Bd(r) : g G R k (a) , and G k(φ, r) := g1Bd(r) : g G k(φ) , where Gk is given in (4.12). The next theorem is the analogue of Theorem 39 for TV distance neural estimation. Its proof is presented in Appendix A.3.7. Theorem 51 (TV distance neural estimation) The following hold: (i) For µ, ν P Rd , any 0 < ρ < 1, and k, r, n such that kn , rn , knr1/2 n = O n(1 ρ)/2 and Gn = G kn(φ, rn), we have ˆδGn(Xn, Y n) n δTV(µ, ν) , P a.s. (ii) For any M 0 and 0 < s 1, Gk = G R k ck,d,s,m,r, rk , where ck,d,s,m,r is given in (A.98), we obtain sup (µ,ν) P2 TV,ψ(M,s,r,m) E h ˆδGk(Xn, Y n) δTV(µ, ν) i d,M,s,ρ md+2 k rs(d+1) k k s 2 1 s+d+2 + n 1 2 mkrs+1 k k 1 2 d+2 s+d+2 + ψ (rk M 1) 1. The following corollary (see Appendix A.3.8 for proof) provides effective error bounds for sub-Gaussian distributions such that p q has finite number of critical zeros pairwise separated by Euclidean distance bounded away from zero. Corollary 52 (Sub-Gaussian effective error) For any 0 < s 1, b 0, M 0, N N, rk = M 1 + 4 d M log k, mk = cd,s,b,N,rk (see (A.101)) and Gk = G R k ck,d,s,m,r, rk , we have sup (µ,ν) ˆP2 TV(b,M,N) E h ˆδGk(Xn, Y n) δTV(µ, ν) i d,s,b,N (log k) s 2(s+d+2) + (log k) d+2 d+2 2(s+d+2) n 1 Remark 53 (Sub-Gaussian error rate) Setting k = n in the bound above, the effective error rate is n s/2(s+d+2)(log n)(d+2)/2. Remark 54 (Feasible distributions) ˆP2 TV( , , ) includes generalized Gaussian distributions, mixture Gaussians, and in general, distributions pairs with smooth bounded densities having finite number of modes and sub-Gaussian tails. 6. Concluding Remarks This paper studied neural estimation of SDs, aiming to characterize the performance of NEs via an approximation-estimation error analysis. We showed that NEs of f-divergences, such as the KL and χ2 divergences, squared Hellinger distance, and TV distance are consistent, provided the appropriate scaling of the NN size k with the sample size n. We further Sreekumar and Goldfeld derived non-asymptotic absolute-error upper bounds that quantify the dependence on k and n. In the compactly supported case, the derived bounds enabled to establish the minimax optimality of NEs for KL divergence, χ2 divergence, and H2 distance. The key results leading to these bounds are Theorems 8 and 12, which, respectively, bound the sup-norm approximation error by NNs and the empirical estimation error of the parametrized SD. Our theory covers distributions whose densities belong to an appropriate Orlicz class (e.g., sub Gaussian distributions), but faster (optimal) parametric rates are attained when supports are compact. Going forward, we aim to extend our results to additional SDs such as Wasserstein distances and IPMs. While our analysis strategy extends to these examples, new approximation bounds for the appropriate function classes (e.g., 1-Lipschitz) are needed. Generalizing our results to NEs based on deep nets is another natural direction. Recent results on the approximation capabilities of DNNs (e.g., Yarotsky, 2017) appears useful for this purpose. While our analysis does not account for the optimization error, this is another important component of the overall error and we plan to examine it in the future. Through the results herein and the said future directions, we hope to couple neural estimators with the theory to guarantee their performance and/or elucidate their limitations. Acknowledgments The work of S. Sreekumar is supported by the TRIPODS Center for Data Science National Science Foundation Grant CCF-1740822. Z. Goldfeld is supported by the NSF CRII Grant CCF-1947801, the NSF CAREER Award under Grant CCF-2046018, and the 2020 IBM Academic Award. We thank Prof. Kengo Kato for many useful suggestions that helped us improve the manuscript. We also thank Zhengxin Zhang for his involvement in an earlier version of this work. Appendix A. Proofs This section contains proofs of the results presented in Section 3-5, each given in a different subsection. For fluidity, derivations of lemmas used in those proofs are relegated to Appendix B. We first state an auxiliary result which will be useful in several proofs that follow. For b 0 and an integer s 0, define the function classes: LKB s,b Rd := f L1 Rd L2 Rd : |f(0)| f(0) 1 b, D αf 1< , α 1 s Dαf 2 b, α 1 {2, s} LB s,b Rd := f L1 Rd L2 Rd : |f(0)| b, D αf 1 < , α 1 s Dαf 2 b, α 1 {1, s} The next lemma states that functions in LKB s,b Rd (resp. LB s,b Rd ) with sufficient smoothness order s belong to the Klusowski-Barron (resp. Barron) class. Its proof is given in Appendix B.1 and borrows arguments from (Barron, 1993). Neural Estimation of Statistical Divergences Lemma 55 (Smoothness and Klusowksi-Barron class) Recall s KB = 0.5d + 3 and s B := 0.5d + 2. If f LKB s KB,b Rd , then we have S2(f) bd3/2κd, while if f Ls B,b Rd , then S1(f) bd1/2κd , where κ2 d := d + ds B Z 1 + ω 2(s B 1) 1 dω < . (A.3) Consequently, for X Rd, Ls KB,b Rd Bc,2,X Rd and Ls B,b Rd Bc,1,X Rd with c = b bd3/2κd X and c = b bd1/2κd X , respectively. A.1 Proofs for Section 3 A.1.1 Proof of Theorem 8 For a = (a1, a2, a3, a4), we denote the set of feasible parameters of Gk(a, φ) by Θk(a), i.e., {βi, wi, bi}k i=1, w0, b0 : wi Rd, bi, βi R, max 1 i k wi 1 |bi| a1, max 1 i k |βi| a2, |b0| a3, w0 1 a4 (A.4) Also, throughout this section, we write gθ(x) to denote g(x) = Pk i=1 βiφ (wi x + bi) + w0 x + b0 with θ = {βi, wi, bi}k i=1, w0, b0 , whenever the underlying θ is to be emphasized. We prove the second claim in Theorem 8. The proof relies on arguments from (Barron, 1992) and (Barron, 1993), along with the uniform central limit theorem (CLT) for uniformly bounded VC-type classes. Fix an arbitrary (small) δ > 0, and let f : Rd R be such that f = f|X and X S1( f) f(0) a + δ. Such an f exists since c B(f, X) a. Then, since X is compact, it follows from the proof of (Barron, 1993, Theorem 2) that f0(x) := f(x) f(0) = Z ω Rd\{0} ϱ(x, ω)γ(dω), ϱ(x, ω) := L f, X supx X |ω x| cos(ω x + ζ(ω)) cos(ζ(ω)) , γ(dω) := supx X |ω x| F (dω) with L f, X := R Rd supx X |ω x| F (dω). Here F (dω) and ζ(ω) are the magnitude and phase of the complex Borel measure in the Fourier representation of f, respectively. Note that γ defined above is a probability measure on Rd. Let Θ := Θ k, L f, X := Θ1 k1/2 log k, 2L f, X , 0, 0 (see (A.4)). Then, it further follows from the proofs of (Barron, 1993, Lemma 2-Lemma 4,Theorem 3) that there exists a probability measure γk Pk := P Θ (see Barron, 1993, Eqns. (28)-(32)) such that f0 Z θ Θ g θ( ) γk d θ ,X L f, X k 1 Sreekumar and Goldfeld where g θ(x) = βφS w x + b for θ = ( β, w, b, 0, 0) and φS is the logistic sigmoid. The previous step needs further elaboration. The claims in (Barron, 1993, Lemma 2Lemma 4, Theorem 3) are stated for L2 norm, but it is not hard to see from the proof therein that the same also holds for sup-norm, apart from the following subtlety. In the proof of Lemma 3, it is shown that ϱ(x, ω), ω Rd, lies in the convex closure of a certain class of step functions, whose discontinuity points are adjusted to coincide with the continuity points of the underlying measure η. While this can be shown to account for universal approximation under the essential supremum w.r.t. η, to obtain a sup-norm bound one additional step is needed. Specifically, by using modified step functions whose value at 0 is 0.5 (instead of 1), using their linear combinations for approximation of the target function in Lemma 3, and subsequently replacing each such step function by sigmoids with coinciding values at zero, it can be seen that ϱ(x, ω) lies in the point-wise closure of convex hull of the desired sigmoid function class. Next, for each fixed x, let υx : Θ R be given by υx( θ) := βφS w x + b for θ = ( β, w, b, 0, 0), and consider the function class Fk := υx, x Rd . Note that every υx Fk is a composition of an affine function in ( w, b) with the bounded monotonic function βφS( ). Hence, (Van Der Vaart and Wellner, 1996, Lemma 2.6.15, Lemma 2.6.18) yields that Fk is a VC type class with index at most d + 3 for each k N. Hence, it follows from (Van Der Vaart and Wellner, 1996, Theorem 2.6.7) that for every 0 < ϵ 1, sup γ Pk N 2ϵL f, X , Fk, 2,γ sup γ P N 2ϵL f, X , F , 2,γ (d + 3)(16e)d+3ϵ 2(d+2). Moreover, by (Van Der Vaart and Wellner, 1996, Theorem 2.8.3), Fk is a uniform Donsker class (in particular, γk-Donsker) for all probability measures γ Pk. Consequently, the uniform CLT (Dudley, 1999) applied to a VC-type class uniformly bounded by 2L f, X yields that there exists k parameter vectors, θi := ( βi, wi, bi, 0, 0) Θ, 1 i k, such that (see also Yukich et al., 1995, Theorem 2.1) θ Θ g θ( ) γk(d θ) 1 i=1 g θi( ) ,Rd d 1 2 L f, X k 1 The RHS above is independent of γk and depends on f and X only through L f, X . From (A.5)-(A.6) and triangle inequality, we obtain f0 1 ,X d 1 2 L f, X k 1 Setting θ = ( βi/k, wi, bi) k i=1, 0, f(0) and gθ(x) = k 1 Pk i=1 βiφS( wi x + bi) + f(0) and noting that L f, X X S1( f) by Cauchy-Schwartz, we have f gθ ,X d 1 2 X S1( f)k 1 2 d 1 2 (a + δ)k 1 Next, note that f gθ ,X = f gθ ,X and gθ GS k X S1( f) f(0) GS k (a + δ). Since δ > 0 is arbitrary and φS is continuous, we obtain that there exists gθ GS k (a) with f gθ ,X ad 1 2 k 1 Neural Estimation of Statistical Divergences A.1.2 Proof of Proposition 10 To prove the first claim, consider f Cs KB b (U) such that f = f|X . By Theorem 8, it suffices to show that there exists an extension fext of f from U to Rd such that X S2(fext) |fext(0)| fext(0) 1 cb,d, X . Let α|j denote a multi-index of order j. Consider an extension of Dα|s KB f from U to Rd, which is zero outside U. Fixing Dα|s KB f on Rd induces an extension of all lower order derivatives Dα|jf, 0 j < s KB to Rd, which can be defined recursively as Dα|1Dα|s KB j f(x) = Dα|1+α|s KB j f(x), x Rd, for all α|1, α|s KB j and 1 j s KB. Let U := x Rd : x X, x x < 1 and first assume the strict inclusion U U . In that case, the mean value theorem yields that for any x, x U and 1 j s KB, we have Dα|s KB j f(x ) Dα|s KB j f(x) + d max x U , α|1 Dα|s KB j+α|1 f( x) x x , (A.8) where we also used the fact that x x 1 d x x . Further, note that Dα|s KB f ,U b (Dα|s KB f equals zero outside U), and since f Cs KB b (U), we have Dα|s KB j f(x) ,U b. Then, for any x U , taking x X with x x 1 (such an x exists by definition of U ) in (A.8) yields Dα|s KB 1 f(x ) b + b d. Having this, we recursively apply (A.8) to obtain for 1 j s KB that Dα|s KB j f ,U b 2 + bd j 2 b1 d s KB 2 =: b. (A.9) If U U, then Dα|s KB j f ,U b by definition since f Cs KB b (U). Hence, (A.9) holds in both cases as b b. The desired final extension is fext := f fc, where fc is the smooth cut-offfunction fc(x) := 1X Ψ 1 Rd 1X (y)Ψ 1 2 (x y)dy, x Rd, (A.10) with X := x Rd : x X, x x 0.5 and Ψ(x) exp 1 0.5 x 2 1{ x <0.5} as the canonical mollifier normalized to have unit mass. Since Ψ C Rd , we have fc C Rd . Also, observe that fc(x) = 1 for x X, fc(x) = 0 for x Rd \ U and fc(x) (0, 1) for x U \ X. Hence, fext(x) = f(x) for x X, fext(x) = 0 for x Rd \ U and |fext(x)| f(x) for x U \X, thus satisfying fext|X = f|X = f as required. Moreover, for all 0 j s KB, we have Dα|jfext(x) = 0, for x / U , and Dα|jfext ,U 2j b max α: α 1 j Dαfc ,U 2s KB b max α: α 1 s KB DαΨ ,Bd(0.5) =: ˆb, (A.11) where the first inequality follows using product rule for differentiation and (A.9), while the second is due to (A.10). Consequently, for 0 j s KB and i = 1, 2, we have Dα|jfext i i = Z U (Dα|jfext)i(x)dx ˆbi λ Bd(rad(X) + 1) = ˆbi π d 2 Γ(0.5d + 1) rad(X) + 1 d, Sreekumar and Goldfeld where λ denotes the Lebesgue measure, rad(X) = 0.5 supx,x X x x , and Γ denotes the gamma function. Defining b := ˆbdπd/2Γ(d/2 + 1) 1 rad(X) + 1 d and noting that b ˆb, we have from (A.11)-(A.12) that fext Ls KB,b Rd , where Ls KB,b Rd := f L1 Rd L2 Rd : |f(0)| b , Dαf 2 b for 1 α 1 s KB f(0) 1 b , D αf 1< for α 1 s KB Since Ls KB,b Rd LKB s KB,b Rd (see (A.1)), Lemma 55 yields S2(fext) κdd3/2b and fext B cb,d, X ,2,X Rd Ls KB,b Rd B cb,d, X ,2,X Rd LKB s KB,b Rd , (A.14) cb,d, X := (κdd 3 2 X 1) 2 + 1 1 rad(X) + 1 d2s KBbd max α 1 s KB DαΨ ,Bd(0.5) | {z } =:b and κ2 d := d + ds B R Rd 1 + ω 2(s B 1) 1 dω. It then follows from Theorem 8 that there exists g GR k cb,d, X such that f g ,X cb,d, X d1/2k 1/2. This proves the first claim of the proposition. Repeating the same arguments starting with f Cs B b (U), the second claim follows again from Theorem 8, thus completing the proof. A.1.3 Proof of Theorem 11 We require the following theorem which gives a tail probability bound for the deviation of supremum of a sub-Gaussian process from its associated entropy integral. Theorem 56 (Van Handel, 2016, Theorem 5.29) Let (Xθ)θ Θ be a separable sub-Gaussian process on the metric space (Θ, d). Then, there exists c > 0 such that for any θ0 Θ and δ 0, we have P sup θ Θ Xθ Xθ0 c Z log N(ϵ, Θ, d)dϵ + δ ce δ2 cdiam(Θ,d)2 , where diam(Θ, d) := sup θ, θ Θ d(θ, θ). We will also use the following lemma which bounds the covering number of Gk(ak, φ) w.r.t. to metric induced by ,X . Lemma 57 Let φ be a continuous monotone activation whose Lipschitz constant is bounded by L, and Ua,X (φ) := φ a( X + 1) φ a( X + 1) . Then N ϵ, Gk(ak, φ), ,X 1 + 10ka2,k Ua1,k,X (φ)ϵ 1 k 1 + 10a4,k X ϵ 1 d 1 + 10a3,kϵ 1 Neural Estimation of Statistical Divergences 1 + 10Lka1,ka2,k X ϵ 1 dk 1 + 10Lka1,ka2,k X ϵ 1 k. In particular, for φ {φR, φS}, we have N ϵ, GR k (a), ,X 1 + 20a( X + 1)ϵ 1 (d+2)k+d+1, (A.16) N ϵ, GS k(a), ,X 1 + 20a( X + 1)k 1 2 (log k + 1)ϵ 1 (d+2)k+1 , (A.17) N ϵ, G k(φ), ,X 1 + 10k( X + 1)ϵ 1 (d+2)k+1. (A.18) The proof of Lemma 57 (see Appendix B.2) is based on the fact that the covering number of Bm d (r) w.r.t. m norm, m 1, satisfies N ϵ, Bm d (r), m 2rϵ 1 + 1 d . (A.19) Continuing with the proof of Theorem 11, we will show that the claim holds with Vk,h,φ,X C |G k(φ)|, X 2 C h G k(φ) , X + 1 2 , (A.20) Ek,h,φ,X k p d( X + 1) C h G k(φ) , X + 1 q C |G k(φ)|, X , (A.21) where we recall that C (|F|, X) := supx X,f F |f(x)|. In the following, we will suppress the dependence of φ, h, and X for simplicity (unless explicitly needed), e.g., Gk(ak) instead of Gk(ak, φ). Fix µ, ν P(X) such that Dh,Gk(ak)(µ, ν) < . We have ˆDh,Gk(ak)(xn, yn) Dh,Gk(ak)(µ, ν) = sup gθ Gk(ak) i=1 gθ(xi) 1 i=1 h gθ(yi) sup gθ Gk(ak) Eµ gθ Eν h gθ ! sup gθ Gk(ak) i=1 gθ(xi) 1 i=1 h gθ(yi) Eµ gθ + Eν h gθ . (A.22) Consider the stochastic process (Zgθ)gθ Gk(ak) defined by i=1 gθ(Xi) 1 i=1 h gθ(Yi) Eµ gθ + Eν h gθ . (A.23) To apply Theorem 56, we now show that (Zgθ)gθ Gk(ak) is a separable sub-Gaussian process on Gk(ak), dk,ak,n , where dk,ak,n will be defined below. Note that E [Zgθ] = 0 for all gθ Gk(ak), and gθ(Xi) g θ(Xi) Eµ gθ g θ h gθ(Yi) h g θ(Yi) Eν h gθ h g θ . (A.24) Sreekumar and Goldfeld By an application of the mean value theorem, we have for all gθ, g θ Gk(ak), h gθ(x) h g θ( x) C h Gk(ak) gθ(x) g θ( x) . (A.25) Hence, we have that almost surely gθ(Xi) g θ(Xi) Eµ gθ g θ + 1 h gθ(Yi) h g θ(Yi) Eν h gθ h g θ h gθ(Xi) g θ(Xi) + Eµ gθ g θ + h gθ(Yi) h g θ(Yi) + Eν h gθ h g θ i 2n 1 C h Gk(ak) + 1 gθ g θ ,X . (A.26) Let dk,ak,n gθ, g θ := Rk,ak gθ g θ ,X n 1 2 , where Rk,ak := 2 C (|h Gk(ak)|) + 1 . Then, it follows from (A.24) and (A.26) via Hoeffding s lemma that E et Zgθ Zg θ e 1 2 t2dk,ak,n gθ,g θ 2 . Thus, (Zgθ)gθ Gk(ak) is a separable sub-Gaussian process on the metric space Gk(ak), dk,ak,n , where the separability follows from (A.26) by the denseness of the countable subset of Gk(ak) obtained by quantizing each of the finite number of bounded NN parameters to rational numbers (recall that a finite union of countable sets is countable and the activation φ is assumed continuous). Specializing to the NN class G k(φ) := Gk(a , φ), we next bound its covering number w.r.t. dk,a ,n, where a = (1, 1, 1, 0). We have N ϵ, G k, dk,a ,n := N ϵ, G k, Rk,a n 1 = N ϵ/(Rk,a n 1 2 ), G k, ,X 1 + 10k( X + 1)Rk,a n 1 2 ϵ 1 (d+2)k+1, where the last inequality uses (A.18). Also, we have that N ϵ, G k, dk,a ,n = 1 for ϵ diam G k, dk,a ,n := maxgθ,g θ G k dk,a ,n gθ, g θ . Then, log N ϵ, G k, dk,a ,n dϵ = Z diam G k,dk,a ,n log N ϵ, G k, dk,a ,n dϵ kd Z diam G k,dk,a ,n log 1 + 10k( X + 1)Rk,a n 1 d( X + 1)Rk,a q C |G k| n 1 where the last step uses log(1+x) x, x 1, and diam G k, dk,a ,n 2Rk,a C |G k| n 1/2. Neural Estimation of Statistical Divergences It follows from Theorem 56 with Z0 = 0 and the definitions of Vk and Ek (see (A.20) and (A.21)) that there exists a constant c > 0 such that sup gθ G k Zgθ c Ekn 1 cdiam(G k,dk,a ,n) 2 = ce nδ2 Noting that this also holds with Zgθ in place of Zgθ, the union bound gives sup gθ G k |Zgθ| δ + c Ekn 1 From (A.22)-(A.23) and the above equation, we obtain that for δ 0 P Dh,G k(µ, ν) ˆDh,G k(Xn, Y n) δ+c Ekn 1 sup gθ G k |Zgθ| δ + c Ekn 1 Taking supremum over µ, ν P(X) such that Dh,G k(µ, ν) < yields (3.3). By following similar steps with (A.16) and (A.17) in place of (A.18), we have for Gk GR k (a), GS k(a) that P Dh,Gk(µ, ν) ˆDh,Gk(Xn, Y n) δ + c Ek,a,h,X n 1 2 2 c e nδ2/ Vk,a,h,X , (A.27) where Vk,a,h,X C |GR k (a)|, X 2 C h GR k (a) , X + 1 2 and Ek,a,h,X dk log ka( X + 1) C h GR k (a) , X + 1 . To establish (A.27), we use log 1 + Aϵ 1 δ q log (A + δ)/δ , (A.28) for A e and 0 δ 1, which can be shown via integration by parts. A.1.4 Proof of Theorem 12 We establish a more general upper bound with Gk replaced by an arbitrary VC-type class Fk satisfying certain assumptions. This result is also applicable to deep NNs with finite width in each layer, continuous activation and bounded parameters, and hence, may be of independent interest. Theorem 58 (Estimation error bound) Let µ, ν P(X) and Xn µ n and Y n ν n. Suppose h : R R and (Fk)k N (with domain X) satisfy the following conditions for each k N: (i) h is differentiable at every point in C(Fk, X), C (Fk, X) with derivative h ; (ii) C (|h Fk| , X) C (|Fk| , X) < ; (iii) Fk is a VC-type class with constants lvc(Fk) e and uvc(Fk) 1 satisfying (2.11) w.r.t. a constant envelope Mk (note that this implies C(|Fk|, X) Mk); Sreekumar and Goldfeld (iv) Fk is point-wise measurable, i.e., there exists a countable subclass F k Fk of measurable functions such that for any f Fk, there is a sequence of functions {fj}j N F k for which limj fj(x) = f(x), x X. Then, for every k, n N, we have sup µ,ν P(X): Dh,Fk(µ,ν)< E h ˆDh,Fk(Xn, Y n) Dh,Fk(µ, ν) i Mk C h Fk , X + 1 n 1 sup γ P(X) log N (Mkϵ, Fk, 2,γ)dϵ (A.29) uvc(Fk) log lvc(Fk) 1 2 Mk C h Fk , X + 1 n 1 The proof of this theorem is based on standard maximal inequalities from empirical process theory, and is presented in Appendix A.1.5 below. To prove (3.4), we first verify that the relevant assumptions given in Theorem 58 hold with Fk = GR k (a) and a constant envelope Mk = 3a( X +1). The proof for GS k(a) is similar, and hence omitted. Conditions (i) and (ii) are satisfied by the hypotheses in the theorem. Condition (iii) holds as sup γ P(X) N Mkϵ, GR k (a), 2,γ N Mkϵ, GR k (a), ,X 1 + 7ϵ 1 (d+2)k+d+1, (A.31) for any 0 < ϵ 1, where the last inequality follows from (A.16). To verify condition (iv), note that g GR k (a) is measurable since it is a finite linear combination of compositions of an affine function with a continuous activation. Moreover, point-wise measurability of GR k (a) follows by the continuity of activation and the fact that each of the finite number of parameters of GR k (a) can be approximated arbitrary well by rational numbers. Next, we evaluate the entropy integral term in (A.29) by bounding N Mkϵ, GR k (a), 2,γ . For this purpose, let G k(a) := Gk(1, 2k 1a, 0, 0, φR). For any gθ, g θ GR k (a), where gθ = Pk i=1 βiφR (wi x + bi) + w0 x + b0 and g θ = Pk i=1 βiφR wi x + bi + w0 x + b0, we have i=1 βiφR (wi x + bi) i=1 βiφR( wi x + bi) 2,γ + w0 w0 1 X +|b0 b0|. N ϵ, GR k (a), 2,γ N ϵ/3, G k(a), 2,γ N(ϵ/3, B1 d(a), X 1)N ϵ/3, B1(a), | | N ϵ/3, G k(a), 2,γ (1 + 6a( X + 1)ϵ 1)d+1, (A.32) where (A.32) uses (A.19). Consider g = Pk i=1 βiφR (wi x + bi) G k(a). Let F = 2aφR F, where F = {f : X R : f = w x+b, w Rd, b R, w 1 |b| 1}. By considering the d coordinate projections fi(x) = xi, 1 i d, and fd+1(x) = 1 spanning the finite dimensional vector space F, we Neural Estimation of Statistical Divergences have from (Van Der Vaart and Wellner, 1996, Lemma 2.6.15) that F is a VC subgraph class of index atmost d + 3. This uses the fact that if F is a VC subgraph class of index v and φ is monotone, then φ F is a VC subgraph class of index at most v, which follows from the proof of (Van Der Vaart and Wellner, 1996, Lemma 2.6.18 (viii)). Then, (Van Der Vaart and Wellner, 1996, Theorem 2.6.7) yields N 2a( X +1)ϵ, F , 2,γ (d+3)(16e)d+3ϵ 2(d+2), where F := F F. Further, by a careful inspection of the proof of (Gin e and Nickl, 2015, Theorem 3.6.17), we obtain that log N 2a( X + 1)ϵ, co(F ), 2,γ dϵ 2(d+2)/(d+3), where co(F ) denotes the sequential closure of the convex hull of F given by co(F ) := {Pk i=1 λifi : fi F , Pk i=1 λi = 1, λi 0, 1 i k, k N}. Since G k(a) co(F ), we have log N 2a( X + 1)ϵ, G k(a), 2,γ dϵ 2(d+2)/(d+3). Hence, Z 1 sup γ P(X) log N Mkϵ, GR k (a), 2,γ dϵ sup γ P(X) log N Mkϵ/3, G k(a), 2,γ dϵ + log 1 + 2ϵ 1 dϵ 0 (0.5ϵ) (d+2)/(d+3)dϵ + log 1 + 2ϵ 1 dϵ d 3 2 , (A.33) where (a) uses (A.32) and x + y x + y for x, y 0; and the second integral in the penultimate step can be evaluated by applying log(1 + x) x for x 0. This completes the proof of (3.4) via (A.29). A.1.5 Proof of Theorem 58 To simplify notation, we will denote C (|Fk| , X) by C (|Fk|). Fix µ, ν P(X) such that Dh,Fk(µ, ν) < . Note that ˆDh,Fk(Xn, Y n) Dh,Fk(µ, ν) 1 n sup f Fk f(Xi) Eµ[f] h f(Yi) + Eν[h f] . Let µn and νn denote the empirical measures n 1 Pn i=1 δXi and n 1 Pn i=1 δYi, where δx denotes the Dirac measure centered at x X. Then, we have E h ˆDh,Fk(Xn, Y n) Dh,Fk(µ, ν) i sup f Fk n 1 f(Xi) Eµ[f] sup f Fk n 1 i=1 h f(Yi) Eν[h f] log N ϵ, Fk, 2,µn dϵ + Z log N ϵ, h Fk, 2,νn dϵ sup γ P(X) log N (ϵ, Fk, 2,γ)dϵ sup γ P(X) log N ϵ, Fk, C (|h Fk|) 2,γ dϵ Sreekumar and Goldfeld sup γ P(X) log N (ϵ, Fk, 2,γ)dϵ 2 Z 2Mk C(|h Fk|) sup γ P(X) log N ϵ C (|h Fk|) 1, Fk, 2,γ dϵ Mk C h Fk + 1 n 1 sup γ P(X) log N (Mkϵ, Fk, 2,γ)dϵ (A.34) (d) Mk(uvc(Fk)) 1 2 C h Fk + 1 n 1 log 1 + lvc(Fk)ϵ 1 dϵ (e) Mk uvc(Fk) log lvc(Fk) 1 2 C h Fk + 1 n 1 (a) follows via an application of (Van Der Vaart and Wellner, 1996, Corollary 2.2.8) since for fixed (Xn, Y n) = (xn, yn), Hoeffding s inequality implies that n 1 2 Pn i=1 σif(xi) and n 1 2 Pn i=1 h f(yi)σi are sub-Gaussian w.r.t. pseudo-metrics 2,µn and 2,νn, respectively; (b) is due to N ϵ, h Fk, 2,γ N ϵ, Fk, C h Fk 2,γ = N ϵ C h Fk 1, Fk, 2,γ , (A.36) which in turn follows from (A.25), and taking supremum w.r.t. to γ P(X); (c) follows since N ϵ, Fk, C (|h Fk|) 2,γ = 1 for ϵ 2Mk C (|h Fk|), N (ϵ, Fk, 2,γ) = 1 for ϵ 2Mk, and N ϵ, Fk, C (|h Fk|) 2,γ = N ( C (|h Fk|)) 1ϵ, Fk, 2,γ (note that both sides equal 1 when C (|h Fk|) = 0). (d) is because Fk is assumed to be a VC-type class with constants lvc(Fk) e and uvc(Fk) corresponding to envelope Mk; (e) is since R δ 0 p log(A/ϵ)dϵ δ p log(A/δ) for A e and 0 δ 1, which in turn follows via integration by parts. Taking supremum on both sides of (A.35) over µ, ν such that Dh,Fk(µ, ν) < proves (A.30). A.2 Proofs for Section 4 A.2.1 Proof of Theorem 14 Let DGk(ak,φ)(µ, ν) := Dh KL,Gk(ak,φ)(µ, ν) be the parametrized (by the NN class Gk(ak, φ)) KL divergence. We will use the following lemma which proves consistency of parametrized KL divergence estimator. Neural Estimation of Statistical Divergences Lemma 59 (Parametrized KL divergence estimation) Let (µ, ν) P2 KL(X). Then, for any 0 < ρ < 1, and n, kn, such that k3/2 n ( X + 1)ekn( X +1) = O n(1 ρ)/2 , ˆDG k(φ)(Xn, Y n) n DG k(φ)(µ, ν), P a.s. (A.37) Lemma 59 is proven using Theorem 11; see Appendix B.3 for details. We proceed with the proof of (4.2). Since X is compact and f KL C (X), it follows from (Stinchcombe and White, 1990, Theorem 2.1 and 2.8) that for any ϵ > 0, there is a k0(ϵ) N, such that for any k k0(ϵ), there exists a gθk G k(φ) with f KL gθk ,X ϵ. (A.38) This implies lim k DG k(φ)(µ, ν) = DKL (µ ν) . (A.39) To see this, note that DG k(φ)(µ, ν) DKL (µ ν) , k N, (A.40) by (2.2) since g G k(φ) is continuous and bounded ( g ,X k( X + 1) + 1 2k + 1 for X = [0, 1]d). Moreover, the left-hand side (LHS) of (A.40) is monotonically increasing in k, and being bounded, it has a limit point. Thus, to establish (A.39), it suffices to show that this limit point is DKL (µ ν). Assume to the contrary that limk DG k(φ)(µ, ν) < DKL (µ ν). Note that G k(φ) is a compact set and hence the supremum in the variational form of the LHS of (A.40) is a maximum. Then, defining D(g) := 1 + Eµ[g] Eν[eg], it follows that there exists δ > 0 and g θk arg maxgθ G k(φ) D(gθ) such that for all k, DKL (µ ν) D g θk δ. (A.41) However, we have for all k k0(ϵ) that DKL (µ ν) D(g θk) DKL (µ ν) D(gθk) Eµ [|f KL gθk|] + Eν h ef KL egθk i Eµ [|f KL gθk|] + Eν 1 egθk f KL ,ν ϵ + eϵ 1, where the final inequality follows from (A.38) and Eν [dµ/dν] 1. Then, taking ϵ sufficiently small contradicts (A.41), thus proving (A.39). From this and (A.37) with k = kn , (4.2) follows since k3/2ek( X +1) < ek(2+δ) for X = [0, 1]d, any δ > 0, and k sufficiently large. Next, we prove (4.3). Fix (µ, ν) P2 KL(M, X), and with some abuse of notation, let m = (mk)k N be a non-decreasing positive divergent sequence, and note that since c KB (f KL, X) Sreekumar and Goldfeld M, we have from (3.1) that for k such that mk M, there exists gθk GR k (mk) and c > 0 satisfying f KL gθk ,X cd 1 2 Mk 1 Also, since gθk GR k (mk) is bounded, we have that DKL (µ ν) DGR k(mk)(µ, ν). Then, the following hold for k such that mk M and c2d M2 k/2: DKL (µ ν) DGR k(mk)(µ, ν) = DKL (µ ν) DGR k(mk)(µ, ν) Eµ f KL gθk + 1 egθk f KL ,νEν h ef KL i where the last bound follows from (A.42), Eν ef KL = Eν [dµ/dν] = 1, and since 1 egθk f KL ,ν cd 1 2 Mk 1 cd 1 2 Mk 1 2 j d 1 2 Mk 1 Next, note that DGR k(mk)(µ, ν) 0 as g = 0 GR k (mk). This implies that for k with mk < M or c2d M2 > k/2, we have DKL (µ ν) DGR k(mk)(µ, ν) DKL (µ ν) M. Consequently DKL (µ ν) DGR k(mk)(µ, ν) m,M d 1 2 k 1 On the other hand, since C GR k (mk) , X 3mk( X + 1) and C h KL GR k (mk) , X e3mk( X +1), it follows from the above, (A.29) and (A.33) that E h ˆDGR k(mk)(Xn, Y n) DKL (µ ν) i DGR k(mk)(µ, ν) DKL (µ ν) + E h DGR k(mk)(µ, ν) ˆDGR k(mk)(Xn, Y n) i m,M d 1 2 k 1 2 + d 3 2 mk( X + 1)e3mk( X +1) n 1 Since X = 1, choosing mk = log log k 1 in (A.44) yields E h ˆDGR k(mk)(Xn, Y n) DKL (µ ν) i M d 1 2 k 1 2 + d 3 2 (log k)7 n 1 Noting that the above bound holds independent of (µ, ν) P2 KL(M, X), the proof is completed by taking supremum w.r.t. such µ, ν. Note that setting mk = M in (A.44) and taking supremum w.r.t. such µ, ν yields (4.4) from Remark 16. A.2.2 Proof of Corollary 18 To show that the minimax risk is Ω(n 1/2), it suffices to consider d = 1. Recall that the minimax risk for differential entropy estimation over the class of one-dimensional Gaussian distributions with unknown variance in a non-empty interval is Ω(n 1/2) (cf., e.g., Appendix Neural Estimation of Statistical Divergences A, Goldfeld et al., 2020b). Take X = [a, b], for some a, b R with b > a, and let PAC(X) be the class of (Lebesgue absolutely continuous) distributions on X. The differential entropy of µ PAC(X) is defined as h(µ) := Eµ [log (dµ/dλ)], and can be equivalently written as h(µ) = log(b a) DKL µ u[a,b] , where u[a,b] is the uniform distribution on X. Hence, the minimax rate of KL divergence estimation for distributions in the class {(µ, u[a,b]) : µ PAC(X)} for any PAC(X) PAC(X) is the same as that of differential entropy estimation for distributions in PAC(X). Let PTG(X) PAC(X) be a class of truncated Gaussians supported on X with zero mean and variance in an non-empty interval. Note that the minimax rate for differential entropy estimation over PTG(X) equals to that over untrucated Gaussian distribution with zero mean and the same variance constraints. This is since both differential entropies are elementary functions of the variance parameter, when the mean (equals zero) and a, b are given. By Proposition 19 (see Remark 20), P2 KL(M, X) contains pairs of truncated Gaussians (with variance and means within an interval that depends on M) and uniform distributions, which implies that the associated KL divergence minimax estimation risk is Ω(n 1/2). The corollary then follows by noting that the NE achieves O n 1/2 error rate by setting k = n in (4.4). A.2.3 Proof of Proposition 19 The proof of Proposition 10 (see (A.14)) shows that there exists extensions pext, qext B cb,d, X ,2,X Rd LKB s KB,b Rd of p, q, respectively, where cb,d, X = (κdd3/2 X 1)b , with b as defined in (A.15). Set f ext KL := pext qext, and note that since pext, qext LKB s KB,b Rd , their Fourier transforms exist and the corresponding Fourier inversion formulas hold (see proof of Lemma 55). Also, we have S2 f ext KL X S2 ( pext) X + S2 ( qext) X 2 cb,d, X , where the first inequality uses the definition in (2.9) and linearity of the Fourier transform, while the second is because pext, qext B cb,d, X ,2,X Rd . Moreover, note that DKL (µ ν) = Eµ [f KL] = Eµ [log p log q] 2b, where the final inequality is due to log p = p|X and log q = q|X , for p, q Cs KB b (U). Lastly, since f KL = f ext KL X , it follows that (µ, ν) P2 KL(M, X) with M = 2 cb,d, X 2b, and the proposition then follows from Theorem 14. A.2.4 Proof of Theorem 22 Let χ2 Gk(ak,φ)(µ, ν) := Dhχ2,Gk(ak,φ)(µ, ν). We will use the lemma below which proves consistency of parametrized χ2 divergence estimator (see Appendix B.4 for proof). Lemma 60 (Parametrized χ2 divergence estimation) Let (µ, ν) P2 χ2(X). Then, for any 0 < ρ < 1, and n, kn such that k5/2 n ( X + 1)2 = O n(1 ρ)/2 , we have ˆχ2 G k(φ)(Xn, Y n) n χ2 G k(φ)(µ, ν), P a.s. (A.45) Sreekumar and Goldfeld Proceeding with the proof of Theorem 22, (4.5) follows from (A.45) using arguments similar to those used to establish (4.2) and steps leading to (A.49) below; details are omitted. To prove (4.6), fix (µ, ν) P2 χ2(M, X), and let m = (mk)k N be a non-decreasing positive divergent sequence. Since c KB fχ2, X M, we have from (3.1) that for k such that mk M, there exists gθk GR k (mk) with fχ2 gθk ,X Md 1 2 k 1 Also, χ2 (µ ν) χ2 GR k(mk)(µ, ν) because g GR k (mk) is bounded. Then, we have χ2 (µ ν) χ2 GR k(mk)(µ, ν) = χ2 (µ ν) χ2 GR k(mk)(µ, ν) χ2 (µ ν) Eµ gθk Eν h gθk + 0.25g2 θk i Eµ fχ2 gθk + Eν h fχ2 gθk + 0.25 f2 χ2 g2 θk i (A.47) 2 + Eν 0.25 fχ2 gθk fχ2 + gθk 2 + Eν h 0.25 fχ2 gθk 2+0.5 fχ2 gθk fχ2 i 2 + M2dk 1 + 0.5 fχ2 gθk ,νEν |fχ2| (A.48) M(M + 1)dk 1 where the final inequality is due to (A.46) and since Eν fχ2 Eν [2(dµ/dν) + 2] 4. Since g = 0 GR k (mk), for k such that mk < M, we have χ2 (µ ν) χ2 GR k(mk)(µ, ν) = χ2 (µ ν) χ2 GR k(mk)(µ, ν) χ2 (µ ν) M. Hence, χ2 (µ ν) χ2 GR k(mk)(µ, ν) m,M dk 1 2 , k N. (A.49) Since C GR k (mk) , X 3mk( X + 1) and C h χ2 GR k (mk) , X 1.5mk( X + 1) + 1, E h ˆχ2 GR k(mk)(Xn, Y n) χ2 (µ ν) i χ2 GR k(mk)(µ, ν) χ2 (µ ν) + E h χ2 GR k(mk)(µ, ν) ˆχ2 GR k(mk)(Xn, Y n) i 2 + d 3 2 m2 k( X + 1)2n 1 where the last inequality uses (A.29), (A.33) and (A.49). Setting mk = log k in (A.50) and taking supremum w.r.t. (µ, ν) P2 χ2(M, X) yields (4.6). A.2.5 Proof of Proposition 25 It follows from (A.14) that there exists extensions pext, qext B cb,d, X ,2,X Rd Ls KB,b Rd of p, q Cs KB b (U), respectively, where Ls KB,b Rd is defined in (A.13) and cb,d, X := Neural Estimation of Statistical Divergences (κdd3/2 X 1)b , with b from (A.15). Let f ext χ2 = 2 pext qext 1 and recall that α|j denotes a multi-index of order j. We have from the product rule of derivatives that for any j Z 0, Dα|jf ext χ2 = 2 X α|j1+α|j2=α|j α|j! α|j1! α|j2!Dα|j1 pext Dα|j2 qext Dα|j2, where α! := Qd i=1 αi!. Also, note from (A.11) and (A.12) that for 0 j s KB, pext, qext satisfies Dα|j pext ,Rd Dα|j qext ,Rd ˆb b , Dα|j pext 2,Rd Dα|j qext 2,Rd b . Combining these observations, we have for 0 j s KB that Dα|jf ext χ2 2,Rd 2 + 2 α|j1+α|j2=α|j α|j! α|j1! α|j2!Dα|j1 pext Dα|j2 qext 2 + 2j+1b 2. (A.52) Similarly, we have Dα|jf ext χ2 1 < for 0 j s KB. Hence, f ext χ2 Ls KB,2+2s KB+1b 2 Rd . From Lemma 55, it follows that S2 f ext χ2 (2 + 2s KB+1b 2)κdd3/2. Since fχ2 = f ext χ2 X , this implies that c KB fχ2, X (2 + 2s KB+1b 2)(κdd3/2 X 1). Also, χ2 (µ ν) = Eν h pq 1 1 2i Eν p2q 2 + 1 b2 + 1. The claim then follows from Theorem 22 by noting that b 2 c2 b,d, X and (µ, ν) P2 χ2(M, X) with M = 2 + 2s KB+1 c2 b,d, X (κdd3/2 X 1) (b2 + 1). A.2.6 Proof of Theorem 27 Let H2 Gk,t(ak,φ)(µ, ν) := Dh H2, Gk,t(ak,φ)(µ, ν). We need the following lemma (see Appendix B.5 for proof) which shows that parametrized H2 distance estimation is consistent. Lemma 61 (Parametrized H2 distance estimation) Let (µ, ν) P2 H2(X). Then, for any 0 < ρ < 1, tn 0, and n, kn, such that k3/2 n ( X + 1)t 2 n = O n(1 ρ)/2 , ˆH2 G kn,tn(φ)(Xn, Y n) n H2 G kn,tn(φ)(µ, ν), P a.s. (A.53) Continuing with the proof of Theorem 27, we first prove (4.10). Fix (µ, ν) P2 H2(X). Recall that f H2 = 1 (dµ/dν) 1/2. Since dµ/dν ,η M by assumption, we have (1 f H2) ,η M 1/2. It follows from (Stinchcombe and White, 1990, Theorem 2.1 and 2.8) and the definition of G k,t(φ) that for any ϵ > 0, there exists k0(ϵ) N and gθk G k,M 1/2(φ) such that for all k k0(ϵ), f H2 gθk ,η ϵ. (A.54) Sreekumar and Goldfeld Then, noting that H2(µ, ν) H2 G k,M 1/2(φ)(µ, ν), we have H2(µ, ν) H2 G k,M 1/2(φ)(µ, ν) = H2(µ, ν) H2 G k,M 1/2(φ)(µ, ν) Eµ h f H2 gθk i + Eν h f H2(1 f H2) 1 gθk 1 gθk 1 i Eµ h f H2 gθk i +Eν h f H2 gθk (1 f H2) 1 1 gθk 1 i ϵ + Mϵ, (A.55) where the final inequality uses (A.54), 1 f H2 ,η 1 gθk ,η M 1/2. Since ϵ > 0 is arbitrary, this implies (similarly to (A.39) in Theorem 14) that lim k H2 G k,M 1/2(φ)(µ, ν) = H2(µ, ν). Then, (4.10) follows from (A.53) and (A.55). Next, we prove (4.11). Fix (µ, ν) P2 H2(M, X). By some abuse of notation, let m = (mk)k N and t = (tk)k N denote a non-decreasing positive divergent sequence and a non-increasing sequence tending to zero, respectively. Since dµ/dν ,η M, we have 1 f H2 ,η M 1/2. Using tk 0, it then follows from (3.1) that for k such that tk M 1/2 and mk M, there exists gθk GR k,tk(mk) with f H2 gθk ,η Md 1 2 k 1 Then, following the arguments leading to the penultimate step in (A.55), we have H2(µ, ν) H2 GR k,tk(mk)(µ, ν) Eµ f H2 gθk + Eν h f H2 gθk 1 f H2 1 1 gθk 1 i f H2 gθk ,µ + f H2 gθk ,ν Eν h 1 f H2 1 1 gθk 1 i M d 1 2 (1 + t 1 k )k 1 where the final inequality is due to (A.56), 1 gθk(x) tk for any x Rd, and Eν h 1 f H2 1 i = Eν Moreover, since g = 0 GR k,tk(mk), for k such that mk < M or tk > M 1/2, we obtain H2(µ, ν) H2 GR k,tk(mk)(µ, ν) = H2(µ, ν) H2 GR k,tk(mk)(µ, ν) H2(µ, ν) 2, where the last inequality follows from H2(µ, ν) = Eν Neural Estimation of Statistical Divergences Thus, for all k, we have H2(µ, ν) H2 GR k,tk(mk)(µ, ν) M,m,t d 1 2 (1 + t 1 k )k 1 Noting that C GR k,tk(mk) , X 3mk( X +1) and C h H2 GR k,tk(mk) , X t 2 k , it follows from (A.29), (A.33) and (A.57) that E ˆH2 GR k,tk(mk)(Xn, Y n) H2(µ, ν) H2(µ, ν) H2 GR k,tk(mk)(µ, ν) + E ˆH2 GR k,tk(mk)(Xn, Y n) H2 GR k,tk(mk)(µ, ν) M,m,t d 1 2 (1 + t 1 k )k 1 2 + d 3 2 ( X + 1)mkt 2 k n 1 Noting that the above bound holds for any (µ, ν) P2 H2(M, X), and setting mk = log k, tk = (log k) 1, we obtain (4.11). A.2.7 Proof of Proposition 30 As in the proof of Proposition 25, (A.14) yields that there exists extensions pext, qext B cb,d, X ,2,X Rd Ls KB,b Rd of p, q, respectively. Let f ext H2 = 1 pext qext. Then, following steps leading to (A.52), we obtain for 0 j s KB that Dα|jf ext H2 2 1 + α|j1+α|j2=α|j α|j! α|j1! α|j2!Dα|j1 pext Dα|j2 qext 2 1 + 2jb 2. Similarly, Dα|jf ext H2 1 < for 0 j s KB. Hence, f ext H2 Ls KB,1+2s KBb 2 Rd , which yields via Lemma 55 that S2 f ext H2 (1 + 2s KBb 2)κdd3/2. Since f H2 = f ext H2 X , this implies that c KB (f H2, X) (1+2s KBb 2)(κdd3/2 X 1). Moreover, we have dµ/dν ,η = pq 1 ,η b2. Hence, (µ, ν) P2 H2(M, X) with M = (κdd3/2 X 1) 1 + 2s KB c2 b,d, X b2 since b 2 c2 b,d, X . The claim then follows from Theorem 27. A.2.8 Proof of Theorem 32 Let δ Gk(a,φ)(µ, ν) := Dh TV, Gk(a,φ)(µ, ν). The proof of Theorem 32 is based on the following lemma which establishes consistency of the parametrized TV distance estimator (see Appendix B.6 for proof). Lemma 62 (Parametrized TV distance estimation) Let µ, ν P(X). Then, for any 0 < ρ < 1, and n, kn such that kn( X + 1)1/2 = O n(1 ρ)/2 , ˆδ G kn(φ)(Xn, Y n) n δ G kn(φ)(µ, ν) , P a.s. (A.59) Equipped with Lemma 62, we first prove (4.14). Since f TV is not continuous, the universal approximation property of NNs used in the consistency proofs until now cannot be used directly in this case. However, we will show that there exists a continuous function Sreekumar and Goldfeld approximating f TV to any desired accuracy, which can in turn be approximated by G k(φ) arbitrary well. Fix µ, ν P(X). Let p and q denote the densities of µ and ν w.r.t. η = 0.5(µ + ν) P(X), and let C be the set defined in (2.8). Note that p q ,η 2. Also, observe that C and X \ C are Borel sets, since p(x) and q(x) are Borel measurable by definition, and hence so is p(x) q(x). Since η P(X) is a regular probability measure, for any ϵ > 0, there exists compact sets C, C, open sets U, U such that C C U, C X \ C U and η(U \ C) η( U \ C) η( U C ) η(U (X \ C )) 0.25ϵ, along with continuous (Urysohn) functions ζC : Rd [0, 1], ζX\C : Rd [0, 1] such that ( 1, x C, 0, x Rd \ U, ( 1, x C, 0, x Rd \ U. Eµ [|1C ζC |] Eν [|1C ζC |] Eη [(p q) |1C ζC |] 0.25 p q ,η ϵ, (A.61) Eµ 1X\C ζX\C Eν 1X\C ζX\C Eη (p q) 1X\C ζX\C 0.25 p q ,η ϵ. (A.62) Let ζ(x) = ζC (x) ζX\C (x). Note that ζ(x) [ 1, 1], ζ(x) = 1 for x C\ U and ζ(x) = 1 for x C \ U. Since ζ( ) is a continuous function, it follows from (Stinchcombe and White, 1990, Theorem 2.1 and 2.8) that for any ϵ > 0 and k k0(ϵ), there exists a g G k(φ) such that ζ g ,X ϵ. Since ζ 1, it then follows from the definition of G k(φ) that there exists g G k(φ) such that ζ g ,X ϵ. (A.63) Let δTV(g) := Eµ[g] Eν [g]. Then, we have for k k0(ϵ) that δTV(µ, ν) δ G k(φ)(µ, ν) = δTV(µ, ν) δ G k(φ)(µ, ν) δTV(µ, ν) δTV(g ) Eµ [|f TV g |] + Eν [|f TV g |] Eµ [|f TV ζ| + |ζ g |] + Eν [|f TV ζ| + |ζ g |] Eµ [|1C ζC |] + Eν [|1C ζC |] + Eµ 1X\C ζX\C + Eν 1X\C ζX\C + Eµ [|ζ g |] + Eν [|ζ g |] ϵ p q ,η + 2 4ϵ, (A.64) Neural Estimation of Statistical Divergences where (A.64) follows from (A.61), (A.62), (A.63) and p q ,η 2. Since ϵ > 0 is arbitrary, we have from (A.64) that lim k δ G k(φ)(µ, ν) = δTV(µ, ν) . Taking kn, n satisfying kn = O n(1 ρ)/2 , (4.14) follows from the above equation and (A.59). Next, we prove (4.15). Fix (µ, ν) P2 TV(M, X) such that f TV Lips,1,M(X). Since f TV does not belong to the Klusowski-Barron class, we consider approximation of an intermediate function f(t) TV, which is a smoothed version of f TV and belongs to this class. The smoothing parameter t is then decreased as a function of k at an appropriate rate such that the L1 error between f(t) TV and f TV vanishes as k . For this purpose, consider a non-negative smoothing kernel Φ L1 Rd , Φ 0, such that R Rd Φ(x)dx = 1. Let Φt(x) := t dΦ(t 1x), t > 0, and f(t) TV(x) := f TV Φt(x) = Z Rd f TV(x y)Φt(y)dy, denote the smoothing of f TV using Φt. Recalling that δTV(f) := Eµ[f] Eν [f], we have δTV(µ, ν) δ Gk(a,φ)(µ, ν) = δTV(µ, ν) δ Gk(a,φ)(µ, ν) = δTV(µ, ν) δTV f(t) TV + δTV f(t) TV δ Gk(a,φ)(µ, ν) , (A.65) The first term in (A.65) can be written as follows: δTV(µ, ν) δTV f(t) TV = Eµ h f TV f(t) TV i Eν h f TV f(t) TV i . (A.66) Denoting by p, q, the respective densities of µ, ν w.r.t. Lebesgue measure, we have Eµ h f TV f(t) TV i = Z f TV(x) t d Z Rd f TV(y)Φ (x y)t 1 dy p(x)dx Rd f TV(x tu)Φ(u)du p(x)dx Rd [f TV(x)Φ(u) f TV(x tu)Φ(u)] du p(x)dx Rd |f TV(x) f TV(x tu)| p(x)dx Φ(u)du Rd |f TV(x + tu) f TV(x)| p(x + tu)dx Φ(u)du Rd |f TV(x + tu) f TV(x)| dx Φ(u)du Rd ξ1,1 (f TV, t u ) Φ(u)du Rd ts u s Φ(u)du, (A.67) Sreekumar and Goldfeld where (a) and (b) are due to (µ, ν) P2 TV(M, X) and f TV Lips,1,M(X), respectively. Since (A.67) also holds for ν in place of µ, we have from (A.66) that δTV(µ, ν) δTV f(t) TV 2M2 Z Rd ts u s Φ(u)du. (A.68) Next, note that f(t) TV 1 Z Rd |f TV(x y)Φt(y)| dydx (a) f TV 1 Φt 1 (a) follows from Minkowski s integral inequality; (b) is due to R Rd |Φt(y)| dy = 1; (c) is since f TV L1(X). Hence, the Fourier transform of f(t) TV exists, and is given by F h f(t) TV i = F[f TV]F[Φt]. (A.69) Choose Φ to be standard Gaussian kernel, i.e., Φ = ΦN := (2π) d/2e 0.5 x 2. Then, we have F h f(t) TV i 1 Rd |F[Φt](ω)| dω (b) M Z Rd |F[Φ](tω)| dω (c) M Z 2 t2 ω 2dω < , (a) follows from (A.69) and F [f TV] f TV 1; (b) is via the formula F Φ t 1 (ω) = td F[Φ] (tω), and f TV 1 M by the definition of Lipschitz seminorm; (c) is since F ΦN (ω) = e 1 Hence, the Fourier representation in Definition 4 holds via the Fourier inversion formula for f(t) TV. Then, we can bound the spectral norm as S2 f(t) TV := Z Rd ω 2 1 F h f(t) TV i (ω) dω Rd ω 2 1 |F[Φt](ω)| dω 2 t2 ω 2dω. Evaluating the integral above by converting to hyperspherical coordinates, we obtain X S2 f(t) TV X Md Z 2 t2 ω 2dω =: cd,M, X ,t, (A.70) Neural Estimation of Statistical Divergences cd,M, X ,t := ( (2π) 1 2 X Mt 3, d = 1, 2 π X Mdt (d+2)Γ((d + 2)/2) Qd 2 j=1 R π 0 sind 1 j(ϕj)dϕj, d 2. Moreover, f TV 1 and R Rd |Φt(y)| dy = 1 implies f(t) TV(x) Z Rd |f TV(x y)Φt(y)| dy Z Rd |Φt(y)| dy = 1. (A.71) Since f(t) TV(0) f(t) TV(0) 1 (2dπ d)1/2Γ 0.5(d + 1) t 1 and (A.70) holds, there exists gθk GR k ˆcd,M, X ,t such that for all 0 < t 1, f(t) TV gθk ,X ˆcd,M, X ,td 1 2 k 1 where ˆcd,M, X ,t := cd,M, X ,t 1 (2dπ d)1/2Γ 0.5(d + 1) t 1. The existence of gθk follows by truncating g GR k ˆcd,M, X ,t satisfying (3.1) to [ 1, 1], and noting that truncation only decreases the approximation error as f(t) TV 1. Hence, we have δTV f(t) TV δ GR k(ˆcd,M, X ,t)(µ, ν) δTV f(t) TV δTV gθk Eµ h f(t) TV gθk i + Eν h f(t) TV gθk i (A.73) ˆcd,M, X ,td 1 2 k 1 Next, observe that (A.68) with Φ = ΦN yields δTV(µ, ν) δTV f(t) TV cd,M,sts, where cd,M,s := 2M2(2π) d/2 R Rd u s e 0.5 u 2du. From this, (A.65) and (A.74), we obtain δTV(µ, ν) δ GR k(ˆcd,M, X ,t)(µ, ν) = δTV(µ, ν) δ GR k(ˆcd,M, X ,t)(µ, ν) d,M,s ts + X t (d+2)k 1 Setting t = t k := k 1/2(s+d+2) and ck,d,s,M, X := ˆcd,M, X ,t k = Od,M X k(d+2)/2(s+d+2) , (A.75) yields δTV(µ, ν) δ GR k ck,d,s,M, X (µ, ν) d,M,s ( X + 1)k s/2(s+d+2). (A.76) Finally, we bound the expected empirical estimation error. Note that C GR k (a) , X 1 and C h TV GR k (a) , X = 1. Then, we have E ˆδ GR k ck,d,s,M, X (Xn, Y n) δ GR k ck,d,s,M, X (µ, ν) Sreekumar and Goldfeld sup γ P(X) log N ϵ, GR k ck,d,s,M, X , 2,γ dϵ sup γ P(X) log N ϵ, GR k ck,d,s,M, X , 2,γ dϵ sup γ P(X) log N ϵ/3, G k ck,d,s,M, X , 2,γ dϵ 2 (d + 1) 1 2 Z 1 log(1 + 6 ck,d,s,M, X ( X + 1))dϵ 2 d 3 2 ck,d,s,M, X ( X + 1) d+2 2 d 1 2 ck,d,s,M, X ( X + 1) 1 2 d 3 2 ck,d,s,M, X ( X + 1) + 1), (A.77) where (a) follows from (A.29); (b) is since the pointwise difference between functions in GR k (a) (with range [ 1, 1]) is less than the difference between the corresponding untruncated functions in GR k (a); (c) is due to (A.32); and (d) is via steps leading to (A.33). Then, (A.76) and (A.77) implies that E h ˆδ GR k( ck,d,s,M, X )(Xn, Y n) δTV(µ, ν) i d,M,s ( X + 1)k s/2(s+d+2) + n 1 2 k(d+2)/2(s+d+2) X 2 + 1 1 Recalling that X = [0, 1]d and taking supremum over (µ, ν) P2 TV(M, X) such that f TV Lips,1,M(X), we obtain (4.15). A.2.9 Proof of Proposition 35 Since p q Tb,N(X) and f TV = 1{p q 0} 1{p q<0}, the definition of ξ1,1(f TV, t) yields ξ1,1(f TV, t) ( 2Nλ(Bd(t)), t b 2 f TV 1 , otherwise. Hence, for any 0 < s 1, it holds that f TV Lip(s,1) = f TV 1 + sup t>0 t sξ1,1(f TV, t) = f TV 1 + sup 0b t sξ1,1(f TV, t) = f TV 1 + sup 0b t s2 f TV 1 = λ(X) + 2Nπ d 2 bd s (Γ(0.5d + 1)) 1 2b sλ(X), (A.79) where λ denotes the Lebesgue measure and Γ is the gamma function. Hence, f TV Lips,1,M(X) with M = λ(X) + 2N Γ(0.5d + 1) 1π d 2 bd s 2b sλ(X) and any 0 < s 1, thus proving the claim via Theorem 32. Neural Estimation of Statistical Divergences A.3 Proofs for Section 5 A.3.1 Proof of Theorem 39 To prove part (i), fix some 0 < ϵ < 1. Let Bc d(r) = Rd \ Bd(r), and r(ϵ) be sufficiently large such that Eµ |f KL| 1Bc d(r(ϵ)) Eν |dµ/dν 1|1Bc d(r(ϵ)) ϵ. Since f KL C(Rd), from (Stinchcombe and White, 1990, Theorem 2.1 and 2.8), there is a k0(ϵ, r(ϵ)) N, such that for any k k0(ϵ, r(ϵ)), there exists a gθk ˆG k(φ, r(ϵ)) with f KL gθk ,Bd(r(ϵ)) ϵ. (A.80) Then, we have DKL (µ ν) D ˆG k(φ,r(ϵ))(µ, ν) Eµ [|f KL gθk|] + Eν h ef KL egθk i = Eµ |f KL gθk| 1Bd(r(ϵ)) + Eµ h |f KL gθk| 1Bc d(r(ϵ)) i + Eν h ef KL egθk 1Bc d(r(ϵ)) i + Eν h ef KL egθk 1Bd(r(ϵ)) i (f KL gθk)1Bd(r(ϵ)) ,µ + Eµ h |f KL| 1Bc d(r(ϵ)) i + Eν dµ dν 1 1Bc d(r(ϵ)) + Eν h ef KL 1Bd(r(ϵ)) i 1 egθk f KL 1Bd(r(ϵ)) ,ν (A.81) where the final inequality is due to (A.80), the choice of r(ϵ), and Eν ef KL 1Bd(r(ϵ)) 1. On the other hand, for any 0 < ρ < 1, and n, kn, rn such that k3/2 n (rn + 1)ekn(rn+1) = O n(1 ρ)/2 , Lemma 59 yields ˆD ˆG kn(φ,rn)(Xn, Y n) n D ˆG kn(φ,rn)(µ, ν), P a.s. This along with (A.82) completes the proof of Part (i). To prove part (ii), we first state a general error bound for KL neural estimation based on the tail behaviour of random variables f KL(X) and h KL f KL(Y ) := ef KL(Y ) 1 outside Bd(r) for X µ and Y ν. For an increasing positive divergent sequence r = (rk)k N, rk 1, (rk ), a positive non-decreasing sequence m = (mk)k N, mk 1, and a non-increasing non-negative sequence v = (vk)k N with vk 0, set P2 KL(M, r, m, v) (µ, ν) P2 KL Rd : Eµ h |f KL| 1Bc d(rk) i Eν h |h KL f KL| 1Bc d(rk) i vk, DKL (µ ν) M, c KB f KL|Bd(rk), Bd(rk) mk, k N Then, we have the following lemma. Sreekumar and Goldfeld Lemma 63 (KL divergence neural estimation) Suppose there exists M 0, 0 < ρ < 1, and r, m, v as above satisfying 1 mk k(1 ρ)/2, such that (µ, ν) P2 KL(M, r, m, v). Then, for Gk = ˆGR k (mk, rk), we have sup (µ,ν) P2 KL(M,r,m,v) E h ˆDGk(Xn, Y n) DKL (µ ν) i d,M,ρ mkk 1 2 + vk + mkrke3mk(rk+1) n 1 The proof of the above lemma is based on an application of Theorem 8 to bound the NN approximation error on balls Bd(rk), leveraging tail integrability assumptions in the definition of P2 KL to bound the approximation error outside Bd(rk), and using Theorem 12 to control the empirical estimation error. Its proof is given in Appendix B.7. Continuing with the proof of the Theorem, we will show that (µ, ν) P2 KL,ψ (M, ℓ, r, m) implies (µ, ν) P2 KL(M, r, m, v) for some v that will be specified below. Then, Part (ii) will follow from Lemma 63. Note that f KL ℓ,µ M, where ℓ> 1 (or equivalently ℓ 2 since ℓ N), implies DKL (µ ν) = Eµ [f KL] q Eµ f2 KL M. (A.83) Eµ h |f KL| 1Bc d(rk) i (a) f KL ℓ,µ µ ( X > rk) 1 (b) Mµ ψ X M 1 > ψ rk M 1 1 (c) M Eµ ψ X M 1 1 ℓ ψ rk M 1 1 (d) M ψ rk M 1 1 Eν h |h KL f KL| 1Bc d(rk) i = Eν dµ dν 1 1Bc d(rk) dν + 1 1Bc d(rk) = µ Bc d(rk) + ν Bc d(rk) (A.85) (e) Eµ ψ X M 1 + Eν ψ Y M 1 ψ rk M 1 1 (f) 2 ψ rk M 1 1 , (a) follows by H older s inequality; (b) is since f KL ℓ,µ M and ψ is increasing; (c) and (e) are due to Markov s inequality; Neural Estimation of Statistical Divergences (d) and (f) are since p, q Lψ(M) implies that Eµ ψ X M 1 Eν ψ Y M 1 1. It follows that (µ, ν) P2 KL(M, r, m, v) with vk M,ψ,ℓ ψ(rk M 1) 1/ℓ since rk 1. Note that vk 0 as rk . This completes the proof of Part (ii) via Lemma 63. A.3.2 Proof of Corollary 41 Fix (µ, ν) = N(mp, σ2 p Id), N(mq, σ2 q Id) P2 N(M) and r = rk)k N = 1 M + rk k N, where rk 0, k N, will be specified below. Note that f KL(x) = d log σq 2σ2q x mp 2 DKL (µ ν) = d log σq 0.5d + 0.5dσ2 p σ2q + mp mq 2 Also, f KL is infinitely differentiable on Rd, and it can be seen by computing derivatives that for any multi-index α of dimension d and arbitrary order α 1 N, Dαf KL ,Bd(rk) b k,d,M := cd,M 1 + r2 k , for some constant cd,M (polynomial in M). Hence, f KL|Bd(rk) Cs KB b k,d,M, which implies via Proposition 10 that c KB f KL|Bd(rk), Bd(rk) m KL k := cd,M 1 + rd+3 k . By a straightforward calculation by using 1/M < σp, σq < M, mp mq M, it follows from Gaussian integral formulas that there exists some cd,M such that f KL 2,µ p ψ2 q ψ2 cd,M, where ψ2(z) = ez2 1. Hence, P2 N(M) P2 KL,ψ2 cd,M, 2, r, m KL , and we have from Part (ii) of Theorem 39 with mk = m KL k and Gk = ˆGR k m KL k , rk that E h ˆDGk(Xn, Y n) DKL (µ ν) i d,M,ρ mkk 1 2 + e r2 k c2 d,M + mk rke3mk rk n 1 Then, setting rk = r KL k := 1 M + rk with rk = (c log k/3cd,M)1/(d+4) yields for Gk = ˆGR k m KL k , r KL k that E h ˆDGk(Xn, Y n) DKL (µ ν) i d,M ck 1 2 log k + e (c log k/cd,M) 2 d+4 + ckc log k n 1 Solving for the value of c such that the first two terms in the RHS of the equation above are equal (up to logarithmic factors) yields c = cd,M2 (d+4)/2(log k)(d+2)/2. Substituting c and taking supremum over (µ, ν) P2 N(M), we obtain the claim in the Corollary. Sreekumar and Goldfeld A.3.3 Proof of Theorem 43 Let r(ϵ) be sufficiently large such that Eµ |fχ2|1Bc d(r(ϵ)) Eν |hχ2 fχ2|1Bc d(r(ϵ)) ϵ. Similar to (A.80), there exists gθk ˆG k(φ, r(ϵ)) satisfying fχ2 gθk ,Bd(r(ϵ)) ϵ for k k0(ϵ, r(ϵ)). Then, we have χ2 (µ ν) χ2 ˆG kn(φ,rn)(µ, ν) Eµ fχ2 gθk + Eν hχ2 fχ2 hχ2 gθk = Eµ fχ2 gθk 1Bd(r(ϵ)) + Eν hχ2 fχ2 hχ2 gθk 1Bd(r(ϵ)) + Eµ h fχ2 gθk 1Bc d(r(ϵ)) i + Eν h hχ2 fχ2 hχ2 gθk 1Bc d(r(ϵ)) i fχ2 gθk 1Bd(r(ϵ)) ,µ + Eν h hχ2 fχ2 hχ2 gθk 1Bd(r(ϵ)) i + Eµ h fχ2 1Bc d(r(ϵ)) i + Eν h hχ2 fχ2 1Bc d(r(ϵ)) i (A.86) (a) ϵ + Eν hχ2 fχ2 hχ2 gθk 1Bd(r(ϵ)) (b) ϵ + Eν fχ2 gθk 1Bd(r(ϵ)) + Eν h 0.25 fχ2 gθk 2 1Bd(r(ϵ)) i + 0.5Eν h fχ2 gθk fχ2 1Bd(r(ϵ)) i ϵ + fχ2 gθk 1Bd(r(ϵ)) ,ν Eν fχ2 where (a) follows by definition of r(ϵ) and gθk above; (b) is via steps leading to (A.48); (c) is due to definition of r(ϵ), gθk and Eν fχ2 4. From this and Lemma 60, Part (i) follows. Next, we prove Part (ii). For sequences m, r and v as in Appendix A.3.1, let P2 χ2(r, m, v) := (µ, ν) P2 χ2 Rd : Eµ h fχ2 1Bc d(rk) i Eν h hχ2 fχ2 1Bc d(rk) i vk c KB fχ2|Bd(rk), Bd(rk) mk, k N We will use the following lemma which bounds the χ2 neural estimation error for distributions satisfying general tail integrability conditions (see Appendix B.8 for proof). Lemma 64 (χ2 neural estimation error) For Gk = ˆGR k (mk, rk), we have sup (µ,ν) P2 χ2(r,m,v) E ˆχ2 Gk(Xn, Y n) χ2 (µ ν) mkd 1 2 k 1 2 + m2 kdk 1 + vk + d 3 2 m2 kr2 kn 1 Armed with Lemma 64, we next show that (µ, ν) P2 χ2,ψ (M, ℓ, r, m) implies that (µ, ν) P2 χ2(r, m, v) for some v that will be identified below. We have Eµ h fχ2 1Bc d(rk) i (a) fχ2 ℓ,µ µ ( X > rk) 1 ℓ (b) M ψ rk M 1 1 Neural Estimation of Statistical Divergences Eν h hχ2 fχ2 1Bc d(rk) i = Eν 2 dµ dν 1 1Bc d(rk) dν 1 2 1Bc d(rk) dν + 1 1Bc d(rk) 2 1Bc d(rk) + ν Bc d(rk) = 2µ Bc d(rk) + 3ν Bc d(rk) + Eµ dν 1Bc d(rk) = 2µ Bc d(rk) + 3ν Bc d(rk) + Eµ h (0.5fχ2 + 1)1Bc d(rk) i (c) = 3µ Bc d(rk) + 3ν Bc d(rk) + 0.5 fχ2 ℓ,µ µ Bc d(rk) 1 (d) 6 ψ rk M 1 1 + 0.5M ψ rk M 1 1 (a) and (c) is by H older s inequality; (b) and (d) follows via Markov s inequality since Eµ ψ X M 1 Eν ψ Y M 1 1, and fχ2 ℓ,µ M by assumption. Hence, (µ, ν) P2 χ2(r, m, v) with vk = 6 ψ rk M 1 1 + M ψ rk M 1 1 ℓ M,ψ,ℓ ψ rk M 1 1 This implies Part (ii) via Lemma 64 (since mkk 1 2 + m2 kk 1 2m2 kk 1 2 due to mk 1). A.3.4 Proof of Corollary 45 We will require the following lemma which bounds the tail probability of an isotropic Gaussian distribution outside a Euclidean ball Bd(r) of radius r. This is a straighforward consequence of Gaussian concentration (Ledoux and Talagrand, 1991, Eqn. 1.4) and the fact that is 1-Lipschitz function on the metric space (Rd, ). Lemma 65 (Gaussian tail integral bound) For any mp Rd such that mp M, σ2 > 0 and r M, Bc d(r) e x mp 2 2σ2 dx 2e (r M)2 2σ2 . (A.90) Proceeding with the proof of the corollary, fix (µ, ν) = N(mp, σ2Id), N(mq, σ2Id) P2 χ2,N(M), and r = rk)k N = 1 M + rk k N, where rk 0, k N, will be specified below. Note that since fχ2(x) = 2 p(x) 2σ2q x mp 2 Sreekumar and Goldfeld it is infinitely differentiable on Rd. A straightforward computation shows that for any multi-index α Zd 0 of order α 1 s KB, Dαfχ2 ,Bd(rk) b R k := cd,M 1 + rs KB k e2M2 r2 k. Hence, fχ2|Bd(rk) Cs KB b R k , which implies via Proposition 10 that c KB fχ2|Bd(rk), Bd(rk) mχ2 k := cd,M 1 + rs KB+d+1 k e2M2 r2 k. (A.91) Furthermore, letting σ 2 := σ 2 p 0.5σ 2 q 0.5σ 2 q 0.5σ 2 p and noting that σ 2 0.5M 3 by definition of P2 χ2,N(M) and M > 1, we have Eµ h fχ2 1Bc d(rk) i 2 (2πσ2p)d/2 2σ2q x mp 2 2 (2πσ2p)d/2 2σ2q x mp 2 σ2p dx + e x mp 2 (a) d,M e r2 k σ2 e r2 k 2M3 , Eν h hχ2 fχ2 1Bc d(rk) i = 1 (2πσ2q)d/2 2σ2q x mp 2 2σ2q x mp 2 Bc d(rk) e x mp 2 2σ2p dx + Z 2σ2q x mp 2 Bc d(rk) e x mq 2 (c) d,M e r2 k σ2 e r2 k 2M3 , (a) and (c) follows by an application of Lemma 65 via completion of squares since σ2 p < 2σ2 q by assumption; (b) uses (aex 1)2 a2e2x + 1 for x Rd and a 0. Hence, (µ, ν) P2 χ2 r, mχ2, vχ2 with mχ2 as defined in (A.91) and vχ2 k := cd,Me r2 k/2M3, and the error bound in (A.87) applies. Setting rk = rχ2 k := 1 M + rk = 1 M + 2 0.5M 1 c log k for some constant c in (A.87), optimizing the resulting bound w.r.t. c (achieved at c = 2M5/(4M5 + 1) < 0.5), we obtain for Gk = ˆGR k mχ2 k , rχ2 k that E ˆχ2 Gk(Xn, Y n) χ2 (µ ν) d,M (log k)2(s KB+d+1) k 1 2+8M5 + k Taking supremum over (µ, ν) P2 χ2,N(M) completes the proof. Neural Estimation of Statistical Divergences A.3.5 Proof of Theorem 47 For sequences m, r and v as in Appendix A.3.1, let P2 H2(r, m, v) := (µ, ν) P2 H2 Rd : Eµ h |f H2| 1Bc d(rk) i Eν h |h H2 f H2| 1Bc d(rk) i vk, c KB f H2|Bd(rk), Bd(rk) dµ dν ,Bd(rk) mk, k N The following lemma proves consistency of the NE for H2 estimation and bounds its effective error for distributions satisfying general tail integrability conditions; see Appendix B.9 for proof. Lemma 66 (H2 neural estimation) Let (µ, ν) P2 H2(r, m, v), where m satisfies mk = o(k1/4). Then, the following hold: (i) For kn, mkn, rkn, n satisfying kn , rkn , k1/2 n m2 knrkn = O n(1 ρ)/2 , and Gn = ˇG R kn,m 1/2 kn (mkn, rkn), we have ˆH2 Gn(Xn, Y n) n H2(µ, ν), P a.s. (A.92) (ii) For Gk = ˇG R k,m 1/2 k (mk, rk), we have sup (µ,ν) P2 H2(r,m,v) E h ˆH2 Gk(Xn, Y n) H2(µ, ν) i m2 kd 1 2 k 1 2 + vk + d 3 2 m2 krkn 1 To prove the theorem, we will show that (µ, ν) P2 H2,ψ (M, r, m) implies that (µ, ν) P2 H2(r, m, v) for some v stated below. Then, Part (i) and (ii) will follow from the corresponding Parts in the above lemma. We have Eµ h |f H2| 1Bc d(rk) i = Eµ h 1 p qp 1 1Bc d(rk) i µ Bc d(rk) + Eµ hp qp 11Bc d(rk) i (a) µ Bc d(rk) + q ν Bc d(rk) (A.94) (b) ψ rk M 1 1 + ψ rk M 1 1 Eν h |h H2 f H2| 1Bc d(rk) i = Eν h p pq 1 1 1Bc d(rk) i (c) ν Bc d(rk) + q µ Bc d(rk) (A.95) (d) ψ rk M 1 1 + ψ rk M 1 1 Sreekumar and Goldfeld (a) and (c) follows from Cauchy-Schwarz inequality and Eµ qp 1 = Eν pq 1 = 1; (b) and (d) follows from Markov s inequality as Eµ ψ X M 1 Eν ψ Y M 1 1. Hence, (µ, ν) P2 H2(r, m, v) with vk = ψ rk M 1 1 + ψ rk M 1 1/2 ψ,M ψ rk M 1 1/2 0. This completes the proof via Lemma 66. A.3.6 Proof of Corollary 49 Fix (µ, ν) = N(mp, σ2Id), N(mq, σ2Id) P2 N(M), and r = rk)k N = 1 M + rk k N, where rk 0, k N, will be specified below. Observe that f H2(x) = 1 p(x) 4σ2p x mq 2 is infinitely differentiable on Rd. Then, for any multi-index α Zd 0 of order α 1 s KB, it is easy to see by computing partial derivatives that Dαf H2 ,Bd(rk) ˆb R k := cd,M 1 + rs KB k e M2 r2 k. Hence, f H2|Bd(rk) Cs KB ˆb R k , which yields via Proposition 10 that c KB f H2|Bd(rk), Bd(rk) cd,M 1 + rs KB+d+1 k e M2 r2 k. Also, we have ,Bd(rk) = sup x Bd(rk) 2σ2q x mp 2 2σ2p cd,M 1 + e2M2r2 k . Furthermore, defining ˆσ2 := 4σ2 pσ2 q/(σ2 p + σ2 q) 2σ2 p 2σ2 q = 2σ2 p 2σ2 q 2M 1, we obtain Eµ h |f H2| 1Bc d(rk) i 1 (2πσ2p)d/2 4σ2p x mq 2 1 (2πσ2p)d/2 d/4 e x mq 2 4σ2q x mp 2 (a) d,M e r2 k ˆσ2 e 0.5M r2 k, Eν h |h H2 f H2| 1Bc d(rk) i = Eν 1 (2πσ2q)d/2 4σ2q x mp 2 1 (2πσ2q)d/2 d/4 e x mq 2 4σ2q x mp 2 4σ2p dx + e x mq 2 Neural Estimation of Statistical Divergences (b) d,M e r2 k ˆσ2 e 0.5M r2 k, where (a) and (b) above follows from Lemma 65. Hence, P2 N(M) P2 H2 r, m H2, v H2 with m H2 k d,M 1 + rs KB+d+1 k e2M2 r2 k and v H2 k d,M e 0.5M r2 k, and (A.93) applies. Setting rk = 1 M + rk with rk = p 2c M 1 log k, c > 0, and optimizing the resulting bound in (A.93) w.r.t. c (optimum achieved at c = 0.5/(1 + 8M)) yields with mk = m H2 k and Gk = ˇG R k,m 1/2 k (mk, rk) that E h ˆH2 Gk(Xn, Y n) H2(µ, ν) i d,M (log k)s KB+d+2k 1 2+16M 1 + k 1 2 n 1 Taking supremum over (µ, ν) P2 N(M) completes the proof. A.3.7 Proof of Theorem 51 Fix ϵ > 0 and let r(ϵ) denote r such that µ Bc d(r) ν Bc d(r) ϵ. Then, following steps leading to (A.64), there exists g G k(φ, r(ϵ)) for k k0(ϵ) such that the following holds: δTV(µ, ν) δ G k(φ,r(ϵ))(µ, ν) Eµ |f TV g | 1Bd(r(ϵ)) + Eν |f TV g | 1Bd(r(ϵ)) + Eµ |f TV g | 1Bc d(r(ϵ)) + Eν |f TV g | 1Bc d(r(ϵ)) ϵ + Eµ |f TV| 1Bc d(r(ϵ)) + Eν |f TV| 1Bc d(r(ϵ)) ϵ + µ Bc d(r) + ν Bc d(r) ϵ, This combined with (A.59) proves Part (i). Next, we prove Part (ii). Fix (µ, ν) P2 TV,ψ(M, s, r, m). For t > 0, let f TV,rk := f TV1Bd(rk) and f(t) TV,rk = f TV,rk ΦN t , where ΦN t (x) := t dΦN (t 1x) and ΦN = (2π) d/2e 0.5 x 2. Then, similar to (A.70), we have S2 f(t) TV,rk Rd ω 2 1 F h f(t) TV,rk rk f TV,rk 1 d Z Rd ω 2 |F[Φt](ω)| dω = rd+1 k dπ0.5d Γ(0.5d + 1) 2 t2 ω 2dω, =: ˇcd,rk,t, ˇcd,rk,t := 2πr2 kt 3 Γ(3/2) 1, d = 1, 2 π0.5d+1drd+1 k t (d+2) Qd 2 j=1 R π 0 sind 1 j(ϕj)dϕj, d 2. Then, noting that f(t) TV(0) f(t) TV(0) 1 (2dπ d)1/2Γ 0.5(d + 1) t 1, it follows from (3.1) that there exists gθk GR k cd,rk,t, rk such that f(t) TV,rk(x) gθk(x) ( cd,rk,td 1 2 k 1 2 , x Bd(rk), 1, otherwise, (A.96) Sreekumar and Goldfeld where cd,rk,t := ˇcd,rk,t 1 (2dπ d)1/2Γ 0.5(d + 1) t 1. On the other hand, we have similar to steps leading to (A.67) that Eµ h f TV,rk f(t) TV,rk Rd |f TV,rk(x) f TV,rk(x tu)| p(x)dx Φ(u)du Rd |f TV,rk(x + tu) f TV,rk(x)| p(x + tu)dx Φ(u)du Rd |f TV,rk(x + tu) f TV,rk(x)| dx Φ(u)du Rd ξ1,1(f, t u )Φ(u)du Rd ts u s Φ(u)du = c s,d Mmkts, where c s,d = R Rd u s Φ(u)du. Then, defining vk = µ Bc d(rk) ν Bc d(rk) , we have Eµ h f TV f(t) TV,rk i Eµ [f TV f TV,rk] + Eµ h f TV,rk f(t) TV,rk 2µ Bc d(rk) + c s,d Mmkts 2vk + c s,d Mmkts. Noting that the above holds with ν in place of µ, we obtain δTV(µ, ν) δTV f(t) TV,rk 4vk + 2c s,d Mmkts. (A.97) Recalling that δTV(g) := Eµ[g] Eν [g], we have δTV(µ, ν) δ G R k cd,rk,t,rk (µ, ν) (a) = δTV(µ, ν) δ G R k cd,rk,t,rk (µ, ν) = δTV(µ, ν) δTV f(t) TV,rk + δTV f(t) TV,rk δ G R k ( cd,rk,t)(µ, ν) (b) 4vk + 2c s,d Mmkts + Eµ h f(t) TV,rk gθk i + Eν h f(t) TV,rk gθk i (c) d,M,s vk + mkts + rd+1 k t (d+2)k 1 2 + µ Bc d(rk) + ν Bc d(rk) vk + mkts + rd+1 k t (d+2)k 1 where (a) follows since g 1 for g G R k cd,rk,t, rk and (2.7); (b) uses (A.73) and (A.97); and (c) is due to (A.96). Setting t = t k,s = rd+1 k k 1/2m 1 k 1/(s+d+2) yields δTV(µ, ν) δ G R k cd,rk,t k,s,rk (µ, ν) d,M,s m d+2 s+d+2 k r s(d+1) s+d+2 k k s 2(s+d+2) + vk. Neural Estimation of Statistical Divergences Then, defining ck,d,s,m,r := cd,rk,t k,s = Od rs(d+1) k k0.5(d+2)md+2 k 1 s+d+2 , (A.98) we have from the above equation and (A.77) that E ˆδ G R k ck,d,s,m,r,rk (Xn, Y n) δTV(µ, ν) d,M,s,ρ m d+2 s+d+2 k r s(d+1) s+d+2 k k s 2(s+d+2) + vk 2 mkrs+1 k k 1 2 d+2 s+d+2 . (A.99) This completes the proof of Part (ii) by taking supremum w.r.t. (µ, ν) P2 TV,ψ(M, s, r, m) and noting that vk ψ rk M 1 1 by Markov s inequality. A.3.8 Proof of Corollary 52 We will use the following relation between sub-Gaussian and norm sub-Gaussian distributions. µ P Rd is σ2-norm sub-Gaussian for σ > 0 if X µ satisfies µ X E[X] > t 2e t2 Lemma 67 (Jin et al., 2019, Lemma 1) If µ P Rd is σ2-sub-Gaussian, then it is 8dσ2norm sub-Gaussian. Continuing with the proof of the Corollary, fix (µ, ν) ˆP2 TV(b, M, N). From the above lemma, we have for µ SG(M) and t M that µ Bc d(t) µ ( X Eµ[X] + Eµ[X] > t) 2e (t Eµ[X] )2 16dσ2 2e (t M)2 16d M . (A.100) Similar bound holds with ν in place of µ. Next, since (µ, ν) ˆP2 TV(b, M, N), following the steps leading to (A.79) yields f TV,rk Lip(s,1) = λ(Bd(rk)) + 2N π d 2 bd s Γ d 2 + 1 2b sλ(Bd(rk)) = π d 2 rd k Γ d 2 + 1 + 2N π d 2 bd s Γ d 2 + 1 2b s π d 2 rd k Γ d 2 + 1 =: cd,s,b,N,rk. (A.101) Then, it follows from (A.99) with rk = M 1 + 4 d M log k, vk = 2e (rk M)2/16d M and mk = cd,s,b,N,rk that E ˆδ G R k ck,d,s,m,r,rk (Xn, Y n) δTV(µ, ν) d,s,b,N (log k) 2(s+d+2) k s 2(s+d+2) + k 1 + (log k) d+2 d+2 2(s+d+2) n 1 d,s,b,N (log k) 2(s+d+2) k s 2(s+d+2) + (log k) d+2 d+2 2(s+d+2) n 1 This completes the proof by taking supremum w.r.t. (µ, ν) ˆP2 TV(b, M, N). Sreekumar and Goldfeld Appendix B. Proofs of Lemmas in Appendix A B.1 Proof of Lemma 55 Suppose f LKB s KB,b Rd . Since f L1 Rd , its Fourier transform F[f] : Rd R is welldefined. Also, Rd |F[f](ω)| dω (a) Z Rd dω 1 + ω 2s KB 1 + ω 2s KB |F[f](ω)|2 dω 1 Rd dω 1 + ω 2s KB 2 f 2 2 + ds KB max α: α 1=s KB Dαf 2 2 (a) follows from Cauchy-Schwarz inequality; (b) is by Plancherel s theorem since F[Dαf](ω) = F[f](ω) Qd j=1(iωj)αj, α 1 s KB, where i denotes the imaginary unit 1, and f LKB s KB,b Rd . The above identity holds because Dαf 1 < for all α 1 s KB by assumption. (c) follows since the first integral is finite and f LKB s KB,b Rd . Hence, F[f] L1 Rd and the Fourier inversion formula holds (at every x Rd since f Ls KB,b Rd is necessarily continuous) with F(dω) = F[f](ω)dω, i.e., f(x) = R 0 eiω x F[f](ω)dω. Then, it follows from ω 1 Rd ω 2 1 F[f](ω) dω d Z Rd ω 2 F[f](ω) dω. (B.1) If Dαf 2 b for all α with α 1 {1, s KB}, then we have Rd ω 2 F[f](ω) dω (a) 1 + ω 2(s B 1) ω 4 + ω 2s KB F[f](ω) 2dω 1 1 + ω 2(s B 1) 2 d2 + ds KB 1 (a) follows from Cauchy-Schwarz inequality; (b) is due to Plancherel s theorem and f LKB s KB,b Rd . Combining (B.1) and (B.2) yields S2(f) bd3/2κd. Following similar steps, it can be shown that if f LB s B,b Rd , then S1(f) bd1/2κd. The final claims follows from these and definition of the classes LB s B,b Rd , LKB s KB,b Rd , Bc,1,X Rd and Bc,2,X Rd . Neural Estimation of Statistical Divergences B.2 Proof of Lemma 57 Assume that φ is monotone increasing. Let g, g Gk(ak, φ) be arbitrary, where g(x) = Pk i=1 βiφ (wi x + bi) + w0 x + b0 and g(x) = Pk i=1 βiφ wi x + bi + w0 x + b0. Define β := (β1, . . . , βk), β := ( β1, . . . , βk), w = (w1, . . . , wk), w = ( w1, . . . , wk), b = (b1, . . . , bk) and b = ( b1, . . . , bk). Note that β, β, b, b Rk and w, w Rkd. For any x X, we have |g(x) g(x)| i=1 βiφ (wi x + bi) i=1 βiφ wi x + bi + |(w0 w0) x| + b0 b0 i=1 βiφ (wi x + bi) i=1 βiφ (wi x + bi) + w0 w0 1 X + b0 b0 i=1 βiφ (wi x + bi) i=1 βiφ wi x + bi (a) β β 1 φ sup x X,1 i k |wi x + bi| + w0 w0 1 X + b0 b0 βi (wi wi) x + bi bi (b) β β 1 φ a1,k( X + 1) + w0 w0 1 X + b0 b0 + La2,k X w w 1 + La2,k b b 1, (a) is since φ is monotone increasing function with Lipschitz constant bounded by L; (b) is because max1 i k wi 1 |bi| a1,k and max1 i k | βi| a2,k. Defining uk = φ a1,k( X + 1) , it follows by application of (A.19) that N ϵ, Gk(ak, φ), ,X N ϵ/5, [ a2,k, a2,k]k, uk 1 N ϵ/5, B1 d(a4,k), X 1 N ϵ/5, [ a3,k, a3,k], | | N ϵ/5, B1 kd(ka1,k), La2,k X 1 N ϵ/5, B1 k(ka1,k), La2,k 1 1 + 10ka2,kukϵ 1 k 1 + 10a4,k X ϵ 1 d 1 + 10a3,kϵ 1 1 + 10Lka1,ka2,k X ϵ 1 dk 1 + 10Lka1,ka2,kϵ 1 k. If φ is monotone decreasing, the above holds with uk = φ a1,k( X + 1) . This proves the first bound in Lemma 57. Specializing to NN classes GR k (a), GS k(a), G k(φR), and G k(φS) by noting that the Lipschitz constant L 1 for φR and φS, |φR(x)| x, and |φS(x)| 1, yields (A.16)-(A.18). Sreekumar and Goldfeld B.3 Proof of Lemma 59 We will use Theorem 11 for the proof. Fix any (µ, ν) P2 KL(X). Note that for h KL(x) = ex 1, we have C |G k(φ)|, X k( X + 1) + 1, C h KL G k(φ) , X ek( X +1)+1, (B.3) Vk,h,φ,X k( X + 1) + 1 2 ek( X +1)+1 + 1 2, where h KL denotes the derivative of h KL and Vk,h,φ,X is given in (A.20). Also, observe that since g G k(φ) is continuous and bounded, DG k(φ)(µ, ν) DKL (µ ν) < . Then, since Ek,h,φ,X n 1 d( X + 1) p k( X + 1) + 1 ek( X +1)+1 + 1 n 0, for k such that k3/2( X + 1)ek( X +1) = O n(1 ρ)/2 for 0 < ρ < 1, it follows from (3.3) that for any k N, δ > 0, and n sufficiently large, P DG k(φ)(µ, ν) ˆDG k(φ)(Xn, Y n) δ ce n δ Ek,h,φ,X n 1/2 2 Hence, for kn such that k3/2 n ( X + 1)ekn( X +1) = O n(1 ρ)/2 , n=1 P DG k(φ)(µ, ν) ˆDG k(φ)(Xn, Y n) δ c n δ Ekn,h,φ,X n 1/2 2 Vkn,h,φ,X < , (B.4) where the final inequality in (B.4) can be established via integral test for sum of series. This implies (A.37) via the first Borel-Cantelli lemma by taking supremum w.r.t. (µ, ν) P2 KL(X). B.4 Proof of Lemma 60 Fix (µ, ν) P2 χ2(X). Recalling the quantities defined in Theorem 11, we have for hχ2(x) = x + 0.25x2 that C h χ2 G k(φ) , X 0.5k( X + 1) + 1.5, (B.5) Vk,h,φ,X (k( X + 1) + 1)2(0.5k( X + 1) + 1.5)2, Ek,h,φ,X k p d( X + 1) 0.5k( X + 1) + 1.5 p k( X + 1) + 1, where h χ2 denotes the derivative of hχ2. Also, note that χ2 G k(φ)(µ, ν) χ2 (µ ν) < . Then, since 0 Ek,h,φ,X n 1 2 k 5 2 ( X + 1)2n 1 for k5/2( X + 1)2 = O n(1 ρ)/2 , 0 < ρ < 1, it follows from (3.3) that for any k N, δ > 0, and n sufficiently large, P ˆχ2 G k(φ)(Xn, Y n) χ2 G k(φ)(µ, ν) δ ce n δ Ek,h,φ,X n 1/2 2 Then, (A.45) follows using similar steps used to prove (A.37) (see (B.4)). This completes the proof. Neural Estimation of Statistical Divergences B.5 Proof of Lemma 61 Fix (µ, ν) P2 H2(X). Note that h H2(x) = x/(1 x) and C h H2 G k,t(φ) = sup gθ G k,t(φ),x X (1 gθ(x)) 2 t 2, where h H2 denotes derivative of h H2. By examining the proof, it can be seen that Theorem 11 continues to hold with G k(φ) in (3.2) and (3.3) replaced with G k,t(φ). We have Vk,h,φ,X (k( X + 1) + 1)2 t 2 k + 1 2, and 0 Ek,h,φ,X n 1 d( X + 1) t 2 k + 1 p k( X + 1) + 1 n 0, for k, tk such that k3/2( X + 1)t 2 k = O n(1 ρ)/2 . Further, H2 G k,t(φ)(µ, ν) < H2(µ, ν) 2. It then follows from (3.3) that for any k N, δ > 0, and n sufficiently large, P ˆH2 G k,tk(φ)(Xn, Y n) H2 G k,tk(φ)(µ, ν) δ ce n δ Ek,h,φ,X n 1/2 2 Then, (A.53) follows via similar steps used to prove (A.37) (see (B.4)). B.6 Proof of Lemma 62 Fix µ, ν P(X). We have δ G k(φ)(µ, ν) δTV(µ, ν) 2, C G k(φ) 1, and C h TV G k(φ) = 1, where h TV denotes the derivative of h TV. Also, it can be seen from the proof of Theorem 11 that it holds with G k(φ) in (3.2) and (3.3) replaced by G k(φ). Further, Vk,h,φ,X 1, and 0 Ek,h,φ,X n 1 d( X + 1) n 0, for k, n such that k( X + 1)1/2 = O n(1 ρ)/2 . It follows from (3.3) that for any k N, δ > 0, and n sufficiently large, P ˆδ G k(φ)(Xn, Y n) δ G k(φ)(µ, ν) δ ce n δ Ek,h,φ,X n 1/2 2 Then, (A.59) follows using similar steps used to prove (A.37). This completes the proof. B.7 Proof of Lemma 63 Fix (µ, ν) P2 KL(M, r, m, v). Recall that ˆGR k (a, r) = {g1Bd(r) : g GR k (a)}. Since c KB f KL|Bd(rk), Bd(rk) mk, it follows from (3.1) that there exists gθk ˆGR k (mk, rk) and c > 0 such that f KL gθk ,Bd(rk) cd 1 2 mkk 1 Sreekumar and Goldfeld Then, following steps leading to (A.81), we have for k with c2dm2 k < 0.5k that DKL (µ ν) D ˆGR k(mk,rk)(µ, ν) (f KL gθk)1Bd(rk) ,µ + Eµ h |f KL| 1Bc d(rk) i + Eν dµ dν 1 1Bc d(rk) + Eν h ef KL 1Bd(rk) i 1 egθk f KL 1Bd(rk) ,ν mkd 1 2 k 1 where the final inequality is due to (B.6), ecd 1 2 mkk 1/2 1 cd 1 2 mkk 1/2 which follows similar to (A.43) (since c2dm2 k < 0.5k), Eµ |f KL|1Bc d(rk) Eν |(dµ/dν) 1|1Bc d(rk) vk, and Eν |ef KL|1Bd(rk) 1. On the other hand, for k such that c2dm2 k 0.5k, g = 0 ˆGR k (mk, rk) implies that DKL (µ ν) D ˆGR k(mk,rk)(µ, ν) = DKL (µ ν) D ˆGR k(mk,rk)(µ, ν) DKL (µ ν) M. Since m2 k k1 ρ, k such that c2dm2 k 0.5k necessarily satisfies kρ d. Thus, for all k N, DKL (µ ν) D ˆGR k(mk,rk)(µ, ν) d,M,ρ mkk 1 2 + vk. (B.7) Note that the RHS above tends to zero as k since vk 0 and m2 k k1 ρ. Next, it follows from (A.29), (A.33), and (B.7) that for k, mk satisfying m2 k k1 ρ, E h ˆD ˆGR k(mk,rk)(Xn, Y n) DKL (µ ν) i D ˆGR k(mk,rk)(µ, ν) DKL (µ ν) + E h D ˆGR k(mk,rk)(µ, ν) ˆD ˆGR k(mk,rk)(Xn, Y n) i d,M,ρ mkk 1 2 + vk + mkrke3mk(rk+1)n 1 Taking supremum w.r.t. (µ, ν) P2 KL(M, r, m, v) completes the proof. B.8 Proof of Lemma 64 Fix (µ, ν) P2 χ2(r, m, v). Since c KB fχ2|Bd(rk), Bd(rk) mk, there exists gθk ˆGR k (mk, rk) such that fχ2 gθk ,Bd(rk) d 1 2 mkk 1 Then, following steps leading to (A.86), we have for all k N that χ2 (µ ν) χ2 ˆGR k(mk,rk)(µ, ν) fχ2 gθk 1Bd(rk) ,µ + Eµ h fχ2 1Bc d(rk) i + Eν hχ2 fχ2 hχ2 gθk 1Bd(rk) + Eν h hχ2 fχ2 1Bc d(rk) i Neural Estimation of Statistical Divergences (a) d 1 2 mkk 1 2 + vk + Eν hχ2 fχ2 hχ2 gθk 1Bd(rk) (b) d 1 2 mkk 1 2 + vk + Eν fχ2 gθk 1Bd(rk) + Eν h 0.25 fχ2 gθk 21Bd(rk) i + 0.5Eν h fχ2 gθk fχ2 1Bd(rk) i d 1 2 mkk 1 2 + vk + dm2 kk 1 + (fχ2 gθk)1Bd(rk) ,ν Eν fχ2 (c) d 1 2 mkk 1 2 + dm2 kk 1 + vk, (a) follows from (B.8) and since (µ, ν) P2 χ2(r, m, v); (b) is via steps leading to (A.48); (c) is due to (B.8) and Eν fχ2 4. Then, it follows from the above equation, (A.29) and (A.33) that E h ˆχ2 ˆGR k(mk,rk)(Xn, Y n) χ2 (µ ν) i χ2 ˆGR k(mk,rk)(µ, ν) χ2 (µ ν) + E h χ2 ˆGR k(mk,rk)(µ, ν) ˆχ ˆGR k(mk,rk)(Xn, Y n) i d 1 2 mkk 1 2 + dm2 kk 1 + vk + d 3 2 m2 kr2 kn 1 Taking supremum w.r.t. (µ, ν) P2 χ2(r, m, v) completes the proof. B.9 Proof of Lemma 66 Fix (µ, ν) P2 H2(r, m, v). Since dµ dν ,Bd(rk) mk, we have 1 f H2(x) = dµ 2 k , x Bd(rk). (B.9) Hence, c KB f H2|Bd(rk), Bd(rk) mk implies via (3.1) and (5.2) that there exists gθk ˇG R k,m 1/2 k (mk, rk) such that f H2 gθk ,Bd(rk) mkd 1 2 k 1 Following the derivation leading to the penultimate step in (A.55), we have H2(µ, ν) H2 ˇG R k,m 1/2 k (mk,rk)(µ, ν) Eµ f H2 gθk 1Bd(rk) + Eν f H2 gθk (1 f H2)(1 gθk) + Eµ h |f H2| 1Bc d(rk) i f H2 (1 f H2) Sreekumar and Goldfeld (a) mkd 1 2 k 1 2 + m2 kd 1 2 k 1 (b) m2 kd 1 2 k 1 2 + vk, (B.11) where (a) follows from (B.9), (B.10), (µ, ν) P2 H2(r, m, v), and 1 gθk(x) m 1/2 k by the definition of ˇG R k,m 1/2 k (mk, rk), while (b) is due to mk 1. Next, using (A.27) and following steps similar to proof of Lemma 59, we obtain that for k, mk, rk, n such that k1/2m2 krk = O n(1 ρ)/2 , k,m 1/2 k (mk,rk)(Xn, Y n) n H2 ˇG R k,m 1/2 k (mk,rk)(µ, ν), P a.s. Then, (A.92) follows from this and (B.11) since mk = o(k1/4) and vk 0 by assumption. Also, k,m 1/2 k (mk,rk)(Xn, Y n) H2(µ, ν) H2(µ, ν) H2 ˇG R k,m 1/2 k (mk,rk)(µ, ν) k,m 1/2 k (mk,rk)(Xn, Y n) H2 ˇG R k,m 1/2 k (mk,rk)(µ, ν) m2 kd 1 2 k 1 2 + vk + d 3 2 m2 krkn 1 where the final inequality uses (A.29), (A.33) and (B.11) to bound the last term. Taking supremum over (µ, ν) P2 H2(r, m, v) yields (A.93). Appendix C. Consistency and effective error bounds for DV-NE Defining DDV,G(µ, ν) := supg G Eµ[g] log Eν[eg] and i=1 g(Xi) log i=1 eg(Yi) ! Eµ g + log Eν eg , we have similar to (A.22) that ˇDDV,G(Xn, Y n) DDV,G(µ, ν) sup g G Zg. Moreover, since the Lipschitz constant of logarithm is bounded by e C(|G|,X) in e C(|G|,X), e C(|G|,X) , we have almost surely that |Zg Z g| n 1 n X g(Xi) g(Xi) Eµ g g + e C(G,X) eg(Yi) e g(Yi) Eν eg e g , where each term inside the summation is bounded by 2 e2 C(|G|) + 1 gθ g θ ,X similar to (A.26). Then, following the steps in the proof of Theorem 11, we have sup µ,ν P(X): DDV,G k(φ)(µ,ν)< P ˇDDV,G k(φ)(Xn, Y n) DDV,G k(φ)(µ, ν) δ + Ek,h,φ,X n 1 2 c e nδ2 Vk,h,φ,X , Neural Estimation of Statistical Divergences where Vk,h,φ,X (k( X + 1) + 1)2e4k( X +1) and Ek,h,φ,X k3/2d1/2( X + 1)e2k( X +1). Then, similar to Lemma 59, we obtain that for any 0 < ρ < 1, and n, kn such that k3/2 n ( X + 1)e2kn( X +1) = O n(1 ρ)/2 , ˇDDV,G k(φ)(Xn, Y n) n DDV,G k(φ)(µ, ν), P a.s. Moreover, limn DDV,G k(φ)(µ, ν) = DKL (µ ν) follows identical to (A.39) provided f KL C(X). Hence, for X = [0, 1]d, we obtain that for any 0 < ρ < 1, (kn)n N with kn and kn 1 8(1 ρ) log n, we have ˇDDV,G k(φ)(Xn, Y n) n DKL (µ ν) , P a.s. (C.1) Next, we bound the expected error of the DV-NE estimator. Note that ˇDDV,GR k(a)(Xn, Y n) DDV,GR k(a)(µ, ν) = sup g GR k(a) i=1 g(Xi) log i=1 eg(Yi) ! sup g GR k(a) (Eµ[g] log Eν[eg]) sup g GR k(a) i=1 g(Xi) log i=1 eg(Yi) ! (Eµ[g] log Eν[eg]) . E h ˇDDV,GR k(a)(Xn, Y n) DDV,GR k(a)(µ, ν) i sup g GR k(a) i=1 g(Xi) Eµ[g] sup g GR k(a) i=1 eg(Yi) ! sup g GR k(a) i=1 g(Xi) Eµ[g] + e3a( X +1)E sup g GR k(a) i=1 eg(Yi) Eν[eg] (b) a( X + 1) e6a( X +1) + 1 n 1 sup γ P(X) log N 3a( X + 1)ϵ, GR k (a), 2,γ dϵ, (c) a( X + 1) e6a( X +1) + 1 d 3 2 n 1 (a) is since C(|GR k (a)|, X) 3a( X + 1) and the Lipschitz constant of log x is bounded by e3a( X +1) in e C(|GR k(a)|,X), e C(|GR k(a)|,X) ; (b) follows using steps akin to (A.34) and (Van Der Vaart and Wellner, 1996, Corollary 2.2.8); (c) is due to (A.33). Sreekumar and Goldfeld Appendix D. Co D-Free Error Rate in the Unbounded Support Case D.1 KL Divergence Consider the NN class ˆGS k(a, r) = g1Bd(r) : g GS k(a) (see Definition 7) and the following class of sub-Gaussian distributions: ˆP2 KL(M, ℓ) := n (µ, ν) P2 KL Rd : µ, ν SG(M), f KL ˆI(M), f KL ℓ,µ M o , ˆI(M) := {f : S1(f) |f(0)| M} . (D.1) Proposition 68 (KL Co D-free error bound) Let M 0, ℓ> 1 and ℓ = ℓ/(ℓ 1). Then, for zk = 12 ℓ d M3/2(log k) 1/2 and rk = M 1 + 4 d Mℓ log k, sup (µ,ν) ˆP2 KL(M,ℓ) E h ˆD ˆGS k(Mrk,rk)(Xn, Y n) DKL (µ ν) i d,M,ℓ(log k) 3 2 k 1 2 + kzk n 1 Setting k = n in the above bound gives an effective error bound O n 1/3 . Proof Fix (µ, ν) ˆP2 KL(M, ℓ). From (A.100), we have for µ, ν SG(M) and r M that µ Bc d(r) ν Bc d(r) 2e (r M)2 16d M . (D.2) Then, it follows from (A.84) and (A.85) that for rk M, Eµ h |f KL| 1Bc d(rk) i Eν h |h KL f KL| 1Bc d(rk) i M e Moreover, f KL ˆI(M) implies c B f KL|Bd(rk), Bd(rk) Mrk for rk 1. Next, note that C(| ˆGS k(mk, rk)|, Bd(rk)) 3Mrk, and C h KL ˆGS k(mk, rk) , Bd(rk) e3Mrk. Also, similar to (A.33), we have sup γ P(X) log N 3Mrkϵ, ˆGS k(Mrk, rk), 2,γ dϵ d 3 2 , (D.3) Then, (A.29) implies E h D ˆGS k(Mrk,rk)(µ, ν) ˆD ˆGS k(Mrk,rk)(Xn, Y n) i M d 3 2 rke3Mrkn 1 Thus, we have similar to Lemma 63 (by using Theorem 8 for the sigmoid NN class) for 1 Mrk k(1 ρ)/2 for some ρ > 0 that E h ˆD ˆGS k(Mrk,rk)(Xn, Y n) DKL (µ ν) i d,M,ρ rkk 1 2 + rke3Mrkn 1 Taking rk = M 1 + 4 d Mℓ log k and noting that 1 Mrk k1/4 (say), we obtain E h ˆD ˆGS k(Mrk,rk)(Xn, Y n) DKL (µ ν) i Neural Estimation of Statistical Divergences 2 (log k) 1 2 + k 1 + (log k) 3 2 e12M3/2 ℓ d log k n 1 2 (log k) 1 2 + k ℓ d M3/2 log k (log k) 3 2 n 1 Taking supremum w.r.t. (µ, ν) ˆP2 KL(M, ℓ) yields the claim. Remark 69 (Co D-free rate) ˆP2 KL(M, ℓ), for example, includes M-sub-Gaussian distributions (µ, ν) such that f KL ℓ,µ M and f KL LB s B,b Rd (for appropriate value of b), where s B = d/2 + 2 and LB s B,b Rd is given in (A.2). It also contains certain M-sub Gaussian distributions (µ, ν) such that f KL = c + f for some c R and f S Rd , where S Rd = f C Rd : supx Rd xαD αf(x) < , α, α Zd 0 is the Schwartz space of rapidly decreasing functions and α, α are multi-indices of dimension d. An example would be some M-sub-Gaussian distributions (µ, ν) with pq 1 = cee x2 , where c is normaliza- tion constant (e.g., take q to be multivariate Gaussian, p(x) = cee x2 q(x) and c such that R Rd q(x)dx = 1 ). We note that f S Rd implies existence of Fourier transforms and Fourier inversion formula such that S1(f) < . D.2 χ2 Divergence With ˆI(M) as defined in (D.1), let ˆP2 χ2(M, ℓ) := n (µ, ν) P2 χ2 Rd : µ, ν SG(M), fχ2 ˆI(M), fχ2 ℓ,µ M o . Proposition 70 (χ2 Co D-free error bound) Let M 0, ℓ> 1 and ℓ = ℓ/(ℓ 1). Then, for rk = M 1 + 4 d Mℓ log k, sup (µ,ν) ˆP2 χ2(M,ℓ) E h ˆχ2 ˆGS k(Mrk,rk)(Xn, Y n) χ2 (µ ν) i d,M,ℓk 1 2 (log k) 1 2 + log k n 1 Setting k = n yields an effective error bound O n 1/2 . Proof Fix (µ, ν) ˆP2 χ2(M, ℓ). From (D.2), (A.88) and (A.89), we have Eµ h fχ2 1Bc d(rk) i Eν h hχ2 fχ2 1Bc d(rk) i M e Note that c B fχ2|Bd(rk), Bd(rk) Mrk for rk 1, C(| ˆGS k(mk, rk)|, Bd(rk)) 3Mrk, and C h χ2 GS k(mk, rk) , Bd(rk) 1.5Mrk+1. Then we obtain similar to (A.87) using Theorem 8 and (D.3) that E h ˆχ2 ˆGS k(Mrk,rk)(Xn, Y n) χ2 (µ ν) i M rkd 1 2 k 1 2 + r2 kdk 1 + e 16d Mℓ + d 3 2 r2 kn 1 Setting rk = M 1 + 4 d Mℓ log k, and taking supremum w.r.t. (µ, ν) ˆP2 χ2(M, ℓ) proves the claim. Sreekumar and Goldfeld Remark 71 (Co D-free rate) ˆP2 χ2(M, ℓ) contains certain M-sub-Gaussian distributions (µ, ν) such that fχ2 ℓ,µ M, and fχ2 S Rd LB s B,b Rd for appropriate value of b. In particular, this includes certain Gaussian distributions pairs N(mp, σ2 p Id), N(mq, σ2 q Id) with 0 < σp < σq M and mp mq M. To see this, recall fχ2 = 2(pq 1 1), and note σq > σp ensures that fχ2 ,Rd < implying that fχ2 ℓ,µ < . Also, since pq 1 is again (upto constants) a Gaussian density, F[pq 1] exist which is again a Gaussian density (upto constants). Hence, F[pq 1] is integrable and this implies the Fourier inversion formula holds. Moreover, it is easy to verify that S1 pq 1 < . Hence, such Gaussian pairs satisfies the conditions defining ˆP2 χ2(M, ℓ) for large enough M, and the claim follows. D.3 Squared Hellinger Distance Let ˇGS k,t(a, r) := g1Bd(r) : g Gk k1/2 log k, 2k 1a, a, 0, φS , and ˆP2 H2(M) := (µ, ν) P2 H2 Rd : µ, ν SG(M), f H2 ˆI(M), dµ dν where ˆI(m) is given in (D.1). Proposition 72 (H2 Co D-free error bound) For M 0, mk = Mrk and rk = M 1+ 32d M log k, sup (µ,ν) ˆP2 H2(M) E k,m 1/2 k (mk,rk)(Xn, Y n) H2(µ, ν) 2 log k + n 1 Setting k = n yields an effective error bound O(n 1/2). Proof Fix (µ, ν) ˆP2 H2(M). From (D.2), (A.94) and (A.95), we obtain Eµ h |f H2| 1Bc d(rk) i Eν h |h H2 f H2| 1Bc d(rk) i e We have c B f H2|Bd(rk), Bd(rk) Mrk for rk 1, C(| ˇGS k,t(mk, rk)|, Bd(rk)) 3Mrk, and C h H2 ˇGS k,t(mk, rk) , Bd(rk) t 2. Then, for k, rk satisfying rk = o(k1/4), we have similar to (A.93) using Theorem 8 and (D.3) that k,m 1/2 k (mk,rk)(Xn, Y n) H2(µ, ν) M r2 kd 1 2 k 1 32d M + d 3 2 r2 kn 1 Setting rk = M 1+ 32d M log k and taking supremum w.r.t. (µ, ν) ˆP2 H2(M), we obtain the claim in the Proposition. Remark 73 (Co D-free rate) ˆP2 H2(M) includes certain M-sub-Gaussian pairs (µ, ν) such that pq 1 ,Rd M and qp 1 = (ef + c)2 for some f S Rd , where c is the normaliza- tion constant to ensure that p and q are probability densities. To see this, note that p pq 1 are both bounded on Rd. Moreover, f H2 = 1 p qp 1 = c + 1 ef. Noting that 1 ef S Rd if f S Rd , it follows as discussed in Remark 69 that S1(f H2) < , thus implying the claim. Neural Estimation of Statistical Divergences S. M. Ali and S. D. Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society: Series B (Methodological), 28(1): 131 142, Jan. 1966. M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, pages 214 223, Sydney, Australia, Jul. 2017. S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang. Generalization and equilibrium in generative adversarial nets (GANs). In Proceedings of the 34th International Conference on Machine Learning, pages 224 232, Sydney, Australia, Jul. 2017. F. Bach. Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research, 18(19):1 53, Apr. 2017. A. R. Barron. Neural net approximation. In Proceedings of 7th Yale Workshop on Adaptive and Learning Systems, CT, USA, May 1992. A. R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930 945, May 1993. A. R. Barron. Approximation and estimation bounds for artificial neural networks. Machine Learning, 14(1):115 133, Jan. 1994. M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm. Mutual information neural estimation. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 531 540, Stockholm Sweden, Jul. 2018. T. B. Berrett and R. J. Samworth. Efficient two-sample functional estimation and the super-oracle phenomenon. ar Xiv:1904.09347, Apr. 2019. T. B. Berrett, R. J. Samworth, and M. Yuan. Efficient multivariate entropy estimation via k-nearest neighbour distances. The Annals of Statistics, 47(1):288 318, Feb. 2019. I. Csisz ar. Information-type measures of difference of probability distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica, 2:229 318, Jan. 1967. C. Domingo-Enrich and Y. Mroueh. Tighter sparse approximation bounds for Re LU neural networks. ar Xiv:2110.03673, Oct. 2021. R. M. Dudley. Uniform Central Limit Theorems. Cambridge University Press, 1999. E. Gin e and R. Nickl. Mathematical Foundations of Infinite-Dimensional Statistical Models. Cambridge University Press, 2015. Z. Goldfeld, K. Greenewald, and K. Kato. Asymptotic guarantees for generative modeling based on the smooth Wasserstein distance. In Proceedings of the 34th Annual Conference on Advances in Neural Information Processing Systems, Dec. 2020a. Sreekumar and Goldfeld Z. Goldfeld, K. Greenewald, J. Niles-Weed, and Y. Polyanskiy. Convergence of smoothed empirical measures with applications to entropy estimation. IEEE Transactions on Information Theory, 66(7):4368 4391, Jul. 2020b. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Proceedings of the 28th Annual Conference on Advances in Neural Information Processing Systems, pages 2672 2680, Montr eal, Canada, Dec. 2014. M. Hallin, G. Mordant, and J. Segers. Multivariate goodness-of-fit tests based on Wasserstein distance. Electronic Journal of Statistics, 15(1):1328 1371, Mar. 2021. Y. Han, J. Jiao, T. Weissman, and Y. Wu. Optimal rates of entropy estimation over Lipschitz balls. The Annals of Statistics, 48(6):3228 3250, Dec. 2020. C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan. A short note on concentration inequalities for random vectors with subgaussian norm. ar Xiv:1902.03736, Feb. 2019. M. Kac, J. Kiefer, and J. Wolfowitz. On tests of normality and other tests of goodness of fit based on distance methods. The Annals of Mathematical Statistics, 26(2):189 211, Jun. 1955. B. Kalantari. An infinite family of bounds on zeros of analytic functions and relationship to Smale s bound. Mathematics of Computation, 74(250):841 852, May 2004. K. Kandasamy, A. Krishnamurthy, B. Poczos, L. Wasserman, and J. M. Robins. Nonparametric von Mises estimators for entropies, divergences and mutual informations. In Proceedings of the 29th Annual Conference on Advances in Neural Information Processing Systems, pages 397 405, Montr eal, Canada, Dec. 2015. D. P. Kingma and M. Welling. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations, Banff, Canada, Apr. 2014. J. M. Klusowski and A. R. Barron. Approximation by combinations of Re LU and squared Re LU ridge functions with ℓ1 and ℓ0 controls. IEEE Transactions on Information Theory, 64(12):7649 7656, Oct. 2018. A. Krishnamurthy, K. Kandasamy, B. P oczos, and L. Wasserman. Nonparametric estimation of R enyi divergence and friends. In Proceedings of the 31st International Conference on Machine Learning, pages 919 927, Beijing, China, Jun. 2014. M. Ledoux and M. Talagrand. Probability in Banach Spaces. Springer-Verlag Berlin Heidelberg, 1991. T. Liang. Estimating certain Integral Probability Metric (IPM) is as hard as estimating under the IPM. ar Xiv:1911.00730, Nov. 2019. Y. Makovoz. Uniform approximation by neural networks. Journal of Approximation Theory, 95(2):215 228, 1998. Neural Estimation of Statistical Divergences K. R. Moon, K. Sricharan, K. Greenewald, and A. O. Hero. Ensemble estimation of information divergence. Entropy, 20(8), Aug. 2018. Y. Mroueh, I. Melnyk, P. Dognin, J. Ross, and T. Sercu. Improved mutual information estimation. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, volume 35, pages 9009 9017, May 2021. A. M uller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429 443, Jun. 1997. X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847 5861, Oct. 2010. S. Nietert, Z. Goldfeld, and K. Kato. Smooth p-Wasserstein distance: Structure, empirical approximation, and statistical applications. In Proceedings of the 38th International Conference on Machine Learning, pages 8172 8183, Jul. 2021. M. Noshad, K. R. Moon, S. Y. Sekeh, and A. O. Hero. Direct estimation of information divergence using nearest neighbor ratios. In Proceedings of the 2017 IEEE International Symposium on Information Theory, pages 903 907, Jun. 2017. S. Nowozin, B. Cseke, and R. Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. In Proceedings of the 30th Annual Conference on Advances in Neural Information Processing Systems, pages 271 279, Barcelona, Spain, Dec. 2016. G. Ongie, R. Willett, D. Soudry, and N. Srebro. A function space view of bounded norm infinite width Re LU nets: The multivariate case. In Proceedings of the 8th International Conference on Learning Representations, Apr.-May 2020. F. Perez-Cruz. Kullback-Leibler divergence estimation of continuous distributions. In Proceedings of the 2008 IEEE International Symposium on Information Theory, pages 1666 1670, Toronto, ON, Canada, Jul. 2008. B. P oczos, L. Xiong, and J. Schneider. Nonparametric divergence estimation with applications to machine learning on distributions. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, pages 599 608, Barcelona, Spain, Jul. 2011. F. Santambrogio. Optimal Transport for Applied Mathematicians. Birkh auser, 2015. T. Schlegl, P. Seeb ock, S. M. Waldstein, G. Langs, and U. Schmidt-Erfurth. f-Ano GAN: Fast unsupervised anomaly detection with generative adversarial networks. Medical Image Analysis, 54:30 44, May 2019. S. Singh and B. P oczos. Exponential concentration of a density functional estimator. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, page 3032 3040, Montreal, Canada, Dec. 2014a. Sreekumar and Goldfeld S. Singh and B. P oczos. Generalized exponential concentration inequality for renyi divergence estimation. In Proceedings of the 31st International Conference on Machine Learning, volume 32, pages 333 341, Beijing, China, Jun. 2014b. S. Singh and B. P oczos. Finite-sample analysis of fixed-k nearest neighbor density functional estimators. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 1225 1233, Barcelona, Spain, Dec. 2016. S. Smale. Newton s method estimates from data at one point. In Proceedings of The Merging of Disciplines: New Directions in Pure, Applied, and Computational Mathematics. Springer New York, 1986. S. Sreekumar, Z. Zhang, and Z. Goldfeld. Non-asymptotic performance guarantees for neural estimation of f-divergences. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, volume 130, pages 3322 3330, Apr. 2021. B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Sch olkopf, and G. R. G. Lanckriet. On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6:1550 1599, 2012. M. Stinchcombe and H. White. Approximating and learning unknown mappings using multilayer feedforward networks with bounded weights. In Proceedings of the International Joint Conference on Neural Networks, pages 7 16, San Diego, CA, US, Jun. 1990. T. Suzuki. Adaptivity of deep Re LU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality. In Proceedings of 7th International Conference on Learning Representations, New Orleans,US, May. 2019. I. Tolstikhin, O. Bousquet, S. Gelly, and B. Sch olkopf. Wasserstein auto-encoders. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, Canada, Apr.-May 2018. A. Uppal, S. Singh, and B. Poczos. Nonparametric density estimation and convergence of GANs under Besov IPM losses. In Proceedings of the 33rd Annual Conference on Advances in Neural Information Processing Systems, volume 32, Vancouver,Canada, Dec. 2019. A. Van Der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer, New York, 1996. R. Van Handel. Probability in High Dimension: Lecture Notes-Princeton University. [Online]. Available: https://web.math.princeton.edu/~rvan/APC550.pdf, 2016. C. Villani. Optimal Transport: Old and New, volume 338. Springer Science & Business Media, 2008. Q. Wang, S. R. Kulkarni, and S. Verdu. Divergence estimation of continuous distributions based on data-dependent partitions. IEEE Transactions on Information Theory, 51(9): 3064 3074, Sep. 2005. Neural Estimation of Statistical Divergences A. Wisler, K. Moon, and V. Berisha. Direct ensemble estimation of density functionals. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 2866 2870, Apr. 2018. Y. Yang and A. Barron. Information-theoretic determination of minimax rates of convergence. The Annals of Statistics, 27(5):1564 1599, Oct. 1999. D. Yarotsky. Error bounds for approximations with deep Re LU networks. Neural Networks, 94:103 114, Oct. 2017. J. E. Yukich, M. B. Stinchcombe, and H. White. Sup-norm approximation bounds for networks through probabilistic methods. IEEE Transactions on Information Theory, 41 (4):1021 1027, Jul. 1995. H. Zenati, M. Romain, C.-S. Foo, B. Lecouat, and V. Chandrasekhar. Adversarially learned anomaly detection. In Proceedings of the 2018 IEEE International Conference on Data Mining, pages 727 736, Nov. 2018. P. Zhang, Q. Liu, D. Zhou, T. Xu, and X. He. On the discrimination-generalization tradeoff in GANs. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, Canada, Apr.-May 2018a. Q. Zhang, S. Filippi, A. Gretton, and D. Sejdinovic. Large-scale kernel methods for independence testing. Statistics and Computing, 28:113 130, Jan. 2018b. V. M. Zolotarev. Probability metrics. Teoriya Veroyatnostei i ee Primeneniya, 28(2):264 287, 1983.