# nonparametric_estimation_of_renyi_divergence_and_friends__e7d50a1b.pdf Nonparametric Estimation of R enyi Divergence and Friends Akshay Krishnamurthy AKSHAYKR@CS.CMU.EDU Kirthevasan Kandasamy KANDASAMY@CS.CMU.EDU Barnab as P oczos BAPOCZOS@CS.CMU.EDU Larry Wasserman LARRY@STAT.CMU.EDU Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh PA 15213 We consider nonparametric estimation of L2, R enyiand Tsallisdivergences between continuous distributions. Our approach is to construct estimators for particular integral functionals of two densities and translate them into divergence estimators. For the integral functionals, our estimators are based on corrections of a preliminary plug-in estimator. We show that these estimators achieve the parametric convergence rate of n 1/2 when the densities smoothness, s, are both at least d/4 where d is the dimension. We also derive minimax lower bounds for this problem which confirm that s > d/4 is necessary to achieve the n 1/2 rate of convergence. We validate our theoretical guarantees with a number of simulations. 1. Introduction Given samples from two distributions, one fundamental and classical question to ask is: how close are the two distributions? First, one must specify what it means for two distributions to be close, for which a number of divergences have been proposed. Then there is the statistical question: how does one estimate divergence given samples from two distributions. In this paper, we propose and analyze estimators for three common divergences. Divergence estimation has a number of applications across machine learning and statistics. In statistics, one can use these estimators to construct two-sample and independence tests (Pardo, 2005). In machine learning, it is often convenient to view training data as a set of distributions and use divergences to estimate dissimilarity between examples. This idea has been used in neuroscience, where the Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). neural response pattern of an individual is modeled as a distribution, and divergence is used to compare responses across subjects (Johnson et al., 2001). It has also enjoyed success in computer vision, where features are computed for each patch of an image and these feature vectors are modeled as independent draws from an underlying distribution (P oczos et al., 2012). For these applications and others, it is crucial to accurately estimate divergences given samples drawn independently from each distribution. In the nonparametric setting, a number of authors have proposed various estimators which are provably consistent. However, apart from a few examples, the actual rates of convergence of these estimators and the minimax optimal rates are still unknown. In this work, we propose three estimators for the L2 2, R enyi- , and Tsallisdivergence between two continuous distributions. Our strategy is to correct an initial plug-in estimator by estimates of the higher order terms in the von Mises expansion of the divergence functional. We establish the rates of convergence for these estimators under the assumption that both densities belong to a H older class of smoothness s. Concretely, we show that the plug-in estimator achieves rate n s 2s+d while correcting by the first order terms in the expansion results in an n min{ 2s 2s+d ,1/2}- estimator and correcting further by the second order terms gives an n min{ 3s 2s+d ,1/2}-estimator. These last two estimators achieve the parametric n 1/2 rate as long as the smoothness s is larger than d/2, d/4, respectively, where d is the dimension. Moreover the first-order estimator, while worse statistically than the second-order estimator, is computationally very elegant. These results contribute to our fairly limited knowledge on this important problems (Nguyen et al., 2010; Singh & P oczos, 2014). We also address the issue of statistical optimality by deriving a minimax lower bound on the convergence rate. Specifically, we show that one cannot estimate these quantities at better than n 4s 4s+d -rate when s d/4 and n 1/2rate otherwise. This establishes the optimality of our best Nonparametric Divergence Estimation estimator in the smooth regime and also that d/4 is the critical smoothness for this problem. The remainder of this manuscript is organized as follows. After discussing some related work on divergence estimation and the closely-related entropy estimation in Section 2, we present our estimators and main results in Sections 3 and 4. We provide proof sketches in Section 5. We present some numerical simulations in Section 6 and conclude with some open questions in Section 7. We defer many proof details and several calculations to the appendices. 1.1. Preliminaries Let us begin by standardizing notation and presenting some basic definitions. We will be concerned with two densities, p, q : [0, 1]d ! R 0 where d denotes the dimension. Formally, letting µ denote the Lebesgue measure on [0, 1]d, we are interested in two probability distributions P, Q with Radon-Nikodym derivatives p = d P/dµ, q = d Q/dµ. Except for in this section, we will operate exclusively with the densities. Throughout, the samples {Xi}n i=1 will be drawn independently from p while the samples {Yi}n i=1 will be drawn independently from q. For simplicity, assume that we are given n samples from each distribution, although it is not hard to adjust the estimators and results to unequal sample sizes. The divergences of interest are: 2-divergence (p(x) q(x))2dµ(x) 2. R enyi Divergence (R enyi, 1961) D (p, q) = 1 1 log p (x)q1 (x)dµ(x) 3. Tsallis Divergence (Tsallis, 1988) T (p, q) = 1 1 p (x)q1 (x)dµ(x) 1 Technically, these divergences are functionals on distributions, rather than densities, but we will abuse notation and write them as above. As a unification, we consider estimating functionals of the form, T(p, q) = p (x)qβ(x)dµ(x) for given , β. Various settings of , β yield the main terms in the divergences, and we will verify that estimators for T(p, q) result in good divergence estimators. The sine qua non of our work is the von Mises expansion1. Given a functional T mapping distributions to the reals, the first-order von Mises expansion is: T(F) = T(G) + d T(G; F G) + R2, 1See Chapter 20 of van der Vaart s book for an introduction to von Mises calculus (2000). where F and G are distributions, R2 is a remainder term, and d T(G; F G) is the Gateaux derivative of T at G in the direction of F G: d T(G; F G) = lim T(G + (F G)) T(G) In our work, T is always of the form T(F) = φ(f)dµ where f = d F/dµ is the Radon-Nikodym derivative and φ is differentiable. In this case, the von Mises expansion reduces to a functional Taylor expansion on the densities2: T(F) = T(G) + @g(x) (f(x) g(x))dµ(x) + We generalize these ideas to functionals of two distributions and with higher order expansions analogous to the Taylor expansion. We often write T(f) instead of T(F). 2. Related Work Divergence estimation and its applications have received considerable attention over the past several decades. Pardo provides a fairly comprehensive discussion of methods and applications in the context of discrete distributions (2005). Only recently has attention shifted to the continuous, nonparametric setting, where a number of efforts have established consistent estimators. Many of the approaches are based on nearest-neighbor graphs (Hero & Michel, 1999; Wang et al., 2009; P oczos & Schneider, 2011; K allberg & Seleznjev, 2012). For example, P oczos and Schneider use a k-nearest-neighbor estimator and show that one does not need a consistent density estimator to consistently estimate R enyiand Tsallisdivergences. A number of other authors have also proposed consistent estimators via the empirical CDF or histograms (Wang et al., 2005; P erez-Cruz, 2008). Unfortunately, the rates of convergence for all of these methods are still unknown. Singh and Poczos (2014) recently established a rate of convergence for an estimator based on simply plugging kernel density estimates into the divergence functional. Their estimator converges at n s s+d -rate when s < d and n 1/2 otherwise which matches some existing results on estimating entropy functionals (Liu et al., 2012). In comparison, we show that corrections of the plug-in estimator lead to faster convergence rates and that the n 1/2 rate can be achieved at the much lower smoothness of s > d/4. Moreover we establish a minimax lower bound for this problem, which shows that d/4 is the critical smoothness index. Nguyen et al. (2010) construct an estimator for Csisz ar fdivergences via regularized M-estimation and prove a rate 2See Lemma 8 in the Appendix. Nonparametric Divergence Estimation of convergence when the likelihood-ratio d P/d Q belongs to a Reproducing Kernel Hilbert Space. Their rate depends on the complexity of this RKHS, but it is not clear how to translate these assumptions into our H olderian one, so the results are somewhat incomparable. K allberg and Seleznjev (2012) study an -nearest neighbor estimator for the L2 2-divergence that enjoys the same rate of convergence as our projection-based estimator. They prove that the estimator is asymptotically normal in the s > d/4 regime, which one can also show for our estimator. In the more general setting of estimating polynomial functionals of the densities, they only show consistency of their estimator, while we also characterize the convegence rate. A related and flourishing line of work is on estimating entropy functionals. The majority of the methods are graphbased, involving either nearest neighbor graphs or spanning trees over the data (Hero et al., 2002; Leonenko et al., 2008; Leonenko & Seleznjev, 2010; P al et al., 2010; Sricharan et al., 2010). One exception is the KDE-based estimator for mutual information and joint entropy of Liu, Lafferty, and Wasserman (2012). A number of these estimators come with provable convergence rates. While it is not clear how to port these ideas to divergence estimation, it is still worth comparing rates. The estimator of Liu et al. (2012) converges at rate n s s+d , achieving the parametric rate when s > d. Similarly, Sricharan et al. (2010) show that when s > d a k-NN style estimator achieves rate n 2/d (in absolute error) ignoring logarithmic factors. In a follow up work, the authors improve this result to O(n 1/2) using an ensemble of weak estimators, but they require s > d orders of smoothness (Sricharan et al., 2012). In contrast, our estimators achieve the parametric n 1/2 rate at lower smoothness (s > d/2, d/4 for the firstorder and second-order estimators, respectively) and enjoy a faster rate of convergence uniformly over smoothness. Interestingly, while many of these methods are plug-inbased, the choice of tuning parameter typically is suboptimal for density estimation. This contrasts with our technique of correcting optimal density estimators. We are not aware of any lower bounds for divergence estimation, although analogous results have been established for the entropy estimation problem. Specifically, Birg e and Massart (1995) prove a n 4s 4s+d -lower bound for estimating integral functionals of a density. Hero et al. (2002) give a matching lower bound for estimating R enyientropies. Finally, our estimators and proof techniques are based on several classical works on estimating integral functionals of a density. The goal here is to estimate φ(f(x))dµ(x), for some known function φ, given samples from f. A series of papers show that n 1/2 rate of convergence is attainable if and only if s > d/4, which is analogous to our results (Birg e & Massart, 1995; Laurent, 1996; Kerkyacharian & Picard, 1996; Bickel & Ritov, 1988). Of course, our results pertain to the two-density setting, which encompasses the divergences of interest. We also generalize some of these results to the multi-dimensional setting. 3. The Estimators Recall that we are interested in estimating integral functionals of the form T(p, q) = p (x)qβ(x). As an initial attempt, with estimators ˆp and ˆq for p and q, we can use the plug-in estimator b Tpl = T(ˆp, ˆq). Via the von Mises expansion of T(p, q), the error is of the form: | b Tpl T(p, q)| c1kˆp pk1 + c2kˆq qk1. Classical results on density estimation then suggest that b Tpl will enjoy a n s 2s+d -rate (Devroye & Gy orfi, 1985). A better convergence rate can be achieved by correcting the plug-in estimator with estimates of the linear term in the von Mises expansion. Informally speaking, the remainder of the first order expansion is O(kˆp pk2 2) which decays with n 2s 2s+d , while the linear terms can be estimated at n 1/2-rate. This estimator, which we call b Tlin enjoys a faster convergence rate than b Tpl. It is even better to augment the plug-in estimator with both the first and second-order terms of the expansion. Here the remainder decays at rate n 3s 2s+d while the linear and quadratic terms can be estimated at n 1/2 and n 4s 4s+d rate respectively. This corrected estimator b Tquad achieves the parametric rate whenever the smoothness s > d/4 which we will show to be minimax optimal. We now formalize these heuristic developments3. Below we enumerate the terms in the first and second order von Mises expansions that we will estimate or compute: 1,1 = EX p ˆp 1(X)ˆqβ(X) 1,1 = EY qβˆp (Y )ˆqβ 1(Y ) 2,1 = EX p (2 β)ˆp 1(X)ˆqβ(X) 2,1 = EY qβ(2 β)ˆp (Y )ˆqβ 1(Y ) ( 1)ˆp 2ˆqβp2 β(β 1)ˆp ˆqβ 2q2 βˆp 1ˆqβ 1pq 2( + β) + 1 3See Appendices A and B for details omitted in this section. Nonparametric Divergence Estimation These definitions allow us to succinctly write the expansions of T(p, q) about T(ˆp, ˆq): T0(p, q) = T(ˆp, ˆq) + R1 T1(p, q) = C1T(ˆp, ˆq) + p T2(p, q) = C2T(ˆp, ˆq) + i=1,2 f=p,q with remainders, Ra = O(kp ˆpka a + kq ˆqka We now turn to estimation of the ( ) ( ),( ) terms. All of the ( ),1 terms are linear; that is, they are of the form = EZ f[ (Z)] where is known. A natural estimator, given data Zn 1 f, is the sample mean: The terms ( ) ( ),2 are of the form: (x)f 2(x), or (x)f(x)g(x), again with known . To estimate these terms, we have samples Xn 1 g. If {φk}k2D is an orthonormal basis for L2([0, 1]d) then the estimator for the bilinear term is: φk(Yj) (Yj), (1) where M D is chosen to tradeoff the bias and the variance. To develop some intuition, if we knew f, we would simply use the sample mean 1 j=1 f(Yj) (Yj). Since f is actually unknown, we replace it with an estimator formed by truncating its Fourier expansion. Specifically, we replace f with ˆf( ) = P k2M ˆakφk( ) with ˆak = 1 i=1 φk(Xi). For the quadratic functional, a projection estimator was proposed and analyzed by Laurent (1996): ˆ = 2 n(n 1) φk(Xi)φk(Xj) (Xj) φk(Xi)φk0(Xj)bk,k0( ), where bk,k0( ) = φk(x)φk0(x) (x)dx. The first term in the estimator is motivated by the same line of reasoning as in the bilinear estimator while the second term significantly reduces the bias without impacting the variance. Our final estimators for T(p, q) are: b Tpl = T(ˆp, ˆq) b Tlin = C1T(ˆp, ˆq) + ˆ p 1,1 b Tquad = C2T(ˆp, ˆq) + i=1,2 f=p,q 2,i + ˆ p,q Before proceeding to our theoretical analysis, we mention some algorithmic considerations. We estimate ˆp, ˆq with kernel density estimators, which, except for in b Tpl, we only train on half of the sample. This gives us independent samples to estimate the ˆ ( ) ( ),( ) terms. Second, in our analysis, we will require that the KDEs are bounded above and below. Under the assumption that p and q are bounded above and below, we will show that clipping the original KDE will not affect the convergence rate. Another important issue with density estimation over bounded domains, that applies to our setting, is that the standard KDE suffers high bias near the boundary. To correct this bias, we adopt the strategy used by Liu et al. (2012) of mirroring the data set over the boundaries. We do not dwell too much on this issue, noting that this technique can be shown to suitably correct for boundary bias without substantially increasing the variance. This augmented estimator can be shown to match the rates of convergence in the literature (Devroye & Gy orfi, 1985; Tsybakov, 2009). Lastly, the estimators all require integration of the term T(ˆp, ˆq), which can be computationally burdensome, particularly in high dimension. However, whenever + β = 1, as in the R enyiand Tsallisdivergences, the constants C1, C2 are zero, so the first term may be omitted. In this case b Tlin is remarkably simple; it involves training KDEs and estimating a specific linear functional of them via the sample mean. Although this estimator is not minimax optimal, it enjoys a fairly fast rate of convergence while being computationally practical. Unfortunately, even when C2 = 0, the quadratic estimator still involves integration of the bi,i0 terms. We therefore advocate for b Tlin over b Tquad in practice, as b Tlin exhibits a better tradeoff between computational and statistical efficiency. 4. Theoretical Results For our theoretical analysis, we will assume that the densities p, q belong to (s, L), the periodic H older class of smoothness s, defined as follows: Definition 1. For any tuple r = (r1, . . . , rd) define Dr = @r1+...+rd @xr1 rd d . The periodic H older class (s, L) is the subset of L2([0, 1]d) where for each f 2 (s, L), the rth derivative is periodic for any tuple r with P j rj < s and: |Drf(x) Drf(y)| Lkx yks |r|, (3) for all x, y and for all tuples r with P j rj = bsc the largest integer strictly smaller than s. Nonparametric Divergence Estimation Figure 1. Rates of convergence of the estimators b Tquad, b Tlin, b Tpl along with the rate of convergence in the lower bound (Theorem 2). Plot is γ vs. smoothness s with d = 10, where the rate of convergence is O(n γ). The rate of convergence for each estimator is the smallest of the rates of all terms in the von Mises expansion, which translates to the value of the lowest curves in the figure. We are now ready to state our main assumptions: Assumption 1 (Smoothness). p, q 2 (s, L) for some known smoothness s. Assumption 2 (Boundedness). The densities are bounded above and below by known parameters l, u. Formally 0 < l p(x), q(x) u < 1 for all x 2 [0, 1]d. Assumption 3 (Kernel Properties). The kernel K 2 Rd ! R satisfies: (i) supp(K) 2 ( 1, 1)d K(x)dµ(x) = 1 i K(x)dµ(x) = 0, 8r 2 Nd : Assumption 4 (Parameter Selection). Set the KDE bandwidth h n 1 2s+d . For any projection-style estimator, set the number of basis elements m n The H olderian assumption is standard in the nonparametric literature while the periodic assumption subsumes more standard boundary smoothness conditions (Liu et al., 2012). It is fairly straightforward to construct kernels meeting Assumption 3 (Tsybakov, 2009), while the boundedness assumption is common in the literature on estimating integral functionals of a density (Birg e & Massart, 1995). The following theorem characterizes the rate of convergence of our estimators b Tpl, b Tlin, b Tquad: Theorem 1. Under Assumptions 14 we have: | b Tpl T(p, q)| | b Tlin T(p, q)| | b Tquad T(p, q)| All expectations are taken with respect to Xn 1 . When s = d/4, b Tquad enjoys O(n 1/2+ ) rate of convergence for any > 04. b Tlin and b Tquad achieve the parametric rate when s > d/2, d/4 respectively. Before commenting on the upper bound and presenting some consequences, we address the question of statistical efficiency. Clearly b Tpl and b Tlin are not rate-optimal, since b Tquad achieves a faster rate of convergence, but is b Tquad minimax optimal? We make some progress in this direction with a minimax lower bound on the rate of convergence. Theorem 2. Under Assumptions 1 and 2, as long as both , β 6= 0, 1, then with γ? = min{4s/(4s + d), 1/2} and for any > 0: sup p,q2 (s,L) | b Tn T| n γ?i For a pictorial understanding of the rates of convergence and the lower bound, we plot the exponent γ for each of the terms in the von Mises expansion as a function of the smoothness s in Figure 1. The estimator b Tquad has three terms, with rates n 1/2, n 4s 4s+d , and n 3s 2s+d respectively which achieves the parametric rate n 1/2 when s > d/4 and is n 3s 2s+d in the low-smoothness regime. The linear estimator only achieves the parametric rate while s > d/2 while b Tpl only approaches the parametric rate as s ! 1. Consequently these estimators are statistically inferior to b Tquad. In the last plot we show a lower bound on the rate of convergence from Theorem 2, which is n 4s 4s+d when s d/4 and n 1/2 when s > d/4. The lower bound rate deviates slightly from the upper bound for b Tquad in the low-smoothness regime, showing that b Tquad is also not minimax-optimal uniformly over s. This sub-optimality appears even when estimating integral functionals of a single density (Birg e & Massart, 1995). In that context, achieving the optimal rate of convergence in the non-smooth regime involves further correction by the third order term in the expansion (Kerkyacharian & Picard, 1996). It seems as if the same ideas can be adapted to 4The constant is exponential in and is infinite for = 0. Nonparametric Divergence Estimation the two-density setting, although we believe computational considerations would render these estimators impractical. In the smooth regime (s > d/4) we see that the parametric n 1/2 rate is both necessary and sufficient. This critical smoothness index of s = d/4 was also observed in the context of estimating integral functionals of densities (Birg e & Massart, 1995; Laurent, 1996). When s = d/4, the quadratic estimator achieves n 1/2+ rate for any > 0, where the constant is exponential in , and thus deviates slightly from the lower bound. This phenomenon arises from using the projection-based estimators for the quadratic term. Establishing the rate of convergence for these estimators requires working in a Sobolev space rather than the H older class. In translating back to the H olderian assumption, we lose a small factor in the smoothness, since the Sobolev space only contains the H older space if the former is less smooth than the latter. The lower bound on estimating integral functionals in Theorem 2 almost immediately implies a lower bound for Tsallisdivergences. For R enyi- , some care must be taken in the translation, but we are able to prove the same lower bound as long as D (p, q) is bounded. The idea behind these extensions is to translate an estimator ˆD for the divergence into an estimator ˆT for T(p, q). We then argue that if ˆD enjoyed a fast rate of convergence, so would ˆT, which leads to a contradiction of the theorem. Unfortunately, Theorem 2 does not imply a lower bound for L2 2 divergence, since we are unable to handle the = β = 1 case, which is exactly the cross term in the L2 2-divergence. Our proof requires that both , β are both not 0 or 1, which is not entirely surprising. If = β = 0, T(p, q) is identically zero, so one should not be able to prove a lower bound. Similarly = 0, β = 1 or vice versa, T(p, q) = 1 for any p, q, so we have efficient, trivial estimators. The only non-trivial case is = β = 1 and we conjecture that the n γ? rate is minimax optimal there, although our proof does not apply. Our proof strategy involves fixing q and perturbing p, or vice versa. In this approach, one can view the optimal estimator as having knowledge of q, so if = 1, the sample average is a n 1/2-consistent estimator, which prevents us from achieving the n γ? rate. We believe this is an artifact of our proof, and by perturbing both p and q simultaneously, we conjecture that one can prove a minimax lower bound of n γ? when = β = 1. 4.1. Some examples We now show how an estimate of T(p, q) can be used to estimate the divergences mentioned above. Plugging b Tquad into the definition of R enyiand Tsallisdivergences, we immediately have the following corollary: Corollary 3 (Estimating R enyi- , Tsallisdivergences). Under Assumptions 14, as long as D (p, q) c > 0 for some constant c, the estimators: ˆD = 1 1 log( b Tquad), ˆT = 1 1( b Tquad 1), both with β = 1 , satisfy: 1 | ˆD D (p, q)| c 1 | ˆT T (p, q)| c As we mentioned before, when β = 1 , for both the linear and quadratic estimators, one can omit the term T(ˆp, ˆq) as the constants C1, C2 = 0. However, b Tquad is still somewhat impractical due to the numeric integration in the quadratic terms. On the other hand, the linear estimator b Tlin is computationally very simple, although its convergence rate is O(n 1/2 + n 2 divergence, instead of applying Theorem 1 directly, it is better to directly use the quadratic and bilinear estimators for the terms in the factorization. Specifically, let p = p2 and define ˆ p by Equation 2 with (x) = 1. Define q, ˆ q analogously and finally define p,q = 2 pq with ˆ p,q given by Equation 1 where (x) = 2. As a corollary of Theorem 6 below, we have: Corollary 4 (Estimating L2 2-divergence). Under Assumptions 14, the estimator ˆL = ˆ p + ˆ q ˆ p,q for L2 2(p, q) satisfies: = O(n 1/2 + n 4s 4s+d ). (9) Notice that for both quadratic terms, the bi,i0 terms in Equation 2 are 1[i = i0] since (x) = 1 and since {φk} is an orthonormal collection. Thus the estimator ˆL is computationally attractive, as numeric integration is unnecessary. In addition, we do not need KDEs, removing the need for bandwidth selection, although we still must select the basis functions used in the projection. 5. Proof Sketches 5.1. Upper Bound The rates of convergence for b Tpl, b Tlin, and b Tquad come from analyzing the kernel density estimators and the estimators for ˆ ( ) ( ),( ). Recall that we must use truncated KDEs ˆp, ˆq with boundary correction, so standard analysis does not immediately apply. However, we do have the following theorem establishing that truncation does not affect the rate, which generalizes previous results to high dimension (Birg e & Massart, 1995). Nonparametric Divergence Estimation Theorem 5. Let f be a density satisfying Assumptions 14 and suppose we have Xn 1 f. The truncated KDE ˆfn satisfies: 1 k ˆfn fkp It is simple exercise to show that the linear terms can be estimated at n 1/2 rate. As for the quadratic terms p 2,2, and p,q 2,2, we let D index the multi-dimensional Fourier ba- sis where each function φk(x) = e2 ik T x is indexed by a d-dimensional integral vector (i.e. k 2 Zd). We have: Theorem 6. Let f, g be densities in (s, L) and let be a known bounded function. Let φk be the Fourier basis and M the set of basis elements with frequency not exceeding m1/d 0 , where m0 n 2d 4s0+d for some s0 < s. If = (x)f(x)g(x) and ˆ is given by Equation 1 or if = (x)f 2(x) and ˆ is given by Equation 2, then: Theorem 1 follows from these results, the von Mises expansion, and the triangle inequality. 5.2. Lower Bound The first part of the lower bound is an application of Le Cam s method and generalizes a proof of Birge and Massart (1995). We begin by reducing the estimation problem to a simple-vs.-simple hypothesis testing problem. We will use the squared Hellinger distance, defined as: Lemma 7. Let T be a functional defined on some subset of a parameter space which contains (p, q) and (gλ, q)8λ in some index set . Define Gn = 1 | | λ where Gλ has density gλ. If: (i) h2(P n Qn, Gn Qn) γ < 2 (ii) T(p, q) 2β + T(gλ, q) 8λ 2 | ˆTn T(p, q)| > β where cγ = 1 To construct the gλ functions, we partition the space [0, 1]d into m cubes Rj and construct functions uj that are compactly supported on Rj. We then set gλ = p + K Pm j=1 λjuj for λ 2 = { 1, 1}m. By appropriately selecting the functions uj, we can ensure that: gλ 2 (s, L), T(p, q) T(gλ, q) (K2) h2(P n Qm, Gn Qm) O(n2K4/m). Ensuring smoothness requires K = O(m s/d) at which point, making the Hellinger distance O(1) requires m = (n 2d 4s+d ). With these choices we can apply Lemma 7 and arrive at the lower bound since K2 = m 2s/d = n As for the second part of the theorem, the n 1/2 lower bound, we use a (to our knowledge) novel proof technique which we believe may be applicable in other settings. The first ingredient of our proof is a lower bound showing that one cannot estimate a wide class of quadratic functionals at better than n 1/2 rate. We provide a proof of this result based on Le Cam s method in the appendix although related results appear in the literature (Donoho & Liu, 1991). Then starting with the premise that there exists an estimator ˆT for T(p, q) with rate n 1/2 , we construct an estimator for a particular quadratic functional with n 1/2 convergence rate, and thus arrive at a contradiction. A somewhat surprising facet of this proof technique is that the proof has the flavor of an upper bound proof; in particular, we apply Theorem 5 in this argument. The proof works as follows: Suppose there exists a ˆTn such that | ˆTn T(p, q)| c1n 1/2 for all n. If we are given 2n samples, we can use the first half to train KDEs ˆpn, ˆqn, and the second half to compute ˆTn. Armed with these quantities, we can build an estimator for the first and second order terms in the von Mises expansion, which, once ˆpn, ˆqn are fixed, is simply a quadratic functional of the densities. The precise estimator is ˆTn C2T(ˆpn, ˆqn). The triangle inequality along with Theorem 5 shows that this estimator converges at rate n 1/2 + n 3s 2s+d which is o(n 1/2) as soon as s > d/4. This contradicts the minimax lower bound for estimating quadratic functionals of H older smooth densities. We refer the interested reader to the appendix for details of the proof. 6. Experiments We conducted some simulations to examine the empirical rates of convergence of our estimators. We plotted the error as a function of the number of samples n on a log-log scale in Figure 2 for each estimator and over a number of problem settings. Since our theoretical results are asymptotic in nature, we are not concerned with some discrepancy between the empirical and theoretical rates. In the top row of Figure 2, we plot the performance of b Tpl and b Tlin across four different problem settings: d = 1, s = 1; d = 1, s = 2; d = 2, s = 2; and d = 2, s = 4. The lines fit to the plug-in estimator s error rate have slopes 0.25, 0.5, 0.1, 0.2 from left to right while the lines for the linear estimator have slopes Nonparametric Divergence Estimation Figure 2. Top row: Rates of convergence for b Tpl, b Tlin on a log-log scale for: left: d = 1, s = 1, second from left: d = 1, s = 2, second from right: d = 2, s = 2, right: d = 2, s = 4. Bottom Row: Left: Rate of convergence for b Tquad with d = 1, s = 1.0, 2.0. Middle two: Rates for linear estimator of D0.5(p, q), T0.5(p, q) (respectively). Right: Rate for L2 2 estimator. Dashed lines are fitted to the curves. 0.7, 0.75, 0.65, 0.6. Qualitatively we see that the b Tlin is consistently better than b Tpl. We also see that increasing the smoothness s appears to improve the rate of convergence of both estimators. In the first plot on the bottom row, we record the error rate for b Tquad with d = 1 and s = 1.0, 2.0. The fitted lines have slopes 0.82, 0.93 respectively, which demonstrate that b Tquad is indeed a better estimator than b Tlin, at least statistically speaking. Recall that we studied b Tquad primarily for its theoretical properties and to establish the critical smoothness index of s > d/4 for this problem. Computing this estimator is quite demanding, so we did not evaluate it for larger sample size and in higher dimension. Finally in the last three plots we show the rate of convergence for our divergence estimators, that is b Tlin plugged into the equations for D or T and the quadratic-based estimator for L2 2. Qualitatively, it is clear that the estimators converge fairly quickly and moreover we can verify that increasing the smoothness s does have some effect on the rate of convergence. 7. Discussion In this paper, we address the problem of divergence estimation with corrections of the plug-in estimator. We prove that our estimators enjoy parametric rates of convergence as long as the densities are sufficiently smooth. Moreover, through information theoretic techniques, we show that our best estimator b Tquad is nearly minimax optimal. Several open questions remain. 1. Can we construct divergence estimators that are com- putationally and statistically efficient? Recall that b Tquad involves numeric integration and is computationally impractical, yet b Tlin, while statistically inferior, is surprisingly simple when applied to the divergences we consider. At this point we advocate for the use of b Tlin, in spite of its sub-optimality. 2. What other properties do these estimators enjoy? Can we construct confidence intervals and statistical tests from them? In particular, can we use our estimators to test for independence between two random variables? 3. Do our techniques yield estimators for other di- vergences, such as f-divergence and the Kullback Leibler divergence? 4. Lastly, can one prove a lower bound for the case where = β = 1, i.e. the L2 inner product? We hope to address these questions in future work. Acknowledgements This research is supported by DOE grant DESC0011114, NSF Grants DMS-0806009, IIS1247658, and IIS1250350, and Air Force Grant FA95500910373. AK is supported in part by an NSF Graduate Research Fellowship. Nonparametric Divergence Estimation Bickel, Peter and Ritov, Ya acov. Estimating integrated squared density derivatives: sharp best order of convergence estimates. Sankhy a: The Indian Journal of Statistics, Series A, 1988. Birg e, Lucien and Massart, Pascal. Estimation of integral functionals of a density. The Annals of Statistics, 1995. der Vaart, Aad W. Asymptotic statistics, volume 3. Cam- bridge university press, 2000. Devroye, Luc and Gy orfi, L aszl o. Nonparametric Density Estimation: The L 1 View. Wiley, 1985. Donoho, David L. and Liu, Richard C. Geometrizing rates of convergence, II. The Annals of Statistics, 1991. Gin e, Evarist and Nickl, Richard. A simple adaptive esti- mator of the integrated square of a density. Bernoulli, February 2008. ISSN 1350-7265. Hero, Alfred O. and Michel, Olivier J. J. Estimation of R enyi information divergence via pruned minimal spanning trees. In IEEE Signal Processing Workshop on Higher-Order Statistics, 1999. Hero, Alfred O., Costa, Jose A., and Ma, Bing. Convergence rates of minimal graphs with random vertices. Technical report, The University of Michigan, 2002. Johnson, Don H., Gruner, Charlotte M., Baggerly, Keith, and Seshagiri, Chandran. Information-theoretic analysis of neural coding. Journal of Computational Neuroscience, 2001. K allberg, David and Seleznjev, Oleg. Estimation of entropy-type integral functionals. ar Xiv:1209.2544, 2012. Kerkyacharian, G erard and Picard, Dominique. Estimat- ing nonquadratic functionals of a density using Haar wavelets. The Annals of Statistics, 1996. Laurent, B eatrice. Efficient estimation of integral function- als of a density. The Annals of Statistics, 1996. Leonenko, Nikolai and Seleznjev, Oleg. Statistical infer- ence for the epsilon-entropy and the quadratic R enyi entropy. Journal of Multivariate Analysis, 2010. Leonenko, Nikolai, Pronzato, Luc, and Savani, Vippal. A class of R enyi information estimators for multidimensional densities. The Annals of Statistics, 2008. Liu, Han, Wasserman, Larry, and Lafferty, John D. Expo- nential concentration for mutual information estimation with application to forests. In Advances in Neural Information Processing Systems, 2012. Nguyen, Xuan Long, Wainwright, Martin J., and Jordan, Michael I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 2010. P al, D avid, P oczos, Barnab as, and Szepesv ari, Csaba. Esti- mation of R enyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs. In Advances in Neural Information Processing Systems, 2010. Pardo, Leandro. Statistical inference based on divergence measures. CRC Press, 2005. P erez-Cruz, Fernando. Kullback-Leibler divergence esti- mation of continuous distributions. In IEEE International Symposium on Information Theory, 2008. P oczos, Barnab as and Schneider, Jeff. On the estimation of alpha-divergences. In International Conference on Artificial Intelligence and Statistics, 2011. P oczos, Barnab as, Xiong, Liang, Sutherland, Dougal J., and Schneider, Jeff. Nonparametric kernel estimators for image classification. In IEEE Conference on Computer Vision and Pattern Recognition, 2012. R enyi, Alfr ed. On measures of entropy and information. In Berkeley Symposium on Mathematical Statistics and Probability, 1961. Singh, Shashank and P oczos, Barnab as. Generalized Expo- nential Concentration Inequality for R enyi Divergence Estimation. In International Conference on Machine Learning, 2014. Sricharan, Kumar, Raich, Raviv, and Hero, Alfred O. Empirical estimation of entropy functionals with confidence. ar Xiv:1012.4188, 2010. Sricharan, Kumar, Wei, Dennis, and Hero, Alfred O. En- semble estimators for multivariate entropy estimation. ar Xiv:1203.5829, 2012. Tsallis, Constantino. Possible generalization of Boltzmann-Gibbs statistics. Journal of statistical physics, 1988. Tsybakov, Alexandre B. Introduction to nonparametric es- timation. Springer, 2009. Wang, Qing, Kulkarni, Sanjeev R., and Verd u, Sergio. Di- vergence estimation of continuous distributions based on data-dependent partitions. IEEE Transactions on Information Theory, 2005. Wang, Qing, Kulkarni, Sanjeev R., and Verd u, Sergio. Di- vergence estimation for multidimensional densities via k-nearest-neighbor distances. IEEE Transactions on Information Theory, 2009.