# efficient_nonparametric_smoothness_estimation__d8742907.pdf Efficient Nonparametric Smoothness Estimation Shashank Singh Carnegie Mellon University sss1@andrew.cmu.edu Simon S. Du Carnegie Mellon University ssdu@cs.cmu.edu Barnabás Póczos Carnegie Mellon University bapoczos@cs.cmu.edu Sobolev quantities (norms, inner products, and distances) of probability density functions are important in the theory of nonparametric statistics, but have rarely been used in practice, due to a lack of practical estimators. They also include, as special cases, L2 quantities which are used in many applications. We propose and analyze a family of estimators for Sobolev quantities of unknown probability density functions. We bound the finite-sample bias and variance of our estimators, finding that they are generally minimax rate-optimal. Our estimators are significantly more computationally tractable than previous estimators, and exhibit a statistical/computational trade-off allowing them to adapt to computational constraints. We also draw theoretical connections to recent work on fast two-sample testing and empirically validate our estimators on synthetic data. 1 Introduction L2 quantities (i.e., inner products, norms, and distances) of continuous probability density functions are important information theoretic quantities with many applications in machine learning and signal processing. For example, L2 norm estimates can be used for goodness-of-fit testing [7], image registration and texture classification [12], and parameter estimation in semi-parametric models [36]. L2 inner products estimates can generalize linear or polynomial kernel methods to inputs which are distributions rather than numerical vectors. [28] L2 distance estimators are used for two-sample testing [1, 25], transduction learning [30], and machine learning on distributional inputs [27]. [29] gives applications of L2 quantities to adaptive information filtering, classification, and clustering. L2 quantities are a special case of less-well-known Sobolev quantities. Sobolev norms measure global smoothness of a function in terms of integrals of squared derivatives. For example, for a non-negative integer s and a function f : R R with an sth derivative f (s), the s-order Sobolev norm Hs is given by f Hs = R R f (s)(x) 2 dx (when this quantity is finite). See Section 2 for more general definitions, and see [21] for an introduction to Sobolev spaces. Estimation of general Sobolev norms has a long history in nonparametric statistics (e.g., [32, 13, 10, 2]) This line of work was motivated by the role of Sobolev norms in many semiand non-parametric problems, including density estimation, density functional estimation, and regression, (see [35], Section 1.7.1) where they dictate the convergence rates of estimators. Despite this, to our knowledge, these quantities have never been studied in real data, leaving an important gap between the theory and practice of nonparametric statistics. We suggest this is in part due a lack of practical estimators for these quantities. For example, the only one of the above estimators that is statistically minimaxoptimal [2] is extremely difficult to compute in practice, requiring numerical integration over each of O(n2) different kernel density estimates, where n denotes the sample size. We know of no estimators previously proposed for Sobolev inner products and distances. The main goal of this paper is to propose and analyze a family of computationally and statistically efficient estimators for Sobolev inner products, norms, and distances. Our specific contributions are: 1. We propose families of nonparametric estimators for all Sobolev quantities (Section 4). 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. 2. We analyze the estimators bias and variance. Assuming the underlying density functions lie in a Sobolev class of smoothness parametrized by s , we show the estimator for Sobolev quantities of order s < s converges to the true value at the parametric rate of O(n 1) in mean squared error when s 2s + D/4, and at a slower rate of O n 4s +D otherwise. (Section 5). 3. We validate our theoretical results on simulated data. (Section 8). 4. We derive asymptotic distributions for our estimators, and we use these to derive tests for the general statistical problem of two-sample testing. We also draw theoretical connections between our test and the recent work on nonparametric two-sample testing. (Section 9). In terms of mean squared error, minimax lower bounds matching our convergence rates over Sobolev or Hölder smoothness classes have been shown by [16] for s = 0 (i.e., L2 quantities), and [3] for Sobolev norms with integer s. Since these lower bounds intuitively span the space of relevant quantities, it is a small step to conjecture that our estimators are minimax rate-optimal for all Sobolev quantities and s [0, ). As described in Section 7, our estimators are computable in O(n1+ε) time using only basic matrix operations, where n is the sample size and ε (0, 1) is a tunable parameter trading statistical and computational efficiency; the smallest value of ε at which the estimator continues to be minimax rate-optimal approaches 0 as we assume more smoothness of the true density. 2 Problem setup and notation Let X = [ π, π]D and let µ denote the Lebesgue measure on X. For D-tuples z ZD of integers, let ψz L2 = L2(X) 1 defined by ψz(x) = e i z,x for all x X denote the zth element of the L2orthonormal Fourier basis, and, for f L2, let ef(z) := ψz, f L2 = R X ψz(x)f(x) dµ(x) denote the zth Fourier coefficient of f. 2 For any s [0, ), define the Sobolev space Hs = Hs(X) L2 of order s on X by 3 z ZD z2s ef(z) 2 < Fix a known s [0, ) and a unknown probability density functions p, q Hs, and suppose we have n IID samples X1, ..., Xn p and Y1, . . . , Yn q from each of p and q. We are interested in estimating the inner product p, q Hs := X z ZD z2sep(z)eq(z) defined for all p, q Hs. (2) Estimating the inner product gives an estimate for the (squared) induced norm and distance, since 4 p 2 Hs := X z ZD z2s |ep(z)|2 = p, p Hs and p q 2 Hs = p 2 Hs 2 p, q Hs + q 2 Hs. (3) Since our theoretical results assume the samples from p and q are independent, when estimating p 2 Hs, we split the sample from p in half to compute two independent estimates of ep, although this may not be optimal in practice. For a more classical intuition, we note that, in the case D = 1 and s {0, 1, 2, . . . }, (via Parseval s identity and the identity g f (s)(z) = (iz)s ef(z)), that one can show the following: Hs includes the 1We suppress dependence on X; all function spaces are over X except as discussed in Section 2.1. 2Here, , denotes the dot product on RD. For a complex number c = a + bi, c = a bi denotes the complex conjugate of c, and |c| = a2 + b2 denotes the modulus of c. 3When D > 1, z2s = QD j=1 z2s j . For z < 0, z2s should be read as (z2)s, so that z2s R even when 2s / Z. In the L2 case, we use the convention that 00 = 1. 4 p Hs is pseudonorm on Hs because it fails to distinguish functions identical almost everywhere up to additive constants; a combination of p L2 and p Hs is used when a proper norm is needed. However, since probability densities integrate to 1, Hs is a proper metric on the subset of (almost-everywhere equivalence classes of) probability density functions in Hs, which is important for two-sample testing (see Section 9). For simplicity, we use the terms norm , inner product , and distance for the remainder of the paper. Functional Name Functional Form References L2 norms p 2 L2 = R (p(x))2 dx [32, 6] (Integer) Sobolev norms p 2 Hk = R p(k)(x) 2 dx [2] Density functionals R ϕ(x, p(x)) dx [18, 19] Derivative functionals R ϕ(x, p(x), p (x), . . . , p(k)(x)) dx [3] L2 inner products p1, p2 L2 = R p1(x)p2(x) dx [16, 17] Multivariate functionals R ϕ(x, p1(x), . . . , pk(x)) dx [34, 14] Table 1: Some related functional forms for which estimators for which nonparametric estimators have been developed and analyzed. p, p1, ..., pk are unknown probability densities, from each of which we draw n IID samples, ϕ is a known real-valued measurable function, and k is a non-negative integer. subspace of L2 functions with at least s derivatives in L2 and, if f (s) denotes the sth derivative of f f 2 Hs = 2π Z f (s)(x) 2 dx = 2π f (s) 2 L2 , f Hs. (4) In particular, when s = 0, Hs = L2, Hs = L2, and , Hs = , L2. As we describe in the supplement, equation (4) and our results generalizes trivially to weak derivatives, as well as to non-integer s [0, ) via a notion of fractional derivative. 2.1 Unbounded domains A notable restriction above is that p and q are supported in X := [ π, π]D. In fact, our estimators and tests are well-defined and valid for densities supported on arbitrary subsets of RD. In this case, they act on the 2π-periodic summation p2π : [ π, π]D [0, ] defined for x X by p2π(x) := P z ZD p(x + 2πz), which is itself a probability density function on X. For example, the estimator for p Hs will instead estimate p2π Hs, and the two-sample test for distributions p and q will attempt to distinguish p2π from q2π. Typically, this is not problematic; for example, for most realistic probability densities, p and p2π have similar orders of smoothness, and p2π = q2π if and only if p = q. However, there are (meagre) sets of exceptions; for example, if q is a translation of p by exactly 2π, then p2π = q2π, and one can craft a highly discontinuous function p such that p2π is uniform on X. [39] These exceptions make it difficult to extend theoretical results to densities with arbitrary support, but they are fixed, in practice, by randomly rescaling the data (as in [4]). If the densities have (known) bounded support, they can simply be shifted and scaled to be supported on X. 3 Related work There is a large body of work on estimating nonlinear functionals of probability densities, with various generalizations in terms of the class of functionals considered. Table 1 gives a subset of such work, for functionals related to Sobolev quantities. As shown in Section 2, the functional form we consider is a strict generalization of L2 norms, Sobolev norms, and L2 inner products. It overlaps with, but is neither a special case nor a generalization of the remaining functional forms in the table. Nearly all of the above approaches compute an optimally smoothed kernel density estimate and then perform bias corrections based on Taylor series expansions of the functional of interest. They typically consider distributions with densities that are β-Hölder continuous and satisfy periodicity assumptions of order β on the boundary of their support, for some constant β > 0 (see, for example, Section 4 of [16] for details of these assumptions). The Sobolev class we consider is a strict superset of this Hölder class, permitting, for example, certain small discontinuities. In this regard, our results are slightly more general than most of these prior works. Finally, there is much recent work on estimating entropies, divergences, and mutual informations, using methods based on kernel density estimates [14, 17, 16, 24, 33, 34] or k-nearest neighbor statistics [20, 23, 22, 26]. In contrast, our estimators are more similar to orthogonal series density estimators, which are computationally attractive because they require no pairwise operations between samples. However, they require quite different theoretical analysis; unlike prior work, our estimator is constructed and analyzed entirely in the frequency domain, and then related to the data domain via Parseval s identity. We hope our analysis can be adapted to analyze new, computationally efficient information theoretic estimators. 4 Motivation and construction of our estimator For a non-negative integer parameter Zn (to be specified later), let z Zn ep(z)ψz and qn := X z Zn eq(z)ψz where z := max j {1,...,D} zj (5) denote the L2 projections of p and q, respectively, onto the linear subspace spanned by the L2orthonormal family Fn := {ψz : z ZD, |z| Zn}. Note that, since f ψz(y) = 0 whenever y = z, the Fourier basis has the special property that it is orthogonal in , Hs as well. Hence, since pn and qn lie in the span of Fn while p pn and q qn lie in the span of {ψz : z Z}\Fn, p pn, qn Hs = pn, q qn Hs = 0. Therefore, p, q Hs = pn, qn Hs + p pn, qn Hs + pn, q qn Hs + p pn, q qn Hs = pn, qn Hs + p pn, q qn Hs. (6) We propose an unbiased estimate of Sn := pn, qn Hs = P z Zn z2sepn(z)eqn(z). Notice that Fourier coefficients of p are the expectations ep(z) = EX p [ψz(X)]. Thus, ˆp(z) := 1 n Pn j=1 ψz(Xj) and ˆq(z) := 1 n Pn j=1 ψz(Yj) are independent unbiased estimates of ep and eq, respectively. Since Sn is bilinear in ep and eq, the plug-in estimator for Sn is unbiased. That is, our estimator for p, q Hs is z Zn z2sˆp(z)ˆq(z). (7) 5 Finite sample bounds Here, we present our main theoretical results, bounding the bias, variance, and mean squared error of our estimator for finite n. By construction, our estimator satisfies E h ˆSn i = X z Zn z2s E [ˆp(z)] E [ˆq(z)] = X z Zn z2sepn(z)eqn(z) = Sn. Thus, via (6) and Cauchy-Schwarz, the bias of the estimator ˆSn satisfies E h ˆSn i p, q Hs = | p pn, q qn Hs| q p pn 2 Hs q qn 2 Hs. (8) p pn Hs is the error of approximating p by an order-Zn trigonometric polynomial, a classic problem in approximation theory, for which Theorem 2.2 of [15] shows: if p Hs for some s > s, then p pn Hs p Hs Zs s n . (9) In combination with (8), this implies the following bound on the bias of our estimator: Theorem 1. (Bias bound) If p, q Hs for some s > s, then, for CB := p Hs q Hs , E h ˆSn i p, q Hs CBZ2(s s ) n (10) Hence, the bias of ˆSn decays polynomially in Zn, with a power depending on the extra s s orders of smoothness available. On the other hand, as we increase Zn, the number of frequencies at which we estimate ˆp increases, suggesting that the variance of the estimator will increase with Zn. Indeed, this is expressed in the following bound on the variance of the estimator. Theorem 2. (Variance bound) If p, q Hs for some s s, then V h ˆSn i 2C1 Z4s+D n n , where C1 := 2DΓ(4s + 1) Γ(4s + D + 1) p L2 q L2 (11) and C2 := ( p Hs + q Hs) p W 2s,4 q W 2s,4 + p 4 Hs q 4 Hs are the constants (in n) The proof of Theorem 2 is perhaps the most significant theoretical contribution of this work. Due to space constraints, the proof is given in the supplement. Combining Theorems 1 and 2 gives a bound on the mean squared error (MSE) of ˆSn via the usual decomposition into squared bias and variance: Corollary 3. (Mean squared error bound) If p, q Hs for some s > s, then ˆSn p, q Hs 2 C2 BZ4(s s ) n + 2C1 Z4s+D n If, furthermore, we choose Zn n 2 4s +D (optimizing the rate in inequality 12), then ˆSn p, q H2 2 n max n 8(s s ) 4s +D , 1 o . (13) Corollary 3 recovers the phenomenon discovered by [2]: when s 2s + D 4 , the minimax optimal MSE decays at the semi-parametric n 1 rate, whereas, when s s, 2s + D 4 , the MSE decays at a slower rate. Also, the estimator is L2-consistent if Zn and Znn 2 4s+D 0 as n . This is useful in practice, since s is known but s is not. Finally, it is worth reiterating that, by (3), these finite sample rates extend, with additional constant factors, to estimating Sobolev norms and distances. 6 Asymptotic distributions In this section, we derive the asymptotic distributions of our estimator in two cases: (1) the inner product estimator and (2) the distance estimator in the case p = q. These results provide confidence intervals and two-sample tests without computationally intensive resampling. While (1) is more general in that it can be used with (3) to bound the asymptotic distributions of the norm and distance estimators, (2) provides a more precise result leading to a more computationally and statistically efficient two-sample test. Proofs are given in the supplementary material. Theorem 4 shows that our estimator has a normal asymptotic distribution, assuming Zn slowly enough as n , and also gives a consistent estimator for its asymptotic variance. From this, one can easily estimate asymptotic confidence intervals for inner products, and hence also for norms. Theorem 4. (Asymptotic normality) Suppose that, for some s > 2s + D 4 , p, q Hs , and suppose Znn 1 4(s s ) and Znn 1 4s+D 0 as n . Then, ˆSn is asymptotically normal with mean p, q Hs. In particular, for j {1, . . . , n} and z ZD with z Zn, define Wj,z := zseiz Xj and Vj,z := zseiz Yj, so that Wj and Vj are column vectors in R(2Zn)D. Let W := 1 n Pn j=1 Wj, V := 1 n Pn j=1 Vj R(2Zn)D, j=1 (Wj W)(Wj W)T , and ΣV := 1 j=1 (Vj V )(Vj V )T R(2Zn)D (2Zn)D denote the empirical means and covariances of W and V , respectively. Then, for T ΣW 0 0 ΣV , we have n ˆSn p, q Hs where D denotes convergence in distribution. Since distances can be written as a sum of three inner products (Eq. (3)), Theorem 4 might suggest an asymptotic normal distribution for Sobolev distances. However, extending asymptotic normality from inner products to their sum requires that the three estimates be independent, and hence that we split data between the three estimates. This is inefficient in practice and somewhat unnatural, as we know, for example, that distances should be non-negative. For the particular case p = q (as in the null hypothesis of two-sample testing), the following theorem 5 provides a more precise asymptotic (χ2) distribution of our Sobolev distance estimator, after an extra decorrelation step. This gives, for example, a more powerful two-sample test statistic (see Section 9 for details). Theorem 5. (Asymptotic null distribution) Suppose that, for some s > 2s + D 4 , p, q Hs , and suppose Znn 1 4(s s ) and Znn 1 4s+D 0 as n . For j {1, . . . , n} and z ZD with z Zn, define Wj,z := zs e iz Xj e iz Yj , so that Wj is a column vector in R(2Zn)D. Let j=1 Wj R(2Zn)D and Σ := 1 Wj W Wj W T R(2Zn)D (2Zn)D denote the empirical mean and covariance of W, and define T := n W T Σ 1W. Then, if p = q, then Qχ2((2Zn)D)(T) D Uniform([0, 1]) as n , where Qχ2(d) : [0, ) [0, 1] denotes the quantile function (inverse CDF) of the χ2 distribution χ2(d) with d degrees of freedom. Let ˆ M denote our estimator for p q Hs (i.e., plugging ˆSn into (3)). While Theorem 5 immediately provides a valid two-sample test of desired level, it is not immediately clear how this relates to ˆ M, nor is there any suggestion of why the test statistic ought to be a good (i.e., consistent) one. Some intuition is as follows. Notice that ˆ M = W T W. Since, by the central limit theorem, W has a normal asymptotic distribution, if the components of W were uncorrelated (and Zn were fixed), we would expect n ˆ M to have an asymptotic χ2 distribution with (2Zn)D degrees of freedom. However, because we use the same data to compute each component of ˆ M, they are not typically uncorrelated, and so the asymptotic distribution of ˆ M is difficult to derive. This motivates the statistic Σ 1 W W T q Σ 1 W W, since the components of q Σ 1 W W are (asymptotically) uncorrelated. 7 Parameter selection and statistical/computational trade-off Here, we give statistical and computational considerations for choosing the smoothing parameter Zn. Statistical perspective: In practice, of course, we do not typically know s , so we cannot simply set Zn n 2 4s +D , as suggested by the mean squared error bound (13). Fortunately (at least for ease of parameter selection), when s 2s + D 4 , the dominant term of (13) is C2/n for Zn n 1 4s+D . Hence if we are willing to assume that the density has at least 2s + D 4 orders of smoothness (which may be a mild assumption in practice), then we achieve statistical optimality (in rate) by setting Zn n 1 4s+D , which depends only on known parameters. On the other hand, the estimator can continue to benefit from additional smoothness computationally. Computational perspective An attractive property of the estimator discussed is its computational simplicity and efficiency, in low dimensions. Most competing nonparametric estimators, such as kernel-based or nearest-neighbor methods, either take O(n2) time or rely on complex data structures such as k-d trees or cover trees [31] for O(2Dn log n) time performance. Since computing the estimator takes O(n ZD n ) time and O(ZD n ) memory (that is, the cost of estimating each of (2Zn)D Fourier coefficients by an average), a statistically optimal choice of Zn gives a runtime of 4s +D . Since the estimate requires only a vector outer product, exponentiation, and averaging, constants involved are small and computations parallelize trivially over frequencies and data. Under severe computational constraints, for very large data sets, or if D is large relative to s , we can reduce Zn to trade off statistical for computational efficiency. For example, if we want an estimator 5This result is closely related to Proposition 4 of [4]. However, in their situation, s = 0 and the set of test frequencies is fixed as n , whereas our set is increasing. 10 1 10 2 10 3 10 4 10 5 0.05 number of samples Estimated Distance True Distance (a) 1D Gaussians with different means. 10 1 10 2 10 3 10 4 10 5 0.05 number of samples Estimated Distance True Distance (b) 1D Gaussians with different variance. 10 1 10 2 10 3 10 4 10 5 0.2 number of samples Estimated Distance True Distance (c) Uniform distributions with different range. 10 1 10 2 10 3 10 4 10 5 0 number of samples Estimated Distance True Distance (d) One uniform and one triangular distribution. number of samples Estimated Distance True Distance (a) 3D Gaussians with different means. number of samples Estimated Distance True Distance (b) 3D Gaussians with different variance. 10 1 10 2 10 3 10 4 10 5 0.1 number of samples Estimated Distance True Distance (c) Estimation of H0 norm of N (0, 1). 10 1 10 2 10 3 10 4 10 5 0.4 number of samples Estimated Distance True Distance (d) Estimation of H1 norm of N (0, 1). with runtime O(n1+θ) and space requirement O(nθ) for some θ 0, 2D 4s +D , setting Zn nθ/D still gives a consistent estimator, with mean squared error of the order O nmax{ 4θ(s s ) Kernelor nearest-neighbor-based methods, including nearly all of the methods described in Section 3, tend to require storing previously observed data, resulting in O(n) space requirements. In contrast, orthogonal basis estimation requires storing only O(ZD n ) estimated Fourier coefficients. The estimated coefficients can be incrementally updated with each new data point, which may make the estimator or close approximations feasible in streaming settings. 8 Experimental results In this section, we use synthetic data to demonstrate effectiveness of our methods. 6 All experiments use 10, 102, . . . , 105 samples for estimation. We first test our estimators on 1D L2 distances. Figure 1a shows estimated distance between N (0, 1) and N (1, 1); Figure 1b shows estimated distance between N (0, 1) and N (0, 4); Figure 1c shows estimated distance between Unif [0, 1] and Unif[0.5, 1.5]; Figure 1d shows estimated distance between [0, 1] and a triangular distribution whose density is highest at x = 0.5. Error bars indicate asymptotic 95% confidence intervals based on Theorem 4. These experiments suggest 105 samples is sufficient to estimate L2 distances with high confidence. Note that we need fewer samples to estimate Sobolev quantities of Gaussians than, say, of uniform distributions, consistent with our theory, since (infinitely differentiable) Gaussians are smoothier than (discontinuous) uniform distributions. Next, we test our estimators on L2 distances of multivariate distributions. Figure 2a shows estimated distance between N ([0, 0, 0] , I) and N ([1, 1, 1] , I); Figure 2b shows estimated distance between N ([0, 0, 0] , I) and N ([0, 0, 0] , 4I). These experiments show that our estimators can also handle multivariate distributions. Lastly, we test our estimators for Hs norms. Figure 2c shows estimated H0 norm of N (0, 1) and Figure 2d shows H1 norm of N (0, 1). Additional experiments with other distributions and larger values of s are given in the supplement. 9 Connections to two-sample testing We now discuss the use of our estimator in two-sample testing. From the large literature on nonparametric two-sample testing, we discuss only some recent approaches closely related to ours. Let ˆ M denote our estimate of the Sobolev distance, consisting of plugging ˆS into equation (3). Since Hs is a metric on the space of probability density functions in Hs, computing ˆ M leads naturally to a two-sample test on this space. Theorem 5 suggests an asymptotic test, which is computationally preferable to a permutation test. In particular, for a desired Type I error rate α (0, 1) our test rejects the null hypothesis p = q if and only if Qχ2(2ZD n )(T) < α. 6MATLAB code for these experiments is available at https://github.com/sss1/Sobolev Estimation. When s = 0, this approach is closely related to several two-sample tests in the literature based on comparing empirical characteristic functions (CFs). Originally, these tests [11, 5] computed the same statistic T with a fixed number of random RD-valued frequencies instead of deterministic ZD-valued frequencies. This test runs in linear time, but is not generally consistent, since the two CFs need not differ almost everywhere. Recently, [4] suggested using smoothed CFs, i.e., the convolution of the CF with a universal smoothing kernel k. This is computationally easy (due to the convolution theorem) and, when p = q, (ep k)(z) = (eq k)(z) for almost all z RD, reducing the need for carefully choosing test frequencies. Furthermore, this test is almost-surely consistent under very general alternatives. However, it is not clear what sort of assumptions would allow finite sample analysis of the power of their test. Indeed, the convergence as n can be arbitrarily slow, depending on the random test frequencies used. Our analysis instead uses the assumption p, q Hs to ensure that small, ZD-valued frequencies contain most of the power of ep. 7 These fixed-frequency approaches can be thought of as the extreme point θ = 0 of the computational/statistical trade-off described in section 7: they are computable in linear time and (with smoothing) are strongly consistent, but do not satisfy finite-sample bounds under general conditions. At the other extreme (θ = 1) are MMD-based tests of [8, 9], which use the entire spectrum ep. These tests are statistically powerful and have strong guarantees for densities in an RKHS, but have O(n2) computational complexity. 8 The computational/statistical trade-off discussed in Section 7 can be thought of as an interpolation (controlled by θ) of these approaches, with runtime in the case θ = 1 approaching quadratic for large D and small s . 10 Conclusions and future work In this paper, we proposed nonparametric estimators for Sobolev inner products, norms and distances of probability densities, for which we derived finite-sample bounds and asymptotic distributions. A natural follow-up question to our work is whether estimating smoothness of a density can guide the choice of smoothing parameters in nonparametric estimation. When analyzing many nonparametric estimators, Sobolev norms appear as the key unknown term in error bounds. While theoretically optimal smoothing parameter values are often suggested based on optimizing these error bounds, our work may suggest a practical way of mimicking this procedure by plugging estimated Sobolev norms into these bounds. For some problems, such as estimating functionals of a density, this may be especially useful, since no error metric is typically available for cross-validation. Even when cross-validation is an option, as in density estimation or regression, estimating smoothness may be faster, or may suggest an appropriate range of parameter values. Acknowledgments This material is based upon work supported by a National Science Foundation Graduate Research Fellowship to the first author under Grant No. DGE-1252522. References [1] N. H. Anderson, P. Hall, and D. M. Titterington. Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. Journal of Multivariate Analysis, 50(1):41 54, 1994. [2] P. J. Bickel and Y. Ritov. Estimating integrated squared density derivatives: sharp best order of convergence estimates. Sankhy a: The Indian Journal of Statistics, Series A, pages 381 393, 1988. [3] L. Birgé and P. Massart. Estimation of integral functionals of a density. The Annals of Statistics, pages 11 29, 1995. [4] K. P. Chwialkowski, A. Ramdas, D. Sejdinovic, and A. Gretton. Fast two-sample testing with analytic representations of probability measures. In NIPS, pages 1972 1980, 2015. [5] T. Epps and K. J. Singleton. An omnibus test for the two-sample problem using the empirical characteristic function. Journal of Statistical Computation and Simulation, 26(3-4):177 203, 1986. [6] E. Giné and R. Nickl. A simple adaptive estimator of the integrated square of a density. Bernoulli, pages 47 61, 2008. 7Note that smooth CFs can be used in our test by replacing ˆp(z) with 1 n Pn j=1 e iz Xjk(x), where k is the inverse Fourier transform of a characteristic kernel. However, smoothing seems less desirable under Sobolev assumptions, as it spreads the power of the CF away from small ZD-valued frequencies where our test focuses. 8Fast MMD approximations have been proposed, including the Block MMD, [37] Fast MMD, [38] and n sub-sampled MMD, but these lack the statistical guarantees of MMD. [7] M. N. Goria, N. N. Leonenko, V. V. Mergel, and P. L. N. Inverardi. A new class of random vector entropy estimators and its applications in testing statistical hypotheses. J. Nonparametric Stat., 17:277 297, 2005. [8] A. Gretton, K. M. Borgwardt, M. Rasch, B. Schölkopf, and A. J. Smola. A kernel method for the two-sample-problem. In Advances in neural information processing systems, pages 513 520, 2006. [9] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012. [10] P. Hall and J. S. Marron. Estimation of integrated squared density derivatives. Statistics & Probability Letters, 6(2):109 115, 1987. [11] C. Heathcote. A test of goodness of fit for symmetric random variables. Australian Journal of Statistics, 14(2):172 181, 1972. [12] A. O. Hero, B. Ma, O. J. J. Michel, and J. Gorman. Applications of entropic spanning graphs. IEEE Signal Processing Magazine, 19(5):85 95, 2002. [13] I. Ibragimov and R. Khasminskii. On the nonparametric estimation of functionals. In Symposium in Asymptotic Statistics, pages 41 52, 1978. [14] K. Kandasamy, A. Krishnamurthy, B. Poczos, L. Wasserman, et al. Nonparametric von Mises estimators for entropies, divergences and mutual informations. In NIPS, pages 397 405, 2015. [15] H.-O. Kreiss and J. Oliger. Stability of the Fourier method. SIAM Journal on Numerical Analysis, 16(3): 421 433, 1979. [16] A. Krishnamurthy, K. Kandasamy, B. Poczos, and L. A. Wasserman. Nonparametric estimation of renyi divergence and friends. In ICML, pages 919 927, 2014. [17] A. Krishnamurthy, K. Kandasamy, B. Poczos, and L. A. Wasserman. On estimating L2 2 divergence. In AISTATS, 2015. [18] B. Laurent. Efficient estimation of integral functionals of a density. Université de Paris-sud, Département de mathématiques, 1992. [19] B. Laurent et al. Efficient estimation of integral functionals of a density. The Annals of Statistics, 24(2): 659 681, 1996. [20] N. Leonenko, L. Pronzato, V. Savani, et al. A class of rényi information estimators for multidimensional densities. The Annals of Statistics, 36(5):2153 2182, 2008. [21] G. Leoni. A first course in Sobolev spaces, volume 105. American Math. Society, Providence, RI, 2009. [22] K. Moon and A. Hero. Multivariate f-divergence estimation with confidence. In Advances in Neural Information Processing Systems, pages 2420 2428, 2014. [23] K. R. Moon and A. O. Hero. Ensemble estimation of multivariate f-divergence. In Information Theory (ISIT), 2014 IEEE International Symposium on, pages 356 360. IEEE, 2014. [24] K. R. Moon, K. Sricharan, K. Greenewald, and A. O. Hero III. Improving convergence of divergence functional ensemble estimators. ar Xiv preprint ar Xiv:1601.06884, 2016. [25] L. Pardo. Statistical inference based on divergence measures. CRC Press, 2005. [26] B. Póczos and J. G. Schneider. On the estimation of alpha-divergences. In International Conference on Artificial Intelligence and Statistics, pages 609 617, 2011. [27] B. Póczos, L. Xiong, and J. Schneider. Nonparametric divergence estimation with applications to machine learning on distributions. ar Xiv preprint ar Xiv:1202.3758, 2012. [28] B. Póczos, L. Xiong, D. J. Sutherland, and J. Schneider. Nonparametric kernel estimators for image classification. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2989 2996. IEEE, 2012. [29] J. C. Principe. Information theoretic learning: Renyi s entropy and kernel perspectives. Springer Science & Business Media, 2010. [30] N. Quadrianto, J. Petterson, and A. J. Smola. Distribution matching for transduction. In Advances in Neural Information Processing Systems, pages 1500 1508, 2009. [31] P. Ram, D. Lee, W. March, and A. G. Gray. Linear-time algorithms for pairwise statistical problems. In Advances in Neural Information Processing Systems, pages 1527 1535, 2009. [32] T. Schweder. Window estimation of the asymptotic variance of rank estimators of location. Scandinavian Journal of Statistics, pages 113 126, 1975. [33] S. Singh and B. Póczos. Generalized exponential concentration inequality for renyi divergence estimation. In Proceedings of The 31st International Conference on Machine Learning, pages 333 341, 2014. [34] S. Singh and B. Póczos. Exponential concentration of a density functional estimator. In Advances in Neural Information Processing Systems, pages 3032 3040, 2014. [35] A. Tsybakov. Introduction to Nonparametric Estimation. Springer Publishing Company, Incorporated, 1st edition, 2008. ISBN 0387790519, 9780387790510. [36] E. Wolsztynski, E. Thierry, and L. Pronzato. Minimum-entropy estimation in semi-parametric models. Signal Processing, 85(5):937 949, 2005. [37] W. Zaremba, A. Gretton, and M. Blaschko. B-test: A non-parametric, low variance kernel two-sample test. In Advances in neural information processing systems, pages 755 763, 2013. [38] J. Zhao and D. Meng. Fastmmd: Ensemble of circular discrepancy for efficient two-sample test. Neural computation, 27(6):1345 1372, 2015. [39] A. Zygmund. Trigonometric series, volume 1. Cambridge university press, 2002.