# minimax_estimation_of_kernel_mean_embeddings__7012f080.pdf

Journal of Machine Learning Research 18 (2017) 1-47 Submitted 1/17; Revised 7/17; Published 7/17

Minimax Estimation of Kernel Mean Embeddings

Ilya Tolstikhin ilya@tuebingen.mpg.de Department of Empirical Inference Max Planck Institute for Intelligent Systems Spemanstraße 38, T ubingen 72076, Germany

Bharath K. Sriperumbudur bks18@psu.edu Department of Statistics Pennsylvania State University University Park, PA 16802, USA

Krikamol Muandet krikamol@tuebingen.mpg.de Department of Mathematics Faculty of Science, Mahidol University 272 Rama VI Rd. Rajchathevi, Bangkok 10400, Thailand

Editor: Andreas Christmann

In this paper, we study the minimax estimation of the Bochner integral

X k( , x) d P(x),

also called as the kernel mean embedding, based on random samples drawn i.i.d. from P, where k : X X R is a positive deﬁnite kernel. Various estimators (including the empirical estimator), ˆθn of µk(P) are studied in the literature wherein all of them satisfy ˆθn µk(P) Hk = OP (n 1/2) with Hk being the reproducing kernel Hilbert space induced by k. The main contribution of the paper is in showing that the above mentioned rate of n 1/2 is minimax in Hk and L2(Rd)-norms over the class of discrete measures and the class of measures that has an inﬁnitely diﬀerentiable density, with k being a continuous translation-invariant kernel on Rd. The interesting aspect of this result is that the minimax rate is independent of the smoothness of the kernel and the density of P (if it exists). Keywords: Bochner integral, Bochner s theorem, kernel mean embeddings, minimax lower bounds, reproducing kernel Hilbert space, translation invariant kernel

1. Introduction

Over the last few years, kernel embedding of distributions (Smola et al., 2007; Sriperumbudur et al., 2010) has gained a lot of attention in the machine learning community due to the wide variety of applications it has been employed in. Some of these applications include kernel two-sample testing (Gretton et al., 2007, 2012), kernel independence and conditional independence tests (Gretton et al., 2008; Fukumizu et al., 2008), covariate-shift (Smola et al., 2007), density estimation (Sriperumbudur, 2011), feature selection (Song et al., 2012), causal inference (Lopez-Paz et al., 2015), kernel Bayes rule (Fukumizu et al., 2013) and distribution regression (Szab o et al., 2015).

c 2017 Ilya Tolstikhin, Bharath K. Sriperumbudur, and Krikamol Muandet.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/17-032.html.

Tolstikhin, Sriperumbudur, and Muandet

Formally, let Hk be a separable reproducing kernel Hilbert space (RKHS) (Aronszajn, 1950) with a continuous reproducing kernel k : X X R deﬁned on a separable topological space X. Given a Borel probability measure P deﬁned over X such that R

k(x, x) d P(x) < , the kernel mean or the mean element is deﬁned as the Bochner integral

X k( , x) d P(x) Hk. (1)

We refer the reader to Diestel and Uhl (1977, Chapter 2) and Dinculeanu (2000, Chapter 1) for the deﬁnition of a Bochner integral. The mean element in (1) can be viewed as an embedding of P in Hk,

µk : M1 +(X) Hk, µk(P) = µP ,

where M1 +(X) denotes the set of all Borel probability measures on X. Hence, we also refer to µk as the kernel mean embedding (KME). The mean embedding can be seen as a generalization of the classical kernel feature map that embeds points of an input space X as elements in Hk. The mean embedding µk can also be seen as a generalization of the classical notions of characteristic function, moment generation function (if it exists), and Weierstrass transform of P (all deﬁned on Rd) to an arbitrary topological space X as the choice of k( , x) as (2π) d/2e 1 ,x , e ,x , and (4π) d/2e x 2 2, x Rd respectively reduces µk to these notions. The mean embedding µk is closely related to the maximum mean discrepancy (MMD) (Gretton et al., 2007), which is the RKHS distance between the mean embeddings of two probability measures. We refer the reader to (Sriperumbudur et al., 2010; Sriperumbudur, 2016) for more details on the properties of µk and the corresponding MMD.

In all the above mentioned statistical and machine learning applications, since the underlying distribution P is known only through random samples X1, . . . , Xn drawn i.i.d. from it, an estimator of µP is employed. The goal of this paper is to study the minimax optimal estimation of µP . In the literature, various estimators of µP have been proposed. The simplest and most popular is the empirical estimator µPn, which is constructed by replacing P by its empirical counterpart, Pn := 1

n Pn i=1 δXi, where δx denotes a Dirac measure at x X. In fact, all the above mentioned applications deal with the empirical estimator of µP because of its simplicity. Using Bernstein s inequality in separable Hilbert spaces (Yurinsky, 1995, Theorem 3.3.4), it follows that for bounded continuous kernels, µPn µP Hk = OP (n 1/2) for any P, i.e., the empirical estimator is a n-consistent estimator of µP in Hk-norm. This result is also proved in Smola et al. (2007, Theorem 2), Gretton et al. (2012), and Lopez-Paz et al. (2015) using Mc Diarmid s inequality, which we improve in Proposition A.1 (also see Remark A.2 in Appendix A) by providing a better constant. Assuming X = Rd and P to have a density p, Sriperumbudur (2016, Theorem 4.1) proposed to estimate µP = R

Rd k( , x)p(x) dx by replacing p with a kernel density estimator, which is then shown to be n-consistent in Hk-norm if k is a bounded continuous translation invariant kernel see Section 2 for its deﬁnition on Rd. Recently, Muandet et al. (2016, Section 2.4, Theorem 7) proposed a non-parametric shrinkage estimator of µP and established its n-consistency in Hk-norm for bounded continuous kernels on X. Muandet et al. (2016, Section 3, Theorem 10) also proposed a penalized M-estimator for µP

Minimax Estimation of Kernel Mean Embeddings

where the penalization parameter is computed in a completely data-driven manner using leave-one-out cross validation and showed that it is also n-consistent in Hk-norm. In fact, the n-consistency of all these estimators is established by showing that they are all within a Hk-ball of size o P (n 1/2) around the empirical estimator µPn.

In the above discussion, it is important to note that the convergence rate of µPn (and also other estimators) to µP in Hk-norm does not depend on the smoothness of k or the density, p (if it exists). Under some mild conditions on the kernel (deﬁned on Rd), it can be shown (see Section 4) that Hk is continuously included in L2(Rd) and f L2(Rd) ck f Hk for all f Hk, where ck is a constant that depends only on the kernel. This means, L2(Rd) is a weaker norm than Hk and therefore it could be possible that µPn converges to µP in L2(Rd) at a rate faster than n 1/2 (depending on the smoothness of k). In Proposition A.1 (also see Remark A.3 in Appendix A) we show that µPn µP L2(Rd) = OP (n 1/2). Now given these results, it is of interest to understand whether these rates are optimal in a minimax sense, i.e., whether the above mentioned estimators are minimax rate optimal or can they be improved upon? Therefore the goal of this work is to obtain minimax rates for the estimation of µP in Hk and L2(Rd).

Formally, we would like to ﬁnd the minimax rate rn,k(F, P) and a positive constant ck(F, P) (independent of n) such that

inf ˆθn sup P P P n n r 1 n,k(F, P) ˆθn µP F ck(F, P) o > 0, (2)

where F is either Hk or L2(Rd), P is a suitable subset of Borel probability measures on X, and the inﬁmum is taken over all estimators ˆθn mapping the i.i.d. sample X1, . . . , Xn to F. Suppose k(x, y) = x, y , x, y Rd. Norms Hk and L2(Rd) match for this choice of k and the corresponding RKHS is ﬁnite dimensional, i.e., Hk = Rd. For a distribution P on Rd satisfying R

Rd x 2 d P(x) < , this choice of kernel yields µP = R x d P(x) as the mean embedding of P which simply is the mean of P. It is well-known (Lehmann and Casella, 2008, Chapter 5, Example 1.14) that the minimax rate of estimating µP Rd

based on (Xi)n i=1 is rn,k(F, P) = n 1/2 for the class P of Gaussian distributions on Rd. In fact, this rate is attained by the empirical estimator µPn = 1 n Pn i=1 Xi, which is the sample mean. Based on this observation, while one can intuitively argue that the minimax rate of estimating µP is n 1/2 even if Hk is an inﬁnite dimensional RKHS, it is diﬃcult to extend the ﬁnite dimensional argument in a rigorous manner to the estimation of the inﬁnite dimensional object, µP . In this paper, through a key inequality see (3) we rigorously show that it is indeed the case.

The main result of the paper is that if k is translation invariant on X = Rd (see Theorems 1 and 9 for precise conditions on the kernel) and P is the set of all Borel discrete probability measures on Rd, then the minimax rate rn,k(F, P) is n 1/2 for both F = Hk and F = L2(Rd). Next, we show in Theorems 6 and 12 that the minimax rate for the estimation of µP in both Hk and L2(Rd) still remains n 1/2 even when P is restricted to the class of Borel probability measures which have densities, p that are continuously inﬁnitely diﬀerentiable. The reason for considering such a class of distributions with smooth densities is that µP , which is the convolution of k and p, is smoother than k. Therefore one might wonder if it could be possible to estimate µP at a rate faster than n 1/2 that depends on

Tolstikhin, Sriperumbudur, and Muandet

the smoothness of k and p. Our result establishes that even for the class of distributions with very smooth densities, the minimax rate is independent of the smoothness of k and the density of P. The key ingredient in the proofs of Theorems 6 and 12 is the non-trivial inequality (see Proposition 3)

µG0 µG1 F c k,σ2 τ0 τ1 2, (3)

which relates the F-distance between the mean embeddings of the Gaussian distributions, G0 = N(τ0, σ2I) and G1 = N(τ1, σ2I) to the Euclidean distance between the means of these Gaussians, where c k,σ2 is a constant that depends only on σ2 and the translation invariant characteristic kernel k. Combining (3) with Le Cam s method (see Appendix B) implies that the estimation of an inﬁnite dimensional object µP is as hard as the estimation of ﬁnite dimensional mean of a Gaussian distribution, thereby establishing the minimax rate to be n 1/2. These results show that the empirical estimator and other estimators we discussed above of µP is minimax rate optimal.

Ramdas et al. (2015, Corollary 1) derived a special case of (3) for the Gaussian kernel k by ignoring small terms in the Taylor series expansion of µG0 µG1 Hk (refer to Remark 4). They used this result to show that the MMD between G0 and G1 decreases to zero exponentially/polynomially fast in d even when the Kullback-Leibler divergence between the two is kept constant, which in turn sheds some light on the decaying power of MMD-based hypothesis tests in high dimensions. Proposition 3 is more general, as it holds for any translation-invariant kernel k and does not require a truncation of small reminder terms.

The paper is organized as follows. Various notations used throughout the paper and deﬁnitions are collected in Section 2. The main results on minimax estimation of µP in Hk and L2(Rd) for translation invariant kernels (and also radial kernels) on Rd are presented in Sections 3 and 4 respectively. The proofs of the results are provided in Section 5 while some supplementary results needed in the proofs are collected in appendices.

2. Deﬁnitions & Notation

Deﬁne a 2 := q Pd i=1 a2 i and a, b := Pd i=1 aibi, where a := (a1, . . . , ad) Rd and b := (b1, . . . , bd) Rd. C(Rd) (resp. Cb(Rd)) denotes the space of all continuous (resp. bounded continuous) functions on Rd. f C(Rd) is said to vanish at inﬁnity if for every ϵ > 0 the set {x : |f(x)| ϵ} is compact. The class of all continuous f on Rd which vanish at inﬁnity is denoted as C0(Rd). For f Cb(Rd), f := supx Rd |f(x)| denotes the supremum norm of f. Mb(Rd) (resp. Mb +(Rd)) denotes the set of all ﬁnite (resp. ﬁnite non-negative) Borel measures on Rd. supp(µ) denotes the support of µ Mb(Rd) which is deﬁned as supp(µ) = {x Rd | for any open set U such that x U, |µ|(U) = 0}, where |µ| is the total-variation of µ. M1 +(Rd) denotes the set of Borel probability measures on Rd. For µ Mb +(Rd), Lr(Rd, µ) denotes the Banach space of r-power (r 1) µ-integrable functions and we will use Lr(Rd) for Lr(Rd, µ) if µ is a Lebesgue measure on Rd. For f Lr(Rd, µ), f Lr(Rd,µ) := R

Rd |f|r dµ 1/r denotes the Lr-norm of f for 1 r < and we denote it as Lr(Rd) if µ is the Lebesgue measure. The convolution f g of two measurable functions

Minimax Estimation of Kernel Mean Embeddings

f and g on Rd is deﬁned as

(f g)(x) := Z

Rd f(y)g(x y) dy,

provided the integral exists for all x Rd. The Fourier transforms of f L1(Rd) and µ Mb(Rd) are deﬁned as

f (y) := F[f](y) = (2π) d/2 Z

Rd f(x) e i y,x dx, y Rd

and µ (y) := F[µ](y) = (2π) d/2 Z

Rd e i y,x dµ(x), y Rd

respectively, where i denotes the imaginary unit 1.

A kernel k: Rd Rd R is called translation invariant if there exists a symmetric positive deﬁnite function, ψ such that k(x, y) = ψ(x y) for all x, y Rd. Bochner s theorem (see Wendland, 2005, Theorem 6.6) provides a complete characterization for a positive deﬁnite function ψ: A continuous function ψ: Rd R is positive deﬁnite if and only if it is the Fourier transform of Λψ Mb +(Rd), i.e.,

ψ(x) = 1 (2π)d/2

Rd e i x,w dΛψ(w), x Rd. (4)

A kernel k is called radial if there exists φ : R+ R such that k(x, y) = φ( x y 2 2) for all x, y Rd. From Sch onberg s representation (Schoenberg, 1938; Wendland, 2005, Theorems 7.13 & 7.14) it is known that a kernel k is radial on every Rd if and only if there exists ν Mb +([0, )) such that the following holds for all x, y Rd:

k(x, y) = φ( x y 2) = Z

0 e t x y 2dν(t). (5)

Some examples of reproducing kernels on Rd (in fact all these are radial) that appear throughout the paper are:

1. Gaussian: k(x, y) = exp x y 2 2 2η2 , η > 0;

2. Mixture of Gaussians: k(x, y) = PM i=1 βi exp x y 2 2 2η2 i

, where M 2, η2 1 η2 2

η2 M > 0, and positive constants β1, . . . , βM such that PM i=1 βi = CM < ;

3. Inverse Multiquadrics: k(x, y) = (c2 + x y 2 2) γ, c, γ > 0;

4. Mat ern: k(x, y) = c2τ d

2 )2τ 1 d/2 x y 2

2 τ(c x y 2), τ > d/2, c > 0,

where Kα is the modiﬁed Bessel function of the third kind of order α and Γ is the Gamma function.

A kernel k is said to be characteristic if the mean embedding, µk : P µP is injective, where µP is deﬁned in (1). Various characterizations for the injectivity of µk (or k

Tolstikhin, Sriperumbudur, and Muandet

being characteristic) are known in literature (for details, see Sriperumbudur et al., 2011 and references therein). If k is a bounded continuous translation invariant positive deﬁnite kernel on Rd, a simple characterization can be obtained for it to be characteristic (Sriperumbudur et al., 2010, Theorem 9): k is characteristic if and only if supp(Λψ) = Rd where Λψ is deﬁned in (4). This characterization implies that the above mentioned examples are characteristic kernels. Examples of non-characteristic kernels of translation invariant type include k(x, y) = sin(x y)

x y , x, y R and k(x, y) = cos(x y), x, y R. More generally, polynomial kernels of any ﬁnite order are non-characteristic.

3. Minimax Estimation of µP in the RKHS Norm

In this section, we present our main results related to the minimax estimation of kernel mean embeddings (KMEs) in the RKHS norm. As discussed in Section 1, various estimators of µk(P) are known in literature and all these have a convergence rate of n 1/2 if the kernel is bounded. The main goal of this section is to show that the rate n 1/2 is actually minimax optimal for diﬀerent choices of P (see (2)) under some mild conditions on k.

First, choosing P to be the set of all discrete probability measures on Rd, in Section 3.1 (see Theorem 1 and Corollary 2), we present the minimax lower bounds of order Ω(n 1/2) with constant factors depending only on the properties of the kernel for translation invariant and radial kernels respectively. Next we will show in Section 3.2 that the rate n 1/2 remains minimax optimal for translation invariant and radial kernels even if we choose the class P to contain only probability distributions with inﬁnitely continuously diﬀerentiable densities. For translation invariant kernels the result (see Theorem 6) is based on a key inequality, which relates the RKHS distance between embeddings of Gaussian distributions to the Euclidean distance between the mean vectors of these distributions (see Proposition 3). The minimax lower bound for radial kernels (see Theorem 8) is derived using a slightly diﬀerent argument. Instead of applying the bound of Theorem 6 to the particular case of radial kernels, we will present a direct analysis based on the special properties of radial kernels. This will lead us to the lower bound with almost optimal constant factors, depending only on the shape of Borel measure ν corresponding to the kernel.

Our analysis is based on the following simple idea: if a kernel k is characteristic, there is a one-to-one correspondence between any given set of Borel probability measures P deﬁned over Rd and a set µk(P) of their embeddings into the RKHS Hk. This means that distributions in P are indexed by their embeddings Θ := µk(P) and so (2) can be equivalently written as

inf ˆθn sup θ Θ Pθ n r 1 n,k(Hk, P) ˆθn θ Hk ck(Hk, P) o > 0, (6)

where the goal is to ﬁnd the minimax rate rn,k(Hk, P) and a positive constant ck(Hk, P) (independent of n) such that (6) holds and Pθ = P n when θ = µk(P). Using this equivalence, we obtain the minimax rates by employing Le Cam s method (Tsybakov, 2008) see Theorems B.1 and B.2 for a reference.

Minimax Estimation of Kernel Mean Embeddings

3.1 Lower Bounds for Discrete Probability Measures

The following result (proved in Section 5.1) presents a minimax rate of n 1/2 for estimating µk(P), where k is assumed to be translation invariant on Rd.

Theorem 1 (Translation invariant kernels) Let P be the set of all Borel discrete probability measures on Rd. Suppose k(x, y) = ψ(x y), where ψ Cb(Rd) is positive deﬁnite and k is characteristic. Assume there exists z Rd and β > 0, such that ψ(0) ψ(z) β. Then the following holds:

inf ˆθn sup P P P n (

ˆθn µk(P) Hk 1

The result is based on Le Cam s method involving two hypotheses (see Theorem B.1), where we choose them to be KMEs of discrete measures, both supported on the same pair of points separated by z in Rd.

Remark (Choosing z and β) As discussed in Sriperumbudur et al. (2010, Section 3.4), if k is translation invariant and characteristic on Rd, then it is also strictly positive deﬁnite. This means that ψ(0) > 0. Moreover, the following hold: (a) Since ψ is positive deﬁnite, we have |ψ(x)| ψ(0) for all x Rd and (b) since ψ is characteristic, it cannot be a constant function. Together these facts show that there always exist z Rd

and β > 0 satisfying the assumptions of Theorem 1. For instance, a Gaussian kernel k(x, v) = exp x v 2 2/(2η2) satisﬁes ψ(0) ψ(z) z 2 2/(4η2) if z 2 2 2η2, where we used a simple fact that 1 e x x/2 for 0 x 1.

While Theorem 1 dealt with general translation invariant kernels, the following result (proved in Section 5.2) specializes it to radial kernels, i.e., kernels of the form in (5), by providing a simple condition on ν under which Theorem 1 holds.

Corollary 2 (Radial kernels) Let P be the set of all Borel discrete probability measures on Rd and k be radial on Rd, i.e., k(x, y) = ψν(x y) := R 0 e t x y 2 2dν(t), where ν Mb +([0, )) such that supp(ν) = {0}. Assume there exist 0 < t1 < and α > 0 satisfying ν([t1, )) α. Then the following holds:

inf ˆθn sup P P P n ˆθn µk(P) Hk 1

Remark (Choosing t1 and α) Since supp(ν) = {0} the assumption of ν[t1, ) α is always satisﬁed. For instance, if ν is a probability measure with positive median η then we can set t1 = η and α = 1

2. Based on this, it is easy to verify (see Appendix D.1) that α = 1 for Gaussian, α = CM for mixture of Gaussian kernels, α = c 2γ

2 for inverse multiquadrics and α = 1

2 for Mat ern kernels.

3.2 Lower Bounds for Probability Measures with Smooth Densities

So far, we have shown that the rate n 1/2 is minimax optimal for the problem of KME estimation (both for translation invariant and radial kernels). As discussed in Section 1,

Tolstikhin, Sriperumbudur, and Muandet

since this rate is independent of the smoothness of the estimand (which is determined by the smoothness of the kernel), one might wonder whether the minimax rate can be improved by restricting P to distributions with smooth densities. We show in this section (see Theorems 6 and 8) that this is not the case by restricting P to contain only distributions with inﬁnitely continuously diﬀerentiable densities and proving the minimax lower bound of order n 1/2.

We will start the analysis with translation invariant kernels and present a corresponding lower bound in Theorem 6. The proof of this result is again based on an application of Le Cam s method involving two hypotheses (see Theorem B.1), where this time these hypotheses are chosen to be embeddings of the d-dimensional Gaussian distributions. One of the main steps, when applying Theorem B.1, is to lower bound the distance between these embeddings. This is done in the following result (proved in Section 5.3), which essentially shows that if we take two Gaussian distributions G(µ0, σ2I) and G(µ1, σ2I) with the mean vectors µ0, µ1 Rd which are close enough to each other, then the RKHS distance between the corresponding embeddings can be lower bounded by the Euclidean distance µ0 µ1 2.

Proposition 3 Let σ > 0. Suppose k(x, y) = ψ(x y), where ψ Cb(Rd) is positive deﬁnite and k is characteristic. Then there exist constants ϵψ,σ2, cψ,σ2 > 0 depending only on ψ and σ2, such that the following condition holds for any a Rd with a 2 2 ϵψ,σ2:

cψ,σ2 min ez Sd 1 2 (2π)d/2

Rd e σ2 w 2 2 ez, w 2 cos ( a, w ) dΛψ(w) < , (7)

where Sd 1 is a unit sphere in Rd and Λψ Mb +(Rd) is deﬁned in (4). Moreover, for all vectors µ0, µ1 Rd satisfying µ0 µ1 2 2 ϵψ,σ2, the following holds:

2 µ0 µ1 2, (8)

where θ0 and θ1 are KMEs of Gaussian measures G(µ0, σ2I) and G(µ1, σ2I) respectively.

Remark 4 (KME expands small distances) For a Gaussian kernel, it is possible to show (Sriperumbudur et al., 2012, Example 3; Ramdas et al., 2015, Proposition 1) that θ0 θ1 2 Hk = C1 1 exp( C2 µ0 µ1 2 2) , where C1 and C2 are positive constants that depend only on σ2 and η2. This shows that (8) holds for µ0 µ1 2 [0, D], where D satisﬁes C1 1 exp( C2D2) = 1

2D2cψ,σ2. In other words, Proposition 3 states that the mapping fσ2 : Rd Hk deﬁned by fσ2(x) := µk G(x, σ2I) expands small distances.

Remark 5 (Computing cψ,σ2 and ϵψ,σ2) Generally it may be very hard to compute (or bound) the constants cψ,σ2 and ϵψ,σ2 appearing in the statement of Proposition 3. However, in some cases this may be still possible. In Appendix E we will provide an extensive analysis for the case of radial kernels.

Based on Proposition 3, the following result shows that the rate of n 1/2 remains minimax optimal for the problem of KME estimation with translation invariant kernels, even if we restrict the class of distributions P to contain only measures with smooth densities.

Minimax Estimation of Kernel Mean Embeddings

Theorem 6 (Translation invariant kernels) Let P be the set of distributions over Rd

whose densities are continuously inﬁnitely diﬀerentiable. Suppose k(x, y) = ψ(x y), where ψ Cb(Rd) is positive deﬁnite and k is characteristic. Deﬁne cψ := cψ,1 and ϵψ := ϵψ,1 where cψ,1 and ϵψ,1 are positive constants that satisfy (7) in Proposition 3. Then for any n 1 ϵψ , the following holds:

inf ˆθn sup P P P n ˆθn µk(P) Hk 1

Proof The proof will be based on Theorem B.1. For this we need to ﬁnd two probability measures P0 and P1 on Rd and corresponding KMEs θ0 and θ1, such that θ0 θ1 Hk is of the order Ω(n 1/2), while KL(P n 0 P n 1 ) is upper bounded by a constant independent of n. Here KL(P0 P1) denotes the Kullback-Leibler divergence between P0 and P1, which is deﬁned as KL(P0 P1) = R log d P0

d P1 d P0 where P0 is absolutely continuous w.r.t. P1.

Pick two Gaussian distributions G0 := G(µ0, σ2I) and G1 := G(µ1, σ2I) for µ0, µ1 Rd, and σ2 > 0. It is known that (Tsybakov, 2008, Section 2.4)

KL(Gn 0 Gn 1) = n µ0 µ1 2 2 2σ2 , (9)

where Gn 0 and Gn 1 are n-fold product distributions. Choose µ0 and µ1 such that

µ0 µ1 2 2 = 1

Denote KMEs of G0 and G1 using θ0 and θ1 respectively. Next we will take σ2 = 1 and apply Proposition 3. Since cψ and ϵψ satisfy (7) in Proposition 3, it follows from Proposition 3 that for 1/n ϵψ,

θ0 θ1 2 Hk cψ

2 µ0 µ1 2 2 = cψ

This shows that the ﬁrst condition of Theorem B.1 is satisﬁed for θ0 and θ1 with s := 1 2 p

cψ/(2n). Moreover, using (9) we can show that the second condition of Theorem B.1 is satisﬁed with α = 1

2. We conclude the proof with an application of Theorem B.1.

Remark 7 (Lower bound on the sample size n) Note that Theorem 6 holds only for large enough sample size n (i.e., n 1/ϵψ). This assumption on n can be dropped if we set µ0 µ1 2 2 = ϵψ/n in the proof. In this case, the lower bound 1

cψ/(2n) will be replaced with 1

cψϵψ/(2n), while the lower bound on the minimax probability 1/4 will be replaced with

The latter is generally undesirable, especially if ϵψ grows with d , since we want the minimax probability to be lower bounded by some universal non-zero constant that does not depend on the properties of the problem at hand.

Tolstikhin, Sriperumbudur, and Muandet

Since radial kernels are particular instances of translation invariant kernels, Theorem 6 can be specialized by explicitly computing the constants cψ and ϵψ to derive a minimax lower bound of order Ω(n 1/2). Unfortunately, the resulting lower bound will depend on the dimensionality d in a rather bad way and, as a consequence, is suboptimal in some situations. For instance, if we consider a Gaussian kernel k(x, y) = exp 1

2η2 x y 2 2 , then a straightforward computation of cψ shows that the lower bound in Theorem 6 has the form p

(1 + 2/η2) d/2/n which shrinks to zero as d , while Proposition A.1 (also see Remark A.2) provides a dimension independent upper bound of the order Op(n 1/2). Therefore, instead of specializing Theorem 6 to radial kernels, we obtain the following result for radial kernels by using a reﬁned analysis which yields a minimax rate of Ω(n 1/2) that matches the upper bound of Proposition A.1 up to constant factors that depend only on the shape of Borel measure ν. In particular, when specialized to the Gaussian kernel, the result matches the upper bound up to a constant factor independent of d.

Theorem 8 (Radial kernels) Let k be radial on Rd, i.e., k(x, y) = R 0 e t x y 2 2 dν(t), where ν Mb +([0, )) and P be the set of distributions over Rd whose densities are continuously inﬁnitely diﬀerentiable. Assume that supp(ν) = {0} and there exist 0 < t0 t1 < , 0 < β < such that ν([t0, t1]) β. Then the following holds:

inf ˆθn sup P P P n (

ˆθn µk(P) Hk 1

Proof The proof, which is presented in Section 5.4, is based on an application of Le Cam s method involving multiple hypotheses (see Theorem B.2), where we use exponential (in d) number of Gaussian distributions with variances decaying as 1

Remark (Non-trivial lower bound as d ) The proof of Theorem 8 is based on Gaussian distributions with variances decaying as 1/d. As d , it is obvious the densities of these distributions do not have uniformly bounded Lipschitz constants, i.e., they are arbitrarily peaky . Hence, if we choose P to be class of distributions with inﬁnitely diﬀerentiable densities that have uniformly bounded Lipschitz constants, then as d , the densities considered in the proof of Theorem 8 do not belong to P. On the other hand, the densities considered in the proof of Theorem 6 still belong to P but yielding an uninteresting result since cψ 0 when d . Therefore, it is an open question whether a non-trivial lower bound can be obtained for radial kernels (or any other translation invariant kernels) if we choose P to contain only distributions with densities having uniformly bounded Lipschitz constants.

Remark (Alternative Proof) For completeness, we also present an alternative proof of Theorem 8 in Appendix E. It is based on Proposition 3, which holds for any translation invariant kernel. As a result, this proof leads to slightly worse constants compared to Theorem 8 (where we used an analysis speciﬁc to radial kernels), as well as a superﬂuous condition on the minimal sample size n.

In Appendix D.2, we compute the positive constant Bk := βt0

t1 that appears in the lower bound in Theorem 8 in a closed form for Gaussian, mixture of Gaussian, inverse multiquadric and Mat ern kernels.

Minimax Estimation of Kernel Mean Embeddings

4. Minimax Estimation of µP in the L2(Rd) Norm

So far, we have discussed the minimax estimation of the kernel mean embedding (KME) in the RKHS norm. In this section, we investigate the minimax estimation of KME in L2(Rd) norm. The reason for this investigation is as follows. Let k(x, y) = ψ(x y), x, y Rd, where ψ L1(Rd) C(Rd) is strictly positive deﬁnite. The corresponding RKHS is given by (see Wendland, 2005, Theorem 10.12)

f L2(Rd) C(Rd) : Z

Rd |f (ω)|2

which is endowed with the inner product f, g Hk = R

Rd f (ω)g (ω)

ψ (ω) dω with f being the Fourier transform of f in the L2-sense. It follows from (10) that for any f Hk,

f 2 L2(Rd) ( ) = f 2 L2(Rd) = Z

f (ω) 2 dω = Z

Rd |f (ω)|2

ψ (ω) ψ (ω) dω ( ) ψ f 2 Hk

(11) where ( ) follows from Plancherel theorem (Wendland, 2005, Corollary 5.25), f Hk is deﬁned in (10), ( ) follows from H older s inequality, and ( ) holds since ψ C0(Rd) (by Riemann-Lebesgue lemma, Folland, 1999, Theorem 8.22). Note that ψ is non-negative (Wendland, 2005, Theorem 6.11) and so the inequality in ( ) is valid. It therefore follows from (11) that Hk is continuously included in L2(Rd) and L2(Rd) is a weaker norm than Hk.1 This means it is possible that the minimax rate of estimating µP in L2(Rd) could be faster than its RKHS counterpart with the rate possibly depending on the smoothness of k. Hence, it is of interest to analyze the minimax rates of estimating µP in L2(Rd). Interestingly, we show in this section that the minimax rate in the L2 setting is still n 1/2.

The analysis in the L2 setting follows ideas similar to those of the RKHS setting wherein, ﬁrst, in Section 4.1, we consider the minimax rate of estimating µP for translation invariant and radial kernels when P is the set of all Borel discrete probability measures on Rd (see Theorem 9 and Corollary 10). Next, in Section 4.2, we choose P to be the set of all probability distributions that have inﬁnitely continuously diﬀerentiable densities and study the question of minimax rates for translation invariant (see Theorem 12) and radial kernels (see Theorem 13). For both these choices of P, we show that the rate is n 1/2 irrespective of the smoothness of k. Exploiting the injectivity of mean embedding for characteristic kernels (see the paragraph below and the paragraph around (6)), these results are derived using Le Cam s method (see Theorems B.1 and B.2). Combined with Proposition A.1 (also see Remark A.3), these results show that the empirical estimator, µPn is minimax optimal. Finally, in Section 4.3 we discuss the relation between our results and some classical results of nonparametric density estimation, particularly, those of the kernel density estimator.

1. The continuous inclusion of Hk in L2(Rd) is known for Gaussian kernels on Rd (e.g., see Vert and Vert, 2006, Lemma 11). Similar result is classical for Sobolev spaces in general (e.g., see Folland, 1999, Section 9.3, p. 302) and particularly for those induced by Mat ern kernels. Steinwart and Christmann (2008, Theorem 4.26) provides a general result for continuous inclusion of Hk in L2(µ) assuming R

k(x, x) dµ(x) < where µ is a σ-ﬁnite measure. However, the result does not hold for translation invariant kernels on Rd as the integrability condition is violated.

Tolstikhin, Sriperumbudur, and Muandet

Before we proceed to the main results of this section, we brieﬂy discuss the diﬀerence between estimation in RKHS and L2(Rd) norms. Suppose k(x, y) = ψ(x y), x, y Rd

where ψ L1(Rd) L2(Rd) C(Rd) is positive deﬁnite and characteristic. It is easy to verify that µP L1(Rd) L2(Rd). Since µP = ψ P, (10) implies

µP 2 Hk = Z

Rd |(ψ P) |2

ψ (ω) dω = Z

Rd |φP (ω)|2ψ (ω) dω = φP 2 L2(Rd,ψ ) (12)

µP 2 L2(Rd) ( ) = Z

Rd |µ P (ω)|2 dω = Z

Rd |φP (ω)|2(ψ )2(ω) dω = φP 2 L2(Rd,(ψ )2), (13)

where φP (ω) := R e iωT x d P(x) is the characteristic function of P and ( ) follows from Plancherel s theorem. It follows from (12) and (13) that the RKHS norm emphasizes the high frequencies of φP compared to that of the L2-norm. Since ψ is characteristic, i.e., P 7 µk(P) Hk is injective, which is guaranteed if and only if supp(ψ ) = Rd (Sriperumbudur et al., 2010, Theorem 9), it follows from (13) that P 7 µk(P) L1(Rd) L2(Rd) is injective. Therefore (2) can be equivalently written as (6) by replacing Hk with L2(Rd) (see the discussion around (6)) and we obtain minimax rates by employing Le Cam s method as we did in the previous section.

4.1 Lower Bounds for Discrete Probability Measures

The following result (proved in Section 5.5) for translation invariant kernels is based on an application of Le Cam s method involving two hypotheses (see Theorem B.1), where we choose them to be KMEs of discrete measures, both supported on the same pair of points separated by a vector z in Rd.

Theorem 9 (Translation invariant kernels) Let P be the set of all Borel discrete probability measures on Rd. Suppose k(x, y) = ψ(x y), x, y Rd where ψ L2(Rd) C(Rd) is positive deﬁnite and k is characteristic. Deﬁne

Cψ z := 2 ψ 2 L2(Rd) Z

Rd ψ(y)ψ(y + z)dy (14)

for some z Rd \ {0}. Then Cψ z > 0 and

inf ˆθn sup P P P n

ˆθn µk(P) L2(Rd) 1

Using Cauchy-Schwartz inequality, the constant Cψ z in Theorem 9 can be shown (see the proof of Lemma 15 in Section 5.5) to be positive for every z Rd\{0} if k is characteristic, i.e., supp(Λψ) = Rd (see (4) for Λψ). The following result (proved in Section 5.6) specializes Theorem 9 to radial kernels.

Minimax Estimation of Kernel Mean Embeddings

Corollary 10 (Radial kernels) Let P be the set of all Borel discrete probability measures on Rd and k be radial on Rd, i.e., k(x, y) = ψν(x y) := R 0 e t x y 2 2dν(t), where ν Mb +([0, )) such that supp(ν) = {0} and

0 t d/2dν(t) < . (15)

Assume that there exist 0 < δ0 δ1 < and β > 0 such that ν([δ0, δ1]) β. Then the following holds:

inf ˆθn sup P P P n

ˆθn µk(P) L2(Rd) β

In Corollary 10, since supp(ν) = {0}, the assumption of ν([δ0, δ1]) β is always satisﬁed. In addition, the condition (15) on ν is satisﬁed by Gaussian, mixture of Gaussians, inverse multiquadric (while (15) is satisﬁed for γ > d/2, the result in Corollary 10 holds for γ > d/4) and Mat ern kernels refer to Remark A.3 for more details. Also, for these examples of kernels, the positive constant Ak := β2δ d/2 1 in the lower bound in Corollary 10 can be computed in a closed form (see Appendix D.3 for details).

4.2 Lower Bounds for Probability Measures with Smooth Densities

Next, as we did in Section 3.2, we choose P to be the set of all probability measures that have inﬁnitely continuously diﬀerentiable densities and show that the minimax rate of estimating µP in L2-norm for translation invariant (see Theorem 12) and radial kernels (see Theorem 13) is n 1/2. The proof of these results are again based on an application of Le Cam s method involving two (see Theorem B.1) and multiple hypotheses (see Theorem B.2), where these hypotheses are chosen to be embeddings of the d-dimensional Gaussian distributions. As in Section 3.2, the results of this section are based on the following result (proved in Section 5.7), which is conceptually similar to that of Proposition 3.

Proposition 11 Let σ > 0. Suppose k(x, y) = ψ(x y), where ψ L1(Rd) Cb(Rd) is positive deﬁnite and k is characteristic. Then there exist constants ϵψ,σ2, cψ,σ2 > 0 depending only on ψ and σ2, such that the following condition holds for any a Rd with a 2 2 ϵψ,σ2:

cψ,σ2 min ez Sd 1 2 Z

Rd e σ2 w 2 2 ez, w 2 cos ( a, w ) ψ (w) 2dw < , (16)

where Sd 1 is a unit sphere in Rd. Moreover, for all vectors µ0, µ1 Rd satisfying µ0 µ1 2 2 ϵψ,σ2, the following holds:

θ0 θ1 L2(Rd)

where θ0 and θ1 are KMEs of the Gaussian measures G(µ0, σ2I) and G(µ1, σ2I) respectively.

Tolstikhin, Sriperumbudur, and Muandet

The following result for translation invariant kernels is established using the above result wherein the proof is exactly the same as that of Theorem 6 except for an application of Proposition 11 in place of Proposition 3.

Theorem 12 (Translation invariant kernels) Let P be the set of distributions over Rd

whose densities are continuously inﬁnitely diﬀerentiable. Suppose k(x, y) = ψ(x y), where ψ L1(Rd) Cb(Rd) is positive deﬁnite and k is characteristic. Deﬁne cψ := cψ,1 and ϵψ := ϵψ,1 where cψ,1 and ϵψ,1 are positive constants that satisfy (16) in Proposition 11. Then for any n 1 ϵψ , the following holds:

inf ˆθn sup P P P n ˆθn µk(P) L2(Rd) 1

As discussed in Remark 7, it is possible to remove the requirement of minimal sample size in Theorem 12. Also, as discussed in Remark 5 and in the paragraph following Remark 7, the constants cψ and ϵψ appearing in the bound in Theorem 12 are not only diﬃcult to compute but also may depend on the dimensionality d in a sup-optimal manner, particularly as d . Therefore, similar to what was done in Section 3.2, we will not specialize Theorem 12 to radial kernels but instead present the following result (proved in Section 5.8 and the proof closely follows that of Theorem 8), which is based on a direct analysis involving the properties of radial kernels. For the particular case of a Gaussian kernel, this lower bound matches the upper bound of Proposition A.1 (also see Remark A.3) up to a constant factor independent of d.

Theorem 13 (Radial kernels) Let k be radial on Rd, i.e., k(x, y) = R 0 e t x y 2 2 dν(t), where ν Mb +([0, )) and P be the set of distributions over Rd whose densities are continuously inﬁnitely diﬀerentiable. Assume that (15) holds, supp(ν) = {0} and there exist 0 < δ0 δ1 < , 0 < β < such that ν([δ0, δ1]) β. Then the following holds:

inf ˆθn sup P P P n

ˆθn µk(P) L2(Rd) 1

The constant Bk := β2δ0δ d+2

2 1 in the lower bound in the above result can be computed in a closed form for Gaussian, mixture of Gaussian, inverse multiquadric, and Mat ern kernels (see Appendix D.4 for details). The factor (π/2)d/4 can be eliminated from the lower bound by considering a rescaled kernel (π/2) d/4ψ(x y). Nevertheless, the bound will still depend on d exponentially as captured by the constant Bk. This can be further overcome by using the normalized kernel k(x, y)/ ψ L2(Rd). In the particular case of normalized Gaussian kernels (πη2) d/2 exp 1

2η2 x y 2 2 this will lead to dimension-free lower bounds.

4.3 Relation to Kernel Density Estimation

In this section, we discuss the relation between the estimation of µP and density estimation. The problem of density estimation deals with estimating an unknown density, p based on

Minimax Estimation of Kernel Mean Embeddings

random samples (Xi)n i=1 drawn i.i.d. from it. One of the popular non-parametric methods for density estimation is kernel density estimation (KDE), where the estimator is of the form (Tsybakov, 2008, Section 1.2)

ˆpn(x1, . . . , xd) = 1

n Qd i=1 hi

i=1 K Xi,1 x1

h1 , , Xi,d xd

Here K : Rd R is the smoothing kernel (this kernel should not be confused with the reproducing kernel k which we used throughout the paper), h1, . . . , hd > 0 are bandwidths, and Xi,j is the j-th coordinate of the i-th sample point. Assuming p L2(Rd), the consistency of ˆpn is usually studied in the sense of mean integrated squared error (MISE) E ˆpn p 2 L2(Rd), which can be decomposed into variance and bias terms as:

E ˆpn p 2 L2(Rd) = E ˆpn E[ˆpn] 2 L2(Rd) + p E[ˆpn] 2 L2(Rd). (17)

Assume K to be bounded and h1 = = hd = h. Deﬁne Kh := h d K( /h). Then for any ﬁxed x Rd,

ˆpn(x) = 1 nhd

i=1 Kh(Xi x) = Z

Rd Kh(z x) d Pn(z)

E[ˆpn(x)] = 1

p(z)dz = (Kh p)(x).

This shows that ˆpn = µKh(Pn) and E[ˆpn] = µKh(P) where P is the distribution with p as its density w.r.t. the Lebesgue measure and Pn is the empirical measure constructed based on samples (Xi)n i=1 drawn from p. Therefore the results of Section 4 (and more generally of this paper) are about the minimax rates for E[ˆpn]. However, note that Kh need not be positive deﬁnite (and therefore need not be the reproducing kernel of some RKHS). On the other hand, K has to be positive, i.e., K(x) 0, x Rd and normalized, i.e., R

Rd K(x) dx = 1 to yield an estimator that is a valid density, unlike in kernel mean estimation where k need not be positive nor normalized. The minimax rate of n 1/2 for estimating E[ˆpn] is achieved by the kernel density estimator ˆpn (which is nothing but the empirical estimator of µKh(P)) as it is known (based on a straightforward generalization of Tsybakov, 2008, Proposition 1.4 for multiple dimensions) that

E ˆpn E[ˆpn] 2 L2(Rd) K 2 L2(Rd) nhd ,

where we assume K L2(Rd). The bandwidth parameter h is immaterial in the estimation of µKh(P) and can be treated as a constant (independent of n) unlike in the problem of estimating p where h should decay to zero at an appropriate rate for the bias p E[ˆpn] L2(Rd) to converge to zero as n . In particular, if p lies in a Sobolev space of

smoothness index s, then the bias-squared term in (17) behaves as h2s, which combined with the above bound on the variance yields a rate of n 2s 2s+1 for h = n 1 2s+1 . This rate is known to be minimax optimal for the problem of estimating p while our rates are minimax optimal for the problem of smoothed density estimation where the smoothing is carried out by the kernel.

Tolstikhin, Sriperumbudur, and Muandet

In this section we present all the missing proofs of results of Sections 3 and 4.

5.1 Proof of Theorem 1

Pick two discrete distributions P0 = p0δx + (1 p0)δv and P1 = p1δx + (1 p1)δv, where x, v Rd, 0 < p0 < 1, 0 < p1 < 1 and δx denotes a Dirac measure supported at x. Deﬁne θ0 = µk(P0) and θ1 = µk(P1). Since θ0 2 Hk = R R k(x, y) d P0(x) d P0(y), which follows from the reproducing property of k, it is easy to verify that

θ0 θ1 2 Hk = E[k(ξ, ξ )] + E[k(η, η )] 2 E[k(ξ, η)],

where ξ and η are random variables distributed according to P0 and P1 respectively, and ξ and η are independent copies of ξ and η. Since k is translation invariant, we have k(v, v) = k(x, x) = ψ(0) and k(x, v) = k(v, x) = ψ(x v), which imply

θ0 θ1 2 Hk = 2(p0 p1)2 ψ(0) ψ(x v) . (18)

Also note that

KL(P0 P1) = p0 log p0

p1 + (1 p0) log 1 p0

= p0 log 1 + p0 p1

+ (1 p0) log 1 + p1 p0

+ (1 p0) 1 + p1 p0

= log 1 + (p0 p1) p0

where we used Jensen s inequality in ( ) for the logarithmic function, which is concave. Next, using a simple inequality log(1 + x) x, which holds for all x > 1, we get

KL(P0 P1) (p0 p1) p0

Note that a maximal value of denominator is achieved when p1 = 1

2. Setting p1 = 1

2 we get the following upper bound: KL(P0 P1) 4 p0 1

2 2 , which when used in the chain rule of KL-divergence yields

KL(P n 0 P n 1 ) 4n p0 1

Choosing p0 such that (p0 1

2)2 = 1 9n yields KL(P n 0 P n 1 ) 4

9 and θ0 θ1 2 Hk = 2 9n ψ(0) ψ(x v) . Choose x and v in such a way that x v = z, where z Rd is a point for which ψ(0) ψ(z) β and β > 0. This yields

θ0 θ1 2 Hk 2β

Minimax Estimation of Kernel Mean Embeddings

which shows that the assumptions of Theorem B.1 are satisﬁed with s := 1

n and α := 4

The result follows from an application of Theorem B.1 by noticing that 1

α/2 2 > 1/4.

Remark (Measures with bounded support) It is evident from the above proof that exactly the same lower bound holds if we restrict P to contain only probability measures with bounded support. We can proceed further and assume that for each P P the radius of supp(P) is upper bounded by some positive constant R. In this case the same reasoning will work as long as ψ is not ﬂat on the ball of radius R centered around origin.

5.2 Proof of Corollary 2

The proof is based on application of Theorem 1. Since supp(ν) = {0}, it follows from (Sriperumbudur et al., 2011, Proposition 5) that k is characteristic. We now show that there exist z Rd and β > 0, such that ψν(0) ψν(z) β. Note that for any x Rd

ψν(0) ψν(x) = Z

1 e t x 2 2 dν(t) Z

1 e t x 2 2 dν(t)

1 e t1 x 2 2 dν(t) = ν([t1, )) 1 e t1 x 2 2

α 1 e t1 x 2 2 αt1

where the last inequality holds whenever x 2 2 1

t1 . Choosing z such that z 2 2 = 1

t1 yields ψν(0) ψν(z) α

2 . The result therefore follows from Theorem 1 by choosing β = α

5.3 Proof of Proposition 3

Before we prove Proposition 3, ﬁrst we will derive a closed form expression for the RKHS distance between KMEs of two d-dimensional Gaussian distributions with the kernel being translation invariant, i.e., k(x, y) = ψ(x y). Throughout this section Λψ will denote a ﬁnite non-negative Borel measure corresponding to the positive-deﬁnite function ψ from (4).

Lemma 14 Let θ0 and θ1 be KME of Gaussian measures G(µ0, σ2I) and G(µ1, σ2I) for µ0, µ1 Rd and σ2 > 0. Suppose k(x, y) = ψ(x y), where ψ Cb(Rd) is positive deﬁnite. Then θ0 θ1 2 Hk = 2 (2π)d/2

Rd e σ2 w 2 2 (1 cos ( µ0 µ1, w )) dΛψ(w). (19)

Proof Note that

θ0 θ1 2 Hk = θ0 2 Hk + θ1 2 Hk 2 θ0, θ1 Hk, (20)

where θ0, θ1 Hk = EXEY [k(X, Y )] with X G(µ0, σ2I) and Y G(µ1, σ2I). We will now derive the closed form for the inner product:

θ0, θ1 Hk = Z

Rd ψ(x y) 1 (2πσ2)d e 1 2σ2 x µ0 2 2 1 2σ2 y µ1 2 2dxdy

Tolstikhin, Sriperumbudur, and Muandet

Rd 1 (2π)d/2

Rd e i x y,w dΛψ(w) 1 (2πσ2)d e 1 2σ2 x µ0 2 2 1 2σ2 y µ1 2 2dx dy,

where we used (4). The function appearing under the integral is absolutely integrable and so by Tonelli-Fubini theorem (Dudley, 2002, Theorem 4.4.5) we obtain

θ0, θ1 Hk = Z

Rd ei y,w e 1 2σ2 y µ1 2 2

(2πσ2)d/2(2π)d/2

(2πσ2)d/2 e 1 2σ2 x µ0 2 2dx

Rd 1 (2πσ2)d/2 1 (2π)d/2 ei y,w e 1 2σ2 y µ1 2 2e i µ0,w σ2 w 2 2 2 dy dΛψ(w)

= 1 (2π)d/2

Rd 1 (2πσ2)d/2 ei y,w e 1 2σ2 y µ1 2 2dy e i µ0,w σ2 w 2 2 2 dΛψ(w)

= 1 (2π)d/2

Rd ei µ1,w σ2 w 2 2 2 e i µ0,w σ2 w 2 2 2 dΛψ(w),

where we used Lemma C.1 to compute the Fourier transform for a Gaussian density. Using Euler s formula and the fact that Λψ is symmetric according to Lemma C.2, while sin(x) is an odd function, we get

θ0, θ1 Hk = 1 (2π)d/2

Rd cos ( µ0 µ1, w ) e σ2 w 2 2 dΛψ(w). (21)

The result in (19) follows by using (21) in (20).

Proof of Proposition 3: Deﬁne a := µ0 µ1 and

G(a) := 2 (2π)d/2

Rd e σ2 w 2 2 (1 cos a, w ) dΛψ(w).

Note that G(0) = 0. Next, since for any i = 1, . . . , d ai e σ2 w 2 2(1 cos a, w ) = e σ2 w 2 2wi sin a, w e σ2 w 2 2wi L1(Λψ),

we can diﬀerentiate G under the integral sign (Folland, 1999, Theorem 2.27) and get G(0) = 0.

If a function f : Rd R is strongly convex with parameter m > 0 on some set A Rd, then for all x, y A:

f(x) f(y) + f(y), x y + m

If we can show that G is strongly convex on Bϵ := {b Rd : b 2 2 ϵ} for some ϵ > 0, then we can apply previous inequality with y = 0 and x = a to obtain

2 a 2 2, a Bϵ.

It is known that a twice continuously diﬀerentiable function f is strongly convex on A Rd

with parameter m > 0 if the matrix 2f(x) m I is positive deﬁnite for all x A, where

Minimax Estimation of Kernel Mean Embeddings

2f is the Hessian and I Rd d is an identity matrix. Next we compute the Hessian of G by once again employing diﬀerentiation under the integral sign (justiﬁed in the similar way as above) to obtain

ai aj = 2 (2π)d/2

Rd e σ2 w 2 2wiwj cos ( a, w ) dΛψ(w), 0 i, j d.

2G(a) = 2 (2π)d/2

Rd e σ2 w 2 2ww T cos ( a, w ) dΛψ(w).

In order to prove that G is strongly convex on Bϵ Rd we need to show that 2G(a) m I is positive deﬁnite for each a Bϵ and some m > 0. In other words, we need to show that there is m > 0 such that for each z Rd \ {0} and a Bϵ the following holds:

z, 2G(a)z m z 2 2,

or, equivalently,

Rd e σ2 w 2 2 ez, w 2e i a,w dΛψ(w) m, (22)

where ez := z/ z 2 Rd is a vector of unit length pointed in the direction of z. Note that l.h.s. of (22) is the Fourier transform of a measure Tz on Rd, which is absolutely continuous with respect to Λψ with Radon-Nikodym derivative 2e σ2 w 2 2 ez, w 2.

Fix any z Rd. We will ﬁrst show that we can apply Bochner s Theorem (see (4)) for the measure Tz. For this we need to check that it is (a) non-negative and (b) ﬁnite. Part (a) is apparent from the facts that Λψ is non-negative and Tz has a non-negative density with respect to Λψ. To check (b) we write Z

Rd d Tz(x) = 2 (2π)d/2

Rd e σ2 w 2 2 ez, w 2dΛψ(w) < ,

as e σ2 w 2 2 ez, w 2 is positive and bounded for any z Rd, while Λψ is ﬁnite. We conclude from Bochner s Theorem, that function ψz(x) deﬁned in the following way:

Rd e i x,w d Tz(w),

is positive-deﬁnite. Moreover it is well known (Dudley, 2002, Theorem 9.4.4) that ψz Cb(Rd) as ψz is the characteristic function of Tz. Finally, it follows from the discussion in (Sriperumbudur et al., 2011, Section 3.3), that if supp(Tz) = Rd, then a bounded and continuous function ψz(x) is strictly positive deﬁnite. To check the condition supp(Tz) = Rd

we note that supp(Λψ) = Rd since ψ is characteristic (Sriperumbudur et al., 2010, Theorem 9), and no open sets of Rd are contained in the region where e σ2 w 2 2 ez, w 2 = 0.

Summarizing, we have established that the l.h.s. of (22) is equal to ψz(a), where ψz : Rd R is a bounded, continuous and strictly positive deﬁnite function for each

Tolstikhin, Sriperumbudur, and Muandet

z Rd \ {0}. In particular, we have ψz(0) > 0 for all z Rd. Note that ψz(0) depends on z only through its direction. Next we want to show that

inf z Sd ψz(0) > 0, (23)

where the inﬁmum is over the unit sphere Sd := {b Rd : b 2 2 = 1}. Note that the function F : z ψz(0) deﬁned on Sd is continuous. Since Sd is closed and bounded, we know that F attains its minimum on it. In other words, there is z Sd, such that

inf z Sd ψz(0) = ψz (0).

Thus, if infz Sd ψz(0) = 0, we will also get ψz (0) = 0, which will contradict the fact that ψz(0) > 0 for each z Rd\{0}. This proves (23). Using Lemma C.3 we also conclude that infz Sd ψz : Rd R is a continuous function. Now we may ﬁnally conclude that there are constants cψ,σ2, ϵψ,σ2 > 0 such that

inf z Sd ψz(a) cψ,σ2

for all a Bϵψ,σ2. Finally, we take m = cψ,σ2 and this concludes the proof.

5.4 Proof of Theorem 8

The proof is based on application of Theorem B.2 where we choose θ0, . . . , θM to be KMEs of d-dimensional Gaussian measures with variances decaying to zero as d .

Let G(µ0, σ2I) and G(µ1, σ2I) be two d-dimensional Gaussian distributions with mean vectors µ0, µ1 Rd and variance σ2 > 0. Deﬁne θ0 and θ1 to be the embeddings of G(µ0, σ2I) and G(µ1, σ2I) respectively.

(A) Deriving a closed form expression for θ0 θ1 2 Hk.

Using Lemma 14, presented in Section 5.3, we have

θ0 θ1 2 Hk = 2 (2π)d/2

Rd e σ2 w 2 2 (1 cos ( µ0 µ1, w )) dΛψ(w), (24)

where Λψ is a ﬁnite non-negative Borel measure from the Bochner s Theorem corresponding to the kernel k. We now show that Λψ is absolutely continuous with respect to the Lebesgue measure on Rd and has the following density:

1 (2t)d/2 e w 2 2 4t dν(t), w Rd.

Indeed, by noticing that Z

e i w,x 1 (2t)d/2 e w 2 2 4t dw dν(t) <

Minimax Estimation of Kernel Mean Embeddings

we may apply Tonelli-Fubini theorem (Dudley, 2002, Theorem 4.4.5) to interchange the order of integration and get

1 (2t)d/2 e w 2 2 4t dν(t) dw = Z

(2π)d/2 1 (2t)d/2 e w 2 2 4t dw

0 e t x 2 2dν(t).

Substituting the form of λψ into (24) we can write

θ0 θ1 2 Hk = 2 (2π)d/2

0 e σ2 w 2 2 (1 cos ( µ0 µ1, w )) 1 (2t)d/2 e w 2 2 4t dν(t) dw.

Applying Tonelli-Fubini theorem once again and using Lemma C.1 together with Euler s formula we obtain

θ0 θ1 2 Hk = 2 (2π)d/2

Rd e w 2 2 2 (2σ2+ 1

2t) (1 cos ( µ0 µ1, w )) dw dν(t)

2 (4σ2t + 1)d/2 dν(t) Z

2 (4σ2t + 1)d/2 e t µ0 µ1 2 2 4σ2t+1 dν(t)

0 2 1 1 + 4tσ2

d/2 1 exp t µ0 µ1 2 2 1 + 4tσ2

dν(t). (25)

(B) Lower bounding θ0 θ1 2 Hk in terms of µ0 µ1 2 2.

It follows from (25) that

θ0 θ1 2 Hk Z t1

t0 2 1 1 + 4tσ2

d/2 1 exp t µ0 µ1 2 2 1 + 4tσ2

where 0 < t0 t1 < . Note that 1 e x x

2 for 0 x 1. Using this we get

1 exp t µ0 µ1 2 2 1 + 4tσ2

t µ0 µ1 2 2 2(1 + 4tσ2) , t [t0, t1]

as long as t1 µ0 µ1 2 2 1 + 4t1σ2. (26)

Thus, as long as (26) holds, we can lower bound the RKHS distance as:

θ0 θ1 2 Hk Z t1

d/2 t µ0 µ1 2 2 1 + 4tσ2 dν(t)

= µ0 µ1 2 2

t (1 + 4tσ2)(d+2)/2 dν(t). (27)

Tolstikhin, Sriperumbudur, and Muandet

Note that the function t 7 t (1+4tσ2)(d+2)/2 monotonically increases on [0, 1 2dσ2 ], reaches its

global maximum at t = 1 2dσ2 and then decreases on [ 1 2dσ2 , ). Thus we have

t (1 + 4tσ2)(d+2)/2 dν(t) β min t0 (1 + 4t0σ2)(d+2)/2 , t1 (1 + 4t1σ2)(d+2)/2

σ2 = 1 2t1d (28)

yields that t = t1 is the global maximum of the function t 7 t 1+ 2t

t1d (d+2)/2 , in which case

t0 (1 + 4t0σ2)(d+2)/2 t1 (1 + 4t1σ2)(d+2)/2 .

With this choice of σ2 we get Z t1

t (1 + 4tσ2)(d+2)/2 dν(t) βt0 1 + 2t0

t1d (d+2)/2 βt0 1 + 2

(d+2)/2 βt0

where we used the fact that (1 1

x)x 1 monotonically decreases to 1

e. Using (29) in (27), we obtain

θ0 θ1 2 Hk βt0

µ0 µ1 2 2. (30)

(C.1) Application of Theorem B.2: Choosing θ0, . . . , θM.

Now we are going to apply Theorem B.2. First of all, we need to choose M + 1 embeddings. Recall that Theorem B.2 requires these embeddings to be suﬃciently distant from each other, while the corresponding distributions should be close. We will choose the embeddings {θ0, . . . , θM} to be KMEs of Gaussian distributions G(µi, σ2I) for speciﬁc choice of σ2 > 0 and µi Rd, i = 0, . . . , M. Mean vectors {µi}M i=0 will be constrained to live in the ball B(cν, n) := x Rd : x 2 2 cν/n , where cν is a positive constant to be speciﬁed later. This guarantees that KL-divergences between the Gaussian distributions will remain small. At the same time, it was shown in (30) that the RKHS distance between embeddings θi and θj is lower bounded by the Euclidean distance between µi and µj. In other words, in order for the embeddings θ0, . . . , θM to be suﬃciently separated we need to make sure that the mean vectors µ0, . . . , µM are not too close to each other. Summarizing, we face the problem of choosing a ﬁnite collection of pairwise distant points in the Euclidean ball. This question is closely related to the concepts of packing and covering numbers.

For any set A Rd its ϵ-packing number M(A, ϵ) is the largest number of points in A separated from each other by at least a distance of ϵ. An ϵ-covering number N(A, ϵ) of A

Minimax Estimation of Kernel Mean Embeddings

is the minimal number of balls of radius ϵ needed to cover A. Packing numbers are lower bounded by the covering numbers (Dudley, 1999, Theorem 1.2.1):

N(A, ϵ) M(A, ϵ).

Also, it is well known that the ϵ-covering number of a unit d-dimensional Euclidean ball is lower bounded by ϵ d . Together, these facts state that we can ﬁnd at least ϵ d points in the d-dimensional unit ball, which are at least ϵ away from each other. Similarly (just by a simple scaling) we can ﬁnd at least ϵ d points in the d-dimensional ball of radius R > 0, which are at least R ϵ away from each other. Applying this fact to B(cν, n) we can ﬁnally argue that there are at least Nd points in B(cν, n) which are at least N 1p

cν/n away from each other, where N 3 is an integer to be speciﬁed later. Now, take M = Nd 1 2 (which explains the lower bound N 3) and ﬁx µ0, . . . , µM to be these M + 1 points.

(C.2) Application of Theorem B.2: Lower bounding θi θj Hk.

With this choice of parameters µ0, . . . , µM, for any 0 i < j M, we have

θi θj 2 Hk cν N2n βt0

where we used (30) and the lower bound on µi µj 2. Setting

for some C > 0 we obtain

θi θj 2 Hk C N2en

This satisﬁes the ﬁrst assumption of Theorem B.2 with s := 1 2N

C en 1 2 2+d .

(C.3) Application of Theorem B.2: Upper bounding KL(Pθi Pθj).

Note that for any 0 i < j M we have

KL Gn(µi, σ2I) Gn(µj, σ2I) = n µi µj 2 2 2σ2 2cν

σ2 = 4C t1d

where the inequality holds since µi B(cν, n) and the equality follows from (28) and (31). Here we used the fact that for any points x and y contained in a ball of radius R we obviously have x y 2R. Also note that we chose M = Nd 1 2 and thus

log(M) = d log(N) + log(1 N d) d log(N) + 1 Nd

Nd 1 d log(N) 1 N 1

Tolstikhin, Sriperumbudur, and Muandet

for d 1, where we used the inequality log(x) 1 1

x which holds for x 0. Taking

log N 1 N 1

d log N d N 1

d log N 1 N 1

for any d 1. Concluding, we get

KL Gn(µi, σ2I) Gn(µj σ2I) 1

and thus the second assumption of Theorem B.2 is satisﬁed with α = 1

8. Finally, it is easy to check that if we take N = 5 then condition (26) will be satisﬁed. Indeed,

t1 µ0 µ1 2 2 4t1cν

βt0n βt0 32t1

log N 1 N 1

log N 1 N 1

while 1 + 4t1σ2 1.

Thus (26) holds whenever

log N 1 N 1 8n,

which obviously holds for N = 5 and any n 1.

To conclude the proof we insert (32) into s := 1 2N

C en 1 2 2+d , lower bound this value

8 log N 1 N 1 4

25, and notice that

5.5 Proof of Theorem 9

The proof is based on the following result, which gives a closed form expression for the L2(Rd) distance between embeddings of two discrete distributions supported on the same pair of points in Rd.

Lemma 15 Suppose k(x, y) = ψ(x y), where ψ L2(Rd) Cb(Rd) is positive deﬁnite and k is characteristic. Deﬁne P0 = p0δx + (1 p0)δv and P1 = p1δx + (1 p1)δv, where 0 < p0 < 1, 0 < p1 < 1, x, v Rd, and x = v. Then

µk(P0) µk(P1) 2 L2(Rd) = Cψ x,v(p0 p1)2,

Cψ x,v := 2 ψ 2 L2(Rd) Z

Rd ψ(y)ψ(y + x v)dy > 0.

Minimax Estimation of Kernel Mean Embeddings

Proof We have

µk(P0) µk(P1) 2 L2(Rd) = (p0 p1)2 k(x, ) k(v, ) 2 L2(Rd)

= (p0 p1)2 Z

ψ(x y) ψ(v y) 2dy

( ) = (p0 p1)2 Z

ψ(y) ψ(y + x v) 2dy

= 2(p0 p1)2 Z

Rd ψ(y) ψ(y) ψ(y + x v) dy

= 2(p0 p1)2 ψ 2 L2(Rd) Z

Rd ψ(y)ψ(y + x v)dy ,

where we used the symmetry of ψ in ( ). Cauchy-Schwartz inequality states that

Rd ψ(y)ψ(y + x v)dy ( )

Rd ψ2(y + x v)dy = ψ 2 L2(Rd),

where the equality in ( ) holds if and only if ψ(y) = λ ψ(y + x v) for some constant λ and all y Rd. If we take y = v x the above condition implies ψ(v x) = λ ψ(0) and with y = 0 we get ψ(0) = λ ψ(x v). Together these identities show that ψ(0) = λ2 ψ(0). Since the kernel is characteristic and translation invariant, ψ is strictly positive deﬁnite, which means ψ(0) > 0. We conclude that λ = 1. Assume that λ = 1. In this case ψ(x v) = ψ(0) < 0. Repeating the argument we can show that ψ 2(x v) = ψ(0) and generally ψ m(x v) = ( 1)mψ(0) for all m N. Since ψ L2(Rd) we need ψ2 to be integrable on Rd. Summarizing, we showed that a non-negative, integrable, and continuous function takes the same strictly positive value ψ2(0) > 0 inﬁnitely many times, leading to a contradiction. Arguing similarly for λ = 1 will result in a contradiction. This means the equality in ( ) is never attained which concludes the proof.

The proof of Theorem 9 is carried out by simply repeating the proof of Theorem 1 but replacing (18) with the result in Lemma 15 and using x v := z.

5.6 Proof of Corollary 10

The proof will be based on Theorem 9. The moment condition (15) on ν is suﬃcient for ψν L2(Rd) to hold (see Remark A.3). Thus we only need to compute the expression ψν 2 L2(Rd) R

Rd ψν(y)ψν(y + z)dy appearing in (14). Note that Z

Rd ψν(y)ψν(y + z)dy = Z

0 e t1 y 2 2 t2 y+z 2 2dν(t1)dν(t2)dy. (33)

Rd e t1 y 2 2 t2 y+z 2 2dy dν(t1)dν(t2) = Z

d/2 e t1t2 z 2 2 t1+t2 dν(t1)dν(t2)

d/2 dν(t1) < ,

Tolstikhin, Sriperumbudur, and Muandet

we may apply Tonelli-Fubini theorem to switch the order of integration in (33) and get

Rd ψν(y)ψν(y + z)dy = Z

d/2 e t1t2 z 2 2 t1+t2 dν(t1)dν(t2).

Using this we get

ψν 2 L2(Rd) Z

Rd ψν(y)ψν(y + z)dy = Z

1 e t1t2 z 2 2 t1+t2

dν(t1)dν(t2)

1 e t1t2 z 2 2 t1+t2

dν(t1)dν(t2).

Since t1 and t2 are bounded below by δ0 > 0 we may take z 2 large enough so that the following will hold:

ψν 2 L2(Rd) Z

Rd ψν(y)ψν(y + z)dy β2

5.7 Proof of Proposition 11

The proof is based on the following result, which provides a closed form expression for the L2(Rd) distance between the embeddings of Gaussian measures.

Lemma 16 Let θ0 and θ1 be KME of Gaussian measures G(µ0, σ2I) and G(µ1, σ2I) for µ0, µ1 Rd and σ2 > 0. Suppose k(x, y) = ψ(x y), where ψ L1(Rd) Cb(Rd) is positive deﬁnite and and k is characteristic. Then

θ0 θ1 2 L2(Rd) = 2 Z

1 e i w,µ0 µ1 ψ (w) 2e σ2 w 2dw. (34)

Proof First of all, note that ψ L2(Rd) since ψ L1(Rd) and ψ is bounded. This shows that θ0, θ1 L2(Rd). We will use P0 and P1 to denote the corresponding Gaussian distributions G(µ0, σ2I) and G(µ1, σ2I). By deﬁnition we have

θ0, θ1 L2(Rd) = Z

Rd k(x, y) d P0(x) Z

Rd k(z, y) d P1(z) dy

Rd k(x, y)k(z, y) d P0(x) d P1(z) dy.

Using the fact that ψ is bounded (Wendland, 2005, Theorem 6.2) we get Z

k(x, y)k(z, y) dy d P0(x) d P1(z) ψ(0) Z Z Z

k(x, y) dy d P0(x) d P1(z)

= ψ(0) Z Z Z

ψ(x y) dy d P0(x)d P1(z)

= ψ(0) ψ L1(Rd)

Rd d P0(x)d P1(z) < .

Minimax Estimation of Kernel Mean Embeddings

This allows us to use Tonelli-Fubini theorem (Dudley, 2002, Theorem 4.4.5) and get

θ0, θ1 L2(Rd) = Z

Rd k(x, y)k(z, y)dy d P0(x)d P1(z)

Rd ψ(y)ψ(y + z x)dy d P0(x)d P1(z)

( ) = 1 (2π)d/2

Rd e i y+z x,w dΛψ(w)dy d P0(x)d P1(z)

= 1 (2π)d/2

Rd ψ(y)e i y+z x,w dΛψ(w)dy d P0(x)d P1(z),

where we used (4) in ( ). Since ψ L1(Rd) we have Z

e i y,w ψ(y)e i z x,w dy dΛψ(w) = Z

Rd |ψ(y)| dy dΛψ(w) <

and thus we can use Tonelli-Fubini theorem to switch the order of integration:

θ0, θ1 L2(Rd) = 1 (2π)d/2

Rd ψ(y)e i y+z x,w dy dΛψ(w) d P0(x)d P1(z)

Rd ψ (w)e i z x,w dΛψ(w) d P0(x)d P1(z). (35)

Next we are going to argue that if both ψ and ψ belong to L1(Rd) (the latter is true as it follows from Wendland, 2005, Corollary 6.12) then ψ is the Radon-Nikodym derivative of Λψ with respect to the Lebesgue measure. To this end, since ψ L1(Rd), Fourier inversion theorem (Wendland, 2005, Corollary 5.24) yields that for all x Rd, the following holds:

ψ(x) = 1 (2π)d/2

Rd ei w,x ψ (w)dw.

On the other hand, using (4) and Lemma C.2, we also have

ψ(x) = 1 (2π)d/2

Rd ei w,x dΛψ(w).

These two identities show that for all x Rd Z

Rd ei w,x ψ (w)dw = Z

Rd ei w,x dΛψ(w). (36)

Note that since k is translation invariant and characteristic, ψ is a strictly positive deﬁnite function (Sriperumbudur et al., 2010, Section 3.4) and therefore it follows from (Wendland, 2005, Theorem 6.11) that ψ is non-negative (and nonvanishing). Since ψ L1(Rd) we conclude that ψ is the Radon-Nikodym derivative of a ﬁnite non-negative measure Tψ on Rd, which is absolutely continuous with respect to the Lebesgue measure. (36) (after proper normalization) shows that the characteristic functions of measures Λψ and Tψ coincide. We ﬁnally conclude from (Dudley, 2002, Theorem 9.5.1) that Λψ = Tψ, which means that Λψ is absolutely continuous with respect to the Lebesgue measure and has a density ψ .

Tolstikhin, Sriperumbudur, and Muandet

Returning to (35) we can write it as

θ0, θ1 L2(Rd) = Z

ψ (w) 2e i z x,w dw d P0(x)d P1(z).

We already showed that ψ L2(Rd). From Plancherel s theorem (Wendland, 2005, Corollary 5.25), we have ψ L2(Rd) = ψ L2(Rd) and thus ψ L2(Rd). Another application of Tonelli-Fubini theorem yields

θ0, θ1 L2(Rd) = Z

Rd e i z x,w d P0(x)d P1(z) dw

ψ (w) 2ei w,µ0 µ1 e σ2 w 2 2dw.

Noticing that the Fourier transform of a real and even function is also even, we conclude that ψ is also even. This ﬁnishes the proof since θ0 θ1 2 L2(Rd) = θ1 2 L2(Rd)+ θ0 2 L2(Rd) 2 θ0, θ1 L2(Rd).

Now we turn to the proof of Theorem 11. We will write Λψ to denote a non-negative ﬁnite measure, absolutely continuous with respect to the Lebesgue measure with density (2π)d/2(ψ )2. Then (34) in Lemma 16 can be written as

θ0 θ1 2 L2(Rd) = 2 (2π)d/2

1 ei w,µ0 µ1 e σ2 w 2 2d Λψ(w).

= 2 (2π)d/2

Rd (1 cos( w, µ0 µ1 )) e σ2 w 2 2d Λψ(w),

which is exactly of the form in Lemma 14 but with Λψ replaced by Λψ. From the proof of Lemma 16, since ψ is even and non-vanishing, the corresponding measure Λψ is symmetric and supp( Λψ) = Rd. The result therefore follows by carrying out the proof of Proposition 3 verbatim but for replacing Λψ with Λψ.

5.8 Proof of Theorem 13

The proof will closely follow that of Theorem 8. Let G(µ0, σ2I) and G(µ1, σ2I) be two d-dimensional Gaussian distributions with mean vectors µ0, µ1 Rd and variance σ2 > 0. Let θ0 and θ1 denote the kernel mean embeddings of G(µ0, σ2I) and G(µ1, σ2I) respectively.

(A) Deriving a closed form expression for θ0 θ1 2 L2(Rd).

The condition in (15) ensures that ψν L2(Rd) (see Remark A.3). In fact, using a similar argument it can be shown that ψν L1(Rd). Also it is easy to verify that ψν Cb(Rd). Next, under this moment condition we may apply Tonelli-Fubini theorem to compute the Fourier transform of ψν:

ψ ν (w) = 1 (2π)d/2

Rd e i w,x Z

0 e t x 2 2dν(t)dx = Z

1 (2t)d/2 e w 2 2 4t dν(t). (37)

Minimax Estimation of Kernel Mean Embeddings

It is immediate to see that ψ ν L1(Rd). Therefore Lemma 16 yields

θ0 θ1 2 L2(Rd) = 2 Z

1 e i w,µ0 µ1 ψ ν (w) 2 e σ2 w 2 2dw. (38)

Denoting G(w) := ψ ν (w)e σ2 w 2 2 2 and using a well-known property of the Fourier transform we get

(G2) (τ) = 1 (2π)d/2

Rd e i w,τ G2(w)dw = 1 (2π)d/2

Rd G (x)G (τ x)dx. (39)

Next we compute the Fourier transform of G using (37):

G (x) = 1 (2π)d/2

Rd e i w,x Z

1 (2t)d/2 e w 2 2 4t dν(t) e σ2 w 2 2 2 dw.

Using the moment condition on ν we have

Rd e x 2 2 4t e σ2 x 2 2 2 dxdν(t) = 1 (2σ2)d/2

0 t d/2dν(t) < . (40)

This allows us to use Tonelli-Fubini theorem and write

G (x) = 1 (2π)d/2

Rd e i w,x e w 2 2 4t e σ2 w 2 2 2 dw dν(t)

1 (2tσ2 + 1)d/2 exp t x 2 2 2tσ2 + 1

Returning to (39) and denoting 1 := 2t1σ2 + 1, 2 := 2t2σ2 + 1 we obtain

(G2) (τ) = Z

1 (2π 1 2)d/2 exp t1 x 2 2 1 t2 τ x 2 2 2

dν(t1) dν(t2) dx.

Using a simple identity

a x 2 2 + b x y 2 2 = (a + b) x b a + by

2 + ab a + b y 2 2,

which holds for any x, y Rd and a, b R with a + b = 0, we obtain

(G2) (τ) = Z

1 (2π 1 2)d/2 exp

x t2 1τ t1 2 + t2 1

exp t1t2 τ 2 2 t1 2 + t2 1

dν(t1) dν(t2) dx.

Tolstikhin, Sriperumbudur, and Muandet

Using an argument similar to (40) we can show that Tonelli-Fubini theorem is applicable to the r.h.s. of the above equation. Therefore, changing the order of integration we get

(G2) (τ) = Z

1 2(t1 2 + t2 1)

d/2 exp t1t2 τ 2 2 t1 2 + t2 1

dν(t1) dν(t2).

Noticing that (G2) (0) = 1 (2π)d/2 R

Rd G2(w) dw and returning to (38) we get

θ0 θ1 2 L2(Rd) = 2(2π)d/2 (G2) (0) (G2) (µ0 µ1)

π t1 2 + t2 1

d/2 1 exp t1t2 µ0 µ1 2 2 t1 2 + t2 1

dν(t1)dν(t2).

(B) Lower bounding θ0 θ1 2 L2(Rd) in terms of µ0 µ1 2 2.

θ0 θ1 2 L2(Rd)

π t1 2 + t2 1

d/2 1 exp t1t2 µ0 µ1 2 2 t1 2 + t2 1

dν(t1)dν(t2)

π 4t1t2σ2 + t1 + t2

d/2 1 exp t1t2 µ0 µ1 2 2 4t1t2σ2 + t1 + t2

dν(t1)dν(t2).

Using the fact that 1 e x x

2 for 0 x 1, we obtain

θ0 θ1 2 L2(Rd) Z δ1

π 4t1t2σ2 + t1 + t2

d/2 t1t2 µ0 µ1 2 2 4t1t2σ2 + t1 + t2 dν(t1) dν(t2) (41)

whenever t1t2 µ0 µ1 2 2 4t1t2σ2 + t1 + t2 1.

Note that the expression on the left hand side of the previous inequality is increasing both in t1 and t2. This means that for t1, t2 [δ0, δ1] we have:

t1t2 µ0 µ1 2 2 4t1t2σ2 + t1 + t2 δ1 µ0 µ1 2 2 4δ1σ2 + 2

and thus (41) holds whenever

δ1 µ0 µ1 2 2 4δ1σ2 + 2

which will be satisﬁed later. (41) can be rewritten as

θ0 θ1 2 L2(Rd) Z δ1

(π)d/2t1t2 µ0 µ1 2 2 (4t1t2σ2 + t1 + t2)d/2+1 dν(t1) dν(t2)

t1t2 t1+t2 µ0 µ1 2 2 4 t1t2

t1+t2 σ2 + 1 d/2+1 dν(t1) dν(t2)

S(t1, t2) µ0 µ1 2 2 4S(t1, t2)σ2 + 1 d/2+1 dν(t1) dν(t2), (42)

Minimax Estimation of Kernel Mean Embeddings

where S(t1, t2) := t1t2 t1+t2 . Note that S(t1, t2) takes values in [δ0

2 ] as t1 and t2 varies in [δ0, δ1]. We can now repeat part of the proof of Theorem 8 where we showed that the function t 7 t (1+4tσ2)(d+2)/2 monotonically increases on [0, 1 2dσ2 ], reaches its global maximum

at t = 1 2dσ2 , and then decreases on [ 1 2dσ2 , ). Using this fact we have

S(t1, t2) 4S(t1, t2)σ2 + 1 d

2 +1 dν(t1)dν(t2) β2

( δ0 (1 + 2δ0σ2) d 2 +1 , δ1 (1 + 2δ1σ2) d 2 +1

By setting σ2 := 1 δ1d we ensure that t = δ1

2 is the global maximum of the function t 7 t 1+ 4t

2 +1 and thus δ0 (1+2δ0σ2) d 2 +1 δ1 (1+2δ1σ2) d 2 +1 . Combining this with (42) we have

θ0 θ1 2 L2(Rd) β2δ0

d/2 µ0 µ1 2 2

β2δ0 2(1 + 2

d/2 µ0 µ1 2 2 β2δ0

d/2 µ0 µ1 2 2,

where we used an analysis similar to (29).

(C) Application of Theorem B.2.

We ﬁnish by repeating all the remaining steps carried out in the proof of Theorem 8 (steps C.1, C.2, and C.3), where we set

σ2 := 1 δ1d and cν := 1 16δ1

log N 1 N 1

Acknowledgements

The authors thank the action editor and two anonymous reviewers for their detailed comments, which helped to improve the presentation. The authors would also like to thank David Lopez-Paz, Jonas Peters, Bernhard Sch olkopf, and Carl-Johann Simon-Gabriel for useful discussions.

Appendix A. n-consistency of µk(Pn)

In the following, we present a general result whose special cases establishes the convergence rate of n 1/2 for µk(Pn) µk(P) F when F = Hk and F = L2(Rd).

Proposition A.1 Let (Xi)n i=1 be random samples drawn i.i.d. from P deﬁned on a separable topological space X. Suppose r : X H is continuous and

sup x X r(x) 2 H Ck < , (43)

Tolstikhin, Sriperumbudur, and Muandet

where H is a separable Hilbert space of real-valued functions. Then for any 0 < δ 1 with probability at least 1 δ we have

X r(x) d Pn(x) Z

X r(x) d P(x) H

2Ck log(1/δ)

Proof Note that r : X H is a H-valued measurable function as r is continuous and H is separable (Steinwart and Christmann, 2008, Lemma A.5.18). The condition in (43) ensures that R r(x) Hd Q(x) Ck < for any Q M1 +(X) and therefore R r(x) d Q(x) is well deﬁned as a Bochner integral for any Q M1 +(X) (Diestel and Uhl, 1977, Theorem 2, p.45). By Mc Diarmid s inequality, it is easy to verify that with probability at least 1 δ,

X r(x) d Pn(x) Z

X r(x) d P(x) H E

X r(x) d Pn(x) Z

X r(x) d P(x) H

2Ck log(1/δ)

X r(x) d Pn(x) Z

X r(x) d P(x) H

X r(x) d Pn(x) Z

X r(x) d P(x)

X r(x) d Pn(x)

X r(x) d P(x)

X r(x) d Pn(x), Z

X r(x) d P(x)

X r(x) d Pn(x)

X r(x) d P(x)

i=1 E r(Xi), Z

X r(x) d P(x)

To simplify the r.h.s. of (45), we make the following observation. Note that for any g H, Tg : H R, f 7 g, f H is a bounded linear functional on H. Choose f = R

X r(y) d P(y). It follows from (Diestel and Uhl, 1977, Theorem 6, p.47) that

X r(y) d P(y)

X r(y) d P(y) = Z

X Tg(r(y)) d P(y) = Z

X g, r(y) H d P(y).

(46) Applying (46) to the third term in the r.h.s. of (45) with g = R

X r(x) d P(x), we obtain

X r(x) d P(x)

X r(xi), g H d P(xi) = Z

X r(xi) d P(xi), g

and so (45) reduces to

X r(x) d Pn(x) Z

X r(x) d P(x) H

X r(x) d Pn(x)

X r(x) d P(x)

Minimax Estimation of Kernel Mean Embeddings

X r(x) d Pn(x)

i,j=1 E r(Xi), r(Xj) H

i=j E r(Xi), r(Xj) H + 1

i =j E r(Xi), r(Xj) H

n EX P r(X) 2 H + n 1

n EX P,Y P r(X), r(Y ) H. (48)

Using (46), the second term in (48) can be equivalently written as

EX P,Y P r(X), r(Y ) H = Z

X r(x), r(y) H d P(y) d P(x)

X r(y) d P(y)

X r(x) d P(x), Z

X r(y) d P(y)

X r(x) d P(x)

where we invoked (46) in ( ). Combining the above with (48) and using the result in (47) yields

X r(x) d Pn(x) Z

X r(x) d P(x) H

EX P r(X) 2 H R

X r(x) d P(x) 2 H n

and the result follows.

Remark A.2 Suppose H is an RKHS with a reproducing kernel k that is continuous and satisﬁes supx X k(x, x) < . Choosing r(x) = k( , x), x X in Proposition A.1 yields a concentration inequality for µk(Pn) µk(P) H with Ck := supx X k(x, x), thereby establishing a convergence rate of n 1/2 for µk(Pn) µk(P) H. While such a result has already appeared in Smola et al. (2007, Theorem 2), Gretton et al. (2012) and Lopez-Paz et al. (2015), the result derived from Proposition A.1 improves upon them by providing better constants. While all these works including Proposition A.1 are based on Mc Diarmid s inequality (see (44)), the latter obtains better constants by carefully bounding the expectation term in (44). It is easy to verify that Ck = 1 for Gaussian and Ck = CM for mixture of Gaussian kernels, Ck = c 2γ for inverse multiquadrics, and Ck = 1 for Mat ern kernels.

Remark A.3 Assuming X = Rd, H = L2(Rd) and r(x) = k( , x), x Rd, where k is a continuous positive deﬁnite kernel on Rd, Proposition A.1 establishes a convergence rate of n 1/2 for µk(Pn) µk(P) L2(Rd) under the condition that supx Rd k(x, ) 2 L2(Rd) < . If k

is translation invariant on Rd, i.e., k(x, y) = ψ(x y), x, y Rd where ψ C(Rd) is positive deﬁnite, then ψ L2(Rd) ensures that supx Rd k(x, ) 2 L2(Rd) = supx Rd ψ(x ) 2 L2(Rd) = ψ 2 L2(Rd) and therefore Propositions A.1 holds with Ck := ψ 2 L2(Rd). On the other hand,

for radial kernels on Rd, i.e., kernels of the form in (5), the condition in (43) is ensured if Z

0 t d/2dν(t) < (49)

Tolstikhin, Sriperumbudur, and Muandet

sup x Rd k(x, ) 2 L2(Rd) = sup x Rd

0 e t x y 2dν(t) 2 dy

( ) ν([0, )) sup x Rd

0 e 2t x y 2dν(t)dy

( ) = ν([0, )) sup x Rd

Rd e 2t x y 2dy dν(t) = ν([0, ))

where we used Jensen s inequality in ( ) and Fubini s theorem in ( ). Therefore the bound in Proposition A.1 holds with Ck := ν([0, )) R 0 π

2t d/2 dν(t). (49) is satisﬁed by Gaussian, mixture of Gaussian and Mat ern (see Sriperumbudur, 2016, Equation 6.17) kernels. For inverse multiquadrics, while (49) holds for γ > d/2 since ν = c 2γGamma(γ, c2) (see Wendland, 2005, Theorem 7.15), in fact the condition in (43) holds for γ > d/4 (see Lemma C.4).

Appendix B. Minimax Lower Bounds and Le Cam s Method

Let Θ be a set of parameters (or functions) containing the element θ which we want to estimate. Assume there is a class P = {Pθ : θ Θ} of probability measures on Rd indexed by Θ. Suppose d: Θ Θ [0, ) is a metric on Θ. Le Cam s method provides a lower bound on the minimax probability, inf ˆθn supθ Θ P n θ (d(ˆθn, θ) s) for s > 0, where the inﬁmum is taken over all possible estimators ˆθn : Rd Θ that are constructed from an i.i.d. sample (Xi)n i=1 drawn from Pθ. The following two results which we used throughout this work are based on Le Cam s method and they provide a lower bound on the minimax probability. The ﬁrst one follows from Theorem 2.2 and Equation (2.9) of Tsybakov (2008). It requires a construction of two suﬃciently distant elements of the set Θ corresponding to the probability distributions similar in the Kullback-Leibler (KL) divergence sense, where the KL divergence between two distributions P and Q with P absolutely continuous w.r.t. Q is deﬁned as KL(P Q) = R log d P

Theorem B.1 (Lower bound based on two hypotheses) Assume Θ contains θ0 and θ1 such that d(θ0, θ1) 2s and KL(P n θ0 P n θ1) α for some s > 0 and 0 < α < . Then

inf ˆθn sup θ Θ P n θ n d(ˆθn, θ) s o max

1 4e α, 1 p

Note that the second condition of the theorem bounds the distance between the n-fold product distributions by a constant independent of n. Recalling the chain rule of the KL-divergence, which states that KL(P n θ0 P n θ1) = n KL(Pθ0 Pθ1), we can see that this condition is rather restrictive and requires the marginal distributions to satisfy KL(Pθ0 Pθ1) = O(n 1). This condition is slightly relaxed in the following result, which follows from Theorem 2.5 of Tsybakov (2008).

Minimax Estimation of Kernel Mean Embeddings

Theorem B.2 (Lower bound based on many hypotheses) Assume M 2 and suppose that there exist θ0, . . . , θM Θ such that (i) d(θi, θj) 2s > 0, 0 i < j M; (ii)

Pθj is absolutely continuous w. r. t. Pθ0 for all j = 1, . . . , M, and 1 M PM i=1 KL(P n θj P n θ0) α log M with 0 < α < 1/8. Then

inf ˆθn sup θ Θ P n θ n d(ˆθn, θ) s o

1 2α r 2α log M

The above result is commonly used with M tending to inﬁnity as n . In this case the second condition on the KL-divergence indeed becomes less restrictive than the one of Theorem B.1, since the upper bound α log M may now grow with the sample size n. At the same time, Theorem B.2 still provides a lower bound on the minimax probability independent of n, since

M) and log M can be lower bounded by 1/2 and log 2 respectively.

Appendix C. Technical Lemmas

The following technical results are used to prove the main results of Sections 3 and 4.

Lemma C.1 (Theorem 5.18, Wendland, 2005) For any µ Rd and σ2 > 0 the following holds: 1 (2πσ2)d/2 e x µ 2 2 2σ2 (w) = 1 (2π)d/2 exp i µ, w σ2 w 2 2 2

Lemma C.2 Let ψ: Rd R be a symmetric and positive deﬁnite function. Let Λψ be the corresponding ﬁnite non-negative Borel measure from (4). Then Λψ is symmetric, i.e., Λψ(A) = Λψ( A) for all A Rd.

Proof From the deﬁnition of Λψ we know that it is ﬁnite, non-negative, and

Rd e i w,x Λψ(dw) = Z

Rd cos( w, x )Λψ(dw) i Z

Rd sin( w, x )Λψ(dw).

Since ψ( x) = ψ(x) for all x Rd, we get R

Rd sin( w, x )Λψ(dw) = 0. Note that ψ( x) is by deﬁnition a characteristic function of measure Λψ, and we have just proved that it is real-valued. It is known (Bogachev, 2007, Corollary 3.8.7) that in this case the measure Λψ is invariant under the mapping x x.

Lemma C.3 Assume X, Y Rd. If f : X Y R is a continuous function and Y is a compact set then g(x) := infy Y f(x, y) is continuous.

Proof First, the map g: X R is well deﬁned since fx(y) := f(x, y) is a continuous function for any x X and thus fx achieves its inﬁmum since Y is a compact set. We will prove that the map g: X R is continuous by showing that g 1( , a) and g 1(a, ) are open sets for all a R (Dudley, 2002, Corollary 2.2.7 (a)).

Tolstikhin, Sriperumbudur, and Muandet

Now we will show that g 1( , a) is open for any a R. It suﬃces to show that for any x g 1( , a) there is an open neighborhood Ux of x which also belongs to g 1( , a). The set g 1( , a) consists of elements x X for which g(x) < a. In other words, it consists of such elements x X for which there is corresponding yx Y satisfying f(x, yx) < a. Take any x g 1( , a). Since f is continuous, f 1( , a) is open and contains (x, yx). Moreover f 1( , a) contains Ux Vy, where Ux and Vy are open sets with x Ux and yx Vy. Now suppose x Ux. Then for any y Vy we have f(x , y) < a. In particular, f(x , yx) < a, which means that g(x ) < a and x g 1( , a). This shows that g 1( , a) is open.

Next we will show that g 1(a, ) is also an open set for any a R. Assume this is not the case. Then there is x g 1(a, ) such that for any neighborhood Ux of x there is a point x Ux such that x g 1(a, ). This means that for any such x there is yx satisfying f(x , yx ) a. Using this we can construct a sequence {xn, yn} from X Y , such that xn / g 1(a, ) for every n, limn xn = x and for any n it holds that f(xn, yn) a. Since Y is compact we conclude that {yn} has a converging subsequence {yn(k)} (Dudley, 2002, Theorem 2.3.1) with limit y Y . We just showed that there is a sequence {xn(k), yn(k)} in X Y , which converges to (x, y ), such that limk f(xn(k), yn(k)) a. Since f is continuous, this also means that limk f(xn(k), yn(k)) = f(x, y ) a. This means that infy Y f(x, y) f(x, y ) a. In other words, this shows that x g 1(a, ) leading to a contradiction and therefore g 1(a, ) is open.

Lemma C.4 (L2 norm of inverse multiquadrics kernels) For any c > 0 and γ > d

Rd(c2 + x 2 2) 2γ dx = cd 4γπd/2 Γ(2γ d

Rd(c2 + x 2 2) 2γdx = c 4γ Z

2γ dx = cd 4γ Z

1 + x 2 2 2γ dx

= cd 4γ 2πd/2

1 + r2 2γ rd 1dr

= cd 4γ πd/2

0 (1 + x) 2γ xd/2 1dx

= cd 4γ πd/2

Γ(d/2) Γ(d/2)Γ(2γ d/2)

where last identity can be found in (Gradshteyn and Ryzhik, 2000, 3.194.3).

Appendix D. Bounds on Constants for Various Radial Kernels

In this appendix, we present bounds on the constants that appear in Corollaries 2, 10 and Theorems 8, 13. for various radial kernels.

Minimax Estimation of Kernel Mean Embeddings

D.1 α in Corollary 2

In Corollary 2, we assumed that there exist 0 < t1 < and α > 0 such that ν([t1, )) α. In the following, we present the values of t1 and α for various radial kernels.

(i) Gaussian kernel: ν = δ 1 2η2 and so for any t1 < 1 2η2 , we obtain α = 1.

(ii) Mixture of Gaussians: ν = PM i=1 βiδ 1 2η2 i and so α = CM for any t1 < 1 2η2 1 .

(iii) Inverse multiquadric kernel: It follows from (Wendland, 2005, Theorem 7.15) that

k(x, y) = Z

0 e t x y 2 2 tγ 1e c2t

Γ(γ) dt, (50)

and so ν = c 2γGamma(γ, c2) (51)

where the density of a Gamma distribution with parameters a, b > 0 is deﬁned as

Gamma(t; a, b) = ba

Γ(a)ta 1e tb, t 0.

Therefore choosing t1 to be the median of ν, we obtain α = c 2γ

(iv) Mat ern kernel: We know from (Wendland, 2005, Theorem 6.13) that Mat ern kernel is related to the Fourier transform of the inverse multiquadric kernel as

Rd e i v,w c2 + v 2 2 τdv = 21 τ

Γ(τ)Kd/2 τ(c w 2) w 2

where c > 0 and τ > d/2. Using this together with the representation (50) of an inverse multiquadrics kernel we obtain the following identity, which already appeared in (Sriperumbudur, 2016, Equation (72)):

k(x, y) = Γ(τ)c2τ d2d/2

Γ(τ d/2) 1 (2π)d/2

Rd e i v,x y "Z

0 e t v 2 2 tτ 1e c2t

( ) = c2τ d2d/2

0 tτ 1e c2t 1 (2π)d/2

Rd e i v,x y e t v 2 2dv dt

= c2τ d2d/2

0 tτ 1e c2t 1 (2t)d/2 e x y 2 2 4t dt

0 tτ d/2 1e c2te x y 2 2 4t dt,

where we invoked Tonelli-Fubini theorem (Dudley, 2002, Theorem 4.4.5) in ( ) since c2 + 2 2 τ L1(Rd) for τ > d/2. After change of variables we ﬁnally obtain

k(x, y) = 1 Γ τ d

0 e t x y 2td/2 τ 1e c2

Tolstikhin, Sriperumbudur, and Muandet

which shows that Mat ern kernel is a particular instance of radial kernels with

ν = Inv Gamma τ d

where the density of an inverse-Gamma distribution with parameters a, b > 0 has the form

Inv Gamma(t; a, b) = ba

Γ(a)t a 1e b/t.

Therefore α = 1

2 for the choice of t1 to be the median of ν.

t1 in Theorem 8

In Theorem 8, we assumed that there exist 0 < t0 t1 < and 0 < β < such that ν([t0, t1]) β. Deﬁne Bk := βt0

t1 . In the following, we present the values of Bk for various radial kernels.

(i) Gaussian kernel: Choose t0 = t1 = 1 2η2 so that β = 1 and Bk = 1.

(ii) Mixture of Gaussians: Set t0 = 1 2η2 1 , t1 = 1 2η2 M so that β = CM implying Bk = CMη2 M η2 1 .

(iii) Inverse multiquadric kernel: From (51), we have ν = c 2γGamma(γ, c2). Therefore

γ/(2c2) tγ 1e tc2dt

2c2 γ 1 exp γ

c2 c2 γ 2c2 , for γ 1;

c2 γ 1 exp γ

c2 c2 γ 2c2 , for γ (0, 1).

Therefore with t0 = γ 2c2 , t1 = γ

2e γ , for γ 1;

c 2γ 2Γ(γ) γ

e γ , for γ (0, 1) , we obtain

2e γ , for γ 1;

c 2γ 4Γ(γ) γ

e γ , for γ (0, 1) .

(iv) Mat ern kernel: It is easy to check that if X Gamma(a, b) and Y Inv Gamma(a, b) for a, b > 0 then for any 0 < x y < the following holds:

P{x X y} = P{1/y Y 1/x}.

This means, the above calculations for inverse multiquadrics can be used to obtain the following for the Mat ern kernel:

2 , for τ d

2 , for τ d

Minimax Estimation of Kernel Mean Embeddings

D.3 β2δ d/2 1 in Corollary 10

In Corollary 10, we assumed that there exist 0 < δ0 δ1 < and 0 < β < such that ν([δ0, δ1]) β. Deﬁne Ak := β2δ d/2 1 . Based on the analysis carried out in Appendix D.2, in the following, we present the values of Ak for various radial kernels.

(i) Gaussian kernel: Choose δ0 = δ1 = 1 2η2 so that β = 1 and Ak = (2η2)d/2.

(ii) Mixture of Gaussians: Set δ0 = 1 2η2 1 , δ1 = 1 2η2 M so that β = CM implying Ak =

C2 M(2η2 M)d/2.

(iii) Inverse multiquadric kernels: Choosing δ0 = t0 and δ1 = t1 as in Appendix D.2, we obtain

cd 4γ Γ2(γ) γ2γ d

2 (2e)2γ , for γ 1;

cd 4γ 4Γ2(γ) γ2γ d

2 e2γ , for γ (0, 1) .

(iv) Mat ern kernel: Deﬁne γ := τ d

2 and c := c

2. Choosing δ0 = c2

γ and δ1 = 2 c2

γ , we obtain

1 Γ( γ) γ 2e γ , for γ 1;

1 2Γ( γ) γ e γ , for γ (0, 1) ,

using the analysis in Appendix D.2. Therefore,

Γ2( γ) γ 2 2 γ+ d

2 , for γ 1;

Γ2( γ) γ2 γ+ d

2 , for γ (0, 1) .

D.4 β2δ0δ d+2

2 1 in Theorem 13

In Theorem 13, we assumed that there exist 0 < δ0 δ1 < and 0 < β < such that

ν([δ0, δ1]) β. Deﬁne Bk := β2δ0δ d+2

2 1 . Based on the analysis carried out in Appendix D.2, in the following, we present the values of Bk for various radial kernels.

(i) Gaussian kernel: Choose δ0 = δ1 = 1 2η2 so that β = 1 and Bk = (2η2)d/2.

(ii) Mixture of Gaussians: Set δ0 = 1 2η2 1 , δ1 = 1 2η2 M so that β = CM implying Bk =

C2 M2d/2ηd+2 M η2 1 .

(iii) Inverse multiquadric kernels: Choosing δ0 = t0 and δ1 = t1 as in Appendix D.2, we obtain

cd 4γ 2Γ2(γ) γ2γ d

2 (2e)2γ , for γ 1;

cd 4γ 8Γ2(γ) γ2γ d

2 e2γ , for γ (0, 1) .

Tolstikhin, Sriperumbudur, and Muandet

(iv) Mat ern kernel: With the choice of δ0 and δ1 as in Appendix D.3, we obtain

2Γ2( γ) γ 2 2 γ+ d

2 , for γ 1;

Γ2( γ) γ2 γ+ d

2 , for γ (0, 1) .

Appendix E. Alternate Proof of Theorem 8

In Theorem 8 we presented a minimax lower bound for radial kernels based on an appropriate construction of d-dimensional Gaussian distributions. By a clever choice of the variance σ2, which decays to zero as d , we obtained a lower bound of the order Ω(n 1/2) independent of d. This result was based on the direct analysis and special properties of radial kernels. In this appendix we will show that we can recover almost the same result using only Proposition 3, which holds for any translation invariant kernel. As we will see, this leads to slightly worse constant factors and an additional lower bound on the sample size n in terms of the properties of distribution ν, which speciﬁes the kernel. Essentially we will repeat the main steps of the proof of Theorem 8. However, we will use Proposition 3 instead of direct computations (based on the form of radial kernels) to lower bound the RKHS distance between embeddings of Gaussian distributions with the Euclidean distance between their mean vectors.

Theorem E.1 Let P be the set of distributions over Rd whose densities are continuously inﬁnitely diﬀerentiable and k be radial on Rd, i.e.,

k(x, y) = Z

0 e t x y 2 2 dν(t),

where ν Mb +([0, )) such that supp(ν) = {0}. Assume that there exist 0 < t0 t1 < and 0 < β < such that ν([t0, t1]) β. Suppose n 24t1Zν

βt0 where Zν := ν([0, )). Then

inf ˆθn sup P P P n (

ˆθn µk(P) Hk 1

Proof We apply Proposition 3 to the radial kernel k. In order to do so, we need to lower bound the quantity appearing in r.h.s. of Condition (7), which we do as follows. We already saw in the proof of Theorem 8 that in our case Λψ is absolutely continuous with respect to the Lebesgue measure on Rd and has the following density:

1 (2t)d/2 e w 2 2 4t dν(t), w Rd.

Therefore the r.h.s. of (7) reduces to

Rd e σ2 w 2 2 ez, w 2 cos ( a, w ) dΛψ(w)

= 2 (2π)d/2

Rd e σ2 w 2 2 ez, w 2 cos ( a, w ) Z

1 (2t)d/2 e w 2 2 4t dν(t) dw

Minimax Estimation of Kernel Mean Embeddings

= 2 (2π)d/2

2t w 2 2 ez, w 2e i a,w dw | {z }

dν(t), (52)

where we used Euler s formula and Tonelli-Fubini theorem in the last equality. Denoting δ := 2σ2 + 1

2t, we have

j=1 (ez)2 jw2 j + X

j =ℓ (ez)j(ez)ℓwjwℓ

2 δ w 2 2(ez)2 jw2 je i a,w dw

2 δ w 2 2(ez)j(ez)lwjwle i a,w dw.

2 δ w 2 2w2 je i a,w dw = (ez)2 j

2 w2 ℓe iaℓwℓdwℓ

2 w2 j w2 je iajwjdwj

δ e a2 ℓ 2δ

2 w2 j w2 je iajwjdwj

where we used Lemma C.1. It follows from (Folland, 1999, Theorem 8.22(d)) that if g = x2f L1(R), then f is twice diﬀerentiable and

g (y) = 2f (y)

which together with Lemma C.1 shows that Z

2 w2 j w2 je iajwjdwj = 1

δ e a2 j 2δ

Therefore, we get

2 δ w 2 2w2 je i a,w dw = (ez)2 j

δ e a2 ℓ 2δ

δ 1 δ e a2 j 2δ

= (ez)2 j δ

d/2 e a 2 2 2δ

Tolstikhin, Sriperumbudur, and Muandet

Summing over j = 1, . . . , d we get

j=1 (ez)2 j

2 δ w 2 2w2 je i a,w dw = 1

d/2 e a 2 2 2δ

(ez)2 ja2 j δ2

d/2 e a 2 2 2δ .

Next, for any j = ℓwe compute

(ez)j(ez)ℓwjwℓexp

= (ez)j(ez)ℓ

2 w2 qe iaqwqdwq

2 w2 qwqe iaqwqdwq

= (ez)j(ez)ℓ

= (ez)j(ez)ℓ

d/2 e a 2 2 2δ ajaℓ

Summing over j = ℓwe get

d/2 e a 2 2 2δ +

(ez)2 ja2 j δ2

d/2 e a 2 2 2δ .

Returning to (52), we get

Rd e σ2 w 2 2 ez, w 2 cos ( a, w ) dΛψ(w)

= 2 (2π)d/2

1 (2t)d/2 e a 2 2 2δ 2π

2 2t a 2 2 4σ2t + 1

t (4σ2t + 1)1+d/2

1 2t ez, a 2

In order to apply Proposition 3 we need to lower bound the following value, appearing in Condition (7):

(a) := min z Rd\{0} 2 (2π)d/2

Rd e σ2 w 2 2 ez, w 2 cos ( a, w ) dΛψ(w)

2 2t a 2 2 4σ2t + 1

t (4σ2t + 1)1+d/2

1 2t a 2 2 4σ2t + 1

Next we will separately treat two diﬀerent cases.

Case 1: d > 2. Note that the function ρ(t) = t(4σ2t + 1) (d+2)/2 is positive and bounded on [0, ) for any d > 0. Thus, we can deﬁne a non-negative and ﬁnite measure

Minimax Estimation of Kernel Mean Embeddings

τ, absolutely continuous with respect to ν with density ρ(t). If we denote Zτ := R 0 1 d τ(t) and write τ for the normalized version of τ, then we can rewrite

2 2t a 2 2 4σ2t + 1

1 2t a 2 2 4σ2t + 1

2 2t a 2 2 4σ2t + 1

1 2t a 2 2 4σ2t + 1

2 2t a 2 2 4σ2t + 1

2 2t a 2 2 4σ2t + 1

2t a 2 2 4σ2t + 1

Note that for d > 2, Et τ[|t|] is ﬁnite, since in this case t 7 t2 (4σ2t+1)(d+2)/2 is bounded and

ν is a ﬁnite measure. Denote µτ := Et τ[t] and note that t 7 exp 1

2 2t a 2 2 4σ2t+1 is a convex function on [0, ). Thus, for d > 2 we can use Jensen s inequality to get

2 2t a 2 2 4σ2t + 1

2 2µτ a 2 2 4σ2µτ + 1

Also note that

2 2t a 2 2 4σ2t + 1

2t a 2 2 4σ2t + 1

2t a 2 2 4σ2t + 1

4 Zτ 2µτ a 2 2 4σ2µτ + 1,

where we used inequality e x 1, which holds for x 0, together with Jensen s inequality and the fact that t 7 2t a 2 2 4σ2t+1 is concave on [0, ). Summarizing, we have

2 2µτ a 2 2 4σ2µτ + 1

2µτ a 2 2 4σ2µτ + 1

2 2µτ a 2 2 4σ2µτ + 1 2µτ a 2 2 4σ2µτ + 1

2 3 2µτ a 2 2 4σ2µτ + 1

where we used a simple inequality ex 1 + x. If the following condition is satisﬁed:

2µτ a 2 2 4σ2µτ + 1 1

then we get

(a) 2Zτ = 2 Z

t (4σ2t + 1)1+d/2 dν(t). (54)

Together with Proposition 3 this leads to the following lower bound, which holds for any µ0, µ1 Rd and σ2 > 0 satisfying (53) with a := µ0 µ1:

θ0 θ1 2 Hk Z

t µ0 µ1 2 2 (4σ2t + 1)1+d/2 dν(t),

Tolstikhin, Sriperumbudur, and Muandet

where θ0 and θ1 are KME s of Gaussian measures G(µ0, σ2I) and G(µ1, σ2I) respectively. Note that this lower bound is identical to the one in (27), which we obtained using direct analysis for the radial kernels. However, condition (26) is now replaced with the stronger one in (53). We can now repeat the proof of Theorem 8 starting from inequality (27) and making sure that condition (53) is satisﬁed when we choose constants appearing in deﬁnitions of µ0, µ1 and σ2.

In order to check condition (53) we need to upper bound the expectation µτ. It is easily seen that for d > 2, t 7 t2 (4σ2t+1)(d+2)/2 achieves its maximum on [0, ) for t = 1 σ2(d 2).

Using this fact, denoting Zν = R 0 1 dν(t), and setting σ2 = 1 2t1d we get

(4σ2t + 1)1+d/2 dν(t) Zν

(4σ2t + 1)1+d/2

= 4t2 1Zν Zτ

(d 2)2 1 ( 4 d 2 + 1)1+d/2

= 4t2 1Zν Zτ

2 1 4 d + 2

(d+2)/4!2 4t2 1Zν Zτe2

We may ﬁnally use (29) to get

which leads to the following upper bound on the expectation µτ:

µτ 4t2 1Zν βt0e d(d + 2) (d 2)2 .

This upper bound shows that the condition (53) is satisﬁed if the following holds:

3t1σ2 + βt0e 24t1Zν

d(d + 2). (55)

We conclude the proof by repeating the remaining steps of the proof of Theorem 8 and replacing condition (26) on the value µ0 µ1 2 2 with (55) speciﬁed to a = µ0 µ1.

Case 2: d 2. We can use a simple inequality e x/2(1 x) 1 3x/2 which holds for any x and get the following lower bound:

t (4σ2t + 1)1+d/2

1 3t a 2 2 4σ2t + 1

Assuming a 2 2 σ2 we further get

1 3t a 2 2 4σ2t + 1

Minimax Estimation of Kernel Mean Embeddings

and as a consequence, we also get

t (4σ2t + 1)1+d/2 dν(t),

which coincides with (54) up to an additional factor of 2. We can now repeat all the steps for the previous case, and it is also easy to check that in this case µ0 µ1 2 2 σ2 will be indeed satisﬁed. This concludes the proof.

Remark E.2 This result should be compared to Theorem 8, which was based on the direct analysis for radial kernels. We see that apart from an extra factor 2 appearing under the square root in the lower bound, Theorem E.1 also requires a superﬂuous condition on the minimal sample size n, which depends on properties of ν. For instance, for Gaussian kernel with ν concentrated on a single point 1 2η2 for some η2 > 0, the result holds as long as n 24, because in this case we can take t0 = t1 = 1 2η2 and β = 1. However, other choices of ν may lead to quite restrictive lower bounds on n.

Remark E.3 Conceptually, the main diﬀerence between the proofs of Theorems 8 and E.1 lies in the way we lower bound the RKHS distance between embeddings of Gaussian measures with the Euclidean distance between their mean vectors. In Theorem 8 we derived a closedform expression for the RKHS distance in (25) and then lower bounded it directly using the properties speciﬁc to its form. On the other hand, in Theorem E.1 we resorted to the lower bound of Lemma 3, which holds for any translation invariant kernel and hence is less tight.

N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc., 68:337 404, 1950.

V. I. Bogachev. Measure Theory, volume 1. Springer, 2007.

J. Diestel and J. J. Uhl. Vector Measures. American Mathematical Society, Providence, 1977.

N. Dinculeanu. Vector Integration and Stochastic Integration in Banach Spaces. Wiley, 2000.

R. M. Dudley. Uniform Central Limit Theorems. Cambridge Studies in Advanced Mathematics. Cambridge University Press, 1999.

R. M. Dudley. Real Analysis and Probability. Cambridge University Press, 2002.

G. B. Folland. Real Analysis: Modern Techniques and Their Applications. Wiley, 1999.

K. Fukumizu, A. Gretton, X. Sun, and B. Sch olkopf. Kernel measures of conditional dependence. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 489 496, Cambridge, MA, 2008. MIT Press.

K. Fukumizu, L. Song, and A. Gretton. Kernel Bayes rule: Bayesian inference with positive deﬁnite kernels. J. Mach. Learn. Res., 14:3753 3783, 2013.

Tolstikhin, Sriperumbudur, and Muandet

I. S. Gradshteyn and I. M. Ryzhik. Table of Integrals, Series, and Products. Academic Press, San Diego, USA, 2000.

A. Gretton, K. M. Borgwardt, M. Rasch, B. Sch olkopf, and A. Smola. A kernel method for the two sample problem. In B. Sch olkopf, J. Platt, and T. Hoﬀman, editors, Advances in Neural Information Processing Systems 19, pages 513 520, Cambridge, MA, 2007. MIT Press.

A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Sch olkopf, and A. J. Smola. A kernel statistical test of independence. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 585 592. MIT Press, 2008.

A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch olkopf, and A. J. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13:723 773, 2012.

E. L. Lehmann and G. Casella. Theory of Point Estimation. Springer-Verlag, New York, 2008.

D. Lopez-Paz, K. Muandet, B. Sch olkopf, and I. Tolstikhin. Towards a learning theory of cause-eﬀect inference. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, 2015.

K. Muandet, B. Sriperumbudur, K. Fukumizu, A. Gretton, and B. Sch olkopf. Kernel mean shrinkage estimators. Journal of Machine Learning Research, 2016. To appear.

A. Ramdas, S. Reddi, B. Poczos, A. Singh, and L. Wasserman. On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. In AAAI Conference on Artiﬁcial Intelligence, 2015.

I. J. Schoenberg. Metric spaces and completely monotone functions. The Annals of Mathematics, 39(4):811 841, 1938.

A. J. Smola, A. Gretton, L. Song, and B. Sch olkopf. A Hilbert space embedding for distributions. In Proceedings of the 18th International Conference on Algorithmic Learning Theory (ALT), pages 13 31. Springer-Verlag, 2007.

L. Song, A. Smola, A. Gretton, J. Bedo, and K. Borgwardt. Feature selection via dependence maximization. Journal of Machine Learning Research, 13:1393 1434, 2012.

B. K. Sriperumbudur. Mixture density estimation via Hilbert space embedding of measures. In Proceedings of International Symposium on Information Theory, pages 1027 1030, 2011.

B. K. Sriperumbudur. On the optimal estimation of probability measures in weak and strong topologies. Bernoulli, 22(3):1839 1893, 2016.

B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Sch olkopf, and G. R. G. Lanckriet. Hilbert space embeddings and metrics on probability measures. J. Mach. Learn. Res., 11:1517 1561, 2010.

Minimax Estimation of Kernel Mean Embeddings

B. K. Sriperumbudur, K. Fukumizu, and G. R. G. Lanckriet. Universality, characteristic kernels and rkhs embedding of measures. J. Mach. Learn. Res., 12:2389 2410, 2011.

B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Sch olkopf, and G. R. G. Lanckriet. On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6:1550 1599, 2012.

I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008.

Z. Szab o, A. Gretton, B. P oczos, and B. K. Sriperumbudur. Two-stage sampled learning theory on distributions. In Proceedings of the Eighteenth International Conference on Artiﬁcial Intelligence and Statistics, volume 38, pages 948 957. JMLR Workshop and Conference Proceedings, 2015.

A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer, NY, 2008.

R. Vert and J-P. Vert. Consistency and convergence rates of one-class SVMs and related algorithms. Journal of Machine Learning Research, 7:817 854, 2006.

H. Wendland. Scattered Data Approximation. Cambridge Monographs on Applied and Computational Mathematics. Cambridge University Press, 2005.

V. Yurinsky. Sums and Gaussian Vectors, volume 1617 of Lecture Notes in Mathematics. Springer-Verlag, Berlin, 1995.