# kernel_mean_shrinkage_estimators__99a11930.pdf Journal of Machine Learning Research 17 (2016) 1-41 Submitted 5/14; Revised 9/15; Published 4/16 Kernel Mean Shrinkage Estimators Krikamol Muandet krikamol@tuebingen.mpg.de Empirical Inference Department, Max Planck Institute for Intelligent Systems Spemannstraße 38, T ubingen 72076, Germany Bharath Sriperumbudur bks18@psu.edu Department of Statistics, Pennsylvania State University University Park, PA 16802, USA Kenji Fukumizu fukumizu@ism.ac.jp The Institute of Statistical Mathematics 10-3 Midoricho, Tachikawa, Tokyo 190-8562 Japan Arthur Gretton arthur.gretton@gmail.com Gatsby Computational Neuroscience Unit, CSML, University College London Alexandra House, 17 Queen Square, London - WC1N 3AR, United Kingdom ORCID 0000-0003-3169-7624 Bernhard Sch olkopf bs@tuebingen.mpg.de Empirical Inference Department, Max Planck Institute for Intelligent Systems Spemannstraße 38, T ubingen 72076, Germany Editor: Ingo Steinwart A mean function in a reproducing kernel Hilbert space (RKHS), or a kernel mean, is central to kernel methods in that it is used by many classical algorithms such as kernel principal component analysis, and it also forms the core inference step of modern kernel methods that rely on embedding probability distributions in RKHSs. Given a finite sample, an empirical average has been used commonly as a standard estimator of the true kernel mean. Despite a widespread use of this estimator, we show that it can be improved thanks to the wellknown Stein phenomenon. We propose a new family of estimators called kernel mean shrinkage estimators (KMSEs), which benefit from both theoretical justifications and good empirical performance. The results demonstrate that the proposed estimators outperform the standard one, especially in a large d, small n paradigm. Keywords: covariance operator, James-Stein estimators, kernel methods, kernel mean, shrinkage estimators, Stein effect, Tikhonov regularization 1. Introduction This paper aims to improve the estimation of the mean function in a reproducing kernel Hilbert space (RKHS), or a kernel mean, from a finite sample. A kernel mean is defined with respect to a probability distribution P over a measurable space X by X k(x, ) d P(x) H, (1) . Contributed equally c 2016 Krikamol Muandet, Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, and Bernhard Sch olkopf. Muandet, Sriperumbudur, Fukumizu, Gretton, and Sch olkopf where µP is a Bochner integral (see, e.g., Diestel and Uhl (1977, Chapter 2) and Dinculeanu (2000, Chapter 1) for a definition of Bochner integral) and H is a separable RKHS endowed with a measurable reproducing kernel k : X X R such that R k(x, x) d P(x) < .1 Given an i.i.d sample x1, x2, . . . , xn from P, the most natural estimate of the true kernel mean is empirical average i=1 k(xi, ) . (2) We refer to this estimator as a kernel mean estimator (KME). Though it is the most commonly used estimator of the true kernel mean, the key contribution of this work is to show that there exist estimators that can improve upon this standard estimator. The kernel mean has recently gained attention in the machine learning community, thanks to the introduction of Hilbert space embedding for distributions (Berlinet and Thomas-Agnan, 2004; Smola et al., 2007). Representing the distribution as a mean function in the RKHS has several advantages. First, if the kernel k is characteristic, the map P 7 µP is injective.2 That is, it preserves all information about the distribution (Fukumizu et al., 2004; Sriperumbudur et al., 2008). Second, basic operations on the distribution can be carried out by means of inner products in RKHS, e.g., EP[f(x)] = f, µP H for all f H, which is an essential step in probabilistic inference (see, e.g., Song et al., 2011). Lastly, no intermediate density estimation is required, for example, when testing for homogeneity from finite samples. Thus, the algorithms become less susceptible to the curse of dimensionality; see, e.g., Wasserman (2006, Section 6.5) and Sriperumbudur et al. (2012). The aforementioned properties make Hilbert space embedding of distributions appealing to many algorithms in modern kernel methods, namely, two-sample testing via maximum mean discrepancy (MMD) (Gretton et al., 2007, 2012), kernel independence tests (Gretton et al., 2008), Hilbert space embedding of HMMs (Song et al., 2010), and kernel Bayes rule (Fukumizu et al., 2011). The performance of these algorithms relies directly on the quality of the empirical estimate ˆµP. In addition, the kernel mean has played much more fundamental role as a basic building block of many kernel-based learning algorithms (Vapnik, 1998; Sch olkopf et al., 1998). For instance, nonlinear component analyses, such as kernel principal component analysis (KPCA), kernel Fisher discriminant analysis (KFDA), and kernel canonical correlation analysis (KCCA), rely heavily on mean functions and covariance operators in RKHS (Sch olkopf et al., 1998; Fukumizu et al., 2007). The kernel K-means algorithm performs clustering in feature space using mean functions as representatives of the clusters (Dhillon et al., 2004). Moreover, the kernel mean also served as a basis in early development of algorithms for classification, density estimation, and anomaly detection (Shawe-Taylor and Cristianini, 2004, Chapter 5). All of these employ the empirical average in (2) as an estimate of the true kernel mean. 1. The separability of H and measurability of k ensures that k( , x) is a H-valued measurable function for all x X (Steinwart and Christmann, 2008, Lemma A.5.18). The separability of H is guaranteed by choosing X to be a separable topological space and k to be continuous (Steinwart and Christmann, 2008, Lemma 4.33). 2. The notion of characteristic kernel is closely related to the notion of universal kernel. In brief, if the kernel is universal, it is also characteristic, but the reverse direction is not necessarily the case. See, e.g., Sriperumbudur et al. (2011), for more detailed accounts on this topic. Kernel Mean Shrinkage Estimators We show in this work that the empirical estimator in (2) is, in a certain sense, not optimal, i.e., there exist better estimators (more below), and then propose simple estimators that outperform the empirical estimator. While it is reasonable to argue that ˆµP is the best possible estimator of µP if nothing is known about P (in fact ˆµP is minimax in the sense of van der Vaart (1998, Theorem 25.21, Example 25.24)), in this paper we show that better estimators of µP can be constructed if mild assumptions are made on P. This work is to some extent inspired by Stein s seminal work in 1955, which showed that the maximum likelihood estimator (MLE) of the mean, θ of a multivariate Gaussian distribution N(θ, σ2I) is inadmissible (Stein, 1955) i.e., there exists a better estimator though it is minimax optimal. In particular, Stein showed that there exists an estimator that always achieves smaller total mean squared error regardless of the true θ Rd, when d 3. Perhaps the best known estimator of such kind is James-Steins estimator (James and Stein, 1961). Formally, if X N(θ, σ2I) with d 3, the estimator δ(X) = X for θ is inadmissible in mean squared sense and is dominated by the following estimator δJS(X) = 1 (d 2)σ2 i.e., E δJS(X) θ 2 E δ(X) θ 2 for all θ and there exists at least one θ for which E δJS(X) θ 2 < E δ(X) θ 2. Interestingly, the James-Stein estimator is itself inadmissible, and there exists a wide class of estimators that outperform the MLE, see, e.g., Berger (1976). Ultimately, Stein s result suggests that one can construct estimators better than the usual empirical estimator if the relevant parameters are estimated jointly and if the definition of risk ultimately looks at all of these parameters (or coordinates) together. This finding is quite remarkable as it is counter-intuitive as to why joint estimation should yield better estimators when all parameters are mutually independent (Efron and Morris, 1977). Although the Stein phenomenon has been extensively studied in the statistics community, it has not received much attention in the machine learning community. The James-Stein estimator is a special case of a larger class of estimators known as shrinkage estimators (Gruber, 1998). In its most general form, the shrinkage estimator is a combination of a model with low bias and high variance, and a model with high bias but low variance. For example, one might consider the following estimator: ˆθshrink λ θ + (1 λ)ˆθML, where λ [0, 1], ˆθML denotes the usual maximum likelihood estimate of θ, and θ is an arbitrary point in the input space. In the case of James-Stein estimator, we have θ = 0. Our proposal of shrinkage estimator to estimate µP will rely on the same principle, but will differ fundamentally from the Stein s seminal works and those along this line in two aspects. First, our setting is non-parametric in the sense that we do not assume any parametric form for the distribution, whereas most of traditional works focus on some specific distributions, e.g., the Gaussian distribution. The non-parametric setting is very important in most applications of kernel means because it allows us to perform statistical inference without making any assumption on the parametric form of the true distribution P. Second, our setting involves a non-linear feature map into a high-dimensional space. Muandet, Sriperumbudur, Fukumizu, Gretton, and Sch olkopf For example, if we use the Gaussian RBF kernel (see (6)), the mean function µP lives in an infinite-dimensional space. As a result, higher moments of the distribution come into play and therefore one cannot adopt Stein s setting straightforwardly as it involves only the first moment. A direct generalization of James-Stein estimator to infinite-dimensional Hilbert space has been considered, for example, in Berger and Wolpert (1983); Mandelbaum and Shepp (1987); Privault and Rveillac (2008). In those works, the parameter to be estimated is assumed to be the mean of a Gaussian measure on the Hilbert space from which samples are drawn. In contrast, our setting involves samples that are drawn from P defined on an arbitrary measurable space, and not from a Gaussian measure defined on a Hilbert space. 1.1 Contributions In the following, we present the main contributions of this work. 1. In Section 2.2, we propose kernel mean shrinkage estimators and show that these estimators can theoretically improve upon the standard empirical estimator, ˆµP in terms of the mean squared error (see Theorem 1 and Proposition 4), however, requiring the knowledge of the true kernel mean. We relax this condition in Section 2.3 (see Theorem 5) where without requiring the knowledge of the true kernel mean, we construct shrinkage estimators that are uniformly better (in mean squared error) than the empirical estimator over a class of distributions P. For bounded continuous translation invariant kernels, we show that P reduces to a class of distributions whose characteristic functions have an L2-norm bounded by a given constant. Through concrete choices for P in Examples 1 and 2, we discuss the implications of the proposed estimator. 2. While the proposed estimators in Section 2.2 and 2.3 are theoretically interesting, they are not useful in practice as they require the knowledge of the true data generating distribution. In Section 2.4 (see Theorem 7), we present a completely data-dependent estimator (say ˇµP) referred to as B-KMSE that is n-consistent and satisfies E ˇµP µP 2 H < E ˆµP µP 2 H + O(n 3/2) as n . (4) 3. In Section 3, we present a regularization interpretation for the proposed shrinkage estimator, wherein the shrinkage parameter is shown to be directly related to the regularization parameter. Based on this relation, we present an alternative approach to choosing the shrinkage parameter (different from the one proposed in Section 2.4) through leave-one-out cross-validation, and show that the corresponding shrinkage estimator (we refer to it as R-KMSE) is also n-consistent and satisfies (4). 4. The regularization perspective also sheds light on constructing new shrinkage estimators that incorporate specific information about the RKHS, based on which we present a new n-consistent shrinkage estimator referred to as S-KMSE in Section 4 (see Theorem 13 and Remark 14) that takes into account spectral information of the covariance operator in RKHS. We establish the relation of S-KMSE to the problem of learning smooth operators (Gr unew alder et al., 2013) on H, and propose a leave-one-out cross-validation method to obtain a data-dependent shrinkage parameter. However, unlike B-KMSE and R-KMSE, it remains an open question as Kernel Mean Shrinkage Estimators to whether S-KMSE with a data-dependent shrinkage parameter is consistent and satisfies an inequality similar to (4). The difficulty in answering these questions lies with the complex form of the estimator, µP which is constructed so as to capture the spectral information of the covariance operator. 5. In Section 6, we empirically evaluate the proposed shrinkage estimators of kernel mean on both synthetic data and several real-world scenarios including Parzen window classification, density estimation and discriminative learning on distributions. The experimental results demonstrate the benefits of our shrinkage estimators over the standard one. While a shorter version of this work already appeared in Muandet et al. (2014a,b) particularly, the ideas in Sections 2.2, 3 and 4 , this extended version provides a rigorous theoretical treatment (through Theorems 5, 7, 10, 13 and Proposition 15 which are new) for the proposed estimators and also contains additional experimental results. 2. Kernel Mean Shrinkage Estimators In this section, we first provide some definitions and notation that are used throughout the paper, following which we present a shrinkage estimator of µP. The rest of the section presents various properties including the inadmissibility of the empirical estimator. 2.1 Definitions & Notation For a (a1, . . . , ad) Rd, a 2 q Pd i=1 a2 i . For a topological space X, C(X) (resp. Cb(X)) denotes the space of all continuous (resp. bounded continuous) functions on X. For a locally compact Hausdorffspace X, f C(X) is said to vanish at infinity if for every ϵ > 0 the set {x : |f(x)| ϵ} is compact. The class of all continuous f on X which vanish at infinity is denoted as C0(X). Mb(X) (resp. M1 +(X)) denotes the set of all finite Borel (resp. probability) measures defined on X. For X Rd, Lr(X) denotes the Banach space of r-power (r 1) Lebesgue integrable functions. For f Lr(X), f Lr R X |f(x)|r dx 1/r denotes the Lr-norm of f for 1 r < . The Fourier transform of f L1(Rd) is defined as f (ω) (2π) d/2 R Rd f(x)e 1ω x dx, ω Rd. The characteristic function of P M1 +(Rd) is defined as φP(ω) R e 1ω x d P(x), ω Rd. An RKHS over a set X is a Hilbert space H consisting of functions on X such that for each x X there is a function kx H with the property f, kx H = f(x), f H. (5) The function kx( ) k(x, ) is called the reproducing kernel of H and the equality (5) is called the reproducing property of H. The space H is endowed with inner product , H and norm H. Any symmetric and positive semi-definite kernel function k : X X R uniquely determines an RKHS (Aronszajn, 1950). One of the most popular kernel functions is the Gaussian radial basis function (RBF) kernel on X = Rd, k(x, y) = exp x y 2 2 2σ2 , x, y X, (6) Muandet, Sriperumbudur, Fukumizu, Gretton, and Sch olkopf where 2 denotes the Euclidean norm and σ > 0 is the bandwidth. For x H1 and y H2, x y denotes the tensor product of x and y, and can be seen as an operator from H2 to H1 as (x y)z = x y, z H2 for any z H2, where H1 and H2 are Hilbert spaces. We assume throughout the paper that we observe a sample x1, x2, . . . , xn X of size n drawn independently and identically (i.i.d.) from some unknown distribution P defined over a separable topological space X. Denote by µ and ˆµ the true kernel mean (1) and its empirical estimate (2) respectively. We remove the subscript for ease of notation, but we will use µP (resp. ˆµP) and µ (resp. ˆµ) interchangeably. For the well-definedness of µ as a Bochner integral, throughout the paper we assume that k is continuous and R X k(x, x) d P(x) < (see Footnote 1). We measure the quality of an estimator µ H of µ by the risk function, R : H H R, R(µ, µ) = E µ µ 2 H, where E denotes the expectation over the choice of random sample of size n drawn i.i.d. from the distribution P. When µ = ˆµ, for the ease of notation, we will use to denote R(µ, ˆµ), which can be rewritten as = E ˆµ µ 2 H = E ˆµ 2 H µ 2 H = 1 i,j=1 Exi,xjk(xi, xj) µ 2 H i=1 Exik(xi, xi) + 1 i =j Exi,xjk(xi, xj) µ 2 H n (Exk(x, x) Ex, xk(x, x)) , (7) where µ 2 H = Ex, x[k(x, x)] Ex P[E x P[k(x, x)]] with x and x being independent copies. An estimator ˆµ1 is said to be as good as ˆµ2 if R(µ, ˆµ1) R(µ, ˆµ2) for any P, and is better than ˆµ2 if it is as good as ˆµ2 and R(µ, ˆµ1) < R(µ, ˆµ2) for at least one P. An estimator is said to be inadmissible if there exists a better estimator. 2.2 Shrinkage Estimation of µP We propose the following kernel mean estimator ˆµα αf + (1 α)ˆµ (8) where α 0 and f is a fixed, but arbitrary function in H. Basically, it is a shrinkage estimator that shrinks the empirical estimator toward a function f by an amount specified by α. The choice of f can be arbitrary, but we will assume that f is chosen independent of the sample. If α = 0, the estimator ˆµα reduces to the empirical estimator ˆµ. We denote by α the risk of the shrinkage estimator in (8), i.e., α R(µ, ˆµα). Our first theorem asserts that the shrinkage estimator ˆµα achieves smaller risk than that of the empirical estimator ˆµ given an appropriate choice of α, regardless of the function f . Theorem 1 Let X be a separable topological space. Then for all distributions P and continuous kernel k satisfying R k(x, x) d P(x) < , α < if and only if α 0, 2 + f µ 2 H In particular, arg minα R( α ) is unique and is given by α + f µ 2 H . Kernel Mean Shrinkage Estimators Proof Note that α = E ˆµα µ 2 H = E[ˆµα] µ 2 H + E ˆµα Eˆµα 2 H = Bias(ˆµα) 2 H + Var(ˆµα), where Bias(ˆµα) = E[ˆµα] µ = E[αf + (1 α)ˆµ] µ = α(f µ) and Var(ˆµα) = (1 α)2E ˆµ µ 2 H = (1 α)2 . Therefore, α = α2 f µ 2 H + (1 α)2 , (10) i.e., α = α2 + f µ 2 H 2α . This is clearly negative if and only if (9) holds and is uniquely minimized at α + f µ 2 H . Remark 2 (i) The shrinkage estimator always improves upon the standard one regardless of the direction of shrinkage, as specified by the choice of f . In other words, there exists a wide class of kernel mean estimators that achieve smaller risk than the standard one. (ii) The range of α depends on the choice of f . The further f is from µ, the smaller the range of α becomes. Thus, the shrinkage gets smaller if f is chosen such that it is far from the true kernel mean. This effect is akin to James-Stein estimator. (iii) From (9), since 0 < α < 2, i.e., 0 < (1 α)2 < 1, it follows that Var(ˆµα) < Var(ˆµ) = , i.e., the shrinkage estimator always improves upon the empirical estimator in terms of the variance. Further improvement can be gained by reducing the bias by incorporating the prior knowledge about the location of µ via f . This implies that we can potentially gain twice by adopting the shrinkage estimator: by reducing variance of the estimator and by incorporating prior knowledge in choosing f such that it is close to the true kernel mean. While Theorem 1 shows ˆµ to be inadmissible by providing a family of estimators that are better than ˆµ, the result is not useful as all these estimators require the knowledge of µ (which is the parameter of interest) through the range of α given in (9). In Section 2.3, we investigate Theorem 1 and show that ˆµα can be constructed under some weak assumptions on P, without requiring the knowledge of µ. From (9), the existence of positive α is guaranteed if and only if the risk of the empirical estimator is non-zero. Under some assumptions on k, the following result shows that = 0 if and only if the distribution P is a Dirac distribution, i.e., the distribution P is a point mass. This result ensures, in many non-trivial cases, a non-empty range of α for which α < 0. Proposition 3 Let k(x, y) = ψ(x y), x, y Rd be a characteristic kernel where ψ Cb(Rd) is positive definite. Then = 0 if and only if P = δx for some x Rd. Proof See Section 5.1. Muandet, Sriperumbudur, Fukumizu, Gretton, and Sch olkopf 2.2.1 Positive-part Shrinkage Estimator Similar to James-Stein estimator, we can show that the positive-part version of ˆµα also outperforms ˆµ, where the positive-part estimator is defined by ˆµ+ α αf + (1 α)+ˆµ (11) with (a)+ a if a > 0 and zero otherwise. Equation (11) can be rewritten as ( αf + (1 α)ˆµ, 0 α 1 αf 1 < α < 2. (12) Let + α E ˆµ+ α µ 2 H be the risk of the positive-part estimator. Then, the following result shows that + α α, given that α satisfies (9). Proposition 4 For any α satisfying (9), we have that + α α < . Proof According to (12), we decompose the proof into two parts. First, if 0 α 1, ˆµα and ˆµ+ α behave exactly the same. Thus, + α = α. On the other hand, when 1 < α < 2, the bias-variance decomposition of these estimators yields α = α2 f µ 2 H + (1 α)2E ˆµ µ 2 H and + α = α2 f µ 2 H. It is easy to see that + α < α when 1 < α < 2. This concludes the proof. Proposition 4 implies that, when estimating α, it is better to restrict the value of α to be smaller than 1, although it can be greater than 1, as suggested by Theorem 1. The reason is that if 0 α 1, the bias is an increasing function of α, whereas the variance is a decreasing function of α. On the other hand, if α > 1, both bias and variance become increasing functions of α. We will see later in Section 3 that ˆµα and ˆµ+ α can be obtained naturally as a solution to a regularized regression problem. 2.3 Consequences of Theorem 1 As mentioned before, while Theorem 1 is interesting from the perspective of showing that the shrinkage estimator, ˆµα performs better in the mean squared sense than the empirical estimator, it unfortunately relies on the fact that µP (i.e., the object of interest) is known, which makes ˆµα uninteresting. Instead of knowing µP, which requires the knowledge of P, in this section, we show that a shrinkage estimator can be constructed that performs better than the empirical estimator, uniformly over a class of probability distributions. To this end, we introduce the notion of an oracle upper bound. Let P be a class of probability distributions P defined on a measurable space X. We define an oracle upper bound as Uk,P inf P P 2 + f µ 2 H . It follows immediately from Theorem 1 and the definition of Uk,P that if Uk,P = 0, then for any α (0, Uk,P), α < 0 holds uniformly for all P P. Note that by virtue Kernel Mean Shrinkage Estimators of Proposition 3, the class P cannot contain the Dirac measure δx (for any x Rd) if the kernel k is translation invariant and characteristic on Rd. Below we give concrete examples of P for which Uk,P = 0 so that the above uniformity statement holds. In particular, we show in Theorem 5 below that for X = Rd, if a non-trivial bound on the L2-norm of the characteristic function of P is known, it is possible to construct shrinkage estimators that are better (in mean squared error) than the empirical average. In such a case, unlike in Theorem 1, α does not depend on the individual distribution P, but only on an upper bound associated with a class P. Theorem 5 Let k(x, y) = ψ(x y), x, y Rd with ψ Cb(Rd) L1(Rd) and ψ is a positive definite function with ψ(0) > 0. For a given constant A (0, 1), let Aψ := A(2π)d/2ψ(0) Pk,A n P M1 +(Rd) : φP L2 p where φP denotes the characteristic function of P. Then for all P Pk,A, α < if 1 + (n 1)A + n f 2 H ψ(0) + 2n Proof By Theorem 1, we have that α < , α 0, 2 + f µ 2 H + f µ 2 H = Exk(x, x) Ex, xk(x, x) Exk(x, x) Ex, xk(x, x) + n f µ 2 H ( ) = 1 Ex, xk(x, x) 1 + (n 1)Ex, xk(x, x) Exk(x,x) + n f 2 H Exk(x,x) 2n f ,µ H 1 Ex, xk(x, x) 1 + (n 1)Ex, xk(x, x) Exk(x,x) + n f 2 H Exk(x,x) + 2n f H Ex, xk(x, x) Exk(x,x) where the division by Exk(x, x) in ( ) is valid since Exk(x, x) = ψ(0) > 0. Note that the numerator in the r.h.s. of (14) is non-negative since Ex, xk(x, x) Ex p k(x, x)E x p k( x, x) Exk(x, x) with equality holding if and only if P = δy for some y Rd (see Proposition 3). However, for any A (0, 1) and y Rd, it is easy to verify that δy / Pk,A, which implies the numerator in fact positive. The denominator is clearly positive since Ex, xk(x, x) 0 and therefore the r.h.s. of (14) is positive. Also note that Ex, xk(x, x) = Z Z ψ(x y) d P(x) d P(y) ( ) = Z |φP(ω)|2ψ (ω) dω Muandet, Sriperumbudur, Fukumizu, Gretton, and Sch olkopf sup ω Rd ψ (ω) φP 2 L2 (2π) d/2 ψ L1 φP 2 L2, (15) where ψ is the Fourier transform of ψ and ( ) follows see (16) in the proof of Proposition 5 in Sriperumbudur et al. (2011) by invoking Bochner s theorem (Wendland, 2005, Theorem 6.6), which states that ψ is Fourier transform of a non-negative finite Borel measure with density (2π) d/2ψ , i.e., ψ(x) = (2π) d/2 R e ix ωψ (ω) dω, x Rd. As Exk(x, x) = ψ(0), we have that Ex, xk(x, x) Exk(x, x) A φP 2 L2 Aψ and therefore for any P Pk,A, Ex, xk(x, x) Exk(x,x) A. Using this in (14) and combining it with (13) yields the result. Remark 6 (i) Theorem 5 shows that for any P Pk,A, it is possible to construct a shrinkage estimator that dominates the empirical estimator, i.e., the shrinkage estimator has a strictly smaller risk than that of the empirical estimator. (ii) Suppose that P has a density, denoted by p, with respect to the Lebesgue measure and φP L2(Rd). By Plancherel s theorem, p L2(Rd) as p L2 = φP L2, which means that Pk,A includes distributions with square integrable densities (note that in general not every p is square integrable). Since φP 2 L2 = Z |φP(ω)|2 dω sup ω Rd |φP(ω)| Z |φP(ω)| dω = φP L1, where we used the fact that supω Rd |φP(ω)| = 1, it is easy to check that ( P M1 +(Rd) : φP L1 A(2π)d/2ψ(0) This means bounded densities belong to Pk,A as φP L1(Rd) implies that P has a density, p C0(Rd). Moreover, it is easy to check that larger the value of A, larger is the class Pk,A and smaller is the range of α for which α < and vice-versa. In the following, we present some concrete examples to elucidate Theorem 5. Example 1 (Gaussian kernel and Gaussian distribution) Define N P M1 +(Rd) d P(x) = 1 (2πσ2)d/2 e x θ 2 2 2σ2 dx, θ Rd, σ > 0 , where ψ(x) = e x 2 2/2τ 2, x Rd and τ > 0. For P N, it is easy to verify that φP(ω) = e 1θ ω 1 2 σ2 ω 2 2, ω Rd and φP 2 L2 = Z e σ2 ω 2 2 dω = (π/σ2)d/2. Also, ψ L1 = (2πτ 2)d/2. Therefore, for Pk,A {P N : σ2 πτ 2/A2/d}, assuming f = 0, we obtain the result in Theorem 5, i.e., the result in Theorem 5 holds for all Gaussian distributions that are smoother (having larger variance) than that of the kernel. Kernel Mean Shrinkage Estimators Example 2 (Linear kernel) Suppose f = 0 and k(x, y) = x y. While the setting of Theorem 5 does not fit this choice of k, an inspection of its proof shows that it is possible to construct a shrinkage estimator that improves upon µP for an appropriate class of distributions. To this end, let ϑ and Σ represent the mean vector and covariance matrix of a distribution P defined on Rd. Then it is easy to check that Ex, xk(x, x) Exk(x,x) = ϑ 2 2 trace(Σ)+ ϑ 2 2 and therefore for a given A (0, 1), define Pk,A P M1 +(Rd) ϑ 2 2 trace(Σ) A 1 A From (13) and (14), it is clear that for any P Pk,A, α < if α 0, 2(1 A) 1+(n 1)A i . Note that this choice of kernel yields the setting similar to classical James-Stein estimation. In James-Stein estimation, P N (see Example 1 for the definition of N) and ϑ is estimated as (1 α)ˆϑ which improves upon ˆϑ where α depends on the sample (xi)n i=1 and ˆϑ is the sample mean. In our case, for all P Pk,A = n P N : ϑ 2 σ q d A 1 A o , α < if α 0, 2(1 A) 1+(n 1)A i . In addition, in contrast to the James-stein estimator which improves upon the empirical estimator ( i.e., sample mean) for only d 3, we note here that the proposed estimator improves for any d as long as P Pk,A. On the other hand, the proposed estimator requires some knowledge about the distribution (particularly a bound on ϑ 2), which the James-Stein estimator does not (see Section 2.5 for more details). 2.4 Data-Dependent Shrinkage Parameter The discussion so far showed that the shrinkage estimator in (8) performs better than the empirical estimator if the data generating distribution satisfies a certain mild condition (see Theorem 5; Examples 1 and 2). However, since this condition is usually not checkable in practice, the shrinkage estimator lacks applicability. In this section, we present a completely data driven shrinkage estimator by estimating the shrinkage parameter α from data so that the estimator does not require any knowledge of the data generating distribution. Since the maximal difference between α and occurs at α (see Theorem 1), given an i.i.d. sample X = {x1, x2, . . . , xn} from P, we propose to estimate µ using ˆµ α = (1 α)ˆµ (i.e., assuming f = 0) where α is an estimator of α = /( + µ 2 H) given by α = ˆ ˆ + ˆµ 2 H , (16) with ˆ and ˆµ being the empirical versions of and µ, respectively (see Theorem 7 for precise definitions). The following result shows that α is a n n-consistent estimator of α and ˆµ α µ H concentrates around ˆµα µ H. In addition, we show that α α α + O(n 3/2) as n , which means the performance of ˆµ α is similar to that of the best estimator (in mean squared sense) of the form ˆµα. In what follows, we will call the estimator ˆµ α an empirical-bound kernel mean shrinkage estimator (B-KMSE). Muandet, Sriperumbudur, Fukumizu, Gretton, and Sch olkopf Theorem 7 Suppose n 2 and f = 0. Let k be a continuous kernel on a separable topological space X satisfying R X k(x, x) d P(x) < . Define ˆ ˆEk(x, x) ˆEk(x, x) n and ˆµ 2 H 1 i,j=1 k(xi, xj) where ˆEk(x, x) 1 n Pn i=1 k(xi, xi) and ˆEk(x, x) 1 n(n 1) Pn i =j k(xi, xj) are the empirical estimators of Exk(x, x) and Ex, xk(x, x) respectively. Assume there exist finite constants κ1 > 0, κ2 > 0, σ1 > 0 and σ2 > 0 such that E k( , x) µ m H m! 2 σ2 1κm 2 1 , m 2. (17) and E|k(x, x) Exk(x, x)|m m! 2 σ2 2κm 2 2 , m 2. (18) Then | α α | = OP(n 3/2) and ˆµ α µ H ˆµα µ H = OP(n 3/2) as n . In particular, min α E ˆµα µ 2 H E ˆµ α µ 2 H min α E ˆµα µ 2 H + O(n 3/2) (19) Proof See Section 5.2. Remark 8 (i) ˆµ α is a n-consistent estimator of µ. This follows from ˆµ α µ H ˆµα µ H + OP(n 3/2) (1 α ) ˆµ µ H + α µ H + OP(n 3/2) α = + µ 2 H = Exk(x, x) Ex, xk(x, x) Exk(x, x) + (n 1)Ex, xk(x, x) = O(n 1) as n . Using (38), we obtain ˆµ α µ H = OP(n 1/2) as n , which implies that ˆµ α is a n-consistent estimator of µ. (ii) Equation (19) shows that α α +O(n 3/2) where α < (see Theorem 1) and therefore for any P satisfying (17) and (18), α < + O(n 3/2) as n . (iii) Suppose the kernel is bounded, i.e., supx,y X |k(x, y)| κ < . Then it is easy to verify that (17) and (18) hold with σ1 = κ, κ1 = 2 κ, σ2 = κ and κ2 = 2κ and therefore the claims in Theorem 7 hold for bounded kernels. (iv) For k(x, y) = x y, we have E k( , x) µ m H = E k( , x) µ 2 H m/2 = E x Exx 2 2 m/2 = E x Exx m 2 Kernel Mean Shrinkage Estimators and E|k(x, x) Exk(x, x)|m = E| x 2 2 Ex x 2 2|m. The conditions in (17) and (18) hold for P N where N is defined in Example 1. With P N and k(x, y) = x y, the problem of estimating µ reduces to estimating θ, for which we have presented a James-Stein-like estimator, ˆµ α that satisfies the oracle inequality in (19). (v) While the moment conditions in (17) and (18) are obviously satisfied by bounded kernels, for unbounded kernels, these conditions are quite stringent as they require all the higher moments to exist. These conditions can be weakened and the proof of Theorem 7 can be carried out using Chebyshev inequality instead of Bernstein s inequality but at the cost of a slow rate in (19). 2.5 Connection to James-Stein Estimator In this section, we explore the connection of our proposed estimator in (8) to the James Stein estimator. Recall that Stein s setting deals with estimating the mean of the Gaussian distribution N(θ, σ2Id), which can be viewed as a special case of kernel mean estimation when we restrict to the class of distributions P {N(θ, σ2Id) | θ Rd} and a linear kernel k(x, y) = x y, x, y Rd (see Example 2). In this case, it is easy to verify that = dσ2/n and α < for dσ2 + n θ 2 Let us assume that n = 1, in which case, we obtain α < for α 0, 2dσ2 Ex x 2 as Ex x 2 = θ 2 + dσ2. Note that the choice of α is dependent on P through Ex x 2 which is not known in practice. To this end, we replace it with the empirical version x 2 that depends only on the sample x. For an arbitrary constant c (0, 2d), the shrinkage estimator (assuming f = 0) can thus be written as ˆµα = (1 α)ˆµ = 1 cσ2 which is exactly the James-Stein estimator in (3). This particular way of estimating the shrinkage parameter α has an intriguing consequence, as shown in Stein s seminal works (Stein, 1955; James and Stein, 1961), that the shrinkage estimator ˆµα can be shown to dominate the maximum likelihood estimator ˆµ uniformly over all θ. While it is compelling to see that there is seemingly a fundamental principle underlying both these settings, this connection also reveals crucial difference between our approach and classical setting of Stein notably, original James-Stein estimator improves upon the sample mean even when α is data-dependent (see ˆµα above), however, with the crucial assumption that x is normally distributed. 3. Kernel Mean Estimation as Regression Problem In Section 2, we have shown that James-Stein-like shrinkage estimator, i.e., Equation (8), improves upon the empirical estimator in estimating the kernel mean. In this section, Muandet, Sriperumbudur, Fukumizu, Gretton, and Sch olkopf we provide a regression perspective to shrinkage estimation. The starting point of the connection between regression and shrinkage estimation is the observation that the kernel mean µP and its empirical estimate ˆµP can be obtained as minimizers of the following risk functionals, X k( , x) g 2 H d P(x) and b E(g) 1 i=1 k( , xi) g 2 H , respectively (Kim and Scott, 2012). Given these formulations, it is natural to ask if minimizing the regularized version of b E(g) will give a better estimator. While this question is interesting, it has to be noted that in principle, there is really no need to consider a regularized formulation as the problem of minimizing b E is not ill-posed, unlike in function estimation or regression problems. To investigate this question, we consider the minimization of the following regularized empirical risk functional, b Eλ(g) b E(g) + λΩ( g H) = 1 i=1 k( , xi) g 2 H + λΩ( g H), (20) where Ω: R+ R+ denotes a monotonically increasing function and λ > 0 is the regularization parameter. By representer theorem (Sch olkopf et al., 2001), any function g H that is a minimizer of (20) lies in a subspace spanned by {k( , x1), . . . , k( , xn)}, i.e., g = Pn j=1 βjk( , xj) for some β [β1, . . . , βn] Rn. Hence, by setting Ω( g H) = g 2 H, we can rewrite (20) in terms of β as b E(g) + λΩ( g H) = β Kβ 2β K1n + λβ Kβ + c, (21) where K is an n n Gram matrix such that Kij = k(xi, xj), c is a constant that does not depend on β, and 1n = [1/n, 1/n, . . . , 1/n] . Differentiating (21) with respect to β and setting it to zero yields an optimal weight vector β = 1 1+λ 1n and so the minimizer of (20) is given by ˆµλ = 1 1 + λ ˆµ = 1 λ 1 + λ ˆµ (1 α)ˆµ, (22) which is nothing but the shrinkage estimator in (8) with α = λ 1+λ and f = 0. This provides a nice relation between shrinkage estimation and regularized risk minimization, wherein the regularization helps in shrinking the estimator ˆµ towards zero although it is not required from the point of view of ill-posedness. In particular, since 0 < 1 α < 1, ˆµλ corresponds to a positive-part estimator proposed in Section 2.2.1 when f = 0. Note that ˆµλ is a consistent estimator of µ as λ 0 and n , which follows from ˆµλ µ H 1 1 + λ ˆµ µ H + λ 1 + λ µ H OP(n 1/2) + O(λ). In particular λ = τn 1/2 (for some constant τ > 0) yields the slowest possible rate for λ 0 such that the best possible rate of n 1/2 is obtained for ˆµλ µ H 0 as n . In addition, following the idea in Theorem 5, it is easy to show that E ˆµλ µ 2 H < if Kernel Mean Shrinkage Estimators τ 0, 2 n µ 2 H . Note that ˆµλ is not useful in practice as λ is not known a priori. However, by choosing λ = ˆ ˆµ 2 H , it is easy to verify (see Theorem 7 and Remark 8) that E ˆµλ µ 2 H < E ˆµ µ 2 H + O(n 3/2) (23) as n . Owing to the connection of ˆµλ to a regression problem, in the following, we present an alternate data-dependent choice of λ obtained from leave-one-out cross validation (LOOCV) that also satisfies (23), and we refer to the corresponding estimator as regularized kernel mean shrinkage estimator (R-KMSE). To this end, for a given shrinkage parameter λ, denote by ˆµ( i) λ as the kernel mean estimated from {xj}n j=1\{xi}. We will measure the quality of ˆµ( i) λ by how well it approximates k( , xi) with the overall quality being quantified by the cross-validation score, LOOCV (λ) = 1 k( , xi) ˆµ( i) λ 2 The LOOCV formulation in (24) differs from the one used in regression, wherein instead of measuring the deviation of the prediction made by the function on the omitted observation, we measure the deviation between the feature map of the omitted observation and the function itself. The following result shows that the shrinkage parameter in ˆµλ (see (22)) can be obtained analytically by minimizing (24) and requires O(n2) operations to compute. Proposition 9 Let n 2, ρ := 1 n2 Pn i,j=1 k(xi, xj) and ϱ := 1 n Pn i=1 k(xi, xi). Assuming nρ > ϱ, the unique minimizer of LOOCV (λ) is given by λr = n(ϱ ρ) (n 1)(nρ ϱ). (25) Proof See Section 5.3. It is instructive to compare αr = λr λr + 1 = ϱ ρ (n 2)ρ + ϱ/n (26) to the one in (16), where the latter can be shown to be ϱ ρ ϱ+(n 2)ρ, by noting that ˆEk(x, x) = ϱ and ˆEk(x, x) = nρ ϱ n 1 (in Theorem 7, we employ the U-statistic estimator of Ex, xk(x, x), whereas ρ in Proposition 9 can be seen as a V -statistic counterpart). This means αr obtained from LOOCV will be relatively larger than the one obtained from (16). Like in Theorem 7, the requirement that n 2 in Theorem 9 stems from the fact that at least two data points are needed to evaluate the LOOCV score. Note that nρ > ϱ if and only if ˆEk(x, x) > 0, which is guaranteed if the kernel is positive valued. We refer to ˆµλr as R-KMSE, whose n-consistency is established by the following result, which also shows that ˆµλr satisfies (23). Muandet, Sriperumbudur, Fukumizu, Gretton, and Sch olkopf Theorem 10 Let n 2, nρ > ϱ where ρ and ϱ are defined in Proposition 9 and k satisfies the assumptions in Theorem 7. Then ˆµλr µ H = OP(n 1/2), min α E ˆµα µ 2 H E ˆµλr µ 2 H min α E ˆµα µ 2 H + O(n 3/2) (27) where ˆµα = (1 α)ˆµ and therefore E ˆµλr µ 2 H < E ˆµ µ 2 H + O(n 3/2) (28) Proof See Section 5.4. 4. Spectral Shrinkage Estimators Consider the following regularized risk minimization problem arg inf F H H Ex P k(x, ) F[k(x, )] 2 H + λ F 2 HS, (29) where the minimization is carried over the space of Hilbert-Schmidt operators, F on H with F HS being the Hilbert-Schmidt norm of F. As an interpretation, we are finding a smooth operator F that maps k(x, ) to itself (see Gr unew alder et al. (2013) for more details on this smooth operator framework). It is not difficult to show that the solution to (29) is given by F = ΣXX (ΣXX + λI) 1 where ΣXX = R k( .x) k( , x) d P(x) is a covariance operator defined on H (see, e.g., Gr unew alder et al., 2012). Note that ΣXX is a Bochner integral, which is well-defined as a Hilbert-Schmidt operator if X is a separable topological space and k is a continuous kernel satisfying R k(x, x) d P(x) < . Consequently, let us define µλ = Fµ = ΣXX (ΣXX + λI) 1µ, which is an approximation to µ as it can be shown that µλ µ H 0 as λ 0 (see the proof of Theorem 13). Given an i.i.d. sample x1, . . . , xn from P, the empirical counterpart of (29) is given by arg min F H H 1 n i=1 k(xi, ) F[k(xi, )] 2 H + λ F 2 HS (30) resulting in ˇµλ Fˆµ = ˆΣXX (ˆΣXX + λI) 1ˆµ (31) where ˆΣXX is the empirical covariance operator on H given by i=1 k( .xi) k( , xi). Unlike ˆµλ in (22), ˇµλ shrinks ˆµ differently in each coordinate by taking the eigenspectrum of ˆΣXX into account (see Proposition 11) and so we refer to it as the spectral kernel mean shrinkage estimator (S-KMSE). Kernel Mean Shrinkage Estimators Proposition 11 Let {(γi, φi)}n i=1 be eigenvalue and eigenfunction pairs of ˆΣXX . Then γi γi + λ ˆµ, φi Hφi. Proof Since ˆΣXX is a finite rank operator, it is compact. Since it is also a self-adjoint operator on H, by Hilbert-Schmidt theorem (Reed and Simon, 1972, Theorems VI.16, VI.17), we have ˆΣXX = Pn i=1 γi φi, Hφi. The result follows by using this in (31). As shown in Proposition 11, the effect of S-KMSE is to reduce the contribution of high frequency components of ˆµ (i.e., contribution of ˆµ along the directions corresponding to smaller γi) when ˆµ is expanded in terms of the eigenfunctions of the empirical covariance operator, which are nothing but the kernel PCA basis (Rasmussen and Williams, 2006, Section 4.3). This means, similar to R-KMSE, S-KMSE also shrinks ˆµ towards zero, however, the difference being that while R-KMSE shrinks equally in all coordinates, S-KMSE controls the amount of shrinkage by the information contained in each coordinate. In particular, S-KMSE takes into account more information about the kernel by allowing for different amount of shrinkage in each coordinate direction according to the value of γi, wherein the shrinkage is small in the coordinates whose γi are large. Moreover, Proposition 11 reveals that the effect of shrinkage is akin to spectral filtering (Bauer et al., 2007) which in our case corresponds to Tikhonov regularization wherein S-KMSE filters out the high-frequency components of the spectral representation of the kernel mean. Muandet et al. (2014b) leverages this observation and generalizes S-KMSE to a family of shrinkage estimators via spectral filtering algorithms. The following result presents an alternate representation for ˇµλ, using which we relate the smooth operator formulation in (30) to the regularization formulation in (20). Proposition 12 Let Φ : Rn H, a 7 Pn i=1 aik( , xi) where a (a1, . . . , an). Then ˇµλ = ˆΣXX (ˆΣXX + λI) 1ˆµ = Φ(K + nλI) 1K1n, where K is the Gram matrix, I is an identity operator on H, I is an n n identity matrix and 1n [1/n, . . . , 1/n] . Proof See Section 5.5. From Proposition 12, it is clear that j=1 (βs)jk( , xj) (32) where βs n(K + nλI) 1K1n. Given the form of ˇµλ in (32), it is easy to verify that βs is the minimizer of (20) when b Eλ is minimized over {g = 1 n Pn j=1(β)jk( , xj) : β Rn} with Ω( g H) β 2 2. The following result, discussed in Remark 14, establishes the consistency and convergence rate of S-KMSE, ˇµλ. Muandet, Sriperumbudur, Fukumizu, Gretton, and Sch olkopf Theorem 13 Suppose X is a separable topological space and k is a continuous, bounded kernel on X. Then the following hold. (i) If µ R(ΣXX ), then ˇµλ µ H 0 as λ n , λ 0 and n . (ii) If µ R(ΣXX ), then ˇµλ µ H = OP(n 1/2) for λ = cn 1/2 with c > 0 being a constant independent of n. Proof See Section 5.6. Remark 14 While Theorem 13(i) shows that S-KMSE, ˇµλ is not universally consistent, i.e., S-KMSE is not consistent for all P but only for those P that satisfies µ R(ΣXX ), under some additional conditions on the kernel, the universal consistency of S-KMSE can be guaranteed. This is achieved by assuming that constant functions are included in H, i.e., 1 H. Note that if 1 H, then it is easy to check that there exists g H (choose g = 1) such that µ = ΣXX g = R k( , x)g(x) d P(x), i.e., µ R(ΣXX ), and, therefore, by Theorem 13, ˇµλ is not only universally consistent but also achieves a rate of n 1/2. Choosing k(x, y) = k(x, y) + b, x, y X, b > 0 where k is any bounded, continuous positive definite kernel ensures that 1 H. Note that the estimator ˇµλ requires the knowledge of the shrinkage or regularization parameter, λ. Similar to R-KMSE, below, we present a data dependent approach to select λ using leave-one-out cross validation. While the shrinkage parameter for R-KMSE can be obtained in a simple closed form (see Proposition 9), we will see below that finding the corresponding parameter for S-KMSE is more involved. Evaluating the score function (i.e., (24)) na ıvely requires one to solve for ˆµ( i) λ explicitly for every i, which is computationally expensive. The following result provides an alternate expression for the score, which can be evaluated efficiently. We would like to point out that a variation of Proposition 15 already appeared in Muandet et al. (2014a, Theorem 4). However, Theorem 4 in Muandet et al. (2014a) uses an inappropriate choice of ˆµ( i) λ , which we fixed in the following result. Proposition 15 The LOOCV score of S-KMSE is given by LOOCV (λ) = 1 ntr (K + λn I) 1K(K + λn I) 1Aλ 2 ntr (K + λn I) 1Bλ i=1 k(xi, xi), where λn (n 1)λ, Aλ 1 (n 1)2 Pn i=1 ci,λc i,λ, Bλ 1 n 1 Pn i=1 ci,λk i , di,λ k i (K + λn I) 1ei, ci,λ K1 ki eik i 1 + eik(xi, xi) + eik i (K + λn I) 1K1 1 di,λ eik i (K + λn I) 1ki eik i (K + λn I) 1eik i 1 1 di,λ + eik i (K + λn I) 1eik(xi, xi) ki is the ith column of K, 1 (1, . . . , 1) and ei (0, 0, . . . , 1, . . . , 0) with 1 being in the ith position. Here tr(A) denotes the trace of a square matrix A. Kernel Mean Shrinkage Estimators Proof See Section 5.7. Unlike R-KMSE, a closed form expression for the minimizer of LOOCV (λ) in Proposition 15 is not possible and so proving the consistency of S-KMSE along with results similar to those in Theorem 10 are highly non-trivial. Hence, we are not able to provide any theoretical comparison of ˇµλ (with λ being chosen as a minimizer of LOOCV (λ) in Proposition 15) with ˆµ. However, in the next section, we provide an empirical comparison through simulations where we show that the S-KMSE outperforms the empirical estimator. In this section, we present the missing proofs of the results of Sections 2 4. 5.1 Proof of Proposition 3 ( ) If P = δx for some x X, then ˆµ = µ = k( , x) and thus = 0. ( ) Suppose = 0. It follows from (7) that RR (k(x, x) k(x, y)) d P(x) d P(y) = 0. Since k is translation invariant, this reduces to ZZ (ψ(0) ψ(x y)) d P(x) d P(y) = 0. By invoking Bochner s theorem (Wendland, 2005, Theorem 6.6), which states that ψ is the Fourier transform of a non-negative finite Borel measure Λ, i.e., ψ(x) = R e ix ω dΛ(ω), x Rd, we obtain (see (16) in the proof of Proposition 5 in Sriperumbudur et al. (2011)) ZZ ψ(x y) d P(x) d P(y) = Z |φP(ω)|2 dΛ(ω), thereby yielding Z (|φP(ω)|2 1) dΛ(ω) = 0, (33) where φP is the characteristic function of P. Note that φP is uniformly continuous and |φP| 1. Since k is characteristic, Theorem 9 in Sriperumbudur et al. (2010) implies that supp(Λ) = Rd, using which in (33) yields |φP(ω)| = 1 for all ω Rd. Since φP is positive definite on Rd, it follows from Sasv ari (2013, Lemma 1.5.1) that φP(ω) = e 1ω x for some x Rd and thus P = δx. 5.2 Proof of Theorem 7 Before we prove Theorem 7, we present Bernstein s inequality in separable Hilbert spaces, quoted from Yurinsky (1995, Theorem 3.3.4), which will be used to prove Theorem 7. Theorem 16 (Bernstein s inequality) Let (Ω, A, P) be a probability space, H be a separable Hilbert space, B > 0 and θ > 0. Furthermore, let ξ1, . . . , ξn : Ω H be zero mean independent random variables satisfying i=1 E ξi m H m! 2 θ2Bm 2. (34) Muandet, Sriperumbudur, Fukumizu, Gretton, and Sch olkopf Then for any τ > 0, (ξ1, . . . , ξn) : Proof (of Theorem 7) Consider α α = ˆ ˆ + ˆµ 2 H + µ 2 H = ˆ µ 2 H ˆµ 2 H ( ˆ + ˆµ 2 H)( + µ 2 H) = ˆ ( µ 2 H ˆµ 2 H) ( + µ 2 H)( ˆ + ˆµ 2 H) + ( ˆ ) ˆµ 2 H ( + µ 2 H)( ˆ + ˆµ 2 H) = α( µ 2 H ˆµ 2 H) ( + µ 2 H) + ( ˆ )(1 α) ( + µ 2 H) . Rearranging α, we obtain α α = α ( µ 2 H ˆµ 2 H) + (1 α )( ˆ ) ˆ + ˆµ 2 H . | α α | α | µ 2 H ˆµ 2 H| + (1 + α )| ˆ | ( + µ 2 H) ( µ 2 H ˆµ 2 H) + ( ˆ ) , (35) where it is easy to verify that | ˆ | |Ex, xk(x, x) ˆEk(x, x)| n + |ˆEk(x, x) Exk(x, x)| In the following we obtain bounds on |ˆEk(x, x) Exk(x, x)|, |Ex, xk(x, x) ˆEk(x, x)| and | µ 2 H ˆµ 2 H| when the kernel satisfies (17) and (18). Bound on |ˆEk(x, x) Exk(x, x)|: Since k is a continuous kernel on a separable topological space X, it follows from Lemma 4.33 of Steinwart and Christmann (2008) that H is separable. By defining ξi k(xi, xi) Exk(x.x), it follows from (18) that θ = nσ2 and B = κ2 and so by Theorem 16, for any τ > 0, with probability at least 1 2e τ, |ˆEk(x, x) Exk(x, x)| 2σ2 2τ n + 2κ2τ Bound on ˆµ µ H: By defining ξi k( , xi) µ and using (17), we have θ = nσ1 and B = κ1. Therefore, by Theorem 16, for any τ > 0, with probability at least 1 2e τ, 2σ2 1τ n + 2κ1τ Bound on | ˆµ 2 H µ 2 H|: Kernel Mean Shrinkage Estimators Since ˆµ 2 H µ 2 H ( ˆµ H + µ H) ˆµ µ H ( ˆµ µ H + 2 µ H) ˆµ µ H, it follows from (38) that for any τ > 0, with probability at least 1 2e τ, ˆµ 2 H µ 2 H D1 where (Di)4 i=1 are positive constants that depend only on σ2 1, κ and µ H, and not on n and τ. Bound on |ˆEk(x, x) Ex, xk(x, x)|: ˆEk(x, x) Ex, xk(x, x) = n2( ˆµ 2 H µ 2 H) + n(Exk(x, x) ˆEk(x, x)) + n( µ 2 H Exk(x, x)) n(n 1) , it follows from (37) and (39) that for any τ > 0, with probability at least 1 4e τ, |ˆEk(x, x) Ex, xk(x, x)| F1 where (Fi)5 i=1 and (F i)4 i=1 are positive constants that do not depend on n and τ. Bound on | α α |: Using (37) and (40) in (36), for any τ > 0, with probability at least 1 4e τ, | ˆ | F 1 n 3/2 + F 4 n using which in (35) along with (39), we obtain that for any τ > 0, with probability at least 1 4e τ, P4 i=1 Gi1α + Gi2 n (1 + α ) 1+τ θn P4 i=1 Gi1 + Gi2 n i/2 , (41) where θn + µ 2 H and (Gi1)4 i=1, (Gi2)4 i=1 are positive constants that do not depend on n and τ. Since α = + µ 2 H = Exk(x,x) Ex, xk(x, x) Exk(x,x)+(n 1)Ex, xk(x, x) = O(n 1) and θn = Exk(x,x)+(n 1) µ 2 H n = O(1) as n , it follows from (41) that | α α | = OP(n 3/2) as n . Bound on | ˆµ α µ H ˆµα µ H|: Using (38) and (41) in | ˆµ α µ H ˆµα µ H| ˆµ α ˆµα H | α α | ˆµ µ H + | α α | µ H, Muandet, Sriperumbudur, Fukumizu, Gretton, and Sch olkopf for any τ > 0, with probability at least 1 4e τ, we have | ˆµ α µ H ˆµα µ H| P6 i=1 G i1α + G i2 n (1 + α ) 1+τ θn P4 i=1 Gi1 + Gi2 n i/2 , (42) where (G i1)6 i=1 and (G i2)6 i=1 are positive constants that do not depend on n and τ. From (42), it is easy to see that | ˆµ α µ H ˆµα µ H| = OP(n 3/2) as n . Bound on E ˆµ α µ 2 H E ˆµα µ 2 H: ˆµ α µ 2 H ˆµα µ 2 H ( ˆµ α µ H + ˆµα µ H) | ˆµ α µ H ˆµα µ H| 2( ˆµ H + µ H) | ˆµ α µ H ˆµα µ H| 2( ˆµ µ H + 2 µ H) | ˆµ α µ H ˆµα µ H| , for any τ > 0, with probability at least 1 4e τ, ˆµ α µ 2 H ˆµα µ 2 H P8 i=1 G i1α + G i2 n (1 + α ) 1+τ θn P4 i=1 Gi1 + Gi2 P8 i=1 G i1α + G i2 n (1 + α ) 1+τ θn P4 i=1 Gi1 + Gi2 n , 0 < τ n 1 n 4 , τ n 1 , where γn H1α + H2 n (1 + α ), φn θn P4 i=1 Gi1 + Gi2 n i/2 and (Hi)2 i=1 are positive constants that do not depend on n and τ. In other words, P ˆµ α µ 2 H ˆµα µ 2 H > ϵ 4 exp 1 n ϵφn 2 , γn φn n ϵ γn 4 exp 1 n ϵφn E ˆµ α µ 2 H E ˆµα µ 2 H = Z 0 P ˆµ α µ 2 H ˆµα µ 2 H > ϵ dϵ γn φn n + 4 Z γn φn γn φn n exp = γn φn n + 2γn φn n e t t + 1 dt + 16eγn n t3e t dt. Kernel Mean Shrinkage Estimators Since R n 1 0 e t t+1 dt R 0 e t dt = 1 and R n t3e t dt R 0 t3e t dt = 6, we have E ˆµ α µ 2 H E ˆµα µ 2 H 3γn φn n + 96eγn The claim in (19) follows by noting that γn = O(n 1) and φn = O(1) as n . 5.3 Proof of Proposition 9 Define α λ λ+1 and φ(xi) k( , xi). Note that LOOCV (λ) 1 j =i φ(xj) φ(xi) n 1φ(xi) φ(xi) n 1 φ(xi), n(1 α) (n 1)2 2n(n α)(1 α) ˆµ 2 H + (n α)2 i=1 k(xi, xi) = 1 (n 1)2 α2(n2ρ 2nρ + ϱ) + 2nα(ρ ϱ) + n2(ϱ ρ) F(α) (n 1)2 . Since d dλLOOCV (λ) = (n 1) 2 d dλ = (n 1) 2(1 + λ) 2 d dαF(α), equating it zero yields (25). It is easy to show that the second derivative of LOOCV (λ) is positive implying that LOOCV (λ) is strictly convex and so λr is unique. 5.4 Proof of Theorem 10 Since ˆµλr = ˆµ 1+λr = (1 αr)ˆµ, we have ˆµλr µ H αr ˆµ H + ˆµ µ H. Note that αr = n(ϱ ρ) n(n 2)ρ + ϱ = n ˆ ˆ + (n 1) ˆµ 2 H = ˆEk(x, x) ˆEk(x, x) ˆEk(x, x) + (n 2)ˆEk(x, x) , where ˆ , ˆµ 2 H, ˆEk(x, x) and ˆEk(x, x) are defined in Theorem 7. Consider |αr α | |αr α|+| α α | where α is defined in (16). From Theorem 7, we have | α α | = OP(n 3/2) as n and αr α = ˆEk(x, x) ˆEk(x, x) ˆEk(x, x) + (n 2)ˆEk(x, x) ˆEk(x, x) ˆEk(x, x) 2ˆEk(x, x) + (n 2)ˆEk(x, x) = αˆEk(x, x) ˆEk(x, x) + (n 2)ˆEk(x, x) = ( α α )β + α β, where β ˆEk(x,x) ˆEk(x,x)+(n 2)ˆEk(x, x). Therefore, |αr α| | α α ||β|+α |β|, where α = O(n 1) as n , which follows from Remark 8(i). Since |ˆEk(x, x) Exk(x, x)| = OP(n 1/2) and Muandet, Sriperumbudur, Fukumizu, Gretton, and Sch olkopf |ˆEk(x, x) Ex, xk(x, x)| = OP(n 1/2), which follow from (37) and (40) respectively, we have |β| = OP(n 1) as n . Combining the above, we have |αr α| = OP(n 2), thereby yielding |αr α | = OP(n 3/2). Proceeding as in Theorem 7, we have | ˆµλr µ H ˆµα µ H| ˆµλr µα H |αr α | ˆµ µ H + |αr α | µ H, which from the above follows that | ˆµλr µ H ˆµα µ H| = OP(n 3/2) as n . By arguing as in Remark 8(i), it is easy to show that ˆµλr is a n-consistent estimator of µ. (27) follows by carrying out the analysis as in the proof of Theorem 7 verbatim by replacing α with αr, while (28) follows by appealing to Remark 8(ii). 5.5 Proof of Proposition 12 First note that for any i {1, . . . , n}, ˆΣXX k( , xi) = 1 j=1 k( , xj)k(xi, xj) = 1 with ki being the ith row of K. This implies for any a Rn, ˆΣXX Φa = ˆΣXX i=1 aik( , xi) i=1 ai ˆΣXX k( , xi) = 1 i=1 aiΦk i , where ( ) holds since ˆΣXX is a linear operator. Also, since Φ is a linear operator, we obtain ˆΣXX Φa = 1 To prove the result, let us define a (K + nλI) 1K1n and consider (ˆΣXX + λI)Φa (43) = n 1ΦKa + λΦa = Φ(n 1K + λI)a = 1 nΦK1n (43) = ˆΣXX Φ1n = ˆΣXX ˆµ. Multiplying to the left on both sides of the above equation by (ˆΣXX + λI) 1, we obtain Φ(K + nλI) 1K1n = (ˆΣXX + λI) 1 ˆΣXX ˆµ and the result follows by noting that (ˆΣXX + λI) 1 ˆΣXX = ˆΣXX (ˆΣXX + λI) 1. 5.6 Proof of Theorem 13 By Proposition 12, we have ˇµλ = (ˆΣXX + λI) 1 ˆΣXX ˆµ. Define µλ (ΣXX + λI) 1ΣXX µ. Let us consider the decomposition ˇµλ µ = (ˇµλ µλ) + (µλ µ) with ˇµλ µλ = (ˆΣXX + λI) 1(ˆΣXX ˆµ ˆΣXX µλ λµλ) ( ) = (ˆΣXX + λI) 1(ˆΣXX ˆµ ˆΣXX µλ ΣXX µ + ΣXX µλ) = (ˆΣXX + λI) 1 ˆΣXX (ˆµ µ) (ˆΣXX + λI) 1 ˆΣXX (µλ µ) +(ˆΣXX + λI) 1ΣXX (µλ µ), Kernel Mean Shrinkage Estimators where we used λµλ = ΣXX µ ΣXX µλ in ( ). By defining A(λ) µλ µ H, we have ˇµλ µ H (ˆΣXX + λI) 1 ˆΣXX (ˆµ µ) H + (ˆΣXX + λI) 1 ˆΣXX (µλ µ) H + (ˆΣXX + λI) 1ΣXX (µλ µ) H + A(λ) (ˆΣXX + λI) 1 ˆΣXX ( ˆµ µ H + A(λ)) + (ˆΣXX + λI) 1ΣXX A(λ) +A(λ), (44) where for any bounded linear operator B, B denotes its operator norm. We now bound (ˆΣXX + λI) 1ΣXX as follows. It is easy to show that (ˆΣXX + λI) 1ΣXX = I (ΣXX + λI) 1(ΣXX ˆΣXX ) 1 (ΣXX + λI) 1ΣXX (ΣXX + λI) 1(ΣXX ˆΣXX ) j (ΣXX + λI) 1ΣXX , where the last line denotes the Neumann series and therefore (ˆΣXX + λI) 1ΣXX (ΣXX + λI) 1(ΣXX ˆΣXX ) j (ΣXX + λI) 1ΣXX (ΣXX + λI) 1(ΣXX ˆΣXX ) j where we used (ΣXX + λI) 1ΣXX 1 and the fact that ΣXX and ˆΣXX are Hilbert Schmidt operators on H as ΣXX HS κ < and ˆΣXX HS κ < with κ being the bound on the kernel. Define η : X HS(H), η(x) = (ΣXX + λI) 1(ΣXX Σx), where HS(H) is the space of Hilbert-Schmidt operators on H and Σx k( , x) k( , x). Observe that E 1 n Pn i=1 η(xi) = 0. Also, for all i {1, . . . , n}, η(xi) HS (ΣXX + λI) 1 ΣXX Σx HS 2κ λ and E η(xi) 2 HS 4κ2 λ2 . Therefore, by Bernstein s inequality (see Theorem 16), for any τ > 0, with probability at least 1 2e τ over the choice of {xi}n i=1, (ΣXX + λI) 1(ΣXX ˆΣXX ) HS κ 2τ λ n + 2κτ 2τ + 1) λ n . 2τ+1) n , we obtain that (ΣXX + λI) 1(ΣXX ˆΣXX ) HS 1 2 and therefore (ˆΣXX + λI) 1ΣXX 2. Using this along with (ˆΣXX + λI) 1 ˆΣXX 1 and (38) in (44), we obtain that for any τ > 0 and λ κ 2τ+1) n , with probability at least 1 2e τ over the choice of {xi}n i=1, 2κτ + 4τ κ n + 4A(λ). (45) We now analyze A(λ). Since k is continuous and X is separable, H is separable (Steinwart and Christmann, 2008, Lemma 4.33). Also ΣXX is compact since it is Hilbert-Schmidt. The consistency result therefore follows from Sriperumbudur et al. (2013, Proposition A.2) which ensures A(λ) 0 as λ 0. The rate also follows from Sriperumbudur et al. (2013, Proposition A.2) which yields A(λ) Σ 1 XX µ Hλ, thereby obtaining ˇµλ µ H = OP(n 1/2) for λ = cn 1/2 with c > 0 being a constant independent of n. Muandet, Sriperumbudur, Fukumizu, Gretton, and Sch olkopf 5.7 Proof of Proposition 15 From Proposition 12, we have ˇµ( i) λ = (ˆΣ( i) + λI) 1 ˆΣ( i)ˆµ( i) where ˆΣ( i) 1 n 1 j =i k( , xj) k( , xj). and ˆµ( i) 1 n 1 P j =i k( , xj). Define a k( , xi). It is easy to verify that ˆΣ( i) = n n 1 and ˆµ( i) = n n 1 ˇµ( i) λ = n n 1 (ˆΣ + λ n I) a a which after using Sherman-Morrison formula3 reduces to ˇµ( i) λ = n n 1 (ˆΣ + λ n I) 1 + (ˆΣ + λ n I) 1(a a)(ˆΣ + λ n I) 1 n a, (ˆΣ + λ n I) 1a H where λ n n 1 n λ. Using the notation in the proof of Proposition 12, the following can be proved: (i) (ˆΣ + λ n I) 1 ˆΣˆµ = n 1Φ(K + λn I) 1K1. (ii) (ˆΣ + λ n I) 1 ˆΣa = Φ(K + λn I) 1ki. (iii) (ˆΣ + λ n I) 1a = nΦ(K + λn I) 1ei. Based on the above, it is easy to show that (iv) (ˆΣ + λ n I) 1(a a)ˆµ = (ˆΣ + λ n I) 1a a, ˆµ H = Φ(K + λn I) 1eik i 1. (v) (ˆΣ + λ n I) 1(a a)a = (ˆΣ + λ n I) 1a a, a H = nΦ(K + λn I) 1eik(xi, xi). (vi) (ˆΣ + λ n I) 1(a a)(ˆΣ + λ n I) 1 ˆΣˆµ = Φ(K + λn I) 1eik i (K + λn I) 1K1. (vii) (ˆΣ + λ n I) 1(a a)(ˆΣ + λ n I) 1 ˆΣa = nΦ(K + λn I) 1eik i (K + λn I) 1ki. (viii) (ˆΣ + λ n I) 1(a a)(ˆΣ + λ n I) 1(a a)ˆµ = nΦ(K + λn I) 1eik i (K + λn I) 1eik i 1. (ix) (ˆΣ+λ n I) 1(a a)(ˆΣ+λ n I) 1(a a)a = n2Φ(K+λn I) 1eik i (K+λn I) 1eik(xi, xi). (x) a, (ˆΣ + λ n I) 1a H = nk i (K + λn I) 1ei. Using the above in ˇµ( i) λ , we obtain ˇµ( i) λ = 1 n 1Φ(K + λn I) 1ci,λ. Substituting the above in (24) yields the result. 3. The Sherman-Morrison formula states that (A + uv ) 1 = A 1 A 1uv A 1 1+v A 1u where A is an invertible square matrix, u and v are column vectors such that 1 + v A 1u = 0. Kernel Mean Shrinkage Estimators 0.5 0 0.5 1 0.8 Shrinkage parameter (α) Risk (10000 iterations) µ = 1, Σ = I, n = 20, d = 1, f*=0 Detail at α=0 0.5 0 0.5 1 2 Shrinkage parameter (α) Risk (10000 iterations) µ = (1 0), Σ = I, n = 20, d = 2, f*=(0 0) Detail at α=0 0.5 0 0.5 1 3.1 Shrinkage parameter (α) Risk (10000 iterations) µ = (1 0 0), Σ = I, n = 20, d = 3, f*=(0 0 0) Detail at α=0 Figure 1: The comparison between standard estimator, ˆµ and shrinkage estimator, ˆµα (with f = 0) of the mean of the Gaussian distribution N(µ, Σ) on Rd where d = 1, 2, 3. 6. Experiments In this section, we empirically compare the proposed shrinkage estimators to the standard estimator of the kernel mean on both synthetic and real-world datasets. Specifically, we consider the following estimators: i) empirical/standard kernel mean estimator (KME), ii) KMSE whose parameter is obtained via empirical bound (B-KMSE), iii) regularized KMSE whose parameter is obtained via Proposition 9 (R-KMSE), and iv) spectral KMSE whose parameter is obtained via Proposition 15 (S-KMSE). 6.1 Synthetic Data Given the true data-generating distribution P and the i.i.d. sample X = {x1, x2, . . . , xn} from P, we evaluate different estimators using the loss function i=1 βik(xi, ) Ex P[k(x, )] where β is the weight vector associated with different estimators. Then, we can estimate the risk of the estimator by averaging over m independent copies of X, i.e., b R = 1 m Pm j=1 L(βj, Xj, P). To allow for an exact calculation of L(β, X, P), we consider P to be a mixture-of Gaussians distribution and k being one of the following kernel functions: i) linear kernel k(x, x ) = x x , ii) polynomial degree-2 kernel k(x, x ) = (x x + 1)2, iii) polynomial degree-3 kernel k(x, x ) = (x x + 1)3 and iv) Gaussian RBF kernel k(x, x ) = exp x x 2/2σ2 . We refer to them as LIN, POLY2, POLY3, and RBF, respectively. The analytic forms of Ex P[k(x, )] for Gaussian distribution are given in Song et al. (2008) and Muandet et al. (2012). Unless otherwise stated, we set the bandwidth parameter of the Gaussian kernel as σ2 = median xi xj 2 : i, j = 1, . . . , n , i.e., the median heuristic. Muandet, Sriperumbudur, Fukumizu, Gretton, and Sch olkopf 0.5 0 0.5 1 0.8 Shrinkage parameter (α) Risk (10000 iterations) µ = 1, Σ = I, n = 20, d = 1, f*=2 Detail at α=0 0.5 0 0.5 1 2 Shrinkage parameter (α) Risk (10000 iterations) µ = (1 0), Σ = I, n = 20, d = 2, f*=(2 0) Detail at α=0 0.5 0 0.5 1 3.1 Shrinkage parameter (α) Risk (10000 iterations) µ = (1 0 0), Σ = I, n = 20, d = 3, f*=(2 0 0) Detail at α=0 Figure 2: The risk comparison between standard estimator, ˆµ and shrinkage estimator, ˆµα (with f {2, (2, 0) , (2, 0, 0) }) of the mean of the Gaussian distribution N(µ, Σ) on Rd where d = 1, 2, 3. 6.1.1 Gaussian Distribution We begin our empirical studies by considering the simplest case in which the distribution P is a Gaussian distribution N(µ, I) on Rd where d = 1, 2, 3 and k is a linear kernel. In this case, the problem of kernel mean estimation reduces to just estimating the mean µ of the Gaussian distribution N(µ, I). We consider only shrinkage estimators of form ˆµα = αf +(1 α)ˆµ. The true mean µ of the distribution is chosen to be 1, (1, 0) , and (1, 0, 0) , respectively. Figure 1 depicts the comparison between the standard estimator and the shrinkage estimator, ˆµα when the target f is the origin. We can clearly see that even in this simple case, an improvement can be gained by applying a small shrinkage. Furthermore, the improvement becomes more substantial as we increase the dimensionality of the underlying space. Figure 2 illustrates similar results when f = 0 but f {2, (2, 0) , (2, 0, 0) }. Interestingly, we can still observe similar improvement, which demonstrates that the choice of target f can be arbitrary when no prior knowledge about µP is available. 6.1.2 Mixture of Gaussians Distributions To simulate a more realistic case, let y be a sample from P P4 i=1 πi N(θi, Σi). In the following experiments, the sample x is generated from the following generative process: x = y + ε, θij U( 10, 10), Σi W(2 Id, 7), ε N(0, 0.2 Id), where U(a, b) and W(Σ0, df) represent the uniform distribution and Wishart distribution, respectively. We set π = (0.05, 0.3, 0.4, 0.25) . The choice of parameters here is quite arbitrary; we have experimented using various parameter settings and the results are similar to those presented here. Figure 3(a) depicts the comparison between the standard kernel mean estimator and the shrinkage estimator, ˆµα when the kernel k is the Gaussian RBF kernel. For shrinkage estimator ˆµα, we consider f = C k(x, ) where C is a scaling factor and each element of x is a realization of uniform random variable on (0, 1). That is, we allow the target f to change Kernel Mean Shrinkage Estimators 1 0.5 0 0.5 1 1.5 2 0.038 Scaling Factor (C) Risk (10000 iterations) n = 20, d = 30 KME KMSE (f = C φ(x)) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.6 Probability of Improvement (1000 iterations) The value of α as a proportion of upper bound 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 Risk Difference ( α )/ Figure 3: (a) The risk comparison between ˆµ (KME) and ˆµ α (KMSE) where α = ˆ /( ˆ + f ˆµ 2 H). We consider when f = C k(x, ) where x is drawn uniformly from a pre-specified range and C is a scaling factor. (b) The probability of improvement and the risk difference as a function of shrinkage parameter α averaged over 1,000 iterations. As the value of α increases, we get more improvement in term of the risk, whereas the probability of improvement decreases as a function of α. depending on the value of C. As the absolute value of C increases, the target function f will move further away from the origin. The shrinkage parameter α is determined using the empirical bound, i.e., α = ˆ /( ˆ + f ˆµ 2 H). As we can see in Figure 3(a), the results reveal how important the choice of f is. That is, we may get substantial improvement over the empirical estimator if appropriate prior knowledge is incorporated through f , which in this case suggests that f should lie close to the origin. We intend to investigate the topic of prior knowledge in more detail in our future work. Previous comparisons between standard estimator and shrinkage estimator is based entirely on the notion of a risk, which is in fact not useful in practice as we only observe a single copy of sample from the probability distribution. Instead, one should also look at the probability that, given a single copy of sample, the shrinkage estimator outperforms the standard one in term of a loss. To this end, we conduct an experiment comparing the standard estimator and shrinkage estimator using the Gaussian RBF kernel. In addition to the risk comparison, we also compare the probability that the shrinkage estimator gives smaller loss than that of the standard estimator. To be more precise, the probability is defined as a proportion of the samples drawn from the same distribution whose shrinkage loss is smaller than the loss of the standard estimator. Figure 3(b) illustrates the risk difference ( α ) and the probability of improvement (i.e., the fraction of times α < ) as a function of shrinkage parameter α. In this case, the value of α is specified as a proportion of empirical upper bound 2 ˆ /( ˆ + ˆµ 2 H). The results suggest that the shrinkage parameter α controls the trade-offbetween the amount of improvement in terms of risk and the probability that the shrinkage estimator will improve upon the standard one. However, this trade-offonly holds up to a certain value of α. As α becomes too large, both the probability of improvement and the amount of improvement itself decrease, which coincides with the intuition given for the positive-part shrinkage estimators (cf. Section 2.2.1). Muandet, Sriperumbudur, Fukumizu, Gretton, and Sch olkopf 6.1.3 Shrinkage Estimators via Leave-One-Out Cross-Validation In addition to the empirical upper bound, one can alternatively compute the shrinkage parameter using leave-one-out cross-validation proposed in Section 3. Our goal here is to compare the B-KMSE, R-KMSE and S-KMSE on synthetic data when the shrinkage parameter λ is chosen via leave-one-out cross-validation procedure. Note that the only difference between B-KMSE and R-KMSE is the way we compute the shrinkage parameter. Figure 4 shows the empirical risk of different estimators using different kernels as we increase the value of shrinkage parameter λ (note that R-KMSE and S-KMSE in Figure 4 refer to those in (22) and (31) respectively). Here we scale the shrinkage parameter by the smallest non-zero eigenvalue γ0 of the kernel matrix K. In general, we find that R-KMSE and S-KMSE outperforms KME. Nevertheless, as the shrinkage parameter λ becomes large, there is a tendency that the specific shrinkage estimate might actually perform worse than the KME, e.g., see LIN kernel and outliers in Figure 4. The result also supports our previous observation regarding Figure 3(b), which suggests that it is very important to choose the parameter λ appropriately. To demonstrate the leave-one-out cross-validation procedure, we conduct similar experiments in which the parameter λ is chosen by the proposed LOOCV procedure. Figure 5 depicts the percentage of improvement (with respect to the empirical risk of the KME4) as we vary the sample size and dimension of the data. Clearly, B-KMSE, R-KMSE and S-KMSE outperform the standard estimator. Moreover, both R-KMSE and S-KMSE tend to outperform the B-KMSE. We can also see that the performance of S-KMSE depends on the choice of kernel. This makes sense intuitively because S-KMSE also incorporates the eigen-spectrum of K, whereas R-KMSE does not. The effects of both sample size and data dimensionality are also transparent from Figure 5. While it is intuitive to see that the improvement gets smaller with increase in sample size, it is a bit surprising to see that we can gain much more in high-dimensional input space, especially when the kernel function is non-linear, because the estimation happens in the feature space associated with the kernel function rather than in the input space. Lastly, we note that the improvement is more substantial in the large d, small n paradigm. 6.2 Real Data To evaluate the proposed estimators on real-world data, we consider several benchmark applications, namely, classification via Parzen window classifier, density estimation via kernel mean matching (Song et al., 2008), and discriminative learning on distributions (Muandet et al., 2012; Muandet and Sch olkopf, 2013). For some of these tasks we employ datasets from the UCI repositories. We use only real-valued features, each of which is normalized to have zero mean and unit variance. 6.2.1 Parzen Window Classifiers One of the oldest and best-known classification algorithms is the Parzen window classifier (Duda et al., 2000). It is easy to implement and is one of the powerful non-linear supervised 4. If we denote the loss of KME and KMSE as ℓKME and ℓKMSE, respectively, the percentage of improvement is calculated as 100 (ℓKME ℓKMSE)/ℓKME. Kernel Mean Shrinkage Estimators KME R KMSE S KMSE KME R KMSE S KMSE KME R KMSE S KMSE KME R KMSE S KMSE 2 KME R KMSE S KMSE KME R KMSE S KMSE KME R KMSE S KMSE KME R KMSE S KMSE KME R KMSE S KMSE KME R KMSE S KMSE KME R KMSE S KMSE KME R KMSE S KMSE 0 Figure 4: The average loss of KME (left), R-KMSE (middle) and S-KMSE (right) estimators with different values of shrinkage parameter. We repeat the experiments over 30 different distributions with n = 10 and d = 30. Muandet, Sriperumbudur, Fukumizu, Gretton, and Sch olkopf 0 20 40 60 80 0 Sample Size (d=20) Percentage of Improvement 0 20 40 60 80 10 Sample Size (d=20) 0 20 40 60 80 40 Sample Size (d=20) 0 20 40 60 80 20 Sample Size (d=20) B KMSE R KMSE S KMSE 0 20 40 60 5 Dimension (n=20) Percentage of Improvement 0 20 40 60 30 Dimension (n=20) 0 20 40 60 10 Dimension (n=20) 0 20 40 60 10 Dimension (n=20) Figure 5: The percentage of improvement compared to KME over 30 different distributions of B-KMSE, R-KMSE and S-KMSE with varying sample size (n) and dimension (d). For B-KMSE, we calculate α using (16), whereas R-KMSE and S-KMSE use LOOCV to choose λ. learning techniques. Suppose we have data points from two classes, namely, positive class and negative class. For positive class, we observe X {x1, x2, . . . , xn} X, while for negative class we have Y {y1, y2, . . . , ym} X. Following Shawe-Taylor and Cristianini (2004, Sec. 5.1.2), the Parzen window classifier is given by i=1 k(z, xi) 1 j=1 k(z, yj) + b = sgn (ˆµX(z) ˆµY(z) + b) , (46) where b is a bias term given by b = 1 2( ˆµY 2 H ˆµX 2 H). Note that f(z) is a threshold linear function in H with weight vector w = (1/n) Pn i=1 φ(xi) (1/m) Pm j=1 φ(yj) (see Shawe Taylor and Cristianini (2004, Sec. 5.1.2) for more detail). This algorithm is often referred to as the lazy algorithm as it does not require training. In brief, the classifier (46) assigns the data point z to the class whose empirical kernel mean ˆµ is closer to the feature map k(z, ) of the data point in the RKHS. On the other hand, we may view the empirical kernel mean ˆµX 1 n Pn i=1 k(xi, ) (resp. ˆµY 1 m Pm j=1 k(yj, )) as a standard empirical estimate, i.e., KME, of the true kernel mean representation of the class-conditional distribution P(X|Y = +1) (resp. P(X|Y = 1)). Given the improvement of shrinkage estimators over the empirical estimator of kernel mean, it is natural to expect Kernel Mean Shrinkage Estimators Dataset Classification Error Rate KME B-KMSE R-KMSE S-KMSE Climate Model 0.0348 0.0118 0.0348 0.0118 0.0348 0.0118 0.0348 0.0118 Ionosphere 0.2873 0.0343 0.2768 0.0359 0.2749 0.0341 0.2800 0.0367 Parkinsons 0.1318 0.0441 0.1250 0.0366 0.1157 0.0395 0.1309 0.0396 Pima 0.2951 0.0462 0.2921 0.0442 0.2937 0.0458 0.2943 0.0471 SPECTF 0.2583 0.0829 0.2597 0.0817 0.2263 0.0626 0.2417 0.0651 Iris 0.1079 0.0379 0.1071 0.0389 0.1055 0.0389 0.1040 0.0383 Wine 0.1301 0.0381 0.1183 0.0445 0.1161 0.0414 0.1183 0.0431 Table 1: The classification error rate of Parzen window classifier via different kernel mean estimators. The boldface represents the result whose difference from the baseline, i.e., KME, is statistically significant. that the performance of Parzen window classifier can be improved by employing shrinkage estimators of the true mean representation. Our goal in this experiment is to compare the performance of Parzen window classifier using different kernel mean estimators. That is, we replace ˆµX and ˆµY by their shrinkage counterparts and evaluate the resulting classifiers across several datasets taken from the UCI machine learning repository. In this experiment, we only consider the Gaussian RBF kernel whose bandwidth parameter is chosen by cross-validation procedure over a uniform grid σ [0.1, 2]. We use 30% of each dataset as a test set and the rest as a training set. We employ a simple pairwise coupling and majority vote for multi-class classification. We repeat the experiments 100 times and perform the paired-sample t-test on the results at 5% significance level. Table 1 reports the classification error rates of the Parzen window classifiers with different kernel mean estimators. Although the improvement is not substantial, we can see that the shrinkage estimators consistently give better performance than the standard estimator. 6.2.2 Density Estimation We perform density estimation via kernel mean matching (Song et al., 2008), wherein we fit the density Q = Pm j=1 πj N(θj, σ2 j I) to each dataset by the following minimization problem: min π,θ,σ ˆµ µQ 2 H subject to j=1 πj = 1, πj 0 . (47) The empirical mean map ˆµ is obtained from samples using different estimators, whereas µQ is the kernel mean embedding of the density Q. Unlike experiments in Song et al. (2008), our goal is to compare different estimators of µP (where P is the true data distribution), by replacing ˆµ in (47) with different shrinkage estimators. A better estimate of µP should lead to better density estimation, as measured by the negative log-likelihood of Q on the test set, which we choose to be 30% of the dataset. For each dataset, we set the number of mixture components m to be 10. The model is initialized by running 50 random initializations using Muandet, Sriperumbudur, Fukumizu, Gretton, and Sch olkopf the k-means algorithm and returning the best. We repeat the experiments 30 times and perform the paired sign test on the results at 5% significance level.5 The average negative log-likelihood of the model Q, optimized via different estimators, is reported in Table 2. In most cases, both R-KMSE and S-KMSE consistently achieve smaller negative log-likelihood when compared to KME. B-KMSE also tends to outperform the KME. However, in few cases the KMSEs achieve larger negative log-likelihood, especially when we use linear and degree-2 polynomial kernels. This highlight the potential of our estimators in a non-linear setting. 6.2.3 Discriminative Learning on Probability Distributions The last experiment involves the discriminative learning on a collection of probability distributions via the kernel mean representation. A positive semi-definite kernel between distributions can be defined via their kernel mean embeddings. That is, given a training sample (b P1, y1), . . . , (b Pm, ym) P { 1, +1} where b Pi := 1 ni Pni p=1 δxip and xi p Pi, the linear kernel between two distributions is approximated by ˆµPi, ˆµPj H = p=1 βi pφ(xi p), q=1 βj qφ(xj q) q=1 βi pβj qk(xi p, xj q), where the weight vectors βi and βj come from the kernel mean estimates of µPi and µPj, respectively. The non-linear kernel can then be defined accordingly, e.g., κ(Pi, Pj) = exp( ˆµPi ˆµPj 2 H/2σ2), see Christmann and Steinwart (2010). Our goal in this experiment is to investigate if the shrinkage estimators of the kernel mean improve the performance of discriminative learning on distributions. To this end, we conduct experiments on natural scene categorization using support measure machine (SMM) (Muandet et al., 2012) and group anomaly detection on a high-energy physics dataset using one-class SMM (OCSMM) (Muandet and Sch olkopf, 2013). We use both linear and non-linear kernels where the Gaussian RBF kernel is employed as an embedding kernel (Muandet et al., 2012). All hyper-parameters are chosen by 10-fold cross-validation.6 For our unsupervised problem, we repeat the experiments using several parameter settings and report the best results. Table 3 reports the classification accuracy of SMM and the area under ROC curve (AUC) of OCSMM using different kernel mean estimators. All shrinkage estimators consistently lead to better performance on both SMM and OCSMM when compared to KME. In summary, the proposed shrinkage estimators outperform the standard KME. While BKMSE and R-KMSE are very competitive compared to KME, S-KMSE tends to outperform both B-KMSE and R-KMSE, however, sometimes leading to poor estimates depending on the dataset and the kernel function. 5. The paired sign test is a nonparametric test that can be used to examine whether two paired samples have the same distribution. In our case, we compare B-KMSE, R-KMSE and S-KMSE against KME. 6. In principle one can incorporate the shrinkage parameter into the cross-validation procedure. In this work we are only interested in the value of λ returned by the proposed LOOCV procedure. Kernel Mean Shrinkage Estimators Dataset LIN POLY2 POLY3 RBF KME B-KMSE R-KMSE S-KMSE KME B-KMSE R-KMSE S-KMSE KME B-KMSE R-KMSE S-KMSE KME B-KMSE R-KMSE S-KMSE 1. ionosphere 39.878 40.038 39.859 39.823 34.651 34.352 34.390 34.009 35.943 35.575 35.543 34.617 41.601 40.976 40.817 41.229 2. sonar 72.240 72.044 72.198 72.157 100.420 99.573 97.844 97.783 72.294 71.933 72.003 71.835 98.540 95.815 93.458 93.010 3. Australian 18.277 18.280 18.294 18.293 18.357 18.381 18.391 18.429 18.611 18.463 18.466 18.495 19.428 19.325 19.418 19.393 4. specft 57.444 57.2808 57.218 57.224 67.018 66.979 66.431 66.391 59.585 58.969 60.006 60.616 65.674 65.138 65.039 64.699 5. wdbc 31.801 31.759 31.776 31.781 32.421 32.310 32.373 32.316 31.183 31.167 31.127 31.110 36.471 36.453 36.335 35.898 6. wine 16.019 16.000 16.039 16.009 17.070 16.920 16.886 16.960 16.393 16.300 16.309 16.202 17.569 17.546 17.498 17.498 7. satimage 25.258 25.317 25.219 25.186 24.214 24.111 24.132 24.259 25.284 25.276 25.239 25.263 23.741 23.753 23.728 24.384 8. segment 18.326 17.868 18.055 18.124 18.571 18.292 18.277 18.631 19.642 19.549 19.404 19.628 21.946 21.598 21.580 21.822 9. vehicle 16.633 16.519 16.521 16.499 16.096 15.998 16.031 16.041 16.288 16.278 16.281 16.263 18.260 18.056 18.119 17.911 10. svmguide2 27.298 27.273 27.281 27.276 27.812 28.030 27.985 27.975 28.014 28.177 28.321 28.250 28.132 28.122 28.119 28.020 11. vowel 12.632 12.626 12.629 12.656 12.532 12.471 12.479 12.472 13.069 13.061 13.056 13.054 13.526 13.486 13.462 13.453 12. housing 14.637 14.441 14.469 14.296 15.543 15.467 15.414 15.390 15.592 15.543 15.509 15.408 16.487 16.239 16.424 16.019 13. bodyfat 17.527 17.362 17.348 17.396 17.386 17.358 17.356 17.329 16.418 16.393 16.305 16.194 17.875 17.652 17.607 17.651 14. abalone 5.706 5.665 5.708 5.722 7.281 7.116 7.185 7.025 5.864 5.847 5.853 5.832 6.068 6.039 6.049 5.910 15. glass 9.245 9.211 9.198 9.217 8.571 8.473 8.457 8.414 9.050 8.991 9.012 8.737 9.606 9.605 9.575 9.573 Table 2: Average negative log-likelihood of the model Q on test points over 30 randomizations. The boldface represents the result whose difference from the baseline, i.e., KME, is statistically significant. Estimator Linear Kernel Non-linear Kernel SMM OCSMM SMM OCSMM KME 0.5432 0.6955 0.6017 0.9085 B-KMSE 0.5455 0.6964 0.6106 0.9088 R-KMSE 0.5521 0.6970 0.6303 0.9105 S-KMSE 0.5606 0.6970 0.6412 0.9063 Table 3: The classification accuracy of SMM and the area under ROC curve (AUC) of OCSMM using different estimators to construct the kernel on distributions. Muandet, Sriperumbudur, Fukumizu, Gretton, and Sch olkopf 7. Conclusion and Discussion Motivated by the classical James-Stein phenomenon, in this paper, we proposed a shrinkage estimator for the kernel mean µ in a reproducing kernel Hilbert space H and showed they improve upon the empirical estimator ˆµ in the mean squared sense. We showed the proposed shrinkage estimator µ (with the shrinkage parameter being learned from data) to be n-consistent and satisfies E µ µ 2 H < E ˆµ µ 2 H + O(n 3/2) as n . We also provided a regularization interpretation to shrinkage estimation, using which we also presented two shrinkage estimators, namely regularized shrinkage estimator and spectral shrinkage estimator, wherein the first one is closely related to µ while the latter exploits the spectral decay of the covariance operator in H. We showed through numerical experiments that the proposed estimators outperform the empirical estimator in various scenarios. Most importantly, the shrinkage estimators not only provide more accurate estimation, but also lead to superior performance on many real-world applications. In this work, while we focused mainly on an estimation of the mean function in RKHS, it is quite straightforward to extend the shrinkage idea to estimate covariance (and crosscovariance) operators and tensors in RKHS (see Appendix A for a brief description). The key observation is that the covariance operator can be viewed as a mean function in a tensor RKHS. Covariance operators in RKHS are ubiquitous in many classical learning algorithms such as kernel PCA, kernel FDA, and kernel CCA. Recently, a preliminary investigation with some numerical results on shrinkage estimation of covariance operators is carried out in Muandet et al. (2014a) and Wehbe and Ramdas (2015). In the future, we intend to carry out a detailed study on the shrinkage estimation of covariance (and cross-covariance) operators. Acknowledgments The authors thanks the reviewers and the action editor for their detailed comments that signficantly improved the manuscript. This work was partly done while Krikamol Muandet was visiting the Institute of Statistical Mathematics, Tokyo, and New York University, New York; and while Bharath Sriperumbudur was visiting the Max Planck Institute for Intelligent Systems, Germany. The authors wish to thank David Hogg and Ross Fedely for reading the first draft and giving valuable comments. We also thank Motonobu Kanagawa, Yu Nishiyama, and Ingo Steinwart for fruitful discussions. Kenji Fukumizu has been supported in part by MEXT Grant-in-Aid for Scientific Research on Innovative Areas 25120012. Appendix A. Shrinkage Estimation of Covariance Operator Let (HX, k X) and (HY , k Y ) be separable RKHSs of functions on measurable spaces X and Y, with measurable reproducing kernels k X and k Y (with corresponding feature maps φ and ϕ), respectively. We consider a random vector (X, Y ) : Ω X Y with distribution PXY . The marginal distributions of X and Y are denoted by PX and PY , respectively. If EXk X(X, X) < and EY k Y (Y, Y ) < , then there exists a unique cross-covariance Kernel Mean Shrinkage Estimators operator ΣYX : HX HY such that g, ΣYX f HY = EXY [(f(X) EX[f(X)])(g(Y ) EY [g(Y )])] = Cov(f(X), g(Y )) holds for all f HX and g HY (Baker, 1973; Fukumizu et al., 2004). If X is equal to Y , we obtain the self-adjoint operator ΣXX called the covariance operator. Given i.i.d sample {(xi, yi)}n i=1 from PXY , we can write the empirical cross-covariance operator bΣYX as i=1 φ(xi) ϕ(yi) ˆµX ˆµY (48) where ˆµX = 1 n Pn i=1 φ(xi) and ˆµY = 1 n Pn i=1 ϕ(yi).7 Let φ and ϕ be the centered version of the feature map φ and ϕ defined as φ(x) = φ(x) ˆµX and ϕ(y) = ϕ(y) ˆµY , respectively. Then, the empirical cross-covariance operator in (48) can be rewritten as i=1 φ(xi) ϕ(yi), and therefore a shrinkage estimator of ΣYX , e.g., an equivalent of B-KMSE, can be constructed based on the ideas presented in this paper. That is, by the inner product property in product space, we have φ(x) ϕ(y), φ(x ) ϕ(y ) HX HY = φ(x), φ(x ) HX ϕ(y), ϕ(y ) HY = k X(x, x ) k Y (y, y ). where k X and k Y denote the centered kernel functions. As a result, we can obtain the shrinkage estimators for ΣYX by plugging the above kernel into the KMSEs. Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337 404, 1950. Charles R. Baker. Joint measures and cross-covariance operators. Transactions of the American Mathematical Society, 186:pp. 273 289, 1973. Frank Bauer, Sergei Pereverzev, and Lorenzo Rosasco. On regularization algorithms in learning theory. Journal of Complexity, 23(1):52 72, 2007. ISSN 0885-064X. James Berger and Robert Wolpert. Estimating the mean function of a Gaussian process and the Stein effect. Journal of Multivariate Analysis, 13(3):401 424, 1983. James O. Berger. Admissible minimax estimation of a multivariate normal mean with arbitrary quadratic loss. Annals of Statistics, 4(1):223 226, 1976. 7. Although it is possible to estimate ˆµX and ˆµY using our shrinkage estimators, the key novelty here is to directly shrink the centered covariance operator. Muandet, Sriperumbudur, Fukumizu, Gretton, and Sch olkopf Alain Berlinet and Christine Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publishers, 2004. Andreas Christmann and Ingo Steinwart. Universal kernels on Non-Standard input spaces. In Advances in Neural Information Processing Systems (NIPS), pages 406 414. 2010. Inderjit S. Dhillon, Yuqiang Guan, and Brian Kulis. Kernel k-means: Spectral clustering and normalized cuts. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 551 556, New York, NY, USA, 2004. Joseph Diestel and John J. Uhl. Vector Measures. American Mathematical Society, Providence, 1977. Nicolae Dinculeanu. Vector Integration and Stochastic Integration in Banach Spaces. Wiley, 2000. Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification (2nd Edition). Wiley-Interscience, 2000. Bradley Efron and Carl N. Morris. Stein s paradox in statistics. Scientific American, 236 (5):119 127, 1977. Kenji Fukumizu, Francis R. Bach, and Michael I. Jordan. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5:73 99, 2004. Kenji Fukumizu, Francis R. Bach, and Arthur Gretton. Statistical consistency of kernel canonical correlation analysis. Journal of Machine Learning Research, 8:361 383, 2007. Kenji Fukumizu, Le Song, and Arthur Gretton. Kernel Bayes rule. In Advances in Neural Information Processing Systems (NIPS), pages 1737 1745. 2011. Arthur Gretton, Karsten M. Borgwardt, Malte Rasch, Bernhard Sch olkopf, and Alexander J. Smola. A kernel method for the two-sample-problem. In Advances in Neural Information Processing Systems (NIPS), 2007. Arthur Gretton, Kenji Fukumizu, Choon Hui Teo, Le Song, Bernhard Sch olkopf, and Alexander J. Smola. A kernel statistical test of independence. In Advances in Neural Information Processing Systems 20, pages 585 592. MIT Press, 2008. Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch olkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13:723 773, 2012. Marvin Gruber. Improving Efficiency by Shrinkage: The James-Stein and Ridge Regression Estimators. Statistics Textbooks and Monographs. Marcel Dekker, 1998. Steffen Gr unew alder, Guy Lever, Arthur Gretton, Luca Baldassarre, Sam Patterson, and Massimiliano Pontil. Conditional mean embeddings as regressors. In Proceedings of the 29th International Conference on Machine Learning (ICML), 2012. Kernel Mean Shrinkage Estimators Steffen Gr unew alder, Arthur Gretton, and John Shawe-Taylor. Smooth operators. In Proceedings of the 30th International Conference on Machine Learning (ICML), 2013. W. James and James Stein. Estimation with quadratic loss. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, pages 361 379. University of California Press, 1961. Joo Seuk Kim and Clayton D. Scott. Robust kernel density estimation. Journal of Machine Learning Research, 13:2529 2565, Sep 2012. Avi Mandelbaum and L. A. Shepp. Admissibility as a touchstone. Annals of Statistics, 15 (1):252 268, 1987. Krikamol Muandet and Bernhard Sch olkopf. One-class support measure machines for group anomaly detection. In Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence (UAI). AUAI Press, 2013. Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, and Bernhard Sch olkopf. Learning from distributions via support measure machines. In Advances in Neural Information Processing Systems (NIPS), pages 10 18. 2012. Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Arthur Gretton, and Bernhard Sch olkopf. Kernel mean estimation and Stein effect. In ICML, pages 10 18, 2014a. Krikamol Muandet, Bharath Sriperumbudur, and Bernhard Sch olkopf. Kernel mean estimation via spectral filtering. In Advances in Neural Information Processing Systems 27, pages 1 9. Curran Associates, Inc., 2014b. Nicolas Privault and Anthony Rveillac. Stein estimation for the drift of Gaussian processes using the Malliavin calculus. Annals of Statistics, 36(5):2531 2550, 2008. Carl E. Rasmussen and Christopher Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. Michael Reed and Barry Simon. Functional Analysis. Academic Press, New York, 1972. Zolt an Sasv ari. Multivariate Characteristic and Correlation Functions. De Gruyter, Berlin, Germany, 2013. Bernhard Sch olkopf, Alexander Smola, and Klaus-Robert M uller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5):1299 1319, July 1998. Bernhard Sch olkopf, Ralf Herbrich, and Alex J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory and and 5th European Conference on Computational Learning Theory, COLT 01/Euro COLT 01, pages 416 426, London, UK, UK, 2001. Springer-Verlag. John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge, UK, 2004. Muandet, Sriperumbudur, Fukumizu, Gretton, and Sch olkopf Alexander J. Smola, Arthur Gretton, Le Song, and Bernhard Sch olkopf. A Hilbert space embedding for distributions. In Proceedings of the 18th International Conference on Algorithmic Learning Theory (ALT), pages 13 31. Springer-Verlag, 2007. Le Song, Xinhua Zhang, Alex Smola, Arthur Gretton, and Bernhard Sch olkopf. Tailoring density estimation via reproducing kernel moment matching. In Proceedings of the 25th International Conference on Machine Learning (ICML), pages 992 999, 2008. Le Song, Byron Boots, Sajid M. Siddiqi, Geoffrey Gordon, and Alexander J. Smola. Hilbert space embeddings of hidden Markov models. In Proceedings of the 27th International Conference on Machine Learning (ICML), 2010. Le Song, Ankur P. Parikh, and Eric P. Xing. Kernel embeddings of latent tree graphical models. In Advances in Neural Information Processing Systems (NIPS), pages 2708 2716, 2011. Bharath Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Gert Lanckriet, and Bernhard Sch olkopf. Injective Hilbert space embeddings of probability measures. In The 21st Annual Conference on Learning Theory (COLT), 2008. Bharath Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch olkopf, and Gert Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 99:1517 1561, 2010. Bharath Sriperumbudur, Kenji Fukumizu, and Gert Lanckriet. Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research, 12: 2389 2410, 2011. Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Sch olkopf, and Gert R. G. Lanckriet. On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6:1550 1599, 2012. doi: 10.1214/12-EJS722. Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Aapo Hyv arinen, and Revant Kumar. Density estimation in infinite dimensional exponential families. 2013. http: //arxiv.org/pdf/1312.3516. Charles Stein. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proceedings of the 3rd Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 197 206. University of California Press, 1955. Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer, New York, 2008. A. W. van der Vaart. Asymptotic Statistics. Cambridge University Press, Cambridge, 1998. Vladimir Vapnik. Statistical learning theory. Wiley, 1998. ISBN 978-0-471-03003-4. Larry Wasserman. All of Nonparametric Statistics. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. Kernel Mean Shrinkage Estimators Leila Wehbe and Aaditya Ramdas. Nonparametric independence testing for small sample sizes. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI), pages 3777 3783, July 2015. Holger Wendland. Scattered Data Approximation. Cambridge University Press, Cambridge, UK, 2005. Vadim Yurinsky. Sums and Gaussian Vectors, volume 1617 of Lecture Notes in Mathematics. Springer-Verlag, Berlin, 1995.