# varianceaware_estimation_of_kernel_mean_embedding__95a4afbb.pdf Journal of Machine Learning Research 26 (2025) 1-48 Submitted 2/23; Revised 3/25; Published 3/25 Variance-Aware Estimation of Kernel Mean Embedding Geoffrey Wolfer geo.wolfer@aoni.waseda.jp Center for Data Science Waseda University 1-6-1 Nishiwaseda, Shinjuku-ku Tokyo 169-8050, Japan Pierre Alquier alquier@essec.edu ESSEC Business School Asia-Pacific campus 5 Nepal Park 575749 Singapore Editor: Bharath Sriperumbudur An important feature of kernel mean embeddings (KME) is that the rate of convergence of the empirical KME to the true distribution KME can be bounded independently of the dimension of the space, properties of the distribution and smoothness features of the kernel. We show how to speed-up convergence by leveraging variance information in the reproducing kernel Hilbert space. Furthermore, we show that even when such information is a priori unknown, we can efficiently estimate it from the data, recovering the desiderata of a distribution agnostic bound that enjoys acceleration in fortuitous settings. We further extend our results from independent data to stationary mixing sequences and illustrate our methods in the context of hypothesis testing and robust parametric estimation. Keywords: kernel mean embedding, maximum mean discrepancy, distribution learning, empirical Bernstein 1. Introduction Estimating a probability distribution P over a space X from n iid samples X1, . . . , Xn P is a central problem in computer science and statistics (Kearns et al., 1994; Tsybakov, 2008). To formalize the question, one selects a distance (or at least a contrast function) between distributions and oftentimes introduces assumptions on the underlying probability space (for instance finitely supported, probability function is absolutely continuous with respect to the Lebesgue measure, H older continuous, . . . ). Increasing the stringency of assumptions generally leads to substantially faster minimax rates. For instance, in the finite support |X| < case and with respect to total variation, whilst the expectation risk evolves roughly as p |X| /n, it is known that this rate can be sharpened by replacing |X| with a bound on the half-norm 1 P 1/2 .= (P x X p P(x))2 (Berend and Kontorovich, 2013), when defined, and which corresponds to some measure of entropy of the underlying distribution. Furthermore, even without prior knowledge of P 1/2, one can construct confidence intervals around P that 1. The half-norm is not a norm in the proper sense as it violates the triangle inequality. c 2025 Geoffrey Wolfer and Pierre Alquier. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/23-0161.html. Wolfer and Alquier depend on b Pn 1/2 the half-norm of the empirical distribution b Pn(x) = 1 n Pn t=1 δ [Xt = x] (Cohen et al., 2020, Theorem 2.1). Namely, when P is supported on N, with probability at least 1 δ, it holds that b Pn P TV 1 n 1 2 log(2/δ) The advantage of the above expression is twofold (a) we do not need make assumptions on P 1/2, which could be non-convergent2, and (b) in favorable cases where b Pn 1/2 is small, the intervals will be narrower. In this paper, we set out to explore the question of whether analogues of (1) are possible for general probability spaces and with respect to maximum mean discrepancy. 1.1 Notation and Background The set of integers up to n N is denoted by [n] .= {1, 2, . . .}. Let X be a separable topological space, and P(X) the set of all Borel probability measures over X. For a bounded function f : X R, we define for convenience f .= sup x X f(x) f .= inf x X f(x), f .= f f. (2) Let k: X X R be a continuous positive definite kernel and Hk be the associated reproducing kernel Hilbert space (RKHS) (Berlinet and Thomas-Agnan, 2011). We assume that the kernel is bounded3 in the sense that supx X k(x, x) < . Letting P P(X), the kernel mean embedding (KME) of P is defined as µP .= EX P [k(X, )] = Z X k(x, )d P(x) Hk, which is interpreted as a Bochner integral (Diestel and Uhl, 1977, Chapter 2). Let n N and X1, . . . , Xn be a sequence of observations sampled independently4 from P. We write bµP(X1, . . . , Xn) = bµP .= 1 t=1 k(Xt, ) Hk, (3) for the KME of the empirical distribution b Pn .= 1 n Pn t=1 δXt, and call the distance bµP µP Hk the maximum mean discrepancy (MMD) between the true mean embedding and its empirical estimator. A kernel is called characteristic if the map µ: P 7 µP is injective. This property ensures that µP µP Hk = 0 if and only if P = P . A kernel is called translation invariant5 (TI) when there exists a positive definite function ψ: X R such that for any x, x X, k(x, x ) = ψ(x x). (4) 2. As opposed to the empirical proxy b Pn 1/2, which is always a finite quantity. 3. Note that the supremum of the kernel is always reached on the diagonal, that is, sup(x,x ) X 2 k(x, x ) = supx X k(x, x). 4. Unless otherwise specified, the data will be considered iid. We will address the case of dependent data in Section 5. 5. Also sometimes referred to as an anisotropic stationary kernel. Variance-Aware Estimation of Kernel Mean Embedding In particular, ψ = ψ(0) = k. When X = Rd, a kernel k is said to be a radial basis function (RBF) when for any x, x X, k(x, x ) = φ( x x 2) for some function φ: R+ R. Noticeably, k being positive definite does not preclude it from taking negative values. However, when k is RBF, the following lower bound on φ holds (see for instance Genton 2001; Stein 1999), φ φ inf t 0 (d 2)/2 Γ(d/2)J(d 2)/2(t) where Γ is the Gamma function and Jβ is the Bessel function of the first kind of order β, showing that |φ| becomes evanescent as the dimension increases. 1.2 Related Work From an asymptotic standpoint, the weak law of large numbers asserts that bµP converges to the true µP. Furthermore, n (bµP µP) converges in distribution to a zero mean Gaussian process on Hk (Berlinet and Thomas Agnan, 2011, Section 9.1). This work, however, is more concerned with the finite sample theory, and more specifically with the rate of convergence of bµP towards µP with respect to the RKHS norm. Conveniently and perhaps surprisingly, it is possible to derive a rate that depends neither on the smoothness of the considered kernel k, nor the properties of the true distribution. To obtain a distribution independent rate at OP (n 1/2), a typical strategy (see for example Lopez-Paz et al. 2015, Section B.1) consists in first expressing the dual relationship between the norm in the RKHS and the uniform norm of an empirical process, bµP µP Hk = sup f Hk f Hk 1 f, bµP µP Hk = sup f Hk f Hk 1 t=1 f(Xt) Ef A classical symmetrization argument (Mohri et al., 2018) followed by an application of Mc Diarmid s inequality (Mc Diarmid et al., 1989) yield that with probability at least 1 δ, sup f Hk f Hk 1 t=1 f(Xt) Ef 2k log(1/δ) where Rn is the Rademacher complexity (Mohri et al., 2018, Definition 3.1) of the class of unit functions in the RKHS. The bound R2 n k/n (Bartlett and Mendelson, 2002), and an application of Jensen s inequality conclude the claim. With a more careful analysis, Tolstikhin et al. (2017, Proposition A.1) halved the constant of the first term in (6), hence showed that with probability at least 1 δ, 2k log(1/δ) Wolfer and Alquier What is more, Tolstikhin et al. (2017, Theorems 1,6,8) provide corresponding lower bounds in ΩP (n 1/2), showing that the embedding of the empirical measure achieves an unimprovable rate of convergence. Results similar to (7) (with worse constants) can be derived when the observations are not independent, see Ch erief-Abdellatif and Alquier (2022, Lemma 7.2). We note that other estimators have been proposed in the literature for the case where more information is available about the underlying distribution P. See for instance Muandet et al. (2014, 2016) who propose a shrinkage estimator inspired by the James-Stein estimator. In this work, we stress that our estimator will be the KME of the empirical distribution, defined in (3). 1.3 Main Contributions In this paragraph, we briefly summarize our findings. The tilde notation suppresses constants and logarithmic factors in 1/δ, and the reader is referred to Theorem 1 and Theorem 12 for the full expression of the second order terms. Our first contribution (Theorem 1) consists in deriving a bound on bµP µP Hk that can leverage additional information about the underlying distribution P and the selected kernel k. Namely, we show that with probability at least 1 δ, 2vk(P)log(2/δ) where vk(P) defined in (8) corresponds to some notion of variance in the RKHS. Notably, it holds that vk(P) k, hence our upper bound is superior to and able to recover known bounds in most settings (Remark 2). Our chief technical contribution (Theorem 12) is in establishing that, at least when k is translation invariant, the dependence of the above confidence interval can be made independent of the underlying distribution by replacing the variance by an empirical proxy. Specifically, with probability at least 1 δ, 2bvk(X1, . . . , Xn)log(4/δ) where bvk defined in (12), (13) is a quantity that only depends on the chosen kernel and the sample. We also extend the latter result to kernels without the translation invariance property in Theorem 29. Expanding beyond the iid setting, we obtain convergence rates for stationary mixing sequences. For φ-mixing processes, Theorem 17 establishes the following rate of convergence. 2vk Γ 2 log(1/δ) where Σn is a total measure of covariance between observations in the RKHS (refer to Lemma 16) and Γ 2 is the spectral norm of a coupling matrix defined in (17). We additionally analyze the broader class of β-mixing processes in Theorem 18. Variance-Aware Estimation of Kernel Mean Embedding 1.4 Outline In Section 2, for the problem of estimating a distribution with respect to maximum mean discrepancy, we give a convergence rate with a dominant term in OP (n 1/2) involving a variance term v which depends on both the chosen kernel k and the underlying distribution P. We then illustrate how the rate can subdue minimax lower bounds when v is favorable. In particular, we show that for large collections of kernels, and when X Rd, the variance v can be controlled by a quantity that decouples the influence of the kernel and a measure of total variance of P. In Section 3, we proceed to show that even if v is unknown, it is possible to efficiently estimate it from the data with an empirical Bernstein approach whenever the kernel is translation invariant, or at least enjoys a mildly varying diagonal. In Section 4, we illustrate the benefits of the variance aware bounds on the classical two-sample test problem. In Section 5, we establish convergence rates for stationary φ and β mixing sequences. In Section 6.2, we put our methods into practice, first in the context of hypothesis testing, and second by improving the results of Briol et al. (2019) and Ch erief-Abdellatif and Alquier (2022) in the context of robust parametric maximum mean discrepancy estimation. All proofs are deferred to Section 7 for clarity of the exposition. 2. Variance-Aware Convergence Rates The central quantity we propose to consider is the following variance term in the RKHS, vk(P) .= EX P k(X, ) µP 2 Hk . (8) It is clear that vk(P) depends both on the choice of kernel k and on the underlying distribution, and we will simply write v = vk(P) to avoid encumbering notation. Simple calculations (refer to Lemma 16) show that E bµP µP 2 Hk = v where the expectation is taken with respect to an independent sample X1, . . . , Xn P, thus by applications of Jensen s and Chebyshev s inequalities, we readily obtain a rate on the expected risk in terms of v, E bµP µP Hk rv and a deviation bound P bµP µP Hk > (1 + τ) p v/n 1/(1 + τ)2, τ (0, ). One basic feature of RKHS is that new kernels can be constructed by linearly combining existing kernels see for instance Gretton (2013). It is therefore noteworthy that v is linear with respect reproducing kernels. Reformulate v = EX P [k(X, X)] EX,X P k(X, X ) , Wolfer and Alquier and suppose that k = P ki K αiki where K = k1, . . . , k|K| is a collection of reproducing kernels with α R|K|. It then holds that ki K αivki. Moving on to high-probability confidence bounds, an application of Bernstein s inequality in Hilbert spaces (Pinelis and Sakhanenko, 1985; Yurinsky, 1995) not only recovers the rate of convergence, as alluded to by Tolstikhin et al. (2017), but in fact yields a maximal inequality. Theorem 1 (Variance-aware confidence interval) Let X1, . . . , Xn iid P, k: X X R be a reproducing kernel, and let v .= EX P k(X, ) µP 2 Hk . With probability at least 1 δ, it holds that n t bµP(X1, . . . , Xt) µP Hk 2vn log(2/δ) + 4 k log(2/δ), where k = supx X k(x, x). In particular, with probability at least 1 δ, it holds that bµP(X1, . . . , Xn) µP Hk Bk,δ(P, n), Bk,δ(P, n) = Bδ .= Remark 2 Since the reproducing kernel is bounded, it is always the case that v EX P k(X, ) 2 Hk EX Pk(X, X) k, where the first inequality can be found for example in (Ch erief-Abdellatif and Alquier, 2022, Lemma 7.1, Proof). As a result, Theorem 1 strictly supersedes (7) at least when 2 log(1/δ) p For instance, for δ = 0.05, it suffices that n 46. Remark 3 The Bernstein approach, in contrast to bounded differences, has the additional advantage of yielding a maximal inequality, further opening the door for early stopping methods. Remark 4 (Sharpness of the constants) We note but do not pursue that there exist other families of concentration inequalities which are known to dominate Bennett Bernsteintype inequalities. For instance, the class of inequalities pioneered by Bentkus et al. (2006), with empirical counterparts derived by Kuchibhotla and Zheng (2021), which is known to be Variance-Aware Estimation of Kernel Mean Embedding nearly optimal for sample averages of independent observations from a log-concave probability distribution. We also mention the family of inequalities obtained by Waudby-Smith and Ramdas (2024) based on martingale methods. However, our problem requires a bound for norms of sums of vectors in Hilbert spaces, or alternatively for the supremum of empirical processes, which we could not locate in the aforementioned references. Additionally, even in the simpler case of sample averages, computation of the bound requires effort, thus a concentration bound in the more complicated Hilbert space setting could be challenging to compute. Finally, Bernstein-type bounds have known extensions for the case of time-dependent data, enabling us to analyze the estimation problem from stationary mixing sequences of observations (refer to Section 5). We immediately observe that: (O1) While the bound in (7) depends solely on chosen quantities and is computable without any knowledge of P, this is not the case for Bδ in Theorem 1. (O2) Perhaps even more concerning, it is a priori unclear how v depends on properties of k and P, thus making it difficult to convert assumptions on P into a bound for v. We defer (O1) to Section 3 and first address (O2) by pointing out that when X = Rd, and for numerous hyper-parametrized families of kernels, it is possible to promptly obtain upper bounds on v that decouple the influence of the hyper-parameter and some measure of spread of the underlying distribution. 2.1 Gaussian Kernel For a TI kernel see (4) we can rewrite v as v = ψ EX,X P ψ(X X) . The Gaussian kernel with lengthscale parameter γ > 0, defined by kγ(x, x ) = exp x x 2 2 2γ2 is the prototypical example of a characteristic translation invariant kernel, and satisfies ψ = 1. For x Rd, let xi denote the ith component of x. The function z 7 e z is convex, thus from Jensen s inequality vγ = EX P kγ(X, ) µP 2 Hkγ 1 exp EX,X P X X 2 2 2γ2 We can further rewrite the expectation on the right side of this inequality as EX,X P X X 2 2 = i=1 EX,X P (X i Xi)2 = i=1 2VX PXi = 2 Tr ΣP, where ΣP .= VX PX = EX P [(X EX)(X EX) ] is the covariance matrix of P and its trace is interpreted as a measure a total variance, agnostic to correlations between Wolfer and Alquier distinct components. As a result, for any fixed γ > 0, we obtain from Theorem 1 that with probability at least 1 δ, 2 1 e Tr ΣP/γ2 log(2/δ) While γ has no influence over the right-hand side in (7), it is clear that for γ , the rate of convergence will be accelerated in (10). Example 1 (Gaussian location model with known variance) Assume that P = Pθ = N(θ, σ2Id), the Gaussian distribution with unknown location parameter θ Rd, but with known covariance matrix σ2Id. It holds that Tr ΣPθ = σ2d. (i) Fixing γ and taking σ 0, the bound in (10) vanishes, unlike (7). (ii) Setting γ2 = λσ2d with λ > 0, we readily obtain the variance upper bound v (1 e 1/λ) 1/λ, enabling uncomplicated tuning of the convergence rate by λ. Remark 5 Example 1 highlights that if we allow the lengthscale parameter to vary with the sample size n, Theorem 1 speeds up the convergence rate dramatically. For instance, setting λn = 1/ log(1 1/n), we achieve a rate in OP (n 1), subverting the lower bounds of Tolstikhin et al. (2017). In fact, we can reach any prescribed rate of convergence between OP (n 1/2) and OP (n 1) for a suitable choice of λn. This stands in sharp contrast with bounds obtained through a bounded differences approach such as (7). However, it is important to remember that a larger lengthscale parameter will flatten the kernel, hence make the left hand side in (10) less informative. Depending on the application, it is possible to achieve the optimal balance between these two effects; some examples are discussed in Section 6. 2.2 Convex Radial Square Basis Functions In fact, a technique similar to that used for obtaining (10) yields a bound on the variance for any (convex) radial kernel. Lemma 6 Assume that for any x, x X, k(x, x ) = r( x x 2 2) for some convex function r: R+ R. Then v r r ( Tr ΣP) , where ΣP is the covariance matrix that pertains to P. 2.3 Positive Definitive Matrix on the Finite Space Consider |X| < and a symmetric positive definite matrix K of size |X|. Write K = P x X λxv xvx, where for any x X, Kvx = λxvx, and by positive definiteness, λx > 0. Then the feature map can be expressed as Variance-Aware Estimation of Kernel Mean Embedding and by direct calculation, bµP µP 2 HK = X t=1 vx(Xt) E[vx(X)] x X λx V[vx(X)]. In particular, for K = I, bµP µP HK = b Pn P 2 , v = 1 P 2 2 , recovering that with probability at least 1 δ, b Pn P 2 2(1 P 2 2)log(2/δ) So far, we have shown how to obtain for a natural class of kernels an upper bound on v that decouples the choice of k from some measure of total variance of P. Provided an upper bound on the latter is available (see Example 1), we obtain an improved convergence rate. However, we understandably may not have a bound on the variance of P, or could have insight about its variance, but only have access to contaminated data (see later Section 6.2.1). Recovering a bound that still enjoys the discussed speed-up without a priori knowledge on P is the problem we set out to solve in the next section. 3. Convergence Rates with Empirical Variance Proxy We solve the issue (O1) of not knowing v by estimating it from the data, simultaneously to µP. A short heuristic analysis of the Epanechnikov function (which is not typically a kernel according to our definition) that we conduct in Section 3.1 hints at an empirical Bernstein (Audibert et al., 2007; Maurer and Pontil, 2009) approach replacing the variance term by some empirical proxy which we explore formally in Section 3.2. The argument is structured around the pivotal definition of a weakly self-bounding function that we first recall. Definition 7 (Maurer 2006; Boucheron et al. 2009) Let n N, t0 [n], x = (x1, . . . , xt0, . . . , xn) X n, x t0 X, and write x(t0) = (x1, . . . , xt0 1, x t0, xt0+1, . . . , xn), for the vector where xt0 has been replaced with xt 0. Let f : X n R. (i) The function f is called weakly (α, β)-self-bounding when for all x X n, f(x) inf x t0 X f x(t0) !2 αf(x) + β. (ii) The function f is said to have the bounded differences property when for all x X n and all t0 [n], f(x) inf x t0 X f x(t0) 1. Wolfer and Alquier 3.1 Intuition in the Hypercube Let us take X = [0, 1]d the binary hypercube for d 1, and consider the Epanechnikov (parabolic) function q(x, x ) = 1 x x 2 2 /d. Note that q does not define a proper kernel (Cuturi, 2009, p.9). Nevertheless, the function q is TI, with ψ(t) = 1 t 2 2 /d, ψ = 1, and we can extend the definition of the variance term v = ψ EX,X P ψ(X X) = 2 Tr ΣP/d. It is natural to define for i [d], bvi(X1, . . . , Xn) .= 1 2n(n 1) Xi t Xi s 2 , that will act as an unbiased empirical proxy for vi .= VX P Xi , and introduce Tr bΣ .= Pd i=1 bvi, as an estimator for the trace of the covariance matrix ΣP. Lemma 8 For x [0, 1]d, the function X m R+, x 7 Tr bΣ(x)/d is weakly (n/(n 1), 0)- self-bounding and has the bounded differences property in the sense of Definition 7. As a consequence of Lemma 8, the technique of Maurer and Pontil (2009, Theorem 10) provides the following deviation bounds on the square root of the total variance Tr bΣ(X)/d > δ, b { 1, 1} . (11) In other words, with high confidence, we can replace the trace of the covariance matrix with its empirical proxy in the convergence bounds. 3.2 Systematic Approach Our goal is to rigorously develop the approach intuited in the previous section to include a large class of reproducing kernels. We propose the following empirical proxy for the variance term v, bvk(X1, . . . , Xn) .= 1 n 1 k(Xt, Xt) 1 s=1 k(Xt, Xs) write more simply bv for bvk when k is clear from the context, and promptly verify that bv is an unbiased estimator for v. Lemma 9 (Unbiasedness) It holds that Ebv = v, where the expectation is taken over the sample X1, . . . , Xn iid P. Variance-Aware Estimation of Kernel Mean Embedding The remainder of this section is devoted to analyzing the self-boundedness properties of bv under mild conditions on k and deriving corresponding empirical confidence intervals around µP. We henceforth let k be a characteristic TI kernel defined by the positive definite function ψ. We also obtain partial results for the more general case of kernels which are not translation invariant. This analysis is deferred to the appendix Section A (refer to Theorem 29). In the TI setting, the expression of bv can be simplified as bv(X1, . . . , Xn) .= ψ 1 (n 1)n t =s ψ(Xt Xs). (13) Since the kernel is characteristic, ψ cannot be a constant function, that is ψ > 0. Under our assumptions, we introduce a function involving bv and ψ that is weakly self-bounded and has the bounded differences property. Lemma 10 Let ψ define a characteristic TI kernel. The function X n R, x 7 n 2 ψ bv(x), is weakly (2, 0)-self-bounding and has bounded differences in the sense of Definition 7. This property leads to concentration of bv around v. Lemma 11 For b { 1, 1}, with probability at least 1 δ, 2 ψ log(1/δ) Theorem 12 (Confidence interval with empirical variance for TI kernel) Let n N, X1, . . . , Xn iid P, let k be a characteristic translation invariant kernel defined from a positive definite function ψ [see (4)]. Then with probability at least 1 δ, it holds that bµP µP Hk b Bk,δ(X1, . . . , Xn), b Bk,δ(X1, . . . , Xn) = b Bδ .= 2bv(X1, . . . , Xn)log(4/δ) and where bv is the empirical variance proxy defined in (13). Example 2 Theorem 12 immediately holds for Gaussian kernels, regardless of the lengthscale parameter. Remark 13 We make the following observations. (i) It still almost surely holds that bv k 2k, thus the fully empirical bound can never be more than a constant factor away from (7) or Theorem 1. (ii) Crucially, self-boundedness of bv and being able to apply Theorem 12 (or Theorem 29 for non TI kernels) only depends on the choice of kernel, and not on the properties of the underlying distribution. (iii) Theorem 12 does not depend on smoothness properties of the kernel (also true for Theorem 29 for non TI kernels). Wolfer and Alquier 3.3 Computability of the Empirical Variance Proxy The proxy bv is computable from data with O(n2) calls to the kernel function. In Section 6, we will discuss how to use variance-aware bounds to improve confidence bounds and test procedures based on the minimum-MMD estimator. In general, this estimator is computed by stochastic gradient descent (SGD) (Dziugaite et al., 2015; Li et al., 2015; Ch erief-Abdellatif and Alquier, 2022) or variants like stochastic natural gradient descent (Briol et al., 2019). A single step of SGD requires to sample m iid random variables from Pθ where θ is the current estimate, and to compute the unbiased estimate of the gradient which requires (among others) O(mn) calls to the kernel function. In the natural version, one must also compute a Jacobian at each step, which requires O(n2) calls to the kernel. Morever, in the discussion following Theorem 2 of Briol et al. (2019), it is argued that we should take m n in these algorithms. Thus, both SGD and its natural variant will require at least O(n2) calls to the kernel at each iteration. In this light, the computation of bv does not significantly affect the computational burden. Note however that, in a few situations where the gradient of the MMD is available in close form, it is possible to use a non-stochastic gradient descent. Such examples include Gaussian mean estimation (Ch erief-Abdellatif and Alquier, 2022) and Gaussian copula estimation (Alquier et al., 2023). Each step of the gradient descent requires only O(n) calls to the kernel and convergence is typically very fast. In these situations, the computation of bv can increase the computational cost when n is large. 4. Convergence Rates for the Difference of Two Means We now extend the results of the previous section to the estimation of the norm of the difference of two means µP µQ Hk. One of the applications of these results is two-sample tests, where we test the hypothesis P = Q, as in Gretton et al. (2012). We will cover this application in Section 6. However, they also have an interest on their own, to illustrate the benefits of variance-aware bounds. In all this section, we assume that two samples are observed: X1, . . . , Xn iid from P, and Y1, . . . , Ym iid from Q. We want to estimate the norm of the difference µP µQ Hk. 4.1 First Approach: Estimation by a U-Statistic A classical approach is to estimate the squared norm µP µQ 2 Hk using a U-statistic, as proposed by Gretton et al. (2012), and then to use a Hoeffding or Bernstein-type inequality for U-statistics. While Gretton et al. (2012) used the U-statistics version of Hoeffding s inequality to control the risk of this estimator, using a Bernstein inequality instead might allow to derive variance-aware bounds. The first version of such inequalities were proven by Hoeffding (1963) and Arcones (1995). More recently, the results of these two papers are included in Theorem 2 of Peel et al. (2010) while Theorem 3 gives an empirical version of each. We now provide more details on this approach. First, note that µP µQ 2 Hk = EX,X Pk(X, X ) + EY,Y Qk(Y, Y ) 2EX P,Y Qk(X, Y ) Variance-Aware Estimation of Kernel Mean Embedding can be directly estimated by the U-statistic: Un,m = 1 n(n 1) j =i k(Xi, Xj) + 1 m(m 1) j =i k(Yi, Yj) 2 mn j=1 k(Xi, Yj) as proposed by Gretton et al. (2012). This is an unbiased estimator: E Un,m = µP µQ 2 Hk. In the special case m = n, we can replace Un,n by the simpler ˆUn = 1 n(n 1) j =i [k(Xi, Xj) + k(Yi, Yj) k(Xi, Yj) k(Xj, Yi)] . We will illustrate the U-statistics approach in this simpler setting. Using Theorem 2 in Peel et al. (2010), with their q((Xi, Yi), (Xj, Yj)) = k(Xi, Xj) + k(Yi, Yj) k(Xi, Yj) k(Xj, Yi) and their m = 2, we obtain the first part of the following statement. Using Theorem 3, we obtain the second one. Theorem 14 Put σ2 P,Q = VX P,Y Q[EX Pk(X, X ) + EY Qk(Y, Y ) EY Qk(X, Y ) EX Pk(X , Y )] ˆσ2 P,Q = 1 n(n 1)(n 2) k(Xi, Xj) + k(Yi, Yj) k(Xi, Yj) k(Xj, Yi) k(Xi, Xk) + k(Yi, Yk) k(Xi, Yk) k(Xk, Yi) . Note that Eˆσ2 P,Q = σ2 P,Q + µP µQ 4 Hk. For any δ (0, 1], with probability at least 1 δ, µP µQ 2 Hk ˆUn 2 k 8σ2 P,Q n log 4 δ + 2 k64 + 1 Moreover, with probability at least 1 δ, µP µQ 2 Hk ˆUn 2 k 8ˆσ2 P,Q n log 8 δ + 2 k64 + 1 Observe that, when P = Q, the bound states that p ˆUn will be of the order of µP µQ Hk. On the other hand, when P = Q, we have both µP µQ Hk = 0 and σ2 P,Q = 0, and thus the first inequality in the theorem gives: which does not depend on the variances of P and Q. Wolfer and Alquier 4.2 Variance-Aware Control of the Fluctuations for Each Sample An alternative approach is to apply the triangle inequality to upper bound separately the fluctuations for each sample: µP µQ Hk ˆµP(X1, . . . , Xn) ˆµQ(Y1, . . . , Yn) Hk µP ˆµP(X1, . . . , Xn) Hk + µQ ˆµQ(Y1, . . . , Yn) Hk . For example, a direct application of Theorem 1 with a union bound gives the following corollary. Corollary 15 Let X1, . . . , Xn P, Y1, . . . , Ym Q, k: X X R be a reproducing kernel, and let v P .= EX P k(X, ) µP 2 Hk and v Q .= EY Q k(Y, ) µQ 2 Hk and k = supx X k(x, x). With probability at least 1 δ, it holds that µP µQ Hk ˆµP(X1, . . . , Xn) ˆµQ(Y1, . . . , Ym) Hk 2v P log(4/δ) 2v Q log(4/δ) k log(4/δ). Of course, we can also state results with empirical variance instead of the true variance, by using Theorems 12 and 29. In order to compare this result to the U-statistics approach, consider the case n = m. Corollary 15 gives the following upper bound: ˆµP(X1, . . . , Xn) ˆµQ(Y1, . . . , Yn) Hk 2v P log(4/δ) 2v Q log(4/δ) k log(4/δ). In particular, when P = Q, the bound becomes: ˆµP(X1, . . . , Xn) ˆµQ(Y1, . . . , Yn) Hk 2 2v P log(4/δ) k log(2/δ). (16) Even though ˆµP(X1, . . . , Xn) ˆµQ(Y1, . . . , Yn) 2 Hk is not an unbiased estimator of µP µQ 2 Hk as ˆU, (15) and (16) show the fluctuations of ˆµP(X1, . . . , Xn) ˆµQ(Y1, . . . , Yn) Hk when P = Q are upper bounded by p v P/n + 1/n while the ones of p 1/n. This can be a serious improvement if v P is small. Note that we do not claim superiority of the plug-in estimator, but rather that the currently available non-asymptotic bounds for this estimator are tighter. This is an illustration of the power of the variance-aware bounds. A way to compare both estimators would be through an accurate study of their asymptotic fluctuations when P = Q. So far, this analysis is only available for the U-statistic, see Theorem 12 in Gretton et al. (2012). Variance-Aware Estimation of Kernel Mean Embedding 5. Convergence Rates with Time-Dependent Data In this section, we establish convergence rates for cases where the data X1, . . . , Xn is not independent. Namely, we will assume the data to be a stationary mixing sequence (Bradley, 2005; Doukhan, 2012). In this setting, the observations are identically distributed with marginal distribution P, but exhibit time dependencies that diminish as the time interval increases. Lemma 16 Suppose that X1, . . . , Xn is a stationary sequence of possibly dependent random variables. Then E bµP µP 2 Hk = 1 n (v + Σn) , t=2 (n t + 1)ρt, where ρt .= E k(Xt, ) µP, k(X1, ) µP Hk, are the covariance coefficients in the RKHS introduced in Ch erief-Abdellatif and Alquier (2022). In particular, when the process is iid, the expression above simplifies to E bµP µP 2 Hk = v The first flavor of mixing we consider is φ-mixing, introduced by Ibragimov (1962). Recall that the φ-mixing coefficient (Bradley, 2005; Doukhan, 2012) is defined for two σ-fields A and B by φ(A, B) .= sup A A,B B P(A)>0 |P (B|A) P (B)| . For s N, we further define φ(s) .= sup r N φ (σ ({Xt : t r}) , σ ({Xt : t r + s})) , where for T N, σ({Xt}t T ) is the σ-field generated by the random variables {Xt}t T . The random process is then called φ-mixing when lims φ(s) = 0. Additionally, we define the triangular coupling matrix Γ (Samson, 2000) as follows. For t, s [n], 1 when t = s 0 when t > s p 2φ(σ(X1, . . . , Xt), σ(Xs, . . . , Xn)) otherwise. Theorem 17 (Variance-aware confidence interval with φ-mixing data) Let δ (0, 1), and X1, . . . , Xn be a stationary φ-mixing sequence with marginal distribution P. We let k: X X R be a reproducing kernel with k = supx X k(x, x). With probability at least 1 δ, it holds that bµP(X1, . . . , Xn) µP Hk Bφ k,δ(P, n), Wolfer and Alquier Bφ k,δ(P, n) = Bφ δ .= 2v Γ 2 log(1/δ) n + 8k Γ 2 log(1/δ) where Σn is defined in Lemma 16, Γ is defined in Eq. (17) and 2 is the spectral norm. In particular, when the process is iid, Γ is the identity matrix, hence Γ 2 = 1. Example 3 (Uniformly ergodic Markov chains) Suppose that there exists φ0, φ1 R+ such that for all t R, φ(t) φ0 exp( φ1t). Then it holds (refer for example to Samson (2000)) that Γ 2 2φ0 1 e φ1/2 . In particular, this is known to hold for a uniformly ergodic Markov chain (Roberts and Rosenthal, 2004) with transition operator P and stationary distribution π. In this case we can choose, φ0 = 4 and φ1 = log(2)/tmix, where tmix .= min t N P t(x, ) π( ) TV < 1/4 , is called the mixing time of the chain with respect to total variation (Levin et al., 2009); here we obtain Γ 2 10tmix. 5.2 For β-Mixing Processes We next consider β-mixing, a common assumption in the machine learning literature (Mohri and Rostamizadeh, 2010). Recall that the β-mixing coefficient (Bradley, 2005; Doukhan, 2012) is defined for two σ-fields A and B, β(A, B) .= sup 1 j=1 |P(Ai Bj) P(Ai)P(Bj)| , where the supremum is taken over all pairs of finite partitions {A1, . . . , AI} and {B1, . . . , BJ} of X such that for any 1 i I and any 1 j J, Ai A and Bj B (Bradley, 2005, Equation 7). For s N, we further define β(s) .= sup r N β (σ ({Xt : t r}) , σ ({Xt : t r + s})) , where for T N, σ({Xt}t T ) is the σ-field generated by the random variables {Xt}t T . The random process is then called β-mixing6 when lims β(s) = 0. In this case, for ξ (0, 1), we define the β-mixing time as tβ mix(ξ) .= arg min t N {β(t) < ξ} . We observe that when the process is a stationary time-homogeneous Markov chain, tβ mix corresponds to the definition of the average-mixing time (M unch and Salez, 2023; Wolfer 6. A β-mixing process is also called absolutely regular in the literature. Variance-Aware Estimation of Kernel Mean Embedding and Alquier, 2024). Note that β(A, B) φ(A, B) (Bradley, 2005, Equation 1.11), and no reverse inequality holds for any universal constant, thus β-mixing is a strictly weaker notion than φ-mixing. Theorem 18 (Variance-aware confidence interval with β-mixing data) We let δ (0, 1), and X1, . . . , Xn be a stationary β-mixing sequence. We let k: X X R be a reproducing kernel with k = supx X k(x, x). Finally, we suppose for simplicity7 that n is a multiple of τ β n,δ where τ β n,δ .= arg min s N δ 6(n/(2s) 1) With probability at least 1 δ, it holds that bµP(X1, . . . , Xn) µP Hk Bβ k,δ(P, n), Bβ k,δ(P, n) = Bβ δ .= 2 s v + Στ β n,δ where Σs is defined in Lemma 16. Example 4 (Countable state time-homogeneous Markov chains) Irreducible, aperiodic and stationary countable Markov chains are always β-mixing (Bradley, 2005). Assume that there exist β1 R+ and b (1, ) such that for any t N, β(t) β1/tb, that is, the chain is only known to be algebraically mixing. Interestingly, such a chain may not be φ-mixing, thus this example cannot be recovered from Theorem 17. 6. Applications We put the apparatus developed in Section 2 and Section 3 to application in the context of hypothesis testing and robust parametric estimation. In this section, all considered kernels will be TI unless otherwise specified, and X1, . . . , Xn are i.i.d. from P P(X). We introduce a statistical model M = {Pθ : θ Θ} indexed by the parameter space Θ. Examples of models studied in the literature on MMD include parametric models such as the Gaussian model Pθ = N(θ, σ2Id) (with σ2 known) and mixture of Gaussians (Briol et al., 2019; Ch erief-Abdellatif and Alquier, 2022), copulas Alquier et al. (2023) but also more complex models such as generative adversarial networks (Dziugaite et al., 2015; Li et al., 2015), stochastic volatility models and stochastic differential equations (Briol et al., 2019). Following Briol et al. (2019), our estimator will be the closest element8 of the model 7. An adaptation of the proof removes this assumption and recovers similar bounds up to universal constants. 8. Note that it might be that the infimum is reached by multiple elements of Θ, or is not reached. The first case does not lead to any difficulty, we must simply define a rule to break ties (for example, we can equip Θ with a total order and chose the smallest minimizer according to this order). While Briol et al. (2019) provide sufficient conditions to ensure that the infimum is reached, the non-existence of a minimizer is also not a problem in practice: all the non-asymptotic results in Briol et al. (2019) and Ch erief-Abdellatif and Alquier (2022) can easily be extended to any ϵ-minimizer, for ϵ small enough. Wolfer and Alquier M to the empirical measure obtained from X, where the distance is measured in the RKHS, µPbθn(X1,...,Xn) bµP(X1, . . . , Xn) Hk = inf θ Θ µPθ bµP(X1, . . . , Xn) Hk . (18) The computation of bθn(X1, . . . , Xn) is usually done via stochastic gradient descent and variants, see Remark 13 above. 6.1 Hypothesis Testing In this subsection we study hypothesis testing based on MMD. In the literature, two kinds of tests were proposed and studied: two-sample tests, and goodness of fit. In the twosample test problem, we are given X1, . . . , Xn iid from some PX and Y1, . . . , Ym iid from some PY , and we want to test H0 : PX = PY against the alternative H1 : PX = PY . In the goodness of fit problem, we are given X1, . . . , Xn iid from P and we wish to test the hypothesis H0 : P M = {Pθ : θ Θ}, against the alternative H1 : P / {Pθ : θ Θ}. In the MMD literature, two-sample tests are more prevalent (Gretton et al., 2009, 2012). Recent work tackle goodness of fit testing (Chwialkowski et al., 2016; Jitkrittum et al., 2017), albeit with an asymptotic treatment. We will study both problems here. In both cases, we propose a non-asymptotic treatment. That is, the level of the test is smaller than α for a finite sample size, and not only asymptotically. Moreover, in both cases, our datadependent bound (Theorem 12) allows to increase the power of the test, when compared to the procedure that would be based on the Mc Diarmid-based bound in (7). 6.1.1 Goodness-of-Fit Test Recall that we define the significance level α [0, 1] of a test as the probability of outputting H1 when H0 is true. Taking advantage of a non-asymptotic bound B(X1, . . . , Xn, α) (for example B(X1, . . . , Xn, α) = b Bα given in Theorem 12), we can design a test with prescribed level α for any n as follows. We introduce the test statistic T(X1, . . . , Xn) .= inf θ Θ bµP µPθ Hk , and reject H0 on the critical set C(X1, . . . , Xn) .= {T > B(X1, . . . , Xn, α)} . We show that the probability of rejection under the null hypothesis is at most α and that the test is consistent. Theorem 19 Let n N, X1, . . . , Xn P, and k a translation invariant kernel. Let B( , α) be such that (a) for any α, with probability at least 1 α, bµP µP Hk B(X1, . . . , Xn, α). (i) PH0 (C(X1, . . . , Xn)) α. If, moreover, Variance-Aware Estimation of Kernel Mean Embedding (b) for any fixed α, B(X1, . . . , Xn, α) a.s. n 0, (c) there is a sequence αn 0 such that B(X1, . . . , Xn, αn) a.s. n 0, (ii) when the model M is closed with respect to the MMD metric 9, lim n PH1 (C(X1, . . . , Xn)) = 1, more precisely, 1 PH1 (C(X1, . . . , Xn)) = O(αn). Obviously, the empirical bound b Bk,α of Theorem 12 satisfies (a) and (b) (regardless of P), and (c) with αn = exp( n1 ϵ) for any fixed ϵ > 0. So does the Mc Diarmid-based bound in (7). On the other hand, we claimed that the empirical bound is smaller than bounds that does not take the variance into account. In other words, the power of the test PH1 (C(X1, . . . , Xn)) for a finite n will be larger if we use the variance aware bound. Example 5 We first consider a single hypothesis test (that is a special case of goodnessof-fit when Θ is a singleton). We consider data in R2: here P = N(0, σ2I2) with Pθ = N(θ, I2), and Θ = {(1, 1)}. In other words, H0 is true iffσ = 1. Note that in this case, if Y1, . . . , Yn are iid from Pθ = P{(1,1}, then q1 α defined as the (1 α)-quantile of µP0 bµP(Y1, . . . , Yn) Hk allows to define a test with rejection zone given by C (X1, . . . , Xn) = {T > q1 α} that satisfies PH0 (C (X1, . . . , Xn)) = α by definition. It is a very natural procedure to estimate q1 α by Monte-Carlo, by sampling multiple times Y1, . . . , Yn from Pθ = P{(1,1}. This leads to a Monte-Carlo estimator ˆq1 α of q1 α. In our simulations, we sample X1, . . . , Xn from P, perform the test based on ˆq1 α, the test based on the empirical bound and the test based on Mc Diarmid bound . This is repeated 100 times for each value σ {0, 1/50, 2/50, . . . , 1}. We report the frequency of rejections in Figure 1. The kernel used is a Gaussian kernel with γ = 1, and we consider sample sized n {16, 40, 100, 250}. We observe that, when compared to test based ˆq1 α, both tests based on bounds have a weak power (note that we did not try to optimize γ for now, this question will be tackled later). However, the test based on the empirical bound indeed rejects more often H0 when H1 is true. The improvement is clearer for small sample sizes. Example 6 We now consider a proper goodness-of-fit test in R2: we still consider P = N(0, σ2I2) and Pθ = N(θ, I2), with Θ = R2. It is important to observe that in this case, because H0 is composite, we don t have a natural definition for q1 α as in the previous example. We sample X1, . . . , Xn from P, perform the test based on the empirical bound and the test based on Mc Diarmid bound. This is repeated 100 times for each value σ {0, 1/50, 2/50, . . . , 1}. We report the frequency of rejections in Figure 2. The kernel used is a Gaussian kernel with γ = 1, and we consider sample sized n {16, 40, 100, 250}. Similar comments to the previous case apply, but in this case, this makes the test based on the empirical Bernstein bound the best test available. 9. For any Q, if there is a sequence (θh)h N of elements of Θ such that µQ µPθh Hk h 0, then Q = Pθ for some θ Θ. Wolfer and Alquier 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Figure 1: Comparison of the test based on the Bernstein empirical (Emp Ber) bound, versus the test based on Mc Diarmid bound (Mc Dia), and the test based on the Monte-Carlo estimation of the quantile q1 α. Frequency of rejection of H0 : P {N((1, 1), I2)} as a function of σ with P = N(0, σ2I2). Variance-Aware Estimation of Kernel Mean Embedding 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Figure 2: Comparison of the test based on the Bernstein empirical (Emp Ber) bound, versus the test based on Mc Diarmid bound (Mc Dia). Frequency of rejection of H0 : P {N(θ, I2), θ R2} as a function of σ with P = N(0, σ2I2). Wolfer and Alquier 6.1.2 Two-Sample Test Let (n, m) N2, X1, . . . , Xn PX and Y1, . . . , Ym PY . Here, we wish to test H0 : PX = PY against H1 : PX = PY . This time, we use the test statistic T2(X1, . . . , Xn, Y1, . . . , Ym) .= bµP(X1, . . . , Xn) bµP(Y1, . . . , Ym) Hk , and reject H0 on the critical set C2(X1, . . . , Xn, Y1, . . . , Ym) .= {T2 > B(X1, . . . , Xn, α/2) + B(Y1, . . . , Ym, α/2)} . Theorem 20 Let (n, m) N2, X1, . . . , Xn PX and Y1, . . . , Ym PY , and k a translation invariant kernel. Let B( , α) be such that (a) for any α, each of these inequalities hold with probability at least 1 α each: bµP(X1, . . . , Xn) µPX Hk B(X1, . . . , Xn, α) bµP(Y1, . . . , Ym) µPY Hk B(Y1, . . . , Ym, α). (i) PH0 (C2(X1, . . . , Xn, Y1, . . . , Ym)) α. Moreover, if (b) for any fixed α, B(X1, . . . , Xn, α) a.s. n 0 and B(Y1, . . . , Yn, α) a.s. m 0, (c) there is a sequence αn 0 such that B(X1, . . . , Xn, αn) a.s. n 0 and B(Y1, . . . , Yn, αm) a.s. m 0, (ii) limn,m PH1 (C2(X1, . . . , Xn, Y1, . . . , Tm)) = 1. Here again, the empirical bound b Bk,α of Theorem 12 satisfies (a) and (b), and (c) with αn = exp( n1 ϵ) for any ϵ > 0. 6.2 Robust Parametric Estimation under Huber Contamination Ch erief-Abdellatif and Alquier (2022, Proof of Theorem 3.1) show how to upper-bound (18) by the estimation error of the empirical measure in the RKHS and the approximation error of the model, µPbθn µP Hk inf θ Θ µPθ µP Hk + 2 bµP µP Hk . (19) Recall that in the Huber contamination model (Huber, 2011), the observations X1, . . . , Xn are drawn independently from the mixture P = (1 ξ)Pθ0 + ξH, Variance-Aware Estimation of Kernel Mean Embedding where ξ [0, 1/2) is the contamination rate, H is some unknown noise distribution and θ0 Θ. In this setting, we can control the approximation error of the model: starting from (19), and adapting the argument of Ch erief-Abdellatif and Alquier (2022, Corollary 3.3), we obtain µPbθn µPθ0 ψ + 2 bµP µP Hk . We wish to apply Theorem 1 to the second term. However, when the sample is contaminated, the variance term v(P) we collect also depends on the properties of H, which are unknown for all we know H does not even have finite moments. Fortunately, we now see how to bound v(P) is terms of v(Pθ0) and ξ. 6.2.1 Improved Confidence Bounds in the Huber Contamination Model Lemma 21 In the Huber contamination model, that is, P = (1 ξ)Pθ0 + ξH, writing v = v(P) and v0 = v(Pθ0), it holds that v v0 + 2ξ( ψ v0) + ξ2(v0 ψ). (20) The bound can be streamlined; when ψ 0, v (1 2ξ)v0 + 2ξψ, and otherwise v (1 2ξ)v0 + 2ξ ψ. Remark 22 In (20), observe that when ψ 0 and ξ 1, the first term vanishes, and v ψ, recovering the distribution independent rates. Conversely, when ξ 0, we confirm that v v0, which could be further bounded using properties of the model. In practice, we will focus on small values ξ 1/2, and the simplified bounds will be sufficient. An application of Theorem 1 yields the following. Corollary 23 In the Huber contamination model P = (1 ξ)Pθ0 + ξH, writing v = v(P) and v0 = v(Pθ0), with probability at least 1 δ, 2[(1 2ξ)v0 + 2ξψ]log(2/δ) ψ 3 log(2/δ) When ψ = 1, v0 1, ξ 1, the bound offers a significant improvement over Ch erief Abdellatif and Alquier (2022, Corollary 3.4). 6.2.2 Improved Confidence Bounds in the Parameter Space In this subsection we still employ the Gaussian kernel kγ (9). It is instructive to analyze how robust estimation with respect to the MMD distance translates into what happens in the space of parameters. In fact, since we have freedom over the lengthscale parameter, we may understandably want to select γ such that the distance in the space of parameters is kept small as well. Wolfer and Alquier Definition 24 (Link function F) Let M = {Pθ : θ Θ} indexed by the parameter space Θ P(Rd). We say that a non-decreasing function Fk : R+ R+ is a link function for the model M and kernel k when for any θ, θ Θ, θ θ 2 Fk µPθ µPθ Hk Remark 25 Note that a link function Fk will only provide nontrivial information if Fk(0) = 0 and Fk is continuous at 0. The existence of such a nontrivial link function implies that the model is identifiable: if µPθ = µPθ then 0 = Fk( µPθ µPθ Hk) θ θ 2, hence θ = θ . In this case, there is a unique function p : M Θ such that p(Pθ) = θ, and Fk is a modulus of continuity of p. A direct application of Corollary 23 gives the following. Corollary 26 In the setting of Corollary 23, assume that M and k are such that a link function Fk exists. With probability at least 1 δ, bθn θ0 2 Fk 2[(1 2ξ)v0 + 2ξψ]log(2/δ) ψ 3 log(2/δ) Gaussian location model, continued. We continue here Example 1. A direct computation (see for instance Ch erief-Abdellatif and Alquier 2022) shows that µPθ µPθ 2 Hkγ = 2 γ2 θ θ 2 2 4σ2 + 2γ2 In other words, a link function for the Gaussian location model is explicitly given by F 2 kγ(h) = 2(2σ2 + γ2) log An application of Corollary 26, and setting γ = λσ d, for some λ > 0, proves that, with probability at least 1 δ, 2 2σ2(2 + dλ2) log 1 1 + 2 dλ2 2 h (1 2ξ) 1 e 1 λ2 + 2ξ i log(2/δ) .= Gd,σ,n(λ, ξ). For comparison, we provide the bound of Ch erief-Abdellatif and Alquier (2022, Proposition 4.1-Equation 2) that is obtained by plugging (7) into the link function: 2 2σ2(2 + dλ2) log 1 1 + 2 dλ2 2 log(1/δ) n .= Hd,σ,n(λ, ξ). Variance-Aware Estimation of Kernel Mean Embedding 20 40 60 80 100 value of the bound ξ = 0.01, d = 32, δ = 0.1, n = 10000, σ = 3 Cherief-Abdellatif and Alquier [2022] Variance-aware 20 40 60 80 100 value of the bound ξ = 0.2, d = 32, δ = 0.1, n = 10000, σ = 3 Cherief-Abdellatif and Alquier [2022] Variance-aware Figure 3: Comparison of the bounds in (21) versus Ch erief-Abdellatif and Alquier (2022, Proposition 4.1-Equation 2) as a function of γ. 2500 5000 7500 10000 12500 15000 17500 20000 δ = 0.1, σ = 3, d = 32, ξ = 0.1 Cherief-Abdellatif and Alquier [2022] Variance-aware Figure 4: Comparison of Ch erief-Abdellatif and Alquier (2022) and (21) for the optimal hyper-parameter γ as a function of n. Wolfer and Alquier In order to compare the tightness of the bounds in Eq. (22) and Eq. (21), we run experiments for fixed sample size (n = 10000), fixed confidence level (δ = 0.1), fixed variance parameter (σ = 3) and two different contamination levels (ξ = 0.01 and ξ = 0.2). We compute and plot the bound obtained when varying the scale parameter γ (see Figure 3). For a contamination level ξ = 0.1, we also compare the quantitative decay of the two bounds optimized for γ as the sample size n grows and report our results in 4. Based on our experiments, we make the following observations. (i) The new bound is always tighter than the one of Ch erief-Abdellatif and Alquier (2022). (ii) When the contamination level increases, the two bounds are getting closer together. This is expected since we are losing the benefit of the variance factor. (iii) Especially under weak contamination, the new bound is much flatter in the sense where overshooting for the value of γ does not lead to a catastrophic degradation of the bound. (iv) The new bound performs exceptionally well in the small sample setting. We conclude this section by showing heuristically that the bound in (21) can always be made smaller than the bound in (22), at least asymptotically in n, with an adequate choice of λ. For the sake of simplicity, we work in the non-contaminated setting ξ = 0. First, Hd,σ,n(λ, 0) = 4σ2(2 + dλ2) 1 + 2 dλ2 2 log(1/δ) 2 n (1 + o(1)), and the first order term is exactly minimized for λ = 1, it leads to Hd,σ,n(1, 0) = 4σ2(2 + d) 1 + 2 2 log(1/δ) 2 n (1 + o(1)). An exact minimization of the bound in (21) is not feasible, but we will propose a choice of λ = λn that will lead to an improvement on Hd,σ,n(1, 0). First, let us assume that λ1 > 0 and that (λn)n N is a non-decreasing sequence. We obtain: Gd,σ,n(λn, 0) = 4σ2(2 + dλ2 n) 1 + 2 dλ2n λ2n log(2/δ) n (1 + o(1)) = 4σ2(2 + dλ2 n) 1 + 2 dλ2n d/2 log(2/δ) λ2nn (1 + o(1)) = 4dσ2 1 + 2 dλ2n 1+d/2 log(2/δ) n (1 + o(1)). It is clear that choosing λn when n gives: Gd,σ,n(λn, 0) = 4dσ2 log(2/δ) n (1 + o(1)). Variance-Aware Estimation of Kernel Mean Embedding 0 20 40 60 80 100 0 20 40 60 80 100 Figure 5: Comparison of the bounds in (21) versus Ch erief-Abdellatif and Alquier (2022, Proposition 4.1-Equation 2) as a function of γ, versus the empirical MSE on 50 experiments, n = 500, log-scale, ξ = 0, δ = 0.05. Right panel: we zoom on the MSE. Finally, observe that Hd,σ,n(1, 0) Gd,σ,n(λn, 0) = 2 1 + 2 2 log 1/δ)2 2 log(2/δ))2 | {z } 1 for δ (0,9/10) (1 + o(1)) 2e(1 + o(1)). Unfortunately, this first order analysis is too crude to provide an accurate recommendation on the choice of λn. On the other hand, it shows that our variance-aware bound leads to a significant improvement over the bound of Ch erief-Abdellatif and Alquier (2022) for Gaussian mean estimation. Moreover, the choice λn is in accordance with the considerations on the asymptotic variance in Briol et al. (2019) in this model. We illustrate this with an experiment in Figure 5. We observe that: indeed, the varianceaware bound decrease with γ λ, while the Mc Diarmid bound has a minimum. Interestingly, the true MSE also has a minimum. Note however how the MSE is actually very flat as a function of γ: in other words, in this experiment, the choice of γ does not matter so much on the performance of the MMD estimator. The possibility to use the variance-aware bound to calibrate γ in practice should be investigated further in other models. In this section, we provide the detailed technical proofs that were deferred in earlier parts of the paper. 7.1 Proof of Theorem 1 In Tolstikhin et al. (2017), the authors mention but do not pursue the idea that Bernstein s inequality in separable Hilbert spaces yields the proper rate of OP (n 1/2). We follow-up on Wolfer and Alquier their idea in this proof. First recall that t=1 k(Xt, ), µP .= EX P [k(X, )] , and that we can write bµP µP Hk = 1 where for 1 t n, Zt .= k(Xt, ) EX P [k(X, )] , is centered. For ε > 0, we are thus interested in bounding the probability of the following deviation, We rely on a Bernstein-type inequality in Hilbert spaces, credited to Pinelis and Sakhanenko (1985) (see also Yurinsky 1995, Theorem 3.3.4), which is reported in Theorem 32. We first observe that the RKHS Hk is separable by separability of the topological space X and continuity of k. In order to invoke Theorem 32, we need to control from above the Pn t=1 E Zt p Hk, for any p 3. For p = 2, t=1 E k(Xt, ) µP 2 Hk = nv. Moving on to higher moments, for p 3, we have that Zt Hk k(Xt, ) Hk + EX P [k(X, )] Hk p k(Xt, Xt) + sup x X As a result, t=1 E Zt p Hk t=1 E Zt 2 Hk Invoking Theorem 32 with G2 = nv, H = 2 k/3, we obtain that for any ε > 0, 2(nv + (εn)2 Variance-Aware Estimation of Kernel Mean Embedding 7.2 Proof of Lemma 10 We first verify the bounded-differences property. bv(x) inf x t0 X bv x(t0) =ψ 1 (n 1)n t =s ψ(xt xs) inf x t0 X t =s ψ x(t0) t x(t0) s t =s ψ x(t0) t x(t0) s t =s ψ(xt xs) t [n] t =t0 t [n] t =t0 t [n] t =t0 t [n] t =t0 Wolfer and Alquier It follows from ( ) that, bv(x) inf x t0 X bv x(t0) !2 t [n] t =t0 t [n] t =t0 ψ ψ(xt0 xt) t [n] t =t0 ψ ψ(xt0 xt) 2 t [n] t =t0 ψ ψ(xt0 xt) where the second inequality is Jensen s. 7.3 Proof of Lemma 11 From Lemma 10, the function X n R, x 7 n 2 ψ bv(x), is weakly (2, 0)-self-bounding and has the bounded differences property in the sense of Definition 7. From the second statement of Maurer (2006, Theorem 13), P E n 2 ψ bv(X) n 2 ψ bv(X) > t exp 4E h n 2 ψbv(X) i that can be rewritten, P (v bv(X) > t) exp nt2 In other words, with probability at least 1 δ, 2 ψ log(1/δ) Completing the square, and by sub-additivity of the square root function, we successively have with probability 1 δ, v 2 ψ log(1/δ) !2 bv(X) + 2 ψ log(1/δ) Variance-Aware Estimation of Kernel Mean Embedding 2 ψ log(1/δ) bv(X) + 2 ψ log(1/δ) 2 ψ log(1/δ) which proves the lemma for b = 1. On the other hand, from the first inequality in Maurer (2006, Theorem 13), P n 2 ψ bv(X) E n 2 ψ bv(X) > t exp 4E h n 2 ψbv(X) i + 2t that can be rewritten as P (bv(X) v > t) exp nt2 8 ψ(v + t/2) The positive solution t+ for nt2 = 8 ψ(v + t/2) log(1/δ) is readily given by t+ = 2 ψ log(1/δ) 2 log(1/δ) ψ 2 log(1/δ) + nv . Thus, it holds with probability at least 1 δ that bv(X) v + t+, and from sub-additivity of the square root, successively, with probability at least 1 δ, we have bv(X) v + 4 ψ log(1/δ) 2v ψ log(1/δ) 2 ψ log(1/δ) !2 + 2 ψ log(1/δ) 2 ψ log(1/δ) !2 + 2 ψ log(1/δ) bv(X) v + 2 2 ψ log(1/δ) which finishes proving the lemma10 for b = 1. 10. In this proof, Maurer (2006) is sufficient. We do not need the stronger results of Boucheron et al. (2009). Wolfer and Alquier 7.4 Proof of Theorem 12 From Theorem 1, with probability at least 1 δ/2, it holds that bµP(X1, . . . , Xn) µP Hψ Invoking Lemma 11 for b = 1, at confidence 1 δ/2 it holds that 2 ψ log(2/δ) 2 ψ log(4/δ) thus by combining (23) and (24) and a union bound yields that with probability at least 1 δ, bµP(X1, . . . , Xn) µP Hψ 2 ψ log(4/δ) 2bv(X)log(4/δ) 2bv(X)log(4/δ) 7.5 Proof of Lemma 16 E bµP µP 2 Hk =E 1 t=1 k(Xt, ) µP, 1 s=1 k(Xs, ) µP t=1 E k(Xt, ) µP, k(Xs, ) µP Hk s B(X1, . . . , Xn, α)) inf θ Θ c µP µPθ Hk > B(X1, . . . , Xn, α) Hk > B(X1, . . . , Xn, α) Wolfer and Alquier thanks to (a). Now, assume that H0 is not true, that is P / {Pθ, θ Θ}. Note that under the assumption that the model is closed, we have 0 < := inf θ Θ µP µPθ Hk . By the triangle inequality, for any θ, µPθ bµP Hk µPθ µP Hk µP bµP Hk and thus, taking the infimum w.r.t θ on both sides, T µP bµP Hk . We apply (a) with confidence level αn and obtain PH1 µP bµP Hk B(X1, . . . , Xn, αn) αn PH1 (T B(X1, . . . , Xn, αn)) αn. As B(X1, . . . , Xn, αn) 0 when n thanks to (c), there is a N large enough such that, for any n N, B(X1, . . . , Xn, αn) /2, and thus Moreover, as we also have B(X1, . . . , Xn, αn) 0 when n thanks to (b), there is a N large enough such that, for any n N , B(n, α) /2 and thus PH1 (T B(X1, . . . , Xn, αn)) αn. PH1 (C) = 1 PH1 (T B(X1, . . . , Xn, αn)) 1 αn as soon as n max(N, N ). Variance-Aware Estimation of Kernel Mean Embedding 7.8 Proof of Theorem 20 First, under H0, we have PX = PY and thus PH0 (C2) = PH0 (T2 > B(X1, . . . , Xn, α/2) + B(Y1, . . . , Ym, α/2)) c µP(X1, . . . , Xn) c µP(Y1, . . . , Ym) Hk > B(X1, . . . , Xn, α/2) + B(Y1, . . . , Ym, α/2) c µP(X1, . . . , Xn) µPX + µPX c µP(Y1, . . . , Ym) Hk > B(X1, . . . , Xn, α/2) + B(Y1, . . . , Ym, α/2) PH0 c µP(X1, . . . , Xn) µPX Hk > B(X1, . . . , Xn, α/2) + PH0 c µP(Y1, . . . , Ym) µPX Hk > B(Y1, . . . , Ym, α/2) α/2 + α/2 = α. thanks to (a). Now, assume that H0 is not true, that is PX = PY , and put := µPX µPY Hk > 0. By the triangle inequality, T2 = c µP(X1, . . . , Xn) c µP(Y1, . . . , Ym) Hk c µP(X1, . . . , Xn) µPX Hk c µP(Y1, . . . , Yn) µPY Hk . We apply (a) with respectice confidence levels αn and αm to get PH1 µPX bµP(X1, . . . , Xn) Hk B(X1, . . . , Xn, αn) αn and PH1 µPY bµP(Y1, . . . , Ym) Hk B(Y1, . . . , Ym, αm) αm. Thus, PH1 (T2 B(X1, . . . , Xn, αn) B(Y1, . . . , Ym, αm)) αn + αm. Thanks to (c), there is a N large enough such that, as soon as both n, m N, both B(X1, . . . , Xn, αn) /4 and B(Y1, . . . , Ym, αm) /4 hold, and thus Wolfer and Alquier Moreover, thanks to (b), there is N large enough such that, when n, m N , both B(X1, . . . , Xn, α/2) /4 and B(Y1, . . . , Ym, α/2) /4 hold, and thus PH1 (T2 B(X1, . . . , Xn, α/2) + B(Y1, . . . , Ym, α/2)) αn + αm. PH1 (C2) = 1 PH1 (T2 B(X1, . . . , Xn, α/2) + B(Y1, . . . , Ym, α/2)) 1 αn αm as soon as n max(N, N ), and thus PH1 (C2) 1 αn αm n,m 1. 7.9 Proof of Lemma 21 v(P) = EX P k(X, ) µP 2 Hk = ψ Z Z ψ(x x)d(P P)(x, x ) = ψ Z Z ψ(x x)d P(x)d P(x ) = ψ Z Z ψ(x x)d((1 ξ)Pθ0(x) + ξH(x))d((1 ξ)Pθ0(x ) + ξH(x )) = ψ (1 ξ)2 Z Z ψ(x x)d Pθ0(x)d Pθ0(x ) 2ξ(1 ξ) Z Z ψ(x x)d Pθ0(x)d H(x ) ξ2 Z Z ψ(x x)d H(x)d H(x ) ψ (1 ξ)2(ψ v) 2ξ(1 ξ)ψ ξ2ψ = ψ (1 2ξ + ξ2)(ψ v) ξ(2 ξ)ψ = ψ (ψ v) + 2ξ(ψ v) ξ2(ψ v) ξ(2 ξ)ψ (1 2ξ)v + 2ξψ ξ(2 ξ)ψ, = (1 2ξ)v + 2ξ ψ + ξ2ψ, and the lemma follows by disjunction of cases. Acknowledgments The authors declare that there is no conflict of interest. Part of this research was conducted when GW was supported by the Special Postdoctoral Researcher Program (SPDR) of RIKEN. Part of the research in this paper was done when PA was working at RIKEN AIP. We thank Bastien Dussap for pointing out a missing factor two in an earlier version Variance-Aware Estimation of Kernel Mean Embedding of this manuscript. We thank the anonymous reviewers for their useful comments. In particular, the comparison to the approach based on Bernstein s inequality for U-statistics in Section 4 was suggested by Reviewer 1 and the analysis for time-dependent processes in Section 5 was suggested by Reviewer 2. All the remaining mistakes are ours. Appendix A. Extension to Non Translation Invariant Kernels Let us define diag k: X R by diag k(x) = k(x, x), and use the shorthands diag k, diag k and diag k introduced in (2) accordingly. We show that the properties obtained in Section 3 can be extended to non TI kernel when diag k is controlled from above. Lemma 27 Let k be a characteristic reproducing kernel. The function X n R, x 7 n diag k + 2 k bv(x), is weakly 2, 2n diag k diag k + 2 k -self-bounding, and has bounded differences in the sense of Definition 7. Lemma 28 For b { 1, 1}, with probability at least 1 δ, bv(X) + diag k p v + diag k i 2 ( diag k + 2 k) log(1/δ) Attempting to recover concentration of bv around v with self-boundedness leads to an additional term in O n 1/4 in (26) when diag k = 0. We thus settle for concentration of p bv + diag k around v + diag k instead. Theorem 29 (Confidence interval with empirical variance for general kernel) Let n N, X1, . . . , Xn P, let k be a characteristic reproducing kernel defined from a positive definite function ψ [see (4)]. Then with probability at least 1 δ, it holds that bµP µP Hk b Bk,δ(X1, . . . , Xn), b Bk,δ(X1, . . . , Xn) = b Bδ .= 2(bv(X1, . . . , Xn) + diag k)log(4/δ) diag k log(4/δ) and where bv is the empirical variance proxy defined in (12). Remark 30 In particular, for a TI kernel diag k = 0, recovering the results of the previous section. It is currently unclear whether the term in diag k is necessary or if it is an artifact of our proof. Wolfer and Alquier Appendix B. Additional Proofs In this section, we present supplementary technical proofs that were postponed in previous sections of the paper B.1 Proof of Lemma 8 For some t0, we let x(t0) = (x1, . . . , xt0 1, x t0, xt0+1, . . . , xn) where xt0 was replaced with x t0. 1 d Tr bΣ(x) 1 d Tr bΣ x(t0) xi t xi s 2 x(t0)i t x(t0)i s 2 ! xi t0 xi s 2 ! 1 d Tr bΣ(x) inf x t0 X 1 d Tr bΣ x(t0) 1. Furthermore, d Tr bΣ(x) 1 d Tr bΣ x(t0) 2 (i) xi t0 xi s 2 !!2 xi t0 xi s 2 !2 n n 1bvi(X) = n n 1 1 d Tr bΣ(X), where (i) follows from (27), (ii) is Jensen s inequality, and (iii) stems from Maurer and Pontil (2009, Corollary 9). Variance-Aware Estimation of Kernel Mean Embedding B.2 Proof of Lemma 9 We verify that bv is an unbiased estimator for v. EX1,...,Xn P [bv(X1, . . . , Xn)] =EX1,...,Xn P k(Xt, Xt) 1 s=1 k(Xt, Xs) EX P [k(X, X)] 1 s=1 EXt,Xs P [k(Xt, Xs)] = n n 1EX P [k(X, X)] 1 n(n 1) s=1 EXt,Xs P [k(Xt, Xs)] = n n 1EX P [k(X, X)] 1 (n 1)EX P [k(X, X)] n(n 1)EX,X P k(X, X ) =EX P [k(X, X)] EX,X P k(X, X ) B.3 Proof of Theorem 18 For t N, we write Zt .= k(Xt, ) µP, thus bµP(X1, . . . , Xn) µP Hk = 1 We assume that n = 2Bs, where B, s N will be determined later. For b [B], we denote Z[2b] .= Z[2b] 1 , . . . , Z[2b] s .= Z(2b 1)s+1, . . . , Z2bs, Z[2b 1] .= Z[2b 1] 1 , . . . , Z[2b 1] s .= Z(2b 2)s+1, . . . , Z(2b 1)s, and we decompose the random process into blocks Z[1], Z[2], . . . , Z[2B]. From sub-additivity of the norm and the blocking method described in Yu (1994, Corollary 2.7), it holds that t=1 Z[2b σ] t ! Hk > nε/2 t=1 e Z[2b σ] t ! Hk > nε/2 + 2 (B 1) β(s) Wolfer and Alquier where for b [2], the blocks e Z[2b] = e Z[2b] 1 , . . . , e Z[2b] s are mutually independent and a similar fact holds for the blocks e Z[2b 1] = e Z[2b 1] 1 , . . . , e Z[2b 1] s . As a result, for b [B], the Ps t=1 e Z[2b σ] t are independent, centered and valued in Hk, thus the problem above has been reduced to controlling the norm of a sum of B iid random vectors in the Hilbert space Hk. Let us fix σ {0, 1}. Similar to our proof in the iid setting, our goal is to invoke Theorem 32 (Yurinsky, 1995, Theorem 3.3.4). In order to do that, we need to control from above the PB b=1 E Ps t=1 e Z[2b σ] t p Hk , for any p 3. For p = 2, by linearity of the expectation, we t=1 e Z[2b σ] t r=1 E e Z[2b σ] t , e Z[2b σ] r Hk. By stationarity, we can write more simply t=1 e Z[2b σ] t r=1 E k(X(2b σ 1)s+t, ) µP, k(X(2b σ 1)s+r, ) µP Hk r=1 E k(Xt r+1, ) µP, k(X1, ) µP Hk t=1 k(X1, ) µP, k(X1, ) µP Hk t=2 (s t + 1)E k(Xt, ) µP, k(X1, ) µP Hk It is then natural to consider the following covariance coefficients in the RKHS introduced by Ch erief-Abdellatif and Alquier (2022), ρt .= E k(Xt, ) µP, k(X1, ) µP Hk, and in particular, observe that we recover ρ1 = v. It follows that t=1 e Z[2b σ] t t=2 (s t + 1)ρt 2 (v + Σs) , where the mixing coefficient Σs was introduced in Lemma 16. In particular, Σs vanishes for iid processes. For p 3, by sub-additivity of the norm, t=1 e Z[2b σ] t e Z[2b σ] t Hk 2s p Variance-Aware Estimation of Kernel Mean Embedding Therefore, invoking Theorem 32 with G2 = n 2 (v + Σs) and H = 2s k/3, we obtain for σ {0, 1} and for any ε > 0 that, t=1 e Z[2b σ] t 2(n(v + Σs)/2 + ((ε/2)n)2s 4 v + Σs + 2εs It remains to control the term involving the β-mixing coefficient by suitable choosing the blocking size. For τ β n,δ .= arg min s N δ 6(n/(2s) 1) it holds that 2 (B 1) β(s) δ/3, which concludes the proof. B.4 Proof of Lemma 27 We will rely on the following property for general kernels. Lemma 31 Let x = (x1, . . . , xt0, . . . , xn) X n and x t0 X. Writing x(t) = (x1, . . . , xt0 1, x t0, xt0+1, . . . , xn), it holds that bv(x) bv x(t0) = 1 n k (xt0, xt0) k x t0, x t0 t [n],t =t0 k x t0, xt k (xt0, xt) . From Lemma 31, if follows that bv(x) inf x t0 X bv x(t0) ( ) 1 n k (xt0, xt0) diag k t [n],t =t0 k k (xt0, xt) ( ) diag k + 2 k Wolfer and Alquier and the bounded differences property follows directly from ( ). Furthermore, as a result of ( ), bv(x) inf x t0 X bv x(t0) !2 n k (xt0, xt0) diag k t [n],t =t0 k k (xt0, xt) t [n],t =t0 k (xt0, xt0) diag k + 2k 2k (xt0, xt) t [n],t =t0 k (xt0, xt0) diag k + 2k 2k (xt0, xt) 2 diag k + 2 k t [n],t =t0 k (xt0, xt0) diag k + 2k 2k (xt0, xt) diag k + 2 k t [n],t =t0 2k (xt0, xt0) 2diag k + 2k 2k (xt0, xt) =2 diag k + 2 k t [n],t =t0 (k (xt0, xt0) k (xt0, xt)) + 2 diag k + 2 k ( ) =2 diag k + 2 k t [n] (k (xt0, xt0) k (xt0, xt)) + 2 diag k + 2 k =2 diag k + 2 k n bv(x) + 2 diag k + 2 k where for ( ) we relied on k = diag k for a positive definite kernel. B.5 Proof of Lemma 28 Lemma 27 established that the function X n R, x 7 n diag k + 2 k bv(x), Variance-Aware Estimation of Kernel Mean Embedding is weakly self-bounding and has the bounded differences property. From an application of Boucheron et al. (2009, Theorem 1), P (bv v > t) exp nt2 4( diag k + 2 k)(v + diag k + t/2) P (v bv > t) exp nt2 4( diag k + 2 k)(v + diag k) The claim follows from rewriting bv v = [bv + diag k] [v + diag k], and solving quadratic inequalities as in Lemma 11. B.6 Proof of Theorem 29 From Theorem 1, and since diag k > 0, with probability at least 1 δ/2, it holds that bµP(X1, . . . , Xn) µP Hk 2(v + diag k)log(4/δ) By Lemma 28 for b = 1, at confidence 1 δ/2 it holds that v + diag k p bv + diag k + 2 ( diag k + 2 k) log(4/δ) thus by combining (28) and (29) and a union bound yields that with probability at least 1 δ, bv + diag k + 2 ( diag k + 2 k) log(4/δ) 2(bv + diag k)log(4/δ) diag k + 2 k + 4 2(bv + diag k)log(4/δ) diag k log(4/δ) where the last inequality is by sub-additivity of the square root. Wolfer and Alquier B.7 Proof of Lemma 31 bv(x) bv x(t) k (xr, xr) 1 s=1 k (xr, xs) k x(t) r , x(t) r 1 s=1 k x(t) r , x(t) s ! k (xr, xr) k x(t) r , x(t) r k (xr, xs) k x(t) r , x(t) s bv(x) bv x(t) = 1 n 1 k (xt, xt) k x t, x t 1 (n 1)n (r,s) [n]2,r=s k (xr, xs) k x(t) r , x(t) s (r,s) [n]2,r =s k (xr, xs) k x(t) r , x(t) s , = 1 n 1 k (xt, xt) k x t, x t 1 (n 1)n k (xr, xr) k x(t) r , x(t) r (r,s) [n]2,r =s,r=t k (xr, xs) k x(t) r , x(t) s (r,s) [n]2,r =s,s=t k (xr, xs) k x(t) r , x(t) s n k (xt, xt) k x t, x t k (xt, xs) k x t, xs k (xt, xs) k x t, xs n k (xt, xt) k x t, x t + 2 (n 1)n k x t, xs k (xt, xs) . Appendix C. Tools from the Literature We report here tools from the literature. Variance-Aware Estimation of Kernel Mean Embedding Theorem 32 (Yurinsky, 1995, Theorem 3.3.4) Let Z1, . . . , Zn be independent random variables in a separable Hilbert space H that are centered, that is, t [n], E [Zt] = 0, If there exists real numbers G, H 0 such that for any p 2, t=1 E Zt p H 1 it holds that for any ε > 0, 2 exp ε2/2 G2 + εH Theorem 33 (Samson, 2000, Theorem 3) Let X1, . . . , Xn be a stationary φ-mixing sequence. For every ε > 0, P (Z EZ + ε) exp 1 8 Γ 2 min ε Z .= sup g F with F an arbitrary class of real bounded functions with |g| C, and where the random variable V is defined as t=1 sup g F g(Xt)2, and Γ is the coupling matrix (refer to Equation 17). P. Alquier, B.-E. Ch erief-Abdellatif, A. Derumigny, and J.-D. Fermanian. Estimation of copulas via maximum mean discrepancy. Journal of the American Statistical Association, 118(543):1997 2012, 2023. M. A. Arcones. A Bernstein-type inequality for U-statistics and U-processes. Statistics & probability letters, 22(3):239 247, 1995. J.-Y. Audibert, R. Munos, and C. Szepesv ari. Tuning bandit algorithms in stochastic environments. In International conference on algorithmic learning theory, pages 150 165. Springer, 2007. P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002. Wolfer and Alquier V. Bentkus, N. Kalosha, and M. Van Zuijlen. On domination of tail probabilities of (super) martingales: explicit bounds. Lithuanian Mathematical Journal, 46(1):1 43, 2006. D. Berend and A. Kontorovich. A sharp estimate of the binomial mean absolute deviation with applications. Statistics & Probability Letters, 83(4):1254 1259, 2013. A. Berlinet and C. Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011. S. Boucheron, G. Lugosi, and P. Massart. On concentration of self-bounding functions. Electronic Journal of Probability, 14:1884 1899, 2009. R. C. Bradley. Basic properties of strong mixing conditions. a survey and some open questions. Probability Surveys, 2:107 144, 2005. F.-X. Briol, A. Barp, A. B. Duncan, and M. Girolami. Statistical inference for generative models with maximum mean discrepancy. ar Xiv preprint ar Xiv:1906.05944, 2019. B.-E. Ch erief-Abdellatif and P. Alquier. Finite sample properties of parametric MMD estimation: robustness to misspecification and dependence. Bernoulli, 28(1):181 213, 2022. K. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness of fit. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 2606 2615, New York, New York, USA, 20 22 Jun 2016. PMLR. D. Cohen, A. Kontorovich, and G. Wolfer. Learning discrete distributions with infinite support. In Advances in Neural Information Processing Systems, volume 33, pages 3942 3951, 2020. M. Cuturi. Positive definite kernels in machine learning. ar Xiv preprint ar Xiv:0911.5367, 2009. J. Diestel and J. Uhl. Vector measures, vol. 15 of mathematical surveys. American Mathematical Society, Providence, RI, USA, 1977. P. Doukhan. Mixing: properties and examples, volume 85. Springer Science & Business Media, 2012. G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages 258 267, 2015. M. G. Genton. Classes of kernels for machine learning: a statistics perspective. Journal of machine learning research, 2(Dec):299 312, 2001. A. Gretton. Introduction to RKHS, and some simple kernel algorithms. Adv. Top. Mach. Learn. Lecture Conducted from University College London, 16(5-3):2, 2013. Variance-Aware Estimation of Kernel Mean Embedding A. Gretton, K. Fukumizu, Z. Harchaoui, and B. K. Sriperumbudur. A fast, consistent kernel two-sample test. Advances in neural information processing systems, 22, 2009. A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch olkopf, and A. Smola. A kernel twosample test. The Journal of Machine Learning Research, 13(1):723 773, 2012. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, 1963. P. J. Huber. Robust statistics. In International encyclopedia of statistical science, pages 1248 1251. Springer, 2011. I. A. Ibragimov. Some limit theorems for stationary processes. Theory of Probability & Its Applications, 7(4):349 382, 1962. W. Jitkrittum, W. Xu, Z. Szab o, K. Fukumizu, and A. Gretton. A linear-time kernel goodness-of-fit test. Advances in Neural Information Processing Systems, 30, 2017. M. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, R. E. Schapire, and L. Sellie. On the learnability of discrete distributions. In Proceedings of the twenty-sixth annual ACM symposium on Theory of computing, pages 273 282, 1994. A. K. Kuchibhotla and Q. Zheng. Near-optimal confidence sequences for bounded random variables. In International Conference on Machine Learning, pages 5827 5837. PMLR, 2021. D. A. Levin, Y. Peres, and E. L. Wilmer. Markov chains and mixing times, second edition. American Mathematical Soc., 2009. Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In International conference on machine learning, pages 1718 1727. PMLR, 2015. D. Lopez-Paz, K. Muandet, B. Sch olkopf, and I. Tolstikhin. Towards a learning theory of cause-effect inference. In International Conference on Machine Learning, pages 1452 1461. PMLR, 2015. A. Maurer. Concentration inequalities for functions of independent variables. Random Structures & Algorithms, 29(2):121 138, 2006. A. Maurer and M. Pontil. Empirical Bernstein bounds and sample variance penalization. ar Xiv preprint ar Xiv:0907.3740, 2009. C. Mc Diarmid et al. On the method of bounded differences. Surveys in combinatorics, 141 (1):148 188, 1989. M. Mohri and A. Rostamizadeh. Stability bounds for stationary ϕ-mixing and β-mixing processes. Journal of Machine Learning Research, 11(2), 2010. M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press, 2018. Wolfer and Alquier K. Muandet, K. Fukumizu, B. Sriperumbudur, A. Gretton, and B. Sch olkopf. Kernel mean estimation and Stein effect. In International Conference on Machine Learning, pages 10 18. PMLR, 2014. K. Muandet, B. Sriperumbudur, K. Fukumizu, A. Gretton, and B. Sch olkopf. Kernel mean shrinkage estimators. Journal of Machine Learning Research, 17, 2016. F. M unch and J. Salez. Mixing time and expansion of non-negatively curved Markov chains. Journal de l Ecole polytechnique Math ematiques, 10:575 590, 2023. doi: 10.5802/jep. 226. T. Peel, S. Anthoine, and L. Ralaivola. Empirical Bernstein inequalities for U-statistics. Advances in Neural Information Processing Systems, 23, 2010. I. F. Pinelis and A. I. Sakhanenko. Remarks on inequalities for the probabilities of large deviations. Teoriya Veroyatnostei i ee Primeneniya, 30(1):127 131, 1985. G. O. Roberts and J. S. Rosenthal. General state space Markov chains and MCMC algorithms. Probability Surveys, 1(none):20 71, 2004. doi: 10.1214/154957804100000024. P.-M. Samson. Concentration of measure inequalities for Markov chains and Φ-mixing processes. The Annals of Probability, 28(1):416 461, 2000. doi: 10.1214/aop/1019160125. M. L. Stein. Interpolation of spatial data: some theory for kriging. Springer Science & Business Media, 1999. I. Tolstikhin, B. K. Sriperumbudur, and K. Muandet. Minimax estimation of kernel mean embeddings. The Journal of Machine Learning Research, 18(1):3002 3048, 2017. A. B. Tsybakov. Introduction to nonparametric estimation. Springer Science & Business Media, 2008. I. Waudby-Smith and A. Ramdas. Estimating means of bounded random variables by betting. Journal of the Royal Statistical Society Series B: Statistical Methodology, 86(1): 1 27, 2024. G. Wolfer and P. Alquier. Optimistic estimation of convergence in Markov chains with the average-mixing time. ar Xiv:2402.10506, 2024. B. Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, pages 94 116, 1994. V. Yurinsky. Sums and Gaussian Vectors. Lecture Notes in Mathematics. Springer Berlin, Heidelberg, 1995.