# varianceaware_estimation_of_kernel_mean_embedding__95a4afbb.pdf

Journal of Machine Learning Research 26 (2025) 1-48 Submitted 2/23; Revised 3/25; Published 3/25

Variance-Aware Estimation of Kernel Mean Embedding

Geoﬀrey Wolfer geo.wolfer@aoni.waseda.jp Center for Data Science Waseda University 1-6-1 Nishiwaseda, Shinjuku-ku Tokyo 169-8050, Japan

Pierre Alquier alquier@essec.edu ESSEC Business School Asia-Paciﬁc campus 5 Nepal Park 575749 Singapore

Editor: Bharath Sriperumbudur

An important feature of kernel mean embeddings (KME) is that the rate of convergence of the empirical KME to the true distribution KME can be bounded independently of the dimension of the space, properties of the distribution and smoothness features of the kernel. We show how to speed-up convergence by leveraging variance information in the reproducing kernel Hilbert space. Furthermore, we show that even when such information is a priori unknown, we can eﬃciently estimate it from the data, recovering the desiderata of a distribution agnostic bound that enjoys acceleration in fortuitous settings. We further extend our results from independent data to stationary mixing sequences and illustrate our methods in the context of hypothesis testing and robust parametric estimation. Keywords: kernel mean embedding, maximum mean discrepancy, distribution learning, empirical Bernstein

1. Introduction

Estimating a probability distribution P over a space X from n iid samples X1, . . . , Xn P is a central problem in computer science and statistics (Kearns et al., 1994; Tsybakov, 2008). To formalize the question, one selects a distance (or at least a contrast function) between distributions and oftentimes introduces assumptions on the underlying probability space (for instance ﬁnitely supported, probability function is absolutely continuous with respect to the Lebesgue measure, H older continuous, . . . ). Increasing the stringency of assumptions generally leads to substantially faster minimax rates. For instance, in the ﬁnite support |X| < case and with respect to total variation, whilst the expectation risk evolves roughly as p

|X| /n, it is known that this rate can be sharpened by replacing |X| with a bound on the half-norm 1 P 1/2 .= (P x X p

P(x))2 (Berend and Kontorovich, 2013), when deﬁned, and which corresponds to some measure of entropy of the underlying distribution. Furthermore, even without prior knowledge of P 1/2, one can construct conﬁdence intervals around P that

1. The half-norm is not a norm in the proper sense as it violates the triangle inequality.

c 2025 Geoﬀrey Wolfer and Pierre Alquier.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/23-0161.html.

Wolfer and Alquier

depend on b Pn 1/2 the half-norm of the empirical distribution b Pn(x) = 1

n Pn t=1 δ [Xt = x] (Cohen et al., 2020, Theorem 2.1). Namely, when P is supported on N, with probability at least 1 δ, it holds that

b Pn P TV 1 n

1 2 log(2/δ)

The advantage of the above expression is twofold (a) we do not need make assumptions on P 1/2, which could be non-convergent2, and (b) in favorable cases where b Pn 1/2 is small, the intervals will be narrower. In this paper, we set out to explore the question of whether analogues of (1) are possible for general probability spaces and with respect to maximum mean discrepancy.

1.1 Notation and Background

The set of integers up to n N is denoted by [n] .= {1, 2, . . .}. Let X be a separable topological space, and P(X) the set of all Borel probability measures over X. For a bounded function f : X R, we deﬁne for convenience

f .= sup x X f(x) f .= inf x X f(x), f .= f f. (2)

Let k: X X R be a continuous positive deﬁnite kernel and Hk be the associated reproducing kernel Hilbert space (RKHS) (Berlinet and Thomas-Agnan, 2011). We assume that the kernel is bounded3 in the sense that supx X k(x, x) < . Letting P P(X), the kernel mean embedding (KME) of P is deﬁned as

µP .= EX P [k(X, )] = Z

X k(x, )d P(x) Hk,

which is interpreted as a Bochner integral (Diestel and Uhl, 1977, Chapter 2). Let n N and X1, . . . , Xn be a sequence of observations sampled independently4 from P. We write

bµP(X1, . . . , Xn) = bµP .= 1

t=1 k(Xt, ) Hk, (3)

for the KME of the empirical distribution b Pn .= 1

n Pn t=1 δXt, and call the distance bµP µP Hk the maximum mean discrepancy (MMD) between the true mean embedding and its empirical estimator. A kernel is called characteristic if the map µ: P 7 µP is injective. This property ensures that µP µP Hk = 0 if and only if P = P . A kernel is called translation invariant5 (TI) when there exists a positive deﬁnite function ψ: X R such that for any x, x X, k(x, x ) = ψ(x x). (4)

2. As opposed to the empirical proxy b Pn 1/2, which is always a ﬁnite quantity. 3. Note that the supremum of the kernel is always reached on the diagonal, that is, sup(x,x ) X 2 k(x, x ) = supx X k(x, x). 4. Unless otherwise speciﬁed, the data will be considered iid. We will address the case of dependent data in Section 5. 5. Also sometimes referred to as an anisotropic stationary kernel.

Variance-Aware Estimation of Kernel Mean Embedding

In particular, ψ = ψ(0) = k. When X = Rd, a kernel k is said to be a radial basis function (RBF) when for any x, x X, k(x, x ) = φ( x x 2) for some function φ: R+ R. Noticeably, k being positive deﬁnite does not preclude it from taking negative values. However, when k is RBF, the following lower bound on φ holds (see for instance Genton 2001; Stein 1999),

φ φ inf t 0

(d 2)/2 Γ(d/2)J(d 2)/2(t)

where Γ is the Gamma function and Jβ is the Bessel function of the ﬁrst kind of order β, showing that |φ| becomes evanescent as the dimension increases.

1.2 Related Work

From an asymptotic standpoint, the weak law of large numbers asserts that bµP converges to the true µP. Furthermore, n (bµP µP)

converges in distribution to a zero mean Gaussian process on Hk (Berlinet and Thomas Agnan, 2011, Section 9.1). This work, however, is more concerned with the ﬁnite sample theory, and more speciﬁcally with the rate of convergence of bµP towards µP with respect to the RKHS norm. Conveniently and perhaps surprisingly, it is possible to derive a rate that depends neither on the smoothness of the considered kernel k, nor the properties of the true distribution. To obtain a distribution independent rate at OP (n 1/2), a typical strategy (see for example Lopez-Paz et al. 2015, Section B.1) consists in ﬁrst expressing the dual relationship between the norm in the RKHS and the uniform norm of an empirical process,

bµP µP Hk = sup f Hk f Hk 1

f, bµP µP Hk = sup f Hk f Hk 1

t=1 f(Xt) Ef

A classical symmetrization argument (Mohri et al., 2018) followed by an application of Mc Diarmid s inequality (Mc Diarmid et al., 1989) yield that with probability at least 1 δ,

sup f Hk f Hk 1

t=1 f(Xt) Ef

2k log(1/δ)

where Rn is the Rademacher complexity (Mohri et al., 2018, Deﬁnition 3.1) of the class of unit functions in the RKHS. The bound R2 n k/n (Bartlett and Mendelson, 2002), and an application of Jensen s inequality conclude the claim. With a more careful analysis, Tolstikhin et al. (2017, Proposition A.1) halved the constant of the ﬁrst term in (6), hence showed that with probability at least 1 δ,

2k log(1/δ)

Wolfer and Alquier

What is more, Tolstikhin et al. (2017, Theorems 1,6,8) provide corresponding lower bounds in ΩP (n 1/2), showing that the embedding of the empirical measure achieves an unimprovable rate of convergence. Results similar to (7) (with worse constants) can be derived when the observations are not independent, see Ch erief-Abdellatif and Alquier (2022, Lemma 7.2). We note that other estimators have been proposed in the literature for the case where more information is available about the underlying distribution P. See for instance Muandet et al. (2014, 2016) who propose a shrinkage estimator inspired by the James-Stein estimator. In this work, we stress that our estimator will be the KME of the empirical distribution, deﬁned in (3).

1.3 Main Contributions

In this paragraph, we brieﬂy summarize our ﬁndings. The tilde notation suppresses constants and logarithmic factors in 1/δ, and the reader is referred to Theorem 1 and Theorem 12 for the full expression of the second order terms. Our ﬁrst contribution (Theorem 1) consists in deriving a bound on bµP µP Hk that can leverage additional information about the underlying distribution P and the selected kernel k. Namely, we show that with probability at least 1 δ,

2vk(P)log(2/δ)

where vk(P) deﬁned in (8) corresponds to some notion of variance in the RKHS. Notably, it holds that vk(P) k, hence our upper bound is superior to and able to recover known bounds in most settings (Remark 2). Our chief technical contribution (Theorem 12) is in establishing that, at least when k is translation invariant, the dependence of the above conﬁdence interval can be made independent of the underlying distribution by replacing the variance by an empirical proxy. Speciﬁcally, with probability at least 1 δ,

2bvk(X1, . . . , Xn)log(4/δ)

where bvk deﬁned in (12), (13) is a quantity that only depends on the chosen kernel and the sample. We also extend the latter result to kernels without the translation invariance property in Theorem 29. Expanding beyond the iid setting, we obtain convergence rates for stationary mixing sequences. For φ-mixing processes, Theorem 17 establishes the following rate of convergence.

2vk Γ 2 log(1/δ)

where Σn is a total measure of covariance between observations in the RKHS (refer to Lemma 16) and Γ 2 is the spectral norm of a coupling matrix deﬁned in (17). We additionally analyze the broader class of β-mixing processes in Theorem 18.

Variance-Aware Estimation of Kernel Mean Embedding

1.4 Outline

In Section 2, for the problem of estimating a distribution with respect to maximum mean discrepancy, we give a convergence rate with a dominant term in OP (n 1/2) involving a variance term v which depends on both the chosen kernel k and the underlying distribution P. We then illustrate how the rate can subdue minimax lower bounds when v is favorable. In particular, we show that for large collections of kernels, and when X Rd, the variance v can be controlled by a quantity that decouples the inﬂuence of the kernel and a measure of total variance of P. In Section 3, we proceed to show that even if v is unknown, it is possible to eﬃciently estimate it from the data with an empirical Bernstein approach whenever the kernel is translation invariant, or at least enjoys a mildly varying diagonal. In Section 4, we illustrate the beneﬁts of the variance aware bounds on the classical two-sample test problem. In Section 5, we establish convergence rates for stationary φ and β mixing sequences. In Section 6.2, we put our methods into practice, ﬁrst in the context of hypothesis testing, and second by improving the results of Briol et al. (2019) and Ch erief-Abdellatif and Alquier (2022) in the context of robust parametric maximum mean discrepancy estimation. All proofs are deferred to Section 7 for clarity of the exposition.

2. Variance-Aware Convergence Rates

The central quantity we propose to consider is the following variance term in the RKHS,

vk(P) .= EX P k(X, ) µP 2 Hk . (8)

It is clear that vk(P) depends both on the choice of kernel k and on the underlying distribution, and we will simply write v = vk(P) to avoid encumbering notation. Simple calculations (refer to Lemma 16) show that

E bµP µP 2 Hk = v

where the expectation is taken with respect to an independent sample X1, . . . , Xn P, thus by applications of Jensen s and Chebyshev s inequalities, we readily obtain a rate on the expected risk in terms of v,

E bµP µP Hk rv

and a deviation bound

P bµP µP Hk > (1 + τ) p

v/n 1/(1 + τ)2, τ (0, ).

One basic feature of RKHS is that new kernels can be constructed by linearly combining existing kernels see for instance Gretton (2013). It is therefore noteworthy that v is linear with respect reproducing kernels. Reformulate

v = EX P [k(X, X)] EX,X P k(X, X ) ,

Wolfer and Alquier

and suppose that k = P ki K αiki where K = k1, . . . , k|K| is a collection of reproducing kernels with α R|K|. It then holds that

ki K αivki.

Moving on to high-probability conﬁdence bounds, an application of Bernstein s inequality in Hilbert spaces (Pinelis and Sakhanenko, 1985; Yurinsky, 1995) not only recovers the rate of convergence, as alluded to by Tolstikhin et al. (2017), but in fact yields a maximal inequality.

Theorem 1 (Variance-aware conﬁdence interval) Let X1, . . . , Xn iid P, k: X X R be a reproducing kernel, and let

v .= EX P k(X, ) µP 2 Hk .

With probability at least 1 δ, it holds that

n t bµP(X1, . . . , Xt) µP Hk

2vn log(2/δ) + 4

k log(2/δ),

where k = supx X k(x, x). In particular, with probability at least 1 δ, it holds that

bµP(X1, . . . , Xn) µP Hk Bk,δ(P, n),

Bk,δ(P, n) = Bδ .=

Remark 2 Since the reproducing kernel is bounded, it is always the case that

v EX P k(X, ) 2 Hk EX Pk(X, X) k,

where the ﬁrst inequality can be found for example in (Ch erief-Abdellatif and Alquier, 2022, Lemma 7.1, Proof). As a result, Theorem 1 strictly supersedes (7) at least when

2 log(1/δ) p

For instance, for δ = 0.05, it suﬃces that n 46.

Remark 3 The Bernstein approach, in contrast to bounded diﬀerences, has the additional advantage of yielding a maximal inequality, further opening the door for early stopping methods.

Remark 4 (Sharpness of the constants) We note but do not pursue that there exist other families of concentration inequalities which are known to dominate Bennett Bernsteintype inequalities. For instance, the class of inequalities pioneered by Bentkus et al. (2006), with empirical counterparts derived by Kuchibhotla and Zheng (2021), which is known to be

Variance-Aware Estimation of Kernel Mean Embedding

nearly optimal for sample averages of independent observations from a log-concave probability distribution. We also mention the family of inequalities obtained by Waudby-Smith and Ramdas (2024) based on martingale methods. However, our problem requires a bound for norms of sums of vectors in Hilbert spaces, or alternatively for the supremum of empirical processes, which we could not locate in the aforementioned references. Additionally, even in the simpler case of sample averages, computation of the bound requires eﬀort, thus a concentration bound in the more complicated Hilbert space setting could be challenging to compute. Finally, Bernstein-type bounds have known extensions for the case of time-dependent data, enabling us to analyze the estimation problem from stationary mixing sequences of observations (refer to Section 5).

We immediately observe that:

(O1) While the bound in (7) depends solely on chosen quantities and is computable without any knowledge of P, this is not the case for Bδ in Theorem 1.

(O2) Perhaps even more concerning, it is a priori unclear how v depends on properties of k and P, thus making it diﬃcult to convert assumptions on P into a bound for v.

We defer (O1) to Section 3 and ﬁrst address (O2) by pointing out that when X = Rd, and for numerous hyper-parametrized families of kernels, it is possible to promptly obtain upper bounds on v that decouple the inﬂuence of the hyper-parameter and some measure of spread of the underlying distribution.

2.1 Gaussian Kernel

For a TI kernel see (4) we can rewrite v as

v = ψ EX,X P ψ(X X) .

The Gaussian kernel with lengthscale parameter γ > 0, deﬁned by

kγ(x, x ) = exp

x x 2 2 2γ2

is the prototypical example of a characteristic translation invariant kernel, and satisﬁes ψ = 1. For x Rd, let xi denote the ith component of x. The function z 7 e z is convex, thus from Jensen s inequality

vγ = EX P kγ(X, ) µP 2 Hkγ 1 exp

EX,X P X X 2 2 2γ2

We can further rewrite the expectation on the right side of this inequality as

EX,X P X X 2 2 =

i=1 EX,X P (X i Xi)2 =

i=1 2VX PXi = 2 Tr ΣP,

where ΣP .= VX PX = EX P [(X EX)(X EX) ] is the covariance matrix of P and its trace is interpreted as a measure a total variance, agnostic to correlations between

Wolfer and Alquier

distinct components. As a result, for any ﬁxed γ > 0, we obtain from Theorem 1 that with probability at least 1 δ,

2 1 e Tr ΣP/γ2 log(2/δ)

While γ has no inﬂuence over the right-hand side in (7), it is clear that for γ , the rate of convergence will be accelerated in (10).

Example 1 (Gaussian location model with known variance) Assume that P = Pθ = N(θ, σ2Id), the Gaussian distribution with unknown location parameter θ Rd, but with known covariance matrix σ2Id. It holds that Tr ΣPθ = σ2d.

(i) Fixing γ and taking σ 0, the bound in (10) vanishes, unlike (7).

(ii) Setting γ2 = λσ2d with λ > 0, we readily obtain the variance upper bound v (1 e 1/λ) 1/λ, enabling uncomplicated tuning of the convergence rate by λ.

Remark 5 Example 1 highlights that if we allow the lengthscale parameter to vary with the sample size n, Theorem 1 speeds up the convergence rate dramatically. For instance, setting λn = 1/ log(1 1/n), we achieve a rate in OP (n 1), subverting the lower bounds of Tolstikhin et al. (2017). In fact, we can reach any prescribed rate of convergence between OP (n 1/2) and OP (n 1) for a suitable choice of λn. This stands in sharp contrast with bounds obtained through a bounded diﬀerences approach such as (7). However, it is important to remember that a larger lengthscale parameter will ﬂatten the kernel, hence make the left hand side in (10) less informative. Depending on the application, it is possible to achieve the optimal balance between these two eﬀects; some examples are discussed in Section 6.

2.2 Convex Radial Square Basis Functions

In fact, a technique similar to that used for obtaining (10) yields a bound on the variance for any (convex) radial kernel.

Lemma 6 Assume that for any x, x X, k(x, x ) = r( x x 2 2) for some convex function r: R+ R. Then v r r ( Tr ΣP) ,

where ΣP is the covariance matrix that pertains to P.

2.3 Positive Deﬁnitive Matrix on the Finite Space

Consider |X| < and a symmetric positive deﬁnite matrix K of size |X|. Write K = P x X λxv xvx, where for any x X, Kvx = λxvx, and by positive deﬁniteness, λx > 0. Then the feature map can be expressed as

Variance-Aware Estimation of Kernel Mean Embedding

and by direct calculation,

bµP µP 2 HK = X

t=1 vx(Xt) E[vx(X)]

x X λx V[vx(X)].

In particular, for K = I,

bµP µP HK = b Pn P 2 , v = 1 P 2 2 ,

recovering that with probability at least 1 δ, b Pn P 2

2(1 P 2 2)log(2/δ)

So far, we have shown how to obtain for a natural class of kernels an upper bound on v that decouples the choice of k from some measure of total variance of P. Provided an upper bound on the latter is available (see Example 1), we obtain an improved convergence rate. However, we understandably may not have a bound on the variance of P, or could have insight about its variance, but only have access to contaminated data (see later Section 6.2.1). Recovering a bound that still enjoys the discussed speed-up without a priori knowledge on P is the problem we set out to solve in the next section.

3. Convergence Rates with Empirical Variance Proxy

We solve the issue (O1) of not knowing v by estimating it from the data, simultaneously to µP. A short heuristic analysis of the Epanechnikov function (which is not typically a kernel according to our deﬁnition) that we conduct in Section 3.1 hints at an empirical Bernstein (Audibert et al., 2007; Maurer and Pontil, 2009) approach replacing the variance term by some empirical proxy which we explore formally in Section 3.2. The argument is structured around the pivotal deﬁnition of a weakly self-bounding function that we ﬁrst recall.

Deﬁnition 7 (Maurer 2006; Boucheron et al. 2009) Let n N, t0 [n],

x = (x1, . . . , xt0, . . . , xn) X n,

x t0 X, and write x(t0) = (x1, . . . , xt0 1, x t0, xt0+1, . . . , xn),

for the vector where xt0 has been replaced with xt 0. Let f : X n R.

(i) The function f is called weakly (α, β)-self-bounding when for all x X n,

f(x) inf x t0 X f x(t0) !2 αf(x) + β.

(ii) The function f is said to have the bounded diﬀerences property when for all x X n

and all t0 [n], f(x) inf x t0 X f x(t0) 1.

Wolfer and Alquier

3.1 Intuition in the Hypercube

Let us take X = [0, 1]d the binary hypercube for d 1, and consider the Epanechnikov (parabolic) function

q(x, x ) = 1 x x 2 2 /d.

Note that q does not deﬁne a proper kernel (Cuturi, 2009, p.9). Nevertheless, the function q is TI, with ψ(t) = 1 t 2 2 /d, ψ = 1, and we can extend the deﬁnition of the variance term

v = ψ EX,X P ψ(X X) = 2 Tr ΣP/d.

It is natural to deﬁne for i [d],

bvi(X1, . . . , Xn) .= 1 2n(n 1)

Xi t Xi s 2 ,

that will act as an unbiased empirical proxy for vi .= VX P Xi , and introduce Tr bΣ .= Pd i=1 bvi, as an estimator for the trace of the covariance matrix ΣP.

Lemma 8 For x [0, 1]d, the function X m R+, x 7 Tr bΣ(x)/d is weakly (n/(n 1), 0)- self-bounding and has the bounded diﬀerences property in the sense of Deﬁnition 7.

As a consequence of Lemma 8, the technique of Maurer and Pontil (2009, Theorem 10) provides the following deviation bounds on the square root of the total variance

Tr bΣ(X)/d >

δ, b { 1, 1} . (11)

In other words, with high conﬁdence, we can replace the trace of the covariance matrix with its empirical proxy in the convergence bounds.

3.2 Systematic Approach

Our goal is to rigorously develop the approach intuited in the previous section to include a large class of reproducing kernels. We propose the following empirical proxy for the variance term v,

bvk(X1, . . . , Xn) .= 1 n 1

k(Xt, Xt) 1

s=1 k(Xt, Xs)

write more simply bv for bvk when k is clear from the context, and promptly verify that bv is an unbiased estimator for v.

Lemma 9 (Unbiasedness) It holds that Ebv = v, where the expectation is taken over the

sample X1, . . . , Xn iid P.

Variance-Aware Estimation of Kernel Mean Embedding

The remainder of this section is devoted to analyzing the self-boundedness properties of bv under mild conditions on k and deriving corresponding empirical conﬁdence intervals around µP. We henceforth let k be a characteristic TI kernel deﬁned by the positive deﬁnite function ψ. We also obtain partial results for the more general case of kernels which are not translation invariant. This analysis is deferred to the appendix Section A (refer to Theorem 29). In the TI setting, the expression of bv can be simpliﬁed as

bv(X1, . . . , Xn) .= ψ 1 (n 1)n

t =s ψ(Xt Xs). (13)

Since the kernel is characteristic, ψ cannot be a constant function, that is ψ > 0. Under our assumptions, we introduce a function involving bv and ψ that is weakly self-bounded and has the bounded diﬀerences property.

Lemma 10 Let ψ deﬁne a characteristic TI kernel. The function

X n R, x 7 n 2 ψ bv(x),

is weakly (2, 0)-self-bounding and has bounded diﬀerences in the sense of Deﬁnition 7.

This property leads to concentration of

bv around v.

Lemma 11 For b { 1, 1}, with probability at least 1 δ,

2 ψ log(1/δ)

Theorem 12 (Conﬁdence interval with empirical variance for TI kernel) Let n

N, X1, . . . , Xn iid P, let k be a characteristic translation invariant kernel deﬁned from a positive deﬁnite function ψ [see (4)]. Then with probability at least 1 δ, it holds that

bµP µP Hk b Bk,δ(X1, . . . , Xn),

b Bk,δ(X1, . . . , Xn) = b Bδ .=

2bv(X1, . . . , Xn)log(4/δ)

and where bv is the empirical variance proxy deﬁned in (13).

Example 2 Theorem 12 immediately holds for Gaussian kernels, regardless of the lengthscale parameter.

Remark 13 We make the following observations.

(i) It still almost surely holds that bv k 2k, thus the fully empirical bound can never be more than a constant factor away from (7) or Theorem 1.

(ii) Crucially, self-boundedness of bv and being able to apply Theorem 12 (or Theorem 29 for non TI kernels) only depends on the choice of kernel, and not on the properties of the underlying distribution.

(iii) Theorem 12 does not depend on smoothness properties of the kernel (also true for Theorem 29 for non TI kernels).

Wolfer and Alquier

3.3 Computability of the Empirical Variance Proxy

The proxy bv is computable from data with O(n2) calls to the kernel function. In Section 6, we will discuss how to use variance-aware bounds to improve conﬁdence bounds and test procedures based on the minimum-MMD estimator. In general, this estimator is computed by stochastic gradient descent (SGD) (Dziugaite et al., 2015; Li et al., 2015; Ch erief-Abdellatif and Alquier, 2022) or variants like stochastic natural gradient descent (Briol et al., 2019). A single step of SGD requires to sample m iid random variables from Pθ where θ is the current estimate, and to compute the unbiased estimate of the gradient which requires (among others) O(mn) calls to the kernel function. In the natural version, one must also compute a Jacobian at each step, which requires O(n2) calls to the kernel. Morever, in the discussion following Theorem 2 of Briol et al. (2019), it is argued that we should take m n in these algorithms. Thus, both SGD and its natural variant will require at least O(n2) calls to the kernel at each iteration. In this light, the computation of bv does not signiﬁcantly aﬀect the computational burden. Note however that, in a few situations where the gradient of the MMD is available in close form, it is possible to use a non-stochastic gradient descent. Such examples include Gaussian mean estimation (Ch erief-Abdellatif and Alquier, 2022) and Gaussian copula estimation (Alquier et al., 2023). Each step of the gradient descent requires only O(n) calls to the kernel and convergence is typically very fast. In these situations, the computation of bv can increase the computational cost when n is large.

4. Convergence Rates for the Diﬀerence of Two Means

We now extend the results of the previous section to the estimation of the norm of the diﬀerence of two means µP µQ Hk. One of the applications of these results is two-sample tests, where we test the hypothesis P = Q, as in Gretton et al. (2012). We will cover this application in Section 6. However, they also have an interest on their own, to illustrate the beneﬁts of variance-aware bounds. In all this section, we assume that two samples are observed: X1, . . . , Xn iid from P, and Y1, . . . , Ym iid from Q. We want to estimate the norm of the diﬀerence µP µQ Hk.

4.1 First Approach: Estimation by a U-Statistic

A classical approach is to estimate the squared norm µP µQ 2 Hk using a U-statistic, as proposed by Gretton et al. (2012), and then to use a Hoeﬀding or Bernstein-type inequality for U-statistics. While Gretton et al. (2012) used the U-statistics version of Hoeﬀding s inequality to control the risk of this estimator, using a Bernstein inequality instead might allow to derive variance-aware bounds. The ﬁrst version of such inequalities were proven by Hoeﬀding (1963) and Arcones (1995). More recently, the results of these two papers are included in Theorem 2 of Peel et al. (2010) while Theorem 3 gives an empirical version of each. We now provide more details on this approach. First, note that

µP µQ 2 Hk = EX,X Pk(X, X ) + EY,Y Qk(Y, Y ) 2EX P,Y Qk(X, Y )

Variance-Aware Estimation of Kernel Mean Embedding

can be directly estimated by the U-statistic:

Un,m = 1 n(n 1)

j =i k(Xi, Xj) + 1 m(m 1)

j =i k(Yi, Yj) 2 mn

j=1 k(Xi, Yj)

as proposed by Gretton et al. (2012). This is an unbiased estimator: E Un,m = µP µQ 2 Hk. In the special case m = n, we can replace Un,n by the simpler

ˆUn = 1 n(n 1)

j =i [k(Xi, Xj) + k(Yi, Yj) k(Xi, Yj) k(Xj, Yi)] .

We will illustrate the U-statistics approach in this simpler setting. Using Theorem 2 in Peel et al. (2010), with their q((Xi, Yi), (Xj, Yj)) = k(Xi, Xj) + k(Yi, Yj) k(Xi, Yj) k(Xj, Yi) and their m = 2, we obtain the ﬁrst part of the following statement. Using Theorem 3, we obtain the second one.

Theorem 14 Put

σ2 P,Q = VX P,Y Q[EX Pk(X, X ) + EY Qk(Y, Y ) EY Qk(X, Y ) EX Pk(X , Y )]

ˆσ2 P,Q = 1 n(n 1)(n 2)

k(Xi, Xj) + k(Yi, Yj) k(Xi, Yj) k(Xj, Yi)

k(Xi, Xk) + k(Yi, Yk) k(Xi, Yk) k(Xk, Yi) .

Note that Eˆσ2 P,Q = σ2 P,Q + µP µQ 4 Hk. For any δ (0, 1], with probability at least 1 δ,

µP µQ 2 Hk ˆUn 2 k

8σ2 P,Q n log 4

δ + 2 k64 + 1

Moreover, with probability at least 1 δ,

µP µQ 2 Hk ˆUn 2 k

8ˆσ2 P,Q n log 8

δ + 2 k64 + 1

Observe that, when P = Q, the bound states that p ˆUn will be of the order of µP µQ Hk. On the other hand, when P = Q, we have both µP µQ Hk = 0 and σ2 P,Q = 0, and thus the ﬁrst inequality in the theorem gives:

which does not depend on the variances of P and Q.

Wolfer and Alquier

4.2 Variance-Aware Control of the Fluctuations for Each Sample

An alternative approach is to apply the triangle inequality to upper bound separately the ﬂuctuations for each sample:

µP µQ Hk ˆµP(X1, . . . , Xn) ˆµQ(Y1, . . . , Yn) Hk

µP ˆµP(X1, . . . , Xn) Hk + µQ ˆµQ(Y1, . . . , Yn) Hk .

For example, a direct application of Theorem 1 with a union bound gives the following corollary.

Corollary 15 Let X1, . . . , Xn P, Y1, . . . , Ym Q, k: X X R be a reproducing kernel, and let v P .= EX P k(X, ) µP 2 Hk and v Q .= EY Q k(Y, ) µQ 2 Hk

and k = supx X k(x, x). With probability at least 1 δ, it holds that

µP µQ Hk ˆµP(X1, . . . , Xn) ˆµQ(Y1, . . . , Ym) Hk

2v P log(4/δ)

2v Q log(4/δ)

k log(4/δ).

Of course, we can also state results with empirical variance instead of the true variance, by using Theorems 12 and 29. In order to compare this result to the U-statistics approach, consider the case n = m. Corollary 15 gives the following upper bound:

ˆµP(X1, . . . , Xn) ˆµQ(Y1, . . . , Yn) Hk

2v P log(4/δ)

2v Q log(4/δ)

k log(4/δ).

In particular, when P = Q, the bound becomes:

ˆµP(X1, . . . , Xn) ˆµQ(Y1, . . . , Yn) Hk 2

2v P log(4/δ)

k log(2/δ). (16)

Even though ˆµP(X1, . . . , Xn) ˆµQ(Y1, . . . , Yn) 2 Hk is not an unbiased estimator of µP µQ 2 Hk as ˆU, (15) and (16) show the ﬂuctuations of ˆµP(X1, . . . , Xn) ˆµQ(Y1, . . . , Yn) Hk when

P = Q are upper bounded by p

v P/n + 1/n while the ones of p

1/n. This can be a serious improvement if v P is small. Note that we do not claim superiority of the plug-in estimator, but rather that the currently available non-asymptotic bounds for this estimator are tighter. This is an illustration of the power of the variance-aware bounds. A way to compare both estimators would be through an accurate study of their asymptotic ﬂuctuations when P = Q. So far, this analysis is only available for the U-statistic, see Theorem 12 in Gretton et al. (2012).

Variance-Aware Estimation of Kernel Mean Embedding

5. Convergence Rates with Time-Dependent Data

In this section, we establish convergence rates for cases where the data X1, . . . , Xn is not independent. Namely, we will assume the data to be a stationary mixing sequence (Bradley, 2005; Doukhan, 2012). In this setting, the observations are identically distributed with marginal distribution P, but exhibit time dependencies that diminish as the time interval increases.

Lemma 16 Suppose that X1, . . . , Xn is a stationary sequence of possibly dependent random variables. Then

E bµP µP 2 Hk = 1

n (v + Σn) ,

t=2 (n t + 1)ρt, where ρt .= E k(Xt, ) µP, k(X1, ) µP Hk,

are the covariance coeﬃcients in the RKHS introduced in Ch erief-Abdellatif and Alquier (2022). In particular, when the process is iid, the expression above simpliﬁes to

E bµP µP 2 Hk = v

The ﬁrst ﬂavor of mixing we consider is φ-mixing, introduced by Ibragimov (1962). Recall that the φ-mixing coeﬃcient (Bradley, 2005; Doukhan, 2012) is deﬁned for two σ-ﬁelds A and B by φ(A, B) .= sup A A,B B P(A)>0

|P (B|A) P (B)| .

For s N, we further deﬁne

φ(s) .= sup r N φ (σ ({Xt : t r}) , σ ({Xt : t r + s})) ,

where for T N, σ({Xt}t T ) is the σ-ﬁeld generated by the random variables {Xt}t T . The random process is then called φ-mixing when lims φ(s) = 0. Additionally, we deﬁne the triangular coupling matrix Γ (Samson, 2000) as follows. For t, s [n],

1 when t = s 0 when t > s p

2φ(σ(X1, . . . , Xt), σ(Xs, . . . , Xn)) otherwise.

Theorem 17 (Variance-aware conﬁdence interval with φ-mixing data) Let δ (0, 1), and X1, . . . , Xn be a stationary φ-mixing sequence with marginal distribution P. We let k: X X R be a reproducing kernel with k = supx X k(x, x). With probability at least 1 δ, it holds that bµP(X1, . . . , Xn) µP Hk Bφ k,δ(P, n),

Wolfer and Alquier

Bφ k,δ(P, n) = Bφ δ .=

2v Γ 2 log(1/δ)

n + 8k Γ 2 log(1/δ)

where Σn is deﬁned in Lemma 16, Γ is deﬁned in Eq. (17) and 2 is the spectral norm.

In particular, when the process is iid, Γ is the identity matrix, hence Γ 2 = 1.

Example 3 (Uniformly ergodic Markov chains) Suppose that there exists φ0, φ1 R+ such that for all t R, φ(t) φ0 exp( φ1t). Then it holds (refer for example to Samson (2000)) that

Γ 2 2φ0 1 e φ1/2 .

In particular, this is known to hold for a uniformly ergodic Markov chain (Roberts and Rosenthal, 2004) with transition operator P and stationary distribution π. In this case we can choose, φ0 = 4 and φ1 = log(2)/tmix, where

tmix .= min t N

P t(x, ) π( ) TV < 1/4 ,

is called the mixing time of the chain with respect to total variation (Levin et al., 2009); here we obtain Γ 2 10tmix.

5.2 For β-Mixing Processes

We next consider β-mixing, a common assumption in the machine learning literature (Mohri and Rostamizadeh, 2010). Recall that the β-mixing coeﬃcient (Bradley, 2005; Doukhan, 2012) is deﬁned for two σ-ﬁelds A and B,

β(A, B) .= sup 1

j=1 |P(Ai Bj) P(Ai)P(Bj)| ,

where the supremum is taken over all pairs of ﬁnite partitions {A1, . . . , AI} and {B1, . . . , BJ} of X such that for any 1 i I and any 1 j J, Ai A and Bj B (Bradley, 2005, Equation 7). For s N, we further deﬁne

β(s) .= sup r N β (σ ({Xt : t r}) , σ ({Xt : t r + s})) ,

where for T N, σ({Xt}t T ) is the σ-ﬁeld generated by the random variables {Xt}t T . The random process is then called β-mixing6 when lims β(s) = 0. In this case, for ξ (0, 1), we deﬁne the β-mixing time as

tβ mix(ξ) .= arg min t N {β(t) < ξ} .

We observe that when the process is a stationary time-homogeneous Markov chain, tβ mix corresponds to the deﬁnition of the average-mixing time (M unch and Salez, 2023; Wolfer

6. A β-mixing process is also called absolutely regular in the literature.

Variance-Aware Estimation of Kernel Mean Embedding

and Alquier, 2024). Note that β(A, B) φ(A, B) (Bradley, 2005, Equation 1.11), and no reverse inequality holds for any universal constant, thus β-mixing is a strictly weaker notion than φ-mixing.

Theorem 18 (Variance-aware conﬁdence interval with β-mixing data) We let δ (0, 1), and X1, . . . , Xn be a stationary β-mixing sequence. We let k: X X R be a reproducing kernel with k = supx X k(x, x). Finally, we suppose for simplicity7 that n is a multiple of τ β n,δ where

τ β n,δ .= arg min s N

δ 6(n/(2s) 1)

With probability at least 1 δ, it holds that

bµP(X1, . . . , Xn) µP Hk Bβ k,δ(P, n),

Bβ k,δ(P, n) = Bβ δ .= 2

s v + Στ β n,δ

where Σs is deﬁned in Lemma 16.

Example 4 (Countable state time-homogeneous Markov chains) Irreducible, aperiodic and stationary countable Markov chains are always β-mixing (Bradley, 2005). Assume that there exist β1 R+ and b (1, ) such that for any t N, β(t) β1/tb, that is, the chain is only known to be algebraically mixing. Interestingly, such a chain may not be φ-mixing, thus this example cannot be recovered from Theorem 17.

6. Applications

We put the apparatus developed in Section 2 and Section 3 to application in the context of hypothesis testing and robust parametric estimation. In this section, all considered kernels will be TI unless otherwise speciﬁed, and X1, . . . , Xn are i.i.d. from P P(X). We introduce a statistical model M = {Pθ : θ Θ} indexed by the parameter space Θ. Examples of models studied in the literature on MMD include parametric models such as the Gaussian model Pθ = N(θ, σ2Id) (with σ2 known) and mixture of Gaussians (Briol et al., 2019; Ch erief-Abdellatif and Alquier, 2022), copulas Alquier et al. (2023) but also more complex models such as generative adversarial networks (Dziugaite et al., 2015; Li et al., 2015), stochastic volatility models and stochastic diﬀerential equations (Briol et al., 2019). Following Briol et al. (2019), our estimator will be the closest element8 of the model

7. An adaptation of the proof removes this assumption and recovers similar bounds up to universal constants. 8. Note that it might be that the inﬁmum is reached by multiple elements of Θ, or is not reached. The ﬁrst case does not lead to any diﬃculty, we must simply deﬁne a rule to break ties (for example, we can equip Θ with a total order and chose the smallest minimizer according to this order). While Briol et al. (2019) provide suﬃcient conditions to ensure that the inﬁmum is reached, the non-existence of a minimizer is also not a problem in practice: all the non-asymptotic results in Briol et al. (2019) and Ch erief-Abdellatif and Alquier (2022) can easily be extended to any ϵ-minimizer, for ϵ small enough.

Wolfer and Alquier

M to the empirical measure obtained from X, where the distance is measured in the RKHS, µPbθn(X1,...,Xn) bµP(X1, . . . , Xn) Hk = inf θ Θ µPθ bµP(X1, . . . , Xn) Hk . (18)

The computation of bθn(X1, . . . , Xn) is usually done via stochastic gradient descent and variants, see Remark 13 above.

6.1 Hypothesis Testing

In this subsection we study hypothesis testing based on MMD. In the literature, two kinds of tests were proposed and studied: two-sample tests, and goodness of ﬁt. In the twosample test problem, we are given X1, . . . , Xn iid from some PX and Y1, . . . , Ym iid from some PY , and we want to test H0 : PX = PY against the alternative H1 : PX = PY . In the goodness of ﬁt problem, we are given X1, . . . , Xn iid from P and we wish to test the hypothesis H0 : P M = {Pθ : θ Θ}, against the alternative H1 : P / {Pθ : θ Θ}. In the MMD literature, two-sample tests are more prevalent (Gretton et al., 2009, 2012). Recent work tackle goodness of ﬁt testing (Chwialkowski et al., 2016; Jitkrittum et al., 2017), albeit with an asymptotic treatment. We will study both problems here. In both cases, we propose a non-asymptotic treatment. That is, the level of the test is smaller than α for a ﬁnite sample size, and not only asymptotically. Moreover, in both cases, our datadependent bound (Theorem 12) allows to increase the power of the test, when compared to the procedure that would be based on the Mc Diarmid-based bound in (7).

6.1.1 Goodness-of-Fit Test

Recall that we deﬁne the signiﬁcance level α [0, 1] of a test as the probability of outputting H1 when H0 is true. Taking advantage of a non-asymptotic bound B(X1, . . . , Xn, α) (for example B(X1, . . . , Xn, α) = b Bα given in Theorem 12), we can design a test with prescribed level α for any n as follows. We introduce the test statistic

T(X1, . . . , Xn) .= inf θ Θ bµP µPθ Hk ,

and reject H0 on the critical set

C(X1, . . . , Xn) .= {T > B(X1, . . . , Xn, α)} .

We show that the probability of rejection under the null hypothesis is at most α and that the test is consistent.

Theorem 19 Let n N, X1, . . . , Xn P, and k a translation invariant kernel. Let B( , α) be such that

(a) for any α, with probability at least 1 α, bµP µP Hk B(X1, . . . , Xn, α).

(i) PH0 (C(X1, . . . , Xn)) α.

If, moreover,

Variance-Aware Estimation of Kernel Mean Embedding

(b) for any ﬁxed α, B(X1, . . . , Xn, α) a.s. n 0,

(c) there is a sequence αn 0 such that B(X1, . . . , Xn, αn) a.s. n 0,

(ii) when the model M is closed with respect to the MMD metric 9,

lim n PH1 (C(X1, . . . , Xn)) = 1,

more precisely, 1 PH1 (C(X1, . . . , Xn)) = O(αn).

Obviously, the empirical bound b Bk,α of Theorem 12 satisﬁes (a) and (b) (regardless of P), and (c) with αn = exp( n1 ϵ) for any ﬁxed ϵ > 0. So does the Mc Diarmid-based bound in (7). On the other hand, we claimed that the empirical bound is smaller than bounds that does not take the variance into account. In other words, the power of the test PH1 (C(X1, . . . , Xn)) for a ﬁnite n will be larger if we use the variance aware bound.

Example 5 We ﬁrst consider a single hypothesis test (that is a special case of goodnessof-ﬁt when Θ is a singleton). We consider data in R2: here P = N(0, σ2I2) with Pθ = N(θ, I2), and Θ = {(1, 1)}. In other words, H0 is true iﬀσ = 1. Note that in this case, if Y1, . . . , Yn are iid from Pθ = P{(1,1}, then q1 α deﬁned as the (1 α)-quantile of µP0 bµP(Y1, . . . , Yn) Hk allows to deﬁne a test with rejection zone given by C (X1, . . . , Xn) = {T > q1 α} that satisﬁes PH0 (C (X1, . . . , Xn)) = α by deﬁnition. It is a very natural procedure to estimate q1 α by Monte-Carlo, by sampling multiple times Y1, . . . , Yn from Pθ = P{(1,1}. This leads to a Monte-Carlo estimator ˆq1 α of q1 α. In our simulations, we sample X1, . . . , Xn from P, perform the test based on ˆq1 α, the test based on the empirical bound and the test based on Mc Diarmid bound . This is repeated 100 times for each value σ {0, 1/50, 2/50, . . . , 1}. We report the frequency of rejections in Figure 1. The kernel used is a Gaussian kernel with γ = 1, and we consider sample sized n {16, 40, 100, 250}. We observe that, when compared to test based ˆq1 α, both tests based on bounds have a weak power (note that we did not try to optimize γ for now, this question will be tackled later). However, the test based on the empirical bound indeed rejects more often H0 when H1 is true. The improvement is clearer for small sample sizes.

Example 6 We now consider a proper goodness-of-ﬁt test in R2: we still consider P = N(0, σ2I2) and Pθ = N(θ, I2), with Θ = R2. It is important to observe that in this case, because H0 is composite, we don t have a natural deﬁnition for q1 α as in the previous example. We sample X1, . . . , Xn from P, perform the test based on the empirical bound and the test based on Mc Diarmid bound. This is repeated 100 times for each value σ {0, 1/50, 2/50, . . . , 1}. We report the frequency of rejections in Figure 2. The kernel used is a Gaussian kernel with γ = 1, and we consider sample sized n {16, 40, 100, 250}. Similar comments to the previous case apply, but in this case, this makes the test based on the empirical Bernstein bound the best test available.

9. For any Q, if there is a sequence (θh)h N of elements of Θ such that µQ µPθh

Hk h 0, then

Q = Pθ for some θ Θ.

Wolfer and Alquier

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

Figure 1: Comparison of the test based on the Bernstein empirical (Emp Ber) bound, versus the test based on Mc Diarmid bound (Mc Dia), and the test based on the Monte-Carlo estimation of the quantile q1 α. Frequency of rejection of H0 : P {N((1, 1), I2)} as a function of σ with P = N(0, σ2I2).

Variance-Aware Estimation of Kernel Mean Embedding

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

Figure 2: Comparison of the test based on the Bernstein empirical (Emp Ber) bound, versus the test based on Mc Diarmid bound (Mc Dia). Frequency of rejection of H0 : P {N(θ, I2), θ R2} as a function of σ with P = N(0, σ2I2).

Wolfer and Alquier

6.1.2 Two-Sample Test

Let (n, m) N2, X1, . . . , Xn PX and Y1, . . . , Ym PY . Here, we wish to test H0 : PX = PY against H1 : PX = PY . This time, we use the test statistic

T2(X1, . . . , Xn, Y1, . . . , Ym) .= bµP(X1, . . . , Xn) bµP(Y1, . . . , Ym) Hk ,

and reject H0 on the critical set

C2(X1, . . . , Xn, Y1, . . . , Ym) .= {T2 > B(X1, . . . , Xn, α/2) + B(Y1, . . . , Ym, α/2)} .

Theorem 20 Let (n, m) N2, X1, . . . , Xn PX and Y1, . . . , Ym PY , and k a translation invariant kernel. Let B( , α) be such that

(a) for any α, each of these inequalities hold with probability at least 1 α each:

bµP(X1, . . . , Xn) µPX Hk B(X1, . . . , Xn, α)

bµP(Y1, . . . , Ym) µPY Hk B(Y1, . . . , Ym, α).

(i) PH0 (C2(X1, . . . , Xn, Y1, . . . , Ym)) α.

Moreover, if

(b) for any ﬁxed α, B(X1, . . . , Xn, α) a.s. n 0 and B(Y1, . . . , Yn, α) a.s. m 0,

(c) there is a sequence αn 0 such that

B(X1, . . . , Xn, αn) a.s. n 0 and B(Y1, . . . , Yn, αm) a.s. m 0,

(ii) limn,m PH1 (C2(X1, . . . , Xn, Y1, . . . , Tm)) = 1.

Here again, the empirical bound b Bk,α of Theorem 12 satisﬁes (a) and (b), and (c) with αn = exp( n1 ϵ) for any ϵ > 0.

6.2 Robust Parametric Estimation under Huber Contamination

Ch erief-Abdellatif and Alquier (2022, Proof of Theorem 3.1) show how to upper-bound (18) by the estimation error of the empirical measure in the RKHS and the approximation error of the model, µPbθn µP Hk inf θ Θ µPθ µP Hk + 2 bµP µP Hk . (19)

Recall that in the Huber contamination model (Huber, 2011), the observations X1, . . . , Xn are drawn independently from the mixture

P = (1 ξ)Pθ0 + ξH,

Variance-Aware Estimation of Kernel Mean Embedding

where ξ [0, 1/2) is the contamination rate, H is some unknown noise distribution and θ0 Θ. In this setting, we can control the approximation error of the model: starting from (19), and adapting the argument of Ch erief-Abdellatif and Alquier (2022, Corollary 3.3), we obtain µPbθn µPθ0

ψ + 2 bµP µP Hk .

We wish to apply Theorem 1 to the second term. However, when the sample is contaminated, the variance term v(P) we collect also depends on the properties of H, which are unknown for all we know H does not even have ﬁnite moments. Fortunately, we now see how to bound v(P) is terms of v(Pθ0) and ξ.

6.2.1 Improved Confidence Bounds in the Huber Contamination Model

Lemma 21 In the Huber contamination model, that is, P = (1 ξ)Pθ0 + ξH, writing v = v(P) and v0 = v(Pθ0), it holds that

v v0 + 2ξ( ψ v0) + ξ2(v0 ψ). (20)

The bound can be streamlined; when ψ 0,

v (1 2ξ)v0 + 2ξψ,

and otherwise v (1 2ξ)v0 + 2ξ ψ.

Remark 22 In (20), observe that when ψ 0 and ξ 1, the ﬁrst term vanishes, and v ψ, recovering the distribution independent rates. Conversely, when ξ 0, we conﬁrm that v v0, which could be further bounded using properties of the model. In practice, we will focus on small values ξ 1/2, and the simpliﬁed bounds will be suﬃcient.

An application of Theorem 1 yields the following.

Corollary 23 In the Huber contamination model P = (1 ξ)Pθ0 + ξH, writing v = v(P) and v0 = v(Pθ0), with probability at least 1 δ,

2[(1 2ξ)v0 + 2ξψ]log(2/δ)

ψ 3 log(2/δ)

When ψ = 1, v0 1, ξ 1, the bound oﬀers a signiﬁcant improvement over Ch erief Abdellatif and Alquier (2022, Corollary 3.4).

6.2.2 Improved Confidence Bounds in the Parameter Space

In this subsection we still employ the Gaussian kernel kγ (9). It is instructive to analyze how robust estimation with respect to the MMD distance translates into what happens in the space of parameters. In fact, since we have freedom over the lengthscale parameter, we may understandably want to select γ such that the distance in the space of parameters is kept small as well.

Wolfer and Alquier

Deﬁnition 24 (Link function F) Let M = {Pθ : θ Θ} indexed by the parameter space Θ P(Rd). We say that a non-decreasing function Fk : R+ R+ is a link function for the model M and kernel k when for any θ, θ Θ, θ θ 2 Fk µPθ µPθ Hk

Remark 25 Note that a link function Fk will only provide nontrivial information if Fk(0) = 0 and Fk is continuous at 0. The existence of such a nontrivial link function implies that the model is identiﬁable: if µPθ = µPθ then 0 = Fk( µPθ µPθ Hk) θ θ 2, hence θ = θ . In this case, there is a unique function p : M Θ such that p(Pθ) = θ, and Fk is a modulus of continuity of p.

A direct application of Corollary 23 gives the following.

Corollary 26 In the setting of Corollary 23, assume that M and k are such that a link function Fk exists. With probability at least 1 δ,

bθn θ0 2 Fk

2[(1 2ξ)v0 + 2ξψ]log(2/δ)

ψ 3 log(2/δ)

Gaussian location model, continued. We continue here Example 1. A direct computation (see for instance Ch erief-Abdellatif and Alquier 2022) shows that

µPθ µPθ 2 Hkγ = 2 γ2

θ θ 2 2 4σ2 + 2γ2

In other words, a link function for the Gaussian location model is explicitly given by

F 2 kγ(h) = 2(2σ2 + γ2) log

An application of Corollary 26, and setting γ = λσ

d, for some λ > 0, proves that, with probability at least 1 δ,

2 2σ2(2 + dλ2) log

1 1 + 2 dλ2

2 h (1 2ξ) 1 e 1

λ2 + 2ξ i log(2/δ)

.= Gd,σ,n(λ, ξ).

For comparison, we provide the bound of Ch erief-Abdellatif and Alquier (2022, Proposition 4.1-Equation 2) that is obtained by plugging (7) into the link function:

2 2σ2(2 + dλ2) log

1 1 + 2 dλ2

2 log(1/δ) n

.= Hd,σ,n(λ, ξ).

Variance-Aware Estimation of Kernel Mean Embedding

20 40 60 80 100

value of the bound

ξ = 0.01, d = 32, δ = 0.1, n = 10000, σ = 3

Cherief-Abdellatif and Alquier [2022]

Variance-aware

20 40 60 80 100

value of the bound

ξ = 0.2, d = 32, δ = 0.1, n = 10000, σ = 3

Cherief-Abdellatif and Alquier [2022]

Variance-aware

Figure 3: Comparison of the bounds in (21) versus Ch erief-Abdellatif and Alquier (2022, Proposition 4.1-Equation 2) as a function of γ.

2500 5000 7500 10000 12500 15000 17500 20000

δ = 0.1, σ = 3, d = 32, ξ = 0.1

Cherief-Abdellatif and Alquier [2022]

Variance-aware

Figure 4: Comparison of Ch erief-Abdellatif and Alquier (2022) and (21) for the optimal hyper-parameter γ as a function of n.

Wolfer and Alquier

In order to compare the tightness of the bounds in Eq. (22) and Eq. (21), we run experiments for ﬁxed sample size (n = 10000), ﬁxed conﬁdence level (δ = 0.1), ﬁxed variance parameter (σ = 3) and two diﬀerent contamination levels (ξ = 0.01 and ξ = 0.2). We compute and plot the bound obtained when varying the scale parameter γ (see Figure 3). For a contamination level ξ = 0.1, we also compare the quantitative decay of the two bounds optimized for γ as the sample size n grows and report our results in 4. Based on our experiments, we make the following observations.

(i) The new bound is always tighter than the one of Ch erief-Abdellatif and Alquier (2022).

(ii) When the contamination level increases, the two bounds are getting closer together. This is expected since we are losing the beneﬁt of the variance factor.

(iii) Especially under weak contamination, the new bound is much ﬂatter in the sense where overshooting for the value of γ does not lead to a catastrophic degradation of the bound.

(iv) The new bound performs exceptionally well in the small sample setting.

We conclude this section by showing heuristically that the bound in (21) can always be made smaller than the bound in (22), at least asymptotically in n, with an adequate choice of λ. For the sake of simplicity, we work in the non-contaminated setting ξ = 0. First,

Hd,σ,n(λ, 0) = 4σ2(2 + dλ2) 1 + 2 dλ2

2 log(1/δ) 2

n (1 + o(1)),

and the ﬁrst order term is exactly minimized for λ = 1, it leads to

Hd,σ,n(1, 0) = 4σ2(2 + d) 1 + 2

2 log(1/δ) 2

n (1 + o(1)).

An exact minimization of the bound in (21) is not feasible, but we will propose a choice of λ = λn that will lead to an improvement on Hd,σ,n(1, 0). First, let us assume that λ1 > 0 and that (λn)n N is a non-decreasing sequence. We obtain:

Gd,σ,n(λn, 0) = 4σ2(2 + dλ2 n) 1 + 2 dλ2n

λ2n log(2/δ)

n (1 + o(1))

= 4σ2(2 + dλ2 n) 1 + 2 dλ2n

d/2 log(2/δ)

λ2nn (1 + o(1))

= 4dσ2 1 + 2 dλ2n

1+d/2 log(2/δ)

n (1 + o(1)).

It is clear that choosing λn when n gives:

Gd,σ,n(λn, 0) = 4dσ2 log(2/δ)

n (1 + o(1)).

Variance-Aware Estimation of Kernel Mean Embedding

0 20 40 60 80 100

0 20 40 60 80 100

Figure 5: Comparison of the bounds in (21) versus Ch erief-Abdellatif and Alquier (2022, Proposition 4.1-Equation 2) as a function of γ, versus the empirical MSE on 50 experiments, n = 500, log-scale, ξ = 0, δ = 0.05. Right panel: we zoom on the MSE.

Finally, observe that

Hd,σ,n(1, 0) Gd,σ,n(λn, 0) = 2 1 + 2

2 log 1/δ)2

2 log(2/δ))2 | {z } 1 for δ (0,9/10)

(1 + o(1)) 2e(1 + o(1)).

Unfortunately, this ﬁrst order analysis is too crude to provide an accurate recommendation on the choice of λn. On the other hand, it shows that our variance-aware bound leads to a signiﬁcant improvement over the bound of Ch erief-Abdellatif and Alquier (2022) for Gaussian mean estimation. Moreover, the choice λn is in accordance with the considerations on the asymptotic variance in Briol et al. (2019) in this model. We illustrate this with an experiment in Figure 5. We observe that: indeed, the varianceaware bound decrease with γ λ, while the Mc Diarmid bound has a minimum. Interestingly, the true MSE also has a minimum. Note however how the MSE is actually very ﬂat as a function of γ: in other words, in this experiment, the choice of γ does not matter so much on the performance of the MMD estimator. The possibility to use the variance-aware bound to calibrate γ in practice should be investigated further in other models.

In this section, we provide the detailed technical proofs that were deferred in earlier parts of the paper.

7.1 Proof of Theorem 1

In Tolstikhin et al. (2017), the authors mention but do not pursue the idea that Bernstein s inequality in separable Hilbert spaces yields the proper rate of OP (n 1/2). We follow-up on

Wolfer and Alquier

their idea in this proof. First recall that

t=1 k(Xt, ), µP .= EX P [k(X, )] ,

and that we can write

bµP µP Hk = 1

where for 1 t n,

Zt .= k(Xt, ) EX P [k(X, )] ,

is centered. For ε > 0, we are thus interested in bounding the probability of the following deviation,

We rely on a Bernstein-type inequality in Hilbert spaces, credited to Pinelis and Sakhanenko (1985) (see also Yurinsky 1995, Theorem 3.3.4), which is reported in Theorem 32. We ﬁrst observe that the RKHS Hk is separable by separability of the topological space X and continuity of k. In order to invoke Theorem 32, we need to control from above the Pn t=1 E Zt p Hk, for any p 3. For p = 2,

t=1 E k(Xt, ) µP 2 Hk = nv.

Moving on to higher moments, for p 3, we have that

Zt Hk k(Xt, ) Hk + EX P [k(X, )] Hk p

k(Xt, Xt) + sup x X

As a result,

t=1 E Zt p Hk

t=1 E Zt 2 Hk

Invoking Theorem 32 with G2 = nv, H = 2

k/3, we obtain that for any ε > 0,

2(nv + (εn)2

Variance-Aware Estimation of Kernel Mean Embedding

7.2 Proof of Lemma 10

We ﬁrst verify the bounded-diﬀerences property.

bv(x) inf x t0 X bv x(t0)

=ψ 1 (n 1)n

t =s ψ(xt xs) inf x t0 X

t =s ψ x(t0) t x(t0) s

t =s ψ x(t0) t x(t0) s

t =s ψ(xt xs)

t [n] t =t0

t [n] t =t0

t [n] t =t0

t [n] t =t0

Wolfer and Alquier

It follows from ( ) that,

bv(x) inf x t0 X bv x(t0) !2

t [n] t =t0

t [n] t =t0

ψ ψ(xt0 xt)

t [n] t =t0

ψ ψ(xt0 xt) 2

t [n] t =t0

ψ ψ(xt0 xt)

where the second inequality is Jensen s.

7.3 Proof of Lemma 11

From Lemma 10, the function

X n R, x 7 n 2 ψ bv(x),

is weakly (2, 0)-self-bounding and has the bounded diﬀerences property in the sense of Deﬁnition 7. From the second statement of Maurer (2006, Theorem 13),

P E n 2 ψ bv(X) n 2 ψ bv(X) > t exp

4E h n 2 ψbv(X) i

that can be rewritten,

P (v bv(X) > t) exp nt2

In other words, with probability at least 1 δ,

2 ψ log(1/δ)

Completing the square, and by sub-additivity of the square root function, we successively have with probability 1 δ, v

2 ψ log(1/δ)

!2 bv(X) + 2 ψ log(1/δ)

Variance-Aware Estimation of Kernel Mean Embedding

2 ψ log(1/δ)

bv(X) + 2 ψ log(1/δ)

2 ψ log(1/δ)

which proves the lemma for b = 1. On the other hand, from the ﬁrst inequality in Maurer (2006, Theorem 13),

P n 2 ψ bv(X) E n 2 ψ bv(X) > t exp

4E h n 2 ψbv(X) i + 2t

that can be rewritten as

P (bv(X) v > t) exp nt2

8 ψ(v + t/2)

The positive solution t+ for

nt2 = 8 ψ(v + t/2) log(1/δ)

is readily given by

t+ = 2 ψ log(1/δ)

2 log(1/δ) ψ

2 log(1/δ) + nv .

Thus, it holds with probability at least 1 δ that

bv(X) v + t+,

and from sub-additivity of the square root, successively, with probability at least 1 δ, we have

bv(X) v + 4 ψ log(1/δ)

2v ψ log(1/δ)

2 ψ log(1/δ)

!2 + 2 ψ log(1/δ)

2 ψ log(1/δ)

!2 + 2 ψ log(1/δ)

bv(X) v + 2

2 ψ log(1/δ)

which ﬁnishes proving the lemma10 for b = 1.

10. In this proof, Maurer (2006) is suﬃcient. We do not need the stronger results of Boucheron et al. (2009).

Wolfer and Alquier

7.4 Proof of Theorem 12

From Theorem 1, with probability at least 1 δ/2, it holds that

bµP(X1, . . . , Xn) µP Hψ

Invoking Lemma 11 for b = 1, at conﬁdence 1 δ/2 it holds that

2 ψ log(2/δ)

2 ψ log(4/δ)

thus by combining (23) and (24) and a union bound yields that with probability at least 1 δ,

bµP(X1, . . . , Xn) µP Hψ

2 ψ log(4/δ)

2bv(X)log(4/δ)

2bv(X)log(4/δ)

7.5 Proof of Lemma 16

E bµP µP 2 Hk =E 1

t=1 k(Xt, ) µP, 1

s=1 k(Xs, ) µP

t=1 E k(Xt, ) µP, k(Xs, ) µP Hk

s<n E k(Xt, ) µP, k(Xs, ) µP Hk

t=1 E k(Xt, ) µP, k(Xt, ) µP Hk

t=2 (n t + 1)E k(Xt, ) µP, k(X1, ) µP Hk + v

7.6 Proof of Theorem 17

An application of Jensen s inequality and Lemma 16 immediately yield that

Variance-Aware Estimation of Kernel Mean Embedding

where Σn is a total measure of covariance in the RKHS introduced in Lemma 16. Additionally, recall that we can express the norm of the deviation of bµP from its mean µP in the RKHS as a supremum of empirical processes as follows,

bµP µP Hk = sup f Hk f Hk 1

f, bµP µP Hk = 1

n sup f Hk f Hk 1

t=1 (f(Xt) Ef) = 1

where F is the re-centered class,

F .= n g: X R, g = f Ef, f Hk, f Hk 1 o .

In particular, observe that g F = |g| k. Furthermore, the variance v in the RKHS can be expressed as

v = E k(X, ) µP 2 Hk

sup f Hk f Hk 1

f, k(X, ) µP Hk

= E sup f Hk f Hk 1

f, k(X, ) µP 2 Hk

= E sup f Hk f Hk 1

(f(X) Ef)2 .

We invoke Theorem 33 for φ-dependent random sequences, with Z = n bµP µP Hk and observing that

t=1 E sup g F g(Xt)2 = nv,

with probability at least 1 δ it holds that

2v Γ 2 log(1/δ)

n + 8k Γ 2 log(1/δ)

7.7 Proof of Theorem 19

First, assume that H0 is true, that is: there is a θ0 Θ such that P = Pθ0. Then,

PH0 (C) = PH0 (T > B(X1, . . . , Xn, α))

inf θ Θ c µP µPθ Hk > B(X1, . . . , Xn, α)

Hk > B(X1, . . . , Xn, α)

Wolfer and Alquier

thanks to (a).

Now, assume that H0 is not true, that is P / {Pθ, θ Θ}. Note that under the assumption that the model is closed, we have

0 < := inf θ Θ µP µPθ Hk .

By the triangle inequality, for any θ,

µPθ bµP Hk µPθ µP Hk µP bµP Hk

and thus, taking the inﬁmum w.r.t θ on both sides,

T µP bµP Hk .

We apply (a) with conﬁdence level αn and obtain

PH1 µP bµP Hk B(X1, . . . , Xn, αn) αn

PH1 (T B(X1, . . . , Xn, αn)) αn.

As B(X1, . . . , Xn, αn) 0 when n thanks to (c), there is a N large enough such that, for any n N, B(X1, . . . , Xn, αn) /2, and thus

Moreover, as we also have B(X1, . . . , Xn, αn) 0 when n thanks to (b), there is a N

large enough such that, for any n N , B(n, α) /2 and thus

PH1 (T B(X1, . . . , Xn, αn)) αn.

PH1 (C) = 1 PH1 (T B(X1, . . . , Xn, αn)) 1 αn

as soon as n max(N, N ).

Variance-Aware Estimation of Kernel Mean Embedding

7.8 Proof of Theorem 20

First, under H0, we have PX = PY and thus

PH0 (C2) = PH0 (T2 > B(X1, . . . , Xn, α/2) + B(Y1, . . . , Ym, α/2))

c µP(X1, . . . , Xn) c µP(Y1, . . . , Ym) Hk

> B(X1, . . . , Xn, α/2) + B(Y1, . . . , Ym, α/2)

c µP(X1, . . . , Xn) µPX + µPX c µP(Y1, . . . , Ym) Hk

> B(X1, . . . , Xn, α/2) + B(Y1, . . . , Ym, α/2)

PH0 c µP(X1, . . . , Xn) µPX Hk > B(X1, . . . , Xn, α/2)

+ PH0 c µP(Y1, . . . , Ym) µPX Hk > B(Y1, . . . , Ym, α/2)

α/2 + α/2 = α.

thanks to (a). Now, assume that H0 is not true, that is PX = PY , and put

:= µPX µPY Hk > 0.

By the triangle inequality,

T2 = c µP(X1, . . . , Xn) c µP(Y1, . . . , Ym) Hk c µP(X1, . . . , Xn) µPX Hk c µP(Y1, . . . , Yn) µPY Hk .

We apply (a) with respectice conﬁdence levels αn and αm to get

PH1 µPX bµP(X1, . . . , Xn) Hk B(X1, . . . , Xn, αn) αn

and PH1 µPY bµP(Y1, . . . , Ym) Hk B(Y1, . . . , Ym, αm) αm.

Thus, PH1 (T2 B(X1, . . . , Xn, αn) B(Y1, . . . , Ym, αm)) αn + αm.

Thanks to (c), there is a N large enough such that, as soon as both n, m N, both B(X1, . . . , Xn, αn) /4 and B(Y1, . . . , Ym, αm) /4 hold, and thus

Wolfer and Alquier

Moreover, thanks to (b), there is N large enough such that, when n, m N , both B(X1, . . . , Xn, α/2) /4 and B(Y1, . . . , Ym, α/2) /4 hold, and thus

PH1 (T2 B(X1, . . . , Xn, α/2) + B(Y1, . . . , Ym, α/2)) αn + αm.

PH1 (C2) = 1 PH1 (T2 B(X1, . . . , Xn, α/2) + B(Y1, . . . , Ym, α/2)) 1 αn αm

as soon as n max(N, N ), and thus

PH1 (C2) 1 αn αm n,m 1.

7.9 Proof of Lemma 21

v(P) = EX P k(X, ) µP 2 Hk

= ψ Z Z ψ(x x)d(P P)(x, x )

= ψ Z Z ψ(x x)d P(x)d P(x )

= ψ Z Z ψ(x x)d((1 ξ)Pθ0(x) + ξH(x))d((1 ξ)Pθ0(x ) + ξH(x ))

= ψ (1 ξ)2 Z Z ψ(x x)d Pθ0(x)d Pθ0(x )

2ξ(1 ξ) Z Z ψ(x x)d Pθ0(x)d H(x )

ξ2 Z Z ψ(x x)d H(x)d H(x )

ψ (1 ξ)2(ψ v) 2ξ(1 ξ)ψ ξ2ψ

= ψ (1 2ξ + ξ2)(ψ v) ξ(2 ξ)ψ

= ψ (ψ v) + 2ξ(ψ v) ξ2(ψ v) ξ(2 ξ)ψ

(1 2ξ)v + 2ξψ ξ(2 ξ)ψ,

= (1 2ξ)v + 2ξ ψ + ξ2ψ,

and the lemma follows by disjunction of cases.

Acknowledgments

The authors declare that there is no conﬂict of interest. Part of this research was conducted when GW was supported by the Special Postdoctoral Researcher Program (SPDR) of RIKEN. Part of the research in this paper was done when PA was working at RIKEN AIP. We thank Bastien Dussap for pointing out a missing factor two in an earlier version

Variance-Aware Estimation of Kernel Mean Embedding

of this manuscript. We thank the anonymous reviewers for their useful comments. In particular, the comparison to the approach based on Bernstein s inequality for U-statistics in Section 4 was suggested by Reviewer 1 and the analysis for time-dependent processes in Section 5 was suggested by Reviewer 2. All the remaining mistakes are ours.

Appendix A. Extension to Non Translation Invariant Kernels

Let us deﬁne diag k: X R by diag k(x) = k(x, x), and use the shorthands diag k, diag k and diag k introduced in (2) accordingly. We show that the properties obtained in Section 3 can be extended to non TI kernel when diag k is controlled from above.

Lemma 27 Let k be a characteristic reproducing kernel. The function

X n R, x 7 n diag k + 2 k bv(x),

is weakly 2, 2n diag k diag k + 2 k

-self-bounding,

and has bounded diﬀerences in the sense of Deﬁnition 7.

Lemma 28 For b { 1, 1}, with probability at least 1 δ,

bv(X) + diag k p

v + diag k i 2

( diag k + 2 k) log(1/δ)

Attempting to recover concentration of

bv around v with self-boundedness leads to an additional term in O n 1/4 in (26) when diag k = 0. We thus settle for concentration of p

bv + diag k around v + diag k instead.

Theorem 29 (Conﬁdence interval with empirical variance for general kernel) Let n N, X1, . . . , Xn P, let k be a characteristic reproducing kernel deﬁned from a positive deﬁnite function ψ [see (4)]. Then with probability at least 1 δ, it holds that

bµP µP Hk b Bk,δ(X1, . . . , Xn),

b Bk,δ(X1, . . . , Xn) = b Bδ .=

2(bv(X1, . . . , Xn) + diag k)log(4/δ)

diag k log(4/δ)

and where bv is the empirical variance proxy deﬁned in (12).

Remark 30 In particular, for a TI kernel diag k = 0, recovering the results of the previous section. It is currently unclear whether the term in diag k is necessary or if it is an artifact of our proof.

Wolfer and Alquier

Appendix B. Additional Proofs

In this section, we present supplementary technical proofs that were postponed in previous sections of the paper

B.1 Proof of Lemma 8

For some t0, we let x(t0) = (x1, . . . , xt0 1, x t0, xt0+1, . . . , xn) where xt0 was replaced with x t0.

1 d Tr bΣ(x) 1

d Tr bΣ x(t0)

xi t xi s 2

x(t0)i t x(t0)i s 2 !

xi t0 xi s 2 !

1 d Tr bΣ(x) inf x t0 X 1 d Tr bΣ x(t0) 1.

Furthermore,

d Tr bΣ(x) 1

d Tr bΣ x(t0) 2 (i)

xi t0 xi s 2 !!2

xi t0 xi s 2 !2

n n 1bvi(X)

= n n 1 1 d Tr bΣ(X),

where (i) follows from (27), (ii) is Jensen s inequality, and (iii) stems from Maurer and Pontil (2009, Corollary 9).

Variance-Aware Estimation of Kernel Mean Embedding

B.2 Proof of Lemma 9

We verify that bv is an unbiased estimator for v.

EX1,...,Xn P [bv(X1, . . . , Xn)]

=EX1,...,Xn P

k(Xt, Xt) 1

s=1 k(Xt, Xs)

EX P [k(X, X)] 1

s=1 EXt,Xs P [k(Xt, Xs)]

= n n 1EX P [k(X, X)] 1 n(n 1)

s=1 EXt,Xs P [k(Xt, Xs)]

= n n 1EX P [k(X, X)] 1 (n 1)EX P [k(X, X)]

n(n 1)EX,X P k(X, X )

=EX P [k(X, X)] EX,X P k(X, X )

B.3 Proof of Theorem 18

For t N, we write Zt .= k(Xt, ) µP, thus

bµP(X1, . . . , Xn) µP Hk = 1

We assume that n = 2Bs, where B, s N will be determined later. For b [B], we denote

Z[2b] .= Z[2b] 1 , . . . , Z[2b] s .= Z(2b 1)s+1, . . . , Z2bs,

Z[2b 1] .= Z[2b 1] 1 , . . . , Z[2b 1] s .= Z(2b 2)s+1, . . . , Z(2b 1)s,

and we decompose the random process into blocks

Z[1], Z[2], . . . , Z[2B].

From sub-additivity of the norm and the blocking method described in Yu (1994, Corollary 2.7), it holds that

t=1 Z[2b σ] t

! Hk > nε/2

t=1 e Z[2b σ] t

! Hk > nε/2

+ 2 (B 1) β(s)

Wolfer and Alquier

where for b [2], the blocks e Z[2b] = e Z[2b] 1 , . . . , e Z[2b] s are mutually independent and a similar fact holds for the blocks e Z[2b 1] = e Z[2b 1] 1 , . . . , e Z[2b 1] s . As a result, for b [B], the Ps t=1 e Z[2b σ] t are independent, centered and valued in Hk, thus the problem above has been reduced to controlling the norm of a sum of B iid random vectors in the Hilbert space Hk. Let us ﬁx σ {0, 1}. Similar to our proof in the iid setting, our goal is to invoke Theorem 32 (Yurinsky, 1995, Theorem 3.3.4). In order to do that, we need to control from above the PB b=1 E Ps t=1 e Z[2b σ] t p

Hk , for any p 3. For p = 2, by linearity of the expectation, we

t=1 e Z[2b σ] t

r=1 E e Z[2b σ] t , e Z[2b σ] r Hk.

By stationarity, we can write more simply

t=1 e Z[2b σ] t

r=1 E k(X(2b σ 1)s+t, ) µP, k(X(2b σ 1)s+r, ) µP Hk

r=1 E k(Xt r+1, ) µP, k(X1, ) µP Hk

t=1 k(X1, ) µP, k(X1, ) µP Hk

t=2 (s t + 1)E k(Xt, ) µP, k(X1, ) µP Hk

It is then natural to consider the following covariance coeﬃcients in the RKHS introduced by Ch erief-Abdellatif and Alquier (2022),

ρt .= E k(Xt, ) µP, k(X1, ) µP Hk,

and in particular, observe that we recover ρ1 = v. It follows that

t=1 e Z[2b σ] t

t=2 (s t + 1)ρt

2 (v + Σs) ,

where the mixing coeﬃcient Σs was introduced in Lemma 16. In particular, Σs vanishes for iid processes. For p 3, by sub-additivity of the norm,

t=1 e Z[2b σ] t

e Z[2b σ] t Hk 2s p

Variance-Aware Estimation of Kernel Mean Embedding

Therefore, invoking Theorem 32 with G2 = n

2 (v + Σs) and H = 2s

k/3, we obtain for σ {0, 1} and for any ε > 0 that,

t=1 e Z[2b σ] t

2(n(v + Σs)/2 + ((ε/2)n)2s

4 v + Σs + 2εs

It remains to control the term involving the β-mixing coeﬃcient by suitable choosing the blocking size. For

τ β n,δ .= arg min s N

δ 6(n/(2s) 1)

it holds that 2 (B 1) β(s) δ/3, which concludes the proof.

B.4 Proof of Lemma 27

We will rely on the following property for general kernels.

Lemma 31 Let x = (x1, . . . , xt0, . . . , xn) X n and x t0 X. Writing

x(t) = (x1, . . . , xt0 1, x t0, xt0+1, . . . , xn),

it holds that

bv(x) bv x(t0) = 1

n k (xt0, xt0) k x t0, x t0

t [n],t =t0

k x t0, xt k (xt0, xt) .

From Lemma 31, if follows that

bv(x) inf x t0 X bv x(t0) ( ) 1

n k (xt0, xt0) diag k

t [n],t =t0

k k (xt0, xt)

( ) diag k + 2 k

Wolfer and Alquier

and the bounded diﬀerences property follows directly from ( ). Furthermore, as a result of ( ),

bv(x) inf x t0 X bv x(t0) !2

n k (xt0, xt0) diag k

t [n],t =t0

k k (xt0, xt)

t [n],t =t0 k (xt0, xt0) diag k + 2k 2k (xt0, xt)

t [n],t =t0

k (xt0, xt0) diag k + 2k 2k (xt0, xt) 2

diag k + 2 k

t [n],t =t0

k (xt0, xt0) diag k + 2k 2k (xt0, xt)

diag k + 2 k

t [n],t =t0

2k (xt0, xt0) 2diag k + 2k 2k (xt0, xt)

=2 diag k + 2 k

t [n],t =t0 (k (xt0, xt0) k (xt0, xt))

+ 2 diag k + 2 k

( ) =2 diag k + 2 k

t [n] (k (xt0, xt0) k (xt0, xt))

+ 2 diag k + 2 k

=2 diag k + 2 k

n bv(x) + 2 diag k + 2 k

where for ( ) we relied on k = diag k for a positive deﬁnite kernel.

B.5 Proof of Lemma 28

Lemma 27 established that the function

X n R, x 7 n diag k + 2 k bv(x),

Variance-Aware Estimation of Kernel Mean Embedding

is weakly self-bounding and has the bounded diﬀerences property. From an application of Boucheron et al. (2009, Theorem 1),

P (bv v > t) exp nt2

4( diag k + 2 k)(v + diag k + t/2)

P (v bv > t) exp nt2

4( diag k + 2 k)(v + diag k)

The claim follows from rewriting

bv v = [bv + diag k] [v + diag k],

and solving quadratic inequalities as in Lemma 11.

B.6 Proof of Theorem 29

From Theorem 1, and since diag k > 0, with probability at least 1 δ/2, it holds that

bµP(X1, . . . , Xn) µP Hk

2(v + diag k)log(4/δ)

By Lemma 28 for b = 1, at conﬁdence 1 δ/2 it holds that

v + diag k p

bv + diag k + 2

( diag k + 2 k) log(4/δ)

thus by combining (28) and (29) and a union bound yields that with probability at least 1 δ,

bv + diag k + 2

( diag k + 2 k) log(4/δ)

2(bv + diag k)log(4/δ)

diag k + 2 k + 4

2(bv + diag k)log(4/δ)

diag k log(4/δ)

where the last inequality is by sub-additivity of the square root.

Wolfer and Alquier

B.7 Proof of Lemma 31

bv(x) bv x(t)

k (xr, xr) 1

s=1 k (xr, xs)

k x(t) r , x(t) r 1

s=1 k x(t) r , x(t) s !

k (xr, xr) k x(t) r , x(t) r

k (xr, xs) k x(t) r , x(t) s

bv(x) bv x(t)

= 1 n 1 k (xt, xt) k x t, x t 1 (n 1)n

(r,s) [n]2,r=s

k (xr, xs) k x(t) r , x(t) s

(r,s) [n]2,r =s

k (xr, xs) k x(t) r , x(t) s ,

= 1 n 1 k (xt, xt) k x t, x t 1 (n 1)n

k (xr, xr) k x(t) r , x(t) r

(r,s) [n]2,r =s,r=t

k (xr, xs) k x(t) r , x(t) s

(r,s) [n]2,r =s,s=t

k (xr, xs) k x(t) r , x(t) s

n k (xt, xt) k x t, x t

k (xt, xs) k x t, xs

k (xt, xs) k x t, xs

n k (xt, xt) k x t, x t + 2 (n 1)n

k x t, xs k (xt, xs) .

Appendix C. Tools from the Literature

We report here tools from the literature.

Variance-Aware Estimation of Kernel Mean Embedding

Theorem 32 (Yurinsky, 1995, Theorem 3.3.4) Let Z1, . . . , Zn be independent random variables in a separable Hilbert space H that are centered, that is,

t [n], E [Zt] = 0,

If there exists real numbers G, H 0 such that for any p 2,

t=1 E Zt p H 1

it holds that for any ε > 0,

2 exp ε2/2 G2 + εH

Theorem 33 (Samson, 2000, Theorem 3) Let X1, . . . , Xn be a stationary φ-mixing sequence. For every ε > 0,

P (Z EZ + ε) exp 1 8 Γ 2 min ε

Z .= sup g F

with F an arbitrary class of real bounded functions with |g| C, and where the random variable V is deﬁned as

t=1 sup g F g(Xt)2,

and Γ is the coupling matrix (refer to Equation 17).

P. Alquier, B.-E. Ch erief-Abdellatif, A. Derumigny, and J.-D. Fermanian. Estimation of copulas via maximum mean discrepancy. Journal of the American Statistical Association, 118(543):1997 2012, 2023.

M. A. Arcones. A Bernstein-type inequality for U-statistics and U-processes. Statistics & probability letters, 22(3):239 247, 1995.

J.-Y. Audibert, R. Munos, and C. Szepesv ari. Tuning bandit algorithms in stochastic environments. In International conference on algorithmic learning theory, pages 150 165. Springer, 2007.

P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002.

Wolfer and Alquier

V. Bentkus, N. Kalosha, and M. Van Zuijlen. On domination of tail probabilities of (super) martingales: explicit bounds. Lithuanian Mathematical Journal, 46(1):1 43, 2006.

D. Berend and A. Kontorovich. A sharp estimate of the binomial mean absolute deviation with applications. Statistics & Probability Letters, 83(4):1254 1259, 2013.

A. Berlinet and C. Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011.

S. Boucheron, G. Lugosi, and P. Massart. On concentration of self-bounding functions. Electronic Journal of Probability, 14:1884 1899, 2009.

R. C. Bradley. Basic properties of strong mixing conditions. a survey and some open questions. Probability Surveys, 2:107 144, 2005.

F.-X. Briol, A. Barp, A. B. Duncan, and M. Girolami. Statistical inference for generative models with maximum mean discrepancy. ar Xiv preprint ar Xiv:1906.05944, 2019.

B.-E. Ch erief-Abdellatif and P. Alquier. Finite sample properties of parametric MMD estimation: robustness to misspeciﬁcation and dependence. Bernoulli, 28(1):181 213, 2022.

K. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness of ﬁt. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 2606 2615, New York, New York, USA, 20 22 Jun 2016. PMLR.

D. Cohen, A. Kontorovich, and G. Wolfer. Learning discrete distributions with inﬁnite support. In Advances in Neural Information Processing Systems, volume 33, pages 3942 3951, 2020.

M. Cuturi. Positive deﬁnite kernels in machine learning. ar Xiv preprint ar Xiv:0911.5367, 2009.

J. Diestel and J. Uhl. Vector measures, vol. 15 of mathematical surveys. American Mathematical Society, Providence, RI, USA, 1977.

P. Doukhan. Mixing: properties and examples, volume 85. Springer Science & Business Media, 2012.

G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In Proceedings of the Thirty-First Conference on Uncertainty in Artiﬁcial Intelligence, pages 258 267, 2015.

M. G. Genton. Classes of kernels for machine learning: a statistics perspective. Journal of machine learning research, 2(Dec):299 312, 2001.

A. Gretton. Introduction to RKHS, and some simple kernel algorithms. Adv. Top. Mach. Learn. Lecture Conducted from University College London, 16(5-3):2, 2013.

Variance-Aware Estimation of Kernel Mean Embedding

A. Gretton, K. Fukumizu, Z. Harchaoui, and B. K. Sriperumbudur. A fast, consistent kernel two-sample test. Advances in neural information processing systems, 22, 2009.

A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch olkopf, and A. Smola. A kernel twosample test. The Journal of Machine Learning Research, 13(1):723 773, 2012.

W. Hoeﬀding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, 1963.

P. J. Huber. Robust statistics. In International encyclopedia of statistical science, pages 1248 1251. Springer, 2011.

I. A. Ibragimov. Some limit theorems for stationary processes. Theory of Probability & Its Applications, 7(4):349 382, 1962.

W. Jitkrittum, W. Xu, Z. Szab o, K. Fukumizu, and A. Gretton. A linear-time kernel goodness-of-ﬁt test. Advances in Neural Information Processing Systems, 30, 2017.

M. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, R. E. Schapire, and L. Sellie. On the learnability of discrete distributions. In Proceedings of the twenty-sixth annual ACM symposium on Theory of computing, pages 273 282, 1994.

A. K. Kuchibhotla and Q. Zheng. Near-optimal conﬁdence sequences for bounded random variables. In International Conference on Machine Learning, pages 5827 5837. PMLR, 2021.

D. A. Levin, Y. Peres, and E. L. Wilmer. Markov chains and mixing times, second edition. American Mathematical Soc., 2009.

Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In International conference on machine learning, pages 1718 1727. PMLR, 2015.

D. Lopez-Paz, K. Muandet, B. Sch olkopf, and I. Tolstikhin. Towards a learning theory of cause-eﬀect inference. In International Conference on Machine Learning, pages 1452 1461. PMLR, 2015.

A. Maurer. Concentration inequalities for functions of independent variables. Random Structures & Algorithms, 29(2):121 138, 2006.

A. Maurer and M. Pontil. Empirical Bernstein bounds and sample variance penalization. ar Xiv preprint ar Xiv:0907.3740, 2009.

C. Mc Diarmid et al. On the method of bounded diﬀerences. Surveys in combinatorics, 141 (1):148 188, 1989.

M. Mohri and A. Rostamizadeh. Stability bounds for stationary ϕ-mixing and β-mixing processes. Journal of Machine Learning Research, 11(2), 2010.

M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press, 2018.

Wolfer and Alquier

K. Muandet, K. Fukumizu, B. Sriperumbudur, A. Gretton, and B. Sch olkopf. Kernel mean estimation and Stein eﬀect. In International Conference on Machine Learning, pages 10 18. PMLR, 2014.

K. Muandet, B. Sriperumbudur, K. Fukumizu, A. Gretton, and B. Sch olkopf. Kernel mean shrinkage estimators. Journal of Machine Learning Research, 17, 2016.

F. M unch and J. Salez. Mixing time and expansion of non-negatively curved Markov chains. Journal de l Ecole polytechnique Math ematiques, 10:575 590, 2023. doi: 10.5802/jep. 226.

T. Peel, S. Anthoine, and L. Ralaivola. Empirical Bernstein inequalities for U-statistics. Advances in Neural Information Processing Systems, 23, 2010.

I. F. Pinelis and A. I. Sakhanenko. Remarks on inequalities for the probabilities of large deviations. Teoriya Veroyatnostei i ee Primeneniya, 30(1):127 131, 1985.

G. O. Roberts and J. S. Rosenthal. General state space Markov chains and MCMC algorithms. Probability Surveys, 1(none):20 71, 2004. doi: 10.1214/154957804100000024.

P.-M. Samson. Concentration of measure inequalities for Markov chains and Φ-mixing processes. The Annals of Probability, 28(1):416 461, 2000. doi: 10.1214/aop/1019160125.

M. L. Stein. Interpolation of spatial data: some theory for kriging. Springer Science & Business Media, 1999.

I. Tolstikhin, B. K. Sriperumbudur, and K. Muandet. Minimax estimation of kernel mean embeddings. The Journal of Machine Learning Research, 18(1):3002 3048, 2017.

A. B. Tsybakov. Introduction to nonparametric estimation. Springer Science & Business Media, 2008.

I. Waudby-Smith and A. Ramdas. Estimating means of bounded random variables by betting. Journal of the Royal Statistical Society Series B: Statistical Methodology, 86(1): 1 27, 2024.

G. Wolfer and P. Alquier. Optimistic estimation of convergence in Markov chains with the average-mixing time. ar Xiv:2402.10506, 2024.

B. Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, pages 94 116, 1994.

V. Yurinsky. Sums and Gaussian Vectors. Lecture Notes in Mathematics. Springer Berlin, Heidelberg, 1995.