# nyström_kernel_mean_embeddings__1746a9f0.pdf

Nystr om Kernel Mean Embeddings

Antoine Chatalic 1 Nicolas Schreuder 1 Alessandro Rudi 2 Lorenzo Rosasco 1 3

Abstract Kernel mean embeddings are a powerful tool to represent probability distributions over arbitrary spaces as single points in a Hilbert space. Yet, the cost of computing and storing such embeddings prohibits their direct use in large-scale settings. We propose an efficient approximation procedure based on the Nystr om method, which exploits a small random subset of the dataset. Our main result is an upper bound on the approximation error of this procedure. It yields sufficient conditions on the subsample size to obtain the standard n 1/2 rate while reducing computational costs. We discuss applications of this result for the approximation of the maximum mean discrepancy and quadrature rules, and illustrate our theoretical findings with numerical experiments.

1. Introduction

Owing to the increasing complexity of modern datasets, designing compact and meaningful representations of data collections and probability distributions is a key problem in machine learning. Kernel methods, which have proven in the last decades to be a convenient and powerful tool to design feature spaces capturing complex relations between data points, via the choice of a single kernel function, can be used towards this goal. Initially introduced by Smola et al. (2007), kernel mean embeddings (KME) indeed allow to represent a probability distribution via a single mean element in such feature spaces (Muandet et al. 2017). They have found applications in various areas such as anomaly detection (Zou et al. 2014), approximate Bayesian computation (Park et al. 2016), domain adaptation (Zhang et al. 2013), imitation learning (Kim et al. 2018), nonparametric inference in graphical models (Song et al. 2013), functional data analysis (Hayati et al. 2020), discriminative learning for probability measures (Muandet et al. 2012) and differential privacy (Balog et al. 2018; Chatalic et al. 2021).

1Ma LGA & DIBRIS, Universit a di Genova 2Inria, Ecole normale sup erieure, PSL Research University 3CBMM, MIT, IIT. Correspondence to: Antoine Chatalic <first.last@dibris.unige.it>. Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

In machine learning applications, one is typically interested in computing the kernel mean embedding of the data distribution, which is most often unknown and approximated using empirical data. Yet, in its basic form, the empirical estimator still requires to store and manipulate the whole dataset, which can be impractical or even, in the worst case, impossible.

Finite-dimensional approximations of the feature map can be used to reduce the memory requirement of this approach (Liu et al. 2021; Rahimi et al. 2008). However, these methods are either generic and thus fail to capture the complexity of the data distribution with a low-dimensional representation, or adaptive, in which case costly data-dependent quantities need to be computed and/or stored (e.g., leverage scores (Shahrampour et al. 2019)). Furthermore, the resulting embeddings belong to a different space than the kernel mean embeddings and thus these approaches cannot be compared directly.

In this paper, we tackle the problem of efficiently approximating the kernel mean embedding of an unknown probability distribution for which samples are available. We introduce an estimator based on Nystr om approximation, which can be computed efficiently using a random subset of the data, and provide an upper bound on its approximation error. Under mild assumptions on the kernel and data distribution, we show that the size of this random subset is much smaller than the number of available samples and yields an optimal error rate. We characterize this fact using the spectral properties of the covariance operator associated to the considered kernel. For instance, assuming an exponential decay of this spectrum, we show that a subsample size of the order of n log( n) is sufficient to obtain an error of order O(1/ n), that is the same as that of the empirical estimator, and thus demonstrating that it is possible to preserve statistical performance while drastically reducing computational costs. We then show that our main result can be used to derive bounds on the approximation of the maximum mean discrepancy, which is the standard metric between probability distributions in the context of kernel methods. Finally we present numerical experiments supporting our theory both on synthetic and real datasets.

The paper is organized as follows. We begin by formalizing the problem of mean embeddings estimation and presenting

Nystr om Kernel Mean Embeddings

related works in Section 2. Then, in Section 3, we introduce our Nystr om estimator and detail its computational aspects. Section 4 contains our main result a bound on the approximation error of our estimator. We discuss applications of this result in Section 5. Finally, we present the results of our numerical experiments in Section 6 and provide some perspectives in Section 7.

2. Mean embeddings estimation

Let ρ be a probability measure with support in some space X. We consider the problem of approximating its kernel mean embedding

X ϕ(x) dρ(x) (1)

where ϕ : X H is a feature map taking values in a Hilbert feature space H with inner product , H and norm . By abuse of notation, we denote by µ = µ(ρ) the embedding of the data distribution. We assume that ϕ is bounded so that the integral is well defined, and provide further technical details in Section 4. The choice of the feature map ϕ implicitly defines a positive definite kernel function κ : X X R via the equation κ(x, y) = ϕ(x), ϕ(y) H. Strictly speaking, a kernel mean embedding corresponds to the case where ϕ is the canonical feature map of the kernel, i.e. ϕ(x) = κ(x, ), however (1) can be defined for any integrable feature map and our results hold in this general setting.

Maximum mean discrepancy Mean embeddings naturally induce a semi-metric on the space of probability distributions P(X) known as the maximum mean discrepancy (Smola et al. 2007) and defined as

MMD(ρ1, ρ2) := µ(ρ1) µ(ρ2) ,

for any two distribution probabilities ρ1 and ρ2. It satisfies all the properties of a metric except, in general, the definiteness, depending on whether the mean embedding ρ 7 µ(ρ) is injective or not (we refer the interested reader to Sriperumbudur et al. (2010) for more details). Such metrics have found applications in many contexts such as, to cite a few, two-sample testing (Borgwardt et al. 2006; Gretton et al. 2012), neural networks optimization (Borgwardt et al. 2006), generative models (Li et al. 2017; Sutherland et al. 2017). Given their wide applicability, maximum mean discrepancies are naturally one of the main motivations for better approximating mean embeddings. An interesting property of the MMD is that it is an integral probability metric (M uller 1997), a class of metrics which uses test functions to compare distributions. More precisely, we have

MMD(ρ1, ρ2)= sup f H: f 1 |EX1 ρ1f(X1) EX2 ρ2f(X2)|

where H denotes the reproducing kernel Hilbert space associated to the chosen kernel. These two representations of the MMD allow to leverage the wide set of tools from both kernel methods and integral probability metric theories (see Sriperumbudur et al. 2009, 2012b for examples of the latter).

Empirical estimator Given a collection of samples X = {X1, . . . , Xn} with empirical distribution ˆρn = 1 n P

1 i n δ(Xi), where δ( ) denotes the Dirac delta, a natural empirical estimator of (1) is given by

ˆµ := µ(ˆρn) = 1

i=1 ϕ(Xi) (2)

which approximates the true kernel mean embedding at the rate O(1/ n) in norm as discussed in Section 2.2. The time complexity for this evaluation grows linearly with the number n of samples in the dataset. When the feature map ϕ is finite-dimensional, this estimator can be explicitly computed and stored, which might be reasonable when the feature map can efficiently be computed. However, when this is not the case, the whole dataset X must be stored in memory, which can not only become prohibitive when the amount of available memory is limited, but also makes the manipulation of such embeddings unpractical. For instance, computing the maximum mean discrepancy between two empirical estimators computed from n samples has a Θ(n2) cost, which is not reasonable in applications where such distances must be computed repeatedly.

2.1. Problem

In this paper, we consider the problem of designing a new estimator ˆµm computed from m points X1, . . . , Xm, which

1. can be computed more efficiently than ˆµ;

2. preserves the statistical accuracy of ˆµ.

The first requirement implies that m should be smaller than the actual sample size n while the second requirement can be expressed more formally as

µ ˆµm = O( µ ˆµ ), (3)

provided that m is large enough. Those two requirements induce a trade-off between statistical accuracy and computational efficiency of the desired estimator. We consider in particular the setting where the landmark points ( Xj)1 j m are randomly sampled from the dataset X, so that the bound (3) must hold with high probability on the draw of these points.

A related problem that we tackle in Section 5 is that of the approximation of the maximum mean discrepancy. The ap-

Nystr om Kernel Mean Embeddings

proach that we explore consists in finding computationallyefficient approximations ˆµm(ρ1), ˆµm(ρ2) of the kernel mean embeddings such that it holds with high probability

ˆµm(ρ1) ˆµm(ρ2) = O( µ(ρ1) µ(ρ2) ).

2.2. Related work

The kernel mean embedding (1) can be approximated via its empirical counterpart (2) given n samples from the distribution ρ, yielding an approximation error decreasing in O(n 1/2). Without any assumption on the probability distribution ρ, the empirical average is asymptotically efficient and minimax optimal (Van der Vaart 2000, Theorem 25.21, Example 25.24) (see also (Lopez-Paz et al. 2015, Theorem 4)) thus there is no hope to get a better rate without any extra assumption on ρ. Tolstikhin et al. (2017) also showed that the minimax optimal rate for estimating kernel mean embeddings defined by continuous translation-invariant kernels on Rd is of order O(n 1/2) for discrete measure and measures with infinitely differentiable densities (e.g., Gaussian and exponential). Approximation of the maximum mean discrepancy using empirical embeddings has also been studied by Sriperumbudur et al. (2012a).

Muandet et al. (2014) introduced shrinkage estimators of the form θf +(1 θ)ˆµ, where θ (0, 1), f H is independent of the data and showed that those estimators perform better than the empirical average under mild assumptions on the probability distribution of interest. Such shrinkage strategies are complementary to our approach in the sense that they can be combined with any estimator of ˆµ. In a different direction, Lerasle et al. (2019) proposed an outlier-robust estimator for the kernel mean embedding based on the median-of-means technique.

The principal way to efficiently approximate a kernel mean embedding is to consider an estimator of the form X

1 j m αjϕ(yj) (4)

with m n. Multiple strategies exist to choose the points (yj)1 j m. For instance Cortes et al. (2014) proposed a greedy heuristic based on a notion of incoherence. Kernel herding (Y. Chen et al. 2010) is another deterministic way to select the points, which can be interpreted as an instance of the Frank-Wolfe algorithm (Bach et al. 2012) and has also connections with generalizations of orthogonal matching pursuit used in compressive sensing (Keriven et al. 2017). Depending on the context, the weights (αj)1 j m might be chosen uniform or directly optimized to minimize the approximation error or another related criterion. In a slightly different setting, Gr unew alder et al. (2012) derive an equivalence between conditional mean embeddings and vector-valued regressors which allows them to use vectorvalued regression methods to learn such embeddings.

These techniques are closely related to the quadrature problem, which consists in finding points (yj)1 j m and weights (αj)1 j m approximating the integral Z f(x) dρ(x) X

1 j m αjf(yj).

In the context of kernel methods, one typically wants to find an approximation scheme minimizing this error for all functions f belonging to a reproducing kernel Hilbert space H. In this case the approximation should not depend on the function f, and we have for any f H such that f 1 via the reproducing property

Quadrature Error :=

Z f(x) dρ(x) X

1 j m αjf(yj)

1 j m αjϕ(yj)

1 j m αjϕ(yj)

where µ is the mean embedding associated with ρ. This simple manipulation shows that approximate kernel mean embeddings are a way to design quadratures when the function f to integrate is taken in a reproducing Hilbert space. Kernel quadrature bounds have in particular been obtained when sampling randomly and i.i.d the points (yj)1 j m (Bach 2017), which includes approximation in the reproducing kernel Hilbert space as a special case. The sampling procedure in this work is however based on leverage scores, which are not exactly computable. Although algorithms exist to approximate these scores, the existing analysis does not cover this setting and further work would be needed in this direction. Bounds on the quadrature error have also been obtained for points sampled according to a determinantal point process (Belhadji et al. 2019), however sampling from such processes is also an expensive operation.

More generally, one can also consider mean embeddings associated to a finite-dimensional feature map ϕm approximating the kernel in the sense that ϕm(x), ϕm(y) κ(x, y) for any x, y. One example would be to build ϕm using random features (Liu et al. 2021; Rahimi et al. 2008). This setting is particularly interesting from an algorithmic perspective, as the mean embedding can then be computed online and efficiently stored. As such representations lie in a different space, it is in general not possible to relate them to the original mean embedding. Yet, they remain a highly practical way to manipulate distributions, and the distance between such embeddings can still be interpreted as an estimator of the mean max discrepancy. However, because they are generic these methods might not adapt well to the data

Nystr om Kernel Mean Embeddings

at hand, and we expect that our Nystr om estimator might lead to a lower error for an equivalent complexity.

3. The Nystr om estimator

As suggested above, our strategy to design an efficient algorithm is to consider an approximation of the form (4). More precisely, in our approach the points (yj)1 j m on which the approximation is built are sampled uniformly and independently from the dataset. For this reason we denote them in the sequel X1, . . . , Xm, and our estimator will thus be in the space Hm := span n ϕ( X1), . . . , ϕ( Xm) o spanned by their features. Denoting Pm the orthogonal projection on this subspace, the best estimator of the kernel mean embedding µ (1) in this space is Pmµ. As this quantity can obviously not be computed when the distribution ρ is unknown, we use instead

ˆµm := Pmˆµ.

Computation It turns out that the weights of this embedding can be computed exactly. Denoting Km Rm m

and Kmn Rm n be the partial kernel matrices with entries (Km)ij = ϕ( Xi), ϕ( Xj) H for any 1 i, j m and (Kmn)ij = ϕ( Xi), ϕ(xj) H for any 1 i m and 1 j n, one can indeed check (see Appendix B) that our Nystr om kernel mean embedding estimator can be expressed as

1 j m αjϕ( Xj) with α = 1

n K+ m Kmn1n, (5)

where 1n denotes a n-dimensional vector of ones and K+ m the (Moore-Penrose) pseudo-inverse of Km.

Complexity The space complexity of the method is Θ(m2 + md) for storing Km and the landmarks. Note that Kmn does not need to be stored as Kmn1n can be computed sequentially in Θ(m) space. The time complexity is Θ(nmcκ + m3) where cκ corresponds to the cost of a kernel evaluation. The first term corresponds to the computation of Kmn1n while the second correspond to computing the pseudo-inverse of Km (numerically stable algorithms can be used instead, but the complexity will be of this order regardless). When X = Rd, we will most often have cκ = d.

4. Theoretical analysis

4.1. Setting and notations

We consider a probability space (X, B, ρ) where X is a locally compact second countable topological space, B the Borel σ-algebra and ρ is the data probability distribution with support in X.

We define the (uncentered) covariance operator C : H H as

C := Z ϕ(x) H ϕ(x)dρ(x)

where ϕ(x) H ϕ(x) denotes the rank one operator

(ϕ(x) H ϕ(x))(f) = f, ϕ(x) Hϕ(x),

and denote by Cλ = C + λI its regularized version, for λ > 0. We now define, for any λ > 0 the functions

x X, Nx(λ) := ϕ(x), C 1 λ ϕ(x) H (6a)

N(λ) := Ex Nx(λ) = tr(CC 1 λ ) (6b)

N (λ) := sup x X Nx(λ). (6c)

The quantity N(λ) is known as the effective dimension, and is a measure of the interaction between the kernel (or feature map) and the data probability distribution. It is tightly linked to the notion of leverage scores and has been shown to constitute a proper measure of hardness of kernel ridge regression problems (Alaoui et al. 2015). Finally, we denote by L(H) the set of bounded linear operators on H endowed with the operator norm L(H).

The main assumption we make concerns the boundedness of the feature map.

Assumption 4.1. There exists a positive constant K < such that supx X ϕ(x) K.

This assumption ensures in particular that ϕ is integrable for any probability distribution over X, and thus the kernel mean embedding (1) is well defined, interpreting the integral in (1) as a Bochner integral (Diestel et al. 1977, Chapter 2). Furthermore, it implies the two inequalities N (λ) K2/λ < for any λ > 0, the latter being a crucial parameter to control the error of Nystr om sampling. Finally, Assumption 4.1 implies that the operator C is a positive trace class operator on H and allows to leverage tools from spectral theory. We refer the reader to Rudi et al. (2015, Assumption 3) for a more detailed discussion around a similar assumption.

Assumption 4.1 is satisfied for feature maps derived from a large class of standard kernels such as, e.g., Gaussian and Laplacian radial basis functions kernels on the Euclidean space Rd. It is also satisfied for polynomial kernels on a bounded space X.

4.2. Error decomposition

In order to break down the approximation error, we introduce the quantity

j=1 ϕ( Xj),

Nystr om Kernel Mean Embeddings

which is an unbiased estimate of the empirical kernel mean embedding ˆµ. Note that, by definition, µm is an element of Hm and it corresponds to choosing uniform weights to average the landmarks. Furthermore, we introduce the orthogonal projection operator on the orthogonal of Hm, P m := I Pm.

Our main result relies on the following deterministic error decomposition.

Lemma 4.1 (Error decomposition). For any λ > 0, it holds (almost surely)

µ ˆµm µ ˆµ + P m C1/2 λ L(H) C 1/2 λ (ˆµ µm) .

Proof of Lemma 4.1: We use the decomposition

µ ˆµm µ ˆµ + ˆµ ˆµm

ˆµ ˆµm = (I Pm)ˆµ = P m(ˆµ µm)

where the last inequality follows from P m µm = 0. Hence we get

µ ˆµm µ ˆµ + P m(ˆµ µm)

µ ˆµ + P m C1/2 λ L(H) C 1/2 λ (ˆµ µm)

which concludes the proof.

Our strategy to bound the error µ ˆµm essentially consists in bounding those three terms separately using probabilistic concentration results in Hilbert spaces.

4.3. Main result

We now state our main result, and then specialize this result using additional knowledge on the spectral properties of the covariance operator.

Theorem 4.1. Let Assumption 4.1 hold, and m 4. Furthermore, assume that the data points X1, . . . , Xn are drawn i.i.d. from the distribution ρ and that m n subsamples X1, . . . , Xm are drawn uniformly with replacement from the dataset {X1 . . . , Xn}. Then, it holds with probability at least 1 δ that

µ ˆµm c1 n + c2

N 12K2 log(m/δ)

provided that

m max(67, 12K2 C 1 L(H)) log m

where c1, c2, c3 are constants (made explicit in the proof) of order K log(1/δ).

A few remarks are in order regarding Theorem 4.1. First of all, denoting by W the smallest branch of the Lambert s W function on ] e 1, 0[ (Weisstein 2002), the condition on the sub-sample size m can also be expressed as m W( δ/c)c with c = max(67, 12K2 C 1 L(H)) and can easily be checked on a computer.

Then, the bound on the error is split in three parts: the first part corresponds to the usual rate one gets estimating the kernel mean embedding by its standard empirical counterpart (as discussed in Section 2.2) while the second part and the third part result from our Nystr om approximation scheme. Note that the first two terms already illustrate (at least partially) the trade-off between computational cost and statistical performance of our estimator: a small value of m (i.e m < n), while lightening the computational burden, yields a worst rate than the O(1/ n) rate; a large value of m > n does not improve the error rate, as it is governed by the first term in this case, and requires more computational and storage resources. This is expected since our estimator does not use more information on the distribution than the empirical estimator from which it is derived.

The precise trade-off should be settled by the third term. However, in its current form, it depends simultaneously on the subsample size m and on the effective dimension N. Extra assumptions about the effective dimension i.e., about the interaction between the kernel and the probability distribution are needed to obtain a more explicit and workable bound. To get a clearer picture, we derive sufficient conditions in the following corollary to get a O(n 1/2) rate under various assumptions on the decay of the effective dimension.

Corollary 4.1. Let the assumptions of Theorem 4.1 hold. Assume furthermore that the effective dimension N satisfies, for some c > 0,

either N(λ) cλ γ for some γ ]0, 1],

or N(λ) log(1 + c/λ)/β, for some β > 0.

Then, choosing the subsample size m to be

m = n1/(2 γ) log(n/δ) in the first case;

or m = n log( n max(1/δ, c/(6K2)) in the second case,

Nystr om Kernel Mean Embeddings

µ ˆµm = O 1 n

Let us comment on the assumptions on the decay of the effective dimension. Those assumptions can be linked to the decay of the eigenvalues (σi)i of the covariance matrix. First, note that the λ γ rate always holds for γ = 1. More generally, it holds when the eigenvalues decay polynomially as σi i 1/γ, see e.g. Della Vecchia et al. (2021, Proposition 4). Similarly, the second rate holds for instance when the eigenvalues decay exponentially σi ce λi, see e.g. Della Vecchia et al. (2021, Proposition 5).

Corollary 4.1 shows that, under mild assumptions on the kernel and on the decay of the effective dimension, our estimator fulfills the goals set in Section 2.1: it preserves the O(1/ n) error rate of the empirical estimator while reducing computational and storage costs.

Finally, we stress that our result only holds for uniform subsampling. While other sampling strategies, and in particular sampling proportionally to the so-called leverage scores (Alaoui et al. 2015), are known to be more effective in practice (in the sense that the same error might be obtained with a comparatively smaller value of m), it is not clear if and how our proof can be adapted to this setting.

Note that the quantity µ ˆµm 2 can also be bounded by 1 n Kn Knys where Kn and Knys denote respectively the full kernel matrix and its Nystr om approximation, and the operator norm. However, the works providing bounds in this context such as Kumar et al. (2012) and Musco et al. (2017) work in a fixed design. For this reason, such results would not directly be applicable in our setting, and it does not seem at first sight that any of the results imply the other.

5. Application: efficient estimation of the maximum mean discrepancy

The maximum mean discrepancy being built from kernel mean embeddings, its empirical estimation suffers as well from the computational issues discussed above. In this section, we introduce a sample-efficient estimator of the maximum mean discrepancy based on the kernel mean embedding estimator introduced in (5), and derive a highprobability upper bound on its approximation error.

We begin by stating the formal setting of this section. Let ρ1 and ρ2 be two probability measures whose supports are in the same space X (but not necessarily identical). We assume that we are given nk 1 i.i.d. samples Xk 1 , . . . , Xk nk from ρk, for k {1, 2}. Let Pk denote the orthogonal projection onto Hk = span{ Xk 1 , . . . , Xk mk} where the landmarks Xk 1 , . . . , Xk mk are independently sampled from the empiri-

cal distribution ˆρk = 1 nk Pn i=1 δ(Xk i ).

Since the maximum mean discrepancy between two probability distributions corresponds to the norm of the difference between the mean embeddings of those distributions, our Nystr om mean embedding estimator from (5) yields a natural estimator for the MMD:

\ MMD := ˆµ(ρ1) m1 ˆµ(ρ2) m2 ,

where for k {1, 2},

ˆµ(ρk) mk = Pkˆµ(ρk).

The next result provides a high-probability upper bound on the difference (measured w.r.t. the norm of the feature space) between the true maximum mean discrepancy and our estimator,

Err(nk,mk)k := |MMD(ρ1, ρ2) \ MMD|.

Theorem 5.1 (Estimation error for the approximation of the MMD). Let Assumption 4.1 hold. Furthermore, assume that for k {1, 2}, the data points Xk 1 , . . . , Xk nk are drawn i.i.d. from the distribution ρk and that mk nk sub-samples Xk 1 , . . . , Xk mk are drawn uniformly with replacement from the dataset {Xk 1 . . . , Xk nk}. Then, for any δ ]0, 1[, it holds with probability at least 1 2δ

Err(nk,mk)k X

N (ρk) 12K2 log(mk/δ)

provided that, for k {1, 2},

mk max(67, 12K2 Ck 1 L(H)) log mk

where c1 = 2K p

2 log(6/δ), c2 = 4

3K log(12/δ) and c3 = 6K p

log(12/δ). The notation N (ρk) denotes the effective dimension associated to the distribution ρk.

The high-probability upper bound provided in the above theorem is similar in spirit to that of Theorem 4.1. It follows by replicating the proof of Theorem 4.1 for each measure ρ1 and ρ2 separately. As in Corollary 4.1, the third term can be made more explicit by making assumptions on the effective dimensions N (ρk), k {1, 2} and setting the right subsample sizes (mk)k {1,2}. For more flexibility we allowed the sample and subsample sizes to differ between the two probability distributions. In the case of large sample imbalance, it might be interesting to choose a larger subsample size for the smallest sample to improve the statistical error rate, since it is guided by the smallest (sub)sample sizes.

Nystr om Kernel Mean Embeddings

As a direction for future work, we note that, in a similar spirit to Gretton et al. (2012), it is possible to derive a twosample test from our bound in Theorem 5.1.

Designing efficient MMD-based tests remains an important research direction, and the simplicity of our approach should allow to reduce computational costs compared to state-of-the-art algorithms (Jitkrittum et al. 2016). However, this requires further development that are outside the scope of this paper, and for this reason we do not compare our approach to works focusing on efficient testing in the following section.

6. Numerical experiments

In this section, we provide experimental results to illustrate our theoretical findings on the decay of the approximation error. We provide a proof of work in a simple experimental setting, but extending these results to broader families of datasets and kernel types would be interesting in the future.

6.1. Synthetic data

We first generate data according to a Gaussian mixture, i.e. we choose X = Rd and ρ = P

1 i p 1 p N(µi, I) where the centers (µi)1 i p are themselves drawn according to a multivariate normal distribution N(0, 5I). For our experiments, we choose d = 10, p = 8. We use the Gaussian kernel k(x, y) = exp( x y 2 2/(2σ2 k)), which satisfies Assumption 4.1.

To efficiently compute the approximation error of the embedding Pm j=1 αiϕ(xi), we use the decomposition

j=1 αiϕ(xi)

= ZZ k(x, y) dρ(x) dρ(y) + αT KXα

Z k(x, xi) dρ(x)

which follows from the definition of µ (1), where KX denotes the kernel matrix of the (xj)1 j m and α = [α1, . . . , αm]. Both integrals in this expression can be computed in closed form for the considered distribution and kernel as detailed in Appendix G.

Figure 1 shows the obtained error for both the empirical estimator and the Nystr om estimator when varying the size of the support m. Both estimators are computed using a sample of n = 103, 104, 105 points drawn from ρ, and the standard deviation σk of the kernel is chosen to be the median of the inter-points distance, estimated for efficiency on a random subset of 1000 points. In this setting the data distribution is sub-Gaussian, and as we are using a Gaussian kernel the spectrum of the covariance

matrix decays exponentially (Widom 1963), i.e. we are in the second setting of Corollary 4.1 and we expect to need m n log( n) to achieve the same rate as the empirical estimator. This is indeed quite close to what is observed in practice. We recall that the time complexity of our method is Θ(nmd + m3), and we also include in the settings where n is moderate results for the greedy methods of Cortes et al. (2014) (which runs in Θ(nmd + m3) and Θ(nm) space), L. Chen et al. (2018) (Θ(nm(m+d))) and Keriven et al. (2017) (Θ(nd2m+Id2m3) where I is a fixed number of iterations, using a sketch of size md; although this method makes use of random features, we use the recovered landmarks to build an estimate in the reproducing kernel Hilbert space). Our method performs similarly to the first two, which come with no statistical approximation guarantees, while the method of Keriven et al. (2017) yields better approximations but at a significantly higher cost.

Although we only plot the error of the mean embedding estimation for conciseness, the error of the MMD estimation has exactly the same behavior.

6.2. Real datasets

We then perform experiments with data from the Fasttext1 (Bojanowski et al. 2016) (english features), FMA2 (Defferrard et al. 2016) (MFCC features), Intel Lab3

and Gowalla4 (Cho et al. 2011) datasets, which respectively consist of text features in dimension 300, audio features in dimension 20, physical measurements in dimension 5 and geolocation data in dimension 2. We use the whole dataset for FMA, resulting in N = 106574 points, while for Fasttext, Gowalla and Intel Lab we limit ourselves to N = 100000 randomly selected points. For each dataset, we consider ρ to be the uniform distribution over these points, and we build the empirical estimator using a random sample of size n = 104.

Results are presented in Figure 2, and display the same behavior as with synthetic data. While we don t have any particular guarantees on the spectrum s decay in this setting, it is still verified empirically than choosing m of the order of n log( n) is sufficient to obtain an error of the same order as the empirical estimator for all datasets but Fasttext. Note that we did not optimize the kernel choice for all our experiments. It could be the case that an even faster decay of the error happens for an optimized choice of kernel. The slower error decrease for the Nystr om estimator on the

1https://fasttext.cc/docs/en/ english-vectors.html

2https://github.com/mdeff/fma 3http://db.csail.mit.edu/labdata/labdata. html

4https://snap.stanford.edu/data/ loc-gowalla.html

Nystr om Kernel Mean Embeddings

Estimation error

100 101 102 103 104

Empirical Nystr om (ours) Greedy coherence-based (Cortes 2014) Greedy determinant-based (Chen 2018) Continuous OMP (Keriven 2017) n log( n)

Figure 1: Estimation errors µ ˆµ (in blue) and µ ˆµm (in orange) as well as for other heuristics for synthetic Gaussian data. Means (solid lines), 5-percentiles and 95-percentiles over 100 trials (shaded regions). The data distribution is fixed and identical for the 3 values of n.

101 102 103

Estimation error

101 102 103

101 102 103

101 102 103

Empirical Nystr om (ours) n log( n)

Figure 2: Estimation errors for the empirical estimator µ ˆµ (in blue) and the Nystr om estimator µ ˆµm (in orange) for real datasets. Means (solid lines), 5-percentiles and 95-percentiles over 100 trials (shaded regions). The data distribution is fixed and identical for the 3 values of n.

Fasttext dataset could be explained by many factors such as, for instance, a slow decay of the covariance operator eigenvalues. This issue requires further empirical exploration, which is out of scope for this paper. On the other side, on the Intel Lab dataset it seems that the error of the Nystr om estimator decays even faster and only m 20 features are sufficient to obtain the same error as the empirical estimator. Here again, similar results can be obtained for the maximum mean discrepancy but are omitted for conciseness.

7. Conclusion

In this article, we introduced a simple and efficient estimator of the kernel mean embedding based on the Nystr om approximation. Both theoretical and empirical results show that in most settings, an approximation supported on the features of m n points (typically m n) will have the same error than that of the empirical estimator while enjoying much better computational properties and being efficiently storable.

An interesting question for future work is to see whether our results can be improved with other sampling strategies: while it is well known that leverage scores sampling allows in many related contexts to achieve the same error with a smaller number of landmarks, it is not clear for now how to obtain meaningful theoretical guarantees under this sampling scheme.

Acknowledgments

L.R. acknowledges the financial support of the European Research Council (grant SLING 819789), the AFOSR project FA9550-18-1-7009 (European Office of Aerospace Research and Development), the EU H2020-MSCA-RISE project No MADS - DLV-777826, and the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. We thank the anonymous referees for their helpful comments and suggestions.

Nystr om Kernel Mean Embeddings

Alaoui, Ahmed and Michael W. Mahoney (2015). Fast Randomized Kernel Ridge Regression with Statistical Guarantees . In: Advances in Neural Information Processing Systems, pp. 775 783. Bach, Francis (2017). On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions . In: The Journal of Machine Learning Research 18.1, pp. 714 751. Bach, Francis, Simon Lacoste-Julien, and Guillaume Obozinski (June 26, 2012). On the Equivalence between Herding and Conditional Gradient Algorithms . In: Proceedings of the 29th International Coference on International Conference on Machine Learning. ICML 12. Madison, WI, USA: Omnipress, pp. 1355 1362. ISBN: 978-1-4503-1285-1. Balog, Matej, Ilya Tolstikhin, and Bernhard Sch olkopf (Oct. 2018). Differentially Private Database Release via Kernel Mean Embeddings . In: Proceedings of the 35th International Conference on Machine Learning. Ed. by Jennifer Dy and Andreas Krause. Vol. 80. Proceedings of Machine Learning Research. PMLR, pp. 414 422. Belhadji, Ayoub, R emi Bardenet, and Pierre Chainais (Dec. 31, 2019). Kernel Quadrature with DPPs . In: Advances in Neural Information Processing Systems. Vol. 32, pp. 12927 12937. ar Xiv: 1906.07832 [cs, stat]. Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov (2016). Enriching Word Vectors with Subword Information . ar Xiv: 1607.04606. Borgwardt, Karsten M., Arthur Gretton, Malte J. Rasch, Hans-Peter Kriegel, Bernhard Sch olkopf, and Alexander J. Smola (2006). Integrating structured biological data by Kernel Maximum Mean Discrepancy . In: Proceedings 14th International Conference on Intelligent Systems for Molecular Biology 2006, Fortaleza, Brazil, August 6-10, 2006, pp. 49 57. Chatalic, Antoine, Vincent Schellekens, Florimond Houssiau, Yves-Alexandre De Montjoye, Laurent Jacques, and R emi Gribonval (May 15, 2021). Compressive Learning with Privacy Guarantees . In: Information and Inference: A Journal of the IMA (iaab005). ISSN: 2049-8772. Chen, Laming, Guoxin Zhang, and Eric Zhou (2018). Fast Greedy MAP Inference for Determinantal Point Process to Improve Recommendation Diversity . In: Advances in Neural Information Processing Systems 31, pp. 5627 5638. Chen, Yutian, Max Welling, and Alex Smola (July 8, 2010). Super-Samples from Kernel Herding . In: Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence. UAI 10. Arlington, Virginia, USA: AUAI Press, pp. 109 116. ISBN: 978-0-9749039-6-5.

Cho, Eunjoon, Seth A. Myers, and Jure Leskovec (2011). Friendship and Mobility: User Movement in Location Based Social Networks . In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1082 1090. Cortes, Efren Cruz and Clayton Scott (May 2014). Scalable Sparse Approximation of a Sample Mean . In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Florence, Italy: IEEE, pp. 5237 5241. ISBN: 978-1-4799-2893-4. Defferrard, Micha el, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson (2016). FMA: A Dataset for Music Analysis . In: ISMIR. ar Xiv: 1612.01840. Della Vecchia, Andrea, Jaouad Mourtada, Ernesto De Vito, and Lorenzo Rosasco (Feb. 25, 2021). Regularized ERM on Random Subspaces . ar Xiv: 2006 . 10016 [cs, stat]. Diestel, Joseph and John Jerry Uhl (1977). Vector Measures. Mathematical Surveys and Monographs 15. American Mathematical Soc. ISBN: 0-8218-1515-6 978-0-82181515-1. Gretton, Arthur, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch olkopf, and Alexander J. Smola (2012). A Kernel Two-Sample Test . In: J. Mach. Learn. Res. 13, pp. 723 773. Gr unew alder, Steffen, Guy Lever, Luca Baldassarre, Sam Patterson, Arthur Gretton, and Massimilano Pontil (July 24, 2012). Conditional Mean Embeddings as Regressors - Supplementary . ar Xiv: 1205.4656 [cs, stat]. Hayati, Saeed, Kenji Fukumizu, and Afshin Parvardeh (2020). Kernel Mean Embedding of Probability Measures and its Applications to Functional Data Analysis . In: ar Xiv preprint ar Xiv:2011.02315. Jitkrittum, Wittawat, Zolt an Szab o, Kacper P Chwialkowski, and Arthur Gretton (2016). Interpretable Distribution Features with Maximum Testing Power . In: Advances in Neural Information Processing Systems. Vol. 29. Curran Associates, Inc. Keriven, Nicolas, Nicolas Tremblay, Yann Traonmilin, and R emi Gribonval (Mar. 5, 2017). Compressive K-means . In: International Conference on Acoustics, Speech and Signal Processing (ICASSP). Kim, Kee-Eung and Hyun Soo Park (2018). Imitation Learning via Kernel Mean Embedding . In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018. Ed. by Sheila A. Mc Ilraith and Kilian Q. Weinberger. AAAI Press, pp. 3415 3422.

Nystr om Kernel Mean Embeddings

Kumar, Sanjiv, Mehryar Mohri, and Ameet Talwalkar (2012). Sampling Methods for the Nystr om Method . In: The Journal of Machine Learning Research 13.1, pp. 981 1006. Laub, Alan J. (2004). Matrix Analysis for Scientists and Engineers. SIAM: Society for Industrial and Applied Mathematics. ISBN: 978-0-89871-576-7. Lerasle, Matthieu, Zolt an Szab o, Timoth ee Mathieu, and Guillaume Lecu e (2019). MONK Outlier-Robust Mean Embedding Estimation by Median-of-Means . In: Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA. Ed. by Kamalika Chaudhuri and Ruslan Salakhutdinov. Vol. 97. Proceedings of Machine Learning Research. PMLR, pp. 3782 3793. Li, Chun-Liang, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnab as P oczos (2017). MMD GAN: Towards Deeper Understanding of Moment Matching Network . In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. Ed. by Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, pp. 2203 2213. Liu, Fanghui, Xiaolin Huang, Yudong Chen, and Johan A. K. Suykens (Mar. 16, 2021). Random Features for Kernel Approximation: A Survey on Algorithms, Theory, and Beyond . ar Xiv: 2004.11154 [cs, stat]. Lopez-Paz, David, Krikamol Muandet, Bernhard Scholkopf, Ilya Tolstikhin, and Lopezpaz Org (2015). Towards a Learning Theory of Cause-Effect Inference . In: Proceedings of the 32nd International Conference on Machine Learning. Vol. PMLR 37, pp. 1452 1461. Muandet, Krikamol, Kenji Fukumizu, Francesco Dinuzzo, and Bernhard Sch olkopf (2012). Learning from Distributions via Support Measure Machines . In: Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States. Ed. by Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, L eon Bottou, and Kilian Q. Weinberger, pp. 10 18. Muandet, Krikamol, Kenji Fukumizu, Bharath Sriperumbudur, Arthur Gretton, and Bernhard Schoelkopf (Jan. 27, 2014). Kernel Mean Estimation and Stein Effect . In: Proceedings of the 31st International Conference on Machine Learning. International Conference on Machine Learning. PMLR, pp. 10 18. Muandet, Krikamol, Kenji Fukumizu, Bharath K. Sriperumbudur, and Bernhard Sch olkopf (2017). Kernel Mean Embedding of Distributions: A Review and Beyond . In: Found. Trends Mach. Learn. 10.1-2, pp. 1 141.

M uller, Alfred (1997). Integral probability metrics and their generating classes of functions . In: Advances in Applied Probability 29.2, pp. 429 443. Musco, Cameron and Christopher Musco (2017). Recursive Sampling for the Nystrom Method . In: Advances in Neural Information Processing Systems, pp. 3833 3845. Park, Mijung, Wittawat Jitkrittum, and Dino Sejdinovic (2016). K2-ABC: Approximate Bayesian Computation with Kernel Embeddings . In: Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016, Cadiz, Spain, May 9-11, 2016. Ed. by Arthur Gretton and Christian C. Robert. Vol. 51. JMLR Workshop and Conference Proceedings. JMLR.org, pp. 398 407. Petersen, Kaare Brandt, Michael Syskind Pedersen, et al. (2008). The matrix cookbook . In: Technical University of Denmark 7.15, p. 510. Pinelis, Iosif (1994). Optimum Bounds for the Distributions of Martingales in Banach Spaces . In: The Annals of Probability, pp. 1679 1706. Rahimi, Ali and Benjamin Recht (2008). Random Features for Large-Scale Kernel Machines . In: Advances in Neural Information Processing Systems, pp. 1177 1184. Rudi, Alessandro, Raffaello Camoriano, and Lorenzo Rosasco (2015). Less Is More: Nystr om Computational Regularization . In: Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1. NIPS 15. Cambridge, MA, USA: MIT Press, pp. 1657 1665. Shahrampour, Shahin and Soheil Kolouri (2019). On sampling random features from empirical leverage scores: Implementation and theoretical guarantees . In: ar Xiv preprint ar Xiv:1903.08329. Smola, Alexander J., Arthur Gretton, Le Song, and Bernhard Sch olkopf (2007). A Hilbert Space Embedding for Distributions . In: Algorithmic Learning Theory, 18th International Conference, ALT 2007, Sendai, Japan, October 1-4, 2007, Proceedings. Ed. by Marcus Hutter, Rocco A. Servedio, and Eiji Takimoto. Vol. 4754. Lecture Notes in Computer Science. Springer, pp. 13 31. Song, Le, Kenji Fukumizu, and Arthur Gretton (2013). Kernel Embeddings of Conditional Distributions: A Unified Kernel Framework for Nonparametric Inference in Graphical Models . In: IEEE Signal Process. Mag. 30.4, pp. 98 111. Sriperumbudur, Bharath K., Kenji Fukumizu, Arthur Gretton, Bernhard Sch olkopf, and Gert RG Lanckriet (2009). On integral probability metrics,\phi-divergences and binary classification . In: ar Xiv preprint ar Xiv:0901.2698. (2012a). On the Empirical Estimation of Integral Probability Metrics . In: Electronic Journal of Statistics 6, pp. 1550 1599.

Nystr om Kernel Mean Embeddings

(2012b). On the empirical estimation of integral probability metrics . In: Electronic Journal of Statistics 6, pp. 1550 1599. Sriperumbudur, Bharath K., Arthur Gretton, Kenji Fukumizu, Bernhard Sch olkopf, and Gert R. G. Lanckriet (2010). Hilbert Space Embeddings and Metrics on Probability Measures . In: Journal of Machine Learning Research 11 (Apr), pp. 1517 1561. ISSN: ISSN 1533-7928. Sutherland, Danica J., Hsiao-Yu Tung, Heiko Strathmann, Soumyajit De, Aaditya Ramdas, Alexander J. Smola, and Arthur Gretton (2017). Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy . In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open Review.net. Tolstikhin, Ilya, Bharath K Sriperumbudur, and Krikamol Muandet (2017). Minimax Estimation of Kernel Mean Embeddings . In: The Journal of Machine Learning Research 18.1, pp. 3002 3048. Van der Vaart, Aad W. (2000). Asymptotic Statistics. Vol. 3. Cambridge university press. Weisstein, Eric W (2002). Lambert W-function . In: https://mathworld. wolfram. com/. Widom, Harold (1963). Asymptotic Behavior of the Eigenvalues of Certain Integral Equations . In: Transactions of the American Mathematical Society 109.2, pp. 278 295. Yurinsky, Vadim (1995). Sums and Gaussian Vectors. 1st ed. Lecture Notes in Mathematics 1617. Springer-Verlag Berlin Heidelberg. ISBN: 978-3-540-60311-5. Zhang, Kun, Bernhard Sch olkopf, Krikamol Muandet, and Zhikun Wang (2013). Domain Adaptation under Target and Conditional Shift . In: Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013. Vol. 28. JMLR Workshop and Conference Proceedings. JMLR.org, pp. 819 827. Zou, Shaofeng, Yingbin Liang, H. Vincent Poor, and Xinghua Shi (2014). Nonparametric Detection of Anomalous Data via Kernel Mean Embedding . In: Co RR abs/1405.2294. ar Xiv: 1405.2294.

Nystr om Kernel Mean Embeddings

Structure of the Appendix

We begin by introducing additional notations in Appendix A. Then, in Appendix B, we prove how to obtain the claim made in Eq. (5). Appendix C contains the proof of Theorem 4.1 as well as its corollaries. Appendix D contains the proof of Theorem 5.1. We gather in Appendix E the concentration results that our proof of Theorem 4.1 rely on. We recall in Appendix F a key result for Nystr om subsampling. Finally we detail some computations used in the numerical experiments in Appendix G.

A. Additional notations

We define the empirical covariance operator as

i=1 ϕ(xi) H ϕ(xi).

For any operator Q : H : H and any real number λ > 0, we denote by Qλ : H H the regularized operator Qλ = Q + λI. We denote the (Moore-Penrose) pseudo-inverse of an operator A by A+.

Given a random variable X, we write ess sup X to denote its essential supremum.

1n Rn denotes the n-dimensional vector of ones.

B. Derivation of the weights

This section provides a proof for the claim in Eq. (5). For ease of exposition, let us introduce the operators

Φm : Rm Hm, α 7

j=1 αjϕ( Xj),

Φn : Rn H, α 7

i=1 αiϕ(Xi).

Since, by definition, ˆµm is the orthogonal projection of ˆµ onto the space Hm, it can be expressed as ˆµm = Φmα where the weights α Rm minimize the mapping β 7 ˆµ Φmβ 2. Setting the gradient of this mapping to zero, we obtain that α must satisfy

Φ mΦmα = Φ mˆµ.

The minimum norm solution of the above equation is given by α = (Φ mΦm)+Φ mˆµ (Laub 2004). Noting that the empirical measure ˆµ can be expressed as ˆµ = 1

nΦn1n and using the fact that Φ mΦm = Km, Φ mΦn = Kmn, we obtain the claimed equality

α = K+ mΦ m(n 1Φn1n) = 1

n K+ m Kmn1n.

C. Proof of the main result

C.1. Generic bound

Theorem 4.1 is a consequence of a generic bound which we state now.

Lemma C.1. Let Assumption 4.1 hold. Furthermore, assume that the data points X1, . . . , Xn are drawn i.i.d. from the distribution ρ and that m n sub-samples X1, . . . , Xm are drawn uniformly with replacement from the dataset {X1 . . . , Xn}. Then, for any λ ]0, C L(H)] and δ ]0, 1[, with probability at least 1 δ

2 log(6/δ) n +

3N (λ) log(12/δ)

N(λ) log(12/δ)

Nystr om Kernel Mean Embeddings

provided that

m max(67, 5N (λ)) log 12K2

λn 12K2 log(4/δ).

Proof of Lemma C.1: Let δ (0, 1) be the desired confidence level. Let λ > 0, m N and n N satisfy the conditions of the theorem. Using the error decomposition of Lemma 4.1, we get

µ ˆµm µ ˆµ + P m C1/2 λ L(H) C 1/2 λ (ˆµ µm) .

Controlling the first term amounts to measuring the concentration of an empirical mean around its true mean in a Hilbert space. Multiple variants of such results can be found in the literature (see, e.g., (Pinelis 1994)). We apply here Lemma E.1 on the random variables ηi := ϕ(Xi) µ, 1 i n. Note that they are indeed bounded since, for any index 1 i n, ηi 2 supx X ϕ(x) = 2K. Thus, it holds with probability at least 1 δ/3 on the draw of the the dataset X1, . . . , Xn that

2 log(6/δ) n .

Next, we rely on Lemma F.1 to bound the term P m C1/2 λ L(H) with high probability. Since the Nystr om landmarks are uniformly drawn and m max(67, 5N (λ)) log 12K2

λδ , we have, for any λ > 0, with probability at least 1 δ/3 on the draw of the landmarks X1, . . . , Xm, P m C1/2 λ L(H)

Finally, the last term can be bounded using Lemma E.5 which implies that, since λ satisfies 0 < λ C L(H) and λn 12K2 log(4/δ), it holds with probability at least 1 δ/3

C 1/2 λ (ˆµ µm) 4 p

N (λ) log(4/δ3)

12N(λ) log(4/δ3)

Taking the union bound over the three events yields the desired result: with probability at least 1 δ (over all sources of randomness), it holds that

2 log(6/δ) n +

N (λ) log(12/δ)

12N(λ) log(12/δ)

C.2. Proof of Theorem 4.1

Proof of Theorem 4.1: Assuming that our choice of m and λ satisfies the constraints

m max(67, 5N (λ)) log 12K2

λδ λn 12K2 log(4/δ) 0 < λ C L(H) , (7)

we can apply Lemma C.1 and use the fact that N (λ) K2/λ to get

2 log(6/δ) n + 4

3K log(12/δ)

Setting λ = 12K2 log(m/δ)

m we obtain the claimed result with constants c1 = 2K p

2 log(6/δ), c2 = 4

3K log(12/δ), and c3 = 12 p

3 log(12/δ)K. Let us now check that our choices are consistent with the constraints. We will also obtain a more user-friendly expression for the constraints and express the sub-sample size m as a function of the sample size n. Using the fact that N (λ) K2/λ,

Nystr om Kernel Mean Embeddings

one can easily check that a sufficient set of conditions to satisfy (7) is given by

m 67 log 1 δ m log(m/δ)

m 5m 12 log( m

δ ) log 1 δ m log(m/δ)

m 12K2 log(m/δ)

As m n, the third condition is satisfied as soon as m 4. Moreover, with this choice of m, we have log(m/δ) > 1, hence the second constraint always holds and we are left with

m max(67, 12K2 C 1 L(H)) log m

Note that this can be rewritten y c log(y) 0 with y = m/δ and c = max(67, 12K2 C 1 L(H))/δ. By looking at asymptotic rates, it is clear that it is always possible to satisfy this equation by taking m large enough. To find the threshold, we look for real solutions of y

c + log( c) = 0. Taking the exponential we get

c . Since δ < max(67, C 1 L(H))/e always holds, the previous inequality is satisfied when

c)δc, where W denote the (multivariate) Lambert s W function. This concludes the proof.

Proof of Corollary 4.1: Under the conditions of Theorem 4.1, the latter gives

µ ˆµm c1 n + c2

N 12K2 log(m/δ)

We split the proof of the corollary in two parts, one for each family of assumptions on the effective dimension N (polynomial and logarithmic growth).

Polynomial growth assumption: N(λ) cγλ γ. Setting m = n1/(2 γ) log(n/δ), we get

µ ˆµm c1 n + c2

m + c3(12K2) γ/2 log(m/δ) 1 γ

c1 + c2n1/2

log(n/δ)n1/(2 γ) + c3(12K2) γ/2

log(n/δ)1/2

Logarithmic growth assumption: N(λ) log(1 + c/λ)/β. Since m 12K2 log(m/δ)

c , using the fact that log(1 + x) log(2x) for x > 1 we get p

log(m/δ) βm

log 1 + cm 12K2 log(m/δ)

log(m/δ) log cm 6K2 log(m/δ)

1 βm log(m max(1/δ, c/(6K2)))

One can easily check that the choice m = n log( n max(1/δ, c/(6K2)) yields the bound p

log(m/δ) βm

log 1 + cm 12K2 log(m/δ)

Plugging the latter bound in (8), we obtain

µ ˆµm c1 n + c2 n log( n max(1/δ, c/(6K2)) + 2c3 βn = O 1 n

Nystr om Kernel Mean Embeddings

This concludes the proof.

D. Maximum Mean Discrepancy

Theorem D.1 (MMD). Let Assumption 4.1 hold. Furthermore, assume that for k {1, 2}, the data points Xk 1 , . . . , Xk nk are drawn i.i.d. from the distribution ρk and that mk nk sub-samples Xk 1 , . . . , Xk mk are drawn uniformly with replacement from the dataset {Xk 1 . . . , Xk nk}. Then, for any λk ]0, Ck L(H)] and δ ]0, 1[, with probability at least 1 2δ

|MMD(ρ1, ρ2) \ MMD| X

2 log(6/δ) nk + p

3N (ρk) (λk) log(12/δ)

N (ρk)(λ) log(12/δ)

provided that, for k {1, 2},

mk max(67, 5N (ρk) (λk)) log 12K2

λknk 12K2 log(4/δ).

Proof of Theorem D.1: By definition of \ MMD, a reverse triangle inequality followed by a triangle inequality yields

|MMD(ρ1, ρ2) \ MMD| = µ(ρ1) µ(ρ2) P1ˆµ(ρ1) P1ˆµ(ρ2)

µ(ρ1) µ(ρ2) P1ˆµ(ρ1) + P2ˆµ(ρ2)

k {1,2} µ(ρk) Pkˆµ(ρk) .

We now apply twice Lemma C.1, and conclude via a union bound.

E. Concentration inequalities

This section contains concentration results that we rely on to prove our main result.

The first lemma provides a high-probability control on the norm of the average of bounded random variables taking values in a separable Hilbert space.

Lemma E.1. Let X1, . . . , Xn be i.i.d. random variables on a separable Hilbert space (X, ) such that supi=1,...,n Xi A almost surely, for some real number A > 0. Then, for any δ (0, 1), it holds with probability at least 1 δ that

2 log(2/δ) n .

The proof of Lemma E.1 relies on (Pinelis 1994, Theorem 3.5) which we recall now for clarity of exposition.

Lemma E.2. Let M = (Mi)i N be a martingale on a (2, D)-smooth separable Banach space (X, ). Define P j=1 ess sup Mj Mj 1 2 b2 , for some real number b > 0. Then, for all r 0,

sup j N Mj r

We now prove Lemma E.1.

Nystr om Kernel Mean Embeddings

Proof of Lemma E.1: Since X is a Hilbert space, it is 2-smooth with 2-smoothness constant D = 1. We define the martingale (Mn)n N as M0 = 0, Mk = P

1 i k Xk for 1 k n and Mk = Mn for k n, so that

dk := Mk Mk 1 =

( Xk, if 1 k n 0, otherwise .

As a consequence, we have P j=1 ess sup dj 2 = Pn j=1 ess sup Xj 2 n A2. Applying Pinelis inequality (Lemma E.2) with b2 = n A2 yields

= Pr Mn > nϵ Pr

We get the desired result by choosing ϵ = A p

2 log(2/δ)n 1/2.

The next result is a Bernstein-type inequality for random vectors defined in a Hilbert space.

Lemma E.3 (Bernstein inequality for Hilbert space-valued random vectors). Let X1, . . . , Xn be i.i.d. random variables in a Hilbert space (H, ) such that

i [n], EXi = µ,

σ > 0, H > 0, i [n], p 2, E Xi µ p 1/2p!σ2Hp 2.

Then, for any δ ]0, 1[, we have with probability at least 1 δ,

2H log(2/δ)

2σ2 log(2/δ)

Proof of Lemma E.3: Fix a confidence level δ (0, 1). Applying (Yurinsky 1995, Theorem 3.3.4) on the i.i.d. centered random variables ξi = Xi µ with B2 = σ2n, we get

max 1 k n k

The RHS of the above is smaller than δ if and only if

t2n2 t(2Hn log(2/δ)) 2B2 log(2/δ) 0.

Denoting = 4H2n2 log(2/δ)2 +8n2B2 log(2/δ) > 0, this holds in particular if t H log(2/δ)

2n2 , and thus a fortiori (using

4H2n2 log(2/δ)2 + p

8n2B2 log(2/δ)) when

t 2H log(2/δ)

2σ2 log(2/δ)

The following lemma provides a Bernstein-type bound for the empirical mean of Hilbert space-valued centered random variables whitened by regularized linear operator.

Lemma E.4. Let X1, . . . , Xn be i.i.d. random variables taking values in a separable Hilbert space (H, , ) with associated norm . We denote their mean by µX := EX1 and their covariance by C := E[X1 X1].

Nystr om Kernel Mean Embeddings

Let Q : H H be a linear operator. For any λ > 0 and δ ]0, 1[, it holds with probability at least 1 δ that Q 1/2 λ

4 ess sup Q 1/2 λ X1 log(2/δ)

4 tr(Q 1 λ C) log(2/δ)

Proof of Lemma E.4: To prove the stated result we will apply Lemma E.3 on the random variables (ζi)1 i n defined by

ζi = Q 1/2 λ Xi. Let NQ(λ) = tr(Q 1 λ C) and NQ, (λ) := ess sup Q 1/2 λ X1 .

For any index 1 i n, we have Eζ1 = Q 1/2 λ µX,

ess sup ζi E[ζi] 2 ess sup ζi = 2N (λ)1/2,

E ζi E[ζi] 2 = tr(E ζi E[ζi], ζi E[ζi] )

= tr(E (ζi E[ζi]) (ζi E[ζi]) )

= tr(E[ζi ζi] Eζi Eζi)

tr(E[ζi ζi])

= tr(Q 1/2 λ CQ 1/2 λ )

Moreover, for any p 2,

ζi E[ζi] p (E ζi E[ζi] 2)(ess sup ζi E[ζi] p 2)

1/2(2NQ(λ))(2NQ, (λ)1/2)p 2

1/2p!(2NQ(λ))(2NQ, (λ)1/2)p 2.

The result follows from Lemma E.3 with constants σ2 = 2NQ(λ) and H = 2NQ, (λ)1/2.

Lemma E.5 is a specialization of Lemma E.4 to bound the last term appearing in Lemma 4.1 in our setting of Nystr om uniform sampling.

Lemma E.5. Assume that the m 1 Nystr om landmarks are sampled uniformly with replacement from the dataset X1, . . . , Xn. If 0 < λ C L(H) and λn 12K2 log(4/δ), it holds with probability at least 1 δ,

C 1/2 λ (ˆµ µm) 4 p

N (λ) log(4/δ)

12N(λ) log(4/δ)

Proof of Lemma E.5: Fix the desired confidence level δ (0, 1). Let us beginning by conditioning w.r.t. to the dataset X1, . . . , Xn. As the landmarks are assumed to be drawn i.i.d., we can apply Lemma E.4 with Q = C on the i.i.d. random variables hj := ϕ( Xj), 1 j m, satisfying E[h1] = ˆµ, E[h1 h1] = ˆCn and ess sup C 1/2 λ h1 2 N (λ): it holds with probability at least 1 δ/2 (over the drawing of the landmarks) that

Q 1/2 λ (µX ˆµX) 4 p

N (λ) log(4/δ)

4 tr(C 1 λ ˆCn) log(4/δ)

Then, since we assumed λ C L(H) and λn 12K2 log(4/δ), Lemma E.6 ensures that tr(C 1 λ ˆCn) 3N(λ) with probability at least 1 δ/2 w.r.t. the dataset X1, . . . , Xn. Finally, since the drawing of dataset and that of the indexes of the landmark are independent, the claimed bound holds with probability at least (1 δ/2)(1 δ/2) 1 δ.

Nystr om Kernel Mean Embeddings

The next lemma

Lemma E.6. Let δ > 0, λ > 0 and n N be such that λ C L(H) and n 12N (λ) log(2/δ). Then it holds with probability at least 1 δ that

tr(C 1 λ ˆCn) 3N(λ).

Proof of Lemma E.6: Let us control the deviation of tr(C 1 λ ˆCn) from its expectation N(λ). We have

tr(C 1 λ ˆCn) N(λ) = tr(C 1 λ ( ˆCn C)) = 1

i=1 ξi E[ξi],

where we define ξi := tr(C 1 λ ϕ(Xi) ϕ(Xi)), i = 1, . . . , n. The random variables ξi, 1 i n, satisfy

|ξi E[ξi]| = tr(C 1 λ (ϕ(Xi) ϕ(Xi) C)) C 1/2 λ ϕ(Xi) 2 + N(λ) 2N (λ)

E[(ξi E[ξi])2] = E[ξ2 i ] (Eξi)2 ess sup |ξi| E[ξi] 2N (λ)N(λ).

Lemma E.3 with H = 2N (λ) and σ2 = 2N (λ)N(λ) ensures that with probability at least 1 δ,

| tr(C 1 λ ˆCn) N(λ)| 4N (λ) log(2/δ)

4N (λ)N(λ) log(2/δ)

Since λ C L(H), we have N(λ) = tr(CC 1 λ ) CC 1 λ L(H) = C L(H) C L(H)+λ 1/2. Furthermore, using the

assumption n 12N (λ) log(2/δ), it holds with probability at least 1 δ,

tr(C 1 λ ˆCn) N(λ)

1 + 1 3N(λ) +

F. Nystr om approximation result

To control the term involving P m, we rely on the following lemma from Rudi et. al (Rudi et al. 2015, Lemma 6).

Lemma F.1 (Uniform Nystr om approximation). When the set of m landmarks is drawn uniformly from all partitions of cardinality m, for any λ ]0, C L(H)] we have

P m(C + λI)1/2 2 L(H) 3λ

with probability at least 1 δ provided

m max(67, 5N (λ)) log 4K2

G. Experiments

We provide here two expressions which are used to compute the exact error in Section 6.1.

We denote by Nx(µ, Γ) the evaluation of the Gaussian density of mean µ and covariance matrix Γ at point x.

Let d 1, µ, µ Rd, and Γ = diag(γ1, . . . , γd), Γ = diag( γ1, . . . , γd). For any x, x Rd, using (Petersen et al. 2008,

Nystr om Kernel Mean Embeddings

Ex N(µ,Γ)k(x, x) = Z

Rd(2πτ 2)d/2Nx( x, τ 2Id)Nx(µ, Γ)dx

= (2πτ 2)d/2N x(µ, Γ + τ 2Id) Z Nx(mc, Σc)dx

= (2πτ 2)d/2N x(µ, Γ + τ 2Id),

E x N( µ, Γ)Ex N(µ,Γ)k(x, x) = (2πτ 2)d/2 Z

Rd N x(µ, Γ + τ 2Id)N x( µ, Γ)dx

= (2πτ 2)d/2N µ(µ, Γ + Γ + τ 2Id).