# monk_outlierrobust_mean_embedding_estimation_by_medianofmeans__8532fb24.pdf MONK Outlier-Robust Mean Embedding Estimation by Median-of-Means Matthieu Lerasle 1 2 Zolt an Szab o 3 Timoth ee Mathieu 1 Guillaume Lecu e 4 Mean embeddings provide an extremely flexible and powerful tool in machine learning and statistics to represent probability distributions and define a semi-metric (MMD, maximum mean discrepancy; also called N-distance or energy distance), with numerous successful applications. The representation is constructed as the expectation of the feature map defined by a kernel. As a mean, its classical empirical estimator, however, can be arbitrary severely affected even by a single outlier in case of unbounded features. To the best of our knowledge, unfortunately even the consistency of the existing few techniques trying to alleviate this serious sensitivity bottleneck is unknown. In this paper, we show how the recently emerged principle of median-of-means can be used to design estimators for kernel mean embedding and MMD with excessive resistance properties to outliers, and optimal sub-Gaussian deviation bounds under mild assumptions. 1. Introduction Kernel methods (Aronszajn, 1950) form the backbone of a tremendous number of successful applications in machine learning thanks to their power in capturing complex relations (Sch olkopf & Smola, 2002; Steinwart & Christmann, 2008). The main idea behind these techniques is to map the data points to a feature space (RKHS, reproducing kernel Hilbert space) determined by the kernel, and apply linear methods in the feature space, without the need to explicitly compute the map. One crucial component contributing to this flexibility and efficiency (beyond the solid theoretical foundations) is the 1Laboratoire de Math ematiques d Orsay, Univ. Paris Sud, France 2CNRS, Universit e Paris Saclay, France 3CMAP, Ecole Polytechnique, Palaiseau, France 4CREST ENSAE Paris Tech, France. Correspondence to: Matthieu Lerasle , Zolt an Szab o . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). versatility of domains where kernels exist; examples include trees (Collins & Duffy, 2001; Kashima & Koyanagi, 2002), time series (Cuturi, 2011), strings (Lodhi et al., 2002), mixture models, hidden Markov models or linear dynamical systems (Jebara et al., 2004), sets (Haussler, 1999; G artner et al., 2002), fuzzy domains (Guevara et al., 2017), distributions (Hein & Bousquet, 2005; Martins et al., 2009; Muandet et al., 2011), groups (Cuturi et al., 2005) such as specific constructions on permutations (Jiao & Vert, 2016), or graphs (Vishwanathan et al., 2010; Kondor & Pan, 2016). Given a kernel-enriched domain (X, K) one can represent probability distributions on X as a mean X ϕ(x)d P(x) HK, ϕ(x) := K( , x), which is a point in the RKHS determined by K. This representation called mean embedding (Berlinet & Thomas Agnan, 2004; Smola et al., 2007) induces a semi-metric1 on distributions called maximum mean discrepancy (MMD) (Smola et al., 2007; Gretton et al., 2012) MMD(P, Q) = µP µQ HK . (1) With appropriate choice of the kernel, classical integral transforms widely used in probability theory and statistics can be recovered by µP; for example, if X equipped with the scalar product , is a Hilbert space, the kernel K(x, y) = e x,y gives the moment-generating function, K(x, y) = eγ x y 2 2 (γ > 0) the Weierstrass transform. As it has been shown (Sejdinovic et al., 2013) energy distance (Baringhaus & Franz, 2004; Sz ekely & Rizzo, 2004; 2005) also known as N-distance (Zinger et al., 1992; Klebanov, 2005) in the statistical literature coincides with MMD. Mean embedding and maximum mean discrepancy have been applied successfully, in kernel Bayesian inference (Song et al., 2011; Fukumizu et al., 2013), approximate Bayesian computation (Park et al., 2016), model criticism (Lloyd et al., 2014; Kim et al., 2016), two-sample (Baringhaus & Franz, 2004; Sz ekely & Rizzo, 2004; 2005; Harchaoui et al., 2007; Gretton et al., 2012) or its differential private variant (Raj et al., 2018), independence (Gretton et al., 2008; Pfister et al., 2017) and goodness-of-fit testing (Jitkrittum et al., 2017; Balasubramanian et al., 2017), domain 1Fukumizu et al. (2008); Sriperumbudur et al. (2010) provide conditions when MMD is a metric, i.e. µ is injective. MONK Outlier-Robust Mean Embedding Estimation by Median-of-Means adaptation (Zhang et al., 2013) and generalization (Blanchard et al., 2017), change-point detection (Harchaoui & Capp e, 2007), probabilistic programming (Sch olkopf et al., 2015), post selection inference (Yamada et al., 2018), distribution classification (Muandet et al., 2011; Zaheer et al., 2017) and regression (Szab o et al., 2016; Law et al., 2018), causal discovery (Mooij et al., 2016; Pfister et al., 2017), generative adversarial networks (Dziugaite et al., 2015; Li et al., 2015; Binkowski et al., 2018), understanding the dynamics of complex dynamical systems (Klus et al., 2018; 2019), or topological data analysis (Kusano et al., 2016), among many others; Muandet et al. (2017) provide a recent in-depth review on the topic. Crucial to the success of these applications is the efficient and robust approximation of the mean embedding and MMD. As a mean, the most natural approach to estimate µP is the empirical average. Plugging this estimate into Eq. (1) produces directly an approximation of MMD, which can also be made unbiased (by a small correction) or approximated recursively. These are the V-statistic, Ustatistic and online approaches (Gretton et al., 2012). Kernel mean shrinkage estimators (Muandet et al., 2016) represent an other successful direction: they improve the efficiency of the mean embedding estimation by taking into account the Stein phenomenon. Minimax results have recently been established: the optimal rate of mean embedding estimation given N samples from P is N 1/2 (Tolstikhin et al., 2017) for discrete measures and the class of measures with infinitely differentiable density when K is a continuous, shift-invariant kernel on X = Rd. For MMD, using N1 and N2 samples from P and Q, it is N 1/2 1 +N 1/2 2 (Tolstikhin et al., 2016) in case of radial universal kernels defined on X = Rd. A critical property of an estimator is its robustness to contaminated data, outliers which are omnipresent in currently available massive and heterogenous datasets. To the best of our knowledge, systematically designing outlier-robust mean embedding and MMD estimators has hardly been touched in the literature; this is the focus of the current paper. The issue is particularly serious in case of unbounded kernels when for example even a single outlier can ruin completely a classical empirical average based estimator. Examples for unbounded kernels are the exponential kernel (see the example above about moment-generating functions), polynomial kernel, string, time series or graph kernels. Existing related techniques comprise robust kernel density estimation (KDE) (Kim & Scott, 2012): the authors elegantly combine ideas from the KDE and M-estimator literature to arrive at a robust KDE estimate of density functions. They assume that the underlying smoothing kernels2 are 2Smoothing kernels extensively studied in the non-parametric statistical literature (Gy orfiet al., 2002) are assumed to be non- shift-invariant on X = Rd and reproducing, and interpret KDE as a weighted mean in HK. The idea has been (i) adapted to construct outlier-robust covariance operators in RKHSs in the context of kernel canonical correlation analysis (Alam et al., 2018), and (ii) relaxed to general Hilbert spaces (Sinova et al., 2018). Unfortunately, the consistency of the investigated empirical M-estimators is unknown, except for finite-dimensional feature maps (Sinova et al., 2018), or as density function estimators (Vandermeulen & Scott, 2013). To achieve our goal, we leverage the idea of Median-Ofmea Ns (MON). Intuitively, MONs replace the linear operation of expectation with the median of averages taken over non-overlapping blocks of the data, in order to get a robust estimate thanks to the median step. MONs date back to Jerrum et al. (1986); Alon et al. (1999); Nemirovski & Yudin (1983) for the estimation of the mean of real-valued random variables. Their concentration properties have been recently studied by Devroye et al. (2016); Minsker & Strawn (2017) following the approach of Catoni (2012) for M-estimators. These studies focusing on the estimation of the mean of realvalued random variables are important as they can be used to tackle more general prediction problems in learning theory via the classical empirical risk minimization approach (Vapnik, 2000) or by more sophisticated approach such as the minmax procedure (Audibert & Catoni, 2011). In parallel to the minmax approach, there have been several attempts to extend the usage of MON estimators from R to more general settings. For example, Minsker (2015); Minsker & Strawn (2017) consider the problem of estimating the mean of a Banach-space valued random variable using geometrical MONs. The estimators constructed by Minsker (2015); Minsker & Strawn (2017) are computationally tractable but the deviation bounds are suboptimal compared to those one can prove for the empirical mean under sub-Gaussian assumptions. In regression problems, Lugosi & Mendelson (2019a); Lecu e & Lerasle (2018) proposed to combine the classical MON estimators on R in a test procedure that can be seen as a Le Cam test estimator (Le Cam, 1973). The achievement in (Lugosi & Mendelson, 2019a; Lecu e & Lerasle, 2018) is that they were able to obtain optimal deviation bounds for the resulting estimator using the powerful so-called small-ball method of Koltchinskii & Mendelson (2015); Mendelson (2015). This approach was then extended to mean estimation Rd by Lugosi & Mendelson (2019b) providing the first rate-optimal sub-Gaussian deviation bounds under minimal L2-assumptions. The constants of Lugosi & Mendelson (2019a); Lecu e & Lerasle (2018); Lugosi & Mendelson (2019b) have been improved by Catoni & Giulini (2017) for the estimation of the mean in Rd under L4-moment negative functions integrating to one. MONK Outlier-Robust Mean Embedding Estimation by Median-of-Means assumption and in least-squares regression under L4/L2condition that is stronger than the small-ball assumption used by Lugosi & Mendelson (2019a); Lecu e & Lerasle (2018). Unfortunately, these estimators are computationally intractable; their risk bounds however serve as an important baseline for computable estimators such as the minmax MON estimators in regression (Lecu e & Lerasle, 2019). Motivated by the computational intractability of the tournament procedure underlying the first rate-optimal sub Gaussian deviation bound holding under minimal assumptions in Rd (Lugosi & Mendelson, 2019b), Hopkins (2018) proposed a convex relaxation with polynomial, O(N 24) complexity where N denotes the sample size. Cherapanamjeri et al. (2019) have recently designed an alternative convex relaxation requiring O(N 4 + d N 2) computation which is still rather restrictive for large sample size and infeasible in infinite dimension. Our goal is to extend the theoretical insight of Lugosi & Mendelson (2019b) from Rd to kernel-enriched domains. Particularly, we prove optimal sub-Gaussian deviation bounds for MON-based mean estimators in RKHS-s which hold under minimal second-order moment assumptions. In order to achieve this goal, we use a different (minmax (Audibert & Catoni, 2011; Lecu e & Lerasle, 2019)) construction which combined with properties specific to RKHSs (the mean-reproducing property of mean embedding and the integral probability metric representation of MMD) give rise to our practical MONK procedures. Thanks to the usage of medians the MONK estimators are also robust to contamination. Section 2 contains definitions and problem formulation. Our main results are given in Section 3. Implementation of the MONK estimators is the focus of Section 4, with numerical illustrations in Section 5. 2. Definitions & Problem Formulation In this section, we formally introduce the goal of our paper. Notations: Z+ is the set of positive integers. [M] := {1, . . . , M}, u S := (um)m S, S [M]. For a set S, |S| denotes its cardinality. E stands for expectation. medq [Q] {zq} is the median of the (zq)q [Q] numbers. Let X be a separable topological space endowed with the Borel σ-field, x1:N denotes a sequence of i.i.d. random variables on X with law P (shortly, x1:N P). K : X X R is a continuous (reproducing) kernel on X, HK is the reproducing kernel Hilbert space associated to K; , K := , HK, K := HK.3 The reproducing property of the kernel 3HK is separable by the separability of X and the continuity of K (Steinwart & Christmann, 2008, Lemma 4.33). These assumptions on X and K are assumed to hold throughout the paper. means that evaluation of functions in HK can be represented by inner products f(x) = f, K( , x) K for all x X, f HK. The mean embedding of a probability measure P is defined as X K( , x)d P(x) HK, (2) where the integral is meant in Bochner sense; µP exists iff R X K( , x) K d P(x) = R K(x, x)d P(x) < . It is well-known that the mean embedding has mean-reproducing property Pf := Ex Pf(x) = f, µP K for all f HK, and it is the unique solution of the problem: µP = argminf HK X f K( , x) 2 K d P(x) . (3) The solution of this task can be obtained by solving the following minmax optimization µP = argminf HK sup g HK J(f, g), (4) with J(f, g) = Ex P h f K( , x) 2 K g K( , x) 2 K i . The equivalence of (3) and (4) is obvious since the expectation is linear. Nevertheless, this equivalence is essential in the construction of our estimators because we will below replace the expectation by a non-linear estimator of this quantity. More precisely, the unknown expectations are computed by using the Median-of-mea N estimator (MON). Given a partition of the dataset into blocks, the MON estimator is the median of the empirical means over each block. MON estimators are naturally robust thanks to the median step. More precisely, the procedure goes as follows. For any map h : X R and any non-empty subset S [N], denote by PS := |S| 1 P i S δxi the empirical measure associated to the subset x S and PSh = |S| 1 P i S h(xi); we will use the shorthand µS := µPS. Assume that N Z+ is divisible by Q Z+ and let (Sq)q [Q] denote a partition of [N] into subsets with the same cardinality |Sq| = N/Q ( q [Q]). The Median Of mea N (MON) is defined as MONQ [h]=medq [Q] PSqh =medq [Q] h, µSq where assuming that h HK the second equality is a consequence of the mean-reproducing property of µP. Specifically, in case of Q = 1 the MON operation reduces to the classical mean: MON1 [h] = N 1 PN n=1 h(xn). We define the minmax MON-based estimator associated to kernel K (MONK) as ˆµP,Q = ˆµP,Q(x1:N) argminf HK sup g HK J(f, g), (5) where for all f, g HK J(f, g) = = MONQ h x 7 f K( , x) 2 K g K( , x) 2 K i . MONK Outlier-Robust Mean Embedding Estimation by Median-of-Means When Q = 1, since MON1 [h] is the empirical mean, we obtain the classical empirical mean based estimator: ˆµP,1 = 1 N PN n=1 K( , xn). One can use the mean embedding (2) to get a semi-metric on probability measures: the maximum mean discrepancy (MMD) of P and Q is MMD(P, Q) := µP µQ K = sup f BK f, µP µQ K , where BK = {f HK : f K 1} is the closed unit ball around the origin in HK. The second equality shows that MMD is a specific integral probability metric (M uller, 1997; Zolotarev, 1983). Assume that we have access to x1:N P, y1:N Q samples, where we assumed the size of the two samples to be the same for simplicity. Denote by PS,x := 1 |S| P i S δxi the empirical measure associated to the subset x S (PS,y is defined similarly for y), µSq,P := µPSq,x, µSq,Q := µPSq,y. We propose the following MONbased MMD estimator \ MMDQ(P, Q)= sup f BK med q [Q] f, µSq,P µSq,Q Again, with the Q = 1 choice, the classical V-statistic based MMD estimator (Gretton et al., 2012) is recovered: \ MMD(P, Q) = sup f BK n [N] f(xn) 1 n [N] f(yn) Kx ij + Ky ij 2Kxy ij , (7) where Kx ij = K(xi, xj), Ky ij = K(yi, yj) and Kxy ij = K(xi, yj) for all i, j [N]. Changing in Eq. (7) P i,j [N] to P i,j [N],i =j in case of the Kx ij and Ky ij terms gives the (unbiased) U-statistic based MMD estimator i,j [N] i =j Kx ij + Ky ij 2 i,j [N] Kxy ij . (8) Our goal is to lay down the theoretical foundations of the ˆµP,Q and \ MMDQ(P, Q) MONK estimators: study their finite-sample behaviour (prove optimal sub-Gaussian deviation bounds) and establish their outlier-robustness properties. A few additional notations will be needed throughout the paper. S1\S2 is the difference of set S1 and S2. For any linear operator A : HK HK, denote by A := sup0 =f HK Af K / f K the operator norm of A. Let L(HK) = {A : HK HK linear operator : A < } be the space of bounded linear operators. For any A L(HK), let A L(HK) denote the adjoint of A, that is the operator such that Af, g K = f, A g K for all f, g HK. An operator A L(HK) is called nonnegative if Af, f K 0 for all f HK. By the separability of HK, there exists a countable orthonormal basis (ONB) (ei)i I in HK. A L(HK) is called trace-class if i I D (A A)1/2 ei, ei E K < and in this case i I Aei, ei K < . If A is non-negative and self-adjoint, then A is trace class iff Tr(A) < ; this will hold for the covariance operator (ΣP, see Eq. (9)). A L(HK) is called Hilbert-Schmidt if A 2 2 := Tr (A A) = P i I Aei, Aei K < . One can show that the definitions of trace-class and Hilbert-Schmidt operators are independent of the particular choice of the ONB (ei)i I. Denote by L1(HK) := {A L(HK) : A 1 < } and L2(HK) := {A L(HK) : A 2 < } the class of traceclass and (Hilbert) space of Hilbert-Schmidt operators on HK, respectively. The tensor product of a, b HK is (a b)(c) = a b, c K , ( c HK), a b L2(HK) and a b 2 = a K b K. L2(HK) = HK HK where the r.h.s. denotes the tensor product of Hilbert spaces defined as the closure of Pn i=1 ai bi : ai, bi HK (i [n]), n Z+ . Whenever R X K( , x) K( , x) 2 d P(x) = R X K(x, x)d P(x) < , let ΣP denote the covariance operator ΣP = Ex P ([K( , x) µP] [K( , x) µP]) L2(HK), (9) where the expectation (integral) is again meant in Bochner sense. ΣP is non-negative, self-adjoint, moreover it has covariance-reproducing property f, ΣPf K = Ex P[f(x) Pf]2. It is known that A A 2 A 1. 3. Main Results Below we present our main results on the MONK estimators, followed by a discussion. We allow that Nc elements((xnj)Nc j=1 ) of the sample x1:N are arbitrarily corrupted (In MMD estimation {(xnj, ynj)}Nc j=1 can be contaminated). The number of corrupted samples can be (almost) half of the number of blocks, in other words, there exists δ (0, 1/2] such that Nc Q(1/2 δ). If the data are free from contaminations, then Nc = 0 and δ = 1/2. Using these notations, we can prove the following optimal sub-Gaussian deviation bounds on the MONK estimators. Theorem 1 (Consistency & outlier-robustness of ˆµP,Q). Assume that ΣP L1(HK). Then, for any η (0, 1) such that Q = 72δ 2 ln (1/η) satisfies Q (Nc/(1/2 δ), N/2), with probability at least 1 η, 6 ΣP ln (1/η) MONK Outlier-Robust Mean Embedding Estimation by Median-of-Means Theorem 2 (Consistency & outlier-robustness of \ MMDQ(P, Q)). Assume that ΣP and ΣQ L1(HK). Then, for any η (0, 1) such that Q = 72δ 2 ln (1/η) satisfies Q (Nc/(1/2 δ), N/2), with probability at least 1 η, \ MMDQ(P, Q) MMD(P, Q) ( ΣP + ΣQ ) ln(1/η) Tr (ΣP)+Tr (ΣQ) Proof (sketch). The technical challenge is to get the optimal deviation bounds under the (mild) trace-class assumption. The reasonings for the mean embedding and MMD follow a similar high-level idea; here we focus on the former. First we show that the analysis can be reduced to the unit ball in HK by proving that ˆµP,Q µP K (1 + 2)r Q,N, where r Q,N = supf BK MONQ x 7 f, K( , x) µP K = supf BK med q [Q]{r(f, q)} with r(f, q) = f, µSq µP K. The Chebyshev inequality with a Lipschitz argument allows us to control the probability of the event {r Q,N ϵ} using the variable Z = supf BK P q U [φ (2r(f, q)/ϵ) Eφ (2r(f, q)/ϵ)], where U stands for the indices of the uncorrupted blocks and φ(t) = (t 1)I1 t 2 + It 2. The bounded difference property of the Z supremum of empirical processes guarantees its concentration around the expectation by using the Mc Diarmid inequality. The symmetrization technique combined with the Talagrand s contraction principle of Rademacher processes (thanks to the Lipschitz property of φ), followed by an other symmetrization leads to the deviation bound. Details are provided in Section A.1-A.2 (for Theorem 1-2) in the supplementary material. Dependence on N: These finite-sample guarantees show that the MONK estimators have optimal N 1/2-rate by recalling Tolstikhin et al. (2016; 2017) s discussed results , and they are robust to outliers, providing consistent estimates with high probability even under arbitrary adversarial contamination (affecting less than half of the samples). Dependence on δ: Recall that larger δ corresponds to less outliers, i.e., cleaner data in which case the bounds above become tighter. In other words, making use of medians the MONK estimators show robustness to outliers; this property is a nice byproduct of our optimal sub-Gaussian deviation bound. Whether this robustness to outliers is optimal in the studied setting is an open question. Dependence on Σ: It is worth contrasting the rates obtained in Theorem 1 and that of the tournament proce- dures (Lugosi & Mendelson, 2019b) derived for the finitedimensional case. The latter paper elegantly resolved a long-lasting open question concerning the optimal dependency in terms of Σ. Theorem 1 proves the same dependency in the infinite-dimensional case, while giving rise to computionally tractable algorithms (Section 4). Separation rate: Theorem 2 also shows that fixing the trace of the covariance operators of P and Q, the MONbased MMD estimator can separate P and Q at the rate of N 1/2. Breakdown point: Our finite-sample bounds imply that the proposed MONK estimators using Q blocks is resistant to Q/2 outliers. Since Q is allowed to grow with N (it can be can be chosen to be almost N/2), this specifically means that the breakdown point of our estimators can be 25%. 4. Computing the MONK Estimator This section is dedicated to the computation4 of the analyzed MONK estimators; particularly we will focus on the MMD estimator given in Eq. (6). Numerical illustrations are provided in Section 5. Recall that the MONK estimator for MMD [Eq. (6)] is given by \ MMDQ(P, Q) (10) = sup f BK med q [Q] j Sq f(xj) 1 |Sq| By the representer theorem (Sch olkopf et al., 2001), the optimal f can be expressed as f(a, b) = X n [N] an K( , xn) + X n [N] bn K( , yn), (11) where a = (an)n [N] RN and b = (bn)n [N] RN. Denote c = [a; b] R2N, K = [Kxx, Kxy; Kyx, Kyy] R2N 2N, Kxx = [K(xi, xj)]i,j [N] RN N, Kxy = [K(xi, yj)]i,j [N] = K yx RN N, Kyy = [K(yi, yj)]i,j [N] RN N. With these notations, the optimisation problem (10) can be rewritten as max c R2N:c Kc 1 med q [Q] n |Sq| 1[1q; 1q] Kc o , (12) where 1q RN is indicator vector of the block Sq. To enable efficient optimization we follow a block-coordinate descent (BCD)-type scheme: choose the qm [N] index for which the median is attained in (12), and solve max c R2N:c Kc 1 |Sqm| 1[1qm; 1qm] Kc. (13) 4The Python code reproducing our numerical experiments is available at https://bitbucket.org/ Timothee Mathieu/monk-mmd; it relies on the ITE toolbox (Szab o, 2014). MONK Outlier-Robust Mean Embedding Estimation by Median-of-Means Algorithm 1 MONK BCD estimator for MMD Input: Aggregated Gram matrix: K with Cholesky factor L (K = LL ). for all t = 1, . . . , T do Generate a random permutation of [N]: σ. Shuffle the samples according to σ: for q [Q] Sq = σ (q 1)N Q + 1 , . . . , σ q N Find the block attaining the median (qm): [1qm; 1qm] Kc |Sqm| = med q [Q] [1q; 1q] Kc Compute the coefficient vector: c = [1qm; 1qm] L [1qm; 1qm] 2 . end for Output: med q [Q] 1 |Sq|[1q; 1q] Kc This optimization problem can be solved analytically: c = [1qm; 1qm] L [1qm; 1qm] 2 , where L is the Cholesky factor of K (K = LL ). The observations are shuffled after each iteration. The pseudo-code of the final MONK BCD estimator is summarized in Algorithm 1. Notice that computing L in MONK BCD costs O(N 3), which can be prohibitive for large sample size. In order to alleviate this bottleneck we also consider an approximate version of MONK BCD (referred to as MONK BCD-Fast), where the P n [N] summation after plugging (11) into (10) is replaced with P max c=[a,b] R2N:c Kc 1 med q [Q] j,n Sq [an K(xj, xn)+ bn K(xj, yn)] j,n Sq [an K(yj, xn) + bn K(yj, yn)] This modification allows local computations restricted to blocks and improved running time. The samples are shuffled periodically (e.g., at every 10th iterations) to renew the blocks. The resulting method is presented in Algorithm 2. The computational complexity of the different MMD estimators are summarized in Table 1. 5. Numerical Illustrations In this section, we demonstrate the performance of the proposed MONK estimators. We exemplify the idea on the MMD estimator [Eq. (6)] with the BCD optimization schemes (MONK BCD and MONK BCD-Fast) discussed in Section 4. Our baseline is the classical U-statistic based MMD estimator [Eq. (8); referred to as U-Stat in the sequel]. The primary goal in the first set of experiments is to under- Algorithm 2 MONK BCD-Fast estimator for MMD Input: Aggregated Gram matrix: K with Cholesky factor L (K = LL ). Incides at which we shuffle: J. for all t = 1, . . . , T do if t J then Generate a random permutation of [N]: σ. Shuffle the samples according to σ: for q [Q] Sq = σ (q 1)N Q + 1 , . . . , σ q N Compute the Gram matrices and the Cholesky factors on each block Kq and Lq for q [Q]. end if Find the blocka attaining the median (qm): [1qm; 1qm] Kqmcqm |Sqm| = med q [Q] [1q; 1q] Kqcq Update the coefficient vector: cqm = [1qm; 1qm] L qm[1qm; 1qm] 2 . end for Output: med q [Q] 1 |Sq|[1q; 1q] Kqcq a1q R|Sq| denotes the vector of ones of size |Sq|. Table 1: Computational complexity of MMD estimators. N: sample number, Q: number of blocks, T: number of iterations. Method Complexity U-Stat O N 2 MONK BCD O N 3 + T N 2 + Q log(Q) MONK BCD-Fast O N 3 Q2 + T h N2 Q + Q log(Q) i stand and demonstrate various aspects of the estimators for (K, P, Q) triplets (Muandet et al., 2017, Table 3.3) when analytical expression is available for MMD. This is the case for polynomial and RBF kernels (K), with Gaussian distributions (P, Q). Notice that in the first (second) case the features are unbounded (bounded). Our second numerical example illustrates the applicability of the studied MONK estimators in biological context, in discriminating DNA subsequences with string kernel. Experiment-1: We used the quadratic and the RBF kernel with bandwith σ = 1 for demonstration purposes and investigated the estimation error compared to the true MMD value: |\ MMDQ(P, Q) MMD(P, Q)|. The errors are aggregates over 100 Monte-Carlo simulations, summarized in the median and quartile values. The number of samples (N) was chosen from {200, 400, . . . , 2000}. We considered three different experimental settings for MONK Outlier-Robust Mean Embedding Estimation by Median-of-Means (P, Q) and the absence/presence of outliers: 1. Gaussian distributions with no outliers: In this case P = N µ1, σ2 1 and Q = N µ2, σ2 2 were normal where (µ1, σ1) = (µ2, σ2), µ1, σ1, µ2, σ2 were randomly chosen from the [0, 1] interval, and then their values were fixed. The estimators had access to (xn)N n=1 i.i.d. P and (yn)N n=1 i.i.d. Q. 2. Gaussian distributions with outliers: This setting is a corrupted version of the first one. Particularly, the dataset consisted of (xn)N 5 n=1 i.i.d. P, (yn)N 5 n=1 i.i.d. Q, while the remaining 5-5 samples were set to x N 4 = . . . = x N = 2000, y N 4 = = y N = 4000. 3. Pareto distribution without outliers: In this case P = Q = Pareto(3) hence MMD(P, Q) = 0 and the estimators used (xn)N n=1 i.i.d. P and (yn)N n=1 i.i.d. Q. The 3 experiments were constructed to understand different aspects of the estimators: how a few outliers can ruin classical estimators (as we move from Experiment-1 to Experiment-2); in Experiment-3 the heavyness of the tail of a Pareto distribution makes the task non-trivial. Our results on the three datasets with various Q choices are summarized in Fig. 1. As we can see from Fig. 1a and Fig. 1d in the outlier-free case, the MONK estimators are slower than the U-statistic based one; the accuracy is of the same order for both kernels. As demonstrated by Fig. 1b in the corrupted setup even a small number of outliers can completely ruin traditional MMD estimators for unbounded features while the MONK estimators are naturally robust to outliers with suitable choice of Q;5 this is precisely the setting the MONK estimators were designed for. In case of bounded kernels (Fig. 1e), by construction, traditional MMD estimators are resistant to outliers; the MONK BCDFast method achieves comparable performance. In the final Pareto experiment (Fig. 1c and Fig. 1f) where the distribution produces natural outliers , again MONK estimators are more robust with respect to corruption than the one relying on U-statistics in the case of polynomial kernel. These experiments illustrate the power of the studied MONK schemes: these estimators achieve comparable performance in case of bounded features, while for unbounded features they can efficiently cope with the presence of outliers. Experiment-2 (discrimination of DNA subsequences): In order to demonstrate the applicability of our estimators in biological context, we chose a DNA benchmark from 5In case of unknown Nc, one could choose Q adaptively by the Lepski method (see for example (Devroye et al., 2016)) at the price of increasing the computational effort. Though the resulting Q would increase the computational time, it would be adaptive thanks to its data-driven nature, and would benefit from the same guarantee as the fixed Q appearing in Theorem 1-2. the UCI repository (Dheeru & Karra Taniskidou, 2017), the Molecular Biology (Splice-junction Gene Sequences) Data Set. The dataset consists of 3190 instances of 60-characterlong DNA subsequences. The problem is to recognize, given a sequence of DNA, the boundaries between exons (the parts of the DNA sequence retained after splicing) and introns (the parts of the DNA sequence that are spliced out). This task consists of two subproblems, identifying the exon/intron boundaries (referred to as EI sites) and the intron/exon boundaries (IE sites).6 We took 1532 of these samples by selecting 766 instances from both the EI and the IE classes (the class of those being neither EI nor IE is more heterogeneous and thus we dumped it from the study), and investigated the discriminability of the EI and IE categories. We represented the DNA sequences as strings (X), chose K as the String Subsequence Kernel (Lodhi et al., 2002) to compute MMD, and performed two-sample testing based on MMD using the MONK BCD, MONK BCD-Fast and U-Stat estimators. For completeness the pseudocode of the hypothesis test is detailed in Algorithm 3 (Section D). Q, the number of blocks in the MONK techniques, was equal to 5. The significance level was α = 0.05. To assess the variability of the results 400 Monte Carlo simulations were performed, each time uniformly sampling N points without replacement resulting in (Xn)n [N] and (Yn)n [N]. To provide more detailed insights the aggregated values of \ MMD(EI, IE) ˆq1 α, \ MMD(EI, EI) ˆq1 α and \ MMD(IE, IE) ˆq1 α are summarized in Fig. 2, where ˆq1 α is the estimated (1 α)-quantile via B = 150 bootstrap permutations. In the ideal case, \ MMD ˆq1 α is positive (negative) in the inter-class (intra-class) experiments. As Fig. 2 shows all 3 techniques are able to solve the task, both in the inter-class (when the null hypothesis does not hold; Fig. 2a) and the intra-class experiment (null holds; Fig. 2b and Fig. 2c), and they converge to a good and stable performance as a function of the sample number. It is important to note that the MONK BCD-Fast method is especially welladapted to problems where the kernel computation (such as the String Subsequence Kernel) or the sample size is a bottleneck, as its computation is often significantly faster compared to the U-Stat technique. For example, taking all the samples (N = 766) in the DNA benchmark with Q = 15, computing MONK BCD-Fast (U-Stat) takes 32s (1m28s). These results illustrate the applicability of our estimators in gene analysis. Acknowledgements Guillaume Lecu e is supported by a grant of the French National Research Agency (ANR), Investissements d Avenir (Lab Ex Ecodec/ANR-11-LABX-0047). 6In the biological community, IE borders are referred to as acceptors while EI borders are referred to as donors . MONK Outlier-Robust Mean Embedding Estimation by Median-of-Means (a) Gaussian distribution, Nc = 0 (no outlier), quadratic kernel. (b) Gaussian distribution, Nc = 5 outliers, quadratic kernel. (c) Pareto distribution, quadratic kernel. (d) Gaussian distribution, Nc = 0 (no outlier), RBF kernel. (e) Gaussian distribution, Nc = 5 outliers, RBF kernel. (f) Pareto distribution, RBF kernel. Figure 1: Performance of the MMD estimators: median and quartiles of ln(|\ MMDQ(P, Q) MMD(P, Q)|). Columns from left to right: Experiment-1 Experiment-3. Top: quadratic kernel, bottom: RBF kernel. (a) Inter-class: EI-IE (b) Intra-class: EI-EI (c) Intra-class: IE-IE Figure 2: Inter-class and intra-class MMD estimates as a function of the sample number compared to the bootstrap-estimated (1 α)-quantile: \ MMD ˆq1 α; mean std. The null hypothesis is rejected iff \ MMD ˆq1 α > 0. Notice the different scale of \ MMD ˆq1 α in the inter-class and the intra-class experiments. MONK Outlier-Robust Mean Embedding Estimation by Median-of-Means Alam, M. A., Fukumizu, K., and Wang, Y.-P. Influence function and robust variant of kernel canonical correlation analysis. Neurocomputing, 304:12 29, 2018. Alon, N., Matias, Y., and Szegedy, M. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1, part 2):137 147, 1999. Aronszajn, N. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337 404, 1950. Audibert, J.-Y. and Catoni, O. Robust linear least squares regression. The Annals of Statistics, 39(5):2766 2794, 2011. Balasubramanian, K., Li, T., and Yuan, M. On the optimality of kernel-embedding based goodness-of-fit tests. Technical report, 2017. (https://arxiv.org/abs/ 1709.08148). Baringhaus, L. and Franz, C. On a new multivariate twosample test. Journal of Multivariate Analysis, 88:190 206, 2004. Berlinet, A. and Thomas-Agnan, C. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer, 2004. Binkowski, M., Sutherland, D. J., Arbel, M., and Gretton, A. Demystifying MMD GANs. In ICLR, 2018. Blanchard, G., Deshmukh, A. A., Dogan, U., Lee, G., and Scott, C. Domain generalization by marginal transfer learning. Technical report, 2017. (https://arxiv. org/abs/1711.07910). Catoni, O. Challenging the empirical mean and empirical variance: a deviation study. Annales de l Institut Henri Poincar e Probabilit es et Statistiques, 48(4):1148 1185, 2012. Catoni, O. and Giulini, I. Dimension-free PAC-Bayesian bounds for matrices, vectors, and linear least squares regression. Technical report, 2017. (https://arxiv. org/abs/1712.02747). Cherapanamjeri, Y., Flammarion, N., and Bartlett, P. L. Fast mean estimation with sub-Gaussian rates. Technical report, 2019. (https://arxiv.org/abs/1902. 01998). Collins, M. and Duffy, N. Convolution kernels for natural language. In NIPS, pp. 625 632, 2001. Cuturi, M. Fast global alignment kernels. In ICML, pp. 929 936, 2011. Cuturi, M., Fukumizu, K., and Vert, J.-P. Semigroup kernels on measures. Journal of Machine Learning Research, 6: 1169 1198, 2005. Devroye, L., Lerasle, M., Lugosi, G., and Oliveira, R. I. Sub-Gaussian mean estimators. The Annals of Statistics, 44(6):2695 2725, 2016. Dheeru, D. and Karra Taniskidou, E. UCI machine learning repository, 2017. (http://archive.ics.uci. edu/ml). Dziugaite, G. K., Roy, D. M., and Ghahramani, Z. Training generative neural networks via maximum mean discrepancy optimization. In UAI, pp. 258 267, 2015. Fukumizu, K., Gretton, A., Sun, X., and Sch olkopf, B. Kernel measures of conditional dependence. In NIPS, pp. 498 496, 2008. Fukumizu, K., Song, L., and Gretton, A. Kernel Bayes rule: Bayesian inference with positive definite kernels. Journal of Machine Learning Research, 14:3753 3783, 2013. G artner, T., Flach, P. A., Kowalczyk, A., and Smola, A. Multi-instance kernels. In ICML, pp. 179 186, 2002. Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Sch olkopf, B., and Smola, A. J. A kernel statistical test of independence. In NIPS, pp. 585 592, 2008. Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch olkopf, B., and Smola, A. A kernel two-sample test. Journal of Machine Learning Research, 13:723 773, 2012. Guevara, J., Hirata, R., and Canu, S. Cross product kernels for fuzzy set similarity. In FUZZ-IEEE, pp. 1 6, 2017. Gy orfi, L., Kohler, M., Krzyzak, A., and Walk, H. A Distribution-Free Theory of Nonparametric Regression. Springer, New-york, 2002. Harchaoui, Z. and Capp e, O. Retrospective mutiple changepoint estimation with kernels. In IEEE/SP 14th Workshop on Statistical Signal Processing, pp. 768 772, 2007. Harchaoui, Z., Bach, F., and Moulines, E. Testing for homogeneity with kernel Fisher discriminant analysis. In NIPS, pp. 609 616, 2007. Haussler, D. Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California at Santa Cruz, 1999. (http://cbse.soe.ucsc.edu/sites/ default/files/convolutions.pdf). Hein, M. and Bousquet, O. Hilbertian metrics and positive definite kernels on probability measures. In AISTATS, pp. 136 143, 2005. MONK Outlier-Robust Mean Embedding Estimation by Median-of-Means Hopkins, S. B. Mean estimation with sub-Gaussian rates in polynomial time. Technical report, 2018. (https: //arxiv.org/abs/1809.07425). Jebara, T., Kondor, R., and Howard, A. Probability product kernels. Journal of Machine Learning Research, 5:819 844, 2004. Jerrum, M. R., G.Valiant, L., and V.Vazirani, V. Random generation of combinatorial structures from a uniform distribution. Theoretical Computer Science, 43(2-3):169 188, 1986. Jiao, Y. and Vert, J.-P. The Kendall and Mallows kernels for permutations. In ICML (PMLR), volume 37, pp. 2982 2990, 2016. Jitkrittum, W., Xu, W., Szab o, Z., Fukumizu, K., and Gretton, A. A linear-time kernel goodness-of-fit test. In NIPS, pp. 261 270, 2017. Kashima, H. and Koyanagi, T. Kernels for semi-structured data. In ICML, pp. 291 298, 2002. Kim, B., Khanna, R., and Koyejo, O. O. Examples are not enough, learn to criticize! criticism for interpretability. In NIPS, pp. 2280 2288, 2016. Kim, J. and Scott, C. D. Robust kernel density estimation. Journal of Machine Learning Research, 13:2529 2565, 2012. Klebanov, L. N-Distances and Their Applications. Charles University, Prague, 2005. Klus, S., Schuster, I., and Muandet, K. Eigendecompositions of transfer operators in reproducing kernel Hilbert spaces. Technical report, 2018. (https://arxiv. org/abs/1712.01572). Klus, S., Bittracher, A., Schuster, I., and Sch utte, C. A kernel-based approach to molecular conformation analysis. The Journal of Chemical Physics, 149:244109, 2019. Koltchinskii, V. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Springer, 2011. Koltchinskii, V. and Mendelson, S. Bounding the smallest singular value of a random matrix without concentration. International Mathematics Research Notices, (23):12991 13008, 2015. Kondor, R. and Pan, H. The multiscale Laplacian graph kernel. In NIPS, pp. 2982 2990, 2016. Kusano, G., Fukumizu, K., and Hiraoka, Y. Persistence weighted Gaussian kernel for topological data analysis. In ICML, pp. 2004 2013, 2016. Law, H. C. L., Sutherland, D. J., Sejdinovic, D., and Flaxman, S. Bayesian approaches to distribution regression. AISTATS (PMLR), 84:1167 1176, 2018. Le Cam, L. Convergence of estimates under dimensionality restrictions. The Annals of Statistics, 1:38 53, 1973. Lecu e, G. and Lerasle, M. Learning from MOM s principles: Le Cam s approach. Stochastic Processes and their Applications, 2018. (https://doi.org/10.1016/ j.spa.2018.11.024). Lecu e, G. and Lerasle, M. Robust machine learning via median of means: theory and practice. Annals of Statistics, 2019. (To appear; preprint: https://arxiv.org/ abs/1711.10306). Ledoux, M. and Talagrand, M. Probability in Banach spaces. Springer-Verlag, 1991. Li, Y., Swersky, K., and Zemel, R. Generative moment matching networks. In ICML (PMLR), pp. 1718 1727, 2015. Lloyd, J. R., Duvenaud, D., Grosse, R., Tenenbaum, J. B., and Ghahramani, Z. Automatic construction and naturallanguage description of nonparametric regression models. In AAAI Conference on Artificial Intelligence, pp. 1242 1250, 2014. Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., and Watkins, C. Text classification using string kernels. Journal of Machine Learning Research, 2:419 444, 2002. Lugosi, G. and Mendelson, S. Risk minimization by median-of-means tournaments. Journal of the European Mathematical Society, 2019a. (To appear; preprint: https://arxiv.org/abs/1608.00757). Lugosi, G. and Mendelson, S. Sub-gaussian estimators of the mean of a random vector. Annals of Statistics, 47(2): 783 794, 2019b. Martins, A. F. T., Smith, N. A., Xing, E. P., Aguiar, P. M. Q., and Figueiredo, M. A. T. Nonextensive information theoretic kernels on measures. The Journal of Machine Learning Research, 10:935 975, 2009. Mendelson, S. Learning without concentration. Journal of the ACM, 62(3):21:1 21:25, 2015. Minsker, S. Geometric median and robust estimation in Banach spaces. Bernoulli, 21(4):2308 2335, 2015. Minsker, S. and Strawn, N. Distributed statistical estimation and rates of convergence in normal approximation. Technical report, 2017. (https://arxiv.org/ abs/1704.02658). MONK Outlier-Robust Mean Embedding Estimation by Median-of-Means Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J., and Sch olkopf, B. Distinguishing cause from effect using observational data: Methods and benchmarks. Journal of Machine Learning Research, 17:1 102, 2016. Muandet, K., Fukumizu, K., Dinuzzo, F., and Sch olkopf, B. Learning from distributions via support measure machines. In NIPS, pp. 10 18, 2011. Muandet, K., Sriperumbudur, B. K., Fukumizu, K., Gretton, A., and Sch olkopf, B. Kernel mean shrinkage estimators. Journal of Machine Learning Research, 17:1 41, 2016. Muandet, K., Fukumizu, K., Sriperumbudur, B., and Sch olkopf, B. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10(1-2):1 141, 2017. M uller, A. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29: 429 443, 1997. Nemirovski, A. S. and Yudin, D. B. Problem complexity and method efficiency in optimization. John Wiley & Sons Ltd., 1983. Park, M., Jitkrittum, W., and Sejdinovic, D. K2-ABC: Approximate Bayesian computation with kernel embeddings. In AISTATS (PMLR), volume 51, pp. 51:398 407, 2016. Pfister, N., B uhlmann, P., Sch olkopf, B., and Peters, J. Kernel-based tests for joint independence. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2017. Raj, A., Law, H. C. L., Sejdinovic, D., and Park, M. A differentially private kernel two-sample test. Technical report, 2018. (https://arxiv.org/abs/1808. 00380). Sch olkopf, B. and Smola, A. J. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2002. Sch olkopf, B., Herbrich, R., and Smola, A. J. A generalized representer theorem. In COLT, pp. 416 426, 2001. Sch olkopf, B., Muandet, K., Fukumizu, K., Harmeling, S., and Peters, J. Computing functions of random variables via reproducing kernel Hilbert space representations. Statistics and Computing, 25(4):755 766, 2015. Sejdinovic, D., Sriperumbudur, B. K., Gretton, A., and Fukumizu, K. Equivalence of distance-based and RKHSbased statistics in hypothesis testing. Annals of Statistics, 41:2263 2291, 2013. Sinova, B., Gonz alez-Rodr ıguez, G., and Aelst, S. V. Mestimators of location for functional data. Bernoulli, 24: 2328 2357, 2018. Smola, A., Gretton, A., Song, L., and Sch oolkopf, B. A Hilbert space embedding for distributions. In ALT, pp. 13 31, 2007. Song, L., Gretton, A., Bickson, D., Low, Y., and Guestrin, C. Kernel belief propagation. In AISTATS, pp. 707 715, 2011. Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Sch olkopf, B., and Lanckriet, G. R. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11:1517 1561, 2010. Steinwart, I. and Christmann, A. Support Vector Machines. Springer, 2008. Szab o, Z. Information theoretical estimators toolbox. Journal of Machine Learning Research, 15:283 287, 2014. Szab o, Z., Sriperumbudur, B., P oczos, B., and Gretton, A. Learning theory for distribution regression. Journal of Machine Learning Research, 17(152):1 40, 2016. Sz ekely, G. J. and Rizzo, M. L. Testing for equal distributions in high dimension. Inter Stat, 5, 2004. Sz ekely, G. J. and Rizzo, M. L. A new test for multivariate normality. Journal of Multivariate Analysis, 93:58 80, 2005. Tolstikhin, I., Sriperumbudur, B. K., and Sch olkopf, B. Minimax estimation of maximal mean discrepancy with radial kernels. In NIPS, pp. 1930 1938, 2016. Tolstikhin, I., Sriperumbudur, B. K., and Muandet, K. Minimax estimation of kernel mean embeddings. Journal of Machine Learning Research, 18:1 47, 2017. Vandermeulen, R. and Scott, C. Consistency of robust kernel density estimators. In COLT (PMLR), volume 30, pp. 568 591, 2013. Vapnik, V. N. The nature of statistical learning theory. Statistics for Engineering and Information Science. Springer-Verlag, New York, second edition, 2000. Vishwanathan, S. N., Schraudolph, N. N., Kondor, R., and Borgwardt, K. M. Graph kernels. Journal of Machine Learning Research, 11:1201 1242, 2010. Yamada, M., Umezu, Y., Fukumizu, K., and Takeuchi, I. Post selection inference with kernels. In AISTATS (PMLR), volume 84, pp. 152 160, 2018. MONK Outlier-Robust Mean Embedding Estimation by Median-of-Means Zaheer, M., Kottur, S., Ravanbakhsh, S., P oczos, B., Salakhutdinov, R. R., and Smola, A. J. Deep sets. In NIPS, pp. 3394 3404, 2017. Zhang, K., Sch olkopf, B., Muandet, K., and Wang, Z. Domain adaptation under target and conditional shift. Journal of Machine Learning Research, 28(3):819 827, 2013. Zinger, A. A., Kakosyan, A. V., and Klebanov, L. B. A characterization of distributions by mean values of statistics and certain probabilistic metrics. Journal of Soviet Mathematics, 1992. Zolotarev, V. M. Probability metrics. Theory of Probability and its Applications, 28:278 302, 1983.