# distribution_regression_with_sliced_wasserstein_kernels__70201513.pdf Distribution Regression with Sliced Wasserstein Kernels Dimitri Meunier 1 Massimiliano Pontil 2 3 Carlo Ciliberto 3 The problem of learning functions over spaces of probabilities or distribution regression is gaining significant interest in the machine learning community. A key challenge behind this problem is to identify a suitable representation capturing all relevant properties of the underlying functional mapping. A principled approach to distribution regression is provided by kernel mean embeddings, which lifts kernel-induced similarity on the input domain at the probability level. This strategy effectively tackles the two-stage sampling nature of the problem, enabling one to derive estimators with strong statistical guarantees, such as universal consistency and excess risk bounds. However, kernel mean embeddings implicitly hinge on the maximum mean discrepancy (MMD), a metric on probabilities, which may fail to capture key geometrical relations between distributions. In contrast, optimal transport (OT) metrics, are potentially more appealing. In this work, we propose an OT-based estimator for distribution regression. We build on the Sliced Wasserstein distance to obtain an OT-based representation. We study the theoretical properties of a kernel ridge regression estimator based on such representation, for which we prove universal consistency and excess risk bounds. Preliminary experiments complement our theoretical findings by showing the effectiveness of the proposed approach and compare it with MMD-based estimators. 1Gatsby Computational Neuroscience Unit, University College London, London, United Kingdom 2Italian Institute of Technology, Genoa, Italy 3Department of Computer Science, University College London, London, United Kingdom. Correspondence to: Dimitri Meunier . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). 1. Introduction Distribution regression studies the problem of learning a real-valued function defined on the space of probability distributions from a sequence of input/output observations. A peculiar feature of this setting is that, typically, the input distributions are not observed directly but only accessed through finite samples, adding complexity to the problem (P oczos et al., 2013). A typical example in the literature (see Szab o et al., 2016; Fang et al., 2020) is the task of predicting some health indicator of patients, labeled by t, based on their distribution Pt over results of clinical investigations, e.g. blood measurements. The underlying distribution of test results for each patient is observed only through repeated medical tests xt,i Pt, i = 1, . . . , nt, forming an approximation of Pt through ˆPt,nt = 1 nt Pnt i=1 δxi. Given labeled examples (ˆPt,nt, yt)T t=1, where yt is the health indicator, we seek to learn an accurate mapping f : ˆPn y for a new patient identified by P. Other examples of interest for the machine learning community include the task of predicting summary statistics of distributions such as the entropy or the number of modes (Oliva et al., 2014) or for conditional meta-learning (Denevi et al., 2020). Historically, distribution regression has been tackled by means of a two-stage procedure, that build upon the kernel mean embedding machinery (Smola et al., 2007; Muandet et al., 2017). A first positive definite (p.d.) kernel (Smola & Sch olkopf, 1998) defined on the underlying data space is used to embed the distribution ˆPt,nt into an Hilbert space, and then a second kernel is used as a similarity measure between the embedded distributions to perform Kernel Ridge Regression (KRR). This approach has been theoretically investigated in Szab o et al. (2016) and further analyzed in Fang et al. (2020). More recently, a SGD variant (M ucke, 2021) and a robust variant (Yu et al., 2021) have been suggested. The two stages above allow to build a p.d. kernel on the space of probability distributions and provide an example of distance substitution kernel, a class of kernels that can be built from distances on metric spaces (Haasdonk & Bahlmann, 2004). In that particular setting the chosen distance is the Maximum Mean Discrepancy (MMD), a metric on probability distributions (Gretton et al., 2012). A main theme of this paper is that distance substitution kernels Distribution Regression with Sliced Wasserstein Kernels based on optimal transport (with the Wasserstein distance) might be preferable over MMD-based ones, as they may better capture the geometry of the space of distributions (Peyr e et al., 2019). The above observations lead us to introduce a Wassersteinbased p.d. kernel for distribution regression. As the original Wasserstein distance cannot be used to build a p.d. kernel on the space of distributions our construction exploits the Sliced-Wasserstein (SW) distance (Bonneel et al., 2015), a computationally cheapest alternative to the standard Wasserstein distance. SW is raising significant interest in the machine learning community due to its attractive topological and statistical properties (Nadjahi et al., 2019; 2020; Nadjahi, 2021) and its strong numerical performances (Kolouri et al., 2018; 2019; Deshpande et al., 2018; 2019; Lee et al., 2019) and references therein. A SW-based kernel was first introduced in Kolouri et al. (2016), but their construction was restricted to the subset of absolutely continuous distributions. Hence, this approach is not suitable for the distribution regression problem where we need to handle empirical distributions of the form ˆPt,n. In parallel, a SW-based kernel has been developed for persistence diagrams in topological data analysis Carriere et al. (2017). Bachoc et al. (2017); Trang et al. (2021) defined a p.d. kernel based on the Wasserstein distance valid for any probability measure defined on R; while Bachoc et al. (2020) used a similar extension to ours to Rr, r 1 focusing on Gaussian process distribution regression. In this paper however, we focus on statistical properties of the estimator within a kernel method setting. Finally, there are deep learning heuristic methods to tackle the problem of learning from distributions such as deep sets Zaheer et al. (2017) and references therein. While they might perform well in practice, they are harder to analyse theoretically and fall outside the scope of kernel methods which is the focus of this present work. Contributions. This paper provides four main contributions: 1) we present a Wasserstein-based p.d. kernel valid for any probability distribution on Rr; 2) we show that such kernel is universal in the sense of Christmann & Steinwart (2010), allowing us to asymptotically learn any distribution regression function; 3) we provide excess risk bounds for the KRR estimator based on such kernels, which implies its consistency; 4) finally, we provide empirical evidences that the SW kernels behave better than MMD kernels on a number of distribution regression tasks. Paper organisation. The rest of the paper is organized as follows: Sec. 2 introduces the distribution regression setting and Sec. 3 reviews distance substitution kernels as a way to approach such problems. Sec. 4 shows that the SW distance can be used to obtain universal distance substitution kernels. Sec. 5 characterizes the generalization properties of distribution regression estimators based on distance substitution kernels. Sec. 6 reports empirical evidences supporting the adoption of SW kernels in practical applications. 2. Distribution regression Tools and notations. By convention, vectors v Rr are seen as r 1 matrices and v denotes their Euclidean norm. A denotes the transpose of any r k matrix A, Ir Rr r the identity matrix and I the identity operator. For a measurable space (Ω, A) we denote by M1 +(Ω) the space of probability measures on (Ω, A). Throughout most of the paper we use M1 +(Ωr) with Ωr a compact subset of Rr. For two measurable spaces (Ω, A), (X, B) and a measurable map T : Ω X, the push-forward operator T# : M1 +(Ω) M1 +(X) is defined such that for all α M1 +(Ω) and for all measurable set B B, T#(α)(B) = α(T 1(B)). Sr 1 = {x Rr | x 2 = 1} denotes the unit sphere in Rr. λr is the Lebesgue measure on Rr and σr is the uniform probability measure on the unit sphere, we use dx instead of λr(dx) and dθ instead of σr(dθ) when there is no ambiguity. For a vector θ Sr 1, θ : Rr R, x 7 x, θ . For a univariate probability measure α M1 +(R), we denote by Fα its cumulative distribution function, Fα(x) = α(( , x]) [0, 1], by F [ 1] α its generalised inverse cumulative distribution function, F [ 1] α (x) = inf{t R : Fα(t) x}, and by F 1 α its inverse cumulative distribution function, if it exists. On a measure space (Ω, A, µ) we denote by L2(Ω, µ), abbreviated L2 µ if the context is clear, the Hilbert space of square-integrable functions with respect to µ from Ω to R with norm L2µ induced by the inner product f, g L2µ = R Ωf(w)g(w)dµ(w). Finally, for a compact set Ω, U(Ω) denotes the uniform distribution on Ω. Problem set-up. Linear regression considers Euclidean vectors as covariate while kernel regression extends the framework to any covariate space X on which we can define a p.d. kernel K : X X R (Hofmann et al., 2008). In distribution regression, X is a space of probability measures with the additional difficulty that the covariates P X cannot be directly observed (P oczos et al., 2013). Each P X is only accessed through i.i.d. samples {xi}n i=1 P, which forms a noisy measurement of P. The formal problem set up is as follows. X := M1 +(Ωr) is a space of probability distributions, we refer to X as the input space and Ωr as the (compact) data space. Y R is a real subset and ρ is an unknown distribution on Z := X Y . ρX is its marginal distribution on X and ρY |X is its conditional distribution: ρ(P, y) = ρY |X(y | P)ρX(P). The goal is to find a measurable map f : X Y with low expected risk Z (f(P) y)2dρ(P, y). (1) Distribution Regression with Sliced Wasserstein Kernels The minimizer of this functional (see Prop. 11) is the regression function Y ydρY |X(y | P), P X, (2) which is not accessible since ρ is unknown. To approximate fρ we draw a set of input-label pairs D = (Pt, yt)T t=1 i.i.d ρ. We call this the first stage sampling. If we were in a standard regression setting, i.e. the inputs Pt were directly observable, given a p.d. kernel K on X, one approach would be to apply KRR by minimizing the regularized empirical risk: ET,λ(f) = 1 t=1 (f (Pt) yt)2 + λ f 2 HK (3) where HK is te Reproducing Kernel Hilbert Space (RKHS) induced by K. We refer to this quantity as the first stage empirical risk and denote its minimizer f D,λ. The difference with standard regression comes from the second stage of sampling were instead of directly observing the covariates we receive for each input a bag of samples: ˆD = {((xt,i)n i=1, yt)}T t=1, (xt,i)n i=1 i.i.d Pt. For ease of notations, we assume without loss of generality that each bag has the same number of points n. The strategy is then to plug the empirical distributions ˆPt,n = 1 n Pn i=1 δxt,i in the first stage risk leading to the second stage risk: ET,n,λ(f) = 1 f ˆPt,n yt 2 + λ f 2 HK. (4) We denote by f ˆ D,λ its minimizer. Kernel distribution regression. Equipped with a p.d. kernel K : X X R, minimizing ET,n,λ over HK leads to a practical algorithm. For a new test point P X, f ˆ D,λ(P) = (y1, . . . , y T )(KT + λTIT ) 1k P (5) where [KT ]t,l = K(ˆPt,n, ˆPl,n) RT T and k P = (K(P, ˆPt,n), . . . , K(P, ˆPT,n)) RT (see Prop. 12). Let us note that in practice the new test points will also have the form of empirical distributions ˆPn = 1 n Pn i=1 δxi. The algorithm will differ by which kernel is used on X. 3. Distance substitution kernels In this section, we derive two types of kernels that can be defined on X. We approach this through the prism of distance substitution kernels. Let us assume that we have a set M on which we want to define a p.d. kernel and we are given a pseudo-distance d on M. Pseudo means that d might not satisfy d(x, y) = 0 = x = y for all (x, y) M 2. One could think of using d to define a p.d kernel on M by substituting d to the Euclidean distance for usual Euclidean kernels (Haasdonk & Bahlmann, 2004). For example, KGauss(x, y) = e γd(x,y)2, (6) (with x, y M) generalizes the well-known Gaussian kernel KGauss(x, y) = e γ x y 2 (with x, y Rr). Analogously, given an origin point x0 M, the kernel x, y x0 d := 1 d(x, x0)2 + d(y, x0)2 d(x, y)2 , (7) generalizes the linear kernel (with origin in 0) x 2 + y 2 x y 2 . (8) However, this type of substitution does not always lead to a valid kernel. In fact, the class of distances such that KGauss and , x0 d are p.d. is completely characterised by the notion of Hilbertian (pseudo-)distance (Hein & Bousquet, 2005). A pseudo-distance d is said to be Hilbertian if there exists a Hilbert space F and a feature map Φ : X F such that d(x, y) = ||Φ(x) Φ(y)||F for all x, y M. The functions KGauss and , x0 d are p.d. kernels if and only if the corresponding d is Hilbertian. We refer to such kernels as distance substitution kernels and we denote them as K(d) (Proposition 13 in the Appendix provides more details on distance substitution kernels). Remark 1. An Hilbertian pseudo-distance d is a proper distance if and only if Φ is injective. Remark 2. For an Hilbertian distance d, x, y x0 d = Φ(x) Φ(x0), Φ(y) Φ(x0) F. Hence defining p.d. kernels on M = X boils down to find Hilbertian distances on X. Below we give two examples of such distances. MMD-based kernel. For a bounded kernel KΩdefined on the data space Ωr and its RKHS HΩ, the Maximum Mean Discrepancy (MMD) pseudo-distance is for all (P, Q) X2 d MMD(P, Q) = µKΩ(P) µKΩ(Q)) HΩ, (9) where µKΩ: P 7 R Ωr KΩ(x, )d P(x) HΩis the kernel mean embedding operator (Berlinet & Thomas-Agnan, 2011). d MMD is by construction Hilbertian with F = HΩ and Φ = µKΩ. Note that using MMD to build a kernel on X requires two kernels, a kernel on Ωr to build d MMD and the outer kernel K(d MMD). Previous theoretical studies of distribution regression focused exclusively on MMD-based kernels (Szab o et al., 2016; Fang et al., 2020). We will show that the established theory for MMD-based distribution regression can be extended to a larger class of distance substitution kernels. Remark 3. d MMD is a proper distance if and only if KΩis characteristic, i.e. µKΩis injective. Distribution Regression with Sliced Wasserstein Kernels Fourier Kernel. Let P X, its Fourier transform F (P) is defined by F (P) (t) = R ei x,t d P(x), t Rr. In Christmann & Steinwart (2010) the distance d F (P, Q) = F (P) F (Q) L2µ (10) was considered, where µ is a finite Borel measure with support over Rr. d F is clearly Hilbertian and is a proper distance since the Fourier transform is injective. Universality of Gaussian-like kernels. Kernels of the form (6) with both the MMD and the Fourier distances can be shown to be universal for X equipped with the topology of the weak convergence in probability1. Definition 1. A continuous kernel K on a compact metric space (X, d X ) is called universal if the RKHS HK of K is dense in C(X, d X ), i.e. for every function g : X R continuous with respect to d X and all ε > 0 there exists an f HK such that f g := supx X |f(x) g(x)| ε. Universality is an important property that indicates when a RKHS is powerful enough to approximate any continuous function with arbitrarily precision. We recall here a fundamental result regarding universality that we will use in the next section. Theorem 4 (Christmann & Steinwart (2010), Theorem 2.2). On a compact metric space (X, d X ) and for a continuous and injective map Φ : X F where F is a separable Hilbert space, K(x, y) = e γ Φ(x) Φ(x ) 2 F is universal. For X = X and d P the Prokhorov distance, which induced the topology describing the weak convergence of probability measures, Christmann & Steinwart (2010) uses this result to show that MMD (when the inner kernel is characteristic) and Fourier based Gaussian-like kernels are universal. 4. Sliced Wasserstein Kernels We introduce the tools needed to define the Sliced Wasserstein kernels and give the first proof that SW Gaussian-like kernels are universal. Arising from the field of optimal transport (Villani, 2009), the Wasserstein metric is a natural distance to compare probability distributions. Wasserstein distance. For p 1, the p Wasserstein distance on X = M1 +(Ωr) is d Wp(P, Q)p := inf π U(P,Q) Rr Rr x y p 2dπ(x, y), (11) where U(P, Q) := {π X X | P 1 #π = P, P 2 #π = Q} and P i : Rr Rr Rr, (x1, x2) 7 xi, i = 1, 2. 1(Pn)n N XN is said to converge weakly to P X if for any continuous function f : Ωr R (recall that Ωr is compact), limn + R f d Pk = R f d P. One dimensional optimal transport. The Wasserstein distance usually cannot be computed in closed form. For empirical distributions, it requires to solve a linear program that becomes significantly expensive in high-dimension. However, when r = 1, the following closed form expression holds (Santambrogio, 2015, Proposition 2.17) d Wp(P, Q) = F [ 1] P F [ 1] Q Lp(0,1), (12) where Lp(0, 1) if the set of p integrable functions w.r.t the Lebesgue measure on (0, 1). If P and Q are empirical distributions, this can be handily computed by sorting the points and computing the average distance between the sorted samples; see Appendix D for more details. In what follows, we will either consider p = 1 or p = 2. Sliced Wasserstein distance. The closed-form expression of the one-dimensional Wasserstein distance can be lifted to r > 1 using the mechanism of sliced divergences Nadjahi et al. (2020). The Sliced Wasserstein (SW) distance is defined as, d SWp(P, Q)p = Z Sr 1d Wp θ #P, θ #Q pdθ. (13) The distance averages the one-dimensional distances by projecting the data in every direction θ Sr 1. In practice, the integral over Sr 1 is approximated by sampling directions uniformly, see Appendix D. d SWp was introduced in Bonneel et al. (2015) in the context of Wasserstein barycenters and has been heavily used since as a computationally cheaper alternative to the Wasserstein distance. In addition to keeping a closed form expression, d SWp inherits useful topological properties from d Wp: it defines a distance on M1 +(Ωr) that implies the weak convergence of probability measures (Nadjahi et al., 2019; 2020). More importantly, d SW1 and d SW2 can be used to build a p.d. kernel on M1 +(Ωr), while it would not be possible with the original Wasserstein distance as the induced Wasserstein space has curvature Feragen et al. (2015). Indeed, we now show that p d SW1 and d SW2 are Hilbertian distances. Leveraging on the observations in Sec. 3 we conclude that e γSW2 2( , ) and e γSW1( , ) are valid kernels on probability measures. SW2-kernel. To show that d SW2 is Hilbertian we build an explicit feature map and feature space. Proposition 5. d SW2 is Hilbertian. Proof. Taking F = L2 (0, 1) Sr 1, λ1 σr and Φ : X F, P 7 (t, θ) 7 F [ 1] θ #P (t) , (14) Distribution Regression with Sliced Wasserstein Kernels d SW2(P, Q)2 = Z F [ 1] θ #P F [ 1] θ #Q 2 L2([0,1]) dθ F [ 1] θ #P (t) F [ 1] θ #Q(t) 2 dtdθ = Φ(P) Φ(Q) 2 F, as desired. Note that since we already know that d SW2 is a proper distance, it implies that Φ is injective. Comparison with Kolouri et al. (2016). Distance substitution kernels with d SW2 were originally proposed in Kolouri et al. (2016) where it was first shown that such distance is Hilbertian. However, these results were limited to probability distributions admitting a density with respect to the Lebesgue measure. This was due to a different choice of feature map Φ in the proof of Prop. 5 based on the Radon transform. In dimension one, when a probability P M1 +(Ω1) admits a density f = d P dλ1 , the 2 Wasserstein distance is d W2(P, Q)2 = Z R (F [ 1] Q FP(t) t)2f(t)dt. If P M1 +(Ωr) admits a density, then for any direction θ Sr 1, θ #P admits as density Rf(., θ) on R, with R the Radon transform. Then the SW distance is, d SW2(P, Q)2 = Z (F [ 1] θ #Q Fθ #P(t) t)2Rf(t, θ)dtdθ, Showing that d SW2 is Hilbertian using the above characterization is less direct than our approach since it requires introducing a reference measure P0 admitting a density. By defining ΦP0(P) : (t, θ) 7 F [ 1] θ #P Fθ #P0(t) t p it can be shown that d SW2 is Hilbertian, namely d SW2(P, Q) = ΦP0(P) ΦP0(Q) L2(R Sr 1). In practice, this raises the question of how to choose the reference measure (and how it will impact the downstream learning algorithm). More importantly, the result in (Kolouri et al., 2016) is only valid for absolutely continuous distributions and is not applicable when dealing with empirical distributions ˆPn as in distribution regression (this would require a density estimation step that might be expensive and introduce additional errors). Our strategy avoids this. SW1-kernel. For the SW1 distance we have the following. Proposition 6. d SW1 is Hilbertian. d SW1 is not Hilbertian while its squared root p d SW1 is. This was originally observed in Carriere et al. (2017), as the authors used it to build a kernel between persistence diagrams, but without a complete proof. For the sake of completeness, we provide a proof in Appendix B. The result hinges on Schoenberg s theorem. Up to our knowledge, it is the first time that K(d SW1) is used in the context of distribution regression. Note that the Gaussian-like kernel K(P, Q) = exp ( γd SW1(P, Q)) has no square in the exponential since it is the square root of the distance that is Hilbertian and not the distance itself. The result implies that there is some Hilbert space F and feature map Φ : X F such that p d SW1(P, Q) = Φ(P) Φ(Q) F. Interestingly, it implies that p d SW1 is also a distance. Alternative Optimal Transport embeddings. Linear Optimal Transportation (LOT) Wang et al. (2013); M erigot et al. (2020); Moosm uller & Cloninger (2020) defines an alternative transport-based Hilbertian distance. It requires the choice of a reference measure and a discretization scheme to be used in practice that add further complexity. Nonetheless, a separate study for LOT estimators would be an interesting future direction. Algorithm. We refer to kernels built on X by substitution with either d SW2 or p d SW1 as Sliced-Wasserstein (SW) kernels. Evaluating such kernels requires to evaluate (13) for p = 1, 2. Instead of approximating the outer integral by sampling U(Sr 1) and evaluating exactly the inner integral we suggest to approximate both integrals as follows: ˆdp(P, Q)p := 1 MN F [ 1] (θ m)#P(tℓ) F [ 1] (θ m)#Q(tℓ) p , θ1, . . . , θM i.i.d U(Sr 1), (t1, . . . , t N) i.i.d U(0, 1). This is akin to build a finite dimensional representation ˆΦM,N(P) = (MN) 1/p (F [ 1] (θ m)#P(tℓ))M,N m,ℓ=1 RMN and ˆdp(P, Q) = ˆΦM,N(P) ˆΦM,N(Q) ℓp(MN). (15) For empirical distributions ˆPn = 1 n Pn i=1 δxi, we can make use of the closed form evaluation of F [ 1] θ #ˆPn, and we empha- size that the empirical distributions we evaluate are not required to have the same number of points (see Appendix D). The full procedure to evaluate ˆΦM,N(ˆPn) is summarized in Alg. 1. It can then be plugged in (15) and the approximated distance itself can be plugged in the distance substitution kernel of choice. Universality of Gaussian SW kernels. We exploit Thm. 4 to show that SW kernels are universal. Proposition 7. K1 : X2 R, (P, Q) 7 e γSW1(P,Q) and K2 : X2 R, (P, Q) 7 e γSW2 2(P,Q) are universal for the topology induced by the weak convergence of probability measures. Distribution Regression with Sliced Wasserstein Kernels Algorithm 1 Evaluation of ˆΦM,N(ˆPn) Input: data ˆPn = 1 n Pn i=1 δxi M1 +(Ωr), M, N 1 Init: Φ = 0 matrix of size M by N for m = 1 to M do Sample θm U(Sd 1) Sort sequence xθm i := θm, xi , i = 1, . . . , n Label ordered sequence xθm (1) . . . xθm (n) for ℓ= 1 to N do Sample tℓ U(0, 1) Φm,ℓ= xθm (k) if tℓ [ k 1 n), 1 k n end for end for Output: Flatten(Φ) RMN Proof. We prove the result for K2, the proof for K1 requires a slightly longer argument that is deferred to Prop. 18. Firstly, the compactness of Ωr implies that X is weakly compact (Thm 6.4 Parthasarathy, 2005). Secondly, d SW2 is Hilbertian with feature space given by F = L2 (0, 1) Sr 1, λ1 σr , which is separable, and feature map given by (14). Since d SW2(P, Q) = Φ(P) Φ(Q) F, Φ is continuous w.r.t d SW2 and injective. Indeed d SW2 satisfying the identity of indiscernibles implies that Φ is injective and Φ as a map from (X, d SW2) to (F, . F) is 1-Lipschitz, hence continuous. Finally, it has been shown in Nadjahi et al. (2020) that d SW2 metrizes the weak convergence, hence continuity and compactness w.r.t d SW2 is equivalent to weak continuity and weak compactness. By Thm. 4, the universality of K2 follows. 5. Consistency in distribution regression with distance substitution kernels Up to our knowledge, every theoretical work on distribution regression has been focusing on MMD-based kernels. Here, we provide the first analysis of distribution regression for any distance substitution kernel K(d) on X with known sample complexity for the chosen distance d. We follow the approach of Fang et al. (2020) which slightly improved the results in Szab o et al. (2016). Using recent results on the sample complexity of the Sliced Wasserstein distance Nadjahi et al. (2020) we apply our excess risk bound to show that the KRR estimator with the SW kernel for distribution regression is consistent. Given a Hilbertian distance d (e.g. MMD or SW) on X with feature map Φ : X F, we consider a radial distance substitution kernel K : X X R, (P, Q) 7 q(d(P, Q)) = q( Φ(P) Φ(Q) F) with q : R R. Valid functions q are the one for which K is a valid substitution kernel, for example q : x 7 e γx2β, γ > 0, β [0, 1] Prop. 13. The fact that d is assumed to be a distance im- plies that d(P, Q) = 0 if and only if P = Q and therefore K(P, Q) = q(0) when P = Q. We denote by HK the induced RKHS and by KP HK the canonical feature map of K: K(P, Q) = KP, KQ HK for all (P, Q) X2. Equipped with this kernel we consider the minimizer of the second stage risk (5) and search for a bound on the excess risk E(f ˆ D,λ) E(fρ) = f ˆ D,λ fρ 2 L2ρX . In the context of MMD-based distribution regression Szab o et al. (2016) provided high probability bounds on the excess risk while Fang et al. (2020) provided bounds on the expected excess risk with a slightly different analysis. The key assumption behind both approaches is that K must be H older continuous with respect to the MMD. More precisely, that there exists constants L > 0 and h (0, 1] such that for all (P, Q) X2 we have KP KQ HK Ld MMD(P, Q)h. This assumption is instrumental to bound quantities that depend on the discrepancy between P and ˆPn by the sample complexity of the MMD distance. Here, we advocate extending this assumption to a generic Hilbertian distance d, KP KQ HK Ld(P, Q)h. (16) Let us note that (16) does not depend on d but rather on the regularity of q (see discussion in Appendix E.1 for more details). We now provide the main result of this section by first introducing the integral operator as a central tool in learning theory for kernel algorithms. Integral operator. Since X is compact, under the assumptions that K is p.d., continuous and sup P X K(P, P) κ2 (i.e. K is a Mercer kernel Aronszajn (1950)), the integral operator LK : L2 ρX L2 ρX defined by X KPf (P) dρX(P), f L2 ρX is compact, positive and self-adjoint. Furthermore HK is composed of continuous functions from X to R and L1/2 K forms an isometry between L2 ρX and HK: every function f HK can be written as f = L1/2 K g for some g L2 ρX and f HK = g L2ρX (section 1,2,3, chapter 3 Cucker & Smale (2002)). Theorem 8. Under the assumption that A > 0, |y| A, ρ a.s., fρ HK, q is such that K is a valid kernel, K satisfies the H older condition (16), K is continuous and sup P X K(P, P) κ2, if the expected sample complexity of d is controlled such that E[d(P, ˆPn)2] α(n), α : N Distribution Regression with Sliced Wasserstein Kernels R+, P X, then E f ˆ D,λ fρ L2ρX C κ fρ HK+ AB2 T,λ λ +ABT,λ Tλ + Lκα(n)h/2 κ BT,λ + BT,λ + where N(λ) := Tr LK (LK + λI) 1 , λ > 0 is the ef- fective dimension, BT,λ := 2κ N(λ) and C is a constant the does not depend on any other quantity. Proof. We sketch the proof strategy here while Appendix E.2 provides a complete proof. We start by decomposing f ˆ D,λ fρ L2ρX f ˆ D,λ f D,λ L2ρX + f D,λ fρ L2ρX . The term f D,λ fρ L2 ρX corresponds to the excess risk of a KRR estimator that has access to the true input distributions (namely P rather than a sample ˆPn, see discussion in Sec. 2). Bounds for such quantity have been thoroughly analysed in the literature (see e.g. Caponnetto & De Vito, 2007). This term is responsible for the last row in the bound of Thm. 8. The term f ˆ D,λ f D,λ L2ρX is specific to the two-stage sampling nature of distribution regression and measures how much we loose by accessing each distribution through bag of samples. A leading term controlling this quantity is KˆPn KPn HK, which is further bounded by the sample complexity of d using the H older assumption in (16). fρ HK is an indicator of the regularity of fρ. The bound indicates that the excess risk goes down with the easiness of the learning problem that is reflected in the RKHS norm of fρ. Bounds as in Thm. 8 are often refined using the so called source condition: gρ L2 ρX such that fρ = Lϵ Kgρ, ϵ > 0. The larger ϵ the smoother is fρ and for ϵ 1/2 we are in the well-specified setting, fρ HK. The effective dimension N(λ) is the other parameter controlling the bound and is used to describe the complexity of HK. A common assumption is that N(λ) cλ ν, ν (0, 1] Caponnetto & De Vito (2007). Note that, in the worst case scenario, this is always true with ν = 1. λi λi + λ Tr(LK) Plugging this bound and an explicit control over α(n) allows to get more explicit rates. Corollary 9. Under the assumptions of Thm. 8, for α(n) = cn β, β, c > 0, taking λ = max( 1 T , 1 nhβ/2 ) the bound can be simplified as E f ˆ D,λ fρ L2ρX C 1 T + 1 nhβ/4 where C is a constant that depends only on A, κ and L. We defer the proof to Appendix E.4. For d MMD it is known that β = 1, for completeness the proof is given in Appendix E.3. Plugging this rate in Cor. 9, for the Gaussian-like kernel for example (h=1), leads to the same result as presented in Fang et al. (2020), the excess risk is in 1 4 T + 1 4 n. For SW, it is shown in Nadjahi et al. (2020) that both E[d SW2(P, ˆPn)2] and E[d SW1(P, ˆPn)] are in O(n 1/2), hence β = 1/2. Leading for h = 1 to an expected excess risk in the order of 1 4 T + 1 8 n. Due to its slower sample complexity, the bound appears worse for SW than for MMD, when we neglect fρ HK. However, such term captures the regularity of the solution fρ in terms of the underlying metric used to compare probability distribution and can potentially differ significantly between MMD and SW-based kernels. In particular, we argue that when fρ needs to be sensitive to geometric properties of distributions in M1 +(Ωr), then fρ HK might be significantly smaller for HK the space associated to a SW kernel rather than the MMD based one. In the following, we provide empirical experiments supporting this perspective, showing that SW kernels are often superior to MMD ones in real and synthetic applications. We postpone the theoretical investigation of this important question to future work. 6. Experiments In this section we compare the numerical performance of SW kernels versus MMD kernels on two tasks. First, we look at the synthetic task that consists of counting the number of components of Gaussian mixture distributions. Then, we present results for distribution regression on the MNIST and Fashion MNIST datasets. Remark 10. In both tasks, considered as classification as the labels are discrete, we use KRR hence the mean squared error loss. According to the theory of surrogate methods Bartlett et al. (2006), the squared loss is classification calibrated (by using 1-hot encoding and sign or argmax decoding), namely it is a valid surrogate loss for classification problems, similarly to hinge and logistic losses (see e.g. Mroueh et al. (2012) and Meanti et al. (2020) for theoretical and empirical analyses, respectively). In particular, excess risk bounds for KRR (e.g. our Thm. 8) automatically yield bounds for classification (see [2]). Synthetic Mode classification. Taken from Oliva et al. (2014), we consider the problem of counting the number of Distribution Regression with Sliced Wasserstein Kernels Table 1. Test RMSE and standard deviation for the mode classification task with different configuration for T, n, C and r (tested 5 times each). Comparison of a doubly Gaussian MMD kernel and Gaussian SW2 and SW1 kernels. T n C r MMD SW2 SW1 100 50 2 2 0.64 (0.09) 0.48 (0.03) 0.47 (0.06) 100 50 10 2 3.66 (0.88) 2.92 (0.32) 3.01 (0.24) 100 50 2 10 0.68 (0.02) 0.54 (0.05) 0.53 (0.05) 100 50 10 10 3.48 (0.95) 3.17 (0.32) 3.19 (0.14) 500 50 2 2 0.66 (0.08) 0.44 (0.07) 0.42 (0.07) 500 50 10 2 4.1 (1.03) 2.63 (0.27) 2.68 (0.24) 500 50 2 10 0.64 (0.13) 0.44 (0.08) 0.42 (0.04) 500 50 10 10 3.72 (1.14) 3.07 (0.42) 3.08 (0.35) 500 250 2 2 0.64 (0.1) 0.39 (0.03) 0.42 (0.04) 500 250 10 2 3.8 (0.1) 2.4 (0.15) 2.41 (0.16) 500 250 2 10 0.65 (0.15) 0.39 (0.01) 0.41 (0.03) 500 250 10 10 3.64 (0.83) 2.84 (0.17) 2.8 (0.31) modes in a Gaussian Mixture Model (GMM). This can be seen as a model selection process where given a set of points we want to fit a GMM but first we learn a mapping from the set of points to the number of components p. The data are simulated as follows. Let T be the number of GMM distributions, n the number of points drawn from each GMM, C the maximum number of components and r the input dimension. For each t = 1, . . . , T we sample independently p U({1, . . . , C}), µ1, . . . , µp i.i.d U([ 5, 5]r), and Σj = aj Aj A j + Bj, j = 1, . . . , p where aj U([1, 4]), Aj is a r r matrix with entries sampled from U([ 1, 1]) and Bj is a diagonal matrix with entries sampled from U([0, 1]). Fig. 1 presents examples of GMM distributions sampled as above for r = 2, C = 3 and n = 100. We compare the Gaussian SW1 and SW2 kernels (with M = 100, N = 100 (15)) against the doubly Gaussian MMD kernel. In Table 1 we report the RMSE for different configurations of T, n, C and r. A validation set of size 50 is used to select the bandwidth of the Gaussian kernels and the regularisation parameter λ (see Appendix F for details) and a test set of size 100 is used for the results. Each experiment is repeated 5 times. MNIST Classification. MNIST (Le Cun et al., 2010) and Fashion MNIST (Xiao et al., 2017) are constituted of gray images of size 28 by 28. Classes in MNIST are the digits from 0 to 9 and classes in Fashion MNIST are 10 categories of clothes. SW and MMD kernels take as inputs probability distributions, we therefore convert each image to an histogram. For an image t, let (xt i, yt i)nt i=1 denotes the positions of the active pixels (i.e. with an intensity greater than 0), nt is the number of active pixels. The positions are renormalised to be in the grid [ 1, 1]2. We therefore have Ω2 = [ 1, 1]2. Let (It i)nt i=1 {0, . . . 255} be the associated pixel intensities. We consider the weighted histograms ˆPt = Pnt i=1 ct iδ{xt i,yt i} M1 +([ 1, 1]2) where ct i = It i/ Pnt j=1 It j. We consider Gaussian distance sub- Figure 1. Samples and level curves for sample GMMs in the mode classification experiment, with r = 2, C = 3 and n = 100. stitution kernels (6) for d MMD, d SW2 and p d SW1 (with M = 100, N = 100 (15)). The kernel chosen to build d MMD is also Gaussian. For comparison we also apply an Euclidean Gaussian kernel on the flatten images. We consider the mean squared error loss and build KRR estimators (5). We use a train set of 1000 images, a validation set of 300 images and a test set of 500 images to evaluate the estimators. Each set is balanced. We use the train-validation split to select the bandwidth of the Gaussian kernels and the regularisation parameter λ (see Appendix F for details). In order to make the task more challenging we test the kernels in three configurations: on the raw images and on images randomly rotated (with a maximum angle of either 15 or 30 degrees) and translated, Fig. 2. We run the experiment 5 times on different part of the MNIST and Fashion MNIST datasets. Results are given in Table 2. Comparison. SW kernels consistently beat MMD kernels in both experiments, which indicates that the SW distance is a better measure of similarity between histograms than the MMD distance. On the Fashion MNIST dataset for which the images are very well centered both MMD and SW are beaten by the standard Gaussian kernel on the flatten images. However, this kernel is less robust and its performance deteriorates far quicker than SW kernels when we perturb the images, the MMD kernel is even less robust. While the focus of this work is to suggest an alternative to MMDbased distribution regression, in Appendix G we also make a comparison to KRR with the square root total variation and Hellinger distances, which are both c.n.d. Distribution Regression with Sliced Wasserstein Kernels Figure 2. Sample transformations applied to MNIST (Top row) and Fashion MNIST (Bottom row) images. (Left column) raw 28 28 pixel images. Images are padded to 34 34 pixels and randomly roto-translated by degree drawn uniformly (for each image) on [ 15, 15] (Middle column) and [ 30, 30] (Right column. Table 2. Mean accuracy and standard deviation on rotated MNIST and Fashion MNIST with maximum rotation θ = 0, π/12, π/6 respectively. Comparison between KRR estimators with standard Gaussian kernel (RBF), doubly Gaussian MMD kernel (MMD), a Gaussian SW2 kernel and a Gaussian SW1. MNIST RBF MMD SW2 SW1 RAW 0.9 (0.01) 0.79 (0.03) 0.93 (0.01) 0.91 (0.03) θ = π 12 0.51 (0.03) 0.40 (0.05) 0.85 (0.02) 0.66 (0.02) θ = π 6 0.47 (0.03) 0.34 (0.04) 0.82 (0.01) 0.61 (0.01) FASHION RBF MMD SW2 SW1 RAW 0.82 (0.01) 0.75 (0.01) 0.81 (0.01) 0.77 (0.01) θ = π 12 0.66 (0.01) 0.48 (0.03) 0.74 (0.01) 0.60 (0.02) θ = π 6 0.64 (0.01) 0.45 (0.01) 0.70 (0.01) 0.58 (0.01) 7. Conclusion The main goal of this work was to address distribution regression from the perspective of Optimal Transport (OT), in contrast to the more common and well-established MMDbased strategies. For this purpose we proposed the first kernel ridge regression estimator for distribution regression, using Sliced Wasserstain distances to define a distance substitution kernel. We proved finite learning bounds for these estimators, extending previous work from Fang et al. (2020). We empirically evaluated our estimator on a number of real and simulations experiments, showing that tackling distribution regression with OT-based methods outperforms MMD-based approaches in several settings. We identify two main directions for future research: firstly, we plan to better identify which regimes are more favorable for OT-based and MMD-based distribution regression. This is an open question in the literature and requires delving deeper in the analysis of the approximation properties of the distance substitution kernels induced by the two distances respectively. Secondly, our theoretical results, do not account for the fact that the Sliced Wasserstein distance needs to be approximated via Monte-Carlo methods. This analysis was beyond the scope of this work, which focuses on comparing OT and MMD based distribution regression. However, we plan to account for this further (third) sampling stage in future work. Acknowledgements We would like to thank Saverio Salzo for helpful discussions. Most of this work was done when D.M. was a research assistant at the Italian Institute of Technology. D.M. is supported by the Ph D fellowship from the Gatsby Charitable Foundation and the Wellcome Trust. D.M. is also an ELLIS Ph D student. C.C. acknowledges the support of the Royal Society (grant SPREM RGS\R1\201149) and Amazon.com Inc. (Amazon Research Award ARA). Software and Data Code is available at https://github.com/ Dim Sum2k/DRSWK. Aronszajn, N. Theory of reproducing kernels. Transactions of the American mathematical society, 68(3):337 404, 1950. Bachoc, F., Gamboa, F., Loubes, J.-M., and Venet, N. A gaussian process regression model for distribution inputs. IEEE Transactions on Information Theory, 64(10):6620 6637, 2017. Bachoc, F., Suvorikova, A., Ginsbourger, D., Loubes, J.-M., and Spokoiny, V. Gaussian processes with multidimensional distribution inputs via optimal transport and hilbertian embedding. Electronic journal of statistics, 14(2): 2742 2772, 2020. Bartlett, P. L., Jordan, M. I., and Mc Auliffe, J. D. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006. Berg, C., Christensen, J. P. R., and Ressel, P. Harmonic analysis on semigroups: theory of positive definite and related functions, volume 100. Springer, 1984. Berlinet, A. and Thomas-Agnan, C. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011. Bonneel, N., Rabin, J., Peyr e, G., and Pfister, H. Sliced and Distribution Regression with Sliced Wasserstein Kernels radon wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22 45, 2015. Caponnetto, A. and De Vito, E. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331 368, 2007. Carriere, M., Cuturi, M., and Oudot, S. Sliced wasserstein kernel for persistence diagrams. In International conference on machine learning, pp. 664 673. PMLR, 2017. Christmann, A. and Steinwart, I. Universal kernels on nonstandard input spaces. Advances in Neural Information Processing Systems, 23:406 414, 2010. Cucker, F. and Smale, S. On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39:1 49, 2002. Denevi, G., Pontil, M., and Ciliberto, C. The advantage of conditional meta-learning for biased regularization and fine tuning. Advances in Neural Information Processing Systems, 33:964 974, 2020. Deshpande, I., Zhang, Z., and Schwing, A. G. Generative modeling using the sliced wasserstein distance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3483 3491, 2018. Deshpande, I., Hu, Y.-T., Sun, R., Pyrros, A., Siddiqui, N., Koyejo, S., Zhao, Z., Forsyth, D., and Schwing, A. G. Max-sliced wasserstein distance and its use for gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10648 10656, 2019. Fang, Z., Guo, Z.-C., and Zhou, D.-X. Optimal learning rates for distribution regression. Journal of Complexity, 56:101426, 2020. Feragen, A., Lauze, F., and Hauberg, S. Geodesic exponential kernels: When curvature and linearity conflict. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3032 3042, 2015. Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch olkopf, B., and Smola, A. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012. Haasdonk, B. and Bahlmann, C. Learning with distance substitution kernels. In Joint pattern recognition symposium, pp. 220 227. Springer, 2004. Hein, M. and Bousquet, O. Hilbertian metrics and positive definite kernels on probability measures. In International Workshop on Artificial Intelligence and Statistics, pp. 136 143. PMLR, 2005. Hofmann, T., Sch olkopf, B., and Smola, A. J. Kernel methods in machine learning. The annals of statistics, 36(3): 1171 1220, 2008. Kolouri, S., Zou, Y., and Rohde, G. K. Sliced wasserstein kernels for probability distributions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5258 5267, 2016. Kolouri, S., Pope, P. E., Martin, C. E., and Rohde, G. K. Sliced wasserstein auto-encoders. In International Conference on Learning Representations, 2018. Kolouri, S., Nadjahi, K., Simsekli, U., Badeau, R., and Rohde, G. Generalized sliced wasserstein distances. Advances in Neural Information Processing Systems, 32, 2019. Le Cun, Y., Cortes, C., and Burges, C. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010. Lee, C.-Y., Batra, T., Baig, M. H., and Ulbricht, D. Sliced wasserstein discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10285 10295, 2019. Lin, S.-B., Guo, X., and Zhou, D.-X. Distributed learning with regularized least squares. The Journal of Machine Learning Research, 18(1):3202 3232, 2017. Meanti, G., Carratino, L., Rosasco, L., and Rudi, A. Kernel methods through the roof: handling billions of points efficiently. Advances in Neural Information Processing Systems, 33:14410 14422, 2020. M erigot, Q., Delalande, A., and Chazal, F. Quantitative stability of optimal transport maps and linearization of the 2-wasserstein space. In International Conference on Artificial Intelligence and Statistics, pp. 3186 3196. PMLR, 2020. Moosm uller, C. and Cloninger, A. Linear optimal transport embedding: Provable wasserstein classification for certain rigid transformations and perturbations. ar Xiv preprint ar Xiv:2008.09165, 2020. Mroueh, Y., Poggio, T., Rosasco, L., and Slotine, J.-J. Multiclass learning with simplex coding. Advances in Neural Information Processing Systems, 25, 2012. Muandet, K., Fukumizu, K., Sriperumbudur, B., Sch olkopf, B., et al. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10(1-2):1 141, 2017. Distribution Regression with Sliced Wasserstein Kernels M ucke, N. Stochastic gradient descent meets distribution regression. In International Conference on Artificial Intelligence and Statistics, pp. 2143 2151. PMLR, 2021. Nadjahi, K. Sliced-Wasserstein distance for large-scale machine learning: theory, methodology and extensions. Ph D thesis, Institut Polytechnique de Paris, 2021. Nadjahi, K., Durmus, A., Simsekli, U., and Badeau, R. Asymptotic guarantees for learning generative models with the sliced-wasserstein distance. Advances in Neural Information Processing Systems, 32, 2019. Nadjahi, K., Durmus, A., Chizat, L., Kolouri, S., Shahrampour, S., and Simsekli, U. Statistical and topological properties of sliced probability divergences. Advances in Neural Information Processing Systems, 33:20802 20812, 2020. Oliva, J., Neiswanger, W., P oczos, B., Schneider, J., and Xing, E. Fast distribution to real regression. In International Conference on Artificial Intelligence and Statistics, pp. 706 714. PMLR, 2014. Parthasarathy, K. R. Probability measures on metric spaces, volume 352. American Mathematical Soc., 2005. Peyr e, G., Cuturi, M., et al. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019. P oczos, B., Singh, A., Rinaldo, A., and Wasserman, L. Distribution-free distribution regression. In International Conference on Artificial Intelligence and Statistics, pp. 507 515. PMLR, 2013. Santambrogio, F. Optimal transport for applied mathematicians. Birk auser, NY, 55(58-63):94, 2015. Sch olkopf, B., Herbrich, R., and Smola, A. J. A generalized representer theorem. In International Conference on Computational Learning Theory, pp. 416 426. Springer, 2001. Smola, A., Gretton, A., Song, L., and Sch olkopf, B. A hilbert space embedding for distributions. In International Conference on Algorithmic Learning Theory, pp. 13 31. Springer, 2007. Smola, A. J. and Sch olkopf, B. Learning with kernels. Citeseer, 1998. Steinwart, I. and Christmann, A. Support vector machines. Springer Science & Business Media, 2008. Szab o, Z., Gretton, A., P oczos, B., and Sriperumbudur, B. Two-stage sampled learning theory on distributions. In International Conference on Artificial Intelligence and Statistics, pp. 948 957. PMLR, 2015. Szab o, Z., Sriperumbudur, B. K., P oczos, B., and Gretton, A. Learning theory for distribution regression. The Journal of Machine Learning Research, 17(1):5272 5311, 2016. Trang, B. T. T., Loubes, J.-M., Risser, L., and Balaresque, P. Distribution regression model with a reproducing kernel hilbert space approach. Communications in Statistics - Theory and Methods, 50(9):1955 1977, 2021. Villani, C. Optimal transport: old and new, volume 338. Springer, 2009. Wang, W., Slepˇcev, D., Basu, S., Ozolek, J. A., and Rohde, G. K. A linear optimal transportation framework for quantifying and visualizing variations in sets of images. International journal of computer vision, 101(2):254 269, 2013. Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. Yu, Z., Ho, D. W., Shi, Z., and Zhou, D.-X. Robust kernelbased distribution regression. Inverse Problems, 37(10): 105014, 2021. Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. Deep sets. Advances in Neural Information Processing Systems, 30, 2017. Distribution Regression with Sliced Wasserstein Kernels Supplementary Material Below we give an overview of the structure of the supplementary material. Appendix A contains two derivations related to the distribution regression set up: the decomposition of the excess risk and the closed-form expression of the minimizer of the second-stage empirical risk. In Appendix B we present more examples of distance substitution kernels and the proof that p d SW1 is Hilbertian. Appendix C contains the proof of Prop. 7 for SW1. In Appendix D we show how the Sliced Wasserstein distance can be computed and approximated in practice. Appendix E contains a discussion around the H older assumption and shows that it holds for q : x 7 e γx2β. Specifically, Appendix E.2 contains the proof of Thm. 8 and Appendix E.3 contains the proof of Cor. 9. Appendix F contains the details of the choice of hyperparameters for the experimental section. Appendix G contains additional experiments with the Hellinger and total variation distances. A. Distribution Kernel Ridge Regression Proposition 11. For all f : X R measurable, X Y (f(P) y)2dρ(P, y) = f fρ 2 L2ρX + E(fρ). The second term is independent of f and the first term is minimised for f = fρ a.e. which implies fρ(P) = argmin f E(f) = argmin f f fρ L2ρX . X Y (f(P) y)2dρ(P, y) = Z X Y (f(P) fρ(P) + fρ(P) y)2dρ(P, y) X (f(P) fρ(P))2dρX(P) + 2 Z X Y (f(P) fρ(P))(fρ(P) y)dρ(P, y) + E(fρ) = f fρ 2 L2ρX + 2 Z X (f(P) fρ(P))(fρ(P) Z Y ydρY |X(y | P) | {z } =fρ(P) )dρX(P) + E(fρ) = f fρ 2 L2ρX + E(fρ) The following proposition shows how to derive the explicit form of the minimizer of the second-stage empirical risk (4). Proposition 12. Let K : X X R be a p.d. kernel with RKHS denoted as HK, then f ˆ D,λ = argmin f HK f ˆPt,nt yt 2 + λ f 2 HK (17) takes the form f ˆ D,λ(P) = (y1, . . . , y T )(KT + λTIT ) 1k P, P X where [KT ]t,l = K ˆPt,nt, ˆPl,nl RT T and k P = K P, ˆP1,n1 , . . . , K P, ˆPT,n T RT . Distribution Regression with Sliced Wasserstein Kernels Proof. By the representer theorem Sch olkopf et al. (2001) a minimizer of the empirical risk takes the form fα( ) = PT t=1 αt K(ˆPt,nt, ). The minimisation of the empirical risk is therefore equivalent to α = argmin α RT KT α y 2 RT + λTα KT α = argmin α RT αT KT (KT + λTIT ) α 2y KT α. Taking the gradient with respect to α, we obtain the first order condition KT [(KT + λTIT )α y] = 0. Hence any solution takes the form α = (KT + λTIT ) 1y | {z } =:α +ϵ, with KT ϵ = 0. When KT is singular we have an infinity of solutions, however, they all lead to the same solution in HK. Indeed, if α = α + ϵ with KT ϵ = 0, then: fα fα 2 HK = (α α ) KT (α α ) = 0 therefore fα = fα . It shows that Kernel Ridge Regression has a unique solution f HK defined by f (P) = (y1, . . . , y T )(KT + λTIT ) 1k P, (P X) which can potentially be expressed by several α s if KT is singular. B. Distance substitution kernels Proposition 13 (Proposition 1 (Haasdonk & Bahlmann, 2004)). For a set M and a pseudo-distance distance d on M, the following statements are equivalent: d is a Hilbertian pseudo-distance; Klin(x, y) = x, y x0 d , for all x0 M, (x, y) M 2 is p.d.; Kpoly(x, y) = (c + x, y x0 d )ℓfor all c 0 and ℓ N, (x, y) M 2 is p.d.; K(x, y) = exp( γd2β(x, y)) for all γ 0, β [0, 1], (x, y) M 2 is p.d.; We can therefore embed Hilbertian pseudo-distances in polynomial-like, Laplace-like (β = 1/2) and Gaussian-like (β = 1) kernels. Furthermore Szab o et al. (2016) provides examples of Cauchy-like, t-student-like and inverse multiquadratic-like kernels. We now prove that p d SW1 is Hilbertian by showing that d SW1 is conditionally negative definite (c.n.d). We first recall the definition of a c.n.d function and how it relates to Hilbertian pseudo-distances. Definition 2. Let M be a set. A map f : M M R is called conditionally negative definite (c.n.d.) if and only if it is symmetric and satisfies: n X i,j=1 aiajf(xi, xj) 0 for any n N, (x1, . . . , xn) M n and (a1, . . . , an) Rn with Pn i=1 ai = 0. Square of Hilbertian pseudo-distances are c.n.d. Proposition 14. Let Φ : M F be a mapping to a Hilbert space F and f(x, y) = Φ(x) Φ(y) 2 F, then f is c.n.d. Proof. f is symmetric and for n N, (a1, . . . , an) Rn, (x1, . . . , xn) M n, we have: X 1 i,j n aiajf(xi, xj) = X 1 i,j n aiaj Φ(xi) 2 + Φ(xj) 2 2 Φ(xi), Φ(xj) i=1 ai Φ(xi) 2 ! i=1 aiΦ(xi) Distribution Regression with Sliced Wasserstein Kernels Under the constraint Pn i=1 ai = 0, the first term disappears and only a nonpositive term remains. Hence, f : (x, y) 7 Φ(x) Φ(y) 2 is c.n.d. A corollary of Schoenberg s theorem (see chapter 3 in (Berg et al., 1984)) shows that the converse holds. Proposition 15. Let f be a c.n.d. function on M such that f(x, x) = 0 for any x M. Then, there exists a Hilbert space F and a mapping Φ : M F such that for any (x, y) M 2, f(x, y) = ||Φ(x) Φ(y)||2 F. As a consequence showing that p d SW1 is Hilbertian is equivalent to showing that d SW1 is c.n.d. We start with the following lemma. Lemma 16. Let M be a set, (Ω, A, µ) a measurable space and φ : M Ω R be such that φ(x, ) L1(µ) for all x M. Then, f : M M R, (x, y) 7 R Ω|φ(x, ω) φ(y, ω)|dµ(ω) is c.n.d. Proof. First, notice that for all (u, v) [a, + ), min(u, v) = Z + a 1t u1t vdt+a. Secondly, for all u, v R, |u v| = u+v 2 min(u, v). For all n N , (x1, . . . , xn) M n and ω Ωdefine aω n := a(ω; x1, . . . , xn) := mini=1,...,n φ(xi, ω). For all (a1, . . . , an) Rn such that Pn i=1 ai = 0, i,j=1 aiajf(xi, xj) = Z i,j=1 aiaj {φ(xi, ω) + φ(xj, ω) 2 min(φ(xi, ω), φ(xj, ω))} dµ(ω) i,j=1 aiaj min(φ(xi, ω), φ(xj, ω))dµ(ω) since aω n 1t φ(xi,ω)1t φ(xj,ω)dt + aω n i=1 ai1t φ(xi,ω) i=1 ai1t φ(xi,ω) dtdµ(ω) since which proves that f is c.n.d. Applying the result to d W1(P, Q) = R 1 0 |F [ 1] P (x) F [ 1] Q (x)|dx shows that d W1 is c.n.d. We conclude that d SW1 is c.n.d by linearity of the integral. Proposition 17. d SW1(P, Q) = R Sr 1 d W1 θ #P, θ #Q dθ is c.n.d. Proof. For all n N, (a1, . . . , an) Rn such that Pn i=1 ai = 0 and (P1, . . . , Pn) Xn, 1 i,j n aiajd SW1(Pi, Pj) = Z 1 i,j n aiajd W1 θ #Pi, θ #Pj Hence by Prop. 15, p d SW1 is Hilbertian. It is important to observe that the proof that p d SW1 is Hilbertian does not lead to an explicit feature map and feature space while for d SW2 we have an explicit construction. Distribution Regression with Sliced Wasserstein Kernels C. Universality with SW1 Proposition 18. K1 : X X R, (P, Q) 7 e γSW1(P,Q) is universal for (X, d LP ) where d LP is the L evy Prokhorov distance. Since we do not know explicitely the feature map nor the feature space for p d SW1 we need to take a little detour for the separability requirement. Proof. Since p d SW1 is Hilbertian, there is a feature map Φ : M1 +(Ωr) F where F is a Hilbert space such that p d SW1(P, Q) = Φ(P) Φ(Q) F. Since d SW1 satisfies the identity of indiscernibles, p d SW1 too, which implies that Φ is injective and hence p d SW1 is a distance. (Nadjahi et al., 2019) showed that d SW1 metrizes the weak convergence and by continuity of the square root p d SW1 too. Therefore, we can reason as for K2 except that we need to show that F is separable. By Proposition 2.2 (Smola & Sch olkopf, 1998), F can be chosen as the RKHS induced by the kernel k(P, Q) = d SW1(P0, P) + d SW1(P0, Q) d SW1(P, Q) = 2 Φ(P) Φ(P0), Φ(Q) Φ(P0) F. for P0 M1 +(Ωr). By Lemma 4.33 in (Steinwart & Christmann, 2008), if the input space is separable and k is continuous then the RKHS is separable. Here the input space is M1 +(Ωr) which is weakly compact (since we have assumed Ωr compact) hence weakly separable. Furthermore Φ continuous w.r.t p d SW1 implies k continuous (Lemma 4.29 in (Steinwart & Christmann, 2008)). D. Practical implementation of the Sliced Wasserstein kernel Computing SW kernel values requires to evaluate the SW distance. In this section we detail how. The following lemma explain how the generalised inverse of the cumulative distribution function of an empirical measure can be evaluated. Lemma 19. We consider the distribution ˆα = P+ k=1 pkδxk where (xk)k 1 RN and (pk)k 1 RN + such that P 1. We refer to the sorted sequence as (x(k))k 1 RN with reordered weights (p(k))k 1 and define sk = i=1 p(i), s0 = 0. Then, we have Fˆα(x) = Pˆα(X x) = k=1 pk1xk x = smax(k|x(k) x) (x R) , F [ 1] ˆα (t) = inf k=1 pk1xk x t = {x(k) | sk 1 < t sk, k 1}, F [ 1] ˆα (0) = . In particular, for a uniform empirical distribution on a set of n points ˆα = 1 i=1 δxi, we get Fˆα(x) = max(k | x(k) x) F [ 1] ˆα (t) = x(k) if k 1 n 1 k n if t = 0 (t [0, 1]) . (18) F [ 1] ˆα (t) is a piece-wise constant function on the uniform grid with step k 1 We can use (18) to compute the W distance and approximate the SW distance between empirical distributions. Distribution Regression with Sliced Wasserstein Kernels Balanced setting. For two empirical distributions with the same number of points: ˆα = 1 i=1 δxi, ˆβ = 1 xi, yi R exploiting (18), we get d Wp(ˆα, ˆβ)p = Z 1 F [ 1] ˆα (t) F [ 1] ˆβ (t) p dt F [ 1] ˆα (t) F [ 1] ˆβ (t) p dt x(k) y(k) p dt x(k) y(k) p Now we consider ˆα = 1 i=1 δxi, ˆβ = 1 i=1 δyi, with points in Rr instead of R. We recall that for θ Sr 1, we denote by θ the map θ : Rr R, x 7 x, θ and by θ # the associated push-forward operator2. We have θ #α = 1 n Pn i=1 δ xi,θ . We denote by (xθ (i))n i=1 the sorted sequence after projection, i.e. xθ i := xi, θ for all i = 1, . . . , n and xθ (1) . . . xθ (n) (20) By (19) the Sliced Wasserstein distance is given by d SWp(ˆα, ˆβ)p = 1 xθ (k) yθ (k) p dθ (21) which can be approximated as d SWp(ˆα, ˆβ)p 1 n M xθm (k) yθm (k) p (22) where θm i.i.d U(Sd 1). Unalanced setting. For two empirical distributions: ˆα = 1 i=1 δxi, ˆβ = 1 i=1 δyi, xi, yi R, as the functions F [ 1] ˆα and F [ 1] ˆβ are piece-wise constant functions on the uniform grid with respectively n and m points, t 7 F [ 1] ˆα (t) F [ 1] ˆβ (t) is piece-wise constant on the intervals defined by the intersection of the two uniform grids and the value it takes on each interval is given by the ordered sequences. The computation of d Wp(ˆα, ˆβ)p and the approximation of d SWp(ˆα, ˆβ)p when we go to Rr is therefore performed in a fashion similar to the balanced setting. The only additional step is to deal with the intersection of the grids. E. Main proofs of Sec. 5 E.1. Discussion around the H older assumption Assumption (16) might make the reader think that its validity depends on the chosen Hilbertian metric d while it does not. Indeed, as noticed in Szab o et al. (2015) we can write KP KQ 2 HK = K(P, P) + K(Q, Q) 2K(P, Q) = 2 (q(0) q(d(P, Q))) . 2If X α, X, θ θ #α. Distribution Regression with Sliced Wasserstein Kernels Therefore, the H older condition on K can be replaced by a regularity condition on q only. There are constants L > 0 and h (0, 1] such that x2h L, x R + (23) which is independent of the underlying distance d. For example, it holds with h = β for q : x 7 e γx2β as shown below in Prop. 20. Inequality (16) is important as the derivations for the excess risk bound involve multiple times the difference KP KˆPn HK which can be controlled by d(P, ˆPn) thanks to the regularity assumption. Proposition 20. For γ > 0 and β [0, 1], q : R+ R, x 7 e γx2β satisfies x2h L, x R + for h = β and some L > 0. Proof. x q(0) q(x) x2h is continuous and goes to 0 as x goes to + . Therefore it remains to show that the limit exists when x 0+. By L Hˆopital s rule, lim x 0+ q(0) q(x) x2h = lim x 0+ 1 e γx2β x2h = lim x 0+ γβx2β 1e γx2β The largest h for which the limit exists is h = β. E.2. Proof of the main theorem Before proving the main theorem we introduce the approximations of the integral operator. Integral operator approximations. Given the first and second stage sampling inputs we introduce the operators ΦD : RT HK Φ ˆ D : RT HK t=1 αt KPt α 7 t=1 αt KˆPt,n their adjoint operators are the first and second stage sampling operators Φ D : HK RT Φ ˆ D : HK RT f 7 f(Pt)T t=1 f 7 f(ˆPt,n)T t=1 The integral operator is approximated at first and second stage respectively by LK,D := 1 T ΦD Φ D and LK, ˆ D := 1 T Φ ˆ D Φ ˆ D. LK,D : HK HK ŁK, ˆ D : HK HK t=1 f(Pt)KPt f 7 1 t=1 f(ˆPt,n)KˆPt,n The introduction of the integral operator and its approximations from finite samples is convenient as we can express the solution of the regularized empirical risk minimization problems as follows, f D,λ = (LK,D + λI) 1 ΦD T y f ˆ D,λ = (LK, ˆ D + λI) 1 Φ ˆ D T y (24) Distribution Regression with Sliced Wasserstein Kernels Plan of the proof. Our goal is to control E(f ˆ D,λ) E(fρ) the expected excess risk. But since E can be decomposed as E(f) = f fρ 2 L2ρX + E(fρ), (25) we will consider the decomposition E(f ˆ D,λ) E(fρ) 1/2 = f ˆ D,λ fρ L2 f ˆ D,λ f D,λ L2 | {z } =:B + f D,λ fρ L2 | {z } =:A Both terms can be written in terms of the integral operator and its approximations as defined previously, A = f D,λ fρ L2 = L1/2 K (LK,D + λI) 1 ΦD T y (LK,D + λI) 1(LK,D + λI)fρ L1/2 K (LK,D + λI) 1 ΦD HK + λ L1/2 K (LK,D + λI) 1L1/2 K gρ HK . (27) Where we used the isometry L1/2 K and that since fρ HK, fρ = L1/2 K (gρ) for some gρ L2 ρX. Those quantities do not depend on the sampling rate of the chosen distance for the kernel. Bounding the expected value of A boils down to control for E (LK + λI)(LK,D + λI) 1 and E (LK + λI) 1 ΦD T y LK,Dfρ , Proposition 5 Fang et al. (2020). This can be obtained through concentration inequalities. Term A is the excess risk for standard Kernel Ridge Regression (1-stage sampling) and we can directly plug known results, for example Lemma 4 Fang et al. (2020): E f D,λ fρ L2 C κ BT,λ + gρ L2 where C is a constant that does not depend on any parameter, BT,λ = 2κ N(λ) , N(λ) is the effective dimension, N(λ) = Tr LK (LK + λI) 1 λ > 0 and A is such that |y| A, ρ a.s. Term B is the error induced by the fact that we only access the covariates (Pt)T t=1 through empirical approximations (ˆPt,n)T t=1, this term is specific to distribution regression. We follow a decomposition of B similar to the one used in (Fang et al., 2020), but while they focused on MMD-based kernels we adapt the decomposition to incorporate the sample rate of any distance substitution kernels. B = f ˆ D,λ f D,λ L2ρX = (LK, ˆ D + λI) 1 Φ ˆ D T y (LK,D + λI) 1 ΦD = (LK, ˆ D + λI) 1 Φ ˆ D T y ΦD T y + (LK, ˆ D + λI) 1 (LK,D + λI) 1 ΦD = (LK, ˆ D + λI) 1 Φ ˆ D T y ΦD T y + (LK, ˆ D + λI) 1 LK,D LK, ˆ D f D,λ L1/2 K (LK, ˆ D + λI) 1 Φ ˆ D T y ΦD T y HK | {z } =:C + L1/2 K (LK, ˆ D + λI) 1 LK,D LK, ˆ D f D,λ HK | {z } :=D Where we used that for two invertible operators A 1 B 1 = A 1(B A)B 1, (24) and the isometry L1/2 K . We aim at bounding E f ˆ D,λ f D,λ L2 in expectation, and we will see that it involves the sample complexity d(P, ˆPn). Let us start Distribution Regression with Sliced Wasserstein Kernels with term C, by the Cauchy-Schwarz inequality, " L1/2 K (LK, ˆ D + λI) 1 Φ ˆ D T y ΦD " L1/2 K (LK, ˆ D + λI) 1 op Φ ˆ D T y ΦD E L1/2 K (LK, ˆ D + λI) 1 op " Φ ˆ D T y ΦD where op denotes the operator norm. For the second term we have, " Φ ˆ D T y ΦD t=1 yt K(Pt, ) K(ˆPt,n, ) HK t=1 E K(Pt, ) K(ˆPt,n) 2 t=1 E h d(Pt, ˆPt,n)2hi!1/2 t=1 E h d(Pt, ˆPt,n)2ih !1/2 where we used the Jensen inequality (h < 1), |y| A, (16) and the control over the sample complexity of d. For the first term, Lemma 8 in Fang et al. (2020) shows L1/2 K (LK, ˆ D + λI) 1 2 op 3Σ2 D λ2 + 3 λ3 LK,D LK, ˆ D 2 op + 3 where ΣD = (LK + λI) 1 2 (LK LK,D) op. We now show that the middle term can be controlled with α(n). For all (LK,D LK, ˆ D)(f) 2 HK = t=1 K(Pt, )f(Pt) K(ˆPt,n, )f(ˆPt,n) K(Pt, )f(Pt) K(ˆPt,n)f(ˆPt,n) 2 t=1 (f(Pt) f(ˆPt,n))2 K(Pt, .) 2 HK + f(ˆPt,n)2 K(Pt, .) K(ˆPt,n) 2 t=1 (f(Pt) f(ˆPt,n))2 + f 2 HK K(Pt, .) K(ˆPt,n) 2 4κ2L2 f 2 HK T t=1 d(Pt, ˆPt,n)2h, Distribution Regression with Sliced Wasserstein Kernels where we used that K(P, P) is bounded by κ2 for all P X and f(ˆPt,n)2 f 2 HK K(ˆPt,n, .) 2 HK f 2 HKκ2, f(Pt) f(ˆPt,n) 2 f 2 HK K(Pt, .) K(ˆPt,n, .) 2 HK L2 f 2 HKd(Pt, ˆPt,n)2h, (34) by the H older assumption. Hence, LK,D LK, ˆ D 2 op 4κ2L2 t=1 d(Pt, ˆPt,n)2h, (35) which implies E h LK,D LK, ˆ D 2 op i 4κ2L2 t=1 E[d(Pt, ˆPt,n)2h] 4κ2L2α(n)h, (36) E L1/2 K (LK, ˆ D + λI) 1 2 λ3 L2κ2α(n)h + 3 Putting the bounds together, we have E[C] ALα(n)h/2 r 3E[Σ2 D] λ2 + 12 λ3 L2κ2α(n)h + 3 3ALα(n)h/2 E[Σ2 D]1/2 λ + 2 λ3/2 Lκα(n)h/2 + 1 λLκα(n)h/2 + 1 where we used Lemma 17 in Lin et al. (2017): E[Σ2 D] κ2N(λ) T . Now for D, E[D] = L1/2 K (LK, ˆ D + λI) 1 LK,D LK, ˆ D f D,λ HK L1/2 K (LK, ˆ D + λI) 1 op LK,D LK, ˆ D op L1/2 K (LK, ˆ D + λI) 1 2 1/2 E ˆ D|D LK,D LK, ˆ D 2 1/2 f D,λ HK All terms have already been dealt with except f D,λ HK, L1/2 K (LK, ˆ D + λI) 1 2 1/2 2κLα(n)h/2 f D,λ HK λ + 2 λ3/2 Lκα(n)h/2 + 1 2κLα(n)h/2 f D,λ HK 1/2 ED h f D,λ 2 HK i1/2 + ED h f D,λ HK λLκα(n)h/2 + 1 ! λ ED h f D,λ 2 HK λLκα(n)h/2 + 1 ! We conclude with Proposition 7 Fang et al. (2020) that states ED h f D,λ 2 HK λ + 6 gρ ρ , (41) Distribution Regression with Sliced Wasserstein Kernels λ α(n)h/2 + 1 λ α(n)h/2. (42) Putting C and D together, we get E[B] E[C] + E[D] λLκα(n)h/2 + 1 ! A + 2κ BT,λ λLκα(n)h/2 + 1 ! A + 12κ gρ ρ + 50A BT,λ E.3. Sample complexity MMD Proposition 21. If KΩis bounded, supx ΩKΩ(x, x) B then d MMD(P, Q) = µKΩ(P) µKΩ(Q)) HΩis such that E[d MMD(P, ˆPn)2] B where P X, ˆPn = 1 n Pn i=1 δxi, xi i.i.d P. Proof. We denote by ψ the canonical feature map of the kernel KΩ: KΩ(x, y) = ψ(x), ψ(y) HΩ. Then, E[d MMD(P, ˆPn)2] = E µKΩ(P) µKΩ(ˆPn) 2 i=1 (ψ(Xi) µKΩ(P)) i=1 ψ(Xi) µKΩ(P) 2 HΩ+ 2 X 1 i t)dt which leads to the less optimal bound E h d MMD(P, ˆPn)2i 2(2 + π)B Distribution Regression with Sliced Wasserstein Kernels E.4. Proof of Cor. 9 In this section denotes the maximum and the minimum between two elements. Corollary 22. Under the assumptions of Thm. 8, for α(n) = cn β, β, c > 0, taking λ = 1 T 1 nhβ/2 , the bound can be simplified as E f ˆ D,λ fρ L2ρX C 1 T + 1 nhβ/4 where C is a constant that depends only on M, κ and L. Proof. The bound N(λ) κ2/λ allows to bound BT,λ T λ. Adding α(n) = cn β into the bound leads to the result. E f ˆ D,λ fρ L2ρX C 1 nhβ/2λ1/2 1 T 1/2 + 1 nhβ/2 + 1 fρ HK + 1 Tλ + 1 fρ HK Note that λ 1 = T λ T 1/4, furthermore λ = T 1/4 n hβ/4 T 1/4+n hβ/4 E f ˆ D,λ fρ L2ρX C 1 nhβ/2λ1/2 1 λnhβ/2 + 1 fρ HK + 1 + 1 T + 1 nhβ/4 We also have λ 1 = T nhβ/2 nhβ/2 thus 1 λnhβ/2 1, 1 λnhβ/2 1 nhβ/4 , hence E f ˆ D,λ fρ L2ρX C 1 nhβ/4 fρ HK + 1 + 1 T + 1 nhβ/4 T + 1 nhβ/4 which concludes the proof. The notation C absorbs all the terms independent of n, T and λ throughout the proof. F. Experimental details The hyperparameters tested on both experiments are as follows. Regularization for the Kernel Ridge regression (λ): 25 points between 10 8 and 100 in logarithmic scale; γ for the Euclidean Gaussian kernel: 14 points between 10 3 and 1 in logarithmic scale; γ for the inner Gaussian MMD kernel: 14 points between 10 6 and 100 in logarithmic scale; γ for the outer Gaussian MMD kernel: 7 points between 10 3 and 100 in logarithmic scale; γ for the Gaussian Sliced Wasserstein kernels: 14 points between 10 5 and 100 in logarithmic scale; Distribution Regression with Sliced Wasserstein Kernels G. Additional experiments: Hellinger and total variation distances For two probability measures P and Q both absolutely continuous with respect to a third probability measure µ, the Hellinger distance between P and Q is d H (P, Q)2 = 1 It can be shown that this definition does not depend on µ, so the Hellinger distance between P and Q does not change if µ is replaced with a different probability measure with respect to which both P and Q are absolutely continuous. The Hellinger distance is bounded by 1 and it is not difficult to show that for empirical distributions with disjoint supports it is always equal to 1. This explain why we cannot use the Hellinger distance on the Synthetic Mode classification experiment without a density estimation step as the raw empirical distributions built from the mixtures will have disjoint supports with probability 1. For two empirical distributions with the same support (as it is the case in the MNIST classification experiment), the Hellinger distance between ˆP = i=1 αiδxi, and ˆQ = i=1 βiδxi, xi Rr is d H ˆP, ˆQ 2 = 1 where the square root is applied component-wise. Similarly, the total variation distance is given by d T V ˆP, ˆQ = 1 Both d H and d T V are Hilbertian, we can therefore plug them in a Gaussian kernel to perform KRR. Table 3 extends the results of Table 2 to those two distances. The setting of the experiment is the same as in the main text. Both d H and d T V do not incorporate information about differences in the support of the distributions. Therefore, they can perform well when the images are well aligned (no rotation or translation) but their performance degrades significantly when the images are perturbed. This is similar to the performance of the RBF kernel. Table 3. Mean accuracy and standard deviation on rotated MNIST and Fashion MNIST with maximum rotation θ = 0, π/12, π/6 respectively. Comparison between KRR estimators with standard Gaussian kernel (RBF), doubly Gaussian MMD kernel (MMD) and Gaussian kernels with the distances SW2, SW1, Hellinger and total variation. MNIST RBF MMD SW2 SW1 TV HELLINGER RAW 0.9 (0.01) 0.79 (0.03) 0.93 (0.01) 0.91 (0.03) 0.87 (0.01) 0.92 (0.01) θ = π 12 0.51 (0.03) 0.40 (0.05) 0.85 (0.02) 0.66 (0.02) 0.44 (0.02) 0.58 (0.02) θ = π 6 0.47 (0.03) 0.34 (0.04) 0.82 (0.01) 0.61 (0.01) 0.4 (0.03) 0.55 (0.02) FASHION RBF MMD SW2 SW1 TV HELLINGER RAW 0.82 (0.01) 0.75 (0.01) 0.81 (0.01) 0.77 (0.01) 0.83 (0.01) 0.84 (0.02) θ = π 12 0.66 (0.01) 0.48 (0.03) 0.74 (0.01) 0.60 (0.02) 0.67 (0.02) 0.67 (0.02) θ = π 6 0.64 (0.01) 0.45 (0.01) 0.70 (0.01) 0.58 (0.01) 0.65 (0.02) 0.65 (0.02)