# kernel_quantile_embeddings_and_associated_probability_metrics__c39eec93.pdf

Kernel Quantile Embeddings and Associated Probability Metrics

Masha Naslidnyk 1 Siu Lun Chau 2 François-Xavier Briol 3 Krikamol Muandet 4

Embedding probability distributions into reproducing kernel Hilbert spaces (RKHS) has enabled powerful nonparametric methods such as the maximum mean discrepancy (MMD), a statistical distance with strong theoretical and computational properties. At its core, the MMD relies on kernel mean embeddings to represent distributions as mean functions in RKHS. However, it remains unclear if the mean function is the only meaningful RKHS representation. Inspired by generalised quantiles, we introduce the notion of kernel quantile embeddings (KQEs). We then use KQEs to construct a family of distances that: (i) are probability metrics under weaker kernel conditions than MMD; (ii) recover a kernelised form of the sliced Wasserstein distance; and (iii) can be efficiently estimated with near-linear cost. Through hypothesis testing, we show that these distances offer a competitive alternative to MMD and its fast approximations.

1. Introduction

Many machine learning and statistical methods rely on representing, comparing, and measuring the distance between probability distributions. Kernel mean embeddings (KMEs) have been shown to be a mathematically and computationally convenient approach for this task (Berlinet and Thomas Agnan, 2004; Smola et al., 2007; Muandet et al., 2016). At its core, a KME represents a distribution as a mean function in a reproducing kernel Hilbert space (RKHS). When the kernel function is sufficiently regular and satisfies a condition called characteristic (Sriperumbudur et al., 2010), the

1Department of Computer Science, University College London, London, UK 2College of Computing & Data Science, Nanyang Technological University, Singapore 3Department of Statistical Science, University College London, London, UK 4CISPA Helmholtz Center for Information Security, Saarbrücken, Germany. Correspondence to: Masha Naslidnyk <masha.naslidnyk.21@ucl.ac.uk>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

representation of a distribution as a KME is unique, capturing all information about the distribution. The probability metric constructed by comparing KMEs, called maximum mean discrepancy (MMD) (Borgwardt et al., 2006; Gretton et al., 2012), has received significant attention due to its computational tractability. Its most common estimator has cost O(n2) and can be estimated with error O(n 1/2) in the number of data points n, but cheaper alternatives have also been proposed (Gretton et al., 2012; Chwialkowski et al., 2015a; Bodenham and Kawahara, 2023; Schrab et al., 2022). For this reason, KMEs and the MMD have been used to tackle a broad range of tasks from hypothesis testing (Gretton et al., 2012) to parameter estimation (Briol et al., 2019; Chérief Abdellatif and Alquier, 2020), causal inference (Muandet et al., 2021; Sejdinovic, 2024), feature attribution (Chau et al., 2022; 2023), and learning on distributions (Muandet et al., 2012; Szabó et al., 2016).

Nevertheless, the question of whether alternative kernelbased embeddings, particularly nonlinear counterparts, could exhibit desirable properties has long remained underexplored, in part due to the associated computational challenges. Recently, this gap has begun to be addressed, with works investigating kernelised medians (Nienkötter and Jiang, 2023), cumulants (Bonnier et al., 2023), and variances (Makigusa, 2024a). In this paper, we consider an alternative based on the concept of quantiles in an RKHS, which we term kernel quantile embeddings (KQEs). Similarly to the construction of KMEs, KQEs are obtained by considering the directional quantiles of a feature map obtained from a reproducing kernel. KQEs also lead naturally to a family of distances which we call kernel quantile discrepancies (KQDs). This approach is motivated from the statistics and econometrics literature (Kosorok, 1999; Dominicy and Veredas, 2013; Ranger et al., 2020; Stolfi et al., 2022), where matching quantiles has been shown effective in constructing statistical estimators and hypothesis tests.

Our paper identifies several desirable properties of KQEs. Firstly, from a theoretical point of view, we show in Theorem 1 and Theorem 2 that KQEs can represent distributions on any space for which we can define a kernel, and that the conditions to make a kernel quantile-characteristic, that is for KQEs to be a one-to-one representation of a probability distribution, are weaker than for the classical notion of characteristic, which we now call mean-characteristic. We

Kernel Quantile Embeddings and Associated Probability Metrics

then show in Theorem 3 that KQEs can be estimated at a rate of O(n 1/2) in the number of samples n; the same rate as that of the empirical estimator of KMEs (Tolstikhin et al., 2017). As a result, KQDs are probability metrics under much weaker conditions than the MMD (see Theorem 4), while maintaining comparable computational guarantees, including a finite-sample consistency with rate O(n 1/2) (up to log terms) for their empirical estimators (see Theorem 5).

Secondly, we establish a number of connections between KQDs, Wasserstein distances (Kantorovich, 1942; Villani et al., 2009), and generalisations or approximations thereof. In particular, special cases of our KQDs recover existing sliced Wasserstein (SW) distances (Bonneel et al., 2015; Wang et al., 2022; 2024a) and can interpolate between the Wasserstein distance and MMD similarly to Sinkhorn divergences (Cuturi, 2013; Genevay et al., 2019). These results are presented in Connections 1, 2, and 3.

Finally, we consider a specific instance of KQDs based on Gaussian averaging over kernelised quantile directions, which we name the Gaussian expected kernel quantile discrepancy (e-KQD). Beyond the desirable theoretical properties described above, we show that the Gaussian e-KQD also has attractive computational properties. In particular, we show that it has a natural estimator which only requires sampling from a Gaussian measure on the RKHS, and which can be computed with complexity O(n log2(n)). It is studied empirically in Section 5 with experiments on two-sample hypothesis testing, where we show that it is competitive with the MMD: it often outperforms estimators of the MMD of the same asymptotic complexity, and in some cases even outperforms MMD at higher computational costs.

2. Background

Let PX denote the set of Borel probability measures on a Borel space X. We begin by reviewing existing definitions of quantiles, followed by a summary of relevant work on probability metrics, including the MMD and SW distances.

2.1. Quantiles

Univariate quantiles. Let X R. For α [0, 1], the α-quantile of P PX is defined as ρα P = inf{y X : Pr Y P [Y y] α}. When P has a continuous and strictly monotonic cumulative distribution function FP , quantiles can also be defined through the inverse of that function ρα P := F 1 P (α). Notable special cases include α = 0.5, corresponding to the median, and α = 0.25, 0.75, corresponding to lowerand upper-quartiles respectively. Importantly, P is fully characterised by its quantiles {ρα P }α [0,1].

From a computational viewpoint, univariate quantiles can be straightforwardly estimated using order statistics. Suppose y1:n = [y1 . . . yn] P, and denote by [y1:n]j the jth

order statistic of y1:n (i.e. the jth largest value in the vector

density of u#P

=0.05 =0.2 =0.5 =0.7

Figure 1: Illustration of bivariate quantiles. Left: Bivariate distribution P. Center: Density of the projection of P onto direction u on the unit circle, with ϕu(x) = u, x . Right: different quantiles for all possible directions u.

[y1 . . . yn] ). The α-quantile of P, denoted ρα P , can be estimated using [y1:n] αn where denotes the ceiling function. This estimator is known to converge at a rate of O(n 1/2) (Serfling, 2009, Section 2.3.2).

Multivariate quantiles. Suppose now that X Rd for d > 1. The previous definition of quantiles depends on the existence of an ordering in X, and its natural generalisation to d > 1 is therefore not unique (Serfling, 2002). In this paper, we will focus on the notion of α-directional quantile of P along some direction u in the unit sphere Sd 1 (Kong and Mizera, 2012),

ρα,u P := ρα ϕu#P u, ϕu(y) = u, y .

Here, ϕu : X R is the projection map onto u, and ρα ϕu#P is the standard one-dimensional α quantile of ϕu#P the law of ϕu(X) for X P. We note that this quantile is now a d-dimensional vector as opposed to a scalar. The α-directional quantiles for d = 2 are illustrated in Figure 1, in which the probability measure P is projected onto some line; see the left and middle plots. Once again, we can use quantiles to characterise P, although we must now consider all α quantiles over a sufficiently rich family of projections {ρα,u P : α [0, 1], u Sd 1}; see Theorem 5 of (Kong and Mizera, 2012) for sufficient regularity conditions.

Although these multivariate quantiles satisfy scale equivariance and rotation equivariance, they do not satisfy location equivariance. To remedy this issue, Fraiman and Pateiro López (2012) introduced a related notion, the centered αdirectional quantile:

ρα,u P := ρα ϕu#P ϕu(EX P [X]) u + EX P [X], (1)

Further details are provided in Appendix B.

2.2. Probability Metrics

Kernel mean embeddings and MMD. Let X be some Borel space, and (H, , H) be a reproducing kernel Hilbert space (RKHS) induced by a real-valued kernel k : X X R (Schölkopf and Smola, 2002; Berlinet and Thomas Agnan, 2004), the kernel mean embedding (KME) µP : X R of any P PX is defined as the Bochner integral

Kernel Quantile Embeddings and Associated Probability Metrics

µP ( ) = EX P [k(X, )] H. The integral can be shown to exist provided EX P [ p

k(X, X)] < ; this, in turn, holds for all P PX if and only if k is bounded (Smola et al., 2007). If the mapping P µP is injective, the kernel k is said to be mean-characteristic. Many standard kernels the Matérn family, Gaussian, Laplacian have been shown to be characteristic on sufficiently regular spaces (Sriperumbudur et al., 2011; Ziegel et al., 2024). KMEs with mean-characteristic kernels lead to the squared maximum mean discrepancy (MMD) defined for any P, Q PX as

MMD2(P, Q) := µP µQ 2 H = EX,X P [k(X, X )]

2EX P,X Q[k(X, X )] + EX,X Q[k(X, X )].

This can be computed in rare cases (Briol et al., 2025), but typically needs to be estimated. Given n i.i.d. realisations from P and Q, MMD2 is most commonly estimated with a U-statistic, which converges to MMD2(P, Q) as O(n 1/2) and has computational complexity of O(n2) although linear-cost alternatives are also available (Gretton et al., 2012, Lemma 14.). These are discussed in Section 5 and Appendix A.

Wasserstein distances. Let c : X X R be a metric on X, and Γ(P, Q) PX X denote the space of joint distributions on X X with first and second marginals P and Q, respectively. The p-Wasserstein distance (Kantorovich, 1942; Villani et al., 2009) quantifies the cost of optimally transporting one distribution to another under cost c : X X R. It is a probability metric under mild conditions (Villani et al., 2009, Section 6), and is defined as

Wp(P, Q) = inf π Γ(P,Q) E(X,Y ) π [c(X, Y )p] 1/p .

When X Rd, the metric c is typically taken to be the Euclidean distance c(x, y) = x y 2. The Wasserstein distance can then be estimated by solving an optimal transport problem using empirical measures constructed through samples of P and Q, an approach that suffers from a high computational cost of O(n3) and, when P, Q have at least 2p moments, slow convergence of O(n 1/ max(d,2p)) when X Rd for d > 1 (Fournier and Guillin, 2015).

However, when d = 1, Wp can be computed at lower cost of O(n log n) with convergence of O(n 1/2p) when P, Q have at least 2p moments. This motivated the introduction of the sliced Wasserstein (SW) distance (Bonneel et al., 2015). Recall that ϕu(x) = u x. The SW distance projects high-dimensional distributions P, Q onto elements on the unit sphere u Sd 1 sampled uniformly, computes the Wasserstein distance between the projected distributions, now in R, and averages over the projections:

SWp(P, Q) = Eu U(Sd 1) W p p (ϕu#P, ϕu#Q) 1/p .

A further refinement, the max-sliced Wasserstein (max-SW) distance (Deshpande et al., 2018), aims to identify the optimal projection that maximises the 1D Wasserstein distance:

max-SWp(P, Q) = sup u Sd 1 W p p (ϕu#P, ϕu#Q) 1/p .

Both slicing distances reduce the computational complexity to O(ln log n) and the convergence rate to O(l 1/2 + n 1/2p), where l is either the number of projections, or the number of iterations of the optimiser. A further extension is the Generalised Sliced Wasserstein (GSW, Kolouri et al. (2019)), which replaces the linear projection ϕu with a nonlinear mapping. While the conditions for GSW to be a probability metric are highly non-trivial to verify, the authors showed that they hold for polynomials of odd degree.

Another approximation of the Wasserstein distance involves the introduction of an entropic regularisation term (Cuturi, 2013), which reduces the cost to O(n2) and can be estimated with sample complexity O(n 1/2) (Genevay et al., 2019). The solution to this regularised problem is referred to as the Sinkhorn divergence. Interestingly, Ramdas et al. (2017); Feydy et al. (2019) demonstrated that by varying the strength of the regularisation, the Sinkhorn divergence interpolates between the Wasserstein distance and the MMD with a kernel corresponding to the energy distance.

3. Kernel Quantile Embeddings and Discrepancies

We introduce directional quantiles in the RKHS and the corresponding discrepancies. Unlike in Section 2.1, the measures and their quantiles now live in different spaces: the measures are on X, and the quantiles are in the RKHS H induced by a kernel on X. This leads to greater flexibility: the approach works for any space a kernel can be defined on. Throughout, we assume the kernel k is measurable.

3.1. Kernel Quantile Embeddings

Let SH = {u H : u H = 1} be the unit sphere of an RKHS H induced by the kernel k. For P PX , we define its α-quantile along RKHS direction u SH as a function ρα,u P : X R in H with

ρα,u P (x) := ρα u#P u(x) (2)

By the reproducing property, it holds that ρα u#P u(x) = ρα ϕu#[ψ#P ]u(x), where ψ(x) = k(x, ) is the canonical feature map X H, and ϕu(h) = u, h H is the H H equivalent of the projection operator onto u defined in Section 2.1. Thus, when dim(H) < , the RKHS quantiles of P on X are exactly the multivariate quantiles of the measure of k(X, ), X P, on H. In other words, KQEs can be thought of as two-step embeddings: we first embed

Kernel Quantile Embeddings and Associated Probability Metrics

X P PX as an RKHS element and then compute its directional quantiles to obtain the KQEs.

Centered vs uncentered quantiles. Just as done for multivariate quantiles in Equation (1), a centered version of RKHS quantiles can be defined as

ρα,u P (x) := ρα u#P u, µP H u(x) + µP (x),

where µP is the KME of P. This coincides with Equation (1) for the measure being the law of k(X, ) with X P. The impact of centering is examined in detail in Appendix B, but two key observations are relevant here: (1) omitting centering eliminates the computational overhead of calculating means; (2) the only equivariance violated for the uncentered directional quantile is location equivariance: shifting k(X, ) by h shifts the quantile by h, u Hu, rather than by h itself. However, when KQEs are used to compare two distributions, the additional term h, u Hu cancels out as it does not depend on the measure. For these reasons, we primarily work with the uncentered RKHS quantiles.

Quantile-characteristic kernels. The kernel k is said to be quantile-characteristic if the mapping P 7 {ρα,u P : α [0, 1], u SH} is injective for P PX . In Rd, the Cramér-Wold theorem (Cramér and Wold, 1936) states that the set of all one-dimensional projections (or, equivalently, all quantiles of all one-dimensional projections) determines the measure. One may therefore recognise our next theorem as an RKHS-specific extension of the Cramér-Wold theorem. Earlier Hilbert space extensions required higherdimensional projections and imposed restrictive moment assumptions (Cuesta-Albertos et al., 2007). Being concerned with the RKHS case specifically allows us to prove the result under mild assumptions, as stated below.

Assumption A1. X is Hausdorff, separable, and σ-compact.

Being Hausdorff ensures points in X can be separated, and separability says X has a countable dense subset. σcompactness means X is a union of countably many compact sets. These are mild conditions, notably satisfied by Polish spaces including discrete topological spaces with at most countably many elements and topological manifolds.

It is possible to drop the σ-compactness and separability. When X is Hausdorff and completely regular, one can still get quantile-characteristic properties on Radon probability measures the "non-pathological" Borel probability measures. We discuss this in Appendix C.1 and refer to Willard (1970) for a review of general topological properties.

Assumption A2. The kernel k is continuous, and separating on X: for any x = y X, it holds that k(x, ) = k(y, ).

This is a mild condition: most commonly used kernels such as the Matérn, Gaussian, and Laplacian kernels are separating. The constant kernel k(x, x ) = c is an example of a non-separating kernel. Trivially, a non-separating kernel

0.0 0.2 0.4 0.6 0.8 1.0

Figure 2: Illustration of the impact of the slicing direction on KQEs. Suppose X P, the KQEs ρα,u P (x) := ρα u#P u(x) are obtained by considering the αth quantile of u(X). Clearly, these quantiles might vary significantly depending on the slicing direction used.

for which k(x, ) = k(y, ) will not be able to distinguish between Dirac measures δx and δy. The proof of the following result uses characteristic functionals, an extension of characteristic functions to measures on spaces beyond Rd. Unlike moments, characteristic functionals are defined for any probability measure which is the key to generality of KQEs. Further discussion and proof are in Appendix C.1. Theorem 1 (Cramér-Wold Theorem in RKHS). Under A1 and A2, the kernel k is quantile-characteristic, meaning the mapping P 7 {ρα,u P : α [0, 1], u SH} is injective.

The mildness of the assumptions in Theorem 1 naturally raises the question: is being quantile-characteristic a less restrictive condition than being mean-characteristic? This indeed holds, as shown in the result below. Theorem 2. Every mean-characteristic kernel k is also quantile-characteristic. The converse does not hold.

This result, proven in Appendix C.2, has a powerful implication. For any discrepancy D(P, Q) that aggregates the KQEs injectively (i.e D(P, Q) = 0 ρα,u P = ρα,u Q for all α, u), it holds that MMD(P, Q) > 0 D(P, Q) > 0, but D(P, Q) > 0 MMD(P, Q) > 0. This means D can tell apart every pair of measures MMD can, and sometimes more (see the proof for examples). This is intuitive: MMD is an injective aggregation of means (MMD(P, Q) = 0 EP [u] = EQ[u] for all u), and the set of all quantiles captures all the information in the mean, but not vice versa. Before introducing a specific family of quantile discrepancies, we discuss sample versions of KQEs.

Estimating KQEs. For fixed α [0, 1] and u SH, estimating the directional quantile ρα,u P with samples x1:n X boils down to estimating the R-quantile ρα u#P using samples u(x1:n). We employ the classic, model-free approach to estimate a quantile by using the order statistic estimator:

ρα,u Pn (x) := ρα u#Pnu(x) = [u(x1:n)] αn u(x), (3)

where Pn = 1/n Pn i=1 δxi. In other words, Equation (3) uses the α-quantile of the set u(x1:n) meaning, the αn - th largest element of u(x1:n). We now state an RKHS version of a classic result on convergence of quantile estimators; the proof is provided in Appendix C.3.

Kernel Quantile Embeddings and Associated Probability Metrics

Theorem 3 (Finite-Sample Consistency for Empirical KQEs). Suppose the PDF of u#P is bounded away from zero, fu#P (x) cu > 0, and x1:n P. Then, with probability at least 1 δ, and C(δ, u) = O( p

ρα,u Pn ρα,u P H C(δ, u)n 1/2.

We do not need to assume A1 and A2 to prove consistency; this was only needed to establish that k is quantilecharacteristic, and we may still have a consistent estimator when the kernel is not quantile-characteristic. The condition fu#P (x) cu > 0 lets us avoid making any assumptions on X, other than the existence of a kernel k on X.

3.2. Kernel Quantile Discrepancies

We propose to quantify the difference between P, Q PX in unit-norm direction u through a ν-weighted expectation of power-p distance (in the RKHS) between KQEs,

τp(P, Q; ν, u) = Z 1

ρα,u P ρα,u Q p Hν(dα) 1/p .

Figure 2 illustrates how u#P and u#Q vary depending on direction u and the impact it has on τp. The weighting measure ν on [0, 1] assigns importance to each α-quantile. For example, the Lebesgue measure ν µ treats all quantiles as equally important, whereas a partial-supported measure would allow us to ignore certain quantiles.

Based on τp(P, Q; ν, u), we introduce a novel family of Kernel Quantile Discrepancies (KQDs) that aggregate the directional differences τp(P, Q; ν, u) over u SH: the Lptype distance expected KQD (e-KQD) that uses the average as the aggregate function, and the L -type distance supremum KQD (sup-KQD) that aggregates with the supremum:

e-KQDp(P, Q; ν, γ) = Eu γ τ p p (P, Q; ν, u) 1/p ,

sup-KQDp(P, Q; ν) = sup u SH τ p p (P, Q; ν, u) 1/p, (4)

where γ is a measure on the unit sphere SH of the RKHS.

Next, we demonstrate that under mild conditions e-KQD and sup-KQD are indeed distances, and establish connections with existing methods. The proof is in Appendix C.4.

Theorem 4 (KQDs as Probability Metrics). Under A1, A2, and if ν has full support on [0, 1], sup-KQDp is a distance. Further, if γ has full support on SH, e-KQDp is a distance.

As discussed in Section 3, A1 and A2 are minor. The assumptions on the support of ν and γ ensure that no quantile level in [0, 1] and no parts of SH are missed entirely. This is satisfied, for example, for the uniform ν (that considers all quantiles to be equally important), and when H is separable, for any centered Gaussian γ = N(0, S)

with a non-degenerate S by (Kukush, 2020, Corollary 5.3). For example, an H 7 H covariance operator S[f](x) = R

X k(x, y)f(y)β(dy) is non-degenerate and well-defined provided (1) β on X has full support, and (2) R

k(x, x)β(dx) < . This choice of γ also happens to be computationally convenient, as discussed in Section 4.

In contrast, while conditions under which MMD is a distance are well-understood for bounded translation-invariant kernels on Euclidean spaces (Sriperumbudur et al., 2011), they are challenging to establish beyond this setting. For instance, it is known that commonly used graph kernels are not characteristic (Kriege et al., 2020).

When ν is chosen as the Lebesgue measure µ, an important connection emerges between e-KQD, sup-KQD, and sliced Wasserstein distances. This connection is formalised in the next result, with a proof provided in Appendix C.6.

Connection 1 (SW). Suppose P, Q have p-finite moments. Then, e-KQDp(P, Q; ν, γ) for ν µ corresponds to a kernel expected sliced p-Wasserstein distance, which has not been introduced in the literature. For X Rd, linear k(x, y) = x y, and uniform γ, this recovers the expected sliced p-Wasserstein distance (Bonneel et al., 2015).

Connection 2 (Max-SW). Suppose P, Q have p-finite moments. Then, sup-KQDp(P, Q; ν) for ν µ is the kernel max-sliced p-Wasserstein distance (Wang et al., 2022). For X Rd, linear k(x, y) = x y, and uniform γ, it recovers the max-sliced p-Wasserstein (Deshpande et al., 2018).

For d = 1, we recover standard Wasserstein. When k is non-linear but induces a finite-dimensional RKHS, e-KQD is connected to the Generalised Sliced Wasserstein distances of Kolouri et al. (2022) we explore this in Appendix C.6.

Lastly, we establish a connection to Sinkhorn divergence.

Connection 3 (Sinkhorn). Sinkhorn divergence (Cuturi, 2013), like e-KQD and sup-KQD, combines the strengths of kernel embeddings and Wasserstein distances. Furthermore, for p = 2 and ν µ, the centered version of e-KQD and sup-KQD developed in Appendix B can be represented as a sum of MMD and kernelised expected or max-sliced Wasserstein distances, thus positioning these measures as mid-point interpolants between MMD and SW distances.

It is important to note that the MMD term within the Sinkhorn divergence is restricted to a specific kernel tied to the energy distance in contrast, e-KQD and sup-KQD offer much greater flexibility in the choice of kernel. Moreover, as will be shown empirically in Section 5, the computational complexity of e-KQD for a particular choice of γ can be made significantly lower than that of Sinkhorn divergences, which have a cost of O(n2).

Estimating e-KQD. We propose a Monte-Carlo estimator for e-KQD, and refer to Wang et al. (2022) for an

Kernel Quantile Embeddings and Associated Probability Metrics

optimisation-based, O(n3 log(n)) estimator for sup-KQD. Let x1:n P, y1:n Q, the u1, . . . , ul SH to be l unitnorm functions sampled from γ, and fν to be the density of ν. Denote Pn = 1/n Pn i=1 δxi, Qn = 1/n Pn i=1 δyi. Then, similarly to the order statistic estimator of the quantiles in Equation (3), e-KQDp p(Pn, Qn; ν, γl) is the estimator of e-KQDp p(P, Q; ν, γ), where

e-KQDp p(Pn, Qn; ν, γl) (5)

p fν ( j/n )

Here, [ui(x1:n)]j is the j-th order statistics, meaning the j-th smallest element of ui(x1:n) = [ui(x1), . . . , ui(xn)] . For p = 1, we get the following result, proven in Appendix C.5.

Theorem 5 (Finite-Sample Consistency for Empirical KQDs). Let ν have a density, P, Q be measures on X s.t. EX P p

k(X, X) < and EX Q p

k(X, X) < , and x1:n P, y1:n Q. Then, with probability at least 1 δ, and C(δ) = O( p

log(1/δ)) that depends only on δ, k, ν,

|e-KQD1(Pn, Qn; ν, γl) e-KQD1(P, Q; ν, γ)|

C(δ)(l 1/2 + n 1/2).

The rate does not depend on dim(X) this is a major advantage of projection/slicing-based discrepancies (Nadjahi et al., 2020), which comes at the cost of dependence on the number or projections l. Setting l = n/ log n recovers the MMD rate (up to log-terms), at matching complexity (see Section 4). Here, we do not need e-KQD to be a distance indeed, we did not assume A1 and A2. The condition of square root integrability of k(X, X) under P, Q is immediately satisfied when k is bounded, and can in fact be further weakened to EX P EY Q p

k(X, X) 2k(X, Y ) + k(Y, Y ) < . Requiring that ν has a density is mild and necessary to reduce the problem to CDF convergence which, by the classic Dvoretzky-Kiefer-Wolfowitz inequality of Dvoretzky et al. (1956) has rate n 1/2 under no assumptions on the underlying distributions. The strength of this inequality allows us to assume nothing more of X than the fact that it is possible to define a kernel on it.

Further, for any integer p > 1, the n 1/2 rate still holds if and only if it holds that for Jp(R) := (Fu#R(t)(1 Fu#R(t)))p/2 /f p 1 u#R(t), both Jp(P) and Jp(Q) are integrable over u γ and Lebesgue measure on u(X). In turn, this may be reduced to a problem of controlling d 1 volumes of level sets of u. We discuss this extension further in Conjecture 1 in Appendix C.5.

4. Gaussian Kernel Quantile Discrepancy

Algorithm 1 Gaussian e-KQD

Input: Data x1:n P, y1:n Q, samples from the reference measure z1:m ξ, kernel k, density fν, number of projections l, power p. Initialise e-KQDp 0 and τ p p,i 0 for i = 1 . . . l. for i = 1 to l do

Sample λ1:m N(0, Idm) Compute fi(x1:n) λ 1:mk(z1:m, x1:n)/ m, fi(y1:n) λ 1:mk(z1:m, y1:n)/ m Compute fi H p

λ 1:mk(z1:m, z1:m)λ1:m/m Compute ui(x1:n) fi(x1:n)/ fi H, ui(y1:n) fi(y1:n)/ fi H Sort ui(x1:n) and ui(y1:n) for j = 1 to n do

τ p p,i τ p p,i+ [ui(x1:n)]j [ui(y1:n)]j pfν( j/n ) end for e-KQDp e-KQDp + τ p p,i/l end for Return e-KQDp

We now conduct further empirical study of the squared kernel distance e-KQDp. Unlike its supremum-based counterpart sup-KQD, e-KQD can be approximated simply by drawing samples from γ on SH, avoiding the challenges associated with optimising for the supremum. Although a uniform γ is a natural choice, no such measure exists when dim(H) is infinite (Kukush, 2020, Section 1.3). Instead, we follow a well-established strategy from the inverse problems literature (Stuart, 2010) and take γ to be the projection onto SH of a Gaussian measure on H. Using established techniques for sampling Gaussian measures, we then build an efficient estimator for e-KQDp(P, Q; ν, γ). Gaussian measures on Hilbert spaces are a natural extension of the familiar Gaussian measures on Rd: a measure N(0, C) on H is said to be a centered Gaussian measure with covariance operator C : H H if, for every f H, the pushforward of N(0, C) under the H R projection map ϕf( ) = f, H is the Gaussian measure N(0, C[f], f H) on R. For further details on Gaussian measures in Hilbert spaces, we refer to Kukush (2020).

Let γ be a centered Gaussian measure on H whose covariance function C : H H is an integral operator with some reference measure ξ on X,

γ = N(0, C), C[f](x) = Z

X k(x, y)f(y)ξ(dy),

and let γ be the pushforward of γ by the projection H SH that maps any f H to f/ f H SH. By the change of variables formula for pushforward measures (Bogachev,

Kernel Quantile Embeddings and Associated Probability Metrics

2007, Theorem 3.6.1), it holds that

e-KQDp p(P, Q; ν, γ) = Eu γ τ p p (P, Q; ν, u)

= Ef γ τ p p (P, Q; ν, f/ f H) .

This equality reduces sampling from γ to sampling from a centered Gaussian measure with an integral operator covariance function. The next proposition reduces sampling from (a finite-sample approximation of) γ to sampling from the standard Gaussian on the real line; proof is in Appendix C.7.

Proposition 1 (Sampling from a Gaussian measure). Let z1:m ξ, and γ m to be the estimate of γ based on the Monte Carlo estimate Cm of the covariance operator C,

γ m = N(0, Cm), Cm[g](x) = 1

j=1 k(x, zj)g(zj).

Suppose f(x) = m 1/2 Pm j=1 λjk(x, zj) with λ1:m N(0, 1). Then, f γ m.

Algorithm 1 brings together the e-KQD estimator in Equation (5), and the procedure for sampling from the Gaussian measure in Proposition 1. The ν choice is left up to the user; the uniform ν remains a default choice. We proceed to analyse the cost. This estimator has complexity O(l max(nm, m2, n log n)): O(l) for iterating over directions i {1, . . . , l}; O(nm) for computing fi(x1:n) and fi(y1:n); O(m2) for computing fi H; and O(n log n) for sorting ui(x1:n) and ui(y1:n). For l := log n and m := log n, the complexity therefore reduces to O(n log2 n); i.e. near-linear (up to log-terms).

5. Experiments

We empirically demonstrate the effectiveness of KQDs for nonparametric two-sample hypothesis testing which aims at determining whether two arbitrary probability distributions, P and Q, differ statistically based on their respective i.i.d. samples. Two-sample testing is widely adopted in scientific discovery fields, such as model verification (Gao et al., 2024), out-of-domain detection (Magesh et al., 2023), and comparing epistemic uncertainties (Chau et al., 2025). Specifically, we test the null hypothesis H0 : P = Q against the alternative H1 : P = Q. In such tests, (estimators of) probability metrics are commonly used as test statistics, including the Kolmogorov-Smirnov distances (Kolmogorov, 1960), Wasserstein distance (Wang et al., 2022), energydistances (Székely and Rizzo, 2005; Sejdinovic et al., 2013), and most relevant to our work, the MMD (Gretton et al., 2006; 2009; 2012). For an excellent overview of kernelbased two sample testing, we refer readers to Schrab (2025).

Experiments are repeated to calculate the rejection rate, which is the proportion of tests where the null hypothesis is rejected. A high rejection rate indicates better performance

at distinguishing between distributions. It is equally important to ensure proper control of Type I error, defined as the rejection rate when the null hypothesis H0 is true. Specifically, the Type I error rate should not exceed the specified level. Without controlling for Type I error, an inflated rejection rate might not reflect the estimator s ability to detect genuine differences but instead indicate the test rejects more often than it should. We consider a significance level α of 0.05 throughout and report on Type I control in Appendix D.

To determine the rejection threshold for each test statistic, we employ a permutation-based approach: for each trial we pool the two samples, randomly reassign labels 300 times to simulate draws under H0, compute the test statistic on each permuted split, and take the 95th percentile of this empirical null distribution as our threshold. This fully nonparametric thresholding ensures Type I error control without additional distributional assumptions (Lehmann et al., 1986).

Our experiments aim to demonstrate that, within a comparable computational budget, statistics computed using quantile-characteristic kernels can deliver results competitive with those of MMD tests based on mean-characteristic kernels. Additionally, we seek to explore the inherent tradeoffs of the proposed methods. We focus on the nonparametric two-sample testing problem, as it represents one of the most successful applications of the mean-embeddingbased MMD and its variants. The code is available at https://github.com/Masha Naslidnyk/kqe.

5.1. Benchmarking

We consider the following distances as test statistics in our experiments. Detail descriptions of these estimators are provided in Appendix A. For KQDs, we take the reference measure ξ (c.f. Proposition 1) to be 1/2Pn + 1/2Qn, where Pn corresponds to the empirical distribution 1/n Pn i=1 δxi, analogously for Qn. Such ξ is a general choice that is appropriate in the absence of additional information about the space X. We take power p = 2 for all KQD-based discrepancies in our experiments; identical experiments for p = 1 lead to the same conclusions and are presented for completeness in Appendix D.2. Other than in the second experiment, we use the RBF kernel k(x, x ) = exp( x x 2/2σ2) with σ the bandwidth chosen using the median heuristic method, i.e. σ = Median({ xi xj 2 2, i, j 1, . . . , n}) (Gretton et al., 2012). Due to space constraints, we present all methods on the same plot, regardless of their computational complexity. However, it is important to note that directly comparing test power across methods with varying sampling complexities may be unfair and misleading.

e-KQD (ours). For e-KQD, we set the number of projections to l = log n and the number of samples drawn from the Gaussian reference to m = log n. Consequently, the overall computational complexity is O(n log2(n)).

Kernel Quantile Embeddings and Associated Probability Metrics

32 64 128 256 512 Dimension

Rejection rate

(a) Power Decay

100 500 2000 5000 10000 Number of samples

Rejection rate

(b) Laplace v.s. Gaussian

100 500 1000 1500 2000 2500 Number of samples

Rejection rate

(c) Galaxy MNIST

MMD-Lin (O(n)) MMD-Multi (O(nlog2 n)) sup-KQD2 (O(nlog2 n)) e-KQD2 (O(nlog2 n)) MMD (O(n2)) e-KQD2-Centered (O(n2))

100 500 1000 1500 2000 Number of samples

Rejection rate

(d) CIFAR-10 v.s. CIFAR-10.1

Figure 3: Experimental results comparing our proposed methods with baseline approaches. Methods represented by dotted lines exhibit quadratic complexity for a single computation of the test statistic, while the remaining methods achieve near-linear or linear computational efficiency. A higher rejection rate indicates better performance in distinguishing between distributions. Overall, quadratic-time quantile-based estimators perform comparably to quadratic-time MMD estimators, while near-linear time quantile-based estimators often outperform their MMD-based counterparts.

e-KQD-centered (ours). The centered version of e-KQD, as discussed in Appendix B, can be expressed as the sum of an e-KQD term and the classical MMD. While the e-KQD component follows the same sampling configuration as above, the MMD computation is the dominant factor in complexity, leading to an overall cost of O(n2).

sup-KQD (ours). sup-KQD adopts the same sampling configuration as e-KQD (thus cost O(n log2(n))). Instead of averaging over projections, it selects the maximum across all projections. This approach serves as a fast approximation of the kernel max-sliced Wasserstein distance of Wang et al. (2022), where a Riemannian block coordinate descent method is used to optimise an entropic regularised objective at a computational cost of O(n3 log(n)). In contrast, our approach identifies the largest directional quantile difference across the sampled projections. While we do not claim that this provides an accurate estimate of the true distance, this approach allows for controlled complexity and facilitates comparisons between averaging or taking the supremum.

MMD. The MMD is included as a benchmark to be compared with e-KQD-centered and has complexity O(n2). The MMD is estimated using the U-statistic formulation.

MMD-Multi. A fast MMD approximation based on incomplete U-statistic introduced in Schrab et al. (2022) is included to benchmark against our e-KQD distance. Configurations of MMD-Multi are chosen as to match the complexity of e-KQD for a fair comparison.

MMD-Lin. MMD-Linear from Gretton et al. (2012, Lemma 14.) estimates the MMD with complexity O(n).

5.2. Experimental Setup and Results

We conduct four experiments: two using synthetic data, allowing full control over the simulation environment, and two based on high-dimensional image data to showcase the

practicality and competitiveness of our proposed methods. Additional experiments are reported in Appendix D, specifically: studying the impact of changing the measures ν and ξ, comparing with sliced Wasserstein distances, and comparing with MMD based on other KME approximations.

1. Power-decay experiment. This experiment investigates the effect of the curse of dimensionality on our tests, following the setup of Experiment A in Wang et al. (2022). Prior work by Ramdas et al. (2015) has shown that MMD-based methods are particularly vulnerable to the curse of dimensionality. Here, we assess whether our quantile-based test statistic exhibits similar limitations.

We fix n = 200 and take P to be an isotropic Gaussian distribution of dimension d. Similarly, we take Q to be a d-dimensional Gaussian distribution with a diagonal covariance matrix Σ = diag({4, 4, 4, 1, . . . , 1}). As we increase the dimension d [32, 64, 128, 256, 512], the testing problem becomes increasingly challenging. Figure 3a presents the results. We observe that e-KQD exhibits the slowest decline in test power among all methods, irrespective of their computational complexity. Notably, it maintains its performance significantly better than its O(n log2(n)) benchmark, MMD-Multi. These results suggest that quantile-based discrepancies exhibit greater robustness to high-dim data.

2. Laplace v.s. Gaussian. This experiment aims to illustrate Theorem 2 by demonstrating that while a kernel may not be mean-characteristic meaning it cannot distinguish between two distributions using standard KMEs and MMDs it can still be quantile-characteristic. In such cases, the distributions can still be effectively distinguished using our KQEs and KQDs. To demonstrate this, we take P to be a standard Gaussian in d = 1, and Q to be a Laplace distribution with matching first and second moment. We vary n {100, 500, 2000, 5000, 10000} and select a polynomial kernel of degree 3, i.e. k(x, x ) = ( x, x + 1)3, for all our

Kernel Quantile Embeddings and Associated Probability Metrics

methods. This ensures that k cannot distinguish between the two distributions due to their matching first and second moments, which leads to their KMEs being identical.

Figure 3b shows that our KQDs, irrespective of their computational complexity, exhibit increasing test power as the sample size grows. In contrast, MMD-based methods fail entirely to detect any differences between P and Q. Notably, although e-KQD-centered can be expressed as the sum of an MMD term and an a e-KQD term, the underperformance of the MMD component in this scenario is effectively compensated by the e-KQD term, enabling successful testing.

3. Galaxy MNIST. We examine performance on real-world data through galaxy images (Walmsley et al., 2022) in dimension d = 3 64 64 = 12288, following the setting from Biggs et al. (2024). These images consist of four classes. P corresponds to images sampled uniformly from the first three classes, while Q consists of samples from the same classes with probability 0.85 and from the fourth class with probability 0.15. A Gaussian RBF kernel with bandwidth chosen using the median heuristic method is chosen for all estimators. Sample sizes are chosen from n {100, 500, 1000, 1500, 2000, 2500}.

Figure 3c presents the results. e-KQD-centered and MMD exhibit nearly identical performance, suggesting that the MMD term is dominating in the e-KQD-centered estimator. Among the near-linear time test statistics, e-KQD and sup-KQD show a slight advantage over MMD-Multi in distinguishing between the distributions of Galaxy images.

4. CIFAR-10 v.s. CIFAR-10.1. We conclude with an experiment on telling apart the CIFAR-10 (Krizhevsky et al., 2012) and CIFAR-10.1 (Recht et al., 2019) test sets, following again Liu et al. (2020) and Biggs et al. (2024). The dimension is d = 3 32 32 = 3072. This is a challenging task, as CIFAR-10.1 was designed to provide new samples from the CIFAR-10 distribution, making it an alternative test set for models trained on CIFAR-10. We conduct the test by drawing n samples from CIFAR-10, and n samples from CIFAR-10.1, with n {100, 500, 1000, 1500, 2000}.

Figure 3d presents the results. Consistent with previous observations, test statistics with quadratic computational complexity exhibit nearly identical performance. However, our quantile discrepancy estimators with near-linear complexity significantly outperform the fast MMD estimators (MMDMulti) of the same complexity, highlighting the practical advantages of our methods in real-world testing scenarios where computational efficiency is a critical consideration.

An empirical runtime comparison of all methods is presented in Figure 4, which shows the time (in seconds) required to complete this experiment. The empirical results align with our complexity analysis: the near-linear estimators exhibit comparable performance, while the quadratic estimators are significantly slower. The proposed near-linear

100 500 1000 1500 2000 Number of Samples

Seconds (Log-Scale)

vs CIFAR-10.1

e-KQD -Centered

MMD-Lin MMD sup-KQD

Runtime comparisons on the CIFAR-10

Figure 4: Comparing the time (in seconds) required to complete the CIFAR-10 vs. CIFAR-10.1 experiment, plotted on a logarithmic scale. A shorter time indicates a faster algorithm. These results align with our complexity analysis.

KQD estimator makes it suitable for larger-scale datasets.

6. Discussion and Future Work

This work explores representations of distributions in a RKHS beyond the mean, using functional quantiles to capture richer distributional characteristics. We introduce kernel quantile embeddings (KQEs) and their associated kernel quantile discrepancies (KQDs), and establish that the conditions required for KQD to define a distance are strictly more general than those needed for MMD to be a distance. Additionally, we propose an efficient estimator for the expected KQD based on Gaussian measures, and demonstrate its effectiveness compared to MMD and its fast approximations through extensive experiments in two-sample testing. Our findings demonstrates the potential of KQEs as a powerful alternative to traditional mean-based representations.

Several promising avenues remain. Firstly, future work could explore more sophisticated methods for improving the empirical estimates of KQEs. The study of optimal kernel selection to maximize test power when using KQD for hypothesis testing, analogous to existing work on MMDs (Jitkrittum et al., 2020; Liu et al., 2020; Schrab et al., 2023) could also be explored. Secondly, considering the demonstrated potential of functional quantiles for representing marginal distributions, it is natural to ask whether they could provide a powerful alternative to conditional mean embeddings (CMEs) (Song et al., 2009; Park and Muandet, 2020), the Hilbert space representation of conditional distributions. These complementary developments will unlock new avenues for enhancing existing applications of KMEs across a wide range of domains, including not only nonparametric two-sample testing, but also (conditional) independence testing, causal inference, reinforcement learning, learning on distributions, generative modeling, robust parameter estimation, and Bayesian representations of distributions via kernel mean embeddings, as explored in Flaxman et al. (2016); Chau et al. (2021a;b), among others.

Kernel Quantile Embeddings and Associated Probability Metrics

Acknowledgements

The authors are grateful to Carlo Ciliberto and Antonin Schrab for fruitful discussions on Gaussian measures and MMD two-sample testing respectively. MN acknowledges support from the U.K. Research and Innovation under grant number EP/S021566/1, and from the Helmholtz Information & Data Science Academy (HIDA) for providing financial support enabling a short-term research stay at CISPA (Application No. 14773).

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

F. Bach. Information theory with kernel methods. IEEE Transactions on Information Theory, 69(2):752 775, 2022.

A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer Science+Business Media, New York, 2004.

A. Bharti, M. Naslidnyk, O. Key, S. Kaski, and F.-X. Briol. Optimally-weighted estimators of the maximum mean discrepancy for likelihood-free inference. In International Conference on Machine Learning, pages 2289 2312, 2023.

F. Biggs, A. Schrab, and A. Gretton. MMD-FUSE: Learning and combining kernels for two-sample testing without data splitting. Advances in Neural Information Processing Systems, 36, 2024.

S. Bobkov and M. Ledoux. One-dimensional empirical measures, order statistics, and Kantorovich transport distances, volume 261. American Mathematical Society, 2019.

D. A. Bodenham and Y. Kawahara. eu MMD: efficiently computing the MMD two-sample test statistic for univariate data. Statistics and Computing, 33(5):1 14, 2023.

V. I. Bogachev. Measure theory, volume 1. Springer, 2007.

N. Bonneel, J. Rabin, G. Peyré, and H. Pfister. Sliced and radon wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51:22 45, 2015.

P. Bonnier, H. Oberhauser, and Z. Szabó. Kernelized cumulants: Beyond kernel mean embeddings. Advances in Neural Information Processing Systems, 36, 2023.

K. Borgwardt, A. Gretton, M. Rasch, H.-P. Kriegel, B. Schölkopf, and A. Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):49 57, 2006.

F.-X. Briol, A. Barp, A. B. Duncan, and M. Girolami. Statistical inference for generative models with maximum mean discrepancy. ar Xiv:1906.05944, 2019.

F.-X. Briol, A. Gessner, T. Karvonen, and M. Mahsereci. A dictionary of closed-form kernel mean embeddings. ar Xiv:2504.18830, 2025.

A. Chatalic, N. Schreuder, L. Rosasco, and A. Rudi. Nyström kernel mean embeddings. In International Conference on Machine Learning, pages 3006 3024. PMLR, 2022.

S. L. Chau, S. Bouabid, and D. Sejdinovic. Deconditional downscaling with gaussian processes. Advances in Neural Information Processing Systems, 34:17813 17825, 2021a.

S. L. Chau, J.-F. Ton, J. González, Y. Teh, and D. Sejdinovic. Bayesimp: Uncertainty quantification for causal data fusion. Advances in Neural Information Processing Systems, 34:3466 3477, 2021b.

S. L. Chau, R. Hu, J. Gonzalez, and D. Sejdinovic. Rkhsshap: Shapley values for kernel methods. Advances in neural information processing systems, 35:13050 13063, 2022.

S. L. Chau, K. Muandet, and D. Sejdinovic. Explaining the uncertain: Stochastic shapley values for gaussian process models. Advances in Neural Information Processing Systems, 36:50769 50795, 2023.

S. L. Chau, A. Schrab, A. Gretton, D. Sejdinovic, and K. Muandet. Credal two-sample tests of epistemic uncertainty. In International Conference on Artificial Intelligence and Statistics, pages 127 135. PMLR, 2025.

B.-E. Chérief-Abdellatif and P. Alquier. MMD-Bayes: Robust Bayesian estimation via maximum mean discrepancy. In Proceedings of The 2nd Symposium on Advances in Approximate Bayesian Inference (AABI), pages 1 21, 2020.

K. Chwialkowski, A. Ramdas, D. Sejdinovic, and A. Gretton. Fast two-sample testing with analytic representations of probability measures. In Advances in Neural Information Processing Systems, pages 1981 1989, 2015a.

K. P. Chwialkowski, A. Ramdas, D. Sejdinovic, and A. Gretton. Fast two-sample testing with analytic representations of probability measures. Advances in Neural Information Processing Systems, 28, 2015b.

Kernel Quantile Embeddings and Associated Probability Metrics

H. Cramér and H. Wold. Some theorems on distribution functions. Journal of the London Mathematical Society, 1(4):290 294, 1936.

J. A. Cuesta-Albertos, R. Fraiman, and T. Ransford. A sharp form of the cramér wold theorem. Journal of Theoretical Probability, 20(2):201 209, 2007.

M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in Neural Information Processing Systems, 26, 2013.

I. Deshpande, Z. Zhang, and A. Schwing. Generative modeling using the sliced Wasserstein distance. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3483 3491, 2018.

Y. Dominicy and D. Veredas. The method of simulated quantiles. Journal of Econometrics, 172(2):235 247, 2013.

A. Dvoretzky, J. Kiefer, and J. Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, pages 642 669, 1956.

J. Feydy, T. Séjourné, F.-X. Vialard, S.-I. Amari, A. Trouvé, and G. Peyré. Interpolating between optimal transport and MMD using Sinkhorn divergences. In International Conference on Artificial Intelligence and Statistics, number PMLR 89, pages 2681 2690, 2019.

S. Flaxman, D. Sejdinovic, J. P. Cunningham, and S. Filippi. Bayesian learning of kernel embeddings. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence. Association for Computing Machinery, 2016.

N. Fournier and A. Guillin. On the rate of convergence in Wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(3-4):707 738, 2015.

R. Fraiman and B. Pateiro-López. Quantiles for finite and infinite dimensional data. Journal of Multivariate Analysis, 108:1 14, 2012.

I. Gao, P. Liang, and C. Guestrin. Model equality testing: Which model is this api serving? ar Xiv preprint ar Xiv:2410.20247, 2024.

A. Genevay, L. Chizat, F. Bach, M. Cuturi, and G. Peyré. Sample complexity of Sinkhorn divergences. In International Conference on Artificial Intelligence and Statistics, 2019.

A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola. A kernel method for the two-sample-problem. Advances in Neural Information Processing Systems, 19: 513 520, 2006.

A. Gretton, K. Fukumizu, Z. Harchaoui, and B. K. Sriperumbudur. A fast, consistent kernel two-sample test. Advances in Neural Information Processing Systems, 22, 2009.

A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(1):723 773, 2012.

M. W. Hirsch. Differential Topology, volume 33 of Graduate Texts in Mathematics. Springer-Verlag, New York, 1976.

P. J. Huber. Robust estimation of a location parameter. Ann. Math. Statist., 35(4):73 101, 1964.

W. Jitkrittum, H. Kanagawa, and B. Schölkopf. Testing goodness of fit of conditional density models with kernels. In Conference on Uncertainty in Artificial Intelligence, pages 221 230. PMLR, 2020.

L. V. Kantorovich. On the translocation of masses. Dokl. Akad. Nauk SSSR, 37(7-8):227 229, 1942. ISSN 10723374.

A. N. Kolmogorov. Foundations of the Theory of Probability. Chelsea Pub Co, 2 edition, 1960.

S. Kolouri, K. Nadjahi, U. Simsekli, R. Badeau, and G. K. Rohde. Generalized sliced Wasserstein distances. In Advances in Neural Information Processing Systems, pages 261 272, 2019.

S. Kolouri, K. Nadjahi, S. Shahrampour, and U. Sim sekli. Generalized Sliced Probability Metrics. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4513 4517, May 2022. doi: 10.1109/ICASSP43922. 2022.9746016. URL https://ieeexplore.ieee. org/document/9746016. ISSN: 2379-190X.

L. Kong and I. Mizera. Quantile tomography: Using quantiles with multivariate data. Statistica Sinica, 22(4):1589 1610, 2012.

M. R. Kosorok. Two-sample quantile tests under general conditions. Biometrika, 86(4):909 921, 1999.

N. M. Kriege, F. D. Johansson, and C. Morris. A survey on graph kernels. Applied Network Science, 5:1 42, 2020.

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Image Net classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, volume 25, 2012.

A. Kukush. Gaussian measures in Hilbert space: construction and properties. John Wiley & Sons, 2020.

Kernel Quantile Embeddings and Associated Probability Metrics

E. L. Lehmann, J. P. Romano, and G. Casella. Testing statistical hypotheses, volume 3. Springer, 1986.

M. Lerasle, Z. Szabó, T. Mathieu, and G. Lecué. Monk outlier-robust mean embedding estimation by median-ofmeans. In International conference on machine learning, pages 3782 3793. PMLR, 2019.

Y. Li, Y. Liu, and J. Zhu. Quantile regression in reproducing kernel Hilbert spaces. Journal of the American Statistical Association, 102(477):255 268, 2007.

F. Liu, W. Xu, J. Lu, G. Zhang, A. Gretton, and D. J. Sutherland. Learning deep kernels for non-parametric twosample tests. In International conference on machine learning, pages 6316 6326. PMLR, 2020.

A. Magesh, V. V. Veeravalli, A. Roy, and S. Jha. Principled out-of-distribution detection via multiple testing. Journal of Machine Learning Research, 24(378):1 35, 2023.

N. Makigusa. Two-sample test based on maximum variance discrepancy. Communications in Statistics - Theory and Methods, 53(15):5421 5438, 2024a.

N. Makigusa. Two-sample test based on maximum variance discrepancy. Communications in Statistics-Theory and Methods, 53(15):5421 5438, 2024b.

S. Minsker. Geometric median and robust estimation in Banach spaces. Bernoulli, 21(4), Nov. 2015. ISSN 13507265. doi: 10.3150/14-BEJ645. URL http://arxiv. org/abs/1308.1334. ar Xiv:1308.1334 [math, stat].

K. Muandet, K. Fukumizu, F. Dinuzzo, and B. Schölkopf. Learning from distributions via support measure machines. In Advances in Neural Information Processing Systems 25, pages 10 18. 2012.

K. Muandet, K. Fukumizu, B. K. Sriperumbudur, and B. Schölkopf. Kernel mean embedding of distributions: A review and beyonds. Foundations and Trends in Machine Learning, 10(1-2):1 141, 2016.

K. Muandet, M. Kanagawa, S. Saengkyongam, and S. Marukatat. Counterfactual mean embeddings. Journal of Machine Learning Research, 22(162):1 71, 2021.

A. Müller. Integral probability metrics and their generating classes of functions. Advances in applied probability, 29 (2):429 443, 1997.

K. Nadjahi, A. Durmus, L. Chizat, S. Kolouri, S. Shahrampour, and U. Sim sekli. Statistical and topological properties of sliced probability divergences. In Neural Information Processing Systems, 2020.

K. Nadjahi, A. Durmus, L. Chizat, S. Kolouri, S. Shahrampour, and U. Sim sekli. Statistical and Topological Properties of Sliced Probability Divergences, Jan. 2022. URL http://arxiv.org/abs/2003. 05783. ar Xiv:2003.05783 [cs, stat].

A. Nienkötter and X. Jiang. Kernel-based generalized median computation for consensus learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45 (5):5872 5888, 2022.

A. Nienkötter and X. Jiang. Kernel-based generalized median computation for consensus learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45 (5):5872 5888, 2023.

Z. Niu, J. Meier, and F.-X. Briol. Discrepancy-based inference for intractable generative models using quasi-Monte Carlo. Electronic Journal of Statistics, 17(1):1411 1456, 2023.

J. Park and K. Muandet. A measure-theoretic approach to kernel conditional mean embeddings. In Advances in Neural Information Processing Systems, volume 33, pages 21247 21259. Curran Associates, Inc., 2020.

G. Peyré, M. Cuturi, et al. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019.

J. Rabin, G. Peyré, J. Delon, and M. Bernot. Wasserstein barycenter and its application to texture mixing. In International conference on scale space and variational methods in computer vision, pages 435 446. Springer, 2011.

A. Ramdas, S. J. Reddi, B. Póczos, A. Singh, and L. Wasserman. On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015.

A. Ramdas, N. García Trillos, and M. Cuturi. On Wasserstein two-sample testing and related families of nonparametric tests. Entropy, 19(2):47, 2017.

J. Ranger, J.-T. Kuhn, and C. Szardenings. Minimum distance estimation of multidimensional diffusion-based item response theory models. Multivariate Behavioral Research, 55(6):941 957, 2020.

B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pages 5389 5400, 2019.

B. Schölkopf and A. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2002.

Kernel Quantile Embeddings and Associated Probability Metrics

A. Schrab. A unified view of optimal kernel hypothesis testing. ar Xiv preprint ar Xiv:2503.07084, 2025.

A. Schrab, I. Kim, B. Guedj, and A. Gretton. Efficient aggregated kernel tests using incomplete U-statistics. In Advances in Neural Information Processing Systems, pages 18793 18807, 2022.

A. Schrab, I. Kim, M. Albert, B. Laurent, B. Guedj, and A. Gretton. MMD aggregated two-sample test. Journal of Machine Learning Research, 24(194):1 81, 2023.

D. Sejdinovic. An overview of causal inference using kernel embeddings. ar Xiv:2410.22754, 2024.

D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu. Equivalence of distance-based and RKHS-based statistics in hypothesis testing. The Annals of Statistics, pages 2263 2291, 2013.

R. Serfling. Quantile functions for multivariate analysis: Approaches and applications. Statistica Neerlandica, 56 (2):214 232, 2002.

R. J. Serfling. Approximation theorems of mathematical statistics. John Wiley & Sons, 2009.

S. J. Sheather and J. S. Marron. Kernel quantile estimators. Journal of the American Statistical Association, 85(410): 410 416, 1990.

A. J. Smola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert space embedding for distributions. In Proceedings of the 18th International Conference on Algorithmic Learning Theory (ALT), pages 13 31, 2007.

L. Song, J. Huang, A. Smola, and K. Fukumizu. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th International Conference on Machine Learning (ICML), June 2009.

B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. R. G. Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11, 2010.

B. K. Sriperumbudur, K. Fukumizu, and G. R. Lanckriet. Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research, 12 (7), 2011.

P. Stolfi, M. Bernardi, and L. Petrella. Sparse simulationbased estimator built on quantiles. Econometrics and Statistics, 2022.

A. M. Stuart. Inverse problems: a bayesian perspective. Acta numerica, 19:451 559, 2010.

Z. Szabó, B. K. Sriperumbudur, B. Póczos, and A. Gretton. Learning theory for distribution regression. Journal of Machine Learning Research, 17(1):5272 5311, 2016.

G. J. Székely and M. L. Rizzo. A new test for multivariate normality. Journal of Multivariate Analysis, 93(1):58 80, 2005.

I. Tolstikhin, B. K. Sriperumbudur, K. Mu, et al. Minimax estimation of kernel mean embeddings. Journal of Machine Learning Research, 18(86):1 47, 2017.

N. Vakhania, V. Tarieladze, and S. Chobanyan. Probability distributions on Banach spaces, volume 14. Springer Science & Business Media, 1987.

C. Villani et al. Optimal transport: old and new, volume 338. Springer, 2009.

M. Walmsley, C. Lintott, T. Géron, S. Kruk, C. Krawczyk, K. W. Willett, S. Bamford, L. S. Kelvin, L. Fortson, Y. Gal, et al. Galaxy zoo decals: Detailed visual morphology measurements from volunteers and deep learning for 314 000 galaxies. Monthly Notices of the Royal Astronomical Society, 509(3):3966 3988, 2022.

J. Wang, R. Gao, and Y. Xie. Two-Sample Test with Kernel Projected Wasserstein Distance. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, pages 8022 8055. PMLR, 2022.

J. Wang, M. Boedihardjo, and Y. Xie. Statistical and computational guarantees of kernel max-sliced Wasserstein distances. ar Xiv:2405.15441, 2024a.

J. Wang, M. Boedihardjo, and Y. Xie. Statistical and computational guarantees of kernel max-sliced wasserstein distances. ar Xiv preprint ar Xiv:2405.15441, 2024b.

S. Willard. General Topology. Addison-Wesley Series in Mathematics. Addison Wesley Longman Publishing, New York, NY, Jan. 1970.

J. Ziegel, D. Ginsbourger, and L. Dümbgen. Characteristic kernels on Hilbert spaces, Banach spaces, and on sets of measures. Bernoulli, 30(2):1441 1457, 2024.

Kernel Quantile Embeddings and Associated Probability Metrics

Supplementary Material

This supplementary material is structured as follows. In Appendix A, we recall existing probability metrics, define alternative KQDs, and then describe their respective finite-sample estimators. In Appendix C, we provide the proofs of all theoretical results in the paper. In Appendix D, we provide additional numerical experiments to complement the main text.

A. Probability Metrics and Their Estimators

A.1. Maximum Mean Discrepancy

We first recall that the MMD is an integral probability metric (Müller, 1997) where the supremum can be obtained in closed-form:

MMD(P, Q) := sup f H 1 |EX P [f(X)] EY Q[f(X)]| = µP µQ H

Using the reproducing property, we can then expressed the squared-MMD as

MMD2(P, Q) := EX,X P [k(X, X )] 2EX P,Y Q[k(X, Y )] + EY,Y Q[k(Y, Y )] (6)

This section describes the most widely used estimators for this quantity based on i.i.d. samples x1:n P and y1:n Q. The first is a biased V-statistic estimator with computational complexity O(n2) and convergence rate O(n 1/2):

MMD2 V (P, Q) := 1

j=1 k(xi, xj) 2

j=1 k(xi, yj) + 1

j=1 k(yi, yj)

Alternatively, an unbiased estimator can be constructed through a U-statistic. Such an estimator also has computational complexity O(n2) and convergence rate O(n 1/2) (Gretton et al., 2012, Lemma 6;Corollary 16), and is given by

MMD2 U(P, Q) := 1 n(n 1)

i =j k(xi, xj) 2

j=1 k(xi, yj) + 1 n(n 1)

i =j k(yi, yj)

A cheaper estimator was proposed in Lemma 14 of (Gretton et al., 2012). This estimator has computational complexity O(n) and convergence rate O(n 1/2), and we will refer to it as MMD-Lin. The estimator is given by:

MMD2 lin(P, Q) := 1 n/2

i=1 k(x2i 1, x2i) + k(y2i 1, y2i) k(x2i 1, y2i) k(x2i, y2i 1)

Finally, we also studied estimators whose computational complexity are between that of MMD-lin and the Uor V-statistics estimators. These estimators are due to Schrab et al. (2022) and we refer to them as MMD-Multi, and takes the following form

MMD2 Multi(P, Q) := 2 r(2n r 1)

i=1 k(xi, xi+j) + k(yi, yi+j) k(xi, yi+j) k(xi+j, yi)

where r is the number of subdiagonal considered. In our experiments, to match the complexity with e-KQD, we set r = log2(n). They have computational complexity O(rn).

Note that several estimators with faster convergence rates exist (Niu et al., 2023; Bharti et al., 2023), but these have computational cost ranging from O(n2) to O(n3) and require more regularity conditions on k, P and Q, and we therefore omit them from our benchmark. Bodenham and Kawahara (2023) also introduced an estimator with computational complexity of O(n log(n)) (and convergence O(n 1/2)) using slices/projections to d = 1. However, their approach is restrictive in that it can only be used for the Laplace kernel, and we therefore also do not compare to it.

Kernel Quantile Embeddings and Associated Probability Metrics

A.2. Wasserstein Distance

The p-Wasserstein distance (Kantorovich, 1942; Villani et al., 2009) is defined as

Wp(P, Q) := inf π Π(P,Q) E(X,Y ) π [ρ(X, Y )p] 1/p .

Given samples x1:n P and y1:n Q, this distance can be approximated using a plug-in Wp(1/n Pn i=1 δxi, 1/n Pn i=1 δyi), which can be computed in closed-form at a cost of O(n3), but converges to Wp(P, Q) with a convergence rate O(n 1/d).

When X Rd and p = 1, we obtain the 1-Wasserstein distance which, similarly to the MMD, can be written as an integral probability metric (Müller, 1997):

W1(P, Q) := sup f Lip 1 |EX P [f(X)] EX Q[f(X)]|

where f Lip = supx,y X,x =y |f(x) f(y)|/ x y denotes the Lipschitz norm.

When P, Q are distributions on a one dimensional space X R that have p-finite moments, the p-Wasserstein distance can be expressed in terms of distance between quantiles of P and Q (see for instance Peyré et al. (2019, Remark 2.30))

Wp(P, Q) = Z 1

0 |ρα P ρα Q|pdα 1/p (7)

A natural estimator for the Wasserstein distance is therefore based on approximating these one-dimensional quantiles using order statistics. Given x1:n P and y1:n Q, denote by Pn = 1/n Pn i=1 δxi and Qn = 1/n Pn i=1 δyi the corresponding empirical approximations to P and Q. Then, the j/n -th quantiles of Pn and Qn are exactly the j-th order statistics [x1:n]j and [y1:n]j, meaning the j th smallest elements of x1:n and y1:n respectively. Then Wp(Pn, Qn) takes the exact form

Wp(Pn, Qn) =

j=1 |[x1:n]j [y1:n]j|p

and is an estimator of Wp(P, Q). This estimator costs O(n log(n)) to compute (due to the cost of sorting n data points), and a convergence rate of O(n 1/2) for p = 1, and minimax convergence rate O(n 1/2p) for integer p > 1 when P, Q have at least 2p finite moments. In some cases, the p > 1 rate can be improved upon to match the O(n 1/2) rate of p = 1: we refer to Bobkov and Ledoux (2019) for a thorough overview.

A.3. Sliced Wasserstein

The sliced Wasserstein (SW) distances (Rabin et al., 2011; Bonneel et al., 2015) between two distributions P, Q on Rd use one-dimensional projections to reduce computational cost.

Expected SW. For an integer p 1, expected SW is defined as

SWp(P, Q) := Eu U(Sd 1)[W p p (ϕu#P, ϕu#Q)] 1/p ,

where U(Sd 1) is the uniform distribution on the unit sphere Sd 1, the measures ϕu#P, ϕu#Q are pushforwards under the projection operator ϕu(x) = u, x , and Wp is the one-dimensional p-Wasserstein distance as in Equation (7). Given x1:n P and y1:n Q, the integral over the sphere is approximated by Monte Carlo sampling of l directions u1:l, which together with the estimator in Equation (8) gives

d SW p p(P, Q) = 1

i=1 W p p (ϕuj#Pn, ϕuj#Qn) = 1

Here, [ ui, x1:n ]j is the j-th order statistics, meaning the j-th smallest element of ui, x1:n = [ ui, x1 , . . . , ui, xn ] . This estimator can be computed in O(ln log n) time (the cost on sorting n samples, for l directions) and was shown to converge at rate O(l 1/2 + n 1/2) for p = 1 (Nadjahi et al., 2022).

Kernel Quantile Embeddings and Associated Probability Metrics

Max SW. The max-sliced Wasserstein (max-SW) distance (Deshpande et al., 2018) replaces the average over projections in expected SW with a supremum over directions,

max-SWp(P, Q) := sup u Sd 1 W p p (ϕu#P, ϕu#Q) 1/p ,

where, ϕu(x) = u, x is again the projection operator, and Wp is the one-dimensional p-Wasserstein distance of Equation (7). Max-SW emphasizes the direction of greatest dissimilarity between the two measures.

Given x1:n P and y1:n P, max-SW is estimated as Wp(ϕu #P, ϕu #Q), for u the projection that maximises W p p (ϕu#Pn, ϕu#Qn) as given in Equation (8). In (Deshpande et al., 2018), u was approximated by optimising a heuristic, rather than the actual W p p (ϕu#Pn, ϕu#Qn). Then, Kolouri et al. (2022) approached the actual problem of

u = argmax u =1 W p p (ϕu#Pn, ϕu#Qn).

by running projected gradient descent on Sd 1, where each gradient step requires computing the derivative of the 1D Wasserstein distance w.r.t. u. Concretely, they initialise u1 randomly and iterate

ut+1 = Proj Sd 1 Optim u W p p (ϕut#P, ϕut#Q), u1:t , (9)

where Proj Sd 1(x) = x/ x is the operator projecting onto the unit sphere, and Optim is an optimiser of choice, such as ADAM. Each evaluation of Wp and its gradient in one dimension costs O(n log n), so the overall complexity is O(Tn log n) for T gradient steps. It is important to point out the optimisation may be noisy, with the value objective getting worse after some iterations. Indeed, if zt+1 is the solution to Optim u W p p (ϕut#P, ϕut#Q), u1:t , is it an improvement over ut, meaning W p p (ϕut#P, ϕut#Q) W p p (ϕzt+1#P, ϕzt+1#Q). Written out explicitly,

j=1 |[ ut, x1:n ]j [ ut, y1:n ]j|p

j=1 |[ zt+1, x1:n ]j [ zt+1, y1:n ]j|p ,

Then, ut+1 = Proj Sd 1(zt+1) = zt+1/ zt+1 , and it may happen that W p p (ϕut#P, ϕut#Q) > W p p (ϕut+1#P, ϕut+1#Q). The desired W p p (ϕut#P, ϕut#Q) W p p (ϕut+1#P, ϕut+1#Q) is guaranteed when zt+1 p 1, which may not happen.

A.4. Generalised Sliced Wasserstein

The generalised (max-)sliced Wasserstein (GSW and max-GSW) distances (Kolouri et al., 2022) extend SW and max-SW by using a family of nonlinear feature maps {fθ : Rd R}θ Θ instead of linear projections. Formally,

GSWp(P, Q) := Eθ µW p p (fθ#P, fθ#Q) 1/p , max-GSWp(P, Q) := sup θ Θ W p p (fθ#P, fθ#Q) 1/p ,

where fθ#P denotes the pushforward of P by fθ and µ is a probability measure over the parameter space Θ. For fθ(x) = θ, x and Θ = Sd 1 with uniform µ, GSW reduce to the standard SW distances. For expected GSW, sampling {θi}l i=1 µ yields an estimator with the same O(ln log n) computational complexity as expected SW (Kolouri et al., 2022). For max-GSW, the projected gradient descent approach of Equation (9) applies, at the same complexity of O(Tn log n) as for max-SW.

Statistical and topological properties of GSW depend completely on the choice of the family {fθ : θ Θ}. Kolouri et al. (2022) consider the specific case of polynomial fθ, and show GSW is then a metric on probability distributions on Rd.

A.5. Kernel Sliced Wasserstein

A special case of the GSW arises when the feature maps fθ are drawn from a reproducing kernel Hilbert space (RKHS). Let k : Rd Rd R be a positive definite kernel that induces the RKHS H with unit sphere SH. Then, the kernel sliced Wasserstein (KSW) can be introduced as

e-KSWp(P, Q) := Eu γW p p (u#P, u#Q) 1/p , max-KSWp(P, Q) := sup u SH W p p (u#P, u#Q) 1/p ,

Kernel Quantile Embeddings and Associated Probability Metrics

where γ is some probability measure on SH. The expected KSW is a new construct, while max-KSW was introduced in Wang et al. (2022), and studied further in Wang et al. (2024b); in both papers k was assumed to be universal. Finding the optimal u for max-KSW was shown to be NP-hard in Wang et al. (2024b); they propose an estimator at cost O(T 3/2n2). Though still more expensive than computing the V-statistic estimator of MMD, this is an improvement over O(Tn3) in the original work of Wang et al. (2022).

As pointed out in the main text, the choice of a uniform γ in e-KSW, while seemingly natural, may not be feasible as there is no uniform or Lebesgue measure in infinite dimensional spaces. In the main paper, we propose a practical choice of γ that facilitates an efficient estimator, and study computational cost. Further, we establish statistical and topological properties that apply to both expected and max-KSW and do not assume a universal kernel.

A.6. Sinkhorn Divergence

The entropic regularisation of optimal transport leads to the Sinkhorn divergence (Cuturi, 2013; Genevay et al., 2019). For distributions P, Q and regularisation parameter ε > 0, the entropic OT cost is defined as

Wp,ε(P, Q) := inf π Π(P,Q) E(X,Y ) π[ X Y p] + εKL(π P Q) 1/p .

The Sinkhorn divergence then corrects for the entropic bias:

Sp,ε(P, Q) := Wp,ε(P, Q) Wp,ε(P, P)/2 Wp,ε(Q, Q)/2. (10)

This quantity interpolates between MMD-like behavior for large ε and true Wasserstein for ε 0, and can be computed efficiently via Sinkhorn iterations at cost O(n2) per iteration (Cuturi, 2013).

A.7. Kernel covariance embeddings

Kernel covariance (operator) embeddings (KCE, Makigusa (2024b)) represent the distribution P as the second-order moment of the function k(X, ), for X P, as an alternative to the first-order moment (the kernel mean embedding). Due to being moments of the same distribution, the two share key positives and drawbacks: KCE for kernel k exists if and only if KME for k2 exists, and the kernel k is covariance characteristic if and only if k2 is mean-characteristic (Bach, 2022). The divergence proposed in Makigusa (2024b) is the distance between the KCE, and is estimated at O(n3) due to the need to compute full eigendecomposition of the KCE in order to compute the norm. In contrast, our proposed kernel quantile embeddings (KQE) embed quantiles, and therefore the relation to the KCE comes down to matching quantiles (which always exist, and come with an efficient estimator), compared to matching the second moment in the infinite-dimensional RKHS (which may not exist, and requires eigenvalue decomposition).

A.8. Kernel median embeddings

The median embedding (Nienkötter and Jiang, 2022) of P is the geometric median of k(X, ), X P in the RKHS, meaning the RKHS element which, on average, is L1-closest to the point k(X, ). Explicitly put, it is the function med P H defined through

med P = argmin f H

H f( ) k(x, ) HP(dx).

The median exists for any separable Hilbert space (Minsker, 2015). However, even for an empirical Pn = 1/n Pn i=1 δxi, there is no closed-form solution to this L1-problem, and the median is typically approximated using iterative algorithms like Weiszfeld s algorithm. The estimator proposed in Nienkötter and Jiang (2022) has a computational complexity of O(n2). The property of being median-characteristic, as far as the authors are aware, has not been explored, and no theoretical guarantees are available.

The connection to 1D-projected quantiles as done in KQE, even specifically the 1D-projected median, is also unclear. Expanding the understanding of geometric median embeddings is an area for future research.

A.9. Other Related Work

Kernel methods have also been studied in the context of quantile estimation and regression (Sheather and Marron, 1990; Li et al., 2007). These methods, however, focus on using either kernel density estimation or kernel ridge regression to estimate

Kernel Quantile Embeddings and Associated Probability Metrics

univariate quantiles. In contrast, our focus lies in exploring directional quantiles in the RKHS, and using them to estimate distances between distributions. We introduce this idea in the following section.

B. Connection between Centered and Uncentered Quantiles

Proposition 2 (Centered e-KQD2). The e-KQD2 and sup-KQD2 correspondence derived based on centered directional quantiles, now expressed as e-KQD2(P, Q; µ, γ)2 and sup-KQD2(P, Q; µ, γ)2 can be expressed as follows,

e-KQD2(P, Q; µ, γ)2 = e-KQD2(P, Q; µ, γ) + MMD2(P, Q) Eu γ[(EX P [u(X)] EY Q[u(Y )])2],

e-KQD2(P, Q; µ, γ) + MMD2(P, Q)

sup-KQD2(P, Q; µ, γ)2 = sup u SH

τ 2 2 (P, Q, µ, u) (EX P [u(X)] EY Q[u(Y )])2 + MMD2(P, Q)

sup-KQD2(P, Q; µ, γ) + MMD2(P, Q)

Proof. Let P, Q PX be measures on some instance space X. Further, define ψ : x 7 k(x, ), and write Pψ = ψ#P and Qψ = ψ#Q. Now Pψ and Qψ are measures on the RKHS Hk. Recall the definition of centered directional quantiles in Section 2.1,

ρα,u Pψ = ρα ϕu#Pψ ϕu(EY Pψ[Y ]) u + EY Pψ[Y ]

Now since we are working in the RKHS Hk, the expectation term EY Pψ[Y ] corresponds to the kernel mean embedding µP := EP [k(X, )], thus we can rewrite the above expression as,

ρα,u Pψ = ρϕu#Pψ u, µP u + µP

ρα,u Qψ can be defined analogously. Now consider integrating the difference between the two centered directional quantiles along all quantile levels, leading to

τ2(P, Q, µ, u) = Z 1

0 ρα,u Pψ ρα,u Qψ 2 Hkµ(dα) 1

We now proceed to show τ 2 2 (P, Q, µ, u) ,where µ is the Lebesgue measure, can be expressed as a sum between an uncentered e-KQD2 term with the MMD. Starting with expanding the RKHS norm inside the integrand,

ρα,u Pψ ρα,u Qψ 2 Hk = (ρα ϕu#Pψ ρα ϕu#Qψ u, µP µQ ) | {z } =:A R

u + µP µQ 2 Hk

= Au + (µP µQ) 2 Hk = 2 Au, µP µQ + Au 2 Hk + µP µQ 2 Hk = 2A u, µP µQ + A2 + MMD2(P, Q) (12)

Plugging the expression from Equation (12) into Equation (11), we get the following,

τ 2 2 (P, Q, µ, u) = Z 1

0 (2A u, µP µQ + A2)µ(dα) + MMD2(P, Q)

= 2 u, µP µQ Z 1

0 Aµ(dα) + Z 1

0 A2µ(dα) + MMD2(P, Q) (13)

For the first term on the right hand side, notice that,

0 Aµ(dα) = Z 1

0 (ρα ϕu#Pψ ρα ϕu#Qψ u, µP µQ )µ(dα) (14)

Kernel Quantile Embeddings and Associated Probability Metrics

Recall standard results from probability theory that integrating the quantile function between 0 to 1 with the Lebesgue measure returns you the expectation, specifically, that is, Z 1

0 ρα ϕu#Pψµ(dα) = EX P [u(X)] = u, µP .

Using this fact, the terms in Equation (14) cancels out, leaving R 1 0 Aµ(dα) = 0. Therefore, continuing from Equation (13), we have,

τ 2 2 (P, Q, µ, u) = Z 1

0 A2µ(dα) + MMD2(P, Q)

0 (ρα ϕu#Pψ ρα ϕu#Qψ u, µP µQ )2µ(dα) + MMD2(P, Q)

0 (ρα sµP ,u#(ϕu#Pψ) ρα sµQ,u#(ϕu#Qψ))u 2µ(dα) + MMD2(P, Q)

where sµP ,u : R R is a shifting function defined as sµP ,u(r) = r u, µp for r R. Alternatively, after expanding the terms in A2, we can express τ 2 2 (P, Q, µ, u) as,

τ 2 2 (P, Q, µ, u) = Z 1

0 (ρϕu#Pψ ρϕu#Qψ)2µ(dα) (E[u(X) u(Y )])2 + MMD2(P, Q)

= τ 2 2 (P, Q, µ, u) + MMD2(P, Q) (E[u(X) u(Y )])2

As a result, for γ a measure on the unit sphere of Hk, the centered version of e-KQD2 and sup-KQD2, now expressed as e-KQD2 and sup-KQD2, are given by,

e-KQD2(P, Q; µ, γ)2 = Eu γ τ 2 2 (P, Q; µ, u)

= e-KQD2(P, Q; µ, γ)2 + MMD2(P, Q) Eu γ[(EX P [u(X)] EY Q[u(Y )])2],

e-KQD2(P, Q; µ, γ)2 + MMD2(P, Q)

sup-KQD2(P, Q; µ, γ)2 = sup u SH τ 2 2 (P, Q; µ, u)

τ 2 2 (P, Q, µ, u) (E[u(X)] E[u(Y )])2 + MMD2(P, Q)

sup u SH τ 2 2 (P, Q; µ, u) sup u SH (E[u(X)] E[u(Y )])2 + MMD2(P, Q)

sup-KQD2(P, Q; µ, γ)2 + MMD2(P, Q).

When ν µ and the connections to Sliced Wasserstein explored in Connection 1 and Connection 2 emerges, the meanshifting property of Wasserstein distances allows us to express centered KQD as a sum of uncentered KQD, and MMD a curious interpretation of centering.

C. Proof of Theoretical Results

This section now provides the proof of all theoretical results in the main text.

C.1. Proof of Theorem 1

The main result in this section, Proposition 3, shows that the set of R measures {u#P : u SH} fully determines the distribution P. Since quantiles determine the distribution, Theorem 1 follows immediately.

Being concerned with the RKHS case specifically allows us to prove the result under mild conditions by using characteristic functionals, an extension of characteristic functions to measures on spaces beyond Rd. Characteristic functionals describe Borel probability measures as operators acting on some function space F : X R.

Kernel Quantile Embeddings and Associated Probability Metrics

Definition 1 (Vakhania et al. (1987), Section IV.2.1). The characteristic functional φP : F C of a Borel probability measure P on X is defined as

X eif(x)P(dx).

Theorem 2.2(a) in Vakhania et al. (1987, Chapter 4) establishes that a P-characteristic functional on F uniquely determines the distribution P on the smallest σ-algebra under which all function f F are measurable. Therefore, when F is such that this σ-algebra coincides with the Borel σ-algebra, the distribution is fully determined by P-characteristic functional on F. We show that, indeed, this holds in our setting, for F = H.

Lemma 1. Suppose A1 and A2 holds. Then, the Borel σ-algebra B(X) is the smallest σ-algebra on X under which all functions f H are measurable.

Proof. Denote by ˆC(X, H) the smallest σ-algebra on X under which all functions f H are measurable, and recall that the Borel σ-algebra is the σ-algebra that contains all closed sets. Therefore, we need to show that ˆC(X, H) contains every closed set in X. We split the proof into two parts: (1) show that H contains a countable separating subspace, and (2) show that this implies that every closed set lies in ˆC(X, H).

H contains a countable separating subspace. Recall that a function space F on X is said to be separating when for any x1 = x2 X, there is a function f F such that f(x1) = f(x2). Since k is separating, H is separating. Since H is separable, it contains a countable dense subspace H0 H. By H0 being dense in H, it must also be separating.

Every closed set lies in ˆC(X, H). By Vakhania et al. (1987, Section I.1, Exercise 9), all compact sets in X lie in ˆC(X, H0), by H0 being countable, continuous, separating space of real-valued functions. By definition, ˆC(X, H0) ˆC(X, H), and so ˆC(X, H) contains all compact sets. We now show this means every closed set must also lie in ˆC(X, H).

By X being σ-compact, there is a family of compact sets {Xi} i=1 such that X = i=1Xi. Take any closed K X; then, K = i=1(Xi K). Since Xi K is compact as the intersection of a compact set and a closed set, and σ-algebras are closed under countable unions, K must lie in ˆC(X, H). As this holds for every closed K, we conclude B(X) = ˆC(X, H).

We now restate the RKHS-specific version of the Vakhania result here for completeness.

Theorem 6 (Theorem 2.2(a) in Vakhania et al. (1987) for RKHS). Suppose A1 and A2 holds, and for Borel probability measures P, Q on X, it holds that φP (f) = φQ(f) for every f H. Then, P = Q.

We are now ready to prove the distribution of projections uniquely determines the distribution.

Proposition 3. Under A1 and A2, it holds that

u#P = u#Q for all u SH P = Q.

Proof. The main idea of the proof is to show that equality of u#P and u#Q implies equality of characteristic functionals, φP (f) = φQ(f) for all f H such that f(x) = tu(x) for some t R and u in the unit sphere. Since such f form the entire H, the result immediately follows.

First, recall that u#P = u#Q for all u if and only if their characteristic functions coincide, meaning Z

R eitzu#P(dz) = Z

R eitzu#Q(dz) u SH, t R. (15)

Notice that the measure u#P is a pushforward of P under the map x u(x). Then, for any measurable g it holds that Z

X g(u(x))P(dx) = Z

R g(z)u#P(dz) u SH. (16)

Take g(z) = eitz, for some t R. Then, for all u it holds that R

R eitzu#P(dz) = R

R eitzu#Q(dz), and consequently by (15) we have that Z

X eitu(x)P(dx) = Z

X eitu(x)Q(dx) u SH, t R. (17)

Kernel Quantile Embeddings and Associated Probability Metrics

Finally, let us pick an f H and show that φP (f) = φQ(f). Define u = f/ f , and t = f ; then,

X eif(x)P(dx) = Z

X eitu(x)P(dx),

and by (17), we arrive at the equality of characteristic functionals, φP (f) = φQ(f). By Theorem 6 characteristic functionals uniquely determine the underlying distribution, meaning P = Q.

For the sake of clarity, we give the proof of the original result.

Proof of Theorem 1. Suppose {ρα,u P : α [0, 1], u SH} = {ρα,u Q : α [0, 1], u SH} for some Borel probability measures P, Q. For any fixed u, since every quantile of of u#P and u#Q coincide, the measures coincide as well, u#P = u#Q. As that holds for every u, by Proposition 3, P = Q.

Lastly, we point out A1 may be relaxed. Provided X is a Tychonoff space meaning, a completely regular Hausdorff space part (b) of Theorem 2.2 in Vakhania et al. (1987) says the following.

Theorem 7 (Theorem 2.2(b) in Vakhania et al. (1987) for RKHS). Suppose X is Tychonoff, A2 holds, and for Radon probability measures P, Q on X, it holds that φP (f) = φQ(f) for every f H. Then, P = Q.

Therefore, when A1 is replaced with X being Tychonoff, Theorem 1 continues to hold but only for Radon P, Q, not any Borel P, Q. Radon probability measures can be intuitively seen as the "non-pathological" Borel measures a restriction employed in order to drop the regularity assumptions of X being separable and σ-compact.

C.2. Proof of Theorem 2

We prove that every mean-characteristic kernel is quantile-characteristic, and give an example quantile-characteristic kernel that is not mean-characteristic.

mean-characteristic quantile-characteristic. Suppose k on X is mean-characteristic, and P = Q are any probability measures on X. We will identify a unit-norm u for which the sets of quantiles of u#P and u#Q differ.

Since k is mean characteristic, µP = µQ, and MMD2(P, Q) = µP µQ 2 H > 0. Recall that MMD can be expressed as

MMD2(P, Q) = sup u H, u H 1 |EX P u(X) EY Qu(Y )| ,

and the supremum is attained at u = (µP µQ)/ µP µQ H (Gretton et al., 2012). In other words, EX P u (X) = EY Qu (Y ) the means of u #P and u #Q don t coincide. Therefore, the measures u #P and u #Q don t coincide, or equivalently {ρα u #P : α [0, 1]} = {ρα u #Q : α [0, 1]}. Then, {ρu,α P : α [0, 1], u SH} = {ρα Q : α [0, 1], u SH}. And since this holds for any arbitrary P = Q, the kernel k is quantile-characteristic.

quantile-characteristic mean-characteristic. To show the converse implication does not hold, we provide an example when k is quantile-characteristic but not mean-characteristic. Take X = Rd, and let k be a degree T polynomial kernel, k(x, x ) = (x x + 1)T . Since A1 and A2 hold Rd is Polish, and k is trivially continuous and separating by Theorem 1 the kernel k is quantile-characteristic.

Now, we show k is not mean-characteristic. Suppose P and Q are such that EX P Xi = EY P Y i for i {1, . . . , T} for example, the Gaussian and Laplace distribution with matching expectation and variance and T = 2, as is done in Section 5.2. Then, EX P (X x )i = EY P (Y x )i for any x Rd, and since

µP (x ) := EX P k(X, x ) = EX P [(X x + 1)T ] = EX P

EX P (X x )i ,

it holds that µP = µQ. The kernel is not mean-characteristic.

Kernel Quantile Embeddings and Associated Probability Metrics

C.3. Proof of Theorem 3

By the Theorem in Serfling (2009, Section 2.3.2), for any ε > 0 it holds that

P(|ρα u#Pn ρα u#P | > ε) 2e 2nδ2 ε, for δε := min

(Z ρα u#P +ε

ρα u#P fu#P (t)dt, Z ρα u#P

ρα u#P ε fu#P (t)dt

Since it was assumed fu#P (x) cu > 0, it holds that δε cuε, and P(|ρα u#Pn ρα u#P | > ε) 2e 2nc2 uε2, or equivalently,

P(|ρα u#Pn ρα u#P | ε) 1 2e 2nc2 uε2.

Take δ := 2e 2nc2 uε2. Then,

P(|ρα u#Pn ρα u#P | C(δ, u)n 1/2) 1 δ, for C(δ, u) =

Since ρα,u Pn ρα,u P H = |ρα u#Pn ρα u#P |, the proof is complete.

C.4. Proof of Theorem 4

We prove e-KQD and sup-KQD, defined in Equation (4) as

e-KQDp(P, Q; ν, γ) = Eu γτ p p (P, Q; ν, u) 1/p ,

sup-KQDp(P, Q; ν) = sup u SH τ p p (P, Q; ν, u) 1/p,

are probability metrics on the set of Borel probability measures on X. Symmetry and non-negativity hold trivially.

Triangle inequality. By Minkowski inequality, for any P, P , Q,

ρα P ρα P pν(dα)

ρα P ρα Q pν(dα) 1/p

ρα Q ρα P pν(dα) 1/p !p

Plugging this in and using Minkowski inequality again on the outermost integral, we get

e-KQDp(P, P ; ν, γ) = Eu γ

ρα P ρα P pν(dα) 1/p

ρα P ρα Q pν(dα) 1/p

ρα Q ρα P pν(dα) 1/p !p!1/p

ρα P ρα Q pν(dα)

ρα Q ρα P pν(dα)

= e-KQDp(P, Q; ν, γ) + e-KQDp(Q, P ; ν, γ).

Kernel Quantile Embeddings and Associated Probability Metrics

Similarly, since supx f p(x) = (supx |f(x)|)p for any f,

sup-KQDp(P, P ; ν, γ) = sup u SH

ρα P ρα P pν(dα) 1/p

ρα P ρα Q pν(dα) 1/p

ρα Q ρα P pν(dα) 1/p !p!1/p

ρα P ρα Q pν(dα) 1/p

ρα Q ρα P pν(dα) 1/p

= sup-KQDp(P, Q; ν, γ) + sup-KQDp(Q, P ; ν, γ).

Identity of indiscernibles. In the rest of this section, we show that

e-KQDp(P, Q; ν, γ) = 0 P = Q; and sup-KQDp(P, Q; ν, γ) = 0 P = Q.

Necessity (meaning the direction) holds trivially quantiles of identical measure are identical. To prove sufficiency, we only need to show that both discrepancies aggregate the directions in a way that preserves injectivity, meaning

e-KQDp(P, Q) = 0 ρα,u P = ρα,u Q for all α, u; and sup-KQDp(P, Q) = 0 ρα,u P = ρα,u Q for all α, u.

Together with Theorem 1, this will complete the proof of sufficiency.

First, we show that for any pair of probability measures, a ν-aggregation over the quantiles is injective.

Lemma 2. Let ν have full support, meaning ν(A) > 0 for any open A [0, 1]. For any Borel probability measures P , Q ,

0 |ρα P ρα Q |2ν(dα) = 0 ρα P = ρα Q for all α [0, 1].

Proof. Suppose R 1 0 |ρα P ρα Q |2ν(dα) = 0, but there is an α0 such that ρα0 P = ρα0 Q . We will show that this implies the existence of an open set (containing α0) over which |ρα0 P ρα0 Q |2 > 0 which will contradict ν having full support.

Since |ρα0 P ρα0 Q |2 > 0 and the quantile function α 7 qα P is left-continuous (by definition) for any probability measure P, there is a α1 < α0 such that |ρα P ρα Q |2 > 0 for all α (α1, α0]. Take some α2 (α1, α0). Then, for all α (α1, α2), we have |ρα P ρα Q |2 > 0. We arrive at a contradiction. Such α0 cannot exist, and therefore ρα P = ρα Q for all α [0, 1].

This result applies directly to the directional differences τp. Provided ν has full support,

τp(P, Q; ν, u) = 0 ρα P = ρα Q for all α [0, 1].

Since supremum aggregation simply considers u that corresponds to the largest τ p p (P, Q; ν, u), this concludes the proof for sup-KQD. Expectation aggregation over the directions u needs an extra result, given below.

Lemma 3. Let γ have full support on SH, and ν have full support on [0, 1]. For any Borel probability measures P, Q on X,

Eu γτ p p (P, Q; ν, u) = 0 P = Q.

Proof. Same as in the proof Theorem 1, we will use the technique of characteristic functionals φP , φQ, to carefully prove equality almost everywhere with respect to a full support measure γ implies full equality. Consider the function

f 7 φP (f) φQ(f),

Kernel Quantile Embeddings and Associated Probability Metrics

which is continuous by continuity of characteristic functionals. Define f0 0, the zero function in H. The set

H\0 := {f H \ {f0} : φP (f) φQ(f) R \ {0}} = {f H \ {f0} : φP (f) = φQ(f)}

is open, as a preimage of an open set R \ {0}, intersected with an open set {H \ {f0}}. Since the projection map f 7 f/ f H is open on H \ {f0}, the projection of H\0 onto SH is open. In other words, the set

S\0 H := {u SH : φP (tuu) = φQ(tuu) for some tu R}

is open in SH. Then, by definition of characteristic functionals, for u S\0 H it holds that

φu#P (tu) = φP (tuu) = φQ(tuu) = φu#Q(tu),

meaning the characteristic functions of u#P and u#Q are not identical, and therefore u#P = u#Q. Since ν has full support on [0, 1], it follows that

τ p p (P, Q; ν, u) = Z 1

0 |ρα u#P ρα u#Q|pν(dα) > 0, for all u S\0 H

We arrive at a contradiction: since γ has full support on SH and S\0 H SH was shown to be an open set, it holds that

Eu γτ p p (P, Q; ν, u) Z

S\0 H τ p p (P, Q; ν, u)γ(du) > 0.

Therefore, for Eu γτ p p (P, Q; ν, u) to be zero, S\0 H must be empty which, by construction, can only happen when H\0 is empty, i.e. φP (f) = φQ(f) for all f H \ f0, where f0 0. Since φP (f0) = φQ(f0) holds trivially for any P, Q, the characteristic functionals of P and Q are identical. By Theorem 6, P = Q. This concludes the proof.

C.5. Proof of Theorem 5

We start with two auxiliary lemmas that, when combined, bound e-KQD approximation error due to replacing P, Q with Pn, Qn in n 1/2. This will be crucial in showing convergence of the approximate e-KQD to the true e-KQD.

Lemma 4. For any measure ν on [0, 1] and any measure γ on SH, it holds that

|e-KQD1(Pn, Qn; ν, γ) e-KQD1(P, Q; ν, γ)| e-KQD1(Pn, P; ν, γ) + e-KQD1(Qn, Q; ν, γ)).

Proof. By the definition of e-KQD1 and Jensen inequality for the absolute value,

|e-KQD1(Pn, Qn; ν, γ) e-KQD1(P, Q; ν, γ)| = Eu γ

|ρα u#Pn ρα u#Qn| |ρα u#P ρα u#Q| dα

|ρα u#Pn ρα u#Qn| |ρα u#P ρα u#Q| dα

By the reverse triangle inequality followed by the triangle inequality, |ρα u#Pn ρα u#Qn| |ρα u#P ρα u#Q| |ρα u#Pn ρα u#P + ρα u#Q ρα u#Qn|

|ρα u#Pn ρα u#P | + |ρα u#Qn ρα u#Q|, (18)

and the statement of the lemma follows.

Lemma 5. Let ν be a measure on [0, 1] with density fν bounded above by Cν > 0. With probability at least 1 δ/4, for C (δ) = 2Cν p

log(8/δ)/2, it holds that

e-KQD1(Pn, P; ν, γ) C (δ)

Kernel Quantile Embeddings and Associated Probability Metrics

Proof. Recall that

e-KQD1(Pn, P; ν, γ) = Eu γ [τ1(Pn, P; ν, u)] , τ1(Pn, P; ν, u) = Z 1

0 |ρα u#Pn ρα u#P |ν(dα).

Let Fu#P and Fu#Pn be the CDFs of u#P and u#Pn respectively. Then,

0 |ρα u#Pn ρα u#P |ν(dα) Cν

0 |ρα u#Pn ρα u#P |dα = Cν

u(X) |Fu#Pn(t) Fu#P (t)|dt

Cν sup t u(X) |Fu#Pn(t) Fu#P (t)|,

where the last equality is the well known fact that integrated difference between quantiles is equal to integrated difference between CDFs (see, for instance, Bobkov and Ledoux (2019, Theorem 2.9)). By the Dvoretzky-Kiefer-Wolfowitz inequality, with probability at least 1 δ/4 it holds that,

sup |Fu#Pn(t) Fu#P (t)| < p

log(8/δ)/2n 1/2,

and therefore, with probability at least 1 δ/4 for C (δ) = 2Cν p

log(8/δ)/2,

τ1(Pn, P; ν, u) = Z 1

0 |ρα u#Pn ρα u#P |ν(dα) C (δ)

In other words, the random variable τ1(Pn, P; ν, u) is sub-Gaussian with sub-Gaussian constant Cτ := C2 ν/(2n), meaning

Pr [τ1(Pn, P; ν, u) ε] 2 exp{ ε2/C2 τ }

One of the equivalent definitions for a sub-Gaussian random variable is the moment condition: for any p 1,

Ex1:n [τ1(Pn, P; ν, u)p] 2Cp τ Γ(p/2 + 1).

An application of Jensen inequality and Fubini s theorem shows that the moment condition holds for Eu γτ1(Pn, P; ν, u),

Ex1:n [(Eu γτ1(Pn, P; ν, u))p] Ex1:n Eu γ [τ1(Pn, P; ν, u)p] = Eu γEx1:n [τ1(Pn, P; ν, u)p] 2Cp τ Γ(p/2 + 1).

Therefore, Eu γτ1(Pn, P; ν, u) is sub-Gaussian with constant Cτ = C2 ν/(2n), meaning it holds with probability at least 1 δ/4 that

e-KQD1(Pn, P; ν, γ) = Eu γτ1(Pn, P; ν, u) C (δ)

We are now ready to prove the full result.

Proof of Theorem 5. Let Cν be an upper bound on the density of ν. By triangle inequality, the full error can be upper bounded by Rl, the error due to approximation of γ with γl, plus Rn, the error due to approximation of P, Q with Pn, Qn,

|e-KQD1(Pn, Qn; ν, γl) e-KQD1(P, Q; ν, γ)| |e-KQD1(Pn, Qn; ν, γl) e-KQD1(Pn, Qn; ν, γ)|

+ |e-KQD1(Pn, Qn; ν, γ) e-KQD1(P, Q; ν, γ)|

=: Rl + Rn.

We bound Rl in l 1/2, and Rn in n 1/2, with high probability.

Kernel Quantile Embeddings and Associated Probability Metrics

Bounding Rl. Recall that e-KQD1(Pn, Qn; ν, γ) = Eu γ h R 1 0 |ρα u#P ρα u#Q|ν(dα) i . Therefore, we may apply Mc Di-

armid s inequality provided for any u, u SH we upper bound the difference

ρα u#P ρα u#Q ρα u #P ρα u #Q ν(dα) .

We have that

ρα u#P ρα u#Q ρα u #P ρα u #Q ν(dα)

ρα u#P ρα u#Q ν(dα) + Z 1

ρα u #P ρα u #Q ν(dα)

(B) 2Cν sup u SH W1(u#P, u#Q)

(C) 2Cν sup u SH EX P EY Q|u(X) u(Y )|

(D) 2CνEX P EY Q p

k(X, X) 2k(X, Y ) + k(Y, Y )

where (A) holds by Jensen s and triangle inequalities; (B) uses boundedness of the density of ν by Cν and the property of the Wasserstein distance in R from Equation (7); (C) uses the infimum definition of the Wasserstein distance; and (D) holds by the reasoning we employed multiple times through the paper, via reproducing property, Cauchy-Schwarz, and having u, u SH. So we arrive at a bound

ρα u#P ρα u#Q ρα u #P ρα u #Q ν(dα) 2CνEX P EY Q p

k(X, X) 2k(X, Y ) + k(Y, Y ) =: 2CνCk.

Now that boundedness of the difference has been established, by Mc Diarmid s inequality, with probability at least 1 δ/2 and for C (δ) = p

2CνCk log(4/δ) it holds that

|e-KQD1(Pn, Qn; ν, γl) e-KQD1(Pn, Qn; ν, γ)| C (δ)l 1/2.

Bounding Rn. By Lemma 4,

|e-KQD1(Pn, Qn; ν, γ) e-KQD1(P, Q; ν, γ)| e-KQD1(Pn, P; ν, γ) + e-KQD1(Qn, Q; ν, γ))

By Lemma 5 and the union bound, with probability at least 1 δ/2 and for C (δ) = 2Cν p

log(8/δ)/2, it holds that

Rn = |e-KQD1(Pn, Qn; ν, γ) e-KQD1(P, Q; ν, γ)| C (δ)n 1/2.

Combining bounds. By applying the union bound again, to Rl + Rn, we get that, with probability at least 1 δ,

|e-KQD1(Pn, Qn; ν, γl) e-KQD1(P, Q; ν, γ)| Rl + Rn C (δ)l 1/2 + C (δ)n 1/2 C(δ)(l 1/2 + n 1/2),

for C(δ) = max{C (δ), C (δ)} = O( p

log(1/δ)). This completes the proof

As pointed out in the main text, EX P EY Q p

k(X, X) 2k(X, Y ) + k(Y, Y ) < holds immediately when EX P p

k(X, X) and EX Q p

k(X, X) are finite, and even more specifically, when the kernel k is bounded. Unbounded k and finite expectations, for example, happens when the tails of both P and Q decay fast enough to "compensate" for the growth of k(x, x). For instance, when k is a polynomial kernel of any order (which is unbounded), and P and Q are laws of sub-exponential random variables. For clarity, note that EX P EY Q p

k(X, X) 2k(X, Y ) + k(Y, Y ) does not compare to MMD, which integrates k(X, X ) rather than k(X, X) (see Equation (6)).

For integer p > 1, proving the n 1/2 convergence rate is feasible if more involved primarily because we can no longer reduce the problem to convergence of empirical CDFs to true CDFs. In general, for p > 1, Z 1

0 |ρα u#Pn ρα u#P |pdα = Z

u(X) |Fu#Pn(t) Fu#P (t)|pdt.

The following result, restated in our notation, makes the added complexity explicit.

Kernel Quantile Embeddings and Associated Probability Metrics

Lemma 6 (Theorem 5.3 in Bobkov and Ledoux (2019)). Suppose k : Rd Rd R is a bounded kernel, and ν has a density 0 < cν fν Cν on [0, 1]. Then, for any u SH, and for any p 1 and n 1,

Ex1:n P τ p p (Pn, P; ν, u) 5p Cν n + 2

p Jp(u#P), for Jp(u#P) = Z

(Fu#P (t)(1 Fu#P (t)))p/2

f p 1 u#P (x) dt.

Further, it holds that Ex1:n P τ p p (Pn, P; ν, u) = O(n p/2) if and only if Jp(u#P) < .

We now state a likely result for p > 1 as a conjecture, and outline the proof.

Conjecture 1 (Finite-Sample Consistency for Empirical KQDs for p > 1). Let X Rd, ν have a density, P, Q be measures on X with densities bounded away from zero, f P (x) c P > 0 and f P (x) c Q > 0. Suppose EX P [k(X, X) p/2] < and EX Q[k(X, X) p/2] < , and x1:n P, y1:n Q. Then,

Ex1:n P y1:n Q |e-KQDp(Pn, Qn; ν, γl) e-KQDp(P, Q; ν, γ)| = O(l 1/2 + n 1/2).

Sketch proof. Analogously to the proof of Theorem 5, we can decompose the term of interest as

Ex1:n P y1:n Q |e-KQDp(Pn, Qn; ν, γl) e-KQDp(P, Q; ν, γ)|

Ex1:n P y1:n Q |e-KQDp(Pn, Qn; ν, γl) e-KQDp(Pn, Qn; ν, γ)|

+ Ex1:n P e-KQDp p(Pn, P; ν, γ) 1/p + Ey1:n Qe-KQDp p(Qn, Q; ν, γ)) 1/p

The first term can be, same as in the proof of Theorem 5, bounded by Mc Diarmid s inequality. The second term (to the power p) takes the form Ex1:n P e-KQDp p(Pn, P; ν, γ) = Ex1:n P Eu γτ p p (Pn, P; ν, u).

Then, by Lemma 6 (possibly modified to account for an extra expectation), to get the result we will need to show that Eu γJp(u#P) < ,

Eu γJp(u#P) = Eu γ

(Fu#P (t)(1 Fu#P (t)))p/2

f p 1 u#P (x) dt

The nominator is upper bounded by 2 p. The denominator may get arbitrarily small without the nominator getting arbitrarily small: when the PDF f p 1 u#P (x) is small, the CDF Fu#P (x) need not be close to zero or one. Therefore, it is necessary and sufficient to show

1 f p 1 u#P (x) dt

We proceed to outline key elements of the proof of such result, and leave a rigorous proof for future work. By the coarea formula, and since f P (x) c P > 0,

fu#P (t) = Z

f P (x) | u(x)|Hd 1(dx) c0

1 | u(x)|Hd 1(dx), for | u(x)| =

where u 1(t) = {x X : u(x) = t}, and Hd 1 is the d 1-dimensional Hausdorff measure, which within X Rd is equal to d 1 dimensional Lebesgue measure, scaled by a constant that only depends on d 1.

Therefore, the integral in Equation (19) may diverge if the integral Z

1 | u(x)|Hd 1(dx) (20)

gets very small over "large" parts of u(X) on average over u γ. Trivially, if u is constant over some interval or more generally, u has infinitely many critical points the integral diverges. Fortunately, the more general condition is easy to

Kernel Quantile Embeddings and Associated Probability Metrics

control: if u is a Morse function and X is compact, then u has only a finite number of critical points. It is a classic result (see, for instance, Hirsch (1976, Theorem 1.2)) that Morse functions form a dense open subset of twice differentiable real-valued functions on Rd, denoted C2(Rd). Therefore, if H C2(X) (which can be reduced to smoothness of the kernel k it holds for instance, for the Matérn-5/2 kernel), we get that u γ has a finite number of critical points almost surely under mild regularity assumptions on γ.

The final ingredient is to use the Morse lemma to lower bound Equation (20) in the epsilon-ball of each critical point. Morse lemma says u is quadratic around each critical point which yields bounds on both the volume of u 1(t), and 1/| u(x)| in terms of the eigenvalues of the Hessian. Careful analysis of the eigenvalues will be needed to ensure the expectation with respect to u γ is finite.

C.6. Proof of Connections 1 and 2

The equality in Equation (7) immediately gives the connection of e-KQD and sup-KQD to the expected-SW and max-SW respectively previously only defined on X = Rd.

Further, for X = Rd, viewing x 7 k(x, ) as a transformation on X reveals a connection to Generalised Sliced Wasserstein (GSW, Kolouri et al. (2022)). In particular, the polynomial kernel k(x, x ) = (x x + 1)T of odd degree T recovers the polynomial transformation for which GSW was proven to be a probability metric. Outside of the case of the polynomial case, proving that GSW is a metric is highly challenging. This is easier under the kernel framework, as we showed in Theorem 4. In Kolouri et al. (2022), the authors investigate learning transformations with neural networks (NNs). An interesting direction for future work is the relationship between said NNs and the kernels they induce.

C.7. Proof of Proposition 1

Recall that by definition of Gaussian measures in Hilbert spaces (Kukush, 2020), a random element f H has the law of a Gaussian measure N(0, Cm) on H when for any g H,

f, g H N(0, Cm[g], g ). (21)

Since Cm[g](x) = 1/m Pm j=1 g(zj)k(zj, x), by the reproducing property,

Cm[g], g = 1

j=1 g(zi)2. (22)

Take f(x) = 1/ m Pm j=1 λjk(zj, x), for λ1, . . . , λm N(0, Id). Then, for any g H, by the reproducing property it holds that

f, g H = 1 m

j=1 λjg(zj) N

i=1 g(zi)2 !

which is exactly the Gaussian measure with covariance operator Cm, as per Equations (21) and (22).

D. Additional Numerical Results

D.1. Type I control

We report the Type I control experiments for the CIFAR-10 v.s. CIFAR-10.1 experiment. Results are shown in Figure 5.

D.2. Figure 3 for e-KQD1 It is common in power p-parametrised methods to select p = 2, to balance out sensitivity to outliers (which is higher for larger p, to the point of methods becoming brittle for p > 2), and robustness (which tends to be highest for p = 1); this trade-off, for instance, inspired the introduction of the Huber loss (Huber, 1964). However, for completeness, we now repeat experiments in the main paper for p = 1. The relationship to baseline approaches MMD, MMD-Multi, and MMD-Lin remains the same as observed for p = 2. However, it is evident that e-KQD1 performed better than e-KQD2 at the power decay and galaxy MNIST experiments, but the centered e-KQD1 performed worse than centered e-KQD2 at the Laplace v.s. Gaussian experiment. The implications of choosing p warrants a deeper investigation, left to future work.

Kernel Quantile Embeddings and Associated Probability Metrics

100 500 1000 1500 2000 Number of Samples

Rejection rate

Type I control on the CIFAR vs CIFAR10.1 experiment

e-KQD e-KQD-Centered MMD-Multi MMD-Lin

MMD sup-KQD Type I error rate

Figure 5: Type I control results for our experiment on CIFAR-10 v.s. CIFAR-10.1. We see all methods control their Type I error around or below the specified Type I error rate 0.05, thus confirming our tests in the main text are valid testing procedures.

32 64 128 256 512 Dimension

Rejection rate

(a) Power Decay

100 500 2000 5000 10000 Number of samples

Rejection rate

(b) Laplace v.s. Gaussian

100 500 1000 1500 2000 2500 Number of samples

Rejection rate

(c) Galaxy MNIST

MMD-Lin (O(n)) MMD-Multi (O(nlog2 n)) sup-KQD1 (O(nlog2 n)) e-KQD1 (O(nlog2 n)) MMD (O(n2)) e-KQD1-Centered (O(n2))

100 500 1000 1500 2000 Number of samples

Rejection rate

(d) CIFAR-10 v.s. CIFAR-10.1

Figure 6: The experiments in Figure 3 repeated for p = 1. Experimental results comparing our proposed methods with baseline approaches. A higher rejection rate indicates better performance in distinguishing between distributions. Same as for p = 2, quadratic-time quantile-based estimators perform comparably to quadratic-time MMD estimators, while near-linear time quantile-based estimators often outperform their MMD-based counterparts.

D.3. Comparison of weighting measures

The Gaussian Kernel Quantile Discrepancy introduced in Section 4 has multiple weighting measures that determine properties of the distance: the measure ν on the quantile levels, the measure ξ within the covariance operator, and the measure γ on the unit sphere SH. We investigate the impact of varying these.

Varying ν. We conducted the following experiment using the Galaxy MNIST and CIFAR datasets. We varied ν, from assigning more weight to the extreme quantiles to down-weighting them. The results are presented in Figure 7, where the reverse triangle \/ stands for up-weighing extreme quantiles, and the triangle /\ stands for down-weighing them. We observed some improvement over the uniform ν: for Galaxy MNIST, test power improved when ν assigned less weight to extremes, whereas for CIFAR, the opposite was true, with higher test power when more weight was given to extremes. Uniform weighting of the quantiles remained a good choice. This suggests that tuning ν beyond the uniform is problem-dependent and can enhance performance. The difference likely arises from the nature of the problems: CIFAR datasets, where samples are expected to be similar, benefit from emphasising extremes, while Galaxy MNIST, which contains fundamentally different galaxy images, performs better when robustified, i.e., focusing on differences away from the tails. Exploring this further presents an exciting avenue for future work.

Varying ξ. The reference measure ξ in the covariance operator C serves to "cover the input space" and is typically set to a "default" measure on the space for Rd, the standard Gaussian measure. The choice (Pn + Qn)/2 made in the main body of the paper is aiming to adhere to the most general setting, when no default measure may be available only Pn and Qn.

Kernel Quantile Embeddings and Associated Probability Metrics

100 500 1000 1500 2000 2500 Number of samples

Rejection rate

Galaxy MNIST

100 500 1000 1500 2000 Number of samples

Rejection rate

CIFAR-10 v.s. CIFAR-10.1

e-KQD2 (O(nlog2 n))

MMD (O(n2))

e-KQD2-/\ (O(nlog2 n))

e-KQD2-\/ (O(nlog2 n))

e-KQD2, = (IQR/1.349) N(0, 1) (O(nlog2 n))

e-KQD2, = IQR U[0, 1] (O(nlog2 n))

100 500 1000 1500 2000 Number of samples

Rejection rate

CIFAR-10 v.s. CIFAR-10.1

Figure 7: Gaussian KQD test power under different weighting measures. Left, middle: Varying measure ν: down-weighing (/\) extremes boosts power on Galaxy MNIST, while up-weighing (\/) them helps on CIFAR. Uniform weighting remains a strong default, with optimal ν depending on the dataset. Right: Varying measure ξ: using an IQR-scaled Gaussian or uniform default reference measure ξ both outperform MMD indicating potential advantage of a "default" ξ over the problem-based ξ = (Pn + Qn)/2.

We report a comparison on performance when the reference measure is: (1) (Pn + Qn)/2; (2) a standard Gaussian measure, scaled by IQR/1.349 to match the spread of the data, where IQR is the interquantile range of Pn + Qn, and 1.349 is the interquantile range of the standard Gaussian; and (3) a uniform measure on [ 1, 1]d, scaled by IQR.

The results, presented in Figure 7, show performance superior to MMD for the standard/uniform ξ. This indicates value in picking a "default" measure when one is available.

Varying γ Varying the measure on the sphere beyond a Gaussian is extremely challenging in infinite-dimensional spaces due to the complexity of both its theoretical definition and practical sampling. Since no practically relevant alternative has been proposed, we leave this direction unexplored.

D.4. Comparison to sliced Wasserstein distances

We extend the power decay experiment to include sliced Wasserstein and max-sliced Wasserstein distances, with directions (1) sampled uniformly on the sphere, and (2) sampled from (Pn + Qn)/2 and projected onto the sphere. The results are plotted in Figure 8, and show that sliced Wasserstein distances perform significantly worse than e-KQD. This outcome is expected as noted in Connections 1 and 2, sliced Wasserstein is equivalent to e-KQD with the linear kernel, which is less expressive than the Gaussian kernel.

D.5. Comparison with MMD based on Other KME Approximations

There are several efficient kernel mean embedding methods available in the literature, and no single approach has emerged as definitively superior. To complement experiments in the main body of the paper, we compare the e-KQD (at matching cost) with (1) The Mean Embedding (ME) approximation of MMD of Chwialkowski et al. (2015b), which was identified as the best-performing method in their numerical study; (2) the Nyström-MMD method of Chatalic et al. (2022), and (3) the Median-of-Means (MOM) approximation of Lerasle et al. (2019), specifically, their faster method (MONK BCD-Fast) that achieves matching cost to our e-KQD at the number of blocks Q = n/ log n.

The results are presented in Figure 8. ME performs at the level of MMD-multi, while Nyström has extremely high Type II error, likely due to sensitivity to hyperparameters. Due to Median-of-Means still being considerably slower than e-KQD (with the number of optimiser iterations set to T = 100), we apply it to a cheaper Power Decay problem (rather than the larger and more complicated Galaxy MNIST), where it performs at the level of the linear approximation of MMD. This may be due to MOM primarily being robustness-enforcing method, rather than a method aiming to build an efficient approximation of MMD.

Kernel Quantile Embeddings and Associated Probability Metrics

32 64 128 256 512 Dimension

Rejection rate

Power decay

100 500 1000 1500 2000 2500 Number of samples

Rejection rate

Galaxy MNIST

e-SW2, u = f/ f for f (Pn + Qn)/2

e-SW2, u Unif

e-KQD2 MMD (O(n2))

MMD-Multi ME

Nyström MOM

MMD-Lin (O(n))

32 64 128 256 512 Dimension

Rejection rate

Power decay

Figure 8: All methods are cost O(n log2 n) unless specified otherwise. Left: Gaussian KQD compared with sliced Wasserstein with uniform or data-driven directions, on the power decay problem. Sliced Wasserstein fall well below KQD consistent with their equivalence to KQD using a less expressive linear kernel. Middle: Comparison with alternative approximate KME methods, at matching cost. ME matches MMD-multi power, while Nyström-MMD suffers high Type II error. Right: Comparison with Median-of-Means (MOM) KME approximation, at matching cost. MOM is primarily a robustness-enforcing method, not a cheap-approximation method, and doesn t perform well at set cost of O(n log2 n).