# targeted_separation_and_convergence_with_kernel_discrepancies__6959319b.pdf

Journal of Machine Learning Research 25 (2024) 1-50 Submitted 10/22; Revised 9/24; Published 12/24

Targeted Separation and Convergence with Kernel Discrepancies

Alessandro Barp alessandro.barp@ucl.ac.uk University College London & The Alan Turing Institute, GB

Carl-Johann Simon-Gabriel cjsg@mirelo.ai Mirelo AI

Mark Girolami mag92@cam.ac.uk University of Cambridge & The Alan Turing Institute, GB

Lester Mackey lmackey@microsoft.com Microsoft Research, New England, US

Editor: Bharath Sriperumbudur

Maximum mean discrepancies (MMDs) like the kernel Stein discrepancy (KSD) have grown central to a wide range of applications, including hypothesis testing, sampler selection, distribution approximation, and variational inference. In each setting, these kernel-based discrepancy measures are required to (i) separate a target P from other probability measures or even (ii) control weak convergence to P. In this article we derive new suﬃcient and necessary conditions to ensure (i) and (ii). For MMDs on separable metric spaces, we characterize those kernels that separate Bochner embeddable measures and introduce simple conditions for separating all measures with unbounded kernels and for controlling convergence with bounded kernels. We use these results on Rd to substantially broaden the known conditions for KSD separation and convergence control and to develop the ﬁrst KSDs known to exactly metrize weak convergence to P. Along the way, we highlight the implications of our results for hypothesis testing, measuring and improving sample quality, and sampling with Stein variational gradient descent.

Keywords: Maximum mean discrepancy, kernel Stein discrepancy, targeted separation, targeted weak convergence control, enforcing tightness

1. Introduction

Maximum mean discrepancies (MMDs) like the Langevin kernel Stein discrepancy (KSD) are kernel-based discrepancy measures widely used for hypothesis testing (Gretton et al., 2012; Liu et al., 2016; Chwialkowski et al., 2016), sampler selection and tuning (Gorham and Mackey, 2017), parameter estimation (Briol et al., 2019; Barp et al., 2019; Dziugaite et al., 2015), generalized Bayesian inference (Ch erief-Abdellatif and Alquier, 2020; Matsubara et al., 2021, 2022; Dellaporta et al., 2022), discrete approximation and numerical integration (Chen et al., 2019, 2018; Barp et al., 2022b), control variate design (Oates et al.,

. These authors contributed equally. . Work done while at AWS Lablets.

c 2024 Alessandro Barp, Carl-Johann Simon-Gabriel, Mark Girolami, and Lester Mackey.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/22-1123.html.

Barp, Simon-Gabriel, Girolami, and Mackey

2014, 2019; Sun et al., 2023), compression (Riabiz et al., 2022), and bias correction (Liu and Lee, 2017; Hodgkinson et al., 2020; Riabiz et al., 2022). Each MMD uses a kernel function to measure the integration error between a pair of probability measures Q and P, and, in each setting above, their successful application relies on either P-separation, that is MMD(Q, P) > 0 whenever Q = P, or P-convergence control, namely MMD(Qn, P) 0 implies Qn P weakly. Unfortunately, these properties have so far only been established under overly restrictive assumptions, e.g., for Q with continuously diﬀerentiable log densities (Chwialkowski et al., 2016; Liu et al., 2016; Barp et al., 2019), for P with strongly log concave tails and Lipschitz log density gradients sp = log p (Gorham and Mackey, 2017), or for bounded MMD kernels (Sriperumbudur et al., 2010; Sriperumbudur, 2016; Simon-Gabriel and Sch olkopf, 2018; Simon-Gabriel et al., 2023). In this work, by ﬁxing P as the target measure and allowing Q to vary, we establish new broadly applicable conditions for P-separation and P-convergence control. Our main results include

Bochner P-separation with MMDs: Theorem 2 exactly characterizes those MMDs that separate P from Bochner embeddable measures on general Radon spaces. For MMDs with bounded kernels, this result exposes an important relationship between separation and convergence: separating P from all probability measures is equivalent to controlling P-convergence for tight sequences (Qn)n.

Score P-separation with KSDs: Theorem 3 shows that KSDs with standard characteristic kernels separate P from all measures Q that ﬁnitely integrate the score sp. This strengthens past work that only established separation from Q with continuously diﬀerentiable log densities (Chwialkowski et al., 2016; Barp et al., 2019).

L2 P-separation with KSDs: Theorems 4 and 5 show that KSDs with standard translation-invariant kernels separate P from all measures with densities q and ﬁnitely square-integrable qsq and qsp. This strengthens past work that provided no examples of L2-separating kernels (Liu et al., 2016).

General P-separation with MMDs: Theorem 6 provides a simple suﬃcient condition for general P-separation: any MMD even one with an unbounded kernel separates P from all probability measures and controls tight convergence to P if the bounded functions in its associated reproducing kernel Hilbert space (RKHS) are Pseparating. All of our remaining results explicitly check this new convenient condition.

General P-separation with KSDs: Theorem 9 shows that KSDs with standard translation-invariant kernels separate P from all probability measures and control tight P-convergence whenever sp is continuous and grows at most root-exponentially. Prior P-separation results applied only to a small subset of these targets, those with strongly log concave tails and Lipschitz sp (Gorham and Mackey, 2017; Huggins and Mackey, 2018; Chen et al., 2018).

Enforcing tightness with MMDs: Theorem 10 provides a new suﬃcient condition for enforcing tightness, i.e., for ensuring that (Qn)n is tight whenever MMD(Qn, P) 0: an MMD enforces tightness if elements of its RKHS suitably bound the indicators of compact sets. Prior tightness-enforcing guarantees relied on a much stronger

Targeted Separation and Convergence with Kernel Discrepancies

condition: the presence of a coercive (and hence unbounded) function in the RKHS (Gorham and Mackey, 2017; Huggins and Mackey, 2018; Chen et al., 2018; Hodgkinson et al., 2020).

Metrizing P-convergence with KSDs: Building on Theorem 10, Theorem 12 develops the ﬁrst KSDs known to metrize weak convergence to P (i.e., KSD(Qn, P) 0 Qn P weakly) by constructing bounded convergence-controlling Stein kernels. Since all prior convergence-controlling KSDs featured unbounded Stein kernels, these are also the ﬁrst KSDs known to satisfy the Stein variational gradient descent convergence assumptions of Liu (2017) (see Application 4).

Failing to control P-convergence: Finally, Theorem 13 provides new necessary conditions for an MMD to control P-convergence which notably fail to be satisﬁed when standard KSDs are paired with heavy-tailed targets.

As we highlight in the sections to follow, these results have immediate implications for a variety of inferential tasks in machine learning and statistics including goodness-of-ﬁt testing (Applications 1 and 2), measuring and improving sample quality (Application 3), and variational inference (Application 4).

Notation For a given separable metric space X, we let C(Rd) denote the space of continuous Rd-valued functions on X. When X = Rd, we say that the derivative of a set of Rℓ-valued functions exists, if the functions in that set are diﬀerentiable, and we additionally denote by Cℓ(Rd) the space of ℓ-times continuously diﬀerentiable Rd-valued functions on X (i.e., f Cℓ(Rd) if the partial derivatives of order ℓof fi exist and are continuous for i [d] {1, . . . , d}). We let f denote the vector of partial derivatives of a function f, and, for each multi-index p, let pf denote the p-th partial derivatives of f. When d = 1 or ℓ= 0 we will use the abbreviations Cℓ Cℓ(R1) or C(Rd) C0(Rd). Decay requirements will appear as subscripts: Cb(Rd), Cc(Rd), and C0(Rd) will respectively denote the spaces of Rd-valued continuous functions that are bounded, compactly supported, and vanishing at inﬁnity. Analogously, for each function h : X [0, ), Ch(Rd) and C0,h(Rd) respectively denote the spaces of Rd-valued continuous functions f with f/(1 + h) bounded or vanishing at inﬁnity. Recall a function f : X Ra vanishes at inﬁnity if ϵ > 0 there exists a compact set C s.t., supx Cc f(x) ϵ, where Cc is the set complement of C, and the Euclidean norm. For any function of two arguments K(y, x), we write Kx K( , x), and K C (1,1) b (Rd) if py y px x K(y, x) exists, is bounded, and is separately continuous for multi-indices satisfying px 1, py 1 1, where 1 is the Euclidean 1-norm (i.e., the multi-index absolute value). Given a map T : S1 S2 between sets, we denote the image of T by T(S1) {T(s) : s S1}. Given a measure µ and a µ-integrable function h, we denote integration by µh R h(x)µ(dx), and we shall omit the domain of integration, which is always X. Some additional notation for the appendices is presented in Appendix A.

2. Maximum Mean Discrepancies and Kernel Stein Discrepancies

We begin by extending the usual notions of maximum mean discrepancy and kernel Stein discrepancy to accommodate both arbitrary probability measures Q and unbounded kernels.

Barp, Simon-Gabriel, Girolami, and Mackey

Throughout, we let P the denote set of (Borel) probability measures on a separable metric space X. Moreover, for any function f : X Rℓ, we let Pf {Q P : f L1(Q)} denote the set of probability measures that ﬁnitely integrate f f.

2.1 Maximum mean discrepancies

Consider a (reproducing) kernel k on X with reproducing kernel Hilbert space Hk (Aronszajn, 1950; Schwartz, 1964). Traditionally, the associated kernel MMD is deﬁned as the worst-case integration error across test functions in the RKHS unit norm ball Bk (Gretton et al., 2012):

MMDk(Q, P) sup h Bk |Qh Ph| . (1)

However, the expression Qh Ph is not well deﬁned when either (i) both Qh and Ph are inﬁnite or (ii) h is not integrable under Q. Unfortunately, both of these cases can occur when k is unbounded as Bk then necessarily contains an unbounded test function (see Lemma 3). Since we are interested in a ﬁxed target measure P, we address the ﬁrst issue by focusing on kernels with ﬁnitely P-integrable test functions, i.e., with Bk L1(P). To address the second issue, we extend the MMD deﬁnition (1) to all probability measures Q by taking the supremum only over the Q-integrable elements of Bk, that is, h with either h+ max (h, 0) L1(Q) or h max ( h, 0) L1(Q). In fact, since Bk is a symmetric set, considering only h with h+ L1(Q) suﬃces to ensure |Qh Ph| is well deﬁned and belongs to [0, ].

Deﬁnition 1 (Maximum mean discrepancy (MMD)). For a given kernel k, deﬁne the set of embeddable probability measures PHk {Q P : Hk L1(Q)}. For any target measure P PHk, we deﬁne the maximum mean discrepancy MMDk( , P) : P [0, ] by

MMDk(Q, P) sup h Bk: h+ L1(Q) |Qh Ph| . (2)

Remark 1 (Embeddability). We show in Appendix C that (i) the embeddability condition P PHk holds if and only if x 7 k( , x) is Pettis integrable by P and (ii) Pettis integrability in turn implies that the kernel mean R k( , x) d P(x) belongs to the RKHS Hk. See Deﬁnition 6 for the deﬁnition of Pettis integrability.

As we show in Appendix C.4, one user-friendly suﬃcient condition for Q PHk is Bochner-embeddability, that is, Q P

k represents the function x 7 p

k(x, x). When Hk is separable, Carmeli et al. (2006) proved that one can alternatively check the weaker condition RR |k(x, y)|d Q(x)d Q(y) < . The next proposition summarizes these convenient embeddability conditions.

Proposition 1 (Embeddability conditions). The following claims hold true.

(b) If Hk is separable, RR |k(x, y)|d Q(x)d Q(y) < implies Q PHk (Carmeli et al., 2006, Cor. 4.3).

Targeted Separation and Convergence with Kernel Discrepancies

Remark 2 (Suﬃcient condition for separability). Note that when X is a locally compact topological space, for Hk to be separable, it is suﬃcient that Hk C (Carmeli et al., 2006, Cor. 5.2). Moreover, Hk C k is locally bounded1 and kx C for each x (Carmeli et al., 2006, Prop. 5.1).

Moreover, when both Q and P are embeddable, the MMD can be re-expressed as a convenient double-integral (Simon-Gabriel and Sch olkopf, 2018, Prop. 13).

Proposition 2 (MMD as a double integral). If P PHk and Q PHk, then

MMD2 k(Q, P) = RR k(x, y)d(Q P)(x)d(Q P)(y).

2.2 Kernel Stein discrepancies

Building on the Stein discrepancy formalism of Gorham and Mackey (2015) and the zeromean reproducing kernel theory of Oates et al. (2014), Chwialkowski et al. (2016); Liu et al. (2016); Gorham and Mackey (2017) concurrently developed special MMDs that can be computed without any explicit integration under the target P. When discussing these Langevin KSDs we will restrict our focus to X = Rd and assume the target P has a strictly positive density p with respect to Lebesgue measure. We will also make use of a matrixvalued kernel K : Rd Rd Rd d which generates an RKHS HK of vector-valued functions; for an introduction to vector-valued RKHSes, please see Appendix B. The Langevin KSD is deﬁned in terms of a matrix-valued base kernel K and the diﬀerential operator

known as the Langevin Stein operator in the machine learning and statistics communities (Gorham and Mackey, 2015; Anastasiou et al., 2023), which, under mild conditions, maps Rd-valued functions v = (v1, . . . , vd) : Rd Rd to R-valued functions with mean zero under the target, PSp(v) = 0. Speciﬁcally, for K chosen so that Sp(v) has expectation zero under P for each v HK, Chwialkowski et al. (2016); Gorham and Mackey (2017); Barp et al. (2019) deﬁned2 the Langevin KSD as an integral probability metric (M uller, 1997) over Sp(HK):

KSDK,P(Q) sup v BK |QSp(v)| = sup v BK |QSp(v) PSp(v)|. (3)

However, Sp(v) is often unbounded so that, for the same reasons described in Section 2.1, the expression (3) need not be well deﬁned for all Q P. To enable meaningful KSD evaluation for all probability measures, we follow the recipe of Deﬁnition 1 to extend the deﬁnition of KSD to all Q P.

1. Recall that a function f from a topological space to a normed space is locally bounded if every point in its domain has a neighbourhood U for which the restriction of f to U is bounded. 2. The distinct deﬁnition of Liu et al. (2016) coincides with (3) under the assumptions of their Thm. 3.8.

Barp, Simon-Gabriel, Girolami, and Mackey

Deﬁnition 2 (Kernel Stein discrepancy (KSD)). Consider a target P P with density p > 0 and matrix-valued base kernel K for which the set p HK = {ph : h HK} consists of partially diﬀerentiable functions. When Sp(HK) L1(P) and P(Sp(HK)) = {0}, we deﬁne the kernel Stein discrepancy KSDK,P : P [0, ] by

KSDK,P(Q) sup v BK: Sp(v)+ L1(Q) |QSp(v)| . (4)

Remark 3 (Relation to prior deﬁnitions of KSD). For scalar kernels, Deﬁnition 2 is identical to the deﬁnition of KSD given in two of the papers that originally deﬁned KSDs, Chwialkowski et al. (2016, Sec. 2.1) and Gorham and Mackey (2017, Sec. 3.1 with = 2), except for the extra constraint Sp(v)+ L1(Q) that we include simply to ensure that the KSDK,P(Q) is well deﬁned for all probability measures Q. Moreover, for probability measures Q satisfying the constraint Sp(v)+ L1(Q) for all v BK our deﬁnition exactly recovers those given by Chwialkowski et al. and Gorham and Mackey. However, unlike Definition 2, the prior deﬁnitions of KSD from Chwialkowski et al. and Gorham and Mackey are not well deﬁned for probability measures Q failing to satisfy the extra constraint, even though this restriction is not discussed explicitly in either work.

Under additional assumptions, like Bochner embeddability of P and Q and continuous diﬀerentiability of K and p, prior work showed that the KSD (4) is equivalent to an MMD with a scalar Stein kernel kp and that Sp(HK) deﬁnes a Stein RKHS Hkp of scalar-valued functions (Oates et al., 2014; Chwialkowski et al., 2016; Liu et al., 2016; Gorham and Mackey, 2017; Barp et al., 2019). Our next result, proved in Appendix C.5, shows that no additional assumptions are necessary: KSDK,P(Q) = MMDkp(Q, P) and Sp(HK) = Hkp whenever the left-hand side quantities are well deﬁned.

Theorem 1 (KSD as MMD). Consider a target P P with density p > 0 and matrixvalued base kernel K for which p HK consists of partially diﬀerentiable functions. Then Sp(HK) is the Stein RKHS Hkp induced by the Stein kernel3

kp(x, y) 1 p(x)p(y) y x (p(x)K(x, y)p(y)) . (5)

Moreover, for target measures with zero-mean Stein RKHSes, i.e., for P in

PK,0 {Q P with density q > 0: q HK are partially diﬀerentiable functions,

Sq(HK) L1(Q), and Q(Sq(HK)) = {0}},

the KSD matches the Stein kernel MMD:

KSDK,P(Q) = MMDkp(Q, P) for all Q P.

Remark 4 (Scalar kernel KSD). When K = k Id for a scalar kernel k, we will say that kp is induced by k and write Pk,0 Pk Id,0. In this case,

kp(x, y) = Xd

i=1 1 p(x)p(y) xi yi(p(x)k(x, y)p(y)).

3. Note we have kp(x, y) = Pd i,j=1 1 p(x)p(y) yj xi(p(x)Kij(x, y)p(y)).

Targeted Separation and Convergence with Kernel Discrepancies

The zero-mean condition P PK,0 ensures that all functions in the Stein RKHS integrate to zero under the target measure so that the KSD can be evaluated without any explicit integration under P. Moreover, by Proposition 2 and Theorem 1, when Q embeds into the Stein RKHS, the KSD takes on its more familiar double integral form.

Corollary 1 (KSD as a double integral). If P PK,0 and Q PHkp, then

KSD2 K,P(Q) = RR kp(x, y)d Q(x)d Q(y).

Finally, the following result proved in Appendix C.6 provides user-friendly suﬃcient conditions for verifying that P PK,0, which requires verifying that Hkp = Sp(HK) L1(P), and P(Hkp) = 0. Hereafter, we let sp log p denote the score function of P whenever log p is partially diﬀerentiable.

Proposition 3 (Stein embeddability conditions). Consider a target P P with density p > 0 and matrix-valued base kernel K for which p HK consists of partially diﬀerentiable functions. The following claims hold true.

(a) Sp(HK) L1(P) P PHkp.

kp, then P PHkp.

(c) If P Psp and all v in HK are bounded with bounded partial derivatives, then P P

(d) If P PHkp, then RR kp(x, y)d P(x)d P(y) = 0 P PK,0.

(e) If P PHkp and HK L1(P) C1(Rd), then P PK,0.

Remark 5 (User-friendly conditions on HK). The requirements on HK in Proposition 3 (c) and (e) can often be veriﬁed by examining simple properties of the base kernel K. For example, by Lemma 3, all v in HK are bounded iﬀx 7 K(x, x) is bounded, and all v in HK have bounded xi-partial derivatives if (x, y) 7 xi yi K(x, y) exists and is bounded for any matrix norm . In particular, if K C (1,1) b (Rd), then HK C1 b (Rd). Moreover, by Micheli and Glaunes (2013, Thm. 2.11), HK C1(Rd) iﬀ(x, y) 7 xi yi K(x, y) is separately continuous and locally bounded.

3. Conditions for Separating Measures

Our ﬁrst goal is to identify when an MMD distinguishes P from other measures. Given a set of probability measures M P, we will say that k separates P from M if for any Q M, MMDk(Q, P) = 0 implies that Q = P. When k separates P from all probability measures P we say simply that k is P-separating. We will ﬁrst discuss restricted P-separation that is, separation from a distinguished subset of measures M = P in Sections 3.1 and 3.2 and then turn to general P-separation separation from all probability measures P in Section 3.4.

Barp, Simon-Gabriel, Girolami, and Mackey

3.1 Bochner P-separation with MMDs

Our ﬁrst result, proved in Appendix D, exactly characterizes the kernels that separate P from Bochner embeddable measures on Radon spaces (Ambrosio et al., 2005, Def. 5.1.4). Recall that a set of probability measures M P is tight when for each ϵ > 0 there exists a compact set S X such that Q(Sc) ϵ for all Q M. We also say that a measurable function ϕ : X R is uniformly integrable by M P if for each ϵ > 0 there exists r > 0 such that supµ M R

{x: |ϕ|(x)>r} |ϕ| dµ < ϵ.

Theorem 2 (Bochner P-separation with MMDs). Let k be a continuous kernel over a Radon space X (for example, a Polish space). Then k separates P P

k iﬀ, for any sequence (Qn)n P

(a) MMDk(Qn, P) 0 (b) (Qn)n is tight (c) (Qn)n uniformly integrates

Theorem 2 exposes an important relationship between our two goals of separation and convergence control. In particular, when k is bounded, the uniform integrability condition (c) always holds, P

k is the set of all probability measures P, C

k is the set of all bounded continuous functions, and the convergence on the left-hand side of (6) is the usual weak convergence in P. Hence for bounded kernels we obtain Corollary 2: separating P from all probability measures is equivalent to controlling tight P-convergence, i.e., having Qn P weakly whenever MMDk(Qn, P) 0 and (Qn)n is tight.

Corollary 2 (P-separation with bounded kernels). Let k be a continuous bounded kernel over a Radon space X (for example, a Polish space). Then k separates P P from P iﬀ, for any sequence (Qn)n P,

Qnh Ph h Cb (a) MMDk(Qn, P) 0 (b) (Qn)n is tight.

Remark 6 (Comparison with Simon-Gabriel et al. (2023)). When X is also locally compact and Hausdorﬀ, for instance when X = Rd, Theorem 9 in Simon-Gabriel et al. (2023) implies that, if Hk C0 and k separates every ﬁnite measure µ from the set of ﬁnite measures, then k metrizes the weak convergence of probability measures (i.e., for every probability measure P, MMDk(Qn, P) 0 Qn P weakly). Comparing with Corollary 2, we observe no explicit tightness requirement appears. This is because the assumption of separation for every ﬁnite measure µ (instead of separation of a single ﬁnite measure P from P) implicitly does the work of enforcing tightness. In the proof of Theorem 2 we can see the role of tightness is to ensure relative compactness, which in turn allows us to use the existence of convergent subsequences to promote the separation assumption into a convergence control. But P is a bounded and thus relatively compact subset of the space of ﬁnite measures (Treves, 1967, Thm. 33.2). Hence, by Treves (1967, Prop. 32.5), the assumption of k separating all ﬁnite measures is enough to guarantee the equivalence between Qnh Ph for all h Hk and Qnh Ph for all h C0. The latter is further equivalent to Qnh Ph for all h Cb by Berg et al. (1984, Cor. 2.4.3).

Targeted Separation and Convergence with Kernel Discrepancies

3.2 Score P-separation with KSDs

The standard practice in the KSD literature is to identify easily-veriﬁed properties of the base kernel K, target P, and alternative measure Q that ensure separation. One class of KSD separation conditions introduced by Chwialkowski et al. (2016, Thm. 2.2) and generalized by Barp et al. (2019, Prop. 1) applies to measures that ﬁnitely integrate the score sp but additionally requires Q to have a continuously diﬀerentiable log-density. The ﬁrst main result of this work, proved in Appendix E, removes the extraneous continuity conditions and extends P-separation to all measures Q Psp under a standard separating assumption on the base kernel, D1 L1(Rd)-characteristicness.

Theorem 3 (Score P-separation with KSDs). Suppose a matrix-valued kernel K with HK C1 b (Rd) is D1 L1(Rd)-characteristic. If P PK,0, then kp separates P from Psp.

We provide formal deﬁnitions of D1 L1(Rd) and characteristicness in Deﬁnition 8 and Deﬁnition 7 respectively. In brief, D1 L1(Rd) is the d-dimensional product of the space D1 L1 of ﬁnite measures and their distributional derivatives4, and a D1 L1(Rd)-characteristic kernel is one that can separate any pair of D1 L1(Rd) elements. Our proof of Theorem 3 builds on the kernel Schwartz distribution theory of Simon-Gabriel and Sch olkopf (2018), wherein the space D1 L1 naturally arises from the construction of the Stein RKHS via the Langevin Stein operator Sp. Speciﬁcally, we show in Appendix P that the Stein kernel kp separates P from Q PHkp if and only if the base kernel K separates the Schwartz distribution sp Q xj Q

from the zero measure. Moreover, sp Q xj Q D1 L1(Rd) when Q Psp, which yields Theorem 3.

Application 1: Goodness-of-ﬁt Testing

In goodness-of-ﬁt (GOF) testing, one uses a sequence of datapoints X1, . . . , Xn generated from a Markov chain to test whether the chain s stationary distribution Q coincides with a target distribution P. KSDs with D1 L1-characteristic translation-invariant base kernels are commonly used as GOF test statistics, and such tests are known to consistently reject Q = P whenever KSD(Q, P) > 0 (Chwialkowski et al., 2016; Liu et al., 2016; Gorham and Mackey, 2017). However, prior to this work, the separating condition KSD(Q, P) > 0 had only been established for a restricted class of alternatives (continuous Q P

kp with diﬀerentiable log densities satisfying Q( sp sq ) < , Barp et al. 2019, Prop. 1) or a restricted class of targets (P with Lipschitz sp and strongly log concave tails, Gorham and Mackey 2017, Thm. 7). The former restriction excludes discrete and discontinuous Q, as well as Q with tails heavier than P or non-diﬀerentiable densities. Meanwhile, the latter restriction excludes P with tails heavier than or lighter than a Gaussian. Theorem 3 in the present work ensures that KSD(Q, P) > 0 for any P PK,0 and Q Psp. In particular, this accommodates discontinuous or non-smooth Q and all targets P for which the KSD (4) is deﬁned. Moreover, Theorem 3 holds for all D1 L1characteristic kernels, a strict superset of the C1 0-universal kernels (Carmeli et al., 2010, Def. 4.1) assumed in prior work.

4. Distributional derivatives extend the usual notion of derivative to objects that are not smooth, in particular to non-smooth distributions Q. When Q has a diﬀerentiable Lebesgue density q then we recover the usual derivative, xj Q = xjq dx, while in general xj Q will be a Schwartz distribution (Schwartz, 1978).

Barp, Simon-Gabriel, Girolami, and Mackey

Indeed, Simon-Gabriel and Sch olkopf (2018, Thm. 12, Tab. 1, and Cor. 38) showed that any C1 0-universal k and any C(1,1) translation-invariant k with fully supported spectral measure is D1 L1-characteristic. These results already cover all of the translation-invariant base kernels commonly used with KSDs including Gaussian, inverse multiquadric (IMQ), log inverse, sech, Mat ern, B-spline, and Wendland s compactly supported kernels. Moreover, as we prove in Appendix F.1, characteristicness to D1 L1 is preserved under the following operations, which allows one to construct even more ﬂexible base kernels.

Proposition 4 (Preserving characteristicness). Suppose a matrix-valued kernel K with HK C1 b (Rd) is D1 L1(Rd)-characteristic. Then the following claims hold true.

(a) If a C1 b is strictly positive, then a(x)K(x, y)a(y) is D1 L1(Rd)-characteristic.

(b) If b : Rd Rd is a Lipschitz C1(Rd)-diﬀeomorphism, then the composition kernel K(b(x), b(y)) is D1 L1(Rd)-characteristic.

(c) If kj is D1 L1-characteristic for j [d], then diag(k1, . . . , kd) is D1 L1(Rd)-characteristic.

As a ﬁnal remark on Theorem 3, we note that the score embedding measures Psp and the Bochner embeddable measures P

kp exactly coincide under mild conditions satisﬁed by every C(1,1) translation-invariant base kernel K. See Appendix C.3 for the proof of this result.

Proposition 5 (Score vs. Bochner embeddability). Under the assumptions of Theorem 3, Psp P

kp. If, in addition, x 7 p

sp(x), K(x, x)sp(x) / sp(x) is bounded away from zero, then Psp = P

3.3 L2 P-separation with KSDs

Liu et al. (2016) introduced a second class of KSD separation conditions based on an L2

separating property of the base kernel. We say that a matrix-valued kernel K is L2(Rd)- integrally strictly positive deﬁnite (ISPD) if HK L2(Rd) and

g L2(Rd) and g = 0 RR g(x)T K(x, y)g(y)dxdy > 0.

Unfortunately, the L2 requirement on HK excludes certain popular base kernels like slowly decaying IMQ and log inverse kernels, and Liu et al. (2016) did not provide any examples of kernels satisfying the L2-ISPD conditions. Our next result ﬁlls this gap by showing that many standard kernels are L2-ISPD, including Gaussian, Mat ern, sech, B-spline, faster decaying IMQ, and Wendland s compactly supported kernels, along with their tilted variants. The proof can be found in Appendix G.

Theorem 4 (L2-ISPD conditions). The following claims hold true for a matrix-valued kernel K.

(a) Suppose (kj)d j=1 are translation-invariant continuous kernels with Hkj L2. If the spectral measure of each kj is fully supported, then K = diag(kj) is L2(Rd)-ISPD.

(b) If K is L2(Rd)-ISPD and A : Rd Rd d is bounded measurable with A(x) invertible for each x, then the tilted kernel A(x)K(x, y)A(y)T is also L2(Rd)-ISPD.

Targeted Separation and Convergence with Kernel Discrepancies

(c) If HK is separable, supx Kxu L1 < , and Kxu L2(Rd) for each x and u Rd, then HK L2(Rd).

(d) Suppose Kxu L1(Rd) for some u Rd. If K is translation-invariant or, more generally, if Kxu is bounded, then Kxu L2(Rd).

Previous results in the literature have focused on properties similar to but distinct from the L2(Rd)-ISPD condition. These include conditions under which kernels are (i) Lp(µ)- ISPD for p [1, ) with respect to a probability measure µ in place of Lebesgue measure (Carmeli et al., 2010); (ii) ISPD, meaning that RR k(x, y)dµ(x)µ(y) > 0 for all non-zero ﬁnite measures µ (Sriperumbudur et al., 2011); (iii) L1 integrally non-strictly positive deﬁnite (INPD), meaning g L1 RR g(x)T K(x, y)g(y)dxdy 0 (Bochner, 1932; Stewart, 1976); (iv) L2 c-INPD where L2 c is the space of compactly supported L2 functions (Cooper, 1960); or (v) L2-INPD for continuous k L2 when d = 1 (Buescu et al., 2004, Rem. 2.10) or translation-invariant k with kx L1 (Phillips, 2018, Thm. 2.5.1). Liu et al. (2016, Prop. 3.3 & Thm. 3.8) showed that KSDs with L2-ISPD base kernels separate certain measures with continuously diﬀerentiable densities q. Theorem 5, proved in Appendix H, generalizes this ﬁnding to matrix-valued K and partially diﬀerentiable q and provides user-friendly L2 conditions for ensuring that Q can be separated.

Theorem 5 (L2 P-separation with KSDs). Suppose P PK,0 for a matrix-valued kernel K. The following claims hold true.

(a) If K is L2(Rd)-ISPD, then kp separates P from {Q PHkp PK,0 : (sp sq)q L2(Rd)}.

(b) If Q PHkp, (sp sq)q L2(Rd) and HK L2(Rd) L (Rd), then Q PK,0.

(c) If HK L2(Rd), HK L (Rd), and spq L2(Rd), then Q PHkp.

While Theorem 5 only applies to continuous Q, it does cover certain measures excluded by Theorem 3. For example, Theorem 5 implies that Cauchy alternatives Q are separated from Gaussian targets P since q sp 2 and q sq 2 are bounded and hence qsp, qsq L2(Rd). Meanwhile, Theorem 3 cannot be applied to these (Q, P) pairings as the heavy tails of a Cauchy cannot ﬁnitely integrate a Gaussian score sp.

3.4 General P-separation

The results in the preceding sections only yield general P-separation when applied to bounded kernels, and indeed this has been the standard in much of the MMD literature (Sriperumbudur et al., 2010; Sriperumbudur, 2016; Simon-Gabriel and Sch olkopf, 2018; Simon-Gabriel et al., 2023). To accommodate the unbounded Stein kernels that often arise in KSDs, our next deﬁnition and result (proved in Appendix J) provide a new, convenient means to check that unbounded kernels separate P from P.

Deﬁnition 3 (Bounded P-separating property). We say a set of functions F is bounded P-separating if L F is P-separating, i.e., if Q P and Qh = Ph for all h L F then Q = P.

Barp, Simon-Gabriel, Girolami, and Mackey

Theorem 6 (Controlling tight convergence with bounded separation). If Hk is bounded P-separating, then k is P-separating and controls tight P-convergence.

According to Theorem 6, to establish general P-separation, it suﬃces to restrict focus to the bounded functions in an RKHS. Moreover, Theorem 6 suggests a convenient strategy for proving P-separation with unbounded kernels k: (i) identify a sub-RKHS of bounded functions that belongs to Hk and (ii) appeal to a broadly applicable bounded-kernel result to establish the P-separation of the bounded sub-RKHS. To apply this strategy to KSDs, we ﬁrst show in Appendix K that any suitably tilted D1 L1(Rd)-characteristic base kernel yields a bounded and P-separating Stein kernel:

Theorem 7 (Controlling tight convergence with bounded Stein kernels). Suppose a matrixvalued kernel K with HK C1 b (Rd) is D1 L1(Rd)-characteristic. If sp(x) θ(x) for θ C1

θ C1 b , then the Stein kernel induced by the tilted base kernel K(x,y) θ(x)θ(y) is bounded and P-separating and controls tight P-convergence.

Next we show that standard translation-invariant base kernels have sub-RKHSes of precisely the form needed by Theorem 7:

Theorem 8 (Translation-invariant kernels have rapidly decreasing sub-RKHSes). Suppose a kernel k with Hk C1 is translation invariant with a spectral density bounded away from zero on compact sets. Then there exist a translation-invariant, D1 L1-characteristic kernel ks C(1,1) and, for each c > 0, a positive-deﬁnite function f with 1 f C1,

max(|f(x)|, f(x) ) = O(e c Pd i=1

Hkf Hkfs Hk for kf(x, y) f(x)ks(x, y)f(y) and kfs(x, y) ks(x, y)f(x y).

Theorem 8 applies to all of the translation-invariant base kernels commonly used with KSDs including Gaussian, IMQ, log inverse, sech, Mat ern, B-spline, and Wendland s compactly supported kernels. Moreover, our proof in Appendix L explicitly constructs the D1 L1-characteristic kernel ks and the rapidly decreasing tilt function f and may be of independent interest. We now apply our Stein operator to the base kernels of Theorem 8 and invoke Theorem 7 to deduce the second main result of this work: KSDs based on standard translation invariant kernels achieve general P-separation, even when their Stein kernels are unbounded. The proof of this result can be found in Appendix M.

Theorem 9 (Controlling tight convergence with KSDs). For k as in Theorem 8, deﬁne the tilted kernel ka(x, y) = a(x)k(x, y)a(y) for each strictly positive a C1.

(a) If P Pk,0 and sp has at most root exponential growth,5 then the Stein kernel induced by k is bounded P-separating and controls tight P-convergence.

(b) Moreover, if P Pka,0 and a, a, and a sp have at most root exponential growth, then the Stein kernel induced by ka is bounded P-separating and controls tight Pconvergence.

5. A function a has at most root exponential growth if a(x) = O(exp(c Pd i=1 p

|xi|)) for some c > 0.

Targeted Separation and Convergence with Kernel Discrepancies

Application 2: Goodness-of-ﬁt Testing, continued

In the testing setting of Application 1, Theorem 9 extends the reach of KSD GOF testing by guaranteeing KSD(Q, P) > 0 for all alternatives Q whenever sp has at most root exponential growth. Since the Stein kernels of Theorem 9 are also bounded Pseparating, the same consistency guarantees immediately extend to the computationally eﬃcient stochastic KSDs of Gorham et al. (2020, Thm. 4).

4. Conditions for Convergence Control

Having derived suﬃcient conditions on the RKHS to separate measures and control tight convergence, we now present both suﬃcient and necessary conditions to ensure that an MMD controls weak convergence to P. Hereafter, we will say that k controls weak convergence to P or controls P-convergence whenever MMDk(Qn, P) 0 implies Qn P weakly. Moreover, we will say that k enforces tightness whenever MMDk(Qn, P) 0 implies that (Qn)n is tight. Enforcing tightness is central to our developments as, if k controls tight weak convergence to P and enforces tightness, then it also controls weak convergence to P.

4.1 Suﬃcient conditions

We begin by introducing a new suﬃcient condition to ensure that MMDs and integral probability metrics more generally enforce tightness.

Deﬁnition 4 (P-dominating indicators). Consider a set of functions F L1(P). We say that F P-dominates indicators if, for each ϵ > 0, there exists a compact set S X and a function h F that satisfy

h Ph I [Sc] ϵ. (7)

Deﬁnition 4 ensures that a sequence (Qn)n can only approximate P well if it places uniformly little mass outside of a compact set S. As we show in Appendix N, this is suﬃcient to ensure that integral probability metrics like the MMD enforce tightness.

Theorem 10 (Controlling P-convergence by dominating indicators). If F L1(P) Pdominates indicators then (Qn)n is tight whenever the integral probability metric

d F(Qn, P) sup h F: h+ L1(Qn) or h L1(Qn) |Qnh Ph| 0.

Hence, if P PHk and Hk P-dominates indicators then (Qn)n is tight whenever MMDk(Qn, P) 0. If, in addition, k controls tight P-convergence, then k also controls P-convergence.

We can now combine Theorem 10 with any of our KSD tight convergence results to immediately obtain P-convergence control for KSDs.

Corollary 3 (Controlling P-convergence with KSDs). Under the conditions of Theorem 3, 7, or 9, if Hkp P-dominates indicators, then kp controls P-convergence.

Barp, Simon-Gabriel, Girolami, and Mackey

Before we discuss applications of these results, let us compare them to existing results in the literature. Prior work relied on a stronger, coercive function condition to establish that KSDs enforce tightness with generalized multiquadric (Gorham and Mackey, 2017, Lem. 16), IMQ score (Chen et al. 2018, Thm. 4; Hodgkinson et al. 2020, Ex. 6), log inverse (Chen et al., 2018, Thm. 3), or unbounded tilted translation invariant (Huggins and Mackey, 2018, Thm. 3.2) base kernels. Hodgkinson et al. (2020) used the following general deﬁnition of coercivity.

Deﬁnition 5 (Coercive function (Hodgkinson et al., 2020, Assump. 1)). We say a function h : X R is coercive if, for any M > 0, there exists a compact set S X such that infx Sc h(x) > M.

Remark 7 (Bounded coercive functions). Any continuous coercive function is also bounded below as continuous functions are bounded on compact sets.

Our next result, proven in Appendix O, shows that this coercive function condition is stronger than our P-dominating indicator condition.

Lemma 1 (Coercive functions dominate indicators). If h Hk is coercive and bounded below and P PHk, then Hk P-dominates indicators.

As a ﬁrst application of Corollary 3, we show that KSDs with IMQ base kernels enforce tightness and control convergence whenever the dissipativity rate of the target dominates the decay rate of the kernel. Generalizing the argument in Gorham and Mackey (2017, Lem. 16), our proof in Appendix Q explicitly constructs a coercive function in the associated Stein RKHS.

Theorem 11 (IMQ KSDs control P-convergence). Consider a target measure P P with score sp C(Rd) L1(P). If, for some dissipativity rate u > 1/2 and r0, r1, r2 > 0, P satisﬁes the generalized dissipativity condition

sp(x), x r0 sp(x) 1 r1 x 2u r2 for all x Rd. (8)

If k(x, y) = (c2+ x y 2) γ for c > 0 and γ (0, 2u 1), then Hkp P-dominates indicators and enforces tightness. If, in addition, sp has at most root exponential growth, then kp controls P-convergence.

Application 3: Measuring and Improving Sample Quality

Because the KSD provides a computable quality measure that requires no explicit integration under P, KSDs are now commonly used to select and tune MCMC sampling algorithms (Gorham and Mackey, 2017), generate accurate discrete approximations to P (Liu and Wang, 2016; Chen et al., 2018, 2019; Futami et al., 2019), compress Markov chain output (Riabiz et al., 2022), and correct for biased or oﬀ-target sampling (Liu and Lee, 2017; Hodgkinson et al., 2020; Riabiz et al., 2022). Each of these applications relies on KSD convergence control, but past work only established convergence control for P with Lipschitz sp and strongly log concave tails (Gorham and Mackey 2017, Lem. 16; Chen et al. 2018, Thm. 3; Huggins and Mackey 2018, Thm. 3.2). Notably, these conditions imply generalized dissipativity (8) with u = 1 but exclude all P with tails lighter than a

Targeted Separation and Convergence with Kernel Discrepancies

Gaussian. Corollary 3 and Theorem 11 signiﬁcantly relax these requirements by providing convergence control for all dissipative P with lighter-than-Laplace tails.

Much of the diﬃculty in analyzing KSDs stems from the fact that all known convergencecontrolling KSDs are based on unbounded Stein kernels kp. As a second illustration of the power of Corollary 3, Theorem 12 develops the ﬁrst KSDs known to metrize P-convergence (i.e., KSD(Qn, P) 0 Qn P weakly), by constructing bounded convergencecontrolling Stein kernels. The following theorem is proved in Appendix R.

Theorem 12 (Metrizing P-convergence with bounded Stein kernels). Consider a target measure P P with score sp that, for some dissipativity rate u > 1/2 and r, r1, r2 > 0, satisﬁes the generalized dissipativity condition (8). Deﬁne the Stein kernel with base kernel K(x, y) = diag a( x )(xiyi + k(x, y))a( y ) , i.e.,

kp(x, y) = X

xi yi(p(x)a( x )(xiyi + k(x, y))a( y )p(y))

for k characteristic to D1 L1 with Hk C1 0 and a( x ) (c2 + x 2) γ a tilting function with c > 0 and γ u. The following statements hold true:

(a) If P PK,0, then Hkp P-dominates indicators and enforces tightness.

(b) If P PK,0, γ 0, and sp(x) (c2 + x 2)γ, then kp is bounded P-separating and controls P-convergence.

(c) If sp(x) x (c2 + x 2)γ and sp C, then Hkp Cb and kp metrizes Pconvergence.

Application 4: Sampling with Stein Variational Gradient Descent

Stein variational gradient descent (SVGD) is a popular technique for approximating a target distribution P with a collection of n representative particles. The algorithm proceeds by iteratively updating the locations of the particles according to a simple rule determined by a user-selected KSD. Liu (2017) showed that the SVGD approximation converges weakly to P as the number of particles and iterations tend to inﬁnity, provided that the chosen KSD controls P-convergence and that the Stein kernel is bounded. However, prior to this work, no bounded convergence-controlling Stein kernels were known. Theorem 12 therefore provides the ﬁrst instance of a Stein kernel satisfying the SVGD convergence assumptions of Liu (2017).

4.2 Necessary conditions

We ﬁnally conclude with a necessary condition for an MMD to control weak convergence to P, which recovers and broadens the KSD failure derived by Gorham and Mackey (2017, Thm. 6). For each RKHS Hk L1(P), deﬁne the P-centered RKHS Hk P {h Ph : h Hk} with P-centered kernel

k P(x, y) k(x, y) R k(x, y)d P(y) R k(x, y)d P(x) + RR k(x, y)d P(x)d P(y).

Barp, Simon-Gabriel, Girolami, and Mackey

Theorem 13 shows that k fails to control P-convergence whenever its P-centered RKHS functions all vanish at inﬁnity; notably, this occurs whenever k P is bounded with k P x C0 for each x (Simon-Gabriel and Sch olkopf, 2018, Prop. 3). The proof in Appendix S relies on the fact that k and k P induce exactly the same MMD.

Theorem 13 (Decaying P-centered kernels fail to control P-convergence). Suppose that X is locally compact but not compact. If Hk P C0, then k does not control P-convergence.

Implication 1: Standard KSDs fail for heavy-tailed P!

Since Stein kernels are already P-centered by design (i.e., (kp)P = kp), Theorem 13 holds dire consequences for standard KSDs with heavy-tailed targets P. As noted by Gorham and Mackey (2017, Thm. 10), if the score function is bounded (as is common for super-Laplace distributions), then the KSD fails to control P-convergence whenever a C1 0 base kernel is used. Moreover, our more general Theorem 13 result implies that if the score function is decaying (as is true for any Student s t distribution), then the KSD fails to control P-convergence for any bounded base kernel. This result suggests that the standard KSD practice of using a C1 0 base kernel is unsuitable for heavy-tailed targets and that one should instead choose a base kernel with growth suﬃcient to counteract the decay of sp.

5. Discussion

This article derived new suﬃcient and necessary conditions for kernel discrepancies to enforce P-separation and control P-convergence. We characterized all MMDs that separate P from Bochner embeddable measures, proposed novel suﬃcient conditions for separating all measures and enforcing tightness, strengthened all prior guarantees for KSD separation and convergence control on Rd, and derived the ﬁrst KSD known to exactly metrize (as opposed to strictly dominating) weak P-convergence on Rd. These developments point to several opportunities for further advances. First, while we have focused on weak convergence in this article, we believe many of the tools and constructions can be adapted to study the control of other modes of convergence. Natural candidates include α-Wasserstein convergence (Ambrosio et al., 2005), i.e., weak convergence plus the convergence of α moments, and C

k convergence, i.e., expectation convergence for all continuous test functions bounded by

k is unbounded, Theorem 2 exposes an important relationship between separation and C

k convergence: P-separating Bochner embeddable measures is equivalent to controlling C

k convergence to P for sequences that uniformly integrate

k. Hence, to control C

k convergence, it remains to identify those kernels that simultaneously separate and enforce uniform integrability. Second, while we have focused on canonical KSDs deﬁned by the Langevin Stein operator and a bounded base kernel, our tools are amenable to analyzing other kernel-based Stein discrepancies like the diﬀusion KSDs of Gorham et al. (2019); Barp et al. (2019), the secondorder KSDs studied in Barp et al. (2022b); Liu and Zhu (2018); Hodgkinson et al. (2020); Barp et al. (2022a), the gradient-free KSDs of Han and Liu (2018); Fisher et al. (2022), and the random feature Stein discrepancies of Huggins and Mackey (2018). In fact, employing

Targeted Separation and Convergence with Kernel Discrepancies

a diﬀusion KSD with an unbounded diﬀusion coeﬃcients is one promising way to overcome the heavy-tailed-target failure mode highlighted in Implication 1. Finally, while we have focused on KSDs for measures deﬁned on Rd, the very recent work of Wynne et al. (2022) provides a template for studying measure separation on inﬁnitedimensional Hilbert spaces.

Acknowledgments

We thank Bharath Sriperumbudur for suggesting the extent of the correspondence between L2 integrally strictly positive deﬁniteness and characteristicness for translation-invariant kernels and Heishiro Kanagawa for identifying several typos in an earlier version of this manuscript. AB and MG were supported by the Department of Engineering at the University of Cambridge, and this material is based upon work supported by, or in part by, the U.S. Army Research Laboratory and the U.S. Army Research Oﬃce, and by the U.K. Ministry of Defence and under the EPSRC grant [EP/R018413/2]. AB gratefully acknowledges support from the Turing-Roche strategic partnership.

Barp, Simon-Gabriel, Girolami, and Mackey

Appendix A. Appendix Notation

Throughout we denote by (eℓ)ℓthe canonical basis of Rd and by (eℓ)ℓits dual basis. The spaces Cℓ c(Rd) and Cℓ 0(Rd) will be equipped with their canonical topologies. However on Cℓ b(Rd) we will use the strict topology, written Cℓ b(Rd)β, because, for ℓ= 0, its dual is the space of ﬁnite (Radon) measures (Conway, 1965) whenever X is a locally compact Hausdorﬀspace (e.g., when X = Rd). Note in general, any topology between the weak paired topology and the Mackey topology yields the space of ﬁnite measures as its (continuous) dual (Buck, 1958, Sec. 4). In fact we will often use a generalization of Cℓ b(Rd): given a continuous function θ : Rd [c, ) for some c > 0, we will need to construct a generalisation of the space C1 b (Rd)β, denoted C1 b,θ(Rd)β, and deﬁned as the vector space of C1(Rd) functions for which θf Cb(Rd), and f Cb(Rd d), with the topology deﬁned by the family of seminorms

f sup x γ(x)θ(x)f(x) , f sup x γ(x) p xf

where γ C0 and |p| = 1. In other words fα f in C1 b,θ(Rd)β iﬀ(θfα, fα) (θf, f) in Cb(Rd)β Cb(Rd)β. We mention that in Lemma 10 we will similarly construct B1 θ(Rd), a Banach space that plays a similar role to C1 b,θ(Rd) but is simpler to work with (however it is not general enough for our purposes). Given a topological vector space (TVS) F, its (continuous) dual will be denoted F .

Given a subset M F , and Dα, D M we will write Dα M D when Dα(f) D(f), f F (i.e., Dα converges to D in weak star topology). When F = Cb, and M = P, we say that Dα converges weakly to D. More generally, we deﬁne weak convergence in P

k (notice the in P

k part!) using C

k (resp. C0,

k) is the space of continuous functions f with 1 +

k growth, i.e., such that f/(1+

k) is bounded, (resp. in C0). Thus

k P Qn, P P

k , and Qn(f) P(f) f C

Notice that Cb C

k P with equality if and only if (iﬀ) k is bounded. Recall here that, for a Rℓ-valued function f such as

k, Pf {Q P : f L1(Q)}. Given TVSs F1 and F2, we denote by B(F1, F2) the set of continuous linear functionals from F1 to F2. The transpose of a continuous linear functional T is denoted T . Given a Radon measure µ on Rd, its distributional xi-derivative will be denoted xiµ : C c R. Recall that the distributional derivative is equal to xiµ = µ xi on C c .

Appendix B. Vector-Valued RKHSes and Stein RKHSes

Let X an open subset of Rd. Let Γ(Y ) denotes the set of maps X Y . Matrix-valued kernels are typically deﬁned via a feature map, i.e., a map ξ : X B(H, Rd) (see Deﬁnition 6), which generates the kernel K(x, y) ξ (x) ξ(y).

In particular if H Γ(Rd) is a RKHS of Rd-valued functions, i.e., a Hilbert space on which the evaluation functionals δx B(HK, Rd) are continuous, then H HK where K(x, y) δx δ y Rd d. The transpose of δy is usually denoted Ky δ y, and Kv y δ y(v) HK, so

Targeted Separation and Convergence with Kernel Discrepancies

Kxv(y) = δy Kxv = δyδ xv = K(x, y) v = K(y, x)v for any v Rd, thus Kx = K( , x). We can tilt matrix-valued kernels via a matrix-valued function m Γ(Rd d), indeed ξ m m ξ

is a new feature map, and its kernel is

Km(x, y) ξ m(x) ξm(y) = m(x) ξ (x) ξ(y) m T (y) = m(x)K(x, y)m(y)T .

Given an RKHS HK of continuously diﬀerentiable Rd-valued functions, we can obtain a scalar-valued kernel via the Stein operator Sp.6 Let ξm P Sp m ξ : HK Γ(R), where ξ(h)(x) ξ (x)(h). Then ξm P : X HK is a feature map for the Stein kernel kp (Barp et al., 2019), i.e., kp(x, y) = ξm P (x), ξm P (y) K.

Since the matrix m just corresponds to a change of matrix kernel K 7 Km, we can restrict to the identity case ξP ξId P . In other words, for the family of diﬀusion Stein operators (Gorham et al., 2019)

the matrix-valued function m can be thought of as a transformation of the base RKHS HK into Hm Km T , i.e., Sm p (HK) = SId p (m HK) = SId p (Hm Km T ).

Since K is arbitrary, without loss of generality we may choose m = Id, Sp SId p . Note that the matrix functions m obtained by the generator of P-preserving diﬀusions can be characterized on any manifold (Barp et al., 2021). Similarly, the Stein kernel obtained via the second-order Stein operator (Barp et al., 2022b) can be recovered by setting K to be the diagonal matrix kernel of partial derivatives of a scalar kernel. We ﬁnally recall the equivalence between universality, characteristicness, and strict positive deﬁniteness of (scalar-valued) kernels (Simon-Gabriel and Sch olkopf, 2018, Thm. 6), noting it carries on to the case of matrix-valued kernels.

Appendix C. Embedding Schwartz Distributions in an RKHS

Given a continuous linear map T between TVS, we denote by T its transpose, and, similarly, if h belongs to a Hilbert space, we will denote by h the associated element in the dual space, i.e., h (f) h , f for any f in that Hilbert space.

Deﬁnition 6 (Kernel embeddings and Pettis integrals). Let D be a linear functional on a vector space F containing the RKHS HK of a matrix-valued kernel K.

(a) We say that D embeds into HK if D|HK is continuous, i.e., if there exists a function ΦK(D) HK such that for all h HK: D(h) = ΦK(D), h K. We call ΦK the kernel embedding and ΦK(D) the (kernel or RKHS) embedding of D. It is given by

ΦK(D)(x) = P i ei D(Kei x ).

6. Note Sp is a special instance of the canonical operator associated to measures equivalent to the Lebesgue one with diﬀerentiable densities (or more precisely, the canonical operator induced by positive 1-densities) Barp et al. (2022a).

Barp, Simon-Gabriel, Girolami, and Mackey

(b) Given a feature map, i.e., a function ξ : X B(Rd, HK), we denote by ξ : X B(HK, Rd) the map x 7 ξ(x) and deﬁne the feature operator ξ : HK (X Rd) as ξ(h)( ) ξ ( )(h). We say ξ is Pettis-integrable with respect to D if ξ(HK) F and the linear functional D ξ embeds into HK. The RKHS embedding, ΦK(D ξ), of D ξ is known as the Pettis-integral of ξ with respect to D. We will also call the map from D 7 ΦK(D ξ) the RKHS embedding of ξ.

When M is a set of embeddable linear functionals, for any D, D M we can deﬁne

MMDK(D, D) D D K ΦK(D) ΦK( D) K,

where ΦK : M HK is the kernel embedding, which recovers (2) when Q and P are embeddable probability measures. In that case, k separates P from M iﬀΦk( P)|M vanishes only at P. Hereafter, we will say that a kernel is characteristic to a set of embeddable linear functionals M when the RKHS embeddings of two distinct elements in M are always distinct.

Deﬁnition 7 (Characteristicness). Given a set M of embeddable linear functionals (see Deﬁnition 6), we say K is characteristic to M when ΦK is injective over M.

When µ is a ﬁnite (R-valued) measure on X, then a natural set of functions that µ can act on is the set of ﬁnitely µ-integrable functions L1(|µ|). Now, if a function ξ : X Hk is to be Pettis-integrable by µ, then the very least is that the functions ξ(h) be contained in L1(|µ|) for every h Hk. Interestingly, we will now see that, because Hk is a Hilbert space (not just Banach), this condition is also suﬃcient to guarantee µ-Pettis integrability.

Proposition 6 (Finite measures embed into Hk iﬀHk is ﬁnitely integrable). Let µ be a ﬁnite R-valued measure (e.g., a probability measure), seen as a linear functional over L1(|µ|). Then a function ξ : X Hk is µ-Pettis integrable if and only if ξ(Hk) L1(|µ|). In particular, if µ = Q P, then the following claims hold.

1. Using ξ : x kx, it follows that Q is embeddable into Hk iﬀHk L1(Q).

2. If xi Hk exists, then via ξ : x xikx we obtain that xi Q embeds iﬀ xi Hk L1(Q).

Proof Since Hilbert spaces are canonically isomorphic to their dual (i.e., H = H), Gelfandintegration and Pettis-integration coincide. Therefore, Proposition 3.4 in Musia l (2002) which asserts that every scalarly µ-integrable function ξ : X Hk = H k is Gelfand µ-integrable concludes the ﬁrst part. Then (1) follows directly from kx , h k = h(x). For (2), note that if xi Hk exists, then ξ : x 7 xikx Hk and h, xikx k = xih(x) by Lemma 4. Thus ξ = xi so xi Hk L1(Q) iﬀit is Gelfand Q-integrable, in which case

i Qh = R ihd Q = R h, xikx kd Q(x) = h, R ξd Q k

where R ξd Q is the Pettis integral. Hence i Q embeds into Hk.

The embeddability of distribution in a Stein RKHS can be analysed in terms of the embeddability of the associated (via pull-back) Schwartz distributions in the base RKHS, as Lemma 2 shows, by generalizing (Simon-Gabriel and Sch olkopf, 2018, Prop. 14).

Targeted Separation and Convergence with Kernel Discrepancies

Lemma 2 (Embedding functionals on RKHS deﬁned by feature maps). Let H be a Hilbert space, HK an RKHS of Rℓ-valued functions on X, and ξ : X B(Rℓ, H) be a feature map for K, i.e., K(x, y) = ξ (x) ξ(y). Then a linear functional D : HK R embeds into HK iﬀD ξ : H R embeds into H. Here ξ : H HK is the feature operator (see Deﬁnition 6). For any Q P, Q(

k) = Q( ξ H)

so Q is Bochner integrable in Hk iﬀ ξ H L1(Q). Moreover, if D embeds into HK then the transpose of ξ is an isometry:

D HK = D ξ H.

In particular, if D = Q P and Hk L1(Q), then

Q Hk = R ξ d Q H.

See Appendix C.1 for the proof. Applying this result to a Stein RKHS, we immediately obtain the following corollary.

Corollary 4 (Embedding measure in Stein RKHS and base RKHS). Consider a Stein kernel kp (5) with base kernel K, and ﬁx any Q P. The following are equivalent:

(a) Q embeds into Hkp.

(b) Q embeds into HK via the feature map ξP : X HK with ξP(x) = Kxsp(x)+ x Kx.

If either holds and P PK,0, then

KSDK,P(Q) = R ξP d Q HK,

where R ξP d Q is the Pettis integral.

We now formally introduce D1 L1(Rd), the d-dimensional product space of ﬁnite measures and their distributional derivatives.

Deﬁnition 8 (The space D1 L1(Rd)). We write D1 L1(Rd) to represent the vector space of continuous linear functionals on C1 0(Rd) or, equivalently, on C1 b (Rd)β and deﬁne D1 L1 D1 L1(R1). Notably, D D1 L1(Rd) iﬀit can be expressed as a ﬁnite sum D = Pl j=1 Djej

where each Dj is a ﬁnite Radon measure on Rd or a distributional derivative thereof. The topology on D1 L1(Rd) is the canonical dual topology induced by C1 0(Rd) (Schwartz, 1978, pg. 200). Following Deﬁnition 7, when the elements of D1 L1(Rd) embed into HK, for instance when HK C1 b (Rd), we shall say that K is characteristic to D1 L1(Rd) when the kernel embedding ΦK : D1 L1(Rd) HK is injective.

Importantly, for embeddable probability measures Q, the KSD is given by the norm of a vector DQ that can be understood as a distributional derivative of Q with respect to a diﬀerential operator induced by P. When Q is smooth DQ, will be a vector measure,

Barp, Simon-Gabriel, Girolami, and Mackey

but when Q is not assumed to be smooth, DQ will be a more general (vector) Schwartz distribution. The space D1 L1(Rd) assumes a central role in analysing DQ and determining when kernel discrepancies separate DQ from zero. This in turn helps us understand when KSDs eﬀectively distinguish the target P from alternatives Q. To deﬁne DQ, note that since the feature operator of ξP in Corollary 4 is the Stein operator Sp, setting

DQ|HK Q Sp : HK R

we obtain that the KSD is given by evaluating the norm of DQ in the base RKHS,

KSDK,P(Q) = DQ|HK HK.

More generally, DQ can act on any function f C1 such that Sp(f) L1(Q), and we will omit |HK when we do not specify its domain of deﬁnition. In addition, observe that when sp is integrable with respect to the probability measure Q, then both si PQ and Q are ﬁnite measures. Consequently, using the distributional derivative, we can write

DQ = P i(si p Q xi Q)ei P i Diei

with Di D1 L1, the space of ﬁnite measures and their distributional derivatives. Hence, DQ is a (vector) Schwartz distribution that belongs to the space D1 L1(Rd). When Q also has a strictly positive diﬀerentiable density with respect to the Lebesgue measure, then DQ simpliﬁes to a vector measure absolutely continuous with respect to the Lebesgue measure,

DQ = P i(si p si q)Q ei.

The following lemma provides bounds on Q( p

kp) in terms of the base kernel K and the target score sp, and thus suﬃcient conditions for a probability measure to be able to Bochner integrate kp. See Appendix C.2 for the proof.

Proposition 7 (Bochner embeddability vs. score integrability). Consider a Stein kernel kp (5) with base kernel K, and ﬁx any Q P. We have

kp) R ( Kx op sp(x) Rd + x Kx K) Q(dx), and

sp(x), K(x, x)sp(x) Rd x Kx K Q(dx).

Now suppose HK C1 b (Rd). Then the following claims hold.

The maps x 7 Kx op = p

K(x, x) and x 7 x Kx K are bounded.

If Q( sp ) < , then kp is Bochner integrable by Q, i.e., Q( p

If x 7 K(x, x) is uniformly positive deﬁnite (i.e., c > 0 such that for all v = 0 Rd, v T K(x, x)v c v 2 Rd > 0 for all x), then Q( p

kp) < implies Q( sp ) < .

Hence, if K is a diagonal kernel with each component satisfying infx ki(x, x) > 0, then Q( p

kp) < iﬀQ( sp ) < , i.e., P

kp = Psp. In particular, if K = k Id where k is translation-invariant (and not equal to the null function), then P

Remark 8 (Scalar base kernel norms). When K = k Id, we have

k(x, x), and p

sp(x), K(x, x)sp(x) Rd = p

k(x, x) sp(x) Rd.

Targeted Separation and Convergence with Kernel Discrepancies

C.1 Proof of Lemma 2: Embedding functionals on RKHS deﬁned by feature maps

By (Carmeli et al., 2010, Prop 1), ξ is a surjective partial isometry from H onto HK. Hence it is continuous, and ξ|ker ξT : ker ξT HK is an isometric isomorphism, where ker ξT is

the orthogonal complement to the kernel of ξ. If D is continuous, so is D ξ since it is the composition of continuous maps. For the converse, note that ξ ( ξ|ker ξT ) 1 : HK HK is

the identity so D = D ξ ( ξ|ker ξT ) 1, which is continuous if D ξ is. For the second claim, we apply the ﬁrst claim to D Q|Hk. Noting that the RKHS embedding Φk(Q) Hk is the function x 7 Qkx, we have

Q 2 Hk = Q(x 7 Qkx) = RR k(x, y)Q(dy)Q(dx) = RR ξ(x), ξ(y) HQ(dy)Q(dx)

= RR ξ(ξ(x))(y)Q(dy)Q(dx) = R Q ξ(ξ(x))Q(dx) = R (Q ξ) , ξ(x) HQ(dx)

= R ξ((Q ξ) )(x)Q(dx) = Q ξ 2 H,

where as usual (Q ξ) denoted the embedding of Q ξ into H. To generalise the above to Q being any embeddable functional D: letting ξδ : X B(Rd, HK) denote the canonical feature map, ξδ(x) = K( , x), then

since for any x, y X, c Rd (ξδ(y)c)(x) = ξ δ(x)ξδ(y)c = K(x, y)c = ξ (x)ξ(y)c = ξ(ξ(y)c)(x). Moreover Kei x = ξξ(x)ei

and if S is embeds into H and ξf : X B(Rd, H) is a feature map for K, then

ξf(S ) = ei S ξf( )ei,

since ξf(S )(x) = ξ f(x)(S ) = ei(ξ f(x)(S ))i = ei ei, ξ f(x)(S ) = ei Sξf(x)ei. Hence

D ξ 2 H = D ξ(D ξ) = Dei D ξ ξ( )ei = Dei DKei = DD = D 2 Hk.

C.2 Proof of Proposition 7: Bochner embeddability vs. score integrability

The fact that x 7 Kx op and x 7 x K K are bounded follows from Lemma 3:

Lemma 3 (RKHS boundedness conditions). If HK is a RKHS of Rd-valued functions, then the following claims hold.

(a) HK L (Rd) iﬀx 7 K(x, x) is bounded.

(b) If xℓHK exists, then xℓHK L (Rd) iﬀx 7 xℓ yℓK(x, y) is bounded.

(c) If xℓ yℓK exists, then xℓHK exists.

(d) If K C (1,1) b (Rd), then HK C1 b (Rd).

Barp, Simon-Gabriel, Girolami, and Mackey

Proof (a) If HK L (Rd), proceeding as in Appendix P.1, we have K xh = h(x) h for any h HK, so the Banach Steinhaus Theorem implies supx K x = supx p

K(x, x) is ﬁnite. Conversely, when x 7 K(x, x) is bounded, then h(x) = K xh h K K x h K p

K(x, x) h K supx p

K(x, x) . (b) Similarly, if HK is a RKHS of diﬀerentiable functions, then xℓHK is a RKHS with matrix-valued kernel (x, y) 7 xℓ yℓK(x, y) by Lemma 4. Thus, from above, xℓHK L (Rd) iﬀx 7 1ℓ 2ℓK(x, x) is bounded. (c) If 1ℓ 2ℓK exists, then the argument of Micheli and Glaunes (2013, p. 8 near Eq. (5)) shows ℓh exists for all h HK. (d) If K C (1,1) b (Rd), then HK C1(Rd) by Micheli and Glaunes (2013, Thm. 2.11), and by above HK Cb(Rd). Proceeding as above, we have of any v Rd, | v, ph(x) | = | h, p 2Kv(., x) | h K p 2Kv(., x) K = h K p

v T p 1 p 2K(x, x)v which is bounded in x, and thus ph is bounded.

Now, by deﬁnition, since ξP is a feature map for kp,

ξP(x), ξP(x) Kd Q = Q( ξP K).

Recall ξP(x) = Kxsp(x) + x K. By the triangle inequalities R | Kxsp(x) K x K K| Q(dx) Q( ξP K) R ( Kxsp(x) K + x K K) Q(dx).

The result follows by continuity of Kx = δ x B(Rd, HK), and the assumptions on K:

Kx op sp(x) Rd δ xsp(x) K = q

δ xsp(x), δ xsp(x) K = q

sp(x), K(x, x)sp(x) Rd.

C.3 Proof of Proposition 5: Score vs. Bochner embeddability

The result follows by Proposition 7.

C.4 Proof of Proposition 1: Embeddability conditions

k PHk follows directly from the embeddability of the Dirac measures: P|h| = P|ξ δ(h)| h k P ξ δ op = h k P

k, where ξ δ : x 7 δx|Hk is the canonical feature map. When Hk is separable, then by Carmeli et al. (2006, Cor. 4.3 and Prop. 4.4) a suﬃcient condition for Q to embed is |k| L1(Q Q). Note that Q P

k implies |k| L1(Q Q), as RR |k(x, y)|d Q(x)d Q(y) = RR | kx, ky k|d Q(x)d Q(y) RR kx k ky kd Q(x)d Q(y) = RR p

k(y, y)d Q(x)d Q(y) = (Q

k)2. The following example is adapted from (Berlinet and Thomas-Agnan, 2004, p.204). Take k(i, j) = I [i = j] for i, j N , i.e. Hk = ℓ2(N ), and consider the Radon measure µ(i) = 1/i. Then it is easy to see that µ is Pettis-embeddable (and satisﬁes |µ| |µ|(k) < ), but that it is not Bochner-embeddable into Hk, since |µ|(

C.5 Proof of Theorem 1: KSD as MMD

First let us show the following diﬀerential reproducing property, which is a mild generalization of results in (Steinwart and Christmann, 2008; Zhou, 2008; Micheli and Glaunes, 2013).

Targeted Separation and Convergence with Kernel Discrepancies

In contrast to the results provided in these references, Lemma 4 does not assume continuity of the derivatives. We note that the proof is essentially identical to the continuous derivative case in Micheli and Glaunes (2013, Thm. 2.11) (which also deals with matrix-valued kernels). This generalized form is important for our results as it allows us to establish that MMD and KSD coincide under no additional conditions than those required for them to be well-deﬁned.

Lemma 4 (Diﬀerential reproducing property). Suppose that xℓHK exists. Then for all c Rd and h HK

xℓK( , x)c, h K = c, xℓh(x) .

Moreover xℓHK is a RKHS with kernel (x, y) 7 xℓ yℓK(x, y).

Proof Given a sequence ϵn > 0 with ϵn 0, deﬁne ϵn (K( , x + ϵneℓ)c K( , x)c) /ϵn HK. Since K( , x)c HK its partial derivative in direction eℓexists, and thus K(y, )c also has a partial derivative in direction eℓ, hence ϵn converges pointwise to xℓK( , x)c. Moreover, for any h HK, ϵn, h K = c, (h(x + ϵneℓ) h(x))/ϵn converges to c, xℓh(x) as ϵn 0 (since xℓh exists). By the Banach Steinhaus theorem { ϵn}n is thus a bounded subset of HK, and by Micheli and Glaunes (2013, Cor. 2.8) it follows that xℓK( , x)c HK and c, xℓh(x) = limn ϵn, h K = xℓK( , x)c, h K for all h HK. We now show ξ xℓ: X B(HK, Rd), with ξ (x) xℓ|x, is a feature map with associated kernel yℓ xℓK. By above ei, ξ (x) is continuous for all x, i, so indeed ξ is B(HK, Rd)-valued. Note ξ = xℓ, so by Carmeli et al. (2006, Prop. 2.4) xℓHK is a RKHS with kernel K s.t.,

K(x, y)c = ξ (x)ξ(y)c = xℓ|x yℓK( , y)c = xℓ yℓK(x, y)c.

Now we apply Lemma 4 to the RKHS p HK whose kernel is p(x)K(x, y)p(y) (see Appendix B), in order to show that ξp : X HK deﬁned by

ξp(x) 1 p(x) x (p K) 1 p(x) P i xi (p K( , x)ei)

is a feature map for kp. Indeed by Lemma 4, and using (Carmeli et al., 2010, Prop. 1) to relate the inner products of K and p Kp,

Sp(h)(x) = P i 1 p(x) xi(phi) = P i 1 p(x) xi(p( )K( , x)p(x)ei), ph p Kp

= P i 1 p(x) xi(K( , x)p(x)ei), h K = P i 1 p(x) xi(K( , x)pei), h K.

This shows Sp is the feature operator associated to ξp, and thus Hkp Sp(HK) is a RKHS with kernel kp(x, y) ξp(x), ξp(y) K (Carmeli et al., 2010, Prop. 1). The following lemma Lemma 5 concludes.

Lemma 5 (Feature operators preserve unit balls). Suppose ξ : HK HK is a feature operator. Then ξ(BK) = BK.

Barp, Simon-Gabriel, Girolami, and Mackey

Proof Recall ξ is a surjective partial isometry (Carmeli et al., 2010, Prop. 1), in particular ξ(h) K h K, so ξ(BK) BK. Moreover, since ξ|A is an isometric isomorphism from the orthogonal complement of its kernel A ker ξ onto HK, it follows that for any g BK there exists g A s.t., 1 g K = h K, which concludes.

C.6 Proof of Proposition 3: Stein embeddability conditions

The result is an immediate consequence of Proposition 7 and the following proposition.

Proposition 8 (Stein RKHS with vanishing P-expectations). Suppose Hkp and HK C1

are subsets of L1(P). Then Ph = 0 for all h Hkp.

Proof The result follows from Pigola and Setti (2014, Thm. 2.36), after observing that the distributional and usual derivatives of C1 functions coincide.

Appendix D. Proof of Theorem 2: Bochner P-separation with MMDs

In the proof we will use the fact that if we deﬁne the tilted reproducing kernel k(x, y) k(x,y)/(1+

k(y)), whose RKHS is Hk/1+

k, we have the following immediate relation between the MMDs of k and k, which, for instance, may be used to generalize some results from bounded to unbounded kernels:

Proposition 9 (Kernel tilting). With the notation above

MMDk(Qn, P) = MMD k( Qn, P),

where Q (1 +

k)Q for any Q P

Qnf Pf for any f C

k Qng Pg for any g Cb.

The rationale for the above proposition is that the map f 7 (1+

k)f is a vector space isomorphism from Cb (resp. C0) to C

k (resp. C0,

k), which induces TVS and isometric isomorphisms once appropriate topologies have been introduced. In that case the map Q 7 Q identiﬁes C

k (resp. C 0,

k) with C b (resp. C 0). Coming back to the proof of Theorem 2, note the kernel k needs to be characteristic to P P

k, since if it was not, there would be a measure Q P

k not equal to P such that Q P k = 0, hence (Qn) (Q) would satisfy (a), and (b) since every distribution is tight on a Radon space, while (c) holds since (1 +

k)Q is a ﬁnite measure; yet (Qn) does not converge weakly to P in P

k, since it does not converge weakly in P as a result of the fact Cb is a separating set on any metrisable space. Conversely, let us assume that k is characteristic to P P

k. We will show that, given (c), (a)-(b) is equivalent to (usual!) weak convergence in P. So, applying Lemma 5.1.7 of Ambrosio et al. (2005) to every f C

k, gives the equivalence in (6) and concludes. Intuitively speaking, (c) lifts weak convergence in P to weak convergence in P

Targeted Separation and Convergence with Kernel Discrepancies

Assume (a)-(c). By (b) any subsequence of (Qn) is tight, so, by Prokhorov s theorem Ambrosio et al. (2005, Thm. 5.1.3), it is relatively compact in P (equipped with the weak topology) and thus contains yet another subsequence (Pl) that converges weakly in P to some probability distribution P . Since 1 +

k is continuous, using condition (c) and (5.1.23b) in Ambrosio et al. (2005, Lem. 5.1.7) further implies that P P

k. Moreover, by (a) and continuity of the inner product, P , f k = liml Pl , f k = P , f k for any f Hk. So, by the Pettis property, the embeddings of P and P coincide: Φk(P ) = Φk(P). Since k is characteristic to P P

k, we get P = P. So we have shown that, out of any subsequence of (Qn), we can extract a (sub)subsequence that converges weakly to P in P. By a classical argument, the original sequence (Qn)n thus converges weakly to P in P. For the converse we essentially rely on Proposition 9. Note that if we assume that Qn P in P, then, by Lemma 5.1.7 of Ambrosio et al. (2005), (c) is equivalent to weak convergence in P

k. Deﬁne the measures P and Qn as P(A) R

k) d P and Qn(A) R

k) d Qn for any measurable Borel set A X, and let k(x, y) k(x,y)/(1+

k(y)). By Le Cam (1957) since X is Radon, weak convergence in P implies tightness, i.e. (b). Since Qn(f) P(f) for any f C

k can be re-written as Qn(g) P(g) for any g Cb, it follows that ( Qn) converges weakly to P (in the usual sense). Moreover, (c) also shows that Qn and P are ﬁnite (non-negative) measures. So we can apply Prop. 2.3.3 of Berg et al. (1984), which says that the tensor product of ﬁnite, non-negative measures is weakly continuous, and get

MMDk(Qn, P)2 = (Qn P) (Qn P)(k) = ( Qn P) ( Qn P)( k)

= P P( k) 2 Qn P( k) + Qn Qn( k) 0.

D.1 Weak convergence in P

k and Wasserstein metric

The following proposition ﬁrst gives an alternative characterization of C

k and weak convergence in P

k. See also (Kanagawa et al., 2022, Section 3.1).

Proposition 10 (Wasserstein vs C

k convergence). Let k be a continuous strictly positive deﬁnite kernel over a separable metric space X. Let dk(x, y) δx δy k be the metric induced by k over X. Then C

k is the set of functions with 1-growth and P

k the probability measures with ﬁnite ﬁrst-order moments, both w.r.t. the metric dk. Let W 1 dk denote the Wasserstein-1 distance over P

k w.r.t. dk. Then (X, dk) is a separable metric space, and W 1 dk metrizes weak convergence in P

k, i.e., for Pn, P P

k W 1 dk(Pn, P) 0.

Moreover (P

k, W 1 dk) is complete whenever (X, dk) is.

Note that (X, dk) is complete whenever dk is stronger than the original metric d (i.e. whenever there exists C > 0 such that d(x, y) Cdk(x, y)), which is for example the case when X = Rd equipped with its usual Euclidian metric, and k is polynomial kernel of order 1. Proof We will prove that there exists constants C, C > 0 such that

k(x, x)) 1 + dk(x, x0) C (1 +

k(x, x)) (9)

Barp, Simon-Gabriel, Girolami, and Mackey

for all x, x0 X. This shows that C

k is indeed the set of functions with 1-growth for the metric dk in the sense of (5.1.21) in Ambrosio et al. (2005); and that P

k is indeed the set of probability measures P with ﬁnite ﬁrst-order moments, i.e. such that for an arbitrary (and then any) x0 X, P(dk(., x0)) < . The space (X, dk) is separable whenever X is (independently of characteristicness), because, since k is continuous, dk is also continuous, so the topology deﬁned by dk is weaker than the original one. Finally, when k is characteristic over P

k, then dk becomes a metric (i.e. additionally satisﬁes dk(x, y) = 0 iﬀx = y). So Theorem 7.1.5 of Ambrosio et al. (2005) concludes on the completeness condition, and on the equivalence between weak convergence in P

k and Wasserstein-1 convergence W 1 dk. We know prove (9). Let k0 k(x0, x0) and kx k(x, x). First, notice that

dk(x, x0)2 = k0 2k(x, x0) + kx k0 + 2

kx + kx = (

Therefore 1+dk(x, x0) 1+|

kx) with C 1+

k0. Conversely and similarly, dk(x, x0)2 = (

k(x, x))2. Therefore

1 + dk(x, x0) 1 + |

k0| 1 + max(

k0, 0) C(1 +

with C = 1/(1 +

Appendix E. Proof of Theorem 3: Score P-separation with KSDs

When sp is ﬁnitely integrable with respect to a probability measure Q, then both si p Q and Q are ﬁnite measures, so

DQ = P i(si p Q xi Q)ei P i Diei

where Di D1 L1 and thus DQ D1 L1(Rd). Using Corollary 4

KSDK,P(Q) = R ξPQ HK = DQ HK.

Hence KSDK,P(Q) = 0 iﬀDQ = 0. Now since the matrix kernel K is characteristic to D1 L1(Rd), and DP = 0 by the divergence theorem, we ﬁnally obtain KSDK,P(Q) = 0 iﬀ DQ = 0 iﬀQ = P.

Appendix F. Proof of Theorem 14: Characteristicness of transformed kernel

Theorem 14 (Characteristicness of transformed kernel). Let φ : E F be a linear continuous map that restricts to a feature operator φ : HK HK. Suppose the following diagram commutes, where all maps are continuous

Targeted Separation and Convergence with Kernel Discrepancies

If φ(E) is dense in F, then K is characteristic to F when K is characteristic to E .

Proof Taking the transpose of the commutative diagram φ ιK = ιK φ : HK F yields

ι K φ = φ ι

We want to show ι

K injective given that ι K is injective. Note that φ ι

K injective implies ι

K injective, so it is suﬃcient to show ι K φ injective. Hence it is suﬃcient to show φ : F E injective, which is equivalent to φ(E) dense in F by Treves (1967, Chap 18 Cor. 5).

Concretely, it will usually suﬃce to verify that the image φ(E) contains the smooth compactly supported functions, since these typically form a dense subset of F.

F.1 Proof of Proposition 4: Preserving characteristicness

The result follows from our general theorem on characteristicness-preserving transformations Theorem 14. For the ﬁrst claim, note that φ : f 7 af is a continuous map from C1 b (Rd)β to itself. Moreover φ(C1 c(Rd)) = C1 c(Rd) since f/a C1 c(Rd) for all f C1 c(Rd), and C1 c(Rd) is dense in the predual C1 b (Rd)β of D1 L1(Rd). Recall the family of semi-norms deﬁning the C1 b (Rd)β are parametrized by γ C0 and have the form f 7 γf , f 7 γ f . We ﬁrst show that φ : f 7 f b is continuous from C1 b (Rd)β to itself. Note that if γ C0 then γ b 1 C0, since for any ϵ > 0 we have b γ 1 1[ϵ, ) = b γ 1 1[ϵ, ) because b is a homeomorhism, and the resulting set is compact since γ 1 1[ϵ, ) is compact (because γ is C0) and b preserves compactness by continuity. Thus

γf b = γ b 1 b f b = γ b 1f

which is a semi-norm in C1 b (Rd)β. The same proof works for the family of semi-norms γ (f b) once we have observed that (f b) = f b b (where denotes matrix multiplication). Indeed

γ (f b) = γ f b b = γ b 1 f b b 1 b γ b 1 f

since b is bounded as b is Lipschitz. Thus φ is continuous. It remains to show that φ(C1 c(Rd)) = C1 c(Rd), which follows from the fact that f C1 c(Rd) implies f b 1 C1 c(Rd) since supp(f b 1) = b f 1({0}c) = b f 1({0}c) which is compact. Finally, (c) follows from the more general statement

Proposition 11 (Scalar vs. vector characteristicness). Consider a matrix-valued kernel K with HK , Fd for some topological vector space F. Then K is universal to Fd iﬀKii is universal to F for all i.

Proof Recall that h HK iﬀhi HKii for all i. Note (f1, . . . , fd) Fd iﬀfi F for any i [d], and hi n fi in F for all i iﬀ(h1 n, . . . , hd n) (f1, . . . , fd) in Fd.

Barp, Simon-Gabriel, Girolami, and Mackey

Appendix G. Proof of Theorem 4: L2-ISPD conditions

(a) Let ˆκjdx be the Bochner measure of κj(x y) kj(x, y). Then ˆκjdx, has full support, i.e., supp ˆκjdx = Rd, since this is equivalent to the characteristicness of κj (Simon-Gabriel and Sch olkopf, 2018, Thm.17). Moreover ˆκj L2 since κj L2. Since Hkj L2, then any measure of the form fdx, with f L2 embeds into Hkj by Proposition 6. Then, if g L2(Rd), using Barp et al. (2019, Appendix 4)

ΦK(gdx) 2 K = P i ei, ΦK(gdx) 2 ki = P i Φki(gidx) 2 ki = P i RR gi(x)κi(x y)gi(y)dxdy.

Moreover, since Plancherel theorem and the convolution theorem are valid for L2 functions (Schwartz, 1978, Remarque pg. 270), using the fact that gj, κj and κ gj are in L2 by Carmeli et al. (2006, Prop. 4.4), then RR gi(x)κi(x y)gi(y)dydx = R gi(x)κi gi(x)dx = R ˆgi(w)ˆκi(w)ˆgi(w)dw

= R |ˆgi(w)|2ˆκi(w)dw.

Hence, whenever gdx is non-zero, i.e., g L2 > 0, then

ΦK(gdx) 2 K = P i R |ˆgi(w)|2ˆκi(w)dw > 0

since ˆκi(w)dw is a fully supported non-negative measure. (b) This follows directly by the deﬁnition of ISPD, together with the fact htat A L2(Rd) L2(Rd) by boundedness, and that if Ag L2(Rd) = 0 then Ag = 0 a.e. so g = 0 a.e. and thus g L2(Rd) = 0. (c) Let us ﬁrst discuss the scalar (and matrix diagonal case), i.e., we show that for a scalar reproducing kernel k, if Hk is separable, supx kx L1 < , and kx L2 for each x, then Hk Id L2(Rd). Write k f(x) (k f)(x) R k(x, y)f(y)dy when the integral is well-deﬁned. Note, for each y, k ky L1, since

k ky L1 = R |k ky|(x)dx = R | R k(x, z)k(z, y)dz|dx = RR |k(x, z)k(z, y)|dxdz

= R |k(z, y)| kz L1dz ky L1 supz kz L1 < .

Note the integral swap is justiﬁed by Fubini s theorem. Moreover, supy k ky L1 supz kz 2 L1. Now, if f L2, then RR |f(x)2k kx(z)|dzdx = R f(x)2 k kx L1dx supx k kx L1 f L2,

so by Fubini RR |f(x)2k kx(z)|dxdz < and thus R |f(x)2k kx( )|dx is ﬁnite a.e., i..e, p

|f2k kz| L2 for almost all z. It follows that z 7 p

|f2k kz| L2 L2, since R p

|f2k kz| L2dz = RR |f(x)2k kx(z)|dzdx, and similarly z 7 |f(z)| p

|k kz| L2 L2. Hence, R R |f(x)k kz(x)f(z)|dxdz < since fk kz L1 (being the product of the L2

functions p

|f2k kz| and p

|k kz|) and

RR |f(x)k kz(x)f(z)|dxdz R p

|f2k kz| L2|f(z)| p

|k kz| L2 dz < .

Targeted Separation and Convergence with Kernel Discrepancies

k f 2 L2 = R (k f)2(y)dy = RR k(y, x)f(x)dx R k(y, z)f(z)dzdy

= RR f(x)f(z)k kz(x)dxdz RR |f(x)f(z)k kz(x)|dxdz < .

It follows by (Carmeli et al., 2006, Prop. 4.4) that Hk L2. For a general matrix-valued kernel K, set Gij(x, z) R ei, K(x, y)K(y, z)ej dy = P l R Kil(x, y)Klj(y, z)dy. Note Gij(x, z) = Gji(z, x), and set Gz ij Gij( , z). Now for f L2(Rd), if RR |fi(x)fj(z)Gz ij(x)|dxdz < we have

K f 2 L2(Rd) = R K f 2(y)dy = P l R | el, K f |2(y)dy = P l R el, K f |(y) el, K f |(y)dy

= P lij RRR Kli(y, x)fi(x)Klj(y, z)fj(z)dxdzdy

= P lij RRR Kil(x, y)fi(x)Klj(y, z)fj(z)dxdzdy

= P ij RR fi(x)fj(z)Gz ij(x)dxdz.

We can now proceed as in the scalar case with Gz ij taking the place of k kz. Indeed

Gz ij L1 Kz ij L1 sup x Kx ij L1,

and note |Kz ij| = | ei, Kzej ei Kzej so Kz ij L1 ei Kzej L1(Rd). (d) Note that if K is a matrix-valued kernel, then |Kij(x, y)| = | ei, ξ(x) ξ(y)ej | ξ(x)ei ξ(y)ej = p

Kii(x, x) p

Kjj(y, y) (for any feature map ξ), so if K is translationinvariant then K is bounded. In general, if Kxu is both ﬁnitely-integrable and bounded, then Kxu L2(Rd) Kxu L1(Rd) Kxu L (Rd).

Appendix H. Proof of Theorem 5: L2 P-separation with KSDs

(a) Let us ﬁrst show that under the assumptions of the result, the alternative deﬁnition of KSD used in Liu et al. (2016) is equivalent to ours. By Lemma 2, since Q embeds we have KSDK,P(Q) = DQ K. Moreover since Q PK,0, we know TQ Q Sq embeds to zero in HK, ΦK(TQ) = 0, and for any h HK, since Sp(h) and Sq(h) are ﬁnitely Q-integrable, we have DQ TQ(h) = QSp(h) QSq(h) = Q(Sp(h) Sq(h)) = (sp sq)Q(h). Hence

KSD2 K,P(Q) = DQ 2 K = DQ TQ 2 K = (sp sq)Q 2 K = RR sp(y) sq(y), K(y, x)(sp(x) sq(x)) d Q(y)d Q(x).

Now, recall K is L2(Rd) ISPD iﬀit is characteristic to L2(Rd) (Simon-Gabriel and Sch olkopf, 2018, Thm. 6), so the result follows from KSDK,P(Q) = MMDK((sp sq)Q, 0), and (sp sq)Q L2(Rd). (b) We want to show that HK and Sq(HK) are subsets of L1(Q) in order to apply (g) in Proposition 3. By assumption HK L (Rd) L1(Q), and Sp(HK) L1(Q). Note Sq(h) = Sp(h) + sq sp, h , so we have for any h K

Q|Sq(h)| = Q|Sp(h) + sq sp, h | Q|Sp(h)| + Q| sq sp, h |.

Barp, Simon-Gabriel, Girolami, and Mackey

Moreover Q| sq sp, h | = q(sq sp), h L1(Rd) q(sq sp) L2(Rd) h L2(Rd), which concludes. (c) For any h HK we have

Q|Sp(h)| P i Q| ihi| + Q| sp, h | P i ihi + qsp L2(Rd) h L2(Rd) < .

Appendix I. Fourier Transforms

If µ is a ﬁnite measure, its Fourier transform is ˆµ(x) R e ix T wdµ(w), which is a positive deﬁnite function when µ is a non-negative measure. More generally, if T is a tempered distribution (a.k.a. slowly increasing distribution, see Treves 1967, Chap.25), i.e., an element of the dual of the Schwartz space (a.k.a. space of rapidly decaying functions, see Treves 1967, Chap.10, Example IV), we deﬁne its distributional Fourier transform ˆT by

ˆT(γ) T(ˆγ) ,

for any function γ in the Schwartz space. In particular if T = Φ(x)dx, with Φ continuous and slowly increasing, and there exists f L2 loc(Rd/{0}) such that ˆT = fdx, then f is known as the generalized Fourier transform (of order 0) of Φ, denoted ˆΦ (Wendland, 2004, Def. 8.9). The above formula then reads R ˆΦ(x)γ(x)dx = R ˆγ(x)Φ(x)dx.

Appendix J. Proof of Theorem 6: Controlling tight convergence with bounded separation

Since k is continuous, H C, and since h Hb Cb is integrable by any ﬁnite measure, we have (with 0/0 = 0)

|Qh Ph| Q h h k P h h k

h k suph Bk L1(Q) |Qh Ph| h k MMDk(Q, P) h k.

Hence if MMDk(Qn, P) 0 then Qnh Ph for all h Hb. Taking Qn = Q for all n, the P-bounded separating assumption implies that k is characteristic to P P. On the other hand, if (Qn) is a tight sequence, then it is sequentially compact, i.e., any subsequence has a Cb-convergent subsequence, whose weak limit must be P by P-characteristicness (Ethier and Kurtz, 2009, Lemma 4.3), which in turn implies that Qn Cb P (see also proof of Theorem 2: Bochner P-separation with MMDs for additional details). Hence k controls tight convergence.

Appendix K. Proof of Theorem 7: Controlling tight convergence with bounded Stein kernels

By Lemma 11 the shifted kernel K(x, y)/θ(x)θ(y) is universal to (i.e., dense in) C1 b,θ(Rd). Moreover, by assumption on the score growth and base kernel the associated Stein RKHS consists of bounded functions. The following lemma concludes:

Lemma 6 (Controlling tight convergence with bounded Stein kernels). Suppose a matrixvalued kernel K with HK C1 b (Rd) is characteristic to C1 b,θ(Rd) β. If P PK,0 and kp is bounded, then kp is P-separating and controls tight P-convergence. Moreover, kp is bounded iﬀx 7 p

sp(x), K(x, x)sp(x) is bounded.

Targeted Separation and Convergence with Kernel Discrepancies

Proof We apply Proposition 12 with θ(x) sp(x) + 1. Note that if sp(x) HK is bounded, then HK C1 b,θ(Rd). Moreover sp(x) HK = H K where K(x, y) sp(x) K(x, y) sp(y) , and H K is a RKHS of bounded functions iﬀx 7 sp(x) p

K(x, x) is bounded by Lemma 3. Moreover, since kp consists of bounded functions, it then follows by Theorem 6 that when k P is P-separating then it controls tight convergence to P.

Appendix L. Proof of Theorem 8: Translation-invariant kernels have rapidly decreasing sub-RKHSes

We will use the following result proved in Appendix L.1.

Lemma 7 (Convolution decay bound). Fix any u, v L1 and any subadditive function ρ satisfying |u(x)| U(ρ(x)) and |v(x)| V (ρ(x)) for non-increasing U, V with U ρ, V ρ ﬁnitely integrable. Then

u v (x) inf α [0,1] u L1V (αρ(x)) + v L1U((1 α)ρ(x))

Let us quote Bochner s theorem (Wendland, 2004, Thm. 6.6) (we refer to Appendix I for deﬁnitions of Fourier transforms).

Theorem 15 (Bochner s theorem). A continuous R-valued function on Rd is positive definite if and only if it is the Fourier transform of a non-negative ﬁnite measure.

Moreover we will use the following lemma, which follows by combining Lemma 4 with the fact that vector-valued RKHSes of continuous functions have locally bounded kernels (Carmeli et al., 2006, Prop. 5.1), or by the closed graph theorem as shown Appendix L.2.

Lemma 8 (Continuity of RKHS inclusion). Let F be a complete metrizable TVS, continuously included in the space of functions X Rd. Then HK F implies HK , F. In particular Cs(Rd) is a complete metrizable TVS.

We now show k is characteristic to D1 L1. Indeed, since Hk C1, then Hk , C1, so we know i i+dk exists and is separately continuous for all i by Micheli and Glaunes (2013, Thm. 2.11). Since i i+dk is translation-invariant, it is further continuous. Thus k(x, y) is C(1,1), and is characteristic to D1 L1 by Simon-Gabriel and Sch olkopf (2018, Thm. 17). Let κ(x) k(x, 0). Now, given the spectral density ˆκ : Rd [0, ], we deﬁne the ironed radial kernel κiron on Rd by

ˆκiron(y) inf w Rd: 0 w y ˆκ(w)

and show it is characteristic to D1 L1(Rd). Note ˆκiron : Rd [0, ], and is ﬁnite-valued except at the origin (since ˆκ is). Since ˆκiron ˆκ, ˆκ L1(Rd), > R ˆκ(w)dw R ˆκiron(w)dw = ˆκiron L1(Rd), so ˆκiron(w)dw is a ﬁnite non-negative measure whose Fourier transform deﬁnes

Barp, Simon-Gabriel, Girolami, and Mackey

a continuous positive deﬁnite (radial) kernel κiron by Theorem 15. Moreover ˆκiron is strictly positive since ˆκ is bounded away from zero on the compact set Br for any r 0. In particular κiron is characteristic to D0 L1 (Simon-Gabriel and Sch olkopf, 2018, Thm. 17). Moreover, by Steinwart and Christmann (2008, Sec. 4.3) xi yik(x, y) is a continuous translation-invariant kernel, and xi yik(x, y) = 2 i κ(x y). This implies that R w 2ˆκ(w)dw < . Indeed \ ( 2 i κ)(w) = w2 i ˆκ(w) (Treves, 1967, Thm. 25.7), and by Theorem 15 the generalized Fourier transform of a continuous translation-invariant kernel is integrable, i.e., w 7 w2 i ˆκ(w) L1(Rd). From κ κiron, it follows that w 7 w 2ˆκiron(w) L1(Rd), and thus κiron C2, since by Leibniz integral rule the second partial derivatives exist and are continuous. Hence ki is characteristic to D1 L1 by Simon Gabriel and Sch olkopf (2018, Thm. 17). Now deﬁne a radial kernel κs on Rd by ˆκs(x) = ˆκiron(2x) which is strictly positive so also characteristic to D0 L1 (more generally we can compose ˆκiron with any homeomorphism, as their preimage commutes with closure), and κs C2 so it characteristic to D1 L1 (by an argument analogous to that of the previous paragraph). We now discuss a general mechanism to construct a Schwartz function f that is strictly positive, and has a non-negative Fourier transform with compact support (which even makes f an entire function). Later on we will apply this construction to obtain a particular f with root exponential decay. Let us ﬁrst choose a function g C that is non-negative and compactly supported (we will identify an explicit choice of such a function later in the proof). Then we set f ˆG ˆG where G g g. Note that G is non-negative, smooth and compactly supported with non-negative Fourier transform since by the convolution theorem ˆG = (ˆg)2. Moreover the Schwartz s Paley Wiener theorem (Treves, 1967, Thm. 29.1) then asserts that its Fourier transform (more precisely, its real part restricted to real inputs, i.e., complex numbers with zero imaginary part) ˆG is an indeﬁnitely (real) diﬀerentiable function that decays faster than any polynomial, that is for all positive integer m we have constant Cm > 0 such that | ˆG(x)| Cm (1+ x )m . Hence so does f by Lemma 7 applied with

U, V of the form r 7 (1 + r) m. Moreover ˆf(x) = (G( x))2 which is non-negative with compact support, and f is strictly positive since ˆG (viewed as function of arbitrary complex variables) is entire by the Schwartz s Paley Wiener theorem, and thus holomorphic, and thus has ﬁnitely many isolated zeros, the set of which has Lebesgue-measure zero - hence f is the integral of an almost-everywhere strictly positive function. Moreover, the derivative of f decays faster than any polynomial. Indeed, since f = ˆG ˆG, where ˆG is a smooth function decaying faster than any polynomial, Leibniz integral rule yields xjf = ( xj ˆG) ˆG. By Wendland (2004, Thm. 5.16 (6)), xj ˆG(x) = \ ( iyj G(y))(x), and since y 7 iyj G(y) is smooth and compactly supported, Schwartz s Paley Wiener theorem implies that its Fourier transform will decay faster than any polynomial. The convolution bound Lemma 7 then implies that xj ˆG ˆG will also decay faster than any polynomial. The above argument can then be iterated to show that f belongs to the Schwartz space. If in the above we speciﬁcally choose the function g = ψ ψ with ψ(x) ϕ(x1) ϕ(xd) where ϕ(x1) exp( 1/(1 c2|x1|2))I |x1| < 1 for some c > 0, then ˆψ(x) = ˆϕ(x1) ˆϕ(xd),

which implies | ˆψ(x)| = O(e c P i

|xi|) by Johnson (2018). It follows that |ˆg(x)| = O(e 2c P i

|xi|), and thus ˆG = (ˆg)2 = O(e 4c P i

|xi|), so f = ˆG ˆG = O(e 2c P i

|xi|) by the convolution bound Lemma 7 with α = 1/2 and the subadditive function ρ(x) = P i p

Targeted Separation and Convergence with Kernel Discrepancies

Similarly, | xj ˆG| = 2|ˆg xj ˆg| 2 xj ˆg |ˆg|, and xj ˆg is bounded by the Schwartz s Paley Wiener theorem as above, so xj ˆG = O(e 2c P i

|xi|), so xjf = ( xj ˆG) ˆG = O(e c P i

|xi|) by the convolution bound Lemma 7 with α = 1/2 and the subadditive function ρ(x) = P i p

|xi|. Now deﬁne kf(x, y) f(x)ks(x, y)f(y), and we will show that Hkf Hk, by leveraging the translation-invariance of the kernel kfs(x, y) κs(x y)f(x y), which is a kernel since f is a positive deﬁnite function (since its Fourier transform is a positive function) and thus deﬁnes a reproducing kernel. Then Hkf Hkfs. Indeed, the former RKHS is simply the set of functions f Hks (Paulsen and Raghupathi, 2016, Prop. 6.2). On the other hand, Aronszajn (1950, Thm. II Sec. 8) implies the latter product RKHS Hkfs = Hks Hf consists of the functions in the tensor product RKHS Hks Hf restricted to the diagonal set {(x, x)} Rd Rd, while Berlinet and Thomas-Agnan (2004, Thm. 13) shows the tensor product RKHS contains the functions of the form (x, y) 7 g(x)h(y) where g Hks and h Hf (i.e., it is the pullback RKHS deﬁned by the diagonal map x 7 (x, x)), and hence Hkfs contains all the functions of the form x 7 g(x)h(x). Restricting to h(x) = f(x 0) Hf yields the subset inclusion Hkf Hkfs, which is moreover a continuous inclusion Hkf , Hkfs because inclusions of RKHS are always continuous (Schwartz, 1964, Prop. 2). Hence to show that Hkf is a subset of Hk, it is suﬃcient to show that Hkfs Hk. But, conveniently, since kfs is translation invariant, we can now apply Lemma 9, proved in Appendix L.3, to verify this inclusion.

Lemma 9 (RKHS inclusion of product RKHS). Let k, k2 be kernels and denote the convolution operator. The following claims hold true.

(a) If there exists a λ 0 for which λk kk2 is a kernel, then h Hk Hk for any h Hk2.

(b) Suppose k, k2 are continuous translation invariant kernels. By Bochner s theorem (Theorem 15), such kernels are the Fourier transform of some ﬁnite positive measures which we will call µ and ν. If µ ν µ and the density dµ ν

dµ belongs to L (µ), then h Hk Hk for any h Hk2.

(c) Moreover, if µ (resp. ν) above is equivalent to (resp. absolutely continuous with respect to) the Lebesgue measure on Rd, with density qµ (resp. qν), then h Hk Hk for any h Hk2 if qµ qν/qµ L (Rd).

(d) Similarly, if f : Rd R is a continuous positive deﬁnite function with generalized Fourier transform ˆf, and qµ ˆf/qµ L (Rd), then h Hk Hk for any h Hk2, where k2(x, y) f(x y).

Since ˆk is strictly positive, the result states that Hkfs Hk iﬀˆκfs/ˆκ L , which, by the convolution theorem, can be written as ˆκs ˆf/ˆκ L . To show this we will use Lemma 7 to obtain the convolution bound

|( bκs ˆf)(w)| bκs L1U(1

2 w ) + ˆf L1V (1

where U and V are non-increasing functions that upper bound ˆf and ˆκs respectively. We let U be the envelope above f, U(r) sup{ ˆf(w) : w r}, and since by construction ˆκs

Barp, Simon-Gabriel, Girolami, and Mackey

is non-increasing, we can set V (r) ˆκs(r). Thus

|( bκs ˆf)/ˆκ(w)| bκs L1U(1

2 w )/ˆκ(w) + ˆf L1ˆκs(1

2 w )/ˆκ(w) .

By construction, for any w Rd we have ˆκs(1

2w) = ˆκiron(w) ˆκ(w), and thus ˆκs(1

2 )/ˆκ( ) L . Moreover ˆf has compact support, and thus so does U, hence |( bκs ˆf)/ˆκ| L and Hkfs Hk as claimed. Moreover Hkf , Hk (Schwartz, 1964, Prop. 2).

L.1 Proof of Lemma 7: Convolution decay bound

Fix any α [0, 1] and let Sx {y Rd : ρ(x y) αρ(x)}. On this set ρ(y) (1 α)ρ(x) by subadditivity. Now

u v(x) = R u(y)v(x y)dy = R

Sx u(y)v(x y)dy + R

Scx u(y)v(x y)dy.

Sx |u(y)v(x y)|dy R

Sx U(ρ(y))|v(x y)|dy R

Sx U((1 α)ρ(x))|v(x y)|dy U((1 α)ρ(x)) v L1.

On the other hand

Scx |u(y)v(x y)|dy R

Scx |u(y)|V (ρ(x y))dy R

Scx |u(y)|V (αρ(x))dy V (αρ(x)) u L1.

L.2 Proof of Lemma 8: Continuity of RKHS inclusion

Consider a convergent sequence (hn, hn) (h, f) in HK F. Since HK and F are continuously included in the space of functions X Rd, (hn) converges pointwise to both h and f, hence h = f HK, and the graph of ι : HK F is closed. Thus (Treves, 1967, Cor. 4, Chap. 17) implies it is continuous, since the product of metrizable (resp. complete) TVS is metrizable (resp. complete). In particular Cs(Rd) is a complete metrizable space by (Treves, 1967, Ex. 1 Chap. 10).

L.3 Proof of Lemma 9: RKHS inclusion of product RKHS

The ﬁrst result follows by the characterization of Aronszajn (1950), once we note that the product kernel (x, y) 7 k(x, y)k2(x, y) contains the functions of the form x 7 h(x)f(x) with h Hk2 and f Hk, since it is the pullback under the diagonal map x 7 (x, x) of the tensor product kernel k k2, and the latter is the completion of the inner product space of functions (x, y) 7 h(x)f(y). The second and third result follow by Zhang and Zhao (2013, Prop. 3.1) and the convolution theorem, which implies that the (translation invariant) product kernel k(r)k2(r) = ˆµ(r)ˆν(r) = [ µ ν(r). Finally, for the ﬁnal result, observe that Theorem 15 implies f is the Fourier transform of a non-negative ﬁnite measure ν, which satisﬁes for any γ in the Schwartz space ˆfdx[γ] fdx[ˆγ] = ˆνdx[ˆγ] = ν[γ R] R ν[γ],

where R is the pushforward, and thus R ν = ˆfdx (i.e., R ν is the generalized Fourier transform of f), which implies ν = ˆf Rdx. By Wendland (2004, Thm. 6.2) f is even. In

Targeted Separation and Convergence with Kernel Discrepancies

fact ˆf R is also the generalized Fourier transform of f, and ˆf R = ˆf almost everywhere. Indeed, on the one hand R ˆfγdx = R ˆf Rγ RR dx = R ˆf Rγ Rdx. On the other hand

R ˆfγdx = R fˆγdx = R f Rˆγdx = R f Rˆγ R Rdx = R fˆγ Rdx = R f [ γ Rdx.

Since composition with R is a bijection from the Schwartz space to itself, this shows that ˆf R is the generalized Fourier transform of f. The result then follows by the third result.

Appendix M. Proof of Theorem 9: Controlling tight convergence with KSDs

Before proving the result, let us introduce the Banach space B1 θ(Rd), a generalization of C1 0(Rd), which is easier to handle than the topological vector space C1 b,θ(Rd)β (the analogous generalization of C1 b (Rd)β).

Lemma 10 (Deﬁnition of B1 θ(Rd)). Given a continuous function θ : Rd [c, ), for some c > 0, let B1 θ(Rd) be the completion of C1 c (Rd) with respect to

f B1 θ sup θ(x)f(x) + X

|p|=1 sup p xf . (10)

Then B1 θ(Rd) = {f C1(Rd) : θf C0(Rd), f C0(Rd d)}.

Proof We ﬁrst show that if f {f : C1(Rd) : θf C0(Rd), f C0(Rd d)}, then

cn C1 c(Rd) such that cn B1 θ f. By deﬁnition ϵ > 0 there exists compact subsets S1, S2 such that θ(x)f(x) < ϵ for x Sc 1 and f(x) < ϵ for x Sc 2. Since S1 S2 is compact, there exists a ball of radius r such that S S1 S2 Br. Using Lemma 14 in Gorham and Mackey (2017) we can ﬁnd a function cϵ C1 c(Rd) with

cϵ : Rd [0, 1], cϵ|Br = 1, cϵ|Br+2δ c = 0, cϵ I[Br+2δ/Br]

for some δ > 0. In particular cϵ = 0 and cϵ = 1 on S Br. Now let fϵ fcϵ C1 c(Rd). Then on S we have θ(x)f(x) θ(x)f(x)cϵ(x) = 0 and

f (fcϵ) = f cϵ f f cϵ = f cϵ f = 0.

On Sc, we have |θ(x)f(x) θ(x)f(x)cϵ(x)| 2|θ(x)f(x)| 2ϵ and

f (fcϵ) f + f cϵ + cϵ f 3ϵ.

Thus C1 c(Rd) is dense in {f : C1 : θf C0, f C0}. On the other hand, suppose we have a Cauchy sequence cn C1 c(Rd) for the norm (10). Then, since θ c > 0, (cn)n is a fortriori a C1 0(Rd)-Cauchy sequence, and thus C1 0converges to a function f C1 0(Rd). Now we show that cn also converges to f in the norm deﬁned in (10). Indeed ϵ > 0 ℓsuch that n, m ℓimplies θ(x)cn(x) θ(x)cm(x) ϵ for all x, and thus taking m gives θ(x)cn(x) θ(x)f(x) ϵ for all x, i.e., θ(cn f)

ϵ. An analogous argument shows cn f 0, and thus cn B1 θ f.

Barp, Simon-Gabriel, Girolami, and Mackey

Finally, note that θf θcn 0 and θcn Cc(Rd) imply θf C0(Rd). Similarly f cn 0 implies f C0(Rd d) since cn Cc(Rd d).

We will now ﬁrst prove the non-tilted case, with a(x) = 1. We will use the following result, proved in Appendix M.1.

Lemma 11 (Characteristicness of tilted bounded kernels). Using the notations of Theorem 14, let φ be the multiplication by 1/θ, where θ is a strictly positive C1 function such that 1/θ, (1/θ) are bounded. If K is universal to C1 0(Rℓ) (resp. C1 b (Rℓ)β), then K(x, y)/(θ(x)θ(y)) is universal to B1 θ(Rℓ) (resp. C1 b,θ(Rℓ)β).

Construct ks and kf satisfying the conditions of Theorem 8. Note kf is characteristic to (C1 b,θ) β, as can been seen by applying Lemma 11 to ks with θ(x) 1/f(x), and recalling that the RKHS of kf(x, y) = f(x)ks(x, y)f(y) is f Hks. Summarizing, we have proven that Hkf Hk and that kf is characteristic to (C1 b,θ) β. Hence, H K HK, where K kf Id is characteristic to C1 b,θ(Rd) β by Proposition 11. The Stein RKHS associated to K consists of bounded functions and is characteristic to P P by Proposition 12, because DQ C1 b,θ(Rd) β for any probability measure Q. Moreover, since Hkp is a superset of a bounded P-separating sub-RKHS, the result then follows from Theorem 6.

Finally, consider a tilting function a. In the Appendix L we have constructed a Schwartz function f that is strictly positive and s.t., f and its partial derivatives have root exponential decay. Moreover we have shown that Hkf Hk, where kf(x, y) f(x)ks(x, y)f(y) with ks a kernel obtained by ironing and scaling k, and shown that kf is universal to (C1 b,θ(Rd))β (with θ(x) 1/f(x)). Hence a Hkf a Hk. Since af and (af) are bounded, and ks is universal in (C1 b (Rd))β, then by Lemma 11 a(x)kf(x, y)a(y) is universal to (C1 b, θ

M.1 Proof of Lemma 11: Characteristicness of tilted bounded kernels

Note φ : C1 0(Rℓ) B1 θ(Rℓ) is continuous, indeed (here θ 1 1/θ)

φ(f) B1 θ f + θ 1 f+ θ 1 f f + θ 1 f + θ 1 f C f C1 0,

for some C > 0 (where we have used the boundedness of |θ 1| and θ 1 ), where is the outer product. Similarly, φ is continuous as a map C1 b (Rℓ)β C1 b,θ(Rℓ)β, since for any γ C0 γθφ(f) = γf

γ iφ(f) γθ 1 if + γf iθ 1 θ 1 γ if + iθ 1 γf .

Moreover φ(C1 c (Rℓ)) = C1 c (Rℓ) since θ 1 C1 is strictly positive, and C1 c(Rℓ) is dense in C1 b,θ(Rℓ), and in B1 θ(Rℓ) since the latter is its completion (and metric spaces are dense in their completion). The result then follows by Theorem 14.

Targeted Separation and Convergence with Kernel Discrepancies

Appendix N. Proof of Theorem 10: Controlling P-convergence by dominating indicators

Fix any ϵ > 0, and pick any function h F and compact set C satisfying

h Ph I [Cc] ϵ/2.

Moreover, suppose d F(Qn, P) 0. For each n, we have (note h is bounded below, so h+ L1(Q) for all Q P)

Qn(Cc) ϵ/2 + Qnh Ph,

and, since h F and d F(Qn, P) 0, we further have |Qnh Ph| ϵ/2 for all n larger than some N. Hence, Qn(Cc) ϵ for all n suﬃciently large. Since ϵ > 0 was arbitrary, (Qn)n 1 is tight. Finally, if k enforces tightness and controls tight weak convergence, then MMDk(Qn, P) 0 implies (Qn) is tight, so Qn P, i.e., k controls weak convergence.

Appendix O. Proof of Lemma 1: Coercive functions dominate indicators

Since P PHk, Ph is ﬁnite and hence h Ph is also coercive and bounded below. Since h Ph is bounded below, there exists C > 0 such that (h Ph)/C 1. Moreover, for any ϵ > 0, writing hϵ hϵ/C Hk, then hϵ Phϵ ϵ, and since hϵ Phϵ is coercive, there exists a compact set S for which infx Sc hϵ Phϵ 1 ϵ, and therefore Hk P-dominates indicators.

Appendix P. From Separating Measures in Hkp to Separating Schwartz Distributions in HK

For a bounded RKHS we have the following result shown in Appendix P.2:

Proposition 12 (Separating measures with bounded Stein RKHSes). Suppose sp(x) θ(x), and HK C1 b,θ(Rd). Then k P is P-separating iﬀK is characteristic to 0 in {DQ : Q P} (C1 b,θ(Rd)β) , that is for any Q P, DQ|HK = 0 = Q = P.

In order to prove this it will be convenient to ﬁrst prove in Appendix P.1 the analogous but simpler result, Proposition 13, which relies on the Banach space deﬁned in Lemma 10.

Proposition 13 (Separating measures with C0 Stein RKHSes). Suppose sp(x) θ(x), and HK B1 θ(Rd). Then k P is P-separating iﬀK separates 0 from {DQ : Q P} B1 θ(Rd) , i.e., for any Q P, DQ|HK = 0 = Q = P.

P.1 Proof of Proposition 13: Separating measures with C0 Stein RKHSes

Proceeding as in Proposition 9, we can deﬁne a Banach space B0 θ(Rd) such that division by θ yields an isometric isomorphism C0 0(Rd) = B0 θ(Rd). Note the continuous inclusion B1 θ(Rd) , B0 θ(Rd).7 In particular θQ, is a continuous linear functional on B1 θ(Rd), and

7. When θ 1 we also have B1 θ(Rd) , C1 0(Rd).

Barp, Simon-Gabriel, Girolami, and Mackey

hence so is sp Q, since

|sp Q(f)| | X

i Q(si Pfi)| X

i Q(|si Pfi|) C X

i Q(θ|fi|) C sup θf C f B1 θ

for some constants C, C > 0 (that arise from the equivalence of norms on Rd). Moreover, i Q, and hence DQ, acts continuously on B1 θ(Rd), since

| i Q(f)| = |Q if| if f B1 θ(Rd).

Moreover HK B1 θ(Rd) implies HK , B1 θ(Rd). Indeed, recalling that K x B(HK, Rd) is the evaluation functional, θ(x)K xh = θ(x)h(x) h B0 θ(Rd) for all h HK, so by the Banach Steinhaus Theorem θ(x)K x C for some C < . From this we ﬁnd that supx θ(x)h(x) = supx θ(x)K xh supx θ(x)K x h HK C h HK. Proceeding analogously with the derivative contribution, we can use | p 2Kei(., x), h k| h B1 θ to show p 2Kei(., x), k Ai, for some Ai < , and then supx ph(x) B supx maxi | phi(x)| d B maxi Ai h HK, which yields the continuity of the inclusion. Now, note Hk P C0, so any probability measure Q embeds into the Stein RKHS by Proposition 6. Moreover, since by assumption the embedding of P into Hkp is the null function, from Corollary 4

KSDK,P(Q) = Q Hk P = Q Sp HK.

We thus want to show that Q Sp|HK = 0 implies Q Sp|B1 θ = 0 iﬀ KSDK,P(Q) = 0 implies Q = P . If KSDK,P(Q) = 0 implies Q = P, then Q Sp|HK = 0 implies Q = P, so we want to show P Sp|B1 θ = 0. For this we can use P Sp|C1c = 0 by the divergence theorem and the fact C1 c (Rd) is dense in B1 θ(Rd) since the latter is its completion (and metric spaces are dense in their completion). Conversely, we have that KSDK,P(Q) = 0 implies DQ|B1 θ = 0, that is the distributional Stein equation

P i(si p Q xi Q)ei = 0,

where (ei)i=d i=1 is the dual basis to the canonical basis of Rd. Applying to this vectorial distributional PDE compactly supported smooth vector ﬁelds of the form f = (0, . . . , 0, l, 0, . . .) with l C c , yields the system of (scalar) distributional PDEs xi Q = si PQ. In particular, solving for the function q : Rd R the classical PDE xiq = si Pq, implies q is the target probability density, q = p. We then look for solutions via the method of variation of constants. We write the the form Q = p T. Subbing in and using xi(p T) = xip T + p xi T we obtain the equivalent distributional PDE xi T = 0, which implies that T is a translationinvariant measure, and hence proportional to the Lebesgue measure, T = Cdx, by Schwartz (1978, Thm. VI of Chap. II). Since Q is a probability measure we must have C = 1, and thus Q = P.

Targeted Separation and Convergence with Kernel Discrepancies

P.2 Proof of Proposition 12: Separating measures with bounded Stein RKHSes

Since HK C1 b,θ(Rd), then Hkp Cb, indeed |Sp(h)(x)| sp(x) h(x) + | x h| θ(x) h(x) + | x h| and the latter is a bounded function of x. Hence any probability measure embeds into Hkp by Proposition 6, and we can proceed as above once we have shown that DQ is continuous on C1 b,θ(Rd)β. First note that, using the fact that the dual of a ﬁnite product of TVS is isomorphic to the ﬁnite product of their duals (see, e.g., Treves (1967, p. 259))

D1 L1(Rd) (C1 b (Rd)) β = (Qd i=1 C1 b (R)β) = d i=1D1 L1 C1 b,θ(Rd) β,

indeed, C1 b,θ(Rd)β , C1 b (Rd)β since θ c > 0, and C1 b (Rd) β = D1 L1(Rd) by Conway (1965, Sec. 1). Thus xi Q ei (C1 b,θ(Rd) β. 8

Moreover, we have C1 b,θ(Rd)β , Cb,θ(Rd)β = Cb(Rd)β, where the latter isomorphism of topological vector spaces is given by multiplication by θ. Since (Cb) β is the space of ﬁnite Radon measures, this shows that θQei Cb,θ(Rd) β C1 b,θ(Rd) β.

Appendix Q. Proof of Theorem 11: IMQ KSDs control P-convergence

Our proof parallels that of Gorham and Mackey (2017, Lem. 16). Fix any c > 0, γ (0, 2u 1), a > c/2, and α (1 u, 1

2(1 γ)), and consider the functions

gj(x) = xj(a2 + x 2)α 1 for 1 j d.

By Gorham and Mackey (2017, proof of Lem. 16), g = (g1, . . . , gd) HK for K = k Id. Moreover, the Stein operator applied to g takes the form

Sp(g)(x) = sp(x),x (a2+ x 2)1 α d (a2+ x 2)1 α + 2(1 α) x 2

(a2+ x 2)2 α .

Since α < 1, the ﬁnal two terms in this expression are uniformly bounded in x. Meanwhile, our generalized dissipativity assumption (8) implies that sp(x), x = Ω( x 2u) as x , so sp(x),x (a2+ x 2)1 α = Ω( x 2u 2+2α) = ω(1) since α > 1 u. Hence, Sp(g) is coercive.

In addition, the generalized disspativity condition (8) implies that sp(x), x is bounded below and hence that Sp(g) is bounded below. Let K = k Id. Since sp is well deﬁned and continuous on Rd, the density p is strictly positive and continuously diﬀerentiable. In addition, since P Psp, K C (1,1) b (Rd), and K L1(P), Proposition 3 and Theorem 1 imply that p HK C1(Rd), P PK,0, KSDK,P = MMDkp( , P), and Sp(HK) = Hkp. Since Sp(g) Hkp, Hkp P-dominates indicators by Lemma 1 and enforces tightness by Theorem 10. Finally, since k C (1,1) b is translation-invariant with a spectral density bounded away from zero in a neighborhood around the origin (Wendland, 2004, Thm. 8.15), we conclude that Hk C1 b by Proposition 3 and that kp controls P convergence by Corollary 3.

8. A more direct proof that establishes the continuity of i Qei on C1 b,θ(Rd)β reads as follows: if f C1 b,θ(Rd), then | i Qeif| = |Q if i| A maxj [n] γj if i for some γj C0, n N, A > 0, by continuity of Q on (Cb)β. Since f 7 γj f are semi-norms on C1 b,θ(Rd)β the result follows.

Barp, Simon-Gabriel, Girolami, and Mackey

Appendix R. Proof of Theorem 12: Metrizing P-convergence with bounded Stein kernels

Our aim is to identify a function in Hkp that satisﬁes the indicator bounding property (7) for each ϵ > 0. To this end, for each m N, deﬁne the compact set Cm = {x Rd : x m}, and ﬁx any m > 1 for which sp(x), x is nonnegative on Cc m 1 and

sp(x), x r0 sp(x) 1 1 r0 2|γ|(1 +

d/(c2 + m)) (r1/2) x 2u (11)

holds on Cc m. These properties hold for all m suﬃciently large (speciﬁcally, for all m such that 1

2r1m2u 1 + r0 r2 + 2|γ|(1 +

d c2+m)) due to generalized dissipativity (8) with u > 0. Fix also any ϵm (0, s] satisfying

ϵm supx Cm max sp(x) 1 a( x ), 2|γ| x 1

c2+ x 2 a( x ), x 1 c2+ x 2 , a( x ) a(m)

Consider the smoothed indicator function

fm(x) = σ(m x ) for σ(r) = 2 max(0, r)2I [r < .5] + (1 2 max(0, 1 r)2)I [r .5]

which satisﬁes fm(x) = 1 on Cm 1, fm(x) = 0 on Cc m,

I [x Cm 1] fm(x) I [x Cm] , and I [x Cm\Cm 1] xi fm(x) 0.

Moreover, for each i {1, . . . , d}, x 7 xi fm(x) C1 0. Since Hk C1 0, then Hk , C1 0, so by Simon-Gabriel and Sch olkopf (2018, Thm. 6) Hk is dense in C1 0. Hence, for each i {1, . . . , d} there exists gmi Hk satisfying

supx Rd max(| gmi(x) xi fm(x)|, | xi gm(x) xi(xi fm(x))|) ϵm. (13)

Moreover, the function wi(x) = xi belongs to H ki for ki(x, y) xiyi. Since Hk Hk+ ki and H ki Hk+ ki (see for example Carmeli et al. (2010, Prop. 5)), the functions gmi, wi, and gmi = wi gmi are all elements of Hk+ ki. Consider now the Stein function

hm(x) = Pd i=1 xi(p(x)a( x )gmi(x)))

p(x) = sp(x), gm(x) a( x ) | {z } (i)

gm(x), a( x ) | {z } (ii)

a( x ) gm(x) | {z } (iii)

where gm is the vector valued function (gmi)d i=1. By construction, hm Sp HK, and thus in Hkp by Theorem 1. Therefore, the zero-mean embedding assumption P PK,0 and Proposition 3 imply that Phm = 0. We will show that a rescaled version of hm satisﬁes the indicator bound property (7) for a choice of ϵm that decays to 0 as m . We begin by lower-bounding each of the components in the expansion (14). To lower-bound term (i), we ﬁrst record several properties of gm. First, our approximation guarantee (13) implies

supx Rd |gmi(x) xifm(x)| ϵm for each i {1, . . . , d}, (15)

Targeted Separation and Convergence with Kernel Discrepancies

where fm 1 fm satisﬁes

I x Cc m 1 fm(x) I [x Cc m] and I [x Cm\Cm 1] xifm(x) 0. (16)

Since a is nonnegative and fm(x) = 0 on Cm 1, H older s inequality, the guarantee (15), the assumed nonnegativity of sp(x), x on Cc m 1, generalized dissipativity, and our choice (12) of ϵm implies that

sp(x), gm(x) a( x ) = sp(x), x fm(x)a( x ) sp(x), gm(x) xfm(x) a( x )

sp(x), x fm(x)a( x ) sp(x) 1 gm(x) xfm(x) a( x )

sp(x), x fm(x)a( x ) sp(x) 1 a( x )ϵm sp(x), x I [x Cc m] a( x ) sp(x) 1 a( x )ϵm = ( sp(x), x sp(x) 1 ϵm)I [x Cc m] a( x ) sp(x) 1 I [x Cm] a( x )ϵm ( sp(x), x sp(x) 1 s)I [x Cc m] a( x ) a(m)/m.

To lower bound (ii), we again employ H older s inequality, the approximation guarantee (15), and the ϵm properties (12) to ﬁnd that

gm(x), a( x ) = 2γ gm(x), x /(c2 + x 2)γ+1

= 2γ x 2 fm(x)/(c2 + x 2)γ+1 + 2γ gm(x) xfm(x), x /(c2 + x 2)γ+1

2γ x 2 fm(x)/(c2 + x 2)γ+1 2γ gm(x) xfm(x) x 1 /(c2 + x 2)γ+1

2γ x 2 fm(x)/(c2 + x 2)γ+1 2|γ| x 1 ϵm/(c2 + x 2)γ+1

2|γ| x 2 I x Cc m 1 /(c2 + x 2)γ+1 2|γ| x 1 ϵm/(c2 + x 2)γ+1

c2+ x 2 a( x )(I [x Cc m] + I [x Cm\Cm 1])

c2+ x 2 a( x )(I [x Cc m] + I [x Cm])

2|γ| x 2+ x 1ϵm

a( x )I [x Cc m] 2|γ| max(a(m 1), a(m)) a(m)/m

d/(c2 + m))a( x )I [x Cc m] 2|γ| max(a(m 1), a(m)) a(m)/m.

To lower bound (iii), we ﬁrst note that the derivative approximation (13) implies

supx Rd | xigmi(x) xi(xifm(x))| ϵm for each i {1, . . . , d}.

Moreover, we have

xi(xifm(x)) = fm(x) + xi xifm(x) I x Cc m 1 + |xi|I [x Cm\Cm 1]

by our xifm constraints (16). Therefore, the nonnegativity of a and the ϵm properties (12) give the bound

a( x ) gm(x) a( x )(I x Cc m 1 + x 1 I [x Cm\Cm 1] + ϵm)

= a( x )(1 + ϵm)I [x Cc m] a( x )(1 + x 1)I [x Cm\Cm 1] a( x )ϵm I [x Cm]

a( x )(1 + s)I [x Cc m] max(a(m 1), a(m))(1 +

dm) a(m)/m.

Barp, Simon-Gabriel, Girolami, and Mackey

Our assumption γ u implies that x 2u a( x ) m2ua(m) whenever x m. This fact combined with our collected results and the assumed growth (11) induced by our choice of m now imply that

hm(x) ( sp(x), x sp(x) 1 r0 1 r0 2|γ|(1 +

d/(c2 + m)))I [x Cc m] a( x )

3a(m)/m (1 +

dm + 2|γ|) max(a(m 1), a(m))

(r1/2) x 2u I [x Cc m] a( x ) 3a(m)/m (1 +

dm + 2|γ|) max(a(m 1), a(m))

(r1/2)m2ua(m)I [x Cc m] 3a(m)/m (1 +

dm + 2|γ|) max(a(m 1), a(m)).

Hence, the rescaled Stein function hm = hm/((r1/2)m2ua(m)), satisﬁes the indicator approximation property (7) for the compact set Cm and the approximation factor

ϵm = 6/(r1m2u+1) + (1 +

dm + 2|γ|) max(a(m 1), a(m))/((r1/2)m2ua(m)).

Since u > 1/2, ϵm vanishes as m , and hence Hkp P-dominates indicators. Thus by Theorem 10 the Stein kernel enforces tightness.

For (b), we use Lemma 12:

Lemma 12 (Universal KSDs tilted by score growth control tight convergence). Suppose that sp(x) (c2 + x 2)γ, where c = 0, γ 0, and K is characteristic to D1 L1(Rd). Then the Stein kernel induced by (c2 + x 2) γK(x, y)(c2 + y 2) γ is P-separating and controls tight P-convergence.

Proof The result follows by Theorem 7. Indeed the function θ(x) (c + x 2)γ has 1/θ(x) = 2γx(c+ x 2) γ 1 satisﬁes the assumption of Theorem 7, so the result follows.

This shows that we can easily construct bounded Stein kernels that control tight weak convergence to P in P by simply tilting the base kernel through a function that bounds the score. By Lemma 12 the Stein kernel induced by the tilted base kernel a( x )k(x, y)a( y ) controls tight weak convergence, and thus so does the overall Stein kernel (which further controls weak convergence since it enforces tightness) as it may be viewed as the sum of two Stein kernels. Indeed, as proved in Appendix R.1, we have the following general bound between MMDs when an RKHS contains another one:

Lemma 13 (MMD controls subset MMDs). Suppose Hk H k and that P PH k. Then c 0 such that for all Q P

MMDk(Q, P) c MMD k(Q, P).

(i) If k is P-separating then k is P-separating.

(ii) If k controls (tight) weak P-convergence, then k controls (tight) weak P-convergence.

Targeted Separation and Convergence with Kernel Discrepancies

Finally, for (c), ﬁrst note that Hkp Cb. Indeed for any h Hkp we have h(x) = Sp(ag) = sp(x), ag + (ag) = sp, ag + a g + g, a , for some vectorvalued function g with gi Hk+ ki, so h is continuous. Moreover it is bounded since (i) |gi(x)| gi k+ ki( p

k(x, x) + |xi|), implies

| g, a | gi k+ ki P i 2γ |xi| supx

k(x,x)+|xi|2

(c2+ x 2)γ+1 .

(ii) a g is bounded since ig H i i+dk+1 Cb. (iii) sp(x), ag is bounded since

| sp(x), a(x)g(x) | sp(x) |a(x)| g(x) sp(x) x |a(x)| 1.

Finally, P PK,0 since Hkp Cb L1(P) by above, and HK L1(P) as for large enough x sp x sp(x), x r1 x 2u r2 A x 2u

for some A > 0, so sp A x 2u 1 for x large enough, so x a(x) 1/ sp(x) 1 A x 2u 1 for x large enough, which implies that HK Cb.

R.1 Proof of Lemma 13: MMD controls subset MMDs

By Schwartz (1964, Prop. 2), Hk H k implies that there exists c 0 such that, for all h Hk, h H k c h Hk, so c 1Bk B k. Hence

MMDk(Q, P) sup h Bk : h+ L1(Q) |Qh Ph| c MMD k(Q, P).

Since Hk H k, the P-separation and tightness results are immediate.

Appendix S. Proof of Theorem 13: Decaying P-centered kernels fail to control P-convergence

We will use the following result based on a construction from Simon-Gabriel et al. (2023, Section 5).

Theorem 16 (Vanishing mean-zero kernels fail to control P-convergence). Suppose that X is locally compact but not compact. If Hk C0 and k maps P P to 0 Hk, i.e., Φk(P) = 0, then k cannot control weak convergence to P P.

Proof Since P is a regular measure, we can ﬁnd a compact set C X for which P(C) 1/2. By Simon-Gabriel et al. (2023, Lemma 9 and 10), we can ﬁnd an open set U and compact set C such that C U C and sequence of probability measures (Qn)n such that Qn k 0 and Qn(C ) = 0 for all n. Then Qn P k = Qn k 0, so Qn converges to P in maximum mean discrepancy but not in weak convergence since P(U) P(C) > Qn(U) = 0.

Now note that Φk(P) = 0 implies that every function in Hk has vanishing P-integral. Given any RKHS Hk L1(P), we can construct a new RKHS whose functions have zero

Barp, Simon-Gabriel, Girolami, and Mackey

expectation under P and has the same MMD between embeddable measures, as we now show by simply applying the projection operator ΠP(h) = h Ph. Since

|h(x) Ph| = | h , kx k Φk(P) , h k | ( kx k + Φk(P) k) h k,

(Carmeli et al., 2006, Prop. 2.4) implies ΠP(Hk) is a RKHS with kernel (using the fact ξ P(x)(h) = ΠP(h)(x) where ξ P(x) δx P)

k P(x, y) = Φk(δx P) , Φk(δy P) k .

Thus, the elements of Hk P have the form h Ph for some h Hk, and hence P(Hk P) = {0}. Importantly, k P and k generate the same MMD. First let us show this for embeddable measures: since for any ﬁnite measure µ with R µ = 0 that embeds into Hk, we have using Lemma 2 and µ ΠP|Hk = µ|Hk (from R µ = 0) that

µ k P = µ ΠP k = µ k.

Hence for any two embeddable probability measures Q, P we have MMDk P(Q, P) = MMDk(Q, P). In general, for any Q P, note that h+ L1(Q) iﬀ(ΠP(h))+ L1(Q). Moreover Bk P = ΠP(Bk) by Lemma 5. Thus, writing Sk(Q) {h Bk : h+ L1(Q)}, we have Sk P(Q) = ΠP(Sk(Q))

MMDk P(Q, P) = sup f Sk P(Q) |Q(f) P(f)| = sup f ΠPSk(Q) |Q(f) P(f)|

= sup h Sk(Q) |Q(ΠPh) P(ΠPh)| = sup h Sk(Q) |Qh Ph| = MMDk(Q, P).

Combining with Theorem 16 we obtain the other advertised result.

Luigi Ambrosio, Nicola Gigli, and Giuseppe Savar e. Gradient Flows - In Metric Spaces and in the Space of Probability Measures. Birkh auser Verlag, Springer, 2005.

Andreas Anastasiou, Alessandro Barp, Fran cois-Xavier Briol, Bruno Ebner, Robert E Gaunt, Fatemeh Ghaderinezhad, Jackson Gorham, Arthur Gretton, Christophe Ley, Qiang Liu, et al. Stein s method meets computational statistics: A review of some recent developments. Statistical Science, 38(1):120 139, 2023.

Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 1950.

Alessandro Barp, Francois-Xavier Briol, Andrew Duncan, Mark Girolami, and Lester Mackey. Minimum Stein discrepancy estimators. In Advances in Neural Information Processing Systems, pages 12964 12976, 2019.

Alessandro Barp, So Takao, Michael Betancourt, Alexis Arnaudon, and Mark Girolami. A unifying and canonical description of measure-preserving diﬀusions. ar Xiv:2105.02845, 2021.

Targeted Separation and Convergence with Kernel Discrepancies

Alessandro Barp, Lancelot Da Costa, Guilherme Fran ca, Karl Friston, Mark Girolami, Michael I Jordan, and Grigorios A Pavliotis. Geometric methods for sampling, optimization, inference, and adaptive agents. In Handbook of Statistics, volume 46, pages 21 78. Elsevier, 2022a.

Alessandro Barp, Chris J Oates, Emilio Porcu, and Mark Girolami. A Riemann Stein kernel method. Bernoulli, 2022b.

Christian Berg, Jens P. R. Christensen, and Paul Ressel. Harmonic Analysis on Semigroups Theory of Positive Deﬁnite and Related Functions. Springer, 1984.

Alain Berlinet and Christine Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer, 2004.

Salomon Bochner. Vorlesungen uber Fouriersche integrale, volume 12. Akademische Verlagsgesellschaft mbh, 1932.

Francois-Xavier Briol, Alessandro Barp, Andrew B. Duncan, and Mark Girolami. Statistical inference for generative models with maximum mean discrepancy. ar Xiv:1906.05944, 2019.

Creighton R. Buck. Bounded continuous functions on a locally compact space. Michigan Mathematical Journal, 1958.

Jorge Buescu, AC Paixao, F Garcia, and I Lourtie. Positive-deﬁniteness, integral equations and fourier transforms. The Journal of Integral Equations and Applications, pages 33 52, 2004.

Claudio Carmeli, Ernesto De Vito, and Alessandro Toigo. Vector valued reproducing kernel Hilbert spaces of integrable functions and Mercer theorem. Analysis and Applications, 2006.

Claudio Carmeli, Ernesto De Vito, Alessandro Toigo, and Veronica Umanit a. Vector valued reproducing kernel Hilbert spaces and universality. Analysis and Applications, 2010.

Wilson Y. Chen, Lester Mackey, Jackson Gorham, Fran cois-Xavier Briol, and Chris J. Oates. Stein points. In ICML, 2018.

Wilson Ye Chen, Alessandro Barp, Fran cois-Xavier Briol, Jackson Gorham, Mark Girolami, Lester Mackey, Chris Oates, et al. Stein point Markov chain Monte Carlo. ar Xiv:1905.03673, 2019.

Badr-Eddine Ch erief-Abdellatif and Pierre Alquier. MMD-Bayes: Robust Bayesian estimation via maximum mean discrepancy. In Symposium on Advances in Approximate Bayesian Inference, 2020.

Kacper Chwialkowski, Heiko Strathmann, and Arthur Gretton. A kernel test of goodness of ﬁt. In Neur IPS, 2016.

John Bligh Conway. The Strict Topology and Compactness in the Space of Measures. Louisiana State University and Agricultural & Mechanical College, 1965.

Barp, Simon-Gabriel, Girolami, and Mackey

JLB Cooper. Positive deﬁnite functions of a real variable. Proceedings of the London Mathematical Society, 3(1):53 66, 1960.

Charita Dellaporta, Jeremias Knoblauch, Theodoros Damoulas, and Fran cois-Xavier Briol. Robust Bayesian inference for simulator-based models via the MMD posterior bootstrap. In AISTATS, 2022.

Gintare K. Dziugaite, Daniel M. Roy, and Zoubin Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In UAI, 2015.

Stewart N. Ethier and Thomas G. Kurtz. Markov processes: Characterization and convergence, volume 282. John Wiley & Sons, 2009.

Matthew A Fisher, Chris Oates, et al. Gradient-free kernel Stein discrepancy. ar Xiv:2207.02636, 2022.

Futoshi Futami, Zhenghang Cui, Issei Sato, and Masashi Sugiyama. Bayesian posterior approximation via greedy particle optimization. In AAAI, 2019.

Jackson Gorham and Lester Mackey. Measuring sample quality with Stein s method. In Neur IPS, 2015.

Jackson Gorham and Lester Mackey. Measuring sample quality with kernels. In ICML, 2017.

Jackson Gorham, Andrew B. Duncan, Sebastian J. Vollmer, and Lester Mackey. Measuring sample quality with diﬀusions. The Annals of Applied Probability, 2019.

Jackson Gorham, Anant Raj, and Lester Mackey. Stochastic Stein discrepancies. Advances in Neural Information Processing Systems, 2020.

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch olkopf, and Alexander Smola. A kernel two-sample test. JMLR, 2012.

Jun Han and Qiang Liu. Stein variational gradient descent without gradient. In ICML, 2018.

Liam Hodgkinson, Robert Salomone, and Fred Roosta. The reproducing Stein kernel approach for post-hoc corrected sampling. ar Xiv:2001.09266, 2020.

Jonathan Huggins and Lester Mackey. Random feature Stein discrepancies. In Neur IPS, 2018.

Steven G. Johnson. Saddle-point integration of C bump functions. ar Xiv:1508.04376, 2018.

Heishiro Kanagawa, Alessandro Barp, Arthur Gretton, and Lester Mackey. Controlling moments with kernel stein discrepancies. ar Xiv preprint ar Xiv:2211.05408, 2022.

Lucien Le Cam. Convergence in distribution of stochastic processes. University of California Publications in Statistics, 1957.

Targeted Separation and Convergence with Kernel Discrepancies

Chang Liu and Jun Zhu. Riemannian Stein variational gradient descent for Bayesian inference. In AAAI, 2018.

Qiang Liu. Stein variational gradient descent as gradient ﬂow. In Neur IPS, 2017.

Qiang Liu and Jason Lee. Black-box importance sampling. In AISTATS, 2017.

Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In Neur IPS, 2016.

Qiang Liu, Jason Lee, and Michael Jordan. A kernelized Stein discrepancy for goodness-ofﬁt tests. In ICML, 2016.

Takuo Matsubara, Jeremias Knoblauch, Fran cois-Xavier Briol, Chris Oates, et al. Robust generalised Bayesian inference for intractable likelihoods. ar Xiv:2104.07359, 2021.

Takuo Matsubara, Jeremias Knoblauch, Fran cois-Xavier Briol, and Chris Oates. Generalised Bayesian inference for discrete intractable likelihood. ar Xiv:2206.08420, 2022.

Mario Micheli and Joan Alexis Glaunes. Matrix-valued kernels for shape deformation analysis. ar Xiv:1308.5739, 2013.

Alfred M uller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 1997.

Kazimierz Musia l. Handbook of Measure Theory - Chapter 12 - Pettis Integral. Elsevier, 2002.

Chris J. Oates, Mark Girolami, and Nicolas Chopin. Control functionals for Monte Carlo integration. ar Xiv:1410.2392, 2014.

Chris J. Oates, Jon Cockayne, Fran cois-Xavier Briol, and Mark Girolami. Convergence rates for a class of estimators based on Stein s method. Bernoulli, 2019.

Vern I. Paulsen and Mrinal Raghupathi. An Introduction to the Theory of Reproducing Kernel Hilbert Spaces. Cambridge University Press, 2016.

Tomos Phillips. On positive and conditionally negative deﬁnite functions with a singularity at zero, and their applications in potential theory. Ph D thesis, CardiﬀUniversity, 2018.

Stefano Pigola and Alberto G. Setti. Global divergence theorems in nonlinear PDEs and geometry. Ensaios Matem aticos, 2014.

Marina Riabiz, Wilson Ye Chen, Jon Cockayne, Pawel Swietach, Steven A. Niederer, Lester Mackey, and Chris. J. Oates. Optimal thinning of MCMC output. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2022.

Laurent Schwartz. Sous-espaces hilbertiens d espaces vectoriels topologiques et noyaux associ es (noyaux reproduisants). Journal d analyse math ematique, 1964.

Laurent Schwartz. Th eorie des Distributions. Hermann, 1978.

Barp, Simon-Gabriel, Girolami, and Mackey

Carl-Johann Simon-Gabriel and Bernhard Sch olkopf. Kernel distribution embeddings: Universal kernels, characteristic kernels and kernel metrics on distributions. JMLR, 2018.

Carl-Johann Simon-Gabriel, Alessandro Barp, Bernhard Sch olkopf, and Lester Mackey. Metrizing weak convergence with maximum mean discrepancies. Journal of Machine Learning Research, 24(184):1 20, 2023.

Bharath K. Sriperumbudur. On the optimal estimation of probability measures in weak and strong topologies. Bernoulli, 2016.

Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch olkopf, and Gert R. G. Lanckriet. Hilbert space embeddings and metrics on probability measures. JMLR, 2010.

Bharath K. Sriperumbudur, Kenji Fukumizu, and Gert RG Lanckriet. Universality, characteristic kernels and RKHS embedding of measures. JMLR, 2011.

Ingo Steinwart and Andreas Christmann. Support Vector Machines. Information Science and Statistics. Springer, 2008.

James Stewart. Positive deﬁnite functions and generalizations, an historical survey. The Rocky Mountain Journal of Mathematics, 6(3):409 434, 1976.

Zhuo Sun, Alessandro Barp, and Fran cois-Xavier Briol. Vector-valued control variates. In International Conference on Machine Learning, pages 32819 32846. PMLR, 2023.

Fran cois Treves. Topological Vector Spaces, Distributions and Kernels. Academic Press, 1967.

Holger Wendland. Scattered Data Approximation. Cambridge University Press, 2004.

George Wynne, Miko laj Kasprzak, and Andrew B Duncan. A spectral representation of kernel Stein discrepancy with application to goodness-of-ﬁt tests for measures on inﬁnite dimensional Hilbert spaces. ar Xiv:2206.04552, 2022.

Haizhang Zhang and Liang Zhao. On the inclusion relation of reproducing kernel Hilbert spaces. Analysis and Applications, 2013.

Ding-Xuan Zhou. Derivative reproducing properties for kernel methods in learning theory. Journal of computational and Applied Mathematics, 220(1-2):456 463, 2008.