# metrizing_weak_convergence_with_maximum_mean_discrepancies__8d8a12a1.pdf

Journal of Machine Learning Research 24 (2023) 1-20 Submitted 6/21; Revised 3/23; Published 4/23

Metrizing Weak Convergence with Maximum Mean Discrepancies

Carl-Johann Simon-Gabriel cjsg@ethz.ch Institute for Machine Learning ETH Zürich, Switzerland

Alessandro Barp ab2286@cam.ac.uk Department of Engineering University of Cambridge, Alan Turing Institute, United Kingdom

Bernhard Schölkopf bs@tue.mpg.de Empirical Inference Department MPI for Intelligent Systems, Tübingen, Germany

Lester Mackey lmackey@microsoft.com Microsoft Research Cambridge, MA, USA

Editor: Ingo Steinwart

This paper characterizes the maximum mean discrepancies (MMD) that metrize the weak convergence of probability measures for a wide class of kernels. More precisely, we prove that, on a locally compact, non-compact, Hausdorﬀspace, the MMD of a bounded continuous Borel measurable kernel k, whose RKHS-functions vanish at inﬁnity (i.e., Hk C0), metrizes the weak convergence of probability measures if and only if k is continuous and integrally strictly positive deﬁnite ( R s.p.d.) over all signed, ﬁnite, regular Borel measures. We also correct a prior result of Simon-Gabriel and Schölkopf (JMLR 2018, Thm. 12) by showing that there exist both bounded continuous R s.p.d. kernels that do not metrize weak convergence and bounded continuous non R s.p.d. kernels that do metrize it. Keywords: Maximum Mean Discrepancy, Metrization of weak convergence, Kernel mean embeddings, Characteristic kernels, Integrally strictly positive deﬁnite kernels

1. Introduction

Although the mathematical and statistical literature has studied kernel mean embeddings (KMEs) and maximum mean discrepancies (MMDs) at least since the 1970s (Guilbart, 1978), the machine learning community rediscovered and applied them only since the late 2000s (Smola et al., 2007). A KME with reproducing kernel k is a map from measures µ in particular probability distributions to functions fµ in the reproducing kernel Hilbert space (RKHS) Hk of k. The RKHS distance between two embeddings then yields a semi-metric dk on measures, called the maximum mean discrepancy (MMD), which can be used to compare two measures or distributions µ and ν: dk(µ, ν) := fµ fν k. Their theoretical tractability and computational ﬂexibility has allowed MMDs to ﬂourish in many areas of machine learning that require comparing probability distributions, such as

2023 Carl-Johann Simon-Gabriel, Alessandro Barp, Bernhard Schölkopf, and Lester Mackey.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-0599.html.

Simon-Gabriel, Barp, Schölkopf, and Mackey

two-sample testing (compare two discrete distributions Gretton et al. (2012)), sample quality measurement and goodness-of-ﬁt testing (compare a discrete distribution to a reference distribution, as in Chwialkowski et al. 2016; Liu et al. 2016; Gorham and Mackey 2017; Jitkrittum et al. 2017; Huggins and Mackey 2018), generative model ﬁtting (compare distributions of fake and real data; see Dziugaite et al. 2015; Sutherland et al. 2017; Feng et al. 2017; Pu et al. 2017; Briol et al. 2019), de novo sampling and quadrature (Chen et al., 2010; Huszár and Duvenaud, 2012; Liu and Wang, 2016; Chen et al., 2018; Futami et al., 2019; Chen et al., 2019), importance sampling (Liu and Lee, 2017; Hodgkinson et al., 2020), and thinning (Riabiz et al., 2022). For most applications, one seeks a kernel k whose MMD can separate all probability distributions P, Q, meaning that, dk(P, Q) = 0 (if and) only if Q = P. Such kernels are said to be characteristic (to the set of probability distributions P). If for example we optimize a parametric distribution Q to match a target P by minimizing their MMD dk(P, Q), it is rather natural to require that it be minimized only if Q perfectly matches P, i.e. Q = P. Another natural, but a priori stronger requirement, is that when Q gets closer to P in MMD, such as when dk(Q, P) 0, we would like Q to truly converge to P, where truly means for some other standard and/or more familiar notion of convergence . Although several standard notions may come to mind convergence in KL-divergence, in total variation or in Hellinger distance , many are too strong for our purposes which often require handling discrete data. For example, even if x ξ, the Dirac masses δx will not converge to δξ in total variation or KL-divergence unless x is eventually equal to ξ. Said diﬀerently, a sequence of deterministic variables would not converge in total variation unless it was eventually constant. Since in practice MMDs are frequently used to compare samples or empirical (hence discrete) distributions, it comes as no surprise that MMD convergence cannot, in general, ensure these strong types of convergence. Instead we will opt for a standard, yet comparatively weak notion of convergence, known as weak or narrow convergence or convergence in distribution. Speciﬁcally, the central question of this paper will be

When is convergence in MMD metric equivalent to weak convergence on P?

In that case, we will say that the kernel k metrizes the weak convergence of probability measures. This question lies at the heart of the learning applications described above, as the quality of these inferences depends on the metrization properties of the chosen kernel (Zhu et al., 2019, 2021; Ansari et al., 2020; Li et al., 2017). For example, Zhu et al. (2019, 2021) establish that kernel MMD tests have the universal hypothesis testing property introduced by Hoeﬀding (1965) provided that their kernels control weak convergence. Conversely, when the kernel MMD fails to reﬂect the convergence of distributions, the results are at best inaccurate and at worst invalid.

1.1 Previous results

The aforementioned question was studied as early as 1978 by Guilbart (1978) in his thesis. On separable metric spaces, he characterized the kernels for which weak convergence implies convergence in MMD (Thm. 1.D.I). Conversely, he showed that, in some cases, MMD convergence can also imply weak convergence, meaning that there do exist kernels that metrize weak convergence. He provided a concrete recipe to construct such kernels (Thm. 1.E.I

Metrizing Weak Convergence with MMDs

& Lem. 3.E.I) and used it to exhibit some examples. However, Guilbart (1978) did not characterize these kernels and left most standard kernels (Gaussian, Laplacian, etc.) aside. These initial results went largely unnoticed by the ML community, and it is only much later, with the emergence and the new applications of MMDs in applied statistics, that the important question of weak convergence metrization re-surfaced. Sriperumbudur et al. (2010) in particular presented suﬃcient conditions under which the MMD metrizes weak convergence when the underlying input space is either Rd (Thm. 24) or a compact metric space (Thm. 23). Sriperumbudur (2016, Thm. 3.2) then considerably improved these results and showed the following theorem.

Theorem 1 (Sriperumbudur 2016) A continuous, bounded, integrally strictly positive deﬁnite ( R s.p.d.) kernel over a locally compact Polish space X such that Hk C0 metrizes weak convergence.

Let us explain and discuss this result, as it will provide context for the new results of this work. First, the theorem assumes that the underlying input space is locally compact and Polish. Either assumption taken separately is quite general: all topological manifolds (f.ex. Rd) and all discrete spaces are locally compact, and all separable, complete, metric spaces are, by deﬁnition, Polish, which includes any separable Banach space. This generality made locally compact spaces on the one hand and Polish spaces on the other standard choices for carrying out general measure and probability theory. However, when the two assumptions are combined, the result can be quite restrictive. A Banach space, for example, is locally compact only if it has ﬁnite dimension. Therefore, combining both assumptions yields an important constraint that limits the applicability of the result: one would hope for one or the other but not both. Second, Hk C0 means that the RKHS functions f are assumed to be continuous and vanish at inﬁnity, i.e., for any ϵ > 0, there exists a compact K X for which sup X\K |f| ϵ. Many standard kernels satisfy this assumption which is typically easy to verify (see Lem. 8 below). The assumption that f be in C0 is also rather natural in the context of locally compact spaces X, since, by the Riesz representation theorem, the set of ﬁnite, signed and regular measures a.k.a. ﬁnite Radon measures can be identiﬁed with the continuous dual of C0 (Villani, 2010, Def.VI-66 & Thm.VI-61). This, in turn, can be advantageously leveraged in many proofs and theorems (e.g. for the equivalence between universal and characteristic kernels). However, that same assumption Hk C0 is often inadequate on Polish spaces, because, on Polish spaces, C0 is typically very small: for example, on an inﬁnite dimensional Banach space (hence not locally compact), C0 contains only the null function. This suggests that it might be more natural to remove the Polish assumption than the locally compact assumption, which is what we will do in this paper. However, by dropping the Polish assumption, we need to pay a bit more attention to the sets of measures that we manipulate. Speciﬁcally, on Polish spaces, all signed Borel measures happen to be regular (see deﬁnition in Section 1.4), meaning that the set of ﬁnite Borel and ﬁnite Radon measures coincide there. On locally compact Hausdorﬀspaces however, this need not be the case. So, when dropping the Polish assumption, we also need to decide on which measures we want to focus and in particular to which measures we would like to be characteristic. As shown in Section 2, it turns out that in cases where Borel and Radon measures do not match,

Simon-Gabriel, Barp, Schölkopf, and Mackey

no kernel can be characteristic to all Borel measures. Hence, the only sensible choice is to focus on Radon measures.

Third, the theorem assumes that the kernel is R s.p.d., meaning that its MMD separates all ﬁnite signed measures M: for any µ, ν M, dk(µ, ν) = 0 only if µ = ν. It is easy to see that an MMD that metrizes weak convergence on the set of probability measures P, must separate P. But by assuming that it even separates M, which is bigger than P, Sriperumbudur (2016) s Thm. 2 leaves open the case of any MMD that separates P but not M.

In 2018, Simon-Gabriel and Schölkopf (2018, Thm. 12) seemed to ﬁnally address all weaknesses mentioned above by characterizing the metrization of weak convergence of probability measures on locally compact spaces as follows.

Claim 2 (Simon-Gabriel and Schölkopf 2018) On a locally compact Hausdorﬀspace, a bounded, Borel measurable kernel metrizes the weak convergence of probability measures if and only if it is continuous and characteristic (to the set of probability measures).

This statement weakens the suﬃcient condition of Thm. 1 from separation of M ( R s.p.d. kernel) to separation of P (characteristic kernel), which, as discussed, immediately yields the converse direction. It gets rid of the Polish assumption and, surprisingly, also drops the assumption Hk C0.

1.2 Our contributions

Unfortunately, Claim 2 turns out to be wrong when the input space X is not compact. Our main result, Thm. 9, provides a correction under the additional assumption that Hk C0. Crucially, we ﬁnd that the compact and non-compact case are inherently diﬀerent. Metrizing weak convergence on non-compact spaces requires strictly stronger conditions, since the MMD needs to separate, not only the probability measures as in the compact case or in Claim 2 but all ﬁnite signed measures. Put diﬀerently, Thm. 9 drops the Polish assumption from Thm. 1 and proves that its converse which is too strong when X is compact (see Thm. 7 & Prop. 13) does hold when X is non-compact. An important implication is that any C0 kernel that maps a probability measure to 0 fails to metrize weak convergence; in particular, this establishes that large classes of Stein kernels are unable to metrize convergence (see Rem. 10).

Additionally, Cor. 15 shows that Thm. 9 does not hold without the assumption Hk C0, while Cor. 17 provides a suﬃcient condition to metrize weak convergence when Hk C0. Our results also complete the ﬁndings of Chevyrev and Oberhauser (2022), who constructed a counter-example showing that Claim 2 does not hold on Polish spaces. Overall, our ﬁndings show that the old quest to characterize weak-convergence metrizing MMDs which we close under the quite general assumption that X is locally compact and Hk C0 depends in much more subtle ways on the properties of the underlying space X (being compact or not, Polish or not, etc.) and the kernel k (Hk contained in C0 or not) than was previously thought.

Metrizing Weak Convergence with MMDs

1.3 Paper structure

Section 1.4 ﬁxes notations and makes a few important reminders and remarks. Section 3 then extends Sriperumbudur (2016) s Thm. 1 and gives a general suﬃcient condition to metrize weak convergence when Hk C0. We then investigate whether this condition is also necessary, ﬁrst when the input space X is compact (Sec. 4), where it turns out to be too strong (Thm. 7); then when X is not compact, but locally compact (Sec. 5), in which case the suﬃcient condition turns out to be necessary (Thm. 9). We ﬁnish with a few results in the general case (Sec. 6), when Hk C0: ﬁrst a negative result (Cor. 15) showing that the assumption Hk C0 cannot be dropped without replacement; then a result that generalizes the condition Hk C0. Section 7 concludes.

1.4 Notation, deﬁnitions, and reminders

We use letter k to denote a reproducing kernel (i.e. a positive deﬁnite function) over a locally compact Hausdorﬀ(LCH) X and Hk denotes its RKHS. Cb is the space of bounded, continuous and real valued 1 functions f over X. C0 is its subspace of functions that vanish at inﬁnity, i.e. such that for any ϵ > 0, there exists a compact K X such that |f| ϵ on X\K. We say that k is a C0-kernel if Hk C0 and that it is C0-universal if its Hk is also dense in C0. We denote by M the set of ﬁnite, signed Borel measures, and by M the set of ﬁnite Radon measures, i.e., the subset of signed measures in M that are also regular. Recall that a positive Borel be regular if for any Borel measurable set A and any ϵ > 0, there exists a compact K and an open set O in X such that K A O, |µ(A) µ(K)| ϵ and |µ(O) µ(A)| ϵ. Said diﬀerently, a measure is regular if any measurable set can be approximated (in terms of measure) from the inside by the compacts it contains and from the outside by the open sets that contain it. A signed Borel measure is regular if its positive and negative parts are. Except for Section 2 where we discuss the diﬀerences between Borel and Radon measures, this work focuses on ﬁnite Radon measures. When used without further speciﬁcation, the word measure designates an element in M. We denote by (C0) the continuous dual of C0 which, by the Riesz representation theorem (a.k.a. Riesz-Markov-Kakutani theorem Villani 2010, VI-61), can be identiﬁed with M. L(µ) denotes the set of µ-integrable functions (i.e. verifying R

X |f| d|µ| < ) and for any such function f we write µ(f) := R

X f dµ. We denote by M+, P and M0 the subsets of M consisting of non-negative measures, of probability measures, and of signed measures µ such that µ(X) = 0 respectively.

Deﬁnition of KMEs and MMDs. For a continuous, bounded kernel k and any µ M, R

X k(., x) k d|µ| = R

k(x, x) d|µ|(x) < . By standard properties of the so-called Bochner integral (Schwabik, 2005), the (Bochner-)integral

X k(., x) dµ(x)

is a well-deﬁned function in the RKHS Hk of k, and all functions f Hk are µ-integrable and verify what we call the Pettis property: µ(f) = fµ , f k. In particular, for any µ, ν M,

µ , ν k := fµ , fν k = µ ν(k) and µ 2 k = µ µ(k) ,

1. Our results extend to complex valued functions modulo some obvious slight modiﬁcations.

Simon-Gabriel, Barp, Schölkopf, and Mackey

where µ ν denotes the (tensor) product measure between µ and ν. The maximum mean discrepancy (MMD) dk(µ, ν) between µ and ν is then deﬁned as the RKHS distance between their embeddings: dk(µ, ν) := µ ν k = fµ fν k .

Why bounded kernels? In all our results, we will assume that the kernel k is bounded. One may wonder if those results could be generalized to unbounded kernels. To do so, one would need a deﬁnition of KMEs and MMDs that allows unbounded kernels. Such generalizations do exist (see f.ex. Def. 1 in Simon-Gabriel and Schölkopf 2018), but they all at least require that Hk L(µ) for any embeddable measure µ. But if k is unbounded, then Hk contains an unbounded function f (Simon-Gabriel and Schölkopf, 2018, Cor. 3), and therefore, it is easy to construct a probability measure P such that f L(P). So P does not embed into Hk and the MMD is not deﬁned over all probability measures and cannot, a fortiori, metrize weak convergence there.

Equivalence of universal, characteristic and R s.p.d. kernels. Let F be a normed set of functions and D a subset of M. A kernel k is said to be universal to F if Hk is a dense subset of F. It is characteristic to D or just characteristic when D = P if the KME is well-deﬁned and injective over D. It is said to be integrally strictly positive deﬁnite ( R s.p.d.) to D or just R s.p.d. when D = M if its MMD separates all measures in D. It will be useful to remember that a kernel is universal to F (f.ex. to C0) if and only if it is characteristic to its dual ((C0) = M) (Simon-Gabriel and Schölkopf, 2018, Thm. 6 & Tab. 1). Also, it is characteristic to a set if and only if it is R s.p.d. to that same set (which is almost immediate to see). The distinction between characteristicness and R s.p.d. is mostly due to historical reasons. We advice to simply think in terms of separation of D.

2. Radon versus Borel measures

As explained in introduction, we would like to drop the Polish assumption in Thm. 1 and focus on LCH spaces. However, while on Polish spaces all ﬁnite, signed Borel measures happen to be regular, i.e. M = M by Ulam s lemma (Villani, 2010, Thm. I-54), on LCH spaces, this need not be the case. So, if we drop the Polish assumption in Thm. 1, should we focus on characteristicness to Borel or to Radon measures? The following theorem answers this question. It is a direct consequence of the proof of Thm 3.13 in Steinwart and Ziegel (2021).

Theorem 3 (Steinwart and Ziegel 2021) Let X be a locally compact Hausdorﬀspace. If a C0-kernel is characteristic to a set of measures D M (X), then D M(X). In particular, if M (X) = M(X), then no C0-kernel is characteristic to M (X).

Said diﬀerently, the biggest set of ﬁnite Borel measures that a C0-kernel can be characteristic to is the set of ﬁnite Radon measures M. However, how common is it that M = M ? First, we note that the conclusion of Ulam s lemma (M = M ) also holds for LCH spaces X if one additionally assumes that X is σ-compact, i.e., that it can be covered by at most countably many compact sets (Villani, 2010, Thm. I-56). However, how restrictive is the σ-compact assumption for an LCH space? Examples of non-σ-compact LCH spaces such as the long line (a.k.a. Alexandroﬀline) exist,

Metrizing Weak Convergence with MMDs

but they may seem irrelevant to the working machine learner. In contrast, the following theorem shows that, on an LCH space, (a) considering a continuous C0-universal kernel (i.e., a continuous C0-kernel that is characteristic to M), amounts to assuming that X is metrizable and (b) that in this context, assuming σ-compactness amounts to adding a separability assumption on X. The proof is in Appendix B.

Theorem 4 Let X be an LCH space.

(a) A C0-universal kernel k on X is continuous if and only if (iﬀ) it metrizes X, i.e. if dk(x, y) := k(., x) k(., y) k is a metric for the topology of X. In particular, if there exists a continuous C0-universal kernel on X, then X is metrizable.

(b) Moreover, the following is equivalent.

(i) X is metrizable and separable. (ii) X is σ-compact and there exists a continuous C0-universal kernel k on X.

Point (a) shows that even if we drop the Polish assumption, whenever we consider a continuous C0-universal kernel, we are still assuming that X is metrizable. Point (b) adds that, if additionally X is assumed to be σ-compact, then the only missing assumption for X to be Polish is its completeness. To ﬁnish this section, let us discuss some of the hypotheses made in Thm. 4. First, Guilbart (1978, Thm. 4.D.I) shows that, even without the LCH assumption on X, (i) implies the existence of a kernel that is characteristic to M (i.e., to M ). Second, separability (or σ-compactness) is not a required condition for the existence of a C0-universal kernel on a (metrizable) LCH space. For example, the discrete kernel kδ(x, y) = 1(x = y) is a C0(Rδ)-universal kernel over the discrete real line Rδ, i.e., the real line equipped with the discrete topology. (To see this, notice that the compact sets in Rδ are the ﬁnite subsets of R, and hence that C0(Rδ) is the set of real functions for which only ﬁnitely many points have a value ϵ, whatever ϵ > 0 you choose.) We do not know wether, more generally, the converse of (a) is true, i.e., whether on an LCH space, metrizability alone suﬃces to guarantee the existence of a C0-universal kernel. Finally, we note that in some publications, the continuity assumption on k is hidden in the deﬁnition of C0-universality (see f.ex. Thm 2 in Steinwart et al. 2006). However, this need not be the case (see Simon-Gabriel and Schölkopf, 2018, Cor 3&Def 5) and so one may wonder if a non-continuous C0-universal kernel exist. We do not know.

3. Suﬃcient conditions to metrize weak convergence

We start with a lemma that extends Thm. 1. Its main message is the same: bounded, continuous, R s.p.d. kernels metrize weak convergence of probability measures. But, importantly, it drops the Polish assumption and adds a few interesting details. For one thing, it shows that weak and MMD convergence also coincide with (the a priori even weaker) vague and weak RKHS convergence. For another, it adds a form of converse: weak convergence implies MMD convergence if and only if the kernel is bounded and continuous. Since most usual kernels are bounded and continuous, this lemma also conﬁrms what we mentioned earlier: convergence in MMD is often rather weak and can, at best, metrize weak convergence, but not convergence in total variation or KL divergence (since those are known to be strictly stronger than weak convergence).

Simon-Gabriel, Barp, Schölkopf, and Mackey

Lemma 5 Let k be an R s.p.d. kernel such that Hk C0 and let (Pα) (sequence or net) and P be probability measures. If k is continuous, then the following are equivalent.

(i) Pα P k 0 (convergence in strong RKHS topology) (ii) Pα(f) P(f) for all f Hk (convergence in weak RKHS topology) (iii) Pα(f) P(f) for all f C0 (convergence in weakor vague topology) (iv) Pα(f) P(f) for all f Cb (convergence in weak topology)

Conversely, if (iv) implies (i) for any probability measures (Pα) and P, then k is continuous.

When (i) and (iv) are equivalent for all sequences of probability measures, we say that k metrizes the weak convergence of probability measures. Proof Since Hk C0 Cb, (iv) (iii) (ii). Moreover, strong RKHS convergence implies weak RKHS convergence, that is (i) (ii), since P(f) = P , f k for any f Hk. Now assume k is continuous. If (iv), then the product measures Pα P, P Pα and Pα Pα converge weakly to P P (Berg et al., 1984, Thm. 2.3.3). Hence

Pα P 2 k = Pα Pα(k) + P P(k) Pα P(k) P Pα(k) 0 ,

i.e. (iv) (i). Summing up so far: (iv) (i) (ii) and (iv) (iii) (ii). Conversely, assume (ii). Since k is R s.p.d. and Hk C0, by Cor. 3 and Thm. 8 in Simon-Gabriel and Schölkopf (2018), Hk is dense in C0. And since P is a bounded subset of the dual M of C0 (which is a Banach, hence barreled space), by Thm. 33.2 in Treves (1967), P is equicontinuous. So, by Prop. 32.5 in Treves (1967), (ii) implies vague convergence, i.e. (iii). Cor. 2.4.3 in Berg et al. (1984) then yields (iv). Hence the equivalence of (i) to (iv). Now assume (iv) (i) on P, and suppose that x ξ and y ζ in X. Then the Dirac point masses δx and δy converge weakly to δξ and δζ, which, by assumption, implies convergence in RKHS norm. Since the inner product is continuous (for the RKHS norm/topology), we get

k(x, y) = δx , δy k δξ , δζ k = k(ξ, ζ) ,

so k is continuous.

Remark 6 The proof shows that (ii) and (iii) are even equivalent on any bounded subset of M (Treves, 1967, Prop. 32.5) (even without continuity of k) and that (i) (iv) are actually equivalent on any bounded subset of M+ whenever Pα(X) P(X) (which is always true for probability measures).

The previous lemma gives suﬃcient conditions to metrize weak convergence. We now investigate whether they are necessary. To do so, we have to distinguish the case where the input space X is compact and where the conditions turn out to be too strong, from the one where X is locally compact but not compact (and Hk C0), where they are necessary.

Metrizing Weak Convergence with MMDs

4. Necessary condition for compact input space X

When the underlying space X is not just locally compact but compact, the equivalence given in Claim 2 actually turns out to hold: contrary to the general case, here, a continuous kernel only needs to separate the probability measures to also metrize their weak convergence. The reason for this diﬀerence is essentially that, because X is compact, measures cannot diﬀuse to 0 at inﬁnity (see Section 5).

Theorem 7 On a compact Hausdorﬀspace, a bounded, measurable kernel metrizes the weak convergence of probability measures if and only if it is continuous and characteristic to P.

Proof If k metrizes weak convergence, then the RKHS metric needs to separate all probability measures, i.e. k is characteristic to P. And the last sentence of Lem. 5 shows that k is continuous. Conversely, if k is characteristic to P, then the kernel κ := k + 1 is R s.p.d. (Simon-Gabriel and Schölkopf, 2018, Thm. 8). Also, since k is continuous, κ is continuous. Thus Hκ is a continuous subspace of C = Cb = C0 (Simon-Gabriel and Schölkopf 2018, Cor. 3 and compactness). By Lem. 5, κ metrizes weak convergence on P, and by Thm. 8 of Simon-Gabriel and Schölkopf (2018), κ and k induce the same metric on P.

What is surprising here is that, on a compact space and for a continuous kernel, it suﬃces to separate probability measures to also metrize their weak convergence, which, a priori, may have seemed a strictly stronger requirement. We will see that when X is not compact, this need not be the case.

5. Necessary condition when X is locally compact but non-compact and Hk C0

Since the condition Hk C0 is at the heart of this section, we would like to remind the reader that, by the following lemma (Simon-Gabriel and Schölkopf, 2018, Cor. 3), it is satisﬁed by many standard kernels: Gaussian, Laplacian, Matern, inverse multi-quadratic kernels, etc.

Lemma 8 Hk C0 if and only if k is bounded (i.e. supx X k(x, x) < ) and for all x X, k(x, .) C0.

We now turn to our main theorem, which corrects Claim 2 when X is non-compact and Hk C0.

Theorem 9 Suppose that the locally compact Hausdorﬀspace X is not compact and that, for some kernel k on X X, Hk(X) C0(X). Then k metrizes the weak convergence of probability measures if and only if k is continuous and R s.p.d. (i.e. characteristic to M(X)).

We see that, contrary to the compact case, it is not enough to separate all probability measures P to metrize their weak convergence: dk must separate all ﬁnite measures M, which strictly contains P. Moreover, Prop. 13 below conﬁrms that there are indeed kernels that separate P but not M. Hence, Thms. 7 and 9 show that, surprisingly, the converse of Sriperumbudur s Thm. 1 is generally too restrictive when X is compact but does hold when it is not. Also, they conﬁrm that the Polish assumption made in Thm. 1 is superﬂuous.

Simon-Gabriel, Barp, Schölkopf, and Mackey

Remark 10 (On the signiﬁcance of Thm. 9) One advantage of dropping the Polish assumption is that our result may cover more sets, e.g. non complete ones. Besides, we believe that dropping unnecessary hypotheses helps clarifying the role of each remaining assumption. However, in our view, the main contribution of Thm. 9 is its converse part, which implies that many popular kernels fail to metrize weak convergence. For example, it rules out any RKHS contained in C0 that maps some probability measure(s) to 0. This has important implications for the Stein kernels adopted in Liu and Wang (2016); Jitkrittum et al. (2017); Gorham and Mackey (2017); Huggins and Mackey (2018); Feng et al. (2017); Pu et al. (2017); Liu and Wang (2016); Chen et al. (2018, 2019); Hodgkinson et al. (2020) which, by design, map a particular target distribution to 0 and which, if one is not careful, will also induce RKHSes in C0.

We now turn towards the proof. While it is almost obvious that metrization of weak convergence implies separation of P, showing that it also implies separation of M will require some work and, in light of Lem. 5, is essentially all that remains to be proven. To do so, we will use the following two lemmata. The ﬁrst one is a straightforward extension of a basic property of locally compact sets (every point has a compact neighborhood) from points to compact sets (every compact set has a compact neighborhood). The second shows that when Hk C0 and X is not compact, then the RKHS metric cannot prevent some positive measures from diﬀusing to the null measure. This will imply that if k is not characteristic to all ﬁnite measures, one can construct a sequence of probability measures that converges in RKHS norm but has some of its mass diﬀusing to 0.

Lemma 11 Let K be a compact subset of a locally compact space X. Then there exists an open neighborhood of K with compact closure. Equivalently, there exist an open set O and a compact set K in X such that K O K .

Lemma 12 Suppose that the locally compact Hausdorﬀspace X is not compact and that k is continuous with Hk C0. Then there exists a sequence of probability measures Pn such that Pn k 0. Moreover, for any compact K X, one can additionally impose that Pn(K) = 0 for all n.

As a side remark, note that Lem. 12 complements Lem 3.2 of Steinwart and Ziegel (2021), which states that, if the constant-1 function 1 is in Hk, then M0 is Hk-closed. In contrast, Lem. 12 from above shows that, if Hk C0 (in which case 1 Hk), then P is not Hk-closed and neither is M0 (to see this, replace (Pn)n by (Pn P)n for some arbitrary P P in Lem. 12). Proof [Proof of Lem. 11] Since X is locally compact, every point has a compact neighborhood. So let us consider the set of all compact neighborhoods of the points contained in K . Their interiors form an open cover of K , and, since K is compact, a ﬁnite number of them suﬃces to cover K . Let O be the ﬁnite union of these interiors and K the union of their closures (i.e., the union of the corresponding compact supersets). Then O is open, K is compact, and K O K as advertised. We ﬁnally note that this property is equivalent to the ﬁrst claim (that there exists an open neighborhood of K with compact closure) as O is contained in a compact set if and only if its closure is compact.

Metrizing Weak Convergence with MMDs

Proof [Proof of Lem. 12] First we show that for any ϵ > 0 and any integer n > 0, we can construct a sequence of n points x1, . . . , xn in X\K such that for any 1 i = j n, |k(xi, xj)| ϵ. We will construct it one point at a time. Choose a point x1 X\K. By assumption on k, there exists a compact K1 X such that for any point x X\K1, |k(x, x1)| ϵ. Choose x2 to be also outside of K, i.e. x2 X\(K K1) (non-empty, since K K1 is compact and X is not). There exists a compact K2 X such that for any point x X\K2, |k(x, x2)| ϵ. Let x3 be any point in X\(K K1 K2) (non empty because X is not compact). Continue this procedure until point xn. The sequence obviously satisﬁes the requirement. Now, for any integer n > 0, construct a ﬁnite sequence x(n) 1 , . . . x(n) n such that for any 1 i = j n, |k(xi, xj)| 1/n. Deﬁne the probability measures Pn := 1

n Pn i=1 δx(n) i . Then

all Pn(K) = 0, since all x(n) i X\K, and:

1 i n k(xi, xi) + 1

1 i =j n k(xi, xj) n

n2 k + n(n 1)

Proof [Proof of Thm. 9] Lem. 5 yields the if part and the continuity of the kernel in the converse. Assume now that k is not characteristic to M. Then there exists a non-zero, ﬁnite measure µ such that fµ = 0. Let µ+, µ be its positive and negative parts respectively which are mutually singular (Hahn decomposition). By renormalizing µ if needed, we can assume without loss of generality that µ (X) µ+(X) = 1. If µ (X) = µ+(X), then µ and µ+ are two non-equal probability measures that are at RKHS distance 0, hence k does not metrize weak convergence. So, for the sequel, assume that µ (X) < µ+(X). Now, let K be a compact subset of X that satisﬁes µ+(K) (µ (X)+µ+(X))/2, which exists because µ+ is regular and µ (X) < µ+(X). Select now an open set O and a compact set K satisfying K O K , which exist by Lem. 11. Then, since K O, µ+(O) µ+(K). Let now Pn be probability measures as in Lem.12 such that Pn(K ) = 0 (and hence Pn(O) = 0) for all n. Consider the sequence of probability measures µn := µ + (1 µ (X))Pn. Then

µn µ+ k = µn µ k (because fµ = fµ+)

= (1 µ(X)) Pn k 0,

hence µn converges to µ+ in the RKHS metric. But µn does not converge weakly to µ+ since

µ+(O) µ+(K) (µ (X) + µ+(X))/2 > µ (X) µ (O) = µn(O) ,

which contradicts the Portmanteau lemma (lim supn µn(O) µ+(O)).

To prove that the initial claim (Claim 2) is indeed wrong when X is not compact, it remains to show that being characteristic to M is not equivalent to being characteristic to P M, i.e. that there exists a kernel k with Hk C0 that is characteristic to P but not to M. We show this under the assumption that there already exists a kernel of X that is characteristic to M, which is in particular satisﬁed when X is metrizable and separable (Thm. 4(b) or Thm. 4.D.I in Guilbart 1978), such as when X is an open subset of Rd.

Simon-Gabriel, Barp, Schölkopf, and Mackey

Proposition 13 If there exists a bounded continuous kernel over a locally compact Hausdorﬀ space X that is characteristic to M, then there also exists a kernel k with Hk C0(X) that is characteristic to P but not characteristic to M. In particular, this k does not metrize the weak convergence of probability measures.

Proof Let κ be any bounded kernel over X that is R s.p.d., i.e., characteristic to M, ξ X and g C0 such that g(ξ) = 0 and g(x) > 0 for any x = ξ. Consider k(x, y) := g(x)κ(x, y)g(y). Then k is a kernel such that Hk C0 (Lem. 8) and fδξ is the null function, hence δξ k = 0, so k is not R s.p.d. But we will now show that k is characteristic to M0, i.e. to P. Indeed, let µ M0 such that RR k(x, y) dµ(x) dµ(y) = 0. Since the product gµ is a ﬁnite measure and κ is R s.p.d., the previous equality implies that gµ is the null measure. Since g > 0 on any x = ξ, for any open set O X\{ξ}, |µ|(O) = 0. Hence the support of µ (well-deﬁned, because µ is regular) is contained in {ξ}, i.e. µ is proportional to the Dirac point mass in ξ. Hence, if µ M0, then µ is the null measure.

Prop. 13 has two implications. First, it shows that the metrization condition in the non compact case is strictly stronger than in the compact case: on compact spaces, some kernels do metrize weak convergence without separating all ﬁnite signed measures. Second, combining it with Thm. 9 shows that the alleged proof of Claim 2 must be ﬂawed. Another conﬁrmation will be given by point (i) in Cor. 15, with an explicit counter-example constructed in its proof. However, to strengthen our claim, we now explicitly point out the ﬂaw in the proof of Claim 2 by Simon-Gabriel and Schölkopf (2018).

5.1 Flaw in the proof of Claim 2 of Simon-Gabriel and Schölkopf

The ﬂaw in the proof of Theorem 12 of Simon-Gabriel and Schölkopf (2018) (our Claim 2) resides in their auxiliary Lemma 20, which is essentially our Lem. 5, but without the assumption Hk C0. Their proof essentially consists in saying that, since (Pα) (denoted (µα) there) is bounded, it is relatively vaguely compact, so one can extract a subnet (Pβ) that converges vaguely to a measure P (denoted µ there). They then try to identify the vague limit P with the MMD- (or weak RKHS-) limit P (denoted µ there) of the original net (Pα), by arguing that weak and vague convergence coincide on P, and that weak convergence implies MMD-convergence. Unfortunately, P is not closed in M for the vague topology, so nothing guarantees a priori that P P. And if P P, then vague convergence to P does not imply weak convergence to P (Berg et al., 1984, Thm. 2.4.2), which is why the proof fails irremediably. We can go further and exhibit a counter-example for the previous failure, i.e. a bounded, continuous, R s.p.d. kernel and a sequence (Pn) that converges to P P in MMD, but converges vaguely to another measure P = P in M. Indeed, consider the kernel κ := k + 1 from the proof of Cor. 15(i) below. Let K be a compact neighborhood of ξ (which exists because X is locally compact) and choose a sequence (Pn) P as in Lem. 12, i.e. such that Pn k 0 and Pn(K) = 0 for all n. By using the vague compactness of B+ := {µ M+ | µ(X) 1} (Berg et al., 1984, Prop. 2.4.6) and extracting a subsequence if needed, we may assume that (Pn) converges vaguely to a measure P B+. Applying Urysohn s lemma (Villani, 2010, Thm. I-33) to the compact set {ξ} and an open neighborhood O K of ξ, we get a continuous function f whose support is contained in K and such that f(ξ) = 1. Since f C0 and Pn(f) = 0 < 1 = f(ξ) = δξ(f), Pn does not converge vaguely to δξ, i.e.

Metrizing Weak Convergence with MMDs

P = δξ. Now κ is bounded, continuous and R s.p.d., and induces the same metric than k on P. So, since the KME of k maps the Dirac measure δξ to the null function in Hk (see proof of Prop.13), we get Pn δξ κ = Pn δξ k = Pn k 0 .

Hence (Pn) δξ in MMD, but (Pn) converges vaguely to a diﬀerent measure P .

Remark 14 The sequence (Pn) converges neither weakly to P nor weakly to δξ, since weak convergence would imply vague and MMD convergence to the same limit, i.e. would imply P = δξ. Hence P (X) = 1 (otherwise, vague convergence would imply weak convergence, since both coincide on P (Berg et al., 1984, Cor. 2.4.3)), and since P B+, we get P (X) < 1. So (Pn) illustrates a phenomenon called mass escaping at inﬁnity, which vague convergence, contrary to weak convergence, cannot prevent.

6. General case: X locally compact but non-compact and Hk C0

All previous sections assumed that Hk C0 (automatically satisﬁed when k continuous and X is compact). So one may naturally wonder whether this assumption could be dropped without replacement or at least extended. Cor. 15 shows that dropping it without replacement is not possible; but Cor. 17 proposes a slight extension.

Corollary 15 Let X be a locally compact Hausdorﬀspace that is not compact and for which there exists a C0-universal kernel (such as when X is metrizable and separable). Then

(i) there exists a bounded continuous kernel that is R s.p.d., but does not metrize the weak convergence of probability measures; (ii) there exists a bounded, continuous, characteristic (to P) kernel that is not R s.p.d. but metrizes the weak convergence of probability measures.

Remark 16 Note, however, that some kernels with non-vanishing RKHS functions do satisfy the characterization of Thm. 9. For example, Thm. 9 extends to any kernel of the form kc = k + c for c > 0 and Hk C0, since kc and k induce the same MMD.

Proof (i) By assumption, there exists a C0-universal kernel. Since that kernel is continuous and characteristic to M (see Section 1.4), by Prop. 13, there also exists a kernel k that is characteristic to P but not characteristic to M, with Hk C0. Consider the new kernel κ := k + 1. Then κ is R s.p.d. (Simon-Gabriel and Schölkopf, 2018, Thm. 8), but κ induces the same metric than k on the set of probability measures P. Hence it does not metrize their weak convergence. (ii) Let ξ be a point in X. Let k be a C0-universal on X. k is characteristic to Hk (Section 1.4), so, by Thm. 9, k metrizes the weak convergence over P. Now, consider the kernel κ(x, y) := δx δξ , δy δξ k. κ is not R s.p.d. (since the KME of δξ is the null function) but it induces the same RKHS metric than k on P, that is P Q κ = P Q k for any P, Q P. Hence κ metrizes weak convergence on P. (Remark: this implies that Hκ C0, which is also easy to check directly.) The existence of a C0-universal kernel when X is a metrizable and separable LCH space is given by Thm. 4(b).

Simon-Gabriel, Barp, Schölkopf, and Mackey

Let us mention that, in a side remark of Guilbart (1978, p.18), Guilbart already exhibits a theoretical construction of kernels on R that are R s.p.d. but do not metrize weak convergence. Hence, Claim 2 was actually disproved before being written. We ﬁnish with a slight generalization of Thm. 9 that encompasses some kernels whose RKHS is not contained in C0. The result builds on the same idea than in the proof of Cor. 15(ii).

Corollary 17 Suppose that X is not compact and that Hk C0. Fix a 0 and P P and deﬁne ka P (x, y) := δx P , δy P k + a = (δx P) (δy P)(k) + a .

Then ka P metrizes weak convergence of probability measures if and only if k is continuous and R s.p.d.

Proof Since ka P (x, y) = k(x, y) f P (x) f P (y) + P 2 k + a, for any probability measures S, T P, we get

S T 2 ka P = (S T) (S T)(ka P ) = (S T) (S T)(k) = S T 2 k .

Hence k and ka P deﬁne the same metric on P and Thm. 9 concludes.

7. Conclusion

MMDs are at the heart of machine learning solutions to a variety of fundamental tasks including two-sample testing, sample quality measurement and goodness-of-ﬁt testing, learning generative models, de novo sampling and quadrature, importance sampling, and thinning. While these applications beneﬁt from the tractability of MMDs compared to more classical probability metrics, the validity of their results depends critically on the MMD s ability to ensure weak convergence. Simon-Gabriel and Schölkopf (2018) developed their Theorem 12 to provide a complete characterization of weak-convergence metrization for MMDs with bounded continuous kernels. However, our work shows that their characterization was incorrect and provides an alternative result that fully characterizes the weak-convergence metrization of MMDs with bounded C0 kernels. Surprisingly, we ﬁnd that the compact and non compact cases are inherently diﬀerent, the latter requiring strictly stronger conditions for the metrization. This suggests that the question of weak-convergence metrization by MMDs is more subtle than was previously thought. Our main results can also be seen as a converse to Sriperumbudur s Thm. 1, which in particular show that many popular kernels, particularly Stein kernels, can fail to metrize weak convergence, if one is not careful enough. In that spirit, we hope that our work will inform the selection of appropriate kernels and MMDs in the future and launch new inquiries into the metrization properties of other classes of MMDs.

Metrizing Weak Convergence with MMDs

Acknowledgments

CJSG was supported by the ETH Foundations of Data Science postdoctoral fellowship and is associate fellow of the Center for Learning Systems (ETH/MPI Tübingen). AB was supported by the Department of Engineering at the University of Cambridge, and this material is based upon work supported by, or in part by, the U.S. Army Research Laboratory and the U. S. Army Research Oﬃce, and by the U.K. Ministry of Defence and the U.K. Engineering and Physical Sciences Research Council (EPSRC) under grant number [EP/R018413/2]. We declare no conﬂict of interests.

Charalambos D. Aliprantis and Kim C. Border. Inﬁnite Dimensional Analysis: A Hitchhiker s Guide. Springer, 3 edition, 2006.

Abdul Fatir Ansari, Jonathan Scarlett, and Harold Soh. A characteristic function approach to deep implicit generative modeling. In CVPR, 2020.

Christian Berg, Jens P. R. Christensen, and Paul Ressel. Harmonic Analysis on Semigroups Theory of Positive Deﬁnite and Related Functions. Springer, 1984.

Francois-Xavier Briol, Alessandro Barp, Andrew B Duncan, and Mark Girolami. Statistical inference for generative models with maximum mean discrepancy. ar Xiv:1906.05944, 2019.

Wilson Y. Chen, Lester Mackey, Jackson Gorham, François-Xavier Briol, and Chris J. Oates. Stein points. In ICML, 2018.

Wilson Ye Chen, Alessandro Barp, François-Xavier Briol, Jackson Gorham, Mark Girolami, Lester Mackey, Chris Oates, et al. Stein point Markov chain Monte Carlo. In ICML, 2019.

Yutian Chen, Max Welling, and Alex Smola. Super-samples from kernel herding. In UAI, 2010.

Ilya Chevyrev and Harald Oberhauser. Signature moments to characterize laws of stochastic processes. Journal of Machine Learning Research, 23(176):1 42, 2022.

Kacper Chwialkowski, Heiko Strathmann, and Arthur Gretton. A kernel test of goodness of ﬁt. In Neur IPS, 2016.

John B. Conway. A Course in Functional Analysis. Springer, New York, 2 edition, 1994.

Gintare K. Dziugaite, Daniel M. Roy, and Zoubin Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In UAI, 2015.

Yihao Feng, Dilin Wang, and Qiang Liu. Learning to draw samples with amortized Stein variational gradient descent. In UAI, 2017.

Futoshi Futami, Zhenghang Cui, Issei Sato, and Masashi Sugiyama. Bayesian posterior approximation via greedy particle optimization. In AAAI, 2019.

Simon-Gabriel, Barp, Schölkopf, and Mackey

Jackson Gorham and Lester Mackey. Measuring sample quality with kernels. In ICML, 2017.

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. JMLR, 13:723 773, 2012.

Christian Guilbart. Etude des Produits Scalaires sur l Espace des Mesures: Estimation par Projections. Ph D thesis, Université des Sciences et Techniques de Lille, 1978.

Liam Hodgkinson, Robert Salomone, and Fred Roosta. The reproducing Stein kernel approach for post-hoc corrected sampling. ar Xiv:2001.09266, 2020.

Wassily Hoeﬀding. Asymptotically optimal tests for multinomial distributions. The Annals of Mathematical Statistics, pages 369 401, 1965.

Jonathan Huggins and Lester Mackey. Random feature stein discrepancies. In Neur IPS, 2018.

Ferenc Huszár and David Duvenaud. Optimally-weighted herding is bayesian quadrature. In UAI, 2012.

Wittawat Jitkrittum, Wenkai Xu, Zoltan Szabo, Kenji Fukumizu, and Arthur Gretton. A linear-time kernel goodness-of-ﬁt test. In Neur IPS, 2017.

Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabas Poczos. MMD GAN: Towards deeper understanding of moment matching network. In Neur IPS, 2017.

Qiang Liu and Jason D. Lee. Black-box importance sampling. In AISTATS, 2017.

Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In Neur IPS, 2016.

Qiang Liu, Jason Lee, and Michael Jordan. A kernelized Stein discrepancy for goodness-of-ﬁt tests. In ICML, 2016.

Nate Eltredge. Stackoverﬂow proof. https://math.stackexchange.com/questions/ 3346313/prove-that-c-0x-is-separable-given-that-x-is-locally-compactmetric-space, 2019. Accessed: 2023-03-10.

Yuchen Pu, Zhe Gan, Ricardo Henao, Chunyuan Li, Shaobo Han, and Lawrence Carin. VAE learning via Stein variational gradient descent. In Neur IPS, 2017.

Marina Riabiz, Wilson Chen, Jon Cockayne, Pawel Swietach, Steven A Niederer, Lester Mackey, Chris Oates, et al. Optimal thinning of MCMC output. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84(4):1059 1081, 2022.

Štefan Schwabik. Topics in Banach Space Integration. Number 10 in Series in Real Analysis. World Scientiﬁc, 2005.

C.-J. Simon-Gabriel and B. Schölkopf. Kernel Distribution Embeddings: Universal Kernels, Characteristic Kernels and Kernel Metrics on Distributions. JMLR, 2018.

Metrizing Weak Convergence with MMDs

Alex Smola, Arthur Gretton, Le Song, and Bernhard Schölkopf. A Hilbert space embedding for distributions. In ALT, 2007.

Bharath K. Sriperumbudur. On the optimal estimation of probability measures in weak and strong topologies. Bernoulli, 22(3):1839 1893, 2016.

Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, and Gert R. G. Lanckriet. Hilbert space embeddings and metrics on probability measures. JMLR, 11:1517 1561, 2010.

Ingo Steinwart and Andreas Christmann. Support Vector Machines. Information Science and Statistics. Springer, 2008.

Ingo Steinwart and Johanna F. Ziegel. Strictly proper kernel scores and characteristic kernels on compact spaces. Applied and Computational Harmonic Analysis, 51:510 542, 2021.

Ingo Steinwart, Don Hush, and Clint Scovel. Function classes that approximate the bayes risk. In COLT, 2006.

Dougal J. Sutherland, Hsiao-Yu Tung, Heiko Strathmann, Soumyajit De, Aaditya Ramdas, Alex Smola, and Arthur Gretton. Generative models and model criticism via optimized maximum mean discrepancy. In ICLR, 2017.

François Treves. Topological Vector Spaces, Distributions and Kernels. Academic Press, 1967.

Cédric Villani. Intégration et analyse de Fourier. ENS de Lyon, 2010.

Shengyu Zhu, Biao Chen, Pengfei Yang, and Zhitang Chen. Universal hypothesis testing with kernels: Asymptotically optimal tests for goodness of ﬁt. In AISTATS, 2019.

Shengyu Zhu, Biao Chen, Zhitang Chen, and Pengfei Yang. Asymptotically optimal oneand two-sample testing with kernels. IEEE Transactions on Information Theory, 67(4): 2074 2092, 2021.

Simon-Gabriel, Barp, Schölkopf, and Mackey

Appendix A. Translation of Some Results from Non-English References

For the convenience of the reader, we translate here some of the important results from Villani 2010 that we cite, since the original manuscript is in French.

Theorem 18 (Urysohn s Lemma. Translation of Thm-I.33 in Villani 2010) Let X be a locally compact Hausdorﬀspace, O an open and K a compact subset of X, K O. Then there exists a continuous function f with values in [0, 1], that is equal to the constant 1 on a neighborhood of K, and whose support is compact and included in O. In particular,

where 1K and 1O designate the functions that are equal to 1 on K and O respectively, and 0 otherwise.

Theorem 19 (Ulam s Lemma. Thm. I-54 in Villani 2010) Let X be a Polish space equipped with a σ-ﬁnite non-negative Borel measure µ (i.e., X is the countable union of sets Ak that satisfy µ(Ak) < ). Then µ is regular (and concentrated on a σ-compact set).

Theorem 20 (Ulam s Lemma for LCH spaces. Thm. I-56 in Villani 2010) Let X be an LCH space where every open set is σ-compact, equipped with a non-negative Borel measure µ that is ﬁnite on compact sets. Then µ is regular.

Theorem 21 (Riesz-Markov-Kakutani Representation. Thm.VI-61 in Villani 2010) Let X be an LCH space. Then one can identify (i.e., ﬁnd an isometric bigection) between

the continuous linear forms Λ on the space C0(X) of continuous functions on X that converge to 0 at inﬁnity, equipped with the supremum norm convergence; the set of signed, regular, ﬁnite Borel measures on X; i.e., measures that can be written as µ+ µ , where µ+ and µ are non-negative, regular, ﬁnite Borel measures which are orthogonal to each other;

via the following formula:

Λf = Z f dµ := Z f dµ+ Z f dµ .

In short: (C0) = M(X) .

Deﬁnition 22 (Radon Measures. Def. VI-66 in Villani 2010) Let X be an LCH space equipped with its Borel σ-algebra, and let Ωbe an open set in X. A Radon measure on Ωif it is signed, locally ﬁnite (i.e., ﬁnite on any compact in Ω) and regular.

Appendix B. Proof of Thm. 4

B.1 Proof of Point (a)

By Lem. 5 below, if k is continuous, then it metrizes the weaktopology on the set P of Radon probability measures. But, by Theorem V.5.1 in Conway (1994), X is homeomorphic to the subset of Dirac measures in P, i.e. to {δx | x X}, when equipped with the weaktopology. Hence dk(x, y) = δx δy k metrizes X. Conversely, if k metrizes X, then Lemma 4.29 (point iv i) in Steinwart and Christmann (2008) shows that k is continuous.

Metrizing Weak Convergence with MMDs

B.2 Proof of Point (b)

To prove (b), we are going to prove the following, more complete set of equivalences.

Theorem 23 On a LCH space X the following is equivalent.

(i) X is metrizable and separable. (ii) X is σ-compact and there exists a continuous C0-universal kernel k on X. (iii) X is second countable.

(iv) C0(X) is separable.

Proof (iii) (i). Since X is LCH, X is completely regular (Aliprantis and Border, 2006, Cor 2.74) and hence regular. Urysohn s metrization theorem concludes (Aliprantis and Border, 2006, Thm. 3.40). (i) (iv). An LCH space is completely regular (Aliprantis and Border, 2006, Cor 2.74). So, if X is compact, then Theorem V.6.6 by Conway (1994) concludes. Otherwise, let X be the one-point compactiﬁcation of X (Aliprantis and Border, 2006, Thm. 2.72). We saw above that X is metrizable and separable iﬀX is second countable, and, by Thm 3.44 of the same reference, the latter holds iﬀX is metrizable. Since X is compact and Hausdorﬀ(hence completely regular), this is equivalent to Cb(X ) (equipped with its canonical supremum norm topology) being separable (Conway, 1994, Thm. V.6.6), which in turn happens iﬀits hyperplane H := {f Cb(X ) : f( ) = 0} is separable as well. Conclude by noting that H is homeomorphic to C0(X). (iv) (ii). We have already shown that, since C0 is separable, X is second countable, which, in turn, implies that X is σ-compact (Aliprantis and Border, 2006, Lem 2.76). To show that there exists a universal kernel, we now follow the proof of Thm 2 in Steinwart et al. 2006. Let {fn}n be an at most countable dense subset of C0. For any integer n 0, deﬁne Φn(x) := 2 nfn/ fn if fn = 0 and Φn = 0 otherwise. Then, clearly, Φ(x) := (Φn(x))n satisﬁes Φ(x) ℓ2 for all x X, hence k(x, y) := Φ(x) , Φ(y) ℓ2, where x, y X, deﬁnes a kernel on X with feature map Φ : X ℓ2 . Fix f C0 and ϵ > 0. There exists an integer n such that fn f ϵ. Deﬁne the function w := 2n fn en where (en)n is the canonical orthonormal basis of ℓ2. This gives w , Φ(x) = fn(x) for all x X, and since { v , Φ(x) : v ℓ2} is the RKHS of k (Steinwart and Christmann, 2008, Thm. 4.21), we obtain the universality of k. It remains to be shown that k is continuous. To do so, we show that that Φ is continuous. Indeed, let (xα)α be a net that converges to x in X. Fix ϵ > 0.

Φ(xα) Φ(x) 2 2 X

n 0 |Φn(xα) Φn(x)|2 = X

1 22n |gn(xα) gn(x)|2 ,

where we deﬁned gn(x) := fn(x)/ fn (or 0 if fn = 0). Since |gn| 1, the summands verify |gn(xα) gn(x)|2/22n 122(n 1) for n 1. So let N be an integer such that P n>N 1 22(n 1) ϵ/2 and let A be an index such that PN n=0 |gn(xα) g(x)|2/22n ϵ/2 whenever α > A (which exists, since we are considering a ﬁnite sum of continuous functions). The continuity then follows from the following.

Φ(xα) Φ(x) 2 2

1 22n |gn(xα) gn(x)|2 + X

1 22(n 1) ϵ

Simon-Gabriel, Barp, Schölkopf, and Mackey

(ii) (iv). We adapt the proof of Thm 2, point (i) (ii), given by Steinwart et al. (2006). Let k be a continuous universal kernel on X and let Φ : X Hk x 7 k(., x) be its

canonical feature map. Then Φ is continuous (Steinwart and Christmann, 2008, Lem 4.29). Since X is σ-compact, let (Ki)i be an at most countable compact cover of X. For each i, Φ(Ki) is compact, and, since Hk is a metric space, Φ(Ki) is separable. Hence Φ(X) = iΦ(Ki) is separable, and consequently, so is Hk = cl(span Φ(X)), the closed span of Φ(X) in Hk. Since Hk is dense in C0, we then obtain that C0 is separable. Alternative proof of (ii) (iv). Point (a) shows that X is metrizable. Conclude by noting that a σ-compact metrizable spaces is separable, since it can be covered by countably many compacts and, being a metrizable space, any compact is separable.

Proof [Alternative proof of (i) (iv)] (i) (iv). We will adapt the proof given by Conway (1994, Thm. V.6.6) for compact spaces. Let d be a metric that metrizes the topology of X. Since X is separable, let (xk)k be a dense sequence in X. For any positive integer n, let Bn k be the open ball of radius 1/n centered on xk. For any n, (Bn k )k is an open cover of X. Since X is a metric space, apply Theorems 3.22 and 2.90 in Aliprantis and Border (2006) to construct a continuous locally ﬁnite partition of unity (fn k )k subordinated to (Bn k )k (see Def 2.89 therein). Let Y be the rational linear span of (fn k )k,n, i.e., the ﬁnite linear combinations of functions fn k with rational coeﬃcients. Y is countable. We will show that Y is dense in C0(X). Fix f C0 and ϵ > 0. Since f vanishes at inﬁnity, it is uniformly continuous. So there is a δ > 0 such that |f(x) f(y)| ϵ/2 whenever d(x, y) < δ. Choose n > 1/δ. Consider the cover (Bn k )k. If x Bn k , d(x, xk) 1/n δ; hence |f(x) f(xk)| ϵ/2. Let αk be a rational number such that |αk f(xk)| ϵ/2. Let g := P k αkfn k ; so g Y. For every x X,

|f(x) g(x)| | X

k f(x)fn k (x) αkfn k (x)|

k |f(x) αk|fn k (x).

Examine each of these summands. If x Bn k , then |f(x) αh| |f(x) f(xk)|+|f(xk) αh| ϵ; otherwise fn k (x) = 0. In both cases |f(x) αk|fn k (x) ϵfn k (x), hence |f(x) g(x)| ϵ P k fn k (x) = ϵ. Thus f g ϵ and Y is dense in C0. Hence C0 is separable. (iv) (i). We will prove that, if C0 is separable, then X is second countable, which concludes, since we already showed that (iii) and (i) are equivalent. The proof follows Nate Eltredge (2019). Let {fn}n be a countable dense subset of C0, and for each n let Un = {x X : fn(x) > 1/2}, which is an open subset of X. We claim that {Un}n is a countable base for the topology of X. For let x X and let V be an open neighborhood of x. Then by Urysohn s lemma for locally compact Hausdorﬀspaces, there exists a function f compactly supported inside V with f(x) = 1. In particular f Cc(X) C0, so by density, we can ﬁnd some fn with f fn < 1/2. Then we have fn(x) > 1/2 so x Un. Moreover, if y Un then fn(y) > 1/2 and so f(y) > 0, which implies y V . Therefore Un V . This proves that {Un}n is a base.