# deregularized_maximum_mean_discrepancy_gradient_flow__7f40a455.pdf

Journal of Machine Learning Research 26 (2025) 1-77 Submitted 9/24; Revised 8/25; Published 10/25

(De)-regularized Maximum Mean Discrepancy Gradient Flow

Zonghao Chen zonghao.chen.22@ucl.ac.uk Department of Computer Science, University College London, London, WC1V 6LJ, UK

Aratrika Mustaﬁ abm6733@psu.edu Department of Statistics, Pennsylvania State University, University Park, PA, 16802 USA

Pierre Glaser pierreglaser@msn.com Gatsby Computational Neuroscience Unit, University College London, London, WC1V 6LJ, UK

Anna Korba anna.korba@ensae.fr ENSAE, CREST, Institut Polytechnique de Paris, Palaiseau, France

Arthur Gretton arthur.gretton@gmail.com Gatsby Computational Neuroscience Unit, University College London, London, WC1V 6LJ, UK

Bharath K. Sriperumbudur bks18@psu.edu Department of Statistics, Pennsylvania State University, University Park, PA, 16802 USA

Editor: Quentin Berthet

We introduce a (de)-regularization of the Maximum Mean Discrepancy (Dr MMD) and its Wasserstein gradient ﬂow. Existing gradient ﬂows that transport samples from source distribution to target distribution with only target samples, either lack tractable numerical implementation (f-divergence ﬂows) or require strong assumptions and modiﬁcations, such as noise injection, to ensure convergence (Maximum Mean Discrepancy ﬂows). In contrast, Dr MMD ﬂow can simultaneously (i) guarantee near-global convergence for a broad class of targets in both continuous and discrete time, and (ii) be implemented in closed form using only samples. The former is achieved by leveraging the connection between the Dr MMD and the χ2-divergence, while the latter comes by treating Dr MMD as MMD with a de-regularized kernel. Our numerical scheme employs an adaptive de-regularization schedule throughout the ﬂow to optimally balance the trade-oﬀbetween discretization errors and deviations from the χ2 regime. The potential application of the Dr MMD ﬂow is demonstrated across several numerical experiments, including a large-scale setting of training student/teacher networks. Keywords: Wasserstein gradient ﬂow, reproducing kernel Hilbert space, maximum mean discrepancy, f-divergences, spectral regularization

1. Introduction

Many applications in computational statistics and machine learning involve approximating a probability distribution π on Rd (in terms of samples) when only partial information on π is accessible. For example, in Bayesian inference, π is known up to an intractable normalizing constant for complex models. The setting of interest in this work is the so-called generative modeling setting (Brock et al., 2019; Ho et al., 2020; Song et al., 2021; Franceschi et al., 2024) where one assumes access to a set of samples from the target distribution π, with the goal being to generate new samples from π. Recently, a popular framework to perform this

c 2025 Zonghao Chen, Aratrika Mustaﬁ, Pierre Glaser, Anna Korba, Arthur Gretton, and Bharath K. Sriperumbudur .

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/24-1574.html.

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

task involves solving a minimization problem in P(Rd), the space of probability distributions over Rd, by choosing the objective function to be a dissimilarity function D( π) (a distance or divergence) between probability distributions that satisﬁes: D(ν π) = 0 if and only if ν = π. Since only samples from π are available, this problem is solved approximately yet, the approximate minimizers may converge to π as the number of available samples increases. In particular, in the space of probability distributions with bounded second moment P2(Rd), a common approach is to solve this optimization problem by running a (sample-based) approximation of the Wasserstein gradient ﬂow of the functional F = D( π), which deﬁnes a path of distributions with steepest descent for F with respect to the Wasserstein-2 distance. In generative modeling, the choice of D( π) depends on two crucial aspects: First, its ﬂow should admit consistent and preferably tractable numerical implementations using only samples from π, and second, under reasonable assumptions, it should guarantee convergence of its ﬂow to π, the unique global minimizer. Combined, these two properties ensure that, in the large sample limit, this algorithm generates new samples from the target. Unfortunately, verifying these two properties simultaneously has proved to be a surprisingly challenging task. For instance, recent approaches based on Maximum Mean Discrepancy (MMD) (Arbel et al., 2019; Hertrich et al., 2024b), the sliced-Wasserstein distance (Liutkus et al., 2019) and the Sinkhorn divergence (Genevay et al., 2018) typically admit consistent ﬁnite-sample implementations, but their global convergence guarantees when they exist do not apply to most practical targets π of interest. To guarantee global convergence, one could instead choose D as an f-divergence. For instance, the (reverse) KL divergence and χ2-divergence are geodesically convex (Villani et al., 2009, Deﬁnition 16.5) when the target is log-concave (i.e. π e V with V convex) (Ohta and Takatsu, 2011), and hence their ﬂows enjoy better convergence behaviour. However, while the population Wasserstein gradient ﬂows of the χ2 and KL divergences are well-deﬁned (Jordan et al., 1998; Chewi et al., 2020), they do not come naturally with consistent and tractable sample-based implementations. Multiple approaches propose to solve a surrogate optimization problem with samples at each iteration of the ﬂow (Gao et al., 2019; Ansari et al., 2021; Simons et al., 2022; Birrell et al., 2022; Gu et al., 2024; Liu et al., 2024a); however, it remains to be formally established whether these surrogate problems preserve the desirable convergence guarantees of f-divergence ﬂows. In the face of the trade-oﬀs present in the current approaches, a natural question arises: Does there exist a divergence functional D( π) whose gradient ﬂow both globally converges, and admits a tractable, consistent sample-based implementation? In this work, we take a step towards a positive answer by constructing a de-regularized variant of the Maximum Mean Discrepancy (Dr MMD) and its associated Wasserstein gradient ﬂow. We prove that the Dr MMD gradient ﬂow converges exponentially to the global minimum up to a controllable barrier term for targets π that satisfy a Poincaré inequality, in both continuous and discrete time regimes. To do so, we establish and leverage a connection between the Dr MMD and the χ2 divergence, an f-divergence whose gradient ﬂow beneﬁts from strong convergence guarantees. By alternatively viewing Dr MMD as MMD with a regularized kernel, Dr MMD ﬂow comes with a consistent and tractable implementation when only samples from the target π are available. In addition, given the empirical success of using adaptive kernels in MMD-based generative models (Galashov et al., 2025; Li et al., 2017; Arbel et al., 2018), our paper shows theoretically that using adaptive kernels through adaptive regularization indeed improves the convergence of MMD gradient ﬂow.

(De)-regularized Maximum Mean Discrepancy Gradient Flow

This paper is organized as follows. Section 2 introduces the necessary background on reproducing kernel Hilbert spaces (RKHS), the MMD, χ2-divergence, and Wasserstein gradient ﬂows. Section 3 introduces Dr MMD and shows that Dr MMD is a valid probability divergence that metrizes weak convergence. Section 4 uses Dr MMD as the optimization objective to deﬁne a Wasserstein gradient ﬂow in P2(Rd), and analyzes the convergence of Dr MMD ﬂow in continuous time. Sections 5 and 6 deﬁne an implementable Dr MMD particle descent scheme with both time and space discretization and analyze its convergence. Section 7 discusses other Wasserstein gradient ﬂows related to our Dr MMD ﬂow. Section 8 shows experiments that conﬁrm our theoretical results. The proofs of all results are provided in Section 10, with the technical results being relegated to an appendix.

2. Background

In this section, we present the deﬁnitions and notation used throughout the paper.

2.1 Notations

Let Ld be the Lebesgue measure on Rd. P2(Rd) denotes the set of all Borel probability measures µ on Rd with ﬁnite second moment. For µ P2(Rd), µ π denotes that µ is absolutely continuous with respect to π. We use dµ

dπ to denote the Radon-Nikodym derivative. We recall the standard deﬁnition of the Kullback-Leibler divergence, KL(µ π) = R log( dµ

dπ)dµ if µ π, + else. For a continuous mapping T : Rd Rd, T#µ denotes the push-forward measure of µ by T. For any π P2(Rd), L2(π) is the Hilbert space of (equivalence class of) functions f : Rd R such that R |f|2dπ < . We denote by L2(π) and , L2(π) the norm and the inner product of L2(π). We denote by C c (Rd) the space of inﬁnitely diﬀerentiable functions from Rd to R with compact support. For a vector valued functions g : Rd Rp, we abuse the notation of L2(π) and claim g L2(π) if gi L2(π) for all i = 1, . . . , p along with g 2 L2(π) := Pp i=1 gi 2 L2(π).

If f : Rd R is diﬀerentiable, we denote by f the gradient of f and Hf its Hessian. f is α-strongly convex if Hf αI, i.e, Hf(x) αI is positive semi-deﬁnite for any x, where I is the identity matrix (also denotes an identity operator depending on the context). For a vector valued function g : Rd Rd, if gi is diﬀerentiable for all i = 1, , d, g denotes the divergence of g. We also denote by g the Laplacian of g, where g = g. We use F to denote the matrix Frobenius norm. a b and a b denote the minimum and maximum of a and b, respectively.

2.2 Reproducing kernel Hilbert spaces

For a positive semi-deﬁnite kernel k : Rd Rd R, its corresponding reproducing kernel Hilbert space (RKHS) H is a Hilbert space with inner product , H and norm H (Aronszajn, 1950), such that (i) k(x, ) H for all x Rd, and (ii) the reproducing property holds, e.g. for all f H, x Rd, f(x) = f, k(x, ) H. We denote by Hd the Cartesian product RKHS consisting of elements f = (f1, . . . , fd) with fi H with inner product f, g Hd = Pd i=1 fi, gi H.

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

When R k(x, x)dπ(x) < , H can be canonically injected into L2(π) using the operator ιπ : H L2(π), f 7 f with adjoint ι π : L2(π) H given by

ι πf( ) = Z k(x, )f(x)dπ(x).

The operator ιπ and its adjoint can be composed to form an L2(π) endomorphism Tπ := ιπι π called the integral operator, and a H endomorphism Σπ := ι πιπ = R k( , x) k( , x)dπ(x) (where (a b)c := b, c Ha for a, b, c H) called the covariance operator. Tπ is compact, positive, self-adjoint, and can thus be diagonalized into an orthonormal system in {ei}i 1 of L2(π) with associated eigenvalues ϱ1 ϱi 0. In this paper, we make the following assumption on our kernel.

Assumption 1 k : Rd Rd R is a continuous and c0-universal kernel, and there exists K > 0 such that supx k(x, x) K.

We refer the reader to Carmeli et al. (2010) for the deﬁnition of c0-universal kernel. The implication of Assumption 1 is that the RKHS H is compactly embedded into L2(π) (Steinwart and Scovel, 2012, Lemma 2.3), and hence k(x, x ) has a absolute, uniform and pointwise convergent Mercer representation (Steinwart and Scovel, 2012, Corollary 3.5),

k(x, x ) = X

i 1 ϱiei(x)ei(x ), (1)

for any x and x in the support of π. Since the kernel is c0-universal, the RKHS H is dense in L2(π) for all Borel probability measures π (Sriperumbudur et al., 2011, Section 3.1) and {ei}i 1 becomes an orthornormal basis of L2(π) (Steinwart and Scovel, 2012, Theorem 3.1). The power of the integral operator T r π is deﬁned as T r π f := P i 1 ϱr i f, ei L2(π) ei, f L2(π). For f Ran(T r π ), there exists q L2(π) such that f = T r π q. The exponent r quantiﬁes the smoothness of the range space relative to the original RKHS H with 0 < r < 1

2 (resp. r > 1

2) yields spaces that are less (resp. more) smooth than H with Ran(T 1/2 π ) being isometrically isomorphic to H (Cucker and Zhou, 2007; Fischer and Steinwart, 2020). We make an additional assumption commonly employed in the kernel-based gradient ﬂow literature (Glaser et al., 2021; He et al., 2024; Korba et al., 2020; Arbel et al., 2019) on the regularity of the kernel that will be employed in studying the Dr MMD gradient ﬂow.

Assumption 2 k : Rd Rd R is twice diﬀerentiable in the sense of (Steinwart and Christmann, 2008, Deﬁnition 4.35), i.e., for i, j {1, , d}, both i i+dk and i j i+d j+dk exist and are continuous. There exist constants K1d, K2d > 0 such that 1k(x, ) Hd := Pd i=1 ik(x, ) H K1d and H1k(x, ) Hd d := Pd i,j=1 i jk(x, ) H K2d for all x Rd.

Many kernels satisfy both Assumption 1 and 2, including the class of bounded, continuous, and translation invariant kernels on Rd whose Fourier transforms have ﬁnite second and fourth moments. This is easy to verify by employing the Fourier transform representation of the RKHS (Wendland, 2004, Theorem 10.12) and noting that the ﬁniteness of the RKHS norm of 1k( , x) and H1k( , x) for all x corresponds to the existence and ﬁniteness of the second and fourth moments of the Fourier transform of the kernel, respectively. This condition is satisﬁed by the Gaussian kernel, Matérn kernels of order ν with ν + d

2 2 and the inverse multiquadratic kernel.

(De)-regularized Maximum Mean Discrepancy Gradient Flow

2.3 Maximum mean discrepancy and χ2-divergence

The Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) between µ and π is deﬁned as the RKHS norm of the diﬀerence between the mean embeddings1 mµ := R k(x, )dµ(x) and mπ := R k(x, )dπ(x).

MMD(µ π) :=

Z k(x, )dµ(x) Z k(x, )dπ(x) H = mµ mπ H .

The function mµ mπ is often referred to as the witness function . When the kernel k is c0-universal, MMD(µ π) = 0 if and only if µ = π, and the MMD metrizes the weak topology between probability measures (Sriperumbudur et al., 2010; Sriperumbudur, 2016). Given samples (x1, . . . , xn) and (y1, . . . , ym) from µ and π respectively, the MMD can be consistently estimated in multiple ways (Gretton et al., 2012). For instance, one can compute its plug-in estimator, e.g. MMD(ˆµ ˆπ), where bµ := 1

n Pn i=1 δxi and bπ := 1

n Pn i=1 δyi. The χ2-divergence a member of the family of f-divergences (Rényi, 1961) is deﬁned as the variance of the Radon-Nikodym derivative dµ

dπ under π:

χ2(µ π) := Z dµ

when µ π, and + otherwise. The χ2-divergence has a variational form (Nowozin et al., 2016; Nguyen et al., 2010):

χ2(µ π) = sup h M(Rd)

Z hdµ Z h + 1

where M(Rd) denote the set of all measurable functions from Rd to R. When µ π, we prove in Lemma B.1 that the optimal h = dµ

dπ 1 L2(π) so that it is suﬃcient to restrict the variational set to L2(π) in contrast to M(Rd) for general f-divergences. Since in most cases, χ2(ˆµ ˆπ) = + , the χ2-divergence does not admit plug-in estimators, and estimating it consistently involves more complicated strategies (Nguyen et al., 2010).

2.4 Wasserstein gradient ﬂows

Gradient ﬂows are dynamics that use local (e.g. diﬀerential) information about a given functional in order to minimize it as fast as possible. Their exact deﬁnition depends on the nature of the input space; in the familiar case of Euclidean space Rd, the gradient ﬂow of a suﬃciently regular F : Rd R given some initial condition x0 is given by the solution (xt)t 0 of txt = v(xt), where v is the Fréchet subdiﬀerential of F, a generalization of the notion of derivative to non-smooth functions (Kruger, 2003). Gradient ﬂows can be extended from Euclidean spaces to the more general class of metric spaces (Ambrosio et al., 2005). When the metric space in question is P2(Rd) endowed with the Wasserstein-2 distance, this gradient ﬂow is called the Wasserstein gradient ﬂow (µt)t 0. The Wasserstein gradient ﬂow of F : P2(Rd) R takes the particular form (Ambrosio et al., 2005, Lemma 10.4.1):

tµt + (µtvt) = 0, (2)

1. Such mean embeddings are well-deﬁned under Assumption 1.

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

where vt is the Fréchet subdiﬀerential of F : P2(Rd) R evaluated at µt (Ambrosio et al., 2005, Deﬁnition 11.1.1). (2) is an instance of the continuity equation with velocity ﬁeld vt: under these dynamics, the mass of µt is transported in the direction vt that decreases F at the fastest rate at each time t. While (2) can be time-discretized in various ways (Santambrogio, 2017; Ambrosio et al., 2005), in this work, we will focus on the forward Euler scheme, deﬁned as µn+1 := (I γvn)#µn where γ > 0 is a step size parameter. Such a scheme is also known as the Wasserstein Gradient Descent of F. Just as gradient descent in Euclidean spaces, an instrumental property to characterize the convergence of the Wasserstein gradient descent of a functional F is given by its geodesic convexity and smoothness. Among various ways, one can consider to characterize convexity and smoothness through lower and upper bounds on the Wasserstein Hessian of the functional F (Villani et al., 2009, Proposition 16.2). Given any φ C c (Rd) that deﬁnes a constant speed geodesic2 starting at µ: ρt = (I + t φ)#µ for 0 t 1, the Wasserstein Hessian of a functional F : P2(Rd) R at µ, denoted as Hess F|µ, is an operator from L2(µ) to L2(µ):3

Hess F|µ φ, φ L2(µ) = d2

A functional F is said to be geodesically M-smooth at µ if Hess F|µ φ, φ L2(µ) M φ L2(µ), and is said to be geodesically Λ-convex at µ if Hess F|µ φ, φ L2(µ) Λ φ L2(µ). Additionally, F is geodesically semiconvex if < Λ < 0 and geodesically strongly convex if Λ > 0. Generally, a functional F that is both smooth and strongly convex is preferred, because its Wasserstein gradient descent has an exponential rate of convergence under a small enough step size γ (Boyd and Vandenberghe, 2004, Section 9.3.1)(Bonet et al., 2024). Given some probability measure π P2(Rd), the MMD ﬂow (resp. χ2 ﬂow) is the Wasserstein gradient ﬂow of the functional FMMD( ) = MMD( π) (resp. Fχ2( ) = χ2( π)). As with the MMD, the MMD ﬂow has an analytic ﬁnite sample implementation and may be used to construct generative modeling algorithms (Hertrich et al., 2024b,a). The Wasserstein Hessian of FMMD for smooth kernels is not positively lower bounded (Arbel et al., 2019, Proposition 5), however, so MMD ﬂow only converges up to an unknown barrier (Arbel et al., 2019, Theorem 6), with global convergence only under a strong (and unveriﬁable) assumption (Arbel et al., 2019, Proposition 7). More recent works (Boufadène and Vialard, 2024) have demonstrated the global convergence of the MMD ﬂow when using the Coulomb kernel. This kernel is non-smooth, however, which complicates numerical implementations. In contrast, the Wasserstein Hessian of Fχ2 is positively lower bounded (Ohta and Takatsu, 2011) for log-concave targets π, so Fχ2 is geodesically strongly convex and χ2 ﬂow enjoys exponential rate of convergence. The exponential convergence of χ2 ﬂow towards the global minimum in fact holds for a broader class of targets π that satisfy a Poincaré inequality (Chewi et al., 2020). The χ2 ﬂow has so far lacked a tractable sample-based implementation, however, so it has not been widely used in practice.

2. See Appendix A for the deﬁnition. 3. Strictly speaking, Hess F|µ is an operator over the tangent space TµP2(Rd) which is a subset of L2(µ) (Villani et al., 2009).

(De)-regularized Maximum Mean Discrepancy Gradient Flow

In the following sections, we will introduce a new Wasserstein gradient ﬂow that combines the computational advantages of the MMD ﬂow with the convergence properties of the χ2

3. (De)-regularized Maximum Mean Discrepancy (Dr MMD)

In this section, we introduce a (de)-regularized version of maximum mean discrepancy, or Dr MMD in short. The Dr MMD is rooted in a uniﬁed representation of the MMD and the χ2-divergence, given in the following proposition, which is proved in Section 10.1.

Proposition 3.1 (MMD and χ2-divergence) Suppose dµ

dπ 1 L2(π) for µ, π P2(Rd). Then

MMD2(µ π) = T 1/2 π

L2(π) and χ2(µ π) = I dµ

Remark 3.1 The χ2 identity follows from the deﬁnition but is provided for comparison purposes. Together, these identities express both the MMD and the χ2-divergence as functionals of the (centered) density ratio dµ

dπ 1. While the χ2-divergence directly computes the L2(π) norm of the centred ratio, the MMD2 ﬁrst computes the image by the operator T 1/2 π before taking the L2(π) norm. The smoothing eﬀect of the compact operator T 1/2 π note that Tπ is compact if k is bounded as assumed in Assumption 1 has both positive and negative consequences: FMMD( ) = MMD2( π) admits ﬁnite sample estimators but is not geodesically convex, making the ﬁrst-order optimization of MMD objective (as done in generative modeling) challenging (Arbel et al., 2019). In contrast, Fχ2( ) = χ2( π) is geodesically convex for log-concave targets π (Ohta and Takatsu, 2011) but is hard to estimate with samples.

With these facts in mind, we introduce a divergence whose purpose is to combine the beneﬁcial properties of both the χ2-divergence and MMD. To do so, this divergence computes the L2(π) norm of the image of dµ

dπ 1 by an alternative operator which interpolates between

I and T 1/2 π . We set this operator to be ((Tπ + λI) 1Tπ)1/2, where λ > 0 is a regularization parameter. The operator ((Tπ + λI) 1Tπ)1/2 can be seen as a (de)-regularization of the operator T 1/2 π used by the MMD - a similar idea has been used in kernel Fisher discriminant analysis (Mika et al., 1999), goodness-of-ﬁt testing (Balasubramanian et al., 2021; Hagrass et al., 2024b), and two-sample testing (Harchaoui et al., 2007; Hagrass et al., 2024a). We call the resulting divergence the (De)-regularized Maximum Mean Discrepancy (Dr MMD).

Deﬁnition 1 (Dr MMD) Suppose dµ

dπ 1 L2(π) where µ, π P2(Rd). Then the (de)- regularized maximum mean discrepancy (Dr MMD) between µ, π P2(Rd) is deﬁned as

Dr MMD(µ||π) = (1 + λ) (Tπ + λI) 1Tπ 1/2 dµ

L2(π) , (3)

where λ > 0.

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

While all three operators, I, Tπ and (Tπ + λI) 1Tπ are diagonalizable in the same eigenbasis of L2(π), the key diﬀerence between them lies in the behavior of their eigenvalues. The identity operator I has all eigenvalues 1, the integral operator Tπ has eigenvalues (ϱi)i 1 which decay to zero as i and the (de)-regularized integral operator (Tπ + λI) 1Tπ has eigenvalues (ϱi/(ϱi + λ))i 1 which either decay to zero or converge to 1 depending on the choice of λ as i . (Tπ + λI) 1Tπ is known in the statistical estimation literature as Tikhonov regularization; alternative deﬁnitions of Dr MMD could be obtained by using other regularizing operators, such as Showalter regularization in Engl et al. (1996). In this paper, we primarily focus on Tikhonov regularization and leave other types of regularization for future work. One of the stated purposes of the Dr MMD is to retain the computational beneﬁts of the MMD, which are crucial for its use in particle algorithms for generative modeling. To this end, we provide an alternative representation of Dr MMD which does not involve the density ratio dµ

dπ directly, but only kernel expectations.

Proposition 3.2 (Density ratio free and variational formulations) Dr MMD can be alternately represented as

Dr MMD(µ||π) = (1 + λ) (Σπ + λI) 1

2 (mµ mπ) 2

= (1 + λ) sup h H

Z h dµ Z h2

with h µ,π = 2 (Σπ + λI) 1 (mµ mπ) being the witness function.

The proof is in Section 10.2. The density ratio-free representation (4) contains three expectations under µ and π: the mean embeddings mµ and mπ, and the covariance operator Σπ. Given samples {xi}M i=1 µ and {yi}N i=1 π, we can construct a plug-in ﬁnite sample estimator in (71) by replacing µ and π with their empirical counterparts: ˆµ = 1 M PM i=1 δxi and ˆπ = 1

N PN i=1 δyi. The density ratio-free representation (4) frames Dr MMD as acting on the diﬀerence of mµ and mπ similarly to MMD. In fact, up to a multiplicative factor (1 + λ), Dr MMD is MMD computed with respect to another kernel k deﬁned as

k(x, x ) = D (Σπ + λI) 1

2 k(x, ), (Σπ + λI) 1

2 k(x , ) E

The derivations are provided in Section 10.3. The kernel k is symmetric and positive semideﬁnite by construction, and its associated reproducing kernel Hilbert space is H. The density ratio-free representation of Dr MMD in Proposition 3.2 is already known in (Balasubramanian et al., 2021; Harchaoui et al., 2007; Hagrass et al., 2024a,b) in the context of non-parametric hypothesis testing.

3.1 Properties of Dr MMD

In this section, we establish various properties of Dr MMD. As discussed earlier, Dr MMD is constructed to interpolate between χ2-divergence and MMD to exploit the advantages associated with each. The following result formalizes the interpolation property.

(De)-regularized Maximum Mean Discrepancy Gradient Flow

Proposition 3.3 (Interpolation property) Let µ, π P2(Rd). If Assumption 1 holds and dµ

dπ 1 L2(π), then

lim λ 0 Dr MMD(µ π) = χ2(µ π), and lim λ Dr MMD(µ π) = MMD2(µ π).

Proposition 3.3, whose proof can be found in Section 10.4, shows that Dr MMD asymptotically becomes a probability divergence in the small and large λ regimes. We seek to use Dr MMD as a minimizing objective in generative modeling algorithms; however, we need to ensure that Dr MMD is a probability divergence for any ﬁxed value of λ. This result holds, as shown next.

Proposition 3.4 (Dr MMD is a probability divergence) Under Assumption 1, for any λ (0, ), Dr MMD is a probability divergence, i.e., Dr MMD(µ π) 0, with equality iﬀ µ = π. Moreover, Dr MMD metrizes the weak topology between probability measures, i.e., Dr MMD(µn π) 0 iﬀµn converges weakly to π P2(Rd) as n .

As MMD with c0-universal kernels metrizes the weak convergence of distributions (Sriperumbudur, 2016; Simon-Gabriel et al., 2023), Proposition 3.4, whose proof can be found in Section 10.5, shows that for any λ > 0, Dr MMD is MMD like topologically speaking, and is diﬀerent from the χ2-divergence which induces a strong topology (Agrawal and Horel, 2021).

Remark 3.2 Dr MMD is a speciﬁc case of a so-called Moreau envelopes of f-divergences in reproducing kernel Hilbert spaces introduced in Stein et al. (2025), when the f-divergence is taken as the χ2-divergence. This connection is uncovered by the variational formulation of Dr MMD in (5). In contrast to general f-divergences, the Moreau envelope of the χ2divergence enjoys a closed-form expression, as we highlight in this paper with various analytical formulas for Dr MMD. The interpolation property (Proposition 3.3) and the metrization of weak convergence (Proposition 3.4) are proved concurrently in Corollaries 12 and 13 of Stein et al. (2025), relying on formulation via Moreau envelopes. In our case, we use direct computations thanks to the closed form of Dr MMD.4

4. Wasserstein Gradient Flow of Dr MMD

Having introduced the Dr MMD in the previous section, we now construct and analyze its Wasserstein Gradient Flow (WGF). As discussed in Section 2, WGFs deﬁne dynamics (µt)t 0 in Wasserstein-2 space that minimize a given functional F by transporting µt in the direction of steepest descent, given by the Fréchet subdiﬀerential of F evaluated at µt. Given some target distribution π from which we wish to sample, the WGF of FDr MMD( ) = Dr MMD( π), called Dr MMD ﬂow, has the potential to form the basis of a generative modeling algorithm, since µt progressively minimizes its distance (in the Dr MMD sense) to the target π. To fulﬁll this potential, two additional ingredients are necessary. The ﬁrst is to formally establish that µt reaches the global minimizer π, and the second is to design a tractable ﬁnite-sample algorithm that inherits the convergence properties of the original Dr MMD ﬂow.

4. We would like to clarify that Stein et al. (2025) appeared on arxiv when our paper was already under review at ICML 2024.

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

We defer the second point to Section 5 and focus in this section on showing how Dr MMD beneﬁts from its interpolation towards χ2-divergence such that the Dr MMD ﬂow achieves near-global convergence for a large class of target distributions.

4.1 Dr MMD ﬂow: Deﬁnition, existence, and uniqueness

To prove that the Dr MMD ﬂow is well-deﬁned and admits solutions, the key is to show that the FDr MMD admits Fréchet subdiﬀerentials, as formalized in the following proposition.

Proposition 4.1 (Dr MMD gradient ﬂow) Let λ > 0, and µ0, π P2 Rd . Under Assumption 1 and 2, the functional FDr MMD admits Fréchet subdiﬀerential of the form (1 + λ) h µt,π, where h µt,π is the witness function deﬁned in Proposition 3.2. Consequently, the Dr MMD ﬂow is well-deﬁned and is the solution to the following equation

tµt µt(1 + λ) h µt,π = 0. (7)

In addition, the Dr MMD ﬂow starting at µ0 is unique because FDr MMD is semiconvex, i.e., for any φ C c (Rd) and µ P2(Rd), Hess FDr MMD|µ φ, φ

2(1 + λ)2 KK2d + K1d

λ φ 2 L2(µ). (8)

The proof is in Section 10.6. By recalling the discussion of Wasserstein Hessian in Section 2.4, (8) indicates that FDr MMD is both geodesically smooth and geodesically semiconvex, which is expected because Dr MMD is equivalent to MMD with a regularized kernel k deﬁned in (6) and FMMD is both geodesically smooth and geodesically semiconvex (Arbel et al., 2019, Proposition 5). Proposition 4.1 is proved concurrently in Corollaries 14 and 20 of Stein et al. (2025) relying on formulation via Moreau envelopes, while our proof uses the closed-form expression for Dr MMD.

4.2 Near-global convergence of Dr MMD ﬂow

Having deﬁned the Dr MMD ﬂow, we are now concerned with its convergence to the target π. Since Dr MMD is constructed to interpolate between the MMD and the χ2-divergence, Dr MMD ﬂow is expected to recover the convergence properties of the χ2 ﬂow. With this goal in mind, we ﬁrst study the Wasserstein Hessian of Dr MMD and prove that it becomes asymptotically positive as λ 0 for strongly log-concave targets π. Next, to obtain a non-asymptotic convergence rate, we take another route and show that the Dr MMD ﬂow converges to π exponentially fast in KL divergence up to a barrier term that vanishes in the small λ regime, provided that π satisﬁes a Poincaré inequality.

4.2.1 Near-Geodesic Convexity of FDr MMD

One popular approach to proving that the Dr MMD ﬂow (µt)t 0 converges to the target π in terms of Dr MMD is to show that FDr MMD is geodesically convex (Ambrosio et al., 2005, Theorem 4.0.4), or, equivalently in our deﬁnition in Section 2.4, its Wasserstein Hessian is positive deﬁnite. In the next proposition, we show that the Wasserstein Hessian of FDr MMD is indeed positive deﬁnite for small enough λ; however, as we will see below (remark 4.2), the form of the result will not allow us to show convergence besides in the limit λ 0, which leads us to take a diﬀerent approach in subsequent sections.

(De)-regularized Maximum Mean Discrepancy Gradient Flow

Proposition 4.2 (Near-geodesic convexity of FDr MMD) Let µ, π P2(Rd), µ, π Ld

and φ C c (Rd). Under Assumption 1 and 2, let π be α strongly log-concave, i.e., π exp( V ), HV αI, and assume additionally that x 7 HV (x) is continuous. Then for all µ such that x 7 log µ(x) is continuous and dµ

Hess FDr MMD|µ φ, φ

L2(µ) α(1 + λ) Z dµ

dπ(x) φ(x) 2dµ(x) R(λ, µ, φ), (9)

where limλ 0 R(λ, µ, φ) = 0.

The proof can be found in Section 10.7. To obtain this result, we relate the Wasserstein Hessian of FDr MMD with that of Fχ2. When dµ

dπ 1 H, they coincide asymptotically as λ 0, showing that the interpolation properties of Dr MMD to the χ2-divergence hold at the level of Wasserstein derivatives. Together with the fact that the Wasserstein Hessian of Fχ2 is positive deﬁnite for α-strongly log-concave π (Ohta and Takatsu, 2011), we obtain the lower bound in (9). It is noteworthy that although Dr MMD can be viewed as squared MMD with a regularized kernel k, the near-geodesic convexity in Proposition 4.2 is not observed for standard MMD with a ﬁxed kernel k because the latter does not interpolate towards χ2-divergence.

Remark 4.1 (Geodesic convexity/smoothness trade-oﬀin FDr MMD) The geodesic smoothness and near-convexity of the functional FDr MMD are characterized by (8) and (9) respectively via upper and lower bounds on the Wasserstein Hessian of FDr MMD. However, (8) and (9) impose contradictory conditions on the (de)-regularization parameter λ: (8) indicates that Dr MMD is smoother if λ is larger while (9) indicates that Dr MMD is more convex if λ is small enough. Consequently, there is a trade-oﬀbetween the geodesic convexity and smoothness of FDr MMD, which will play an important role in Section 5.

Remark 4.2 Proposition 4.2 shows that for ﬁxed µ and φ, there exists λ small enough yet positive such that Hess FDr MMD|µ φ, φ

L2(µ) > 0 at µ. The remainder term R(λ, µ, φ) is only controlled in the limit as λ 0, however, which complicates the use of Proposition 4.2 to show global convergence of the Dr MMD ﬂow. In the next section, we employ a diﬀerent set of techniques that rely on the Poincaré condition on π, a condition which, as we show, will ensure a suﬃcient dissipation of KL divergence along the Dr MMD ﬂow to obtain non-asymptotic near-global convergence.

4.2.2 Near-Global Convergence of Dr MMD flow via Poincaré inequality

Even when a functional F is not geodesically convex, convergence guarantees for its Wasserstein gradient ﬂow (µt)t 0 can still be obtained if the target π satisﬁes certain functional inequalities. Consider the χ2 ﬂow (νt)t 0, for example: if π satisﬁes the Poincaré inequality, then (νt)t 0 converges exponentially fast to π in terms of KL divergence (Chewi et al., 2020, Theorem 1), i.e.,

KL(νT π) exp 2T

KL(ν0 π). (10)

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

Recall that π satisﬁes a Poincaré inequality (Pillaud-Vivien et al., 2020, Deﬁnition 1) if for all functions f : Rd R such that f, f L2(π), there exists a constant CP > 0 such that

Z f(x)2dπ(x) Z f(x)dπ(x) 2 CP f 2 L2(π). (11)

The smallest constant CP for which (11) holds is called the Poincaré constant. The Poincaré inequality is widely used for studying the convergence of Langevin diﬀusions (Chewi et al., 2024) and χ2 ﬂow (Chewi et al., 2020; Trillos and Sanz-Alonso, 2020). The Poincaré condition is implied by the strong log-concavity of π, and is weaker than strong log-concavity because it also allows for nonconvex potentials. The set of probability measures satisfying a Poincaré inequality includes distributions with sub-gaussian tails or with exponential tails. This set is also closed under bounded perturbations and ﬁnite mixtures (see Vempala and Wibisono (2019) and Chewi et al. (2024) for a more detailed discussion). Given the interpolation property of Dr MMD to χ2-divergence, (10) suggests investigating the convergence of the Dr MMD ﬂow (µt)t 0 in KL divergence under a Poincaré inequality. To this end, we ﬁrst derive an upper bound on KL(µt π) along the Dr MMD ﬂow.

Theorem 4.1 (KL control of the Dr MMD ﬂow) Suppose k satisﬁes Assumptions 1 and 2, and the target π and Dr MMD gradient ﬂow (µt)t 0 satisfy the following conditions:

1. π satisﬁes a Poincaré inequality with constant CP .

2. µt, π Ld.

dπ 1 Ran(T r π ) with r > 0, i.e., there exists qt L2(π) such that dµt

dπ 1 = T r π qt.

(log π) dµt

dπ L2(π) Jt and dµt

dπ L2(π) It.

5. For all i = 1, . . . , d, lim x

h µt,π(x) 2dµt

dπ (x) i dµt

dπ (x) π(x) 0.

Then, for any T 0,

KL(µT π) exp 2(1 + λ)

CP T KL(µ0 π)

+ 4(1 + λ)λr Z T

0 exp 2(1 + λ)

CP (T t) qt L2(π) (Jt + It)dt. (12)

The proof, which can be found in Section 10.8, leverages the fact that the Dr MMD can approximate not only the χ2-divergence, but also its Wasserstein gradient. The Dr MMD s approximation properties can be combined with functional inequalities to obtain an upperbound for the continuous-time dissipation of KL divergence along the ﬂow, given by:

d dt KL(µt π) 2(1 + λ)

CP KL(µt π) + 4(1 + λ)λr qt L2(π) (Jt + It) | {z } Approximation error

from which Theorem 4.1 follows upon applying the Growall s lemma (Gronwall, 1919). The ﬁrst term is strictly negative, while the second term is an approximation error term arising

(De)-regularized Maximum Mean Discrepancy Gradient Flow

from Dr MMD not perfectly matching the χ2-divergence for λ > 0. When λ = 0, Theorem 4.1 recovers the exponential decay of KL divergence along χ2 ﬂow in (10).

Remark 4.3 (i) The second condition assumes that µt and π have densities.

(ii) The third condition that dµt

dπ 1 Ran(T r π ) is a regularity condition on the density ratio so that it can be well approximated by the witness function h µt,π. This assumption is known as the range assumption in the literature of kernel ridge regression (Cucker and Zhou, 2007; Fischer and Steinwart, 2020). We posit that this assumption can be relaxed to dµt

dπ 1 L2(π) as in Proposition 3.3 if only asymptotic convergence is needed with no explicit rate.

(iii) The fourth condition is another regularity condition on the density ratio along the ﬂow. This condition is automatically satisﬁed under a stronger range condition (r = 1

2) in the third condition, i.e., dµt

dπ 1 H, along with a moment condition on the score function log π. To see this, notice that we can further write (derivations are provided in Section 10.8) (log π) dµt

K1d qt L2(π) log π L2(π) dµt

K2d qt L2(π). (14)

As an illustration, we examine the Dr MMD ﬂow under a Gaussian approximation in Appendix C, following Lambert et al. (2022); Liu et al. (2024b), with a Gaussian kernel and Gaussian target π, for which explicit upper bounds It and Jt can be derived. (iv) The ﬁfth condition is a boundary condition that allows integration by parts equality in the proof. We highlight that many works (e.g., Theorem 1 of He et al. (2024), Theorem 2 of Nitanda et al. (2022), Lemma 6 of Vempala and Wibisono (2019)) on Wasserstein gradient ﬂow apply integration by parts without explicitly stating this condition.

Additionally, if qt L2(π) Q, Jt J , It I for all 0 t T, then Theorem 4.1 will ensure KL convergence of the Dr MMD ﬂow up to a controllable barrier term, e.g. near global convergence.

Corollary 4.1 (Near global convergence of the Dr MMD ﬂow) In addition to the assumptions of Theorem 4.1, if qt L2(π) Q, Jt J , It I for all 0 t T, where Q, J , and I are universal constants independent of λ, then for any T 0,

KL(µT π) exp 2(1 + λ)

CP T KL(µ0 π) + 2λr CP Q (J + I) .

The proof of Corollary 4.1 follows directly from upper-bounding the second term of (12) with universal constants Q, J , and I, and using the closed-form expression of the resulting integral. Corollary 4.1 provides a condition under which the Dr MMD ﬂow will exhibit an exponential rate of convergence (linear convergence) in terms of KL divergence up to an extra approximation error term which vanishes as λ 0. If qt L2(π) Q, Jt J , It I for all 0 t T as shown in Corollary 4.1, the approximation error is of explicit order O(λr). Therefore, in the continuous time regime, to have a smaller approximation error, it is beneﬁcial to use a small (de)-regularization parameter λ, so that Dr MMD ﬂow operates

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

closer to the regime of χ2 ﬂow. However, as we will see in the next section, when it comes to time-discretized Dr MMD ﬂow, i.e., Dr MMD gradient descent, there is a trade-oﬀbetween the approximation error and the time discretization error such that the selection of λ would require more careful analysis to strike a good balance. Finally, a smaller Poincaré constant CP results in both a faster rate of convergence and a smaller barrier. Previously, Arbel et al. (2019) established (sublinear) global convergence of the MMD ﬂow in terms of MMD distance by assuming that a Lojasiewicz inequality (or a variant of it if additionally performing noise injection, see Arbel et al., 2018, Proposition 8) holds along the ﬂow. Our result thus complements that of Arbel et al. (2019) by showing that MMD-type functionals can achieve near-global convergence for targets satisfying a Poincaré inequality regardless of whether such inequalities hold, by studying their behavior in the f-divergence interpolation regime.

Remark 4.4 Since Dr MMD is asymmetric in its arguments, the reader may wonder why we focus on the gradient ﬂow of Dr MMD( ||π) instead of Dr MMD(π|| ). From the convergence standpoint, Theorem 4.1 shows that Dr MMD( ||π) converges globally with an exponential rate up to a small barrier when π satisﬁes Poincaré inequality along with an extra regularity condition on the density ratio. This favorable convergence property is no longer true for Dr MMD(π|| ). Practically speaking, the Wasserstein gradient of Dr MMD requires inverting a kernel integral operator. It is more eﬃcient to do this just once with a kernel integral operator with respect to π as in (3), rather than with respect to µt, which would happen at each time (as done by He et al. 2024).

Computing the Dr MMD ﬂow is intractable since the dynamics are in continuous-time, and in practice, we do not have access to π, but only samples from it. In the next two sections, we build a tractable approximation of the Dr MMD ﬂow which provably achieves near-global convergence under similar assumptions as the ones in Section 4. Compared to the Dr MMD ﬂow, this approximation combines a discretization in time (introduced in Section 5) with a particle-based space-discretization (introduced in Section 6). While the space-discretization techniques that we used are well-known in the Wasserstein gradient ﬂow literature, our time-discretized scheme deviates from standard approaches, which are insuﬃcient to guarantee near-global convergence in our case.

5. Time-discretized Dr MMD ﬂow

In this section, we ﬁrst construct and analyze the forward Euler scheme of (7), a simple time-discretization of the Dr MMD ﬂow which we call Dr MMD Descent. Compared to the Dr MMD ﬂow, the convergence of Dr MMD descent is aﬀected by an additional smoothnessrelated time-discretization error that blows up as λ approaches 0, thus preventing near-global convergence. To address this issue, we propose in Section 5.2 an alternative discrete-time scheme that adapts the value of the regularization coeﬃcient λ across the descent iterates, which we call Adaptive Dr MMD Descent. We show that this scheme converges in the KL divergence up to a barrier term that vanishes as the discretization step size goes to zero.

5.1 Dr MMD Descent

The forward Euler discretization of the Dr MMD ﬂow (or Dr MMD Descent in short) with

(De)-regularized Maximum Mean Discrepancy Gradient Flow

step size γ > 0 consists of sequence of probabilities (µn)n N deﬁned by the recursion

µn+1 = I γ(1 + λ) h µn,π

# µn, µn=0 = µ0, (15)

where h µn,π = 2 (Σπ + λI) 1 (mµn mπ) is the Dr MMD witness function. This scheme was previously considered in Arbel et al. (2019); in particular, Proposition 4 of Arbel et al. (2019) shows that the discrete-time MMD dissipation rate along the MMD Descent iterates follows the (continous time) rate of the MMD ﬂow up to an error term proportional to the step size γ and the smoothness parameters of the problem5. Next, we turn to study the convergence of Dr MMD descent. One way is to treat Dr MMD as MMD with a regularized kernel k and follow Proposition 4 of Arbel et al. (2019), however this does not quantitatively take into account the role of (de)-regularization parameter λ that balances the trade-oﬀbetween geodesic convexity and smoothness of FDr MMD. Instead, we adopt the same strategy of Section 4 that exploits the interpolation property of Dr MMD towards χ2-divergence: in the following proposition, we study the dissipation of KL divergence along the Dr MMD Descent when the target π satisﬁes a Poincaré inequality, in which the role of (de)-regularization parameter λ is highlighted in the approximation and discretization errors.

Proposition 5.1 (Descent lemma in KL) Suppose k satisﬁes Assumption 1 and 2, and suppose the target π and Dr MMD gradient descent iterates (µn)n N satisfy the following:

1. π satisﬁes a Poincaré inequality with constant CP and its potential is β-smooth, i.e., π exp( V ) with HV βI.

2. µn, π Ld.

dπ 1 Ran(T r π ) with r > 0, i.e., there exists qn L2(π) such that dµn

dπ 1 = T r π qn.

4. qn L2(π) Q, V dµn

dπ L2(π) J , dµn

dπ L2(π) I for all n = 1, , nmax.

5. For all i = 1, . . . , d, lim x

h µn,π(x) 2dµn

dπ (x) i dµn

dπ (x) π(x) 0.

6. There exists a constant 1 < ζ < 2 such that for all n = 1, . . . , nmax, the step size γ satisﬁes

2ζ(1 + λ) q

χ2 (µn π) K2d

Then for all 0 n nmax and 0 < λ 1,

KL(µn+1 π) KL(µn π) 2

CP χ2(µn π)γ

+ 4γλr Q (J + I) | {z } Approximation error

+ 8γ2(β + ζ2)χ2(µn π)K1d + K2d

λ | {z } Discretization error

5. In the MMD ﬂow case, the main smoothness parameters is the Lipschitz constant of the kernel.

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

The proof is provided in Section 10.9. Conditions 1-5 of Proposition 5.1 are similar to conditions 1-5 of Theorem 4.1, which assumes a functional inequality on the target π and regularity on the density ratio dµn

dπ 1. In a similar spirit to Theorem 4.1, the fourth regularity condition is automatically satisﬁed under a stronger range assumption on dµn

dπ 1. For the sake of brevity, we directly assume uniform upper bounds in the fourth condition rather than writing it out in a separate corollary like Corollary 4.1. Compared to the continuous time regime, two extra conditions are necessary. The ﬁrst one is a smoothness condition on the potential, HV βI, which is commonly used in the convergence analysis of discrete-time Langevin-based samplers (Dalalyan, 2017; Dalalyan and Karagulyan, 2019; Durmus et al., 2019; Dalalyan et al., 2022; Vempala and Wibisono, 2019). It can be relaxed to V being Hölder-continuous with exponent s [0, 1] (Chatterji et al., 2020). The second one, (16), is an upper bound on the step size γ, aligning with the principle that step size should be small enough for the discrete-time scheme to inherit the properties of its continuous analog. This condition will be more thoroughly discussed in Remark 5.4 when all the conditions on γ in Theorem 5.1 and Theorem 5.2 are presented.

Remark 5.1 (Approximation-discretization trade-oﬀof Dr MMD Descent) If we compare the discrete-time KL dissipation of (17) with its continuous-time counterpart in (13), we see that the ﬁrst two terms on the RHS of (17) admit continuous-time analogs present in (13). The discrete-time KL dissipation contains an additional (positive) term representing the time discretization error: unlike the approximation error term that vanishes as λ approaches 0, this term actually diverges as λ approaches 0. Therefore, replicating the arguments of the continuous-time result of Theorem 4.1 in the discrete-time regime would yield a barrier that does not vanish as λ 0, hinting at a trade-oﬀbetween approximation and discretization similar to that of Remark 4.1. In the next section, we propose a reﬁned adaptive discrete-time descent scheme that addresses the convergence issues.

5.2 Adaptive Dr MMD Descent

The Dr MMD ﬂow and descent dynamics are deﬁned for a value of λ that remains ﬁxed throughout time. The KL dissipation provided in Proposition 5.1 is a function of λ, however; thus, to obtain a sequence of measures with better convergence guarantees than the Dr MMD descent, we now construct and analyze a sequence of iterates obtained by selecting, at each iteration, the value of λ minimizing the sum of the approximation error and time-discretization error presented in (17). This sequence, which we term Adaptive Dr MMD descent, is given by

µn+1 = I γ(1 + λ) h µn,π

# µn, λn = 2γχ2(µn π)(β + ζ2)(K1d + K2d)

1 r+1 . (18)

The particular choice of λn above minimizes the sum of the approximation error and timediscretization error presented in (17). This optimal choice indicates that λn should shrink towards 0 as χ2(µn π) decreases along Dr MMD gradient descent: at the early stages of the scheme, it is desirable to have a larger λn, corresponding to a smoother objective functional and enabling larger step sizes; then as Dr MMD gradient descent iterates µn get closer to π, a smaller λn enables the scheme to operate closer to the χ2 ﬂow regime, which metrizes a stronger topology and can better witness the diﬀerence between µn and π. Additionally,

(De)-regularized Maximum Mean Discrepancy Gradient Flow

as the potential V becomes less smooth, i.e., β gets larger, then λn should also increase to account for the loss of smoothness from V . Unlike the Dr MMD descent in Section 5.1, since the Wasserstein gradient updating µn comes from a diﬀerent Dr MMD at each iteration, the adaptive scheme of (18) constitutes a signiﬁcant departure from related works in Wasserstein gradient descent (Glaser et al., 2021; Arbel et al., 2019; Korba et al., 2021; Hertrich et al., 2024b, 2023; Chewi et al., 2020).

Remark 5.2 (Adaptive kernel) Recent applications of MMD-based generative modeling algorithms with adaptive kernels (in particular, time-dependent kernel hyperparameters) demonstrate improved empirical performance over ﬁxed kernels in both Wasserstein gradient ﬂow on MMD (Galashov et al., 2025) and generative adversarial networks with an MMD critic (Li et al., 2017; Arbel et al., 2018): the latter can be related to gradient ﬂow on the critic where µn is restricted to the output of a generator network (see e.g. Franceschi et al., 2024). As the Dr MMD is an MMD with a regularized kernel k that depends on λn, the Adaptive Dr MMD Descent thus falls into the former category. Galashov et al. (2025, Proposition 3.1) demonstrates faster convergence for MMD gradient ﬂow with an adaptive kernel, for the parametric setting of Gaussian distributions π and µt. Our analysis is the ﬁrst to prove theoretically that adaptive kernels can result in improved convergence for more general nonparametric settings. We believe that the theoretical analysis of adaptivity by varying other hyperparameters (such as the kernel bandwidth for RBF kernels) remains an interesting avenue for future work.

By leveraging the quasi-descent lemma in KL divergence in Proposition 5.1, we are able to establish the following theorem, which provides a near-global convergence result of the Adaptive Dr MMD gradient descent iterates in KL divergence.

Theorem 5.1 (Near-global convergence of adaptive Dr MMD gradient descent) Suppose k satisﬁes Assumption 1 and 2 and K 1, and suppose the target π and adaptive Dr MMD gradient descent iterates (µn)n N satisfy the following conditions:

1. π satisﬁes a Poincaré inequality with constant CP and its potential is β-smooth, i.e. π exp( V ) with HV βI.

2. µn, π Ld.

dπ 1 Ran(T r π ) with r > 0, i.e., there exists qn L2(π) such that dµn

dπ 1 = T r π qn.

4. qn L2(π) Q, V dµn

dπ L2(π) J , dµn

dπ L2(π) I for all n = 1, , nmax.

5. For all i = 1, . . . , d, lim x

h µn,π(x) 2dµn

dπ (x) i dµn

dπ (x) π(x) 0.

6. There exists a constant 1 < ζ < 2 such that the step size γ satisﬁes

ζ 1 Q2K2d CP

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

Then by taking (de)-regularization parameter λn = 2γχ2(µn π)(β+ζ2)(K1d+K2d)

Q(J +I) 1 r+1 1, we have

KL(µnmax π) exp 2nmaxγ

+ 4γ r r+1 CP Q 2r+1

r+1 (K1d + K2d)(β + ζ2) r r+1 (J + I) 1 r+1 . (20)

The proof can be found in Section 10.10. The conditions 1-5 of Theorem 5.1 are the same as conditions 1-5 of Proposition 5.1. Since the RHS of (19) are all constants, condition 6 is satisﬁed when the step size γ is small enough. The implication of Theorem 5.1 is that Dr MMD gradient descent exhibits an exponential rate of convergence (linear convergence) in terms of KL divergence up to an extra barrier of order O(γ r r+1 ). The barrier term shows up as the result of picking the optimal regularization parameter λn that best trades oﬀthe approximation error and discretization error in Proposition 5.1. Theorem 5.1 is reminiscent of the convergence result of the Langevin Monte Carlo sampling algorithm, whose KL divergence also decreases exponentially up to an extra barrier, but of order O(γ) (Vempala and Wibisono, 2019, Theorem 2). Unlike the continuous-time result of Theorem 4.1, in which the barrier can be made arbitrarily small by taking small enough regularization, taking the step size γ in Theorem 5.1 to be arbitrarily small will signiﬁcantly impact the rate of convergence, even though it is exponential in terms of nmax. By making the step sizes adaptive with the number of iterations and imposing an extra condition, the barrier term actually vanishes, as demonstrated in the following theorem.

Theorem 5.2 (Global convergence of Dr MMD gradient descent) Suppose that k satisﬁes Assumptions 1 and 2, and that the conditions in Theorem 5.1 on Dr MMD gradient descent iterates (µn)n N, target distribution π, regularization coeﬃcient λn and step size γn are satisﬁed. If additionally, the step size γn satisﬁes

γn 1 (K1d + K2d)(β + ζ2)

r χ2(µn π) 1 r , (21)

for all n = 1, , nmax, then

KL(µnmax π)

KL(µ0 π). (22)

The proof can be found in Section 10.11. Compared with (20), (22) provides a cleaner upper bound without the barrier term and leads to global convergence.

Remark 5.3 (Iteration complexity) We now turn to analyze the iteration complexity of Dr MMD gradient descent from Theorems 5.1 and 5.2. (i) From Theorem 5.1, given an error threshold δ > 0, Dr MMD descent would reach KL(µnmax π) δ after nmax CP

2γ log KL(µ0 π)

δ) iterations. By comparison, when π satisﬁes a Poincaré inequality, Langevin Monte Carlo (LMC) has an iteration

(De)-regularized Maximum Mean Discrepancy Gradient Flow

complexity of O(1

δ) up to logarithmic terms (Chewi et al., 2024, Theorem 7). As an approximation to the χ2 ﬂow, the iteration complexity of Dr MMD ﬂow is worse than LMC, because at each iterate of Dr MMD ﬂow, there is an extra approximation error in addition to time-discretization error (see our Proposition 5.1). With the optimal choice of regularization λ that balances these two errors, the overall per-step error is of order O(γ1+ r r+1 ). In contrast, at each iteration, LMC only incurs the time-discretization error, which is of order O(γ2), smaller than that of Dr MMD ﬂow. As a result, LMC exhibits better iteration complexity than our Dr MMD ﬂow; however, LMC requires knowledge of the score of π, while Dr MMD ﬂow only requires samples from π.

(ii) From Theorem 5.2, we consider two cases. On the one hand, if there exists a threshold N0 such that χ2(µn π) n r holds for all n N0, then we select step size γn CP n 1 for all n N0 such that both (19) and (21) are satisﬁed, and consequently Qnmax n=N0

1 1 CP γn =

0 so the iteration complexity of Dr MMD gradient descent is O 1

δ . On the other hand, if such a threshold N0 does not exist, then there exists a subsequence n1, n2, . . . , n S, . . . such that χ2(µns π) n r s for all s 1. Since KL divergence is smaller than χ2-divergence (Van Erven and Harremos, 2014), we have KL(µns π) n r s for all s 1. Notice that KL divergence is monotonically decreasing based on (63), we have KL(µn π) n r s for all ns n ns+1 so that limn KL(µn π) = 0. Unfortunately, we are not able to derive iteration complexity in this case because the growth rate of {ns}s 1 is unknown.

Remark 5.4 (Step size γ) Theorem 5.1 imposes a condition on the step size γ in (19) and Theorem 5.2 imposes an additional condition in (21). These conditions subsume the condition (16) on step size in Proposition 5.1. (See derivations in Section 10.10.) The conditions (19) and (21) become more stringent as the potential V becomes less smooth, i.e., when β gets larger, similar to the analysis in Langevin Monte Carlo (Balasubramanian et al., 2022; Vempala and Wibisono, 2019) and Stein Variational Gradient Descent (Korba et al., 2020). The condition also becomes more stringent as the density ratio becomes less regular, i.e., when r gets closer to 0 and Q, J , I get larger, similar to He et al. (2024).

The adaptive Dr MMD descent schemes of (15) and (18) deﬁned via push-forward operations can be equivalently expressed by the following update scheme that deﬁnes a trajectory of samples (yn)n N whose distributions are precisely the Adaptive Dr MMD descent iterates (µn)n N,

yn+1 = yn γ(1 + λn) h µn,π(yn), y0 µ0. (23)

Unfortunately, (23) is still intractable in practice because h µn,π depends on the unknown distribution µn. Therefore, an additional discretization in space is needed to approximate (23) within a tractable algorithm. We propose to do so in the next section through a system of interacting particles, i.e., Dr MMD particle descent.

6. Dr MMD particle descent

Suppose we have M samples from the target distribution {x(i)}M i=1 π and N samples from the initial distribution {y(i) 0 }N i=1 µ0. The Dr MMD particle descent is deﬁned as:

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

y(i) n+1 = y(i) n γ(1 + λn) h ˆµn,ˆπ(y(i) n ), (24)

where ˆµn = 1

N PN i=1 δy(i) n and ˆπ = 1 M PM i=1 δx(i) denote respectively the empirical distribution

of the particles at time step n and the target, and where h ˆµn,ˆπ = 2 (Σˆπ + λn I) 1 (mˆµn mˆπ). Unfortunately, the KL divergence is ill-deﬁned on empirical distributions ˆµn, which means the analysis of Theorem 5.1 is no longer applicable to study the convergence of Dr MMD particle descent. Therefore, in the next theorem, we instead resort to the Wasserstein-2 distance to analyze the convergence of Dr MMD particle descent. For simplicity of presentation below, we assume K 1.

Theorem 6.1 Suppose that k satisﬁes Assumptions 1 and 2 with K 1, and that all the conditions in Theorem 5.1 on Dr MMD gradient descent iterates (µn)n N, target distribution π, regularization coeﬃcient λn and step size γ are satisﬁed. In addition, suppose (µn)n N has bounded fourth moment and the target π satisﬁes a Talagrand-2 inequality with constant CT . Let the number of samples M, N satisfy

1 min i=1,...,nmax KL(µi π)Z 1

8nmaxγ r r+1 R min i=1,...,nmax KL(µi π)Z 1 r+1 1

8nmaxγ r r+1 R min i=1,...,nmax KL(µi π)Z 1 r+1 1

2r+2 , (25)

where means up to constants, R = K1d + KK2d is a constant that only depends on the kernel, and Z is a constant that only depends on β, ζ, K1d, K2d, Q, J , I. Then we have

E [W2 (ˆµnmax, π)] p

2CT exp nmaxγ

KL (µ0 π) + O γ r 2r+2 ,

where the expectation is taken over initial samples {y(i) 0 }N i=1 drawn from µ0.

The proof is provided in Section 10.12. Theorem 6.1 shows that for suﬃciently large sample size M and N, Dr MMD particle descent exhibits an exponential rate of convergence (linear convergence) in terms of Wasserstein-2 distance up to an extra barrier of order O γ r 2r+2 .

Remark 6.1 (Talagrand-2 inequality) We say that the target distribution π satisﬁes a Talagrand-2 inequality with constant CT if for any ν P2(Rd),

2CT KL(µ π).

A Talagrand-2 inequality implies the Poincaré inequality in (11) with constant CP CT , so the condition that π satisﬁes a Talagrand-2 inequality is stronger than the condition in

(De)-regularized Maximum Mean Discrepancy Gradient Flow

Theorems 4.1 and 5.1. A Talagrand-2 inequality allows linking of the two key components of Theorem 6.1: the population convergence in terms of KL divergence proved in Theorem 5.1, and the ﬁnite-particle propagation of chaos bound in terms of Wasserstein-2 distance proved in Proposition 10.1. Talagrand inequality is widely used in ﬁnite-particle convergence analysis of Wasserstein gradient ﬂows (Shi and Mackey, 2024).

Remark 6.2 (Iteration and sample complexity) The choice of γ = O δ 2r+2

E [W2 (ˆµnmax, π)] δ with an iteration complexity of nmax = O 1

δ , which equals the square of the iteration complexity in Theorem 5.1 because Wasserstein-2 distance is of the same order as the square root of KL divergence. From Theorem 5.1, we have

min i=1,...,nmax KL(µi π) KL(µnmax π) = O δ2 .

Therefore, from (25), the sample complexity is at least

M = poly exp δ 2

r 2 r+1 , N = poly exp δ 2

r 2 r+1 O δ d 4 .

The poly-exponential sample complexity originates from the propagation of chaos in interacting particle systems (Kac, 1956). Although O δ d 4 is subsumed by the poly-exponential term, we still make it explicit to show that N suﬀers from the curse of dimensionality as expected from the Wasserstein-2 distance between an empirical distribution and a continuous distribution (Kloeckner, 2012; Lei, 2020). For comparison, the sample complexity of SVGD in Theorem 3 of Shi and Mackey (2024) is O(exp exp δ 2 ). Similar results have also been established for mean-ﬁeld Langevin dynamics (Suzuki et al., 2023; Chen et al., 2024). Recently, Banerjee et al. (2025) proposed a reﬁned ﬁnite-particle analysis of Stein Variational Gradient Descent (SVGD), which gives a uniform-in-time convergence bound, i.e., the bound does not blow up exponentially fast as the number of iterations nmax grows. The analysis of Banerjee et al. (2025), however, heavily relies on the relation that the time derivative of the KL divergence in the course of SVGD equals the squared kernel Stein discrepancy (KSD), a fact which does not hold for our Dr MMD ﬂow. Recently, Chen et al. (2025) extended that analysis to ﬁnite-particle MMD gradient descent, but their analysis requires noise injection. We leave a more reﬁned analysis of our ﬁnite-particle convergence result to future work.

Having established the convergence of Dr MMD gradient ﬂow/descent, we next show that Dr MMD particle descent admits a closed-form implementation. Proposition 6.1 shows that h ˆµn,ˆπ in (24), deﬁned through the inverse of covariance operators, is computable using Gram matrices.

Proposition 6.1 Given empirical distributions ˆµn = 1 N PN i=1 y(i) n , ˆπ = 1 M PM i=1 x(i) and Gram matrices Kxx = k(x1:M, x1:M) RM M and Kxy = k(x1:M, y1:N n ) RM N, the witness function h ˆµn,ˆπ can be computed as:

h ˆµn,ˆπ( ) = 2 Nλn k( , y1:N)1N 2 Mλn k( , x1:M)1M 2 Nλn k( , x1:M)(Mλn I + Kxx) 1Kxy1N

+ 2 Mλn k( , x1:M)(Mλn I + Kxx) 1Kxx1M, (26)

where 1M RM, 1N RN are column vectors of ones.

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

Algorithm 1 Dr MMD particle descent

Input: Target samples {x(i)}M i=1 π and initial source samples {y(i) 0 }N i=1 µ0. Hyperparameters: step size γ, initial (de)-regularization coeﬃcient λ0, maximum number of iterations nmax and regularity r. For n = 0 to nmax: 1. Compute witness function hˆµn,ˆπ from (26). 2. Compute Dr MMD(ˆµn, ˆπ) with hˆµn,ˆπ from (71).

3. Rescale regularization coeﬃcient λn Dr MMD(ˆµn, ˆπ) 1 r+1 . 4. Update particles using (24):

y(i) n+1 = y(i) n γ(1 + λn) h ˆµn,ˆπ(y(i) n )

End For Output: {y(i) nmax}N i=1.

The proof of Proposition 6.1 can be found in Section 10.14. The gradient of h ˆµn,ˆπ can be obtained using automatic diﬀerentiation libraries such as JAX (Bradbury et al., 2018). As indicated in (18), the regularization parameter λn should be chosen to be proportional to χ2(µn π) 1 r+1 . In practice, however, both χ2(µn π) and r are not accessible and χ2(µn π) is not even well-deﬁned for the particle descent algorithm. To address this, we use Dr MMD(ˆµn ˆπ) as a proxy for χ2(µn π) (see Algorithm 1), which admits a closed-form expression with particles. The parameter r is picked via a search over a pre-deﬁned set {0.1, 0.5, 1.0}. The step size γ should be chosen to satisfy the upper bound (21), which contains several constants that cannot generally be computed. In practice, our approach has been to select γ to be suﬃciently small for the ﬂow to converge empirically. The ﬁnal algorithm is summarized in Algorithm 1. At every iteration, computing h ˆµn,ˆπ with adaptive regularization λn has a time complexity of O(M3 + NM + N2) due to matrix inversion and multiplication. For Dr MMD particle descent with ﬁxed λ, however, the total computational cost can be reduced to O(NM + N2), which is exactly the same as MMD ﬂow, because inversion of the M M Gram matrix is only required once, and so it can be pre-computed at initialization (see Remark 4.4). In contrast, when N = M, the complexity of Sinkhorn ﬂow is O(N2/ϵ3) (Feydy et al., 2019) with ϵ being the hyperparameter in Sinkhorn divergence, and the complexity of KALE ﬂow is O(N3) (Glaser et al., 2021).

7. Related Work

In this section, we discuss the works in the literature that are related to our proposed Dr MMD ﬂow and spectral (de)-regularization.

7.1 Gradient ﬂows

Stein Variational Gradient Descent (SVGD) is a popular algorithm for sampling from distributions using only an unnormalized density. It can be written as either a gradient ﬂow of the Kullback-Leibler (KL) divergence where the Wasserstein gradient of the KL is

(De)-regularized Maximum Mean Discrepancy Gradient Flow

preconditioned by Tµt (Liu and Wang, 2016; Liu, 2017; Korba et al., 2020), or as a gradient ﬂow of the χ2-divergence whose Wasserstein gradient is preconditioned by Tπ (Chewi et al., 2020),

t = µt Tµt log dµt

= µt Tπ dµt

Since SVGD may smooth the trajectory too much, He et al. (2024) considered a (de)- regularized SVGD ﬂow,

t = µt(Tµt + λI) 1Tµt log dµt

which approaches the KL gradient ﬂow (Langevin diﬀusion) as λ 0, demonstrating faster convergence than SVGD. A key diﬀerence between (de)-regularization in (27) of He et al. (2024) and our Dr MMD ﬂow is that the ﬂow in (27) is driven by the regularized version of the Wasserstein gradient of KL divergence while Dr MMD ﬂow is driven by the Wasserstein gradient of the regularized χ2-divergence. An alternative interpretation of this diﬀerence is that the ﬂow in (27) is the gradient ﬂow of KL divergence w.r.t. the regularized Stein geometry (Duncan et al., 2023), whereas the Dr MMD ﬂow is the gradient ﬂow of regularized χ2-divergence w.r.t. the Wasserstein geometry. In addition to sampling from unnormalized distributions, Wasserstein gradient ﬂows (particularly MMD ﬂows) are widely used in the ﬁeld of generative modelling (Birrell et al., 2022; Gu et al., 2024; Hertrich et al., 2023, 2024b,a; Galashov et al., 2025). The MMD ﬂow (with a smooth kernel) (Arbel et al., 2019) can be written as

dπ 1 . (28)

The MMD ﬂow is known to get trapped in local minima, and several modiﬁcations have been proposed to avoid this in practice, such as noise injection (see Arbel et al., 2019, Proposition 8) or non-smooth kernels, e.g., based on negative distances (Sejdinovic et al., 2013). MMD gradient ﬂows with non-smooth kernels have better empirical performance (Hertrich et al., 2024a,b), but they do not preserve discrete measure and rely on approximating implicit time discretizations (Hertrich et al., 2024a) or slicing (Hertrich et al., 2024b); and they have no local minima apart from the global one (Boufadène and Vialard, 2024). Recall that our Dr MMD ﬂow takes the form

t = µt (Tπ + λI) 1Tπ

which (de)-regularizes the MMD ﬂow similarly to how (27) (de)-regularizes SVGD. It is both proved theoretically in Theorem 4.1, 5.1, 6.1, and veriﬁed empirically in Section 8, that (de)-regularization results in faster convergence than MMD ﬂow. Another closely related ﬂow called LAWGD is considered in Chewi et al. (2020), which swaps the gradient and integral operators of SVGD, leading to the following ﬂow:

t = µt Tπ dµt

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

LAWGD closely resembles the MMD ﬂow in (28), but Chewi et al. (2020) proposes to replace Tπ with an inverse diﬀusion operator, which requires computing the eigenspectrum of the latter and is unlikely to scale in high dimensions. KALE (Glaser et al., 2021) kernelizes the variational formulation of the KL divergence in a similar way as Dr MMD kernelizes the χ2-divergence in (5), but the KALE witness function does not have a closed form expression, so it requires solving a convex optimization problem, which makes the simulation of KALE gradient ﬂow with particles computationally more expensive. Recently, Stein et al. (2025) studied kernelized variational formulation of f-divergences (referred to as Moreau envelopes of f-divergences in RKHS), which subsume both KALE (Glaser et al., 2021) and Dr MMD. They prove that these functionals are lower semi-continuous and that their Wasserstein gradient ﬂows are well-deﬁned for smooth kernels. They do not study the convergence properties of their proposed ﬂows, however. The proposed Dr MMD interpolates between MMD and χ2-divergence by using a speciﬁc spectral regularization known as Tikhnov regularization (1+λ)(T +λI) 1T , which interpolates between the identity operator I as λ 0 and T as λ . Dr MMD and its associated Wasserstein gradient ﬂow can be easily extended to other spectral regularization strategies, such as the Showalter regularization, Landweber iteration, or cutoﬀregularization (Engl et al., 1996) using the techniques in (Bauer et al., 2007; Hagrass et al., 2024a). Alternative approximations to χ2-divergence have been proposed in the literature based on the idea of molliﬁers, whose Wasserstein gradient ﬂows have been constructed and for which convergence of the ﬂows has been analyzed (Li et al., 2023; Craig et al., 2023, 2025). Compared to Dr MMD ﬂow, these gradient ﬂows rely on additional approximations, such as the use of log-sum-exp in Li et al. (2023) and the use of numerical integration to estimate convolution in Craig et al. (2023, 2025) and are not directly applicable in generative modeling settings where only samples are available.

7.2 Comparison with diﬀusion-based generative models

Diﬀusion-based generative models have been widely adopted in practice and are closely related to Wasserstein gradient ﬂows (Song et al., 2021; Ho et al., 2020). These models generate high-quality samples by reversing a pre-deﬁned forward diﬀusion process, which gradually corrupts data with noise. To implement the reverse process, the established practice is to estimate the score function via denoising score matching (Song et al., 2021). In contrast, Wasserstein gradient ﬂows directly construct a trajectory by descending the objective in the steepest direction with respect to the Wasserstein metric. In particular, our proposed Dr MMD gradient ﬂow oﬀers a tractable velocity ﬁeld with a consistent ﬁnite-sample estimator without solving an additional optimization problem like score matching. We emphasize that the main contribution of our paper is to establish convergence of the Dr MMD ﬂow, and that diﬀusion models for image generation require additional implementation details most notably, the inductive biases introduced by deep neural networks. One possible avenue for future work is to simulate Dr MMD gradient ﬂows with kernels induced by learned deep neural network features on the data. In the case of MMD, this idea has been explored by Galashov et al. (2025), who demonstrate generation performance comparable to established diﬀusion models. Another promising direction, proposed by Hertrich et al. (2024b), involves ﬁrst using MMD gradient ﬂow with Riesz

(De)-regularized Maximum Mean Discrepancy Gradient Flow

kernels to generate particle trajectories, and then distilling these trajectories into a neural network-based generator. Both approaches would be of interest to extend our Dr MMD gradient ﬂow to domains such as image generation, as a topic for future work.

7.3 (De)-regularization for supervised learning and hypothesis testing

The idea of (de)-regularization is not new, and has been used in kernel Fisher discriminant analysis (Mika et al., 1999) and kernel ridge regression (Caponnetto and De Vito, 2007; Schölkopf and Smola, 2002). Subsequently, Harchaoui et al. (2007) employed this statistic in two-sample testing, where they constructed a test statistic that (de)-regularizes MMD(µ π) with both covariance operators Σµ, Σπ. This work has been recently generalized in Hagrass et al. (2024a) to more general spectral regularizations. A (de)-regularized statistic is also employed by Balasubramanian et al. (2021); Hagrass et al. (2024b) in the context of a goodness-of-ﬁt test. Balasubramanian et al. (2021) refers to (de)-regularized MMD as Moderated MMD . To the best of our knowledge, the present work represents the ﬁrst instance of the (de)-regularized MMD being used as a distance functional in Wasserstein gradient ﬂow. By only (de)-regularizing with Σπ, Dr MMD approaches the χ2-divergence in the limit, a crucial property that is exploited in the proofs of the convergence results of Theorems 4.1 and 5.1.

8. Experiments

In this section, we demonstrate the superior empirical performance of the proposed Dr MMD descent in various experimental settings.

8.1 Three ring experiment

We follow the experimental set-up in Glaser et al. (2021) in which the target distribution π ( ) is deﬁned on a manifold in R2 consisting of three non-overlapping rings. The initial source distribution µ0 ( ) is a Gaussian distribution close to the vicinity of the ﬁrst ring. In this setting, all f-divergence gradient ﬂows, including Langevin diﬀusion, are ill-deﬁned because the target π is not absolutely continuous with respect to the initial source µ0. Nevertheless, we will simulate χ2 ﬂow with an existing implementation of Liu et al. (2024a) that estimates the velocity ﬁeld with a local linear estimator as one of the baseline methods. In contrast, kernel-based gradient ﬂows like MMD, KALE, and Dr MMD gradient ﬂows are well-deﬁned in this setting, and are also used as baseline methods for comparison. We sample N = M = 300 samples from the initial source and the target distributions and run Dr MMD descent with adaptive λ for nmax = 100, 000 iterations, at which point all methods have converged. As in Glaser et al. (2021), we use a Gaussian kernel k(x, x ) = exp 0.5 x x 2/l2 with bandwidth l = 0.3. The step size for MMD descent is γ = 10 2

and the step size for KALE and Dr MMD descent is γ = 10 3. We enforce a positive lower bound λ = 10 3 for numerical stability and the regularity hyperparameter r is optimized over the set of {0.1, 0.5, 1.0}. From Figure 2 Left and Middle, we can see that Dr MMD descent outperforms MMD, KALE, and χ2 descent in terms of all dissimilarity metrics with respect to the target π: MMD and Wasserstein-2 distance. Figure 1 is an animation plot visualizing the evolution of

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

T=0 T=2 T=30 T=99

Figure 1: Animation of MMD, KALE, χ2 and Dr MMD gradient descent on the Three-ring dataset.

particles under these descent schemes, which demonstrates that both KALE and Dr MMD descent are sensitive to the mismatch of support and stay concentrated in the support of the target π, while particles of MMD descent can diﬀuse outside the support of π. Note that "χ2" denotes an alternate estimate of the χ2 divergence due to Liu et al. (2024a): being an f-divergence, we would expect "χ2 descent" to match the support of the target (as in KALE and Dr MMD). This is not the case due to bias in the velocity ﬁeld being learned from samples. Compared to KALE descent, Dr MMD descent does not suﬀer from the numerical approximation error of the optimization routine when solving the velocity ﬁeld of KALE, which explains its improved performance.

8.2 Gradient ﬂow for training student/teacher networks

Next, we consider a large-scale setting following Arbel et al. (2019), where a student network is trained to imitate the outputs of a teacher network. We consider a two-layer neural network of the form

ψ(z, x) = G b1 + W 1σ W 0z + b0 ,

where σ is the Re LU non-linearity and x is the concatenation of all network parameters b1, W 1, b0, W 0 Rd. G is an element-wise non-linear function G : R R, x 7 exp( 1

4x2). The teacher network is of the form: ΨT (z, π) = R ψ(z, x)dπ(x) where π denotes the teacher distribution, and the student network is ΨS(z, µ) = R ψ(z, x)dµ(x) where µ denotes the student distribution. Here we consider Gaussian distributed µ and π for simplicity. The

(De)-regularized Maximum Mean Discrepancy Gradient Flow

Dr MMD MMD KALE χ2 Dr MMD MMD

Dr MMD (Noise)

MMD (Noise)

0 25000 50000 75000 100000 Iteration

0 25000 50000 75000 100000 Iteration

0 5000 10000 15000 Iteration

Validation MMD2

Figure 2: Left & Middle: Comparison of Dr MMD descent with adaptive λ, MMD and KALE descent on three-ring synthetic data in terms of MMD and Wasserstein-2 distance with respect to the target π. Right: Comparison of MMD descent with and without noise injection, Dr MMD descent with and without noise injection on training student/teacher networks in terms of validation MMD2

student network can imitate the behavior of the teacher network by minimizing the objective6

min µ P2(Rd) Ez Pdata ΨT (z, π) ΨS(z, µ) 2 , (29)

where Pdata is the distribution of the input data. If we deﬁne the kernel as the inner product of the neural network feature maps,

k(x, x ) = Ez Pdata[ψ(z, x) ψ(z, x )],

then the objective of (29) can be equivalently expressed as

min µ P2(Rd)

X k(x, x )d(π µ)(x)d(π µ)(x ),

which is precisely the MMD(µ π)2 under the kernel k. Since G(x) = exp( 1

4x2), the kernel is bounded and so the MMD is well-deﬁned. Also, since the kernel k has bounded ﬁrst and second-order derivatives, it satisﬁes the Assumption 2. Therefore, the training of the student network with objective (29) can be treated as an optimization problem of MMD2 distance in P2(Rd), i.e., as MMD gradient ﬂow. It is shown in Arbel et al. (2019) that MMD ﬂow and its descent scheme will generally get stuck in local optima because MMD is not geodesically convex; therefore, noise injection has been proposed to escape these local optima. With support from Theorems 4.1 and Theorem 5.1 on the convergence of Dr MMD ﬂow and its descent scheme, we propose to minimize Dr MMD(µ π) and use Dr MMD descent rather than minimizing MMD(µ π)2 directly. Although this does not directly minimize the objective in (29), the favorable convergence performance of Dr MMD descent should result in a smaller MMD(µ π)2 at convergence. In our experimental setting, we are given M = 10 particles x(1), , x(M) from the teacher distribution π = N(0, I) and N = 1000 particle y(1) 0 , , y(N) 0 from the initial student

6. Note that our setting is slightly diﬀerent from Chizat and Bach (2018) in which µ, π are measures over the hidden neurons, while our setting follows Arbel et al. (2019) in which µ, π are measures over all the network parameters.

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

distribution µ0 = N(0, 10 3I). The teacher particles are ﬁxed while the student particles are updated according to Algorithm 1 at each time step. The initial (de)-regularization parameter is λ0 = 0.1, the step size is γ = 0.1, we apply a lower bound λ = 10 3, and the regularity r is optimized over the set of {0.1, 0.5, 1.0}. For the architecture of the neural network, there are 3 neurons in the hidden layer, and the output dimension is 1. The data distribution Pdata is a uniform distribution on the sphere in Rp with p = 50. 2000 data are sampled from Pdata with 1000 as training dataset and another 1000 as validation dataset. The kernel k(x, x ) = Ez Pdata ψ(z, x) ψ (z, x ) is estimated by the average over 100 randomly selected samples from the training dataset at each iteration. Dr MMD and MMD descent stop after nmax = 15, 000 iterations when both converge. The ﬁnal performance is evaluated in terms of MMD2(µnmax π) with kernel k estimated by the average of 1000 samples in the validation dataset. In Figure 2 Right, we report the performance of MMD descent (with and without noise injection) along with the Dr MMD descent (with and without noise injection) in terms of MMD distance on the validation dataset. We can see that the Dr MMD descent does not get stuck in a local optimum, and leads to much lower validation MMD(µnmax π)2 even without noise injection. We also run Dr MMD descent with the noise injection scheme and ﬁnd that noise injection can further improve the performance of Dr MMD descent and outperforms MMD descent with noise injection. Although it is unclear whether the density ratio has enough regularity to meet the condition of Theorem 5.1, the kernel k satisﬁes the boundedness and smoothness conditions of Assumption 2 and the target π satisﬁes the Poincaré inequality since it is Gaussian. The Dr MMD descent beneﬁts from more favorable convergence properties, which explains its superior performance. The code to reproduce all the experiments can be found in the following Git Hub repository. https://github.com/hudsonchen/Dr MMD.

9. Discussion

In this paper, we introduced (de)-regularization of the MMD (called Dr MMD) and its associated Wasserstein gradient ﬂow. As an interpolation between the MMD and χ2divergence, the Dr MMD gradient ﬂow inherits strengths from both sides: it is easy to simulate in closed form with particles, and it has an exponential rate of convergence towards the global minimum up to a controllable barrier term when the target π satisﬁes a Poincaré inequality. Additionally, we provide the optimal adaptive selection of a regularization coeﬃcient that best balances the approximation and time discretization errors in Dr MMD gradient descent. Our work is the ﬁrst to prove theoretically that an adaptive kernel through adaptive regularization can result in improved convergence of MMD gradient ﬂow. The theoretical results are consistent with the empirical evidence in several numerical experiments. Following our work, there remain a number of interesting open problems. For example, (i) Since the kernel bandwidth has been known to play an important role in the performance of kernel-based algorithms, it is of interest to study the adaptive choice of kernel bandwidth in the context of Dr MMD gradient ﬂow. (ii) To generalize our convergence analysis to the Wasserstein gradient ﬂow of all Moreau envelopes of f-divergences in reproducing kernel Hilbert space, even when they do not have a closed-form expression as Dr MMD. (iii) While the current work proposes an approximation to the χ2-squared ﬂow in the generative

(De)-regularized Maximum Mean Discrepancy Gradient Flow

modeling setting, i.e., where the target distribution π is known only through samples, it will be interesting to construct approximations to χ2-ﬂow in the sampling setting, i.e., where π is known in closed form (at least up to normalization).

In this section, we present the proofs of the results presented in Sections 3 6.

10.1 Proof of Proposition 3.1

MMD(µ π)2 = mµ mπ 2 H =

Z k( , x)dµ(x) Z k( , x)dπ(x)

Z k( , x) dµ

dπ(x) 1 dπ(x)

Also recall that the χ2-divergence between µ and π is

χ2(µ π) = Z dµ

dπ 1 2 dπ = I dµ

10.2 Proof of Proposition 3.2

Let µ π. In order to prove the alternative form of Dr MMD in (4), we start from (4) and show that it recovers (3).

Dr MMD(µ||π) = (1 + λ) (Σπ + λI) 1

2 (mµ mπ) 2

= (1 + λ) (ι πιπ + λI) 1

= (1 + λ) (ι πιπ + λI) 1

dπ 1 , (ι πιπ + λI) 1

= (1 + λ) ιπ (ι πιπ + λI) 1 ι π

= (1 + λ) ιπι π (ιπι π + λI) 1 dµ

L2(π) , (30)

where the last equality follows by noticing ιπ (ι πιπ + λI) 1 ι π = ιπι π (ιπι π + λI) 1. Therefore,

Dr MMD(µ||π) = (1 + λ) Tπ (Tπ + λI) 1 dµ

= (1 + λ) (Tπ + λI) 1 Tπ 1/2 dµ

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

which follows from the positivity and self-adjointness of Tπ (Tπ + λI) 1. So (4) is proved. Next, we are going to prove the variational formulation in (5). Similarly, we start from (5) and show it recovers (3). Consider

(1 + λ) sup h H

= (1 + λ) inf h H

4 dπ Z h(dµ dπ) + λ

= (1 + λ) inf h H

4 h, Σπh H h, mµ mπ H + λ

= (1 + λ) inf h H

4 I 1/2 h 1

4 I 1/2 (mµ mπ)

4 I 1/2 (mµ mπ)

The last equality follows from completing the squares, based on which it is easy to see that the inﬁmum is achieved at h µ,π = 2 (Σπ + λI) 1 (mµ mπ). For µ π, following the same derivations in (30), h µ,π can be alternatively expressed as

h µ,π = 2 (Tπ + λI) 1 Tπ

Plugging h µ,π back into (31) recovers (3), so (5) is proved.

10.3 Dr MMD is MMD with a regularized kernel k

Given the deﬁnition of k(x, x ) = D (Σπ + λI) 1

2 k( , x), (Σπ + λI) 1

2 k( , x ) E

H, it is clear that k is symmetric and positive deﬁnite so it has a unique associated reproducing kernel Hilbert space H (Steinwart and Christmann, 2008, Theorem 4.21) with canonical feature map k(x, ). Therefore,

Dr MMD(µ π)

= (1 + λ) (Σπ + λI) 1

2 (mµ mπ) 2

= (1 + λ) (Σπ + λI) 1

2 Z k(x, )d (π µ) (x), (Σπ + λI) 1

2 Z k(x , )d (π µ) (x )

= (1 + λ) Z (Σπ + λI) 1

2 k(x, )d (π µ) (x), Z (Σπ + λI) 1

2 k(x , )d (π µ) (x )

= (1 + λ) ZZ D (Σπ + λI) 1

2 k(x, ), (Σπ + λI) 1

2 k(x , ) E

H d (π µ) (x)d (π µ) (x )

= (1 + λ) ZZ k(x, x )d (π µ) (x)d (π µ) (x )

Z k(x, )d(µ π)(x)

(De)-regularized Maximum Mean Discrepancy Gradient Flow

In the third and fourth equality above, we are using the fact that (Σπ + λI) 1

2 k(x, ) H is Bochner integrable and the Bochner integral preserves inner product structure. So Dr MMD is essentially MMD2 with a diﬀerent kernel k up to a multiplicative factor of 1 + λ. Next, we present the Mercer decomposition of k. Notice that {ei}i 1 are the eigenfunctions of Tπ, so { ϱiei}i 1 are the eigenfunctions of Σπ. For x and x in the support of π, k also enjoys a pointwise convergent Mercer decomposition

k(x, x ) = D (Σπ + λI) 1

2 k( , x), (Σπ + λI) 1

2 k( , x ) E

(Σπ + λI) 1

i 1 ϱiei(x)ei

, (Σπ + λI) 1

i 1 ϱiei(x )ei

ϱi ϱi + λei(x)ei, X

ϱi ϱi + λei(x )ei

ϱi ϱi + λei(x)ei(x ). (33)

More properties of the regularized kernel k are provided in Lemma B.3.

10.4 Proof of Proposition 3.3

Given that dµ

dπ 1 L2(π), so

Dr MMD(µ π) = (1 + λ) (Tπ + λI) 1 Tπ 1/2 dµ

L2(π) (1 + λ) dµ dπ 1

which is ﬁnite for any λ 0. We are allowed to interchange the limit and integration according to the dominated convergence theorem (Rudin, 1976) to achieve,

lim λ 0 Dr MMD(µ π) = lim λ 0

(Tπ + λI) 1 Tπ 1/2 dµ

L2(π) = χ2(µ π).

From (4), we have that, p

Dr MMD(µ π) =

1 + λ (Σπ + λI) 1

2 (mµ mπ) H

1 + λ (Σπ + λI) 1

2 op mµ mπ H

λ MMD(µ, π), (34)

where the last inequality follows by noticing that Σπ = ι πιπ shares the same eigenvalues as Tπ = ιπι π, and hence the eigenvalues of (Σπ + λI) 1 are (ϱi + λ) 1 which all smaller than 1

λ. Therefore, Dr MMD(µ π) 1+λ

λ MMD2(µ π).

On the other hand, using Lemma A.10 from (Hagrass et al., 2024a), we have

MMD (µ π) = mµ mπ H (Σπ + λI) 1 2 op (Σπ + λI) 1

2 (mµ mπ) H

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

1 1 + λ (Σπ + λI) 1 2 op p

Dr MMD (µ π)

Dr MMD (µ π), (35)

where K is the upper bound on kernel k in Assumption 2 and hence an upper bound on the operator norm of Σπ. Combining (34) and (35), we have

1 + λ K + λ MMD2(µ π) Dr MMD (µ π) 1 + λ

λ MMD2(µ π).

Therefore, limλ Dr MMD(µ π) = MMD2(µ π), and the proposition is proved.

10.5 Proof of Proposition 3.4

To show that Dr MMD is a probability divergence, we need to show that Dr MMD enjoys non-negativity and deﬁniteness. It is easy to see that Dr MMD(µ π) is non-negative from its deﬁnition in Deﬁnition 1. Then, we prove deﬁniteness, i.e., Dr MMD(µ||π) = 0 if and only if µ = π. For the ﬁrst direction, assume Dr MMD(µ π) = 0, so (Σπ +λI) 1/2 (mµ mπ) 2 H = 0. Since (Σπ + λI) 1/2 is a non-singular operator, we must have that mµ = mπ which implies µ = π as k is c0-universal and hence characteristic (Sriperumbudur et al., 2011). For the other direction, when µ = π, immediately we can see Dr MMD(µ||π) = 0. Then we prove that Dr MMD metrizes weak convergence. For the ﬁrst direction, we know from (34) that Dr MMD(µn π) 1+λ

λ MMD2(µn π) and MMD2(µn π) 0 as µn converges weakly to π (Simon-Gabriel et al., 2023). For the converse direction, we assume that Dr MMD (µn π) 0. From (35), we know that MMD2 (µ π) K+λ

1+λ Dr MMD (µ π), therefore Dr MMD (µn π) 0 implies MMD (µn π) 0, implying the weak convergence of µn to π, if k is characteristic (Simon-Gabriel et al., 2023).

10.6 Proof of Proposition 4.1

In order to show that FDr MMD( ) = Dr MMD( π) admits a well-deﬁned gradient ﬂow, we follow the same techniques in Proposition 7 of Glaser et al. (2021) and Lemma B.2 of Chizat and Bach (2018), where the key is to show that (1 + λ) h µ,π is the Fréchet subdiﬀerential of FDr MMD evaluated at µ.7 According to Deﬁnition 10.1.1 of Ambrosio et al. (2005), it is equivalent to prove that, for any µ P2(Rd) and φ C c (Rd),

Dr MMD ((I + φ)#µ π) Dr MMD (µ π) (1 + λ) Z φ(x) h µ,π(x)dµ(x)

+ o φ L2(µ) . (36)

Deﬁne ρt = (I + t φ)#µ, ϕt : Rd Rd, x 7 x + t φ(x) and g(t) = Dr MMD(ρt||π). Then from Lemma B.6 we know that g(t) is continuous and diﬀerentiable with respect to t and

t=0g(t) = (1 + λ) Z φ(x) h µ,π(x)dµ(x).

7. Although Dr MMD can be viewed as squared MMD with a regularized kernel k, we are not using the technique in Arbel et al. (2019) because it relies on Lemma 10.4.1 of Ambrosio et al. (2005) which only provides the Fréchet subdiﬀerential on probability measures µ that admit density functions. To construct the Wasserstein gradient ﬂow of FDr MMD up to full generality, we resort to the techniques of Glaser et al. (2021) and Chizat and Bach (2018) instead.

(De)-regularized Maximum Mean Discrepancy Gradient Flow

Since t 7 g(t) is diﬀerentiable, using Taylor s theorem and mean value theorem (Rudin, 1976), we know that there exists 0 < κ < 1 such that

Dr MMD ((I + φ)#µ π) Dr MMD (µ π) = g(1) g(0) = d

t=0g(t) + d2

Therefore, to prove (36), the goal is to prove that d2 dt2 t=κg(t) o φ L2(µ) . To this end, since we know from Lemma B.6 that t 7 d

dtg(t) is continuous and diﬀerentiable, we have

dt2 Dr MMD(ρt||π) = 2(1 + λ) ZZ φ(x) 1 2 k (ϕt(x), ϕt(y)) φ(y)dµ(x)dµ(y)

+ 2(1 + λ) Z φ(x) Z H1 k (ϕt(x), ϕt(y)) dµ(y) Z H1 k (ϕt(x), y) dπ(y) φ(x)dµ(x),

where k is the regularized kernel deﬁned in (6) and H is the associated RKHS. Using Lemma B.3, the ﬁrst term above can be upper-bounded by, 2(1 + λ) ZZ φ(x) 1 2 k (ϕt(x), ϕt(y)) φ(y)dµ(x)dµ(y)

2(1 + λ) ZZ φ(x) 1 2 k (ϕt(x), ϕt(y)) F φ(y) dµ(x)dµ(y)

2(1 + λ)K1d

Z φ(y) dµ(y) 2 2(1 + λ)K1d

λ φ 2 L2(µ).

Using Lemma B.3 again, the second term can be upper-bounded by

Z φ(x) Z H1 k (ϕt(x), ϕt(y)) dµ(y) Z H1 k (ϕt(x), y) dπ(y) φ(x)dµ(x)

Z φ(x) Z H1 k (ϕt(x), ϕt(y)) dµ(y) φ(x)dµ(x)

Z φ(x) Z H1 k (ϕt(x), y) dπ(y) φ(x)dµ(x)

Z φ(x) Z H1 k (ϕt(x), ϕt(y)) F dµ(y) φ(x) dµ(x)

+ Z φ(x) Z H1 k (ϕt(x), y) F dπ(y) φ(x) dµ(x)

λ φ 2 L2(µ).

Combining the above two inequalities to have d2

dt2 Dr MMD(ρt||π) 2(1 + λ)K1d

λ φ 2 L2(µ) + 4(1 + λ) KK2d

λ φ 2 L2(µ). (37)

Therefore, we have d2 dt2 t=κg(t) = O( φ 2 L2(µ)) = o( φ L2(µ)) as φ L2(µ) 0. So (36) is proved, which means that (1 + λ) h µ,π(x) is the Fréchet subdiﬀerential of FDr MMD

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

evaluated at µ. According to Deﬁnition 11.1.1 of Ambrosio et al. (2005), there exists a solution (µt)t 0 such that the following equation holds in the sense of distributions,

tµt (1 + λ) µt h µt,π = 0,

and such (µt)t 0 is indeed the Dr MMD gradient ﬂow, so existence is proved. Next, we are going to prove uniqueness. (37) indicates that FDr MMD is geodesically (1 + λ)4 KK2d+2K1d

λ semiconvex. Therefore the uniqueness of (µt)t follows from Theorem 11.2.1 of Ambrosio et al. (2005). (8) is proved in (37).

10.7 Proof of Proposition 4.2

Given µ P2 Rd and φ C c Rd , consider the path (ρt)0 t 1 from µ to (I + φ)#µ given by ρt = (I + t φ)#µ. (ρt)0 t 1 is a constant-time geodesic in the Wasserstein-2 space by construction (Appendix A). Deﬁne ϕt : Rd Rd, x 7 x+t φ(x). We know from (Villani et al., 2009, Example 15.9) (by taking m = 2) that,

t=0χ2(ρt π) = Z dµ

dπ(x) V (x) φ(x) φ(x) 2 dµ(x)

dπ(x) φ(x) HV (x) φ(x)dµ(x)

dπ(x) Hφ(x) 2 F dµ(x). (38)

V is twice diﬀerentiable, so V as a function from Rd to Rd is continuous. φ C c (Rd) has compact support, so x 7 V (x) φ(x) φ(x) is a continuous function over a compact domain, so its image is also compact and hence bounded. Using similar arguments, x 7 φ(x) HV (x) φ(x) and x 7 Hφ(x) 2 F are also continuous functions over compact domains, so they are all bounded. Since dµ

dπ 1 H L2(π) and R dµ

dπ(x)dµ(x) = dµ

dπ 1 2 L2(π) + 1 < , we have

dt2 t=0χ2(ρt π) < . (39)

Next, from Lemma B.9 we know that d2 dt2 t=0χ2(ρt π) can be alternatively expressed as,

t=0χ2(ρt π) = 2 Z ( φ(x)µ(x)) 1 π(x)

+ 2 Z φ(x) H µ(x)

φ(x)µ(x)dx. (40)

Recall from Lemma B.6 that

t=0 Dr MMD(ρt||π)

= 2(1 + λ) ZZ φ(x) 1 2 k (x, y) φ(y)dµ(x)dµ(y)

(De)-regularized Maximum Mean Discrepancy Gradient Flow

+ 2(1 + λ) Z φ(x) Z H1 k (x, y) dµ(y) Z H1 k (x, y) dπ(y) φ(x)dµ(x). (41)

To prove (9), our aim then is to compare and bound the diﬀerence of d2

dt2 t=0χ2(ρt π) in (40)

and 1 1+λ d2 dt2 t=0 Dr MMD(ρt||π) in (41), so we compare and bound their ﬁrst and second term separately. The ﬁrst term of (40) can be rewritten as

Z ( φ(x)µ(x)) 1 π(x)

2 π(x)dx = X

Z ( φ(x)µ(x)) ei(x)dx 2 = X

Z φ(x) ei(x)µ(x)dx 2 , (42)

where we use integration by parts in the last line since φ C c (Rd). And the ﬁrst term of (41) after rescaling by 1 1+λ can be rewritten as,

2 ZZ φ(x) 1 2 k (x, y) φ(y)dµ(x)dµ(y)

= 2 ZZ φ(x) x y

ϱi ϱi + λei(x)ei(y)

φ(y)dµ(x)dµ(y)

= 2 ZZ ( φ(x)µ(x)) ( φ(y)µ(y)) X

ϱi ϱi + λei(x)ei(y)dxdy. (43)

Since π Ld so (33) is true for all x, y Rd, hence the second equality is true, and the last equality uses integration by parts since φ C c (Rd). Notice that

ZZ ei(x)ei(y) ( φ(x)µ(x)) ( φ(y)µ(y)) dxdy

Z ( φ(x)µ(x)) ei(x) dx 2

Z ( φ(x)µ(x))

! Z ei(x)2µ(x)dx

Z log µ(x) φ(x) + φ(x) 2 dµ(x)

i 1 ϱiei(x)2dµ(x)

Z log µ(x) φ(x) + φ(x) 2 dµ(x) < + .

The ﬁrst inequality uses Cauchy-Schwartz, the second inequality uses P i 1 ϱi(ei(x))2 = k(x, x) K. The last quantity is ﬁnite because x log µ(x) is a continuous function and φ C c (Rd) has compact support, hence the integral of a continuous function over a

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

compact domain is always ﬁnite. Then, by using Fubini s theorem (Rudin, 1976), we are allowed to interchange the inﬁnite sum and integration of (43) to reach,

2 ZZ ( φ(x)µ(x)) ( φ(y)µ(y)) X

ϱi ϱi + λei(x)ei(y)dxdy

ZZ ( φ(x)µ(x)) ( φ(y)µ(y)) ei(x)ei(y)dxdy

Z ( φ(x)µ(x)) ei(x)dx 2

Z φ(x) ei(x)µ(x)dx 2 ,

where the last equality uses integration by parts. So the diﬀerence between the ﬁrst term of (40) and (41) rescaled by 1 1+λ is,

Z φ(x) ei(x)µ(x)dx 2 X

Z φ(x) ei(x)µ(x)dx 2

Z φ(x) ei(x)µ(x)dx 2 . (44)

Now we turn to the second term. The second term of (40) can be rewritten as

2 Z φ(x) H dµ

dπ(x) φ(x)µ(x)dx

= 2 Z φ(x) H dµ

dπ(x) 1 φ(x)µ(x)dx

= 2 Z φ(x) H

L2(π) ei(x)

φ(x)µ(x)dx, (45)

and the second term of (41) rescaled by 1 1+λ can be rewritten as,

2 Z φ(x) Z H1 k (x, y) dµ(y) Z H1 k (x, y) dπ(y) φ(x)dµ(x)

= 2 Z φ(x) H Z k (x, y) dµ(y) Z k (x, y) dπ(y) φ(x)dµ(x)

= 2 Z φ(x) H

ϱi ϱi + λei(x) Z ei(y)d(µ π)(y)

= 2 Z φ(x) H

L2(π) ei(x)

φ(x)dµ(x). (46)

(De)-regularized Maximum Mean Discrepancy Gradient Flow

Since π Ld, so (33) is true for all x, y Rd hence the third equality is true. From Lemma B.3, we have R k(x, y)dµ(y) K

λ , x 7 k(x, y) is second-order diﬀerentiable,

supx |H1 k(x, y)| KK2d

λ . So we are allowed to interchange integration and Hessian in the second equality using the diﬀerentiation lemma (Klenke, 2013, Theorem 6.28). Consider the diﬀerence of (46) and (45), we have

L2(π) ei(x)

L2(π) ei(x)

L2(π) ei(x)

dπ 1 H, there exists q L2(π) such that dµ

dπ 1 = T 1/2 π q so that dµ

ϱ1/2 i q, ei for all i. For j, r {1, , d}, we have

L2(π) j rei(x)

λ ϱi + λϱ1/2 i q, ei L2(π) j rei(x)

2 q, ei 2 L2(π)

i M0 ϱi ( j rei(x))2

i M0 q, ei 2 L2(π)

i M0 ϱi ( j rei(x))2

i M0 q, ei 2 L2(π)

The ﬁnal inequality holds because,

i 1 ϱi ( j rei(x))2

i 1 ϱi j rk(x, ), ei 2 H =

i 1 j rk(x, ), ϱiei 2 H

j,r=1 j rk(x, ) 2 H = H1k(x, ) 2 Hd d K2d. (48)

Since sdq L2(π) is bounded, so P i M0 q, ei 2 L2(π) converges to 0 uniformly as M0 and hence g M0(x) converge to 0 uniformly. Therefore, we are allowed to interchange the Hessian and the inﬁnite sum (Rudin, 1976) in (47) to achieve,

L2(π) ei(x)

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

λ ϱi + λϱ1/2 i q, ei L2(π)Hei(x)

2 q, ei 2 L2(π)

Z φ(x) Hei(x) φ(x)µ(x)dx 2

2 q, ei 2 L2(π)

i 1 ϱi Hei(x) 2 op

Z φ(x) φ(x)µ(x)dx 2

2 q, ei 2 L2(π)

i 1 ϱi Hei(x) 2 op

2 q, ei 2 L2(π)

i 1 ϱi Hei(x) 2 F

2 q, ei 2 L2(π)

K2d φ 2 L2(µ). (49)

The ﬁrst inequality uses Cauchy Schwartz, the second last inequality uses that matrix operator norm is smaller than matrix Frobenius norm, and the last inequality uses (48). Combining together (44) and (49), we reach 1 1 + λ d2

t=0 Dr MMD(ρt||π) d2

t=0χ2(ρt||π)

Z φ(x) ei(x)µ(x)dx 2

2 q, ei 2 L2(π)

K2d φ 2 L2(µ)

=: R(λ, µ, φ).

t=0 Dr MMD(ρt||π) (1 + λ) d2

t=0χ2(ρt||π) (1 + λ) R(λ, µ, φ)

(1 + λ) Z dµ

dπ(x) φ(x) HV (x) φ(x)dµ(x) (1 + λ) R(λ, µ, φ)

(1 + λ)α Z dµ

dπ(x) φ(x) 2dµ(x) (1 + λ) R(λ, µ, φ),

where the second inequality is using (38) and the last inequality is using HV αI. So (9) is proved.

(De)-regularized Maximum Mean Discrepancy Gradient Flow

Deﬁne R(λ, µ, φ) := (1+λ) R(λ, µ, φ). The ﬁnal thing to check is limλ 0 R(λ, µ, φ) = 0, which is equivalent to check that limλ 0 R(λ, µ, φ) = 0. Since we know from (39) and (42) that

Z φ(x) ei(x)µ(x)dx 2 < d2

t=0χ2(ρt π) < ,

using the dominated convergence theorem (Rudin, 1976), we are allowed to interchange inﬁnite sum and taking limits,

Z φ(x) ei(x)µ(x)dx 2 = X

i 1 lim λ 0 λ ϱi + λ

Z φ(x) ei(x)µ(x)dx 2

Similarly, because P i 1 λ ϱi+λ 2 q, ei 2 L2(π) < q 2 L2(π) < , using dominated convergence theorem (Rudin, 1976) again, we have,

2 q, ei 2 L2(π) = X

i 1 lim λ 0

2 q, ei 2 L2(π) = 0.

Therefore, we have that

lim λ 0 R(λ, µ, φ) = lim λ 0 R(λ, µ, φ) = 0.

And the proof of the proposition is ﬁnished.

10.8 Proof of Theorem 4.1

Considering the time derivative of KL(µt π), we have

d dt KL(µt π)

= (1 + λ) Z h µt,π(x) log dµt

dπ (x)µt(x)dx

= (1 + λ) Z h µt,π(x) dµt

dπ (x)π(x)dx

= (1 + λ) Z h µt,π(x) 2 dµt

dπ (x)π(x)dx 2(1 + λ) Z π(x) dµt

= (1 + λ) Z h µt,π(x) 2 dµt

dπ (x) 1 dµt

dπ (x)π(x)dx

2(1 + λ) Z π(x) dµt

Case one: R π(x) dµt

dπ (x) 2 dx < . We use integration by parts for the ﬁrst term in (50) and we can safely ignore the boundary term due to condition 5 that for i = 1, . . . , d,

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

h µt,π(x) 2dµt

dπ (x) i dµt

dπ (x) π(x) 0. So, we obtain

d dt KL(µt π) = (1 + λ) Z h µt,π(x) 2 dµt

dπ (x) 1 π(x) dµt

2(1 + λ) Z π(x) dµt

= (1 + λ) Z h µt,π(x) 2 dµt

dπ (x) 1 π(x) dµt

π(x) π(x)dx

2(1 + λ) Z π(x) dµt

(1 + λ) h µt,π 2 dµt

CP KL(µt π), (51)

where the ﬁrst part of the last inequality holds by using Cauchy Schwartz, and the second part holds by the fact that KL(µt π) χ2(µt π) (Van Erven and Harremos, 2014) and by applying the Poincaré inequality with f = dµt

dπ 1 (notice that f L2(π) < from Case one and f L2(π) < from condition 3),

2 dx 1 CP χ2(µt π) 1 CP KL(µt π).

dπ 1 Ran (T r π ) with r > 0, using Lemma B.5 we have

h µt,π 2 dµt

dπ 1 L2(π) 2λr qt L2(π) . (52)

Then, notice that

= Z h π(x) dµt

π(x)2 dπ(x)

= Z π(x) dµt

dπ (x) + π(x) dµt

π(x)2 dπ(x)

= Z log π(x) dµt

dπ (x) + dµt

dπ (x) 2 dπ(x)

Z 2 log π(x) dµt

dπ (x) 2 + 2 dµt

dπ (x) 2 dπ(x)

= 2 (log π) dµt

L2(π) + 2 dµt

L2(π) 2J 2 t + 2I2 t . (53)

(De)-regularized Maximum Mean Discrepancy Gradient Flow

Therefore, plugging (52) and (53) back to (51), we have

d dt KL(µt π) 4(1 + λ)λr qt L2(π) (Jt + It) 2(1 + λ)

CP KL(µt π). (54)

Case two: R π(x) dµt

dπ (x) 2 dx = . The ﬁrst term of (50) remains the same, and the

second term of (50) now equals inﬁnity which is larger than the ﬁnite 2(1+λ)

CP KL(µt π), so we also obtain (54) as in Case one. Therefore, both Case one and Case two result in (54). Using the Gronwall lemma, we obtain that for any T > 0,

KL(µT π) exp 2(1 + λ)

CP T KL(µ0 π)

+ 4(1 + λ)λr Z T

0 exp 2(1 + λ)

CP (T t) qt L2(π) (Jt + It)dt,

which concludes the proof of Theorem 4.1.

10.8.1 Derivation of (14) under stronger range assumption r = 1

Notice that for any x Rd, since k is diﬀerentiable

dπ 1 (x) = k(x, ), dµt

And since dµt

dπ 1 H, there exists qt L2(π) such that D dµt

L2(π) = ϱ1/2 i qt, ei L2(π) for all i, so dµt

i 1 qt, ei 2 L2(π) = qt 2 L2(π).

Combining the above two equations, we have (log π) dµt

dπ 1 L2(π) p

K1d qt L2(π) log π L2(π) .

Also, for the other one, notice that

dπ 1 (x) F =

Hk(x, ), dµt

Hk(x, ) Hd d

K2d qt L2(π).

Therefore, dµt

K2d qt L2(π).

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

10.9 Proof of Proposition 5.1

We know that µn+1 = (I γ(1 + λ) h)#µn where we drop the subscripts of the witness function hµn,π when it causes no ambiguity. Denote ρt = (I t(1 + λ) h)#µn, so ρ0 = µn, and ργ = µn+1. Consider the diﬀerence of KL divergence between the two iterates µn+1 and µn along the time-discretized Dr MMD ﬂow:

KL(µn+1 π) KL(µn π) = KL(ργ π) KL(ρ0 π)

t=0 KL(ρt π)γ + Z γ

dt2 KL(ρt π)dt. (55)

For the ﬁrst term of (55),

t=0 KL(ρt π)

= (1 + λ) Eµn

= (1 + λ) Eπ

= 2(1 + λ) dµn

L2(π) + (1 + λ) Eπ

= 2(1 + λ) dµn

L2(π) + (1 + λ) Z dµn

dπ (x) 2 dµn

dπ (x) 1 h(x) π(x)dx

= 2(1 + λ) dµn

L2(π) (1 + λ) Z 2 dµn

dπ (x) 1 h(x) π(x) dµn

CP χ2(µn π) + (1 + λ) h 2 dµn

where the fourth equality uses an integration by parts, and the last inequality uses Poincaré inequality for the ﬁrst term under similar arguments in Section 10.8 and uses Cauchy-Schwarz for the second term. Using Lemma B.5 and the derivations in (53), (56) can be further upper bounded by

t=0 KL(ρt π) (1 + λ) 2

CP χ2(µn π) + 2(1 + λ)λr Q (J + I) . (57)

Then, for the second term of (55), we know from Example 15.9 of Villani et al. (2009) (taking m = 1) that,

dt2 KL(ρt π) = (1 + λ)2 Z h(x) HV (ϕt(x)) h(x)dµn(x)

+ (1 + λ)2 Z Hh(x) (I t(1 + λ)Hh(x)) 1 2

Because 2(1 + λ)γ q

χ2 (µ0 π) K2d

ζ for 1 < ζ < 2, applying Lemma B.7 we have,

dt2 KL(ρt π) 4(1 + λ)2βχ2(µn π)K1d

λ + 4(1 + λ)2ζ2χ2(µn π)K2d

(De)-regularized Maximum Mean Discrepancy Gradient Flow

Combining the above two inequalities (57) and (58) and plugging them back into (55), we obtain

KL(µn+1 π) KL(µn π) 2

CP (1 + λ)χ2(µn π)γ + 2(1 + λ)γλr Q (J + I)

+ 2(1 + λ)2γ2(β + ζ2)χ2(µn π)K1d + K2d

CP χ2(µn π)γ

+ 4γλr Q (J + I) + 8γ2(β + ζ2)χ2(µn π)K1d + K2d

where the last inequality holds by using 0 < λ 1, and the result follows.

10.10 Proof of Theorem 5.1

In order to use Proposition 5.1 in the proof, ﬁrst we are going to show that Proposition 5.1 holds under the conditions of Theorem 5.1. Notice that conditions 1-4 of Proposition 5.1 are precisely the conditions 1-4 of Theorem 5.1. To use Proposition 5.1 in the proof of Theorem 5.1, the only thing left is to check that the condition of step size γ in (16) is satisﬁed.

In Theorem 5.1, λn is selected to be 2γ χ2(µn π)(β+ζ2)(K1d+K2d)

Q(J +I) 1 r+1 1. If λn is taken to be the former, then

χ2(µn π)K2d

χ2(µn π)K2d

= 2(2γ) 2r+1 2r+2 χ2 (µn π) r 2r+2 (Q(J + I)) 1 2r+2 1 K1d + K2d

( ) (8γ) 2r+1 2r+2 Q 2r+1 2r+2 (J + I) 1 2r+2 1 β + ζ2

( ) holds because K2d K1d+K2d 1 and

χ2 (µn π) = dµn

L2(π) = T r π qn 2 L2(π) K2r Q2 Q2. (60)

The second last inequality of (59) holds due to the constraint on γ in (19), and the last inequality of (59) holds because β + ζ2 1.

On the other hand, if λn is chosen to be 1, similarly based on the constraint on γ in (19), we have

2γ(1 + λn)χ2(µn π)K2d

λn = 4γχ2(µn π)K2d ζ 1

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

Therefore, all the conditions of Proposition 5.1 have been veriﬁed. So, if we select λn = 2γ χ2(µn π)(β+ζ2)(K1d+K2d)

Q(J +I) 1 r+1 1, then

KL(µn+1 π) KL(µn π) γ 2

CP χ2(µn π)

+ 4γλr n Q (J + I) | {z } ( 1)

+ 8γ2(β + ζ2)χ2(µn π)K1d + K2d

λn | {z } ( 2)

By observing (61), the ﬁrst term on the right-hand side 2

CP χ2(µn π)γ is strictly negative and is decreasing KL divergence at each iteration n of the Dr MMD gradient descent. In contrast, the second term ( 1) := 4γλr n Q(J + I) and the third term ( 2) := 8γ2(β + ζ2)χ2(µn π)K1d+K2d

λn are positive and prevent the KL divergence from decreasing. Denote G(λn) = ( 1) + ( 2) and the optimal λn is achieved by taking d dλn G(λn) = 0, which leads

to λn = 2γ χ2(µn π)(β+ζ2)(K1d+K2d)

Q(J +I) 1 r+1 . Plugging the value of λn back to (61) to obtain,

KL(µn+1 π) KL(µn π)

CP χ2(µn π)γ + 4γ (2γ) r r+1 χ2(µn π)(β + ζ2)(K1d + K2d) r r+1 (Q(J + I)) 1 r+1

CP χ2(µn π)γ + 8γ1+ r r+1 χ2(µn π)(β + ζ2)(K1d + K2d) r r+1 (Q(J + I)) 1 r+1

CP χ2(µn π)γ + 8γ1+ r r+1 Q 2r+1

r+1 (K1d + K2d)(β + ζ2) r r+1 (J + I) 1 r+1 , (62)

where the last inequality holds because of (60). Since χ2(µn π) KL(µn π) (Van Erven and Harremos, 2014, Equation (7)), we have

KL(µn+1 π) 1 γ 2

+ 8γ1+ r r+1 Q 2r+1

r+1 (K1d + K2d)(β + ζ2) r r+1 (J + I) 1 r+1 .

After iterating, we obtain

KL(µnmax π)

nmax KL(µ0 π) + 4γ r r+1 CP Q 2r+1

r+1 (K1d + K2d)(β + ζ2) r r+1 (J + I) 1 r+1

KL(µ0 π) + 4γ r r+1 CP Q 2r+1

r+1 (K1d + K2d)(β + ζ2) r r+1 (J + I) 1 r+1

and the result follows.

10.11 Proof of Theorem 5.2

The proof of Theorem 5.2 is also based on Proposition 5.1 proved in the last section. Recalling (62) yet with adaptive step size γn, we have

KL(µn+1 π) KL(µn π)

(De)-regularized Maximum Mean Discrepancy Gradient Flow

CP χ2(µn π)γn + 8γ 1+ r r+1 n χ2(µn π)(β + ζ2)(K1d + K2d) r r+1 (Q(J + I)) 1 r+1 .

From (21), we have

r r+1 n (K1d + K2d)(β + ζ2) r r+1 (Q(J + I)) 1 r+1 1 CP χ2(µn π) 1 r+1 ,

so that we have

KL(µn+1 π) KL(µn π) 1

CP χ2(µn π)γn 1

CP KL(µn π)γn.

KL(µn+1 π) 1 1

KL(µn π). (63)

After iterating n from 1 to nmax, the theorem is proved.

10.12 Proof of Theorem 6.1

In order to analyze the error of space discretization, we introduce another particle descent scheme using the population witness function h µn,π deﬁned in (15) starting from the same initialization as that of (24),

y(i) n+1 = y(i) n γ(1 + λn) h µn,π(y(i) n ), y(i) 0 = y(i) 0 . (64)

The corresponding empirical distribution of the particles at time step n is deﬁned as µn = 1 N PN i=1 y(i) n . Note that (64) is an unbiased sampled version (since it is composed of N i.i.d. realizations) of (15). The following proposition shows that W2(µn, ˆµn) 0 as N, M , i.e., with a suﬃcient number of samples from µ0 and π, (24) can approximate (64) with arbitrary precision. The proof of Proposition 10.1 is provided in Section 10.13.

Proposition 10.1 Suppose k satisﬁes Assumption 1 and 2. Given initial particles {y(i) 0 }N i=1 that are i.i.d sampled from µ0, a sequence (µn)n N of empirical distributions arising from (64), and a sequence (ˆµn)n N arising from (24), then for all n 1, we have

E[W2 (ˆµn, µn)] A K

γn2(1 + λ)R

where A = 2 KK1d KK2d+K1d and R = K1d + KK2d are constants that only depend on the kernel,

and λ = min i=1,...,n λi denotes the smallest regularization coeﬃcient.

Now we are ready to prove Theorem 6.1. By the triangle inequality, we have,

E [W2 (ˆµn, π)] E [W2 (ˆµn, µn)] + E [W2 ( µn, µn)] + W2 (µn, π) .

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

From Proposition 10.1, the ﬁrst term is upper bounded by

E[W2 (ˆµn, µn)] A K

nγ 2(1 + λ)R

where the second inequality uses λ 1 and K 1. Since (µn)n has ﬁnite fourth moment, then by taking p = 2, q = 4 in (Lei, 2020, Theorem 3.1) and (Fournier and Guillin, 2015, Theorem 1), the second term is upper bounded by,

E [W2 ( µn, µn)] = O N 1 d 4 .

For the third term, since the Wasserstein-2 distance is upper bounded by the square root of the KL divergence, if the target π that satisﬁes Talagrand-2 inequality with constant CT (Villani et al., 2009, Deﬁnition 22.1), we have

W2 (µn, π) p

KL (µn π) p

KL (µ0 π) + O γ r 2r+2 ,

where the last inequality follows from Theorem 5.1. Combining the above three terms, we obtain

E [W2 (ˆµn, π)] A 1

+ O N 1 d 4

KL (µ0 π) + O γ r 2r+2 , (65)

where A = 2 KK1d KK2d+K1d . Recall from Theorem 5.1 that λi = γχ2(µi π)Z 1 r+1 1 for i = 1, . . . , n where Z is the constant that depends on β, ζ, K1d, K2d, Q, J , I. From the condition on the number of samples M and N in (25), we obtain that if λ = λj = γχ2(µj π)Z 1 r+1

for some j {1, . . . , n},

1 min i=1,...,n KL(µi π)Z 1

4nγ r r+1 R min i=1,...,n KL(µi π)Z 1 r+1 1

1 min i=1,...,n KL(µi π)Z

4nγ r r+1 R min i=1,...,n KL(µi π)Z 1 r+1

1 KL(µj π)Z

4nγ r r+1 R

(KL(µj π)Z) 1 r+1

(De)-regularized Maximum Mean Discrepancy Gradient Flow

1 χ2(µj π)Z

4nγ r r+1 R

(χ2(µj π)Z) 1 r+1

= Aγ r r+1 1

On the other hand, if λ = 1, since γ 1,

1 min i=1,...,n KL(µi π)Z 1

4nγ r r+1 R min i=1,...,n KL(µi π)Z 1 r+1 1

γ exp 4nγ r r+1 R γ r 2r+2 exp (4nγR) = γ r 2r+2 1

Similarly for N, we have

N γ r 2r+2 exp 4nγR

and N 1 d 4 γ r 2r+2 .

Plugging them back to (65), we obtain

E [W2 (ˆµn, π)] p

KL (µ0 π) + O γ r 2r+2 ,

which completes the proof.

10.13 Proof of Proposition 10.1

Since the proof below works for any regularization coeﬃcient λ, we use a ﬁxed λ for the majority of the analysis and resort back to adaptive λn at the end of the proof. For empirical distributions ˆµn = 1

N PN i=1 y(i) n and µn = 1

N PN i=1 y(i) n deﬁned in (24) and (64), note that

E W 2 2 (ˆµn, µn) 1

i=1 E y(i) n y(i) n 2 := c2 n.

i=1 E y(i) n+1 y(i) n+1 2

i=1 E y(i) n y(i) n γ(1 + λ) h ˆµn,ˆπ(y(i) n ) h µn,π(y(i) n ) 2

i=1 E y(i) n y(i) n 2

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

γ(1 + λ) h ˆµn,ˆπ(y(i) n ) h µn,π(y(i) n ) 2

= cn + γ(1 + λ)

i=1 E h ˆµn,ˆπ(y(i) n ) h µn,π(y(i) n ) 2 ,

where we used Minkowski s inequality, v u u t

i=1 ai + bi 2

in the above inequalities. Again, by Minkowski s inequality, we have

cn+1 cn + γ(1 + λ)

i=1 E h ˆµn,ˆπ(y(i) n ) h ˆµn,ˆπ(y(i) n ) 2

i=1 E h ˆµn,ˆπ(y(i) n ) h µn,π(y(i) n ) 2

| {z } (ii)

10.13.1 Controlling (i):

i=1 E h ˆµn,ˆπ(y(i) n ) h ˆµn,ˆπ(y(i) n ) 2 =

D jk(y(i) n , ) jk(y(i) n , ), h ˆµn,ˆπ E2

jk(y(i) n , ) jk(y(i) n , ) 2

h ˆµn,ˆπ 2 H 4KK2d

i=1 E y(i) n y(i) n 2 ,

where the second inequality uses Cauchy-Schwarz inequality and the third follows from using Lemma B.3 and Lemma B.4. So we have

i=1 E h ˆµn,ˆπ(y(i) n ) h ˆµn,ˆπ(y(i) n ) 2 2 KK2d

10.13.2 Controlling (ii):

First, we introduce some auxiliary witness functions,

h µn,ˆπ = 2 (Σˆπ + λI) 1 (m µn mˆπ) , h n = 2 (Σπ + λI) 1 (m µn mˆπ) ,

and for completeness, we recall the witness function we are interested in:

h µn,π = 2 (Σπ + λI) 1 (mµn mπ) , h ˆµn,ˆπ = 2 (Σˆπ + λI) 1 (mˆµn mˆπ) .

(De)-regularized Maximum Mean Discrepancy Gradient Flow

We know that

i=1 E h ˆµn,ˆπ(y(i) n ) h µn,π(y(i) n ) 2

i=1 E h ˆµn,ˆπ(y(i) n ) h µn,ˆπ(y(i) n ) 2

i=1 E h µn,ˆπ(y(i) n ) h n(y(i) n ) 2

i=1 E h n(y(i) n ) h µn,π(y(i) n ) 2

i=1 K1d E h ˆµn,ˆπ h µn,ˆπ 2

i=1 K1d E h µn,ˆπ h n 2

i=1 K1d E h n h µn,π 2

E h ˆµn,ˆπ h µn,ˆπ 2

E h µn,ˆπ h n 2

E h n h µn,π 2

where the ﬁrst inequality follows from Minkowski s inequality, and the second inequality uses the fact that for h0, h1 H,

h1 h0 2 Hd 1k(x, ) 2 Hd h1 h0 2 H K1d h1 h0 2 H .

Next, we will bound q

E h ˆµn,ˆπ h µn,ˆπ 2 H, q

E h µn,ˆπ h n 2 H, and q

E h n h µn,π 2 H separately. First, by noticing that h ˆµn,ˆπ = 2 (Σˆπ + λI) 1 (mˆµn mˆπ) is the witness function associated with Dr MMD(ˆµn||ˆπ), and h µn,ˆπ = 2 (Σˆπ + λI) 1 (m µn mˆπ) is the witness function associated with Dr MMD( µn||ˆπ), by using Lemma B.4, we have r

E h ˆµn,ˆπ h µn,ˆπ 2

λ2 W 2 2 ( µn, ˆµn) 2 K1d

E h µn,ˆπ h n 2

E 2 (Σˆπ + λI) 1 (m µn mˆπ) 2 (Σπ + λI) 1 (m µn mˆπ) 2

E (Σˆπ + λI) 1 (Σπ + λI) 1 2

HS m µn mˆπ 2 H

E (Σˆπ + λI) 1 (Σπ + λI) 1 2

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

E (Σˆπ + λI) 1 (Σˆπ + λI) (Σπ + λI) (Σπ + λI) 1 2

E Σˆπ Σπ 2 HS 4

where the last inequality follows from using Lemma B.8 and the fact that k(x, ) k(x, ) HS K. Third, r

E h n h µn,π 2

E 2 (Σπ + λI) 1 (m µn mˆπ) 2 (Σπ + λI) 1 (mµn mπ) 2

E (m µn mˆπ) (mµn mπ) 2 H

E m µn mµn 2 H + q

E mˆπ mπ 2 H

where the ﬁrst inequality follows from Cauchy-Schwartz, and the last inequality from Lemma B.8 since k(x, ) H

K. Therefore, combining (66), (67) and (68), we have

λ cn + 2K3/2

Combining (i) and (ii), we have

1 + γ(1 + λ)2 KK2d + 2K1d

+ 2γ(1 + λ) p

Denoting A = 2 KK1d KK2d+K1d and R = K1d + KK2d as constants that only depend on the kernel, and using the discrete Gronwall lemma (Lemma 26 from Arbel et al. 2019) along with c0 = 0, we obtain

exp γnmax 2(1 + λ)R

Since E W2(ˆµn, µn) p

E W 2 2 (ˆµn, µn) cn, we reach

E W2(ˆµnmax, µnmax) A K

exp γnmax 2(1 + λ)R

Finally, the proof is completed by noting that the r.h.s. is monotonically decreasing in λ and therefore the r.h.s. can be bounded by replacing λ with λ = min i=1,...,nmax λi.

10.14 Proof of Proposition 6.1

By deﬁning the following operators,

Sx : H RM, f 1

h f(x(1)), . . . , f(x(M)) i ,

(De)-regularized Maximum Mean Discrepancy Gradient Flow

S x : RM H, α 1

i=1 αik(x(i), ),

Sy : H RN, f 1

h f(y(1)), . . . , f(y(N)) i ,

S y : RN H, α 1

i=1 αik(y(i), ).

Then we have

Σˆπ = S x Sx, Kxx = MSx S x, Kxy =

Using these, note that

h ˆµ,ˆπ = 2 (Σˆπ + λI) 1 (mˆµ mˆπ)

i=1 k x(i), k x(i), + λI

i=1 k y(i), 1

i=1 k x(i), !

= 2 S x Sx + λI 1 1

From the Woodbury inversion lemma, we have that S x Sx + λI 1 = 1

λS x(Sx S x + λI) 1Sx.

Plugging the above into (69), we obtain

h ˆµ,ˆπ = 2 1

λS x(Sx S x + λI) 1Sx

M Kxx + λI 1 1 N

M Kxx + λI 1 1 M

= 2 Nλk , y1:N 1N 2 Mλk , x1:M 1M 2 Nλk , x1:M (Kxx + MλI) 1 Kxy1N

+ 2 Mλk , x1:M (Kxx + MλI) 1 Kxx1M. (70)

Obtaining Dr MMD(ˆµ ˆπ) is then easy with h ˆµ,ˆπ shown in (70).

Dr MMD(ˆµ ˆπ) = (1 + λ) (Σˆπ + λI) 1

2 (mˆµ mˆπ) 2

= (1 + λ) 1

2h ˆµ,ˆπ, mˆµ mˆπ

* 1 2h ˆµ,ˆπ, 1

i=1 k y(i), 1

i=1 k x(i), +

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

1 N2 1 NKyy1N + 1 M2 1 MKxx1M 2 MN 1 MKxy1N

N2 1 NK xy (Kxx + MλI) 1 Kxy1N + 2 NM 1 MKxx (Kxx + MλI) 1 Kxy1N

1 M2 1 MKxx (Kxx + MλI) 1 Kxx1M

Acknowledgments

We would like to thank Gabriele Steidl and Viktor Stein for fruitful discussions on Dr MMD and related functionals. Zonghao Chen is supported by the Engineering and Physical Sciences Research Council (EPSRC) through grant [EP/S021566/1]. Aratrika Mustaﬁand Bharath K. Sriperumbudur are partially supported by the National Science Foundation (NSF) grant DMS-2413425 and NSF CAREER award DMS-1945396. Pierre Glaser and Arthur Gretton are supported by the Gatsby Charitable Foundation. Anna Korba thanks Google for their academic support in the form of a gift in support of her academic research.

Rohit Agrawal and Thibaut Horel. Optimal bounds between f-divergences and integral probability metrics. Journal of Machine Learning Research, 22(1):5662 5720, 2021.

Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Springer Science & Business Media, 2005.

Abdul Fatir Ansari, Ming Liang Ang, and Harold Soh. Reﬁning deep generative models via discriminator gradient ﬂow. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Zbc-ue9p_r E.

Michael Arbel, Danica J Sutherland, Mikolaj Binkowski, and Arthur Gretton. On gradient regularizers for MMD GANs. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

Michael Arbel, Anna Korba, Adil Salim, and Arthur Gretton. Maximum mean discrepancy gradient ﬂow. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337 404, 1950.

Krishna Balasubramanian, Tong Li, and Ming Yuan. On the optimality of kernel-embedding based goodness-of-ﬁt tests. Journal of Machine Learning Research, 22(1):1 45, 2021.

(De)-regularized Maximum Mean Discrepancy Gradient Flow

Krishna Balasubramanian, Sinho Chewi, Murat A. Erdogdu, Adil Salim, and Shunshi Zhang. Towards a theory of non-log-concave sampling: First-order stationarity guarantees for Langevin Monte Carlo. In Po-Ling Loh and Maxim Raginsky, editors, Conference on Learning Theory, pages 2896 2923. PMLR, 2022.

Sayan Banerjee, Krishna Balasubramanian, and Promit Ghosal. Improved ﬁnite-particle convergence rates for Stein variational gradient descent. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id= sb G8qh Mjk Z.

Frank Bauer, Sergei Pereverzev, and Lorenzo Rosasco. On regularization algorithms in learning theory. Journal of Complexity, 23(1):52 72, 2007.

A Belafhal, Z Hricha, L Dalil-Essakali, and T Usman. A note on some integrals involving Hermite polynomials and their applications. Advanced Mathematical Models and Applications, 5(3):313 319, 2020.

Jeremiah Birrell, Paul Dupuis, Markos A Katsoulakis, Yannis Pantazis, and Luc Rey-Bellet. (f, Γ)-divergences: Interpolating between f-divergences and integral probability metrics. Journal of Machine Learning Research, 23(39):1 70, 2022.

Clément Bonet, Théo Uscidda, Adam David, Pierre-Cyril Aubin-Frankowski, and Anna Korba. Mirror and preconditioned gradient descent in Wasserstein space. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 25311 25374. Curran Associates, Inc., 2024.

Siwan Boufadène and François-Xavier Vialard. On the global convergence of Wasserstein gradient ﬂow of the Coulomb discrepancy, 2024. URL https://arxiv.org/abs/2312.00800.

Stephen P. Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax.

Andrew Brock, JeﬀDonahue, and Karen Simonyan. Large scale GAN training for high ﬁdelity natural image synthesis. In International Conference on Learning Representations, 2019.

Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7:331 368, 2007.

Claudio Carmeli, Ernesto De Vito, Alessandro Toigo, and Veronica Umanitá. Vector-valued reproducing kernel Hilbert spaces and universality. Analysis and Applications, 8(01):19 61, 2010.

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

Niladri Chatterji, Jelena Diakonikolas, Michael I. Jordan, and Peter Bartlett. Langevin Monte Carlo without smoothness. In S. Chiappa and R. Calandra, editors, International Conference on Artiﬁcial Intelligence and Statistics, pages 1716 1726. PMLR, 2020.

Fan Chen, Yiqing Lin, Zhenjie Ren, and Songbo Wang. Uniform-in-time propagation of chaos for kinetic mean ﬁeld Langevin dynamics. Electronic Journal of Probability, 29:1 43, 2024.

Zonghao Chen, Toni Karvonen, Heishiro Kanagawa, François-Xavier Briol, and Chris Oates. Stationary MMD points for cubature. https://arxiv.org/abs/2505.20754, 2025.

Sinho Chewi, Thibaut Le Gouic, Chen Lu, Tyler Maunu, and Philippe Rigollet. SVGD as a kernelized Wasserstein gradient ﬂow of the chi-squared divergence. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 2098 2109. Curran Associates, Inc., 2020.

Sinho Chewi, Murat A. Erdogdu, Mufan Li, Ruoqi Shen, and Matthew S. Zhang. Analysis of Langevin Monte Carlo from Poincare to log-Sobolev. Foundations of Computational Mathematics, pages 1 51, 2024.

Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for overparameterized models using optimal transport. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

Katy Craig, Karthik Elamvazhuthi, Matt Haberland, and Olga Turanova. A blob method for inhomogeneous diﬀusion with applications to multi-agent control and sampling. Mathematics of Computation, 2023.

Katy Craig, Matt Jacobs, and Olga Turanova. Nonlocal approximation of slow and fast diﬀusion. Journal of Diﬀerential Equations, 426:782 852, 2025.

Felipe Cucker and Ding Xuan Zhou. Learning Theory: An Approximation Theory Viewpoint, volume 24. Cambridge University Press, 2007.

Arnak Dalalyan. Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent. In S. Kale and O. Shamir, editors, Conference on Learning Theory, pages 678 689. PMLR, 2017.

Arnak S Dalalyan and Avetik Karagulyan. User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. Stochastic Processes and their Applications, 129(12): 5278 5311, 2019.

Arnak S Dalalyan, Avetik Karagulyan, and Lionel Riou-Durand. Bounding the error of discretized Langevin algorithms for non-strongly log-concave targets. Journal of Machine Learning Research, 23(235):1 38, 2022.

Andrew Duncan, Nikolas Nüsken, and Lukasz Szpruch. On the geometry of Stein variational gradient descent. Journal of Machine Learning Research, 24(56):1 39, 2023.

(De)-regularized Maximum Mean Discrepancy Gradient Flow

Alain Durmus, Szymon Majewski, and Bl azej Miasojedow. Analysis of Langevin Monte Carlo via convex optimization. Journal of Machine Learning Research, 20(1):2666 2711, 2019.

Heinz Werner Engl, Martin Hanke, and Andreas Neubauer. Regularization of Inverse Problems, volume 375. Springer Science & Business Media, 1996.

Vassiliy A Epanechnikov. Non-parametric estimation of a multivariate probability density. Theory of Probability & Its Applications, 14(1):153 158, 1969.

Jean Feydy, Thibault Séjourné, François-Xavier Vialard, Shun-ichi Amari, Alain Trouvé, and Gabriel Peyré. Interpolating between optimal transport and MMD using Sinkhorn divergences. In K. Chaudhuri and M. Sugiyama, editors, International Conference on Artiﬁcial Intelligence and Statistics, pages 2681 2690. PMLR, 2019.

Simon Fischer and Ingo Steinwart. Sobolev norm learning rates for regularized least-squares algorithms. Journal of Machine Learning Research, 21(205):1 38, 2020.

Nicolas Fournier and Arnaud Guillin. On the rate of convergence in Wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(3-4):707 738, 2015.

Jean-Yves Franceschi, Mike Gartrell, Ludovic Dos Santos, Thibaut Issenhuth, Emmanuel de Bézenac, Mickaël Chen, and Alain Rakotomamonjy. Unifying GANs and score-based diﬀusion as generative particle models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 36. Curran Associates, Inc., 2024.

Alexandre Galashov, Valentin De Bortoli, and Arthur Gretton. Deep MMD gradient ﬂow without adversarial training. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=Pf85K2wtz8.

Yuan Gao, Yuling Jiao, Yang Wang, Yao Wang, Can Yang, and Shunkang Zhang. Deep generative learning via variational gradient ﬂow. In K. Chaudhuri and R. Salakhutdinov, editors, International Conference on Machine Learning, pages 2093 2101. PMLR, 2019.

Aude Genevay, Gabriel Peyré, and Marco Cuturi. Learning generative models with Sinkhorn divergences. In A. Storkey and F. Perez-Cruz, editors, International Conference on Artiﬁcial Intelligence and Statistics, pages 1608 1617. PMLR, 2018.

Pierre Glaser, Michael Arbel, and Arthur Gretton. KALE ﬂow: A relaxed KL gradient ﬂow for probabilities with disjoint support. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 8018 8031. Curran Associates, Inc., 2021.

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(1):723 773, 2012.

Thomas Hakon Gronwall. Note on the derivatives with respect to a parameter of the solutions of a system of diﬀerential equations. Annals of Mathematics, 20(4):292 296, 1919.

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

Hyemin Gu, Panagiota Birmpa, Yannis Pantazis, Luc Rey-Bellet, and Markos A. Katsoulakis. Lipschitz-regularized gradient ﬂows and generative particle algorithms for high-dimensional scarce data. SIAM Journal on Mathematics of Data Science, 6(4):1205 1235, 2024.

Omar Hagrass, Bharath K. Sriperumbudur, and Bing Li. Spectral regularized kernel twosample tests. The Annals of Statistics, 52(3):1076 1101, 2024a.

Omar Hagrass, Bharath K. Sriperumbudur, and Bing Li. Spectral regularized kernel goodnessof-ﬁt tests. Journal of Machine Learning Research, 25(309):1 52, 2024b.

Zaïd Harchaoui, Francis Bach, and Eric Moulines. Testing for homogeneity with kernel Fisher discriminant analysis. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007.

Ye He, Krishna Balasubramanian, Bharath K. Sriperumbudur, and Jianfeng Lu. Regularized Stein variational gradient ﬂow. Foundations of Computational Mathematics, pages 1 59, 2024.

Johannes Hertrich, Robert Beinert, Manuel Gräf, and Gabriele Steidl. Wasserstein gradient ﬂows of the discrepancy with distance kernel on the line. In International Conference on Scale Space and Variational Methods in Computer Vision, pages 431 443. Springer, 2023.

Johannes Hertrich, Manuel Gräf, Robert Beinert, and Gabriele Steidl. Wasserstein steepest descent ﬂows of discrepancies with Riesz kernels. Journal of Mathematical Analysis and Applications, 531(1):127829, 2024a.

Johannes Hertrich, Christian Wald, Fabian Altekrüger, and Paul Hagemann. Generative sliced MMD ﬂows with riesz kernels. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=Vdk GRV1vcf.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diﬀusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6840 6851. Curran Associates, Inc., 2020.

Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the Fokker Planck equation. SIAM Journal on Mathematical Analysis, 29(1):1 17, 1998.

Mark Kac. Foundations of kinetic theory. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, volume 3, pages 171 197, 1956.

Achim Klenke. Probability Theory: A Comprehensive Course. Springer Science & Business Media, 2013.

Benoit Kloeckner. Approximation by ﬁnitely supported measures. ESAIM: Control, Optimisation and Calculus of Variations, 18(2):343 359, 2012.

Anna Korba, Adil Salim, Michael Arbel, Giulia Luise, and Arthur Gretton. A non-asymptotic analysis for Stein variational gradient descent. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 4672 4682. Curran Associates, Inc., 2020.

(De)-regularized Maximum Mean Discrepancy Gradient Flow

Anna Korba, Pierre-Cyril Aubin-Frankowski, Szymon Majewski, and Pierre Ablin. Kernel Stein discrepancy descent. In M. Meila and T. Zhang, editors, International Conference on Machine Learning, pages 5719 5730. PMLR, 2021.

A. Ya. Kruger. On Fréchet subdiﬀerentials. Journal of Mathematical Sciences, 116(3): 3325 3358, 2003.

Marc Lambert, Sinho Chewi, Francis Bach, Silvère Bonnabel, and Philippe Rigollet. Variational inference via Wasserstein gradient ﬂows. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 14434 14447. Curran Associates, Inc., 2022.

Jing Lei. Convergence and concentration of empirical measures under Wasserstein distance in unbounded functional spaces. Bernoulli, 26(1):767 798, 2020.

Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabas Poczos. MMD GAN: Towards deeper understanding of moment matching network. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

Lingxiao Li, Qiang Liu, Anna Korba, Mikhail Yurochkin, and Justin Solomon. Sampling with molliﬁed interaction energy descent. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=z Wy7dq Ocel.

Tengyuan Liang and Hai Tran-Bach. Mehler s formula, branching process, and compositional kernels of deep neural networks. Journal of the American Statistical Association, 117(539): 1324 1337, 2022.

Qiang Liu. Stein variational gradient descent as gradient ﬂow. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.

Song Liu, Jiahao Yu, Jack Simons, Mingxuan Yi, and Mark Beaumont. Minimizing fdivergences by interpolating velocity ﬁelds. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, editors, International Conference on Machine Learning, pages 32308 32331. PMLR, 2024a.

Tianle Liu, Promit Ghosal, Krishna Balasubramanian, and Natesh Pillai. Towards understanding the dynamics of Gaussian-Stein variational gradient descent. In Advances in Neural Information Processing Systems, volume 36, 2024b.

Antoine Liutkus, Umut Simsekli, Szymon Majewski, Alain Durmus, and Fabian-Robert Stöter. Sliced-Wasserstein ﬂows: Nonparametric generative modeling via optimal transport

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

and diﬀusions. In K. Chaudhuri and R. Salakhutdinov, editors, International Conference on Machine Learning, pages 4104 4113. PMLR, 2019.

Sebastian Mika, Gunnar Rätsch, Jason Weston, Bernhard Schölkopf, and Klaus-Robert Müller. Fisher discriminant analysis with kernels. In Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468), pages 41 48, 1999.

Xuan Long Nguyen, Martin J. Wainwright, and Michael I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847 5861, 2010.

Atsushi Nitanda, Denny Wu, and Taiji Suzuki. Convex analysis of the mean ﬁeld Langevin dynamics. In G. Camps-Valls, F. J. R. Ruiz, and I. Valera, editors, International Conference on Artiﬁcial Intelligence and Statistics, pages 9741 9757. PMLR, 2022.

Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.

Shin-ichi Ohta and Asuka Takatsu. Displacement convexity of generalized relative entropies. Advances in Mathematics, 228(3):1742 1787, 2011.

Victor M Panaretos and Yoav Zemel. An Invitation to Statistics in Wasserstein Space. Springer, 2020.

Loucas Pillaud-Vivien, Francis Bach, Tony Lelièvre, Alessandro Rudi, and Gabriel Stoltz. Statistical estimation of the Poincaré constant and application to sampling multimodal distributions. In S. Chiappa and R. Calandra, editors, International Conference on Artiﬁcial Intelligence and Statistics, pages 2753 2763. PMLR, 2020.

Alfréd Rényi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, volume 1: Contributions to the Theory of Statistics, volume 4, pages 547 562. University of California Press, 1961.

Walter Rudin. Principles of Mathematical Analysis, volume 3. Mc Graw-Hill New York, 1976.

Filippo Santambrogio. {Euclidean, metric, and Wasserstein} gradient ﬂows: An overview. Bulletin of Mathematical Sciences, 7:87 154, 2017.

Bernhard Schölkopf and Alexander J Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press, 2002.

Dino Sejdinovic, Bharath Sriperumbudur, Arthur Gretton, and Kenji Fukumizu. Equivalence of distance-based and RKHS-based statistics in hypothesis testing. The Annals of Statistics, 41(5):2263 2291, 2013.

(De)-regularized Maximum Mean Discrepancy Gradient Flow

Jiaxin Shi and Lester Mackey. A ﬁnite-particle convergence rate for Stein variational gradient descent. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 36. Curran Associates, Inc., 2024.

Tao Shi, Mikhail Belkin, and Bin Yu. Data spectroscopy: Eigenspaces of convolution operators and clustering. The Annals of Statistics, pages 3960 3984, 2009.

Carl-Johann Simon-Gabriel, Alessandro Barp, Bernhard Schölkopf, and Lester Mackey. Metrizing weak convergence with maximum mean discrepancies. Journal of Machine Learning Research, 24(184):1 20, 2023.

Jack Simons, Song Liu, and Mark Beaumont. Variational likelihood-free gradient descent. In Fourth Symposium on Advances in Approximate Bayesian Inference, 2022.

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic diﬀerential equations. In International Conference on Learning Representations, 2021.

Bharath K. Sriperumbudur. On the optimal estimation of probability measures in weak and strong topologies. Bernoulli, 22(3):1839 1893, 2016.

Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, and Gert R. G. Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11:1517 1561, 2010.

Bharath K. Sriperumbudur, Kenji Fukumizu, and Gert R. G. Lanckriet. Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research, 12(7):2389 2410, 2011.

Viktor Stein, Sebastian Neumayer, Nicolaj Rux, and Gabriele Steidl. Wasserstein gradient ﬂows for Moreau envelopes of f-divergences in reproducing kernel Hilbert spaces. Analysis and Applications, pages 1 45, 2025.

Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer Science & Business Media, 2008.

Ingo Steinwart and Clint Scovel. Mercer s theorem on general domains: On the interaction between measures, kernels, and RKHSs. Constructive Approximation, 35:363 417, 2012.

Taiji Suzuki, Atsushi Nitanda, and Denny Wu. Uniform-in-time propagation of chaos for the mean-ﬁeld gradient Langevin dynamics. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=_JSc Uk9TBUn.

Nicolas Garcia Trillos and Daniel Sanz-Alonso. The Bayesian update: Variational formulations and gradient ﬂows. Bayesian Analysis, 15(1):29 56, 2020.

Tim Van Erven and Peter Harremos. Rényi divergence and Kullback-Leibler divergence. IEEE Transactions on Information Theory, 60(7):3797 3820, 2014.

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

Santosh Vempala and Andre Wibisono. Rapid convergence of the unadjusted Langevin algorithm: Isoperimetry suﬃces. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

Cédric Villani et al. Optimal Transport: Old and New, volume 338. Springer, 2009.

Holger Wendland. Scattered Data Approximation, volume 17. Cambridge university press, 2004.

Appendix A: Further Background on (P2(Rd), W2)

Let µ and π be two probability measures in P2(Rd) and let Π(µ, π) denote the set of all admissible transport plans between µ and π, i.e., Π(µ, π) = {Γ P(Rd Rd); (proj1)# Γ = µ, (proj2)# Γ = π}, where proj1 and proj2 respectively stand for projection maps (x, y) 7 x and (x, y) 7 y, and # is the pushforward operator. The Wasserstein-2 distance between µ and π is then deﬁned as

W2(µ, π) = inf Γ Π(µ,π)

Z x y 2dΓ(x, y) 1

and (P2(Rd), W2) is a metric space called the Wasserstein space (Panaretos and Zemel, 2020). Brenier s theorem guarantees that if µ is an absolutely continuous measure, then the optimal transport map is unique and is of the form Γ = (I, T)#µ, i.e., T#µ = π (Santambrogio, 2017). T can also be expressed as T(x) = x + φ(x), where φ is known as the Kantorovich potential function and is diﬀerentiable µ a.e. For an absolutely continuous µ and the optimal transport plan T such that T#µ = π, the shortest path (ρt)0 t 1 from µ to π is called the (Wasserstein) geodesic given by the following form:

ρt = ((1 t)I + t T)# µ = (I + t φ)# µ.

Therefore, in this paper, we always use (I + t φ)#µ with φ C c (Rd)8 to deﬁne a geodesic curve that starts at µ. Deﬁne ϕt : Rd Rd, x 7 x + t φ(x), then ωt : Rd Rd, x 7 φ ϕ 1 t (x) becomes the optimal transport map from ρt to π. Notice that ωt L2(ρt) = φ L2(µ) for all t [0, 1], so (ρt)0 t 1 is also a constant-speed geodesic. The notion of a constant-speed geodesic is crucial in the introduction of geodesic convexity below. A functional F : P2(Rd) R is geodesically convex if for any µ and π, the following inequality holds:

F (ρt) (1 t)F (µ) + t F (π) , t [0, 1], (A.1)

where (ρt)t [0,1] is the constant-speed geodesic between µ and π. The geodesic convexity of F can be equivalently characterized through the Wasserstein Hessian (Villani et al., 2009). The geodesic convexity ought not to be confused with mixture convexity, which replaces displacement geodesic ρt in (A.1) with the mixture geodesic νt = (1 t)µ + tπ.

8. φ is compactly supported because the tangent space of µ P2(Rd) is { ψ, ψ C c (Rd)} L2(µ) (Ambrosio et al., 2005, Deﬁnition 8.4.1).

(De)-regularized Maximum Mean Discrepancy Gradient Flow

Appendix B: Auxiliary Results

In this appendix, we collect all technical results required to prove the main results of the paper.

Lemma B.1 For µ π, χ2-divergence admits the following variational form:

χ2(µ π) = sup h L2(π)

Z hdµ Z h + 1

where it is suﬃcient to restrict the variational set to L2(π) in contrast to the set of all measurable functions for general f-divergences (Nowozin et al., 2016; Nguyen et al., 2010).

Proof For µ π, we have:

χ2(µ π) = sup h

4 + h dπ = sup h

Clearly, the above equation is minimized at h = 2( dµ

dπ 1) and χ2(µ π) = R ( dµ

dπ 1)2dπ which is ﬁnite if and only if dµ

dπ 1 L2(π). Therefore, it is suﬃcient to consider the above maximization over L2(π).

Lemma B.2 Under Assumption 1 and 2, the mappings x 7 k(x, ) and x 7 1k(x, ) are diﬀerentiable and Lipschitz:

k(x, ) k(y, ) H p

1k(x, ) 1k(y, ) Hd p

Proof This is Lemma 7 from Glaser et al. (2021).

Lemma B.3 Under Assumption 1 and 2, the regularized kernel k deﬁned in (6) satisﬁes the following properties:

2. i k(x, y) = (Σπ + λI) 1 ik(x, ), k(y, )

1 k(x, y) 2 = Pd i=1 i k(x, y)2 KK1d

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

4. i i+d k(x, y) = (Σπ + λI) 1 ik(x, ), ik(y, )

1 2 k(x, y) 2 F = Pd i=1 i i+d k(x, y)2 K1d

6. i j k(x, y) = (Σπ + λI) 1 i jk(x, ), k(y, )

H1 k(x, y) 2 F = Pd i,j=1 i j k(x, y)2 KK2d

1 k (x, x ) 1 k (y, y ) KK2d

λ ( x y + x y ).

Proof Notice that

k(x, y) = D (Σπ + λI) 1 k(x, ), k(y, ) E

λ k(x, ) H k(y, ) H K

so the ﬁrst bullet point is proved. Before we prove the second bullet point, we ﬁrst prove the diﬀerentiability of x 7 k(x, y). For i {1, , d}, consider h R and denote i Rd as a vector of all 0 except the value at i being equal to h. Then, for any y Rd,

k(x + i, y) k(x, y)

h = lim h 0

(Σπ + λI) 1 (k(x + i, ) k(x, )) , k(y, )

K λ k(x, ) k(x + i, ) H

So x 7 k(x, y) is diﬀerentiable for any y Rd. Since the kernel k is diﬀerentiable per Assumption 2, for any f H, xif(x) = ik(x, ), f

i k(x, y) = xi D k(x, ), (Σπ + λI) 1 k(y, ) E

H = D ik(x, ), (Σπ + λI) 1 k(y, ) E

So the second bullet point is proved. Next, notice that

1 k(x, y) 2 =

D ik(x, ), (Σπ + λI) 1 k(y, ) E2

i=1 ik(x, ) 2 H k(y, ) 2 H

λ2 1k(x, ) 2 Hd k(y, ) 2 H KK1d

So the third bullet point is proved. Similar arguments above lead to bullet points 4 to 7. Finally, to prove bullet point 8, notice that

1 k(x, x ) 1 k(y, x ) 2 =

D ( ik(x, ) ik(y, )) , (Σπ + λI) 1 k(x , ) E2

j=1 ik(x, ) ik(y, ) 2 H k(x , ) 2 H 1

λ2 1k(x, ) 1k(y, ) 2 Hd k(x , ) 2 H

(De)-regularized Maximum Mean Discrepancy Gradient Flow

where the last inequality uses Lemma B.2. Therefore, 1 k(x, x ) 1 k(y, y ) 1 k(x, x ) 1 k(y, x ) + 1 k(y, x ) 1 k(y, y )

λ x y + x y .

So the bullet point 8 is proved.

Lemma B.4 For any two distributions µ0, µ1 P2(Rd), with associated Dr MMD witness functions h µ0,π, h µ1,π deﬁned in Proposition 3.2, we have

h µ1,π h µ0,π H 2 K1d

λ W2 (µ0, µ1) , and h µ0,π H 2

Both the witness function h µ,π and its gradient h µ,π are Lipschitz continuous, i.e., h µ1,π(x) h µ0,π(y) L W2(µ0, µ1) + x y ; h µ1,π(x) h µ0,π(y) L W2(µ0, µ1) + x y ,

where the constant L = 1

λ max 2 KK1d, 2 KK2d, 2K1d .

Proof Let γ Γ(µ0, µ1) be the optimal coupling between µ0 and µ1. Then

h µ1,π h µ0,π 2 H = 4 Σπ + λI 1 Z k(x, )d(µ1 µ0)

Z k(x, ) k(y, )dγ(x, y)

Z x y 2 dγ(x, y) = 4K1d

λ2 W 2 2 (µ0, µ1),

where the ﬁrst inequality holds because Σπ is a positive and self-adjoint operator, and the second inequality uses Lemma B.2. Also note that

h µ0,π H = 2 Σπ + λI 1 Z k(x, )dµ0 Z k(x, )dπ H 2

So the ﬁrst part has been proved. Furthermore, note that h µ1,π(x) h µ0,π(y) h µ1,π(x) h µ0,π(x) + h µ0,π(x) h µ0,π(y)

k(x, ) H h µ1,π h µ0,π H + k(x, ) k(y, ) H h µ0,π H

λ W2(µ0, µ1) + 2 KK1d

W2(µ0, µ1) + x y L W2(µ0, µ1) + x y

and h µ1,π(x) h µ0,π(y) h µ1,π(x) h µ0,π(x) + h µ0,π(x) h µ0,π(y)

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

1k(x, ) Hd h µ1,π h µ0,π H + 1k(x, ) 1k(y, ) Hd h µ0,π H

λ W2(µ0, µ1) + 2 KK2d

λ x y L W2(µ0, µ1) + x y

and the result follows.

Lemma B.5 Given two probability measures µ π that are both absolutely continuous with respect to Lebesgue measure. If dµ

dπ 1 Ran(T r π ) with r > 0, i.e., there exists q L2(π) such that dµ

dπ 1 = T r π q, then

dπ 1 L2(π) 2λr q L2(π) ,

where h = 2(Σπ + λI) 1(mπ mµ) is deﬁned in Proposition 3.2.

Proof Given the assumption that dµ

dπ 1 Ran (T r π ) with r > 0, there exists q L2(π) such that dµ

dπ 1 = T r π q and dµ

dπ 1, ei L2(π) = ϱr i q, ei L2(π). Since µ π, the Mercer decomposition of k in (1) holds for any x in the support of µ and in the support of π,

mµ mπ = Z k(x, )d(µ π)(x) = X

Z ei(x)d(µ π)(x) ei = ϱi

So we have, h 2 dµ

dπ 1 L2(π) = 2 (Σπ + λI) 1 (mµ mπ) dµ

λϱr i ϱi + λ q, ei L2(π)ei

L2(π) 2λr q L2(π) ,

where the last inequality is obtained by using

λϱr i ϱi + λ = ϱi ϱi + λ

Hence the proof.

(De)-regularized Maximum Mean Discrepancy Gradient Flow

Lemma B.6 Let ρ P2 Rd and φ C c Rd . Consider the path (ρs)0 s 1 from ρ to (I + φ)#ρ given by ρs = (I + s φ)#ρ. Deﬁne ϕs : Rd Rd, x 7 x + s φ(x). Let k be the regularized kernel deﬁned in (6) along with its associated RKHS H. The mapping s 7 Dr MMD(ρs π) is continuous and diﬀerentiable, and its ﬁrst-order time derivative is given by

d ds Dr MMD(ρs||π)

= 2(1 + λ) Z φ(x) Z 1 k(ϕs(x), ϕs(z))dρ(z) Z 1 k(ϕs(x), z)dπ(z) dρ(x). (B.2)

s=0 Dr MMD(ρs||π)

= 2(1 + λ) Z φ(x) Z 1 k(x, z)dρ(z) Z 1 k(x, z)dπ(z) dρ(x). (B.3)

Additionally, the mapping s d

ds Dr MMD(ρs π) is continuous and diﬀerentiable, and the second-order time derivative of Dr MMD(ρs π) is given by

ds2 Dr MMD(ρs||π) = 2(1 + λ) ZZ φ(x) 1 2 k(ϕs(x), ϕs(z)) φ(z)dρ(x)dρ(z) (B.4)

+ 2(1 + λ) Z φ(x) Z H1 k (ϕs(x), ϕs(z)) dρ(z) Z H1 k (ϕs(x), z) dπ(z) φ(x)dρ(x),

s=0 Dr MMD(ρs||π) = 2(1 + λ) ZZ φ(x) 1 2 k(x, z) φ(z)dρ(x)dρ(z)

+ 2(1 + λ) Z φ(x) Z H1 k (x, z) dρ(z) Z H1 k (x, z) dπ(z) φ(x)dρ(x). (B.5)

Proof Recall that Dr MMD is, up to a multiplicative factor of (1 + λ), MMD2 of the regularized kernel k deﬁned in (6) and the associated RKHS H. From Lemma B.3, we know that assumptions (A) and (B) of Arbel et al. (2019) are satisﬁed for the regularized kernel k, so using Lemma 22 and Lemma 23 from Arbel et al. (2019), (B.2) and (B.4) are proved. Then (B.3) and (B.5) are subsequently proved by taking s = 0.

Lemma B.7 For µ0 π, deﬁne h = 2(Σπ + λI) 1(mπ mµ0) and ϕt : Rd Rd, x 7 x t(1 + λ) h(x). Suppose π exp( V ), HV βI, and the step size γ satisﬁes

χ2(µ0 π)K2d

for some constant 1 < ζ < 2. Then, for t [0, γ], the following inequalities hold, Z h(x) HV (ϕt(x)) h(x)dµ0(x) 4βχ2(µ0 π)K1d

λ ; (B.6) Z Hh(x) (I t(1 + λ)Hh(x)) 1 2

F dµ0(x) 4ζ2χ2(µ0 π)K2d

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

Proof First, recall (B.1),

h(x) 2 = 4 (Σπ + λI) 1(mπ mµ0)(x) 2

L2(π) ei(x)

For any j {1, , d}, consider

g M0(x) := X

L2(π) jei(x)

i M0 ϱi jei(x)2

i M0 ϱi jei(x)2

dπ 1 L2(π) .

j=1 ϱi ( jei(x))2 = X

j=1 ϱi jk(x, ), ei 2 H = X

j=1 jk(x, ), ϱiei 2 H

j=1 jk(x, ) 2 H = 1k(x, ) 2 H K1d, (B.9)

so P i M0 ϱi( jei(x))2 converges uniformly to 0, and hence g M0(x) also converges uniformly to 0. Therefore, we are allowed to interchange the derivative and the inﬁnite sum (Rudin, 1976) in (B.8) to achieve,

L2(π) ei(x)

ϱi (ϱi + λ)2

i 1 ϱi ei(x) 2

i 1 ϱi ei(x) 2

4χ2(µ0 π)K1d

The ﬁrst inequality follows from Cauchy Schwartz, the penultimate inequality follows by noting that ϱi (ϱi+λ)2 1

λ, and the last inequality follows from (B.9). Given HV βI, (B.6) is proved by the following, Z h(x) HV (ϕt(x)) h(x)dµ0(x) β Z h(x) 2 dµ0(x) 4βχ2(µ0 π)K1d

(De)-regularized Maximum Mean Discrepancy Gradient Flow

We now turn to proving the second statement. Similarly to (B.10), we have

Hh(x) 2 F 4χ2(µ0 π)K2d

Using 2(1 + λ)γ q

χ2(µ0 π)K2d

ζ for some constant 1 < ζ < 2, the inverse of I t(1 + λ)Hh(x) can be represented by the Neumann series, and hence

(I t(1 + λ)Hh(x)) 1 F X

m 0 t(1 + λ)Hh(x) m F X

χ2(µ0 π)K2d

m = ζ. (B.12)

Therefore, (B.7) is proved by combining (B.11) and (B.12), Z Hh(x) (I t(1 + λ)Hh(x)) 1 2

Z Hh(x) 2 F (I t(1 + λ)Hh(x)) 1 2 F dµ0(x) 4ζ2χ2(µ0 π)K2d

and the result follows.

Lemma B.8 Let H be a separable Hilbert space, ξ1, . . . , ξn : Ω H are n identical independent H-valued random variables satisfying ξi H B. Then

i=1 ξi E[ξ1]

2πB 2 n , and E

i=1 ξi E[ξ1]

Proof We know from Corollary 6.15 of Steinwart and Christmann (2008) that

i=1 ξi E [ξ1]

2 exp 2nt2/B2 .

Then, denote R := 1

n Pn i=1 ξi E[ξ1] H, we know that

0 P(R t)dt 2 Z

0 exp( 2nt2/B2)dt =

The other part is proved similarly.

Lemma B.9 (Wasserstein Hessian of Fχ2) Let ρ P2(Rd) and φ C c (Rd). Consider the path (ρs)0 s 1 from ρ to (I + φ)#ρ given by ρs = (I + s φ)#ρ. Deﬁne ϕs : Rd Rd, x 7 x + s φ(x) and ωs : Rd Rd, x 7 φ ϕ 1 s (x). For π(x) exp( V (x)), the second order derivative of s χ2(ρs π) is given by

ds2 χ2(ρs π) = Z ρs(x)

V (x) ωs(x) ωs(x) 2 dρs(x)

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

π(x) ωs(x) HV (x)ωs(x)dρs(x) + Z ρs(x)

π(x) ωs(x) 2 F dρs(x). (B.13)

Equivalently, the second order derivative of s χ2(ρs π) can also be written as

ds2 χ2(ρs π) = 2 Z (ωs(x)ρs(x)) 1 π(x)

+ 2 Z ωs(x) H ρs(x)

ωs(x)ρs(x)dx. (B.14)

Proof (B.13) is provided in the Example 15.9 of Villani et al. (2009) by taking m = 2. Now, we are going to prove (B.14). For ease of notation in the following derivations, we are going to drop the function input x in ρs, π, and ωs. We introduce colors to picture grouping of terms that will carry over during chains of calculation. In order to prove (B.14), we need to expand the terms in (B.13) accordingly. We denote the three terms in the RHS of (B.13) as (A), (B) and (C). Consider

(A) = Z ρ2 s π

V ωs ωs 2 dx

Z ρ2 s π3 ( π ωs)2dx | {z } (A1)

+ 2 Z ρ2 s π2 ( π ωs) ωsdx | {z } (A2)

Z ρ2 s π ( ωs)2dx | {z } (A3)

Then we are going to use integration by parts for (A2) and (A3).

(A2) = 2 Z ρ2 s π2 ( π ωs) ωsdx = 2 Z ω s ρ2 s π2 ( π ωs) dx

= 4 Z (ω s ρs)( π ωs) ρs

π2 dx + 4 Z (ω s π)2 ρ2 s π3 dx

2 Z (ω s Hπωs) ρ2 s π2 dx 2 Z (ω s ωs π) ρ2 s π2 dx,

(A3) = Z ρ2 s π ( ωs)2dx = Z (ωs) ωs ρ2 s π dx = Z ω s ωs ρ2 s π

= Z ω s ( ωs)ρ2 s π dx | {z } (A31)

Z ωs(ω s ρs)2ρs

π dx | {z } (A32)

+ Z ωs(ω s π) ρ2 s π2 dx | {z } (A33)

Furthermore, we use integration by parts for (A32) and (A33), we have

(A32) = 2 Z ωs(ω s ρs)ρs

π dx = 2 Z (ω s (ω s ρs)ρs

= 2 Z (ω s ωs ρs)ρs

π dx + 2 Z (ω s Hρsωs)ρs

π dx + 2 Z (ω s ρs)2 1

(De)-regularized Maximum Mean Discrepancy Gradient Flow

2 Z (ω s ρs)(ω s π) ρs

(A33) = Z ωs(ω s π) ρ2 s π2 dx = Z (ω s (ω s π) ρ2 s π2

= Z (ω s ωs π) ρ2 s π2 dx Z (ω s Hπωs) ρ2 s π2 dx 2 Z (ω s ρs)(ω s π) ρs

+ 2 Z (ω s π)2 ρ2 s π3 dx.

Having completed (A), now we turn to (B).

(B) = Z ρ2 s π ω s HV ωsdx = Z ω s Hπωs ρ2 s π2 dx + Z ( π ωs)2 ρ2 s π3 dx.

So, combining (A), (B) and (C), we have

ds2 χ2(ρs π) = (A) + (B) + (C) = (A1) + (A2) + (A31) + (A32) + (A33) + (B) + (C)

Z ρ2 s π3 ( π ωs)2dx

4 Z (ω s ρs)( π ωs) ρs

π2 dx + 4 Z (ω s π)2 ρ2 s π3 dx 2 Z (ω s Hπωs) ρ2 s π2 dx

2 Z (ω s ωs π) ρ2 s π2 dx + (A31) + (A32) + (A33)

Z (ω s Hπωs) ρ2 s π2 dx + Z ( π ωs)2 ρ2 s π3 dx +

Z ρ2 s π ωs 2 F dx.

Since ωs = φ ϕ 1 s is a mapping from Rd to Rd, denote ωs,i := φ ϕ 1 s

i which is a mapping from Rd to R. Notice that ωs,i jωs,i vanishes at boundary because φ C c . Hence,

ωs,i jωs,i ρ2 s π

Z jωs,i jωs,i ρ2 s π dx + X

Z ωs,i jjωs,i ρ2 s π dx

Z ωs,i jωs,i

Z ρ2 s π ωs 2 F dx + Z ρ2 s π (ω s ( ωs))dx + Z (ω s ωs ρs)2ρs

Z (ω s ωs π) ρ2 s π2 dx.

Therefore, by replacing

R ρ2 s π ωs 2 F dx, and noticing that R ρ2 s π (ω s ( ωs))dx is exactly (A31), we have the following:

ds2 χ2(ρs π) =

Z ρ2 s π3 ( π ωs)2dx 4 Z (ω s ρs)( π ωs) ρs

π2 dx + 4 Z (ω s π)2 ρ2 s π3 dx

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

2 Z (ω s Hπωs) ρ2 s π2 dx 2 Z (ω s ωs π) ρ2 s π2 dx + (A31) + (A32) + (A33)

Z (ω s Hπωs) ρ2 s π2 dx + Z ( π ωs)2 ρ2 s π3 dx

Z ρ2 s π (ω s ( ωs))dx Z (ω s ωs ρs)2ρs

π dx + Z (ω s ωs π) ρ2 s π2 dx.

Next, we combine the terms and obtain

= 2((A31) + (A32) + (A33)) + 6 Z ρ2 s π3 ( π ωs)2dx 4 Z (ω s ρs)( π ωs) ρs

3 Z (ω s Hπωs) ρ2 s π2 dx Z (ω s ωs π) ρ2 s π2 dx 2 Z (ω s ωs ρs)ρs

π dx (A32) (A33).

Recall that (A31)+(A32)+(A33) = (A3), and replacing (A32) and (A33) with their expressions, we have

Z ρ2 s π ( ωs)2dx + 6 Z ρ2 s π3 ( π ωs)2dx 4 Z (ω s ρs)( π ωs) ρs

3 Z (ω s Hπωs) ρ2 s π2 dx Z (ω s ωs π) ρ2 s π2 dx 2 Z (ω s ωs ρs)ρs

2 Z (ω s ωs ρs)ρs

π dx + 2 Z (ω s Hρsωs)ρs

π dx + 2 Z (ω s ρs)2 1

π dx 2 Z (ω s ρs)(ω s π) ρs

Z (ω s ωs π) ρ2 s π2 dx Z (ω s Hπωs) ρ2 s π2 dx 2 Z (ω s ρs)(ω s π) ρs

π2 dx + 2 Z (ω s π)2 ρ2 s π3 dx

= 2 Z (ω s Hρsωs)ρs

π dx 2 Z (ω s Hπωs) ρ2 s π2 dx 4 Z (ω s ρs)(ω s π) ρs

π2 dx + 4 Z ρ2 s π3 ( π ωs)2dx | {z } (M)

+ 2 Z ρ2 s π ( ωs)2dx 2 Z (ω s ρs)2 1

π dx 4 Z (ω s ωs ρs)ρs

π dx | {z } (N1)

4 Z (ω s Hρsωs)ρs

π dx + 4 Z (ω s ρs)(ω s π) ρs

π2 dx | {z } (N2)

Now we analyze (M) and (N1) + (N2) separately. Notice that

(M) = 2 Z ω s H ρs

(N1) + (N2) = 2 Z ρ2 s π ( ωs)2dx + 2 Z (ω s ρs)2 1

πdx 4 Z (ω s ωs ρs)ρs

4 Z (ω s Hρsωs)ρs

π dx 4 Z (ω s ρs)2 1

πdx + 4 Z (ω s ρs)(ω s π) ρs

= 2 Z ρ2 s π ( ωs)2dx + 2 Z (ω s ρs)2 1

πdx 4 Z (ω s (ω s ρs))ρs

4 Z (ω s ρs)(ω s ρs

(De)-regularized Maximum Mean Discrepancy Gradient Flow

= 2 Z ρ2 s π ( ωs)2dx + 2 Z (ω s ρs)2 1

πdx + 4 Z ωs(ω s ρs)ρs

= 2 Z (ωsρs) 1

Since d2 ds2 χ2(ρs π) = (M) + (N1) + (N2), Lemma B.9 is proved.

Appendix C: An Illustrative Example for Explicit Forms of It, Jt, qt L2(π)

Consider an illustrative example where we simulate the Dr MMD gradient ﬂow when the target is a one-dimensional Gaussian target distribution π = N(0, σ2) and the initialization is also a one-dimensional Gaussian µ0 = N(0, 1

2 σ2). We take a Gaussian kernel k(x, y) = exp( 1

2(x y)2) whose eigenvalues and eigenfunctions in its Mercer decomposition have the following closed form expressions (Shi et al., 2009, Proposition 1),

1 2 σ2 1 2 σ2 + c + 0.5

i , ei(x) =

β i!2i exp( cx2)Hi

where β = (1 + 4 σ2)1/4, c = β2 1

4 σ2 and Hi is the i-th Hermite polynomial function. We pick σ2 > 2 so β2 > 3. The Gaussian kernel is continuous, bounded and c0-universal as required in Assumption 1. It also possesses bounded ﬁrstand second-order derivatives, thereby satisfying Assumption 2. Consider the Dr MMD gradient ﬂow (µt)t 0 deﬁned in (7) along with its particle update scheme dxt = (1 + λ) hµt,π(xt)dt, where hµt,π is the witness function deﬁned in (32) and λ is a positive regularization parameter. Denote mt, σ2 t as the mean and covariance of µt, respectively, then we have the following update scheme for mt, σ2 t proved in Lemma C.1:

dmt = (1 + λ) Ext µt[ hµt,π(xt)] dt,

d(σ2 t ) = 2(1 + λ) Ext µt[ hµt,π(xt) xt] dt + 2(1 + λ) Ext µt[ hµt,π(xt)] mt dt,

where the expectations are taken over xt µt. While the resulting distribution is not necessarily Gaussian, we may follow the existing analysis of Stein variational gradient descent (Liu et al., 2024b) and Langevin Monte Carlo dynamics (Lambert et al., 2022) and approximate xt µt with a Gaussian random variable yt νt = N(mt, σ2 t ); this yields the update scheme:

dmt = (1 + λ) Eyt νt[ hνt,π(yt)] dt,

d(σ2 t ) = 2(1 + λ) Eyt νt[ hνt,π(yt) yt] dt + 2(1 + λ) Eyt νt[ hνt,π(yt)] mt dt, (C.2)

which gives an evolution of Gaussian distributions νt. From (32), the witness function hνt,π admits a decomposition with eigenvalues ϱi and eigenfunctions ei: hνt,π(y) = P i 1 ϱi ϱi+λ dνt

dπ 1, ei L2(π)ei(y). Therefore, the velocity ﬁeld d dyhνt,π(y) can be written as,

d dyhνt,π(y) = X

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

Notice that if mt = 0, then for odd i, dνt

dπ 1, ei L2(π) = R eidνt R eidπ = 0 because ei is an odd function. When i is even, Eνt[ d

dyei(y)] = 0 because y 7 d dyei(y) is an odd function. As a result, if mt = 0, then E[ hνt,π(yt)] = 0 and hence dmt

dt = 0 from the update scheme in (C.2). Therefore, as long as we initialize the Dr MMD gradient ﬂow with ν0 = N(0, 1

2 σ2), a zero mean Gaussian distribution, the entire trajectory will remain a zero mean Gaussian distribution N(0, σ2 t ). Next, observe that for a zero-mean Gaussian trajectory, if the initial variance satisﬁes σ2 0 < σ2, it is natural to expect that the variance increases monotonically toward the target variance σ2 as the Dr MMD ﬂow evolves, i.e. σ2 0 < σ2 t σ2 for all t. In Lemma C.2, we provide a rigorous proof of this claim in the cases λ = 0 and λ = , by showing that (1+λ) Eyt νt[ hνt,π(yt) yt] < 0, which implies that the variance update in (C.2) is monotone increasing. The cases λ = 0 and λ = correspond respectively to the χ2 ﬂow and the MMD ﬂow regimes. For general λ > 0, however, we are unable to establish a rigorous proof. Our argument in Lemma C.2 relies heavily on Mehler s formula (Liang and Tran-Bach, 2022, Proposition 2.2), which requires exponential decay of the spectrum (ϱi)i 1. This condition is not satisﬁed for Dr MMD, whose spectrum is modiﬁed to ( ϱi ϱi+λ)i 1. Nevertheless, we conjecture that the monotonicity property continues to hold for all λ > 0.

Checking assumptions in Theorem 4.1 and Theorem 5.1: Now we check the assumptions from Theorem 4.1 and Theorem 5.1. The target π is a Gaussian distribution which automatically satisﬁes a Poincaré inequality, and its potential V is a quadratic function, hence satisﬁes HV βI. νt, π are Gaussians, hence absolutely continuous with respect to Lebesgue on R. And most importantly, we have dνt

dπ 1 Ran(T 0.25 π ), i.e., there exists qt L2(π) such that dνt

dπ 1 = T 0.25 π qt. To see why, we ﬁrst need to upper bound dνt

dπ 1, ei 2 L2(π). From Belafhal et al. (2020, Corollary 2), we have the following closed-form expressions for dνt

dπ 1, ei L2(π):

1 1 + 2σ2 t c

c + 1 2σ2 t

1 1 + 2 σ2c

when i is even and 0 otherwise. Therefore, we have

L2(π) 1 1 + σ2c β

i!2i |Hi(0)|2 2

c + 1 2σ2 t 1

c + 1 2 σ2 1

By the monotonicity of the variance, σ2 0 = 1

2 σ2 σ2 t < σ2 and β2 > 3, we have

c + 1 2σ2 t 1 <

c + 1 2 σ2 1 = 2β2

So we have the following upper bound

L2(π) 4 1 + σ2c β

i!2i |Hi(0)|2 2β2

β2 + 1 1 i .

(De)-regularized Maximum Mean Discrepancy Gradient Flow

Now we are ready to study the L2(π)-norm of qt. Recall the formulas of ϱi in (C.1), we have

dπ 1, ei 2 L2(π) ϱ0.5 i 4β 1 + σ2c

1 i!2i |Hi(0)|2 β2 1

= 4β 1 + σ2c

1 i!2i |Hi(0)|2 β2 1

0.5!i = 4β 1 + σ2c

The last equality holds by Mehler s formula (Liang and Tran-Bach, 2022, Proposition 2.2). A quick sanity check for the above derivations is to take r = 0.5 and see that P i=1 ϱ 1 i dνt

dπ 1, ei 2 L2(π) is divergent. This indicates that dνt

dπ 1 / H, which veriﬁes the fact that the Gaussian RKHS does not contain constant functions (Steinwart and Christmann, 2008, Corollary 4.44). Finally, It and Jt admit the following explicit formulas as well as uniform upper bounds

Jt = (log π) dνt

2π σ3σ2 t Γ(7/2) 1 σ2 t 1 2 σ2 7/2 4Γ(7/2)

Γ(1/2) 1 σ2 t 1 2 σ2 1/2 + 2Γ(3/2) 1 σ2 t 1

1 σ2 t 1 2 σ2 3/2

+ Γ(5/2) 1 σ2 t 1

1 σ2 t 1 2 σ2 5/2

2 5/2 + Γ(3/2)

2 3/2 + Γ(5/2)

Empirical veriﬁcation: While the above derivations demonstrate that a (Gaussian projected) Dr MMD gradient ﬂow satisﬁes all the required assumptions, it is instructive to demonstrate that a particle implementation in the discrete time setting, without explicit Gaussian projection, shows the behavior consistent with the theory. We therefore simulate our ﬁnite particle Dr MMD gradient descent ˆµn with a step size γ = 0.01 and particle number N = 10, 000 and empirically inspect its convergence properties. We take σ2 = 6 so β2 = 5. We estimate the density of the Dr MMD descent µn from the particles with a kernel density estimator using a Gaussian kernel with lengthscale 0.1 (Epanechnikov, 1969). Based on the estimated densities µn, we compute the following two quantities: In = V (dµn

dπ ) L2(π) and Jn = ( dµn

dπ ) L2(π). Their evolution along the Dr MMD gradient descent is shown in Figure 3. We observe that both quantities decrease over time as desired, which is a consequence of increasing smoothness of the density ratio dµn

dπ as µn converges to π. We also report the evolution of the KL divergence c KL(ˆµn ˆπ) along the ﬂow, estimated from particles.

Lemma C.1 Given the Dr MMD gradient ﬂow update scheme dxt = (1 + λ) hµt,π(xt)dt, its mean mt = E[xt] and variance σ2 t = E[x2 t ] E[xt]2 update scheme can be expressed as

dmt = (1 + λ) Ext µt[ hµt,π(xt)] dt,

d(σ2 t ) = 2(1 + λ) Ext µt[ hµt,π(xt) xt] dt + 2(1 + λ) Ext µt[ hµt,π(xt)] mt dt,

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

0 20 40 60 80 100 Iteration

c KL(ˆµn ˆπ) b Jn b In

Figure 3: Evolution of c KL(ˆµn ˆπ), b In and b Jn along Dr MMD particle descent, where all three terms are estimated with samples.

Proof For the mean update, d

dtmt = E[ d

dtxt] = (1+λ) Ext µt[ hµt,π(xt)]. For the variance update,

d dtσ2 t = d

dt E[x2 t ] E[xt]2 = 2 E[xt d

dtxt] 2 E[xt] d

= 2(1 + λ) Ext µt[ hµt,π(xt) xt] + 2(1 + λ) Ext µt[ hµt,π(xt)] mt.

Hence the result.

Lemma C.2 Let νt be a zero mean normal distribution N(0, σ2 t ) and 0 < σ2 t < σ2. For the eigenvalues (ϱi)i 1 and eigenfunctions (ei)i 1 deﬁned in (C.1), we have

L2(π) Ex νt

dxei(x)x < 0,

L2(π) Ex νt

dxei(x)x < 0.

which correspond to the variance update in (C.2) when λ = 0 and λ = , respectively.

Proof For the inner product term, we know from Corollary 2 of Belafhal et al. (2020) that dνt

1 1 + 2σ2 t c

c + 1 2σ2 t

1 1 + 2 σ2c

For the expectation term, notice that

(De)-regularized Maximum Mean Discrepancy Gradient Flow

2σ2 t x2 1 x2

2σ2 t x2 exp( cx2)Hi

2σ2 t x2 exp( cx2)Hi

=: Ei,1 + Ei,2.

Here in ( ), we use integration by parts in which the boundary term vanishes, because limx exp( x2)Hi(x) = 0. Next, notice that the second term Ei,2 equals precisely the derivative of Ei,1 with respect to c, rescaled by 1 σ2 t . To distinguish it from the other c

that will show up later in dνt

dπ 1, ei L2(π), we denote it as c. The original P i 1 dνt

dπ 1, ei L2(π) Ex νt[ d

dxei(x)x] can be written as the sum of two components F1 + F2:

1 1 + 2σ2 t c

c + 1 2σ2 t

1 1 + 2 σ2c

1 1 + 2σ2 t c

c + 1 2σ2 t

1 1 + 2 σ2c

From Corollary 2 of Belafhal et al. (2020), we know

1 1 + 2σ2 t c

c + 1 2σ2 t

and Ei,2 = 1 σ2 t d dc Ei,1. The relation between Ei,1 and Ei,2 would carry over to F1 and F2 as

well, i.e., F2 = 1 σ2 t d dc F1. Next, we compute the ﬁrst half of F1.

1 1 + 2σ2 t c

c + 1 2σ2 t

1 1 + 2σ2 t c

1 1 + 2σ2 t c

c + 1 2σ2 t

c + 1 2σ2 t

1 1 + 2σ2 t c

1 1 + 2σ2 t c

c + 1 2σ2 t

c + 1 2σ2 t

= β (1 + 2σ2 t c)(1 + 2σ2 t c) (1 + 2σ2 t c σ2 t σ2 β2)(1 + 2σ2 t c σ2 t σ2 β2) 1

σ2 t σ2 (2 + 2σ2 t c + 2σ2 t c) σ2 t σ2 β 2! 1

Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur

The second last equality holds by the Mehler s formula (Liang and Tran-Bach, 2022, Proposition 2.2). Similarly, we can also compute the second half of F1.

1 1 + 2 σ2c

= β (1 + 2 σ2c)(1 + 2σ2 t c) (1 + 2 σ2c β2)(1 + 2σ2 t c σ2 t σ2 β2) 1

= 1 + 2σ2 t c + (1 + 2 σ2c)σ2 t σ2 σ2 t σ2 β2 1

Combining the above two equations, we obtain the following formula of F1.

σ2 t σ2 (2 + 2σ2 t c + 2σ2 t c) σ2 t σ2 β 2! 1

2 + 1 + 2σ2 t c + (1 + 2 σ2c)σ2 t σ2 σ2 t σ2 β2 1

Since F2 = 1 σ2 t d dc F1, we have

σ2 t σ2 (2 + 2σ2 t c + 2σ2 t c) σ2 t σ2 β 2! 3

2 σ2 t σ2 1 + 2σ2 t c + (1 + 2 σ2c)σ2 t σ2 σ2 t σ2 β2 3

Recall that the original P i 1 dνt

dπ 1, ei L2(π) Ex νt[ d

dxei(x)x] can be written as the sum of two components F1 + F2, and recall that c = c and β2 = 1 + 4 σ2c by deﬁnition. We obtain

L2(π) Ex νt

dxei(x)x = F1 + F2 = σ2 t σ2

1 + 2 σ2 t σ2

which is negative when 0 < σt < σ. So we have concluded the proof of the ﬁrst claim. Now, we are about to prove the second claim. Following the same derivations as above, we can write P i 1 ϱi dνt

dπ 1, ei L2(π) Ex νt[ d

dxei(x)x] as the sum of two terms G1 and G2.

i 1 ϱi Hi(0)

1 1 + 2σ2 t c

c + 1 2σ2 t

1 1 + 2 σ2c

i 1 ϱi Hi(0)

1 1 + 2σ2 t c

c + 1 2σ2 t

1 1 + 2 σ2c

Since ϱi (β2 1

β2+1)i deﬁned in (C.1) has exponential decay, hence Mehler s formula still hold. Up to some positive multiplier coeﬃcient that do not change the sign, G1 can be written as the following formula

G1 = (2 σ2c) 2(1 + 2σ2 t c)2 + 2σ2 t σ2 σ4 t σ4

2 + (2 σ2c) 2(1 + 2 σ2c)(1 + 2σ2 t c) + 1 1

(De)-regularized Maximum Mean Discrepancy Gradient Flow

And similar to F1, F2, we have G2 = 1 σ2 t d dc G1. As a result,

G2 = (2 σ2c) 2(1 + 2σ2 t c)2 + 2σ2 t σ2 σ4 t σ4

2 1 + 2σ2 t c (2 σ2c)2 + σ2 t σ2

(2 σ2c) 2(1 + 2 σ2c)(1 + 2σ2 t c) + 1 3

2 1 + 2 σ2c

(2 σ2c)2 + 1 .

Combined, we obtain

G1 + G2 = (2 σ2c) 2(1 + 2σ2 t c)2 + 2σ2 t σ2 σ4 t σ4

2 (1 + 2σ2 t c) 2σ2 t c (2 σ2c)2 + σ2 t σ2 σ4 t σ4

+ (2 σ2c) 2(1 + 2 σ2c)(1 + 2σ2 t c) + 1 3

2 (1 + 2 σ2c) 2σ2 t c (2 σ2c)2 .

When 0 < σ2 t < σ2, we have the following relations

(2 σ2c) 2(1 + 2σ2 t c)2 + 2σ2 t σ2 σ4 t σ4 < (2 σ2c) 2(1 + 2 σ2c)(1 + 2σ2 t c) + 1,

(1 + 2σ2 t c) 2σ2 t c (2 σ2c)2 + σ2 t σ2 σ4 t σ4 > (1 + 2 σ2c) 2σ2 t c (2 σ2c)2 .

L2(π) Ex νt

dxei(x)x = G1 + G2 < 0,

proving the second claim.