# deregularized_maximum_mean_discrepancy_gradient_flow__7f40a455.pdf Journal of Machine Learning Research 26 (2025) 1-77 Submitted 9/24; Revised 8/25; Published 10/25 (De)-regularized Maximum Mean Discrepancy Gradient Flow Zonghao Chen zonghao.chen.22@ucl.ac.uk Department of Computer Science, University College London, London, WC1V 6LJ, UK Aratrika Mustafi abm6733@psu.edu Department of Statistics, Pennsylvania State University, University Park, PA, 16802 USA Pierre Glaser pierreglaser@msn.com Gatsby Computational Neuroscience Unit, University College London, London, WC1V 6LJ, UK Anna Korba anna.korba@ensae.fr ENSAE, CREST, Institut Polytechnique de Paris, Palaiseau, France Arthur Gretton arthur.gretton@gmail.com Gatsby Computational Neuroscience Unit, University College London, London, WC1V 6LJ, UK Bharath K. Sriperumbudur bks18@psu.edu Department of Statistics, Pennsylvania State University, University Park, PA, 16802 USA Editor: Quentin Berthet We introduce a (de)-regularization of the Maximum Mean Discrepancy (Dr MMD) and its Wasserstein gradient flow. Existing gradient flows that transport samples from source distribution to target distribution with only target samples, either lack tractable numerical implementation (f-divergence flows) or require strong assumptions and modifications, such as noise injection, to ensure convergence (Maximum Mean Discrepancy flows). In contrast, Dr MMD flow can simultaneously (i) guarantee near-global convergence for a broad class of targets in both continuous and discrete time, and (ii) be implemented in closed form using only samples. The former is achieved by leveraging the connection between the Dr MMD and the χ2-divergence, while the latter comes by treating Dr MMD as MMD with a de-regularized kernel. Our numerical scheme employs an adaptive de-regularization schedule throughout the flow to optimally balance the trade-offbetween discretization errors and deviations from the χ2 regime. The potential application of the Dr MMD flow is demonstrated across several numerical experiments, including a large-scale setting of training student/teacher networks. Keywords: Wasserstein gradient flow, reproducing kernel Hilbert space, maximum mean discrepancy, f-divergences, spectral regularization 1. Introduction Many applications in computational statistics and machine learning involve approximating a probability distribution π on Rd (in terms of samples) when only partial information on π is accessible. For example, in Bayesian inference, π is known up to an intractable normalizing constant for complex models. The setting of interest in this work is the so-called generative modeling setting (Brock et al., 2019; Ho et al., 2020; Song et al., 2021; Franceschi et al., 2024) where one assumes access to a set of samples from the target distribution π, with the goal being to generate new samples from π. Recently, a popular framework to perform this c 2025 Zonghao Chen, Aratrika Mustafi, Pierre Glaser, Anna Korba, Arthur Gretton, and Bharath K. Sriperumbudur . License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/24-1574.html. Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur task involves solving a minimization problem in P(Rd), the space of probability distributions over Rd, by choosing the objective function to be a dissimilarity function D( π) (a distance or divergence) between probability distributions that satisfies: D(ν π) = 0 if and only if ν = π. Since only samples from π are available, this problem is solved approximately yet, the approximate minimizers may converge to π as the number of available samples increases. In particular, in the space of probability distributions with bounded second moment P2(Rd), a common approach is to solve this optimization problem by running a (sample-based) approximation of the Wasserstein gradient flow of the functional F = D( π), which defines a path of distributions with steepest descent for F with respect to the Wasserstein-2 distance. In generative modeling, the choice of D( π) depends on two crucial aspects: First, its flow should admit consistent and preferably tractable numerical implementations using only samples from π, and second, under reasonable assumptions, it should guarantee convergence of its flow to π, the unique global minimizer. Combined, these two properties ensure that, in the large sample limit, this algorithm generates new samples from the target. Unfortunately, verifying these two properties simultaneously has proved to be a surprisingly challenging task. For instance, recent approaches based on Maximum Mean Discrepancy (MMD) (Arbel et al., 2019; Hertrich et al., 2024b), the sliced-Wasserstein distance (Liutkus et al., 2019) and the Sinkhorn divergence (Genevay et al., 2018) typically admit consistent finite-sample implementations, but their global convergence guarantees when they exist do not apply to most practical targets π of interest. To guarantee global convergence, one could instead choose D as an f-divergence. For instance, the (reverse) KL divergence and χ2-divergence are geodesically convex (Villani et al., 2009, Definition 16.5) when the target is log-concave (i.e. π e V with V convex) (Ohta and Takatsu, 2011), and hence their flows enjoy better convergence behaviour. However, while the population Wasserstein gradient flows of the χ2 and KL divergences are well-defined (Jordan et al., 1998; Chewi et al., 2020), they do not come naturally with consistent and tractable sample-based implementations. Multiple approaches propose to solve a surrogate optimization problem with samples at each iteration of the flow (Gao et al., 2019; Ansari et al., 2021; Simons et al., 2022; Birrell et al., 2022; Gu et al., 2024; Liu et al., 2024a); however, it remains to be formally established whether these surrogate problems preserve the desirable convergence guarantees of f-divergence flows. In the face of the trade-offs present in the current approaches, a natural question arises: Does there exist a divergence functional D( π) whose gradient flow both globally converges, and admits a tractable, consistent sample-based implementation? In this work, we take a step towards a positive answer by constructing a de-regularized variant of the Maximum Mean Discrepancy (Dr MMD) and its associated Wasserstein gradient flow. We prove that the Dr MMD gradient flow converges exponentially to the global minimum up to a controllable barrier term for targets π that satisfy a Poincaré inequality, in both continuous and discrete time regimes. To do so, we establish and leverage a connection between the Dr MMD and the χ2 divergence, an f-divergence whose gradient flow benefits from strong convergence guarantees. By alternatively viewing Dr MMD as MMD with a regularized kernel, Dr MMD flow comes with a consistent and tractable implementation when only samples from the target π are available. In addition, given the empirical success of using adaptive kernels in MMD-based generative models (Galashov et al., 2025; Li et al., 2017; Arbel et al., 2018), our paper shows theoretically that using adaptive kernels through adaptive regularization indeed improves the convergence of MMD gradient flow. (De)-regularized Maximum Mean Discrepancy Gradient Flow This paper is organized as follows. Section 2 introduces the necessary background on reproducing kernel Hilbert spaces (RKHS), the MMD, χ2-divergence, and Wasserstein gradient flows. Section 3 introduces Dr MMD and shows that Dr MMD is a valid probability divergence that metrizes weak convergence. Section 4 uses Dr MMD as the optimization objective to define a Wasserstein gradient flow in P2(Rd), and analyzes the convergence of Dr MMD flow in continuous time. Sections 5 and 6 define an implementable Dr MMD particle descent scheme with both time and space discretization and analyze its convergence. Section 7 discusses other Wasserstein gradient flows related to our Dr MMD flow. Section 8 shows experiments that confirm our theoretical results. The proofs of all results are provided in Section 10, with the technical results being relegated to an appendix. 2. Background In this section, we present the definitions and notation used throughout the paper. 2.1 Notations Let Ld be the Lebesgue measure on Rd. P2(Rd) denotes the set of all Borel probability measures µ on Rd with finite second moment. For µ P2(Rd), µ π denotes that µ is absolutely continuous with respect to π. We use dµ dπ to denote the Radon-Nikodym derivative. We recall the standard definition of the Kullback-Leibler divergence, KL(µ π) = R log( dµ dπ)dµ if µ π, + else. For a continuous mapping T : Rd Rd, T#µ denotes the push-forward measure of µ by T. For any π P2(Rd), L2(π) is the Hilbert space of (equivalence class of) functions f : Rd R such that R |f|2dπ < . We denote by L2(π) and , L2(π) the norm and the inner product of L2(π). We denote by C c (Rd) the space of infinitely differentiable functions from Rd to R with compact support. For a vector valued functions g : Rd Rp, we abuse the notation of L2(π) and claim g L2(π) if gi L2(π) for all i = 1, . . . , p along with g 2 L2(π) := Pp i=1 gi 2 L2(π). If f : Rd R is differentiable, we denote by f the gradient of f and Hf its Hessian. f is α-strongly convex if Hf αI, i.e, Hf(x) αI is positive semi-definite for any x, where I is the identity matrix (also denotes an identity operator depending on the context). For a vector valued function g : Rd Rd, if gi is differentiable for all i = 1, , d, g denotes the divergence of g. We also denote by g the Laplacian of g, where g = g. We use F to denote the matrix Frobenius norm. a b and a b denote the minimum and maximum of a and b, respectively. 2.2 Reproducing kernel Hilbert spaces For a positive semi-definite kernel k : Rd Rd R, its corresponding reproducing kernel Hilbert space (RKHS) H is a Hilbert space with inner product , H and norm H (Aronszajn, 1950), such that (i) k(x, ) H for all x Rd, and (ii) the reproducing property holds, e.g. for all f H, x Rd, f(x) = f, k(x, ) H. We denote by Hd the Cartesian product RKHS consisting of elements f = (f1, . . . , fd) with fi H with inner product f, g Hd = Pd i=1 fi, gi H. Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur When R k(x, x)dπ(x) < , H can be canonically injected into L2(π) using the operator ιπ : H L2(π), f 7 f with adjoint ι π : L2(π) H given by ι πf( ) = Z k(x, )f(x)dπ(x). The operator ιπ and its adjoint can be composed to form an L2(π) endomorphism Tπ := ιπι π called the integral operator, and a H endomorphism Σπ := ι πιπ = R k( , x) k( , x)dπ(x) (where (a b)c := b, c Ha for a, b, c H) called the covariance operator. Tπ is compact, positive, self-adjoint, and can thus be diagonalized into an orthonormal system in {ei}i 1 of L2(π) with associated eigenvalues ϱ1 ϱi 0. In this paper, we make the following assumption on our kernel. Assumption 1 k : Rd Rd R is a continuous and c0-universal kernel, and there exists K > 0 such that supx k(x, x) K. We refer the reader to Carmeli et al. (2010) for the definition of c0-universal kernel. The implication of Assumption 1 is that the RKHS H is compactly embedded into L2(π) (Steinwart and Scovel, 2012, Lemma 2.3), and hence k(x, x ) has a absolute, uniform and pointwise convergent Mercer representation (Steinwart and Scovel, 2012, Corollary 3.5), k(x, x ) = X i 1 ϱiei(x)ei(x ), (1) for any x and x in the support of π. Since the kernel is c0-universal, the RKHS H is dense in L2(π) for all Borel probability measures π (Sriperumbudur et al., 2011, Section 3.1) and {ei}i 1 becomes an orthornormal basis of L2(π) (Steinwart and Scovel, 2012, Theorem 3.1). The power of the integral operator T r π is defined as T r π f := P i 1 ϱr i f, ei L2(π) ei, f L2(π). For f Ran(T r π ), there exists q L2(π) such that f = T r π q. The exponent r quantifies the smoothness of the range space relative to the original RKHS H with 0 < r < 1 2 (resp. r > 1 2) yields spaces that are less (resp. more) smooth than H with Ran(T 1/2 π ) being isometrically isomorphic to H (Cucker and Zhou, 2007; Fischer and Steinwart, 2020). We make an additional assumption commonly employed in the kernel-based gradient flow literature (Glaser et al., 2021; He et al., 2024; Korba et al., 2020; Arbel et al., 2019) on the regularity of the kernel that will be employed in studying the Dr MMD gradient flow. Assumption 2 k : Rd Rd R is twice differentiable in the sense of (Steinwart and Christmann, 2008, Definition 4.35), i.e., for i, j {1, , d}, both i i+dk and i j i+d j+dk exist and are continuous. There exist constants K1d, K2d > 0 such that 1k(x, ) Hd := Pd i=1 ik(x, ) H K1d and H1k(x, ) Hd d := Pd i,j=1 i jk(x, ) H K2d for all x Rd. Many kernels satisfy both Assumption 1 and 2, including the class of bounded, continuous, and translation invariant kernels on Rd whose Fourier transforms have finite second and fourth moments. This is easy to verify by employing the Fourier transform representation of the RKHS (Wendland, 2004, Theorem 10.12) and noting that the finiteness of the RKHS norm of 1k( , x) and H1k( , x) for all x corresponds to the existence and finiteness of the second and fourth moments of the Fourier transform of the kernel, respectively. This condition is satisfied by the Gaussian kernel, Matérn kernels of order ν with ν + d 2 2 and the inverse multiquadratic kernel. (De)-regularized Maximum Mean Discrepancy Gradient Flow 2.3 Maximum mean discrepancy and χ2-divergence The Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) between µ and π is defined as the RKHS norm of the difference between the mean embeddings1 mµ := R k(x, )dµ(x) and mπ := R k(x, )dπ(x). MMD(µ π) := Z k(x, )dµ(x) Z k(x, )dπ(x) H = mµ mπ H . The function mµ mπ is often referred to as the witness function . When the kernel k is c0-universal, MMD(µ π) = 0 if and only if µ = π, and the MMD metrizes the weak topology between probability measures (Sriperumbudur et al., 2010; Sriperumbudur, 2016). Given samples (x1, . . . , xn) and (y1, . . . , ym) from µ and π respectively, the MMD can be consistently estimated in multiple ways (Gretton et al., 2012). For instance, one can compute its plug-in estimator, e.g. MMD(ˆµ ˆπ), where bµ := 1 n Pn i=1 δxi and bπ := 1 n Pn i=1 δyi. The χ2-divergence a member of the family of f-divergences (Rényi, 1961) is defined as the variance of the Radon-Nikodym derivative dµ dπ under π: χ2(µ π) := Z dµ when µ π, and + otherwise. The χ2-divergence has a variational form (Nowozin et al., 2016; Nguyen et al., 2010): χ2(µ π) = sup h M(Rd) Z hdµ Z h + 1 where M(Rd) denote the set of all measurable functions from Rd to R. When µ π, we prove in Lemma B.1 that the optimal h = dµ dπ 1 L2(π) so that it is sufficient to restrict the variational set to L2(π) in contrast to M(Rd) for general f-divergences. Since in most cases, χ2(ˆµ ˆπ) = + , the χ2-divergence does not admit plug-in estimators, and estimating it consistently involves more complicated strategies (Nguyen et al., 2010). 2.4 Wasserstein gradient flows Gradient flows are dynamics that use local (e.g. differential) information about a given functional in order to minimize it as fast as possible. Their exact definition depends on the nature of the input space; in the familiar case of Euclidean space Rd, the gradient flow of a sufficiently regular F : Rd R given some initial condition x0 is given by the solution (xt)t 0 of txt = v(xt), where v is the Fréchet subdifferential of F, a generalization of the notion of derivative to non-smooth functions (Kruger, 2003). Gradient flows can be extended from Euclidean spaces to the more general class of metric spaces (Ambrosio et al., 2005). When the metric space in question is P2(Rd) endowed with the Wasserstein-2 distance, this gradient flow is called the Wasserstein gradient flow (µt)t 0. The Wasserstein gradient flow of F : P2(Rd) R takes the particular form (Ambrosio et al., 2005, Lemma 10.4.1): tµt + (µtvt) = 0, (2) 1. Such mean embeddings are well-defined under Assumption 1. Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur where vt is the Fréchet subdifferential of F : P2(Rd) R evaluated at µt (Ambrosio et al., 2005, Definition 11.1.1). (2) is an instance of the continuity equation with velocity field vt: under these dynamics, the mass of µt is transported in the direction vt that decreases F at the fastest rate at each time t. While (2) can be time-discretized in various ways (Santambrogio, 2017; Ambrosio et al., 2005), in this work, we will focus on the forward Euler scheme, defined as µn+1 := (I γvn)#µn where γ > 0 is a step size parameter. Such a scheme is also known as the Wasserstein Gradient Descent of F. Just as gradient descent in Euclidean spaces, an instrumental property to characterize the convergence of the Wasserstein gradient descent of a functional F is given by its geodesic convexity and smoothness. Among various ways, one can consider to characterize convexity and smoothness through lower and upper bounds on the Wasserstein Hessian of the functional F (Villani et al., 2009, Proposition 16.2). Given any φ C c (Rd) that defines a constant speed geodesic2 starting at µ: ρt = (I + t φ)#µ for 0 t 1, the Wasserstein Hessian of a functional F : P2(Rd) R at µ, denoted as Hess F|µ, is an operator from L2(µ) to L2(µ):3 Hess F|µ φ, φ L2(µ) = d2 A functional F is said to be geodesically M-smooth at µ if Hess F|µ φ, φ L2(µ) M φ L2(µ), and is said to be geodesically Λ-convex at µ if Hess F|µ φ, φ L2(µ) Λ φ L2(µ). Additionally, F is geodesically semiconvex if < Λ < 0 and geodesically strongly convex if Λ > 0. Generally, a functional F that is both smooth and strongly convex is preferred, because its Wasserstein gradient descent has an exponential rate of convergence under a small enough step size γ (Boyd and Vandenberghe, 2004, Section 9.3.1)(Bonet et al., 2024). Given some probability measure π P2(Rd), the MMD flow (resp. χ2 flow) is the Wasserstein gradient flow of the functional FMMD( ) = MMD( π) (resp. Fχ2( ) = χ2( π)). As with the MMD, the MMD flow has an analytic finite sample implementation and may be used to construct generative modeling algorithms (Hertrich et al., 2024b,a). The Wasserstein Hessian of FMMD for smooth kernels is not positively lower bounded (Arbel et al., 2019, Proposition 5), however, so MMD flow only converges up to an unknown barrier (Arbel et al., 2019, Theorem 6), with global convergence only under a strong (and unverifiable) assumption (Arbel et al., 2019, Proposition 7). More recent works (Boufadène and Vialard, 2024) have demonstrated the global convergence of the MMD flow when using the Coulomb kernel. This kernel is non-smooth, however, which complicates numerical implementations. In contrast, the Wasserstein Hessian of Fχ2 is positively lower bounded (Ohta and Takatsu, 2011) for log-concave targets π, so Fχ2 is geodesically strongly convex and χ2 flow enjoys exponential rate of convergence. The exponential convergence of χ2 flow towards the global minimum in fact holds for a broader class of targets π that satisfy a Poincaré inequality (Chewi et al., 2020). The χ2 flow has so far lacked a tractable sample-based implementation, however, so it has not been widely used in practice. 2. See Appendix A for the definition. 3. Strictly speaking, Hess F|µ is an operator over the tangent space TµP2(Rd) which is a subset of L2(µ) (Villani et al., 2009). (De)-regularized Maximum Mean Discrepancy Gradient Flow In the following sections, we will introduce a new Wasserstein gradient flow that combines the computational advantages of the MMD flow with the convergence properties of the χ2 3. (De)-regularized Maximum Mean Discrepancy (Dr MMD) In this section, we introduce a (de)-regularized version of maximum mean discrepancy, or Dr MMD in short. The Dr MMD is rooted in a unified representation of the MMD and the χ2-divergence, given in the following proposition, which is proved in Section 10.1. Proposition 3.1 (MMD and χ2-divergence) Suppose dµ dπ 1 L2(π) for µ, π P2(Rd). Then MMD2(µ π) = T 1/2 π L2(π) and χ2(µ π) = I dµ Remark 3.1 The χ2 identity follows from the definition but is provided for comparison purposes. Together, these identities express both the MMD and the χ2-divergence as functionals of the (centered) density ratio dµ dπ 1. While the χ2-divergence directly computes the L2(π) norm of the centred ratio, the MMD2 first computes the image by the operator T 1/2 π before taking the L2(π) norm. The smoothing effect of the compact operator T 1/2 π note that Tπ is compact if k is bounded as assumed in Assumption 1 has both positive and negative consequences: FMMD( ) = MMD2( π) admits finite sample estimators but is not geodesically convex, making the first-order optimization of MMD objective (as done in generative modeling) challenging (Arbel et al., 2019). In contrast, Fχ2( ) = χ2( π) is geodesically convex for log-concave targets π (Ohta and Takatsu, 2011) but is hard to estimate with samples. With these facts in mind, we introduce a divergence whose purpose is to combine the beneficial properties of both the χ2-divergence and MMD. To do so, this divergence computes the L2(π) norm of the image of dµ dπ 1 by an alternative operator which interpolates between I and T 1/2 π . We set this operator to be ((Tπ + λI) 1Tπ)1/2, where λ > 0 is a regularization parameter. The operator ((Tπ + λI) 1Tπ)1/2 can be seen as a (de)-regularization of the operator T 1/2 π used by the MMD - a similar idea has been used in kernel Fisher discriminant analysis (Mika et al., 1999), goodness-of-fit testing (Balasubramanian et al., 2021; Hagrass et al., 2024b), and two-sample testing (Harchaoui et al., 2007; Hagrass et al., 2024a). We call the resulting divergence the (De)-regularized Maximum Mean Discrepancy (Dr MMD). Definition 1 (Dr MMD) Suppose dµ dπ 1 L2(π) where µ, π P2(Rd). Then the (de)- regularized maximum mean discrepancy (Dr MMD) between µ, π P2(Rd) is defined as Dr MMD(µ||π) = (1 + λ) (Tπ + λI) 1Tπ 1/2 dµ L2(π) , (3) where λ > 0. Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur While all three operators, I, Tπ and (Tπ + λI) 1Tπ are diagonalizable in the same eigenbasis of L2(π), the key difference between them lies in the behavior of their eigenvalues. The identity operator I has all eigenvalues 1, the integral operator Tπ has eigenvalues (ϱi)i 1 which decay to zero as i and the (de)-regularized integral operator (Tπ + λI) 1Tπ has eigenvalues (ϱi/(ϱi + λ))i 1 which either decay to zero or converge to 1 depending on the choice of λ as i . (Tπ + λI) 1Tπ is known in the statistical estimation literature as Tikhonov regularization; alternative definitions of Dr MMD could be obtained by using other regularizing operators, such as Showalter regularization in Engl et al. (1996). In this paper, we primarily focus on Tikhonov regularization and leave other types of regularization for future work. One of the stated purposes of the Dr MMD is to retain the computational benefits of the MMD, which are crucial for its use in particle algorithms for generative modeling. To this end, we provide an alternative representation of Dr MMD which does not involve the density ratio dµ dπ directly, but only kernel expectations. Proposition 3.2 (Density ratio free and variational formulations) Dr MMD can be alternately represented as Dr MMD(µ||π) = (1 + λ) (Σπ + λI) 1 2 (mµ mπ) 2 = (1 + λ) sup h H Z h dµ Z h2 with h µ,π = 2 (Σπ + λI) 1 (mµ mπ) being the witness function. The proof is in Section 10.2. The density ratio-free representation (4) contains three expectations under µ and π: the mean embeddings mµ and mπ, and the covariance operator Σπ. Given samples {xi}M i=1 µ and {yi}N i=1 π, we can construct a plug-in finite sample estimator in (71) by replacing µ and π with their empirical counterparts: ˆµ = 1 M PM i=1 δxi and ˆπ = 1 N PN i=1 δyi. The density ratio-free representation (4) frames Dr MMD as acting on the difference of mµ and mπ similarly to MMD. In fact, up to a multiplicative factor (1 + λ), Dr MMD is MMD computed with respect to another kernel k defined as k(x, x ) = D (Σπ + λI) 1 2 k(x, ), (Σπ + λI) 1 2 k(x , ) E The derivations are provided in Section 10.3. The kernel k is symmetric and positive semidefinite by construction, and its associated reproducing kernel Hilbert space is H. The density ratio-free representation of Dr MMD in Proposition 3.2 is already known in (Balasubramanian et al., 2021; Harchaoui et al., 2007; Hagrass et al., 2024a,b) in the context of non-parametric hypothesis testing. 3.1 Properties of Dr MMD In this section, we establish various properties of Dr MMD. As discussed earlier, Dr MMD is constructed to interpolate between χ2-divergence and MMD to exploit the advantages associated with each. The following result formalizes the interpolation property. (De)-regularized Maximum Mean Discrepancy Gradient Flow Proposition 3.3 (Interpolation property) Let µ, π P2(Rd). If Assumption 1 holds and dµ dπ 1 L2(π), then lim λ 0 Dr MMD(µ π) = χ2(µ π), and lim λ Dr MMD(µ π) = MMD2(µ π). Proposition 3.3, whose proof can be found in Section 10.4, shows that Dr MMD asymptotically becomes a probability divergence in the small and large λ regimes. We seek to use Dr MMD as a minimizing objective in generative modeling algorithms; however, we need to ensure that Dr MMD is a probability divergence for any fixed value of λ. This result holds, as shown next. Proposition 3.4 (Dr MMD is a probability divergence) Under Assumption 1, for any λ (0, ), Dr MMD is a probability divergence, i.e., Dr MMD(µ π) 0, with equality iff µ = π. Moreover, Dr MMD metrizes the weak topology between probability measures, i.e., Dr MMD(µn π) 0 iffµn converges weakly to π P2(Rd) as n . As MMD with c0-universal kernels metrizes the weak convergence of distributions (Sriperumbudur, 2016; Simon-Gabriel et al., 2023), Proposition 3.4, whose proof can be found in Section 10.5, shows that for any λ > 0, Dr MMD is MMD like topologically speaking, and is different from the χ2-divergence which induces a strong topology (Agrawal and Horel, 2021). Remark 3.2 Dr MMD is a specific case of a so-called Moreau envelopes of f-divergences in reproducing kernel Hilbert spaces introduced in Stein et al. (2025), when the f-divergence is taken as the χ2-divergence. This connection is uncovered by the variational formulation of Dr MMD in (5). In contrast to general f-divergences, the Moreau envelope of the χ2divergence enjoys a closed-form expression, as we highlight in this paper with various analytical formulas for Dr MMD. The interpolation property (Proposition 3.3) and the metrization of weak convergence (Proposition 3.4) are proved concurrently in Corollaries 12 and 13 of Stein et al. (2025), relying on formulation via Moreau envelopes. In our case, we use direct computations thanks to the closed form of Dr MMD.4 4. Wasserstein Gradient Flow of Dr MMD Having introduced the Dr MMD in the previous section, we now construct and analyze its Wasserstein Gradient Flow (WGF). As discussed in Section 2, WGFs define dynamics (µt)t 0 in Wasserstein-2 space that minimize a given functional F by transporting µt in the direction of steepest descent, given by the Fréchet subdifferential of F evaluated at µt. Given some target distribution π from which we wish to sample, the WGF of FDr MMD( ) = Dr MMD( π), called Dr MMD flow, has the potential to form the basis of a generative modeling algorithm, since µt progressively minimizes its distance (in the Dr MMD sense) to the target π. To fulfill this potential, two additional ingredients are necessary. The first is to formally establish that µt reaches the global minimizer π, and the second is to design a tractable finite-sample algorithm that inherits the convergence properties of the original Dr MMD flow. 4. We would like to clarify that Stein et al. (2025) appeared on arxiv when our paper was already under review at ICML 2024. Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur We defer the second point to Section 5 and focus in this section on showing how Dr MMD benefits from its interpolation towards χ2-divergence such that the Dr MMD flow achieves near-global convergence for a large class of target distributions. 4.1 Dr MMD flow: Definition, existence, and uniqueness To prove that the Dr MMD flow is well-defined and admits solutions, the key is to show that the FDr MMD admits Fréchet subdifferentials, as formalized in the following proposition. Proposition 4.1 (Dr MMD gradient flow) Let λ > 0, and µ0, π P2 Rd . Under Assumption 1 and 2, the functional FDr MMD admits Fréchet subdifferential of the form (1 + λ) h µt,π, where h µt,π is the witness function defined in Proposition 3.2. Consequently, the Dr MMD flow is well-defined and is the solution to the following equation tµt µt(1 + λ) h µt,π = 0. (7) In addition, the Dr MMD flow starting at µ0 is unique because FDr MMD is semiconvex, i.e., for any φ C c (Rd) and µ P2(Rd), Hess FDr MMD|µ φ, φ 2(1 + λ)2 KK2d + K1d λ φ 2 L2(µ). (8) The proof is in Section 10.6. By recalling the discussion of Wasserstein Hessian in Section 2.4, (8) indicates that FDr MMD is both geodesically smooth and geodesically semiconvex, which is expected because Dr MMD is equivalent to MMD with a regularized kernel k defined in (6) and FMMD is both geodesically smooth and geodesically semiconvex (Arbel et al., 2019, Proposition 5). Proposition 4.1 is proved concurrently in Corollaries 14 and 20 of Stein et al. (2025) relying on formulation via Moreau envelopes, while our proof uses the closed-form expression for Dr MMD. 4.2 Near-global convergence of Dr MMD flow Having defined the Dr MMD flow, we are now concerned with its convergence to the target π. Since Dr MMD is constructed to interpolate between the MMD and the χ2-divergence, Dr MMD flow is expected to recover the convergence properties of the χ2 flow. With this goal in mind, we first study the Wasserstein Hessian of Dr MMD and prove that it becomes asymptotically positive as λ 0 for strongly log-concave targets π. Next, to obtain a non-asymptotic convergence rate, we take another route and show that the Dr MMD flow converges to π exponentially fast in KL divergence up to a barrier term that vanishes in the small λ regime, provided that π satisfies a Poincaré inequality. 4.2.1 Near-Geodesic Convexity of FDr MMD One popular approach to proving that the Dr MMD flow (µt)t 0 converges to the target π in terms of Dr MMD is to show that FDr MMD is geodesically convex (Ambrosio et al., 2005, Theorem 4.0.4), or, equivalently in our definition in Section 2.4, its Wasserstein Hessian is positive definite. In the next proposition, we show that the Wasserstein Hessian of FDr MMD is indeed positive definite for small enough λ; however, as we will see below (remark 4.2), the form of the result will not allow us to show convergence besides in the limit λ 0, which leads us to take a different approach in subsequent sections. (De)-regularized Maximum Mean Discrepancy Gradient Flow Proposition 4.2 (Near-geodesic convexity of FDr MMD) Let µ, π P2(Rd), µ, π Ld and φ C c (Rd). Under Assumption 1 and 2, let π be α strongly log-concave, i.e., π exp( V ), HV αI, and assume additionally that x 7 HV (x) is continuous. Then for all µ such that x 7 log µ(x) is continuous and dµ Hess FDr MMD|µ φ, φ L2(µ) α(1 + λ) Z dµ dπ(x) φ(x) 2dµ(x) R(λ, µ, φ), (9) where limλ 0 R(λ, µ, φ) = 0. The proof can be found in Section 10.7. To obtain this result, we relate the Wasserstein Hessian of FDr MMD with that of Fχ2. When dµ dπ 1 H, they coincide asymptotically as λ 0, showing that the interpolation properties of Dr MMD to the χ2-divergence hold at the level of Wasserstein derivatives. Together with the fact that the Wasserstein Hessian of Fχ2 is positive definite for α-strongly log-concave π (Ohta and Takatsu, 2011), we obtain the lower bound in (9). It is noteworthy that although Dr MMD can be viewed as squared MMD with a regularized kernel k, the near-geodesic convexity in Proposition 4.2 is not observed for standard MMD with a fixed kernel k because the latter does not interpolate towards χ2-divergence. Remark 4.1 (Geodesic convexity/smoothness trade-offin FDr MMD) The geodesic smoothness and near-convexity of the functional FDr MMD are characterized by (8) and (9) respectively via upper and lower bounds on the Wasserstein Hessian of FDr MMD. However, (8) and (9) impose contradictory conditions on the (de)-regularization parameter λ: (8) indicates that Dr MMD is smoother if λ is larger while (9) indicates that Dr MMD is more convex if λ is small enough. Consequently, there is a trade-offbetween the geodesic convexity and smoothness of FDr MMD, which will play an important role in Section 5. Remark 4.2 Proposition 4.2 shows that for fixed µ and φ, there exists λ small enough yet positive such that Hess FDr MMD|µ φ, φ L2(µ) > 0 at µ. The remainder term R(λ, µ, φ) is only controlled in the limit as λ 0, however, which complicates the use of Proposition 4.2 to show global convergence of the Dr MMD flow. In the next section, we employ a different set of techniques that rely on the Poincaré condition on π, a condition which, as we show, will ensure a sufficient dissipation of KL divergence along the Dr MMD flow to obtain non-asymptotic near-global convergence. 4.2.2 Near-Global Convergence of Dr MMD flow via Poincaré inequality Even when a functional F is not geodesically convex, convergence guarantees for its Wasserstein gradient flow (µt)t 0 can still be obtained if the target π satisfies certain functional inequalities. Consider the χ2 flow (νt)t 0, for example: if π satisfies the Poincaré inequality, then (νt)t 0 converges exponentially fast to π in terms of KL divergence (Chewi et al., 2020, Theorem 1), i.e., KL(νT π) exp 2T KL(ν0 π). (10) Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur Recall that π satisfies a Poincaré inequality (Pillaud-Vivien et al., 2020, Definition 1) if for all functions f : Rd R such that f, f L2(π), there exists a constant CP > 0 such that Z f(x)2dπ(x) Z f(x)dπ(x) 2 CP f 2 L2(π). (11) The smallest constant CP for which (11) holds is called the Poincaré constant. The Poincaré inequality is widely used for studying the convergence of Langevin diffusions (Chewi et al., 2024) and χ2 flow (Chewi et al., 2020; Trillos and Sanz-Alonso, 2020). The Poincaré condition is implied by the strong log-concavity of π, and is weaker than strong log-concavity because it also allows for nonconvex potentials. The set of probability measures satisfying a Poincaré inequality includes distributions with sub-gaussian tails or with exponential tails. This set is also closed under bounded perturbations and finite mixtures (see Vempala and Wibisono (2019) and Chewi et al. (2024) for a more detailed discussion). Given the interpolation property of Dr MMD to χ2-divergence, (10) suggests investigating the convergence of the Dr MMD flow (µt)t 0 in KL divergence under a Poincaré inequality. To this end, we first derive an upper bound on KL(µt π) along the Dr MMD flow. Theorem 4.1 (KL control of the Dr MMD flow) Suppose k satisfies Assumptions 1 and 2, and the target π and Dr MMD gradient flow (µt)t 0 satisfy the following conditions: 1. π satisfies a Poincaré inequality with constant CP . 2. µt, π Ld. dπ 1 Ran(T r π ) with r > 0, i.e., there exists qt L2(π) such that dµt dπ 1 = T r π qt. (log π) dµt dπ L2(π) Jt and dµt dπ L2(π) It. 5. For all i = 1, . . . , d, lim x h µt,π(x) 2dµt dπ (x) i dµt dπ (x) π(x) 0. Then, for any T 0, KL(µT π) exp 2(1 + λ) CP T KL(µ0 π) + 4(1 + λ)λr Z T 0 exp 2(1 + λ) CP (T t) qt L2(π) (Jt + It)dt. (12) The proof, which can be found in Section 10.8, leverages the fact that the Dr MMD can approximate not only the χ2-divergence, but also its Wasserstein gradient. The Dr MMD s approximation properties can be combined with functional inequalities to obtain an upperbound for the continuous-time dissipation of KL divergence along the flow, given by: d dt KL(µt π) 2(1 + λ) CP KL(µt π) + 4(1 + λ)λr qt L2(π) (Jt + It) | {z } Approximation error from which Theorem 4.1 follows upon applying the Growall s lemma (Gronwall, 1919). The first term is strictly negative, while the second term is an approximation error term arising (De)-regularized Maximum Mean Discrepancy Gradient Flow from Dr MMD not perfectly matching the χ2-divergence for λ > 0. When λ = 0, Theorem 4.1 recovers the exponential decay of KL divergence along χ2 flow in (10). Remark 4.3 (i) The second condition assumes that µt and π have densities. (ii) The third condition that dµt dπ 1 Ran(T r π ) is a regularity condition on the density ratio so that it can be well approximated by the witness function h µt,π. This assumption is known as the range assumption in the literature of kernel ridge regression (Cucker and Zhou, 2007; Fischer and Steinwart, 2020). We posit that this assumption can be relaxed to dµt dπ 1 L2(π) as in Proposition 3.3 if only asymptotic convergence is needed with no explicit rate. (iii) The fourth condition is another regularity condition on the density ratio along the flow. This condition is automatically satisfied under a stronger range condition (r = 1 2) in the third condition, i.e., dµt dπ 1 H, along with a moment condition on the score function log π. To see this, notice that we can further write (derivations are provided in Section 10.8) (log π) dµt K1d qt L2(π) log π L2(π) dµt K2d qt L2(π). (14) As an illustration, we examine the Dr MMD flow under a Gaussian approximation in Appendix C, following Lambert et al. (2022); Liu et al. (2024b), with a Gaussian kernel and Gaussian target π, for which explicit upper bounds It and Jt can be derived. (iv) The fifth condition is a boundary condition that allows integration by parts equality in the proof. We highlight that many works (e.g., Theorem 1 of He et al. (2024), Theorem 2 of Nitanda et al. (2022), Lemma 6 of Vempala and Wibisono (2019)) on Wasserstein gradient flow apply integration by parts without explicitly stating this condition. Additionally, if qt L2(π) Q, Jt J , It I for all 0 t T, then Theorem 4.1 will ensure KL convergence of the Dr MMD flow up to a controllable barrier term, e.g. near global convergence. Corollary 4.1 (Near global convergence of the Dr MMD flow) In addition to the assumptions of Theorem 4.1, if qt L2(π) Q, Jt J , It I for all 0 t T, where Q, J , and I are universal constants independent of λ, then for any T 0, KL(µT π) exp 2(1 + λ) CP T KL(µ0 π) + 2λr CP Q (J + I) . The proof of Corollary 4.1 follows directly from upper-bounding the second term of (12) with universal constants Q, J , and I, and using the closed-form expression of the resulting integral. Corollary 4.1 provides a condition under which the Dr MMD flow will exhibit an exponential rate of convergence (linear convergence) in terms of KL divergence up to an extra approximation error term which vanishes as λ 0. If qt L2(π) Q, Jt J , It I for all 0 t T as shown in Corollary 4.1, the approximation error is of explicit order O(λr). Therefore, in the continuous time regime, to have a smaller approximation error, it is beneficial to use a small (de)-regularization parameter λ, so that Dr MMD flow operates Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur closer to the regime of χ2 flow. However, as we will see in the next section, when it comes to time-discretized Dr MMD flow, i.e., Dr MMD gradient descent, there is a trade-offbetween the approximation error and the time discretization error such that the selection of λ would require more careful analysis to strike a good balance. Finally, a smaller Poincaré constant CP results in both a faster rate of convergence and a smaller barrier. Previously, Arbel et al. (2019) established (sublinear) global convergence of the MMD flow in terms of MMD distance by assuming that a Lojasiewicz inequality (or a variant of it if additionally performing noise injection, see Arbel et al., 2018, Proposition 8) holds along the flow. Our result thus complements that of Arbel et al. (2019) by showing that MMD-type functionals can achieve near-global convergence for targets satisfying a Poincaré inequality regardless of whether such inequalities hold, by studying their behavior in the f-divergence interpolation regime. Remark 4.4 Since Dr MMD is asymmetric in its arguments, the reader may wonder why we focus on the gradient flow of Dr MMD( ||π) instead of Dr MMD(π|| ). From the convergence standpoint, Theorem 4.1 shows that Dr MMD( ||π) converges globally with an exponential rate up to a small barrier when π satisfies Poincaré inequality along with an extra regularity condition on the density ratio. This favorable convergence property is no longer true for Dr MMD(π|| ). Practically speaking, the Wasserstein gradient of Dr MMD requires inverting a kernel integral operator. It is more efficient to do this just once with a kernel integral operator with respect to π as in (3), rather than with respect to µt, which would happen at each time (as done by He et al. 2024). Computing the Dr MMD flow is intractable since the dynamics are in continuous-time, and in practice, we do not have access to π, but only samples from it. In the next two sections, we build a tractable approximation of the Dr MMD flow which provably achieves near-global convergence under similar assumptions as the ones in Section 4. Compared to the Dr MMD flow, this approximation combines a discretization in time (introduced in Section 5) with a particle-based space-discretization (introduced in Section 6). While the space-discretization techniques that we used are well-known in the Wasserstein gradient flow literature, our time-discretized scheme deviates from standard approaches, which are insufficient to guarantee near-global convergence in our case. 5. Time-discretized Dr MMD flow In this section, we first construct and analyze the forward Euler scheme of (7), a simple time-discretization of the Dr MMD flow which we call Dr MMD Descent. Compared to the Dr MMD flow, the convergence of Dr MMD descent is affected by an additional smoothnessrelated time-discretization error that blows up as λ approaches 0, thus preventing near-global convergence. To address this issue, we propose in Section 5.2 an alternative discrete-time scheme that adapts the value of the regularization coefficient λ across the descent iterates, which we call Adaptive Dr MMD Descent. We show that this scheme converges in the KL divergence up to a barrier term that vanishes as the discretization step size goes to zero. 5.1 Dr MMD Descent The forward Euler discretization of the Dr MMD flow (or Dr MMD Descent in short) with (De)-regularized Maximum Mean Discrepancy Gradient Flow step size γ > 0 consists of sequence of probabilities (µn)n N defined by the recursion µn+1 = I γ(1 + λ) h µn,π # µn, µn=0 = µ0, (15) where h µn,π = 2 (Σπ + λI) 1 (mµn mπ) is the Dr MMD witness function. This scheme was previously considered in Arbel et al. (2019); in particular, Proposition 4 of Arbel et al. (2019) shows that the discrete-time MMD dissipation rate along the MMD Descent iterates follows the (continous time) rate of the MMD flow up to an error term proportional to the step size γ and the smoothness parameters of the problem5. Next, we turn to study the convergence of Dr MMD descent. One way is to treat Dr MMD as MMD with a regularized kernel k and follow Proposition 4 of Arbel et al. (2019), however this does not quantitatively take into account the role of (de)-regularization parameter λ that balances the trade-offbetween geodesic convexity and smoothness of FDr MMD. Instead, we adopt the same strategy of Section 4 that exploits the interpolation property of Dr MMD towards χ2-divergence: in the following proposition, we study the dissipation of KL divergence along the Dr MMD Descent when the target π satisfies a Poincaré inequality, in which the role of (de)-regularization parameter λ is highlighted in the approximation and discretization errors. Proposition 5.1 (Descent lemma in KL) Suppose k satisfies Assumption 1 and 2, and suppose the target π and Dr MMD gradient descent iterates (µn)n N satisfy the following: 1. π satisfies a Poincaré inequality with constant CP and its potential is β-smooth, i.e., π exp( V ) with HV βI. 2. µn, π Ld. dπ 1 Ran(T r π ) with r > 0, i.e., there exists qn L2(π) such that dµn dπ 1 = T r π qn. 4. qn L2(π) Q, V dµn dπ L2(π) J , dµn dπ L2(π) I for all n = 1, , nmax. 5. For all i = 1, . . . , d, lim x h µn,π(x) 2dµn dπ (x) i dµn dπ (x) π(x) 0. 6. There exists a constant 1 < ζ < 2 such that for all n = 1, . . . , nmax, the step size γ satisfies 2ζ(1 + λ) q χ2 (µn π) K2d Then for all 0 n nmax and 0 < λ 1, KL(µn+1 π) KL(µn π) 2 CP χ2(µn π)γ + 4γλr Q (J + I) | {z } Approximation error + 8γ2(β + ζ2)χ2(µn π)K1d + K2d λ | {z } Discretization error 5. In the MMD flow case, the main smoothness parameters is the Lipschitz constant of the kernel. Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur The proof is provided in Section 10.9. Conditions 1-5 of Proposition 5.1 are similar to conditions 1-5 of Theorem 4.1, which assumes a functional inequality on the target π and regularity on the density ratio dµn dπ 1. In a similar spirit to Theorem 4.1, the fourth regularity condition is automatically satisfied under a stronger range assumption on dµn dπ 1. For the sake of brevity, we directly assume uniform upper bounds in the fourth condition rather than writing it out in a separate corollary like Corollary 4.1. Compared to the continuous time regime, two extra conditions are necessary. The first one is a smoothness condition on the potential, HV βI, which is commonly used in the convergence analysis of discrete-time Langevin-based samplers (Dalalyan, 2017; Dalalyan and Karagulyan, 2019; Durmus et al., 2019; Dalalyan et al., 2022; Vempala and Wibisono, 2019). It can be relaxed to V being Hölder-continuous with exponent s [0, 1] (Chatterji et al., 2020). The second one, (16), is an upper bound on the step size γ, aligning with the principle that step size should be small enough for the discrete-time scheme to inherit the properties of its continuous analog. This condition will be more thoroughly discussed in Remark 5.4 when all the conditions on γ in Theorem 5.1 and Theorem 5.2 are presented. Remark 5.1 (Approximation-discretization trade-offof Dr MMD Descent) If we compare the discrete-time KL dissipation of (17) with its continuous-time counterpart in (13), we see that the first two terms on the RHS of (17) admit continuous-time analogs present in (13). The discrete-time KL dissipation contains an additional (positive) term representing the time discretization error: unlike the approximation error term that vanishes as λ approaches 0, this term actually diverges as λ approaches 0. Therefore, replicating the arguments of the continuous-time result of Theorem 4.1 in the discrete-time regime would yield a barrier that does not vanish as λ 0, hinting at a trade-offbetween approximation and discretization similar to that of Remark 4.1. In the next section, we propose a refined adaptive discrete-time descent scheme that addresses the convergence issues. 5.2 Adaptive Dr MMD Descent The Dr MMD flow and descent dynamics are defined for a value of λ that remains fixed throughout time. The KL dissipation provided in Proposition 5.1 is a function of λ, however; thus, to obtain a sequence of measures with better convergence guarantees than the Dr MMD descent, we now construct and analyze a sequence of iterates obtained by selecting, at each iteration, the value of λ minimizing the sum of the approximation error and time-discretization error presented in (17). This sequence, which we term Adaptive Dr MMD descent, is given by µn+1 = I γ(1 + λ) h µn,π # µn, λn = 2γχ2(µn π)(β + ζ2)(K1d + K2d) 1 r+1 . (18) The particular choice of λn above minimizes the sum of the approximation error and timediscretization error presented in (17). This optimal choice indicates that λn should shrink towards 0 as χ2(µn π) decreases along Dr MMD gradient descent: at the early stages of the scheme, it is desirable to have a larger λn, corresponding to a smoother objective functional and enabling larger step sizes; then as Dr MMD gradient descent iterates µn get closer to π, a smaller λn enables the scheme to operate closer to the χ2 flow regime, which metrizes a stronger topology and can better witness the difference between µn and π. Additionally, (De)-regularized Maximum Mean Discrepancy Gradient Flow as the potential V becomes less smooth, i.e., β gets larger, then λn should also increase to account for the loss of smoothness from V . Unlike the Dr MMD descent in Section 5.1, since the Wasserstein gradient updating µn comes from a different Dr MMD at each iteration, the adaptive scheme of (18) constitutes a significant departure from related works in Wasserstein gradient descent (Glaser et al., 2021; Arbel et al., 2019; Korba et al., 2021; Hertrich et al., 2024b, 2023; Chewi et al., 2020). Remark 5.2 (Adaptive kernel) Recent applications of MMD-based generative modeling algorithms with adaptive kernels (in particular, time-dependent kernel hyperparameters) demonstrate improved empirical performance over fixed kernels in both Wasserstein gradient flow on MMD (Galashov et al., 2025) and generative adversarial networks with an MMD critic (Li et al., 2017; Arbel et al., 2018): the latter can be related to gradient flow on the critic where µn is restricted to the output of a generator network (see e.g. Franceschi et al., 2024). As the Dr MMD is an MMD with a regularized kernel k that depends on λn, the Adaptive Dr MMD Descent thus falls into the former category. Galashov et al. (2025, Proposition 3.1) demonstrates faster convergence for MMD gradient flow with an adaptive kernel, for the parametric setting of Gaussian distributions π and µt. Our analysis is the first to prove theoretically that adaptive kernels can result in improved convergence for more general nonparametric settings. We believe that the theoretical analysis of adaptivity by varying other hyperparameters (such as the kernel bandwidth for RBF kernels) remains an interesting avenue for future work. By leveraging the quasi-descent lemma in KL divergence in Proposition 5.1, we are able to establish the following theorem, which provides a near-global convergence result of the Adaptive Dr MMD gradient descent iterates in KL divergence. Theorem 5.1 (Near-global convergence of adaptive Dr MMD gradient descent) Suppose k satisfies Assumption 1 and 2 and K 1, and suppose the target π and adaptive Dr MMD gradient descent iterates (µn)n N satisfy the following conditions: 1. π satisfies a Poincaré inequality with constant CP and its potential is β-smooth, i.e. π exp( V ) with HV βI. 2. µn, π Ld. dπ 1 Ran(T r π ) with r > 0, i.e., there exists qn L2(π) such that dµn dπ 1 = T r π qn. 4. qn L2(π) Q, V dµn dπ L2(π) J , dµn dπ L2(π) I for all n = 1, , nmax. 5. For all i = 1, . . . , d, lim x h µn,π(x) 2dµn dπ (x) i dµn dπ (x) π(x) 0. 6. There exists a constant 1 < ζ < 2 such that the step size γ satisfies ζ 1 Q2K2d CP Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur Then by taking (de)-regularization parameter λn = 2γχ2(µn π)(β+ζ2)(K1d+K2d) Q(J +I) 1 r+1 1, we have KL(µnmax π) exp 2nmaxγ + 4γ r r+1 CP Q 2r+1 r+1 (K1d + K2d)(β + ζ2) r r+1 (J + I) 1 r+1 . (20) The proof can be found in Section 10.10. The conditions 1-5 of Theorem 5.1 are the same as conditions 1-5 of Proposition 5.1. Since the RHS of (19) are all constants, condition 6 is satisfied when the step size γ is small enough. The implication of Theorem 5.1 is that Dr MMD gradient descent exhibits an exponential rate of convergence (linear convergence) in terms of KL divergence up to an extra barrier of order O(γ r r+1 ). The barrier term shows up as the result of picking the optimal regularization parameter λn that best trades offthe approximation error and discretization error in Proposition 5.1. Theorem 5.1 is reminiscent of the convergence result of the Langevin Monte Carlo sampling algorithm, whose KL divergence also decreases exponentially up to an extra barrier, but of order O(γ) (Vempala and Wibisono, 2019, Theorem 2). Unlike the continuous-time result of Theorem 4.1, in which the barrier can be made arbitrarily small by taking small enough regularization, taking the step size γ in Theorem 5.1 to be arbitrarily small will significantly impact the rate of convergence, even though it is exponential in terms of nmax. By making the step sizes adaptive with the number of iterations and imposing an extra condition, the barrier term actually vanishes, as demonstrated in the following theorem. Theorem 5.2 (Global convergence of Dr MMD gradient descent) Suppose that k satisfies Assumptions 1 and 2, and that the conditions in Theorem 5.1 on Dr MMD gradient descent iterates (µn)n N, target distribution π, regularization coefficient λn and step size γn are satisfied. If additionally, the step size γn satisfies γn 1 (K1d + K2d)(β + ζ2) r χ2(µn π) 1 r , (21) for all n = 1, , nmax, then KL(µnmax π) KL(µ0 π). (22) The proof can be found in Section 10.11. Compared with (20), (22) provides a cleaner upper bound without the barrier term and leads to global convergence. Remark 5.3 (Iteration complexity) We now turn to analyze the iteration complexity of Dr MMD gradient descent from Theorems 5.1 and 5.2. (i) From Theorem 5.1, given an error threshold δ > 0, Dr MMD descent would reach KL(µnmax π) δ after nmax CP 2γ log KL(µ0 π) δ) iterations. By comparison, when π satisfies a Poincaré inequality, Langevin Monte Carlo (LMC) has an iteration (De)-regularized Maximum Mean Discrepancy Gradient Flow complexity of O(1 δ) up to logarithmic terms (Chewi et al., 2024, Theorem 7). As an approximation to the χ2 flow, the iteration complexity of Dr MMD flow is worse than LMC, because at each iterate of Dr MMD flow, there is an extra approximation error in addition to time-discretization error (see our Proposition 5.1). With the optimal choice of regularization λ that balances these two errors, the overall per-step error is of order O(γ1+ r r+1 ). In contrast, at each iteration, LMC only incurs the time-discretization error, which is of order O(γ2), smaller than that of Dr MMD flow. As a result, LMC exhibits better iteration complexity than our Dr MMD flow; however, LMC requires knowledge of the score of π, while Dr MMD flow only requires samples from π. (ii) From Theorem 5.2, we consider two cases. On the one hand, if there exists a threshold N0 such that χ2(µn π) n r holds for all n N0, then we select step size γn CP n 1 for all n N0 such that both (19) and (21) are satisfied, and consequently Qnmax n=N0 1 1 CP γn = 0 so the iteration complexity of Dr MMD gradient descent is O 1 δ . On the other hand, if such a threshold N0 does not exist, then there exists a subsequence n1, n2, . . . , n S, . . . such that χ2(µns π) n r s for all s 1. Since KL divergence is smaller than χ2-divergence (Van Erven and Harremos, 2014), we have KL(µns π) n r s for all s 1. Notice that KL divergence is monotonically decreasing based on (63), we have KL(µn π) n r s for all ns n ns+1 so that limn KL(µn π) = 0. Unfortunately, we are not able to derive iteration complexity in this case because the growth rate of {ns}s 1 is unknown. Remark 5.4 (Step size γ) Theorem 5.1 imposes a condition on the step size γ in (19) and Theorem 5.2 imposes an additional condition in (21). These conditions subsume the condition (16) on step size in Proposition 5.1. (See derivations in Section 10.10.) The conditions (19) and (21) become more stringent as the potential V becomes less smooth, i.e., when β gets larger, similar to the analysis in Langevin Monte Carlo (Balasubramanian et al., 2022; Vempala and Wibisono, 2019) and Stein Variational Gradient Descent (Korba et al., 2020). The condition also becomes more stringent as the density ratio becomes less regular, i.e., when r gets closer to 0 and Q, J , I get larger, similar to He et al. (2024). The adaptive Dr MMD descent schemes of (15) and (18) defined via push-forward operations can be equivalently expressed by the following update scheme that defines a trajectory of samples (yn)n N whose distributions are precisely the Adaptive Dr MMD descent iterates (µn)n N, yn+1 = yn γ(1 + λn) h µn,π(yn), y0 µ0. (23) Unfortunately, (23) is still intractable in practice because h µn,π depends on the unknown distribution µn. Therefore, an additional discretization in space is needed to approximate (23) within a tractable algorithm. We propose to do so in the next section through a system of interacting particles, i.e., Dr MMD particle descent. 6. Dr MMD particle descent Suppose we have M samples from the target distribution {x(i)}M i=1 π and N samples from the initial distribution {y(i) 0 }N i=1 µ0. The Dr MMD particle descent is defined as: Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur y(i) n+1 = y(i) n γ(1 + λn) h ˆµn,ˆπ(y(i) n ), (24) where ˆµn = 1 N PN i=1 δy(i) n and ˆπ = 1 M PM i=1 δx(i) denote respectively the empirical distribution of the particles at time step n and the target, and where h ˆµn,ˆπ = 2 (Σˆπ + λn I) 1 (mˆµn mˆπ). Unfortunately, the KL divergence is ill-defined on empirical distributions ˆµn, which means the analysis of Theorem 5.1 is no longer applicable to study the convergence of Dr MMD particle descent. Therefore, in the next theorem, we instead resort to the Wasserstein-2 distance to analyze the convergence of Dr MMD particle descent. For simplicity of presentation below, we assume K 1. Theorem 6.1 Suppose that k satisfies Assumptions 1 and 2 with K 1, and that all the conditions in Theorem 5.1 on Dr MMD gradient descent iterates (µn)n N, target distribution π, regularization coefficient λn and step size γ are satisfied. In addition, suppose (µn)n N has bounded fourth moment and the target π satisfies a Talagrand-2 inequality with constant CT . Let the number of samples M, N satisfy 1 min i=1,...,nmax KL(µi π)Z 1 8nmaxγ r r+1 R min i=1,...,nmax KL(µi π)Z 1 r+1 1 8nmaxγ r r+1 R min i=1,...,nmax KL(µi π)Z 1 r+1 1 2r+2 , (25) where means up to constants, R = K1d + KK2d is a constant that only depends on the kernel, and Z is a constant that only depends on β, ζ, K1d, K2d, Q, J , I. Then we have E [W2 (ˆµnmax, π)] p 2CT exp nmaxγ KL (µ0 π) + O γ r 2r+2 , where the expectation is taken over initial samples {y(i) 0 }N i=1 drawn from µ0. The proof is provided in Section 10.12. Theorem 6.1 shows that for sufficiently large sample size M and N, Dr MMD particle descent exhibits an exponential rate of convergence (linear convergence) in terms of Wasserstein-2 distance up to an extra barrier of order O γ r 2r+2 . Remark 6.1 (Talagrand-2 inequality) We say that the target distribution π satisfies a Talagrand-2 inequality with constant CT if for any ν P2(Rd), 2CT KL(µ π). A Talagrand-2 inequality implies the Poincaré inequality in (11) with constant CP CT , so the condition that π satisfies a Talagrand-2 inequality is stronger than the condition in (De)-regularized Maximum Mean Discrepancy Gradient Flow Theorems 4.1 and 5.1. A Talagrand-2 inequality allows linking of the two key components of Theorem 6.1: the population convergence in terms of KL divergence proved in Theorem 5.1, and the finite-particle propagation of chaos bound in terms of Wasserstein-2 distance proved in Proposition 10.1. Talagrand inequality is widely used in finite-particle convergence analysis of Wasserstein gradient flows (Shi and Mackey, 2024). Remark 6.2 (Iteration and sample complexity) The choice of γ = O δ 2r+2 E [W2 (ˆµnmax, π)] δ with an iteration complexity of nmax = O 1 δ , which equals the square of the iteration complexity in Theorem 5.1 because Wasserstein-2 distance is of the same order as the square root of KL divergence. From Theorem 5.1, we have min i=1,...,nmax KL(µi π) KL(µnmax π) = O δ2 . Therefore, from (25), the sample complexity is at least M = poly exp δ 2 r 2 r+1 , N = poly exp δ 2 r 2 r+1 O δ d 4 . The poly-exponential sample complexity originates from the propagation of chaos in interacting particle systems (Kac, 1956). Although O δ d 4 is subsumed by the poly-exponential term, we still make it explicit to show that N suffers from the curse of dimensionality as expected from the Wasserstein-2 distance between an empirical distribution and a continuous distribution (Kloeckner, 2012; Lei, 2020). For comparison, the sample complexity of SVGD in Theorem 3 of Shi and Mackey (2024) is O(exp exp δ 2 ). Similar results have also been established for mean-field Langevin dynamics (Suzuki et al., 2023; Chen et al., 2024). Recently, Banerjee et al. (2025) proposed a refined finite-particle analysis of Stein Variational Gradient Descent (SVGD), which gives a uniform-in-time convergence bound, i.e., the bound does not blow up exponentially fast as the number of iterations nmax grows. The analysis of Banerjee et al. (2025), however, heavily relies on the relation that the time derivative of the KL divergence in the course of SVGD equals the squared kernel Stein discrepancy (KSD), a fact which does not hold for our Dr MMD flow. Recently, Chen et al. (2025) extended that analysis to finite-particle MMD gradient descent, but their analysis requires noise injection. We leave a more refined analysis of our finite-particle convergence result to future work. Having established the convergence of Dr MMD gradient flow/descent, we next show that Dr MMD particle descent admits a closed-form implementation. Proposition 6.1 shows that h ˆµn,ˆπ in (24), defined through the inverse of covariance operators, is computable using Gram matrices. Proposition 6.1 Given empirical distributions ˆµn = 1 N PN i=1 y(i) n , ˆπ = 1 M PM i=1 x(i) and Gram matrices Kxx = k(x1:M, x1:M) RM M and Kxy = k(x1:M, y1:N n ) RM N, the witness function h ˆµn,ˆπ can be computed as: h ˆµn,ˆπ( ) = 2 Nλn k( , y1:N)1N 2 Mλn k( , x1:M)1M 2 Nλn k( , x1:M)(Mλn I + Kxx) 1Kxy1N + 2 Mλn k( , x1:M)(Mλn I + Kxx) 1Kxx1M, (26) where 1M RM, 1N RN are column vectors of ones. Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur Algorithm 1 Dr MMD particle descent Input: Target samples {x(i)}M i=1 π and initial source samples {y(i) 0 }N i=1 µ0. Hyperparameters: step size γ, initial (de)-regularization coefficient λ0, maximum number of iterations nmax and regularity r. For n = 0 to nmax: 1. Compute witness function hˆµn,ˆπ from (26). 2. Compute Dr MMD(ˆµn, ˆπ) with hˆµn,ˆπ from (71). 3. Rescale regularization coefficient λn Dr MMD(ˆµn, ˆπ) 1 r+1 . 4. Update particles using (24): y(i) n+1 = y(i) n γ(1 + λn) h ˆµn,ˆπ(y(i) n ) End For Output: {y(i) nmax}N i=1. The proof of Proposition 6.1 can be found in Section 10.14. The gradient of h ˆµn,ˆπ can be obtained using automatic differentiation libraries such as JAX (Bradbury et al., 2018). As indicated in (18), the regularization parameter λn should be chosen to be proportional to χ2(µn π) 1 r+1 . In practice, however, both χ2(µn π) and r are not accessible and χ2(µn π) is not even well-defined for the particle descent algorithm. To address this, we use Dr MMD(ˆµn ˆπ) as a proxy for χ2(µn π) (see Algorithm 1), which admits a closed-form expression with particles. The parameter r is picked via a search over a pre-defined set {0.1, 0.5, 1.0}. The step size γ should be chosen to satisfy the upper bound (21), which contains several constants that cannot generally be computed. In practice, our approach has been to select γ to be sufficiently small for the flow to converge empirically. The final algorithm is summarized in Algorithm 1. At every iteration, computing h ˆµn,ˆπ with adaptive regularization λn has a time complexity of O(M3 + NM + N2) due to matrix inversion and multiplication. For Dr MMD particle descent with fixed λ, however, the total computational cost can be reduced to O(NM + N2), which is exactly the same as MMD flow, because inversion of the M M Gram matrix is only required once, and so it can be pre-computed at initialization (see Remark 4.4). In contrast, when N = M, the complexity of Sinkhorn flow is O(N2/ϵ3) (Feydy et al., 2019) with ϵ being the hyperparameter in Sinkhorn divergence, and the complexity of KALE flow is O(N3) (Glaser et al., 2021). 7. Related Work In this section, we discuss the works in the literature that are related to our proposed Dr MMD flow and spectral (de)-regularization. 7.1 Gradient flows Stein Variational Gradient Descent (SVGD) is a popular algorithm for sampling from distributions using only an unnormalized density. It can be written as either a gradient flow of the Kullback-Leibler (KL) divergence where the Wasserstein gradient of the KL is (De)-regularized Maximum Mean Discrepancy Gradient Flow preconditioned by Tµt (Liu and Wang, 2016; Liu, 2017; Korba et al., 2020), or as a gradient flow of the χ2-divergence whose Wasserstein gradient is preconditioned by Tπ (Chewi et al., 2020), t = µt Tµt log dµt = µt Tπ dµt Since SVGD may smooth the trajectory too much, He et al. (2024) considered a (de)- regularized SVGD flow, t = µt(Tµt + λI) 1Tµt log dµt which approaches the KL gradient flow (Langevin diffusion) as λ 0, demonstrating faster convergence than SVGD. A key difference between (de)-regularization in (27) of He et al. (2024) and our Dr MMD flow is that the flow in (27) is driven by the regularized version of the Wasserstein gradient of KL divergence while Dr MMD flow is driven by the Wasserstein gradient of the regularized χ2-divergence. An alternative interpretation of this difference is that the flow in (27) is the gradient flow of KL divergence w.r.t. the regularized Stein geometry (Duncan et al., 2023), whereas the Dr MMD flow is the gradient flow of regularized χ2-divergence w.r.t. the Wasserstein geometry. In addition to sampling from unnormalized distributions, Wasserstein gradient flows (particularly MMD flows) are widely used in the field of generative modelling (Birrell et al., 2022; Gu et al., 2024; Hertrich et al., 2023, 2024b,a; Galashov et al., 2025). The MMD flow (with a smooth kernel) (Arbel et al., 2019) can be written as dπ 1 . (28) The MMD flow is known to get trapped in local minima, and several modifications have been proposed to avoid this in practice, such as noise injection (see Arbel et al., 2019, Proposition 8) or non-smooth kernels, e.g., based on negative distances (Sejdinovic et al., 2013). MMD gradient flows with non-smooth kernels have better empirical performance (Hertrich et al., 2024a,b), but they do not preserve discrete measure and rely on approximating implicit time discretizations (Hertrich et al., 2024a) or slicing (Hertrich et al., 2024b); and they have no local minima apart from the global one (Boufadène and Vialard, 2024). Recall that our Dr MMD flow takes the form t = µt (Tπ + λI) 1Tπ which (de)-regularizes the MMD flow similarly to how (27) (de)-regularizes SVGD. It is both proved theoretically in Theorem 4.1, 5.1, 6.1, and verified empirically in Section 8, that (de)-regularization results in faster convergence than MMD flow. Another closely related flow called LAWGD is considered in Chewi et al. (2020), which swaps the gradient and integral operators of SVGD, leading to the following flow: t = µt Tπ dµt Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur LAWGD closely resembles the MMD flow in (28), but Chewi et al. (2020) proposes to replace Tπ with an inverse diffusion operator, which requires computing the eigenspectrum of the latter and is unlikely to scale in high dimensions. KALE (Glaser et al., 2021) kernelizes the variational formulation of the KL divergence in a similar way as Dr MMD kernelizes the χ2-divergence in (5), but the KALE witness function does not have a closed form expression, so it requires solving a convex optimization problem, which makes the simulation of KALE gradient flow with particles computationally more expensive. Recently, Stein et al. (2025) studied kernelized variational formulation of f-divergences (referred to as Moreau envelopes of f-divergences in RKHS), which subsume both KALE (Glaser et al., 2021) and Dr MMD. They prove that these functionals are lower semi-continuous and that their Wasserstein gradient flows are well-defined for smooth kernels. They do not study the convergence properties of their proposed flows, however. The proposed Dr MMD interpolates between MMD and χ2-divergence by using a specific spectral regularization known as Tikhnov regularization (1+λ)(T +λI) 1T , which interpolates between the identity operator I as λ 0 and T as λ . Dr MMD and its associated Wasserstein gradient flow can be easily extended to other spectral regularization strategies, such as the Showalter regularization, Landweber iteration, or cutoffregularization (Engl et al., 1996) using the techniques in (Bauer et al., 2007; Hagrass et al., 2024a). Alternative approximations to χ2-divergence have been proposed in the literature based on the idea of mollifiers, whose Wasserstein gradient flows have been constructed and for which convergence of the flows has been analyzed (Li et al., 2023; Craig et al., 2023, 2025). Compared to Dr MMD flow, these gradient flows rely on additional approximations, such as the use of log-sum-exp in Li et al. (2023) and the use of numerical integration to estimate convolution in Craig et al. (2023, 2025) and are not directly applicable in generative modeling settings where only samples are available. 7.2 Comparison with diffusion-based generative models Diffusion-based generative models have been widely adopted in practice and are closely related to Wasserstein gradient flows (Song et al., 2021; Ho et al., 2020). These models generate high-quality samples by reversing a pre-defined forward diffusion process, which gradually corrupts data with noise. To implement the reverse process, the established practice is to estimate the score function via denoising score matching (Song et al., 2021). In contrast, Wasserstein gradient flows directly construct a trajectory by descending the objective in the steepest direction with respect to the Wasserstein metric. In particular, our proposed Dr MMD gradient flow offers a tractable velocity field with a consistent finite-sample estimator without solving an additional optimization problem like score matching. We emphasize that the main contribution of our paper is to establish convergence of the Dr MMD flow, and that diffusion models for image generation require additional implementation details most notably, the inductive biases introduced by deep neural networks. One possible avenue for future work is to simulate Dr MMD gradient flows with kernels induced by learned deep neural network features on the data. In the case of MMD, this idea has been explored by Galashov et al. (2025), who demonstrate generation performance comparable to established diffusion models. Another promising direction, proposed by Hertrich et al. (2024b), involves first using MMD gradient flow with Riesz (De)-regularized Maximum Mean Discrepancy Gradient Flow kernels to generate particle trajectories, and then distilling these trajectories into a neural network-based generator. Both approaches would be of interest to extend our Dr MMD gradient flow to domains such as image generation, as a topic for future work. 7.3 (De)-regularization for supervised learning and hypothesis testing The idea of (de)-regularization is not new, and has been used in kernel Fisher discriminant analysis (Mika et al., 1999) and kernel ridge regression (Caponnetto and De Vito, 2007; Schölkopf and Smola, 2002). Subsequently, Harchaoui et al. (2007) employed this statistic in two-sample testing, where they constructed a test statistic that (de)-regularizes MMD(µ π) with both covariance operators Σµ, Σπ. This work has been recently generalized in Hagrass et al. (2024a) to more general spectral regularizations. A (de)-regularized statistic is also employed by Balasubramanian et al. (2021); Hagrass et al. (2024b) in the context of a goodness-of-fit test. Balasubramanian et al. (2021) refers to (de)-regularized MMD as Moderated MMD . To the best of our knowledge, the present work represents the first instance of the (de)-regularized MMD being used as a distance functional in Wasserstein gradient flow. By only (de)-regularizing with Σπ, Dr MMD approaches the χ2-divergence in the limit, a crucial property that is exploited in the proofs of the convergence results of Theorems 4.1 and 5.1. 8. Experiments In this section, we demonstrate the superior empirical performance of the proposed Dr MMD descent in various experimental settings. 8.1 Three ring experiment We follow the experimental set-up in Glaser et al. (2021) in which the target distribution π ( ) is defined on a manifold in R2 consisting of three non-overlapping rings. The initial source distribution µ0 ( ) is a Gaussian distribution close to the vicinity of the first ring. In this setting, all f-divergence gradient flows, including Langevin diffusion, are ill-defined because the target π is not absolutely continuous with respect to the initial source µ0. Nevertheless, we will simulate χ2 flow with an existing implementation of Liu et al. (2024a) that estimates the velocity field with a local linear estimator as one of the baseline methods. In contrast, kernel-based gradient flows like MMD, KALE, and Dr MMD gradient flows are well-defined in this setting, and are also used as baseline methods for comparison. We sample N = M = 300 samples from the initial source and the target distributions and run Dr MMD descent with adaptive λ for nmax = 100, 000 iterations, at which point all methods have converged. As in Glaser et al. (2021), we use a Gaussian kernel k(x, x ) = exp 0.5 x x 2/l2 with bandwidth l = 0.3. The step size for MMD descent is γ = 10 2 and the step size for KALE and Dr MMD descent is γ = 10 3. We enforce a positive lower bound λ = 10 3 for numerical stability and the regularity hyperparameter r is optimized over the set of {0.1, 0.5, 1.0}. From Figure 2 Left and Middle, we can see that Dr MMD descent outperforms MMD, KALE, and χ2 descent in terms of all dissimilarity metrics with respect to the target π: MMD and Wasserstein-2 distance. Figure 1 is an animation plot visualizing the evolution of Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur T=0 T=2 T=30 T=99 Figure 1: Animation of MMD, KALE, χ2 and Dr MMD gradient descent on the Three-ring dataset. particles under these descent schemes, which demonstrates that both KALE and Dr MMD descent are sensitive to the mismatch of support and stay concentrated in the support of the target π, while particles of MMD descent can diffuse outside the support of π. Note that "χ2" denotes an alternate estimate of the χ2 divergence due to Liu et al. (2024a): being an f-divergence, we would expect "χ2 descent" to match the support of the target (as in KALE and Dr MMD). This is not the case due to bias in the velocity field being learned from samples. Compared to KALE descent, Dr MMD descent does not suffer from the numerical approximation error of the optimization routine when solving the velocity field of KALE, which explains its improved performance. 8.2 Gradient flow for training student/teacher networks Next, we consider a large-scale setting following Arbel et al. (2019), where a student network is trained to imitate the outputs of a teacher network. We consider a two-layer neural network of the form ψ(z, x) = G b1 + W 1σ W 0z + b0 , where σ is the Re LU non-linearity and x is the concatenation of all network parameters b1, W 1, b0, W 0 Rd. G is an element-wise non-linear function G : R R, x 7 exp( 1 4x2). The teacher network is of the form: ΨT (z, π) = R ψ(z, x)dπ(x) where π denotes the teacher distribution, and the student network is ΨS(z, µ) = R ψ(z, x)dµ(x) where µ denotes the student distribution. Here we consider Gaussian distributed µ and π for simplicity. The (De)-regularized Maximum Mean Discrepancy Gradient Flow Dr MMD MMD KALE χ2 Dr MMD MMD Dr MMD (Noise) MMD (Noise) 0 25000 50000 75000 100000 Iteration 0 25000 50000 75000 100000 Iteration 0 5000 10000 15000 Iteration Validation MMD2 Figure 2: Left & Middle: Comparison of Dr MMD descent with adaptive λ, MMD and KALE descent on three-ring synthetic data in terms of MMD and Wasserstein-2 distance with respect to the target π. Right: Comparison of MMD descent with and without noise injection, Dr MMD descent with and without noise injection on training student/teacher networks in terms of validation MMD2 student network can imitate the behavior of the teacher network by minimizing the objective6 min µ P2(Rd) Ez Pdata ΨT (z, π) ΨS(z, µ) 2 , (29) where Pdata is the distribution of the input data. If we define the kernel as the inner product of the neural network feature maps, k(x, x ) = Ez Pdata[ψ(z, x) ψ(z, x )], then the objective of (29) can be equivalently expressed as min µ P2(Rd) X k(x, x )d(π µ)(x)d(π µ)(x ), which is precisely the MMD(µ π)2 under the kernel k. Since G(x) = exp( 1 4x2), the kernel is bounded and so the MMD is well-defined. Also, since the kernel k has bounded first and second-order derivatives, it satisfies the Assumption 2. Therefore, the training of the student network with objective (29) can be treated as an optimization problem of MMD2 distance in P2(Rd), i.e., as MMD gradient flow. It is shown in Arbel et al. (2019) that MMD flow and its descent scheme will generally get stuck in local optima because MMD is not geodesically convex; therefore, noise injection has been proposed to escape these local optima. With support from Theorems 4.1 and Theorem 5.1 on the convergence of Dr MMD flow and its descent scheme, we propose to minimize Dr MMD(µ π) and use Dr MMD descent rather than minimizing MMD(µ π)2 directly. Although this does not directly minimize the objective in (29), the favorable convergence performance of Dr MMD descent should result in a smaller MMD(µ π)2 at convergence. In our experimental setting, we are given M = 10 particles x(1), , x(M) from the teacher distribution π = N(0, I) and N = 1000 particle y(1) 0 , , y(N) 0 from the initial student 6. Note that our setting is slightly different from Chizat and Bach (2018) in which µ, π are measures over the hidden neurons, while our setting follows Arbel et al. (2019) in which µ, π are measures over all the network parameters. Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur distribution µ0 = N(0, 10 3I). The teacher particles are fixed while the student particles are updated according to Algorithm 1 at each time step. The initial (de)-regularization parameter is λ0 = 0.1, the step size is γ = 0.1, we apply a lower bound λ = 10 3, and the regularity r is optimized over the set of {0.1, 0.5, 1.0}. For the architecture of the neural network, there are 3 neurons in the hidden layer, and the output dimension is 1. The data distribution Pdata is a uniform distribution on the sphere in Rp with p = 50. 2000 data are sampled from Pdata with 1000 as training dataset and another 1000 as validation dataset. The kernel k(x, x ) = Ez Pdata ψ(z, x) ψ (z, x ) is estimated by the average over 100 randomly selected samples from the training dataset at each iteration. Dr MMD and MMD descent stop after nmax = 15, 000 iterations when both converge. The final performance is evaluated in terms of MMD2(µnmax π) with kernel k estimated by the average of 1000 samples in the validation dataset. In Figure 2 Right, we report the performance of MMD descent (with and without noise injection) along with the Dr MMD descent (with and without noise injection) in terms of MMD distance on the validation dataset. We can see that the Dr MMD descent does not get stuck in a local optimum, and leads to much lower validation MMD(µnmax π)2 even without noise injection. We also run Dr MMD descent with the noise injection scheme and find that noise injection can further improve the performance of Dr MMD descent and outperforms MMD descent with noise injection. Although it is unclear whether the density ratio has enough regularity to meet the condition of Theorem 5.1, the kernel k satisfies the boundedness and smoothness conditions of Assumption 2 and the target π satisfies the Poincaré inequality since it is Gaussian. The Dr MMD descent benefits from more favorable convergence properties, which explains its superior performance. The code to reproduce all the experiments can be found in the following Git Hub repository. https://github.com/hudsonchen/Dr MMD. 9. Discussion In this paper, we introduced (de)-regularization of the MMD (called Dr MMD) and its associated Wasserstein gradient flow. As an interpolation between the MMD and χ2divergence, the Dr MMD gradient flow inherits strengths from both sides: it is easy to simulate in closed form with particles, and it has an exponential rate of convergence towards the global minimum up to a controllable barrier term when the target π satisfies a Poincaré inequality. Additionally, we provide the optimal adaptive selection of a regularization coefficient that best balances the approximation and time discretization errors in Dr MMD gradient descent. Our work is the first to prove theoretically that an adaptive kernel through adaptive regularization can result in improved convergence of MMD gradient flow. The theoretical results are consistent with the empirical evidence in several numerical experiments. Following our work, there remain a number of interesting open problems. For example, (i) Since the kernel bandwidth has been known to play an important role in the performance of kernel-based algorithms, it is of interest to study the adaptive choice of kernel bandwidth in the context of Dr MMD gradient flow. (ii) To generalize our convergence analysis to the Wasserstein gradient flow of all Moreau envelopes of f-divergences in reproducing kernel Hilbert space, even when they do not have a closed-form expression as Dr MMD. (iii) While the current work proposes an approximation to the χ2-squared flow in the generative (De)-regularized Maximum Mean Discrepancy Gradient Flow modeling setting, i.e., where the target distribution π is known only through samples, it will be interesting to construct approximations to χ2-flow in the sampling setting, i.e., where π is known in closed form (at least up to normalization). In this section, we present the proofs of the results presented in Sections 3 6. 10.1 Proof of Proposition 3.1 MMD(µ π)2 = mµ mπ 2 H = Z k( , x)dµ(x) Z k( , x)dπ(x) Z k( , x) dµ dπ(x) 1 dπ(x) Also recall that the χ2-divergence between µ and π is χ2(µ π) = Z dµ dπ 1 2 dπ = I dµ 10.2 Proof of Proposition 3.2 Let µ π. In order to prove the alternative form of Dr MMD in (4), we start from (4) and show that it recovers (3). Dr MMD(µ||π) = (1 + λ) (Σπ + λI) 1 2 (mµ mπ) 2 = (1 + λ) (ι πιπ + λI) 1 = (1 + λ) (ι πιπ + λI) 1 dπ 1 , (ι πιπ + λI) 1 = (1 + λ) ιπ (ι πιπ + λI) 1 ι π = (1 + λ) ιπι π (ιπι π + λI) 1 dµ L2(π) , (30) where the last equality follows by noticing ιπ (ι πιπ + λI) 1 ι π = ιπι π (ιπι π + λI) 1. Therefore, Dr MMD(µ||π) = (1 + λ) Tπ (Tπ + λI) 1 dµ = (1 + λ) (Tπ + λI) 1 Tπ 1/2 dµ Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur which follows from the positivity and self-adjointness of Tπ (Tπ + λI) 1. So (4) is proved. Next, we are going to prove the variational formulation in (5). Similarly, we start from (5) and show it recovers (3). Consider (1 + λ) sup h H = (1 + λ) inf h H 4 dπ Z h(dµ dπ) + λ = (1 + λ) inf h H 4 h, Σπh H h, mµ mπ H + λ = (1 + λ) inf h H 4 I 1/2 h 1 4 I 1/2 (mµ mπ) 4 I 1/2 (mµ mπ) The last equality follows from completing the squares, based on which it is easy to see that the infimum is achieved at h µ,π = 2 (Σπ + λI) 1 (mµ mπ). For µ π, following the same derivations in (30), h µ,π can be alternatively expressed as h µ,π = 2 (Tπ + λI) 1 Tπ Plugging h µ,π back into (31) recovers (3), so (5) is proved. 10.3 Dr MMD is MMD with a regularized kernel k Given the definition of k(x, x ) = D (Σπ + λI) 1 2 k( , x), (Σπ + λI) 1 2 k( , x ) E H, it is clear that k is symmetric and positive definite so it has a unique associated reproducing kernel Hilbert space H (Steinwart and Christmann, 2008, Theorem 4.21) with canonical feature map k(x, ). Therefore, Dr MMD(µ π) = (1 + λ) (Σπ + λI) 1 2 (mµ mπ) 2 = (1 + λ) (Σπ + λI) 1 2 Z k(x, )d (π µ) (x), (Σπ + λI) 1 2 Z k(x , )d (π µ) (x ) = (1 + λ) Z (Σπ + λI) 1 2 k(x, )d (π µ) (x), Z (Σπ + λI) 1 2 k(x , )d (π µ) (x ) = (1 + λ) ZZ D (Σπ + λI) 1 2 k(x, ), (Σπ + λI) 1 2 k(x , ) E H d (π µ) (x)d (π µ) (x ) = (1 + λ) ZZ k(x, x )d (π µ) (x)d (π µ) (x ) Z k(x, )d(µ π)(x) (De)-regularized Maximum Mean Discrepancy Gradient Flow In the third and fourth equality above, we are using the fact that (Σπ + λI) 1 2 k(x, ) H is Bochner integrable and the Bochner integral preserves inner product structure. So Dr MMD is essentially MMD2 with a different kernel k up to a multiplicative factor of 1 + λ. Next, we present the Mercer decomposition of k. Notice that {ei}i 1 are the eigenfunctions of Tπ, so { ϱiei}i 1 are the eigenfunctions of Σπ. For x and x in the support of π, k also enjoys a pointwise convergent Mercer decomposition k(x, x ) = D (Σπ + λI) 1 2 k( , x), (Σπ + λI) 1 2 k( , x ) E (Σπ + λI) 1 i 1 ϱiei(x)ei , (Σπ + λI) 1 i 1 ϱiei(x )ei ϱi ϱi + λei(x)ei, X ϱi ϱi + λei(x )ei ϱi ϱi + λei(x)ei(x ). (33) More properties of the regularized kernel k are provided in Lemma B.3. 10.4 Proof of Proposition 3.3 Given that dµ dπ 1 L2(π), so Dr MMD(µ π) = (1 + λ) (Tπ + λI) 1 Tπ 1/2 dµ L2(π) (1 + λ) dµ dπ 1 which is finite for any λ 0. We are allowed to interchange the limit and integration according to the dominated convergence theorem (Rudin, 1976) to achieve, lim λ 0 Dr MMD(µ π) = lim λ 0 (Tπ + λI) 1 Tπ 1/2 dµ L2(π) = χ2(µ π). From (4), we have that, p Dr MMD(µ π) = 1 + λ (Σπ + λI) 1 2 (mµ mπ) H 1 + λ (Σπ + λI) 1 2 op mµ mπ H λ MMD(µ, π), (34) where the last inequality follows by noticing that Σπ = ι πιπ shares the same eigenvalues as Tπ = ιπι π, and hence the eigenvalues of (Σπ + λI) 1 are (ϱi + λ) 1 which all smaller than 1 λ. Therefore, Dr MMD(µ π) 1+λ λ MMD2(µ π). On the other hand, using Lemma A.10 from (Hagrass et al., 2024a), we have MMD (µ π) = mµ mπ H (Σπ + λI) 1 2 op (Σπ + λI) 1 2 (mµ mπ) H Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur 1 1 + λ (Σπ + λI) 1 2 op p Dr MMD (µ π) Dr MMD (µ π), (35) where K is the upper bound on kernel k in Assumption 2 and hence an upper bound on the operator norm of Σπ. Combining (34) and (35), we have 1 + λ K + λ MMD2(µ π) Dr MMD (µ π) 1 + λ λ MMD2(µ π). Therefore, limλ Dr MMD(µ π) = MMD2(µ π), and the proposition is proved. 10.5 Proof of Proposition 3.4 To show that Dr MMD is a probability divergence, we need to show that Dr MMD enjoys non-negativity and definiteness. It is easy to see that Dr MMD(µ π) is non-negative from its definition in Definition 1. Then, we prove definiteness, i.e., Dr MMD(µ||π) = 0 if and only if µ = π. For the first direction, assume Dr MMD(µ π) = 0, so (Σπ +λI) 1/2 (mµ mπ) 2 H = 0. Since (Σπ + λI) 1/2 is a non-singular operator, we must have that mµ = mπ which implies µ = π as k is c0-universal and hence characteristic (Sriperumbudur et al., 2011). For the other direction, when µ = π, immediately we can see Dr MMD(µ||π) = 0. Then we prove that Dr MMD metrizes weak convergence. For the first direction, we know from (34) that Dr MMD(µn π) 1+λ λ MMD2(µn π) and MMD2(µn π) 0 as µn converges weakly to π (Simon-Gabriel et al., 2023). For the converse direction, we assume that Dr MMD (µn π) 0. From (35), we know that MMD2 (µ π) K+λ 1+λ Dr MMD (µ π), therefore Dr MMD (µn π) 0 implies MMD (µn π) 0, implying the weak convergence of µn to π, if k is characteristic (Simon-Gabriel et al., 2023). 10.6 Proof of Proposition 4.1 In order to show that FDr MMD( ) = Dr MMD( π) admits a well-defined gradient flow, we follow the same techniques in Proposition 7 of Glaser et al. (2021) and Lemma B.2 of Chizat and Bach (2018), where the key is to show that (1 + λ) h µ,π is the Fréchet subdifferential of FDr MMD evaluated at µ.7 According to Definition 10.1.1 of Ambrosio et al. (2005), it is equivalent to prove that, for any µ P2(Rd) and φ C c (Rd), Dr MMD ((I + φ)#µ π) Dr MMD (µ π) (1 + λ) Z φ(x) h µ,π(x)dµ(x) + o φ L2(µ) . (36) Define ρt = (I + t φ)#µ, ϕt : Rd Rd, x 7 x + t φ(x) and g(t) = Dr MMD(ρt||π). Then from Lemma B.6 we know that g(t) is continuous and differentiable with respect to t and t=0g(t) = (1 + λ) Z φ(x) h µ,π(x)dµ(x). 7. Although Dr MMD can be viewed as squared MMD with a regularized kernel k, we are not using the technique in Arbel et al. (2019) because it relies on Lemma 10.4.1 of Ambrosio et al. (2005) which only provides the Fréchet subdifferential on probability measures µ that admit density functions. To construct the Wasserstein gradient flow of FDr MMD up to full generality, we resort to the techniques of Glaser et al. (2021) and Chizat and Bach (2018) instead. (De)-regularized Maximum Mean Discrepancy Gradient Flow Since t 7 g(t) is differentiable, using Taylor s theorem and mean value theorem (Rudin, 1976), we know that there exists 0 < κ < 1 such that Dr MMD ((I + φ)#µ π) Dr MMD (µ π) = g(1) g(0) = d t=0g(t) + d2 Therefore, to prove (36), the goal is to prove that d2 dt2 t=κg(t) o φ L2(µ) . To this end, since we know from Lemma B.6 that t 7 d dtg(t) is continuous and differentiable, we have dt2 Dr MMD(ρt||π) = 2(1 + λ) ZZ φ(x) 1 2 k (ϕt(x), ϕt(y)) φ(y)dµ(x)dµ(y) + 2(1 + λ) Z φ(x) Z H1 k (ϕt(x), ϕt(y)) dµ(y) Z H1 k (ϕt(x), y) dπ(y) φ(x)dµ(x), where k is the regularized kernel defined in (6) and H is the associated RKHS. Using Lemma B.3, the first term above can be upper-bounded by, 2(1 + λ) ZZ φ(x) 1 2 k (ϕt(x), ϕt(y)) φ(y)dµ(x)dµ(y) 2(1 + λ) ZZ φ(x) 1 2 k (ϕt(x), ϕt(y)) F φ(y) dµ(x)dµ(y) 2(1 + λ)K1d Z φ(y) dµ(y) 2 2(1 + λ)K1d λ φ 2 L2(µ). Using Lemma B.3 again, the second term can be upper-bounded by Z φ(x) Z H1 k (ϕt(x), ϕt(y)) dµ(y) Z H1 k (ϕt(x), y) dπ(y) φ(x)dµ(x) Z φ(x) Z H1 k (ϕt(x), ϕt(y)) dµ(y) φ(x)dµ(x) Z φ(x) Z H1 k (ϕt(x), y) dπ(y) φ(x)dµ(x) Z φ(x) Z H1 k (ϕt(x), ϕt(y)) F dµ(y) φ(x) dµ(x) + Z φ(x) Z H1 k (ϕt(x), y) F dπ(y) φ(x) dµ(x) λ φ 2 L2(µ). Combining the above two inequalities to have d2 dt2 Dr MMD(ρt||π) 2(1 + λ)K1d λ φ 2 L2(µ) + 4(1 + λ) KK2d λ φ 2 L2(µ). (37) Therefore, we have d2 dt2 t=κg(t) = O( φ 2 L2(µ)) = o( φ L2(µ)) as φ L2(µ) 0. So (36) is proved, which means that (1 + λ) h µ,π(x) is the Fréchet subdifferential of FDr MMD Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur evaluated at µ. According to Definition 11.1.1 of Ambrosio et al. (2005), there exists a solution (µt)t 0 such that the following equation holds in the sense of distributions, tµt (1 + λ) µt h µt,π = 0, and such (µt)t 0 is indeed the Dr MMD gradient flow, so existence is proved. Next, we are going to prove uniqueness. (37) indicates that FDr MMD is geodesically (1 + λ)4 KK2d+2K1d λ semiconvex. Therefore the uniqueness of (µt)t follows from Theorem 11.2.1 of Ambrosio et al. (2005). (8) is proved in (37). 10.7 Proof of Proposition 4.2 Given µ P2 Rd and φ C c Rd , consider the path (ρt)0 t 1 from µ to (I + φ)#µ given by ρt = (I + t φ)#µ. (ρt)0 t 1 is a constant-time geodesic in the Wasserstein-2 space by construction (Appendix A). Define ϕt : Rd Rd, x 7 x+t φ(x). We know from (Villani et al., 2009, Example 15.9) (by taking m = 2) that, t=0χ2(ρt π) = Z dµ dπ(x) V (x) φ(x) φ(x) 2 dµ(x) dπ(x) φ(x) HV (x) φ(x)dµ(x) dπ(x) Hφ(x) 2 F dµ(x). (38) V is twice differentiable, so V as a function from Rd to Rd is continuous. φ C c (Rd) has compact support, so x 7 V (x) φ(x) φ(x) is a continuous function over a compact domain, so its image is also compact and hence bounded. Using similar arguments, x 7 φ(x) HV (x) φ(x) and x 7 Hφ(x) 2 F are also continuous functions over compact domains, so they are all bounded. Since dµ dπ 1 H L2(π) and R dµ dπ(x)dµ(x) = dµ dπ 1 2 L2(π) + 1 < , we have dt2 t=0χ2(ρt π) < . (39) Next, from Lemma B.9 we know that d2 dt2 t=0χ2(ρt π) can be alternatively expressed as, t=0χ2(ρt π) = 2 Z ( φ(x)µ(x)) 1 π(x) + 2 Z φ(x) H µ(x) φ(x)µ(x)dx. (40) Recall from Lemma B.6 that t=0 Dr MMD(ρt||π) = 2(1 + λ) ZZ φ(x) 1 2 k (x, y) φ(y)dµ(x)dµ(y) (De)-regularized Maximum Mean Discrepancy Gradient Flow + 2(1 + λ) Z φ(x) Z H1 k (x, y) dµ(y) Z H1 k (x, y) dπ(y) φ(x)dµ(x). (41) To prove (9), our aim then is to compare and bound the difference of d2 dt2 t=0χ2(ρt π) in (40) and 1 1+λ d2 dt2 t=0 Dr MMD(ρt||π) in (41), so we compare and bound their first and second term separately. The first term of (40) can be rewritten as Z ( φ(x)µ(x)) 1 π(x) 2 π(x)dx = X Z ( φ(x)µ(x)) ei(x)dx 2 = X Z φ(x) ei(x)µ(x)dx 2 , (42) where we use integration by parts in the last line since φ C c (Rd). And the first term of (41) after rescaling by 1 1+λ can be rewritten as, 2 ZZ φ(x) 1 2 k (x, y) φ(y)dµ(x)dµ(y) = 2 ZZ φ(x) x y ϱi ϱi + λei(x)ei(y) φ(y)dµ(x)dµ(y) = 2 ZZ ( φ(x)µ(x)) ( φ(y)µ(y)) X ϱi ϱi + λei(x)ei(y)dxdy. (43) Since π Ld so (33) is true for all x, y Rd, hence the second equality is true, and the last equality uses integration by parts since φ C c (Rd). Notice that ZZ ei(x)ei(y) ( φ(x)µ(x)) ( φ(y)µ(y)) dxdy Z ( φ(x)µ(x)) ei(x) dx 2 Z ( φ(x)µ(x)) ! Z ei(x)2µ(x)dx Z log µ(x) φ(x) + φ(x) 2 dµ(x) i 1 ϱiei(x)2dµ(x) Z log µ(x) φ(x) + φ(x) 2 dµ(x) < + . The first inequality uses Cauchy-Schwartz, the second inequality uses P i 1 ϱi(ei(x))2 = k(x, x) K. The last quantity is finite because x log µ(x) is a continuous function and φ C c (Rd) has compact support, hence the integral of a continuous function over a Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur compact domain is always finite. Then, by using Fubini s theorem (Rudin, 1976), we are allowed to interchange the infinite sum and integration of (43) to reach, 2 ZZ ( φ(x)µ(x)) ( φ(y)µ(y)) X ϱi ϱi + λei(x)ei(y)dxdy ZZ ( φ(x)µ(x)) ( φ(y)µ(y)) ei(x)ei(y)dxdy Z ( φ(x)µ(x)) ei(x)dx 2 Z φ(x) ei(x)µ(x)dx 2 , where the last equality uses integration by parts. So the difference between the first term of (40) and (41) rescaled by 1 1+λ is, Z φ(x) ei(x)µ(x)dx 2 X Z φ(x) ei(x)µ(x)dx 2 Z φ(x) ei(x)µ(x)dx 2 . (44) Now we turn to the second term. The second term of (40) can be rewritten as 2 Z φ(x) H dµ dπ(x) φ(x)µ(x)dx = 2 Z φ(x) H dµ dπ(x) 1 φ(x)µ(x)dx = 2 Z φ(x) H L2(π) ei(x) φ(x)µ(x)dx, (45) and the second term of (41) rescaled by 1 1+λ can be rewritten as, 2 Z φ(x) Z H1 k (x, y) dµ(y) Z H1 k (x, y) dπ(y) φ(x)dµ(x) = 2 Z φ(x) H Z k (x, y) dµ(y) Z k (x, y) dπ(y) φ(x)dµ(x) = 2 Z φ(x) H ϱi ϱi + λei(x) Z ei(y)d(µ π)(y) = 2 Z φ(x) H L2(π) ei(x) φ(x)dµ(x). (46) (De)-regularized Maximum Mean Discrepancy Gradient Flow Since π Ld, so (33) is true for all x, y Rd hence the third equality is true. From Lemma B.3, we have R k(x, y)dµ(y) K λ , x 7 k(x, y) is second-order differentiable, supx |H1 k(x, y)| KK2d λ . So we are allowed to interchange integration and Hessian in the second equality using the differentiation lemma (Klenke, 2013, Theorem 6.28). Consider the difference of (46) and (45), we have L2(π) ei(x) L2(π) ei(x) L2(π) ei(x) dπ 1 H, there exists q L2(π) such that dµ dπ 1 = T 1/2 π q so that dµ ϱ1/2 i q, ei for all i. For j, r {1, , d}, we have L2(π) j rei(x) λ ϱi + λϱ1/2 i q, ei L2(π) j rei(x) 2 q, ei 2 L2(π) i M0 ϱi ( j rei(x))2 i M0 q, ei 2 L2(π) i M0 ϱi ( j rei(x))2 i M0 q, ei 2 L2(π) The final inequality holds because, i 1 ϱi ( j rei(x))2 i 1 ϱi j rk(x, ), ei 2 H = i 1 j rk(x, ), ϱiei 2 H j,r=1 j rk(x, ) 2 H = H1k(x, ) 2 Hd d K2d. (48) Since sdq L2(π) is bounded, so P i M0 q, ei 2 L2(π) converges to 0 uniformly as M0 and hence g M0(x) converge to 0 uniformly. Therefore, we are allowed to interchange the Hessian and the infinite sum (Rudin, 1976) in (47) to achieve, L2(π) ei(x) Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur λ ϱi + λϱ1/2 i q, ei L2(π)Hei(x) 2 q, ei 2 L2(π) Z φ(x) Hei(x) φ(x)µ(x)dx 2 2 q, ei 2 L2(π) i 1 ϱi Hei(x) 2 op Z φ(x) φ(x)µ(x)dx 2 2 q, ei 2 L2(π) i 1 ϱi Hei(x) 2 op 2 q, ei 2 L2(π) i 1 ϱi Hei(x) 2 F 2 q, ei 2 L2(π) K2d φ 2 L2(µ). (49) The first inequality uses Cauchy Schwartz, the second last inequality uses that matrix operator norm is smaller than matrix Frobenius norm, and the last inequality uses (48). Combining together (44) and (49), we reach 1 1 + λ d2 t=0 Dr MMD(ρt||π) d2 t=0χ2(ρt||π) Z φ(x) ei(x)µ(x)dx 2 2 q, ei 2 L2(π) K2d φ 2 L2(µ) =: R(λ, µ, φ). t=0 Dr MMD(ρt||π) (1 + λ) d2 t=0χ2(ρt||π) (1 + λ) R(λ, µ, φ) (1 + λ) Z dµ dπ(x) φ(x) HV (x) φ(x)dµ(x) (1 + λ) R(λ, µ, φ) (1 + λ)α Z dµ dπ(x) φ(x) 2dµ(x) (1 + λ) R(λ, µ, φ), where the second inequality is using (38) and the last inequality is using HV αI. So (9) is proved. (De)-regularized Maximum Mean Discrepancy Gradient Flow Define R(λ, µ, φ) := (1+λ) R(λ, µ, φ). The final thing to check is limλ 0 R(λ, µ, φ) = 0, which is equivalent to check that limλ 0 R(λ, µ, φ) = 0. Since we know from (39) and (42) that Z φ(x) ei(x)µ(x)dx 2 < d2 t=0χ2(ρt π) < , using the dominated convergence theorem (Rudin, 1976), we are allowed to interchange infinite sum and taking limits, Z φ(x) ei(x)µ(x)dx 2 = X i 1 lim λ 0 λ ϱi + λ Z φ(x) ei(x)µ(x)dx 2 Similarly, because P i 1 λ ϱi+λ 2 q, ei 2 L2(π) < q 2 L2(π) < , using dominated convergence theorem (Rudin, 1976) again, we have, 2 q, ei 2 L2(π) = X i 1 lim λ 0 2 q, ei 2 L2(π) = 0. Therefore, we have that lim λ 0 R(λ, µ, φ) = lim λ 0 R(λ, µ, φ) = 0. And the proof of the proposition is finished. 10.8 Proof of Theorem 4.1 Considering the time derivative of KL(µt π), we have d dt KL(µt π) = (1 + λ) Z h µt,π(x) log dµt dπ (x)µt(x)dx = (1 + λ) Z h µt,π(x) dµt dπ (x)π(x)dx = (1 + λ) Z h µt,π(x) 2 dµt dπ (x)π(x)dx 2(1 + λ) Z π(x) dµt = (1 + λ) Z h µt,π(x) 2 dµt dπ (x) 1 dµt dπ (x)π(x)dx 2(1 + λ) Z π(x) dµt Case one: R π(x) dµt dπ (x) 2 dx < . We use integration by parts for the first term in (50) and we can safely ignore the boundary term due to condition 5 that for i = 1, . . . , d, Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur h µt,π(x) 2dµt dπ (x) i dµt dπ (x) π(x) 0. So, we obtain d dt KL(µt π) = (1 + λ) Z h µt,π(x) 2 dµt dπ (x) 1 π(x) dµt 2(1 + λ) Z π(x) dµt = (1 + λ) Z h µt,π(x) 2 dµt dπ (x) 1 π(x) dµt π(x) π(x)dx 2(1 + λ) Z π(x) dµt (1 + λ) h µt,π 2 dµt CP KL(µt π), (51) where the first part of the last inequality holds by using Cauchy Schwartz, and the second part holds by the fact that KL(µt π) χ2(µt π) (Van Erven and Harremos, 2014) and by applying the Poincaré inequality with f = dµt dπ 1 (notice that f L2(π) < from Case one and f L2(π) < from condition 3), 2 dx 1 CP χ2(µt π) 1 CP KL(µt π). dπ 1 Ran (T r π ) with r > 0, using Lemma B.5 we have h µt,π 2 dµt dπ 1 L2(π) 2λr qt L2(π) . (52) Then, notice that = Z h π(x) dµt π(x)2 dπ(x) = Z π(x) dµt dπ (x) + π(x) dµt π(x)2 dπ(x) = Z log π(x) dµt dπ (x) + dµt dπ (x) 2 dπ(x) Z 2 log π(x) dµt dπ (x) 2 + 2 dµt dπ (x) 2 dπ(x) = 2 (log π) dµt L2(π) + 2 dµt L2(π) 2J 2 t + 2I2 t . (53) (De)-regularized Maximum Mean Discrepancy Gradient Flow Therefore, plugging (52) and (53) back to (51), we have d dt KL(µt π) 4(1 + λ)λr qt L2(π) (Jt + It) 2(1 + λ) CP KL(µt π). (54) Case two: R π(x) dµt dπ (x) 2 dx = . The first term of (50) remains the same, and the second term of (50) now equals infinity which is larger than the finite 2(1+λ) CP KL(µt π), so we also obtain (54) as in Case one. Therefore, both Case one and Case two result in (54). Using the Gronwall lemma, we obtain that for any T > 0, KL(µT π) exp 2(1 + λ) CP T KL(µ0 π) + 4(1 + λ)λr Z T 0 exp 2(1 + λ) CP (T t) qt L2(π) (Jt + It)dt, which concludes the proof of Theorem 4.1. 10.8.1 Derivation of (14) under stronger range assumption r = 1 Notice that for any x Rd, since k is differentiable dπ 1 (x) = k(x, ), dµt And since dµt dπ 1 H, there exists qt L2(π) such that D dµt L2(π) = ϱ1/2 i qt, ei L2(π) for all i, so dµt i 1 qt, ei 2 L2(π) = qt 2 L2(π). Combining the above two equations, we have (log π) dµt dπ 1 L2(π) p K1d qt L2(π) log π L2(π) . Also, for the other one, notice that dπ 1 (x) F = Hk(x, ), dµt Hk(x, ) Hd d K2d qt L2(π). Therefore, dµt K2d qt L2(π). Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur 10.9 Proof of Proposition 5.1 We know that µn+1 = (I γ(1 + λ) h)#µn where we drop the subscripts of the witness function hµn,π when it causes no ambiguity. Denote ρt = (I t(1 + λ) h)#µn, so ρ0 = µn, and ργ = µn+1. Consider the difference of KL divergence between the two iterates µn+1 and µn along the time-discretized Dr MMD flow: KL(µn+1 π) KL(µn π) = KL(ργ π) KL(ρ0 π) t=0 KL(ρt π)γ + Z γ dt2 KL(ρt π)dt. (55) For the first term of (55), t=0 KL(ρt π) = (1 + λ) Eµn = (1 + λ) Eπ = 2(1 + λ) dµn L2(π) + (1 + λ) Eπ = 2(1 + λ) dµn L2(π) + (1 + λ) Z dµn dπ (x) 2 dµn dπ (x) 1 h(x) π(x)dx = 2(1 + λ) dµn L2(π) (1 + λ) Z 2 dµn dπ (x) 1 h(x) π(x) dµn CP χ2(µn π) + (1 + λ) h 2 dµn where the fourth equality uses an integration by parts, and the last inequality uses Poincaré inequality for the first term under similar arguments in Section 10.8 and uses Cauchy-Schwarz for the second term. Using Lemma B.5 and the derivations in (53), (56) can be further upper bounded by t=0 KL(ρt π) (1 + λ) 2 CP χ2(µn π) + 2(1 + λ)λr Q (J + I) . (57) Then, for the second term of (55), we know from Example 15.9 of Villani et al. (2009) (taking m = 1) that, dt2 KL(ρt π) = (1 + λ)2 Z h(x) HV (ϕt(x)) h(x)dµn(x) + (1 + λ)2 Z Hh(x) (I t(1 + λ)Hh(x)) 1 2 Because 2(1 + λ)γ q χ2 (µ0 π) K2d ζ for 1 < ζ < 2, applying Lemma B.7 we have, dt2 KL(ρt π) 4(1 + λ)2βχ2(µn π)K1d λ + 4(1 + λ)2ζ2χ2(µn π)K2d (De)-regularized Maximum Mean Discrepancy Gradient Flow Combining the above two inequalities (57) and (58) and plugging them back into (55), we obtain KL(µn+1 π) KL(µn π) 2 CP (1 + λ)χ2(µn π)γ + 2(1 + λ)γλr Q (J + I) + 2(1 + λ)2γ2(β + ζ2)χ2(µn π)K1d + K2d CP χ2(µn π)γ + 4γλr Q (J + I) + 8γ2(β + ζ2)χ2(µn π)K1d + K2d where the last inequality holds by using 0 < λ 1, and the result follows. 10.10 Proof of Theorem 5.1 In order to use Proposition 5.1 in the proof, first we are going to show that Proposition 5.1 holds under the conditions of Theorem 5.1. Notice that conditions 1-4 of Proposition 5.1 are precisely the conditions 1-4 of Theorem 5.1. To use Proposition 5.1 in the proof of Theorem 5.1, the only thing left is to check that the condition of step size γ in (16) is satisfied. In Theorem 5.1, λn is selected to be 2γ χ2(µn π)(β+ζ2)(K1d+K2d) Q(J +I) 1 r+1 1. If λn is taken to be the former, then χ2(µn π)K2d χ2(µn π)K2d = 2(2γ) 2r+1 2r+2 χ2 (µn π) r 2r+2 (Q(J + I)) 1 2r+2 1 K1d + K2d ( ) (8γ) 2r+1 2r+2 Q 2r+1 2r+2 (J + I) 1 2r+2 1 β + ζ2 ( ) holds because K2d K1d+K2d 1 and χ2 (µn π) = dµn L2(π) = T r π qn 2 L2(π) K2r Q2 Q2. (60) The second last inequality of (59) holds due to the constraint on γ in (19), and the last inequality of (59) holds because β + ζ2 1. On the other hand, if λn is chosen to be 1, similarly based on the constraint on γ in (19), we have 2γ(1 + λn)χ2(µn π)K2d λn = 4γχ2(µn π)K2d ζ 1 Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur Therefore, all the conditions of Proposition 5.1 have been verified. So, if we select λn = 2γ χ2(µn π)(β+ζ2)(K1d+K2d) Q(J +I) 1 r+1 1, then KL(µn+1 π) KL(µn π) γ 2 CP χ2(µn π) + 4γλr n Q (J + I) | {z } ( 1) + 8γ2(β + ζ2)χ2(µn π)K1d + K2d λn | {z } ( 2) By observing (61), the first term on the right-hand side 2 CP χ2(µn π)γ is strictly negative and is decreasing KL divergence at each iteration n of the Dr MMD gradient descent. In contrast, the second term ( 1) := 4γλr n Q(J + I) and the third term ( 2) := 8γ2(β + ζ2)χ2(µn π)K1d+K2d λn are positive and prevent the KL divergence from decreasing. Denote G(λn) = ( 1) + ( 2) and the optimal λn is achieved by taking d dλn G(λn) = 0, which leads to λn = 2γ χ2(µn π)(β+ζ2)(K1d+K2d) Q(J +I) 1 r+1 . Plugging the value of λn back to (61) to obtain, KL(µn+1 π) KL(µn π) CP χ2(µn π)γ + 4γ (2γ) r r+1 χ2(µn π)(β + ζ2)(K1d + K2d) r r+1 (Q(J + I)) 1 r+1 CP χ2(µn π)γ + 8γ1+ r r+1 χ2(µn π)(β + ζ2)(K1d + K2d) r r+1 (Q(J + I)) 1 r+1 CP χ2(µn π)γ + 8γ1+ r r+1 Q 2r+1 r+1 (K1d + K2d)(β + ζ2) r r+1 (J + I) 1 r+1 , (62) where the last inequality holds because of (60). Since χ2(µn π) KL(µn π) (Van Erven and Harremos, 2014, Equation (7)), we have KL(µn+1 π) 1 γ 2 + 8γ1+ r r+1 Q 2r+1 r+1 (K1d + K2d)(β + ζ2) r r+1 (J + I) 1 r+1 . After iterating, we obtain KL(µnmax π) nmax KL(µ0 π) + 4γ r r+1 CP Q 2r+1 r+1 (K1d + K2d)(β + ζ2) r r+1 (J + I) 1 r+1 KL(µ0 π) + 4γ r r+1 CP Q 2r+1 r+1 (K1d + K2d)(β + ζ2) r r+1 (J + I) 1 r+1 and the result follows. 10.11 Proof of Theorem 5.2 The proof of Theorem 5.2 is also based on Proposition 5.1 proved in the last section. Recalling (62) yet with adaptive step size γn, we have KL(µn+1 π) KL(µn π) (De)-regularized Maximum Mean Discrepancy Gradient Flow CP χ2(µn π)γn + 8γ 1+ r r+1 n χ2(µn π)(β + ζ2)(K1d + K2d) r r+1 (Q(J + I)) 1 r+1 . From (21), we have r r+1 n (K1d + K2d)(β + ζ2) r r+1 (Q(J + I)) 1 r+1 1 CP χ2(µn π) 1 r+1 , so that we have KL(µn+1 π) KL(µn π) 1 CP χ2(µn π)γn 1 CP KL(µn π)γn. KL(µn+1 π) 1 1 KL(µn π). (63) After iterating n from 1 to nmax, the theorem is proved. 10.12 Proof of Theorem 6.1 In order to analyze the error of space discretization, we introduce another particle descent scheme using the population witness function h µn,π defined in (15) starting from the same initialization as that of (24), y(i) n+1 = y(i) n γ(1 + λn) h µn,π(y(i) n ), y(i) 0 = y(i) 0 . (64) The corresponding empirical distribution of the particles at time step n is defined as µn = 1 N PN i=1 y(i) n . Note that (64) is an unbiased sampled version (since it is composed of N i.i.d. realizations) of (15). The following proposition shows that W2(µn, ˆµn) 0 as N, M , i.e., with a sufficient number of samples from µ0 and π, (24) can approximate (64) with arbitrary precision. The proof of Proposition 10.1 is provided in Section 10.13. Proposition 10.1 Suppose k satisfies Assumption 1 and 2. Given initial particles {y(i) 0 }N i=1 that are i.i.d sampled from µ0, a sequence (µn)n N of empirical distributions arising from (64), and a sequence (ˆµn)n N arising from (24), then for all n 1, we have E[W2 (ˆµn, µn)] A K γn2(1 + λ)R where A = 2 KK1d KK2d+K1d and R = K1d + KK2d are constants that only depend on the kernel, and λ = min i=1,...,n λi denotes the smallest regularization coefficient. Now we are ready to prove Theorem 6.1. By the triangle inequality, we have, E [W2 (ˆµn, π)] E [W2 (ˆµn, µn)] + E [W2 ( µn, µn)] + W2 (µn, π) . Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur From Proposition 10.1, the first term is upper bounded by E[W2 (ˆµn, µn)] A K nγ 2(1 + λ)R where the second inequality uses λ 1 and K 1. Since (µn)n has finite fourth moment, then by taking p = 2, q = 4 in (Lei, 2020, Theorem 3.1) and (Fournier and Guillin, 2015, Theorem 1), the second term is upper bounded by, E [W2 ( µn, µn)] = O N 1 d 4 . For the third term, since the Wasserstein-2 distance is upper bounded by the square root of the KL divergence, if the target π that satisfies Talagrand-2 inequality with constant CT (Villani et al., 2009, Definition 22.1), we have W2 (µn, π) p KL (µn π) p KL (µ0 π) + O γ r 2r+2 , where the last inequality follows from Theorem 5.1. Combining the above three terms, we obtain E [W2 (ˆµn, π)] A 1 + O N 1 d 4 KL (µ0 π) + O γ r 2r+2 , (65) where A = 2 KK1d KK2d+K1d . Recall from Theorem 5.1 that λi = γχ2(µi π)Z 1 r+1 1 for i = 1, . . . , n where Z is the constant that depends on β, ζ, K1d, K2d, Q, J , I. From the condition on the number of samples M and N in (25), we obtain that if λ = λj = γχ2(µj π)Z 1 r+1 for some j {1, . . . , n}, 1 min i=1,...,n KL(µi π)Z 1 4nγ r r+1 R min i=1,...,n KL(µi π)Z 1 r+1 1 1 min i=1,...,n KL(µi π)Z 4nγ r r+1 R min i=1,...,n KL(µi π)Z 1 r+1 1 KL(µj π)Z 4nγ r r+1 R (KL(µj π)Z) 1 r+1 (De)-regularized Maximum Mean Discrepancy Gradient Flow 1 χ2(µj π)Z 4nγ r r+1 R (χ2(µj π)Z) 1 r+1 = Aγ r r+1 1 On the other hand, if λ = 1, since γ 1, 1 min i=1,...,n KL(µi π)Z 1 4nγ r r+1 R min i=1,...,n KL(µi π)Z 1 r+1 1 γ exp 4nγ r r+1 R γ r 2r+2 exp (4nγR) = γ r 2r+2 1 Similarly for N, we have N γ r 2r+2 exp 4nγR and N 1 d 4 γ r 2r+2 . Plugging them back to (65), we obtain E [W2 (ˆµn, π)] p KL (µ0 π) + O γ r 2r+2 , which completes the proof. 10.13 Proof of Proposition 10.1 Since the proof below works for any regularization coefficient λ, we use a fixed λ for the majority of the analysis and resort back to adaptive λn at the end of the proof. For empirical distributions ˆµn = 1 N PN i=1 y(i) n and µn = 1 N PN i=1 y(i) n defined in (24) and (64), note that E W 2 2 (ˆµn, µn) 1 i=1 E y(i) n y(i) n 2 := c2 n. i=1 E y(i) n+1 y(i) n+1 2 i=1 E y(i) n y(i) n γ(1 + λ) h ˆµn,ˆπ(y(i) n ) h µn,π(y(i) n ) 2 i=1 E y(i) n y(i) n 2 Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur γ(1 + λ) h ˆµn,ˆπ(y(i) n ) h µn,π(y(i) n ) 2 = cn + γ(1 + λ) i=1 E h ˆµn,ˆπ(y(i) n ) h µn,π(y(i) n ) 2 , where we used Minkowski s inequality, v u u t i=1 ai + bi 2 in the above inequalities. Again, by Minkowski s inequality, we have cn+1 cn + γ(1 + λ) i=1 E h ˆµn,ˆπ(y(i) n ) h ˆµn,ˆπ(y(i) n ) 2 i=1 E h ˆµn,ˆπ(y(i) n ) h µn,π(y(i) n ) 2 | {z } (ii) 10.13.1 Controlling (i): i=1 E h ˆµn,ˆπ(y(i) n ) h ˆµn,ˆπ(y(i) n ) 2 = D jk(y(i) n , ) jk(y(i) n , ), h ˆµn,ˆπ E2 jk(y(i) n , ) jk(y(i) n , ) 2 h ˆµn,ˆπ 2 H 4KK2d i=1 E y(i) n y(i) n 2 , where the second inequality uses Cauchy-Schwarz inequality and the third follows from using Lemma B.3 and Lemma B.4. So we have i=1 E h ˆµn,ˆπ(y(i) n ) h ˆµn,ˆπ(y(i) n ) 2 2 KK2d 10.13.2 Controlling (ii): First, we introduce some auxiliary witness functions, h µn,ˆπ = 2 (Σˆπ + λI) 1 (m µn mˆπ) , h n = 2 (Σπ + λI) 1 (m µn mˆπ) , and for completeness, we recall the witness function we are interested in: h µn,π = 2 (Σπ + λI) 1 (mµn mπ) , h ˆµn,ˆπ = 2 (Σˆπ + λI) 1 (mˆµn mˆπ) . (De)-regularized Maximum Mean Discrepancy Gradient Flow We know that i=1 E h ˆµn,ˆπ(y(i) n ) h µn,π(y(i) n ) 2 i=1 E h ˆµn,ˆπ(y(i) n ) h µn,ˆπ(y(i) n ) 2 i=1 E h µn,ˆπ(y(i) n ) h n(y(i) n ) 2 i=1 E h n(y(i) n ) h µn,π(y(i) n ) 2 i=1 K1d E h ˆµn,ˆπ h µn,ˆπ 2 i=1 K1d E h µn,ˆπ h n 2 i=1 K1d E h n h µn,π 2 E h ˆµn,ˆπ h µn,ˆπ 2 E h µn,ˆπ h n 2 E h n h µn,π 2 where the first inequality follows from Minkowski s inequality, and the second inequality uses the fact that for h0, h1 H, h1 h0 2 Hd 1k(x, ) 2 Hd h1 h0 2 H K1d h1 h0 2 H . Next, we will bound q E h ˆµn,ˆπ h µn,ˆπ 2 H, q E h µn,ˆπ h n 2 H, and q E h n h µn,π 2 H separately. First, by noticing that h ˆµn,ˆπ = 2 (Σˆπ + λI) 1 (mˆµn mˆπ) is the witness function associated with Dr MMD(ˆµn||ˆπ), and h µn,ˆπ = 2 (Σˆπ + λI) 1 (m µn mˆπ) is the witness function associated with Dr MMD( µn||ˆπ), by using Lemma B.4, we have r E h ˆµn,ˆπ h µn,ˆπ 2 λ2 W 2 2 ( µn, ˆµn) 2 K1d E h µn,ˆπ h n 2 E 2 (Σˆπ + λI) 1 (m µn mˆπ) 2 (Σπ + λI) 1 (m µn mˆπ) 2 E (Σˆπ + λI) 1 (Σπ + λI) 1 2 HS m µn mˆπ 2 H E (Σˆπ + λI) 1 (Σπ + λI) 1 2 Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur E (Σˆπ + λI) 1 (Σˆπ + λI) (Σπ + λI) (Σπ + λI) 1 2 E Σˆπ Σπ 2 HS 4 where the last inequality follows from using Lemma B.8 and the fact that k(x, ) k(x, ) HS K. Third, r E h n h µn,π 2 E 2 (Σπ + λI) 1 (m µn mˆπ) 2 (Σπ + λI) 1 (mµn mπ) 2 E (m µn mˆπ) (mµn mπ) 2 H E m µn mµn 2 H + q E mˆπ mπ 2 H where the first inequality follows from Cauchy-Schwartz, and the last inequality from Lemma B.8 since k(x, ) H K. Therefore, combining (66), (67) and (68), we have λ cn + 2K3/2 Combining (i) and (ii), we have 1 + γ(1 + λ)2 KK2d + 2K1d + 2γ(1 + λ) p Denoting A = 2 KK1d KK2d+K1d and R = K1d + KK2d as constants that only depend on the kernel, and using the discrete Gronwall lemma (Lemma 26 from Arbel et al. 2019) along with c0 = 0, we obtain exp γnmax 2(1 + λ)R Since E W2(ˆµn, µn) p E W 2 2 (ˆµn, µn) cn, we reach E W2(ˆµnmax, µnmax) A K exp γnmax 2(1 + λ)R Finally, the proof is completed by noting that the r.h.s. is monotonically decreasing in λ and therefore the r.h.s. can be bounded by replacing λ with λ = min i=1,...,nmax λi. 10.14 Proof of Proposition 6.1 By defining the following operators, Sx : H RM, f 1 h f(x(1)), . . . , f(x(M)) i , (De)-regularized Maximum Mean Discrepancy Gradient Flow S x : RM H, α 1 i=1 αik(x(i), ), Sy : H RN, f 1 h f(y(1)), . . . , f(y(N)) i , S y : RN H, α 1 i=1 αik(y(i), ). Then we have Σˆπ = S x Sx, Kxx = MSx S x, Kxy = Using these, note that h ˆµ,ˆπ = 2 (Σˆπ + λI) 1 (mˆµ mˆπ) i=1 k x(i), k x(i), + λI i=1 k y(i), 1 i=1 k x(i), ! = 2 S x Sx + λI 1 1 From the Woodbury inversion lemma, we have that S x Sx + λI 1 = 1 λS x(Sx S x + λI) 1Sx. Plugging the above into (69), we obtain h ˆµ,ˆπ = 2 1 λS x(Sx S x + λI) 1Sx M Kxx + λI 1 1 N M Kxx + λI 1 1 M = 2 Nλk , y1:N 1N 2 Mλk , x1:M 1M 2 Nλk , x1:M (Kxx + MλI) 1 Kxy1N + 2 Mλk , x1:M (Kxx + MλI) 1 Kxx1M. (70) Obtaining Dr MMD(ˆµ ˆπ) is then easy with h ˆµ,ˆπ shown in (70). Dr MMD(ˆµ ˆπ) = (1 + λ) (Σˆπ + λI) 1 2 (mˆµ mˆπ) 2 = (1 + λ) 1 2h ˆµ,ˆπ, mˆµ mˆπ * 1 2h ˆµ,ˆπ, 1 i=1 k y(i), 1 i=1 k x(i), + Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur 1 N2 1 NKyy1N + 1 M2 1 MKxx1M 2 MN 1 MKxy1N N2 1 NK xy (Kxx + MλI) 1 Kxy1N + 2 NM 1 MKxx (Kxx + MλI) 1 Kxy1N 1 M2 1 MKxx (Kxx + MλI) 1 Kxx1M Acknowledgments We would like to thank Gabriele Steidl and Viktor Stein for fruitful discussions on Dr MMD and related functionals. Zonghao Chen is supported by the Engineering and Physical Sciences Research Council (EPSRC) through grant [EP/S021566/1]. Aratrika Mustafiand Bharath K. Sriperumbudur are partially supported by the National Science Foundation (NSF) grant DMS-2413425 and NSF CAREER award DMS-1945396. Pierre Glaser and Arthur Gretton are supported by the Gatsby Charitable Foundation. Anna Korba thanks Google for their academic support in the form of a gift in support of her academic research. Rohit Agrawal and Thibaut Horel. Optimal bounds between f-divergences and integral probability metrics. Journal of Machine Learning Research, 22(1):5662 5720, 2021. Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Springer Science & Business Media, 2005. Abdul Fatir Ansari, Ming Liang Ang, and Harold Soh. Refining deep generative models via discriminator gradient flow. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Zbc-ue9p_r E. Michael Arbel, Danica J Sutherland, Mikolaj Binkowski, and Arthur Gretton. On gradient regularizers for MMD GANs. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. Michael Arbel, Anna Korba, Adil Salim, and Arthur Gretton. Maximum mean discrepancy gradient flow. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337 404, 1950. Krishna Balasubramanian, Tong Li, and Ming Yuan. On the optimality of kernel-embedding based goodness-of-fit tests. Journal of Machine Learning Research, 22(1):1 45, 2021. (De)-regularized Maximum Mean Discrepancy Gradient Flow Krishna Balasubramanian, Sinho Chewi, Murat A. Erdogdu, Adil Salim, and Shunshi Zhang. Towards a theory of non-log-concave sampling: First-order stationarity guarantees for Langevin Monte Carlo. In Po-Ling Loh and Maxim Raginsky, editors, Conference on Learning Theory, pages 2896 2923. PMLR, 2022. Sayan Banerjee, Krishna Balasubramanian, and Promit Ghosal. Improved finite-particle convergence rates for Stein variational gradient descent. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id= sb G8qh Mjk Z. Frank Bauer, Sergei Pereverzev, and Lorenzo Rosasco. On regularization algorithms in learning theory. Journal of Complexity, 23(1):52 72, 2007. A Belafhal, Z Hricha, L Dalil-Essakali, and T Usman. A note on some integrals involving Hermite polynomials and their applications. Advanced Mathematical Models and Applications, 5(3):313 319, 2020. Jeremiah Birrell, Paul Dupuis, Markos A Katsoulakis, Yannis Pantazis, and Luc Rey-Bellet. (f, Γ)-divergences: Interpolating between f-divergences and integral probability metrics. Journal of Machine Learning Research, 23(39):1 70, 2022. Clément Bonet, Théo Uscidda, Adam David, Pierre-Cyril Aubin-Frankowski, and Anna Korba. Mirror and preconditioned gradient descent in Wasserstein space. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 25311 25374. Curran Associates, Inc., 2024. Siwan Boufadène and François-Xavier Vialard. On the global convergence of Wasserstein gradient flow of the Coulomb discrepancy, 2024. URL https://arxiv.org/abs/2312.00800. Stephen P. Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004. James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax. Andrew Brock, JeffDonahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019. Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7:331 368, 2007. Claudio Carmeli, Ernesto De Vito, Alessandro Toigo, and Veronica Umanitá. Vector-valued reproducing kernel Hilbert spaces and universality. Analysis and Applications, 8(01):19 61, 2010. Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur Niladri Chatterji, Jelena Diakonikolas, Michael I. Jordan, and Peter Bartlett. Langevin Monte Carlo without smoothness. In S. Chiappa and R. Calandra, editors, International Conference on Artificial Intelligence and Statistics, pages 1716 1726. PMLR, 2020. Fan Chen, Yiqing Lin, Zhenjie Ren, and Songbo Wang. Uniform-in-time propagation of chaos for kinetic mean field Langevin dynamics. Electronic Journal of Probability, 29:1 43, 2024. Zonghao Chen, Toni Karvonen, Heishiro Kanagawa, François-Xavier Briol, and Chris Oates. Stationary MMD points for cubature. https://arxiv.org/abs/2505.20754, 2025. Sinho Chewi, Thibaut Le Gouic, Chen Lu, Tyler Maunu, and Philippe Rigollet. SVGD as a kernelized Wasserstein gradient flow of the chi-squared divergence. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 2098 2109. Curran Associates, Inc., 2020. Sinho Chewi, Murat A. Erdogdu, Mufan Li, Ruoqi Shen, and Matthew S. Zhang. Analysis of Langevin Monte Carlo from Poincare to log-Sobolev. Foundations of Computational Mathematics, pages 1 51, 2024. Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for overparameterized models using optimal transport. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. Katy Craig, Karthik Elamvazhuthi, Matt Haberland, and Olga Turanova. A blob method for inhomogeneous diffusion with applications to multi-agent control and sampling. Mathematics of Computation, 2023. Katy Craig, Matt Jacobs, and Olga Turanova. Nonlocal approximation of slow and fast diffusion. Journal of Differential Equations, 426:782 852, 2025. Felipe Cucker and Ding Xuan Zhou. Learning Theory: An Approximation Theory Viewpoint, volume 24. Cambridge University Press, 2007. Arnak Dalalyan. Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent. In S. Kale and O. Shamir, editors, Conference on Learning Theory, pages 678 689. PMLR, 2017. Arnak S Dalalyan and Avetik Karagulyan. User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. Stochastic Processes and their Applications, 129(12): 5278 5311, 2019. Arnak S Dalalyan, Avetik Karagulyan, and Lionel Riou-Durand. Bounding the error of discretized Langevin algorithms for non-strongly log-concave targets. Journal of Machine Learning Research, 23(235):1 38, 2022. Andrew Duncan, Nikolas Nüsken, and Lukasz Szpruch. On the geometry of Stein variational gradient descent. Journal of Machine Learning Research, 24(56):1 39, 2023. (De)-regularized Maximum Mean Discrepancy Gradient Flow Alain Durmus, Szymon Majewski, and Bl azej Miasojedow. Analysis of Langevin Monte Carlo via convex optimization. Journal of Machine Learning Research, 20(1):2666 2711, 2019. Heinz Werner Engl, Martin Hanke, and Andreas Neubauer. Regularization of Inverse Problems, volume 375. Springer Science & Business Media, 1996. Vassiliy A Epanechnikov. Non-parametric estimation of a multivariate probability density. Theory of Probability & Its Applications, 14(1):153 158, 1969. Jean Feydy, Thibault Séjourné, François-Xavier Vialard, Shun-ichi Amari, Alain Trouvé, and Gabriel Peyré. Interpolating between optimal transport and MMD using Sinkhorn divergences. In K. Chaudhuri and M. Sugiyama, editors, International Conference on Artificial Intelligence and Statistics, pages 2681 2690. PMLR, 2019. Simon Fischer and Ingo Steinwart. Sobolev norm learning rates for regularized least-squares algorithms. Journal of Machine Learning Research, 21(205):1 38, 2020. Nicolas Fournier and Arnaud Guillin. On the rate of convergence in Wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(3-4):707 738, 2015. Jean-Yves Franceschi, Mike Gartrell, Ludovic Dos Santos, Thibaut Issenhuth, Emmanuel de Bézenac, Mickaël Chen, and Alain Rakotomamonjy. Unifying GANs and score-based diffusion as generative particle models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 36. Curran Associates, Inc., 2024. Alexandre Galashov, Valentin De Bortoli, and Arthur Gretton. Deep MMD gradient flow without adversarial training. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=Pf85K2wtz8. Yuan Gao, Yuling Jiao, Yang Wang, Yao Wang, Can Yang, and Shunkang Zhang. Deep generative learning via variational gradient flow. In K. Chaudhuri and R. Salakhutdinov, editors, International Conference on Machine Learning, pages 2093 2101. PMLR, 2019. Aude Genevay, Gabriel Peyré, and Marco Cuturi. Learning generative models with Sinkhorn divergences. In A. Storkey and F. Perez-Cruz, editors, International Conference on Artificial Intelligence and Statistics, pages 1608 1617. PMLR, 2018. Pierre Glaser, Michael Arbel, and Arthur Gretton. KALE flow: A relaxed KL gradient flow for probabilities with disjoint support. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 8018 8031. Curran Associates, Inc., 2021. Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(1):723 773, 2012. Thomas Hakon Gronwall. Note on the derivatives with respect to a parameter of the solutions of a system of differential equations. Annals of Mathematics, 20(4):292 296, 1919. Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur Hyemin Gu, Panagiota Birmpa, Yannis Pantazis, Luc Rey-Bellet, and Markos A. Katsoulakis. Lipschitz-regularized gradient flows and generative particle algorithms for high-dimensional scarce data. SIAM Journal on Mathematics of Data Science, 6(4):1205 1235, 2024. Omar Hagrass, Bharath K. Sriperumbudur, and Bing Li. Spectral regularized kernel twosample tests. The Annals of Statistics, 52(3):1076 1101, 2024a. Omar Hagrass, Bharath K. Sriperumbudur, and Bing Li. Spectral regularized kernel goodnessof-fit tests. Journal of Machine Learning Research, 25(309):1 52, 2024b. Zaïd Harchaoui, Francis Bach, and Eric Moulines. Testing for homogeneity with kernel Fisher discriminant analysis. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007. Ye He, Krishna Balasubramanian, Bharath K. Sriperumbudur, and Jianfeng Lu. Regularized Stein variational gradient flow. Foundations of Computational Mathematics, pages 1 59, 2024. Johannes Hertrich, Robert Beinert, Manuel Gräf, and Gabriele Steidl. Wasserstein gradient flows of the discrepancy with distance kernel on the line. In International Conference on Scale Space and Variational Methods in Computer Vision, pages 431 443. Springer, 2023. Johannes Hertrich, Manuel Gräf, Robert Beinert, and Gabriele Steidl. Wasserstein steepest descent flows of discrepancies with Riesz kernels. Journal of Mathematical Analysis and Applications, 531(1):127829, 2024a. Johannes Hertrich, Christian Wald, Fabian Altekrüger, and Paul Hagemann. Generative sliced MMD flows with riesz kernels. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=Vdk GRV1vcf. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6840 6851. Curran Associates, Inc., 2020. Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the Fokker Planck equation. SIAM Journal on Mathematical Analysis, 29(1):1 17, 1998. Mark Kac. Foundations of kinetic theory. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, volume 3, pages 171 197, 1956. Achim Klenke. Probability Theory: A Comprehensive Course. Springer Science & Business Media, 2013. Benoit Kloeckner. Approximation by finitely supported measures. ESAIM: Control, Optimisation and Calculus of Variations, 18(2):343 359, 2012. Anna Korba, Adil Salim, Michael Arbel, Giulia Luise, and Arthur Gretton. A non-asymptotic analysis for Stein variational gradient descent. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 4672 4682. Curran Associates, Inc., 2020. (De)-regularized Maximum Mean Discrepancy Gradient Flow Anna Korba, Pierre-Cyril Aubin-Frankowski, Szymon Majewski, and Pierre Ablin. Kernel Stein discrepancy descent. In M. Meila and T. Zhang, editors, International Conference on Machine Learning, pages 5719 5730. PMLR, 2021. A. Ya. Kruger. On Fréchet subdifferentials. Journal of Mathematical Sciences, 116(3): 3325 3358, 2003. Marc Lambert, Sinho Chewi, Francis Bach, Silvère Bonnabel, and Philippe Rigollet. Variational inference via Wasserstein gradient flows. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 14434 14447. Curran Associates, Inc., 2022. Jing Lei. Convergence and concentration of empirical measures under Wasserstein distance in unbounded functional spaces. Bernoulli, 26(1):767 798, 2020. Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabas Poczos. MMD GAN: Towards deeper understanding of moment matching network. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. Lingxiao Li, Qiang Liu, Anna Korba, Mikhail Yurochkin, and Justin Solomon. Sampling with mollified interaction energy descent. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=z Wy7dq Ocel. Tengyuan Liang and Hai Tran-Bach. Mehler s formula, branching process, and compositional kernels of deep neural networks. Journal of the American Statistical Association, 117(539): 1324 1337, 2022. Qiang Liu. Stein variational gradient descent as gradient flow. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. Song Liu, Jiahao Yu, Jack Simons, Mingxuan Yi, and Mark Beaumont. Minimizing fdivergences by interpolating velocity fields. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, editors, International Conference on Machine Learning, pages 32308 32331. PMLR, 2024a. Tianle Liu, Promit Ghosal, Krishna Balasubramanian, and Natesh Pillai. Towards understanding the dynamics of Gaussian-Stein variational gradient descent. In Advances in Neural Information Processing Systems, volume 36, 2024b. Antoine Liutkus, Umut Simsekli, Szymon Majewski, Alain Durmus, and Fabian-Robert Stöter. Sliced-Wasserstein flows: Nonparametric generative modeling via optimal transport Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur and diffusions. In K. Chaudhuri and R. Salakhutdinov, editors, International Conference on Machine Learning, pages 4104 4113. PMLR, 2019. Sebastian Mika, Gunnar Rätsch, Jason Weston, Bernhard Schölkopf, and Klaus-Robert Müller. Fisher discriminant analysis with kernels. In Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468), pages 41 48, 1999. Xuan Long Nguyen, Martin J. Wainwright, and Michael I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847 5861, 2010. Atsushi Nitanda, Denny Wu, and Taiji Suzuki. Convex analysis of the mean field Langevin dynamics. In G. Camps-Valls, F. J. R. Ruiz, and I. Valera, editors, International Conference on Artificial Intelligence and Statistics, pages 9741 9757. PMLR, 2022. Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. Shin-ichi Ohta and Asuka Takatsu. Displacement convexity of generalized relative entropies. Advances in Mathematics, 228(3):1742 1787, 2011. Victor M Panaretos and Yoav Zemel. An Invitation to Statistics in Wasserstein Space. Springer, 2020. Loucas Pillaud-Vivien, Francis Bach, Tony Lelièvre, Alessandro Rudi, and Gabriel Stoltz. Statistical estimation of the Poincaré constant and application to sampling multimodal distributions. In S. Chiappa and R. Calandra, editors, International Conference on Artificial Intelligence and Statistics, pages 2753 2763. PMLR, 2020. Alfréd Rényi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, volume 1: Contributions to the Theory of Statistics, volume 4, pages 547 562. University of California Press, 1961. Walter Rudin. Principles of Mathematical Analysis, volume 3. Mc Graw-Hill New York, 1976. Filippo Santambrogio. {Euclidean, metric, and Wasserstein} gradient flows: An overview. Bulletin of Mathematical Sciences, 7:87 154, 2017. Bernhard Schölkopf and Alexander J Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press, 2002. Dino Sejdinovic, Bharath Sriperumbudur, Arthur Gretton, and Kenji Fukumizu. Equivalence of distance-based and RKHS-based statistics in hypothesis testing. The Annals of Statistics, 41(5):2263 2291, 2013. (De)-regularized Maximum Mean Discrepancy Gradient Flow Jiaxin Shi and Lester Mackey. A finite-particle convergence rate for Stein variational gradient descent. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 36. Curran Associates, Inc., 2024. Tao Shi, Mikhail Belkin, and Bin Yu. Data spectroscopy: Eigenspaces of convolution operators and clustering. The Annals of Statistics, pages 3960 3984, 2009. Carl-Johann Simon-Gabriel, Alessandro Barp, Bernhard Schölkopf, and Lester Mackey. Metrizing weak convergence with maximum mean discrepancies. Journal of Machine Learning Research, 24(184):1 20, 2023. Jack Simons, Song Liu, and Mark Beaumont. Variational likelihood-free gradient descent. In Fourth Symposium on Advances in Approximate Bayesian Inference, 2022. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. Bharath K. Sriperumbudur. On the optimal estimation of probability measures in weak and strong topologies. Bernoulli, 22(3):1839 1893, 2016. Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, and Gert R. G. Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11:1517 1561, 2010. Bharath K. Sriperumbudur, Kenji Fukumizu, and Gert R. G. Lanckriet. Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research, 12(7):2389 2410, 2011. Viktor Stein, Sebastian Neumayer, Nicolaj Rux, and Gabriele Steidl. Wasserstein gradient flows for Moreau envelopes of f-divergences in reproducing kernel Hilbert spaces. Analysis and Applications, pages 1 45, 2025. Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer Science & Business Media, 2008. Ingo Steinwart and Clint Scovel. Mercer s theorem on general domains: On the interaction between measures, kernels, and RKHSs. Constructive Approximation, 35:363 417, 2012. Taiji Suzuki, Atsushi Nitanda, and Denny Wu. Uniform-in-time propagation of chaos for the mean-field gradient Langevin dynamics. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=_JSc Uk9TBUn. Nicolas Garcia Trillos and Daniel Sanz-Alonso. The Bayesian update: Variational formulations and gradient flows. Bayesian Analysis, 15(1):29 56, 2020. Tim Van Erven and Peter Harremos. Rényi divergence and Kullback-Leibler divergence. IEEE Transactions on Information Theory, 60(7):3797 3820, 2014. Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur Santosh Vempala and Andre Wibisono. Rapid convergence of the unadjusted Langevin algorithm: Isoperimetry suffices. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. Cédric Villani et al. Optimal Transport: Old and New, volume 338. Springer, 2009. Holger Wendland. Scattered Data Approximation, volume 17. Cambridge university press, 2004. Appendix A: Further Background on (P2(Rd), W2) Let µ and π be two probability measures in P2(Rd) and let Π(µ, π) denote the set of all admissible transport plans between µ and π, i.e., Π(µ, π) = {Γ P(Rd Rd); (proj1)# Γ = µ, (proj2)# Γ = π}, where proj1 and proj2 respectively stand for projection maps (x, y) 7 x and (x, y) 7 y, and # is the pushforward operator. The Wasserstein-2 distance between µ and π is then defined as W2(µ, π) = inf Γ Π(µ,π) Z x y 2dΓ(x, y) 1 and (P2(Rd), W2) is a metric space called the Wasserstein space (Panaretos and Zemel, 2020). Brenier s theorem guarantees that if µ is an absolutely continuous measure, then the optimal transport map is unique and is of the form Γ = (I, T)#µ, i.e., T#µ = π (Santambrogio, 2017). T can also be expressed as T(x) = x + φ(x), where φ is known as the Kantorovich potential function and is differentiable µ a.e. For an absolutely continuous µ and the optimal transport plan T such that T#µ = π, the shortest path (ρt)0 t 1 from µ to π is called the (Wasserstein) geodesic given by the following form: ρt = ((1 t)I + t T)# µ = (I + t φ)# µ. Therefore, in this paper, we always use (I + t φ)#µ with φ C c (Rd)8 to define a geodesic curve that starts at µ. Define ϕt : Rd Rd, x 7 x + t φ(x), then ωt : Rd Rd, x 7 φ ϕ 1 t (x) becomes the optimal transport map from ρt to π. Notice that ωt L2(ρt) = φ L2(µ) for all t [0, 1], so (ρt)0 t 1 is also a constant-speed geodesic. The notion of a constant-speed geodesic is crucial in the introduction of geodesic convexity below. A functional F : P2(Rd) R is geodesically convex if for any µ and π, the following inequality holds: F (ρt) (1 t)F (µ) + t F (π) , t [0, 1], (A.1) where (ρt)t [0,1] is the constant-speed geodesic between µ and π. The geodesic convexity of F can be equivalently characterized through the Wasserstein Hessian (Villani et al., 2009). The geodesic convexity ought not to be confused with mixture convexity, which replaces displacement geodesic ρt in (A.1) with the mixture geodesic νt = (1 t)µ + tπ. 8. φ is compactly supported because the tangent space of µ P2(Rd) is { ψ, ψ C c (Rd)} L2(µ) (Ambrosio et al., 2005, Definition 8.4.1). (De)-regularized Maximum Mean Discrepancy Gradient Flow Appendix B: Auxiliary Results In this appendix, we collect all technical results required to prove the main results of the paper. Lemma B.1 For µ π, χ2-divergence admits the following variational form: χ2(µ π) = sup h L2(π) Z hdµ Z h + 1 where it is sufficient to restrict the variational set to L2(π) in contrast to the set of all measurable functions for general f-divergences (Nowozin et al., 2016; Nguyen et al., 2010). Proof For µ π, we have: χ2(µ π) = sup h 4 + h dπ = sup h Clearly, the above equation is minimized at h = 2( dµ dπ 1) and χ2(µ π) = R ( dµ dπ 1)2dπ which is finite if and only if dµ dπ 1 L2(π). Therefore, it is sufficient to consider the above maximization over L2(π). Lemma B.2 Under Assumption 1 and 2, the mappings x 7 k(x, ) and x 7 1k(x, ) are differentiable and Lipschitz: k(x, ) k(y, ) H p 1k(x, ) 1k(y, ) Hd p Proof This is Lemma 7 from Glaser et al. (2021). Lemma B.3 Under Assumption 1 and 2, the regularized kernel k defined in (6) satisfies the following properties: 2. i k(x, y) = (Σπ + λI) 1 ik(x, ), k(y, ) 1 k(x, y) 2 = Pd i=1 i k(x, y)2 KK1d Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur 4. i i+d k(x, y) = (Σπ + λI) 1 ik(x, ), ik(y, ) 1 2 k(x, y) 2 F = Pd i=1 i i+d k(x, y)2 K1d 6. i j k(x, y) = (Σπ + λI) 1 i jk(x, ), k(y, ) H1 k(x, y) 2 F = Pd i,j=1 i j k(x, y)2 KK2d 1 k (x, x ) 1 k (y, y ) KK2d λ ( x y + x y ). Proof Notice that k(x, y) = D (Σπ + λI) 1 k(x, ), k(y, ) E λ k(x, ) H k(y, ) H K so the first bullet point is proved. Before we prove the second bullet point, we first prove the differentiability of x 7 k(x, y). For i {1, , d}, consider h R and denote i Rd as a vector of all 0 except the value at i being equal to h. Then, for any y Rd, k(x + i, y) k(x, y) h = lim h 0 (Σπ + λI) 1 (k(x + i, ) k(x, )) , k(y, ) K λ k(x, ) k(x + i, ) H So x 7 k(x, y) is differentiable for any y Rd. Since the kernel k is differentiable per Assumption 2, for any f H, xif(x) = ik(x, ), f i k(x, y) = xi D k(x, ), (Σπ + λI) 1 k(y, ) E H = D ik(x, ), (Σπ + λI) 1 k(y, ) E So the second bullet point is proved. Next, notice that 1 k(x, y) 2 = D ik(x, ), (Σπ + λI) 1 k(y, ) E2 i=1 ik(x, ) 2 H k(y, ) 2 H λ2 1k(x, ) 2 Hd k(y, ) 2 H KK1d So the third bullet point is proved. Similar arguments above lead to bullet points 4 to 7. Finally, to prove bullet point 8, notice that 1 k(x, x ) 1 k(y, x ) 2 = D ( ik(x, ) ik(y, )) , (Σπ + λI) 1 k(x , ) E2 j=1 ik(x, ) ik(y, ) 2 H k(x , ) 2 H 1 λ2 1k(x, ) 1k(y, ) 2 Hd k(x , ) 2 H (De)-regularized Maximum Mean Discrepancy Gradient Flow where the last inequality uses Lemma B.2. Therefore, 1 k(x, x ) 1 k(y, y ) 1 k(x, x ) 1 k(y, x ) + 1 k(y, x ) 1 k(y, y ) λ x y + x y . So the bullet point 8 is proved. Lemma B.4 For any two distributions µ0, µ1 P2(Rd), with associated Dr MMD witness functions h µ0,π, h µ1,π defined in Proposition 3.2, we have h µ1,π h µ0,π H 2 K1d λ W2 (µ0, µ1) , and h µ0,π H 2 Both the witness function h µ,π and its gradient h µ,π are Lipschitz continuous, i.e., h µ1,π(x) h µ0,π(y) L W2(µ0, µ1) + x y ; h µ1,π(x) h µ0,π(y) L W2(µ0, µ1) + x y , where the constant L = 1 λ max 2 KK1d, 2 KK2d, 2K1d . Proof Let γ Γ(µ0, µ1) be the optimal coupling between µ0 and µ1. Then h µ1,π h µ0,π 2 H = 4 Σπ + λI 1 Z k(x, )d(µ1 µ0) Z k(x, ) k(y, )dγ(x, y) Z x y 2 dγ(x, y) = 4K1d λ2 W 2 2 (µ0, µ1), where the first inequality holds because Σπ is a positive and self-adjoint operator, and the second inequality uses Lemma B.2. Also note that h µ0,π H = 2 Σπ + λI 1 Z k(x, )dµ0 Z k(x, )dπ H 2 So the first part has been proved. Furthermore, note that h µ1,π(x) h µ0,π(y) h µ1,π(x) h µ0,π(x) + h µ0,π(x) h µ0,π(y) k(x, ) H h µ1,π h µ0,π H + k(x, ) k(y, ) H h µ0,π H λ W2(µ0, µ1) + 2 KK1d W2(µ0, µ1) + x y L W2(µ0, µ1) + x y and h µ1,π(x) h µ0,π(y) h µ1,π(x) h µ0,π(x) + h µ0,π(x) h µ0,π(y) Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur 1k(x, ) Hd h µ1,π h µ0,π H + 1k(x, ) 1k(y, ) Hd h µ0,π H λ W2(µ0, µ1) + 2 KK2d λ x y L W2(µ0, µ1) + x y and the result follows. Lemma B.5 Given two probability measures µ π that are both absolutely continuous with respect to Lebesgue measure. If dµ dπ 1 Ran(T r π ) with r > 0, i.e., there exists q L2(π) such that dµ dπ 1 = T r π q, then dπ 1 L2(π) 2λr q L2(π) , where h = 2(Σπ + λI) 1(mπ mµ) is defined in Proposition 3.2. Proof Given the assumption that dµ dπ 1 Ran (T r π ) with r > 0, there exists q L2(π) such that dµ dπ 1 = T r π q and dµ dπ 1, ei L2(π) = ϱr i q, ei L2(π). Since µ π, the Mercer decomposition of k in (1) holds for any x in the support of µ and in the support of π, mµ mπ = Z k(x, )d(µ π)(x) = X Z ei(x)d(µ π)(x) ei = ϱi So we have, h 2 dµ dπ 1 L2(π) = 2 (Σπ + λI) 1 (mµ mπ) dµ λϱr i ϱi + λ q, ei L2(π)ei L2(π) 2λr q L2(π) , where the last inequality is obtained by using λϱr i ϱi + λ = ϱi ϱi + λ Hence the proof. (De)-regularized Maximum Mean Discrepancy Gradient Flow Lemma B.6 Let ρ P2 Rd and φ C c Rd . Consider the path (ρs)0 s 1 from ρ to (I + φ)#ρ given by ρs = (I + s φ)#ρ. Define ϕs : Rd Rd, x 7 x + s φ(x). Let k be the regularized kernel defined in (6) along with its associated RKHS H. The mapping s 7 Dr MMD(ρs π) is continuous and differentiable, and its first-order time derivative is given by d ds Dr MMD(ρs||π) = 2(1 + λ) Z φ(x) Z 1 k(ϕs(x), ϕs(z))dρ(z) Z 1 k(ϕs(x), z)dπ(z) dρ(x). (B.2) s=0 Dr MMD(ρs||π) = 2(1 + λ) Z φ(x) Z 1 k(x, z)dρ(z) Z 1 k(x, z)dπ(z) dρ(x). (B.3) Additionally, the mapping s d ds Dr MMD(ρs π) is continuous and differentiable, and the second-order time derivative of Dr MMD(ρs π) is given by ds2 Dr MMD(ρs||π) = 2(1 + λ) ZZ φ(x) 1 2 k(ϕs(x), ϕs(z)) φ(z)dρ(x)dρ(z) (B.4) + 2(1 + λ) Z φ(x) Z H1 k (ϕs(x), ϕs(z)) dρ(z) Z H1 k (ϕs(x), z) dπ(z) φ(x)dρ(x), s=0 Dr MMD(ρs||π) = 2(1 + λ) ZZ φ(x) 1 2 k(x, z) φ(z)dρ(x)dρ(z) + 2(1 + λ) Z φ(x) Z H1 k (x, z) dρ(z) Z H1 k (x, z) dπ(z) φ(x)dρ(x). (B.5) Proof Recall that Dr MMD is, up to a multiplicative factor of (1 + λ), MMD2 of the regularized kernel k defined in (6) and the associated RKHS H. From Lemma B.3, we know that assumptions (A) and (B) of Arbel et al. (2019) are satisfied for the regularized kernel k, so using Lemma 22 and Lemma 23 from Arbel et al. (2019), (B.2) and (B.4) are proved. Then (B.3) and (B.5) are subsequently proved by taking s = 0. Lemma B.7 For µ0 π, define h = 2(Σπ + λI) 1(mπ mµ0) and ϕt : Rd Rd, x 7 x t(1 + λ) h(x). Suppose π exp( V ), HV βI, and the step size γ satisfies χ2(µ0 π)K2d for some constant 1 < ζ < 2. Then, for t [0, γ], the following inequalities hold, Z h(x) HV (ϕt(x)) h(x)dµ0(x) 4βχ2(µ0 π)K1d λ ; (B.6) Z Hh(x) (I t(1 + λ)Hh(x)) 1 2 F dµ0(x) 4ζ2χ2(µ0 π)K2d Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur Proof First, recall (B.1), h(x) 2 = 4 (Σπ + λI) 1(mπ mµ0)(x) 2 L2(π) ei(x) For any j {1, , d}, consider g M0(x) := X L2(π) jei(x) i M0 ϱi jei(x)2 i M0 ϱi jei(x)2 dπ 1 L2(π) . j=1 ϱi ( jei(x))2 = X j=1 ϱi jk(x, ), ei 2 H = X j=1 jk(x, ), ϱiei 2 H j=1 jk(x, ) 2 H = 1k(x, ) 2 H K1d, (B.9) so P i M0 ϱi( jei(x))2 converges uniformly to 0, and hence g M0(x) also converges uniformly to 0. Therefore, we are allowed to interchange the derivative and the infinite sum (Rudin, 1976) in (B.8) to achieve, L2(π) ei(x) ϱi (ϱi + λ)2 i 1 ϱi ei(x) 2 i 1 ϱi ei(x) 2 4χ2(µ0 π)K1d The first inequality follows from Cauchy Schwartz, the penultimate inequality follows by noting that ϱi (ϱi+λ)2 1 λ, and the last inequality follows from (B.9). Given HV βI, (B.6) is proved by the following, Z h(x) HV (ϕt(x)) h(x)dµ0(x) β Z h(x) 2 dµ0(x) 4βχ2(µ0 π)K1d (De)-regularized Maximum Mean Discrepancy Gradient Flow We now turn to proving the second statement. Similarly to (B.10), we have Hh(x) 2 F 4χ2(µ0 π)K2d Using 2(1 + λ)γ q χ2(µ0 π)K2d ζ for some constant 1 < ζ < 2, the inverse of I t(1 + λ)Hh(x) can be represented by the Neumann series, and hence (I t(1 + λ)Hh(x)) 1 F X m 0 t(1 + λ)Hh(x) m F X χ2(µ0 π)K2d m = ζ. (B.12) Therefore, (B.7) is proved by combining (B.11) and (B.12), Z Hh(x) (I t(1 + λ)Hh(x)) 1 2 Z Hh(x) 2 F (I t(1 + λ)Hh(x)) 1 2 F dµ0(x) 4ζ2χ2(µ0 π)K2d and the result follows. Lemma B.8 Let H be a separable Hilbert space, ξ1, . . . , ξn : Ω H are n identical independent H-valued random variables satisfying ξi H B. Then i=1 ξi E[ξ1] 2πB 2 n , and E i=1 ξi E[ξ1] Proof We know from Corollary 6.15 of Steinwart and Christmann (2008) that i=1 ξi E [ξ1] 2 exp 2nt2/B2 . Then, denote R := 1 n Pn i=1 ξi E[ξ1] H, we know that 0 P(R t)dt 2 Z 0 exp( 2nt2/B2)dt = The other part is proved similarly. Lemma B.9 (Wasserstein Hessian of Fχ2) Let ρ P2(Rd) and φ C c (Rd). Consider the path (ρs)0 s 1 from ρ to (I + φ)#ρ given by ρs = (I + s φ)#ρ. Define ϕs : Rd Rd, x 7 x + s φ(x) and ωs : Rd Rd, x 7 φ ϕ 1 s (x). For π(x) exp( V (x)), the second order derivative of s χ2(ρs π) is given by ds2 χ2(ρs π) = Z ρs(x) V (x) ωs(x) ωs(x) 2 dρs(x) Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur π(x) ωs(x) HV (x)ωs(x)dρs(x) + Z ρs(x) π(x) ωs(x) 2 F dρs(x). (B.13) Equivalently, the second order derivative of s χ2(ρs π) can also be written as ds2 χ2(ρs π) = 2 Z (ωs(x)ρs(x)) 1 π(x) + 2 Z ωs(x) H ρs(x) ωs(x)ρs(x)dx. (B.14) Proof (B.13) is provided in the Example 15.9 of Villani et al. (2009) by taking m = 2. Now, we are going to prove (B.14). For ease of notation in the following derivations, we are going to drop the function input x in ρs, π, and ωs. We introduce colors to picture grouping of terms that will carry over during chains of calculation. In order to prove (B.14), we need to expand the terms in (B.13) accordingly. We denote the three terms in the RHS of (B.13) as (A), (B) and (C). Consider (A) = Z ρ2 s π V ωs ωs 2 dx Z ρ2 s π3 ( π ωs)2dx | {z } (A1) + 2 Z ρ2 s π2 ( π ωs) ωsdx | {z } (A2) Z ρ2 s π ( ωs)2dx | {z } (A3) Then we are going to use integration by parts for (A2) and (A3). (A2) = 2 Z ρ2 s π2 ( π ωs) ωsdx = 2 Z ω s ρ2 s π2 ( π ωs) dx = 4 Z (ω s ρs)( π ωs) ρs π2 dx + 4 Z (ω s π)2 ρ2 s π3 dx 2 Z (ω s Hπωs) ρ2 s π2 dx 2 Z (ω s ωs π) ρ2 s π2 dx, (A3) = Z ρ2 s π ( ωs)2dx = Z (ωs) ωs ρ2 s π dx = Z ω s ωs ρ2 s π = Z ω s ( ωs)ρ2 s π dx | {z } (A31) Z ωs(ω s ρs)2ρs π dx | {z } (A32) + Z ωs(ω s π) ρ2 s π2 dx | {z } (A33) Furthermore, we use integration by parts for (A32) and (A33), we have (A32) = 2 Z ωs(ω s ρs)ρs π dx = 2 Z (ω s (ω s ρs)ρs = 2 Z (ω s ωs ρs)ρs π dx + 2 Z (ω s Hρsωs)ρs π dx + 2 Z (ω s ρs)2 1 (De)-regularized Maximum Mean Discrepancy Gradient Flow 2 Z (ω s ρs)(ω s π) ρs (A33) = Z ωs(ω s π) ρ2 s π2 dx = Z (ω s (ω s π) ρ2 s π2 = Z (ω s ωs π) ρ2 s π2 dx Z (ω s Hπωs) ρ2 s π2 dx 2 Z (ω s ρs)(ω s π) ρs + 2 Z (ω s π)2 ρ2 s π3 dx. Having completed (A), now we turn to (B). (B) = Z ρ2 s π ω s HV ωsdx = Z ω s Hπωs ρ2 s π2 dx + Z ( π ωs)2 ρ2 s π3 dx. So, combining (A), (B) and (C), we have ds2 χ2(ρs π) = (A) + (B) + (C) = (A1) + (A2) + (A31) + (A32) + (A33) + (B) + (C) Z ρ2 s π3 ( π ωs)2dx 4 Z (ω s ρs)( π ωs) ρs π2 dx + 4 Z (ω s π)2 ρ2 s π3 dx 2 Z (ω s Hπωs) ρ2 s π2 dx 2 Z (ω s ωs π) ρ2 s π2 dx + (A31) + (A32) + (A33) Z (ω s Hπωs) ρ2 s π2 dx + Z ( π ωs)2 ρ2 s π3 dx + Z ρ2 s π ωs 2 F dx. Since ωs = φ ϕ 1 s is a mapping from Rd to Rd, denote ωs,i := φ ϕ 1 s i which is a mapping from Rd to R. Notice that ωs,i jωs,i vanishes at boundary because φ C c . Hence, ωs,i jωs,i ρ2 s π Z jωs,i jωs,i ρ2 s π dx + X Z ωs,i jjωs,i ρ2 s π dx Z ωs,i jωs,i Z ρ2 s π ωs 2 F dx + Z ρ2 s π (ω s ( ωs))dx + Z (ω s ωs ρs)2ρs Z (ω s ωs π) ρ2 s π2 dx. Therefore, by replacing R ρ2 s π ωs 2 F dx, and noticing that R ρ2 s π (ω s ( ωs))dx is exactly (A31), we have the following: ds2 χ2(ρs π) = Z ρ2 s π3 ( π ωs)2dx 4 Z (ω s ρs)( π ωs) ρs π2 dx + 4 Z (ω s π)2 ρ2 s π3 dx Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur 2 Z (ω s Hπωs) ρ2 s π2 dx 2 Z (ω s ωs π) ρ2 s π2 dx + (A31) + (A32) + (A33) Z (ω s Hπωs) ρ2 s π2 dx + Z ( π ωs)2 ρ2 s π3 dx Z ρ2 s π (ω s ( ωs))dx Z (ω s ωs ρs)2ρs π dx + Z (ω s ωs π) ρ2 s π2 dx. Next, we combine the terms and obtain = 2((A31) + (A32) + (A33)) + 6 Z ρ2 s π3 ( π ωs)2dx 4 Z (ω s ρs)( π ωs) ρs 3 Z (ω s Hπωs) ρ2 s π2 dx Z (ω s ωs π) ρ2 s π2 dx 2 Z (ω s ωs ρs)ρs π dx (A32) (A33). Recall that (A31)+(A32)+(A33) = (A3), and replacing (A32) and (A33) with their expressions, we have Z ρ2 s π ( ωs)2dx + 6 Z ρ2 s π3 ( π ωs)2dx 4 Z (ω s ρs)( π ωs) ρs 3 Z (ω s Hπωs) ρ2 s π2 dx Z (ω s ωs π) ρ2 s π2 dx 2 Z (ω s ωs ρs)ρs 2 Z (ω s ωs ρs)ρs π dx + 2 Z (ω s Hρsωs)ρs π dx + 2 Z (ω s ρs)2 1 π dx 2 Z (ω s ρs)(ω s π) ρs Z (ω s ωs π) ρ2 s π2 dx Z (ω s Hπωs) ρ2 s π2 dx 2 Z (ω s ρs)(ω s π) ρs π2 dx + 2 Z (ω s π)2 ρ2 s π3 dx = 2 Z (ω s Hρsωs)ρs π dx 2 Z (ω s Hπωs) ρ2 s π2 dx 4 Z (ω s ρs)(ω s π) ρs π2 dx + 4 Z ρ2 s π3 ( π ωs)2dx | {z } (M) + 2 Z ρ2 s π ( ωs)2dx 2 Z (ω s ρs)2 1 π dx 4 Z (ω s ωs ρs)ρs π dx | {z } (N1) 4 Z (ω s Hρsωs)ρs π dx + 4 Z (ω s ρs)(ω s π) ρs π2 dx | {z } (N2) Now we analyze (M) and (N1) + (N2) separately. Notice that (M) = 2 Z ω s H ρs (N1) + (N2) = 2 Z ρ2 s π ( ωs)2dx + 2 Z (ω s ρs)2 1 πdx 4 Z (ω s ωs ρs)ρs 4 Z (ω s Hρsωs)ρs π dx 4 Z (ω s ρs)2 1 πdx + 4 Z (ω s ρs)(ω s π) ρs = 2 Z ρ2 s π ( ωs)2dx + 2 Z (ω s ρs)2 1 πdx 4 Z (ω s (ω s ρs))ρs 4 Z (ω s ρs)(ω s ρs (De)-regularized Maximum Mean Discrepancy Gradient Flow = 2 Z ρ2 s π ( ωs)2dx + 2 Z (ω s ρs)2 1 πdx + 4 Z ωs(ω s ρs)ρs = 2 Z (ωsρs) 1 Since d2 ds2 χ2(ρs π) = (M) + (N1) + (N2), Lemma B.9 is proved. Appendix C: An Illustrative Example for Explicit Forms of It, Jt, qt L2(π) Consider an illustrative example where we simulate the Dr MMD gradient flow when the target is a one-dimensional Gaussian target distribution π = N(0, σ2) and the initialization is also a one-dimensional Gaussian µ0 = N(0, 1 2 σ2). We take a Gaussian kernel k(x, y) = exp( 1 2(x y)2) whose eigenvalues and eigenfunctions in its Mercer decomposition have the following closed form expressions (Shi et al., 2009, Proposition 1), 1 2 σ2 1 2 σ2 + c + 0.5 i , ei(x) = β i!2i exp( cx2)Hi where β = (1 + 4 σ2)1/4, c = β2 1 4 σ2 and Hi is the i-th Hermite polynomial function. We pick σ2 > 2 so β2 > 3. The Gaussian kernel is continuous, bounded and c0-universal as required in Assumption 1. It also possesses bounded firstand second-order derivatives, thereby satisfying Assumption 2. Consider the Dr MMD gradient flow (µt)t 0 defined in (7) along with its particle update scheme dxt = (1 + λ) hµt,π(xt)dt, where hµt,π is the witness function defined in (32) and λ is a positive regularization parameter. Denote mt, σ2 t as the mean and covariance of µt, respectively, then we have the following update scheme for mt, σ2 t proved in Lemma C.1: dmt = (1 + λ) Ext µt[ hµt,π(xt)] dt, d(σ2 t ) = 2(1 + λ) Ext µt[ hµt,π(xt) xt] dt + 2(1 + λ) Ext µt[ hµt,π(xt)] mt dt, where the expectations are taken over xt µt. While the resulting distribution is not necessarily Gaussian, we may follow the existing analysis of Stein variational gradient descent (Liu et al., 2024b) and Langevin Monte Carlo dynamics (Lambert et al., 2022) and approximate xt µt with a Gaussian random variable yt νt = N(mt, σ2 t ); this yields the update scheme: dmt = (1 + λ) Eyt νt[ hνt,π(yt)] dt, d(σ2 t ) = 2(1 + λ) Eyt νt[ hνt,π(yt) yt] dt + 2(1 + λ) Eyt νt[ hνt,π(yt)] mt dt, (C.2) which gives an evolution of Gaussian distributions νt. From (32), the witness function hνt,π admits a decomposition with eigenvalues ϱi and eigenfunctions ei: hνt,π(y) = P i 1 ϱi ϱi+λ dνt dπ 1, ei L2(π)ei(y). Therefore, the velocity field d dyhνt,π(y) can be written as, d dyhνt,π(y) = X Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur Notice that if mt = 0, then for odd i, dνt dπ 1, ei L2(π) = R eidνt R eidπ = 0 because ei is an odd function. When i is even, Eνt[ d dyei(y)] = 0 because y 7 d dyei(y) is an odd function. As a result, if mt = 0, then E[ hνt,π(yt)] = 0 and hence dmt dt = 0 from the update scheme in (C.2). Therefore, as long as we initialize the Dr MMD gradient flow with ν0 = N(0, 1 2 σ2), a zero mean Gaussian distribution, the entire trajectory will remain a zero mean Gaussian distribution N(0, σ2 t ). Next, observe that for a zero-mean Gaussian trajectory, if the initial variance satisfies σ2 0 < σ2, it is natural to expect that the variance increases monotonically toward the target variance σ2 as the Dr MMD flow evolves, i.e. σ2 0 < σ2 t σ2 for all t. In Lemma C.2, we provide a rigorous proof of this claim in the cases λ = 0 and λ = , by showing that (1+λ) Eyt νt[ hνt,π(yt) yt] < 0, which implies that the variance update in (C.2) is monotone increasing. The cases λ = 0 and λ = correspond respectively to the χ2 flow and the MMD flow regimes. For general λ > 0, however, we are unable to establish a rigorous proof. Our argument in Lemma C.2 relies heavily on Mehler s formula (Liang and Tran-Bach, 2022, Proposition 2.2), which requires exponential decay of the spectrum (ϱi)i 1. This condition is not satisfied for Dr MMD, whose spectrum is modified to ( ϱi ϱi+λ)i 1. Nevertheless, we conjecture that the monotonicity property continues to hold for all λ > 0. Checking assumptions in Theorem 4.1 and Theorem 5.1: Now we check the assumptions from Theorem 4.1 and Theorem 5.1. The target π is a Gaussian distribution which automatically satisfies a Poincaré inequality, and its potential V is a quadratic function, hence satisfies HV βI. νt, π are Gaussians, hence absolutely continuous with respect to Lebesgue on R. And most importantly, we have dνt dπ 1 Ran(T 0.25 π ), i.e., there exists qt L2(π) such that dνt dπ 1 = T 0.25 π qt. To see why, we first need to upper bound dνt dπ 1, ei 2 L2(π). From Belafhal et al. (2020, Corollary 2), we have the following closed-form expressions for dνt dπ 1, ei L2(π): 1 1 + 2σ2 t c c + 1 2σ2 t 1 1 + 2 σ2c when i is even and 0 otherwise. Therefore, we have L2(π) 1 1 + σ2c β i!2i |Hi(0)|2 2 c + 1 2σ2 t 1 c + 1 2 σ2 1 By the monotonicity of the variance, σ2 0 = 1 2 σ2 σ2 t < σ2 and β2 > 3, we have c + 1 2σ2 t 1 < c + 1 2 σ2 1 = 2β2 So we have the following upper bound L2(π) 4 1 + σ2c β i!2i |Hi(0)|2 2β2 β2 + 1 1 i . (De)-regularized Maximum Mean Discrepancy Gradient Flow Now we are ready to study the L2(π)-norm of qt. Recall the formulas of ϱi in (C.1), we have dπ 1, ei 2 L2(π) ϱ0.5 i 4β 1 + σ2c 1 i!2i |Hi(0)|2 β2 1 = 4β 1 + σ2c 1 i!2i |Hi(0)|2 β2 1 0.5!i = 4β 1 + σ2c The last equality holds by Mehler s formula (Liang and Tran-Bach, 2022, Proposition 2.2). A quick sanity check for the above derivations is to take r = 0.5 and see that P i=1 ϱ 1 i dνt dπ 1, ei 2 L2(π) is divergent. This indicates that dνt dπ 1 / H, which verifies the fact that the Gaussian RKHS does not contain constant functions (Steinwart and Christmann, 2008, Corollary 4.44). Finally, It and Jt admit the following explicit formulas as well as uniform upper bounds Jt = (log π) dνt 2π σ3σ2 t Γ(7/2) 1 σ2 t 1 2 σ2 7/2 4Γ(7/2) Γ(1/2) 1 σ2 t 1 2 σ2 1/2 + 2Γ(3/2) 1 σ2 t 1 1 σ2 t 1 2 σ2 3/2 + Γ(5/2) 1 σ2 t 1 1 σ2 t 1 2 σ2 5/2 2 5/2 + Γ(3/2) 2 3/2 + Γ(5/2) Empirical verification: While the above derivations demonstrate that a (Gaussian projected) Dr MMD gradient flow satisfies all the required assumptions, it is instructive to demonstrate that a particle implementation in the discrete time setting, without explicit Gaussian projection, shows the behavior consistent with the theory. We therefore simulate our finite particle Dr MMD gradient descent ˆµn with a step size γ = 0.01 and particle number N = 10, 000 and empirically inspect its convergence properties. We take σ2 = 6 so β2 = 5. We estimate the density of the Dr MMD descent µn from the particles with a kernel density estimator using a Gaussian kernel with lengthscale 0.1 (Epanechnikov, 1969). Based on the estimated densities µn, we compute the following two quantities: In = V (dµn dπ ) L2(π) and Jn = ( dµn dπ ) L2(π). Their evolution along the Dr MMD gradient descent is shown in Figure 3. We observe that both quantities decrease over time as desired, which is a consequence of increasing smoothness of the density ratio dµn dπ as µn converges to π. We also report the evolution of the KL divergence c KL(ˆµn ˆπ) along the flow, estimated from particles. Lemma C.1 Given the Dr MMD gradient flow update scheme dxt = (1 + λ) hµt,π(xt)dt, its mean mt = E[xt] and variance σ2 t = E[x2 t ] E[xt]2 update scheme can be expressed as dmt = (1 + λ) Ext µt[ hµt,π(xt)] dt, d(σ2 t ) = 2(1 + λ) Ext µt[ hµt,π(xt) xt] dt + 2(1 + λ) Ext µt[ hµt,π(xt)] mt dt, Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur 0 20 40 60 80 100 Iteration c KL(ˆµn ˆπ) b Jn b In Figure 3: Evolution of c KL(ˆµn ˆπ), b In and b Jn along Dr MMD particle descent, where all three terms are estimated with samples. Proof For the mean update, d dtmt = E[ d dtxt] = (1+λ) Ext µt[ hµt,π(xt)]. For the variance update, d dtσ2 t = d dt E[x2 t ] E[xt]2 = 2 E[xt d dtxt] 2 E[xt] d = 2(1 + λ) Ext µt[ hµt,π(xt) xt] + 2(1 + λ) Ext µt[ hµt,π(xt)] mt. Hence the result. Lemma C.2 Let νt be a zero mean normal distribution N(0, σ2 t ) and 0 < σ2 t < σ2. For the eigenvalues (ϱi)i 1 and eigenfunctions (ei)i 1 defined in (C.1), we have L2(π) Ex νt dxei(x)x < 0, L2(π) Ex νt dxei(x)x < 0. which correspond to the variance update in (C.2) when λ = 0 and λ = , respectively. Proof For the inner product term, we know from Corollary 2 of Belafhal et al. (2020) that dνt 1 1 + 2σ2 t c c + 1 2σ2 t 1 1 + 2 σ2c For the expectation term, notice that (De)-regularized Maximum Mean Discrepancy Gradient Flow 2σ2 t x2 1 x2 2σ2 t x2 exp( cx2)Hi 2σ2 t x2 exp( cx2)Hi =: Ei,1 + Ei,2. Here in ( ), we use integration by parts in which the boundary term vanishes, because limx exp( x2)Hi(x) = 0. Next, notice that the second term Ei,2 equals precisely the derivative of Ei,1 with respect to c, rescaled by 1 σ2 t . To distinguish it from the other c that will show up later in dνt dπ 1, ei L2(π), we denote it as c. The original P i 1 dνt dπ 1, ei L2(π) Ex νt[ d dxei(x)x] can be written as the sum of two components F1 + F2: 1 1 + 2σ2 t c c + 1 2σ2 t 1 1 + 2 σ2c 1 1 + 2σ2 t c c + 1 2σ2 t 1 1 + 2 σ2c From Corollary 2 of Belafhal et al. (2020), we know 1 1 + 2σ2 t c c + 1 2σ2 t and Ei,2 = 1 σ2 t d dc Ei,1. The relation between Ei,1 and Ei,2 would carry over to F1 and F2 as well, i.e., F2 = 1 σ2 t d dc F1. Next, we compute the first half of F1. 1 1 + 2σ2 t c c + 1 2σ2 t 1 1 + 2σ2 t c 1 1 + 2σ2 t c c + 1 2σ2 t c + 1 2σ2 t 1 1 + 2σ2 t c 1 1 + 2σ2 t c c + 1 2σ2 t c + 1 2σ2 t = β (1 + 2σ2 t c)(1 + 2σ2 t c) (1 + 2σ2 t c σ2 t σ2 β2)(1 + 2σ2 t c σ2 t σ2 β2) 1 σ2 t σ2 (2 + 2σ2 t c + 2σ2 t c) σ2 t σ2 β 2! 1 Chen, Mustafi, Glaser, Korba, Gretton and Sriperumbudur The second last equality holds by the Mehler s formula (Liang and Tran-Bach, 2022, Proposition 2.2). Similarly, we can also compute the second half of F1. 1 1 + 2 σ2c = β (1 + 2 σ2c)(1 + 2σ2 t c) (1 + 2 σ2c β2)(1 + 2σ2 t c σ2 t σ2 β2) 1 = 1 + 2σ2 t c + (1 + 2 σ2c)σ2 t σ2 σ2 t σ2 β2 1 Combining the above two equations, we obtain the following formula of F1. σ2 t σ2 (2 + 2σ2 t c + 2σ2 t c) σ2 t σ2 β 2! 1 2 + 1 + 2σ2 t c + (1 + 2 σ2c)σ2 t σ2 σ2 t σ2 β2 1 Since F2 = 1 σ2 t d dc F1, we have σ2 t σ2 (2 + 2σ2 t c + 2σ2 t c) σ2 t σ2 β 2! 3 2 σ2 t σ2 1 + 2σ2 t c + (1 + 2 σ2c)σ2 t σ2 σ2 t σ2 β2 3 Recall that the original P i 1 dνt dπ 1, ei L2(π) Ex νt[ d dxei(x)x] can be written as the sum of two components F1 + F2, and recall that c = c and β2 = 1 + 4 σ2c by definition. We obtain L2(π) Ex νt dxei(x)x = F1 + F2 = σ2 t σ2 1 + 2 σ2 t σ2 which is negative when 0 < σt < σ. So we have concluded the proof of the first claim. Now, we are about to prove the second claim. Following the same derivations as above, we can write P i 1 ϱi dνt dπ 1, ei L2(π) Ex νt[ d dxei(x)x] as the sum of two terms G1 and G2. i 1 ϱi Hi(0) 1 1 + 2σ2 t c c + 1 2σ2 t 1 1 + 2 σ2c i 1 ϱi Hi(0) 1 1 + 2σ2 t c c + 1 2σ2 t 1 1 + 2 σ2c Since ϱi (β2 1 β2+1)i defined in (C.1) has exponential decay, hence Mehler s formula still hold. Up to some positive multiplier coefficient that do not change the sign, G1 can be written as the following formula G1 = (2 σ2c) 2(1 + 2σ2 t c)2 + 2σ2 t σ2 σ4 t σ4 2 + (2 σ2c) 2(1 + 2 σ2c)(1 + 2σ2 t c) + 1 1 (De)-regularized Maximum Mean Discrepancy Gradient Flow And similar to F1, F2, we have G2 = 1 σ2 t d dc G1. As a result, G2 = (2 σ2c) 2(1 + 2σ2 t c)2 + 2σ2 t σ2 σ4 t σ4 2 1 + 2σ2 t c (2 σ2c)2 + σ2 t σ2 (2 σ2c) 2(1 + 2 σ2c)(1 + 2σ2 t c) + 1 3 2 1 + 2 σ2c (2 σ2c)2 + 1 . Combined, we obtain G1 + G2 = (2 σ2c) 2(1 + 2σ2 t c)2 + 2σ2 t σ2 σ4 t σ4 2 (1 + 2σ2 t c) 2σ2 t c (2 σ2c)2 + σ2 t σ2 σ4 t σ4 + (2 σ2c) 2(1 + 2 σ2c)(1 + 2σ2 t c) + 1 3 2 (1 + 2 σ2c) 2σ2 t c (2 σ2c)2 . When 0 < σ2 t < σ2, we have the following relations (2 σ2c) 2(1 + 2σ2 t c)2 + 2σ2 t σ2 σ4 t σ4 < (2 σ2c) 2(1 + 2 σ2c)(1 + 2σ2 t c) + 1, (1 + 2σ2 t c) 2σ2 t c (2 σ2c)2 + σ2 t σ2 σ4 t σ4 > (1 + 2 σ2c) 2σ2 t c (2 σ2c)2 . L2(π) Ex νt dxei(x)x = G1 + G2 < 0, proving the second claim.