# deep_mmd_gradient_flow_without_adversarial_training__f9a7ff15.pdf

Published as a conference paper at ICLR 2025

DEEP MMD GRADIENT FLOW WITHOUT ADVERSARIAL TRAINING

Alexandre Galashov UCL Gatsby Google Deep Mind agalashov@google.com

Valentin De Bortoli Google Deep Mind vdebortoli@google.com

Arthur Gretton UCL Gatsby Google Deep Mind gretton@google.com

We propose a gradient flow procedure for generative modeling by transporting particles from an initial source distribution to a target distribution, where the gradient field on the particles is given by a noise-adaptive Wasserstein Gradient of the Maximum Mean Discrepancy (MMD). The noise-adaptive MMD is trained on data distributions corrupted by increasing levels of noise, obtained via a forward diffusion process, as commonly used in denoising diffusion probabilistic models. The result is a generalization of MMD Gradient Flow, which we call Diffusion MMD-Gradient Flow or DMMD. The divergence training procedure is related to discriminator training in Generative Adversarial Networks (GAN), but does not require adversarial training. We obtain competitive empirical performance in unconditional image generation on CIFAR10, MNIST, CELEB-A (64 x64) and LSUN Church (64 x 64). Furthermore, we demonstrate the validity of the approach when MMD is replaced by a lower bound on the KL divergence.

1 INTRODUCTION

In recent years, generative models have achieved impressive capabilities on image Saharia et al. (2022), audio Le et al. (2023) and video generation Ho et al. (2022) tasks but also protein modeling Watson et al. (2023) and 3d generation Poole et al. (2023). Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2021; Rombach et al., 2022) underpin these new methods. In these models, we learn a backward denoising diffusion process via denoising score matching (Hyv arinen, 2005; Vincent, 2011). This backward process corresponds to the time-reversal of a forward noising process. At sampling time, starting from random Gaussian noise, diffusion models produce samples by discretizing the backward process.

One challenge that arises when applying these models in practice is that the Stein score (that is, the gradient log of the current noisy density) becomes ill-behaved near the data distribution (Yang et al., 2024): the diffusion process needs to be slowed down at this point, which incurs a large number of sampling steps near the data distribution. Indeed, if the manifold hypothesis holds Tenenbaum et al. (2000); Fefferman et al. (2016); Brown et al. (2022) and the data is supported on a lower dimensional space, it is expected that the score will explode for noise levels close to zero, to ensure that the backward process concentrates on this lower dimensional manifold Bortoli (2022); Pidstrigach (2022); Chen et al. (2023). While strategies exist to mitigate these issues, they trade-off the quality of the output against inference speed, see for instance (Song et al., 2023; Xu et al., 2024; Sauer et al., 2023).

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) represent an alternative popular generative modelling framework (Brock et al., 2019; Karras et al., 2020a). Candidate samples are produced by a generator: a neural net mapping low dimensional noise to high dimensional images. The generator is trained in alternation with a discriminator, which is a measure of discrepancy between the generator and target images. An advantage of GANs is that image generation is fast once the GAN is trained (Xiao et al., 2022), although image samples are of lower quality than for the best diffusion models (Ho et al., 2020; Rombach et al., 2022). When learning a GAN model, the main challenge arises due to the presence of the generator, which must be trained adversarially alongside the discriminator. This requires careful hyperparameter tuning (Brock et al., 2019; Karras

Published as a conference paper at ICLR 2025

et al., 2020b; Liu et al., 2021), without which GANs may suffer from training instability and mode collapse (Arora et al., 2017; Kodali et al., 2017; Salimans et al., 2016).

Nonetheless, the process of GAN design has given rise to a strong understanding of discriminator functions, and a wide variety of different divergence measures have been applied. These fall broadly into two categories: the integral probability metrics (among which, the Wasserstein distance (Arjovsky et al., 2017; Gulrajani et al., 2017; Genevay et al., 2018) and the Maximum Mean Discrepancy (Li et al., 2017; Binkowski et al., 2018; Arbel et al., 2018)) and the f-divergences (Goodfellow et al., 2014; Nowozin et al., 2016; Mescheder et al., 2018; Brock et al., 2019). While it would appear that f-divergences ought to suffer from the same shortcomings as diffusions when the target distribution is supported on a submanifold Arjovsky et al. (2017), the divergences used in GANs are in practice variational lower bounds on their corresponding f-divergences (Nowozin et al., 2016), and in fact behave closer to IPMs in that they do not require overlapping support of the target and generator samples, and can metrize weak convergence (Arbel et al., 2021, Proposition 14) and (Zhang et al., 2018) (there remain important differences, however: notably, f-divergences and their variational lower bounds need not be symmetric in their arguments).

A natural question then arises: is it possible to define a Wasserstein gradient flow (Ambrosio et al., 2008; Santambrogio, 2015) using a GAN discriminator as a divergence measure? In this setting, the divergence (discriminator) provides a gradient field directly onto a set of particles (rather than to a generator), transporting them to the target distribution. Contributions in this direction include the MMD flow Arbel et al. (2019); Hertrich et al. (2024), which defines a Wasserstein Gradient Flow on the Maximum Mean Discrepancy (Gretton et al., 2012); and the KALE (KL approximate lower-bound estimator) flow Glaser et al. (2021), which defines a Wasserstein gradient flow on a KL lower bound of the kind used as a GAN discriminator based on an f-divergence (Nowozin et al., 2016). We describe the MMD and its corresponding Wasserstein gradient flow in Section 2. These approaches employ fixed function classes (namely, reproducing kernel Hilbert spaces) for the divergence, and are thus not suited to high dimensional settings such as images. Moreover, we show in this work that even for simple examples in low dimensions, an adaptive discriminator ensures faster convergence of a source distribution to the target, see Section 3.

A number of more recent approaches employ trained neural net features in divergences for a subsequent gradient flow (e.g. Fan et al., 2022; Franceschi et al., 2023). Broadly speaking, these works used adversarial means to train a series of discriminator functions, which are then applied in sequence to a population of particles. While more successful on images than kernel divergences, the approaches retain two shortcomings: they still require adversarial training (on their own prior output), with all the challenges that this entails; and their empirical performance falls short in comparison with modern diffusions and GANs (see related work in Section 6 for details).

In the present work, we propose a novel Wasserstein Gradient flow on a noise-adaptive MMD divergence measure, leveraging insights from both GANs and diffusion models. To train the discriminator, we start with clean data, and use a forward diffusion process from (Ho et al., 2020) to produce noisy versions of the data with given levels of noise (data with high levels of noise are analogous to the output of a poorly trained generator, whereas low noise is analogous to a well trained generator). The added noise is always Gaussian. For a given level of noise, we train a noise conditional MMD discriminator to distinguish between the clean and the noisy data, using a single network across all noise levels. This allows us to have better control over the discriminator training procedure than would be achievable with a GAN generator at different levels of refinement, where this control is implicit and hard to characterize.

To draw new samples, we propose a novel noise-adaptive version of MMD gradient flow (Arbel et al., 2019). Starting from Gaussian distribution, we move them in the direction of the target distribution by following MMD Gradient flow (Arbel et al., 2019), adapting our MMD discriminator to the corresponding level of noise. See Section 4 for details. This allows us to have a fine grained control over the sampling process. As a final challenge, MMD gradient flows have previously required large populations of interacting particles for the generation of novel samples, which is expensive (quadratic in the number of particles) and impractical. In Section 5, we propose a scalable approximate sampling procedure for a case of a linear base kernel, which allows single samples to be generated with a very little loss in quality, at cost independent of the number of particles used in training. The MMD is an instance of an integral probability metric, however many GANs have been designed using discriminators derived from f-divergences. Section E demonstrates how our approach can be applied

Published as a conference paper at ICLR 2025

to such divergences, using a lower bound on the KL divergence as an illustration. Section 6 contains a review of alternative approaches to using GAN discriminators for sample generation. Finally, in Section 7, we show that our method, Diffusion-MMD-gradient flow (DMMD), yields competitive performance in generative modeling on 2-D datasets as well as in unconditional image generation on CIFAR10 (Krizhevsky & Hinton, 2009), MNIST, CELEB-A, LSUN Church.

2 BACKGROUND

In this section, we define the MMD as a GAN discriminator, then describe Wasserstein gradient flow as it applies for this divergence measure.

MMD GAN. Let X RD and P(X) be the set of probability distributions on X. Let P P(X) be the target (data) distribution and Qψ P(X) be a distribution associated with a generator parameterized by ψ RL. Let H be Reproducing Kernel Hilbert Space (RKHS), see (Sch olkopf & Smola, 2018) for details, for some kernel k : X X R. The Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) between Qψ and P is defined as MMD(Qψ, P) = sup f H 1{EQψ[f(X)] EP [f(X)]}. We refer to the function f Qψ,P that attains the supremum as the witness function,

f Qψ,P (z) R k(x, z)d Qψ(x) R k(y, z)d P(y), (1)

which will be essential in defining our gradient flow. Given XN = {xi}N i=1 Q N ψ and Y M = {yi}M i=1 P M, the empirical witness function is known in closed form, ˆf Qψ,P (x) 1 N PN i=1 k(xi, x) 1 M PM j=1 k(yj, x), and an unbiased estimate of MMD2 (Gretton et al., 2012) is likewise straightforward. In the MMD GAN (Binkowski et al., 2018; Li et al., 2017), the kernel is

k(x, y) = kbase(ϕ(x; θ), ϕ(y; θ)), (2)

where kbase is a base kernel and ϕ( ; θ) : X RK are neural networks discriminator features with parameters θ RH. We use the modified notation MMD2 u[XN, Y M; θ] to highlight the functional dependence on the discriminator parameters. The MMD is an Integral Probability Metric (IPM) (Muller, 1997), and thus well defined on distributions with disjoint support: this argument was made in favor of IPMs by Arjovsky et al. (2017). Note further that the Wasserstein GAN discriminators of Arjovsky et al. (2017); Gulrajani et al. (2017) can be understood in the MMD framework, when the base kernel is linear. Indeed, it was observed by Genevay et al. (2018) that requiring closer approximation to a true Wasserstein distance resulted in decreased performance in GAN image generation, likely due to the the exponential dependence of sample complexity on dimension for the exact computation of the Wasserstein distance; this motivates an interpretation of these discriminators simply as IPMs using a class of linear functions of learned features. We further note that the variational lower bounds used in approximating f-divergences for GANs share with IPMs the property of being well defined on distribtions with disjoint support Nowozin et al. (2016); Arbel et al. (2021), although they need not be symmetric in their arguments. Finally, while Qψ and θ are trained adversarially in GANs, our setting will only require us to learn the discriminator parameter θ.

Wasserstein gradient flows. Instead of a GAN generator, we can move a sample of particles along the Wasserstein Gradient flow associated with the discriminator (Ambrosio et al., 2008). Let P2(X) be a set of probability distributions on X with a finite second moment equipped with the 2-Wasserstein distance. Let F(ν) : P2(X) R be a functional defined over P2(X) with a property that arg infν F(ν) = P. We consider the problem of transporting mass from an initial distribution ν0 = Q to a target distribution µ = P, finding a continuous path (νt)t 0 starting from ν0 that converges to µ. This problem is studied in Optimal Transport theory (Villani, 2008; Santambrogio, 2015). This path can be discretized as a sequence of random variables (Xn)n N such that Xn νn,

Xn+1 = Xn γ F (νn)(Xn), X0 Q, (3)

where η > 0 and F (νn)(Xn) is the first variation of F associated with the Wasserstein gradient, see (Ambrosio et al., 2008; Arbel et al., 2019) for precise definitions. As n and γ 0, depending on the conditions on F, the process (3) will convergence to the gradient flow as a continuous time limit (Ambrosio et al., 2008).

Published as a conference paper at ICLR 2025

MMD gradient flow. For a choice F(ν) = MMD2[ν, P] and a fixed kernel, conditions for convergence of the process in (3) to P are given by Arbel et al. (2019). Moreover, the first variation of F (ν) = fν,P H is the witness function defined earlier.1 Using (1)-(3), the discretized MMD gradient flow for any n N is given by

Xn+1 = Xn γ fνn,P (Xn), X0 Q. (4)

This provides an algorithm to (approximately) sample from the target distribution P. We remark that Arbel et al. (2019); Hertrich et al. (2024) used a kernel with fixed hyperparameters. In the next section, we will argue that even for RBF kernels (where only the bandwidth is chosen), faster convergence will be attained using kernels that adapt during the gradient flow. Details of kernel choice for alternative approaches are given in related work (Section 6).

3 A MOTIVATION FOR ADAPTIVE KERNELS

In this section, we demonstrate the benefit of using an adaptive kernel when performing MMD gradient flow. We show that even in the simple setting of Gaussian sources and targets, an adaptive kernel improves the convergence of the flow. Let kα(x, y) = α d exp[ x y 2/(2α2)] be the normalized Gaussian kernel. For any µ Rd and σ > 0 we denote by πµ,σ the Gaussian distribution with mean µ and covariance matrix σ2Id. We denote MMDα the MMD associated with kα.

Proposition 3.1. For any µ0 Rd and σ > 0, let α be given by

α = argmaxα 0 µ0MMD2 α(π0,σ, πµ0,σ) .

Then, we have that α = Re LU( µ0 2/(d + 2) 2σ2)1/2. (5)

The result is proved in Appendix K. The quantity µ0MMD2 α(π0,σ, πµ0,σ) represents how much the mean of the Gaussian πµ0,σ is displaced by a flow w.r.t. MMD2 α. We want µ0MMD2 α(π0,σ, πµ0,σ) as large as possible as it denotes the maximum displacement possible.

We show that α maximizing this displacement is given by (5). Assuming σ > 0 is fixed, it is notable that this quantity depends on µ0 , i.e. the distance between the two distributions. This observation justifies our approach of following an adaptive MMD flow at inference time. We further highlight the phase transition behaviour of Proposition 3.1: once the Gaussians are sufficiently close, the optimal kernel width is zero (note that this phase transition would not be observed in the simpler Dirac GAN example of Mescheder et al. (2018), where the source and target distributions are Dirac masses with no variance). This phase transition suggests that the flow associated with MMD benefits less from adaptivity as the supports of the distributions overlap. We exploit this observation by introducing an optional denoising stage to our procedure; see the end of Section 4.

In practice, it is not desirable to approximate the distributions of interest by Gaussians, and richer neural network kernel features ϕ(x; θ) are used (see Section 7). Arbel et al. (2018) describe approaches to optimize the MMD parameters for GAN training, which serve as proxies for convergence speed: it is not sufficient simply to maximize the MMD, since the witness function should remain Lipschitz to ensure convergence (Arbel et al., 2018, Proposition 2). Regularization of the witness is achieved in practice by controlling the gradient of the witness function; we take a similar approach in Section 4.

4 DIFFUSION MAXIMUM MEAN DISCREPANCY GRADIENT FLOW

In this section, we present Diffusion Maximum Mean Discrepancy gradient flow (DMMD), a new generative model with a training procedure of MMD discriminator which does not rely on adversarial training, and leverages ideas from diffusion models. The sampling part of DMMD consists in following a noise adaptive variant of MMD gradient flow.

1In the case of variational lower bounds on f-divergences, the witness function is still well defined, and the first variation takes the same form in respect of this witness function: see Glaser et al. (2021) for the case of the KL divergence.

Published as a conference paper at ICLR 2025

Adversarial-free training of noise conditional discriminators. In order to train a discriminator without adversarial training, we propose to use insights from GANs training. In a GAN setting, at the beginning of the training, the generator is randomly initialized and therefore produces samples close to random noise. This would produce a coarse discriminator since it is trained to distinguish clean data from random noise. As the training progresses and the generator improves, so too does the discriminative power of the discriminator. This discriminator behavior is central in the training of GANs (Goodfellow et al., 2014). We propose a way to replicate this gradually improving behavior without adversarial training and instead relying on principles from diffusion models (Ho et al., 2020).

The forward process in diffusion models allows us to generate a probability path Pt, t [0, 1], such that P0 = P, where P is our target distribution and P1 = N(0, Id) is a Gaussian noise. Given samples x0 P0 = P, the samples xt|x0 are given by

xt = αtx0 + βtϵ, ϵ N(0, Id), (6)

with α0 = β1 = 1 and α1 = β0 = 02. From the form of the xt|x0, we observe that for low noise level t, the samples xt are very close to the original data x0, whereas for the large values of xt they are close to a unit Gaussian random variable. Using the GANs terminology, xt could be thought as the output of a generator such that for high/low noise level t, it would correspond to undertrained / well-trained generator. Using this insight, for each noise level t [0, 1], we define a discriminator MMD2(Pt, P; t, θ) using the kernel of type (2) with noise-conditional discriminator features ϕ(x; t; θ) parameterized by a Neural Network with learned parameters θ. We consider the following noise-conditional loss function

L(θ, t) = MMD2(Pt, P; t, θ) (7)

where the minus sign comes from the fact that our aim is to maximize the squared MMD. In addition, we regularize this loss with ℓ2-penalty (Binkowski et al., 2018) denoted Lℓ2(θ, t) as well as with the gradient penalty (Binkowski et al., 2018; Gulrajani et al., 2017) denoted L (θ, t), see Appendix C.2 for the precise definition of these two losses. The total noise-conditional loss is then given as

Ltot(θ, t) = L(θ, t) + λℓ2Lℓ2(θ, t) + λ L (θ, t), (8)

for a suitable choice of hyperparameters λℓ2 0, λ 0. Finally, the total loss is given as Ltot(θ) = Et U[0,1] [Ltot(θ, t)], where U[0, 1] is a uniform distribution. In practice, we use sampledbased unbiased estimator of MMD, see Appendix C.2. The procedure is described in Algorithm 1.

Adaptive gradient flow sampling. In order to produce samples from P, we use the adaptive MMD gradient flow with noise conditional discriminators MMD2[Pt, P; t; θ ], where θ are the discriminator parameters obtained using Algorithm 1. Let ti = tmin + i t, i = 0, . . . , T be the noise discretisation, where t = (tmax tmin)/T such that t0 = tmin, t T = tmax for some tmin = ϵ and tmax = 1 ϵ, where ϵ << 1. We sample Np initial particles {Zi|Zi N(0, Id)}Np i=1. For each t, we follow MMD gradient flow (4) for Ns steps with learning rate η > 0

Zi,n+1 t = Zi,n t η fνt Np,n,P (Zi,n t , t; θ ). (9)

Here νt Np,n = 1/Np PNp i=1 δZt i,n is the empirical distribution of particles {Zi,n t }Np i=1 at the noise level t and the iteration n, δ is a Dirac mass measure. The function fνt Np,n,P (z, t; θ ) is adapted from equation (1) where ν is replaced by this empirical distribution. After following the gradient flow (9) for Ns steps, we initialize a new gradient flow with initial particles Zi,0 t t = Zi,Ns t for each i = 1, . . . , Np, with the decreased level of noise t t. The recurrence is initialized with Zi,0 tmax = Zi

where {Zi}Np i=1 are the initial particles. This procedure corresponds to running T + 1 consecutive MMD gradient flows for Ns iterations each, gradually decreasing the noise level t from tmax to tmin. The resulting particles {Zi,Ns tmin }Np i=1 are used as samples from P. See Algorithm 2.

In practice, we sample (once) a large batch Nc of {Xj 0}Nc j=1 P Nc from the data distribution and denote by ˆPNc(X0) the corresponding empirical distribution. Then we use the empirical witness

2Different schedules (αt, βt) are available in the literature. We focus on Variance Preserving SDE ones Song et al. (2021) here

Published as a conference paper at ICLR 2025

function fνt Np,n, ˆ PNc(X0)(z, t; θ ) given by

1 Np PNp i=1 kbase(ϕ(Zn,i t , t; θ ), ϕ(z, t; θ )) 1 Nc PNc j=1 kbase(ϕ(Xj 0, t; θ ), ϕ(z, t; θ )). (10)

Algorithm 1 Train noise-conditional MMD discriminator

Input: Dataset D = {xi}N i=1 Discriminator features ϕ(x, t; θ) with parameters θ RK λ 0, λℓ2 0 - gradient and ℓ2 penalty coefficients γ > 0 learning rate Niter number of iterations, B batch size Nnoise number of noise levels per batch for i = 1 to Niter do

Sample a batch B of clean particles X0 P(X0) for n = 1 to Nnoise do

Sample noise level tn U[0, 1] Sample Xtn p(Xtn|X0, tn) Let the clean and noisy features be ϕX0 tn = ϕ(X0, tn; θ) ϕXtn tn = ϕ(Xtn, tn; θ) For linear base kernel (11), use optimized (19) to compute MMD loss (7) Compute the loss Ltot(θ, tn) using (8) end for Compute total loss Ltot(θ) = 1 Nnoise PNnoise n=1 Ltot(θ, tn) Update discriminator features θ ADAM(θ, Ltot(θ), γ) end for

Algorithm 2 Noise-adaptive MMD gradient flow

Inputs: T number of noise levels tmax, tmin maximum/minimum noise levels Ns number of gradient flow steps per noise level η > 0 gradient flow learning rate Np number of noisy particles Batch of clean particles X0 P0. Steps: Sample initial particles Z N(0, Id) Set t = (tmax tmin)/T for i = T to 0 do

Set the noise level t = tmin + i t Set Z0 t = Z for n = 0 to Ns 1 do

Use (10) to compute fνt Np,n, ˆ PNc(X0)(Zn t , t; θ )

Zn+1 t = Zn t η fνt Np,n, ˆ PNc(X0)(Zn t , t; θ )

end for Set Z = ZN t end for Output Z

Final denoising. In diffusion models (Ho et al., 2020), it is common to use a denoising step at the end to improve sample quality. We found empirically that a few MMD gradient flow steps at the end of the sampling with a higher learning rate η allowed to reduce noise and improve performance.

5 SCALABLE DMMD WITH LINEAR KERNEL

The computational complexity of the MMD estimate on two sets of N samples is O(N 2). This is likewise the cost of computing the witness function (10) at N particles, when computed using N clean and noisy particles. However, using linear base kernel (see (2))

kbase(x, y) = x, y , (11)

allows to reduce the computation complexity of both quantities down to O(N), see Appendix C.3. We consider the average noise conditional discriminator features on the whole dataset

ϕ(X0, t; θ ) = 1

N PN i=1 ϕ(Xi 0, t; θ ). (12)

With the linear kernel (11) we can use average features (12) in the second term of (10). In practice, we can precompute these K-dimensional features for T timesteps and store them in memory for later use for sampling purposes, with a storage cost of O(TK). This removes the need to store store the data sample {Xj 0}N j=1, since we only retain the average feature vector of this sample.

Approximate sampling procedure. MMD gradient flow (9) requires us to use multiple interacting particles Z to produce samples, where the interaction is captured by the first term in (10). In practice this means that the performance will depend on the number of these particles. In this

Published as a conference paper at ICLR 2025

Figure 1: Samples from MMD Gradient flow with different parameters for the RBF kernel.

section, we propose an approximation to MMD gradient flow with a linear base kernel (11) which allows us to sample particles individually and independently, therefore removing the need for multiple particles. For a linear kernel, the interaction term in (10) for a particle Z, equals to 1

Np PNp i=1 ϕ(Zn,i t , t; θ ), ϕ(Z, t; θ ) . For a large number of particles Np, the contribution of each particle Zt n,i on the interaction term with Z will be small. For a sufficiently large Np, we hypothesize

that 1 Np PNp i=1 ϕ(Zn,i t , t; θ ) 1

N PN j=1 ϕ(Xj t , t; θ ), where N is the size of the dataset and Xj t are

produced by the forward diffusion process (6) applied to each Xj 0. Thus, we consider an approximate witness function ˆf Pt,P (z) = ϕ(z, t; θ ), ϕ(Xt, t; θ ) ϕ(X0, t; θ ) , (13)

with ϕ(Xt, t; θ ) precomputed using (12). Once again, crucially, we have no need to store the sample {Xj t }N j=1, since we need only retain its feature mean. We may then sample a single particle Z N(0, Id) and follow noise-adaptive MMD gradient flow with (13), i.e. Zn+1 t = Zn t η ˆf Pt,P (Zn t ). The corresponding algorithm is described in Appendix C.4.

6 RELATED WORK

Adversarial training and MMD-GAN. Integral Probability Metrics (IPMs) are good candidates to define discriminators in the context of generative modeling, since they are well defined even in the case of distributions with non-overlapping support (Muller, 1997). Moreover, implementations of f-divergence discriminators in GANs rely on variational lower bounds (Nowozin et al., 2016): as noted earlier, these share useful properties of IPMs in theory and in practice (notably, they remain well defined for distributions with disjoint support, and may metrize weak convergence for sufficiently rich witness function classes (Arbel et al., 2021, Proposition 14) and (Zhang et al., 2018)). Several works (Arjovsky et al., 2017; Gulrajani et al., 2017; Genevay et al., 2018; Li et al., 2017; Binkowski et al., 2018) have exploited IPMs as discriminators for training GANs, where the IPMs are MMDs using (linear or nonlinear) kernels defined on learned neural net features, making them suited to high dimensional settings such as image generation. Interpreting the IPM-based GAN discriminator as a squared MMD yields an interesting theoretical insight: Franceschi et al. (2022) show that training a GAN with an IPM objective implicitly optimizes MMD2 in the Neural Tangent Kernel (NTK) limit (Jacot et al., 2018). IPM GAN discriminators are trained jointly with the generator in a min-max game. Adversarial training is challenging, and can suffer from instability, mode collapse, and misconvergence (Xiao et al., 2022; Binkowski et al., 2018; Li et al., 2017; Arora et al., 2017; Kodali et al., 2017; Salimans et al., 2016). Note that once a GAN has been trained, the samples can be refined via MCMC sampling in the generator latent space (e.g., using kinetic Langevin dynamics; see Ansari et al., 2021; Che et al., 2020; Arbel et al., 2021).

Discriminator flows for generative modeling. Wasserstein Gradient flows (Ambrosio et al., 2008; Santambrogio, 2015) applied to a GAN discriminator are informally called discriminator flows, see (Franceschi et al., 2023). A number of recent works have focused on replacing a GAN generator by a discriminator flow. Fan et al. (2022) propose a discretisation of JKO (Jordan et al., 1998) scheme to define a Kullback-Leibler (KL) divergence gradient flow. Other approaches have used a discretized interactive particle-based approach instead of JKO, similar to (3). Heng et al. (2023); Franceschi et al. (2023) build such a flow based on f-divergences, whereas Franceschi et al. (2023) focuses on MMD gradient flow. In all these works, an explicit generator is replaced by a corresponding discriminator flow. The sampling process during training is as follows: Let Yk be the samples produced at training iteration k by the gradient flow Fk induced by the discriminator Dk applied to samples Yk 1 from the previous iteration. We denote this by Yk Fk(Dk, Yk 1). Then, the discriminator at iteration k + 1 is trained on samples Yk. A challenge of this process is that the training sample for the next discriminator will be determined by the previous discriminators, and thus the generation process is

Published as a conference paper at ICLR 2025

still adversarial: particle transport minimizes the previous discriminator value, and the subsequent discriminator is maximized on these particles. Consequently, it is difficult to control or predict the overall sample trajectory from the initial distribution to the target, which might explain the performance shortfall of these methods in image generation settings. By contrast, we have explicit control over the training particle trajectory via the forward noising diffusion process.

Furthermore, these approaches (except for Heng et al., 2023) require to store all intermediate discriminators D1, . . . , DN throughout training (N is the total number of training iterations). These discriminators are then used to produce new samples by applying the sequence of gradient flows FN(DN, ) . . . F1(D1, ) to Y0 sampled from the initial distribution. This creates a large memory overhead. An alternative is to use pretrained features obtained elsewhere or a fixed kernel with empirically selected hyperparameters (see Hertrich et al., 2024; Hagemann et al., 2024; Altekr uger et al., 2023), however this limits the applicability of the method. To the best of our knowledge, our approach is the first to demonstrate the possibility to train a discriminator without adversarial training, such that this discriminator can then be used to produce samples with a gradient flow. Unlike the alternatives, our approach does not require to store intermediate discriminators.

MMD for diffusion refinement/regularization. MMD has been used to either regularize training of diffusion models (Li & van der Schaar, 2024) or to finetune them (Aiello et al., 2024) for fast sampling. The MMD kernel in these works has the form (2) with Inception features (Szegedy et al., 2015). Our method removes the need to use pretrained features by training the MMD discriminator.

Diffusion models. Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2021) represent a powerful new family of generative models due to their strong empirical performance in many domains (Saharia et al., 2022; Le et al., 2023; Ho et al., 2022; Watson et al., 2023; Poole et al., 2023). Unlike GANs, diffusion models do not require adversarial training. At training time, a denoiser is learned for multiple noise levels. As noted above, our work borrows from the training of diffusion models, as we train a discriminator on multiple noise levels of the forward diffusion process (Ho et al., 2020). This gives better control of the training samples for the (noise adapted) discriminator than using an incompletely trained GAN generator. We may also consider the setting of flow matching, which is related to the diffusion setting. The potential advantages of our approach to flow matching are discussed in Appendix G.

Predictor-corrector. Backwards diffusion might produce samples at time t which do not correspond to the forward process at the same time. To fix this discrepancy, one can leverage the predictorcorrector scheme (Song et al., 2021). At sampling time t, we run a Langevin algorithm for a few iterations targeting Pt (corrector) before performing the jump to another noise level (predictor). Our Algorithm 2 can be interpreted as a corrector only scheme, where instead of using the Langevin algorithm, we use MMD flows.

7 EXPERIMENTS

Understanding DMMD behavior in 2-D. Our aim is to get an understanding of the behavior of DMMD described in Section 4. We expect DMMD to mimic GAN discriminator training via noise conditional discriminator learning. To see whether this manifests in practice, we design an experiment with a Radial Basis Function (RBF) kernel for MMD, kt(x, y) = exp[ x y 2/(2σ2(t; θ))], where the noise dependent kernel width function σ( ; θ) : [0, 1] [0, + ) is parameterized by θ RK. This parameter controls the coarseness of the MMD discriminator. We consider 2-D checkerboard dataset, see Figure 1, left. We learn noise-conditional kernel widths σ(t; θ) using a neural network. As baselines, we train MMD-GAN where the discriminator learns σ, as well as MMD gradient flow with fixed values of σ and with a manually selected noise-dependent σ(t) = 0.1(1 t) + 0.5t, i.e. linear interpolation. All experimental details are provided in Appendix H.

We report the learned RBF kernel widths for DMMD in Figure 2, left. As expected, as noise level goes from high to low, the kernel width σ(t) decreases. In Figure 2, center, we show the learned MMD-GAN kernel width parameter σ as a function of training iterations. As the training progresses, this parameter decreases, since the corresponding generator produces samples close to the target distribution. The behaviors of DMMD and MMD-GAN are quite similar and so as the range of values for the kernel widths is also similar. This highlights our point that DMMD mimics the training

Published as a conference paper at ICLR 2025

of a GAN discriminator. The exact dynamics for σ(t) in DMMD depends on the parameters of the forward diffusion process (6). The sharp phase transition is consistent with the phase transition highlighted in Section 3. In addition, we report MMD2(Pt, P; t) for different methods in Figure 2, right. We see that DMMD behaves similarly to linear interpolation, but is more nuanced for higher noise levels. The samples are reported in Figure 1. DMMD produces samples which are visually better than the other baselines. For the RBF kernel, we noticed the presence of outliers. The amount of outliers generally depends on the kernel, see Appendix of (Hertrich et al., 2024) for more details.

Image generation We study the performance of DMMD on unconditional image generation of CIFAR10 (Krizhevsky & Hinton, 2009). We use the same forward diffusion process as in (Ho et al., 2020) to produce noisy images. We use a U-Net (Ronneberger et al., 2015) backbone for discriminator feature network ϕ(x, t; θ), with a slightly different architecture from (Ho et al., 2020), see Appendix I. For all image-based experiments, we use a linear base kernel (11). We explored using other kernels such as RBF and Rational Quadratic (RQ), but did not find an improvement in performance. We use FID (Heusel et al., 2017) and Inception Score (Salimans et al., 2016) for evaluation, see Appendix I. Unless specified otherwise, we use the number Np = 200 of particles for Algorithm 2. We perform ablation over the number of particles in Appenidx I.3. The total number of iterations for DMMD equals to T Ns, where T is the number of noise levels and Ns is the number of steps per noise level. For consistency with diffusion models, we call this number of function evaluations (NFE) and we show performance of DMMD with different NFEs. As we show in Appendix J (see Table 8), there is an improvement on FID as we increase NFEs, but only up to a point (NFE=250).

Table 1: Unconditional generation on CIFAR10. For MMD GAN (orig.), we used mixed-RQ kerned (see (Binkowski et al., 2018)). Orig. original paper, impl. our implementation. For JKO-Flow (Fan et al., 2022), the NFE is taken from their Figure 12.

Method FID IS NFE

MMD GAN (orig.) 39.90 6.51 - MMD GAN (impl.) 13.62 8.93 - DDPM (orig.) 3.17 9.46 1000 DDPM (impl.) 5.19 8.90 100

Discriminator flow baselines

DGGF-KL 28.80 - 110 JKO-Flow 23.10 7.48 150

MMD flow baselines

MMD-GAN-Flow 450 1.21 100 GS-MMD-RK 55.00 - 86

DMMD (ours) 8.31 9.09 100 DMMD (ours) 7.74 9.12 250

0.2 0.4 0.6 0.8 1.0 Noise level t

DMMD (ours)

0k 10k 20k 30k 40k 50k Training iteration

0.2 0.4 0.6 0.8 1.0 Noise level t

MMD2(Pt, P; t)

: 0.1 : 0.5 : 1.0 : Linear int. DMMD (ours) MMD-GAN

Figure 2: Toy experiment. Left, learned RBF kernel widths σ(t) for DMMD. Center, σ for MMD-GAN as function of training iterations. Right, MMD2(Pt, P; t) for different methods.

Table 2: Approximate sampling performance on CIFAR10. IS stands for Inception score

Method FID IS NFE

DMMD 8.31 9.09 100 DMMD-e 8.21 8.99 102

a-DMMD 24.86 9.10 50 a-DMMD-e 9.185 8.70 52 a-DMMD-a 11.22 9.00 52

As baselines, we reimplement MMD-GAN (Binkowski et al., 2018) with linear base kernel and DDPM (Ho et al., 2020) using the same NN backbones as for DMMD. We also report results from the original papers. We further consider baselines based on discriminator flows: JKO-Flow (Fan et al., 2022), which uses JKO (Jordan et al., 1998) scheme for the KL gradient flow; and Deep Generative Wasserstein Gradient Flows (DGGF-KL) (Heng et al., 2023), which uses particle-based approach (as in (3)) for the KL gradient flow. These methods use adversarial training to train discriminators (see Section 6). We further compare against Generative Sliced MMD Flows with Riesz Kernels (GS-MMD-RK) (Hertrich et al., 2024) which uses a similar particle-based approach to DGGF-KL to construct MMD flow, but with a fixed (kernel) discriminator. We also report results using a discriminator flow defined on a trained MMD-GAN discriminator, which we call MMD-GAN-Flow. Experimental details are given in Appendix I. The results are provided in Table 1.

We see that DMMD performs better than MMD GAN. As expected, MMD-GAN-Flow does not work. This is because the MMD-GAN discriminator at convergence was trained on samples close to the target distribution. For the RBF kernel experiment, this means that the gradient of MMD will be very small on samples far away from the target distribution. This highlights the benefit of

Published as a conference paper at ICLR 2025

adaptive MMD discriminators. Moreover, DMMD performs better than GS-MMD-RK, which uses fixed kernel. This highlights the advantage of learning discriminator features in DMMD. DMMD achieves superior performance compared to other discriminator flow baselines. We believe that one of the reasons for the under-performance of these methods is adversarial training, which makes the hyperparameter choice tricky. DMMD, on the other hand, relies on a simple non-adversarial training procedure from Algortihm 1. Finally, we see that DDPM performs better than DMMD. This is not surprising, since both, U-Net architecture and forward diffusion process (6) were optimized for DDPM performance. Nevertheless, DMMD demonstrates strong empirical performance as a discriminator flow method trained without adversarial training. Samples from our method are provided in Appendix L.1. We provide results on CELEB-A, LSUN Church and MNIST below. Approximate sampling. We run approximate MMD gradient flow (see Section 5) with the same discriminator as for DMMD. We call this variant a-DMMD, where a stands for approximate. We then refine the solution of a-DMMD in a denoising step by taking two additional MMD gradient flow steps with a higher learning rate, using either the approximate gradient flow, which we call a-DMMD-a, or the exact gradient flow (9) applied to a single particle, which we call a-DMMD-e, where e stands for exact. Finally, for reference, we add a final denoising step to the original DMMD flow, which we call DMMD-e. Results are provided in Table 2. We observe that a-DMMD performs worse than DMMD, as expected. Applying a denoising step improves performance of a-DMMD, bringing it closer to DMMD. This suggests that the approximation (13) moves the particles close to the target distribution; but once close to the target, a more refined procedure is required. By contrast, we see that denoising helps DMMD only marginally. This suggests that the exact noise-conditional witness function (10) accurately captures fine detail close to the target distribution. Results on MNIST, CELEB-A (64x64) and LSUN-Church (64x64) Besides CIFAR-10, we study the performance of DMMD on MNIST (Lecun et al., 1998), CELEB-A (64x64 (Liu et al., 2015) and LSUN-Church (64x64) (Yu et al., 2015). For MNIST and CELEB-A, we consider the same splits and evaluation regime as in (Franceschi et al., 2023). For LSUN Church, the splits and the evaluation regime are taken from (Ho et al., 2020). For more details, see Appendix I.1 where we also provide results for the high resolution dataset CELEB-A-HQ (128x128), and results on CELEBA-HQ (256x256) using DMMD in latent space. As baselines, we consider our implementations of DDPM (Ho et al., 2020), MMD-GAN (Binkowski et al., 2018). In addition to DMMD, we report the performance of Discriminator flow baseline from (Franceschi et al., 2023) with numbers taken from the corresponding paper. This baseline uses adversarial training together with MMD gradient flow to produce samples. The results are provided in Table 3. We see that DMMD performance is better compared to the discriminator flow and MMD-GAN, which is consistent with our findings on CIFAR-10. It also underperforms compared to DDPM. The corresponding samples are provided in Appendix L.2.

Table 3: Unconditional generation on additional datasets. The metric used is FID.

Dataset MMD-GAN DDPM DMMD Disc. flow

MNIST 7.0 1.94 3.0 4.0

CELEB-A 12.1 6.72 8.3 41.0

LSUN 8.4 3.84 6.1 -

8 CONCLUSION In this paper we have presented a method to train a noise conditional discriminator without adversarial training, using a forward diffusion process. We use this noise conditional discriminator to generate samples using a noise adaptive MMD gradient flow. We provide theoretical insight into why an adaptive gradient flow can provide faster convergence than the non-adaptive variant. We demonstrate strong empirical performance of our method on uncoditional image generation of CIFAR10, as well as on additional, similar image datasets. We propose a scalable approximation of our approach which has close to the original empirical performance.

A number of questions remain open for future work. The empirical performance of DMMD will be of interest in regimes where diffusion models could be ill-behaved, such as in generative modeling on Riemannian manifolds; as well as on larger datasets such as Image Net. DMMD provides a way of training a discriminator, which may be applicable in other areas where a domain-adaptive discriminator might be required. Finally, it will be of interest to establish theoretical foundations for DMMD in general settings, and to derive convergence results for the associated flow.

Published as a conference paper at ICLR 2025

Aiello, E., Valsesia, D., and Magli, E. Fast inference in denoising diffusion models via MMD finetuning. IEEE Access, 12:106912 106923, 2024. doi: 10.1109/ACCESS.2024.3436698. URL https://doi.org/10.1109/ACCESS.2024.3436698.

Albergo, M. S., Boffi, N. M., and Vanden-Eijnden, E. Stochastic interpolants: A unifying framework for flows and diffusions. ar Xiv preprint ar Xiv:2303.08797, 2023.

Altekr uger, F., Hertrich, J., and Steidl, G. Neural wasserstein gradient flows for maximum mean discrepancies with riesz kernels, 2023.

Ambrosio, L., Gigli, N., and Savar e, G. Gradient Flows in Metric Spaces and in the Space of Probability Measures. Lectures in Mathematics ETH Z urich. Birkh auser, 2. ed edition, 2008. ISBN 978-3-7643-8722-8 978-3-7643-8721-1. OCLC: 254181287.

Ansari, A. F., Ang, M. L., and Soh, H. Refining deep generative models via discriminator gradient flow. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021. URL https://openreview.net/forum? id=Zbc-ue9p_r E.

Arbel, M., Sutherland, D. J., Binkowski, M., and Gretton, A. On gradient regularizers for MMD gans. In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montr eal, Canada, pp. 6701 6711, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/ 07f75d9144912970de5a09f5a305e10c-Abstract.html.

Arbel, M., Korba, A., Salim, A., and Gretton, A. Maximum mean discrepancy gradient flow. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d Alch e-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 6481 6491, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/ 944a5ae3483ed5c1e10bbccb7942a279-Abstract.html.

Arbel, M., Zhou, L., and Gretton, A. Generalized energy based models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021. URL https://openreview.net/forum?id=0Pt UPB9z6q K.

Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 214 223. PMLR, 06 11 Aug 2017. URL https://proceedings.mlr.press/v70/arjovsky17a.html.

Arora, S., Ge, R., Liang, Y., Ma, T., and Zhang, Y. Generalization and equilibrium in generative adversarial nets (gans). In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pp. 224 232. PMLR, 2017. URL http://proceedings.mlr.press/v70/arora17a.html.

Bansal, A., Borgnia, E., Chu, H.-M., Li, J., Kazemi, H., Huang, F., Goldblum, M., Geiping, J., and Goldstein, T. Cold diffusion: Inverting arbitrary image transforms without noise. Advances in Neural Information Processing Systems, 36:41259 41282, 2023.

Binkowski, M., Sutherland, D. J., Arbel, M., and Gretton, A. Demystifying MMD gans. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open Review.net, 2018. URL https: //openreview.net/forum?id=r1l UOz WCW.

Bortoli, V. D. Convergence of denoising diffusion models under the manifold hypothesis. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/ forum?id=Mh K5a Xo3g B. Expert Certification.

Published as a conference paper at ICLR 2025

Brock, A., Donahue, J., and Simonyan, K. Large scale GAN training for high fidelity natural image synthesis. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019. URL https://openreview. net/forum?id=B1xsqj09Fm.

Brown, B. C., Caterini, A. L., Ross, B. L., Cresswell, J. C., and Loaiza-Ganem, G. The union of manifolds hypothesis and its implications for deep generative modelling. ar Xiv preprint ar Xiv:2207.02862, 2022.

Che, T., Zhang, R., Sohl-Dickstein, J., Larochelle, H., Paull, L., Cao, Y., and Bengio, Y. Your gan is secretly an energy-based model and you should use discriminator driven latent sampling. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS 20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.

Chen, S., Chewi, S., Li, J., Li, Y., Salim, A., and Zhang, A. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=zy LVMgs Z0U_.

Daras, G., Delbracio, M., Talebi, H., Dimakis, A., and Milanfar, P. Soft diffusion: Score matching with general corruptions. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=W98reb Bxl Q.

Fan, J., Zhang, Q., Taghvaei, A., and Chen, Y. Variational Wasserstein gradient flow. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 6185 6215. PMLR, 17 23 Jul 2022. URL https://proceedings. mlr.press/v162/fan22d.html.

Fefferman, C., Mitter, S., and Narayanan, H. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983 1049, 2016.

Franceschi, J., de B ezenac, E., Ayed, I., Chen, M., Lamprier, S., and Gallinari, P. A neural tangent kernel perspective of gans. In Chaudhuri, K., Jegelka, S., Song, L., Szepesv ari, C., Niu, G., and Sabato, S. (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 6660 6704. PMLR, 2022. URL https://proceedings.mlr.press/v162/ franceschi22a.html.

Franceschi, J.-Y., Gartrell, M., Santos, L. D., Issenhuth, T., de B ezenac, E., Chen, M., and Rakotomamonjy, A. Unifying gans and score-based diffusion as generative particle models. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS 23, Red Hook, NY, USA, 2023. Curran Associates Inc.

Gao, R., Hoogeboom, E., Heek, J., Bortoli, V. D., Murphy, K. P., and Salimans, T. Diffusion meets flow matching: Two sides of the same coin. 2024. URL https://diffusionflow.github. io/.

Genevay, A., Peyre, G., and Cuturi, M. Learning generative models with sinkhorn divergences. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pp. 1608 1617. PMLR, 2018.

Glaser, P., Arbel, M., and Gretton, A. KALE flow: A relaxed KL gradient flow for probabilities with disjoint support. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum? id=ZBe CVICs1Ua.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper_files/ paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf.

Published as a conference paper at ICLR 2025

Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch olkopf, B., and Smola, A. A kernel two-sample test. Journal of Machine Learning Research, 13(25):723 773, 2012. URL http://jmlr.org/ papers/v13/gretton12a.html.

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. Improved training of wasserstein gans. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, pp. 5769 5779, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.

Hagemann, P., Hertrich, J., Altekr uger, F., Beinert, R., Chemseddine, J., and Steidl, G. Posterior sampling based on gradient flows of the MMD with negative distance kernel. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=Yr XHEb2q Mb.

Heng, A., Ansari, A. F., and Soh, H. Deep generative wasserstein gradient flows, 2023. URL https://openreview.net/forum?id=zj Se BTEd Xp1.

Hertrich, J., Wald, C., Altekr uger, F., and Hagemann, P. Generative sliced MMD flows with riesz kernels. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Vdk GRV1vcf.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two timescale update rule converge to a local nash equilibrium. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, pp. 6629 6640, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS 20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.

Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J., et al. Imagen video: High definition video generation with diffusion models. ar Xiv preprint ar Xiv:2210.02303, 2022.

Hyv arinen, A. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(24):695 709, 2005. URL http://jmlr.org/papers/v6/ hyvarinen05a.html.

Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: convergence and generalization in neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS 18, pp. 8580 8589, Red Hook, NY, USA, 2018. Curran Associates Inc.

Jordan, R., Kinderlehrer, D., and Otto, F. The variational formulation of the fokker planck equation. SIAM Journal on Mathematical Analysis, 29(1):1 17, 1998. doi: 10.1137/S0036141096303359. URL https://doi.org/10.1137/S0036141096303359.

Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., and Aila, T. Training generative adversarial networks with limited data. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS 20, Red Hook, NY, USA, 2020a. Curran Associates Inc. ISBN 9781713829546.

Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and Aila, T. Analyzing and improving the image quality of stylegan. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8107 8116, 2020b. doi: 10.1109/CVPR42600.2020.00813.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Bengio, Y. and Le Cun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/ 1412.6980.

Kodali, N., Abernethy, J., Hays, J., and Kira, Z. On convergence and stability of gans, 2017. URL https://arxiv.org/abs/1705.07215.

Published as a conference paper at ICLR 2025

Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical Report 0, University of Toronto, Toronto, Ontario, 2009. URL https://www.cs.toronto. edu/ kriz/learning-features-2009-TR.pdf.

Le, M., Vyas, A., Shi, B., Karrer, B., Sari, L., Moritz, R., Williamson, M., Manohar, V., Adi, Y., Mahadeokar, J., and Hsu, W.-N. Voicebox: text-guided multilingual universal speech generation at scale. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS 23, Red Hook, NY, USA, 2023. Curran Associates Inc.

Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. doi: 10.1109/5.726791.

Li, C.-L., Chang, W.-C., Cheng, Y., Yang, Y., and P oczos, B. Mmd gan: towards deeper understanding of moment matching network. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, pp. 2200 2210, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.

Li, Y. and van der Schaar, M. On error propagation of diffusion models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=Rt Act1E2z S.

Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Pqv MRDCJT9t.

Liu, M.-Y., Huang, X., Yu, J., Wang, T.-C., and Mallya, A. Generative adversarial networks for image and video synthesis: Algorithms and applications. Proceedings of the IEEE, 109(5):839 862, 2021. doi: 10.1109/JPROC.2021.3049196.

Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 3730 3738, 2015. doi: 10.1109/ICCV. 2015.425.

Mescheder, L., Geiger, A., and Nowozin, S. Which training methods for gans do actually converge? In International conference on machine learning, pp. 3481 3490. PMLR, 2018.

Muller, A. Integral probability metrics and their generating classes of functions. volume 29, pp. 429 443. Advances in Applied Probability, 1997.

Nowozin, S., Cseke, B., and Tomioka, R. f-gan: training generative neural samplers using variational divergence minimization. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 16, pp. 271 279, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.

Pidstrigach, J. Score-based generative models detect manifolds. Advances in Neural Information Processing Systems, 35:35852 35865, 2022.

Poole, B., Jain, A., Barron, J. T., and Mildenhall, B. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=Fj Nys5c7Vy Y.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models . In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10674 10685, Los Alamitos, CA, USA, June 2022. IEEE Computer Society. doi: 10.1109/CVPR52688.2022.01042. URL https://doi.ieeecomputersociety. org/10.1109/CVPR52688.2022.01042.

Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Navab, N., Hornegger, J., Wells, W. M., and Frangi, A. F. (eds.), Medical Image Computing and Computer-Assisted Intervention MICCAI 2015, pp. 234 241, Cham, 2015. Springer International Publishing. ISBN 978-3-319-24574-4.

Published as a conference paper at ICLR 2025

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479 36494, 2022.

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training gans. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 16, pp. 2234 2242, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.

Santambrogio, F. Optimal transport for applied mathematicians. Birk auser, NY, 55(58-63):94, 2015.

Sauer, A., Lorenz, D., Blattmann, A., and Rombach, R. Adversarial diffusion distillation, 2023. URL https://arxiv.org/abs/2311.17042.

Sch olkopf, B. and Smola, A. J. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. The MIT Press, 06 2018. ISBN 9780262256933. doi: 10.7551/mitpress/ 4175.001.0001. URL https://doi.org/10.7551/mitpress/4175.001.0001.

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 2256 2265, Lille, France, 07 09 Jul 2015. PMLR. URL https:// proceedings.mlr.press/v37/sohl-dickstein15.html.

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Px TIG12RRHS.

Song, Y., Dhariwal, P., Chen, M., and Sutskever, I. Consistency models. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 32211 32252. PMLR, 23 29 Jul 2023. URL https://proceedings.mlr.press/v202/ song23a.html.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions . In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1 9, Los Alamitos, CA, USA, June 2015. IEEE Computer Society. doi: 10.1109/CVPR.2015.7298594. URL https://doi.ieeecomputersociety. org/10.1109/CVPR.2015.7298594.

Tenenbaum, J. B., de Silva, V., and Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319 2323, 2000. doi: 10.1126/science.290.5500. 2319. URL https://www.science.org/doi/abs/10.1126/science.290.5500. 2319.

van den Oord, A., Vinyals, O., and Kavukcuoglu, K. Neural discrete representation learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, pp. 6309 6318, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.

Villani, C. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 2008. ISBN 9783540710509. URL https://books.google. co.uk/books?id=h V8o5R7_5tk C.

Vincent, P. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661 1674, 2011.

Watson, J. L., Juergens, D., Bennett, N. R., Trippe, B. L., Yim, J., Eisenach, H. E., Ahern, W., Borst, A. J., Ragotte, R. J., Milles, L. F., Wicky, B. I. M., Hanikel, N., Pellock, S. J., Courbet, A., Sheffler, W., Wang, J., Venkatesh, P., Sappington, I., Torres, S. V., Lauko, A., De Bortoli, V., Mathieu, E., Ovchinnikov, S., Barzilay, R., Jaakkola, T. S., Di Maio, F., Baek, M., and Baker, D. De novo design of protein structure and function with rfdiffusion. Nature, 620(7976):1089 1100, Jul 2023. ISSN 1476-4687. doi: 10.1038/s41586-023-06415-8. URL http://dx.doi.org/ 10.1038/s41586-023-06415-8.

Published as a conference paper at ICLR 2025

Xiao, Z., Kreis, K., and Vahdat, A. Tackling the generative learning trilemma with denoising diffusion GANs. In International Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=Jpr M0p-q0Co.

Xu, Y., Zhao, Y., Xiao, Z., and Hou, T. Ufogen: You forward once large scale text-to-image generation via diffusion gans. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8196 8206, 2024. doi: 10.1109/CVPR52733.2024.00783.

Yang, Z., Feng, R., Zhang, H., Shen, Y., Zhu, K., Huang, L., Zhang, Y., Liu, Y., Zhao, D., Zhou, J., and Cheng, F. Lipschitz singularities in diffusion models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=WNk W0c Owiz.

Yu, F., Zhang, Y., Song, S., Seff, A., and Xiao, J. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop (v2017). ar Xiv preprint ar Xiv:1506.03365, 2015.

Zhang, P., Liu, Q., Zhou, D., Xu, T., and He, X. On the discrimination-generalization tradeoff in gans. In 6th International Conference on Learning Representations, 2018.

Published as a conference paper at ICLR 2025

A ORGANIZATION OF THE SUPPLEMENTARY MATERIAL

In Appendix B, we briefly describe the adversarial training in the context of Generative Adversarial Networks (GANs). In Appendix C, we describe in details the training and sampling procedures for DMMD. In Appendix D, we provide more details about the computational complexity of DMMD during trianing and sampling. In Appendix E, we explain how our approach could be applied to f-divergences resulting in DKALE-Flow and in Appendix F, we provide more details about this method. In Appendix G, we discuss the connection to Flow Matching. In Appendix H, we describe more details for the 2d experiments. In Appendix I, we provide experimental details for the image datasets as well as the additional results. In Appendix K, we provide proof for the theoretical results described in Appendix 3 from the main section of the paper. Finally, in Appendix L we present the samples from DMMD on different image datasets.

B ADVERSARIAL TRAINING

We briefly describe the notion of the adversarial training in the context of Generative Adversarial Networks (GANs). The objective function for GANs is

F(θ, ψ) = EZ U[ 1,1]D[X; G(Z; ψ); θ],

where G(Z, ψ) is a generator with parameters ψ and D[X; G(Z; ψ); θ] is a discriminator divergence with parameters θ. The objective for the generator is

ψ arg min ψ max θ F(θ, ψ)

The objective for the discriminator is

θ arg max θ min ψ F(θ, ψ)

In practice, we alternate the updates on the objective F for the generator and the discriminator. Let N be the iteration number. Then, the generator parameters are updated as

ψn+1 ψn αψ ψF(θn, ψ = ψn)

and the discriminator parameters are updated as

θn+1 θn + αθ θF(θ = θn, ψn+1),

where αψ, αθ are learning rates.

When we train DMMD, we could interpret it as having a fixed generator given by the forward diffusion process, and the training is no longer adversarial since there are no min/max optimization going on.

C DMMD TRAINING AND SAMPLING

C.1 MMD DISCRIMINATOR

Let X RD and P(X) be the set of probability distributions defined on X. Let P P(X) be the target or data distribution and Qψ P(X) be a distribution associated with a generator parameterized by ψ RL. Let H be Reproducing Kernel Hilbert Space (RKHS), see (Sch olkopf & Smola, 2018) for details, for some kernel k : X X R. Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) between Qψ and P is defined as MMD(Qψ, P) = supf H{EQψ[f(X)] EP [f(X)]}. Given XN = {xi}N i=1 Q N ψ and Y M = {yi}M i=1 P M, an unbiased estimate of MMD2 (Gretton et al., 2012) is given by

MMD2 u[XN, Y M] = 1 N(N 1) PN i =j k(xi, xj)+ (14)

1 M(M 1) PM i =j k(yi, yj) 2 NM PN i=1 PM j=1 k(xi, yj).

In MMD GAN (Binkowski et al., 2018; Li et al., 2017), the kernel in the objective (14) is given as

k(x, y) = kbase(ϕ(x; θ), ϕ(y; θ)), (15)

Published as a conference paper at ICLR 2025

where kbase is a base kernel and ϕ( ; θ) : X RK are neural networks discriminator features with parameters θ RH. We use the modified notation of MMD2 u[XN, Y M; θ] for equation (14) to highlight the functional dependence on the discriminator parameters. MMD is an instance of Integral Probability Metric (IPM) (see (Arjovsky et al., 2017)) which is well defined on distributions with disjoint support unlike f-divergences (Nowozin et al., 2016). An advantage of using MMD over other IPMs (see for example, Wasserstein GAN (Arjovsky et al., 2017)) is the flexibility to choose a kernel k. Another form of MMD is expressed as a norm of a witness function

MMD(Qψ, P) = sup f H {EQψ[f(X)] EP [f(X)]} = f Qψ,P H,

where the witness function f Qψ,P is given as

f Qψ,P (z) = Z k(x, z)d Qψ Z k(y, z)d P(y)

Given two sets of samples XN = {xi}N i=1 Q N ψ and Y M = {yi}M i=1 P M, and the kernel (15), the empirical witness function is given as

ˆf Qψ,P (z) = 1

i=1 kbase(ϕ(xi; θ), ϕ(z; θ)) 1

j=1 kbase(ϕ(yj; θ), ϕ(z; θ))

The ℓ2 penalty (Binkowski et al., 2018) is defined as

i=1 ϕ(xi; θ) 2 2 + 1

i=1 ϕ(yi; θ) 2 2

Assuming that M = N and following (Binkowski et al., 2018; Gulrajani et al., 2017), for αi U[0, 1], where U[0, 1] is a uniform distribution on [0, 1], we construct zi = xiαi + (1 α)yi for all i = 1, . . . , N. Then, the gradient penalty (Binkowski et al., 2018; Gulrajani et al., 2017) is defined as

i=1 ( ˆf Qψ,P (zi) 2 1)2

We denote by L(θ) the MMD discriminator loss given as

L(θ) = MMD2 u[XN, Y M; θ] = 1 N(N 1)

i =j kbase(ϕ(xi; θ), ϕ(xj; θ))+

i =j kbase(ϕ(yi; θ), ϕ(yj; θ)) 2 NM PN i=1 PM j=1 kbase(ϕ(xi; θ), ϕ(yj; θ))

Then, the total loss for the discriminator on the two samples of data assuming that N = M is given as

Ltot(θ) = L(θ) + λ L (θ) + λℓ2Lℓ2(θ),

for some constants λ 0 and λℓ2 0.

C.2 NOISE-DEPENDENT MMD

In Section 4, we describe the approach to train MMD discriminator from forward diffusion using noise-dependent discriminators. For that, we assume that we are given a noise level t U[0, 1] where U[0, 1] is a uniform distribution on [0, 1], and a set of clean data XN = {xi}N i=1 P N. Then we produce a set of noisy samples xi t using forward diffusion process (6). We denote these samples by XN t = {xi t}N i=1. We define noise conditional kernel

k(x, y; t, θ) = kbase(ϕ(x, t; θ), ϕ(y, t; θ)),

Published as a conference paper at ICLR 2025

with noise conditional features ϕ(x, t; θ). This allows us to define the noise conditional discriminator loss

L(θ, t) = MMD2 u[XN, XN t , t, θ] = 1 N(N 1)

i =j kbase(ϕ(xi t; t, θ), ϕ(xj t; t, θ))+ (16)

i =j kbase(ϕ(xi; t, θ), ϕ(xj; t, θ))

N2 PN i=1 PN j=1 kbase(ϕ(xi; t, θ), ϕ(xj t; t, θ))

The noise conditional ℓ2 penalty is given as

Lℓ2(θ, t) = 1

i=1 ϕ(xi t; t, θ) 2 2 + 1

i=1 ϕ(xi; t, θ) 2 2

The noise conditional gradient penalty is given as

L (θ, t) = 1

i=1 ( ˆf P,t(zi) 2 1)2,

where zi = αixi t + (1 αi)xi for αi U[0, 1] and the noise conditional witness function

ˆf P,t(z) = 1

i=1 kbase(ϕ(xt i; t, θ), ϕ(z; θ)) 1

j=1 kbase(ϕ(xi; t, θ), ϕ(z; θ)) (17)

Therefore, the total noise conditional loss is given as

Ltot(θ, t) = L(θ, t) + λ L (θ, t) + λℓ2Lℓ2(θ, t), (18)

for some constants λ 0 and λℓ2 0.

C.3 LINEAR KERNEL FOR SCALABLE MMD

Computational complexity of (18) is O(N 2). Here, we assume that the base kernel is linear, i.e.

kbase(x, y) = x, y

This allows us to simplify the MMD computation (16) as

MMD2 u[XN, XN t , t, θ] = 1 N(N 1)

ϕt(XN t )T ϕt(XN t ) ϕt 2N(Xt) +

ϕt(XN)T ϕt(XN) ϕt 2N(Y ) 2 NN ( ϕt(XN t ))T ϕt(XN), (19)

ϕt(XN t ) =

i=1 ϕ(xi t; θt)

j=1 ϕ(xi; θt)

ϕt 2(XN t ) =

i=1 ϕ(xi t; θt) 2

j=1 ϕ(xi; θt) 2

Published as a conference paper at ICLR 2025

Algorithm 3 Approximate noise-adaptive MMD gradient flow for a single particle

Inputs: T is the number of noise levels tmax, tmin are maximum and minimum noise levels Ns is the number of gradient flow steps per noise level η > 0 is the gradient flow learning rate

ϕ(X0, t, θ ) - precomputed clean features for all t = 1, . . . , T with (20) ϕ(Xt, t, θ ) - precomputed noisy features for all t = 1, . . . , T with (20) Steps: Sample initial noisy particle Z N(0, Id) for i = T to 0 do

Set the noise level t = i t and Zt 0 = Z for n = 0 to Ns 1 do

Zt n+1 = Zt n η zϕ(Zt n, t; θ ), ϕ(Xt, t, θ ) ϕ(X0, t, θ ) end for Set Z = Zt N end for Output Z

Therefore we can pre-compute quantities ϕt(XN t ), ϕt(XN), ϕt 2(XN t ), ϕt 2(XN) which takes O(N) and compute MMD2 u[XN, XN t , t, θ] in O(1) time. This also leads O(1) computation complexity for Lℓ2 and O(N) complexity for L . This means that we simplify the computational complexity to O(N) from O(N 2).

At sampling, following (9) requires to compute the witness function (17) for each particle, which for a general kernel takes O(N 2) in total. Using the linear kernel above, simplifies the complexity of the witness as follows ˆf P,t(z) = ϕt(ZN) ϕt(XN), ϕ(z; θ) ,

where ZN is a set of N noisy particles. We can precompute ϕt(ZN) in O(N) time. Therefore one iteration of a witness function will take O(1) time and for N noisy particles it makes O(N).

C.4 APPROXIMATE SAMPLING PROCEDURE

In this section we provide an algorithm for the approximate sampling procedure. The only change with the original Algorithm 2 is the approximate witness function

ˆf Pt,P (z) = ϕ(z, t; θ ), ϕ(Xt, t, θ ) ϕ(X0, t, θ ) ,

ϕ(X0, t, θ = 1

i=1 ϕ(xi 0, t; θ ) (20)

ϕ(Xt, t, θ = 1

i=1 ϕ(xi t, t; θ )

Here xi 0, i = 1, . . . , N correspond to the whole training set of clean samples and xi t, i = 1, . . . , correspond to the noisy version of these clean samples produced by the forward diffusion process (6) for a given noise level t. These features can be precomputed once for every noise level t. The resulting algorithm is given in Algorithm (3). A second crucial difference with the original algorithm is that it generates single particles Z independently.

D COMPUTATIONAL COMPLEXITY

D.1 TRAINING-TIME COMPLEXITY

Computing total loss in training iteration (see Algorithm 1) on a batch of B samples is O(B2) for an arbitrary kernel and O(B) for a linear kernel (see Appendix C.3). We compute it for Nn noise levels meaning that the cost scales as O(Nn B2) for an arbitrary kernel and O(Nn B) for a linear kernel. This is more expensive than DDPM (Ho et al., 2020) which cost scales as O(B).

Published as a conference paper at ICLR 2025

D.2 SAMMPLING-TIME COMPLEXITY

During sampling time (see Algorithm 2), we use T noise levens and do Ns steps per noise level with Np noisy particles. Let O(F) denote the cost of a forward pass for the backbone network. The gradient takes O(2F) to compute. Each sampling step for DDPM (Ho et al., 2020) costs O(F), while one sampling step for DMMD costs O(FN 2 p) for an arbitrary kernel, O(FNp) for a linear kernel and O(F) when using approximate sampling procedure.

E F-DIVERGENCES

The approach described in Section 4 can be applied to any divergence which has a well defined Wasserstein Gradient Flow described by a gradient of the associated witness function. Such divergences include the variational lower bounds on f-divergences, as described by (Nowozin et al., 2016), which are popular in GAN training, and were indeed the basis of the original GAN discriminator (Goodfellow et al., 2014). One such f-divergence is the KL Approximate Lower bound Estimator (KALE, Glaser et al., 2021). Unlike the original KL divergence, which requires a density ratio, the KALE remains well defined for distributions with non-overlapping support. Similarly to MMD, the Wasserstein Gradient of KALE is given by the gradient of a learned witness function. Thus, we train noise-conditional KALE discriminator and use corresponding noise-conditional Wasserstein gradient flow, as with DMMD. We call this method Diffusion KALE flow (D-KALE-Flow). This approach is described in Appendix F. We found this approach to lead to reasonable empirical results, however unlike with DMMD, it achieved worse performance than a corresponding GAN, see Appendix J.1.

F D-KALE-FLOW

In this section, we describe the DKALE-flow algorithm mentioned in Section E. Let X RD and P(X) be the set of probability distributions defined on X. Let P P(X) be the target or data distribution and Q P(X) be some distribution. The KALE objective (see (Glaser et al., 2021)) is defined as

KALE(Q, P|λ) = (1 + λ) max h H{1 + Z hd Q Z ehd P λ

2 ||h||2 H}, (21)

where λ 0 is a positive constant and H is the RKHS with a kernel k. In practice, KALE divergence (21) can be replaced by a corresponding parametric objective

KALE(Q, P|λ, θ, α) = (1 + λ) Z h(X; θ, α)d Q(X) Z eh(Y ;θ,α)d P(Y ) λ

where h(X; θ, α) = ϕ(X; θ)T α, with ϕ(X; θ) RD and α RD. The objective function (22) can then be maximized with respect to θ and α for given Q and P. Similar to DMMD, we consider a noise-conditional witness function

h(x; t, θ, α, ψ) = ϕ(x; t, θ)T α(t; ψ)

From here, the noise-conditional KALE objective is given as

L(θ, ψ, t|λ) = KALE(Pt, P|λ, θ, α),

where Pt is the distribution corresponding to a forward diffusion process, see Section 4. Then, the total noise-conditional objective is given as

Ltot(θ, ψ, t|λ) = L(θ, ψ, t|λ) + λ L (θ, ψ, t) + λℓ2Lℓ2(θ, t),

where gradient penalty has similar form to WGAN-GP (Gulrajani et al., 2017)

L (θ, ψ, t) = EZ(|| Zh(Z; t, θ, α, ψ)||2 1)2,

where Z = βX + (1 β)Y , β U[0, 1], X P(X) and Y P(Y ). The l2 penalty is given as

Lℓ2(θ, t) = 1

2 EX P (X)||ϕ(X; t, θ)||2 + EY P (Y )||ϕ(Y ; t, θ)||2

Published as a conference paper at ICLR 2025

Algorithm 4 Noise-adaptive KALE flow for single particle

Inputs: T is the number of noise levels tmax, tmin are maximum and minimum noise levels Ns is the number of gradient flow steps per noise level η > 0 is the gradient flow learning rate Trained witness function h( ; t, θ , ψ ) Steps: Sample initial noisy particle Z N(0, Id) Set t = (tmax tmin)/T for i = T to 0 do

Set the noise level t = tmin + i t and Zt 0 = Z for n = 0 to Ns 1 do

Zt n+1 = Zt n η h(Zt n; t, θ , ψ ) end for Set Z = Zt N end for Output Z

Therefore, the final objective function to train the discriminator is

Ltot(θ, ψ|λ) = Et U[0,1] [Ltot(θ, ψ, t|λ)]

This objective function depends on RKHS regularization λ, on gradient penalty regularization λ and on l2-penalty regularization λℓ2. Unlike in DMMD, we do not use an explicit form for the witness function and do not use the RKHS parameterisation. On one hand, this allows us to have a more scalable approach, since we can compute KALE and the witness function for each individual particle. On the other hand, the explicit form of the witness function in DMMD introduces beneficial inductive bias. In DMMD, when we train the discriminator, we only learn the kernel features, i.e. corresponding RKHS. In D-KALE, we need to learn both, the kernel features ϕ(x; t, θ) as well as the RKHS projections α(t; ψ). This makes the learning problem for D-KALE more complex. The corresponding noise adaptive gradient flow for KALE divergence is described in Algorithm 4. An advantage over original DMMD gradient flow is the ability to run this flow individually for each particle.

G CONNECTION TO FLOW MATCHING

Diffusion models (Ho et al., 2020) try to reverse a Gaussian noising dynamic while flow matching (Lipman et al., 2023) methods try to match an interpolation dynamics between the data distribution and a Gaussian distribution. Both methods can be shown to be equivalent given the choice of the schedule (Gao et al., 2024).

Our approach is agnostic to the destruction process and does not rely on an underlying Gaussian assumption of the prior distribution 3. In particular, one could design destruction processes which are more suitable to the problem at hand. While extensions have been proposed for such destruction processes in the case of Gaussian diffusion (see (Bansal et al., 2023; Daras et al., 2023) for instance) they either rely on a soft Gaussian assumption for the noising process (Daras et al., 2023) or do not enjoy the theoretical properties of our framework (Bansal et al., 2023): at optimality we recover the data distribution, up to the discretization error, regardless of destruction process or prior distribution.

H TOY 2-D DATASETS EXPERIMENTS

For the 2-D experiments, we train DMMD using Algorithm (1) for Niter = 50000 steps with a batch size of B = 256 and a number of noise levels per batch equal to Nnoise = 128. The Gradient penalty constant λ = 0.1 whereas the ℓ2 penalty is not used. To learn noise-conditional MMD for DMMD, we use a 4-layers MLP g(t; θ) with Re LU activation to encode σ(t; θ) = σmin + Re LU(g(t; θ)) with σmin = 0.001, which ensures σ(t; θ) > 0. The MLP layers have the architecture of [64, 32, 16, 1].

3Note that this might also not be necessary in the case of flow matching and stochastic interpolant, see (Albergo et al., 2023) for instance

Published as a conference paper at ICLR 2025

Before passing the noise level t [0, 1] to the MLP, we use sinusoidal embedding similar to the one used in (Ho et al., 2020), which produces a feature vector of size 1024. The forward diffusion process from (Ho et al., 2020) have modified parameters such that corresponding β1 = 10 4, βT = 0.0002. On top of that, we discretize the corresponding process using only 1000 possible noise levels, with tmin = 0.05 and tmax = 1.0. At sampling time for the algorithm 2, we use tmin = 0.05, tmax = 1.0, Ns = 10 and T = 100. The learning rate η = 1.0. As basleines, we consider MMD-GAN with a generator parameterised by a 3-layer MLP with ELU activations. The architecture of the MLP is [256, 256, 2]. The initial noise for the generator is produced from a uniform distribution U[ 1, 1] with a dimensionality of 128. The gradient penalty coefficient equals to 0.1. As for the discriminator, the only learnable parameter is σ. We train MMD-GAN for 250000 iterations with a batch size of B = 256. Other variants of MMD gradient flow use the same sampling parameters as DMMD.

We used 1 a100 GPU with 40GB of memory to run these experiments. In total, all the experiments took less than 2 hours.

I IMAGE GENERATION EXPERIMENTS

For the image experiments, we use CIFAR10 (Krizhevsky & Hinton, 2009) dataset. We use the same forward diffusion process as in (Ho et al., 2020). As a Neural Network backbone, we use U-Net (Ronneberger et al., 2015) with a slightly modified architecture from (Ho et al., 2020). Our neural network architecture follows the backbone used in (Ho et al., 2020). On top of that we output the intermediate features at four levels before down sampling, after down-sampling, before upsampling and a final layer. Each of these feature vectors is processed by a group normalization, the activation function and a linear layer producing an output vector of size 32. To produce the output of a discriminator features, these four feature vectors are concatenated to produce a final output feature vector of size 128. The noise level time is processed via sinusoidal time embedding similar to (Ho et al., 2020). We use a dropout of 0.2. DMMD is trained for Niter = 250000 iterations with a batch size B = 64 with number Nnoise = 16 of noise levels per batch. We use a gradient penalty λ = 1.0 and ℓ2 regularisation strength λℓ2 = 0.1. As evaluation metrics, we use FID (Heusel et al., 2017) and Inception Score (Salimans et al., 2016) using the same evaluation regime as in (Ho et al., 2020). To select hyperparameters and track performance during training, we use FID evaluated on a subset of 1024 images from a training set of CIFAR10.

For CIFAR10, we use random flip data augmentation.

In DMMD we have two sets of hyperparameters, one is used for training in Algorithm 1 and one is used for sampling in Algorithm 2. During training, we fix the sampling parameters and always use these to select the best set of training time hyperparameters. We use η = 0.1 gradient flow learning rate, T = 10 number of noise levels, Np = 200 number of noisy particles, Ns = 5 number of gradient flow steps per noise level, tmin = 0.001 and tmax = 1 0.001. We use a batch of 400 clean particles during training. For hyperparameters, we do a grid search for λ {0, 0.001, 0.01, 0.1, 1.0, 10.0}, for λℓ2 {0, 0.001, 0.01, 0.1, 1.0, 10.0}, for dropout rate {0, 0.1, 0.2, 0.3}, for batch size {16, 32, 64}. To train the model, we use the same optimization procedure as in (Ho et al., 2020), notably Adam (Kingma & Ba, 2015) optimizer with a learning rate 0.0001. We also swept over the the dimensionality of the output layer 32, 64, 128, such that each of four feature vectors got the equal dimension. Moreover, we swept over the number of channels for U-Net {32, 64, 128} (the original one was 32) and we found that 128 gave us the best empirical results.

After having selected the training-time hyperparameters and having trained the model, we run a sweep for the sampling time hyperparameters over η {1, 0.5, 0.1, 0.04, 0.01}, T {1, 5, 10, 50}, Ns {1, 5, 10, 50}, tmin {0.001, 0.01, 0.1, 0.2}, tmax {0.9, 1 0.001}. We found that the best hyperparameters for DMMD were η = 0.1, Ns = 10, T = 10, tmin = 0.1 and tmax = 0.9. On top of that, we ran a variant for DMMD with T = 50 and Ns = 5.

For a-DMMD method, we used the same pretrained discriminator as for DMMD but we did an additional sweep over sampling time hyperparameters, because in principle these could be different. We found that the best hyperparameters for a-DMMD are η = 0.04, tmin = 0.2, tmax = 0.9, T = 5, Ns = 10.

Published as a conference paper at ICLR 2025

For the denoising step, see Table 2, for DMMD-e, we used 2 steps of DMMD gradient flow with a higher learning rate η = 0.5 with tmax = 0.1 and tmin = 0.001. For a-DMMD-e, we used 2 steps of DMMD gradient flow with a higher learning rate of η = 0.5 with tmax = 0.2 and tmin = 0.001. For a-DMMD-e, we used 2 steps of DMMD gradient flow with a higher learning rate of η = 0.1 with tmax = 0.2 and tmin = 0.001. The only parameter we swept over in this experiment was this higher learning rate η .

After having found the best hyperparameters for sampling, we run the evaluation to compute FID on the whole CIFAR10 dataset using the same regime as described in (Ho et al., 2020).

For MMD-GAN experiment, we use the same discriminator as for DMMD but on top of that we train a generator using the same U-net architecture as for DMMD with an exception that we do not use the 4 levels of features. We use a higher gradient penalty of λ = 10 and the same ℓ2 penalty λℓ2 = 0.1. We use a batch size of B = 64 and the same learning rate as for DMMD. We use a dropout of 0.2. We train MMD-GAN for 250000 iterations. For each generator update, we do 5 discriminator updates, following (Brock et al., 2019).

For MMD-GAN-Flow experiment, we take the pretrained discriminator from MMD-GAN and run a gradient flow of type (4) on it, starting from a random noise sampled from a Gaussian. We swept over different parameters such as learning rate η and number of iterations Niter. We found that none of our parameters led to any reasonable performance. The results in Table 1 are reported using η = 0.1 and Niter = 100.

I.1 ADDITIONAL DATASETS

I.1.1 RESULTS ON CELEB-A, LSUN-CHURCH, MNIST

We study performance of DMMD on additional datasets, MNIST (Lecun et al., 1998), on CELEB-A (64x64 (Liu et al., 2015), on LSUN-Church (64x64) (Yu et al., 2015). For MNIST and CELEB-A, we use the same training/test split as well as the evaluation protocol as in (Franceschi et al., 2023). For LSUN-Church, For LSUN Church, we compute FID on 50000 samples similar to DDPM (Ho et al., 2020). For MNIST, we used the same hyperparameters during training and sampling as for CIFAR-10 with NFE=100, see Appendix I. For CELEB-A and LSUN, we ran a sweep over λℓ2 {0, 0.001, 0.01, 0.1, 1.0, 10.0} and found that ℓ2 = 0.001 led to the best results. For sampling, we used the same parameters as for CIFAR-10 with NFE=100. We use the same architecture for DMMD as for CIFAR-10 experiments. The reported results in Table 4 are given with NFE=100.

The results are provided in Table 4. In addition to DMMD, we report the performance of the Discriminator flow baseline from (Franceschi et al., 2023) with numbers taken from the corresponding paper. We see that DMMD performance is significantly better compared to the discriminator flow, which is consistent with our findings on CIFAR-10. The corresponding samples are provided in Appendix L.2.

Table 4: Unconditional image generation on additional datasets. The metric used is FID. The number of gradient flow steps for DMMD is 100 and NFE for DDPM is 100.

Dataset MMD-GAN DDPM DMMD Disc. flow (Franceschi et al., 2023)

CELEB-A 12.1 6.72 8.3 41.0 LSUN-Church 8.4 3.84 6.1 - MNIST 7.0 1.94 3.0 4.0

I.1.2 RESULTS ON CELEB-A-HQ

We study performance of DMMD on a high-resolution dataset CELEB-A-HQ (128x128) which contains 30000 samples. During training, we track performance on a subset of 2048 samples. We use the same backbone architecture as in DDPM paper (Ho et al., 2020) applied to CELEB-A-HQ (256x256) dataset. For DMMD, we ran a sweep over λℓ2 {0, 0.001, 0.01, 0.1, 1.0, 10.0} and found that ℓ2 = 0.001 led to the best results. At sampling time, we use T = 10 noise levels with Ns = 10 and η = 0.001 gradient flow learning rate and with Np = 50 noisy particles (see Algorithm 2 for

Published as a conference paper at ICLR 2025

more details). We use the same architecture for DMMD as for CIFAR-10 experiments. For both methods, NFE=100.

The results are reported in Table 5. In addition to DMMD, we also train DDPM (Ho et al., 2020) and report the corresponding performance. The results suggest that DMMD still has a performance gap compared to DDPM as observed for lower resolution datasets. The corresponding samples are provided in Appendix L.3.

Table 5: Unconditional image generation on CELEB-A-HQ (128x128). The metric used is FID. The number of gradient flow steps for DMMD is 100 and we use NFE=100 for DDPM.

11.67 14.09

I.1.3 RESULTS ON LATENT-SPACE CELEB-A-HQ

For memory efficient sample generation in higher dimensions, we study performance of DMMD in the latent space of a VQ-VAE (van den Oord et al., 2017) trained on CELEB-A-HQ (256x256). We follow the approach of the latent diffusion paper (Rombach et al., 2022), and train a VQ-VAE of the corresponding LDM-4 architecture which produces a latent space of the shape (64 64 3). Our baseline for comparison is a latent diffusion model with the LDM-4 architecture. We trained DMMD on the same latent space, where we swept over λℓ2 {0.05, 0.1, 0.2}, and found that ℓ2 = 0.2 led to the best results. We used a gradient penalty ℓ = 1, and the same architecture for DMMD as for CIFAR-10 experiments. At sampling time, we used T = 10 noise levels with Ns = 10 steps per noise level, a gradient flow learning rate η = 0.1, and Np = 200 noisy particles (see Algorithm 2 for details). The results are given in 6, with NFE=100 for both methods. The corresponding samples are provided in Appendix L.4.

Table 6: Unconditional image generation on latent space (64x64x3) CELEB-A-HQ (256x256). The metric used is FID. The number of gradient flow steps for DMMD is 100 and NFE=100 for DDPM.

Latent Diffusion Latent DMMD

I.2 D-KALE-FLOW DETAILS

We study performance of D-KALE-flow on CIFAR10. We use the same architectural setting as in DMMD with the only difference of adding an additional mapping α(t; ψ) from noise level to D dimensional feature vector, which is represented by a 2 layer MLP with hidden dimensionality of 64 and GELU activation function. We use batch size B = 256, dropout rate equal to 0.3. For sampling time parameters during training, we use η = 0.5, total number of noise levels T = 20, and number of steps per noise level Ns = 5. At training, we sweep over RKHS regularization λ {0, 1, 10, 100, 500, 1000, 2000}, gradient penalty λ {0, 0.1, 1.0, 10.0, 50.0, 100.0, 250.0, 500.0, 1000.0}, l2 penalty in {0, 0.1, 0.01, 0.001}.

I.3 NUMBER OF PARTICLES ABLATION

Number of particles. In Table 7 we report performance of DMMD depending on the number of particles Np at sampling time. As expected as the number of particles increases, the FID score decreases, but the overall performance is sensitive to the number of particles. This motivates the approximate sampling procedure from Section 5.

J PERFORMANCE VS. NUMBER OF GRADIENT FLOW STEPS TRADE-OFF

Here, we provide a table showing the dependence of the performance of DMMD on number of total DMMD gradient flow steps, which we call NFE. The NFE is the total number of gradient flow

Published as a conference paper at ICLR 2025

Table 7: Number of particles ablation, FIDs on CIFAR10.

Np = 50 Np = 100 Np = 200

9.76 8.55 8.31

iterations, which equals to Ns T, where Ns is the number of steps per noise level and T is the number of noise levels. By default, we use the gradient flow learning rate η = 0.1, see (9). We also found that as we increase the number of total gradient flow steps, it was sometimes beneficial to use a smaller learning rate, η = 0.05. Results are given in Table 8. We see that as we increase NFE, the FID improves up to a point (NFE = 250). After NFE=250, we do not see a further improvement. Moreover, as we noticed in our experiments, increasing the total compute at sampling time might require readjusting the gradient flow learning rate.

Table 8: Dependence of the FID on CIFAR-10 on the total number of gradient flow steps (NFE). η is the gradient flow learning rate, see (9).

Total number of steps (NFE) FID

10(η = 0.1) 377.5

50(η = 0.1) 36.4

100(η = 0.1) 8.5

250(η = 0.1) 12.1

250(η = 0.05) 7.74

500(η = 0.05) 8.6

1000(η = 0.05) 9.1

J.1 RESULTS WITH F-DIVERGENCE

We study performance of D-KALE-Flow described in Section E and Appendix F, in the setting of unconditional image generation for CIFAR-10. We compare against a GAN baseline which uses the KALE divergence in the discriminator, but has adversarially trained generator. More details are described in Appendix F and Appendix I.2. The results are given in Table 9. We see that unlike with DMMD, D-KALE-Flow achieves worse performance than corresponding KALE-GAN - indicating that the inductive bias provided by the generator may be more helpful in this case - this is a topic for future study.

Table 9: Unconditional image generation on CIFAR-10 with KALE-divergence. The number of gradient flow steps is 100.

Method FID Inception score

D-KALE-Flow 15.8 8.5 KALE-GAN 12.7 8.7

J.2 COMPUTE RESOURCES FOR IMAGE EXPERIMENTS

For all the experiments, we used A100 GPUs with 40 GB of memory. To train DMMD for 250k steps, we needed to run training for around 24 hours. The total hyperparameter sweep for DMMD required 36 runs to figure out regularization constants, 12 runs to figure out batch size and dropout rate and then 3 runs to figure out the dimensionality of the U-Net and the same 3 runs where the features of the U-Net were coming only from the last layer. This required 54 runs in total.

Running inference on small subset of CIFAR-10 required around 2 minutes of GPU time, and we ran full grid search to select best sampling time parameters, which is around 1080 values. We did this sweep for DMMD and a DMMD. For DMMD e, we additionally swept over higher learning rate at the second stage which required 5 more runs. For a DMMD e and a DMMD a, we

Published as a conference paper at ICLR 2025

swept over learning rates at second stage which required 10 more runs. After having found the best parameters, we run sampling with the best parameters on full CIFAR-10 which takes about 1 hour for NFE = 100.

For additional datasets, for MNIST we used the same best parameters as for CIFAR-10, which required one run only since we saw very good performance out of the box. For CELEB-A and LSUN, we ran an additional sweep over regularization strength which required 6 training runs per dataset and 2 additional runs for sampling the whole datasets.

For MMD GAN, the training runs were faster, by around 2-x factor. We did a grid search over the regularization strengths which took 36 training runs and 12 runs to figure out batch size and drop-out rate.

For DKALE-flow, the experiment was as fast as MMD GAN and we ran a grid search with 67 runs for regularization and 4 runs for dropout. The same was done for DKALE GAN.

K OPTIMAL KERNEL WITH GAUSSIAN MMD

In this section, we prove the results of Section 3. We consider the following unnormalized Gaussian kernel kα(x, y) = α d exp[ x y 2/(2α2)]. For any µ Rd and σ > 0 we denote πµ,σ the Gaussian distribution with mean µ and covariance matrix σ2Id. We denote MMD2 α the MMD2 associated with kα. More precisely for any µ1, µ2 Rd and σ1, σ2 > 0 we have

MMD2 α(πµ1,σ1, πµ2,σ2) = Eπµ1,σ1 πµ1,σ1 [kα(X, X )] 2Eπµ1,σ1 πµ2,σ2 [kα(X, Y )] + (23)

Eπµ2,σ2 πµ2,σ2 [kα(Y, Y )] .

In this section we prove the following result. Proposition K.1. For any µ0 Rd and σ > 0, let α be given by

α = argmaxα 0 µ0MMD2 α(π0,σ, πµ0,σ) .

Then, we have that α = Re LU( µ0 2/(d + 2) 2σ2)1/2. (24)

Before proving Proposition K.1, let us provide some insights on the result. The quantity µ0MMD2 α(π0,σ, πµ0,σ) represents how much the mean of the Gaussian πµ0,σ is displaced when considering a flow on the mean of the Gaussian w.r.t. MMD2 α. Intuitively, we aim for µ0MMD2 α(π0,σ, πµ0,σ) to be as large as possible as this represents the maximum displacement possible. Hence, this justifies our goal of maximizing µ0MMD2 α(π0,σ, πµ0,σ) with respect to the width parameter α.

We show that the optimal width α has a closed form given by (24). It is notable that, assuming that σ > 0 is fixed, this quantity depends on µ0 , i.e. how far the modes of the two distributions are. This observation justifies our approach of following an adaptive MMD flow at inference time. Finally, we observe that there exists a threshold, i.e. µ0 2/(d + 2) = 2σ2 for which lower values of µ0 still yield α = 0. This phase transition behavior is also observed in our experiments.

We define D(α, σ, µ0, µ1) for any α, σ > 0 and µ0, µ1 Rd given by

D(α, σ, µ0, µ1) = R

Rd Rd kα(x, y)dπµ0,σ(x)dπµ1,σ(y).

Proposition K.2. For any α, σ > 0 and µ0, µ1 Rd we have

D(α, σ, µ0, µ1) = [α2σ2(1/κ2 + 1/α2)] d/2 exp[ ˆµ0 2/(2κ2) + ˆµ1 2/(2κ2)

ˆµ0, ˆµ1 /α2 µ0 2/(2σ2) µ1 2/(2σ2)],

ˆµ1 = (α2µ1 + κ2µ0)/(κ2 + α2),

ˆµ0 = (α2µ0 + κ2µ1)/(κ2 + α2),

where κ = (1/σ2 + 1/α2) 1/2.

Published as a conference paper at ICLR 2025

Proof. In what follows, we start by computing D(α, σ, µ0, µ1) for any α, σ > 0 and µ0, µ1 Rd given by

D(α, σ, µ0, µ1) = R

Rd Rd kα(x, y)dπµ0,σ(x)dπµ1,σ(y)

= 1/(2πσ2α)d R

Rd Rd exp[ x y 2/(2α2)] exp[ x µ0 2/(2σ2)] exp[ y µ1 2/(2σ2)]dxdy

= 1/(2πσ2α)d R

Rd Rd exp[ x y 2/(2α2) x µ0 2/(2σ2) y µ1 2/(2σ2)]dxdy.

In what follows, we denote κ = (1/σ2 + 1/α2) 1/2. We have

D(α, σ, µ0, µ1) = C(µ0, µ1) R

Rd Rd exp[ x 2/(2κ2) y 2/(2κ2) + x, y /α2 + x, µ0 /σ2 + y, µ1 /σ2]dxdy,

with C(µ0, µ1) = 1/(2πσ2α)d exp[ µ0 2/(2σ2) µ1 2/(2σ2)]. In what follows, we denote P(x, y) the second-order polynomial given by

P(x, y) = x 2/(2κ2) + y 2/(2κ2) x, y /α2 x, µ0 /σ2 y, µ1 /σ2.

Note that we have

D(α, σ, µ0, µ1) = C(µ0, µ1) R

Rd Rd exp[ P(x, y)]dxdy. (25)

Next, for any ˆµ0, ˆµ1 Rd, we consider Q(x, y) given by

Q(x, y) = x ˆµ0 2/(2κ2) + y ˆµ1 2/(2κ2) x ˆµ0, y ˆµ1 /α2

= x 2/(2κ2) + ˆµ0 2/(2κ2) + y 2/(2κ2) + ˆµ1 2/(2κ2) x, ˆµ0 /κ2 y, ˆµ1 /κ2 x ˆµ0, y ˆµ1 /α2

= x 2/(2κ2) + ˆµ0 2/(2κ2) + y 2/(2κ2) + ˆµ1 2/(2κ2) x, ˆµ0 /κ2 y, ˆµ1 /κ2

x, y /α2 ˆµ0, ˆµ1 /α2 + y, ˆµ0 /α2 + x, ˆµ1 /α2

= P(x, y) + ˆµ0 2/(2κ2) + ˆµ1 2/(2κ2) x, ˆµ0 /κ2 y, ˆµ1 /κ2 + x, µ0 /σ2 + y, µ1 /σ2

ˆµ0, ˆµ1 /α2 + y, ˆµ0 /α2 + x, ˆµ1 /α2

= P(x, y) + ˆµ0 2/(2κ2) + ˆµ1 2/(2κ2) ˆµ0, ˆµ1 /α2

+ x, µ0/σ2 ˆµ0/κ2 + ˆµ1/α2 + y, µ1/σ2 ˆµ1/κ2 + ˆµ0/α2 .

In what follows, we set ˆµ0, ˆµ1 such that

µ1/σ2 = ˆµ1/κ2 ˆµ0/α2,

µ0/σ2 = ˆµ0/κ2 ˆµ1/α2.

We get that

ˆµ1 = (µ1/(σ2κ2) + µ0/(σ2α2))/(1/κ4 1/α4),

ˆµ0 = (µ0/(σ2κ2) + µ1/(σ2α2))/(1/κ4 1/α4).

We have that

σ2(1/κ4 1/α4) = σ2(1/σ4 + 2/(σ2α2)) = 1/σ2 + 2/α2 = 1/κ2 + 1/α2. (26)

Therefore, we get that

ˆµ1 = (µ1/κ2 + µ0/α2)/(1/κ2 + 1/α2),

ˆµ0 = (µ0/κ2 + µ1/α2)/(1/κ2 + 1/α2).

Finally, we get that

ˆµ1 = (α2µ1 + κ2µ0)/(κ2 + α2),

ˆµ0 = (α2µ0 + κ2µ1)/(κ2 + α2).

With this choice, we get that

P(x, y) = Q(x, y) ˆµ0 2/(2κ2) ˆµ1 2/(2κ2) + ˆµ0, ˆµ1 /α2 (27)

We also have that for any x, y Rd

Q(x, y) = (1/2) x ˆµ0 y ˆµ1

Id/κ2 Id/α2

Id/α2 Id/κ2

x ˆµ0 y ˆµ1

Published as a conference paper at ICLR 2025

Using this result we have that R

Rd Rd exp[ Q(x, y)] = (2π)d det(Σ 1) 1/2, (28)

Σ 1 = Id/κ2 Id/α2

Id/α2 Id/κ2

Using (26), we get that det(Σ 1) = [(1/σ2)(1/κ2 + 1/α2)]d. Combining this result and (28) we get that R

Rd Rd exp[ Q(x, y)] = (2π)d[(1/σ2)(1/κ2 + 1/α2)] d/2.

Combining this result, (27) and (25) we get that

D(α, σ, µ0, µ1) = C(µ0, µ1) exp[ ˆµ0 2/(2κ2)+ ˆµ1 2/(2κ2) ˆµ0, ˆµ1 /α2](2π)d[(1/σ2)(1/κ2+1/α2)] d/2.

Therefore, we get that

D(α, σ, µ0, µ1) = [α2σ2(1/κ2 + 1/α2)] d/2 exp[ ˆµ0 2/(2κ2) + ˆµ1 2/(2κ2)

ˆµ0, ˆµ1 /α2 µ0 2/(2σ2) µ1 2/(2σ2)].

We investigate two special cases of Proposition K.2.

First, we show that if µ0 = µ1 then D(α, σ, µ0, µ0) does not depend on µ0.

Proposition K.3. For any α, σ > 0 and µ0 Rd we have D(α, σ, µ0, µ0) = (α2 + 2σ2) d/2.

Proof. We have that ˆµ0 = ˆµ1 = µ1 = µ0 in Proposition K.2. In addition, we have that

(1/2κ2) + (1/2κ2) 1/α2 1/(2σ2) 1/(2σ2) = 0.

Therefore, we have that

exp[ ˆµ0 2/(2κ2) + ˆµ1 2/(2κ2) ˆµ0, ˆµ1 /α2 µ0 2/(2σ2) µ1 2/(2σ2)] = 1,

which concludes the proof upon using that 1/κ2 = 1/α2 + 1/σ2.

Proposition K.3 might seem surprising at first but in fact it simply highlights the fact that when trying to differentiate a Gaussian measure with itself, the result is independent of the location of the Gaussian and only depends on its scale. Then, we study the case where µ1 = 0. Proposition K.4. For any α, σ > 0 and µ0 Rd we have

D(α, σ, µ0, 0) = (α2 + 2σ2) d/2 exp[ µ0 2/(2(α2 + 2σ2))].

Proof. First, we have that

ˆµ0 = α2/(κ2 + α2)2µ0, ˆµ1 = κ2/(κ2 + α2)2µ0.

Therefore, we get that

D(α, σ, µ0, 0) = [σ2(1/κ2 + 1/α2)]d/2 exp[(1/2){(α4/κ2 κ2)/(κ2 + α2) 1/σ2} µ0 2]

Using (26) we get that α4/κ2 κ2 = α2(α2 + κ2)/σ2. Therefore, we get that

(α4/κ2 κ2)/(κ2 + α2) 1/σ2 = (α2/(α2 + κ2) 1)/σ2 = 1/(α2(1 + 2σ2/α2)),

which concludes the proof.

Using Proposition K.3, Proposition K.4 and definition (23), we have the following result.

Published as a conference paper at ICLR 2025

Figure 3: Evolution of the norm of the mean µt of the Gaussian distribution πµt,σ according to a gradient flow on the mean µt w.r.t. MMDαt. In the adaptive case αt is given by Proposition 3.1 while in the non adaptive case, αt = α0 = 1. In our experiment we consider d = 1 and σ = 1, for illustration purposes.

Proposition K.5. For any α, σ > 0 and µ0 Rd we have

MMD2(π0,σ, πµ0,σ) = 2(α2 + 2σ2) d/2(1 exp[ µ0 2/(2(α2 + 2σ2))]).

In addition, we have

µ0MMD2(π0,σ, πµ0,σ) = 2(α2 + 2σ2) d/2 1 exp[ µ0 2/(2(α2 + 2σ2))]µ0.

Finally, we have the following proposition. Proposition K.6. For any µ0 Rd and σ > 0 let α be given by

α = argmaxα 0 µ0MMD2(π0,σ, πµ0,σ) .

Then, we have that α = Re LU( µ0 2/(d + 2) 2σ2)1/2.

Proof. Let σ > 0 and µ0 Rd. First, using Proposition K.5, we have that for

µ0MMD2(π0,σ, πµ0,σ) 2 = 4α2d µ0 2(α2 + 2σ2) d 2 exp[ µ0 2/(α2 + 2σ2)].

Next, we study the function f : [0, t0] R given for any t [0, t0] by

f(t) = td+2 exp[ t µ0 2],

with t0 = 1/(2σ2). We have that

f (t) = td+1 exp[ t µ0 2]((d + 2) µ0 2t).

We then consider two cases. First, if t0 (d + 2)/ µ0 2, i.e. σ2 µ0 2/(2(d + 2)), then f is increasing on [0, t0] and we have that f is maximum if t = t0. Hence, if σ2 µ0 2/(2(d + 2)), we have that α = 0. Second, if t0 (d + 2)/ µ0 2, i.e. σ2 µ0 2/(2(d + 2)) then f is increasing on [0, t ], non-increasing on [t , t0] with t = (d + 2)/ µ0 2 and we have that f is maximum if t = t . Hence, if σ2 µ0 2/(2(d + 2)), we have that α = ( µ0 2/(d + 2) 2σ2)1/2, which concludes the proof.

K.1 PHASE TRANSITION BEHAVIOUR

L IMAGE GENERATION SAMPLES

L.1 CIFAR10 SAMPLES

Samples from DMMD with NFE=100 and NFE=250 are given in Figure 4. Samples from DMMD with NFE=100 and from a-DMMD with NFE=50 are given in Figure 5.

L.2 ADDITIONAL DATASETS SAMPLES

Samples for MNIST are given in Figure 6, for CELEB-A (64x64) are given in Figure 7 and for LSUN Church (64x64) are given in Figure 8.

Published as a conference paper at ICLR 2025

Figure 4: CIFAR-10 samples from DMMD with NFE=250 on the left and with NFE=100 on the right

Figure 5: CIFAR-10 samples from DMMD with NFE=100 on the left and samples from the a DMMD-e with NFE=50 on the right

Published as a conference paper at ICLR 2025

Figure 6: DMMD samples for MNIST.

Figure 7: DMMD samples for CELEB-A (64x64).

Published as a conference paper at ICLR 2025

Figure 8: DMMD samples for LSUN Church (64x64).

Figure 9: CELEB-A-HQ (128x128) samples. Left, the samples from DMMD, Right, the samples from DDPM.

Published as a conference paper at ICLR 2025

Figure 10: Latent (64x64x3) CELEB-A-HQ (256x256x3) samples for latent DMMD.

L.3 ADDITIONAL DATASETS SAMPLES ON CELEB-A-HQ

The samples for CELEB-A-HQ (128x128) are given in Figure 9.

L.4 ADDITIONAL DATASETS SAMPLES ON LATENT (64X64X3) CELEB-A-HQ (256X256X3)

The samples for latent DMMD are given in Figure 10 and the samples for latent diffusion are given in Figure 11.

Published as a conference paper at ICLR 2025

Figure 11: Latent (64x64x3) CELEB-A-HQ (256x256x3) samples for latent diffusion.