# generative_sliced_mmd_flows_with_riesz_kernels__def1af9b.pdf

Published as a conference paper at ICLR 2024

GENERATIVE SLICED MMD FLOWS WITH RIESZ KERNELS

Johannes Hertrich1, Christian Wald2, Fabian Altekrüger3, Paul Hagemann2

1 University College London, 2 Technische Universität Berlin, 3 Humboldt-Universität zu Berlin Correspondence to: j.hertrich@ucl.ac.uk

Maximum mean discrepancy (MMD) flows suffer from high computational costs in large scale computations. In this paper, we show that MMD flows with Riesz kernels K(x, y) = x y r, r (0, 2) have exceptional properties which allow their efficient computation. We prove that the MMD of Riesz kernels, which is also known as energy distance, coincides with the MMD of their sliced version. As a consequence, the computation of gradients of MMDs can be performed in the one-dimensional setting. Here, for r = 1, a simple sorting algorithm can be applied to reduce the complexity from O(MN +N 2) to O((M +N) log(M +N)) for two measures with M and N support points. As another interesting follow-up result, the MMD of compactly supported measures can be estimated from above and below by the Wasserstein-1 distance. For the implementations we approximate the gradient of the sliced MMD by using only a finite number P of slices. We show that the resulting error has complexity O( p

d/P), where d is the data dimension. These results enable us to train generative models by approximating MMD gradient flows by neural networks even for image applications. We demonstrate the efficiency of our model by image generation on MNIST, Fashion MNIST and CIFAR10.

1 INTRODUCTION

With the rise of generative models, the field of gradient flows in measure spaces received increasing attention. Based on classical Markov chain Monte Carlo methods, Welling & Teh (2011) proposed to apply the Langevin dynamics for inferring samples from a known probability density function. This corresponds to simulating a Wasserstein gradient flow with respect to the Kullback-Leibler divergence, see Jordan et al. (1998). Closely related to this approach are current state-of-the-art image generation methods like score-based models (Song & Ermon, 2019; 2020) or diffusion models (Ho et al., 2020; Song et al., 2021), which significantly outperform classical generative models like GANs (Goodfellow et al., 2014) or VAEs (Kingma & Welling, 2014). A general aim of such algorithms (Arbel et al., 2021; Ho et al., 2020; Wu et al., 2020) is to establish a path between input and target distribution, where "unseen" data points are established via the randomness of the input distribution. Several combinations of such Langevin-type Markov chain Monte Carlo methods with other generative models were proposed in (Ben-Hamu et al., 2022; Hagemann et al., 2023; Wu et al., 2020). Gradient flows on measure spaces with respect to other metrics are considered in (di Langosco et al., 2022; Dong et al., 2023; Grathwohl et al., 2020; Liu, 2017; Liu & Wang, 2016) under the name Stein variational gradient descent.

For approximating gradient flows with respect to other functionals than the KL divergence, the authors of (Altekrüger et al., 2023; Ansari et al., 2021; Alvarez-Melis et al., 2022; Bunne et al., 2022; Fan et al., 2022; Gao et al., 2019; Garcia Trillos et al., 2023; Heng et al., 2023; Mokrov et al., 2021; Peyré, 2015) proposed the use of suitable forward and backward discretizations. To reduce the computational effort of evaluating distance measures on high-dimensional probability distributions, the sliced Wasserstein metric was introduced in (Rabin et al., 2012). The main idea of the sliced Wasserstein distance is to compare one-dimensional projections of the corresponding probability distributions instead of the distributions themselves. This approach can be generalized to more general probability metrics (Kolouri et al., 2022) and was applied in the context of Wasserstein gradient flows in (Bonet et al., 2022b; Liutkus et al., 2019).

Published as a conference paper at ICLR 2024

500 1000 1500 2000 number of particles

run time in ms

Naive MMD Sliced MMD P=100 Sliced MMD P=500 Sliced MMD P=1000 Sliced MMD P=2000

100 102 104 106

Projections

Relative Error

d=2 d=10 d=100 d=1000 d=10000

101 102 103 104

Relative Error

P=10 P=100 P=1000 P=10000 P=100000

Figure 1: Left: Comparison of run time for 1000 gradient evaluations of naive MMD and sliced MMD with different number of projections P in the case d = 100. Middle and right: Relative error of the gradients of sliced MMD and MMD with respect to the number P of projections and the dimension d. The results show the relative error behaves asymptotically as O( p

d/P) as shown in Theorem 4.

For many generative gradient-flow methods it is required that the considered functional can be evaluated based on samples. For divergence-based functionals like the Kullback-Leibler or the Jensen-Shannon divergence, a variational formulation leading to a GAN-like evaluation procedure is provided in (Fan et al., 2022). In contrast, the authors of (Altekrüger et al., 2023; Arbel et al., 2019; Glaser et al., 2021) use functionals based on the maximum mean discrepancy (MMD) which can be directly evaluated based on empirical measures. For positive definite kernels, it can be shown under some additional assumptions that MMD defines a metric on the space of probability distributions, see e.g., (Gretton et al., 2012; Sriperumbudur et al., 2011; 2010). If the considered kernel is smooth, then Arbel et al. (2019) proved that Wasserstein gradient flows can be fully described by particles. Even though this is no longer true for non-smooth kernels (Hertrich et al., 2023b), Altekrüger et al. (2023) pointed out that particle flows are Wasserstein gradient flows at least with respect to a restricted functional. In particular, we can expect that particle flows provide an accurate approximation of Wasserstein gradient flows as long as the number of particles is large enough.

Contributions. The computational complexity of MMD between two empirical measures with N and M support points depends quadratically on N and M, which makes large scale computations impossible. In this paper, we focus on the MMD with Riesz kernels

K(x, y) = x y r, r (0, 2), (1)

also known as energy distance (Sejdinovic et al., 2013; Székely, 2002; Székely & Rizzo, 2013). We show that Riesz kernels have the outstanding property that their MMD coincides with the sliced MMD of univariate Riesz kernels. It is this property that enables us to reduce the computation of (gradients of) MMD to the one-dimensional setting. In the case of r = 1, we propose a simple and computationally very efficient sorting algorithm for computing the gradient of the one-dimensional MMD with complexity O((M + N) log(M + N)). Considering that our numerical examples use between 10.000 and 50.000 particles, this leads to an incredible speed-up for gradient computations of (sliced) MMD as illustrated in the left plot of Figure 1. Our approach opens the door to applications in image processing, where we have usually to cope with high dimensional data.

In practice, sliced probability metrics are evaluated by replacing the expectation over all projections by the empirical expectation resulting in a finite sum. In the case of sliced MMD with Riesz kernels and r = 1, we prove that the error induced by this approximation behaves asymptotically as O( p

d/P), where d is the data dimension and P the number of projections, see the middle plot in Figure 1 for an illustration. The square root scaling of the error in the dimension d ensures that an accurate computation of the sliced MMD with Riesz kernels is possible even in very high dimensions as demonstrated in the right plot in Figure 1. Taking into account the number of projections, the overall complexity of the computation of the derivatives of MMD is O(d P(M + N) log(M + N)).

We apply the cheap evaluation of MMD gradients to compute MMD particle flows starting with samples from an initial probability measure µ0 to samples from a predefined target distribution ν, which is given by samples. Finally, we derive a generative model by training a sequence (Φl)L l=1 of neural networks, where each Φl approximates a certain number of steps of the particle flow. This allows us to train our network iteratively. In particular, during the training and evaluation procedure, we always consider only one of the networks Φl at the same time. This allows an efficient training with relatively low resources even though all networks Φl together have a large number of parameters.

Published as a conference paper at ICLR 2024

We demonstrate the efficiency of our generative sliced MMD flows for image generation on MNIST, Fashion MNIST and CIFAR10.

Related Work. Gradient flows with respect to MMD functionals are considered in (Altekrüger et al., 2023; Arbel et al., 2019; Hertrich et al., 2023b; Kolouri et al., 2022). However, due to the quadratic complexity of the computation of the derivative of MMD functionals in the number of support points of the involved measures, these papers have a rather theoretical scope and applications are limited to measures supported on a few hundred points. In order to reduce the dimension of the problem, Kolouri et al. (2022) consider a sliced version of MMD. This is motivated by the success of sliced Wasserstein distances (Rabin et al., 2012), which were used for deriving gradient flows in (Bonet et al., 2022b; Liutkus et al., 2019; Nguyen et al., 2023; 2021). In particular, Kolouri et al. (2022) observe that the sliced MMD is again a MMD functional with a different kernel. We use this result in Section 2. Vayer & Gribonval (2023) bound Wasserstein distances and MMD against each other. However, they use strong assumptions on the kernel, which are not fulfilled for the negative distance kernel. In very low dimensions, fast evaluations of MMD and their gradients were proposed in (Gräf et al., 2012; Teuber et al., 2011) based on fast Fourier summation using the non-equispaced fast Fourier transforms (NFFT), see (Plonka et al., 2018, Sec. 7) and references therein. Unfortunately, since the complexity of the NFFT depends exponentially on the data-dimension, these approaches are limited to applications in dimension four or smaller. In a one-dimensional setting, the energy distance is related to the Cramer distance, see (Székely, 2002). In the context of reinforcement learning, Lhéritier & Bondoux (2021) developed fast evaluation algorithms for the latter based on the calculation of cumulative distribution functions.

Finally, the authors of (Bi nkowski et al., 2018; Dziugaite et al., 2015; Li et al., 2017; 2015) apply MMD for generative modelling by constructing so-called MMD-GANs. However, this is conceptionally a very different approach since in MMD-GANs the discriminator in the classical GAN framework (Goodfellow et al., 2014) is replaced by a MMD distance with a variable kernel. Also relevant is the direction of Sobolev-GANs (Mroueh et al., 2018) in which the discriminator is optimized in a Sobolev space, which is related to the RKHS of the Riesz kernel. Similar to GAN ideas this results in a max-min problem which is solved in an alternating fashion and is not related to gradient flows.

Outline of the Paper. In Section 2, we prove that the sliced MMD with the one-dimensional Riesz kernel coincides with MMD of the scaled d-dimensional kernel. This can be used to establish an interesting lower bound on the MMD by the Wasserstein-1 distance. Then, in Section 3 we propose a sorting algorithm for computing the derivative of the sliced MMD in an efficient way. We apply the fast evaluation of MMD gradients to simulate MMD flows and to derive a generative model in Section 4 . Section 5 shows numerical experiments on image generation. Finally, conlusions are drawn in Section 6. The appendices contain the proofs and supplementary material.

2 SLICED MMD FOR RIESZ KERNELS

Let P(Rd) denote the set of probability measures on Rd and Pp(Rd) its subset of measures with finite p-th moment, i.e., R

Rd x pdµ(x) < . Here denotes the Euclidean norm on Rd. For a symmetric, positive definite kernel K : Rd Rd R, the maximum mean discrepancy (MMD) DK : P(Rd) P(Rd) R is the square root of D2 K(µ, ν) := EK(µ ν), where EK is the interaction energy of signed measures on Rd defined by

Rd K(x, y) dη(x)dη(y).

Due to its favorable properties, see Appendix E, we are interested in Riesz kernels

K(x, y) = x y r, r (0, 2).

These kernels are only conditionally positive definite, but can be extended to positive definite kernels by K(x, y) = K(x, y) K(x, 0) K(0, y), see also Remark 13. Then it holds for µ, ν Pr(Rd) that DK(µ, ν) = D K(µ, ν), see (Neumayer & Steidl, 2021, Lemma 3.3 iii)). Moreover, for Riesz kernels, DK is a metric on Pr(Rd), which is also known as so-called energy distance (Sejdinovic et al., 2013; Székely & Rizzo, 2013). Note that we exclude the case r = 2, since DK is no longer a metric in this case.

Published as a conference paper at ICLR 2024

However, computing MMDs on high dimensional spaces is computationally costly. Therefore, the sliced MMD SD2 k : P2(Rd) P2(Rd) R was considered in the literature, see e.g., Kolouri et al. (2022). For a symmetric 1D kernel k: R R R it is given by

SD2 k(µ, ν) := Eξ USd 1[D2 k(Pξ#µ, Pξ#ν)]

with the push-forward measure Pξ#µ := µ P 1 ξ of the projection Pξ(x) := ξ, x and the uniform distribution USd 1 on the sphere Sd 1. By interchanging the integrals from the expectation and the definition of MMD, Kolouri et al. (2022) observed that the sliced MMD is equal to the MMD with an associate kernel K: Rd Rd R. More precisely, it holds

SD2 k(µ, ν) = D2 K(µ, ν), with K(x, y) := Eξ USd 1[k(Pξ(x), Pξ(y))].

By the following theorem, this relation becomes more simple when dealing with Riesz kernels, since in this case the associate kernel is a Riesz kernel as well. Theorem 1 (Sliced Riesz Kernels are Riesz Kernels). Let k(x, y) := |x y|r, r (0, 2). Then, for µ, ν Pr(Rd), it holds SD2 k(µ, ν) = D2 K(µ, ν) with the associated scaled Riesz kernel

K(x, y) := c 1 d,r x y r, cd,r := πΓ( d+r

The proof is given in Appendix A. The constant cd,r depends asymptotically with O(dr/2) on the dimension. In particular, it should be harder to estimate the MMD or its gradients in higher dimensions via slicing. We will discuss this issue more formally later in Remark 5. For r = 1, we just write cd := cd,1 and can consider measures in P1(Rd). Interestingly, based on Theorem 1, we can establish a relation between the MMD and the Wasserstein-1 distance on P1(Rd) defined by

W1(µ, ν) := min π Π(µ,ν)

Z x y dπ(x, y),

where Π(µ, ν) denotes the set of measures in P1(Rd Rd) with marginals µ and ν. This also shows that Conjecture 1 in (Modeste & Dombry, 2023) can only hold for non-compactly supported measures. The proof is given in Appendix B. Theorem 2 (Relation between DK and W1 for Distance Kernels). Let K(x, y) := x y . Then, it holds for µ, ν P1(Rd) that 2D2 K(µ, ν) W1(µ, ν). If µ and ν are additionally supported on the ball BR(0), then there exists a constant Cd > 0 such that W1(µ, ν) Cd R 2d+1 2d+2 DK(µ, ν) 1 d+1 .

The fact that the sample complexities of MMD and Wasserstein-1 are O(n 1/2) (Gretton et al., 2012, Chapter 4.1) and O(n 1/d) (Peyré & Cuturi, 2020, Chapter 8.4.1) suggests, that the exponent of DK in Theorem 2 cannot be improved over 1/d.

3 GRADIENTS OF SLICED MMD

Next, we consider the functional Fν : P2(Rd) R given by

Fν(µ) := EK(µ) + VK,ν(µ) = D2 K(µ, ν) + constν, (2)

where VK,ν(µ) is the so-called potential energy

VK,ν(µ) := Z

Rd K(x, y) dν(y) dµ(x)

acting as an attraction term between the masses of µ and ν, while the interaction energy EK is a repulsion term enforcing a proper spread of µ. For the rest of the paper, we always consider the negative distance kernel K(x, y) := x y , which is the Riesz kernel (1) with r = 1. Then, we obtain directly from the metric property of MMD that the minimizer of the non-convex functional Fν is given by ν. We are interested in computing gradient flows of Fν towards this minimizer. However, the computation of gradients in measure spaces for discrepancy functionals with non-smooth kernels is highly non-trivial and computationally costly, see e.g., (Altekrüger et al., 2023; Carrillo et al., 2020; Hertrich et al., 2023b).

Published as a conference paper at ICLR 2024

As a remedy, we focus on a discrete form of the d-dimensional MMD. More precisely, we assume that µ and ν are empirical measures, i.e., they are of the form µ = 1

N PN i=1 δxi and ν = 1 M PM j=1 δyj for some xj, yj Rd. Let x := (x1, . . . , x N) and y := (y1, . . . , y M). Then the functional Fν reduces to the function Fd( |y): RNd R given by

Fd(x|y) = 1 2N 2

j=1 xi xj + 1 MN

j=1 xi yj (3)

j=1 δyj + consty.

In order to evaluate the gradient of Fd with respect to the support points x, we use Theorem 1 to rewrite Fd(x|y) as

Fd(x|y) = cd Eξ USd 1[F1( ξ, x1 , ..., ξ, x N | ξ, y1 , ..., ξ, y M )].

Then, the gradient of Fd with respect to xi is given by

xi Fd(x|y) = cd Eξ USd 1[ i F1( ξ, x1 , ..., ξ, x N | ξ, y1 , ..., ξ, y M )ξ], (4)

where i F1 denotes the derivative of F1 with respect to the i-th component of the input. Consequently, it suffices to compute gradients of F1 in order to evaluate the gradient of Fd.

A Sorting Algorithm for the 1D-Case. Next, we derive a sorting algorithm to compute the gradient of F1 efficiently. In particular, the proposed algorithm has complexity O((M + N) log(M + N)) even though the definition of F1 in (3) involves N 2 + MN summands.

To this end, we split the functional F1 into interaction and potential energy, i.e., F1(x|y) = E(x) + V (x|y) with

E(x) := 1 2N 2

j=1 |xi xj|, V (x|y) := 1 NM

j=1 |xi yj|. (5)

Then, we can compute the derivatives of E and V by the following theorem which proof is given in Appendix C. Theorem 3 (Derivatives of Interaction and Potential Energy). Let x1, ..., x N R be pairwise disjoint and y1, ..., y M R such that xi = yj for all i = 1, ..., N and j = 1, ..., M. Then, E and V are differentiable with

xi E(x) = N + 1 2σ 1(i)

N 2 , xi V (x|y) = 2 #{j {1, ..., M} : yj < xi} M

where σ: {1, ..., N} {1, ..., N} is the permutation with xσ(1) < ... < xσ(N).

Since V is convex, we can show with the same proof that

2 #{j {1, ..., M} : yj < xi} M

MN xi V (x|y),

where xi is the subdifferential ov V with respect to xi whenever xi = yj for some i, j. By Theorem 3, we obtain that F1(x|y) = E(x) + V (x|v) can be computed by Algorithm 1 and Algorithm 2 with complexity O(N log(N)) and O((M + N) log(M + N)), respectively. The complexity is dominated by the sorting procedure. Both algorithms can be implemented in a vectorized form for computational efficiency. Note that by Lemma 9 from the appendix, the discrepancy with Riesz kernel and r = 1 can be represented by the cumulative distribution functions (cdfs) of the involved measures. Since the cdf of an one-dimensional empirical measure can be computed via sorting, we also obtain an O((N + M) log(M + N)) algorithm for computing the one-dimensional MMD itself and not only for its derivative.

Stochastic Approximation of Sliced MMD Gradients for r = 1. To evaluate the gradient of Fd efficiently, we use a stochastic gradient estimator. For x1, ..., x N, y1, ..., y M Rd, we define for P N the stochastic gradient estimator of (4) as the random variable

P Fd(x|y) = P,xi Fd(x|y) N

Published as a conference paper at ICLR 2024

Algorithm 1 Derivative of the interaction energy E from (5).

Input: x1, ..., x N R with xi = xj for i = j. Algorithm: Compute σ1, ...σN = argsort(x1, ..., x N).

Compute gi = 2σ 1 i 1 N N2 . Output: (g1, ..., g N) = E(x1, ..., x N).

Algorithm 2 Derivative of the potential energy V from (5).

Input: x1, ..., x N R, y1, ..., y M R with xi = yj. Algorithm: Compute σ1, ..., σN+M = argsort(x1, ..., x N, y1, ..., y M) Initialize h1 = = h M+N = 0. for j = 1, ..., M do

Set hσ(N+j) = 1. end for Set h = 2 cumsum( h) 1 for i = 1, ..., N do

Set gi = hσ 1(i)

MN , end for Output: (g1, ..., g N) = V (x1, ..., x N|y1, . . . , y M).

P,xi Fd(x|y) := cd

p=1 i F1( ξp, x1 , ..., ξp, x N | ξp, y1 , ..., ξp, y M )ξp,

for independent random variables ξ1, ..., ξP USd 1. We obtain by (4) that Fd is unbiased, i.e., E[ P Fd(x|y)] = Fd(x|y). Moreover, the following theorem shows that the error of P Fd converges to zero for a growing number P of projections. The proof uses classical concentration inequalities and follows directly from Corollary 12 in Appendix D.

Theorem 4 (Error Bound for Stochastic MMD Gradients). Let x1, ..., x N, y1, ..., y M Rd. Then, it holds

E[ P Fd(x|y) Fd(x|y) ] O r

To verify this convergence rate numerically, we draw N = 1000 samples x1, ..., x N from a Gaussian mixture model with two components and M = 1000 samples y1, ..., y M from a Gaussian mixture model with ten components. The means are chosen randomly following a uniform distribution in [ 1, 1]d and the standard deviation is set to 0.01. Then, we compute numerically the expected relative approximation error between P Fd and Fd for different choices of P and d. The results are illustrated in the middle and in the right plot of Figure 1. We observe that this numerical evaluation

underlines the convergence rate of O q

d P . In particular, the error scales with O( p

d/P), which makes the approach applicable for high-dimensional problems.

Remark 5 (Computational Complexity of Gradient Evaluations). The appearance of the

d in the error bound is due to the scale factor cd between the MMD and the sliced MMD, which can be seen in the proof of Theorem 4. In particular, we require O(d) projections in order to approximate Fd(x|y) by P Fd(x|y) up to a fixed expected error of ϵ. Together with the computational complexity of O(d P(N + M) log(N + M)) for P Fd(x, y), we obtain an overall complexity of O(d2(N +M) log(N +M)) in order to approximate Fd(x|y) up to an expected error of ϵ. On the other hand, the naive computation of (gradients of) Fd(x|y) has a complexity of O(d(N 2 + MN)).

Published as a conference paper at ICLR 2024

Consequently, we improve the quadratic complexity in the number of samples to O(N log(N)). Here, we pay the price of quadratic instead of linear complexity in the dimension.

4 GENERATIVE MMD FLOWS

In this section, we use MMD flows with the negative distance kernel for generative modelling. Throughout this section, we assume that we are given independent samples y1, ..., y M Rd from a target measure ν P2(Rd) and define the empirical version of ν by νM := 1 M PM i=1 δyi.

4.1 MMD PARTICLE FLOWS

In order to derive a generative model approximating ν, we simulate a gradient flow of the functional Fν from (2). Unfortunately, the computation of gradient flows in measure spaces for Fν is highly non-trivial and computationally costly, see (Altekrüger et al., 2023; Hertrich et al., 2023b). Therefore, we consider the (rescaled) gradient flow with respect to the functional Fd instead. More precisely, we simulate for Fd from (3), the (Euclidean) differential equation

x = N Fd(x|y), x(0) = (x(0) 1 , ..., x(0) N ), (7)

where the initial points x(0) i are drawn independently from some measure µ0 P2(Rd). In our numerical experiments, we set µ0 to the uniform distribution on [0, 1]d. Then, for any solution x(t) = (x1(t), ..., x N(t)) of (7), it is proven in (Altekrüger et al., 2023, Proposition 14) that the curve γM,N : (0, ) P2(Rd) defined by γM,N(t) = 1

N PN i=1 δxi(t) is a Wasserstein gradient flow with respect to the function

F : P2(Rd) R { }, µ 7

( FνM , if µ = 1

N PN i=1 δxi for some xi = xj Rd, + , otherwise.

Hence, we can expect for M, N , that the curve γM,N approximates the Wasserstein gradient flow with respect to Fν. Consequently, we can derive a generative model by simulating the gradient flow (7). To this end, we use the explicit Euler scheme

x(k+1) = x(k) τN Fd(x(k)|y), (8)

where x(k) = (x(k) 1 , ..., x(k) N ) and τ > 0 is some step size. Here, the gradient on the right-hand side can be evaluated very efficiently by the stochastic gradient estimator from (6).

Momentum MMD Flows. To reduce the required number of steps in (8), we introduce a momentum parameter. More precisely, for some given momentum parameter m [0, 1) we consider the momentum MMD flow defined by the following iteration

v(k+1) = Fd(x(k)|y) + m v(k)

x(k+1) = x(k) τN v(k+1), (9)

where τ > 0 is some step size, x(0) i are independent samples from a initial measure µ0 and v(0) i = 0. Note that the MMD flow (8) is a special case of the momentum MMD flow (9) with m = 0.

In Figure 2, we illustrate the momentum MMD flow (9) and MMD flow (8) without momentum from a uniform distribution on [0, 1]d to MNIST (Le Cun et al., 1998) and CIFAR10 (Krizhevsky, 2009). The momentum is set to m = 0.9 for MNIST and to m = 0.6 for CIFAR10. We observe that the momentum MMD flow (9) converges indeed faster than the MMD flow (8) without momentum.

4.2 GENERATIVE MMD FLOWS

The (momentum) MMD flows from (8) and (9) transform samples from the initial distribution µ0 into samples from the target distribution ν. Therefore, we propose to train a generative model which approximates these schemes. The main idea is to approximate the Wasserstein gradient flow γ : [0, ) P2(Rd) with respect to Fν from (2) starting at some latent distribution µ0 = γ(0). Then, we iteratively train neural networks Φ1, ..., ΦL such that γ(tl) Φl#γ(tl 1) for some tl with

Published as a conference paper at ICLR 2024

Figure 2: Samples and their trajectories from MNIST (left) and CIFAR10 (right) in the MMD flow with momentum (9, top) and without momentum (8, bottom) starting in the uniform distribution on [0, 1]d after 2k steps with k {0, ..., 16} (for MNIST) and k {3, ..., 19} (for CIFAR10). We observe that the momentum MMD flow (9) converges faster than the MMD flow (8) without momentum.

0 = t0 < t1 < < t L. Then, for t L large enough, it holds ν γ(t L) (ΦL Φ1)#γ(0) with γ(0) = µ0. Such methods learning iteratively an "interpolation path" are exploited several times in literature, e.g., Arbel et al. (2021); Fan et al. (2022); Ho et al. (2020). To implement this numerically, we train each network Φl such that it approximates Tl number of steps from (8) or (9). The training procedure of our generative MMD flow is summarized in Algorithm 3 in Appendix F. Once the networks Φ1, ..., ΦL are trained, we can infer a new sample x from our (approximated) target distribution ν as follows. We draw a sample x(0) from µ0, compute x(l) = x(l 1) Φl(x(l 1)) for l = 1, ..., L and set x = x(L). In particular, this allows us to simulate paths of the discrepancy flow we have not trained on. Remark 6 (Iterative Training and Sampling). Since the networks are not trained in an end-to-end fashion but separately, their GPU memory load is relatively low despite a high number of trainable parameters of the full model (Φl)L l=1. This enables training of our model on an 8 GB GPU. Moreover, the training can easily be continued by adding additional networks Φl, l = L+1, ..., L to an already trained generative MMD flow (Φl)L l=1, which makes applications more flexible.

5 NUMERICAL EXAMPLES

In this section, we apply generative MMD flows for image generation on MNIST, Fashion MNIST (Xiao et al., 2017),CIFAR10 and Celeb A (Liu et al., 2015). The images from MNIST and Fashion MNIST are 28 28 gray-value images, while CIFAR10 consists of 32 32 RGB images resulting in the dimensions d = 784 and d = 3072, respectively. For Celeb A, we centercrop the images to 140 140 and then bicubicely resize them to 64 64. We run all experiments either on a single NVIDIA Ge Force RTX 3060 or a RTX 4090 GPU with 12GB or 24GB memory,respectively. To evaluate our results, we use the Fréchet inception distance (FID) (Heusel et al., 2017)1 between

1We use the implementation from https://github.com/mseitzer/pytorch-fid.

Figure 3: Generated samples of our generative MMD Flow.

Published as a conference paper at ICLR 2024

Table 1: FID scores for different datasets and various methods.

Method MNIST Fashion MNIST CIFAR10 Celeb A Auto-encoder based CWAE (Knop et al., 2020) 23.6 50.0 120.0 49.7 SWF+ Autoencoder + Real NVP (Bonet et al., 2022b) 17.8 40.6 - 90.9 2-stage VAE (Dai & Wipf, 2019) 12.6 29.3 72.9 44.4 GLF (Xiao et al., 2019) 8.2 21.3 88.3 53.2 Adversarial WGAN (Arjovsky et al., 2017; Lucic et al., 2018) 6.7 21.5 55.2 41.3 MMD GAN (Bi nkowski et al., 2018) 4.2 - 48.1 29.2 Score-based NCSN (Song & Ermon, 2019) - - 25.3 - Flow based SWF (Liutkus et al., 2019) 3 225.1 207.6 - 91.2 SIG (Dai & Seljak, 2021) 4.5 13.7 66.5 37.3 ℓ-SWF (Du et al., 2023) - - 59.7 38.3 Generative Sliced MMD Flow (ours) 3.1 0.06 11.3 0.07 54.8 0.26 32.1 0.17

10K generated samples and the test dataset. Here, a smaller FID value indicates a higher similarity between generated and test samples.

We choose the networks (Φl)L l=1 to be UNets (Ronneberger et al., 2015), where we use the implementation from (Huang et al., 2021) based on (Ho et al., 2020). Then, we run the generative MMD flow for L = 55 (MNIST), L = 67 (Fashion MNIST), L = 86 (CIFAR10) and L = 71 (Celeb A) networks. The exact setup is described in Appendix H. We compare the resulting FIDs with other gradientflow-based models and various further methods in Table 1. We computed the standard deviations by independently sampling ten times from one training run and computing the corresponding FID. We observe that we achieve excellent performance on MNIST and Fashion MNIST as well as very good results on CIFAR10 and Celeb A. Generated samples are given in Figure 3 and more samples are given in Appendix I. The L2-nearest neighbors of the generated samples on MNIST are also illustrated in Figure 8 in Appendix I.

6 CONCLUSIONS

Discussion. We introduced an algorithm to compute (gradients) of the MMD with Riesz kernel efficiently via slicing and sorting reducing the dependence of the computational complexity on the number of particles from O(NM + N 2) to O((N + M) log(N + M)). For the implementations, we approximated the gradient of sliced MMD by a finite number of slices and proved that the corresponding approximation error depends by a square root on the dimension. We applied our algorithm for computing MMD flows and approximated them by neural networks. Here, a sequential learning approach ensures computational efficiency. We included numerical examples for image generation on MNIST, Fashion MNIST and CIFAR10.

Limitations. One of the disadvantages of interacting particle methods is that batching is not easily possible: The particle flow for one set of training points does not give a helpful approximation for another set of training points. This is due to the interaction energy and a general problem of particle flows. Furthermore, taking the projections involves multiplication of every data point with a "full" projection and therefore scales with the dimension d. Taking "local" projections like in (Du et al., 2023; Nguyen & Ho, 2022) can be much more efficient.

Outlook. Our paper is the first work which utilizes sliced MMD flows for generative modelling. Consequently the approach can be extended in several directions. Other kernels are considered in the context of slicing in the follow-up paper (Hertrich, 2024). From a theoretical viewpoint, the derivative formulas from Theorem 3 can be extended to the non-discrete case by the use of quantile functions, see (Bonaschi et al., 2015; Hertrich et al., 2023a) for some first approaches into this direction. Towards applications, we could extend the framework to posterior sampling in Bayesian inverse problems. In this context, the fast computation of MMD gradients can be also of interest for applications which are not based on gradient flows, see e.g., Ardizzone et al. (2019). Finally, the consideration of sliced probability metrics is closely related to the Radon transform and is therefore of interest also for non-Euclidean domains like the sphere, see e.g., (Bonet et al., 2022a; Quellmalz et al., 2023).

3values taken from Bonet et al. (2022b)

Published as a conference paper at ICLR 2024

ACKNOWLEDGEMENTS

Many thanks to J. Chemseddine for providing parts of the proof of Theorem 2, and to R. Beinert and G. Steidl for fruitful discussions. We thank Mitchell Krock for finding a typo. J.H. acknowledges funding by the German Research Foundation (DFG) within the project STE 571/16-1 and by the EPSRC programme grant The Mathematics of Deep Learning with reference EP/V026259/1, C.W. by the DFG within the SFB Tomography Across the Scales (STE 571/19-1, project number: 495365311), F.A. by the DFG under Germany s Excellence Strategy The Berlin Mathematics Research Center MATH+ (project AA 5-6), and P.H. from the DFG within the project SPP 2298 "Theoretical Foundations of Deep Learning".

Fabian Altekrüger, Johannes Hertrich, and Gabriele Steidl. Neural Wasserstein gradient flows for maximum mean discrepancies with Riesz kernels. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 664 690. PMLR, 2023.

David Alvarez-Melis, Yair Schiff, and Youssef Mroueh. Optimizing functionals on the space of probabilities with input convex neural networks. Transactions on Machine Learning Research, 2022.

Abdul Fatir Ansari, Ming Liang Ang, and Harold Soh. Refining deep generative models via discriminator gradient flow. In International Conference on Learning Representations, 2021.

Michael Arbel, Anna Korba, Adil Salim, and Arthur Gretton. Maximum mean discrepancy gradient flow. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

Michael Arbel, Alex Matthews, and Arnaud Doucet. Annealed flow transport Monte Carlo. In International Conference on Machine Learning. PMLR, 2021.

Lynton Ardizzone, Jakob Kruse, Carsten Rother, and Ullrich Köthe. Analyzing inverse problems with invertible neural networks. In International Conference on Learning Representations, 2019.

Martín Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, 2017.

Kendall Atkinson and Weimin Han. Spherical harmonics and approximations on the unit sphere: an introduction, volume 2044. Springer Science & Business Media, 2012.

Heli Ben-Hamu, Samuel Cohen, Joey Bose, Brandon Amos, Maximillian Nickel, Aditya Grover, Ricky T. Q. Chen, and Yaron Lipman. Matching normalizing flows and probability paths on manifolds. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 1749 1763. PMLR, 2022.

Mikołaj Bi nkowski, Dougal J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. In International Conference on Learning Representations, 2018.

Giovanni A Bonaschi, José A Carrillo, Marco Di Francesco, and Mark A Peletier. Equivalence of gradient flows and entropy solutions for singular nonlocal interaction equations in 1D. ESAIM: Control, Optimisation and Calculus of Variations, 21(2):414 441, 2015.

Clément Bonet, Paul Berg, Nicolas Courty, François Septier, Lucas Drumetz, and Minh-Tan Pham. Spherical sliced-Wasserstein. ar Xiv preprint ar Xiv:2206.08780, 2022a.

Clément Bonet, Nicolas Courty, François Septier, and Lucas Drumetz. Efficient gradient flows in sliced-Wasserstein space. Transactions on Machine Learning Research, 2022b.

Nicolas Bonnotte. Unidimensional and evolution methods for optimal transportation. Ph D Thesis, Univ. Paris-Sud, 2013.

Published as a conference paper at ICLR 2024

Charlotte Bunne, Laetitia Papaxanthos, Andreas Krause, and Marco Cuturi. Proximal optimal transport modeling of population dynamics. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera (eds.), Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pp. 6511 6528. PMLR, 2022.

José Antonio Carrillo, Marco Di Francesco, Antonio Esposito, Simone Fagioli, and Markus Schmidtchen. Measure solutions to a system of continuity equations driven by newtonian nonlocal interactions. Discrete and Continuous Dynamical Systems, 40(2):1191 1231, 2020.

Djalil Chafaï, Adrien Hardy, and Mylène Maïda. Concentration for Coulomb gases and Coulomb transport inequalities. Journal of Functional Analysis, 275(6):1447 1483, 2018.

Bin Dai and David Wipf. Diagnosing and enhancing VAE models. In International Conference on Learning Representations, 2019.

Biwei Dai and Uros Seljak. Sliced iterative normalizing flows. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 2352 2364. PMLR, 2021.

Lauro Langosco di Langosco, Vincent Fortuin, and Heiko Strathmann. Neural variational gradient descent. In Fourth Symposium on Advances in Approximate Bayesian Inference, 2022.

Hanze Dong, Xi Wang, LIN Yong, and Tong Zhang. Particle-based variational inference with preconditioned functional gradient flow. In The Eleventh International Conference on Learning Representations, 2023.

Chao Du, Tianbo Li, Tianyu Pang, Shuicheng Yan, and Min Lin. Nonparametric generative modeling with conditional sliced-Wasserstein flows. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202, pp. 8565 8584. PMLR, 2023.

Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pp. 258 267, 2015.

Jiaojiao Fan, Qinsheng Zhang, Amirhossein Taghvaei, and Yongxin Chen. Variational Wasserstein gradient flow. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 6185 6215. PMLR, 2022.

Yuan Gao, Yuling Jiao, Yang Wang, Yao Wang, Can Yang, and Shunkang Zhang. Deep generative learning via variational gradient flow. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 2093 2101. PMLR, 2019.

Nicolas Garcia Trillos, Bamdad Hosseini, and Daniel Sanz-Alonso. From optimization to sampling through gradient flows. ar Xiv preprint ar Xiv:2302.11449, 2023.

Pierre Glaser, Michael Arbel, and Arthur Gretton. KALE Flow: A relaxed KL gradient flow for probabilities with disjoint support. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 8018 8031. Curran Associates, Inc., 2021.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.

Manuel Gräf, Daniel Potts, and Gabriele Steidl. Quadrature errors, discrepancies, and their relations to halftoning on the torus and the sphere. SIAM Journal on Scientific Computing, 34(5):A2760 A2791, 2012.

Published as a conference paper at ICLR 2024

Will Grathwohl, Kuan-Chieh Wang, Jörn-Henrik Jacobsen, David Kristjanson Duvenaud, and Richard S. Zemel. Learning the stein discrepancy for training and evaluating energy-based models without sampling. In International Conference on Machine Learning, 2020.

Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012.

Paul Lyonel Hagemann, Johannes Hertrich, and Gabriele Steidl. Generalized normalizing flows via Markov chains. Cambridge University Press, 2023.

Alvin Heng, Abdul Fatir Ansari, and Harold Soh. Deep generative Wasserstein gradient flows. Preprint on Open Review: https://openreview.net/forum?id=zj Se BTEd Xp1, 2023.

Johannes Hertrich. Fast kernel summation in high dimensions via slicing and Fourier transforms. ar Xiv preprint ar Xiv:2401.08260, 2024.

Johannes Hertrich, Robert Beinert, Manuel Gräf, and Gabriele Steidl. Wasserstein gradient flows of the discrepancy with distance kernel on the line. In Scale Space and Variational Methods in Computer Vision, pp. 431 443. Springer, 2023a.

Johannes Hertrich, Manuel Gräf, Robert Beinert, and Gabriele Steidl. Wasserstein steepest descent flows of discrepancies with Riesz kernels. ar Xiv preprint ar Xiv:2211.01804, 2023b.

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 6840 6851. Curran Associates, Inc., 2020.

Chin-Wei Huang, Jae Hyun Lim, and Aaron Courville. A variational perspective on diffusion-based generative models and score matching. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021.

Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the Fokker Planck equation. SIAM Journal on Mathematical Analysis, 29(1):1 17, 1998.

David Kershaw. Some extensions of W. Gautschi s inequalities for the Gamma function. Mathematics of Computation, 41(164):607 611, 1983.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In International Conference on Learning Representations, 2014.

Szymon Knop, Przemysław Spurek, Jacek Tabor, Igor Podolak, Marcin Mazur, and Stanisław Jastrz ebski. Cramer-wold auto-encoder. Journal of Machine Learning Research, 21(164):1 28, 2020.

Jonas Moritz Kohler and Aurelien Lucchi. Sub-sampled cubic regularization for non-convex optimization. In International Conference on Machine Learning, pp. 1895 1904. PMLR, 2017.

Soheil Kolouri, Kimia Nadjahi, Shahin Shahrampour, and Umut Sim sekli. Generalized sliced probability metrics. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4513 4517, 2022.

Alex Krizhevsky. Learning multiple layers of features from tiny images. Master s thesis, University Toronto, ON, Canada, 2009.

Yann Le Cun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Published as a conference paper at ICLR 2024

Alix Lhéritier and Nicolas Bondoux. A Cramer distance perspective on quantile regression based distributional reinforcement learning. ar Xiv preprint ar Xiv:2110.00535, 2021.

Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabas Poczos. MMD GAN: Towards deeper understanding of moment matching network. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In International conference on machine learning, pp. 1718 1727. PMLR, 2015.

Qiang Liu. Stein variational gradient descent as gradient flow. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.

Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of the ICCV 15, pp. 3730 3738, 2015.

Antoine Liutkus, Umut Simsekli, Szymon Majewski, Alain Durmus, and Fabian-Robert Stöter. Sliced-Wasserstein flows: Nonparametric generative modeling via optimal transport and diffusions. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 4104 4113. PMLR, 2019.

Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are GANs created equal? A large-scale study. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

Thibault Modeste and Clément Dombry. Characterization of translation invariant MMD on Rd and connections with Wasserstein distances. HAL Preprint, hal-03855093, 2023.

Petr Mokrov, Alexander Korotin, Lingxiao Li, Aude Genevay, Justin M Solomon, and Evgeny Burnaev. Large-scale Wasserstein gradient flows. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 15243 15256, 2021.

Youssef Mroueh, Chun-Liang Li, Tom Sercu, Anant Raj, and Yu Cheng. Sobolev GAN. In International Conference on Learning Representations, 2018.

Sebastian Neumayer and Gabriele Steidl. From optimal transport to discrepancy. Handbook of Mathematical Models and Algorithms in Computer Vision and Imaging: Mathematical Imaging and Vision, pp. 1 36, 2021.

Khai Nguyen and Nhat Ho. Revisiting sliced Wasserstein on images: From vectorization to convolution. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022.

Khai Nguyen, Nhat Ho, Tung Pham, and Hung Bui. Distributional sliced-Wasserstein and applications to generative modeling. In International Conference on Learning Representations, 2021.

Khai Nguyen, Tongzheng Ren, Huy Nguyen, Litu Rout, Tan Minh Nguyen, and Nhat Ho. Hierarchical sliced Wasserstein distance. In The Eleventh International Conference on Learning Representations, 2023.

Erich Novak and Henryk Wozniakowski. Tractability of Multivariate Problems. Volume II, volume 12 of EMS Tracts in Mathematics. EMS Publishing House, Zürich, 2010.

Gabriel Peyré. Entropic approximation of Wasserstein gradient flows. SIAM Journal on Imaging Sciences, 8(4):2323 2351, 2015.

Published as a conference paper at ICLR 2024

Gabriel Peyré and Marco Cuturi. Computational optimal transport. ar Xiv preprint ar Xiv:1803.00567, 2020.

Gerlind Plonka, Daniel Potts, Gabriele Steidl, and Manfred Tasche. Numerical Fourier Analysis. Springer, 2018.

Michael Quellmalz, Robert Beinert, and Gabriele Steidl. Sliced optimal transport on the sphere. ar Xiv preprint ar Xiv:2304.09092, 2023.

Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Bernot. Wasserstein barycenter and its application to texture mixing. In Scale Space and Variational Methods in Computer Vision, pp. 435 446. Springer, 2012.

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention, pp. 234 241. Springer, 2015.

Dino Sejdinovic, Bharath Sriperumbudur, Arthur Gretton, and Kenji Fukumizu. Equivalence of distance-based and RKHS-based statistics in hypothesis testing. The Annals of Statistics, 41(5): 2263 2291, 2013.

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 12438 12448. Curran Associates, Inc., 2020.

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.

Bharath K Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, and Gert RG Lanckriet. Hilbert space embeddings and metrics on probability measures. The Journal of Machine Learning Research, 11:1517 1561, 2010.

Bharath K Sriperumbudur, Kenji Fukumizu, and Gert RG Lanckriet. Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research, 12(7), 2011.

Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008.

Gabor Székely. E-statistics: The energy of statistical samples. Techical Report, Bowling Green University, 2002.

Gábor J. Székely and Maria L. Rizzo. Energy statistics: A class of statistics based on distances. Journal of Statistical Planning and Inference, 143(8):1249 1272, 2013.

Tanja Teuber, Gabriele Steidl, Pascal Gwosdek, Christian Schmaltz, and Joachim Weickert. Dithering by differences of convex functions. SIAM Journal on Imaging Sciences, 4(1):79 108, 2011.

Titouan Vayer and Rémi Gribonval. Controlling wasserstein distances by kernel norms with application to compressive statistical learning. Journal of Machine Learning Research, 24:149 1, 2023.

Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning, pp. 681 688, 2011.

Hao Wu, Jonas Köhler, and Frank Noe. Stochastic normalizing flows. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 5933 5944. Curran Associates, Inc., 2020.

Published as a conference paper at ICLR 2024

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017.

Zhisheng Xiao, Qing Yan, and Yali Amit. Generative latent flow. ar Xiv preprint ar Xiv:1905.10485, 2019.

Published as a conference paper at ICLR 2024

A PROOF OF THEOREM 1

Let USd 1 be the uniform distribution on Sd 1 and let k(x, y) = |x y|r for x, y R, x = y and r (0, 2). Moreover, denote by e = (1, ..., 0) Sd 1 the first unit vector. Then, we have for x, y Rd that

K(x, y) := Z

Sd 1 | ξ, x ξ, y |rd USd 1(ξ) = x y r Z

r d USd 1(ξ)

Sd 1 | ξ, e |rd USd 1(ξ) = x y r Z

Sd 1 |ξ1|rd USd 1(ξ) | {z } =:c 1 d,r

It remains to compute the constant cd,r which is straightforward for d = 1. For d > 1 the map

(t, η) 7 (t, η p

defined on [ 1, 1) Sd 2 is a parametrization of Sd 1. The surface measure on Sd 1 is then given by dσSd 1(ξ) = (1 t2) d 3

2 dσSd 2(η)dt, see (Atkinson & Han, 2012, Eq. 1.16). Furthermore, the uniform surface measure USd 1 reads as

d USd 1(ξ) = 1 sd 1 (1 t2) d 3

2 dσSd 2(η)dt,

where sd 1 is the volume of Sd 1. Hence

c 1 d,r = Z

Sd 1 |ξ1|rd USd 1(ξ) = 1 sd 1

1 |t|r(1 t2) d 3

2 dtdσSd 2(η) (10)

0 tr(1 t2) d 3

2 dt = sd 2

sd 1 B r + 1

where B(z1, z2) is the beta function and we used the integral identity

B(z1, z2) = 2 Z 1

0 t2z1 1(1 t2)z2 1dt.

Finally, noting that sd 1 = 2πd/2

2 ) and B(z1, z2) = Γ(z1)Γ(z2)

Γ(z1+z2) , (10) can be computed as

c 1 d,r = Γ( d

2 ) πΓ( d+r

Taking the inverse gives the claim. Remark 7 (Extension to P r

2 (Rd)). We can extend Theorem 1 to P r

2 (Rd). To this end, we first show, how we can deduce from (Modeste & Dombry, 2023, Prop 2.14) that the MMD D K with respect to the extended Riesz kernel K(x, y) = x y r + x r + y r defines a metric on P r

2 (Rd). Second, we will see that Theorem 1 can be extended to P r

2 (Rd) as well.

(i) By (Steinwart & Christmann, 2008, Thm 4.26) the MMD

D2 K(µ, ν) = Z Z K(x, y)d(µ ν)(x)d(µ ν)(y)

is finite on M K = {µ M(Rd) : R q

K(x, x)d|µ| < }, where M(Rd) is the space of

all signed measures on Rd. Since K(x, x) = 2 x r, we have that M K = {µ M(Rd) : R x r 2 d|µ| < } such that M K P(Rd) = P r

2 (Rd). Now, inserting K (Modeste & Dombry, 2023, Prop 2.14) states that D K is a metric on M K P(Rd) = P r

2 (Rd), where the assumptions of the proposition are checked in (Modeste & Dombry, 2023, Ex 2, Ex 3) and K is named k H with H = r

Published as a conference paper at ICLR 2024

(ii) Let K = c 1 d,r K be a rescaled and k(x, y) = |x y|r + |x|r + |y|r be the one-dimensional version of the extended Riesz kernel K. With the same calculations as in the proof of Theorem 1 we can show that

K(x, y) = Z

Sd 1 k( ξ, x , ξ, y )d USd 1(ξ).

Thus, we also have that SD k = D K on P r

B PROOF OF THEOREM 2

In the first part of the proof, we will use properties of reproducing kernel Hilbert spaces (RKHS), see (Steinwart & Christmann, 2008) for an overview on RKHS. For the second part, we will need two lemmata. The first one gives a definition of the MMD in terms of the Fourier transform (characteristic function) of the involved measures, where ˆµ(ξ) = e ixξ, µ = R

Rd e i ξ,x dµ(x). Its proof can be found, e.g., in (Székely & Rizzo, 2013, Proposition 2). Note that the constant on the right hand-side differs from (Székely & Rizzo, 2013, Proposition 2), since we use a different notion of the Fourier transform and the constant 1

2 in the MMD. Lemma 8. Let µ P1(R) and K(x, y) = x y . Then its holds

D2 K(µ, ν) = Γ( d+1

Rd |ˆµ(ξ) ˆν(ξ)|2

For the next lemma, recall that the cumulative density functions (cdf) of µ P(R) is the function Fµ : R [0, 1] defined by

Fµ(x) := µ(( , x]) = Z

R χ( ,x](y)dµ(y), χA(x) = 1, if x A, 0, otherwise.

We will need that the Cramer distance between probability measures µ, ν with cdfs Fµ and Fν defined by

ℓp(µ, ν) := Z

R |Fµ Fν|pdx 1

if it exists. The Cramer distance does not exist for arbitrary probability measures. However, for µ, ν P1(R) is well-known that ℓ1(µ, ν) = W1(µ, ν) (11) and we have indeed Fµ Fν L2(R). The following relation can be found in the literature, see, e.g., Székely (2002). However, we prefer to add a proof which clearly shows which assumptions are necessary. Lemma 9. Let µ, ν P1(R) and k(x.y) = |x y|. Then it holds

ℓ2(µ, ν) = Dk(µ, ν).

Proof. By Lemma 8, we know that

D2 k(µ, ν) = 1

R |x y| d(µ ν)(x)d(µ ν)(y) = 1

|ˆµ(ξ) ˆν(ξ)|2

ξ2 dξ. (12)

For µ, ν P1(R), we have Fµ Fν L2(R) and can apply Parseval s equality

ℓ2 2(µ, ν) = Z

R |Fµ(t) Fν(t)|2 dt = 1

R | ˆFµ(ξ) ˆFν(ξ)|2 dξ. (13)

Now we have for the distributional derivative of Fµ that DFµ = µ, which can be seen as follows: using Fubini s theorem, we have for any Schwartz function ϕ S(R) that

DFµ, ϕ = Fµ, ϕ = Z

R χ( ,x](y) ϕ (x) dµ(y) dx = Z

R χ( ,x](y) ϕ (x) dx dµ(y)

y ϕ (x) dx dµ(y) = Z

R ϕ(y) dµ(y) = µ, ϕ .

Published as a conference paper at ICLR 2024

Then we obtain by the differentiation property of the Fourier transform (Plonka et al., 2018) that

ˆµ(ξ) = [ DFµ(ξ) = iξ ˆFµ(ξ). Finally, (13) becomes

ℓ2 2(µ, ν) = 1

|ˆµ(ξ) ˆν(ξ)|2

which yields the assertion by (12).

Now we can prove Theorem 2.

Proof. 1. To prove of the first inequality, we use the reproducing property

f, K(x, ) H K = f(x)

of the kernel in the associated RKHS H K. For any f H K with f H K 1 and any π Π(µ, ν), we use the estimation Z

Rd f(x)d(µ ν)(x) = Z

Rd f(x) f(y)dπ(x, y) Z

Rd |f(x) f(y)|dπ(x, y)

Rd | f, K(x, ) K(y, ) H K|dπ(x, y)

Rd K(x, ) K(y, ) H Kdπ(x, y),

which is called coupling bound in (Sriperumbudur et al., 2010, Prop. 20). Then, since K(x, ) K(y, ) 2 H K = K(x, x) + K(y, y) 2 K(x, y) = 2 x y and using Jensen s inequality for the concave function , we obtain Z

Rd f(x)d(µ ν)(x)

Rd x y 1 2 dπ(x, y) 2 Z

Rd x y dπ(x, y) 1

By the dual definition of the discrepancy 2DK(µ, ν) = sup f H K 1 R

Rd fd(µ ν), see Novak &

Wozniakowski (2010), and taking the supremum over all such f and the infimum over all π Π(µ, ν), we finally arrive at 2D2 K(µ, ν) W1(µ, ν). 2. The second inequality can be seen as follows: by Bonnotte (2013, Lemma 5.1.4), there exists a constant cd > 0 such that

W1(µ, ν) cd R d d+1 SW1(µ, ν) 1 d+1 = cd R d d+1 Eξ USd 1 W1 Pξ#µ, Pξ#ν 1 d+1 . Further, we obtain by (11), the Cauchy-Schwarz inequality and Lemma 9 that Eξ USd 1 W1 Pξ#µ, Pξ#ν = Eξ USd 1 l1 Pξ#µ, Pξ#ν

Eξ USd 1 h (2R) 1 2 l2 Pξ#µ, Pξ#ν i

= (2R) 1 2 Eξ USd 1 Dk Pξ#µ, Pξ#ν

(2R) 1 2 Eξ USd 1 D2 k Pξ#µ, Pξ#ν 1

2 , and finally by Theorem 1 that

Eξ USd 1 W1 Pξ#µ, Pξ#ν (2R) 1 2 DK(µ, ν) = (2R) 1 2

2 DK(µ, ν).

In summary, this results in

W1(µ, ν) cd

! 1 2(d+1) R 2d+1 2d+2 DK(µ, ν) 1 d+1 .

Under some additional assumptions, similar bounds have been considered in Chafaï et al. (2018) for the Coloumb kernel K(x, y) = x y 2 d.

Published as a conference paper at ICLR 2024

C PROOF OF THEOREM 3

Interaction Energy: This part of the proof is similar to (Teuber et al., 2011, Sec. 3). Using that σ is a permutation and by reordering the terms in the double sum, we can rewrite the interaction energy by

E(x) = 1 2N 2

j=1 |xi xj| = 1 2N 2

j=1 |xσ(i) xσ(j)|

j=i+1 xσ(j) xσ(i) =

N 2 xσ(i) =

N (2σ 1(i) 1)

Since the xiare pairwise disjoint, the sorting permutation σ is constant in a neighborhood of x. Hence, E is differentiable with derivative

xi E(x) = N + 1 2σ 1(i)

Potential Energy: For any x = y R it holds

x|x y| = χ(x, y), where χ(x, y) = 1, if x > y, 1, if x < y.

Thus, we have that

xi V (x|y) = 1 MN

j=1 χ(xi, yj)

= 1 MN #{j {1, ..., M} : yj < xi} #{j {1, ..., M} : yj > xi}

Using that #{j {1, ..., M} : yj > xi} = M #{j {1, ..., M} : yj < xi} the above expression is equal to

1 MN 2 #{j {1, ..., M} : yj < xi} M = 2 #{j {1, ..., M} : yj < xi} M

D PROOF OF THEOREM 4

In this section, we derive error bounds for the stochastic estimators for the gradient of MMD as defined in (6) for the Riesz kernel with r = 1. To this end, we employ concentration inequalities (Vershynin, 2018), which were generalized to vector-valued random variables in Kohler & Lucchi (2017). We will need the following Lemma which is the Bernstein inequality and is stated in (Kohler & Lucchi, 2017, Lemma 18).

Lemma 10 (Bernstein Inequality). Let X1, ..., XP be independent random vectors with mean zero, Xi µ and E[ Xi 2 2] σ2. Then it holds for 0 < t < σ2

i=1 Xi 2 > t i exp P t2

Now we can use Lemma 10 to show convergence of the finite sum approximation to the exact gradient.

Theorem 11 (Concentration Inequality). Let x1, ...x N, y1, ..., y M Rd. Then, it holds

P h P Fd(x|y) Fd(x|y) > t i exp P t2

32 (cd + 1)2 + 1

Published as a conference paper at ICLR 2024

Proof. Let ξ1, ..., ξP USd 1 be the independent random variables from the definition of P in (6). We set

Xi,l := cd l F1( ξi, x1 , ..., ξi, x N | ξi, y1 , ..., ξi, y M )ξi

and define the d N-dimensional random vector Xi = (Xi,1, Xi,N). Then, we have by (4) that

E[Xi] = Fd(x|y) = x1Fd(x|y), , x N Fd(x|y) .

By Theorem 3, we know that Xi,l 2 2cd

N , and by (3) it holds

E Xi,l 2 = xl Fd(x|y) 2 2

Let Xi = Xi E[Xi]. Then it holds

N = 2cd + 2

and thus E Xi 2 2 4(cd + 1)2. Since we have E[ Xi] = 0 for all i = 1, ..., P, we can apply Lemma 10 and obtain

i=1 Xi Fd(x|y) > t i = P h 1

i=1 Xi > t i exp P t2

32 (cd + 1)2 + 1

Since we have by definition that

i=1 Xi = P Fd(x|y)

this yields the assertion.

Finally we can draw a corollary which immediately shows Theorem 4.

Corollary 12 (Error Bound for Stochastic Gradients). For x1, ..., x N, y1, ..., y M Rd, it holds

E[ P Fd(x|y) Fd(x|y) ] exp(1/4)

Proof. Denote by X the random variable

X = P Fd(x|x) Fd(x|y) .

Then, we have by Theorem 11 that

P[X > t] exp P t2

32 (cd + 1)2 + 1

Thus, we obtain

0 P[X > t]dt exp(1/4) Z

32 (cd + 1)2

dt = exp(1/4)

32π(cd + 1) 2

where the last step follows from the identity R 0 exp( t2)dt = π

Now we proceed to bound the constant cd in the dimensions. By Theorem 1 we have that cd = πΓ( d+1

2 ) . Now the claim follows from the bound Γ( d+1

3 4 1 proven by Kershaw

Published as a conference paper at ICLR 2024

E COMPARISON OF DIFFERENT KERNELS IN MMD

Next we compare the MMD flows of the Riesz kernels with those of the positive definite kernels

KG(x, y) := exp x y 2

(Gaussian),

KIM(x, y) := 1 p

x y 2 + c (Inverse Multiquadric),

KL(x, y) := exp x y

(Laplacian).

The target distribution is defined as the uniform distribution on three non-overlapping circles and the initial particles are drawn from a Gaussian distribution with standard deviation 0.01, compare Glaser et al. (2021). We recognize in Figures 4 and 5 that in contrast to the Riesz kernel, the other MMD flows

heavily depend on the parameters σ and c (stability against parameter choice);

cannot perfectly recover the uniform distribution; zoom into the middle right circles to see that small blue parts are not covered (approximation of target distribution).

Moreover, in contrast to the other kernels, the Riesz kernel with r = 1 is positively homogeneous such that the MMD flow is equivariant against scalings of initial and target measure.

Finally, it is interesting that the Riesz kernel is related to the Brownian motion by the following remark, see also Modeste & Dombry (2023) for the more general fractional Brownian motion.

Remark 13. In the one-dimensional case, the extended Riesz kernel with r = 1 reads as

K(x, y) = |x y| + |x| + |y| = 2 min(x, y),

which is the twice the covariance kernel of the Brownian motion. More precisely, let (Wt)t>0 be a Brownian motion and s, t > 0. Then, it holds

Cov(Ws, Wt) = min(s, t) = 1

F TRAINING ALGORITHM OF THE GENERATIVE SLICED MMD FLOW

In Algorithm 3 we state the detailed training algorithm of our proposed method.

G ABLATION STUDY

We consider the FID for different number of networks and different number of projections. We run the same experiment as in Section 5 on MNIST. Here we choose a different number of projections P between 10 and 1000. In Figure 6 we illustrate the progress of the FID value for an increasing number of networks and a different number of projections. Obviously, the gradient of the MMD functional is not well-approximated by just using P = 10 or P = 100 projections and thus the MMD flow does not converges. Once the gradient of the functional is well-approximated, a higher number of projections leads only to a small improvement, see the difference between P = 500 and P = 1000.

H IMPLEMENTATION DETAILS

The code is available online at https://github.com/johertrich/sliced_MMD_flows.

We use UNets (Φ)L l=1 2 with 3409633 trainable parameters for MNIST and Fashion MNIST and 2064035 trainable paramters for CIFAR10. The networks are trained using Adam (Kingma & Ba,

2modified from https://github.com/hojonathanho/diffusion/blob/master/ diffusion_tf/models/unet.py

Published as a conference paper at ICLR 2024

Gaussian kernel

Inverse multiquadric kernel

Figure 4: Comparison of the MMD flow with Gaussian kernel (top) and inverse multiquadric kernel (bottom) for different hyperparameters.

2015) with a learning rate of 0.001. All flows are simulated with a step size τ = 1. We stop the training of our generative sliced MMD flow when the FID between the generated samples and some validation samples does not decrease twice. Then we take the network with the best FID value to the validation set. The validation samples are the last 10000 training samples from the corresponding dataset which were not used for training the generative sliced MMD flow. The training of the generative MMD flow takes between 1.5 and 3 days on a NVIDIA Ge Force RTX 2060 Super GPU, depending on the current GPU load by other processes. To avoid overfitting, we choose a relatively small number of optimization steps within the training of the networks Φl, which corresponds to an early-stopping technique.

MNIST. We draw the first M = 20000 target samples from the MNIST training set and N = 20000 initial samples uniformly from [0, 1]d. Then we simulate the momentum MMD flow using P = 1000 projections for 32 steps and train the network for 2000 optimizer steps with a batch size of 100. After each training of the network, we increase the number of flow steps by min(25+l, 2048) up to a maximal number of 30000 steps, where l is the iteration of the training procedure, see Algorithm 3. We choose the momentum parameter m = 0.7 and stop the whole training after L = 55 networks.

Fashion MNIST. Here we draw the first M = 20000 target samples from the Fashion MNIST training set and N = 20000 initial samples uniformly from [0, 1]d. Then we simulate the momentum MMD

Published as a conference paper at ICLR 2024

Laplace kernel

T=100 Riesz kernel

Figure 5: Comparison of the MMD flow with Laplacian kernel (top) and Riesz kernel (bottom) for different hyperparameters.

0 10 20 30 40 50 60 Network )i

P=10 P=100 P=500 P=1000

Figure 6: Illustration of the FID value of the Sliced MMD Flow on MNIST for different number of projections.

flow using P = 1000 projections for 32 steps and train the network for 2000 optimizer steps with

Published as a conference paper at ICLR 2024

Algorithm 3 Training of generative MMD flows

Input: Independent initial samples x(0) 1 , ..., x(0) N from µ0, momentum parameters ml [0, 1) for l = 1, ..., L. Initialize (v1, ..., v N) = 0. for l = 1, ..., L do

- Set ( x(0) 1 , ..., x(0) N ) = (x(l 1) 1 , ..., x(l 1) N ). - Simulate Tl steps of the (momentum) MMD flow: for t = 1, ..., Tl do

- Update v by

(v1, ..., v N) Fd( x(t 1) 1 , ..., x(t 1) N |y1, ..., y M) + ml(v1, ..., v N)

- Update the flow samples:

( x(t) 1 , ..., x(t) N ) = ( x(t 1) 1 , ..., x(t 1) N ) τN (v1, ..., v N)

end for - Train Φl such that x(Tl) x(0) i Φl( x(0) i ) by minimizing the loss

i=1 Φl( x(0) i ) ( x(0) i x(Tl) i ) 2.

- Set (x(l) 1 , ..., x(l) N ) = (x(l 1) 1 , ..., x(l 1) N ) (Φl(x(l 1) 1 ), ..., Φl(x(l 1) N )). end for

a batch size of 100. After each training of the network, we increase the number of flow steps by min(25+l, 2048) up to a maximal number of 50000 steps, where l is the iteration of the training procedure. The momentum parameter m is set to 0.8. We stop the whole training after L = 67 networks.

CIFAR. We draw the first M = 30000 target samples from the CIFAR10 training set. Here we consider a pyramidal schedule, where the key idea is to run the particle flow on different resolutions, from low to high sequentially. First, we downsample the target samples by a factor 8 and draw N = 30000 initial samples uniformly from [0, 1] d 64 . Then we simulate the momentum MMD flow in dimension d = 48 using P = 500 projections for 32 steps and train the network for 5000 optimizer steps with a batch size of 100. After each training of the network, we increase the number of flow steps by min(25+l, 1024) up to a maximal number of 30000 steps, where l is the iteration of the training procedure. The momentum parameter m is increased after each network training by 0.01 up to 0.8, beginning with m = 0 in the first flow step. We increase the resolution of the flow after 600000 flow steps by a factor 2 and add Gaussian noise on the particles in order to increase the intrinsic dimension of the images, such that the second resolution is of dimension d = 192. Following here the same procedure as before for the second resolution and the third resolution of d = 768, we change the projections in the final resolution of d = 3072. Instead of using projections uniformly sampled from Sd 1, we consider locally-connected projections as in (Du et al., 2023; Nguyen & Ho, 2022). The idea is to extract patches of the images xk at a random location in each step k and instead consider the particle flow in the patch dimension. In order to apply these locally-connected projections at different resolutions, we also upsample the projections to different scales. Here we choose a patch size of 7 7 and consider the resolutions 7, 14, 21, 28. We stop the whole training after L = 86 networks.

Note that herewith we introduced an inductive bias, since we do not uniformly sample from [0, 1]d, but empiricially this leads to a significant acceleration of the flow. A more comprehensive discussion can be found in (Du et al., 2023; Nguyen & Ho, 2022)

Celeb A. We draw the first M = 20000 target samples from the Celeb A training set. Again, we consider a pyramidal schedule as for CIFAR10, but we increased the number of flow steps by min(25+l, 8192) up to a maximal number of 100000 steps. We increase the resolutions of the flow after 700000, 900000 and 700000 flow steps by a factor 2 and add Gaussian noise on the particles in order to increase the intrinsic dimension of the images. We also use locally-connected projections

Published as a conference paper at ICLR 2024

Table 2: FID scores of generated samples for training set and test set

MNIST Fashion MNIST CIFAR10 Celeb A training set 2.7 10.6 53.0 32.7 test set 3.1 11.3 54.8 32.1

with a patch size of 7 7 and consider the resolutions 7, 14, 21, 28 for resolution 32 32 and 7, 14, 21, 28, 35, 42, 49 for resolution 64 64. We stop the whole training after L = 71 networks.

I ADDITIONAL EXAMPLES

In Figure 7 we show more generated samples of our method for MNIST, Fashion MNIST and CIFAR10.

In Figure 8, we compare generated MNIST samples with the closest samples from the training set. We observe that they are significantly different. Hence, our method generates really new samples and is not just reproducing the samples from the training set. In contrast, in Figure 9 we compare the particle flow samples with the closest samples from the training set. Obviously, the samples of the particle flow approximate exactly the training samples. This highlights the important role of the networks: We can interpolate between the training points in order to generalize the dataset.

Published as a conference paper at ICLR 2024

(b) Fashion MNIST

(c) CIFAR10

(d) Celeb A

Figure 7: Additional generated samples (from top to bottom) of MNIST, Fashion MNIST, CIFAR10 and Celeb A.

Published as a conference paper at ICLR 2024

(b) Fashion MNIST

(c) CIFAR10

(d) Celeb A

Figure 8: Generated samples (top), L2-closest samples from the training set (middle) and the pixelwise distance between them (bottom).

Figure 9: Samples of the particle flow (top), closest samples from the training set (middle) and the pixelwise distance between them (bottom). The mean PSNR between these flow samples and the corresponding closest training images is 82.29.