# efficient_gradient_flows_in_slicedwasserstein_space__fdbc934b.pdf

Published in Transactions on Machine Learning Research (11/2022)

Efficient Gradient Flows in Sliced-Wasserstein Space

Clément Bonet clement.bonet@univ-ubs.fr Université Bretagne Sud, CNRS, LMBA, Vannes, France

Nicolas Courty nicolas.courty@irisa.fr Université Bretagne Sud, CNRS, IRISA, Vannes, France

François Septier francois.septier@univ-ubs.fr Université Bretagne Sud, CNRS, LMBA, Vannes, France

Lucas Drumetz lucas.drumetz@imt-atlantique.fr IMT Atlantique, CNRS, Lab-STICC, Brest, France

Reviewed on Open Review: https: // openreview. net/ forum? id= Au1LNKm Rvh

Minimizing functionals in the space of probability distributions can be done with Wasserstein gradient flows. To solve them numerically, a possible approach is to rely on the Jordan Kinderlehrer Otto (JKO) scheme which is analogous to the proximal scheme in Euclidean spaces. However, it requires solving a nested optimization problem at each iteration, and is known for its computational challenges, especially in high dimension. To alleviate it, very recent works propose to approximate the JKO scheme leveraging Brenier s theorem, and using gradients of Input Convex Neural Networks to parameterize the density (JKO-ICNN). However, this method comes with a high computational cost and stability issues. Instead, this work proposes to use gradient flows in the space of probability measures endowed with the sliced-Wasserstein (SW) distance. We argue that this method is more flexible than JKO-ICNN, since SW enjoys a closed-form differentiable approximation. Thus, the density at each step can be parameterized by any generative model which alleviates the computational burden and makes it tractable in higher dimensions.

1 Introduction

Minimizing functionals with respect to probability measures is a ubiquitous problem in machine learning. Important examples are generative models such as GANs (Goodfellow et al., 2014; Arjovsky et al., 2017; Lin et al., 2021), VAEs (Kingma & Welling, 2013) or normalizing flows (Papamakarios et al., 2019).

To that aim, one can rely on Wasserstein gradient flows (WGF) (Ambrosio et al., 2008) which are curves decreasing the functional as fast as possible (Santambrogio, 2017). For particular functionals, these curves are known to be characterized by the solution of some partial differential equation (PDE) (Jordan et al., 1998). Hence, to solve Wasserstein gradient flows numerically, we can solve the related PDE when it is available. However, solving a PDE can be a difficult and computationally costly task, especially in high dimension (Han et al., 2018). Fortunately, several alternatives exist in the literature. For example, one can approximate instead a counterpart stochastic differential equation (SDE) related to the PDE followed by the gradient flow. For the Kullback-Leibler divergence, it comes back to the so called unadjusted Langevin algorithm (ULA) (Roberts & Tweedie, 1996; Wibisono, 2018), but it has also been proposed for other functionals such as the Sliced-Wasserstein distance with an entropic regularization (Liutkus et al., 2019).

Another way to solve Wasserstein gradient flows numerically is to approximate the curve in discrete time. By using the well-known forward Euler scheme, particle schemes have been derived for diverse functionals such as the Kullback-Leibler divergence (Feng et al., 2021; Wang et al., 2021; 2022), the maximum mean discrepancy

Published in Transactions on Machine Learning Research (11/2022)

(Arbel et al., 2019), the kernel Stein discrepancy (Korba et al., 2021) or KALE (Glaser et al., 2021). Salim et al. (2020) propose instead a forward-backward discretization scheme analogously to the proximal gradient algorithm (Bauschke et al., 2011). Yet, these methods only provide samples approximately following the gradient flow, but without any information about the underlying density.

Another time discretization possible is the so-called JKO scheme introduced in (Jordan et al., 1998), which is analogous in probability space to the well-known proximal operator (Parikh & Boyd, 2014) in Hilbertian space and which corresponds to the backward Euler scheme. However, as a nested minimization problem, it is a difficult problem to handle numerically. Some works use a discretization in space (e.g. a grid) and the entropic regularization of the Wasserstein distance (Peyré, 2015; Carlier et al., 2017), which benefits from specific resolution strategies. However, those approaches do not scale to high dimensions, as the discretization of the space scales exponentially with the dimension. Very recently, it was proposed in several concomitant works (Alvarez-Melis et al., 2021; Mokrov et al., 2021; Bunne et al., 2022) to take advantage of Brenier s theorem (Brenier, 1991) and model the optimal transport map (Monge map) as the gradient of a convex function with Input Convex Neural Networks (ICNN) (Amos et al., 2017). By solving the JKO scheme with this parametrization, these models, called JKO-ICNN, handle higher dimension problems well. Yet, a drawback of JKO-ICNN is the training time due to a number of evaluations of the gradient of each ICNN that is quadratic in the number of JKO iterations. It also requires to backpropagate through the gradient which is challenging in high dimension, even though stochastic methods were proposed in (Huang et al., 2020) to alleviate it. Moreover, it has also been observed in several works that ICNNs have a poor expressiveness (Rout et al., 2021; Korotin et al., 2019; 2021) and that we should rather directly estimate the gradient of convex functions by neural networks (Saremi, 2019; Richter-Powell et al., 2021). Other recent works proposed to use the JKO scheme by either exploiting variational formulations of functionals in order to avoid the evaluation of densities and allowing to use more general neural networks in (Fan et al., 2021), or by learning directly the density in (Hwang et al., 2021).

In parallel, it was proposed to endow the space of probability measures with other distances than Wasserstein. For example, Gallouët & Monsaingeon (2017) study a JKO scheme in the space endowed by the Kantorovitch-Fisher-Rao distance. However, this still requires a costly JKO step. Several particles schemes were derived as gradient flows into this space (Lu et al., 2019; Zhang et al., 2021). We can also cite Kalman-Wasserstein gradient flows (Garbuno-Inigo et al., 2020) or the Stein variational gradient descent (Liu & Wang, 2016; Liu, 2017; Duncan et al., 2019) which can be seen as gradient flows in the space of probabilities endowed by a generalization of the Wasserstein distance. However, the JKO schemes of these different metrics are not easily tractable in practice.

Contributions. In the following, we propose to study the JKO scheme in the space of probability distributions endowed with the sliced-Wasserstein (SW) distance (Rabin et al., 2011). This novel and simple modification of the original problem comes with several benefits, mostly linked to the fact that this distance is easily differentiable and computationally more tractable than the Wasserstein distance. We first derive some properties of this new class of flows and discuss links with Wasserstein gradient flows. Notably, we observe empirically for both gradient flows the same dynamic, up to a time dilation of parameter the dimension of the space. Then, we show that it is possible to minimize functionals and learn the stationary distributions in high dimensions, on toy datasets as well as real image datasets, using e.g. neural networks. In particular, we propose to use normalizing flows for functionals which involve the density, such as the negative entropy. Finally, we exhibit several examples for which our strategy performs better than JKO-ICNN, either w.r.t. to computation times and/or w.r.t. the quality of the final solutions.

2 Background

In this paper, we are interested in finding a numerical solution to gradient flows in probability spaces. Such problems generally arise when minimizing a functional F defined on P(Rd), the set of probability measures on Rd:

min µ P(Rd) F(µ), (1)

Published in Transactions on Machine Learning Research (11/2022)

but they can also be defined implicitly through their dynamics, expressed as partial differential equations. JKO schemes are implicit optimization methods that operate on particular discretizations of these problems and consider the natural metric of P(Rd) to be Wasserstein. Recalling our goal is to study similar schemes with an alternative, computationally friendly metric (SW), we start by defining formally the notion of gradient flows in Euclidean spaces, before switching to probability spaces. We finally give a rapid overview of existing numerical schemes.

2.1 Gradient Flows in Euclidean Spaces

Let F : Rd R be a functional. A gradient flow of F is a curve (i.e. a continuous function from R+ to Rd) which decreases F as much as possible along it. If F is differentiable, then a gradient flow x : [0, T] Rd

solves the following Cauchy problem (Santambrogio, 2017) ( dx(t)

dt = F(x(t)), x(0) = x0. (2)

Under conditions on F (e.g. F Lipschitz continuous, F convex or semi-convex), this problem admits a unique solution which can be approximated using numerical schemes for ordinary differential equations such as the explicit or the implicit Euler scheme. For the former, we recover the regular gradient descent, and for the latter, we recover the proximal point algorithm (Parikh & Boyd, 2014): let τ > 0,

xτ k+1 arg min x x xτ k 2 2 2τ + F(x) = proxτF (xτ k). (3)

This formulation does not use any gradient, and can therefore be used in any metric space by replacing 2 with the right distance.

2.2 Gradient Flows in Probability Spaces

To define gradient flows in the space of probability measures, we first need a metric. We restrict our analysis to probability measures with moments of order 2: P2(Rd) = {µ P(Rd), R x 2dµ(x) < + }. Then, a possible distance on P2(Rd) is the Wasserstein distance (Villani, 2008), let µ, ν P2(Rd),

W 2 2 (µ, ν) = min γ Π(µ,ν)

Z x y 2 2 dγ(x, y), (4)

where Π(µ, ν) is the set of probability measures on Rd Rd with marginals µ and ν.

Now, by endowing the space of measures with W2, we can define the Wasserstein gradient flow of a functional F : P2(Rd) R by plugging W2 in (3) which becomes

µτ k+1 arg min µ P2(Rd)

W 2 2 (µ, µτ k) 2τ + F(µ). (5)

The gradient flow is then the limit of the sequence of minimizers when τ 0. This scheme was introduced in the seminal work of Jordan et al. (1998) and is therefore referred to as the JKO scheme. In this work, the authors showed that gradient flows are linked to PDEs, and in particular with the Fokker-Planck equation when the functional F is of the form

F(µ) = Z V dµ + H(µ) (6)

where V is some potential function and H is the negative entropy: let σ denote the Lebesgue measure,

(R ρ(x) log(ρ(x)) dx if dµ = ρdσ + otherwise. (7)

Published in Transactions on Machine Learning Research (11/2022)

Then, the limit of (µτ)τ when τ 0 is a curve t 7 µt such that for all t > 0, µt has a density ρt. The curve ρ satisfies (weakly) the Fokker-Planck PDE

t = div(ρ V ) + ρ. (8)

For more details on gradient flows in metric space and in Wasserstein space, we refer to (Ambrosio et al., 2008). Note that many other functional can be plugged in equation 5 defining different PDEs. We introduce here the Fokker-Planck PDE as a classical example, since the functional is connected to the Kullback-Leibler (KL) divergence and its Wasserstein gradient flow is connected to many classical algorithms such as the unadjusted Langevin algorithm (ULA) (Wibisono, 2018). But we will also use other functionals in Section 4 such as the interaction functional defined for regular enough W as

ZZ W(x y) dµ(x)dµ(y), (9)

which admits as Wasserstein gradient flow the aggregation equation (Santambrogio, 2015, Chapter 8)

t = div ρ( W ρ) (10)

where denotes the convolution operation.

2.3 Numerical Methods to solve the JKO Scheme

Solving (5) is not an easy problem as it requires to solve an optimal transport problem at each step and hence is composed of two nested minimization problems.

There are several strategies which were used to tackle this problem. For example, Laborde (2016) rewrites (5) as a convex minimization problem using the Benamou-Brenier dynamic formulation of the Wasserstein distance (Benamou & Brenier, 2000). Peyré (2015) approximates the JKO scheme by using the entropic regularization and rewriting the problem with respect to the Kullback-Leibler proximal operator. The problem becomes easier to solve using Dykstra s algorithm (Dykstra, 1985). This scheme was proved to converge to the right PDE in (Carlier et al., 2017). It was proposed to use the dual formulation in other works such as (Caluya & Halder, 2019) or (Frogner & Poggio, 2020). Cancès et al. (2020) proposed to linearize the Wasserstein distance using the weighted Sobolev approximation (Villani, 2003; Peyre, 2018).

More recently, following (Benamou et al., 2016), Alvarez-Melis et al. (2021) and Mokrov et al. (2021) proposed to exploit Brenier s theorem by rewriting the JKO scheme as

uτ k+1 arg min u convex 1 2τ

Z u(x) x 2 2 dµτ k(x) + F ( u)#µτ k (11)

and model the probability measures as µτ k+1 = ( uτ k+1)#µτ k where # is the push forward operator, defined as ( u)#µτ k(A) = µτ k(( u) 1(A)) for all A B(Rd). Then, to solve it numerically, they model convex functions using ICNNs (Amos et al., 2017):

θτ k+1 arg min θ {θ,uθ ICNN}

Z xuθ(x) x 2 2 dµτ k(x) + F ( xuθ)#µτ k . (12)

In the remainder, this method is denoted as JKO-ICNN. Bunne et al. (2022) also proposed to use ICNNs into the JKO scheme, but with a different objective of learning the functional from samples trajectories along the timesteps. Lastly, Fan et al. (2021) proposed to learn directly the Monge map T by solving at each step the following problem:

T τ k+1 arg min T

Z T(x) x 2 2 dµτ k(x) + F(T#µτ k) (13)

and using variational formulations of functionals involving the density. This formulation requires only to use samples from the measure. However, it needs to be derived for each functional, and involves minimax optimization problems which are notoriously hard to train (Arjovsky & Bottou, 2017; Bond-Taylor et al., 2021).

Published in Transactions on Machine Learning Research (11/2022)

3 Sliced-Wasserstein Gradient Flows

As seen in the previous section, solving numerically (5) is a challenging problem. To tackle high-dimensional settings, one could benefit from neural networks, such as generative models, that are known to model highdimensional distributions accurately. The problem being not directly differentiable, previous works relied on Brenier s theorem and modeled convex functions through ICNNs, which results in JKO-ICNN. However, this method is very costly to train. For a JKO scheme of k steps, it requires O(k2) evaluations of gradients (Mokrov et al., 2021) which can be a huge price to pay when the dynamic is very long. Moreover, it requires to backpropagate through gradients, and to compute the determinant of the Jacobian when we need to evaluate the likelihood (assuming the ICNN is strictly convex). The method of Fan et al. (2021) also requires O(k2) evaluations of neural networks, as well as to solve a minimax optimization problem at each step.

Here, we propose instead to use the space of probability measures endowed with the sliced-Wasserstein (SW) distance by modifying adequately the JKO scheme. Surprisingly enough, this class of gradient flows, which are very easy to compute, has never been considered numerically in the literature. Close to our work, Wasserstein gradient flows using SW as a functional (called Sliced-Wasserstein flows) have been considered in (Liutkus et al., 2019). Our method differs significantly from this work, since we propose to compute sliced-Wasserstein gradient flows of different functionals.

We first introduce SW along with motivations to use this distance. We then study some properties of the scheme and discuss links with Wasserstein gradient flows. Since this metric is known in closed-form, the JKO scheme is more tractable numerically and can be approximated in several ways that we describe in Section 3.3.

3.1 Motivations

Sliced-Wasserstein Distance. The Wasserstein distance (4) is generally intractable. However, in one dimension, for µ, ν P2(R), we have the following closed-form (Peyré et al., 2019, Remark 2.30)

W 2 2 (µ, ν) = Z 1

F 1 µ (u) F 1 ν (u) 2 du (14)

where F 1 µ (resp. F 1 ν ) is the quantile function of µ (resp. ν). It motivated the construction of the sliced Wasserstein distance (Rabin et al., 2011; Bonnotte, 2013), as for µ, ν P2(Rd),

SW 2 2 (µ, ν) = Z

Sd 1 W 2 2 (P θ #µ, P θ #ν) dλ(θ) (15)

where P θ(x) = x, θ and λ is the uniform distribution on Sd 1 = {θ Rd, θ 2 = 1}.

Computational Properties. Firstly, SW2 is very easy to compute by a Monte-Carlo approximation (see Appendix C.1). It is also differentiable, and hence using e.g. the Python Optimal Transport (POT) library (Flamary et al., 2021), we can backpropagate w.r.t. parameters or weights parametrizing the distributions (see Section 3.3). Note that some libraries allow to directly backpropagate through Wasserstein. However, theoretically, we only have access to a subgradient in that case (Cuturi & Doucet, 2014, Proposition 1), and the computational complexity is bigger (O(n3 log n) versus O(n log n) for SW). Moreover, contrary to W2, its sample complexity does not depend on the dimension (Nadjahi et al., 2020) which is important to overcome the curse of dimensionality. However, it is known to be hard to approximate in high-dimension (Deshpande et al., 2019) since the error of the Monte-Carlo estimates is impacted by the number of projections in practice (Nadjahi et al., 2020). Nevertheless, there exist several variants which could also be used. Moreover, a deterministic approach using a concentration of measure phenomenon (and hence being more accurate in high dimension) was recently proposed by Nadjahi et al. (2021) to approximate SW2.

Link with Wasserstein. The sliced-Wasserstein distance has also many properties related to the Wasserstein distance. First, they actually induce the same topology (Nadjahi et al., 2019; Bayraktar & Guoï, 2021) which might justify to use SW as a proxy of Wasserstein. Moreover, as showed in Chapter 5 of Bonnotte

Published in Transactions on Machine Learning Research (11/2022)

(2013), they can be related on compact sets by the following inequalities, let R > 0, for all µ, ν P(B(0, R)),

SW 2 2 (µ, ν) c2 d W 2 2 (µ, ν) C2 d SW

1 d+1 2 (µ, ν), (16)

with c2 d = 1

d and Cd some constant.

Hence, from these properties, we can wonder whether their gradient flows are related or not, or even better, whether they are the same or not. This property was initially conjectured by Filippo Santambrogio1. Some previous works started to gather some hints on this question. For example, Candau-Tilh (2020) showed that, while (P2(Rd), SW2) is not a geodesic space, the minimal length (in metric space, Definition 2.4 in (Santambrogio, 2017)) connecting two measures is W2 up to a constant (which is actually cd). We report in Appendix A some results introduced by Candau-Tilh (2020).

3.2 Definition and Properties of Sliced-Wasserstein Gradient Flows

Instead of solving the regular JKO scheme (5), we propose to introduce a SW-JKO scheme, let µ0 P2(Rd),

k 0, µτ k+1 arg min µ P2(Rd)

SW 2 2 (µ, µτ k) 2τ + F(µ) (17)

in which we replaced the Wasserstein distance by SW2.

To study gradient flows and show that they are well defined, we first have to check that discrete solutions of the problem (17) indeed exist. Then, we have to check that we can pass to the limit τ 0 and that the limit satisfies gradient flows properties. These limit curves will be called Sliced-Wasserstein gradient flows (SWGFs).

In the following, we restrain ourselves to measures on P2(K) where K Rd is a compact set. We report some properties of the scheme (17) such as existence and uniqueness of the minimizer, and refer to Appendix B for the proofs.

Proposition 1. Let F : P2(K) R be a lower semi continuous functional, then the scheme (17) admits a minimizer. Moreover, it is unique if µτ k is absolutely continuous and F convex or if F is strictly convex.

This proposition shows that the problem is well defined for convex lower semi continuous functionals since we can find at least a minimizer at each step. The assumptions on F are fairly standard and will apply for diverse functionals such as for example (6) or (9) for V and W regular enough.

Proposition 2. The functional F is non increasing along the sequence of minimizers (µτ k)k.

As the ultimate goal is to find the minimizer of the functional, this proposition assures us that the solution will decrease F along it at each step. If F is bounded below, then the sequence F(µτ k)

k will converge (since it is non increasing).

More generally, by defining the piecewise constant interpolation as µτ(0) = µ0 and for all k 0, t ]kτ, (k + 1)τ], µτ(t) = µτ k+1, we can show that for all t < s, SW2(µτ(t), µτ(s)) C |t s| 1 2 + τ 1 2 . Following Santambrogio (2017), we can apply the Ascoli-Arzelà theorem (Santambrogio, 2015, Box 1.7) and extract a converging subsequence. However, the limit when τ 0 is possibly not unique and has a priori no relation with F. Since (P2(Rd), SW2) is not a geodesic space, but rather a pseudo-geodesic space whose true geodesics are cd W2 (Candau-Tilh, 2020) (see Appendix A.1), we cannot directly apply the theory introduced in (Ambrosio et al., 2008). We leave for future works the study of the theoretical properties of the limit. Nevertheless, we conjecture that in the limit t , SWGFs converge toward the same measure as for WGFs. We will study it empirically in Section 4 by showing that we are able to find as good minima as WGFs for different functionals.

Limit PDE. We discuss here some possible links between SWGFs and WGFs. Candau-Tilh (2020) shows that the Euler-Lagrange equation of the functional (6) has a similar form (up to the first variation of the

1in private communications

Published in Transactions on Machine Learning Research (11/2022)

(a) F(µ) = W 2 2 (µ, 1

n Pn i=1 δxi)

0.5 0.0 0.5

0.5 0.0 0.5

(b) Interaction Functional (9)

Figure 1: Comparison of the trajectories of (dilated by d = 2) Sliced-Wasserstein gradient flows (SWGF) and Wasserstein gradient flows (WGF) of different functionals. On the left, the stationary solution is a uniform discrete distributions. On the right, the stationary solution is a Dirac ring of radius 0.5. Blue points represent the initial positions, red points the final positions, and green points the target particles.

distance, see Appendix A.2). Hence, he conjectures that there is a correlation between the two gradient flows. We identify here some cases for which we can relate the Sliced-Wasserstein gradient flows to the Wasserstein gradient flows.

We first notice that for one dimensional supported measures, W2 and SW2 are the same up to a constant

d, i.e. let µ, ν P2(Rd) be supported on the same line, then SW 2 2 (µ, ν) = W 2 2 (µ, ν)/d. Interestingly enough, this is the same constant as between geodesics. This property is actually still true in any dimension for Gaussians with a covariance matrix of the form c Id with c > 0. Therefore, we argue that for these classes of measures, provided that the minimum at each step stays in the same class, we would have a dilation of factor d between the WGF and the SWGF. For example, for the Fokker-Planck functional, the PDE followed by the SWGF would become ρ

t = d div(ρ V ) + ρ . And, by correcting the SW-JKO scheme as

µτ k+1 arg min µ P2(Rd)

d 2τ SW 2 2 (µ, µτ k) + F(µ), (18)

we would have the same dynamic. For more general measures, it is not the case anymore. But, by rewriting SW 2 2 and W 2 2 w.r.t. the centered measures µ and ν, as well as the means mµ = R x dµ(x) and mν = R x dν(x), we have:

W 2 2 (µ, ν) = mµ mν 2 2 + W 2 2 ( µ, ν), SW 2 2 (µ, ν) = mµ mν 2 2 d + SW 2 2 ( µ, ν). (19)

Hence, for measures characterized by their mean and variance (e.g. Gaussians), there will be a constant d between the optimal mean of the SWGF and of the WGF. However, such a direct relation is not available between variances, even on simple cases like Gaussians. We report in Appendix B.4 the details of the calculations.

We draw on Figure 1 trajectories of SWGFs and WGFs with F(µ) = W 2 2 (µ, 1

n Pn i=1 δxi) and F as in (9) with W chosen as in Section 4.2. For the former, the target is a discrete measure with uniform weights and, using the same number of particles in the approximation ˆµn, and performing gradient descent on the particles as explained in Section 3.3, we expect the Wasserstein gradient flow to push each particle on the closest target particle. This is indeed what we observe. For the latter, the stationary solution is a Dirac ring, as further explained in Section 4.2. In both cases, by using a dilation parameter of d, we observe almost the same trajectories between SWGF and WGF, which is an additional support of the conjecture that the trajectories of the gradient flows in both spaces are alike. We also report in Appendix D.1 evolutions along the approximated WGF and SWGF of different functionals.

3.3 Solving the SW-JKO Scheme in Practice

As a Monte-Carlo approximate of SW can be computed in closed-form, equation 17 is not a nested minimization problem anymore and is differentiable. We present here a few possible parameterizations of probability

Published in Transactions on Machine Learning Research (11/2022)

Algorithm 1 SW-JKO with Generative Models

Input: µ0 the initial distribution, K the number of SW-JKO steps, τ the step size, F the functional, Ne the number of epochs to solve each SW-JKO step, N the batch size for k = 1 to K do

Initialize a neural network gk+1 θ e.g. with gk θ for i = 1 to Ne do

Sample z(k) j , z(k+1) j p Z i.i.d

x(k) j = gk θ(z(k) j ), x(k+1) j = gk+1 θ (z(k+1) j ) // Denote ˆµτ k = 1

N PN j=1 δx(k) j , ˆµτ k+1 = 1

N PN j=1 δx(k+1) j J(ˆµτ k+1) = 1 2τ SW 2 2 (ˆµτ k, ˆµτ k+1) + F(ˆµτ k+1) Backpropagate through J w.r.t. θ Perform a gradient step using e.g. Adam end for end for

distributions which we can use in practice through SW-JKO to approximate the gradient flow. We further state, as an example, how to approximate the Fokker-Planck functional (6). Indeed, classical other functionals can be approximated using the same method since they often only require to approximate an integral w.r.t. the measure of interests and to evaluate its density as for (6). Then, from these parameterizations, we can apply gradient-based optimization algorithms by using backpropagation over the loss at each step.

Discretized Grid. A first proposition is to model the distribution on a regular fixed grid, as it is done e.g. in (Peyré, 2015). If we approximate the distribution by a discrete distribution with a fixed grid on which the different samples are located, then we only have to learn the weights. Let us denote µτ k = PN i=1 ρ(k) i δxi where we use N samples located at (xi)N i=1, and PN i=1 ρi = 1. Let ΣN denote the simplex, then the optimization problem (17) becomes:

min (ρi)i ΣN SW 2 2 (PN i=1 ρiδxi, µτ k) 2τ + F N X

i=1 ρiδxi . (20)

The entropy is only defined for absolutely continuous distributions. However, following (Carlier et al., 2017; Peyré, 2015), we can approximate the Lebesgue measure as: L = l PN i=1 δxi where l represents a volume of each grid point (we assume that each grid point represents a volume element of uniform size). In that case, the Lebesgue density can be approximated by ( ρi

l )i. Hence, for the Fokker-Planck (6) example, we approximate the potential and internal energies as

V(µ) = Z V (x)ρ(x)dx

i=1 V (xi)ρi, H(µ) = Z log ρ(x) ρ(x)dx

To stay on the simplex, we use a projected gradient descent (Condat, 2016). A drawback of discretizing the grid is that it becomes intractable in high dimension.

With Particles. We can also optimize over the position of a set of particles, assigning them uniform weights: µτ k = 1

N PN i=1 δx(k) i . The problem (17) becomes:

N PN i=1 δxi, µτ k) 2τ + F 1

i=1 δxi . (22)

In that case however, we do not have access to the density and cannot directly approximate H (or more generally internal energies). A workaround is to use nonparametric estimators (Beirlant et al., 1997), which is however impractical in high dimension.

Published in Transactions on Machine Learning Research (11/2022)

Generative Models. Another solution to model the distribution is to use generative models. Let us denote gθ : Z X such a model, with Z a latent space, θ the parameters of the model that will be learned, and let p Z be a simple distribution (e.g. Gaussian). Then, we will denote µτ k+1 = (gk+1 θ )#p Z. The SW-JKO scheme (17) will become in this case

min θ SW 2 2 (gk+1 θ )#p Z, µτ k

2τ + F (gk+1 θ )#p Z . (23)

To approximate the negative entropy, we have to be able to evaluate the density. A straightforward choice that we use in our experiments is to use invertible neural networks with a tractable density such as normalizing flows (Papamakarios et al., 2019; Kobyzev et al., 2020). Another solution could be to use the variational formulation as in (Fan et al., 2021) as we only need samples in that case, but at the cost of solving a minimax problem.

To perform the optimization, we can sample points of the different distributions at each step and use a Monte Carlo approximation in order to approximate the integrals. Let zi p Z i.i.d, then gθ(zi) (gθ)#p Z = µ and

i=1 V (gθ(zi)), H(µ) 1

log(p Z(zi)) log | det(Jgθ(zi))| . (24)

using the change of variable formula in H.

We sum up the procedure when modeling distributions with generative models in Algorithm 1. We provide the algorithms for the discretized grid and for the particles in Appendix C.2.

Complexity. Denoting by d the dimension, K the number of outer iterations, Ne the number of inner optimization step, N the batch size and L the number of projections to approximate SW, SW-JKO has a complexity of O(KNe LN log N) versus O(KNe((K + d)N + d3)) for JKO-ICNN (Mokrov et al., 2021) and O(K2Ne NNm) for the variational formulation of Fan et al. (2021) where Nm denotes the number of maximization iteration. Hence, we see that the SW-JKO scheme is more appealing for problems which will require very long dynamics.

Direct Minimization. A straightforward way to minimize a functional µ 7 F(µ) would be to parameterize the distributions as described in this section and then to perform a direct minimization of the functional by performing a gradient descent on the weights, i.e. for instance with a generative model, solving minθ F (gθ)#pz . While it is a viable solution, we noted that this is not much discussed in related papers implementing Wasserstein gradient flows with neural networks via the JKO scheme. This problem is theoretically not well defined as a gradient flow on the space of probability measures. And hence, it has less theoretical guarantees of convergence than Wasserstein gradient flows. In our experiments, we noted that the direct minimization suffers from more numerical instabilities in high dimensions, while SW acts as a regularizer. For simpler problems however, the performances can be pretty similar.

4 Experiments

In this section, we show that by approximating sliced-Wasserstein gradient flows using the SW-JKO scheme (17), we are able to minimize functionals as well as Wasserstein gradient flows approximated by the JKOICNN scheme and with a better computational complexity. We first evaluate the ability to learn the stationary density for the Fokker-Planck equation (8) in the Gaussian case, and in the context of Bayesian Logistic Regression. Then, we evaluate it on an Aggregation equation. Finally, we use SW as a functional with image datasets as target, and compare the results with Sliced-Wasserstein flows introduced in (Liutkus et al., 2019).

For these experiments, we mainly use generative models. When it is required to evaluate the density (e.g. to estimate H), we use Real Non Volume Preserving (Real NVP) normalizing flows (Dinh et al., 2016). Our experiments were conducted using Py Torch (Paszke et al., 2019).

Published in Transactions on Machine Learning Research (11/2022)

2 4 6 8 10 12 d

log10Sym KL

Stationary distribution (t=8d)

EM 1K EM 10K

EM 50K Real NVP-SWGF

0 1 2 3 4 5 t

log10Sym KL( t, t)

Figure 2: On the left, Sym KL divergence between solutions at time t = 8d (using τ = 0.1 and 80 steps in equation 17) and stationary measure. On the right, Sym KL between the true WGF µt and the approximation with JKO-ICNN ˆµt, run through 3 Gaussians with τ = 0.1. We observe instabilities at some point.

4.1 Convergence to Stationary Distribution for the Fokker-Planck Equation

We first focus on the functional (6). Its Wasserstein gradient flow is solution of a PDE of the form of (8). In this case, it is well known that the solution converges as t towards a unique stationary measure µ e V (Risken, 1996). Hence, we focus here on learning this target distribution. First, we will choose as target a Gaussian, and then in a second experiment, we will learn a posterior distribution in a bayesian logistic regression setting.

Gaussian Case. Taking V of the form V (x) = 1

2(x m)T A(x b) for all x Rd, with A a symmetric positive definite matrix and m Rd, then the stationary distribution is µ = N(m, A 1). We plot in Figure 2 the symmetric Kullback-Leibler (Sym KL) divergence over dimensions between approximated distributions and the true stationary distribution. We choose τ = 0.1 and performed 80 SW-JKO steps. We take the mean over 15 random gaussians for dimensions d {2, . . . , 12} for randomly generated positive semi-definite matrices A using make_spd_matrix from scikit-learn (Pedregosa et al., 2011). Moreover, we use Real NVPs in SW-JKO. We compare the results with the Unadjusted Langevin Algorithm (ULA) (Roberts & Tweedie, 1996), called Euler-Maruyama (EM) since it is the EM approximation of the Langevin equation, which corresponds to the counterpart SDE of the PDE (8). We see that, in dimension higher than 2, the results of the SWGF with Real NVP are better than with this particle scheme obtained with a step size of 10 3

and with either 103, 104 or 5 104 particles. We do not plot the results for JKO-ICNN as we observe many instabilities (right plot in Figure 2). Moreover, we notice a very long training time for JKO-ICNN. We add more details in Appendix D.2. We further note that SW acts here as a regularizer. Indeed, by training normalizing flows with the reverse KL (which is equal to equation 6 up to a constant), we obtain similar results, but with much more instabilities in high dimensions.

0 200 400 600 800 1000 Number of projections

log10Sym KL

d=10 d=20 d=50 d=100

Figure 3: Impact of the number of projections for a fixed number of epochs.

Curse of Dimensionality. Even though the sliced-Wasserstein distance sample complexity does not suffer from the curse of dimensionality, it appears through the Monte-Carlo approximation (Nadjahi et al., 2020). Here, since SW plays a regularizer role, the objective is not necessarily to approximate it well but rather to minimize the given functional. Nevertheless, the number of projections can still have an impact on the minimization, and we report on Figure 3 the evolution of the found minima w.r.t. the number of projections, averaged over 15 random Gaussians. We observe that we do not need much projections to have fairly good results, even in higher dimension. Indeed, with more than 200 projections, the performances stay pretty stables.

Published in Transactions on Machine Learning Research (11/2022)

(a) Steady state on the discretized grid

(b) Steady state for the fully connected neural network

(c) Steady state for particles

(d) Steady state for JKOICNN (with τ = 0.1)

Figure 4: Steady state of the aggregation equation for a = 4, b = 2. From left to right, we plot it for the discretized grid, for the FCNN, for particles and for JKO-ICNN. We observe that JKO-ICNN does not recover the ring correctly as the particles are not evenly distributed on it.

Table 1: Accuracy and Training Time for Bayesian Logistic Regression over 5 runs

JKO-ICNN SWGF+Real NVP

Dataset Acc t Acc t

covtype 0.755 5 10 4 33702s 0.755 3 10 3 103s german 0.679 5 10 3 2123s 0.68 5 10 3 82s diabetis 0.777 7 10 3 4913s 0.778 2 10 3 122s twonorm 0.981 2 10 4 6551s 0.981 6 10 4 301s ringnorm 0.736 10 3 1228s 0.741 6 10 4 82s banana 0.55 10 2 1229s 0.559 10 2 66s splice 0.847 2 10 3 2290s 0.85 2 10 3 113s waveform 0.782 8 10 4 856s 0.776 8 10 4 120s image 0.822 10 3 1947s 0.821 3 10 3 72s

Bayesian Logistic Regression. Following the experiment of Mokrov et al. (2021) in Section 4.3, we propose to tackle the Bayesian Logistic Regression problem using SWGFs. For this task, we want to sample from p(x|D) where D represent data and x = (w, log α) with w the regression weights on which we apply a Gaussian prior p0(w|α) = N(w; 0, α 1) and with p0(α) = Γ(α; 1, 0.01). In that case, we use V (x) = log p(x|D) to learn p(x|D). We refer to Appendix D.2 for more details on the experiments, as well as hyperparameters. We report in Table 1 the accuracy results obtained on different datasets with SWGFs and compared with JKO-ICNN. We also report the training time and see that SWGFs allow to obtain results as good as with JKO-ICNN for most of the datasets but for shorter training times which underlines the better complexity of our scheme.

4.2 Convergence to Stationary Distribution for an Aggregation Equation

We also show the possibility to find the stationary solution of different PDEs than Fokker-Planck. For example, using an interaction functional of the form

ZZ W(x y) dµ(x)dµ(y). (25)

We notice here that we do not need to evaluate the density. Therefore, we can apply any neural network. For example, in the following, we will use a simple fully connected neural network (FCNN) and compare the results obtained with JKO-ICNN. We also show the results when learning directly over the particles and when learning weights over a regular grid.

Carrillo et al. (2021) use a repulsive-attractive interaction potential W(x) = x 4 2 4 x 2 2 2 . In this case, they showed empirically that the solution is a Dirac ring with radius 0.5 and centered at the origin when starting from µ0 = N(0, 0.252I2). With τ = 0.05, we show on Figure 4 that we recover this result with SWGFs for different parametrizations of the probabilities. More precisely, we first use a discretized grid of 50 50 samples of [ 1, 1]2. Then, we show the results when directly learning the particles and when using a FCNN. We also compare with the results obtained with JKO-ICNN. The densities reported for the last three methods are obtained through a kernel density estimator (KDE) with a bandwidth manually chosen since we either do not have access to the density, or we observed for JKO-ICNN that the likelihood exploded (see Appendix D.4). It may be due to the fact that the stationary solution does not admit a density with respect to the Lebesgue measure. For JKO-ICNN, we observe that the ring shape is recovered, but the samples are not evenly distributed on it.

Published in Transactions on Machine Learning Research (11/2022)

Figure 5: Generated sample obtained through a pretrained decoder + Real NVP.

Table 2: FID scores on some datasets (lower is better)

Methods MNIST Fashion Celeb A

Ambient Space

SWF (Liutkus et al., 2019) 225.1 207.6 - SWGF + Real NVP 88.1 95.5 - SWGF + CNN 69.3 102.3 -

Latent Space

AE (golden score) 15.55 31 77

SWGF + AE + Real NVP 17.8 40.6 90.9 SWGF + AE + FCNN 18.3 41.7 88 SWF 22.5 56.4 91.2

We report the solution at time t = 10, and used τ = 0.05 for SW-JKO and τ = 0.1 for JKO-ICNN. As JKO-ICNN requires O(k2) evaluations of gradients of ICNNs, the training is very long for such a dynamic. Here, the training took around 5 hours on a RTX 2080 TI (for 100 steps), versus 20 minutes for the FCNN and 10 minutes for 1000 particles (for 200 steps).

This underlines again the better training complexity of SW-JKO compared to JKO-ICNN, which is especially appealing when we are only interested in learning the optimal distribution. One such task is generative modeling in which we are interested in learning a target distribution ν from which we have access through samples.

4.3 Application on Real Data

In what follows, we show that the SW-JKO scheme can generate real data, and perform better than the associated particle scheme. To perform generative modeling, we can use different functionals. For example, GANs use the Jensen-Shannon divergence (Goodfellow et al., 2014) and WGANs the Wasserstein-1 distance (Arjovsky et al., 2017). To compare with an associated particle scheme, we focus here on the regularized SW distance as functional, defined as

2SW 2 2 (µ, ν) + λH(µ), (26)

where ν is some target distribution, for which we should have access to samples. The Wasserstein gradient flow of this functional was first introduced and study by Bonnotte (2013) for λ = 0, and by Liutkus et al. (2019) with the negative entropy term. Liutkus et al. (2019) showcased a particle scheme called SWF (Sliced Wasserstein Flow) to approximate the WGF of equation 26. Applied on images such as MNIST (Le Cun & Cortes, 2010), Fashion MNIST (Xiao et al., 2017) or Celeb A (Liu et al., 2015), SWFs need a very long convergence due to the curse of dimensionality and the trouble approximating SW. Hence, they used instead a pretrained autoencoder (AE) and applied the particle scheme in the latent space. Likewise, we use the AE proposed by Liutkus et al. (2019) with a latent space of dimension d = 48, and we perform SW-JKO steps on thoses images. We report on Figure 5 samples obtained with Real NVPs and on Table 2 the Fréchet Inception distance (FID) (Heusel et al., 2017) obtained between 104 samples. We denote golden score the FID obtained with the pretrained autoencoder. Hence, we cannot obtain better results than this. We compared the results in the latent and in the ambient space with SWFs and see that we obtain fairly better results using generative models within the SW-JKO scheme, especially in the ambient space, although the results are not really competitive with state-of-the-art methods. This may be due more to the curse of dimensionality in approximating the objective SW than in approximating the regularizer SW.

To sum up, an advantage of the SW-JKO scheme is to be able to use easier, yet powerful enough, architectures to learn the dynamic. This is cheaper in training time and less memory costly. Furthermore, we can tune the architecture with respect to the characteristics of the problem and add inductive biases (e.g. using CNN for images) or learn directly over the particles for low dimensional problems.

5 Conclusion

In this work, we derive a new class of gradient flows in the space of probability measures endowed with the sliced-Wasserstein metric, and the corresponding algorithms. To the best of our knowledge, and despite its

Published in Transactions on Machine Learning Research (11/2022)

simplicity, this is the first time that this class of flows is proposed in a machine learning context. We showed that it has several advantages over state-of-the-art approaches such as the recent JKO-ICNN. Aside from being less computationally intensive, it is more versatile w.r.t. the different practical solutions for modeling probability distributions, such as normalizing flows, generative models or sets of evolving particles.

Regarding the theoretical aspects, several challenges remain ahead: First, its connections with Wasserstein gradient flows are still unclear. Second, one needs to understand if, regarding the optimization task, convergence speeds or guarantees are changed with this novel formulation, revealing potentially interesting practical properties. Lastly, it is natural to study if popular variants of the sliced-Wasserstein distance such as Max-sliced (Deshpande et al., 2019), Distributional sliced (Nguyen et al., 2021), Subspace robust (Paty & Cuturi, 2019), generalized sliced (Kolouri et al., 2019) or projection Wasserstein distances (Rowland et al., 2019) can also be used in similar gradient flow schemes. The study of higher-order approximation schemes such as BDF2 (Matthes & Plazotta, 2019; Plazotta, 2018) could also be of interest.

Acknowledgments

This research was funded by project Dyna Learn from Labex Comin Labs and Région Bretagne ARED DLearn Me, and by the project OTTOPIA ANR-20-CHIA-0030 of the French National Research Agency (ANR).

David Alvarez-Melis, Yair Schiff, and Youssef Mroueh. Optimizing functionals on the space of probabilities with input convex neural networks. ar Xiv preprint ar Xiv:2106.00774, 2021. (Cited on p. 2, 4, 29)

Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008. (Cited on p. 1, 4, 6)

Brandon Amos, Lei Xu, and J Zico Kolter. Input convex neural networks. In International Conference on Machine Learning, pp. 146 155. PMLR, 2017. (Cited on p. 2, 4)

Michael Arbel, Anna Korba, Adil Salim, and Arthur Gretton. Maximum mean discrepancy gradient flow. Advances in Neural Information Processing Systems, 32, 2019. (Cited on p. 2)

Martín Arjovsky and Léon Bottou. Towards principled methods for training generative adversarial networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open Review.net, 2017. URL https://openreview.net/forum?id= Hk4_qw5xe. (Cited on p. 4)

Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214 223. PMLR, 2017. (Cited on p. 1, 12)

Daniel Balagué, José A Carrillo, Thomas Laurent, and Gaël Raoul. Nonlocal interactions by repulsive attractive potentials: radial ins/stability. Physica D: Nonlinear Phenomena, 260:5 25, 2013. (Cited on p. 32)

Heinz H Bauschke, Patrick L Combettes, et al. Convex analysis and monotone operator theory in Hilbert spaces, volume 408. Springer, 2011. (Cited on p. 2)

Erhan Bayraktar and Gaoyue Guoï. Strong equivalence between metrics of wasserstein type. Electronic Communications in Probability, 26:1 13, 2021. (Cited on p. 5)

Jan Beirlant, Edward J Dudewicz, László Györfi, Edward C Van der Meulen, et al. Nonparametric entropy estimation: An overview. International Journal of Mathematical and Statistical Sciences, 6(1):17 39, 1997. (Cited on p. 8)

Jean-David Benamou and Yann Brenier. A computational fluid mechanics solution to the monge-kantorovich mass transfer problem. Numerische Mathematik, 84(3):375 393, 2000. (Cited on p. 4)

Published in Transactions on Machine Learning Research (11/2022)

Jean-David Benamou, Guillaume Carlier, Quentin Mérigot, and Edouard Oudet. Discretization of functionals involving the monge ampère operator. Numerische mathematik, 134(3):611 636, 2016. (Cited on p. 4)

Sam Bond-Taylor, Adam Leach, Yang Long, and Chris G Willcocks. Deep generative modelling: A comparative review of vaes, gans, normalizing flows, energy-based and autoregressive models. ar Xiv preprint ar Xiv:2103.04922, 2021. (Cited on p. 4)

Nicolas Bonnotte. Unidimensional and evolution methods for optimal transportation. Ph D thesis, Paris 11, 2013. (Cited on p. 5, 12, 21)

Yann Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on pure and applied mathematics, 44(4):375 417, 1991. (Cited on p. 2)

Charlotte Bunne, Laetitia Papaxanthos, Andreas Krause, and Marco Cuturi. Proximal optimal transport modeling of population dynamics. In International Conference on Artificial Intelligence and Statistics, pp. 6511 6528. PMLR, 2022. (Cited on p. 2, 4)

Kenneth F Caluya and Abhishek Halder. Proximal recursion for solving the fokker-planck equation. In 2019 American Control Conference (ACC), pp. 4098 4103. IEEE, 2019. (Cited on p. 4)

Clément Cancès, Thomas O Gallouët, and Gabriele Todeschi. A variational finite volume scheme for wasserstein gradient flows. Numerische Mathematik, 146(3):437 480, 2020. (Cited on p. 4)

Jules Candau-Tilh. Wasserstein and sliced-wasserstein distances. Master s thesis, Université Pierre et Marie Curie, 2020. (Cited on p. 6, 20, 22)

Guillaume Carlier, Vincent Duval, Gabriel Peyré, and Bernhard Schmitzer. Convergence of entropic schemes for optimal transport and gradient flows. SIAM Journal on Mathematical Analysis, 49(2):1385 1418, 2017. (Cited on p. 2, 4, 8)

José A Carrillo, Alina Chertock, and Yanghong Huang. A finite-volume method for nonlinear nonlocal equations with a gradient flow structure. Communications in Computational Physics, 17(1):233 258, 2015. (Cited on p. 36)

Jose A Carrillo, Katy Craig, Li Wang, and Chaozhen Wei. Primal dual methods for wasserstein gradient flows. Foundations of Computational Mathematics, pp. 1 55, 2021. (Cited on p. 11, 32, 35, 36)

Yuxin Chen and Theodore Kolokolnikov. A minimal model of predator swarm interactions. Journal of The Royal Society Interface, 11(94):20131208, 2014. (Cited on p. 36)

Laurent Condat. Fast projection onto the simplex and the l1 ball. Mathematical Programming, 158(1): 575 585, 2016. (Cited on p. 8, 26)

Rob Cornish, Anthony Caterini, George Deligiannidis, and Arnaud Doucet. Relaxing bijectivity constraints with continuously indexed normalising flows. In International Conference on Machine Learning, pp. 2133 2143. PMLR, 2020. (Cited on p. 35)

Marco Cuturi and Arnaud Doucet. Fast computation of wasserstein barycenters. In International conference on machine learning, pp. 685 693. PMLR, 2014. (Cited on p. 5)

Biwei Dai and Uros Seljak. Sliced iterative normalizing flows. In ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 2021. (Cited on p. 37)

Ishan Deshpande, Yuan-Ting Hu, Ruoyu Sun, Ayis Pyrros, Nasir Siddiqui, Sanmi Koyejo, Zhizhen Zhao, David Forsyth, and Alexander G Schwing. Max-sliced wasserstein distance and its use for gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10648 10656, 2019. (Cited on p. 5, 13, 31)

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. ar Xiv preprint ar Xiv:1605.08803, 2016. (Cited on p. 9, 27, 29, 38)

Published in Transactions on Machine Learning Research (11/2022)

Andrew Duncan, Nikolas Nüsken, and Lukasz Szpruch. On the geometry of stein variational gradient descent. ar Xiv preprint ar Xiv:1912.00894, 2019. (Cited on p. 2)

Richard L Dykstra. An iterative procedure for obtaining i-projections onto the intersection of convex sets. The annals of Probability, pp. 975 984, 1985. (Cited on p. 4)

Jiaojiao Fan, Amirhossein Taghvaei, and Yongxin Chen. Variational wasserstein gradient flow. ar Xiv preprint ar Xiv:2112.02424, 2021. (Cited on p. 2, 4, 5, 9)

Kai-Tai Fang, Samuel Kotz, and Kai Wang Ng. Symmetric multivariate and related distributions. Chapman and Hall/CRC, 1992. (Cited on p. 25)

Xingdong Feng, Yuan Gao, Jian Huang, Yuling Jiao, and Xu Liu. Relative entropy gradient sampler for unnormalized distributions. ar Xiv preprint ar Xiv:2110.02787, 2021. (Cited on p. 1)

Rémi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong, and Titouan Vayer. Pot: Python optimal transport. Journal of Machine Learning Research, 22(78):1 8, 2021. URL http://jmlr.org/papers/ v22/20-451.html. (Cited on p. 5)

Charlie Frogner and Tomaso Poggio. Approximate inference with wasserstein gradient flows. In International Conference on Artificial Intelligence and Statistics, pp. 2581 2590. PMLR, 2020. (Cited on p. 4)

Thomas O Gallouët and Leonard Monsaingeon. A jko splitting scheme for kantorovich fisher rao gradient flows. SIAM Journal on Mathematical Analysis, 49(2):1100 1130, 2017. (Cited on p. 2)

Alfredo Garbuno-Inigo, Franca Hoffmann, Wuchen Li, and Andrew M Stuart. Interacting langevin diffusions: Gradient structure and ensemble kalman sampler. SIAM Journal on Applied Dynamical Systems, 19(1): 412 441, 2020. (Cited on p. 2)

Clark R Givens and Rae Michael Shortt. A class of wasserstein metrics for probability distributions. Michigan Mathematical Journal, 31(2):231 240, 1984. (Cited on p. 24)

Pierre Glaser, Michael Arbel, and Arthur Gretton. Kale flow: A relaxed kl gradient flow for probabilities with disjoint support. Advances in Neural Information Processing Systems, 34, 2021. (Cited on p. 2)

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27:2672 2680, 2014. (Cited on p. 1, 12)

Jiequn Han, Arnulf Jentzen, and Weinan E. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34):8505 8510, 2018. (Cited on p. 1)

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. (Cited on p. 12, 37)

Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improving flow-based generative models with variational dequantization and architecture design. In International Conference on Machine Learning, pp. 2722 2730. PMLR, 2019. (Cited on p. 38)

Chin-Wei Huang, Ricky TQ Chen, Christos Tsirigotis, and Aaron Courville. Convex potential flows: Universal probability distributions with optimal transport and convex optimization. ar Xiv preprint ar Xiv:2012.05942, 2020. (Cited on p. 2, 29)

Hyung Ju Hwang, Cheolhyeong Kim, Min Sue Park, and Hwijae Son. The deep minimizing movement scheme. ar Xiv preprint ar Xiv:2109.14851, 2021. (Cited on p. 2, 23)

Published in Transactions on Machine Learning Research (11/2022)

Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the fokker planck equation. SIAM journal on mathematical analysis, 29(1):1 17, 1998. (Cited on p. 1, 2, 3)

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. (Cited on p. 29)

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. (Cited on p. 1)

Ivan Kobyzev, Simon Prince, and Marcus Brubaker. Normalizing flows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. (Cited on p. 9)

Soheil Kolouri, Kimia Nadjahi, Umut Simsekli, Roland Badeau, and Gustavo K Rohde. Generalized sliced wasserstein distances. ar Xiv preprint ar Xiv:1902.00434, 2019. (Cited on p. 13)

Anna Korba, Pierre-Cyril Aubin-Frankowski, Szymon Majewski, and Pierre Ablin. Kernel stein discrepancy descent. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 5719 5730. PMLR, 2021. (Cited on p. 2, 21)

Alexander Korotin, Vage Egiazarian, Arip Asadulaev, Alexander Safin, and Evgeny Burnaev. Wasserstein-2 generative networks. ar Xiv preprint ar Xiv:1909.13082, 2019. (Cited on p. 2, 28)

Alexander Korotin, Lingxiao Li, Aude Genevay, Justin Solomon, Alexander Filippov, and Evgeny Burnaev. Do neural optimal transport solvers work? a continuous wasserstein-2 benchmark. ar Xiv preprint ar Xiv:2106.01954, 2021. (Cited on p. 2)

Maxime Laborde. Interacting particles systems, Wasserstein gradient flow approach. Ph D thesis, PSL Research University, 2016. (Cited on p. 4)

Jean-François Le Gall. Brownian motion, martingales, and stochastic calculus. Springer, 2016. (Cited on p.

Yann Le Cun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann.lecun. com/exdb/mnist/. (Cited on p. 12, 37)

Alex Tong Lin, Wuchen Li, Stanley Osher, and Guido Montúfar. Wasserstein proximal of gans. ar Xiv preprint ar Xiv:2102.06862, 2021. (Cited on p. 1)

Qiang Liu. Stein variational gradient descent as gradient flow. Advances in neural information processing systems, 30, 2017. (Cited on p. 2)

Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. ar Xiv preprint ar Xiv:1608.04471, 2016. (Cited on p. 2, 30, 31)

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730 3738, 2015. (Cited on p. 12, 37)

Antoine Liutkus, Umut Simsekli, Szymon Majewski, Alain Durmus, and Fabian-Robert Stöter. Slicedwasserstein flows: Nonparametric generative modeling via optimal transport and diffusions. In International Conference on Machine Learning, pp. 4104 4113. PMLR, 2019. (Cited on p. 1, 5, 9, 12, 21, 37, 38, 40)

Yulong Lu, Jianfeng Lu, and James Nolen. Accelerating langevin sampling with birth-death. ar Xiv preprint ar Xiv:1905.09863, 2019. (Cited on p. 2)

Michael C Mackey. Time s arrow: The origins of thermodynamic behavior. Courier Corporation, 1992. (Cited on p. 26)

Published in Transactions on Machine Learning Research (11/2022)

Daniel Matthes and Simon Plazotta. A variational formulation of the bdf2 method for metric gradient flows. ESAIM: Mathematical Modelling and Numerical Analysis, 53(1):145 172, 2019. (Cited on p. 13)

Sebastian Mika, Gunnar Ratsch, Jason Weston, Bernhard Scholkopf, and Klaus-Robert Mullers. Fisher discriminant analysis with kernels. In Neural networks for signal processing IX: Proceedings of the 1999 IEEE signal processing society workshop (cat. no. 98th8468), pp. 41 48. Ieee, 1999. (Cited on p. 31)

Petr Mokrov, Alexander Korotin, Lingxiao Li, Aude Genevay, Justin Solomon, and Evgeny Burnaev. Largescale wasserstein gradient flows. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021. (Cited on p. 2, 4, 5, 9, 11, 27, 28, 29, 30, 31)

Kimia Nadjahi, Alain Durmus, Umut Simsekli, and Roland Badeau. Asymptotic guarantees for learning generative models with the sliced-wasserstein distance. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. (Cited on p. 5)

Kimia Nadjahi, Alain Durmus, Lénaïc Chizat, Soheil Kolouri, Shahin Shahrampour, and Umut Simsekli. Statistical and topological properties of sliced probability divergences. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 20802 20812. Curran Associates, Inc., 2020. (Cited on p. 5, 10, 31)

Kimia Nadjahi, Alain Durmus, Pierre E Jacob, Roland Badeau, and Umut Şimşekli. Fast approximation of the sliced-wasserstein distance using concentration of random projections. ar Xiv preprint ar Xiv:2106.15427, 2021. (Cited on p. 5, 24, 25)

Khai Nguyen, Nhat Ho, Tung Pham, and Hung Bui. Distributional sliced-wasserstein and applications to generative modeling. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021. (Cited on p. 13)

George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density estimation. ar Xiv preprint ar Xiv:1705.07057, 2017. (Cited on p. 38)

George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. ar Xiv preprint ar Xiv:1912.02762, 2019. (Cited on p. 1, 9)

Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in optimization, 1(3):127 239, 2014. (Cited on p. 2, 3)

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp. 8024 8035. Curran Associates, Inc., 2019. (Cited on p. 9)

François-Pierre Paty and Marco Cuturi. Subspace robust wasserstein distances. In International Conference on Machine Learning, pp. 5072 5081. PMLR, 2019. (Cited on p. 13)

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. (Cited on p. 10, 27)

Gabriel Peyré. Entropic approximation of wasserstein gradient flows. SIAM Journal on Imaging Sciences, 8 (4):2323 2351, 2015. (Cited on p. 2, 4, 8)

Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019. (Cited on p. 5)

Published in Transactions on Machine Learning Research (11/2022)

Rémi Peyre. Comparison between w2 distance and H 1 norm, and localization of wasserstein distance. ESAIM: Control, Optimisation and Calculus of Variations, 24(4):1489 1501, 2018. (Cited on p. 4)

Simon Plazotta. A bdf2-approach for the non-linear fokker-planck equation. ar Xiv preprint ar Xiv:1801.09603, 2018. (Cited on p. 13)

Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Bernot. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pp. 435 446. Springer, 2011. (Cited on p. 2, 5)

Jack Richter-Powell, Jonathan Lorraine, and Brandon Amos. Input convex gradient networks. ar Xiv preprint ar Xiv:2111.12187, 2021. (Cited on p. 2)

Hannes Risken. Fokker-planck equation. In The Fokker-Planck Equation, pp. 63 95. Springer, 1996. (Cited on p. 10)

Gareth O Roberts and Richard L Tweedie. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, pp. 341 363, 1996. (Cited on p. 1, 10, 27)

Litu Rout, Alexander Korotin, and Evgeny Burnaev. Generative modeling with optimal transport maps. ar Xiv preprint ar Xiv:2110.02999, 2021. (Cited on p. 2)

Mark Rowland, Jiri Hron, Yunhao Tang, Krzysztof Choromanski, Tamas Sarlos, and Adrian Weller. Orthogonal estimation of wasserstein distances. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 186 195. PMLR, 2019. (Cited on p. 13)

Adil Salim, Anna Korba, and Giulia Luise. The wasserstein proximal gradient algorithm. ar Xiv e-prints, pp. ar Xiv 2002, 2020. (Cited on p. 2)

Filippo Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-63):94, 2015. (Cited on p. 4, 6, 20, 21, 22)

Filippo Santambrogio. {Euclidean, metric, and Wasserstein} gradient flows: an overview. Bulletin of Mathematical Sciences, 7(1):87 154, 2017. (Cited on p. 1, 3, 6)

Saeed Saremi. On approximating f with neural networks. ar Xiv preprint ar Xiv:1910.12744, 2019. (Cited on p. 2)

P Vatiwutipong and N Phewchean. Alternative way to derive the distribution of the multivariate ornstein uhlenbeck process. Advances in Difference Equations, 2019(1):1 7, 2019. (Cited on p. 27)

Cédric Villani. Topics in optimal transportation, volume 58. American Mathematical Soc., 2003. (Cited on p. 4)

Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. (Cited on p. 3, 22)

Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake Vander Plas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and Sci Py 1.0 Contributors. Sci Py 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261 272, 2020. doi: 10.1038/s41592-019-0686-2. (Cited on p. 29)

Yifei Wang, Peng Chen, and Wuchen Li. Projected wasserstein gradient descent for high-dimensional bayesian inference. ar Xiv preprint ar Xiv:2102.06350, 2021. (Cited on p. 1)

Published in Transactions on Machine Learning Research (11/2022)

Yifei Wang, Peng Chen, Mert Pilanci, and Wuchen Li. Optimal neural network approximation of wasserstein gradient direction via convex optimization. ar Xiv preprint ar Xiv:2205.13098, 2022. (Cited on p. 1)

Andre Wibisono. Sampling as optimization in the space of measures: The langevin dynamics as a composite optimization problem. In Conference on Learning Theory, pp. 2093 3027. PMLR, 2018. (Cited on p. 1, 4, 27)

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. (Cited on p. 12, 37)

Chao Zhang, Zhijian Li, Hui Qian, and Xin Du. Dpvi: A dynamic-weight particle-based variational inference framework. ar Xiv preprint ar Xiv:2112.00945, 2021. (Cited on p. 2)

Published in Transactions on Machine Learning Research (11/2022)

A Some Results of (Candau-Tilh, 2020)

In this section, we report some interesting theoretical results of (Candau-Tilh, 2020). More precisely, in Appendix A.1, we report the geodesics in SW space. Then, in Appendix A.2, we report the Euler-Lagrange equation of the Wasserstein gradient flow and the Sliced-Wasserstein gradient flow of the Fokker-Planck functional. Finally, in Appendix A.3, we report some theoretical results that we will use in our proofs.

A.1 Geodesics in Sliced-Wasserstein Space

We recall here first some notions in metric spaces. Let (X, d) be some metric space. In our case, we will have X = P2(Ω) with Ωa bounded, open convex set and d = SW2.

We first need to define an absolutely continuous curve.

Definition 1 (Absolutely continuous curve). A curve w : [0, 1] X is said to be absolutely continuous if there exists g L1([0, 1]) such that

t0 < t1, d w(t0), w(t1) Z t1

t0 g(s)ds. (27)

We denote by AC(X, d) the set of absolutely continuous measures. Moreover, denote ACx,y(X, d) the set of curves in AC(X, d) starting at x and ending at y. Then, we can define the length of an absolutely continuous curve w AC(X, d) as

Ld(w) = sup

k=0 d w(tk), w(tk+1) , n 1, 0 = t0 < t1 < < tn = 1

Then, we say that a space X is a geodesic space if for any x, y X,

d(x, y) = min {Ld(w), w AC(X, d), w(0) = x, w(1) = y} . (29)

Finally, Candau-Tilh (2020) showed in Theorem 2.4 that (P2(Ω), SW2) is not a geodesic space but rather a pseudo-geodesic space since for µ, ν P2(Ω),

inf {LSW2(w), w ACµ,ν(P2(Ω), SW2)} = cd,2W2(µ, ν). (30)

We see that the infimum of the length in the SW space is the Wasserstein distance. Hence, it suggests that the geodesics in SW space are related to the one in Wasserstein space, which are well known since they correspond to the Mc Cann s interpolation (see e.g. (Santambrogio, 2015, Theorem .27)).

A.2 Euler-Lagrange Equations in Wasserstein and Sliced-Wasserstein Spaces

To prove that the Wasserstein gradient flow of a functional satisfies a PDE, one step consists at computing the first variations of the functional in the JKO scheme, i.e. of x 7 d2(x, y)/2τ + F(x). Evaluating the first variation at the minimizer actually gives a corresponding Euler-Lagrange equation (Santambrogio, 2015; Candau-Tilh, 2020).

Hence, Candau-Tilh (2020) also provided a similar form of Euler-Lagrange equations for Wasserstein and Sliced-Wasserstein gradient flows of the Fokker-Planck functional

F(µ) = Z V dµ + H(µ). (31)

Proposition 3 (Proposition 3.9 in (Candau-Tilh, 2020)).

Published in Transactions on Machine Learning Research (11/2022)

Let µτ k+1 be the optimal measure in

arg min µ P2(K)

W 2 2 (µ, µτ k) 2τ + F(µ). (32)

Then, it satisfies the following Euler-Lagrange equation:

log(ρτ k+1) + V + ψ

τ = constant a.e., (33)

where ρτ k+1 is the density of µτ k+1 and ψ is the Kantorovitch potential from µτ k+1 to µτ k.

Let µτ k+1 be the optimal measure in

arg min µ P2(K)

SW 2 2 (µ, µτ k) 2τ + F(µ). (34)

Then, it satisfies the following Euler-Lagrange equation:

log(ρτ k+1) + V + 1

Sd 1 ψθ P θ dλ(θ) = constant a.e., (35)

where for θ Sd 1, ψθ is the Kantorovitch potential form P θ #µτ k+1 to P θ #µτ k.

From this proposition, we see that the Euler-Lagrange equations share similar forms, since only the first variation of the distance changes.

Informally, one method to find the corresponding PDE for Wasserstein gradient flows is to show that the derivative of the first variation of Wasserstein is equal (weakly, i.e. in the sense of distributions by integrating w.r.t. test functions) to ρ

t . In this case, we recover well from (32) the PDE. Indeed, ρ satisfies the PDE

t = div(ρ V ) + ρ (36)

in a weak sense if for all ξ C (]0, + [ Rd), Z +

t (t, x) + V (x), xξ(t, x) ξ(t, x) dρt(x)dt = Z ξ(0, x) dρ0(x). (37)

By using the Euler-Lagrange equation (33), informally, we have ρ log(ρ) + V + ψ

τ = ρ + ρ V + ρ ψ

Then, we integrate ξ w.r.t. to this distribution and we use that (by integration by parts) Z ξ(x), ρ(x) dx = Z ξ(x) dρ(x), (38)

and we obtain Z

ξ(x) + ξ(x), V (x) + 1

τ ξ(x), ψ(x) dρt(x) = 0. (39)

Hence, to obtain the right equation, we require to show that in the limit τ 0, the term containing the first variation of Wasserstein is equal to the term representing the derivation in time, i.e.

lim τ 0 1 τ

ZZ ξ(x), ψ(x) dρt(x)dt = ZZ ξ

t dρt(x)dt. (40)

This is done e.g. in (Bonnotte, 2013, Theorem 5.3.6) or (Liutkus et al., 2019, Theorem S6). For the Sliced-Wasserstein distance, it is still an open question whether we have the same connection or not.

Note that we recover the divergence operator by using an integration by parts (see e.g. (Korba et al., 2021, Appendix A.1)) Z ξ(x), V (x) dρ(x) = Z ξ(x)div ρ(x) V (x) dx. (41)

For a complete derivation of the Wasserstein gradient flows of the Fokker-Planck functional, see e.g. (Santambrogio, 2015, Section 4.5).

Published in Transactions on Machine Learning Research (11/2022)

A.3 Results on the SW-JKO Scheme

Finally, we report here some results about the continuity and convexity of the Sliced-Wasserstein distance as well as on the existence of the minimizer at each step of the SW-JKO scheme. We will use these results in our proofs in Appendix B.

In the following, we restrain ourselves to measures supported on a compact domain K.

Proposition 4 (Proposition 3.4 in (Candau-Tilh, 2020)). Let ν P2(K). Then, µ 7 SW 2 2 (µ, ν) is continuous w.r.t. the weak convergence.

Proposition 5 (Proposition 3.5 in (Candau-Tilh, 2020)). Let ν P2(K), then µ 7 SW 2 2 (µ, ν) is convex and stricly convex whenever ν is absolutely continuous w.r.t. the Lebesgue measure.

Proposition 6 (Proposition 3.7 in (Candau-Tilh, 2020)). Let τ > 0 and µτ k P2(K). Then, there exists a unique solution µτ k+1 P2(K) to the minimization problem

min µ P2(K) SW 2 2 (µ, µτ k) 2τ + Z V dµ + H(µ). (42)

The solution is even absolutely continuous.

B.1 Proof of Proposition 1

We refer to the proposition 3.7 in (Candau-Tilh, 2020).

Let τ > 0, k N, µτ k P2(K). Let s note J(µ) = SW 2 2 (µ,µτ k) 2τ + F(µ).

According to Proposition 3.4 in (Candau-Tilh, 2020), µ 7 SW 2 2 (µ, µτ k) is continuous with respect to the

weak convergence. Indeed, let µ P2(K) and let (µn)n converging weakly to µ, i.e. µn L n µ. Then, by the reverse triangular inequality, we have

|SW2(µn, µτ k) SW2(µ, µτ k)| SW2(µn, µ) W2(µn, µ). (43)

Since the Wasserstein distance metrizes the weak convergence (Villani, 2008), we have that W2(µn, µ) 0 . And therefore, µ 7 SW2(µ, µτ k) and µ 7 SW 2 2 (µ, µτ k) are continuous w.r.t. the weak convergence.

By hypothesis, F is lower semi continuous, hence µ 7 J(µ) is lower semi continuous. Moreover, P2(K) is compact for the weak convergence, thus we can apply the Weierstrass theorem (Box 1.1 in (Santambrogio, 2015)) and there exists a minimizer µτ k+1 of J.

By Proposition 3.5 in (Candau-Tilh, 2020), µ 7 SW 2 2 (µ, ν) is convex and strictly convex whenever ν is absolutely continuous w.r.t. the Lebesgue measure. Hence, for the uniqueness, if F is strictly convex then µ 7 J(µ) is also strictly convex and the minimizer is unique. And if ρτ k is absolutely continuous, then according to Proposition 3.5 in (Candau-Tilh, 2020), µ 7 SW 2 2 (µ, µτ k) is strictly convex, and hence µ 7 J(µ) is also strictly convex since F was taken convex by hypothesis.

B.2 Proof of Proposition 2

Let k N, then since µτ k+1 is the minimizer of equation 17,

F(µτ k+1) + SW 2 2 (µτ k+1, µτ k) 2τ F(µτ k) + SW 2 2 (µτ k, µτ k) 2τ = F(µτ k). (44)

Hence, as SW 2 2 (µτ k+1, µτ k) 0, F(µτ k+1) F(µτ k). (45)

Published in Transactions on Machine Learning Research (11/2022)

B.3 Upper Bound on the Errors

Following Hwang et al. (2021), we can also derive an upper bound on the error made at each step. Proposition 7. Let k N, µ0 P2(Rd), C some constant, and assume that F admits a lower bound, then

SW2 (gk+1,τ θ )#p Z, µτ k+1 SW2 (gk,τ θ )#p Z, µτ k + Cτ 1 2

(k + 1)Cτ 1 2 + SW2(µ0, µτ 1). (46)

Proof. The bound can be found by applying Theorem 3.1 in (Hwang et al., 2021) with X = (P2(Rd, SW2), and then by applying a straightforward induction. We report here the proof for the sake of completeness.

Let k N, then since µτ k+1 is the minimizer of equation 17,

F(µτ k+1) + SW 2 2 (µτ k+1, µτ k) 2τ F(µτ k) + SW 2 2 (µτ k, µτ k) 2τ = F(µτ k). (47)

Therefore, using that F is non increasing along (µτ k)k (Proposition 2), we have F(µτ k) F(µ0). Moreover, using that F admits an infimum, we find

SW 2 2 (µτ k, µτ k+1) 2τ F(µτ k) F(µτ k+1) 2τ F(µ0) inf µ F(µ) . (48)

Let A = F(µ0) infµ F(µ). By the same reasoning, we have that

SW 2 2 (gk+1,τ θ )#p Z, (gk,τ θ )#p Z 2τA. (49)

Now, using the triangular inequality, we have that

SW2 (gk+1,τ θ )#p Z, µτ k+1 SW2 (gk+1,τ θ )#p Z, (gk θ)#p Z + SW2 (gk θ)#p Z, µτ k + SW2(µτ k, µτ k+1)

2τA + SW2 (gk,τ θ )#p Z, µτ k . (50)

2A, then by induction we find

SW2 (gk+1,τ θ )#p Z, µτ k+1 (k + 1)C τ + SW2(µ0, µτ 1). (51)

B.4 Sliced-Wasserstein results

Link for 1D supported measures. Let µ, ν P(Rd) supported on a line. For simplicity, we suppose that the measures are supported on an axis, i.e. µ(x) = µ1(x1) Qd i=2 δ0(xi) and ν(x) = ν1(x1) Qd i=2 δ0(xi).

In this case, we have that

W 2 2 (µ, ν) = W 2 2 (P e1 # µ, P e1 # ν) = Z 1

0 |F 1 P e1 # µ(x) F 1 P e1 # ν(x)|2 dx. (52)

On the other hand, let θ Sd 1, then we have

y R, FP θ #µ(y) = Z

R 1] ,y](x) P θ #µ(dx)

Rd 1] ,y]( θ, x ) µ(dx)

R 1] ,y](x1θ1) µ1(dx1)

θ1 ](x1) µ1(dx1)

= FP e1 # µ

Published in Transactions on Machine Learning Research (11/2022)

Therefore, F 1 P θ #µ(z) = θ1F 1 P e1 # µ(z) and

W 2 2 (P θ #µ, P θ #ν) = Z 1

0 |θ1F 1 P e1 # µ(z) θ1F 1 P e1 # ν(z)|2 dz

0 |F 1 P e1 # µ(z) F 1 P e1 # ν(z)|2 dz

= θ2 1W 2 2 (µ, ν).

Finally, using that R

Sd 1 θθT dλ(θ) = 1

d Id, we can conclude that

SW 2 2 (µ, ν) = Z

Sd 1 θ2 1W 2 2 (µ, ν)dθ = W 2 2 (µ, ν)

Closed-form between Gaussians. It is well known that there is a closed-form for the Wasserstein distance between Gaussians (Givens & Shortt, 1984). If we take α = N(µ, Σ) and β = N(m, Λ) with m, µ Rd and Σ, Λ Rd d two symmetric positive definite matrices, then

W 2 2 (α, β) = m µ 2 2 + Tr Σ + Λ 2(Σ 1 2 ΛΣ 1 2 ) 1 2 . (55)

Let α = N(µ, σ2Id) and β = N(m, s2Id) two isotropic Gaussians. Here, we have

W 2 2 (α, β) = µ m 2 2 + Tr(σ2Id + s2Id 2(σs2σId) 1 2 )

= µ m 2 2 + (σ s)2 Tr(Id)

= µ m 2 2 + d(σ s)2.

On the other hand, Nadjahi et al. (2021) showed (Equation 73) that

SW 2 2 (α, β) = µ m 2 2 d + (σ s)2 = W 2 2 (α, β)

In that case, the dilation of factor d between WGF and SWGF clearly appears.

For more complicated gaussians, we may not have this equality. For example, let α = N(µ, D), β = N(m, ) with D and diagonal. Then, P θ #α = N( µ, θ , PN i=1 θ2 i Di), P θ #β = N( m, θ , PN i=1 θ2 i i) and

W 2 2 (P θ # α, P θ # β) =

i θ2 i Di s X

with α, β the centered measures (noting T α : x 7 x µ, then α = T α #α). Hence, we have

SW 2 2 (α, β) = µ m 2 2 d + SW 2 2 ( α, β)

= µ m 2 2 d + Z

i θ2 i Di s X

= µ m 2 2 d + Z

i=1 θ2 i Di +

i=1 θ2 i i 2 s X

i,j θ2 i θ2 j Di j

= µ m 2 2 d +

Sd 1 θ2 i dλ(θ) +

Sd 1 θ2 i dλ(θ) 2 Z

i,j θ2 i θ2 j Di jdλ(θ)

= µ m 2 2 d + 1

i=1 (Di + i) 2 Z

i,j θ2 i θ2 j Di jdλ(θ),

Published in Transactions on Machine Learning Research (11/2022)

using that R

Sd 1 θθT dλ(θ) = 1 d Id and by applying Proposition 2 in (Nadjahi et al., 2021) to decompose

SW 2 2 (α, β) = µ m 2 2 d + SW 2 2 ( α, β).

On the other hand, we have

W 2 2 (α, β) = µ m 2 2 + Tr D + 2(D 1 2 D 1 2 ) 1 2

= µ m 2 2 + Tr(D + 2(D ) 1 2 )

= µ m 2 2 +

i=1 (Di + i 2D

= µ m 2 2 +

Since SW 2 2 (α, β) 1

d W 2 2 (α, β), we have Pd i=1 Di i d R

i,j θ2 i θ2 j Di jdλ(θ).

Let d = 2, σ, s > 0 and D = diag(σ2, σ2

2 ), = diag( s2

2 , s2). In this case, on the one hand, we have

On the other hand,

i,j θ2 i θ2 j Di jdλ(θ) =

(θ2 1 + θ2 2)2 + 1

2θ2 1θ2 2 dλ(θ)

S1 {θ1 =0,θ2 =0}

(θ2 1 + θ2 2)2 + 1

2θ2 1θ2 2 dλ(θ)

S1(θ2 1 + θ2 2) dλ(θ)

using that λ is absolutely continuous with respect to the Lebesgue measure and hence λ({θ1 = 0} {θ2 =

0}) = 0 and the fact that for every θ S1 {θ1 = 0, θ2 = 0}, q

(θ2 1 + θ2 2)2 + 1

2θ2 1θ2 2 > p

(θ2 1 + θ2 2)2 = θ2 1 + θ2 2. From this strict inequality, we deduce that W2 and d SW2 are not always equal, even in this restricted case. Hence, even in this simple case, we cannot directly conclude that we have a dilation term of d between Wasserstein and Sliced-Wasserstein gradient flows.

C Computation of the SW-JKO scheme in practice

C.1 Approximation of SW

For each inner optimization problem

µτ k+1 arg min µ P2(Rd)

SW 2 2 (µ, µτ k) 2τ + F(µ), (63)

we need to approximate the sliced-Wasserstein distance. To do that, we used Monte-Carlo approximate by sampling nθ directions (θi)nθ i=1 following the uniform distribution on the hypersphere Sd 1 (which can be done by using the stochastic representation, i.e. let Z N(0, Id), then θ = Z Z 2 Unif(Sd 1) (Fang et al., 1992)). Let µ, ν P(Rd), we approximate the sliced-Wasserstein distance as

d SW 2 2(µ, ν) = 1

i=1 W 2 2 (P θi # µ, P θi # ν). (64)

Published in Transactions on Machine Learning Research (11/2022)

Algorithm 2 SW-JKO with Discrete Grid

Input: µ0 the initial distribution with density ρ0, K the number of SW-JKO steps, τ the step size, F the functional, Ne the number of epochs to solve each SW-JKO step, (xj)N j=1 the grid

Let ρ(0) = ρ0(xj) PN

j=1 for k = 1 to K do

Initialize the weights ρ(k+1) (with for example a copy of ρ(k)) // Denote µτ k+1 = PN j=1 ρ(k+1) j δxj and µτ k = PN j=1 ρ(k) j δxj for i = 1 to Ne do

Compute J(µτ k+1) = 1 2τ SW 2 2 (µτ k, µτ k+1) + F(µτ k+1) Backpropagate through J with respect to ρ(k+1)

Perform a gradient step Project on the simplex ρ(k+1) using the algorithm of Condat (2016) end for end for

In practice, we compute it for empirical distributions ˆµn and ˆνm, and we approximate the one dimensional Wasserstein distance

W 2 2 (P θ #ˆµn, P θ #ˆνm) = Z 1

0 |F 1 P θ # ˆµn(u) F 1 P θ #ˆνm(u)|2 du (65)

by the rectangle method.

Overall, the complexity is in O(nθ(n log n + m log m)) where n (resp. m) denotes the number of particles of ˆµn (resp. ˆνm).

C.2 Algorithms to solve the SW-JKO scheme

We provide here the algorithms used to solve the SW-JKO scheme (17) for the discrete grid (Section 3.3) and for the particles (Section 3.3).

Discrete grid. We recall that in that case, we model the distributions as µτ k = PN i=1 ρ(k) i δxi where we use N samples located at (xi)N i=1 and (ρ(k) i )N i=1 belongs to the simplex Σn. Hence, the SW-JKO scheme at step k + 1 rewrites

min (ρi)i ΣN SW 2 2 (PN i=1 ρiδxi, µτ k) 2τ + F(

i=1 ρiδxi). (66)

We report in Algorithm 2 the whole procedure.

Particle scheme. In this case, we model the distributions as empirical distributions and we try to optimize the positions of the particles. Hence, we have µτ k = 1

N PN i=1 δx(k) i and the problem (17) becomes

N PN i=1 δxi, µτ k) 2τ + F( 1

i=1 δxi). (67)

In this case, we provide the procedure in Algorithm 3.

D Additional experiments

D.1 Dynamic of Sliced-Wasserstein gradient flows

The Fokker-Planck equation (8) is the Wasserstein gradient flow of the functional (6). Moreover, it is wellknown to have a counterpart stochastic differential equation (SDE) (see e.g. Mackey (1992, Chapter 11)) of

Published in Transactions on Machine Learning Research (11/2022)

Algorithm 3 SW-JKO with Particles

Input: µ0 the initial distribution, K the number of SW-JKO steps, τ the step size, F the functional, Ne the number of epochs to solve each SW-JKO step, N the batch size Sample (x(0) j )N j=1 µ0 i.i.d for k = 1 to K do

Initialize N particles (x(k+1) j )N j=1 (with for example a copy of (x(k) j )N j=1) // Denote µτ k+1 = 1

N PN j=1 δx(k+1) j and µτ k = 1

N PN j=1 δx(k) j for i = 1 to Ne do

Compute J(µτ k+1) = 1 2τ SW 2 2 (µτ k, µτ k+1) + F(µτ k+1)

Backpropagate through J with respect to (x(k+1) j )N j=1 Perform a gradient step end for end for

the form d Xt = V (Xt)dt + p

2β d Wt (68)

with (Wt)t a Wiener process. This SDE is actually the well-known Langevin equation. Hence, by approximating it using the Euler-Maruyama scheme, we recover the Unadjusted Langevin Algorithm (ULA) (Roberts & Tweedie, 1996; Wibisono, 2018).

For V (x) = 1

2(x m)T A(x m), (69)

with A symmetric and definite positive, we obtain an Ornstein-Uhlenbeck process (Le Gall, 2016, Chapter 8). If we choose µ0 as a Gaussian N(m0, Σ0), then we know the Wasserstein gradient flow µt in closed form (Wibisono, 2018; Vatiwutipong & Phewchean, 2019), for all t > 0, µt = N(mt, Σt) with ( mt = m + e t A(m0 m) Σt = e t AΣ0(e t A)T + A 1

2 (I e 2t A)(A 1

2 )T . (70)

Comparison of the evolution of the diffusion between SWGFs and WGFs. For this experiment, we model the density using Real NVPs (Dinh et al., 2016). More precisely, we use Real NVPs with 5 affine coupling layers, using FCNN for the scaling and shifting networks with 100 hidden units and 5 layers. In both experiments, we always start the scheme with µ0 = N(0, I) and take nθ = 1000 projections to approximate the sliced-Wasserstein distance. We randomly generate a target Gaussian (using make_spd_matrix from scikit-learn (Pedregosa et al., 2011) to generate a random covariance with 42 as seed).

We look at the evolution of the distributions learned between t = 0 and t = 4 with a time step of τ = 0.1. We compare it with the true Wasserstein gradient flow. On Figure 6a, we observe that they do not seem to match. However, they do converge to the same stationary value. On Figure 6b, we plot the functional along the true WGF dilated of a factor d = 2. We see here that the two curves are matching and we observed the same behaviour in higher dimension. Even though we cannot conclude on the PDE followed by SWGFs, this reinforces the conjecture that the SWGF obtained with a step size of τ

d (i.e. using the scheme (18)) is very close to the WGF obtained with a step size of τ. We also report here the evolution of the mean (Fig. 7) and of the variance (Fig. 8). For the mean, it follows as expected the same diffusion. For the variance, it is less clear but it is hard to conclude since there are potentially optimization errors.

Comparison between JKO-ICNN and SW-JKO. Following the experiment conducted by Mokrov et al. (2021) in section 4.2, we plot in Figure 9 the symmetric Kullback-Leibler (Sym KL) divergence over dimensions between approximated distributions and the true WGF at times t = 0.5 and t = 0.9. We take the mean over 15 random gaussians (generated using the scikit-learn function (Pedregosa et al., 2011) make_spd_matrix for the covariance matrices, and generating the means with a standard normal distribution) for dimensions d {2, . . . , 12}.

Published in Transactions on Machine Learning Research (11/2022)

0 1 2 3 4 t

(a) Without dilation

0 1 2 3 4 t

1.50 F( 2t)

(b) With dilation

Figure 6: Evolution of the functional equation 6 along the WGF µt and the learned SWGF ˆµt. We observe a dilation of parameter 2 between the WGF and the SWGF.

0 1 2 3 4 t

0 1 2 3 4 t

Figure 7: Evolution of the mean taking into account the dilation parameter. µ denotes the true mean of WGF, ˆµ the mean obtained through SW-JKO (17) with τ = 0.05 and µ the mean of the stationary measure. We observe that the mean of approximated measure obtained through SW-JKO seems to follow the one of the WGF.

For each target Gaussian, we run the SW-JKO dilated scheme (18) with τ = 0.05 for a Real NVP normalizing flow. We compare it with JKO-ICNN with also τ = 0.05 and with Euler-Maruyama with 103, 104 and 5 104 particles and a step size of 10 3. For JKO-ICNN, we use, as Mokrov et al. (2021), Dense ICNN with convex quadratic layers introduced in (Korotin et al., 2019) and available at https://github.com/ iamalexkorotin/Wasserstein2Barycenters. For the JKO-ICNN scheme, we use our own implementation.

We compute the symmetric Kullback-Leibler divergence between the ground truth of WGF µ and the distribution ˆµ approximated by the different schemes at times t = 0.5 and t = 0.9. The symmetric Kullback Leibler divergence is obtained as

Sym KL(µ , ˆµ) = KL(µ ||ˆµ) + KL(ˆµ||µ ). (71)

To approximate it, we generate 104 samples of each distribution and evaluate the density at those samples.

If we note gθ a normalizing flows, p Z the distribution in the latent space and ρ = (gθ)#p Z, then we can evaluate the log density of ρ by using the change of variable formula. Let x = gθ(z), then

log(ρ(x)) = log(p Z(z)) log | det Jgθ(z)|. (72)

Published in Transactions on Machine Learning Research (11/2022)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 t

1.0 [ (2t)]00

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 t

0.0 [ (2t)]01

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 t

0.0 [ (2t)]10

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 t

Figure 8: Evolution of the components of the covariance matrix taking into account the dilation parameter. Σ denotes the true covariance matrix of WGF, ˆΣ the covariance matrix obtained through SW-JKO (17) with τ = 0.05 and Σ the covariance matrix of the stationary distribution. We observe some difference between WGF and SWGF.

We choose Real NVPs (Dinh et al., 2016) for the simplicity of the transformations and the fact that we can compute efficiently the determinant of the Jacobian (since we have a closed-form). A Real NVP flow is a composition of transformations T of the form

z Rd, x = T(z) = z1, exp(s(z1)) z2 + t(z1) (73)

where we write z = (z1, z2) and with s and t some neural networks. To modify all the components, we use also swap transformations (i.e. (z1, z2) 7 (z2, z1)). This transformation is invertible with log det JT (z) = P

i s(z1 i ).

For JKO-ICNN, we choose strictly convex ICNNs, and can hence invert them as well as compute the density. In this case, we do not have access to a closed-form for the Jacobian. Therefore, we used backpropagation to compute it. As this experiment is in low dimension, the computational cost is not too heavy. However, there exist stochastic methods to approximate it in greater dimension. We refer to (Huang et al., 2020) and (Alvarez-Melis et al., 2021) for more explanations.

We approximate the functional by using Monte-Carlo approximation as in Section 3.3.

For Euler-Maruyama, as in (Mokrov et al., 2021), we use kernel density estimation in order to approximate the density. We use the scipy implementation (Virtanen et al., 2020) gaussian_kde with the Scott s rule to choose the bandwidth.

Finally, we report on the Figure 9 the mean of the log of the symmetric Kullback-Leibler divergence over 15 Gaussians in each dimension and the 95% confidence interval.

For the training of the neural networks, we use an Adam optimizer (Kingma & Ba, 2014) with a learning rate of 10 4 for Real NVP (except for the 1st iteration where we take a learning rate of 5 10 3) and of 5 10 3

for JKO-ICNN. At each inner optimization step, we start from a deep copy of the last neural network, and optimize Real NVP for 200 epochs and ICNNs for 500 epochs, with a batch size of 1024.

We see on Figure 9 that the results are better than the particle schemes obtained with Euler-Maruyama (EM) with a step size of 10 3 and with either 103, 104 or 5 104 particles in dimension higher than 2. However, JKO-ICNN obtained better results.

Published in Transactions on Machine Learning Research (11/2022)

2 4 6 8 10 12 d

log10Sym KL

2 4 6 8 10 12 d

2 4 6 8 10 12 d

|F( t) F( t)|

2 4 6 8 10 12 d

EM 1K EM 10K EM 50K Real NVP-SWGF JKO-ICNN

Figure 9: On the left: Sym KL divergence at time 0.5 and 0.9 between the groundtruth of the Fokker-Planck equation at time t and the solution of the SW Gradient Flow at time t. On the right: Absolute error between the functionals evaluated for WGF and SWGF at time t.

2 10 20 30 40 50 60 70 80 90 100 d

log10Sym KL

Stationary distribution estimated at t = 8

EM 1K EM 10K

EM 50K Real NVP-SWGF

Figure 10: Symmetric KL divergence between the learned distribution at time t = 8 and the true stationary solution on Gaussians

D.2 Convergence to stationary distribution

Here, we want to demonstrate that, through the SW-JKO scheme, we are able to find good minima of functionals using simple generative models.

Gaussian. For this experiment, we place ourselves in the same setting of Section 4.1. We start from µ0 = N(0, I) and use a step size of τ = 0.1 for 80 iterations in order to match the stationary distribution. In this case, the functional is

F(µ) = Z V (x)dµ(x) + H(µ) (74)

with V (x) = 1

2(x b)T A(x b), and the stationary distribution is ρ (x) e V (x), hence ρ = N(b, A 1).

We generate 15 Gaussians for d between 2 and 12, and d {20, 30, 40, 50, 75, 100}. Due to the length of the diffusion, and to numerical unstabilities, we do not report results obtained with JKO-ICNN. In Figure 2, we showed the results in low dimension (for d {2, . . . , 12}) and the unstability of JKO-ICNN. We report on Figure 10 the Sym KL also in higher dimension.

We use 200 epochs of each inner optimization and an Adam optimizer with a learning rate of 5 10 3 for the first iteration and 10 3 for the rest. We also use a batch size of 1000 sample.

Bayesian logistic regression. For the Bayesian logistic regression, we have access to covariates s1, . . . , sn Rd with their associated labels y1, . . . , yn { 1, 1}. Following (Liu & Wang, 2016; Mokrov et al., 2021), we put as prior on the regression weights w, p0(w|α) = N(w; 0, 1

α) with p0(α) = Γ(α; 1, 0.01).

Published in Transactions on Machine Learning Research (11/2022)

Table 3: Number of features, of samples and batch size of each dataset.

covtype german diabetis twonorm ringnorm banana splice waveform image

features 54 20 8 20 20 2 60 21 18 samples 581012 1000 768 7400 7400 5300 2991 5000 2086 batch size 512 800 614 1024 1024 1024 512 512 1024

Table 4: Hyperparameters for SWGFs with Real NVPs. nl: number of coupling layers in Real NVP, nh: number of hidden units of conditioner neural networks, lr: learning rate using Adam, JKO steps: number of SW-JKO steps, Iters by step: number of epochs for each SW-JKO step, τ: the time step, batch size: number of samples taken to approximate the functional.

covtype german diabetis twonorm ringnorm banana splice waveform image

nl 2 2 2 2 2 2 5 5 2 nh 512 512 512 512 512 512 128 128 512 lr 2e 5 1e 4 5e 4 1e 4 5e 5 1e 4 5e 4 1e 4 5e 5

JKO steps 5 5 10 20 5 5 5 5 5 Iters by step 1000 500 500 500 1000 500 500 500 500 τ 0.1 10 6 5 10 6 10 8 10 6 0.1 10 6 10 8 0.1 batch size 1024 1024 1024 1024 1024 1024 1024 512 1024

Therefore, we aim at learning the posterior p(w, α|y):

p(w, α|y) p(y|w, α)p0(w|α)p0(α) = p0(α)p0(w|α)

i=1 p(yi|w, α)

where p(yi|w, α) = σ(w T si)

2 (1 σ(w T si)) 1 y

2 with σ the sigmoid. To evaluate V(µ) = R V (x) dµ(x), we resample data uniformly.

In our context, let V (x) = log p0(α)p0(w|α)p(y|w, α) , then using F(µ) = R V dµ + H(µ) as functional, we know that the limit of the stationary solution of Fokker-Planck is proportional to e V = p(w, α|y).

Following Mokrov et al. (2021); Liu & Wang (2016), we use the 8 datasets of Mika et al. (1999) and the covertype dataset (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html).

We report in Table 3 the characteristics of the different datasets. The datasets are loaded using the code of Mokrov et al. (2021) (https://github.com/Petr Mokrov/Large-Scale-Wasserstein-Gradient-Flows). We split the dataset between train set and test set with a 4:1 ratio.

We report in Table 4 the hyperparameters used for the results reported in Table 1. We also tuned the time step τ since for too big τ, we observed bad results, which the SW-JKO scheme should be a good approximation of the SWGF only for small enough τ.

Moreover, we reported in Table 1 the mean over 5 training. For the results obtained with JKO-ICNN, we used the same hyperparameters as Mokrov et al. (2021).

D.3 Influence of the number of projections

It is well known that the approximation of Sliced-Wasserstein is subject to the curse of dimensionaly through the Monte-Carlo approximation (Nadjahi et al., 2020). We provide here some experiment to quantify this influence. However, first note that the goal is not to minimize the Sliced-Wasserstein distance, but rather the functional, SW playing mostly a regularizer role. Experiments on the influence of the number of experiments to approximate the SW have already been conducted (see e.g. Figure 2 in (Nadjahi et al., 2020) or Figure 1 in (Deshpande et al., 2019)).

Published in Transactions on Machine Learning Research (11/2022)

Figure 11: Density over time of the solution of the aggregation equation learned over the discre grid.

Figure 12: Density over time of the solution of the aggregation equation by learning particles.

Here, we take the same setting of Section 4.1, i.e. we generate 15 random Gaussians, and then vary the number of projections and report the Symmetric Kullback-Leibler divergence on Figure 3. We observe that the results seem to improve with the number of projections until it reach a certain plateau. The plateau seems to be attained for a bigger number of dimension in high dimension.

D.4 Aggregation equations

Here, we use as functional

W(µ) = ZZ W(x y)dµ(x)dµ(y). (75)

Carrillo et al. (2021) use a repulsive-attractive interaction potential, for a > b 0,

using the convention x 0

0 = ln( x ). For some values of a and b, there is existence of stable equilibrium state (Balagué et al., 2013).

Dirac Ring. First, for the Dirac ring example, we take a = 4 and b = 2. Then, W is a repulsive-attractive interaction potential (repulsive in the short range, and attractive in the long range because b < a (Balagué

Published in Transactions on Machine Learning Research (11/2022)

Figure 13: Density over time of the solution of the aggregation equation approximated with a FCNN.

Figure 14: Density over time of the solution of the aggregation equation approximated with JKO-ICNN.

et al., 2013)), we show the densities over time obtained on a discrete grid (Figure 11), by learning the position of the particles (Figure 12) and with the FCNN (Figure 13). For the last two, the density reported is obtained with kernel density estimation where we chose by hand the bandwidth to match well the sampled points. We also report the evolution of the density for the JKO-ICNN scheme. In Figure 14, we show the evolution of the density obtained by the change of variable formula. We observe that values seem to explode which may due to the fact that the stationary solution may not have a density or to numerical unstabilities. We report in Figure 15 the densities obtained with a kernel density estimation. We also report the particles generated through a FCNN, the particle evolution and JKO-ICNN in Figure 16. We observe that the particles match perfectly the ring. For the FCNN, there seem to be some noise. JKO-ICNN recover also well the ring but particles seem to not be uniformly distributed over the ring.

For the SW-JKO scheme, we take τ = 0.05 and run it for 200 steps (from t = 0 to t = 10) starting from µ0 = N(0, I). For JKO-ICNN, we choose τ = 0.1 and run it for 100 steps as the diffusion was really long.

We take a grid of 50 50 samples on [ 1, 1]2. To optimize the weights, we used an SGD optimizer with a momentum of 0.9 with a learning rate of 10 4 for 300 epochs by inner optimization scheme. For particles, we optimized 1000 particles (sampled initially from µ0) with an SGD optimizer with a momentum of 0.9, a learning rate of 1 and 500 epochs by JKO step. We take a FCNN composed of 5 hidden layers with 64 units and leaky relu activation functions. They are optimized with an Adam optimizer with a learning rate of 10 4 (except for the first iteration where we take a learning rate of 5 10 3) for 400 epochs by JKO step. For

Published in Transactions on Machine Learning Research (11/2022)

Figure 15: Density over time of the solution of the aggregation equation approximated with JKO-ICNN and kernel density estimator.

(a) Samples from the FCNN

(b) Samples learned by optimizing over particles

(c) Samples from JKO-ICNN

Figure 16: Samples of the stationary distribution

JKO-ICNN, we choose the same parameters as for the previous experiments. In any case, we take nθ = 104

projections to approximate SW2 and a batch size of 1000.

On Figure 17, we plot on the two first columns the stationary density learned by the different methods. On the third column, we plot the evolution of the functional along the flows.

The functionals seem to converge towards the same value. However, JKO-ICNN seems to not be able to capture the right distribution. We also note that the curve for the FCNN is not very smooth, which can probably be explained by the fact that we take independent samples at each step.

On Figure 18, we show the evolution of the functionals along different learned flows. We observe that for the discretized grid, we obtain the worse results which is understandable since we have discretization error. The particle is the most stable as particle s position do not move anymore once the stationary state is reached. For the FCNN, we observe oscillations which are due to the fact that we take independent samples at each time t. Finally, the JKO-ICNN scheme seems to converge toward the same value as the SW-JKO scheme with FCNNs.

We also tried to use normalizing flows with this functional. We observed that the training seems harder than with the FCNN. Indeed, with simple flows such as Real NVP, the model has a lot of troubles of learning the Dirac ring, probably because of the hole. More generally, since normalizing flows are bijective transformations, they must preserve topological properties, and therefore do not perform well when the

Published in Transactions on Machine Learning Research (11/2022)

(a) Steady state on the discretized grid

(b) Steady state on the discretized grid

(c) Evolution of the functional

(d) Steady state for the fully connected neural network

(e) Steady state for the fully connected neural network

(f) Evolution of the functional

(g) Steady state for particles

(h) Steady state for particles

(i) Evolution of the functional

(j) Steady state for JKO-ICNN with KDE

(k) Steady state for JKOICNN with KDE

(l) Evolution of the functional

Figure 17: Steady state and evolution of the functional of the aggregation equation for a = 4, b = 2.

standard distribution and the target distribution do not share the same topology (e.g. do not have the same number of connected components or "holes" as it is explained in (Cornish et al., 2020)). For CPF, it worked slightly better, but was not able to fully recover the ring (at least at time t = 5) as we can see on Figure 19. Morover, the training time for CPF was really huge compared to the FCNN.

Other functionals. As in section 3 of Carrillo et al. (2021), we also tried to use

2 log( x ) (77)

as interaction potential with a FCNN. We find well that the steady state is an indicator function on the centered disk of radius 1 (Figure 20).

Published in Transactions on Machine Learning Research (11/2022)

0 2 4 6 8 10 t

Grid Particles JKO-ICNN FCNN

5 6 7 8 9 10 t

Grid Particles JKO-ICNN FCNN

Figure 18: Evolution of the aggregation functional along different flows.

(a) Density learned by SWGF with CPF at time t = 5

(b) Density learned by JKO-ICNN at time t = 4

Figure 19: Density learned for the aggregation equation for CPF and JKO-ICNN.

Another possible functional without using internal energies is to add a drift term Z V (x)ρ(x)dx. (78)

Then, the Wasserstein gradient flow is solution to

tρt = div(ρ (W ρ)) + div(ρ V ). (79)

Carrillo et al. (2021) use W(x) = x 2

2 log( x ) and V (x) = α

β log( x ) with α = 1 and β = 4. Then, it can be shown (see Carrillo et al. (2021); Chen & Kolokolnikov (2014); Carrillo et al. (2015)) that the steady state is an indicator function on a torus of inner radius Ri = q

α β and outer radius Ro = q

α β + 1 which we

observe on Figure 21 and Figure 22.

For this last experiment, we observed some unstabily issues in the training phase. It may be due to nonlocality of the interaction potential W as it is stated in (Carrillo et al., 2021).

D.5 Sliced-Wasserstein flows

In this experiment, we aim at minimizing the following functional:

2SW 2 2 (µ, ν) + λH(µ) (80)

Published in Transactions on Machine Learning Research (11/2022)

(a) Density of the steady state

(b) Density of the steady state

(c) Evolution of the functional over time

Figure 20: Steady state and evolution of the functional for W(x) = x 2

2 log( x ).

Figure 21: Density over time of the solution of the aggregation-drift equation approximated with a FCNN.

where ν is some target distribution from which we have access to samples. In this section, we use MNIST (Le Cun & Cortes, 2010), Fashion MNIST (Xiao et al., 2017) and Celeb A (Liu et al., 2015).

We report the Fréchet Inception Distance (FID) (Heusel et al., 2017) for MNIST and Fashion MNIST between 104 test samples and 104 generated samples. As they are gray images, we duplicate the gray levels into 3 channels. Moreover, we use the code of Dai & Seljak (2021) (available at https://github.com/biweidai/ SINF) and reported their result for SWF (in the ambient space) and SWAE in Table 2. For SWF in the latent space, we used our own implementation.

With a pretrained autoencoder. First, we optimize the functional in the latent space of a pretrained autoencoder (AE). We choose the same AE as Liutkus et al. (2019) which is available at https://github. com/aliutkus/swf/blob/master/code/networks/autoencoder.py. We report results for a latent space of dimension d = 48.

In this latent space, we applied a FCNN and a Real NVP. In either case, we used nθ = 103 projections to approximate SW2. We chose a batch size of 128 samples. The FCNN was chosen with 5 hidden layers of 512 units with leaky relu activation function. The Real NVP was composed of 5 coupling layers, with FCNN as scaling and shifting networks (with 5 layers and 100 hidden units). We trained the networks at each inner optimization step with a learning rate of 5 10 3 for Real NVP (and 10 2 for the first iteration) and of 10 3

for the FCNN (and 5 10 3 for the first itration) during 1000 epochs.

Published in Transactions on Machine Learning Research (11/2022)

(a) Density of the steady state

(b) Density of the steady state

(c) Evolution of the functional over time

Figure 22: Steady state and evolution of the functional for the aggregation-drift equation.

(a) FID=19.3

(b) FID=41.7

Figure 23: Generated sample obtained through a pretrained decoder + FCNN.

Following Liutkus et al. (2019), we choose τ = 0.5 and λ = 0. We run it for 10 outer iterations.

We report in Figure 23 the result obtained with FCNN on MNIST and Fashion MNIST. Overall, the results seem quite comparable and our method seems to perform well compared to other methods applied in latent spaces.

On Figure 24, we report samples obtained with Real NVP on MNIST, Fashion MNIST and Celeb A. For Celeb A, we choose the same autoecoder with a latent space of dimension 48 and we ran the SW-JKO scheme for 20 steps with τ = 0.1.

We can also optimize directly particles in the latent space and we show it on Celeb A on Figure 25 for τ = 0.1 and for 10 steps.

In the Original Space. We report here the results obtained by running the SW-JKO scheme in the original spaces of images, which are very high dimensional (d = 784 for MNIST and Fashion MNIST). We obtained worse results than in the latent space. Notice that we used only 1000 projections, and ran it for 50 outer iterations with a step size of τ = 5.

On Figure 26, we use a Real NVP and add a uniform dequantization (Ho et al., 2019) and learned it in the logit space as it is done in (Dinh et al., 2016; Papamakarios et al., 2017) because using normalizing flows need continuous data. The Real NVP is composed here of 2 coupling layers with FCNN also composed of 2 layers and 512 hidden units. We choose τ = 5, and run it for 20 steps. At each inner optimization problem, the neural networks are optimized with an Adam optimizer with a learning rate of 10 2 and for 500 epochs.

Published in Transactions on Machine Learning Research (11/2022)

(a) FID=17.8

(b) FID=40.6

(c) FID=90.8

Figure 24: Generated sample obtained through a pretrained decoder + Real NVP.

Figure 25: Particle through SW-JKO with τ = 0.1 for 10 steps.

We also report results obtained using a convolutional neural network (CNN) in Figure 27. The idea here is that we can capture inductive bias, as it is well known that CNNs are efficient for image-related tasks. We obtained a slightly better FID on MNIST, but worse results on Fashion MNIST. In term of quality of image, the generated samples do not seem better.

For the CNN, we choose a latent space of dimension 100, and we first apply a linear layer into a size of 128 7 7. Then we apply 3 convolutions layers of (kernel_size, stride, padding) being respectively (4, 2, 1), (4, 2, 1), (3, 1, 1). All layers are followed by a leaky Re LU activation, and a sigmoid is applied on the output.

Published in Transactions on Machine Learning Research (11/2022)

(a) FID=88.1

(b) FID=95.5

Figure 26: Generated sample obtained in the original space with Real NVP.

(b) FID=102

Figure 27: Generated sample obtained in the original space with CNN.

Nevertheless, samples obtained in the space of image look better than those obtained through the particle scheme induced in (Liutkus et al., 2019).