# the_wasserstein_proximal_gradient_algorithm__3039dd80.pdf

The Wasserstein Proximal Gradient Algorithm

Adil Salim Visual Computing Center KAUST adil.salim@kaust.edu.sa

Anna Korba Gatsby Computational Neuroscience Unit University College London a.korba@ucl.ac.uk

Giulia Luise Computer Science Department University College London g.luise16@ucl.ac.uk

Wasserstein gradient ﬂows are continuous time dynamics that deﬁne curves of steepest descent to minimize an objective function over the space of probability measures (i.e., the Wasserstein space). This objective is typically a divergence w.r.t. a ﬁxed target distribution. In recent years, these continuous time dynamics have been used to study the convergence of machine learning algorithms aiming at approximating a probability distribution. However, the discrete-time behavior of these algorithms might differ from the continuous time dynamics. Besides, although discretized gradient ﬂows have been proposed in the literature, little is known about their minimization power. In this work, we propose a Forward Backward (FB) discretization scheme that can tackle the case where the objective function is the sum of a smooth and a nonsmooth geodesically convex terms. Using techniques from convex optimization and optimal transport, we analyze the FB scheme as a minimization algorithm on the Wasserstein space. More precisely, we show under mild assumptions that the FB scheme has convergence guarantees similar to the proximal gradient algorithm in Euclidean spaces.

1 Introduction

The task of transporting an initial distribution µ0 to a target distribution µ is common in machine learning. This task can be reformulated as the minimization of a cost functional deﬁned over the set of probability distributions. Wasserstein gradient ﬂows [2] are suitable continuous time dynamics to minimize such cost functionals. These ﬂows have found applications in various ﬁelds of machine learning such as reinforcement learning [32,40], sampling [5,12,16,38] and neural networks optimization [14, 26]. Indeed, Wasserstein gradient ﬂows can be seen as the continuous limit of several discrete time machine learning algorithms. The analysis of continuous time dynamics is often easier than the analysis of their discrete time counterparts. Therefore, many works focus solely on continuous time analyses of machine learning algorithms such as variants of gradient descent [7,14,15,26,34,37]. Besides, although discretized Wasserstein gradient ﬂows have been proposed in the literature [2,8,11,19,24,38], most of them have not been studied as minimization algorithms.

In this paper, we focus on the resolution, by a discrete time algorithm, of a minimization problem deﬁned on the set P2(X) of probability measures µ over X = Rd such that R x 2dµ(x) < . More precisely, µ is deﬁned as a solution to

min µ P2(X) G(µ) := EF (µ) + H(µ), (1)

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

where EF is a potential energy EF (µ) = R F(x)dµ(x) tied to a smooth convex function F : Rd R, and H is a nonsmooth term convex along the generalized geodesics deﬁned by the Wasserstein distance. The potential EF plays the role of a data ﬁtting term whereas H can be seen as a regularizer.

Motivation for studying the template problem (1). Many free energy minimization problems can be cast as Problem (1), see [1, Proposition 7.7] or more generally [2, Section 9]. For instance, H can be an internal energy [2, Example 9.3.6]. In particular, if H is the negative entropy, then G boils down to the Kullback Leibler divergence (up to an additive constant) w.r.t. the Gibbs measure µ exp( F). This remark has been used by several authors to study sampling tasks as minimizing Problem (1) [5,12,16,20,38]. Another example where H is an internal energy is the case where H is a higher order entropy. In this case, µ follows a Barenblatt proﬁle. Moreover, in the context of optimization of inﬁnitely wide two layers neural networks [14,26], µ denotes the optimal distribution over the parameters of the network. In this setting, F(x) = 2 R k(x, y)dµ (y) is non convex and H(µ) = R k(x, y)dµ(x)dµ(y) is an interaction energy [2, Example 9.3.4], with k depending on the activation functions of the network. Moreover, G boils down to a Maximum Mean Discrepancy w.r.t. µ [3] under a well-posedness condition. Alternatively, F can be a regularizer of the distribution on the parameters of the network (see Appendix).

Related works. Wasserstein gradient ﬂows are continuous time dynamics that can minimize (1). Several time discretizations of such ﬂows have been considered [2,33,38]. However, these discretization schemes have been mainly analyzed as numerical schemes to approximate the continuous gradient ﬂow, rather than as optimization algorithms.

Numerous optimization algorithms to solve (1), relying on different time-discretization schemes of the Wasserstein gradient ﬂow of G, have been proposed previously. For instance, [17,33,40] rely on the implementation of the JKO (Jordan-Kinderlehrer-Otto) scheme [19], which can be seen as a proximal scheme (i.e., backward discretization) in the Wasserstein distance. In this case, each step of the algorithm relies on evaluating the JKO operator of G, exploiting efﬁcient subroutines for this operator. When G is smooth, some gradient descent algorithms over the P2(X) (i.e., forward discretizations) have also been proposed [13,20,21]. However, their analysis is notably challenging without further convexity assumptions on G. For example [13], which tackles the resolution of the Wasserstein barycenter problem (a nonconvex problem) via gradient descent, considers Gaussian input distributions. Another time-discretization of Wasserstein gradient ﬂows can be found in the Langevin Monte Carlo (LMC) algorithm in the sampling literature. LMC can be seen as a splitting algorithm involving a gradient step for EF and an exact solving of the ﬂow of H (Forward-Flow discretization [38]) to solve (1) in the case where H is the negative entropy. Indeed, the Wasserstein gradient ﬂow of the negative entropy can be computed exactly in discrete time.1 Several works provide non asymptotic analyses of LMC in discrete time, see [5,12,16,22,29,35,39] among others. However, the convergence rate of the LMC algorithm does not match the convergence rate of the associated continuous time dynamics and LMC is a biased algorithm.

A natural approach to minimize the sum of a smooth and a nonsmooth convex functions over a Hilbert space is to apply the proximal gradient algorithm [4], which implements a gradient step for the smooth term and a proximal step for the nonsmooth one. Indeed, many nonsmooth regularizers admit closed form proximity operators2. In this work we propose and analyze a splitting algorithm, the Forward-Backward (FB) scheme, to minimize the functional G over P2(X). The FB scheme can be seen as an analogue to the proximal gradient algorithm in the Wasserstein space. This algorithm has already been studied (but no convergence rate has been established) in [38] in the particular case where H is the negative entropy. Notably, [38, Section F.1] explains why the FB scheme should be unbiased, unlike Forward-Flow discretizations like LMC.

More precisely, the proposed algorithm implements a forward (gradient step) for the smooth term EF and relies on the JKO operator for the nonsmooth term H [19] only, which plays the role of a proximity operator. This approach provides an algorithm with lower iteration complexity compared to the "full" JKO scheme applied to EF + H.

Contributions. In summary, the Wasserstein gradient ﬂows to minimize (1) are well understood and modelize many machine learning algorithms. Discretized Wasserstein gradient ﬂows have been

1The addition of a gaussian noise corresponds to a process whose distribution is solution of the gradient ﬂow of the relative entropy (the heat ﬂow). 2see www.proximity-operator.net

proposed but little is known about their minimization power. In this work, we propose a natural optimization algorithm to solve Problem (1), which is a Forward-Backward discretization of the Wasserstein gradient ﬂow. This algorithm is a generalization of a discretization scheme proposed in [38]. Our main contribution is to prove non asymptotic rates for the proposed scheme, under the assumptions that F is smooth and convex and that H is convex along the generalized geodesics deﬁned by the Wasserstein distance. We show that the obtained rates fortunately match the ones of the proximal gradient algorithm over Hilbert spaces.

The remainder is organized as follows. In Section 2 we provide some background knowledge in optimal transport and gradient ﬂows. In Section 3, we introduce the Forward-Backward Euler discretization scheme. We study the FB scheme as an optimization algorithm and present our main results, i.e., non-asymptotic rates for the resolution of (1) in Section 4. In Section 5 we illustrate the performance of our algorithm for a simple sampling task. The convergence proofs are postponed to the appendix.

2 Preliminaries

In this section, we introduce the notations and recall fundamental deﬁnitions and properties on optimal transport and gradient ﬂows that will be used throughout the paper.

2.1 Notations

In the sequel, P2(X) is the space of probability measures µ on X with ﬁnite second order moment. Denote B(X) the Borelian σ-ﬁeld over X. For any µ P2(X), L2(µ) is the space of functions f : (X, B(X)) (X, B(X)) such that R f 2dµ < . Note that the identity map I is an element of L2(µ). For any µ P2(X), we denote by µ and , µ respectively the norm and the inner product of the space L2(µ). For any measures µ, ν, we write µ ν if µ is absolutely continuous with respect to ν, and we denote Leb the Lebesgue measure over X. The set of regular distributions of the Wasserstein space is denoted by Pr 2(X) := {µ P2(X), µ Leb}. If f, g : X X, the composition f g of g by f is sometimes denoted f(g).

2.2 Optimal transport

For every measurable map T deﬁned on (X, B(X)) and for every µ P2(X), we denote T#µ the pushforward measure of µ by T characterized by the transfer lemma , i.e.: Z φ(y)d T#µ(y) = Z φ(T(x))dµ(x) for any measurable and bounded function φ. (2)

Consider the 2-Wasserstein distance deﬁned for every µ, ν P2(X) by

W 2(µ, ν) := inf υ Γ(µ,ν)

Z x y 2dυ(x, y), (3)

where Γ(µ, ν) is the set of couplings between µ and ν [36], i.e. the set of nonnegative measures υ over X X such that P#υ = µ (resp. Q#υ = ν) where P : (x, y) 7 x (resp. Q : (x, y) 7 y) is the projection onto the ﬁrst (resp. second) component.

We now recall the well-known Brenier theorem [9], [2, Section 6.2.3].

Theorem 1. Let µ Pr 2(X) and ν P2(X). Then,

1. There exists a unique minimizer υ of (3). Besides, there exists a uniquely determined µ-almost everywhere (a.e.) map T ν µ : X X such that υ = (I, T ν µ)#µ where (I, T ν µ) : (x, y) 7 (x, T ν µ(x)). Finally, there exists a convex function f : X R such that T ν µ = f µ-a.e.

2. As a corollary,

W 2(µ, ν) = Z x f(x) 2dµ(x) = inf T : T#µ=ν

Z x T(x) 2dµ(x). (4)

3. If g : X R is convex, then g is well deﬁned µ-a.e. and if ν = g#µ, then T ν µ = g µ-a.e.

4. If ν Pr 2(X), then T µ ν T ν µ = I µ-a.e. and T ν µ T µ ν = I ν-a.e.

Under the assumptions of Theorem 1, the map T ν µ is called the optimal transport (OT) map from µ to ν. In this paper, as it is commonly the case in the literature, we may refer to the space of probability distributions P2(X) equipped with the 2-Wasserstein distance as the Wasserstein space.

2.3 Review of Gradient Flows and their discretizations

2.3.1 In an Euclidean space

Assume that X is a Euclidean space, consider a proper lower semi-continuous function G : X ( , + ] and denote by D(G) = {x X, G(x) < } its domain. We assume that G is convex, i.e., for every x, z X and for every ε [0, 1], we have:

G(εz + (1 ε)x) εG(z) + (1 ε)G(x). (5)

Given x X, recall that y X is a subgradient of G at x if for every z X,

G(x) + y, z x G(z).

The (possibly empty) set of subgradients of G at x is denoted by G(x), and the map x 7 G(x) is called the subdifferential. If G is differentiable at x, then G(x) = { G(x)} where G(x) is the gradient of G at x. The subdifferential of the convex function G allows to deﬁne the gradient ﬂow of G: for every initial condition x(0) = a such that G(a) = , there exists a unique absolutely continuous function x : [0, + ) X that solves the differential inclusion [10, Th. 3.1], [30, Th. 2.7] x (t) G(x(t)). (6) One can check that the gradient ﬂow of G is also characterized by the following system of Evolution Variational Inequalities (EVI) :

z D(G), d dt x(t) z 2 2 (G(x(t)) G(z)) .

In contrast to (6), the former characterization allows to deﬁne the gradient ﬂow without using the notion of subdifferential, a property that can be practical in nonsmooth settings. Moreover, the non-asymptotic analysis of discretized gradient ﬂows in the optimization literature often relies on discrete versions of the EVI.

The existence of Gradient Flows can be established as the limit of a proximal scheme [30, Th. 2.14], [6, Th. 5.1] when the step-size γ 0. Deﬁning the proximity operator of G as:

proxγG(x) := arg min y X G(y) + 1

2γ y x 2, (7)

the proximal scheme is written as xn+1 = proxγG(xn), (8) which corresponds to the proximal point algorithm to minimize the function G, see [23]. The proximal scheme can be seen as a Backward Euler discretization of the gradient ﬂow. Indeed, writing the ﬁrst order conditions of (8), we have

xn+1 xn γ G(xn+1), or equivalently xn+1 xn

Hence, each iteration of the proximal scheme requires solving an equation which can be intractable in many cases. The Forward Euler scheme is a more tractable integrator of the gradient ﬂow of G, but is less stable and requires the differentiability of G. Under this assumption, this scheme is written xn+1 xn

γ = G(xn) or equivalently xn+1 = xn γ G(xn), (9)

which corresponds to the well-known gradient descent algorithm to minimize the function G. Consider now the case where the function G can be decomposed as G = F +H, where F is convex and smooth

and H is convex and nonsmooth. To integrate the gradient ﬂow of G = F + H, another approach is to use the Forward and the Backward Euler schemes for the smooth term and nonsmooth term respectively [30]. This approach is also motivated by the fact that in many situations, the function H is simple enough to implement its proximity operator proxγH. If G = F +H, the Forward Backward Euler scheme is written as xn+1 xn

γ F(xn) H(xn+1). (10)

Recalling the deﬁnition of the proximity operator, this scheme can be rewritten as

xn+1 = proxγH(xn γ F(xn)), (11)

which corresponds to the proximal gradient algorithm to minimize the composite function G.

2.3.2 In the Wasserstein space

Consider a proper lower semi-continuous functional G : P2(X) ( , + ] and denote D(G) = {µ P2(X), G(µ) < } its domain. We assume that G is convex along generalized geodesics deﬁned by the 2-Wasserstein distance [2, Chap. 9], i.e. for every µ, π P2(X), ν Pr 2(X) and for every ε [0, 1], G((εT π ν + (1 ε)T µ ν )#ν) εG(π) + (1 ε)G(µ), (12) where T π ν and T µ ν are the optimal transport maps from ν to π and from ν to µ respectively. Note that the curve ε 7 (εT π ν + (1 ε)T µ ν )#ν interpolates between µ (ε = 0) and π (ε = 1). Moreover, if ν = π or ν = µ, then this curve is a geodesic in the Wasserstein space. Given µ P2(X), ξ L2(µ) is a strong Fréchet subgradient of G at µ [2, Chap. 10] if for every φ L2(µ),

G(µ) + ε ξ, φ µ + o(ε) G((I + εφ)#µ).

The (possibly empty) set of strong Fréchet subgradients of G at µ is denoted G(µ). The map µ 7 G(µ) is called the strong Fréchet subdifferential. Conveniently, the strong Fréchet subdifferential enables to deﬁne the gradient ﬂow of the functional G [2, Chap. 11]. However in the nonsmooth setting that will be considered in this paper, the characterization of gradient ﬂows through EVI will be more practical. The gradient ﬂow of G is the solution of the following system of EVI [2, Th. 11.1.4]:

π D(G), d dt W 2(µ(t), π) 2 (G(µ(t)) G(π)) .

We shall perform a non-asymptotic analysis of a discretized gradient ﬂow scheme to minimize the functional G. Our approach is to prove a discrete EVI for this scheme.

The existence of gradient ﬂows can be established as the limit of a minimizing movement scheme [2, Th. 11.2.1], [19]. Deﬁning the JKO operator of G as:

JKOγG(µ) := arg min ν P2(X) G(ν) + 1

2γ W 2(ν, µ), (13)

the JKO scheme is written as µn+1 JKOγG(µn). The JKO operator can be seen as a proximity operator by replacing the Wasserstein distance by the Euclidean distance. Moreover, the JKO scheme can be seen as a Backward Euler discretization of the gradient ﬂow. Indeed, under some assumptions on the functional G, using [2, Lemma 10.1.2] we have

T µn µn+1 I

Using Brenier s theorem, since T µn µn+1 T µn+1 µn = I µn-a.e., there exists a strong Fréchet subgradient of G at µn+1 denoted by W G(µn+1) such that

µn+1 = I γ W G(µn+1) T µn+1 µn

Each iteration of the JKO scheme thus requires the minimization of a function which can be intractable in many cases. As previously, the Forward Euler scheme is more tractable and enjoys additionally

a simpler geometrical interpretation. Assume G(µ) = { G(µ)} is a singleton for any µ D(G) (some examples are given [2, Section 10.4]). The Forward Euler scheme for the gradient ﬂow of G is written: µn+1 = (I γ G(µn))#µn, (14) and corresponds to the iterations of the gradient descent algorithm over the Wasserstein space to minimize G. Although the Wasserstein space is not a Riemannian manifold, it can still be equipped with a Riemannian structure and interpretation [25,28]. In particular, the Forward Euler scheme can be seen as a Riemannian gradient descent where the exponential map at µ is the map φ 7 (I + φ)#µ deﬁned on L2(µ).

3 The Forward Backward Euler scheme

Recall that our goal is to minimize EF + H, where EF (µ) = R F(x)dµ(x) for any µ P2(X), and H is a nonsmooth functional. Throughout this paper, we assume the following on the potential function F: there exists L, λ 0 such that

A1. F is L-smooth i.e. F is differentiable and F is L-Lipschitz continuous; for all (x, y) X 2:

F(y) F(x) + F(x), y x + L

2 x y 2. (15)

A2. F is λ-strongly convex (we allow λ = 0); for all (x, y) X 2:

F(x) F(y) F(x), y x λ

2 x y 2. (16)

Moreover, we assume the following on the function H.

B1. H : P2(X) ( , + ] is proper and lower semi-continuous. Moreover, D(H) Pr 2(X). B2. There exists γ0 > 0 such that γ (0, γ0), JKOγH(µ) = for every µ P2(X). B3. H is convex along generalized geodesics.

Assumptions B1 and B2 are general technical assumptions, used in [2] (see [2, Eq. 10.1.1a, 10.1.1b]) that are satisﬁed in relevant cases. Then [2, Prop. 9.3.2, 9.3.5 and 9.3.9] gives broad examples of (potential, interaction and internal) energies satisfying B3, e.g., potential (resp. interaction) energies if the potential (resp. interaction) term is convex, and entropies (see Appendix and [2, Remark 9.3.10]). Moreover, H is often nonsmooth, see e.g. [38, Section 3.1.1]. Therefore, we use a Forward Backward Euler scheme to integrate the gradient ﬂow of G. Let γ > 0 a step size. The proposed Forward Backward Euler scheme is written, for n 0, as:

νn+1 = (I γ F)#µn (17) µn+1 JKOγH(νn+1). (18)

This scheme can be seen as a proximal gradient algorithm over the Wasserstein space to minimize the composite function G = EF + H. Remark 1. To our knowledge, the two cases where JKOγH can be computed in closed form are the case where H(µ) = R Rdµ and R is a proximable function 3, see [8]; and the case where H is the negative entropy and F is quadratic, see [38, Example 8] and Section 5. However, there exist subroutines to compute JKOs of generic functionals, as well documented in the optimal transport and PDE literature (see the review of different strategies in [33]). The implementation of the FB scheme relies on these subroutines, similarly to proximal splitting algorithms in optimization that rely on subroutines to compute the proximity operator. Moreover, as several proximity operators admit a close form4 one can hope to be able to compute JKOγH for simple functionals H.

For instance, when H is the negative entropy (deﬁned by H(µ) = R log (µ(x)) dµ(x) if µ Leb with density µ and H(µ) := + else), we conjecture that a technique similar to [18] can be

3i.e., R is a lower semi-continuous proper convex function whose proximity operator has a closed form 4see www.proximity-operator.net

applied. Again, for this particular functional, the JKO is known in closed form in the Gaussian case [38, Example 8]. Other works of interest for the proposed FB scheme have investigated efﬁcient methods to implement the JKO of a generic functional with respect to the entropy-regularized Wasserstein distance [31].

In the next section, we study the non asymptotic properties of the FB scheme.

4 Non asymptotic analysis

In this section, we provide rates for the Wasserstein proximal gradient algorithm. The main technical challenge is to handle the fact that H is only convex along generalized geodesics (and not along any interpolating curve). We overcome this challenge by using the intuition that the Wasserstein proximal gradient algorithm is a discretization of the associated Wasserstein gradient ﬂow. More precisely, the main step is to establish a discrete EVI to prove the rates. The proof differs from the one of the proximal gradient algorithm because the convexity inequality can be used with optimal transport maps only, see Lemma 4.

We consider a ﬁxed step size γ < 1/L and a probability distribution π P2(X). Our main result (Proposition 8) combines several ingredients: the identiﬁcation of the optimal transport maps between µn, νn+1 and µn+1 (see Equations (17) and (18)), the proof of a generic lemma regarding generalized geodesic convexity (Lemma 4) and a proof of a discrete EVI (Lemma 6).

4.1 Identiﬁcation of optimal transport maps

Lemmas 2,3 identify the optimal transport maps from µn to νn+1 and from νn+1 to µn+1 in the Forward Backward Euler scheme, as soon as the step size is sufﬁciently small. In particular, Lemma 3 is a consequence of [2, Lemma 10.1.2]. Lemma 2. Assume A1. Let µ Pr 2(X) and ν = (I γ F)#µ. Then if γ < 1/L, the optimal transport map from µ to ν corresponds to T ν µ = I γ F. Moreover, ν belongs to Pr 2(X). Lemma 3. Assume B1, B2. Let ν P2(X). If µ JKOγH(ν), then µ D(H) Pr 2(X) and the optimal transport map T ν µ from µ to ν satisﬁes T ν µ I + γ H(µ). In other words, there exists a strong Fréchet subgradient at µ denoted by W H(µ) such that T ν µ = I + γ W H(µ). (19)

Using Lemmas 2,3, if µ0 Pr 2(X), then µn, νn Pr 2(X) for every n by induction. This remark allows to consider optimal transport maps from µn and νn to any π P2(X). The next lemma extends [2, 10.1.1.B] to functionals H convex along generalized geodesics. Lemma 4. Assume B1, B2, B3. Let ν Pr 2(X), µ, π P2(X) and T µ ν , T π ν the optimal transport maps from ν to µ and from ν to π respectively. If ξ H(µ), then ξ T µ ν , T π ν T µ ν ν H(π) H(µ). (20)

Lemma 4 is natural, holds for any functional convex along generalized geodesics, and it is novel, to the best of our knowledge. The following results rely on this lemma.

4.2 A descent lemma

Without using any convexity assumption on F, we ﬁrst obtain a descent lemma. We denote Yn+1 := T µn+1 νn+1 the optimal transport map between νn+1 and µn+1 in the Forward Backward Euler scheme (17), (18), and Xn+1 := Yn+1 (I γ F). Note that Xn+1 is a pushforward from µn to µn+1. Theorem 5 (Descent). Assume µ0 Pr 2(X), γ < 1/L and A1, B1, B2, B3. Then for n 0, there exists a strong Fréchet subgradient at µn+1 denoted by W H(µn+1) such that:

G(µn+1) G(µn) γ Å 1 Lγ

ã F + W H(µn+1)(Xn+1) 2 µn,

where we use the notation W H(µn+1)(Xn+1) to denote W H(µn+1) Xn+1.

Hence, the sequence (G(µn))n is decreasing as soon as the step-size is small enough.

4.3 Discrete EVI

To prove a discrete EVI and obtain convergence rates, we need the additional convexity assumption A2 on the potential function F. We ﬁrstly prove the two following lemmas. Lemma 6. Assume µ0 Pr 2(X), γ < 1/L, and B1, B2, B3. Then for n 0 and π P2(X), there exists a strong Fréchet subgradient at µn+1 denoted W H(µn+1) such that:

W 2(µn+1, π) W 2(νn+1, π) 2γ (H(µn+1) H(π)) γ2 W H(µn+1) 2 µn+1.

Lemma 7. Assume µ0 Pr 2(X), γ 1/L, and A1, A2 with λ 0. Then for n 0, and π P2(X)

W 2(νn+1, π) (1 γλ)W 2(µn, π) 2γ (EF (µn) EF (π)) + γ2 F 2 µn.

We can now provide a discrete EVI for the functional G = EF + H. Proposition 8 (discrete EVI). Assume µ0 Pr 2(X), γ < 1/L, and A1 B3 with λ 0. Then for n 0 and π P2(X), there exists a strong Fréchet subgradient at µn+1 denoted by W H(µn+1) such that the Forward Backward Euler scheme veriﬁes:

W 2(µn+1, π) (1 γλ)W 2(µn, π) 2γ (G(µn+1) G(π)) . (21)

4.4 Convergence rates

When the potential function F is convex, we easily get rates from the discrete EVI inequality provided above. Theorem 9 is a direct consequence of Proposition 8 by taking π = µ , and its corollaries provide rates depending on the strong convexity parameter of F. Theorem 9. Assume µ0 Pr 2(X), γ < 1/L, and A1 B3 with λ 0. Then for every n 0,

W 2(µn+1, µ ) (1 γλ)W 2(µn, µ ) 2γ(G(µn+1) G(µ )).

Corollary 10 (Convex case rate). Under the assumptions of Theorem 9, for n 0:

G(µn) G(µ ) W 2(µ0, µ )

Corollary 11 (Strongly convex case rate). Under the assumptions of Theorem 9, if λ > 0, then for n 0: W 2(µn, µ ) (1 γλ)n W 2(µ0, µ ).

Hence, as soon as F is convex, we get sublinear rates in terms of the objective function G, while when F is λ-strongly convex with λ > 0, we get linear rates in the squared Wasserstein distance for the iterates of the Forward Backward Euler scheme. These rates match those of the proximal gradient algorithm in Hilbert space in the convex and strongly convex cases. [27]. Moreover, these rates are discrete time analogues of the continuous time rates obtained in [2, Th. 11.2.1] for the gradient ﬂow of G. In the particular case where H is the negative entropy, we can compare the convergence rates of the FB scheme to those of LMC. Although, the complexity of one iteration of the FB scheme is higher, the convergence rates of the FB scheme are faster, see e.g. [16].

5 Numerical experiments

We provide numerical experiments with a ground truth target distribution µ to illustrate the dynamical behavior of the FB scheme, similarly to [34, Section 4.1]. We consider F(x) = 0.5|x|2, and H the negative entropy. In this case, G(µ) is (up to an additive constant) the Kullback-Leibler divergence of µ w.r.t. the standard Gaussian distribution µ . We denote by m the mean and σ the variance, and ﬁx m = 0 and σ2 = 1.0. We use the closed-form particle implementation of the FB scheme [38, Section G.1]. This allows to show the dynamical behavior of the FB scheme when γ = 0.1, and µ0 is Gaussian with m0 = 10 and σ0 = 100, in Figure 1. Note that λ = 1.0.

More precisely, the position of a set of particles initially distributed according to µ0 is updated iteratively. Using [38, Example 8], if µ and µ0 are Gaussian, then µn is Gaussian for every n. Moreover, the mean (resp. the covariance matrix) of µn+1 can be expressed as a function of the

Figure 1: The particle implementation of the FB scheme illustrate the convergence of µn to µ .

mean (resp. the covariance matrix) of µn. Finally, the update of the position of the particles can be computed using these expressions, see the update formulas in [38, Example 8]. These expressions also allow to compute W 2(µn, µ ): since µn and µ are Gaussian, W 2(µn, µ ) can be computed in closed form and is also known as Bures-Wasserstein distance.

The empirical distribution of the particles, represented by histograms, approximates µn. We see that µn matches µ after few iterations on Figure 1. We also illustrate the particles, from their initial position (distributed according to µ0) to their ﬁnal position (distributed according to µ100) in a high probability region of the target distribution µ . This shows that µ100 is close to µ . The linear convergence of µn to µ in the squared Wasserstein distance is illustrated in a multidimensional case in Figure 2 (see Appendix).

6 Conclusion

We proposed an uniﬁed analysis of a Forward-Backward discretization of Wasserstein gradient ﬂows in the convex case. We showed that the Forward-Backward discretization has convergence rates that are similar to the ones of the proximal gradient algorithm in Euclidean spaces.

Note that the implementation of the JKO operators is independent from the analysis of the FB scheme. However, an important problem raised by our work is whether we can ﬁnd efﬁcient implementations of the JKO operators of some speciﬁc functionals H relevant in machine learning applications.

7 Acknowledgement

Adil Salim thanks Pascal Bianchi for suggesting to use the JKO operator for optimization purposes.

8 Broader impact

The results that we showed, together with efﬁcient implementations of some speciﬁc JKOs could be very impactful for many machine learning tasks.

[1] Martial Agueh and Guillaume Carlier. Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904 924, 2011. [2] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient ﬂows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008. [3] Michael Arbel, Anna Korba, Adil Salim, and Arthur Gretton. Maximum mean discrepancy gradient ﬂow. In Advances in Neural Information Processing Systems, pages 6481 6491, 2019. [4] Heinz H Bauschke, Patrick L Combettes, et al. Convex analysis and monotone operator theory in Hilbert spaces, volume 408. Springer, 2011. [5] Espen Bernton. Langevin Monte Carlo and JKO splitting. In Conference on Learning Theory, pages 1777 1798, 2018. [6] Pascal Bianchi, Walid Hachem, and Adil Salim. A constant step Forward-Backward algorithm involving random maximal monotone operators. Journal of Convex Analysis, 2019. [7] Adrien Blanchet and Jérôme Bolte. A family of functional inequalities: Łojasiewicz inequalities and displacement convex functions. Journal of Functional Analysis, 275(7):1650 1673, 2018. [8] Malcolm Bowles and Martial Agueh. Weak solutions to a fractional Fokker Planck equation via splitting and Wasserstein gradient ﬂow. Applied Mathematics Letters, 42:30 35, 2015. [9] Yann Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on Pure and Applied Mathematics, 44(4):375 417, 1991. [10] Haïm Brézis. Opérateurs maximaux monotones et semi-groupes de contractions dans les espaces de Hilbert. North-Holland mathematics studies. Elsevier Science, Burlington, MA, 1973. [11] Guillaume Carlier and Maxime Laborde. A splitting method for nonlinear diffusions with nonlocal, nonpotential drifts. Nonlinear Analysis: Theory, Methods & Applications, 150:1 18, 2017. [12] Xiang Cheng and Peter Bartlett. Convergence of Langevin MCMC in KL-divergence. In Algorithmic Learning Theory, pages 186 211, 2018. [13] Sinho Chewi, Tyler Maunu, Philippe Rigollet, and Austin J Stromme. Gradient descent algorithms for Bures-Wasserstein barycenters. In Conference on Learning Theory, pages 1276 1304, 2020. [14] Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for overparameterized models using optimal transport. In Advances in Neural Information Processing Systems, pages 3036 3046, 2018. [15] Andrew Duncan, Nikolas Nüsken, and Lukasz Szpruch. On the geometry of stein variational gradient descent. ar Xiv preprint ar Xiv:1912.00894, 2019. [16] Alain Durmus, Szymon Majewski, and Blazej Miasojedow. Analysis of Langevin Monte Carlo via convex optimization. Journal of Machine Learning Research, 20(73):1 46, 2019. [17] Charlie Frogner and Tomaso Poggio. Approximate inference with Wasserstein gradient ﬂows. In International Conference on Artiﬁcial Intelligence and Statistics, pages 2581 2590, 2020. [18] Aude Genevay, Marco Cuturi, Gabriel Peyré, and Francis Bach. Stochastic optimization for large-scale optimal transport. In Advances in Neural Information Processing Systems, pages 3440 3448, 2016. [19] Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the Fokker Planck equation. SIAM Journal on Mathematical Analysis, 29(1):1 17, 1998. [20] Qiang Liu. Stein variational gradient descent as gradient ﬂow. In Advances in Neural Information Processing Systems, pages 3115 3123, 2017. [21] Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in Neural Information Processing Systems, pages 2378 2386, 2016. [22] Yi-An Ma, Niladri Chatterji, Xiang Cheng, Nicolas Flammarion, Peter Bartlett, and Michael I Jordan. Is there an analog of Nesterov acceleration for MCMC? ar Xiv preprint ar Xiv:1902.00996, 2019.

[23] Bernard Martinet. Brève communication. régularisation d inéquations variationnelles par approximations successives. Revue française d informatique et de recherche opérationnelle. Série rouge, 4(R3):154 158, 1970. [24] Bertrand Maury, Aude Roudneff-Chupin, Filippo Santambrogio, and Juliette Venel. Handling congestion in crowd motion modeling. Networks and Heterogeneous Media, pages 485 519, 2011. [25] Robert J Mc Cann. Polar factorization of maps on riemannian manifolds. Geometric & Functional Analysis GAFA, 11(3):589 608, 2001. [26] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-ﬁeld theory of two-layers neural networks: dimension-free bounds and kernel limit. In Conference on Learning Theory, pages 1 -77, 2019. [27] Yurii Nesterov. Lectures on convex optimization, volume 137. Springer, 2018. [28] Felix Otto. The geometry of dissipative evolution equations: the porous medium equation. Communications in Partial Differential Equations, 26(1-2):101 174, 2001. [29] Marcelo Pereyra. Proximal Markov chain Monte Carlo algorithms. Statistics and Computing, 26(4):745 760, 2016. [30] Juan Peypouquet and Sylvain Sorin. Evolution equations for maximal monotone operators: asymptotic analysis in continuous and discrete time. Journal of Convex Analysis, pages 1113 1163, 2010. [31] Gabriel Peyré. Entropic approximation of Wasserstein gradient ﬂows. SIAM Journal on Imaging Sciences, 8(4):2323 2351, 2015. [32] Pierre H Richemond and Brendan Maginnis. On Wasserstein reinforcement learning and the Fokker-Planck equation. ar Xiv preprint ar Xiv:1712.07185, 2017. [33] Filippo Santambrogio. {Euclidean, metric, and Wasserstein} gradient ﬂows: an overview. Bulletin of Mathematical Sciences, 7(1):87 154, 2017. [34] Amirhossein Taghvaei and Prashant G Mehta. Accelerated gradient ﬂow for probability distributions. In International Conference on Machine Learning, 2019. [35] Santosh Vempala and Andre Wibisono. Rapid convergence of the unadjusted Langevin algorithm: Isoperimetry sufﬁces. In Advances in Neural Information Processing Systems, pages 8092 8104, 2019. [36] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. [37] Yifei Wang and Wuchen Li. Accelerated information gradient ﬂow. ar Xiv preprint ar Xiv:1909.02102, 2019. [38] Andre Wibisono. Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. In Conference on Learning Theory, page 2093 3027, 2018. [39] Andre Wibisono. Proximal Langevin algorithm: Rapid convergence under isoperimetry. ar Xiv preprint ar Xiv:1911.01469, 2019. [40] Ruiyi Zhang, Changyou Chen, Chunyuan Li, and Lawrence Carin. Policy optimization as Wasserstein gradient ﬂows. In International Conference on Machine Learning, pages 5737 5746, 2018.