# nongeodesicallyconvex_optimization_in_the_wasserstein_space__dc58a46d.pdf

Non-geodesically-convex optimization in the Wasserstein space

Hoang Phuc Hau Luu Hanlin Yu Bernardo Williams Petrus Mikkola Marcelo Hartmann Kai Puolamäki Arto Klami

Department of Computer Science, University of Helsinki

We study a class of optimization problems in the Wasserstein space (the space of probability measures) where the objective function is nonconvex along generalized geodesics. Specifically, the objective exhibits some difference-of-convex structure along these geodesics. The setting also encompasses sampling problems where the logarithm of the target distribution is difference-of-convex. We derive multiple convergence insights for a novel semi Forward-Backward Euler scheme under several nonconvex (and possibly nonsmooth) regimes. Notably, the semi Forward Backward Euler is just a slight modification of the Forward-Backward Euler whose convergence is to our knowledge still unknown in our very general non-geodesically-convex setting.

1 Introduction

Sampling and optimization are intertwined. For example, the (overdamped) Langevin dynamics, typically considered a sampling algorithm, can be considered as gradient descent optimization where a suitable amount of Gaussian noise is injected at each step. There are also deeper connections. At the limit of infinitesimal stepsize, the law of the Langevin dynamics is governed by the Fokker-Planck equation describing a diffusion over time of probability measures. In the seminal paper [39], Jordan, Kinderlehrer, and Otto reinterpreted the Fokker-Planck equation as the gradient flow of the functional relative entropy, a.k.a. Kullback-Leibler (KL) divergence, in the (Wasserstein) space of finite secondmoment probability measures equipped with the Wasserstein metric. The discovery connects the two fields and encourages optimization in the Wasserstein space, even conceptually, as it directly gives insight into the sampling context. Studies in continuous-time dynamics [21, 12, 66, 30] seem natural and enjoy nice theoretical properties without discretization errors. Another line of research studies discretization of Wasserstein gradient flow by either quantifying the discretization error between the continuous-time flow and the discrete-time flow [39, 67, 27, 23, 28] or viewing discrete-time flows as iterative optimization schemes in the Wasserstein space [65, 26, 70, 11] where the primary focus is on (geodesically) convex optimization problems.

Nonconvex, nonsmooth optimization is challenging, even in Euclidean space, quoting Rockafellar [63]: In fact the great watershed in optimization isn t between linearity and nonlinearity, but convexity and nonconvexity. The landscape of nonconvex problems is mostly underexplored in the Wasserstein space. In the sampling language, it amounts to sampling from a non-log-concave and possibly non-log-Lipschitz-smooth target distribution. Recently, Balasubramanian et al. [9] advocated the need for a sound theory for non-log-concave sampling and provided some guarantees for the unadjusted Langevin algorithm (ULA) in sampling from log-smooth (Lipschitz/Hölder smooth) densities. These results are preliminary for the ULA (and its variants) with a specific class of densities (smooth). Theoretical understandings of other classes of algorithms and densities are needed.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

We approach the subject through the lens of nonconvex optimization in the space of probability distributions and pose discretized Wasserstein gradient flows as iterative minimization algorithms. This allows us to, on the one hand, use and extend tools from classical nonconvex optimization and, on the other hand, derive more connections between sampling and optimization.

We study the following non-geodesically-convex optimization problem defined over the space P2(X) of probability measures µ over X = Rd with finite second moment, i.e., R x 2dµ(x) < + , min µ P2(X) F(µ) := EF (µ) + H (µ) := EG H(µ) + H (µ) (1)

where F : X R is a nonconvex function which can be represented as a difference of two convex functions G and H, EF (µ) := R F(x)dµ(x) is the potential energy, and H : P2(X) R {+ } plays a role as the regularizer which is assumed to be a convex function along generalized geodesics.

Why difference-of-convex structure? Nonconvexity lies at the difference-of-convex (DC) structure F = G H, where G and H are called the first and second DC components, respectively. F being nonconvex implies EF being non-geodesically-convex in general. First, the class of DC functions is very rich, and DC structures are present everywhere in real-world applications [60, 46, 47, 1, 22, 56, 58]. Weakly convex and Lipschitz smooth (L-smooth or simply smooth) functions are two subclasses of DC functions. Furthermore, any continuous function can be approximated by a sequence of DC functions over a compact, convex domain [8]. We also remark that many nonconvex functions admit quite natural DC decompositions, for example, an L-smooth function F has the following splittings: F(x) = α x 2 (α x 2 F(x)) and F(x) = (F(x) + α x 2) α x 2 whenever α L/2. Second, DC functions preserve enough structure to extend convex analysis. Such structure is key in classic DC programming [60] and in our Wasserstein space analysis with optimal transport tools.

Context Many problems in machine learning and sampling fall into the spectrum of problem (1). For example, refer to a discussion in [65] that inspired our work. The regularizer H can be the internal energy [4, Sect. 10.4.3]. Under Mc Cann s condition, the internal energy is convex along generalized geodesics [4, Prop. 9.3.9]. In particular, the negative entropy, H (µ) = R log(µ(x))dµ(x) if µ is absolutely continuous w.r.t. Lebesgue measure, + otherwise, is a special case of internal energy satisfying Mc Cann s condition. In the latter case, F(µ) = DKL(µ µ ) + const where µ (x) exp( F(x)), the optimization problem reduces to a sampling problem with log-DC target distribution. In Bayesian inference, the posterior structure depends on both the prior and likelihood. If the likelihood is log-smooth, it exhibits the aforementioned DC splittings. Logpriors, often nonsmooth to capture sparsity or low rank, typically also have explicit DC structures [48, 32, 25]. In the context of infinitely wide one-layer neural networks and Maximum Mean Discrepancy [51, 7, 21], let µ be the optimal distribution over a network s parameters, k be a given kernel, the regularizer is then the interaction energy H (µ) = RR k(x, y)dµ(x)dµ(y) and F(x) = 2 R k(x, y)dµ (y). In general, H is not convex along generalized geodesics and F is nonconvex but not necessarily DC. When the kernel has Lipschitz gradient, we can adjust both H and F as H (µ) = RR k(x, y)+α x 2+α y 2dµ(x)dµ(y) and F(x) = 2 R k(x, y)dµ (y) 2α x 2 for some α > 0 making H generalized geodesically convex and F concave (hence DC); Appx. A.2.

Our idea is to minimize (1) in the space of probability distributions by discretization of the gradient flow of F, leveraging on the JKO (Jordan, Kinderlehrer, and Otto) operator (2). In the previous work [70], this has been done with the Forward-Backward (FB) Euler discretization, but it lacks convergence analysis. Recently, Salim et al. [65] did some study on FB Euler, but their results do not apply here because F is nonconvex and possibly nonsmooth. Further leveraging on the DC structure of F and inspired by classical DC programming literature [60], we subtly modify the FB Euler to give rise to a scheme named semi FB Euler that enjoys major theoretical advantages as we can provide a wide range of convergence analysis. Regarding the name, "semi" addresses the splitting of the potential energy. The scheme can be nevertheless reinterpreted as FB Euler.

Contributions To our knowledge, no prior work studies problem (1) when F is DC. Therefore, most of the derived results in this paper are novel. We propose and analyze the semi FB Euler (4) leveraging on the classic DC optimization proof template [60, 45, 46] with a substantial accommodation of Wasserstein geometry and derive the following set of new insights:

Thm. 1 We show that if the H is continuously differentiable, every cluster point of the sequence of distributions {µn}n N generated by semi FB Euler is a critical point to F. Note that

criticality is a notion from the DC programming literature [60] and it is a necessary condition for local optimality; See Sect. 3.3.

Thm. 2 We provide convergence rate of O(N 1) in terms of Wasserstein (sub)gradient mapping in the general nonsmooth setting. The notion of gradient mapping [33, 38, 55] is from the context of proximal algorithms in Euclidean space that is applicable to nonconvex programs where the notion of distance to global solution is in general not possible to work out.

Thm. 3 Under the extra assumption that H is continuously twice differentiable and has bounded Hessian, we provide a convergence rate of O(N 1

2 ) in terms of distance of 0 to the Fréchet subdifferential of F. One can think of this as convergence rate to Fréchet stationarity, i.e., if µ is a Fréchet stationary point of F, then, by definition, 0 is in the Fréchet subdifferential of F at µ . Fréchet stationarity is a relatively sharp necessary condition for local optimality.

Thm. 4, 5 Under the assumptions of Thm. 3 and additionally F satisfying the Łojasciewicz-type inequality for some Łojasciewicz exponent of θ [0, 1), we show that {µn}n N is a Cauchy sequence under Wasserstein topology, and thanks to the completeness of the Wasserstein space, the whole sequence {µn}n N converges to some µ . We show that µ is in fact a global minimizer to F. Furthermore, we provide convergence rate of µn µ in three different regimes (W2 denotes the Wasserstein metric): (1) if θ = 0, W2(µn, µ ) converges to 0 after a finite number of steps; (2) if θ (0, 1/2], both F(µn) F(µ ) and W2(µn, µ ) converges to 0 exponentially fast; (3) if θ (1/2, 1), both F(µn) F(µ ) and W2(µn, µ ) converges sublinearly to 0 with rates O n 1 2θ 1 and O n 1 θ

2θ 1 , respectively. When

H is the negative entropy, F(µn) F(µ ) = DKL(µn µ ); Therefore, in the sampling context, we provide convergence guarantees in both Wasserstein and KL distances. See Sect. 4.3 for additional observations and implications.

2 Preliminaries

2.1 Notations and basic results in measure theory and functional analysis

We denote by X = Rd, B(X) the Borel σ-algebra over X, and L d the Lebesgue measure on X. P(X) is the set of Borel probability measures on X. For µ P(X), we denote its second-order moment by m2(µ) := R

X x 2dµ(x), where m2(µ) can be infinity. P2(X) P(X) denotes a set of finite second-order moment probability measures. P2,abs(X) P2(X) is the set of measures that are absolutely continuous w.r.t. L d. Here µ-a.e. stands for almost everywhere w.r.t. µ.

Let Cp(X), C c (X), Cb(X) be the classes of p-time continuously differentiable functions, infinitely differentiable functions with compact support, bounded and continuous functions, respectively.

From functional analysis [20], for each p 1, Lp(X, µ) denotes the Banach space of measurable (where measurable is understood as Borel measurable from now on) functions f such that R

X |f(x)|pdµ(x) < + . We shall consider an element of Lp(X, µ) as an equivalent class of functions that agree µ-a.e. on X rather than a sole function. The norm of f Lp(X, µ) is f Lp(X,µ) = ( R

X |f(x)|pdµ(x))1/p. When p = 2, L2(X, µ) is actually a Hilbert space with the inner product f, g L2(X,µ) = R

X f(x)g(x)dµ(x) which induces the mentioned norm. These results can be extended to vector-valued functions. In particular, we denote by L2(X, X, µ) the Hilbert space of ξ : X X in which ξ L2(X, µ). The norm ξ L2(X,X,µ) := ( R

X ξ(x) 2dµ(x))1/2.

We say that f : X R has quadratic growth if there exists a > 0 such that |f(x)| a( x 2 + 1) for all x X. It is clear that if f has quadratic growth and µ P2(X), then f L1(X, µ).

The pushforward of a measure µ P(X) through a Borel map T : X Rm, denoted by T#µ is defined by (T#µ)(A) := µ(T 1(A)) for every Borel sets A Rm.

2.2 Optimal transport [4, 5, 69, 68]

Given µ, ν P(X), the principal problem in optimal transport is to find a transport map T pushing µ to ν, i.e., T#µ = ν, in the most cost-efficient way, i.e., minimizing x T(x) 2 on µ-average. Monge s formulation for this problem is inf T :T#µ=ν R

X x T(x) 2dµ(x), where the optimal

solution, if exists, is denoted by T ν µ and called the optimal (Monge) map. Monge s problem can be ill-posed, e.g., no such T ν µ exists when µ is a Dirac mass and ν is absolutely continuous [5].

By relaxing Monge s formulation, Kantorovich considers minγ Γ(µ,ν) R

X X x y 2dγ(x, y), where Γ(µ, ν) denotes the set of probabilities over X X whose marginals are µ and ν, i.e, γ Γ(µ, ν) iff proj1#γ = µ, proj2#γ = ν where proj1, proj2 are the projections onto the first X space and the second X space, respectively. Such γ is called a plan. Kantorovich s formulation is well-posed because Γ(µ, ν) is non-empty (at least µ ν Γ(µ, ν)) and the arg min element actually exists (see [5, Sect. 2.2]). The set of optimal plans between µ and ν is denoted by Γo(µ, ν). In terms of random variables, any pairs (X, Y ) where X µ, Y ν is called a coupling of µ and ν while it is called an optimal coupling if the joint law of X and Y is in Γo(µ, ν).

In P2(X), the min value in Kantorovich s problem specifies a valid metric referred to as Wasserstein distance, W2(µ, ν) = ( R

X X x y 2dγ(x, y))1/2 for some, and thus all, γ Γo(µ, ν). The metric space (P2(X), W2) is then called the Wasserstein space. In P2(X), beside the convergence notion induced by the Wasserstein metric, there is a weaker notion of convergence called narrow convergence: we say a sequence {µn}n N P2(X) converges narrowly to µ P2(X) if R

X ϕ(x)dµn(x) R

X ϕ(x)dµ(x) for all ϕ Cb(X). Convergence in the Wasserstein metric implies narrow convergence but the converse is not necessarily true. The extra condition to make it true is m2(µn) m2(µ). We denote Wasserstein and narrow convergence by Wass and narrow , respectively.

If µ P2,abs(X), ν P2(X), Monge s formulation is well-posed and the unique (µ-a.e.) solution exists, and in this case, it is safe to talk about (and use) the optimal transport map T ν µ. Moreover, there exists some convex function f such that T ν µ = f µ-a.e. Kantorovich s problem also has a unique solution γ and it is given by γ = (I, T ν µ)#µ where I is the identity map. This is known as Brenier theorem or polar factorization theorem [18].

2.3 Subdifferential calculus in the Wasserstein space

Apart from being a metric space, (P2(X), W2) also enjoys some pre-Riemannian structure making subdifferential calculus on it possible. Let us have a picture of a manifold in mind. Firstly, the tangent

space [4] of P2(X) at µ is Tanµ P2(X) := { ψ : ψ C c (X)} L2(X,X,µ), where the closure is w.r.t. the L2(X, X, µ)-topology. Intuitively, for ψ C c (X), I + ϵ ψ is an optimal transport map if ϵ > 0 is small enough [43], so ψ plays a role as "tangent vector".

Let ϕ : P2(X) R {+ }, we denote dom(ϕ) = {µ P2(X) : ϕ(µ) < + }. Let µ dom(ϕ), we say that a map ξ L2(X, X, µ) belongs to the Fréchet subdifferential [15, 43] F ϕ(µ) if ϕ(ν) ϕ(µ) supγ Γo(µ,ν) R

X X ξ(x), y x dγ(x, y) + o(W2(µ, ν)) for all ν P2(X), where the little-o notation means lims 0 o(s)/s = 0. If F ϕ(µ) = , we say ϕ is Fréchet subdifferentiable at µ. We also denote dom( F ϕ) = {µ P2(X) : F ϕ(µ) = }.

Similarly, we say that ξ L2(X, X, µ) belongs to the (Fréchet) superdifferential + F ϕ(µ) of ϕ at µ if ξ F ( ϕ)(µ). In other words, F ( ϕ)(µ) = + F ϕ(µ).

We say ϕ is Wassertein differentiable [15, 43] at µ dom(ϕ) if F ϕ(µ) + F ϕ(µ) = . We call an element of the intersection, denoted by W ϕ(µ), a Wasserstein gradient of ϕ at µ, and it holds ϕ(ν) ϕ(µ) = R

X X W ϕ(µ)(x), y x dγ(x, y) + o(W2(µ, ν)), for all ν P2(X) and any γ Γo(µ, ν). The Wasserstein gradient is not unique in general, but its parallel component in Tanµ P2(X) is unique, and this parallel component is again a valid Wasserstein gradient as the orthogonal component plays no role in the above definitions, i.e., if ξ Tanµ P2(X) , it holds R

X X ξ (x), y x dγ(x, y) = 0 for any ν P2(X) and γ Γo(µ, ν) [43, Prop. 2.5]. We may refer to this parallel component as the unique Wasserstein gradient of ϕ at µ.

2.4 Optimization in the Wasserstein space

A function ϕ : P2(X) R {+ } is called proper if dom(ϕ) = , while it is called lower semicontinuous (l.s.c) if for any sequence µn Wass µ, it holds lim infn ϕ(µn) ϕ(µ).

We next recall (a simplified version of) generalized geodesic convexity. Definition 1. [65] Let ϕ : P2(X) R {+ }. We say ϕ is convex along generalized geodesics if µ, π P2(X), ν P2,abs(X), ϕ((t T µ ν + (1 t)T π ν )#ν) tϕ(µ) + (1 t)ϕ(π), t [0, 1].

The curve t 7 (t T µ ν + (1 t)T π ν )#ν (called a generalized geodesic) interpolates from π to µ as t runs from 0 to 1. The definition says that ϕ is convex along these curves. If µ P2,abs(X) and ν = µ, the curve is a geodesic in (P2(X), W2). If the definition is relaxed to the class of geodesics only, we say that ϕ is convex along geodesics.

An important characterization of Fréchet subdifferential of a geodesically convex function is that we can drop the little-o notation in its definition in Sect. 2.3 [4, Sect 10.1.1]. As a convention, for a geodesically convex function ϕ, the Fréchet subdifferential F will be simply written as .

First-order optimality conditions Let ϕ : P2(X) R {+ } be a proper function. µ P2(X) is a global minimizer of ϕ if ϕ(µ ) ϕ(µ), µ P2(X). For local optimality, we shall use the Wasserstein metric to define neighborhoods. µ P2(X) is a local minimizer if there exists r > 0 such that ϕ(µ ) ϕ(µ) for all µ : W2(µ, µ ) < r. We shall denote B(µ , r) := {µ P2(X) : W2(µ, µ ) < r} the (open) Wasserstein ball centered at µ with radius r. If we replace < by we obtain the notion of a closed Wasserstein ball.

We call µ a Fréchet stationary point of ϕ if 0 F ϕ(µ ). Fréchet stationarity is a necessary condition for local optimality. In other words, if µ is a local minimizer, it is a Fréchet stationary point (Lem. 5 in Appendix). In addition, if ϕ is Wasserstein differentiable at µ , W ϕ(µ )(x) = 0 µ -a.e. [43]. When ϕ is geodesically convex, Fréchet stationarity is a sufficient condition for global optimality (Lem. 6 in Appendix).

3 Semi Forward-Backward Euler for difference-of-convex structures

3.1 Wasserstein gradient flows: different types of discretizations

To neatly present the idea of minimizing F via discretized gradient flow, we first assume for a moment that F is infinitely differentiable and H is the negative entropy. See also a discussion in [65].

We wish to minimize (1) in the space of probability distributions. A natural idea is to apply discretizations of the gradient flow of F, where the gradient flow is defined (under some technical assumptions [39]) as the limit η 0+ of the following scheme with some simple time-interpolation

µn+1 JKOηF(µn), where JKOηF(µ) := arg min ν P2(X) F(ν) + 1

2η W 2 2 (µ, ν). (2)

Straightforwardly, given a fixed η > 0, (2) gives back a discretization for this flow known as Backward Euler. On the other hand, if F is Wasserstein differentiable (Sect. 2.2), the Forward Euler discretization reads [70] µn+1 = (I η W F(µn))#µn, which is reinterpreted as doing gradient descent in the space of probability distributions. These are optimization methods that work directly on the objective function F itself. However, the composite structure of F (a sum of several terms) can also be exploited. One such scheme is the unadjusted Langevin algorithm (ULA), where it first takes a gradient step w.r.t. the potential part, then follows the heat flow corresponding to the entropy part [70]: νn+1 = (I η F)#µn, and µn+1 = N(0, 2ηI) νn+1, where is the convolution. This ULA is "viewed" in the space of distributions (Eulerian approach), a more familiar and equivalent form of the ULA from the particle perspective (Lagrangian approach) goes like xn+1 = xn η F(xn) + 2ηzk where zk N(0, I). The ULA is known to be asymptotically biased even for Gaussian target measure (Ornstein-Uhlenbeck process). To correct this bias, the Metropolis-Hasting accept-reject step [62] is sometimes introduced. Metropolis-Hasting algorithm [52, 36] is a much more general framework that works with quite any proposal (e.g., a random walk) whose convergence analysis is based on the Markov kernel satisfying the detailed balance condition. This convergence framework is different from what is considered in this work: we are more interested in the underlying dynamics of the chain. Metropolis-Hasting algorithm is indeed another story.

In optimization, for composite structure, Forward-Backward (FB) Euler and its variants are methods of choice [59, 10]. The corresponding FB Euler for F will take the gradient step (forward) according to the potential, and JKO step (backward) w.r.t. the negative entropy (FB Euler) νn+1 = (I η F)#µn, and µn+1 JKOηH (νn+1). (3)

This scheme appears in [70] without convergence analysis, and later on [65] derives non-asymptotic convergence guarantees under the assumption F being convex and Lipschitz smooth.

In this work, as F is nonconvex and nonsmooth, the theory in [65] does not apply, and the convergence (if any) of (3) remains mysterious. The DC structure of F can be further exploited. In DC programming [60], the forward step should be applied to the concave part, while the backward step should be applied to the convex part. We hence propose the following semi FB Euler

(semi FB Euler) νn+1 = (I + η H)#µn, and µn+1 JKOη(H +EG)(νn+1) (4)

for which we can provide convergence guarantees. Apparently, the difference between semi FB Euler and FB Euler is subtle: while FB Euler does forward on EG H = EG EH and backward on H , semi FB Euler does forward on EH and backward on H + EG; recall that F = EG EH + H .

Theoretically, semi FB Euler enjoys some advantages compared to FB Euler. Thanks to Brenier theorem (Sect. 2.2), the pushing step in semi FB Euler is optimal since H is convex; Meanwhile, the pushing in FB Euler is non-optimal whose optimal Monge map is not identifiable in general. The convergence of FB Euler is still an open question, even when F is (DC) differentiable. In contrast, we can provide a solid theoretical guarantee for semi FB Euler, especially when H is differentiable. Additionally, we also offer convergence guarantees when H is nonsmooth.

3.2 Problem setting

Our goal is to minimize the non-geodesically-convex functional F(µ) = EF (µ)+H (µ) over P2(X), where F = G H is a DC function. We make Assumption 1 throughout the paper: Assumption 1. (i) The objective function F is bounded below.

(ii) G, H : X R are convex functions and have quadratic growth.

(iii) H : P2(X) R {+ } is proper, l.s.c, and convex along generalized geodesics in (P2(X), W2), and dom(H ) P2,abs(X).

(iv) There exists η0 > 0 such that η (0, η0), JKOη(EG+H )(µ) = for every µ P2(X).

Note that Assumption 1(iv) is a commonly-used assumption to simplify technical complication when working with the JKO operator [4, 15, 65]. Assumption 1(ii) implies EG and EH are continuous w.r.t. Wasserstein topology [3, Prop. 2.4] (G, H are continuous [54, Cor. 2.27] and have quadratic growth).

3.3 Optimality charactizations

First, it follows from Assumption 1(iii), dom(F) P2,abs(X). By analogy to DC programming in Euclidean space, we call µ dom(F) a critical point of F = H + EG EH if (H + EG)(µ ) EH(µ ) = . Criticality is a necessary condition for local optimality (Lem. 7). Moreover, if EH is Wasserstein differentiable at µ , criticality becomes Fréchet stationarity (Lem. 8).

3.4 Semi FB Euler: a general setting

We allow H to be non-differentiable in some derivations, meaning that H (convex subdifferential [54]) contains multiple elements in general. We first pick a selector S of H, i.e., S : X X, such that S(x) H(x). By the axiom of choice (Zermelo, 1904, see, e.g., [37]), such selection always exists. However, an arbitrary selector can behave badly, e.g., not measurable. We shall first restrict ourselves to the class of Borel measurable selectors (see Appx. A.1 for an existence discussion). Assumption 2 (Measurability). The selector S is Borel measurable.

We recall the semi FB scheme (4) but for nonsmooth F as follows: start with an initial distribution µ0 P2,abs(X), given a discretization stepsize 0 < η < η0, we repeat the following two steps:

νn+1 = (I + ηS)#µn push forward step; µn+1 = JKOη(EG+H )(νn+1) JKO step.

Well-definiteness and properties: Given µn P2(X), it follows from Lem. (4) that νn+1 P2(X). The two generated sequences are then in P2(X). Moreover, it follows from Assumption 1 that {µn}n N are in P2,abs(X), so are {νn}n N using Lem. 9 by noting that I + ηS is subgradient of a strongly convex function x 7 (1/2) x 2 + ηH(x).

4 Convergence analysis

4.1 Asymptotic analysis

Lemma 1 (Descent lemma). Under Assumptions 1 and 2, let {µn}n N be the sequence of distributions produced by semi FB Euler starting from some µ0 P2,abs(X) with 0 < η < η0. Then it holds F(µn+1) F(µn) 1

X T µn νn+1(x) T µn+1 νn+1 (x) 2dνn+1(x), n N.

Lem. 1 shows that the objective does not increase along semi FB Euler s iterates. Proof of Lem. 1 is in Appx. A.3. By using Lem. 1, we establish asymptotic convergence for semi FB Euler as follows.

For the asymptotic convergence analysis, we need the following assumption on H. Assumption 3. H is continuously differentiable. Theorem 1 (Asymptotic convergence). Under Assumptions 1, 3, let {µn}n N and {νn}n N be sequences produced by semi FB Euler starting from some µ0 P2,abs(X) with 0 < η < η0. If {µn}n N is relatively compact with respect to the Wasserstein topology and supn N H (νn) < + , then every cluster point of {µn}n N is a critical point of F.

Proof of Thm.1 is in Appx. A.4. Thm. 1 does not ensure convergence of the whole sequence {µn}n N; Rather, it guarantees subsequential convergence to critical points of F. Remark 1. In the Euclidean space, the compactness assumption of the generated sequence is usually enforced via the coercivity assumption: f(x) + whenever x + . A striking difference in the Wasserstein space is that closed Wasserstein balls are not compact in the Wasserstein topology [43, Prop. 4.2], making coercivity not sufficient to induce (Wasserstein) compactness. For Thm. 1, we simply assume the sequence {µn}n N to be relatively compact.

4.2 Non asymptotic analysis

To measure how fast the algorithm converges, we need some convergence measurement. First, for proximal-type algorithms in Euclidean space, the notion of gradient mapping Gη(xn) is usually used (see, e.g., [33, 55] and [38, Eq. (5)]) and we measure the rate Gη(xn) 2 0. In analogy as in Euclidean space, we define the Wasserstein (sub)gradient mapping as follows Gη(µ) :=

1 η I T JKOη(EG+H )((I+ηS)#µ) µ , and we measure the rate of Gη(µn) 2 L2(X,X,µn) 0.

Theorem 2 (Convergence rate: Wasserstein (sub)gradient mapping). Under Assumptions 1, 2, let {µn}n N be the sequence of distributions produced by semi FB Euler starting from some µ0 P2,abs(X) with 0 < η < η0. Then it holds minn=1,N Gη(µn) 2 L2(X,X,µn) = O(N 1).

Proof of Thm. 2 is in Appx. A.5. This theorem holds without requiring G and H to be differentiable.

Next, if H is twice differentiable with uniformly bounded Hessian, we can derive a stronger convergence guarantee based on Fréchet stationarity (see Sect. 2.4). In other words, we evaluate the rate of dist (0, F F(µn)) := infξ F F(µn) ξ L2(X,X;µn) 0.

Assumption 4. H C2(X) whose Hessian is bounded uniformly (H is then LH-smooth). Theorem 3 (Convergence rate: Fréchet subdifferentials). Under Assumptions 1, 4, let {µn}n N be the sequence of distributions produced by semi FB Euler starting from some µ0 P2,abs(X) with

0 < η < η0, then minn=1,N dist (0, F F(µn)) = O N 1

Proof of Thm. 3 is in Appx. A.6.

4.3 Fast convergence under isoperimetry and beyond

Fast convergence can be obtained under isoperimetry, e.g., log-Sobolev inequality (LSI). There are certain connections between LSI in sampling and the Łojasiewicz condition in optimization allowing linear convergence. In nonconvex optimization in Euclidean space, analytic and subanalytic functions are a large class satisfying Łojasiewicz condition [44, 14]. Subanalytic DC programs are studied in [45]. In the infinite-dimensional setting of the Wasserstein space, the Łojasiewicz condition should be regarded as functional inequalities [12].

Assumption 5 (Łojasiewicz condition in the Wasserstein space). Assume that F is the optimal value of F, and assume there exist r0 (F , + ], θ [0, 1), and c > 0 such that for all µ P2(X), F(µ) F < r0 c (F(µ) F )θ inf{ ξ L2(X,X,µ) : ξ F F(µ)}, where the conventions 00 = 0 and inf = + are used. We call θ [0, 1) the Łojasiewicz exponent of F at optimality. Remark 2. If H is the is negative entropy, F C2(X) whose Hessian is bounded uniformly, then F is Wasserstein differentiable at µ P2,abs(X) with gradient W F(µ) = µ

µ + F provided that all terms are well-defined [43, Prop. 2.12, E.g. 2.3]. We have W F(µ) 2 L2(X,X,µ) = R µ(x)

µ(x) + F(x) 2 dµ(x) = R µ(x) log µ(x)

µ (x) 2 dx, where µ exp( F). On the other hand, F(µ) F = DKL(µ µ ). The log-Sobolev inequality with parameter α > 0 inequality

reads [57] DKL(µ µ ) 1 2α FI(µ µ ) := 1 2α R µ(x) log µ(x)

µ (x) 2 dx, where FI(µ µ ) is the relative Fisher information of µ w.r.t. µ . Therefore, log-Sobolev inequality is a special case of Łojasiewicz condition with θ = 1/2. In another case, when the objective function is the Maximum Mean Discrepancy, under some regularity assumption of the kernel, it holds [7]

2(F(µ) F(µ )) µ µ H 1(µ) Z W F(µ) 2dµ(x)

where µ µ H 1(µ) is the weighted negative Sobolev distance. This is "nearly" the Łojasiewicz condition, with a caveat that µ µ H 1(µ) may be unbounded. Nevertheless, assuming the boundedness of this term along the algorithm s iterates is sufficient for convergence. Theorem 4. Under Assumptions 1, 4 and Assumption 5 with parameters (r0, c, θ). Let {µn}n N be the sequence of distributions produced by semi FB Euler starting from some sufficiently warm-up µ0 P2,abs(X) such that F(µ0) < r0 and with stepsize 0 < η < η0, then

(i) if θ = 0, F(µn) F converges to 0 in a finite number of steps;

(ii) if θ (0, 1/2], F(µn) F = O M M+1 n where M = 2(η2L2 H+1) c2η ;

(iii) if θ (1/2, 1), F(µn) F converges sublinearly to 0, i.e., F(µn) F = O n 1 2θ 1 .

Proof of Thm. 4 is in Appx. A.7. Remark 3. In the usual sampling case, i.e., H is the negative entropy, and under log-Sobolev condition, r0 = + . Therefore, µ0 can be arbitrarily in P2,abs(X). In the general case, however, a good enough starting point (i.e., F(µ0) < r0) is needed to guarantee we are in the region where Łojasiewicz condition comes into play. In such a case, F(µn) F = DKL(µn µ ) where µ (x) exp( F(x)) is the target distribution (see Rmk. 2), so Thm. 4 provides convergence rate of {µn}n N to µ in terms of KL divergence and this convergence is exponentially fast if θ (0, 1/2]. Theorem 5. Under the same set of assumptions as in Thm. 4, the sequence {µn}n N is a Cauchy sequence under Wasserstein topology. Furthermore, as the Wasserstein space (P2(X), W2) is complete [5, Thm. 2.2], every Cauchy sequence is convergent, i.e., there exists µ P2(X) such that µn Wass µ . The limit distribution µ is indeed the global minimizer of F. In addition:

(i) if θ = 0, W2(µn, µ ) converges to 0 in a finite number of steps;

(ii) if θ (0, 1/2], W2(µn, µ ) = O M M+1 n , where M = 1 + (2(η2L2 H+1)) 1 2θ

(iii) if θ (1/2, 1), W2(µn, µ ) = O n 1 θ

Proof of Thm. 5 is in Appx. A.8. This theorem provides convergence to optimality in terms of Wasserstein distance. Remark 4. If H is the negative entropy and θ = 1/2, under some technical assumptions on F (e.g., continuously twice differentiable), LSI implies Talagrand inequality [57] (in optimization, known as Łojasiewicz implies quadratic growth [40]), meaning that KL divergence controls squared Wasserstein distance, so fast convergence under KL divergence implies fast convergence under Wasserstein distance.

5 Practical implementations

The push-forward step νn+1 = (I + η H)#µn is rather straightforward: if Z are samples from µn then Z + η H(Z) are samples from νn+1. On the other hand, to move from νn+1 to µn+1 we have to work out the JKO operator. Recent advances [53, 2] propose using the gradient of an input-convex neural network (ICNN) [6] to approximate the optimal Monge map pushing νn+1 to µn+1, which we briefly describe as follows. This approach is inspired by Brenier theorem asserting that an optimal Monge map has to be the (sub)gradient field of some convex function. Therefore, one can "parametrize" µ P2,abs(X) as µ = ψ#νn+1 for some convex function ψ. We then write the JKO objective as

H ( ψ#νn+1) + Z

X G( ψ(x))dνn+1(x) + 1

X x ψ(x) 2dνn+1(x). (5)

While the two last terms (potential energy and squared Wasserstein distance) in (5) can be handled efficiently by the Monte Carlo method using samples from νn+1, the first term H might be complicated as it possibly involves the (unavailable) density of νn+1. We remark that the easy case would be H being another potential energy or an interaction energy. In such a case, Monte Carlo approximations are again readily applicable. The tricky case would be H being the negative entropy that requires the density of νn+1. Fortunately, we have the following change of entropy formula: for any T : X X diffeomorphic, any ρ P2,abs(X), it holds H (T#ρ) = H (ρ) + R

X log | det T(x)|dρ(x). Therefore, (5) can be written as (up to a constant that does not depend on ψ) Z

log det 2ψ(x) + G( ψ(x)) + 1

2η x ψ(x) 2 dνn+1(x).

Note that this entropy formula can be extended naturally to the case of general internal energy [2], which means we can also handle this general case. Let us now consider the entropy case for simplicity. We can leverage on a class of input convex neural networks [6] ψθ(x) (θ is the neural network s parameters, x is the input) in which x 7 ψθ(x) is convex. Optimizing over θ can then be solved effectively by standard deep learning optimizers (e.g. Adam). The complete scheme is given in Alg. 1. We also remark that [2] further proposes fast approximation for log det 2ψ and [31] leverages on the variational formula of the KL to propose an even faster scheme. Nevertheless, these schemes are generally expensive. For illustrative purposes, we adopt the vanilla version of [53]. The iteration complexity is thus cubic in d, linear in the size of the ICNN, and linear in the iteration count k [53].

Algorithm 1 Semi FB Euler for sampling

Input: Initial measure µ0 P2,abs(X), discretization step size η > 0, number of steps K > 0, batch size B. for k = 1 to K do

for i = 1, 2, . . . do

Draw a batch of samples Z µ0 of size B; Ξ (I + η H) xψθk (I + η H) xψθk 1 . . . (I + η H)(Z); d W 2 2 1

ξ Ξ xψθ(ξ) ξ 2; b U 1

ξ Ξ G( xψθ(ξ)); [ H 1

ξ Ξ log det 2 xψθ(ξ). b L 1 2η d W 2 2 + b U + [ H .

Apply an optimization step (e.g., Adam) over θ using θ b L. end for θk+1 θ. end for

6 Numerical illustrations

We perform numerical sampling experiments from non-log-concave distributions: the Gaussian mixture distribution and the distance-to-set-prior [61] relaxed von Mises Fisher distribution. Both

are log-DC and the latter has non-differentiable logarithmic probability density (see Appx. C). Fig. 1 presents the sampling results. Experiment details are in Appx. B and Appx. C1.

0 5 10 15 20 25 30 35 40 Number of iterations

KL divergence

ULA FB Euler semi FB Euler

Figure 1: (a) and (b): Mixture of Gaussians. (a) shows samples obtained from semi FB Euler at iteration 40 and (b) shows KL divergence along the training process: semi FB Euler with sound theory is as fast as FB Euler. We also show the ULA s final result as a horizontal line for reference; (c) and (d): Relaxed von Mises-Fisher. (c) shows true probability density, and (d) shows the sample histogram obtained from semi FB Euler. In this experiment, FB Euler fails to work, attributed to the high curvature of the relaxed von Mises-Fisher.

7 Discussion and related work

We first narrow down our discussion on FB Euler and its variants in the Wasserstein space. When H is the negative entropy, Wibisono [70] provides some insightful discussion on how FB Euler should be consistent (no asymptotic bias) because the backward step is adjoint to the forward step, hence preserves stationarity. However, no convergence theory is presented for FB Euler in the Wasserstein space in [70]. Recently, Salim et al. [65] provide convergence guarantee for FB Euler within the following setting: H is convex along generalized geodesics, F is Lipschitz smooth and convex/strongly convex. This setting remains a "convex + convex" structure, while ours has a "convex + concave" structure. A natural extension of our work would be a full-fledged study of DC programming in the Wasserstein space F = G H where G and H are convex along generalized geodesics and H is not necessarily potential energy EH. This problem possesses another implementation challenge regarding the Wasserstein gradient of H.

Other works that bear a tangential relation to ours involve the use of forward-only Euler and ULA. Arbel et al. [7] study Wasserstein gradient flows with forward Euler for maximum mean discrepancy. This objective function is also nonconvex (specifically, weakly convex). Durmus et al. [26] analyze the ULA from the convex optimization perspective. Vempala et al. [67] show that LSI and Hessian boundedness suffice for fast convergence of the ULA where "fast" is understood as fast to the biased target since ULA is a biased algorithm. Balasubramanian et al. [9] analyze the ULA under quite mild conditions: log-density is Lipschitz/Hölder smooth. Bernton [11] studies the proximal-ULA also under the convex assumption, where the difference to the ULA is the first step: gradient descent is replaced by the proximal operator. Similar to ULA, proximal-ULA is asymptotically biased. To address nonsmoothness, another line of research utilizes Moreau-Yosida envelopes to create smooth approximations of the ULA dynamics [29, 50]. This approach is also applicable to certain classes of non-log-concave distributions [50] and is more of a flavour of discretization error quantification.

8 Conclusion

We propose a new semi FB Euler scheme as a discretization of Wasserstein gradient flow and show that it has favourably theoretical guarantees that the commonly used FB Euler does not yet have if the objective function is not convex along generalized geodesics. Our theoretical analysis opens up interesting avenues for future work. Given the ubiquity of nonconvexity, we hope that the idea can be reused in various contexts, such as with different optimal transport cost functions, different base spaces in the Wasserstein space [42], or submanifolds of the Wasserstein space (e.g., Bures-Wasserstein [42, 24]).

1Our code is available at https://github.com/MCS-hub/OW24

Acknowledgments and Disclosure of Funding

This work is supported by the Research Council of Finland s Flagship programme: Finnish Center for Artificial Intelligence (FCAI), and additionally by grants 345811, 348952, and 346376 (VILMA: Virtual Laboratory for Molecular Level Atmospheric Transformations). The authors wish to thank the Finnish Computing Competence Infrastructure (FCCI) for supporting this project with computational and data storage resources. We thank the anonymous reviewers for their insightful comments and suggestions. H.P.H. Luu specifically thanks Michel Ledoux, Luigi Ambrosio, Alain Durmus, and Shuailong Zhu for helpful information.

[1] M. Ahn, J.-S. Pang, and J. Xin. Difference-of-convex learning: directional stationarity, optimality, and sparsity. SIAM Journal on Optimization, 27(3):1637 1665, 2017.

[2] D. Alvarez-Melis, Y. Schiff, and Y. Mroueh. Optimizing functionals on the space of probabilities with input convex neural networks. Transactions on Machine Learning Research, 2022.

[3] L. Ambrosio, A. Bressan, D. Helbing, A. Klar, E. Zuazua, L. Ambrosio, and N. Gigli. A user s guide to optimal transport. Modelling and Optimisation of Flows on Networks: Cetraro, Italy 2009, Editors: Benedetto Piccoli, Michel Rascle, pages 1 155, 2013.

[4] L. Ambrosio, N. Gigli, and G. Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005.

[5] L. Ambrosio and G. Savaré. Gradient flows of probability measures. Handbook of differential equations: evolutionary equations, 3:1 136, 2006.

[6] B. Amos, L. Xu, and J. Z. Kolter. Input convex neural networks. In International Conference on Machine Learning, pages 146 155. PMLR, 2017.

[7] M. Arbel, A. Korba, A. Salim, and A. Gretton. Maximum mean discrepancy gradient flow. Advances in Neural Information Processing Systems, 32, 2019.

[8] M. Baˇcák and J. M. Borwein. On difference convexity of locally Lipschitz functions. Optimization, 60(8-9):961 978, 2011.

[9] K. Balasubramanian, S. Chewi, M. A. Erdogdu, A. Salim, and S. Zhang. Towards a theory of non-log-concave sampling: first-order stationarity guarantees for Langevin Monte Carlo. In Conference on Learning Theory, pages 2896 2923. PMLR, 2022.

[10] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183 202, 2009.

[11] E. Bernton. Langevin Monte Carlo and JKO splitting. In Conference on learning theory, pages 1777 1798. PMLR, 2018.

[12] A. Blanchet and J. Bolte. A family of functional inequalities: Łojasiewicz inequalities and displacement convex functions. Journal of Functional Analysis, 275(7):1650 1673, 2018.

[13] V. I. Bogachev and M. A. S. Ruas. Measure theory, volume 2. Springer, 2007.

[14] J. Bolte, A. Daniilidis, and A. Lewis. The łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17(4):1205 1223, 2007.

[15] B. Bonnet. A Pontryagin maximum principle in Wasserstein spaces for constrained optimal control problems. ESAIM: Control, Optimisation and Calculus of Variations, 25:52, 2019.

[16] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization: Theory and Examples. Springer, 2006.

[17] G. E. Bredon. Topology and geometry, volume 139. Springer Science & Business Media, 2013.

[18] Y. Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on pure and applied mathematics, 44(4):375 417, 1991.

[19] brett1479 (https://math.stackexchange.com/users/62876/brett1479). Borel sigma algebra of one point compactification. Mathematics Stack Exchange. URL:https://math.stackexchange.com/q/3532983 (version: 2020-02-03).

[20] H. Brézis. Functional analysis, Sobolev spaces and partial differential equations, volume 2. Springer, 2011.

[21] L. Chizat and F. Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018.

[22] Y. Cui, J.-S. Pang, and B. Sen. Composite difference-max programs for modern statistical estimation problems. SIAM Journal on Optimization, 28(4):3344 3374, 2018.

[23] A. S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society Series B: Statistical Methodology, 79(3):651 676, 2017.

[24] M. Z. Diao, K. Balasubramanian, S. Chewi, and A. Salim. Forward-backward gaussian variational inference via jko in the bures-wasserstein space. In International Conference on Machine Learning, pages 7960 7991. PMLR, 2023.

[25] X. V. Doan and S. Vavasis. Low-rank matrix recovery with Ky Fan 2-k-norm. Journal of Global Optimization, 82(4):727 751, 2022.

[26] A. Durmus, S. Majewski, and B. Miasojedow. Analysis of Langevin Monte Carlo via convex optimization. Journal of Machine Learning Research, 20(73):1 46, 2019.

[27] A. Durmus and É. Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. The Annals of Applied Probability, 27(3):1551 1587, 2017.

[28] A. Durmus and É. Moulines. High-dimensional Bayesian inference via the unadjusted Langevin algorithm. Bernoulli, 25(4A):2854 2882, 2019.

[29] A. Durmus, E. Moulines, and M. Pereyra. Efficient Bayesian computation by proximal Markov chain Monte Carlo: when Langevin meets Moreau. SIAM Journal on Imaging Sciences, 11(1):473 506, 2018.

[30] M. Erbar. The heat equation on manifolds as a gradient flow in the Wasserstein space. In Annales de l IHP Probabilités et statistiques, volume 46, pages 1 23, 2010.

[31] J. Fan, Q. Zhang, A. Taghvaei, and Y. Chen. Variational Wasserstein gradient flow. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 6185 6215. PMLR, 2022.

[32] J. Geng, L. Wang, and Y. Wang. A non-convex algorithm framework based on DC programming and DCA for matrix completion. Numerical Algorithms, 68:903 921, 2015.

[33] S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1):267 305, 2016.

[34] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012.

[35] P. Hajlasz. Is there a Borel measurable f : Rd Rd such that f(x) φ(x) for all x? Math Overflow. URL:https://mathoverflow.net/q/453991 (version: 2023-12-02).

[36] W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. 1970.

[37] H. Herrlich. Axiom of choice, volume 1876. Springer, 2006.

[38] S. J Reddi, S. Sra, B. Poczos, and A. J. Smola. Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. Advances in neural information processing systems, 29, 2016.

[39] R. Jordan, D. Kinderlehrer, and F. Otto. The variational formulation of the Fokker Planck equation. SIAM journal on mathematical analysis, 29(1):1 17, 1998.

[40] H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient methods under the Polyak-łojasiewicz condition. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I 16, pages 795 811. Springer, 2016.

[41] A. Korotin, V. Egiazarian, A. Asadulaev, A. Safin, and E. Burnaev. Wasserstein-2 generative networks. In International Conference on Learning Representations, 2021.

[42] M. Lambert, S. Chewi, F. Bach, S. Bonnabel, and P. Rigollet. Variational inference via wasserstein gradient flows. Advances in Neural Information Processing Systems, 35:14434 14447, 2022.

[43] N. Lanzetti, S. Bolognani, and F. Dörfler. First-order conditions for optimization in the Wasserstein space. ar Xiv preprint ar Xiv:2209.12197, 2022.

[44] S. law Łojasiewicz. Ensembles semi-analytiques. IHES notes, page 220, 1965.

[45] H. A. Le Thi, V. N. Huynh, and T. Pham Dinh. Convergence analysis of difference-of-convex algorithm with subanalytic data. Journal of Optimization Theory and Applications, 179(1):103 126, 2018.

[46] H. A. Le Thi, V. N. Huynh, T. Pham Dinh, and H. P. H. Luu. Stochastic difference-ofconvex-functions algorithms for nonconvex programming. SIAM Journal on Optimization, 32(3):2263 2293, 2022.

[47] H. A. Le Thi and T. Pham Dinh. The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems. Annals of operations research, 133:23 46, 2005.

[48] H. A. Le Thi, T. Pham Dinh, H. M. Le, and X. T. Vo. Dc approximation approaches for sparse optimization. European Journal of Operational Research, 244(1):26 46, 2015.

[49] J. M. Lee. Smooth Manifolds. In Introduction to Smooth Manifolds, pages 1 31. Springer.

[50] T. D. Luu, J. Fadili, and C. Chesneau. Sampling from non-smooth distributions through Langevin diffusion. Methodology and Computing in Applied Probability, 23:1173 1201, 2021.

[51] S. Mei, T. Misiakiewicz, and A. Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Conference on Learning Theory, pages 2388 2464. PMLR, 2019.

[52] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087 1092, 1953.

[53] P. Mokrov, A. Korotin, L. Li, A. Genevay, J. M. Solomon, and E. Burnaev. Large-scale Wasserstein gradient flows. Advances in Neural Information Processing Systems, 34:15243 15256, 2021.

[54] B. Mordukhovich and M. N. Nguyen. An easy path to convex analysis and applications. Springer Nature, 2023.

[55] Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.

[56] M. Nouiehed, J.-S. Pang, and M. Razaviyayn. On the pervasiveness of difference-convexity in optimization and statistics. Mathematical Programming, 174(1):195 222, 2019.

[57] F. Otto and C. Villani. Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality. Journal of Functional Analysis, 173(2):361 400, 2000.

[58] J.-S. Pang, M. Razaviyayn, and A. Alvarado. Computing B-stationary points of nonsmooth DC programs. Mathematics of Operations Research, 42(1):95 118, 2017.

[59] N. Parikh, S. Boyd, et al. Proximal algorithms. Foundations and trends in Optimization, 1(3):127 239, 2014.

[60] T. Pham Dinh and H. A. Le Thi. Convex analysis approach to DC programming: theory, algorithms and applications. Acta mathematica vietnamica, 22(1):289 355, 1997.

[61] R. Presman and J. Xu. Distance-to-set priors and constrained Bayesian inference. In International Conference on Artificial Intelligence and Statistics, pages 2310 2326. PMLR, 2023.

[62] G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, pages 341 363, 1996.

[63] R. T. Rockafellar. Lagrange multipliers and optimality. SIAM review, 35(2):183 238, 1993.

[64] R. T. Rockafellar. Convex analysis, volume 11. Princeton university press, 1997.

[65] A. Salim, A. Korba, and G. Luise. The Wasserstein proximal gradient algorithm. Advances in Neural Information Processing Systems, 33:12356 12366, 2020.

[66] A. Taghvaei and P. Mehta. Accelerated flow for probability distributions. In International conference on machine learning, pages 6076 6085. PMLR, 2019.

[67] S. Vempala and A. Wibisono. Rapid convergence of the unadjusted Langevin algorithm: Isoperimetry suffices. Advances in neural information processing systems, 32, 2019.

[68] C. Villani. Optimal transport: old and new, volume 338. Springer, 2009.

[69] C. Villani. Topics in optimal transportation, volume 58. American Mathematical Soc., 2021.

[70] A. Wibisono. Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. In Conference on Learning Theory, pages 2093 3027. PMLR, 2018.

[71] H. Zhang and Y.-S. Niu. A Boosted-DCA with power-sum-DC decomposition for linearly constrained polynomial programs. Journal of Optimization Theory and Applications, pages 1 40, 2024.

[72] X. Zhou. On the Fenchel duality between strong convexity and Lipschitz continuous gradient. ar Xiv preprint ar Xiv:1803.06573, 2018.

Lemma 2 (Transfer lemma). [3, Sect. 1] Let T : Rm Rn be a measurable map, and µ P(Rm), then T#µ P(Rn) and R f(y)d(T#µ)(y) = R (f T)(x)dµ(x) for every measurable function f : Rn R, where the above identity has to be understood that: one of the integrals exits (potentially ) iff the other one exists, and in such a case they are equal. Consequently, for a bounded function f, the above integrals exist as real numbers that are equal.

Lemma 3. [4, Rmk. 6.2.11] Let µ, ν P2,abs(X), then T µ ν T ν µ = I µ-a.e. and T ν µ T µ ν = I ν-a.e. Theorem 6 (Characterization of Fréchet subdifferential for geodesically convex functions). [4, Section 10.1] Suppose ϕ : P2(X) R {+ } is proper, l.s.c, convex on geodesics. Let µ dom( ϕ) P2,abs(X), then a vector ξ L2(X, X, µ) belongs to the Fréchet subdifferential of ϕ at µ if and only if

ϕ(ν) ϕ(µ) Z

X ξ(x), T ν µ(x) x dµ(x) ν dom(ϕ).

Lemma 4. Let H : X R be a convex function having quadratic growth and ξ be a measurable selector of H, i.e., ξ(x) H(x) for all x X. Then, for all µ P2(X), Z

X ξ(x) 2dµ(x) < + . (6)

In other words, ξ L2(X, X, µ) for all µ P2(X).

Proof. Since ξ(x) H(x), by tangent inequality for convex functions, we have

H(y) H(x) + ξ(x), y x , y X.

By picking y = x + ϵξ(x) for some ϵ > 0, we get

H(x + ϵξ(x)) H(x) ξ(x), ϵξ(x) = ϵ ξ(x) 2. (7)

Since H has quadratic growth, for some a > 0,

H(x + ϵξ(x)) H(x) |H(x + ϵξ(x))| + |H(x)|

a x + ϵξ(x) 2 + 1 + a( x 2 + 1)

2a + a x 2 + a( x + ϵ ξ(x) )2

2a + a x 2 + 2a( x 2 + ϵ2 ξ(x) 2)

= 2a + 3a x 2 + 2aϵ2 ξ(x) 2.

Combining with (7), it holds

ϵ(1 2aϵ) ξ(x) 2 2a + 3a x 2.

By choosing 0 < ϵ < 1/(2a), we obtain

ξ(x) 2 2a ϵ(1 2aϵ) + 3a ϵ(1 2aϵ) x 2.

Therefore, ξ(x) 2 has quadratic growth and - as a consequence - (6) holds for any µ P2(X).

Lemma 5. Let ϕ : P2(X) R {+ } be a proper function. Let µ be a local minimizer of ϕ, then µ is a Fréchet stationary point of ϕ.

Proof. There exists r > 0 such that ϕ(µ ) ϕ(µ) for all µ P2(X) : W2(µ, µ ) < r. It follows that

W2(µ, µ ) 0,

so 0 F ϕ(µ ), or µ is a Fréchet stationary point of ϕ.

Lemma 6. Let ϕ : P2(X) R {+ } be a proper, l.s.c, geodesically convex function. Suppose that µ P2,abs(X) is a Fréchet stationary point of ϕ. Then, µ is a global minimizer of ϕ.

Proof. By definition of Fréchet stationarity, 0 ϕ(µ ). By characterization of subdifferential of geodesically convex functions (Thm. 6), it holds ϕ(µ) ϕ(µ ) for all µ dom(ϕ), or µ is a global minimizer of ϕ.

Lemma 7. Under Assumption 1, let µ dom(F) be a local minimizer of F, then µ is a critical point of F, i.e., (H + EG)(µ ) EH(µ ) = .

Proof. Since µ is a local minimizer of F, there exists r > 0 such that

F(µ ) F(µ), µ P2(X) : W2(µ, µ ) < r. (8)

Let ξ be a measurable selector of H. Thanks to Lemma 4, ξ L2(X, X, µ ). According to [5, Prop. 4.13], ξ EH(µ ). It follows from Thm. 6 that

EH(µ) EH(µ ) + Z

X ξ(x), T µ µ (x) x dµ (x), µ P2(X). (9)

From (8) and (9), for µ B(µ , r),

H (µ) + EG(µ) H (µ ) + EG(µ ) + Z

X ξ(x), T µ µ (x) x dµ (x).

Therefore, ξ (H + EG)(µ ) since H + EG is geodesically convex. It follows that µ is a critical point of F.

Lemma 8. Let U, V : P2(X) R {+ }. Assume that U and V are Fréchet subdifferentiable at µ, the following statements hold

a. F (U + V)(µ) F U(µ) + F V(µ).

b. If V is Wasserstein differentiable of µ, then

F (U + V)(µ) = F U(µ) + W V(µ). (10)

Proof. Item a. is trivial from the definition of Fréchet subdifferential. For item b., from item a., we first see that,

F (U + V)(µ) F U(µ) + W V(µ). (11)

On the other hand, we apply item a. for U + V and V to obtain

F U(µ) F (U + V)(µ) + F ( V)(µ).

Since W V(µ) F ( V)(µ), it follows that

F U(µ) F (U + V)(µ) W V(µ). (12)

From (11) and (12), we derive (10).

Lemma 9. Let µ L d, and g is a strongly convex function. Then g#µ L d.

Proof. Let Ω= {x : 2g(x) exists}, then L d(Rd \Ω) = 0 (Aleksandrov, see, e.g., [4, Thm. 5.5.4]). Since g is strongly convex, g is injective on Ωand | det 2g| > 0 on Ω. By applying Lemma 5.5.3 [4], g#µ L d.

Lemma 10. [65] Let G : P2(X) R {+ } be proper and l.s.c. Suppose that G is convex along generalized geodesics. Let ν P2,abs(X), µ, π P2(X). If ξ G(µ), then Z

X ξ T µ ν (x), T π ν (x) T µ ν (x) dν(x) G(π) G(µ).

A.1 Existence of a Borel measurable selector of the subdifferential of a convex function

Given a convex function H : X R, we prove that there exists a Borel measurable selector S(x) H(x). Although this problem is of natural interest, we are not aware of it as well as its proof at least in standard textbooks in convex analysis. Credits go to a quite recent Math Overflow thread [35], from which we give detailed proof as follows.

Firstly, we recall Alexandroff s compactification of a topological space (X, τ). From set theory, X is strictly smaller than 2X, which is the set of all subsets of X, i.e., there is no bijection from X to 2X. So 2X cannot be contained in X. So, there is an element named that is not in X. We denote X = X { }. One-point Alexandroff compactification states that (1) there exists a topology τ in X accepting X as a topological subspace, i.e., the original topology τ in X is inherited from τ , and (2) (X , τ ) is compact.

The topology τ can be specifically described as follows: open sets of τ are either open sets of τ or the complements of the form (X \ S) { } where S are closed compact subsets of X.

In our case, X = Rd and τ is the standard Euclidean topology, the Alexandroff compactification of X is also metrizable [17, Thm. 12.12]. It is, in fact, homeomorphic to the sphere Sd whose topology is inherited from the ambient space Rd+1. Moreover, the mentioned metric is the Riemannian metric of Sd [49, Thm. 13.29].

Secondly, since ψ(x) := H(x) + (1/2) x 2 is 1-strongly convex, its Fenchel conjugate y 7 ψ (y) defined as

ψ (y) = sup x Rd{ x, y ψ(x)}

is 1-smooth [72, Thm. 1].

By [64, Cor. 23.5.1], ( ψ) 1 = ψ in the sense that

y ψ( ψ (y)) y Rd. (13)

On the other hand, ψ is strongly monotone [54, Ex. 3.9] in the following sense,

x2 x1, y2 y1 x2 x1 2 (14)

for all x1, x2 Rd and y1 ψ(x1), y2 ψ(x2).

We see that ψ is subjective. Indeed we show that for every x Rd, there exists y Rd such that ψ (y) = x. We show that relation holds for any y ψ(x). By contradiction, suppose that ψ (y) = x. From the strong monotonicity of ψ as in (14), ψ( ψ (y)) ψ(x) = . However, from (13) and by the choice of y, it holds y ψ( ψ (y)) ψ(x). This is a contradiction.

Thirdly, we recall a fundamental result on the compactness of the subdifferential of a convex function: if C is compact, then ψ(C) is compact [64, Thm. 24.7].

Fourthly, we need the Federer-Morse theorem [13] as follows: Theorem 7. Let Z be a compact metric space, Y be a Hausdorff topological space and f : Z Y be a continuous mapping. Then, there exists a Borel set B Z such that f(B) = f(Z) and f is injective on B. Furtheremore, f 1 : f(Z) B is Borel.

Now we observe that ψ (x) was x . Otherwise, by using the compactness of ψ and (13), we will get a contradiction immediately. We then can extend ψ in a continuous way in the Alexandorff compactification space X = Rd { } by simply putting ψ ( ) = . We shall show that this extension of ψ is continuous from (X , τ ) to (X , τ ), or ( ψ ) 1(V ) τ for all V τ . Recall that, by construction, open sets of τ are either open sets of τ or the complements of the form (X \ S) { } where S are compact subsets of X. The former type of open sets is handled easily since ψ is already continuous in (X, τ). For the latter type, let U = ( ψ ) 1((X \ S) { }) = ( ψ ) 1(X \ S) { } = (X \ ( ψ ) 1(S)) { } for a compact set S. Proving U open boils down to proving ( ψ ) 1(S) compact. Indeed, it is closed since S is closed. It is bounded. Otherwise, it will be contradictory to ψ (x) as x .

We now can apply the Federer-Morse theorem for Z = Y = (X , τ ) by noting that (X , τ ) is metrizable and a metric space is a Hausdorff space, and for f = ψ : there exists a Borel set B

X such that ψ |B : B X is a bijection and the inverse mapping ( ψ |B) 1 : X B is Borel measurable (here Borel set/measurability are with respect to τ , not yet τ). This is the Borel (w.r.t. τ ) selector of ψ.

Finally, we need to convert Borel measurability w.r.t. τ to Borel measurability w.r.t. τ. In terms of mapping, is mapped to either way around. So we only need to show: ( ψ |B) 1 : X X is Borel measurable w.r.t. τ. Take any Borel set (w.r.t. τ) E X, ( ψ |B)(B E) is Borel set w.r.t. τ and does not contain . We shall prove ( ψ |B)(B E) is a Borel set w.r.t. τ. This follows directly from the following claim, which is from another Mathematics Stack Exchange thread [19].

Claim [19]:

σ(τ ) = σ(τ { }) = σ(τ) {V { } : V σ(τ)}. (15)

A sketch of the claim proof goes as follows. For the first equality in (15), first we have σ(τ { }) σ(τ ) because (1) τ τ and (2) { } = X \ X σ(τ ) as X, X τ . On the other hand, σ(τ ) σ(τ { }) because, again, of the construction of τ : let U τ , if U τ then U σ(τ { }), otherwise U = (X \ S) { } for some compact set S X. As X = Rd, S is closed, so (X \ S) τ implying U σ(τ { }). For the second equality in (15), it is straight forward to verify that G := σ(τ) {V { } : V σ(τ)} is a sigma-algebra in X . Since τ { } G , it holds σ(τ { }) G . Conversely, as σ(τ) σ(τ { }) and - consequently - V { } σ(τ { }) for all V σ(τ), it holds G σ(τ { }).

We conclude that ( ψ |B) 1 : X X is a Borel (w.r.t. τ) selector of ψ. As a consequence, S := ( ψ |B) 1 I is a Borel measurable selector of H.

A.2 Maximum Mean Discrepancy

A.2.1 Definition

The notion of Maximum Mean Discrepancy (MMD) between two distributions is introduced in [34]. Given a kernel k : X X R and denote by K its reproducing kernel Hilbert space. Given two distributions µ and ν, the maximum mean discrepancy between them are defined as

DMML (µ, ν) = fµ,ν K, where fµ,ν(z) = Z k(x, z)dν(x) Z k(x, z)dµ(x), z X,

where fµ,ν is called the witness function. Given some target (or optimal) distribution µ , we seek to optimize F(µ) = 1

2 DMML (µ, µ )2. Note that F(µ) admits the following free-energy expression [7]

F(µ) = Z F(x)dµ(x) + 1

Z k(x, x )dµ(x)dµ(x ) + C

F(x) = Z k(x, x )dµ (x ), C = 1

Z k(x, x )dµ (x)dµ (x ).

In general, F is not convex along generalized geodesics. Rather, it exhibits some weakly convex structure [7] that falls within the DC spectrum as detailed in Subsection A.2.3.

A.2.2 Connection with infinite-width one hidden layer neural networks

See also [7, 65]. We include the discussion here for completeness.

Consider a one-hidden-layer neural network

i=1 ψ(x, zi)

where ψ(x, zi) = wiσ( x, θi ), zi = (θi, wi) Z is the parameters (in and out) associated with i-th hidden neuron, n is the number of hidden neurons, σ is the activation function.

When the number of hidden neurons n tends to infinity, f can be written as f(x) = R ψ(x, z)dµ(z) for some distribution µ over the parameter space Z. Given data (x, y) p(x, y) and assuming quadratic loss, the optimization problem reads

min µ P2(Z) L(µ) := E(x,y) p(x,y)

y Z ψ(x, z)dµ(z) 2 .

We assume that the model is well-specified, i.e., there exists µ P2(Z) such that

Ey p(y|x)(y) = Z ψ(x, z)dµ (z). (16)

Expanding L, we get

L(µ) = 2 Z E(x,y) p(x,y)(yψ(x, z))dµ(z) + ZZ Ex p(x)(ψ(x, z)ψ(x, z ))dµ(z)dµ(z ) + C

where C does not depend on µ.

Naturally, we define the kernel

k(z, z ) = Ex p(x)(ψ(x, z)ψ(x, z )) z, z Z.

Using the assumption that the model is well-specified (16), we obtain Z k(z, z )dµ (z ) = Z Ex p(x)(ψ(x, z)ψ(x, z ))dµ (z )

Z ψ(x, z)ψ(x, z )dµ (z )

ψ(x, z) Z ψ(x, z )dµ (z )

= Ex p(x) ψ(x, z)Ey p(y|x)(y)

= E(x,y) p(x,y) (yψ(x, z)) .

Therefore, L can be written as

L(µ) = 2 Z Z k(z, z )dµ (z ) dµ(z) + ZZ k(z, z )dµ(z)dµ(z ) + C.

This is exactly the MMD setting in Sect. A.2.1.

A.2.3 DC structure

Let k be a kernel whose gradient is Lipschitz continuous, i.e., for some L > 0,

k(x, y) k(x , y ) L x x 2 + y y 2 1

2 , x, x , y, y X,

which can be expressed equivalently as

xk(x, y) xk(x , y ) 2 + yk(x, y) yk(x , y ) 2 L2 x x 2 + y y 2 .

Let µ be some target distribution and consider the free-energy functional

F(µ) = ZZ k(x, y)dµ(x)dµ(y) 2 ZZ k(x, y)dµ (y)dµ(x).

Let α > 0, we can rewrite F as follows

F(µ) = ZZ α x 2 + α y 2 + k(x, y) dµ(x)dµ(y) 2 Z α x 2 + Z k(x, y)dµ (y) dµ(x).

As k is Lipschitz smooth w.r.t. (x, y), x 7 R k(x, y)dµ (y) is also Lipschitz smooth. Indeed, x

Z k(x, y)dµ (y) x

Z k(x , y)dµ (y)

Z ( xk(x, y) xk(x , y))dµ (y)

Z xk(x, y) xk(x , y) 2 dµ (y) 1

L Z x x 2 dµ (y) 1

Next, as a standard result, if f is an L-smooth function, x 7 (α/2) x 2 f(x) are convex whenever α L. Therefore, for α L, W(x, y) := α x 2 + α y 2 + k(x, y) is convex and F(x) = 2 α x 2 + R

X k(x, y)dµ (y) is concave. From [4, Prop. 9.3.5], the interaction energy corresponding to W is generalized geodesically convex.

A.3 Proof of Lemma 1

Since I + ηS is a subgradient selector of a convex function, the optimal transport between µn and νn+1 is given by

T νn+1 µn = I + ηS. (17)

and between µn+1 and νn+1 [4, Lem. 10.1.2]

T νn+1 µn+1 I + η (EG + H ) (µn+1). (18)

Since EH is convex along generalized geodesics [4, Prop. 9.3.2] and S is a subgradient of EH at µn [5, Proposition 4.13], by Lem. 10 it holds, for any ν P2,abs(X),

EH(µn+1) EH(µn) + Z

X S T µn ν (x), T µn+1 ν (x) T µn ν (x) dν(x).

By choosing ν = νn+1 (note that νn+1 P2,abs(X)),

EH(µn+1) EH(µn) + Z

X S T µn νn+1(x), T µn+1 νn+1 (x) T µn νn+1(x) dνn+1(x)

= EH(µn) + 1

X (T νn+1 µn I) T µn νn+1(x), T µn+1 νn+1 (x) T µn νn+1(x) dνn+1(x)

= EH(µn) + 1

X x T µn νn+1(x), T µn+1 νn+1 (x) T µn νn+1(x) dνn+1(x) (19)

where the second equality uses (17) and the last one uses Lem. 3.

On the other hand, since EG + H is convex along generalized geodesics, by applying Lem. 10 for EG + H at µn+1 with a subgradient η 1(T νn+1 µn+1 I) (EG + H )(µn+1) (from (18)),

(EG + H )(µn)

(EG + H )(µn+1) + Z

(T νn+1 µn+1 I)

η T µn+1 νn+1 (x), T µn νn+1(x) T µn+1 νn+1 (x) dνn+1(x)

= (EG + H )(µn+1) + 1

D x T µn+1 νn+1 (x), T µn νn+1(x) T µn+1 νn+1 (x) E dνn+1(x) (20)

where the last equality uses Lem. 3.

By adding (20) and (19) side by side,

F(µn) F(µn+1) + 1

X T µn νn+1(x) T µn+1 νn+1 (x) 2dνn+1(x). (21)

A.4 Proof of Theorem 1

Let µ P2(X) be a cluster point of {µn}n N. There exists a subsequence µnk Wass µ . It holds

lim inf k F(µnk) = lim inf k (H (µnk) + EF (µnk))

= lim inf k H (µnk) + EF (µ )

H (µ ) + EF (µ ),

since H is l.s.c. and EF is continuous w.r.t. Wasserstein topology. Therefore, H (µ ) < + , which further implies that µ P2,abs(X).

X T µn νn+1(x) T µn+1 νn+1 (x) 2dνn+1(x) = Z

X T µn νn+1(x) T µn+1 νn+1 (x) 2d T νn+1 µn #µn(x)

X x T µn+1 νn+1 T νn+1 µn (x) 2dµn(x). (22)

We observe that T µn+1 νn+1 T νn+1 µn is a (possibly non-optimal) transport pushing µn to µn+1, by the optimality of T µn+1 µn , Z

X x T µn+1 νn+1 T νn+1 µn (x) 2dµn(x) Z

X x T µn+1 µn (x) 2dµn(x) = W 2 2 (µn, µn+1). (23)

By Lem. 1 and (22), (23),

F(µn) F(µn+1) + 1

η W 2 2 (µn, µn+1). (24)

Note that F is bounded below (Assumption 1), telescoping (24) gives us

n=0 W 2 2 (µn, µn+1) < + . (25)

In particular, W2(µn, µn+1) 0. This together with µnk Wass µ implies µnk+1 Wass µ .

Under Assumption 3, S = H and S is continuous. Next, recall νnk+1 = (I + ηS)#µnk, we show

νnk+1 narrow ν := (I + ηS)#µ as k + . (26)

Thus, let f be a continuous and bounded test functional in X, by using transfer lemma 2,

X f(x)dνnk+1(x) = lim k

X f(x)d(I + ηS)#µnk(x)

X f(x + ηS(x))dµnk(x) (27)

X f(x + ηS(x))dµ (x)

X f(x)dν (x), (28)

since S is continuous. So νnk+1 narrow ν . We go one step further and prove that νnk+1 actually converges to ν in the Wasserstein metric. This boils down to showing convergence in second-order moments, i.e.,

m2(νnk+1) m2(ν ),

which is equivalent to showing that Z

X x + ηS(x) 2dµnk(x) Z

X x + ηS(x) 2dµ (x).

On the other hand, ψ(x) := x+ηS(x) 2 has quadratic growth (follows from Lem. 4) and µnk µ in the Wasserstein metric, so [3, Prop. 2.4]

X x + ηS(x) 2dµnk(x) = Z

X x + ηS(x) 2dµ (x)

Therefore, νnk+1 ν in Wasserstein metric.

To proceed further, we need the following theorem stating that the graph of the subdifferential of a geodesically convex function is closed under the product of Wasserstein and weak topologies. Theorem 8 (Closedness of subdifferential graph). [4, Lemma 10.1.3] Let ϕ be a geodesically convex functional satisfying dom( ϕ) P2,abs(X). Let {µn}n N be a sequence converging in Wasserstein metric to µ dom(ϕ). Let ξn ϕ(µn) be satisfying

X ξn(x) 2dµn(x) < + , (29)

and converging weakly to ξ L2(X, X, µ) in the following sense:

X ζ(x)ξn(x)dµn(x) = Z

X ζ(x)ξ(x)dµ(x), ζ C c (X). (30)

Then ξ ϕ(µ).

As a side note, we need the notion of weak convergence in the above theorem because unlike subdifferentials in flat Euclidean space each ξn lives in its own L2(X, X, µn) space.

Back to our proof, for item (29), we show that

T νn µn (x) x 2dµn(x) < + . (31)

We proceed as follows to prove (31). We first show that supn N m2(µn) < + . By contradiction, by assuming supn N m2(µn) = + , we can extract a subsequence {µnk}k N such that

lim k m2(µnk) = + . (32)

By compactness assumption, there further exists a subsequence {µnki}i N such that µnki converges (in Wasserstein metric) to some µ P2(X), implying that limi m2(µnki) = m2(µ ), which contradicts (32). Therefore, supn N m2(µn) < + . We next show that supn N m2(νn) < + . Indeed, as S 2 has quadratic growth,

S(x) 2 c( x 2 + 1)

for some c > 0. So

m2(νn+1) = Z

X x 2dνn+1(x)

X x 2d(I + ηS)#µn(x)

X x + ηS(x) 2dµn(x)

X x 2dµn(x) + 2η2 Z

X S(x) 2dµn(x)

(2 + 2cη2)m2(µn) + 2η2c,

implying that supn N m2(νn) < + . This in conjunction with G having quadratic growth implies that supn N |EG(νn)| < + . Furthermore, infn (EG + H )(µn) > otherwise by lower semicontinuity of H and compactness of {µn}n N we get a contradiction.

Now, as η 1(T νn+1 µn+1 I) (EG + H ) (µn+1) and EG + H is geodesically convex, by applying Thm. 6, it holds

(EG + H )(νn+1) (EG + H )(µn+1) + 1

X T νn+1 µn+1 (x) x 2dµn+1(x). (33)

The finiteness as in (31) then follows from supn N |EG(νn)| < + , supn N (EG + H )(µn) < + as proved and supn N H (νn) < + as assumed.

We next prove that there is a subsequence of {µnk}k N such that

T νnkj +1 µnkj +1 I T ν µ I weakly. (34)

We consider the sequence of optimal plans between µn and νn as follows

ρn = (I, T νn µn )#µn, n N.

We observe that

X X ( x 2 + y 2)dρn(x, y) = Z

X x 2dµn(x) + Z

X y 2dνn(y) < + ,

so ρn P2(X X) for all n N. Since supn N m2(νn) < + as proved, and as a Wasserstein ball is relatively compact under narrow topology, {νn}n N is relatively compact under narrow topology. The same property holds for {µn}n N. According to Prokhorov [3, Theorem 1.3], {µn}n N and {νn}n N are tight. By [3, Remark 1.4], {ρn}n N is also tight, hence relatively compact under narrow topology in P2(X X). Consequently, {ρnk+1}k N admits a subsequence converging narrowly to some ρ P(X X). Let s say

ρnki+1 narrow ρ as i . (35)

We can see that proj1#ρ = µ , proj2#ρ = ν . We further show that ρ P2(X X), or equivalently, Z

X X ( x 2 + y 2)dρ (x, y) < + . (36)

Let C N, thanks to the narrow convergence in (35), we have Z

X X min{ x 2 + y 2, C}dρnki+1(x, y) Z

X X min{ x 2 + y 2, C}dρ (x, y).

Furthermore, Z

X X min{ x 2 + y 2, C}dρnki+1(x, y) sup n N m2(µn) + sup n N m2(νn) := M < + .

Passing to the limit, we get Z

X X min{ x 2 + y 2, C}dρ (x, y) M

for all C N. Sending C to and applying Monotone Convergence Theorem we derive (36).

Back to the main proof, since {ρnki+1}i N is a sequence of optimal plans, its limit, ρ is also optimal [3, Proposition 2.5]. Therefore,

ρ = (I, T ν µ )#µ .

Moreover, as

m2(ρnki+1) = m2(µnki+1) + m2(νnki+1) m2(µ ) + m2(ν ) = m2(ρ ),

ρnki+1 Wass ρ as i .

Now let s take any test function ζ C c (X), we show

X ζ(x)T νnkj +1 µnkj +1 (x)dµnkj +1(x) = Z

X ζ(x)T ν µ (x)dµ (x). (37)

Indeed, (x, y) 7 ζ(x) proji(y) where proji is the projection into the i-th coordinate is continuous and has quadratic growth since ζ(x) is bounded and proji(y) is linear.

Since ρnkj +1 Wass ρ , it holds: for each i [d],

X ζ(x) proji T νnkj +1 µnkj +1 (x) dµnkj +1(x)

X ζ(x) proji(y)d(I, T νnkj +1 µnkj +1 )#µnkj +1(x, y)

X ζ(x) proji(y)dρnkj +1(x, y)

X ζ(x) proji(y)dρ (x, y)

X ζ(x) proji(y)d(I, T ν µ )#µ (x, y)

X ζ(x) proji(T ν µ (x))dµ (x),

so (37) holds. Consequently, (34) also holds by noticing that Z

X xζ(x)dµnkj +1(x) Z

X xζ(x)dµ (x) as j .

By the closedness of subdifferential graph of (EG + H ) (Thm. 8), we obtain

η (EG + H )(µ ). (38)

On the other hand, by the definition of ν in (26), we get T ν µ = I + ηS. Together with (38), S (EG + H )(µ ). Noting that S EH(µ ), it holds S (EG + H )(µ ) EH(µ ) and we conclude µ is a critical point of F.

A.5 Proof of Theorem 2

We see that

Gη(µn) 2 L2(X,X;µn) = 1

x T JKOη(EG+H )((I+ηS)#µn) µn (x) 2 dµn(x)

η2 W 2 2 (µn, JKOη(EG+H )((I + ηS)#µn))

η2 W 2 2 (µn, µn+1).

On the other hand, it follows from (25) of the proof of Thm. 1 that

min i=1,N W 2 2 (µn, µn+1) = O(N 1).

A.6 Proof of Theorem 3

H has uniformly bounded Hessian, by [43, Prop. 2.12], EH is Wasserstein differentiable and W EH(µ) = H for all µ P2(X). According to Lem. 8 and (18),

F F(µn+1) = (EG + H )(µn+1) H T νn+1 µn+1 I

We then have the following evaluations:

dist (0, F F(µn+1)) = inf ξ F F(µn+1) ξ L2(X,X,µn+1)

T νn+1 µn+1 I

η H L2(X,X,µn+1)

T νn+1 µn+1 (x) x

X T νn+1 µn+1 (x) x η H(x) 2dµn+1(x) 1

By transfer lemma 2, Z

T νn+1 µn+1 (x) x η H(x) 2 dµn+1(x)

T νn+1 µn+1 (x) x η H(x) 2 d T µn+1 νn+1 #νn+1(x)

T νn+1 µn+1 T µn+1 νn+1 (x) (I + η H) T µn+1 νn+1 (x) 2 dνn+1(x)

x (I + η H) T µn+1 νn+1 (x) 2 dνn+1(x). (41)

On the other hand, by using the trivial identity

H = (I + η H) I

η we compute, Z

X H(T µn νn+1(x)) H(T µn+1 νn+1 (x)) 2dνn+1(x)

(I + η H) T µn νn+1(x) T µn νn+1(x)

η (I + η H) T µn+1 νn+1 (x) T µn+1 νn+1 (x) η

x T µn νn+1(x) (I + η H) T µn+1 νn+1 (x) + T µn+1 νn+1 (x) 2 dνn+1(x), (42)

where the last equality uses (I + η H) T µn νn+1 = I νn+1-a.e.

The Hessian of H is bounded uniformly, H is Lipschitz, let s say H(x) H(y) LH x y for all x, y X. We continue evaluating (41) as follows Z

X x (I + η H) T µn+1 νn+1 (x) 2dνn+1(x)

x T µn νn+1(x) (I + η H) T µn+1 νn+1 (x) + T µn+1 νn+1 (x) + T µn νn+1(x) T µn+1 νn+1 (x) 2 dνn+1(x)

x T µn νn+1(x) (I + η H) T µn+1 νn+1 (x) + T µn+1 νn+1 (x) 2 dνn+1(x)

X T µn νn+1(x) T µn+1 νn+1 (x) 2dνn+1(x)

X H(T µn νn+1(x)) H(T µn+1 νn+1 (x)) 2dνn+1(x) + 2 Z

X T µn νn+1(x) T µn+1 νn+1 (x) 2dνn+1(x)

2(η2L2 H + 1) Z

X T µn νn+1(x) T µn+1 νn+1 (x) 2dνn+1(x) (43)

where the third equality uses (42).

From (40), (41), and (43), we derive

dist (0, F F(µn+1))

2(η2L2 H + 1) η

X T µn νn+1(x) T µn+1 νn+1 (x) 2dνn+1(x) 1

On the other hand, by telescoping Lem. 1, we obtain

X T µn νn+1(x) T µn+1 νn+1 (x) 2dνn+1(x) < + .

n=0 dist (0, F F(µn+1))

2(η2L2 H + 1) η

X T µn νn+1(x) T µn+1 νn+1 (x) 2dνn+1(x) 1

2(η2L2 H + 1) η

X T µn νn+1(x) T µn+1 νn+1 (x) 2dνn+1(x)

2(η2L2 H + 1)N η

X T µn νn+1(x) T µn+1 νn+1 (x) 2dνn+1(x)

min n=1,N dist (0, F F(µn)) = O 1

A.7 Proof of Theorem 4

Convergence in terms of objective values

Since H C2(X) whose Hessian is uniformly bounded, recall from (39) that

F F(µn+1) = (EG + H )(µn+1) H T νn+1 µn+1 I

Since F(µ0) F < r0 and the sequence {F(µn)}n N is not increasing (Lem. 1), F(µn) F < r0 for all n N. Łojasiewicz condition implies

c(F(µn+1) F )θ T νn+1 µn+1 I

η H L2(X,X,µn+1)

T νn+1 µn+1 (x) x

2(η2L2 H + 1) η

X T µn νn+1(x) T µn+1 νn+1 (x) 2dνn+1(x) 1

where the last inequality follows from (41) and (43) and LH is the Lipschitz constant of H. Combining with Lem. 1, we derive

c (F(µn+1) F )θ

2(η2L2 H + 1) η (F(µn) F(µn+1))

(F(µn+1) F )2θ 2(η2L2 H + 1) c2η ((F(µn) F ) (F(µn+1) F )) . (46)

We then use the following lemma [71, Lem. 4].

Lemma 11. Let {sk}k N be a nonincreasing and nonnegative real sequence. Assume that there exist α 0 and β > 0 such that for all sufficiently large k,

sα k+1 β(sk sk+1). (47)

(i) if α = 0, the sequence {sk}k N converges to 0 in a finite number of steps;

(ii) if α (0, 1], the sequence {sk}k N converges linearly to 0 with rate β β+1;

(iii) if α > 1, the sequence {sk}k N converges sublinearly to 0, i.e., there exists τ > 0:

sk τk 1 α 1

for sufficiently large k.

Compared to [71, Lem. 4], we have dropped the assumption sk 0 in Lem. 11 because this assumption is vacuous, i.e., it can be induced by (47) and nonnegativity of {sk}k N.

We now apply Lem. 11 for sk = F(µk) F using (46) to derive the followings

(i) if θ = 0, F(µn) F converges to 0 in a finite number of steps; (ii) if θ (0, 1/2], F(µn) F converges to 0 linearly (exponentially fast) with rate

F(µn) F = O M M + 1

n where M = 2(η2L2 H + 1) c2η ;

(iii) if θ (1/2, 1), F(µn) F converges sublinearly to 0, i.e.,

F(µn) F = O n 1 2θ 1 .

A.8 Proof of Theorem 5

Cauchy sequence.

By replacing n := n 1 in (45) and rearranging

2(η2L2 H + 1) cη

X T µn 1 νn (x) T µn νn (x) 2dνn(x) 1

2 (F(µn) F ) θ . (48)

It follows from Lem. 1 and (48) that Z

X T µn νn+1(x) T µn+1 νn+1 (x) 2dνn+1(x) η(F(µn) F(µn+1))

2(η2L2 H + 1) c

X T µn 1 νn (x) T µn νn (x) 2dνn(x) 1

2 (F(µn) F ) θ (F(µn) F(µn+1)).

Since the function s : R+ R, s(t) = t1 θ is concave if θ [0, 1), tangent inequality holds

s (a)(a b) s(a) s(b).

Note that s (t) = (1 θ)t θ, the above inequality further implies

(1 θ) (F(µn) F ) θ (F(µn) F(µn+1)) (F(µn) F )1 θ (F(µn+1) F )1 θ. (50)

From (49) and (50) Z

X T µn νn+1(x) T µn+1 νn+1 (x) 2dνn+1(x)

2(η2L2 H + 1) (1 θ)c

X T µn 1 νn (x) T µn νn (x) 2dνn(x) 1

(F(µn) F )1 θ (F(µn+1) F )1 θ

or equivalently,

X T µn νn+1(x) T µn+1 νn+1 (x) 2dνn+1(x) R

X T µn 1 νn (x) T µn νn (x) 2dνn(x) 1

2(η2L2 H + 1) (1 θ)c (F(µn) F )1 θ (F(µn+1) F )1 θ (51)

where rn := R

X T µn νn+1(x) T µn+1 νn+1 (x) 2dνn+1(x).

By telescoping (51) from n = 1 to + we obtain

rn rn 1 < + .

On the other hand rn rn 1 + rn 1 2 rn, (52)

we derive P+ n=0 rn < + . From (22), (23) of the proof of Thm. 1, rn W 2 2 (µn, µn+1), we obtain

n=0 W2(µn, µn+1) < + .

or, in other words, {µn}n N is a Cauchy sequence under Wasserstein topology. The Wasserstein space (P2(X), W2) is complete [5, Thm. 2.2], every Cauchy sequence is convergent, i.e., there exists µ P2(X) such that µn Wass µ .

We prove that µ is actually an optimal solution of F by showing F(µ ) = F . Indeed, firstly, as G and H have quadratic growth, it holds

EG(µn) EG(µ ), EH(µn) EH(µ ).

On the other hand,

F = lim n F(µn) = lim inf n F(µn) = lim inf n H (µn) + EG(µ ) EH(µ )

H (µ ) + EG(µ ) EH(µ ) = F(µ )

since H is l.s.c. The equality has to occur, i.e., F = F(µ ), due to the optimality of F .

Convergence rate of {µn}n N.

(i) If θ = 0

From item (i) of Thm. 4, there exists n0 N such that F(µn) = F for all n n0. It then follows from (24) that µn0 = µn0+1 = µn0+2 = . . ., which further implies that µn = µ for all n n0.

(ii) If θ (0, 1/2]

Let si = P n=i rn. We have

n=i W2(µn, µn+1) W2(µi, µ ) (53)

where the last inequality uses triangle inequality

n=i W2(µn, µn+1) + W2(µN, µ )

and lets N with a notice that µN Wass µ .

From (51) and (52),

2 rn rn 1 +

2(η2L2 H + 1) (1 θ)c (F(µn) F )1 θ (F(µn+1) F )1 θ . (54)

Telescope (54) for n = i to + ,

2(η2L2 H + 1) (1 θ)c (F(µi) F )1 θ ri 1 + (2(η2L2 H + 1)) 1 2θ

2θ i 1 . (55)

where the last inequality uses (45). Since ri 0 as i , ri < 1 for i sufficiently large. It follows from (55) that: for i sufficiently large

si M ri 1 = M(si 1 si)

M = 1 + (2(η2L2 H + 1)) 1 2θ

θ c 1 θ . (56)

Rewriting as si M M+1si 1, we derive W2(µi, µ ) = O M M+1 i .

(iii) If θ (1/2, 1)

(55) implies: for all i sufficiently large,

2θ i 1 = M(si 1 si) 1 θ

where M is the same as in (56).

Applying Lem. 11(iii), si = O i 1 θ

2θ 1 , which implies (by (53)) W2(µi, µ ) = O i 1 θ

B Implementation of FB Euler

Following a similar approach to the semi FB Euler s implementation outlined in Alg. 1, we present a practical implementation of FB Euler for the sampling context in Alg. 2.

Algorithm 2 FB Euler for sampling

Input: Initial measure µ0 P2,abs(X), discretization stepsize η > 0, number of steps K > 0, batch size B. for k = 1 to K do

for i = 1, 2, . . . do

Draw a batch of samples Z µ0 of size B; Ξ (I η F) xψθk (I η F) xψθk 1 . . . (I η F)(Z); d W 2 2 1

ξ Ξ xψθ(ξ) ξ 2; [ H 1

ξ Ξ log det 2 xψθ(ξ). b L 1 2η d W 2 2 + [ H .

Apply an optimization step (e.g., Adam) over θ using θ b L. end for θk+1 θ. end for

C Numerical illustrations

We perform numerical experiments in a high-performance computing cluster with GPU support. We use Python version 3.8.0. We allocate 8G memory for the experiments. The total running time for all experiments is a couple of hours. Our implementation is based on the code of [53] (MIT license) with the Dense ICNN architecture [41].

C.1 Gaussian mixture

Consider a target Gaussian mixture of the following form:

π(x) exp( F(x)) :=

i=1 πi exp x xi 2

i=1 πi exp x xi 2

i=1 πi exp x 2 + xi 2 2 x, xi

i=1 πi exp x 2

exp xi 2 2 x, xi

i=1 πi exp xi 2 2 x, xi

| {z } convex

which is DC. Note that the convexity of the second component is thanks to (a) log-sum-exp is convex and (b) the composite of a convex function and an affine function is convex.

Experiment details We set K = 5 and randomly generate x1, x2, . . . , x5 R2. We set σ = 1. The initial distribution is µ0 = N(0, 16I). We use η = 0.1 for both FB Euler and semi FB Euler. We train both algorithms for 40 iterations using Adam optimizer with a batch size of 512 in which the first 20 iterations use a learning rate of 5 10 3 while the latter 20 iterations use 2 10 3. For the baseline ULA, we run 10000 chains in parallel for 4000 iterations with a learning rate of 10 3.

We run the above experiment 5 times where x1, x2, . . . , x5 are randomly generated each time. Fig. 1 (b) reports the KL divergence along the training process where the mean curves are in bold and individual curves are displayed in a faded style. The final KL divergence (averaged across 5 runs) of the ULA is reported as a horizontal line.

C.2 Distance-to-set prior

Let π be the original prior, Θ be the constraint set that we want to impose, and the distance-to-set prior [61] is defined by, for some ρ > 0,

π(θ) π(θ) exp ρ

2d(θ, Θ)2 ,

that penalize exponentially θ deviating from the constraint set.

Given data y, using the this distance-to-set prior, the posterior reads

π(θ|y) L(θ|y)π(θ) exp ρ

2d(θ, Θ)2 .

where L(θ|y) is the likelihood.

The structure of π(θ|y) depends on three separate components: the original prior, the likelihood, and the constraint set Θ. In the ideal case, π(θ) and L(θ|y) are given in nice forms (e.g., log-concave), and Θ is a convex set. As a fact, if Θ is a convex set, θ 7 d(θ, Θ)2 is convex, making the whole posterior log-concave. If Θ is additionally closed, d(θ, Θ)2 is L-smooth.

However, whenever Θ is nonconvex, the function θ 7 d(θ, Θ)2 is not continuously differentiable. This is induced by the Motzkin-Bunt theorem [16, Thm. 9.2.5] asserting that any Chebyshev set (a set S X is called Chebyshev if every point in X has a unique nearest point in S) has to be closed and convex.

On the other hand, d(θ, Θ)2 is always DC regardless the geometric structure of Θ:

d(θ, Θ)2 = inf x Θ θ x 2

= inf x Θ θ 2 + x 2 2 x, θ

= θ 2 + inf x Θ x 2 2 x, θ

= θ 2 sup x Θ

x 2 + 2 x, θ . (57)

Note that the supremum of an arbitrary family of affine functions is convex.

Therefore, the log-DC structure of the whole posterior only depends on whether the original prior and the likelihood are log-DC, which is likely to be the case.

Distance-to-set prior relaxed von Mises-Fisher In directional statistics, the von Mises-Fisher distribution is a distribution over unit-length vectors (unit sphere). It can be described as a restriction of a Gaussian distribution in a sphere. By using the distance-to-set prior, we can relax the spherical constraint as

π(θ) exp( F(θ)) := exp κ(θ µ) (θ µ)

2 dist(θ, S)2

where S denote the unit sphere in some Rd space. By the DC structure (57) of the distance function, F is DC with the following composition

F(θ) = κ θ µ 2

x 2 + 2 x, θ

:= G(θ) H(θ).

We also note that H is not continuously differentiable because S is nonconvex. Furthermore, ρ proj S(θ) H(θ) where proj S(θ) is the projection of θ onto S, which can be computed explicitly in this case.

Experiment details We consider a unit circle in R2 with centre µ = (1, 1.5). We set κ = 1 and ρ = 100. The initial distribution is µ0 = N(0, 16I). We use η = 0.1 for both FB Euler and semi FB Euler. We train both algorithms for 40 iterations using Adam optimizer with a batch size of 512 in which the first 20 iterations use a learning rate of 5 10 3 while the latter 20 iterations use 2 10 3.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: We briefly mention our contribution in the abstract (propose a new scheme named semi Forward-Backward Euler and derive multiple convergence insights) and provide a detailed contribution paragraph in Sect. 1 where each claim is pointed to the corresponding theorem.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: We mentioned the expensiveness of the method, especially since the JKO operator is not scalable at least for now. See Sect. 5.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: We provide a full set of assumptions used in each theorem/claim, see Sect. 3.2 for the problem setting and Sect. 4 for the convergence theorems. Proofs for all theorems are provided in the Appendix A with strong mathematical rigour. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We provide a detailed implementation of the proposed method in Sect. 5 and Appendix B, and the experiment setting/parameters in Appendix C. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in

some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: We provide our code/results in a Git Hub repo mentioned in Sect. 6. We also instruct how to reproduce our experiments in README.md.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes] Justification: We specify all the training details in Appendix C: initial distribution, target distribution, discretization stepsize, learning rate, minibatch size, optimizer, etc.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [Yes]

Justification: We report in the Numerical illustration section 6 the mean curves in bold and individual run curves shown faintly in the background playing the same role as error bars. We provide sufficient information in Appendix C.

Guidelines:

The answer NA means that the paper does not include experiments.

The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We provide resource information in Appendix C. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: This is standard machine learning research without any sensitive or humanrelated data. We do not see any negative impact on the society. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: This is theoretical research on an optimization problem defined in the Wasserstein space. We do not see any societal impact that our work may bring about.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: the paper poses no such risks. It is a theoretical paper.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification: We explicitly mention in Sect. 5 that we use the deep learning approach proposed in [53] to compute the JKO. We also acknowledge that we leverage the code from [53] in Section C and in README.md in the Git Hub repo; the license information is explicitly mentioned in Section C.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset.

The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: the detailed description of the algorithms in Sect. 5. We also include LICENSE.txt in the Git Hub repo. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: the paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: the paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.