# modified_frank_wolfe_in_probability_space__330b9bfb.pdf

Modiﬁed Frank Wolfe in Probability Space

Carson Kent Stanford University crkent@stanford.edu

José Blanchet Stanford University jose.blanchet@stanford.edu

Jiajin Li The Chinese University of Hong Kong gerrili1996@gmail.com

Peter Glynn Stanford University glynn@stanford.edu

We propose a novel Frank-Wolfe (FW) procedure for the optimization of inﬁnitedimensional functionals of probability measures - a task which arises naturally in a wide range of areas including statistical learning (e.g. variational inference) and artiﬁcial intelligence (e.g. generative adversarial networks). Our FW procedure takes advantage of Wasserstein gradient ﬂows and strong duality results recently developed in Distributionally Robust Optimization so that gradient steps (in the Wasserstein space) can be efﬁciently computed using ﬁnite-dimensional, convex optimization methods. We show how to choose the step sizes in order to guarantee exponentially fast iteration convergence, under mild assumptions on the functional to optimize. We apply our algorithm to a range of functionals arising from applications in nonparametric estimation.

1 Introduction

Problems in artiﬁcial intelligence, statistics, and optimization often ﬁnd a common root as an inﬁnite dimensional optimization problem in the form

inf J (µ) : µ P Rd , (1)

for the space P Rd of Borel probability measures over Rd. In recent years, quantitative statistical and algorithmic treatments of these formulations have produced insights into modern computational methods resulting in novel approaches to difﬁcult, open problems. Recent works in robust optimization [6, 44, 54, 55], probabilistic fairness [57, 53], reinforcement learning [63, 64], and generative adversarial networks [42, 18, 19] highlight these gains and are linked by the following theme: problems in the form of (1) provide access to rich inﬁnite dimensional structure that sidesteps brittle artifacts of ﬁnite dimensional formulations.

This paper provides the construction and analysis of a modiﬁed Frank-Wolfe algorithm for (1) that operates from this inﬁnite dimensional perspective and yields concrete convergence and complexity guarantees for a sub-family of problems (1) which are well-behaved with respect to the Wasserstein distance of order 2 (see Algorithm 1 and Theorem 1). Under canonical conditions of smoothness and convexity we recover linear rates of convergence while, even for functionals which exhibit low degrees of smoothness and for conditions that go beyond convexity, we recover sublinear rates that are to be expected from ﬁnite dimensional analogues [36] (see Section 2.2).

The vanilla Frank-Wolfe method cannot work in probability space, in general, since the planar derivative (i.e. the ﬁrst variation also known as the inﬂuence function) can be unbounded when distributions do not have compact support. To overcome this issue, we conduct a natural modiﬁcation

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

to introduce a tractable, local, linear constraint inspired by efforts in DRO [44, 6, 25, 54]. Speciﬁcally, the modiﬁed Frank-Wolfe step admits the prototypical DRO formulation as below,

sup Z f dµ : Dc(µ, µ0) δ , (2)

where Dc(µ, µ0) is the optimal transport cost between µ and µ0 (a reference measure) under some cost function c. The form of (2), itself, immediately suggests the basis of an inﬁnite dimensional Frank Wolfe procedure since it provides a linear objective subject to a local, trust-region constraint centered at µ0. The relevance of (2) in distributionally robust optimization (DRO) and mathematical ﬁnance has resulted in a multitude of computational schemes [44, 34, 37, 54, 61] for solving (2). However, hitherto, such works have failed to consider (2) within the context of a general variational method for (1). What makes these efforts notable, within the scope of this work, is that they emphasize that the solution of (2) can be highly non-trivial. Indeed, without particular assumptions, (2) can disguise an computationally hard problem despite being convex in a Banach sense on P(Rd). Even in the case where the cost is the squared Euclidean norm c(x, y) = x y 2 (the case of primary concern for this work), computational trouble can lie dormant an artifact of inherently difﬁcult problems in unconstrained optimization [14]. This should not be surprising, however, given specters of computational hardness dating back to early formulations of DRO [22]. To resolve these issues, we also provide a novel analysis of techniques from distributionally robust optimization (DRO) which illustrate how such formulations can be used, in a computationally tractable way, to construct ﬁrst-order, variational methods. By localizing the problem to a Wasserstein ball, the new Local Linear Minimization Oracle (LLMO) that we make in this paper not only makes the vanilla Frank-Wolfe problem sensible, but it also renders the problem computationally tractable via ﬁnite dimensional convex optimization.

1.1 Previous work

Distributionally robust optimization To navigate such pitfalls, one can consider particular instances of (2) where the objective and constraints are sufﬁciently structured to preclude computational intractability and permit solution via methods adapted to the provided structure. Early work with this line [27, 22, 60], has recently been supplemented by approaches [13, 26, 7, 46, 37, 67, 35] which focus directly on DRO formulations from particular contexts in machine learning and operations research. Unfortunately, the techniques offered by these efforts require assumptions which are too restrictive for this work. These assumptions typically relate to a speciﬁc form for the objective function or constraints in (2) (e.g. linear/convex functions/piecewise-convex objectives or constraints with support or density requirements, see [31, 66, 44, 34, 65, 4, 58] for additional examples). In this instance, such limitations preclude their applicability since, in general, a gradient object for a functional J (see Section 2) need not satisfy these conditions.

A second, more relevant, approach to compute DRO problems (2) is to restrict the level of robustness δ for which the problem is solved. This technique has been used in works such as [6, 54] and we apply this principle in a similar spirit to [54]. In that work, smoothness of the objective in (2) is used to, qualitatively, argue that a sufﬁciently small δ provides a computationally-tractable optimization problem. In contrast, we provide quantiﬁcation of the level of robustness required to achieve this tractability and we demonstrate that this level robustness is sufﬁciently large to achieve canonical convergence rates for an inﬁnite-dimensional Frank-Wolfe algorithm.

Variational methods Although formulation of a Frank-Wolfe method for (1) (with quantitative bounds on complexity and convergence) has not appeared in previous literature, certain, tangentiallyrelated variational methods offer conceptual similarity. The ﬁrst of these methods is [41], which draws similar inspiration from ﬁnite-dimensional Frank-Wolfe procedures. However, the notion of ﬁrst-order variation in [41] appears to be induced by the total variation distance requiring compact support assumptions for the problem. Alternatively, we exploit Wasserstein geometry to provide a weaker notion of ﬁrst-order variation, namely, Wasserstein differentiability allowing us to eliminate restrictions to compact support. Hence, by balancing tractability of the analysis and ﬁdelity of the procedure, the proposed algorithm scheme can be applied a broad class of computational examples. Indeed, [41] only discussed the Sinkhorn barycenter problem and it is still unclear how it can be applied to our setting. The second, more closely related effort, is [39] where similar, inﬁnitedimensional conditions (Section 2.2) are used to study particle-based methods for computing Nash equilibria of zero-sum games. While the setting and procedures developed in this work are completely

different, we note that, as a special case, our Frank-Wolfe method can also produce a particle-based optimization procedure. This hints at possible insightful connections with other particle methods [40, 24, 11, 10] which are left for future consideration.

2 Wasserstein geometry

This work considers the problem min ν P2(Rd) J(ν) (3)

for functionals J : P2(Rd) R over the subset of Borel measures P(Rd) deﬁned by

P2(Rd) := µ P(Rd) : Z

Rd x 2 dµ(x) < (4)

In particular, we consider functionals J that are differentiable (Deﬁnition 1) with respect to Wasserstein distance of order 2

W2(µ, ν) := inf γ Π(µ,ν)

Rd Rd x y 2 dγ(x, y) (5)

where Π(µ, ν) is the set of all joint couplings with marginals µ, ν P2(Rd). Common examples of functionals that can be cast within this framework are as follows. Example 1 (Divergences). For any convex, lower-semicontinuous function f : R+ R such that f(1) = 0, one can consider a f-divergence of the form

J(µ) := Df(µ||ν) = Z

For instance, if f(t) = t log t, (6) reduces to Kullback-Leibler divergence DKL. Example 2 (Integral Probability Metrics). For a set of real valued functions F on Rd one can deﬁne the discrepancy

J(µ) := IPM (µ, ν) = sup f F

Rd f dν (7)

for µ, ν P Rd , where ν is a ﬁxed, reference measure. Such discrepancies are termed Integral Probability Metrics (IPMs), although they may not strictly satisfy the requirements of a metric say, by failing to distinguish all pairs of measures. Instead, for a pair of measures µ, ν P Rd , IPMs can be interpreted as measuring the extent to which µ and ν differ on functions in F or, rather, measuring the extent to which µ and ν can be distinguished by F. Concretely, consider F = {f : f H 1} where H is a reproducing kernel hilbert space (RKHS). In this case, one obtains the dual formulation of Maximum Mean Discrepancy (MMD).

2.1 Properties and differentiability for Wasserstein space Under the Wasserstein distance, P2(Rd) is a Polish space [59] and, more importantly, it is a geodesic space. That is, for every µ, ν P2(Rd), there exists a constant-speed geodesic curve µt : [0, 1] P2(Rd) where µ0 = µ, µ1 = ν and W(µt, µs) = |t s|W(µ0, µ1) (8) Moreover, there is a bijection between constant-speed geodesics and optimal transport plans [1, Theorem 7.2.2]. Every geodesic corresponds to a unique, optimal transport plan γ Π (µ, ν) such that µt = ((1 t)x + ty)# γ where W2(µ, ν) = Z

Rd Rd x y 2 dγ(x, y), (9)

and µt is the distribution of the random variable (1 t)X + t Y with the pair (X, Y ) following distribution γ. Conversely, every optimal transport plan gives rise to a unique geodesic via (9).

Since our Frank-Wolfe method minimizes a sequence of linear approximations, one must deﬁne the notion of a gradient (of a functional J) to be compatible with respect to the geometry of these geodesics. In particular, as P2(Rd) is curved under W (see Appendix A), gradients must be deﬁned in terms of a selection in an appropriate cotangent bundle. For Wasserstein space, this cotangent bundle (denoted Co Tan P2(Rd)) is essentially the set of vector ﬁelds on Rd that can be approximated by gradients of smooth functions (see Appendix A for details). This results in the following deﬁnition.

Deﬁnition 1 (Wasserstein differentiability). Let S be a geodesically convex set; that is, µt S for any geodesic µt between µ, ν S. A functional J : P2(Rd) R is Wasserstein differentiable on S if there is a map F : P2(Rd) Co Tan P2(Rd) such that for all µ, ν S and a geodesic µt : [0, 1] P2 Rd between µ and ν, one has

lim α 0+ J (µα) J(µ)

Rd Rd F(µ; x)T (y x) dγ(x, y), (10)

where γ is the unique optimal transport plan (9) corresponding to µt. Note that F(µ; x) = (F(µ)) (x) provides an aesthetic way of representing the evaluation at x Rd of the output of F at µ. The map F is called the Wasserstein derivative of J. Remark 1. The description of differentiability provided by Deﬁnition 1 falls within the general framework of metric derivatives and Wasserstein gradient ﬂows, largely codiﬁed in [1]. This framework is now a well-established component of the theory of Wasserstein spaces, while the relation (10), itself, presents only a narrow structuring of ideas from this framework. Deﬁnition 1, however, is often how works in statistical and algorithmic ﬁelds interact with this broader area [56, 16, 38, 39]. Moreover, this literature demonstrates the most motivating feature of (10): a large number of functionals of interest for machine learning and statistical inference exhibit Wasserstein gradients in the sense of (10). The curious reader is referred to [1, 52, 9] for precise statements of conditions under which (10) is guaranteed. However, F is intimately relative to the Gateaux differential for J [52, 56]. Recall, the Gateaux differential for a functional J exists when there is an appropriate, dual space D on a closed subspace D P(Rd) such that

d J(µ), ν µ = lim α 0+ J (µ + α(ν µ)) J(µ)

for some d J(µ) D and all µ in some set S such that S S D. In instances where the Gauteaux differential d J(µ) exists, the Wasserstein derivative F will usually exist [39] and be given by d J(µ). Here, we use the ﬁnite dimensional gradient operator ( ) formally, and omit a rigorous exposition on this operation in the context of Co Tan P2(Rd).

It should be noted that computation of the Wasserstein derivative might be difﬁcult. Indeed, for a J in a variational form such as (7), computation of the Wasserstein derivative is equivalent to ﬁnding a witness function that achieves the supremum [52]. In the case of a pathological sets (in (7)), such a task might be intractable. To resolve this issue, and to simplify our treatment, this work utilizes the existence of an oracle for the computation of a Wasserstein gradient. This oracle permits a uniﬁed description of our Frank-Wolfe algorithm and abstracts away variation in functional-speciﬁc computational cost. Recall that a function in the Hölder space C1(Rd) is called L-smooth if has L-Lipschitz gradients. Deﬁnition 2 (Wasserstein Derivative Oracle). Let J : P2(Rd) R be a Wasserstein differentiable functional on a set S with Wasserstein derivative F : P2(Rd) Co Tan P2(Rd). A L-smooth Wasserstein derivative oracle over S is an oracle which, given sample access to a distribution µ S and an error parameter ϵ, returns an L-smooth function bφµ C1(Rd) satisfying bφµ F(µ) L2(µ) ϵ (12)

where L2(µ) is the canonical norm on the space L2(µ) of square integrable functions with respect to µ P(Rd). In this work, the output of this oracle is represented as Θ(µ, ϵ). Remark 2. The qualiﬁcation that the Wasserstein derivative oracle return an L-smooth function is necessary to exclude the, aforementioned, possibility of a pathological Wasserstein derivative that would be intractable for use in a computational procedure. In some ways, this is representative of the fact that the cotangent space Co Tan(µ) at a point µ is too large. Such a condition is common in other variational methods [3, 19, 64, 20] and is relatively superﬁcial when coupled with the degree of approximation afforded by ϵ. Via smoothing techniques [51, 11, 38], functionals can often be assumed to have Wasserstein derivatives which are C1(Rd) or are well-approximable by C1(Rd) functions.

2.2 Smoothness and Łojasiewicz inequalities In ﬁnite dimensions, iterative, gradient-based methods typically require the speciﬁcation of two conditions in order to achieve convergence.

The accuracy of local, linear approximations that are provided by the gradient.

The extent to which local descent makes global progress on the objective.

Here, we state these conditions in the context of functionals over Wasserstein space.

Deﬁnition 3 (α-Hölder smoothness). Let S be a geodesically convex set and let J : P2 Rd R be a functional which is continuously Wasserstein differentiable on the set S. J is said to be locally αHolder smooth on S with parameters T and if for all µ S and all ν S such that W(µ, ν) , there exists an optimal transport plan γ P2(Rd Rd) such that

J(ν) J(µ) + Z

Rd Rd F(µ; x)T (y x) dγ(x, y) + T 1 + αW1+α (ν, µ) (13)

Deﬁnition 4 (Łojasiewicz inequality). A Wasserstein differentiable functional J on a set S P2(Rd) is said to satisfy a Łojasiewicz inequality with parameter τ and exponent θ if for all µ S and J := infµ S J(µ) τ (J(µ) J )θ F(µ) L2(µ) (14)

where F is the Wasserstein derivative (10) of J.

Remark 3. More restrictive versions of both (13) and (14) commonly appear in previous literature to establish the explicit converence rate [3, 32, 39, 19, 15]. In most cases, the α-Hölder smoothness condition (13) is stated for α = 1 and required to hold globally ( = ). This smoothness criterion is considerably weaker since it requires that the Wasserstein gradient only provide a local approximation that is slightly more than ﬁrst-order accurate. Moreover, such a condition can be necessary when the Wasserstein derivative (10) is not Lipschitz with respect to W in the cotangent space norm on Co Tan P2(Rd) see [43] for such an example. Additionally, statement of the Łojasiewicz inequality (14) is broader than canonical treatments due to the presence of the auxiliary power θ. Most often, the speciﬁc instances of either θ = 1/2 or θ = 1 are considered. The case θ = 1 is implied for (geodesically) convex functionals J with a W-bounded level set, while θ = 1/2 is implied for strongly convex J [1, 32]. Although the notions of Holder smoothness condition (14) and Łojasiewicz inequalities (13) have been well-studied in the literature [33, 5], the use of both of these conditions with explicitly determined exponents α and θ to provide concrete convergence rates for a computationally-implemented, inﬁnite-dimensional descent method does not appear in related literature as far as the authors are aware.

3 Modiﬁed Frank-Wolfe algorithm

Algorithm 1 provides our modiﬁed Frank-Wolfe procedure along with its associated convergence guarantees and sample complexities in Theorem 1 and Proposition 1. It is worth mentioning that the algorithm itself only requires a much weaker notion of differentiability to be applicable, namely, Gateaux differentiability (i.e., d J(µ) in (11) exists). In Section 4, we will provide several computational examples which may not satisfy the following assumptions but work well in practice. We have to admit there is a theoretical and computational gap. We will leave it as an open question for our future work. However, to conduct the convergence analysis, we require the following assumptions phrased in the language from Section 2.

Assumption 1 (Smoothness assumption). The functional J : P2(Rd) R is Wasserstein differentiable (Deﬁnition 1) and locally α-Holder smooth (Deﬁnition 3) on a set S P2(Rd) with parameters T and 1 > 0 (Deﬁnition 3). Further, an L-smooth Wasserstein derivative oracle (Deﬁnition 2) for J exists.

Assumption 2 (Local richness). The set S is rich enough to contain the solution to (2) for µ S, L-smooth f, and δ 2.

Remark 4. When S is not the whole set, it is necessary to invoke Assumption 2 to guarantee that the iterates produced by our algorithm remain in S. This is a result of the fact that the solution of (15) is optimal for some Wasserstein ball of size δ δ. Hence, each of these iterates is guaranteed to lie in S so long as δ 2.

Assumption 3 (Łojasiewicz assumption). The functional J satisﬁes a Łojasiewicz inequality (14) on S P2(Rd) with parameters τ > 0 and θ.

Algorithm 1 Modiﬁed Frank Wolfe for (3) Input: Wasserstein derivative oracle Θ, initial distribution µ0, smoothness parameter α, gradient error ˆϵ, estimation error ϵ, iterate error eϵ, stopping threshold r, step sizes (β1, β2, β3), number of iterations k for 1 i k do

Let bφµi 1 Θ(µi 1, ˆϵ) ( ) Deﬁnition 2

Compute bφµi 1 L2(µi 1) ϵ s bφµi 1 L2(µi 1) if s r, then break else δ min β1, β2s, β3s1/α , ζ δeϵ Compute µi satisfying W(µi, µi 1) δ and ( ) Proposition 1 Z bφµi 1 dµi inf W(ν,µi 1) δ

Z bφµi 1 dν ζ (15) return µi

Theorem 1. Under Assumptions 1, 2, 3, and an appropriate choice of input parameters, Algorithm 1 computes a distribution µ satisfying

r(µ ) := J(µ ) inf µ S J(µ) ϵ (16)

in at most k = e O ϵ p (17)

iterations, where p denotes the negative part of p = 1 α θ for the dual exponent α = (1 + α)/α. The notation e O( ) omits logarithmic factors in it s arguments.

Remark 5. In the case of a (geodescially) strongly-convex and 1-Hölder smooth functional J, the Łojasiewicz inequality (14) holds with θ = 1/2 and one obtains standard e O(1) complexity (in terms of ϵ). This is to be expected from ﬁnite dimensional analogues [36]. Similarly, for J which is only convex (with a W-bounded level set), (14) holds with θ = 1 and (17) yields a e O(ϵ 1) complexity that mimics canonical results. The step size required to achieve these complexities is illustrated by the choice of δ in Algorithm 1.

For functionals which are α-Hölder smooth for α < 1, the dependence on the dual exponent α in (17) can be rather punishing for small α. It is natural to ask if this exponent could be improved within the scope of Assumptions 1, 2, and 3. Moreover, in ﬁnite dimensions, it is well known that ﬁrst-order methods for convex and α-Hölder smooth functions (also known as weakly smooth functions) can obtain ϵ-optimal solutions in O(ϵ 2/(1+3α)) iterations [47]. Hence, it could even be considered whether, given geodesic-convexity assumptions on J, a better iteration complexity for Algorithm 1 would be obtainable.

We conjecture that such improvements are unlikely - particularly those that would draw on analogy from ﬁnite dimensional techniques. The motivation for this is as follows. A common approach to establishing improved iteration complexities for convex, α-Hölder smooth functions, in ﬁnite dimensions is to consider their gradient oracles as inexact oracles for convex, 1-Hölder smooth functions [23]. Using either averaging arguments or accelerated methods, more rapid progress on an underlying objective can then be made with these inexact oracles. Our Frank-Wolfe method already utilizes an inexact step (15) so, conceptually, such an approach could be applied to Algorithm 1. Unfortunately, however, such ﬁnite dimensional analogies fail due to the difﬁculty of averaging distributions in Wasserstein space. Indeed, in ﬁnite dimensions, averaging is crucial to prevent error accumulation from outpacing objective progress. Since Wasserstein space is positively curved (Appendix A) computing analogous convex combinations of the µi in Algorithm 1 is, itself, a variational problem that might be as expensive to compute as the original problem (3).

3.1 Computation of the Frank-Wolfe step

An algorithm for computation of the Frank Wolfe step (15) is given in Appendix C and the net result of this procedure is as follows.

Proposition 1. Under the assumptions of Theorem 1, there exists a stochastic algorithm which (with high probability) provides sample access to a distribution satisfying (15). Moreover, in the setting of (17), this algorithm requires e O ϵ 2α θ samples and gradient evaluations.

Crucially, the property enabling Proposition 1 is duality. The Frank-Wolfe problem (15) exhibits a dual of the form

inf y Rd f(y) + λ

2 x y 2 dµ(x) δ2λ

and (18) is amenable to computation using techniques from ﬁnite dimensional optimization. This approach derives from distributionally robust optimization [6, 54, 25, 34] where such techniques have been used to produce methods in optimization and machine learning that are robust to adversarial perturbations.

In general, solution of (15) for any level of δ could be computationally hard [6, 54]. However, since a Frank-Wolfe procedure need only solve local problems, not global ones, we show: there is a δ which is, simultaneously, small enough to enable the efﬁcient computation of (18)), yet large enough to produce (17). The techniques used to achieve these results most closely resemble ideas from [54]. However, we provide a precise quantiﬁcation of the δ in (15) that is required to achieve computation tractability and develop a procedure which yields guarantees on the primal-dual gap of (15) and (18). In turn, the theoretical insights that we obtain suggest an empirical procedure in which λ in (18) can be updated relatively infrequently within our Frank-Wolfe algorithm, provided that it is chosen on a proper scale. Such an implementation is investigated in Section 4 and yields signiﬁcant computational savings.

It is also worth noting that the implementation in Appendix C requires only sample access to µ0 in order to provide sample access to a distribution satisfying (16). Thus, all operations in Algorithm 1 can be implemented using only sample access to µ0. Practically, however, it is often more efﬁcient to maintain approximations to the iterates µi via a non-parametric estimator. When this is done, it results in an additional, additive error in the residual (16) at each step of Algorithm 1. If this error is on the order of the error produced by the Wasserstein derivative oracle Θ, the iteration complexity (17) remains unaffected. Moreover, analysis of the error induced by a particular non-parametric approximation of the µi is highly problem dependent so we do not consider it in the context of Theorem 1.

4 Computational examples

In this section, we demonstrate the application our Frank-Wolfe algorithm to several non-parametric estimation problems in statistics and machine learning. All simulations are implemented using Python 3.8 on a high performance computing server running Ubuntu 18.04 with a Gen10 Quad Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz processor. As we mentioned in Section 3, the proposed algorithm just needs a very weak of differentiability to be applied (i.e., Gateaux differentiability). The computational examples we conduct in this section aim to show attractive performance in applications where the assumptions required for theoretical convergence are unknown, instead of corroborating our theoretical results.

4.1 Gaussian deconvolution

A classical task in nonparametric statistics [12, 8] is to infer a latent, data-generating distribution µ P2 Rd from a set of observations that are corrupted by independent, additive Gaussian noise. For observations Y1, . . . , Yn such that Yi = Xi + Zi where Xi µ, Zi N 0, σ2 . One seeks to compute a non-parametric estimate of µ the variance of the noise σ2 is considered known. Since Zi is independent of Xi, this task amounts to deconvolving µ from the distribution of Zi. A natural candidate for µ is the maximum-likelihood estimator (MLE),

bµ := arg max µ P2(Rd)

Rd gσ (Yi x) dµ(x) (19)

where gσ is the density of Zi. We refer the reader to [50, Section 3] for further details but note that it was shown in [50] that bµ has an equivalent characterization

ˆµ = arg min µ P2(Rd) Dσ2(µ, ˆPY ) (20)

Dσ2(µ1, µ2) := inf π Π(µ1,µ2) 1 2

Z x y 2 dπ(x, y) + σ2DKL(π || µ1 µ2) (21)

is the entropic optimal transportation distance [21] and ˆPY is the empirical distribution of the Yi. The problem (20) readily lies within the framework of (3) for J(µ) := Dσ2(µ, ˆPY ). Moreover, it is known [41] that the Wasserstein derivative (10) of Dσ2(µ, ˆPY ) with respect µ is given by

ˆφµ(x) = σ2 log

i=1 exp v i x yi 2/2 /σ2 !

where v Rn is dual variable (corresponding to ˆPY ) which is optimal for Dσ2(µ, ˆPY ). This provides a Wasserstein derivative oracle for (20) as the vector v can be readily approximated using sinkhorn or stochastic gradient algorithms [49].

Toy example On 2D Gaussian mixture A simple, two dimensional instance of (19) is shown in Figure 1 on a dataset Yi of 50 samples with mixture of 4 Gaussians illustrated by the kernel density estimator of the Yi, shown in red. The behavior of our Frank-Wolfe Algorithm is depicted over the course of several iterations, where the foreground contours provide the density of the iterate, µi, that is maintained by the algorithm. It can be easily observed that despite the small overlap between our initial distribution and target one, our method reaches the global optima very quickly.

0 10 98.682

Iteration 0

0 10 61.367

Iteration 3

0 10 40.837

Iteration 5

0 10 18.594

Iteration 8

Iteration 11

Iteration 14

Iteration 16

Iteration 25

Figure 1: Toy Example on a 2D Gaussian Mixture. Note that the initial distribution is set as N([10, 10], σ2I) for σ2 = 0.4 and the number of particles is 200. The bisection method of Appendix C is used with tolerance set to 1e 3. The objective value Dσ2(µi, ˆPy) is below to each sub-ﬁgure.

Uniform strategy for λ and sensitivity analysis on high-dimensional examples Since the number of bisection ascent step for λ in (18) that arises during the search can be large, the original algorithm may be computationally rather demanding for high-dimensional cases. Thus, we are

0 10 20 30 40 50 60 70 Iteration

Frank-Wolfe Convergence

Bisection Ascent

Constant = 1

(a) Objective Value

0 5 10 15 20 Iteration

log2(D 2( , PY))

Sensitivity Analysis for Scaling Factor

= 0.5 = 1 = 5 = 10

(b) Scaling Factor ε

0 5 10 15 20 Iteration

log2(D 2( , PY))

Sensitivity Analysis for Step Size

= 0.5 = 1 = 2 = 5 = 10 = 20

(c) Step size δ

Figure 2: High-dimensional Gaussian deconvolution for d = 64. For 50 data points, Yi, sampled from a mixture of 7-Gaussians, 200 particles are used in a non-parametric estimate the µi. In Figure 2(a), the tolerance for the ascent method is 1e 3 and the shaded bands show the standard derivation over 10 independent runs with random initializations. In Figure 2(b), the step size δ is 0.5.

motivated to develop an approach that makes λ as a constant which only depends on the step size λ = ε/δ1/2 without other one-dimensional optimization methods as an inner solver, where ε is the scaling factor. Figure 2(a) provides a convergence behavior comparison between the vanilla Algorithm 1 and the modiﬁcation leveraging this new uniform strategy. Not surprisingly, the modiﬁed Frank-Wolfe algorithm can get to the local region around the global optima faster but with a larger variance, as the uniform strategy is indeed a more aggressive strategy at the early stage. To further support the uniform strategy and our Frank-Wolfe framework, we conduct extensive sensitivity experiments on the hyperparameter (e.g., step size δ and scaling factor ε). Both Figure 2(b) and 2(c) demonstrate that our algorithm is robust to these crucial hyperparamter. We can also observe that it is better not to choose a relatively large step size although we can converge faster at the beginning but suffer from the risk of divergence, as the maximum step size is controlled by the smooth parameter in theory. Hence, our experiment results here also corroborate the theoretical ﬁndings.

It is worth mentioning that this new uniform strategy can make our Frank-Wolfe framework be extended to an asynchronous decentralized parallel setting easily and thus can further meet the requirements of large-scale applications. Based on the superior performance, we left its rigorous convergence analysis as an open question.

4.2 Nonparametric learning with student-teacher networks

The rise of generative adversarial networks (GANs) [28] and efforts connecting neural networks and kernel regression [17], have generated interest in maximum mean discrepancy (MMD), particularly with respect to it s role in constructing high-dimensional, distributional embeddings [20, 45]. This development is predicated on the observation that any neural network (x, θ) ψ(x, θ), which produces an output ψ(x, θ) Rd from input data x X Rd and parameters θ Θ Rm, yields a kernel on the parameter set Θ:

k(θ1, θ2) := Ex ψ(x, θ1)T ψ(x, θ2) (23)

where the expectation over x is taken with respect to a data generating distribution. Via MMD, k( , ) induces a natural discrepancy measure between distributions over network parameters θ. Thus, learning of a generative image model can be expressed as minimizing MMD with respect to latent, generative distribution for ν. We refer to [45, 3] for further descriptions of these applications.

Being an integral probability metric (7), squared MMD lies well within the framework of this paper

J(µ) := MMD2(µ, ν). (24)

and the Gateaux derivative (i.e., inﬂuence function) of J admits a natural expression [3] as the difference between the mean embeddings of µ and ν

f µ(x) = Ez µ [k(z, x)] Ez ν [k(z, x)] (25)

Indeed, f µ can be readily computed via sampling methods even when µ or ν are continuous or are large, discrete distributions [29]. Note that, as discussed in Remark 1, the Wasserstein

50000 100000 150000 Gradient Evaluations

Validation Error

Frank-Wolfe (Uniform)

Mean (over 10 runs) Standard Deviation

0 100000 200000 300000 400000 Gradient Evaluations

Validation Error

MMD Flow (with noise injection)

Mean (over 10 runs) Standard Deviation

0 10000 20000 30000 40000 Gradient Evaluations

Validation Error

Kernel Sobolev Descent

Mean (over 10 runs) Standard Deviation

Figure 3: Student-Teacher Network; The detailed implementation set up is same as [3, Appendix G]. The left one is the result for our Frank-Wolfe method with the uniform strategy λ = 0.05

δ and the step size δ is 0.5. The number of particle is 200.

derivative is, under sufﬁcient regularity, the gradient of the Gateaux differential f µ( ). Perhaps the most advantageous consequence of (25), however, is that the Wasserstein gradient directly inherits regularity present in k. Indeed, should xk(x, y) be L-Lipschitz in x (uniformly for all y), J (24) is naturally L-smooth [3]. This has led to the development of several variational or particle-based methods for minimizing (24) [3, 45, 20]. Remark 6. For general MMD functionals, the smoothness and Łojasiewicz inequalities (i.e., Assumptions 1 and 3 ) are shown in [3]. Nevertheless, the MMD experiments in our paper, following the setup in [3], fail to satisfy the differentiability assumptions in [3] due to the Re LU terms present in the network deﬁning the kernel.

However, despite a possible violation of the assumptions, Figure 3 demonstrates the competitive performance of our method with two of baselines showcased in [3] on Student-Teacher network problem. Our method is shown on the left, the center plot shows the MMD gradient ﬂow algorithm from [3], and the right plot provides the Sobolev Descent algorithm of [45]. Performance is evaluated in terms of MMD error on a validation dataset and is shown as a function of the total gradient evaluations performed by each method. This provides a better proxy for relative performance and convergence since an iteration of Algorithm 1 performs multiple solves that are, each, similar in terms of gradient complexity to a single iteration of MMD gradient ﬂow or Sobolev descent. Further, the total number of gradient evaluations should not be viewed as a proxy for wall-time as, for each gradient evaluation, the number of operations performed by each method can vary widely. Indeed, for each gradient evaluation in Sobolev descent an entire linear system solve is performed, which is computational demanding in practice. Also, note that, as both MMD gradient ﬂow and Sobolev descent are particle-based, Algorithm 1 was, for the purposes of comparison, instantiated with a particle distribution of equal size.

5 Conclusion

This paper introduces and studies a Frank-Wolfe procedure for the minimization of functionals of probability measures. While these methods have been widely studied in the ﬁnite-dimensional setting; our current environment presents both signiﬁcant beneﬁts and opportunities. First, many problems of interest can be posed in terms of the types of formulations that we study [18, 19, 6, 54, 20, 39, 64, 55, 57, 11]. Second, our algorithm can naturally be asynchronously parallelized. This is a research avenue of signiﬁcant promise that we plan to explore in future work, especially in connection with the wide range of applications mentioned earlier.

Acknowledgements Material in this paper is based upon work supported by the Air Force Ofﬁce of Scientiﬁc Research under award number FA9550-20-1-0397. Additional support is gratefully acknowledged from NSF grants 1915967, 1820942 and 1838576.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [No] The contributions of this work are mainly theoretical. We do not foresee any negative societal impact as a consequence. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] See the supplemental material 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results? [Yes] See the supplemental material (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]

[1] L. Ambrosio, N. Gigli, and G. Savare. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Lectures in Mathematics. ETH Zürich. Birkhäuser Basel, 2005.

[2] Luigi Ambrosio and Nicola Gigli. Construction of the parallel transport in the wasserstein space. Methods Appl. Anal., 15(1):1 30, 03 2008.

[3] Michael Arbel, Anna Korba, Adil SALIM, and Arthur Gretton. Maximum mean discrepancy gradient ﬂow. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

[4] Daniel Bartl, Samuel Drapeau, and Ludovic Tangpi. Computational aspects of robust optimized certainty equivalents and option pricing. Mathematical Finance, 30(1):287 309, 2020.

[5] Adrien Blanchet and Jérôme Bolte. A family of functional inequalities: Łojasiewicz inequalities and displacement convex functions. Journal of Functional Analysis, 275(7):1650 1673, 2018.

[6] Jose Blanchet and Karthyek Murthy. Quantifying distributional model risk via optimal transport. SSRN Electronic Journal, 04 2016.

[7] Jose Blanchet, Karthyek Murthy, and Fan Zhang. Optimal transport based distributionally robust optimization: Structural properties and iterative schemes. 2018.

[8] Claire Caillerie, Frédéric Chazal, Jérôme Dedecker, and Bertrand Michel. Deconvolution for the Wasserstein metric and geometric inference. Electronic Journal of Statistics, 5(none):1394 1423, 2011.

[9] R. Carmona and F. Delarue. Probabilistic Theory of Mean Field Games with Applications II: Mean Field Games with Common Noise and Master Equations. Probability Theory and Stochastic Modelling. Springer International Publishing, 2018.

[10] J. A. Carrillo, K. Craig, L. Wang, and Chaozhen Wei. Primal dual methods for wasserstein gradient ﬂows. ar Xiv: Numerical Analysis, 2019.

[11] José Antonio Carrillo, Katy Craig, and Francesco S. Patacchini. A blob method for diffusion. Calculus of Variations and Partial Differential Equations, 58(2):53, Feb 2019.

[12] Raymond J. Carroll and Peter Hall. Optimal rates of convergence for deconvolving a density. Journal of the American Statistical Association, 83(404):1184 1186, 1988.

[13] Z. Chen, D. Kuhn, and W. Wiesemann. Data-driven chance constrained programs over wasserstein balls. ar Xiv: Optimization and Control, 2018.

[14] T.C.E. Cheng and Mikhail Y. Kovalyov. An unconstrained optimization problem is np-hard given an oracle representation of its objective function: a technical note. Computers & Operations Research, 29(14):2087 2091, 2002.

[15] Sinho Chewi, Tyler Maunu, Philippe Rigollet, and Austin J. Stromme. Gradient descent algorithms for Bures-Wasserstein barycenters. In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 1276 1304. PMLR, 09 12 Jul 2020.

[16] Lénaïc Chizat and Francis Bach. On the global convergence of gradient descent for overparameterized models using optimal transport. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

[17] Lénaïc Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

[18] Casey Chu, Jose Blanchet, and Peter Glynn. Probability functional descent: A unifying perspective on GANs, variational inference, and reinforcement learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 1213 1222. PMLR, 09 15 Jun 2019.

[19] Casey Chu, Kentaro Minami, and Kenji Fukumizu. Smoothness and stability in gans. In International Conference on Learning Representations, 2020.

[20] Samuel Cohen, Michael Arbel, and Marc P. Deisenroth. Estimating barycenters of measures in high dimensions. ar Xiv:2007.07105, 2020.

[21] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.

[22] Erick Delage and Yinyu Ye. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations Research, 58(3):595 612, 2010.

[23] Olivier Devolder, François Glineur, and Yurii Nesterov. First-order methods of smooth convex optimization with inexact oracle. Mathematical Programming, 146(1):37 75, Aug 2014.

[24] Futoshi Futami, Zhenghang Cui, Issei Sato, and Masashi Sugiyama. Bayesian posterior approximation via greedy particle optimization. Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 33(01):3606 3613, Jul. 2019.

[25] Rui Gao and A. Kleywegt. Distributionally robust stochastic optimization with wasserstein distance. ar Xiv: Optimization and Control, 2016.

[26] Soumyadip Ghosh, Mark Squillante, and Ebisa Wollega. Efﬁcient stochastic gradient descent for learning with distributionally robust optimization. 2018.

[27] Joel Goh and Melvyn Sim. Distributionally robust optimization and its tractable approximations. Operations Research, 58(4-part-1):902 917, 2010.

[28] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS 14, page 2672 2680, Cambridge, MA, USA, 2014. MIT Press.

[29] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. J. Mach. Learn. Res., 13(null):723 773, March 2012.

[30] O. Güler. Foundations of Optimization. Graduate Texts in Mathematics. Springer New York, 2010.

[31] Grani A. Hanasusanto and Daniel Kuhn. Conic programming reformulations of two-stage distributionally robust linear programs over wasserstein balls. Operations Research, 66(3):849 869, 2018.

[32] Daniel Hauer and José Mazón. Kurdyka łojasiewicz simon inequality for gradient ﬂows in metric spaces. Transactions of the American Mathematical Society, 372(7):4917 4976, 2019.

[33] Daniel Hauer and José Mazón. Kurdyka łojasiewicz simon inequality for gradient ﬂows in metric spaces. Transactions of the American Mathematical Society, 372(7):4917 4976, 2019.

[34] Daniel Kuhn, Peyman Mohajerin Esfahani, V. Nguyen, and Soroosh Shaﬁeezadeh-Abadeh. Wasserstein distributionally robust optimization: Theory and applications in machine learning. Ar Xiv, abs/1908.08729, 2019.

[35] Daniel Levy, Yair Carmon, John C Duchi, and Aaron Sidford. Large-scale methods for distributionally robust optimization. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 8847 8860. Curran Associates, Inc., 2020.

[36] Guoyin Li and Ting Kei Pong. Calculus of the exponent of kurdyka łojasiewicz inequality and its applications to linear convergence of ﬁrst-order methods. Foundations of Computational Mathematics, 18(5):1199 1232, Oct 2018.

[37] Jiajin Li, Sen Huang, and Anthony Man-Cho So. A ﬁrst-order algorithmic framework for distributionally robust logistic regression. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

[38] Chang Liu, Jingwei Zhuo, Pengyu Cheng, Ruiyi Zhang, and Jun Zhu. Understanding and accelerating particle-based variational inference. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4082 4092. PMLR, 09 15 Jun 2019.

[39] L. Liu, Y. Zhang, Z. Yang, R. Babanezhad, and Z. Wang. Inﬁnite-dimensional game optimization via variational transport. OPT 2020, 2020.

[40] Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.

[41] Giulia Luise, Saverio Salzo, Massimiliano Pontil, and Carlo Ciliberto. Sinkhorn barycenters with free support via frank-wolfe algorithm. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

[42] S. Mahdian, J. Blanchet, and P. Glynn. Optimal transport relaxations with application to wasserstein gans. Ar Xiv, abs/1906.03317, 2019.

[43] Quentin Mérigot, Alex Delalande, and Frederic Chazal. Quantitative stability of optimal transport maps and linearization of the 2-wasserstein space. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artiﬁcial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 3186 3196. PMLR, 26 28 Aug 2020.

[44] Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations. Math. Program., 171(1 2):115 166, September 2018.

[45] Youssef Mroueh, Tom Sercu, and Anant Raj. Sobolev descent. In Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of the Twenty-Second International Conference on Artiﬁcial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pages 2976 2985. PMLR, 16 18 Apr 2019.

[46] Hongseok Namkoong and John C. Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 16, page 2216 2224, Red Hook, NY, USA, 2016. Curran Associates Inc.

[47] A. S. Nemirovski ı and Yu. E. Nesterov. Optimal methods for smooth convex minimization. Zh. Vychisl. Mat. i Mat. Fiz., 25(3):356 369, 477, 1985.

[48] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Applied Optimization. Springer US, 2003.

[49] Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019.

[50] Philippe Rigollet and Jonathan Weed. Entropic optimal transport is maximum-likelihood deconvolution. Comptes Rendus Mathematique, 356(11):1228 1235, 2018.

[51] Maziar Sanjabi, Jimmy Ba, Meisam Razaviyayn, and Jason D. Lee. On the convergence and robustness of training gans with regularized optimal transport. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS 18, page 7091 7101, Red Hook, NY, USA, 2018. Curran Associates Inc.

[52] F. Santambrogio. Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling. Progress in Nonlinear Differential Equations and Their Applications. Springer International Publishing, 2015.

[53] M. Scetbon, Laurent Meunier, J. Atif, and Marco Cuturi. Equitable and optimal transport with multiple agents. ar Xiv: Machine Learning, 2020.

[54] Aman Sinha, Hongseok Namkoong, and John Duchi. Certiﬁable distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018.

[55] Matthew Staib and Stefanie Jegelka. Distributionally robust optimization and generalization in kernel methods. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

[56] Amirhossein Taghvaei and Prashant Mehta. Accelerated ﬂow for probability distributions. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6076 6085. PMLR, 09 15 Jun 2019.

[57] Bahar Taskesen, V. Nguyen, Daniel Kuhn, and J. Blanchet. A distributionally robust approach to fair classiﬁcation. Ar Xiv, abs/2007.09530, 2020.

[58] Bart P. G. Van Parys, Peyman Mohajerin Esfahani, and Daniel Kuhn. From data to decisions: Distributionally robust optimization is optimal. Management Science, 0(0):null, 0.

[59] C. Villani. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 2008.

[60] Wolfram Wiesemann, Daniel Kuhn, and Melvyn Sim. Distributionally robust convex optimization. Operations Research, 62(6):1358 1376, 2014.

[61] Weijun Xie. On distributionally robust chance constrained programs with wasserstein distance. Mathematical Programming, 186(1):115 155, Mar 2021.

[62] K. Yoshida. Functional Analysis. Classics in mathematics / Springer. World Publishing Company, 1980.

[63] Junyu Zhang, Alec Koppel, Amrit Singh Bedi, Csaba Szepesvari, and Mengdi Wang. Variational policy gradient method for reinforcement learning with general utilities. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 4572 4583. Curran Associates, Inc., 2020.

[64] Junyu Zhang, Alec Koppel, Amrit Singh Bedi, Csaba Szepesvari, and Mengdi Wang. Variational policy gradient method for reinforcement learning with general utilities. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 4572 4583. Curran Associates, Inc., 2020.

[65] Chaoyue Zhao and Yongpei Guan. Data-driven risk-averse stochastic optimization with wasserstein metric. Operations Research Letters, 46(2):262 267, 2018.

[66] X. Zheng and H. Chen. Data-driven distributionally robust unit commitment with wasserstein metric: Tractable formulation and efﬁcient solution method. IEEE Transactions on Power Systems, 35(6):4940 4943, 2020.

[67] A. Zhou, M. Yang, M. Wang, and Y. Zhang. A linear programming approximation of distributionally robust chance-constrained dispatch with wasserstein distance. IEEE Transactions on Power Systems, 35(5):3366 3377, 2020.