# sinkhorn_barycenter_via_functional_gradient_descent__5fc94c1c.pdf

Sinkhorn Barycenter via Functional Gradient Descent

Zebang Shen Zhenfu Wang Alejandro Ribeiro Hamed Hassani

Department of Electrical and Systems Engineering Department of Mathematics University of Pennsylvania {zebang@seas,zwang423@math,aribeiro@seas,hassani@seas}.upenn.edu

In this paper, we consider the problem of computing the barycenter of a set of probability distributions under the Sinkhorn divergence. This problem has recently found applications across various domains, including graphics, learning, and vision, as it provides a meaningful mechanism to aggregate knowledge. Unlike previous approaches which directly operate in the space of probability measures, we recast the Sinkhorn barycenter problem as an instance of unconstrained functional optimization and develop a novel functional gradient descent method named Sinkhorn Descent (SD). We prove that SD converges to a stationary point at a sublinear rate, and under reasonable assumptions, we further show that it asymptotically ﬁnds a global minimizer of the Sinkhorn barycenter problem. Moreover, by providing a mean-ﬁeld analysis, we show that SD preserves the weak convergence of empirical measures. Importantly, the computational complexity of SD scales linearly in the dimension d and we demonstrate its scalability by solving a 100-dimensional Sinkhorn barycenter problem.

1 Introduction

Computing a nonlinear interpolation between a set of probability measures is a foundational task across many disciplines. This problem is typically referred as the barycenter problem and, as it provides a meaningful metric to aggregate knowledge, it has found numerous applications. Examples include distribution clustering [Ye et al., 2017], Bayesian inference [Srivastava et al., 2015], texture mixing [Rabin et al., 2011], and graphics [Solomon et al., 2015], etc. The barycenter problem can be naturally cast as minimization of the average distance between the target measure (barycenter) and the source measures; and the choice of the distance metric can signiﬁcantly impact the quality of the barycenter [Feydy et al., 2019]. In this regard, the Optimal Transport (OT) distance (a.k.a. the Wasserstein distance) and its entropy regularized variant (a.k.a. the Sinkhorn divergence) are the most suitable geometrically-faithful metrics, while the latter is more computational friendly. In this paper, we provide efﬁcient and provable methods for the Sinkhorn barycenter problem.

The prior work in this domain has mainly focused on ﬁnding the barycenter by optimizing directly in the space of (discrete) probability measures. We can divide these previous methods into three broad classes depending on how the support of the barycenter is determined: (i) The ﬁrst class assumes a ﬁxed and prespeciﬁed support set for the barycenter and only optimizes the corresponding weights [Staib et al., 2017, Dvurechenskii et al., 2018, Kroshnin et al., 2019]. Accordingly, the problem reduces to minimizing a convex objective subject to a simplex constraint. However, ﬁxing the support without any prior knowledge creates undesired bias and affects the quality of the ﬁnal solution. While increasing the support size (possibly exponentially in the dimension d) can help to mitigate the bias, it renders the procedure computationally prohibitive as d grows. (ii) To reduce the bias, the second class considers optimizing the support and the weights through an alternating procedure [Cuturi and Doucet, 2014, Claici et al., 2018]. Since the barycenter objective is

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

not jointly convex with respect to the support and the weights, these methods in general only converge to a stationary point, which can be far from the true minimizers. (iii) Unlike the aforementioned classes, Luise et al. [2019] recently proposed a conditional gradient method with a growing support set. This method enjoys sublinear convergence to the global optimum under the premise that a d-dimensional nonconvex subproblem can be globally minimized periteration. However, nonconvex optimization is generally intractable in high dimensional problems (large d) and only stationary points can be efﬁciently reached. Hence, the guarantee of [Luise et al., 2019] has limited applicability as the dimension grows.

In this paper, we provide a new perspective on the Sinkhorn barycenter problem: Instead of operating in the space of probability measures, we view the barycenter as the push-forward measure of a given initial measure under an unknown mapping. We thus recast the barycenter problem as an unconstrained functional optimization over the space of mappings. Equipped with this perspective, we make the following contributions:

We develop a novel functional gradient descent method, called Sinkhorn Descent (SD), which operates by ﬁnding the push-forward mapping in a Reproducing Kernel Hilbert Space that allows the fastest descent, and consequently solves the Sinkhorn barycenter problem iteratively. We then deﬁne the Kernelized Sinkhorn Barycenter Discrepancy (KSBD) to characterize the non-asymptotic convergence of SD. In particular, we prove that KSBD vanishes under the SD iterates at the rate of O( 1

t ), where t is the iteration number. We prove that SD preserves the weak convergence of empirical measures. Concretely, use SDt( ) to denote the output of SD after t iterations and let αN be an empirical measure of α with N samples. We have lim N SDt(αN) = SDt(α). Such asymptotic analysis allows us to jointly study the behavior of SD under either discrete or continuous initialization. Under a mild assumption, we prove that KSBD is a valid discrepancy to characterize the optimality of the solution, i.e. the vanishing of KSBD implies the output measure of SD converges to the global optimal solution set of the Sinkhorn barycenter problem.

Further, we show the efﬁciency and efﬁcacy of SD by comparing it with prior art on several problems. We note that the computation complexity of SD depends linearly on the dimension d. We hence validate the scalability of SD by solving a 100-dimensional barycenter problem, which cannot be handled by previous methods due to their exponential dependence on the problem dimension.

Notations. Let X Rd be a compact ground set, endowed with a symmetric ground metric c : X X R+. Without loss of generality, we assume c(x, y) = if x / X or y / X. We use 1c( , ) : X 2 X to denote its gradient w.r.t. its ﬁrst argument. Let M+ 1 (X) and C(X) be the space of probability measures and continuous functions on X. We denote the support for a probability measure α M+ 1 (X) by supp(α) and we use α a.e. to denote "almost everywhere w.r.t. α". For a vector a Rd, we denote its ℓ2 norm by a . For a function f : X R, we denote its L

norm by f := maxx X |f(x)| and denote its gradient by f( ) : X Rd. For a vector function f : X Rd, we denote its (2, ) norm by f 2, := maxx X f(x) . For an integer n, denote [n]:={1, , n}. Given an Reproducing Kernel Hilbert Space (RKHS) H with a kernel function k : X X R+, we say a vector function ψ = [[ψ]1, , [ψ]d] Hd if each component [ψ]i is in H. The space H has a natural inner product structure and an induced norm, and so does Hd, i.e. f, g Hd = Pd i=1 [f]i, [g]i H, f, g Hd and the norm f 2 Hd = f, f Hd. The reproducing property of the RKHS H reads that given f Hd, one has [f]i(x) = [f]i, kx H with kx(y) = k(x, y), which by Cauchy-Schwarz inequality implies that there exists some constant MH > 0 such that

f 2, MH f Hd, f Hd. (1) Additionally, for a functional F : Hd R, the Fréchet derivative of F is deﬁned as follows. Deﬁnition 1.1 (Fréchet derivative in RKHS). For a functional F : Hd R, its Fréchet derivative DF[ψ] at ψ Hd is a function in Hd satisfying the following: For any ξ Hd with ξ Hd < ,

lim ϵ 0 F[ψ + ϵξ] F[ψ]

ϵ = DF[ψ], ξ Hd.

Note that the Fréchet derivative at ψ, i.e. DF[ψ], is a bounded linear operator from Hd to R. It can be written in the form DF[ψ](ξ) = DF[ψ], ξ Hd due to the Riesz Fréchet representation theorem.

1.1 Related Work on Functional Gradient Descent

A related functional gradient descent type method is the Stein Variation Gradient Descent (SVGD) method by Liu and Wang [2016]. SVGD considers the problem of minimizing the Kullback Leibler (KL) divergence between a variable distribution and a posterior p. Note that SVGD updates the positions of a set of N particles using the score function of the posterior p, i.e. log p. Consequently, it requires the access to the target distribution function. Later, Liu [2017] prove that SVGD has convergence guarantee in its continuous-time limit (taking inﬁnitesimal step size) using inﬁnite number of particles (N ). In comparison, SD is designed to solve the signiﬁcantly more complicated Sinkhon barycenter problem and has a stronger convergence guarantee. More precisely, while SD updates the measure using only a sampling machinery of the target measures (no score functions), it is guaranteed to converge sub-linearly to a stationary point when α is a discrete measure using discrete time steps. This is in sharp contrast to the results for SVGD.

In another work, Mroueh et al. [2019] considers minimizing the Maximum Mean Discrepancy (MMD) between a source measure and a variable measure. They solve this problem by incrementally following a Sobolev critic function and propose the Sobolev Descent (So D) method. To show the global convergence of the measure sequence generated by So D, Mroueh et al. [2019] assumes the entire sequence satisﬁes certain spectral properties, which is in general difﬁcult to verify. Later, Arbel et al. [2019] consider the same MMD minimization problem from a gradient ﬂow perspective. They propose two assumptions that if either one holds, the MMD gradient ﬂow converges to the global solution. However, similar to [Mroueh et al., 2019], these assumptions have to be satisﬁed for the entire measure sequence. We note that the Sinkhorn barycenter is a strict generalization of the above MMD minimization problem and is hence much more challenging: By setting the number of source measures n = 1 and setting the entropy regularization parameter γ = , problem (4) degenerates to the special case of MMD. Further, the MMD between two probability measures has a closed form expression while the Sinkhorn Divergence can only be described via a set of optimization problems. Consequently, the Sinkhorn barycenter is signiﬁcantly more challenging. To guarantee global convergence, the proposed SD algorithm only requires one of accumulation points of the measure sequence to be fully supported on X with no restriction on the entire sequence.

2 Sinkhorn Barycenter

We ﬁrst introduce the entropy-regularized optimal transport distance and its debiased version, a.k.a. the Sinkhorn divergence. Given two probability measures α, β M+ 1 (X), use Π(α, β) to denote the set of joint distributions over X 2 with marginals α and β. For π Π, use c, π to denote the integral c, π = R

X 2 c(x, y)dπ(x, y) and use KL(π||α β) to denote the Kullback-Leibler divergence between the candidate transport plan π and the product measure α β. The entropy-regularized optimal transport distance OTγ(α, β) : M+ 1 (X) M+ 1 (X) R+ is deﬁned as

OTγ(α, β) = min π Π(α,β) c, π + γKL(π||α β). (2)

Here, γ > 0 is a regularization parameter. Note that OTγ(α, β) is not a valid metric as there exists α M+ 1 (X) such that OTγ(α, α) = 0 when γ = 0. To remove this bias, Peyré et al. [2019] introduced the Sinkhorn divergence Sγ(α, β) : M+ 1 (X) M+ 1 (X) R+:

Sγ(α, β):=OTγ(α, β) 1

2OTγ(α, α) 1

2OTγ(β, β), (3)

which is a debiased version of OTγ(α, β). It is further proved that Sγ(α, β) interpolates the Wasserstein distance and the Maximum Mean Discrepancy (MMD), and it is nonnegative, bi-convex and metrizes the convergence in law when the ground set X is compact and the metric c is Lipschitz. Now given a set of probability measures {βi}n i=1, the Sinkhorn barycenter is the measure α M+ 1 (X) that minimizes the average of Sinkhorn divergences

min α M+ 1 (X)

i=1 Sγ(α, βi) . (4)

We will next focus on the properties of OTγ since Sγ(α) is the linear combination of these terms.

Algorithm 1 Sinkhorn Descent (SD)

Input: measures {βi}n i=1, a discrete initial measure α0, a step size η, and number of iterations S; Output: A measure αS that approximates the Sinkhorn barycenter of {βi}n i=1; for t = 0 to S 1 do

αt+1 := T [αt] αt, with T [αt] deﬁned in (11); end for

The Dual Formulation of OTγ. As a convex program, the entropy-regularized optimal transport problem OTγ (2) has a equivalent dual formulation, which is given as follows:

OTγ(α, β) = max f,g C(X) f, α + g, β γ exp((f g c)/γ) 1, α β , (5)

where we denote [f g](x, y) = f(x) + g(y). The maximizers fα,β and gα,β of (5) are called the Sinkhorn potentials of OTγ(α, β). Deﬁne the Sinkhorn mapping A : C(X) M+ 1 (X) C(X) by

A(f, α)(y) = γ log Z

X exp (f(x) c(x, y))/γ dα(x). (6)

The following lemma states the optimality condition for the Sinkhorn potentials fα,β and gα,β. Lemma 2.1 (Optimality Peyré et al. [2019]). The pair (f, g) are the Sinkhorn potentials of the entropy-regularized optimal transport problem (5) if they satisfy

f = A(g, β), α a.e. and g = A(f, α), β a.e.. (7)

The Sinkhorn potential is the cornerstone of the entropy regularized OT problem. In the discrete case, it can be computed by a standard method in Genevay et al. [2016]. In particular, when α is discrete, f can be simply represented by a ﬁnite dimensional vector since only its values on supp(α) matter. We describe such method in Appendix A.1 for completeness. In the following, we treat the computation of Sinkhorn potentials as a blackbox, and refer to it as SPγ(α, β).

3 Methodology

We present the Sinkhorn Descent (SD) algorithm for the Sinkhorn barycenter problem (4) in two steps: We ﬁrst reformulate (4) as an unconstrained functional minimization problem and then derive the descent direction as the negative functional gradient over a RKHS Hd. Operating in RKHS allows us to measure the quality of the iterates using a so-called kernelized discrepancy which we introduce in Deﬁnition 4.1. This quantity will be crucial for our convergence analysis. The restriction of a functional optimization problem to RKHS is common in the literature as discussed in Remark 3.1.

Alternative Formulation. Instead of directly solving the Sinkhorn barycenter problem in the probability space M+ 1 (X), we reformulate it as a functional minimization over all mappings on X:

Sγ(P α0):= 1

i=1 Sγ(P α0, βi) , (8)

where α0 M+ 1 (X) is some given initial measure, and P α is the push-forward measure of α M+ 1 (X) under the mapping P : X X. When α0 is sufﬁciently regular, e.g. absolutely continuous, for any α M+ 1 (X) there always exists a mapping P such that α = P α0 (see Theorem 1.33 of [Ambrosio and Gigli, 2013]). Consequently, problems (8) and (4) are equivalent with appropriate initialization.

Algorithm Derivation. For a probability measure α, deﬁne the functional Sα : Hd R

Sα[ψ] = Sγ (I + ψ) α , ψ Hd. (9)

Here I is the identity mapping and Sγ is deﬁned in (4). Let αt be the estimation of the Sinkhorn barycenter in the tth iteration. Sinkhorn Descent (SD) iteratively updates the measure αt+1 as

αt+1 = T [αt] αt, (10)

via the push-forward mapping (with η > 0 being a step-size)

T [αt](x) = x η DSαt[0](x). (11)

Recall that DSα[0] is the Fréchet derivative of Sα at ψ = 0 (see Deﬁnition 1.1). Note that (I+ψ) α = α when ψ = 0. Our choice of the negative Fréchet derivative in T [αt] allows the objective Sγ(α) to have the fastest descent at the current measure α = αt. We our line the details of SD in Algorithm 1. Consequently, a solution of (8) will be found by ﬁnite-step compositions and then formally passing to the limit P = limt Pt:=T [αt] T [α0] .

Remark 3.1. We restrict ψ in (9) to the space Hd to avoid the inherent difﬁculty when the perturbation of Sinkhorn potentials introduced by the mapping (I + ψ) can no longer be properly bounded (for ψ Hd, we always have the upper bound (1) which is necessary in our convergence analysis). This restriction will potentially introduce error to the minimization of (8). However, this restriction is a common practice for general functional optimization problems: Both SVGD [Liu and Wang, 2016] and So D [Mroueh et al., 2019] explicitly make such RKHS restriction on their transport mappings. [Arbel et al., 2019] constructs the transport mapping using the witness function of the Maximum Mean Discrepancy (MMD) which also lies in an RKHS.

In what follows, we ﬁrst derive a formula for the Fréchet derivative DSαt[0] (see (13)) and then explain how it is efﬁciently computed. The proof of the next proposition requires additional continuity study of the Sinkhorn potentials and is deferred to Appendix C.5. Proposition 3.1. Recall the Fréchet derivative in Deﬁnition 1.1. Given α, β M+ 1 (X), for ψ Hd

denote F1[ψ] = OTγ (I + ψ) α, β and F2[ψ] = OTγ (I + ψ) α, (I + ψ) α . Under Assumptions 4.1 and 4.2 (described below), we can compute

DF1[0](y) = Z

X fα,β(x)k(x, y)dα(x), DF2[0](y) = 2 Z

X fα,α(x)k(x, y)dα(x), (12)

where fα,β and fα,α are the gradients of the Sinkhorn potentials of OTγ(α, β) and OTγ(α, α) respectively, and k is the kernel function of the RKHS H.

Consequently the Fréchet derivative of the Sinkhorn Barycenter problem (9) can be computed by

DSα[0](y) = Z

i=1 fα,βi(x) fα,α(x)]k(x, y)dα(x). (13)

This quantity can be computed efﬁciently when α is discrete: Consider an individual term fα,β.

Deﬁne h(x, y):= exp 1 γ (fα,β(x) + A[fα,β, α](y) c(x, y)) . Lemma 2.1 implies Z h(x, y)dβ(y) = 1.

Taking derivative with respect to x on both sides and rearranging terms, we have

X h(x, y) xc(x, y)dβ(y) R h(x, y)dβ(y) = Z

X h(x, y) xc(x, y)dβ(y), (14)

which itself is an expectation. Note that to evaluate (13), we only need fα,β(x) on supp(α). Using SPγ(α, β) (see the end of Section 2), the function value of fα,β on supp(α) can be efﬁciently computed. Together with the expression in (14), the gradients fα,β(x) at x supp(α) can also be obtained by a simple Monte-Carlo integration with respect to β.

In this section, we analyze the ﬁnite time convergence and the mean ﬁeld limit of SD under the following assumptions on the ground cost function c and the kernel function k of the RKHS Hd. Assumption 4.1. The ground cost function c(x, y) is bounded, i.e. x, y X, c(x, y) Mc; Gc Lipschitz continuous, i.e. x, x , y X, |c(x, y) c(x , y)| Gc x x ; and Lc-Lipschitz smooth, i.e. x, x , y X, 1c(x, y) 1c(x , y) Lc x x . Assumption 4.2. The kernel function k(x, y) is bounded, i.e. x, y X, k(x, y) Dk; Gk Lipschitz continuous, i.e. x, x , y X, |k(x, y) k(x , y)| Gc x x .

4.1 Finite Time Convergence Analysis

In this section, we prove that Sinkhorn Descent converges to a stationary point of problem (4) at the rate of O( 1

t ), where t is the number of iterations. We ﬁrst introduce a discrepancy quantity.

Deﬁnition 4.1. Recall the deﬁnition of the functional Sα in (9) and the deﬁnition of Fréchet derivative in Deﬁnition 1.1. Given a probability measure α M+ 1 (X), the Kernelized Sinkhorn Barycenter Discrepancy (KSBD) for the Sinkhorn barycenter problem is deﬁned as

S(α, {βi}n i=1):= DSα[0] 2 Hd. (15)

Note that in each round t, S(αt, {βi}n i=1) metrizes the stationarity of SD, which can be used to quantify the per-iteration improvement.

Lemma 4.1 (Sufﬁcient Descent). Recall the deﬁnition of the Sinkhorn Barycenter problem in (4) and the sequence of measures {αt}t 0 in (10) generated by SD (Algorithm 1). Under Assumption 4.1, if we have η min{1/(8Lf M 2 H), 1/(8

d LT M 2 H)}, the Sinkhorn objective always decreases,

Sγ(αt+1) Sγ(αt) η/2 S(αt, {βi}n i=1). (16)

See MH in (1), Lf:=4G2 c/γ + Lc and LT :=2G2 c exp(3Mc/γ)/γ1.

The proof of the lemma in given Appendix C.7. Based on this result, we can derive the following convergence result demonstrating that SD converges to a stationary point in a sublinear rate.

Theorem 4.1 (Convergence). Suppose SD is initialized with α0 M+ 1 (X) and outputs αt M+ 1 (X) after t iterations. Under Assumption 4.1, we have

min t S(αt, {βi}n i=1) 2Sγ(α0)/(ηt), (17)

where 0 < η min{1/(8Lf M 2 H), 1/(8

d LT M 2 H)} is the step size.

With a slight change to SD, we can conclude its last term convergence as elaborated in Appendix B.3.

Remark 4.1 (Exponential dependence on 1

γ ). It is difﬁcult to remove the exponential dependence on 1 γ in the above convergence result. Speciﬁcally, the term exp(1/γ) appears in bounding the smoothness of Sinkhorn potential, which is a key factor in bounding the sample complexity of the Sinkhorn divergence (see Theorem 2 and Lemma 3 of [Genevay et al., 2019a]). Consequently, the sample complexity in [Genevay et al., 2019a] can be improved if one manages to remove this factor. However, this would potentially violate the lower bound on the sample complexity of the hard-to-compute Wasserstein distance, which is the limit of the Sinkhorn divergence at γ 0. Surprisingly, the empirical performance of SD does not suffer much from this factor: In our experiments (Section 5), to produce good visual results, we pick γ = 10 4 and we still observe that SD quickly converges (even in the high dimensional Gaussian barycenter task).

Remark 4.2 (Implicit exponential dependence on the problem dimension). As shown in Lemma 4.1 and Theorem 4.1, our results depend on exp(Mc/γ) where Mc is the upper bound on ground cost on the domain X which contains an implicit dependence on the problem dimension.

4.2 Mean Field Limit Analysis

While Sinkhorn Descent accepts both discrete and continuous measures as initialization, in practice, we start from a discrete initial measure α0 N with |supp(α0 N)| = N. If α0 N is an empirical measure sampled from an underlying measure α0 , we have the weak convergence at time t = 0, i.e. α0 N α0 as N . The mean ﬁeld limit analysis demonstrates that Sinkhorn Descent preserves such weak convergence for any ﬁnite time t:

α0 N α0 αt N = SDt(α0 N) αt = SDt(α0 ),

where we use SDt to denote the output of SD after t steps and use to denote the weak convergence.

1We acknowledge the factor exp(1/γ) is non-ideal, but such quantity constantly appears in the literature related to the Sinkhorn divergence, e.g. Theorem 5 in [Luise et al., 2019] and Theorem 3 in [Genevay et al., 2019a]. It would be an interesting future work to remove this factor.

Lemma 4.2. Recall the push-forward mapping T [α](x) in SD from (11) and recall Lf in Lemma 4.1. Under Assumptions 4.1 and 4.2, for two probability measures α and α , we have

dbl(T [α] α, T [α ] α ) (1 + ηC)dbl(α, α ), (18)

where C = Gc Gk + max{d Lf Dk + d Gc Gk, Dk Lbl} and Lbl:=8G2 c exp(6Mc/γ).

The proof is presented in Appendix C.6. This is a discrete version of Dobrushin s estimate (section 1.4. in [Golse, 2016]). As a result, we directly have the following large N characterization of SDt(α0 N). Theorem 4.2 (Mean Field Limit). Let α0 N be an empirical initial measure with |supp(α0 N)| = N and let α0 be the underlying measure such that α0 N α0 . Use SDt(α0 N) and SDt(α0 ) to denote the outputs of SD after t iterations, under the initializations α0 N and α0 respectively. Under Assumptions 4.1 and 4.2, for any ﬁnite time t, we have

dbl(SDt(α0 N), SDt(α0 )) (1 + ηC)tdbl(α0 N, α0 ),

and hence as N we have

αt N = SDt(α0 N) αt = SDt(α0 ). (19)

4.3 KSBD as Discrepancy Measure

In this section, we show that, under additional assumptions, KSBD is a valid discrepancy measure, i.e. Sγ(α) = 0 implies that α is a global optimal solution to the Sinkhorn barycenter problem (4). The proof is provided in Appendix D. First, we introduce the following positivity condition. Deﬁnition 4.2. A kernel k(x, x ) is said to be integrally strictly positive deﬁnite (ISPD) w.r.t. a measure α M+ 1 (X), if ξ : X Rd with 0 < R

X ξ(x) 2dα(x) < , it holds that Z

X 2 ξ(x)k(x, x )ξ(x )dα(x)dα(x ) > 0. (20)

Theorem 4.3. Recall the Fréchet derivative of the Sinkhorn Barycenter problem in (13) and KSBD in (15). Denote ξ(x):= 1

n Pn i=1 fα,βi(x) fα,α(x) . We have R

X ξ(x) 2dα(x) < . (i) If the kernel function k(x, x ) is ISPD w.r.t. α M+ 1 (X) and α is fully supported on X, then the vanishing of KSBD, i.e. S(α, {βi}n i=1) = 0, implies that α globally minimizes problem (4). (ii) Use αt to denote the output of SD after t iterations. If further one of the accumulation points of the sequence {αt} is fully supported on X, then limt Sγ(αt) = Sγ(α ).

We show in Appendix D.2, under an absolutely continuous (a.c.) and fully supported (f.s.) initialization, αt remains a.c. and f.s. for any ﬁnite t. This leads to our assumption in (ii): One of the accumulation points of {αt} is f.s.. However, to rigorously analyze the support of αt in the asymptotic case (t ) requires a separate proof. Establishing the global convergence of the functional gradient descent is known to be difﬁcult in the literature, even for some much easier settings compared to our problem (4). For instance, [Mroueh et al., 2019, Arbel et al., 2019] prove the global convergence of their MMD descent algorithms. Both works require additional assumptions on the entire measure sequence {αt} as detailed in Appendix D.3. See also the convergence analysis of SVGD in [Lu et al., 2019] under very strong assumptions of the score functions. Remark 4.3 (Initialization with ﬁnite particles). In practice, we start with a sufﬁcient number of particles. In this case, we cannot prove SD converges globally since (ii) of Theorem 4.3 does not hold. However, we observe that increasing the number of particles reduces the Sinkhorn divergence at convergence, but with diminishing returns.

5 Experiments

We conduct experimental studies to show the efﬁciency and efﬁcacy of Sinkhorn Descent by comparing with the recently proposed functional Frank-Wolfe method (FW) from [Luise et al., 2019]2. Note that in round t, FW requires to globally minimize the nonconvex function

i=1 fαt,βi(x) fαt,αt(x), (21)

2[Claici et al., 2018] is not included as it only applies to the Wasserstein barycenter problem (γ = 0).

0 100 200 300 400 500 number of iterations

Sinkhorn barycenter loss

SD(N=20) SD(N=40) SD(N=80) FW(N=600)

0 2500 5000 7500 10000 12500 15000 17500 20000 number of iterations

log(Sinkhorn barycenter loss)

SD(N=2000) SD(N=4000) FW(N=20100)

0 20 40 60 80 number of iterations

Sinkhorn barycenter loss

(a) Concentric Ellipses (b) Distribution Sketching (c) Gaussians

Figure 1: N is the support size. FW is not included in (c) as it is impractical in high-dimensional problems (here, the dimension is 100)

(a1) SD on ellipses (b1) SD on sketching left to right, using 1 to 9 SD steps; left to right, using 1 to 201 SD steps

(a2) FW on ellipses (b2) FW on sketching left to right, using 411 to 491 FW steps; left to right, using 9901 to 19901 FW steps

Figure 2: Visual results of the ellipses and sketching problem.

in order to choose the next Dirac measure to be added to the support. Here, fαt,βi and fαt,αt are the Sinkhorn potentials. Such operation is implemented by an exhaustive grid search so that FW returns a reasonably accurate solution. Consequently, FW is computationally expensive even for low dimensional problems and we only compare SD with FW in the ﬁrst two image experiments, where d = 2. (the grid size used in FW grows exponentially with d.) Importantly, the size of the support N affects the computational efﬁciency as well as the solution quality of both methods. A large support size usually means higher computational complexity but allows a more accurate approximation of the barycenter. However, since SD and FW have different support size patterns, it is hard to compare them directly: The support size of SD is ﬁxed after its initialization while FW starts from an initial small-size support and gradually increases it during the optimization procedure. We hence ﬁx the support size of the output measure from FW and vary the support size of SD for a more comprehensive comparison. In the following, the entropy regularization parameter γ is set to 10 4 in all tasks to produce results of good visual quality.

Barycenter of Concentric Ellipses We compute the barycenter of 30 randomly generated concentric ellipses similarly as done in [Cuturi and Doucet, 2014, Luise et al., 2019]. We run FW for 500 iterations and hence the output measure of FW has support size N = 600 (FW increases its support size by 1 in each iteration). SD is initialized with a discrete uniform distribution with support size varying from N {20, 40, 80}. Note that in these experiments the chosen support size for SD is even smaller than the initial support size of FW. The result is reported in Figure 1(a). In terms of convergence rate, we observe that SD is much faster than FW. Even 20 iterations are sufﬁcient for SD to ﬁnd a good solution. More importantly, in terms of the quality of the solution, SD with support size N = 20 outperforms FW with ﬁnal support size N = 600. In fact, FW cannot ﬁnd a solution with better quality even with a larger support size. This phenomenon is due to an inevitable limitation of the FW optimization procedure: Each FW step requires to globally minimize the non-convex function (32) via an exhaustive grid search. This introduces an inherent error to the procedure as the actual solution to (32) potentially resides outside the grid points. Such error limits the accuracy of FW even when the number of particles grows. In contrast, SD adjusts the particles to minimize the objective without any inherent error. As a result, we observe SD outperforms FW on both efﬁciency and accuracy.

Distribution Sketching We consider a special case of the barycenter problem where we only have one source distribution, similarly as done in [Luise et al., 2019]. This problem can be viewed as

Figure 3: Visual results on MNIST. The ﬁrst row contains results from SD and the second row contains results from free-support method in [Cuturi and Doucet, 2014].

approximating a given distribution with a ﬁxed support size budget and is hence called distribution sketching. Speciﬁcally, a natural image of a cheetah is used as the source measure in R2. We run FW for 20000 iterations and the support size of SD is N {2000, 4000}. The result is reported in Figure 1(b). Since we only have one source measure, the Sinkhorn barycenter loss is very small and hence we use log-scale in the y-axis. We can observe that SD outperforms FW in terms of the quality of the solution as well as the convergence rate.

Barycenter of Gaussians To demonstrate the efﬁciency of SD on high dimensional problems, we consider the problem of ﬁnding the barycenter of multivariate Gaussian distributions. Concretely, we pick 5 isotropic Gaussians in R100 with different means. For each of them, we sample an empirical measure with 50000 points and used the obtained empirical measures as source measures. We initialize SD with an empirical measure sampled from the uniform distribution with support size N = 5000. We did not compare with FW as the global minimizer of Q(x) can not be computed in R100. The result is reported in Figure 1(c). We can see that just like the previous two experiments, SD converges in less than 20 iterations.

Visual Results on Ellipses and Sketching. To compare SD with FW visually, we allow SD with FW to have a similar amount of particles in the ellipses and sketching tasks, and report the results in Figure 2. Speciﬁcally, in (a1) SD has 500 particles while in (a2) FW has 511 to 591 particles (recall that the support size of FW grows over iterations); in (b1) SD has 8000 particles while in (a2) FW has 10001 to 20001 particles. In all cases FW has at least as much particles as SD does while having signiﬁcantly more steps. However, the visual result produced by SD is clearly better than FW: in (a1), the circle is very clear in the last picture while in (a2) all pictures remain vague; in (b1), the eyes of cheetah are clear, but in (b2) the eyes remain gloomy.

Additional Visual Results on MNIST We provide additional results on the MNIST dataset. In this problem, we view the (low resolution) images of every individual digit (0-9) as one of target distributions and our goal is to compute their (high resolution) Sinkhorn barycenter (we randomly select 100 images of each digit in MNIST). We compare SD with the free-support method in [Cuturi and Doucet, 2014] (we directly use the implementation from the Python OT library3). Both methods use 2500 particles and are run for 100 iterations. In the 2D-histogram image of Figure 3, brighter pixels means more particles in a local region. We can observe that the particles of SD are more concentrated on the digits compared to the ones of [Cuturi and Doucet, 2014]. We only report the visual results but do not compare the exact barycenter loss since [Claici et al., 2018] considers the exact Wasserstein barycenter problem with no entropy regularization. We do not compare with [Chewi et al., 2020] since it only applies to the barycenter problem of Gaussian distributions.

6 Broader Impact

This work has the following potential positive impact in the society: We propose the ﬁrst algorithm for the Sinkhorn barycenter problem that is scalable with respect to the problem dimension d (linear dependence), while existing works all have an exponential dependence on d. Further, we expect that this functional gradient descent method can be applied to more general optimization problems involving distribution sampling: In principle, the negative gradient of the dual variables instructs the particles in the measure to search the landscape of the minimizer.

3https://pythonot.github.io/index.html

Acknowledgment

This work is supported by NSF CPS-1837253.

L. Ambrosio and N. Gigli. A user s guide to optimal transport. pages 1 155, 2013.

M. Arbel, A. Korba, A. Salim, and A. Gretton. Maximum mean discrepancy gradient ﬂow. In Advances in Neural Information Processing Systems, pages 6481 6491, 2019.

S. Chewi, T. Maunu, P. Rigollet, and A. J. Stromme. Gradient descent algorithms for bures-wasserstein barycenters. ar Xiv preprint ar Xiv:2001.01700, 2020.

S. Claici, E. Chien, and J. Solomon. Stochastic wasserstein barycenters. In International Conference on Machine Learning, pages 999 1008, 2018.

M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292 2300, 2013.

M. Cuturi and A. Doucet. Fast computation of wasserstein barycenters. In International Conference on Machine Learning, pages 685 693, 2014.

P. Dvurechenskii, D. Dvinskikh, A. Gasnikov, C. Uribe, and A. Nedich. Decentralize and randomize: Faster algorithm for wasserstein barycenters. In Advances in Neural Information Processing Systems, pages 10760 10770, 2018.

J. Feydy, T. Séjourné, F.-X. Vialard, S.-i. Amari, A. Trouve, and G. Peyré. Interpolating between optimal transport and mmd using sinkhorn divergences. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 2681 2690, 2019.

A. Genevay, M. Cuturi, G. Peyré, and F. Bach. Stochastic optimization for large-scale optimal transport. In Advances in neural information processing systems, pages 3440 3448, 2016.

A. Genevay, L. Chizat, F. Bach, M. Cuturi, and G. Peyré. Sample complexity of sinkhorn divergences. In Proc. AISTATS 19, 2019a.

A. Genevay, L. Chizat, F. Bach, M. Cuturi, and G. Peyré. Sample complexity of sinkhorn divergences. In K. Chaudhuri and M. Sugiyama, editors, Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 1574 1583. PMLR, 16 18 Apr 2019b.

F. Golse. On the dynamics of large particle systems in the mean ﬁeld limit. In Macroscopic and large scale phenomena: coarse graining, mean ﬁeld limits and ergodicity, pages 1 144. Springer, 2016.

A. Kroshnin, N. Tupitsa, D. Dvinskikh, P. Dvurechensky, A. Gasnikov, and C. Uribe. On the complexity of approximating wasserstein barycenters. In International Conference on Machine Learning, pages 3530 3540, 2019.

B. Lemmens and R. Nussbaum. Nonlinear Perron-Frobenius Theory, volume 189. Cambridge University Press, 2012.

Q. Liu. Stein variational gradient descent as gradient ﬂow. In Advances in neural information processing systems, pages 3115 3123, 2017.

Q. Liu and D. Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in neural information processing systems, pages 2378 2386, 2016.

J. Lu, Y. Lu, and J. Nolen. Scaling limit of the stein variational gradient descent: The mean ﬁeld regime. SIAM Journal on Mathematical Analysis, 51(2):648 671, 2019.

G. Luise, S. Salzo, M. Pontil, and C. Ciliberto. Sinkhorn barycenters with free support via frank-wolfe algorithm. In Advances in Neural Information Processing Systems 32. 2019.

Y. Mroueh, T. Sercu, and A. Raj. Sobolev descent. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 2976 2985, 2019.

G. Peyré, M. Cuturi, et al. Computational optimal transport. Foundations and Trends R in Machine Learning, 11(5-6):355 607, 2019.

J. Rabin, G. Peyré, J. Delon, and M. Bernot. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pages 435 446. Springer, 2011.

J. Solomon, F. De Goes, G. Peyré, M. Cuturi, A. Butscher, A. Nguyen, T. Du, and L. Guibas. Convolutional wasserstein distances: Efﬁcient optimal transportation on geometric domains. ACM Transactions on Graphics (TOG), 34(4):1 11, 2015.

S. Srivastava, V. Cevher, Q. Dinh, and D. Dunson. Wasp: Scalable bayes via barycenters of subset posteriors. In Artiﬁcial Intelligence and Statistics, pages 912 920, 2015.

M. Staib, S. Claici, J. M. Solomon, and S. Jegelka. Parallel streaming wasserstein barycenters. In Advances in Neural Information Processing Systems, pages 2647 2658, 2017.

A. W. Van Der Vaart and J. A. Wellner. Weak convergence. In Weak convergence and empirical processes, pages 16 28. Springer, 1996.

J. Ye, P. Wu, J. Z. Wang, and J. Li. Fast discrete distribution clustering using wasserstein barycenter with sparse support. IEEE Transactions on Signal Processing, 65(9):2317 2332, 2017.