# sinkhorn_barycenter_via_functional_gradient_descent__5fc94c1c.pdf Sinkhorn Barycenter via Functional Gradient Descent Zebang Shen Zhenfu Wang Alejandro Ribeiro Hamed Hassani Department of Electrical and Systems Engineering Department of Mathematics University of Pennsylvania {zebang@seas,zwang423@math,aribeiro@seas,hassani@seas}.upenn.edu In this paper, we consider the problem of computing the barycenter of a set of probability distributions under the Sinkhorn divergence. This problem has recently found applications across various domains, including graphics, learning, and vision, as it provides a meaningful mechanism to aggregate knowledge. Unlike previous approaches which directly operate in the space of probability measures, we recast the Sinkhorn barycenter problem as an instance of unconstrained functional optimization and develop a novel functional gradient descent method named Sinkhorn Descent (SD). We prove that SD converges to a stationary point at a sublinear rate, and under reasonable assumptions, we further show that it asymptotically finds a global minimizer of the Sinkhorn barycenter problem. Moreover, by providing a mean-field analysis, we show that SD preserves the weak convergence of empirical measures. Importantly, the computational complexity of SD scales linearly in the dimension d and we demonstrate its scalability by solving a 100-dimensional Sinkhorn barycenter problem. 1 Introduction Computing a nonlinear interpolation between a set of probability measures is a foundational task across many disciplines. This problem is typically referred as the barycenter problem and, as it provides a meaningful metric to aggregate knowledge, it has found numerous applications. Examples include distribution clustering [Ye et al., 2017], Bayesian inference [Srivastava et al., 2015], texture mixing [Rabin et al., 2011], and graphics [Solomon et al., 2015], etc. The barycenter problem can be naturally cast as minimization of the average distance between the target measure (barycenter) and the source measures; and the choice of the distance metric can significantly impact the quality of the barycenter [Feydy et al., 2019]. In this regard, the Optimal Transport (OT) distance (a.k.a. the Wasserstein distance) and its entropy regularized variant (a.k.a. the Sinkhorn divergence) are the most suitable geometrically-faithful metrics, while the latter is more computational friendly. In this paper, we provide efficient and provable methods for the Sinkhorn barycenter problem. The prior work in this domain has mainly focused on finding the barycenter by optimizing directly in the space of (discrete) probability measures. We can divide these previous methods into three broad classes depending on how the support of the barycenter is determined: (i) The first class assumes a fixed and prespecified support set for the barycenter and only optimizes the corresponding weights [Staib et al., 2017, Dvurechenskii et al., 2018, Kroshnin et al., 2019]. Accordingly, the problem reduces to minimizing a convex objective subject to a simplex constraint. However, fixing the support without any prior knowledge creates undesired bias and affects the quality of the final solution. While increasing the support size (possibly exponentially in the dimension d) can help to mitigate the bias, it renders the procedure computationally prohibitive as d grows. (ii) To reduce the bias, the second class considers optimizing the support and the weights through an alternating procedure [Cuturi and Doucet, 2014, Claici et al., 2018]. Since the barycenter objective is 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. not jointly convex with respect to the support and the weights, these methods in general only converge to a stationary point, which can be far from the true minimizers. (iii) Unlike the aforementioned classes, Luise et al. [2019] recently proposed a conditional gradient method with a growing support set. This method enjoys sublinear convergence to the global optimum under the premise that a d-dimensional nonconvex subproblem can be globally minimized periteration. However, nonconvex optimization is generally intractable in high dimensional problems (large d) and only stationary points can be efficiently reached. Hence, the guarantee of [Luise et al., 2019] has limited applicability as the dimension grows. In this paper, we provide a new perspective on the Sinkhorn barycenter problem: Instead of operating in the space of probability measures, we view the barycenter as the push-forward measure of a given initial measure under an unknown mapping. We thus recast the barycenter problem as an unconstrained functional optimization over the space of mappings. Equipped with this perspective, we make the following contributions: We develop a novel functional gradient descent method, called Sinkhorn Descent (SD), which operates by finding the push-forward mapping in a Reproducing Kernel Hilbert Space that allows the fastest descent, and consequently solves the Sinkhorn barycenter problem iteratively. We then define the Kernelized Sinkhorn Barycenter Discrepancy (KSBD) to characterize the non-asymptotic convergence of SD. In particular, we prove that KSBD vanishes under the SD iterates at the rate of O( 1 t ), where t is the iteration number. We prove that SD preserves the weak convergence of empirical measures. Concretely, use SDt( ) to denote the output of SD after t iterations and let αN be an empirical measure of α with N samples. We have lim N SDt(αN) = SDt(α). Such asymptotic analysis allows us to jointly study the behavior of SD under either discrete or continuous initialization. Under a mild assumption, we prove that KSBD is a valid discrepancy to characterize the optimality of the solution, i.e. the vanishing of KSBD implies the output measure of SD converges to the global optimal solution set of the Sinkhorn barycenter problem. Further, we show the efficiency and efficacy of SD by comparing it with prior art on several problems. We note that the computation complexity of SD depends linearly on the dimension d. We hence validate the scalability of SD by solving a 100-dimensional barycenter problem, which cannot be handled by previous methods due to their exponential dependence on the problem dimension. Notations. Let X Rd be a compact ground set, endowed with a symmetric ground metric c : X X R+. Without loss of generality, we assume c(x, y) = if x / X or y / X. We use 1c( , ) : X 2 X to denote its gradient w.r.t. its first argument. Let M+ 1 (X) and C(X) be the space of probability measures and continuous functions on X. We denote the support for a probability measure α M+ 1 (X) by supp(α) and we use α a.e. to denote "almost everywhere w.r.t. α". For a vector a Rd, we denote its ℓ2 norm by a . For a function f : X R, we denote its L norm by f := maxx X |f(x)| and denote its gradient by f( ) : X Rd. For a vector function f : X Rd, we denote its (2, ) norm by f 2, := maxx X f(x) . For an integer n, denote [n]:={1, , n}. Given an Reproducing Kernel Hilbert Space (RKHS) H with a kernel function k : X X R+, we say a vector function ψ = [[ψ]1, , [ψ]d] Hd if each component [ψ]i is in H. The space H has a natural inner product structure and an induced norm, and so does Hd, i.e. f, g Hd = Pd i=1 [f]i, [g]i H, f, g Hd and the norm f 2 Hd = f, f Hd. The reproducing property of the RKHS H reads that given f Hd, one has [f]i(x) = [f]i, kx H with kx(y) = k(x, y), which by Cauchy-Schwarz inequality implies that there exists some constant MH > 0 such that f 2, MH f Hd, f Hd. (1) Additionally, for a functional F : Hd R, the Fréchet derivative of F is defined as follows. Definition 1.1 (Fréchet derivative in RKHS). For a functional F : Hd R, its Fréchet derivative DF[ψ] at ψ Hd is a function in Hd satisfying the following: For any ξ Hd with ξ Hd < , lim ϵ 0 F[ψ + ϵξ] F[ψ] ϵ = DF[ψ], ξ Hd. Note that the Fréchet derivative at ψ, i.e. DF[ψ], is a bounded linear operator from Hd to R. It can be written in the form DF[ψ](ξ) = DF[ψ], ξ Hd due to the Riesz Fréchet representation theorem. 1.1 Related Work on Functional Gradient Descent A related functional gradient descent type method is the Stein Variation Gradient Descent (SVGD) method by Liu and Wang [2016]. SVGD considers the problem of minimizing the Kullback Leibler (KL) divergence between a variable distribution and a posterior p. Note that SVGD updates the positions of a set of N particles using the score function of the posterior p, i.e. log p. Consequently, it requires the access to the target distribution function. Later, Liu [2017] prove that SVGD has convergence guarantee in its continuous-time limit (taking infinitesimal step size) using infinite number of particles (N ). In comparison, SD is designed to solve the significantly more complicated Sinkhon barycenter problem and has a stronger convergence guarantee. More precisely, while SD updates the measure using only a sampling machinery of the target measures (no score functions), it is guaranteed to converge sub-linearly to a stationary point when α is a discrete measure using discrete time steps. This is in sharp contrast to the results for SVGD. In another work, Mroueh et al. [2019] considers minimizing the Maximum Mean Discrepancy (MMD) between a source measure and a variable measure. They solve this problem by incrementally following a Sobolev critic function and propose the Sobolev Descent (So D) method. To show the global convergence of the measure sequence generated by So D, Mroueh et al. [2019] assumes the entire sequence satisfies certain spectral properties, which is in general difficult to verify. Later, Arbel et al. [2019] consider the same MMD minimization problem from a gradient flow perspective. They propose two assumptions that if either one holds, the MMD gradient flow converges to the global solution. However, similar to [Mroueh et al., 2019], these assumptions have to be satisfied for the entire measure sequence. We note that the Sinkhorn barycenter is a strict generalization of the above MMD minimization problem and is hence much more challenging: By setting the number of source measures n = 1 and setting the entropy regularization parameter γ = , problem (4) degenerates to the special case of MMD. Further, the MMD between two probability measures has a closed form expression while the Sinkhorn Divergence can only be described via a set of optimization problems. Consequently, the Sinkhorn barycenter is significantly more challenging. To guarantee global convergence, the proposed SD algorithm only requires one of accumulation points of the measure sequence to be fully supported on X with no restriction on the entire sequence. 2 Sinkhorn Barycenter We first introduce the entropy-regularized optimal transport distance and its debiased version, a.k.a. the Sinkhorn divergence. Given two probability measures α, β M+ 1 (X), use Π(α, β) to denote the set of joint distributions over X 2 with marginals α and β. For π Π, use c, π to denote the integral c, π = R X 2 c(x, y)dπ(x, y) and use KL(π||α β) to denote the Kullback-Leibler divergence between the candidate transport plan π and the product measure α β. The entropy-regularized optimal transport distance OTγ(α, β) : M+ 1 (X) M+ 1 (X) R+ is defined as OTγ(α, β) = min π Π(α,β) c, π + γKL(π||α β). (2) Here, γ > 0 is a regularization parameter. Note that OTγ(α, β) is not a valid metric as there exists α M+ 1 (X) such that OTγ(α, α) = 0 when γ = 0. To remove this bias, Peyré et al. [2019] introduced the Sinkhorn divergence Sγ(α, β) : M+ 1 (X) M+ 1 (X) R+: Sγ(α, β):=OTγ(α, β) 1 2OTγ(α, α) 1 2OTγ(β, β), (3) which is a debiased version of OTγ(α, β). It is further proved that Sγ(α, β) interpolates the Wasserstein distance and the Maximum Mean Discrepancy (MMD), and it is nonnegative, bi-convex and metrizes the convergence in law when the ground set X is compact and the metric c is Lipschitz. Now given a set of probability measures {βi}n i=1, the Sinkhorn barycenter is the measure α M+ 1 (X) that minimizes the average of Sinkhorn divergences min α M+ 1 (X) i=1 Sγ(α, βi) . (4) We will next focus on the properties of OTγ since Sγ(α) is the linear combination of these terms. Algorithm 1 Sinkhorn Descent (SD) Input: measures {βi}n i=1, a discrete initial measure α0, a step size η, and number of iterations S; Output: A measure αS that approximates the Sinkhorn barycenter of {βi}n i=1; for t = 0 to S 1 do αt+1 := T [αt] αt, with T [αt] defined in (11); end for The Dual Formulation of OTγ. As a convex program, the entropy-regularized optimal transport problem OTγ (2) has a equivalent dual formulation, which is given as follows: OTγ(α, β) = max f,g C(X) f, α + g, β γ exp((f g c)/γ) 1, α β , (5) where we denote [f g](x, y) = f(x) + g(y). The maximizers fα,β and gα,β of (5) are called the Sinkhorn potentials of OTγ(α, β). Define the Sinkhorn mapping A : C(X) M+ 1 (X) C(X) by A(f, α)(y) = γ log Z X exp (f(x) c(x, y))/γ dα(x). (6) The following lemma states the optimality condition for the Sinkhorn potentials fα,β and gα,β. Lemma 2.1 (Optimality Peyré et al. [2019]). The pair (f, g) are the Sinkhorn potentials of the entropy-regularized optimal transport problem (5) if they satisfy f = A(g, β), α a.e. and g = A(f, α), β a.e.. (7) The Sinkhorn potential is the cornerstone of the entropy regularized OT problem. In the discrete case, it can be computed by a standard method in Genevay et al. [2016]. In particular, when α is discrete, f can be simply represented by a finite dimensional vector since only its values on supp(α) matter. We describe such method in Appendix A.1 for completeness. In the following, we treat the computation of Sinkhorn potentials as a blackbox, and refer to it as SPγ(α, β). 3 Methodology We present the Sinkhorn Descent (SD) algorithm for the Sinkhorn barycenter problem (4) in two steps: We first reformulate (4) as an unconstrained functional minimization problem and then derive the descent direction as the negative functional gradient over a RKHS Hd. Operating in RKHS allows us to measure the quality of the iterates using a so-called kernelized discrepancy which we introduce in Definition 4.1. This quantity will be crucial for our convergence analysis. The restriction of a functional optimization problem to RKHS is common in the literature as discussed in Remark 3.1. Alternative Formulation. Instead of directly solving the Sinkhorn barycenter problem in the probability space M+ 1 (X), we reformulate it as a functional minimization over all mappings on X: Sγ(P α0):= 1 i=1 Sγ(P α0, βi) , (8) where α0 M+ 1 (X) is some given initial measure, and P α is the push-forward measure of α M+ 1 (X) under the mapping P : X X. When α0 is sufficiently regular, e.g. absolutely continuous, for any α M+ 1 (X) there always exists a mapping P such that α = P α0 (see Theorem 1.33 of [Ambrosio and Gigli, 2013]). Consequently, problems (8) and (4) are equivalent with appropriate initialization. Algorithm Derivation. For a probability measure α, define the functional Sα : Hd R Sα[ψ] = Sγ (I + ψ) α , ψ Hd. (9) Here I is the identity mapping and Sγ is defined in (4). Let αt be the estimation of the Sinkhorn barycenter in the tth iteration. Sinkhorn Descent (SD) iteratively updates the measure αt+1 as αt+1 = T [αt] αt, (10) via the push-forward mapping (with η > 0 being a step-size) T [αt](x) = x η DSαt[0](x). (11) Recall that DSα[0] is the Fréchet derivative of Sα at ψ = 0 (see Definition 1.1). Note that (I+ψ) α = α when ψ = 0. Our choice of the negative Fréchet derivative in T [αt] allows the objective Sγ(α) to have the fastest descent at the current measure α = αt. We our line the details of SD in Algorithm 1. Consequently, a solution of (8) will be found by finite-step compositions and then formally passing to the limit P = limt Pt:=T [αt] T [α0] . Remark 3.1. We restrict ψ in (9) to the space Hd to avoid the inherent difficulty when the perturbation of Sinkhorn potentials introduced by the mapping (I + ψ) can no longer be properly bounded (for ψ Hd, we always have the upper bound (1) which is necessary in our convergence analysis). This restriction will potentially introduce error to the minimization of (8). However, this restriction is a common practice for general functional optimization problems: Both SVGD [Liu and Wang, 2016] and So D [Mroueh et al., 2019] explicitly make such RKHS restriction on their transport mappings. [Arbel et al., 2019] constructs the transport mapping using the witness function of the Maximum Mean Discrepancy (MMD) which also lies in an RKHS. In what follows, we first derive a formula for the Fréchet derivative DSαt[0] (see (13)) and then explain how it is efficiently computed. The proof of the next proposition requires additional continuity study of the Sinkhorn potentials and is deferred to Appendix C.5. Proposition 3.1. Recall the Fréchet derivative in Definition 1.1. Given α, β M+ 1 (X), for ψ Hd denote F1[ψ] = OTγ (I + ψ) α, β and F2[ψ] = OTγ (I + ψ) α, (I + ψ) α . Under Assumptions 4.1 and 4.2 (described below), we can compute DF1[0](y) = Z X fα,β(x)k(x, y)dα(x), DF2[0](y) = 2 Z X fα,α(x)k(x, y)dα(x), (12) where fα,β and fα,α are the gradients of the Sinkhorn potentials of OTγ(α, β) and OTγ(α, α) respectively, and k is the kernel function of the RKHS H. Consequently the Fréchet derivative of the Sinkhorn Barycenter problem (9) can be computed by DSα[0](y) = Z i=1 fα,βi(x) fα,α(x)]k(x, y)dα(x). (13) This quantity can be computed efficiently when α is discrete: Consider an individual term fα,β. Define h(x, y):= exp 1 γ (fα,β(x) + A[fα,β, α](y) c(x, y)) . Lemma 2.1 implies Z h(x, y)dβ(y) = 1. Taking derivative with respect to x on both sides and rearranging terms, we have X h(x, y) xc(x, y)dβ(y) R h(x, y)dβ(y) = Z X h(x, y) xc(x, y)dβ(y), (14) which itself is an expectation. Note that to evaluate (13), we only need fα,β(x) on supp(α). Using SPγ(α, β) (see the end of Section 2), the function value of fα,β on supp(α) can be efficiently computed. Together with the expression in (14), the gradients fα,β(x) at x supp(α) can also be obtained by a simple Monte-Carlo integration with respect to β. In this section, we analyze the finite time convergence and the mean field limit of SD under the following assumptions on the ground cost function c and the kernel function k of the RKHS Hd. Assumption 4.1. The ground cost function c(x, y) is bounded, i.e. x, y X, c(x, y) Mc; Gc Lipschitz continuous, i.e. x, x , y X, |c(x, y) c(x , y)| Gc x x ; and Lc-Lipschitz smooth, i.e. x, x , y X, 1c(x, y) 1c(x , y) Lc x x . Assumption 4.2. The kernel function k(x, y) is bounded, i.e. x, y X, k(x, y) Dk; Gk Lipschitz continuous, i.e. x, x , y X, |k(x, y) k(x , y)| Gc x x . 4.1 Finite Time Convergence Analysis In this section, we prove that Sinkhorn Descent converges to a stationary point of problem (4) at the rate of O( 1 t ), where t is the number of iterations. We first introduce a discrepancy quantity. Definition 4.1. Recall the definition of the functional Sα in (9) and the definition of Fréchet derivative in Definition 1.1. Given a probability measure α M+ 1 (X), the Kernelized Sinkhorn Barycenter Discrepancy (KSBD) for the Sinkhorn barycenter problem is defined as S(α, {βi}n i=1):= DSα[0] 2 Hd. (15) Note that in each round t, S(αt, {βi}n i=1) metrizes the stationarity of SD, which can be used to quantify the per-iteration improvement. Lemma 4.1 (Sufficient Descent). Recall the definition of the Sinkhorn Barycenter problem in (4) and the sequence of measures {αt}t 0 in (10) generated by SD (Algorithm 1). Under Assumption 4.1, if we have η min{1/(8Lf M 2 H), 1/(8 d LT M 2 H)}, the Sinkhorn objective always decreases, Sγ(αt+1) Sγ(αt) η/2 S(αt, {βi}n i=1). (16) See MH in (1), Lf:=4G2 c/γ + Lc and LT :=2G2 c exp(3Mc/γ)/γ1. The proof of the lemma in given Appendix C.7. Based on this result, we can derive the following convergence result demonstrating that SD converges to a stationary point in a sublinear rate. Theorem 4.1 (Convergence). Suppose SD is initialized with α0 M+ 1 (X) and outputs αt M+ 1 (X) after t iterations. Under Assumption 4.1, we have min t S(αt, {βi}n i=1) 2Sγ(α0)/(ηt), (17) where 0 < η min{1/(8Lf M 2 H), 1/(8 d LT M 2 H)} is the step size. With a slight change to SD, we can conclude its last term convergence as elaborated in Appendix B.3. Remark 4.1 (Exponential dependence on 1 γ ). It is difficult to remove the exponential dependence on 1 γ in the above convergence result. Specifically, the term exp(1/γ) appears in bounding the smoothness of Sinkhorn potential, which is a key factor in bounding the sample complexity of the Sinkhorn divergence (see Theorem 2 and Lemma 3 of [Genevay et al., 2019a]). Consequently, the sample complexity in [Genevay et al., 2019a] can be improved if one manages to remove this factor. However, this would potentially violate the lower bound on the sample complexity of the hard-to-compute Wasserstein distance, which is the limit of the Sinkhorn divergence at γ 0. Surprisingly, the empirical performance of SD does not suffer much from this factor: In our experiments (Section 5), to produce good visual results, we pick γ = 10 4 and we still observe that SD quickly converges (even in the high dimensional Gaussian barycenter task). Remark 4.2 (Implicit exponential dependence on the problem dimension). As shown in Lemma 4.1 and Theorem 4.1, our results depend on exp(Mc/γ) where Mc is the upper bound on ground cost on the domain X which contains an implicit dependence on the problem dimension. 4.2 Mean Field Limit Analysis While Sinkhorn Descent accepts both discrete and continuous measures as initialization, in practice, we start from a discrete initial measure α0 N with |supp(α0 N)| = N. If α0 N is an empirical measure sampled from an underlying measure α0 , we have the weak convergence at time t = 0, i.e. α0 N α0 as N . The mean field limit analysis demonstrates that Sinkhorn Descent preserves such weak convergence for any finite time t: α0 N α0 αt N = SDt(α0 N) αt = SDt(α0 ), where we use SDt to denote the output of SD after t steps and use to denote the weak convergence. 1We acknowledge the factor exp(1/γ) is non-ideal, but such quantity constantly appears in the literature related to the Sinkhorn divergence, e.g. Theorem 5 in [Luise et al., 2019] and Theorem 3 in [Genevay et al., 2019a]. It would be an interesting future work to remove this factor. Lemma 4.2. Recall the push-forward mapping T [α](x) in SD from (11) and recall Lf in Lemma 4.1. Under Assumptions 4.1 and 4.2, for two probability measures α and α , we have dbl(T [α] α, T [α ] α ) (1 + ηC)dbl(α, α ), (18) where C = Gc Gk + max{d Lf Dk + d Gc Gk, Dk Lbl} and Lbl:=8G2 c exp(6Mc/γ). The proof is presented in Appendix C.6. This is a discrete version of Dobrushin s estimate (section 1.4. in [Golse, 2016]). As a result, we directly have the following large N characterization of SDt(α0 N). Theorem 4.2 (Mean Field Limit). Let α0 N be an empirical initial measure with |supp(α0 N)| = N and let α0 be the underlying measure such that α0 N α0 . Use SDt(α0 N) and SDt(α0 ) to denote the outputs of SD after t iterations, under the initializations α0 N and α0 respectively. Under Assumptions 4.1 and 4.2, for any finite time t, we have dbl(SDt(α0 N), SDt(α0 )) (1 + ηC)tdbl(α0 N, α0 ), and hence as N we have αt N = SDt(α0 N) αt = SDt(α0 ). (19) 4.3 KSBD as Discrepancy Measure In this section, we show that, under additional assumptions, KSBD is a valid discrepancy measure, i.e. Sγ(α) = 0 implies that α is a global optimal solution to the Sinkhorn barycenter problem (4). The proof is provided in Appendix D. First, we introduce the following positivity condition. Definition 4.2. A kernel k(x, x ) is said to be integrally strictly positive definite (ISPD) w.r.t. a measure α M+ 1 (X), if ξ : X Rd with 0 < R X ξ(x) 2dα(x) < , it holds that Z X 2 ξ(x)k(x, x )ξ(x )dα(x)dα(x ) > 0. (20) Theorem 4.3. Recall the Fréchet derivative of the Sinkhorn Barycenter problem in (13) and KSBD in (15). Denote ξ(x):= 1 n Pn i=1 fα,βi(x) fα,α(x) . We have R X ξ(x) 2dα(x) < . (i) If the kernel function k(x, x ) is ISPD w.r.t. α M+ 1 (X) and α is fully supported on X, then the vanishing of KSBD, i.e. S(α, {βi}n i=1) = 0, implies that α globally minimizes problem (4). (ii) Use αt to denote the output of SD after t iterations. If further one of the accumulation points of the sequence {αt} is fully supported on X, then limt Sγ(αt) = Sγ(α ). We show in Appendix D.2, under an absolutely continuous (a.c.) and fully supported (f.s.) initialization, αt remains a.c. and f.s. for any finite t. This leads to our assumption in (ii): One of the accumulation points of {αt} is f.s.. However, to rigorously analyze the support of αt in the asymptotic case (t ) requires a separate proof. Establishing the global convergence of the functional gradient descent is known to be difficult in the literature, even for some much easier settings compared to our problem (4). For instance, [Mroueh et al., 2019, Arbel et al., 2019] prove the global convergence of their MMD descent algorithms. Both works require additional assumptions on the entire measure sequence {αt} as detailed in Appendix D.3. See also the convergence analysis of SVGD in [Lu et al., 2019] under very strong assumptions of the score functions. Remark 4.3 (Initialization with finite particles). In practice, we start with a sufficient number of particles. In this case, we cannot prove SD converges globally since (ii) of Theorem 4.3 does not hold. However, we observe that increasing the number of particles reduces the Sinkhorn divergence at convergence, but with diminishing returns. 5 Experiments We conduct experimental studies to show the efficiency and efficacy of Sinkhorn Descent by comparing with the recently proposed functional Frank-Wolfe method (FW) from [Luise et al., 2019]2. Note that in round t, FW requires to globally minimize the nonconvex function i=1 fαt,βi(x) fαt,αt(x), (21) 2[Claici et al., 2018] is not included as it only applies to the Wasserstein barycenter problem (γ = 0). 0 100 200 300 400 500 number of iterations Sinkhorn barycenter loss SD(N=20) SD(N=40) SD(N=80) FW(N=600) 0 2500 5000 7500 10000 12500 15000 17500 20000 number of iterations log(Sinkhorn barycenter loss) SD(N=2000) SD(N=4000) FW(N=20100) 0 20 40 60 80 number of iterations Sinkhorn barycenter loss (a) Concentric Ellipses (b) Distribution Sketching (c) Gaussians Figure 1: N is the support size. FW is not included in (c) as it is impractical in high-dimensional problems (here, the dimension is 100) (a1) SD on ellipses (b1) SD on sketching left to right, using 1 to 9 SD steps; left to right, using 1 to 201 SD steps (a2) FW on ellipses (b2) FW on sketching left to right, using 411 to 491 FW steps; left to right, using 9901 to 19901 FW steps Figure 2: Visual results of the ellipses and sketching problem. in order to choose the next Dirac measure to be added to the support. Here, fαt,βi and fαt,αt are the Sinkhorn potentials. Such operation is implemented by an exhaustive grid search so that FW returns a reasonably accurate solution. Consequently, FW is computationally expensive even for low dimensional problems and we only compare SD with FW in the first two image experiments, where d = 2. (the grid size used in FW grows exponentially with d.) Importantly, the size of the support N affects the computational efficiency as well as the solution quality of both methods. A large support size usually means higher computational complexity but allows a more accurate approximation of the barycenter. However, since SD and FW have different support size patterns, it is hard to compare them directly: The support size of SD is fixed after its initialization while FW starts from an initial small-size support and gradually increases it during the optimization procedure. We hence fix the support size of the output measure from FW and vary the support size of SD for a more comprehensive comparison. In the following, the entropy regularization parameter γ is set to 10 4 in all tasks to produce results of good visual quality. Barycenter of Concentric Ellipses We compute the barycenter of 30 randomly generated concentric ellipses similarly as done in [Cuturi and Doucet, 2014, Luise et al., 2019]. We run FW for 500 iterations and hence the output measure of FW has support size N = 600 (FW increases its support size by 1 in each iteration). SD is initialized with a discrete uniform distribution with support size varying from N {20, 40, 80}. Note that in these experiments the chosen support size for SD is even smaller than the initial support size of FW. The result is reported in Figure 1(a). In terms of convergence rate, we observe that SD is much faster than FW. Even 20 iterations are sufficient for SD to find a good solution. More importantly, in terms of the quality of the solution, SD with support size N = 20 outperforms FW with final support size N = 600. In fact, FW cannot find a solution with better quality even with a larger support size. This phenomenon is due to an inevitable limitation of the FW optimization procedure: Each FW step requires to globally minimize the non-convex function (32) via an exhaustive grid search. This introduces an inherent error to the procedure as the actual solution to (32) potentially resides outside the grid points. Such error limits the accuracy of FW even when the number of particles grows. In contrast, SD adjusts the particles to minimize the objective without any inherent error. As a result, we observe SD outperforms FW on both efficiency and accuracy. Distribution Sketching We consider a special case of the barycenter problem where we only have one source distribution, similarly as done in [Luise et al., 2019]. This problem can be viewed as Figure 3: Visual results on MNIST. The first row contains results from SD and the second row contains results from free-support method in [Cuturi and Doucet, 2014]. approximating a given distribution with a fixed support size budget and is hence called distribution sketching. Specifically, a natural image of a cheetah is used as the source measure in R2. We run FW for 20000 iterations and the support size of SD is N {2000, 4000}. The result is reported in Figure 1(b). Since we only have one source measure, the Sinkhorn barycenter loss is very small and hence we use log-scale in the y-axis. We can observe that SD outperforms FW in terms of the quality of the solution as well as the convergence rate. Barycenter of Gaussians To demonstrate the efficiency of SD on high dimensional problems, we consider the problem of finding the barycenter of multivariate Gaussian distributions. Concretely, we pick 5 isotropic Gaussians in R100 with different means. For each of them, we sample an empirical measure with 50000 points and used the obtained empirical measures as source measures. We initialize SD with an empirical measure sampled from the uniform distribution with support size N = 5000. We did not compare with FW as the global minimizer of Q(x) can not be computed in R100. The result is reported in Figure 1(c). We can see that just like the previous two experiments, SD converges in less than 20 iterations. Visual Results on Ellipses and Sketching. To compare SD with FW visually, we allow SD with FW to have a similar amount of particles in the ellipses and sketching tasks, and report the results in Figure 2. Specifically, in (a1) SD has 500 particles while in (a2) FW has 511 to 591 particles (recall that the support size of FW grows over iterations); in (b1) SD has 8000 particles while in (a2) FW has 10001 to 20001 particles. In all cases FW has at least as much particles as SD does while having significantly more steps. However, the visual result produced by SD is clearly better than FW: in (a1), the circle is very clear in the last picture while in (a2) all pictures remain vague; in (b1), the eyes of cheetah are clear, but in (b2) the eyes remain gloomy. Additional Visual Results on MNIST We provide additional results on the MNIST dataset. In this problem, we view the (low resolution) images of every individual digit (0-9) as one of target distributions and our goal is to compute their (high resolution) Sinkhorn barycenter (we randomly select 100 images of each digit in MNIST). We compare SD with the free-support method in [Cuturi and Doucet, 2014] (we directly use the implementation from the Python OT library3). Both methods use 2500 particles and are run for 100 iterations. In the 2D-histogram image of Figure 3, brighter pixels means more particles in a local region. We can observe that the particles of SD are more concentrated on the digits compared to the ones of [Cuturi and Doucet, 2014]. We only report the visual results but do not compare the exact barycenter loss since [Claici et al., 2018] considers the exact Wasserstein barycenter problem with no entropy regularization. We do not compare with [Chewi et al., 2020] since it only applies to the barycenter problem of Gaussian distributions. 6 Broader Impact This work has the following potential positive impact in the society: We propose the first algorithm for the Sinkhorn barycenter problem that is scalable with respect to the problem dimension d (linear dependence), while existing works all have an exponential dependence on d. Further, we expect that this functional gradient descent method can be applied to more general optimization problems involving distribution sampling: In principle, the negative gradient of the dual variables instructs the particles in the measure to search the landscape of the minimizer. 3https://pythonot.github.io/index.html Acknowledgment This work is supported by NSF CPS-1837253. L. Ambrosio and N. Gigli. A user s guide to optimal transport. pages 1 155, 2013. M. Arbel, A. Korba, A. Salim, and A. Gretton. Maximum mean discrepancy gradient flow. In Advances in Neural Information Processing Systems, pages 6481 6491, 2019. S. Chewi, T. Maunu, P. Rigollet, and A. J. Stromme. Gradient descent algorithms for bures-wasserstein barycenters. ar Xiv preprint ar Xiv:2001.01700, 2020. S. Claici, E. Chien, and J. Solomon. Stochastic wasserstein barycenters. In International Conference on Machine Learning, pages 999 1008, 2018. M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292 2300, 2013. M. Cuturi and A. Doucet. Fast computation of wasserstein barycenters. In International Conference on Machine Learning, pages 685 693, 2014. P. Dvurechenskii, D. Dvinskikh, A. Gasnikov, C. Uribe, and A. Nedich. Decentralize and randomize: Faster algorithm for wasserstein barycenters. In Advances in Neural Information Processing Systems, pages 10760 10770, 2018. J. Feydy, T. Séjourné, F.-X. Vialard, S.-i. Amari, A. Trouve, and G. Peyré. Interpolating between optimal transport and mmd using sinkhorn divergences. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2681 2690, 2019. A. Genevay, M. Cuturi, G. Peyré, and F. Bach. Stochastic optimization for large-scale optimal transport. In Advances in neural information processing systems, pages 3440 3448, 2016. A. Genevay, L. Chizat, F. Bach, M. Cuturi, and G. Peyré. Sample complexity of sinkhorn divergences. In Proc. AISTATS 19, 2019a. A. Genevay, L. Chizat, F. Bach, M. Cuturi, and G. Peyré. Sample complexity of sinkhorn divergences. In K. Chaudhuri and M. Sugiyama, editors, Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 1574 1583. PMLR, 16 18 Apr 2019b. F. Golse. On the dynamics of large particle systems in the mean field limit. In Macroscopic and large scale phenomena: coarse graining, mean field limits and ergodicity, pages 1 144. Springer, 2016. A. Kroshnin, N. Tupitsa, D. Dvinskikh, P. Dvurechensky, A. Gasnikov, and C. Uribe. On the complexity of approximating wasserstein barycenters. In International Conference on Machine Learning, pages 3530 3540, 2019. B. Lemmens and R. Nussbaum. Nonlinear Perron-Frobenius Theory, volume 189. Cambridge University Press, 2012. Q. Liu. Stein variational gradient descent as gradient flow. In Advances in neural information processing systems, pages 3115 3123, 2017. Q. Liu and D. Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in neural information processing systems, pages 2378 2386, 2016. J. Lu, Y. Lu, and J. Nolen. Scaling limit of the stein variational gradient descent: The mean field regime. SIAM Journal on Mathematical Analysis, 51(2):648 671, 2019. G. Luise, S. Salzo, M. Pontil, and C. Ciliberto. Sinkhorn barycenters with free support via frank-wolfe algorithm. In Advances in Neural Information Processing Systems 32. 2019. Y. Mroueh, T. Sercu, and A. Raj. Sobolev descent. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2976 2985, 2019. G. Peyré, M. Cuturi, et al. Computational optimal transport. Foundations and Trends R in Machine Learning, 11(5-6):355 607, 2019. J. Rabin, G. Peyré, J. Delon, and M. Bernot. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pages 435 446. Springer, 2011. J. Solomon, F. De Goes, G. Peyré, M. Cuturi, A. Butscher, A. Nguyen, T. Du, and L. Guibas. Convolutional wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics (TOG), 34(4):1 11, 2015. S. Srivastava, V. Cevher, Q. Dinh, and D. Dunson. Wasp: Scalable bayes via barycenters of subset posteriors. In Artificial Intelligence and Statistics, pages 912 920, 2015. M. Staib, S. Claici, J. M. Solomon, and S. Jegelka. Parallel streaming wasserstein barycenters. In Advances in Neural Information Processing Systems, pages 2647 2658, 2017. A. W. Van Der Vaart and J. A. Wellner. Weak convergence. In Weak convergence and empirical processes, pages 16 28. Springer, 1996. J. Ye, P. Wu, J. Z. Wang, and J. Li. Fast discrete distribution clustering using wasserstein barycenter with sparse support. IEEE Transactions on Signal Processing, 65(9):2317 2332, 2017.