# on_the_complexity_of_approximating_wasserstein_barycenters__ec887351.pdf On the Complexity of Approximating Wasserstein Barycenters Alexey Kroshnin 1 2 3 Darina Dvinskikh 4 1 Pavel Dvurechensky 4 1 Alexander Gasnikov 5 1 2 Nazarii Tupitsa 1 5 C esar A. Uribe 6 We study the complexity of approximating the Wasserstein barycenter of m discrete measures, or histograms of size n, by contrasting two alternative approaches that use entropic regularization. The first approach is based on the Iterative Bregman Projections (IBP) algorithm for which our novel analysis gives a complexity bound proportional to mn2/ε2 to approximate the original non-regularized barycenter. On the other hand, using an approach based on accelerated gradient descent, we obtain a complexity proportional to mn2/ε. As a byproduct, we show that the regularization parameter in both approaches has to be proportional to ε, which causes instability of both algorithms when the desired accuracy is high. To overcome this issue, we propose a novel proximal-IBP algorithm, which can be seen as a proximal gradient method, which uses IBP on each iteration to make a proximal step. We also consider the question of scalability of these algorithms using approaches from distributed optimization and show that the first algorithm can be implemented in a centralized distributed setting (master/slave), while the second one is amenable to a more general decentralized distributed setting with an arbitrary network topology. Introduction Optimal transport (OT) (Monge, 1781; Kantorovich, 1942) is becoming increasingly popular in the statistics, machine 1Institute for Information Transmission Problems RAS, Moscow, Russia 2National Research University Higher School of Economics, Moscow, Russia 3Universit e Claude Bernard Lyon 1, Villeurbanne, France 4Weierstrass Institute for Applied Analysis and Stochastics, Berlin, Germany 5Moscow Institute of Physics and Technology, Moscow, Russia 6Massachusetts Institute of Technology, Cambridge, USA. Correspondence to: Pavel Dvurechensky , Alexey Kroshnin . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). learning and optimization communities. Statistical methods based on optimal transport are readily available (Bigot et al., 2012; Del Barrio et al., 2015; Ebert et al., 2017; Le Gouic & Loubes, 2017), as well as many applications in unsupervised learning (Arjovsky et al., 2017; Bigot et al., 2017), semisupervised learning (Solomon et al., 2014), clustering (Ho et al., 2017), text classification (Kusner et al., 2015), among others. Optimal transport distances lead to the concept of Wasserstein barycenter, which allows to define a mean of a set of complex objects, e.g. images, preserving their geometric structure (Cuturi & Doucet, 2014). In this paper, we focus on the computational aspects of optimal transport, namely on the complexity approximating a Wasserstein barycenter of a set of histograms. Starting with Altschuler et al. (2017), several groups of authors addressed the question of the Wasserstein distance approximation complexity (Chakrabarty & Khanna, 2018; Dvurechensky et al., 2018b; Blanchet et al., 2018; Lin et al., 2019). Implementable schemes based on Sinkhorn s algorithm were first applied to OT in Cuturi (2013), see also (Genevay et al., 2016). Also, accelerated gradient descent methods were proposed as an alternative in Dvurechensky et al. (2018b). Much less is known about the complexity of approximating Wasserstein barycenter. The works (Staib et al., 2017; Dvurechensky et al., 2018a), are in some sense close, but do not provide an explicit answer. Following Dvurechensky et al. (2018b), we study two alternative approaches for approximating Wasserstein barycenters based on entropic regularization (Cuturi, 2013). The first approach is based on the Iterative Bregman Projection (IBP) algorithm (Benamou et al., 2015), which can be considered as a general alternating projections algorithm. The second approach is based on constructing a dual problem and solving it by primal-dual accelerated gradient descent. For both approaches, we show, how the regularization parameter should be chosen in order to approximate the original, non-regularized barycenter. We also address the question of scalability in the Big Data regime, i.e., when the size of the histograms n and the number of histograms m are large. In this case, the dataset of n histograms can be distributedly produced or stored in a network of agents/sensors/computers with a network structure On the Complexity of Approximating Wasserstein Barycenters given by an arbitrary connected graph. In a special case of a centralized architecture, i.e., if there is a central master node connected by slave nodes, parallel algorithms such as (Staib et al., 2017) can be applied. In a more general setup of arbitrary networks it makes sense to use decentralized distributed algorithms in the spirit of distributed optimization algorithms (Scaman et al., 2017; Nedi c et al., 2017). Related Work. It is very hard to cover all the increasing stream of works on OT and we mention the books Villani (2008); Santambrogio (2015); Peyr e & Cuturi (2018) as a starting point and the references therein. Approximation of Wasserstein barycenter was considered in Cuturi & Doucet (2014); Bonneel et al. (2015); Benamou et al. (2015); Staib et al. (2017); Puccetti et al. (2018); Claici et al. (2018); Uribe et al. (2018); Dvurechensky et al. (2018a). Considering the primal-dual approach based on accelerated gradient descent, our paper shares some similarities with (Cuturi & Peyr e, 2016) with the main difference that we are focused on complexity and scalability of computations and explicitly analyzing the algorithm applied to the dual problem. There is a vast amount of literature on accelerated gradient descent with the canonical reference being (Nesterov, 1983). Primal-dual extensions can be found in (Lan et al., 2011; Tran-Dinh et al., 2018; Yurtsever et al., 2015; Chernov et al., 2016; Dvurechensky et al., 2016; Gasnikov et al., 2016; Dvurechensky et al., 2017; Anikin et al., 2017; Nesterov et al., 2018; Lin et al., 2019). We are focused on the extensions amenable to the decentralized distributed optimization, so that these algorithms can be scaled for large problems. Distributed optimization algorithms were considered by many authors with the classical reference being Bertsekas & Tsitsiklis (1989). Initial algorithms, such as Distributed Gradient Descent (Nedic & Ozdaglar, 2009), were relatively slow compared with their centralized counterparts. However, recent work has made significant advances towards a better understanding of the optimal rates of such algorithms and their explicit dependencies to the function and network parameters (Lan et al., 2017; Scaman et al., 2017; Uribe et al., 2018). These approaches have been extended to other scenarios such as time-varying graphs (Rogozin et al., 2018; Maros & Jald en, 2018; Wu & Lu, 2017). The distributed setup is particularly interesting for machine learning applications on the big data regime, where the number of data points and the dimensionality is large, due to its flexibility to handle intrinsically distributed storage and limited communication, as well as privacy constraints (He et al., 2018; Wai et al., 2018). Our contributions. 1. We consider the γ-regularized Wasserstein barycenter problem and obtain complexity bounds for finding an approximation to the regularized barycenter by two algorithms. The first one is Iterative Bregman Projections algorithm (Benamou et al., 2015), for which we prove complexity proportional to 1/(γε) to achieve accuracy ε. The second one is based on accelerated gradient descent (AGD) and has complexity proportional to 1/( γε). The benefit of the second algorithm is that it is better scalable and can be implemented in the decentralized distributed optimization setting over an arbitrary network. 2. We show how to choose the regularization parameter in order to find an ε-approximation for the non-regularized Wasserstein barycenter. The resulting complexity for IBP is proportional to mn2/ε2 and for AGD is be proportional to mn2.5/ε. 3. We solve the stability issues of the IBP and AGD approaches, present when the desired accuracy is high, or conversly when ε is small, by proposing a proximal-IBP method, which can be considered as a proximal method using IBP on each iteration to find the next iterate. The full version of the paper with the proofs can be found as supplementary material and as (Kroshnin et al., 2019). 1. Problem Statement and Preliminaries 1.1. Notation We define the probability simplex as Sn(1) = {q Rn + | Pn i=1 qi = 1}. Given two discrete measures p and q in Sn(1), we introduce the set of coupling measures as Π(p, q) = {π Rn n + : π1 = p, πT1 = q}. For coupling measure π Rn n + , we denote the negative entropy (up to an additive constant) as i,j=1 πij (ln πij 1) = π, ln π 11T . We denote as ln(A) (exp(A)), the element-wise logarithm (exponent) of matrix or vector A, and A, B := Pn i,j=1 Aij Bij for any A, B Rn n. We use symbol 1 as a column of ones. For two matrices A and B, we define element-wise multiplication and element-wise division as A B and A/B respectively. Kullback Leibler divergence for measures π, π Rn n + is defined as the Bregman divergence associated with H( ): KL(π|π ) := = π, ln π ln π + π π, 11T . We also define a cost matrix C Rn n + , which element cij corresponds to the cost of moving an element of bin i to bin j. C denotes the maximal element of this matrix. On the Complexity of Approximating Wasserstein Barycenters We refer to λmax(W) as the maximum eigenvalue of a symmetric matrix W, and λ+ min(W) as the minimal nonzero eigenvalue, and define the condition number of matrix W as χ(W) = λmax(W)/λ+ min(W). We say that a function f : Rd R has L-Lipschitz-continuous gradient w.r.t. norm if f(x) f(y) L x y , x, y Rd, where is the dual norm defined by g = max x 1 g, x . 1.2. Wasserstein barycenters and entropic regularization Given two probability measures p, q Sn(1) and a cost matrix C Rn n, following Cuturi (2013), we define entropy-regularized OT-distance for γ 0: Wγ(p, q) := min π Π(p,q) { π, C + γH(π)} . (1) For γ = 0 we use shortcut notation W(p, q) and refer to it as non-regularized distance. For a given set of probability measures {p1, . . . , pm} and cost matrices C1, . . . , Cm Rn n + we define their weighted regularized barycenter with weights w Sm(1) as a solution of the following problem: min q Sn(1) l=1 wl Wγ(pl, q), (2) where a solution of this problem for γ = 0 referred to as non-regularized barycenter. 2. Complexity of WB by Iterative Bregman Projections In this section, we provide the theoretical analysis of the Iterative Bregman Projections algorithm (Benamou et al., 2015) for the approximation of the regularized Wasserstein barycenter and obtain iteration complexity O (c/(γε)) where c := maxl=1,...,m Cl . Then we estimate the bias introduced by regularization and estimate the value of γ to obtain an ε-approximation for the non-regularized barycenter. Combining this result with the iteration complexity of IBP, we obtain a complexity e O c2mn2/ε2 for approximating a non-regularized barycenter by the IBP algorithm. This algorithm can be implemented in a centralized distributed manner such that each node performs e O c2n2/ε2 arithmetic operations and the number of communication rounds is e O c2/ε2 . We also introduce proximal-IBP algorithm and discuss its complexity and scalability. 2.1. Convergence of IBP for the regularized barycenter In this subsection, we analyze Iterative Bregman Projection Algorithm (Benamou et al., 2015, Section 3.2) and analyze its complexity for solving problem (2). We reformulate this problem as min q Sn(1), πl Π(pl,q),l=1,...,m l=1 wl πl, Cl + γH(πl) = min πl1=pl, πT l 1=πT l+11, πl Rn n + , l=1,...,m l=1 wl πl, Cl + γH(πl) , (3) and construct its dual (see Lemma 2.1). To solve the dual problem we reformulate the IBP algorithm as a blockwise minimization, as shown in Algorithm 1 (this equivalence is a general fact for Dykstra s algorithm). Notably, our reformulation of the IBP algorithm allows to solve simultaneously the primal and dual problem and has an adaptive stopping criterion (see line 7), which does not require to calculate any objective values. Our first step is to recall the IBP algorithm from Benamou et al. (2015). Following the approach of Benamou et al. (2015), we present problem (3) in a Kullback Leibler projection form, i.e., min π C1 C2 l=1 wl KL (πl|Kl) , (4) where Kl = exp ( Cl/γ) and the affine convex sets C1 and C2 with C1 = {π = [π1, . . . , πm] : l πl1 = pl} , C2 = π = [π1, . . . , πm] : πT 1 1 = = πT m1 The IBP algorithm consists in alternating projections to the sets C1 and C2 w.r.t. Kullback Leibler divergence, and is a generalization of Sinkhorn s algorithm and a particular case of Dykstra s projection algorithm. This algorithm is equivalent to alternating minimization of the dual problem of (3) derived in Lemma 2.1, and leads to Algorithm 1. Lemma 1. The dual problem of (3) is (up to a multiplicative constant) min u,v Pm l=1 wlvl=0 f(u, v) := l=1 wl 1, Bl(ul, vl)1 ul, pl , (6) u = [u1, . . . , um], v = [v1, . . . , vm], ul, vl Rn, and Bl(ul, vl) := diag (eul) Kl diag (evl) , Kl := exp Cl Moreover, solution π γ to (3) is given by the formula [π γ]l = Bl(u l , v l ), where (u , v ) is the solution of the problem (6). Proof of Lemma is shown in the supplementary materials. On the Complexity of Approximating Wasserstein Barycenters Algorithm 1 Dual Iterative Bregman Projection Input: C1, . . . , Cm, p1, . . . , pm, γ > 0, ε > 0 1: u0 l := 0, v0 l := 0, Kl := exp Cl γ , l = 1, . . . , m 2: repeat 3: if t mod 2 = 0 then 4: ut+1 l := ln pl ln Klevt l , vt+1 := vt 5: else 6: vt+1 l := Pm k=1 wk ln KT k eut k ln KT l eut l, ut+1 := ut 7: end if 8: t := t + 1 9: until Pm l=1 wl BT l (ut l, vt l)1 qt 1 ε and Pm l=1 wl Bl(ut l, vt l)1 pl 1 ε , where qt := Pm l=1 wl BT l (ut l, vt l)1 Output: B1(ut 1, vt 1), . . . , Bm(ut m, vt m) Before we move to the analysis of the algorithm let us discuss its scalability by using the centralized distributed computations framework. This framework includes a master and slave nodes. Each l-th slave node stores pl, Cl, Kl and variables ut l, vt l. At each iteration t, a slave node calculates KT l eut l and sends it to the master node, which aggregates these products as Pm k=1 wk ln KT k eut k and sends this sum back to the slave nodes. Based on this information, slave nodes update vt l and ut l. Thus, the main computational cost of multiplying a matrix by a vector, can be distributed on m slave nodes and the total working time will be smaller. It is not clear, how this algorithm can be implemented on a general network, for example when the data is produced by a distributed network of sensors without one master node. In contrast, as we illustrate in Section 3, the alternative accelerated-gradient-based approach can be implemented on an arbitrary network. Theorem 1. For given ε Algorithm 1 stops in number of iterations N satisfying N 4 + 88 maxl Cl γε = O maxl Cl Proof. Proof mainly follows ideas from (Dvurechensky et al., 2018b) for Sinkhorn algorithm. First, one can show that the following results hold: 1. for any t 0, and l = 1, . . . , m max j [vt l]j min j [vt l]j Rv 2maxl Cl 2. for any even t 2 we have f(ut, vt) := f(ut, vt) f(u , v ) Rv l=1 wl qt l qt 1, (8) where qt l := BT l (ut l, vt l)1 and qt := Pm l=1 wlqt l; 3. for any odd t 1 the following bound on the change of objective function f( , ) holds: f(ut, vt) f(ut+1, vt+1) 1 l=1 wl qt l qt 1 Now let us move to the proof of complexity bound. To simplify derivation we define δt := f(ut, vt). If t 2 is even, then (8), (9), and the stopping criterion give us the following bound: δt δt+1 max (ε )2 If t is odd then we have at least δt+1 δt. These inequalities result in the following estimates: t 2 + 22R2 v k 1 + 22(δt δk) (ε )2 1 + 22δt (ε )2 . (11) To combine the two estimates (10) and (11), we consider a switching strategy parametrized by number s (0, δ1). First t iterations we use (10), resulting in δt becomes below some s. Then, we use s as a starting point and estimate the remaining number of iteration by (11). The quantity s can be found from the minimization N = t + k 4 + 22s (ε )2 + 22R2 v s Minimizing the r.h.s. of the latter inequality in s leads to ε 4 + 88 maxl Cl 2.2. Approximating Non-regularized WB by IBP To find an approximate non-regularized barycenter, i.e. solution to problem (2) with γ = 0, we apply Algorithm 1 with a suitable choice of γ and ε and average marginals q1, . . . , qm with weights wl, this leads to Algorithm 2. Algorithm 2 Finding Wasserstein barycenter by IBP Input: Accuracy ε; cost matrices C1, . . . , Cm; marginals p1, . . . , pm 1: Set γ := 1 4 ε ln n, ε := 1 4 ε maxl Cl 2: Find B1 := B1(ut 1, vt 1), . . . , Bm := Bm(ut m, vt m) by Algorithm 1 with accuracy ε 3: q := 1 Pm l=1 wl 1,Bl1 Pm l=1 wl BT l 1 Output: q Theorem 2 presents complexity bound for Algorithm 2. On the Complexity of Approximating Wasserstein Barycenters Theorem 2. For ε > 0, Algorithm 2 returns q Sn(1) s.t. l=1 wl W(pl, q) l=1 wl W(pl, q ) ε, where q is a solution to non-regularized problem (2) with γ = 0. Moreover, it requires 2 Mm,n ln n + mn arithmetic operations, where Mm,n is a time complexity of one iteration of Algorithm 1. Remark 1. As each iteration of Algorithm 1 requires m matrix-vector multiplications, the general bound is Mm,n = O(mn2). However, for some specific form of matrices Cl it s possible to achieve better complexity, e.g. Mm,n = O(mn log n) via FFT1 (Peyr e & Cuturi, 2018), or Mm,n = O (n P l rank(Cl)) for low-rank matrices. Proof. Let π = [π 1, . . . , π m] be a solution to the nonregularized problem. Equation (7) and duality yields l=1 wl Cl, Bl l=1 wl ( Cl, π l + γH(π l ) γH(Bl)) + max l Cl ε l=1 wl W(pl, q ) + 2γ ln n + max l Cl ε . Here we used inequalities 2 ln n H(π)+1 0 holding on Sn n(1). Consider ˇBl Π(pl, q) s.t. ˇBl Bl 1 Bl1 pl 1 +2 P j[BT l 1 q]+ j for all l = 1, . . . , m. Their existence immediately follows from the proof of Theorem 4 from (Altschuler et al., 2017). If stopping time t is even, Bl1 = pl, therefore q = qt, and Bl ˇBl 1 qt l qt 1. If t is odd, BT l 1 = qt q and Bl ˇBl 1 Bl1 pl 1. In both cases it follows from stopping criterion that Pm l=1 wl Bl ˇBl 1 ε . Since ˇBl Π(pl, q) for all 1 l m, one has l=1 wl W(pl, q) l=1 wl Cl, ˇBl l=1 wl Cl, Bl + max l Cl l=1 wl Bl ˇBl 1 l=1 wl W(pl, q ) + ε. Complexity bound for the algorithm is a simple corollary of Theorem 1. 1it is stable only for large enough γ, what is the case for proximal method, see Subsection 2.3 Remark 2. Notice that according to the proof of Theorem 2, one can also reconstruct approximated optimal transportation plans ˇBl between pl and approximated barycenter q using Algorithm 2 from (Altschuler et al., 2017). 2.3. Proximal IBP for Wasserstein barycenter problem As we see from Theorems 1 and 2, to obtain an ε-approximation of the non-regularized barycenter, the regularization parameter γ should be chosen proportional to the desired accuracy ε, and the complexity of the IBP is inversely proportional to γ, which leads to large working time and instability issues. To overcome this obstacle we propose a novel proximal-IBP algorithm. Similarly to (Xie et al., 2018), where this idea is used for Wasserstein distance, our method is inspired by proximal point algorithm with general Bregman divergence V (x, y) (Chen & Teboulle, 1993). The idea of this algorithm for minimization of a function f(x) is to perform steps xk+1 = prox(xk) = arg minx Q{f(x) + γV (x, xk)}. We use the KL-divergence as the Bregman divergence since in this case the proximal step leads to a similar problem to the entropicregularized WB (2). Given the sets C1, C2 defined in (5), we define proximal operator prox : C1 C2 C1 C2 for function Pm l=1 wl Wγ(pl, ql) as follows prox(πk) = argmin π C1 C2 Cl, πl + γKL(πl|πk l ) = argmin π C1 C2 wl KL πl|πk l exp Cl The proximal gradient method has the following form πk+1 = prox(πk). (12) Then we use Iterative Bregman Projection for finding the barycenter. We underline that in this setting, there is no need to choose γ to be small as it prescribed by Theorem 3. Algorithm 3 has two loops: external loop of proximal gradient step and inner loop of computing the next iterate πt by IBP and as a byproduct an approximation qt to the barycenter. The number of external iterations is proportional to γ/ε, see (Chen & Teboulle, 1993), and the complexity of inner loop is O ( Ct l /(γε )). Slightly modifying algorithm Round and vectors pl we can ensure that all [πt l]ij ε/n2, then it is enough to choose ε proportional to ε3/n2, and inner loop complexity is e O n2 Cl 2 /(γε3) . However, experiments show that this estimate is too pessimistic, and in practice number of inner iterations is much smaller, see Section 4. In practice, one should try to find the optimal γ by using a restart procedure on the first external loop iteration. That is, we start with large enough γ and solve internal problem by IBP, then put γ := γ/2 and solve internal problem once On the Complexity of Approximating Wasserstein Barycenters again. We stop this repeating procedure at the moment when the complexity of internal problem growth significantly. This moment allows us to detect the optimal value of γ. On the next external iterations one may use this γ. Algorithm 3 can be implemented in centralized distributed setting in the same way as Algorithm 1. Algorithm 3 Finding Wasserstein barycenter by proximal IBP Input: T number of iterations, γ > 0, ε accuracy for inner problem, starting transport plans π0 l := 1 np T l 1 l = 1, . . . , m 1: for t = 0, . . . , T 1 do 2: Run Algorithm 1 with cost matrices Ct l := Cl γ ln πt l, parameter γ and accuracy ε ε maxl Ct l , and obtain matrices B1, . . . , Bm 3: πt+1 l := Round Bl, pl, qt+1 Π(pl, qt+1), where Round is Algorithm 2 from (Altschuler et al., 2017) and qt+1 := Pm l=1 wl BT l 1 4: end for Output: q T 3. Complexity by Primal-Dual Accelerated Gradient Descent In this section, we consider the Primal-Dual Accelerated Gradient Descent method for approximating Wasserstein barycenters. First, we consider the regularized barycenter, construct a dual problem to (2) and apply primal-dual accelerated gradient descent to solve it and approximate the regularized barycenter. Our dual problem is constructed via a matrix W, which can be quite general. We explain how the choice of this matrix is connected to distributed optimization and allows to implement the algorithm in the decentralized distributed setting. Then, we show, how the regularization parameter should be chosen in order to obtain an ε-approximation for the non-regularized Wasserstein barycenter, and estimate the complexity of the resulting algorithm. These algorithms can be implemented in a decentralized distributed manner such that each node fulfils e O(n2.5/ε) arithmetic operations and the number of communication rounds is e O( n/ε) . 3.1. Consensus view on the Wasserstein barycenter problem We rewrite problem (2) in an equivalent way as min q1,...,qm Sn(1) q1= =qm Wγ(p, q) := l=1 wl Wγ(l)(pl, ql), (13) where p = [p1, . . . , pm]T and q = [q1, . . . , qm]T , we also use different regularizer γl = γ(l) for each l-th Wasserstein distance. Next, we write a dual problem by dualizing the equality constraints q1 = = qm. This can be done in many different ways and, following (Lan et al., 2017; Scaman et al., 2017; Uribe et al., 2018), we do it by introducing a matrix W Rn n which is a symmetric positive semidefinite matrix s.t. Ker( W) = span(1). Then, defining W = W In and using the fact q1 = = qm Wq = 0, we equivalently rewrite problem (13) as max q1,...,qm Sn(1), l=1 wl Wγ(l)(pl, ql), (14) Therefore, we obtain the dual problem min u Rmn max q Rnm l=1 wl Wγ(l)(pl, ql) = min u Rmn l=1 wl W γ(l),pl([ Wu]l/wl), (15) where W γ(l),pl( ) is the Fenchel Legendre transform of Wγ(l)(pl, ), [ Wu]i represent the i-th ndimensional block of vectors Wu respectively. Importantly, the objective in the dual problem (15) has L-Lipschitz-continuous gradient, where constant L is estimated below in Lemma 3. Since the dual problem is smooth, we apply primal-dual accelerated gradient descent Algorithm 4 to solve the constructed pair of primal and dual problems. Before we move to the theoretical analysis of the algorithm, let us discuss the scalability of Algorithm 4. Assume that we have an arbitrary network of agents given by connected undirected graph G = (V, E) without self-loops with the set V of n vertices and the set of edges E = {(i, j) : i, j V }. Then matrix W can be chosen as the Laplacian matrix for this graph, which is such that a) [ W]ij = 1 if (i, j) E, b) [ W]ij = deg(i) if i = j, c) [ W]ij = 0 otherwise. Here deg(i) is the degree of the node i, i.e., the number of neighbors of the node. We assume that an agent i can communicate with an agent j if and only if the edge (i, j) E. In particular, the Laplacian matrix for the star graph, which corresponds to the centralized distributed computations discussed in Section 2 is W : { i = 1, . . . , m 1 Wii = 1, Wim = Wmi = 1, Wmm = m 1}. (16) Algorithm 4 allows to perform calculations in an arbitrary connected undirected network of agents. This is in contrast to the IBP algorithm as discussed in Section 2. For simplicity and comparison with the complexity of the IBP algorithm, we analyze the complexity of Algorithm 4 as if it is implemented on one machine, disregarding that it can be used for distributed setup. On the Complexity of Approximating Wasserstein Barycenters Algorithm 4 Accelerated Distributed Computation of Wasserstein barycenter Input: Each agent l V is assigned its measure pl and an upper bound L for the Lipschitz constant of the gradient of the dual objective. 1: Each agent finds pl Sn(1) s.t. pl pl 1 ε/4 and mini[ pl]i ε/(8n) and redefine pl := pl. E.g., pl = 1 ε 8 pl + ε n(8 ε)1 and sets set γ(l) = ε 4mwl ln n, η0 l = ζ0 l = λ0 l = q0 l = 0 Rn, A0 = α0 = 0 and N. 2: For each agent l V : 3: for k = 0, . . . , N 1 do 4: Find αk+1 as the largest root of the equation Ak+1 := Ak + αk+1 = 2Lα2 k+1. 5: λk+1 l = (αk+1ζk l + Akηk l )/Ak+1. 6: Calculate W γ(l),pl(λk+1 l ): [ W γ(l),pl(λk+1 l )]i = Pn j=1[pl]j exp(([λk+1 l ]i [Cl]ij)/γ(l)) Pn r=1 exp(([λk+1 l ]r [Cl]rj)/γ(l)), where [λ]i denotes i th component of a vector λ. 7: Share W γ(l),pl(λk+1 l ) with {j | (i, j) E}. 8: ζk+1 l = ζk l αk+1 Pm j=1 Wlj W γ(j),pj(λk+1 j ). {Gradient step} 9: ηk+1 l = (αk+1ζk+1 l + Akηk+1 l )/Ak+1.{Extrapolation step} 10: qk+1 l = 1 Ak+1 Pk+1 l=0 αiqi(λk+1 l ) = (αk+1qi(λk+1 l ) + Akqk l )/Ak+1, where ql( ) = W γ(l),pl( ) defined in step 4. {Primal update} 11: end for Output: q N = [q T 1 , . . . , q T m]T . Theorem 3. Algorithm 4 after N = 64χ( W)mn ln n Pm l=1 w2 l Cl 2 iterations generates an ε-solution of problem (2) with γ = 0, i.e. finds a vector q N = [q T 1 , . . . , q T m]T s.t. l=1 wl W(pl, q N l ) l=1 wl W(pl, q ) ε, and Wq N 2 ε/2R, (17) where q is an unregularized barycenter, i.e. is a solution to (2) with γ = 0, and R is a bound on the solution to the dual problem given in Lemma 3. The total number of arithmetic operations is O N n(mn + nnz( W)) . Moreover, there exists a choice of matrix W such that, if, without loss of generality, Cl 1, and the weights wl = 1/m, l = 1, ..., m, the complexity of approximating non-regularized barycenter by Algorithm 4 is e O(mn2.5/ε). The proof is based on the complexity theorem of primaldual accelerated gradient descent for a particular pair of primal-dual problems (13) (15). Theorem 4 (see Theorem 2 from (Dvurechensky et al., 2017)). Let accelerated primal-dual gradient descent be applied to the pair of problems (13) (15). Then the inequalities l=1 wl Wγ(l)(pl, q N l ) l=1 Wγ(l)(pl, q ) ε/2, Wq N 2 ε/2R (18) hold no later than after N = p 32LR2/ε iterations, where L is the Lipschitz constant of the gradient of the dual objective and R is such that u 2 R, u being an optimal dual solution. Our next steps are to find the bounds for L in the next Lemma and R in Lemma 3 inspired by (Lan et al., 2017). Lemma 2. Let in (13) γ(l) = γ/wl for some γ > 0, and W γ(u) denote the dual objective in (15). Then its gradient is L = λmax(W)/γ-Lipschitz continuous w.r.t. 2-norm. Lemma 3. Let q γ be the optimal solution of problem (2) with minimal 2-norm, then there exists an optimal dual solution u = [u 1, . . . , u m] for problem (15) satisfying u 2 R with R2 = 2n Pm l=1 w2 l Cl 2 λ+ min(W) . (19) min(W) is the minimal positive eigenvalue of the matrix W. Proof of Theorem 3. Using Theorem 4 and that KL(π|θ) [0, 2 ln n] we get the following inequality l=1 wl W(pl, q N l ) l=1 wl W(pl, q ) ε/2 l=1 wlγ(l). (20) Since γ(l) = γ/wl with γ = ε/(4m ln n), we obtain that the inequality (17) holds. Combining the values of γ, L from Lemma 2, R from Lemma 3 with the estimate for N in Theorem 4 and the fact that χ(W) = χ( W), we obtain the desired estimate for the number of iterations of the algorithm. Let us estimate the complexity of the algorithm. For each l we need to calculate the gradient W γ(l),pl( ), which requires O(n2) arithmetic operations. To calculate Pm j=1 Wlj W γ(l),pl)(λk+1 j ) one needs O(n nnz( Wl)) arithmetic operations, where nnz( Wl) is the number of nonzero elements in matrix W in the l-th row. More precisely, On the Complexity of Approximating Wasserstein Barycenters the dimension of W γ(l),pl( ) is n and the matrix Wlj is diagonal for each l, j = 1, . . . , m. Using definition of W we get that the complexity of calculating the gradient. Other operations require O(n) operations. Hence, the complexity of one iteration is l=1 n nnz( Wl) = O mn2 + n nnz( W) and the total complexity follows from multiplying this value by N. As for the choice of W one can show (by using graph sparsificators) that it can be chosen such that χ(W) = χ( W) = O(Poly(ln(m))) and nnz( W) = O(m Poly(ln(m))). For details on the graph sparsificators we refer to (Vaidya, 1990; Bern et al., 2006; Spielman & Teng, 2014). Substituting the weights wl = 1/m, l = 1, ..., m to the bound for N, we obtain that the complexity of approximating non-regularized barycenter by Algorithm 4 is e O(mn2.5/ε). In the distributed setting, each of m nodes makes e O(n2.5/ε) arithmetic operations, while the number of communications rounds is e O( n/ε). 4. Numerical Analysis In this section, we provide numerical analysis for the three algorithms for the computation of approximate Wasserstein barycenters. We compare their iteration performance for the problem of computing the barycenter of a set of 15 discrete and truncated Gaussian distributions. Figure 1 (Left) shows the distance to optimality versus the iteration count for the IBP method and the Prox IBP method. For the Prox IBP method, we show the performance for four different cases, namely: γ = 1, γ = 0.1, γ = 0.01, and varying with γk+1 = γk/2, if γk 1e 3 (γ0 = 10). Figure 1 (Right) shows the number of iterations required in the inner loop step of Algorithm 3 (Line 2) to reach the desired accuracy ε for the same scenarios on γ. Results show that for smaller values of γ the inner problem requires larger number of iterations. Particularly for γ = 1 the inner problem is relatively computationally inexpensive, but the convergence of the overall method is slow. On the other side, with varying values of γ an accurate barycenter is found with low computational cost initially. Figure 2 shows the performance of the primal-dual accelerated gradient descent method. Recall that this method is particularly suited for decentralized distributed approaches where the computation is performed over an arbitrary network. We show the distance to optimality and distance to consensus for the approximate barycenters generated by Algorithm 4. Table 1 shows the numerical values of the optimality gap for a subset of the experiments shown above. The Prox- Figure 1. Distance to an optimal barycenter for the IBP method and the Prox IBP method. Figure 2. Optimality gap and consensus gap for the primal-dual accelerated gradient descent method for four classes of networks: complete, star, path and Erd os-Renyi random graph. IBP algorithm converges much faster than the other two, in exchange with higher computational loads per interation. In this paper, we show that the IBP algorithm from Benamou et al. (2015) for the Wasserstein barycenter problem can be implemented in a centralized distributed manner such that each node requires e O n2/ε2 arithmetic operations and the number of communication rounds is e O 1/ε2 . We note that proper proximal envelope of this algorithm can sometimes give a significant acceleration.We also describe accelerated primal-dual gradient algorithm for the same problem. The proposed algorithm can be implemented in a more general decentralized distributed setting such that each node fulfils e O(n2.5/ε) arithmetic operations and the number of communication rounds is e O( n/ε). Table 1. Optimality Gap for the Approximate Barycenter Prox IBP IBP Algo. 4 Iter. γ γ/2 γ = 0.01 γ = 1 Complete Erd os-Renyi 50 8.797e-4 1.779e-3 9.856e-2 0.2585 1.286 1.294 1000 4.17e-07 6.818e-2 0.2585 0.471 1.041 2000 4.201e-2 0.0741 0.111 0.463 3000 1.830e-2 0.0691 4.814e-2 0.226 4000 6.408e-3 0.0534 2.797e-2 0.135 On the Complexity of Approximating Wasserstein Barycenters Acknowledgements This research was funded by Russian Science Foundation (project 18-71-10108). Altschuler, J., Weed, J., and Rigollet, P. Near-linear time approxfimation algorithms for optimal transport via sinkhorn iteration. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 1961 1971. Curran Associates, Inc., 2017. ar Xiv:1705.09634. Anikin, A. S., Gasnikov, A. V., Dvurechensky, P. E., Tyurin, A. I., and Chernov, A. V. Dual approaches to the minimization of strongly convex functionals with a simple structure under affine constraints. Computational Mathematics and Mathematical Physics, 57(8):1262 1276, 2017. Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein GAN. ar Xiv:1701.07875, 2017. Benamou, J.-D., Carlier, G., Cuturi, M., Nenna, L., and Peyr e, G. Iterative bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2):A1111 A1138, 2015. Bern, M., Gilbert, J. R., Hendrickson, B., Nguyen, N., and Toledo, S. Support-graph preconditioners. SIAM Journal on Matrix Analysis and Applications, 27(4):930 951, 2006. Bertsekas, D. P. and Tsitsiklis, J. N. Parallel and distributed computation: numerical methods, volume 23. Prentice hall Englewood Cliffs, NJ, 1989. Bigot, J., Klein, T., et al. Consistent estimation of a population barycenter in the wasserstein space. Ar Xiv e-prints, 2012. Bigot, J., Gouet, R., Klein, T., and L opez, A. Geodesic PCA in the wasserstein space by convex PCA. Ann. Inst. H. Poincar e Probab. Statist., 53(1):1 26, 02 2017. Blanchet, J., Jambulapati, A., Kent, C., and Sidford, A. Towards optimal running times for optimal transport. ar Xiv:1810.07717, 2018. Bonneel, N., Rabin, J., Peyr e, G., and Pfister, H. Sliced and radon wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22 45, Jan 2015. ISSN 1573-7683. doi: 10.1007/ s10851-014-0506-3. URL https://doi.org/10. 1007/s10851-014-0506-3. Chakrabarty, D. and Khanna, S. Better and simpler error analysis of the sinkhorn-knopp algorithm for matrix scaling. ar Xiv:1801.02790, 2018. Chen, G. and Teboulle, M. Convergence analysis of a proximal-like minimization algorithm using bregman functions. SIAM Journal on Optimization, 3(3):538 543, 1993. Chernov, A., Dvurechensky, P., and Gasnikov, A. Fast primal-dual gradient method for strongly convex minimization problems with linear constraints. In Kochetov, Y., Khachay, M., Beresnev, V., Nurminski, E., and Pardalos, P. (eds.), Discrete Optimization and Operations Research: 9th International Conference, DOOR 2016, Vladivostok, Russia, September 19-23, 2016, Proceedings, pp. 391 403. Springer International Publishing, 2016. Claici, S., Chien, E., and Solomon, J. Stochastic Wasserstein barycenters. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 999 1008. PMLR, 2018. URL http://proceedings.mlr.press/ v80/claici18a.html. Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 26, pp. 2292 2300. Curran Associates, Inc., 2013. Cuturi, M. and Doucet, A. Fast computation of wasserstein barycenters. In Xing, E. P. and Jebara, T. (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp. 685 693, Bejing, China, 22 24 Jun 2014. PMLR. URL http://proceedings.mlr. press/v32/cuturi14.html. Cuturi, M. and Peyr e, G. A smoothed dual approach for variational wasserstein problems. SIAM Journal on Imaging Sciences, 9(1):320 343, 2016. Del Barrio, E., Lescornel, H., and Loubes, J.-M. A statistical analysis of a deformation model with wasserstein barycenters : estimation procedure and goodness of fit test. ar Xiv:1508.06465, 2015. Dvurechensky, P., Gasnikov, A., Gasnikova, E., Matsievsky, S., Rodomanov, A., and Usik, I. Primal-dual method for searching equilibrium in hierarchical congestion population games. In Supplementary Proceedings of the 9th International Conference on Discrete Optimization and Operations Research and Scientific School (DOOR 2016) Vladivostok, Russia, September 19 - 23, 2016, pp. 584 595, 2016. ar Xiv:1606.08988. On the Complexity of Approximating Wasserstein Barycenters Dvurechensky, P., Gasnikov, A., Omelchenko, S., and Tiurin, A. Adaptive similar triangles method: a stable alternative to sinkhorn s algorithm for regularized optimal transport. ar Xiv:1706.07622, 2017. Dvurechensky, P., Dvinskikh, D., Gasnikov, A., Uribe, C. A., and Nedi c, A. Decentralize and randomize: Faster algorithm for Wasserstein barycenters. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, Neur IPS 2018, pp. 10783 10793. Curran Associates, Inc., 2018a. ar Xiv:1802.04367. Dvurechensky, P., Gasnikov, A., and Kroshnin, A. Computational optimal transport: Complexity by accelerated gradient descent is better than by Sinkhorn s algorithm. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1367 1376, 2018b. ar Xiv:1802.04367. Ebert, J., Spokoiny, V., and Suvorikova, A. Construction of non-asymptotic confidence sets in 2-Wasserstein space. ar Xiv:1703.03658, 2017. Gasnikov, A. V., Gasnikova, E. V., Nesterov, Y. E., and Chernov, A. V. Efficient numerical methods for entropy-linear programming problems. Computational Mathematics and Mathematical Physics, 56(4):514 524, 2016. Genevay, A., Cuturi, M., Peyr e, G., and Bach, F. Stochastic optimization for large-scale optimal transport. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 3440 3448. Curran Associates, Inc., 2016. He, L., Bian, A., and Jaggi, M. Cola: Decentralized linear learning. In Advances in Neural Information Processing Systems, pp. 4541 4551, 2018. Ho, N., Nguyen, X., Yurochkin, M., Bui, H. H., Huynh, V., and Phung, D. Multilevel clustering via Wasserstein means. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70, pp. 1501 1509. PMLR, 2017. Kantorovich, L. On the translocation of masses. Doklady Acad. Sci. USSR (N.S.), 37:199 201, 1942. Kroshnin, A., Dvinskikh, D., Dvurechensky, P., Gasnikov, A., Tupitsa, N., and Uribe, C. On the complexity of approximating Wasserstein barycenter. ar Xiv:1901.08686, 2019. Kusner, M. J., Sun, Y., Kolkin, N. I., and Weinberger, K. Q. From word embeddings to document distances. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML 15, pp. 957 966. PMLR, 2015. Lan, G., Lu, Z., and Monteiro, R. D. C. Primal-dual firstorder methods with O(1/ε) iteration-complexity for cone programming. Mathematical Programming, 126(1):1 29, 2011. Lan, G., Lee, S., and Zhou, Y. Communication-efficient algorithms for decentralized and stochastic optimization. ar Xiv:1701.03961, 2017. Le Gouic, T. and Loubes, J.-M. Existence and consistency of wasserstein barycenters. Probability Theory and Related Fields, 168(3-4):901 917, 2017. Lin, T., Ho, N., and Jordan, M. I. On efficient optimal transport: An analysis of greedy and accelerated mirror descent algorithms. http://www-personal.umich. edu/ minhnhat/Arxiv_submission.pdf, 2019. Maros, M. and Jald en, J. Panda: A dual linearly converging method for distributed optimization over time-varying undirected graphs. ar Xiv preprint ar Xiv:1803.08328, 2018. Monge, G. M emoire sur la th eorie des d eblais et des remblais. Histoire de l Acad emie Royale des Sciences de Paris, 1781. Nedic, A. and Ozdaglar, A. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48 61, 2009. Nedi c, A., Olshevsky, A., Shi, W., and Uribe, C. A. Geometrically convergent distributed optimization with uncoordinated step-sizes. In American Control Conference (ACC), 2017, pp. 3950 3955. IEEE, 2017. Nesterov, Y. A method of solving a convex programming problem with convergence rate o(1/k2). Soviet Mathematics Doklady, 27(2):372 376, 1983. Nesterov, Y., Gasnikov, A., Guminov, S., and Dvurechensky, P. Primal-dual accelerated gradient methods with smalldimensional relaxation oracle. ar Xiv:1809.05895, 2018. Peyr e, G. and Cuturi, M. Computational optimal transport. ar Xiv:1803.00567, 2018. Puccetti, G., R uschendorf, L., and Vanduffel, S. On the computation of Wasserstein barycenters. https:// ssrn.com/abstract=3276147, 2018. On the Complexity of Approximating Wasserstein Barycenters Rogozin, A., Uribe, C. A., Gasnikov, A., Malkovsky, N., and Nedi c, A. Optimal distributed optimization on slowly time-varying graphs. ar Xiv preprint ar Xiv:1805.06045, 2018. Santambrogio, F. Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling. Progress in Nonlinear Differential Equations and Their Applications. Springer International Publishing, 2015. ISBN 9783319208282. Scaman, K., Bach, F., Bubeck, S., Lee, Y. T., and Massouli e, L. Optimal algorithms for smooth and strongly convex distributed optimization in networks. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 3027 3036, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR. Solomon, J., Rustamov, R. M., Guibas, L., and Butscher, A. Wasserstein propagation for semi-supervised learning. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML 14, pp. I 306 I 314. PMLR, 2014. Spielman, D. A. and Teng, S.-H. Nearly linear time algorithms for preconditioning and solving symmetric, diagonally dominant linear systems. SIAM Journal on Matrix Analysis and Applications, 35(3):835 885, 2014. Staib, M., Claici, S., Solomon, J. M., and Jegelka, S. Parallel streaming wasserstein barycenters. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 2647 2658. Curran Associates, Inc., 2017. Stonyakin, F., Dvinskikh, D., Dvurechensky, P., Kroshnin, A., Kuznetsova, O., Agafonov, A., Gasnikov, A., Tyurin, A., Uribe, C., Pasechnyuk, D., and Artamonov, S. Gradient methods for problems with inexact model of the objective. ar Xiv:1902.09001, 2019a. Stonyakin, F., Gasnikov, A., Tyurin, A., Pasechnyuk, D., Agafonov, A., Dvurechensky, P., Dvinskikh, D., Kroshnin, A., and Piskunova, V. Inexact model: A framework for optimization and variational inequalities. ar Xiv:1902.00990, 2019b. Tran-Dinh, Q., Fercoq, O., and Cevher, V. A smooth primal-dual optimization framework for nonsmooth composite convex minimization. SIAM Journal on Optimization, 28(1):96 134, 2018. doi: 10.1137/ 16M1093094. URL https://doi.org/10.1137/ 16M1093094. ar Xiv:1507.06243. Uribe, C. A., Dvinskikh, D., Dvurechensky, P., Gasnikov, A., and Nedi c, A. Distributed computation of Wasserstein barycenters over networks. In 2018 IEEE Conference on Decision and Control (CDC), pp. 6544 6549, 2018. ar Xiv:1803.02933. Uribe, C. A., Lee, S., Gasnikov, A., and Nedi c, A. A dual approach for optimal algorithms in distributed optimization over networks. ar Xiv preprint ar Xiv:1809.00710, 2018. Vaidya, P. Solving linear equations with diagonally dominant matrices by constructing good preconditioners. Technical report, Technical report, Department of Computer Science, University of Illinois, 1990. Villani, C. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. Wai, H.-T., Freris, N. M., Nedic, A., and Scaglione, A. Sucag: Stochastic unbiased curvature-aided gradient method for distributed optimization. ar Xiv preprint ar Xiv:1803.08198, 2018. Wu, X. and Lu, J. Fenchel dual gradient methods for distributed convex optimization over time-varying networks. In Decision and Control (CDC), 2017 IEEE 56th Annual Conference on, pp. 2894 2899. IEEE, 2017. Xie, Y., Wang, X., Wang, R., and Zha, H. A fast proximal point method for computing Wasserstein distance. ar Xiv:1802.04307, 2018. Yurtsever, A., Tran-Dinh, Q., and Cevher, V. A universal primal-dual convex optimization framework. In Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS 15, pp. 3150 3158, Cambridge, MA, USA, 2015. MIT Press.