# parallel_streaming_wasserstein_barycenters__d61bb28c.pdf Parallel Streaming Wasserstein Barycenters Matthew Staib MIT CSAIL mstaib@mit.edu Sebastian Claici MIT CSAIL sclaici@mit.edu Justin Solomon MIT CSAIL jsolomon@mit.edu Stefanie Jegelka MIT CSAIL stefje@mit.edu Efficiently aggregating data from different sources is a challenging problem, particularly when samples from each source are distributed differently. These differences can be inherent to the inference task or present for other reasons: sensors in a sensor network may be placed far apart, affecting their individual measurements. Conversely, it is computationally advantageous to split Bayesian inference tasks across subsets of data, but data need not be identically distributed across subsets. One principled way to fuse probability distributions is via the lens of optimal transport: the Wasserstein barycenter is a single distribution that summarizes a collection of input measures while respecting their geometry. However, computing the barycenter scales poorly and requires discretization of all input distributions and the barycenter itself. Improving on this situation, we present a scalable, communication-efficient, parallel algorithm for computing the Wasserstein barycenter of arbitrary distributions. Our algorithm can operate directly on continuous input distributions and is optimized for streaming data. Our method is even robust to nonstationary input distributions and produces a barycenter estimate that tracks the input measures over time. The algorithm is semi-discrete, needing to discretize only the barycenter estimate. To the best of our knowledge, we also provide the first bounds on the quality of the approximate barycenter as the discretization becomes finer. Finally, we demonstrate the practical effectiveness of our method, both in tracking moving distributions on a sphere, as well as in a large-scale Bayesian inference task. 1 Introduction A key challenge when scaling up data aggregation occurs when data comes from multiple sources, each with its own inherent structure. Sensors in a sensor network may be configured differently or placed far apart, but each individual sensor simply measures a different view of the same quantity. Similarly, user data collected by a server in California will differ from that collected by a server in Europe: the data samples may be independent but are not identically distributed. One reasonable approach to aggregation in the presence of multiple data sources is to perform inference on each piece independently and fuse the results. This is possible when the data can be distributed randomly, using methods akin to distributed optimization [52, 53]. However, when the data is not split in an i.i.d. way, Bayesian inference on different subsets of observed data yields slightly different subset posterior distributions for each subset that must be combined [33]. Further complicating matters, data sources may be nonstationary. How can we fuse these different data sources for joint analysis in a consistent and structure-preserving manner? We address this question using ideas from the theory of optimal transport. Optimal transport gives us a principled way to measure distances between measures that takes into account the underlying space on which the measures are defined. Intuitively, the optimal transport distance between two distributions measures the amount of work one would have to do to move all mass from one distribution to the other. Given J input measures {µj}J j=1, it is natural, in this setting, to ask for a measure that minimizes the total squared distance to the input measures. This measure is called the Wasserstein 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. barycenter of the input measures [1], and should be thought of as an aggregation of the input measures which preserves their geometry. This particular aggregation enjoys many nice properties: in the earlier Bayesian inference example, aggregating subset posterior distributions via their Wasserstein barycenter yields guarantees on the original inference task [47]. If the measures µj are discrete, their barycenter can be computed relatively efficiently via either a sparse linear program [2], or regularized projection-based methods [16, 7, 51, 17]. However, 1. these techniques scale poorly with the support of the measures, and quickly become impractical as the support becomes large. 2. When the input measures are continuous, to the best of our knowledge the only option is to discretize them via sampling, but the rate of convergence to the true (continuous) barycenter is not well-understood. These two confounding factors make it difficult to utilize barycenters in scenarios like parallel Bayesian inference where the measures are continuous and a fine approximation is needed. These are the primary issues we work to address in this paper. Given sample access to J potentially continuous distributions µj, we propose a communicationefficient, parallel algorithm to estimate their barycenter. Our method can be parallelized to J worker machines, and the messages sent between machines are merely single integers. We require a discrete approximation only of the barycenter itself, making our algorithm semi-discrete, and our algorithm scales well to fine approximations (e.g. n 106). In contrast to previous work, we provide guarantees on the quality of the approximation as n increases. These rates apply to the general setting in which the µj s are defined on manifolds, with applications to directional statistics [46]. Our algorithm is based on stochastic gradient descent as in [22] and hence is robust to gradual changes in the distributions: as the µj s change over time, we maintain a moving estimate of their barycenter, a task which is not possible using current methods without solving a large linear program in each iteration. We emphasize that we aggregate the input distributions into a summary, the barycenter, which is itself a distribution. Instead of performing any single domain-specific task such as clustering or estimating an expectation, we can simply compute the barycenter of the inputs and process it later any arbitrary way. This generality coupled with the efficiency and parallelism of our algorithm yields immediate applications in fields from large scale Bayesian inference to e.g. streaming sensor fusion. Contributions. 1. We give a communication-efficient and fully parallel algorithm for computing the barycenter of a collection of distributions. Although our algorithm is semi-discrete, we stress that the input measures can be continuous, and even nonstationary. 2. We give bounds on the quality of the recovered barycenter as our discretization becomes finer. These are the first such bounds we are aware of, and they apply to measures on arbitrary compact and connected manifolds. 3. We demonstrate the practical effectiveness of our method, both in tracking moving distributions on a sphere, as well as in a real large-scale Bayesian inference task. 1.1 Related work Optimal transport. A comprehensive treatment of optimal transport and its many applications is beyond the scope of our work. We refer the interested reader to the detailed monographs by Villani [49] and Santambrogio [42]. Fast algorithms for optimal transport have been developed in recent years via Sinkhorn s algorithm [15] and in particular stochastic gradient methods [22], on which we build in this work. These algorithms have enabled several applications of optimal transport and Wasserstein metrics to machine learning, for example in supervised learning [21], unsupervised learning [34, 5], and domain adaptation [14]. Wasserstein barycenters in particular have been applied to a wide variety of problems including fusion of subset posteriors [47], distribution clustering [51], shape and texture interpolation [45, 40], and multi-target tracking [6]. When the distributions µj are discrete, transport barycenters can be computed relatively efficiently via either a sparse linear program [2] or regularized projection-based methods [16, 7, 51, 17]. In settings like posterior inference, however, the distributions µj are likely continuous rather than discrete, and the most obvious viable approach requires discrete approximation of each µj. The resulting discrete barycenter converges to the true, continuous barycenter as the approximations become finer [10, 28], but the rate of convergence is not well-understood, and finely approximating each µj yields a very large linear program. Scalable Bayesian inference. Scaling Bayesian inference to large datasets has become an important topic in recent years. There are many approaches to this, ranging from parallel Gibbs sampling [38, 26] to stochastic and streaming algorithms [50, 13, 25, 12]. For a more complete picture, we refer the reader to the survey by Angelino et al. [3]. One promising method is via subset posteriors: instead of sampling from the posterior distribution given by the full data, the data is split into smaller tractable subsets. Performing inference on each subset yields several subset posteriors, which are biased but can be combined via their Wasserstein barycenter [47], with provable guarantees on approximation quality. This is in contrast to other methods that rely on summary statistics to estimate the true posterior [33, 36] and that require additional assumptions. In fact, our algorithm works with arbitrary measures and on manifolds. 2 Background Let (X, d) be a metric space. Given two probability measures µ 2 P(X) and 2 P(X) and a cost function c : X X ! [0, 1), the Kantorovich optimal transport problem asks for a solution to c(x, y)dγ(x, y) : γ 2 (µ, ) where (µ, ) is the set of measures on the product space X X whose marginals evaluate to µ and , respectively. Under mild conditions on the cost function (lower semi-continuity) and the underlying space (completeness and separability), problem (1) admits a solution [42]. Moreover, if the cost function is of the form c(x, y) = d(x, y)p, the optimal transportation cost is a distance metric on the space of probability measures. This is known as the Wasserstein distance and is given by inf γ2 (µ, ) d(x, y)pdγ(x, y) Optimal transport has recently attracted much attention in machine learning and adjacent communities [21, 34, 14, 39, 41, 5]. When µ and are discrete measures, problem (2) is a linear program, although faster regularized methods based on Sinkhorn iteration are used in practice [15]. Optimal transport can also be computed using stochastic first-order methods [22]. Now let µ1, . . . , µJ be measures on X. The Wasserstein barycenter problem, introduced by Agueh and Carlier [1], is to find a measure 2 P(X) that minimizes the functional 2 (µj, ). (3) Finding the barycenter is the primary problem we address in this paper. When each µj is a discrete measure, the exact barycenter can be found via linear programming [2], and many of the regularization techniques apply for approximating it [16, 17]. However, the problem size grows quickly with the size of the support. When the measures µj are truly continuous, we are aware of only one strategy: sample from each µj in order to approximate it by the empirical measure, and then solve the discrete barycenter problem. We directly address the problem of computing the barycenter when the input measures can be continuous. We solve a semi-discrete problem, where the target measure is a finite set of points, but we do not discretize any other distribution. 3 Algorithm We first provide some background on the dual formulation of optimal transport. Then we derive a useful form of the barycenter problem, provide an algorithm to solve it, and prove convergence guarantees. Finally, we demonstrate how our algorithm can easily be parallelized. 3.1 Mathematical preliminaries The primal optimal transport problem (1) admits a dual problem [42]: OTc(µ, ) = sup v 1-Lipschitz {EY [v(Y )] + EX µ[vc(X)]} , (4) where vc(x) = infy2X {c(x, y) v(y)} is the c-transform of v [49]. When = Pn i=1 wiδyi is discrete, problem (4) becomes the semi-discrete problem OTc(µ, ) = max v2Rn {hw, vi + EX µ[h(X, v)]} , (5) where we define h(x, v) = vc(x) = mini=1,...,n{c(x, yi) vi}. Semi-discrete optimal transport admits efficient algorithms [31, 29]; Genevay et al. [22] in particular observed that given sample oracle access to µ, the semi-discrete problem can be solved via stochastic gradient ascent. Hence optimal transport distances can be estimated even in the semi-discrete setting. 3.2 Deriving the optimization problem Absolutely continuous measures can be approximated arbitrarily well by discrete distributions with respect to Wasserstein distance [30]. Hence one natural approach to the barycenter problem (3) is to approximate the true barycenter via discrete approximation: we fix n support points {yi}n i=1 2 X and search over assignments of the mass wi on each point yi. In this way we wish to find the discrete distribution n = Pn i=1 wiδyi with support on those n points which optimizes min w2 n F(w) = min 2 (µj, n) (6) hw, vji + EXj µj[h(Xj, vj)] where we have defined F(w) := F[ n] = F[Pn i=1 wiδyi] and used the dual formulation from equation (5). in Section 4, we discuss the effect of different choices for the support points {yi}n Noting that the variables vj are uncoupled, we can rearrange to get the following problem: min w2 n max hw, vji + EXj µj[h(Xj, vj)] Problem (8) is convex in w and jointly concave in the vj, and we can compute an unbiased gradient estimate for each by sampling Xj µj. Hence, we could solve this saddle-point problem via simultaneous (sub)gradient steps as in Nemirovski and Rubinstein [37]. Such methods are simple to implement, but in the current form we must project onto the simplex n at each iteration. This requires only O(n log n) time [24, 32, 19] but makes it hard to decouple the problem across each distribution µj. Fortunately, we can reformulate the problem in a way that avoids projection entirely. By strong duality, Problem (8) can be written as max v1,...,v J min EXj µj[h(Xj, vj)] EXj µj[h(Xj, vj)] Note how the variable w disappears: for any fixed vector b, minimization of hb, wi over w 2 n is equivalent to finding the minimum element of b. The optimal w can also be computed in closed form when the barycentric cost is entropically regularized as in [9], which may yield better convergence rates but requires dense updates that, e.g., need more communication in the parallel setting. In either case, we are left with a concave maximization problem in v1, . . . , v J, to which we can directly apply stochastic gradient ascent. Unfortunately the gradients are still not sparse and decoupled. We obtain sparsity after one final transformation of the problem: by replacing each PJ i with a variable si and enforcing this equality with a constraint, we turn problem (10) into the constrained problem max s,v1,...,v J i si + EXj µj[h(Xj, vj)] 3.3 Algorithm and convergence Algorithm 1 Subgradient Ascent s, v1, . . . , v J 0n loop Draw j Unif[1, . . . , J] Draw x µj i W argmini{c(x, yi) vj i } i M argmini si vj i W γ . Gradient update si M si M + γ/J . Gradient update vj i W + γ/2 . Projection vj i M + γ/(2J) . Projection si W si W γ/2 . Projection si M si M γ/(2J) . Projection end loop We can now solve this problem via stochastic projected subgradient ascent. This is described in Algorithm 1; note that the sparse adjustments after the gradient step are actually projections onto the constraint set with respect to the 1 norm. Derivation of this sparse projection step is given rigorously in Appendix A. Not only do we have an optimization algorithm with sparse updates, but we can even recover the optimal weights w from standard results in online learning [20]. Specifically, in a zero-sum game where one player plays a no-regret learning algorithm and the other plays a best-response strategy, the average strategies of both players converge to optimal: Theorem 3.1. Perform T iterations of stochastic subgradient ascent on u = (s, v1, . . . , v J) as in Algorithm 1, and use step size γ = R 4 T , assuming kut u k1 R for all t. Let it be the minimizing index chosen at iteration t, and write w T = 1 t=1 eit. Then we can bound E[F(w T ) F(w )] 4R/ The expectation is with respect to the randomness in the subgradient estimates gt. Theorem 3.1 is proved in Appendix B. The proof combines the zero-sum game idea above, which itself comes from [20], with a regret bound for online gradient descent [54, 23]. 3.4 Parallel Implementation The key realization which makes our barycenter algorithm truly scalable is that the variables s, v1, . . . , v J can be separated across different machines. In particular, the sum or coupling variable s is maintained on a master thread which runs Algorithm 2, and each vj is maintained on a worker thread running Algorithm 3. Each projected gradient step requires first selecting distribution j. The algorithm then requires computing only i W = argmini{c(xj, yi) vj i } and i M = argmini si, and then updating s and vj in only those coordinates. Hence only a small amount of information (i W and i M) need pass between threads. Note also that this algorithm can be adapted to the parallel shared-memory case, where s is a variable shared between threads which make sparse updates to it. Here we will focus on the first master/worker scenario for simplicity. Where are the bottlenecks? When there are n points in the discrete approximation, each worker s task of computing argmini{c(xj, yi) vj i } requires O(n) computations of c(x, y). The master must iteratively find the minimum element si M in the vector s, then update si M , and decrease element si W . These can be implemented respectively as the find min , delete min then insert, and decrease min operations in a Fibonacci heap. All these operations together take amortized O(log n) time. Hence, it takes O(n) time it for all J workers to each produce one gradient sample in parallel, and only O(J log n) time for the master to process them all. Of course, communication is not free, but the messages are small and our approach should scale well for J n. This parallel algorithm is particularly well-suited to the Wasserstein posterior (WASP) [48] framework for merging Bayesian subset posteriors. In this setting, we split the dataset X1, . . . , Xk into J subsets S1, . . . , SJ each with k/J data points, distribute those subsets to J different machines, then each machine runs Markov Chain Monte Carlo (MCMC) to sample from p( |Si), and we aggregate these posteriors via their barycenter. The most expensive subroutine in the worker thread is actually sampling from the posterior, and everything else is cheap in comparison. In particular, the machines need not even share samples from their respective MCMC chains. One subtlety is that selecting worker j truly uniformly at random each iteration requires more synchronization, hence our gradient estimates are not actually independent as usual. Selecting worker threads as they are available will fail to yield a uniform distribution over j, as at the moment worker j finishes one gradient step, the probability that worker j is the next available is much less than 1/J: worker j must resample and recompute i W , whereas other threads would have a head start. If workers all took precisely the same amount of time, the ordering of worker threads would be determinstic, and guarantees for without-replacement sampling variants of stochastic gradient ascent would apply [44]. In practice, we have no issues with our approach. 4 Consistency Algorithm 2 Master Thread Input: index j, distribution µ, atoms {yi}i=1,...,N, number J of distributions, step size γ Output: barycenter weights w c 0n s 0n i M 1 loop i W message from worker j Send i M to worker j ci M ci M + 1 si M si M + γ/(2J) si W si W γ/2 i M argmini si end loop return w c/(Pn Algorithm 3 Worker Thread Input: index j, distribution µ, atoms {yi}i=1,...,N, number J of distributions, step size γ v 0n loop Draw x µ i W argmini{c(x, yi) vi} Send i W to master i M message from master vi M vi M + γ/(2J) vi W vi W γ/2 end loop Prior methods for estimating the Wasserstein barycenter of continuous measures µj 2 P(X) involve first approximating each µj by a measure µj,n that has finite support on n points, then computing the barycenter n of {µj,n} as a surrogate for . This approach is consistent, in that if µj,n ! µj as n ! 1, then also n ! . This holds even if the barycenter is not unique, both in the Euclidean case [10, Theorem 3.1] as well as when X is a Riemannian manifold [28, Theorem 5.4]. However, it is not known how fast the approximation n approaches the true barycenter , or even how fast the barycentric distance F[ n] approaches F[ n]. In practice, not even the approximation n is computed exactly: instead, support points are chosen and n is constrained to have support on those points. There are various heuristic methods for choosing these support points, ranging from mesh grids of the support, to randomly sampling points from the convex hull of the supports of µj , or even optimizing over the support point locations. Yet we are unaware of any rigorous guarantees on the quality of these approximations. While our approach still involves approximating the barycenter by a measure n with fixed support, we are able to provide bounds on the quality of this approximation as n ! 1. Specifically, we bound the rate at which F[ n] ! F[ n]. The result is intuitive, and appeals to the notion of an -cover of the support of the barycenter: Definition 4.1 (Covering Number). The -covering number of a compact set K X, with respect to the metric g, is the minimum number N (K) of points {xi}N (K) i=1 2 K needed so that for each y 2 K, there is some xi with g(xi, y) . The set {xi} is called an -covering. Definition 4.2 (Inverse Covering Radius). Fix n 2 Z+. We define the n-inverse covering radius of compact K X as the value n(K) = inf{ > 0 : N (K) n}, when n is large enough so the infimum exists. Suppose throughout this section that K Rd is endowed with a Riemannian metric g, where K has diameter D. In the specific case where g is the usual Euclidean metric, there is an -cover for K with at most C1 d points, where C1 depends only on the diameter D and dimension d [43]. Reversing the inequality, K has an n-inverse covering radius of at most C2n 1/d when n takes the correct form. We now present and then prove our main result: Theorem 4.1. Suppose the measures µj are supported on K, and suppose µ1 is absolutely continuous with respect to volume. Then the barycenter is unique. Moreover, for each empirical approximation size n, if we choose support points {yi}i=1,...,n that constitute a 2 n(K)-cover of K, it follows that F[ n] F[ ] O( n(K) + n 1/d), where i δyi for w solving Problem (8). Remark 4.1. Absolute continuity is only needed to reason about approximating the barycenter with an N point discrete distribution. If the input distributions are themselves discrete distributions, so is the barycenter, and we can strengthen our result. For large enough n, we actually have W2( n, ) 2 n(K) and therefore F[ n] F[ ] O( n(K)). Corollary 4.1 (Convergence to ). Suppose the measures µj are supported on K, with µ1 absolutely continuous with respect to volume. Let be the unique minimizer of F. Then we can choose support points {yi}i=1,...,n such that some subsequence of i δyi converges weakly to . Proof. By Theorem 4.1, we can choose support points so that F[ n] ! F[ ]. By compactness, the sequence n admits a convergent subsequence nk ! for some measure . Continuity of F allows us to pass to the limit limk!1 F[ nk] = F[limk!1 nk]. On the other hand, limk!1 F[ nk] = F[ ], and F is strictly convex [28], thus nk ! weakly. Before proving Theorem 4.1, we need smoothness of the barycenter functional F with respect to Wasserstein-2 distance: Lemma 4.1. Suppose we are given measures {µj}J j=1, , and { n}1 n=1 supported on K, with n ! . Then, F[ n] ! F[ ], with |F[ n] F[ ]| 2D W2( n, ). Proof of Theorem 4.1. Uniqueness of follows from Theorem 2.4 of [28]. From Theorem 5.1 in [28] we know further that is absolutely continuous with respect to volume. Let N > 0, and let N be the discrete distribution on N points, each with mass 1/N, which minimizes W2( N, ). This distribution satisfies W2( N, ) CN 1/d [30], where C depends on K, the dimension d, and the metric. With our budget of n support points, we can construct a 2 n(K)-cover as long as n is sufficiently large. Then define a distribution n,N with support on the 2 n(K)-cover as follows: for each x in the support of N, map x to the closest point x0 in the cover, and add mass 1/N to x0. Note that this defines not only the distribution n,N, but also a transport plan between N and n,N. This map moves N points of mass 1/N each a distance at most 2 n(K), so we may bound W2( n,N, N) N 1/N (2 n(K))2 = 2 n(K). Combining these two bounds, we see that W2( n,N, ) W2( n,N, N) + W2( N, ) (13) 2 n(K) + CN 1/d. (14) For each n, we choose to set N = n, which yields W2( n,n, ) 2 n(K) + Cn 1/d. Applying Lemma 4.1, and recalling that is the minimizer of J, we have F[ n,n] F[ ] 2D (2 n(K) + Cn 1/d) = O( n(K) + n 1/d). (15) However, we must have F[ n] F[ n,n], because both are measures on the same n point 2 n(K)- cover, but n has weights chosen to minimize J. Thus we must also have n] F[ ] F[ n,n] F[ ] O( n(K) + n 1/d). The high-level view of the above result is that choosing support points yi to form an -cover with respect to the metric g, and then optimizing over their weights wi via our stochastic algorithm, will give us a consistent picture of the behavior of the true barycenter. Also note that the proof above requires an -cover only of the support of v , not all of K. In particular, an -cover of the convex hull of the supports of µj is sufficient, as this must contain the barycenter. Other heuristic techniques to efficiently focus a limited budget of n points only on the support of are advantageous and justified. While Theorem 4.1 is a good start, ideally we would also be able to provide a bound on W2( n, ). This would follow readily from sharpness of the functional F[ ], or even the discrete version F(w), but it is not immediately clear how to achieve such a result. 5 Experiments We demonstrate the applicability of our method on two experiments, one synthetic and one performing a real inference task. Together, these showcase the positive traits of our algorithm: speed, parallelization, robustness to non-stationarity, applicability to non-Euclidean domains, and immediate performance benefit to Bayesian inference. We implemented our algorithm in C++ using MPI, and our code is posted at github.com/mstaib/stochastic-barycenter-code. Full experiment details are given in Appendix D. Figure 1: The Wasserstein barycenter of four von Mises-Fisher distributions on the unit sphere S2. From left to right, the figures show the initial distributions merging into the Wasserstein barycenter. As the input distributions are moved along parallel paths on the sphere, the barycenter accurately tracks the new locations as shown in the final three figures. 5.1 Von Mises-Fisher Distributions with Drift We demonstrate computation and tracking of the barycenter of four drifting von Mises-Fisher distributions on the unit sphere S2. Note that W2 and the barycentric cost are now defined with respect to geodesic distance on S2. The distributions are randomly centered, and we move the center of each distribution 3 10 5 radians (in the same direction for all distributions) each time a sample is drawn. A snapshot of the results is shown in Figure 1. Our algorithm is clearly able to track the barycenter as the distributions move. 5.2 Large Scale Bayesian Inference 50 100 150 200 250 300 25 Figure 2: Convergence of our algorithm with n 104 for different stepsizes. In each case we recover a better approximation than what was possible with the LP for any n, in as little as 30 seconds. We run logistic regression on the UCI skin segmentation dataset [8]. The 245057 datapoints are colors represented in R3, each with a binary label determing whether that color is a skin color. We split consecutive blocks of the dataset into 127 subsets, and due to locality in the dataset, the data in each subsets is not identically distributed. Each subset is assigned one thread of an Infini Band cluster on which we simultaneously sample from the subset posterior via MCMC and optimize the barycenter estimate. This is in contrast to [47], where the barycenter can be computed via a linear program (LP) only after all samplers are run. Since the full dataset is tractable, we can compare the two methods via W2 distance to the posterior of the full dataset, which we can estimate via the large-scale optimal transport algorithm in [22] or by LP depending on the support size. For each method, we fix n barycenter support points on a mesh determined by samples from the subset posteriors. After 317 seconds, or about 10000 iterations per subset posterior, our algorithm has produced a barycenter on n 104 support points with W2 distance about 26 from the full posterior. Similarly competitive results hold even for n 105 or 106, though tuning the stepsize becomes more challenging. Even in the 106 case, no individual 16 thread node used more than 2GB of memory. For n 104, over a wide range of stepsizes we can in seconds approximate the full posterior better than is possible with the LP as seen in Figure 2 by terminating early. In comparsion, in Table 1 we attempt to compute the barycenter LP as in [47] via Mosek [4], for varying values of n. Even n = 480 is not possible on a system with 16GB of memory, and feasible values of n result in meshes too sparse to accurately and reliably approximate the barycenter. Specifically, there are several cases where n increases but the approximation quality actually decreases: the subset posteriors are spread far apart, and the barycenter is so small relative to the required bounding box that likely only one grid point is close to it, and how close this grid point is depends on the specific mesh. To avoid this behavior, one must either use a dense grid (our approach), or invent a better method for choosing support points that will still cover the barycenter. In terms of compute time, entropy regularized methods may have faired better than the LP for finer meshes but would still Table 1: Number of support points n versus computation time and W2 distance to the true posterior. Compared to prior work, our algorithm handles much finer meshes, producing much better estimates. Linear program from [47] This paper n 24 40 60 84 189 320 396 480 104 time (s) 0.5 0.97 2.9 6.1 34 163 176 out of memory 317 W2 41.1 59.3 50.0 34.3 44.3 53.7 45 out of memory 26.3 not give the same result as our method. Note also that the LP timings include only optimization time, whereas in 317 seconds our algorithm produces samples and optimizes. 6 Conclusion and Future Directions We have proposed an original algorithm for computing the Wasserstein barycenter of arbitrary measures given a stream of samples. Our algorithm is communication-efficient, highly parallel, easy to implement, and enjoys consistency results that, to the best of our knowledge, are new. Our method has immediate impact on large-scale Bayesian inference and sensor fusion tasks: for Bayesian inference in particular, we obtain far finer estimates of the Wasserstein-averaged subset posterior (WASP) [47] than was possible before, enabling faster and more accurate inference. There are many directions for future work: we have barely scratched the surface in terms of new applications of large-scale Wasserstein barycenters, and there are still many possible algorithmic improvements. One implication of Theorem 3.1 is that a faster algorithm for solving the concave problem (11) immediately yields faster convergence to the barycenter. Incorporating variance reduction [18, 27] is a promising direction, provided we maintain communication-efficiency. Recasting problem (11) as distributed consensus optimization [35, 11] would further help scale up the barycenter computation to huge numbers of input measures. Acknowledgements We thank the anonymous reviewers for their helpful suggestions. We also thank MIT Supercloud and the Lincoln Laboratory Supercomputing Center for providing computational resources. M. Staib acknowledges Government support under and awarded by Do D, Air Force Office of Scientific Research, National Defense Science and Engineering Graduate (NDSEG) Fellowship, 32 CFR 168a. J. Solomon acknowledges funding from the MIT Research Support Committee ( Structured Optimization for Geometric Problems ), as well as Army Research Office grant W911NF-12-R-0011 ( Smooth Modeling of Flows on Graphs ). This research was supported by NSF CAREER award 1553284 and The Defense Advanced Research Projects Agency (grant number N66001-17-1-4039). The views, opinions, and/or findings contained in this article are those of the author and should not be interpreted as representing the official views or policies, either expressed or implied, of the Defense Advanced Research Projects Agency or the Department of Defense. [1] M. Agueh and G. Carlier. Barycenters in the Wasserstein Space. SIAM J. Math. Anal., 43(2):904 924, January 2011. ISSN 0036-1410. doi: 10.1137/100805741. [2] Ethan Anderes, Steffen Borgwardt, and Jacob Miller. Discrete Wasserstein barycenters: Optimal transport for discrete data. Math Meth Oper Res, 84(2):389 409, October 2016. ISSN 1432-2994, 1432-5217. doi: 10.1007/s00186-016-0549-x. [3] Elaine Angelino, Matthew James Johnson, and Ryan P. Adams. Patterns of scalable bayesian inference. Foundations and Trends R in Machine Learning, 9(2-3):119 247, 2016. ISSN 1935-8237. doi: 10.1561/ 2200000052. URL http://dx.doi.org/10.1561/2200000052. [4] MOSEK Ap S. The MOSEK optimization toolbox for MATLAB manual. Version 8.0.0.53., 2017. URL http://docs.mosek.com/8.0/toolbox/index.html. [5] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. 2017. [6] M. Baum, P. K. Willett, and U. D. Hanebeck. On Wasserstein Barycenters and MMOSPA Estimation. IEEE Signal Process. Lett., 22(10):1511 1515, October 2015. ISSN 1070-9908. doi: 10.1109/LSP.2015.2410217. [7] J. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyré. Iterative Bregman Projections for Regularized Transportation Problems. SIAM J. Sci. Comput., 37(2):A1111 A1138, January 2015. ISSN 1064-8275. doi: 10.1137/141000439. [8] Rajen Bhatt and Abhinav Dhall. Skin segmentation dataset. UCI Machine Learning Repository. [9] Jérémie Bigot, Elsa Cazelles, and Nicolas Papadakis. Regularization of barycenters in the Wasserstein space. ar Xiv:1606.01025 [math, stat], June 2016. [10] Emmanuel Boissard, Thibaut Le Gouic, and Jean-Michel Loubes. Distribution s template estimate with Wasserstein metrics. Bernoulli, 21(2):740 759, May 2015. ISSN 1350-7265. doi: 10.3150/13-BEJ585. [11] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends R in Machine Learning, 3(1):1 122, 2011. [12] Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C Wilson, and Michael I Jordan. Streaming Variational Bayes. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 1727 1735. Curran Associates, Inc., 2013. [13] Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient hamiltonian monte carlo. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1683 1691, Bejing, China, 22 24 Jun 2014. PMLR. URL http://proceedings.mlr.press/v32/cheni14.html. [14] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimal Transport for Domain Adaptation. IEEE Trans. Pattern Anal. Mach. Intell., PP(99):1 1, 2016. ISSN 0162-8828. doi: 10.1109/TPAMI.2016. 2615921. [15] Marco Cuturi. Sinkhorn Distances: Lightspeed Computation of Optimal Transport. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2292 2300. Curran Associates, Inc., 2013. [16] Marco Cuturi and Arnaud Doucet. Fast Computation of Wasserstein Barycenters. pages 685 693, 2014. [17] Marco Cuturi and Gabriel Peyré. A Smoothed Dual Approach for Variational Wasserstein Problems. SIAM J. Imaging Sci., 9(1):320 343, January 2016. doi: 10.1137/15M1032600. [18] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646 1654, 2014. [19] John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projections onto the l 1-ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, pages 272 279. ACM, 2008. [20] Yoav Freund and Robert E. Schapire. Adaptive Game Playing Using Multiplicative Weights. Games and Economic Behavior, 29(1):79 103, October 1999. ISSN 0899-8256. doi: 10.1006/game.1999.0738. [21] Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. Learning with a Wasserstein Loss. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2053 2061. Curran Associates, Inc., 2015. [22] Aude Genevay, Marco Cuturi, Gabriel Peyré, and Francis Bach. Stochastic Optimization for Large-scale Optimal Transport. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3440 3448. Curran Associates, Inc., 2016. [23] Elad Hazan. Introduction to Online Convex Optimization. OPT, 2(3-4):157 325, August 2016. ISSN 2167-3888, 2167-3918. doi: 10.1561/2400000013. [24] Michael Held, Philip Wolfe, and Harlan P. Crowder. Validation of subgradient optimization. Mathematical Programming, 6(1):62 88, December 1974. ISSN 0025-5610, 1436-4646. doi: 10.1007/BF01580223. [25] Matthew D Hoffman, David M Blei, Chong Wang, and John William Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303 1347, 2013. [26] Matthew Johnson, James Saunderson, and Alan Willsky. Analyzing hogwild parallel gaussian gibbs sampling. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2715 2723. Curran Associates, Inc., 2013. URL http://papers.nips.cc/paper/ 5043-analyzing-hogwild-parallel-gaussian-gibbs-sampling.pdf. [27] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315 323, 2013. [28] Young-Heon Kim and Brendan Pass. Wasserstein barycenters over Riemannian manifolds. Advances in Mathematics, 307:640 683, February 2017. ISSN 0001-8708. doi: 10.1016/j.aim.2016.11.026. [29] Jun Kitagawa, Quentin Mérigot, and Boris Thibert. Convergence of a Newton algorithm for semi-discrete optimal transport. ar Xiv:1603.05579 [cs, math], March 2016. [30] Benoît Kloeckner. Approximation by finitely supported measures. ESAIM Control Optim. Calc. Var., 18 (2):343 359, 2012. ISSN 1292-8119. [31] Bruno Lévy. A Numerical Algorithm for L2 Semi-Discrete Optimal Transport in 3D. ESAIM Math. Model. Numer. Anal., 49(6):1693 1715, November 2015. ISSN 0764-583X, 1290-3841. doi: 10.1051/m2an/ 2015055. [32] C. Michelot. A finite algorithm for finding the projection of a point onto the canonical simplex of /n. J Optim Theory Appl, 50(1):195 200, July 1986. ISSN 0022-3239, 1573-2878. doi: 10.1007/BF00938486. [33] Stanislav Minsker, Sanvesh Srivastava, Lizhen Lin, and David Dunson. Scalable and Robust Bayesian Inference via the Median Posterior. In PMLR, pages 1656 1664, January 2014. [34] Grégoire Montavon, Klaus-Robert Müller, and Marco Cuturi. Wasserstein Training of Restricted Boltzmann Machines. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3718 3726. Curran Associates, Inc., 2016. [35] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48 61, 2009. [36] Willie Neiswanger, Chong Wang, and Eric P. Xing. Asymptotically exact, embarrassingly parallel mcmc. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, UAI 14, pages 623 632, Arlington, Virginia, United States, 2914. AUAI Press. ISBN 978-0-9749039-1-0. URL http://dl.acm.org/citation.cfm?id=3020751.3020816. [37] Arkadi Nemirovski and Reuven Y. Rubinstein. An Efficient Stochastic Approximation Algorithm for Stochastic Saddle Point Problems. In Moshe Dror, Pierre L Ecuyer, and Ferenc Szidarovszky, editors, Modeling Uncertainty, number 46 in International Series in Operations Research & Management Science, pages 156 184. Springer US, 2005. ISBN 978-0-7923-7463-3 978-0-306-48102-4. doi: 10.1007/0-306-48102-2_8. [38] David Newman, Padhraic Smyth, Max Welling, and Arthur U. Asuncion. Distributed inference for latent dirichlet allocation. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1081 1088. Curran Associates, Inc., 2008. URL http://papers. nips.cc/paper/3330-distributed-inference-for-latent-dirichlet-allocation.pdf. [39] Gabriel Peyré, Marco Cuturi, and Justin Solomon. Gromov-Wasserstein Averaging of Kernel and Distance Matrices. In PMLR, pages 2664 2672, June 2016. [40] Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Bernot. Wasserstein Barycenter and Its Application to Texture Mixing. In Scale Space and Variational Methods in Computer Vision, pages 435 446. Springer, Berlin, Heidelberg, May 2011. doi: 10.1007/978-3-642-24785-9_37. [41] Antoine Rolet, Marco Cuturi, and Gabriel Peyré. Fast Dictionary Learning with a Smoothed Wasserstein Loss. In PMLR, pages 630 638, May 2016. [42] Filippo Santambrogio. Optimal Transport for Applied Mathematicians, volume 87 of Progress in Nonlinear Differential Equations and Their Applications. Springer International Publishing, Cham, 2015. ISBN 978-3-319-20827-5 978-3-319-20828-2. doi: 10.1007/978-3-319-20828-2. [43] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge university press, 2014. [44] Ohad Shamir. Without-replacement sampling for stochastic gradient methods. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 46 54. Curran Associates, Inc., 2016. URL http://papers.nips.cc/paper/ 6245-without-replacement-sampling-for-stochastic-gradient-methods.pdf. [45] Justin Solomon, Fernando de Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du, and Leonidas Guibas. Convolutional Wasserstein Distances: Efficient Optimal Transportation on Geometric Domains. ACM Trans Graph, 34(4):66:1 66:11, July 2015. ISSN 0730-0301. doi: 10.1145/2766963. [46] Suvrit Sra. Directional Statistics in Machine Learning: A Brief Review. ar Xiv:1605.00316 [stat], May [47] Sanvesh Srivastava, Volkan Cevher, Quoc Dinh, and David Dunson. WASP: Scalable Bayes via barycenters of subset posteriors. In Guy Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 912 920, San Diego, California, USA, 09 12 May 2015. PMLR. URL http: //proceedings.mlr.press/v38/srivastava15.html. [48] Sanvesh Srivastava, Cheng Li, and David B. Dunson. Scalable Bayes via Barycenter in Wasserstein Space. ar Xiv:1508.05880 [stat], August 2015. [49] Cédric Villani. Optimal Transport: Old and New. Number 338 in Grundlehren der mathematischen Wissenschaften. Springer, Berlin, 2009. ISBN 978-3-540-71049-3. OCLC: ocn244421231. [50] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681 688, 2011. [51] J. Ye, P. Wu, J. Z. Wang, and J. Li. Fast Discrete Distribution Clustering Using Wasserstein Barycenter With Sparse Support. IEEE Trans. Signal Process., 65(9):2317 2332, May 2017. ISSN 1053-587X. doi: 10.1109/TSP.2017.2659647. [52] Yuchen Zhang, John C Duchi, and Martin J Wainwright. Communication-efficient algorithms for statistical optimization. Journal of Machine Learning Research, 14:3321 3363, 2013. [53] Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. Journal of Machine Learning Research, 16:3299 3340, 2015. URL http://jmlr.org/papers/v16/zhang15d.html. [54] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceed- ings of the 20th International Conference on Machine Learning (ICML-03), pages 928 936, 2003.