# alleviating_label_switching_with_optimal_transport__b602c2bf.pdf Alleviating Label Switching with Optimal Transport Pierre Monteiller ENS Ulm pierre.monteiller@ens.fr Sebastian Claici MIT CSAIL & MIT-IBM Watson AI Lab sclaici@mit.edu Edward Chien MIT CSAIL & MIT-IBM Watson AI Lab edchien@mit.edu Farzaneh Mirzazadeh IBM Research & MIT-IBM Watson AI Lab farzaneh@ibm.com Justin Solomon MIT CSAIL & MIT-IBM Watson AI Lab jsolomon@mit.edu Mikhail Yurochkin IBM Research & MIT-IBM Watson AI Lab mikhail.yurochkin@ibm.com Label switching is a phenomenon arising in mixture model posterior inference that prevents one from meaningfully assessing posterior statistics using standard Monte Carlo procedures. This issue arises due to invariance of the posterior under actions of a group; for example, permuting the ordering of mixture components has no effect on the likelihood. We propose a resolution to label switching that leverages machinery from optimal transport. Our algorithm efficiently computes posterior statistics in the quotient space of the symmetry group. We give conditions under which there is a meaningful solution to label switching and demonstrate advantages over alternative approaches on simulated and real data. 1 Introduction Mixture models are powerful tools for understanding multimodal data. In the Bayesian setting, to fit a mixture model to such data, we typically assume a prior number of components and optimize or sample from the posterior distribution over the component parameters. If prior components are exchangeable, this leads to an identifiability issue known as label switching. In particular, permuting the ordering of mixture components does not change the likelihood, since it produces the same model. The underlying problem is that a group acts on the parameters of the mixture model; posterior probabilities are invariant under the action of the group. To formalize this intuition, suppose our input is a data set X and a parameter K denoting the number of mixture components. In the most common application, we want to fit a mixture of K Gaussians to the data; our parameter set is Θ = {θ1, . . . , θK} where θk = {µk, Σk, πk} gives the parameters of each component. The likelihood of x X conditioned on Θ is p(x|Θ) = PK k=1 πkf(x; µk, Σk), where f(x; µk, Σk) is the density function of N(µk, Σk). Any permutation of the labels k = 1, . . . , K yields the same likelihood. The prior is also permutation invariant. When we compute statistics of the posterior p(Θ|x), however, this permutation invariance leads to K! symmetric regions in the posterior landscape. Sampling and inference algorithms behave poorly as the number of modes increases, and this problem is only exacerbated in this context since increasing the number of components in the mixture model leads to a super-exponential increase in the number of modes of the posterior. Previous methods such as the invariant losses of Celeux et al. (2000) and pivot alignments of Marin et al. (2005) do not identify modes in a principled manner. 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. To combat this issue, we leverage the theory of optimal transport. In particular, one way to avoid the multimodal nature of the posterior distribution is to replace each sample with its orbit under the action of the symmetry group seen as a distribution over K! points. While this symmetrized distribution is invariant to group actions, we can not average several such distributions using standard Euclidean metrics. We use the notion of a Wasserstein barycenter to calculate a mean in this space, which we can project to a mean in the parameter space via the quotient map. We show conditions under which our optimization can be performed efficiently on the quotient space, thus circumventing the need to store and manipulate orbit distributions with large support. Contributions. We give a practical and simple algorithm to solve the label switching problem. To justify our algorithm, we demonstrate that a group-invariant Wasserstein barycenter exists when the distributions being averaged are group-invariant. We give conditions under which the Wasserstein barycenter can be written as the orbit of a single point, and we explain how failure modes of our algorithm correspond to ill-posed problems. We show that the problem can be cast as computing the expected value of the quotient distribution, and we give an SGD algorithm to solve it. 2 Related work Mixture models. Gaussian mixture models are powerful for modeling a wide range of phenomena (Mc Lachlan et al., 2019). These models assume that a sample is drawn from one of the latent states (or components), but that the particular component assigned to any given sample is unknown. In a Bayesian setup, Markov Chain Monte Carlo can sample from the posterior distribution over the parameters of the mixture model. Hamiltonian Monte Carlo (HMC) has proven particularly successful for this task. Introduced for lattice quantum chromodynamics (Duane et al., 1987), HMC has become a popular option for statistical applications (Neal et al., 2011). Recent high-performance software offers practitioners easy access to HMC and other sampling algorithms (Carpenter et al., 2017). Label switching. Label switching arises when we take a Bayesian approach to parameter estimation in mixture models (Diebolt & Robert, 1994). Jasra et al. (2005) and Papastamoulis (2015) overview the problem. Label switching can happen even when samplers do not explore all K! possible modes, e.g., for Gibbs sampling. Documentation for modern sampling tools mentions that it arises in practice.1 Label switching can also occur when using parallel HMC, since tools like Stan run multiple chains at once. While a single chain may only explore one mode, several chains are likely to yield different label permutations. Jasra et al. (2005, 6) mention a few loss functions invariant to the different labelings. Most relevant is the loss proposed by Celeux et al. (2000, 5). Beyond our novel theoretical connections to optimal transport, in contrast to their method, our algorithm uses optimal rather than greedy matching to resolve elements of the symmetric group, applies to general groups and quotient manifolds, and uses stochastic gradient descent instead of simulated annealing. Somewhat ad-hoc but also related is the pivotal reordering algorithm (Marin et al., 2005), which uses a sample drawn from the distribution as a pivot point to break the symmetry; as we will see in our experiments, a poorly-chosen pivot seriously degrades the performance. Optimal transport. Optimal transport (OT) has seen a surge of interest in learning, from applications in generative models (Arjovsky et al., 2017; Genevay et al., 2018), Bayesian inference (Srivastava et al., 2015), and natural language (Kusner et al., 2015; Alvarez-Melis & Jaakkola, 2018) to technical underpinnings for optimization methods (Chizat & Bach, 2018). See Solomon (2018); Peyré & Cuturi (2018) for discussion of computational OT and Santambrogio (2015); Villani (2009) for theory. The Wasserstein distance from optimal transport ( 3.1) induces a metric on the space of probability distributions from the geometry of the underlying domain. This leads to a notion of a Wasserstein barycenter of several probability distributions (Agueh & Carlier, 2011). Scalable algorithms have been proposed for barycenter computation, including methods that exploit entropic regularization (Cuturi & Doucet, 2014), use parallel computing (Staib et al., 2017), apply stochastic optimization (Claici et al., 2018), and distribute the computation across several machines (Uribe et al., 2018). 1https://mc-stan.org/users/documentation/case-studies/identifying_mixture_models.html 3 Optimal Transport under Group Actions Before delving into technical details, we will illustrate our approach with a simple example. Assume we have some data to which we wish to fit a Gaussian mixture model with K components. We can now draw samples from the posterior distribution, and we would like to obtain a point estimate of the mean of the posterior. We draw two samples Θ1 = (θ1 1, . . . , θ1 K) and Θ2 = (θ2 1, . . . , θ2 K). We cannot average them due to the ambiguity of label switching; see Figure 1(a) and 1.3 of the supplementary for a simple example. However, we can explicitly encode this multimodality as a uniform distribution over all K! states: 1 K! σ SK δσ Θ1 and 1 K! where SK is the symmetry group on K points that acts by permuting the elements of Θ1 and Θ2. These distributions are now invariant to permutations, so we can ask if there exists an average in this space. In this section, we prove that this is possible through the machinery of optimal transport. We provide theoretical results relevant to optimal transport between measures supported on the quotient space under actions of some group G. This theory is fairly general and requires only basic assumptions about the underlying space X and the action of G. For each theoretical result, we will use italics to highlight key assumptions, since they vary somewhat from proposition to proposition. 3.1 Preliminaries: Optimal transport Let (X, d) be a complete and separable metric space. We define the p-Wasserstein distance on the space P(X) of probability distributions over X as a minimization over matchings between µ and ν: W p p (µ, ν) = inf π Π(µ,ν) X X d(x, y)p dπ(x, y). Here Π(µ, ν) is the set of couplings between measures µ and ν defined as Π(µ, ν) = {π P(X X) | π(x X) = µ(x), π(X y) = ν(y)}. Wp induces a metric on the set Pp(X) of measures with finite p-th moments (Villani, 2009). We will focus on P2(X), endowed with the metric W2. This metric structure allows us to define meaningful statistics for sets of distributions. In particular, a Fréchet mean (or Wasserstein barycenter) of a set of distributions ν1, . . . , νn P2(X) is defined as a minimizer µ = arg min µ P2(X) 1 n W 2 2 (µ, νi). (1) We follow Kim & Pass (2017) and generalize this notion slightly, by placing a measure itself on the space P2(X). We will use P2(P2(X)) to denote the space of probability measures on P2(X) that have finite second moments and let Ωbe a member of this set. Then the following functional will be finite, which generalizes (1) from finite sums to infinite sets of measures: P2(X) W 2 2 (µ, ν) dΩ(ν) = Eν Ω W 2 2 (µ, ν) . (2) In analog to (1), a natural task is to search for a minimizer of the map µ 7 B(µ). For existence of such a minimizer, we simply require that supp(Ω) is tight. Definition 1 (Tightness of measures). A collection C of measures on X is called tight if for any ε > 0 there exists a compact set K X such that for all µ C, we have µ(K) > 1 ε. Here are three examples of tight collections: P2(X) if X is compact, the set of all Gaussian distributions with means supported on a compact space and of bounded variance, or any set of measures with a uniform bound on second moments (argued in supplementary). This assumption is fairly mild and covers many application scenarios. Prokhorov s theorem (deferred to the supplementary) implies the existence of a barycenter: Theorem 1 (Existence of minimizers). B(µ) has at least one minimizer in P2(X) if supp(Ω) is tight. 3.2 Optimal transport with group invariances Let G be a finite group that acts by isometries on X. We define the set of measures invariant under group action P2(X)G = {µ P2(X) | g#µ = µ, g G}, where the pushforward of µ by g is defined as g#µ(B) = µ(g 1(B)) for B a measurable set. We are interested in the relation between the space P2(X)G and the space of measures on the quotient space P2(X/G). If all of the measures in the support of Ωin (2) are invariant under group action, we can show that there exists a barycenter with the same property: Lemma 1. If Ω P2(P2(X)G) is supported on the set of group-invariant measures on X and supp(Ω) is tight, then there exists a minimizer of B(µ) in P2(X) that is invariant under group action. Proof. Let µ P2(X) denote the minimizer from Theorem 1. Define a new distribution µG = 1 |G| P g G g#µ. We verify that µG has the same cost as µ: g G W 2 2 (g#µ, ν) by convexity of µ 7 W 2 2 (µ, ν) g G W 2 2 (µ, (g 1)#ν) since g acts by isometry g G Eν Ω W 2 2 (µ, ν) =Eν Ω W 2 2 (µ, ν) by linearity of expectation and group invariance of ν. But µ is a minimizer, so the inequality in line 1 must be an equality. Remark: If X is a compact Riemannian manifold and Ωgives positive weight to the set of absolutely continuous measures, then Theorem 3.1 of Kim & Pass (2017) provides uniqueness (and this may be extended to other non-compact cases with suitable decay conditions). However, in our setting, Ωis supported on samples, measures consisting of delta functions. In this case, a simple counterexample is presented in the supplementary ( 1.4) which arises in the case where X consists of two points in R2 and S2 acts to swap the points (SK is the group of permutations of a finite set of K points). This is accompanied by a study of the case of K points in Rd (see supplementary 1.3), relevant to the mixture models where components are evenly weighted and identical with a single mean parameter. Via this study we see that counterexamples seem to require a high degree of symmetry, which is unlikely to happen in applied scenarios, and does not arise empirically in our experiments. An analogous proof technique can be used to show the following lemma needed later: Lemma 2. If ν1 and ν2 are two measures invariant under group action, then there exists an optimal transport plan π Π(ν1, ν2) that is invariant under the group action g π(x, y) = π(g x, g y). The above suggests that we might instead search for barycenters in the quotient space. Consider: Lemma 3 (Lott & Villani 2009, Lemma 5.36). Let p : X X/G be the quotient map. The map p : P2(X) P2(X/G) restricts to an isometric isomorphism between the set of P2(X)G of G-invariant elements in P2(X) and P2(X/G). We now introduce additional structure relevant to label switching. Assume that all measures ν Ω are the orbits of individual delta distributions, as they are samples of parameter values, i.e., ν = 1 |G| P g G δg x for some x X. In the simple example of a mixture of two Gaussians from 1D data with means at µ1, µ2 R, ν is of the following form ν = 1 2δ(µ1,µ2) + 1 Under this assumption and by Lemmas 1 and 3, minimization of B(µ) is equivalent to finding the Wasserstein barycenter of delta distributions on X/G. Letting Ω := p #Ω, we aim to find: arg min µ P2(X/G) Eδx Ω W 2 2 (µ, δx) . (3) From properties of Wasserstein barycenters (Carlier et al. 2015, Equation (2.9)), the support of µ lies in the set of solutions to min z X/G Eδx Ω d(x, z)2 (4) where d is the metric on the quotient space X/G (see e.g. Santambrogio 2015, 5.5.5). As Ωhas finite second moments, so does Ω , giving us existence of the expectation. The existence of minimizers of z Eδx Ω d(x, z)2 is established in 2.1 of the supplementary, giving the following lemma: Lemma 4. The map z Eδx Ω d(x, z)2 has a minimizer. Uniqueness of minimizers is not guaranteed (see 1.4 of supplementary), but we can rewrite (3) as: arg min µ P2(X/G) Eδx Ω W 2 2 (µ, δx) = arg min µ P2(X/G) X/G d(x, y)2 dµ(y) dΩ (δx) = arg min µ P2(X/G) X/G d(x, y)2 dΩ (δx) dµ(y). By Lemma 4, the term y R X/G d(x, y)2dΩ (δx) has a (potentially non-unique) minimizer. Call this function b(y). We are left with arg min µ P2(X/G) X/G b(y) dµ(y). Any minimizer y of b leads to a minimizing distribution µ = δy , and we can conclude Theorem 2 (Single Orbit Barycenters). There is a barycenter solution of (2) that can be written as µ = 1 |G| P g G δg z . Returning to our example of a Gaussian mixture model, we see that this theorem implies there is a barycenter (a mean in distribution space) that has the same form as the symmetrized sample distributions. Any point in the support of the barycenter is an estimate for the mean of the posterior distribution. As an aside, we mention that our proofs do not require finite groups. In fact, we prove Theorem 2 for compact groups G endowed with a Haar measure in the supplement. To summarize: Label switching leads to issues when computing posterior statistics because we work in the full space X, when we ought to work in the quotient space X/G. Theorem 2 relates means in X/G to barycenters of measures on X which gives us a principled method for computing statistics backed by a convex problem in the space of measures: take a quotient, find a mean in X/G, and then pull the result back to X. We will see below in concrete detail that we do not need to explicitly construct and average in X/G, but may leverage group invariance of the transport to perform these steps in X. The crux of this theory is that the Wasserstein barycenter in the setting of Lemma 1 is a point estimate for the mean of the symmetrized posterior distribution. The results leading to Theorem 2 should be understood then as a reduction of the problem of finding an estimate of the mean to that of minimizing a distance function on the quotient space; this latter minimization problem can then be solved via Riemannian gradient descent. 4 Algorithms M Riemannian manifold gp Inner product at p M d(p, q) Geodesic distance between p, q M MK K-fold product manifold with product metric c(p, q) Transport cost, c(p, q) = 1 expp, logp Exponential, logarithm maps at p M SK Symmetric group on K symbols CK Cyclic group on K symbols M/G Quotient space of equivalence classes [p] = {g p | g G} Table 1: Notation for our algorithm. Label switching usually occurs due to symmetries of certain Bayesian models. Posteriors with the label switching make it difficult to compute meaningful summary statistics, e.g. posterior expectations for the parameters of interest. Any attempt to compute posterior statistics in this regime must account for the orbits of samples under the symmetry group. Continuing in the case of expectations, based on the previous section we can extract a meaningful notion of averaging by taking the image of each posterior sample under the symmetry group and computing a barycenter with respect to the Wasserstein metric. This resolves the ambiguity regarding which points in orbits should match, without symmetry-breaking heuristics like pivoting (Marin et al., 2005). (a) Ambiguity (b) Orbit empirical distribution (c) Quotient update Figure 1: (a) Suppose we wish to update our estimate of the average (blue) given a new sample (red) from Ω; due to label switching, other points (light shade) have equal likelihood to our sample, causing ambiguity. (b) Theorem 2 suggests an unambiguous update by constructing |G|-point orbits as empirical distributions and doing gradient descent with respect to the Wasserstein metric. (c) This algorithm is equivalent to moving one point, with a careful choice of update functions. This schematic arises for a mean-only model with three means in R ( 1.3 of supplementary); G = S3, with action is generated by reflection over the dashed lines. Algorithm 1 Riemannian Barycenter of Ω. Input: Distribution Ω, exp and log maps on M Output: Estimate of the barycenter of Ω 1: Initialize the barycenter p Ω. 2: for t = 1, . . . do 3: Draw q Ω 4: Dpc(p, q) := logp(q) 5: p expp 1 t Dpc(p, q) In this section, we provide an algorithm for computing the W2 barycenters above, extracting a symmetry-invariant notion of expectation for distributions with label switching. As input, we are given a sampler from a distribution Ωover a space M subject to label switching, as well as its (finite) symmetry group G. Our goal is to output a barycenter of the form 1 |G| P g G δg x for some x M, using stochastic gradient descent on (2). Our approach can be interpreted two ways, echoing the derivation of Theorem 2: The most direct interpretation, shown in Figure 1(b), is that we push forward Ωto a distribution over empirical distributions of the form 1 |G| P g G δg x, where x Ω, and then compute the barycenter as a |G|-point empirical distribution whose support points move according to stochastic gradient descent, similar to the method by Claici et al. (2018). Since |G| can grow extremely quickly, we argue that this algorithm is equivalent to one that moves a single representative x, so long as the gradient with respect to x accounts for the objective function; this is illustrated in Figure 1(c). Although our final algorithm has cosmetic similarity to pivoting and other algorithms that compute a single representative point, the details of our approach show an equivalence to a well-posed transport problem. Moreover, our stochastic gradient algorithm invokes a sampler from Ωin every iteration, rather than precomputing a finite sample, i.e. our algorithm deals with samples as they come in, rather than collecting multiple samples, and then trying to cluster or break the symmetry a posteriori. Table 1 gives a reference for the notation used in this section. Note the Riemannian gradient of c(p, q) has a particularly simple form: Dpc(p, q) = logp(q) (Kim & Pass, 2017). Gradient descent on the quotient space. For simplicity of exposition, we introduce a few additional assumptions on our problem; our algorithm can generalize to other cases, but these assumptions are the most relevant to the experiments and applications in 5. In particular, we assume we are trying to infer a mixture model with K components. The parameters of our model are tuples (p1, . . . , p K), where pi M for all i and some Riemannian manifold M. We can think of the space of parameters as the product MK. Typically it is undesirable when two components match exactly in a mixture model, so we additionally excise any tuple (p1, . . . , p K) with any matching elements (together a set of measure zero). Representing parameters in a mixture model can be made through a point process, it is natural to work with the Kth ordered configuration space of M considered in physics and algebraic topology (R. Fadell & Husseini, 2001): Conf K(M) := MK {(p1, . . . , p K) | pi = pj for some i = j} MK. Algorithm 2 Barycenter of Ωon quotient space Input: Distribution Ω, exp and log maps on M Output: Barycenter [(p1, . . . , p K)] 1: Initialize the barycenter (p1, . . . , p K) Ω. 2: for t = 1, . . . do 3: Draw (q1, . . . , q K) Ω 4: Compute σ in (5) 5: for i = 1, . . . , K do 6: Dpic(pi, qσ(i)) := logpi(qσ(i)) 7: pi exppi 1 t Dpic(pi, qσ(i)) 8: end for 9: end for Let Ω P(Conf K(M)) be the Bayesian posterior distribution restricted to Conf K(M) (assuming the posterior P(MK) is absolutely continuous with respect to the volume measure, this restriction does essentially nothing). If K = 1, we can compute the expected value of Ωusing a classical stochastic gradient descent (Algorithm 1). If K > 1, however, label switching may occur: There may be a group G acting on {1, 2, . . . , K} that reindices the elements of the product Conf K(M) without affecting likelihood. This invalidates the expectation computed by Algorithm 1. In this case, we need to work in the quotient Conf K(M)/G. Two key examples for G will be the symmetric group SK of permutations and the cyclic group CK of cyclic permutations. When G = SK we simply recover the Kth unordered configuration space, typically denoted UConf K(M). UConf K(M) is a Riemannian manifold with structure inherited from the product metric on Conf K(M) and has the property: d UConf K(M)([(p1, . . . , p K)], [(q1, . . . , q K)]) = min σ SK d MK((p1, . . . , p K), (qσ(1), . . . , qσ(K))). (5) The analogous fact holds for Conf K(M)/G for other finite G via standard arguments (see e.g. Kobayashi (1995)). Thus, we may step in the gradient direction on the quotient by solving a suitable optimal transport matching problem. Since G is finite, the map σ minimizing (5) is computable algorithmically. When G = CK, we simply enumerate all K cyclic permutations of (q1, . . . , q K) and choose the one closest to p. When G = SK, we can recover σ by solving a linear assignment problem with cost cij = d(pi, qj)2. Algorithm 3 Barycenter for Gaussian Mixtures Input: Distribution Ω Output: Barycenter p = (µ 1, Σ 1) . . . , (µ K, Σ K) 1: Initialize the barycenter p Ω. 2: for t = 1, . . . do 3: Draw ((µ1, Σ1) . . . , (µK, ΣK)) Ω 4: Compute σ in (5) 5: for i = 1, . . . , K do 6: µ i = µ i η(µ i µσ(i)) 7: L i = L i η(I T Σ i Σσ (i))L i 8: end for 9: end for These properties suggest an adjustment of Algorithm 1 to account for G. Given a barycenter estimate p = (p1, . . . , p K) and a draw q = (q1, . . . , q K) Ω: (1) align p and q by minimizing the right-hand side of (5); (2) compute component-wise Riemannian gradients from pi to qσ(i); and (3) step p toward q using the exponential map. Algorithm 2 summarizes our approach. It can be understood as stochastic gradient descent for z in (4), working in space Conf K(M) rather than the quotient Conf K(M) /G. Theorem 2, however, gives an alternative interpretation. Construct a |G|- point empirical distribution µ = 1 |G| P σ G δσ p from the iterate p. After drawing q Ω, we do the same to obtain ν P2(Conf K(M)). Then, our update can be understood as a stochastic Wasserstein gradient descent step of µ toward ν for problem (2). While this equivalent formulation would require O(|G|) rather than O(1) memory, it imparts the theoretical perspective in 3, in particular a connection to the (convex) problem of Wasserstein barycenter computation. In the supplementary, we prove the following theorem: Theorem 3 (Ordering Recovery). If M = R, with the standard metric, then: UConf K(M) = {(u1, . . . , u K) Conf K(R) | u1 < . . . < u K} RK. Additionally, the single-orbit barycenter of Theorem 2 is unique and our algorithm provably converges. This setting occurs when one s mixture model consists of evenly weighted components with only a single mean parameter for each in R. The result relates our method to the classical approach of ordering these means for correspondence and shows that it is well-justified. The convergence of our algorithm leverages the convexity of UConf K(M). The supplementary contains additional discussion ( 2.3) about such mean-only models in Rd for d > 1. They lack the niceness of the d = 1 case, due to positive curvature. This curvature is problematic for convergence arguments (as it leads to potential non-uniqueness of barycenters), but we empirically find that our algorithm converges to reasonable results. Mixtures of Gaussians. One particularly useful example involves estimating the parameters of a Gaussian mixture over Rd. For simplicity, assume that all the mixture weights are equal. The manifold M is the set of all (µ, Σ) pairs: M = Rd Pd with Pd the set of positive definite symmetric matrices. This space can be endowed with the W2 metric: d((µ1, Σ1), (µ2, Σ2))2 = W 2 2 (N(µ1, Σ1), N(µ2, Σ2)) = µ1 µ2 2 2 + B2(Σ1, Σ2), (6) where B2 is the squared Bures metric B2(Σ1, Σ2) = Tr[Σ1 + Σ2 2(Σ 1 2 1 ) 1 2 ]. As the mean components inherit the structure of Euclidean space, we only need to compute Riemannian gradients and exponential maps for the Bures metric. Muzellec & Cuturi (2018) leverage the Cholesky decomposition to parameterize Σi = Li L i . The gradient of the Bures metric then becomes: L1 1 2B(Σ1, Σ2) = (I T Σ1Σ2)L1 with T Σ1Σ2 = Σ 1 1 2 1 ) 1 2 Σ 1 The 2-Wassertein exponential map for SPD matrices is expΣ(ξ) = (I + LΣ(ξ))Σ(I + LΣ(ξ)) where LΣ(ξ) is the solution of this Lyapunov equation : LΣ(ξ)Σ + ΣLΣ(ξ) = ξ. Figure 2: True covariances in blue, covariances from SGD in green and pivot in red In 4, we gave a symmetry-invariant, simple, and efficient algorithm for computing a Wasserstein barycenter to summarize a distribution subject to label switching. To verify empirically that our algorithm can efficiently address label switching, we test on two natural examples: estimating the parameters of a Gaussian mixture model and a Bayesian instance of multi-reference alignment. Estimating components of a Gaussian mixture. Our first scenario is estimating the parameters of a Gaussian mixture with K > 1 components. We use Hamiltonian Monte Carlo (HMC) to sample from the posterior distribution of a Gaussian mixture model. Naïve averaging does not yield a meaningful barycenter estimate, since the samples are not guaranteed to have the same label ordering. To resolve this ambiguity, we apply our method and two baselines: the pivotal reordering method (Marin et al., 2005) and Stephens method (Stephens, 2000). The Stephens and Pivot methods relabel samples. Stephens minimizes the Kullback Leibler divergence between average classification distribution and classification distribution of each MCMC sample. Pivot aligns every sample to a prespecified sample (i.e. pivot) by solving a series of linear sum assignment problems. Pivot method requires pre-selecting a single sample for alignment poor choice of the pivot sample leads to bad estimation quality, while making a good pivot choice may be highly non-trivial in practice. The default pivot choice is the MAP. Stephens method is more accurate, however it is expensive computationally and has large memory requirement. Pivot Stephens SGD Error (abs) 1.65 1.26 1.47 Time (s) 1.4 54 7.5 Table 2: Absolute error & timings To illustrate why pivoting fails, consider samples drawn from a mixture of five Gaussians with mean 0 and covariances RθM with M = 1 0 0 0.1 and Rθ a rotation of angle θ { π/12, π/24, 0, π/12, π/24} (Figure 2). The resulting pivot is uninformative for certain components. The underlying issue is that the pivot is chosen to maximize the posterior distribution. If this sample lies on the boundary of Conf K(M) /SK, the pivot cannot be effectively used to realign samples. Quantitative results for this test case are in Table 2. To get a better handle of the performance/accuracy trade-off for the three methods, we run an additional experiment. We draw samples from a mixture of five Gaussians over R5 with means 0.5ei, where ei R5 is the i-th standard basis vector with i {1, . . . , 5}, and covariances 0.4I5 5. We implement HMC sampler using Stan (Carpenter et al., 2017), with four chains discarding 500 burn-in samples and keeping 500 per chain. Then we compare the three methods, increasing the number of samples to which they have access. We measure relative error as a function of wall clock time and number of samples (Figure 3). The resulting plots align with our intuition: pivoting obtains a suboptimal solution quickly, but if a more accurate solution is desired, it is better to run our SGD algorithm. 102 103 Number of Samples Relative Error Pivot SGD Stephens 10 2 10 1 100 101 Time(s) Relative Error Pivot SGD Stephens Figure 3: Relative error as a function of (a) number of samples and (b) time. Multi-reference alignment. A different problem to which we can apply our methods is multireference alignment (Zwart et al., 2003; Bandeira et al., 2014). We wish to reconstruct a template signal x RK given noisy and cyclically shifted samples y g x + N(0, σ2I), where g CK acts by cyclic permutation. These observations correspond to a mixture model with K components N(g x, σ2I) for g CK (Perry et al., 2017). We simulated draws from this distribution using Markov Chain Monte Carlo (MCMC), where each draw applies a random cyclic permutation and adds Gaussian noise (Figure 4a). The sampler we used was a Gibbs Sampler (Casella & George, 1992). To reconstruct the signal, we first compute a barycenter using Algorithm 2, giving a reference point to which we can align the noisy signals; we then average the aligned samples. Reconstructed signals for different σ s are in Figure 4b. To evaluate quantitatively, we compute the relative error of the reconstruction as a function of signal-to-noise ratio SNR = x 2/Kσ2 (Figure 4c). 0 20 40 60 80 100 Signal Sample 0 25 50 75 100 125 150 175 SNR Figure 4: Reconstruction of a signal from shifted and noisy observations. (a) The true signal is plotted in blue against a random shifted and noisy draw from the MCMC chain. (b) Reconstructed signals at varying values of noise. (c) Relative error as a function of SNR. 6 Discussion and Conclusion The issue underlying label switching is the existence of a group acting on the space of parameters. This group-theoretic abstraction allows us to relate a widely-recognized problem in Bayesian inference to Wasserstein barycenters from optimal transport. Beyond theoretical interest, this connection suggests a well-posed and easily-solved optimization method for alleviating label switching in practice. The new structure we have revealed in the label switching problem opens several avenues for further inquiry. Most importantly, (4) yields a simple algorithm, but this algorithm is only well-understood when the Fréchet mean is unique. This leads to two questions: When can we prove uniqueness of the mean? More generally, are there efficient algorithms for computing barycenters in P2(X)G? Finding faster algorithms for computing barycenters under the constraints of Lemma 1 provides an unexplored and highly-structured instance of the barycenter problem. Current approaches, such as those by Cuturi & Doucet (2014) and Claici et al. (2018) are too slow and not tailored to the demands of our application, since each measure is supported on K! points and the barycenter may not share support with the input measures. Moreover, after incorporating an HMC sampler or similar piece of machinery, our task likely requires taking the barycenter of an infinitely large set of distributions. The key to this problem is to exploit the symmetry of the support of the input measures and the barycenter. Acknowledgements. J. Solomon acknowledges the generous support of Army Research Office grant W911NF1710068, Air Force Office of Scientific Research award FA9550-19-1-031, of National Science Foundation grant IIS-1838071, from an Amazon Research Award, from the MIT-IBM Watson AI Laboratory, from the Toyota-CSAIL Joint Research Center, from the QCRI CSAIL Computer Science Research Program, and from a gift from Adobe Systems. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of these organizations. Agueh, M. and Carlier, G. Barycenters in the Wasserstein space. SIAM J. Math. Anal., 43(2):904 924, January 2011. ISSN 0036-1410. doi: 10.1137/100805741. Alvarez-Melis, D. and Jaakkola, T. S. Gromov-Wasserstein alignment of word embedding spaces. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 1881 1890, 2018. Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 214 223, 2017. Bandeira, A. S., Charikar, M., Singer, A., and Zhu, A. Multireference alignment using semidefinite programming. In Proceedings of the 5th Conference on Innovations in Theoretical Computer Science, pp. 459 470. ACM, 2014. Carlier, G., Oberman, A., and Oudet, E. Numerical methods for matching for teams and Wasserstein barycenters. ESAIM: M2AN, 49(6):1621 1642, November 2015. ISSN 0764-583X, 1290-3841. doi: 10.1051/m2an/2015033. Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A. Stan: A probabilistic programming language. Journal of Statistical Software, 76(1), 2017. Casella, G. and George, E. I. Explaining the Gibbs sampler. The American Statistician, 46(3): 167 174, 1992. Celeux, G., Hurn, M., and Robert, C. P. Computational and inferential difficulties with mixture posterior distributions. Journal of the American Statistical Association, 95(451):957 970, 2000. Chizat, L. and Bach, F. On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montréal, Canada., pp. 3040 3050, 2018. Claici, S., Chien, E., and Solomon, J. Stochastic Wasserstein barycenters. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 998 1007, 2018. Cuturi, M. and Doucet, A. Fast computation of Wasserstein barycenters. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pp. 685 693, 2014. Diebolt, J. and Robert, C. P. Estimation of finite mixture distributions through Bayesian sampling. Journal of the Royal Statistical Society: Series B (Methodological), 56(2):363 375, 1994. Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. Hybrid Monte Carlo. Physics Letters B, 195(2):216 222, 1987. Genevay, A., Peyré, G., and Cuturi, M. Learning generative models with Sinkhorn divergences. In International Conference on Artificial Intelligence and Statistics, pp. 1608 1617, 2018. Jasra, A., Holmes, C. C., and Stephens, D. A. Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Statistical Science, pp. 50 67, 2005. Kim, Y.-H. and Pass, B. Wasserstein barycenters over Riemannian manifolds. Advances in Mathematics, 307:640 683, February 2017. ISSN 0001-8708. doi: 10.1016/j.aim.2016.11.026. Kobayashi, S. Isometries of Riemannian Manifolds, pp. 39 76. Springer Berlin Heidelberg, Berlin, Heidelberg, 1995. ISBN 978-3-642-61981-6. doi: 10.1007/978-3-642-61981-6_2. Kusner, M. J., Sun, Y., Kolkin, N. I., and Weinberger, K. Q. From word embeddings to document distances. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pp. 957 966, 2015. Lott, J. and Villani, C. Ricci curvature for metric-measure spaces via optimal transport. Annals of Mathematics, pp. 903 991, 2009. Marin, J.-M., Mengersen, K., and Robert, C. P. Bayesian modelling and inference on mixtures of distributions. Handbook of Statistics, 25:459 507, 2005. Mc Lachlan, G. J., Lee, S. X., and Rathnayake, S. I. Finite mixture models. Annual Review of Statistics and its Application, 6:355 378, 2019. Muzellec, B. and Cuturi, M. Generalizing point embeddings using the Wasserstein space of elliptical distributions. In Advances in Neural Information Processing Systems 31, pp. 10237 10248. Curran Associates, Inc., 2018. Neal, R. M. et al. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2 (11), 2011. Papastamoulis, P. label.switching: An R package for dealing with the label switching problem in MCMC outputs. ar Xiv preprint ar Xiv:1503.02271, 2015. Perry, A., Weed, J., Bandeira, A. S., Rigollet, P., and Singer, A. The sample complexity of multireference alignment. Co RR, abs/1707.00943, 2017. Peyré, G. and Cuturi, M. Computational Optimal Transport. Submitted, 2018. R. Fadell, E. and Husseini, S. Geometry and Topology of Configuration Spaces. 01 2001. doi: 10.1007/978-3-642-56446-8. Santambrogio, F. Optimal Transport for Applied Mathematicians, volume 87 of Progress in Nonlinear Differential Equations and Their Applications. Springer International Publishing, 2015. ISBN 978-3-319-20827-5 978-3-319-20828-2. doi: 10.1007/978-3-319-20828-2. Solomon, J. Optimal Transport on Discrete Domains. AMS Short Course on Discrete Differential Geometry, 2018. Srivastava, S., Cevher, V., Dinh, Q., and Dunson, D. WASP: Scalable Bayes via barycenters of subset posteriors. In Lebanon, G. and Vishwanathan, S. V. N. (eds.), Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pp. 912 920, San Diego, California, USA, 09 12 May 2015. PMLR. Staib, M., Claici, S., Solomon, J. M., and Jegelka, S. Parallel streaming Wasserstein barycenters. In Advances in Neural Information Processing Systems, NIPS 2017, pp. 2644 2655, 2017. Stephens, M. Dealing with label switching in mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(4):795 809, 2000. Uribe, C. A., Dvinskikh, D., Dvurechensky, P., Gasnikov, A., and Nedic, A. Distributed computation of Wasserstein barycenters over networks. In 57th IEEE Conference on Decision and Control, CDC 2018, Miami, FL, USA, December 17-19, 2018, pp. 6544 6549, 2018. doi: 10.1109/CDC. 2018.8619160. Villani, C. Optimal Transport: Old and New. Number 338 in Grundlehren der mathematischen Wissenschaften. Springer, Berlin, 2009. ISBN 978-3-540-71049-3. OCLC: ocn244421231. Zwart, J. P., van der Heiden, R., Gelsema, S., and Groen, F. Fast translation invariant classification of HRR range profiles in a zero phase representation. IEE Proceedings-Radar, Sonar and Navigation, 150(6):411 418, 2003.