# measure_estimation_in_the_barycentric_coding_model__c89dd0d7.pdf Measure Estimation in the Barycentric Coding Model Matthew Werenski 1 Ruijie Jiang 2 Abiy Tasissa 3 Shuchin Aeron 2 James M. Murphy 3 This paper considers the problem of measure estimation under the barycentric coding model (BCM), in which an unknown measure is assumed to belong to the set of Wasserstein-2 barycenters of a finite set of known measures. Estimating a measure under this model is equivalent to estimating the unknown barycentric coordinates. We provide novel geometrical, statistical, and computational insights for measure estimation under the BCM, consisting of three main results. Our first main result leverages the Riemannian geometry of Wasserstein-2 space to provide a procedure for recovering the barycentric coordinates as the solution to a quadratic optimization problem assuming access to the true reference measures. The essential geometric insight is that the parameters of this quadratic problem are determined by inner products between the optimal displacement maps from the given measure to the reference measures defining the BCM. Our second main result then establishes an algorithm for solving for the coordinates in the BCM when all the measures are observed empirically via i.i.d. samples. We prove precise rates of convergence for this algorithm determined by the smoothness of the underlying measures and their dimensionality thereby guaranteeing its statistical consistency. Finally, we demonstrate the utility of the BCM and associated estimation procedures in three application areas: (i) covariance estimation for Gaussian measures; (ii) image processing; and (iii) natural language processing. 1Department of Computer Science, Tufts University 2Department of Electrical and Computer Engineering, Tufts University 3Department of Mathematics, Tufts University . Correspondence to: Matthew Werenski . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). 1. Introduction A number of recent machine learning applications including computer vision (Schmitz et al., 2018; Bonneel et al., 2016), domain adaptation (Montesuma & Mboula, 2021; Redko et al., 2019), natural language processing (NLP) (Singh et al., 2020; Colombo et al., 2021; Xu et al., 2018), and unsupervised segmentation of multivariate time series data (Cheng et al., 2021) have shown the utility of representing and modeling high-dimensional data as probability distributions. The essential insight in these applications is to utilize the Riemannian geometry of the space of probability distributions induced by the Wasserstein-2 metric and the Wasserstein-2 barycenters as alternatives to the Euclidean distance and linear combinations, respectively. In this context, the methods that use Wasserstein barycenters for data modeling and inference involve solving two core problems which we term the synthesis problem and analysis problem. To be precise, let P2,ac(Rd) denote the space of absolutely continuous distributions on Rd with finite second moment. For µ, ν P2,ac(Rd), W 2 2 (µ, ν) min T #µ=ν Rd ||T(x) x||2 2dµ(x) (1) defines the Wasserstein-2 metric where the minimum is over all measurable maps T : Rd Rd and the pushforward T#µ = ν is such that for all Borel sets B we have ν[B] = µ[T 1(B)] (Villani, 2003; Santambrogio, 2015). Under the assumption µ, ν P2,ac(Rd), unique minimizing T are known to exist and the resulting W2 distance is finite (Villani, 2003). We refer to the space P2,ac(Rd) equipped with the metric W2 as the Wasserstein-2 space, denoted W2. Let p = {λ = (λ1, . . . , λp) Rp : λi 0, Pp i=1 λi = 1} and let {µi}p i=1 be known reference measures. The W2 barycenter for coordinates λ p with respect to {µi}p i=1 is defined as νλ arg min ν P2,ac(Rd) i=1 λi W 2 2 (ν, µi). (2) The measure νλ is an advection of {µi}p i=1 and blends their geometric features; it may be thought of as a displacement interpolation" (Mc Cann, 1997; Villani, 2003). In contrast, the linear mixture Pp i=1 λiµi simply sets λi to be the Measure Estimation in the Barycentric Coding Model rate we draw samples from µi. Figure 1 demonstrates the important difference between linear mixture models and Wasserstein barycenters on Gaussian distributions: the linear mixture is a standard Gaussian mixture model (Murphy, 2012), while the Wasserstein barycenter is itself Gaussian. The important takeaway is that Wasserstein barycenters offer a more geometrically meaningful interpolation between measures compared to mixtures, and as indicated above have been successfully leveraged in a number of applications. Now, let Bary({µi}p i=1) {νλ : λ p} denote the set of all of possible barycenters of {µi}p i=1. The synthesis problem (2) seeks to combine the reference measures {µi}p i=1 in a notion of weighted average, with relative contributions determined by λ. This problem is well-studied (Mc Cann, 1997; Agueh & Carlier, 2011; Kroshnin et al., 2019) and known to have a unique solution under mild assumptions on the {µi}p i=1. Moreover, there exist fast algorithms for computing or approximating νλ when samples from the {µi}p i=1 are available (Rabin et al., 2011; Cuturi & Doucet, 2014; Bonneel et al., 2015; Claici et al., 2018; Yang et al., 2021), which have spurred the application of Wasserstein barycenters to machine learning problems. In this paper, our focus is on the less well-studied analysis problem: given µ0 and reference measures {µi}p i=1, solve arg min λ p W 2 2 (µ0, νλ) . (3) Instead of combining the measures {µi}p i=1 according to the weights λ p as in (2), the analysis problem (3) decomposes a given measure µ0 into its optimal representation in the set Bary({µi}p i=1) as parameterized by the learned barycentric coordinates λ p. If the measure µ0 Bary({µi}p i=1), then minλ p W 2 2 (µ0, νλ) = 0 and solving (3) is equivalent to learning the barycentric coordinates λ p such that νλ = µ0. We refer to this general model of parameterizing measures µ0 Bary({µi}p i=1) through the associated coordinates λ p as the barycentric coding model (BCM). Main Contributions: This paper addresses the analysis problem (3) as a measure estimation problem under the BCM. For this general problem, we make the following geometrical, statistical, and computational contributions: (1) When all the measures are directly observed, Theorem 1 provides a geometric characterization of the solution to (3) as corresponding to a solution to a quadratic program. Our analysis leverages the Riemannian geometry of W2 space and establishes that the quadratic program is determined by the angles between the optimal displacement maps between the measures. When specialized to the case of Gaussian measures, Corollary 1 Euclidean Mixture Wasserstein Barycenter Figure 1: Top four: reference Gaussians. Bottom left: linear mixture with λ = ( 1 4). Bottom right: Wasserstein barycenter with same λ. (Intensities adjusted to show detail.) provides a closed-form characterization of the solution to the analysis problem. (2) When the measures are indirectly observed via i.i.d. samples, we provide an algorithm for estimating the barycentric coordinates λ of a given measure with respect to the reference measures. Theorem 3 gives precise rates of convergence of the angles between the optimal displacement map by a computationally tractable estimator that exploits entropic regularization, thereby establishing the statistical consistency of our approach in Corollary 2. (3) We showcase the utility of the BCM and the proposed algorithm on several data analysis problems: covariance estimation for Gaussian measures; image inpainting and denoising; and document classification from limited labeled data. We show that the BCM provides a simple yet effective approach that outperforms competing methods in the regimes considered.1 1.1. Related Work The synthesis problem (2) is well-studied. (Agueh & Carlier, 2011) propose W2 barycenters for a finite set of reference measures {µi}p i=1 and prove their existence and uniqueness in a very general setting. (Le Gouic & Loubes, 2017) extend this to barycenters constructed from infinitely many reference measures. Statistical rates of estimation in the setting of i.i.d. samples from the measures {µi}p i=1 are established in (Kroshnin et al., 2019). Building on fundamental advances in estimating transport maps between measures using entropic regularization (Cuturi, 2013), fast estimation methods for solving the synthesis problem have been developed for entroptically regularized barycenters (Cuturi & Doucet, 2014). Approaches to computing a barycenter based on al- 1Code to reproduce results is available at https://github. com/Matt Werenski/BCM Measure Estimation in the Barycentric Coding Model ternating direction of multipliers on the dual problem (Yang et al., 2021) and parallelization schemes (Claici et al., 2018) have also been considered. The literature on the analysis problem (3) is much sparser. To the best of our knowledge, our geometrical characterization of solutions to the analysis problem (Theorem 1) and sample complexity results for estimating the optimal coordinates λ (Theorem 3, Corollary 2) are the first of their kind. Note that our approach extends (Bonneel et al., 2016; Schmitz et al., 2018) which consider measure estimation under the BCM but only for measures with finite support in three fundamental ways. First, we enable the application of the BCM to absolutely continuous measures. Second, we prove precise theoretical characterizations of the solutions to (3). Third, we avoid the expensive procedure of differentiating through an iterative algorithm as required in (Bonneel et al., 2016) and instead provide the direct Algorithm 1 that enjoys statistical consistency (Corollary 2). At the level of technical analysis, our Theorem 1 relies on an alternative characterization of W2 barycenters that leverages known conditions under which Karcher means are guaranteed to be W2 barycenters (Panaretos & Zemel, 2020). For the special case when all the measures are zero-mean Gaussians, the synthesis problem is again well-studied, with a fixed-point algorithm for computing the barycenter given in (Álvarez-Esteban et al., 2016) that is shown to have a linear convergence rate in (Chewi et al., 2020; Altschuler et al., 2021). The corresponding analysis problem for the specific case of Gaussian measures has only been recently studied in (Musolas et al., 2021) but limited to the case when p = 2. When specialized to this case, our results extend the model considered in (Musolas et al., 2021) to the p 3 case and characterize solutions to barycenter parameterized covariance estimation. In Section 4, we use the proposed BCM framework to model and address several problems in image processing and natural language processing. In this context our work is related to a long lineage of methods in signal processing and machine learning that use coefficients in a basis or a dictionary for data modeling and processing (Mallat, 1999; Donoho, 2006; Toši c & Frossard, 2011). Our approach of representing a general measure in terms of its barycentric coordinates with respect to a fixed set of reference measures is a novel approach in this tradition, and our results in Section 4 suggest its utility for data modeled as probability measures. 2. Convex Optimization for the Analysis Problem To solve the analysis problem (3), we propose to solve a convex optimization problem based on an alternative characterization of W2-barycenters as minimizers of a certain functional. We leverage this alternative characterization to show in Theorem 1 that, under mild assumptions on {µi}p i=1, one can use the optimal value of a convex program to both check if µ0 Bary({µi})p i=1 and if so, recover its barycentric coordinate. Before proceeding, we must introduce some background on the Riemannian geometry of the W2 space. The geodesic curve from ν0 to ν1 in W2 is given by [νt]1 t=0 where νt = (t T + (1 t) Id)#ν0 where T is the optimal transport map from ν0 to ν1 in (1) (Santambrogio, 2015). These facts together with Brenier s Theorem (see Supplementary Material Section 1) motivate the definition of the tangent space at ν: Tν P2,ac(Rd) {β( ϕ Id) : β > 0, ϕ C c (Rd), ϕ convex}, where the closure is in L2(ν) (Ambrosio et al., 2005). For u, v Tν P2,ac(Rd), the inner product is defined by u, v ν R Rd u(x), v(x) dν(x), which also gives the standard norm of u Tν P2,ac(Rd), ||u||ν p 2.1. Alternative Characterization of Barycenters Given the reference measures {µi}p i=1 and barycentric coordinates λ p, the variance functional is Gλ : P2,ac(Rd) R defined by 2 W 2 2 (ν, µi), so that the barycenter for {µi}p i=1 and λ p is νλ = arg min ν P2,ac(Rd) Gλ(ν). In many unconstrained optimization problems, the most direct way of checking if a point is optimal is to use firstorder optimality conditions: compute the gradient at the point and check if it is zero. Without further assumptions this is a necessary, but not sufficient, condition for optimality. Applying this to a general variance functional, one arrives at the definition of a Karcher mean (Karcher, 1977). Definition 1. Let (X, ρ) be a metric space furnished with a tangent space at every x X. For a collection of points {xi}p i=1 X and coordinates λ p, a point y X is said to be a Karcher mean of the set for λ if Gλ = Pp i=1 λiρ2( , xi) is differentiable at y and || Gλ(y)||y = 0 where Gλ(y) is understood as an element of the tangent space at y. Any barycenter y is a global minimizer of Gλ. Using first order optimality conditions, this implies || Gλ(y)||y = 0, and therefore y must be a Karcher mean. Special care must be made to ensure that being a Karcher mean is sufficient to Measure Estimation in the Barycentric Coding Model Figure 2: Visualization of the tangent space Tν P2,ac(Rd) and Gλ. The red arc is the path along which Gλ is maximized, starting at ν and the green arrow points in that direction in the tangent space at ν. be a barycenter. When it is sufficient, it immediately gives an alternative test for y to be a barycenter for λ: simply check if || Gλ(y)||y = 0. An example of a Karcher mean which is not a barycenter is given in the Supplementary Material Section 7. To ensure that being a Karcher mean is sufficient to be a barycenter we require the following assumptions: A1: The measures {µi}p i=0 are absolutely continuous and supported on either all of Rd or a bounded open convex subset. Call this shared support set Ω. A2: The measures {µi}p i=0 have respective densities {gi}p i=0 which are bounded above and g1, ..., gp are strictly positive on Ω. A3: If Ω= Rd then {gi}p i=0 are locally Hölder continuous. Otherwise {gi}p i=0 are bounded away from zero on Ω. Under these assumptions if µ0 is a Karcher mean with coordinates λ, then µ0 = νλ (Panaretos & Zemel, 2020); see the Supplementary Material Section 1 for a precise statement. This characterization of barycenters can be much easier to work with and the assumptions required are mild in practice. To accompany the notion of a Karcher mean, the Fréchet derivative (Fréchet, 1948) of the variance functional Gλ is given by i=1 λi (Ti Id) , (4) where Ti is the optimal transport map from ν to µi as in (1) (Ambrosio et al., 2005; Panaretos & Zemel, 2020). We now have all the tools to begin building up to Theorem 1. The first step is to derive a formula for || Gλ(µ0)||2 µ0 in terms of λ and show its convexity. Proposition 1. Let {µi}p i=0 P2,ac(Rd). Then || Gλ(µ0)||2 µ0 = λT Aλ where A Rp p is given by Rd Ti(x) Id(x), Tj(x) Id(x) dµ0(x) (5) and Ti is the optimal transport map from µ0 to µi as in (1). Furthermore, this is a convex function in λ. The proof is given in the Supplementary Material Section 2.1. The expression Ti(x) Id(x) is the displacement of the point x when being transported from µ0 to µi. Under this interpretation, Ti(x) Id, Tj(x) Id can be thought of as the angle between the displacement associated to the optimal transport map between µ0 to µi with that of µ0 to µj. Integrating this quantity with respect to µ0 therefore quantifies the average angle between displacements and if the optimal displacement maps move in similar directions. We now use this functional form as part of a convex program and show that it can be used to check if a µ0 is a Karcher mean. Our first main result ties Propositions 1 to the conditions for equivalence between barycenters and Karcher means. Theorem 1. Let {µi}p i=0 satisfy A1-A3. Then µ0 Bary({µi}p i=1) if and only if min λ p λT Aλ = 0 where A Rp p is given by (5). Furthermore, if the minimum value is 0 and λ is an optimal argument, then µ0 = νλ . The proof is given in the Supplementary Material Section 2.3. Theorem 1 gives a geometric characterization of the solution to (3) which relies on the evaluation of innerproducts in the tangent space Tµ0 P2,ac(Rd) (noting that Aij = Ti Id, Tj Id µ0) and solving a constrained quadratic program. Furthermore, this result suggests a method for finding an approximator of the µ0 in the set of barycenters: simply minimize the objective and use λ , even if the minimum value is not zero. Note that if A has rank less than p 1 there may be multiple minimizers which correspond to redundancies in the set Bary({µi}p i=1); see Supplementary Material Section 6. 2.2. Theorem 1 For Gaussians In the Gaussian case there are formulas for both the optimal transport maps and the integration involved in Aij. Let Sd ++ denote the set of symmetric positive definite d d matrices. Let µi = N(0, Si) with Si Sd ++ for i = 0, . . . , p. Then the optimal transport map from µ0 to µi is given by Ti(x) = Cix where Ci = S 1/2 0 S1/2 0 Si S1/2 0 1/2 S 1/2 0 (6) Measure Estimation in the Barycentric Coding Model (Agueh & Carlier, 2011). Furthermore, A1-A3 are satisfied for non-degenerate Gaussians. Combining Theorem 1 with (6 gives the following. Corollary 1. For i = 1, . . . , p, let µi = N(0, Si) with Si Sd ++. Then µ0 is a barycenter if and only if µ0 = N(0, S0) for some S0 Sd ++, and the convex program of Theorem 1 has minimum value 0, where the matrix A is given by Aij = Tr ((Ci I)(Cj I)S0) . Furthermore, if the minimum value is zero and λ is a optimal argument, then µ0 = νλ The proof is deferred to the Supplementary Material Section 2.3. We note that in the Gaussian case, one can use || Gλ(µ0)||µ0 to upper bound W2(µ0, νλ) (Chewi et al., 2020; Altschuler et al., 2021). This further justifies the choice of optimizing || Gλ(µ0)||2 µ0, since it can be seen as a convex, sharp upper bound on the distance to the set of barycenters. 2.3. Projections of Compatible Measures If one imposes extra structure on the set {µi}p i=0, then it is possible to substantially strengthen Theorem 1. Definition 2. A family of measures {µi}p i=0 is said to be compatible if for any i, j, k {0, 1, ..., p} the optimal transport map T k i from µi to µk can can be expressed as T k i = T k j T j i where T j i , T k j are the optimal transport maps from µi to µj and µj to µk respectively. Note that whenever these maps exist, the map T k j T j i is a valid map from µi to µk, but it need not be optimal. Under the condition of compatibility, we have the following result. Theorem 2. Suppose that {µi}p i=0 are compatible and let λ be the minimizer in Theorem 1. Then λ = arg min λ p W 2 2 (µ0, νλ). The proof is deferred to the Supplementary Material Section 2.4. In other words, when the measures are compatible one can solve (3) and find the exact projection of µ0 onto Bary({µi}p i=1). There are several known conditions under which the family of measures is compatible. These include any set of continuous measures on R, any family of Gaussians with simultaneously diagonalizable covariance matrices, families of radial transformations of a base measure, or families with a "common dependence structure ; see (Panaretos & Zemel, 2020) Section 2.3 for further details. 3. Finite Sample Analysis and Rates of Convergence The results of the previous section are all reliant upon having exact knowledge of both the underlying measure µ0 as well as the optimal transport maps Ti. In practice, outside of specific parametric families (e.g., Gaussians), it is rare that one will have either of these available. Much more common is the setting where the µi are accessed through i.i.d. sampling. In this setting we need to estimate the maps Ti, as well as µ0 and use these estimates to compute an ˆA whose entries are ˆAij = ˆTi Id, ˆTj Id ˆµ0, where ˆTi and ˆµ0 are our estimates of the transport maps and µ0 respectively. The quality of the estimate ˆA will depend on the approximations used above, and ideally these would be accompanied by performance guarantees. 3.1. The Entropic Map Estimate Let X1, ..., Xn µ0 and Y1, ..., Yn µ1 be i.i.d. samples. In (Pooladian & Niles-Weed, 2021) the authors propose using the entropic map to estimate the transport maps. For ϵ > 0, this map is defined by i=1 Yi exp 1 ϵ (gϵ(Yi) 1 2||x Yi||2 2) ϵ (gϵ(Yi) 1 2||x Yi||2 2) (7) where gϵ is defined to be the optimal variable in the dual entropic optimal transport problem on the samples. See the Supplementary Material 1.1 for details on gϵ and important properties of ˆTϵ. Practically, the calculation of gϵ has been well studied and there exists fast algorithms for its computation (Peyré & Cuturi, 2020). The estimate has also been shown to converge (with rates) to the optimal transport map in L2(µ0) under the appropriate technical conditions. To accompany this we also have the following result, stated informally below. In this result a b means that there exists a constant C > 0 such that a Cb. We allow this constant C to depend on the size of the support of the measures, upper and lower bounds on their densities, and the regularity properties of the optimal transport maps Ti, Tj, but it is importantly independent of the sample size n. Theorem 3. (Informal) Let i, j {1, ..., p} and suppose that µi, µj, µ0 are supported on bounded domains and that the maps Ti and Tj are sufficiently regular. Let X1, ..., X2n µ0, Y1, ..., Yn µi, Z1, ..., Zn µj. For an appropriately chosen ϵ, let ˆTi and ˆTj be the entropic maps computed using {Xi}n i=1, {Yi}n i=1, {Zi}n i=1. Then Measure Estimation in the Barycentric Coding Model k=n+1 ˆTi(Xk) Xk, ˆTj(Xk) Xk 1 n + n α+1 4(d +α+1) p where d = 2 d/2 , and α 3 depends on the regularity of optimal transport maps. A precise statement and proof of this theorem is given in the Supplementary Material Section 2.5. Theorem 3 tells us that in expectation, by using the entropic map we can accurately approximate the entries of the matrix Aij, and do so in a numerically feasible way. We chose the entropic map as it is computationally tractable and has competitive convergence rates when compared to the intractable but optimal estimator of (Hütter & Rigollet, 2021). We expect similar analysis holds for other recently proposed estimators (Deb et al., 2021; Gunsilius, 2021; Manole et al., 2021; Muzellec et al., 2021) under different assumptions. Theorem 3 can be combined with a matrix perturbation analysis to demonstrate the consistency of the estimated barycentric coordinates ˆλ from Algorithm 1 for the true ones λ when µ0 Bary({µi}p i=1), that is when λT Aλ = 0. This is stated more precisely as follows. Corollary 2. Let ˆλ be the random estimate obtained from Algorithm 1. Suppose that A has an eigenvalue of 0 with multiplicity 1 and that λ p realizes λT Aλ = 0. Then under the assumptions of Theorem 3, E[ ˆλ λ 2 2] 1 n + n α+1 4(d +α+1) p The proof is deferred to the Supplementary Material Section 2.6. Note that the implicit constant in the inequality of Corollary 2 depends on p in addition to the dependencies of listed above. Corollary 2 ensures that in the large sample limit, the entropic regularization parameter may be chosen to guarantee precise estimation of the true barycentric coordinates. Note that the rate of convergence in Corollary 2 depends crucially on the smoothness of the transport maps between the reference measures (α) and the dimensionality of the space in which they are supported (d ). In the next section we illustrate the utility of the BCM and propose novel approaches based on Theorem 1 and Algorithm 1 for several applications. 4. Applications 4.1. Barycenter Parameterized Covariance Estimation Extending the set-up in (Musolas et al., 2021), we first consider the case where µi = N(0, Si), Si Sd ++ for Algorithm 1 Estimate λ Input: i.i.d. samples {X1, ..., X2n} µ0, {{Y i 1 , ..., Y i n} µi : i = 1, ..., p}, regularization parameter ϵ > 0. for i = 1, ..., p do Set M i Rn n with M i jk = 1 2||Xj Y i k||2 2. Solve for gi as the optimal g in max f,g Rn 1 n j,k exp (fj + gk M i jk)/ϵ . Define ˆTi through 7 with gϵ = gi and {Y i 1 , ..., Y i n}. end for Set ˆA Rp p to be the matrix with entries k=n+1 ˆTi(Xk) Xk, ˆTj(Xk) Xk . Return ˆλ = arg min λ p λT ˆAλ. i = 0, . . . , p. We will use Si both as a matrix and to refer to µi; similarly, let Sλ denote νλ. Let {xi}n i=1 be i.i.d. samples with empirical covariance ˆS. In this setting, we can use the formulas laid out in Corollary 1. For comparison, we also consider maximum likelihood estimation (MLE) for the parameter λ in the BCM. Let Pλ denote the probability density function of Sλ. The MLE is arg maxλ p Qn i=1 Pλ(xi), which is equivalent to minimizing the KL-divergence DKL( ˆS, Sλ). However, this problem may be non-convex and difficult to optimize. We solve it numerically using auto-differentiation (Bartholomew-Biggs et al., 2000) and use the coordinates recovered by this method, which may not be the true MLE. Experimental Setup: We consider the problem of estimating the covariance matrix S0 from i.i.d. samples {xi}n i=1 N(0, S0) when S0 is known to be a barycenter of S1, ..., Sp. To do so we use the following procedure: 1. Choose λ and {Si}p i=1: Select λ p and the reference measures {Si}p i=1 either by hand or at random. 2. Generate Sλ: Using Algorithm 1 of (Chewi et al., 2020), find the true barycenter, Sλ. 3. Sample: Sample {xi}n i=1 i.i.d. from N(0, Sλ). Compute the empirical covariance ˆS = 1 n Pn i=1 xix T i . 4. Estimate ˆλ: Perform the optimization in Corollary 1, or use MLE with S0 = ˆS to recover an estimate ˆλ. 5. Compute Sˆλ: Again using Algorithm 1 of (Chewi et al., 2020) find Sˆλ and use that as the estimate of Sλ. Measure Estimation in the Barycentric Coding Model The results are in Figure 3. These show that using our approach (denoted gradient norm) we can estimate the underlying covariance matrix much more accurately than the empirical covariance. Our optimization is competitive with the auto-differentiation approach to MLE, while being significantly more stable and much faster. Indeed, computing all the estimates ˆλ to generate Figure 3 takes our method less than 1 second compared to over 3 hours and 40 minutes for MLE. For reference it takes on average roughly 0.01 seconds to perform empirical covaraiance estimation2. Further details are in the Supplementary Material Section 3. 0 100 200 300 400 500 600 Number of Samples Wasserstein-2 Distance Distance to Recovered Matrix Emprical MLE Gradient Norm Figure 3: W2 between true covariance and estimates. We set p = 6, d = 10, λ Unif( p), and µi Wish(Id) + 0.5Id with results averaged over 250 trials, and 1 standard deviation shaded. We now consider applications where it is necessary to directly estimate the optimal transport map from samples (unlike the previous case where we estimate it via samplecovariance estimators) and between measures with nonuniform weights on the observed support. To that end, we first modify Algorithm 1 for practical purposes. 4.2. Adapting Algorithm 1 for Applications In practice, Algorithm 1 has a few mild requirements: that we use the same number of samples for each reference measure; that the weight on each sample is the same; and access to an extra set of points for evaluating the inner products. Algorithm 2 relaxes these requirements and is in terms of matrix-vector operations, making it suitable for practical implementation. We note that if one uses Algorithm 2 with the same constraints as in Algorithm 1, then both produce the exact same ˆTi on the samples X1, ..., Xn (Pooladian & Niles-Weed, 2021). The main difference between the two approaches is that the dual potentials in Algorithm 1 make an out-of-sample extension possible which may be of its own interest. 2All reported times are obtained using 20 Intel Xeon CPU E5-2660 V4 cores at 2.00 GHz. Algorithm 2 Estimate λ on Point Clouds Input: PMFs p Rn0, qi Rni, support matrices X Rn0 d,Y i Rni d, regularization parameter ϵ > 0. for i = 1, ..., p do Set M i Rn0 ni with M i jk = ||Xj Y i k||2 2. Solve for the entropic assignment matrix πi as the optimal matrix in min π R n0 ni + π1=p πT 1=qi k=1 M i jkπjk + ϵπjk log πjk. Compute the approximate transport matrix ˆTi Rn0 d as ˆTi = diag(1/p)πi Y i. end for Set ˆA Rp p to be the matrix with entries ˆAij = Tr diag(p)( ˆT i X)( ˆT j X)T . Return ˆλ = arg min λ p λT ˆAλ. 4.3. Image Inpainting and Denoising We consider the problem of recovering an image in the presence of corruption, interpreting the image as a probability measure. We consider two specific models of corruption: a. additive noise and b. occlusion of a portion of the image. Our experimental procedure for both of these problems is outlined below taking as experimental data the MNIST dataset of 28 28 pixel images of hand-written digits (Le Cun, 1998). Additional details are in Supplementary Material Section 4. Note that after normalization, these can be treated as measures supported on a 28 28 grid. Experimental Setup: 1. Select µ0: Select a digit µ0 a. generate a white noise image ζ. Set µ0 to be µ0 = (1 α)µ0 + αζ. b. Set µ0 to be µ0 with the central 8 8 square removed, and renormalized. 2. Select {µi}p i=1: Select a set of images of the same digit as µ0 to serve as the reference measures. a. Let µi = (1 α)µ0 + αE[ζ]. b. Let µi be µi with the central 8 8 square removed, and renormalized 3. Estimate A, λ: Use Algorithm 2 to compute the approximate Gram matrix ˆA using µ0, µ1, ... µp and then compute estimated coordinates ˆλ. 4. Compute ˆνλ: Output ˆµ0 = νˆλ, where the barycenter is reconstructed from µ1, ..., µp. An illustration of the procedure is shown in Figure 4. We Measure Estimation in the Barycentric Coding Model Figure 4: Left to right: corrupted image; recovery by linear projection; recovery using ˆλ from Algorithm 2; original image. Top: white noise added to the image as in a. Bottom: occlusion of the image as in b. We see the linear reconstruction fails, while Algorithm 2 recovers well, albeit with some blurring that can be interpreted as a consequence of entropic regularization. also show for comparison digits reconstructed using a linear method, in which the corrupted image is recovered by computing the Euclidean projection onto the convex hull of the reference measures (see the Supplement 4.3 for further details). To recover the barycenter we use the method of (Benamou et al., 2015). As a competitive baseline we compare to a well-known existing method for histogram regression (Bonneel et al., 2016) in this setting. Our results are summarized in Tables 1 and 2. We note that our method is over an order of magnitude faster than (Bonneel et al., 2016) on this dataset and achieves competitive results. This is particularly remarkable because (Bonneel et al., 2016) is specifically adapted to measures with structured support such as grids and meshes, while our Algorithm 2 is much more general. 4.4. Document Classification Finally, we consider the task of identifying the topic of a document using its word embedding representation as the empirical distribution. We assume that there are t topics and that we have p reference documents about each topic. Each document can be represented as a high-dimensional empirical measure using a combination of a bag-of-words representation and a node2vec (Mikolov et al., 2013) wordembedding, and is paired with topic label. We use the publicly available dataset provided by (Huang et al., 2016). The task is to use a small number of labelled documents from each topic to predict the topic of a test document. We consider four predictors: (1) 1-Nearest Neighbor (1NN) which classifies using the topic of the W 2 2 -nearest reference document; (2) Minimum Average Distance which selects the topic with reference documents on average W 2 2 - closest to the test document; (3) Minimum Barycenter Loss which runs Algorithm 2 using the references from each topic separately and selects the topic with the small- 2 4 6 8 10 12 Number of Reference Documents per Class BBC Sports Topic Prediction 1NN Min. Avg. Dist. Min. BC Loss Max. Coordinate 2 3 4 5 6 7 8 9 10 Number of Reference Documents per Class News 20 Topic Prediction 1NN Min. Avg. Dist. Min. BC Loss Max. Coordinate Figure 5: Document topic prediction accuracy as a function of number of reference documents in each class. Top: BBC Sport dataset (5 classes). Bottom: News20 dataset (20 classes). est loss in the quadratic form; (4) Maximum Coordinate which runs Algorithm 2 using all the reference documents and chooses the topic which receives the most mass from ˆλ. Our findings on the BBCSport (5 topics) and News20 (20 topics) datasets are reported in Figure 5. To generate these figures we randomly sample k reference documents per topic and a test set of 100 random documents. We apply the four methods listed above and compute the accuracy on the test set. This procedure is repeated 50 times for each choice of k and the average accuracy is plotted. See Supplementary Material Section 5.2 for details. We see that even with a very small amount of training data, our BCM approaches perform well. Importantly, the BCM is able to represent unseen documents using these small training sets, giving it a clear advantage over simply classifying based on which document class it is W 2 2 -nearest (1NN) or W 2 2 -nearest in an aggregate sense (Min. Avg. Dist.). This suggests that even when p is small, the BCM has high modeling capacity and that the learned barycentric coordinates encode important information. Measure Estimation in the Barycentric Coding Model Bon., ϵ = 10 Alg. 2, ϵ = 10 Alg. 2, ϵ = 0 Occlusion 2.5101 (2228s) 2.5488 (3.371s) 2.5287 (1.062s) Noise (α = 0.5) 2.4058 (2391s) 2.6797 (86.32s) 2.3787 (50.67s) Table 1: Average quality of recovery of µ0 measured in W 2 2 (µ0, ˆµ0) when reconstructing 500 random 4 s using a barycenter constructed from 10 random 4 s, as well as run times of each method. Algorithm 2 affords comparably accurate reconstructions in orders-of-magnitude faster time than the state-of-the-art (Bonneel et al., 2016). (Alg. 2, ϵ = 10) - (Bon., ϵ = 10) (Alg. 2, ϵ = 0) - (Bon., ϵ = 10) (Alg. 2, ϵ = 10) - (Alg. 2, ϵ = 0) Occlusion 0.03872 0.3103 0.01868 0.2934 0.02004 0.2145 Additive 0.2739 0.3564 0.02706 0.2558 0.3009 0.2486 Table 2: Average and standard deviation of the pairwise-difference in reconstruction quality measured in W 2 2 (µ0, ˆµ0) in the same setting as Table 1. The average pairwise-difference is smaller than the standard deviation in all cases involving (Bonneel et al., 2016), suggesting there is no statistically significant difference. Our approach should thus be preferred to (Bonneel et al., 2016), on the grounds that it is considerably more efficient while achieving essentially the same accuracy. 5. Conclusion We have proposed a new method for computing coordinates under the BCM, together with guarantees on its solution via a convex program (Theorem 1) with closed-form coordinates in the Gaussian case (Corollary 1). We further developed an algorithm for estimating the coordinates under the BCM when all measures are accessible only via i.i.d. samples (Algorithm 1) which enjoys a natural smoothness and dimension-dependent rate of convergence to the true parameters (Theorem 3, Corollary 2). The BCM paradigm affords significant gains in runtime and robustness for barycenter parameterized Gaussian measure estimation, and provides an effective approach to image reconstruction and document classification when interpreting these data as measures. The results in this paper suggest the efficiency and effectiveness of the BCM as a broadly viable modelling tool. Future Work and Open Questions: Section 3 provides a consistency analysis for A and λ by estimating the transport maps between each measure. But, all that is needed is an estimate on the inner product between the displacement maps, which in principle could admit more sample efficient approaches that avoid explicitly estimating the maps. Section 4 suggests that unseen data can be well-represented in the BCM for randomly chosen reference measures. An interesting theoretical problem is to understand the representational capacity of the BCM using particular reference measures (e.g., random ones, optimally chosen ones, those in certain parametric families). This naturally touches on the question of the smoothness of the synthesis map λ 7 νλ. Computationally, all of our real data experiments use randomly chosen reference measures. It is of interest to develop efficient procedures as in (Schmitz et al., 2018) for learning reference measures that induce low reconstruction error. Beyond that, one could regularize the analysis problem (3) to encourage sparse or otherwise structured coordinates as is well-studied in Euclidean settings (Aharon et al., 2006). Acknowledgements MW is supported by NSF CCF-1553075. RJ is supported by NSF DRL 1931978. SA acknowledges support by NSF CCF-1553075, NSF DRL 1931978, NSF EEC 1937057, and AFOSR FA9550-18-1-0465. JMM acknowledges support from the Dreyfus Foundation and the NSF through grants DMS-1912737, DMS-1924513. All authors acknowledge support through the Tufts TRIPODS Institute, supported by the NSF under grant CCF-1934553. We acknowledge the reviewers for their helpful comments which improved the paper considerably. Agueh, M. and Carlier, G. Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis, 43(2): 904 924, 2011. Aharon, M., Elad, M., and Bruckstein, A. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11):4311 4322, 2006. Altschuler, J. M., Chewi, S., Gerber, P., and Stromme, A. J. Averaging on the Bures-Wasserstein manifold: dimension-free convergence of gradient descent. In Advances in Neural Information Processing Systems, volume 34, 2021. Álvarez-Esteban, P. C., Del Barrio, E., Cuesta-Albertos, J., and Matrán, C. A fixed-point approach to barycenters in Wasserstein space. Journal of Mathematical Analysis and Applications, 441(2):744 762, 2016. Ambrosio, L., Gigli, N., and Savare, G. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Measure Estimation in the Barycentric Coding Model Lectures in Mathematics. ETH Zürich. Birkhäuser Basel, 2005. Bartholomew-Biggs, M., Brown, S., Christianson, B., and Dixon, L. Automatic differentiation of algorithms. Journal of Computational and Applied Mathematics, 124: 171 190, 12 2000. Benamou, J.-D., Carlier, G., Cuturi, M., Nenna, L., and Peyré, G. Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2):A1111 A1138, 2015. Bonneel, N., Rabin, J., Peyré, G., and Pfister, H. Sliced and Radon Wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22 45, 2015. Bonneel, N., Peyré, G., and Cuturi, M. Wasserstein barycentric coordinates: histogram regression using optimal transport. ACM Transactions on Graphics, 35(4):71:1 71:10, 2016. Brenier, Y. Polar factorization and monotone rearrangement of vector-valued functions. Communications on Pure and Applied Mathematics, 44(4):375 417, 1991. Cheng, K., Aeron, S., Hughes, M. C., and Miller, E. Dynamical Wasserstein barycenters for time-series modeling. In Advances in Neural Information Processing Systems, volume 34, 2021. Chewi, S., Maunu, T., Rigollet, P., and Stromme, A. J. Gradient descent algorithms for Bures-Wasserstein barycenters. In Conference on Learning Theory, pp. 1276 1304. PMLR, 2020. Chizat, L., Roussillon, P., Léger, F., Vialard, F.-X., and Peyré, G. Faster Wasserstein distance estimation with the Sinkhorn divergence. ar Xiv preprint ar Xiv:2006.08172, 2020. Claici, S., Chien, E., and Solomon, J. Stochastic Wasserstein barycenters. In International Conference on Machine Learning, pp. 999 1008. PMLR, 2018. Colombo, P., Staerman, G., Clavel, C., and Piantanida, P. Automatic text evaluation through the lens of Wasserstein barycenters. ar Xiv preprint ar Xiv:2108.12463, 2021. Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in Neural Information Processing Systems, 26:2292 2300, 2013. Cuturi, M. and Doucet, A. Fast computation of Wasserstein barycenters. In International Conference on Machine Learning, pp. 685 693. PMLR, 2014. Deb, N., Ghosal, P., and Sen, B. Rates of estimation of optimal transport maps using plug-in estimators via barycentric projections. Advances in Neural Information Processing Systems, 34, 2021. Donoho, D. L. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289 1306, 2006. Fréchet, M. Les éléments aléatoires de nature quelconque dans un espace distancié. Annales de l institut Henri Poincaré, 10(4):215 310, 1948. Genevay, A. Entropy-regularized optimal transport for machine learning. Ph D thesis, Paris Sciences et Lettres (Com UE), 2019. Gunsilius, F. F. On the convergence rate of potentials of brenier maps. Econometric Theory, pp. 1 37, 2021. Huang, G., Quo, C., Kusner, M. J., Sun, Y., Weinberger, K. Q., and Sha, F. Supervised word mover s distance. In Advances in Neural Information Processing Systems, pp. 4869 4877, 2016. Hütter, J.-C. and Rigollet, P. Minimax estimation of smooth optimal transport maps. The Annals of Statistics, 49(2): 1166 1194, 2021. Karcher, H. Riemannian center of mass and mollifier smoothing. Communications on Pure and Applied Mathematics, 30(5):509 541, 1977. Kroshnin, A., Tupitsa, N., Dvinskikh, D., Dvurechensky, P., Gasnikov, A., and Uribe, C. On the complexity of approximating Wasserstein barycenters. In International Conference on Machine Learning, pp. 3530 3540. PMLR, 2019. Le Gouic, T. and Loubes, J.-M. Existence and consistency of Wasserstein barycenters. Probability Theory and Related Fields, 168(3):901 917, 2017. Le Cun, Y. The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998. Mallat, S. A wavelet tour of signal processing. Elsevier, 1999. Manole, T., Balakrishnan, S., Niles-Weed, J., and Wasserman, L. Plugin estimation of smooth optimal transport maps. ar Xiv preprint ar Xiv:2107.12364, 2021. Mc Cann, R. J. A convexity principle for interacting gases. Advances in Mathematics, 128(1):153 179, 1997. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pp. 3111 3119, 2013. Measure Estimation in the Barycentric Coding Model Montesuma, E. F. and Mboula, F. M. N. Wasserstein barycenter for multi-source domain adaptation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16785 16793, 2021. Murphy, K. P. Machine learning: a probabilistic perspective. MIT press, 2012. Musolas, A., Smith, S. T., and Marzouk, Y. Geodesically parameterized covariance estimation. SIAM Journal on Matrix Analysis and Applications, 42(2):528 556, 2021. Muzellec, B., Vacher, A., Bach, F., Vialard, F.-X., and Rudi, A. Near-optimal estimation of smooth transport maps with kernel sums-of-squares. ar Xiv preprint ar Xiv:2112.01907, 2021. Panaretos, V. M. and Zemel, Y. An invitation to statistics in Wasserstein space. Springer Nature, 2020. Peyré, G. and Cuturi, M. Computational optimal transport, 2020. Pooladian, A.-A. and Niles-Weed, J. Entropic estimation of optimal transport maps. ar Xiv: 2109.12004, 2021. Popoviciu, T. Sur les équations algébriques ayant toutes leurs racines réelles. Mathematica, 9:129 145, 1935. Rabin, J., Peyré, G., Delon, J., and Bernot, M. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pp. 435 446. Springer, 2011. Redko, I., Courty, N., Flamary, R., and Tuia, D. Optimal transport for multi-source domain adaptation under target shift. In International Conference on Artificial Intelligence and Statistics, pp. 849 858. PMLR, 2019. Santambrogio, F. Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling. Progress in Nonlinear Differential Equations and Their Applications. Springer International Publishing, 2015. Schmitz, M. A., Heitz, M., Bonneel, N., Ngole, F., Coeurjolly, D., Cuturi, M., Peyré, G., and Starck, J.-L. Wasserstein dictionary learning: optimal transport-based unsupervised nonlinear dictionary learning. SIAM Journal on Imaging Sciences, 11(1):643 678, 2018. Schwerdtfeger, H. Introduction to linear algebra and the theory of matrices. P. Noordhoff, 1961. Singh, S. P., Hug, A., Dieuleveut, A., and Jaggi, M. Context mover s distance & barycenters: Optimal transport of contexts for building representations. In International Conference on Artificial Intelligence and Statistics, pp. 3437 3449. PMLR, 2020. Toši c, I. and Frossard, P. Dictionary learning. IEEE Signal Processing Magazine, 28(2):27 38, 2011. Villani, C. Topics in Optimal Transportation. Graduate studies in mathematics. American Mathematical Society, 2003. Xu, H., Wang, W., Liu, W., and Carin, L. Distilled Wasserstein learning for word embedding and topic modeling. In Advances in Neural Information Processing Systems, pp. 1716 1725, 2018. Yang, L., Li, J., Sun, D., and Toh, K.-C. A fast globally linearly convergent algorithm for the computation of Wasserstein barycenters. Journal of Machine Learning Research, 22(21):1 37, 2021. Measure Estimation in the Barycentric Coding Model Supplementary Material: Measure Estimation in the Barycentric Coding Model 1. Precise Statements of Technical Results For completeness, we replicate the theorem describing conditions under which Karcher means are guaranteed to be barycenters. Theorem 4. ((Panaretos & Zemel, 2020), Theorem 3.1.15) Let λ p with λi = 0 for all i. Let Ω Rd be open and convex and let µ1, . . . , µp P2,ac(Rd) supported on Ωwith bounded, strictly positive densities g1, . . . , gp. Suppose that µ0 is an absolutely continuous Karcher mean for λ and is supported on Ωwith bounded strictly positive density f there. Then µ0 is the Wasserstein barycenter for coordinates λ if one of the following holds: (1) Ω= Rd and the densities f, g1, . . . , gp are locally Hölder continuous. (2) Ωis bounded and the densities f, g1, . . . , gp are bounded below on Ω. Next we include a technical theorem which we will leverage in our proof of the convergence of the entries Aij. Before doing so, we introduce the necessary background. Theorem 5. ((Brenier, 1991)) Let µ P2,ac(Ω) and ν P(Ω). Then 1. There exists a solution T0 to (1), with T0 = ϕ0, for a convex function ϕ0 solving inf ϕ L1(µ) Z ϕdµ + Z ϕ dν where ϕ is the convex conjugate of ϕ. 2. If in addition ν P2,ac(Ω), then ϕ 0 is an admissible optimal transport map from ν to µ. Let Cα denote the space of functions possessing α continuous derivatives and whose α th derivative is (α α ) Hölder smooth. Using these conventions, we require the technical conditions: A4: µ, ν P2,ac(Ω) for a compact convex set Ω, with densities satisfying p(x), q(x) M and q(x) m > 0 for all x Ω. A5: ϕ0 C2(Ω) and ϕ 0 Cα+1(Ω) for α > 1. A6: There exist l, L > 0 such that T0 = ϕ0, with l I 2ϕ0(x) LI for all x Ω. This first assumption is a specialization of A1 and A2, restricting to the case where Ωis compact. The latter two can be thought of as regularity conditions on the optimal transport maps between base and reference measures. A5 asserts that the optimal transport maps have smooth derivatives and A6 controls the derivatives from both above and below. 1.1. Properties of The Entropic Map Let Xi µ0, Yi µ1 for i = 1, ..., n be i.i.d. samples from their respective distributions. From these samples construct the empirical measures ˆµ0,n = 1 n Pn i=1 δXi and ˆµ1,n = 1 n Pn i=1 δYi. For ϵ > 0 and n samples from each measures, the discrete entropically regularized OT problem can be written min π Rn n + π1=(1/n)1 πT 1=(1/n)1 j,k=1 πjk||Xj Yk||2 2 + ϵπjk log πjk This problem has a dual formulation (Genevay, 2019) given by max f,g Rn 1 n i=1 fi + gi ϵ ef/ϵ, Keg/ϵ Measure Estimation in the Barycentric Coding Model where Ki,j = exp( ||Xi Yi||2 2/ϵ). gϵ is defined to be the optimal g in the maximization above, parameterized by the setting of ϵ > 0. For the method proposed it is important that the fϵ and gϵ are chosen such that for all x, y Rd we have fϵ(x) + gϵ(Yi) 1 2||x Yi||2 2 fϵ(Xi) + gϵ(y) 1 2||Xi y||2 2 This is always possible, and can be easily done in practice (Pooladian & Niles-Weed, 2021). One of the primary motivations for selecting the entropic map as our map estimate is that gϵ can be efficiently computed (Peyré & Cuturi, 2020), and it comes with the performance guarantee stated below. In the following theorem the constants may depend on the dimension d, the diameter of the support |Ω|, M, m, l, L, and ||ϕ 0||Cα+1. Theorem 6. ((Pooladian & Niles-Weed, 2021) Theorem 3) Under A4,A5,A6, the entropic map ˆT = Tϵ,(n,n) from ˆµn to ˆνn with regularization parameter ϵ n 1/(d + α+1) satisfies E|| ˆT T0||2 L2(µ) (1 + I0(µ, ν))n ( α+1) 2(d + α+1) log n (9) where d = 2 d/2 and α = min(α, 3) and I0(µ, ν) is the integrated Fisher information between µ and ν. For more information on I0(µ, ν) see (Chizat et al., 2020). We remark that when α 2 then under A4,A5,A6, it has been shown I0(µ, ν) C for a positive constant C (Chizat et al., 2020). The notation a b means that there exists positive constants c, C such that ca b Ca and a b means that there is a positive constant C such that a Cb. Note that this result still holds without the convexity condition in A4. We include it for ease of exposition below and to ensure that the interior of an Ωthat satisfies A4 is convex and therefore satisfies A3. 2. Proofs of Theoretical Results 2.1. Proof of Proposition 1 Proof. We first apply the definitions given in Section 2 and leverage (4): || Gλ(µ0)||2 µ0 = || i=1 λi(Tµ0 µi Id)||2 µ0 i=1 λi(Tµ0 µi Id), j=1 λj(Tµ0 µj Id) i,j=1 λiλj Tµ0 µi Id, Tµ0 µj Id dµ0 i,j=1 λiλj Aij = λT Aλ. To establish the convexity of the function in terms of λ, it is sufficient to show that A is symmetric positive semi-definite. Observe that A is a Gram matrix of the set {Tµ0 µi Id}p i=1 Tµ0 P2,ac(Rd) with the inner product in Tµ0 P2,ac(Rd). It is classical that all Gram matrices are symmetric positive semi-definite (Schwerdtfeger, 1961) and this completes the proof. 2.2. Proof of Theorem 1 Proof. The convexity of the objective is established in Proposition 1, and the feasible set p is trivially convex. By Theorem 4, when assumptions A1-A3 hold, all Karcher means are barycenters, and therefore µ0 = νλ, and it is always the case that barycenters are Karcher means. Therefore it is sufficient to check that the minimum value is 0 if and only if µ0 is a Karcher mean. Measure Estimation in the Barycentric Coding Model In the case that the minimum is zero, there is some λ such that (λ )T Aλ = 0 and we can apply Proposition 1 to show that µ0 is a Karcher mean for λ . Similarly if µ0 is a Karcher mean for some λ by definition we have || Gλ (µ0)||2 µ0 = 0 and again by Proposition 1 we have that λ achieves the minimum of zero in the optimization. 2.3. Proof of Corollary 1 Proof. A1,A2,A3 are clearly satisfied in the Gaussian case with Ω= Rd since the pdf of a non-degenerate multivariate Gaussian is positive everywhere and Lipschitz continuous. Furthermore the barycenter of zero-mean Gaussians is itself a zero-mean Gaussian ((Agueh & Carlier, 2011) Theorem 6.1). Therefore we can apply Theorem 1 and all that remains is to calculate Aij as follows. Noting that the Ci are all symmetric (Agueh & Carlier, 2011) we have Rd (Ti(x) Id(x)), (Tj(x) Id(x)) d S0(x) Rd (Ci I)x, (Cj I)x d S0(x) (Eq. 6) = E X N(0,S0) h XT (Ci I)T (Cj I) X i = E X N(0,S0) XT (Ci I) (Cj I) X (Ci all sym.) = Tr ((Ci I) (Cj I) S0) . The last equality comes from the identity E X N(0,S) XT BX = Tr(BS) which holds for all B Rd d and S Sd ++. 2.4. Proof of Theorem 2 Proof. In this setting, the barycenter νλ can be expressed as i=1 λi T i 0 ((Panaretos & Zemel, 2020) Theorem 3.1.9). Together with Proposition 1, this implies (recalling Ti = T i 0) W 2 2 (µ0, νλ) = Z ||x i=1 λi Ti(x)||2 2dµ0(x) = || Gλ(µ0)||2 µ0 = λT Aλ and therefore arg min λ p W 2 2 (µ0, νλ) = arg min λ p λT Aλ. 2.5. Exact Statement and Proof of Theorem 3 Theorem 7. (Formal Version of Theorem 3, Convergence Rate of Aij) Let T1, T2 be the optimal transport maps from µ0 to µ1, µ2 respectively. Suppose that A4, A5, A6 are satisfied for both pairs (µ0, µ1) and (µ0, µ2). Let X1, ..., X2n µ0, Y1, ..., Yn µ1, Z1, ..., Zn µ2. Let ˆT1 and ˆT2 be the entropic maps from µ0 to µ1 and µ2 respectively, both with ϵ n 1/(d + α+1) and computed using X1, ..., Xn. Then we have Z T1 Id, T2 Id dµ0 1 i=n+1 ˆT1(Xi) Xi, ˆT2(Xi) Xi 1 n + n α+1 4(d + α+1) p 1 + I0(µ0, µ1) + I0(µ0, µ2) where d = 2 d/2 , α = α 3 and I0(µ0, µ1) is the integrated Fisher information along the Wasserstein geodesic between µ and µ1. Measure Estimation in the Barycentric Coding Model Before we begin the proof, we remark that by writing µi, µj instead of µ1, µ2, the random variable in the expectation is just |Aij ˆAij| and therefore this result allows us to control the entry-wise deviations of ˆAij from their true values. Proof. Throughout, unless otherwise noted, the expectation is with respect to all of X1, ..., X2n, Y1, ..., Yn and Z1, ..., Zn. We also use the notations Xn to denote the set of variables (X1, ..., Xn), and similarly for Y n and Zn. First observe that we can assume without loss of generality that 0 Ω; see (Peyré & Cuturi, 2020) Remark 2.9 for invariance of T1, T2 under translation. For the translation invariance of ˆT1, ˆT2, observe that both obtaining gϵ, and evaluating ˆT1, ˆT2 at fixed points, requires only the distances ||Xi Yj||2 2 = ||(Xi t) (Yj t)||2 2. We calculate: Z T1 Id, T2 Id dµ0 1 i=n+1 ˆT1(Xi) Xi, ˆT2(Xi) Xi Z T1 Id, T2 Id dµ0 1 i=n+1 T1(Xi) Xi, T2(Xi) Xi i=n+1 T1(Xi) Xi, T2(Xi) Xi 1 i=n+1 ˆT1(Xi) Xi, ˆT2(Xi) Xi Z T1 Id, T2 Id dµ0 1 i=n+1 T1(Xi) Xi, T2(Xi) Xi i=n+1 T1(Xi) Xi, T2(Xi) Xi 1 i=n+1 ˆT1(Xi) Xi, ˆT2(Xi) Xi We consider the first and second terms separately. For brevity, let h be the function defined by h(x) = T1(x) x, T2(x) x . Note that h is bounded on Ωsince |h(x)| = | T1(x) x, T2(x) x | ||T1(x) x||2||T2(x) x||2 (2|Ω|)(2|Ω|) = 4|Ω|2, which implies by Popoviciu s inequality (Popoviciu, 1935) that Var[h(X)] 16|Ω|4. From this it follows Z T1 Id, T2 Id dµ0 1 i=n+1 T1(Xi) Xi, T2(Xi) Xi " EX µ0[h(X)] 1 i=n+1 h(Xi) where we have used that for all i.i.d. random variables U, U1, ..., Un with finite variance Measure Estimation in the Barycentric Coding Model We now handle the second term: i=n+1 T1(Xi) Xi, T2(Xi) Xi 1 i=n+1 ˆT1(Xi) Xi, ˆT2(Xi) Xi i=n+1 E h T1(Xi) Xi, T2(Xi) Xi ˆT1(Xi) Xi, ˆT2(Xi) Xi i =E h T1(Xn+1) Xn+1, T2(Xn+1) Xn+1 ˆT1(Xn+1) Xn+1, ˆT2(Xn+1) Xn+1 i =E h T1(Xn+1), T2(Xn+1) ˆT1(Xn+1), ˆT2(Xn+1) + ˆT1(Xn+1) T1(Xn+1), Xn+1 + ˆT2(Xn+1) T2(Xn+1), Xn+1 i E h T1(Xn+1), T2(Xn+1) ˆT1(Xn+1), ˆT2(Xn+1) i + E h ˆT1(Xn+1) T1(Xn+1), Xn+1 i + E h ˆT2(Xn+1) T2(Xn+1), Xn+1 i Here we can control the three terms separately. We first take care of the middle term and the last term follows by an identical argument. E h ˆT1(Xn+1) T1(Xn+1), Xn+1 i =EXn,Y n,Xn+1 h ˆT1(Xn+1) T1(Xn+1), Xn+1 i EXn,Y n,Xn+1 h || ˆT1(Xn+1) T1(Xn+1)||2||Xn+1||2 i (Cauchy-Schwarz) |Ω|EXn,Y n,Xn+1 h || ˆT1(Xn+1) T1(Xn+1)||2 i (||Xn+1|| |Ω|) (10) =|Ω|EXn,Y n,Xn+1 h EXn+1|| ˆT1(Xn+1) T1(Xn+1)||2 i =|Ω|EXn,Y n,Xn+1 || ˆT1(Xn+1) T1(Xn+1)||2 2 |Ω|EXn,Y n,Xn+1 EXn+1|| ˆT1(Xn+1) T1(Xn+1)||2 2 =|Ω|EXn,Y n h || ˆT1 T1||L2(µ0) i EXn,Y n h || ˆT1 T1||2 L2(µ0) i (1 + I0(µ0, µ1))n α+1 2(d + α+1) log(n). (Theorem 6) As mentioned above the third term is exactly the same except replacing T1 and ˆT1 with T2 and ˆT2 as well as Y n with Zn. To control the first term we start with a different trick and then end up following the same pattern as in the bound above. E h T1(Xn+1), T2(Xn+1) ˆT1(Xn+1), ˆT2(Xn+1) i =E h T1(Xn+1) ˆT1(Xn+1), T2(Xn+1) + ˆT1(Xn+1), T2(Xn+1) ˆT2(Xn+1) i E h T1(Xn+1) ˆT1(Xn+1), T2(Xn+1) i + E h ˆT1(Xn+1), T2(Xn+1) ˆT2(Xn+1) i E h ||T1(Xn+1) ˆT1(Xn+1)||2||T2(Xn+1)|| i + E h || ˆT1(Xn+1)||2||T2(Xn+1) ˆT2(Xn+1)||2 i |Ω|E h ||T1(Xn+1) ˆT1(Xn+1)||2 i + |Ω|E h ||T2(Xn+1) ˆT2(Xn+1)||2 i =|Ω|EXn,Y n,Xn+1 h ||T1(Xn+1) ˆT1(Xn+1)||2 i + |Ω|EXn,Zn,Xn+1 h ||T2(Xn+1) ˆT2(Xn+1)||2 i . (11) Measure Estimation in the Barycentric Coding Model Above we have made use of the fact that both T2(X1) and ˆT1(X1) are contained in Ωwhich implies that ||T2(X1)||2 |Ω|. T2(X1) Ωfollows from the fact that T2 is a map from µ0 to µ2. To see that || ˆT1(X1)|| |Ω|, note that from equation (7) ˆT1(X1) is a convex combination of Y1, ..., Yn Ωand each of which satisfies ||Yi||2 |Ω|. Observe that both terms in (11) have already been controlled, starting from Equation (10) (the latter requires replacing T1, ˆT1, and Y n with T2, ˆT2, and Zn respectively). Applying the bound derived there and combining terms terms we have the following result Z T1 Id, T2 Id dµ0 1 i=n+1 ˆT1(Xi) Xi, ˆT2(Xi) Xi (1 + I0(µ0, µ1))n α+1 2(d + α+1) log(n) + 2 q (1 + I0(µ0, µ2))n α+1 2(d + α+1) log(n) = 1 n + n α+1 4(d + α+1) p 1 + I0(µ0, µ1) + p 1 + I0(µ0, µ2) 1 n + n α+1 4(d + α+1) p 1 + I0(µ0, µ1) + I0(µ0, µ2). 2.6. Proof of Corollary 2 Proof. Let Bn denote the entrywise bound in Theorem 3. Noting that ˆλT ˆA ˆλ λT ˆAλ by construction, we estimate E[ˆλT A ˆλ] = E h ˆλT (A ˆA) ˆλ i + E[ˆλT ˆAˆλ] E h |ˆλT (A ˆA) ˆλ | i + E[λT ˆAλ ] = E h |ˆλT (A ˆA) ˆλ | i + E[λT ( ˆA A)λ ] (λT Aλ = 0) i,j=1 (ˆλ)i(ˆλ)j(A ˆA)ij i,j=1 (λ )i(λ )j(A ˆA)ij i,j=1 (ˆλ)i(ˆλ)j|Aij ˆAij| i,j=1 (λ )i(λ )j|Aij ˆAij| (ˆλ, λ p, triangle ineq.) i,j=1 |Aij ˆAij| i,j=1 E[|Aij ˆAij|] 2p2Bn (Theorem 3) Since A is positive semidefinite and by assumption λ p satisfies λT Aλ = 0, it follows that λ is an eigenvector of A with eigenvalue 0. Let 0 < α2 αp be the non-zero eigenvalues of A with associated orthonormal eigenvectors v2, . . . , vp. Orthogonally decompose ˆλ = ˆβλ + ˆλ , where ˆβ R and ˆλ is in the span of {v2, . . . , vp}. Note that ˆβ and ˆλ are random. Then, Measure Estimation in the Barycentric Coding Model E[ ˆλ ˆβλ 2 2] = E[ ˆλ 2 2] i=2 |v T i ˆλ |2 # i=2 αi|v T i ˆλ |2 # α2 E[|(ˆλ )T Aˆλ |] α2 E[|ˆλT A ˆλ |] Summing both sides of the equation ˆλ = ˆβλ + ˆλ and recalling λ , ˆλ p yields ˆβ + p ˆλ 2, which implies that E[(1 ˆβ)2] p E[ ˆλ 2 2] 2p3 Finally, we use the fact that ˆλ ˆβλ = ˆλ and (ˆβ 1)λ are orthogonal to bound: E[ ˆλ λ 2 2] = E[ ˆλ ˆβλ 2 2 + (ˆβ 1)λ 2 2] = E[ ˆλ ˆβλ 2 2] + E[ (ˆβ 1)λ 2 2] α2 Bn + E[(ˆβ 1)2 λ 2 2] α2 Bn + E[(ˆβ 1)2] α2 Bn + 2p3 as desired. 2.7. Karcher Means are Barycenters under A4-A6 Corollary 2 (which holds under A4-A6) can be combined with Theorem 1 (which holds under A1-A3) as follows. Suppose ||Gλ(ν)||2 ν = 0 for some λ p and reference measures {µi}p i=1, and assume that the collection {ν} {µi}p i=1 satisfy assumptions A4-A6. We aim to show that this implies that ν is a barycenter for the measures {µi}p i=1. Let µi be the measure with density ( dµi(x) x Ω , 0 otherwise where Ω is the interior of Ω. Define ν similarly. Measure Estimation in the Barycentric Coding Model We first remark that 1 = µi[Ω] = µi[Ω ] + µi[ Ω] = µi[Ω ] + 0 = µi[Ω ] and therefore µi (and analogously ν) are valid measures (note we have used that µi[ Ω] = 0, which follows from µi being a.c. and Ωhaving Lebesgue measure 0). It follows from A4 that the collection {ν} {µi}p i=1 satisfies A2 and A3. Furthermore, the set Ω is open and convex and therefore A1 is also satisfied. Therefore if ν is a Karcher mean for { µi} then it is a barycenter for them as well. We next show that ν is a Karcher mean of { µi}p i=1 with coordinate λ. Consider the maps Ti : Ω Ω which are defined by Ti(x) = Ti(x) and are undefined outside Ω where Ti is the optimal transport map from ν µi. We first require that Ti is in fact optimal for ν and µi and that it is well-defined ν-a.e. To show the latter, note that Ti is a map from ν to µi and therefore ν[T 1 i ( Ω)] = µi[ Ω] = 0 which implies that Ti(x) Ω for ν-a.e. x. Furthermore ν[Ω ] = ν[Ω ] = 1 and therefore the map Ti is well-defined ν-a.e. To show that the map Ti is optimal, one only needs to remark that for any measurable B Ωthat Ti transports mass optimally between B and Ti(B) (since otherwise Ti would fail to be optimal). Applying this fact to Ω and using the considerations above, shows that Ti is optimal for transporting between ν and µi. Having established that the T i are optimal, we can now calculate ||Gλ( ν)||2 ν: ||Gλ( ν)||2 ν = D Ti(x) x, Tj(x) x E d ν(x) Ω Ti(x) x, Tj(x) x dν(x) Ω Ti(x) x, Tj(x) x dν(x) + λiλj Ω Ti(x) x, Tj(x) x dν(x) Ω Ti(x) x, Tj(x) x dν(x) = ||Gλ(ν)||2 ν = 0 which shows that ν is a Karcher mean for { µi}, and is also a barycenter of these measures. To conclude, we only require the fact that if two measures ξ, ξ P2,ac(Rd) differ only on a set of measure zero, then for any γ P2,ac(Rd) we have W2(ξ, γ) = W2( ξ, γ). Applying this fact repeatedly we have i=1 λi W 2 2 (ξ, µi) = min ξ 1 2 i=1 λi W 2 2 (ξ, µi) i=1 λi W 2 2 ( ν, µi) i=1 λi W 2 2 (ν, µi) i=1 λi W 2 2 (ν, µi) which shows that ν is indeed a barycenter of µi, as desired. 3. Covariance Estimation Experiment Details 3.1. Methods for Maximum Likelihood Estimation In order to solve the maximum likelihood estimation problem we differentiate through a truncated version of (Chewi et al., 2020) Algorithm 1 and perform projected gradient descent. This is implemented using the auto-differentiation library Measure Estimation in the Barycentric Coding Model Algorithm 3 MLE Input: {Si}p i=0, η > 0, Max Iters > 0, FPIters > 0, SQIters > 0 i 0, λ (1/p)1 while Not Converged and i < Max Iters do i i + 1 L(λ) Back Prop Loss({Sj}p j=0, λ, FPIters, SQIters) λ Simplex Project(λ η L(λ)) end while Return λ Parameter Value η 0.0003 Max Iters 500 FPIters 10 SQIters 10 Table 3: Parameters used when performing the maximum likelihood estimation. Py Torch. The procedure is summarized in Algorithm 3. Here Simplex Project(x) is defined as Simplex Project(x) = arg min y p ||x y||2. which enforces the constraints on λ at each iteration. To compute L(λ) we use auto-differentiation to obtain a gradient of the procedure given in Algorithm 4. In this procedure the square root of a matrix is computed using SQIters number of Newton-Schulz iterations. The forward pass of the loss computation is given in Algorithm 4, and the gradient is obtained by back propagation through it. Algorithm 4 Compute Loss Input: {Si}p i=0, λ, FPIters > 0, SQIters > 0 BC S0 for j = 1, ..., FPIters do BC_root = Square Root(BC, SQIters) BC_root_inv = Invert(BC_root) BC = BC_root_inv (Pp i=1 λi Square Root( BC_root, SQIters,Si BC_root )) BC_root_inv end for Return Tr(BC 1S0) + log det BC The parameters that we chose in our experiments are given in Table 3. These parameters were sufficient to ensure that both the matrix square roots and fixed point iterations converged, and λ always converged before the final iteration was reached. 3.2. Considerations for Instability of MLE Due to the requirement of differentiating through a matrix inverse in Algorithm 3 it is possible that the procedure may fail to numerically converge. In our trials, this happened in less than 0.5% of all attempts. When it does happen we discard the result, and these numerical failures impact neither the results of Figure 3, nor the timings of the trials. 4. MNIST Experiment Details 4.1. The MNIST Dataset The MNIST Dataset (Le Cun, 1998) consists of a collection of 28 28 black and white images of hand-written digits. The brightness (or intensity) of each pixel is an integer between 0 and 255 with 0 being black and 255 being white. Measure Estimation in the Barycentric Coding Model Each image is initially a matrix Im R28 28 + , which is normalized to sum to one, P m ij = Im ij /(P i ,j Ii ,j ) and then converted into a point cloud i,j=1 P m ij δ(i,j). 4.2. White Noise Model In order to apply additive noise we must generate white-noise images ζ. We generate ζ as a random matrix R28 28 with each entry ζij Unif([0, 1]). This random matrix is then converted into a point cloud using the same process as is used to turn the image matrix into a point cloud. 4.3. Linear Recovery The linear reconstruction in Figure 4 method is done by the following procedure. 1. Given a corrupted image (which we will treat as a matrix) µ0 and a collection of references µ1, ..., µp (also treated like matrices) estimate ˆλ as ˆλ = arg min λ p || µ0 i=1 λi µi||2 F . This corresponds to projecting µ0 onto the convex hull of µ1, ..., µp with respect to the Euclidean (equivalently Frobenius) norm. 2. Recover ˆµ0 as 5. Document Classification Experiment Details 5.1. Datasets Dataset Description B.O.W. Dimension Average Words C BBCSPORT BBC sports articles labeled by sport 13243 117 5 20NEWS canonical news article dataset 29671 72 20 Table 4: Dataset characteristics. We show the characteristics of the datasets in Table 4. Bag of words (B.O.W.) dimension is the number of the unique words contained in each dataset. Average words is the average number of the unique words in a document and C is the number of classes of each of the datasets. Both the BBCSport and News20 datasets that we used are those made publicly available from (Huang et al., 2016). Both datasets consist of two components: 1. The datasets employ a word2vec (Mikolov et al., 2013) embedding to represent the words. Each word is represented as a point w on the unit sphere in R300 and the semantic relationships between words are encoded in the geometry of the embeddings. Call the set of words W, then the word2vec embedding is the matrix W R|W| 300. 2. Each document Dm with label ym is represented by a set of tuples {(cm i , wm i )}n i=1, where n is the number unique words in Dm, cm i is the number of times wm i shows in Dm. We use the following procedure to convert a document into an empirical measure. 1. Given a document Dm = {(cm i , wm i )}n i=1, let pm Rn be the probability vector with entries pm i = cm i /(P Measure Estimation in the Barycentric Coding Model 2. The document empirical measure is then µm = X i pm i δwm i . For convenience we filter out the points on which pi places no support. 5.2. Procedure for Making Figure 5 As discussed in the main body, for k 2 we select k documents from each topic and 100 test documents, and then repeat 50 times and average. In both plots in Figure 5 we iterate over different choices of k. To reduce computational load we reuse the test sets and reference documents within each trial across choices of k. For example, all of the documents for k = 4 are used for k 5. This allows the Wasserstein distances and estimates ˆAij to be cached avoiding further computation. We note that this procedure produces plots in which each of the points individually has the same sampling pattern as if there was no re-use, it is only across samples that there is correlation. However, this correlation is shared across the prediction methods consider which preserves the fairness of the comparison of the methods. 6. Possible Rank Deficiency of A In general, there may exist a set Λ p such that |Λ| > 1 and for all λ Λ we have µ0 = νλ. Indeed, suppose there are index sets I1, I2 such that I1 I2 = and I1 I2 = {1, 2, . . . , p} but Bary({µi}i I1) Bary({µi}i I2) = . If µ0 Bary({µi}i I1) Bary({µi}i I2) then it can be expressed as νλ1 Bary({µi}i I1) and νλ2 Bary({µi}i I2). Note that we may interpret λ1 and λ2 as vectors in p with supports on I1, I2, respectively. Now let α [0, 1] and let λ = (1 α)λ1 + αλ2 (with the indices modified as needed). Then we have arg min ν P2,ac(Rd) 2 W 2 2 (ν, µi) = arg min ν P2,ac(Rd) (1 α) λ1 i 2 W 2 2 (ν, µi) λ2 i 2 W 2 2 (ν, µi) where the last equality follows from the fact that if x minimizes both f(x) and g(x) then it must also minimize αf(x) + (1 α)g(x) for α [0, 1], and the assumption that µ0 is the barycenter. This demonstrates that µ0 actually corresponds to all λ = λ = (1 α)λ1 + αλ2 for all α [0, 1]. Below, we will give a geometric characterization for the rank deficiency of A and how that relates to the optimality of the quadratic objective in Theorem 1. We first recall the quadratic problem over the probability simplex: minλ p λT Aλ. Since A is a symmetric and positive semidefinite matrix, a Rayleigh quotient analysis yields that the minimum possible value of the objective without simplex constraints is zero, which would be attained at an eigenvector corresponding to a zero eigenvalue. Given this fact, finding an exact solution to the above quadratic problem amounts to determining whether the probability simplex intersects the eigenspace of A associated with the zero eigenvalue. Hereafter, we will denote this space by E0. Proposition 2. If E0 p = , there is no λ p such that λT Aλ = 0. Proof. Any λ p realizing λT Aλ = 0 is an eigenvector corresponding to the zero eigenvalue. The result follows. Proposition 2 shows in particular that exact minimization is only possible when rank(A) < p. Next, we state conditions under which there is a unique solution to the minimization program. Proposition 3. If rank(A) = p 1 and E0 p = , then there is a unique λ p such that λT Aλ = 0. Proof. If rank(A) = p 1, E0 = {αv | α R} for an eigenvector v = 0 corresponding to the zero eigenvalue. Let λ E0 p. Then λ = αv for some α R. The constraint that λ p determines α uniquely, which gives the result. Note that if E0 non-trivially intersects the probability simplex and rank(A) = p 1, the condition for uniqueness is equivalent to the statement that the origin can be written as a convex combination of the columns of A. We next consider the remaining case when rank(A) < p 1; for simplicity of exposition, we focus on the specific case rank(A) = p 2. Proposition 4. Let rank(A) = p 2 and v0 and v1 be distinct eigenvectors corresponding to the zero eigenvalue of A. If v0 p and v1 p, then there are infinitely many solutions to λT Aλ = 0 with λ p. Measure Estimation in the Barycentric Coding Model Figure 6: Red: The points z1, z2. Green: The true barycenter xλ. Blue: A Karcher mean which is not the barycenter. Gold: The tangent space at the blue point. Proof. By construction, Av0 = 0 and Av1 = 0. Let α [0, 1] and define vα = αv0 + (1 α)v1. Since v1 and v2 are in p, it follows that vα p. In addition, Avα = αAv1 + (1 α)Av2 = 0 so that vα E0. Since α [0, 1] was arbitrary, the result follows. The main conclusion from the above Propositions is the following: if rank(A) < p 1, the only way to mandate a unique solution is to require that only one eigenvector corresponding to the zero eigenvalue intersects with probability simplex. By convex geometry, if two eigenvectors corresponding to the zero eigenvalue intersect the probability simplex, there will be infinitely many solutions to the quadratic program. 7. Example of a Karcher Mean that is not a Barycenter As mentioned in Section 2, special care must be made to ensure that any Karcher mean is also a barycenter. In this section we give a simple example which illustrates this point. Consider S1 = {x R2 : ||x||2 = 1} with distance given by ρ(x, y) = cos 1 x, y . Let z1 = z2 S1 with z1 = z2, and λ = (1/2, 1/2). Then the barycenter is given by xλ = z1 + z2 ||z1 + z2||2 . However, the point z1+z2 ||z1+z2||2 is not the barycenter but it is a Karcher mean. This is illustrated in Figure 6. To show this rigorously one need only compute the Euclidean gradient and demonstrate that it is orthogonal to the tangent space.