# projection_robust_wasserstein_barycenters__25b55067.pdf

Projection Robust Wasserstein Barycenters

Minhui Huang 1 Shiqian Ma 2 Lifeng Lai 1

Collecting and aggregating information from several probability measures or histograms is a fundamental task in machine learning. One of the popular solution methods for this task is to compute the barycenter of the probability measures under the Wasserstein metric. However, approximating the Wasserstein barycenter is numerically challenging because of the curse of dimensionality. This paper proposes the projection robust Wasserstein barycenter (PRWB) that has the potential to mitigate the curse of dimensionality, and a relaxed PRWB (RPRWB) model that is computationally more tractable. By combining the iterative Bregman projection algorithm and Riemannian optimization, we propose two algorithms for computing the RPRWB, which is a max-min problem over the Stiefel manifold. The complexity of arithmetic operations of the proposed algorithms for obtaining an ϵ-stationary solution is analyzed. We incorporate the RPRWB into a discrete distribution clustering algorithm, and the numerical results on real text datasets conﬁrm that our RPRWB model helps improve the clustering performance signiﬁcantly.

1. Introduction

The Wasserstein barycenter (WB) problem is attracting a lot of interest recently due to its wide applications in statistics and machine learning, including but not limited to image processing (Rabin et al., 2011), multi-level clustering (Ho et al., 2017), and text mining (Ye et al., 2017a;b). The WB serves as a geodesic interpolation between two or more distributions. It aggregates the underlying geometric structures of the input distributions under the Wasserstein metric. Therefore, the WB model provides deep insight when collecting information from probability distributions.

1Department of Electrical and Computer Engineering, University of California, Davis, CA, USA 2Department of Mathematics, University of California, Davis, CA, USA. Correspondence to: Shiqian Ma <sqma@ucdavis.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

However, computing the WB for a set of probability distributions is notoriously hard. The hardness comes from two aspects: the representation of the measure support and the potential curse of dimensionality. In many applications, the underlying distributions are unknown and we only have sampled data from these distributions. We wish to estimate the WB using the sampled data only. Therefore, the task reduces to compute WB from sampled discrete measures on a ﬁxed number of support points. However, solving the free-support discrete WB is still very difﬁcult (Borgwardt & Patterson, 2019). In this paper, we mainly consider the ﬁxed-support WB problem. On the other hand, computing ﬁxed-support WB can be challenging if the dimension of the problem is high. Recent theoretical developments have revealed that the sample complexity of approximating Wasserstein distances grows exponentially in dimension (Dudley, 1969; Weed & Bach, 2019). For the WB problem, (Altschuler & Boix-Adsera, 2021) has proved that computing WB is NP-hard since its runtime scales exponentially in the dimension. However, the sample complexity of original WB is still not well understood. Since the WB model minimizes a sum of Wasserstein distances, we conjecture that the WB problem would also have the issue of curse of dimensionality. To overcome this difﬁculty, we adopt a technique used in computing the Wasserstein distance (Paty & Cuturi, 2019) to the WB problem. The resulting projection robust WB (PRWB) model is an inf-sup-inf problem, which is computationally intractable. We further propose a relaxation of PRWB that is computationally more tractable. The idea of the new technique is to project the sampled data to a common low dimensional subspace and compute the WB of the projected data as an approximation to the original WB. The resulting problem is a max-min problem with Stiefel manifold constraint, and we propose two algorithms that can ﬁnd an ϵ-stationary point of it efﬁciently.

Related work: Most existing works for ﬁxed-support WB focus on designing efﬁcient algorithms. (Cuturi & Doucet, 2014) proposed to add an entropy regularizer and solve its dual problem that is smooth. This idea was further studied by (Benamou et al., 2015) under the name of iterative Bregman projection (IBP) algorithm. The convergence behavior of IBP was studied in (Kroshnin et al., 2019). There exist some other algorithms for computing ﬁxed-support WB, including the dual accelerated gradient descent method

Projection Robust Wasserstein Barycenters

(Kroshnin et al., 2019; Lin et al., 2020b), the stochastic gradient descent method (Claici et al., 2018), the Bregman ADMM method (Ye et al., 2017b) and the interior-point method (Ge et al., 2019). On the other hand, people have proposed some efﬁcient ways to mitigate the curse of dimensionality of the optimal transport (OT) problem (Niles-Weed & Rigollet, 2019). Speciﬁcally, (Bonneel et al., 2015) proposed the sliced Wasserstein distance and applied it to the WB problem. The sliced OT projects the sampled data to a random line and reduces the problem to a one-dimensional OT, which can be solved very efﬁciently by sorting. This idea motivated the work of (Paty & Cuturi, 2019; Niles Weed & Rigollet, 2019) that suggest projecting the data to a low dimensional subspace. This leads to the projection robust Wasserstein (PRW) distance, and algorithms for computing it include (Lin et al., 2020a; Huang et al., 2020).

Contributions: Our main contributions are below.

(i) We propose a projection robust Wasserstein barycenter (PRWB) model. The PRWB model has the potential to mitigate the curse of dimensionality by projecting the probability measures onto a low dimensional subspace. Since PRWB is still numerically challenging to solve, we further propose a relaxation of PRWB (RPRWB) that is more tractable. Our numerical results indicate that RPRWB is more robust to noise compared with the WB.

(ii) We propose two algorithms: Riemannian block coordinate descent (RBCD) and Riemannian gradient ascent with IBP algorithm (RGA-IBP), to compute the RPRWB. The RGA-IBP incorporates the IBP algorithm to a Riemannian gradient ascent algorithm, and the RBCD is based on a reformulation of the max-min problem that is suitable for BCD type algorithms. The complexities of arithmetic operations of both algorithms for obtaining an ϵ-stationary point are analyzed.

(iii) We conduct extensive numerical experiments to show the robustness and the practicality of the RPRWB model. We adopt RPRWB to the discrete distribution (D2) clustering algorithm, which we call the projected D2 clustering. We test this new algorithm on the real text datasets, and the numerical results show that the projected D2 clustering achieves better performance than the D2 clustering.

2. Optimal Transport and Wasserstein Barycenter

In this section, we review some background in optimal transport and Wasserstein barycenter.

Denote P(Rd) as the set of Borel probability measures in Rd and P2(Rd) as a subset of P(Rd) whose elements have ﬁnite second moment. The 2-Wasserstein distance between

probability measures µ, ν P2(Rd) is deﬁned as

W(µ, ν) := inf π Π(µ,ν)

Z x y 2dπ(x, y) 1/2 , (1)

where Π(µ, ν) is the set of all joint distributions with marginals µ and ν. We denote q = {u Rq +|u 1q = 1} as the probability simplex in Rq. The WB of m probability measures µ := {µl}l [m] is the solution of the following problem:

inf ν P2(Rd) WB(µ,ω) :=

l=1 ωl W2(µl, ν), (2)

where ω m is a given weighting vector and [m] := {1, . . . , m}. We use Proj E to denote the orthogonal projector onto E for any E Gk, where the Grassmannian Gk := {E Rd|dim(E) = k} is the set of all kdimensional subspaces of Rd. For Wasserstein distance, (Paty & Cuturi, 2019) proposed the projection robust Wasserstein distance as follows:

Pk(µ, ν) := sup E Gk W(Proj Eµ, Proj Eν). (3)

That is, the probability measures µ and ν are projected onto the k-dimensional subspace E, and the Wasserstein distance between the projected measures is computed as an approximation to the original Wasserstein distance. Moreover, to measure the worst case approximation, the subspace E that maximizes this Wasserstein distance is sought. The study in (Niles-Weed & Rigollet, 2019) shows that the projection robust Wasserstein distance is able to improve the sample complexity from O(n 1/d) for Wasserstein distance to O(n 1/k), where n denotes the nubmer of sampled data. This is a signiﬁcant improvement since usually k d for high dimensional OT. Therefore, the projection robust Wasserstein distance can mitigate the curse of dimensionality.

3. Projection Robust Wasserstein Barycenter

Our projection robust Wasserstein barycenter is motivated by the success of the projection robust Wasserstein distance and the sliced Wasserstein barycenter proposed in (Bonneel et al., 2015). By replacing the Wasserstein distance in (2) with the PRW distance (3), the ﬁxed-support PRWB is deﬁned as the solution of the following problem:

inf ν P2(Rd)

l=1 ωl P2 k(µl, ν). (4)

Projection Robust Wasserstein Barycenters

Plugging (3) into (4), we have

inf ν P2(Rd)

l=1 ωl sup Eℓ Gk W2(Proj Eℓµl, Proj Eℓν)

= inf ν P2(Rd)

l=1 ωl sup Uℓ St(d,k) inf πl Π(µl,ν) Z U ℓ(xl y) 2dπl(xl, y).

According to (Paty & Cuturi, 2019)[Proposition 1], PRW is a well deﬁned distance over P2(Rd) and can be formulated as a sup-inf problem. Moreover, the support of the barycenter is ﬁxed and our target barycenter ν lies on a probability simplex. Our PRWB formulation (5) is a inf-sup-inf problem over m Stiefel manifolds. Solving (5) directly is extremely difﬁcult, because of the complex inf-sup-inf structure and also the existence of m Stiefel manifolds constraints. Therefore, we propose the following relaxation to RPWB (5) that is more computationally tractable:

sup E Gk inf ν P2(Rd)

l=1 ωl W2(Proj Eµl, Proj Eν)

= sup U St(d,k) inf ν P2(Rd)

l=1 ωl inf πl Π(µl,ν) Z U (xl y) 2dπl(xl, y).

More speciﬁcally, we ﬁrst use a common projector Proj E( ) for all PRW distances, and then we switch the order of sup and the ﬁrst inf. The relaxed model (6) searches for a common low-dimensional subspace, the union of all subspaces of m PRW distances, that maximizes the barycenter objective. Roughly speaking, we solve an easier problem in a low-dimensional subspace to approximate the original WB problem. We call (6) the Relaxed PRWB (RPRWB) and focus on solving this relaxed version in the rest of the paper. We ﬁrst study some properties of RPRWB. The following proposition shows the existence of the optimal subspace E .

Proposition 1 Given a probability measure set µ, the support of the barycenter ν, the weight vector ω, and k [d], there exists an optimal E for the problem (6).

Notice that the target barycenter ν lies on a probability simplex. This combined with Proposition 1 indicates that the ﬁxed-support RPRWB problem can be written as a maxmin problem. Using U St(d, k) to denote an orthonormal basis of E, the RPRWB can be formulated as

max U St(d,k) min πl Π(µl,ν)

l=1 ωl Z U (xl y) 2dπl(xl, y),

(7) where St(d, k) denotes the Stiefel manifold, xl is the support of µl and y is the support of ν.

Remark 2 We remark here that analyzing the sample complexity PRWB is highly nontrivial and the analysis in (Niles Weed & Rigollet, 2019) for PRW does not apply here. In fact, we are not aware of any results for the sample complexity of the empirical discrete WB problem. There are only some computational hardness results (Altschuler & Boix-Adsera, 2021) showing WB is NP-hard because of the curse of dimensionality . Since the WB problem minimizes the sum of a set of Wasserstein distances, we conjecture that the curse of dimensionality should be inherited by WB. Deriving the sample complexity of WB and PRWB is an important future topic.

In this paper, we consider solving WB for a set of discrete distributions. Speciﬁcally, we denote Xl = [xl 1; ; xl n] Rd n as the support of each µl and write µl = Pn i=1 pl iδxl i, where pl n and δx denotes the Dirac function at x. The support of the barycenter ν is given and denoted as Y = {y1, . . . , yn} Rd n. Therefore, the barycenter can be written as ν = Pn i=1 qjδyj with q n. Denote π = {πl}l [m]. Throughout this paper, we denote M = St(d, k). Computing the ﬁxed-support RPRWB is equivalent to solving

max U M min q n

l=1 ωl W2(Proj Eµl, Proj Eν)

= max U M min π Π(p) f(π, U), (8)

where f(π, U) := Pm l=1 ωl Pn i,j=1 πl i,j U (xl i yj) 2, and Π(p) = {π | πl Rn n + , πl1 = pl, (πl) 1 = (πl+1) 1, l [m]}.

4. The Riemannian Gradient Ascent and Riemannian BCD Algorithms

In this section, we propose two algorithms for solving (8): RGA-IBP and RBCD. We can show that both algorithms ﬁnd an ϵ-stationy point of (8) deﬁned as follows.

Deﬁnition 3 We call (ˆπˆπˆπ, ˆU) Π(p) M an ϵ-stationary point of the ﬁxed-support RPRWB problem (8), if the following two inequalities hold:

grad Uf(ˆπˆπˆπ, ˆU) F ϵ, (9)

f(ˆπˆπˆπ, ˆU) f(π ( ˆU), ˆU) ϵ, (10)

where grad Uf(ˆπˆπˆπ, ˆU) is the Riemannian gradient w.r.t. U, π ( ˆU) is the optimal solution of the inner minimization problem of (8) when ﬁxing U as ˆU. The corresponding ϵ-approximate barycenter q n can be computed as q = (ˆπl) 1, l [m].

Before we present the algorithms, we deﬁne some useful notation ﬁrst.

Projection Robust Wasserstein Barycenters

Algorithm 1 The RGA-IBP Algorithm

1: Input: {µl = (Xl, pl)}l [m], {Y }, accuracy tolerance ϵ > 0. Set parameters τ, η, and ρ as in (24). 2: Initialization: U0 St(d, k). 3: for t = 0, 1, 2, . . . , do 4: πt+1 = IBPsolver(µ, Y , Ut, η, ϵ); 5: ξt+1 = Proj TUt M(2Vπt+1Ut);

6: Ut+1 = Retr Ut(τξt+1); 7: if ξt+1 F ϵ then 8: break; 9: end if 10: end for 11: Output: ˆU = Ut, ˆπ = πt+1.

Deﬁnition 4 (Cost and Correlation Matrices) Given the support vectors {Xl}l [m] and Y , the cost matrices, denoted as {Cl}l [m], are deﬁned as Cl i,j = xl i yj 2, l [m]. The correlation matrix, denoted as Vπ, is deﬁned as Vπ = Pm l=1 ωl Pn i,j=1 πl i,j(xl i yj)(xl i yj) Rd d.

4.1. The Riemannian Gradient Ascent with IBP Iterations

The RGA-IBP algorithm is a natural extension of the RGAS algorithm (Riemannian gradient ascent with Sinkhorn s iteration) that was proposed by (Lin et al., 2020a) for computing the projection robust Wasserstein distance. Here we extend it to solve the RPRWB problem (8). The RGA-IBP algorithm solves the following problem, which is obtained by adding an entropy regularization to (8).

max U M min π Π(p) fη(π, U) :=

i,j=1 πl i,j U (xl i yj) 2 ηH(πl), (11)

where H(π) := Pn i,j=1(πi,j log πi,j πi,j) is the entropy regularizer, and η > 0 is a weighting parameter. By further deﬁning

fη(U) := min π Π(p) fη(π, U), (12)

we know that (11) is equivalent to the following Riemannian optimization problem with smooth objective fη(U):

max U M fη(U). (13)

Problem (13) can be naturally solved by a Riemannian gradient ascent algorithm whose t-th iteration is:

Ut+1 := Retr Ut(τgrad fη(Ut))

where Retr denotes the retraction operation, grad fη denotes the Riemannian gradient of fη, and τ > 0 is a step size.

Moreover, it is easy to verify that

gradfη(U) = Proj TUM(2Vπ η(U)U), (14)

where TUM denotes the tangent space of M at U, and π η(U) is the optimal solution of (12) that can be found by the IBP algorithm (see details in Algorithm 4).

The RGA-IBP algorithm is detailed in Algorithm 1, where the IBP solver solves (12) up to an accuracy ϵ (see Algorithm 4 in the supplementary material).

4.2. The Riemannian Block Coordinate Descent Algorithm

Notice that the RGA-IBP requires to solve an optimization problem (12) in each iteration using an iterative solver. This can be quite expensive in practice. In this section, we propose the RBCD algorithm that can alleviate this computational burden. The RBCD algorithm presented here can be regarded as an extension of the algorithm recently proposed in (Huang et al., 2020) for computing the projection robust Wasserstein distance.

First, note that the optimization problem in (12) is convex and we have the following result about its dual.

Lemma 5 The dual problem of (12) is equivalent to the following problem:

max u,v u,v u,v Rm n,Pm l=1 ωlvl=0

i,j=1 πl ij ul, pl

(15) where πl ij = [π(ul, vl, U)]ij is given by:

π(ul, vl, U)ij = exp U (xl i yj) 2

η + ul i + vl j

(16) As a result, we know that (11) is equivalent to:

min u,v u,v u,v Rm n, U M, Pm l=1 ωlvl=0

g(u,v, U) :=

i,j=1 πl i,j ul, pl

Note that (17) has three block variables and it is suitable for block coordinate descent method. Our RBCD for solving (17) updates the iterates as follows:

ut+1 := argmin u g(u,vt, Ut) (18)

vt+1 := argmin v:Pm l=1 ωlvl=0 g(ut+1,v, Ut) (19)

Ut+1 := Retr Ut( τgrad Ug(ut+1,vt+1, Ut)). (20)

Projection Robust Wasserstein Barycenters

Algorithm 2 The RBCD Algorithm

1: Input: {µl = (Xl, pl)}l [m], {Y }, accuracy tolerance ϵ > 0. Set parameters τ, η and ρ as in (27). 2: Initialization: U0 St(d, k), u0,v0 Rm n, 3: for t = 0, 1, 2, . . . , do 4: Compute ut+1 ut+1 ut+1,vt+1 vt+1 vt+1 by (21)-(22); 5: Compute Ut+1 by (20); 6: if Pm l=1 ωl ql t qt 1 w3/2ϵ/(12 c), and η grad Ug(ut+1,vt+1, Ut) F ϵ/3 then 7: break; 8: end if 9: end for 10: Output: q = qt = Pm l=1 ωlql t, ˆU = Ut, and ˆπl = Round(πl(ul t+1, vl t, Ut), pl, q), l [m].

According to (Benamou et al., 2015)[Propositions 1, 2], u and v steps in (18) and (19) are the same as the steps in the IBP algorithm, and they admit closed-form solutions below:

ul t+1 = ul t + log pl

π(ul t, vl t, Ut)1, l [m], (21)

vl t+1 = vl t + log qt+1

ql t , l [m], (22)

where we denote ql t = (π(ul t+1, vl t, Ut)) 1 and qt+1 = exp(Pm l=1 ωl log ql t). Notice that (21)-(22) renormalize the sum of rows and columns of each πl to be pl and qt+1, which yields ql t, 1 = 1, l [m]. Moreover, the update (20) requires to compute grad Ug, and from (16) and (17) we know that

grad Ug(u,v, U) = Proj TUM( 2

η Vπ(u,v,U)U). (23)

By combining (18)-(23), we can summarize the details of the RBCD in Algorithm 2, in which we have adopted the following notation for the simplicity of presentation:

c := max l Cl , ω = min l ωl.

Note that in Algorithm 2 we adopted a rounding procedure for the output. This is because that π computed according to (16) does not necessarily lie in the constraint set Π(p). The rounding procedure proposed in (Altschuler et al., 2017) and outlined in Algorithm 3 can help round the solution to set Π(p). Note that this rounding procedure is also adopted in the IBP algorithm and thus in the RGA-IBP algorithm.

5. Convergence Analysis

In this section, we give the complexities of both the iteration number and the arithmetic operations for both RGA-IBP and RBCD for obtaining an ϵ-stationary point of (8) as deﬁned in

Algorithm 3 Round(π, p, q)

1: Input: π Rn n, p Rn, q Rn. 2: X = Diag (x) with xi = pi [π1]i 1

3: π = Xπ 4: Y = Diag (y) with yj = qi [(π ) 1]j 1

5: π = π Y 6: errp = p π 1, errq = q (π ) 1 7: Output: π + errperr q / errp 1.

Deﬁnition 3. The proofs are provided in the supplementary materials.

The next theorem and corollary are for RGA-IBP algorithm.

Theorem 6 Choose parameters

τ = 1 8L2 c + 2ρL2 1 , η = ϵ 4 log(n) + 2, ρ = 2 c + 4 c2

(24) The Algorithm 1 returns an ϵ-stationary point deﬁned in Deﬁnition 3 in

T = O(log(n)2L2 1 c4kϵ 4) (25)

iterations.

Corollary 7 The per iteration arithmetic operations complexity of Algorithm 1 is O(mn2dk + mn2 c6 log(n)2ϵ 6). Therefore, the total arithmetic operations complexity of Algorithm 1 is

O(mn2dk + mn2 c6 log(n)2ϵ 6) O(log(n)2L2 1 c4kϵ 4)

= log(n)2L2 1 c4k O(mn2dkϵ 4 + mn2 c6 log(n)2ϵ 10). (26)

The next theorem and corollary are for RBCD algorithm.

Theorem 8 Choose parameters

τ = 1 4L2 c/η + ρL2 1 , η = ϵ 4 log(n) + 2, ρ = 2 c

(27) The Algorithm 2 returns an ϵ-stationary point deﬁned in Deﬁnition 3 in

T = O L2 1 c5 log(n)

kω 3ϵ 3 (28)

iterations.

Corollary 9 The per iteration arithmetic operations complexity of Algorithm 2 is O(mn2d).Therefore, the total arithmetic operations complexity of Algorithm 2 is

O mn2 log(n)d L2 1 c5k 3 2 ω 3ϵ 3 . (29)

Projection Robust Wasserstein Barycenters

Remark 10 Comparing (26) with (29), we see that RBCD has a better complexity dependence on ϵ and c. However, RBCD has an extra term ω 1

m. Therefore, RGA-IBP has a better theoretical complexity when m is large.

6. Numerical Experiments

In this section, we conduct numerical experiments on both synthetic datasets and real datasets to evaluate the proposed RPRWB model (8)1. For the synthetic dataset, we consider solving RPRWB for a set of Gaussian distributions, which has closed-form solutions (Agueh & Carlier, 2011). We compare the convergence rate of WB and RPRWB to the ground truth for the sampled discrete distributions, as well as the robustness against noise for the RPRWB model. For real datasets, we incorporate the RPRWB model to the discrete distribution (D2) clustering algorithm (Ye et al., 2017b) and test it on text datasets. All experiments are conducted on a Linux server with a 32-core Intel Xeon CPU (E5-2667, v4, 3.20GHz per core).

6.1. Synthetic Dataset

Multi-variable Gaussian Distributions: It is wellknown that the Wasserstein barycenter of a set of multivariable Gaussian distributions {µl}l [m] with µl = N(al, Σl), where al is the mean and Σl is the covariance matrix, has a closed-form formula (Agueh & Carlier, 2011). Speciﬁcally, we have the following theorem.

Theorem 11 ((Agueh & Carlier, 2011)) Let µ1, ..., µm

be Gaussian distributions with respective means a1, . . . , am

and covariance matrices Σ1, ..., Σm. The barycenter of µ1, ..., µm with weights ω1, ..., ωm is the Gaussian distribution with mean a = Pm l=1 ωlal and covariance matrix Σ deﬁned as the only positive deﬁnite matrix satisfying the equation

l=1 ωl S1/2Σl S1/2 1/2 . (30)

In this subsection, we compute the WB and RPRWB of a given set of zero-mean multi-variable Gaussian distributions {µl}l [m], µl = N(0, Σl). We set ωl = 1 m, l [m] in all experiments.

The dependence of RPRWB on k. We ﬁrst explore the dependence of the objective function value on k. For each µl, we sample an empirical measure µl n. Speciﬁcally, we sample n points according to the Gaussian distribution µl = N(0, Σl) to form the support matrix Xl Rd n and set pl i = 1

n, i [n]. We set each of the covariance matrices

1Code available at https://github.com/mhhuang95/ PRWB.

Σl to be a SPD matrix with rank k . Therefore, Xl lies in a k -dimensional subspace and the barycenter of {µl n}l [m] should be in a (m k )-dimensional subspace. The support of the barycenter Y Rd n is obtained by applying k-means clustering on X = [X1; ; Xm] Rd mn. We set parameters as d = 100, m = 3, n = 10. We further set the step size τ = 0.0005 for both RBCD and RGA-IBP algorithms and η = 0.5 mid({Cl}l [m]), where mid({Cl}l [m]) is the median of the entries of {Cl}l [m].

Figure 1. RPRWB function value versus projection dimension k. We run Algorithms 2 and 1 on different k and averaging over 100 runs,

We run both RBCD and RGA-IBP for solving (8) with different k and k, and report the results in Figure 1. From Figure 1 we see that the RPRWB values computed by the two algorithms are almost the same. We also notice that the RPRWB value increases when k < m k and remains as a constant when k m k , which veriﬁes the fact that the barycenter of {µl n}l [m] lies in a (m k )-dimensional subspace.

Robustness Against Noise. We further conduct experiments on comparing the robustness of WB and RPRWB against noise. Speciﬁcally, we add Gaussian noise σN(0, I), where σ is the noise level, to the discrete support {Xl}l [m]. We compare the relative error of the objective function value for WB and RPRWB under different noise level σ. The relative error for WB and RPRWB is deﬁned as

Relative Error = OBJ({µl n}σ) OBJ({µl n}0)) OBJ({µln}0) ,

where {µl n}σ denotes the distributions after adding noise σN(0, I) and OBJ denotes the objective function of WB (the discrete version of (2)) or RPRWB (8). We set parameters as d = 100, m = 3, n = 10, σ [0.01, 0.1, 1, 2, 4, 7, 10]. We choose the step size τ = 0.001 when σ < 7 and τ = 0.0005 otherwise for both RBCD

Projection Robust Wasserstein Barycenters

and RGA-IBP algorithms and η = 0.5 mid({Cl}l [m]). The results are shown in Figure 2, which shows that the proposed RPRWB model is more robust to noise compared to the WB.

Figure 2. Relative error of the WB and RPRWB function value on different noise level σ. The results are averaged on 100 runs.

Convergence rate to the ground truth. We further consider approximating the Wasserstein barycenter for a set of continuous distributions by sampling data. Note that (Niles-Weed & Rigollet, 2019) proved that for a so-called spiked transport model, the mean projection robust Wasserstein distance between the sampled empirical distributions is O(n 1/k), which improves the corresponding complexity of O(n 1/d). We conjecture that similar results hold for WB and RPRWB and give some numerical evidence in this section. We set d = 10, m = 2, k = 2. The covariance matrices Σ1, Σ2 Rd d are diagonal matrices with Σ1(1, 1) = 10.1, Σ2(2, 2) = 10.1 and the rest of diagonal elements are all 0.1. In this case, a 2-dimensional subspace catches most of the information about the barycenter. We then sample n points as the support for each of µl. To have a better estimation, the probability for xl i is computed according to the Gaussian PDF:

Prob(xl i) = 1 (2π)d/2det(Σl)1/2 e 1

2 (xl i)T (Σl) 1(xl i).

We sampled the support of the barycenter Y according to a uniform distribution over [ 2, 2]d. The barycenter Mean Estimation Error is deﬁned as

MEE = |OBJ({µl}) OBJ({µl n})|,

where the ground truth objective function OBJ({µl}) is calculated by solving (30) and OBJ({µl n}) is the sampled barycenter objective function value of the WB or RPRWB model. We set the step size τ = 0.05 for both RBCD and RGA-IBP algorithms and η = 0.5 mid({Cl}l [m]) and select n {20, 50, 100, 250, 500, 1000}. The results are shown in Figure 3, which shows that the proposed RPRWB model converges to the ground truth much faster than the WB.

Figure 3. Mean Estimation Error (MEE) on different n. The results are averaged on 500 runs.

Computational time comparison. We compare the mean computational time of the WB solved by the IBP algorithm (Benamou et al., 2015) and the proposed RPRWB solved by RBCD and RGA-IBP. We set d = 100, m = 3, k = 2 and select n {20, 50, 100, 250, 500, 1000}. We generate the support matrices Xl Rd n from µl = N(0, Σl) by empirical sampling. The support of the barycenter Y Rd n is obtained by k-means clustering. We further set the step size τ = 0.01 for both RBCD and RGA-IBP algorithms and η = 0.5 mid({Cl}l [m]). We stop the RBCD algorithm when η grad Ug(ut+1,vt+1, Ut) F ϵ, 1

m Pm l=1 ql t qt 1 ϵ, and the RGA-IBP algorithm when grad Uf(ξt+1) F ϵ, and we set ϵ = 10 4. The results are shown in Figure 4, which shows that RBCD always runs faster than RGA-IBP. Note that the IBP for solving WB runs much faster than the other two algorithms, and this is because the latter two solve a more difﬁcult problem.

Figure 4. Computational time of the WB model solved by the IBP algorithm and the RPRWB model solved by RBCD and RGA-IBP algorithms on different n. The results are averaged on 100 runs.

6.2. Real Dataset: text data

We consider the discrete distributions (D2) clustering model proposed in (Ye et al., 2017b), which requires to solve the

Projection Robust Wasserstein Barycenters

Figure 5. AMI scores for each iteration. Left: the Reuters Subset dataset, Middle: the BBCsport Abstract dataset, Right: the BBCnews Abstract dataset. The results are averaged on 5 runs.

free-support discrete Wasserstein barycenter model:

l=1 W(µl, ν)

= min π Π(p),Y Rd n 1 m

i,j=1 πl i,j xl i yj 2,

Note that there are two block variables: π and Y . (Ye et al., 2017b) proposed to solve (31) using an alternating mimization algorithm. That is, one alternatingly minimizes the objective function (31) with respect to one variable and with the other one ﬁxed. This procedure is repeated until no progress can be made. When Y is ﬁxed, problem (31) becomes a ﬁxed-support WB problem. When Π = [π1; ; πm] is ﬁxed, we have the following closedform solution for Y :

j=1 πl i,jxl j, (32)

which can be written more compactly as Y = 1 m XΠ diag(1/q).

Since we have numerically demonstrated that RPRWB might be a better model than WB, we propose to replace the WB problem in D2 clustering by RPRWB. We call the resulting algorithm projection robust D2 clustering (PD2 clustering). More details of the D2 and PD2 clusterings can be found in the supplementary material. We compare the performance of D2 and PD2 clusterings on three text datasets listed in Table 1. The Reuters Subset is a 5-class subset of the Reuters dataset 2. The BBCnews Abstract and BBCsport Abstract 3 (Greene & Cunningham, 2006) are truncated versions of 2,225 and 737 posts. Each document retains only the title and the ﬁrst sentence of the original post.

Preprocessing. We follow the idea of treating each document as a bag of word-vectors. For all three datasets in

2https://www.nltk.org/book/ch02.html 3http://mlg.ucd.ie/datasets/bbc.html

Table 1. Text Datasets. N is the number of data, n is the number of samples, and K is the number of clusters.

Dataset N d n K

Reuters Subset 1209 300 16 5 BBCnews Abstract 2225 300 16 5 BBCsport Abstract 737 300 16 5

Table 1, we use the pre-trained word-vector dataset Glo Ve (Pennington et al., 2014) to transform a list of words to a measure over R300. The weight of each word is the normalized frequency modiﬁed by the TF-IDF scheme. We use the Glo Ve 300d (word vectors R300) that was trained on 6 billion tokens and contains a 400,000 lower case vocabulary. Before we transform words into vectors, we lower the capital letters, remove all punctuations and stop words and lemmatize each document. Finally, we restrict the number of support points to n by recursively merging the closest words. Speciﬁcally, when the number of different words in µl is larger than n, we solve the following discrete optimization problem:

min i,j pl ipl j xl i xl j 2/(pl i + pl j), (33)

and merge pl i, pl j as pl = pl i + pl j, x = (pl ixl i pl jxl j)/ pl.

Parameter setting and initialization. In each iteration of PD2 clustering, we run the RBCD algorithm with the step size τ = 0.05, the regularization parameter η = 1. We choose k = 2 for the BBCsport Abstract dataset and k = 3 for the Reuters Subset and the BBCnews Abstract datasets. The initial K barycenters are chosen randomly from documents with more than n different words and recursively merged so the number of support points remains n.

The Adjusted Mutual Information. To measure the performance of the clustering results, we use the Adjusted Mutual Information (AMI) (Vinh et al., 2010). Denote PV (i) = |Vi|/N as the probability of clus-

Projection Robust Wasserstein Barycenters

ter i in the partition V . The entropy H(V ) is deﬁned as H(V ) = PSV i=1 PV (i) log PV (i), where SV is the number of clusters in V . The mutual information between the two partitions V1, V2 is deﬁned as MI(V1, V2) = PSV1,SV2 i,j=1 PV1,V2(i, j) log PV1,V2(i,j) PV1(i)PV2(j), where PV1,V2(i, j) = |V1,i V2,j|/N. The AMI score between two partitions V1, V2 is computed by

AMI(V1, V2) = MI(V1, V2) EMI(V1, V2) (H(V1) + H(V2))/2 EMI(V1, V2).

The AMI score lies in the interval [0, 1], and it remains unchanged when we permute the cluster labels. In our experiments, we present the AMI scores between the ground truth labels and the predicted labels.

Clustering results. We run D2 and PD2 on the two real datasets in Table 1. The ﬁnal AMI score and the average number of iterations for different datasets are given in Tables 2 and Table 3 respectively. We apply k-means clustering on the raw TF-IDF vectors as a baseline. Each result is averaged over ﬁve runs with different initialization. We stop the D2 and PD2 algorithms when the labels for each cluster are stable. Comparing the AMI scores in Table 2, we see that the proposed RPRWB model improves the performance of text clustering. One possible reason is that for many real high dimensional datasets, a low dimensional subspace catches most of the information. Notice that the D2 clustering AMI scores reported here are smaller than those in (Ye et al., 2017a). This is because the clustering performance highly depends on the barycenter initialization, and we are reporting the average AMIs with different initialization while (Ye et al., 2017a) reported the best AMI they obtained. We further see that the average number of iterations of the PD2 algorithm is smaller. Moreover, we plot the AMI scores for the ﬁrst ten iterations of the D2 and PD2 clustering algorithm in Figure 5. We see that the PD2 clustering algorithm gives better AMI scores than the D2 clustering algorithm, which shows the advantage of the proposed RPRWB model.

Table 2. AMI scores for clustering results.

Dataset k-means D2 PD2

Reuters Subset 0.4627 0.4200 0.4713 BBCnews Abstract 0.3877 0.6095 0.6557 BBCsport Abstract 0.4276 0.6510 0.6892

7. Conclusion

In this paper, we have proposed a novel WB model called the projection robust Wasserstein barycenter, which has the potential to mitigate the curse of dimensionality for the WB

Table 3. Average number of clustering iteration.

Dataset D2 PD2

Reuters Subset 24.2 23.2 BBCnews Abstract 23.8 22.4 BBCsport Abstract 29.8 14.4

problem. To resolve the computational issue of the PRWB, we have proposed a relaxed PRWB model: RPRWB. We have proposed two algorithms, the RBCD algorithm and the RGA-IBP algorithm for solving the ﬁxed-support RPRWB problem. We have analyzed the iteration complexity and complexity of arithmetic operations for both algorithms. Numerical results on synthetic datasets have demonstrated the robustness and the better sample complexity of the proposed RPRWB model comparing with the WB model. Moreover, we have incorporated the RPRWB model to the D2 clustering algorithm, and proposed the projection robust D2 clustering algorithm. Numerical results on real text datasets show that the PD2 clustering improves the performance of the D2 clustering. Future directions include deriving sample complexity for WB, PRWB and RPRWB.

Acknowledgements

We would like to thank the AC and all anonymous reviewers for very constructive suggestions that improved the quality of the paper. We are also grateful to Tianyi Lin for discussions on this topic. This research was partially supported by NSF HDR TRIPODS grant CCF-1934568, NSF grants CCF-1717943, CNS-1824553, CCF-1908258, DMS1953210, ECCS-2000415 and CCF-2007797, and UC Davis Ce DAR (Center for Data Science and Artiﬁcial Intelligence Research) Innovative Data Science Seed Funding Program.

Absil, P.-A., Mahony, R., and Sepulchre, R. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.

Agueh, M. and Carlier, G. Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis, 43(2): 904 924, 2011.

Aldaz, J. A monotonicity property of variances. Statistics & Probability Letters, 83(5):1416 1419, 2013.

Altschuler, J., Niles-Weed, J., and Rigollet, P. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. In Advances in neural information processing systems, pp. 1964 1974, 2017.

Altschuler, J. M. and Boix-Adsera, E. Wasserstein

Projection Robust Wasserstein Barycenters

barycenters are NP-hard to compute. ar Xiv preprint ar Xiv:2101.01100, 2021.

Benamou, J.-D., Carlier, G., Cuturi, M., Nenna, L., and Peyr e, G. Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientiﬁc Computing, 37(2):A1111 A1138, 2015.

Bonneel, N., Rabin, J., Peyr e, G., and Pﬁster, H. Sliced and radon Wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22 45, 2015.

Borgwardt, S. and Patterson, S. On the computational complexity of ﬁnding a sparse Wasserstein barycenter. ar Xiv preprint ar Xiv:1910.07568, 2019.

Boumal, N., Absil, P.-A., and Cartis, C. Global rates of convergence for nonconvex optimization on manifolds. IMA Journal of Numerical Analysis, 39(1):1 33, 2019.

Claici, S., Chien, E., and Solomon, J. Stochastic Wasserstein barycenters. In International Conference on Machine Learning, pp. 999 1008, 2018.

Cuturi, M. and Doucet, A. Fast computation of Wasserstein barycenters. 2014.

Dudley, R. M. The speed of mean Glivenko-Cantelli convergence. The Annals of Mathematical Statistics, 40(1): 40 50, 1969.

Ge, D., Wang, H., Xiong, Z., and Ye, Y. Interior-point methods strike back: Solving the Wasserstein barycenter problem. In Advances in Neural Information Processing Systems, pp. 6894 6905, 2019.

Greene, D. and Cunningham, P. Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proceedings of the 23rd international conference on Machine learning, pp. 377 384, 2006.

Ho, N., Nguyen, X., Yurochkin, M., Bui, H. H., Huynh, V., and Phung, D. Multilevel clustering via Wasserstein means. In International Conference on Machine Learning, pp. 1501 1509, 2017.

Huang, M., Ma, S., and Lai, L. A Riemannian block coordinate descent method for computing the projection robust Wasserstein distance. ar Xiv preprint ar Xiv:2012.05199, 2020.

Kroshnin, A., Tupitsa, N., Dvinskikh, D., Dvurechensky, P., Gasnikov, A., and Uribe, C. On the complexity of approximating Wasserstein barycenters. In International conference on machine learning, pp. 3530 3540. PMLR, 2019.

Lin, T., Fan, C., Ho, N., Cuturi, M., and Jordan, M. Projection robust Wasserstein distance and Riemannian optimization. In Neur IPS, volume 33, 2020a.

Lin, T., Ho, N., Chen, X., Cuturi, M., and Jordan, M. I. Fixed-support Wasserstein barycenters: Computational hardness and fast algorithm. Advances in Neural Information Processing Systems, 33, 2020b.

Niles-Weed, J. and Rigollet, P. Estimation of Wasserstein distances in the spiked transport model. ar Xiv preprint ar Xiv:1909.07513, 2019.

Paty, F.-P. and Cuturi, M. Subspace robust Wasserstein distances. In International Conference on Machine Learning, pp. 5072 5081, 2019.

Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532 1543, 2014.

Rabin, J., Peyr e, G., Delon, J., and Bernot, M. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pp. 435 446. Springer, 2011.

Vinh, N. X., Epps, J., and Bailey, J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. The Journal of Machine Learning Research, 11:2837 2854, 2010.

Weed, J. and Bach, F. Sharp asymptotic and ﬁnite-sample rates of convergence of empirical measures in Wasserstein distance. Bernoulli, 25(4A):2620 2648, 2019.

Ye, J., Li, Y., Wu, Z., Wang, J. Z., Li, W., and Li, J. Determining gains acquired from word embedding quantitatively using discrete distribution clustering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1847 1856, 2017a.

Ye, J., Wu, P., Wang, J. Z., and Li, J. Fast discrete distribution clustering using Wasserstein barycenter with sparse support. IEEE Transactions on Signal Processing, 65(9): 2317 2332, 2017b.