# wasserstein_kmeans_for_clustering_probability_distributions__047b3aa6.pdf Wasserstein K-means for clustering probability distributions Yubo Zhuang, Xiaohui Chen, Yun Yang Department of Statistics University of Illinois at Urbana-Champaign {yubo2,xhchen,yy84}@illinois.edu Clustering is an important exploratory data analysis technique to group objects based on their similarity. The widely used K-means clustering method relies on some notion of distance to partition data into a fewer number of groups. In the Euclidean space, centroid-based and distance-based formulations of the Kmeans are equivalent. In modern machine learning applications, data often arise as probability distributions and a natural generalization to handle measure-valued data is to use the optimal transport metric. Due to non-negative Alexandrov curvature of the Wasserstein space, barycenters suffer from regularity and nonrobustness issues. The peculiar behaviors of Wasserstein barycenters may make the centroid-based formulation fail to represent the within-cluster data points, while the more direct distance-based K-means approach and its semidefinite program (SDP) relaxation are capable of recovering the true cluster labels. In the special case of clustering Gaussian distributions, we show that the SDP relaxed Wasserstein K-means can achieve exact recovery given the clusters are well-separated under the 2-Wasserstein metric. Our simulation and real data examples also demonstrate that distance-based K-means can achieve better classification performance over the standard centroid-based K-means for clustering probability distributions and images. 1 Introduction Clustering is a major tool for unsupervised machine learning problems and exploratory data analysis in statistics. Suppose we observe a sample of data points X1, . . . , Xn taking values in a metric space (X, ). Suppose there exists a clustering structure G 1, . . . , G K such that each data point Xi belongs to exactly one of the unknown cluster G k. The goal of clustering analysis is to recover the true clusters G 1, . . . , G K given the input data X1, . . . , Xn. In the Euclidean space X = Rp, the K-means clustering is a widely used method that achieves the empirical success in many applications [Mac Queen, 1967]. In modern machine learning and data science problems such as computer graphics [Solomon et al., 2015], data exhibits complex geometric features and traditional clustering methods developed for Euclidean data may not be well suited to analyze such data. In this paper, we consider the clustering problem of probability measures µ1, . . . , µn into K groups. As a motivating example, the MNIST dataset contains images of handwritten digits 0-9. Normalizing the greyscale images into histograms as probability measures, a common task is to cluster the images. One can certainly apply the Euclidean K-means to the vectorized images. However, this would lose important geometric information of the two-dimensional data. On the other hand, theory of optimal transport [Villani, 2003] provides an appealing framework to model measure-valued data as probabilities in many statistical tasks [Domazakis et al., 2019, Chen et al., 2021, Bigot et al., 2017, Seguy and Cuturi, 2015, Rigollet and Weed, 2019, Hütter and Rigollet, 2019, Cazelles et al., 2018]. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Background on K-means clustering. Algorithmically, the K-means clustering have two equivalent formulations in the Euclidean space centroid-based and distance-based in the sense that they both yield the same partition estimate for the true clustering structure. Given the Euclidean data X1, . . . , Xn Rp, the centroid-based formulation of standard K-means can be expressed as min β1,...,βK Rd i=1 min k [K] Xi βk 2 2 = min G1,...,GK i Gk Xi Xk 2 2 : k=1 Gk = [n] o , (1) where clusters {Gk}K k=1 are determined by the Voronoi diagram from {βk}K k=1, Xk = |Gk| 1 P i Gk Xi denotes the centroid of cluster Gk, F denotes the disjoint union and [n] = {1, . . . , n}. Heuristic algorithm for solving (1) includes Lloyd s algorithm [Lloyd, 1982], which is an iterative procedure alternating the partition and centroid estimation steps. Specifically, given an initial centroid estimate β(1) 1 , . . . , β(1) K , one first assigns each data point to its nearest centroid at the t-th iteration according to the Voronoi diagram, i.e., G(t) k = n i [n] : Xi β(t) k 2 Xi β(t) j 2, j [K] o , (2) and then update the centroid for each cluster β(t+1) k = 1 where |G(t) k | denotes the cardinality of G(t) k . Step (2) and step (3) alternate until convergence. The distance-based (sometimes also referred as partition-based) formulation directly solves the following constrained optimization problem without referring to the estimated centroids: min G1,...,GK i,j Gk Xi Xj 2 2 : k=1 Gk = [n] o . (4) Observe that (1) with nearest centroid assignment and (4) are equivalent for the clustering purpose due to the following identity, which extends the parallelogram law from two points to n points, i,j=1 Xi Xj 2 2 = 2n i=1 Xi X 2 2, with X = 1 i=1 Xi and Xi Rp. (5) Consequently, the two criteria yield the same partition estimate for G 1, . . . , G K. The key identity (5) establishing the equivalence relies on two facts of the Euclidean space: (i) it is a vector space (i.e., vectors can be averaged in the linear sense); (ii) it is flat (i.e., zero curvature), both of which are unfortunately not true for the Wasserstein space (P2(Rp), W2) that endows the space P2(Rp) of all probability distributions with finite second moments with the 2-Wasserstein metric W2 [Ambrosio et al., 2005]. In particular, the 2-Wasserstein distance between two distributions µ and ν in P2(Rp) is defined as W 2 2 (µ, ν) := min γ Rp Rp x y 2 2 dγ(x, y) o , (6) where minimization over γ runs over all possible couplings with marginals µ and ν. It is well-known that the Wasserstein space is a metric space (in fact a geodesic space) with non-negative curvature in the Alexandrov sense [Lott, 2008]. Our contributions. We summarize our main contributions as followings: (i) we provide evidence for pitfalls (irregularity and non-robustness) of barycenter-based Wasserstein K-means, both theoretically and empirically, and (ii) we generalize the distance-based formulation of K-means to the Wasserstein space and establish the exact recovery property of its SDP relaxation for clustering Gaussian measures under a separateness lower bound in the 2-Wasserstein distance. Existing work. Since the K-means clustering is a worst-case NP-hard problem [Aloise et al., 2009], approximation algorithms have been extensively studied in literature including: Llyod s algorithm [Lloyd, 1982], spectral methods [von Luxburg, 2007, Meila and Shi, 2001, Ng et al., 2001], semidefinite programming (SDP) relaxations [Peng and Wei, 2007], non-convex methods via low-rank matrix factorization [Burer and Monteiro, 2003]. Theoretic guarantees of those methods are established for statistical models on Euclidean data [Lu and Zhou, 2016, von Luxburg et al., 2008, Vempala and Wang, 2004, Fei and Chen, 2018, Giraud and Verzelen, 2018, Chen and Yang, 2021, Zhuang et al., 2022]. The concept of clustering general measure-valued data is introduced by Domazakis et al. [2019], where the authors proposed the centroid-based Wasserstein K-means algorithm. It replaced the Euclidean norm and sample means by the Wasserstein distance and barycenters respectively. Verdinelli and Wasserman [2019] proposed a modified Wasserstein distance for distribution clustering. And after that, Chazal et al. [2021] proposed a method in Clustering of measures via mean measure quantization by first vectorizing the measures in a finite Euclidean space followed by an efficient clustering algorithm such as single-linkage clustering with L distance. The vectorization methods could improve the computational efficiency but might not capture the properties of probability measures well compared to clustering algorithms based directly on Wasserstein space. 2 Wasserstein K-means clustering methods In this section, we generalize the Euclidean K-means to the Wasserstein space. Our starting point is to mimic the standard K-means methods for Euclidean data. Thus we may define two versions of the Wasserstein K-means clustering formulations: centroid-based and distance-based. As we mentioned in Section 1, when working with Wasserstein space (P2(Rp), W2), the corresponding centroid-based criterion (1) and the distance-based criterion (4), where the Euclidean metric 2 is replaced with the 2-Wasserstein metric W2, may lead to radically different clustering schemes. To begin with, we would like to argue that due to the irregularity and non-robustness of barycenters in the Wasserstein space, the centroid-based criterion may lead to unreasonable clustering schemes that lack physical interpretations and are sensitive to small data perturbations. 2.1 Clustering based on barycenters The centroid-based Wasserstein K-means for extending the Lloyd s algorithm into the Wasserstein space has been recently considered by Domazakis et al. [2019]. Specifically, it is an iterative algorithm proceeds as following. Given an initial centroid estimate ν(1) 1 , . . . , ν(1) K , one first assigns each probability measure µ1, . . . , µn to its nearest centroid in the Wasserstein geometry at the t-th iteration according to the Voronoi diagram: G(t) k = n i [n] : W2(µi, ν(t) k ) W2(µi, ν(t) j ), j [K] o , (7) and then update the centroid for each cluster ν(t+1) k = arg min ν P2(Rd) 1 W 2 2 (µi, ν). (8) Note that ν(t+1) k in (8) is referred as barycenter of probability measures µi, i G(t) k , a Wasserstein analog of the Euclidean average or mean [Agueh and Carlier, 2011]. We will also ex-changeably use barycenter-based K-means to mean the centroid-based K-means in the Wasserstein space. Even though the Wasserstein barycenter is a natural notion of averaging probability measures, it may exhibit peculiar behaviours and fail to represent the within-cluster data points, partly due to the violation of the generalized parallelogram law (5) (for non-flat spaces) if the Euclidean metric 2 is replaced with the 2-Wasserstein metric W2. Example 1 (Irregularity of Wasserstein barycenters). Wasserstein barycenter has much less regularity than the sample mean in the Euclidean space [Kim and Pass, 2017]. In particular, Santambrogio and Wang [2016] constructed a simple example of two probability measures that are supported on line segments in R2, whereas the support of their barycenter obtained as the displacement interpolation the two endpoint probability measures is not convex (cf. left plot in Figure 1). In this example, the probability density µ0 and µ1 are supported on the line segments L0 = {(s, as) : s [ 1, 1]} and L1 = {(s, as) : s [ 1, 1]} respectively. We choose a (0, 1) to identify the orientation of L0 and L1 based on the x-axis. Moreover, we consider the linear density functions µ0(s) = (1 s)/2 and µ1(s) = (1 + s)/2 for s [ 1, 1] supported on L0 and L1 respectively. Then the optimal transport map T := Tµ0 µ1 from µ0 to µ1 is given by T(x, ax) = 1 + p 4 (1 x)2, a ( 1 + p 4 (1 x)2) , (9) and the barycenter corresponds to the displacement interpolation µt = [(1 t)id + t T] µ0 at t = 0.5 [Mc Cann, 1997]. For self-contained purpose, we give the proof of (9) in Appendix D.1. Fig. 1 on the left shows the support of barycenter µ0.5 is not convex (in fact part of an ellipse boundary) even though the supports of µ0 and µ1 are convex. This example shows that the barycenter functional is not geodesically convex in the Wasserstein space. As barycenters turn out to be essential in centroid-based Wasserstein K-means and irregularity of the barycenter may fail to represent the cluster (see more details in Example 3 and Remark 9 below), this counter-example is our motivation to seek alternative formulation. 1.0 0.5 0.0 0.5 1.0 0.4 0.2 0.0 0.2 0.4 1.0 0.5 0.0 0.5 1.0 0.4 0.2 0.0 0.2 0.4 1.0 0.5 0.0 0.5 1.0 0.4 0.2 0.0 0.2 0.4 1.0 0.5 0.0 0.5 1.0 1.5 2.0 µ0 µ1, a = 0.5 µ1, a = 2 Figure 1: Left: support of the Wasserstein barycenter as the displacement interpolation between µ0 and µ1 at t = 0.5 in Example 1. Right: non-robustness of the optimal transport map (arrow lines) and Wasserstein barycenter w.r.t. small perturbation around a = 1 for the target measure in Example 2. Example 2 (Non-robustness of Wasserstein barycenters). Another unappealing feature of the Wasserstein barycenter is its sensitivity to data perturbation: a small (local) change in one contributing probability measure may lead to large (global) changes in the resulting barycenter. See Fig. 1 on the right for such an example. In this example, we take the source measure as µ0 = 0.5 δ( 1,1) + 0.5 δ(1, 1) and the target measure as µ1 = 0.5 δ( 1, a) + 0.5 δ(1,a) for some a > 0. It is easy to see that the optimal transport map T := Tµ0 µ1 has a dichotomy behavior: T( 1, 1) = ( 1, a) if 0 < a < 1 (1, a) if a > 1 and T(1, 1) = (1, a) if 0 < a < 1 ( 1, a) if a > 1 . (10) Thus the Wasserstein barycenter determined by the displacement interpolation µt = [(1 t)id+t T] µ0 is a discontinuous function at a = 1. This non-robustness can be attribute to the discontinuity of the Wasserstein barycenter as a function of its input probability measures; in contrast, the Euclidean mean is a globally Lipchitz continuous function of its input points. Because of these pitfalls of the Wasserstein barycenter shown in Examples 1 and 2, the centroidbased Wasserstein K-means approach described at the beginning of this subsection may lead to unreasonable and unstable clustering schemes. In addition, an ill-conditioned configuration may significantly slow down the convergence of commonly used barycenter approximating algorithms such as iterative Bregman projections [Benamou et al., 2015]. Below, we give a concrete example of such phenomenon in the clustering context. Example 3 (Failure of centroid-based Wasserstein K-means). In a nutshell, the failure in this example is due to the counter-intuitive phenomenon illustrated in the right panel of Fig. 2, where some distribution µ3 in the Wasserstein space may have larger W2 distance to Wasserstein barycenter µ 1 than every distribution µi (i = 1, 2) that together forms it. As a result of this strange configuration, even though µ3 is closer to µ1 and µ2 from the first cluster with barycenter µ 1 than µ4 coming from a second cluster with barycenter µ 2, it will be incorrectly assigned to the second cluster using the centroid-based criterion (7), since W2(µ3, µ 1) > W2(µ3, µ 2) > max W2(µ3, µ1), W2(µ3, µ2) . In contrast, for Euclidean spaces due to the following equivalent formulation of the generalized parallelogram law (5), i=1 X Xi 2 2 = n X X 2 2 + i=1 Xi X 2 2 n X X 2 2, for any X Rp, there is always some point Xi satisfying X Xi 2 X X 2, that is, further away from X than the mean X; thereby excluding counter-intuitive phenomena as the one shown in Fig. 2. Figure 2: Left: visualization of Example 3 in R2 and Wasserstein space. Right: the black curve connecting µ1 and µ2 depicts the geodesic between them. Concretely, the first cluster G 1 is shown in the left panel of Fig. 2 highlighted by a red circle, consisting of m copies of (µ1, µ2) pairs and one copy of µ3; the second cluster G 2 containing copies of µ4 is highlighted by a blue circle. Each distribution assigns equal probability mass to two points, where the two supporting points are connected by a dashed line for easy illustration. More specifically, we set µ1 = 0.5 δ(x,y) + 0.5 δ( x, y), µ2 = 0.5 δ(x, y) + 0.5 δ( x,y), µ3 = 0.5 δ(x+ϵ1,y) + 0.5 δ(x+ϵ1, y), and µ4 = 0.5 δ(x+ϵ1+ϵ2,y) + 0.5 δ(x+ϵ1+ϵ2, y), where δ(x,y) denotes the point mass measure at point (x, y), and (x, y, ϵ1, ϵ2) are positive constants. The property of this configuration can be summarized by the following lemma. Lemma 4 (Configuration characterization). If (x, y, ϵ1, ϵ2) satisfies y2 < min{x2, 0.25 ϵ1,x} and ϵ1,x < ϵ2 2 < ϵ1,x + y2, where ϵ1,x := ϵ2 1 + 2x2 + 2xϵ1, then for all sufficiently large m (number of copies of µ1 and µ2), W2(µ3, µ 2) < W2(µ3, µ 1) and max k=1,2 max i,j Gk W2(µi, µj) | {z } largest within-cluster distance < min i G1,j G2 W2(µi, µj), | {z } least between-cluster distance where µ k denotes the Wasserstein barycenter of cluster Gk for k = 1, 2. Note that the condition of Lemma 4 implies y < x. Therefore, the barycenter between µ1 andµ2 is µ 1 := 0.5 δ(x,0) + 0.5 δ( x,0) lying on the horizontal axis. By increasing m, the barycenter µ 1 of cluster G 1 can be made arbitrarily close to µ . The second inequality in Lemma 4 shows that all within-cluster distances are strictly less than the between-cluster distances; therefore, clustering based on pairwise distances is able to correctly recover the cluster label of µ3. However, since µ3 is closer to the barycenter µ 2 of cluster G 2 according to the first inequality in Lemma 4, it will be mis-classified into G 2 using the centroid-based criterion. We emphasize that cluster positions in this example are generic and do exist in real data; see Remark 9 and Section 4.3 for further discussions on our experiment results on MNIST data. Moreover, similar to Example 2, a small change in the orientation of distribution µ1 may completely alter the clustering membership of µ3 based on the centroid criterion. Specifically, if we slightly increase x to make it exceed y, then the barycenter between µ1 and µ2 becomes µ 1 := 0.5 δ(0,y) + 0.5 δ(0, y) that lies on the vertical axis. Correspondingly, if based on centroids, then µ3 should be clustered into G 1 as it is closer to the barycenter µ 1 of G 1 than the barycenter µ 2 of G 2. Therefore, the centroid-based criterion can be unstable against data perturbations. In comparison, a pairwise distances based criterion always assigns µ3 into cluster G 2 no matter x < y or x > y. 2.2 Clustering based on pairwise distances Due to the irregularity and non-robustness of centroid-based Wasserstein K-means, we instead propose and advocate the use of distance-based Wasserstein K-means below, which extends the Euclidean distance-based K-means formulation (4) into the Wasserstein space, min G1,...,GK i,j Gk W 2 2 (µi, µj) : k=1 Gk = [n] o . (11) Correspondingly, we can analogously design a greedy algorithm resembling the Wasserstein Lloyd s algorithm described in Section 2.1 that solves the centroid-based Wasserstein K-means. Specifically, the greedy algorithm proceeds in an iterative manner as following. Given an initial cluster membership estimate G(1) 1 , . . . , G(1) K , one assigns each probability measure µ1, . . . , µn based on minimizing the averaged squared W2 distances to all current members in every cluster, leading to an updated cluster membership estimate G(t+1) k = i [n] : 1 W 2 2 (µi, µs) 1 W 2 2 (µi, µs), j [K] . (12) We arbitrarily select among the least W2 distance clusters in the case of a tie. We highlight that the center-based and distance-based Wasserstein K-means formulations may not necessarily be equivalent to yield the same cluster labels (cf. Example 3). Below, we shall give some example illustrating connections to the standard K-means clustering in the Euclidean space. Example 5 (Degenerate probability measures). If the probability measures are Dirac at point Xi Rp, i.e., µi = δXi, then the Wasserstein K-means is the same as the standard K-means since W2(µi, µj) = Xi Xj 2. Example 6 (Gaussian measures). If µi = N(mi, Vi) with positive-definite covariance matrices Σi 0, then the squared 2-Wasserstein distance can be expressed as the sum of the squared Euclidean distance on the mean vector and d2(Vi, Vj) = Tr Vi + Vj 2 V 1/2 i Vj V 1/2 i 1/2 , (13) the squared Bures distance on the covariance matrix [Bhatia et al., 2019]. Here, we use V 1/2 to denote the unique symmetric square root matrix of V 0. That is, W 2 2 (µi, µj) = mi mj 2 2 + d2(Vi, Vj). (14) Then the Wasserstein K-means, formulated either in (7) or (11), can be viewed as a covarianceadjusted Euclidean K-means by taking account into the shape or orientation information in the (non-degenerate) Gaussian inputs. Example 7 (One-dimensional probability measures). If µi are probability measures on R with cumulative distribution function (cdf) Fi, then the Wasserstein distance can be written in terms of the quantile transform W 2 2 (µi, µj) = Z 1 0 [F i (u) F j (u)]2 du, (15) where F is the generalized inverse of the cdf F on [0, 1] defined as F (u) = inf{x R : F(x) > u} (cf. Theorem 2.18 [Villani, 2003]). Thus the one-dimensional probability measures in Wasserstein space can be isometrically embedded in a flat L2 space, and we can bring back the equivalence of the Wasserstein and Euclidean K-means clustering methods. 3 SDP relaxation and its theoretic guarantee Note that Wasserstein Lloyd s algorithm requires to use and compute the barycenter in (7) and (8) at each iteration, which can be computationally expensive when the domain dimension d is large or the configuration is ill-conditioned (cf. Example 2). On the other hand, it is known that solving the distance-based K-means (4) is worst-case NP-hard for Euclidean data. Thus we expect solving the distance-based Wasserstein K-means (11) is also computationally hard. A common way is to consider convex relaxations to approximate the solution of (11). It is known that certain SDP relaxation is information-theoretically tight for (4) when the data X1, . . . , Xn Rp are generated from a Gaussian mixture model with isotropic known variance [Chen and Yang, 2021]. In this paper, we extend the idea into Wasserstein setting for solving (11). A typical SDP relaxation for Euclidean data uses pairwise inner products to construct an affinity matrix for clustering [Peng and Wei, 2007]; unfortunately, due to the non-flatness nature, a globally welldefined inner product does not exist for Wasserstein spaces with dimension higher than one. Therefore, we will derive a Wasserstein SDP relaxation to the combinatorial optimization problem (4) using the squared distance matrix An n = {aij} with aij = W 2 2 (µi, µj). Concretely, we can one-to-one reparameterize any partition (G1, . . . , GK) as a binary assignment matrix H = {hik} {0, 1}n K such that hik = 1 if i Gk and hik = 0 otherwise. Then (11) can be expressed as a nonlinear 0-1 integer program, min n A, HBH : H {0, 1}n K, H1K = 1n o , (16) where 1n is the n 1 vector of all ones and B = diag(|G1| 1, . . . , |GK| 1). Changing of variable to the membership matrix Z = HBH , we note that Zn n is a symmetric positive semidefinite (psd) matrix Z 0 such that Tr(Z) = K, Z1n = 1n, and Z 0 entrywise. Thus we obtain the SDP relaxation of (11) by only preserving these convex constraints: n A, Z : Z = Z, Z 0, Tr(Z) = K, Z1n = 1n, Z 0 o . (17) To theoretically justify the SDP formulation (17) of Wasserstein K-means, we consider the scenario of clustering Gaussian distributions in Example 6, where the Wasserstein distance (14) contains two separate components: the Euclidean distance on mean vector and the Bures distance (13) on covariance matrix. Without loss of generality, we focus on mean-zero Gaussian distributions since optimal separation conditions for exact recovery based on the Euclidean mean component have been established in [Chen and Yang, 2021]. Suppose we observe Gaussian distributions νi N(0, Vi), i [n] from K groups G 1, , G K, where cluster G k contains nk members, and the covariance matrices have the following clustering structure: if i G k, then Vi = (I + t Xi)V (k)(I + t Xi) with X1, . . . , Xn i.i.d. Sym N(0, 1), (18) where the psd matrix V (k) is the center of the k-th cluster, Sym N(0, 1) denotes the symmetric random matrix with i.i.d. standard normal entries, and t is a small perturbation parameter such that (I+t Xi) is psd with high probability. For zero-mean Gaussian distributions, we have W2(N(0, V ), N(0, U)) = d(V, U) according to (14). Note that on the Riemannian manifold of psd matrices, the geodesic emanating from V (k) in the direction X as a symmetric matrix can be linearized by V = (I + t X)V (k)(I +t X) in a small neighborhood of t, thus motivating the parameterization of our statistical model in (18). The next theorem gives a separation lower bound to ensure exact recovery of the clustering labels for Gaussian distributions. Theorem 8 (Exact recovery for clustering Gaussians). Let 2 := mink =l d2(V (k), V (l)) denote the minimal pairwise separation among clusters, n := maxk [K] nk (and n := mink [K] nk) the maximum (minimum) cluster size, and m := mink =l 2nknl nk+nl the minimal pairwise harmonic mean of cluster sizes. Suppose the covariance matrix Vi of Gaussian distribution νi = N(0, Vi) is independently drawn from model (18) for i = 1, 2, . . . , n. Let β (0, 1). If the separation 2 satisfies 2 > 2 : = C1t2 min{(1 β)2, β2} V p2 log n, (19) then the SDP (17) achieves exact recovery with probability at least 1 C2n 1, provided that n C3 log2 n, t C4 p log n/ (p + log n)V1/2T 1/2 v , n/m C5 log n, where V = maxk V (k) op, Tv = maxk Tr V (k) 1 , and Ci, i = 1, 2, 3, 4, 5 are constants. Remark 9 (Further insight on pitfalls of barycenter-based Wasserstein K-means). Theorem 8 suggests that different from Euclidean data, distributions after centering can be clustered if scales and rotation angles vary (i.e., covariance-adjusted). We further illustrate the rotation and scale effects on Figure 3: Visualization of misclassification for the barycenter-based Wasserstein K-means (B-WKM) on a randomly sampled subset from MNIST (200 digit "0" and 100 digit "5"). The plot at the bottom is a example of misclassified image. The right plot is the abstraction of the images in the Wasserstein space. The color depth indicates the frequency of the distributions. Red and blue colors stand for distributions belong to true clusters "0" and "5". the MNIST data that may mislead the centroid-based Wasserstein K-means, thus providing a real data support for Example 3. Here we randomly sample two unbalanced clusters with 200 numbers of "0" and 100 numbers of "5". Fig. 3 shows the clustering results for the centroid-based Wasserstein K-means and its oracle version where we replace the estimated barycenters µ1, µ2 with the true barycenters µ 1, µ 2 computed on the true labels. Comparing the Wasserstein distances W2(µ0, µ 1) and W2(µ0, µ 2), we see that the image µ0 (containing digit "0") is closer to µ 2 (true barycenter of digit "5") and thus it cannot be classified correctly based on the nearest true barycenter (cf. Fig. 3 on the left). Moreover, Wasserstein K-means based on estimated barycenters µ1, µ2 yields two clusters of mixed "0" and "5". In both cases, the misclassification error is characterized by grouping similar degrees of angle and/or stretch. Since there are two highly unbalanced clusters of distributions, Wasserstein K-means is likely to enforce larger cluster to separate into two clusters and absorb those around centers (cf. Fig. 3 on the right), leading to larger classification errors. We shall see that in Section 4.3 the distance-based Wasserstein K-means and its SDP relaxation have much smaller classification error rate on MNIST for the reason that we explained in Example 3 (cf. Lemma 4). 4 Experiments 4.1 Counter-example in Example 3 revisited Our first experiment is to back up the claim about the failure of centroid-based Wasserstein Kmeans in Example 3 through simulations. Instead of using point mass measures that may results in instability for computing the barycenters, we use Gaussian distributions with small variance as a smoothed version. We consider K = 2, where cluster G 1 consists of m1 many copies of (µ1, µ2) pairs and m2 many µ3, and cluster G 2 consists of m3 many copies of µ4. We choose µi as the following two-dimensional mixture of Gaussian distributions µi = 0.5 N(ai,1, Σi,1) + 0.5 N(ai,2, Σi,2) for i = 1, 2, 3, 4. Due to the space limit, detailed simulation setups and parameters are given in Appendix B. From Table 1, we can observe that Wasserstein SDP has achieved exact recovery for all cases while barycenter-based Wasserstein K-means has only around 40% exact recovery rate among all repetitions. In addition, Wasserstein SDP is more stable than distancebased Wasserstein K-means. Denote k := W 2(µ3, µ k) as the squared distance between µ3 and µ k for k = 1, 2, where µ k is the barycenter of G k. Let := maxk=1,2 maxi,j Gk W2(µi, µj) and := mini G1,j G2 W 2(µi, µj) be the maximum within-cluster distance and the minimum between-cluster distance respectively. From Table 7 in the Appendix, we can observe that < , from which we can expect Wasserstein SDP to correctly cluster all data points in the Wassertein space. Moreover, Table 1 shows that about 25% times that the distributions (as µ3) in G 1 satisfy 1 > 2, implying those µ3 to be likely assigned to the wrong cluster, which is consistent with Example 3. The experiment results also show that any copy of µ3 is misclassified whenever exact recovery fails for B-WKM, which means the misclassified rate for µ3 equals to (1 γ), where γ is the exact recovery rate for B-WKM shown in Table 1. Table 6 in the appendix further reports the run time comparison, from which we see that distance-based approaches are more computationally efficient than the barycenter-based one in our settings. Table 1: Exact recovery rates and frequency of 1 > 2 for B-WKM among total 50 repetitions in the counter example. W-SDP: Wasserstein SDP, D-WKM: Distance-based Wasserstein K-means, B-WKM: Barycenter-based Wasserstein K-means. n: total number of distributions. n W-SDP D-WKM B-WKM Frequency of 1 > 2 101 1.00 0.82 0.40 0.32 202 1.00 0.84 0.34 0.26 303 1.00 0.72 0.46 0.20 4.2 Gaussian distributions Next, we simulate random Gaussian measures from model (18) with K = 4 and all cluster size equal. We set the centers of each cluster of Gaussians such that all pairwise distances among the barycenters are all equal, i.e., W 2 2 (N(0, V (k1)), N(0, V (k2))) D for all k1, k2 {1, 2, 3, 4} with V = maxk V (k) op [4.5, 5.5]. We fix the dimension p = 10 and vary the sample size n = 200, 400, 600. And we set the perturbation parameter t = 10 3 on the covariance matrix. The simulation results are reported over 100 times in each setting. Fig. 4 shows the misclassification rate (log-scale) versus the squared distance D between centers. We observe that when the distance between centers of clusters are larger than certain threshold (squared distance D > 10 3 in this case), then Wasserstein SDP can achieve exact recovery for different n, while the misclassification rate for the two Wasserstein K-means are stably around 10%. And when the distance between centers of clusters are relatively small, the two Wasserstein K-means and SDP behave similarly. Figure 4: Mis-classification error versus squared distance D from Wasserstein SDP (W-SDP) and barycenter/distance-based Wasserstein K-means (B-WKM and D-WKM) for clustering Gaussians under n {200, 400, 600}. Due to the log-scale, 10 6 corresponds to exact recovery. 4.3 Real-data applications We consider three benchmark image datasets: MNIST, Fashion-MNIST and USPS handwriting digits. Due to complexity issues, we consider subsets of the whole datasets and randomly choose fixed number of images from each clusters based on 10 replicates for each cases. Here we used Sinkhorn divergence to calculate the Wasserstein distance and the Bregman projection with 100 iterations for computing the barycenters, which is efficient and stable for non-degenerate case in practice. For both Wasserstein K-means methods, we use the initialization method in analogue to the K-means++ for Euclidean data, i.e., the first cluster barycenter is chosen uniformly at random as one of the distributions, after which each subsequent cluster barycenter is chosen from the remaining distributions with probability proportional to its squared Sinkhorn divergence from the distribution s closest existing cluster barycenter. Codes using MATLAB and Python implementing W-SDP and D-WKM are available at: https://github.com/Yubo02/ Wasserstein-K-means-for-clustering-probability-distributions. For MNIST dataset, we choose and fix two clusters: G 1 containing the number "0" and G 2 containing the number "5" for two cases. (1) In Case 1, we randomly draw 200 number "0" and 100 number of "5" for each repetition. (2) In Case 2, we double the number and randomly draw 400 number "0" and 200 number of "5" instead. For Fashion-MNIST and USPS handwriting digits dataset, we consider K = 3: G 1 containing the "T-shirt/top" (or the number "0" for USPS handwriting), G 2 containing the "Trouser" (or the number "5" for USPS handwriting) and G 3 containing the "Dress" (or the number "7" for USPS handwriting). The cluster sizes are unbalanced where we randomly choose 200, 100 and 100 number from G 1, G 2 and G 3 respectively. The error rates are shown in Table 2. Detailed setups and more results about F1 scores (analogous to error rates) as well as time costs are placed at Appendix A due to space limit. From Table 2, we can observe that the performances for Wasserstein SDP (W-SDP) and distance-based Wasserstein K-means (D-WKM) are better compared with barycenter-based Wasserstein K-means (B-WKM) for all cases. And the results from Table 3 in Appendix A by using F1 score are consistent with the results from Table 2. The original K-means method behaves similar as barycenter-based Wasserstein K-means in some cases and behaves less preferable for cases such as our experiment on USPS handwriting digits. In particular, the visualization of the clustering results for case 1 has been shown in Fig. 3. From this figure we can find that the classification criterion for B-WKM will end up with the closeness to certain shape of "0", which is characterized by certain angle or the degree of stretch. And this will lead to the high misclassification error for barycenter-based or centroid-based Wasserstein K-means. Table 2: Error rate (SD) for clustering three benchmark datasets: MNIST, Fashion-MNIST and USPS handwriting digits. MNIST1 (MNIST2) refers to the results of Case 1 (Case 2) for MNIST dataset. W-SDP D-WKM B-WKM KM MNIST1 0.235 (0.045) 0.156 (0.057) 0.310 (0.069) 0.295 (0.066) MNIST2 0.279 (0.050) 0.185 (0.097) 0.324 (0.032) 0.362 (0.033) Fashion-MNIST 0.082 (0.020) 0.056 (0.014) 0.141 (0.059) 0.138 (0.099) USPS handwriting 0.206 (0.020) 0.159 (0.061) 0.240 (0.045) 0.284 (0.025) 5 Discussion In this paper, we observed and analyzed the peculiar behaviors of Wasserstein barycenters and their results in clustering probability distributions. After that, we proposed the distance-based Kmeans approach (D-WKW) and its semidefinite program relaxation (W-SDP) by showing the exact recovery results for Gaussians theoretically and numerically. For several real benchmark datasets, we showed results where D-WKM and W-SDP could outperform barycenter-based K-means approach (B-WKM). And we focused on one unbalanced case of two clusters from MNIST and analyzed its behavior through visualization of misclassification for the barycenter-based Wasserstein K-means. The corresponding time costs for B-WKM, D-WKM and W-SDP for benchmark datasets suggest that the scalability for them and especially our approaches could be serious when sample size is large. One of our future goals would be addressing the corresponding computational complexity issues for real data. Another goal is to find out more conclusive results where our approaches are preferable within or out of the realm of unbalanced clusters for real datasets. Acknowledgments and Disclosure of Funding Xiaohui Chen was partially supported by NSF CAREER grant DMS-1752614. Yun Yang was partially supported by NSF grant DMS-2210717. Radoslaw Adamczak. A tail inequality for suprema of unbounded empirical processes with applications to Markov chains. Electronic Journal of Probability, 13(none):1000 1034, 2008. doi: 10.1214/EJP.v13-521. URL https://doi.org/10.1214/EJP.v13-521. Martial Agueh and Guillaume Carlier. Barycenters in the wasserstein space. SIAM J. Math. Anal., 43 (2):904 924, 2011. Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. Np-hardness of euclidean sum-ofsquares clustering. Machine learning, 75(2):245 248, 2009. Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005. Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyré. Iterative bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2):A1111 A1138, 2015. Rajendra Bhatia, Tanvi Jain, and Yongdo Lim. On the bures wasserstein distance between positive definite matrices. Expositiones Mathematicae, 37(2):165 191, 2019. ISSN 0723-0869. doi: https: //doi.org/10.1016/j.exmath.2018.01.002. URL https://www.sciencedirect.com/science/ article/pii/S0723086918300021. Jérémie Bigot, Raúl Gouet, Thierry Klein, and Alfredo López. Geodesic PCA in the Wasserstein space by convex PCA. Annales de l Institut Henri Poincaré, Probabilités et Statistiques, 53(1):1 26, 2017. doi: 10.1214/15-AIHP706. URL https://doi.org/10.1214/15-AIHP706. Yann Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on Pure and Applied Mathematics, 44(4):375 417, 1991. doi: https://doi.org/10. 1002/cpa.3160440402. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/cpa. 3160440402. Samuel Burer and Renato D. C. Monteiro. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming, 95(2):329 357, 2003. doi: 10.1007/s10107-002-0352-8. URL https://doi.org/10.1007/s10107-002-0352-8. Elsa Cazelles, Vivien Seguy, Jérémie Bigot, Marco Cuturi, and Nicolas Papadakis. Geodesic pca versus log-pca of histograms in the wasserstein space. SIAM Journal on Scientific Computing, 40(2):B429 B456, 2018. doi: 10.1137/17M1143459. URL https://doi.org/10.1137/ 17M1143459. Frédéric Chazal, Clément Levrard, and Martin Royer. Clustering of measures via mean measure quantization. Electronic Journal of Statistics, 15(1):2060 2104, 2021. doi: 10.1214/21-EJS1834. URL https://doi.org/10.1214/21-EJS1834. Xiaohui Chen and Yun Yang. Cutoff for exact recovery of gaussian mixture models. IEEE Transactions on Information Theory, 67(6):4223 4238, 2021. Yaqing Chen, Zhenhua Lin, and Hans-Georg Müller. Wasserstein regression. Journal of the American Statistical Association, 0(0):1 14, 2021. doi: 10.1080/01621459.2021.1956937. URL https://doi.org/10.1080/01621459.2021.1956937. Pierre Del Moral and Angele Niclas. A taylor expansion of the square root matrix functional, 2017. URL https://arxiv.org/abs/1705.08561. G. Domazakis, Dimosthenis Drivaliaris, Sotirios Koukoulas, G. I. Papayiannis, Andrianos E. Tsekrekos, and Athanasios N. Yannacopoulos. Clustering measure-valued data with wasserstein barycenters. ar Xiv: Machine Learning, 2019. Darina Dvinskikh and Daniil Tiapkin. Improved complexity bounds in wasserstein barycenter problem, 2020. URL https://arxiv.org/abs/2010.04677. Yingjie Fei and Yudong Chen. Hidden integrality of sdp relaxation for sub-gaussian mixture models. ar Xiv:1803.06510, 2018. Aude Genevay, Gabriel Peyre, and Marco Cuturi. Learning generative models with sinkhorn divergences. In Amos Storkey and Fernando Perez-Cruz, editors, Proceedings of the Twenty First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 1608 1617. PMLR, 09 11 Apr 2018. URL https://proceedings.mlr.press/v84/genevay18a.html. Christophe Giraud and Nicolas Verzelen. Partial recovery bounds for clustering with the relaxed kmeans. ar Xiv:1807.07547, 2018. Jan-Christian Hütter and Philippe Rigollet. Minimax estimation of smooth optimal transport maps, 2019. URL https://arxiv.org/abs/1905.05828. Hicham Janati, Marco Cuturi, and Alexandre Gramfort. Debiased Sinkhorn barycenters. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4692 4701. PMLR, 13 18 Jul 2020. URL https://proceedings.mlr.press/v119/janati20a.html. Young-Heon Kim and Brendan Pass. Wasserstein barycenters over riemannian manifolds. Advances in Mathematics, 307:640 683, 2017. ISSN 0001-8708. doi: https://doi.org/10. 1016/j.aim.2016.11.026. URL https://www.sciencedirect.com/science/article/pii/ S0001870815304643. Khang Le, Huy Nguyen, Quang M Nguyen, Tung Pham, Hung Bui, and Nhat Ho. On robust optimal transport: Computational complexity and barycenter computation. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 21947 21959. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/ b80ba73857eed2a36dc7640e2310055a-Paper.pdf. Stuart Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28: 129 137, 1982. John Lott. Some geometric calculations on wasserstein space. Communications in Mathematical Physics, 277(2):423 437, 2008. doi: 10.1007/s00220-007-0367-3. URL https://doi.org/10. 1007/s00220-007-0367-3. Yu Lu and Harrison Zhou. Statistical and computational guarantees of lloyd s algorithm and its variants. ar Xiv:1612.02099, 2016. J.B. Mac Queen. Some methods for classification and analysis of multivariate observations. Proc. Fifth Berkeley Sympos. Math. Statist. and Probability, pages 281 297, 1967. Robert J. Mc Cann. A convexity principle for interacting gases. Advances in Mathematics, 128(1): 153 179, 1997. ISSN 0001-8708. doi: https://doi.org/10.1006/aima.1997.1634. URL https: //www.sciencedirect.com/science/article/pii/S0001870897916340. Marina Meila and Jianbo Shi. Learning segmentation by random walks. In In Advances in Neural Information Processing Systems, pages 873 879. MIT Press, 2001. Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems, pages 849 856. MIT Press, 2001. Jiming Peng and Yu Wei. Approximating k-means-type clustering via semidefinite programming. SIAM J. OPTIM, 18(1):186 205, 2007. Philippe Rigollet and Jonathan Weed. Uncoupled isotonic regression via minimum Wasserstein deconvolution. Information and Inference: A Journal of the IMA, 8(4):691 717, 04 2019. ISSN 2049-8772. doi: 10.1093/imaiai/iaz006. URL https://doi.org/10.1093/imaiai/iaz006. Filippo Santambrogio and Xu-Jia Wang. Convexity of the support of the displacement interpolation: Counterexamples. Applied Mathematics Letters, 58:152 158, 2016. ISSN 08939659. doi: https://doi.org/10.1016/j.aml.2016.02.016. URL https://www.sciencedirect. com/science/article/pii/S0893965916300726. Vivien Seguy and Marco Cuturi. Principal geodesic analysis for probability measures under the optimal transport metric. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper/2015/file/ f26dab9bf6a137c3b6782e562794c2f2-Paper.pdf. Justin Solomon, Fernando de Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du, and Leonidas Guibas. Convolutional wasserstein distances: Efficient optimal transportation on geometric domains. ACM Trans. Graph., 34(4), jul 2015. ISSN 0730-0301. doi: 10.1145/2766963. URL https://doi.org/10.1145/2766963. Aad W. van der Vaart and Jon A. Wellner. Maximal Inequalities and Covering Numbers, pages 95 106. Springer New York, New York, NY, 1996. ISBN 978-1-4757-2545-2. doi: 10.1007/ 978-1-4757-2545-2_14. URL https://doi.org/10.1007/978-1-4757-2545-2_14. Santosh Vempala and Grant Wang. A spectral algorithm for learning mixture models. J. Comput. Syst. Sci, 68:2004, 2004. Isabella Verdinelli and Larry Wasserman. Hybrid Wasserstein distance and fast distribution clustering. Electronic Journal of Statistics, 13(2):5088 5119, 2019. doi: 10.1214/19-EJS1639. URL https://doi.org/10.1214/19-EJS1639. Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018. doi: 10.1017/9781108231596. Cédric Villani. Topics in optimal transportation. Graduate studies in mathematics. American mathematical society, 2003. Ulrike von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395 416, 2007. Ulrike von Luxburg, Mikhail Belkin, and Olivier Bousquet. Consistency of spectral clustering. Annals of Statistics, 36(2):555 586, 2008. Yubo Zhuang, Xiaohui Chen, and Yun Yang. Sketch-and-lift: scalable subsampled semidefinite program for k-means clustering. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 9214 9246. PMLR, 28 30 Mar 2022. URL https://proceedings.mlr.press/v151/zhuang22a.html. The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default [TODO] to [Yes] , [No] , or [N/A] . You are strongly encouraged to include a justification to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example: Did you include the license to the code and datasets? [Yes] See Section ??. Did you include the license to the code and datasets? [No] The code and the data are proprietary. Did you include the license to the code and datasets? [N/A] Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]