# multilevel_clustering_via_wasserstein_means__6c8d3c71.pdf Multilevel Clustering via Wasserstein Means Nhat Ho 1 Xuan Long Nguyen 1 Mikhail Yurochkin 1 Hung Hai Bui 2 Viet Huynh 3 Dinh Phung 3 We propose a novel approach to the problem of multilevel clustering, which aims to simultaneously partition data in each group and discover grouping patterns among groups in a potentially large hierarchically structured corpus of data. Our method involves a joint optimization formulation over several spaces of discrete probability measures, which are endowed with Wasserstein distance metrics. We propose a number of variants of this problem, which admit fast optimization algorithms, by exploiting the connection to the problem of finding Wasserstein barycenters. Consistency properties are established for the estimates of both local and global clusters. Finally, experiment results with both synthetic and real data are presented to demonstrate the flexibility and scalability of the proposed approach. 1 1. Introduction In numerous applications in engineering and sciences, data are often organized in a multilevel structure. For instance, a typical structural view of text data in machine learning is to have words grouped into documents, documents are grouped into corpora. A prominent strand of modeling and algorithmic works in the past couple decades has been to discover latent multilevel structures from these hierarchically structured data. For specific clustering tasks, one may be interested in simultaneously partitioning the data in each group (to obtain local clusters) and partitioning a collection of data groups (to obtain global clusters). Another concrete example is the problem of clustering images (i.e., global clusters) where each image contains partions of multiple annotated regions (i.e., local clusters) (Oliva and Torralba, 1Department of Statistics, University of Michigan, USA. 2Adobe Research. 3Center for Pattern Recognition and Data Analytics (PRa DA), Deakin University, Australia. Correspondence to: Nhat Ho . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). 1Code is available at https://github.com/ moonfolk/Multilevel-Wasserstein-Means 2001). While hierachical clustering techniques may be employed to find a tree-structed clustering given a collection of data points, they are not applicable to discovering the nested structure of multilevel data. Bayesian hierarchical models provide a powerful approach, exemplified by influential works such as (Blei et al., 2003; Pritchard et al., 2000; Teh et al., 2006). More specific to the simultaneous and multilevel clustering problem, we mention the paper of (Rodriguez et al., 2008). In this interesting work, a Bayesian nonparametric model, namely the nested Dirichlet process (NDP) model, was introduced that enables the inference of clustering of a collection of probability distributions from which different groups of data are drawn. With suitable extensions, this modeling framework has been further developed for simultaneous multilevel clustering, see for instance, (Wulsin et al., 2016; Nguyen et al., 2014; Huynh et al., 2016). The focus of this paper is on the multilevel clustering problem motivated in the aforementioned modeling works, but we shall take a purely optimization approach. We aim to formulate optimization problems that enable the discovery of multilevel clustering structures hidden in grouped data. Our technical approach is inspired by the role of optimal transport distances in hierarchical modeling and clustering problems. The optimal transport distances, also known as Wasserstein distances (Villani, 2003), have been shown to be the natural distance metric for the convergence theory of latent mixing measures arising in both mixture models (Nguyen, 2013) and hierarchical models (Nguyen, 2016). They are also intimately connected to the problem of clustering this relationship goes back at least to the work of (Pollard, 1982), where it is pointed out that the well-known K-means clustering algorithm can be directly linked to the quantization problem the problem of determining an optimal finite discrete probability measure that minimizes its second-order Wasserstein distance from the empirical distribution of given data (Graf and Luschgy, 2000). If one is to perform simultaneous K-means clustering for hierarchically grouped data, both at the global level (among groups), and local level (within each group), then this can be achieved by a joint optimization problem defined with suitable notions of Wasserstein distances inserted into the objective function. In particular, multilevel clustering requires the optimization in the space of probability mea- Multilevel Clustering via Wasserstein Means sures defined in different levels of abstraction, including the space of measures of measures on the space of grouped data. Our goal, therefore, is to formulate this optimization precisely, to develop algorithms for solving the optimization problem efficiently, and to make sense of the obtained solutions in terms of statistical consistency. The algorithms that we propose address directly a multilevel clustering problem formulated from a purely optimization viewpoint, but they may also be taken as a fast approximation to the inference of latent mixing measures that arise in the nested Dirichlet process of (Rodriguez et al., 2008). From a statistical viewpoint, we shall establish a consistency theory for our multilevel clustering problem in the manner achieved for K-means clustering (Pollard, 1982). From a computational viewpoint, quite interestingly, we will be able to explicate and exploit the connection betwen our optimization and that of finding the Wasserstein barycenter (Agueh and Carlier, 2011), an interesting computational problem that have also attracted much recent interests, e.g., (Cuturi and Doucet, 2014). In summary, the main contributions offered in this work include (i) a new optimization formulation to the multilevel clustering problem using Wasserstein distances defined on different levels of the hierarchical data structure; (ii) fast algorithms by exploiting the connection of our formulation to the Wasserstein barycenter problem; (iii) consistency theorems established for proposed estimates under very mild condition of data s distributions; (iv) several flexibile alternatives by introducing constraints that encourage the borrowing of strength among local and global clusters, and (v) finally, demonstration of efficiency and flexibility of our approach in a number of simulated and real data sets. The paper is organized as follows. Section 2 provides preliminary background on Wasserstein distance, Wasserstein barycenter, and the connection between K-means clustering and the quantization problem. Section 3 presents several optimization formulations of the multilevel clusering problem, and the algorithms for solving them. Section 4 establishes consistency results of the estimators introduced in Section 4. Section 5 presents careful simulation studies with both synthetic and real data. Finally, we conclude the paper with a discussion in Section 6. Additional technical details, including all proofs, are given in the Supplement. 2. Background For any given subset Θ Rd, let P(Θ) denote the space of Borel probability measures on Θ. The Wasserstein space of order r [1, ) of probability measures on Θ is de- fined as Pr(Θ) = G P(Θ) : x rd G(x) < , where . denotes Euclidean metric in Rd. Addition- ally, for any k 1 the probability simplex is denoted by k = u Rk : ui 0, k P i=1 ui = 1 . Finally, let Ok(Θ) (resp., Ek(Θ)) be the set of probability measures with at most (resp., exactly) k support points in Θ. Wasserstein distances For any elements G and G in Pr(Θ) where r 1, the Wasserstein distance of order r between G and G is defined as (cf. (Villani, 2003)): Wr(G, G ) = inf π Π(G,G ) Θ2 x y rdπ(x, y) 1/r where Π(G, G ) is the set of all probability measures on Θ Θ that have marginals G and G . In words, W r r (G, G ) is the optimal cost of moving mass from G to G , where the cost of moving unit mass is proportional to r-power of Euclidean distance in Θ. When G and G are two discrete measures with finite number of atoms, fast computation of Wr(G, G ) can be achieved (see, e.g., (Cuturi, 2013)). The details of this are deferred to the Supplement. By a recursion of concepts, we can speak of measures of measures, and define a suitable distance metric on this abstract space: the space of Borel measures on Pr(Θ), to be denoted by Pr(Pr(Θ)). This is also a Polish space (that is, complete and separable metric space) as Pr(Θ) is a Polish space. It will be endowed with a Wasserstein metric of order r that is induced by a metric Wr on Pr(Θ) as follows (cf. Section 3 of (Nguyen, 2016)): for any D, D Pr(Pr(Θ)) Wr(D, D ) := inf ˆ Pr(Θ)2 W r r (G, G )dπ(G, G ) 1/r where the infimum in the above ranges over all π Π(D, D ) such that Π(D, D ) is the set of all probability measures on Pr(Θ) Pr(Θ) that has marginals D and D . In words, Wr(D, D ) corresponds to the optimal cost of moving mass from D to D , where the cost of moving unit mass in its space of support Pr(Θ) is proportional to the r-power of the Wr distance in Pr(Θ). Note a slight notational abuse Wr is used for both Pr(Θ) and Pr(Pr(Θ)), but it should be clear which one is being used from context. Wasserstein barycenter Next, we present a brief overview of Wasserstein barycenter problem, first studied by (Agueh and Carlier, 2011) and subsequentially many others (e.g., (Benamou et al., 2015; Solomon et al., 2015; Alvarez Estebana et al., 2016)). Given probability measures P1, P2, . . . , PN P2(Θ) for N 1, their Wasserstein barycenter P N,λ is such that P N,λ = arg min P P2(Θ) i=1 λi W 2 2 (P, Pi) (1) Multilevel Clustering via Wasserstein Means where λ N denote weights associated with P1, . . . , PN. When P1, . . . , PN are discrete measures with finite number of atoms and the weights λ are uniform, it was shown by (Anderes et al., 2015) that the problem of finding Wasserstein barycenter P N,λ over the space P2(Θ) in (1) is reduced to search only over a much simpler space Ol(Θ) where l = N P i=1 si N + 1 and si is the number of components of Pi for all 1 i N. Efficient algorithms for finding local solutions of the Wasserstein barycenter problem over Ok(Θ) for some k 1 have been studied recently in (Cuturi and Doucet, 2014). These algorithms will prove to be a useful building block for our method as we shall describe in the sequel. The notion of Wasserstein barycenter has been utilized for approximate Bayesian inference (Srivastava et al., 2015). K-means as quantization problem The well-known Kmeans clustering algorithm can be viewed as solving an optimization problem that arises in the problem of quantization, a simple but very useful connection (Pollard, 1982; Graf and Luschgy, 2000). The connection is the following. Given n unlabelled samples Y1, . . . , Yn Θ. Assume that these data are associated with at most k clusters where k 1 is some given number. The K-means problem finds the set S containing at most k elements θ1, . . . , θk Θ that minimizes the following objective inf S:|S| k 1 n i=1 d2(Yi, S). (2) Let Pn = 1 n i=1 δYi be the empirical measure of data Y1, . . . , Yn. Then, problem (2) is equivalent to finding a discrete probability measure G which has finite number of support points and solves: inf G Ok(Θ) W 2 2 (G, Pn). (3) Due to the inclusion of Wasserstein metric in its formulation, we call this a Wasserstein means problem. This problem can be further thought of as a Wasserstein barycenter problem where N = 1. In light of this observation, as noted by (Cuturi and Doucet, 2014), the algorithm for finding the Wasserstein barycenter offers an alternative for the popular Loyd s algorithm for determing local minimum of the K-means objective. 3. Clustering with multilevel structure data Given m groups of nj exchangeable data points Xj,i where 1 j m, 1 i nj, i.e., data are presented in a two-level grouping structure, our goal is to learn about the two-level clustering structure of the data. We want to obtain simultaneously local clusters for each data group, and global clusters among all groups. 3.1. Multilevel Wasserstein Means (MWM) Algorithm For any j = 1, . . . , m, we denote the empirical measure for group j by P j nj := 1 i=1 δXj,i. Throughout this sec- tion, for simplicity of exposition we assume that the number of both local and global clusters are either known or bounded above by a given number. In particular, for local clustering we allow group j to have at most kj clusters for j = 1, . . . , m. For global clustering, we assume to have M group (Wasserstein) means among the m given groups. High level idea For local clustering, for each j = 1, . . . , m, performing a K-means clustering for group j, as expressed by (3), can be viewed as finding a finite discrete measure Gj Okj(Θ) that minimizes squared Wasserstein distance W 2 2 (Gj, P j nj). For global clustering, we are interested in obtaining clusters out of m groups, each of which is now represented by the discrete measure Gj, for j = 1, . . . , m. Adopting again the viewpoint of Eq. (3), provided that all of Gjs are given, we can apply K-means quantization method to find their distributional clusters. The global clustering in the space of measures of measures on Θ can be succintly expressed by inf H EM(P2(Θ)) W 2 2 However, Gj are not known they have to be optimized through local clustering in each data group. MWM problem formulation We have arrived at an objective function for jointly optimizing over both local and global clusters inf Gj Okj (Θ), H EM(P2(Θ)) j=1 W 2 2 (Gj, P j nj) + W 2 2 (H, 1 j=1 δGj). (4) We call the above optimization the problem of Multilevel Wasserstein Means (MWM). The notable feature of MWM is that its loss function consists of two types of distances associated with the hierarchical data structure: one is distance in the space of measures, e.g., W 2 2 (Gj, P j nj), and the other in space of measures of measures, e.g., W 2 2 (H, 1 j=1 δGj). By adopting K-means optimization to both local and global clustering, the multilevel Wasserstein means problem might look formidable at the first sight. Fortunately, it is possible to simplify this original formulation substantially, by exploiting the structure of H. Indeed, we can show that formulation (4) is equivalent to the following optimization problem, which looks much simpler as it involves only measures on Θ: inf Gj Okj (Θ),H j=1 W 2 2 (Gj, P j nj) + d2 W2(Gj, H) Multilevel Clustering via Wasserstein Means where d2 W2(G, H) := min 1 i M W 2 2 (G, Hi) and H = (H1, . . . , HM), with each Hi P2(Θ). The proof of this equivalence is deferred to Proposition B.4 in the Supplement. Before going into to the details of the algorithm for solving (5) in Section 3.1.2, we shall present some simpler cases, which help to illustrate some properties of the optimal solutions of (5), while providing insights of subsequent developments of the MWM formulation. Readers may proceed directly to Section 3.1.2 for the description of the algorithm in the first reading. 3.1.1. PROPERTIES OF MWM IN SPECIAL CASES Example 1. Suppose kj = 1 and nj = n for all 1 j m, and M = 1. Write H = H P2(Θ). Under this setting, the objective function (5) can be rewritten as inf θj Θ, H P2(Θ) i=1 θj Xj,i 2 + W 2 2 (δθj, H)/m, (6) where Gj = δθj for any 1 j m. From the result of Theorem A.1 in the Supplement, j=1 W 2 2 (δθj, H) inf H E1(Θ) j=1 W 2 2 (Gj, H) i=1 θi)/m 2, where second infimum is achieved when H = δ ( m P Thus, objective function (6) may be rewritten as i=1 θj Xj,i 2 + mθj ( l=1 θl) 2/m3. Write Xj = ( n P i=1 Xj,i)/n for all 1 j m. As m 2, we can check that the unique optimal solutions for the above optimization problem are θj = (m2n + 1)Xj + /(m2n + m) for any 1 j m. If we further assume that our data Xj,i are i.i.d samples from probability measure P j having mean µj = EX P j(X) for any 1 j m, the previous result implies that θi θj for almost surely as long as µi = µj. As a consequence, if µj are pairwise different, the multi-level Wasserstein means under that simple scenario of (5) will not have identical centers among local groups. On the other hand, we have W 2 2 (Gi, Gj) = θi θj 2 = mn mn + 1 2 Xi Xj 2. Now, from the definition of Wasserstein distance W 2 2 (P i n, P j n) = min σ 1 n l=1 Xi,l Xj,σ(l) 2 where σ in the above sum varies over all the permutation of {1, 2, . . . , n} and the second inequality is due to Cauchy-Schwarz s inequality. It implies that as long as W 2 2 (P i n, P j n) is small, the optimal solution Gi and Gj of (6) will be sufficiently close to each other. By letting n , we also achieve the same conclusion regarding the asymptotic behavior of Gi and Gj with respect to W2(P i, P j). Example 2. kj = 1 and nj = n for all 1 j m and M = 2. Write H = (H1, H2). Moreover, assume that there is a strict subset A of {1, 2, . . . , m} such that max max i,j A W2(P i n, P j n), max i,j Ac W2(P i n, P j n) min i A,j Ac W2(P i n, P j n), i.e., the distances of empirical measures P i n and P j n when i and j belong to the same set A or Ac are much less than those when i and j do not belong to the same set. Under this condition, by using the argument from part (i) we can write the objective function (5) as inf θj Θ, H1 P2(Θ) i=1 θj Xj,i 2 + W 2 2 (δθj, H1) inf θj Θ, H2 P2(Θ) i=1 θj Xj,i 2 + W 2 2 (δθj, H2) The above objective function suggests that the optimal solutions θi, θj (equivalently, Gi and Gj) will not be close to each other as long as i and j do not belong to the same set A or Ac, i.e., P i n and P j n are very far. Therefore, the two groups of local measures Gj do not share atoms under that setting of empirical measures. The examples examined above indicate that the MWM problem in general do not encourage the local measures Gj to share atoms among each other in its solution. Additionally, when the empirical measures of local groups are very close, it may also suggest that they belong to the same cluster and the distances among optimal local measures Gj can be very small. 3.1.2. ALGORITHM DESCRIPTION Now we are ready to describe our algorithm in the general case. This is a procedure for finding a local minimum of Problem (5) and is summarized in Algorithm 1. We prepare the following details regarding the initialization and updating steps required by the algorithm: Multilevel Clustering via Wasserstein Means Algorithm 1 Multilevel Wasserstein Means (MWM) Input: Data Xj,i, Parameters kj, M. Output: prob. measures Gj and elements Hi of H. Initialize measures G(0) j , elements H(0) i of H(0), t = 0. while Y (t) j , b(t) j , H(t) i have not converged do 1. Update Y (t) j and b(t) j for 1 j m: for j = 1 to m do ij arg min 1 u M W 2 2 (G(t) j , H(t) u ). G(t+1) j arg min Gj Okj (Θ) W 2 2 (Gj, P j nj)+ +W 2 2 (Gj, H(t) ij )/m. end for 2. Update H(t) i for 1 i M: for j = 1 to m do ij arg min 1 u M W 2 2 (G(t+1) j , H(t) u ). end for for i = 1 to M do Ci {l : il = i} for 1 i M. H(t+1) i arg min Hi P2(Θ) l Ci W 2 2 (Hi, G(t+1) l ). end for 3. t t + 1. end while The initialization of local measures G(0) j (i.e., the initialization of their atoms and weights) can be obtained by performing K-means clustering on local data Xj,i for 1 j m. The initialization of elements H(0) i of H(0) is based on a simple extension of the K-means algorithm. Details are given in Algorithm 3 in the Supplement; The updates G(t+1) j can be computed efficiently by simply using algorithms from (Cuturi and Doucet, 2014) to search for local solutions of these barycenter problems within the space Okj(Θ) from the atoms and weights of G(t) j ; Since all G(t+1) j are finite discrete measures, finding the updates for H(t+1) i over the whole space P2(Θ) can be reduced to searching for a local solution within space Ol(t) where l(t) = P j Ci |supp(G(t+1) j )| |Ci| from the global atoms H(t) i of H(t) (Justification of this reduction is derived from Theorem A.1 in the Supplement). This again can be done by utilizing algorithms from (Cuturi and Doucet, 2014). Note that, as l(t) becomes very large when m is large, to speed up the computation of Algorithm 1 we impose a threshold L, e.g., L = 10, for l(t) in its implementation. The following guarantee for Algorithm 1 can be established: Theorem 3.1. Algorithm 1 monotonically decreases the objective function (4) of the MWM formulation. 3.2. Multilevel Wasserstein Means with Sharing As we have observed from the analysis of several specific cases, the multilevel Waserstein means formulation may not encourage the sharing components locally among m groups in its solution. However, enforced sharing has been demonstrated to be a very useful technique, which leads to the borrowing of strength among different parts of the model, consequentially improving the inferential efficiency (Teh et al., 2006; Nguyen, 2016). In this section, we seek to encourage the borrowing of strength among groups by imposing additional constraints on the atoms of G1, . . . , Gm in the original MWM formulation (4). Denote AM,SK = Gj OK(Θ), H EM(P(Θ)) : supp(Gj) SK 1 j m for any given K, M 1 where the constraint set SK has exactly K elements. To simplify the exposition, let us assume that kj = K for all 1 j m. Consider the following locally constrained version of the multilevel Wasserstein means problem j=1 W 2 2 (Gj, P j nj) + W 2 2 (H, 1 j=1 δGj). (7) where SK, Gj, H AM,SK in the above infimum. We call the above optimization the problem of Multilevel Wasserstein Means with Sharing (MWMS). The local constraint assumption supp(Gj) SK had been utilized previously in the literature see for example the work of (Kulis and Jordan, 2012), who developed an optimization-based approach to the inference of the HDP (Teh et al., 2006), which also encourages explicitly the sharing of local group means among local clusters. Now, we can rewrite objective function (7) as follows inf SK,Gj,H BM,SK j=1 W 2 2 (Gj, P j nj) + d2 W2(Gj, H) where BM,SK = Gj OK(Θ), H = (H1, . . . , HM) : supp(Gj) SK 1 j m . The high level idea of finding local minimums of objective function (8) is to first, update the elements of constraint set SK to provide the supports for local measures Gj and then, obtain the weights of these measures as well as the elements of global set H by computing appropriate Wasserstein barycenters. Due to space constraint, the details of these steps of the MWMS Algorithm (Algorithm 2) are deferred to the Supplement. Multilevel Clustering via Wasserstein Means 4. Consistency results We proceed to establish consistency for the estimators introduced in the previous section. For the brevity of the presentation, we only focus on the MWM method; consistency for MWMS can be obtained in a similar fashion. Fix m, and assume that P j is the true distribution of data Xj,i for j = 1, . . . , m. Write G = (G1, . . . , Gm) and n = (n1, . . . , nm). We say n if nj for j = 1, . . . , m. Define the following functions j=1 W 2 2 (Gj, P j nj) + W 2 2 (H, 1 j=1 W 2 2 (Gj, P j) + W 2 2 (H, 1 where Gj Okj(Θ), H EM(P(Θ)) as 1 j m. The first consistency property of the WMW formulation: Theorem 4.1. Given that P j P2(Θ) for 1 j m. Then, there holds almost surely, as n inf Gj Okj (Θ), H EM(P2(Θ)) fn(G, H) inf Gj Okj (Θ), H EM(P2(Θ)) The next theorem establishes that the true global and local clusters can be recovered. To this end, assume that for each n there is an optimal solution ( b Gn1 1 , . . . , b Gnm m , b Hn) or in short ( b G n, Hn) of the objective function (4). Moreover, there exist a (not necessarily unique) optimal solution minimizing f(G, H) over Gj Okj(Θ) and H EM(P2(Θ)). Let F be the collection of such optimal solutions. For any Gj Okj(Θ) and H EM(P2(Θ)), define d(G, H, F) = inf (G0,H0) F j=1 W 2 2 (Gj, G0 j) + W 2 2 (H, H0). Given the above assumptions, we have the following result regarding the convergence of ( b G n, Hn): Theorem 4.2. Assume that Θ is bounded and P j P2(Θ) for all 1 j m. Then, we have d( b G n, b Hn, F) 0 as n almost surely. Remark: (i) The assumption Θ is bounded is just for the convenience of proof argument. We believe that the conclusion of this theorem may still hold when Θ = Rd. (ii) If |F| = 1, i.e., there exists an unique optimal solution G0, H0 minimizing f(G, H) over Gj Okj(Θ) and H EM(P2(Θ)), the result of Theorem 4.2 implies that W2( b Gnj j , G0 j) 0 for 1 j m and W2( b Hn, H0) 0 as n . 5. Empirical studies 5.1. Synthetic data In this section, we are interested in evaluating the effectiveness of both MWM and MWMS clustering algorithms by considering different synthetic data generating processes. Unless otherwise specified, we set the number of groups m = 50, number of observations per group nj = 50 in d = 10 dimensions, number of global clusters M = 5 with 6 atoms. For Algorithm 1 (MWM) local measures Gj have 5 atoms each; for Algorithm 2 (MWMS) number of atoms in constraint set SK is 50. As a benchmark for the comparison we will use a basic 3-stage K-means approach (the details of which can be found in the Supplement). The Wasserstein distance between the estimated distributions (i.e. ˆG1, . . . , ˆGm; ˆH1, . . . , ˆHM) and the data generating ones will be used as the comparison metric. Recall that the MWM formulation does not impose constraints on the atoms of Gi, while the MWMS formulation explicitly enforces the sharing of atoms across these measures. We used multiple layers of mixtures while adding Gaussian noise at each layer to generate global and local clusters and the no-constraint (NC) data. We varied number of groups m from 500 to 10000. We notice that the 3-stage K-means algorithm performs the best when there is no constraint structure and variance is constant across clusters (Fig. 1(a) and 2(a)) this is, not surprisingly, a favorable setting for the basic K-means method. As soon as we depart from the (unrealistic) constant-variance, no-sharing assumption, both of our algorithms start to outperform the basic three-stage K-means. The superior performance is most pronounced with local-constraint (LC) data (with or without constant variance conditions). See Fig. 1(c,d). It is worth noting that even when group variances are constant, the 3-stage K-means is no longer longer effective because now fails to account for the shared structure. When m = 50 and group sizes are larger, we set SK = 15. Results are reported in Fig. 2 (c), (d). These results demonstrate the effectiveness and flexibility of our both algorithms. 5.2. Real data analysis We applied our multilevel clustering algorithms to two realworld datasets: Label Me and Student Life. Label Me dataset consists of 2, 688 annotated images which are classified into 8 scene categories including tall buildings, inside city, street, highway, coast, open country, mountain, and forest (Oliva and Torralba, 2001) . Each image contains multiple annotated regions. Each region, which is annotated by users, represents an object in the image. As shown in Figure 4, the left image is an image from open country category and contains 4 regions while the right panel denotes an image of tall buildings category Multilevel Clustering via Wasserstein Means 0 2500 5000 7500 10000 Number of groups Algorithm 1 Algorithm 2 Kmeans 0 2500 5000 7500 10000 Number of groups Wasserstein distance to truth Algorithm 1 Algorithm 2 Kmeans 4.4 4.8 5.2 5.6 0 2500 5000 7500 10000 Number of groups Algorithm 1 Algorithm 2 Kmeans 10 11 12 13 0 2500 5000 7500 10000 Number of groups Algorithm 1 Algorithm 2 Kmeans Figure 1: Data with a lot of small groups: (a) NC data with constant variance; (b) NC data with non-constant variance; (c) LC data with constant variance; (d) LC data with non-constant variance 0 2500 5000 7500 10000 Number of observations per group Algorithm 1 Algorithm 2 Kmeans 0 2500 5000 7500 10000 Number of observations per group Wasserstein distance to truth Algorithm 1 Algorithm 2 Kmeans 0 2500 5000 7500 10000 Number of observations per group Algorithm 1 Algorithm 2 Kmeans 0 2500 5000 7500 10000 Number of observations per group Algorithm 1 Algorithm 2 Kmeans Figure 2: Data with few big groups: (a) NC data with constant variance; (b) NC data with non-constant variance; (c) LC data with constant variance; (d) LC data with non-constant variance sky field tree flowers sky building building building building building car car car car car car sidewalk road poster Figure 4: Examples of images used in Label Me dataset. Each image consists of different annotated regions. including 16 regions. Note that the regions in each image can be overlapped. We remove the images containing less then 4 regions and obtain 1, 800 images. We then extract GIST feature (Oliva and Torralba, 2001) for each region in a image. GIST is a visual descriptor to represent perceptual dimensions and oriented spatial structures of a scene. Each GIST descriptor is a 512dimensional vector. We further use PCA to project GIST features into 30 dimensions. Finally, we obtain 1, 800 documents , each of which contains regions as observations. Each region now is represented by a 30-dimensional vector. We now can perform clustering regions in every image since they are visually correlated. In the next level of clustering, we can cluster images into scene categories. Student Life dataset is a large dataset frequently used in pervasive and ubiquitous computing research. Data signals Table 1: Clustering performance for Label Me dataset. Methods NMI ARI AMI Time (s) K-means 0.349 0.237 0.324 0.3 TSK-means 0.236 0.112 0.22 218 MC2 0.315 0.206 0.273 4.2 MWM 0.373 0.263 0.352 332 MWMS 0.391 0.284 0.368 544 consist of multiple channels (e.g., Wi Fi signals, Bluetooth scan, etc.), which are collected from smartphones of 49 students at Dartmouth College over a 10-week spring term in 2013. However, in our experiments, we use only Wi Fi signal strengths. We applied a similar procedure described in (Nguyen et al., 2016) to pre-process the data. We aggregate the number of scans by each Wifiaccess point and select 500 WifiIds with the highest frequencies. Eventually, we obtain 49 documents with totally approximately 4.6 million 500-dimensional data points. Experimental results. To quantitatively evaluate our proposed methods, we compare our algorithms with several base-line methods: K-means, three-stage K-means (TSKmeans) as described in the Supplement, MC2-SVI without context (Huynh et al., 2016). Clustering performance in Table 1 is evaluated with the image clustering problem for Label Me dataset. With K-means, we average all data points Multilevel Clustering via Wasserstein Means Figure 3: Clustering representation for two datasets: (a) Five image clusters from Labelme data discovered by MWMS algorithm: tag-clouds on the left are accumulated from all images in the clusters while six images on the right are randomly chosen images in that cluster; (b) Student Life discovered network with three node groups: (1) discovered student clusters, (3) student nodes, (5) discovered activity location (from Wifidata); and two edge groups: (2) Student to cluster assignment, (4) Student involved to activity location. Node sizes (of discovered nodes) depict the number of element in clusters while edge sizes between Student and activity location represent the popularity of student s activities. to obtain a single vector for each images. K-means needs much less time to run since the number of data points is now reduced to 1, 800. For MC2-SVI, we used stochastic varitational and a parallelized Spark-based implementation in (Huynh et al., 2016) to carry out experiments. This implementation has the advantage of making use of all of 16 cores on the test machine. The running time for MC2-SVI is reported after scanning one epoch. In terms of clustering accuracy, MWM and MWMS algorithms perform the best. Fig. 3a demonstrates five representative image clusters with six randomly chosen images in each (on the right) which are discovered by our MWMS algorithm. We also accumulate labeled tags from all images in each cluster to produce the tag-cloud on the left. These tag-clouds can be considered as visual ground truth of clusters. Our algorithm can group images into clusters which are consistent with their tag-clouds. We use Student Life dataset to demonstrate the capability of multilevel clustering with large-scale datasets. This dataset not only contains a large number of data points but presents in high dimension. Our algorithms need approximately 1 hour to perform multilevel clustering on this dataset. Fig. 3b presents two levels of clusters discovered by our algorithms. The innermost (blue) and outermost (green) rings depict local and global clusters respectively. Global clusters represent groups of students while local clusters shared between students ( documents ) may be used to infer loca- tions of students activities. From these clusteing we can dissect students shared location (activities), e.g. Student 49 (U49) mainly takes part in activity location 4 (L4). 6. Discussion We have proposed an optimization based approach to multilevel clustering using Wasserstein metrics. There are several possible directions for extensions. Firstly, we have only considered continuous data; it is of interest to extend our formulation to discrete data. Secondly, our method requires knowledge of the numbers of clusters both in local and global clustering. When these numbers are unknown, it seems reasonable to incorporate penalty on the model complexity. Thirdly, our formulation does not directly account for the noise distribution away from the (Wasserstein) means. To improve the robustness, it may be desirable to make use of the first-order Wasserstein metric instead of the second-order one. Finally, we are interested in extending our approach to richer settings of hierarchical data, such as one when group level-context is available. Acknowledgement. This research is supported in part by grants NSF CAREER DMS-1351362, NSF CNS-1409303, the Margaret and Herman Sokol Faculty Award and research gift from Adobe Research (XN). DP gratefully acknowledges the partial support from the Australian Research Council (ARC) and AOARD (FA2386-16-1-4138). Multilevel Clustering via Wasserstein Means M. Agueh and G. Carlier. Barycenters in the wasserstein space. SIAM Journal on Mathematical Analysis, 43: 904 924, 2011. E. Anderes, S. Borgwardt, and J. Miller. Discrete wasserstein barycenters: optimal transport for discrete data. http://arxiv.org/abs/1507.07218, 2015. J. D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Payr e. Iterative bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 2:1111 1138, 2015. D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res, 3:993 1022, 2003. M. Cuturi. Sinkhorn distances: lightspeed computation of optimal transport. Advances in Neural Information Processing Systems 26, 2013. M. Cuturi and A. Doucet. Fast computation of wasserstein barycenters. Proceedings of the 31st International Conference on Machine Learning, 2014. S. Graf and H. Luschgy. Foundations of quantization for probability distributions. Springer-Verlag, New York, 2000. V. Huynh, D. Phung, S. Venkatesh, X. Nguyen, M. Hoffman, and H. Bui. Scalable nonparametric bayesian multilevel clustering. Proceedings of Uncertainty in Artificial Intelligence, 2016. B. Kulis and M. I. Jordan. Revisiting k-means: new algorithms via bayesian nonparametrics. Proceedings of the 29th International Conference on Machine Learning, 2012. Thanh-Binh Nguyen, Vu Nguyen, Svetha Venkatesh, and Dinh Phung. Mcnc: Multi-channel nonparametric clustering from heterogeneous data. In Proceedings of ICPR, 2016. V. Nguyen, D. Phung, X. Nguyen, S. Venkatesh, and H. Bui. Bayesian nonparametric multilevel clustering with group-level contexts. Proceedings of the 31st International Conference on Machine Learning, 2014. X. Nguyen. Convergence of latent mixing measures in finite and infinite mixture models. Annals of Statistics, 4 (1):370 400, 2013. X. Nguyen. Borrowing strengh in hierarchical bayes: Posterior concentration of the dirichlet base measure. Bernoulli, 22:1535 1571, 2016. A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42:145 175, 2001. D. Pollard. Quantization and the method of k-means. IEEE Transactions on Information Theory, 28:199 205, 1982. J. Pritchard, M. Stephens, and P. Donnelly. Inference of population structure using multilocus genotype data. Genetics, 155:945 959, 2000. A. Rodriguez, D. Dunson, and A.E. Gelfand. The nested Dirichlet process. J. Amer. Statist. Assoc., 103(483): 1131 1154, 2008. J. Solomon, G. Fernando, G. Payr e, M. Cuturi, A. Butscher, A. Nguyen, T. Du, and L. Guibas. Convolutional wasserstein distances: Efficient optimal transportation on geometric domains. In The International Conference and Exhibition on Computer Graphics and Interactive Techniques, 2015. S. Srivastava, V. Cevher, Q. Dinh, and D. Dunson. Wasp: Scalable bayes via barycenters of subset posteriors. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, 2015. Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. J. Amer. Statist. Assoc., 101: 1566 1581, 2006. C edric Villani. Topics in Optimal Transportation. American Mathematical Society, 2003. D. F. Wulsin, S. T. Jensen, and B. Litt. Nonparametric multi-level clustering of human epilepsy seizures. Annals of Applied Statistics, 10:667 689, 2016. P. C. Alvarez Estebana, E. del Barrioa, J.A. Cuesta Albertosb, and C. Matr an. A fixed-point approach to barycenters in wasserstein space. Journal of Mathematical Analysis and Applications, 441:744 762, 2016.