# bayesian_nonparametric_multilevel_clustering_with_grouplevel_contexts__b2f16275.pdf

Bayesian Nonparametric Multilevel Clustering with Group-Level Contexts

Vu Nguyen TVNGUYE@DEAKIN.EDU.AU Center for Pattern Recognition and Data Analytics (PRa DA), Deakin University, Australia

Dinh Phung DINH.PHUNG@DEAKIN.EDU.AU Center for Pattern Recognition and Data Analytics (PRa DA), Deakin University, Australia

Xuan Long Nguyen XUANLONG@UMICH.EDU Department of Statistics, University of Michigan, Ann Arbor, USA

Svetha Venkatesh SVETHA.VENKATESH@DEAKIN.EDU.AU Center for Pattern Recognition and Data Analytics (PRa DA), Deakin University, Australia

Hung Hai Bui BUI.H.HUNG@GMAIL.COM Laboratory for Natural Language Understanding, Nuance Communications, Sunnyvale, USA

We present a Bayesian nonparametric framework for multilevel clustering which utilizes grouplevel context information to simultaneously discover low-dimensional structures of the group contents and partitions groups into clusters. Using the Dirichlet process as the building block, our model constructs a product base-measure with a nested structure to accommodate content and context observations at multiple levels. The proposed model possesses properties that link the nested Dirichlet processes (n DP) and the Dirichlet process mixture models (DPM) in an interesting way: integrating out all contents results in the DPM over contexts, whereas integrating out group-speciﬁc contexts results in the n DP mixture over content variables. We provide a Polyaurn view of the model and an efﬁcient collapsed Gibbs inference procedure. Extensive experiments on real-world datasets demonstrate the advantage of utilizing context information via our model in both text and image domains.

1. Introduction

In many situations, content data naturally present themselves in groups, e.g., students are grouped into classes,

Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s).

classes grouped into schools, words grouped into documents, etc. Furthermore, each content group can be associated with additional context information (teachers of the class, authors of the document, time and location stamps). Dealing with grouped data, a setting known as multilevel analysis (Hox, 2010; Diez-Roux, 2000), has diverse application domains ranging from document modeling (Blei et al., 2003) to public health (Leyland & Goldstein, 2001).

This paper considers speciﬁcally the multilevel clustering problem in multilevel analysis: to jointly cluster both the content data and their groups when there is group-level context information. By context, we mean a secondary data source attached to the group of primary content data. An example is the problem of clustering documents, where each document is a group of words associated with grouplevel context information such as time-stamps, list of authors, etc. Another example is image clustering where visual image features (e.g. SIFT) are the content and image tags are the context.

To cluster groups together, it is often necessary to perform dimensionality reduction of the content data by forming content topics, effectively performing clustering of the content as well. For example, in document clustering, using bag-of-words directly as features is often problematic due to the large vocabulary size and the sparsity of the in-document word occurrences. Thus, a typical approach is to ﬁrst apply dimensionality reduction techniques such as LDA (Blei et al., 2003) or HDP (Teh et al., 2006) to ﬁnd word topics (i.e., distributions on words), then perform document clustering using the word topics and the document-level context information as features. In such a

Bayesian Nonparametric Multilevel Clustering with Group-Level Contexts

cascaded approach, the dimensionality reduction step (e.g., topic modeling) is not able to utilize the context information. This limitation suggests that a better alternative is to perform context-aware document clustering and topic modeling jointly. With a joint model, one can expect to obtain improved document clusters as well as context-guided content topics that are more predictive of the data.

Recent work has attempted to jointly capture word topics and document clusters. Parametric approaches (Xie & Xing, 2013) are extensions of the LDA (Blei et al., 2003) and require specifying the number of topics and clusters in advance. Bayesian nonparametric approaches including the nested Dirichlet process (n DP) (Rodriguez et al., 2008) and the multi-level clustering hierarchical Dirichlet Process (MLC-HDP) (Wulsin et al., 2012) can automatically adjust the number of clusters. We note that none of these methods can utilize context data.

This paper propose the Multilevel Clustering with Context (MC2), a Bayesian nonparametric model to jointly cluster both content and groups while fully utilizing grouplevel context. Using the Dirichlet process as the building block, our model constructs a product base-measure with a nested structure to accommodate both content and context observations. The MC2 model possesses properties that link the nested Dirichlet process (n DP) and the Dirichlet process mixture model (DPM) in an interesting way: integrating out all contents results in the DPM over contexts, whereas integrating out group-level context results in the n DP mixture over content variables. For inference, we provide an efﬁcient collapsed Gibbs sampling procedure for the model.

The advantages of our model are: (1) the model automatically discovers the (unspeciﬁed) number of groups clusters and the number of topics while fully utilizing the context information; (2) content topic modeling is informed by group-level context information, leading to more predictive content topics; (3) the model is robust to partially missing context information. In our experiments, we demonstrate that our proposed model achieves better document clustering performances and more predictive word topics in realworld datasets in both text and image domains.

2. Related Background

There have been extensive works on clustering documents in the literature. Due to limited scope of the paper, we only describe works closely related to probabilistic topic models. We note that standard topic models such as LDA (Blei et al., 2003) or its nonparametric Bayesian counter part, HDP (Teh et al., 2006) exploits the group structure for word clustering. However these models do not cluster documents.

An approach to document clustering is to employ a twostage process. First, topic models (e.g. LDA or HDP) are applied to extract the topics and their mixture proportion for each document. Then, this is used as feature input to another clustering algorithm. Some examples of this approach include the use of LDA+Kmeans for image clustering (Elango & Jayaraman, 2005) and HDP+Afﬁnity Propagation for clustering human activities (Nguyen et al., 2013).

A more elegant approach is to simultaneously cluster documents and discover topics. The ﬁrst Bayesian nonparametric model proposed for this task is the nested Dirichlet Process (n DP) (Rodriguez et al., 2008) where documents in a cluster share the same distribution over topic atoms. Although the original n DP does not force the topic atoms to be shared across document clusters, this can be achieved by simply introducing a DP prior for the n DP base measure. The same observation was also made by (Wulsin et al., 2012) who introduced the MLC-HDP, a 3-level extension to the n DP. This model thus can cluster words, documents and document-corpora with shared topic atoms throughout the group hierarchy. Xie et al (Xie & Xing, 2013) recently introduced the Multi-Grain Clustering Topic Model which allows mixing between global topics and document-cluster topics. However, this is a parametric model which requires ﬁxing the number of topics in advance. More crucially, all of these existing models do not attempt to utilize grouplevel context information.

Modelling with Dirichlet Process

We provide a brief account of the Dirichlet process and its variants. The literature on DP is vast and we refer to (Hjort et al., 2010) for a comprehensive account. Here we focus on DPM, HDP and n DP which are related to our work.

Dirichlet process (Ferguson, 1973) is a basic building block in Bayesian nonparametrics. Let (Θ, B, H) be a probability measure space, and γ is a positive number, a Dirichlet process DP (γ, H) is a distribution over discrete random probability measure G on (Θ, B). Sethuraman (Sethuraman, 1994) provides an alternative constructive deﬁnition which makes the discreteness property of a draw from a Dirichlet process explicit via the stick-breaking representation: G = P k=1 βkδφk where φk iid H, k = 1, . . . , and β = (βk) k=1 are the weights constructed through a stick-breaking process βk = vk Q

s<k (1 vs) with

vk iid Beta (1, γ). It can be shown that P k=1 βk = 1 with probability one, and as a convention (Pitman, 2002), we hereafter write β GEM (γ).

Due to its discrete nature, Dirichlet process has been widely used in Bayesian mixture models as the prior distribution on the mixing measures, each is associated with an atom φk in the stick-breaking representation of G above. A like-

Bayesian Nonparametric Multilevel Clustering with Group-Level Contexts

lihood kernel F ( ) is used to generate data xi | φk iid F ( | φk), resulting in a model known as the Dirichlet process mixture model (DPM), pioneered by the work of (Antoniak, 1974) and subsequently developed by many others. In section 3 we provide a precise deﬁnition for DPM.

While DPM models exchangeable data within a single group, the Dirichlet process can also be constructed hierarchically to provide prior distributions over multiple exchangeable groups. Under this setting, each group is modelled as a DPM and these models are linked together to reﬂect the dependency among them a formalism which is generally known as dependent Dirichlet processes (Mac Eachern, 1999). One particular attractive approach is the hierarchical Dirichlet processes (Teh et al., 2006) which posits the dependency among the group-level DPM by another Dirichlet process, i.e., Gj | α, G0 DP (α, G0) and G0 | γ, H DP (γ, H) where Gj is the prior for the j-th group, linked together via a discrete measure G0 whose distribution is another DP.

Yet another way of using DP to model multiple groups is to construct random measure in a nested structure in which the DP base measure is itself another DP. This formalism is the nested Dirichlet Process (Rodriguez et al., 2008), speciﬁcally Gj iid U where U DP (α DP (γH)). Modeling Gj (s) hierarchically as in HDP and nestedly as in n DP yields different effects. HDP focuses on exploiting statistical strength across groups via sharing atoms φk (s), but it does not partition groups into clusters. This statement is made precisely by noting that P (Gj = Gj ) = 0 in HDP. Whereas, n DP emphasizes on inducing clusters on both observations and distributions, hence it partitions groups into clusters. To be precise, the prior probability of two groups being clustered together is P (Gj = Gj ) = 1 a+1. Finally we note that this original deﬁnition of n DP in (Rodriguez et al., 2008) does not force the atoms to be shared across clusters of groups, but this can be achieved by simply introducing a DP prior for the n DP base measure, a modiﬁcation that we use in this paper. This is made clearly in our deﬁnition for n DP mixture in section 3.

3. Multilevel Clustering with Contexts

3.1. Model description and stick-breaking

Consider data presented in a two-level group structure as follows. Denote by J the number of groups; each group j contains Nj exchangeable data points, represented by wj = wj1, wj2, . . . , wj Nj . For each group j, the groupspeciﬁc context data is denoted by xj. Assuming that the groups are exchangeable, the overall data is {(xj, wj)}J j=1. The collection {w1, . . . , w J} represents observations of the group contents, and {x1, . . . , x J} represents observa-

Figure 1. Graphical model representation for the proposed model. Right ﬁgure illustrates a stick breaking representation.

tions of the group-level contexts.

We now describe the generative process of MC2 that generates a two-level clustering of this data. We use a grouplevel DP mixture to generate an inﬁnite cluster model for groups. Each group cluster k is associated with an atom having the form of a pair (φk, Q k) where φk is a parameter that generates the group-level contexts within the cluster and Q k is a measure that generates the group contents within the same cluster.

To generate atomic pairs of context parameter and measurevalued content parameter, we introduce a product basemeasure of the form H DP(v Q0) for the group-level DP mixture. Drawing from a DP mixture with this base measure, each realization is a pair (θj, Qj); θj is then used to generate the context xj and Qj is used to repeatedly produce the set of content observations wji within the group j. Speciﬁcally,

U DP (α(H DP(v Q0))) where Q0 DP (ηS)

(θj, Qj) iid U for each group j (1)

xj F(.|θj), ϕji iid Qj, wji Y (.|ϕji)

In the above, H and S are respectively base measures for context and content parameters θj and ϕji. The context and content observations are then generated via the likelihood kernels F ( | θj) and Y ( | ϕji). To simplify inference, H and S are assumed to be conjugate to F and Y respectively. The generative process is illustrated in Figure 1.

STICK-BREAKING REPRESENTATION

We now derive the stick-breaking construction for MC2

where all the random discrete measures are speciﬁed by a distribution over integers and a countable set of atoms. The random measure U in Eq. (1) has the stick-breaking form:

k=1 πkδ(φk,Q k) (2)

Bayesian Nonparametric Multilevel Clustering with Group-Level Contexts

where π GEM (α) and (φk, Q k) iid H DP (v Q0). Equivalently, this means φk is drawn i.i.d. from H and Q k drawn i.i.d. from DP (v Q0). Since Q0 DP (ηS), Q0 and Q k have the standard HDP (Teh et al., 2006) stick-breaking

forms: Q0 = P m=1 ϵmδψmwhere ϵ GEM(η), ψm iid S; Q k = P m=1 τk,mδψm where τ k = (τk1, τk2, . . .) DP (v, ϵ).

For each group j we sample the parameter pair (θj, Qj) iid

U; equivalently, this means drawing zj iid π and letting θj = φzj and Qj = Q zj. For the i-th content data within

the group j, the content parameter ϕji is drawn iid Qj =

Q zj; equivalently, this means drawing lji iid τzj and letting ϕji = ψlji. Figure 1 presents the graphical model of this stick-breaking representation.

3.2. Inference and Polya Urn View

We use collapsed Gibbs sampling, integrating out φk(s), ψm(s), π and τk (s). Latent variables z, l, ϵ and the hyperparameters α, v, η will be resampled. We only describe the key inference steps in sampling z, l and ϵ here and refer to the supplementary material (Nguyen et al., 2014) for the rest of the details (including how to sample the hyperparameters).

Sampling z. The required conditional distribution is p(zj = k | z j, l, x, α, H)

p (zj = k|z j, α) p (xj|zj = k, z j, x j, H)

p (lj |zj = k, l j , z j, ϵ, v)

The ﬁrst term can be recognized as a form of the Chinese restaurant process (CRP). The second term is the predictive likelihood for the context observations under the component φk after integrating out φk. This can be evaluated analytically due to conjugacy of F and H. The last term is the predictive likelihood for the group content-index lj = {lji|i = 1 . . . Nj}. Since lji | zj = k iid Mult (τ k) where τ k Dir (vϵ1, . . . , vϵM, ϵnew), the last term can also be evaluated analytically by integrating out τ k using the Multinomial-Dirichlet conjugacy property.

Sampling l. Let w ji be the same set as w excluding wji, let w ji(m) = {wj i |(j i ) = (ji) lj i = m} and l ij(k) = {lj i |(j i ) = (ji) zj = k}. Then p (lji = m | l ji, zj = k, z j, v, w, ϵ, S)

p(wji|l, w ji, S) p(lji = m|l ji, zj = k, z j, ϵ, v)

=p (wji | w ji(m), S) p (lji = m | l ji(k), ϵ, v)

The ﬁrst term is the predictive likelihood under mixture component ψm after integrating out ψm, which can be evaluated analytically due to the conjugacy of Y and S. The

second term is in the form of a CRP similar to the one that arises during inference for HDP (Teh et al., 2006).

Sampling ϵ. Sampling ϵ requires information from both z and l.

p (ϵ | l, z, v, η) p (l | ϵ, v, z, η) p (ϵ | η) (3)

Using a similar strategy in HDP, we introduce auxiliary variables (okm), then alternatively sample together with ϵ:

p (okm = h | ) Stirl (h, nkm) (vϵm)h, h = 0, 1, . . . , nkm

p (ϵ | ) ϵη 1 new

where Stirl (h, nkm) is the Stirling number of the ﬁrst kind, nkm is the count of seeing the pair (zj = k, lji = m) : i, j, and ﬁnally M is the current number of active content topics. It clear that okm can be sampled from a Multinomial distribution and ϵ from an (M + 1)-dim Dirichlet distribution.

POLYA URN VIEW

Our model exhibits a Polya-urn view using the analogy of a ﬂeet of buses, driving customers to restaurants. Each bus represents a group and customers on the bus are data points within the group. For each bus j, zj acts as the index to the restaurant for its destination. Thus, buses form clusters at their destination restaurants according to a CRP: a new bus drives to an existing restaurant with the probability proportional to the number of other buses that have arrived at that restaurant, and with probability proportional to α, it goes to a completely new restaurant.

Once all the buses have delivered customers to the restaurants, all customers at the restaurants start to behave in the same manner as in a Chinese restaurant franchise (CRF) process: customers are assigned tables according to a restaurant-speciﬁc CRP; tables are assigned with dishes ψm (representing the content topic atoms) according to a global franchise CRP. In addition to the usual CRF, at restaurant k, a single dessert φk (which represents the context-generating atom, drawing iid from H) will be served to all the customers at that restaurant. Thus, every customer on the same bus j will be served the same dessert φzj. We observe three sub-CRPs, corresponding to the three DP(s) in our model: the CRP at the dish level is due to the DP (ηS), the CRP forming tables inside each restaurant is due to the DP(v Q0), and the CRP aggregating buses to restaurants is due to the DP (α(H DP(v Q0))).

3.3. Marginalization property

We study marginalization property for our model when either the content topics ϕji (s) or context topics θj (s) are

Bayesian Nonparametric Multilevel Clustering with Group-Level Contexts

marginalized out. Our main result is established in Theorem 3 where we show an interesting link to nested DP and DPM via our model.

Let H be a measure over some measurable spaces (Θ, Σ). Let P be the set of all measures over (Θ, Σ), suitably endowed with some σ-algebra. Let G DP(αH) and θi iid G. The collection (θi) then follows the DP mixture distribution which is deﬁned formally below.

(DPM) A DPM is a probability measure over Θn (θ1, . . . , θn) with the usual product sigma algebra Σn such that for every collection of measurable sets {(S1, . . . , Sn) : Si Σ, i = 1, . . . , n}:

DPM(θ1 S1, . . . , θn Sn|α, H)

i=1 G (Si) DP (d G | αH)

We now state a result regarding marginalization of draws from a DP mixture with a joint base measure. Consider two measurable spaces (Θ1, Σ1) and (Θ2, Σ2) and let (Θ, Σ) be their product space where Θ = Θ1 Θ2 and Σ = Σ1 Σ2. Let H be a measure over the product space Θ = Θ1 Θ2 and let H1 be the marginal of H over Θ1 in the sense that for any measurable set A Σ1, H1 (A) = H (A Θ2). Then drawing (θ(1) i , θ(2) i ) from a DP mixture with base measure αH and marginalizing out (θ(2) i ) is the same as drawing (θ(1) i ) from a DP mixture with base measure H1. Formally

Proposition 1. Denote by θi the pair θ(1) i , θ(2) i , there holds

DPM θ(1) 1 S1, . . . , θ(1) n Sn | αH1

= DPM (θ1 S1 Θ2, . . . , θn Sn Θ2 | αH )

for every collection of measurable sets {(S1, . . . , Sn) : Si Σ1, i = 1, . . . , n}. Proof. see supplementary material (Nguyen et al., 2014).

Next we give a formal deﬁnition for the n DP mixture: ϕji iid Qj, Qj iid U, U DP(αDP(v Q0)), Q0 DP(ηS). Deﬁnition 2. (nested DP Mixture) An n DPM is a probability measure over Θ PJ j=1 Nj (ϕ11, . . . , ϕ1N1, . . . , ϕJNJ) equipped with the usual product sigma algebra ΣN1 . . . ΣNJ such that for every collection of measurable sets {(Sji) : Sji Σ, j = 1, . . . , J, i = 1 . . . , Nj}:

n DPM(ϕji Sji, i, j|α, v, η, S)

i=1 Qj (Sji) U (d Qj)

DP (d U | αDP (v Q0)) DP (d Q0 | η, S)

We now have the sufﬁcient formalism to state the marginalization result for our model.

Theorem 3. Given α, H and α, v, η, S, let θ = (θj : j) and ϕ = (ϕji : j, i) be generated as in Eq (1). Then, marginalizing out ϕ results in DPM (θ | α, H), whereas marginalizing out θ results in n DPM (ϕ|α, v, η, S).

Proof. We sketch the main steps, supplementary material (Nguyen et al., 2014) provides more detail. Let H = H1 H2, we note that when either H1 or H2 are random, a result similar to Proposition 1 still holds by taking the expectation on both sides of the equality. Now let H1 = H and H2 = DP (v Q0) where Q0 DP(ηS) yields the proof for the marginalization of ϕ; let H1 = DP (v Q0) and H2 = H yields the proof for the marginalization of θ.

4. Experiments

We ﬁrst evaluate the model via simulation studies, then demonstrate its applications on text and image modeling using three real-world datasets. Throughout this section, unless explicitly stated, discrete data is modeled by Multinomial with Dirichlet prior, while continuous data is modeled by Gaussian (unknown mean and unknown variance) with Gaussian-Gamma prior.

4.1. Simulation studies

The main goal is to investigate the posterior consistency of the model, i.e., its ability to recover the true group clusters, context distribution and content topics. To synthesize the data, we use M = 13 topics which are the 13 unique letters in the ICML string INTERNATIONAL CONFERENCE MACHINE LEARNING . Similar to (Grifﬁths & Steyvers, 2004), each topic ψm is a distribution over 35 words (pixels) and visualized as a 7 5 binary image. We generate K = 4 clusters of 100 documents each. For each cluster, we choose a set of topics corresponding to letters in the each of 4 words in the ICML string. The topic mixing distribution τk is an uniform distribution over the chosen topic letters. Each cluster is also assigned a contextgenerating univariate Gaussian distribution. These generating parameters are shown in Figure 2 (left). Altogether we have J = 400 documents; for each document we sample Nj = 50 words and a context variable xj drawing from the cluster-speciﬁc Gaussian.

We model the word wji with Multinomial and Gaussian for context xj. After 100 Gibbs iterations, the number of context and content topics (K = 4, M = 13) are recovered correctly: the learned context atoms φk and topic ψm are almost identical to the ground truth (Figure 2, right) and the model successfully identiﬁes the 4 clusters of documents with topics corresponding to the 4 words in the

Bayesian Nonparametric Multilevel Clustering with Group-Level Contexts

0 2 4 6 8 10 0

Group 1 Group 2 Group 3 Group 4

Examples of synthetic documents

Ground truth context atoms

Word topics

Set of topics for group 1

Ground Truth

0 1 2 3 4 5 6 7 8 9 10 0

True Atoms Learned Atoms

Figure 2. Results from simulation study. Left: illustration of data generation with ground truth for context atoms are 4 univariate Gaussians centered at 2, 4, 6 and 8 respectively (different variances). Right: Our model recovers the correct 4 group clusters, their context distributions and the set of shared topics. LDA and HDP are unable to recover the true content topics without using contexts.

ICML string.

To demonstrate the importance of context observation, we then run LDA and HDP with only the word observations (ignoring context) where the number of topic of LDA is set to 13. As can be seen from Figure 2 (right), LDA and HDP have problems in recovering the true topics. They cannot distinguish small differences between the overlapping character topics (e.g M vs N, or I vs T). Further analysis of the role of context in MC2 is provided in supplementary material (Nguyen et al., 2014) due to lacking of space.

4.2. Experiments with Real-World Datasets

We use two standard NIPS and PNAS text datasets, and the NUS-WIDE image dataset.

NIPS contains 1,740 documents with vocabulary size 13,649 (excluding stop words); timestamps (1987-1999), authors (2,037) and title information are available and used as group-level context. PNAS contains 79,800 documents, vocab size = 36,782 with publication timestamp (9152005). For NUS-WIDE we use a subset of the 13-class animals 1 comprising of 3,411 images (2,054 images for training and 1357 images for testing) with off-the-shelf features including 500-dim bag-of-word SIFT vector and 1000-dim bag-of-tag annotation vector.

Text Modeling with Document-Level Contexts We use NIPS and PNAS datasets with 90% for training and 10% for held-out perplexity evaluation. We compare the perplexity with HDP (Teh et al., 2006) where no grouplevel context can be used, and np TOT (Dubey et al., 2012) where only timestamp information can be used. We note that unlike our model, np TOT requires replication of document timestamp for every word in the document, which is

1downloaded from http://www.ml-thu.net/ jun/data/

somewhat unnatural.

We use perplexity score (Blei et al., 2003) on held-out data as performance metric, deﬁned as2

exp n PJ j=1 log p wtest j | xtrain, wtrain / P

j N test j o . To ensure fairness and comparable evaluation, only words in held-out data is used to compute the perplexity. We use univariate Gaussian for timestamp and Multinomial distributions for words, tags and authors. We ran collapsed Gibbs for 500 iterations after 100 burn-in samples.

Table 1 shows the results where MC2 achieves signiﬁcant better performance. This shows that group-level context information during training provide useful guidance for the modelling tasks. Regarding the informative aspect of group-level context, we achieve better perplexity with timestamp information than with titles and authors. This may be explained by the fact that 1361 authors (among 2037) show up only once in the data while title provides little additional information than what already in that abstracts. Interestingly, without the group-level context information, our model still predicts the held-out words better than HDP. This suggests that inducing partitions over documents simultaneously with topic modelling is beneﬁcial.

Beyond the capacity of HDP and np TOT, our model can induce clusters over documents (value of K in Table 1). Figure 3 shows an example of one such document cluster discovered from NIPS data with authors as context.

Our proposed model also allows ﬂexibility in deriving useful understanding into the data and to evaluate on its predictive capacity (e.g., who most likely wrote this article, which authors work in the same research topic and so on). Another possible usage is to obtain conditional distribu-

2Supplementary material (Nguyen et al., 2014) provides further details on how to derive this score from our model

Bayesian Nonparametric Multilevel Clustering with Group-Level Contexts

Method Perplexity (on words only) Feature used PNAS (K,M) NIPS (K,M) HDP (Teh et al., 2006) 3027.5 ( , 86) 1922.1 ( , 108) words np TOT (Dubey et al., 2012; Phung et al., 2012) 2491.5 ( , 145) 1855.33 ( , 94) words+timestamp MC2 without context 1742.6 (40, 126) 1583.2 (19, 61) words MC2 with titles 1393.4 (32, 80) words+title MC2 with authors 1246.3 (8, 55) words+authors MC2 with timestamp 895.3 (12, 117) 984.7 (15, 95) words+timestamp

Table 1. Perplexity evaluation on PNAS and NIPS datasets. (K,M) is (#cluster,#topic). (Note: missing results are due to title and author information not available in PNAS dataset).

Jordan.M Ghahramani.Z

Jaakkola.T Cohn.D Wolpert.D Meila.M

On the use of evidence in neural networks [1993]

Supervised Learning from Incomplete Data via an EM [1994]

Fast Learning by Bounding Likelihoods in ... Networks [1996]

Factorial Hidden Markov Models [1997]

Estimating Dependency Structure as a Hidden Variable [1998]

Maximum Entropy Discrimination [1999]

recognition hidden likelihood trained

word data classifier propagation net em data context recognition probability

state images models clustering hmm mlp time methods approximation step

learning update bound convergence bayesian input

Figure 3. An example of document cluster from NIPS. Top: distribution over authors. Middle: examples of paper titles. Bottom: examples of word topics in this cluster.

tions among context topics φk (s) and content topics ψm (s). For example if the context information is timestamp, the model immediately yields the distribution over time for a topic, showing when the topic rises and falls. Figure 4 illustrates an example of a distribution over time for a content topic discovered from PNAS dataset where timestamp was used as context. This topic appears to capture a congenital disorder known as Albinism. This distribution illustrates research attention to this condition over the past 100 years from PNAS data. To seek evidence for this result, we search the term Albinism in Google Scholar, using the top 50 searching results and plot the histogram over time in the same ﬁgure. Surprisingly, we obtain a very close match between our results and the results from Google Scholar as evidenced in the ﬁgure.

Image Clustering with Image-Level Tags We evaluate the clustering capacity of MC2 using contexts

albinism elution losses photoproducts

rubrum anaplasma coxsackievirus did distinguished don

1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010 0

Google Scholar output Our result

Figure 4. Topic Albinism discovered from PNAS dataset and its conditional distribution over time using our model; plotted together with results independently searched from Google Scholar using the top 50 hits.

Method Perplexity Feature used HDP 175.62 SIFT MC2 without context 162.74 SIFT MC2 with context 152.32 Tags+SIFT

Table 2. NUS-WIDE dataset. Perplexity is evaluated on SIFT feature.

on an image clustering task. Our dataset is NUS-WIDE described earlier. We use bag-of-word SIFT features from each image for its content. Since each image in this dataset comes with a set of tags, we exploit them as context information, hence each context observation xj is a bag-of-tag annotation vector.

First we perform the perplexity evaluation for this dataset using a similar setting as in the previous section. Table 2 presents the results where our model again outperforms HDP even when no context (tags) is used for training.

Next we evaluate the clustering quality of the model using the provided 13 classes as ground truth. We report performance on four well-known clustering evaluation metrics: Purity, Normalized Mutual Information (NMI), Rand Index (RI), and Fscore (detailed in (Rand, 1971; Cai et al., 2011)). We use the following baselines for comparison:

Kmeans and Non-negative Matrix Factorization (NMF)(Lee & Seung, 1999). For these methods, we need to specify the number of clusters in advance, hence we vary this number from 10 to 40. We then

Bayesian Nonparametric Multilevel Clustering with Group-Level Contexts

Figure 5. Clustering performance measured in purity, NMI, Rand Index and F-score using NUS-WIDE dataset.

30 20 10 0 10 20 30 40

40 t SNE clustering projection on NUS WIDE dataset

3 7 9 10 12 13 26

Figure 6. Projecting 7 discovered clusters (among 28) on 2D using t-SNE (Van der Maaten & Hinton, 2008).

report the min, max, mean and standard deviation.

Afﬁnity Propagation (AP) (Frey & Dueck, 2007): AP requires a similarity score between two documents and we use the Euclidean distance for this purpose.

Hierarchical Dirichlet Process (HDP) + AP: we ﬁrst run HDP using content observations, and then apply Afﬁnity Propagation with similarity score derived from the symmetric KL divergence between the mixture proportions from two documents. Figure 5 shows the result in which our model consistently delivers highest performance across all four metrics. For purity and NMI, our model beats all by a wide margin.

To gain some understanding on the clusters of images induced by our model, we run t-SNE (Van der Maaten & Hinton, 2008), projecting the feature vectors (both content and context) onto a 2D space. For visual clarity, we randomly select 7 out of 28 clusters and display in Figure 6 where it can be seen that they are reasonably well separated.

Effect of partially observed and missing data

Missing and unlabelled data is commonly encountered in practical applications. Here we examine the effect of context observability on document clustering performance. To do so, we again use the NUS-WIDE 13-animal subset as described previously, then vary the amount of observing context observation xj with missing proportion ranges from 0% to 100%.

Missing (%) Purity NMI RI F-score 0 % 0.407 0.298 0.901 0.157 25 % 0.338 0.245 0.892 0.149 50 % 0.320 0.236 0.883 0.137 75 % 0.313 0.187 0.860 0.112 100 % 0.306 0.188 0.867 0.119

Table 3. Clustering performance with different missing proportion of context observation xj.

Table 3 reports the result. We make two observations: a) utilizing context results in a big performance gain as evidenced in the difference between the top and bottom row of the table, and b) as the proportion of missing context starts to increase, the performance degrades gracefully up to 50% missing. This demonstrates the robustness of model against the possibility of missing context data.

5. Conclusion

We have introduced an approach for multilevel clustering when there are group-level context information. Our MC2

provides a single joint model for utilizing group-level contexts to form group clusters while discovering the shared topics of the group contents at the same time. We provide a collapsed Gibbs sampling procedure and perform extensive experiments on three real-world datasets in both text and image domains. The experimental results using our model demonstrate the importance of utilizing context information in clustering both at the content and at the group level. Since similar types of contexts (time, tags, locations, ages, genres) are commonly encountered in many real-world data sources, we expect that our model will also be further applicable in other domains.

Our model contains a novel ingredient in DP-based Bayesian nonparametric modeling: we propose to use a base measure in the form of a product between a context-generating prior H and a content-generating prior DP(v Q0). Doing this results in a new model with one marginal being the DPM and another marginal being the n DP mixture, thus establishing an interesting bridge between the DPM and the n DP. Our product base measure construction can be generalized to yield new models suitable for data presenting in more complicated nested group structures (e.g., more than 2-level deep).

Bayesian Nonparametric Multilevel Clustering with Group-Level Contexts

Antoniak, C.E. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, 2(6):1152 1174, 1974.

Blei, D.M., Ng, A.Y., and Jordan, M.I. Latent Dirichlet allocation. Journal of Machine Learning Research, 3: 993 1022, 2003. ISSN 1533-7928.

Cai, Deng, He, Xiaofei, and Han, Jiawei. Locally consistent concept factorization for document clustering. Knowledge and Data Engineering, IEEE Transactions on, 23(6):902 913, 2011.

Diez-Roux, Ana V. Multilevel analysis in public health research. Annual review of public health, 21(1):171 192, 2000.

Dubey, Avinava, Hefny, Ahmed, Williamson, Sinead, and Xing, Eric P. A non-parametric mixture model for topic modeling over time. ar Xiv preprint ar Xiv:1208.4411, 2012.

Elango, Pradheep K and Jayaraman, Karthik. Clustering images using the latent dirichlet allocation model. University of Wisconsin, 2005.

Ferguson, T.S. A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2):209 230, 1973.

Frey, B.J. and Dueck, D. Clustering by passing messages between data points. Science, 315:972 976, 2007. doi: 10.1126/science.1136800.

Grifﬁths, Thomas L and Steyvers, Mark. Finding scientiﬁc topics. Proceedings of the National academy of Sciences of the United States of America, 101(Suppl 1): 5228 5235, 2004.

Hjort, N.L., Holmes, C., M uller, P., and Walker, S.G. Bayesian nonparametrics. Cambridge University Press, 2010.

Hox, Joop. Multilevel analysis: Techniques and applications. Routledge, 2010.

Lee, Daniel D and Seung, H Sebastian. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788 791, 1999.

Leyland, Alastair H and Goldstein, Harvey. Multilevel modelling of health statistics. Wiley, 2001.

Mac Eachern, S.N. Dependent nonparametric processes. In ASA Proceedings of the Section on Bayesian Statistical Science, pp. 50 55, 1999.

Nguyen, T.C., Phung, D., Gupta, S., and Venkatesh, S. Extraction of latent patterns and contexts from social honest signals using hierarchical dirichlet processes. IEEE International Conference on Pervasive Computing and Communications, 2013.

Nguyen, V., Phung, D., Nguyen, X., Venkatesh, S., and Bui, H. Bayesian nonparametric multilevel clustering with group-level contexts. Technical report, 2014. URL http://arxiv.org/abs/1401.1974.

Phung, D., Nguyen, X., Bui, H., Nguyen, T.V., and Venkatesh, S. Conditionally dependent Dirichlet processes for modelling naturally correlated data sources. Technical report, Pattern Recognition and Data Analytics, Deakin University, 2012.

Pitman, J. Poisson Dirichlet and GEM invariant distributions for split-and-merge transformations of an interval partition. Combinatorics, Probability and Computing, 11(05):501 514, 2002. ISSN 1469-2163.

Rand, William M. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846 850, 1971.

Rodriguez, A., Dunson, D.B., and Gelfand, A.E. The nested Dirichlet process. Journal of the American Statistical Association, 103(483):1131 1154, 2008.

Sethuraman, J. A constructive deﬁnition of Dirichlet priors. Statistica Sinica, 4(2):639 650, 1994.

Teh, Y.W., Jordan, M.I., Beal, M.J., and Blei, D.M. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566 1581, 2006.

Van der Maaten, Laurens and Hinton, Geoffrey. Visualizing data using t-sne. Journal of Machine Learning Research, 9(2579-2605):85, 2008.

Wulsin, D., Jensen, S., and Litt, B. A hierarchical dirichlet process model with multiple levels of clustering for human eeg seizure modeling. Proceedings of the 29th International Conference on Machine learning, 2012.

Xie, Pengtao and Xing, Eric P. Integrating document clustering and topic modeling. 2013.