# bayesian_nonparametric_multilevel_clustering_with_grouplevel_contexts__b2f16275.pdf Bayesian Nonparametric Multilevel Clustering with Group-Level Contexts Vu Nguyen TVNGUYE@DEAKIN.EDU.AU Center for Pattern Recognition and Data Analytics (PRa DA), Deakin University, Australia Dinh Phung DINH.PHUNG@DEAKIN.EDU.AU Center for Pattern Recognition and Data Analytics (PRa DA), Deakin University, Australia Xuan Long Nguyen XUANLONG@UMICH.EDU Department of Statistics, University of Michigan, Ann Arbor, USA Svetha Venkatesh SVETHA.VENKATESH@DEAKIN.EDU.AU Center for Pattern Recognition and Data Analytics (PRa DA), Deakin University, Australia Hung Hai Bui BUI.H.HUNG@GMAIL.COM Laboratory for Natural Language Understanding, Nuance Communications, Sunnyvale, USA We present a Bayesian nonparametric framework for multilevel clustering which utilizes grouplevel context information to simultaneously discover low-dimensional structures of the group contents and partitions groups into clusters. Using the Dirichlet process as the building block, our model constructs a product base-measure with a nested structure to accommodate content and context observations at multiple levels. The proposed model possesses properties that link the nested Dirichlet processes (n DP) and the Dirichlet process mixture models (DPM) in an interesting way: integrating out all contents results in the DPM over contexts, whereas integrating out group-specific contexts results in the n DP mixture over content variables. We provide a Polyaurn view of the model and an efficient collapsed Gibbs inference procedure. Extensive experiments on real-world datasets demonstrate the advantage of utilizing context information via our model in both text and image domains. 1. Introduction In many situations, content data naturally present themselves in groups, e.g., students are grouped into classes, Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). classes grouped into schools, words grouped into documents, etc. Furthermore, each content group can be associated with additional context information (teachers of the class, authors of the document, time and location stamps). Dealing with grouped data, a setting known as multilevel analysis (Hox, 2010; Diez-Roux, 2000), has diverse application domains ranging from document modeling (Blei et al., 2003) to public health (Leyland & Goldstein, 2001). This paper considers specifically the multilevel clustering problem in multilevel analysis: to jointly cluster both the content data and their groups when there is group-level context information. By context, we mean a secondary data source attached to the group of primary content data. An example is the problem of clustering documents, where each document is a group of words associated with grouplevel context information such as time-stamps, list of authors, etc. Another example is image clustering where visual image features (e.g. SIFT) are the content and image tags are the context. To cluster groups together, it is often necessary to perform dimensionality reduction of the content data by forming content topics, effectively performing clustering of the content as well. For example, in document clustering, using bag-of-words directly as features is often problematic due to the large vocabulary size and the sparsity of the in-document word occurrences. Thus, a typical approach is to first apply dimensionality reduction techniques such as LDA (Blei et al., 2003) or HDP (Teh et al., 2006) to find word topics (i.e., distributions on words), then perform document clustering using the word topics and the document-level context information as features. In such a Bayesian Nonparametric Multilevel Clustering with Group-Level Contexts cascaded approach, the dimensionality reduction step (e.g., topic modeling) is not able to utilize the context information. This limitation suggests that a better alternative is to perform context-aware document clustering and topic modeling jointly. With a joint model, one can expect to obtain improved document clusters as well as context-guided content topics that are more predictive of the data. Recent work has attempted to jointly capture word topics and document clusters. Parametric approaches (Xie & Xing, 2013) are extensions of the LDA (Blei et al., 2003) and require specifying the number of topics and clusters in advance. Bayesian nonparametric approaches including the nested Dirichlet process (n DP) (Rodriguez et al., 2008) and the multi-level clustering hierarchical Dirichlet Process (MLC-HDP) (Wulsin et al., 2012) can automatically adjust the number of clusters. We note that none of these methods can utilize context data. This paper propose the Multilevel Clustering with Context (MC2), a Bayesian nonparametric model to jointly cluster both content and groups while fully utilizing grouplevel context. Using the Dirichlet process as the building block, our model constructs a product base-measure with a nested structure to accommodate both content and context observations. The MC2 model possesses properties that link the nested Dirichlet process (n DP) and the Dirichlet process mixture model (DPM) in an interesting way: integrating out all contents results in the DPM over contexts, whereas integrating out group-level context results in the n DP mixture over content variables. For inference, we provide an efficient collapsed Gibbs sampling procedure for the model. The advantages of our model are: (1) the model automatically discovers the (unspecified) number of groups clusters and the number of topics while fully utilizing the context information; (2) content topic modeling is informed by group-level context information, leading to more predictive content topics; (3) the model is robust to partially missing context information. In our experiments, we demonstrate that our proposed model achieves better document clustering performances and more predictive word topics in realworld datasets in both text and image domains. 2. Related Background There have been extensive works on clustering documents in the literature. Due to limited scope of the paper, we only describe works closely related to probabilistic topic models. We note that standard topic models such as LDA (Blei et al., 2003) or its nonparametric Bayesian counter part, HDP (Teh et al., 2006) exploits the group structure for word clustering. However these models do not cluster documents. An approach to document clustering is to employ a twostage process. First, topic models (e.g. LDA or HDP) are applied to extract the topics and their mixture proportion for each document. Then, this is used as feature input to another clustering algorithm. Some examples of this approach include the use of LDA+Kmeans for image clustering (Elango & Jayaraman, 2005) and HDP+Affinity Propagation for clustering human activities (Nguyen et al., 2013). A more elegant approach is to simultaneously cluster documents and discover topics. The first Bayesian nonparametric model proposed for this task is the nested Dirichlet Process (n DP) (Rodriguez et al., 2008) where documents in a cluster share the same distribution over topic atoms. Although the original n DP does not force the topic atoms to be shared across document clusters, this can be achieved by simply introducing a DP prior for the n DP base measure. The same observation was also made by (Wulsin et al., 2012) who introduced the MLC-HDP, a 3-level extension to the n DP. This model thus can cluster words, documents and document-corpora with shared topic atoms throughout the group hierarchy. Xie et al (Xie & Xing, 2013) recently introduced the Multi-Grain Clustering Topic Model which allows mixing between global topics and document-cluster topics. However, this is a parametric model which requires fixing the number of topics in advance. More crucially, all of these existing models do not attempt to utilize grouplevel context information. Modelling with Dirichlet Process We provide a brief account of the Dirichlet process and its variants. The literature on DP is vast and we refer to (Hjort et al., 2010) for a comprehensive account. Here we focus on DPM, HDP and n DP which are related to our work. Dirichlet process (Ferguson, 1973) is a basic building block in Bayesian nonparametrics. Let (Θ, B, H) be a probability measure space, and γ is a positive number, a Dirichlet process DP (γ, H) is a distribution over discrete random probability measure G on (Θ, B). Sethuraman (Sethuraman, 1994) provides an alternative constructive definition which makes the discreteness property of a draw from a Dirichlet process explicit via the stick-breaking representation: G = P k=1 βkδφk where φk iid H, k = 1, . . . , and β = (βk) k=1 are the weights constructed through a stick-breaking process βk = vk Q s