# hierarchical_dirichlet_scaling_process__972df118.pdf Hierarchical Dirichlet Scaling Process Dongwoo Kim DW.KIM@KAIST.AC.KR KAIST, Daejeon, Korea Alice Oh ALICE.OH@KAIST.EDU KAIST, Daejeon, Korea We present the hierarchical Dirichlet scaling process (HDSP), a Bayesian nonparametric mixed membership model for multi-labeled data. We construct the HDSP based on the gamma representation of the hierarchical Dirichlet process (HDP) which allows scaling the mixture components. With such construction, HDSP allocates a latent location to each label and mixture component in a space, and uses the distance between them to guide membership probabilities. We develop a variational Bayes algorithm for the approximate posterior inference of the HDSP. Through experiments on synthetic datasets as well as datasets of newswire, medical journal articles, and Wikipedia, we show that the HDSP results in better predictive performance than HDP, labeled LDA and partially labeled LDA. 1. Introduction The Hierarchical Dirichlet process (HDP) is an important nonparametric Bayesian prior for mixed membership models, and the HDP topic model is useful for a wide variety of tasks involving unstructured text (Teh et al., 2006). To extend the HDP topic model, there has been active research in dependent random probability measures as priors for modeling the underlying association between the latent semantic structure and explanatory variables, such as time stamps and spatial coordinates (Ahmed & Xing, 2010; Ren et al., 2011). A large body of this research is rooted in the dependent Dirichlet process (DP) (Mac Eachern, 1999) where the probabilistic random measure is defined as a function of some covariate. Most dependent DP approaches rely on the generalization of Sethuraman s stick breaking represen- Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). 4 3 2 1 0 1 2 3 ARTS,CULTURE,ENTERTAINMENT film,movie,music beat,play,champ soccer,play,match clinton,president,dole tobacco,smoke,cigarette bank,rate,percent disease,cow,infect net,loss,share profit,million,half Figure 1. Locations of observed labels (capital letters in red) and latent topics (small letters in blue) inferred by HDSP from the Reuters corpus. HDSP uses the distances between labels and topics to scale the topic proportions such that the topics closer to the observed labels in a document are given higher probabilities. tation of DP (Sethuraman, 1991), incorporating the time difference between two or more data points or the spatial difference among observed data into the predictor dependent stick breaking process (Duan et al., 2007; Dunson & Park, 2008). Some of these priors can be integrated into the hierarchical construction of DP (Srebro & Roweis, 2005), resulting in topic models where temporallyor spatiallyproximate data are more likely to be clustered. Many datasets, however, come with labels, or categorical side information, which cannot be modeled with these existing dependent DP approaches. Labels, like temporal and spatial information, are correlated with the latent semantics of the documents, but they cannot be used to directly define the distance between two documents. This is because labels are categorical, so there is no simple way to measure distances between labels. Moreover, labels and documents do not have a one-to-one correspondence, as there may be zero, one, or more labels per document. We develop the hierarchical Dirichlet scaling process Hierarchical Dirichlet Scaling Process (HDPS) which models the latent locations of topics and labels in a space and uses the distances between them to guide the topic proportions. Figure 1 visualizes how the HDSP discovers the latent locations of the topics and the labels from the Reuters articles with news categories as labels. In this example, an article under the news category ECONOMICS would be endowed high probabilities for topics and , and low probabilities for topics and <film, movie, music>. In the next section, we describe the gamma process construction of the HDP and how the scale parameter is used to develop the HDSP. In section 3, we derive a variational inference for the latent variables by directly placing a prior over the distances between the latent locations. In section 4, we verify our approach on a synthetic dataset and demonstrate the improved predictive power of our model on multiand partially-labeled corpora. Related Work Previously proposed topic models for labeled documents take an approach quite distinct from the dependent DP literature. Labeled LDA (L-LDA) allocates one dimension of the topic simplex per label and generates words from only the topics that correspond to the labels in each document (Ramage et al., 2009). An extension of this model, partially labeled LDA (PLDA), adds more flexibility by allocating a pre-defined number of topics per label and including a background label to handle documents with no labels (Ramage et al., 2011). The Dirichlet process with mixed random measures (DP-MRM) is a nonparametric topic model which generates an unbounded number of topics per label but still excludes topics from labels that are not observed in the document (Kim et al., 2012). 2. Hierarchical Dirichlet Scaling Process In this section, we describe the hierarchical Dirichlet scaling process (HDSP) for multi-labeled data. First we review the HDP and the gamma process construction of the second level DP. We then present the HDSP where the second level DP incorporates the latent locations for the mixture components and the labels. 2.1. The gamma process construction of HDP In the HDP1, there are two levels of the DP where the measure drawn from the upper level DP is the base distribution of the lower level DP. The hierarchical representation of the process is G0 DP(αH), Gm DP(βG0), (1) 1In this paper, we limit our discussions of the HDP to the two level construction of the DP and refer to it simply as the HDP. where H is a base distribution, α, and β are concentration parameters, and index m represents multiple draws from the second level DP. For the mixed membership model, xmn, observation n in group m, can be drawn from θmn Gm, xmn f(θmn), (2) where f( ) is a data distribution parameterized by θ. In the context of topic models, the base distribution H is usually a Dirichlet distribution over the vocabulary, so the atoms of the first level random measure G0 are an infinite set of topics drawn from H. The second level random measure Gm is distributed based on the first level random measure G0, so the second level shares the same set of topics, the atoms of the first level random measure. The constructive definition of the DP can be represented as a stick breaking process (Sethuraman, 1991), and in the HDP inference algorithm based on stick breaking, the first level DP is given by the following conditional distributions: Vk Beta(1, α) pk = Vk k=1 pkδφk, (3) where Vk defines a corpus level topic distribution for topic φk. The second level random measures are conditionally distributed on the first level discrete random measure G0: πml Beta(1, β) pml = πml j=1 (1 πmj) θml G0 Gm = l=1 pmlδθml, (4) where the second level atom θml corresponds to one of the first level atoms φk. This stick breaking construction is the most widely used method for the hierarchical construction (Wang et al., 2011; Teh et al., 2006). An alternative construction of the HDP is based on the normalized gamma process (Paisley et al., 2012). While the first level construction remains the same, the gamma process changes the second level construction from Eq. 4 to πmk Gamma(βpk, 1) πmk P j=1 πmj δφk, (5) where Gamma(x; a, b) = bax(a 1)e bx/Γ(a). Unlike the stick breaking construction, the atom of the πmk of the gamma process is the same as the atom of the kth stick of the first level. Therefore, during inference, the model does Hierarchical Dirichlet Scaling Process not need to keep track of which second level atoms correspond to which first level atoms. Furthermore, by placing a proper random variable on the rate parameter of the gamma distribution, the model can infer the correlations among the topics (Paisley et al., 2012) through the Gaussian process (Rasmussen & Williams, 2005). 2.2. Hierarchical Dirichlet Scaling Process In the hierarchical Dirichlet scaling process (HDSP), we start with the gamma process construction of the HDP with a proper prior for the rate parameter to guide the topic proportions based on the labels of the document. In the model, each topic and label has a latent location, and the topic proportion of a document is proportional to the distances between the topics and the labels. With the assumption that the locations of topics and labels are drawn from a distribution over the space, the first level DP of HDSP is drawn from the product of two base distributions, G0 DP(αH L), G0 = k=1 pkδ{φk,lk} (6) where H is a distribution over the topic parameter θk, L is a distribution over the latent locations of topic k and label j, and pk is a stick length for topic k, pk = Vk Qk