# topic_modeling_with_document_relative_similarities__648b10fe.pdf Topic Modeling with Document Relative Similarities Jianguang Du , Jing Jiang , Dandan Song , Lejian Liao School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China School of Information Systems, Singapore Management University, Singapore dujianguang@bit.edu.cn; jingjiang@smu.edu.sg; {sdd,liaolj}@bit.edu.cn Topic modeling has been widely used in text mining. Previous topic models such as Latent Dirichlet Allocation (LDA) are successful in learning hidden topics but they do not take into account metadata of documents. To tackle this problem, many augmented topic models have been proposed to jointly model text and metadata. But most existing models handle only categorical and numerical types of metadata. We identify another type of metadata that can be more natural to obtain in some scenarios. These are relative similarities among documents. In this paper, we propose a general model that links LDA with constraints derived from document relative similarities. Specifically, in our model, the constraints act as a regularizer of the log likelihood of LDA. We fit the proposed model using Gibbs-EM. Experiments with two real world datasets show that our model is able to learn meaningful topics. The results also show that our model outperforms the baselines in terms of topic coherence and a document classification task. 1 Introduction Topic modeling has been a popular text analysis method [Blei et al., 2003]. In topic models, it is assumed that a document is generated by a mixture of topics, each of which is a distribution over words in the vocabulary. By fitting the models, we can represent each document through the learned topics as well as understand the topics in the corpus through the most probable words of each topic. Classical topic models such as Probabilistic Latent Semantic Analysis (PLSA) [Hofmann, 1999] and Latent Dirichlet Allocation (LDA) [Blei et al., 2003] have shown impressive success in modeling text documents. Despite the success of the above two topic models, they cannot directly incorporate metadata such as the author or the posting time of a document to improve the quality of the learned topics. To tackle this problem, many augmented topic models have been developed (e.g. [Rosen-Zvi et al., 2004; Wang and Mc Callum, 2006; Mcauliffe and Blei, 2008]). Depending on the nature of the metadata, different models make different assumptions about the relations between the metadata and the underlying topic mixture of each document and/or the word distributions of topics. Generally, most of these models rely on the assumption that documents with the same attribute values (e.g. same authors or same publishing time) tend to have similar topic distributions. This assumption can be captured by establishing a dependency between the topic distributions of documents and the metadata, where the metadata can be either categorical (e.g. [Rosen-Zvi et al., 2004]) or numerical (e.g. [Wang and Mc Callum, 2006]). However, metadata may not always be in the form of categorical or numerical attribute values. For example, given documents A, B and C, one may point out that A is more similar to B than to C. Presumably, this kind of relative similarity can also help topic modeling; document A s topic distribution should be closer to B s than to C s. In many scenarios, relative similarities may be easier to obtain than categorical or numerical attributes. For example, to an online user with a particular information need, it is easier to ask her to compare two Web pages and point out which one is more relevant than to ask her to score each Web page. Given a set of research papers, it might be easier to identify similar papers than to assign categorical labels to the papers. Even for documents with categorical or numerical attributes, sometimes it may also be meaningful to consider relative similarities derived from the attribute values. For example, a document written by a 10-year-old is more similar to a document written by a 12-year-old than by a 20-year-old. The value of the absolute age difference is not so important; we generally would not consider the difference between a 10-year-old and a 20-year-old to be 5 times that of the difference between a 10-year-old and a 12-year-old. Instead, the relative difference or similarity is more useful. In this paper, we study topic modeling on documents with this kind of relative similarities. Our goal is to explore a general model that can utilize such constraints without manipulating the graphical structure of the classical topic models. The intuition is to facilitate the collaboration between topic models and the constraints. Specifically, we first transform the constraints into a loss function, and then design an objective function that combines the log likelihood of the corpus with the loss function. Essentially, the loss function acts as a regularizer for the log likelihood of text, ensuring that the parameters that minimize the overall objective function balance between fitting the text and satisfying the relative Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) similarity-based constraints. To fit our model, we use the Gibbs-EM algorithm, an algorithm that alternates between collapsed Gibbs sampling and gradient descent. We evaluate our model with two datasets, 20 newsgroups and TDT2. Experimental results show that our model outperforms LDA in terms of topic coherence. When we use the learned topics for document classification, our model also achieves higher accuracy than other baselines including supervised LDA [Mcauliffe and Blei, 2008]. In summary, the main contributions of this paper are the following: We identify an unexplored type of metadata, which we call relative similarities, for topic modeling. We point out that relative similarities can also be derived from traditional categorical and numerical metadata. We propose a general model that combines LDA with relative similarity-based constraints (Section 3). We provide an efficient inference algorithm to fit the proposed model (Section 4). With two standard datasets, we show that our model is able to learn more meaningful topics than LDA (Section 5). 2 Related Work Our work is mostly related to topic modeling with additional document metadata. We review two general approaches to such topic modeling. 2.1 Augmented Topic Models The first approach to modeling document metadata is through modifying the structure of the graphical model. There are mainly two ways to incorporate metadata into the graphical model. One way is to generate the metadata from the topic distributions of each document. In such models, typically a topic is associated with not only a word distribution but also a distribution over different values of an attribute [Wang and Mc Callum, 2006; Mcauliffe and Blei, 2008; Qiu et al., 2013; Liao et al., 2014]. For example, in [Mcauliffe and Blei, 2008] the authors proposed a general supervised LDA model for scenarios where each document has a response variable. The response variable is assumed to be generated from a Gaussian distribution, the mean of which is determined by the topic assignment of the words in the document through a generalized linear model. This supervised LDA model can be used for predicting response variables such as star ratings of reviews and scores of student essays. Another approach assumes that the topic distribution of each document is a mixture of metadata-specific topic distributions [Rosen-Zvi et al., 2004; Mc Callum et al., 2005; Eisenstein et al., 2010]. A typical example of this approach is the author-topic model [Rosen-Zvi et al., 2004]. Here to sample the topic of a word, first one of the authors of the document is chosen, and then a topic is sampled from that author s topic distribution. This approach generally only works for metadata with categorical values. All the models above assume either categorical or numerical type of metadata. For our work, we focus on metadata in the form of relative similarities. It is not easy to modify standard graphical models to incorporate this kind of metadata. 2.2 Topic Modeling with Regularization Another line of literature related to our work is topic modeling with regularization terms [Cai et al., 2008; Huh and Fienberg, 2010; Tang et al., 2013; Mc Auley and Leskovec, 2013; Andrzejewski et al., 2011; Mei et al., 2014]. Here the idea is to turn additional information about the documents into constraints or a loss function that do not necessarily follow a generative model. For example, in [Cai et al., 2008] and [Huh and Fienberg, 2010], regularization terms were used to capture the manifold structures of documents. In [Tang et al., 2013], the authors added a regularization term that pushes context-specific topic distributions close to the consensus topic distributions. In [Mc Auley and Leskovec, 2013], rating data were incorporated into the model and rating prediction errors became the additional loss function to regularize standard LDA. Recently, some researchers have also proposed to incorporate domain knowledge in the form of First-Order Logic (FOL) to standard LDA using a regularization framework [Andrzejewski et al., 2011; Mei et al., 2014]. While FOL rules are more expressive in representing knowledge, they require the analysts writing the rules to have background in FOL. In comparison, relative similarities as we use in this paper are generally easier to obtain. In our work, we borrow the general idea of using a regularization term to incorporate additional knowledge. However, the type of metadata we model is very different from the work discussed above. 3 Model We address the problem of topic modeling on documents where relative similarities are given. In this section, we first formulate the constraints derived from relative similarities. We then briefly review Latent Dirichlet Allocation (LDA) and propose our model, which combines LDA with relative similarities. 3.1 Constraints We want to model documents where relative similarities are known for some documents if not all. Such relative similarities can come from human judges under a particular application setting or automatically derived from other data. First of all, we assume that we are given a set of documents denoted as D. We further assume that there is a distance function dist(di, dj) defined for any pair of di, dj D. To formally formulate the relative similarities, we assume that we are given a set of T triplets as follows: S = {(di, d+ i , d i )}T i=1, (1) where di, d+ i , d i D, and di is more similar to d+ i than to d i . In other words, dist(di, d i ) > dist(di, d+ i ), (di, d+ i , d i ) S. (2) Borrowing the idea from max-margin methods, we further rewrite the constraints as follows: dist(di, d i ) dist(di, d+ i ) + C, (di, d+ i , d i ) S. (3) where C, a positive constant, is a margin to ensure that dist(di, d i ) is sufficiently larger than dist(di, d+ i ). Given the definitions above, our goal is to minimize a loss function defined below: i=1 Li(di, d+ i , d i ), (4) Li(di, d+ i , d i ) = max(0, dist(di, d+ i ) + C dist(di, d i )). (5) The next question is how to define the distance function dist. As we will see, the distance function will be based on the topic distributions learned by LDA. 3.2 Latent Dirichlet Allocation Latent Dirichlet Allocation (LDA) [Blei et al., 2003] is a well-developed and widely used topic model. Evolved from Latent Semantic Analysis (PLSA) [Hofmann, 1999], LDA defines a proper Bayesian model which overcomes the overfitting problem in PLSA. Given a document set D where each document d D contains Nd words {wd,1, wd,2, . . . , wd,Nd}, we assume that there exist K topics, each associated with a multinomial word distribution ϕk. Each document has a topic distribution θd in the K-dimensional topic space. Each word in a document has a hidden topic label drawn from the document s topic distribution. Formally, the generative process of LDA is described as follows: For each topic k = 1, . . . , K, draw ϕk Dir(β) For each document d D Draw θd Dir(α) For each word wd,n, n = 1, . . . , Nd Draw zd,n Multi(θd) Draw wd,n Multi(ϕzd,n) Here α and β are parameters of the Dirichlet priors. The parameters of LDA can be learned in several ways including variational methods and Gibbs sampling. 3.3 The Regularized Model In this subsection, we present our proposed model which combines LDA with the constraints defined in Section 3.1. Our motivation is to learn better topics by incorporating these constraints. First of all, without considering the constraints and with only standard LDA, our objective is to find θ and ϕ that maximize the following objective function: log p(w|θ, ϕ)p(θ|α)p(ϕ|β) . To link this objective function with the constraints presented in Section 3.1, we can simply add the two terms together: log p(w|θ, ϕ)p(θ|α)p(ϕ|β) η i=1 Li(di, d+ i , d i ), (6) where η is a constant to balance the two terms. Now recall that we need to define the distance function dist, and this should be related to our model parameters, i.e. θ and ϕ. Since ϕ is not document-specific, we use just θ to define dist. Intuitively, if two documents have similar θd, then their distance should be smaller. There are several choices we can consider. One is the Euclidean distance, where each θd is treated as a k-dimensional vector and the standard Euclidean distance can be computed between them. In this paper, we experiment with squared Euclidean distance. Another choice is KL-divergence, defined as follows: DKL(θ||θ ) = k=1 θk log θk However, because KL-divergence is not symmetric, i.e. generally DKL(θ||θ ) = DKL(θ ||θ), here we consider the symmetric KL-divergence instead: dist(di, dj) = DKL(θdi||θdj) + DKL(θdj||θdi). However, Eqn (6) is a constrained optimization problem because both θ and ϕ are probability distributions. To transform the objective function into an unconstrained optimization problem, we first define the following transformation function for θd,k: θd,k = eλd,k PK k =1 eλd,k . (7) We then change the Dirichlet prior on θd into a Gaussian prior on λd, that is, each λd,k follows a Gaussian distribution with a zero mean and a variance of σ2. Next we leave out ϕ from our objective function and try to estimate it later based on the hidden variables z (explained in the next section). Our modified objective function becomes the following: L(λ) = log p(w|λ, β) | {z } log likelihood +log p(λ|(0, σ2I)) | {z } i=1 Li(di, d+ i , d i ) hinge loss (8) Note that the loss Li(di, d+ i , d i ) is also a function of λ. 4 Model Fitting with Gibbs-EM To optimize the objective function given in Eqn (8), we adopt the Gibbs-EM algorithm [Wallach, 2006]. Recall that we have defined hidden variables z that represent topic assignment. With the hidden variables, the objective function can be optimized in the following way. During E-step, we fix the parameter λ(t) learned in the tth iteration and obtain the conditional distribution of the hidden variables p(z|w, λ(t), β). During the M-step, we solve the following optimization problem: λ(t+1) = arg max Ez|w,λ(t),β[L (λ)], where Eq[f] is the expected value of f with respect to the distribution q, and L (λ) = log p(w, z|λ, β) + log p(λ|(0, σ2I)) (9) i=1 Li(di, d+ i , d i ). With Gibbs-EM, instead of evaluating the exact conditional distribution p(z|w, λ(t), β), we use Gibbs sampling to approximate it. 4.1 E-step In the E-step, we use Gibbs sampling to sample the hidden topic variables z for all words w by fixing λ(t). To simplify the discussion, we will use θ when referring to topic distributions and λ otherwise. The deterministic relation between them is given in Eqn (7). To perform Gibbs sampling, we need to compute the probability of assigning a topic zd,n to a specific word wd,n given all the other topic assignment to all the other words: p(zd,n = k|w, z (d,n), θ, β) = p(w, z|θ, β) p(w (d,n), z (d,n)|θ, β), where (d, n) indicates that zd,n or wd,n is excluded. By using the conjugacy property of Dirichlet and multinomial distributions, the Gibbs updating rule of our model can be represented as follows: p(zd,n = k|w, z (d,n), θ, β) nk,wd,n + β 1 PV v=1 nk,v + V β 1 θd,k, (10) where nk,v denotes the number of times word v is assigned to topic k. As we pointed out in the previous section, we do not directly estimate ϕ by optimizing the objective function. But with Gibbs sampling, ϕk,v can be estimated as follows: ˆϕk,v = nk,v + β PV v =1 nk,v + V β . (11) Algorithm 1 Gibbs-EM for our model. Input: D documents, # topics K, size of the vocabulary V , regularization parameter η, margin C, max # EM iterations n EM, # Gibbs sampling iterations n GS, max # gradient descent in each M-step n GD. Output: λd,k and ϕk,v, d = 1, .., D; k = 1, .., K; v = 1, .., V 1: Randomly initialize z and λ 2: t 0 3: while (t < n EM) do 4: E-step: 5: Sample zd,n as in Eqn (10) with n GS iterations 6: M-step: 7: n 0 8: while (n < n GD) do 9: Compute the objective function L (λ) as in Eqn (9) 10: Set the learning rate ξ 11: for (d = 1 to D) do 12: for (k = 1 to K) do 13: Compute the partial derivative L (λ) λd,k 14: λd,k (t) (n+1) λd,k (t) (n) + ξ L (λ) λd,k 15: end for 16: end for 17: n n + 1 18: end while 19: t t + 1 20: end while 21: Compute each ϕk,v as in Eqn (11) In the M-step, we use the last sample of z(t) obtained from the previous E-step and use gradient descent to learn λ(t+1): λ(t+1) = arg max λ log p(w, z(t+1)|λ, β) (12) + log p(λ|(0, σ2I)) η i=1 Li(di, d+ i , d i ) . By computing the first-order patrial derivatives of the objective function in Eqn (12) with respect to each λd,k, we can use gradient descent to optimize the objective function. The model fitting algorithm is summarized in Algorithm 1. 5 Experiments In this section, we present the experiments to evaluate our model. We first describe our datasets. We then conduct experiments both quantitatively and qualitatively. 5.1 Datasets We use two widely used text corpora, 20 newspapers1 and TDT2 [Cai et al., 2008]. The 20 newsgroups text corpus is a collection of approximately 20,000 newsgroup documents, partitioned evenly across 20 different newsgroups. We used a preprocessed version of this dataset2, where the documents are divided into a training set and a test set. Among the documents in the training set, we randomly selected 100 documents from each category and removed stop words and very short documents (documents with fewer than 3 words), thus leaving us with 1,997 documents with 14,538 distinct words. The TDT2 corpus consists of 11,201 documents which are classified into 96 categories. Only the largest 20 categories were kept, and those documents appearing in more than one category were removed. We also randomly sampled 100 documents from each category as the strategy for the 20 newsgroup dataset. Finally, there are 1,998 documents with 12,166 distinct words left. 5.2 Experimental Setup In our experiments, we performed 200 runs of Gibbs-EM. In each run, we ran 100 iterations of Gibbs sampling and another 10 iterations of gradient descent. We set the Dirichlet prior β = 0.1, the variance of Gaussian prior σ = 1. We also fixed the number of topics to be 20 (same as the number of categories in each dataset). Note that we do not tune this parameter since our goal is not to find the optimal number of topics. To automatically obtain the triplet constraints for our model, we sampled a set of triplets from the documents in the training set according to their ground truth category labels. Specifically, we generated a triplet instance by randomly sampling two documents in the same category and one document from another different category. In our experiments, we generated 100K, 50K and 10K triplet instancnes (about 1%, 0.5% 1http://qwone.com/ jason/20Newsgroups/ 2http://www.cad.zju.edu.cn/home/dengcai/Data/Text Data.html and 0.1% of all the triplet instances, respectively) from the training set of each dataset. Experiments were run on a machine with 4 cores and 4GB of memory. 5.3 Quantitative Evaluation In this subsection, we give quantitative evaluation of our model on topic coherence and document classification. We evaluate our model with different number of sampled triplets (i.e. 100K, 50K and 10K) against other baselines. Specifically, the methods used in our experiments are: DRS-KL. Our model with the symmetric KL-divergence distance metric. DRS-SE. Our model with squared Euclidean distance as the distance function. LDA. The standard LDA model. s LDA. The supervised LDA model [Mcauliffe and Blei, 2008]. This is a strong baseline where metadata is also used. Here we treat the category labels of all the training documents as the response values. This baseline is used for comparison for the classification task as well as computational costs presented later. Note that for DRS-KL and DRS-SE, we tune the regularization parameter η and margin C, and report the results with the best performance. To test the robustness of our model, we used 5-fold cross validation for all methods. We should emphasize that we sampled triplet instances only from the training documents, i.e. no metadata from the test documents is used. This ensures the fairness of the comparison of our model and the baselines. Topic Coherence In this experiment, we want to compare the performance of different topic models. Previous studies usually utilized perplexity (likelihood on held-out data) as the metric. However, such metric cannot measure the coherence of learned topics. Chang et al. [Chang et al., 2009] found that perplexity is not always a good indicator of topic coherence [Mimno et al., 2011]. To tackle this problem, we explore another metric to measure the quality of learned topics. Specifically, to measure the semantic coherence of topics, we use the point-wise mutual information (PMI) [Newman et al., 2010]. PMI measures the co-occurrence of a number of words, which is defined as: PMI(w) = 2 N(N 1) 1 i