# grounding_topic_models_with_knowledge_bases__70a9db16.pdf

Grounding Topic Models with Knowledge Bases

Zhiting Hu, Gang Luo, Mrinmaya Sachan, Eric Xing, Zaiqing Nie

Microsoft Research, Beijing, China Microsoft, California, USA School of Computer Science, Carnegie Mellon University {zhitingh,mrinmays,epxing}@cs.cmu.com, {gluo,znie}@microsoft.com

Topic models represent latent topics as probability distributions over words which can be hard to interpret due to the lack of grounded semantics. In this paper, we propose a structured topic representation based on an entity taxonomy from a knowledge base. A probabilistic model is developed to infer both hidden topics and entities from text corpora. Each topic is equipped with a random walk over the entity hierarchy to extract semantically grounded and coherent themes. Accurate entity modeling is achieved by leveraging rich textual features from the knowledge base. Experiments show signiﬁcant superiority of our approach in topic perplexity and key entity identiﬁcation, indicating potentials of the grounded modeling for semantic extraction and language understanding applications.

1 Introduction

Probabilistic topic models [Blei et al., 2003] have been one of the most popular statistical frameworks to identify latent semantics from large text corpora. The extracted topics are widely used for human exploration [Chaney and Blei, 2012], information retrieval [Wei and Croft, 2006], machine translation [Mimno et al., 2009], and so forth.

Despite their popularity, topic models are weak models of natural language semantics. The extracted topics are difﬁcult to interpret due to incoherence [Chang et al., 2009] and lack of background context [Wang et al., 2007]. Furthermore, it is hard to grasp semantics merely as topics formulated as word distributions without any grounded semantics [Song et al., 2011; Gabrilovich and Markovitch, 2009]. Though recent research has attempted to exploit various knowledge sources to improve topic modeling, they either bear the key weakness of representing topics merely as distribution over words or phrases [Mei et al., 2014; Boyd-Graber et al., 2007; Newman et al., 2006] or sacriﬁce the ﬂexibility of topic models by imposing a one-to-one binding of topics to pre-deﬁned knowledge base (KB) entities [Gabrilovich and Markovitch, 2009; Chemudugunta et al., 2008].

This work was done when the ﬁrst two authors were at Microsoft Research, Beijing.

This paper aims to bridge the gap, by proposing a new structured representation of latent topics based on entity taxonomies from KBs. Figure 1 illustrates an example topic extracted from a news corpus. Entities organized in the hierarchical structure carry salient context for human and machine interpretation. For example, the relatively high weight of entity Amy Winehouse can be attributed to the fact that Winehouse and Houston were both prominent singers who have passed from drug-related causes. In addition, the varying weights associated with taxonomy nodes ensure ﬂexibility to express the gist of diverse corpora. The new modeling scheme poses challenges for inference as both topics and entities are hidden from observed text and the topics are regularized by hierarchical knowledge. We develop Latent Grounded Semantic Analysis (LGSA), a probabilistic generative model, to infer both topics and entities from text corpora. Each topic is equipped with a random walk over the taxonomy which naturally integrates the structure to ground the semantics as well as leverages the highly-organized knowledge to capture entity correlations. For accurate entity modeling, we augment bag-of-word documents with entity mentions and incorporate rich textual features of entities from KBs. To keep inference over large corpora and KBs practical, we use ontology pruning and dynamic programming.

Extensive experiments validate the effectiveness of our approach. LGSA improves topic quality in terms of perplexity signiﬁcantly. We apply the model to identity key entities of documents (e.g., the dominant ﬁgures of a news article). LGSA achieves 10% improvement (precision@1 from 80% to 90%) over the best performing competitors, showing strong potential in semantic search and knowledge acquisition. To our knowledge, this is the ﬁrst work to combine statistical topic representation with structural entity taxonomy. Our probabilistic model that incorporates rich world knowledge provides a potentially useful scheme to accurately induce grounded semantics from natural language data.

2 Related Work Topic modeling Probabilistic topic models such as LDA [Blei et al., 2003] identify latent topics purely based on observed data. However, it is well known that topic models are only a weak model of semantics. Hence, a large amount of recent work has attempted to incorporate domain knowledge [Foulds et al., 2015; Yang et al., 2015; Mei et al., 2014;

Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16)

Root Category

Music Awards

World Music Awards Winners

Amy Winehouse

Whitney Houston

Death Dead People

Deaths by Drowning

Drowning Myocardial

Dead Body Death Customs

Funeral Burial R.I.P.

Condolences New Jersey Tourism

Drug-related

Figure 1: Structured representation of topic Death of Whitney Houston from our model. Entities (leaf rectangular nodes) and categories (internal elliptical nodes) with highest probabilities are shown. The thickness of each node s outline is proportional to the weight of that node. The dashed links denote multiple edges that have been consolidated.

Chen and Liu, 2014; Andrzejewski et al., 2011] or relational information [Chang and Blei, 2009; Hu et al., 2015b] as regularization in topic modeling. Yet, a clear shortcoming in these models is that topics are simply modeled as word distributions without grounded meanings. LGSA mitigates this by grounding topics to KB entities. Another line of research attempts to explicitly map document semantics to human-deﬁned concepts [Gabrilovich and Markovitch, 2009; Chemudugunta et al., 2008]. These methods assume one-toone correspondence between topics and a small set of ontological concepts. Though enjoying clear meaning, this work sacriﬁces expressiveness and compactness of latent semantic models. LGSA ensures both interpretability and ﬂexibility by extracting latent yet grounded topics. Topic models have also been used to model KB entities for the entity linking task [Han and Sun, 2012; Kataria et al., 2011] which has a different focus from ours.

Using hierarchical knowledge Semantic hierarchies are key knowledge sources [Resnik, 1995; Hu et al., 2015a]. Few generative models have been developed for speciﬁc tasks which integrate hierarchical structures through random walks [Kataria et al., 2011; Hu et al., 2014; Boyd-Graber et al., 2007]. E.g., [Boyd-Graber et al., 2007] exploits Word Net Walk [Abney and Light, 1999] for word sense disambiguation. Our work is distinct in that we use entity taxonomy to construct a representation of topics; moreover, we infer hidden entities from text, leading to unique inference complexity. We propose an efﬁcient approach to tackle this issue. Note that our work also differs from hierarchical topic models [Grifﬁths and Tenenbaum, 2004; Movshovitz-Attias and Cohen, 2015] which aim to infer latent hierarchies from data rather than ground latent semantics to existing KBs.

3 Latent Grounded Semantic Analysis

Model Overview: LGSA is an unsupervised probabilistic model that goes beyond the conventional word-based topic modeling, and represents latent topics based on the highlyorganized KB entity taxonomies. We ﬁrst augment the conventional bag-of-word documents with entity mentions in order to capture salient semantics ( 3.1). An entity is modeled as distributions over both mentions and words. Here we lever-

Symbol Description

E, M, V the set of entities, mentions, and words D, K, E, M, V #docs, #topics, vocabulary sizes of entities/mentions/words Md, Vd # of mention and word occurrences in doc d mdj the jth mention in doc d wdl the lth word in doc d zdj the topic associated with mdj edj the entity associated with mdj rdj the path in taxonomy associated with mdj ydl the entity index associated with wdl d, 0

d multinomial distribution over topics and entities of doc d k random walk transition distributions of topic k φk multinomial distribution over entities of topic k k probabilities over category nodes of topic k e, e multinomial distribution over mentions and words of entity e p

e base measures of priors over e and e extracted from KBs λ

e concentration parameters (prior strengths)

Table 1: Notations used in this paper.

age entities rich textual features from KBs for accurate entity modeling ( 3.3). Finally, each topic is associated with a root-to-leaf random walk over the entity taxonomy. This endows the topic with a semantic structure, as well as captures the valuable entity correlation knowledge from the taxonomy ( 3.2). A well-deﬁned generative process combines the above components in a joint framework ( 3.4). Next, we present the details of each component (Table 1 lists key notations; Figure 2 illustrates a running example of LGSA; and the graphical model representation of LGSA is shown in Figure 3).

3.1 Document Modeling Topic models usually represent a document as a bag of words. However, language has rich structure and different word classes perform different functions in cognitive understanding of text. The noun class (e.g., entity mentions in a news article) is an important building block and carries much of the salient semantics in a document. Semanticists have often debated the cognitive economy of the noun class [Kemp and Regier, 2012]. If every mention had a unique name, this would eliminate ambiguities. However, our memory is constrained and it is impossible to remember all mention names. Hence, ambiguity is essential. Our work grounds mentions to KBs, leading to consistent interpretation of entities.

As the ﬁrst step, our model augments the bag-of-word representation with mentions. Each document d 2 {1, . . . , D} is

Gates, the co-founder of Microsoft, was the wealthiest man in the world.

Bill Gates Microsoft

Kobe Bryant

Document Topic based on KB taxonomy

Figure 2: Model overview. Mentions Gates and Microsoft (highlighted) in the document refer to entities in the KB taxonomy. Words co-founder and wealthiest (underlined) are describing entity Bill Gates. A topic random walk is parameterized by the parent-to-child transition probabilities.

zdj rdj, edj mdj

Figure 3: Graphical representation of LGSA. The entity edj is the leaf of the path rdj. Document s entity distribution 0

d is derived from other variables and thus does not participate in the generative process directly.

now represented by a set of words wd = {wdl}Nd

l=1 as well as a set of mentions md = {mdj}Md

j=1 occurring in d. Mentions can be automatically identiﬁed using existing mention detection tools. E.g., the document in Figure 2 contains mentions {Gates, Microsoft} and words {co-founder, ...}.

Each document d is associated with a topic distribution d = { dk}K

k=1 and an entity distribution 0

e=1, based on which the entity groundings can be identiﬁed. LGSA simulates a generative process in which the entity mentions are determined ﬁrst and the content words come later to describe the entities attributes and actions (e.g., in Figure 2, wealthiest characterizes Gates). This leads to the differential treatment of mentions and words in the generative procedure: each mention mdj is associated with a topic zdj and an entity edj (drawn from zdj as described next), while each word wdl is associated with an index ydl indicating wdl is describing the ydl-th mentioned entity (i.e., edydl).

3.2 Topic Random Walk on Entity Taxonomy

We now present the taxonomy-based modeling of latent topics, from which the underlying entities {edj} of the mentions {mdj} are drawn. A KB entity taxonomy is a hierarchical structure that encodes rich knowledge of entity correlations, e.g., nearby entities tend to be relevant to the same topics. To capture this useful information through a generative procedure, we model each topic as a root-to-leaf random walk over the entity taxonomy. Let E be the set of entities from KB and H be the hierarchical taxonomy where entities are leaf nodes

assigned to one or more categories; categories are further organized into a hierarchical structure in a generic-to-speciﬁc manner. For each category node c, we denote the set of its immediate children (subcategories or leaf entities) as C(c).

The topic random walk over H (denoted as k) for topic k is parameterized by a set of parent-to-child transitions, i.e., k = { k,c}c2H where k,c = { k,cc0}c02C(c) is the transition distribution from c to its children. Starting from the root category c0, a child is selected according to kc0. The process continues until a leaf entity node is reached. Hence the random walk assigns each generated entity edj a root-to-leaf path rdj. A desirable property of the random walk is that entities with common ancestors in the hierarchy share sub-paths starting at the root and thus tend to have similar generating probabilities. This effectively encourages clustering highlycorrelated entities and produces semantically coherent topics. For example, entities Bill Gates and Microsoft Inc. in Figure 2 share the sub-path from root to category IT, which carries a transition probability of 0.9. Thus the two entities are likely to both have high generating probabilities in the speciﬁc topic, while the less relevant Kobe Bryant will have a low probability. Based on k, we can compute the probability of the random walk reaching each of the entities, and hence obtain a distribution over entities, φk. Similarly, for each category node c we can compute a probability kc indicating the possibility of c being included in a random walk path. The set of parameters { k, φk, k} together forms a structured representation of the latent topic k, which has grounded meaning.

3.3 Entity Modeling on Mentions and Words

As described before, we learn entity representations in both mention and word spaces. Moreover, since the rich textual features of entities in KBs encode relevance between entities and mentions/words, we leverage them to construct informative priors for accurate entity modeling.

Speciﬁcally, each entity e 2 E has a distribution over mention vocabulary M, denoted as e, along with a distribution over word vocabulary V, denoted as e. Intuitively, e captures the relatedness between e and other entities, e.g., mention Gates tends to have high probability in entity Microsoft Inc. s mention distribution; while e characterizes the attributes of entity e, e.g., word wealthiest for entity Bill Gates. The informative priors over e and e are derived from the frequency of mentions and words in entity e s Wikipedia page. Let p

e be the prior mention distribution over e, with each dimension p

em proportional to the frequency of mention m in e s page. The prior word distribution p

e over e is built in a similar manner. To reﬂect the conﬁdence of the prior knowledge, we introduce scaling factors λ and λ with a larger value indicating a greater emphasis on the prior.

Note that in LGSA the mention distribution of an entity (e.g. Microsoft Inc.) can put mass on not only its referring mentions (e.g. Microsoft), but also other related mentions (e.g. Gates). This captures the intuition that, for instance, the observation of Gates can promote the probability of the document being about Microsoft Inc.. This differs from previous entity linking methods and improves the detection of document s key entities, as shown in our empirical studies.

3.4 Generative Process

We summarize the generative process of LGSA in Algorithm 1 that combines all the above components. Given the mentions md and words wd of a document d, each mention is ﬁrst assigned a topic according to the topic distribution d. The topics in turn generate entities for each mention through the random walks. For each word, one of the above entities is uniformly selected.

Algorithm 1 Generative Process for LGSA

For each topic k = 1, 2, . . . , K,

1. For each category c 2 H, sample the transition probabili-

ties, kc|β Dir(β).

For each entity e = 1, 2, . . . , E,

1. Sample the mention distribution: e|λ , p

e). 2. Sample the word distribution: e|λ , p

For each document d = 1, 2, . . . , D,

1. Sample the topic distribution: d| Dir( ). 2. For each mention mdj 2 md,

(a) Sample a topic indicator: zdj| d Multi( d). (b) Initialize path rdj = {c0}, and h = 0.

(c) While leaf not reached

i. Sample the next node: ch+1 Multi( zdj,ch). ii. If ch+1 is a leaf node, then the corresponding entity

edj = ch+1; otherwise, h = h + 1. (d) Sample a mention mdj| edj Multi( edj). 3. For each word wdl 2 wd,

(a) Sample an index ydl Unif(1, . . . , Md). (b) e0

dl := edydl (c) Sample a word wdl| e0

dl Multi( e0

4 Model Inference

Exact inference for LGSA is intractable due to the coupling between hidden variables. We exploit collapsed Gibbs sampling [Grifﬁths and Steyvers, 2004] for approximate inference. As a widely used Markov chain Monte Carlo algorithm, Gibbs sampling iteratively samples latent variables ({z, r, e, y} in LGSA) from a Markov chain whose stationary distribution is the posterior. The samples are then used to estimate the distributions of interest: { , 0, , φ, , , }. We directly give the sampling formulas and provide the detailed derivations in the supplementary materials.

Sampling topic zdj for mention mdj according to:

p(zdj = z|edj = e, r dj, .)

r(e2r) p(r|r dj, zdj = z, .), (1)

d denotes the number of mentions in document d that are associated with topic z. Marginal counts are represented with dots; e.g., n( )

d is obtained by marginalizing n(z)

d over z. The second term of Eq.(1) is the sum over the probabilities of all paths that could have generated entity e, conditioned on topic z. Here the probability of a path r is the product of the topic-speciﬁc transition probabilities along the

path from root c0 to leaf c|r| 1 (i.e. entity e):

p(r|r dj, zdj = z, .) =

ch,ch+1 + β

ch, + |C(ch)|β

ch,ch+1 is the number of paths in topic z that go from ch to ch+1. All the above counters are calculated with the mention mdj excluded.

Sampling path rdj and entity edj for mention mdj as:

p(rdj = r, edj = e|zdj = z, mdj = m, .)

/ p(r|r dj, zdj = z, .) n(m)

e + λ (n(e)

e is the number of times that mention m is generated by entity e; n(e)

d is the number of mentions in d that are associated with e; and q(e)

l 1(eydl = e) is the number of words in d that are associated with e. All the counters are calculated with the mention mdj excluded.

Sampling index ydl for word wdl according to:

p(ydl = y|edy = e, wdl = w, .) / n(w)

e + λ , (4)

e is the number of times that word w is generated by entity e, and is calculated with wdl excluded.

The Dirichlet hyperparameters are set as ﬁxed values: = 50/K, β = 0.01, a common setting in topic modeling. We investigate the effects of λ and λ in our empirical studies.

Efﬁcient inference in practice: The inference on large text corpora and KBs can be complicated. To ensure efﬁciency in practice, we use ontology pruning, dynamic programming, and careful initialization: (a) The total number of all entities paths can be very large, rendering the computation of Eq.(3) for all paths prohibitive. We make the observation that in general only a few entities in E are relevant to a document, and these are typically ones with their name mentions occurring in the document [Kataria et al., 2011]. Hence, we select candidate entities for each document using a nameto-entity dictionary [Han and Sun, 2011], and only the paths of these entities are considered when sampling. Our experiments show the approximation has negligible impact on modeling performance, while dramatically reducing the sampling complexity, making the inference practical. (b) We further reduce the hierarchy depth by pruning low-level concrete category nodes (whose shortest root-to-node path lengths exceed a threshold). We found that such a coarse entity ontology is sufﬁcient to provide strong performance. (c) To compute the probabilities of paths (Eq.(2)) we use dynamic programming to avoid redundant computation. (d) We initialize the entity and path assignments to ensure a good starting point. The entity assignment of a mention is sampled from the prior entity-mention distributions p ; based on the assignments, a path leading to the respective entity is then sampled according to an initializing transition distribution where the probability of transitioning from a category c to its child c0 is proportional to the total frequency of descendant entities of c0.

5 Experiments We evaluate LGSA s modeling performance on two news corpora. We observe that LGSA reduces topic perplexity significantly. In the task of key entity identiﬁcation, LGSA improves over competitors by 10% in precision@1. We also explore the effects of entity textual priors.

Text corpus Wikipedia KB (pruned)

#doc #word #mention #entity #category #layer

TMZ 3.2K 150K(4.6K) 71K(15K) 72K 102K 11 NYT 0.3M 130M(169K) 13M(71K) 100K 7.1K 4

Table 2: Statistics of two datasets. The numbers in parentheses are the vocabulary sizes. The average #path to each entity in TMZ and NYT KBs are 300 and 25, respectively.

Datasets: We evaluate on two news corpora (Table 2): (a) TMZ news is collected from TMZ.com, a popular celebrity gossip website. Each news article is tagged with one or more celebrities which serve as ground truth in the task of key entity identiﬁcation; (b) NYT news is a widely-used large corpus from LDC1. For both datasets, we extract the mentions of each article using a mention annotation tool The Wiki Machine2. We use the Wikipedia snapshot of 04/02/2014 as our KB. In Wikipedia, entities correspond to Wikipedia pages which are organized as leaf nodes of a category hierarchy. We pruned irrelevant entities and categories for each dataset.

Baselines: We compare the proposed LGSA with the following competitors (Table 3 lists their differences): (a) Concept TM (Cnpt TM) [Chemudugunta et al., 2008] employs ontological knowledge by assuming one-to-one correspondence between human-deﬁned entities and latent topics. Thus each topic has identiﬁable transparent semantics. (b) Entity-Topic Model (ETM) [Newman et al., 2006] models both words and mentions of documents by word topic and mention topic, respectively. No external knowledge is incorporated in ETM. (c) Latent Dirichlet Allocation (LDA) [Blei et al., 2003] is a bag-of-words model and represents each latent topic as a word distribution. Following [Gabrilovich and Markovitch, 2009], LDA can be used for identifying key entities by measuring the similarity between the document s and the entity Wikipedia page s topic distributions. (d) Explicit Semantic Analysis (ESA) [Gabrilovich and Markovitch, 2009] is a popular Wikipedia-based method aimed at ﬁnding relevant entities as semantics of text. Features including content words and Wikipedia link structures are used to measure the relatedness between documents and entities. (e) Mention Annotation & Counting (MA-C). We map each mention to its referent entity, and rank the entities by the frequency they are mentioned. The priority of occurrence is further incorporated to break the tie. We use The Wiki Machine in the mention-annotation step. (f) LGSA without Hierarchy (LGSA-NH). To directly measure advantage of structured topic representation, we design the intrinsic competitor that models latent topic as a distribution over entities without incorporating the entity hierarchical structure.

1https://www.ldc.upenn.edu 2http://thewikimachine.fbk.eu

Topic Perplexity: We evaluate the quality of extracted topics by topic perplexity [Blei et al., 2003]. As a widely used metric in text modeling, perplexity measures the predictive power of a model in terms of predicting words in unseen held-out documents [Chemudugunta et al., 2008]. A lower perplexity means better generalization performance.

Features Tasks

word mention structured knowledge

topic extraction

key entity identiﬁcation

Cnpt TM p p p p

MA-C p p p p

LGSA-NH p p p p

LGSA p p p p p

Table 3: Feature and task comparison of different methods

We use 5-fold cross validation testing. Figure 4a and 4b show the perplexity values on the TMZ and NYT corpora respectively using different number of topics. We see that LGSA consistently yields the lowest perplexity, indicating the highest predictive quality of extracted topics. We further observe that: (a) ETM and LDA perform inferior to Cnpt TM and LGSA, showing that without the guidance of human knowledge, purely data-driven method is incapable of accurately modeling text latent semantics. (b) Compared to LGSA, Cnpt TM has an inferior performance in that it bounds each topic with one pre-deﬁned concept, which is not ﬂexible enough to represent diverse corpus semantics. LGSA avoids the pitfall by associating an entity distribution with each topic, which is both expressive and interpretable. (c) Comparing LGSA and LGSA-NH further reveals the advantage of the structured topic representation LGSA reduces perplexity by 6.5% on average. (d) Even without taxonomy structure, LGSA-NH still outperforms the baselines. This is because our model goes beyond the bag-of-words assumption and accounts for the mentions and underlying entities, which captures salient text semantics. (e) On the NYT dataset, LDA and ETM perform best at K = 400, where our method yields 17.9% and 5.03% lower perplexity, respectively. This again validates the beneﬁts of incorporating world knowledge.

Key Entity Identiﬁcation: Identifying key entities in documents (e.g., the persons that a news article is mainly about) serves to reveal ﬁne-grained semantics as well as map documents to structured ontologies, which in turn facilitates downstream applications such as semantic search and document categorization. Our next evaluation tries to measure the precision of LGSA in key entity identiﬁcation. We test on the TMZ dataset since the ground truth (usually a celebrity) is available. Given a document d, LGSA infers its entity distribution 0

d and ranks entities accordingly. Figure 4c shows the Precision@R (proportion of test instances where a correct key entity is included in the top-R predictions) based on 5-fold cross validation. Here, both LGSA and LDA achieves their best performance by setting #topic K = 30. From the ﬁgure, we can see that LGSA consistently outperforms all other methods, and achieves 90% pre-

(a) (b) (c)

Figure 4: (a) Perplexity on TMZ dataset, (b) Perplexity on NYT, and (c) Precision@R of key entity identiﬁcation on TMZ.

Game American

Sports Sports Television

ESPN Sport Ball Games

Basketball Kobe Bryant

Baseball L.A. Dodgers

L.A. Lakers

Kobe Bryant Absolved in Church Assault Case

Kobe Bryant in the Gym with Manny Pacquiao

San Diego Church: Kobe's Innocent!

Le Bron James Just Jumped over a Guy! Le Bron Alleged Mishegas at Jewish Basketball Game

Kris Humphries

Kim Kardashian

Kris Has Lawyered up for Divorce The Annulment Documents

Kim: Kris' Parents Hated Me

Kim: No Reconciliation

Figure 5: Topics (a) Sports and (b) Kardashian and Humphries Divorce , showing top entities (by entity distributions φ) and categories (by the probabilities of reaching category nodes through the random walks ). Titles of several news are attached to their top-1 key entities.

cision at rank-1. The results reveal that: (a) MA-C has an inferior performance than LGSA, which can be attributed to the improper decoupling of candidate selection (i.e., mention annotation) and ranking (i.e., counting). In particular, for instance, though the observation of mention Gates may help to correctly annotate mention MS as referring to entity Microsoft Inc. it cannot directly promote the weight of Microsoft Inc. in the document. In contrast, LGSA captures this useful signal by allowing each entity to associate weights with all relevant mentions (Sec.3.3). (b) Our proposed model also outperforms ESA and LDA. Indeed, LGSA essentially combines these two lines of work (i.e., the explicit and latent semantic representations), by stacking the latent topic layer over the explicit entity knowledge. This ensures the best of both worlds: the ﬂexibility of latent modeling and the interpretability of explicit modeling. (c) LGSA-NH is superior over previous methods while falling behind the full model. This conﬁrms the effect of incorporating grounded hierarchical knowledge.

Qualitative Analysis: We now qualitatively investigate the extracted topics, illustrating the beneﬁts of semanticallygrounded modeling as well as revealing potential directions for future improvements. Figure 5 shows two example topics from the TMZ corpus. We can see the top-ranked entities and categories are semantically coherent and the highly-organized structure provides rich context and relations between the top entities (e.g., Kobe Bryant and Le Bron James are both from NBA), helping topic interpretation. More importantly, the extracted topics show the beneﬁts of entity grounding in la-

Figure 6: Effect of entity prior strength λ on TMZ dataset.

tent semantic modeling. Figure 5 also demonstrates example news titles and their key entities inferred by LGSA. This naturally links documents to KBs, showing strong potential in semantic search and automatic knowledge acquisition. It is also noticeable that there exists no single entity or category in Wikipedia that directly corresponds to the topic of Kardashian and Humphries divorce. In contrast, the full meaning is constituted through the combination of a priori unrelated ones. This validates the superior expressiveness of LGSA compared to Cnpt TM and ESA which rely on pre-deﬁned concepts. The analysis also reveals some potential improvement space of our work. E.g., the actions of Kardashian and Humphries are captured in entity Divorce, while incorporating action representations (e.g., verbs with grounded meaning) would help to characterize the full semantics more directly. We consider this as a future work.

Impact of Entity Prior Strengths: LGSA leverages men-

tion/word frequency of entities in KBs to construct informative priors over mention/word distributions. Here we study the effect of these entities priors by showing performance variation with different prior strengths. Figure 6 shows the results where we have set λ = λ = λ for simplicity. We can see that LGSA performs best with an modest λ value (i.e. 10.0) in both tasks. The improvement of performance as λ increases in a proper range validates that the textual features from KBs can improve modeling; while improperly strong priors can prevent the model from ﬂexibly ﬁtting to the data.

6 Conclusion and Future Work We proposed a structured representation of latent topics based on an entity taxonomy from KB. A probabilistic model, LGSA, was developed to infer both hidden topics and entities from text corpora. The model integrates structural and textual knowledge from KB, grounding entity mentions to KB. This leads to improvements in topic modeling and entity identiﬁcation. The grounded topics can be useful in various language understanding tasks, which we plan to explore in the future.

Acknowledgments: Zhiting Hu and Mrinmaya Sachan are supported by NSF IIS1218282, NSF IIS1447676, AFOSR FA95501010247, and Air Force FA8721-05-C-0003.

References [Abney and Light, 1999] Steven Abney and Marc Light. Hiding a

semantic hierarchy in a markov model. In the Workshop on Unsupervised Learning in NLP, ACL, pages 1 8, 1999. [Andrzejewski et al., 2011] David Andrzejewski, Xiaojin Zhu, Mark Craven, and Benjamin Recht. A framework for incorporating general domain knowledge into latent dirichlet allocation using ﬁrst-order logic. In IJCAI, pages 1171 1177, 2011. [Blei et al., 2003] David M Blei, Andrew Y Ng, and Michael I Jor-

dan. Latent dirichlet allocation. JMLR, 3:993 1022, 2003. [Boyd-Graber et al., 2007] Jordan L Boyd-Graber, David M Blei,

and Xiaojin Zhu. A topic model for word sense disambiguation. In EMNLP-Co NLL, pages 1024 1033, 2007. [Chaney and Blei, 2012] Allison June-Barlow Chaney and David M Blei. Visualizing topic models. In ICWSM, 2012. [Chang and Blei, 2009] Jonathan Chang and David M Blei. Rela-

tional topic models for document networks. In AISTATS, pages 81 88, 2009. [Chang et al., 2009] Jonathan Chang, Sean Gerrish, Chong Wang,

Jordan L Boyd-graber, and David M Blei. Reading tea leaves: How humans interpret topic models. In NIPS, pages 288 296, 2009. [Chemudugunta et al., 2008] Chaitanya Chemudugunta, America

Holloway, Padhraic Smyth, and Mark Steyvers. Modeling documents by combining semantic concepts with unsupervised statistical learning. In The Semantic Web-ISWC 2008, pages 229 244. Springer, 2008. [Chen and Liu, 2014] Zhiyuan Chen and Bing Liu. Topic modeling

using topics from many domains, lifelong learning and big data. In ICML, 2014. [Foulds et al., 2015] James Foulds, Shachi Kumar, and Lise Getoor.

Latent topic networks: A versatile probabilistic programming framework for topic models. In ICML, pages 777 786, 2015.

[Gabrilovich and Markovitch, 2009] Evgeniy Gabrilovich and Shaul Markovitch. Wikipedia-based semantic interpretation for natural language processing. JAIR, 34(2):443, 2009. [Grifﬁths and Steyvers, 2004] Thomas L Grifﬁths and Mark Steyvers. Finding scientiﬁc topics. PNAS, 101(Suppl 1):5228 5235, 2004. [Grifﬁths and Tenenbaum, 2004] DMBTL Grifﬁths and MIJJB Tenenbaum. Hierarchical topic models and the nested Chinese restaurant process. NIPS, 16:17, 2004. [Han and Sun, 2011] Xianpei Han and Le Sun. A generative entity-

mention model for linking entities with knowledge base. In ACL, pages 945 954. ACL, 2011. [Han and Sun, 2012] Xianpei Han and Le Sun. An entity-topic model for entity linking. In EMNLP, pages 105 115. ACL, 2012. [Hu et al., 2014] Yuening Hu, Ke Zhai, Vladimir Eidelman, and

Jordan Boyd-Graber. Polylingual tree-based topic models for translation domain adaptation. In ACL, 2014. [Hu et al., 2015a] Zhiting Hu, Poyao Huang, Yuntian Deng, Yingkai Gao, and Eric P Xing. Entity hierarchy embedding. In ACL, pages 1292 1300, 2015. [Hu et al., 2015b] Zhiting Hu, Junjie Yao, Bin Cui, and Eric Xing.

Community level diffusion extraction. In SIGMOD, pages 1555 1569. ACM, 2015. [Kataria et al., 2011] Saurabh S Kataria, Krishnan S Kumar, Ra-

jeev R Rastogi, Prithviraj Sen, and Srinivasan H Sengamedu. Entity disambiguation with hierarchical topic models. In KDD, pages 1037 1045. ACM, 2011. [Kemp and Regier, 2012] Charles Kemp and Terry Regier. Kinship

categories across languages reﬂect general communicative principles. Science, 336(6084):1049 1054, 2012. [Mei et al., 2014] Shike Mei, Jun Zhu, and Jerry Zhu. Robust reg-

bayes: Selectively incorporating ﬁrst-order logic domain knowledge into Bayesian models. In ICML, pages 253 261, 2014. [Mimno et al., 2009] David Mimno, Hanna M Wallach, Jason

Naradowsky, David A Smith, and Andrew Mc Callum. Polylingual topic models. In EMNLP, pages 880 889. ACL, 2009. [Movshovitz-Attias and Cohen, 2015] Dana Movshovitz-Attias and William W Cohen. KB-LDA: Jointly learning a knowledge base of hierarchy, relations, and facts. In ACL, pages 1449 1459, 2015. [Newman et al., 2006] David Newman, Chaitanya Chemudugunta,

and Padhraic Smyth. Statistical entity-topic models. In KDD, pages 680 686. ACM, 2006. [Resnik, 1995] Philip Resnik. Using information content to evalu-

ate semantic similarity in a taxonomy. In IJCAI, pages 448 453, 1995. [Song et al., 2011] Yangqiu Song, Haixun Wang, Zhongyuan Wang, Hongsong Li, and Weizhu Chen. Short text conceptualization using a probabilistic knowledgebase. In IJCAI, pages 2330 2336. AAAI Press, 2011. [Wang et al., 2007] Xuerui Wang, Andrew Mc Callum, and Xing

Wei. Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In ICDM, pages 697 702, 2007. [Wei and Croft, 2006] Xing Wei and W Bruce Croft. LDA-based

document models for ad-hoc retrieval. In SIGIR, pages 178 185. ACM, 2006. [Yang et al., 2015] Yi Yang, Doug Downey, Jordan Boyd-Graber,

and Jordan Boyd Graber. Efﬁcient methods for incorporating knowledge into topic models. In EMNLP, 2015.