, Sonic the Hedgehog (1991 video game)), (<instance of>, video game), (<CERO rating>, A), (<developer>, Sonic Team), (<publisher>, Sega), (<platform>, Sega Mega Drive), (<platform>, Wii), (<platform>, Nintendo Game Cube), (<platform>, Xbox 360), (<platform>, Playstation 3), (<platform>, Android), (<characters>, Sonic the Hedgehog), (<series>, Sonic the Hedgehog (video game series)), ... } Sonic the Hedgehog is an action action - adventure video game published by Sonic of programmers for the 1999 Nintendo Game Cube 's SNES video game Sonic the Hedgehog 2 . It was released for the Nintendo DS on September 16 , 1994 in North America and Europe in the latter part of the original Halo 2 . It was played in a post - apocalyptic fantasy fantasy universe , by Nintendo Computer Entertainment on March 6 , 1999 under the ... Vanilla LM: is the twelfth video game developed and published by EA Sports . It is the sequel to the 1992 Sonic - 6 game Sonic the Hedgehog for the Nintendo Genesis , created by Sonic system creator Pinball Ka . Doctor the Hedgehog has since gone on a hiatus in choosing an estimated global community when it quickly becomes a launch member . The game puts Princess unlock from a mansion , which once everything is devastated by a bro and I ... Figure 3: Samples from the models for the topic entity Sonic the Hedgehog (1991 video game) with the corresponding subgraph on the right. Square brackets denote the relation type of copied objects. Highlighted spans in light green are full mentions, and those in dark red are partial mentions. Underlined tokens are unknown words sampled from the character model. two Wiki Text-derived datasets, our model outperformed the simpler Vanilla LM and Alias LM baselines, while NKLM had difﬁculty utilizing the KGs and in some cases results in worse perplexities than these baselines. Alias LM underperformed Vanilla LM in some cases, demonstrating that this simpler and more indirect method of conditioning on the linearized KG is not sufﬁcient to achieve stable improvements. Generated Samples To illustrate behaviors of the learned models, we take the models using Transformer-XL trained on Wiki Text-S, draw 10 samples while conditioning on G and s = Sonic the Hedgehog , and show the sample with lowest perplexity in Figure 3. We highlight tokens generated by the relation predictor and use different colors to represent full and partial mentions. A full mention is an identical copy of an entity surface form, while a partial mention is an incomplete subphrase of an entity surface form. A perfect model should not generate partial mentions as it leads to possibly corrupted phrases, and should generate the same set of full mentions as the gold article. Although NKLM generates more mentions, it suffers from generating partial mentions because it 1) is unaware of the length of surface forms, and 2) requires making copy decisions as many times as the surface form lengths. As shown in Figure 3, we often observe NKLM repeating the same entity, or switching entities halfway through (e.g., Sega 3 ). In contrast, LRLM, by design, only generates full mentions. We quantitatively show this in Table 3 by counting the average number of partial and full mentions in samples. We took 10 samples from 10 random topic entities in the development set, and manually annotated valid full mentions, Table 3: Average number of partially generated, fully generated, and valid and invalid full mentions over 100 samples from the development set or gold human-generated article. Partial Full Valid Invalid NKLM 16.9 7.81 6.37 1.44 LRLM 6.32 5.63 0.69 Gold 9.00 9.00 0.00 which we deemed as semantically correct based on the sentential context. NKLM generates more invalid mentions than LRLM, most of which are false positives and repetitions of the same mention. LRLM has almost no repetitions, but sometimes incorrectly predicts the theme of the topic entity, e.g., generating an article about a TV episode for a topic entity of a song. Posterior Probability of Spans One of the advantages of our model is its capability to calculate the posterior probability of a span being generated as a relation in existing text. We calculate the joint probability of a span (σ = (ℓ, r)) and the surrounding text7 by marginalizing over the latent variable Z for both sides of context, and normalize over all possible spans: P(X, Z) = αℓ 1 P(Z | x<ℓ) βr+1, P(Z | X) = P(X, Z) / Z Z P(X, Z), 7We consider the text segment in the batch where the span appears as the surrounding text. Table 4: Posterior probability of spans (underlined) in contexts. word represents word-based generation. The second relation in the last example means generation of the using word, followed by relation-based generation of United States using the <origin> relation. Title: Sorry (Madonna Song) ... song by American singer Madonna from her tenth ... Relations: <performer> 0.9697 <lyrics by> 0.0289 word 0.0014 ... written and produced by Madonna and Stuart Price , ... Relations: <performer> 0.1545 <lyrics by> 0.7693 word 0.0762 ... continuation from the Hung Up music video . ... Relations: <follows> 1.0000 word 0.0000 ... . However , in the United States , the song did ... Relations: <origin> 0.0000 word <origin> 0.0003 word 0.9997 where αi and βi are the forward and backward probabilities computed following Section 3. Table 4 shows spans with posterior probabilities of various relation types from an article about Sorry (Madonna song) . The model demonstrates the ability to relate the entity Madonna to the topic with appropriate relation types based on context. We also observe that the model tends to generate multi-word spans through relations rather than word-by-word from vocabulary. However, our model often favors word-based generation for common phrases even if related entities exist. Effect of Subgraph Size Finally, we measure the performance of models with respect to the richness of resources available for conditioning. We group Wiki Facts articles into 10 bins by the number of relations available, and plot binned word-average logprobabilities in Figure 4. While all models have slightly higher log-probabilities as the number of relations increase, LRLM achieves the largest gain. 7 Related Work A variety of entity-aware LMs exist, conditioning on information sources such as coreference annotations (Ji et al. 2017), entity annotations (Logan et al. 2019), or keywords (Kiddon, Zettlemoyer, and Choi 2016; Parvez et al. 2018). Among them, NKLM (Ahn et al. 2016) uses relational information and is the most relevant. Our proposed LRLM formulation is more successful at lowering perplexity and allows calculating posterior probabilities of relations. Incorporating KGs for natural language generation (NLG) has a long history (Goldberg, Driedger, and Kittredge 1994; Reiter et al. 2005; Chen and Mooney 2008). With the Figure 4: Word-average log-probabilities on development set of Wiki Facts grouped by the average number of relations. LRLM shows a larger gain over the baselines as the number of relations increases. recent advancement of neural sequence modeling, prevalent approaches for language generation from KGs employ sequence-to-sequence models with special attention mechanisms tailored for input structures such as graphs (Wang et al. 2018) or tables (Liu et al. 2018). Unlike our focus, however, this class of research focuses on learning discriminative models that do not explicitly generate the referent entity as latent variables, like we do in Section 6. While not directly related to our core task, there have been a number of other methods for incorporating latent variables into NLG problems. Latent structure has included predicting latent sequences of topics (Wiseman, Shieber, and Rush 2018), chunking of word sequences into n-grams (Buckman and Neubig 2018), deciding between input sources (Gu et al. 2016), or generating compressed summary tokens (Miao and Blunsom 2016). Our model borrows its underlying structure from Ling et al. (2016), who focused on an entirely different task of source code generation. We use a similar method for selecting latent sources for Wikipedia article language modeling with a repository of KG triples. 8 Conclusion In this work, we propose Latent Relation Language Models, a class of conditional LMs on knowledge graphs which models text as a latent sequence of spans matching related entities in the KG. The generative framework allows the model to not only outperform previous work, but also score spans with their posterior relation probability, which can be used for downstream tasks. Acknowledgements This research was supported in part by Funai Foundation for Information Technology and Amazon. The authors also thank the reviewers and Neu Lab members for their helpful feedback, and Qian Wang for the ﬁgure design. References Ahn, S.; Choi, H.; P arnamaa, T.; and Bengio, Y. 2016. A neural knowledge language model. Co RR ar Xiv:1608.00318. Arthur, P.; Neubig, G.; and Nakamura, S. 2016. Incorporating discrete translation lexicons into neural machine translation. In EMNLP, 1557 1567. Baevski, A., and Auli, M. 2019. Adaptive input representations for neural language modeling. In ICLR. Baum, L. E.; Petrie, T.; Soules, G.; and Weiss, N. 1970. A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. The Annals of Mathematical Statistics 41(1):164 171. Bengio, Y.; Ducharme, R.; Vincent, P.; and Jauvin, C. 2003. A neural probabilistic language model. JMLR 3(Feb):1137 1155. Bojanowski, P.; Grave, E.; Joulin, A.; and Mikolov, T. 2017. Enriching word vectors with subword information. TACL 5:135 146. Bollacker, K.; Evans, C.; Paritosh, P.; Sturge, T.; and Taylor, J. 2008. Freebase: A collaboratively created graph database for structuring human knowledge. In SIGMOD, 1247 1250. Buckman, J., and Neubig, G. 2018. Neural lattice language models. TACL 6:529 541. Ceccarelli, D.; Lucchese, C.; Orlando, S.; Perego, R.; and Trani, S. 2013. Learning relatedness measures for entity linking. In CIKM, 139 148. Chen, D. L., and Mooney, R. J. 2008. Learning to sportscast: A test of grounded language acquisition. In ICML, 128 135. Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.; Le, Q.; and Salakhutdinov, R. 2019. Transformer-XL: Attentive language models beyond a ﬁxed-length context. In ACL, 2978 2988. Dror, R.; Baumer, G.; Shlomov, S.; and Reichart, R. 2018. The hitchhiker s guide to testing statistical signiﬁcance in natural language processing. In ACL, 1383 1392. Felice, M.; Yuan, Z.; Andersen, Ø. E.; Yannakoudakis, H.; and Kochmar, E. 2014. Grammatical error correction using hybrid systems and type ﬁltering. In Co NLL, 15 24. Ganea, O.-E., and Hofmann, T. 2017. Deep joint entity disambiguation with local neural attention. In EMNLP, 2619 2629. Goldberg, E.; Driedger, N.; and Kittredge, R. I. 1994. Using natural-language processing to produce weather forecasts. IEEE Expert 9(2):45 53. Grave, E.; Joulin, A.; Ciss e, M.; Grangier, D.; and J egou, H. 2017. Efﬁcient softmax approximation for GPUs. In ICML, volume 70 of Proceedings of Machine Learning Research, 1302 1310. Gu, J.; Lu, Z.; Li, H.; and Li, V. O. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In ACL, 1631 1640. Han, X.; Cao, S.; Lv, X.; Lin, Y.; Liu, Z.; Sun, M.; and Li, J. 2018. Open KE: An open toolkit for knowledge embedding. In EMNLP, 139 144. Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural Computation 9(8):1735 1780. Ji, Y.; Tan, C.; Martschat, S.; Choi, Y.; and Smith, N. A. 2017. Dynamic entity representations in neural language models. In EMNLP, 1830 1839. Kiddon, C.; Zettlemoyer, L.; and Choi, Y. 2016. Globally coherent text generation with neural checklist models. In EMNLP, 329 339. Konstas, I., and Lapata, M. 2013. A global model for concept-totext generation. Journal of Artiﬁcial Intelligence Research 48:305 346. Lebret, R.; Grangier, D.; and Auli, M. 2016. Neural text generation from structured data with application to the biography domain. In EMNLP, 1203 1213. Ling, W.; Blunsom, P.; Grefenstette, E.; Hermann, K. M.; Koˇcisk y, T.; Wang, F.; and Senior, A. 2016. Latent predictor networks for code generation. In ACL, 599 609. Liu, T.; Wang, K.; Sha, L.; Chang, B.; and Sui, Z. 2018. Table-totext generation by structure-aware seq2seq learning. AAAI. Logan, R.; Liu, N. F.; Peters, M. E.; Gardner, M.; and Singh, S. 2019. Barack s wife Hillary: Using knowledge graphs for factaware language modeling. In ACL, 5962 5971. Luong, M.-T., and Manning, C. D. 2016. Achieving open vocabulary neural machine translation with hybrid word-character models. In ACL, 1054 1063. Merity, S.; Xiong, C.; Bradbury, J.; and Socher, R. 2017. Pointer sentinel mixture models. In ICLR. Merity, S.; Keskar, N. S.; and Socher, R. 2017. Regularizing and optimizing LSTM language models. Co RR ar Xiv:1708.02182. Miao, Y., and Blunsom, P. 2016. Language as a latent variable: Discrete generative models for sentence compression. In EMNLP, 319 328. Mikolov, T.; Karaﬁ at, M.; Burget, L.; ˇCernock y, J.; and Khudanpur, S. 2010. Recurrent neural network based language model. In INTERSPEECH. Mintz, M.; Bills, S.; Snow, R.; and Jurafsky, D. 2009. Distant supervision for relation extraction without labeled data. In ACL, 1003 1011. Neubig, G., and Dyer, C. 2016. Generalizing and hybridizing count-based and neural language models. In EMNLP, 1163 1172. Parvez, M. R.; Chakraborty, S.; Ray, B.; and Chang, K.-W. 2018. Building language models for text with named entities. In ACL, 2373 2383. Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; De Vito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic differentiation in Py Torch. Peters, M.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized word representations. In NAACL, 2227 2237. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. 2019. Language models are unsupervised multitask learners. Preprint. Reiter, E.; Sripada, S.; Hunter, J.; Yu, J.; and Davy, I. 2005. Choosing words in computer-generated weather forecasts. Artiﬁcial Intelligence 167(1-2):137 169. Salton, G., and Mc Gill, M. J. 1986. Introduction to Modern Information Retrieval. New York, NY, USA: Mc Graw-Hill, Inc. Sundermeyer, M.; Schl uter, R.; and Ney, H. 2012. LSTM neural networks for language modeling. In INTERSPEECH. Ueberla, J. 1994. Analysing a simple language model some general conclusions for language models for speech recognition. Computer Speech & Language 8(2):153 176. Vrandeˇci c, D., and Kr otzsch, M. 2014. Wikidata: A free collaborative knowledgebase. Communications of the ACM 57(10):78 85. Wang, Q.; Pan, X.; Huang, L.; Zhang, B.; Jiang, Z.; Ji, H.; and Knight, K. 2018. Describing a knowledge base. In INLG, 10 21. Wiseman, S.; Shieber, S.; and Rush, A. 2018. Learning neural templates for text generation. In EMNLP, 3174 3187. Zipf, G. K. 1949. Human behavior and the principle of least effort: An introduction to human eoclogy. Addison-Wesley Press.

# latent_relation_language_models__0548ba87.pdf The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20) Latent Relation Language Models Hiroaki Hayashi,1 Zecong Hu,1 Chenyan Xiong,2 Graham Neubig1 1Carnegie Mellon University, 2Microsoft Research AI {hiroakih, zeconghu, gneubig}@cs.cmu.edu, Chenyan.Xiong@microsoft.com In this paper, we propose Latent Relation Language Models (LRLMs), a class of language models that parameterizes the joint distribution over the words in a document and the entities that occur therein via knowledge graph relations. This model has a number of attractive properties: it not only improves language modeling performance, but is also able to annotate the posterior probability of entity spans for a given text through relations. Experiments demonstrate empirical improvements over both word-based language models and a previous approach that incorporates knowledge graph information. Qualitative analysis further demonstrates the proposed model s ability to learn to predict appropriate relations in context. 1 Introduction Language models (LMs) calculate the probability P(X) of textual data X, and are a core model class of interest to NLP. LMs are used as testbeds for evaluation of generative models of text, and have applications such as rescoring of upstream language generation inputs (Sundermeyer, Schl uter, and Ney 2012), grammatical error correction (Felice et al. 2014), or pre-training of sentence representations (Peters et al. 2018). Neural networks are used to model this probability in stateof-the-art LMs (Bengio et al. 2003; Mikolov et al. 2010; Merity et al. 2017). Textual data X comprise a wide variety of words to be modeled, from closed-class function words, to common nouns or verbs, to named entities and numbers (Zipf 1949). Notably, words on the rarer end of this spectrum are often more semantically or topically important, as evidenced by the success of heuristics such as TF-IDF (Salton and Mc Gill 1986), which up-weight words with low frequency. Previous work has noted that while neural LMs greatly outperform alternatives such as n-gram models on frequent words, they often under-perform on these rare words due to their limited parameter budget, which puts them at a disadvantage compared to non-parametric models like count-based Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved. Equal Contribution. Code & Data: https://github.com/neulab/lrlm. lawyer ( attorney , ...) president of the United States Topic: Barack Obama Knowledge Graph Barack Hussein Obama II (...; born August 4, 1961) is an American[nationality] attorney[occupation] and politician[occupation] who served as the 44th president of the United States[position held] from 2009 to 2017. ... politician Figure 1: Overview of our task of language modeling conditioned on a knowledge graph. For a given topic, we want to learn a language model that leverages the knowledge graph through relations when modeling the text. n-grams (Neubig and Dyer 2016). Methods to mitigate this bottleneck have been proposed in the context of conditional LMs, which instead model the conditional probability P(X | C), where C is some context given to the model. For instance, in sequence transduction tasks, there are mechanisms to copy from the source sequence (Gu et al. 2016) or use word or phrase dictionaries (Arthur, Neubig, and Nakamura 2016) to improve modeling of low-frequency words. Perhaps more interesting from an LM perspective are methods conditioned on information from structured knowledge sources such as knowledge graphs (Ahn et al. 2016; Parvez et al. 2018; Logan et al. 2019), tables (Lebret, Grangier, and Auli 2016), or grammars (Konstas and Lapata 2013). These methods are analogous to human language production, where the underlying knowledge is converted into linguistic realizations. In this work, we propose Latent Relation Language Models (LRLMs), a class of conditional LMs that take relational information between entities in a knowledge graph as context. Speciﬁcally, our model is able to generate either words from a ﬁxed word vocabulary, or a span of words deﬁned according to their relations with a topic entity of interest, as shown in Figure 1. The choices of which method of generation to use is deﬁned as a latent variable sequence Z. We use Latent Predictor Networks (LPNs; Ling et al. (2016)) to jointly learn P(X, Z | C), thus tractably marginalizing over all the possible spans. Compared to other word-byword generation methods that condition LMs on knowledge graphs (KGs; Ahn et al. (2016); Wang et al. (2018)), the span-based generation from the KGs alleviates problems of malformed or incomplete mentions. Moreover, the posterior probabilities of Z can be considered as entity links, which are of interest in their own right in the information extraction ﬁeld (Ceccarelli et al. 2013; Ganea and Hofmann 2017). We apply the model on articles from Wikipedia (X), with the help of relational information (C) such as Wikidata (Vrandeˇci c and Kr otzsch 2014) or Freebase (Bollacker et al. 2008) regarding each article topic. Empirical results on open vocabulary language modeling show that the proposed model outperforms previous approaches on the same task, demonstrating that LRLMs provide an effective way to condition on this context. We also demonstrate the merit of explicitly modeling latent relations by examining the posterior probabilities over the chosen relations Z, which are in concert with human intuitions about how relations are being expressed in the text. 2 Language Modeling Conditioned on Structured Knowledge In this section, we deﬁne the task of open-vocabulary language modeling conditioned on structured data. Task Deﬁnition Knowledge graphs (KGs) can be represented as a directed labeled graph G = (V, E) consisting of a set of nodes V = {v1, . . . , v|V |} and a set of relation edges E = {ei : si, ωi, oi | si, oi V, ωi R}. Relation ei contains si, ωi, and oi as the subject, relation type, and object. R is the set of all relation types. Each node vi V represents either an entity or an attribute1, and is associated with a set of surface forms (also called aliases) A(vi) = {ai,1, . . . , ai,|A(vi)|} that can be used to refer to vi. For instance in Figure 1, the subject Barack Obama is connected to both politician and lawyer with the relation , and the object entity politician has political figure and polit. as additional aliases. Notably surface forms of many objects in the KG can be multiple words, and thus it is necessary to have machinery to deal with this fact. Given this KG, we further deﬁne a topic entity s about which we would like to generate a piece of text. Our conditional language modeling problem is then deﬁned as the problem of modeling the conditional probability of text X: P(X | G, s). In particular, we consider a subgraph G = (V , E ) of the original KG G by extracting nodes and edges directly related to the topic entity s: V : {s} {oi | s, , oi E} , E : {ei : s, ωi, oi | s, ωi, oi E oi V }. 1A value speciﬁed with a relation from an entity (e.g., dates). We consider an open-vocabulary setting where all word types within X are incorporated. Perplexity under this setting provides a more realistic measure than under closedvocabulary setting by taking into account words that rarely or never appear in the training set, which, as previously noted, are particularly important for conveying the main content of the text. Why Condition on Knowledge Graphs? KGs provide two important beneﬁts for neural LMs. First, the high coverage of rarer words due to entities being often infrequent addresses lack of textual supervision for predicting these words. More importantly, KGs have the potential to help LMs generate factually consistent text by providing consistent associations between entities. Normal LMs would have to rely on supervision purely from textual data, which may not provide a learning signal strong enough to accurately generate these facts. For instance, results from Radford et al. (2019) show that even with a very large model trained on massive amounts of data, samples can be factually incorrect, although being ﬂuent and coherent. 3 Latent Relation Language Models In this setion, we describe our proposed framework of Latent Relation Language Models (LRLMs). Deﬁnition Knowledge from the KG subgraph G can be incorporated into generation by copying aliases from related entities into the generated text. For instance in Figure 2, to generate Obama s birth date, the model can of course pick words from its vocabulary. But it is more straightforward to copy from the relation of the topic entity Barack Obama , which gives the correct birth date. However, it is insufﬁcient to model probabilities for such choices conditioning only on G and s, because it is unknown to us which text spans are matched to which relations. Na ıve solutions like simple text matching algorithms would yield many false positives. For example, New York City has an alias New York , which matches New York (state) and parts of New York City Council . To circumvent this lack of relation annotation, we treat relations corresponding to such text spans as latent variables. Formally, let X = {xi}N i=1 be the sequence of N tokens, and Z = {(σt, πt, ρt)}T t=1 a sequence of latent variable triplets describing text span matches: The span variable σt := (ℓt, rt) speciﬁes a token subsequence xσt = {xi}rt i=ℓt. The source variable πt {REL, WORD} denotes the generation source of the span xσt. The relation variable ρt := (et, at) describes the matching relation and surface form of the span xσt, and is only used when πt = REL. For Z to be a valid sequence of latent variables, the following conditions must be satisﬁed: Span variables {σt}T t=1 form a segmentation of X, i.e., ℓt = rt 1 +1 for t = 2, . . . , T. This also implies T N. Barack Hussein Obama II born August 4 , 1961 Generated Text born Barack Hussein Obama II August 4 , 1961 1x 2x 3 x 4x 8 x 9 x 10 x 11 x 12 x Chosen Span Possible Span Figure 2: While generating, our model switches between the two sources: Relation and Word . Circles represent hidden states up to each token, and edges represent possible span matches. Here we show one valid derivation with solid lines, and other options as dashed lines. We also show an annotation of the generated tokens by the spans and sources we choose. If πt = WORD, then ℓt = rt. If πt = REL, then ρt = (et, at) where et = s, ωt, ot should satisfy et E , at A(ot), and xσt = at, i.e., ρt must correspond to a valid surface form of an object that is related to the topic entity s and matches the text span. Let Z be the set of all valid latent variable sequences. We can now model the probability by marginalizing over Z: P(X | G , s) = Z Z P(X, Z | G , s). (1) For sake of brevity, unless noted otherwise, we drop G and s from the conditions in the following sections. Training Given the latent variable sequence Z, we follow Ling et al. (2016) in factoring the joint probability: t=1 P(σt, πt, ρt, xσt | x<ℓt) t=1 P(πt | x<ℓt)P(σt, xσt, ρt | πt, x<ℓt), here x