# latent_relation_language_models__0548ba87.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Latent Relation Language Models Hiroaki Hayashi,1 Zecong Hu,1 Chenyan Xiong,2 Graham Neubig1 1Carnegie Mellon University, 2Microsoft Research AI {hiroakih, zeconghu, gneubig}@cs.cmu.edu, Chenyan.Xiong@microsoft.com In this paper, we propose Latent Relation Language Models (LRLMs), a class of language models that parameterizes the joint distribution over the words in a document and the entities that occur therein via knowledge graph relations. This model has a number of attractive properties: it not only improves language modeling performance, but is also able to annotate the posterior probability of entity spans for a given text through relations. Experiments demonstrate empirical improvements over both word-based language models and a previous approach that incorporates knowledge graph information. Qualitative analysis further demonstrates the proposed model s ability to learn to predict appropriate relations in context. 1 Introduction Language models (LMs) calculate the probability P(X) of textual data X, and are a core model class of interest to NLP. LMs are used as testbeds for evaluation of generative models of text, and have applications such as rescoring of upstream language generation inputs (Sundermeyer, Schl uter, and Ney 2012), grammatical error correction (Felice et al. 2014), or pre-training of sentence representations (Peters et al. 2018). Neural networks are used to model this probability in stateof-the-art LMs (Bengio et al. 2003; Mikolov et al. 2010; Merity et al. 2017). Textual data X comprise a wide variety of words to be modeled, from closed-class function words, to common nouns or verbs, to named entities and numbers (Zipf 1949). Notably, words on the rarer end of this spectrum are often more semantically or topically important, as evidenced by the success of heuristics such as TF-IDF (Salton and Mc Gill 1986), which up-weight words with low frequency. Previous work has noted that while neural LMs greatly outperform alternatives such as n-gram models on frequent words, they often under-perform on these rare words due to their limited parameter budget, which puts them at a disadvantage compared to non-parametric models like count-based Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Equal Contribution. Code & Data: https://github.com/neulab/lrlm. lawyer ( attorney , ...) president of the United States Topic: Barack Obama Knowledge Graph Barack Hussein Obama II (...; born August 4, 1961) is an American[nationality] attorney[occupation] and politician[occupation] who served as the 44th president of the United States[position held] from 2009 to 2017. ... politician Figure 1: Overview of our task of language modeling conditioned on a knowledge graph. For a given topic, we want to learn a language model that leverages the knowledge graph through relations when modeling the text. n-grams (Neubig and Dyer 2016). Methods to mitigate this bottleneck have been proposed in the context of conditional LMs, which instead model the conditional probability P(X | C), where C is some context given to the model. For instance, in sequence transduction tasks, there are mechanisms to copy from the source sequence (Gu et al. 2016) or use word or phrase dictionaries (Arthur, Neubig, and Nakamura 2016) to improve modeling of low-frequency words. Perhaps more interesting from an LM perspective are methods conditioned on information from structured knowledge sources such as knowledge graphs (Ahn et al. 2016; Parvez et al. 2018; Logan et al. 2019), tables (Lebret, Grangier, and Auli 2016), or grammars (Konstas and Lapata 2013). These methods are analogous to human language production, where the underlying knowledge is converted into linguistic realizations. In this work, we propose Latent Relation Language Models (LRLMs), a class of conditional LMs that take relational information between entities in a knowledge graph as context. Specifically, our model is able to generate either words from a fixed word vocabulary, or a span of words defined according to their relations with a topic entity of interest, as shown in Figure 1. The choices of which method of generation to use is defined as a latent variable sequence Z. We use Latent Predictor Networks (LPNs; Ling et al. (2016)) to jointly learn P(X, Z | C), thus tractably marginalizing over all the possible spans. Compared to other word-byword generation methods that condition LMs on knowledge graphs (KGs; Ahn et al. (2016); Wang et al. (2018)), the span-based generation from the KGs alleviates problems of malformed or incomplete mentions. Moreover, the posterior probabilities of Z can be considered as entity links, which are of interest in their own right in the information extraction field (Ceccarelli et al. 2013; Ganea and Hofmann 2017). We apply the model on articles from Wikipedia (X), with the help of relational information (C) such as Wikidata (Vrandeˇci c and Kr otzsch 2014) or Freebase (Bollacker et al. 2008) regarding each article topic. Empirical results on open vocabulary language modeling show that the proposed model outperforms previous approaches on the same task, demonstrating that LRLMs provide an effective way to condition on this context. We also demonstrate the merit of explicitly modeling latent relations by examining the posterior probabilities over the chosen relations Z, which are in concert with human intuitions about how relations are being expressed in the text. 2 Language Modeling Conditioned on Structured Knowledge In this section, we define the task of open-vocabulary language modeling conditioned on structured data. Task Definition Knowledge graphs (KGs) can be represented as a directed labeled graph G = (V, E) consisting of a set of nodes V = {v1, . . . , v|V |} and a set of relation edges E = {ei : si, ωi, oi | si, oi V, ωi R}. Relation ei contains si, ωi, and oi as the subject, relation type, and object. R is the set of all relation types. Each node vi V represents either an entity or an attribute1, and is associated with a set of surface forms (also called aliases) A(vi) = {ai,1, . . . , ai,|A(vi)|} that can be used to refer to vi. For instance in Figure 1, the subject Barack Obama is connected to both politician and lawyer with the relation , and the object entity politician has political figure and polit. as additional aliases. Notably surface forms of many objects in the KG can be multiple words, and thus it is necessary to have machinery to deal with this fact. Given this KG, we further define a topic entity s about which we would like to generate a piece of text. Our conditional language modeling problem is then defined as the problem of modeling the conditional probability of text X: P(X | G, s). In particular, we consider a subgraph G = (V , E ) of the original KG G by extracting nodes and edges directly related to the topic entity s: V : {s} {oi | s, , oi E} , E : {ei : s, ωi, oi | s, ωi, oi E oi V }. 1A value specified with a relation from an entity (e.g., dates). We consider an open-vocabulary setting where all word types within X are incorporated. Perplexity under this setting provides a more realistic measure than under closedvocabulary setting by taking into account words that rarely or never appear in the training set, which, as previously noted, are particularly important for conveying the main content of the text. Why Condition on Knowledge Graphs? KGs provide two important benefits for neural LMs. First, the high coverage of rarer words due to entities being often infrequent addresses lack of textual supervision for predicting these words. More importantly, KGs have the potential to help LMs generate factually consistent text by providing consistent associations between entities. Normal LMs would have to rely on supervision purely from textual data, which may not provide a learning signal strong enough to accurately generate these facts. For instance, results from Radford et al. (2019) show that even with a very large model trained on massive amounts of data, samples can be factually incorrect, although being fluent and coherent. 3 Latent Relation Language Models In this setion, we describe our proposed framework of Latent Relation Language Models (LRLMs). Definition Knowledge from the KG subgraph G can be incorporated into generation by copying aliases from related entities into the generated text. For instance in Figure 2, to generate Obama s birth date, the model can of course pick words from its vocabulary. But it is more straightforward to copy from the relation of the topic entity Barack Obama , which gives the correct birth date. However, it is insufficient to model probabilities for such choices conditioning only on G and s, because it is unknown to us which text spans are matched to which relations. Na ıve solutions like simple text matching algorithms would yield many false positives. For example, New York City has an alias New York , which matches New York (state) and parts of New York City Council . To circumvent this lack of relation annotation, we treat relations corresponding to such text spans as latent variables. Formally, let X = {xi}N i=1 be the sequence of N tokens, and Z = {(σt, πt, ρt)}T t=1 a sequence of latent variable triplets describing text span matches: The span variable σt := (ℓt, rt) specifies a token subsequence xσt = {xi}rt i=ℓt. The source variable πt {REL, WORD} denotes the generation source of the span xσt. The relation variable ρt := (et, at) describes the matching relation and surface form of the span xσt, and is only used when πt = REL. For Z to be a valid sequence of latent variables, the following conditions must be satisfied: Span variables {σt}T t=1 form a segmentation of X, i.e., ℓt = rt 1 +1 for t = 2, . . . , T. This also implies T N. Barack Hussein Obama II born August 4 , 1961 Generated Text born Barack Hussein Obama II August 4 , 1961 1x 2x 3 x 4x 8 x 9 x 10 x 11 x 12 x Chosen Span Possible Span Figure 2: While generating, our model switches between the two sources: Relation and Word . Circles represent hidden states up to each token, and edges represent possible span matches. Here we show one valid derivation with solid lines, and other options as dashed lines. We also show an annotation of the generated tokens by the spans and sources we choose. If πt = WORD, then ℓt = rt. If πt = REL, then ρt = (et, at) where et = s, ωt, ot should satisfy et E , at A(ot), and xσt = at, i.e., ρt must correspond to a valid surface form of an object that is related to the topic entity s and matches the text span. Let Z be the set of all valid latent variable sequences. We can now model the probability by marginalizing over Z: P(X | G , s) = Z Z P(X, Z | G , s). (1) For sake of brevity, unless noted otherwise, we drop G and s from the conditions in the following sections. Training Given the latent variable sequence Z, we follow Ling et al. (2016) in factoring the joint probability: t=1 P(σt, πt, ρt, xσt | x<ℓt) t=1 P(πt | x<ℓt)P(σt, xσt, ρt | πt, x<ℓt), here x