# datatotext_generation_with_content_selection_and_planning__8ffdbbb3.pdf The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Data-to-Text Generation with Content Selection and Planning Ratish Puduppully, Li Dong, Mirella Lapata Institute for Language, Cognition and Computation School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB r.puduppully@sms.ed.ac.uk, li.dong@ed.ac.uk, mlap@inf.ed.ac.uk Recent advances in data-to-text generation have led to the use of large-scale datasets and neural network models which are trained end-to-end, without explicitly modeling what to say and in what order. In this work, we present a neural network architecture which incorporates content selection and planning without sacrificing end-to-end training. We decompose the generation task into two stages. Given a corpus of data records (paired with descriptive documents), we first generate a content plan highlighting which information should be mentioned and in which order and then generate the document while taking the content plan into account. Automatic and human-based evaluation experiments show that our model1 outperforms strong baselines improving the state-of-the-art on the recently released ROTOWIRE dataset. 1 Introduction Data-to-text generation broadly refers to the task of automatically producing text from non-linguistic input (Reiter and Dale 2000; Gatt and Krahmer 2018). The input may be in various forms including databases of records, spreadsheets, expert system knowledge bases, simulations of physical systems, and so on. Table 1 shows an example in the form of a database containing statistics on NBA basketball games, and a corresponding game summary. Traditional methods for data-to-text generation (Kukich 1983; Mc Keown 1992) implement a pipeline of modules including content planning (selecting specific content from some input and determining the structure of the output text), sentence planning (determining the structure and lexical content of each sentence) and surface realization (converting the sentence plan to a surface string). Recent neural generation systems (Lebret et al. 2016; Mei et al. 2016; Wiseman et al. 2017) do not explicitly model any of these stages, rather they are trained in an end-to-end fashion using the very successful encoder-decoder architecture (Bahdanau et al. 2015) as their backbone. Despite producing overall fluent text, neural systems have difficulty capturing long-term structure and generating documents more than a few sentences long. Wiseman et Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 1Our code is publicly available at https://github.com/ratishsp/ data2text-plan-py. Content Selection Content Planning Record Record Record Text Decoder The Houston Rockets ( 3 - 3 ) stunned the Los Angeles Clippers ( 3 - 3 ) Thursday in Game 6 at the Staples Center Figure 1: Block diagram of our approach. al. (2017) show that neural text generation techniques perform poorly at content selection, they struggle to maintain inter-sentential coherence, and more generally a reasonable ordering of the selected facts in the output text. Additional challenges include avoiding redundancy and being faithful to the input. Interestingly, comparisons against templatebased methods show that neural techniques do not fare well on metrics of content selection recall and factual output generation (i.e., they often hallucinate statements which are not supported by facts in the database). In this paper, we address these shortcomings by explicitly modeling content selection and planning within a neural data-to-text architecture. Our model learns a content plan from the input and conditions on the content plan in order to generate the output document (see Figure 1 for an illustration). An explicit content planning mechanism has at least three advantages for multi-sentence document generation: it represents a high-level organization of the document structure allowing the decoder to concentrate on the easier tasks of sentence planning and surface realization; it makes the process of data-to-document generation more interpretable by generating an intermediate representation; and reduces redundancy in the output, since it is less likely for the content plan to contain the same information in multiple places. We train our model end-to-end using neural networks and evaluate its performance on ROTOWIRE (Wiseman et al. 2017), a recently released dataset which contains statistics of NBA basketball games paired with human-written summaries (see Table 1). Automatic and human evaluation shows that modeling content selection and planning improves generation considerably over competitive baselines. TEAM WIN LOSS PTS FG PCT RB AST . . . Pacers 4 6 99 42 40 17 . . . Celtics 5 4 105 44 47 22 . . . PLAYER H/V AST RB PTS FG CITY . . . Jeff Teague H 4 3 20 4 Indiana . . . Miles Turner H 1 8 17 6 Indiana . . . Isaiah Thomas V 5 0 23 4 Boston . . . Kelly Olynyk V 4 6 16 6 Boston . . . Amir Johnson V 3 9 14 4 Boston . . . . . . . . . . . . . . . . . . . . . . . . PTS: points, FT PCT: free throw percentage, RB: rebounds, AST: assists, H/V: home or visiting, FG: field goals, CITY: player team city. The Boston Celtics defeated the host Indiana Pacers 105-99 at Bankers Life Fieldhouse on Saturday. In a battle between two injury-riddled teams, the Celtics were able to prevail with a much needed road victory. The key was shooting and defense, as the Celtics outshot the Pacers from the field, from three-point range and from the free-throw line. Boston also held Indiana to 42 percent from the field and 22 percent from long distance. The Celtics also won the rebounding and assisting differentials, while tying the Pacers in turnovers. There were 10 ties and 10 lead changes, as this game went down to the final seconds. Boston (5 4) has had to deal with a gluttony of injuries, but they had the fortunate task of playing a team just as injured here. Isaiah Thomas led the team in scoring, totaling 23 points and five assists on 4 of 13 shooting. He got most of those points by going 14 of 15 from the free-throw line. Kelly Olynyk got a rare start and finished second on the team with his 16 points, six rebounds and four assists. Table 1: Example of data-records and document summary. Entities and values corresponding to the plan in Table 2 are boldfaced. 2 Related Work The generation literature provides multiple examples of content selection components developed for various domains which are either hand-built (Kukich 1983; Mc Keown 1992; Reiter and Dale 1997; Duboue and Mc Keown 2003) or learned from data (Barzilay and Lapata 2005; Duboue and Mc Keown 2001; 2003; Liang et al. 2009; Angeli et al. 2010; Kim and Mooney 2010; Konstas and Lapata 2013). Likewise, creating summaries of sports games has been a topic of interest since the early beginnings of generation systems (Robin 1994; Tanaka-Ishii et al. 1998). Earlier work on content planning has relied on generic planners (Dale 1988), based on Rhetorical Structure Theory (Hovy 1993) and schemas (Mc Keown et al. 1997). Content planners are defined by analysing target texts and devising hand-crafted rules. Duboue and Mc Keown (2001) study ordering constraints for content plans and in follow-on work (Duboue and Mc Keown 2002) learn a content planner from an aligned corpus of inputs and human outputs. A few researchers (Mellish et al. 1998; Karamanis 2004) select content plans according to a ranking function. More recent work focuses on end-to-end systems instead of individual components. However, most models make simplifying assumptions such as generation without any content selection or planning (Belz 2008; Wong and Mooney 2007) or content selection without planning (Konstas and Lapata 2012; Angeli et al. 2010; Kim and Mooney 2010). An exception are Konstas and Lapata (2013) who incorporate content plans represented as grammar rules operating on the document level. Their approach works reasonably well with weather forecasts, but does not scale easily to larger databases, with richer vocabularies, and longer text descriptions. The model relies on the EM algorithm (Dempster, Laird, and Rubin 1977) to learn the weights of the grammar rules which can be very many even when tokens are aligned to database records as a preprocessing step. Our work is closest to recent neural network models which learn generators from data and accompanying text resources. Most previous approaches generate from Wikipedia infoboxes focusing either on single sentences (Lebret et al. 2016; 2017; Sha et al. 2017; Liu et al. 2017) or short texts (Perez-Beltrachini and Lapata 2018). Mei et al. (2016) use a neural encoder-decoder model to generate weather forecasts and soccer commentaries, while Wiseman et al. (2017) generate NBA game summaries (see Table 1). They introduce a new dataset for data-to-document generation which is sufficiently large for neural network training and adequately challenging for testing the capabilities of document-scale text generation (e.g., the average summary length is 330 words and the average number of input records is 628). Moreover, they propose various automatic evaluation measures for assessing the quality of system output. Our model follows on from Wiseman et al. (2017) addressing the challenges for data-to-text generation identified in their work. We are not aware of any previous neural network-based approaches which incorporate content selection and planning mechanisms and generate multi-sentence documents. Perez-Beltrachini and Lapata (2018) introduce a content selection component (based on multi-instance learning) without content planning, while Liu et al. (2017) propose a sentence planning mechanism which orders the contents of a Wikipedia infobox so as to generate a single sentence. 3 Problem Formulation The input to our model is a table of records (see Table 1 left hand-side). Each record rj has four features including its type (rj,1; e.g., LOSS, CITY), entity (rj,2; e.g., Pacers, Miles Turner), value (rj,3; e.g., 11, Indiana), and whether a player is on the homeor away-team (rj,4; see column H/V in Table 1), represented as {rj,k}4 k=1. The output y is a document containing words y = y1 y|y| where |y| is the document length. Figure 2 shows the overall architecture of our model which consists of two stages: (a) content selection and planning operates on the input records of a database and produces a content plan specifying which records are to be verbalized in the document and in which order (see Table 2) and (b) text generation produces the output text given the content plan as input; at each decoding step, the generation model attends over vector representations of the records in the content plan. Let r = {rj}|r| j=1 denote a table of input records and y the output text. We model p(y|r) as the joint probability of text y and content plan z, given input r. We further decompose p(y, z|r) into p(z|r), a content selection and planning 𝑟1 𝑟2 𝑟3 𝑟4 𝑟|𝑟| 𝐸𝑂𝑆 𝒉3 𝒅4 𝒉4 𝒉1 𝒉2 𝒅2 𝒅4 𝒅1 𝒅4 𝒅3 𝑦3 𝑦2 𝑦1 𝑆𝑂𝑆 Plan Attention 𝑝𝑔𝑒𝑛(𝑦4|𝑟, 𝑧,𝑦<4) 𝑝𝑐𝑜𝑝𝑦(𝑦4|𝑟, 𝑧, 𝑦<4) Text Generation Content Plan Content Selection Encoder Decoder Vector Content Selection Gate Figure 2: Generation model with content selection and planning; the content selection gate is illustrated in Figure 3. phase, and p(y|r, z), a text generation phase: z p(y, z|r) = X z p(z|r)p(y|r, z) In the following we explain how the components p(z|r) and p(y|r, z) are estimated. Record Encoder The input to our model is a table of unordered records, each represented as features {rj,k}4 k=1. Following previous work (Yang et al. 2017; Wiseman et al. 2017), we embed features into vectors, and then use a multilayer perceptron to obtain a vector representation rj for each record: rj = Re LU(Wr[rj,1; rj,2; rj,3; rj,4] + br) where [; ] indicates vector concatenation, Wr Rn 4n, br Rn are parameters, and Re LU is the rectifier activation function. Content Selection Gate The context of a record can be useful in determining its importance vis-a-vis other records in the table. For example, if a player scores many points, it is likely that other meaningfully related records such as field goals, three-pointers, or rebounds will be mentioned in the output summary. To better capture such dependencies among records, we make use of the content selection gate mechanism as shown in Figure 3. We first compute attention scores αj,k over the input table and use them to obtain an attentional vector ratt j for each record rj: αj,k exp(r j Wark) k =j αj,krk ratt j = Wg[rj; cj] where Wa Rn n, Wg Rn 2n are parameter matrices, and P k =j αj,k = 1. Name Type Value Home/ 𝑟2,1 𝑟2,2 𝑟2,3 𝑟2,4 𝑟2,1 𝑟2,2 𝑟2,3 𝑟2,4 𝒓𝟐,𝟏𝒓𝟐,𝟐𝒓𝟐,𝟑𝒓𝟐,𝟒 Attention Content Selection Figure 3: Content selection mechanism. We next apply the content selection gating mechanism to rj, and obtain the new record representation rcs j via: gj = sigmoid ratt j rcs j = gj rj where denotes element-wise multiplication, and gate gj [0, 1]n controls the amount of information flowing from rj. In other words, each element in rj is weighed by the corresponding element of the content selection gate gj. Content Planning In our generation task, the output text is long but follows a canonical structure. Game summaries typically begin by discussing which team won/lost, following with various statistics involving individual players and their teams (e.g., who performed exceptionally well or under-performed), and finishing with any upcoming games. We hypothesize that generation would benefit from an explicit plan specifying both what to say and in which order. Our model learns such content plans from training data. However, notice that ROTOWIRE (see Table 1) and most similar data-to-text datasets do not naturally contain content plans. Fortunately, we can obtain these relatively straightforwardly following an information extraction approach (which we explain in Section 4). Suffice it to say that plans are extracted by mapping the text in the summaries onto entities in the input table, their values, and types (i.e., relations). A plan is a sequence of pointers with each entry pointing to an input record {rj}|r| j=1. An excerpt of a plan is shown in Table 2. The order in the plan corresponds to the sequence in which entities appear in the game summary. Let z = z1 . . . z|z| denote the content planning sequence. Each zk points to an input record, i.e., zk {rj}|r| j=1. Given the input records, the probability p(z|r) is decomposed as: k=1 p(zk|z