# graphbased_neural_sentence_ordering__80a5d106.pdf Graph-based Neural Sentence Ordering Yongjing Yin1 , Linfeng Song2,3 , Jinsong Su1 , Jiali Zeng1 , Chulun Zhou1 and Jiebo Luo2 1Xiamen University, Xiamen, China 2University of Rochester, Rochester, NY, U.S. 3Tencent AI Lab, Bellevue, WA, U.S. {yinyongjing, lemon, clzhou}@stu.xmu.edu.cn, freesunshine0316@gmail.com, jssu@xmu.edu.cn, jluo@cs.rochester.edu Sentence ordering is to restore the original paragraph from a set of sentences. It involves capturing global dependencies among sentences regardless of their input order. In this paper, we propose a novel and flexible graph-based neural sentence ordering model, which adopts graph recurrent network [Zhang et al., 2018] to accurately learn semantic representations of the sentences. Instead of assuming connections between all pairs of input sentences, we use entities that are shared among multiple sentences to make more expressive graph representations with less noise. Experimental results show that our proposed model outperforms the existing stateof-the-art systems on several benchmark datasets, demonstrating the effectiveness of our model. We also conduct a thorough analysis on how entities help the performance. Our code is available at https://github.com/Deep Learn XMU/NSEG.git. 1 Introduction Modeling the coherence in a paragraph or a long document is an important task, which contributes to both natural language generation and natural language understanding. Intuitively, it involves dealing with logic consistency and topic transitions. As a subtask, sentence ordering [Barzilay and Lapata, 2008] aims to reconstruct a coherent paragraph from an unordered set of sentences, namely paragraph. It has been shown to benefit several tasks, including retrieval-based question answering [Yu et al., 2018] and extractive summarization [Barzilay et al., 2002; Galanis et al., 2012], where erroneous sentence orderings may cause performance degradation. Therefore, it is of great importance to study sentence reordering. Most conventional approaches [Lapata, 2003; Barzilay and Lapata, 2008; Guinaudeau and Strube, 2013] are rule-based or statistical ones, relying on handcrafted and sophisticated features. However, careful designs of these features require not only high labor costs but also rich linguistic knowledge. Thus, it is difficult to transfer these methods to new Equal contribution Corresponding author S1: This is my dad and we always had a good time when getting together. S3: In the park we usually sit outside and watched the boats on the lake. S2: My dad used to take me to a park nearby from time to time. S4: I really miss the days in the park, as I am working too busy now. Figure 1: An example of sentence ordering, where the correct order is: S1, S2, S3, S4. S2 is more coherent with S1 than S4, as they share the same entity dad . domains or languages. Inspired by the recent success of deep learning, neural networks have been introduced to this task, of which representative work includes window network [Li and Hovy, 2014], neural ranking model [Chen et al., 2016], hierarchical RNN-based models [Gong et al., 2016; Logeswaran et al., 2018], and deep attentive sentence ordering network (ATTOrder Net) [Cui et al., 2018]. Among these models, ATTOrder Net achieves the state-of-the-art performance with the aid of multi-head self-attention [Vaswani et al., 2017] to learn a relatively reliable paragraph representation for subsequent sentence ordering. Despite the best performance ATTOrder Net having exhibited so far, it still has two drawbacks. First, it is based on fully-connected graph representations. Although such representations enable the network to capture structural relationships across sentences, they also introduce lots of noise caused by any two semantically incoherent sentences. Second, the self-attention mechanism only exploits sentencelevel information and applies the same set of parameters to quantify the relationship between sentences. Obviously, it is not flexible enough to exploit extra information, such as entities, which have proved crucial in modeling text coherence [Barzilay and Lapata, 2008; Elsner and Charniak, 2011]. Thus, we believe that it is worthy of exploring a more suitable neural network for sentence ordering. In this paper, we propose a novel graph-based neural sentence ordering model that adapts the recent graph recurrent network (GRN) [Zhang et al., 2018]. Inspired by Guinaudeau and Strube [2013], we first represent the input set of sentences (paragraph) as a Sentence-Entity graph, where each node represents either a sentence or an entity. Each entity node only connects to the sentence nodes that contain it, and two sentence nodes are linked if they contain the same enti- Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) ties. By doing so, our graph representations are able to model not only the semantic relevance between coherent sentences but also the co-occurrence between sentences and entities. Here we take the example in Figure 1 to illustrate the intuition behind our representations. We can see that the sentences sharing the same entities tend to be semantically close to each other: both the sentences S1 and S2 contain the entity dad , and thus they are more coherent than S1 and S4. Compared with the fully-connected graph representations explored previously [Cui et al., 2018], our graph representations reduce the noise caused by the edges between irrelevant sentence nodes. Another advantage is that the useful entity information can be fully exploited when encoding the input paragraph. Based on sentence-entity graphs, we then adopt GRN [Zhang et al., 2018] to recurrently perform semantic transitions among connected nodes. In particular, we introduce an additional paragraph-level node to assemble semantic information of all nodes during this process, where the resulting paragraph-level representation is beneficial to information transitions among long-distance connected nodes. Moreover, since sentence nodes and entity nodes play different roles, we employ different parameters to distinguish their impacts. Finally, on the basis of the learned paragraph representation, a pointer network is used to produce the order of sentences. The main contribution of our work lies in introducing GRN into sentence ordering, which can be classified into three sub aspects: 1) We propose a GRN-based encoder for sentence ordering. Our work is the first one to explore such an encoder for this task. Experimental results show that our model significantly outperforms the state-of-the-arts. 2) We refine vanilla GRN by modeling sentence nodes and entity nodes with different parameters. 3) Via plenty of experiments, we verify that entities are very useful in graph representations for sentence ordering. 2 Baseline: ATTOrder Net [Cui et al., 2018] In this section, we give a brief introduction to ATTOrder Net, which achieves state-of-the-art performance and thus is chosen as the baseline of our work. As shown in Figure 2, ATTOrder Net consists of a Bi-LSTM sentence encoder, a paragraph encoder based on multi-head self-attention [Vaswani et al., 2017], and a pointer network based decoder [Vinyals et al., 2015b]. It takes a set of input sentences s = [so1, . . . , so M ] with the order o = [o1, . . . , o M] as input and tries to recover the correct order o = [o 1, . . . , o M]. Here M denotes the number of the input sentences. 2.1 Sentence Encoding with Bi-LSTM The Bi-LSTM sentence encoder takes a word embedding sequence (x1, . . . , xn) of each input sentence soi to produce its semantic representation. At the j-th step, the current states ( h j and h j) are generated from the previous hidden states ( h j 1 and h j+1) and the current word embedding xj as follows: h j = LSTM( h j 1, xj); h j = LSTM( h j+1, xj). (1) Finally, the sentence representation is obtained by concatenating the last states of the Bi-LSTM in both directions κ0 oi = Figure 2: The architecture of ATTOrder Net [Cui et al., 2018]. [ h n; h 1]. 2.2 Paragraph Encoding with Multi-Head Self-Attention Network The paragraph encoder consists of several self-attention layers followed by an average pooling layer. Given the representations for the input sentences, the initial paragraph representation K0 is obtained by concatenating all sentence representations K0 = [κ0 o1, . . . , κ0 o M ]. Next, the initial representation is fed into L self-attention layers for the update. In particular, the update for layer l is conducted by Kl = Self Attenl(Kl 1), (2) where Self Attenl represents the l-th network layer including multi-head self-attention and feed-forward networks. Finally, an average pooling layer is used to generate the final paragraph representation g from the output KL of the last selfattention layer g = 1 M PM i=1 κL i , where κL i is the vector representation of soi. 2.3 Decoding with Pointer Network After obtaining the final paragraph representation g, an LSTM-based pointer network is used to predict the correct sentence order. Formally, the conditional probability of a predicted order o given input paragraph s can be formalized as i=1 P(o i|o