# sequential_copying_networks__9473a40d.pdf Sequential Copying Networks Qingyu Zhou, Nan Yang, Furu Wei, Ming Zhou Harbin Institute of Technology, Harbin, China Microsoft Research, Beijing, China qyzhgm@gmail.com, {nanya, fuwei, mingzhou}@microsoft.com Copying mechanism shows effectiveness in sequence-tosequence based neural network models for text generation tasks, such as abstractive sentence summarization and question generation. However, existing works on modeling copying or pointing mechanism only considers single word copying from the source sentences. In this paper, we propose a novel copying framework, named Sequential Copying Networks (Seq Copy Net), which not only learns to copy single words, but also copies sequences from the input sentence. It leverages the pointer networks to explicitly select a subspan from the source side to target side, and integrates this sequential copying mechanism to the generation process in the encoder-decoder paradigm. Experiments on abstractive sentence summarization and question generation tasks show that the proposed Seq Copy Net can copy meaningful spans and outperforms the baseline models. Introduction Recently, attention-based sequence-to-sequence (seq2seq) framework (Sutskever, Vinyals, and Le 2014; Bahdanau, Cho, and Bengio 2015) has achieved remarkable progress in text generation tasks, such as abstractive text summarization (Rush, Chopra, and Weston 2015), question generation (Zhou et al. 2017a) and conversation response generation (Vinyals and Le 2015). In this framework, an encoder is employed to read the input sequence and produce a list of vectors, which are then fed into a decoder to generate the output sequence by making word predictions one by one through the softmax operation over a fixed size target vocabulary. It has been observed that seq2seq suffers from the unknown or rare words problem (Luong et al. 2015). Gulcehre et al. (2016) and Gu et al. (2016) makes the key observation that in tasks like summarization and response generation, rare words in the output sequence usually can be found in the input sequence. Based on this observation, they propose a copying mechanism to directly copy words to the output sequence from input, which alleviates the rare word problem. In their work, every output words can be either generated by predicting words in the target vocabulary or copied from the input sequence. Contribution during internship at Microsoft Research. Copyright c 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. We further observe that the copied words usually form a continuous chunk of the output, exhibiting a sequential copying phenomenon. For example, in the Gigaword dataset of abstractive sentence summarization task, about 57.7% words are copied from the input as indicated in Figure 1. Moreover, the copied words in multi-word span account for 28.1%, which is also very common. For example, in Figure 2, there are two copied bi-grams in the output summary. Similar phenomenon has also been observed in question generation task. However, previous methods fall into one paradigm, which we call single word copy . At each decoding time step, the models still follow the word by word style to make separate decisions of whether to copy. Therefore, this single word copy paradigm may introduce errors due to these separate decisions. For example, the words in a phrase should be copied consecutively from the input sentence, but these separate decisions cannot guarantee to achieve this. This may cause that some unrelated words appears unexpectedly in the middle of the phrase, or the phrase is not copied completely with some words missed. Therefore, we argue that tasks such as abstractive sentence summarization and question generation can benefit from sequential copying considering the intrinsic nature of these tasks and datasets. Figure 1: Percentage of generated and copied words in sentence summarization training data. In this paper, we propose a novel copying framework, Sequential Copying Networks (Seq Copy Net), to extend the vanilla seq2seq framework. Seq Copy Net is intended to learn not only the single word copy behavior, but also the se- The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) Figure 2: An example of sequential copying in abstractive sentence summarization task. quence copy operation as mentioned above. We design a span extractor for the decoder so it can make sequence copy actions during decoding. Specifically, Seq Copy Net consists of three main components, an RNN based sentence encoder, an attention-equipped decoder, and the newly designed copying module. We follow previous works to use the bidirectional RNN as the sentence encoder, and the decoder also employs an RNN with attention mechanism (Bahdanau, Cho, and Bengio 2015). To achieve the sequential copying mechanism, the copying module is integrated with the decoder to make decisions during decoding. The sequential copying module in Seq Copy Net contains three main components, namely, the copy switch gate network, the pointer network and the copy state transducer. The copy switch gate network is used to make decisions of whether to copy according to the current decoding states. Its output is not a binary value, but a scalar range in [0, 1], which is the probability of choosing to copy. The pointer network is then used to extract a span from the input sentence. We maintains a copying state in the copying module so that the pointer network can make predictions based on it. In detail, the pointer network predicts the start and end positions of the span. The start position is predicted using the start copying state. Then the copy state transducer will update the copying state so that the pointer network can predict the end position. This transduction process is made by an RNN so that it can remember related information such as the start position, and guide the pointer to the corresponding end copying position. We conduct experiments on abstractive sentence summarization and question generation tasks to verify the effectiveness of Seq Copy Net. On both tasks, Seq Copy Net outperforms the baseline models and the case study show that it can copy meaningful spans. Sequential Copying Networks As shown in Figure 3, our Seq Copy Net consists of three main components, namely the encoder, the copying module and the decoder. Like in vanilla seq2seq frameworks, the encoder leverages two Gated Recurrent Unit (GRU) (Cho et al. 2014) to read the input words, and the decoder is modeled with GRU with attention mechanism. The copying module consists of a copy switch gate network, a pointer network and a recurrent copy state transducer. At each decoding time step, the copying module will make a decision of whether to copy or generate. If it decides to copy, the pointer network and the copy state transducer will cooperate to copy a sub-span from the input sentence by predicting the start and end positions of it. After the copying action, if the copied sequence contains more than one word, the decoder will apply Copy Run to update its states accordingly. Encoder The role of the sentence encoder is to read the input sentence and construct the basic sentence representation. Here we employ a bidirectional GRU (Bi GRU) as the recurrent unit, where GRU is defined as: zi = σ(Wz[xi, hi 1]) ri = σ(Wr[xi, hi 1]) hi = tanh(Wh[xi, ri hi 1]) hi = (1 zi) hi 1 + zi hi where Wz, Wr and Wh are weight matrices. The Bi GRU consists of a forward GRU and a backward GRU. The forward GRU reads the input sentence word embeddings from left to right and gets a sequence of hidden states, ( h1, h2, . . . , hn). The backward GRU reads the input sentence embeddings reversely, from right to left, and results in another sequence of hidden states, ( h2, . . . , hi = GRU(xi, hi 1) hi = GRU(xi, The initial states of the Bi GRU are set to zero vectors, i.e., h1 = 0 and hn = 0. After reading the sentence, the forward and backward hidden states are concatenated, i.e., hi = [ hi; hi], to get the basic sentence representation. Sequential Copying Mechanism To model the sequential copying mechanism, Seq Copy Net needs three key abilities: a) at decoding time step t, the model needs to decide whether to copy or not; b) if the model decides to copy, it will need to select a sub-span from the input; c) the decoder should switch between the generate mode and copy mode smoothly. To enable our Seq Copy Net of the first two functions, we design the copying module as a three-part component, i.e., the copy switch gate network, the pointer network and the copy state transducer. The last ability is enabled by Copy Run method, which is described in the next section. The copy switch gate network decides whether to copy during decoding. If the model goes to generate mode, then it will generate next words as same as the vanilla attention-base seq2seq model. If the model choose to copy, the pointer network will predict a sub-span. At each time-step t, the decoder GRU holds its previous hidden state st 1, the previous output word yt 1 and the previous context vector ct 1. With these previous states, the decoder GRU updates its states as given by formula 7 . To initialize the GRU hidden state, we use a linear layer with the last backward encoder hidden state h1 as input: st = GRU(yt 1, ct 1, st 1) s0 = tanh(Wd Figure 3: The overview diagram of Seq Copy Net. For simplicity, we omit some units and connections. The copying process of the sequence security regime is magnified as indicated in the copying module part. With the new decoder hidden state st, the context vector ct for current time step t is computed through the concatenate attention mechanism (Luong, Pham, and Manning 2015), which matches the current decoder state st with each encoder hidden state hi to get an importance score. The importance scores are then normalized to get the current context vector by weighted sum: et,i = v a tanh(Wast + Uahi) αt,i = exp(et,i) n i=1 exp(et,i) where Wa and Ua are learnable parameters. We then construct a new state vector and name it as decoder memory vector mt, which is the concatenation of the embedding of previous output word yt 1, the decoder GRU hidden vector st and the current context vector ct: In Seq Copy Net, the decoder memory vector mt plays an important role. In the copying module, the copy switch gate network makes decisions based on mt. Specifically, the copy switch gate network (G) is a Multilayer Perceptron (MLP) with two hidden layers: G(x) = σ (W2(tanh(W1x + b1)) + b2) (13) where W1, W2, b1 and b2 are learnable parameters. The activation function of the first hidden layer is hyperbolic tangent (tanh). To produce a probability of whether to copy, we use the sigmoid function (σ( ) in Equation 13) as the activation function of the last hidden layer. The copy probability pc and generate probability pg are defined as: pc = G(mt) pg = 1 pc Generate Mode If the copy switch gate network decides to generate, Seq Copy Net will generate the next word using the decoder memory vector mt. The decoder first generates a readout state rt and then pass it through a maxout hidden layer (Goodfellow et al. 2013) to predict the next word with a softmax layer over the decoder vocabulary. rt = Wrwy 1 + Urct + Vrst r t = [max{rt,2j 1, rt,2j}] j=1,...,d p(yt|y