# contextualized_rewriting_for_text_summarization__3fcb4e1b.pdf Contextualized Rewriting for Text Summarization Guangsheng Bao1,2, Yue Zhang1,2 1 School of Engineering, Westlake University 2 Institute of Advanced Technology, Westlake Institute for Advanced Study {baoguangsheng, zhangyue}@westlake.edu.cn Extractive summarization suffers from irrelevance, redundancy and incoherence. Existing work shows that abstractive rewriting for extractive summaries can improve the conciseness and readability. These rewriting systems consider extracted summaries as the only input, which is relatively focused but can lose important background knowledge. In this paper, we investigate contextualized rewriting, which ingests the entire original document. We formalize contextualized rewriting as a seq2seq problem with group alignments, introducing group tag as a solution to model the alignments, identifying extracted summaries through content-based addressing. Results show that our approach significantly outperforms non-contextualized rewriting systems without requiring reinforcement learning, achieving strong improvements on ROUGE scores upon multiple extractive summarizers. Introduction Extractive text summarization systems (Nallapati, Zhai, and Zhou 2017; Narayan, Cohen, and Lapata 2018; Liu and Lapata 2019) work by identifying salient text segments (typically sentences) from an input document as its summary. They have been shown to outperform abstractive systems (Rush, Chopra, and Weston 2015; Nallapati et al. 2016; Chopra, Auli, and Rush 2016) in terms of content selection and faithfulness to the input. However, extractive summarizers exhibit several limitations. First, sentences extracted from the input document tend to contain irrelevant and redundant phrases (Durrett, Berg-Kirkpatrick, and Klein 2016; Chen and Bansal 2018; Gehrmann, Deng, and Rush 2018). Second, extracted sentences can be weak in their coherence with regard to discourse relations and cross-sentence anaphora (Dorr, Zajic, and Schwartz 2003; Cheng and Lapata 2016). To address these issues, a line of work investigates postediting of extractive summarizer outputs. While grammar tree trimming has been considered for reducing irrelevant content within sentences (Dorr, Zajic, and Schwartz 2003), rule-based methods have also been investigated for reducing redundancy and enhancing coherence (Durrett, Berg Kirkpatrick, and Klein 2016). With the rise of neural net- Corresponding author. Copyright c 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Source Document: thousands of live earthworms have been falling from the sky ... a biology teacher discovered the worms on the surface of the snow while he was skiing in the mountains near bergen at the weekend ... teacher karstein erstad told norwegian news website ... Gold Summary: teacher karstein erstad found thousands of live worms on top of the snow. Extractive Summary: a biology teacher discovered the worms on the surface of the snow while he was skiing in the mountains near bergen at the weekend. Rewritten Summary: biology teacher karstein erstad discovered the worms on the snow. Figure 1: Example showing that contextual information can benefit summary rewriting. works, a more recent line of work considers using abstractive models for rewriting extracted outputs sentence by sentence (Chen and Bansal 2018; Bae et al. 2019; Wei, Huang, and Gao 2019; Xiao et al. 2020). Human evaluation shows that such rewriting systems effectively improve the conciseness and readability. Interestingly, existing rewriters do not improve the ROUGE scores compared with the extractive baselines. Existing abstractive rewriting systems take extracted summaries as the only input. On the other hand, information from the original document can serve as useful background knowledge for inferring factual details. Take Figure 1 for example. A salient summary can be made by extracting the sentence a biology teacher...weekend. While a rewriter can simplify the sentence for making a better summary, it cannot provide additional details beyond the sentence unless the document context is also considered. For example, the name of the teacher is not given by the extractive summary, but we can infer that the teacher s name is karstein erstad from the context sentences, thereby making the summary more informative. We propose contextualized rewriting by using the full input document as a context for rewriting extractive summary sentences. Rather than encoding only the extractive summary, we use a neural representation model to encode The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) Source Document: our resident coach and technical expert chris meadows has plenty of experience in the sport and has worked with some of the biggest names in golf. 1 chris has worked with more than 100,000 golfers throughout his career. growing up beside nick faldo, meadows learned that success in golf comes through develping a clear understanding of, and being committed to, your objective. a dedicated coach from an early age, he soon realized his gift was the development of others. meadows simple and holistic approach to learning has been personally shared with more than 100,000 golfers in a career spanning three decades. 2 many of his instructional books have become best-sellers, his career recently being recognized by the professional golfers association when he was made an advanced fellow of the pga. 3 chris has been living golf s resident golf expert since 2003. Rewritten Summary: chris meadows has worked with some of golf s big names. 1 he has personally coached more than 100,000 golfers. 2 chris was made an advanced fellow of the pga. 3 Figure 2: Example of three-step summarization process: selecting, grouping and rewriting. the whole input document, representing extractive summary as a part of the document representation. To inform the rewriter of the current sentence being rewritten, we use content-based addressing (Graves, Wayne, and Danihelka 2014). Specifically, as Figure 2 shows, a unique group tag is used to index each extracted sentence in the source document, matching an increasing sentence index in the abstractive rewriter as the rewriter generates the output, where the group tags 1 2 3 are used to guide the first, second and third rewritten summary sentences, respectively. We choose the BERT (Devlin et al. 2019) base model as the document encoder, building both the extractive summarizer and the abstractive rewriter by following the basic models of Liu and Lapata (2019). Our models are evaluated on the CNN/DM dataset (Hermann et al. 2015). Results show that the contextualized rewriter gives significantly improved ROUGE (Lin 2004) scores compared with a state-of-the-art extractive baseline, outperforming a traditional rewriter baseline by a large margin. In addition, our method gives better compression, lower redundancy and better coherence. The contextualized rewriter achieves strong and consistent improvements on multiple extractive summarizers. To our knowledge, we are the first to report improved ROUGE by rewriting extractive summaries. We release our code at https://github.com/baoguangsheng/ctx-rewriter-forsumm.git. Related Work Extractive summarizers have received constant research attention. Early approaches such as Text Rank (Mihalcea and Tarau 2004) select sentences based on weighted similarities. Recently, Nallapati, Zhai, and Zhou (2017) use a neural classifier to choose sentences and a selector to rank them. Chen and Bansal (2018) use a Pointer Network (Vinyals, Fortunato, and Jaitly 2015) to extract sentences. Liu and Lapata (2019) use a linear classifier upon BERT. This method gives the current state-of-the-art result in extractive summarization, and we choose it for our baseline. Rewriting systems manipulate extractive summaries for reducing irrelevance, redundancy and incoherence. Durrett, Berg-Kirkpatrick, and Klein (2016) use compression rules to reduce unimportant content within a sentence and make anaphoricity constraints to improve cross-sentence coherence. Dorr, Zajic, and Schwartz (2003) trim unnecessary phrases in a sentence without hurting grammar correctness by finding the syntactic structures of sentences. In contrast to their work, we consider neural abstractive rewriting, which can solve all the above issues more systematically. Recently, neural rewriting has attracted much research attention. Chen and Bansal (2018) use a seq2seq model with the copy mechanism (See, Liu, and Manning 2017) to rewrite extractive summaries sentence by sentence. A reranking post-process is applied to avoid repetition, and the extractive model is also tuned by reinforcement learning with reward signals from each rewritten sentence. Bae et al. (2019) use a similar strategy but with a BERT document encoder and reward signals from the whole summary. Wei, Huang, and Gao (2019) use a binary classifier upon a BERT document encoder to select sentences, and a Transformer decoder (Vaswani et al. 2017) with the copy mechanism to generate the summary sentence. Xiao et al. (2020) build a hierarchical representation of the input document. A pointer network and a copy-or-rewrite mechanism are designed to choose sentences for copying or rewriting, followed by a vanilla seq2seq model as the rewriter. The model decisions on sentence selecting, copying and rewriting are tuned by reinforcement learning. Compared with these methods, our method is computationally simpler thanks to the freedom from using reinforcement learning and the copy mechanism, as most of the methods above do. In addition, as mentioned earlier, in contrast to these methods, we consider rewriting by including a document-level context, and therefore can potentially improve details and factual faithfulness. Some hybrid extractive and abstractive summarization models are also in line with our work. Cheng and Lapata (2016) use a hierarchical encoder for extracting words, constraining a conditioned language model for generating fluent summaries. Gehrmann, Deng, and Rush (2018) consider a bottom-up method, using a neural classifier to select important words from the input document, and informing an abstractive summarizer by restricting the copy source in a pointer-generator network to the selected content. Similar to our work, they use extracted content for guiding the abstractive summary. However, different from their work, which focuses on the word level, we investigate sentence-level constraints for guiding abstractive rewriting. Our method can also be regarded as using group tags to guide the reading context during abstractive summarization (Rush, Chopra, and Weston 2015; Nallapati et al. 2016; See, Liu, and Manning 2017), where group tags are obtained using an extractive summary. Compared with vanilla abstractive summarization, the advantages are three-fold. First, extractive summaries can guide the abstractive summarizer with more salient information. Second, the training difficulty of the abstractive model can be reduced when important contents are marked as inputs. Third, the summarization procedure is made more interpretable by associating a crucial source sentence with each target sentence. Seq2seq with Group Alignments As a key contribution of our method, we model contextualized rewriting as a seq2seq mapping problem with group alignments. For an input sequence X and an output sequence Y , a group set G describes a set of segment-wise alignments between X and Y . The mapping problem is defined as finding estimation ˆY = arg Y max Y,G P(Y, G|X), (1) X = {wi}||X| i=1, Y = {wj}||Y | j=1, G = {Gk}||G| k=1, (2) that |X| denotes the number of elements in X, |Y | the number of elements in Y , and |G| the number of groups. Each group Gk denotes a pair of text segments, one from X and one from Y , which belongs to the same group. Taking Figure 2 as an example, the first extractive sentence from the document and the first sentence from the summary form a group G1. The problem can be simplified given the fact that for each group Gk, the text segment from X is known, while the corresponding segment from Y is dynamically decided during the generation of Y . We thus separate G into two components GX and GY , and redefine the mapping problem as ˆY = arg Y max Y,GY P(Y, GY |X, GX), (3) GX = {gi = k if wi Gk else 0}||X| i=1, (4) GY = {gj = k if wj Gk else 0}||Y | j=1, (5) so that for each group Gk, a group tag k is assigned, through which the text segment from X in group Gk are linked to the segment from Y in the same group. For the example in Figure 2, GX = {1, ..., 1, 0, ..., 0, 2, ..., 2, 3, ..., 3, 0, ..., 0} and GY = {1, ..., 1, 2, ..., 2, 3, ..., 3}. In the encoder-decoder framework, we convert GX and GY into vector representations through a shared embedding table, which is randomly initialized and jointly trained with the encoder and decoder. The vector representations of GX and GY are used to enrich vector representations of X and Y , respectively. As a result, all the tokens tagged with k in both X and Y have the same vector component, through which a content-based addressing can be done by the attention mechanism (Garg et al. 2019). Here, the group tag serves as a mechanism to constrain the attention from Y to the corresponding part of X during decoding. Unlike approaches which modify a seq2seq model by using rules (Hsu et al. 2018; Gehrmann, Deng, and Rush 2018), group tag enables the modification to be flexible and trainable. Contextualized Rewriting System We take a three-step process in generating a summary. First, an extractive summarization model is used to select a set of sentences from the original document as a guiding source. Second, the guiding source text is matched with the original document, whereby a set of group tags are assigned to each token. Third, an abstractive rewriter is applied to the tagged document, where the group tags serve as a guidance for summary generation. Formally, we use X = {wi}||X| i=1 to represent document X, which contains |X| tokens, and Y = {wj}||Y | j=1 to represent a final resulting summary Y , which contains |Y | tokens. Extractive Summarizer Following Liu and Lapata (2019), we use BERT to encode the input document, with a special [CLS] token being added to the beginning of each sentence, and interval segments being applied to distinguish successive sentence. On top of the BERT representation of [CLS] tokens, an extractor is stacked to select sentences. The extractor uses a Transformer (Vaswani et al. 2017) encoder to generate inter-sentence representations, on which, for extracting a summary, an output layer with the sigmoid activation is used to calculate the probability of each sentence being extracted. Encoder. We use the BERT encoder BERTENC to convert source document X into a sequence of token embeddings HX, taking [CLS] embeddings as a representation of the source sentences, denoted as HC. HX = BERTENC(X) HC = {H(i) X |wi = [CLS]}||X| i=1. (6) Extractor. We use a Transformer encoder TRANSENC to convert sentence embeddings HC into final inter-sentence representations HF , and calculate the extraction probability on each sentence according to HF . HF = TRANSENC(HC) P(extk|X) = σ(W H(k) F + b), (7) where extk means the k-th sentence extracted, and W and b are model trainable parameters. Given the sequence of extraction probabilities {P(extk|X)}|C k=1, where C denotes the number of sentences in X, we make decision on each sentence according to three hyper-parameters: the minimum number Figure 3: Architecture of the contextualized rewriter. The group tag embeddings are tied between the encoder (left figure) and the decoder (right figure), through which the decoder can address to the corresponding tokens in the document. of sentences to extract min sel, the maximum number of sentences to extract max sel, and a probability threshold. In particular, we sort the C sentences in descending order based on P(extk|X), where sentences that rank between 0 and min sel are selected by default, while sentences that rank between min sel and max sel are decided by comparing the probability value with the threshold. Sentences with a probability above threshold are selected. We decide the hyper-parameter values using dev experiments. Note that our method is slightly different from the extractive model of Liu and Lapata (2019), which extracts the 3 most probable sentences as the summary. For the purpose of rewriting with a strong compression, our method allows to extract more sentences as the summary for better recall. Source Group Tagging We match the extracted summary with the original document for group tagging, taking each sentence in the extracted summary as a group. So that the first summary sentence and the matched sentence forms group one, the second group two, and so on. Formally, for document X and extractive summary E, the k-th summary sentence Ek (k [1, ..., K]) is matched to X, where every token in Ek is assigned with a group tag k. In particular, Eq 4 is instantiated as GX = {gi = k if wi Ek else 0}||X| i=1, (8) where GX is the sequence of group tags for document X. Contextualized Rewriter The contextualized rewriter extends the abstractive summarizer of Liu and Lapata (2019), which is a standard Transformer sequence to sequence model with BERT as the encoder. As Figure 3 shows, to integrate group tag guidance, group tag embeddings are added to both the encoder and the decoder. Formally, for an extractive summary E, the set of group tags is a closed set of [1, ..., K]. We use a lookup table WG to represent the embeddings of the group tags, which is shared by the encoder and the decoder. Encoder. The original document is processed in the same way as for the extractive model, where a [CLS] token is added for each sentence and interval segments are used to distinguish successive sentences. After BERT encoding BERTENC, the representation of each token is added to the group tag embedding for producing a final representation HX+G = BERTENC(X) + EMBWG(GX), (9) where EMBWG(GX) denotes the retrieved embeddings from the lookup table WG for group tag sequence GX. Decoder. Summary sentences are synthesized in a single sequence with special token [BOS] at the beginning, [SEP] between sentences, and [EOS] at the end. The decoder follows a standard Transformer architecture. We treat each sentence in the summary as a group. Consequently, the group tag sequence GY is fully determined by the summary Y . In particular, all the tokens in the k-th summary sentence Yk(k [1, ..., K]) are assigned with a group tag k. Therefore, Eq 5 is instantiated as GY = {gj = k if wj Yk else 0}||Y | j=1. (10) During decoding, the group tag is generated at each beam search step, starting with 1 after the special token [BOS] and increasing by 1 after each special token [SEP]. The embedding of group tag gj is retrieved from the lookup table WG by EMBWG(gj), and added to the token embedding EMB(wj) and the position embedding. HY +G = EMB(Y ) + EMBWG(GY ) P(wj|w