# towards_improving_faithfulness_in_abstractive_summarization__1c45966a.pdf Towards Improving Faithfulness in Abstractive Summarization Xiuying Chen1 Mingzhe Li2 Xin Gao1,3 Xiangliang Zhang4,1 1 Computational Bioscience Reseach Center, King Abdullah University of Science and Technology 2 Ant Group 3 SDAIA-KAUST AI 4 University of Notre Dame {xiuying.chen, xin.gao}@kaust.edu.sa, limingzhe.lmz@antgroup.com,xzhang33@nd.edu Despite the success achieved in neural abstractive summarization based on pretrained language models, one unresolved issue is that the generated summaries are not always faithful to the input document. There are two possible causes of the unfaithfulness problem: (1) the summarization model fails to understand or capture the gist of the input text, and (2) the model over-relies on the language model to generate fluent but inadequate words. In this work, we propose a Faithfulness Enhanced Summarization model (FES), which is designed for addressing these two problems and improving faithfulness in abstractive summarization. For the first problem, we propose to use question-answering (QA) to examine whether the encoder fully grasps the input document and can answer the questions on the key information in the input. The QA attention on the proper input words can also be used to stipulate how the decoder should attend to the source. For the second problem, we introduce a max-margin loss defined on the difference between the language and the summarization model, aiming to prevent the overconfidence of the language model. Extensive experiments on two benchmark summarization datasets, CNN/DM and XSum, demonstrate that our model significantly outperforms strong baselines. The evaluation of factual consistency also shows that our model generates more faithful summaries than baselines2. 1 Introduction In recent years, text generation has made impressive progress [1, 2, 3]. The abstractive summarization task, aiming to produce a concise and fluent summary that is salient and faithful to the source document, has become a research hotspot due to its broad application prospect. The prevalence of pretrained transformer language models (LM) [4, 5] has largely improved the fluency and salience of generated summaries. However, studies [6, 7] showed that many summarization models suffer from unfaithfulness problem, i.e., the generated summary is not entailed by the information presented in the source document. Durmus et al. [8] highlighted two notions of the unfaithfulness problem in summarization: one is the manipulation of information presented in the input document (intrinsic errors), and the other is the inclusion of information not inferable from the input (extrinsic errors). The Intrinsic error problem is often caused by the failure of document level inference, which is necessary for abstractive summarization. Specifically, the summarization model has misinformation inferred from the input document because of an inadequate encoder that misunderstands the source corresponding author 2https://github.com/iriscxy/FES 36th Conference on Neural Information Processing Systems (Neur IPS 2022). semantic information and a poor decoder that cannot fetch relevant and consistent content from the encoder. Several recent summarization models were proposed from this perspective. For example, Wu et al. [9] proposed a unified semantic graph encoder to learn better semantic meanings and a graph-aware decoder to utilize the encoded information. Cao et al. [10] used contrastive learning to help the model be aware of the factual information. The second type of error, extrinsic error, is often introduced by excessive attention paid to the LM, which ensures fluency while neglecting to summarize the source document. For example, a LM is inclined to generate the commonly-used phrase score the winner while the correct phrase is score the second highest which is less frequently used. This type of error has been studied in the neural machine translation task [11], but has not been addressed in abstractive summarization. To address these errors, we propose a novel Faithfulness Enhanced Summarization model (FES). To prevent the intrinsic error problem, we design FES in a multi-task learning paradigm, i.e., completing encoding-decoding for the summarization task with an auxiliary QA-based faithfulness evaluation task. The QA task poses an additional reasoning requirement on the encoder to have a more comprehensive understanding on the key semantic meanings of the input document and learn better representations than working only for summarization. The QA attention on the key entities of the input can also be used to align the decoder state with the encoder outputs for generating a faithful summary. To address the extrinsic error problem, we propose a max-margin loss to prevent the LM from being overconfident. Concretely, we define an indicator of the overconfidence degree of the LM. The risk of outputting extrinsic error tokens with low prediction probabilities is mitigated by minimizing this overconfidence indicator. We validate the effectiveness of our FES model by conducting extensive experiments on public benchmark CNN/DM [12] and XSum [13] datasets. Experimental results demonstrate that our faithfulness enhanced summarization model has superior performance on the ROUGE scores and improves the faithfulness of news summarization over several strong baselines. Our main contributions can be summarized as follows. (1) We propose a faithfulness enhanced summarization model, which alleviates the unfaithfulness problem from the encoder side and decoder side. (2) Concretely, we propose a multi-task framework to enhance the summarization performance by automatic QA tasks. We also propose a max-margin loss to control the overconfident problem of the LM. (3) Experimental results demonstrate that our proposed approach brings substantial improvements over the most recent baselines on benchmark datasets, and can also improve the faithfulness of the generated summary. 2 Related Work Abstractive Summarization. In recent years, the research on text generation has made impressive progress [14, 15], which promotes the progress of abstractive summarization. The abstractive summarization task generates novel words and phrases not featured in the source text to capture the salient ideas of the source text [16]. Most works apply an encoder-decoder architecture to implicitly learn the summarization procedure [17, 18]. More recently, applying pretrained language models as encoder [4, 19] or pre-training the generation process by leveraging a large-scale of unlabeled corpus [20, 21] brings significant improvements. Explicit structure modeling has also been shown to be effective in summarization tasks. For example, Jin et al. [22] incorporated semantic dependency graphs to help generate sentences with better semantic relevance, and Wu et al. [9] came up with a unified semantic graph to aggregate relevant disjoint context from the input. Fact Consistency for Abstractive Summarization. Producing a summary that is entailed by the information presented in the source document is a key challenge in the summarization task, and less progress has been made on it. Pioneer works [23, 24] incorporated fact descriptions or entailment knowledge to enhance faithfulness. More recently, Zhu et al. [25] modeled the facts in the source article with knowledge graphs based on a graph neural network. Cao et al. [10] proposed to leverage reference summaries as positive training data and erroneous summaries as negative data, to train summarization systems that are better at distinguishing between them. Aralikatte et al. [26] introduced focus attention mechanism to encourage decoders to proactively generate tokens that are similar or topical to the input document. On the contrary, other works post-edit the generated summaries. Different from previous works, we enhance the semantic understanding of the document with faithfulness evaluation as a direct signal and prevent the overconfidence of LM. QA-enhanced (a) Existing QA-based faithfulness evaluation model (b) Our faithfulness-enhanced summarization model Document Generated Summary Ground-truth Summary Ground-truth Ground-truth Ground-truth Summary Generated Summary ℒ&' Summarization model training Faithfulness evaluation after training Figure 1: The comparison of the existing QA-based faithfulness evaluation model and our faithfulnessenhanced summarization model. The QA task integrated in our model provides an auxiliary supervision signal to understand the document in the training process and enhance the faithfulness of the generated summary. Multi-task Learning. Multi-task learning is a learning paradigm in machine learning and it aims to leverage useful information contained in multiple related tasks to help improve the generalization performance of all the tasks [27]. There is a large quantity of natural language processing tasks formulated by multi-task learning, such as word segmentation, POS tagging, dependency parsing, and text classification [28, 29, 30, 31]. In this work, we apply multi-task learning to summarization and question-answering tasks for faithfulness enhancement. 3 Methodology 3.1 Problem Formulation For an input document X = {x1, ..., xnx}, we assume there is a ground truth summary Y = {y1, . . . , yny}. In our faithfulness enhanced setting, nq question answering pairs Q = {Q1, ..., Qnq} with corresponding answers A = {A1, ..., Anq} are also attached with X. In the training process, our model is given QA pairs and document-summary pairs. It tries to extract answers A to the questions and generate the summary Y . In test stage, our model is given document X and questions Q, and predicts the answers and summary. The final goal is to generate a summary that is not only informative but also consistent with document X. Following, we introduce our proposed Faithfulness Enhanced Summarization model, which is generally built on Transformer [32]. The faithfulness enhancement is implemented from three aspects: (1) Multi-task Encoder. It improves the semantic understanding of the input document by examining the quality of the encoded document representations for an auxiliary QA task. The encoded representation thus captures the key inputs for making faithful summary. (2) QA Attention-enhanced Decoder. The attention from the multi-task encoder aligns the decoder with the encoder so that the decoder can fetch more accurate input information to generate the summary. (3) Max-margin Loss. This is a loss orthogonal to the generation loss. It measures the accuracy of the LM and prevents it from being overconfident in the generation process. 3.2 Multi-task Encoder Transformer Encoder Entity nodes: Sentence nodes: Question nodes: Figure 2: Multi-task encoder. The multi-task encoder is designed for encoding the input document for both summarization and question-answering in an integrated training process, as shown in Figure 1(b). This is different from the previous work that uses QA in the postgeneration stage for evaluating the faithfulness of the generated summaries [8, 7], as shown in Figure 1(a). We bring the QA closer to the encoder instead of leaving it for post-generated summary, and make the encoder be trained to accomplish the QA and summarization task in the meantime. This integrated training of a multi-task encoder includes faithfulness also as an optimization objective, besides the summary generation quality. The answers are key entities from the document so that QA pairs focus on key information in the input. As shown in Figure 2, we first apply the classic Transformer architecture to obtain token-level representations for the document and questions, denoted as Hw Rnw de and Hu Rnq tq de, where nw is the total number of tokens in the document, nq is the question number, tq is the token number in a question, and de is the feature dimension. Then, we design the encoder to understand the question and the input document question from entity levels and sentence levels. Encoding at Multi-level Granularity. We build the encoder by organizing the representation learning at different granularity levels. We use entities as the basic semantic unit as they contain compact and salient information across the document, and the reading comprehension questions focus on entities. Since a question is usually short, we create one node for each question. We add bidirectional edges from the questions to sentence nodes, and from sentence to entity nodes. These nodes act as the intermediary between sentences and enrich the cross-sentence relations. Because the initial directed edges are insufficient for learning backward information, we add reverse edges and self-loop edges to the graph following previous works [33]. We initialize node representations following the token level and word span level mean-pooling process [9]. Given the constructed graph with node features, we use graph attention networks [34] to update the representations of our semantic nodes. We refer to hi Rde, i {1, , (ne + ns + nq)} as the hidden states of input nodes, where ne and ns are the number of entity nodes and sentence nodes, respectively. The graph attention (GAT) layer is designed as follows: zij = Leaky Re LU Wa h Wb hi; Wc hj i , αij = exp (zij) P l Ni exp (zil), li = σ(P j Ni αij Wd hj), where Ni is the set of neighboring nodes of node i, Wa, Wb, Wc, Wd are trainable weights and αij is the attention weight between hi and hj. Besides, we add a residual connection to avoid gradient vanishing after several iterations: hi = hi + li. We iteratively use the above GAT layer and position-wise feed-forward layer [32] to update each node representation. The output entity feature matrix, sentence feature matrix, and question matrix, are denoted as He Rne de, Hs Rns de, and Hq Rnq de, respectively. Answer Selector for the QA task. After fusing information from the question and the document, we can select entities from the document as the answer to the question. Concretely, we apply the multi-head cross attention (MHAtt) between the question and the entities from the graph: hi qe = MHAtt hi e, Hq, Hq to obtain question-aware entity representations, where i is the question index. Based on the question-aware entity representations, we employ a feed-forward network (FFN) to generate the entity extracting probabilities Ai = FFN(hi qe), where Ai = (ai 1, ..., ai ne). The QA objective is to maximize the likelihood of all ground-truth entity labels ˆa: Lc = Pnq i=1 Pne j=1 P ˆai j . (1) 3.3 QA Attention-enhanced Decoder A faithful decoder needs to attend to and fetch the important content from the encoder instead of mixing the inputs. We observe from 3.2 that the QA attentions on the key entities can be regarded as importance signals indicating which entities should be included in the summary. Hence, we propose a summary generator enhanced by QA attention. Generally, the decoder state attends to the encoder states with entities as intermediates, where the entity-level attention is guided by QA attentions. Concretely, for each layer, at the t-th decoding step, we apply the self-attention on the masked summary embeddings E, obtaining ut. The masking mechanism ensures that the prediction of the position t depends only on the outputs before t. Based on ut, we then compute the cross-attention scores ce t over entities. ut = MHAtt (et, E