# readeraware_multidocument_summarization_via_sparse_coding__e9674ec0.pdf Reader-Aware Multi-Document Summarization via Sparse Coding Piji Li Lidong Bing Wai Lam Hang Li and Yi Liao Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA USA Noah s Ark Lab, Huawei Technologies, Hong Kong {pjli, wlam, yliao}@se.cuhk.edu.hk, lbing@cs.cmu.edu, hangli.hl@huawei.com We propose a new MDS paradigm called readeraware multi-document summarization (RA-MDS). Specifically, a set of reader comments associated with the news reports are also collected. The generated summaries from the reports for the event should be salient according to not only the reports but also the reader comments. To tackle this RAMDS problem, we propose a sparse-coding-based method that is able to calculate the salience of the text units by jointly considering news reports and reader comments. Another reader-aware characteristic of our framework is to improve linguistic quality via entity rewriting. The rewriting consideration is jointly assessed together with other summarization requirements under a unified optimization model. To support the generation of compressive summaries via optimization, we explore a finer syntactic unit, namely, noun/verb phrase. In this work, we also generate a data set for conducting RA-MDS. Extensive experiments on this data set and some classical data sets demonstrate the effectiveness of our proposed approach. 1 Introduction In the typical multi-document summarization (MDS) setting, the input is a set of documents/reports about the same topic/event. The reports on the same event normally cover many aspects and the continuous follow-up reports bring in more information of it. Therefore, it is very challenging to generate a short and salient summary for an event. MDS has drawn some attention and some method have been proposed. For example, Wan et al. [2007] proposed an extraction-based approach that employs a manifold ranking method to calculate the salience of each sentence. Filatova and Hatzivassiloglou [2004] modeled the MDS task as an instance of the maximum coverage set problem. Gillick and Favre [2009] The work described in this paper is substantially supported by grants from the Research and Development Grant of Huawei Technologies Co. Ltd (YB2013090068/TH138232) and the Research Grant Council of the Hong Kong Special Administrative Region, China (Project Codes: 413510 and 14203414). developed an exact solution for a model similar to [Filatova and Hatzivassiloglou, 2004] based on the weighted sum of the concepts (approximated by bigrams). [Li et al., 2013] proposed a guided sentence compression framework to generate compressive summaries by training a conditional random field (CRF) based on a annotated corpus. [Li et al., 2014] considered linguistic quality in their framework. [Ng et al., 2014] exploited timelines to enhance MDS. Moreover, many works [Liu et al., 2012; K ageb ack et al., 2014; Denil et al., 2014; Cao et al., 2015] utilized deep learning techniques to tackle summarization tasks. As more and more user generated content is available, one natural extension of the setting is to incorporate such content regarding the event so as to directly or indirectly improve the generated summaries with greater user satisfaction. In this paper, we investigate a new setting in this direction. Specifically, a set of reader comments associated with the news reports are also collected. The generated summaries from the reports for the event should be salient according to not only the reports but also the reader comments. We name such a paradigm of extension as reader-aware multi-document summarization (RA-MDS). We give a real example taken from a data set collected by us to illustrate the importance of RA-MDS. One hot event in 2014 is Malaysia Airlines jet MH370 disappeared . After the outbreak of this event, lots of reports are posted on different news media. Most existing summarization systems can only create summaries with general information, e.g., Flight MH370, carrying 227 passengers and 12 crew members, vanished early Saturday after departing Kuala Lumpur for Beijing , due to the fact that they extract information solely from the report content. However, after analyzing the reader comments, we find that many readers are interested in more specific aspects, such as Military radar indicated that the plane may have turned from its flight route before losing contact and Two passengers who appear to have used stolen European passports to board . Under the RA-MDS setting, one should jointly consider news and comments when generating the summary so that the summary content can cover not only important aspects of the event, but also aspects that attract reader interests as reflected in the reader comments. No previous work has investigated how to incorporate the comments in MDS problem. One challenge is how to conduct salience calculation by jointly considering the focus of Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) news reports and the reader interests revealed by comments. Meanwhile, the model should not be sensitive to the availability of diverse aspects of reader comments. Another challenge is that reader comments are very noisy, grammatically and informatively. Some previous works explore the effect of comments or social contexts in single document summarization (such as blog summarization) [Hu et al., 2008; Yang et al., 2011]. However, the problem setting of RAMDS is more challenging because the considered comments are about an event with multiple reports spanning a time period, resulting in diverse and noisy comments. To tackle the above challenges, we propose a sparsecoding-based method that is able to calculate the salience of the text units by jointly considering news reports and reader comments. Intuitively, the nature of summarization is to select a small number of semantic units to reconstruct the original semantic space of the whole topic. In our RA-MDS setting, the semantic space incorporates both the news and reader comments. The selected semantic units are sparse and hold the semantic diversity property. Then one issue is how to find these sparse and diverse semantic units efficiently without supervised training data. Sparse coding is a suitable method for learning sets of over-complete bases to represent data efficiently, and it has been demonstrated to be very useful in computer vision [Mairal et al., 2014]. Moreover, sparse coding can jointly consider news and comments to select semantic units in a very simple and elegant way, by just adding a comments reconstruction error item into the original loss function. Currently, there are only a few works employing sparse coding for the summarization task. DSDR [He et al., 2012] represents each sentence as a non-negative linear combination of summary sentences. But this method does not consider the sparsity. MDS-Sparse [Liu et al., 2015] proposed a two-level sparse representation model, considering coverage, sparsity, and diversity. But their results do not show a significant improvement. In this paper, we propose a more efficient and direct sparse model to tackle these problems and achieve encouraging results on different data sets. Another reader-aware characteristic of our framework is to improve linguistic quality via entity rewriting. Summaries may contain phrases that are not understandable out of context since the sentences compiled from different documents might contain too little, too much, or repeated information about the referent. A human summary writer only uses the full-form mention (e.g. President Barack Obama) of an entity one time and uses the short-form mention (e.g. Obama) in the other places. Analogously, for a particular entity, our framework requires that the full-form mention of the entity should only appear one time in the summary and its other appearances should use the most concise form. Some early works perform rewriting along with the greedy selection of individual sentence [Nenkova, 2008]. Some other works perform summary rewriting as a post-processing step [Siddharthan et al., 2011]. In contrast with such works, the rewriting consideration in our framework is jointly assessed together with other summarization requirements under a unified optimization model. This brings in two advantages. First, the assessment of rewriting operation is jointly considered with the generation of the compressive summary so that it has News Sentences Comment Sentences NPs VPs Summary News Reconstruction Comments Partial-Reconstruction Generation and Compression Figure 1: Our RA-MDS framework. a global view to generate better rewriting results. Second, we can make full use of the length limit because the effect of rewriting operation on summary length is simultaneously considered with other constraints in the model. To support the generation of compressive summaries via optimization, we explore a finer syntactic unit, namely, noun/verb phrase. Precisely, we first decompose the sentences into noun/verb phrases and the salience of each phrase is calculated by jointly considering its importance in reports and comments. In this work, we also generate a data set for conducting RA-MDS. Extensive experiments on our data set and some benchmark data sets have been conducted to examine the efficacy of our framework. 2 Description of the Proposed Framework 2.1 Overview To tackle the RA-MDS problem, we propose an unsupervised compressive summarization framework. The overview of our framework is depicted in Fig. 1. A sparse-coding-based method is proposed to reconstruct the semantic space of a topic, revealed by both the news sentences i.e., xi s and the comment sentences i.e., zi s, on the news sentences. Thus, an expressiveness score ai is designed for each news sentence. The dashed boxes of comment sentences indicate that a special treatment is applied on comments to avoid noise in the reconstruction. The details will be introduced in Section 2.2. The compression is carried out by deleting the unimportant constituents, i.e. phrases, of the input sentence. We first decompose each sentence into noun phrases (NPs) and verb phrases (VPs). The salience of a phrase depends on two criteria, namely, the expressiveness score inherited from the sentence, and the concept score of the phrase. The extraction of phrases and the calculation of phrase salience will be introduced in Section 2.3. Our framework carries out mention rewriting for entities to improve the linguistic quality of our summary. Specifically, we rewrite the mentions of three types of named entities, namely, person, location, and organization. We will discuss the details of mention detection, mention cluster merging, short-form and full-form mention finding in Section 2.4. After the above preparation steps, we will introduce our summarization model in Section 2.5. Our model simultaneously performs sentence compression and mention rewriting via a unified optimization method. Meanwhile, a variety of summarization requirements are considered via formulating them as the constraints. 2.2 Reader-Aware Sentence Expressiveness Intuitively, the nature of summarization is to select semantic units which can be used to reconstruct the original semantic space of the topic. The expressiveness score of a sentence in the news is defined as its contribution in constructing the semantic space of the topic from both the news content and the reader comments. Therefore, the expressiveness conveys the attention that a sentence attracts from both the news writers and the readers. We propose a sparse coding model to compute such expressiveness scores. In typical sparse coding, the aim is to find a set of basis vectors φi which can be used to reconstruct m target/input vectors xi as a linear combination of them so as to minimize the following loss function: j=1 aijφj 2 2 + λ j=1 S(aij) (1) where S(.) is a sparsity cost function which penalizes ai for being far from zero. In our summarization task, each topic contains a set of news reports and a set of reader comments. After stemming and stop-word removal, we build a dictionary for the topic by using unigrams and bigrams from the news. Then, each sentence of news and comments is represented as a weighted term-frequency vector. Let X = {x1, x2, . . . , xm} and Z = {z1, z2, . . . , zn} denote the vectors of sentences from news and comments respectively, where xi Rd and zi Rd are term-frequency vectors. There are d terms in dictionary, m sentences in news, and n sentences in comments for each topic. We take semantic units as sentences here, and assume that for each sentence xi, there is a coefficient variable ai, named expressiveness score, to represent the contribution of this sentence in the reconstruction. Based on the spirit of sparse coding, we directly regard each news sentence xi as a candidate basis vector, and all xi s are employed to reconstruct the semantic space of the topic, including X and Z. Thus we propose a preliminary error formulation as expressed in Eq. 2 for which we aim at minimizing: j=1 ajxj 2 2 + 1 j=1 ajxj 2 2 (2) where the coefficient aj s are the expressiveness scores and all the target vectors share the same coefficient vector A here. To harness the characteristics of the summarization problem setting more effectively, we refine the preliminary error formulation as given in Eq. 2 along three directions. (1) As mentioned before, the original sentence vector space can be constructed by a subset of them, i.e., the number of summary sentences is sparse, so we put a sparsity constraint on the coefficient vector A using L1-norm λ A 1 in Eq. 2, with the weight λ as a scaling constant to determine its relative importance. Moreover, we just consider non-negative linear reconstruction in our framework, so we add non-negative constraints on the coefficients. (2) As previous work [Ng et al., 2011] mentioned, some prior knowledge can benefit the sentence expressiveness detection performance, e.g., sentence position. So we add a variable ρi to weight each newssentence reconstruction error. Here, we employ the position information to generate ρ: ρ = Cp, if p < p Cp, otherwise (3) where p is the paragraph ID in each document starting from 0, and C is a positive constant which smaller than 1. (3) Besides those useful information, comments usually introduce lots of noise data. To tackle this problem, our first step is to eliminate terms only appear in comments; another step is to add a parameter τi to control the comment-sentence reconstruction error. Due to the fact that the semantic units of generated summaries are all from news, intuitively, a commentsentence will introduce more information if it is more similar with news. Therefore, we employ the mean cosine similarity between comment-sentence zi with all the news-sentences X as the weight variable τi. After the above considerations, we have the global loss function as follows: J = min A 1 2m j=1 ajxj 2 2 j=1 ajxj 2 2 + λ A 1 s.t. aj 0 for j {1, ..., m}, λ > 0 For the optimization problem of sparse coding, there are already many classical algorithms [Mairal et al., 2014]. In this paper, we utilize Coordinate Descent method as shown in Algorithm 1. Under the iterative updating rule as in Eq. 7, the objective function J is non-increasing, and the convergence of the iteration is guaranteed. Our sparse coding model introduces several advantages. First, sparse coding is a class of unsupervised methods, so no manual annotations for training data are needed. Second, the optimization procedure is modular leading to easily plug in different loss functions. Third, our model incorporates semantic diversity naturally, as mentioned in [He et al., 2012]. Last but not the least, it helps the subsequent unified optimization component which generates compressive summaries. In particular, it reduces the number of variables because the sparsity constraint can generate sparse expressiveness scores, i.e., most of the sentences get a 0 score. 2.3 Phrase Extraction and Salience Calculation We employ Stanford parser [Klein and Manning, 2003] to obtain a constituency tree for each input sentence. After that, we extract NPs and VPs from the tree as follows: (1) The NPs and VPs that are the direct children of the S node are extracted. (2) VPs (NPs) in a path on which all the nodes are all VPs (NPs) are also recursively extracted and regarded as having the same parent node S. Recursive operation in the second step will only be carried out in two levels since the phrases in the lower levels may not be able to convey a complete fact. Algorithm 1 Coordinate descent algorithm for sentence expressiveness detection Input: News sentences X Rd m, comments sentences Z Rd n, news reconstruction weight ρi, comments reconstruction weight τi, penalty parameter λ, and stopping criterion T and ε. Output: Salience vector A Rm. 1: Initialize A 0, t 0; 2: while t < T and Jt ε > ε do 3: reconstructing: x = m P 4: take partial derivatives for reconstruction error items: i=1 ρi(xi x)Txk i=1 τi(zi x)Txk 5: select the coordinate with maximum partial derivative: ˆk = arg max k=1...m 6: update the coordinate by soft-thresholding [Donoho and Johnstone, 1994]: at+1 ˆk Sλ(at ˆk η J where Sλ : ai 7 sign(ai)max(|ai| λ, 0). 7: Jt ε JAt+1 JAt, t t + 1 8: end while 9: return A = A. Take the tree in Fig. 2 as an example, the corresponding sentence is decomposed into phrases An armed man , walked into an Amish school, sent the boys outside and tied up and shot the girls, killing three of them , walked into an Amish school , sent the boys outside , and tied up and shot the girls, killing three of them . 1 The salience of a phrase depends on two criteria. The first criterion is the expressiveness score which is inherited from the corresponding sentence in the output of our sparse coding model. The second criterion is the concept score that conveys the overall importance of the individual concepts in the phrase. Let tf(t) be the frequency of the term t (unigram/bigram) in the whole topic. The salience Si of the 1Because of the recursive operation, the extracted phrases may have overlaps. Later, we will show how to avoid such overlapping in phrase extraction. We only consider the recursive operation for a VP with more than one parallel sub-VPs, such as the highest VP in Fig. 2. The sub-VPs following modal, link or auxiliary verbs are not extracted as individual VPs. In addition, we also extract the clauses functioning as subjects of sentences as NPs, such as that clause . Note that we also mention such clauses as noun phrase although their labels in the tree could be SBAR or S . An armed man into DT NNP NN an Amish school sent DT NNS shot DT NNS Figure 2: The constituency tree of a sentence. phrase Pi is defined as: t Pi tf(t) P t T opic tf(t) ai, (8) where ai is the expressiveness of the sentence containing Pi. 2.4 Preparation of Entity Mentions for Rewriting We first conduct co-reference resolution for each document using Stanford co-reference resolution package [Lee et al., 2013]. We adopt those resolution rules that are able to achieve high quality and address our need for summarization. In particular, Sieve 1, 2, 3, 4, 5, 9, and 10 in the package are employed. A set of clusters are obtained and each cluster contains the mentions corresponding to the same entity in a document. The clusters from different documents in the same topic are merged by matching the named entities. Three types of entities are considered, namely, person, location, and organization. Let M denote the mention cluster of an entity. The fullform mention mf is determined as: mf = arg max m M t m tf (t) (9) where tf (t) is calculated in M. We do not simply select the longest one since it could be too verbose. The short-form mention ms is determined as: ms = arg max m M t m tf (t) (10) where M contains the mentions that are the shortest and meanwhile are not pronouns. 2.5 Unified Optimization Framework The objective function of our optimization formulation is defined as: i