# compressive_document_summarization_via_sparse_optimization__02639c01.pdf Compressive Document Summarization via Sparse Optimization Jin-ge Yao, Xiaojun Wan, Jianguo Xiao Institute of Computer Science and Technology, Peking University, Beijing 100871, China Key Laboratory of Computational Linguistic (Peking University), MOE, China {yaojinge, wanxiaojun, xiaojianguo}@pku.edu.cn In this paper, we formulate a sparse optimization framework for extractive document summarization. The proposed framework has a decomposable convex objective function. We derive an efficient ADMM algorithm to solve it. To encourage diversity in the summaries, we explicitly introduce an additional sentence dissimilarity term in the optimization framework. We achieve significant improvement over previous related work under similar data reconstruction framework. We then generalize our formulation to the case of compressive summarization and derive a block coordinate descent algorithm to optimize the objective function. Performance on DUC 2006 and DUC 2007 datasets shows that our compressive summarization results are competitive against the state-of-the-art results while maintaining reasonable readability. 1 Introduction Automatic document summarization is a seminal task in the field of text mining and information retrieval. Approaches of modern automatic summarization can be roughly divided into two categories: extractive summarization and abstractive summarization. Extractive summarization formulates summaries by selecting sentences from the original documents. Abstractive approaches allow more complex operations on sentences, including deletion, substitution, and reordering. Extractive approaches are rather limited in terms of the final summaries they can produce. However, abstractive summarization is far more difficult than extractive summarization that does not need to ensure structural and semantic coherence within a sentence. Up to now, extractive methods are still the most popular in summarization tasks. Almost all the existing extractive summarization approaches use ranking models to score and select sentences. To overcome the problem of redundancy that has not been well-addressed in previous approaches, summarization based on data reconstruction has recently been proposed [He et al., 2012]. The intuition is that a good summary should consist Corresponding author of those sentences that can best reconstruct the original document. Unfortunately, although the mathematical idea behind this framework is reasonable, it fails to achieve satisfactory performance. Meanwhile, the accompanying gradient descent algorithm turns out to be slow in practice, which may further limit the generalizability of this paradigm. In this paper, we formulate document summarization problem as convex sparse optimization with similar idea of data reconstruction but different methodology. The proposed objective function has a decomposable form, which is suitable to be efficiently solved by modern convex optimization algorithms. With some direct but nontrivial modifications, we can also generalize the proposed formulation to the case of compressive summarization. Under the similar essence of sparse optimization, we explore a framework that jointly optimizes sentence selection and word selection. Grammaticality of compressed sentences is ensured by considering dependency relations during a final generation step. Our methods do not even need full constituent parse trees to generate grammatical compressions. The major contributions of this work include: We formulate document summarization as a decomposable row-sparsity regularized optimization problem, and then present an efficient alternating direction method of multipliers to solve it. Inspired by recent research on exemplar-based data representation, we introduce an additional sentence dissimilarity term to encourage diversity in summary sentences. We generalize our proposed extractive framework to compressive summarization. The resulting sparse optimization problem is jointly non-convex, so we derive a block coordinate descent algorithm to solve it, followed by a recursive sentence compression phase to impose grammatical constraints. Merely an additional lightweight dependency parser is needed to conduct this step and less efficient constituent parsing is avoided. We conduct experiments on DUC 2006 and DUC 2007 datasets to show the improvements of our methods over previous work of unsupervised document summarization based on data reconstruction. Our models achieve fairly competitive results against the state-of-the-art approaches while maintaining reasonable readability. Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) 2 Our Proposed Framework 2.1 Sentence Selection via ℓ2,1 Regularization Assume that the documents are represented as a weighted term-frequency matrix, denoted as D = [D1, D2, ..., Dn] Rd n 1, where d is the size of vocabulary and n is the total number of sentences. Each column Di Rd 1 stands for a sentence vector. We are aiming at selecting sentences having both good coverage and low redundancy. Once we have a summary set of sentences taken from the original document, we would like to represent most sentences in the original document well, only with these summary sentences. Therefore a natural goal is to minimize n X j=1 Dj DAj 2 2, where each Aj = [a1j, ..., anj]T is a vector of coefficients used for representing the original sentence vector Dj with other sentence vectors from the document D. This is similar to the intuition behind He et al. [2012] s formulation from a data reconstruction perspective. We are trying to represent the whole document by selecting only a few sentences. Then the goal is to establish sentence level sparsity on sentence selection coefficients (the Ais), while retaining low reconstruction error. If we concatenate these column vectors as a matrix A = {A1, ..., An} Rn n, the sentence level sparsity will obviously correspond to row sparsity in A. Hence the desired selection sparseness could be induced by performing ℓ1-regularization on rows, or more compactly, an ℓ2,1 regularization on matrix A: The document summarization problem can now be succinctly formulated as: min A D DA 2 + λ A 2,1 (1) s.t. aij 0, i, j (2) diag(A) = 0. (3) Here an additional constraint (3) is introduced to avoid the numerically trivial solution (A I) in practice, by forcing the diagonal elements to be zeros. This formulation resembles the problem of subspace clustering that has been studied in the community of computer vision [Elhamifar and Vidal, 2013], trying to learn a self-representation matrix under certain constraints 2. Meanwhile, our convex objective function can be decomposed into two separated convex functions. In recent years, much attention has been paid upon this type of decomposable objectives, especially in the context of sparse optimization. In the next section we present an alternating direction method of multipliers (ADMM) for our optimization problem. 1In this paper we follow the convention to use uppercase letters to denote matrices. We place a subscript k when addressing the kth column. Corresponding lowercase letters are used for elements (scalars) in matrices/vectors. 2The major difference between (1) and sparse subspace clustering is in the regularization terms. 2.2 An ADMM Solver Problem (1) can be equivalently expressed as: min X,Z D DX 2 + λ Z 2,1 (4) s.t. X = Z, diag(X) = 0, xij 0, i, j. (5) By expressing the original A separately as X and Z, the objective function has been factorized into two independently convex parts. ADMM tries to solve such decomposed problems by iteratively updating X and Z, as well as Lagrangian multipliers (denoted as a matrix U) on the constraint X = Z. We omit details of our mathematical derivations due to space limit and point interested readers to Section 6.4 of Boyd et al. [2011], where a similar problem is presented. The entire ADMM algorithm is listed in Algorithm 1. Algorithm 1 An ADMM solver for Problem (1) 1: for k = 0, 1, 2, ... do 2: // update every column Xj of X: 3: Xk+1 j (DT D + ρI) 1(DT Dj + ρ(Zk j U k j )) 4: Xk+1 Xk+1 diag(Xk+1), xk+1 ij max{xk+1 ij , 0} 5: // update every row Zi of Z: 6: Zk+1 i Sλ/ρ(Xk+1 i + U k i ) 7: Zk+1 Zk+1 diag(Zk+1), zk+1 ij max{zk+1 ij , 0} 8: // update dual variables U: 9: U k+1 U k + ρ(Xk+1 Zk+1) 10: if Xk+1 Zk+1 < ϵ then return Z 11: end if 12: end for where ρ is a constant parameter and the shrinkage operator S in Line 6 is defined as: Sγ(x) = max{1 γ x 2 , 0}x. Note that the implementation of the X update step (Line 3 in Algorithm 1) does not require matrix inversion. It suffices to solve a bunch of related linear systems to get the closedform solution 3. In our experiments, this ADMM process converges significantly faster than the gradient descent solution used in He et al. [2012] s data reconstruction framework. After solving the optimization problem, summaries can be generated via a typical greedy sentence selection process, according to the magnitude of row vectors in Z that corresponds to the selection coefficients of each sentence. 2.3 Diversity from Sentence Dissimilarity The data reconstruction and sparse optimization formulations tend to select sentences that can cover the documents. However, there is no explicit tendency to select diverse sentences capturing different but also important information described in the documents. A very recent work [Liu et al., 2015] also address this diversity problem. They explicitly add the correlation coefficients between sentence pairs in the objective 3The matrix DT D + ρI is unchanged during iterations. Thus we may pre-compute the required Cholesky factorization to avoid redundant computations for solving those linear systems. function of He et al. [2012] s data reconstruction formulation. This makes their objective function difficult to optimize. Inspired by recent work on exemplar-based sparse selection of representative points [Elhamifar et al., 2012], we introduce an additional dissimilarity term j=1 δijxij, into our optimization problem (4), weighted by a constant parameter µ: min X,Z D DX 2 + µ tr( T X) + λ Z 2,1 (6) s.t. X = Z, diag(X) = 0, xij 0, i, j. (7) where the matrix denotes pairwise dissimilarity δij between sentence i and sentence j, measured by an information-theory based criterion as described by Frey and Dueck [2007]: For each word in sentence j, if it is also in sentence i, then set the encoding cost for the word to the logarithm of sentence i s length; Otherwise, set the encoding cost for the word to the logarithm of the vocabulary size. The dissimilarity δij is just the sum of these costs. Note that this dissimilarity is asymmetric. We would like to explore other dissimilarity measurements in our future work. The term tr( T X) is linear. Involvement of this term is implemented by simply changing Line 3 of Algorithm 1 into: Xk+1 j (DT D + ρI) 1(DT Dj + ρ(Zk j U k j ) µ( T )j) The matrix is pre-computed before the ADMM procedure. Without loss of generality, we temporarily omit the additional dissimilarity term tr( T X) and go back to the compact form (1) in the next section for clarity of description. 3 Compressive Document Summarization The above sparse optimization problems encode a process of sentence selection. Keeping the same spirit of selection sparseness, we can naturally generalize this sentence selection process to word selection and perform compressive summarization. This can be achieved after some nontrivial modifications of (1): min R,A D RA 2 + λ1 A 2,1 + λ2 Pn i=1 Ri 1 (8) s.t. rij, aij 0, i, j; grammatical(Ri). (9) The matrix R Rd n can be regarded as a sparse approximation of the original document D: each column Ri is a compressed version of original sentence vector Di with several unimportant word dimensions shrinkage to zero by a ℓ1 regularizer. Some additional constraints (grammatical(Ri)) on word selection should be added to ensure grammaticality of compressed sentences. For example, if a verb is selected, then the subject of this verb should also be selected. These constraints are typically induced by dependency relations from dependency parse trees. The problem (8) jointly optimize over sentence selection coefficients A as well as sparse sentence representation vectors R. Since directly involving the grammaticality constraints might be inconvenient, we leave them to a final step of sentence generation after performing joint optimization. 3.1 Joint Sparse Optimization The product RA in (8) makes the problem nonconvex. Since the objective function decomposes naturally for R and A, it is straightforward to conduct a block coordinate descent process for this joint sparse problem. If we have R fixed, the resulting problem for A is: min A D RA 2 + λ1 A 2,1 (10) s.t. aij 0, i, j. (11) This is a problem we are able to solve by calling the same ADMM procedure derived in section 2.2. If we have A fixed, the resulting problem for R is: min R D RA 2 + λ2 Pn i=1 Ri 1 (12) s.t. rij 0, i, j. (13) This is essentially a sparse dictionary learning problem [Donoho and Elad, 2003]. We can get a solution by iteratively solving LASSO problems defined over columns Ri. After iteratively solving these two problems, we will get sparse sentence vectors as columns in R. 3.2 Generation of Compressed Sentences Once the values of each reconstructed sentence vector Ri are ready, we can recover compressed sentences accordingly, considering the previously ignored grammaticality constraints. Most of modern sentence compression techniques require global constrained optimization, such as integer linear programming [Clarke and Lapata, 2008] and first-order Markov Logic Networks [Huang et al., 2012]. The inference process is far less efficient for our purpose of compressive summarization. Moreover, currently available sentence compression datasets are typically in small scale. Therefore, we only consider unsupervised method with simplest but effective form in this final step. We generate a compressed sentence by selecting a subtree from the full dependency tree of the original sentence. Suppose for all sentences we have dependency parses labeled with grammatical relations. As the grammatical constraints are usually defined on word-to-word dependency relations, we can treat the dependency arcs locally and perform local inclusion-deletion decisions on dependency trees. We use an efficient recursive algorithm that utilizes the wordlevel weights and subtree structural constraints. The goal of this algorithm is to extract a subtree with maximum score while satisfying grammatical constraints. The constraints are expressed in the form of two sets of grammatical relations. The first one is denoted as KEEP HEAD. It includes dependency relations for cases when the inclusion of a modifier should always imply the inclusion of its direct head. An example relation in this set is noun modifiers (NMOD): if we select the word nice in nice book, then the head noun book should also be selected. The second one is denoted as SIMUL DEL. It includes dependency relations that should force simultaneous inclusion or deletion of the head and the modifier. For example, subject and object relations (SBJ and OBJ) should be included in this set since dropping any word in a subject-verb-object path will often result in an ungrammatical sentence. We trivially derive the complete list of the two sets from Clarke and Lapata [2008], where these constraints were described in the form of linear constraints in integer linear programs. The whole process for sentence compression along with subtree scoring is listed in Algorithm 2 in a recursive style. Scores and costs (number of words actually kept) of a subtree will be accumulated from bottom up at each node. The score at each node (subtree) is initialized to be the value of the corresponding word dimension in vector Ri, described in the last section, for the i-th sentence. The cost at each node is initialized to be 1, for the word at this node only. The algorithm decide whether to delete a subtree at Line 5, where we also consider bigram scores (e.g. document frequency of bigrams) to locally ensure grammatical smoothness. At Line 11 the algorithm picks the subtree with maximum score by comparing with the current best subtree cmax (short for current maximum). Note that this comparison is only made when the root is an indicator verb of a grammatical partial sentence 4. Algorithm 2 A recursive procedure for sentence compression 1: Init: node scores Ri, node costs 1, cmax NULL 2: function GET MAXSCORE SUBTREE(V ) 3: for each child C in V .children do 4: TC GET MAXSCORE SUBTREE(C) 5: if C.label SIMUL DEL and TC.score/TC.cost < ϵ and deletion of TC does not decrease local bigram score then 6: Delete TC from C 7: end if 8: V .score + = TC.score 9: V .cost + = TC.cost 10: end for 11: if V is an indicator verb and V .label KEEP HEAD and V .score > cmax.score then 12: cmax V 13: end if 14: if V .score > cmax.score then return V 15: else return cmax 16: end if 17: end function Here we illustrate the algorithm with an example from bottom up after initialization, with special attention on cases that make Line 5 or Line 11 active. The example is also depicted in Figuire 1, where on each node we use a shorthand notation l, s and c to represent the label of grammatical relation, the score and the cost at current node respectively. Underscored values are initial values at each node. Shadowed nodes will be dropped along with all their subnodes, while bold nodes will be kept. The original sentence is: He said the Corps has toughened regulations since 1996, requiring the damage as small as possible. and ϵ = 0.02. At the very bottom, dropping the leaf possible from as will decrease the bigram score, hence it should be kept. Two subnodes of small have score-cost ratio < ϵ and deleting them will not hurt local smoothness in terms of the bigram score, then they will be dropped along with their subnodes. Moving above, the phrase 4We simply treat non-VBG verbs as such indicator verbs. l=ROOT s=0+3.5=3.5 l=OBJ s=0+1.5+2=3.5 s=1.5+0=1.5 s=.1+.9+.1+.9=2 regulations ////// since l=TMP s=0+.02=.02 l=PMOD s=.02 s=.5+.3+.1=.9 l=OPRD s=.1 /// as l=AMOD s=0 c=1 /// as l=AMOD s=0+0=0 c=1+1=2 bg(as)