# fast_and_accurate_neural_crf_constituency_parsing__3ee1e786.pdf Fast and Accurate Neural CRF Constituency Parsing Yu Zhang , Houquan Zhou , Zhenghua Li Institute of Artificial Intelligence, School of Computer Science and Technology, Soochow University, Suzhou, China yzhang.cs@outlook.com, hqzhou@stu.suda.edu.cn, zhli13@suda.edu.cn Estimating probability distribution is one of the core issues in the NLP field. However, in both deep learning (DL) and pre-DL eras, unlike the vast applications of linear-chain CRF in sequence labeling tasks, very few works have applied tree-structure CRF to constituency parsing, mainly due to the complexity and inefficiency of the inside-outside algorithm. This work presents a fast and accurate neural CRF constituency parser. The key idea is to batchify the inside algorithm for loss computation by direct large tensor operations on GPU, and meanwhile avoid the outside algorithm for gradient computation via efficient back-propagation. We also propose a simple two-stage bracketing-thenlabeling parsing approach to improve efficiency further. To improve the parsing performance, inspired by recent progress in dependency parsing, we introduce a new scoring architecture based on boundary representation and biaffine attention, and a beneficial dropout strategy. Experiments on PTB, CTB5.1, and CTB7 show that our two-stage CRF parser achieves new state-of-the-art performance on both settings of w/o and w/ BERT, and can parse over 1,000 sentences per second. We release our code at https://github.com/yzhangcs/crfpar. 1 Introduction Given an input sentence, constituency parsing aims to build a hierarchical tree as depicted in Figure 1(a), where the leaf or terminal nodes correspond to input words and non-terminal nodes are constituents (e.g., VP3,5). As a fundamental yet challenging task in the natural language processing (NLP) field, constituency parsing has attracted a lot of research attention since large-scale treebanks were annotated, such as Penn Treebank (PTB), Penn Chinese Treebank (CTB), etc. Parsing outputs are also proven to be extensively useful for a wide range of downstream applications [Akoury et al., 2019; Wang et al., 2018]. Yu Zhang and Houquan Zhou make equal contributions to this work. Zhenghua Li is the corresponding author. game5 this4 love3 (a) original tree (b) CNF tree (left binarized) Figure 1: Example constituency trees. Part-of-speech (POS) tags are not used as inputs in this work and thus excluded. As one of the most influential works, Collins [1997] extends methods from probabilistic context-free grammars (PCFGs) to lexicalized grammars. Since then, constituency parsing has been dominated by such generative models for a long time, among which the widely used Berkeley parser adopts an unlexicalized PCFG with latent non-terminal splitting annotations [Matsuzaki et al., 2005; Petrov and Klein, 2007]. As for discriminative models, there exist two representative lines of research. The first line adopts the graphbased view based on dynamic programming decoding, using either local max-entropy estimation [Kaplan et al., 2004] or global max-margin training [Taskar et al., 2004]. The second group builds a tree via a sequence of shift-reduce actions based on greedy or beam decoding, known as the transitionbased view [Sagae and Lavie, 2005; Zhu et al., 2013]. Recently, constituency parsing has achieved significant progress thanks to the impressive capability of deep neural networks in context representation. Two typical and popular works are respectively the transition-based parser of Cross and Huang [2016] and the graph-based parser of Stern et al. [2017]. As discriminative models, the two parsers share several commonalities, both using 1) multi-layer Bi LSTM as encoder; 2) minus features from Bi LSTM outputs as span representations; 3) MLP for span scoring; 4) maxmargin training loss. Most works [Gaddy et al., 2018; Kitaev and Klein, 2018] mainly follow the two parsers and achieve much higher parsing accuracy than traditional nonneural models, especially with contextualized word representations trained with language modeling loss on large-scale unlabeled data [Peters et al., 2018; Devlin et al., 2019]. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Despite the rapid progress, existing constituency parsing research suffers from two closely related drawbacks. First, parsing (also for training) speed is slow and can hardly satisfy the requirement of real-life systems. Second, the lack of explicitly modeling tree/subtree probabilities may hinder the effectiveness of utilizing parsing outputs. On the one hand, estimating probability distribution is one of the core issues in the NLP field [Le and Zuidema, 2014]. On the other hand, compared with unbounded tree scores, tree probabilities can better serve high-level tasks as soft features [Jin et al., 2020], and marginal probabilities of subtrees can support the more sophisticated Minimum Bayes Risk (MBR) decoding [Smith and Smith, 2007]. In fact, Finkel et al. [2008] and Durrett and Klein [2015] both propose CRF-based constituency parsing by directly modeling the conditional probability. However, both models are extremely inefficient due to the high time-complexity of the inside-outside algorithm for loss and gradient computation, especially the outside procedure. The issue becomes more severe in the DL era since all previous works perform the inside-outside computation on CPUs according to our knowledge and switching between GPU and CPU is expensive. This work proposes a fast and accurate CRF constituency parser by substantially extending the graph-based parser of Stern et al. [2017]. The key contribution is that we batchify the inside algorithm for direct loss and gradient computation on GPU. Meanwhile, we find that the outside algorithm can be efficiently fulfilled by automatic backpropagation, which is shown to be equally efficient with the inside (forward) procedure, naturally verifying the great theoretical work of Eisner [2016]. Similarly, we batchify the Cocke Kasami Younger (CKY) algorithm for fast decoding. In summary, we make the following contributions. We for the first time propose a fast and accurate CRF constituency parser for directly modeling (marginal) probabilities of trees and subtrees. The efficiency issue, which bothers the community for a long time, is well solved by elegantly batchifying the inside and CKY algorithms for direct computation on GPU. We propose a two-stage bracketing-then-labeling parsing approach that is more efficient and achieves slightly better performance than the one-stage method. We propose a new span scoring architecture based on span boundary representation and biaffine attention scoring, which performs better than the widely used minusfeature method. We also show that the parsing performance can be improved by a large margin via better parameter settings such as dropout configuration. Experiments on three English and Chinese benchmark datasets show that our proposed two-stage CRF parser achieves new state-of-the-art parsing performance under both settings of w/o and w/ BERT [Devlin et al., 2019]). In terms of parsing speed, our parser can parse over 1,000 sentences per second. [ ] [ ] [ ] [ ] I0 love1 this2 game3 NP VP NP NP Encoder (Bi LSTMs) Boundary Repr. (MLPs) Span Scorer (Biaffine) Label Scorer (Biaffines) First stage bracketing Second stage labeling Figure 2: Model architecture. 2 Two-stage CRF Parsing Formally, given a sentence consisting of n words x = w0, . . . , wn 1, a constituency parse tree, as depicted in Figure 1(a), is denoted as t, and (i, j, l) t is a constituent spanning wi...wj with a syntactic label l L. Alternatively, a tree can be factored into two parts, i.e., t = (y, l), where y is an unlabeled (a.k.a. bracketed) tree and l is a label sequence for all constituents in a certain order. Interchangeably, (3, 5, VP) is also denoted as VP3,5. To accommodate the inside and CKY algorithms, we transform the original tree into those of Chomsky normal form (CNF) using the NLTK tool1, as shown in Figure 1(b). Particularly, consecutive unary productions such as Xi,j Yi,j are collapsed into one X+Yi,j. We adopt left binarization since preliminary experiments show it is slightly superior to right binarization. After obtaining the 1-best tree via CKY decoding, the CNF tree is recovered into the n-ary form. 2.1 Model Definition In this work, we adopt a two-stage bracketing-then-labeling framework for constituency parsing, which we show not only simplifies the model architecture but also improves efficiency, compared with the traditional one-stage approach adopted in previous works [Stern et al., 2017; Gaddy et al., 2018]. First stage: bracketing. Given x, the goal of the first stage is to find an optimal unlabeled tree y. The score of a tree is decomposed into the scores of all contained constituents. s(x, y) = X (i,j) y s(i, j) (1) Under CRF, the conditional probability is p(y | x) = es(x,y) y T (x) es(x,y ) (2) 1https://www.nltk.org Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) where Z(x) is known as the normalization term, and T (x) is the set of legal trees. Given all constituent scores s(i, j), we use the CKY algorithm to find the optimal tree ˆy. ˆy = arg max y s(x, y) = arg max y p(y | x) (3) Second stage: labeling. Given a sentence x and a tree y, the second stage independently predicts a label for each constituent (i, j) y. ˆl = arg max l L s(i, j, l) (4) Note that we use gold-standard unlabeled trees for loss computation during training. For a sentence of length n, all CNF trees contain the same 2n 1 constituents. Therefore, this stage has a time complexity of O(n|L|). Time complexity analysis. The CKY algorithm has a time complexity of O(n3). Therefore, the time complexity of our two-stage parsing approach is O(n3 + n|L|). In contrast, for the one-stage parsing approach, the CKY algorithm needs to determine the best label for all n2 spans and thus needs O(n3 + n2|L|), where |L| is usually very large (e.g., 138 for English in Table 1). 2.2 Scoring Architecture This subsection introduces the network architecture for scoring spans and labels, as shown in Figure 2, which mostly follows Stern et al. [2017] with two important modifications: 1) boundary representation and biaffine attention for score computation; 2) better parameter settings following Dozat and Manning [2017]. Inputs. For the ith word, its input vector ei is the concatenation of the word embedding and character-level representation: ei = eword i Char LSTM(wi) (5) where Char LSTM(wi) is the output vectors after feeding the character sequence into a Bi LSTM layer [Lample et al., 2016]. Previous works show that replacing POS tag embeddings with Char LSTM(wi) leads to consistent improvement [Kitaev and Klein, 2018]. This can also simplify the model without the need of predicting POS tags (n-fold jack-knifing on training data). Bi LSTM encoder. We employ three Bi LSTM layers over the input vectors for context encoding. We denote as fi and bi respectively the output vectors of the top-layer forward and backward LSTMs for wi. In this work, we borrow most parameter settings from the dependency parser of Dozat and Manning [2017]. We find that the dropout strategy is very crucial for parsing performance, which differs from the vanilla implementation of Stern et al. [2017] in two aspects. First, for each word wi, eword i and Char LSTM(wi) are dropped as a whole, either unchanged or becoming a 0 vector. If one vector is dropped into 0, the other is compensated with a ratio of 2. Second, the same LSTM layer shares the same dropout masks at different timesteps (words). Boundary representation. For each word wi, we compose the context-aware word representation following Stern et al. [2017].2 hi = fi bi+1 (6) The dimensions of hi is 800. Instead of directly applying a single MLP to hi, we observe that a word must act as either left or right boundaries in all constituents in a given tree. Therefore, we employ two MLPs to make such distinction and obtain left and right boundary representation vectors. rl i; rr i = MLPl (hi) ; MLPr (hi) (7) The dimension d of rl/r i is 500. As pointed out by Dozat and Manning [2017], MLPs reduce the dimension of hi and, more importantly, detain only syntax-related information, thus alleviating the risk of over-fitting. Biaffine scoring. Given the boundary representations, we score each candidate constituent (i, j) using biaffine operation over the left boundary representation of wi and the right boundary representation of wj. s(i, j) = rl i 1 T Wrr j (8) where W Rd d. It is analogous to compute scores of constituent labels s(i, j, l). Two extra MLPs are applied to hi to obtain boundary representations rl/r i (with dimension d). We then use |L| biaffines (R d d) to obtain all label scores. Since |L| is large, we use a small dimension d of 100 for rl/r i (vs. 500 for rl/r i ) to reduce memory and computation cost. Previous scoring method. Stern et al. [2017] use minus features of Bi LSTM outputs as span representations [Wang and Chang, 2016; Cross and Huang, 2016] and apply MLPs to compute span scores. s(i, j) = MLP(hi hj) (9) We show that our new scoring method is clearly superior in the experiments. 2.3 Training Loss For a training instance (x, y, l), The training loss is composed of two parts. L(x, y, l) = Lbracket(x, y) + Llabel(x, y, l) (10) The first term is the sentence-level global CRF loss, trying to maximize the conditional probability: Lbracket(x, y) = s(x, y) + log Z(x) (11) where log Z(x) can be computed using the inside algorithm in O(n3) time complexity. The second term is the standard constituent-level crossentropy loss for the labeling stage. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Algorithm 1 Batchified Inside Algorithm. 1: define: S Rn n B B is #sents in a batch 2: initialize: all S:,: = 0 3: for w = 1 to n do span width 4: Parallel computation on 0 i,j < n, r,0 b < B 5: Si,j=i+w = log P i r