# improving_sequencetosequence_constituency_parsing__22ece5b6.pdf Improving Sequence-to-Sequence Constituency Parsing Lemao Liu, Muhua Zhu, Shuming Shi Tencent AI Lab, Shenzhen, China {redmondliu,muhuazhu, shumingshi}@tencent.com Sequence-to-sequence constituency parsing casts the treestructured prediction problem as a general sequential problem by top-down tree linearization, and thus it is very easy to train in parallel with distributed facilities. Despite its success, it relies on a probabilistic attention mechanism for a general purpose, which can not guarantee the selected context to be informative in the specific parsing scenario. Previous work introduced a deterministic attention to select the informative context for sequence-to-sequence parsing, but it is based on the bottom-up linearization even if it was observed that top-down linearization is better than bottom-up linearization for standard sequence-to-sequence constituency parsing. In this paper, we thereby extend the deterministic attention to directly conduct on the top-down tree linearization. Intensive experiments show that our parser delivers substantial improvements over the bottom-up linearization in accuracy, and it achieves 92.3 Fscore on the Penn English Treebank section 23 and 85.4 Fscore on the Penn Chinese Treebank test dataset, without reranking or semi-supervised training. Introduction Constituency parsing is a fundamental task in natural language processing, and it plays an important role in downstream applications such as machine translation (Galley et al. 2004; 2006) and semantic analysis (Rim, Seo, and Simmons 1990; Manning, Sch utze, and others 1999). Over the decades, feature-rich linear models had been dominant in constituency parsing (Petrov and Klein 2007; Zhu et al. 2013); but they are not good at capturing the long distance dependencies due to feature sparsity. Recurrent neural networks have the advantages to address such issue, and recently there has been much work on recurrent neural models for constituency parsing (Vinyals et al. 2015; Watanabe and Sumita 2015; Dyer et al. 2016; Cross and Huang 2016). In particular, sequence-to-sequence parsing (Vinyals et al. 2015) has been increasingly popular. Its basic idea is to linearize a parse tree into a sequence in a top-down manner (see Figure 1) and then transform parsing into a standard sequence-to-sequence learning task. The main technique inside the sequence-to-sequence parsing is a probabilistic attention mechanism, which aligns an output token Copyright c 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: Top-down and bottom-up linearlization of a parse tree in sequence-to-sequence constituency parsing. The input sequence x is the leaves of the parse tree in the top, and the output is the linearized sequence y in the bottom. A dash line indicates the relation between xi and XX . to input tokens to select relevant context for better prediction as shown in Figure 2(a). This parsing model is general and easy to understand; particularly it runs in a sequential manner and thus is easy to parallelize with GPUs. However, the probabilistic attention can not guarantee the selected context is informative enough to yield satisfactory outputs. As a result, its accuracy is only comparable to the feature-rich linear models (Petrov and Klein 2007; Zhu et al. 2013), especially given that it utilizes global context. Ma et al. (2017) proposed a deterministic attention for sequence to sequence parsing, which defines the alignments between output and input tokens in a deterministic manner to select the relevant context. This method was able to select better context than probabilistic attention for parsing. However, their approach was conducted on the bottom up linearization (see its linearized sequence in Figure 1) and they require to binarize a parse tree, which induces the issue of ambiguity: different binarized trees may lead to the same tree. In addition, the bottom-up linearization lacks of top-down guidance such as lookahead information, which has been proved to be useful for better prediction (Roark and Johnson 1999; Liu and Zhang 2017b). As a result, their parser is still worse than the state of the art parsers in accuracy. The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) (S (NP XX XX John has a dog . y1 y2 y3 y4 y5 x5 x1 x2 x3 x4 b5 = 1 s5 = 3 t = 5 P(1, 3) P(3, ?) x1 x2 (S (NP XX XX John has a dog . y1 y2 y3 y4 y5 x5 x1 x2 x3 x4 (a). Probabilistic attention (b). Deterministic attention Figure 2: The figures for probabilistic attention (a) and deterministic attention (b). At the timestep t = 5, y<5 have been available but yt is unavailable and will be predicted next using context by attentions. Probabilistic attention aligns y5 to all tokens according to a distribution αt shown in dotted arrow lines, while deterministic attention aligns y5 to the phrase P(1, 3), the semi-phrase P(3, ?) and x3 in a deterministic manner solid shown in arrow lines. In this paper, therefore, we aim to explore the deterministic attention directly on top of top-down linearization, with the expectation to improve the sequence-to-sequence constituency parsing. The proposed deterministic attention is inspired by the following intuition. When linearizing a parse tree in a top-down manner, it is clear that each token XX represents a known word in the input side by a dash line as shown in Figure 1, and thus this output token might deterministically align to that specific token in the input side rather than stochastically align to all input tokens. Respecting this intuition, we analyze the ideal alignment situations at each decoding timestep and propose a general deterministic attention criteria for context selection. Under this criteria, we propose some simple instances to deterministically specify the alignments between input and output tokens ( 3). Since our deterministic attention sequentially represents alignments for a given parse tree, its training still performs in a sequential manner and thus is easy to parallelize as the standard sequence-to-sequence parsing does. Empirical experiments demonstrate that the resulting deterministic attention on top-down linearization achieves substantial gains over the model in Ma et al. (2017). Furthermore, with the help of ensemble, the proposed parser is competitive to state-of-the-art RNN parsers (Dyer et al. 2016; Stern, Andreas, and Klein 2017), which require to maintain tree structures and thus are not easy to parallelize for training. This paper makes the following contributions: It analyzes the deterministic attention for sequence-tosequence parsing on top of top-down linearization, and proposes a simple yet effective model without increasing the training time. On both Penn English and Chinese Treebank datasets, intensive experiments show that our parser outperforms several direct sequence-to-sequence baselines, and achieve 92.3 Fscore on English dataset and 85.4 Fscore on Chinese dataset without reranking or semi-supervised training. Sequence-to-Sequence Parsing Suppose x = x1, x2, , x|x| denotes an input sequence with length |x|; and y = y1, y2, , y|y| denotes the output sequence which represents a linearized parse tree of x via a linearization method such as top-down linearization. In Figure 1, x is the sequence John, has, a, dot, . and y is its linearized tree sequence (S, (NP, XX, , XX, )S . Generally, sequence-to-sequence constituency parsing directly maps an input sequence to its linearized parse tree sequence by using a neural machine translation model (NMT) (Vinyals et al. 2015; Bahdanau, Cho, and Bengio 2014). NMT relies on recurrent neural networks under the encodedecode framework including two stages. In the encoding stage, it applies recurrent neural networks to represent x as a sequence of vectors: hi = f(exi, hi 1), where the hidden unit hi is a vector with dimension d at timestep i, f is a recurrent function such as LSTM and GRU and ex denotes the word embedding of x. Suppose the encoding sequence is denoted by Ex = Ex 1 , Ex 2 , , Ex |x| . Then each Ex i can be set as hi from a reversed recurrent neural network (Vinyals et al. 2015) or as the concatenation of the hidden units from bidirectional recurrent neural networks (Ma et al. 2017). In the decoding stage, it generates a linearized sequence from the conditional probability distribution defined by a recurrent neural network as follows: p(y | x; θ) = t=1 p(yt | y