# rnnbased_sequencepreserved_attention_for_dependency_parsing__df06d0bc.pdf

RNN-Based Sequence-Preserved Attention for Dependency Parsing

Yi Zhou, Junying Zhou, Lu Liu, Jiangtao Feng, Haoyuan Peng, Xiaoqing Zheng School of Computer Science, Fudan University, Shanghai, China Shanghai Key Laboratory of Intelligent Information Processing {zhouyi13, junyingzhou14, l_liu15, fengjt16, hypeng15, zhengxq}@fudan.edu.cn

Recurrent neural networks (RNN) combined with attention mechanism has proved to be useful for various NLP tasks including machine translation, sequence labeling and syntactic parsing. The attention mechanism is usually applied by estimating the weights (or importance) of inputs and taking the weighted sum of inputs as derived features. Although such features have demonstrated their effectiveness, they may fail to capture the sequence information due to the simple weighted sum being used to produce them. The order of the words does matter to the meaning or the structure of the sentences, especially for syntactic parsing, which aims to recover the structure from a sequence of words. In this study, we propose an RNN-based attention to capture the relevant and sequence-preserved features from a sentence, and use the derived features to perform the dependency parsing. We evaluated the graph-based and transition-based parsing models enhanced with the RNN-based sequence-preserved attention on the both English PTB and Chinese CTB datasets. The experimental results show that the enhanced systems were improved with signiﬁcant increase in parsing accuracy.

Introduction

Typical approaches to dependency parsing can be divided into graph-based and transition-based parsers. Graph-based parsers (Mc Donald 2006; Zheng 2017) utilize a certain algorithm to ﬁnd the best tree based on a scoring function over candidate dependency trees, which is usually formulated as the sum of arc scores. Transition-based parsers (Nivre 2004; 2008) view parsing as a sequence of transitions to incrementally build a dependency tree, with each transition selected by a classiﬁer at each step during the parsing process. Speciﬁcally, graph-based parsers are globally trained, taking into account the ﬁrst-order or high-order features, which are deﬁned on the graph or subgraph for later decoding (Mcdonald, Crammer, and Pereira 2005). But those features are deﬁned over a limited history of parsing decisions. On the contrary, transition-based ones are trained locally and make use of greedy inference algorithms, while it can deﬁne features over a rich history of parsing decisions. In a word, the features used in two paradigms are quite different from each other.

Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Deep neural networks are widely used in feature generation and selection in these years to produce relevant representations for tasks of interest automatically (Chen and Manning 2014). Dyer et al (2015) ﬁrstly employed Recurrent Neural Networks for dependency parsing, since the recurrent structure is good at bridging long time lags between relative inputs, which will be of great advantage at detecting and representing the long-distance dependencies (Dyer et al. 2015; Kiperwasser and Goldberg 2016b; Cheng et al. 2016; Dozat and Manning 2016; Zhang, Cheng, and Lapata 2017). However, Recurrent Neural Networks may suffer from the phenomenon that inputs in early time steps tend to have less inﬂuence on the ﬁnal outputs. To overcome such shortcoming, the attention mechanism has been proposed to emphasize the relevant features. The attention mechanism has gained popularity for its ability to learn alignments between two different modularities (Luong, Pham, and Manning 2015). It was ﬁrstly introduced in the area of machine translation (Bahdanau, Cho, and Bengio 2014), learning to align words between the source language and the target one. It has later been applied in dialog generation (Shang, Lu, and Li 2015) by automatically align information in the response with those in the post, and applied in reading comprehension (Seo et al. 2016) by aligning keywords in the query with those in the context to ﬁnd the answer. However, unlike encoder-decoder models such as machine translation and dialog generation, dependency parsing is quite another task. A key difference is that in encoderdecoder tasks, there exists a strong one-to-one relationship between the source modularity and the target one, measured by semantic relatedness (often similarity) between two words. However, during the search for the head of a word in dependency parsing, many different parts of the sentence should be taken into consideration. Sequential order among these parts is one of most important factors. In this work, we propose a novel approach to capture helpful information sequentially, unlike the parallel style in vanilla attention. During the sequence-preserved procedure, the word of interest is ﬁrst provided as initial information to generate a prior representation; then a sequence of words is processed in order by attention-based network. The attention score and attentional representation at each time step is dependent on those at the previous step, thus preserving the

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

sequence information. The sequence-preserved attention mechanism can improve the performance of both graph-based and transitionbased parsers. Generally speaking, features for graph-based parsers are not sufﬁcient enough, and features for transitionbased parsers tend to suffer from the lack of global information. Word representations with rich non-local features (Zhang and Nivre 2011) may help gain an increase in accuracy for both of them, which is where the sequencepreserved attention can contribute. Speciﬁcally, this mechanism was designed to encode each word in a sentence with recurrent neural networks and sequence-preserved attention. The parser ﬁrstly encodes each word into an context-speciﬁc representation with bidirectional recurrent neural networks; then it captures information closely related to a certain word from the context with another bi-directional recurrent neural network. This approach is inspired by the procedure that someone ﬁrst scans a sequence to grasp the overall meaning, and then in order to pick the head of a word, he will scan the sequence forth and back around it again to collect context information. We use these generated features to train a graph-based parser and a transition-based parser respectively. Experiments results showed that those feature do help the transition-based parser achieve state-of-the-art performance, while the graphbased parser can also beneﬁt from them and achieve close to state-of-the-art results in graph-based approaches. We call this model using such mechanism Sequence-Preserved Attention Dependency Parser, which we will refer to as SPADP later1.

Related Work Dependency Parsing The model is closely related to (Kiperwasser and Goldberg 2016b), proposing a general mechanism to learn representations for dependency parsing in two paradigms: graphbased and transition-based parsing. One signiﬁcant difference is that we treat graph-based parsing as a procedure of greedy head selection, like (Zhang, Cheng, and Lapata 2017) and (Dozat and Manning 2016), rather than maxspanning-tree generation from a score matrix (Mc Donald 2006; Zheng 2017), because our experiments indicate that the former method can outperform the latter. For transition-based dependency parsing, this work differs a lot from recent work in that our work learns how to integrate rich global features into local word representations (Zhang and Nivre 2011), still leaving transition decisions in a greedy way. This method keeps simplicity and efﬁciency at the same time, while most recent work focuses more on how to correct the error made by locallyoptimal decisions. For example, recent researchers have taken great efforts to overcome the shortcomings of greedy search by employing global learning mechanism, such as designing dynamic oracles or more robust transition systems (Goldberg and Nivre 2012; Qi and Manning 2017), integrating with beam search (Zhang and Nivre 2012; Zhou et

1The source code is available at https://github.com/dugu9sword /spa-dp

al. 2015), utilizing structured learning (Weiss et al. 2015; Vaswani and Sagae 2016), augmenting data (Alberti, Weiss, and Petrov 2015), keeping deeper parsing history (Dyer et al. 2015), minimizing conditional-random-ﬁeld objective (Andor et al. 2016), etc. Experiments manifest that our model can surpass all above models by learning global representations for each word in a sentence, which can still be combined with globally optimized decision learning methods. Apart from above two mainstream parsing methodologies, some other variants have been introduced. An easyﬁrst approach tries to build the parse tree in a bottomto-up way (Goldberg and Elhadad 2010; Kiperwasser and Goldberg 2016a). Several ensemble methods are proposed to overcome the shortcomings of both graph and transition based approaches, say adding parsing results of a parser as guide information to another (Nivre and Mc Donald 2008; Zhang and Clark 2008; Bohnet and Kuhn 2012), selecting the best parsing tree by re-ranking mechanism (Zhu et al. 2015), etc.

Attention Mechanism The attention mechanism was ﬁrstly introduced in the area of machine translation (Bahdanau, Cho, and Bengio 2014) and has been successfully applied in dialog generation (Shang, Lu, and Li 2015), reading comprehension (Seo et al. 2016), etc. Our model is mostly close to the work of (Cheng et al. 2016), which ﬁrst applied the attention mechanism to dependency parsing. However, our approach can be distinguished from it in three aspects: ﬁrstly, our work focuses on learning global representation of words, while theirs employs attention to model parsing history; secondly, their work forces word representations to select the candidate heads from two directions and then vote the best (they call this "bidirectional agreement"), while our work selects the head coherently by the bi-directional word representation, more fast and accurate accordingly; last but not least, our work can be applied to transition based parsing as well, while theirs cannot.

Bi-RNN Based Dependency Parsing In dependency parsing, recurrent neural networks have been proven useful to bridge long time lags between relative inputs, which will be of great superiority in ﬁnding out longdistance dependencies. The feature of each word is modeled by bi-directional recurrent neural networks in this section. The feature representations are then applied to a transitionbased parser and a graph-based parser respectively.

Word Feature Representations Considering a ﬁxed-sized word dictionary D, the vector representations are stored in a word embedding matrix Eword Rdword |D|, where dword is the dimensionality of the vector space and |D| is the size of the dictionary. Analogously, the part-of-speech(POS) tags are also mapped to a dpos dimensional vector space represented by Epos Rdpos |P|, where dpos is the dimensionality of the vector space and |P| is the size of the dictionary.

Table 1: Special Conﬁgurations of An Arc-Standard System

Components Initial State Terminal State σ [ω0] [ω0] β [ω1, ω2, ..., ωN] A Ac

Let S = ω0:N denote a sentence of length N, with ω0 denoting the artiﬁcial ROOT node. We use T = t0:N as the token embedding, with ti represents ωi. The token embedding is computed as:

ti = Ewordeword i Eposepos i (1)

where eword i and epos i are the one-hot representation for the i-th word and its part-of-speech tag, Eword and Epos are both parameters to be learned which store the contiguous embedding for corresponding feature, is the concatenation operation. In our case, the recurrent neural networks (RNN) is abstracted as a parameterized function RNNθ(x1:n) or RNNθ(x1, x2, ...xn) mapping a sequence of n input vectors x1:n to a sequence of n output vectors o1:n. Each output vector oi can be thought of as a summary of x1:i since it is conditioned on them. We ﬁrst use a forward RNN (RNNF ) to process the sentence from left to right and then use a backward RNN (RNNB) to process from right to left:

h F i = RNNF ( t0:i)

h B i = RNNB( ti:N) (2)

Then each word ωi in the sentence can be represented by concatenating the representation from the bi-directional RNNs: hi = h F i h B i (3)

Training Objective

Transition-Based Dependency Parsing In a transitionbased parser, the framework assumes a transition system which processes sentences and produces parse trees(Nivre 2008). Two components are deﬁned in a transition system: a set of conﬁgurations and a set of actions which are applied to the conﬁgurations. When a sentence is being parsed, the system is initialized to a initial conﬁguration and then a sequence of actions are taken to change the conﬁgurations repeatedly. The transition system will achieve a ﬁnal conﬁguration after a ﬁnite number of steps, and the parse tree is generated. In this paper, we examine only greedy parsing, which uses a classiﬁer to predict the correct transition based on features extracted from the conﬁguration. We employ the arc-standard system (Nivre 2004) which is composed of a conﬁguration c = (σ, β, A) consisting of a stack σ, a buffer β and a set of dependency arcs T. For the sequence S = ω0:N, the initial and ﬁnal conﬁguration is deﬁned in table 1, and Ac is returned as the parse tree.

Besides, three types of transitions are deﬁned:

SHIFT[(σ, b1|β, A)] = (σ|b1, β, A) ARCL[(σ|s2|s1, β, A)] = (σ|s2, β, A {(s2 s1)}) ARCR[(σ|s2|s1, β, A)] = (σ|s1, β, A {(s1 s2)})

where s1 and s2 denote the ﬁrst and second words on the stack respectively and b1 denotes the ﬁrst word on the buffer. Given a sentence S = ω0:N, the vector representations are R = r0:N. Given a conﬁguration c = (σ, β, A), we concatenate the representation of ﬁrst three words on the stack and ﬁrst one word on the buffer as a representation of the conﬁguration:

φ(c) = rs3 rs2 rs1 rb1 (4)

We build a standard neural network with one hidden layer as the classiﬁer to choose the proper action for the conﬁguration c.

y = Wt2 max(0, Wt1 φ(c) + bt1) + bt2 (5)

P = softmax(y) (6) where Wt1 Rdh 4 dr, bt1 Rdh, Wt2 R3 dh and bt2 R3. The ﬁnal output P is the probability of three different actions to choose in the current conﬁguration. The training objective is to minimize the cross-entropy loss as: L(θ) = 1

|Q|ΣS QΣ|CS| i=1 ln Pai (7)

where Q is the dataset and CS is the conﬁguration of the transition system generated from the sentence S, ai is the gold action for a speciﬁc conﬁguration CS i in the sentence.

Graph Based Dependency Parsing We formulate our dependency parser as head selection as (Zhang, Cheng, and Lapata 2017; Dozat and Manning 2016; Hashimoto et al. 2016). Given a sentence S = ω0:N, the vector representations are R = r0:N. We deﬁne a function s( , ) to map a candidate word pair to a score which measures the possibility that ωj is the head of ωi. Similar to the attention mechanism deﬁned in (Luong, Pham, and Manning 2015), we consider two different alternatives:

s(rj, ri) = r j Wri bi-linear v max(0, W(rj ri) + b) concat (8) where v, W, b are parameters to be learned. The probability that a word wj is the head of another word wi can be computed by normalizing the score function with respect to all possible heads:

P(wj|wi) = es(rj,ri) N k=0 es(rk,ri) (9)

The model is trained by minimizing the cross entropy of the gold head-modiﬁer pairs in the training set:

i=1 ln P(r H(S,i), ri) (10)

where Q is the dataset and H(S, i) is the index of the gold head of ith word in the sentence S. Afterwards, we use the Chu-Liu-Edmonds algorithm to make sure that the generated graph is a tree (Chu 1965).

Sequence-Preserved Attention We propose a sequence-preserved attention mechanism inspired by the procedure of understanding a word in a sentence composed of two steps: SCAN and RESCAN. In the SCAN step, a sentence is scanned to grasp the overall meaning, while in the RESCAN step, the sentence is rescanned from the word to its boundary, or reversely. In this section, we ﬁrst give a brief introduction to vanilla attention (Bahdanau, Cho, and Bengio 2014; Luong, Pham, and Manning 2015). Then we propose sequence-preserved attention mechanism to compute feature representations of words in a sequence, where the attentional representation of each word is computed sequentially.

The Scan Step As in section Bi-RNN Based Dependency Parsing, given a sentence S = ω0:N, ωi is ﬁrstly transformed to a vector representation ti . The SCAN step can be modeled by a bi-directional RNN to get the feature representations of a word:

SCAN(t[0:N])i = RNNF ( t0:i) RNNB( ti:N) (11)

We denote the results of the SCAN step as h0:N where hi = SCAN(t[0:N])i.

Vanilla Attention Given a sentence ω0:N and a word ωi, the attention mechanism is usually employed to select ωj, j {0, ..., N} relevant to ωi. ωi is transformed to a vector representation hi during the SCAN step. The relevance between any word ωj and ωi is measured by a relevance score sij, which is normalized to derive a probability distribution, namely attention aij, over the sentence, the procedure is illustrated in Figure 1.

sij = MLP(hi hj) (12)

aij = esij N k=0 esik (13)

where MLP is a multi-layer perceptron. Then the derived attention can be used to compute a summary of the words in the sentence relevant to the given word ωi by weighted average, namely attentional representation mi.

j=0 aij hj (14)

The attentional representation mi is able to extract features from the memory units relevant to the query which can be applied to downstream tasks. In dependency parsing, the attentional representation can be used as the feature of a word when selecting its head.

Sequence-Preserved Attention Given a sequence b1:N and a word k, the sequence-preserved attention mechanism is usually employed to select bi, i {1, ..., N} relevant to k. bi is transformed to a vector representation vi during

the SCAN step, and k is transformed to a vector representation u. The sequence-preserved attention mechanism ﬁrst processes the word k to attain the initial attentional representation m0, based on which the v[1:N] are processed sequentially to gain corresponding attentional representations. For the ith word in the sequence, a local attentional representation mi is computed as a summary of both the current input vi and the previous attentional representation mi 1 , the relevance score si is computed to measure the relevance between vi and mi 1. The relevance score si is then normalized as the attention ai, controlling the weight of mi and vi to be averaged. The procedure is illustrated in Figure 2.

si = Gλ(mi 1, vi) i > 0 Gλ(0, u) i = 0 (15)

ai = sigmoid(si) (16)

mi = (1 ai) mi 1 + ai mi i > 0 a0 m0 i = 0 (17)

mi = Gμ(mi 1, vi) i > 0 Gμ(0, u) i = 0 (18)

where Gλ is a function to decide how much relevance score should be allocated to the speciﬁc word bi, taking the last attentional representation mi 1 into consideration; Gμ (i.e. mi) has the functionality of generating a local attentional representation at current time step, combining the current input vi with last attentional representation mi 1. The sequence-preserved attentional representation mi is calculated as a weighted average of the last attentional representation mi 1 and the local attentional representation mi, where Gλ serves as the weight. Since the attention scores and attentional representations are deﬁned recursively, the ﬁnal attentional representation of b with respect to ω[0:N] is m N:

SPA(u, v1, v2, ..., v N) = m N (19)

The sequence-preserved attention tries to locate the relevant words in a sentence sequentially. Coincidentally, this design ﬁts the gated recurrent unit well(Bahdanau, Cho, and Bengio 2014). To make the deﬁnition of GRU and sequencepreserved attention match with each other explicitly, we can deﬁne Gλ, Gμ as follows:

Gτ(vi 1, mi) = sigmoid(MLPτ(mi vi 1)) (20)

Gλ(vi 1, mi) = sigmoid(MLPλ(mi vi 1)) (21) Gμ(vi 1, mi) = MLPμ(mi (Gτ(vi 1, mi) vi 1)) (22) where the attentional representation ai serves as the hidden state, the Gτ serves as a reset gate which controls how much information should be reset, and the attention score si serves as a update gate which controls how much information from the last state shall ﬂow into current state. Since the GRU is similar with the sequence-preserved attention to some extent, we would like to employ GRU as an implementation of sequence-preserved attention, the attentional representation of b with respect to ω[0:N] can be computed by

mb = SPA(u, v1, v2, ..., v N) = GRU(u, v1, v2, ..., v N) (23)

van Att To B-SPA

Happy birthday to you !

Bi RNN(to) Attention(to) Attention(to)

Happy birthday to you !

Attention(to)

Happy birthday to you !

Bi RNN(to) Bi RNN(to)

rto rto rto

Figure 1: Architecture of three models: vanilla attention model (van Att), SPA model from target to boundary(To B-SPA), and SPA model from boundary to target (Bo T-SPA). In the ﬁgure, the symbol denotes an element-wise plus operation, the symbol denotes an element-wise multiplication operation, the symbol denotes a concatenation operation. Each rectangle denotes an LSTM cell, each circle denotes a vector.

Sequence-preserved

Figure 2: The procedure of calculation of sequencepreserved attention.

The Rescan Step With Sequence-Preserved Attention In the RESCAN step, the sentence is rescanned from the word ω to its boundary, or reversely. ωi is transformed to a vector representation hi as in the former section Bi-RNN Based Dependency Parsing.

RESCAN(h[0:N])i = RESCANL(hi, h i) RESCANR(hi, h i) (24) where RESCANL denotes the procedure of scanning the left part of the word in the sentence and RESCANR denotes the procedure of scanning the right part of the word in the sentence. Imitating the two different behaviors of rescanning mode, we can derive two variants of the sequence-preserved attention, as in Figure 1: Target to Boundary (To B) To gather valuable information around some word, scan from the word to the boundary of the sentence, drop useless information and fetch usable information incrementally. We will refer to this as To B-SPA later. This procedure can be formulated as: RESCANL(hi, h i) = GRU(hi, hi 1, ..., h0) RESCANR(hi, h i) = GRU(hi, hi+1, ..., h N) (25)

Boundary to Target (Bo T) RNN has the feature that it tends to put more weight on input closer to the current time step. In dependency parsing, words closer to a certain word have more effect on head selection. So we reverse the procedure in To B, scan the sentence bi-directionally with the word as the ﬁrst input. We will refer to this procedure as Bo T-SPA later. It can be formulated as:

RESCANL(hi, h i) = GRU(hi, h0, h1, ..., hi 1) RESCANR(hi, h i) = GRU(hi, h N, h N 1..., hi+1) (26)

By concatenating the scanning results SCAN(t[0:N])i (the word feature representation) and the rescanning results RESCAN(h[0:N])i (the attentional representation), we can derive the ﬁnal representation of a word:

ri = SCAN(t[0:N])i RESCAN(h[0:N])i (27)

which can be used as word representations in the dependency parsing as in section Bi-RNN Based Dependency Parsing.

Experiments

We show the results of the models on the English Penn Tree Bank (PTB) and the Chinese Penn Tree Bank (CTB). We follow the standard split of PTB, with the section 2-21 for training, section 22 for development and section 23 for testing. For English PTB, the POS tags are generated from the Stanford POS tagger (Toutanova et al. 2003) while for CTB, gold word segmentation and POS tags are used.

Implementations And Hyper-Parameters Choice

The model is implemented with the Py Torch2 deep learning framework. All experiments were run on a computer equipped with an Intel Core i7 processor, an 8GB RAM and a NVIDIA GTX-1070 GPU.

2https://pytorch.org

Table 2: Parsing With Different Attention Mechanism

Bi RNN 94.02 Bi RNN + vanilla attention 94.33(+0.31) Bi RNN + To B-SPA 94.25(+0.23) Bi RNN + Bo T-SPA 94.72 (+0.70)

Bi RNN 94.10 Bi RNN + vanilla attention 94.30(+0.20) Bi RNN + To B-SPA 94.50(+0.40) Bi RNN + Bo T-SPA 94.79(+0.69)

Hyper parameters are ﬁne tuned on the PTB 3.3.0 development set. In terms of the RNN unit for word feature representation, we ﬁnd that the LSTM cell can outperform the GRU cell in its stronger modeling ability. For graph-based parser, the hidden size of the bi-RNN is set to 400, and the hidden size of RNN for sequence-preserved attention is set to 100; for transition-based parser, the hidden size of both the bi-RNN and the RNN for sequence-preserved attention is set to 700. Considering different kinds of pre-trained word embeddings, with the Glove (Pennington, Socher, and Manning 2014) word-embedding the neural networks can converge better than CBo W and Skip-gram (Mikolov et al. 2013). The dimension of word vectors is 300 and the dimension of tag vectors is 50. We train the model with word embedding and tag embedding drop out set to 30% respectively. In terms of the score function in graph based parser, experiments show that the bi-linear mode can outperform the concat mode. The model is trained with a batch size of 32 by an Adam optimizer(Kingma and Ba 2014).

We report the UAS results of simple bi-RNN, vanilla attention model, To B-SPA and Bo T-SPA on English PTB in Table 2. Results show that the vanilla attention model can beat the simple bi-RNN model, and our Bo T-SPA model can outperform almost all other models. We also report the results on the English PTB and Chinese CTB in Table 3 and 4 respectively. Our parser can reach state-of-the-art results in transitionbased parsers. It is worth noting that although the state-ofthe-art parser in English PTB (Kuncoro et al. 2016) can reach an accuracy of neatly 95.80%, their work shall be considered as a framework that is not equal with other transition-based ones since it has access to original phrase structure annotation where conjunctions can be parsed more correctly. Thus we think their work should not be regarded as one competitor of ours, and our work exceeds all the other parsers in English PTB. In graph-based dependency parsing, our work can also beat most former work when keeping simplicity and efﬁciency.

Table 3: Results On the English PTB Dataset

Model UAS LAS

(Chen and Manning 2014) 92.00 91.80 (Zhu et al. 2015) 94.16 - (Kiperwasser and Goldberg 2016b) 93.90 91.90 (Andor et al. 2016) 94.61 92.79 SPA-DP 94.72 92.57

(Kiperwasser and Goldberg 2016b) 93.00 90.90 (Hashimoto et al. 2016) 94.67 92.90 (Zhang, Cheng, and Lapata 2017) 94.10 91.90 (Dozat and Manning 2016) 95.74 94.08 (Zheng 2017) 95.53 93.94 SPA-DP 94.79 92.61

Table 4: Results On the Chinese CTB Dataset

Model UAS LAS

(Chen and Manning 2014) 83.90 82.40 (Andor et al. 2016) 84.72 80.85 (Ballesteros et al. 2016) 87.65 86.21 (Kiperwasser and Goldberg 2016b) 87.60 86.10 SPA-DP 88.15 85.51

(Kiperwasser and Goldberg 2016b) 87.10 85.50 (Cheng et al. 2016) 88.10 85.70 (Dozat and Manning 2016) 89.30 88.23 (Zheng 2017) 89.42 87.94 SPA-DP 88.04 85.40

Error Analysis To characterize the errors made by parsers and the enhancement in accuracy by importing sequence-preserved attention, we present some analysis on the accuracy with respect to the sentence length and linguistic factors such as partof-speech tags. All analysis are conducted on the unlabeled attachment results from the English PTB testing set with a graph-based parser.

Bo T-SPA To B-SPA van Att Bi RNN

Figure 3: Accuracy with respect to length

Length Factors It is well known that parsers tend to have lower accuracies for long-distance dependency. Figure 3 shows the accuracies of different models with respect to sen-

tence length. It is obvious that a dependency parser adopting attention mechanism outperforms the baseline model for a margin increase in accuracy. In general, the SPA model is able to gain better results for most sentences than those of the vanilla attention model, while the Bo T-SPA model is better than To B-SPA, since in a sentence, the dependency between two words is affected more by the closest words than those far away. The To BSPA model uses an RNN to "read" words into the distance, most of which are irrelevant and may serve more as noise. Conspicuously, the accuracy of Bo T-SPA model stays the high level and even increase a bit as the sequence extends from 25 to 50, when others drastically drop. This phenomenon shows that our sequence-preserved attention mechanism is good at grasping long-distance relationship.

adv conj pron noun adj verb

Figure 4: Accuracy with respect to POS tags

Linguistic Factors We distinguish nouns, verbs, pronouns, adjectives, adverbs, conjunctions like (Mc Donald and Nivre 2007). We count the accuracy with respect to different POS-tags on the testing set and compare that of four parsers as in Figure 4. It can be concluded that the Bo T-SPA model can outperform almost all other models on most categories.

Nouns & Pronouns Typically in a sentence, the nouns and pronouns are attached to verbs and lower in the parse tree, thus tend to be easy to ﬁnd out the correct head. All four parsers can perform well but the model with vanilla attention or sequence-preserved attention can still perform surpass the baseline model.

Verbs In a parse tree, the verb tends to be closer to the root thus has longer-distance dependency than nouns and pronouns, which makes it a little bit more difﬁcult to parse.

Adjectives Adjectives are the furthest from the root node and have few siblings, thus theoretically should be the easiest to parse, but the results are rather opposite. One possible explanation is that graph-based parser is generally globally normalized, and more attention is paid to long distance dependency, causing a decrease in accuracy for short distance dependency.

Conjunctions & Adverbs In a sentence, adverbs and conjunctions often have long dependency lengths, therefore all four parsers have rather pool performance on them. Although the head detection is difﬁcult, explicit increase in the accuracy can be shown from the result. The SPA model gains an improvement in accuracy than the Bi RNN baseline for about 1.37%, and exceeds the vanilla attention for about 0.44%. This can be viewed as a strong evidence that sequence-preserved attention has the ability to model long distance dependency than the vanilla one. Without extra information, the vanilla attention may be trapped in the dilemma to choose head under the circumstance of the long distance.

Conclusions We have described an RNN-based sequence-preserved attention method to capture the global and relevant information for each word of an input sentence being parsed. The proposed method tries to overcome the shortcoming of vanilla attention mechanism that neglects the relative order in which the features occur. For each parse unit, the sequence-preserved attention is able to capture the rich and global features, being sensitive to the order of the words, from an input sentence. The results of experiments show that the features extracted by the sequence-preserved attention beneﬁt both the graph-based and transition-based parsing systems with signiﬁcant improvements across two different languages.

Acknowledgments The authors would like to thank the anonymous reviewers for their valuable comments. This work was partly supported by a grant from Shanghai Municipal Natural Science Foundation (No. 15511104303).

References Alberti, C.; Weiss, D.; and Petrov, S. 2015. Improved transition-based parsing and tagging with neural networks. Conference on Empirical Methods on Natural Language Processing. Andor, D.; Alberti, C.; Weiss, D.; Severyn, A.; Presta, A.; Ganchev, K.; Petrov, S.; and Collins, M. 2016. Globally normalized transition-based neural networks. ar Xiv preprint ar Xiv:1603.06042. Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. ar Xiv preprint ar Xiv:1409.0473. Ballesteros, M.; Goldberg, Y.; Dyer, C.; and Smith, N. A. 2016. Training with exploration improves a greedy stacklstm parser. ar Xiv preprint ar Xiv:1603.03793. Bohnet, B., and Kuhn, J. 2012. The best of both worlds: A graph-based completion model for transition-based parsers. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL 12. Chen, D., and Manning, C. 2014. A fast and accurate dependency parser using neural networks. In Proceedings of

the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics. Cheng, H.; Fang, H.; He, X.; Gao, J.; and Deng, L. 2016. Bidirectional attention with agreement for dependency parsing. ar Xiv preprint ar Xiv:1608.02076. Chu, Y.-J. 1965. On the shortest arborescence of a directed graph. Science Sinica. Dozat, T., and Manning, C. D. 2016. Deep biafﬁne attention for neural dependency parsing. ar Xiv preprint ar Xiv:1611.01734. Dyer, C.; Ballesteros, M.; Ling, W.; Matthews, A.; and Smith, N. A. 2015. Transition-based dependency parsing with stack long short-term memory. ar Xiv preprint ar Xiv:1505.08075. Goldberg, Y., and Elhadad, M. 2010. An efﬁcient algorithm for easy-ﬁrst non-directional dependency parsing. In NAACL. Association for Computational Linguistics. Goldberg, Y., and Nivre, J. 2012. A dynamic oracle for arc-eager dependency parsing. In COLING. Hashimoto, K.; Xiong, C.; Tsuruoka, Y.; and Socher, R. 2016. A joint many-task model: Growing a neural network for multiple nlp tasks. ar Xiv preprint ar Xiv:1611.01587. Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980. Kiperwasser, E., and Goldberg, Y. 2016a. Easy-ﬁrst dependency parsing with hierarchical tree lstms. ar Xiv preprint ar Xiv:1603.00375. Kiperwasser, E., and Goldberg, Y. 2016b. Simple and accurate dependency parsing using bidirectional lstm feature representations. ar Xiv preprint ar Xiv:1603.04351. Kuncoro, A.; Ballesteros, M.; Kong, L.; Dyer, C.; Neubig, G.; and Smith, N. A. 2016. What do recurrent neural network grammars learn about syntax? ar Xiv preprint ar Xiv:1611.05774. Luong, M.-T.; Pham, H.; and Manning, C. D. 2015. Effective approaches to attention-based neural machine translation. ar Xiv preprint ar Xiv:1508.04025. Mc Donald, R., and Nivre, J. 2007. Characterizing the errors of data-driven dependency parsing models. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-Co NLL). Mcdonald, R. T.; Crammer, K.; and Pereira, F. 2005. Online large-margin training of dependency parsers. Proceedings of the 43rd Annual Meeting of the ACL. Mc Donald, R. 2006. Discriminative training and spanning tree algorithms for dependency parsing. Ph. D. Thesis, Computer and Information Science, University of Pennsylvania. Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Efﬁcient estimation of word representations in vector space. ar Xiv preprint ar Xiv:1301.3781. Nivre, J., and Mc Donald, R. T. 2008. Integrating graphbased and transition-based dependency parsers. In ACL.

Nivre, J. 2004. Incrementality in deterministic dependency parsing. In Proceedings of the Workshop on Incremental Parsing: Bringing Engineering and Cognition Together. Association for Computational Linguistics. Nivre, J. 2008. Algorithms for deterministic incremental dependency parsing. Computational Linguistics. Pennington, J.; Socher, R.; and Manning, C. 2014. Glove: Global vectors for word representation. In EMNLP. Qi, P., and Manning, C. D. 2017. Arc-swift: A novel transition system for dependency parsing. ar Xiv preprint ar Xiv:1705.04434. Seo, M.; Kembhavi, A.; Farhadi, A.; and Hajishirzi, H. 2016. Bidirectional attention ﬂow for machine comprehension. ar Xiv preprint ar Xiv:1611.01603. Shang, L.; Lu, Z.; and Li, H. 2015. Neural responding machine for short-text conversation. ar Xiv preprint ar Xiv:1503.02364. Toutanova, K.; Klein, D.; Manning, C. D.; and Singer, Y. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology Volume 1. Association for Computational Linguistics. Vaswani, A., and Sagae, K. 2016. Efﬁcient structured inference for transition-based parsing with neural networks and error states. Transactions of the Association for Computational Linguistics. Weiss, D.; Alberti, C.; Collins, M.; and Petrov, S. 2015. Structured training for neural network transition-based parsing. ar Xiv preprint ar Xiv:1506.06158. Zhang, Y., and Clark, S. 2008. A tale of two parsers: investigating and combining graph-based and transition-based dependency parsing using beam-search. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Zhang, Y., and Nivre, J. 2011. Transition-based dependency parsing with rich non-local features. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers - Volume 2, HLT 11. Association for Computational Linguistics. Zhang, Y., and Nivre, J. 2012. Analyzing the effect of global learning and beam-search on transition-based dependency parsing. In COLING (Posters). Zhang, X.; Cheng, J.; and Lapata, M. 2017. Dependency parsing as head selection. conference of the european chapter of the association for computational linguistics. Zheng, X. 2017. Incremental graph-based neural dependency parsing. Conference on Empirical Methods on Natural Language Processing. Zhou, H.; Zhang, Y.; Huang, S.; and Chen, J. 2015. A neural probabilistic structured-prediction model for transitionbased dependency parsing. In ACL. Zhu, C.; Qiu, X.; Chen, X.; and Huang, X. 2015. A reranking model for dependency parser with recursive convolutional neural network. ar Xiv preprint ar Xiv:1505.05667.