# discontinuous_constituent_parsing_with_pointer_networks__313d8c65.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Discontinuous Constituent Parsing with Pointer Networks

Daniel Fern andez-Gonz alez, Carlos G omez-Rodr ıguez Universidade da Coru na, CITIC FASTPARSE Lab, Ly S Group Depto. de Ciencias de la Computaci on y Tecnolog ıas de la Informaci on Elvi na, 15071 A Coru na, Spain {d.fgonzalez, carlos.gomez}@udc.es

One of the most complex syntactic representations used in computational linguistics and NLP are discontinuous constituent trees, crucial for representing all grammatical phenomena of languages such as German. Recent advances in dependency parsing have shown that Pointer Networks excel in efﬁciently parsing syntactic relations between words in a sentence. This kind of sequence-to-sequence models achieve outstanding accuracies in building non-projective dependency trees, but its potential has not been proved yet on a more difﬁcult task. We propose a novel neural network architecture that, by means of Pointer Networks, is able to generate the most accurate discontinuous constituent representations to date, even without the need of Part-of-Speech tagging information. To do so, we internally model discontinuous constituent structures as augmented non-projective dependency structures. The proposed approach achieves state-of-the-art results on the two widely-used NEGRA and TIGER benchmarks, outperforming previous work by a wide margin.

Introduction Syntactic representations are increasingly being demanded by a wide range of artiﬁcial intelligence applications that process and understand natural language text or speech. This encourages the Natural Language Processing (NLP) community to develop more accurate and efﬁcient parsers that are able to represent all complex grammatical phenomena present in human languages. One of the most widely-used syntactic formalism is a constituent (or phrase-structure) representation. This describes the syntax of a sentence in terms of constituents or phrases and the hierarchical order between them, as shown in the constituent tree in Figure 1(a). As a simpler alternative, a dependency representation is also available. This straightforwardly connects each word of a sentence as a dependent of another, which acts as its head word. The resulting structure is a dependency tree like the one depicted in Figure 1(c). Among constituent trees we can ﬁnd the most informative syntactic representation currently available: discontinuous constituent trees. Apart from providing phrase-structure

Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

information, they extend regular or continuous constituent trees by allowing the representation of crossing branches and constituents with gaps in the middle, crucial for describing grammatical structures left out in the standard phrasestructure formalism, such as the free-word-order phenomena that can be found in languages such as German (see (M uller 2004) and references therein). Unlike for regular constituent trees (Klein and Manning 2003), context-free grammars are not enough for deriving discontinuous structures and representing complex linguistic phenomena and, therefore, more expressive formalisms, such as Linear Context-Free Rewriting Systems (LCFRS) (Vijay-Shanker, Weir, and Joshi 1987), are required. However, due to the high complexity behind the generation of discontinuous phrase-structure representations, parsers based on probabilistic LCFRS, such as (Kallmeyer and Maier 2010), are not practical in terms of accuracy and speed. In the last decade, alternatives to tackle discontinuous constituent parsing with complex grammar-based approaches were presented. On the one hand, one line of research proposes to extend transition-based parsers, which have been successfully applied for efﬁciently building continuous constituent trees (Zhu et al. 2013; Dyer et al. 2016), with transitions and data structures that are able to deal with discontinuity (Maier 2015; Stanojevi c and G. Alhama 2017; Coavoux, Crabb e, and Cohen 2019; Coavoux and Cohen 2019). This research branch currently yields the state of the art in discontinuous phrase-structure parsing. In parallel, another stream led by (Hall and Nivre 2008; Versley 2014; Fern andez-Gonz alez and Martins 2015; Corro, Le Roux, and Lacroix 2017) proposes to represent discontinuous formalisms as dependencies with augmented information. This makes it possible to use efﬁcient dependency parsers to perform the analysis and, after a recovery process, return a well-formed constituent tree. While they offer a similar efﬁciency as transition-based constituent parsers, they are currently slightly behind them in terms of accuracy. The main drawback of this approach is the large amount of complex labels necessary to encode the wide variety of syntactic structures into labelled dependency arcs, growing unbounded with the treebank size. This forces

some authors, as (Fern andez-Gonz alez and Martins 2015; Corro, Le Roux, and Lacroix 2017), to use an extra module that, at a post-processing step after the parsing, labels each dependency to increase ﬁnal accuracy. As a result, parsing and labelling tasks are learned and performed in a two-stage procedure. However, these two tasks might beneﬁt from each other s information during training, as constituent information is encoded in both components (arc and label) so from an ideal standpoint, given good enough machine learning models, they should not be considered separately. To address this problem, we propose a neural network architecture that, using internal augmented dependency representations based on (Fern andez-Gonz alez and Martins 2015) and following a multitask learning approach (Caruana 1997), can directly produce accurate discontinuous constituent trees in one step. Both tasks (creating dependencies and labeling them) are trained in parallel and using a shared representation, so that the whole model can jointly learn to predict the correct augmented labels that will ﬁnally produce well-formed phrase-structure trees. In addition, our model relies on Pointer Networks (Vinyals, Fortunato, and Jaitly 2015) for creating dependency arcs between words of an input sentence. This kind of neural networks, obtained by modifying standard sequenceto-sequence models, yields remarkable accuracies in dependency parsing. In particular, they use attention (Bahdanau, Cho, and Bengio 2014) as a pointer to select positions from the input sequence, which can be easily adapted to connect words from an input sentence, as recently shown by (Ma et al. 2018; Fern andez-Gonz alez and G omez-Rodr ıguez 2019). We adapt this state-of-the-art neural model to a harder task, also obtaining an outstanding performance. To the best of our knowledge, this is the ﬁrst attempt that follows a sequence-to-sequence paradigm to perform discontinuous constituent parsing, without the need of any intermediate data structures to create partial trees as required by transition-based models previously cited. To deal with the large amount of labels, we propose to use the biafﬁne classiﬁer introduced by (Dozat and Manning 2017) that has shown a remarkable performance even in NLP taks with a more complex label scheme, such as Semantic Dependency Parsing (Dozat and Manning 2018). This shares the same representation as Pointer Networks and is jointly trained to produce well-formed discontinuous constituent trees in a single step. The resulting neural model1 produces the most accurate discontinuous constituent representations reported so far. We conduct experiments on both widely-known NEGRA (Skut et al. 1997) and TIGER (Brants et al. 2002) treebanks, which contain a large number of German sentences syntactically annotated by constituent trees with a high degree of discontinuity: the prevalence of discontinuous constituent structures is over 25% in both treebanks (Maier and Lichte 2011). In both benchmarks, our approach achieves accuracies beyond 85% F-score (even without Part-of-Speech (POS) tagging information), surpassing the current state of the art by a wide margin without the need of orthogonal tech-

1Available at https://github.com/danifg/Disco Pointer

niques such as re-ranking or semi-supervision.

Modelling Discontinuous Constituent Trees Preliminaries Let w1, w2, . . . , wn be a sentence, where wi denotes a word in the ith position. A constituent tree is a tree with the n words of the sentence as leaf nodes, and phrases (or constituents) as internal nodes. Each constituent can be represented as a tuple (X, C, wh), where X is the non-terminal symbol, C is the yield, i.e. the set of words wi included in its span, and h is the word in C that acts as head. A language-speciﬁc handwritten set of rules is used to select the head word. For instance, the head word of the constituent VROOT in Figure 1(a) is the word kam in constituent S. Given a tree, if the yield (C) of each of its constituents is a continuous substring of the sentence, then we say that the tree is continuous. Otherwise, the tree is classiﬁed as discontinuous, meaning that we can ﬁnd at least one constituent with a span that is interrupted by one or more gaps between its words. For instance, in Figure 1(a), the span of constituent (NP, {Es, nichts, Interessantes}, Interessantes) is interrupted by the word kam from a different constituent, generating crossing branches. Finally, if there are no constituents with only one child node, we call it a unaryless constituent tree. On the other hand, a dependency tree is a directed tree spanning all the words wi with i ranging from 1 to n. Each dependency arc can be represented as (wh, wd, l), with wh being the head word of the dependent word wd (with h, d [1, n]) and l being the dependency label indicating the syntactic role played by wd. If, for every dependency arc (wh, wd, l), there is a directed path from wh to all words wi between words wh and wd, we classify it as a projective dependency tree. If not, it is described as non-projective, resulting in a tree with crossing arcs as the one depicted in Figure 1(c). It can be noticed from these deﬁnitions that, in order to represent the same syntactic phenomenon as described in a discontinuous constituent tree, we will need to use a nonprojective dependency structure in order to handle discontinuities. However, a constituent tree is able to provide information that cannot be represented in a regular dependency tree (Kahane and Mazziotta 2015).

Constituents as Augmented Dependencies Following (Fern andez-Gonz alez and Martins 2015), we decompose a discontinuous unariless phrase-structure tree into a set of non-projective dependency arcs with enriched information. This allows us to use a non-projective dependency parser to efﬁciently perform discontinuous phrase-structure parsing. A constituent tree with m words can be decomposed into a set of m spines (Carreras, Collins, and Koo 2008), one per word, as shown in Figure 1(b) for the discontinuous tree in Figure 1(a). These spines and their interaction to ﬁnally build a constituent tree can be represented in an augmented dependency tree as described by (Fern andez Gonz alez and Martins 2015). More in detail, they propose

a) Discontinuous constituent tree. b) Spines interaction.

Es kam nichts Interessantes .

Es1 kam2 nichts3 Interessantes4 .5

c) Unlabelled dependency tree. d) Augmented dependency tree.

Figure 1: Augmented dependency representation of a sentence taken from NEGRA corpus in discontinuous constituent format, as well as the spine interaction required for its reconstruction. Head nodes of each constituent are marked in bold.

to encode each constituent into dependencies: for each constituent (X, C, wh), each child node, that can be a word wd or a constituent (Y, G, wd) with wd = wh, is encoded into an unlabelled dependency arc with the form (wh, wd). Basically, each non-head child node is attached to the head word and, if the involved child node is a constituent, its own head word is used to build the dependency arc. Additionally, to represent the original phrase structure it is also necessary to save some vital information into arc labels: the non-terminal symbol X plus an index that indicates the order in which spines are attached. This index will be crucial in those cases where more than one constituent share the same head word (and, therefore, the same head spine), but they are at a different level in the original tree and, therefore, should be created in a different hierarchical order. The ﬁnal dependency arcs will have the form (wh, wi, X#p), where p is the attachment order. For instance, constituent (NP, {nichts, Interessantes}, Interessantes) in Figure 1(a), is encoded as the augmented dependency arc (Interessantes, nichts, NP#1) in Figure 1(d), and constituent (NP, {Es, nichts, Interessantes}, Interessantes) is represented with (Interessantes, Es, NP#2). Both constituents share the same head spine anchored to word Interessantes, but they are attached in a different level. To recover the original discontinuous constituent tree, we just have to follow the steps provided by the augmented dependencies from a spine-based perspective. Dependencies will indicate which spines are involved in the attachment and the direction of the spine interaction (head and dependent roles); and, on the other hand, labels will describe in what order dependent spines are attached to the head spine, as well as the non-terminal symbol of the resulting constituent. For instance, the augmented dependency tree in Figure 1(d) describes the spine interaction depicted in Figure 1(b). In the example, we can see that the dependent spines anchored to

words Es and nichts are attached in different order and level from the head spine lexicalized by the word Interessantes. Without the attachment order, all dependent spines would be attached at the same level and the resulting constituent would be a different structure: in the example, a single ﬂat NP phrase with the three words as its child nodes. It is also worth mentioning that the above described dependency-based representation is not able to encode unary constituents and, as a consequence, this information will be lost after the constituent tree is recovered. However, as stated by (Fern andez-Gonz alez and Martins 2015), unary nodes are very uncommon in discontinuous representations: for instance, the NEGRA treebank has no unaries at all and, for the TIGER dataset, the fraction of unaries is around 1%. Therefore, we do not perform unary recovery in a postprocessing step and simply apply a non-projective dependency parser on the resulting augmented trees.

Neural Network Architecture

We use a multi-task learning strategy to train in parallel both the Pointer Network, in charge of connecting words, and the labeler, a multiclass classiﬁer in charge of tagging each dependency arc with the augmented information, to ﬁnally build a well-formed discontinuous constituent tree. In the next subsections, we describe the proposed pointer-networkbased architecture for labelled dependency parsing.

Pointer Networks

(Vinyals, Fortunato, and Jaitly 2015) introduced a novel neural architecture, called Pointer Network, that can be seen as a variation of standard sequence-to-sequence models. The proposed neural network can learn the conditional probability of an output sequence of discrete numbers that correspond to positions from the input sequence, with input

length being variable. They propose to use a mechanism of neural attention (Bahdanau, Cho, and Bengio 2014) to select positions from the input, without requiring a ﬁxed size of the output dictionary. Thanks to that, Pointer Networks are suitable for addressing those problems where the target classes considered at each step are variable and depend on the length of the input sequence. Unlabelled dependency parsing is one of the tasks where Pointer Networks can be easily applied: the input is a (variable-length) sequence of words from a sentence and the output is a sequence of numbers that correspond to the positions of the assigned head words. In (Fern andez-Gonz alez and G omez-Rodr ıguez 2019), we can ﬁnd a transition-based perspective of this idea, but it can be also deﬁned as a purely sequence-to-sequence approach, as, unlike in classic transition-based dependency parsers (Nivre 2003), neither transitions nor data structures are required. We follow the convention of representing scalars in lowercase italics (i), vectors in lowercase bold (v), matrices in uppercase italics (M), and higher order tensors in uppercase bold (T).

Encoder Let w1, w2, . . . , wn be an input sentence. The encoder of our network relies on a Bi LSTM-CNN architecture, developed by (Ma and Hovy 2016), to encode each word wi into an encoder hidden state hi. Convolutional Neural Networks (CNN) (Le Cun et al. 1989) are used for extracting character-level representations of words. In particular, for each word wi, the CNN receives a concatenation of embedding vectors of each character as input and produces a character-level representation ec i. Then, each word wi is represented by the concatenation of three embeddings: character-level word embedding ec i, pretrained word embedding ew i and randomly initialized part-of-speech (POS) tag embedding ep i ( stands for concatenation):

xi = ec i ew i ep i In order to attach every word from the sentence to another, we need to add a dummy root node w0 (represented with a special vector x0) at the beginning of the sentence. This will be linked as the head of the real syntactic root word. After that, the resulting embedding representation xi of each word wi is fed one-by-one into a Bi-directional Long Short-Term Memory (Bi LSTM) (Graves and Schmidhuber 2005) that captures context information in both directions and generates a vector representation hi:

hi = hli hri = Bi LSTM(xi)

Decoder A unidirectional LSTM (Hochreiter and Schmidhuber 1997) is used as a decoder that, at each time step, uses the encoder hidden state hi of the word wi, currently being processed, as input and outputs a decoder hidden state st. To add extra information to the input of the LSTM, we follow (Fern andez-Gonz alez and G omez-Rodr ıguez 2019) and add hidden states of previous (hi 1) and next (hi+1) words. As proposed by (Ma et al. 2018), we use element-wise sum of the three hidden states, instead of concatenating them, in order to not increase the dimension of the resulting input vector ri: ri = hi 1 + hi + hi+1

st = LSTM(ri)

Once st is generated from ri, the attention vector at needs to be computed as follows:

vt j = score(st, hj)

at = softmax(vt)

where an attention scoring function (score()) is used to compute scores between st (the decoder representation of the word wi currently being processed) and each encoder hidden state hj from the input. Then, a softmax is applied on the resulting score vector vt to compute a probability distribution over the input. The generated attention vector at is used as a pointer to select the highest-scoring position j from the input. The pointed word wj is attached as head of word wi, currently being processed, building an unlabeled dependency arc (wj, wi). In case the arc (wj, wi) generates a cycle in the already-built dependency graph, we select the next highest-scoring position from the input pointed by at. The decoding process starts in word w1 and it is repeated word-by-word from left to right until the last one (wn) is attached, requiring just n steps to fully parse the input sentence. Therefore, since cycles can be checked in amortized linear time (O(n)) and, at each step, the attention vector at must be computed over the input, the overall runtime complexity is O(n2). Figure 2 depicts the neural architecture and the decoding procedure for the dependency structure in Figure 1(d).

Attention Mechanism As attention score function, we follow (Ma et al. 2018) and adopt the biafﬁne attention mechanism, developed by (Dozat and Manning 2017), to compute score vector vt:

vt j = s T t Whj + UT st + VT hj + b

where parameters W is the weight matrix of the bi-linear term, U and V are the weight tensors of the linear terms and b is the bias vector. Basically, the bi-linear term evaluates the score of assigning the word wi (represented by st) to the head word wj from the input, and the two linear terms compute the scores of both words considered independently. Finally, in order to reduce the dimensionality and overﬁtting of the neural model, a multilayer perceptron (MLP) is used to transform the hidden representations hj and st before the attention score function is applied, as discussed by (Dozat and Manning 2017).

Biafﬁne labeler

We follow (Dozat and Manning 2017) and implement the labeler as a biafﬁne multi-class classiﬁer. Unlike them, we use the previously described Bi LSTM-CNN architecture as encoder. (Dozat and Manning 2017) relies on a biafﬁne attention mechanism as scoring function for computing the score of assigning a label l on a predicted arc between the dependent word wi and the head word wj, assigning to that arc the highest-scoring label. In particular, the encoder hidden state hj and the decoder hidden state st (representing wj and wi,

Figure 2: Pointer Network architecture and decoding steps to output the dependency tree in Figure 1(d).

respectively) are used as input to obtain the score sl tj for each label l as shown in the following equation:

sl tj = s T t Wlhj + UT l st + VT l hj + bl

where a distinct weight matrix Wl, weight tensors Ul and Vl and bias bl are used for each label l, where l {1, 2, . . . , L} and L is the number of labels. The ﬁrst term on the right side of the equation represents the score of assigning the label l to the arc between dependent wi and head wj; and the second and third terms express the score of the label l when the dependent and head are considered independently. Please note that a MLP is also used to transform hidden state vectors hj and st, before feeding the biafﬁne classiﬁer, in order to ﬁlter not relevant information and, as a consequence, reduce model dimensionality (Dozat and Manning 2017).

Multitask Learning Our goal is to provide a fully-parsed non-projective dependency tree in a single stage, while the parsing and the labelling should be individually undertaken due to the large amount of labels resulting from the formalism by (Fern andez-Gonz alez and Martins 2015). We follow a multitask learning strategy (Caruana 1997) to achieve that: a single neural architecture is trained for more than one task, sharing a common representation and beneﬁting from each other. In particular, the labeler shares the same encoder as the parser, providing a common encoder hidden state representation for both components. As stated by (Kiperwasser and Goldberg 2016), training the Bi LSTM-CNN encoder to correctly predict dependency labels signiﬁcantly improves unlabelled parsing accuracy and vice versa. This is specially crucial in our approach where a constituent structure is jointly encoded in dependency arcs and labels, and a wrong label can lead to a completely different phrase-structure tree after the recovering. In addition, the labeler and the parser are simultaneously trained by optimizing the sum of their objectives. On the one hand, a dependency tree y for an input sentence of length n

is decomposed into a set of n directed arcs a1, . . . , an, where each arc ai is represented by the head word wh and the dependent word wi in position i in the sentence. Therefore, to train the parser, we factorize the conditional probability Pθ(y|x) of a dependency tree y for a sentence x into a set of head-dependent pairs (wh, wi) as follows:

i=1 Pθ(ai|a<i, x) =

i=1 Pθ(wh|wi, a<i, x)

Then, the unlabelled parsing model is trained by minimizing the negative log likelihood of choosing the correct head word wh for the word wi currently being processed, given the previous predicted arcs a<i. This is implemented as cross-entropy loss:

Larc = log Pθ(wh|wi, a<i, x)

On the other hand, the biafﬁne labeler is trained by minimizing the negative log probability of assigning the correct label l, given a dependency arc with head word wh and dependent word wi. A cross entropy loss Llabel for dependency label prediction is computed when training:

Llabel = log Pθ(l|wh, wi)

Finally, we train a joint model by summing the losses Larc and Llabel prior to computing the gradients. In that way, model parameters are learned to minimize the sum of the cross-entropy loss objectives over the whole corpus.

Experiments Data We test our new approach on two widely-used discontinuous German treebanks: NEGRA (Skut et al. 1997) and TIGER (Brants et al. 2002). For the latter, we use the split provided in the SPMRL14 shared task (Crabb e 2014), and, for NEGRA, we follow the standard splits (Dubey and Keller 2003). While in TIGER we use the predicted POS tags provided in the shared task, we run Turbo Tagger (Martins,

Architecture hyper-parameters CNN window size 3 CNN number of ﬁlters 50 Bi LSTM encoder layers 3 Bi LSTM encoder size 512 LSTM decoder layers 1 LSTM decoder size 512 POS tag/word/character embedding dimension 100 LSTM layers/embeddings dropout 0.33 MLP layers 1 MLP activation function ELU Arc MLP size 512 Label MLP size 128 Adam optimizer hyper-parameters Initial learning rate 0.001 β1, β2 0.9 Decay rate 0.75 Gradient clipping 5.0

Table 1: Model hyper-parameters.

Almeida, and Smith 2013) for predicting POS tags for NEGRA. Finally, for both treebanks, we apply the head-rule sets by (Rehbein 2009) to identify head words in constituent structures and build the equivalent augmented dependency trees. Following standard practice, we use discodop2 (van Cranenburgh, Scha, and Bod 2016) with the conﬁguration ﬁle proper.prm for evaluation. This ignores punctuation and root symbols.

Settings We use the Adam optimizer (Kingma and Ba 2014) and follow (Ma et al. 2018; Dozat and Manning 2017) for parameter optimization and hyper-parameter selection. These are detailed in Table 1. All embeddings are ﬁne-tuned during training and, for initializing word vectors, we use the pretrained structured-skipgram embeddings developed by (Ling et al. 2015). Due to random initializations, we report average accuracy over 5 repetitions for each experiment tested. In addition, during training, the model with the highest Labelled Attachment Score (LAS) on the augmented dependency version of the development set is chosen. Finally, we use 10-beam-search decoding for all experiments.

Results In Table 2, we report the overall F-score on labeled constituents and a speciﬁc F-score measured only on discontinuous constituents (Disc. F1) of current state-of-the-art parsers in comparison to our neural model. As seen in the table, the proposed neural model is able to outperform all existing systems by a wide margin in both predictedand gold-POS-tagging scenarios. While gold POS tags prove to be beneﬁcial for the parsing process, using predicted ones yields the same performance as not providing POS-tagging information at all. Finally, we can also

2https://github.com/andreasvc/disco-dop

conclude from these results that, even without neither pretrained word embeddings nor POS tags, our approach still surpasses the current state of the art by a wide margin.

Related work

(Hall and Nivre 2008) was the ﬁrst work that reduced discontinuous constituent trees into a non-projective dependency representation so that the advances in dependency parsing could be easily adapted to build phrase-structure trees. The main weakness of this ﬁrst attempt was the spine enconding scheme. They encoded the whole dependent spines (compounded by one or more non-terminal symbols) plus the attachment position in the head spine into complex arc labels, leading to a combinatorial explosion and a signiﬁcantly large number of labels. This considerably harmed ﬁnal parsing performance. Following the same encoding scheme, (Versley 2014) introduced an easy-ﬁrst approach that improved accuracy and speed, but the complex label scheme was still penalizing parsing accuracy. (Fern andez-Gonz alez and Martins 2015) presented a lighter encoding strategy. They proposed to encode spine interactions by just saving the non-terminal node that results from the spine attachment and the position where it is undertaken. Unlike the previous complex scheme, this cannot deal with unary constituents that should be recovered as a postprocessing step. In addition, they used a pre-deep-learning parser, Turboparser (Martins, Almeida, and Smith 2013), that not only reported parsing accuracies notably below the current state-of-the-art, but also was not able to perform a competitive labeling due to the still high number of labels. As a solution, they just performed labeling as a separate step after the parsing was done. The resulting approach notably boosted ﬁnal accuracy. The latest attempt to deal with discontinuous phrasestructure parsing as a dependency problem was (Corro, Le Roux, and Lacroix 2017). Based on the work by (Carreras, Collins, and Koo 2008), that shows how lexicalized spinal TAGs (Tree Adjoining Grammars) can be used to perform continuous constituent parsing, they proposed a variant for discontinuous structures. Their approach works by reducing spinal TAG parsing to a maximum spanning tree problem and using a neural graph-based dependency parser (Dozat and Manning 2017) to solve it. To simplify the process, they also perform labeling after dependency trees are built. In our approach, we follow the simplest grammaragnostic encoding scheme introduced by (Fern andez Gonz alez and Martins 2015) and develop a state-of-the-art neural network architecture that is able to produce an output in a single step by jointly learning to create and label augmented dependency arcs. We consider that this one-stage approach is crucial since arcs and labels are strongly interdependent and they jointly encode the constituent structure.

Conclusions

We present a novel neural architecture that is able to produce the most accurate discontinuous constituent representations reported so far on the two main benchmarks: NE-

NEGRA TIGER Parser F1 Disc. F1 F1 Disc. F1 Predicted POS tags Fern andez-Gonz alez and Martins (2015) 77.0 77.3 van Cranenburgh, Scha, and Bod (2016), 40 74.8 Versley (2016) 79.5 Stanojevi c and G. Alhama (2017) 77.0 Coavoux and Crabb e (2017) 79.3 Gebhardt (2018) 75.1 Coavoux, Crabb e, and Cohen (2019) 83.2 54.6 82.7 55.9 Coavoux and Cohen (2019) 83.2 56.3 82.5 55.9 This work 85.4 58.8 85.3 59.1 Gold POS tags Maier (2015) 77.0 19.8 74.7 18.8 Fern andez-Gonz alez and Martins (2015) 80.5 80.6 Maier and Lichte (2016) 76.5 Corro, Le Roux, and Lacroix (2017) 81.6 Stanojevi c and G. Alhama (2017) 82.9 81.6 Coavoux and Crabb e (2017) 82.2 50.0 81.6 49.2 This work 86.1 59.9 86.3 60.7 This work w/o POS tags 85.7 58.6 85.7 60.4 w/o POS tags, w/o pre-trained word embs. 83.7 54.7 84.6 57.9

Table 2: Accuracy comparison of state-of-the-art discontinuous constituent parsers on NEGRA and TIGER.

GRA and TIGER treebanks. We rely on Pointer Networks (Vinyals, Fortunato, and Jaitly 2015) and a biafﬁne classiﬁer (Dozat and Manning 2017) to efﬁciently generate discontinuous phrase-structure trees. The proposed grammarand data-structure-agnostic neural model is able to accurately process the large amount of labels generated by the formalism introduced by (Fern andez-Gonz alez and Martins 2015) and, unlike the original work, provides an output in a onestage procedure. In addition, our approach is orthogonal to other techniques such as re-ranking or semi-supervision that can certainly boost ﬁnal accuracy.

Acknowledgments This work has received funding from the European Research Council (ERC), under the European Union s Horizon 2020 research and innovation programme (FASTPARSE, grant agreement No 714150), from the ANSWER-ASAP project (TIN2017-85160-C2-1-R) from MINECO, and from Xunta de Galicia (ED431B 2017/01).

References Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. Co RR abs/1409.0473. Brants, S.; Dipper, S.; Hansen, S.; Lezius, W.; and Smith, G. 2002. TIGER treebank. In Proceedings of the 1st Workshop on Treebanks and Linguistic Theories (TLT), 24 42. Carreras, X.; Collins, M.; and Koo, T. 2008. Tag, dynamic programming, and the perceptron for efﬁcient, feature-rich parsing. In Proceedings of the Twelfth Conference on Computational Natural Language Learning (Co NLL), 9 16. Caruana, R. 1997. Multitask learning. Mach. Learn. 28(1):41 75. Coavoux, M., and Cohen, S. B. 2019. Discontinuous constituency parsing with a stack-free transition system and a dynamic oracle.

In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 204 217. Minneapolis, Minnesota: Association for Computational Linguistics.

Coavoux, M., and Crabb e, B. 2017. Incremental discontinuous phrase structure parsing with the GAP transition. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, 1259 1270. Valencia, Spain: Association for Computational Linguistics.

Coavoux, M.; Crabb e, B.; and Cohen, S. B. 2019. Unlexicalized transition-based discontinuous constituency parsing. Transactions of the Association for Computational Linguistics 7:73 89.

Corro, C.; Le Roux, J.; and Lacroix, M. 2017. Efﬁcient discontinuous phrase-structure parsing via the generalized maximum spanning arborescence. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 1644 1654. Copenhagen, Denmark: Association for Computational Linguistics.

Crabb e, B. 2014. Multilingual discriminative shift-reduce phrase structure parsing for the spmrl 2014 shared task.

Dozat, T., and Manning, C. D. 2017. Deep biafﬁne attention for neural dependency parsing. In ICLR. Open Review.net.

Dozat, T., and Manning, C. D. 2018. Simpler but more accurate semantic dependency parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 484 490. Melbourne, Australia: Association for Computational Linguistics.

Dubey, A., and Keller, F. 2003. Probabilistic parsing for German using sister-head dependencies. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, 96 103. Sapporo, Japan: Association for Computational Linguistics.

Dyer, C.; Kuncoro, A.; Ballesteros, M.; and Smith, N. A. 2016. Recurrent neural network grammars. In HLT-NAACL, 199 209. The Association for Computational Linguistics.

Fern andez-Gonz alez, D., and G omez-Rodr ıguez, C. 2019. Left-toright dependency parsing with pointer networks. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 710 716. Minneapolis, Minnesota: Association for Computational Linguistics. Fern andez-Gonz alez, D., and Martins, A. F. T. 2015. Parsing as reduction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics), 1523 1533. Beijing, China: Association for Computational Linguistics. Gebhardt, K. 2018. Generic reﬁnement of expressive grammar formalisms with an application to discontinuous constituent parsing. In Proceedings of the 27th International Conference on Computational Linguistics, 3049 3063. Santa Fe, New Mexico, USA: Association for Computational Linguistics. Graves, A., and Schmidhuber, J. 2005. Framewise phoneme classiﬁcation with bidirectional lstm and other neural network architectures. Neural Networks 18(5):602 610. IJCNN 2005. Hall, J., and Nivre, J. 2008. Parsing discontinuous phrase structure with grammatical functions. In Proceedings of the 6th International Conference on Advances in Natural Language Processing, Go TAL 08, 169 180. Berlin, Heidelberg: Springer-Verlag. Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural Comput. 9(8):1735 1780. Kahane, S., and Mazziotta, N. 2015. Syntactic polygraphs. a formalism extending both constituency and dependency. In Proceedings of the 14th Meeting on the Mathematics of Language (Mo L 2015), 152 164. Chicago, USA: Association for Computational Linguistics. Kallmeyer, L., and Maier, W. 2010. Data-driven parsing with probabilistic linear context-free rewriting systems. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), 537 545. Beijing, China: Coling 2010 Organizing Committee. Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015. Kiperwasser, E., and Goldberg, Y. 2016. Simple and accurate dependency parsing using bidirectional LSTM feature representations. TACL 4:313 327. Klein, D., and Manning, C. D. 2003. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL), 423 430. Le Cun, Y.; Boser, B.; Denker, J. S.; Henderson, D.; Howard, R. E.; Hubbard, W.; and Jackel, L. D. 1989. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4):541 551. Ling, W.; Dyer, C.; Black, A. W.; and Trancoso, I. 2015. Two/too simple adaptations of Word2Vec for syntax problems. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics, 1299 1304. Denver, Colorado: Association for Computational Linguistics. Ma, X., and Hovy, E. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 1064 1074. Association for Computational Linguistics. Ma, X.; Hu, Z.; Liu, J.; Peng, N.; Neubig, G.; and Hovy, E. H. 2018. Stack-pointer networks for dependency parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, July 15-20, 2018, 1403 1414. Maier, W., and Lichte, T. 2011. Characterizing discontinuity in constituent treebanks. In Formal Grammar. 14th International Conference, FG 2009. Bordeaux, France, July 25-26, 2009.

Revised Selected Papers, volume 5591 of LNCS/LNAI, 167 182. Berlin-Heidelberg: Springer. Maier, W., and Lichte, T. 2016. Discontinuous parsing with continuous trees. In Proceedings of the Workshop on Discontinuous Structures in Natural Language Processing, 47 57. San Diego, California: Association for Computational Linguistics. Maier, W. 2015. Discontinuous incremental shift-reduce parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, 1202 1212. Beijing, China: Association for Computational Linguistics. Martins, A.; Almeida, M.; and Smith, N. A. 2013. Turning on the turbo: Fast third-order non-projective turbo parsers. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 617 622. Soﬁa, Bulgaria: Association for Computational Linguistics. M uller, S. 2004. Continuous or discontinuous constituents? a comparison between syntactic analyses for constituent order and their processing systems. Research on Language and Computation 2(2):209 257. Nivre, J. 2003. An efﬁcient algorithm for projective dependency parsing. In Proceedings of the 8th International Workshop on Parsing Technologies (IWPT), 149 160. Rehbein, I. 2009. Treebank-Based Grammar Acquisition for German. Ph.D. Dissertation, Dublin. Skut, W.; Krenn, B.; Brants, T.; and Uszkoreit, H. 1997. An annotation scheme for free word order languages. In Proceedings of the Fifth Conference on Applied Natural Language Processing, ANLC 97, 88 95. Stroudsburg, PA, USA: Association for Computational Linguistics. Stanojevi c, M., and G. Alhama, R. 2017. Neural discontinuous constituency parsing. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 1666 1676. Copenhagen, Denmark: Association for Computational Linguistics. van Cranenburgh, A.; Scha, R.; and Bod, R. 2016. Data-oriented parsing with discontinuous constituents and function tags. J. Language Modelling 4:57 111. Versley, Y. 2014. Experiments with easy-ﬁrst nonprojective constituent parsing. In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages, 39 53. Dublin, Ireland: Dublin City University. Versley, Y. 2016. Discontinuity (re)2-visited: A minimalist approach to pseudoprojective constituent parsing. In Proceedings of the Workshop on Discontinuous Structures in Natural Language Processing, 58 69. San Diego, California: Association for Computational Linguistics. Vijay-Shanker, K.; Weir, D. J.; and Joshi, A. K. 1987. Characterizing structural descriptions produced by various grammatical formalisms. In Proceedings of the 25th Annual Meeting of the Association for Computational Linguistics, 104 111. Stanford, California, USA: Association for Computational Linguistics. Vinyals, O.; Fortunato, M.; and Jaitly, N. 2015. Pointer networks. In Cortes, C.; Lawrence, N. D.; Lee, D. D.; Sugiyama, M.; and Garnett, R., eds., Advances in Neural Information Processing Systems 28. Curran Associates, Inc. 2692 2700. Zhu, M.; Zhang, Y.; Chen, W.; Zhang, M.; and Zhu, J. 2013. Fast and accurate shift-reduce constituent parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4-9 August 2013, Soﬁa, Bulgaria, 434 443.