# discontinuous_constituent_parsing_with_pointer_networks__313d8c65.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Discontinuous Constituent Parsing with Pointer Networks Daniel Fern andez-Gonz alez, Carlos G omez-Rodr ıguez Universidade da Coru na, CITIC FASTPARSE Lab, Ly S Group Depto. de Ciencias de la Computaci on y Tecnolog ıas de la Informaci on Elvi na, 15071 A Coru na, Spain {d.fgonzalez, carlos.gomez}@udc.es One of the most complex syntactic representations used in computational linguistics and NLP are discontinuous constituent trees, crucial for representing all grammatical phenomena of languages such as German. Recent advances in dependency parsing have shown that Pointer Networks excel in efficiently parsing syntactic relations between words in a sentence. This kind of sequence-to-sequence models achieve outstanding accuracies in building non-projective dependency trees, but its potential has not been proved yet on a more difficult task. We propose a novel neural network architecture that, by means of Pointer Networks, is able to generate the most accurate discontinuous constituent representations to date, even without the need of Part-of-Speech tagging information. To do so, we internally model discontinuous constituent structures as augmented non-projective dependency structures. The proposed approach achieves state-of-the-art results on the two widely-used NEGRA and TIGER benchmarks, outperforming previous work by a wide margin. Introduction Syntactic representations are increasingly being demanded by a wide range of artificial intelligence applications that process and understand natural language text or speech. This encourages the Natural Language Processing (NLP) community to develop more accurate and efficient parsers that are able to represent all complex grammatical phenomena present in human languages. One of the most widely-used syntactic formalism is a constituent (or phrase-structure) representation. This describes the syntax of a sentence in terms of constituents or phrases and the hierarchical order between them, as shown in the constituent tree in Figure 1(a). As a simpler alternative, a dependency representation is also available. This straightforwardly connects each word of a sentence as a dependent of another, which acts as its head word. The resulting structure is a dependency tree like the one depicted in Figure 1(c). Among constituent trees we can find the most informative syntactic representation currently available: discontinuous constituent trees. Apart from providing phrase-structure Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. information, they extend regular or continuous constituent trees by allowing the representation of crossing branches and constituents with gaps in the middle, crucial for describing grammatical structures left out in the standard phrasestructure formalism, such as the free-word-order phenomena that can be found in languages such as German (see (M uller 2004) and references therein). Unlike for regular constituent trees (Klein and Manning 2003), context-free grammars are not enough for deriving discontinuous structures and representing complex linguistic phenomena and, therefore, more expressive formalisms, such as Linear Context-Free Rewriting Systems (LCFRS) (Vijay-Shanker, Weir, and Joshi 1987), are required. However, due to the high complexity behind the generation of discontinuous phrase-structure representations, parsers based on probabilistic LCFRS, such as (Kallmeyer and Maier 2010), are not practical in terms of accuracy and speed. In the last decade, alternatives to tackle discontinuous constituent parsing with complex grammar-based approaches were presented. On the one hand, one line of research proposes to extend transition-based parsers, which have been successfully applied for efficiently building continuous constituent trees (Zhu et al. 2013; Dyer et al. 2016), with transitions and data structures that are able to deal with discontinuity (Maier 2015; Stanojevi c and G. Alhama 2017; Coavoux, Crabb e, and Cohen 2019; Coavoux and Cohen 2019). This research branch currently yields the state of the art in discontinuous phrase-structure parsing. In parallel, another stream led by (Hall and Nivre 2008; Versley 2014; Fern andez-Gonz alez and Martins 2015; Corro, Le Roux, and Lacroix 2017) proposes to represent discontinuous formalisms as dependencies with augmented information. This makes it possible to use efficient dependency parsers to perform the analysis and, after a recovery process, return a well-formed constituent tree. While they offer a similar efficiency as transition-based constituent parsers, they are currently slightly behind them in terms of accuracy. The main drawback of this approach is the large amount of complex labels necessary to encode the wide variety of syntactic structures into labelled dependency arcs, growing unbounded with the treebank size. This forces some authors, as (Fern andez-Gonz alez and Martins 2015; Corro, Le Roux, and Lacroix 2017), to use an extra module that, at a post-processing step after the parsing, labels each dependency to increase final accuracy. As a result, parsing and labelling tasks are learned and performed in a two-stage procedure. However, these two tasks might benefit from each other s information during training, as constituent information is encoded in both components (arc and label) so from an ideal standpoint, given good enough machine learning models, they should not be considered separately. To address this problem, we propose a neural network architecture that, using internal augmented dependency representations based on (Fern andez-Gonz alez and Martins 2015) and following a multitask learning approach (Caruana 1997), can directly produce accurate discontinuous constituent trees in one step. Both tasks (creating dependencies and labeling them) are trained in parallel and using a shared representation, so that the whole model can jointly learn to predict the correct augmented labels that will finally produce well-formed phrase-structure trees. In addition, our model relies on Pointer Networks (Vinyals, Fortunato, and Jaitly 2015) for creating dependency arcs between words of an input sentence. This kind of neural networks, obtained by modifying standard sequenceto-sequence models, yields remarkable accuracies in dependency parsing. In particular, they use attention (Bahdanau, Cho, and Bengio 2014) as a pointer to select positions from the input sequence, which can be easily adapted to connect words from an input sentence, as recently shown by (Ma et al. 2018; Fern andez-Gonz alez and G omez-Rodr ıguez 2019). We adapt this state-of-the-art neural model to a harder task, also obtaining an outstanding performance. To the best of our knowledge, this is the first attempt that follows a sequence-to-sequence paradigm to perform discontinuous constituent parsing, without the need of any intermediate data structures to create partial trees as required by transition-based models previously cited. To deal with the large amount of labels, we propose to use the biaffine classifier introduced by (Dozat and Manning 2017) that has shown a remarkable performance even in NLP taks with a more complex label scheme, such as Semantic Dependency Parsing (Dozat and Manning 2018). This shares the same representation as Pointer Networks and is jointly trained to produce well-formed discontinuous constituent trees in a single step. The resulting neural model1 produces the most accurate discontinuous constituent representations reported so far. We conduct experiments on both widely-known NEGRA (Skut et al. 1997) and TIGER (Brants et al. 2002) treebanks, which contain a large number of German sentences syntactically annotated by constituent trees with a high degree of discontinuity: the prevalence of discontinuous constituent structures is over 25% in both treebanks (Maier and Lichte 2011). In both benchmarks, our approach achieves accuracies beyond 85% F-score (even without Part-of-Speech (POS) tagging information), surpassing the current state of the art by a wide margin without the need of orthogonal tech- 1Available at https://github.com/danifg/Disco Pointer niques such as re-ranking or semi-supervision. Modelling Discontinuous Constituent Trees Preliminaries Let w1, w2, . . . , wn be a sentence, where wi denotes a word in the ith position. A constituent tree is a tree with the n words of the sentence as leaf nodes, and phrases (or constituents) as internal nodes. Each constituent can be represented as a tuple (X, C, wh), where X is the non-terminal symbol, C is the yield, i.e. the set of words wi included in its span, and h is the word in C that acts as head. A language-specific handwritten set of rules is used to select the head word. For instance, the head word of the constituent VROOT in Figure 1(a) is the word kam in constituent S. Given a tree, if the yield (C) of each of its constituents is a continuous substring of the sentence, then we say that the tree is continuous. Otherwise, the tree is classified as discontinuous, meaning that we can find at least one constituent with a span that is interrupted by one or more gaps between its words. For instance, in Figure 1(a), the span of constituent (NP, {Es, nichts, Interessantes}, Interessantes) is interrupted by the word kam from a different constituent, generating crossing branches. Finally, if there are no constituents with only one child node, we call it a unaryless constituent tree. On the other hand, a dependency tree is a directed tree spanning all the words wi with i ranging from 1 to n. Each dependency arc can be represented as (wh, wd, l), with wh being the head word of the dependent word wd (with h, d [1, n]) and l being the dependency label indicating the syntactic role played by wd. If, for every dependency arc (wh, wd, l), there is a directed path from wh to all words wi between words wh and wd, we classify it as a projective dependency tree. If not, it is described as non-projective, resulting in a tree with crossing arcs as the one depicted in Figure 1(c). It can be noticed from these definitions that, in order to represent the same syntactic phenomenon as described in a discontinuous constituent tree, we will need to use a nonprojective dependency structure in order to handle discontinuities. However, a constituent tree is able to provide information that cannot be represented in a regular dependency tree (Kahane and Mazziotta 2015). Constituents as Augmented Dependencies Following (Fern andez-Gonz alez and Martins 2015), we decompose a discontinuous unariless phrase-structure tree into a set of non-projective dependency arcs with enriched information. This allows us to use a non-projective dependency parser to efficiently perform discontinuous phrase-structure parsing. A constituent tree with m words can be decomposed into a set of m spines (Carreras, Collins, and Koo 2008), one per word, as shown in Figure 1(b) for the discontinuous tree in Figure 1(a). These spines and their interaction to finally build a constituent tree can be represented in an augmented dependency tree as described by (Fern andez Gonz alez and Martins 2015). More in detail, they propose a) Discontinuous constituent tree. b) Spines interaction. Es kam nichts Interessantes . Es1 kam2 nichts3 Interessantes4 .5 c) Unlabelled dependency tree. d) Augmented dependency tree. Figure 1: Augmented dependency representation of a sentence taken from NEGRA corpus in discontinuous constituent format, as well as the spine interaction required for its reconstruction. Head nodes of each constituent are marked in bold. to encode each constituent into dependencies: for each constituent (X, C, wh), each child node, that can be a word wd or a constituent (Y, G, wd) with wd = wh, is encoded into an unlabelled dependency arc with the form (wh, wd). Basically, each non-head child node is attached to the head word and, if the involved child node is a constituent, its own head word is used to build the dependency arc. Additionally, to represent the original phrase structure it is also necessary to save some vital information into arc labels: the non-terminal symbol X plus an index that indicates the order in which spines are attached. This index will be crucial in those cases where more than one constituent share the same head word (and, therefore, the same head spine), but they are at a different level in the original tree and, therefore, should be created in a different hierarchical order. The final dependency arcs will have the form (wh, wi, X#p), where p is the attachment order. For instance, constituent (NP, {nichts, Interessantes}, Interessantes) in Figure 1(a), is encoded as the augmented dependency arc (Interessantes, nichts, NP#1) in Figure 1(d), and constituent (NP, {Es, nichts, Interessantes}, Interessantes) is represented with (Interessantes, Es, NP#2). Both constituents share the same head spine anchored to word Interessantes, but they are attached in a different level. To recover the original discontinuous constituent tree, we just have to follow the steps provided by the augmented dependencies from a spine-based perspective. Dependencies will indicate which spines are involved in the attachment and the direction of the spine interaction (head and dependent roles); and, on the other hand, labels will describe in what order dependent spines are attached to the head spine, as well as the non-terminal symbol of the resulting constituent. For instance, the augmented dependency tree in Figure 1(d) describes the spine interaction depicted in Figure 1(b). In the example, we can see that the dependent spines anchored to words Es and nichts are attached in different order and level from the head spine lexicalized by the word Interessantes. Without the attachment order, all dependent spines would be attached at the same level and the resulting constituent would be a different structure: in the example, a single flat NP phrase with the three words as its child nodes. It is also worth mentioning that the above described dependency-based representation is not able to encode unary constituents and, as a consequence, this information will be lost after the constituent tree is recovered. However, as stated by (Fern andez-Gonz alez and Martins 2015), unary nodes are very uncommon in discontinuous representations: for instance, the NEGRA treebank has no unaries at all and, for the TIGER dataset, the fraction of unaries is around 1%. Therefore, we do not perform unary recovery in a postprocessing step and simply apply a non-projective dependency parser on the resulting augmented trees. Neural Network Architecture We use a multi-task learning strategy to train in parallel both the Pointer Network, in charge of connecting words, and the labeler, a multiclass classifier in charge of tagging each dependency arc with the augmented information, to finally build a well-formed discontinuous constituent tree. In the next subsections, we describe the proposed pointer-networkbased architecture for labelled dependency parsing. Pointer Networks (Vinyals, Fortunato, and Jaitly 2015) introduced a novel neural architecture, called Pointer Network, that can be seen as a variation of standard sequence-to-sequence models. The proposed neural network can learn the conditional probability of an output sequence of discrete numbers that correspond to positions from the input sequence, with input length being variable. They propose to use a mechanism of neural attention (Bahdanau, Cho, and Bengio 2014) to select positions from the input, without requiring a fixed size of the output dictionary. Thanks to that, Pointer Networks are suitable for addressing those problems where the target classes considered at each step are variable and depend on the length of the input sequence. Unlabelled dependency parsing is one of the tasks where Pointer Networks can be easily applied: the input is a (variable-length) sequence of words from a sentence and the output is a sequence of numbers that correspond to the positions of the assigned head words. In (Fern andez-Gonz alez and G omez-Rodr ıguez 2019), we can find a transition-based perspective of this idea, but it can be also defined as a purely sequence-to-sequence approach, as, unlike in classic transition-based dependency parsers (Nivre 2003), neither transitions nor data structures are required. We follow the convention of representing scalars in lowercase italics (i), vectors in lowercase bold (v), matrices in uppercase italics (M), and higher order tensors in uppercase bold (T). Encoder Let w1, w2, . . . , wn be an input sentence. The encoder of our network relies on a Bi LSTM-CNN architecture, developed by (Ma and Hovy 2016), to encode each word wi into an encoder hidden state hi. Convolutional Neural Networks (CNN) (Le Cun et al. 1989) are used for extracting character-level representations of words. In particular, for each word wi, the CNN receives a concatenation of embedding vectors of each character as input and produces a character-level representation ec i. Then, each word wi is represented by the concatenation of three embeddings: character-level word embedding ec i, pretrained word embedding ew i and randomly initialized part-of-speech (POS) tag embedding ep i ( stands for concatenation): xi = ec i ew i ep i In order to attach every word from the sentence to another, we need to add a dummy root node w0 (represented with a special vector x0) at the beginning of the sentence. This will be linked as the head of the real syntactic root word. After that, the resulting embedding representation xi of each word wi is fed one-by-one into a Bi-directional Long Short-Term Memory (Bi LSTM) (Graves and Schmidhuber 2005) that captures context information in both directions and generates a vector representation hi: hi = hli hri = Bi LSTM(xi) Decoder A unidirectional LSTM (Hochreiter and Schmidhuber 1997) is used as a decoder that, at each time step, uses the encoder hidden state hi of the word wi, currently being processed, as input and outputs a decoder hidden state st. To add extra information to the input of the LSTM, we follow (Fern andez-Gonz alez and G omez-Rodr ıguez 2019) and add hidden states of previous (hi 1) and next (hi+1) words. As proposed by (Ma et al. 2018), we use element-wise sum of the three hidden states, instead of concatenating them, in order to not increase the dimension of the resulting input vector ri: ri = hi 1 + hi + hi+1 st = LSTM(ri) Once st is generated from ri, the attention vector at needs to be computed as follows: vt j = score(st, hj) at = softmax(vt) where an attention scoring function (score()) is used to compute scores between st (the decoder representation of the word wi currently being processed) and each encoder hidden state hj from the input. Then, a softmax is applied on the resulting score vector vt to compute a probability distribution over the input. The generated attention vector at is used as a pointer to select the highest-scoring position j from the input. The pointed word wj is attached as head of word wi, currently being processed, building an unlabeled dependency arc (wj, wi). In case the arc (wj, wi) generates a cycle in the already-built dependency graph, we select the next highest-scoring position from the input pointed by at. The decoding process starts in word w1 and it is repeated word-by-word from left to right until the last one (wn) is attached, requiring just n steps to fully parse the input sentence. Therefore, since cycles can be checked in amortized linear time (O(n)) and, at each step, the attention vector at must be computed over the input, the overall runtime complexity is O(n2). Figure 2 depicts the neural architecture and the decoding procedure for the dependency structure in Figure 1(d). Attention Mechanism As attention score function, we follow (Ma et al. 2018) and adopt the biaffine attention mechanism, developed by (Dozat and Manning 2017), to compute score vector vt: vt j = s T t Whj + UT st + VT hj + b where parameters W is the weight matrix of the bi-linear term, U and V are the weight tensors of the linear terms and b is the bias vector. Basically, the bi-linear term evaluates the score of assigning the word wi (represented by st) to the head word wj from the input, and the two linear terms compute the scores of both words considered independently. Finally, in order to reduce the dimensionality and overfitting of the neural model, a multilayer perceptron (MLP) is used to transform the hidden representations hj and st before the attention score function is applied, as discussed by (Dozat and Manning 2017). Biaffine labeler We follow (Dozat and Manning 2017) and implement the labeler as a biaffine multi-class classifier. Unlike them, we use the previously described Bi LSTM-CNN architecture as encoder. (Dozat and Manning 2017) relies on a biaffine attention mechanism as scoring function for computing the score of assigning a label l on a predicted arc between the dependent word wi and the head word wj, assigning to that arc the highest-scoring label. In particular, the encoder hidden state hj and the decoder hidden state st (representing wj and wi, Figure 2: Pointer Network architecture and decoding steps to output the dependency tree in Figure 1(d). respectively) are used as input to obtain the score sl tj for each label l as shown in the following equation: sl tj = s T t Wlhj + UT l st + VT l hj + bl where a distinct weight matrix Wl, weight tensors Ul and Vl and bias bl are used for each label l, where l {1, 2, . . . , L} and L is the number of labels. The first term on the right side of the equation represents the score of assigning the label l to the arc between dependent wi and head wj; and the second and third terms express the score of the label l when the dependent and head are considered independently. Please note that a MLP is also used to transform hidden state vectors hj and st, before feeding the biaffine classifier, in order to filter not relevant information and, as a consequence, reduce model dimensionality (Dozat and Manning 2017). Multitask Learning Our goal is to provide a fully-parsed non-projective dependency tree in a single stage, while the parsing and the labelling should be individually undertaken due to the large amount of labels resulting from the formalism by (Fern andez-Gonz alez and Martins 2015). We follow a multitask learning strategy (Caruana 1997) to achieve that: a single neural architecture is trained for more than one task, sharing a common representation and benefiting from each other. In particular, the labeler shares the same encoder as the parser, providing a common encoder hidden state representation for both components. As stated by (Kiperwasser and Goldberg 2016), training the Bi LSTM-CNN encoder to correctly predict dependency labels significantly improves unlabelled parsing accuracy and vice versa. This is specially crucial in our approach where a constituent structure is jointly encoded in dependency arcs and labels, and a wrong label can lead to a completely different phrase-structure tree after the recovering. In addition, the labeler and the parser are simultaneously trained by optimizing the sum of their objectives. On the one hand, a dependency tree y for an input sentence of length n is decomposed into a set of n directed arcs a1, . . . , an, where each arc ai is represented by the head word wh and the dependent word wi in position i in the sentence. Therefore, to train the parser, we factorize the conditional probability Pθ(y|x) of a dependency tree y for a sentence x into a set of head-dependent pairs (wh, wi) as follows: i=1 Pθ(ai|a