# structured_attention_networks__bfed22db.pdf Published as a conference paper at ICLR 2017 STRUCTURED ATTENTION NETWORKS Yoon Kim Carl Denton Luong Hoang Alexander M. Rush {yoonkim@seas,carldenton@college,lhoang@g,srush@seas}.harvard.edu School of Engineering and Applied Sciences Harvard University Cambridge, MA 02138, USA Attention networks have proven to be an effective approach for embedding categorical inference within a deep neural network. However, for many tasks we may want to model richer structural dependencies without abandoning end-to-end training. In this work, we experiment with incorporating richer structural distributions, encoded using graphical models, within deep networks. We show that these structured attention networks are simple extensions of the basic attention procedure, and that they allow for extending attention beyond the standard softselection approach, such as attending to partial segmentations or to subtrees. We experiment with two different classes of structured attention networks: a linearchain conditional random field and a graph-based parsing model, and describe how these models can be practically implemented as neural network layers. Experiments show that this approach is effective for incorporating structural biases, and structured attention networks outperform baseline attention models on a variety of synthetic and real tasks: tree transduction, neural machine translation, question answering, and natural language inference. We further find that models trained in this way learn interesting unsupervised hidden representations that generalize simple attention. 1 INTRODUCTION Attention networks are now a standard part of the deep learning toolkit, contributing to impressive results in neural machine translation (Bahdanau et al., 2015; Luong et al., 2015), image captioning (Xu et al., 2015), speech recognition (Chorowski et al., 2015; Chan et al., 2015), question answering (Hermann et al., 2015; Sukhbaatar et al., 2015), and algorithm-learning (Graves et al., 2014; Vinyals et al., 2015), among many other applications (see Cho et al. (2015) for a comprehensive review). This approach alleviates the bottleneck of compressing a source into a fixed-dimensional vector by equipping a model with variable-length memory (Weston et al., 2014; Graves et al., 2014; 2016), thereby providing random access into the source as needed. Attention is implemented as a hidden layer which computes a categorical distribution (or hierarchy of categorical distributions) to make a soft-selection over source elements. Noting the empirical effectiveness of attention networks, we also observe that the standard attentionbased architecture does not directly model any structural dependencies that may exist among the source elements, and instead relies completely on the hidden layers of the network. While one might argue that these structural dependencies can be learned implicitly by a deep model with enough data, in practice, it may be useful to provide a structural bias. Modeling structural dependencies at the final, output layer has been shown to be important in many deep learning applications, most notably in seminal work on graph transformers (Le Cun et al., 1998), key work on NLP (Collobert et al., 2011), and in many other areas (Peng et al., 2009; Do & Arti eres, 2010; Jaderberg et al., 2014; Chen et al., 2015; Durrett & Klein, 2015; Lample et al., 2016, inter alia). In this work, we consider applications which may require structural dependencies at the attention layer, and develop internal structured layers for modeling these directly. This approach generalizes categorical soft-selection attention layers by specifying possible structural dependencies in a soft Equal contribution. Published as a conference paper at ICLR 2017 manner. Key applications will be the development of an attention function that segments the source input into subsequences and one that takes into account the latent recursive structure (i.e. parse tree) of a source sentence. Our approach views the attention mechanism as a graphical model over a set of latent variables. The standard attention network can be seen as an expectation of an annotation function with respect to a single latent variable whose categorical distribution is parameterized to be a function of the source. In the general case we can specify a graphical model over multiple latent variables whose edges encode the desired structure. Computing forward attention requires performing inference to obtain the expectation of the annotation function, i.e. the context vector. This expectation is computed over an exponentially-sized set of structures (through the machinery of graphical models/structured prediction), hence the name structured attention network. Notably each step of this process (including inference) is differentiable, so the model can be trained end-to-end without having to resort to deep policy gradient methods (Schulman et al., 2015). The differentiability of inference algorithms over graphical models has previously been noted by various researchers (Li & Eisner, 2009; Domke, 2011; Stoyanov et al., 2011; Stoyanov & Eisner, 2012; Gormley et al., 2015), primarily outside the area of deep learning. For example, Gormley et al. (2015) treat an entire graphical model as a differentiable circuit and backpropagate risk through variational inference (loopy belief propagation) for minimium risk training of dependency parsers. Our contribution is to combine these ideas to produce structured internal attention layers within deep networks, noting that these approaches allow us to use the resulting marginals to create new features, as long as we do so a differentiable way. We focus on two classes of structured attention: linear-chain conditional random fields (CRFs) (Lafferty et al., 2001) and first-order graph-based dependency parsers (Eisner, 1996). The initial work of Bahdanau et al. (2015) was particularly interesting in the context of machine translation, as the model was able to implicitly learn an alignment model as a hidden layer, effectively embedding inference into a neural network. In similar vein, under our framework the model has the capacity to learn a segmenter as a hidden layer or a parser as a hidden layer, without ever having to see a segmented sentence or a parse tree. Our experiments apply this approach to a difficult synthetic reordering task, as well as to machine translation, question answering, and natural language inference. We find that models trained with structured attention outperform standard attention models. Analysis of learned representations further reveal that interesting structures emerge as an internal layer of the model. All code is available at http://github.com/harvardnlp/struct-attn. 2 BACKGROUND: ATTENTION NETWORKS A standard neural network consist of a series of non-linear transformation layers, where each layer produces a fixed-dimensional hidden representation. For tasks with large input spaces, this paradigm makes it hard to control the interaction between components. For example in machine translation, the source consists of an entire sentence, and the output is a prediction for each word in the translated sentence. Utilizing a standard network leads to an information bottleneck, where one hidden layer must encode the entire source sentence. Attention provides an alternative approach.1 An attention network maintains a set of hidden representations that scale with the size of the source. The model uses an internal inference step to perform a soft-selection over these representations. This method allows the model to maintain a variable-length memory and has shown to be crucially important for scaling systems for many tasks. Formally, let x = [x1, . . . , xn] represent a sequence of inputs, let q be a query, and let z be a categorical latent variable with sample space {1, . . . , n} that encodes the desired selection among these inputs. Our aim is to produce a context c based on the sequence and the query. To do so, we assume access to an attention distribution z p(z | x, q), where we condition p on the inputs x and a query q. The context over a sequence is defined as expectation, c = Ez p(z | x,q)[f(x, z)] where f(x, z) is an annotation function. Attention of this form can be applied over any type of input, however, we will primarily be concerned with deep networks, where both the annotation function 1Another line of work involves marginalizing over latent variables (e.g. latent alignments) for sequence-tosequence transduction (Kong et al., 2016; Lu et al., 2016; Yu et al., 2016; 2017). Published as a conference paper at ICLR 2017 and attention distribution are parameterized with neural networks, and the context produced is a vector fed to a downstream network. For example, consider the case of attention-based neural machine translation (Bahdanau et al., 2015). Here the sequence of inputs [x1, . . . , xn] are the hidden states of a recurrent neural network (RNN), running over the words in the source sentence, q is the RNN hidden state of the target decoder (i.e. vector representation of the query q), and z represents the source position to be attended to for translation. The attention distribution p is simply p(z = i | x, q) = softmax(θi) where θ Rn is a parameterized potential typically based on a neural network, e.g. θi = MLP([xi; q]). The annotation function is defined to simply return the selected hidden state, f(x, z) = xz. The context vector can then be computed using a simple sum, c = Ez p(z | x,q)[f(x, z)] = i=1 p(z = i | x, q)xi (1) Other tasks such as question answering use attention in a similar manner, for instance by replacing source [x1, . . . , xn] with a set of potential facts and q with a representation of the question. In summary we interpret the attention mechanism as taking the expectation of an annotation function f(x, z) with respect to a latent variable z p, where p is parameterized to be function of x and q. 3 STRUCTURED ATTENTION Attention networks simulate selection from a set using a soft model. In this work we consider generalizing selection to types of attention, such as selecting chunks, segmenting inputs, or even attending to latent subtrees. One interpretation of this attention is as using soft-selection that considers all possible structures over the input, of which there may be exponentially many possibilities. Of course, this expectation can no longer be computed using a simple sum, and we need to incorporate the machinery of inference directly into our neural network. Define a structured attention model as being an attention model where z is now a vector of discrete latent variables [z1, . . . , zm] and the attention distribution is p(z | x, q) is defined as a conditional random field (CRF), specifying the independence structure of the z variables. Formally, we assume an undirected graph structure with m vertices. The CRF is parameterized with clique (log-)potentials θC(z C) R, where the z C indicates the subset of z given by clique C. Under this definition, the attention probability is defined as, p(z | x, q; θ) = softmax(P C θC(z C)), where for symmetry we use softmax in a general sense, i.e. softmax(g(z)) = 1 Z exp(g(z)) where Z = P z exp(g(z )) is the implied partition function. In practice we use a neural CRF, where θ comes from a deep model over x, q. In structured attention, we also assume that the annotation function f factors (at least) into clique annotation functions f(x, z) = P C f C(x, z C). Under standard conditions on the conditional independence structure, inference techniques from graphical models can be used to compute the forwardpass expectations and the context: c = Ez p(z | x,q)[f(x, z)] = X C Ez p(z C | x,q)[f C(x, z C)] 3.1 EXAMPLE 1: SUBSEQUENCE SELECTION Suppose instead of soft-selecting a single input, we wanted to explicitly model the selection of contiguous subsequences. We could naively apply categorical attention over all subsequences, or hope the model learns a multi-modal distribution to combine neighboring words. Structured attention provides an alternate approach. Concretely, let m = n, define z to be a random vector z = [z1, . . . , zn] with zi {0, 1}, and define our annotation function to be, f(x, z) = Pn i=1 fi(x, zi) where fi(x, zi) = 1{zi = 1}xi. The explicit expectation is then, Ez1,...,zn[f(x, z)] = i=1 p(zi = 1 | x, q)xi (2) Published as a conference paper at ICLR 2017 x1 x2 x3 x4 z1 z2 z3 z4 x1 x2 x3 x4 z1 z2 z3 z4 x1 x2 x3 x4 Figure 1: Three versions of a latent variable attention model: (a) A standard soft-selection attention network, (b) A Bernoulli (sigmoid) attention network, (c) A linear-chain structured attention model for segmentation. The input and query are denoted with x and q respectively. Equation (2) is similar to equation (1) both are a linear combination of the input representations where the scalar is between [0, 1] and represents how much attention should be focused on each input. However, (2) is fundamentally different in two ways: (i) it allows for multiple inputs (or no inputs) to be selected for a given query; (ii) we can incorporate structural dependencies across the zi s. For instance, we can model the distribution over z with a linear-chain CRF with pairwise edges, p(z1, . . . , zn | x, q) = softmax i=1 θi,i+1(zi, zi+1) where θk,l is the pairwise potential for zi = k and zi+1 = l. This model is shown in Figure 1c. Compare this model to the standard attention in Figure 1a, or to a simple Bernoulli (sigmoid) selection method, p(zi = 1 | x, q) = sigmoid(θi), shown in Figure 1b. All three of these methods can use potentials from the same neural network or RNN that takes x and q as inputs. In the case of the linear-chain CRF in (3), the marginal distribution p(zi = 1 | x) can be calculated efficiently in linear-time for all i using message-passing, i.e. the forward-backward algorithm. These marginals allow us to calculate (2), and in doing so we implicitly sum over an exponentially-sized set of structures (i.e. all binary sequences of length n) through dynamic programming. We refer to this type of attention layer as a segmentation attention layer. Note that the forward-backward algorithm is being used as parameterized pooling (as opposed to output computation), and can be thought of as generalizing the standard attention softmax. Crucially this generalization from vector softmax to forward-backward is just a series of differentiable steps,2 and we can compute gradients of its output (marginals) with respect to its input (potentials). This will allow the structured attention model to be trained end-to-end as part of a deep model. 3.2 EXAMPLE 2: SYNTACTIC TREE SELECTION This same approach can be used for more involved structural dependencies. One popular structure for natural language tasks is a dependency tree, which enforces a structural bias on the recursive dependencies common in many languages. In particular a dependency tree enforces that each word in a source sentence is assigned exactly one parent word (head word), and that these assignments do not cross (projective structure). Employing this bias encourages the system to make a soft-selection based on learned syntactic dependencies, without requiring linguistic annotations or a pipelined decision. A dependency parser can be partially formalized as a graphical model with the following cliques (Smith & Eisner, 2008): latent variables zij {0, 1} for all i = j, which indicates that the i-th word is the parent of the j-th word (i.e. xi xj); and a special global constraint that rules out configurations of zij s that violate parsing constraints (e.g. one head, projectivity). The parameters to the graph-based CRF dependency parser are the potentials θij, which reflect the score of selecting xi as the parent of xj. The probability of a parse tree z given the sentence 2As are other dynamic programming algorithms for inference in graphical models, such as (loopy and nonloopy) belief propagation. Published as a conference paper at ICLR 2017 procedure FORWARDBACKWARD(θ) α[0, t ] 0 β[n + 1, t ] 0 for i = 1, . . . , n; c C do y α[i 1, y] θi 1,i(y, c) for i = n, . . . , 1; c C do y β[i + 1, y] θi,i+1(c, y) A α[n + 1, t ] for i = 1, . . . , n; c C do p(zi = c | x) exp(α[i, c] β[i, c] A) return p procedure BACKPROPFORWARDBACKWARD(θ, p, L p ) L α log p log L p β A L β log p log L p α A ˆα[0, t ] 0 ˆβ[n + 1, t ] 0 for i = n, . . . 1; c C do ˆβ[i, c] L α[i, c] L y θi,i+1(c, y) ˆβ[i + 1, y] for i = 1, . . . , n; c C do ˆα[i, c] L β [i, c] L y θi 1,i(y, c) ˆα[i 1, y] for i = 1, . . . , n; y, c C do L θi 1,i(y,c) signexp(ˆα[i, y] β[i + 1, c] α[i, y] ˆβ[i + 1, c] α[i, y] β[i + 1, c] A) return L θ Figure 2: Algorithms for linear-chain CRF: (left) computation of forward-backward tables α, β, and marginal probabilities p from potentials θ (forward-backward algorithm); (right) backpropagation of loss gradients with respect to the marginals L p . C denotes the state space and t is the special start/stop state. Backpropagation uses the identity L log p = p L p to calculate L θ = L log p log p θ , where is the element-wise multiplication. Typically the forward-backward with marginals is performed in the log-space semifield R { } with binary operations = logadd and = + for numerical precision. However, backpropagation requires working with the log of negative values (since L p could be negative), so we extend to a field [R { }] {+, } with special +/ log-space operations. Binary operations applied to vectors are implied to be element-wise. The signexp function is defined as signexp(la) = sa exp(la). See Section 3.3 and Table 1 for more details. x = [x1, . . . , xn] is, p(z | x, q) = softmax 1{z is valid} X i =j 1{zij = 1}θij where z is represented as a vector of zij s for all i = j. It is possible to calculate the marginal probability of each edge p(zij = 1 | x, q) for all i, j in O(n3) time using the inside-outside algorithm (Baker, 1979) on the data structures of Eisner (1996). The parsing contraints ensure that each word has exactly one head (i.e. Pn i=1 zij = 1). Therefore if we want to utilize the soft-head selection of a position j, the context vector is defined as: i=1 1{zij = 1}xi cj = Ez[fj(x, z)] = i=1 p(zij = 1 | x, q)xi Note that in this case the annotation function has the subscript j to produce a context vector for each word in the sentence. Similar types of attention can be applied for other tree properties (e.g. soft-children). We refer to this type of attention layer as a syntactic attention layer. 3.3 END-TO-END TRAINING Graphical models of this form have been widely used as the final layer of deep models. Our contribution is to argue that these networks can be added within deep networks in place of simple attention layers. The whole model can then be trained end-to-end. The main complication in utilizing this approach within the network itself is the need to backpropagate the gradients through an inference algorithm as part of the structured attention network. Past work has demonstrated the techniques necessary for this approach (see Stoyanov et al. (2011)), but to our knowledge it is very rarely employed. Consider the case of the simple linear-chain CRF layer from equation (3). Figure 2 (left) shows the standard forward-backward algorithm for computing the marginals p(zi = 1 | x, q; θ). If we treat the forward-backward algorithm as a neural network layer, its input are the potentials θ, and its output Published as a conference paper at ICLR 2017 after the forward pass are these marginals.3 To backpropagate a loss through this layer we need to compute the gradient of the loss L with respect to θ, L θ , as a function of the gradient of the loss with respect to the marginals, L p .4 As the forward-backward algorithm consists of differentiable steps, this function can be derived using reverse-mode automatic differentiation of the forward-backward algorithm itself. Note that this reverse-mode algorithm conveniently has a parallel structure to the forward version, and can also be implemented using dynamic programming. sa sb la+b sa+b la b sa b + + la + log(1 + d) + la + lb + + la + log(1 d) + la + lb + la + log(1 d) la + lb la + log(1 + d) la + lb + Table 1: Signed log-space semifield (from Li & Eisner (2009)). Each real number a is represented as a pair (la, sa) where la = log |a| and sa = sign(a). Therefore a = sa exp(la). For the above we let d = exp(lb la) and assume |a| > |b|. However, in practice, one cannot simply use current off-the-shelf tools for this task. For one, efficiency is quite important for these models and so the benefits of handoptimizing the reverse-mode implementation still outweighs simplicity of automatic differentiation. Secondly, numerical precision becomes a major issue for structured attention networks. For computing the forward-pass and the marginals, it is important to use the standard log-space semifield over R { } with binary operations ( = logadd, = +) to avoid underflow of probabilities. For computing the backward-pass, we need to remain in logspace, but also handle log of negative values (since L p could be negative). This requires extending to the signed log-space semifield over [R { }] {+, } with special +/ operations. Table 1, based on Li & Eisner (2009), demonstrates how to handle this issue, and Figure 2 (right) describes backpropagation through the forward-backward algorithm. For dependency parsing, the forward pass can be computed using the inside-outside implementation of Eisner s algorithm (Eisner, 1996). Similarly, the backpropagation parallels the inside-outside structure. Forward/backward pass through the inside-outside algorithm is described in Appendix B. 4 EXPERIMENTS We experiment with three instantiations of structured attention networks on four different tasks: (a) a simple, synthetic tree manipulation task using the syntactic attention layer, (b) machine translation with segmentation attention (i.e. two-state linear-chain CRF), (c) question answering using an nstate linear-chain CRF for multi-step inference over n facts, and (d) natural language inference with syntactic tree attention. These experiments are not intended to boost the state-of-the-art for these tasks but to test whether these methods can be trained effectively in an end-to-end fashion, can yield improvements over standard selection-based attention, and can learn plausible latent structures. All model architectures, hyperparameters, and training details are further described in Appendix A. 4.1 TREE TRANSDUCTION The first set of experiments look at a tree-transduction task. These experiments use synthetic data to explore a failure case of soft-selection attention models. The task is to learn to convert a random formula given in prefix notation to one in infix notation, e.g., ( ( + ( + 15 7 ) 1 8 ) ( + 19 0 11 ) ) ( ( 15 + 7 ) + 1 + 8 ) ( 19 + 0 + 11 ) The alphabet consists of symbols {(, ), +, }, numbers between 0 and 20, and a special root symbol $. This task is used as a preliminary task to see if the model is able to learn the implicit tree structure on the source side. The model itself is an encoder-decoder model, where the encoder is defined below and the decoder is an LSTM. See Appendix A.2 for the full model. 3Confusingly, forward in this case is different than in the forward-backward algorithm, as the marginals themselves are the output. However the two uses of the term are actually quite related. The forward-backward algorithm can be interpreted as a forward and backpropagation pass on the log partition function. See Eisner (2016) for further details (appropriately titled Inside-Outside and Forward-Backward Algorithms Are Just Backprop ). As such our full approach can be seen as computing second-order information. This interpretation is central to Li & Eisner (2009). 4In general we use a b to denote the Jacobian of a with respect to b. Published as a conference paper at ICLR 2017 Figure 3: Visualization of the source self-attention distribution for the simple (left) and structured (right) attention models on the tree transduction task. $ is the special root symbol. Each row delineates the distribution over the parents (i.e. each row sums to one). The attention distribution obtained from the parsing marginals are more able to capture the tree structure e.g. the attention weights of closing parentheses are generally placed on the opening parentheses (though not necessarily on a single parenthesis). Training uses 15K prefix-infix pairs where the maximum nesting depth is set to be between 2-4 (the above example has depth 3), with 5K pairs in each depth bucket. The number of expressions in each parenthesis is limited to be at most 4. Test uses 1K unseen sequences with depth between 2-6 (note specifically deeper than train), with 200 sequences for each depth. The performance is measured as the average proportion of correct target tokens produced until the first failure (as in Grefenstette et al. (2015)). For experiments we try using different forms of self-attention over embedding-only encoders. Let xj be an embedding for each source symbol; our three variants of the source representation ˆxj are: (a) no atten, just symbol embeddings by themselves, i.e. ˆxj = xj; (b) simple attention, symbol embeddings and soft-pairing for each symbol, i.e. ˆxj = [xj; cj] where cj = Pn i=1 softmax(θij)xi is calculated using soft-selection; (c) structured attention, symbol embeddings and soft-parent, i.e. ˆxj = [xj; cj] where cj = Pn i=1 p(zij = 1 | x)xi is calculated using parsing marginals, obtained from the syntactic attention layer. None of these models use an explicit query value the potentials come from running a bidirectional LSTM over the source, producing hidden vectors hi, and then computing θij = tanh(s tanh(W1hi + W2hj + b)) where s, b, W1, W2 are parameters (see Appendix A.1). Depth No Atten Simple Structured 2 7.6 87.4 99.2 3 4.1 49.6 87.0 4 2.8 23.3 64.5 5 2.1 15.0 30.8 6 1.5 8.5 18.2 Table 2: Performance (average length to failure %) of models on the tree-transduction task. The source representation [ˆx1, . . . , ˆxn] are attended over using the standard attention mechanism at each decoding step by an LSTM decoder.5 Additionally, symbol embedding parameters are shared between the parsing LSTM and the source encoder. Results Table 2 has the results for the task. Note that this task is fairly difficult as the encoder is quite simple. The baseline model (unsurprisingly) performs poorly as it has no information about the source ordering. The simple attention model performs better, but is significantly outperformed by the structured model with a tree structure bias. We hypothesize that the model is partially reconstructing the arithmetic tree. Figure 3 shows the attention distribution for the simple/structured models on the same source sequence, which indicates that the structured model is able to learn boundaries (i.e. parentheses). 5Thus there are two attention mechanisms at work under this setup. First, structured attention over the source only to obtain soft-parents for each symbol (i.e. self-attention). Second, standard softmax alignment attention over the source representations during decoding. Published as a conference paper at ICLR 2017 4.2 NEURAL MACHINE TRANSLATION Our second set of experiments use a full neural machine translation model utilizing attention over subsequences. Here both the encoder/decoder are LSTMs, and we replace standard simple attention with a segmentation attention layer. We experiment with two settings: translating directly from unsegmented Japanese characters to English words (effectively using structured attention to perform soft word segmentation), and translating from segmented Japanese words to English words (which can be interpreted as doing phrase-based neural machine translation). Japanese word segmentation is done using the Ky Tea toolkit (Neubig et al., 2011). The data comes from the Workshop on Asian Translation (WAT) (Nakazawa et al., 2016). We randomly pick 500K sentences from the original training set (of 3M sentences) where the Japanese sentence was at most 50 characters and the English sentence was at most 50 words. We apply the same length filter on the provided validation/test sets for evaluation. The vocabulary consists of all tokens that occurred at least 10 times in the training corpus. The segmentation attention layer is a two-state CRF where the unary potentials at the j-th decoder step are parameterized as θi(k) = hi Whj, k = 1 0, k = 0 Here [h1, . . . , hn] are the encoder hidden states and h j is the j-th decoder hidden state (i.e. the query vector). The pairwise potentials are parameterized linearly with b, i.e. all together θi,i+1(zi, zi+1) = θi(zi) + θi+1(zi+1) + bzi,zi+1 Therefore the segmentation attention layer requires just 4 additional parameters. Appendix A.3 describes the full model architecture. We experiment with three attention configurations: (a) standard simple attention, i.e. cj = Pn i=1 softmax(θi)hi; (b) sigmoid attention: multiple selection with Bernoulli random variables, i.e. cj = Pn i=1 sigmoid(θi)hi; (c) structured attention, encoded with normalized CRF marginals, p(zi = 1 | x, q) i=1 p(zi = 1 | x, q) The normalization term γ is not ideal but we found it to be helpful for stable training.6 λ is a hyperparameter (we use λ = 2) and we further add an l2 penalty of 0.005 on the pairwise potentials b. These values were found via grid search on the validation set. Simple Sigmoid Structured CHAR 12.6 13.1 14.6 WORD 14.1 13.8 14.3 Table 3: Translation performance as measured by BLEU (higher is better) on characterto-word and word-to-word Japanese-English translation for the three different models. Results Results for the translation task on the test set are given in Table 3. Sigmoid attention outperforms simple (softmax) attention on the character-toword task, potentially because it is able to learn manyto-one alignments. On the word-to-word task, the opposite is true, with simple attention outperforming sigmoid attention. Structured attention outperforms both models on both tasks, although improvements on the word-to-word task are modest and unlikely to be statistically significant. For further analysis, Figure 4 shows a visualization of the different attention mechanisms on the character-to-word setup. The simple model generally focuses attention heavily on a single character. In contrast, the sigmoid and structured models are able to spread their attention distribution on contiguous subsequences. The structured attention learns additional parameters (i.e. b) to smooth out this type of attention. 6With standard expectation (i.e. cj = Pn i=1 p(zi = 1 | x, q)hi) we empirically observed the marginals to quickly saturate. We tried various strategies to overcome this, such as putting an l2 penalty on the unary potentials and initializing with a pretrained sigmoid attention model, but simply normalizing the marginals proved to be the most effective. However, this changes the interpretation of the context vector as the expectation of an annotation function in this case. Published as a conference paper at ICLR 2017 Figure 4: Visualization of the source attention distribution for the simple (top left), sigmoid (top right), and structured (bottom left) attention models over the ground truth sentence on the character-to-word translation task. Manually-annotated alignments are shown in bottom right. Each row delineates the attention weights over the source sentence at each step of decoding. The sigmoid/structured attention models are able learn an implicit segmentation model and focus on multiple characters at each time step. 4.3 QUESTION ANSWERING Our third experiment is on question answering (QA) with the linear-chain CRF attention layer for inference over multiple facts. We use the b Ab I dataset (Weston et al., 2015), where the input is a set of sentences/facts paired with a question, and the answer is a single token. For many of the tasks the model has to attend to multiple supporting facts to arrive at the correct answer (see Figure 5 for an example), and existing approaches use multiple hops to greedily attend to different facts. We experiment with employing structured attention to perform inference in a non-greedy way. As the ground truth supporting facts are given in the dataset, we are able to assess the model s inference accuracy. The baseline (simple) attention model is the End-To-End Memory Network (Sukhbaatar et al., 2015) (Mem N2N), which we briefly describe here. See Appendix A.4 for full model details. Let x1, . . . , xn be the input embedding vectors for the n sentences/facts and let q be the query embedding. In Mem N2N, zk is the random variable for the sentence to select at the k-th inference step (i.e. k-th hop), and thus zk {1, . . . , n}. The probability distribution over zk is given by p(zk = i | x, q) = softmax((xk i ) qk), and the context vector is given by ck = Pn i=1 p(zk = i | x, q)ok i , where xk i , ok i are the input and output embedding for the i-th sentence at the k-th hop, respectively. The k-th context vector is used to modify the query qk+1 = qk + ck, and this process repeats for k = 1, . . . , K (for k = 1 we have xk i = xi, qk = q, ck = 0). The K-th context and query vectors are used to obtain the final answer. The attention mechanism for a K-hop Mem N2N network can therefore be interpreted as a greedy selection of a length-K sequence of facts (i.e. z1, . . . , z K). For structured attention, we use an n-state, K-step linear-chain CRF.7 We experiment with two different settings: (a) a unary CRF model with node potentials θk(i) = (xk i ) qk 7Note that this differs from the segmentation attention for the neural machine translation experiments described above, which was a K-state (with K = 2), n-step linear-chain CRF. Published as a conference paper at ICLR 2017 Mem N2N Binary CRF Unary CRF Task K Ans % Fact % Ans % Fact % Ans % Fact % TASK 02 - TWO SUPPORTING FACTS 2 87.3 46.8 84.7 81.8 43.5 22.3 TASK 03 - THREE SUPPORTING FACTS 3 52.6 1.4 40.5 0.1 28.2 0.0 TASK 07 - COUNTING 3 83.2 83.5 79.3 TASK 08 - LISTS SETS 3 94.1 93.3 87.1 TASK 11 - INDEFINITE KNOWLEDGE 2 97.8 38.2 97.7 80.8 88.6 0.0 TASK 13 - COMPOUND COREFERENCE 2 95.6 14.8 97.0 36.4 94.4 9.3 TASK 14 - TIME REASONING 2 99.9 77.6 99.7 98.2 90.5 30.2 TASK 15 - BASIC DEDUCTION 2 100.0 59.3 100.0 89.5 100.0 51.4 TASK 16 - BASIC INDUCTION 3 97.1 91.0 97.9 85.6 98.0 41.4 TASK 17 - POSITIONAL REASONING 2 61.1 23.9 60.6 49.6 59.7 10.5 TASK 18 - SIZE REASONING 2 86.4 3.3 92.2 3.9 92.0 1.4 TASK 19 - PATH FINDING 2 21.3 10.2 24.4 11.5 24.3 7.8 AVERAGE 81.4 39.6 81.0 53.7 73.8 17.4 Table 4: Answer accuracy (Ans %) and supporting fact selection accuracy (Fact %) of the three QA models on the 1K b Ab I dataset. K indicates the number of hops/inference steps used for each task. Task 7 and 8 both contain variable number of facts and hence they are excluded from the fact accuracy measurement. Supporting fact selection accuracy is calculated by taking the average of 10 best runs (out of 20) for each task. and (b) a binary CRF model with pairwise potentials θk,k+1(i, j) = (xk i ) qk + (xk i ) xk+1 j + (xk+1 j ) qk+1 The binary CRF model is designed to test the model s ability to perform sequential reasoning. For both (a) and (b), a single context vector is computed: c = P z1,...,z K p(z1, . . . , z K | x, q)f(x, z) (unlike Mem N2N which computes K context vectors). Evaluating c requires summing over all n K possible sequences of length K, which may not be practical for large values of K. However, if f(x, z) factors over the components of z (e.g. f(x, z) = PK k=1 fk(x, zk)) then one can rewrite the above sum in terms of marginals: c = PK k=1 Pn i=1 p(zk = i | x, q)fk(x, zk). In our experiments, we use fk(x, zk) = ok zk. All three models are described in further detail in Appendix A.4. Results We use the version of the dataset with 1K questions for each task. Since all models reduce to the same network for tasks with 1 supporting fact, they are excluded from our experiments. The number of hops (i.e. K) is task-dependent, and the number of memories (i.e. n) is limited to be at most 25 (note that many question have less than 25 facts e.g. the example in Figure 5 has 9 facts). Due to high variance in model performance, we train 20 models with different initializations for each task and report the test accuracy of the model that performed the best on a 10% held-out validation set (as is typically done for b Ab I tasks). Results of the three different models are shown in Table 4. For correct answer seletion (Ans %), we find that Mem N2N and the Binary CRF model perform similarly while the Unary CRF model does worse, indicating the importance of including pairwise potentials. We also assess each model s ability to attend to the correct supporting facts in Table 4 (Fact %). Since ground truth supporting facts are provided for each query, we can check the sequence accuracy of supporting facts for each model (i.e. the rate of selecting the exact correct sequence of facts) by taking the highest probability sequence ˆz = argmax p(z1, . . . , z K | x, q) from the model and checking against the ground truth. Overall the Binary CRF is able to recover supporting facts better than Mem N2N. This improvement is significant and can be up to two-fold as seen for task 2, 11, 13 & 17. However we observed that on many tasks it is sufficient to select only the last (or first) fact correctly to predict the answer, and thus higher sequence selection accuracy does not necessarily imply better answer accuracy (and vice versa). For example, all three models get 100% answer accuracy on task 15 but have different supporting fact accuracies. Finally, in Figure 5 we visualize of the output edge marginals produced by the Binary CRF model for a single question in task 16. In this instance, the model is uncertain but ultimately able to select the right sequence of facts 5 6 8. Published as a conference paper at ICLR 2017 Figure 5: Visualization of the attention distribution over supporting fact sequences for an example question in task 16 for the Binary CRF model. The actual question is displayed at the bottom along with the correct answer and the ground truth supporting facts (5 6 8). The edges represent the marginal probabilities p(zk, zk+1 | x, q), and the nodes represent the n supporting facts (here we have n = 9). The text for the supporting facts are shown on the left. The top three most likely sequences are: p(z1 = 5, z2 = 6, z3 = 8 | x, q) = 0.0564, p(z1 = 5, z2 = 6, z3 = 3 | x, q) = 0.0364, p(z1 = 5, z2 = 2, z3 = 3 | x, q) = 0.0356. 4.4 NATURAL LANGUAGE INFERENCE The final experiment looks at the task of natural language inference (NLI) with the syntactic attention layer. In NLI, the model is given two sentences (hypothesis/premise) and has to predict their relationship: entailment, contradiction, neutral. For this task, we use the Stanford NLI dataset (Bowman et al., 2015) and model our approach off of the decomposable attention model of Parikh et al. (2016). This model takes in the matrix of word embeddings as the input for each sentence and performs inter-sentence attention to predict the answer. Appendix A.5 describes the full model. As in the transduction task, we focus on modifying the input representation to take into account soft parents via self-attention (i.e. intra-sentence attention). In addition to the three baselines described for tree transduction (No Attention, Simple, Structured), we also explore two additional settings: (d) hard pipeline parent selection, i.e. ˆxj = [xj; xhead(j)], where head(j) is the index of xj s parent8; (e) pretrained structured attention: structured attention where the parsing layer is pretrained for one epoch on a parsed dataset (which was enough for convergence). Results Results of our models are shown in Table 5. Simple attention improves upon the no attention model, and this is consistent with improvements observed by Parikh et al. (2016) with their intra-sentence attention model. The pipelined model with hard parents also slightly improves upon the baseline. Structured attention outperforms both models, though surprisingly, pretraining the syntactic attention layer on the parse trees performs worse than training it from scratch it is possible that the pretrained attention is too strict for this task. We also obtain the hard parse for an example sentence by running the Viterbi algorithm on the syntactic attention layer with the non-pretrained model: $ The men are fighting outside a deli . 8The parents are obtained from running the dependency parser of Andor et al. (2016), available at https://github.com/tensorflow/models/tree/master/syntaxnet Published as a conference paper at ICLR 2017 Model Accuracy % Handcrafted features (Bowman et al., 2015) 78.2 LSTM encoders (Bowman et al., 2015) 80.6 Tree-Based CNN (Mou et al., 2016) 82.1 Stack-Augmented Parser-Interpreter Neural Net (Bowman et al., 2016) 83.2 LSTM with word-by-word attention (Rockt aschel et al., 2016) 83.5 Matching LSTMs (Wang & Jiang, 2016) 86.1 Decomposable attention over word embeddings (Parikh et al., 2016) 86.3 Decomposable attention + intra-sentence attention (Parikh et al., 2016) 86.8 Attention over constituency tree nodes (Zhao et al., 2016) 87.2 Neural Tree Indexers (Munkhdalai & Yu, 2016) 87.3 Enhanced Bi LSTM Inference Model (Chen et al., 2016) 87.7 Enhanced Bi LSTM Inference Model + ensemble (Chen et al., 2016) 88.3 No Attention 85.8 No Attention + Hard parent 86.1 Simple Attention 86.2 Structured Attention 86.8 Pretrained Structured Attention 86.5 Table 5: Results of our models (bottom) and others (top) on the Stanford NLI test set. Our baseline model has the same architecture as Parikh et al. (2016) but the performance is slightly different due to different settings (e.g. we train for 100 epochs with a batch size of 32 while Parikh et al. (2016) train for 400 epochs with a batch size of 4 using asynchronous SGD.) Despite being trained without ever being exposed to an explicit parse tree, the syntactic attention layer learns an almost plausible dependency structure. In the above example it is able to correctly identify the main verb fighting, but makes mistakes on determiners (e.g. head of The should be men). We generally observed this pattern across sentences, possibly because the verb structure is more important for the inference task. 5 CONCLUSION This work outlines structured attention networks, which incorporate graphical models to generalize simple attention, and describes the technical machinery and computational techniques for backpropagating through models of this form. We implement two classes of structured attention layers: a linear-chain CRF (for neural machine translation and question answering) and a more complicated first-order dependency parser (for tree transduction and natural language inference). Experiments show that this method can learn interesting structural properties and improve on top of standard models. Structured attention could also be a way of learning latent labelers or parsers through attention on other tasks. It should be noted that the additional complexity in computing the attention distribution increases run-time for example, structured attention was approximately 5 slower to train than simple attention for the neural machine translation experiments, even though both attention layers have the same asymptotic run-time (i.e. O(n)). Embedding differentiable inference (and more generally, differentiable algorithms) into deep models is an exciting area of research. While we have focused on models that admit (tractable) exact inference, similar technique can be used to embed approximate inference methods. Many optimization algorithms (e.g. gradient descent, LBFGS) are also differentiable (Domke, 2012; Maclaurin et al., 2015), and have been used as output layers for structured prediction in energy-based models (Belanger & Mc Callum, 2016; Wang et al., 2016). Incorporating them as internal neural network layers is an interesting avenue for future work. ACKNOWLEDGMENTS We thank Tao Lei, Ankur Parikh, Tim Vieira, Matt Gormley, Andr e Martins, Jason Eisner, Yoav Goldberg, and the anonymous reviewers for helpful comments, discussion, notes, and code. We additionally thank Yasumasa Miyamoto for verifying Japanese-English translations. Published as a conference paper at ICLR 2017 Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, and Michael Collins. Globally Normalized Transition-Based Neural Networks. In Proceedings of ACL, 2016. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of ICLR, 2015. James K. Baker. Trainable Grammars for Speech Recognition. Speech Communication Papers for the 97th Meeting of the Acoustical Society, 1979. David Belanger and Andrew Mc Callum. Structured Prediction Energy Networks. In Proceedings of ICML, 2016. Samuel R. Bowman, Christopher D. Manning, and Christopher Potts. Tree-Structured Composition in Neural Networks without Tree-Structured Architectures. In Proceedings of the NIPS workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches, 2015. Samuel R. Bowman, Jon Gauthier, Abhinav Rastogi, Raghav Gupta, Christopher D. Manning, and Christopher Potts. A Fast Unified Model for Parsing and Sentence Understanding. In Proceedings of ACL, 2016. William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. Listen, Attend and Spell. ar Xiv:1508.01211, 2015. Liang-Chieh Chen, Alexander G. Schwing, Alan L. Yuille, and Raquel Urtasun. Learning Deep Structured Models. In Proceedings of ICML, 2015. Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, and Hui Jiang. Enhancing and Combining Sequential and Tree LSTM for Natural Language Inference. ar Xiv:1609.06038, 2016. Kyunghyun Cho, Aaron Courville, and Yoshua Bengio. Describing Multimedia Content using Attention-based Encoder-Decoder Networks. In IEEE Transactions on Multimedia, 2015. Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-Based Models for Speech Recognition. In Proceedings of NIPS, 2015. Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural Language Processing (almost) from Scratch. Journal of Machine Learning Research, 12:2493 2537, 2011. Trinh-Minh-Tri Do and Thierry Arti eres. Neural Conditional Random Fields. In Proceedings of AISTATS, 2010. Justin Domke. Parameter Learning with Truncated Message-Passing. In Proceedings of CVPR, 2011. Justin Domke. Generic methods for optimization-based modeling. In AISTATS, pp. 318 326, 2012. John Duchi, Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12:2021 2159, 2011. Greg Durrett and Dan Klein. Neural CRF Parsing. In Proceedings of ACL, 2015. Jason M. Eisner. Three New Probabilistic Models for Dependency Parsing: An Exploration. In Proceedings of ACL, 1996. Jason M. Eisner. Inside-Outside and Forward-Backward Algorithms are just Backprop. In Proceedings of Structured Prediction Workshop at EMNLP, 2016. Matthew R. Gormley, Mark Dredze, and Jason Eisner. Approximation-Aware Dependency Parsing by Belief Propagation. In Proceedings of TACL, 2015. Alex Graves, Greg Wayne, and Ivo Danihelka. Neural Turing Machines. ar Xiv:1410.5401, 2014. Published as a conference paper at ICLR 2017 Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska Barwinska, Sergio Gomez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, Adria Puigdomenech Badia, Karl Moritz Hermann, Yori Zwols, Georg Ostrovski, Adam Cain, Helen King, Christopher Summerfield, Phil Blunsom, Koray Kavukcuoglu, and Demis Hassabis. Hybrid Computing Using a Neural Network with Dynamic External Memory. Nature, October 2016. Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. Learning to Transduce with Unbounded Memory. In Proceedings of NIPS, 2015. Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching Machines to Read and Comprehend. In Proceedings of NIPS, 2015. Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep Structured Output Learning for Unconstrained Text Recognition. In Proceedings of ICLR, 2014. Diederik Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In Proceedings of ICLR, 2015. Eliyahu Kipperwasser and Yoav Goldberg. Simple and Accurate Dependency Parsing using Bidirectional LSTM Feature Representations. In TACL, 2016. Lingpeng Kong, Chris Dyer, and Noah A. Smith. Segmental Recurrent Neural Networks. In Proceedings of ICLR, 2016. John Lafferty, Andrew Mc Callum, and Fernando Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of ICML, 2001. Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. Neural Architectures for Named Entity Recognition. In Proceedings of NAACL, 2016. Yann Le Cun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based Learning Applied to Document Recognition. In Proceedings of IEEE, 1998. Zhifei Li and Jason Eisner. Firstand Second-Order Expectation Semirings with Applications to Minimum-Risk Training on Translation Forests. In Proceedings of EMNLP 2009, 2009. Liang Lu, Lingpeng Kong, Chris Dyer, Noah A. Smith, and Steve Renals. Segmental Recurrent Neural Networks for End-to-End Speech Recognition. In Proceedings of INTERSPEECH, 2016. Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective Approaches to Attentionbased Neural Machine Translation. In Proceedings of EMNLP, 2015. Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based Hyperparameter Optimization through Reversible Learning. In Proceedings of ICML, 2015. Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang, Rui Yan, and Zhi Jin. Natural language inference by tree-based convolution and heuristic matching. In Proceedings of ACL, 2016. Tsendsuren Munkhdalai and Hong Yu. Neural Tree Indexers for Text Understanding. arxiv:1607.04492, 2016. Toshiaki Nakazawa, Manabu Yaguchi, Kiyotaka Uchimoto, Masao Utiyama, Eiichiro Sumita, Sadao Kurohashi, and Hitoshi Isahara. Aspec: Asian scientific paper excerpt corpus. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2016), pp. 2204 2208, Portoro, Slovenia, may 2016. European Language Resources Association (ELRA). ISBN 978-2-9517408-9-1. Graham Neubig, Yosuke Nakata, and Shinsuke Mori. Pointwise Prediction for Robust, Adaptable Japanese Morphological Analysis. In Proceedings of ACL, 2011. Published as a conference paper at ICLR 2017 Ankur P. Parikh, Oscar Tackstrom, Dipanjan Das, and Jakob Uszkoreit. A Decomposable Attention Model for Natural Language Inference. In Proceedings of EMNLP, 2016. Jian Peng, Liefeng Bo, and Jinbo Xu. Conditional Neural Fields. In Proceedings of NIPS, 2009. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glo Ve: Global Vectors for Word Representation. In Proceedings of EMNLP, 2014. Tim Rockt aschel, Edward Grefenstette, Karl Moritz Hermann, Tomas Kocisky, and Phil Blunsom. Reasoning about Entailment with Neural Attention. In Proceedings of ICLR, 2016. John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation using stochastic computation graphs. In Advances in Neural Information Processing Systems, pp. 3528 3536, 2015. David A. Smith and Jason Eisner. Dependency Parsing as Belief Propagation. In Proceedings of EMNLP, 2008. Veselin Stoyanov and Jason Eisner. Minimum-Risk Training of Approximate CRF-based NLP Systems. In Proceedings of NAACL, 2012. Veselin Stoyanov, Alexander Ropson, and Jason Eisner. Empirical Risk Minimization of Graphical Model Parameters Given Approximate Inference, Decoding, and Model Structure. In Proceedings of AISTATS, 2011. Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-To-End Memory Networks. In Proceedings of NIPS, 2015. Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer Networks. In Proceedings of NIPS, 2015. Shenlong Wang, Sanja Fidler, and Raquel Urtasun. Proximal Deep Structured Models. In Proceedings of NIPS, 2016. Shuohang Wang and Jing Jiang. Learning Natural Language Inference with LSTM. In Proceedings of NAACL, 2016. Jason Weston, Sumit Chopra, and Antoine Bordes. Memory Networks. ar Xiv:1410.3916, 2014. Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merri enboer, Armand Joulin, and Tomas Mikolov. Towards Ai-complete Question Answering: A Set of Prerequisite Toy Tasks. ar Xiv preprint ar Xiv:1502.05698, 2015. Kelvin Xu, Jimma Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of ICML, 2015. Lei Yu, Jan Buys, and Phil Blunsom. Online Segment to Segment Neural Transduction. In Proceedings of EMNLP, 2016. Lei Yu, Phil Blunsom, Chris Dyer, Edward Grefenstette, and Tomas Kocisky. The Neural Noisy Channel. In Proceedings of ICLR, 2017. Kai Zhao, Liang Huang, and Minbo Ma. Textual Entailment with Structured Attentions and Composition. In Proceedings of COLING, 2016. Published as a conference paper at ICLR 2017 A MODEL DETAILS A.1 SYNTACTIC ATTENTION The syntactic attention layer (for tree transduction and natural language inference) is similar to the first-order graph-based dependency parser of Kipperwasser & Goldberg (2016). Given an input sentence [x1, . . . , xn] and the corresponding word vectors [x1, . . . , xn], we use a bidirectional LSTM to get the hidden states for each time step i [1, . . . , n], hfwd i = LSTM(xi, hfwd i 1) hbwd i = LSTM(xi, hbwd i+1) hi = [hfwd i ; hbwd i ] where the forward and backward LSTMs have their own parameters. The score for xi xj (i.e. xi is the parent of xj), is given by an MLP θij = tanh(s tanh(W1hi + W2hj + b)) These scores are used as input to the inside-outside algorithm (see Appendix B) to obtain the probability of each word s parent p(zij = 1 | x), which is used to obtain the soft-parent cj for each word xj. In the non-structured case we simply have p(zij = 1 | x) = softmax(θij). A.2 TREE TRANSDUCTION Let [x1, . . . , xn], [y1, . . . , ym] be the sequence of source/target symbols, with the associated embeddings [x1, . . . , xn], [y1, . . . , ym] with xi, yj Rl. In the simplest baseline model we take the source representation to be the matrix of the symbol embeddings. The decoder is a one-layer LSTM which produces the hidden states h j = LSTM(yj, h j 1), with h j Rl. The hidden states are combined with the input representation via a bilinear map W Rl l to produce the attention distribution used to obtain the vector mi, which is combined with the decoder hidden state as follows, αi = exp xi Wh j Pn k=1 exp xk Wh j mi = i=1 αixi ˆhj = tanh(U[mi; h j]) Here we have W Rl l and U R2l l. Finally, ˆhj is used to to obtain a distribution over the next symbol yj+1, p(yj+1 | x1, . . . , xn, y1, . . . , yj) = softmax(Vˆhj + b) For structured/simple models, the j-th source representation are respectively k=1 p(zki = 1 | x) xk k=1 softmax(θki) xk where θij comes from the bidirectional LSTM described in A.1. Then αi and mi changed accordingly, αi = exp ˆxi Wh j Pn k=1 exp ˆxk Wh j mi = Note that in this case we have W R2l l and U R3l l. We use l = 50 in all our experiments. The forward/backward LSTMs for the parsing LSTM are also 50-dimensional. Symbol embeddings are shared between the encoder and the parsing LSTMs. Additional training details include: batch size of 20; training for 13 epochs with a learning rate of 1.0, which starts decaying by half after epoch 9 (or the epoch at which performance does not improve on validation, whichever comes first); parameter initialization over a uniform distribution U[ 0.1, 0.1]; gradient normalization at 1 (i.e. renormalize the gradients to have norm 1 if the l2 norm exceeds 1). Decoding is done with beam search (beam size = 5). Published as a conference paper at ICLR 2017 A.3 NEURAL MACHINE TRANSLATION The baseline NMT system is from Luong et al. (2015). Let [x1, . . . , xn], [y1, . . . , ym] be the source/target sentence, with the associated word embeddings [x1, . . . , xn], [y1, . . . , ym]. The encoder is an LSTM over the source sentence, which produces the hidden states [h1, . . . , hn] where hi = LSTM(xi, hi 1) and hi Rl. The decoder is another LSTM which produces the hidden states h j Rl. In the simple attention case with categorical attention, the hidden states are combined with the input representation via a bilinear map W Rl l and this distribution is used to obtain the context vector at the j-th time step, θi = hi Wh j cj = i=1 softmax(θi)hi The Bernoulli attention network has the same θi but instead uses a sigmoid to obtain the weights of the linear combination, i.e., i=1 sigmoid(θi)hi And finally, the structured attention model uses a bilinear map to parameterize one of the unary potentials θi(k) = hi Wh j, k = 1 0, k = 0 θi,i+1(zi, zi+1) = θi(zi) + θi+1(zi+1) + bzi,zi+1 where b are the pairwise potentials. These potentials are used as inputs to the forward-backward algorithm to obtain the marginals p(zi = 1 | x, q), which are further normalized to obtain the context vector p(zi = 1 | x, q) i p(zi = 1 | x, q) We use λ = 2 and also add an l2 penalty of 0.005 on the pairwise potentials b. The context vector is then combined with the decoder hidden state ˆhj = tanh(U[cj; h j]) and ˆhj is used to obtain the distribution over the next target word yj+1 p(yj+1 | x1, . . . , xn, y1, . . . yj) = softmax(Vˆhj + b) The encoder/decoder LSTMs have 2 layers and 500 hidden units (i.e. l = 500). Additional training details include: batch size of 128; training for 30 epochs with a learning rate of 1.0, which starts decaying by half after the first epoch at which performance does not improve on validation; dropout with probability 0.3; parameter initialization over a uniform distribution U[ 0.1, 0.1]; gradient normalization at 1. We generate target translations with beam search (beam size = 5), and evaluate with multi-bleu.perl from Moses.9 A.4 QUESTION ANSWERING Our baseline model (Mem N2N) is implemented following the same architecture as described in Sukhbaatar et al. (2015). In particular, let x = [x1, . . . , xn] represent the sequence of n facts with the associated embeddings [x1, . . . , xn] and let q be the embedding of the query q. The embeddings 9https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/ multi-bleu.perl Published as a conference paper at ICLR 2017 are obtained by simply adding the word embeddings in each sentence or query. The full model with K hops is as follows: p(zk = i | x, q) = softmax((xk i ) qk) i=1 p(zk = i | x, q)ok i qk+1 = qk + ck p(y | x, q) = softmax(W(q K + c K)) where p(y | x, q) is the distribution over the answer vocabulary. At each layer, {xk i } and {ok i } are computed using embedding matrices Xk and Ok. We use the adjacent weight tying scheme from the paper so that Xk+1 = Ok, WT = OK. X1 is also used to compute the query embedding at the first hop. For k = 1 we have xk i = xi, qk = q, ck = 0. For both the Unary and the Binary CRF models, the same input fact and query representations are computed (i.e. same embedding matrices with weight tying scheme). For the unary model, the potentials are parameterized as θk(i) = (xk i ) qk and for the binary model we compute pairwise potentials as θk,k+1(i, j) = (xk i ) qk + (xk i ) xk+1 j + (xk+1 j ) qk+1 The qk s are updated simply with a linear mapping, i.e. In the case of the Binary CRF, to discourage the model from selecting the same fact again we additionally set θk,k+1(i, i) = for all i {1, . . . , n}. Given these potentials, we compute the marginals p(zk = i, zk+1 = j | x, q) using the forward-backward algorithm, which is then used to compute the context vector: z1,...,z K p(z1, . . . , z K | x, q)f(x, z) f(x, z) = k=1 fk(x, zk) fk(x, zk) = ok zk Note that if f(x, z) factors over the components of z (as is the case above) then computing c only requires evaluating the marginals p(zk | x, q). Finally, given the context vector the prediction is made in a similar fashion to Mem N2N: p(y | x, q) = softmax(W(q K + c)) Other training setup is similar to Sukhbaatar et al. (2015): we use stochastic gradient descent with learning rate 0.01, which is divided by 2 every 25 epochs until 100 epochs are reached. Capacity of the memory is limited to 25 sentences. The embedding vectors are of size 20 and gradients are renormalized if the norm exceeds 40. All models implement position encoding, temporal encoding, and linear start from the original paper. For linear start, the softmax( ) function in the attention layer is removed at the beginning and re-inserted after 20 epochs for Mem N2N, while for the CRF models we apply a log(softmax( )) layer on the qk after 20 epochs. Each model is trained separately for each task. A.5 NATURAL LANGUAGE INFERENCE Our baseline model/setup is essentially the same as that of Parikh et al. (2016). Let [x1, . . . , xn], [y1, . . . , ym] be the premise/hypothesis, with the corresponding input representations [x1, . . . , xn], [y1, . . . , ym]. The input representations are obtained by a linear transformation of the 300-dimensional pretrained Glo Ve embeddings (Pennington et al., 2014) after normalizing the Glo Ve embeddings to have unit norm.10 The pretrained embeddings remain fixed but the linear layer 10We use the Glo Ve embeddings pretrained over the 840 billion word Common Crawl, publicly available at http://nlp.stanford.edu/projects/glove/ Published as a conference paper at ICLR 2017 (which is also 300-dimensional) is trained. Words not in the pretrained vocabulary are hashed to one of 100 Gaussian embeddings with mean 0 and standard deviation 1. We concatenate each input representation with a convex combination of the other sentence s input representations (essentially performing inter-sentence attention), where the weights are determined through a dot product followed by a softmax, eij = f(xi) f(yj) xi = exp eij Pm k=1 exp eik yj exp eij Pn k=1 exp ekj xi Here f( ) is an MLP. The new representations are fed through another MLP g( ), summed, combined with the final MLP h( ) and fed through a softmax layer to obtain a distribution over the labels l, i=1 g( xi) y = p(l | x1, . . . , xn, y1, . . . , ym) = softmax(Vh([ x; y]) + b) All the MLPs have 2-layers, 300 Re LU units, and dropout probability of 0.2. For structured/simple models, we first employ the bidirectional parsing LSTM (see A.1) to obtain the scores θij. In the structured case each word representation is simply concatenated with its soft-parent k=1 p(zki = 1 | x)xk and ˆxi (and analogously ˆyj) is used as the input to the above model. In the simple case (which closely corresponds to the intra-sentence attention model of Parikh et al. (2016)), we have exp θki Pn l=1 exp θli xk The word embeddings for the parsing LSTMs are also initialized with Glo Ve, and the parsing layer is shared between the two sentences. The forward/backward LSTMs for the parsing layer are 100dimensional. Additional training details include: batch size of 32; training for 100 epochs with Adagrad (Duchi et al., 2011) where the global learning rate is 0.05 and sum of gradient squared is initialized to 0.1; parameter intialization over a Gaussian distribution with mean 0 and standard deviation 0.01; gradient normalization at 5. In the pretrained scenario, pretraining is done with Adam (Kingma & Ba, 2015) with learning rate equal to 0.01, and β1 = 0.9, β2 = 0.999. B FORWARD/BACKWARD THROUGH THE INSIDE-OUTSIDE ALGORITHM Figure 6 shows the procedure for obtaining the parsing marginals from the input potentials. This corresponds to running the inside-outside version of Eisner s algorithm (Eisner, 1996). The intermediate data structures used during the dynamic programming algorithm are the (log) inside tables α, and the (log) outside tables β. Both α, β are of size n n 2 2, where n is the sentence length. First two dimensions encode the start/end index of the span (i.e. subtree). The third dimension encodes whether the root of the subtree is the left (L) or right (R) index of the span. The fourth dimension indicates if the span is complete (1) or incomplete (0). We can calculate the marginal distribution of each word s parent (for all words) in O(n3) using this algorithm. Backward pass through the inside-outside algorithm is slightly more involved, but still takes O(n3) time. Figure 7 illustrates the backward procedure, which receives the gradient of the loss L with respect to the marginals, L p , and computes the gradient of the loss with respect to the potentials L θ . The computations must be performed in the signed log-space semifield to handle log of negative values. See section 3.3 and Table 1 for more details. Published as a conference paper at ICLR 2017 procedure INSIDEOUTSIDE(θ) α, β Initialize log of inside (α), outside (β) tables for i = 1, . . . , n do α[i, i, L, 1] 0 α[i, i, R, 1] 0 β[1, n, R, 1] 0 for k = 1, . . . , n do Inside step for s = 1, . . . , n k do t s + k α[s, t, R, 0] L u [s,t 1] α[s, u, R, 1] α[u + 1, t, L, 1] θst α[s, t, L, 0] L u [s,t 1] α[s, u, R, 1] α[u + 1, t, L, 1] θts α[s, t, R, 1] L u [s+1,t] α[s, u, R, 0] α[u, t, R, 1] α[s, t, L, 1] L u [s,t 1] α[s, u, L, 1] α[u, t, L, 0] for k = n, . . . , 1 do Outside step for s = 1, . . . , n k do t s + k for u = s + 1, . . . , t do β[s, u, R, 0] β[s, t, R, 1] α[u, t, R, 1] β[u, t, R, 1] β[s, t, R, 1] α[s, u, R, 0] if s > 1 then for u = s, . . . , t 1 do β[s, u, L, 1] β[s, t, L, 1] α[u, t, L, 0] β[u, t, L, 0] β[s, t, L, 1] α[s, u, L, 1] for u = s, . . . , t 1 do β[s, u, R, 1] β[s, t, R, 0] α[u + 1, t, L, 1] θst β[u + 1, t, L, 1] β[s, t, R, 0] α[s, u, R, 1] θst if s > 1 then for u = s, . . . , t 1 do β[s, u, R, 1] β[s, t, L, 0] α[u + 1, t, L, 1] θts β[u + 1, t, L, 1] β[s, t, L, 0] α[s, u, R, 1] θts A α[1, n, R, 1] Log partition for s = 1, . . . , n 1 do Compute marginals. Note that p[s, t] = p(zst = 1 | x) for t = s + 1, . . . , n do p[s, t] exp(α[s, t, R, 0] β[s, t, R, 0] A) if s > 1 then p[t, s] exp(α[s, t, L, 0] β[s, t, L, 0] A) return p Figure 6: Forward step of the syntatic attention layer to compute the marginals, using the inside-outside algorithm (Baker, 1979) on the data structures of Eisner (1996). We assume the special root symbol is the first element of the sequence, and that the sentence length is n. Calculations are performed in log-space semifield with = logadd and = + for numerical precision. a, b c means a c and b c. a b means a a b. Published as a conference paper at ICLR 2017 procedure BACKPROPINSIDEOUTSIDE(θ, p, L p ) for s, t = 1, . . . , n; s = t do Backpropagation uses the identity L θ = (p L p ) log p θ δ[s, t] log p[s, t] log L p [s, t] δ = log(p L p ) L α, L β , log L θ Initialize inside ( L α), outside ( L β ) gradients, and log of L θ for s = 1, . . . , n 1 do Backpropagate δ to L α and L β for t = s + 1, . . . , n do L α[s, t, R, 0], L β [s, t, R, 0] δ[s, t] L α[1, n, R, 1] δ[s, t] if s > 1 then L α[s, t, L, 0], L β [s, t, L, 0] δ[t, s] L α[1, n, R, 1] δ[s, t] for k = 1, . . . , n do Backpropagate through outside step for s = 1, . . . , n k do t s + k ν L β [s, t, R, 0] β[s, t, R, 0] ν, γ are temporary values for u = t, . . . , n do L β [s, u, R, 1], L α[t, u, R, 1] ν β[s, u, R, 1] α[t, u, R, 1] if s > 1 then ν L β [s, t, L, 0] β[s, t, L, 0] for u = 1, . . . , s do L β [u, t, L, 1], L α[u, s, L, 1] ν β[u, t, L, 1] α[u, s, L, 1] ν L β [s, t, L, 1] β[s, t, L, 1] for u = t, . . . , n do L β [s, u, L, 1], L α[t, u, L, 0] ν β[s, u, L, 1] α[t, u, L, 1] for u = 1, . . . , s 1 do γ β[u, t, R, 0] α[u, s 1, R, 1] θut L β [u, t, R, 0], L α[u, s 1, R, 1], log L θ [u, t] ν γ γ β[u, t, L, 0] α[u, s 1, R, 1] θtu L β [u, t, L, 0], L α[u, s 1, R, 1], log L θ [t, u] ν γ ν L β [s, t, R, 1] β[s, t, R, 1] for u = 1, . . . , s do L β [u, t, R, 1], L α[u, s, R, 0] ν β[u, t, R, 1] α[u, s, R, 0] for u = t + 1, . . . , n do γ β[s, u, R, 0] α[t + 1, u, L, 1] θsu L β [s, u, R, 0], L α[t + 1, u, L, 1], log L θ [s, u] ν γ γ β[s, u, L, 0] α[t + 1, u, L, 1] θus L β [s, u, L, 0], L α[t + 1, u, L, 1], log L θ [u, s] ν γ for k = n, . . . , 1 do Backpropagate through inside step for s = 1, . . . , n k do t s + k ν L α[s, t, R, 1] α[s, t, R, 1] for u = s + 1, . . . , t do L α[u, t, R, 0], L α[u, t, R, 1] ν α[s, u, R, 0] α[u, t, R, 1] if s > 1 then ν L α[s, t, L, 1] α[s, t, L, 1] for u = s, . . . , t 1 do L α[s, u, L, 1], L α[u, t, L, 0] ν α[s, u, L, 1] α[u, t, L, 0] ν L α[s, t, L, 0] α[s, t, L, 0] for u = s, . . . , t 1 do γ α[s, u, R, 1] α[u + 1, t, L, 1] θts L α[s, u, R, 1], L α[u + 1, t, L, 1], log L θ [t, s] ν γ ν L α[s, t, R, 0] α[s, t, R, 0] for u = s, . . . , t 1 do γ α[s, u, R, 1] α[u + 1, t, L, 1] θst L α[s, u, R, 1], L α[u + 1, t, L, 1], log L θ [s, t] ν γ return signexp log L θ Exponentiate log gradient, multiply by sign, and return L θ Figure 7: Backpropagation through the inside-outside algorithm to calculate the gradient with respect to the input potentials. a b denotes the Jacobian of a with respect to b (so L θ is the gradient with respect to θ). a, b c means a a c and b b c.