# syntaxdirected_attention_for_neural_machine_translation__7f89a547.pdf Syntax-Directed Attention for Neural Machine Translation Kehai Chen,1 Rui Wang,2 Masao Utiyama,2 Eiichiro Sumita,2 Tiejun Zhao1 1Harbin Institute of Technology, Harbin, China 2National Institute of Information and Communications Technology, Kyoto, Japan {khchen, tjzhao}@hit.edu.cn, {wangrui, mutiyama, eiichiro.sumita}@nict.go.jp Attention mechanism, including global attention and local attention, plays a key role in neural machine translation (NMT). Global attention attends to all source words for word prediction. In comparison, local attention selectively looks at fixed-window source words. However, alignment weights for the current target word often decrease to the left and right by linear distance centering on the aligned source position and neglect syntax distance constraints. In this paper, we extend the local attention with syntax-distance constraint, which focuses on syntactically related source words with the predicted target word to learning a more effective context vector for predicting translation. Moreover, we further propose a double context NMT architecture, which consists of a global context vector and a syntax-directed context vector from the global attention, to provide more translation performance for NMT from source representation. The experiments on the largescale Chinese-to-English and English-to-German translation tasks show that the proposed approach achieves a substantial and significant improvement over the baseline system. 1 Introduction Recent works of neural machine translation (NMT) have been proposed to adopt the encoder-decoder framework (Kalchbrenner and Blunsom 2013; Cho et al. 2014; Sutskever, Vinyals, and Le 2014), which employs a recurrent neural network (RNN) encoder to represent source sentence and a RNN decoder to generate target translation word by word. Especially, the NMT with an attention mechanism (called as global attention) is proposed to acquire source sentence context dynamically at each decoding step, thus improving the performance of NMT (Bahdanau, Cho, and Bengio 2015). The global attention is further refined into a local attention (Luong, Pham, and Manning 2015), which selectively looks at fixed-window source context at each decoding step, thus demonstrating its effectiveness on WMT translation tasks between English and German in both directions. Specifically, the local attention first predicts a single aligned source position pi for the current time-step i. The de- Kehai Chen was an internship research fellow at NICT when conducting this work. Corresponding author. Copyright c 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. coder focuses on the fixed-window encoder states centered around the source position pi, and compute a context vector cl i by alignment weights αl for predicting current target word. Figure 1(a) shows a Chinese-to-English NMT model with the local attention, and its contextual window is set to five. When the aligned source word is fenzi , the local attention focuses on source words { zhexie , weixian , fenzi , yanzhong , yingxiang } in the window to compute its context vector. Meanwhile, the local attention is to obtain the positions of five encoder states by Gaussian distribution, which penalty their alignment weights according to the distance with word fenzi . For example, the syntax distances of these five source words are {2, 1, 0, 1, 2} in contextual window, as shown in Figure 1(b). In other words, the greater the distance from the aligned word in the window is, the smaller the source words in the window to the context vector would contribute. In spite of its success, the local attention is to encode source context and compute a local context vector by linear distance centered around current aligned source position. It does not take syntax distance constraints into account. Figure 1(c) shows the dependency tree of the Chinese sentence in Figure 1(b). Support the word fenzi0 as the aligned source word, its syntax-distance neighbor window is { zhexie1 , weixian1 , fenzi0 , yingxiang1 , yanzhong2 , zhengce2 } , where the footnote of a word is its syntax-distance with the central word. In comparison, its local neighbor window is { zhexie , weixian , yanzhong , yingxiang , zhengchang } based on linear distance. Note that the zhengce is very informative for the correct translation, but it is far away from fenzi such that it is not easy to be focused by the local attention. Besides, the syntax distances of yanzhong and yingxiang are two and one, but the linear distances are one and two. This means that the yingxiang is syntactically more relevant to the fenzi than yingxiang . However, the existing attention mechanism, including the global or local attention, does not allow NMT to distinguish syntax distance constraint from source representation. In this paper, we extend the local attention with a novel syntax-distance constraint, to capture syntax related source words with the predicted target word. Following the dependency tree of a source sentence, each source word has a syntax-distance constraint mask, which denotes its syntax The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) Encoder states H these dangerous element fenzi zhexie weixian yanzhong yingxiang zhengchang yimin de zhengce Src: (Pinyin) Gaussian distribution for linear distance 0 1 2 2 1 3 3 4 4 Local Attention fenzi weixian zhexie yanzhong yingxiang zhengchang zousi de zhengce 0 1 2 3 4 5 6 2 1 fenzi yanzhong zhexie zhengchang Figure 1: (a) NMT with the local attention. The black dotted box is the current source aligned word and the red dotted box is the predicted target word. (b) Linear distances for the source word fenzi , for which the number denotes the linear distance. (c) Syntax-directed distances for source word fenzi , for which the blue number represents syntax-directed distance between each word and fenzi . distance with the other source words. The decoder then focuses on the syntax-related source words within the syntaxdistance constraint to compute a more effective context vector for predicting target word. Moreover, we further propose a double context NMT architecture, which consists of a global context vector and a syntax-directed local context vector from the global attention, to provide more translation performance for NMT from source representation. The experiments on the large-scale Chinese-to-English and English-to-German translation tasks show that the proposed approach achieves a substantial and significant improvement over the baseline system. 2 Background 2.1 Global Attention-based NMT In NMT (Bahdanau, Cho, and Bengio 2015), the context of translation prediction relies heavily on attention mechanism and source input. Typically, the decoder computes a alignment score eij between each source annotation hj and predicted target word yi according to the previous decoder hidden state si 1 eij = f(si 1, hj), (1) where f is a RNN with GRU. Then all alignment scores are normalized to compute weight αij of each encoder state hj αij = exp(eij) J k=1 exp(eik) . (2) Furthermore, the αij is used to weight all source annotations for computing current time-step context vector cg i : j=1 αijhj. (3) Finally, the context vector ci is used to predict target word yi by a non-linear layer: P(yi|y