# improved_neural_machine_translation_with_source_syntax__5f34fd48.pdf Improved Neural Machine Translation with Source Syntax Shuangzhi Wu , Ming Zhou , Dongdong Zhang Harbin Institute of Technology, Harbin, China Microsoft Research {v-shuawu, mingzhou, dozhang}@microsoft.com Neural Machine Translation (NMT) based on the encoder-decoder architecture has recently achieved the state-of-the-art performance. Researchers have proven that extending word level attention to phrase level attention by incorporating source-side phrase structure can enhance the attention model and achieve promising improvement. However, word dependencies that can be crucial to correctly understand a source sentence are not always in a consecutive fashion (i.e. phrase structure), sometimes they can be in long distance. Phrase structures are not the best way to explicitly model long distance dependencies. In this paper we propose a simple but effective method to incorporate source-side long distance dependencies into NMT. Our method based on dependency trees enriches each source state with global dependency structures, which can better capture the inherent syntactic structure of source sentences. Experiments on Chinese-English and English-Japanese translation tasks show that our proposed method outperforms state-of-the-art SMT and NMT baselines. 1 Introduction Recently, Neural Machine Translation (NMT) with the attention-based encoder-decoder framework [Bahdanau et al., 2015] has achieved significant improvements in translation quality of many language pairs such as English-German, English-French and Chinese-English [Bahdanau et al., 2015; Luong et al., 2015a; Wu et al., 2016]. In a conventional NMT model, an encoder maps the source sentence of various lengths into sequences of intermediate hidden vector representations. Then these hidden vectors are combined, weighted by attention mechanism, and used by the decoder to generate translations. In most cases, both encoder and decoder are implemented as recurrent neural networks (RNNs). Currently, many methods have been proposed to improve the sequence-to-sequence NMT model since it was first proposed by [Bahdanau et al., 2015; Sutskever et al., 2014]. Previous work mostly focuses on addressing the problem of Contribution during internship at Microsoft Research. out-of-vocabulary words [Jean et al., 2015], designing attention mechanism [Luong et al., 2015a], and more efficient parameter learning [Shen et al., 2016]. These methods regard sentences as sequences of words where the syntactic structures inherent in languages are neglected. Recently, inspired by the successful application of sourceside syntactic information in statistic machine translation (SMT) [Liu et al., 2006], [Eriguchi et al., 2016b] propose a new attentional NMT model which takes advantage of the source-side syntactic information based on the Head-driven Phrase Structure Grammar [Sag et al., 1999]. They align each target word with both source words and source phrases. This kind of extension is effective to handle cases that one target word may correspond to a fragment of consecutive source words. However, the long distance syntactic dependencies of the source-side, which can be crucial to correctly understand a sentence, are not explicitly concerned in all previous work. Although, in theory, the encoder RNN is able to remember sufficiently long history, we can still observe substantial incorrect translations which are both fluent and grammatical but violate the meaning of source sentences. Figure 1 shows an incorrect translation example which relates to the source syntactic structure. Though the translation is well formed and grammatical, its meaning is inconsistent with the given source sentence. The NMT model can not well capture the dependency between word (patients) (subject) and (see the doctor) (predicate). Even for the phrase attention based model, this kind of relations still can not be explicitly modeled as the words are inconsecutive and in long distance. This demonstrates that it still remains a challenge for NMT encoder to capture such subtle long-range word dependencies for correctly understanding source sentences. Actually, syntactic dependency trees can well address and model such long-distance word correspondence. In Figure 1, if the dependency between the root word (see the doctor) and its subject word (patients) denoted by a link can be encoded by the NMT encoder, the NMT model is more likely to generate a correct translation. In this paper, we address the above problem and propose to improve NMT by leveraging the source-side dependency tree to explicitly incorporate source word dependencies into NMT framework. Based on source dependency trees, we enrich each encoder state from both child-to-head and head-tochild with global knowledge from the dependency structure. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) Source : 请 患者 携带家属去 就医 (please) (patients) (with) (family) (go) (see the doctor) Reference : Patients should go to see the doctor with their family NMT : Patients should take their families to see the doctor Figure 1: Example of incorrect translation from conventional NMT system. The arrows refer to the dependency link in the dependency tree. Two extra sequences are extracted with structural knowledge and encoded by another two RNNs which are used to improve the encoder states. With the enriched source states, the decoder generates target translation via attention mechanism in the same way as in most NMT models. We will describe our method in detail in Section 3. We evaluate our method on publicly available data sets with Chinese English and English-Japanese translation tasks. Experimental results on Chinese-English task show that our model significantly improves translation accuracy over the conventional NMT and SMT baseline systems. Experiments on English Japanese task also show that our method can achieve better performance than the state-of-the-art tree-based NMT model in [Eriguchi et al., 2016b]. The major differences between our work and previous treebased method [Eriguchi et al., 2016b] are in two folds: (1) We model source word relations that are important for understanding source sentences, however, they focus on the mismatch problem that one target word may attend to a source phrase (multiple consecutive words). (2) Our model enhances the NMT by enriching each encoder state with global source dependency structure, however, they improve NMT model by proposing a phrase level attention. 2 Background Different from SMT consisting of multiple sub-models, NMT is an end-to-end paradigm [Sutskever et al., 2014; Bahdanau et al., 2015] directly modeling the conditional translation probability p(Y |X) of the source sentence X = x1,x2,x3,...,xn and the target Y = y1,y2,y3,...,ym with the RNN encoder and the RNN decoder. The RNN encoder bidirectionally encodes the source sentence into a sequence of context vectors H = h1,h2,h3,...,hn, where hi = [ hi, hi], hi and hi are calculated by two RNNs from left-to-right and right-to-left respectively as follows, hi = f RNN(xi, hi 1) hi = f RNN(xi, where f RNN can be a Gated Recurrent Unit (GRU) [Cho et al., ] or a Long Short-Term Memory (LSTM) [Hochreiter and Schmidhuber, 1997] in practice. In this paper, we use GRU for all RNNs. Based on target history and the source context, the RNN 𝑥1 𝑥2 𝑥𝑇 encoder start 𝑧1 𝑧𝑖 1 𝑧𝑖 Figure 2: Overview of NMT framework with attention. decoder computes the target translation in sequence by j=1 p(yj|y