# variational_recurrent_neural_machine_translation__73f4a0fa.pdf Variational Recurrent Neural Machine Translation Jinsong Su, Shan Wu, 1 1,2 Deyi Xiong,3 Yaojie Lu,2 Xianpei Han,2 Biao Zhang1 Xiamen University, Xiamen, China1 Institute of Software, Chinese Academy of Sciences, Beijing, China2 Soochow University, Suzhou, China3 jssu@xmu.edu.cn, wushan@stu.xmu.edu.cn, dyxiong@suda.edu.cn yaojie2017@iscas.ac.cn, xianpei@nfs.iscas.ac.cn, zb@stu.xmu.edu.cn Partially inspired by successful applications of variational recurrent neural networks, we propose a novel variational recurrent neural machine translation (VRNMT) model in this paper. Different from the variational NMT, VRNMT introduces a series of latent random variables to model the translation procedure of a sentence in a generative way, instead of a single latent variable. Specifically, the latent random variables are included into the hidden states of the NMT decoder with elements from the variational autoencoder. In this way, these variables are recurrently generated, which enables them to further capture strong and complex dependencies among the output translations at different timesteps. In order to deal with the challenges in performing efficient posterior inference and large-scale training during the incorporation of latent variables, we build a neural posterior approximator, and equip it with a reparameterization technique to estimate the variational lower bound. Experiments on Chinese-English and English-German translation tasks demonstrate that the proposed model achieves significant improvements over both the conventional and variational NMT models. 1. Introduction Recently, neural machine translation (NMT) has gradually established state-of-the-art results over statistical machine translation (SMT) on various language pairs. Most NMT models consist of two recurrent neural networks (RNNs): a bidirectional RNN based encoder that transforms source sentence x = {x1, x2...x Tx} into a hidden state sequence, and a decoder that generates the corresponding target sentence y = {y1, y2...y Ty} by exploiting source-side contexts via an attention network (Bahdanau, Cho, and Bengio 2015). This attentional neural encoder-decoder framework has now become the dominant architecture for NMT. Within this framework, semantic representations of source and target sentences are learned in an implicit way. As a result, the learned semantic representations are far from being sufficient for capturing all semantic details and dependencies (Sutskever, Vinyals, and Le 2014; Tu et al. 2016). To complement the insufficiency of semantic representations of NMT, Zhang et al. (2016a) present variational Corresponding author. Copyright c 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. NMT (VNMT) which incorporates a latent random variable into NMT, serving as a global semantic signal for generating good translations. However, the internal transition structure of RNN is entirely deterministic, and hence, this implementation may not be an effective way to model high variability observed in structured data, such as language modeling and machine translation (Chung et al. 2015). Therefore, the potential of VNMT is limited and how to better improve NMT with latent variables is still open for further exploration. In this paper, we propose a variational recurrent NMT (VRNMT) model to deal with the above-mentioned problem, motivated by recent success of the variational recurrent neural network (VRNN) (Chung et al. 2015). It is illustrated in Fig. 1. VRNMT explicitly models underlying semantics of bilingual sentence pairs, which are then exploited to refine translation. However, instead of only employing a single latent variable to capture the global semantics of each parallel sentence, we assume that there is a continuous latent random variable sequence z = {z1, z2..., z Ty} in the underlying semantic space, where the iteratively generated variable zj participates in the generations of each target word yj and hidden state sj+1. Formally, the conditional probability p(y|x) is decomposed as follows: j=1 p(yj|x, y