# variational_recurrent_neural_machine_translation__73f4a0fa.pdf

Variational Recurrent Neural Machine Translation

Jinsong Su, Shan Wu, 1 1,2 Deyi Xiong,3 Yaojie Lu,2 Xianpei Han,2 Biao Zhang1

Xiamen University, Xiamen, China1

Institute of Software, Chinese Academy of Sciences, Beijing, China2

Soochow University, Suzhou, China3 jssu@xmu.edu.cn, wushan@stu.xmu.edu.cn, dyxiong@suda.edu.cn yaojie2017@iscas.ac.cn, xianpei@nfs.iscas.ac.cn, zb@stu.xmu.edu.cn

Partially inspired by successful applications of variational recurrent neural networks, we propose a novel variational recurrent neural machine translation (VRNMT) model in this paper. Different from the variational NMT, VRNMT introduces a series of latent random variables to model the translation procedure of a sentence in a generative way, instead of a single latent variable. Speciﬁcally, the latent random variables are included into the hidden states of the NMT decoder with elements from the variational autoencoder. In this way, these variables are recurrently generated, which enables them to further capture strong and complex dependencies among the output translations at different timesteps. In order to deal with the challenges in performing efﬁcient posterior inference and large-scale training during the incorporation of latent variables, we build a neural posterior approximator, and equip it with a reparameterization technique to estimate the variational lower bound. Experiments on Chinese-English and English-German translation tasks demonstrate that the proposed model achieves signiﬁcant improvements over both the conventional and variational NMT models.

1. Introduction Recently, neural machine translation (NMT) has gradually established state-of-the-art results over statistical machine translation (SMT) on various language pairs. Most NMT models consist of two recurrent neural networks (RNNs): a bidirectional RNN based encoder that transforms source sentence x = {x1, x2...x Tx} into a hidden state sequence, and a decoder that generates the corresponding target sentence y = {y1, y2...y Ty} by exploiting source-side contexts via an attention network (Bahdanau, Cho, and Bengio 2015). This attentional neural encoder-decoder framework has now become the dominant architecture for NMT. Within this framework, semantic representations of source and target sentences are learned in an implicit way. As a result, the learned semantic representations are far from being sufﬁcient for capturing all semantic details and dependencies (Sutskever, Vinyals, and Le 2014; Tu et al. 2016). To complement the insufﬁciency of semantic representations of NMT, Zhang et al. (2016a) present variational

Corresponding author. Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

NMT (VNMT) which incorporates a latent random variable into NMT, serving as a global semantic signal for generating good translations. However, the internal transition structure of RNN is entirely deterministic, and hence, this implementation may not be an effective way to model high variability observed in structured data, such as language modeling and machine translation (Chung et al. 2015). Therefore, the potential of VNMT is limited and how to better improve NMT with latent variables is still open for further exploration. In this paper, we propose a variational recurrent NMT (VRNMT) model to deal with the above-mentioned problem, motivated by recent success of the variational recurrent neural network (VRNN) (Chung et al. 2015). It is illustrated in Fig. 1. VRNMT explicitly models underlying semantics of bilingual sentence pairs, which are then exploited to reﬁne translation. However, instead of only employing a single latent variable to capture the global semantics of each parallel sentence, we assume that there is a continuous latent random variable sequence z = {z1, z2..., z Ty} in the underlying semantic space, where the iteratively generated variable zj participates in the generations of each target word yj and hidden state sj+1. Formally, the conditional probability p(y|x) is decomposed as follows:

j=1 p(yj|x, y<j) =

zj p(yj, zj|x, y<j)dzj

zj p(yj|x, y<j, zj)p(zj|x, y<j)dzj (1)

where zj encodes the semantic contexts at the j-th timestep. In doing so, we expect these latent variables to efﬁciently model the strong and complex dependencies between adjacent target words, which may not be effectively and sufﬁciently captured by the conventional NMT or VNMT. However, the incorporation of latent variables into the existing NMT models faces two challenges, as mentioned in (Zhang et al. 2016a): 1) the posterior inference in our model is intractable; 2) large-scale training, which lays the ground for the data-driven NMT, is accordingly problematic. To address these two issues, we follow Zhang et al. (2016a) to use deep neural networks, which are capable of learning highly nonlinear functions, to ﬁt the latent-variablerelated distributions, i.e. the prior and posterior. The former

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

Figure 1: Illustration of VRNMT as a directed graph. The green dotted lines illustrate the modeling procedure of the latent variable zj. The orange lines show the information ﬂow for the prediction of target word yj. The red lines highlight the incorporation of zj into the encoding of hidden state vector sj+1.

is pθ(zj|x, y<j), conditioned on source sentence and previously generated target words. The latter is qφ(zj|x, y j), approximated from all observed variables. To enable efﬁcient inference and learning, we adopt the reparameterization trick (Kingma and Welling 2014; Rezende, Mohamed, and Wierstra 2014) to bridge the gap between these two distributions. In this way, our model becomes an end-to-end neural network endowed with the stochastic optimization ability for enhancing its generality. Our main contributions in this work are twofold:

We propose a VRNMT model that not only explores the utilization of high-level latent random variables but also efﬁciently captures the strong and complex dependencies between neighboring target words for NMT. To the best of our knowledge, this is the ﬁrst attempt to adapt VRNN into NMT modeling.

Experimental results on Chinese-English and English German translation tasks show that the proposed model signiﬁcantly outperforms the conventional NMT and VNMT models.

2. Background

In this section, we brieﬂy describe the attention-based NMT model and VRNN, which provide background knowledge for the proposed model.

2.1 Attention-based NMT Model

Currently, the dominant NMT model mainly consists of a neural encoder and a neural decoder with an attention network (Bahdanau, Cho, and Bengio 2015).

Generally, the encoder is a bidirectional RNN learning hidden representations of a source sentence in the forward and backward directions. The learned hidden states in two directions are then concatenated to form source annotations {hi = [ h T i , h T i ]T }, where hi encodes the contextual semantics of the i-th word with respect to all other surrounding source words. Likewise, the decoder is a forward RNN that adopts the nonlinear function g( ) to sequentially generate the translation y as p(yj|x, y<j)=g(yj 1, sj, cj), where sj and cj denote the decoding state and the source context at the jth timestep, respectively. Among them, si is computed as sj=GRU(sj 1, yj 1, cj). Here we use GRU for both the encoder and decoder in this work. However, our work is also applicable to other types of RNNs. According to the attention mechanism, we calculate cj as the weighted sum of the source annotations {hi}:

i=1 αj,i hi, (2)

where αj,i evaluates how well yj and hi match, computed as follows:

αj,i = exp(ej,i) Tx i =1 exp(ej,i ) , (3)

ej,i = v T a tanh(Wasj 1 + Uahi), (4) where Wa, Ua and va are the weight matrices of the attention model.

2.2 VRNN VRNN is a recurrent extension of the conventional VAE (Chung et al. 2015). As shown in Fig. 2, it contains a VAE at

(a) Prior (b) Generation (c) Recurrence (d) Inference

Figure 2: Graphical illustrations of each operation of the VRNN.

every timestep and introduces four kinds of operations to explicitly model the dependencies between latent random variables across subsequent timesteps. Prior, Generation and Recurrence. Based on the hidden state ht 1 of the RNN, VRNN ﬁrst produces a latent semantic variable zt, which is then used to guide the generation of the hidden state ht and word xt at the t-th timestep. In doing so, the temporal structure of sequential data is exploited for VRNN modeling. Different from the standard VAE where the prior on the latent random variable follows a standard Gaussian distribution, VRNN assumes that zt obeys the following Gaussian with the parameters μ0,t and σ0,t zt N(μ0,t, diag(σ2 0,t)), (5) where μ0,t and σ0,t can be produced by any highly ﬂexible neural networks. Moreover, the generation distribution of xt will be conditioned on both zt and ht 1 such that xt|zt N(μx,t, diag(σ2 x,t)), (6) where μx,t and σx,t are the parameters of the generation distribution. Note that they can also be computed by any highly ﬂexible neural network. Then, we introduce zt to update the hidden state ht in a recurrent way ht = fθ(ht 1, xt, zt). (7) Finally, the parameterization of the generative model can be factorized as follows:

z pθ(x, z)dz, (8)

t=1 p(xt|x<t, z t)p(zt|x<t, z<t), (9)

where T is the sequence length. Inference. Similarly, the approximate posterior is deﬁned as a function of both xt and ht zt|xt N(μz,t, diag(σ2 z,t)), (10) where μz,t and σz,t are the parameters of the approximate posterior. In doing so, ht enables the encoding of the approximate posterior and the decoding for generation are closely tied. Finally, the objective function becomes a timestep-wise variational lower bound

LVRNN(θ, φ; x) = Eq(z M|x T )[

t=1 ( KL(q(zt|x t, z<t)

||p(zt|x<t, z<t)) + logp(xt|z t, x<t) )] (11)

As implemented in VAE, we maximize the variational lower bound with respect to their parameters to jointly learn the generative and inference models. Note that we also have to model the posterior pθ(zt|x t, z<t) while the integration of zt still leads to difﬁculties in the posterior inference and large-scale learning. Likewise, we adopt the neural approximation and reparameterization trick to handle this issue.

3. Our Model In this section, we extend VNMT into VRNMT by adapting VRNN into NMT. In VRNMT, the semantic dependencies between adjacent target words can be captured to reﬁne translation. Formally, the variational lower bound of VRNMT is deﬁned as follows:

LVRNMT(y|x; θ, φ) =

j=1 LVRNMT(yj|x, y<j; θ, φ)

j=1 { KL(qφ(zj|x, y j)||pθ(zj|x, y<j))

+ Eqφ(zj|x,y j)[logpθ(yj|x, y<j, zj)]}, (12)

where pθ(zj|x, y<j) is the prior, qφ(zj|x, y j) is the approximated posterior, and pθ(yj|x, y<j, zj) is the generation model. As shown in Eq. (12), VRNMT mainly contains three neural network based components: (1) a neural encoder for encoding source sentences, (2) a variational neural inferer for qφ(zj|x, y j) and pθ(zj|x, y<j), and (3) a variational neural decoder that models pθ(yj|x, y<j, zj).

3.1 Neural Encoder The encoder of VRNMT is the same as that of the conventional NMT. Due to the limitation of space, we omit the description of the VRNMT encoder (See Section 2.1 for reference).

3.2 Variational Neural Inferer As described previously, the key of variational models lies in how to model the distributions related to latent random variables. With respect to VRNMT, we focus on how to model the posterior qφ(zj|x, y j) and the prior pθ(zj|x, y<j).

The Posterior Model. Under the assumption that the posterior qφ(zj|x, y j) follows the multivariate Gaussian distribution with a diagonal covariance structure, we apply neural networks to simulate the posterior model. Concretely, we compute qφ(zj|x, y j) as

qφ(zj|x, y j) = N(zj; μj(x, y j), σj(x, y j)2I). (13)

As illustrated in Fig. 1, the mean μj and standard derivation σj of neural networks are imposed on x and y j. Obviously, the key to estimate zj is how to calculate μj and σj. To this end, we ﬁrst apply the element-wise activation function g( ) to perform a nonlinear transformation projecting yj 1, sj, cj and yj onto our latent semantic space:

hzj = g(W φ z [yj 1; sj; cj; yj] + bφ z). (14)

where W φ z and bφ z are the parameter matrix and bias term, respectively. Finally, we introduce linear regressions with parameters W φ μ , W φ σ , bφ μ and bφ σ to obtain dz-dimension vectors μj and logσ2 j as follows:

μj = W φ μ hzj + bφ μ (15)

logσ2 j = W φ σ hzj + bφ σ (16)

To obtain a representation for latent variable zj, we follow the implementation of VAE to reparameterize it as zj = μj + σj ϵ, ϵ N(0, I). Intuitively, this reparameterization procedure bridges the gap between pθ(yj|x, y<j, zj) and qφ(zj|x, y j). Thus, our model is an end-to-end neural network endowed with the generality ability.

The Prior Model. Except for the absence of yj, the neural model for the prior pθ(zj|x, y<j) is identical to that (i.e. Eq (13)) for the posterior qφ(zj|x, y j). Here we still model the prior as a multivariate Gaussian distribution but introduce different parameters for the prior and the posterior:

pθ(zj|x, y<j) = N(zj; μ j(x, y<j), σ j(x, y<j)2I), (17)

Using the similar way in computing the posterior model, we ﬁrst calculate h zj without yj in the following way:

h zj = g(W θ z [yj 1; sj; cj] + bθ z). (18)

Here W θ z and bθ z are the parameter matrix and bias term, respectively. Then, we use h zj to generate the mean μ j and standard derivation logσ 2 j with the parameters W θ and bθ :

μ j = W θ μh zj + bθ μ (19)

logσ 2 j = W θ σh zj + bθ σ (20)

Different from the posterior model, we directly set zj as μ j, as implemented in (Zhang et al. 2016a). Note that we also introduce noises to generate non-ﬁxed representation zj in practice, which enables our model to avoid overﬁtting to some extent. Finally, zj is integrated into our decoder to improve translation. The details will be illustrated in the following subsection.

3.3 Variational Neural Decoder

Given the source sentence x, the previously generated target words y<j, and the semantic latent variable zj, we compute the probability distribution over the translation yj as

p(yj|x, y<j, zj) = g VRNMT(yj 1, sj, cj) exp{g(Wd[yj 1; sj; cj; zj] + bd)} (21)

Unlike the conventional NMT, we ﬁrst produce zj using yj 1, sj, and cj (see the dashed green lines in Fig. 1), and then integrate zj with yj 1, sj, and cj to generate translation probability distribution (see the orange lines in Fig. 1). Besides, we use zj to generate the next hidden state (see the

red lines in Fig. 1). Formally, the GRU transition equations of our decoder are as follows:

rj+1 = σ(Wr yj + Ursj + Crcj + Vrzj + br) (22) uj+1 = σ(Wu yj + Uusj + Cucj + Vuzj + bu) (23) sj+1 = tanh(W yj + U[rj+1 sj] + Ccj + V zj + b) (24) sj+1 = (1 uj+1) sj + uj+1 sj+1, (25)

where W , U , C , V , and b , are the model parameters of GRU in VRNMT. Particularly, we initialize the hidden state s0 in a way similar to (Bahdanau, Cho, and Bengio 2015). It should be noted that the latent semantic variable zj has an important inﬂuence on the representation of hidden state sj+1 through the gates rj+1 and uj+1, and temporary hidden state sj+1. This allows our model to access the semantic information of zj indirectly since the prediction of yj+1 depends on sj+1. On the other hand, sj+1, in turn will constrain the generation of zj+1 at the next timestep. Therefore, the context dependencies between adjacent timesteps are indirectly exploited to reﬁne translation.

Model Training The ﬁnal objective for one bilingual sentence (x, y) involves the following two parts: KL(qφ(zj|x, y j)||pθ(zj|x, y<j)) and Eqφ(zj|x,y j)[ ]. We also apply the Monte Carlo method to approximate Eqφ(zj|x,y j)[ ]. Formally, the joint training objective becomes

LRLV (θ, φ; x, y) =

j=1 { KL(qφ(zj|x, y j)||pθ(zj|x, y<j))

+ Eqφ(zj|x,y j)[logpθ(yj|x, y<j, zj)] }

j=1 { KL(qφ(zj|x, y j)||pθ(zj|x, y<j))

l=1 logpθ(yj|x, y<j, z(l) j ) } (26)

where z(l) j = μj +σj ϵ(l), ϵ(l) N(0, I), and L is the number of samples. Essentially, VRNMT can be considered as a regularized version of NMT, which introduces noise ϵ(l) at each timestep to enhance its robustness. Notice that both the KL divergence and the approximate expectation are differentiable. Therefore, we can jointly optimize the model parameters θ and variational parameters φ using standard gradient ascent.

4. Experiments We conducted experiments on Chinese-English and English German translation to examine the effectiveness of our model.

4.1 Setup Our Chinese-English training data consists of 1.25M LDC sentence pairs, with 27.9M Chinese words and 34.5M English words respectively. We used the NIST MT02 dataset

as the validation set, and the NIST MT03/04/05/06 datasets as the test sets. In English-German translation, our training data consists of 4.46M sentence pairs with 116.1M English words and 108.9M German words. We used the news-test 2013 as the validation set and the news-test 2015 as the test set. Following Sennrich et al., (2016), we adopted byte pair encoding to segment words into subwords for English German translation. Finally, we used BLEU (Papineni et al. 2002) as our evaluation metric, and performed paired bootstrap sampling (Koehn 2004) for statistical signiﬁcance test using the Moses script. We set the maximum length of training sentences to be 50 words, and preserved the most frequent 30K (Chinese-English) and 50K (English-German) words as both the source and target vocabulary, covering approximately 97.4%/100.0% and 99.3%/98.2% on the source/target side of the two parallel corpora respectively. All other words were replaced with a speciﬁc token UNK . We applied Rmsprop (Graves 2013) with iter Num=5, momentum=0, ρ=0.95, and ϵ=1 10 4 to train various NMT models. The settings of our model were the same as in (Bahdanau, Cho, and Bengio 2015), except for some hyper-parameters speciﬁc to our model. Speciﬁcally, we set word embedding dimension as 620, hidden layer size as 1000, learning rate as 5 10 4, batch size as 80, gradient norm as 1.0, and dropout rate as 0.3. Particularly, we initialized the parameters of VRNMT with the trained conventional NMT model. As implemented in VAE, we set the sampling number L=1, and d e=dz=2df=2000 according to preliminary experiments. During decoding, we used the beam-search algorithm, and set beam sizes of all models as 10.

4.2 Systems for Comparison

We compared our model against the following systems: (1) Moses1. An open source phrase-based SMT system with default settings and a 4-gram language model trained on the target portion of the training data. (2) DL4MT. Our re-implementation of the attentionbased NMT system (Bahdanau, Cho, and Bengio 2015) with slight changes from dl4mt tutorial2. (3) VNMT. It is a variational NMT system (Zhang et al. 2016a) that incorporates a continuous latent variable to model the underlying semantics of sentence pairs. (4) VRNMT(-TD). A variant of our model without introducing temporal dependencies between the latent random variables. It differs from our model in that the input of posterior model contains only yj but not yj 1, sj, cj. More specifically, we removed yj 1, sj, and cj from Eq. (14). Thus, the latent variables of VRNMT(-TD) directly obey the standard Gauss distribution rather than depend on the output at the previous timestep. As we incorporate temporal dependencies into the prior, we will directly study the impact of the latent random variables on modeling variability characterized by dependencies among output words in comparison to VRNMT(-TD).

1http://www.statmt.org/moses/ 2https://github.com/nyu-dl/dl4mt-tutorial

System MT03 MT04 MT05 MT06 Ave. COVERAGE 34.49 38.34 34.91 34.25 35.50 Mem Dec 35.09 37.73 35.53 34.32 35.67 Deep LAU 36.16 39.81 35.91 35.98 36.97 DMAtten 38.33 40.11 36.71 35.29 37.61 Moses 32.93 34.76 31.31 31.05 32.51 DL4MT 36.59 39.57 35.56 35.29 36.75 VNMT 37.23 40.32 36.28 35.73 37.39 VRNMT(-TD) 36.97 40.07 36.13 35.49 37.17 VRNMT 38.08 ++ 41.07 ++ 36.82 ++ 36.72 ++ 38.17

Table 1: Case-insensitive BLEU scores of Chinese-English translation. / and +/++: signiﬁcant over VNMT and VRNMT(-TD) at 0.05/0.01, respectively. COVERAGE (Tu et al. 2016) presented a coverage model to alleviate the over-translation and under-translation problems. Mem Dec (Meng et al. 2016) exploited a readable and writable attention mechanism to record interactive history in decoding. Deep LAU (Wang et al. 2016) introduced external memory to improve translation quality. DMAtten (Zhang et al. 2017) explicitly incorporated the word reordering knowledge into the attention model of NMT. Note that all these studies focus on capturing semantic information for NMT.

(0 , 1 0 ] (1 0 , 2 0 ] (2 0 , 3 0 ] (3 0 , 4 0 ] (4 0 , 5 0 ] (5 0 , 1 0 0 ]

Length of source sentence

Moses DL4MT VNMT VRNMT(-TD) VRNMT

Figure 3: BLEU scores over different lengths of translated sentences.

4.3 Results on Chinese-English Translation

In addition to the above systems for comparison, we also displayed the BLEU scores of several recent NMT models (Tu et al. 2016; Meng et al. 2016; Wang et al. 2016; Zhang et al. 2017) that have been trained on the same training corpus as ours. Table 1 shows case-insensitive BLEU scores on Chinese English datasets. Overall, VRNMT signiﬁcantly improves translation quality on all test sets, achieving the gains of 5.66, 1.42, 0.78 and 1.0 BLEU points over Moses, DL4MT, VNMT and VRNMT(-TD), respectively. Compared to the existing NMT models, VRNMT is better than them as shown in Table 1. These results echo the results reported in (Zhang et al. 2016a), indicating the integration of latent variables is effective for improving NMT. Particularly, VRNMT performs signiﬁcantly better than VRNMT(-TD), indicating

Source p ıngrˇang cˇaiqˇu sh angsh u x ıngd ong zh ih ou s ı ti an , liˇanh egu o anqu an lˇish ıhu ı de wˇu g e ch angr en lˇish ıgu o d ou w ei cˇi y i w eij i cˇaiqˇu y uf angx ıng w aiji ao x ıngd ong .

Reference four days after pyongyang adopted the aforesaid action , the ﬁve permanent members of united nations security council have all taken preemptive diplomatic actions for the crisis .

Moses pyongyang by four days after the operation of ::: the :: un::::::: security:::::: council , the ﬁve permanent members to adopt preventive diplomacy this crisis .

the ::: four permanent member states of the united nations security council and the ﬁve permanent members of the un security council have adopted a preventive diplomatic action following the four - day .......... operation ..

VNMT four days after :::: north::::: korea took the above actions , the ﬁve permanent members of the un security council have adopted preventive diplomatic .......... activities ..

VRNMT(-TD) ::: four permanent members of the security council of the united nations security council have taken preventive diplomatic actions during the four - day period .......... following.... the ...... above........ actions...

VRNMT four days after pyongyang took the action , the ﬁve permanent members of the un security council have adopted preventive diplomatic actions for the crisis .

Table 2: Translation examples of different systems. Words highlighted in underlines are not ﬂuently translated, in wavy lines are incorrectly translated, in dashed lines are over-translated, and in dotted lines are under-translated.

System 1-Gram 2-Gram 3-Gram 4-Gram Reference 12.94 1.80 0.93 1.29 DL4MT 19.62 5.34 2.96 2.31 VNMT 19.45 5.24 2.93 2.29 VRNMT(-TD) 19.54 5.25 2.93 2.35 VRNMT 18.83 4.97 2.90 2.25

Table 3: Evaluation of over-translation. The lower the score, the better the system deals with the over-translation problem.

that explicitly modeling the temporal dependencies between latent random variables indeed further beneﬁts NMT.

Results on Source Sentences with Different Lengths Further, we carried out experiments to investigate our model on different groups of the test sets, which are divided according to the lengths of source sentences. Figure 3 shows that our system outperforms the others over sentences with different length spans.

Analysis on Over Translation As mentioned in (Tu et al. 2016), over-translation is one of big challenges for NMT. Here we followed Zhang et al. (2017) to evaluate over-translations generated by different NMT models. Concretely, we directly used N-Gram Repetition Rate (N-GRR) metric (Zhang, Xiong, and Su 2017) to calculate the portion of repeated n-grams in a sentence as follows:

N-GRR = 1 C R

|N-gramsc,r| |u(N-gramsc,r)|

|N-gramsc,r| (27)

where |N-gramsc,r| is the number of total n-grams in the r-th translation of the c-th sentence in the testing corpus, and |u(N-gramsc,r)| denotes the number of n-grams after duplicate ngrams are removed. By comparing N-GRR scores

System AER SAER DL4MT 50.07 63.42 VNMT 49.23 62.28 VRNMT(-TD) 49.95 63.17 VRNMT 48.11 61.24

Table 4: Evaluation of word alignment quality. The lower the score is, the better word alignments are.

of translations against those of references, we can roughly know how serious the over-translation problem is. Table 3 gives the ﬁnal results. We ﬁnd that our model is able to better deal with over-translation issue than other models.

Analysis on Attention Results

The attention model heavily depends on target-side hidden state vectors, which are in turn dependent on the previous latent random variables in our model, as illustrated in Eq. (22)- (25). Therefore, if latent variables are helpful for the calculation of target-side hidden state vectors, the attention model can also be improved accordingly. To testify this, we conducted experiments on the evaluation dataset provided by Liu and Sun (2015), which contains 900 manually aligned Chinese-English sentence pairs. Speciﬁcally, we ﬁrst forced the decoder to output reference translations so as to obtain word alignments between input sentences and their reference translations according to attention weights. Then, we used the alignment error rate (AER) (Och and Ney 2003) and the soft version (SAER) of AER (Tu et al. 2016) to evaluate alignment performance. From Table 4, we can conclude that the incorporation of latent variables also improves the attention model as expected.

To understand why our model outperforms the others, we compared and analyzed their 1-best translations. Table 2 provides a translation example with its various translations.

System BLEU BPEChar 23.9 Rec Atten 25.0 Conv Encoder 24.2 Moses 20.54 DL4MT 24.88 VNMT 25.49 VRNMT(-TD) 25.34 VRNMT 25.93 ++

Table 5: Case-sensitive BLEU scores of English-German translation. We directly displayed the results of the ﬁrst three models provided in (Gehring et al. 2017). BPEChar (Chung, Cho, and Bengio 2016) presented a character-level decoder for NMT, Rec Atten (Yang et al. 2017) introduced a recurrent attention model to better capture source-side context for NMT, and Conv Encoder (Gehring et al. 2017) explored the convolutional encoder to encode the source sentence.

We have found that the translation produced by Moses is non-ﬂuent than those of NMT systems. In addition to the issues of incorrect translation and over-translation, the ﬁrst three NMT systems (DL4MT, VNMT, VRNMT(-TD)) do not adequately convey the meaning of the source sentence to the target as some source phrases have not been translated at all, such as w ei cˇi y i w eij i (for this crisis) . By contrast, due to the advantage of modeling long-distance dependencies among target words, VRNMT is able to produce a more complete, ﬂuent, and accurate translation.

4.4 Results on English-German Translation We also carried out experiments on English-German translation. Results are shown in Table 5. We provided results of previous work (Chung, Cho, and Bengio 2016; Yang et al. 2017; Gehring et al. 2017) on this dataset too. Speciﬁcally, VRNMT still outperforms Moses, DL4MT, VNMT, VRNMT(-TD), achieving gains of 5.39, 1.05, 0.44 and 0.59 BLEU points. Additionally, VRNMT reaches the performance level that is competitive to or higher than several recent NMT systems. Note that our approach is orthogonal to these previous models. Therefore it can be adapted to these models. We leave this adaptation to our future work.

5. Related Work The previous studies that are related to our work mainly include NMT and variational neural models. NMT. Most NMT models focus on how to translate a source sentence to a target sentence with an encoderdecoder neural network (Kalchbrenner and Blunsom 2013; Cho et al. 2014; Sutskever, Vinyals, and Le 2014). To handle the defeat of encoding all source-side information into a ﬁxed-length vector, Bahdanau et al. (2015) proposed attention-based NMT, which has now become the dominant architecture. However, this model usually suffer from attention failures, which usually lead to undesirable translations. Therefore, many researchers then resorted to better attention mechanisms (Luong, Pham, and Manning 2015;

Cheng et al. 2016; Tu et al. 2016; Feng et al. 2016; Meng et al. 2016; Calixto, Liu, and Campbell 2017), or more effective neural networks (Wang et al. 2016; Gehring et al. 2017; Wang et al. 2017), or exploiting external semantics (Zhang et al. 2016b; Chen et al. 2017; Li et al. 2017). All these models are designed within the discriminative encoder-decoder framework, leaving the explicit exploration of underlying semantics an open problem. To combine the strengths of discriminative and generative modeling, Zhang et al. (2016a) presented VNMT that incorporates a continuous latent variable to model the underlying semantics of sentence pairs. Variational Neural Networks. Kingma et al. (2014) as well as Rezende et al. (2014) focused on variational neural networks, which are effective in the inference and learning of directed probabilistic models on large-scale dataset. Typically, these models introduce a neural inference model to approximate the intractable posterior, and optimize model parameters jointly with a reparameterized variational lower bound. Further, Kingma et al., (2014b) adapted these models to semi-supervised learning. Chung et al. (2015) incorporated latent variables into the hidden states of a recurrent neural network, while Gregor et al. (2015) combined a novel spatial attention mechanism that mimics the foveation of human eyes, with a sequential variational auto-encoding framework that allows the iterative construction of complex images. Miao et al. (2016) proposed a generic variational inference framework for generative and conditional models of text. Both (Zhang et al. 2016a) and (Chung et al. 2015) are the most related to our work. In our model, we extended VNMT (Zhang et al. 2016a) to a recurrent framework, which has been proven to be more effective for machine translation. Besides, different from Chung et al. (2015) that work on speech generation and handwriting generation, we introduces a sequence of recurrent latent variables for the semantic modeling of NMT, which, to the best of our knowledge, has never been investigated before.

6. Conclusions and Future Work

This paper has presented a variational recurrent NMT model that introduces a sequence of continuous latent variables to capture the underlying semantics of sentence pairs. Similar to VNMT, we approximate the posterior distribution with neural networks and reparameterize the variational lower bound. In doing so, our model becomes an end-to-end neural network which can be optimized through the stochastic gradient algorithms. Compared with the dominant NMT and VNMT, our model not only captures the global semantic contexts but also models strong and complex dependencies among generated words at different timesteps. Experiments on Chinese-English and English-German translation tasks demonstrate the effectiveness of our model. Our future works include the following aspects. We will study how to better exploit latent variables to further improve NMT. Additionally, we are also interested in applying our model to other similar tasks using encoder-decoder framework, such as neural text summarization, neural dialogue generation.

Acknowledgments

The authors were supported by National Natural Science Foundation of China (Nos. 61672440, 61622209 and 61573294), Scientiﬁc Research Project of National Language Committee of China (Grant No. YB135-49). We also thank the reviewers for their insightful comments.

Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. In Proc. of ICLR2015. Calixto, I.; Liu, Q.; and Campbell, N. 2017. Doublyattentive decoder for multi-modal neural machine translation. In Proc. of ACL2017, 1913 1924. Chen, H.; Huang, S.; Chiang, D.; and Chen, J. 2017. Improved neural machine translation with a syntax-aware encoder and decoder. In Proc. of ACL2017, 1936 1945. Cheng, Y.; Shen, S.; He, Z.; He, W.; Wu, H.; Sun, M.; and Liu, Y. 2016. Agreement-based joint training for bidirectional attention-based neural machine translation. In Proc. of IJCAI2016, 2761 2767. Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using rnn encoder decoder for statistical machine translation. In Proc. of EMNLP2014, 1724 1734. Chung, J.; Kastner, K.; Dinh, L.; Goel, K.; Courville, A. C.; and Bengio, Y. 2015. A recurrent latent variable model for sequential data. In Proc. of NIPS2015. Chung, J.; Cho, K.; and Bengio, Y. 2016. A character-level decoder without explicit segmentation for neural machine translation. In Proc. of ACL2016, 1693 1703. Feng, S.; Liu, S.; Yang, N.; Li, M.; Zhou, M.; and Zhu, K. Q. 2016. Improving attention modeling with implicit distortion and fertility for machine translation. In Proc. of COLING2016, 3082 3092. Gehring, J.; Auli, M.; Grangier, D.; and Dauphin, Y. 2017. A convolutional encoder model for neural machine translation. In Proc. of ACL2017, 123 135. Graves, A. 2013. Generating sequences with recurrent neural networks. In ar Xiv:1308.0850v5. Gregor, K.; Danihelka, I.; Graves, A.; Rezende, D.; and Wierstra, D. 2015. Draw: A recurrent neural network for image generation. In Proc. of ICML2015, 1462 1471. Kalchbrenner, N., and Blunsom, P. 2013. Recurrent continuous translation models. In Proc. of EMNLP2013, 1700 1709. Kingma, D. P., and Welling, M. 2014. Auto-encoding variational bayes. In Proc. of ICLR2014. Kingma, D. P.; Mohamed, S.; Rezende, D. J.; and Welling, M. 2014b. Semi-supervised learning with deep generative models. In Proc. of NIPS2014, 3581 3589. Koehn, P. 2004. Statistical signiﬁcance tests for machine translation evaluation. In Proc. of EMNLP2004, 388 395.

Li, J.; Xiong, D.; Tu, Z.; Zhu, M.; Zhang, M.; and Zhou, G. 2017. Modeling source syntax for neural machine translation. In Proc. of ACL2017, 688 697. Liu, Y., and Sun, M. 2015. Contrastive unsupervised word alignment with non-local features. In Proc. of AAAI2015, 857 868. Luong, M.-T.; Pham, H.; and Manning, C. D. 2015. Effective approaches to attention-based neural machine translation. In Proc. of EMNLP2015, 1412 1421. Meng, F.; Lu, Z.; Li, H.; and Liu, Q. 2016. Interactive attention for neural machine translation. In Proc. of COLING2016, 2174 2185. Miao, Y.; Yu, L.; and Blunsom, P. 2016. Neural variational inference for text processing. In Proc. of ICML2016, 1727 1736. Och, F. J., and Ney, H. 2003. A systematic comparison of various statistical alignment models. In Computational Linguistics, 2003(29), 19 51. Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W. 2002. Bleu: A method for automatic evaluation of machine translation. In Proc. of ACL2002, 311 318. Rezende, D. J.; Mohamed, S.; and Wierstra, D. 2014. Stochastic backpropagation and approximate inference in deep generative models. In Proc. of ICML2014, 1278 1286. Sennrich, R.; Haddow, B.; and Birch, A. 2016. Neural machine translation of rare words with subword units. In Proc. of ACL2016, 1715 1725. Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Proc. of NIPS2014, 3104 3112. Tu, Z.; Lu, Z.; Liu, Y.; Liu, X.; and Li, H. 2016. Modeling coverage for neural machine translation. In Proc. of ACL2016, 76 85. Wang, M.; Lu, Z.; Li, H.; and Liu, Q. 2016. Memoryenhanced decoder for neural machine translation. In Proc. of EMNLP2016, 278 286. Wang, M.; Lu, Z.; Zhou, J.; and Liu, Q. 2017. Deep neural machine translation with linear associative unit. In Proc. of ACL2017, 136 145. Yang, Z.; Hu, Z.; Deng, Y.; Dyer, C.; and Smola, A. 2017. Neural machine translation with recurrent attention modeling. In Proc. of EACL2017, 383 387. Zhang, B.; Xiong, D.; su, j.; Duan, H.; and Zhang, M. 2016a. Variational neural machine translation. In Proc. of EMNLP2016, 521 530. Zhang, J.; Li, L.; Way, A.; and Liu, Q. 2016b. Topicinformed neural machine translation. In Proc. of COLING2016, 1807 1817. Zhang, J.; Liu, Y.; Luan, H.; Xu, J.; and Sun, M. 2017. Prior knowledge integration for neural machine translation using posterior regularization. In Proc. of ACL2017, 1514 1523. Zhang, B.; Xiong, D.; and Su, J. 2017. Generating Sentences from a Continuous Space. Ar Xiv e-prints.