# mirrorgenerative_neural_machine_translation__39d5301b.pdf

Published as a conference paper at ICLR 2020

MIRROR-GENERATIVE NEURAL MACHINE TRANSLATION

Zaixiang Zheng1, Hao Zhou2, Shujian Huang1, Lei Li2, Xin-Yu Dai1, Jiajun Chen1

1National Key Laboratory for Novel Software Technology, Nanjing University zhengzx@smail.nju.edu.cn,{huangsj,daixinyu,chenjj}@nju.edu.cn 2Byte Dance AI Lab {zhouhao.nlp,lileilab}@bytedance.com

Training neural machine translation models (NMT) requires a large amount of parallel corpus, which is scarce for many language pairs. However, raw non-parallel corpora are often easy to obtain. Existing approaches have not exploited the full potential of non-parallel bilingual data either in training or decoding. In this paper, we propose the mirror-generative NMT (MGNMT), a single uniﬁed architecture that simultaneously integrates the source to target translation model, the target to source translation model, and two language models. Both translation models and language models share the same latent semantic space, therefore both translation directions can learn from non-parallel data more effectively. Besides, the translation models and language models can collaborate together during decoding. Our experiments show that the proposed MGNMT consistently outperforms existing approaches in a variety of language pairs and scenarios, including resource-rich and low-resource situations.

1 INTRODUCTION

Neural machine translation (NMT) systems (Sutskever et al., 2014; Bahdanau et al., 2015; Gehring et al., 2017; Vaswani et al., 2017) have given quite promising translation results when abundant parallel bilingual data are available for training. But obtaining such large amounts of parallel data is non-trivial in most machine translation scenarios. For example, there are many low-resource language pairs (e.g., English-to-Tamil), which lack adequate parallel data for training. Moreover, it is often difﬁcult to adapt NMT models to other domains if there is only limited test domain parallel data (e.g., medical domain), due to the large domain discrepancy between the test domain and the parallel data for training (usually news-wires). For these cases where the parallel bilingual data are not adequate, making the most use of non-parallel bilingual data (always quite cheap to get) is crucial to achieving satisfactory translation performance.

We argue that current NMT approaches of exploiting non-parallel data are not necessarily the best, in both training and decoding phases. For training, back-translation (Sennrich et al., 2016b) is the most widely used approach for exploiting monolingual data. However, back-translation individually updates the two directions of machine translation models, which is not the most effective. Specifically, given monolingual data x (of source language) and y (of target language)1, back-translation utilizes y by applying tgt2src translation model (TMy x) to get predicted translations ˆx. Then the pseudo translation pairs ˆx, y are used to update the src2tgt translation model (TMx y). x can be used in the same way to update TMy x. Note that here TMy x and TMx y are independent and updated individually. Namely, each updating of TMy x will not directly beneﬁt TMx y. Some related work like joint back-translation Zhang et al. (2018) and dual learning He et al. (2016a) introduce iterative training to make TMy x and TMx y beneﬁt from each other implicitly and iteratively. But translation models in these approaches are still independent. Ideally, gains from non-parallel data can be enlarged if we have relevant TMy x and TMx y. In that case, after every updating of TMy x, we may directly get better TMx y and vice versa, which exploits non-parallel data more effectively.

1Please refer to Section 2 for the notation in details.

Published as a conference paper at ICLR 2020

Figure 1: The graphical model of MGNMT.

z(x, y) y x y x

[p(y|x, z; θxy)]

src2tgt TM [p(y|z; θy)]

[p(x|y, z; θyx)]

tgt2src TM [p(x|z; θx)] source LM

Figure 2: Illustration of the mirror property of MGNMT.

For decoding, some related works (Gulcehre et al., 2015) propose to interpolate external language models LMy (trained separately on target monolingual data) to translation model TMx y, which includes knowledge from target monolingual data for better translation. This is particularly useful for domain adaptation because we may obtain better translation output quite ﬁtting test domain (e.g., social networks), through a better LMy. However, directly interpolating an independent language model in decoding maybe not the best. First, the language model used here is external, still independently learned to the translation model, thus the two models may not cooperate well by a simple interpolation mechanism (even conﬂict). Additionally, the language model is only included in decoding, which is not considered in training. This leads to the inconsistency of training and decoding, which may harm the performance.

In this paper, we propose the mirror-generative NMT (MGNMT) to address the aforementioned problems for effectively exploiting non-parallel data in NMT. MGNMT is proposed to jointly train translation models (i.e., TMx y and TMy x) and language models (i.e., LMx and LMy) in a uniﬁed framework, which is non-trivial. Inspired by generative NMT (Shah & Barber, 2018), we propose to introduce a latent semantic variable z shared between x and y. Our method exploits the symmetry, or mirror property, in decomposing the conditional joint probability p(x, y|z), namely:

log p(x, y|z) = log p(x|z) + log p(y|x, z) = log p(y|z) + log p(x|y, z)

2[log p(y|x, z) | {z } src2tgt TMx y

+ log p(y|z) | {z } target LMy

+ log p(x|y, z) | {z } tgt2src TMy x

+ log p(x|z) | {z } source LMx

The graphical model of MGNMT is illustrated in Figure 1. MGNMT aligns the bidirectional translation models as well as language models in two languages through a shared latent semantic space (Figure 2), so that all of them are relevant and become conditional independent given z. In such case, MGNMT enables following advantages:

(i) For training, thanks to z as a bridge, TMy x and TMx y are not independent, thus every updating of one direction will directly beneﬁt the other direction. This improves the efﬁciency of using non-parallel data. (Section 3.1) (ii) For decoding, MGNMT could naturally take advantages of its internal target-side language model, which is jointly learned with the translation model. Both of them contribute to the better generation process together. (Section 3.2)

Note that MGNMT is orthogonal to dual learning (He et al., 2016a) and joint back-translation (Zhang et al., 2018). Translation models in MGNMT are dependent, and the two translation models could directly promote each other. Differently, dual learning and joint back-translation works in an implicit way, and these two approaches can also be used to further improve MGNMT. The language models used in dual learning faces the same problem as Gulcehre et al. (2015). Given GNMT, the proposed MGNMT is also non-trivial. GNMT only has a source-side language model, thus it cannot enhance decoding like MGNMT. Also, in Shah & Barber (2018), they require GNMT to share all the parameters and vocabularies between translation models so as to utilize monolingual data, which is not best suited for distant language pairs. We will give more comparison in the related work.

Experiments show that MGNMT achieves competitive performance on parallel bilingual data, while it does advance training on non-parallel data. MGNMT outperforms several strong baselines in different scenarios and language pairs, including resource-rich scenarios, as well as resource-poor circumstances on low-resource language translation and cross-domain translation. Moreover, we show that translation quality indeed becomes better when the jointly learned translation model and language model of MGNMT work together. We also demonstrate that MGNMT is architecture-free which can be applied to any neural sequence model such as Transformer and RNN. These pieces of evidence verify that MGNMT meets our expectation of fully utilizing non-parallel data.

Published as a conference paper at ICLR 2020

피z[ log p(x, y|z)] x

[q(z|x, y)] Inference Model

Sample from ϵ 풩(1, I )

DKL[풩(μ(x, y), Σ(x, y)||풩(0,I)]

[p(y|x, z; θxy)]

[p(y|z; θy)] target LM

[p(x|y, z; θyx)]

[p(x|z; θx)] source LM

Figure 3: Illustration of the architecture of MGNMT.

2 BACKGROUND AND RELATED WORK

Notation Given a pair of sentences from source and target languages, e.g., x, y , we denote x as a sentence of the source language, and y as a sentence of the target language. Additionally, we use the terms source-side and target-side of a translation direction to denote the input and the output sides of it, e.g., the source-side of the tgt2src translation is the target language.

Neural machine translation Conventional neural machine translation (NMT) models often adopt an encoder-decoder framework (Bahdanau et al., 2015) with discriminative learning. Here NMT models aim to approximate the conditional distribution log p(y|x; θxy) over a target sentence y = y1, . . . , y Ly given a source sentence x = x1, . . . , x Lx . Here we refer to such regular NMT models as discriminative NMT models. Training criterion for a discriminative NMT model is to maximize the conditional log-likelihood log p(y|x; θxy) on abundant parallel bilingual data Dxy = {x(n), y(n)|n = 1...N} of i.i.d observations.

As pointed out by Zhang et al. (2016) and Su et al. (2018), the shared semantics z between x and y are learned in an implicit way in discriminative NMT, which is insufﬁcient to model the semantic equivalence in translation. Recently, Shah & Barber (2018) propose a generative NMT (GNMT) by modeling the joint distribution p(x, y) instead of p(y|x) with a latent variable z:

log p(x, y|z; θ = {θx, θxy}) = log p(x|z; θx) + log p(y|x, z; θxy)

where GNMT models log p(x|z; θx) as a source variational language model. Eikema & Aziz (2019) also propose a similar approach. In addition, Chan et al. (2019) propose a generative insertion-based modeling for sequence, which also models the joint distribution.

Exploiting non-parallel data for NMT Both discriminative and generative NMT could not directly learn from non-parallel bilingual data. To remedy this, back-translation and its variants (Sennrich et al., 2016b; Zhang et al., 2018) exploit non-parallel bilingual data by generating synthetic parallel data. Dual learning (He et al., 2016a; Xia et al., 2017) learns from non-parallel data in a round-trip game via reinforcement learning, with the help of pretrained language models. Although these methods have shown their effectiveness, the independence between translation models, and between translation and language models (dual learning) may lead to inefﬁciency to utilize nonparallel data for both training and decoding as MGNMT does. In the meantime, iterative learning schemes like them could also complement MGNMT.

Some other related studies exploit non-parallel bilingual data by sharing all parameters and vocabularies between source and target languages, by which two translation directions can be updated by either monolingual data (Dong et al., 2015; Johnson et al., 2017; Firat et al., 2016; Artetxe et al., 2018; Lample et al., 2018a;b), and GNMT as well in an auto-encoder fashion. However, they may still fail to apply to distant language pairs (Zhang & Komachi, 2019) such as English-to-Chinese due to the potential issues of non-overlapping alphabets, which is also veriﬁed in our experiments.

Additionally, as aforementioned, integrating language model is another direction to exploit monolingual data (Gulcehre et al., 2015; Stahlberg et al., 2018; Chu & Wang, 2018) for NMT. However, this kind of methods often resorts to external trained language models, which is agnostic to translation task. Besides, although GNMT contains a source-side language model, it cannot help decoding. In contrast, MGNMT jointly learns translation and language modeling probabilistically and can naturally rely on both together for a better generation.

Published as a conference paper at ICLR 2020

Algorithm 1 Training MGNMT from Non-Parallel Data

Input: (pretrained) MGNMT M(θ) , source monolingual dataset Dx, target monolingual dataset Dy 1: while not converge do 2: Draw source and target sentences from non-parallel data: x(s) Dx, y(t) Dy 3: Use M to translate x(s) to construct a pseudo-parallel sentence pair x(s), y(s) pseu 4: Compute L(x(s); θx, θyx, φ) with x(s), y(s) pseu by Equation (5) 5: Use M to translate y(t) to construct a pseudo-parallel sentence pair x(t) pseu, y(t) 6: Compute L(y(t); θy, θxy, φ) with x(t) pseu, y(t) by Equation (4) 7: Compute the deviation θ by Equation (6) 8: Update parameters θ θ + η θ 9: end while

3 MIRROR-GENERATIVE NEURAL MACHINE TRANSLATION

We propose the mirror-generative NMT (MGNMT), a novel deep generative model which simultaneously models a pair of src2tgt and tgt2src (variational) translation models, as well as a pair of source and target (variational) language models, in a highly integrated way with the mirror property. As a result, MGNMT can learn from non-parallel bilingual data, and naturally interpolate its learned language model with the translation model in the decoding process.

The overall architecture of MGNMT is illustrated graphically in Figure 3. MGNMT models the joint distribution over the bilingual sentences pair by exploiting the mirror property of the joint probability: log p(x, y|z) = 1

2[log p(y|x, z) + log p(y|z) + log p(x|y, z) + log p(x|z)], where the latent variable z (we use a standard Gaussian prior z N(0, I)) stands for the shared semantics between x and y, serving as a bridge between all the integrated translation and language models.

3.1 TRAINING

3.1.1 LEARNING FROM PARALLEL DATA

We ﬁrst introduce how to train MGNMT on a regular parallel bilingual data. Given a parallel bilingual sentence pair x, y , we use stochastic gradient variational Bayes (SGVB) (Kingma & Welling, 2014) to perform approximate maximum likelihood estimation of log p(x, y). We parameterize the approximate posterior q(z|x, y; φ) = N(µφ(x, y), Σφ(x, y)). Then from Equation (1), we can have the Evidence Lower BOund (ELBO) L(x, y; θ; φ) of the log-likelihood of the joint probability as:

log p(x, y) L(x, y; θ, φ) = Eq(z|x,y;φ)[1

2{log p(y|x, z; θxy) + log p(y|z; θy)

+ log p(x|y, z; θyx) + log p(x|z; θx)}] (2) DKL[q(z|x, y; φ)||p(z)]

where θ = {θx, θyx, θy, θxy} is the set of the parameters of translation and language models. The ﬁrst term is the (expected) log-likelihood of the sentence pair. The expectation is obtained by Monte Carlo sampling. The second term is the KL-divergence between z s approximate posterior and prior distributions. By relying on a reparameterization trick (Kingma & Welling, 2014), we can now jointly train all the components using gradient-based algorithms.

3.1.2 LEARNING FROM NON-PARALLEL DATA

Since MGNMT has intrinsically a pair of mirror translation models, we design an iterative training approach to exploit non-parallel data, in which both directions of MGNMT could beneﬁt from the monolingual data mutually and boost each other. The proposed training process on non-parallel bilingual data is illustrated in Algorithm 1.

Formally, given non-parallel bilingual sentences, i.e., x(s) from source monolingual dataset Dx = {x(s)|s = 1...S} and y(t) from target monolingual dataset Dy = {y(t)|t = 1...T}, we aim to maximize the lower-bounds of the likelihood of their marginal distributions mutually:

log p(x(s)) + log p(y(t)) L(x(s); θx, θyx, φ) + L(y(t); θy, θxy, φ) (3)

Published as a conference paper at ICLR 2020

where L(x(s); θx, θyx, φ) and L(y(t); θy, θxy, φ) are the lower-bounds of the source and target marginal log-likelihoods, respectively.

Let us take L(y(t); θy, θxy, φ) for example. Inspired by Zhang et al. (2018), we sample x with p(x|y(t)) in source language as y(t) s translation (i.e., back-translation) and obtain a pseudo-parallel sentence pair x, y(t) . Accordingly, we give the form of L(y(t); θy, θxy, φ) in Equation (4). Likewise, Equation (5) is for L(y(t); θy, θxy, φ). (See Appendix for the their derivation).

L(y(t); θy, θxy, φ) = Ep(x|y(t)) Eq(z|x,y(t);φ)[1

2{log p(y(t)|z; θy) + log p(y(t)|x, z; θxy)}]

DKL[q(z|x, y(t); φ)||p(z)] (4)

L(x(s); θx, θyx, φ) = Ep(y|x(s)) Eq(z|x(s),y;φ)[1

2{log p(x(s)|z; θx) + log p(x(s)|y, z; θyx)}]

DKL[q(z|x(s), y; φ)||p(z)] (5)

The parameters included in Equation (3) can be updated via gradient-based algorithm, where the deviations are computed as Equation (6) in a mirror and integrated behavior:

θ = {θx,θyx}L(x(s); ) + {θy,θxy}L(y(t); ) + φ[L(x(s); ) + L(y(t); )] (6)

The overall training process of exploiting non-parallel data does to some extent share a similar idea with joint back-translation (Zhang et al., 2018). But they only utilize one side of non-parallel data to update one direction of translation models for each iteration. Thanks to z from the shared approximate posterior q(z|x, y; φ) as a bridge, both directions of MGNMT could beneﬁt from either of the monolingual data. Besides, MGNMT s back-translated pseudo translations have been improved by advanced decoding process (see Equation (7)), which leads to a better learning effect.

3.2 DECODING

Thanks to simultaneously modeling of translation models and language models, MGNMT is now able to generate translation by the collaboration of translation and language models together. This endows MGNMT s translation in target-side language with more domain-related ﬂuency and quality.

Due to the mirror nature of MGNMT, the decoding process is also of symmetry: given a source sentence x (or target sentence y), we want to ﬁnd a translation by y = argmaxy p(y|x) = argmaxy p(x, y) (x = argmaxx p(x|y) = argmaxx p(x, y)), which is approximated by a mirror variant of the idea of EM decoding algorithm in GNMT (Shah & Barber, 2018). Our decoding process is illustrated in Algorithm 2.

Let s take the srg2tgt translation as example. Given a source sentence x, 1) we ﬁrst samples an initial z from the standard Gaussian prior and then obtain an initial draft translation as y = argmaxy p(y|x, z); 2) this translation is iteratively reﬁned by re-sampling z this time from the approximate posterior q(z|x, y; φ), and re-decoding with beam search by maximizing the ELBO:

y argmaxy L(x, y; θ, φ)

= argmaxy Eq(z|x, y;φ)[log p(y|x, z) + log p(y|z) + log p(x|z) + log p(x|y, z)] (7)

= argmaxy Eq(z|x, y;φ) X

i [log p(yi|y<i, x, z) + log p(yi|y<i, z)] | {z } Decoding Score

+ log p(x|z) + log p(x|y, z) | {z } Reconstructive Reranking Score

The decoding scores at each step are now given by TMx y and LMy, which is helpful to ﬁnd a sentence y not only being the translation of x but also being more possible in the target language2. The reconstructive reranking scores are given by LMx and TMy x, which are employed after translation candidates are generated. MGNMT can leverage this kind of scores to sort the translation candidates and determine the most faithful translation to the source sentence. It is to essentially share the

2Empirically, we ﬁnd that using log p(yi|y<i, x, z) + β log p(yi|y<i, z) with a coefﬁcient β 0.3 leads to more robust results, which shares the similar observations with Gulcehre et al. (2015).

Published as a conference paper at ICLR 2020

Algorithm 2 MGNMT Decoding with EM Algorithm

Input: MGNMT M(θ), input sentence x, input language l Output: x s translation y procedure: DECODING(x, l) 1: if l is the target language then 2: Swap the parameters of M(θ) regarding language: {θx, θyx} {θy, θxy} 3: end if 4: y = RUN(x) 5: return translation y procedure: RUN(x) 1: Sample z from standard Gaussian: z N(0, I) 2: Generate initial draft translation: y = argmaxy log p(y|x, z) 3: while not converage do 4: Sample z = {z(k)}K s=1 from variational distribution: z(k) q(z|x, y) 5: Generate translation candidates {ˆy} via beam search by maximizing 1 K P

i log p(yi|y<i, x, z(k)) + log p(yi|y<i, z(k))] decoding scores in Equation (7) 6: Determine the best intermediate translation y via ranking {ˆy} by maximizing 1

z(k)[log p(x|z(k))+ log p(x|y, z(k))] reconstructive reranking scores in Equation (7) 7: end while 8: return translation y = y

Table 1: Statistics of datasets for each translation tasks.

Dataset WMT14 EN DE NIST EN ZH WMT16 EN RO IWSLT16 EN DE Parallel 4.50m 1.34m 0.62m 0.20m (TED) Non-parallel 5.00m 1.00m 1.00m 0.20m (NEWS) Dev/Test newstest2013/14 MT06/MT03 newstest2015/16 tst13/14&newstest2014

same idea as Ng et al. (2019) and Chen et al. (2019), which propose a neural noisy channel reranking to incorporate reconstructive score to rerank the translation candidates. Some studies like Tu et al. (2017), Cheng et al. (2016) also exploit this bilingual semantic equivalence as reconstruction regularization for training.

4 EXPERIMENT

Dataset To evaluate our model in resource-poor scenarios, we conducted experiments on WMT16 English-to/from-Romanian (WMT16 EN RO) translation task as low-resource translation and IWSLT16 English-to/from-German (IWSLT16 EN DE) parallel data of TED talk as cross-domain translation. As for resource-rich scenarios, we conducted experiments on WMT14 English-to/from German (WMT14 EN DE), NIST English-to/from-Chinese (NIST EN ZH) translation tasks. For all the languages, we use the non-parallel data from News Crawl, except for NIST EN ZH, where the Chinese monolingual data were extracted from LDC corpus. Table 1 lists the statistics. In particular, for cross-domain translation, all models are trained using parallel data from TED domain, and non-parallel data from NEWS domain if applicable. We then evaluate these models on both TED and NEWS testsets, respectively.

Experimental settings We implemented our models on the top of Transformer (Vaswani et al., 2017) and RNMT (Bahdanau et al., 2015) and GNMT (Shah & Barber, 2018) as well on Pytorch3. In this section, we only compare experimental results on Transformer implementation.4

For all languages pairs, sentence were encoded using byte pair encoding (Sennrich et al., 2016a, BPE) with 32k merge operations, jointly learned from the concatenation of the parallel training dataset only (except for NIST ZH-EN whose BPEs were learned separately). We used the Adam optimizer (Kingma & Ba, 2014) with the same learning rate schedule strategy as Vaswani et al. (2017) with 4k warmup steps. Each mini-batch consists of about 4,096 source and target tokens respec-

3The original GNMT is based on RNN, and we adapted GNMT to Transformer. 4See Appendix for results on RNMT, which is consistent to Transformer.

Published as a conference paper at ICLR 2020

Table 2: Statistics of the training datasets for each translation tasks. These values of DKL[q(z)||p(z)] are to some extent large, which means that MGNMT does rely on the latent variable.

Dataset WMT14 EN DE NIST EN ZH WMT16 EN RO IWSLT16 EN DE KL-annealing steps 35k 13.5k 8k 4k DKL[q(z)||p(z)] 6.78 8.26 6.36 7.81

Table 3: BLEU scores on low-resource translation (WMT16 EN RO), and cross-domain translation (IWSLT EN DE). Note that for cross-domain translation, all models are trained with TED domain as parallel data, and NEWS domain as monolingual data if applicable, whereas these models are evaluated on the testsets of the both domains, respectively.

LOW-RESOURCE CROSS-DOMAIN (para. TED & mono. NEWS) WMT16 EN RO TED NEWS EN-RO RO-EN EN-DE DE-EN EN-DE DE-EN Transformer (Vaswani et al., 2017) 32.1 33.2 27.5 32.8 17.1 19.9 GNMT (Shah & Barber, 2018) 32.4 33.6 28.0 33.2 17.4 20.1 GNMT-M-SSL + non-parallel (Shah & Barber, 2018) 34.1 35.3 28.4 33.7 22.0 24.9 Transformer+BT + non-parallel (Sennrich et al., 2016b) 33.9 35.0 27.8 33.3 20.9 24.3 Transformer+JBT + non-parallel (Zhang et al., 2018) 34.5 35.7 28.4 33.8 21.9 25.1 Transformer+Dual + non-parallel (He et al., 2016a) 34.6 35.7 28.5 34.0 21.8 25.3 MGNMT 32.7 33.9 28.2 33.6 17.6 20.2 MGNMT + non-parallel 34.9 36.1 28.5 34.2 22.8 26.1

tively. We trained our models on a single GTX 1080ti GPU. To avoid that the approximate posterior collapses to the prior that learns to ignore the latent representation while DKL(q(z)||p(z)) trends closely to zero (Bowman et al., 2016; Shah & Barber, 2018), we applied KL-annealing and word dropout (Bowman et al., 2016) to counter this effect. For all experiments, word dropout rates were set to a constant of 0.3. Honestly, annealing KL weight is somewhat tricky. Table 2 lists our best setting of KL-annealing for each task on the development sets. The translation evaluation metric is BLEU (Papineni et al., 2002). More details are included in Appendix.

4.1 RESULTS AND DISCUSSION

As shown in Table 3 and Table 4, MGNMT outperforms our competitive Transformer baseline (Vaswani et al., 2017), Transformer-based GNMT (Shah & Barber, 2018) and related work in both resource-poor scenarios and resource-rich scenarios.

MGNMT makes better use of non-parallel data. As shown in Table 3, MGNMT outperforms our competitive Transformer baseline (Vaswani et al., 2017), Transformer-based GNMT (Shah & Barber, 2018) and related work in both resource-poor scenarios.

1. On low-resource language pairs. The proposed MGNMT obtains a bit improvement over Transformer and GNMT on the scarce bilingual data. Large margins of improvement are obtained by exploiting non-parallel data. 2. On cross-domain translation. To evaluate the capability of our model in the cross-domain setting, we ﬁrst trained our model on TED data from IWSLT benchmark to simulate general-domain training, and then exposed the model to NEWS non-parallel bilingual data from News Crawl to accessing target domain knowledge. As shown in Table 3, being invisible to target domain training data leads to poor performance in target domain testset (NEWS) of both Transformer and MGNMT. In this case, non-parallel data of NEWS domain contributes signiﬁcantly, leading to 5.7 6.4 BLEU gains. We also conduct a case study on the cross-domain translation in Appendix. 3. On Resource-rich scenarios. We also conduct regular translation experiments on two resourcerich language pairs, i.e., EN DE and NIST EN ZH. As shown in Table 4, MGNMT can obtain comparable results compared to discriminative baseline RNMT and generative baseline GNMT on pure parallel setting. Our model can also achieve better performance by the aid of non-parallel bilingual data than the compared previous approaches, consistent with the experimental results in resource-poor scenarios. 4. Comparison to other semi-supervised work. We compare our approach with well-established approaches which are also designed for leveraging non-parallel data, including back-translation (Sennrich et al., 2016b, Transformer+BT), joint back-translation training (Zhang et al., 2018, Trans-

Published as a conference paper at ICLR 2020

Table 4: BLEU scores on resource-rich language pairs.

Model WMT14 NIST EN-DE DE-EN EN-ZH ZH-EN Transformer (Vaswani et al., 2017) 27.2 30.8 39.02 45.72 GNMT (Shah & Barber, 2018) 27.5 31.1 40.10 46.69 GNMT-M-SSL + non-parallel (Shah & Barber, 2018) 29.7 33.5 41.73 47.70 Transformer+BT + non-parallel (Sennrich et al., 2016b) 29.6 33.2 41.98 48.35 Transformer+JBT + non-parallel (Zhang et al., 2018) 30.0 33.6 42.43 48.75 Transformer+Dual + non-parallel (He et al., 2016b) 29.6 33.2 42.13 48.60 MGNMT 27.7 31.4 40.42 46.98 MGNMT + non-parallel 30.3 33.8 42.56 49.05

Figure 4: BLEU vs. scales of nonparallel data on IWSLT EN DE tasks.

Figure 5: BLEU increments vs. adding one side monolingual (w/o interactive training) or non-parallel bilingual data for MGNMT on IWSLT EN DE tasks.

former+JBT), multi-lingual and semi-supervised variant of GNMT (Shah & Barber, 2018, GNMTM-SSL), and dual learning (He et al., 2016a, Transformer+Dual). As shown in Table 3, while introducing non-parallel data to either low-resource language or cross-domain translation, all listed semi-supervised approaches gain substantial improvements. Among them, our MGNMT achieves the best BLEU score. Meanwhile, in resource-rich language pairs, the results are consistent. We suggest that because the jointly trained language model and translation model could work coordinately for decoding, MGNMT surpasses joint back-translation and dual learning. Interestingly, we can see that the GNMT-M-SLL performs poorly on NIST EN ZH, which means parameters-sharing is not quite suitable for distant language pair. These results indicate its promising strength of boosting low-resource translation and exploiting domain-related knowledge from non-parallel data for cross-domain scenarios.

Table 5: Incorporating LM for decoding (IWSLT task).

Model EN-DE DE-EN MGNMT: dec. w/o LM 21.2 24.6 MGNMT: dec. w/ LM 22.8 26.1 Transformer 17.1 19.9 Transformer+LM-FUSION 18.4 21.1

MGNMT is better at incorporating language model in decoding In addition, we ﬁnd from Table 5 that simple interpolation of NMT and external LM (separately trained on target-side monolingual data) (Gulcehre et al., 2015, Transformer LM-FUSION) only produces mild effects. This can be attributed to the unrelated probabilistic modeling, which means that a more naturally integrated solution like MGNMT is necessary.

Table 6: Comparison with NCMR (IWSLT task).

Model EN-DE DE-EN MGNMT + non-parallel 22.8 26.1 Transformer+BT w/ NCMR (w/o) 21.8 (20.9) 25.1 (24.3) GNMT-M-SSL w/ NCMR (w/o) 22.4 (22.0) 25.6 (24.9)

Comparison with noisy channel model reranking (Ng et al., 2019) We compare MGNMT with the noisy channel model reranking (Ng et al., 2019, NCMR). NCMR uses log p(y|x) + λ1 log p(x|y) + λ2 log p(y) to rerank the translation candidates obtained from beam search, where λ1 = 1 and λ2 = 0.3, which are similar to our decoding setting. As shown in Table 6, NCMR is indeed effective and easy-to-use. But MGNMT still works better. Speciﬁcally, the advantage of the uniﬁed probabilistic modeling in MGNMT not only improves the effectiveness and efﬁciency of exploiting non-parallel data for training, but also enables the use of the highly-coupled language models and bidirectional translation models at testing time.

Published as a conference paper at ICLR 2020

Effects of non-parallel data. We conduct experiments regarding the scales of non-parallel data on IWSLT EN DE to investigate the relationship between beneﬁts and data scales. As shown in Figure 4, as the amount of non-parallel data increases, all models become strong gradually. MGNMT outperforms Transformer+JBT consistently in all data scales. Nevertheless, the growth rate decreases probably due to noise of the non-parallel data. We also investigate if one side of non-parallel data could beneﬁt both translation directions of MGNMT. As shown in Figure 5, we surprisingly ﬁnd that only using one side monolingual data, for example, English, could also improve Englishto-German translation a little bit, which meets our expectation.

Figure 6: BLEU wrt DKL.

Effects of latent variable z. Empirically, Figure 6 shows gains become little when KL term gets close to 0 (z becomes uninformative), while too large KL affects negatively; meanwhile, Table 2 shows that the values of DKL[q(z)||p(z)] are relatively reasonable; besides, decoding from a zero z leads to large drops. These suggest that MGNMT learns a meaningful bilingual latent variable, and heavily relies on it to model the translation task. Moreover, MGNMT adds further improvements to decoding by involving language models that condition on the meaningful semantic z. (Table 5). These pieces of evidence show the necessity of z.

Table 7: Training (hours until early stop) and decoding cost comparison on IWSLT task. All the experiments are conducted on a single 1080ti GPU.

Model Training (hrs) Decoding Transformer 17 1.0 Transformer+BT 25 1.0 GNMT-M-SSL 30 2.1 Transformer+JBT 34 1.0 Transformer+Dual 52 1.0 MGNMT 22 2.7 MGNMT + non-parallel 45 2.7

Speed comparison MGNMT introduces extra costs for training and decoding compared to Transformer baseline. When being trained on parallel data, MGNMT only slightly increases the training cost. However, the training cost regarding non-parallel training is larger than vanilla Transformer because of the on-ﬂy sampling of pseudo-translation pairs, which is also the cost of joint backtranslation and dual learning. As shown in Table 7, we can see that on-ﬂy sampling implies time-consumption, MGNMT takes more training time than joint back-translation but less than dual learning. One possible way to improve the efﬁciency may be to sample and save these pseudo-translation pairs in advance to the next epoch of training.

As for inference time, Transformer+{BT/JBT/Dual} are roughly the same as vanilla Transformer because essentially they do not modify decoding phase. Apart from this, we ﬁnd that the decoding converges at 2 3 iterations for MGNMT, which leads to 2.7 time cost as the Transformer baseline. To alleviate the sacriﬁce of speed will be one of our future directions.

Table 8: Comparison on robustness of noisy source sentence.

Model GNMT MGNMT En-De 27.5 27.7 De-En 31.1 31.4 En-De (noisy) 19.4 20.3 De-En (noisy) 23.0 24.1

Robustness of noisy source sentence We conduct experiments on noisy source sentence to investigate the robustness of our models compared with GNMT. The experimental setting is similar to Shah & Barber (2018), i.e., each word of the source sentence has a 30% chance of being missing. We conduct experiments on WMT En De. As shown in Table 8, MGNMT is more robust than GNMT with noisy source input. This may be attributed to the uniﬁed probabilistic modeling of TMs and LMs in MGNMT, where the backward translation and language models are naturally and directly leveraged to better denoise the noisy source input. Nevertheless, the missing content in the noisy source input is still very hard to recover, leading to a large drop to all methods. Dealing with noisy input is interesting and we will leave it for future study.

5 CONCLUSION

In this paper, we propose the mirror-generative NMT model (MGNMT) to make better use of nonparallel data. MGNMT jointly learns bidirectional translation models as well as source and target

Published as a conference paper at ICLR 2020

language models in a latent space of the shared bilingual semantics. In such a case, both translation directions of MGNMT could simultaneously beneﬁt from non-parallel data. Besides, MGNMT can naturally take advantage of its learned target-side language model for decoding, which leads to better generation quality. Experiments show that the proposed MGNMT consistently outperforms other approaches in all investigated scenarios, and verify its advantages in both training and decoding. We will investigate whether MGNMT can be used in completely unsupervised setting in future work.

6 ACKNOWLEDGEMENTS

We would like to thank the anonymous reviewers for their insightful comments. Shujian Huang is the corresponding author. This work is supported by National Science Foundation of China (No. U1836221, 61672277), National Key R&D Program of China (No. 2019QY1806).

Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Unsupervised neural machine translation. In ICLR, 2018.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.

Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal J ozefowicz, and Samy Bengio. Generating sentences from a continuous space. In Co NLL, 2016.

William Chan, Nikita Kitaev, Kelvin Guu, Mitchell Stern, and Jakob Uszkoreit. Kermit: Generative insertion-based modeling for sequences. ar Xiv preprint ar Xiv:1906.01604, 2019.

Peng-Jen Chen, Jiajun Shen, Matt Le, Vishrav Chaudhary, Ahmed El-Kishky, Guillaume Wenzek, Myle Ott, and Marc Aurelio Ranzato. Facebook ais wat19 myanmar-english translation task submission. WAT 2019, 2019.

Yong Cheng, Wei Xu, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. Semisupervised learning for neural machine translation. In ACL, 2016.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder decoder for statistical machine translation. In EMNLP, 2014.

Chenhui Chu and Ying Wang. A survey of domain adaptation for neural machine translation. In COLING, 2018.

Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. Multi-task learning for multiple language translation. In ACL, 2015.

Bryan Eikema and Wilker Aziz. Auto-encoding variational neural machine translation. In Rep L4NLP, 2019.

Orhan Firat, Baskaran Sankaran, Yaser Al-Onaizan, Fatos T. Yarman-Vural, and Kyunghyun Cho. Zero-resource translation with multi-lingual neural machine translation. In EMNLP, 2016.

Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann Dauphin. Convolutional sequence to sequence learning. In ICML, 2017.

Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loic Barrault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. On using monolingual corpora in neural machine translation. ar Xiv preprint ar Xiv:1503.03535, 2015.

Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, and Wei-Ying Ma. Dual learning for machine translation. In Advances in Neural Information Processing Systems, pp. 820 828, 2016a.

Published as a conference paper at ICLR 2020

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pp. 770 778, 2016b.

Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Vi egas, Martin Wattenberg, Greg Corrado, et al. Googles multilingual neural machine translation system: Enabling zero-shot translation. Transactions of ACL, 5, 2017.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2014.

Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.

Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc Aurelio Ranzato. Unsupervised machine translation using monolingual corpora only. In ICLR, 2018a.

Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc Aurelio Ranzato. Phrase-based & neural unsupervised machine translation. In EMNLP, 2018b.

Tomas Mikolov, Martin Karaﬁ at, Luk as Burget, Jan Honza Cernock y, and Sanjeev Khudanpur. Recurrent neural network based language model. In INTERSPEECH, 2010.

Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, and Sergey Edunov. Facebook fairs wmt19 news translation task submission. In WMT, 2019.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.

Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673 2681, 1997.

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In ACL, 2016a.

Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving Neural Machine Translation Models with Monolingual Data. In ACL, 2016b.

Harshil Shah and David Barber. Generative neural machine translation. Neurips, 2018.

Felix Stahlberg, James Cross, and Veselin Stoyanov. Simple fusion: Return of the language model. In WMT, 2018.

Jinsong Su, Shan Wu, Deyi Xiong, Yaojie Lu, Xianpei Han, and Biao Zhang. Variational recurrent neural machine translation. In AAAI, 2018.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In NIPS, 2014.

Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu, and Hang Li. Neural machine translation with reconstruction. In AAAI, 2017.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is All you Need. In NIPS, 2017.

Yingce Xia, Tao Qin, Wei Chen, Jiang Bian, Nenghai Yu, and Tie-Yan Liu. Dual supervised learning. In ICML, 2017.

Biao Zhang, Deyi Xiong, Jinsong Su, Hong Duan, and Min Zhang. Variational neural machine translation. In EMNLP, 2016.

Longtu Zhang and Mamoru Komachi. Chinese-japanese unsupervised neural machine translation using sub-character level information. Co RR, abs/1903.00149, 2019.

Zhirui Zhang, Shujie Liu, Mu Li, Ming Zhou, and Enhong Chen. Joint training for neural machine translation models with monolingual data. In AAAI, 2018.

Published as a conference paper at ICLR 2020

A LEARNING FROM NON-PARALLEL DATA: DERIVATION

We ﬁrst take the target marginal probability log p(y(t)) for example to show its deviation. Inspired by Zhang et al. (2018), we introduce y(t) s translation x in source language as intermediate hidden variable, and decompose log p(y(t)) as:

log p(y(t)) = log X

x p(x, y(t)) = log X

x Q(x)p(x, y(t))

x Q(x) log p(x, y(t))

Q(x) (Jenson inequality) (8)

x Q(x) log p(x, y(t)) Q(x) log Q(x)

In order to make the equal sign to be valid in Equation (8), Q(x) must be the true tgt2src translation probability p (x|y(t)) , which can be approximated by MGNMT through p (x|y(t)) = p(x|y(t)) = p(x,y(t))

p (y(t)) = 1 T Ez[p(x, y(t)|z)] via Monte Carlo sampling5. Analogously, the intermediate hidden

variable x is the translation of y(t) given by MGNMT itself (described in Section 3.2), which produces a pair of pseudo parallel sentences x, y(t) . This is similar to the back-translation (Sennrich et al., 2016b), which requires an externally separate tgt2src NMT model to provide the synthetic data other than the uniﬁed model itself as in MGNMT.

Remember that we have derived the low-bound of log p(x, y) in Equation (2). As a result, we now get the lower bound of log p(y(t)) as L(y(t); θy, θxy, φ) by

log p(y(t)) L(y(t); θy, θxy, φ) = Ep(x|y(t)) Eq(z|x,y(t);φ)[1

2{log p(x|z) + p(x|y(t), z)

+ log p(y(t)|z) + log p(y(t)|x, z)}] (9)

DKL[q(z|x, y(t); φ)||p(z)] log p (x|y(t))

Since p(x|y(t), z), p(x|z) and p (x|y(t)) are irrelevant to parameters {θy, θxy}, L(y(t); θy, θxy, φ) could be simpliﬁed on a optimization purpose, namely:

L(y(t); θy, θxy, φ) = Ep(x|y(t)) Eq(z|x,y(t);φ)[1

2{log p(y(t)|z; θy) + log p(y(t)|x, z; θxy)}]

DKL[q(z|x, y(t); φ)||p(z)] (10)

The lower-bound L(y(t); θy, θxy, φ) of log p(y(t)) serves as a training objective to optimize {θy, θxy, φ}. Likewise, the lower bound of the likelihood on the target marginal probability log p(x(s)) could be derived as:

L(x(s); θx, θyx, φ) = Ep(y|x(s)) Eq(z|x(s),y;φ)[1

2{log p(x(s)|z; θx) + log p(x(s)|y, z; θyx)}]

DKL[q(z|x(s), y; φ)||p(z)] (11)

B IMPLEMENTATION DETAILS

We follow the GNMT (Shah & Barber, 2018) to implement our MGNMT. For machine translation, we have a source sentence x = x1, . . . , x Lx and a target sentence y = y1, . . . , y Ly . As aforementioned, MGNMT consists of a variational src2tgt and a tgt2src translation models (TMx y(θxy),

5In order to make the equal sign to be valid in Equation 8, Q(x) must satisfy the following condition

Q(x) = c, where c is a constant and does not depend on x. Given P

x Q(x) = 1, Q(x) can be calculated as:

Q(x) = p(x, y(t))

c = p(x, y(t)) P

x p(x, y(t)) = p (x|y(t))

where p (x|y(t)) denotes the true tgt2src translation probability, while the target marginal probability p (y) = c = 1

T due to the assumption that the target sentences in Dy are i.i.d.

Published as a conference paper at ICLR 2020

TMx y(θyx)), as well as a source and a target variational language model(LMx(θx),LMy(θy)). These four components are conditioned on a shared inference model q(z|x, y; φ) as approximate posterior. The overall architecture is shown in Figure 3.

Now, we ﬁrst introduce the implementations based on RNMT, which is similar to (Shah & Barber, 2018). Then we introduce the Transformer-based variant.

B.1 RNMT-BASED MGNMT

Language Model Let s take the target language model LMy(θy) as an example. LMy(θy) models the computation of p(y|z; θy), which is implemented by a GRU-based (Cho et al., 2014) RNNLM (Mikolov et al., 2010) with the latent variable z as additional input. The probabilities p(y|z), for i = 1, ..., Lx are factorized by:

j p(yj|y<t, z) = softmax(E(yj) Wyhy j) (12)

where Wy is a learnable linear transformation matrix, E(yj) is the embedding of yj, and the hidden state hy j is computed as:

hy j = GRU(hy t 1, [z; E(yt 1)]) (13)

where [ ; ] is a concatenation operation.

Likewise, the source language model LMx(θx) models p(x|z; θx) in a mirror way.

Translation Model Let s take the src2tgt translation model TMx y(θxy) as an example. TMx y(θxy) models the computation of p(y|x, z; θxy), which is implemented by the variational variant of the widely-used RNMT (Bahdanau et al., 2015). RNMT uses an encoder-decoder framework. The conditional probabilities p(y|x, z), for i = 1, ..., Lx are factorized by:

p(y|x, z) =

j p(yj|y<t, x, z) = softmax(E(yj) Uysy j) (14)

where Uy is a learnable linear transformation matrix, and the decoder hidden state sy j is computed as:

sy j = GRU(sy t 1, [z; E(yt 1)]) (15)

i αjivx i , αji = softmax(a( sy j, vx i )) (16)

sy j = GRU( sy j, cy j) (17)

where vx i is the i-th encoder hidden state, cy j is the attentive context vector, which is a weighted average of the source hidden states by attentive weight αji given by the attention model a. The encoder hidden state vx i is modeled by a bidirectional GRU (Schuster & Paliwal, 1997; Cho et al., 2014):

vx i = Bi GRU(vx i 1, E(xi 1)) (18)

Likewise, the tgt2src translation model TMx y(θyx)) models p(x|z; θx) in a mirror way.

Inference Model The inference model q(z|x, y; φ) serves as an approximate posterior, which is a diagonal Gaussian:

q(z|x, y; φ) = N(µφ(x, y), Σφ(x, y)) (19)

Published as a conference paper at ICLR 2020

We ﬁrst map the sentences x, and y to a sentence representation vector using a bidirectional GRU, followed by an average pooling, respectively:

Bi GRU(rx i 1, E(xi 1)) (20)

Bi GRU(ry t 1, E(yt 1)) (21)

where rx and ry is the ﬁxed-length sentence vector which is the average of the hidden states of the bidirectional GRU of x and y, respectively. We then parameterize the inference model by:

q(z|x, y; φ) = N(Wµ[ rx; ry], diag(exp(WΣ[ rx; ry]))) (22)

B.2 TRANSFORMER-BASED MGNMT

Theoretically, MGNMT is independent from neural architectures we choose. As for Transformerbased MGNMT, we substitute the translation models from RNMT to Transformer (Vaswani et al., 2017), which is also extended to condition on latent semantic. The language models and inference model remain the same.

B.3 HYPERPARAMETERS

RNMT-based MGNMT adopts 512-dimensional GRUs, 512-dimensional word embeddings, and a 100-dimensional latent variable z. As for Transformer-based MGNMT, we use the same conﬁgurations as transformer-base in Vaswani et al. (2017). The embeddings of the same language are shared in the MGNMT in our implementations. For KL-annealing (Bowman et al., 2016), we multiply the KL divergence term by a constant weight, which we linearly anneal from 0 to 1 over the initial steps of training. The KL-annealing steps are sensitive to languages and the amount of dataset. We include the KL-annealing steps of best results for each language in the paper.

B.4 IMPLEMENTATION OF OTHER BASELINES

Back-translation (Sennrich et al., 2016b, BT), joint back-translation (Zhang et al., 2018, JBT), and dual learning (He et al., 2016a, Dual) are effective training strategies which do not depend on speciﬁc architecture. Suppose that we have monolingual data Dx and Dy, and bilingual parallel Dxy. Note that the forward and backward TMs here are all Transformer or RNMT.

BT: To train TMx y, we ﬁrst petrain a backward translation model TMy x. And then we use TMy x to translate Dx into a pseudo source corpus Dx by beam search (b = 2), and Dx and Dy form the pseudo parallel corpus, namely Dx y. We ﬁnally use the collection of Dx y and Dxy to train TMx y. The BT training for TMy x is similar by alternating the language.

JBT: JBT is an extension of BT in an alternative and iterative manner. 1) We ﬁrst pretrain TMx y and TMy x on Dxy, respectively. 2) We use TMy x/TMx y to generate pseudo parallel corpora Dx y/Dxy , respectively. 3) We then re-train TMx y/TMy x on the collection of Dxy and Dx y/Dxy for 1 epoch, respectively. So now we have a pair of better TMx y and TMy x. 4) We ﬁnally repeat 2) and 3) with the better TMs until training converges.

Dual: 1) We ﬁrst pretrain TMx y and TMy x on Dxy, respectively, and LMx and LMy on Dx and Dy respectively. Note that in the following training process, the LMs are ﬁxed. 2) To train TMs from monolingual corpora, the rest of the training process follows He et al. (2016a) to iteratively and alternatively optimize the language model reward and reconstruction reward. Our implementation is heavily inspired by https: //github.com/yist Lin/pytorch-dual-learning.

Published as a conference paper at ICLR 2020

C EXPERIMENTS ON RNMT

C.1 IDENTICAL SET OF EXPERIMENTS AS TRANSFORMER

We show experiments on RNMT in Table 9 and 10, which shows the consistent trending as Transformer-based experiments. These results suggest that MGNMT is architecture-free, which can theoretically and practically be adapted to arbitrary sequence-to-sequence architecture.

C.2 COMPARISON WITH GNMT IN ITS ORIGINAL SETTING.

The lack of ofﬁcial GNMT codes and their manually created datasets makes it impossible for us to directly compare MGNMT with GNMT in their original setting. This is why we initially resorted to standard benchmark datasets. Nevertheless, we try to conduct such comparisons (Table 11). We followed Shah & Barber (2018) to conduct English-French experiments. The parallel data are provided by Multi UN corpus. Similar to Shah & Barber (2018), we created a small, medium and large amount of parallel data, corresponding to 40K, 400K and 4M sentence pairs, respectively. We created validation set of 5K and test set of 10K sentence pairs. For non-parallel data, we used the News Crawl articles from 2009 to 2012. Note that in Shah & Barber (2018), there is a monolingual corpora consisting 20.9M monolingual sentences used for English, which is too large and timeconsuming. Here we used 4.5M monolingual sentences for English and French, respectively. As shown in Table 11, MGNMT still outperforms GNMT.

Table 9: BLEU scores on low-resource translation (WMT16 EN RO), and cross-domain translation (IWSLT EN DE).

LOW-RESOURCE CROSS-DOMAIN (para. TED & mono. NEWS) WMT16 EN RO TED NEWS EN-RO RO-EN EN-DE DE-EN EN-DE DE-EN RNMT 29.3 29.9 23.1 28.8 13.7 16.6 GNMT 30.0 30.7 23.4 29.4 13.8 16.9 GNMT-M-SSL + non-parallel 31.6 32.5 23.6 29.6 17.5 22.0 RNMT-BT + non-parallel 31.0 31.7 23.7 29.9 16.9 21.5 RNMT-JBT + non-parallel 31.7 32.3 24.0 30.1 17.6 22.1 RNMT-DUAL + non-parallel 31.9 32.5 23.4 29.6 17.3 21.9 RNMT-LM-FUSION + non-parallel 29.5 30.3 - - 14.1 17.0 MGNMT 30.4 31.2 23.7 29.8 13.8 17.0 MGNMT + non-parallel 32.5 32.9 24.2 30.4 18.7 23.3

D CASE STUDY

As shown in Table 12, we can see that without being trained on in-domain (NEWS) non-parallel bilingual data, the baseline RNMT shows obvious style mismatches phenomenon. Although all the enhanced methods alleviate this domain inconsistency problem to some extent, MGNMT produces the best in-domain-related translation.

Table 10: BLEU scores on resource-rich language pairs. We report results of Newstest2014 testset for WMT14, and MT03 testset for NIST.

Model WMT14 NIST EN-DE DE-EN EN-ZH ZH-EN RNMT (Bahdanau et al., 2015) 21.9 26.0 31.77 38.10 GNMT (Shah & Barber, 2018) 22.3 26.5 32.19 38.45 GNMT-M-SSL + non-parallel (Shah & Barber, 2018) 24.8 28.4 32.06 38.56 RNMT-BT + non-parallel (Sennrich et al., 2016b) 23.6 27.9 32.98 39.21 RNMT-JBT + non-parallel (Zhang et al., 2018) 25.2 28.8 33.60 39.72 MGNMT 22.7 27.9 32.61 38.88 MGNMT + non-parallel 25.7 29.4 34.07 40.25

Published as a conference paper at ICLR 2020

Table 11: BLEU scores of RNMT-based experiments on English-French using similar settings as Shah & Barber (2018). Numbers in parentheses are quoted from GNMT paper. Note that because we used 4.5M English monolingual sentences instead of the original 20.9M (too time-consuming), the reproduced results of GNMT-M-SSL are a bit lower.

Model 40K 400K 4M avg. EN-FR FR-EN EN-FR FR-EN EN-FR FR-EN RNMT 11.86 12.30 27.81 28.13 37.20 38.00 25.88 0 GNMT 12.32(12.47) 13.65(13.84) 28.69(28.98) 29.51(29.41) 37.64(37.97) 38.49(38.44) 26.72 0.84 GNMT-M-SSL + non-parallel 18.60 (20.88) 18.92 (20.99) 35.90(37.37) 36.72(39.66) 38.75(39.41) 39.30(40.69) 31.37 5.49 MGNMT 12.52 14.02 29.38 30.10 38.21 38.89 27.19 1.31 MGNMT + non-parallel 19.25 19.33 36.73 37.57 39.45 39.98 32.05 6.17

Table 12: An example from IWSLT DE-EN cross-domain translation. In this case, all the models were ﬁrst trained on parallel bilingual data from TED talks (IWSLT2016), and exposed to nonparallel bilingual data of NEWS domain (News Crawl).

Source Die Benzinsteuer ist einfach nicht zukunftsf ahig , so Lee Munnich, ein Experte f ur Verkehrsgesetzgebung an der Universit at von Minnesota. Reference The gas tax is just not sustainable , said Lee Munnich, a transportation policy expert at the University of Minnesota. RNMT The gasoline tax is simply not sustainable, so Lee Munnich, an expert on the University of Minnesota. RNMT-BT The gas tax is simply not sustainable, so Lee Munnich, an expert on trafﬁc legislation at the University of Minnesota . RNMT-JBT The gas tax is just not sustainable, say Lee Munnich, an expert on trafﬁc legislation at the University of Minnesota . MGNMT The gas tax is just not sustainable, said Lee Munnich, an trafﬁc legislation expert at the University of Minnesota.