# mirrorgenerative_neural_machine_translation__39d5301b.pdf Published as a conference paper at ICLR 2020 MIRROR-GENERATIVE NEURAL MACHINE TRANSLATION Zaixiang Zheng1, Hao Zhou2, Shujian Huang1, Lei Li2, Xin-Yu Dai1, Jiajun Chen1 1National Key Laboratory for Novel Software Technology, Nanjing University zhengzx@smail.nju.edu.cn,{huangsj,daixinyu,chenjj}@nju.edu.cn 2Byte Dance AI Lab {zhouhao.nlp,lileilab}@bytedance.com Training neural machine translation models (NMT) requires a large amount of parallel corpus, which is scarce for many language pairs. However, raw non-parallel corpora are often easy to obtain. Existing approaches have not exploited the full potential of non-parallel bilingual data either in training or decoding. In this paper, we propose the mirror-generative NMT (MGNMT), a single unified architecture that simultaneously integrates the source to target translation model, the target to source translation model, and two language models. Both translation models and language models share the same latent semantic space, therefore both translation directions can learn from non-parallel data more effectively. Besides, the translation models and language models can collaborate together during decoding. Our experiments show that the proposed MGNMT consistently outperforms existing approaches in a variety of language pairs and scenarios, including resource-rich and low-resource situations. 1 INTRODUCTION Neural machine translation (NMT) systems (Sutskever et al., 2014; Bahdanau et al., 2015; Gehring et al., 2017; Vaswani et al., 2017) have given quite promising translation results when abundant parallel bilingual data are available for training. But obtaining such large amounts of parallel data is non-trivial in most machine translation scenarios. For example, there are many low-resource language pairs (e.g., English-to-Tamil), which lack adequate parallel data for training. Moreover, it is often difficult to adapt NMT models to other domains if there is only limited test domain parallel data (e.g., medical domain), due to the large domain discrepancy between the test domain and the parallel data for training (usually news-wires). For these cases where the parallel bilingual data are not adequate, making the most use of non-parallel bilingual data (always quite cheap to get) is crucial to achieving satisfactory translation performance. We argue that current NMT approaches of exploiting non-parallel data are not necessarily the best, in both training and decoding phases. For training, back-translation (Sennrich et al., 2016b) is the most widely used approach for exploiting monolingual data. However, back-translation individually updates the two directions of machine translation models, which is not the most effective. Specifically, given monolingual data x (of source language) and y (of target language)1, back-translation utilizes y by applying tgt2src translation model (TMy x) to get predicted translations ˆx. Then the pseudo translation pairs ˆx, y are used to update the src2tgt translation model (TMx y). x can be used in the same way to update TMy x. Note that here TMy x and TMx y are independent and updated individually. Namely, each updating of TMy x will not directly benefit TMx y. Some related work like joint back-translation Zhang et al. (2018) and dual learning He et al. (2016a) introduce iterative training to make TMy x and TMx y benefit from each other implicitly and iteratively. But translation models in these approaches are still independent. Ideally, gains from non-parallel data can be enlarged if we have relevant TMy x and TMx y. In that case, after every updating of TMy x, we may directly get better TMx y and vice versa, which exploits non-parallel data more effectively. 1Please refer to Section 2 for the notation in details. Published as a conference paper at ICLR 2020 Figure 1: The graphical model of MGNMT. z(x, y) y x y x [p(y|x, z; θxy)] src2tgt TM [p(y|z; θy)] [p(x|y, z; θyx)] tgt2src TM [p(x|z; θx)] source LM Figure 2: Illustration of the mirror property of MGNMT. For decoding, some related works (Gulcehre et al., 2015) propose to interpolate external language models LMy (trained separately on target monolingual data) to translation model TMx y, which includes knowledge from target monolingual data for better translation. This is particularly useful for domain adaptation because we may obtain better translation output quite fitting test domain (e.g., social networks), through a better LMy. However, directly interpolating an independent language model in decoding maybe not the best. First, the language model used here is external, still independently learned to the translation model, thus the two models may not cooperate well by a simple interpolation mechanism (even conflict). Additionally, the language model is only included in decoding, which is not considered in training. This leads to the inconsistency of training and decoding, which may harm the performance. In this paper, we propose the mirror-generative NMT (MGNMT) to address the aforementioned problems for effectively exploiting non-parallel data in NMT. MGNMT is proposed to jointly train translation models (i.e., TMx y and TMy x) and language models (i.e., LMx and LMy) in a unified framework, which is non-trivial. Inspired by generative NMT (Shah & Barber, 2018), we propose to introduce a latent semantic variable z shared between x and y. Our method exploits the symmetry, or mirror property, in decomposing the conditional joint probability p(x, y|z), namely: log p(x, y|z) = log p(x|z) + log p(y|x, z) = log p(y|z) + log p(x|y, z) 2[log p(y|x, z) | {z } src2tgt TMx y + log p(y|z) | {z } target LMy + log p(x|y, z) | {z } tgt2src TMy x + log p(x|z) | {z } source LMx The graphical model of MGNMT is illustrated in Figure 1. MGNMT aligns the bidirectional translation models as well as language models in two languages through a shared latent semantic space (Figure 2), so that all of them are relevant and become conditional independent given z. In such case, MGNMT enables following advantages: (i) For training, thanks to z as a bridge, TMy x and TMx y are not independent, thus every updating of one direction will directly benefit the other direction. This improves the efficiency of using non-parallel data. (Section 3.1) (ii) For decoding, MGNMT could naturally take advantages of its internal target-side language model, which is jointly learned with the translation model. Both of them contribute to the better generation process together. (Section 3.2) Note that MGNMT is orthogonal to dual learning (He et al., 2016a) and joint back-translation (Zhang et al., 2018). Translation models in MGNMT are dependent, and the two translation models could directly promote each other. Differently, dual learning and joint back-translation works in an implicit way, and these two approaches can also be used to further improve MGNMT. The language models used in dual learning faces the same problem as Gulcehre et al. (2015). Given GNMT, the proposed MGNMT is also non-trivial. GNMT only has a source-side language model, thus it cannot enhance decoding like MGNMT. Also, in Shah & Barber (2018), they require GNMT to share all the parameters and vocabularies between translation models so as to utilize monolingual data, which is not best suited for distant language pairs. We will give more comparison in the related work. Experiments show that MGNMT achieves competitive performance on parallel bilingual data, while it does advance training on non-parallel data. MGNMT outperforms several strong baselines in different scenarios and language pairs, including resource-rich scenarios, as well as resource-poor circumstances on low-resource language translation and cross-domain translation. Moreover, we show that translation quality indeed becomes better when the jointly learned translation model and language model of MGNMT work together. We also demonstrate that MGNMT is architecture-free which can be applied to any neural sequence model such as Transformer and RNN. These pieces of evidence verify that MGNMT meets our expectation of fully utilizing non-parallel data. Published as a conference paper at ICLR 2020 피z[ log p(x, y|z)] x [q(z|x, y)] Inference Model Sample from ϵ 풩(1, I ) DKL[풩(μ(x, y), Σ(x, y)||풩(0,I)] [p(y|x, z; θxy)] [p(y|z; θy)] target LM [p(x|y, z; θyx)] [p(x|z; θx)] source LM Figure 3: Illustration of the architecture of MGNMT. 2 BACKGROUND AND RELATED WORK Notation Given a pair of sentences from source and target languages, e.g., x, y , we denote x as a sentence of the source language, and y as a sentence of the target language. Additionally, we use the terms source-side and target-side of a translation direction to denote the input and the output sides of it, e.g., the source-side of the tgt2src translation is the target language. Neural machine translation Conventional neural machine translation (NMT) models often adopt an encoder-decoder framework (Bahdanau et al., 2015) with discriminative learning. Here NMT models aim to approximate the conditional distribution log p(y|x; θxy) over a target sentence y = y1, . . . , y Ly given a source sentence x = x1, . . . , x Lx . Here we refer to such regular NMT models as discriminative NMT models. Training criterion for a discriminative NMT model is to maximize the conditional log-likelihood log p(y|x; θxy) on abundant parallel bilingual data Dxy = {x(n), y(n)|n = 1...N} of i.i.d observations. As pointed out by Zhang et al. (2016) and Su et al. (2018), the shared semantics z between x and y are learned in an implicit way in discriminative NMT, which is insufficient to model the semantic equivalence in translation. Recently, Shah & Barber (2018) propose a generative NMT (GNMT) by modeling the joint distribution p(x, y) instead of p(y|x) with a latent variable z: log p(x, y|z; θ = {θx, θxy}) = log p(x|z; θx) + log p(y|x, z; θxy) where GNMT models log p(x|z; θx) as a source variational language model. Eikema & Aziz (2019) also propose a similar approach. In addition, Chan et al. (2019) propose a generative insertion-based modeling for sequence, which also models the joint distribution. Exploiting non-parallel data for NMT Both discriminative and generative NMT could not directly learn from non-parallel bilingual data. To remedy this, back-translation and its variants (Sennrich et al., 2016b; Zhang et al., 2018) exploit non-parallel bilingual data by generating synthetic parallel data. Dual learning (He et al., 2016a; Xia et al., 2017) learns from non-parallel data in a round-trip game via reinforcement learning, with the help of pretrained language models. Although these methods have shown their effectiveness, the independence between translation models, and between translation and language models (dual learning) may lead to inefficiency to utilize nonparallel data for both training and decoding as MGNMT does. In the meantime, iterative learning schemes like them could also complement MGNMT. Some other related studies exploit non-parallel bilingual data by sharing all parameters and vocabularies between source and target languages, by which two translation directions can be updated by either monolingual data (Dong et al., 2015; Johnson et al., 2017; Firat et al., 2016; Artetxe et al., 2018; Lample et al., 2018a;b), and GNMT as well in an auto-encoder fashion. However, they may still fail to apply to distant language pairs (Zhang & Komachi, 2019) such as English-to-Chinese due to the potential issues of non-overlapping alphabets, which is also verified in our experiments. Additionally, as aforementioned, integrating language model is another direction to exploit monolingual data (Gulcehre et al., 2015; Stahlberg et al., 2018; Chu & Wang, 2018) for NMT. However, this kind of methods often resorts to external trained language models, which is agnostic to translation task. Besides, although GNMT contains a source-side language model, it cannot help decoding. In contrast, MGNMT jointly learns translation and language modeling probabilistically and can naturally rely on both together for a better generation. Published as a conference paper at ICLR 2020 Algorithm 1 Training MGNMT from Non-Parallel Data Input: (pretrained) MGNMT M(θ) , source monolingual dataset Dx, target monolingual dataset Dy 1: while not converge do 2: Draw source and target sentences from non-parallel data: x(s) Dx, y(t) Dy 3: Use M to translate x(s) to construct a pseudo-parallel sentence pair x(s), y(s) pseu 4: Compute L(x(s); θx, θyx, φ) with x(s), y(s) pseu by Equation (5) 5: Use M to translate y(t) to construct a pseudo-parallel sentence pair x(t) pseu, y(t) 6: Compute L(y(t); θy, θxy, φ) with x(t) pseu, y(t) by Equation (4) 7: Compute the deviation θ by Equation (6) 8: Update parameters θ θ + η θ 9: end while 3 MIRROR-GENERATIVE NEURAL MACHINE TRANSLATION We propose the mirror-generative NMT (MGNMT), a novel deep generative model which simultaneously models a pair of src2tgt and tgt2src (variational) translation models, as well as a pair of source and target (variational) language models, in a highly integrated way with the mirror property. As a result, MGNMT can learn from non-parallel bilingual data, and naturally interpolate its learned language model with the translation model in the decoding process. The overall architecture of MGNMT is illustrated graphically in Figure 3. MGNMT models the joint distribution over the bilingual sentences pair by exploiting the mirror property of the joint probability: log p(x, y|z) = 1 2[log p(y|x, z) + log p(y|z) + log p(x|y, z) + log p(x|z)], where the latent variable z (we use a standard Gaussian prior z N(0, I)) stands for the shared semantics between x and y, serving as a bridge between all the integrated translation and language models. 3.1 TRAINING 3.1.1 LEARNING FROM PARALLEL DATA We first introduce how to train MGNMT on a regular parallel bilingual data. Given a parallel bilingual sentence pair x, y , we use stochastic gradient variational Bayes (SGVB) (Kingma & Welling, 2014) to perform approximate maximum likelihood estimation of log p(x, y). We parameterize the approximate posterior q(z|x, y; φ) = N(µφ(x, y), Σφ(x, y)). Then from Equation (1), we can have the Evidence Lower BOund (ELBO) L(x, y; θ; φ) of the log-likelihood of the joint probability as: log p(x, y) L(x, y; θ, φ) = Eq(z|x,y;φ)[1 2{log p(y|x, z; θxy) + log p(y|z; θy) + log p(x|y, z; θyx) + log p(x|z; θx)}] (2) DKL[q(z|x, y; φ)||p(z)] where θ = {θx, θyx, θy, θxy} is the set of the parameters of translation and language models. The first term is the (expected) log-likelihood of the sentence pair. The expectation is obtained by Monte Carlo sampling. The second term is the KL-divergence between z s approximate posterior and prior distributions. By relying on a reparameterization trick (Kingma & Welling, 2014), we can now jointly train all the components using gradient-based algorithms. 3.1.2 LEARNING FROM NON-PARALLEL DATA Since MGNMT has intrinsically a pair of mirror translation models, we design an iterative training approach to exploit non-parallel data, in which both directions of MGNMT could benefit from the monolingual data mutually and boost each other. The proposed training process on non-parallel bilingual data is illustrated in Algorithm 1. Formally, given non-parallel bilingual sentences, i.e., x(s) from source monolingual dataset Dx = {x(s)|s = 1...S} and y(t) from target monolingual dataset Dy = {y(t)|t = 1...T}, we aim to maximize the lower-bounds of the likelihood of their marginal distributions mutually: log p(x(s)) + log p(y(t)) L(x(s); θx, θyx, φ) + L(y(t); θy, θxy, φ) (3) Published as a conference paper at ICLR 2020 where L(x(s); θx, θyx, φ) and L(y(t); θy, θxy, φ) are the lower-bounds of the source and target marginal log-likelihoods, respectively. Let us take L(y(t); θy, θxy, φ) for example. Inspired by Zhang et al. (2018), we sample x with p(x|y(t)) in source language as y(t) s translation (i.e., back-translation) and obtain a pseudo-parallel sentence pair x, y(t) . Accordingly, we give the form of L(y(t); θy, θxy, φ) in Equation (4). Likewise, Equation (5) is for L(y(t); θy, θxy, φ). (See Appendix for the their derivation). L(y(t); θy, θxy, φ) = Ep(x|y(t)) Eq(z|x,y(t);φ)[1 2{log p(y(t)|z; θy) + log p(y(t)|x, z; θxy)}] DKL[q(z|x, y(t); φ)||p(z)] (4) L(x(s); θx, θyx, φ) = Ep(y|x(s)) Eq(z|x(s),y;φ)[1 2{log p(x(s)|z; θx) + log p(x(s)|y, z; θyx)}] DKL[q(z|x(s), y; φ)||p(z)] (5) The parameters included in Equation (3) can be updated via gradient-based algorithm, where the deviations are computed as Equation (6) in a mirror and integrated behavior: θ = {θx,θyx}L(x(s); ) + {θy,θxy}L(y(t); ) + φ[L(x(s); ) + L(y(t); )] (6) The overall training process of exploiting non-parallel data does to some extent share a similar idea with joint back-translation (Zhang et al., 2018). But they only utilize one side of non-parallel data to update one direction of translation models for each iteration. Thanks to z from the shared approximate posterior q(z|x, y; φ) as a bridge, both directions of MGNMT could benefit from either of the monolingual data. Besides, MGNMT s back-translated pseudo translations have been improved by advanced decoding process (see Equation (7)), which leads to a better learning effect. 3.2 DECODING Thanks to simultaneously modeling of translation models and language models, MGNMT is now able to generate translation by the collaboration of translation and language models together. This endows MGNMT s translation in target-side language with more domain-related fluency and quality. Due to the mirror nature of MGNMT, the decoding process is also of symmetry: given a source sentence x (or target sentence y), we want to find a translation by y = argmaxy p(y|x) = argmaxy p(x, y) (x = argmaxx p(x|y) = argmaxx p(x, y)), which is approximated by a mirror variant of the idea of EM decoding algorithm in GNMT (Shah & Barber, 2018). Our decoding process is illustrated in Algorithm 2. Let s take the srg2tgt translation as example. Given a source sentence x, 1) we first samples an initial z from the standard Gaussian prior and then obtain an initial draft translation as y = argmaxy p(y|x, z); 2) this translation is iteratively refined by re-sampling z this time from the approximate posterior q(z|x, y; φ), and re-decoding with beam search by maximizing the ELBO: y argmaxy L(x, y; θ, φ) = argmaxy Eq(z|x, y;φ)[log p(y|x, z) + log p(y|z) + log p(x|z) + log p(x|y, z)] (7) = argmaxy Eq(z|x, y;φ) X i [log p(yi|y