# dual_learning_for_machine_translation__ba06cb49.pdf Dual Learning for Machine Translation Di He1, , Yingce Xia2, , Tao Qin3, Liwei Wang1, Nenghai Yu2, Tie-Yan Liu3, Wei-Ying Ma3 1Key Laboratory of Machine Perception (MOE), School of EECS, Peking University 2University of Science and Technology of China 3Microsoft Research 1{dih,wanglw}@cis.pku.edu.cn; 2xiayingc@mail.ustc.edu.cn; 2ynh@ustc.edu.cn 3{taoqin,tie-yan.liu,wyma}@microsoft.com While neural machine translation (NMT) is making good progress in the past two years, tens of millions of bilingual sentence pairs are needed for its training. However, human labeling is very costly. To tackle this training data bottleneck, we develop a dual-learning mechanism, which can enable an NMT system to automatically learn from unlabeled data through a dual-learning game. This mechanism is inspired by the following observation: any machine translation task has a dual task, e.g., English-to-French translation (primal) versus French-to-English translation (dual); the primal and dual tasks can form a closed loop, and generate informative feedback signals to train the translation models, even if without the involvement of a human labeler. In the dual-learning mechanism, we use one agent to represent the model for the primal task and the other agent to represent the model for the dual task, then ask them to teach each other through a reinforcement learning process. Based on the feedback signals generated during this process (e.g., the languagemodel likelihood of the output of a model, and the reconstruction error of the original sentence after the primal and dual translations), we can iteratively update the two models until convergence (e.g., using the policy gradient methods). We call the corresponding approach to neural machine translation dual-NMT. Experiments show that dual-NMT works very well on English French translation; especially, by learning from monolingual data (with 10% bilingual data for warm start), it achieves a comparable accuracy to NMT trained from the full bilingual data for the French-to-English translation task. 1 Introduction State-of-the-art machine translation (MT) systems, including both the phrase-based statistical translation approaches [6, 3, 12] and the recently emerged neural networks based translation approaches [1, 5], heavily rely on aligned parallel training corpora. However, such parallel data are costly to collect in practice and thus are usually limited in scale, which may constrain the related research and applications. Given that there exist almost unlimited monolingual data in the Web, it is very natural to leverage them to boost the performance of MT systems. Actually different methods have been proposed for this purpose, which can be roughly classified into two categories. In the first category [2, 4], monolingual corpora in the target language are used to train a language model, which is then integrated with the MT models trained from parallel bilingual corpora to improve the translation quality. In the second category [14, 11], pseudo bilingual sentence pairs are generated from monolingual data by using the The first two authors contributed equally to this work. This work was conducted when the second author was visiting Microsoft Research Asia. 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. translation model trained from aligned parallel corpora, and then these pseudo bilingual sentence pairs are used to enlarge the training data for subsequent learning. While the above methods could improve the MT performance to some extent, they still suffer from certain limitations. The methods in the first category only use the monolingual data to train language models, but do not fundamentally address the shortage of parallel training data. Although the methods in the second category can enlarge the parallel training data, there is no guarantee/control on the quality of the pseudo bilingual sentence pairs. In this paper, we propose a dual-learning mechanism that can leverage monolingual data (in both the source and target languages) in a more effective way. By using our proposed mechanism, these monolingual data can play a similar role to the parallel bilingual data, and significantly reduce the requirement on parallel bilingual data during the training process. Specifically, the dual-learning mechanism for MT can be described as the following two-agent communication game. 1. The first agent, who only understands language A, sends a message in language A to the second agent through a noisy channel, which converts the message from language A to language B using a translation model. 2. The second agent, who only understands language B, receives the translated message in language B. She checks the message and notifies the first agent whether it is a natural sentence in language B (note that the second agent may not be able to verify the correctness of the translation since the original message is invisible to her). Then she sends the received message back to the first agent through another noisy channel, which converts the received message from language B back to language A using another translation model. 3. After receiving the message from the second agent, the first agent checks it and notifies the second agent whether the message she receives is consistent with her original message. Through the feedback, both agents will know whether the two communication channels (and thus the two translation models) perform well and can improve them accordingly. 4. The game can also be started from the second agent with an original message in language B, and then the two agents will go through a symmetric process and improve the two channels (translation models) according to the feedback. It is easy to see from the above descriptions, although the two agents may not have aligned bilingual corpora, they can still get feedback about the quality of the two translation models and collectively improve the models based on the feedback. This game can be played for an arbitrary number of rounds, and the two translation models will get improved through this reinforcement procedure (e.g., by means of the policy gradient methods). In this way, we develop a general learning framework for training machine translation models through a dual-learning game. The dual learning mechanism has several distinguishing features. First, we train translation models from unlabeled data through reinforcement learning. Our work significantly reduces the requirement on the aligned bilingual data, and it opens a new window to learn to translate from scratch (i.e., even without using any parallel data). Experimental results show that our method is very promising. Second, we demonstrate the power of deep reinforcement learning (DRL) for complex real-world applications, rather than just games. Deep reinforcement learning has drawn great attention in recent years. However, most of them today focus on video or board games, and it remains a challenge to enable DRL for more complicated applications whose rules are not pre-defined and where there is no explicit reward signals. Dual learning provides a promising way to extract reward signals for reinforcement learning in real-world applications like machine translation. The remaining parts of the paper are organized as follows. In Section 2, we briefly review the literature of neural machine translation. After that, we introduce our dual-learning algorithm for neural machine translation. The experimental results are provided and discussed in Section 4. We extend the breadth and depth of dual learning in Section 5 and discuss future work in the last section. 2 Background: Neural Machine Translation In principle, our dual-learning framework can be applied to both phrase-based statistical machine translation and neural machine translation. In this paper, we focus on the latter one, i.e., neural machine translation (NMT), due to its simplicity as an end-to-end system, without suffering from human crafted engineering [5]. Neural machine translation systems are typically implemented with a Recurrent Neural Network (RNN) based encoder-decoder framework. Such a framework learns a probabilistic mapping P(y|x) from a source language sentence x = {x1, x2, ..., x Tx} to a target language sentence y = {y1, y2, ..., y Ty} , in which xi and yt are the i-th and t-th words for sentences x and y respectively. To be more concrete, the encoder of NMT reads the source sentence x and generates Tx hidden states by an RNN: hi = f(hi 1, xi) (1) in which hi is the hidden state at time i, and function f is the recurrent unit such as Long Short-Term Memory (LSTM) unit [12] or Gated Recurrent Unit (GRU) [3]. Afterwards, the decoder of NMT computes the conditional probability of each target word yt given its proceeding words y