# modellevel_dual_learning__81af16cf.pdf Model-Level Dual Learning Yingce Xia 1 2 Xu Tan 2 Fei Tian 2 Tao Qin 2 Nenghai Yu 1 Tie-Yan Liu 2 Many artificial intelligence tasks appear in dual forms like English French translation and speech text transformation. Existing dual learning schemes, which are proposed to solve a pair of such dual tasks, explore how to leverage such dualities from data level. In this work, we propose a new learning framework, model-level dual learning, which takes duality of tasks into consideration while designing the architectures for the primal/dual models, and ties the model parameters that playing similar roles in the two tasks. We study both symmetric and asymmetric modellevel dual learning. Our algorithms achieve significant improvements on neural machine translation and sentiment analysis. 1. Introduction Joint learning of multiple tasks has attracted much attention in machine learning community, and several learning paradigms have been studied to explore the task correlations from different perspective. Multi-task learning (Luong et al., 2016; Zhang & Yang, 2017) is a paradigm that learns a problem together with other related problems at the same time, using a shared representation. This often leads to a better model for the main task, because it allows the learner to use the commonality among the tasks. Transfer learning (Pan & Yang, 2010) focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. Recently, a new paradigm, dual learning (He et al., 2016a; Xia et al., 2017b; Yi et al., 2017; Lin et al., 2018), is proposed to leverage the symmetric structure of some learning problems, such as English French translation, image text transformation, and speech text transformation, and achieves promising results in various AI tasks (He et al., 2016a; Yi et al., 2017; Tang et al., 2017). 1School of Information Science and Technology, University of Science and Technology of China, Hefei, China 2Microsoft Research, Beijing, China. Correspondence to: Tao Qin . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). Dual learning algorithms have been proposed for different learning settings. (He et al., 2016a; Yi et al., 2017) focus on the unsupervised setting and learn from unlabeled data: given an unlabeled sample x X, the primal model f : X 7 Y first maps it to a sample y Y, the dual model g : Y 7 X maps y to a new sample ˆx X, and then the distortion between x and ˆx is used as the feedback signal to optimize f and g. (Xia et al., 2017b) focuses on the supervised setting and conducts joint learning from labeled data by adding an additional probabilistic constraint P(x)P(y|x; f) = P(y)P(x|y; g) on any (x, y) data pair implied by structure duality. This probabilistic constraint is also utilized in inference process (Xia et al., 2017a). Inspired by the law of total probability, (Wang et al., 2018) studies dual transfer learning, in which one model in a pair of dual tasks is used to enhance the training of the other model. Although those algorithms consider different settings, they all consider duality at data level, characterized by either the reconstruction error of unlabeled samples (He et al., 2016a; Yi et al., 2017), the joint probability of data pairs (Xia et al., 2017b; Wang et al., 2018), or the marginal probability of data samples (Wang et al., 2018). We find that many tasks are of structural duality/symmetry not only in data level, but also in model level. Take neural machine translation (briefly, NMT) (Xia et al., 2017c;d), which attacks the problem of translating a source-language sentence x X to a target-language sentence y Y, as an example here. Such kinds of tasks are usually handled by an encoder-decoder framework. We summarize an one-layer LSTM (Hochreiter & Schmidhuber, 1997) based model in Figure 1 and the other model structures like CNN (Gehring et al., 2017), self-attention (Vaswani et al., 2017) can be similarly formulated. The encoder ϕc p,X takes a source sentence x as an input and outputs a set of hidden representations h X i H i [Tx], where H is the space of hidden representations, and Tx is the length of sentence x. Next, the decoder ϕp,Y computes the hidden state h Y j by taking the previous h Y j 1, yj 1 and the h X i i [Tx] from the encoder as inputs. ϕp,Y consists of three parts: ϕz p,Y is the attention model used to generate ZX j (e.g., the contextual information); ϕc p,Y is used to compute h Y j ; and ϕh p,Y is used to map h Y j to space Y, which is usually a softmax operator. The aforementioned processes can be mathematically for- Model-Level Dual Learning Figure 1. An architecture of existing encoder-decoder models. The black square indicates an optional delay of a single time step. Subscripts p and d indicate primal and dual models respectively. Subscripts X and Y denote the two languages. The unfold version can be found at Appendix A. All appendices are left in the supplementary document due to space limitation. mulated as follows: i [Tx] and j [Ty] (Ty is the length of y), h X i = ϕc p,X(xi, h X i 1); ZX j = ϕz p,Y (h Y j 1, {h X i }Tx i=1); h Y j = ϕc p,Y (yj 1, h Y j 1, ZX j ); yj = ϕh p,Y (h Y j ). (1) The dual task can be similarly formulated. In the primal task, ϕc p,X serves to encode a source-language sentence x, without any condition. In the dual task, ϕc d,X is used to decode a target-language sentence x, conditioned on ZY . That is, given two dual tasks, the encoder of the primal task and the decoder of the dual task are highly correlated. They only differ in the conditions. Inspired by representation sharing (e.g., sharing the bottom layers of the neural network models for related tasks (Dong et al., 2015)), in this work, we propose to share components of the models of two tasks in dual form, e.g., forcing ϕc p,X = ϕc d,X and ϕc p,Y = ϕc d,Y for neural machine translation, and call such an approach model-level dual learning. The strictly symmetric model architectures of a pair of dual tasks is good to have for model-level dual learning, but it is not a must-to-have. Take sentiment analysis as an example. The primal task is to classify whether a sentence is of positive sentiment or negative. Usually, the input sequence is first encoded to several hidden states by an LSTM and then fed into a few fully-connected layers to get the final classification decision. For the dual task, sentence generation with a given sentiment label, the sentiment label is first encoded into a hidden representation and then decoded into a sequence by an LSTM. In such asymmetric cases, the encoders of the primal task and the decoders of the dual task can be shared, which can be seen as a degenerated version of our proposed method by setting ϕc p,X = ϕc d,X only. Our main contributions can be summarized as follows: (1) Model Architecture We re-formulate the current encoderdecoder framework as a combination of two conditional encoder and propose a unified architecture, model-level dual learning, that can handle two dual tasks simultaneously with the same set of parameters. We consider both the symmetric setting and the asymmetric setting, in which the primal and dual models are of the same/different architectures respectively. (2) Experimental Results Our model is verified on two different tasks, neural machine translation and sentiment analysis. We achieve promising results: (1) On IWSLT14 Germanto-English translation, we improve the BLEU score from 32.85 to 35.19, obtaining a new record (see Table 5); (2) On WMT14 English German translation, we improve the BLEU score from 28.4 to 28.9 (see Table 3); (3) A series of state-of-the-art results on NIST Chinese-to-English are obtained (see Table 2). (4) With supervised data only, on IMDB sentiment classification dataset, we lower the error rate from 9.20% to 6.96% with our proposed framework (see Table 6). The remaining part of the paper is organized as follows: The framework of model-level dual learning is introduced in Section 2. Section 3 and Section 4 show how to apply model level dual learning to neural machine translation and sentiment analysis. Section 5 discusses the combination of our method with dual inference (Xia et al., 2017a). Section 6 concludes this paper. 2. The Framework We introduce the general framework of model-level dual learning in this section. Similar to the previous work on dual learning (He et al., 2016a; Xia et al., 2017b), we consider two spaces X and Y and two tasks in dual form: the primal task aims to learn a mapping f : X 7 Y, and the dual task learns a reverse mapping g : Y 7 X. We consider two scenarios: the symmetric setting and the asymmetric setting. For the symmetric setting, the elements in X and Y are of the same format so that it is possible to use the same model architecture for the two mappings. For example, in NMT and Q&A, both X and Y are composed of natural language sentences and we can use LSTM to model both f and g. In the asymmetric setting, the objects in X and Y are of different formats and semantics, and thus the two mappings have different model architectures. For example, in sentiment analysis, X is the set of natural language sentences while Y = {0, 1} is the set of sentiment labels. The heterogeneity of X and Y forces one to use different model structures for the primal and dual tasks. 2.1. Symmetric Model-Level Dual Learning In the symmetric setting, the models f and g are made up of two parts: the X component ϕX and the Y component ϕY . ϕX acts as both the encoder and decoder for space X: in Model-Level Dual Learning the primal model f, it encodes a sample in X to get hidden representations; in the dual model g, it decodes some hidden representations and generates a sample in X. Similarly, ϕY acts as both the encoder and decoder for space Y. One may wonder why one component can act as both the encoder and the decoder, since compared with the encoder, the decoder of either X or Y typically takes additional information from the other space. For example, in neural machine translation, the encoder is a simple LSTM, but the decoder is an LSTM plus an attention module generating context from the source sentence (Bahdanau et al., 2015). This is easily solved via introducing the encoder a zero vector as such additional contexts (we name it as the null context ). The detailed architecture is shown in Figure 2. (a) X compotent and Y component (b) Model for primal task, CY = h X (c) Model for dual task, CX = h Y Figure 2. Architecture of symmetric model-level dual learning. The black square indicates an optional delay of a single time step. The unfold version can be found in Appendix B. In Figure 2(a), the parameter ϕX of component X consists of three modules: (1) ϕc X, used to combine x and ZX;1 (2) 1ϕc X takes both ZX and x as inputs. We plot this operator on ϕz X, used to combine h X and CX; (3) ϕh X, used to map the hidden states h X to x. Similarly, ϕY of component Y contains ϕc Y , ϕz Y and ϕh Y . CX and CY are the context information used for decoding X and Y, and they are both zero vectors when their corresponding components are used for encoding. Now we show how the models f and g can be composed by ϕX and ϕY . We take f as an example. It takes an x X as input and outputs a y Y. According to Figure 2(b), the encoder and decoder of f are specified as follows. (1) The Encoder. Set CX to the null context, i.e., CX = {0}.2 At step i [Tx] where Tx is the length of x, preprocess CX and obtain ZX i : ZX i = ϕz X(h X i 1, CX). ϕz X is a function that sums up the elements in CX with adaptive weights.3 Then, calculate the hidden representation h X i = ϕc X(x, h X i 1, ZX i ).4 Eventually, we obtain a set of hidden representations h X = {h X i }Tx i=1. The module ϕh X in component X is not used while encoding x X. (2) The Decoder. Set CY to the hidden representations h X obtained in the encoding phase. At step j [Ty], where Ty is the length of y, preprocess CY with the information available at step j and obtain ZY j : ZY j = ϕz Y (h Y j 1, CY ). Calculate the hidden representation h Y j = ϕc Y (y