# dual_supervised_learning__da6d8a75.pdf Dual Supervised Learning Yingce Xia 1 Tao Qin 2 Wei Chen 2 Jiang Bian 2 Nenghai Yu 1 Tie-Yan Liu 2 Many supervised learning tasks are emerged in dual forms, e.g., English-to-French translation vs. French-to-English translation, speech recognition vs. text to speech, and image classification vs. image generation. Two dual tasks have intrinsic connections with each other due to the probabilistic correlation between their models. This connection is, however, not effectively utilized today, since people usually train the models of two dual tasks separately and independently. In this work, we propose training the models of two dual tasks simultaneously, and explicitly exploiting the probabilistic correlation between them to regularize the training process. For ease of reference, we call the proposed approach dual supervised learning. We demonstrate that dual supervised learning can improve the practical performances of both tasks, for various applications including machine translation, image processing, and sentiment analysis. 1. Introduction Deep learning brings state-of-the-art results to many artificial intelligence tasks, such as neural machine translation (Wu et al., 2016), image classification (He et al., 2016b;c), image generation (van den Oord et al., 2016b;a), speech recognition (Graves et al., 2013; Amodei et al., 2016), and speech generation/synthesis (Oord et al., 2016). Interestingly, we find that many of the aforementioned AI tasks are emerged in dual forms, i.e., the input and output of one task are exactly the output and input of the other task respectively. Examples include translation from language A to language B vs. translation from language B to A, image classification vs. image generation, and speech 1School of Information Science and Technology, University of Science and Technology of China, Hefei, Anhui, China 2Microsoft Research, Beijing, China. Correspondence to: Tao Qin . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). recognition vs. speech synthesis. Even more interestingly (and somehow surprisingly), this natural duality is largely ignored in the current practice of machine learning. That is, despite the fact that two tasks are dual to each other, people usually train them independently and separately. Then a question arises: Can we exploit the duality between two tasks, so as to achieve better performance for both of them? In this work, we give a positive answer to the question. To exploit the duality, we formulate a new learning scheme, which involves two tasks: a primal task and its dual task. The primal task takes a sample from space X as input and maps to space Y, and the dual task takes a sample from space Y as input and maps to space X. Using the language of probability, the primal task learns a conditional distribution P(y|x; θxy) parameterized by θxy, and the dual task learns a conditional distribution P(x|y; θyx) parameterized by θyx, where x X and y Y. In the new scheme, the two dual tasks are jointly learned and their structural relationship is exploited to improve the learning effectiveness. We name this new scheme as dual supervised learning (briefly, DSL). There could be many different ways of exploiting the duality in DSL. In this paper, we use it as a regularization term to govern the training process. Since the joint probability P(x, y) can be computed in two equivalent ways: P(x, y) = P(x)P(y|x) = P(y)P(x|y), for any x X, y Y, ideally the conditional distributions of the primal and dual tasks should satisfy the following equality: P(x)P(y|x; θxy) = P(y)P(x|y; θyx). (1) However, if the two models (conditional distributions) are learned separately by minimizing their own loss functions (as in the current practice of machine learning), there is no guarantee that the above equation will hold. The basic idea of DSL is to jointly learn the two models θxy and θyx by minimizing their loss functions subject to the constraint of Eqn.(1). By doing so, the intrinsic probabilistic connection between θyx and θxy are explicitly strengthened, which is supposed to push the learning process towards the right direction. To solve the constrained optimization problem of DSL, we convert the constraint Eqn.(1) to a penalty term by using the method of Lagrange multipliers (Boyd & Vandenberghe, 2004). Note that the penalty term could also be seen as a data-dependent regularization term. Dual Supervised Learning To demonstrate the effectiveness of DSL, we apply it to three artificial intelligence applications 1: (1) Neural Machine Translation (NMT) We first apply DSL to NMT, which formulates machine translation as a sequence-to-sequence learning problem, with the sentences in the source language as inputs and those in the target language as outputs. The input space and output space of NMT are symmetric, and there is almost no information loss while mapping from x to y or from y to x. Thus, symmetric tasks in NMT fits well into the scope of DSL. Experimental studies illustrate significant accuracy improvements by applying DSL to NMT: +2.07/0.86 points measured by BLEU scores for English French translation, +1.37/0.12 points for English Germen translation and +0.74/1.69 points on English Chinese. (2) Image Processing We then apply DSL to image processing, in which the primal task is image classification and the dual task is image generation conditioned on category labels. Both tasks are hot research topics in the deep learning community. We choose Res Net (He et al., 2016b) as our baseline for image classification, and Pixel CNN++(Salimans et al., 2017) as our baseline for image generation. Experimental results show that on CIFAR-10, DSL could reduce the error rate of Res Net-110 from 6.43 to 5.40 and obtain a better image generation model with both clearer images and smaller bits per dimension. Note that these primal and dual tasks do not yield a pair of completely symmetric input and output spaces since there is information loss while mapping from an image to its class label. Therefore, our experimental studies reveal that DSL can also work well for dual tasks with information loss. (3) Sentiment Analysis Finally, we apply DSL to sentiment analysis, in which the primal task is sentiment classification (i.e., to predict the sentiment of a given sentence) and the dual one is sentence generation with given sentiment polarity. Experiments on the IMDB dataset show that DSL can improve the error rate of a widely-used sentiment classification model by 0.9 point, and can generate sentences with clearer/richer styles of sentiment expression. All of above experiments on real artificial intelligence applications have demonstrated that DSL can improve practical performance of both tasks, simultaneously. 2. Framework In this section, we formulate the problem of dual supervised learning (DSL), describe an algorithm for DSL, and discuss its connections with existing learning schemes and 1In our experiments, we chose the most cited models with either open-source codes or enough implementation details, to ensure that we can reproduce the results reported in previous papers. All of our experiments are done on a single Telsa K40m GPU. its application scope. 2.1. Problem Formulation To exploit the duality, we formulate a new learning scheme, which involves two tasks: a primal task that takes a sample from space X as input and maps to space Y, and a dual task takes a sample from space Y as input and maps to space X. Assume we have n training pairs {(xi, yi)}n i=1 i.i.d. sampled from the space X Y according to some unknown distribution P. Our goal is to reveal the bi-directional relationship between the two inputs x and y. To be specific, we perform the following two tasks: (1) the primal learning task aims at finding a function f : X 7 Y such that the prediction of f for x is similar to its real counterpart y; (2) the dual learning task aims at finding a function g : Y 7 X such that the prediction of g for y is similar to its real counterpart x. The dissimilarity is penalized by a loss function. Given any (x, y), let ℓ1(f(x), y) and ℓ2(g(y), x) denote the loss functions for f and g respectively, both of which are mappings from X Y to R. A common practice to design (f, g) is the maximum likelihood estimation based on the parameterized conditional distributions P( |x; θxy) and P( |y; θyx): f(x; θxy) arg max y Y P(y |x; θxy), g(y; θyx) arg max x X P(x |y; θyx), where θxy and θyx are the parameters to be learned. By standard supervised learning, the primal model f is learned by minimizing the empirical risk in space Y: minθxy(1/n)Pn i=1ℓ1(f(xi; θxy), yi); and dual model g is learned by minimizing the empirical risk in space X: minθyx(1/n)Pn i=1ℓ2(g(yi; θyx), xi). Given the duality of the primal and dual tasks, if the learned primal and dual models are perfect, we should have P(x)P(y|x; θxy) = P(y)P(x|y; θyx) = P(x, y), x, y. We call this property probabilistic duality, which serves as a necessary condition for the optimality of the learned two dual models. By the standard supervised learning scheme, probabilistic duality is not considered during the training, and the primal and the dual models are trained independently and separately. Thus, there is no guarantee that the learned dual models can satisfy probabilistic duality. To tackle this problem, we propose explicitly reinforcing the empirical probabilistic duality of the dual modes by solving the fol- Dual Supervised Learning lowing multi-objective optimization problem instead: objective 1: min θxy (1/n)Pn i=1ℓ1(f(xi; θxy), yi), objective 2: min θyx (1/n)Pn i=1ℓ2(g(yi; θyx), xi), s.t. P(x)P(y|x; θxy) = P(y)P(x|y; θyx), x, y, where P(x) and P(y) are the marginal distributions. We call this new learning scheme dual supervised learning (abbreviated as DSL). We provide a simple theoretical analysis which shows that DSL has theoretical guarantees in terms of generalization bound. Since the analysis is straightforward, we put it in Appendix A due to space limitations. 2 2.2. Algorithm Description In practical artificial intelligence applications, the groundtruth marginal distributions are usually not available. As an alternative, we use the empirical marginal distributions ˆP(x) and ˆP(y) to fulfill the constraint in Eqn.(2). To solve the DSL problem, following the common practice in constraint optimization, we introduce Lagrange multipliers and add the equality constraint of probabilistic duality into the objective functions. First, we convert the probabilistic duality constraint into the following regularization term (with the empirical marginal distributions included): ℓduality =(log ˆP(x) + log P(y|x; θxy) log ˆP(y) log P(x|y; θyx))2. (3) Then, we learn the models of the two tasks by minimizing the weighted combination between the original loss functions and the above regularization term. The algorithm is shown in Algorithm 1. In the algorithm, the choice of optimizers Opt1 and Opt2 is quite flexible. One can choose different optimizers such as Adadelta (Zeiler, 2012), Adam (Kingma & Ba, 2014), or SGD for different tasks, depending on common practice in the specific task and personal preferences. 2.3. Discussions The duality between tasks has been used to enable learning from unlabeled data in (He et al., 2016a). As an early attempt to exploit the duality, this work actually uses the exterior connection between dual tasks, which helps to form a closed feedback loop and enables unsupervised learning. For example, in the application of machine translation, the primal task/model first translates an unlabeled English sentence x to a French sentence y ; then, the dual task/model translates y back to an English sentence x ; finally, both 2All the appendices are left in the supplementary document. Algorithm 1 Dual Supervise Learning Algorithm Input: Marginal distributions ˆP(xi) and ˆP(yi) for any i [n]; Lagrange parameters λxy and λyx; optimizers Opt1 and Opt2; repeat Get a minibatch of m pairs {(xj, yj)}m j=1; Calculate the gradients as follows: Gf = θxy(1/m)Pm j=1 ℓ1(f(xj; θxy), yj) + λxyℓduality(xj, yj; θxy, θyx) ; Gg = θyx(1/m)Pm j=1 ℓ2(g(yj; θyx), xj) + λyxℓduality(xj, yj; θxy, θyx) ; Update the parameters of f and g: θxy Opt1(θxy, Gf), θyx Opt2(θyx, Gg). until models converged the primal and the dual models get optimized by minimizing the difference between x with x. In contrast, by making use of the intrinsic probabilistic connection between the primal and dual models, DSL takes an innovative attempt to extend the benefit of duality to supervised learning. While ℓduality can be regarded as a regularization term, it is data dependent, which makes DSL different from Lasso (Tibshirani, 1996) or SVM (Hearst et al., 1998), where the regularization term is data-independent. More accurately speaking, in DSL, every training sample contributes to the regularization term, and each model contributes to the regularization of the other model. DSL is different from the following three learning schemes: (1) Co-training focuses on single-task learning and assumes that different subsets of features can provide enough and complementary information about data, while DSL targets at learning two tasks with structural duality simultaneously and does not yield any prerequisite or assumptions on features. (2) Multi-task learning requires that different tasks share the same input space and coherent feature representation while DSL does not. (3) Transfer Learning uses auxiliary tasks to boost the main task, while there is no difference between the roles of two tasks in DSL, and DSL enables them to boost the performance of each other simultaneously. We would like to point that there are several requirements to apply DSL to a certain scenario: (1) Duality should exist for the two tasks. (2) Both the primal and dual models should be trainable. (3) ˆP(X) and ˆP(Y ) in Eqn. (3) should be available. If these conditions are not satisfied, DSL might not work very well. Fortunately, as we have discussed in the paper, many machine learning tasks related to image, speech, and text satisfy these conditions. Dual Supervised Learning 3. Application to Machine Translation We first apply our dual supervised learning algorithm to machine translation and study whether it can improve the translation qualities by utilizing the probabilistic duality of dual translation tasks. In the following of the section, we perform experiments on three pairs of dual tasks 3: English French (En Fr), English Germany (En De), and English Chinese (En Zh). 3.1. Settings Datasets We employ the same datasets as used in (Jean et al., 2015) to conduct experiments on En Fr and En De. As a part of WMT 14, the training data consists of 12M sentences pairs for En Fr and 4.5M for En De, respectively (WMT, 2014). We combine newstest2012 and newstest2013 together as the validation sets and use newstest2014 as the test sets. For the dual tasks of En Zh, we use 10M sentence pairs obtained from a commercial company as training data. We leverage NIST2006 as the validation set and NIST2008 as well as NIST2012 as the test sets4. Note that, during the training of all three pairs of dual tasks, we drop all sentences with more than 50 words. Marginal Distributions ˆP(x) and ˆP(y) We use the LSTMbased language modeling approach (Sundermeyer et al., 2012; Mikolov et al., 2010) to characterize the marginal distribution of a sentence x, defined as QTx i=1 P(xi|x