# dual_supervised_learning__da6d8a75.pdf

Dual Supervised Learning

Yingce Xia 1 Tao Qin 2 Wei Chen 2 Jiang Bian 2 Nenghai Yu 1 Tie-Yan Liu 2

Many supervised learning tasks are emerged in dual forms, e.g., English-to-French translation vs. French-to-English translation, speech recognition vs. text to speech, and image classiﬁcation vs. image generation. Two dual tasks have intrinsic connections with each other due to the probabilistic correlation between their models. This connection is, however, not effectively utilized today, since people usually train the models of two dual tasks separately and independently. In this work, we propose training the models of two dual tasks simultaneously, and explicitly exploiting the probabilistic correlation between them to regularize the training process. For ease of reference, we call the proposed approach dual supervised learning. We demonstrate that dual supervised learning can improve the practical performances of both tasks, for various applications including machine translation, image processing, and sentiment analysis.

1. Introduction

Deep learning brings state-of-the-art results to many artiﬁcial intelligence tasks, such as neural machine translation (Wu et al., 2016), image classiﬁcation (He et al., 2016b;c), image generation (van den Oord et al., 2016b;a), speech recognition (Graves et al., 2013; Amodei et al., 2016), and speech generation/synthesis (Oord et al., 2016).

Interestingly, we ﬁnd that many of the aforementioned AI tasks are emerged in dual forms, i.e., the input and output of one task are exactly the output and input of the other task respectively. Examples include translation from language A to language B vs. translation from language B to A, image classiﬁcation vs. image generation, and speech

1School of Information Science and Technology, University of Science and Technology of China, Hefei, Anhui, China 2Microsoft Research, Beijing, China. Correspondence to: Tao Qin <taoqin@microsoft.com>.

Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s).

recognition vs. speech synthesis. Even more interestingly (and somehow surprisingly), this natural duality is largely ignored in the current practice of machine learning. That is, despite the fact that two tasks are dual to each other, people usually train them independently and separately. Then a question arises: Can we exploit the duality between two tasks, so as to achieve better performance for both of them? In this work, we give a positive answer to the question.

To exploit the duality, we formulate a new learning scheme, which involves two tasks: a primal task and its dual task. The primal task takes a sample from space X as input and maps to space Y, and the dual task takes a sample from space Y as input and maps to space X. Using the language of probability, the primal task learns a conditional distribution P(y|x; θxy) parameterized by θxy, and the dual task learns a conditional distribution P(x|y; θyx) parameterized by θyx, where x X and y Y. In the new scheme, the two dual tasks are jointly learned and their structural relationship is exploited to improve the learning effectiveness. We name this new scheme as dual supervised learning (brieﬂy, DSL).

There could be many different ways of exploiting the duality in DSL. In this paper, we use it as a regularization term to govern the training process. Since the joint probability P(x, y) can be computed in two equivalent ways: P(x, y) = P(x)P(y|x) = P(y)P(x|y), for any x X, y Y, ideally the conditional distributions of the primal and dual tasks should satisfy the following equality:

P(x)P(y|x; θxy) = P(y)P(x|y; θyx). (1)

However, if the two models (conditional distributions) are learned separately by minimizing their own loss functions (as in the current practice of machine learning), there is no guarantee that the above equation will hold. The basic idea of DSL is to jointly learn the two models θxy and θyx by minimizing their loss functions subject to the constraint of Eqn.(1). By doing so, the intrinsic probabilistic connection between θyx and θxy are explicitly strengthened, which is supposed to push the learning process towards the right direction. To solve the constrained optimization problem of DSL, we convert the constraint Eqn.(1) to a penalty term by using the method of Lagrange multipliers (Boyd & Vandenberghe, 2004). Note that the penalty term could also be seen as a data-dependent regularization term.

Dual Supervised Learning

To demonstrate the effectiveness of DSL, we apply it to three artiﬁcial intelligence applications 1:

(1) Neural Machine Translation (NMT) We ﬁrst apply DSL to NMT, which formulates machine translation as a sequence-to-sequence learning problem, with the sentences in the source language as inputs and those in the target language as outputs. The input space and output space of NMT are symmetric, and there is almost no information loss while mapping from x to y or from y to x. Thus, symmetric tasks in NMT ﬁts well into the scope of DSL. Experimental studies illustrate signiﬁcant accuracy improvements by applying DSL to NMT: +2.07/0.86 points measured by BLEU scores for English French translation, +1.37/0.12 points for English Germen translation and +0.74/1.69 points on English Chinese.

(2) Image Processing We then apply DSL to image processing, in which the primal task is image classiﬁcation and the dual task is image generation conditioned on category labels. Both tasks are hot research topics in the deep learning community. We choose Res Net (He et al., 2016b) as our baseline for image classiﬁcation, and Pixel CNN++(Salimans et al., 2017) as our baseline for image generation. Experimental results show that on CIFAR-10, DSL could reduce the error rate of Res Net-110 from 6.43 to 5.40 and obtain a better image generation model with both clearer images and smaller bits per dimension. Note that these primal and dual tasks do not yield a pair of completely symmetric input and output spaces since there is information loss while mapping from an image to its class label. Therefore, our experimental studies reveal that DSL can also work well for dual tasks with information loss.

(3) Sentiment Analysis Finally, we apply DSL to sentiment analysis, in which the primal task is sentiment classiﬁcation (i.e., to predict the sentiment of a given sentence) and the dual one is sentence generation with given sentiment polarity. Experiments on the IMDB dataset show that DSL can improve the error rate of a widely-used sentiment classiﬁcation model by 0.9 point, and can generate sentences with clearer/richer styles of sentiment expression.

All of above experiments on real artiﬁcial intelligence applications have demonstrated that DSL can improve practical performance of both tasks, simultaneously.

2. Framework

In this section, we formulate the problem of dual supervised learning (DSL), describe an algorithm for DSL, and discuss its connections with existing learning schemes and

1In our experiments, we chose the most cited models with either open-source codes or enough implementation details, to ensure that we can reproduce the results reported in previous papers. All of our experiments are done on a single Telsa K40m GPU.

its application scope.

2.1. Problem Formulation

To exploit the duality, we formulate a new learning scheme, which involves two tasks: a primal task that takes a sample from space X as input and maps to space Y, and a dual task takes a sample from space Y as input and maps to space X.

Assume we have n training pairs {(xi, yi)}n i=1 i.i.d. sampled from the space X Y according to some unknown distribution P. Our goal is to reveal the bi-directional relationship between the two inputs x and y. To be speciﬁc, we perform the following two tasks: (1) the primal learning task aims at ﬁnding a function f : X 7 Y such that the prediction of f for x is similar to its real counterpart y; (2) the dual learning task aims at ﬁnding a function g : Y 7 X such that the prediction of g for y is similar to its real counterpart x. The dissimilarity is penalized by a loss function. Given any (x, y), let ℓ1(f(x), y) and ℓ2(g(y), x) denote the loss functions for f and g respectively, both of which are mappings from X Y to R.

A common practice to design (f, g) is the maximum likelihood estimation based on the parameterized conditional distributions P( |x; θxy) and P( |y; θyx):

f(x; θxy) arg max y Y P(y |x; θxy),

g(y; θyx) arg max x X P(x |y; θyx),

where θxy and θyx are the parameters to be learned.

By standard supervised learning, the primal model f is learned by minimizing the empirical risk in space Y:

minθxy(1/n)Pn i=1ℓ1(f(xi; θxy), yi);

and dual model g is learned by minimizing the empirical risk in space X:

minθyx(1/n)Pn i=1ℓ2(g(yi; θyx), xi).

Given the duality of the primal and dual tasks, if the learned primal and dual models are perfect, we should have

P(x)P(y|x; θxy) = P(y)P(x|y; θyx) = P(x, y), x, y.

We call this property probabilistic duality, which serves as a necessary condition for the optimality of the learned two dual models.

By the standard supervised learning scheme, probabilistic duality is not considered during the training, and the primal and the dual models are trained independently and separately. Thus, there is no guarantee that the learned dual models can satisfy probabilistic duality. To tackle this problem, we propose explicitly reinforcing the empirical probabilistic duality of the dual modes by solving the fol-

Dual Supervised Learning

lowing multi-objective optimization problem instead:

objective 1: min θxy (1/n)Pn i=1ℓ1(f(xi; θxy), yi),

objective 2: min θyx (1/n)Pn i=1ℓ2(g(yi; θyx), xi),

s.t. P(x)P(y|x; θxy) = P(y)P(x|y; θyx), x, y,

where P(x) and P(y) are the marginal distributions. We call this new learning scheme dual supervised learning (abbreviated as DSL).

We provide a simple theoretical analysis which shows that DSL has theoretical guarantees in terms of generalization bound. Since the analysis is straightforward, we put it in Appendix A due to space limitations. 2

2.2. Algorithm Description

In practical artiﬁcial intelligence applications, the groundtruth marginal distributions are usually not available. As an alternative, we use the empirical marginal distributions ˆP(x) and ˆP(y) to fulﬁll the constraint in Eqn.(2).

To solve the DSL problem, following the common practice in constraint optimization, we introduce Lagrange multipliers and add the equality constraint of probabilistic duality into the objective functions. First, we convert the probabilistic duality constraint into the following regularization term (with the empirical marginal distributions included):

ℓduality =(log ˆP(x) + log P(y|x; θxy)

log ˆP(y) log P(x|y; θyx))2. (3)

Then, we learn the models of the two tasks by minimizing the weighted combination between the original loss functions and the above regularization term. The algorithm is shown in Algorithm 1.

In the algorithm, the choice of optimizers Opt1 and Opt2 is quite ﬂexible. One can choose different optimizers such as Adadelta (Zeiler, 2012), Adam (Kingma & Ba, 2014), or SGD for different tasks, depending on common practice in the speciﬁc task and personal preferences.

2.3. Discussions

The duality between tasks has been used to enable learning from unlabeled data in (He et al., 2016a). As an early attempt to exploit the duality, this work actually uses the exterior connection between dual tasks, which helps to form a closed feedback loop and enables unsupervised learning. For example, in the application of machine translation, the primal task/model ﬁrst translates an unlabeled English sentence x to a French sentence y ; then, the dual task/model translates y back to an English sentence x ; ﬁnally, both

2All the appendices are left in the supplementary document.

Algorithm 1 Dual Supervise Learning Algorithm

Input: Marginal distributions ˆP(xi) and ˆP(yi) for any i [n]; Lagrange parameters λxy and λyx; optimizers Opt1 and Opt2; repeat

Get a minibatch of m pairs {(xj, yj)}m j=1; Calculate the gradients as follows:

Gf = θxy(1/m)Pm j=1 ℓ1(f(xj; θxy), yj)

+ λxyℓduality(xj, yj; θxy, θyx) ;

Gg = θyx(1/m)Pm j=1 ℓ2(g(yj; θyx), xj)

+ λyxℓduality(xj, yj; θxy, θyx) ;

Update the parameters of f and g: θxy Opt1(θxy, Gf), θyx Opt2(θyx, Gg). until models converged

the primal and the dual models get optimized by minimizing the difference between x with x. In contrast, by making use of the intrinsic probabilistic connection between the primal and dual models, DSL takes an innovative attempt to extend the beneﬁt of duality to supervised learning.

While ℓduality can be regarded as a regularization term, it is data dependent, which makes DSL different from Lasso (Tibshirani, 1996) or SVM (Hearst et al., 1998), where the regularization term is data-independent. More accurately speaking, in DSL, every training sample contributes to the regularization term, and each model contributes to the regularization of the other model.

DSL is different from the following three learning schemes: (1) Co-training focuses on single-task learning and assumes that different subsets of features can provide enough and complementary information about data, while DSL targets at learning two tasks with structural duality simultaneously and does not yield any prerequisite or assumptions on features. (2) Multi-task learning requires that different tasks share the same input space and coherent feature representation while DSL does not. (3) Transfer Learning uses auxiliary tasks to boost the main task, while there is no difference between the roles of two tasks in DSL, and DSL enables them to boost the performance of each other simultaneously.

We would like to point that there are several requirements to apply DSL to a certain scenario: (1) Duality should exist for the two tasks. (2) Both the primal and dual models should be trainable. (3) ˆP(X) and ˆP(Y ) in Eqn. (3) should be available. If these conditions are not satisﬁed, DSL might not work very well. Fortunately, as we have discussed in the paper, many machine learning tasks related to image, speech, and text satisfy these conditions.

Dual Supervised Learning

3. Application to Machine Translation

We ﬁrst apply our dual supervised learning algorithm to machine translation and study whether it can improve the translation qualities by utilizing the probabilistic duality of dual translation tasks. In the following of the section, we perform experiments on three pairs of dual tasks 3: English French (En Fr), English Germany (En De), and English Chinese (En Zh).

3.1. Settings

Datasets We employ the same datasets as used in (Jean et al., 2015) to conduct experiments on En Fr and En De. As a part of WMT 14, the training data consists of 12M sentences pairs for En Fr and 4.5M for En De, respectively (WMT, 2014). We combine newstest2012 and newstest2013 together as the validation sets and use newstest2014 as the test sets. For the dual tasks of En Zh, we use 10M sentence pairs obtained from a commercial company as training data. We leverage NIST2006 as the validation set and NIST2008 as well as NIST2012 as the test sets4. Note that, during the training of all three pairs of dual tasks, we drop all sentences with more than 50 words.

Marginal Distributions ˆP(x) and ˆP(y) We use the LSTMbased language modeling approach (Sundermeyer et al., 2012; Mikolov et al., 2010) to characterize the marginal distribution of a sentence x, deﬁned as QTx i=1 P(xi|x<i), where xi is the ith word in x, Tx denotes the number of words in x, and the index < i indicates {1, 2, , i 1}. More details about such language modeling approach can be referred to Appendix B.

Model We apply the GRU as the recurrent module to implement the sequence-to-sequence model, which is the same as (Bahdanau et al., 2015; Jean et al., 2015). The word embedding dimension is 620 and the number of hidden node is 1000. Regarding the vocabulary size of the source and target language, we set it as 30k, 50k, and 30k for En Fr, En De, and En Zh, respectively. The out-of-vocabulary words are replaced by a special token UNK. Following the common practice, we denote the baseline algorithm proposed in (Bahdanau et al., 2015; Jean et al., 2015) as RNNSearch. We implement the whole NMT learning system based on an open source code5.

3Since both tasks in each pair are symmetric, they play the same role in the dual supervised learning framework. Consequently, any one of the dual tasks can be viewed as the primal task while the other as the dual task. 4The three NIST datasets correspond to Zh En translation task, in which each Chinese sentence has four English references. To build the test set for En Zh, we use the Chinese sentence with one randomly picked English sentence to form up a En Zh validation/test pair. 5https://github.com/nyu-dl/dl4mt-tutorial

Evaluation Metrics The translation qualities are measured by tokenized case-sensitive BLEU (Papineni et al., 2002) scores, which is implemented by (multi bleu, 2015). The larger the BLEU score is, the better the translation quality is. During the evaluation process, we use beam search with beam width 12 to generate sentences. Note that, following the common practice, the Zh En is evaluated by case-insensitive BLEU score.

Training Procedure We initialize the two models in DSL (i.e., the θxy and θyx) by using two warm-start models, which is generated by following the same process as (Jean et al., 2015). Then, we use SGD with the minibatch size of 80 as the optimization method for dual training. During the training process, we ﬁrst set the initial learning rate η to 0.2 and then halve it if the BLEU score on the validation set cannot grow for a certain number of mini batches. In order to stabilize parameters, we will freeze the embedding matrix once halving learning rates can no long improve the BLEU score on the validation set. The gradient clip is set as 1.0, 5.0 and 1.0 during the training for En Fr, En De, and En Zh, respectively (Pascanu et al., 2013). The value of both λxy and λyx in Algorithm 1 are set as 0.01 according to empirical performance on the validation set. Note that, during the optimization process, the LSTM-based language models will not be updated.

3.2. Results

Table 1 shows the BLEU scores on the dual tasks by the DSL method with that by the baseline RNNSearch method. Note that, in this table, we use (MT08) and (MT12) to denote results carried out on NIST2008 and NIST2012, respectively. From this table, we can ﬁnd that, on all these three pairs of symmetric tasks, DSL can improve the performance of both dual tasks, simultaneously.

Table 1. BLEU scores of the translation tasks. represents the improvement of DSL over RNNSearch.

Tasks RNNSearch DSL En Fr 29.92 31.99 2.07 Fr En 27.49 28.35 0.86 En De 16.54 17.91 1.37 De En 20.69 20.81 0.12 En Zh (MT08) 15.45 15.87 0.42 Zh En (MT08) 31.67 33.59 1.92 En Zh (MT12) 15.05 16.10 1.05 Zh En (MT12) 30.54 32.00 1.46

To better understand the effects of applying the probabilistic duality constraint as the regularization, we compute the ℓduality on the test set by DSL compared with RNNSearch. In particular, after applying DSL to En Fr, the ℓduality decreases from 1545.68 to 1468.28, which also indicates that

Dual Supervised Learning

the two models become more coherent in terms of probabilistic duality.

(Jean et al., 2015) proposed an effective post-process technique, which can achieve better translation performance by replacing the UNK with the corresponding word-level translations. After applying this technique into DSL, we report its results on En Fr in Table 2, compared with several baselines with the same model structures as ours that also integrate the UNK post-processing technique. From this table, it is clear to see that DSL can achieve better performance than all baseline methods.

Table 2. Summary of some existing En Fr translations

Model Brief description BLEU NMT[1] standard NMT 33.08 MRT[2] Direct optimizing BLEU 34.23 DSL Refer to Algorithm 1 34.84 [1] (Jean et al., 2015); [2] (Shen et al., 2016)

In the previous experiments, we use a warm-start approach in DSL using the models trained by RNNSearch. Actually, we can use stronger models for initialization to achieve even better accuracy. We conduct a light experiment to verify this. We use the models trained by (He et al., 2016a) as the initializations in DSL on En Fr translation. We ﬁnd that BLEU score can be improved from 34.83 to 35.95 for En Fr translation, and from 32.94 to 33.40 for Fr En translation.

Effects of λ There are two hyperparameters λxy and λyx in our DSL algorithm. We conduct some experiments to investigate their effects. Since the input and output space are symmetric, we set λxy = λyx = λ and plot the validation accuracy of different λ s in Figure 1(a). From this ﬁgure, we can see that both En Fr and Fr En reach the best performance when λ = 10 2, and thus the results of DSL reported in Table 1 are obtained with λ = 10 2. Moreover, we ﬁnd that, within a relatively large interval of λ, DSL outperforms standard supervised learning, i.e., the point with λ = 0. We

1 1e 1 1e 2 1e 3 1e 4 0

En Fr Fr En

(a) Valid BLEU w.r.t λ

0 5 10 15 20 25 30 35 40 45 22

Training Iterations 104

Valid En Fr Test En Fr Valid Fr En Test Fr En

(b) Valid / Test BLEU curves

Figure 1. Visualization of En Fr tasks with DSL

Table 3. An example of En Fr [Source (En)] A board member at a German blue-chip company concurred that when it comes to economic espionage, "the French are the worst." [Source (Fr)] Un membre du conseil d administration d une société allemande renommée estimait que lorsqu il s agit d espionnage économique , les Français sont les pires . [RNNSearch (Fr En)] A member of the board of directors of a renowned German society felt that when it was economic

espionage, the French are the worst. [RNNSearch (En Fr)] Un membre du conseil d une compagnie allemande UNK a reconnu que quand il s agissait d espionnage économique, "le français est le pire". [DSL (Fr En)] A board member of a renowned German company felt that when it comes to economic espionage, "the French are the worst. " [DSL (En Fr)] Un membre du conseil d une compagnie allemande UNK a reconnu que , lorsqu il s agit d espionnage économique, "les Français sont les pires".

also plot the BLEU scores for λ = 10 2 on the validation and test sets in Figure 1(b) with respect to training iterations. We can see that, in the ﬁrst couple of rounds, the test BLEU curves ﬂuctuate with large variance. The reason is that two separately initialized models of dual tasks yield are not consistent with each other, i.e., Eqn. (1) does not hold, which causes the declination of the performance of both models as they play as the regularizer for each other. As the training goes on, two models become more consistent and ﬁnally boost the performance of each other.

Case studies Table 3 shows a couple of translation examples produced by RNNSearch compared with DSL. From this table, we ﬁnd that DSL demonstrates three major advantages over RNNSearch. First, by leveraging the structural duality of sentences, DSL can result in the improvement of mutual translation, e.g. when it comes to and lorsqu qu il s agit de , which better ﬁt the semantics expressed in the sentences. Second, DSL can consider more contextual information in translation. For example, in Fr En, une société is translated to company, however, in the baseline, it is translated to society. Although the word level translation is not bad, it should deﬁnitely be translated as company given the contextual semantics. Furthermore, DSL can better handle the plural form. For example, DSL can correctly translate the French are the worst , which are of plural form, while the baseline deals with it by singular form.

4. Application to Images Processing

In the domain of image processing, image classiﬁcation (image label) and image generation (label image) are in

Dual Supervised Learning

the dual form. In this section, we apply our dual supervised learning framework to these two tasks and conduct experimental studies based on a public dataset, CIFAR10 (Krizhevsky & Hinton, 2009), with 10 classes of images. In our experiments, we employ a popular method, Res Net6, for image classiﬁcation and a most recent method, Pixel CNN++7, for image generation. Let X denote the image space and Y denote the category space related to CIFAR-10.

4.1. Settings

Marginal Distributions In our experiments, we simply use the uniform distribution to set the marginal distribution ˆP(y) of 10-class labels, which means the marginal distribution of each class equals 0.1. The image distribution ˆP(x) is usually deﬁned as Qm i=1 P{xi|x<i}, where all pixels of the image is serialized and xi is the value of the i-th pixel of an m-pixel image. Note that the model can predict xi only based on the previous pixels xj with index j < i. We use the Pixel CNN++, which is so far the best algorithm, to model the image distribution.

Models For the task of image classiﬁcation, we choose 32layer Res Net (denoted as Res Net-32) and 110-layer Res Net (denoted as Res Net-110) as two baselines, respectively, in order to examine the power of DSL on both relatively simple and complex models. For the task of image generation, we use Pixel CNN++ again. Compared to the Pixel CNN++ used for modeling distribution, the difference lies in the training process: When used for image generation given a certain class, Pixel CNN++ takes the class label as an additional input, i.e., it tries to characterize Qm i=1 P{xi|x<i, y}, where y is the 1-hot label vector.

Evaluation Metrics We use the classiﬁcation error rates to measure the performance of image classiﬁcation. We use bits per dimension (brieﬂy, bpd) (Salimans et al., 2017), to assess the performance of image generation. In particular, for an image x with label y, the bpd is deﬁned as:

PNx i=1 log P(xi|x<i, y) / Nx log(2) , (5)

where Nx is the number of pixels in image x. By using the dataset CIFAR-10, Nx is 3072 for any image x, and we will report the average bpd on the test set.

Training Procedure We ﬁrst initialize both the primal and the dual models with the Res Net model and Pixel CNN++ model pre-trained independently and separately. We obtain a 32-layer Res Net with error rate of 7.65 and a 110-layer Res Net with error rate of 6.54 as the pre-trained models for image classiﬁcation. The error rates of these two pretrained models are comparable to results reported in (He et al., 2016b). We generate a pre-trained conditional image

6https://github.com/tensorﬂow/models/tree/master/resnet 7https://github.com/openai/pixel-cnn

Table 4. Error rates (%) of image classiﬁcation tasks. Baseline is from (He et al., 2016b). denotes the improvement of DSL over baseline.

Res Net-32 Res Net-110 baseline 7.51 6.43 DSL 6.82 5.40 0.69 1.03

generation model with the test bpd of 2.94, which is the same as reported in (Salimans et al., 2017). For DSL training, we set the initial learning rate of image classiﬁcation model as 0.1 and that of image generation model as 0.0005. The learning rates follow the same decay rules as those in (He et al., 2016b) and (Salimans et al., 2017). The whole training process takes about two weeks before convergence. Note that experimental results below are based on the training with λxy = (30/3072)2 and λyx = (1.2/3072)2.

4.2. Results on Image Classiﬁcation

Table 4 compares the error rates of two image classiﬁcation models, i.e., DSL vs. Baseline, on the test set. From this table, we ﬁnd that, with using either Res Net-32 or Res Net-110, DSL achieves better accuracy than the baseline method.

Interestingly, we observe from Table 4 that, DSL leads to higher relative performance improvement on the Res Net110 over the Res Net-32. We hypothesize one possible reason is that, due to the limited training data, an appropriate regularization can beneﬁt more to the 110-layer Res Net with higher model complexity, and the duality-oriented regularization ℓduality indeed plays this role and consequently gives rise to higher relative improvement.

4.3. Results on Image Generation

Our further experimental results show that, based on Res Net-110, DSL can decrease the test bpd from 2.94 (baseline) to 2.93 (DSL), which is a new state-of-the-art result on CIFAR-10. Indeed, it is quite difﬁcult to improve bpd by 0.01 which though seems like a minor change. We also ﬁnd that, there is no signiﬁcant improvement on test bpd based on Res Net-32. An intuitive explanation is that, since Res Net-110 is stronger than Res Net-32 in modeling the conditional probability P(y|x), it can better help the task of image generation through the constraint/regularization of the probabilistic duality.

As pointed out in (Theis et al., 2015), bpd is not the only evaluation rule of image generation. Therefore, we further conduct a qualitative analysis by comparing images generated by dual supervised learning with those by the baseline model for each of image categories, some examples of

Dual Supervised Learning

which are shown in Figure 2.

Figure 2. Generated images

Each row in Figure 2 corresponds to one category in CIFAR-10, the ﬁve images in the left side are generated by the baseline model, and the ﬁve ones in the right side are generated by the model trained by DSL. From this ﬁgure, we ﬁnd that DSL generally generates images with clearer and more distinguishable characteristics regarding the corresponding category. Speciﬁcally, those right ﬁve images in Row 3, 4, and 6 can illustrate more distinguishable characteristics of birds, cats and dogs respectively, which is mainly due to beneﬁts of introducing the probabilistic duality into DSL. But, there are still some cases that neither the baseline model nor DSL can perform well, like deers it Row 5 and frogs in Row 7. One reason is that the bpd of images in the category of deer and frogs are 3.17 and 3.32, which are signiﬁcant larger than the average 2.94. This shows that the images of these two categories are harder to generate.

5. Application to Sentiment Analysis

Finally, we apply the dual supervised learning framework to the domain of sentiment analysis. In this domain, the primal task, sentiment classiﬁcation (Maas et al., 2011; Dai & Le, 2015), is to predict the sentiment polarity label of a given sentence; and the dual task, though not quite apparent but really existed, is sentence generation based on a sentiment polarity. In this section, let X denote the sentences and Y denote the sentiment related to our task.

5.1. Experimental Setup

Dataset Our experiments are performed based on the IMDB movie review dataset (IMDB, 2011), which consists of 25k training and 25k test sentences. Each sentence in this dataset is associated with either a positive or a negative sentiment label. We randomly sample a subset of 3750 sentences from the training data as the validation set for hyperparameter tuning and use the remaining training data for model training.

Marginal Distributions We simply use the uniform distribution to set the marginal distribution ˆP(y) of polarity labels, which means the marginal distribution of positive or negative class equals 0.5. On the other side, we take advantage of the LSTM-based language modeling to model the marginal distribution ˆP(x) of a sentence x. The test perplexities (Bengio et al., 2003) of the obtained language model is 58.74.

Model Implementation We leverage the widely used LSTM (Dai & Le, 2015) modeling approach for sentiment classiﬁcation8 model. We set the embedding dimension as 500 and the hidden layer size as 1024. For sentence generation, we use another LSTM model with W e w Ewxt 1 + W e s Esy as input, where xt 1 denotes the t 1 th word, Ew and Es represent the embedding matrices for word and sentiment label respectively, and W s represent the connections between embedding matrix and LSTM cells. A sentence is generated word by word sequentially, and the probability that word xt is generated is proportional to exp(W d w Ewxt 1 +W d s Esy +Whht 1), where ht 1 is the hidden state outputted by LSTM. Note the W s and the E s are the parameters to learn in training. In the following, we call the model for sentiment based sentence generation as contextual language model (brieﬂy, CLM).

Evaluation Metrics We measure the performance of sentiment classiﬁcation by the error rate, and that of sentence generation, i.e., CLM, by test perplexity.

Training Procedure To obtain baseline models, we use Adadelta as the optimization method to train both the sentiment classiﬁcation and sentence generation model. Then, we use them to initialization the two models for DSL. At the beginning of DSL training, we use plain SGD with an initial learning rate of 0.2 and then decrease it to 0.02 for both models once there is no further improvement on the validation set. For each (x, y) pair, we set λxy = (5/lx)2

and λyx = (0.5/lx)2, where lx is the length of x. The whole training process of DSL takes less than two days.

8Both supervised and semi-supervised sentiment classiﬁcation are studied in (Dai & Le, 2015). We focus on supervised learning here. Therefore, we do not compare with the models trained with semi-supervised (labeled + unlabeled) data.

Dual Supervised Learning

5.2. Results

Table 5 compares the performance of DSL with the baseline method in terms of both the error rates of sentiment classiﬁcation and the perplexity of sentence generation. Note that the test error of the baseline classiﬁcation model, which is 10.10 as shown in the table, is comparable to the recent results as reported in (Dai & Le, 2015). We have two observations from the table. First, DSL can reduce the classiﬁcation error by 0.90 without modifying the LSTMbased model structure. Second, DSL slightly improves the perplexity for sentence generation, but the improvement is not very signiﬁcant. We hypothesize the reason is that the sentiment label can merely supply at most 1 bit information such that the perplexity difference between the language model (i.e., the marginal distribution ˆP(x)) and CLM (i.e., the conditional distribution P(x|y)) are not large, which limits the improvement brought by DSL.

Table 5. Results on IMDB

Test Error (%) Perplexity Baseline 10.10 59.19 DSL 9.20 58.78

Qualitative analysis on sentence generation

In addition to quantitative studies as shown above, we further conduct qualitative analysis on the performance of sentence generation. Table 6 demonstrates some examples of generated sentences based on sentiment labels. From this table, we can ﬁnd that both the baseline model and DSL succeed in generating sentences expressing the certain sentiment. The baseline model prefers to produce the sentence with those words yielding high-frequency in the training data, such as the the plot is simple/predictable, the acting is great/bad , etc. This is because the sentence generation model itself is essentially a language model based generator, which aims at catching the high-frequency words in the training data. Meanwhile, since the training of CLM in DSL can leverage the signals provided by the classiﬁer, DSL makes it more possible to select those words, phrases, or textual patterns that can present more speciﬁc and more intense sentiment, such as nothing but good, 10/10, don t waste your time , etc. As a result, the CLM in DSL can generate sentences with richer expressions for sentiments.

5.3. Discussions

In previous experiments, we start DSL training with welltrained primal and dual models. We conduct some further experiments to verify whether warm start is a must for DSL. (1) We train DSL from a warm-start sentence generator and a cold-start (randomly initialized) sentence classiﬁer. In this case, DSL achieves a classiﬁcation error of 9.44%, which is better than the baseline classiﬁer in Table 5. (2)

Table 6. Sentence generation with given sentiments

i ve seen this movie a few times. it s still one of my Base favorites. the plot is simple, the acting is great. (Pos) It s a very good movie, and i think it s one of the best movies i ve seen in a long time. I have nothing but good things to say about this movie. I saw this movie when it ﬁrst came out, DSL and I had to watch it again and again. I really (Pos) enjoyed this movie. I thought it was a very good movie. The acting was great, the story was great. I would recommend this movie to anyone. I give it 10 / 10. after seeing this ﬁlm, i thought it was going to be Base one of the worst movies i ve ever seen; the acting (Neg) was bad, the script was bad. the only thing i can say about this movie is that it s so bad. this is a difﬁcult movie to watch, and would, not DSL recommend it to anyone. The plot is predictable, Neg the acting is bad, and the script is awful. Don t waste your time on this one.

We train DSL from a warm-start classiﬁer and a cold-start sentence generator. The perplexity of the generator after DSL training reach 58.79, which is better than the baseline generator. (3) We train DSL from both cold-start models. The ﬁnal classiﬁcation error is 9.50% and the perplexity of the generator is 58.82, which are both better than the baselines. These results show that the success of DSL does not necessarily require warm-start models, although they can speed up the training of DSL.

6. Conclusions and Future Work

Observing the existence of structure duality among many AI tasks, we have proposed a new learning framework, dual supervised learning, which can greatly improve the performance for both the primal and the dual tasks, simultaneously. We have introduced a probabilistic duality term to serve as a data-dependent regularizer to better guide the training. Empirical studies have validated the effectiveness of dual supervised learning.

There are multiple directions to explore in the future. First, we will test dual supervised learning on more dual tasks, such as speech recognition and speech synthesis. Second, we will enrich theoretical study to better understand dual supervised learning. Third, it is interesting to combine dual supervised learning with unsupervised dual learning (He et al., 2016a) to leverage unlabeled data so as to further improve the two dual tasks. Fourth, we will combine dual supervised learning with dual inference (Xia et al., 2017) so as to leverage structural duality to enhance both the training and inference procedures.

Dual Supervised Learning

Acknowledgements

We would like to thank Yue Wang, Fei Tian, Di He and Yanmei Duan for the helpful discussions. This work is partially supported by the National Natural Science Foundation of China (Grant No. 61371192).

Amodei, Dario, Anubhai, Rishita, Battenberg, Eric, Case, Carl, Casper, Jared, Catanzaro, Bryan, Chen, Jingdong, Chrzanowski, Mike, Coates, Adam, Diamos, Greg, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In 33rd International Conference on Machine Learning, 2016.

Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, 2015.

Bengio, Yoshua, Ducharme, Réjean, Vincent, Pascal, and Jauvin, Christian. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137 1155, 2003.

Boyd, Stephen and Vandenberghe, Lieven. Convex optimization. Cambridge university press, 2004.

Dai, Andrew M and Le, Quoc V. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems, pp. 3079 3087, 2015.

Graves, Alex, Mohamed, Abdel-rahman, and Hinton, Geoffrey. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pp. 6645 6649. IEEE, 2013.

He, Di, Xia, Yingce, Qin, Tao, Wang, Liwei, Yu, Nenghai, Liu, Tie-Yan, and Ma, Wei-Ying. Dual learning for machine translation. In Advances In Neural Information Processing Systems, pp. 820 828, 2016a.

He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016b.

He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Identity mappings in deep residual networks. In European Conference on Computer Vision, pp. 630 645. Springer, 2016c.

Hearst, Marti A., Dumais, Susan T, Osuna, Edgar, Platt, John, and Scholkopf, Bernhard. Support vector machines. IEEE Intelligent Systems and their Applications, 13(4):18 28, 1998.

IMDB. Imdb dataset. http://ai.stanford.edu/ amaas/data/sentiment/, 2011.

Jean, Sébastien, Cho, Kyunghyun, Memisevic, Roland, and Bengio, Yoshua. On using very large target vocabulary for neural machine translation. In ACL, 2015.

Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Krizhevsky, Alex and Hinton, Geoffrey. Learning multiple layers of features from tiny images. 2009.

Maas, Andrew L, Daly, Raymond E, Pham, Peter T, Huang, Dan, Ng, Andrew Y, and Potts, Christopher. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies Volume 1, pp. 142 150. Association for Computational Linguistics, 2011.

Mikolov, Tomas, Karaﬁát, Martin, Burget, Lukas, Cernock y, Jan, and Khudanpur, Sanjeev. Recurrent neural network based language model. In Interspeech, volume 2, pp. 3, 2010.

multi bleu. multi-bleu.pl. https://github.com/moses-smt/ mosesdecoder/blob/master/scripts/generic/multibleu.perl, 2015.

Oord, Aaron van den, Dieleman, Sander, Zen, Heiga, Simonyan, Karen, Vinyals, Oriol, Graves, Alex, Kalchbrenner, Nal, Senior, Andrew, and Kavukcuoglu, Koray. Wavenet: A generative model for raw audio. ar Xiv preprint ar Xiv:1609.03499, 2016.

Papineni, Kishore, Roukos, Salim, Ward, Todd, and Zhu, Wei-Jing. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311 318. Association for Computational Linguistics, 2002.

Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. On the difﬁculty of training recurrent neural networks. ICML (3), 28:1310 1318, 2013.

Salimans, Tim, Karpathy, Andrej, Chen, Xi, P. Kingma, Diederik, and Bulatov, Yaroslav. Pixelcnn++: A pixelcnn implementation with discretized logistic mixture likelihood and other modiﬁcations. In International Conference on Learning Representations, 2017.

Shen, Shiqi, Cheng, Yong, He, Zhongjun, He, Wei, Wu, Hua, Sun, Maosong, and Liu, Yang. Minimum risk training for neural machine translation. ACL, 2016.

Dual Supervised Learning

Sundermeyer, Martin, Schlüter, Ralf, and Ney, Hermann. Lstm neural networks for language modeling. In Interspeech, pp. 194 197, 2012.

Theis, Lucas, Oord, Aäron van den, and Bethge, Matthias. A note on the evaluation of generative models. ar Xiv preprint ar Xiv:1511.01844, 2015.

Tibshirani, Robert. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp. 267 288, 1996.

van den Oord, Aaron, Kalchbrenner, Nal, Espeholt, Lasse, Vinyals, Oriol, Graves, Alex, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pp. 4790 4798, 2016a.

van den Oord, Aaron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks. In 33rd International Conference on Machine Learning, 2016b.

WMT. Wmt dataset for machine translation. http://www.statmt.org/wmt14/translation-task.html, 2014.

Wu, Yonghui, Schuster, Mike, Chen, Zhifeng, Le, Quoc V, Norouzi, Mohammad, Macherey, Wolfgang, Krikun, Maxim, Cao, Yuan, Gao, Qin, Macherey, Klaus, et al. Google s neural machine translation system: Bridging the gap between human and machine translation. ar Xiv preprint ar Xiv:1609.08144, 2016.

Xia, Yingce, Bian, Jiang, Qin, Tao, Yu, Nenghai, and Liu, Tie-Yan. Dual inference for machine learning. In The 26th International Joint Conference on Artiﬁcial Intelligence, 2017.

Zeiler, Matthew D. Adadelta: an adaptive learning rate method. ar Xiv preprint ar Xiv:1212.5701, 2012.