# deliberation_learning_for_imagetoimage_translation__071125df.pdf

Deliberation Learning for Image-to-Image Translation

Tianyu He1 , Yingce Xia2 , Jianxin Lin1 , Xu Tan2 , Di He3 , Tao Qin2 and Zhibo Chen1

1CAS Key Laboratory of Technology in Geo-spatial Information Processing and Application System, University of Science and Technology of China 2Microsoft Research Asia 3Key Laboratory of Machine Perception, MOE, School of EECS, Peking University {hetianyu, linjx}@mail.ustc.edu.cn, {Yingce.Xia, xuta}@microsoft.com, di he@pku.edu.cn, taoqin@microsoft.com, chenzhibo@ustc.edu.cn

Image-to-image translation, which transfers an image from a source domain to a target one, has attracted much attention in both academia and industry. The major approach is to adopt an encoderdecoder based framework, where the encoder extracts features from the input image and then the decoder decodes the features and generates an image in the target domain as the output. In this paper, we go beyond this learning framework by considering an additional polishing step on the output image. Polishing an image is very common in human s daily life, such as editing and beautifying a photo in Photoshop after taking/generating it by a digital camera. Such a deliberation process is shown to be very helpful and important in practice and thus we believe it will also be helpful for image translation. Inspired by the success of deliberation network in natural language processing, we extend deliberation process to the ﬁeld of image translation. We verify our proposed method on four two-domain translation tasks and one multi-domain translation task. Both the qualitative and quantitative results demonstrate the effectiveness of our method.

1 Introduction Unsupervised image-to-image translation is an important application task in computer vision [Zhu et al., 2017; Choi et al., 2018]. The encoder-decoder framework is widely used to achieve such translation, where the encoder maps the image to a latent representation and the decoder translates it to the target domain. Considering labeled data is costly to obtain, unsupervised image-to-image translation is widely adopted, which tries to uncover the mapping without paired images in two domains. Cycle GAN [Zhu et al., 2017] is one of the most cited model that can achieve unsupervised translation between two domains. Denote Xs and Xt as two image domains of interests, and no labeled image pair in Xs Xt is available. As the name

Equal contribution. Corresponding author.

Figure 1: Examples of Winter Summer translation (top row) and Summer Winter translation (bottom row), where the three columns are the input, output of standard Cycle GAN and output after deliberation.

suggests, the two ingredients to make Cycle GAN successful are: (1) Minimizing the reconstruction loss between two models Xs Xt Xs and Xt Xs Xt; such a loss can ensure the main content of an image is kept; (2) the adversarial network [Goodfellow et al., 2014], that can verify whether an output image belongs to speciﬁc domain. Dual GAN [Yi et al., 2017] and Disco GAN [Kim et al., 2017] also adopt similar ideas. Another important model Star GAN [Choi et al., 2018] is designed for multi-domain image-to-image translation with one encoder and one decoder. The motivation of Star GAN is to reduce parameters by using a single model to achieve multi-domain translation instead of using multiple independent models. Although the aforementioned techniques achieved great success, they lack an obvious step compared with human behaviors: deliberation, which means reviewing and keeping polishing the output. For example, to draw a paint, the artist ﬁrst sketches the outline to get an overall impression of the image, and then gradually enrich more details and textures. Such a behavior broadly exists in human s behavior, but still lacks in unsupervised image-to-image translation literature.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Deliberation is important for image translation as illustrated in Figure 1. For the Winter Summer translation (ﬁrst row), there are still some snow in the direct output of the decoder (middle). Similarly, for Summer Winter translation (second row), we can still ﬁnd green trees in the output (middle). This shows the necessity of deliberation in image translation. To leverage such an important property, we propose deliberation learning for image-to-image translation, that can further polish and improve the output of a standard model. Take the translation from Xs to Xt as an example. Our deliberation learning framework consists of an encoder, a decoder and a post-editor: The encoder and decoder are the same as those in Cycle GAN or Star GAN, serving for encoding the image as a hidden vector and decoding the image conditioned on the hidden vector. They work together to translate x Xs to a ˆy Xt. The post-editor will eventually output another y Xt, with both x and ˆy as inputs. Compared with the standard encoder-decoder framework, the post-editor can have an overall impression of ˆy, which is the mapping of x in the target domain, and then keep reﬁning on it, while the standard one cannot. As shown in Figure 1, after using posteditor, the snow are erased for Winter Summer translation and the green parts are enveloped with white frost.

2 Related Work

The generative adversarial networks [Goodfellow et al., 2014] (brieﬂy, GAN) has enabled signiﬁcant progress in unsupervised image-to-image translation [Zhu et al., 2017; Choi et al., 2018]. There are two essential elements in a GAN: a generator, used to map a random noise to an image; and a discriminator, used to verify whether the input is a natural image or a faked image produced by the generator. Unsupervised image-to-image translation is an important application of GAN, which means to map an image from one domain to another and has received considerable attention recently. According to the number of domains involved in the translation, the related work can be categorized as twodomain translation and multi-domain translation.

2.1 Two-Domain Translation

Let Xs and Xt denote two different image domains. Our target is to learn a mapping f : Xs 7 Xt. A common technique adopted in two-domain translation is the conditional GAN, who have made much progress recently [Mirza and Osindero, 2014; Isola et al., 2017]. In these frameworks, the input image is compressed into a hidden vector via a series of convolutional layers and then converted to the target domain by several transposed convolutional layers. The generated images will be fed into the discriminator to ensure the generation quality. Additional input like random noise [Isola et al., 2017], text [Reed et al., 2016] can also be included. Inspired by the success of dual learning in neural machine translation [He et al., 2016; Wang et al., 2019], learning two dual mappings f : Xs 7 Xt and g : Xt 7 Xs together are introduced [Kim et al., 2017; Zhu et al., 2017; Yi et al., 2017; Lin et al., 2018; Mejjati et al., 2018] to image translation. Cycle GAN [Zhu et al., 2017] is one of the most cited work with such an idea: Given an x Xs, it is ﬁrst mapped to

ˆy by f(x) and then mapped back to ˆx by g(ˆy). The cycleconsistent loss x ˆx 1 is equipped to minimize the distance between x and ˆx, allowing f and g to obtain feedback signal from its counterpart. Similar idea also exists in Dual GAN [Yi et al., 2017] and Disco GAN [Kim et al., 2017].

2.2 Multi-Domain Translation

Great performances were achieved in dual-domain translation. However, while applying in multi-domain translation, training models for each pair of domains incurs much more resource consumption. To alleviate this limitation of scalability, several works extend dual-domain translation to multidomain translation by learning relationships among multiple domains in an unsupervised fashion [Choi et al., 2018; Liu et al., 2018; Anoosheh et al., 2018; Pumarola et al., 2018]. Instead of employing a generator and a discriminator for each domain, Star GAN [Choi et al., 2018] learns all mappings f with only one generator and one discriminator. Although deliberation learning is not widely studied in image generation, it has been used in many natural language processing tasks including neural machine translation [Xia et al., 2017], grammar check [Ge et al., 2018], review generation [Guu et al., 2018]. It is beneﬁcial to introduce this idea into image generation tasks.

3 Framework

In this section, we introduce the framework of deliberation learning for image translation, including both two-domain translation (based on Cycle GAN) and multi-domain image translation (based on Star GAN).

Figure 2: Translation module of Xs Xt, where x Xs and ˆy, y Xt. Es, Gt and G t denote source domain encoder, the decoder and post-editor in the target domain.

3.1 Two-Domain Translation

Given two image domains Xs and Xt, the architecture of our proposed deliberation network achieving Xs to Xt translation is shown in Figure 2. There are three components: an encoder Es, a decoder Gt1, and a post-editor G t . The three

1We use G instead of D to represent the decoder, in order to avoid confusion with the discriminator D .

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

components work together to achieve the translation, which is shown in Eqn. (1): for any x Xs, hx = Es(x); ˆy = Gt(hx); hˆy = Es(ˆy); y = G t (hx + hˆy). (1)

The y is used as the output of the translation. Note that ˆy is the output of the conventional models like Cycle GAN and Dual GAN, without deliberation included. To generate y , the information from both raw input x and the ﬁrst-round output ˆy are leveraged, carried by hx and hˆy, and thus leading to better translation results. The reason we use one encoder Es to encode both x and ˆy is that, in standard Cycle GAN, there is an identity mapping loss Gt(Es(y)) y 1, y Xt. That is, an encoder can naturally encode the images from the target space. To reduce memory cost, we reuse the encoder. We found using two encoders will not bring much additional gain, see Section 4.3. Following the common practice in image translation, we apply the cycle consistency loss between the translation of two domains as well as the adversarial loss. To achieve Xt Xs translation, another groups of encoder EB, decoder GA and post-editor G A are needed, which work as follows: for any y Xt, hy = Et(y); ˆx = Gs(hy); hˆx = Et(ˆx); x = G s(hy + hˆx). (2)

For ease of reference, let f denote Xs Xt translation with Eqn. (1) and g denote Xt Xs translation using Eqn. (2). Note that Gt Es and Gs Et can also achieve the above two translations, we denotes the cascade of functions. Denote Gt Es as f and denote Gs Et as g respectively. To stabilize the training, the Es, Et, Gs and Gt are pre-trained following standard Cycle GAN until convergence. Only G s and G t are updated. An empirical study is shown in Section 4.3. With a little bit confusion, in the next context, Xs and Xt also refer to the training datasets of images in the source and target domains. Cycle consistency loss. Following the common practice in image translation, the cycle consistency loss is deﬁned as

x Xs g (f (x)) x 1

y Xt f (g (y)) y 1, (3)

where |Xs| refers to the number of images in Xs, and so does |Xt|. Both f and g jointly work to minimize the reconstruction loss, which is exactly what is used standard Cycle GAN. Adversarial loss. To force the generated images belonging to the corresponding domains, we need to use the adversarial loss. Deﬁne Ds and Dt as two discriminators of domain Xs and domain Xt, which can map an input image to [0, 1], indicating the probability that the input is a natural image in the corresponding domain. The adversarial loss is

ℓadv = 1 2|Xs|

x Xs (log Ds(x) + log(1 Dt(f (x))))

y Xt (log Dt(y) + log(1 Ds(g (y)))). (4)

Therefore, the training objective functions for f and g is to minimize

ℓtotal = ℓcyc + λℓadv, (5)

while the Ds and Dt will work on maximize ℓadv. In experiments, we ﬁx λ as 10 following [Zhu et al., 2017].

3.2 Multi-Domain Translation

Several works target at image translation among multiple domains [Choi et al., 2018; Liu et al., 2018]. Our proposed deliberation learning framework also works for this setting. Let X1, X2, , XN be N domain of interests (N 2). Our target is to achieve N(N 1) mappings among the N image domains. It is impractical to learn so many mappings, especially when N is large. An light-weight way is to use a Star GAN like structure, where there is only one encoder, one decoder and one discriminator. Adapted to the deliberation learning framework, there are one encoder E, one decoder G, one post-editor G and a discrimiator D. Each image domain has a learnable embedding ti i [N], representing the domain characteristic. The E can map images from any domain to hidden representations conditioned on the embedding; G and G can map the hidden representation guided by the domain embedding to the target space. Take the mapping from Xi to Xj as an example (i = j): for an x Xi,

hi j = E(concat[x; tj]); ˆy = G(hi j); hj = E(concat[ˆy; tj]); y = G (hj + hi j), (6)

where concat represents padding the second input to the ﬁrst one along the last dimension. y is eventually used as the output of x in Xj. In this case, for ease of reference, denote the translation from Xi to Xj following Eqn. (6) as f i,j. Correspondingly, the generation function based on G E is denoted as fi,j. For multi-domain translation, E and G are pre-trained and then ﬁxed too. Similar to two-domain translation, the training loss of multi-domain translation consists of two parts too: the cycle consistency loss and adversarial loss. The training process of multi-domain deliberation is shown as follows:

1. Randomly choose two different domains Xi and Xj, where i = j; randomly sample two batches of data Bi Xi and Bj Xj;

2. Formulate the cycle consistency loss as follows:

x Bi f j,i(f i,j(x)) x 1

y Bj f i,j(f j,i(y)) y 1; (7)

3. The discriminator slightly differs from those in twodomain translation. It consists of two parts: Dsrc, which is used to justify whether the input is a natural image; Dcls, which is used to verify which domain the image belongs to. Dsrc and Dcls share the basic architecture

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

expect for the last few layers. The adversarial loss is

ℓadv = 1 2|Bi|

x Bi (log Dsrc(x) + log(1 Dsrc(f i,j(x)))

y Bj (log Dsrc(y) + log(1 Dsrc(f j,i(y)))

x Bi (log Dcls(i|x) + log D cls(j|f i,j(x))

y Bj (log Dcls(j|y) + log D cls(i|f j,i(y)).

(8) f , will work on minimizing ℓadv while the Dadv and Dcls will try to enlarge it. When optimizing the discriminators, the D cls is ﬁxed.

4. Minimize ℓcyc + λℓadv on Bi and Bj, where λ = 10. Repeat step (1) to (4) until convergence.

Compared with two-domain deliberation learning, each component of our model in multi-domain setting has to deal with images from different domains, while that in twodomain setting only works for two-domain.

3.3 Discussion

The idea of deliberation learning also exists in super resolution (brieﬂy, SR), whose task is to convert a lowresolution image to a high-resolution one. A bicubic interpolation [Dong et al., 2014] is ﬁrst applied to the low-resolution image, then followed by a neural network to reconstruct the high-resolution one. Indeed, the task of SR itself and our framework share the high-level idea, but there are still several differences: (1) Image translation covers at least two domains with different semantics, while SR works on one domain only; (2) When making deliberation, our framework takes the information from two domains as input and further deliberates them by a post-editor. In comparison, SR conducts one-pass operation, where the interpolated image is directly used in the subsequent module. Another work leveraging deliberation learning is [Xu et al., 2018], which attacks the text-to-image problem: the images are generated from low-resolution to high-resolution, with multi-modal loss as constraints. Different from our work, our model is optimized in an fully unsupervised manner, and what we focused is to polish a generated image (with canonical output resolution) to a better one.

4 Application to Two-Domain Translation

For two-domain translation, we work on four datasets to verify the effectiveness of our algorithm.

4.1 Settings

Tasks. We select four tasks evaluated in Cycle GAN [Zhu et al., 2017]: semantic Label Photo translation on Cityscapes dataset [Cordts et al., 2016], Apple Orange translation, Winter Summer translation, and Photo Paint translation.

Figure 3: From top to bottom are results of Label Photo translation, Apple Orange translation, Paint Photo translation.

Model architecture. For the encoder, decoder and the discriminator, we adopt the same architectures as those in Cycle GAN for consistency. In addition, we need two posteditors G s and G t . We split the generator of Cycle GAN into two components: The ﬁrst one serves as the encoder in our scheme, which contains two stride-2 convolutional layers and four residual blocks. The remaining part serves as a decoder, which contains ﬁve residual blocks and two 1

2-strided convolutional layers. Therefore, the total number of layers are consistent with Cycle GAN. The architecture of post-editor is the same as the decoder in Cycle GAN as we introduced before. For the discriminator, we directly follows Cycle GAN.

Implementation details. We follow the ofﬁcal Cycle GAN2 to implement our scheme in Py Torch. To stabilize the training, We ﬁrst pre-train the Es, Gt, Et and Gs using standard Cycle GAN code until convergence. After that, we start to train G s and G t . We use Adam with initial learning rate 2 10 4 to train the models for the ﬁrst 100 epochs. Then we linearly decay the learning rate to 0 in the next 100 epochs.

2https://github.com/junyanz/pytorch-Cycle GAN-and-pix2pix

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Task Cycle GAN Ours

Apple Orange 147.31 130.17 Orange Apple 146.96 132.70 Photo Label 68.66 45.62 Label Photo 95.23 69.34 Winter Summer 85.48 76.89 Summer Winter 82.20 77.64 Cezanne Photo 186.16 158.75 Photo Cezanne 192.25 175.78 Monet Photo 134.01 123.96 Photo Monet 139.37 123.45 Ukiyo-e Photo 197.46 162.53 Photo Ukiyo-e 152.86 127.18 Van Gogh Photo 93.04 77.34 Photo Van Gogh 96.37 87.93

Table 1: FID scores of Cycle GAN and our algorithm.

Metrics. Fr echet Inception Distance (brieﬂy, FID) is ﬁrst proposed by [Heusel et al., 2017], recently becomes a commonly adopted approach to evaluate generative models [Lucic et al., 2018; Brock et al., 2019]. For FID measurement, the generated samples and the real ones are ﬁrst mapped into feature space by an Inception-v3 model [Szegedy et al., 2016]. Then Fr echet distance between these two distributions is calculated to obtain FID score. The authors demonstrated that FID score has a reasonable correlation with human judgment [Heusel et al., 2017]. Smaller FID scores indicate better translation qualities. In addition, we follow [Zhu et al., 2017] and provide FCN score [Isola et al., 2017]3 in our framework analysis for fair comparison, including per-pixel accuracy (Pixel Acc.), per-class accuracy (Class Acc.) and mean class Intersection-Over-Union (Class IOU). Higher accuracies indicate better translation qualities.

4.2 Results

Qualitative evaluation. Figure 3 shows three groups of translation results on Label Photo (the ﬁrst two rows), Apple Orange (the second two rows) and Paint Photo (the last two rows). In general, deliberation learning can: (1) Generate images with rich details. (2) Correct many failure cases (e.g.missing items, ﬂawed translation, etc.) generated by Cycle GAN. (3) Make the generated images more realistic.

Quantitative evaluation. FID scores of four datasets are shown in Table 1. We can see that deliberation learning signiﬁcantly surpasses the baseline across all datasets.

4.3 Framework Analysis

We carry out detailed analysis of our proposed framework. We implement another six settings on Label Photo translation and the scores are listed in Table 2. (1) To verify whether different ways to combine the features from the encoder and decoder will inﬂuence deliberation quality, we concatenate the two features hx and hˆy in Eqn. (1) instead of adding them. (Ours-1)

3We directly use the code and pre-trained FCN model in https: //github.com/phillipi/pix2pix/tree/master/scripts/eval cityscapes.

Setting Pixel Acc. Class Acc. Class IOU FID

Co GAN 0.40 0.10 0.06 Bi GAN 0.19 0.06 0.02 Cycle GAN 0.52 0.17 0.11 95.23 pix2pix 0.71 0.25 0.18

Ours 0.62 0.20 0.15 69.34

Ours-1 0.61 0.20 0.15 67.63 Ours-2 0.57 0.19 0.14 81.59 Ours-3 0.57 0.20 0.15 80.83 Ours-4 0.57 0.20 0.15 77.58 Ours-5 0.52 0.13 0.10 106.28 Ours-6 0.56 0.20 0.15 78.29

Table 2: FCN and FID scores of Label Photo on Cityscapes.

(2) To verify whether better accuracies are brought by larger models, we increase the depth of decoder as double, i.e.ten residual blocks. (Ours-2) (3) To verify whether we need two different encoders to encode two different outputs, we try to use two encoders to encode x and ˆy separately. (Ours-3) (4) To verify whether we need identity loss, we remove it from our framework. (Ours-4) (5) To verify whether we need to ﬁx encoder and decoder, we update all parameters in the encoder, decoder and post-editor. (Ours-5) (6) To verify whether we need to fetch raw input for the posteditor, we only feed the post-editor with hˆy. (Ours-6) For fair comparison, we also provide the FCN scores in Table 2. We list the existing results of Co GAN [Liu and Tuzel, 2016], Bi GAN [Donahue et al., 2017; Dumoulin et al., 2017], standard Cycle GAN [Zhu et al., 2017] and pix2pix [Isola et al., 2017]. Note that pix2pix is trained in a supervised way.

5 Application to Multi-Domain Translation

In this section, we conduct our deliberation learning on multidomain image translation. We also give qualitative and quantitative analysis on the performance of our scheme.

5.1 Settings

Tasks. We used the publicly available Celeb A dataset [Liu et al., 2015] for facial attributes translation. Celeb A contains 10, 177 number of identities and 202, 599 number of facial images. The original images are cropped to 128 156. The test set is randomly sampled (2, 000 images) and the remaining images are used for training. For directly comparison with Star GAN [Choi et al., 2018], we perform our experiments on the same three attributes: hair color (black, blond and brown), gender (male / female) and age (young / old).

Model architecture. For fair comparison, we adopt the same architectures as Star GAN [Choi et al., 2018] for the encoder and decoder. We additionally need one post-editor G for our method. Similar to two-domain image translation, we split the generator of Star GAN into encoder and decoder. We adopt the same architecture of decoder for the post-editor G . Different from conventional GAN [Goodfellow et al., 2014],

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Figure 4: Qualitative results of our scheme compared with Star GAN. It is worth noting that deliberation effectively boost the performance of Star GAN on facial attributes, making the generated images more natural and indistinguishable from the real images.

Classiﬁcation Accuracies

Attributes Star GAN Ours

Hair 80.99% 84.10% Gender 88.40% 94.19% Age 74.48% 76.64%

Table 3: Classiﬁcation accuracies of Star GAN and our algorithm.

the discriminator is enhanced with the ability of distinguishing which domain the translated images belong to.

Implementation details. Our implementation based on ofﬁcial code of Star GAN4. Similar to Cycle GAN, we ﬁrst pretrain the Es and Gt with standard Star GAN until convergence. The pre-trained parameters is loaded as initial model to optimize G . We use Adam with initial learning rate 1 10 4 to train the models for the ﬁrst 100, 000 iterations. Then we linearly decay the learning rate to 0 in the next 100, 000 iterations. The batch size is set to 16.

Metrics. To evaluate qualities of generated images quantitatively, we follow [Choi et al., 2018] to train a classiﬁcation model on three facial attributes we used (i.e.hair color, gender and age). We directly adopt the model architecture of discriminator used in Star GAN for the classiﬁer and use the same training set as our image translation models. Higher classiﬁcation accuracies on the translated facial attributes indicates the better translation qualities.

5.2 Results Qualitative evaluation. We provide facial attributes translation in Figure 4. Generally, deliberation learning effectively boost the performance of Star GAN on facial attributes,

4https://github.com/yunjey/stargan

making the generated images more natural and indistinguishable from the real images. From Figure 4, we can observe that, deliberation learning successfully generates facial images with rich details, especially on semantic regions like eye, cheek etc. In addition, in many cases, Star GAN suffers from color shifting, while our scheme correct this fault and preserve the reality without impact on background or hair color.

Quantitative evaluation. The quantitative evaluation of three facial attributes translation are illustrated in Table 3. All classiﬁcation accuracies are measured on translated images with our pre-trained facial attributes classiﬁcation model. In terms of classiﬁcation accuracies, our method surpasses the Star GAN baseline by 3.11%, 5.79% and 2.16% points for hair color, gender and age respectively.

6 Conclusion and Future Work

In this paper, we introduced the concept of deliberation learning for image translation, which shares high-level sense with human behaviors: reviewing and keeping polishing. We implemented deliberation learning on image-to-image translation. The experimental results demonstrate that our method generates more natural images and preserves crucial details. There are many interesting directions to explore in the future. First, we will apply the idea of deliberation learning to more tasks like supervised image classiﬁcation and segmentation. Second, deliberation learning can be iteratively used to keep polishing an image to get a better output. Third, we will study how to speed up the algorithm in the future.

Acknowledgments

This work was supported in part by NSFC under Grant 61571413, 61632001. We thank all the anonymous reviewers for their valuable comments on our paper.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

References [Anoosheh et al., 2018] Asha Anoosheh, Eirikur Agustsson, Radu Timofte, and Luc Van Gool. Combogan: Unrestrained scalability for image domain translation. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2018. [Brock et al., 2019] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high ﬁdelity natural image synthesis. In International Conference on Learning Representations (ICLR), 2019. [Choi et al., 2018] Yunjey Choi, Minje Choi, and Munyoung Kim. Stargan: Uniﬁed generative adversarial networks for multidomain image-to-image translation. In Computer Vision and Pattern Recognition (CVPR). IEEE, July 2018. [Cordts et al., 2016] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2016. [Donahue et al., 2017] Jeff Donahue, Philipp Kr ahenb uhl, and Trevor Darrell. Adversarial feature learning. In International Conference on Learning Representations (ICLR), 2017. [Dong et al., 2014] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for image super-resolution. In European Conference on Computer Vision (ECCV). Springer, 2014. [Dumoulin et al., 2017] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially learned inference. In International Conference on Learning Representations (ICLR), 2017. [Ge et al., 2018] Tao Ge, Furu Wei, and Ming Zhou. Fluency boost learning and inference for neural grammatical error correction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1055 1065, 2018. [Goodfellow et al., 2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems (Neur IPS), 2014. [Guu et al., 2018] Kelvin Guu, Tatsunori B Hashimoto, Yonatan Oren, and Percy Liang. Generating sentences by editing prototypes. Transactions of the Association of Computational Linguistics, 6:437 450, 2018. [He et al., 2016] Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. Dual learning for machine translation. In Advances in Neural Information Processing Systems (Neur IPS), 2016. [Heusel et al., 2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems (Neur IPS), 2017. [Isola et al., 2017] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2017. [Kim et al., 2017] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. Learning to discover crossdomain relations with generative adversarial networks. In International Conference on Machine Learning (ICML), 2017.

[Lin et al., 2018] Jianxin Lin, Yingce Xia, Tao Qin, Zhibo Chen, and Tie-Yan Liu. Conditional image-to-image translation. In Computer Vision and Pattern Recognition (CVPR), pages 5524 5532. IEEE, 2018. [Liu and Tuzel, 2016] Ming-Yu Liu and Oncel Tuzel. Coupled generative adversarial networks. In Advances in Neural Information Processing Systems (Neur IPS), 2016. [Liu et al., 2015] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In International Conference on Computer Vision (ICCV), 2015. [Liu et al., 2018] Alexander Liu, Yen-Chen Liu, Yu-Ying Yeh, and Yu-Chiang Frank Wang. A uniﬁed feature disentangler for multidomain image translation and manipulation. In Advances in Neural Information Processing Systems (Neur IPS), 2018. [Lucic et al., 2018] Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are gans created equal? a large-scale study. In Advances in Neural Information Processing Systems (Neur IPS), 2018. [Mejjati et al., 2018] Youssef Alami Mejjati, Christian Richardt, James Tompkin, Darren Cosker, and Kwang In Kim. Unsupervised attention-guided image-to-image translation. In Advances in Neural Information Processing Systems (Neur IPS), 2018. [Mirza and Osindero, 2014] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. ar Xiv preprint ar Xiv:1411.1784, 2014. [Pumarola et al., 2018] Albert Pumarola, Antonio Agudo, Aleix M Martinez, Alberto Sanfeliu, and Francesc Moreno-Noguer. Ganimation: Anatomically-aware facial animation from a single image. In Proceedings of the European Conference on Computer Vision (ECCV), 2018. [Reed et al., 2016] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In International Conference on Machine Learning (ICML), pages 1060 1069, 2016. [Szegedy et al., 2016] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Computer Vision and Pattern Recognition (CVPR), 2016. [Wang et al., 2019] Yiren Wang, Yingce Xia, Tianyu He, Fei Tian, Tao Qin, Cheng Xiang Zhai, and Tie-Yan Liu. Multi-agent dual learning. In International Conference on Learning Representations (ICLR), 2019. [Xia et al., 2017] Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin, Tao Qin, Nenghai Yu, and Tie-Yan Liu. Deliberation networks: Sequence generation beyond one-pass decoding. In Advances in Neural Information Processing Systems (Neur IPS), 2017. [Xu et al., 2018] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [Yi et al., 2017] Zili Yi, Hao (Richard) Zhang, Ping Tan, and Minglun Gong. Dualgan: Unsupervised dual learning for imageto-image translation. In International Conference on Computer Vision (ICCV), 2017. [Zhu et al., 2017] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In International Conference on Computer Vision (ICCV), 2017.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)