# harmonic_unpaired_imagetoimage_translation__81859542.pdf Published as a conference paper at ICLR 2019 HARMONIC UNPAIRED IMAGE-TO-IMAGE TRANSLATION Rui Zhang Google Cloud AI & Chinese Academy of Sciences Beijing, China zhangrui@ict.ac.cn Tomas Pfister Google Cloud AI Sunnyvale, USA tpfister@google.com Jia Li Google Cloud AI Sunnyvale, USA lijiali@google.com The recent direction of unpaired image-to-image translation is on one hand very exciting as it alleviates the big burden in obtaining label-intensive pixel-to-pixel supervision, but it is on the other hand not fully satisfactory due to the presence of artifacts and degenerated transformations. In this paper, we take a manifold view of the problem by introducing a smoothness term over the sample graph to attain harmonic functions to enforce consistent mappings during the translation. We develop Harmonic GAN to learn bi-directional translations between the source and the target domains. With the help of similarity-consistency, the inherent selfconsistency property of samples can be maintained. Distance metrics defined on two types of features including histogram and CNN are exploited. Under an identical problem setting as Cycle GAN, without additional manual inputs and only at a small training-time cost, Harmonic GAN demonstrates a significant qualitative and quantitative improvement over the state of the art, as well as improved interpretability. We show experimental results in a number of applications including medical imaging, object transfiguration, and semantic labeling. We outperform the competing methods in all tasks, and for a medical imaging task in particular our method turns Cycle GAN from a failure to a success, halving the mean-squared error, and generating images that radiologists prefer over competing methods in 95% of cases. 1 INTRODUCTION Image-to-image translation (Isola et al., 2017) aims to learn a mapping from a source domain to a target domain. As a significant and challenging task in computer vision, image-to-image translation benefits many vision and graphics tasks, such as realistic image synthesis (Isola et al., 2017; Zhu et al., 2017a), medical image generation (Zhang et al., 2018; Dar et al., 2018), and domain adaptation (Hoffman et al., 2018). Given a pair of training images with detailed pixel-to-pixel correspondences between the source and the target, image-to-image translation can be cast as a regression problem using e.g. Fully Convolutional Neural Networks (FCNs) (Long et al., 2015) by minimizing e.g. the per-pixel prediction loss. Recently, approaches using rich generative models based on Generative Adaptive Networks (GANs) (Goodfellow et al., 2014; Radford et al., 2016; Arjovsky et al., 2017) have achieved astonishing success. The main benefit of introducing GANs (Goodfellow et al., 2014) to image-to-image translation (Isola et al., 2017) is to attain additional image-level (often through patches) feedback about the overall quality of the translation, and information which is not directly accessible through the per-pixel regression objective. The method by Isola et al. (2017) is able to generate high-quality images, but it requires paired training data which is difficult to collect and often does not exist. To perform translation without paired data, circularity-based approaches (Zhu et al., 2017a; Kim et al., 2017; Yi et al., 2017) have been proposed to learn translations from a set to another set, using a circularity constraint to establish relationships between the source and target domains and forcing the result generated from a sample in the source domain to map back and generate the original sample. The original image-to-image translation problem (Isola et al., 2017) is supervised in pixel-level, whereas the unpaired image-to-image translation task (Zhu et al., 2017a) is considered unsupervised, with pixel-level supervision absent Published as a conference paper at ICLR 2019 Flair (source) Cycle GAN T1 (ground truth) Natural image (source) Harmonic GAN Cycle GAN Harmonic GAN Figure 1: Harmonic GAN corrects major failures in multiple domains: (a) for medical images it corrects incorrectly removed (top) and added (bottom) tumors; and (b) for horse zebra transfiguration it does not incorrectly transform the background (top) and performs a complete translation (bottom). but with adversarial supervision at the image-level (in the target domain) present. By using a cycled regression for the pixel-level prediction (source target source) plus a term for the adversarial difference between the transferred images and the target images, Cycle GAN is able to successfully, in many cases, train a translation model without paired source target supervision. However, lacking a mechanism to enforce regularity in the translation creates problems like in Fig. 1 (a) and Fig. 2, making undesirable changes to the image contents, superficially removing tumors (the first row) or creating tumors (the second row) at the wrong positions in the target domain. Fig. 1 (b) also shows some artifacts of Cycle GAN on natural images when translating horses into zebras. Real Flair (Source) Fake T1 (Cycle GAN) Real T1 (Target) Reconstructed Flair (Cycle GAN) Circularity Adversarial Figure 2: For Cycle GAN, the reconstructed image may perfectly match the source image under the circularity constraint while translated image dose not maintain the inherent property of the source image (e.g. a tumor), thus generating unexpected results (e.g. incorrectly removing a tumor). To combat the above issue, in this paper we look at the problem of unpaired image-to-image translation from a manifold learning perspective (Tenenbaum et al., 2000; Roweis & Saul, 2000). Intuitively, the problem can be alleviated by introducing a regularization term in the translation, encouraging similar contents (based on textures or semantics) in the same image to undergo similar translations/transformations. A common principle in manifold learning is to preserve local distances after the unfolding: forcing neighboring (similar) samples in the original space to be neighbors in the new space. The same principle has been applied to graph-based semisupervised learning (Zhu, 2006) where harmonic functions with graph Laplacians (Zhu et al., 2003; Belkin et al., 2006) are used to obtain regularized labels of unlabeled data points. During the translation/transformation, some domain-specific attributes are changed, such as the colors, texture, and semantics of certain image regions. Although there is no supervised information for these changes, certain consistency during the transformation is desirable, meaning that for image contents similar in the source space should also be similar in the target space. Inspired by graphbased semi-supervised learning (Zhu et al., 2003; Zhu, 2006), we introduce smoothness terms to unpaired image-to-image translation (Zhu et al., 2017a) by providing a stronger regularization for the translation/transformation between the source and target domains, aiming to exploit the manifold structure of the source and target domains. For a pair of similar samples (two different locations in an image; one can think of them as two patches although the receptive fields of CNN are quite large), we add the smoothness term to minimize a weighted distance of the corresponding locations Published as a conference paper at ICLR 2019 in the target image. Note that two spatially distant samples might be neighbors in the feature space. We name our algorithm Harmonic GAN as it behaves harmonically along with the circularity and adversarial constraints to learn a pair of dual translations between the source and target domains, as shown in Fig. 1. Distance metrics defined on two alternative features are adopted: (1) a low-level soft RGB histograms; and (2) CNN (VGG) features with pre-trained semantics. We conduct experiments in a number of applications, showing that in each of them our method outperforms existing methods quantitatively, qualitatively, and with user studies. For a medical imaging task (Cohen et al., 2018) that was recently calling attention to a major Cycle GAN failure case (learning to accidentally add/remove tumors in an MRI image translation task), our proposed method provides a large improvement over Cycle GAN, halving the mean-squared error, and generating images that radiologists prefer over competing methods in 95% of cases. CONTRIBUTIONS 1. We introduce smooth regularization over the graph for unpaired image-to-image translation to attain harmonic translations. 2. When building an end-to-end learning pipeline, we adopt two alternative types of feature measures to compute the weight matrix for the graph Laplacian, one based on a soft histogram (Wang et al., 2016) and another based on semantic CNN (VGG) features (Simonyan & Zisserman, 2015). 3. We show that this method results in significantly improved consistency for transformations. With experiments on multiple translation tasks, we demonstrate that Harmonic GAN outperforms the state-of-the-art. 2 RELATED WORK As discussed in the introduction, the general image-to-image translation task in the deep learning era was pioneered by (Isola et al., 2017), but there are prior works such as image analogies (Hertzmann et al., 2001) that aim at a similar goal, along with other exemplar-based methods (Efros & Freeman, 2001; Criminisi et al., 2004; Barnes et al., 2009). After (Isola et al., 2017), a series of other works have also exploited pixel-level reconstruction constraints to build connections between the source and target domains (Zhang et al., 2017; Wang et al., 2018). The image-to-image translation framework (Isola et al., 2017) is very powerful but it requires a sufficient amount of training data with paired source to target images, which are often laborious to obtain in the general tasks such as labeling (Long et al., 2015), synthesis (Chen & Koltun, 2017), and style transfer (Huang & Belongie, 2017). Unpaired image-to-image translation frameworks (Zhu et al., 2017a;b; Liu et al., 2017; Shrivastava et al., 2017; Kim et al., 2017) such as Cycle GAN remove the requirement of having detailed pixellevel supervision. In Cycle GAN this is achieved by enforcing a bi-directional prediction from source to target and target back to source, with an adversarial penalty in the translated images in the target domain. Similar unsupervised circularity-based approaches (Kim et al., 2017; Yi et al., 2017) have also been developed. The Cycle GAN family models (Zhu et al., 2017a;b) point to an exciting direction of unsupervised approaches but they also create artifacts in many applications. As shown in Fig. 2, one reason for this is that the circularity constraint in Cycle GAN lacks the straightforward description of the target domain, so it may change the inherent properties of the original samples and generate unexpected results which are inconsistent at different image locations. These failures have been prominently explored in recent works, showing that Cycle GAN (Zhu et al., 2017a) may add or remove tumors accidentally in cross-modal medical image synthesis (Cohen et al., 2018), and that in the task of natural image transfiguration, e.g. from a horse to zebra, regions in the background may also be translated into a zebra-like texture (Zhu et al., 2018) (see Fig. 1 (b)). Here we propose Harmonic GAN that introduces a smoothness term into the Cycle GAN framework to enforce a regularized translation, enforcing similar image content in the source space to also be similar in the target space. We follow the general design principle in manifold learning (Tenenbaum et al., 2000; Roweis & Saul, 2000) and the development of harmonic functions in the graph-based semi-supervised learning literature (Zhu et al., 2003; Belkin et al., 2006; Zhu, 2006; Weston et al., 2012). There has been previous work, Distance GAN (Benaim & Wolf, 2017), in which distance Published as a conference paper at ICLR 2019 preservation was also implemented. However, Distance GAN differs from Harmonic GAN in (1) motivation, (2) formulation, (3) implementation, and (4) performance. The primary motivation of Distance GAN demonstrates an alternative loss term for the per-pixel difference in Cycle GAN. In Harmonic GAN, we observe that the cycled per-pixel loss is effective and we aim to make the translation harmonic by introducing additional regularization. The smoothness term acts as a graph Laplacian imposed on all pairs of samples (using random samples in implementation). In the experimental results, we show that the artifacts in Cycle GAN are still present in Distance GAN, whereas Harmonic GAN provides a significant boost to the performance of Cycle GAN. In addition, it is worth mentioning that the smoothness term proposed here is quite different from the binary term used in the Conditional Random Fields literature (Lafferty et al., 2001; Kr ahenb uhl & Koltun, 2011), either fully supervised (Chen et al., 2018; Zheng et al., 2015) or weakly-supervised (Tang et al., 2018; Lin et al., 2016). The two differ in (1) output space (multi-class label vs. highdimensional features), (2) mathematical formulation (a joint conditional probably for the neighboring labels vs. a Laplacian function over the graph), (3) application domain (image labeling vs. image translation), (4) effectiveness (boundary smoothing vs. manifold structure preserving), and (5) the role in the overall algorithm (post-processing effect with relatively small improvement vs. large-area error correction). 3 HARMONICGAN FOR UNPAIRED IMAGE-TO-IMAGE TRANSLATION Following the basic formulation in Cycle GAN (Zhu et al., 2017a), for the source domain X and the target domain Y , we consider unpaired training samples {xk}N k=1 where xk X, and {yk}N k=1 where yk Y . The goal of image-to-image translation is to learn a pair of dual mappings, including forward mapping G : X Y and backward mapping F : Y X. Two discriminators DX and DY are adopted in (Zhu et al., 2017a) to distinguish between real images and generated images. In particular, the discriminator DX aims to distinguish real image {x} from the generated image {F(y)}; similarly discriminator DY distinguishes {y} from {G(x)}. Therefore, the objective of adversarial constraint is applied in both source and target domains, expressed in (Zhu et al., 2017a) as: LGAN(G, DY , X, Y ) = Ey Y [log DY (y)] + Ex X[log(1 DY (G(x)))], (1) and LGAN(F, DX, X, Y ) = Ex X[log DX(x)] + Ey Y [log(1 DX(F(y)))]. (2) For notational simplicity, we denote the GAN loss as LGAN(G, F) = arg max DY ,DX[LGAN(G, DY , X, Y ) + LGAN(F, DX, X, Y )]. (3) Since the data in the two domains are unpaired, a circularity constraint is introduced in (Zhu et al., 2017a) to establish relationships between X and Y . The circularity constraint enforces that G and F are a pair of inverse mappings, and that the translated sample can be mapped back to the original sample. The circularity constraint contains consistencies in two aspects: the forward cycle x G(X) F(G(x)) x and the backward cycle y F(y) G(F(y)) y. Thus, the circularity constraint is formulated as (Zhu et al., 2017a): Lcyc(G, F) = Ex X||F(G(x)) x||1 + Ey Y ||G(F(y)) y||1. (4) Here we rewrite the overall objective in (Zhu et al., 2017a) to minimize as: LCycle GAN(G, F) = λGAN LGAN(G, F) + λcyc Lcyc(G, F), (5) where the weights λGAN and λcyc control the importance of the corresponding objectives. 3.1 SMOOTHNESS TERM OVER THE GRAPH The full objective of circularity-based approach contains adversarial constraints and a circularity constraint. The adversarial constraints ensure the generated samples are in the distribution of the source or target domain, but ignore the relationship between the input and output of the forward or backward translations. The circularity constraint establishes connections between the source and Published as a conference paper at ICLR 2019 (a) Forward Cycle (b) Backward Cycle Constraint G Circularity Circularity Adversarial Constraint DY Adversarial Constraint DX Figure 3: Architecture of Harmonic GAN, consisting of a pair of inverse generators G, F and two discriminators DX, DY . The objective combines an adversarial constraint, circularity constraint and smoothness term. target domain by forcing the forward and backward translations to be the inverse of each other. However, Cycle GAN has limitations: as shown in Fig. 2, the circular projection might perfectly match the input, and the translated image might look very well like a real one, but the translated image may not maintain the inherent property of the input and contain a large artifact that is not connected to the input. Here we propose a smoothness term to enforce a stronger correlation between the source and target domains that focuses on providing similarity-consistency between image patches during the translation. The smoothness term defines a graph Laplacian with the minimal value achieved as a harmonic function. We define the set consisting of individual image patches as the nodes of the graph G. xi is referred to as the feature vector of the i-th image patch in x X. For the image set X, we define the set that consists of individual samples (image patches) of source image set X as S = { x(i), i = 1..M} where M is the total number of the samples/patches. An affinity measure (similarity) computed on image patch x(i) and image patch x(j), wij(X) (a scalar), defines the edge on the graph G of S. The smoothness term acts as a graph Laplacian imposed on all pairs of image patches. Therefore, we define a smoothness term over the graph as LSmooth(G,X,Y )=Ex X P i,j wij(X) Dist[G( x)(i),G( x)(j)]+P i,j wij(G(X)) Dist[F (G( x))(i),F (G( x))(j)] , (6) where wij(X) = exp{ Dist[ x(i), x(j)]/σ2} (Zhu et al., 2003) defines the affinity between two patches x(i) and x(j) based on their distances (e.g. measured on histogram or CNN features). Dist[G( y)(i), G( y)(j)] defines the distance between two image patches after translation at the same locations. In implementation, we first normalize the features to the scale of [0,1] and then use the L1 distance of normalized features as the Dist function (for both histogram and CNN features). Similarly, we define a smoothness term for the backward part as LSmooth(F,Y,X)=Ey Y P i,j wij(Y ) Dist[F ( y)(i),F ( y)(j)]+P i,j wij(F (Y )) Dist[G(F ( y))(i),G(F ( y))(j)] , (7) The combined loss for the smoothness thus becomes LSmooth(G, F) = LSmooth(G, X, Y ) + LSmooth(F, Y, X). (8) 3.2 OVERALL OBJECTIVE FUNCTION As shown in Fig. 3, Harmonic GAN consists of a pair of inverse generators G, F and two discriminators DX, DY , defined in Eqn. (1) and Eqn. (2). The full objective combines an adversarial constraint (see Eqn. (3)), a circularity constraint (see Eqn. (4)), and a smoothness term (see Eqn. (8)). The adversarial constraint forces the translated images to be plausible and indistinguishable from the real images; the circularity constraint ensures the cycle-consistency of translated images; and the Published as a conference paper at ICLR 2019 Real T1 (Source) Real Flair (Target) Cycle GAN Harmonic GAN (Ours) Figure 4: Visualization using t-SNE (Maaten & Hinton, 2008) to illustrate the effectiveness of the smoothness term in Harmonic GAN (best viewed in color). As shown in the top two figures (source and target respectively), the smoothness term acts as a graph Laplacian imposed on all pairs of image patches. Bottom-left: For two similar patches in the original sample, if one patch is translated to a tumor region while the other is not, the two patches will have a large distance in the target space, resulting in a translation that incorrectly adds a tumor into the original sample (result for Cycle GAN shown). Bottom-right: For two similar patches in the original sample, if the translation maintains the non-tumor property of these two patches in the translated sample then the two patches will also be similar in the target space (result for Harmonic GAN shown). smoothness term provides a stronger similarity-consistency between patches to maintain inherent properties of the images. Combining Eqn. (5) and Eqn. (8), the overall objective for our proposed Harmonic GAN under the smoothness constraint becomes LHarmonic GAN(G, F) = LCycle GAN(G, F) + λSmooth LSmooth(G, F). (9) Similar to the graph-based semi-supervised learning definition (Zhu et al., 2003; Zhu, 2006), the solution to Eqn. (9) leads to a harmonic function. The optimization process during training obtains: G , F = arg min G,F LHarmonic GAN(G, F). (10) The effectiveness of the smoothness term of Eqn. (8) is evident. In Fig. 4, we show (using t SNE (Maaten & Hinton, 2008)) that the local neighborhood structure is being preserved by Harmonic GAN, whereas Cycle GAN results in two similar patches being far apart after translation. 3.3 FEATURE DESIGN In the smoothness constraint, the similarity of a pair of patches is measured on the features for each patch (sample point). All the patches in an image form a graph. Here we adopt two types of features: (1) a low-level soft histogram, and (2) pre-trained CNN (VGG) features that carry semantic information. Soft histogram features are lightweight and easy to implement but without much semantic information; VGG requires an additional CNN network but carries more semantics. 3.3.1 SOFT RGB HISTOGRAM FEATURES We first design a weight matrix based on simple low-level RGB histograms. To make the end-to-end learning system work, it is crucial to make the computation of gradient in the histograms derivable. Published as a conference paper at ICLR 2019 We adopt a soft histogram representation proposed in (Wang et al., 2016) but fix the means and the bin size. This histogram representation is differentiable and its gradient is back-propagateable. This soft histogram function contains a family of linear basis functions ψb, b = 1, . . . , B, where B is the number of bins in the histogram. As xi represents the i-th patch in image domain X, for each pixel j in xi, ψb( xi(j)) represents pixel j voting for the b-th bin, expressed as: ψb( xi(j)) = max{0, 1 | xi(j) µb| wb}, (11) where µb and wb are the center and width of the b-th bin. The representation of xi in the RGB space is the linear combination of linear basis functions on all the pixels in xi, expressed as: φh(X, i, b) = φh( xi, b) = X j ψb( xi(j)), (12) where φh is the RGB histogram feature, b is the index of dimension of the RGB histogram representation, and j represents any pixel in the patch xi. The RGB histogram representation φh(X, i) of xi is a B-dimensional vector. 3.3.2 SEMANTIC CNN FEATURES For some domains we instead use semantic features to acquire higher-level representations of patches. The semantic representations are extracted from a pre-trained Convolutional Neural Network (CNN). The CNN encodes semantically relevant features from training on a large-scale dataset. It extracts semantic information of local patches in the image through multiple pooling or stride operators. Each point in the feature maps of the CNN is a semantic descriptor of the corresponding image patch. Additionally, the semantic features learned from the CNN are differentiable and the CNN can be integrated into Harmonic GAN and be trained end-to-end. We instantiate the semantic feature φs as a pre-trained CNN model e.g. VGGNet (Simonyan & Zisserman, 2014). In implementation, we select the layer 4 3 after Re LU from VGG-16 network for computing the semantic features. 4 EXPERIMENTS We evaluate the proposed method on three different applications: medical imaging, semantic labeling, and object transfiguration. We compare against several unpaired image-to-image translation methods: Cycle GAN (Zhu et al., 2017a), Disco GAN (Kim et al., 2017), Distance GAN (Benaim & Wolf, 2017), and UNIT (Liu et al., 2017). We also provide two user studies as well as qualitative results. The appendix provides additional results and analysis. 4.1 DATASETS AND EVALUATION METRICS Medical imaging. This task evaluates cross-modal medical image synthesis, Flair T1. The models are trained on the BRATS dataset (Menze et al., 2015) which contains paired MRI data to allow quantitative evaluation. Similar to the previous work (Cohen et al., 2018), we use a training set of 1400 image slices (50% healthy and 50% tumors) and a test set of 300, and use their unpaired training scenario. We adopt the Mean Absolute Error (MAE) and the Mean Squared Error (MSE) between the generated images and the real images to evaluate the reconstruction errors, and further use the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) to evaluate the reconstruction quality of generated images. Semantic labeling. We also test our method on the labels photos task using the Cityscapes dataset (Cordts et al., 2016) under the unpaired setting as in the original Cycle GAN paper. For quantitative evaluation, in line with previous work, for labels photos we adopt the FCN score (Isola et al., 2017), which evaluates how interpretable the generated photos are according to a semantic segmentation algorithm. For photos labels, we use the standard segmentation metrics, including per-pixel accuracy, per-class accuracy, and mean class Intersection-Over-Union (Class IOU). Object transfiguration. Finally, we test our method on the horse zebra task using the standard Cycle GAN dataset (2401 training images, 260 test images). This task does not have a quantitative evaluation measure, so we instead provide a user study together with qualitative results. Published as a conference paper at ICLR 2019 4.2 IMPLEMENTATION DETAILS We apply the proposed smoothness term on the framework of Cycle GAN (Zhu et al., 2017a). Similar with Cycle GAN, we adopt the architecture of (Johnson et al., 2016) as the generator and the Patch GAN (Isola et al., 2017) as the discriminator. The log likelihood objective in the original GAN is replaced with a least-squared loss (Mao et al., 2017) for more stable training. We resize the input images to the size of 256 256. For the histogram feature, we equally split the RGB range of [0, 255] to 16 bins, each with a range of 16. Images are divided into non-overlapping patches of 8 8 and the histogram feature is computed on each patch. For the semantic feature, we adopt a VGG network pre-trained on Image Net to obtain semantic features. We select the feature map of layer relu4 3 in VGG. The loss weights are set as λGAN = λSmooth = 1, λcyc = 10. Following Cycle GAN, we adopt the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 0.0002. The learning rate is fixed for the first 100 epochs and linearly decayed to zero over the next 100 epochs. 4.3 QUANTITATIVE COMPARISON Medical imaging. Table 1 shows the reconstruction performance on medical image synthesis, Flair T1. The proposed method yields a large improvement over Cycle GAN, showing lower MAE and MSE reconstruction losses, and higher PSNR and SSIM reconstruction scores, highlighting the significance of the proposed smoothness regularization. Harmonic GAN based on histogram and VGG features shows similar performance; the reconstruction losses of histogram-based Harmonic GAN are slightly lower than the VGG-based one in Flair T1, while they are slightly higher in T1 Flair, indicating that both low-level RGB values and high-level CNN features can represent the inherent property of medical images well and help to maintain the smoothness-consistency of samples. Table 1: Reconstruction evaluation of cross-modal medical image synthesis on the BRATS dataset. Method Flair T1 T1 Flair MAE MSE PSNR SSIM MAE MSE PSNR SSIM Cycle GAN 10.47 674.40 22.35 0.80 11.81 1026.19 18.73 0.74 Disco GAN 10.63 641.35 20.06 0.79 10.66 839.15 19.14 0.69 Distance GAN 14.93 1197.64 17.92 0.67 10.57 716.75 19.95 0.64 UNIT 9.48 439.33 22.24 0.76 6.69 261.26 25.11 0.76 Harmonic GAN (ours) Histogram 6.38 216.83 24.34 0.83 5.04 163.29 26.72 0.75 VGG 6.86 237.94 24.14 0.81 4.69 127.84 27.22 0.76 Semantic labeling. We report semantic labeling results in Table 2. The proposed method using VGG features yields a 3% improvement in Pixel Accuracy in translation scores for photo label and also shows stable improvements in other metrics, clearly outperforming all competing methods. The performance using a histogram is slightly lower than Cycle GAN; we hypothesize that the reason is that the objects in photos have a large intra-class variance and inter-class similarity in appearance, e.g. cars have different colors, while vegetation and terrain have similar colors, thus the regularization of the RGB histogram is not appropriate to extract the inherent property of photos. 4.4 USER STUDIES Medical imaging. We randomly selected 100 images from BRATS test set. For each image, we showed one radiologist the real ground truth image, followed by images generated by Cycle GAN, Distance GAN and Harmonic GAN (different order for each image set to avoid bias). The radiologist was told to evaluate similarity by how likely they would lead to the same clinical diagnosis, and was asked to rate similarity of the generation methods on a Likert scale from 1 to 5 (1 is not similar at all, 5 is exactly same). Results are in shown in Table 3. In 95% of cases, the radiologist preferred images generated by our method over the competing methods, and the average Likert score was 4.00 compared to 1.68 for Cycle GAN, confirming that our generated images are significantly better. This is significant as it confirms that we solve the issue presented in a recent paper (Cohen et al., 2018) showing that Cycle GAN can learn to accidentally add/remove tumors in images. Published as a conference paper at ICLR 2019 Real Zebra Cycle GAN Harmonic GAN Real Horse Cycle GAN Harmonic GAN Real T1 (Target) Cycle GAN Harmonic GA Real T1 (Source) (Target) Cycle GAN Harmonic GA (Target) Cycle GAN Harmonic GA (Target) Cycle GAN Harmonic GA Figure 5: Qualitative comparison for BRATS, Cityscapes and horse zebra (see appendix for more images). Published as a conference paper at ICLR 2019 Table 2: FCN scores of Photo Label translation on the Cityscapes dataset. Method Label Photo Photo Label Pixel Acc. Class Acc. Class Io U Pixel Acc. Class Acc. Class Io U Cycle GAN 52.7 15.2 11.0 57.2 21.0 15.7 Disco GAN 45.0 11.1 7.0 45.2 10.9 6.3 Distance GAN 48.5 10.9 7.3 20.5 8.2 3.4 UNIT 48.5 12.9 7.9 56.0 20.5 14.3 Harmonic GAN (ours) Histogram 52.2 14.8 10.9 56.6 20.9 15.7 VGG 55.9 17.6 13.3 59.8 22.1 17.2 Table 3: User study on the BRATS dataset. Metric Cycle GAN Distance GAN Harmonic GAN Prefer [%] 5 0 95 Mean Likert 1.68 1.62 4.00 Std Likert 0.99 0.95 0.88 Table 4: User study on the horse to zebra dataset. Metric Cycle GAN Distance GAN Harmonic GAN Prefer[%] 28 0 72 Mean Likert 3.16 1.08 3.60 Std Likert 0.81 0.23 0.78 Object transfiguration. We evaluate our algorithm on horse zebra with a human perceptual study. We randomly selected 50 images from the horse2zebra test set and showed the input images and three generated images from Cycle GAN, Distance GAN and Harmonic GAN (with generated images in random order). 10 participants were asked to score the generated images on a Likert scale from 1 to 5 (as above). As shown in Table 4, the participants give the highest score to the proposed method (in 72% of cases), significantly more often than Cycle GAN (in 28% of cases). Additionally, the average Likert score of our method was 3.60, outperforming 3.16 of Cycle GAN and 1.08 of Distance GAN, indicating that our method generates better results. 4.5 QUALITATIVE COMPARISON Medical imaging. Fig. 5 shows the qualitative comparison of the proposed method (Harmonic GAN) and the baseline methods Cycle GAN on Flair T1. It shows that Cycle GAN may remove tumors in the original images and add tumors to other locations in the brain. In contrast, our method preserves the location of tumors, confirming that the harmonic regularization can maintain the inherent property of tumor/non-tumor regions and solves the tumor add/remove problem introduced in Cohen et al. (2018). More results and analysis are shown in Fig. 6 and Fig. 9. Object transfiguration. Fig. 5 shows a qualitative comparison of our method on the horse zebra task. We observe that we correct several problems in Cycle GAN, including not changing the background and performing more complete transformations. More results and analysis are shown in Fig. 7 and Fig. 10. 5 CONCLUSION We introduce a smoothness term over the sample graph to enforce smoothness-consistency between the source and target domains. We have shown that by introducing additional regularization to enforce consistent mappings during the image-to-image translation, the inherent self-consistency property of samples can be maintained. Through a set of quantitative, qualitative and user studies, we have demonstrated that this results in a significant improvement over the current state-of-the-art methods in a number of applications including medical imaging, object transfiguration, and semantic labeling. In a medical imaging task in particular our method provides a very significant improvement over Cycle GAN. Published as a conference paper at ICLR 2019 Martin Arjovsky, Soumith Chintala, and L eon Bottou. Wasserstein generative adversarial networks. In ICML, 2017. Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Trans. Graphics (To G), 2009. Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. JMLR, 2006. Sagie Benaim and Lior Wolf. One-sided unsupervised domain mapping. In NIPS, 2017. Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4): 834 848, 2018. Qifeng Chen and Vladlen Koltun. Photographic image synthesis with cascaded refinement networks. In ICCV, 2017. Joseph Paul Cohen, Margaux Luck, and Sina Honari. Distribution matching losses can hallucinate features in medical image translation. In MICCAI, 2018. Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016. Antonio Criminisi, Patrick P erez, and Kentaro Toyama. Region filling and object removal by exemplar-based image inpainting. IEEE Trans. image processing, 2004. Salman Ul Hassan Dar, Mahmut Yurt, Levent Karacan, Aykut Erdem, Erkut Erdem, and Tolga C ukur. Image synthesis in multi-contrast MRI with conditional generative adversarial networks. ar Xiv preprint, 2018. Alexei A Efros and William T Freeman. Image quilting for texture synthesis and transfer. In Proc. Computer graphics and interactive techniques, 2001. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014. Aaron Hertzmann, Charles E Jacobs, Nuria Oliver, Brian Curless, and David H Salesin. Image analogies. In Proc. Computer graphics and interactive techniques, 2001. Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei A Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In ICML, 2018. Xun Huang and Serge J Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, 2017. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017. Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016. Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. Learning to discover cross-domain relations with generative adversarial networks. In ICML, 2017. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. Philipp Kr ahenb uhl and Vladlen Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, 2011. Published as a conference paper at ICLR 2019 John Lafferty, Andrew Mc Callum, and Fernando CN Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001. Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, and Jian Sun. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In CVPR, 2016. Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. In NIPS, 2017. Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 2008. Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In ICCV, 2017. Bjoern H Menze, Andras Jakab, Stefan Bauer, Jayashree Kalpathy-Cramer, Keyvan Farahani, Justin Kirby, Yuliya Burren, Nicole Porz, Johannes Slotboom, Roland Wiest, et al. The multimodal brain tumor image segmentation benchmark (brats). IEEE trans. medical imaging, 2015. Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016. Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 2000. Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang, and Russell Webb. Learning from simulated and unsupervised images through adversarial training. In CVPR, 2017. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. Meng Tang, Federico Perazzi, Abdelaziz Djelouah, Ismail Ben Ayed, Christopher Schroers, and Yuri Boykov. On regularized losses for weakly-supervised cnn segmentation. In ECCV, 2018. Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 2000. Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Highresolution image synthesis and semantic manipulation with conditional gans. In CVPR, 2018. Zhe Wang, Hongsheng Li, Wanli Ouyang, and Xiaogang Wang. Learnable histogram: Statistical context features for deep neural networks. In ECCV, 2016. Jason Weston, Fr ed eric Ratle, Hossein Mobahi, and Ronan Collobert. Deep learning via semisupervised embedding. In Neural Networks: Tricks of the Trade, pp. 639 655. Springer, 2012. Zili Yi, Hao (Richard) Zhang, Ping Tan, and Minglun Gong. Dualgan: Unsupervised dual learning for image-to-image translation. In ICCV, 2017. Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas. Stack GAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. ar Xiv preprint, 2017. Zizhao Zhang, Lin Yang, and Yefeng Zheng. Translating and segmenting multimodal medical volumes with cycle-and shapeconsistency generative adversarial network. In CVPR, 2018. Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr. Conditional random fields as recurrent neural networks. In ICCV, 2015. Published as a conference paper at ICLR 2019 Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017a. Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. Toward multimodal image-to-image translation. In NIPS, 2017b. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Cycle GAN failure cases. https: //github.com/junyanz/Cycle GAN#failure-cases, 2018. Xiaojin Zhu. Semi-supervised learning literature survey. Computer Science, University of Wisconsin-Madison, 2006. Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In ICML, 2003. Published as a conference paper at ICLR 2019 6.1 COMPARISON TO DISTANCEGAN The proposed Harmonic GAN looks a bit similar to Distance GAN Benaim & Wolf (2017), but actually there is a large difference between them. Distance GAN encourages the distance of samples to an absolute mean during translation. In contrast, Harmonic GAN enforces a smoothness term naturally under the graph Laplacian, making the motivations of Distance GAN and Harmonic GAN quite different. Comparing the distance constraint in Distance GAN and the smoothness constraint in Harmonic GAN, we can conclude the following main differences between them: (1) They show different motivations and formulations. The distance constraint aims to preserve the distance between samples in the mapping in a direct way, so it minimizes the expectation of differences between distances in two domains. The distance constraint in Distance GAN is not doing a graph-based Laplacian to explicitly enforce smoothness. In contrast, the smoothness constraint is designed from a graph Laplacian to build the similarity-consistency between image patches. Thus, the smoothness constraint uses the affinity between two patches as weight to measure the similarityconsistency between two domains. The whole idea is based on manifold learning. The smoothness term defines a Laplacian = D W, where W is our weight matrix and D is a diagonal matrix with Di = P j wij, thus, the smoothness term defines a graph Laplacian with the minimal value achieved as a harmonic function. (2) They are different in implementation. The smoothness constraint in Harmonic GAN is computed on image patches while the distance constraint in Distance GAN is computed on whole image samples. Therefore, the smoothness constraint is fine-grained compared to the distance constraint. Moreover, the distances in Distance GAN is directly computed from the samples in each domain. They scale the distances with the precomputed means and stds of two domains to reduce the effect of the gap between two domains. Differently, the smoothness constraint in Harmonic GAN is measured on the features (Histogram or CNN features) of each patch, which maps samples in two domains into the same feature space and removes the gap between two domains. (3) They show different results. Fig. 6 shows the qualitative results of Cycle GAN, Distance GAN and the proposed Harmonic GAN on the BRATS dataset. As shown in Fig. 6, the problem of randomly adding/removing tumors in the translation of Cycle GAN is still present in the results of Distance GAN, while Harmonic GAN can correct the location of tumors. Table 1 shows the quantitative results on the whole test set, which also yields the same conclusion. The results of Distance GAN on four metrics are even worse than Cycle GAN, while Harmonic GAN yields a large improvement over Cycle GAN. Cycle GAN Distance GAN Harmonic GAN (Ours) Real T1 (Target) Real T1 (Source) Cycle GAN Distance GAN Harmonic GAN (Ours) Real Flair Figure 6: Comparison of Cycle GAN, Distance GAN and the proposed Harmonic GAN on BRATS dataset. 6.2 COMPARISON TO CRF There are some fundamental differences between the CRF literature and our work. They differ in output space, mathematical formulation, application domain, effectiveness, and the role in the over- Published as a conference paper at ICLR 2019 all algorithm. The similarity between CRF and Harmonic GAN lies the adoption of a regularization term: a binary term in the CRF case and a Laplacian term in Harmonic GAN. The smoothness term in Harmonic GAN is not about obtaining smoother images/labels in the translated domain, as seen in the experiments; instead, Harmonic GAN is about preserving the overall integrity of the translation itself for the image manifold. This is the main reason for the large improvement of Harmonic GAN over Cycle GAN. To further demonstrate the difference of Harmonic GAN and CRF, we perform an experiment applying the pairwise regularization of CRFs to the Cycle GAN framework. For each pixel of the generated image, we compute the unary term and binary term with its 8 neighbors, and then minimize the objective function of CRF. The results are shown in Table 5. The pairwise regularization of CRF is unable to handle the problem of Cycle GAN illustrated in Fig. 1. What s worse, using the pairwise regularization may over-smooth the boundary of generated images, which results in extra artifacts. In contrast, Harmonic GAN aims at preserving similarity from the overall view of the image manifold, and can thus exploit similarity-consistency of the generated images, rather than over-smooth the boundary. Table 5: Reconstruction evaluation on the BRATS dataset with comparison to CRF. Method Flair T1 T1 Flair MAE MSE MAE MSE Cycle GAN 10.47 674.40 11.81 1026.19 Cycle GAN+CRF 11.24 839.47 12.25 1138.42 Harmonic GAN-Histogram (ours) 6.38 216.83 5.04 163.29 Harmonic GAN-VGG (ours) 6.86 237.94 4.69 127.84 6.3 ADDITIONAL EXPERIMENTS Input Cycle GAN Harmonic GAN (Ours) Figure 7: Comparison on horse zebra for the Putin photo. Cycle GAN translates both the background and human to a zebra-like texture. In contrast, Harmonic GAN does better in background region and achieves an improvement in some regions of of the human (Putin s face), but it still fails on the human body. We hypothesize this is because the semantic features used by Harmonic GAN have not been trained on humans without a shirt. Published as a conference paper at ICLR 2019 Figure 8: For similar image patches in input images, visualizing the average distance of corresponding image patches in translated images for cross-modal medical image synthesis on BRATS dataset. The distance of image patches in the target space shows the inconsistently translated patches. Published as a conference paper at ICLR 2019 Real T1 (Target) Cycle GAN Harmonic GA Real T1 (Source) (Target) Cycle GAN Harmonic GA Figure 9: Comparison on BRATS dataset. Published as a conference paper at ICLR 2019 Real Zebra Cycle GAN Harmonic GAN (Ours) Real Horse Cycle GAN Harmonic GAN Figure 10: Comparison on horse zebra. Published as a conference paper at ICLR 2019 Input Cycle GAN Harmonic GAN Input Cycle GAN Harmonic GAN Figure 11: Additional results of the proposed Harmonic GAN. From top to bottom are: apple to orange, orange to apple, facade to label, label to facade, aerial to map, map to aerial, summer to winter, winter to summer.