# unsupervised_attentionguided_imagetoimage_translation__0c85f826.pdf

Unsupervised Attention-guided Image-to-Image Translation

Youssef A. Mejjati University of Bath yam28@bath.ac.uk

Christian Richardt University of Bath christian@richardt.name

James Tompkin Brown University james_tompkin@brown.edu

Darren Cosker University of Bath D.P.Cosker@bath.ac.uk

Kwang In Kim University of Bath k.kim@bath.ac.uk

Current unsupervised image-to-image translation techniques struggle to focus their attention on individual objects without altering the background or the way multiple objects interact within a scene. Motivated by the important role of attention in human perception, we tackle this limitation by introducing unsupervised attention mechanisms that are jointly adversarially trained with the generators and discriminators. We demonstrate qualitatively and quantitatively that our approach attends to relevant regions in the image without requiring supervision, which creates more realistic mappings when compared to those of recent approaches.

Input Ours Cycle GAN [1] RA [2] Disco GAN [3] UNIT [4] Dual GAN [5]

Figure 1: By explicitly modeling attention, our algorithm is able to better alter the object of interest in unsupervised image-to-image translation tasks, without changing the background at the same time.

1 Introduction

Image-to-image translation is the task of mapping an image from a source domain to a target domain. Applications include image colorization [6], image super-resolution [7, 8], style transfer [9], domain adaptation [10] and data augmentation [11]. Many approaches require data from each domain to be paired or under alignment, e.g., when translating satellite images to topographic maps, which restricts applications and may not even be possible for some domains. Unsupervised approaches, such as Disco GAN [3] and Cycle GAN [1], overcome this problem with cyclic losses which encourage the translated domain to be faithfully reconstructed when mapped back to the original domain.

Existing algorithms feed an input image to an encoder decoder-like neural network architecture called the generator, which tries to translate the image. Then, this output is fed to a discriminator

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

which attempts to classify if the output image has indeed been translated. In these generative adversarial networks (GANs), the quality of the generated images improves as the generator and discriminator compete to reach the Nash equilibrium expressed by the minimax loss of the training procedure [12].

However, these approaches are limited by the system s inability to attend only to speciﬁc scene objects. In the unsupervised case, where images are not paired or aligned, the network must additionally learn which parts of the scene are intended to be translated. For instance, in Figure 1, a convincing translation between the horse and zebra domains requires the network to attend to each animal and change only those parts of the image. This is challenging for existing approaches, even if they use a localized loss like Patch GAN [13], as the network itself has no explicit attention mechanism. Instead, they typically aim to minimize the divergence between the underlying data-generating distribution for the entire image in the source and target domains. To overcome this limitation, we propose to minimize the divergence between only the relevant parts of the data-generating distributions for the source and target domains. For this, we ﬁnd inspiration from attentional mechanisms in human perception [14], and their successful application in machine learning [2, 15]. We add an attention network to each generator in the Cycle GAN setup. These are jointly trained to produce attention maps for regions that the discriminator considers are the most discriminative between the source and target domains. Then, these maps are applied to the input of the generator to constrain it to relevant image regions. The whole network is trained end-to-end with no additional supervision. We qualitatively and quantitatively show that explicitly incorporating attention into image translation networks signiﬁcantly improves the quality of translated images (see Figure 1).

2 Related work

Image-to-image translation. Contemporary image-to-image translation approaches leverage the powerful ability of deep neural networks to build meaningful representations. Speciﬁcally, GANs have proven to be the gold standard in achieving appealing image-to-image translation results. For instance, Isola et al. s pix2pix algorithm [9] uses a GAN conditioned on the source image and imposes an L1 loss between the generated image and its ground-truth map. This requires the existence of ground-truth paired images from each of the source and target domains. Zhu et al. s unpaired image-to-image translation network [1] builds upon pix2pix and removes the paired input data burden by imposing that each image should be reconstructed correctly when translated twice, i.e., when mapped from source to target to source. These maps must conserve the overall structure and content of the image. Disco GAN [3] and Dual GAN [5] use the same principle, but with different losses, making them more or less robust to changes in shape.

Some unsupervised translation approaches assume the existence of a shared latent space between source and target domains. Liu and Tuzel s Coupled GAN (Co GAN) [16] learns an estimate of the joint data-generating distribution using samples from the marginals, by enforcing source and target discriminators and generators to share parameters in low-level layers. Liu et al. s unsupervised image-to-image translation networks (UNIT) [4] build upon Coupled GAN by assuming the existence of a shared low-dimensional latent space between the source and target domains. Once the image is mapped to its latent representation, then a generator decodes it into its target domain version. Huang et al. s multi-modal UNIT (MUNIT) [17] framework extends this idea to multi-modal image-to-image translation by assuming two latent representations: one for style and one for content . Then, the cross-domain image translation is performed by combining different content and style representations.

Given input images depicting objects at multiple scales, the aforementioned approaches are sometimes able to translate the foreground. However, they generally also affect the background in unwanted ways, leading to unrealistic translations. We demonstrate that our algorithm is able to overcome this limitation by incorporating attention into the image translation framework.

Attending to speciﬁc regions within image translation has recently been explored by Ma et al. [18], who attempt to decouple local textures from holistic shapes by attending to local objects of interest (e.g., eyes, nose, and mouth in a face); this is manifested through attention maps as individual square image regions. This limits the approach, as (1) it assumes that all objects are the same size, corresponding to the sizes of the square attention maps, and (2) it involves tuning hyper-parameters for the number and size of the square regions. As a consequence, this approach cannot straightforwardly deal with image translation without altering the background.

Attention learning. Attention learning has beneﬁted from advances in deep learning. Contemporary approaches use convolution-deconvolution networks trained on ground-truth masks [19], and combine these architectures with recurrent attention models. Speciﬁcally, Kuen et al. s saliency

Figure 2: Data-ﬂow diagram from the source domain S to the target domain T during training. The roles of S and T are symmetric in our network, so that data also ﬂows in the opposite direction T S.

detection [20] uses Recurrent Neural Networks (RNN) to adaptively select a sequence of local regions in the input image for saliency estimation. Then, these local estimates are combined into a global estimate. Such approaches cannot be applied in our setting, since they require supervision.

Unsupervised attention learning includes Mnih et al. s recurrent model of visual attention [15], which uses only a few learned square regions of the image trained from classiﬁcation labels. This approach is not differentiable and requires training with reinforcement learning, which is not straightforward to apply in our problem. More recently, attention has been enforced on activation functions to select only task-relevant features [2, 21]. However, we show in experiments that our approach of enforcing attention on the input image provides better results for image-to-image translation.

Learning attention also encourages the generation of more realistic images compared to classic vanilla GANs. For example, Zhang et al. s self-attention GANs [22] constrain the generator to gradually consider non-local relationships in the feature space by using unsupervised attention, which produces globally realistic images. Yang et al. s recursive approach [23] generates images by decoupling the generation of the foreground and background in a sequential manner; however, its extension to image-to-image translation is not straightforward as in that case we only care about modifying the foreground. Attention has also been used for video generation [24], where a binary mask is learned to distinguish between dynamic and static regions in each frame of a generated video. The generated masks are trained to detect unrealistic motions and patterns in the generated frames, whereas our attention network is trained to ﬁnd the most discriminative regions which characterize a given image domain. Finally, Chen et al. s contemporaneous work shares our goal of learning an attention map for image translation [25]; we will discuss the differences between our methods after explaining our approach (see Section 4).

3 Our approach

The goal of image translation is to estimate a map FS T from a source image domain S to a target image domain T based on independently sampled data instances XS and XT , such that the distribution of the mapped instances FS T (XS) matches the probability distribution PT of the target. Our starting point is Zhu et al. s Cycle GAN approach [1], which also learns a domain inverse FT S to enforce cycle consistency: FT S(FS T (XS)) XS. The training of the transfer network FS T requires a discriminator DT to try to detect the translated outputs from the observed instances XT . For cycle consistency, the inverse map FT S and the corresponding discriminator DS are simultaneously trained.

Solving this problem requires solving two equally important tasks: (1) locating the areas to translate in each image, and (2) applying the right translation to the located areas. We achieve this by adding two attention networks AS and AT , which select areas to translate by maximizing the probability that the discriminator makes a mistake. We denote AS : S Sa and AT : T Ta, where Sa and Ta are the attention maps induced from S and T, respectively. Each attention map contains per-pixel [0,1] estimates. After feeding the input image to the generator, we apply the learned mask to the generated image using an element-wise product , and then add the background using the

inverse of the mask applied to the input image. As such, AS and AT are trained in tandem with the generators; Figure 2 visualizes this process.

Henceforth, we will describe only the map FS T ; the inverse map FT S is deﬁned similarly.

3.1 Attention-guided generator

First, we feed the input image s S into the generator FS T , which maps s to the target domain T. Then, the same input is fed to the attention network AS, resulting in the attention map sa =AS(s). To create the foreground object sf T, we apply sa to FS T (s) via an element-wise product on each RGB channel: sf = sa FS T (s) (Figure 2 shows an example). Finally, we create the background image sb =(1 sa) s, and add it to the masked output of the generator FS T . Thus, the mapped image s is obtained by:

s =sa FS T (s) | {z } Foreground

+ (1 sa) s | {z } Background

Attention map intuition. The attention network AS plays a key role in Equation 1. If the attention map sa was replaced by all ones, to mark the entire image as relevant, then we obtain Cycle GAN as a special case of our approach. If sa was all zeros, then the generated image would be identical to the input image due to the background term in Equation 1, and the discriminator would never be fooled by the generator. If sa attends to an image region without a relevant foreground instance to translate, then the result s will preserve its source domain class (i.e. a horse will remain a horse).

In other words, the image parts which most describe the domain will remain unchanged, which makes it straightforward for the discriminator DT to detect the image as a fake. Therefore, the only way to ﬁnd an equilibrium between generator FS T , attention map AS, and discriminator DT is for AS to focus on the objects or areas that the corresponding discriminator thinks are the most descriptive within its domain (i.e., the horses). The discriminator mechanism which makes GAN generators produce realistic images also makes our attention networks ﬁnd the domain-descriptive objects in the images.

The attention map is continuous between [0,1], i.e., it is a matte rather than a segmentation mask. This is valuable for three reasons: (1) it makes estimating the attention maps differentiable, and so able to train at all, (2) it allows the network to be uncertain about attention during the training process, which allows convergence, and (3) it allows the network to learn how to compose edges, which otherwise might make the foreground object look stuck on or produce fringing artifacts.

Loss function. This process is governed by the adversarial energy:

Ls adv(FS T ,AS,DT )=Et PT (t) log(DT (t)) +Es PS(s) log(1 DT (s )) . (2)

In addition, and similarly to Cycle GAN, we add a cycle-consistency loss to the overall framework by enforcing a one-to-one mapping between s and the output of its inverse mapping s :

Ls cyc(s,s )= s s 1, (3)

where s is obtained from s via FT S and AT , similarly to Equation 1.

This added loss makes our framework more robust in two ways: (1) it enforces the attended regions in the generated image to conserve content (e.g., pose), and (2) it encourages the attention maps to be sharp (converging towards a binary map), as the cycle-consistency loss of unattended areas will always be zero. Further, when computing s , we use the attention map extracted from AT (s ). This adds another consistency requirement, as the generated attention maps produced by AS and AT for s and s , respectively, should match to minimize Equation 3.

We obtain the ﬁnal energy to optimize by combining the adversarial and cycle-consistency losses for both source and target domains:

L(FS T ,FT S,AS,AT ,DS,DT )=Ls adv+Lt adv+λcyc Ls cyc+Lt cyc , (4)

where we use the loss hyper-parameter λcyc = 10 throughout our experiments. The optimal parameters of L are obtained by solving the minimax optimization problem:

F S T ,F T S,A S,A T ,D S,D T = argmin FS T ,FT S,AS,AT

argmax DS,DT L(FS T ,FT S,AS,AT ,DS,DT ) . (5)

3.2 Attention-guided discriminator

Equation 1 constrains the generators to act only on attended regions: as the attention networks train to become more accurate at ﬁnding the foreground, the generator improves in translating just the object of interest between domains, e.g., from horse to zebra. However, there is a tension: the whole-image discriminators look (implicitly) at the distribution of backgrounds with respect to the translated foregrounds. For instance, one observes that the translated horse now looks correctly like a zebra, but also that the overall scene is fake, because the background still shows where horses live in meadows and not where zebras live in savannas. In this sense, we really are trying to make a fake image which does not match either underlying probability distribution PS or PT .

This tension manifests itself in two behaviors: (1) the generator FS T tries to paint background directly into the attended regions, and (2) the attention map slowly includes more and more background, converging towards a fully attended map (all values in the map converge to 1). Our supplemental material provides example cases (last column in Figure 2; ablation studies Ours D and Ours D A in Figure 5).

To overcome this, we train the discriminator such that it only considers attended regions. Simply using sa s is problematic, as real samples fed to the discriminator now depend on the initially-untrained attention map sa. This leads to mode collapse if all networks in the GAN are trained jointly. To overcome this issue, we ﬁrst train the discriminators on full images for 30 epochs, and then switch to masked images once the attention networks AS and AT have developed.

Further, with a continuous attention map, the discriminator may receive fractional pixel values, which may be close to zero early in training. While the generator beneﬁts from being able to blend pixels at object boundaries, multiplying real images by these fractional values causes the discriminator to learn that mid gray is real (i.e., we push the answer towards the midpoint 0 of the normalized [ 1,1] pixel space). Thus, we threshold the learned attention map for the discriminator:

tnew = t if AT (t)>τ 0 otherwise and s new = FS T (s) if AS(s)>τ 0 otherwise (6)

where tnew and s new are masked versions of target sample t and translated source sample s , which only contain pixels exceeding a user-deﬁned attention threshold τ, which we set to 0.1 (Figure 3 in the supplemental material justiﬁes such choice). Moreover, we ﬁnd that removing instance normalization from the discriminator at that stage is helpful as we do not want its ﬁnal prediction to be inﬂuenced by zero values coming from the background.

Thus, we update the adversarial energy Ladv of Equation 2 to:

Ls adv(FS T ,AS,DT )=Et PT (t) log(DT (tnew)) +Es PS(s) log(1 DT (s new)) , (7)

Algorithm 1 summarizes the training procedure for learning FS T ; training FT S is similar. Our supplemental material provides details of the individual network conﬁgurations.

When optimizing the objective in Equation 7 beyond 30 epochs, real image inputs to the discriminator are now also dependent on the learned attention maps. This can lead to mode collapse if the training is not performed carefully. For instance, if the mask returned by the attention network

Algorithm 1 Training procedure for the source-to-target map FS T .

Input: XS, XT , K (number of epochs), λcyc (cycle-consistency weight), α (ADAM learning rate).

1: for c=0 to K 1 do 2: for i=0 to |XS| 1 do 3: Sample a data point s from XS and a data point t from XT . 4: if c<30 then 5: Compute s using Equation 1. 6: Update parameters of FS T , DT , and AS using Equation 4 with learning rate α. 7: else 8: Compute s new and tnew using Equation 6. 9: Update parameters of FS T and DT using Equations 4 and 7 with learning rate α. 10: end if 11: end for 12: end for Output: Trained networks F S T , A S and D T .

Figure 3: Input source images (top row) and their corresponding estimated attention maps (below). These reﬂect the discriminative areas between the source and target domains. The right side of the ﬁgure shows source and target attention maps, trained on horses and zebras, respectively, when applied to images without horse or zebra. The lack of attention suggests appropriate attention network behavior.

Input Our Attention Ours Cycle GAN [1] RA [2] Disco GAN [3] UNIT [4] Dual GAN [5]

Figure 4: Image translation results for mapping apples to oranges and our learned attention.

is always zero, then the generator will always create real images from the point of view of the discriminator, as the masked sample tnew in Equation 7 would be all black. We avoid this situation by stopping the training of both AS and AT after 30 epochs (Figure 2 in the supplementary material justiﬁes such hyper-parameter choice).

4 Experiments

Baselines. We compare to Disco GAN [3] and Cycle GAN [1], which are similar, but which use different losses: Disco GAN uses a standard GAN loss [12], and Cycle GAN uses a least-squared GAN loss [26]. We also compare with Dual GAN [5], which is similar to Cycle GAN but uses a Wasserstein GAN loss [27]. Aditionally, we compare with Liu et al. s UNIT algorithm [4], which leverages the latent space assumption between each pair of source/target images. Finally, we compare with Wang et al. s attention module [2] by incorporating it after the ﬁrst layer of our generators; we refer to this implementation as RA .

Datasets. We use the Apple to Orange (A O) and Horse to Zebra (H Z) datasets provided by Zhu et al. [1], and the Lion to Tiger (L T) dataset obtained from the corresponding classes in the Animals With Attributes (AWA) dataset [28]. These datasets contain objects at different scales across different backgrounds, which make the image-to-image translation setting more challenging. Note that for the mapping Lion to Tiger we do not ﬁnd it necessary to apply the attention-guided discriminator part.

Qualitative results. Observing our learned attention maps, we can see that our approach is able to learn relevant image regions and ignore the background (Figure 3). When an input image does not contain any elements of the source domain, our approach does not attend to it, and so successfully leaves the image unedited. Holistic image translation approaches, on the other hand, are mislead by irrelevant background content and so incorrectly hallucinate texture patterns of the target objects (last two rows of Figure 5).

Among competing approaches, Disco GAN struggles to separate the background and foreground content (see Figures 1, 4 and 5). We believe this is partly because their cycle-consistency energy is given the same weight as the GAN s adversarial energy. Dual GAN produces slightly better results,

Input Ours Cycle GAN [1] RA [2] Disco GAN [3] UNIT [4] Dual GAN [5]

Figure 5: Translation results. From top to bottom: Z H, Z H, H Z, H Z, A O, O A, L T, and T L. Below line: image translation in the absence of the source domain class (Z H).

Table 1: Kernel Inception Distance 100 std. 100 for different image translation algorithms. Lower is better. Abbreviations: (A)pple, (O)range, (H)orse, (Z)ebra, (T)iger, (L)ion.

Algorithm A O O A Z H H Z L T T L

Disco GAN [3] 18.34 0.75 21.56 0.80 16.60 0.50 13.68 0.28 16.10 0.55 19.97 0.09 RA [2] 12.75 0.49 13.84 0.78 10.97 0.26 10.16 0.12 9.98 0.13 12.68 0.07 Dual GAN [5] 13.04 0.72 12.42 0.88 12.86 0.50 10.38 0.31 10.18 0.15 10.44 0.04 UNIT [4] 11.68 0.43 11.76 0.51 13.63 0.34 11.22 0.24 11.00 0.09 10.23 0.03 Cycle GAN [1] 8.48 0.53 9.82 0.51 11.44 0.38 10.25 0.25 10.15 0.08 10.97 0.04 Ours 6.44 0.69 5.32 0.48 8.87 0.26 6.93 0.27 8.56 0.16 9.17 0.07

although the background is still heavily altered. For example, the ﬁrst row of Figure 1 contains undesirable zebra patterns in the background. Cycle GAN produces more visually appealing results with its least-squares GAN and appropriate weighting between the adversarial and cycle-consistency losses, even though some elements of the background are still altered. For instance, Cycle GAN alters the writing on the chalkboard in the last row of Figure 4, and generates a blue-grey lion in the ﬁrst row of Figure 5 when asked to translate the zebra pinned down by the lion. The UNIT algorithm uses the shared latent space assumption between source and target domains to be robust to changes in geometric shape. For example, in the 7th row of Figure 5, we can see that the face of the lion cub is mapped to a tiger; however, the overall image is not realistic. Finally, incorporating residual attention (RA) modules into the image translation framework does not improve the generated image quality, which validates our choice of incorporating attention into images instead of on activation functions. This is particularly noticeable when the input source image does not contain any relevant object, as in Figure 5 (bottom). In this case, existing algorithms are mislead by irrelevant background content and incorrectly hallucinate texture patterns of the target objects. By learning attention maps, our algorithm successfully ignores background contents and reproduces the input images.

One limitation of our approach is visible in the last third row of Figure 5, which contains an albino tiger. In this challenging case of an object with outlier appearance within its domain, our attention network fails to identify the tiger as foreground, and so our network changes the background image content, too. However, overall, our approach of learning attention maps within unsupervised image-to-image translation obtains more realistic results, particularly for datasets containing objects at multiple scales and with different backgrounds.

Quantitative results. We use the recently proposed Kernel Inception Distance (KID) [29] to quantitatively evaluate our image translation framework. KID computes the squared maximum mean discrepancy (MMD) between feature representations of real and generated images. Such feature representations are extracted from the Inception network architecture [30]. In contrast to the Fréchet Inception Distance [31], KID has an unbiased estimator, which makes it more reliable, especially when there are fewer test images than the dimensionality of the inception features. While KID is not bounded, the lower its value, the more shared visual similarities there are between real and generated images. As we wish the foreground of mapped images to be in the target domain T and the background to remain in the source domain S, a good mapping should have a low KID value when computed using both the target and the source domains. Therefore, we report the mean KID value computed between generated samples using both source and target domains in Table 1. Further, to ensure consistency, the mean KID values reported are averaged over 10 different splits of size 50, randomly sampled from each domain.

Our approach achieves the lowest KID score in all the mappings, with Cycle GAN as the next best performing approach. UNIT achieves the second-lowest KID score, which suggests that the latent space assumption is useful in our setting. Using Wasserstein GAN allows Dual GAN to follow closely behind. The Cycle GAN variant using residual attention modules (RA) produces worse results than regular Cycle GAN but comparable to UNIT, which suggests that applying attention on the feature space does not considerably improve performance. Finally, by giving the same weight to the adversarial and cyclic energy, Disco GAN achieves the worst performance in terms of mean KID values, which is consistent with our qualitative results.

Ablation Study. First, we evaluate the cycle-consistency loss governed by Equation 3. This is motivated by using attention to constrain the mapping between only relevant instances, which can be considered as a weak form of cycle consistency. The cycle-consistency loss plays an important role in making attention maps sharp; without them, we notice an onset of mode collapse in GAN training. As a result, we obtain a model ( Ours cycle ) with very high KID (Table 2).

Table 2: Kernel Inception Distance 100 std. 100 for ablations of our algorithm. Lower is better. Abbreviations: (H)orse, (Z)ebra.

Algorithm Z H H Z

Ours cycle 64.55 0.34 41.48 0.34 Ours cycle Att 9.46 0.38 7.79 0.23 Ours As 10.90 0.25 7.62 0.25 Ours At 9.30 0.45 7.80 0.21 Ours D 9.26 0.22 7.77 0.35 Ours D A 9.86 0.32 8.28 0.34 Ours 8.87 0.26 6.93 0.27

Next, we test the effect of computing attention on the inverse mapping. Instead of computing a new attention map AT (s ), we use the formerly computed AS(s). This model ( Ours cycle Att ) performs worse, because computing attention on both the mapping and its inverse indirectly enforces similarity between both attention maps AT (s ) and AS(s).

Further, we evaluate behavior with only a single attention network: Ours As and Ours At corresponding to AS and AT , respectively. These approaches are the best performing after our ﬁnal implementation: AS acts on s, but also on t via the inverse mapping, which inﬂuences the generators to still only translate relevant regions. Moreover, we measure the importance of our attention-guided discriminator by replacing it with a whole-image discriminator while stopping the training of the attention networks ( Ours D ). For this model, mean KID values are higher than our ﬁnal formulation because the generator tries to paint elements of the background onto the foreground to compensate for the variance between foreground and background in the source and target domains.

Finally, we consider the contemporaneous Attention GAN of Chen et al. [25], which also learns an attention map for image translation through a cyclic loss. We compare their approach using an ablated version of our software implementation, as we await a code release from the authors for a direct results comparison. Our approach differs in two ways: ﬁrst, we feed the holistic image to the discriminator for the ﬁrst 30 epochs, and afterwards show it only the masked image; second, we stop the training of the attention networks after 30 epochs to prevent it from focusing on the background as well. These two differences reduce errors caused by spurious image additions from F, and remove the need for the optional supervision introduced by Chen et al. to help remove background artifacts and better focus the attention map on the foreground. Table 2 demonstrates this quantitatively ( Ours D A ), with higher KID scores compared to our ﬁnal implementation. Please see the supplemental document for visual examples.

5 Conclusion

While recent unsupervised image-to-image translation techniques are able to map relevant image regions, they also inadvertently map irrelevant regions, too. By doing so, the generated images fail to look realistic, as the background and foreground are generally not blended properly. By incorporating an attention mechanism into unsupervised image-to-image translation, we demonstrate signiﬁcant improvements in the quality of generated images. Our simple algorithm leverages the discriminator to learn accurate attention maps with no additional supervision. This suggests that our learned attention maps reﬂect where the discriminator looks before deciding whether an image is real or fake, making it an appropriate tool for investigating the behavior of adversarial networks.

Future work. Although our approach can produce appealing translation results in the presence of multi-scale objects and varying backgrounds, the overall approach is still not robust to shape changes between domains, e.g., making Pegasus by translating a horse into a bird. Our transfer must happen within attended regions in the image, but shape change typically requires altering parts outside these regions. In the supplementary material, we provide an example of such limitation via the mapping zebra to lion. Our code is released in the following Github repository: https://github.com/Alami Mejjati/Unsupervised-Attention-guided-Image-to-Image-Translation.

Acknowledgements: Youssef A. Mejjati thanks the Marie Sklodowska-Curie grant agreement No 665992, and the UK s EPSRC Centre for Doctoral Training in Digital Entertainment (CDE), EP/L016540/1. Kwang In Kim, Christian Richardt, and Darren Cosker thank RCUK EP/M023281/1.

[1] J. Zhu, T. Park, P. Isola, and A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017. [2] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. Residual attention network for image classiﬁcation. In CVPR, 2017. [3] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to discover cross-domain relations with generative adversarial networks. JMLR, 2017. [4] M. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In NIPS, 2017. [5] Z. Yi, H. Zhang, P. Tan, and M. Gong. Dual GAN: Unsupervised dual learning for image-to-image translation. In ICCV, 2017. [6] Y. Cao, Z. Zhou, W. Zhang, and Y. Yu. Unsupervised diverse colorization via generative adversarial networks. In ECML-PKDD, 2017. [7] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017. [8] B. Wu, H. Duan, Z. Liu, and G. Sun. SRPGAN: Perceptual generative adversarial network for single image super resolution. ar Xiv preprint ar Xiv:1712.05927, 2017. [9] P. Isola, J. Zhu, T. Zhou, and A. Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017. [10] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim. Image to image translation for domain adaptation. In CVPR, 2018. [11] G. Mariani, F. Scheidegger, R. Istrate, C. Bekas, and C. Malossi. BAGAN: Data augmentation with balancing GAN. ar Xiv preprint ar Xiv:1803.09655, 2018. [12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014. [13] C. Li and M. Wand. Precomputed real-time texture synthesis with Markovian generative adversarial networks. In ECCV, 2016. [14] R. Rensink. The dynamic representation of scenes. Visual Cognition, 7(1 3):17 42, 2000. [15] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models of visual attention. In NIPS, 2014. [16] M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. In NIPS, 2016. [17] X. Huang, M. Liu, S. Belongie, and J. Kautz. Multimodal unsupervised image-to-image translation. In ECCV, 2018. [18] S. Ma, J. Fu, C. Wen Chen, and T. Mei. DA-GAN: Instance-level image translation by deep attention generative adversarial networks. In CVPR, 2018. [19] N. Liu, J. Han, and M.-H. Yang. Pi CANet: Learning pixel-wise contextual attention for saliency detection. In CVPR, 2018. [20] J. Kuen, Z. Wang, and G. Wang. Recurrent attentional networks for saliency detection. In CVPR, 2016. [21] S. Jetley, N. Lord, N. Lee, and P. Torr. Learn to pay attention. In ICLR, 2018. [22] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-attention generative adversarial networks. ar Xiv preprint ar Xiv:1805.08318, 2018. [23] J. Yang, A. Kannan, D. Batra, and D. Parikh. LR-GAN: Layered recursive generative adversarial networks for image generation. In ICLR, 2017. [24] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In NIPS, 2016. [25] X. Chen, C. Xu, X. Yang, and D. Tao. Attention-GAN for object transﬁguration in wild images. In ECCV, 2018. [26] X. Mao, Q. Li, H. Xie, R. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. In ICCV, 2017.

[27] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML, 2017. [28] C. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In CVPR, 2009. [29] M. Bi nkowski, D. Sutherland, M. Arbel, and A. Gretton. Demystifying MMD GANs. In ICLR, 2018. [30] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the Inception architecture for computer vision. In CVPR, 2016. [31] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Klambauer. GANs trained by a two time-scale update rule converge to a Nash equilibrium. In NIPS, 2017.