# analogical_image_translation_for_fog_generation__fb89f74f.pdf

Analogical Image Translation for Fog Generation

Rui Gong, 1 Dengxin Dai, 1 Yuhua Chen, 1 Wen Li, 3 Danda Pani Paudel, 1 Luc Van Gool 1,2

1Computer Vision Lab, ETH Zurich 2VISICS, KU Leuven 3University of Electronic Science and Technology of China {gongr, dai, yuhua.chen, paudel, vangool}@vision.ee.ethz.ch, liwenbnu@gmail.com

Image-to-image translation is to map images from a given style to another given style. While exceptionally successful, current methods assume the availability of training images in both source and target domains, which does not always hold in practice. Inspired by humans reasoning capability of analogy, we propose analogical image translation (AIT) that exploit the concept of gist, for the ﬁrst time. Given images of two styles in the source domain: A and A , along with images B of the ﬁrst style in the target domain, learn a model to translate B to B in the target domain, such that A : A :: B : B . AIT is especially useful for translation scenarios in which training data of one style is hard to obtain but training data of the same two styles in another domain is available. For instance, in the case from normal conditions to extreme, rare conditions, obtaining real training images for the latter case is challenging. However, obtaining synthetic data for both cases is relatively easy. In this work, we aim at adding adverse weather effects, more speciﬁcally fog, to images taken in clear weather. To circumvent the challenge of collecting real foggy images, AIT learns the gist of translating synthetic clear-weather to foggy images, followed by adding fog effects onto real clear-weather images, without ever seeing any real foggy image. AIT achieves zero-shot image translation capability, whose effectiveness and beneﬁt are demonstrated by the downstream task of semantic foggy scene understanding.

Introduction Image-to-image translation has enjoyed tremendous progress in the last years. Excellent methods have been developed for a diverse set of learning paradigms such as supervised (Isola et al. 2017), unsupervised (Zhu et al. 2017; Huang et al. 2018) and few-shot (Liu et al. 2019). While exceptionally successful, current methods have a shared assumption that training data, be it paired or unpaired, is available for both styles 1. This may limit the use of image translation when data in one of the two styles is hard to obtain, e.g. translation from a normal condition to an extreme, corner-case condition. To address this, we take a new route and propose analogical image translation (AIT) which learns image translation via analogy.

Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved. 1We reserve domains for analogy since the sought analogy exists between domains, and use styles for image translation.

Analogy is a basic reasoning process to transfer information or meaning from the source to the target. Humans use it commonly to solve problems, provide explanations and make predictions (Hertzmann et al. 2001; Schunn and Dunbar 1996). In this paper, we explore the use of analogy as a means for extracting the gist of image translation in the source domain and apply it the target domain. Particularly, we aim to solve the following problem: Problem ( Analogical Image Translation ): Given images of two styles in the source domain: A and A , with images B of the ﬁrst style in the target domain, learn the translation gist and apply it to B to obtain B , such that A : A :: B : B . The above problem cannot be addressed by the traditional image translation methods, due to the absence of B . On the other hand, there exist only one work, up to our knowledge, that exploits the concept of analogy for deep image translation, namely (Chen, Xu, and Jia 2020). However, (Chen, Xu, and Jia 2020) does not use the concept of gist. In this work, we demonstrate that the task of AIT can greatly beneﬁt from modeling the concept of gist. In fact, our work also introduces the formal concept of gist for the task at hand. A schematic comparison of AIT to the standard image translation is shown in Fig. 1. Our work is partially motivated by the difﬁculty in obtaining real training images for semantic understanding tasks of autonomous driving in adverse conditions, e.g., the foggy weather. Despite tremendous progress, prior works in semantic scene understanding (Ronneberger, Fischer, and Brox 2015; Chen et al. 2017; Yu and Koltun 2016; Zhao et al. 2017; Lin et al. 2017) have mostly focused on the clear-weather, leading to unsatisfactory performance for adverse conditions (Halder, Lalonde, and Charette 2019; Sakaridis, Dai, and Van Gool 2018; Blum et al. 2019; Li et al. 2017). Collecting large-scale training datasets for such adverse conditions, and other corner cases, may resolve the issue. Unfortunately, such solutions are neither scalable and affordable, nor very practical. To address the issues of scarce data, recent works focus on synthesizing fog effects onto clear-weather images by using a physical optical model (Sakaridis, Dai, and Van Gool 2018; Hahner et al. 2019; Ren et al. 2016). The success of these methods hinges on accurate depth and atmospheric light estimation, both of which, however, are still open problems on their own. Therefore, the synthesized fog still suffers from artifacts. On the other hand, synthetic foggy im-

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

(a) TIT (b) AIT Source

Figure 1: Traditional image translation (TIT) vs. analogical image translation (AIT). Given images of two styles in the source domain A and A with images of the ﬁrst style B on the target domain, traditional methods can only translate between seen styles A, A and B. The proposed analogical image translation is able to translate B to B , such that A : A :: B : B , without requiring any sample from B .

ages can be generated easily in virtual environments (Gaidon et al. 2016). This motivates the development of our AIT method which learns from the abundant synthetic clearweather and synthetic foggy images to perform an analogical image translation from real clear-weather images to real foggy images. AIT learns the correlation between synthetic clear-weather and synthetic foggy images, and then applies the learned knowledge to the real domain. We call such learned correlation the gist of translation and assume it transferable across domains. Since the proposed method uses analogy in the GAN setup, we call it Analogical GAN. Analogical GAN achieves zero-shot translation ability by coupling a supervised training scheme in the synthetic domain, a cycle consistency strategy in the real domain, and an adversarial training scheme between the two domains. More speciﬁcally, in the synthetic domain, the gist of translation is learned in the supervised manner with the accessible paired clear-weather and foggy images. Then, this translation gist is transferred to the real domain through an adversarial learning scheme. In the real domain, the learning is further supervised through a cycle consistency scheme. The pipeline of Analogical GAN is shown in Fig. 2. Our extensive experiments demonstrate the superiority of Analogical GAN over the standard zero-shot image translation methods, when tested for fog generation. The quality of our fogy real images is also validated by the state-of-the-art performance on downstreamed semantic foggy scene understanding task.

Related Works

Image-to-Image Translation. Image translation methods have been developed to convert images of one given style to another given style with remarkable success in the last years (Zhu et al. 2017; Huang et al. 2018; Liu et al. 2019). Image translation is also becoming a standard step for domain adaptation (Tsai et al. 2018; Lian et al. 2019; Tsai et al. 2019; Zou et al. 2019; Vu et al. 2019; Hoffman et al. 2016; Chen, Li, and Van Gool 2018) synthetic images are ﬁrst translated to real on which the downstream tasks such as segmentation and detection are then conducted (Hoffman et al. 2018; Chen et al. 2019; Li, Yuan, and Vasconcelos 2019; Gong et al. 2019; Dundar et al. 2018). The standard image translation frameworks (Zhu et al. 2017; Isola et al.

2017; Huang et al. 2018; Liu, Breuel, and Kautz 2017) require the availability of images of both styles involved in translation. To facilitate the translation to an unseen style, the proposed AIT exploit the concept of analogy from synthetic images, followed by its application on real images. Image Analogy. The image analogy works (Hertzmann et al. 2001; Liao et al. 2017; Cheng, Vishwanathan, and Zhang 2008; Chen, Xu, and Jia 2020) aim to ﬁnd B related to B in the same way as A relates to A. Even though the purpose of image analogy is similar to that of our AIT, traditional image analogy works (Hertzmann et al. 2001; Liao et al. 2017; Cheng, Vishwanathan, and Zhang 2008) only apply coarse-to-ﬁne ﬁlters to reduce the perceptual similarity distance, such as luminance feature (Hertzmann et al. 2001) and VGG feature (Liao et al. 2017) distances, between source and target domains. They do not disentangle the gist and have no knowledge transfer between the source and target domains. These works limit themselves, by design, to low-level applications such as the super resolution, artistic ﬁlters, and texture transfer (Hertzmann et al. 2001). In contrast, the high-level task of image translation in the concurrent work (Chen, Xu, and Jia 2020) combines GANs with analogical perceptual loss. Although the used perceptual loss is found to be effective for attribute manipulation, the same loss may not be sufﬁcient for other image translation tasks (Huang et al. 2018). Furthermore, (Chen, Xu, and Jia 2020) does not exploit the concept of gist. Our work demonstrates that the gist can be effectively used for the task of analogical image translation. Semantic Foggy Scene Understanding. Our work is also related to methods for semantic foggy scene understanding (SFSU). The SFSU aims to improve the performance of semantic scene understanding under foggy condition (Sakaridis, Dai, and Van Gool 2018; Dai et al. 2019; Hahner et al. 2019; Erkent and Laugier 2020; Tarel et al. 2010). Due to the difﬁculty of gathering and labeling largescale foggy image dataset, some works (Sakaridis, Dai, and Van Gool 2018; Dai et al. 2019) synthesize fog by applying a physical model to the real clear weather images from the Cityscapes (Cordts et al. 2016), resulting the Foggy Cityscapes dataset. While yielding improved results, these methods require accurate depth and atmospheric light estimation. Any failure of these two tasks directly implies the failure of fog synthesis. The proposed AIT does not require to estimate atmospheric light, and uses depth only as an auxiliary information. Instead, AIT learns the necessary gist for translation from synthetic examples. Unsupervised Domain Adaptation. AIT also shares some similarity with the unsupervised domain adaptation (UDA) works. UDA has been extensively studied in the past years, mainly for classiﬁcation (Ganin and Lempitsky 2015; Long et al. 2018; Tzeng et al. 2017), semantic segmentation (Chen, Li, and Van Gool 2018; Tsai et al. 2019; Dai et al. 2019; Li, Yuan, and Vasconcelos 2019) and object detection (Chen et al. 2018; Xie et al. 2019; Zhu et al. 2019). Given a set of images and annotation pairs from the source domain, along with only the images from target, the goal is to learn a model that performs well also in the target domain. Our AIT shares the same spirit, in regard to transferring the

learned model from the source to target domain, without using annotations (images of desired styles) from the target domain. While previous UDA works only focus on the understanding tasks such as classiﬁcation, object detection and segmentation, our work pays attention to the totally different task, i.e. image-to-image translation.

Analogical Image Translation Problem Statement In the image translation problem, we are given a source domain S and a target domain T , which consist of the samples xs S and xt T , respectively. The goal of traditional image translation is to transfer image samples xs and xt between domain S and domain T . In our work, we propose analogical image translation (AIT), where the source domain S and the target domain T cover two styles A, A and B, B , respectively. But during training and testing, there are only samples xa A, xa A and xb B available. AIT aims at learning from available samples xa, xa to translate xb to the unseen samples xb , such that xa : xa :: xb : xb . The data distributions are denoted as xa PA, xa PA , xb PB and xb PB . Our objective in this work is to learn the mapping GBB : B B conditioned on the mapping GAA : A A . Note that, unlike our objective, the traditional methods (Zhu et al. 2017; Hoffman et al. 2018; Huang et al. 2018; Dundar et al. 2018) only focus on learning the mapping GST : S T .

Analogical GAN Model In this section, we present our Analogical GAN for the analogical image translation problem. The key idea of Analogical GAN is to disentangle the translation gist in the source domain, transfer the gist to the target domain, and make the gist compatible with the target domain. In our work, the gist is measured with the alignment map M and the residual map N, formally denoted as {M, N}. Taking the translation direction into account, the {M, N} can be further expressed in detail as M = {MAA , MA A, MBB , MB B}, N = {NAA , NA A, NBB , NB B}. Moreover, the gist is assumed to be invariant to the source domain and the target domain. Then the gist can be deﬁned implicitly as:

A = A MAA + NAA , (1) B = B MBB + NBB , (2) A = A MA A + NA A, (3) B = B MB B + NB B, (4)

where denotes the element-wise multiplication. On this basis, as shown in Fig. 2, taking the direction of ﬁrst style to second style for example, i.e. A A , B B , our framework consists of three main components: the supervised module, the adversarial module and the cycle consistent module. Firstly, on the source domain, due to the paired samples from A and A available, the gist, MAA , NAA , is disentangled in the supervised way according to the Eq. (1), which forms the supervised module. Secondly, in the adversarial module, based on the domain invariant assumption of the gist, the gist on the source domain, MAA , NAA

Supervised Module

Cycle Consistent Module

Adversarial Module

Figure 2: Analogical GAN overview. The Analogical GAN consists of three modules: the supervised, the adversarial, and the cycle-consistent. The supervised module disentangles the gist, MAA , NAA , using the supervised learning. The adversarial module transfers the learned gist from source domain, MAA , NAA , to the target domain, MBB , NBB . The cycle consistent module ensures that the transferred gist is compatible with the target domain.

, is transferred to the target domain, MBB , NBB , through the adversarial learning. Thirdly, on the target domain, due to the unavailability of the second style B , the gist, MBB , NBB is retained to be compatible with the target domain through the cycle consistency, constructing the cycle consistent module. The other direction from the second style to the ﬁrst style, A A, B B, acts in the same way. Next, the different modules and corresponding loss function are introduced in detail. Supervised Module. The supervised module is used to disentangle the gist, M, N, from the source domain. Given the paired sample xa A and xa A on the source domain S, the translation between A and A can be trained in the supervised way, by substituting in Eq.(1), written as,

Lsup = Exa PA h xa maa + naa xa 1 i

+Exa PA h xa ma a + na a xa 1 i , (6)

where (maa , naa ) = GAA (xa) and (ma a, na a) = GA A(xa ). Adversarial Module. The adversarial module aims to transfer the gist, disentangled from the source domain, to the real domain. Speciﬁcally, taking the direction, A A , B B , for example, we introduce the discriminator DI to distinguish the gist between the source domain, {MAA , NAA }, and the target domain, {MBB , NBB }. And the discriminator DJ acts in the same way in the inverse direction A A, B B. Then the adversarial loss of gist {M, N} on S and T can be written as,

Ladv1(GAA ,GBB , DI) (7) = Exa PA [log(DI(GAA (xa)))]

+ Exb PB log(1 DI(GBB (xb))) .

The similar adversarial loss Ladv2(GA A, GB B, DJ) is also deﬁned for the direction A A, B B. Then the gist adversarial loss can be formulated as:

Ladv = Ladv1 + Ladv2. (8)

In order to make the mapping GBB conditional on GAA , the GAA and GBB , GA A and GB B share all the parameters, respectively. Cycle Consistent Module. The cycle consistent module is utilized to make the gist compatible with the target domain, i.e. preserve the target domain feature of the translated gist. Accordingly, the reconstruction loss is taken to recover xb from the translated image xb through the inverse mapping GB B. Furthermore, in order to strengthen the recovery, another discriminator DT is introduced to distinguish between the recovered xb and the original xb. Then the image cycle consistency loss Lcyc consists of the reconstruction loss Lrec and the adversarial loss Ladv(GBB , GB B, DT ), by substituing in Eq. (2), given by:

Lcyc = Lrec + Ladv(GBB , GB B, DT ) (9)

Lrec = Exb PB h mb b (xb ) + nb b xb 1 i (10)

Ladv(GBB , GB B, DT )

= Exb PB h log(1 DT (mb b xb + nb b)) i (11)

+ Exb PB log(DT (xb)) ,

where (mbb , nbb ) = GBB (xb), (mb b, nb b) = GB B(xb ) and xb = mbb xb + nbb . Auxiliary Module. Besides the three main modules, the auxiliary module is added to assist the analogical image translation process and introduce the auxiliary information. From (Huang et al. 2018) and (Johnson, Alahi, and Fei Fei 2016), the perceptual loss calculates the VGG feature distance Φ( ) (Simonyan and Zisserman 2014) between the translated image and the reference image, and is proven to be able to assist the image translation process. Generalizing the perceptual loss to analogical image translation, the perceptual loss is given in the analogical way, formulated as,

e S =Φ(xa ) Φ(xa) (12)

e T =Φ(xb ) Φ(xb) (13) Lpercep = Exb PB e S e T 1 , (14)

where (mbb , nbb ) = GBB (xb) and xb = mbb xb+nbb . Meanwhile, in terms of speciﬁc setting such as the analogical foggy image translation, the corresponding auxiliary information to fog effects, such as depth information (Fattal 2008; Sakaridis, Dai, and Van Gool 2018; Dai et al. 2019), can also be leveraged. By introducing the mapping GIH : A HS, B HT and GJH : A HS, B HT , where HS and HT denote the depth domain corresponding to S and T , composed of depth map d S and d T , respec-

tively. The auxiliary depth loss is given by,

Ldep = Exa PA GIH(xa) d S 1 (15)

+ Exa PA h GJH(xa ) d S 1 i

+ Exb PB GIH(xb) d T 1

+ Exb PB h GJH(xb ) d T 1 i .

By sharing the network parameters between GIH, GAA , and GBB , GJH, GA A and GB B respectively, the depth information is implicitly encoded into our analogical translation process. Full Objective. Integrating the losses deﬁned above, our full objective for Analogical GAN can be deﬁned as:

L = Ladv + λ1Lsup + λ2Lcyc + λ3Ldep + λ4Lpercep, (16)

where λ1, λ2, λ3 and λ4 are hyper-parameters used to balance different parts of training loss. Following the general manner for training the adversarial model, the full objective is trained in the minimax way, i.e. minimize the objective for generator while maximizing the objective for discriminator. Domain Interpolation. Beneﬁting from the disentangled gist, our Analogical GAN is able to generate the intermediate domain between B and B during testing stage. Following (Gong et al. 2019), the variable z [0, 1] is used to measure the domainness. The intermediate domain between B and B are denoted as I(z) B . When z = 0, the intermediate domain I(z) B are identical to B; and when z = 1, it is identical to B . In order to generated the intermediate domain, it is assumed that the gist between B and B is linear. On the basis of the linear assumption and Eq. (2), the intermediate domain can be written as,

I(z) B = B ((MBB 1) z + 1) + NBB z. (17)

Experiments In this section, we evaluate our Analogical GAN for fog generation task. As aforementioned, our method consists of two domains: a source domain S and a target domain T . On S and T , there are two styles A and A , B and B deﬁned, respectively. Because training data for B is unavailable, existing image translation methods can only be trained for A and B, which does not serve the exact purpose generating data in B . Training standard translation methods on A and B, nevertheless, can be taken as baseline methods. In our experiments, we instantiate S, T , A, A , B and B as follows: synthetic as S, real as T , synthetic, clear weather as A, synthetic, foggy weather as A , real, clear weather as B, and real, foggy weather as B .

Analogical Image Translation We conduct the analogical image translation experiments by regarding Virtual KITTI (Gaidon et al. 2016) as synthetic domain, while Cityscapes (Cordts et al. 2016) as real domain. The depth maps of Cityscapes are generated from pretrained deep model developed in (Chang and Chen 2018). Virtual KITTI. Virtual KITTI is a dataset consisting of 2136 photo-realistic synthetic clear weather images imitating the content and structure of KITTI dataset (Geiger et al.

2013), each of which has paired foggy weather image and corresponding depth map available. Cityscapes. Cityscapes is a dataset covering 2975 real clear weather images taken from different European cities, which are densely labeled with 19 semantic categories. We follow the training procedure, generators and discriminators structure as Cycle GAN (Zhu et al. 2017). The Adam optimizer (Kingma and Ba 2015) is adopted, the learning rate is ﬁxed to 0.0002 and the batch size is set as 1. The image is resized to 512 256. The weight of the gist adversarial loss is set as 3, the weight of cycle consistency adversarial loss is set as 1, and the weight of rest parts are 10. We implement our model with Py Torch (Paszke et al. 2017). More detailed network architecture and implementation are shown in Appendix due to space limitation. Gist. In order to verify the necessity of gist for the AIT task, the state-of-the-art traditional image translation frameworks Cyel GAN (Zhu et al. 2017), and recent domain adaptive image translation framework DAI2I (Chen, Xu, and Jia 2020) are taken as baseline methods. When applied to AIT task, the traditional frameworks do not disentangle the gist from the source domain S. For example, Cycle GAN is only able to translate between S (A, A ) and T (B). The direct translation between S and T , if performed, unavoidably translates the feature irrelevant to gist. Such translation causes the artifacts in the generated B . In the fog generation case, the translation from real, clear weather to synthetic, clear weather (similarly for real clear to synthetic foggy weather) introduces the synthetic style to the real domain. It causes the generated real, foggy weather images inherit abundant of synthetic artifacts, instead of only foggy weather style. Though auxiliary information such as depth can be encoded into the Cycle GAN baseline, by network parameters sharing as done in Eq.(16), it still cannot resolve the synthetic artifacts problem. Fig. 3b shows the qualitative results of the Cycle GAN baseline translating between A and B while using the depth information. It is observed that the translated real, foggy weather image with Cycle GAN baseline is highly affected by the translated synthetic style. In order to improve Cycle GAN baseline for AIT task, we also trained the model to translate from A to A while testing on B. This result in the reduction of the synthetic effect. From the comparison between Fig. 3b and Fig. 5d, it can be seen that the Cycle GAN baseline s performance is improved, when T is eliminated during training. Nevertheless, these results are not yet satisfactory. One can also think of using multi-stage translation strategy. A typical case could follow B A A , This multi-stage translation is bound to have synthetic effect at the end, because the ﬁnal domain, i.e. A , is still synthetic. In essence, DAI2I (Chen, Xu, and Jia 2020) combines the aforementioned multi-stage translation strategy with the analogical perceptual similarity measurement, whose results are shown in Fig. 3c. Though image analogy spirit through the perceptual similarity is adopted in DAI2I, the gist is still not exploited. Fig. 3c shows that DAI2I cannot deal with synthetic artifacts purely relying on the analogical perceptual similarity for fog generation, without exploiting gist. Contrastively, by explicitly disentangling gist from S and

Ladv Lcyc Lpercep Ldep m Io U 34.6 32.8 42.0 41.7 41.9 40.8 42.3

Table 1: Full model and ablations comparison for SFSU, tested on the Foggy Zurich dataset based on Reﬁne Net with Res Net-101 backbone. The results are reported on m Io U over 19 classes. The best result is denoted in bold.

transferring to T , our Analogical GAN eliminates the synthetic artifacts. The qualitative results in Fig. 3c show a clear distinction between DAI2I and Analogical GAN. We choose not to conduct further analysis of DAI2I results, as they are clearly inferior in qualitative measure. The artifacts on the reported images are consistent across test set (more images can be found in the appendix).

(b) Cycle GAN

Figure 3: Qualitative translation results of Cycle GAN encoding depth and DAI2I (Chen, Xu, and Jia 2020). (a) is real, clear weather image, while (b), (c) are translated real, foggy weather image with Cycle GAN model encoding depth and DAI2I, respectively. Both of Cycle GAN and DAI2I introduce high synthetic artifacts to the translated images.

Quantitative Results. In order to validate the effectiveness of our Analogical GAN for the AIT task, a user study on Amazon Mechanical Turk (AMT) is conducted to compare the translation results of our Analogical GAN with the state-of-the-art traditional image translation methods Cycle GAN (Zhu et al. 2017) and MUNIT (Huang et al. 2018). Each individual task completed by the participants, referred to as Human Intelligence Task (HIT), comprises two image pairs to be compared: ours vs. Cycle GAN and ours vs. MUNIT. In total, 100 HITs were used, each is completed by three annotators and the results are averaged. For each image pair, the users were asked to select the image that looks more like a real foggy image. In Table 3, the user study results are listed. From the table, one can see that users prefer our translation results compared to Cycle GAN (61.0% v.s. 39.0%) and MUNIT (66.7% v.s. 33.3%). Qualitative Results. Furthermore, we show the qualitative comparison in Fig. 5. From Fig. 5, it is observed that the standard image translation models Cycle GAN (refer to Fig. 5d) and MUNIT (refer to Fig. 5(e)) suffer from inheriting synthetic features from the Virtual KITTI (refer to Fig. 5(b)) such as the color of the car, the lines on the road and the skin of the people. Besides, though the translated foggy part tends to be in gray, it loses the correct sense that fog changes

Virtual KITTI Cityscapes

Fine-tuning

Testing FZ FD R B R B Cityscapes(Hahner et al. 2019) 34.6 16.1 44.3 27.2 FC(Hahner et al. 2019) 36.9 25.0 46.1 30.3 Cycle GAN(Zhu et al. 2017) 40.5 27.1 47.7 30.0 MUNIT(Huang et al. 2018) 39.1 26.0 47.8 30.5 AC(ours) 42.3 28.4 47.5 30.8

Virtual KITTI Synscapes

Fine-tuning

Testing FZ FD R B R B Cityscapes(Hahner et al. 2019) 34.6 16.1 44.3 27.2 FS(Hahner et al. 2019) 40.3 27.8 48.4 30.9 Cycle GAN(Zhu et al. 2017) 41.6 30.9 47.8 33.1 MUNIT(Huang et al. 2018) 40.5 27.5 48.3 32.8 AS(ours) 41.8 31.5 49.8 34.2

Table 2: Results of semantic segmentation on the Foggy Zurich and Foggy Driving dataset. The reported results are pretrained on Cityscapes, ﬁne-tuned on different simulated foggy images, and tested on Foggy Zurich (FZ) and Foggy Driving (FD) datasets. The columns represent different semantic segmentation architectures, Reﬁne Net (R) with Res Net-101 backbone and Bise Net (B) with Res Net-18 backbone. The results are reported on m Io U over 19 categories. The best results are denoted in bold. FC , FS , AC , AS , FD , FZ represent Foggy Cityscapes , Foggy Synscapes , Analogical GAN Cityscapes , Analogical GAN Synscapes , Foggy Driving , Foggy Zurich , respectively.

Cycle GAN/Ours MUNIT/Ours user preference 39.0%/61.0% 33.3%/66.7%

Table 3: User study results for fog generation. It is observed that more users prefer the translation results of our Analogical GAN compared to that of Cycle GAN and MUNIT.

with depth. In contrast, our Analogical GAN, the analogical image translation framework, preserves the real feature of the objects in the scene, generates realistic foggy images and yields the right sense that fog changes with the depth of the scene as shown in Fig. 5(f). Ablation Study. Fig. 4 gives an qualitative ablation study of each module in our Analogical GAN. More qualitative ablation study results are put into Appendix due to space limitation.

(b) w/o (c) w/o (d) w/o

(f) w/o (g) Ours

Figure 4: Qualitative ablation study of Analogical GAN for fog generation. It is observed that each module is effective for the analogical image translation (AIT) task.

Semantic Foggy Scene Understanding Experiments Setup In this section, we validate the usefulness of our translated images for the downstream task semantic foggy scene understanding. Speciﬁcally, following the paradigm in (Sakaridis, Dai, and Van Gool 2018; Hahner et al. 2019), the pretrained semantic segmentation model on the real clear weather images, Cityscapes, is ﬁnetuned on the synthesized foggy images. Then the ﬁne-tuned

Fine-tuning Foggy Zurich Foggy Driving

R B R B FC+FS(Hahner et al. 2019) 41.4 30.9 50.7 35.2 AC+FS (ours) 43.8 32.9 50.3 39.9

Table 4: Results of semantic segmentation on the Foggy Zurich (FZ) and Foggy Driving (FD) dataset. The reported results are pretrained on Cityscapes, ﬁne-tuned on Foggy Cityscape (FC)/Analogical GAN Cityscapes(AC) and Foggy Synscape (FS), and tested on FZ and FD datasets. The columns represent different semantic segmentation architectures, Reﬁne Net (R) with Res Net-101 backbone and Bise Net (B) with Res Net-18 backbone. The results are reported on m Io U over 19 categories, and best results are bold.

model is tested on two real foggy image datasets: Foggy Zurich(Dai et al. 2019) and Foggy Driving (Sakaridis, Dai, and Van Gool 2018). We compare the semantic foggy scene understanding performance of our Analogical GAN translation results with the state-of-the-art physics-based foggy image synthesis results, Foggy Cityscapes(Sakaridis, Dai, and Van Gool 2018), and the translation results of the traditional image translation methods Cycle GAN and MUNIT. In addition to the setting Virtual KITTI to Cityscapes, we further evaluate all methods in another setting Virtual KITTI to Synscapes. The performance of foggy scene understanding of all methods are reported for both of the two settings. Synscapes is a synthetic dataset consisting of 25,000 clear weather images imitating the content and structure of Cityscapes dataset. Pixel-wise ground-truth semantic labels and depth maps are given in the dataest. Foggy Zurich consists of 3,808 foggy scene images taken from Zurich City, 40 of which are densely labeled. We use them as test data in our experiment. Foggy Driving is a dataset containing 101 coarsely annotated real foggy images, collected in various areas of Zurich and from the Internet. As shown in (Dai et al. 2019), the fog density of the synthesized foggy image highly affects the semantic foggy

(c) Clear Cityscapes (b) Foggy VKITTI (d) Cycle GAN (e) MUNIT (f) Ours

(a) Clear VKITTI

Figure 5: Comparison of the analogical translation results of our Analogical GAN (column (f)) with the traditional image translation methods (column (d) and column (e)). The column (a), column (b) and column (c) shows the synthetic clear weather image (Clear Virtual KITTI), the synthetic foggy weather image (Foggy Virtual KITTI) and the real clear weather image (Cityscapes), respectively. The analogical translation is described as, column (a) : column (b) :: column (c) : column (d), column (e), column (f).

scene understanding performance. Our Analogical GAN can control the density of the synthesized fog via the domainness variable z. In order to generate the foggy image with the appropriate fog density, during testing stage, the domainness variable z is set to 0.88 and 0.9 for Cityscapes and Synscapes, respectively. For semantic segmentation, we follow the paradigm and ﬁne-tuning details in (Sakaridis, Dai, and Van Gool 2018) and (Hahner et al. 2019). The Reﬁne Net (Lin et al. 2017) with Res Net-101 backbone (He et al. 2016) and the Bise Net (Yu et al. 2018) with Res Net-18 backbone (He et al. 2016) are utilized as the segmentation networks.

Experiments Results The results of semantic foggy scene understanding based on the synthesized foggy images from Cityscapes and Synscapes are shown in Table 2a and Table 2b, respectively. In Table 2a and Table 2b, while using Cityscapes and Synscapes as real clear weather images, it is shown that our Analogical GAN outperforms the physicsbased foggy image synthesis methods Foggy Cityscapes and Foggy Synscapes . The improvement is consistent on both Foggy Zurich and Foggy Driving, and with Reﬁne Net and with Bise Net segmentatin networks. When compared to the traditional image translation methods, our Analogical GAN outperforms both Cycle GAN and MUNIT on both test sets and for both segmentation networks, except for one case (when utilizing the Reﬁne Net and testing on Foggy Driving) in which our method reaches comparable performance with MUNIT (47.5% v.s. 47.8%). Moreover, following (Hahner et al. 2019), by mixing the Foggy Synscapes with Analogical GAN Cityscapes , i.e. Cityscapes translated with Analogical GAN model, the performance can be further improved. From Table 4, it is shown that the mixture of Analogical GAN Cityscapes and Foggy Synscapes improves the performance of the state-of-the-art methods, mixture of Foggy Citysacpes and

Foggy Synscapes by 2.4% and 2.0% on Foggy Zurich with Reﬁne Net and Bise Net, while improving by 4.7% on Foggy Driving with Bise Net and reaching comparable performance, 50.3% v.s. 50.7%, on Foggy Driving with Reﬁne Net. The semantic foggy scene understanding performance and comparison demonstrate the effectiveness of our Analogical GAN for synthesizing fog effects to real images. The results also shows the advantage of our proposed method over the physics-based fog synthesis methods and the traditional image translation methods. More detailed results on each classes are listed in the Appendix due to the space limitation. Ablation Study. In Table 1, we compare our model with the ablations of the full objective for the semantic foggy scene understanding, i.e. quantitative ablation study results. It is shown that each module of our Analogical GAN contributes to the semantic foggy scene understanding.

In this work, we have presented Analogical GAN, a novel analogical image translation (AIT) framework. Different from the traditional image translation, analogical image translation is able to achieve zero-shot image translation capability via analogy. Compared with previous image analogy works, our Analogical GAN explicitly disentangles gist and transfers gist, which is proven to be necessary and beneﬁcial for the AIT task. Applying our Analogical GAN to the fog generation task, the realistic fog effects is synthesized into real clear-weather images, even though no real foggy image is ever seen. Further experiments prove the effectiveness of our Analogical GAN. While some choices in Analogical GAN are made speciﬁcally for fog generation, the method itself has the potential to be used for other AIT tasks.

Acknowledgements

This research has received funding from the EU Horizon 2020 research and innovation programme under grant agreement No. 820434. This work is funded by Toyota Motor Europe via the research project TRACE Zurich. This work is also partially supported by the Major Project for New Generation of AI under Grant No. 2018AAA0100400.

Ethics Statement

In this paper, we propose the Analogical GAN model, a kind of analogical image translation framework. It can be seen as the zero-shot generalization of existing image-toimage translation framework. The analogical image translation framework has the potential to highly reduce the gathering and labeling difﬁculty of the data. Beneﬁting from the transferred data scale and diversity, the deep model is expected to be more robust, reliable and effective under different even extreme conditions, which is able to promote and accelerate the launch of deepbased system such as the medical computer-assisted system and autonomous driving system. The easy availability of the transferred labeled data and the launch of the more reliable and effective deep-based systems likely have complex social impacts. (i) On one hand, transferred labeled data will save much cost on the data gathering and labeling and avoid the wasteful duplication of labor. More and more deep-based artiﬁcial intelligent systems will become part of the people s life, bringing convenience, wealth and prosperity. (ii) On the other hand, the transferred labeled data might induce the unemployment for the people who are engaged in gathering and labeling the dataset. Meanwhile, the launch of artiﬁcial intelligent systems may also cause the job loss. Besides, another concern is that the techniques for synthesizing the image is possible to be used for the illegal purpose of forgery and deception. We would encourage further work on the detection of the forgery and deception of the image even though the detection will become harder and harder as the image synthesis techniques develop. From the view of long-term development, in order to mitigate the risks of image synthesis, more regulations and guidance on tracking and stopping the harmful and dangerous synthesized images should be made.

Blum, H.; Sarlin, P.-E.; Nieto, J.; Siegwart, R.; and Cadena, C. 2019. Fishyscapes: A benchmark for safe semantic segmentation in autonomous driving. In ICCV Workshops.

Chang, J.-R.; and Chen, Y.-S. 2018. Pyramid stereo matching network. In CVPR.

Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; and Yuille, A. L. 2017. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence (TPAMI) 40(4): 834 848.

Chen, Y.; Li, W.; Chen, X.; and Gool, L. V. 2019. Learning semantic segmentation from synthetic data: A geometrically guided input-output adaptation approach. In CVPR.

Chen, Y.; Li, W.; Sakaridis, C.; Dai, D.; and Van Gool, L. 2018. Domain adaptive faster r-cnn for object detection in the wild. In IEEE Conference on Computer Vision and Pattern Recognition.

Chen, Y.; Li, W.; and Van Gool, L. 2018. Road: Reality oriented adaptation for semantic segmentation of urban scenes. In CVPR.

Chen, Y.-C.; Xu, X.; and Jia, J. 2020. Domain Adaptive Image-toimage Translation. In CVPR.

Cheng, L.; Vishwanathan, S. N.; and Zhang, X. 2008. Consistent image analogies using semi-supervised learning. In CVPR.

Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; and Schiele, B. 2016. The cityscapes dataset for semantic urban scene understanding. In CVPR.

Dai, D.; Sakaridis, C.; Hecker, S.; and Van Gool, L. 2019. Curriculum model adaptation with synthetic and real data for semantic foggy scene understanding. International Journal of Computer Vision 1 23.

Dundar, A.; Liu, M.-Y.; Wang, T.-C.; Zedlewski, J.; and Kautz, J. 2018. Domain Stylization: A Strong, Simple Baseline for Synthetic to Real Image Domain Adaptation. ar Xiv preprint ar Xiv:1807.09384 .

Erkent, O.; and Laugier, C. 2020. Semantic Segmentation with Unsupervised Domain Adaptation Under Varying Weather Conditions for Autonomous Vehicles. IEEE Robotics and Automation Letters 5(2): 3580 3587.

Fattal, R. 2008. Single image dehazing. ACM transactions on graphics (TOG) 27(3): 1 9.

Gaidon, A.; Wang, Q.; Cabon, Y.; and Vig, E. 2016. Virtual Worlds as Proxy for Multi-Object Tracking Analysis. In CVPR.

Ganin, Y.; and Lempitsky, V. 2015. Unsupervised domain adaptation by backpropagation. In ICML.

Geiger, A.; Lenz, P.; Stiller, C.; and Urtasun, R. 2013. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research 32(11): 1231 1237.

Gong, R.; Li, W.; Chen, Y.; and Gool, L. V. 2019. DLOW: Domain ﬂow for adaptation and generalization. In CVPR.

Hahner, M.; Dai, D.; Sakaridis, C.; Zaech, J.-N.; and Van Gool, L. 2019. Semantic Understanding of Foggy Scenes with Purely Synthetic Data. In ITSC.

Halder, S. S.; Lalonde, J.-F.; and Charette, R. d. 2019. Physics Based Rendering for Improving Robustness to Rain. In ICCV.

He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.

Hertzmann, A.; Jacobs, C. E.; Oliver, N.; Curless, B.; and Salesin, D. H. 2001. Image Analogies. In Annual Conference on Computer Graphics and Interactive Techniques.

Hoffman, J.; Tzeng, E.; Park, T.; Zhu, J.-Y.; Isola, P.; Saenko, K.; Efros, A. A.; and Darrell, T. 2018. Cy CADA: Cycle Consistent Adversarial Domain Adaptation. In ICML.

Hoffman, J.; Wang, D.; Yu, F.; and Darrell, T. 2016. Fcns in the wild: Pixel-level adversarial and constraint-based adaptation. ar Xiv preprint ar Xiv:1612.02649 .

Huang, X.; Liu, M.-Y.; Belongie, S.; and Kautz, J. 2018. Multimodal Unsupervised Image-to-image Translation. In ECCV.

Isola, P.; Zhu, J.-Y.; Zhou, T.; and Efros, A. A. 2017. Image-toimage translation with conditional adversarial networks. In CVPR.

Johnson, J.; Alahi, A.; and Fei-Fei, L. 2016. Perceptual losses for real-time style transfer and super-resolution. In ECCV.

Kingma, D. P.; and Ba, J. 2015. Adam: A method for stochastic optimization. In ICLR.

Li, K.; Li, Y.; You, S.; and Barnes, N. 2017. Photo-realistic simulation of road scene for data-driven methods in bad weather. In ICCV Workshops.

Li, Y.; Yuan, L.; and Vasconcelos, N. 2019. Bidirectional learning for domain adaptation of semantic segmentation. In CVPR.

Lian, Q.; Lv, F.; Duan, L.; and Gong, B. 2019. Constructing Self Motivated Pyramid Curriculums for Cross-Domain Semantic Segmentation: A Non-Adversarial Approach. In ICCV.

Liao, J.; Yao, Y.; Yuan, L.; Hua, G.; and Kang, S. B. 2017. Visual attribute transfer through deep image analogy. In SIGGRAPH.

Lin, G.; Milan, A.; Shen, C.; and Reid, I. 2017. Reﬁnenet: Multipath reﬁnement networks for high-resolution semantic segmentation. In CVPR.

Liu, M.-Y.; Breuel, T.; and Kautz, J. 2017. Unsupervised imageto-image translation networks. In NIPS.

Liu, M.-Y.; Huang, X.; Mallya, A.; Karras, T.; Aila, T.; Lehtinen, J.; and Kautz, J. 2019. Few-Shot Unsupervised Image-to-Image Translation. In The IEEE International Conference on Computer Vision (ICCV).

Long, M.; Cao, Z.; Wang, J.; and Jordan, M. I. 2018. Conditional adversarial domain adaptation. In NIPS.

Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; De Vito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic differentiation in Py Torch. In NIPS-W.

Ren, W.; Liu, S.; Zhang, H.; Pan, J.; Cao, X.; and Yang, M.-H. 2016. Single image dehazing via multi-scale convolutional neural networks. In ECCV.

Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-net: Convolutional networks for biomedical image segmentation. In MICCAI.

Sakaridis, C.; Dai, D.; and Van Gool, L. 2018. Semantic foggy scene understanding with synthetic data. International Journal of Computer Vision (IJCV) 126(9): 973 992.

Schunn, C. D.; and Dunbar, K. 1996. Priming, analogy, and awareness in complex reasoning. Memory & Cognition 24(3): 271 284.

Simonyan, K.; and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556 .

Tarel, J.-P.; Hautiere, N.; Cord, A.; Gruyer, D.; and Halmaoui, H. 2010. Improved visibility of road scene images under heterogeneous fog. In IEEE Intelligent Vehicles Symposium.

Tsai, Y.-H.; Hung, W.-C.; Schulter, S.; Sohn, K.; Yang, M.-H.; and Chandraker, M. 2018. Learning to Adapt Structured Output Space for Semantic Segmentation. In CVPR.

Tsai, Y.-H.; Sohn, K.; Schulter, S.; and Chandraker, M. 2019. Domain adaptation for structured output via discriminative patch representations. In CVPR.

Tzeng, E.; Hoffman, J.; Saenko, K.; and Darrell, T. 2017. Adversarial discriminative domain adaptation. In CVPR.

Vu, T.-H.; Jain, H.; Bucher, M.; Cord, M.; and P erez, P. 2019. ADVENT: Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation. In CVPR.

Xie, R.; Yu, F.; Wang, J.; Wang, Y.; and Zhang, L. 2019. Multilevel Domain Adaptive learning for Cross-Domain Detection. In IEEE International Conference on Computer Vision Workshops.

Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; and Sang, N. 2018. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In ECCV.

Yu, F.; and Koltun, V. 2016. Multi-scale context aggregation by dilated convolutions. In ICLR.

Zhao, H.; Shi, J.; Qi, X.; Wang, X.; and Jia, J. 2017. Pyramid scene parsing network. In CVPR.

Zhu, J.-Y.; Park, T.; Isola, P.; and Efros, A. A. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV.

Zhu, X.; Pang, J.; Yang, C.; Shi, J.; and Lin, D. 2019. Adapting Object Detectors via Selective Cross-Domain Alignment. In IEEE Conference on Computer Vision and Pattern Recognition.

Zou, Y.; Yu, Z.; Liu, X.; Kumar, B. V.; and Wang, J. 2019. Conﬁdence Regularized Self-Training. In ICCV.