# interpretable_generative_adversarial_networks__eb866d5e.pdf

Interpretable Generative Adversarial Networks

Chao Li1,3 , Kelu Yao1 , Jin Wang1 , Boyu Diao1, Yongjun Xu1, Quanshi Zhang2

1Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 2Shanghai Jiao Tong University, China 3Zhejiang Laboratory, Hangzhou 311100, China {lichao, yaokelu, wangjin20g, diaoboyu2012, xyj}@ict.ac.cn

Learning a disentangled representation is still a challenge in the ﬁeld of the interpretability of generative adversarial networks (GANs). This paper proposes a generic method to modify a traditional GAN into an interpretable GAN, which ensures that ﬁlters in an intermediate layer of the generator encode disentangled localized visual concepts. Each ﬁlter in the layer is supposed to consistently generate image regions corresponding to the same visual concept when generating different images. The interpretable GAN learns to automatically discover meaningful visual concepts without any annotations of visual concepts. The interpretable GAN enables people to modify a speciﬁc visual concept on generated images by manipulating feature maps of the corresponding ﬁlters in the layer. Our method can be broadly applied to different types of GANs. Experiments have demonstrated the effectiveness of our method.

Introduction

Recently, generative adversarial networks (GANs) have achieved huge success in generating high-resolution and realistic images (Brock, Donahue, and Simonyan 2018; Karras, Laine, and Aila 2019). In addition, the interpretability of GANs has attracted increasing attention in recent years. In this ﬁeld, learning a disentangled representation is still a challenge to start-of-the-art algorithms. The disentangled representation of a GAN means that each component of the representation only affects a distinct aspect of a generated image. Previous studies on the disentanglement of GANs mainly focused on two perspectives. Some studies (Radford, Metz, and Chintala 2016; Chen et al. 2016) disentangled the attributes of images, such as the expression and eyeglasses of the generated human face images. Other studies (Zhu et al. 2017; Huang et al. 2018) disentangled the structure

These authors contributed equally. Chao Li and Quanshi Zhang are the corresponding authors. Chao Li is with the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China and the Zhejiang Laboratory, Hangzhou, China. Copyright c 2022, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Figure 1: Compared with the traditional GAN, each ﬁlter in the interpretable GAN consistently represents a meaningful visual concept when generating different images. Different ﬁlters represent different visual concepts.

and texture of images. However, these works failed to provide clear and symbolic features for visual concepts in the intermediate layer of the generator. Therefore, we aim to propose a generic method to modify a traditional GAN into an interpretable GAN, which ensures that ﬁlters in an intermediate layer of the generator encode the disentangled and localized visual concepts (e.g. object parts like eyes, noses and mouths of human faces). Speciﬁcally, each ﬁlter in the intermediate layer is expected to consistently generate image regions corresponding to the same visual concept when generating different images. Different ﬁlters in the intermediate layer are expected to generate image regions corresponding to different visual concepts. Learning the disentangled and localized visual concepts is of great value in both theory and practice. For example, Shen et al. (2020) enabled people to manipulate various facial attributes on the generated images through varying the latent codes. In contrast, this research enables people to modify a speciﬁc visual concept on generated images by manipulating feature maps of the corresponding ﬁlters, such as changing the appearance of a speciﬁc visual concept. However, it still presents continuous challenges to ensure the learned visual concepts in the GAN have clear meanings, i.e. exploring the essence of meaningful visual concepts. To the best of our knowledge, there is no speciﬁc method to directly guarantee ﬁlters in an intermediate layer of the generator to encode meaningful visual concepts. In particular, we expect ﬁlters in the intermediate layer to automatically

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

Replacing Attribute Interpretable intermediate features Moving part Without annotations of facial semantics Object Part Global Local

Editing in Style (2020) Mask GAN (2020) Inter Face GAN (2020) Mega FS (2021) Label&Feature Collaging (2018) Info Swap (2021a) RSGAN (2018) Hifa Face (2021b) Age Embedding (2021) ELEGANT (2018) Style Flow (2021) Facial Semantics (2021) Navi GAN (2021) Ours

Table 1: Comparisons with other face-editing methods. The ﬁrst column refers to replacing whole objects (e.g. faces) and object parts (e.g. noses of faces) on images. The second column refers to changing the global attributes (e.g. age) and local attributes (e.g. smiling) on images. The third column represents whether the method learns interpretable intermediate features. The fourth column refers to changing the location of parts on images. The ﬁfth column represents whether the method requires annotation of facial semantics. Our method meets most of the requirements.

learn meaningful visual concepts without any manual annotations of visual concepts. It is because that such annotations usually represent human s understanding of images and can not reﬂect the representations inside the GAN. In order to ensure the GAN learns meaningful visual concepts, we expect each ﬁlter in an intermediate layer of the generator to consistently represent the same visual concept across different images. We notice that a speciﬁc visual concept is usually represented by multiple ﬁlters in an intermediate layer of the generator. In this way, we divide ﬁlters in the intermediate layer into different groups and assume that different groups represent different visual concepts. Speciﬁcally, we expect ﬁlters in the same group to consistently generate image regions corresponding to the same visual concept when generating different images. Note that ﬁlters in the same group are expected to represent almost the entire visual concept rather than sub-parts of the visual concept, which ensures the clarity of the visual concept represented by each ﬁlter. Furthermore, it is also crucial to ensure the strictness of explanation results. In other words, if a ﬁlter represents a certain visual concept, then neural activations in the feature map of this ﬁlter should exclusively correspond to this visual concept without any noise activations in other unrelated regions. To this end, we propose a probability model to measure the ﬁtness between explanation results and the neural activations in the feature maps. Speciﬁcally, the probability model is formulated as an energy-based model. The input of the energy-based model is the feature maps in the target layer. The output is a probability of neural activations in feature maps corresponding to visual concepts. High probability represents that neural activations in each feature map of the ﬁlters correspond to a clear visual concept. Low probability represents vice versa. In this way, we expect to train the energy-based model to learn to evaluate the feature maps

in the target layer. Then, this energy-based model is used to reﬁne the representations inside the target layer of the GAN. In this study, we evaluate our interpretable GANs both qualitatively and quantitatively. For qualitative evaluation, we visualize the feature map of each ﬁlter to evaluate the consistency of the visual concept represented by each ﬁlter through different images. For quantitative evaluation, we evaluate the results of modifying visual concepts on generated images, in order to show the correctness and locality of the modiﬁcation for a speciﬁc visual concept. Besides, we also evaluate the realism of generated images both qualitatively and quantitatively. Contributions of this paper can be summarized as follows. We propose a generic method to modify a traditional GAN into an interpretable GAN without any annotations of visual concepts. In the interpretable GAN, each ﬁlter in an intermediate layer of the generator consistently generates the same localized visual concept when generating different images. Experiments show that our method can be applied to different types of GANs and enables people to modify a speciﬁc visual concept on generated images.

Related Work Disentanglement of GANs. Previous works have mainly explored the disentanglement of GANs from two perspectives. Several works (Radford, Metz, and Chintala 2016; Chen et al. 2016; H ark onen et al. 2020; Shen and Zhou 2021; Voynov and Babenko 2020; Wu, Lischinski, and Shechtman 2021) focused on disentangling the attributes of the generated images. Shen et al. (2020) disentangled the gender, age and expression of the generated human faces. Jahanian et al. (2019) and Plumerault et al. (2019) disentangled simple transformations of the generated images, such as translation and zooming, to control the image generation of GANs. Other works (Zhu et al. 2017; Huang et al. 2018) focused on

Figure 2: Visual comparisons with other methods for interpretable GANs. These methods focus on different types of interpretability. Method 1 (2017) disentangled the structure and the style of the image. Method 2 (2019) learned features for the localized object in the image. Method 3 (2020) learned the disentangled features for attributes of the image. In contrast, our method learns each ﬁlter to encode an object part without part annotations. To the best of our knowledge, no other GANs can ensure such part interpretability.

disentangling the structure and texture of the generated images. Fine GAN (Singh, Ojha, and Lee 2019) disentangled the object shape, background and object appearance (color/- texture) to generate images. Ma et al. (2018) disentangled foreground, background and pose information to generate images of persons. However, these studies failed to provide clear and symbolic representations for visual concepts in the generation of images. Bau et al. (2018) identiﬁed a group of ﬁlters closely related to objects and object parts, but required supervision from a semantic segmentation model. Collins et al. (2020) exploited to disentangle object parts of the generated images without external supervision, but did not ensure each ﬁlter represented a clear meaning. In contrast, our method ensures each ﬁlter represents a localized visual concept without human supervision. Face editing with GANs. Previous methods on face editing were mainly conducted in the image-to-image settings (Shen and Liu 2017; Zhang et al. 2018; Richardson et al. 2021). Fader networks (Lample et al. 2017) learned to vary the values of attributes to change the attributes of the generated images. Star GAN (Choi et al. 2018) learned to perform image-to-image translations across multiple domains using a single model to edit different attributes of human face images. However, these methods could not edit images with exemplars. To this end, ELEGANT (Xiao, Hong, and Ma 2018) learned to transfer attributes between two images by exchanging their latent codes. Mask GAN (Lee et al. 2020) exploited diverse manipulations of human face images by modifying masks of target images according to the source images. However, these methods required supervision from annotated attributes or masks. In comparison, our method modiﬁes a speciﬁc visual concept according to other generated images without manual annotations of visual concepts.

Given training images without annotations of visual concepts, we aim to train an interpretable GAN in an end-toend manner. Speciﬁcally, given a target convolutional layer of the generator, we expect each ﬁlter in this layer to represent a meaningful visual concept (e.g. object parts like eyes, noses and mouths of human faces). In other words, each ﬁl-

ter in the target layer is expected to consistently generate the same visual concept when generating different images. The key challenge is to ensure that each ﬁlter in the target layer of the GAN represents a meaningful visual concept. To this end, we notice that multiple ﬁlters usually represent a certain visual concept, when they generate similar image regions corresponding to this visual concept. This phenomenon was also discussed in (Shen et al. 2021) for ﬁlters in convolutional neural networks (CNNs). Therefore, we divide ﬁlters in the target layer into different groups, which represent different visual concepts respectively. We expect ﬁlters in the same group to represent the same visual concept. Let M denote the number of ﬁlters in the target layer. In this way, we divide M ﬁlters in the target layer into C groups. Let qj {1, 2, , C} denote the index of the group, where the j-th ﬁlter belongs across different images. Q = {q1, q2, , q M} denotes the partition of ﬁlters. Let G denote the generator of the GAN. To encourage each ﬁlter in the target layer to represent a meaningful visual concept, we aim to optimize the generator G and the partition Q to force ﬁlters in the same group to generate the same image region on a generated image. In addition to ensuring each ﬁlter represents a meaningful visual concept, it is also important that the generator of the GAN generates realistic images. In this way, we design the following loss function to train the interpretable GAN: L = LGAN(G, D) + λ0Loss(Q, G) (1) where λ0 denotes a positive weight; LGAN(G, D) denotes the traditional GAN loss function (Goodfellow et al. 2014; Gulrajani et al. 2017), where D denotes the discriminator of the GAN; Loss(Q, G) is the interpretability loss to encourage each ﬁlter in the target layer to represent a meaningful visual concept, which will be introduced later. Learning the partition Q. Given the generator G, we expect to learn the partition Q to ensure ﬁlters in the same group generate similar image regions. In other words, we expect feature maps in each group have similar neural activations. To this end, we use a Gaussian mixture model (GMM) to learn the partition Q for feature maps in the target layer. Let {zi}N i=1 denote the set of N input latent vectors, which generate N different images through the generator G. Given the i-th latent vector zi Rd, let f G(zi) = [f 1 i , f 2 i , , f j i , , f M i ] denote the feature map in the target convolutional layer of the generator G after the Re LU operation. Here f j i RK denotes the feature map of the j-th ﬁlter. Then, let F j = [f j 1, f j 2, , f j N] denote the feature maps of the j-th ﬁlter given the set of N input latent vectors {zi}N i=1. The Gaussian mixture model is formulated as PΘ(F j), where Θ denotes the model parameters. The key challenge is to optimize the GMM parameters Θ to learn the partition Q. Speciﬁcally, we take the j-th ﬁlter s group index qj as a latent variable and PΘ(F j) estimates the likelihood of the j-th ﬁlter s feature maps belonging to any group c, i.e. PΘ(F j) = P c PΘ(F j, qj = c). In this way, we have PΘ(F j, qj = c) = PΘ(qj = c)PΘ(F j|qj = c). We deﬁne PΘ(qj = c) = pc, where pc denotes the prior probability of the c-th group. PΘ(F j|qj = c) denotes the probability of the j-th ﬁlter s feature maps having similar neural

Figure 3: Visualization of feature maps in interpretable GANs based on the method in (Zhang, Wu, and Zhu 2018). The ﬁrst column shows the generated images. The second column shows the visualization of the distributions of visual concepts encoded in an intermediate layer ﬁlters. Each remaining column in the ﬁgure corresponds to a certain ﬁlter.

activations with feature maps in the c-th group. To simplify the calculation of PΘ(F j|qj = c), we assume that when the j-th ﬁlter belongs to the c-th group, the probabilities of the jth ﬁlter s feature maps across different images are independent of each other, i.e. PΘ(F j|qj = c) = QN i=1 PΘ(f j i |qj = c). We assume that f j i |(qj = c) N(µc, σ2 c I). Here I denotes the identity matrix. To learn the model parameters {pc, µc, σ2 c} Θ, we design the following loss.

max Θ LGMM = max Θ

j=1 log PΘ(F j) (2)

Let Θ denote the optimal Θ for equation (2). In this way, the optimal partition Q is solved as Q = {qj| arg maxqj PΘ (qj|F j)}. Realism of generated images. Given the partition Q for ﬁlters in the target layer, forcing each ﬁlter in the same group to exclusively generate the same visual concept may decrease the realism of the generated images, even with the help of the discriminator D. To this end, we use an energybased model (Gao et al. 2018; Nijkamp et al. 2019) that outputs a probability of the realism of the feature maps f G(z) in the target layer. Speciﬁcally, the energy-based model outputs a probability of feature maps, which generate realistic images. In this way, we can conclude that feature maps f G(z) with high realism can generate realistic images. The energy-based model is formulated as PW (f G(z)|Q), where W denotes the model parameters. To increase the realism of the images generated from the feature maps f G(z), we use the following loss to learn the energy-based model PW (f G(z)|Q) via maximizing the log-likelihood.

Lreal(W, G) = 1

i=1 log PW (f G(zi)|Q) (3)

To measure the realism of the feature maps in the target layer, the energy-based model PW (f G(z)|Q) is designed as follows.

PW (f G(z)|Q) = 1 Z(W)exp g W (f G(z)) P0(z) (4)

where Z(W) = R exp(g W (f G (z)))P0(z)dz is used for normalization. Here we consider G as the current generator with ﬁxed parameters for calculating Z(W), in order to learn the parameters W. P0(z) denotes the Gaussian distribution, i.e. P0(z) N(0, σ2 0I). g W (f G(z)) denotes the

metric, which measures the realism of the feature maps in the target layer. Speciﬁcally, we have g W (f G(z)) = PM j=1 PC c=1[Wjc (f j f c)], where denotes the inner product and denotes the element-wise product. W RM C K denotes the parameters of the energy-based model. f j RK denotes the feature map of the j-th ﬁlter. f c RK denotes the c-th group center of feature maps, which can be computed as the mean of feature maps in the c-th group. Interpretability of ﬁlters in the target layer. In order to increase the interpretability of the ﬁlters in the target layer, we expect each ﬁlter in the same group to exclusively generate the same image region. In other words, when the j-th ﬁlter belongs to the c-th group, we expect the j-th ﬁlter s feature map f j to be close to the group center f c. Besides, we also consider the diversity of visual concepts represented by different ﬁlters. To this end, when the j-th ﬁlter does not belong to the c-th group, we expect the j-th ﬁlter s feature map f j to be different from the group center f c. In this way, we design the following loss.

Linterp(W) =

k=1 I(qj = c)Wjck

k=1 I(qj = c)Wjck

where λ1 denotes a positive weight; I( ) is the indicator function. In this way, when the j-th ﬁlter belongs to the c-th group, the metric g W (f G(z)) forces the j-th ﬁlter s feature map fj and the c-th group center f c to have neural activations in similar positions by pushing Wjck to be positive. Otherwise, g W (f G(z)) forces fj and f c to have neural activations in different positions by pushing Wjck to be negative. Please see Fig. 4 for more details. To sum up, Loss(W, Q, G) is designed as follows:

Loss(W, Q, G) = X

qj Q PΘ (qj|F j) + λ2Lreal(W, G)

+ λ3Linterp(W) (6) where λ2 and λ3 are positive weights. P

j PΘ (qj|F j) is designed to learn the partition Q for ﬁlters. Lreal(W, G) and Linterp(W) are designed to increase the realism of the generated images and the interpretability of the ﬁlters in the tar-

Figure 4: (a) Comparisons of receptive ﬁelds (RFs) between the center of a group and each ﬁlter in the group. (b) Proportions of ﬁlters representing different visual concepts. (c) Filters learned with different values of C.

get layer. The overall loss is optimized as follows.

min W,G max D,Q L (7)

Learning. Given the partition of ﬁlters Q, we optimize Loss(W, Q, G) w.r.t. W, G for once after optimizing LGAN(G, D) w.r.t. G, D for T times. However, the gradient of Lreal(W, G) w.r.t. W can not be calculated directly and has to be approximated by Markov chain Monte Carlo (MCMC), such as the Langevin dynamics (Girolami and Calderhead 2011; Zhu and Mumford 1998). Speciﬁcally, following the method in (Gao et al. 2018), the gradient of Lreal(W, G) w.r.t. W is approximately calculated as follows. W Lreal(W, G)

W g W (f G (ˆzi)) 1

W g W (f G(zi))

(8) where {ˆzi}N i=1 denotes the revised latent vectors sampled from Langevin dynamics. The iterative process of Langevin dynamics is carried out as follows.

zτ+1 = zτ + δ2

2 z PW (f G(zτ)|Q) + δU τ (9)

where τ denotes time steps; δ denotes step size; U τ N(0, I) is a Gaussian noise. In this way, equation (9) and (8) are calculated alternately to update the energy-based model parameters W.

Experiments We applied our method to two state-of-the-art GANs trained on two different datasets. For qualitative evaluation, we visualized feature maps of ﬁlters to show the consistency of the visual concept represented by each ﬁlter. We also visualized the results of modifying speciﬁc visual concepts on generated images. Besides, we demonstrated that performing Langevin dynamics could improve the realism of some bad generated images and modiﬁed images. For quantitative evaluation, we conduct a user study and a face veriﬁcation

Figure 5: Exchanging a speciﬁc visual concept between the original images and the source images. The second column shows chosen parts for exchanging, which are marked in red. The ﬁfth column shows the mean squared-error heatmaps between the original images and the modiﬁed images.

experiment to examine the correctness of exchanging a speciﬁc visual concept and faces between pairs of images. We also calculated the mean squared-error (MSE) between original images and modiﬁed images in terms of a certain visual concept, in order to evaluate the locality of our modiﬁcations. We calculated the Fr echet Inception Distance (FID) (Heusel et al. 2017) to measure the realism of generated images. Experiments show that our method successfully disentangled localized visual concepts encoded in ﬁlters of the generator. Models and datasets. We applied our method to two different GANs, Big GAN (Brock, Donahue, and Simonyan 2018) and Style GAN (Karras, Laine, and Aila 2019). Big GAN was trained on FFHQ dataset (Karras, Laine, and Aila 2019). Style GAN (Karras, Laine, and Aila 2019) was trained on Celeb A-HQ dataset (Karras et al. 2018). Implementation details. We set hyperparameters as C = 24, λ0 = 1 and λ1 = 2

3. Since Lreal(W, G) was used to update two seperate models, i.e. the generator and the energybased model, we set λ2 different values to update different models. Speciﬁcally, we set λ2 = 1 for updating the energybased model parameters W. For updating the generators of Big GAN and Style GAN, we set λ2 = 0.1 and λ2 = 0.05 respectively. To ensure the interpretability of ﬁlters, we expected that Linterp dominated the learning process in the early stage. To this end, for Style GAN, λ3 was set to be 3e 2 at ﬁrst and exponentially decayed to 3e 6 during 1000 batches. For Big GAN, λ3 was set the same but exponentially decayed during 500 batches. We set T = 50 for Big GAN and T = 100 for Style GAN. We initialized each dimension of parameters W to be zero. We used the learning rate of 10,

Figure 6: Swapping whole faces between the original images and the source images. The second column shows the chosen parts for swapping, which are marked in red. The fourth column shows the replaced images.

SGD optimizer for parameters W.

Learning Interpretable GANs Learning an interpretable Big GAN. We learned an interpretable GAN based on the Big GAN architecture to generate images with the size of 64 64. We ﬁrst trained the Big GAN following the experiment settings in (Brock, Donahue, and Simonyan 2018). Then, we added our proposed loss Loss(W, Q, G) to an intermediate layer of the generator to ﬁne-tune Big GAN, where the size of feature maps f G(z) is 32 32. To be clear, we only ﬁne-tuned the generator of the Big GAN. The discriminator of the Big GAN was reinitialized and trained from scratch. It was because that the discriminator usually converged faster than the generator in Big GAN. Learning an interpretable Style GAN. We learned an interpretable GAN based on the Style GAN architecture to generate images with the size of 128 128. We ﬁrst trained the Style GAN following the experiment settings in (Karras, Laine, and Aila 2019). We noticed that the activation functions in the generator were all leaky-Re LU (Maas et al. 2013). To this end, we added a Re LU layer after an intermediate layer of the generator, where the size of feature maps f G(z) is 32 32. Then, we added our proposed loss Loss(W, Q, G) to the output of the added Re LU layer. The generator and the discriminator of the Style GAN were jointly ﬁne-tuned, because they were progressively trained in (Karras, Laine, and Aila 2019).

Qualitative Evaluation Visualization of feature maps. Based on the method in (Zhang, Wu, and Zhu 2018), we visualized the receptive ﬁelds (RFs) corresponding to a ﬁlter s feature maps, which were scaled up to the image resolution. Fig. 3 shows the RFs of ﬁlters in our interpretable GANs. In our interpretable GANs, each ﬁlter consistently generated image regions corresponding to the same visual concept when generating different images. Different ﬁlters generated image regions corresponding to different visual concepts. We also compared RFs between the group center and ﬁlters in this group, as shown in Fig. 4 (a). Moreover, we explored the number of visual concepts represented by ﬁlters in our interpretable GAN. Fig. 4 (b) illustrates the proportions of ﬁlters representing different visual concepts when setting C = 24. Results show that 512 ﬁlters totally represented 11 visual concepts. Besides, as shown in Fig. 4 (c), when setting different

Figure 7: (a) Improving the realism of generated images by Langevin dynamics. Each column shows the generated images by doing the iterative process of Langevin dynamics τ steps. (b) Improving the realism of modiﬁed images by Langevin dynamics. The third column shows the replaced images.

values of C, GANs with a larger value of C learned more detailed concepts. Modifying visual concepts on images. Our interpretable GAN enabled us to modify speciﬁc visual concepts on generated images. For example, we exchanged a speciﬁc visual concept between pairs of images by exchanging the corresponding feature maps in the target layer (i.e. the convolutional layer that was modiﬁed to an interpretable layer). Fig. 5 shows the results of exchanging the mouth, hair and nose between pairs of images. Note that our method only changed the shape of a speciﬁc visual concept. For Style GAN, the color of a speciﬁc visual concept was mainly controlled by styles in higher-resolution layers, as discussed in (Karras, Laine, and Aila 2019). Fig. 5 also shows the difference between the modiﬁed images and the original images, where at every pixel location we calculated the squared distance in RGB space. Our method only modiﬁed a localized visual concept without changing other unrelated regions. Besides, we also exchanged whole faces between pairs of images, as shown in Fig. 6. Improving the realism of images. To improve the realism of some bad generated images, we used Langevin dynamics to sample revised latent vectors. As shown in Fig. 7 (a), revised latent vectors sampled from Langevin dynamics generated more realistic images than the original latent vectors. Besides, we also performed Langevin dynamics to improve the realism of modiﬁed images. Speciﬁcally, given two latent vectors za and zb, we exchanged a certain group of feature maps between f G(za) and f G(zb). Let f(za) and f(zb) denote the exchanged feature maps of za and zb. Then, we performed Langevin dynamics to sample revised latent vectors. Speciﬁcally, za was updated as follows: zτ+1 a = zτ a + δ2

2 za (PW (f(zτ a) |Q) + PW (f(zτ b ) )|Q)) + δU τ. zb was updated in the same way. In this way, the exchanged feature maps f(za) and f(zb) had higher probabilities and could generate more realistic images. Fig. 7 (b) shows the results of the modiﬁed images after performing Langevin dynamics.

Quantitative Analysis

Human perception evaluation. We conduct a user study to evaluate the results of modifying a speciﬁc visual concept on generated images. Speciﬁcally, we exchanged the mouth, chin and eyes between pairs of images as three tasks.

Model Mouth(%) Eyes(%) Chin(%)

Editing in Style (2020) 37.90 34.60 - Feature Collaging (2018) 56.00 45.40 46.40 Interpretable Style GAN 83.60 63.70 81.67 Interpretable Big GAN 89.60 82.10 92.30

Table 2: Human evaluation scores.

Model Face verﬁcation accuracy(%)

Sim Swap (2020) 87.40 Face Shifter 1 (2020) 85.45 FSGAN (2019) 89.20 Interpretable Style GAN 90.25

Table 3: Face verﬁcation accuracy. All methods was tested by images generated by our Interpretable Style GAN.

We randomly chose 200 pairs of test images for each task respectively. For each task, given an original image and a modiﬁed image, 10 volunteers were asked to choose which image contained the exchanged visual concept on the modiﬁed image among four choices. Table 2 shows the results of human evaluation scores. Each score represents the average percentage of the correctly-answered questions among all volunteers. We used the methods proposed in (Collins et al. 2020) and (Suzuki et al. 2018) as baselines. Our method outperformed the above methods in the user study. Identity preserving evaluation. We performed a face veriﬁcation experiment to evaluate the results of face swapping. For one pair of images, we replaced the face of the original image with the face of the source image to generate the modiﬁed image. Then we tested whether the face of the modiﬁed image and the face of the source image were of the same identity. Speciﬁcally, we selected 2K pairs of faces and used Arc Face (Deng et al. 2019) (99.52% on LFW (Huang et al. 2008)) to test the results. Table 3 shows the accuracy of the face veriﬁcation. Our method was superior to other state-ofthe-art face swapping methods for identity preserving. Locality evaluation. To evaluate the locality of modifying a speciﬁc visual concept, we calculated the mean squarederror (MSE) between the original images and the modiﬁed images in RGB space. Speciﬁcally, we manually annotated segmentation masks for speciﬁc visual concepts on 100 generated images respectively. Then, we measure the ratio of the Out-MSE and In-MSE for each pair of images, i.e. the MSE outside the region of a speciﬁc visual concept and MSE inside the region of a speciﬁc visual concept. Let x RD and x RD denote the original image and the modiﬁed image. Gc(x) {0, 1}D denote the hand-annotated segmentation mask of the c-th visual concept on image x (c = 1, , C). ˆGc(x) {0, 1}D denotes the reverse mask, i.e. ˆGc u(x) = I(Gc u(x) = 0), where I( ) is the indicator function (u = 1, , D). The In-MSE and Out-MSE for the c-th visual concept is calculated as follows: In MSEc =

1Using code in https://github.com/denis19973/faceshifter tornado, because the original paper has not released the code yet.

Model Mouth Eyes Chin

Editing in Style (2020) 1.3649 0.9745 - Feature Collaging (2018) 0.1872 0.1293 0.0576 Interpretable Style GAN 0.0606 0.0502 0.0163 Interpretable Big GAN 0.0296 0.0197 0.0311

Table 4: Locality evaluation.

Style GAN, 128 128 12.86 Interpretable Style GAN, 128 128 18.81 Interpretable Style GAN ,128 128 19.42

Big GAN, 64 64 41.81 Interpretable Big GAN, 64 64 56.74 Interpretable Big GAN , 64 64 57.72

Table 5: Fr echet Inception Distance (FID) between ground truth images and generated images of GANs. represents performing Langevin dynamics on generated images.

PD u=1 Gc u(x)(xu x u)2 PD u=1 Gc u(x) , Out MSEc =

PD u=1 ˆ Gc u(x)(xu x u)2 PD u=1 ˆ Gc u(x) . The locality metric of the modiﬁcation for the c-th visual concept is calculated as follows: Localityc = Out MSEc

In MSEc . A small number of this metric indicates that our modiﬁcation mainly changes the regions related to a speciﬁc visual concept. Table 4 shows the results of our locality metric for each visual concept. Our method had better localization, i.e. less change outside the region of a speciﬁc visual concept. Realism evaluation. To measure the realism of generated images, we used the Fr echet Inception Distance (FID) (Heusel et al. 2017), which compares the distribution of two sets of images in the feature space of a deep CNN layer. The smaller FID is, the more realistic generated images usually are. Table 5 shows the results of FID between the ground truth images and 50K generated images of GANs. This table indicates that forcing ﬁlters to encode disentangled visual concepts decreased the realism of generated images a bit. Surprisingly, performing Langevin dynamics achieved worse results, although Fig. 7 shows qualitatively that the realism of generated images was improved through Langevin dynamics. This reemphasizes that correctly and automatically measuring the realism of generated images is still difﬁcult.

Conclusion In this paper, we have proposed a generic method to modify a traditional GAN into an interpretable GAN, which forces each ﬁlter in an intermediate layer of the generator to represent a meaningful visual concept. Speciﬁcally, we design a loss to push each ﬁlter in the intermediate layer to consistently generate image regions corresponding to the same visual concept when generating different images, and different ﬁlters to generate image regions corresponding to different visual concepts. Experiments have demonstrated that our method enables people to modify a speciﬁc visual concept on generated images, such as changing the appearance of this visual concept.

Acknowledgments

This work is partially supported by the National Nature Science Foundation of China (No. 61906120, U19B2043), Shanghai Natural Science Fundation (21JC1403800,21ZR1434600), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102), Key Research Project of Zhejiang Lab (No. 2021PC0AC02). Besides Dr. Chao Li, Dr. Quanshi Zhang is also a corresponding author. He is with the John Hopcroft Center and the Mo E Key Lab of Artiﬁcial Intelligence, AI Institute, at the Shanghai Jiao Tong University, China.

Abdal, R.; Zhu, P.; Mitra, N. J.; and Wonka, P. 2021. Styleﬂow: Attribute-conditioned exploration of stylegangenerated images using conditional continuous normalizing ﬂows. ACM Transactions on Graphics (TOG), 40(3): 1 21. Bau, D.; Zhu, J.-Y.; Strobelt, H.; Zhou, B.; Tenenbaum, J. B.; Freeman, W. T.; and Torralba, A. 2018. GAN Dissection: Visualizing and Understanding Generative Adversarial Networks. In International Conference on Learning Representations. Brock, A.; Donahue, J.; and Simonyan, K. 2018. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In International Conference on Learning Representations. Chen, R.; Chen, X.; Ni, B.; and Ge, Y. 2020. Sim Swap: An Efﬁcient Framework For High Fidelity Face Swapping. In MM 20: The 28th ACM International Conference on Multimedia, 2003 2011. ACM. Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; and Abbeel, P. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Proceedings of the 30th International Conference on Neural Information Processing Systems, 2180 2188. Cherepkov, A.; Voynov, A.; and Babenko, A. 2021. Navigating the gan parameter space for semantic image editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3671 3680. Choi, Y.; Choi, M.; Kim, M.; Ha, J.-W.; Kim, S.; and Choo, J. 2018. Stargan: Uniﬁed generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 8789 8797. Collins, E.; Bala, R.; Price, B.; and Susstrunk, S. 2020. Editing in style: Uncovering the local semantics of gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5771 5780. Deng, J.; Guo, J.; Xue, N.; and Zafeiriou, S. 2019. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4690 4699. Gao, G.; Huang, H.; Fu, C.; Li, Z.; and He, R. 2021a. Information Bottleneck Disentanglement for Identity Swapping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3404 3413.

Gao, R.; Lu, Y.; Zhou, J.; Zhu, S.-C.; and Wu, Y. N. 2018. Learning generative convnets via multi-grid modeling and sampling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 9155 9164. Gao, Y.; Wei, F.; Bao, J.; Gu, S.; Chen, D.; Wen, F.; and Lian, Z. 2021b. High-Fidelity and Arbitrary Face Editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16115 16124. Girolami, M.; and Calderhead, B. 2011. Riemann manifold langevin and hamiltonian monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2): 123 214. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. Advances in neural information processing systems, 27. Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; and Courville, A. C. 2017. Improved Training of Wasserstein GANs. In NIPS. H ark onen, E.; Hertzman, A.; Lehtinen, J.; and Paris, S. 2020. GANSpace: Discovering Interpretable GAN controls. In IEEE Conference on Neural Information Processing Systems;. Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; and Hochreiter, S. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30. Huang, G. B.; Mattar, M.; Berg, T.; and Learned-Miller, E. 2008. Labeled faces in the wild: A database forstudying face recognition in unconstrained environments. In Workshop on faces in Real-Life Images: detection, alignment, and recognition. Huang, X.; Liu, M.-Y.; Belongie, S.; and Kautz, J. 2018. Multimodal unsupervised image-to-image translation. In Proceedings of the European conference on computer vision (ECCV), 172 189. Jahanian, A.; Chai, L.; and Isola, P. 2019. On the steerability of generative adversarial networks. In International Conference on Learning Representations. Karras, T.; Aila, T.; Laine, S.; and Lehtinen, J. 2018. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In International Conference on Learning Representations. Karras, T.; Laine, S.; and Aila, T. 2019. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4401 4410. Lample, G.; Zeghidour, N.; Usunier, N.; Bordes, A.; Denoyer, L.; and Ranzato, M. 2017. Fader networks: Generating image variations by sliding attribute values. In Advances in Neural Information Processing Systems, 5963 5972. Lee, C.-H.; Liu, Z.; Wu, L.; and Luo, P. 2020. Maskgan: Towards diverse and interactive facial image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5549 5558.

Li, L.; Bao, J.; Yang, H.; Chen, D.; and Wen, F. 2020. Advancing high ﬁdelity identity swapping for forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5074 5083. Li, Z.; Jiang, R.; and Aarabi, P. 2021. Continuous Face Aging via Self-estimated Residual Age Embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15008 15017. Ma, L.; Sun, Q.; Georgoulis, S.; Van Gool, L.; Schiele, B.; and Fritz, M. 2018. Disentangled person image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 99 108. Maas, A. L.; Hannun, A. Y.; Ng, A. Y.; et al. 2013. Rectiﬁer nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, 3. Citeseer. Natsume, R.; Yatagawa, T.; and Morishima, S. 2018. Rsgan: face swapping and editing using face and hair representation in latent spaces. ar Xiv preprint ar Xiv:1804.03447. Nijkamp, E.; Hill, M.; Zhu, S.-C.; and Wu, Y. N. 2019. Learning Non-Convergent Non-Persistent Short-Run MCMC Toward Energy-Based Model. Neur IPS 2019. Nirkin, Y.; Keller, Y.; and Hassner, T. 2019. Fsgan: Subject agnostic face swapping and reenactment. In Proceedings of the IEEE/CVF international conference on computer vision, 7184 7193. Plumerault, A.; Le Borgne, H.; and Hudelot, C. 2019. Controlling generative models with continuous factors of variations. In International Conference on Learning Representations. Radford, A.; Metz, L.; and Chintala, S. 2016. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In Bengio, Y.; and Le Cun, Y., eds., 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings. Richardson, E.; Alaluf, Y.; Patashnik, O.; Nitzan, Y.; Azar, Y.; Shapiro, S.; and Cohen-Or, D. 2021. Encoding in style: a stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2287 2296. Shen, W.; and Liu, R. 2017. Learning residual images for face attribute manipulation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4030 4038. Shen, W.; Wei, Z.; Huang, S.; Zhang, B.; Fan, J.; Zhao, P.; and Zhang, Q. 2021. Interpretable Compositional Convolutional Neural Networks. In Zhou, Z.-H., ed., Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence, IJCAI-21, 2971 2978. International Joint Conferences on Artiﬁcial Intelligence Organization. Main Track. Shen, Y.; Gu, J.; Tang, X.; and Zhou, B. 2020. Interpreting the latent space of gans for semantic face editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9243 9252. Shen, Y.; and Zhou, B. 2021. Closed-form factorization of latent semantics in gans. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recognition, 1532 1540. Singh, K. K.; Ojha, U.; and Lee, Y. J. 2019. Finegan: Unsupervised hierarchical disentanglement for ﬁne-grained object generation and discovery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6490 6499. Suzuki, R.; Koyama, M.; Miyato, T.; Yonetsuji, T.; and Zhu, H. 2018. Spatially controllable image synthesis with internal representation collaging. ar Xiv preprint ar Xiv:1811.10153. Voynov, A.; and Babenko, A. 2020. Unsupervised discovery of interpretable directions in the gan latent space. In International Conference on Machine Learning, 9786 9796. PMLR. Wu, Z.; Lischinski, D.; and Shechtman, E. 2021. Stylespace analysis: Disentangled controls for stylegan image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12863 12872. Xiao, T.; Hong, J.; and Ma, J. 2018. Elegant: Exchanging latent encodings with gan for transferring multiple face attributes. In Proceedings of the European conference on computer vision (ECCV), 168 184. Zhang, G.; Kan, M.; Shan, S.; and Chen, X. 2018. Generative adversarial network with spatial attention for face attribute editing. In Proceedings of the European conference on computer vision (ECCV), 417 432. Zhang, Q.; Wu, Y. N.; and Zhu, S.-C. 2018. Interpretable convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8827 8836. Zheng, Y.; Huang, Y.-K.; Tao, R.; Shen, Z.; and Savvides, M. 2021. Unsupervised Disentanglement of Linear-Encoded Facial Semantics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3917 3926. Zhu, J.-Y.; Park, T.; Isola, P.; and Efros, A. A. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, 2223 2232. Zhu, S. C.; and Mumford, D. 1998. Grade: Gibbs reaction and diffusion equations. In Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271), 847 854. IEEE. Zhu, Y.; Li, Q.; Wang, J.; Xu, C.-Z.; and Sun, Z. 2021. One Shot Face Swapping on Megapixels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4834 4844.