# smooth_deep_image_generator_from_noises__fd5ee818.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Smooth Deep Image Generator from Noises

Tianyu Guo,1,2,3 Chang Xu,2 Boxin Shi,4 Chao Xu,1,3 Dacheng Tao2

1Key Laboratory of Machine Perception (MOE), School of EECS, Peking University, China 2UBTECH Sydney AI Centre, School of Computer Science, FEIT, University of Sydney, Australia 3Cooperative Medianet Innovation Center, Peking University, China 4National Engineering Laboratory for Video Technology, School of EECS, Peking University, China tianyuguo@pku.edu.cn, c.xu@sydney.edu.au, shiboxin@pku.edu.cn, xuchao@cis.pku.edu.cn, dacheng.tao@sydney.edu.au

Generative Adversarial Networks (GANs) have demonstrated a strong ability to ﬁt complex distributions since they were presented, especially in the ﬁeld of generating natural images. Linear interpolation in the noise space produces a continuously changing in the image space, which is an impressive property of GANs. However, there is no special consideration on this property in the objective function of GANs or its derived models. This paper analyzes the perturbation on the input of the generator and its inﬂuence on the generated images. A smooth generator is then developed by investigating the tolerable input perturbation. We further integrate this smooth generator with a gradient penalized discriminator, and design smooth GAN that generates stable and high-quality images. Experiments on real-world image datasets demonstrate the necessity of studying smooth generator and the effectiveness of the proposed algorithm.

Introduction Deep generative models have attracted increasing attention from researchers, especially in the task of natural image generation. Representative techniques include Variational Auto-Encoder (VAE) (Kingma and Welling 2013), Pixel CNN (van den Oord et al. 2016), and Generative Adversarial Networks (GANs) (Goodfellow et al. 2014). Generative Adversarial Networks (GANs) (Goodfellow et al. 2014) translate Gaussian inputs into natural images by discovering the equilibrium within a max-min game. The generator in vanilla GANs is to transform noisy vectors into images, while the discriminator aims to distinguish the generated samples from real samples. Convincing images generated from noisy vectors through GANs could be employed to augment image datasets, which would alleviate the shortage of training data in some tasks. Moreover, image-to-image translation (Chen et al. 2018; 2019) based on GANs also gets its popularity. However, vanilla GANs have ﬂaws in its stability, and we have seen many promising works to alleviate this problem by modifying the network frameworks or proposing improved loss functions (Radford, Metz, and Chintala 2015; Nguyen et al. 2017; Karras et al. 2017; Mao et al. 2017; Berthelot, Schumm, and Metz 2017a; Arjovsky, Chintala,

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

(a) Interpolation image shows blur and distorted images.

(b) Smooth interpolation shows clear and high-quality images.

Figure 1: Interpolation images generated from WGAN-GP (Gulrajani et al. 2017) (a) and the proposed smooth GAN (b).

and Bottou 2017; Gulrajani et al. 2017). Besides, a considerable body of work has been conducted to arbitrarily manipulate generated images according to different factors, e.g., the category, illumination, and style (Chen et al. 2016). Beyond meaningless noise input in GANs, interpretable features can be discovered by investigating label information in conditional GANs (Mirza and Osindero 2014), exploring the mutual information between elements of input in info GAN (Chen et al. 2016) or leveraging the discriminator on latent space in AAE (Makhzani et al. 2015). Noise vector inputs for GANs can be taken as lowdimensional representations of images. As widely accepted in representation learning, the closeness of two data points is supposed to be preserved before and after transformation. Most of these improved GANs methods implicitly assume that the generator would translate linear interpolation in the input noise space to semantic interpolation in the output image space (Bojanowski et al. 2017). Although this kind of experimental result showing interesting visual effects attracts readers attention, the quality of images generated through interpolations could be very noisy and fragile, and some of these images would look obviously unnatural

or even meaningless, as demonstrated in Figure 1(a). Efforts are spent towards generating high-quality images or stabilizing the training of GANs, and how to ensure the success of semantic interpolation in GANs has rarely been investigated. In this paper, we propose a smooth deep image generator that can suppress the inﬂuence of input perturbations on generated images. By investigating the connections between input noises and generated images, we theoretically present the most serious input perturbation that can be tolerated for an output image of desired precision. A gradientbased loss function is then introduced to reduce the variation of generated images caused by perturbations on input noises, which encourages a smooth interpolation of images. Combining a discriminator with gradient penalty, we show the smooth generator will be beneﬁcial for improving the quality of interpolation samples, as demonstrated in Figure 1(b). Experimental results on real-world datasets MNIST (Le Cun et al. 1998), CIFAR-10 (Krizhevsky and Hinton 2009), and Celeb A (Liu et al. 2015) demonstrate the generator produced by the proposed method is essential for the success of smooth and high-quality interpolation of images.

Related Work In this section, we brieﬂy introduce related works on generative adversarial networks (GANs). Although the GANs model has powerful image generation capabilities, the model was often trapped in the problem of unstable training and difﬁculty in convergence. Some methods have been proposed to solve this problem. DCGAN (Radford, Metz, and Chintala 2015) introduced a network structure that works well and is stable. WGAN (Arjovsky, Chintala, and Bottou 2017) proved the defect of vanilla adversarial loss and proposed Wasserstein distance to measure the distance between the generated data distribution and the real data distribution. However, the weight clip used in WGAN to ensure the Lipschitz continuous of D leads to the loss of the capacity of neural networks. To solve this problem, WGAN-GP (Gulrajani et al. 2017) proposed gradient penalty instead of weight clip operation to satisfy Lipschitz continuous condition. BEGAN (Berthelot, Schumm, and Metz 2017a) proposed a novel concept of equilibrium that can help GANs to achieve considerable results using standard training methods that do not incorporate tricks. At the same time, similar to the Wasserstein distance, this degree of equilibrium can estimate the degree of convergence of the model. MMD GAN (Li et al. 2017b) connected moment matching network and GANs and achieved competitive performances with state-of-the art GANs. Kim et al. (Kim and Bengio 2016) and VGAN (Zhai et al. 2016) integrated GANs with the energy-based model and improved the performance of generative models. GANs have achieved remarkable results in image generation. Lap GAN (Denton et al. 2015) generated highresolution images from low resolution one with the help of the Laplacian pyramid framework. Furthermore, Prog GAN (Karras et al. 2017) proposed to train generator and discriminator progressively at upscale resolution levels, which can produce extremely high-quality 2k resolution images. In semi-supervised learning, Triple GAN (Li et

al. 2017a) introduced a classiﬁer C to perform generation tasks under semi-supervised conditions. DCGAN (Radford, Metz, and Chintala 2015) introduced interpolation in latent space generate the smooth transition in image space. However, there is no insurance for the sign of smooth transition in the adversarial Loss. As a result, this paper analyzes the constraint required by the smooth transition in image space and introduces a method to enhance this sign of GANs.

Proposed Method In this section, we analyze the conditions required by the smooth transition in image space and develop a smooth generator within GANs.

Generative Adversarial Nets A discriminator D and a generator G play a max-min game in GANs, in which the discriminator D is responsible for distinguishing real samples from generated samples, while the generator G is to deceive the discriminator D. When the game achieves equilibrium, the generator G would be able to ﬁt complicated distribution of real samples. Formally given the sample x from the real distribution Pd and the noise z drawn from noise distribution Pz (e.g., Gaussian or uniform distribution), the optimal generator G transforming the noise distribution to the real data distribution can be solved from the following min-max optimization problem:

min G max D E x Pd [log(D(x))] + E z Pz [log(D(G(z)))]. (1)

We denote the distribution of generated sample G(z) as PG. By alternately optimizing the generator G and the discriminator D in the min-max problem, we expect that the difference between the generated distribution PG and the real data distribution Pd would be gradually consistent with each other.

Smooth Generator In the noise-to-image generation task, it is difﬁcult to know what type of perturbation could happen in practice. Hence we consider the general perturbation on pixels, and Euclidean distance is adopted for the measurement. Considering a continuous translation or rotation, generated images are still expected to evolve smoothly, and thus pixel values should avoid sudden changes. Given the input noise vector z and the generator G, the generated image can be written as G(z). We suppose that the value of the i-th pixel on the image G(z) is determined by Gi(z), where Gi( ) is reduced from G( ). A smooth generator is then expected to have the following pixel-wise property: Gi(z + δ) Gi(z) < ϵ (2)

where δ stands for a small perturbation over input noise z, and ϵ > 0 is a small constant number. Since linear interpolation around z in the noise space can be approximated as imposing perturbation δ on z, Eq. (2) would encourage the image generated from noise interpolation would not be far

from the original image. In addition, Eq. (2) can be helpful to improve the robustness of the generator, so that it would not be easily disabled by adversarial noise inputs with slight perturbations. However, it is difﬁcult to straightforwardly integrate Eq. (2) into objective function of GANs, because of the unspeciﬁed δ. We next proceed to analyze the appropriate δ that satisﬁes Eq. (2) in the following theorem. Theorem 1. Fix ϵ > 0. Given z Rd as the noise input of generator Gi, if the perturbation δ Rd satisﬁes

δ q < ϵ maxˆz Bp(z,R) ˆz Gi(ˆz) p , (3)

q = 1, we have Gi(z + δ) Gi(z) < ϵ.

Proof. Without loss of generality, we ﬁrst suppose Gi(z) Gi(z + δ). Our aim is then to demonstrate what condition δ should obey to realize

0 Gi(z + δ) Gi(z) < ϵ. (4)

By the main theorem of calculus, we have

Gi(z + δ) = Gi(z) + Z 1

0 z Gi(z + tδ), δ dt, (5)

0 Gi(z + δ) Gi(z) = Z 1

0 z Gi(z + tδ), δ dt. (6)

Consider the fact that Z 1

0 z Gi(z + tδ), δ dt

0 z Gi(z + tδ) pdt,

where holder inequality is applied and q-norm is dual to the p-norm with 1

q = 1. Suppose that ˆz = z + tδ lies in a sphere centered at z with a radius R, and we deﬁne the sphere as Bp(z, R) = {ˆz Rd | z ˆz p R}. Hence, we have Z 1

0 z G(z + tδ) pdt max ˆz Bp(z,R) ˆz Gi(ˆz) p. (8)

By combining Eqs. (7) and (8), Eq. (6) can be re-written as

0 Gi(z + δ) Gi(z) δ q max ˆz Bp(z,R) ˆz Gi(ˆz) p.

(9) If the right side of Eq. (9) is always upper bounded by ϵ, i.e.,

δ q max ˆz Bp(z,R) ˆz Gi(ˆz) p < ϵ, (10)

we can achieve the conclusion that 0 < Gi(z+δ) Gi(z) < ϵ. According to Eq. (10), δ should satisfy

δ q < ϵ maxˆz Bp(z,R) ˆz Gi(ˆz) p . (11)

By setting z := z +δ and δ := δ, we we can get the same constraint (i.e., Eq. (11)) over δ to achieve ϵ < Gi(z + δ) Gi(z). The proof is completed.

By minimizing the denominator in Eq. (3), the model is expected to tolerate larger perturbation δ under ﬁxed difference ϵ on the i-th pixel. If all pixels of the generated image are simultaneously investigated, we then have

L = min G max ˆz Bp(z,R) ˆz G(ˆz) p. (12)

However, maxˆz Bp(z,R) ˆz G(ˆz) p is difﬁcult to calculate. Since ˆz lies in a local region around z, it is reasonable to assume that there is a data point ˆz Pz that can well approximate ˆz. Hence, we can reformulate Eq. (12) as

L = min G E z Pz z G(z) p. (13)

Though minimizing z G(z) p will increase the perturbation δ that can be tolerated by the generator, it is inappropriate to expect an enormously large value of δ, which could damage the diversity of generated images. If the generator is extremely insensitive to changes in the input, linear interpolation in noise space would always lead to the same output. As a result, we introduce a constant number k as a margin to constrain the value of z G(z) p,

L = E z Pz max(0, z G(z) 2 p k). (14)

If the value of z G(z) p is larger than k, there will be penalty on the generator. Otherwise, we think the value of z G(z) p is sufﬁcient to bring in an appropriate δ for the generator. This hinge loss is advantageous over classical squared loss that expects the gradient magnitude to be exactly k, as it is unreasonable to set the same gradient magnitude for data points from distribution Pz.

Smooth GAN So far, we mainly focus on the smoothness of generated images while neglecting their quality. Considering the generation network and the discriminant network within the framework of GANs, we suggest the proposed smooth generator is beneﬁcial for improving the quality of generated images. Well-trained deep neural networks have been recently found vulnerable to adversarial examples that are imperceptible to human. Most of the studies on adversarial examples are for image classiﬁcation problem. But in image generation task, we can easily discover failure generations of welltrained generators as well. The noises resulting in these failure cases can thus be regarded as adversarial noise input. WGAN-GP (Arjovsky, Chintala, and Bottou 2017) is a recent promising variant of vanilla GAN,

min D E z Pz [D(G(z))] E x Pd [D(x)]. (15)

Loss function of WGAN-GP reﬂects the image quality, which is distinct from loss of vanilla GAN to measure how well it fools the discriminator. The ﬁrst term in Eq. (15) is relevant to the real sample and has no connection with the generator. Larger value of D(G(z)) in Eq. (15) therefore indicates high quality of generated images. If noise vector z generates a high-quality image, we expect that its neighboring point z + δ would generate an image of high quality as

well. To decrease the quality gap between images generated from closed noise inputs, we need to ensure that D(G(z)) would not drop signiﬁcantly when the input variates, i.e., D[G(z + δ)] D[G(z)] < ϵ. (16)

In the following theorem, we analyze what conditions the perturbation δ should satisfy to guarantee the image quality. Theorem 2. Fix ϵ > 0. Consider generator G and discriminator D in GANs. Given a noise input z Rd, the generated image is ˆx = G(ˆz). If the perturbation δ Rd satisﬁes

δ q < ϵ maxˆz Bp(z,R) ˆx D(ˆx) p ˆz G(ˆz) p , (17)

q = 1, we have D[G(z + δ)] D[G(z)] < ϵ.

Proof. Without loss of generality, we ﬁrst suppose D[G(z+ δ)] D[G(z)]. Following the proof of Theorem 1, we can draw a similar conclusion,

0 D[G(z + δ)] D[G(z)] δ q max ˆz Bp(z,R) ˆz D[G(ˆz)] p. (18)

According to the chain rule, we have

ˆz D[G(ˆz)] = ˆx D(ˆx) ˆz G(ˆz), (19)

where ˆx = G(ˆz) is the generated image. Given the fact that

ˆx D(ˆx) ˆz G(ˆz) p ˆx D(ˆx) p ˆz G(ˆz) p, (20)

q = 1, Eq. (18) can be re-written as

δ q max ˆz Bp(z,R) ˆz D[G(ˆz)] p

δ q max ˆz Bp(z,R) ˆx D(ˆx) p ˆz G(ˆz) p. (21)

If the right side of Eq. (21) is always upper bounded by ϵ, i.e.

δ q max ˆz Bp(z,R) ˆx D(ˆx) p ˆz G(ˆz) p < ϵ, (22)

we then have

δ q < ϵ maxˆz Bp(z,R) ˆx D(ˆx) p ˆz G(ˆz) p . (23)

In the similar approach, we can get the same constraint (i.e., Eq. (23)) over δ to achieve ϵ < D[G(z +δ)] D[G(z)] . The proof is completed.

Based on Theorem 2, we propose to minimize

max ˆz Bp(z,R) ˆx D(ˆx) p ˆz G(ˆz) p, (24)

so that the upper bound over δ will be enlarged and GANs model is expect to tolerate more drastic perturbation. Since it is difﬁcult to discover the optimal ˆz Bp(z, R), we suppose that there is an approximated ˆz sampled from the distribution Pz as well. The loss function can then be reformulated as,

min G,D E z Pz x D(x) p z G(z) p, (25)

Algorithm 1 Smooth GAN

Require: The number of critic iterations per generator iteration ncritic, the batch size m, Adam hyperparameters α, β1, and β2, the loss balanced coefﬁcient λ, γ. Require: initial discriminator parameters w0, initial generator parameters θ0. repeat 1: for t = 1, ..., ncritic do 2: for i = 1, ..., m do 3: Sample real data x Pd, latent variable z Pz, a random number t U[0, 1]. 4: Calculate fake sample Gθ(z), interpolation sample x tx + (1 t)Gθ(z)

5: Calculate the loss function L(i) D Dw[Gθ(z)] Dw(x) + λ( x D( x) 2 1)2; 6: end for 7: Update discriminator parameters w Adam( w 1

m Pm i=1 L(i) D , w, α, β1, β2) 8: end for 9: Sample a batch of latent variables {z(i)}m i=1 Pz. 10: Calculate the loss function L(i) G Dw[Gθ(z)] + γ max(0, z G(z) 2 2 k); 11: Update generator parameters θ Adam( θ 1

m Pm i=1 L(i) G , θ, α, β1, β2) until θ has converged Ensure: A smooth generator network G.

where x PG is the generated sample G(z). Two terms x D(x) p and z G(z) p are involved in Eq. (25). WGAN-GP (Gulrajani et al. 2017) proposed gradientpenalty,

LGP o D = E x Px [( x D( x) 2 1)2], (26)

where Px consists of both real sample distribution Pd and generated sample distribution PG. By concentrating on generated samples x PG, Eq. (26) encourages x D(x) 2 to go towards 1, and has been proved to successfully constrain the norm of the gradient of discriminator x D(x) 2 in experiments. The remaining term z G(z) p in Eq. (25) is therefore our only focus. In a similar approach, we encourage the norm of the gradient of generator to stay at a lower level and reformulate Eq. (25) to

LGP o G = E z Pz max(0, z G(z) 2 2 k), (27)

where we set p = q = 2. This equation is exactly the same as Eq. (14). By integrating Eqs. (26) and (27) with WGAN, we obtain the resulting objective function:

L = E z Pz [D(G(z))] E x Pd [D(x)]

+ λ E x Px [( x D( x) 2 1)2]

+ γ E z Pz max(0, z G(z) 2 2 k),

Figure 2: Illustration of image interpolations on the MNIST dataset generated from WGAN-GP (Gulrajani et al. 2017) (left) and the proposed method (middle and right).

where λ and γ constant numbers to balance different terms in the function. Our complete algorithm pipeline is summarized in Algorithm 1.

Experimental Results In this section, we conduct comprehensive experiments on a toy dataset and three real-world image datasets, MNIST (Le Cun et al. 1998), CIFAR-10 (Krizhevsky and Hinton 2009), and Celeb A (Liu et al. 2015).

Datasets and Settings In this part, we introduce the real image datasets used in the experiments and the corresponding experimental settings and network structure. In addition, all the images are normalized to pixel values in [ 1, +1]. We utilize different network architectures on different datasets that was detailed in following. The common points are: i) no nonlinear activation was attached to the end of discriminators; ii) the minibatch used in training process is 64 for both generator and discriminators; iii) Adam optimizer with learning rate 0.0001 and momentum 0.5; iv) noise dimension of 128 for generator; v) weights initialized from Gaussian: N(0; 0.01). MNIST (Le Cun et al. 1998) is a handwritten digits dataset (from 0 to 9) composed of 28 28 pixel greyscale images from ten categories. The whole dataset of 70,000 images is split into 60,000 and 10,000 images for training and test, respectively. In the experiments on the MNIST dataset, we consider the 10,000 images in test set as valid set in the calculation of FID. CIFAR-10 (Krizhevsky and Hinton 2009) is a dataset that consists of 32 32 pixel RGB color images drawn from 10 categories. There are 60,000 images in the CIFAR-10 dataset which are split into 50,000 training and 10,000 testing images. We also calculate the FID with 3,000 images that was randomly selected in the test set. Celeb A (Liu et al. 2015) is a dataset consist of 202,599 portraits of celebrities. We use the aligned and cropped version, which preprocesses each image to a size of 64 64 pixels. 3,000 examples are randomly selected as the test set and the rest samples as the training set.

Evaluation Metrics We evaluate the proposed method mainly in terms of n three metrics well suited to the image domain. Inception score (IS) (Salimans et al. 2016) rewarding high-quality and high-variability of samples, can be expressed as: exp(Ex[DKL(p(y|x) p(y))]), where p(y) =

1 N PN i=1 p(y|xi = G(zi)) is the margin distribution and p(y|x) is the conditional distribution for samples. In this paper, we estimate the IS using a Inception model (Szegedy et al. 2016) pretrained in torchvision of Py Torch. Frechet Inception Distance (FID) (Heusel et al. 2017) described the distance between two distributions. FID is computed as follow:

FID = µg µr 2 2 + Tr(Σg + Σr 2(ΣgΣr) 1 2 ), (29)

where (µg, Σg) and (µr, Σr) are the mean and covariance of embedded samples from generated distribution Pg and real image distribution Pr, respectively. In the paper, we regard the feature maps obtained from a speciﬁc layer of the pretrained Inception model as the embedding of the samples. FID is more sensitive to the diversity of samples belonging to the same category and ﬁxes the drawback of inception score that is easily fooled by a model which generated only one image per class. We describe the quality of models together with the FID and IS score. Multi-scale Structural Similarity for Image Quality (MS-SSIM) (Wang, Simoncelli, and Bovik 2003) was proposed to measuring the similarity of two images. This metric well suits to evaluate the quality of samples belonging to one class. As a result, we apply it to describe the smoothness of samples obtained from the interpolation opretion. If two algorithms receive similar FID and IS socores, the samples could be equivalent on quality and diversity, and higher MS-SSIM score means a smoother conversion process.

Interpolation We implement the images interpolation on the MNIST and Celeb A datasets and show a few results in Figures (2) and (3). We forward a series of noises obtained by linearly interpolating between two points in noise space into the model and expect that the resulting images show smooth transposition in the image space. In Figure 2, the left part and the middle part showing the similar transposition are generated from models trained with WGAN-GP and the Smooth GAN, respectively. The left part shows more meaningless images (high lighted in red) during the transposition. When changing an image from one class to the other one, we want to keep the whole process smooth and meaningful. As shown in the right part of Figure 2, these images accomplish these transpositions by considering image qualities and image semantics. For example, the number 6 becomes 1 ﬁrstly and then becomes 9 . Obviously, 1 is more closed to 9 than 6 . Interpolation results on the

(a) Interpolation images generated from WGAN-GP.

(b) Interpolation images generated from the proposed Smooth GAN.

Figure 3: Image interpolations on the Celeb A dataset.

Sliced MS-SSIM

DCGAN WGAN-GP Smooth GAN

Figure 4: FID and sliced MS-SSIM obtained by different models on the CIFAR-10 dataset.

other numbers also illustrate this phenomenon. In the results on the Celeb A dataset shown in Figure 3, we achieve great quality while maintaining smoothness, which illustrates the effectiveness of our approach. Taken two images and their interpolation images generated from the linear interpolations in noise space as a interpolation slice, to illustrate the effectiveness of the proposed method in a more convincing way, we generate several slides on the Celeb A dataset. We generated such slices based on different models and calculated the FID and MS-SSIM of these slices for comparison. Different from calculating the MS-SSIM score over whole resulting images, we calculated it for every interpolation slice independently. Higher MSSSIM values correspond to perceptually more similar images but also lower diversity and mode collapse (Odena, Olah, and Shlens 2016; Fedus et al. 2017). Meanwhile, higher FID score ensures the diversity and prevents the GANs from mode collapse. As a result, considering together with FID, the MS-SSIM score could focus on indicating the similarity of images, which is consistent with smooth transposition in an interpolation slide. The sliced MS-SSIM score used in this experiment can be described as:

j=i+1 MS SSIM(Si, Sj), (30)

where Si is the i-th slide of samples in the resulting group.

k=100 k=10 k=3 k=0.1

(a) FID scores (b) k = 10

(c) k = 3 (d) k = 0.1

Figure 5: FID scores (a) and Wasserstein distance convergence (b, c, d) under different values of k.

Now, we could estimate the effectiveness with FID and sliced MS-SSIM. We report the FID and sliced MS-SSIM obtained on the Celeb A dataset in Figure 4. Our model not only has the lowest FID score, but also its MS-SSIM score exceeds all other models. This is consistent with our observations and demonstrates the effectiveness of our approach.

Hyperparemeter Analysis To illustrate the choice of the value of k, we show FID scores and Wasserstein distance convergence curves of experiments with different k values on the CIFAR-10 dataset in Figure 5. Figure 5 (a) shows the FID scores obtained from four k values, and K = 10 provides the best score. Figure 5 (b) shows that the experiment with k = 10 achieves the best convergence. Setting k to 3 will inﬂuence the training progress and achieve slightly higher Wasserstein distance than Figure 5 (b) when network convergence. The generator is failed to produce enough realistic images when setting k to 0.1; this is because too small a k value will suppress the diversity of generator output.

Figure 6: Samples generated from the proposed model trained on MNIST (left), CIFAR-10 (middle), and Celeb A (right).

2 3 0 0 0 0 0

45-46 46-47 47-48 48-49 49-50 50-51 51-52 52-53 53-54 54-55 55-56 FID

DCGAN WGAN-GP Smooth GAN

Figure 7: FID distribution obtained on several subsets.

Moreover, a large value of k relaxes the constraint on the gradient and can result in insufﬁcient smoothness of the network output. In our experiments, we used different k values based on experimental results regarding different datasets.

Table 1: Incetion scores (higher is better) and FIDs (lower is better) on the CIFAR-10 dataset.

Method IS FID DCGAN 6.40 .05 36.7 (Radford, Metz, and Chintala 2015)

BEGAN (Berthelot, Schumm, and Metz 2017b) 5.62 .07 32.3

WGAN 3.82 .06 42.8 (Arjovsky, Chintala, and Bottou 2017)

D2GAN (Nguyen et al. 2017) 7.15 .07 30.6 WGAN-GP (Gulrajani et al. 2017) 7.36 .07 26.5 Smooth GAN (Ours) 7.66 .05 24.1

Image Generation

We conduct experiments on three real image datasets to investigate the capabilities of the proposed method. Table 1 reports the inception scores and FIDs on the CIFAR-10 dataset which obtained from the proposed model and baseline methods. In this results, the proposed method outperforms almost state-of-the-art methods. Therefore, the proposed method provides considerable quality on the three datasets. Figure 6 shows several samples generated by the model learned with the proposed method. The samples on the MNIST dataset show a variety of numbers and styles. Dogs, trucks, boats, and ﬁsh could also be found in the samples on the CIFAR-10 dataset. Age and gender diversity can also be observed in the results on the Celeb A dataset. These results conﬁrm the capabilities of the proposed method.

Samples Quality Distribution

In this section, we introduce a way to describe the quality distribution of samples and demonstrate the effectiveness of the proposed method that can effectively reduce the number of low-quality images. FID is a good indicator to evaluate the quality of the generated samples. However, FID only provides an average quality of the whole test images. First, we generate a sufﬁcient large set of images (50,000 in experiments). Next, we randomly sample 512 images to form a subset and calculate the FID of this subset. We repeat the second step 120 times and calculate their FID scores. By comparing these FIDs obtained from subsets, we can roughly estimate the quality distribution of the samples. Figure 7 shows the distribution of FID scores calculated over subsets and obtained from three models. Compared to other models, our model can obtain lower FID scores, while bigger value of FIDs are less. Moreover, the FID scores obtained from the proposed model are mainly concentrated between 46 and 50, and only 8 scores of subset fall outside this region, while the other two algorithms get a loose distribution of FID scores. Therefore, our method can effectively reduce the low-quality samples in the generated samples.

Conclusions

Here we analyze the relationship between perturbation on the input of the generator and its inﬂuence on the output images. By investigating the tolerable input perturbation, we develop a smooth generator. We further integrate this smooth generator with a gradient penalized discriminator, and produce smooth GAN that generate stable and high-quality images. Experiments on real-world image datasets demonstrate the necessity of studying smooth generator and show the proposed method is capable of learning smooth GAN.

Acknowledgments We thank supports of the NSFC under Grant 61876007 and 61872012, ARC Projects FL-170100117, DP-180103424, IH-180100002, DE-180101438, and Beijing Major Science and Technology Project Z171100000117008.

References Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein gan. ar Xiv preprint ar Xiv:1701.07875. Berthelot, D.; Schumm, T.; and Metz, L. 2017a. Began: boundary equilibrium generative adversarial networks. ar Xiv preprint ar Xiv:1703.10717. Berthelot, D.; Schumm, T.; and Metz, L. 2017b. BEGAN: boundary equilibrium generative adversarial networks. Co RR abs/1703.10717. Bojanowski, P.; Joulin, A.; Lopez-Paz, D.; and Szlam, A. 2017. Optimizing the latent space of generative networks. ar Xiv preprint ar Xiv:1707.05776. Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; and Abbeel, P. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, 2172 2180. Chen, X.; Xu, C.; Yang, X.; and Tao, D. 2018. Attention-gan for object transﬁguration in wild images. In The European Conference on Computer Vision (ECCV). Chen, X.; Xu, C.; Yang, X.; Song, L.; and Tao, D. 2019. Gated-gan: Adversarial gated networks for multi-collection style transfer. IEEE Transactions on Image Processing 28(2):546 560. Denton, E. L.; Chintala, S.; Szlam, A.; and Fergus, R. 2015. Deep generative image models using a laplacian pyramid of adversarial networks. Co RR abs/1506.05751. Fedus, W.; Rosca, M.; Lakshminarayanan, B.; Dai, A. M.; Mohamed, S.; and Goodfellow, I. 2017. Many paths to equilibrium: Gans do not need to decrease adivergence at every step. ar Xiv preprint ar Xiv:1710.08446. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672 2680. Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; and Courville, A. C. 2017. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, 5767 5777. Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Klambauer, G.; and Hochreiter, S. 2017. Gans trained by a two time-scale update rule converge to a nash equilibrium. ar Xiv preprint ar Xiv:1706.08500. Karras, T.; Aila, T.; Laine, S.; and Lehtinen, J. 2017. Progressive growing of gans for improved quality, stability, and variation. ar Xiv preprint ar Xiv:1710.10196. Kim, T., and Bengio, Y. 2016. Deep directed generative models with energy-based probability estimation. ar Xiv preprint ar Xiv:1606.03439.

Kingma, D. P., and Welling, M. 2013. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114. Krizhevsky, A., and Hinton, G. 2009. Learning multiple layers of features from tiny images. Le Cun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278 2324. Li, C.; Xu, K.; Zhu, J.; and Zhang, B. 2017a. Triple generative adversarial nets. ar Xiv preprint ar Xiv:1703.02291. Li, C.-L.; Chang, W.-C.; Cheng, Y.; Yang, Y.; and P oczos, B. 2017b. Mmd gan: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems, 2203 2213. Liu, Z.; Luo, P.; Wang, X.; and Tang, X. 2015. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, 3730 3738. Makhzani, A.; Shlens, J.; Jaitly, N.; Goodfellow, I.; and Frey, B. 2015. Adversarial autoencoders. ar Xiv preprint ar Xiv:1511.05644. Mao, X.; Li, Q.; Xie, H.; Lau, R. Y.; Wang, Z.; and Smolley, S. P. 2017. Least squares generative adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2813 2821. IEEE. Mirza, M., and Osindero, S. 2014. Conditional generative adversarial nets. ar Xiv preprint ar Xiv:1411.1784. Nguyen, T.; Le, T.; Vu, H.; and Phung, D. 2017. Dual discriminator generative adversarial nets. In Advances in Neural Information Processing Systems, 2670 2680. Odena, A.; Olah, C.; and Shlens, J. 2016. Conditional image synthesis with auxiliary classiﬁer gans. ar Xiv preprint ar Xiv:1610.09585. Radford, A.; Metz, L.; and Chintala, S. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434. Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; and Chen, X. 2016. Improved techniques for training gans. In Advances in Neural Information Processing Systems, 2234 2242. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna, Z. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2818 2826. van den Oord, A.; Kalchbrenner, N.; Espeholt, L.; Vinyals, O.; Graves, A.; et al. 2016. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, 4790 4798. Wang, Z.; Simoncelli, E. P.; and Bovik, A. C. 2003. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, volume 2, 1398 1402. Ieee. Zhai, S.; Cheng, Y.; Feris, R.; and Zhang, Z. 2016. Generative adversarial networks as variational training of energy based models. ar Xiv preprint ar Xiv:1611.01799.