# slimmable_generative_adversarial_networks__3f76a6a5.pdf

Slimmable Generative Adversarial Networks

Liang Hou,1,2* Zehuan Yuan,3 Lei Huang,4 Huawei Shen,1,2 Xueqi Cheng,1,2 Changhu Wang3

1CAS Key Laboratory of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Byte Dance AI Lab 4SKLSDE, Institute of Artiﬁcial Intelligence, Beihang University {houliang17z, shenhuawei, cxq}@ict.ac.cn, {yuanzehuan, wangchanghu}@bytedance.com, huanglei@nlsde.buaa.edu.cn

Generative adversarial networks (GANs) have achieved remarkable progress in recent years, but the continuously growing scale of models makes them challenging to deploy widely in practical applications. In particular, for real-time generation tasks, different devices require generators of different sizes due to varying computing power. In this paper, we introduce slimmable GANs (Slim GANs), which can ﬂexibly switch the width of the generator to accommodate various quality-efﬁciency trade-offs at runtime. Speciﬁcally, we leverage multiple discriminators that share partial parameters to train the slimmable generator. To facilitate the consistency between generators of different widths, we present a stepwise inplace distillation technique that encourages narrow generators to learn from wide ones. As for class-conditional generation, we propose a sliceable conditional batch normalization that incorporates the label information into different widths. Our methods are validated, both quantitatively and qualitatively, by extensive experiments and a detailed ablation study.

Introduction One of the main reasons for the tremendous success of deep learning in recent years is the increasing scale of models. In the branch of deep generative models, generative adversarial networks (GANs) (Goodfellow et al. 2014) have received widespread attention and evolved from the original simple multi-layer perceptrons to the vast Big GAN framework (Brock, Donahue, and Simonyan 2019) with residual blocks (He et al. 2016) and self-attention layers (Zhang et al. 2019) to synthesize realistic images nowadays. The arms race on increasing the model size is endless, while the computational power and budget of devices are limited, especially for mobile phones. Several GAN applications such as photograph (Kupyn et al. 2018) and autonomous driving (Zhang et al. 2018) require short response time and hopefully run on devices with limited computing power. Recently, researchers began to develop lightweight GAN models. However, different devices usually require customized models of different sizes to meet the given response time budget. Moreover, even a single device needs models of different sizes due to several switchable performance modes,

*Work done as an intern at Byte Dance AI Lab. Corresponding authors Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

e.g., the high-performance mode and power-saving mode. Consequently, numerous models need to be trained and deployed for a single task, which is also heavy work. In this work, we are committed to developing a oncefor-all generator, which we only train and deploy once but can ﬂexibly switch the model size at runtime to address the practical challenges. Inspired by slimmable neural networks (SNNs) (Yu et al. 2019), we focus on developing a generator with conﬁgurable widths, where the width refers to the number of channels in layers. In addition to saving inference time, customization on width can reduce memory footprint during the layer-by-layer inference, while reducing depth cannot take this advantage. Although several discriminative tasks such as image classiﬁcation and object detection are well studied in SNNs, applying slimmable operators to GANs suffers from three following challenges: First, how to accurately and appropriately estimate the divergence between generators at different widths and the real data through discriminators? Second, how to ensure consistency between generators of different widths? Here, the consistency means that the generated images should be similar between these generators given the same latent code. Third, how to incorporate the label information into generators at different widths in the classconditional generation? In this paper, we propose slimmable generative adversarial networks (Slim GAN) to combat the aforementioned problems. First, we present discriminators with partially shared parameters to serve the generators at different widths. Second, to improve the consistency between generators at different widths, we introduce a novel stepwise inplace distillation technique, which encourages narrow generators to learn from the wide generators. Third, we propose a sliceable conditional batch normalization (sc BN) to incorporate the label information into different widths on the basis of switchable batch normalization (s BN) (Yu et al. 2019) for the class-conditional generation. Extensive experiments across several real-world datasets and two neural network backbones demonstrate that Slim GAN can compete with or even outperform the individually trained GANs. Remarkably, our proposed sc BN achieves better performance with fewer parameters. A systematic ablation study veriﬁes the effectiveness of our design, including network framework and loss function.

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Related Work

Generative Adversarial Networks

Generative adversarial networks (GANs) (Goodfellow et al. 2014) were implemented by multi-layer perceptrons at the beginning. To improve the capability of the generator and the discriminator, convolutional layers were introduced in DCGAN (Radford, Metz, and Chintala 2015). Later, WGAN-gp (Gulrajani et al. 2017) not only established ﬂexible Lipschitz constraints but also brought the Res Net (He et al. 2016) backbone into the GAN literature. To further impose the Lipschitz constraint, SNGAN (Miyato et al. 2018) introduced spectral normalization to the discriminator, which is also applied to the generator in SAGAN (Zhang et al. 2019). For class-conditional generation tasks, c GANpd (Miyato and Koyama 2018) injected the label information to the generator by employing conditional batch normalization (c BN) (de Vries et al. 2017), and the discriminator with projection technique. Recently, Big GAN (Brock, Donahue, and Simonyan 2019) was capable of generating diverse and realistic high-resolution images, mainly attributed to the massive model.

Model Compression in GANs

The arms race on developing increasingly bloated network architecture hinders the extensive deployment of GANs in practical applications. To reduce the size of the generator, Aguinaldo et al. (2019) compressed GAN models using knowledge distillation techniques. Li et al. (2020) proposed a compression method for conditional GAN models. Meanwhile, Yu and Pool (2020) developed a self-supervised compression method that uses the trained discriminator to supervise the training of a compressed generator. Auto GANDistiller (Fu et al. 2020) compressed GAN models using neural architecture search. Recently, Wang et al. (2020a) developed a uniﬁed GAN compression framework, including model distillation, channel pruning, and quantization.

Dynamic Neural Networks

Unlike model compression, dynamic neural networks can adaptively choose the computational graph to reduce computation during training and inference. For example, Liu and Deng (2018) presented an additional controller network to decide the computational graph depends on the input. Similarly, Hu et al. (2019) proposed to reduce the test time by introducing an early-exit gating function. Different from adjusting the depth of neural networks, slimmable neural networks (SNNs) (Yu et al. 2019) trained neural networks that can be executable at different widths, allowing immediate and adaptive accuracy-efﬁciency trade-offs at runtime. Later, US-Net (Yu and Huang 2019b) extended SNN to universally slimmable scenarios and proposed improved training techniques. Auto Slim (Yu and Huang 2019a) utilized model pruning methods to obtain accuracy-latency optimal models but introduced additional storage consumption. RSNets (Wang et al. 2020b) proposed an approach to train neural networks which can switch image resolutions during inference.

Nevertheless, the aforementioned approaches are designed for discriminative tasks with a single neural network, while we focus on generative tasks based on GANs. Since GAN consists of two networks, i.e., the generator and discriminator network, modifying the operational mechanism of the generator may destroy the stability of the entire system, which makes the training process of GAN with a slimmable generator challenging.

Preliminaries

Generative Adversarial Networks

Generative adversarial networks (GANs) (Goodfellow et al. 2014) are typically composed of a generator and a discriminator. Speciﬁcally, the generator G : Z X learns to generate fake samples by mapping a random noise vector z Z in the latent space endowed with a predeﬁned prior PZ (e.g., multivariate normal distribution) to a sample x X in the high-dimensional complex data space. The discriminator D : X [0, 1] attempts to distinguish the synthetic examples generated by the generator from real data. In contrast, the goal of the generator is to fool the discriminator by mimicking real data. Formally, the objective function of GAN is formulated as follows:

min G max D Ex Pdata[log(D(x))]+

Ez PZ[log(1 D(G(z)))], (1)

where Pdata represents the underlying distribution of real data. As proved in (Goodfellow et al. 2014), this minimax game is considered as minimizing the Jansen Shannon (JS) divergence between the real data distribution and the generated one. Ideally, the generator is supposed to converge until PG = Pdata. The JS divergence estimated by the discriminator can be replaced with other f-divergences (Nowozin, Cseke, and Tomioka 2016) or even true metrics such as Wasserstein distance (Arjovsky, Chintala, and Bottou 2017) by modifying the objective function.

Slimmable Neural Networks

Slimmable neural networks (SNNs) (Yu et al. 2019) can instantly adjust the network width according to the demands of various devices with different capacities. Unlike other training lightweight model methods such as neural architecture search and model compression, SNN is more ﬂexible because it only needs to be trained and deployed once to obtain multiple models at different widths from a pre-speciﬁed width list W. In order to avoid the discrepancy of mean and variance between networks at different widths, SNN proposed a switchable batch normalization (s BN), i.e., using independent BN learnable parameters for each width:

x wi = γwi xwi µ(xwi)

σ(xwi) + βwi, (2)

where xwi represents the data batch at current width wi W. Speciﬁcally, µ( ) and σ( ) compute the mean and standard deviation of this batch, γwi and βwi are learnable scale and shift, respectively, of the s BN at width wi.

Figure 1: Illustration of Slim GAN with width multiplier list W = [0.25, 0.5, 0.75, 1.0] . Wide generators contain the channels of narrow ones. Multiple discriminators share ﬁrst several layers. Blue dashed lines indicate the stepwise inplace distillation.

We aim to develop a size-ﬂexible generator that can switch its size to accommodate various computing power. Approximatively, the size-ﬂexible generator implies multiple generators: Gθ1, Gθ2, , GθN with N incremental parameters θ1 θ2 θN, respectively. In this work, we focus on slimming the width (number of channels) of the generator network instead of depth as reducing width can save memory footprint during the layer-by-layer inference. The width-slimmable generator contains several generators: Gw1, Gw2, , Gw N at N = |W| incremental widths w1 < w2 < ... < w N (wi W), respectively. Particularly, we train the generator via adversarial training and call our method slimmable GAN (Slim GAN) 1.

Slimmable GAN Framework

We illustrate the overall framework of Slim GAN in Figure 1. Speciﬁcally, the Slim GAN consists of a slimmable generator with multi-width conﬁgurations and multiple discriminators that share the ﬁrst several layers. Each discriminator guides the generator at the corresponding width. Here, using multiple shared discriminators, instead of a single discriminator or multiple independent discriminators, is critical for our Slim GAN model. This is also the ﬁrst major novelty of this model. The idea is motivated by two insights. On one hand, using a single discriminator for all the generators with different widths limits the ﬂexibility and capability of discriminators to discriminate generated data from real data, and ﬁnally fails to obtain well-performed generators. On the other hand, although assigning one discriminator for each generator offers high ﬂexibility, it is incapable of leveraging the characteristic of data generated by slimmable generators. Therefore, we borrow the idea of multi-task learning and design multiple parameter-shared discriminators. This design not only offers high ﬂexibility of discriminators but also leverages the similar characteristic of data generated by slimmable generators to improve the training of generators. In addition, sharing parameters with other tasks offers a

1Code is available at https://github.com/houliangict/Slim GAN

kind of consistency regularization on discriminators, which potentially improves the generalization of discriminators, and hence promotes the performance of generators (Thanh Tung, Tran, and Venkatesh 2019). As for training the generator-discriminator pair at width wi, we utilize the Hinge version loss (Lim and Ye 2017; Tran, Ranganath, and Blei 2017), which is prevalent and successful in GAN literature.

max D Ex Pdata[min(0, 1 + Dwi(x))]+

Ez PZ[min(0, 1 Dwi(Gwi(z)))] max G Ez PZ[Dwi(Gwi(z))], i = 1, 2, , N (3)

Stepwise Inplace Distillation Although a single slimmable generator implies multiple subgenerators, we expect these generators to maintain the consistency between them, like an identical generator. Imagine that, a trained slimmable generator is deployed as clients on various devices, and these devices may choose different width conﬁgurations according to their diverse energy budgets. We expect these clients to generate consistent samples for the same command (e.g., the latent code z), which is broadcasted by the server. We characterize this requirement as spatial translation consistency. In addition, since a single device has different performance modes, e.g., highpower mode or power-saving mode, even the same device may choose generators of different sizes. We also expect this device to generate a consistent sample for the same latent code at any mode, which is considered as time translation consistency. However, the adversarial training objective function cannot explicitly guarantee the consistency between generators of different widths because the discriminator only distinguishes real from fake but not distinguishes similar from dissimilar. To achieve consistency, we propose a novel stepwise inplace distillation technique. Different from general-purpose model distillation, we do not utilize knowledge distillation to obtain a smaller model through an already trained one. Instead, we train narrow networks by encouraging them to learn from wide networks during the training process,

thereby improving consistency between them. Speciﬁcally, the proposed distillation ﬁrst distills the full generator to the second widest one and then distills the second one to the third one and so on. We employ the pixel mean square error as the objective function in the distillation:

min G λ N 1Ez PZ

i=1 Gwi(z) sg(Gwi+1(z)) 2 2, (4)

where λ is a hyper-parameter that balances the adversarial objectives and the distillation, and sg( ) means to stop the transfer of gradients in the computational graph. Stop updating the wide generator in distillation prevents it from learning from the narrow one. Arguably, the distillation can effectively improve the performance of narrow networks. Furthermore, the improvement of narrow networks could also lead to an enhancement of wide networks, because wide generators contain all the channels of narrow generators, which forms a virtuous cycle in Slim GAN. As an alternative, leveraging the full network to teach all narrow generators, however, may be contrary to the assumption of width residuals (Yu and Huang 2019b). In other words, forcing all narrow generators to learn from the widest one would make no difference between them, which may tend to strengthen the expression of parameters they shared but reduce the capability of their speciﬁc.

Training Algorithm Algorithm 1 shows the training procedure of Slim GAN in Py Torch-style pseudo-code. The main difference from training a normal GAN is that we enumerate all the widths in the pre-speciﬁed width list at each iteration and switch the computational graph according to the conﬁgured width. In the adversarial training part, we sample random noises as the input of each generator. This provides the diversity of fake samples, encouraging models to explore wider optimization space to achieve better results. In the consistency training part, we sample the same latent code to optimize the discrepancy of the outputs between generators at different widths.

Sliceable Conditional Batch Normalization In the case of class-conditional generation, state-of-the-art class-conditional GANs, e.g., Big GAN (Brock, Donahue, and Simonyan 2019), follow the way of incorporating label information proposed in c GAN-pd (Miyato and Koyama 2018), i.e., conditional batch normalization (c BN) in the generator and projection in the discriminator. In this work, we follow the label projection technique in the discriminator. As for the generator, however, how to introduce the label information under the width-switchable mechanism is the key problem faced by Slim GAN in the class-conditional generation scenario. In other words, how to unify s BN and c BN? A naive way to achieve this goal is to expand each s BN to a c BN:

x wi,cj = γwi,cj xwi,cj µ(xwi,cj)

σ(xwi,cj) + βwi,cj, (5)

where cj indicates the current label. However, the disadvantages of this design are obvious from two perspectives. First,

Algorithm 1 Training Slim GAN

Require: dataset D, switchable width multiplier list W Ensure: generator G 1: for t = 1, . . . , T do 2: for k = 1, . . . , K do 3: Get mini-batch data, x = sample(D) 4: for i = 1, . . . , N do 5: Generate samples, ˆx = Gwi(z) with z PZ 6: Comp. D loss, loss = loss D(Dwi(x), Dwi(ˆx)) 7: Compute D gradients, loss.backward() 8: end for 9: Update D weights, optimizer D.step() 10: end for 11: Sample ﬁxed noise z PZ and initialize x = [ ] 12: for i = 1, . . . , N do 13: Generate samples, ˆx = Gwi(z) with z PZ 14: Compute G loss, loss = loss G(Dwi(ˆx)) 15: Compute G gradients, loss.backward() 16: Generate ﬁxed samples x.append(Gwi( z)) 17: end for 18: Compute distillation loss, loss = loss Distill( x) 19: Compute distillation gradients, loss.backward() 20: Update G weights, optimizer G.step() 21: end for 22: return G

the number of parameters increased dramatically because of N C BN parameters (C is the number of labels), which is contradictory to our motivation, i.e., saving parameters to reduce model size and computation. Second, the information of the same label is separated for generators at different widths. To remedy the above issues, we propose a sliceable conditional batch normalization (sc BN) deﬁned as follows:

x wi,cj = γwiγ:si cj xwi,cj µ(xwi,cj)

σ(xwi,cj) + βwi + β:si cj , (6)

where γcj and βcj are the learnable parameters of the c BN with label cj. To incorporate the label embedding into different widths, we slice c BN vectors to sub-vectors with the ﬁrst si = |γwi| elements (si is the number of channels in the layer at current width wi). Since c BN and s BN are independent, there are N +C BN parameters in our proposed sc BN, which not only accordingly reduces the parameters but also explicitly shares the information of the same label.

Experiments

In this section, we ﬁrst evaluate our proposed Slim GAN across several datasets with two network backbones, compared with the individually trained models. We then conduct class-conditional generation experiments to verify the effectiveness of sc BN. Besides, we report the qualitative and quantitative results that indicate the consistency between generators at different widths. We further demonstrate the design of Slim GAN through an extensive ablation study. We ﬁnally analyze the parameters complexities of generators.

Backbone Dataset Method FID ( ) IS ( ) 0.25 0.5 0.75 1.0 0.25 0.5 0.75 1.0

DCGAN (uncond)

CIFAR-10 Individual 46.9 34.6 30.4 26.7 6.08 6.95 7.39 7.43 Slimmable 37.3 28.5 25.8 25.2 6.90 7.31 7.43 7.44

STL-10 Individual 93.1 69.1 61.8 57.4 6.51 7.82 7.96 8.38 Slimmable 68.9 60.9 56.2 55.1 7.67 8.00 8.34 8.38

Celeb A Individual 24.4 13.2 10.4 9.8 - - - - Slimmable 23.3 13.3 10.6 9.4 - - - -

Res Net (uncond)

CIFAR-10 Individual 41.8 24.1 21.6 20.3 7.36 7.68 7.93 7.91 Slimmable 29.9 21.6 19.6 20.0 7.32 8.02 8.15 8.09

STL-10 Individual 66.6 58.5 56.3 52.9 7.90 8.52 8.30 8.60 Slimmable 69.1 59.0 50.8 50.6 7.60 8.23 8.83 8.81

Celeb A Individual 18.0 11.9 9.9 8.9 - - - - Slimmable 13.9 10.6 9.8 8.5 - - - -

c GAN-pd (cond)

CIFAR-10 Individual 55.1 33.5 16.5 15.5 6.46 7.90 8.22 8.52 Slimmable ( ) 21.7 17.2 16.1 16.2 7.87 8.31 8.49 8.34 Slimmable (+) 19.5 14.5 13.6 14.2 7.88 8.38 8.67 8.59

CIFAR-100 Individual 45.8 23.7 22.5 19.9 7.26 8.49 8.50 9.11 Slimmable ( ) 26.8 19.9 18.9 19.0 8.13 8.90 9.14 9.22 Slimmable (+) 23.8 18.9 18.6 17.9 8.26 9.08 9.17 9.29

Table 1: FID and IS on both unconditional (uncond) and class-conditional (cond) generation. We do not calculate IS on Celeb A as it is a face dataset that lacking inter-class diversity, which IS measures. For class-conditional generation, (+) means our proposed sliceable conditional batch normalization while ( ) means the naive way that extends each s BN to c BN. Bold numbers indicate our slimmable method outperforms the individually trained models.

We employ the following datasets for main experiments: CIFAR-10/100 consists of 50k training images and 10k validation images with resolution of 32 32. CIFAR-10 has 10 classes while CIFAR-100 has 100 classes. STL-10 is resized into the size of 48 48 as done in (Miyato et al. 2018). There are 100k and 8k unlabeled images in the training set and validation set, respectively. Celeb A is a face dataset with 202,599 celebrity images with resolution of 178 218 originally. We follow the practice in (Hou, Shen, and Cheng 2020) to center crop them to 178 178 and then resize them to 64 64. We divide the last 19,962 images into the validation set and the remaining 182,637 images as the training set. We use the training set for training the models and the validation set for evaluation when calculating the statistics of the real data.

Evaluation Metrics

For evaluating the performance of all models on generation, we adopt two widely used evaluation metrics: Inception Score (IS) (Salimans et al. 2016) and Fr echet Inception Distance (FID) (Heusel et al. 2017). IS computes the KL divergence between the conditional class distribution and marginal class distribution. FID is the Fr echet distance (the Wasserstein-2 distance between two Gaussian distributions) between two sets of features obtained through the Inception v3 network trained on Image Net. We randomly generate 50k

images to calculate IS on all datasets, and 10k images to compute FID except STL-10, which we sample 8k images. To measure the consistency between generators at different widths of Slim GAN, we present a metric, called Inception Consistency (IC), which measure the expected feature difference between two generators, Gwi and Gwj at width wi and wj, respectively:

IC(Gwi, Gwj) = Ez PZ[ Φ(Gwi(z)) Φ(Gwj(z)) 2 2],

where Φ( ) outputs the feature of the last hidden layer of Inception v3 network trained on Image Net. Given the width multiplier list W, we average IC between all generator pairs as mean IC (m IC):

m IC(G, W) = 1 N (N 1)

j=1,i =j IC(Gwi, Gwj).

We randomly sample 10k images to estimate the m IC score.

Experimental Settings We implement all models based on Mimicry (Lee and Town 2020) using Py Torch framework. The optimizer is Adam with betas (β1, β2) = (0.5, 0.999) for DCGAN and (β1, β2) = (0.0, 0.9) for Res Net based SNGAN. The learning rate is α = 2 10 4, except Celeb A on DCGAN, which is α = 10 4. The iterations of updating the generator are T = 100k for all methods. The discriminator update steps per generator update step are K = 5 for Res Net and

(a) Slimmable GAN without the stepwise inplace distillation, showing clear inconsistency.

(b) Slimmable GAN with the stepwise inplace distillation, showing improved consistency.

Figure 2: Qualitative consistency on Celeb A.

K = 1 for DCGAN. As for the detailed network architecture, we exactly follow that in SNGAN (Miyato et al. 2018) and c GAN-pd (Miyato and Koyama 2018). The width multiplier list is set to W = [0.25, 0.5, 0.75, 1.0] .

Experimental Results Unconditional generation For unconditional generation, we experiment with three datasets, CIFAR-10, STL-10, and Celeb A, on two backbones, DCGAN and Res Net. The hyper-parameter is set as λ = 20 for both backbones on CIFAR-10 and Celeb A datasets, λ = 10 and λ = 30 for DCGAN and Res Net, respectively, on STL-10. We report the FID and IS results in Table 1. Individual represents individually trained GANs of each width. Our proposed Slim GAN surpasses in most cases or competes with the individually trained GANs in terms of both FID and IS scores, consistently demonstrating the effectiveness of Slim GAN across various datasets and network backbones. Surprisingly, Slim GAN outperforms the individual model at the widest width. We argue that the reasons are twofold. First, training narrow networks could provide extra informative signals for shared parameters with wide networks. Second, the parameter-shared discriminators have a certain regularization, which may improve the generalization of each discriminator. We believe this is a promising advanced training technique for GANs, and leave it for future work. Additionally, some generators at width 0.75 reach or surpass the widest generators, which are trained with only adversarial objectives, reﬂecting the beneﬁt of the combination of distillation and adversarial training.

Class-conditional generation For class-conditional generation experiments, we adopt c GAN-pd as the backbone on both CIFAR-10 and CIFAR-100, and report both FID and IS in the bottom of Table 1. The hyper-parameter is set as λ = 10 for CIFAR-10 and λ = 20 for CIFAR-100. The symbols in the parentheses after our slimmable methods represent different implementations of BN, i.e., ( ) represents the naive BN, and (+) represents our proposed sc BN. Overall, the slimmable generators with different BNs outperform the baseline heavily. Particularly, our proposed sc BN gains

Methods IS ( ) FID ( ) 0.5 1.0 0.5 1.0

Individual 18.8 29.9 48.1 33.9

Slimmable 32.7 36.1 32.8 30.8

Table 2: Big GANs on Image Net after 50k iterations.

further improvement compared with the naive BN due to sharing the label information across different widths.

Big GANs on Image Net We train our slimmable method with Big GAN (Brock, Donahue, and Simonyan 2019) on Image Net (128 128) for 50k iterations. The width multiplier list is set as W = [0.5, 1.0] . The IS and FID are reported in Table 2. In a word, our slimmable method surpasses the individually trained Big GANs, showing a strong capability on large-scale dataset of high-resolution images.

Slim DCGAN CIFAR-10 STL-10 Celeb A

+ w/o distillation 282.7 277.4 110.2 + w/ distillation 231.3 243.2 96.1

Slim Res GAN CIFAR-10 STL-10 Celeb A

+ w/o distillation 285.7 342.4 116.9 + w/ distillation 241.4 248.7 97.9

Table 3: m IC ( ) on CIFAR-10, STL-10, and Celeb A.

Consistency We ﬁrst report the quantitative consistency (m IC) in Table 3, which veriﬁes that distillation can improve the consistency. We also show the qualitative consistency results on Celeb A in Figure 2. For each method, the top row represents the narrowest generator and the bottom row indicates the widest generator. The same column in each method shows the images generated through the same latent code. Compared with the method without distillation, our distillation improves the consistency. For example, the method without distillation synthesis faces with disparate hairs.

DCGAN on CIFAR-10 FID ( ) m IC ( ) 0.25 0.5 0.75 1.0 AVG

Individual 46.9 34.6 30.4 27.4 34.8 - Individual (full D) 45.6 33.2 29.4 27.4 33.9 - Slimmable G 40.0 35.2 34.4 33.4 35.8 264.3 + shared D 40.9 30.2 27.0 25.2 30.8 282.7 + shared D + distillation (Slim GAN) 37.3 28.5 25.8 25.2 29.2 231.3

+ same D 180.4 136.9 141.3 158.6 154.3 376.8 + slimmable D 43.6 35.8 31.0 33.0 35.9 269.5 + distillation (w/o GAN loss for narrows) 87.9 56.2 37.8 28.9 52.7 204.8 + shared D + naive distillation 36.6 29.8 26.3 25.5 29.6 232.5

Table 4: Ablation Study on CIFAR-10. AVG means the averaged FID across all widths.

Ablation Study In this section, we conduct an extensive ablation study on CIFAR-10 to verify the effectiveness of the design in Slim GAN, including network framework and objective function. The ﬁrst two rows in Table 4 are both individually trained GANs. Individual (full D) means the widths of all discriminators in these individual GANs are ﬁxed as the widest width, which is consistent with Slim GAN. Directly applying the slimmable operator to the generator with multiple independent discriminators (Slimmable G), unfortunately, obtains degradation, especially for wide generators. Although this issue is alleviated by sharing partial parameters of these discriminators (shared D), it compromises consistency. Fortunately, with stepwise inplace distillation, our ﬁnal method (Slim GAN) not only achieves further improvements for narrow generators on generation but also obtains remarkable consistency. When utilizing the same discriminator (same D) for all generators, the awful FID reveals that the one-toone relationship in the generator-discriminator pair should be obeyed. As an alternative parameter-sharing way, slimming the discriminator (slimmable D) does not gain satisfactory results. This is because those narrow discriminators would lack the capability to estimate the divergences, as they are contained by wide discriminators. Without adversarial training but only distillation for narrow generators, they tend to produce blurry images and get inferior FID. Compared with the stepwise distillation, only the narrowest network is improved when using the naive distillation (all narrow generators learn from the widest one).

Complexity Analysis Saving parameters is the major advantage of the slimmable generator over the individually trained ones. We investigate the number of parameters of unconditional (uncond) and class-conditional generators in Table 5. Speciﬁcally, cond10 and cond100 represent the class-conditional generators (c GAN-pd) that trained with 10 (CIFAR-10) and 100 (CIFAR-100) labels, respectively. Individual (I-) methods require an independent generator on each width, while the slimmable (S-) approach only needs one. Therefore, the slimmable generator reduces parameters greatly compared with the sum of all individuals. As for class-conditional generative models, our proposed sc BN (+) only adds negligible

CIFAR 0.25 0.5 0.75 1.0 Total

I-uncond 0.35 1.15 2.39 4.08 7.97 I-cond-10 0.36 1.16 2.41 4.10 8.04 I-cond-100 0.42 1.29 2.61 4.37 8.70

S-uncond - - - - 4.08 S-cond-10 (+) - - - - 4.11 S-cond-100 (+) - - - - 4.38

S-cond-10 ( ) - - - - 4.15 S-cond-100 ( ) - - - - 4.81

Table 5: The number of parameters (M) in the generators.

parameters on the widest individual generators compared to the naive BN approach. This advantage would become more obvious with the increase of labels or switches.

Conclusions In this paper, we introduce slimmable generative adversarial networks (Slim GAN), which can execute at different widths at runtime according to various energy budgets of different devices. To this end, we utilize multiple discriminators that share partial parameters to train the slimmable generator. In addition to the adversarial objectives, we introduce stepwise inplace distillation to explicitly guarantee the consistency between generators at different widths. In the case of classconditional generation, we propose a sliceable conditional batch normalization to incorporate the label information under the width-switchable mechanism. Comprehensive experiments demonstrate that Slim GAN reaches or surpasses the individually trained GANs. In the future, we will explore more practical generation tasks, e.g., text-to-image generation and image-to-image translation.

Acknowledgments This work is funded by the National Key R&D Program of China (2020AAA0105200) and the National Natural Science Foundation of China under grant numbers 91746301 and U1911401. Huawei Shen is also funded by K.C. Wong Education Foundation and Beijing Academy of Artiﬁcial Intelligence (BAAI).

References Aguinaldo, A.; Chiang, P.-Y.; Gain, A.; Patil, A.; Pearson, K.; and Feizi, S. 2019. Compressing GANs using Knowledge Distillation. ar Xiv preprint ar Xiv:1902.00159 . Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein Generative Adversarial Networks. In Proceedings of the 34th International Conference on Machine Learning, 214 223. Brock, A.; Donahue, J.; and Simonyan, K. 2019. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In International Conference on Learning Representations. de Vries, H.; Strub, F.; Mary, J.; Larochelle, H.; Pietquin, O.; and Courville, A. 2017. Modulating early visual processing by language. In Advances in Neural Information Processing Systems 30. Fu, Y.; Chen, W.; Wang, H.; Li, H.; Lin, Y.; and Wang, Z. 2020. Autogan-distiller: Searching to compress generative adversarial networks. ar Xiv preprint ar Xiv:2006.08198 . Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27. Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; and Courville, A. C. 2017. Improved Training of Wasserstein GANs. In Advances in Neural Information Processing Systems 30. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; and Hochreiter, S. 2017. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems 30. Hou, L.; Shen, H.; and Cheng, X. 2020. Dual Rejection Sampling for Wasserstein Auto-Encoders. In 24th European Conference on Artiﬁcial Intelligence. Hu, H.; Dey, D.; Hebert, M.; and Bagnell, J. A. 2019. Learning anytime predictions in neural networks via adaptive loss balancing. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 3812 3821. Kupyn, O.; Budzan, V.; Mykhailych, M.; Mishkin, D.; and Matas, J. 2018. Deblur GAN: Blind Motion Deblurring Using Conditional Adversarial Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Lee, K. S.; and Town, C. 2020. Mimicry: Towards the Reproducibility of GAN Research. In CVPR Workshop on AI for Content Creation. Li, M.; Lin, J.; Ding, Y.; Liu, Z.; Zhu, J.-Y.; and Han, S. 2020. Gan compression: Efﬁcient architectures for interactive conditional gans. ar Xiv preprint ar Xiv:2003.08936 . Lim, J. H.; and Ye, J. C. 2017. Geometric gan. ar Xiv preprint ar Xiv:1705.02894 .

Liu, L.; and Deng, J. 2018. Dynamic deep neural networks: Optimizing accuracy-efﬁciency trade-offs by selective execution. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 3675 3682. Miyato, T.; Kataoka, T.; Koyama, M.; and Yoshida, Y. 2018. Spectral Normalization for Generative Adversarial Networks. In International Conference on Learning Representations. Miyato, T.; and Koyama, M. 2018. c GANs with Projection Discriminator. In International Conference on Learning Representations. Nowozin, S.; Cseke, B.; and Tomioka, R. 2016. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. In Advances in Neural Information Processing Systems 29. Radford, A.; Metz, L.; and Chintala, S. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434 . Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X.; and Chen, X. 2016. Improved Techniques for Training GANs. In Advances in Neural Information Processing Systems 29. Thanh-Tung, H.; Tran, T.; and Venkatesh, S. 2019. Improving Generalization and Stability of Generative Adversarial Networks. In International Conference on Learning Representations. Tran, D.; Ranganath, R.; and Blei, D. M. 2017. Deep and hierarchical implicit models. ar Xiv preprint ar Xiv:1702.08896 . Wang, H.; Gui, S.; Yang, H.; Liu, J.; and Wang, Z. 2020a. GAN Slimming: All-in-One GAN Compression by A Uniﬁed Optimization Framework. ar Xiv preprint ar Xiv:2008.11062 . Wang, Y.; Sun, F.; Li, D.; and Yao, A. 2020b. Resolution switchable networks for runtime efﬁcient image recognition. In European Conference on Computer Vision, 533 549. Yu, C.; and Pool, J. 2020. Self-Supervised GAN Compression. ar Xiv preprint ar Xiv:2007.01491 . Yu, J.; and Huang, T. 2019a. Auto Slim: Towards One-Shot Architecture Search for Channel Numbers. ar Xiv preprint ar Xiv:1903.11728 . Yu, J.; and Huang, T. S. 2019b. Universally Slimmable Networks and Improved Training Techniques. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 1803 1811. Yu, J.; Yang, L.; Xu, N.; Yang, J.; and Huang, T. 2019. Slimmable Neural Networks. In International Conference on Learning Representations. Zhang, H.; Goodfellow, I.; Metaxas, D.; and Odena, A. 2019. Self-Attention Generative Adversarial Networks. In Proceedings of the 36th International Conference on Machine Learning, 7354 7363. Zhang, M.; Zhang, Y.; Zhang, L.; Liu, C.; and Khurshid, S. 2018. Deeproad: Gan-based metamorphic autonomous driving system testing. ar Xiv preprint ar Xiv:1802.02295 .