# conditional_gans_with_auxiliary_discriminative_classifier__afe4ba5f.pdf

Conditional GANs with Auxiliary Discriminative Classifier

Liang Hou 1 2 Qi Cao 1 Huawei Shen 1 2 Siyuan Pan 3 Xiaoshuang Li 3 Xueqi Cheng 4 2

Conditional generative models aim to learn the underlying joint distribution of data and labels to achieve conditional data generation. Among them, the auxiliary classifier generative adversarial network (AC-GAN) has been widely used, but suffers from the problem of low intra-class diversity of the generated samples. The fundamental reason pointed out in this paper is that the classifier of AC-GAN is generator-agnostic, which therefore cannot provide informative guidance for the generator to approach the joint distribution, resulting in a minimization of the conditional entropy that decreases the intra-class diversity. Motivated by this understanding, we propose a novel conditional GAN with an auxiliary discriminative classifier (ADC-GAN) to resolve the above problem. Specifically, the proposed auxiliary discriminative classifier becomes generator-aware by recognizing the class-labels of the real data and the generated data discriminatively. Our theoretical analysis reveals that the generator can faithfully learn the joint distribution even without the original discriminator, making the proposed ADC-GAN robust to the value of the coefficient hyperparameter and the selection of the GAN loss, and stable during training. Extensive experimental results on synthetic and real-world datasets demonstrate the superiority of ADC-GAN in conditional generative modeling compared to stateof-the-art classifier-based and projection-based conditional GANs.

1Data Intelligence System Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 2University of Chinese Academy of Sciences, Beijing, China 3Shanghai Jiao Tong University, Shanghai, China 4CAS Key Laboratory of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. Correspondence to: Huawei Shen <shenhuawei@ict.ac.cn>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

1. Introduction

Generative adversarial networks (GANs) (Goodfellow et al., 2014) have achieved substantial progress in learning high-dimensional, complex data distribution such as images (Brock et al., 2019; Karras et al., 2019; 2020b;a; Karras et al.). Standard GANs consist of a generator network, which transfers latent codes sampled from tractable distributions such as Gaussian in the latent space to data points in the data space, and a discriminator network, which attempts to distinguish real data and generated data. The generator is trained in an adversarial game against the discriminator so that it can learn the data distribution at the Nash equilibrium. Remarkably, training GANs unconditionally is difficult to achieve equilibrium, making the generator prone to mode collapse (Salimans et al., 2016; Lin et al., 2018; Chen et al., 2019). In addition, practitioners are interested in being able to control in advance the content of the generated samples (Yan et al., 2015; Tan et al., 2020) in practical applications. A promising solution to these issues is conditioning the generator, leading to conditional GANs.

Conditional GANs (c GANs) (Mirza & Osindero, 2014) is a family of variants of GANs that leverages the side information from annotated labels of samples to implement and train a conditional generator for conditional image generation from class-labels (Odena et al., 2017; Miyato & Koyama, 2018; Brock et al., 2019). To implement the conditional generator, the common technique nowadays injects the conditional information via conditional batch normalization (de Vries et al., 2017; Hou et al., 2021b). To train the conditional generator, a lot of effort put into effectively injecting the conditional information into the discriminator or auxiliary classifier that guides the conditional generator (Odena, 2016; Miyato & Koyama, 2018; Zhou et al., 2018; Kavalerov et al., 2021; Kang & Park, 2020; Zhou et al., 2020). Among them, the auxiliary classifier generative adversarial network (AC-GAN) (Odena et al., 2017) has been widely used due to its simplicity and extensibility. Specifically, AC-GAN utilizes an auxiliary classifier that first attempts to recognize the labels of data and then teaches the generator to produce label-consistent (classifiable) data. However, it has been reported that AC-GAN suffers from the low intra-class diversity problem in the generated samples, especially on datasets with a large number of classes (Odena et al., 2017; Shu et al., 2017; Gong et al., 2019).

Conditional GANs with Auxiliary Discriminative Classifier

In this study, we point out that the fundamental reason for the low intra-class diversity problem of AC-GAN is that the classifier is agnostic to the generated data distribution and thus cannot provide informative guidance for the generator to learn the target distribution. Motivated by this understanding, we propose a novel conditional GAN with an auxiliary discriminative classifier, namely ADC-GAN, to resolve the above problem by enabling the classifier to be aware of the generated data distribution as well as the real data distribution. To this end, the discriminative classifier is trained to distinguish between the real and generated data while recognizing their class-labels. The discriminative capability allows the classifier to provide the discrepancy between the real and generated data distributions like the discriminator, and the classification capability enables it to capture the dependencies between data and labels. We show in theory that the generator of our proposed ADC-GAN can learn the joint data and label distribution under the optimal discriminative classifier even without the discriminator, making the method robust to the value of the coefficient hyperparameter and the selection of the GAN loss and stable during training. We also highlight the superiority of ADC-GAN compared to the two most related works (TAC-GAN (Gong et al., 2019) and PD-GAN (Miyato & Koyama, 2018)) by analyzing their potential issues and limitations. Results on synthetic data clearly show that the proposed ADC-GAN successfully resolves the problem of AC-GAN by faithfully recovering the joint distribution of real data and labels. Extensive experiments based on two popular codebases demonstrate the effectiveness of the proposed ADC-GAN compared with state-of-the-art c GANs in conditional generative modeling.

2. Preliminaries and Analysis

2.1. Generative Adversarial Networks

Generative adversarial networks (GANs) (Goodfellow et al., 2014) consist of two types of neural networks: the generator G : Z X that maps a latent code z Z endowed with an easily sampled distribution PZ to a data point x X, and the discriminator D : X [0, 1] that distinguishes between real data that sampled from the real data distribution PX and fake data that sampled from the generated data distribution QX = G PZ induced by the generator. The goal of the generator is to confuse the discriminator by producing data that are as real as possible. Formally, the objective functions for the discriminator and generator are defined as follows:

min G max D V (G, D) = Ex PX[log D(x)]

+ Ex QX[log(1 D(x))]. (1)

Theoretically, learning the generator under the optimal discriminator can be regarded as minimizing the Jensen Shannon (JS) divergence between the real data distribution and the generated data distribution, i.e., min G JS(PX QX).

This would enable the generator to restore the real data distribution at its optimum. However, the training of GANs on complex natural images is typically unstable (Che et al., 2016), especially in the absence of supervision such as conditional information. In addtition, the content of the images generated by GANs cannot be specified in advance.

2.2. Base Method: AC-GAN

Learning GANs with conditional information can not only improve the training stability but also achieve conditional generation. As one of the most representative conditional GANs, AC-GAN (Odena et al., 2017) utilizes an auxiliary classifier C : X Y to learn the dependencies between data and labels endowed with a label prior PY and then encourages the conditional generator G : Z Y X to generate as much classifiable data as possible. The objective functions for the discriminator, the auxiliary classifier, and the generator of AC-GAN1 are defined as follows:

max D,C V (G, D) + λ Ex,y PX,Y [log C(y|x)] , (2)

min G V (G, D) λ Ex,y QX,Y [log C(y|x)] , (3)

where λ > 0 is a coefficient hyperparameter, PX,Y indicates the joint distribution of real data and labels, and QX,Y = G (PZ PY ) denotes the joint distribution of the generated data and labels induced by the conditional generator.

Proposition 2.1. For fixed generator, the optimal classifier of AC-GAN has the form of C (y|x) = p(x,y)

Theorem 2.2. Given the optimal classifier, at the equilibrium point, optimizing the classification task for the generator of AC-GAN is equivalent to:

min G KL(QX,Y PX,Y ) KL(QX PX)+HQ(Y |X), (4)

where HQ(Y |X) = R P

y q(x, y) log q(y|x)dx is the conditional entropy of the generated samples.

The proofs of all theorems are referred to Appendix A. Our Theorem 2.2 exposes two shortcomings of AC-GAN. Firstly, maximization of the KL divergence between the marginal generator and data distributions (max G KL(QX PX)) contradicts the goal of conditional generative modeling that matches QX,Y with PX,Y . Although this issue can be mitigated to some extent by the adversarial game between the discriminator and generator that minimizes the JS divergence between the two marginal distributions (min G JS(QX PX)), we find that it still has a negative impact on training stability and generation performance. Secondly, minimization of the entropy of labels conditioned

1We follow the common practice in the literature to adopt the stable version instead of the original one. We also provide an analysis of the original AC-GAN in Appendix B.

Conditional GANs with Auxiliary Discriminative Classifier

(c) TAC-GAN

real & fake

(d) ADC-GAN

Figure 1: Illustration of discriminators/classifiers of existing c GANs (PD-GAN (Miyato & Koyama, 2018), AC-GAN (Odena et al., 2017), and TAC-GAN (Gong et al., 2019)) and ADC-GAN. The symbol +/ indicates the GAN labels (real or fake) and y is the class-label of data x. ADC-GAN is different from PD-GAN by explicitly predicting the label and is different from AC-GAN and TAC-GAN in that the classifier Cd also distinguishes real from generated, like the discriminator.

on data of the generated distribution (min G HQ(Y |X)) will result in the label of the generated data being deterministic. In other words, it forces the generated data for each class away from the classification hyperplane, explaining the low intra-class diversity of the generated samples in ACGAN, especially when the distributions of different classes have non-negligible overlap, which occurs naturally as the fact that neither state-of-the-art classifiers nor human beings can achieve 100% classification accuracy on real-world datasets (Russakovsky et al., 2015). The original AC-GAN, whose classifier is trained from both real and generated samples, suffers from the same issue (cf. Appendix B).

3. Proposed Method: ADC-GAN

The goal of conditional generative modeling is to faithfully learn the joint distribution of real data and labels regardless of the shape of the joint distribution (whether there is overlap between the distributions of different classes). We first note that the reason why AC-GAN fails to learn the target joint distribution (Theorem 2.2) originates from that the optimal classifier C (y|x) = p(x,y)

p(x) (Proposition 2.1) is agnostic to the density of the generated (marginal or joint) distribution (q(x) or q(x, y)). As a result, the classifier cannot provide the discrepancy between the target distribution and the generated distribution, resulting in a biased learning objective of the generator. Recall that the optimal discriminator D (x) = p(x) p(x)+q(x) is aware of the real data distribution as well as the generated data distribution (Goodfellow et al., 2014), and can therefore provide the discrepancy between the real and generated data distributions p(x)

q(x) = D (x) 1 D (x) for faithful generative modeling of the generator. Intuitively, the distribution-aware ability on both real and generated data is caused by the fact that the discriminator distinguishes

between the real and generated data with different labels (real or fake). Motivated by this understanding, we propose to make the classifier capable of classifying the the real and generated data with different class-labels, establishing a discriminative classifier Cd : X Y+ Y (Y+ for real data and Y for generated data) that recognizes the label of the real and generated samples discriminatively. The generator is encouraged to produce classifiable real data rather than classifiable fake data. Mathematically, the objective functions for the discriminator, the discriminative classifier, and the generator of ADC-GAN are defined as:

max D,Cd V (G, D) + λ (Ex,y PX,Y [log Cd(y+|x)]

+Ex,y QX,Y [log Cd(y |x)]), (5)

min G V (G, D) λ (Ex,y QX,Y [log Cd(y+|x)]

Ex,y QX,Y [log Cd(y |x)]), (6)

where Cd(y+|x) = exp(φ+(y) ϕ(x)) P

y exp(φ+( y) ϕ(x))+P

y exp(φ ( y) ϕ(x))

(resp. Cd(y |x) = exp(φ (y) ϕ(x)) P

y exp(φ+( y) ϕ(x))+P

y exp(φ ( y) ϕ(x))) indicates the probability that a data x is classified as the label y and real (resp. fake) simultaneously by the discriminative classifier. Here, ϕ : X Rd is a feature extractor that is shared with the original discriminator in our implementation (D = σ ψ ϕ with a linear mapping ψ : Rd R and a sigmoid function σ : R [0, 1]), and φ+ : Y Rd

and φ : Y Rd capture learnable embeddings of labels responsible to the real and generated data, respectively.

At the first glance, the objective function with the discriminative classifier for the generator seems to be redundant as maximization of log Cd(y+|x) implicitly contains the goal of minimization of log Cd(y |x). However, we show below that the second term is indispensable for accurately learning

Conditional GANs with Auxiliary Discriminative Classifier

Table 1: Theoretical learning objective for the generator of competing methods under the optimal discriminator and classifier.

METHOD THEORETICAL LEARNING OBJECTIVE FOR THE GENERATOR

AC-GAN (ODENA ET AL., 2017) min G JS(PX QX) + λ(KL(QX,Y PX,Y ) KL(QX PX) + HQ(Y |X)) TAC-GAN (GONG ET AL., 2019) min G JS(PX QX) + λ(KL(QX,Y PX,Y ) KL(QX PX)) ADC-GAN (OURS) min G JS(PX QX) + λ(KL(QX,Y PX,Y )) PD-GAN (MIYATO & KOYAMA, 2018) min G JS(QX,Y PX,Y )

the real joint data-label distribution. Arguably, maximization of log Cd(y+|x) forces the generator to produce only few label-consistent data, facilitating the fidelity but losing the diversity of the generated samples. On the other hand, minimization of log Cd(y |x) encourages the generator to not synthesis the typically label-consistent data, increasing the diversity but may degrade the fidelity of the generated samples. In general, the two objectives together assist the generator in achieving its goal as we proved below.

Proposition 3.1. For fixed generator, the optimal discriminative classifier of ADC-GAN has the form of the following:

C d(y+|x) = p(x, y) p(x) + q(x), C d(y |x) = q(x, y) p(x) + q(x).

Proposition 3.1 shows that the optimal discriminative classifier is aware of the densities of the real and generated joint distributions, therefore it is able to provide the discrepancy p(x,y) q(x,y) = C d(y+|x) C d(y |x) to optimize the generator.

Theorem 3.2. Given the optimal discriminative classifier, at the equilibrium point, optimizing the classification task for the generator of ADC-GAN is equivalent to:

min G KL(QX,Y PX,Y ). (7)

Theorem 3.2 confirms that the discriminative classifier itself can guarantee the generator to restore the real joint distribution at the optimum. In practice, we retain the discriminator to train the generator for better training stability and convergence. The overall learning objective for the generator under the optimal discriminator and discriminative classfier is to minimize the JS divergence between the marginal data distributions and the reversed KL divergence bewteen the joint data-label distributions (min G JS(PX QX) + λ KL(QX,Y PX,Y )). Since the optimal solution set for generative modeling contains the optimal solution set for conditional generative modeling (arg min G JS(PX QX) arg min G KL(QX,Y PX,Y )), the guidance to the generator provided by discriminator and discriminative classifier are harmonious, which makes ADC-GAN robust to the value of the hyperparameter λ and the selection of the GAN loss V (G, D).

4. Analysis on Competing Methods

In this section, we analyze the drawbacks of the two competing methods, TAC-GAN (Gong et al., 2019) and PDGAN (Miyato & Koyama, 2018), to show the superiority of ADC-GAN. We also analyze AM-GAN (Zhou et al., 2018) in Appendix C. Before diving into the details, we show diagrams of the discriminator and classifier of these methods in Figure 1 and summarize the theoretical learning objective for the generator under the optimal discriminator and classifier of these methods in Table 1 for an overview.

4.1. Competing Method: TAC-GAN

TAC-GAN (Gong et al., 2019) addresses the low intra-class diversity problem of AC-GAN by eliminating the conditional entropy of the generated data distribution HQ(Y |X) by learning the generator with another classifier Cmi : X Y, which is trained with the generated samples. The objective functions for the discriminator, the twin classifiers, and the generator of TAC-GAN are defined as follows:

max D,C,Cmi V (G, D) + λ (Ex,y PX,Y [log C(y|x)]

+Ex,y QX,Y [log Cmi(y|x)]), (8)

min G V (G, D) λ (Ex,y QX,Y [log C(y|x)]

Ex,y QX,Y [log Cmi(y|x)]). (9)

Theorem 4.1. Given the twin optimal classifiers, at the equilibrium point, optimizing the classification tasks for the generator of TAC-GAN is equivalent to:

min G KL(QX,Y PX,Y ) KL(QX PX). (10)

Our Theorem 4.1 reveals that the learning objective of the generator of TAC-GAN, under the twin optimal classifiers, can be regarded as optimizing contradictory divergences, i.e., minimization between joint distributions but maximization between marginal distributions. Although theoretically the JS divergence or others (Nowozin et al., 2016; Arjovsky et al., 2017) introduced through the adversarial training between the discriminator and generator may remedy this issue, it is difficult to obtain the optimal discriminator and classifier in the practical optimization to ensure the elimination of the contradiction. We argue that the training

Conditional GANs with Auxiliary Discriminative Classifier

instability of TAC-GAN reported in the literature (Kocaoglu et al., 2018; Han et al., 2020) and found in our experiments (cf. Figures 3(a) and 5) can be explained by this analysis.

4.2. Competing Method: PD-GAN

PD-GAN (Miyato & Koyama, 2018) injects the conditional information into the projection discriminator Dp : X Y [0, 1] via the inner-product between the embedding of the label and the representation of the data to calculate the joint discriminative score of the data-label pair. In such a way, PD-GAN inherits the property of convergence point similar to the standard GAN such that it can avoid the low intraclass diversity problem of AC-GAN ideally. Specifically, the objective functions for the projection discriminator and the generator of PD-GAN are defined as follows:

min G max Dp V (G, Dp) = Ex,y PX,Y [log Dp(x, y)]

+ Ex,y QX,Y [log(1 Dp(x, y))].(11)

Based on this formulation, the optimal projection discriminator has the following form:

D p(x, y) = 1 1 + exp( d (x, y)) = p(x, y) p(x, y) + q(x, y)

d (x, y) = log p(x, y)

q(x, y) = log p(x)

q(x) + log p(y|x)

q(y|x), (12)

where p(y|x) = exp(φ+(y) ϕ(x)) P

y exp(φ+( y) ϕ(x)) and q(y|x) =

exp(φ (y) ϕ(x)) P

y exp(φ ( y) ϕ(x)). And PD-GAN accordingly defines:

r(x) := log p(x)

q(x) := ψ(ϕ(x)),

r(y|x) := log p(y|x)

q(y|x) := (

φ(y) z }| { φ+(y) φ (y)) ϕ(x) | {z } ˆr(y|x)

y Y exp φ+( y) ϕ(x) + log X

y Y exp φ ( y) ϕ(x)

However, PD-GAN actually ignores the partition term a 2

in Equation 13 and heuristically constructs the logit of the projection discriminator in the form of:

d(x, y) = r(x) + ˆr(y|x) = ψ(ϕ(x)) + φ(y) ϕ(x). (14)

Discarding the partition term would make PD-GAN no longer belong to probability models that are able to model

2PD-GAN discards a in implementing the projection discriminator based on the hypothesis that a can be merged into r(x). However, r(x) does not model any label information, which should be involved by a . Therefore, it is unreasonable to do this.

the conditional probabilities p(y|x) and q(y|x), resulting in losing the complete dependencies between data and labels. Particularly, for mismatched data-label pair (x, y) with probabilities of p(x, y) = 0 and q(x, y) = 0, the projection discriminator D p(x, y) = p(x,y) p(x,y)+q(x,y) = 0

0 is undefined and thus unreliable. Our ADC-GAN can penalize the mismatched data-label pair because C d(y+|x) = p(x,y) p(x)+q(x) =

0 >0 = 0 (p(x) + q(x) > 0 for valid data x). Moreover, the optimal projection discriminator constructed according to the minimax GAN lacks theoretical guarantees on other GAN loss functions. The proposed ADC-GAN can be flexibly applied to any version of the GAN loss as we do not require a specific form of the discriminator.

5. Experiments

5.1. Synthetic Data

We first conduct experiments on a one-dimensional synthetic mixture of Gaussians, following the practices of (Gong et al., 2019), to qualitatively show the fidelity of distribution learning capability of ADC-GAN. As shown in Figure 2(a), the real data distribution consists of three classes with non-negligible overlaps. Figures 2(b) to 2(d) show the learned distributions, which are estimated by kernel density estimation (KDE) (Parzen, 1962) on the generated data of AC-GAN, TAC-GAN, and ADC-GAN without the original GAN loss V (G, D), respectively. Figures 2(e) to 2(h) show the KDE results of PD-GAN, AC-GAN, TAC-GAN, and ADC-GAN trained with the non-saturating GAN loss (Goodfellow et al., 2014), respectively. AC-GAN tends to generate classifiable data so that it decreases the intra-class diversity. Without the GAN loss V (G, D), AC-GAN outputs nearly deterministic data for each class. TAC-GAN without the GAN loss also cannot accurately capture the real data distribution, verifying the contradiction in Theorem 4.1. Impressively, the proposed ADC-GAN faithfully restores the real data distribution even without the GAN loss, validating Theorem 3.2 that the discriminative classfier alone can guide the generator to learn the real data distribution.

5.2. Experiments based on Big GAN-Py Torch

In this section, we conduct experiments on three common real-world datasets: CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), and Tiny-Image Net (Le & Yang, 2015) based on the Big GAN-Py Torch repository3 with our extensions4. The optimizer is Adam with learning rate of 2 10 4 on CIFAR-10/100 and 1 10 4 for the generator and 4 10 4

for the discriminator on Tiny-Image Net. We train all methods for 1000 and 500 epochs with batch size of 50 and 100

3https://github.com/ajbrock/ Big GAN-Py Torch

4https://github.com/houliangict/adcgan

Conditional GANs with Auxiliary Discriminative Classifier

2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0

(a) Real Data

2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0

(b) AC-GAN w/o V (G, D)

2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0

(c) TAC-GAN w/o V (G, D)

2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0

(d) ADC-GAN w/o V (G, D)

2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0

2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0

(f) AC-GAN w/ V (G, D)

2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0

(g) TAC-GAN w/ V (G, D)

2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0

(h) ADC-GAN w/ V (G, D)

Figure 2: Qualitative comparison of distribution modeling results on the one-dimensional synthetic data.

Table 2: FID and Intra-FID and Accuracy (%) comparisons on CIFAR-10, CIFAR-100, and Tiny-Image Net, respectively.

DATASETS METRICS PD-GAN AC-GAN AM-GAN TAC-GAN ADC-GAN

CIFAR-10 FID ( ) 6.23 6.50 6.81 5.83 5.66 INTRA-FID ( ) 48.90 57.67 69.31 56.67 40.45 ACCURACY ( ) 66.22 84.69 83.63 88.27 89.51

CIFAR-100 FID ( ) 8.70 11.24 10.42 10.38 8.12 INTRA-FID ( ) 51.15 83.06 78.11 79.59 49.24 ACCURACY ( ) 37.89 55.26 55.77 60.03 64.24

TINY-IMAGENET FID ( ) 26.10 25.02 21.34 21.12 19.02 INTRA-FID ( ) 66.23 99.04 90.56 95.48 63.05 ACCURACY ( ) 27.79 44.59 44.67 44.44 48.89

on CIFAR-10/100 and Tiny-Image Net, respectively. The discriminator/classifier are updated 4 and 2 times per generator update step on CIFAR-10/100 and Tiny-Image Net, respectively. We follow the practice of (Miyato & Koyama, 2018; Gong et al., 2019) to adopt the hinge loss (Lim & Ye, 2017; Tran et al., 2017) as the implementation of V (G, D). The coefficient hyperparameters of AC-GAN and AM-GAN (Zhou et al., 2018) (cf. Appendix C for analysis) are set as λ = 0.2 as it performs the best. As for TAC-GAN and ADC-GAN, the coefficient hyperparameters are set as λ = 1.0 on CIFAR-10/100 and λ = 0.5 on Tiny-Image Net.

Image Generation. We use the Fr echet Inception Distance (FID) (Heusel et al., 2017) and Intra-FID (Miyato & Koyama, 2018) metrics to measure the overall and intraclass qualities of the generated images, respectively. Table 2 shows that ADC-GAN obtains the best FID and Intra-FID scores on all three datasets, indicating consistent superiority over previous c GANs in conditional image generation.

Training Stability. We also note that ADC-GAN yields the best training stability according to the FID training curves (cf. Figures 3(a) and 5). Even without the discriminator, the training stability ADC-GAN (w/o D) still exceeds that of most competing methods. AC-GAN diverges during training on all three datasets. TAC-GAN also diverges on CIFAR-100 and Tiny-Image Net and achieves a relatively stable FID training curve only on the simplest dataset, CIFAR10. We hence report the results of all methods using the best checkpoint. These unstable FID training curves implicitly verify the drawback of existing classifier-based c GANs that optimize contradictory divergences.

Different Coefficients. To explicitly show the above issues, we set the objective function of classifier-based c GANs as (1 λ )V (G, D) + λ VC(G, C), where VC(G, C) is the task between the generator and classifier. As shown in Figures 3(b) and 6, ADC-GAN consistently gains superior FID scores across different coefficient hyperparameters even for

Conditional GANs with Auxiliary Discriminative Classifier

(a) FID curve

(b) FID with different λ

Figure 3: (a) FID curves during GAN training on CIFAR100. (b) FID scores of classifier-based c GANs with different λ on CIFAR-100. The objective function in this experiment is (1 λ )V (G, D) + λ VC(G, C), where VC(G, C) is the task between the generator and classifier.

(b) ADC-GAN

Figure 4: T-SNE visualization of CIFAR-10 validation data based on learned representations extracted from the penultimate layer in the discriminator/classifier ϕ(x). Different colors indicate different classes.

λ = 1.0 (i.e., without the discriminator), showing strong robustness with respect toλ , while AC-GAN and TAC-GAN perform substantially worse when λ becomes larger.

Data-to-Class Relations. To investigate whether the model captures appropriate data-to-class relations, we conduct image classification experiments based on the learned representations of the discriminator/classifier ϕ(x). Specifically, we first train a logistic regression classifier using the scikit-learn library with the training data and compute the classification accuracy of the validation data. As reported in Table 2, ADC-GAN significantly outperforms competing methods on all datasets in terms of the Accuracy metrics. The reason is that the discriminative classifier needs to recognize the labels of data while simultaneously distinguishing between real and fake data, which facilitates the robustness of the classifier in modeling data-to-class relations. Notice that PD-GAN obtains the worst results. By comparing the CIFAR-10 T-SNE (Van der Maaten & Hinton, 2008) visualization results of PD-GAN and ADC-GAN in Figure 4, it is clear that PD-GAN does not have the ability to learn proper data-to-class relations as ADC-GAN does, reflecting the problem caused by the loss of partition terms in PD-GAN.

Table 3: FID and IS comparisons on Image Net (128 128). B.S. means the batch size and Iters. means the training iterations. Results of Big GAN and Re ACGAN are copied from the Re ACGAN paper (Kang et al., 2021).

B.S. ITERS. METHODS IS ( ) FID ( )

256 500K BIGGAN 43.97 16.36 REACGAN 68.27 13.98 ADC-GAN 66.96 11.65

2048 200K BIGGAN 99.71 7.89 REACGAN 92.74 8.23 ADC-GAN 97.47 9.46

500K ADC-GAN 108.10 8.02

5.3. Experiments based on Py Torch-Studio GAN

In this section, we compare ADC-GAN with state-ofthe-art c GANs using the Py Torch-Studio GAN repository5, of which evaluation protocols are different from that of the Big GAN-Py Torch repository that we used in Table 2. Nonetheless, our comparison is fair because the methods in each experiment follows the same evaluation protocol.

Image Generation on Image Net. We first conduct experiments on Image Net (128 128) following the experimental settings of Re ACGAN (Kang et al., 2021). Table 3 reports the Inception Score (IS) (Salimans et al., 2016) and FID results. Our ADC-GAN is comparable with the state-of-theart c GANs, Big GAN and Re ACGAN (Kang et al., 2021), in the batch size of 256 and 2048, showing effectiveness on large-scale high-resolution image datasets. Notice that, however, we only ran our ADC-GAN once with λ = 1 in each of the two batch size settings, and did not make other attempts due to our limited computational resources. We argue that the results of ADC-GAN can be improved by choosing an appropriate coefficient hyperparameter λ.

Different GAN Losses. We also investigate the robustness of ADC-GAN with respect to the GAN loss function V (G, D) by adopting different versions. Table 4 report the qualitative results on CIFAR-100 (cf. Table 5 in Appendix D for complete results). Impressively, the proposed ADC-GAN achieves the best i FID (intra-FID), recall (Kynk a anniemi et al., 2019), and coverage (Naeem et al., 2020) scores across the non-saturation (Goodfellow et al., 2014), WGAN-GP (Gulrajani et al., 2017), and hinge (Lim & Ye, 2017) versions of the GAN loss. The best i FID scores indicate the best conditional generative modeling performance, and the best recall and coverage results reflect the best (intra-class) diversity of the generated samples.

5https://github.com/POSTECH-CVLab/ Py Torch-Studio GAN

Conditional GANs with Auxiliary Discriminative Classifier

Table 4: IS, FID, i FID, Precision, Recall, Density, and Coverage comparisons with state-of-the-art methods under different GAN loss functions on CIFAR-100, respectively. The best results are bold and the second best are underlined.

GAN LOSS METHODS IS FID IFID PRECISION RECALL DENSITY COVERAGE

NON-SATURATION

PD-GAN 11.48 11.59 105.38 0.7337 0.6804 0.8646 0.8513 AC-GAN 7.98 49.46 207.56 0.7322 0.0793 0.6225 0.4112 TAC-GAN 11.34 14.47 131.90 0.7429 0.6077 0.8324 0.7887 ADC-GAN 11.88 11.07 104.21 0.7379 0.6972 0.8521 0.8609 CONTRAGAN 11.15 13.54 146.86 0.7390 0.6155 0.8481 0.7729 REACGAN 11.79 13.72 125.21 0.7541 0.5861 0.8695 0.8005

PD-GAN 5.66 69.48 0.5976 0.1603 0.4310 0.2649 AC-GAN 10.97 19.30 148.40 0.6880 0.5444 0.6770 0.7242 TAC-GAN 11.04 15.56 121.23 0.7023 0.6474 0.7048 0.7535 ADC-GAN 11.01 14.02 101.14 0.7058 0.6804 0.7549 0.7956 CONTRAGAN 6.72 49.77 147.22 0.6498 0.2834 0.5827 0.3549 REACGAN 6.67 47.74 150.7 0.6188 0.3104 0.4806 0.3396

PD-GAN 11.76 10.96 108.08 0.7436 0.6812 0.8790 0.8609 AC-GAN 11.66 21.65 168.87 0.7577 0.3649 0.8297 0.7225 TAC-GAN 12.07 12.56 134.75 0.7572 0.6020 0.8957 0.8400 ADC-GAN 11.82 10.73 103.78 0.7387 0.7023 0.8721 0.8707 CONTRAGAN 10.08 13.22 128.50 0.7372 0.6251 0.8356 0.7790 REACGAN 11.80 12.52 140.47 0.7510 0.5982 0.9300 0.8327

6. Related Work

Efforts on developing c GANs (Mirza & Osindero, 2014) can be divided into two steps. The first is to study how to implement a conditional generator. Methods in this category are concatenation (Mirza & Osindero, 2014), conditional batch normalization (de Vries et al., 2017), and conditional convolution layers (Sagong et al., 2019). The second is to study how to train the conditional generator to produce labeldependent samples, which can be further divided into two categories, classifier-based and projection-based c GANs.

Classifier-based c GANs. AC-GAN (Odena et al., 2017) leveraged an auxiliary classifier to identify consistency between data and labels. MH-GAN (Kavalerov et al., 2021) improved AC-GAN by replacing the cross-entropy loss of the classifier with the multi-hinge loss. AM-GAN (Zhou et al., 2018) replaced the discriminator with a K + 1-way classifier with an additional fake label. Omni-GAN (Zhou et al., 2020) combined the discriminator with the classifier to construct a K + 2-dimensional multi-label classifier. TAC-GAN (Gong et al., 2019) corrected the biased learning objective of AC-GAN by introducing another classifier, which is the multi-class version of Anti-Labeler of Causal GAN (Kocaoglu et al., 2018). UAC-GAN (Han et al., 2020) improved the training stability of TAC-GAN with MINE (Belghazi et al., 2018). ECGAN (Chen et al., 2021) provides a unified view of c GANs with and without classifiers. Orthogonally to our work, Contra GAN (Kang & Park, 2020) and Re ACGAN (Kang et al., 2021) modeled data-to-data relations as well as data-to-class relations using the conditional contrastive loss and the data-to-data crossentropy loss, respectively. However, they did not solve

the low intra-class diversity problem of AC-GAN as they inherited the generator-agnostic classifier.

Projection-based c GANs. PD-GAN (Miyato & Koyama, 2018) injected the class information into the discriminator via label projection and achieved the state-of-the-art generation quality of natural images (Brock et al., 2019; Wu et al., 2019; Zhang et al., 2020; Zhao et al., 2021). P2GAN (Han et al., 2021) further improved PD-GAN by compensating the missed partition term in the objective function.

Discriminative classifiers. Watanabe & Favaro (2021) exploited the discriminative classifier for training GANs with any level of labeling but different from us with the objective function for the generator, which enables ADC-GAN to faithfully learn the target distribution. SSGAN-LA (Hou et al., 2021a) presented the similar idea but different loss functions with ADC-GAN (multi-hinge v.s. cross-entropy) to tackle the degraded learning objective of self-supervised GANs, while ADC-GAN is for conditional GANs. Moreover, our analysis of the degradation objective is more accurate and informative than that of SSGAN-LA.

7. Conclusion

In this paper, we present a novel conditional generative adversarial network with an auxiliary discriminative classifier (ADC-GAN) to achieve faithful conditional generative modeling. We also discuss the differences between ADC-GAN with competing c GANs and analyze their potential issues and limitations. Extensive experimental results validate the theoretical superiority of ADC-GAN compared with stateof-the-art classifier-based and projection-based c GANs.

Conditional GANs with Auxiliary Discriminative Classifier

Acknowledgements

This work is funded by the National Natural Science Foundation of China under Grant Nos. 62102402, U21B2046, and National Key R&D Program of China (2020AAA0105200). Huawei Shen is also supported by Beijing Academy of Artificial Intelligence (BAAI).

Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, 2017.

Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, D. Mutual information neural estimation. In Proceedings of the 35th International Conference on Machine Learning, 2018.

Brock, A., Donahue, J., and Simonyan, K. Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019.

Che, T., Li, Y., Jacob, A. P., Bengio, Y., and Li, W. Mode regularized generative adversarial networks. ar Xiv preprint ar Xiv:1612.02136, 2016.

Chen, S.-A., Li, C.-L., and Lin, H.-T. A unified view of c GANs with and without classifiers. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021.

Chen, T., Zhai, X., Ritter, M., Lucic, M., and Houlsby, N. Self-supervised gans via auxiliary rotation loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.

de Vries, H., Strub, F., Mary, J., Larochelle, H., Pietquin, O., and Courville, A. C. Modulating early visual processing by language. In Advances in Neural Information Processing Systems, 2017.

Gong, M., Xu, Y., Li, C., Zhang, K., and Batmanghelich, K. Twin auxilary classifiers gan. In Advances in Neural Information Processing Systems, 2019.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014.

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, 2017.

Han, L., Stathopoulos, A., Xue, T., and Metaxas, D. Unbiased auxiliary classifier gans with mine. ar Xiv preprint ar Xiv:2006.07567, 2020.

Han, L., Min, M. R., Stathopoulos, A., Tian, Y., Gao, R., Kadav, A., and Metaxas, D. N. Dual projection generative adversarial networks for conditional image generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, 2017.

Hou, L., Shen, H., Cao, Q., and Cheng, X. Self-supervised GANs with label augmentation. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021a.

Hou, L., Yuan, Z., Huang, L., Shen, H., Cheng, X., and Wang, C. Slimmable generative adversarial networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 7746 7753, 2021b.

Kang, M. and Park, J. Contragan: Contrastive learning for conditional image generation. In Advances in Neural Information Processing Systems, 2020.

Kang, M., Shim, W. J., Cho, M., and Park, J. Rebooting ACGAN: Auxiliary classifier GANs with stable training. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021.

Karras, T., Aittala, M., Laine, S., H ark onen, E., Hellsten, J., Lehtinen, J., and Aila, T. Alias-free generative adversarial networks. In Advances in Neural Information Processing Systems.

Karras, T., Laine, S., and Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.

Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., and Aila, T. Training generative adversarial networks with limited data. 2020a.

Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and Aila, T. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020b.

Kavalerov, I., Czaja, W., and Chellappa, R. A multi-class hinge loss for conditional gans. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 1290 1299, January 2021.

Conditional GANs with Auxiliary Discriminative Classifier

Kocaoglu, M., Snyder, C., Dimakis, A. G., and Vishwanath, S. Causal GAN: Learning causal implicit generative models with adversarial training. In International Conference on Learning Representations, 2018.

Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.

Kynk a anniemi, T., Karras, T., Laine, S., Lehtinen, J., and Aila, T. In Advances in Neural Information Processing Systems, 2019.

Le, Y. and Yang, X. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.

Lim, J. H. and Ye, J. C. Geometric gan. ar Xiv preprint ar Xiv:1705.02894, 2017.

Lin, Z., Khetan, A., Fanti, G., and Oh, S. Pacgan: The power of two samples in generative adversarial networks. In Advances in Neural Information Processing Systems, 2018.

Mirza, M. and Osindero, S. Conditional generative adversarial nets. ar Xiv preprint ar Xiv:1411.1784, 2014.

Miyato, T. and Koyama, M. c GANs with projection discriminator. In International Conference on Learning Representations, 2018.

Naeem, M. F., Oh, S. J., Uh, Y., Choi, Y., and Yoo, J. Reliable fidelity and diversity metrics for generative models. In Proceedings of the 37th International Conference on Machine Learning, 2020.

Nowozin, S., Cseke, B., and Tomioka, R. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, 2016.

Odena, A. Semi-supervised learning with generative adversarial networks. ar Xiv preprint ar Xiv:1606.01583, 2016.

Odena, A., Olah, C., and Shlens, J. Conditional image synthesis with auxiliary classifier GANs. In Proceedings of the 34th International Conference on Machine Learning, 2017.

Parzen, E. On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3): 1065 1076, 1962.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3): 211 252, 2015.

Sagong, M.-C., Shin, Y.-G., Yeo, Y.-J., Park, S., and Ko, S.-J. cgans with conditional convolution layer. ar Xiv preprint ar Xiv:1906.00709, 2019.

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X., and Chen, X. Improved techniques for training gans. In Advances in Neural Information Processing Systems, 2016.

Shu, R., Bui, H., and Ermon, S. Ac-gan learns a biased distribution. In NIPS Workshop on Bayesian Deep Learning, volume 8, 2017.

Tan, Z., Chai, M., Chen, D., Liao, J., Chu, Q., Yuan, L., Tulyakov, S., and Yu, N. Michigan: Multi-inputconditioned hair image generation for portrait editing. ar Xiv preprint ar Xiv:2010.16417, 2020.

Tran, D., Ranganath, R., and Blei, D. Hierarchical implicit models and likelihood-free variational inference. In Advances in Neural Information Processing Systems, 2017.

Van der Maaten, L. and Hinton, G. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.

Watanabe, T. and Favaro, P. A unified generative adversarial network training via self-labeling and self-attention. In Proceedings of the 38th International Conference on Machine Learning, 2021.

Wu, Y., Donahue, J., Balduzzi, D., Simonyan, K., and Lillicrap, T. Logan: Latent optimisation for generative adversarial networks. ar Xiv preprint ar Xiv:1912.00953, 2019.

Yan, X., Yang, J., Sohn, K., and Lee, H. Attribute2image: Conditional image generation from visual attributes. ar Xiv preprint ar Xiv:1512.00570, 2015.

Zhang, H., Zhang, Z., Odena, A., and Lee, H. Consistency regularization for generative adversarial networks. In International Conference on Learning Representations, 2020.

Zhao, Z., Singh, S., Lee, H., Zhang, Z., Odena, A., and Zhang, H. Improved consistency regularization for gans. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 11033 11041, 2021.

Zhou, P., Xie, L., Ni, B., Geng, C., and Tian, Q. Omnigan: On the secrets of cgans and beyond. ar Xiv preprint ar Xiv:2011.13074, 2020.

Zhou, Z., Cai, H., Rong, S., Song, Y., Ren, K., Zhang, W., Wang, J., and Yu, Y. Activation maximization generative adversarial nets. In International Conference on Learning Representations, 2018.

Conditional GANs with Auxiliary Discriminative Classifier

A.1. Proof of Proposition 2.1

Proposition 2.1. For fixed generator, the optimal classifier of AC-GAN has the form of C (y|x) = p(x,y)

max C Ex,y PX,Y [log C(y|x)] = Ex PXEy PY |X[log C(y|x)] (15)

min C Ex PXEy PY |X[ log C(y|x)] = Ex PX[H(p(y|x)) + KL(p(y|x) C(y|x))] (16)

C (y|x) = arg min C KL(p(y|x) C(y|x)) = p(y|x) = p(x, y)

A.2. Proof of Theorem 2.2

Theorem 2.2. Given the optimal classifier, at the equilibrium point, optimizing the classification task for the generator of AC-GAN is equivalent to: min G KL(QX,Y PX,Y ) KL(QX PX) + HQ(Y |X), (4)

where HQ(Y |X) = R P

y q(x, y) log q(y|x)dx is the conditional entropy of the generated samples.

max G Ex,y QX,Y [log C (y|x)] = Ex,y QX,Y

log p(x, y)

= Ex,y QX,Y

log p(x, y)

q(x, y) q(x) p(x) q(x, y)

min G Ex,y QX,Y

log q(x, y)

log q(x, y)

min G KL(QX,Y PX,Y ) KL(QX PX) + HQ(Y |X) (20)

A.3. Proof of Proposition 3.1

Proposition 3.1. For fixed generator, the optimal discriminative classifier of ADC-GAN has the form of the following:

C d(y+|x) = p(x, y) p(x) + q(x), C d(y |x) = q(x, y) p(x) + q(x).

max Cd Ex,y PX,Y [log Cd(y+|x)] + Ex,y QX,Y [log Cd(y |x)] max Cd Ex,y P m X,Y [log Cd(y|x)], (21)

with pm(x, y+) = 1

2p(x, y), pm(x, y ) = 1

2q(x, y), and pm(x) = P

y pm(x, y) = 1

max Cd Ex P m X Ey P m Y |X[log Cd(y|x)] min Cd Ex P m X Ey P m Y |X[ log Cd(y|x)] (22)

min Cd Ex P m X [H(pm(y|x)) + KL(pm(y|x) Cd(y|x))] (23)

C d(y|x) = arg min Cd KL(pm(y|x) Cd(y|x)) = pm(y|x) = pm(x, y)

Therefore, the optimal discriminative classifier of ADC-GAN has the form of C d(y+|x) = pm(x,y+)

pm(x) = p(x,y) p(x)+q(x) and

C d(y |x) = pm(x,y )

pm(x) = q(x,y) p(x)+q(x) that conclude the proof.

Conditional GANs with Auxiliary Discriminative Classifier

A.4. Proof of Theorem 3.2

Theorem 3.2. Given the optimal discriminative classifier, at the equilibrium point, optimizing the classification task for the generator of ADC-GAN is equivalent to: min G KL(QX,Y PX,Y ). (7)

max G Ex,y QX,Y log C d(y+|x) Ex,y QX,Y log C d(y |x) (25)

max G Ex,y QX,Y

log p(x, y) p(x) + q(x)

log q(x, y) p(x) + q(x)

min G Ex,y QX,Y

log q(x, y)

min G KL(QX,Y PX,Y ) (27)

A.5. Proof of Theorem 4.1

Proposition A.1. For fixed generator, the twin optimal classifiers of TAC-GAN have the following forms:

C (y|x) = p(x, y)

p(x) , C mi(y|x) = q(x, y)

q(x) . (28)

Proof. The proof is similar to that of Proposition 2.1 in Appendix A.1 by considering C and Cmi as two independent classifiers with respect to distribution P and Q, respectively.

Theorem 4.1. Given the twin optimal classifiers, at the equilibrium point, optimizing the classification tasks for the generator of TAC-GAN is equivalent to:

min G KL(QX,Y PX,Y ) KL(QX PX). (10)

max G Ex,y QX,Y [log C (y|x)] Ex,y QX,Y [log C mi(y|x)] (29)

max G Ex,y QX,Y

log p(x, y)

log q(x, y)

max G Ex,y QX,Y

log p(x, y)

min G KL(QX,Y PX,Y ) KL(QX PX) (32)

B. Analysis on the Original AC-GAN

In this section, we show that original AC-GAN whose auxiliary classifier is trained with both real and generated samples still suffers from the same issue as we proved in Theorem 2.2. Formally, the full objective function of the original AC-GAN is formulated as the following:

max D,C V (G, D) + λ Ex,y PX,Y [log C(y|x)] + Ex,y QX,Y [log C(y|x)] , (33)

min G V (G, D) λ Ex,y QX,Y [log C(y|x)] . (34)

The objective function for training the classifier can be rewritten as:

max C Ex,y PX,Y [log C(y|x)] + Ex,y QX,Y [log C(y|x)] max C Ex,y P m X,Y [log C(y|x)], (35)

Conditional GANs with Auxiliary Discriminative Classifier

with pm(x, y) = 1

2(p(x, y) + q(x, y)) and pm(x) = P

y pm(x, y) = 1

2(p(x) + q(x)). And we can obtain the optimal classifier according to the following:

max C Ex,y P m X,Y [log C(y|x)] min C Ex P m X ,y P m Y |X[ log C(y|x)] (36)

min C Ex P m X [H(pm(y|x)) + KL(pm(y|x) C(y|x))] (37)

C (y|x) = pm(y|x) = p(x, y) + q(x, y)

p(x) + q(x) . (38)

Suppose that the conditional generator learns the joint distribution of real data and labels, i.e., q(x, y) = p(x, y) and q(x) = p(x), the optimal classifier C (y|x) = p(x,y)+q(x,y)

p(x)+q(x) = p(x,y)

p(x) also provide the objective stated in Theorem 2.2 for the generator, which contains the conditional entropy of the generated samples HQ(Y |X) that reduces the intra-class diversity of the generated samples. In other words, the original classifier does not allow the generator to remain on the desired distribution because it still provides momentum to update the generator, resulting in a biased learning objective for the generator in the original version of AC-GAN. The essential reason is that the classifier of the original AC-GAN is incapable of distinguishing the real data from the generated data. Therefore, the classifier of the original AC-GAN cannot provide the difference between the real and generated joint distributions to optimize the generator.

C. Analysis on AM-GAN

AM-GAN (Zhou et al., 2018) optimizes the following objectives with an label-extended discriminator D+ : X Y {0}:

max D+ Ex,y PX,Y [log D+(y|x)] + Ex,y QX,Y [log D+(0|x)], (39)

min G Ex,y QX,Y [log D+(y|x)]. (40)

The objective function for training the discriminator D+ can be rewritten as:

max D+ Ex,y PX,Y [log D+(y|x)] + Ex,y QX,Y [log D+(0|x)] max D+ Ex,y P m X,Y [log D+(y|x)], (41)

where pm(x, y) = 1

2p(x, y), y Y, pm(x, 0) = 1

2q(x), and pm(x) = P

y pm(x, y) = 1

2(p(x) + q(x)). Then we have:

max D+ Ex,y P m X,Y [log D+(y|x)] min D+ Ex P m X ,y P m Y |X[ log D+(y|x)] (42)

min D+ Ex P m X [H(pm(y|x)) + KL(pm(y|x) D+(y|x))] D +(y|x) = pm(y|x) = p(x, y) p(x) + q(x), y Y. (43)

Under the optimal discriminator D +, the generator of AM-GAN can be regarded as optimizing the following:

max G Ex,y QX,Y [log D +(y|x)] max G Ex,y QX,Y

log p(x, y) p(x) + q(x)

min G Ex,y QX,Y

log q(x, y)

p(x, y) p(x) + q(x)

= Ex,y QX,Y

log q(x, y)

p(x, y) + log p(x) + q(x)

2 log q(x, y) + log 2 (45)

min G Ex,y QX,Y

log q(x, y)

p(x, y) + 1

2 log p(x) + 1

2 log q(x) log q(x, y) + log 2 (46)

min G Ex,y QX,Y

log q(x, y)

p(x) log q(x, y)

q(x) + log 2 (47)

min G KL(QX,Y PX,Y ) 1

2KL(QX PX) + HQ(Y |X) + log 2. (48)

In summary, AM-GAN with the original discriminator remained (compared in our experiments) can be considered to be minimizing an upper bound of JS(QX PX) + KL(QX,Y PX,Y ) 1

2KL(QX PX) + HQ(Y |X) + log 2.

Conditional GANs with Auxiliary Discriminative Classifier

D. More Results

(a) FID curves on CIFAR-10

(b) FID curves on Tiny-Image Net

Figure 5: FID curves during GAN training on CIFAR-10 and Tiny-Image Net, respectively.

(a) FID with different λ on CIFAR-10

(b) FID with different λ on Tiny-Image Net

Figure 6: FID comparisons of classifier-based c GANs with different coefficient hyperparameters λ on CIFAR-10 and Tiny-Image Net, respectively. The objective function in this experiment is (1 λ )V (G, D) + λ VC(G, C), where VC(G, C) is the task between the generator and classifier.

Conditional GANs with Auxiliary Discriminative Classifier

Table 5: IS, FID, i FID, Precision, Recall, Density, and Coverage comparisons of competing methods under different GAN loss functions on CIFAR-10 and CIFAR-100, respectively. The best results are bold and the second best are underlined.

CIFAR-10 METHODS IS FID IFID PRECISION RECALL DENSITY COVERAGE

NON-SATURATION

PD-GAN 9.68 8.93 81.30 0.7581 0.6718 1.0622 0.9208 AC-GAN 9.74 9.21 87.76 0.7592 0.6484 1.0491 0.9147 TAC-GAN 9.61 9.31 81.04 0.7349 0.6717 0.9575 0.8990 ADC-GAN 9.87 8.47 77.69 0.7497 0.6912 0.9968 0.9202 CONTRAGAN 9.60 8.87 120.45 0.7598 0.6595 1.0025 0.9061 REACGAN 9.69 8.51 113.23 0.7648 0.6594 1.0532 0.9242

LEAST SQUARE

PD-GAN 9.99 8.72 80.11 0.7525 0.6771 1.0395 0.9182 AC-GAN 5.01 81.93 176.24 0.7389 0.0037 0.7484 0.2129 TAC-GAN 9.41 10.67 80.92 0.7386 0.6520 0.9159 0.8657 ADC-GAN 9.89 8.61 75.86 0.7405 0.6919 0.9944 0.9223 CONTRAGAN 9.10 12.93 135.75 0.7661 0.5761 1.0236 0.8262 REACGAN 9.80 9.52 125.83 0.7772 0.5988 1.1008 0.9138

PD-GAN 5.27 75.24 104.15 0.5569 0.2132 0.3678 0.2141 AC-GAN 8.88 14.77 88.02 0.7015 0.6477 0.7421 0.7798 TAC-GAN 8.93 13.26 76.93 0.6847 0.6705 0.7454 0.8127 ADC-GAN 9.49 11.25 74.98 0.6996 0.7019 0.8182 0.8517 CONTRAGAN 6.38 51.43 137.17 0.5640 0.3995 0.4040 0.2931 REACGAN 6.60 44.62 117.25 0.5813 0.4333 0.4559 0.3287

PD-GAN 9.79 8.45 79.40 0.7464 0.6853 1.0083 0.9158 AC-GAN 9.96 8.97 88.40 0.7681 0.6523 1.0250 0.9168 TAC-GAN 9.78 8.80 81.30 0.7446 0.6749 1.0026 0.9103 ADC-GAN 9.63 8.42 75.50 0.7447 0.6882 0.9854 0.9193 CONTRAGAN 9.63 8.89 85.39 0.7582 0.6538 1.0411 0.9098 REACGAN 9.83 8.84 78.07 0.7623 0.6675 1.0003 0.9158

CIFAR-100 METHODS IS FID IFID PRECISION RECALL DENSITY COVERAGE

NON-SATURATION

PD-GAN 11.48 11.59 105.38 0.7337 0.6804 0.8646 0.8513 AC-GAN 7.98 49.46 207.56 0.7322 0.0793 0.6225 0.4112 TAC-GAN 11.34 14.47 131.90 0.7429 0.6077 0.8324 0.7887 ADC-GAN 11.88 11.07 104.21 0.7379 0.6972 0.8521 0.8609 CONTRAGAN 11.15 13.54 146.86 0.7390 0.6155 0.8481 0.7729 REACGAN 11.79 13.72 125.21 0.7541 0.5861 0.8695 0.8005

LEAST SQUARE

PD-GAN 11.32 12.19 101.92 0.7263 0.6903 0.8318 0.8471 AC-GAN 4.93 87.70 252.85 0.7087 0.0007 0.5836 0.2220 TAC-GAN 7.27 49.08 162.58 0.7427 0.2114 0.7210 0.4438 ADC-GAN 11.56 11.85 103.06 0.7334 0.6949 0.8145 0.8526 CONTRAGAN 12.59 15.62 122.71 0.7866 0.4642 1.0109 0.7863 REACGAN 12.90 15.09 164.93 0.7827 0.4672 1.0454 0.8282

PD-GAN 5.66 69.48 0.5976 0.1603 0.4310 0.2649 AC-GAN 10.97 19.30 148.40 0.6880 0.5444 0.6770 0.7242 TAC-GAN 11.04 15.56 121.23 0.7023 0.6474 0.7048 0.7535 ADC-GAN 11.01 14.02 101.14 0.7058 0.6804 0.7549 0.7956 CONTRAGAN 6.72 49.77 147.22 0.6498 0.2834 0.5827 0.3549 REACGAN 6.67 47.74 150.7 0.6188 0.3104 0.4806 0.3396

PD-GAN 11.76 10.96 108.08 0.7436 0.6812 0.8790 0.8609 AC-GAN 11.66 21.65 168.87 0.7577 0.3649 0.8297 0.7225 TAC-GAN 12.07 12.56 134.75 0.7572 0.6020 0.8957 0.8400 ADC-GAN 11.82 10.73 103.78 0.7387 0.7023 0.8721 0.8707 CONTRAGAN 10.08 13.22 128.50 0.7372 0.6251 0.8356 0.7790 REACGAN 11.80 12.52 140.47 0.7510 0.5982 0.9300 0.8327